[Qemu-devel] [RFC PATCH qemu] spapr_pci: Create PCI-express root bus by default

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Qemu-devel] [RFC PATCH qemu] spapr_pci: Create PCI-express root bus by default
@ 2016-10-28  7:56 Alexey Kardashevskiy
  2016-10-28 10:07 ` [Qemu-devel] [Qemu-ppc] " Greg Kurz
  0 siblings, 1 reply; 38+ messages in thread
From: Alexey Kardashevskiy @ 2016-10-28  7:56 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, qemu-ppc, David Gibson, Paolo Bonzini,
	Michael Roth, Alex Williamson

At the moment sPAPR PHB creates a root buf of TYPE_PCI_BUS type.
This means that vfio-pci devices attached to it (and this is
a default behaviour) hide PCIe extended capabilities as
the bus does not pass a pci_bus_is_express(pdev->bus) check.

This changes adds a default PCI bus type property to sPAPR PHB
and uses TYPE_PCIE_BUS if none passed; older machines get TYPE_PCI_BUS
for backward compatibility as a bus type is used in the bus name
so the root bus name becomes "pcie.0" instead of "pci.0".

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---

What can possibly go wrong with such change of a name?
>From devices prospective, I cannot see any.

libvirt might get upset as "pci.0" will not be available,
will it make sense to create pcie.0 as a root bus and always
add a PCIe->PCI bridge and name its bus "pci.0"?

Or create root bus from TYPE_PCIE_BUS and force name to "pci.0"?
pci_register_bus() can do this.


---
 hw/ppc/spapr.c              | 5 +++++
 hw/ppc/spapr_pci.c          | 5 ++++-
 include/hw/pci-host/spapr.h | 1 +
 3 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index 0b3820b..a268511 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -2541,6 +2541,11 @@ DEFINE_SPAPR_MACHINE(2_8, "2.8", true);
         .driver   = TYPE_SPAPR_PCI_HOST_BRIDGE,     \
         .property = "mem64_win_size",               \
         .value    = "0",                            \
+    },                                              \
+    {                                               \
+        .driver   = TYPE_SPAPR_PCI_HOST_BRIDGE,     \
+        .property = "root_bus_type",                \
+        .value    = TYPE_PCI_BUS,                   \
     },
 
 static void phb_placement_2_7(sPAPRMachineState *spapr, uint32_t index,
diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
index 7cde30e..2fa1f22 100644
--- a/hw/ppc/spapr_pci.c
+++ b/hw/ppc/spapr_pci.c
@@ -1434,7 +1434,9 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
     bus = pci_register_bus(dev, NULL,
                            pci_spapr_set_irq, pci_spapr_map_irq, sphb,
                            &sphb->memspace, &sphb->iospace,
-                           PCI_DEVFN(0, 0), PCI_NUM_PINS, TYPE_PCI_BUS);
+                           PCI_DEVFN(0, 0), PCI_NUM_PINS,
+                           sphb->root_bus_type ? sphb->root_bus_type :
+                           TYPE_PCIE_BUS);
     phb->bus = bus;
     qbus_set_hotplug_handler(BUS(phb->bus), DEVICE(sphb), NULL);
 
@@ -1590,6 +1592,7 @@ static Property spapr_phb_properties[] = {
     DEFINE_PROP_UINT64("pgsz", sPAPRPHBState, page_size_mask,
                        (1ULL << 12) | (1ULL << 16)),
     DEFINE_PROP_UINT32("numa_node", sPAPRPHBState, numa_node, -1),
+    DEFINE_PROP_STRING("root_bus_type", sPAPRPHBState, root_bus_type),
     DEFINE_PROP_END_OF_LIST(),
 };
 
diff --git a/include/hw/pci-host/spapr.h b/include/hw/pci-host/spapr.h
index b92c1b5..a4ee873 100644
--- a/include/hw/pci-host/spapr.h
+++ b/include/hw/pci-host/spapr.h
@@ -79,6 +79,7 @@ struct sPAPRPHBState {
     uint64_t dma64_win_addr;
 
     uint32_t numa_node;
+    char *root_bus_type;
 };
 
 #define SPAPR_PCI_MEM_WIN_BUS_OFFSET 0x80000000ULL
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [RFC PATCH qemu] spapr_pci: Create PCI-express root bus by default
  2016-10-28  7:56 [Qemu-devel] [RFC PATCH qemu] spapr_pci: Create PCI-express root bus by default Alexey Kardashevskiy
@ 2016-10-28 10:07 ` Greg Kurz
  2016-10-31  2:53   ` David Gibson
  0 siblings, 1 reply; 38+ messages in thread
From: Greg Kurz @ 2016-10-28 10:07 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: qemu-devel, Michael Roth, Alex Williamson, qemu-ppc,
	Paolo Bonzini, David Gibson

On Fri, 28 Oct 2016 18:56:40 +1100
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> At the moment sPAPR PHB creates a root buf of TYPE_PCI_BUS type.
> This means that vfio-pci devices attached to it (and this is
> a default behaviour) hide PCIe extended capabilities as
> the bus does not pass a pci_bus_is_express(pdev->bus) check.
> 
> This changes adds a default PCI bus type property to sPAPR PHB
> and uses TYPE_PCIE_BUS if none passed; older machines get TYPE_PCI_BUS
> for backward compatibility as a bus type is used in the bus name
> so the root bus name becomes "pcie.0" instead of "pci.0".
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
> 
> What can possibly go wrong with such change of a name?
> From devices prospective, I cannot see any.
> 
> libvirt might get upset as "pci.0" will not be available,
> will it make sense to create pcie.0 as a root bus and always
> add a PCIe->PCI bridge and name its bus "pci.0"?
> 
> Or create root bus from TYPE_PCIE_BUS and force name to "pci.0"?
> pci_register_bus() can do this.
> 
> 
> ---
>  hw/ppc/spapr.c              | 5 +++++
>  hw/ppc/spapr_pci.c          | 5 ++++-
>  include/hw/pci-host/spapr.h | 1 +
>  3 files changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> index 0b3820b..a268511 100644
> --- a/hw/ppc/spapr.c
> +++ b/hw/ppc/spapr.c
> @@ -2541,6 +2541,11 @@ DEFINE_SPAPR_MACHINE(2_8, "2.8", true);
>          .driver   = TYPE_SPAPR_PCI_HOST_BRIDGE,     \
>          .property = "mem64_win_size",               \
>          .value    = "0",                            \
> +    },                                              \
> +    {                                               \
> +        .driver   = TYPE_SPAPR_PCI_HOST_BRIDGE,     \
> +        .property = "root_bus_type",                \
> +        .value    = TYPE_PCI_BUS,                   \
>      },
>  
>  static void phb_placement_2_7(sPAPRMachineState *spapr, uint32_t index,
> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
> index 7cde30e..2fa1f22 100644
> --- a/hw/ppc/spapr_pci.c
> +++ b/hw/ppc/spapr_pci.c
> @@ -1434,7 +1434,9 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
>      bus = pci_register_bus(dev, NULL,
>                             pci_spapr_set_irq, pci_spapr_map_irq, sphb,
>                             &sphb->memspace, &sphb->iospace,
> -                           PCI_DEVFN(0, 0), PCI_NUM_PINS, TYPE_PCI_BUS);
> +                           PCI_DEVFN(0, 0), PCI_NUM_PINS,
> +                           sphb->root_bus_type ? sphb->root_bus_type :
> +                           TYPE_PCIE_BUS);

Shouldn't we ensure that sphb->root_bus_type is either TYPE_PCIE_BUS or
TYPE_PCI_BUS ?

Otherwise, if we pass something like:

-global spapr-pci-host-bridge.root_bus_type=pcie

we get the following not really explicit error:

ERROR:/home/greg/Work/qemu/qemu-spapr/qom/object.c:474:object_new_with_type: assertion failed: (type != NULL)

>      phb->bus = bus;
>      qbus_set_hotplug_handler(BUS(phb->bus), DEVICE(sphb), NULL);
>  
> @@ -1590,6 +1592,7 @@ static Property spapr_phb_properties[] = {
>      DEFINE_PROP_UINT64("pgsz", sPAPRPHBState, page_size_mask,
>                         (1ULL << 12) | (1ULL << 16)),
>      DEFINE_PROP_UINT32("numa_node", sPAPRPHBState, numa_node, -1),
> +    DEFINE_PROP_STRING("root_bus_type", sPAPRPHBState, root_bus_type),
>      DEFINE_PROP_END_OF_LIST(),
>  };
>  
> diff --git a/include/hw/pci-host/spapr.h b/include/hw/pci-host/spapr.h
> index b92c1b5..a4ee873 100644
> --- a/include/hw/pci-host/spapr.h
> +++ b/include/hw/pci-host/spapr.h
> @@ -79,6 +79,7 @@ struct sPAPRPHBState {
>      uint64_t dma64_win_addr;
>  
>      uint32_t numa_node;
> +    char *root_bus_type;
>  };
>  
>  #define SPAPR_PCI_MEM_WIN_BUS_OFFSET 0x80000000ULL

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [RFC PATCH qemu] spapr_pci: Create PCI-express root bus by default
  2016-10-28 10:07 ` [Qemu-devel] [Qemu-ppc] " Greg Kurz
@ 2016-10-31  2:53   ` David Gibson
  2016-10-31  4:10     ` Alexey Kardashevskiy
  0 siblings, 1 reply; 38+ messages in thread
From: David Gibson @ 2016-10-31  2:53 UTC (permalink / raw)
  To: Greg Kurz
  Cc: Alexey Kardashevskiy, qemu-devel, Michael Roth, Alex Williamson,
	qemu-ppc, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 4561 bytes --]

On Fri, Oct 28, 2016 at 12:07:12PM +0200, Greg Kurz wrote:
> On Fri, 28 Oct 2016 18:56:40 +1100
> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> 
> > At the moment sPAPR PHB creates a root buf of TYPE_PCI_BUS type.
> > This means that vfio-pci devices attached to it (and this is
> > a default behaviour) hide PCIe extended capabilities as
> > the bus does not pass a pci_bus_is_express(pdev->bus) check.
> > 
> > This changes adds a default PCI bus type property to sPAPR PHB
> > and uses TYPE_PCIE_BUS if none passed; older machines get TYPE_PCI_BUS
> > for backward compatibility as a bus type is used in the bus name
> > so the root bus name becomes "pcie.0" instead of "pci.0".
> > 
> > Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> > ---
> > 
> > What can possibly go wrong with such change of a name?
> > From devices prospective, I cannot see any.
> > 
> > libvirt might get upset as "pci.0" will not be available,
> > will it make sense to create pcie.0 as a root bus and always
> > add a PCIe->PCI bridge and name its bus "pci.0"?
> > 
> > Or create root bus from TYPE_PCIE_BUS and force name to "pci.0"?
> > pci_register_bus() can do this.
> > 
> > 
> > ---
> >  hw/ppc/spapr.c              | 5 +++++
> >  hw/ppc/spapr_pci.c          | 5 ++++-
> >  include/hw/pci-host/spapr.h | 1 +
> >  3 files changed, 10 insertions(+), 1 deletion(-)
> > 
> > diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> > index 0b3820b..a268511 100644
> > --- a/hw/ppc/spapr.c
> > +++ b/hw/ppc/spapr.c
> > @@ -2541,6 +2541,11 @@ DEFINE_SPAPR_MACHINE(2_8, "2.8", true);
> >          .driver   = TYPE_SPAPR_PCI_HOST_BRIDGE,     \
> >          .property = "mem64_win_size",               \
> >          .value    = "0",                            \
> > +    },                                              \
> > +    {                                               \
> > +        .driver   = TYPE_SPAPR_PCI_HOST_BRIDGE,     \
> > +        .property = "root_bus_type",                \
> > +        .value    = TYPE_PCI_BUS,                   \
> >      },
> >  
> >  static void phb_placement_2_7(sPAPRMachineState *spapr, uint32_t index,
> > diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
> > index 7cde30e..2fa1f22 100644
> > --- a/hw/ppc/spapr_pci.c
> > +++ b/hw/ppc/spapr_pci.c
> > @@ -1434,7 +1434,9 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
> >      bus = pci_register_bus(dev, NULL,
> >                             pci_spapr_set_irq, pci_spapr_map_irq, sphb,
> >                             &sphb->memspace, &sphb->iospace,
> > -                           PCI_DEVFN(0, 0), PCI_NUM_PINS, TYPE_PCI_BUS);
> > +                           PCI_DEVFN(0, 0), PCI_NUM_PINS,
> > +                           sphb->root_bus_type ? sphb->root_bus_type :
> > +                           TYPE_PCIE_BUS);
> 
> Shouldn't we ensure that sphb->root_bus_type is either TYPE_PCIE_BUS or
> TYPE_PCI_BUS ?

Yes, I think so.  In fact, I think it would be better to make the
property a boolean that just selects PCI-E, rather than this which
exposes qemu (semi-)internal type names on the comamnd line.

> Otherwise, if we pass something like:
> 
> -global spapr-pci-host-bridge.root_bus_type=pcie
> 
> we get the following not really explicit error:
> 
> ERROR:/home/greg/Work/qemu/qemu-spapr/qom/object.c:474:object_new_with_type: assertion failed: (type != NULL)
> 
> >      phb->bus = bus;
> >      qbus_set_hotplug_handler(BUS(phb->bus), DEVICE(sphb), NULL);
> >  
> > @@ -1590,6 +1592,7 @@ static Property spapr_phb_properties[] = {
> >      DEFINE_PROP_UINT64("pgsz", sPAPRPHBState, page_size_mask,
> >                         (1ULL << 12) | (1ULL << 16)),
> >      DEFINE_PROP_UINT32("numa_node", sPAPRPHBState, numa_node, -1),
> > +    DEFINE_PROP_STRING("root_bus_type", sPAPRPHBState, root_bus_type),
> >      DEFINE_PROP_END_OF_LIST(),
> >  };
> >  
> > diff --git a/include/hw/pci-host/spapr.h b/include/hw/pci-host/spapr.h
> > index b92c1b5..a4ee873 100644
> > --- a/include/hw/pci-host/spapr.h
> > +++ b/include/hw/pci-host/spapr.h
> > @@ -79,6 +79,7 @@ struct sPAPRPHBState {
> >      uint64_t dma64_win_addr;
> >  
> >      uint32_t numa_node;
> > +    char *root_bus_type;
> >  };
> >  
> >  #define SPAPR_PCI_MEM_WIN_BUS_OFFSET 0x80000000ULL
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [RFC PATCH qemu] spapr_pci: Create PCI-express root bus by default
  2016-10-31  2:53   ` David Gibson
@ 2016-10-31  4:10     ` Alexey Kardashevskiy
  2016-11-01  2:46       ` David Gibson
  0 siblings, 1 reply; 38+ messages in thread
From: Alexey Kardashevskiy @ 2016-10-31  4:10 UTC (permalink / raw)
  To: David Gibson, Greg Kurz
  Cc: qemu-devel, Michael Roth, Alex Williamson, qemu-ppc, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 4609 bytes --]

On 31/10/16 13:53, David Gibson wrote:
> On Fri, Oct 28, 2016 at 12:07:12PM +0200, Greg Kurz wrote:
>> On Fri, 28 Oct 2016 18:56:40 +1100
>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>>
>>> At the moment sPAPR PHB creates a root buf of TYPE_PCI_BUS type.
>>> This means that vfio-pci devices attached to it (and this is
>>> a default behaviour) hide PCIe extended capabilities as
>>> the bus does not pass a pci_bus_is_express(pdev->bus) check.
>>>
>>> This changes adds a default PCI bus type property to sPAPR PHB
>>> and uses TYPE_PCIE_BUS if none passed; older machines get TYPE_PCI_BUS
>>> for backward compatibility as a bus type is used in the bus name
>>> so the root bus name becomes "pcie.0" instead of "pci.0".
>>>
>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>> ---
>>>
>>> What can possibly go wrong with such change of a name?
>>> From devices prospective, I cannot see any.
>>>
>>> libvirt might get upset as "pci.0" will not be available,
>>> will it make sense to create pcie.0 as a root bus and always
>>> add a PCIe->PCI bridge and name its bus "pci.0"?
>>>
>>> Or create root bus from TYPE_PCIE_BUS and force name to "pci.0"?
>>> pci_register_bus() can do this.
>>>
>>>
>>> ---
>>>  hw/ppc/spapr.c              | 5 +++++
>>>  hw/ppc/spapr_pci.c          | 5 ++++-
>>>  include/hw/pci-host/spapr.h | 1 +
>>>  3 files changed, 10 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
>>> index 0b3820b..a268511 100644
>>> --- a/hw/ppc/spapr.c
>>> +++ b/hw/ppc/spapr.c
>>> @@ -2541,6 +2541,11 @@ DEFINE_SPAPR_MACHINE(2_8, "2.8", true);
>>>          .driver   = TYPE_SPAPR_PCI_HOST_BRIDGE,     \
>>>          .property = "mem64_win_size",               \
>>>          .value    = "0",                            \
>>> +    },                                              \
>>> +    {                                               \
>>> +        .driver   = TYPE_SPAPR_PCI_HOST_BRIDGE,     \
>>> +        .property = "root_bus_type",                \
>>> +        .value    = TYPE_PCI_BUS,                   \
>>>      },
>>>  
>>>  static void phb_placement_2_7(sPAPRMachineState *spapr, uint32_t index,
>>> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
>>> index 7cde30e..2fa1f22 100644
>>> --- a/hw/ppc/spapr_pci.c
>>> +++ b/hw/ppc/spapr_pci.c
>>> @@ -1434,7 +1434,9 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
>>>      bus = pci_register_bus(dev, NULL,
>>>                             pci_spapr_set_irq, pci_spapr_map_irq, sphb,
>>>                             &sphb->memspace, &sphb->iospace,
>>> -                           PCI_DEVFN(0, 0), PCI_NUM_PINS, TYPE_PCI_BUS);
>>> +                           PCI_DEVFN(0, 0), PCI_NUM_PINS,
>>> +                           sphb->root_bus_type ? sphb->root_bus_type :
>>> +                           TYPE_PCIE_BUS);
>>
>> Shouldn't we ensure that sphb->root_bus_type is either TYPE_PCIE_BUS or
>> TYPE_PCI_BUS ?
> 
> Yes, I think so.  In fact, I think it would be better to make the
> property a boolean that just selects PCI-E, rather than this which
> exposes qemu (semi-)internal type names on the comamnd line.


Sure, a "pcie-root" boolean property should do.

However this is not my main concern, I rather wonder if we have to have
pci.0 when we pick PCIe for the root.



>> Otherwise, if we pass something like:
>>
>> -global spapr-pci-host-bridge.root_bus_type=pcie
>>
>> we get the following not really explicit error:
>>
>> ERROR:/home/greg/Work/qemu/qemu-spapr/qom/object.c:474:object_new_with_type: assertion failed: (type != NULL)
>>
>>>      phb->bus = bus;
>>>      qbus_set_hotplug_handler(BUS(phb->bus), DEVICE(sphb), NULL);
>>>  
>>> @@ -1590,6 +1592,7 @@ static Property spapr_phb_properties[] = {
>>>      DEFINE_PROP_UINT64("pgsz", sPAPRPHBState, page_size_mask,
>>>                         (1ULL << 12) | (1ULL << 16)),
>>>      DEFINE_PROP_UINT32("numa_node", sPAPRPHBState, numa_node, -1),
>>> +    DEFINE_PROP_STRING("root_bus_type", sPAPRPHBState, root_bus_type),
>>>      DEFINE_PROP_END_OF_LIST(),
>>>  };
>>>  
>>> diff --git a/include/hw/pci-host/spapr.h b/include/hw/pci-host/spapr.h
>>> index b92c1b5..a4ee873 100644
>>> --- a/include/hw/pci-host/spapr.h
>>> +++ b/include/hw/pci-host/spapr.h
>>> @@ -79,6 +79,7 @@ struct sPAPRPHBState {
>>>      uint64_t dma64_win_addr;
>>>  
>>>      uint32_t numa_node;
>>> +    char *root_bus_type;
>>>  };
>>>  
>>>  #define SPAPR_PCI_MEM_WIN_BUS_OFFSET 0x80000000ULL
>>
> 


-- 
Alexey


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 839 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [RFC PATCH qemu] spapr_pci: Create PCI-express root bus by default
  2016-10-31  4:10     ` Alexey Kardashevskiy
@ 2016-11-01  2:46       ` David Gibson
  2016-11-15 14:02         ` Andrea Bolognani
  0 siblings, 1 reply; 38+ messages in thread
From: David Gibson @ 2016-11-01  2:46 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Greg Kurz, Paolo Bonzini, Alex Williamson, qemu-ppc, qemu-devel,
	Michael Roth, abologna

[-- Attachment #1: Type: text/plain, Size: 5138 bytes --]

On Mon, Oct 31, 2016 at 03:10:23PM +1100, Alexey Kardashevskiy wrote:
> On 31/10/16 13:53, David Gibson wrote:
> > On Fri, Oct 28, 2016 at 12:07:12PM +0200, Greg Kurz wrote:
> >> On Fri, 28 Oct 2016 18:56:40 +1100
> >> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> >>
> >>> At the moment sPAPR PHB creates a root buf of TYPE_PCI_BUS type.
> >>> This means that vfio-pci devices attached to it (and this is
> >>> a default behaviour) hide PCIe extended capabilities as
> >>> the bus does not pass a pci_bus_is_express(pdev->bus) check.
> >>>
> >>> This changes adds a default PCI bus type property to sPAPR PHB
> >>> and uses TYPE_PCIE_BUS if none passed; older machines get TYPE_PCI_BUS
> >>> for backward compatibility as a bus type is used in the bus name
> >>> so the root bus name becomes "pcie.0" instead of "pci.0".
> >>>
> >>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >>> ---
> >>>
> >>> What can possibly go wrong with such change of a name?
> >>> From devices prospective, I cannot see any.
> >>>
> >>> libvirt might get upset as "pci.0" will not be available,
> >>> will it make sense to create pcie.0 as a root bus and always
> >>> add a PCIe->PCI bridge and name its bus "pci.0"?
> >>>
> >>> Or create root bus from TYPE_PCIE_BUS and force name to "pci.0"?
> >>> pci_register_bus() can do this.
> >>>
> >>>
> >>> ---
> >>>  hw/ppc/spapr.c              | 5 +++++
> >>>  hw/ppc/spapr_pci.c          | 5 ++++-
> >>>  include/hw/pci-host/spapr.h | 1 +
> >>>  3 files changed, 10 insertions(+), 1 deletion(-)
> >>>
> >>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> >>> index 0b3820b..a268511 100644
> >>> --- a/hw/ppc/spapr.c
> >>> +++ b/hw/ppc/spapr.c
> >>> @@ -2541,6 +2541,11 @@ DEFINE_SPAPR_MACHINE(2_8, "2.8", true);
> >>>          .driver   = TYPE_SPAPR_PCI_HOST_BRIDGE,     \
> >>>          .property = "mem64_win_size",               \
> >>>          .value    = "0",                            \
> >>> +    },                                              \
> >>> +    {                                               \
> >>> +        .driver   = TYPE_SPAPR_PCI_HOST_BRIDGE,     \
> >>> +        .property = "root_bus_type",                \
> >>> +        .value    = TYPE_PCI_BUS,                   \
> >>>      },
> >>>  
> >>>  static void phb_placement_2_7(sPAPRMachineState *spapr, uint32_t index,
> >>> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
> >>> index 7cde30e..2fa1f22 100644
> >>> --- a/hw/ppc/spapr_pci.c
> >>> +++ b/hw/ppc/spapr_pci.c
> >>> @@ -1434,7 +1434,9 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
> >>>      bus = pci_register_bus(dev, NULL,
> >>>                             pci_spapr_set_irq, pci_spapr_map_irq, sphb,
> >>>                             &sphb->memspace, &sphb->iospace,
> >>> -                           PCI_DEVFN(0, 0), PCI_NUM_PINS, TYPE_PCI_BUS);
> >>> +                           PCI_DEVFN(0, 0), PCI_NUM_PINS,
> >>> +                           sphb->root_bus_type ? sphb->root_bus_type :
> >>> +                           TYPE_PCIE_BUS);
> >>
> >> Shouldn't we ensure that sphb->root_bus_type is either TYPE_PCIE_BUS or
> >> TYPE_PCI_BUS ?
> > 
> > Yes, I think so.  In fact, I think it would be better to make the
> > property a boolean that just selects PCI-E, rather than this which
> > exposes qemu (semi-)internal type names on the comamnd line.
> 
> 
> Sure, a "pcie-root" boolean property should do.
> 
> However this is not my main concern, I rather wonder if we have to have
> pci.0 when we pick PCIe for the root.

Right.

I've added Andrea Bologna to the CC list to get a libvirt perspective.

Andrea,

To summarise the issue here:
   * As I've said before the PAPR spec kinda-sorta abstracts the
     difference between vanilla PCI and PCI-E
   * However, because within qemu we're declaring the bus as PCI that
     means some PCI-E devices aren't working right
   * In particular it means that PCI-E extended config space isn't
     available

The proposal is to change (on newer machine types) the spapr PHB code
to declare a PCI-E bus instead.  AIUI this still won't make the root
complex guest visible (which it's not supposed to be under PAPR), and
the guest shouldn't see a difference in most cases - it will still see
the PAPR abstracted PCIish bus, but will now be able to get extended
config space.

The possible problem from a libvirt perspective is that doing this in
the simplest way in qemu would change the name of the default bus from
pci.0 to pcie.0.  We have two suggested ways to mitigate this:
   1) Automatically create a PCI-E to PCI bridge, so that new machine
      types will have both a pcie.0 and pci.0 bus
   2) Force the name of the bus to be pci.0, even though it's treated
      as PCI-E in other ways.

We're trying to work out exactly what will and won't cause trouble for
libvirt.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [RFC PATCH qemu] spapr_pci: Create PCI-express root bus by default
  2016-11-01  2:46       ` David Gibson
@ 2016-11-15 14:02         ` Andrea Bolognani
  2016-11-17  2:02           ` Alexey Kardashevskiy
  0 siblings, 1 reply; 38+ messages in thread
From: Andrea Bolognani @ 2016-11-15 14:02 UTC (permalink / raw)
  To: David Gibson, Alexey Kardashevskiy
  Cc: Greg Kurz, Paolo Bonzini, Alex Williamson, qemu-ppc, qemu-devel,
	libvir-list, Michael Roth

On Tue, 2016-11-01 at 13:46 +1100, David Gibson wrote:
> On Mon, Oct 31, 2016 at 03:10:23PM +1100, Alexey Kardashevskiy wrote:
> > 
> > On 31/10/16 13:53, David Gibson wrote:
> > > 
> > > On Fri, Oct 28, 2016 at 12:07:12PM +0200, Greg Kurz wrote:
> > > > 
> > > > On Fri, 28 Oct 2016 18:56:40 +1100
> > > > Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> > > > 
> > > > > 
> > > > > At the moment sPAPR PHB creates a root buf of TYPE_PCI_BUS type.
> > > > > This means that vfio-pci devices attached to it (and this is
> > > > > a default behaviour) hide PCIe extended capabilities as
> > > > > the bus does not pass a pci_bus_is_express(pdev->bus) check.
> > > > > 
> > > > > This changes adds a default PCI bus type property to sPAPR PHB
> > > > > and uses TYPE_PCIE_BUS if none passed; older machines get TYPE_PCI_BUS
> > > > > for backward compatibility as a bus type is used in the bus name
> > > > > so the root bus name becomes "pcie.0" instead of "pci.0".
> > > > > 
> > > > > Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> > > > > ---
> > > > > 
> > > > > What can possibly go wrong with such change of a name?
> > > > > From devices prospective, I cannot see any.
> > > > > 
> > > > > libvirt might get upset as "pci.0" will not be available,
> > > > > will it make sense to create pcie.0 as a root bus and always
> > > > > add a PCIe->PCI bridge and name its bus "pci.0"?
> > > > > 
> > > > > Or create root bus from TYPE_PCIE_BUS and force name to "pci.0"?
> > > > > pci_register_bus() can do this.
> > > > > 
> > > > > 
> > > > > ---
> > > > >  hw/ppc/spapr.c              | 5 +++++
> > > > >  hw/ppc/spapr_pci.c          | 5 ++++-
> > > > >  include/hw/pci-host/spapr.h | 1 +
> > > > >  3 files changed, 10 insertions(+), 1 deletion(-)
> > > > > 
> > > > > diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> > > > > index 0b3820b..a268511 100644
> > > > > --- a/hw/ppc/spapr.c
> > > > > +++ b/hw/ppc/spapr.c
> > > > > @@ -2541,6 +2541,11 @@ DEFINE_SPAPR_MACHINE(2_8, "2.8", true);
> > > > >          .driver   = TYPE_SPAPR_PCI_HOST_BRIDGE,     \
> > > > >          .property = "mem64_win_size",               \
> > > > >          .value    = "0",                            \
> > > > > +    },                                              \
> > > > > +    {                                               \
> > > > > +        .driver   = TYPE_SPAPR_PCI_HOST_BRIDGE,     \
> > > > > +        .property = "root_bus_type",                \
> > > > > +        .value    = TYPE_PCI_BUS,                   \
> > > > >      },
> > > > >  
> > > > >  static void phb_placement_2_7(sPAPRMachineState *spapr, uint32_t index,
> > > > > diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
> > > > > index 7cde30e..2fa1f22 100644
> > > > > --- a/hw/ppc/spapr_pci.c
> > > > > +++ b/hw/ppc/spapr_pci.c
> > > > > @@ -1434,7 +1434,9 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
> > > > >      bus = pci_register_bus(dev, NULL,
> > > > >                             pci_spapr_set_irq, pci_spapr_map_irq, sphb,
> > > > >                             &sphb->memspace, &sphb->iospace,
> > > > > -                           PCI_DEVFN(0, 0), PCI_NUM_PINS, TYPE_PCI_BUS);
> > > > > +                           PCI_DEVFN(0, 0), PCI_NUM_PINS,
> > > > > +                           sphb->root_bus_type ? sphb->root_bus_type :
> > > > > +                           TYPE_PCIE_BUS);
> > > > 
> > > > Shouldn't we ensure that sphb->root_bus_type is either TYPE_PCIE_BUS or
> > > > TYPE_PCI_BUS ?
> > > 
> > > Yes, I think so.  In fact, I think it would be better to make the
> > > property a boolean that just selects PCI-E, rather than this which
> > > exposes qemu (semi-)internal type names on the comamnd line.
> > 
> > Sure, a "pcie-root" boolean property should do.
> > 
> > However this is not my main concern, I rather wonder if we have to have
> > pci.0 when we pick PCIe for the root.
> 
> Right.
> 
> I've added Andrea Bologna to the CC list to get a libvirt perspective.

Thanks for doing so: changes such as this one can have quite
an impact on the upper layers of the stack, so the earliest
libvirt is involved in the discussion the better.

I'm going to go a step further and cross-post to libvir-list
in order to give other libvirt contributors a chance to chime
in too.

> Andrea,
> 
> To summarise the issue here:
>    * As I've said before the PAPR spec kinda-sorta abstracts the
>      difference between vanilla PCI and PCI-E
>    * However, because within qemu we're declaring the bus as PCI that
>      means some PCI-E devices aren't working right
>    * In particular it means that PCI-E extended config space isn't
>      available
> 
> The proposal is to change (on newer machine types) the spapr PHB code
> to declare a PCI-E bus instead.  AIUI this still won't make the root
> complex guest visible (which it's not supposed to be under PAPR), and
> the guest shouldn't see a difference in most cases - it will still see
> the PAPR abstracted PCIish bus, but will now be able to get extended
> config space.
> 
> The possible problem from a libvirt perspective is that doing this in
> the simplest way in qemu would change the name of the default bus from
> pci.0 to pcie.0.  We have two suggested ways to mitigate this:
>    1) Automatically create a PCI-E to PCI bridge, so that new machine
>       types will have both a pcie.0 and pci.0 bus
>    2) Force the name of the bus to be pci.0, even though it's treated
>       as PCI-E in other ways.
> 
> We're trying to work out exactly what will and won't cause trouble for
> libvirt.

Option 2) is definitely a no-no, as we don't want to be piling
up even more hacks and architecture-specific code: the PCI
Express Root Bus should be called pcie.0, just as it is on q35
and mach-virt machine types.

Option 1) doesn't look too bad, but devices that are added
automatically by QEMU are an issue since we need to hardcode
knowledge of them into libvirt if we want the rest of the PCI
address allocation logic to handle them correctly.

Moreover libvirt now has the ability of building a legacy PCI
topology without user intervention, if needed to plug in
legacy devices, on machines that have a PCI Express Root Bus,
which makes the additional bridge fully redundant...

... or at least it would, if we actually had a proper
PCIe-to-PCI bridge; AFAIK, though, the closest we have is the
i82801b11-bridge that is Intel-specific despite having so far
been abused as a generic PCIe-to-PCI bridge. I'm not even
sure whether it would work at all on ppc64.

Moving from legacy PCI to PCI Express would definitely be an
improvement, in my opinion. As mentioned, that's already the
case for at least two other architectures, so the more we can
standardize on that, the better.

That said, considering that a big part of the PCI address
allocation logic is based off whether the specific machine
type exposes a legay PCI Root Bus or a PCI Express Root Bus,
libvirt will need a way to be able to tell which one is which.

Version checks are pretty much out of the question, as they
fail as soon as downstream releases enter the picture. A
few ways we could deal with the situation:

  1) switch to PCI Express on newer machine types, and
     expose some sort of capability through QMP so that
     libvirt can know about the switch

  2) switch between legacy PCI and PCI Express based on a
     machine type option. libvirt would be able to find out
     whether the option is available or not, and default to
     either

       <controller type='pci' model='pci-root'/>

     or

       <controller type='pci' model='pcie-root'/>

     based on that. In order to support multiple PHBs
     properly, those would have to be switchable with an
     option as well

  3) create an entirely new machine type, eg. pseries-pcie
     or whatever someone with the ability to come up with
     decent names can suggest :) That would make ppc64
     similar to x86, where i440fx and q35 have different
     root buses. libvirt would learn about the new machine
     type, know that it has a PCI Express Root Bus, and
     behave accordingly

Option 1) would break horribly with existing libvirt
versions, and so would Option 2) if we default to using
PCI Express. Option 2) with default to legacy PCI and
option 3) would work just fine with existing libvirt
versions AFAICT, but wouldn't of course expose the new
capabilities.

Option 3) is probably the one that will be less confusing
to users; we might even decide to take the chance and fix
other small annoyances with the current pseries machine
type, if there's any. On the other hand, it might very well
be considered to be too big a hammer for such a small nail.

-- 
Andrea Bolognani / Red Hat / Virtualization

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [RFC PATCH qemu] spapr_pci: Create PCI-express root bus by default
  2016-11-15 14:02         ` Andrea Bolognani
@ 2016-11-17  2:02           ` Alexey Kardashevskiy
  2016-11-18  6:11             ` David Gibson
  2016-11-18  8:17             ` Andrea Bolognani
  0 siblings, 2 replies; 38+ messages in thread
From: Alexey Kardashevskiy @ 2016-11-17  2:02 UTC (permalink / raw)
  To: Andrea Bolognani, David Gibson
  Cc: Greg Kurz, Paolo Bonzini, Alex Williamson, qemu-ppc, qemu-devel,
	libvir-list, Michael Roth

On 16/11/16 01:02, Andrea Bolognani wrote:
> On Tue, 2016-11-01 at 13:46 +1100, David Gibson wrote:
>> On Mon, Oct 31, 2016 at 03:10:23PM +1100, Alexey Kardashevskiy wrote:
>>>  
>>> On 31/10/16 13:53, David Gibson wrote:
>>>>  
>>>> On Fri, Oct 28, 2016 at 12:07:12PM +0200, Greg Kurz wrote:
>>>>>  
>>>>> On Fri, 28 Oct 2016 18:56:40 +1100
>>>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>>>>>  
>>>>>>  
>>>>>> At the moment sPAPR PHB creates a root buf of TYPE_PCI_BUS type.
>>>>>> This means that vfio-pci devices attached to it (and this is
>>>>>> a default behaviour) hide PCIe extended capabilities as
>>>>>> the bus does not pass a pci_bus_is_express(pdev->bus) check.
>>>>>>  
>>>>>> This changes adds a default PCI bus type property to sPAPR PHB
>>>>>> and uses TYPE_PCIE_BUS if none passed; older machines get TYPE_PCI_BUS
>>>>>> for backward compatibility as a bus type is used in the bus name
>>>>>> so the root bus name becomes "pcie.0" instead of "pci.0".
>>>>>>  
>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>>>> ---
>>>>>>  
>>>>>> What can possibly go wrong with such change of a name?
>>>>>> From devices prospective, I cannot see any.
>>>>>>  
>>>>>> libvirt might get upset as "pci.0" will not be available,
>>>>>> will it make sense to create pcie.0 as a root bus and always
>>>>>> add a PCIe->PCI bridge and name its bus "pci.0"?
>>>>>>  
>>>>>> Or create root bus from TYPE_PCIE_BUS and force name to "pci.0"?
>>>>>> pci_register_bus() can do this.
>>>>>>  
>>>>>>  
>>>>>> ---
>>>>>>   hw/ppc/spapr.c              | 5 +++++
>>>>>>   hw/ppc/spapr_pci.c          | 5 ++++-
>>>>>>   include/hw/pci-host/spapr.h | 1 +
>>>>>>   3 files changed, 10 insertions(+), 1 deletion(-)
>>>>>>  
>>>>>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
>>>>>> index 0b3820b..a268511 100644
>>>>>> --- a/hw/ppc/spapr.c
>>>>>> +++ b/hw/ppc/spapr.c
>>>>>> @@ -2541,6 +2541,11 @@ DEFINE_SPAPR_MACHINE(2_8, "2.8", true);
>>>>>>           .driver   = TYPE_SPAPR_PCI_HOST_BRIDGE,     \
>>>>>>           .property = "mem64_win_size",               \
>>>>>>           .value    = "0",                            \
>>>>>> +    },                                              \
>>>>>> +    {                                               \
>>>>>> +        .driver   = TYPE_SPAPR_PCI_HOST_BRIDGE,     \
>>>>>> +        .property = "root_bus_type",                \
>>>>>> +        .value    = TYPE_PCI_BUS,                   \
>>>>>>       },
>>>>>>   
>>>>>>   static void phb_placement_2_7(sPAPRMachineState *spapr, uint32_t index,
>>>>>> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
>>>>>> index 7cde30e..2fa1f22 100644
>>>>>> --- a/hw/ppc/spapr_pci.c
>>>>>> +++ b/hw/ppc/spapr_pci.c
>>>>>> @@ -1434,7 +1434,9 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
>>>>>>       bus = pci_register_bus(dev, NULL,
>>>>>>                              pci_spapr_set_irq, pci_spapr_map_irq, sphb,
>>>>>>                              &sphb->memspace, &sphb->iospace,
>>>>>> -                           PCI_DEVFN(0, 0), PCI_NUM_PINS, TYPE_PCI_BUS);
>>>>>> +                           PCI_DEVFN(0, 0), PCI_NUM_PINS,
>>>>>> +                           sphb->root_bus_type ? sphb->root_bus_type :
>>>>>> +                           TYPE_PCIE_BUS);
>>>>>  
>>>>> Shouldn't we ensure that sphb->root_bus_type is either TYPE_PCIE_BUS or
>>>>> TYPE_PCI_BUS ?
>>>>  
>>>> Yes, I think so.  In fact, I think it would be better to make the
>>>> property a boolean that just selects PCI-E, rather than this which
>>>> exposes qemu (semi-)internal type names on the comamnd line.
>>>  
>>> Sure, a "pcie-root" boolean property should do.
>>>  
>>> However this is not my main concern, I rather wonder if we have to have
>>> pci.0 when we pick PCIe for the root.
>>  
>> Right.
>>  
>> I've added Andrea Bologna to the CC list to get a libvirt perspective.
> 
> Thanks for doing so: changes such as this one can have quite
> an impact on the upper layers of the stack, so the earliest
> libvirt is involved in the discussion the better.
> 
> I'm going to go a step further and cross-post to libvir-list
> in order to give other libvirt contributors a chance to chime
> in too.
> 
>> Andrea,
>>  
>> To summarise the issue here:
>>     * As I've said before the PAPR spec kinda-sorta abstracts the
>>       difference between vanilla PCI and PCI-E
>>     * However, because within qemu we're declaring the bus as PCI that
>>       means some PCI-E devices aren't working right
>>     * In particular it means that PCI-E extended config space isn't
>>       available
>>  
>> The proposal is to change (on newer machine types) the spapr PHB code
>> to declare a PCI-E bus instead.  AIUI this still won't make the root
>> complex guest visible (which it's not supposed to be under PAPR), and
>> the guest shouldn't see a difference in most cases - it will still see
>> the PAPR abstracted PCIish bus, but will now be able to get extended
>> config space.
>>  
>> The possible problem from a libvirt perspective is that doing this in
>> the simplest way in qemu would change the name of the default bus from
>> pci.0 to pcie.0.  We have two suggested ways to mitigate this:
>>     1) Automatically create a PCI-E to PCI bridge, so that new machine
>>        types will have both a pcie.0 and pci.0 bus
>>     2) Force the name of the bus to be pci.0, even though it's treated
>>        as PCI-E in other ways.
>>  
>> We're trying to work out exactly what will and won't cause trouble for
>> libvirt.
> 
> Option 2) is definitely a no-no, as we don't want to be piling
> up even more hacks and architecture-specific code: the PCI
> Express Root Bus should be called pcie.0, just as it is on q35
> and mach-virt machine types.
> 
> Option 1) doesn't look too bad, but devices that are added
> automatically by QEMU are an issue since we need to hardcode
> knowledge of them into libvirt if we want the rest of the PCI
> address allocation logic to handle them correctly.
> 
> Moreover libvirt now has the ability of building a legacy PCI
> topology without user intervention, if needed to plug in
> legacy devices, on machines that have a PCI Express Root Bus,
> which makes the additional bridge fully redundant...
> 
> ... or at least it would, if we actually had a proper
> PCIe-to-PCI bridge; AFAIK, though, the closest we have is the
> i82801b11-bridge that is Intel-specific despite having so far
> been abused as a generic PCIe-to-PCI bridge. I'm not even
> sure whether it would work at all on ppc64.
> 
> Moving from legacy PCI to PCI Express would definitely be an
> improvement, in my opinion. As mentioned, that's already the
> case for at least two other architectures, so the more we can
> standardize on that, the better.
> 
> That said, considering that a big part of the PCI address
> allocation logic is based off whether the specific machine
> type exposes a legay PCI Root Bus or a PCI Express Root Bus,
> libvirt will need a way to be able to tell which one is which.
> 
> Version checks are pretty much out of the question, as they
> fail as soon as downstream releases enter the picture. A
> few ways we could deal with the situation:
> 
>   1) switch to PCI Express on newer machine types, and
>      expose some sort of capability through QMP so that
>      libvirt can know about the switch
> 
>   2) switch between legacy PCI and PCI Express based on a
>      machine type option. libvirt would be able to find out
>      whether the option is available or not, and default to
>      either
> 
>        <controller type='pci' model='pci-root'/>
> 
>      or
> 
>        <controller type='pci' model='pcie-root'/>
> 
>      based on that. In order to support multiple PHBs
>      properly, those would have to be switchable with an
>      option as well
> 
>   3) create an entirely new machine type, eg. pseries-pcie
>      or whatever someone with the ability to come up with
>      decent names can suggest :) That would make ppc64
>      similar to x86, where i440fx and q35 have different
>      root buses. libvirt would learn about the new machine
>      type, know that it has a PCI Express Root Bus, and
>      behave accordingly
> 
> Option 1) would break horribly with existing libvirt
> versions, and so would Option 2) if we default to using


How exactly 1) will break libvirt? Migrating from pseries-2.7 to
pseries-2.8 does not work anyway, and machines are allowed to behave
different from version to version, what distinct difference will using
"pseries-pcie-X.Y" make? I believe after we introduced the very first
pseries-pcie-X.Y, we will just stop adding new pseries-X.Y.



> PCI Express. Option 2) with default to legacy PCI and
> option 3) would work just fine with existing libvirt
> versions AFAICT, but wouldn't of course expose the new
> capabilities.
> 
> Option 3) is probably the one that will be less confusing
> to users; we might even decide to take the chance and fix
> other small annoyances with the current pseries machine
> type, if there's any. On the other hand, it might very well
> be considered to be too big a hammer for such a small nail.



-- 
Alexey

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [RFC PATCH qemu] spapr_pci: Create PCI-express root bus by default
  2016-11-17  2:02           ` Alexey Kardashevskiy
@ 2016-11-18  6:11             ` David Gibson
  2016-11-18  8:17             ` Andrea Bolognani
  1 sibling, 0 replies; 38+ messages in thread
From: David Gibson @ 2016-11-18  6:11 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Andrea Bolognani, Greg Kurz, Paolo Bonzini, Alex Williamson,
	qemu-ppc, qemu-devel, libvir-list, Michael Roth

[-- Attachment #1: Type: text/plain, Size: 10147 bytes --]

On Thu, Nov 17, 2016 at 01:02:57PM +1100, Alexey Kardashevskiy wrote:
> On 16/11/16 01:02, Andrea Bolognani wrote:
> > On Tue, 2016-11-01 at 13:46 +1100, David Gibson wrote:
> >> On Mon, Oct 31, 2016 at 03:10:23PM +1100, Alexey Kardashevskiy wrote:
> >>>  
> >>> On 31/10/16 13:53, David Gibson wrote:
> >>>>  
> >>>> On Fri, Oct 28, 2016 at 12:07:12PM +0200, Greg Kurz wrote:
> >>>>>  
> >>>>> On Fri, 28 Oct 2016 18:56:40 +1100
> >>>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> >>>>>  
> >>>>>>  
> >>>>>> At the moment sPAPR PHB creates a root buf of TYPE_PCI_BUS type.
> >>>>>> This means that vfio-pci devices attached to it (and this is
> >>>>>> a default behaviour) hide PCIe extended capabilities as
> >>>>>> the bus does not pass a pci_bus_is_express(pdev->bus) check.
> >>>>>>  
> >>>>>> This changes adds a default PCI bus type property to sPAPR PHB
> >>>>>> and uses TYPE_PCIE_BUS if none passed; older machines get TYPE_PCI_BUS
> >>>>>> for backward compatibility as a bus type is used in the bus name
> >>>>>> so the root bus name becomes "pcie.0" instead of "pci.0".
> >>>>>>  
> >>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >>>>>> ---
> >>>>>>  
> >>>>>> What can possibly go wrong with such change of a name?
> >>>>>> From devices prospective, I cannot see any.
> >>>>>>  
> >>>>>> libvirt might get upset as "pci.0" will not be available,
> >>>>>> will it make sense to create pcie.0 as a root bus and always
> >>>>>> add a PCIe->PCI bridge and name its bus "pci.0"?
> >>>>>>  
> >>>>>> Or create root bus from TYPE_PCIE_BUS and force name to "pci.0"?
> >>>>>> pci_register_bus() can do this.
> >>>>>>  
> >>>>>>  
> >>>>>> ---
> >>>>>>   hw/ppc/spapr.c              | 5 +++++
> >>>>>>   hw/ppc/spapr_pci.c          | 5 ++++-
> >>>>>>   include/hw/pci-host/spapr.h | 1 +
> >>>>>>   3 files changed, 10 insertions(+), 1 deletion(-)
> >>>>>>  
> >>>>>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> >>>>>> index 0b3820b..a268511 100644
> >>>>>> --- a/hw/ppc/spapr.c
> >>>>>> +++ b/hw/ppc/spapr.c
> >>>>>> @@ -2541,6 +2541,11 @@ DEFINE_SPAPR_MACHINE(2_8, "2.8", true);
> >>>>>>           .driver   = TYPE_SPAPR_PCI_HOST_BRIDGE,     \
> >>>>>>           .property = "mem64_win_size",               \
> >>>>>>           .value    = "0",                            \
> >>>>>> +    },                                              \
> >>>>>> +    {                                               \
> >>>>>> +        .driver   = TYPE_SPAPR_PCI_HOST_BRIDGE,     \
> >>>>>> +        .property = "root_bus_type",                \
> >>>>>> +        .value    = TYPE_PCI_BUS,                   \
> >>>>>>       },
> >>>>>>   
> >>>>>>   static void phb_placement_2_7(sPAPRMachineState *spapr, uint32_t index,
> >>>>>> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
> >>>>>> index 7cde30e..2fa1f22 100644
> >>>>>> --- a/hw/ppc/spapr_pci.c
> >>>>>> +++ b/hw/ppc/spapr_pci.c
> >>>>>> @@ -1434,7 +1434,9 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
> >>>>>>       bus = pci_register_bus(dev, NULL,
> >>>>>>                              pci_spapr_set_irq, pci_spapr_map_irq, sphb,
> >>>>>>                              &sphb->memspace, &sphb->iospace,
> >>>>>> -                           PCI_DEVFN(0, 0), PCI_NUM_PINS, TYPE_PCI_BUS);
> >>>>>> +                           PCI_DEVFN(0, 0), PCI_NUM_PINS,
> >>>>>> +                           sphb->root_bus_type ? sphb->root_bus_type :
> >>>>>> +                           TYPE_PCIE_BUS);
> >>>>>  
> >>>>> Shouldn't we ensure that sphb->root_bus_type is either TYPE_PCIE_BUS or
> >>>>> TYPE_PCI_BUS ?
> >>>>  
> >>>> Yes, I think so.  In fact, I think it would be better to make the
> >>>> property a boolean that just selects PCI-E, rather than this which
> >>>> exposes qemu (semi-)internal type names on the comamnd line.
> >>>  
> >>> Sure, a "pcie-root" boolean property should do.
> >>>  
> >>> However this is not my main concern, I rather wonder if we have to have
> >>> pci.0 when we pick PCIe for the root.
> >>  
> >> Right.
> >>  
> >> I've added Andrea Bologna to the CC list to get a libvirt perspective.
> > 
> > Thanks for doing so: changes such as this one can have quite
> > an impact on the upper layers of the stack, so the earliest
> > libvirt is involved in the discussion the better.
> > 
> > I'm going to go a step further and cross-post to libvir-list
> > in order to give other libvirt contributors a chance to chime
> > in too.
> > 
> >> Andrea,
> >>  
> >> To summarise the issue here:
> >>     * As I've said before the PAPR spec kinda-sorta abstracts the
> >>       difference between vanilla PCI and PCI-E
> >>     * However, because within qemu we're declaring the bus as PCI that
> >>       means some PCI-E devices aren't working right
> >>     * In particular it means that PCI-E extended config space isn't
> >>       available
> >>  
> >> The proposal is to change (on newer machine types) the spapr PHB code
> >> to declare a PCI-E bus instead.  AIUI this still won't make the root
> >> complex guest visible (which it's not supposed to be under PAPR), and
> >> the guest shouldn't see a difference in most cases - it will still see
> >> the PAPR abstracted PCIish bus, but will now be able to get extended
> >> config space.
> >>  
> >> The possible problem from a libvirt perspective is that doing this in
> >> the simplest way in qemu would change the name of the default bus from
> >> pci.0 to pcie.0.  We have two suggested ways to mitigate this:
> >>     1) Automatically create a PCI-E to PCI bridge, so that new machine
> >>        types will have both a pcie.0 and pci.0 bus
> >>     2) Force the name of the bus to be pci.0, even though it's treated
> >>        as PCI-E in other ways.
> >>  
> >> We're trying to work out exactly what will and won't cause trouble for
> >> libvirt.
> > 
> > Option 2) is definitely a no-no, as we don't want to be piling
> > up even more hacks and architecture-specific code: the PCI
> > Express Root Bus should be called pcie.0, just as it is on q35
> > and mach-virt machine types.
> > 
> > Option 1) doesn't look too bad, but devices that are added
> > automatically by QEMU are an issue since we need to hardcode
> > knowledge of them into libvirt if we want the rest of the PCI
> > address allocation logic to handle them correctly.
> > 
> > Moreover libvirt now has the ability of building a legacy PCI
> > topology without user intervention, if needed to plug in
> > legacy devices, on machines that have a PCI Express Root Bus,
> > which makes the additional bridge fully redundant...
> > 
> > ... or at least it would, if we actually had a proper
> > PCIe-to-PCI bridge; AFAIK, though, the closest we have is the
> > i82801b11-bridge that is Intel-specific despite having so far
> > been abused as a generic PCIe-to-PCI bridge. I'm not even
> > sure whether it would work at all on ppc64.
> > 
> > Moving from legacy PCI to PCI Express would definitely be an
> > improvement, in my opinion. As mentioned, that's already the
> > case for at least two other architectures, so the more we can
> > standardize on that, the better.
> > 
> > That said, considering that a big part of the PCI address
> > allocation logic is based off whether the specific machine
> > type exposes a legay PCI Root Bus or a PCI Express Root Bus,
> > libvirt will need a way to be able to tell which one is which.
> > 
> > Version checks are pretty much out of the question, as they
> > fail as soon as downstream releases enter the picture. A
> > few ways we could deal with the situation:
> > 
> >   1) switch to PCI Express on newer machine types, and
> >      expose some sort of capability through QMP so that
> >      libvirt can know about the switch
> > 
> >   2) switch between legacy PCI and PCI Express based on a
> >      machine type option. libvirt would be able to find out
> >      whether the option is available or not, and default to
> >      either
> > 
> >        <controller type='pci' model='pci-root'/>
> > 
> >      or
> > 
> >        <controller type='pci' model='pcie-root'/>
> > 
> >      based on that. In order to support multiple PHBs
> >      properly, those would have to be switchable with an
> >      option as well
> > 
> >   3) create an entirely new machine type, eg. pseries-pcie
> >      or whatever someone with the ability to come up with
> >      decent names can suggest :) That would make ppc64
> >      similar to x86, where i440fx and q35 have different
> >      root buses. libvirt would learn about the new machine
> >      type, know that it has a PCI Express Root Bus, and
> >      behave accordingly
> > 
> > Option 1) would break horribly with existing libvirt
> > versions, and so would Option 2) if we default to using
> 
> 
> How exactly 1) will break libvirt? Migrating from pseries-2.7 to
> pseries-2.8 does not work anyway, and machines are allowed to behave
> different from version to version, what distinct difference will using
> "pseries-pcie-X.Y" make? I believe after we introduced the very first
> pseries-pcie-X.Y, we will just stop adding new pseries-X.Y.

IIUC, it's because when libvirt wants to attach a PCI device, it will
just try to attach it to bus pci.0, which will no longer exist.

> > PCI Express. Option 2) with default to legacy PCI and
> > option 3) would work just fine with existing libvirt
> > versions AFAICT, but wouldn't of course expose the new
> > capabilities.
> > 
> > Option 3) is probably the one that will be less confusing
> > to users; we might even decide to take the chance and fix
> > other small annoyances with the current pseries machine
> > type, if there's any. On the other hand, it might very well
> > be considered to be too big a hammer for such a small nail.
> 
> 
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [RFC PATCH qemu] spapr_pci: Create PCI-express root bus by default
  2016-11-17  2:02           ` Alexey Kardashevskiy
  2016-11-18  6:11             ` David Gibson
@ 2016-11-18  8:17             ` Andrea Bolognani
  2016-11-21  2:12               ` Alexey Kardashevskiy
  2016-11-23  5:00               ` [Qemu-devel] " David Gibson
  1 sibling, 2 replies; 38+ messages in thread
From: Andrea Bolognani @ 2016-11-18  8:17 UTC (permalink / raw)
  To: Alexey Kardashevskiy, David Gibson
  Cc: Greg Kurz, Paolo Bonzini, Alex Williamson, qemu-ppc, qemu-devel,
	libvir-list, Michael Roth

On Thu, 2016-11-17 at 13:02 +1100, Alexey Kardashevskiy wrote:
> > That said, considering that a big part of the PCI address
> > allocation logic is based off whether the specific machine
> > type exposes a legay PCI Root Bus or a PCI Express Root Bus,
> > libvirt will need a way to be able to tell which one is which.
> > 
> > Version checks are pretty much out of the question, as they
> > fail as soon as downstream releases enter the picture. A
> > few ways we could deal with the situation:
> > 
> >   1) switch to PCI Express on newer machine types, and
> >      expose some sort of capability through QMP so that
> >      libvirt can know about the switch

[...]
> > Option 1) would break horribly with existing libvirt
> > versions, and so would Option 2) if we default to using
> 
> How exactly 1) will break libvirt? Migrating from pseries-2.7 to
> pseries-2.8 does not work anyway, and machines are allowed to behave
> different from version to version, what distinct difference will using
> "pseries-pcie-X.Y" make?

Existing libvirt versions assume that pseries guests have
a legacy PCI root bus, and will base their PCI address
allocation / PCI topology decisions on that fact: they
will, for example, use legacy PCI bridges.

So if you used a new QEMU binary with a libvirt version
that doesn't know about the change, new guests would end up
using the wrong controllers. Existing guests would not be
affected as they would stick with the older machine types,
of course.

> I believe after we introduced the very first
> pseries-pcie-X.Y, we will just stop adding new pseries-X.Y.

Isn't i440fx still being updated despite the fact that q35
exists? Granted, there are a lot more differences between
those two machine types than just the root bus type.

Even if no newer pseries-x.y were to be added after
introducing pseries-pcie, you could still easily create
guests that use either root bus, so no loss in functionality.

-- 
Andrea Bolognani / Red Hat / Virtualization

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [RFC PATCH qemu] spapr_pci: Create PCI-express root bus by default
  2016-11-18  8:17             ` Andrea Bolognani
@ 2016-11-21  2:12               ` Alexey Kardashevskiy
  2016-11-21 13:08                 ` Andrea Bolognani
  2016-11-23  5:00               ` [Qemu-devel] " David Gibson
  1 sibling, 1 reply; 38+ messages in thread
From: Alexey Kardashevskiy @ 2016-11-21  2:12 UTC (permalink / raw)
  To: Andrea Bolognani, David Gibson
  Cc: Greg Kurz, Paolo Bonzini, Alex Williamson, qemu-ppc, qemu-devel,
	libvir-list, Michael Roth

On 18/11/16 19:17, Andrea Bolognani wrote:
> On Thu, 2016-11-17 at 13:02 +1100, Alexey Kardashevskiy wrote:
>>> That said, considering that a big part of the PCI address
>>> allocation logic is based off whether the specific machine
>>> type exposes a legay PCI Root Bus or a PCI Express Root Bus,
>>> libvirt will need a way to be able to tell which one is which.
>>>  
>>> Version checks are pretty much out of the question, as they
>>> fail as soon as downstream releases enter the picture. A
>>> few ways we could deal with the situation:
>>>  
>>>    1) switch to PCI Express on newer machine types, and
>>>       expose some sort of capability through QMP so that
>>>       libvirt can know about the switch
> 
> [...]
>>> Option 1) would break horribly with existing libvirt
>>> versions, and so would Option 2) if we default to using
>>  
>> How exactly 1) will break libvirt? Migrating from pseries-2.7 to
>> pseries-2.8 does not work anyway, and machines are allowed to behave
>> different from version to version, what distinct difference will using
>> "pseries-pcie-X.Y" make?
> 
> Existing libvirt versions assume that pseries guests have

If libvirt is using just "pseries" (without a version), then having a
virtual PCIe-PCI bridge (and "pci.0" always available by default) will do it.

If libvirt is using a specific version of pseries, then it already knows
that <=2.7 has pci.0 as a root, pcie.0 otherwise. libvirt has a knowledge
what QEMU version has what, right?

In what scenario will an additional machine type help?

> a legacy PCI root bus, and will base their PCI address
> allocation / PCI topology decisions on that fact: they
> will, for example, use legacy PCI bridges.
> 
> So if you used a new QEMU binary with a libvirt version
> that doesn't know about the change, new guests would end up
> using the wrong controllers. Existing guests would not be
> affected as they would stick with the older machine types,
> of course.
> 
>> I believe after we introduced the very first
>> pseries-pcie-X.Y, we will just stop adding new pseries-X.Y.
> 
> Isn't i440fx still being updated despite the fact that q35
> exists? Granted, there are a lot more differences between
> those two machine types than just the root bus type.

I do not know about i440<->q35 but in pseries the difference is going to be
very simple.

For example, we did not change the machine type when we switched from
default OHCI to XHCI, switching from PCI to PCIe does not sound like we
need a whole new machine type for this either.

> Even if no newer pseries-x.y were to be added after
> introducing pseries-pcie, you could still easily create
> guests that use either root bus, so no loss in functionality.

I could do this with the existing pseries if the machine had a "root but
type" property.


-- 
Alexey

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [RFC PATCH qemu] spapr_pci: Create PCI-express root bus by default
  2016-11-21  2:12               ` Alexey Kardashevskiy
@ 2016-11-21 13:08                 ` Andrea Bolognani
  2016-11-22  2:26                   ` Alexey Kardashevskiy
  2016-11-22 14:07                   ` [Qemu-devel] [libvirt] " Eric Blake
  0 siblings, 2 replies; 38+ messages in thread
From: Andrea Bolognani @ 2016-11-21 13:08 UTC (permalink / raw)
  To: Alexey Kardashevskiy, David Gibson
  Cc: Greg Kurz, Paolo Bonzini, Alex Williamson, qemu-ppc, qemu-devel,
	libvir-list, Michael Roth

On Mon, 2016-11-21 at 13:12 +1100, Alexey Kardashevskiy wrote:
> > > >    1) switch to PCI Express on newer machine types, and
> > > >       expose some sort of capability through QMP so that
> > > >       libvirt can know about the switch
> > > > 
> > > > [...]
> > > > Option 1) would break horribly with existing libvirt
> > > > versions, and so would Option 2) if we default to using
> > > 
> > > How exactly 1) will break libvirt? Migrating from pseries-2.7 to
> > > pseries-2.8 does not work anyway, and machines are allowed to behave
> > > different from version to version, what distinct difference will using
> > > "pseries-pcie-X.Y" make?
> > 
> > Existing libvirt versions assume that pseries guests have
> 
> If libvirt is using just "pseries" (without a version), then having a
> virtual PCIe-PCI bridge (and "pci.0" always available by default) will do it.

Please don't. Any device that is included in the guest
by default and can't be turned off makes libvirt's life
significantly more difficult (see below).

> If libvirt is using a specific version of pseries, then it already knows
> that <=2.7 has pci.0 as a root, pcie.0 otherwise. libvirt has a knowledge
> what QEMU version has what, right?

It doesn't yet, that's the point :)

We *could* add such knowledge to libvirt[1], but *existing*
libvirt versions would still not know about it, which means
that upgrading QEMU withough upgrading libvirt will result
in failure to create new guests.

> In what scenario will an additional machine type help?

Because then libvirt could learn that

  pseries-x.y       <->  pci.0
  pseries-pcie-x.y  <->  pcie.0

the same way it already knows that

  pc-i440fx-x.y     <->  pci.0
  pc-q35-x.y        <->  pcie.0

and choosing between one or the other would be, I think,
much easier for the user as well.

> > a legacy PCI root bus, and will base their PCI address
> > allocation / PCI topology decisions on that fact: they
> > will, for example, use legacy PCI bridges.
> > 
> > So if you used a new QEMU binary with a libvirt version
> > that doesn't know about the change, new guests would end up
> > using the wrong controllers. Existing guests would not be
> > affected as they would stick with the older machine types,
> > of course.
> > 
> > > I believe after we introduced the very first
> > > pseries-pcie-X.Y, we will just stop adding new pseries-X.Y.
> > 
> > Isn't i440fx still being updated despite the fact that q35
> > exists? Granted, there are a lot more differences between
> > those two machine types than just the root bus type.
> 
> I do not know about i440<->q35 but in pseries the difference is going to be
> very simple.
> 
> For example, we did not change the machine type when we switched from
> default OHCI to XHCI, switching from PCI to PCIe does not sound like we
> need a whole new machine type for this either.

The change from OHCI to XHCI only affected the *default* USB
controller, which libvirt tries its best not to use anyway:
instead, it will prefer to use '-M ...,usb=off' along with
'-device ...' and set both the controller model and its PCI
address explicitly, partially to shield its users from such
changes in QEMU.

Let's not forget that libvirt is a management layer, and as
such it needs to have as much control as possible over the
virtual hardware presented to the guest, mostly for guest ABI
compatibility purposes. Defaulting to XHCI instead of OHCI,
same as pretty much all other defaults, is good for people who
run QEMU directly, but libvirt needs to be able to control
such knobs itself in order to effectively perform the task it
was designed for.

Moreover, we're talking about a more fundamental change here:
the PCI Root Bus is not just any random device, it's the one
fundation upon which the entire PCI hierarchy is built. Based
on whether the machine exposes a PCI Express Root Bus or a
legacy PCI Express Root Bus, libvirt will create very
different PCI hierarchies, eg. using several ioh3420 devices
instead of a single pci-bridge device.

I'm still not sure if that enough to warrant an entirely new
machine type, but it definitely has a more far-reaching impact
than simply flipping the default USB controller from OHCI to
XHCI.

> > Even if no newer pseries-x.y were to be added after
> > introducing pseries-pcie, you could still easily create
> > guests that use either root bus, so no loss in functionality.
> 
> I could do this with the existing pseries if the machine had a "root but
> type" property.

That was indeed one of my original proposals ;)

Just note, once again, that the default for this property
would have to be "off" or we would run into the same issues
described above.

[1] Even though, as I mentioned earlier in the thread,
    performing version checks on machine types is frowned
    upon as it turns into a minefield as soon as backports
    and downstream machine types enter the picture... It's
    much better to expose some flag through QMP for libvirt
    to introspect
-- 
Andrea Bolognani / Red Hat / Virtualization

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [RFC PATCH qemu] spapr_pci: Create PCI-express root bus by default
  2016-11-21 13:08                 ` Andrea Bolognani
@ 2016-11-22  2:26                   ` Alexey Kardashevskiy
  2016-11-23  5:02                     ` David Gibson
  2016-11-22 14:07                   ` [Qemu-devel] [libvirt] " Eric Blake
  1 sibling, 1 reply; 38+ messages in thread
From: Alexey Kardashevskiy @ 2016-11-22  2:26 UTC (permalink / raw)
  To: Andrea Bolognani, David Gibson
  Cc: Greg Kurz, Paolo Bonzini, Alex Williamson, qemu-ppc, qemu-devel,
	libvir-list, Michael Roth

On 22/11/16 00:08, Andrea Bolognani wrote:
> On Mon, 2016-11-21 at 13:12 +1100, Alexey Kardashevskiy wrote:
>>>>>     1) switch to PCI Express on newer machine types, and
>>>>>        expose some sort of capability through QMP so that
>>>>>        libvirt can know about the switch
>>>>>  
>>>>> [...]
>>>>> Option 1) would break horribly with existing libvirt
>>>>> versions, and so would Option 2) if we default to using
>>>>  
>>>> How exactly 1) will break libvirt? Migrating from pseries-2.7 to
>>>> pseries-2.8 does not work anyway, and machines are allowed to behave
>>>> different from version to version, what distinct difference will using
>>>> "pseries-pcie-X.Y" make?
>>>  
>>> Existing libvirt versions assume that pseries guests have
>>  
>> If libvirt is using just "pseries" (without a version), then having a
>> virtual PCIe-PCI bridge (and "pci.0" always available by default) will do it.
> 
> Please don't. Any device that is included in the guest
> by default and can't be turned off makes libvirt's life
> significantly more difficult (see below).
>
>> If libvirt is using a specific version of pseries, then it already knows
>> that <=2.7 has pci.0 as a root, pcie.0 otherwise. libvirt has a knowledge
>> what QEMU version has what, right?
> 
> It doesn't yet, that's the point :)
> 
> We *could* add such knowledge to libvirt[1], but *existing*
> libvirt versions would still not know about it, which means
> that upgrading QEMU withough upgrading libvirt will result
> in failure to create new guests.
>
> 
>> In what scenario will an additional machine type help?
> 
> Because then libvirt could learn that
> 
>   pseries-x.y       <->  pci.0
>   pseries-pcie-x.y  <->  pcie.0
> 
> the same way it already knows that
> 
>   pc-i440fx-x.y     <->  pci.0
>   pc-q35-x.y        <->  pcie.0
> 
> and choosing between one or the other would be, I think,
> much easier for the user as well.
>
>>> a legacy PCI root bus, and will base their PCI address
>>> allocation / PCI topology decisions on that fact: they
>>> will, for example, use legacy PCI bridges.
>>>  
>>> So if you used a new QEMU binary with a libvirt version
>>> that doesn't know about the change, new guests would end up
>>> using the wrong controllers. Existing guests would not be
>>> affected as they would stick with the older machine types,
>>> of course.
>>>  
>>>> I believe after we introduced the very first
>>>> pseries-pcie-X.Y, we will just stop adding new pseries-X.Y.
>>>  
>>> Isn't i440fx still being updated despite the fact that q35
>>> exists? Granted, there are a lot more differences between
>>> those two machine types than just the root bus type.
>>  
>> I do not know about i440<->q35 but in pseries the difference is going to be
>> very simple.
>>  
>> For example, we did not change the machine type when we switched from
>> default OHCI to XHCI, switching from PCI to PCIe does not sound like we
>> need a whole new machine type for this either.
> 
> The change from OHCI to XHCI only affected the *default* USB
> controller, which libvirt tries its best not to use anyway:
> instead, it will prefer to use '-M ...,usb=off' along with
> '-device ...' and set both the controller model and its PCI
> address explicitly, partially to shield its users from such
> changes in QEMU.


Ok. Always likes this approach really. We should start moving to this
direction with PHB - stop adding the default PHB at all when -nodefaults is
passed (or -machine pseries,pci=off ?) and let libvirt manage PHBs itself
(and provide another spapr-phb type like spapr-pcie-host-bridge or add a
"pcie_root_bus_type" property to the existing PHB type).

What will be wrong with this approach?



> 
> Let's not forget that libvirt is a management layer, and as
> such it needs to have as much control as possible over the
> virtual hardware presented to the guest, mostly for guest ABI
> compatibility purposes. Defaulting to XHCI instead of OHCI,
> same as pretty much all other defaults, is good for people who
> run QEMU directly, but libvirt needs to be able to control
> such knobs itself in order to effectively perform the task it
> was designed for.
> 
> Moreover, we're talking about a more fundamental change here:
> the PCI Root Bus is not just any random device, it's the one
> fundation upon which the entire PCI hierarchy is built. Based
> on whether the machine exposes a PCI Express Root Bus or a
> legacy PCI Express Root Bus, libvirt will create very
> different PCI hierarchies, eg. using several ioh3420 devices
> instead of a single pci-bridge device.
> 
> I'm still not sure if that enough to warrant an entirely new
> machine type, but it definitely has a more far-reaching impact
> than simply flipping the default USB controller from OHCI to
> XHCI.
> 
>>> Even if no newer pseries-x.y were to be added after
>>> introducing pseries-pcie, you could still easily create
>>> guests that use either root bus, so no loss in functionality.
>>  
>> I could do this with the existing pseries if the machine had a "root but
>> type" property.
> 
> That was indeed one of my original proposals ;)
> 
> Just note, once again, that the default for this property
> would have to be "off" or we would run into the same issues
> described above.
> 
> 
> [1] Even though, as I mentioned earlier in the thread,
>     performing version checks on machine types is frowned
>     upon as it turns into a minefield as soon as backports
>     and downstream machine types enter the picture... It's
>     much better to expose some flag through QMP for libvirt
>     to introspect
> -- 
> Andrea Bolognani / Red Hat / Virtualization
> 


-- 
Alexey

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [libvirt] [Qemu-ppc] [RFC PATCH qemu] spapr_pci: Create PCI-express root bus by default
  2016-11-21 13:08                 ` Andrea Bolognani
  2016-11-22  2:26                   ` Alexey Kardashevskiy
@ 2016-11-22 14:07                   ` Eric Blake
  1 sibling, 0 replies; 38+ messages in thread
From: Eric Blake @ 2016-11-22 14:07 UTC (permalink / raw)
  To: Andrea Bolognani, Alexey Kardashevskiy, David Gibson
  Cc: Alex, libvir-list, Greg Kurz, qemu-devel, qemu-ppc, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 957 bytes --]

On 11/21/2016 07:08 AM, Andrea Bolognani wrote:

>> If libvirt is using a specific version of pseries, then it already knows
>> that <=2.7 has pci.0 as a root, pcie.0 otherwise. libvirt has a knowledge
>> what QEMU version has what, right?
> 
> It doesn't yet, that's the point :)
> 
> We *could* add such knowledge to libvirt[1], but *existing*
> libvirt versions would still not know about it, which means
> that upgrading QEMU withough upgrading libvirt will result
> in failure to create new guests.

But that's okay.  In general, we try to promise that:

old qemu + old libvirt works
new qemu + new libvirt works
old qemu + new libvirt works
new qemu + old libvirt is untested, and may require a libvirt upgrade

In other words, as long as you update your stack from top down (libvirt
before qemu), you should be fine.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [RFC PATCH qemu] spapr_pci: Create PCI-express root bus by default
  2016-11-18  8:17             ` Andrea Bolognani
  2016-11-21  2:12               ` Alexey Kardashevskiy
@ 2016-11-23  5:00               ` David Gibson
  2016-11-25 13:46                 ` Andrea Bolognani
  1 sibling, 1 reply; 38+ messages in thread
From: David Gibson @ 2016-11-23  5:00 UTC (permalink / raw)
  To: Andrea Bolognani
  Cc: Alexey Kardashevskiy, Greg Kurz, Paolo Bonzini, Alex Williamson,
	qemu-ppc, qemu-devel, libvir-list, Michael Roth

[-- Attachment #1: Type: text/plain, Size: 2862 bytes --]

On Fri, Nov 18, 2016 at 09:17:22AM +0100, Andrea Bolognani wrote:
> On Thu, 2016-11-17 at 13:02 +1100, Alexey Kardashevskiy wrote:
> > > That said, considering that a big part of the PCI address
> > > allocation logic is based off whether the specific machine
> > > type exposes a legay PCI Root Bus or a PCI Express Root Bus,
> > > libvirt will need a way to be able to tell which one is which.
> > > 
> > > Version checks are pretty much out of the question, as they
> > > fail as soon as downstream releases enter the picture. A
> > > few ways we could deal with the situation:
> > > 
> > >   1) switch to PCI Express on newer machine types, and
> > >      expose some sort of capability through QMP so that
> > >      libvirt can know about the switch
> 
> [...]
> > > Option 1) would break horribly with existing libvirt
> > > versions, and so would Option 2) if we default to using
> > 
> > How exactly 1) will break libvirt? Migrating from pseries-2.7 to
> > pseries-2.8 does not work anyway, and machines are allowed to behave
> > different from version to version, what distinct difference will using
> > "pseries-pcie-X.Y" make?
> 
> Existing libvirt versions assume that pseries guests have
> a legacy PCI root bus, and will base their PCI address
> allocation / PCI topology decisions on that fact: they
> will, for example, use legacy PCI bridges.

Um.. yeah.. trouble is libvirt's PCI-E address allocation probably
won't work for spapr PCI-E either, because of the weird PCI-E without
root complex presentation we get in PAPR.

> So if you used a new QEMU binary with a libvirt version
> that doesn't know about the change, new guests would end up
> using the wrong controllers. Existing guests would not be
> affected as they would stick with the older machine types,
> of course.
> 
> > I believe after we introduced the very first
> > pseries-pcie-X.Y, we will just stop adding new pseries-X.Y.
> 
> Isn't i440fx still being updated despite the fact that q35
> exists? Granted, there are a lot more differences between
> those two machine types than just the root bus type.

Right, there are heaps of differences between i440fx and q35, and
reasons to keep both updated.  For pseries we have neither the impetus
nor the resources to maintain two different machine type variant,
where the only difference is between legacy PCI and weirdly presented
PCI-E.

> Even if no newer pseries-x.y were to be added after
> introducing pseries-pcie, you could still easily create
> guests that use either root bus, so no loss in functionality.
> 
> -- 
> Andrea Bolognani / Red Hat / Virtualization
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [RFC PATCH qemu] spapr_pci: Create PCI-express root bus by default
  2016-11-22  2:26                   ` Alexey Kardashevskiy
@ 2016-11-23  5:02                     ` David Gibson
  2016-11-25 14:36                       ` Andrea Bolognani
  0 siblings, 1 reply; 38+ messages in thread
From: David Gibson @ 2016-11-23  5:02 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Andrea Bolognani, Greg Kurz, Paolo Bonzini, Alex Williamson,
	qemu-ppc, qemu-devel, libvir-list, Michael Roth

[-- Attachment #1: Type: text/plain, Size: 4566 bytes --]

On Tue, Nov 22, 2016 at 01:26:49PM +1100, Alexey Kardashevskiy wrote:
> On 22/11/16 00:08, Andrea Bolognani wrote:
> > On Mon, 2016-11-21 at 13:12 +1100, Alexey Kardashevskiy wrote:
> >>>>>     1) switch to PCI Express on newer machine types, and
> >>>>>        expose some sort of capability through QMP so that
> >>>>>        libvirt can know about the switch
> >>>>>  
> >>>>> [...]
> >>>>> Option 1) would break horribly with existing libvirt
> >>>>> versions, and so would Option 2) if we default to using
> >>>>  
> >>>> How exactly 1) will break libvirt? Migrating from pseries-2.7 to
> >>>> pseries-2.8 does not work anyway, and machines are allowed to behave
> >>>> different from version to version, what distinct difference will using
> >>>> "pseries-pcie-X.Y" make?
> >>>  
> >>> Existing libvirt versions assume that pseries guests have
> >>  
> >> If libvirt is using just "pseries" (without a version), then having a
> >> virtual PCIe-PCI bridge (and "pci.0" always available by default) will do it.
> > 
> > Please don't. Any device that is included in the guest
> > by default and can't be turned off makes libvirt's life
> > significantly more difficult (see below).
> >
> >> If libvirt is using a specific version of pseries, then it already knows
> >> that <=2.7 has pci.0 as a root, pcie.0 otherwise. libvirt has a knowledge
> >> what QEMU version has what, right?
> > 
> > It doesn't yet, that's the point :)
> > 
> > We *could* add such knowledge to libvirt[1], but *existing*
> > libvirt versions would still not know about it, which means
> > that upgrading QEMU withough upgrading libvirt will result
> > in failure to create new guests.
> >
> > 
> >> In what scenario will an additional machine type help?
> > 
> > Because then libvirt could learn that
> > 
> >   pseries-x.y       <->  pci.0
> >   pseries-pcie-x.y  <->  pcie.0
> > 
> > the same way it already knows that
> > 
> >   pc-i440fx-x.y     <->  pci.0
> >   pc-q35-x.y        <->  pcie.0
> > 
> > and choosing between one or the other would be, I think,
> > much easier for the user as well.
> >
> >>> a legacy PCI root bus, and will base their PCI address
> >>> allocation / PCI topology decisions on that fact: they
> >>> will, for example, use legacy PCI bridges.
> >>>  
> >>> So if you used a new QEMU binary with a libvirt version
> >>> that doesn't know about the change, new guests would end up
> >>> using the wrong controllers. Existing guests would not be
> >>> affected as they would stick with the older machine types,
> >>> of course.
> >>>  
> >>>> I believe after we introduced the very first
> >>>> pseries-pcie-X.Y, we will just stop adding new pseries-X.Y.
> >>>  
> >>> Isn't i440fx still being updated despite the fact that q35
> >>> exists? Granted, there are a lot more differences between
> >>> those two machine types than just the root bus type.
> >>  
> >> I do not know about i440<->q35 but in pseries the difference is going to be
> >> very simple.
> >>  
> >> For example, we did not change the machine type when we switched from
> >> default OHCI to XHCI, switching from PCI to PCIe does not sound like we
> >> need a whole new machine type for this either.
> > 
> > The change from OHCI to XHCI only affected the *default* USB
> > controller, which libvirt tries its best not to use anyway:
> > instead, it will prefer to use '-M ...,usb=off' along with
> > '-device ...' and set both the controller model and its PCI
> > address explicitly, partially to shield its users from such
> > changes in QEMU.
> 
> 
> Ok. Always likes this approach really. We should start moving to this
> direction with PHB - stop adding the default PHB at all when -nodefaults is
> passed (or -machine pseries,pci=off ?) and let libvirt manage PHBs itself
> (and provide another spapr-phb type like spapr-pcie-host-bridge or add a
> "pcie_root_bus_type" property to the existing PHB type).
> 
> What will be wrong with this approach?

Hm, that's a good point.  If were removing the default PHB entirely,
that I would consider a possible case for a new machine type.  Putting
construction of the PHBs into libvirt's hand could make life simpler
there.  Although it would make it a bit of a pain for people starting
qemu by hand.

I think this option is worth some thought.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [RFC PATCH qemu] spapr_pci: Create PCI-express root bus by default
  2016-11-23  5:00               ` [Qemu-devel] " David Gibson
@ 2016-11-25 13:46                 ` Andrea Bolognani
  2016-12-02  4:18                   ` David Gibson
  0 siblings, 1 reply; 38+ messages in thread
From: Andrea Bolognani @ 2016-11-25 13:46 UTC (permalink / raw)
  To: David Gibson
  Cc: Alexey Kardashevskiy, Greg Kurz, Paolo Bonzini, Alex Williamson,
	qemu-ppc, qemu-devel, libvir-list, Michael Roth

On Wed, 2016-11-23 at 16:00 +1100, David Gibson wrote:
> > Existing libvirt versions assume that pseries guests have
> > a legacy PCI root bus, and will base their PCI address
> > allocation / PCI topology decisions on that fact: they
> > will, for example, use legacy PCI bridges.
> 
> Um.. yeah.. trouble is libvirt's PCI-E address allocation probably
> won't work for spapr PCI-E either, because of the weird PCI-E without
> root complex presentation we get in PAPR.

So, would the PCIe Root Bus in a pseries guest behave
differently than the one in a q35 or mach-virt guest?

Does it have a different number of slots, do we have to
plug different controllers into them, ...?

Regardless of how we decide to move forward with the
PCIe-enabled pseries machine type, libvirt will have to
know about this so it can behave appropriately.

> > > I believe after we introduced the very first
> > > pseries-pcie-X.Y, we will just stop adding new pseries-X.Y.
> > 
> > Isn't i440fx still being updated despite the fact that q35
> > exists? Granted, there are a lot more differences between
> > those two machine types than just the root bus type.
> 
> Right, there are heaps of differences between i440fx and q35, and
> reasons to keep both updated.  For pseries we have neither the impetus
> nor the resources to maintain two different machine type variant,
> where the only difference is between legacy PCI and weirdly presented
> PCI-E.

Calling the PCIe machine type either pseries-2.8 or
pseries-pcie-2.8 would result in the very same amount of
work, and in both cases it would be understood that the
legacy PCI machine type is no longer going to be updated,
but can still be used to run existing guests.

-- 
Andrea Bolognani / Red Hat / Virtualization

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [RFC PATCH qemu] spapr_pci: Create PCI-express root bus by default
  2016-11-23  5:02                     ` David Gibson
@ 2016-11-25 14:36                       ` Andrea Bolognani
  2016-12-02  3:37                         ` David Gibson
  0 siblings, 1 reply; 38+ messages in thread
From: Andrea Bolognani @ 2016-11-25 14:36 UTC (permalink / raw)
  To: David Gibson, Alexey Kardashevskiy
  Cc: Greg Kurz, Paolo Bonzini, Alex Williamson, qemu-ppc, qemu-devel,
	libvir-list, Michael Roth

On Wed, 2016-11-23 at 16:02 +1100, David Gibson wrote:
> > > The change from OHCI to XHCI only affected the *default* USB
> > > controller, which libvirt tries its best not to use anyway:
> > > instead, it will prefer to use '-M ...,usb=off' along with
> > > '-device ...' and set both the controller model and its PCI
> > > address explicitly, partially to shield its users from such
> > > changes in QEMU.
> > 
> > Ok. Always likes this approach really. We should start moving to this
> > direction with PHB - stop adding the default PHB at all when -nodefaults is
> > passed (or -machine pseries,pci=off ?) and let libvirt manage PHBs itself
> > (and provide another spapr-phb type like spapr-pcie-host-bridge or add a
> > "pcie_root_bus_type" property to the existing PHB type).
> > 
> > What will be wrong with this approach?
> 
> Hm, that's a good point.  If were removing the default PHB entirely,
> that I would consider a possible case for a new machine type.  Putting
> construction of the PHBs into libvirt's hand could make life simpler
> there.  Although it would make it a bit of a pain for people starting
> qemu by hand.
> 
> I think this option is worth some thought.

Note that libvirt always runs QEMU with -nodefaults, so we
could just remove the default PHB if that flag is present,
and leave it in if that's not the case.

The idea itself sounds good, but of course it will require
more work from the libvirt side than simply making the PCIe
machine type behave like q35 and mach-virt.

Moreover, we already have an RFE for supporting additional
PHBs, we could solve both issues in one fell swoop and have
the libvirt guest XML be a more faithful representation of
the actual virtual hardware, which is always a win in my
book.

That will be even more important if it turns out that the
rules for PCIe device assignment (eg. what device/controller
can be plugged into which slot) are different for PCIe PHBs
than they are for q35/mach-virt PCIe Root Bus. I've asked
for clarifications about this elsewhere in the thread.

-- 
Andrea Bolognani / Red Hat / Virtualization

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [RFC PATCH qemu] spapr_pci: Create PCI-express root bus by default
  2016-11-25 14:36                       ` Andrea Bolognani
@ 2016-12-02  3:37                         ` David Gibson
  0 siblings, 0 replies; 38+ messages in thread
From: David Gibson @ 2016-12-02  3:37 UTC (permalink / raw)
  To: Andrea Bolognani
  Cc: Alexey Kardashevskiy, Greg Kurz, Paolo Bonzini, Alex Williamson,
	qemu-ppc, qemu-devel, libvir-list, Michael Roth

[-- Attachment #1: Type: text/plain, Size: 2807 bytes --]

On Fri, Nov 25, 2016 at 03:36:25PM +0100, Andrea Bolognani wrote:
> On Wed, 2016-11-23 at 16:02 +1100, David Gibson wrote:
> > > > The change from OHCI to XHCI only affected the *default* USB
> > > > controller, which libvirt tries its best not to use anyway:
> > > > instead, it will prefer to use '-M ...,usb=off' along with
> > > > '-device ...' and set both the controller model and its PCI
> > > > address explicitly, partially to shield its users from such
> > > > changes in QEMU.
> > > 
> > > Ok. Always likes this approach really. We should start moving to this
> > > direction with PHB - stop adding the default PHB at all when -nodefaults is
> > > passed (or -machine pseries,pci=off ?) and let libvirt manage PHBs itself
> > > (and provide another spapr-phb type like spapr-pcie-host-bridge or add a
> > > "pcie_root_bus_type" property to the existing PHB type).
> > > 
> > > What will be wrong with this approach?
> > 
> > Hm, that's a good point.  If were removing the default PHB entirely,
> > that I would consider a possible case for a new machine type.  Putting
> > construction of the PHBs into libvirt's hand could make life simpler
> > there.  Although it would make it a bit of a pain for people starting
> > qemu by hand.
> > 
> > I think this option is worth some thought.
> 
> Note that libvirt always runs QEMU with -nodefaults, so we
> could just remove the default PHB if that flag is present,
> and leave it in if that's not the case.
> 
> The idea itself sounds good, but of course it will require
> more work from the libvirt side than simply making the PCIe
> machine type behave like q35 and mach-virt.

Yeah, but the latter basically just won't work.

> Moreover, we already have an RFE for supporting additional
> PHBs, we could solve both issues in one fell swoop and have
> the libvirt guest XML be a more faithful representation of
> the actual virtual hardware, which is always a win in my
> book.

Right.  And the general recomendation for PAPR guests is that each
passed through device (or rather, each passed through iommu group)
have a separate virtual PHB on the guest.  With this rework libvirt
could switch over to implementing that recommendation.

> That will be even more important if it turns out that the
> rules for PCIe device assignment (eg. what device/controller
> can be plugged into which slot) are different for PCIe PHBs
> than they are for q35/mach-virt PCIe Root Bus. I've asked
> for clarifications about this elsewhere in the thread.
> 
> -- 
> Andrea Bolognani / Red Hat / Virtualization
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [RFC PATCH qemu] spapr_pci: Create PCI-express root bus by default
  2016-11-25 13:46                 ` Andrea Bolognani
@ 2016-12-02  4:18                   ` David Gibson
  2016-12-02  5:17                     ` Benjamin Herrenschmidt
                                       ` (3 more replies)
  0 siblings, 4 replies; 38+ messages in thread
From: David Gibson @ 2016-12-02  4:18 UTC (permalink / raw)
  To: Andrea Bolognani
  Cc: Alexey Kardashevskiy, Greg Kurz, Paolo Bonzini, Alex Williamson,
	qemu-ppc, qemu-devel, libvir-list, Michael Roth, benh

[-- Attachment #1: Type: text/plain, Size: 7062 bytes --]

On Fri, Nov 25, 2016 at 02:46:21PM +0100, Andrea Bolognani wrote:
> On Wed, 2016-11-23 at 16:00 +1100, David Gibson wrote:
> > > Existing libvirt versions assume that pseries guests have
> > > a legacy PCI root bus, and will base their PCI address
> > > allocation / PCI topology decisions on that fact: they
> > > will, for example, use legacy PCI bridges.
> > 
> > Um.. yeah.. trouble is libvirt's PCI-E address allocation probably
> > won't work for spapr PCI-E either, because of the weird PCI-E without
> > root complex presentation we get in PAPR.
> 
> So, would the PCIe Root Bus in a pseries guest behave
> differently than the one in a q35 or mach-virt guest?

Yes.  I had a long discussion with BenH and got a somewhat better idea
about this.

If only a single host PE (== iommu group) is passed through and there
are no emulated devices, the difference isn't too bad: basically on
pseries you'll see the subtree that would be below the root complex on
q35.

But if you pass through multiple groups, things get weird.  On q35,
you'd generally expect physically separate (different slot) devices to
appear under separate root complexes.  Whereas on pseries they'll
appear as siblings on a virtual bus (which makes no physical sense for
point-to-point PCI-E).

I suppose we could try treating all devices on pseries as though they
were chipset builtin devices on q35, which will appear on the root
PCI-E bus without root complex.  But I suspect that's likely to cause
trouble with hotplug, and it will certainly need different address
allocation from libvirt.

> Does it have a different number of slots, do we have to
> plug different controllers into them, ...?
> 
> Regardless of how we decide to move forward with the
> PCIe-enabled pseries machine type, libvirt will have to
> know about this so it can behave appropriately.

So there are kind of two extremes of how to address this.  There are a
variety of options in between, but I suspect they're going to be even
more muddled and hideous than the extremes.

1) Give up.  You said there's already a flag that says a PCI-E bus is
able to accept vanilla-PCI devices.  We add a hack flag that says a
vanilla-PCI bus is able to accept PCI-E devices.  We keep address
allocation as it is now - the pseries topology really does resemble
vanilla-PCI much better than it does PCI-E. But, we allow PCI-E
devices, and PAPR has mechanisms for accessing the extended config
space.  PCI-E standard hotplug and error reporting will never work,
but PAPR provides its own mechanisms for those, so that should be ok.

2) Start exposing the PCI-E heirarchy for pseries guests much more
like q35, root complexes and all.  It's not clear that PAPR actually
*forbids* exposing the root complex, it just doesn't require it and
that's not what PowerVM does.  But.. there are big questions about
whether existing guests will cope with this or not.  When you start
adding in multiple passed through devices and particularly virtual
functions as well, things could get very ugly - we might need to
construct multiple emulated virtual root complexes or other messes.

In the short to medium term, I'm thinking option (1) seems pretty
compelling.

> > > > I believe after we introduced the very first
> > > > pseries-pcie-X.Y, we will just stop adding new pseries-X.Y.
> > > 
> > > Isn't i440fx still being updated despite the fact that q35
> > > exists? Granted, there are a lot more differences between
> > > those two machine types than just the root bus type.
> > 
> > Right, there are heaps of differences between i440fx and q35, and
> > reasons to keep both updated.  For pseries we have neither the impetus
> > nor the resources to maintain two different machine type variant,
> > where the only difference is between legacy PCI and weirdly presented
> > PCI-E.
> 
> Calling the PCIe machine type either pseries-2.8 or
> pseries-pcie-2.8 would result in the very same amount of
> work, and in both cases it would be understood that the
> legacy PCI machine type is no longer going to be updated,
> but can still be used to run existing guests.

So, I'm not sure if the idea of a new machine type has legs or not,
but let's think it through a bit further.  Suppose we have a new
machine type, let's call it 'papr'.  I'm thinking it would be (at
least with -nodefaults) basically a super-minimal version of pseries:
so each PHB would have to be explicitly created, the VIO bridge would
have to be explicitly created, likewise the NVRAM.  Not sure about the
"devices" which really represent firmware features - the RTC, RNG,
hypervisor event source and so forth.

Might have some advantages.  Then again, it doesn't really solve the
specific problem here.  It means libvirt (or the user) has to
explicitly choose a PCI or PCI-E PHB to put things on, but libvirt's
PCI-E address allocation will still be wrong in all probability.

Guh.

As an aside, here's a RANT.

libvirt address allocation.  Seriously, W. T. F!

libvirt insists on owning address allocation.  That's so it can
recreate the exact same machine at the far end of a migration.  So far
so good, except it insists on recording that information in the domain
XML in kinda-sorta-but-not-really back end independent form.  But the
thing is libvirt fundamentally CAN NOT get this right.  There are all
sorts of possible machine specific address allocation constraints that
can exist - from simply which devices are already created by default
(for varying values of "default") to complicated constraints depending
on details of board wiring.  The back end has to know about these - it
implements them.  The ONLY way libvirt can get this (temporarily)
right is by duplicating a huge chunk of the back end's allocation
logic, which will inevitably get out of date causing problems just
like this.

Basically the back end will *always* have better information about how
to place devices than libvirt can.  So, libvirt should be allowing the
back end to do the allocation, then snapshotting that in a back end
specific format which can be used for creating migration
destinations. But that breaks libvirt's the-domain-XML-is-everything
model.

In this regard libvirt doesn't just have a design flaw, it has design
flaws which breed more design flaws like pestilent cancer.  And what's
worse the consequences of those design flaws are now making sane
design decisions increasingly difficult in adjacent projects like
qemu.

I'd feel better about this if there seemed to be some recognition of
it, and some necessarily long term plan to improve it, but if there is
I haven't heard of it.  Or at least the closest thing seems to be
coming from the qemu side (witness Eduardo's talk at the last KVM
forum, and mine at the one before).

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [RFC PATCH qemu] spapr_pci: Create PCI-express root bus by default
  2016-12-02  4:18                   ` David Gibson
@ 2016-12-02  5:17                     ` Benjamin Herrenschmidt
  2016-12-02  5:50                       ` David Gibson
  2016-12-05 19:06                     ` [Qemu-devel] [libvirt] " Laine Stump
                                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 38+ messages in thread
From: Benjamin Herrenschmidt @ 2016-12-02  5:17 UTC (permalink / raw)
  To: David Gibson, Andrea Bolognani
  Cc: Alexey Kardashevskiy, Greg Kurz, Paolo Bonzini, Alex Williamson,
	qemu-ppc, qemu-devel, libvir-list, Michael Roth

On Fri, 2016-12-02 at 15:18 +1100, David Gibson wrote:
> But if you pass through multiple groups, things get weird.  On q35,
> you'd generally expect physically separate (different slot) devices to
> appear under separate root complexes.  Whereas on pseries they'll
> appear as siblings on a virtual bus (which makes no physical sense for
> point-to-point PCI-E).

It's also somewhat broken if they aren't in the same iommu domain
because the way we pass the iommu buid is via the parent node, so
a given iommu domain must reside below a common parent and not share
it.

> I suppose we could try treating all devices on pseries as though they
> were chipset builtin devices on q35, which will appear on the root
> PCI-E bus without root complex.  But I suspect that's likely to cause
> trouble with hotplug, and it will certainly need different address
> allocation from libvirt.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [RFC PATCH qemu] spapr_pci: Create PCI-express root bus by default
  2016-12-02  5:17                     ` Benjamin Herrenschmidt
@ 2016-12-02  5:50                       ` David Gibson
  2016-12-02 21:41                         ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 38+ messages in thread
From: David Gibson @ 2016-12-02  5:50 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Andrea Bolognani, Alexey Kardashevskiy, Greg Kurz, Paolo Bonzini,
	Alex Williamson, qemu-ppc, qemu-devel, libvir-list, Michael Roth

[-- Attachment #1: Type: text/plain, Size: 1418 bytes --]

On Fri, Dec 02, 2016 at 04:17:50PM +1100, Benjamin Herrenschmidt wrote:
> On Fri, 2016-12-02 at 15:18 +1100, David Gibson wrote:
> > But if you pass through multiple groups, things get weird.  On q35,
> > you'd generally expect physically separate (different slot) devices to
> > appear under separate root complexes.  Whereas on pseries they'll
> > appear as siblings on a virtual bus (which makes no physical sense for
> > point-to-point PCI-E).
> 
> It's also somewhat broken if they aren't in the same iommu domain
> because the way we pass the iommu buid is via the parent node, so
> a given iommu domain must reside below a common parent and not share
> it.

Uh.. I don't entirely follow you.  From the host point of view there
are multiple iommu groups (PEs), but from the guest point of view
there's only one.  On the guest side iommu granularity is always
per-vPHB.

> > I suppose we could try treating all devices on pseries as though they
> > were chipset builtin devices on q35, which will appear on the root
> > PCI-E bus without root complex.  But I suspect that's likely to cause
> > trouble with hotplug, and it will certainly need different address
> > allocation from libvirt.
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [RFC PATCH qemu] spapr_pci: Create PCI-express root bus by default
  2016-12-02  5:50                       ` David Gibson
@ 2016-12-02 21:41                         ` Benjamin Herrenschmidt
  2016-12-03  1:02                           ` Alexey Kardashevskiy
  0 siblings, 1 reply; 38+ messages in thread
From: Benjamin Herrenschmidt @ 2016-12-02 21:41 UTC (permalink / raw)
  To: David Gibson
  Cc: Andrea Bolognani, Alexey Kardashevskiy, Greg Kurz, Paolo Bonzini,
	Alex Williamson, qemu-ppc, qemu-devel, libvir-list, Michael Roth

On Fri, 2016-12-02 at 16:50 +1100, David Gibson wrote:
> 
> Uh.. I don't entirely follow you.  From the host point of view there
> are multiple iommu groups (PEs), but from the guest point of view
> there's only one.  On the guest side iommu granularity is always
> per-vPHB.

Ok so the H_PUT_TCE calls affect all the underlying tables ?

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [RFC PATCH qemu] spapr_pci: Create PCI-express root bus by default
  2016-12-02 21:41                         ` Benjamin Herrenschmidt
@ 2016-12-03  1:02                           ` Alexey Kardashevskiy
  0 siblings, 0 replies; 38+ messages in thread
From: Alexey Kardashevskiy @ 2016-12-03  1:02 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, David Gibson
  Cc: Andrea Bolognani, Greg Kurz, Paolo Bonzini, Alex Williamson,
	qemu-ppc, qemu-devel, libvir-list, Michael Roth

On 03/12/16 08:41, Benjamin Herrenschmidt wrote:
> On Fri, 2016-12-02 at 16:50 +1100, David Gibson wrote:
>>
>> Uh.. I don't entirely follow you.  From the host point of view there
>> are multiple iommu groups (PEs), but from the guest point of view
>> there's only one.  On the guest side iommu granularity is always
>> per-vPHB.
> 
> Ok so the H_PUT_TCE calls affect all the underlying tables ?

Yes, it updates all hardware tables attached to the same guest's LIOBN
(which in reality can only be seen on power7 as power8 shares tables
between PEs).


-- 
Alexey

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [libvirt] [Qemu-ppc] [RFC PATCH qemu] spapr_pci: Create PCI-express root bus by default
  2016-12-02  4:18                   ` David Gibson
  2016-12-02  5:17                     ` Benjamin Herrenschmidt
@ 2016-12-05 19:06                     ` Laine Stump
  2016-12-05 20:54                     ` Laine Stump
  2016-12-06 17:30                     ` [Qemu-devel] " Andrea Bolognani
  3 siblings, 0 replies; 38+ messages in thread
From: Laine Stump @ 2016-12-05 19:06 UTC (permalink / raw)
  To: David Gibson, Andrea Bolognani
  Cc: Alexey Kardashevskiy, benh, Greg Kurz, qemu-devel, libvir-list,
	qemu-ppc, Paolo Bonzini, Eduardo Habkost, Jaroslav Suchanek,
	Daniel P. Berrange

On 12/01/2016 11:18 PM, David Gibson wrote:
> On Fri, Nov 25, 2016 at 02:46:21PM +0100, Andrea Bolognani wrote:
>> On Wed, 2016-11-23 at 16:00 +1100, David Gibson wrote:
>>>> Existing libvirt versions assume that pseries guests have
>>>> a legacy PCI root bus, and will base their PCI address
>>>> allocation / PCI topology decisions on that fact: they
>>>> will, for example, use legacy PCI bridges.
>>>   
>>> Um.. yeah.. trouble is libvirt's PCI-E address allocation probably
>>> won't work for spapr PCI-E either, because of the weird PCI-E without
>>> root complex presentation we get in PAPR.
>> So, would the PCIe Root Bus in a pseries guest behave
>> differently than the one in a q35 or mach-virt guest?
> Yes.  I had a long discussion with BenH and got a somewhat better idea
> about this.
>
> If only a single host PE (== iommu group) is passed through and there
> are no emulated devices, the difference isn't too bad: basically on
> pseries you'll see the subtree that would be below the root complex on
> q35.
>
> But if you pass through multiple groups, things get weird.  On q35,
> you'd generally expect physically separate (different slot) devices to
> appear under separate root complexes.  Whereas on pseries they'll
> appear as siblings on a virtual bus (which makes no physical sense for
> point-to-point PCI-E).
>
> I suppose we could try treating all devices on pseries as though they
> were chipset builtin devices on q35, which will appear on the root
> PCI-E bus without root complex.  But I suspect that's likely to cause
> trouble with hotplug, and it will certainly need different address
> allocation from libvirt.
>
>> Does it have a different number of slots, do we have to
>> plug different controllers into them, ...?
>>
>> Regardless of how we decide to move forward with the
>> PCIe-enabled pseries machine type, libvirt will have to
>> know about this so it can behave appropriately.
> So there are kind of two extremes of how to address this.  There are a
> variety of options in between, but I suspect they're going to be even
> more muddled and hideous than the extremes.
>
> 1) Give up.  You said there's already a flag that says a PCI-E bus is
> able to accept vanilla-PCI devices.  We add a hack flag that says a
> vanilla-PCI bus is able to accept PCI-E devices.  We keep address
> allocation as it is now - the pseries topology really does resemble
> vanilla-PCI much better than it does PCI-E. But, we allow PCI-E
> devices, and PAPR has mechanisms for accessing the extended config
> space.  PCI-E standard hotplug and error reporting will never work,
> but PAPR provides its own mechanisms for those, so that should be ok.
>
> 2) Start exposing the PCI-E heirarchy for pseries guests much more
> like q35, root complexes and all.  It's not clear that PAPR actually
> *forbids* exposing the root complex, it just doesn't require it and
> that's not what PowerVM does.  But.. there are big questions about
> whether existing guests will cope with this or not.  When you start
> adding in multiple passed through devices and particularly virtual
> functions as well, things could get very ugly - we might need to
> construct multiple emulated virtual root complexes or other messes.
>
> In the short to medium term, I'm thinking option (1) seems pretty
> compelling.
>
>>>>> I believe after we introduced the very first
>>>>> pseries-pcie-X.Y, we will just stop adding new pseries-X.Y.
>>>>   
>>>> Isn't i440fx still being updated despite the fact that q35
>>>> exists? Granted, there are a lot more differences between
>>>> those two machine types than just the root bus type.
>>>   
>>> Right, there are heaps of differences between i440fx and q35, and
>>> reasons to keep both updated.  For pseries we have neither the impetus
>>> nor the resources to maintain two different machine type variant,
>>> where the only difference is between legacy PCI and weirdly presented
>>> PCI-E.
>> Calling the PCIe machine type either pseries-2.8 or
>> pseries-pcie-2.8 would result in the very same amount of
>> work, and in both cases it would be understood that the
>> legacy PCI machine type is no longer going to be updated,
>> but can still be used to run existing guests.
> So, I'm not sure if the idea of a new machine type has legs or not,
> but let's think it through a bit further.  Suppose we have a new
> machine type, let's call it 'papr'.  I'm thinking it would be (at
> least with -nodefaults) basically a super-minimal version of pseries:
> so each PHB would have to be explicitly created, the VIO bridge would
> have to be explicitly created, likewise the NVRAM.  Not sure about the
> "devices" which really represent firmware features - the RTC, RNG,
> hypervisor event source and so forth.
>
> Might have some advantages.  Then again, it doesn't really solve the
> specific problem here.  It means libvirt (or the user) has to
> explicitly choose a PCI or PCI-E PHB to put things on, but libvirt's
> PCI-E address allocation will still be wrong in all probability.

That's a broad statement. Why? If qemu reports the default devices and 
characteristics of the devices properly (and libvirt uses that 
information) there's no reason for it to make the wrong decision.

>
> Guh.
>
>
> As an aside, here's a RANT.
>
> libvirt address allocation.  Seriously, W. T. F!
>
> libvirt insists on owning address allocation.  That's so it can
> recreate the exact same machine at the far end of a migration.  So far
> so good, except it insists on recording that information in the domain
> XML in kinda-sorta-but-not-really back end independent form.

Explain "kinda-sorta-but-not-really". If there's a deficiency in the 
model maybe it can be fixed.

>   But the
> thing is libvirt fundamentally CAN NOT get this right.

True, but not for the reasons you think. If qemu is able to respond to 
queries with adequate details about the devices available for a 
machinetype (and what buses are in place by default), there's no reason 
that libvirt can't add devices addressed such that all the connections 
are legal; what libvirt *can't* get right is the policy requested in the 
next higher layer of management (and ultimately of the user) - does this 
device need to be hotpluggable? Does the user want to keep all devices 
on the root complex to avoid extra PCI controllers?

And qemu fundamentally CAN NOT get it right either. qemu knows what is 
possible and what is allowed, but it doesn't know what the user *wants* 
(beyond "they want device X", which is only 1/2 the story), and has no 
way of being told what the user wants other than with a PCI address.

To back up for a minute, some background info: once a device has been 
added to a domain, at *any* time in the future (not just during a 
migration, but forever more until the end of time) that device must 
always have the same PCI address as it had that first time. In order to 
guarantee that, libvirt needs to either:

a) keep track of the order the devices were added and always put the 
devices in the same order on the commandline (assuming that qemu 
guarantees that it actually assigns addresses based on the order of the 
devices' appearance on the commandline, which has never been stated 
anywhere as an API guarantee of qemu),

or

b) remember the address of each device as it is added and specify that 
on the commandline in the future. libvirt chooses (b). And where is the 
logical place to store that address? In the config.

So we've established that the PCI address of a device needs to be stored 
in the config. So why does libvirt need to choose it the first time?

1) Because qemu doesn't have (and CAN NOT have) all the information 
about what are the user's plans for that device:

   a) It has no idea if the user wants that device to be hotpluggable 
(on a root-port)
       or not (on root complex as an integrated device)

   b) it doesn't know if the user wants the device to be placed on an 
expander bus
       so that its NUMA status can be discovered by the guest OS.

If there is a choice, there must be a way to make that choice. The way 
that qemu provides to make the choice is by specifying an address. So 
libvirt must specify an address in its config.

2) Because qemu is unable/unwilling to automatically add PCIe root ports 
when necessary, it's *not even possible* (on PCIe machinetypes) for it 
to place a device on a hotpluggable port without libvirt specifying a 
PCI address the very first time the device is added (and also adding in 
a root-port), but libvirt's default policy is that (almost) all devices 
should be hotpluggable. If we were to follow your recommendation 
("libvirt never specifies PCI addresses, but instead allows qemu to 
assign them"), hotplug on PCIe-based machinetypes would not be possible, 
though.

There have even been mentions that even libvirt is too *low* in the 
stack to be specifying the PCI address of devices (i.e. that all PCI 
address decisions should be up to higher level management applications) 
- I had posted a patch that would allow specifying 
"hotpluggable='yes|no'" in the XML rather than forcing the call to 
specify an address, and this was NACKed because it was seen as libvirt 
dictating policy. (In the end, libvirt *does* dictate a default policy, 
(it's just that the only way to modify that policy is by manually 
specifying addresses) - libvirt's default PCI address policy is that 
(almost) all devices will be assigned an address that makes them 
hotpluggable, and will not be placed on a non-0 NUMA node.

So, in spite of libvirt's effort, in the end we *still* need to expose 
address configuration to higher level management applications, since 
they may want to force all devices onto the root complex (e.g. 
libguestfs, which does it to reduce PCI controller count, and thus 
startup time) or force certain devices to be on a non-0 NUMA node (e.g. 
OpenStack when it wants to place a VFIO assigned device on the same NUMA 
node in the guest as it is in the host).

With all of that, I fail to see how it would be at all viable to simply 
leave PCI address assignment up to qemu.

> There are all
> sorts of possible machine specific address allocation constraints that
> can exist - from simply which devices are already created by default
> (for varying values of "default") to complicated constraints depending
> on details of board wiring.  The back end has to know about these - it
> implements them.

And since qemu knows about them, it should be able to report them. Which 
is what Eduardo's work is doing. And then libvirt will know about all 
the constraints in an programmatic manner (rather than the horrible 
(tedious, error prone) hardcoding of all those details that we've had to 
suffer with until now).

>   The ONLY way libvirt can get this (temporarily)
> right is by duplicating a huge chunk of the back end's allocation
> logic, which will inevitably get out of date causing problems just
> like this.
>
> Basically the back end will *always* have better information about how
> to place devices than libvirt can.

And since no matter how hard qemu might try to come up with a policy for 
address assignment that will satisfy the needs of 100% of the users 100% 
of the time, it will fail (because different users have different 
needs). Because qemu will be unable to properly place all devices all 
the time, libvirt (and higher level management) will still need to do 
it. Even in the basic case qemu doesn't provide what libvirt requires as 
default - that devices be hotpluggable.

>    So, libvirt should be allowing the
> back end to do the allocation, then snapshotting that in a back end
> specific format which can be used for creating migration
> destinations. But that breaks libvirt's the-domain-XML-is-everything
> model.

No, that doesn't work because qemu would in many situations place the 
devices at the wrong address / on the wrong controller, because there 
are many possible topologies that are legal, and the user may (for 
perfectly valid reasons) want something different from what qemu would 
have chosen.

(An example of two differing (and valid) policies - libguestfs needs 
guests to startup as quickly as possible, and thus wants as few PCI 
controllers as possible (this makes a noticeable difference in Linux 
boot time), so it wants all devices to be integrated on the root 
complex. On the other hand, a generic guest in libvirt should make all 
devices hotpluggable just in case the user wants to unplug them, so by 
default it tries to place all devices on a pcie-root-port. You can't 
support both of these if addressing is all left up to qemu)

>
> In this regard libvirt doesn't just have a design flaw, it has design
> flaws which breed more design flaws like pestilent cancer.

It may make you feel good to say that, but the facts don't back it up. 
Any project makes design mistakes, but in the specific case you're 
discussing here, I think you haven't looked from a wide enough viewpoint 
to see the necessity of what libvirt is doing and why it can't be done 
by qemu (certainly not all the time anyway).

>    And what's
> worse the consequences of those design flaws are now making sane
> design decisions increasingly difficult in adjacent projects like
> qemu.

libvirt has always done the best that could be done with the information 
provided by qemu. The problem isn't that libvirt is creating new 
problems for qemu out of thin air, it's that qemu is unable to 
automatically address PCI devices for all possible situations and user 
policy preferences, so higher levels need to make the decisions about 
addressing to satisfy their policies (ie what they *want*, eg 
hotpluggable, integrated on root complex), and qemu hasn't (until 
Eduardo's patches) been able to provide adequate information about what 
is *legal* (e.g which devices can be plugged into which model of pci 
controller, what slots are available on each type of controller, whether 
those slots are hotpluggable) in a programmatic way, so libvirt has had 
to hardcode rules about bus-device compatibility and capabilities, slot 
ranges, etc in order to make proper decisions itself when possible, and 
to sanity-check decisions about addresses made by higher level 
management when not. I don't think that's a design flaw. I think that's 
making the best of a "less than ideal" situation.

> I'd feel better about this if there seemed to be some recognition of
> it, and some necessarily long term plan to improve it, but if there is
> I haven't heard of it.  Or at least the closest thing seems to be
> coming from the qemu side (witness Eduardo's talk at the last KVM
> forum, and mine at the one before).

Eduardo's work isn't being done to make up for some mythical design flaw 
in libvirt. It is being done in order to give libvirt the (previously 
unavailable) information it needs to do a necessary job, and is being 
done at least partly at the request of libvirt (we've certainly been 
demanding some of that stuff for a long time!)

The summary is that it's impossible for qemu to correctly decide where 
to put new devices, especially in a PCIe hierarchy for a few reasons (at 
least); because of this, libvirt (and higher level management) needs to 
be able to assign addresses to devices, and in order for us/them to be 
able to do that properly, qemu needs to provide detailed and accurate 
information about what buses/controllers/devices are in each 
machinetype, what controllers/devices are available to add, and what are 
the legal ways of connecting those devices and controllers.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [libvirt] [Qemu-ppc] [RFC PATCH qemu] spapr_pci: Create PCI-express root bus by default
  2016-12-02  4:18                   ` David Gibson
  2016-12-02  5:17                     ` Benjamin Herrenschmidt
  2016-12-05 19:06                     ` [Qemu-devel] [libvirt] " Laine Stump
@ 2016-12-05 20:54                     ` Laine Stump
  2016-12-07  3:34                       ` David Gibson
  2016-12-06 17:30                     ` [Qemu-devel] " Andrea Bolognani
  3 siblings, 1 reply; 38+ messages in thread
From: Laine Stump @ 2016-12-05 20:54 UTC (permalink / raw)
  To: qemu-ppc
  Cc: David Gibson, Andrea Bolognani, Alexey Kardashevskiy, benh,
	Greg Kurz, qemu-devel, libvir-list, Paolo Bonzini

(Sorry for any duplicates. I sent it from the wrong address the first time)

On 12/01/2016 11:18 PM, David Gibson wrote:
> On Fri, Nov 25, 2016 at 02:46:21PM +0100, Andrea Bolognani wrote:
>> On Wed, 2016-11-23 at 16:00 +1100, David Gibson wrote:
>>>> Existing libvirt versions assume that pseries guests have
>>>> a legacy PCI root bus, and will base their PCI address
>>>> allocation / PCI topology decisions on that fact: they
>>>> will, for example, use legacy PCI bridges.
>>>   Um.. yeah.. trouble is libvirt's PCI-E address allocation probably
>>> won't work for spapr PCI-E either, because of the weird PCI-E without
>>> root complex presentation we get in PAPR.
>> So, would the PCIe Root Bus in a pseries guest behave
>> differently than the one in a q35 or mach-virt guest?
> Yes.  I had a long discussion with BenH and got a somewhat better idea
> about this.
>
> If only a single host PE (== iommu group) is passed through and there
> are no emulated devices, the difference isn't too bad: basically on
> pseries you'll see the subtree that would be below the root complex on
> q35.
>
> But if you pass through multiple groups, things get weird.  On q35,
> you'd generally expect physically separate (different slot) devices to
> appear under separate root complexes.  Whereas on pseries they'll
> appear as siblings on a virtual bus (which makes no physical sense for
> point-to-point PCI-E).
>
> I suppose we could try treating all devices on pseries as though they
> were chipset builtin devices on q35, which will appear on the root
> PCI-E bus without root complex.  But I suspect that's likely to cause
> trouble with hotplug, and it will certainly need different address
> allocation from libvirt.
>
>> Does it have a different number of slots, do we have to
>> plug different controllers into them, ...?
>>
>> Regardless of how we decide to move forward with the
>> PCIe-enabled pseries machine type, libvirt will have to
>> know about this so it can behave appropriately.
> So there are kind of two extremes of how to address this.  There are a
> variety of options in between, but I suspect they're going to be even
> more muddled and hideous than the extremes.
>
> 1) Give up.  You said there's already a flag that says a PCI-E bus is
> able to accept vanilla-PCI devices.  We add a hack flag that says a
> vanilla-PCI bus is able to accept PCI-E devices.  We keep address
> allocation as it is now - the pseries topology really does resemble
> vanilla-PCI much better than it does PCI-E. But, we allow PCI-E
> devices, and PAPR has mechanisms for accessing the extended config
> space.  PCI-E standard hotplug and error reporting will never work,
> but PAPR provides its own mechanisms for those, so that should be ok.
>
> 2) Start exposing the PCI-E heirarchy for pseries guests much more
> like q35, root complexes and all.  It's not clear that PAPR actually
> *forbids* exposing the root complex, it just doesn't require it and
> that's not what PowerVM does.  But.. there are big questions about
> whether existing guests will cope with this or not.  When you start
> adding in multiple passed through devices and particularly virtual
> functions as well, things could get very ugly - we might need to
> construct multiple emulated virtual root complexes or other messes.
>
> In the short to medium term, I'm thinking option (1) seems pretty
> compelling.
>
>>>>> I believe after we introduced the very first
>>>>> pseries-pcie-X.Y, we will just stop adding new pseries-X.Y.
>>>>   Isn't i440fx still being updated despite the fact that q35
>>>> exists? Granted, there are a lot more differences between
>>>> those two machine types than just the root bus type.
>>>   Right, there are heaps of differences between i440fx and q35, and
>>> reasons to keep both updated.  For pseries we have neither the impetus
>>> nor the resources to maintain two different machine type variant,
>>> where the only difference is between legacy PCI and weirdly presented
>>> PCI-E.
>> Calling the PCIe machine type either pseries-2.8 or
>> pseries-pcie-2.8 would result in the very same amount of
>> work, and in both cases it would be understood that the
>> legacy PCI machine type is no longer going to be updated,
>> but can still be used to run existing guests.
> So, I'm not sure if the idea of a new machine type has legs or not,
> but let's think it through a bit further.  Suppose we have a new
> machine type, let's call it 'papr'.  I'm thinking it would be (at
> least with -nodefaults) basically a super-minimal version of pseries:
> so each PHB would have to be explicitly created, the VIO bridge would
> have to be explicitly created, likewise the NVRAM.  Not sure about the
> "devices" which really represent firmware features - the RTC, RNG,
> hypervisor event source and so forth.
>
> Might have some advantages.  Then again, it doesn't really solve the
> specific problem here.  It means libvirt (or the user) has to
> explicitly choose a PCI or PCI-E PHB to put things on, but libvirt's
> PCI-E address allocation will still be wrong in all probability.

That's a broad statement. Why? If qemu reports the default devices and 
characteristics of the devices properly (and libvirt uses that 
information) there's no reason for it to make the wrong decision.

>
> Guh.
>
>
> As an aside, here's a RANT.
>
> libvirt address allocation.  Seriously, W. T. F!
>
> libvirt insists on owning address allocation.  That's so it can
> recreate the exact same machine at the far end of a migration.  So far
> so good, except it insists on recording that information in the domain
> XML in kinda-sorta-but-not-really back end independent form.

Explain "kinda-sorta-but-not-really". If there's a deficiency in the 
model maybe it can be fixed.

> But the
> thing is libvirt fundamentally CAN NOT get this right.

True, but not for the reasons you think. If qemu is able to respond to 
queries with adequate details about the devices available for a 
machinetype (and what buses are in place by default), there's no reason 
that libvirt can't add devices addressed such that all the connections 
are legal; what libvirt *can't* get right is the policy requested in the 
next higher layer of management (and ultimately of the user) - does this 
device need to be hotpluggable? Does the user want to keep all devices 
on the root complex to avoid extra PCI controllers?

And qemu fundamentally CAN NOT get it right either. qemu knows what is 
possible and what is allowed, but it doesn't know what the user *wants* 
(beyond "they want device X", which is only 1/2 the story), and has no 
way of being told what the user wants other than with a PCI address.

To back up for a minute, some background info: once a device has been 
added to a domain, at *any* time in the future (not just during a 
migration, but forever more until the end of time) that device must 
always have the same PCI address as it had that first time. In order to 
guarantee that, libvirt needs to either:

a) keep track of the order the devices were added and always put the 
devices in the same order on the commandline (assuming that qemu 
guarantees that it actually assigns addresses based on the order of the 
devices' appearance on the commandline, which has never been stated 
anywhere as an API guarantee of qemu),

or

b) remember the address of each device as it is added and specify that 
on the commandline in the future. libvirt chooses (b). And where is the 
logical place to store that address? In the config.

So we've established that the PCI address of a device needs to be stored 
in the config. So why does libvirt need to choose it the first time?

1) Because qemu doesn't have (and CAN NOT have) all the information 
about what are the user's plans for that device:

   a) It has no idea if the user wants that device to be hotpluggable 
(on a root-port)
       or not (on root complex as an integrated device)

   b) it doesn't know if the user wants the device to be placed on an 
expander bus
       so that its NUMA status can be discovered by the guest OS.

If there is a choice, there must be a way to make that choice. The way 
that qemu provides to make the choice is by specifying an address. So 
libvirt must specify an address in its config.

2) Because qemu is unable/unwilling to automatically add PCIe root ports 
when necessary, it's *not even possible* (on PCIe machinetypes) for it 
to place a device on a hotpluggable port without libvirt specifying a 
PCI address the very first time the device is added (and also adding in 
a root-port), but libvirt's default policy is that (almost) all devices 
should be hotpluggable. If we were to follow your recommendation 
("libvirt never specifies PCI addresses, but instead allows qemu to 
assign them"), hotplug on PCIe-based machinetypes would not be possible, 
though.

There have even been mentions that even libvirt is too *low* in the 
stack to be specifying the PCI address of devices (i.e. that all PCI 
address decisions should be up to higher level management applications) 
- I had posted a patch that would allow specifying 
"hotpluggable='yes|no'" in the XML rather than forcing the call to 
specify an address, and this was NACKed because it was seen as libvirt 
dictating policy. (In the end, libvirt *does* dictate a default policy, 
(it's just that the only way to modify that policy is by manually 
specifying addresses) - libvirt's default PCI address policy is that 
(almost) all devices will be assigned an address that makes them 
hotpluggable, and will not be placed on a non-0 NUMA node.

So, in spite of libvirt's effort, in the end we *still* need to expose 
address configuration to higher level management applications, since 
they may want to force all devices onto the root complex (e.g. 
libguestfs, which does it to reduce PCI controller count, and thus 
startup time) or force certain devices to be on a non-0 NUMA node (e.g. 
OpenStack when it wants to place a VFIO assigned device on the same NUMA 
node in the guest as it is in the host).

With all of that, I fail to see how it would be at all viable to simply 
leave PCI address assignment up to qemu.

> There are all
> sorts of possible machine specific address allocation constraints that
> can exist - from simply which devices are already created by default
> (for varying values of "default") to complicated constraints depending
> on details of board wiring.  The back end has to know about these - it
> implements them.

And since qemu knows about them, it should be able to report them. Which 
is what Eduardo's work is doing. And then libvirt will know about all 
the constraints in an programmatic manner (rather than the horrible 
(tedious, error prone) hardcoding of all those details that we've had to 
suffer with until now).

> The ONLY way libvirt can get this (temporarily)
> right is by duplicating a huge chunk of the back end's allocation
> logic, which will inevitably get out of date causing problems just
> like this.
>
> Basically the back end will *always* have better information about how
> to place devices than libvirt can.

And since no matter how hard qemu might try to come up with a policy for 
address assignment that will satisfy the needs of 100% of the users 100% 
of the time, it will fail (because different users have different 
needs). Because qemu will be unable to properly place all devices all 
the time, libvirt (and higher level management) will still need to do 
it. Even in the basic case qemu doesn't provide what libvirt requires as 
default - that devices be hotpluggable.

> So, libvirt should be allowing the
> back end to do the allocation, then snapshotting that in a back end
> specific format which can be used for creating migration
> destinations. But that breaks libvirt's the-domain-XML-is-everything
> model.

No, that doesn't work because qemu would in many situations place the 
devices at the wrong address / on the wrong controller, because there 
are many possible topologies that are legal, and the user may (for 
perfectly valid reasons) want something different from what qemu would 
have chosen.

(An example of two differing (and valid) policies - libguestfs needs 
guests to startup as quickly as possible, and thus wants as few PCI 
controllers as possible (this makes a noticeable difference in Linux 
boot time), so it wants all devices to be integrated on the root 
complex. On the other hand, a generic guest in libvirt should make all 
devices hotpluggable just in case the user wants to unplug them, so by 
default it tries to place all devices on a pcie-root-port. You can't 
support both of these if addressing is all left up to qemu)

>
> In this regard libvirt doesn't just have a design flaw, it has design
> flaws which breed more design flaws like pestilent cancer.

It may make you feel good to say that, but the facts don't back it up. 
Any project makes design mistakes, but in the specific case you're 
discussing here, I think you haven't looked from a wide enough viewpoint 
to see the necessity of what libvirt is doing and why it can't be done 
by qemu (certainly not all the time anyway).

> And what's
> worse the consequences of those design flaws are now making sane
> design decisions increasingly difficult in adjacent projects like
> qemu.

libvirt has always done the best that could be done with the information 
provided by qemu. The problem isn't that libvirt is creating new 
problems for qemu out of thin air, it's that qemu is unable to 
automatically address PCI devices for all possible situations and user 
policy preferences, so higher levels need to make the decisions about 
addressing to satisfy their policies (ie what they *want*, eg 
hotpluggable, integrated on root complex), and qemu hasn't (until 
Eduardo's patches) been able to provide adequate information about what 
is *legal* (e.g which devices can be plugged into which model of pci 
controller, what slots are available on each type of controller, whether 
those slots are hotpluggable) in a programmatic way, so libvirt has had 
to hardcode rules about bus-device compatibility and capabilities, slot 
ranges, etc in order to make proper decisions itself when possible, and 
to sanity-check decisions about addresses made by higher level 
management when not. I don't think that's a design flaw. I think that's 
making the best of a "less than ideal" situation.

> I'd feel better about this if there seemed to be some recognition of
> it, and some necessarily long term plan to improve it, but if there is
> I haven't heard of it.  Or at least the closest thing seems to be
> coming from the qemu side (witness Eduardo's talk at the last KVM
> forum, and mine at the one before).

Eduardo's work isn't being done to make up for some mythical design flaw 
in libvirt. It is being done in order to give libvirt the (previously 
unavailable) information it needs to do a necessary job, and is being 
done at least partly at the request of libvirt (we've certainly been 
demanding some of that stuff for a long time!)

The summary is that it's impossible for qemu to correctly decide where 
to put new devices, especially in a PCIe hierarchy for a few reasons (at 
least); because of this, libvirt (and higher level management) needs to 
be able to assign addresses to devices, and in order for us/them to be 
able to do that properly, qemu needs to provide detailed and accurate 
information about what buses/controllers/devices are in each 
machinetype, what controllers/devices are available to add, and what are 
the legal ways of connecting those devices and controllers.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [RFC PATCH qemu] spapr_pci: Create PCI-express root bus by default
  2016-12-02  4:18                   ` David Gibson
                                       ` (2 preceding siblings ...)
  2016-12-05 20:54                     ` Laine Stump
@ 2016-12-06 17:30                     ` Andrea Bolognani
  2016-12-07  4:11                       ` David Gibson
  3 siblings, 1 reply; 38+ messages in thread
From: Andrea Bolognani @ 2016-12-06 17:30 UTC (permalink / raw)
  To: David Gibson
  Cc: Alexey Kardashevskiy, Greg Kurz, Paolo Bonzini, Alex Williamson,
	qemu-ppc, qemu-devel, libvir-list, Michael Roth, benh

On Fri, 2016-12-02 at 15:18 +1100, David Gibson wrote:
> > So, would the PCIe Root Bus in a pseries guest behave
> > differently than the one in a q35 or mach-virt guest?
> 
> Yes.  I had a long discussion with BenH and got a somewhat better idea
> about this.

Sorry, but I'm afraid you're going to have to break this
down even further for me :(

> If only a single host PE (== iommu group) is passed through and there
> are no emulated devices, the difference isn't too bad: basically on
> pseries you'll see the subtree that would be below the root complex on
> q35.
> 
> But if you pass through multiple groups, things get weird.

Is the difference between q35 and pseries guests with
respect to PCIe only relevant when it comes to assigned
devices, or in general? I'm asking this because you seem to
focus entirely on assigned devices.

> On q35,
> you'd generally expect physically separate (different slot) devices to
> appear under separate root complexes.

This part I don't get at all, so please bear with me.

The way I read it you're claiming that eg. a SCSI controller
and a network adapter, being physically separate and assigned
to separate PCI slots, should have a dedicated PCIe Root
Complex each on a q35 guest.

That doesn't match with my experience, where you would simply
assign them to separate slots of the default PCIe Root Bus
(pcie.0), eg. 00:01.0 and 00:02.0.

Maybe you're referring to the fact that you might want to
create multiple PCIe Root Complexes in order to assign the
host devices to separate guest NUMA nodes? How is creating
multiple PCIe Root Complexes on q35 using pxb-pcie different
than creating multiple PHBs using spapr-pci-host-bridge on
pseries?

> Whereas on pseries they'll
> appear as siblings on a virtual bus (which makes no physical sense for
> point-to-point PCI-E).

What is the virtual bus in question? Why would it matter
that they're siblings?

I'm possibly missing the point entirely, but so far it
looks to me like there are different configurations you
might want to use depending on your goal, and both q35
and pseries give you comparable tools to achieve such
configurations.

> I suppose we could try treating all devices on pseries as though they
> were chipset builtin devices on q35, which will appear on the root
> PCI-E bus without root complex.  But I suspect that's likely to cause
> trouble with hotplug, and it will certainly need different address
> allocation from libvirt.

PCIe Integrated Endpoint Devices are not hotpluggable on
q35, that's why libvirt will follow QEMU's PCIe topology
recommendations and place a PCIe Root Port between them;
I assume the same could be done for pseries guests as
soon as QEMU grows support for generic PCIe Root Ports,
something Marcel has already posted patches for.

Again, sorry for clearly misunderstanding your explanation,
but I'm still not seeing the issue here. I'm sure it's very
clear in your mind, but I'm afraid you're going to have to
walk me through it :(

> > Regardless of how we decide to move forward with the
> > PCIe-enabled pseries machine type, libvirt will have to
> > know about this so it can behave appropriately.
> 
> So there are kind of two extremes of how to address this.  There are a
> variety of options in between, but I suspect they're going to be even
> more muddled and hideous than the extremes.
> 
> 1) Give up.  You said there's already a flag that says a PCI-E bus is
> able to accept vanilla-PCI devices.  We add a hack flag that says a
> vanilla-PCI bus is able to accept PCI-E devices.  We keep address
> allocation as it is now - the pseries topology really does resemble
> vanilla-PCI much better than it does PCI-E. But, we allow PCI-E
> devices, and PAPR has mechanisms for accessing the extended config
> space.  PCI-E standard hotplug and error reporting will never work,
> but PAPR provides its own mechanisms for those, so that should be ok.

We can definitely special-case pseries guests and take
the "anything goes" approach to PCI vs PCIe, but it would
certainly be nicer if we could avoid presenting our users
the head-scratching situation of PCIe devices being plugged
into legacy PCI slots and still showing up as PCIe in the
guest.

What about virtio devices, which present themselves either
as legacy PCI or PCIe depending on the kind of slot they
are plugged into? Would they show up as PCIe or legacy PCI
on a PCIe-enabled pseries guest?

> 2) Start exposing the PCI-E heirarchy for pseries guests much more
> like q35, root complexes and all.  It's not clear that PAPR actually
> *forbids* exposing the root complex, it just doesn't require it and
> that's not what PowerVM does.  But.. there are big questions about
> whether existing guests will cope with this or not.  When you start
> adding in multiple passed through devices and particularly virtual
> functions as well, things could get very ugly - we might need to
> construct multiple emulated virtual root complexes or other messes.
> 
> In the short to medium term, I'm thinking option (1) seems pretty
> compelling.

Is the Root Complex not currently exposed? The Root Bus
certainly is, otherwise PCI devices won't work at all, I
assume. And I can clearly see a pci.0 bus in the output
of 'info qtree' for a pseries guest, and a pci.1 too if
I add a spapr-pci-host-bridge.

Maybe I just don't quite get the relationship between Root
Complexes and Root Buses, but I guess my question is: what
is preventing us from simply doing whatever a
spapr-pci-host-bridge is doing in order to expose a legacy
PCI Root Bus (pci.*) to the guest, and create a new
spapr-pcie-host-bridge that exposes a PCIe Root Bus (pcie.*)
instead?

> So, I'm not sure if the idea of a new machine type has legs or not,
> but let's think it through a bit further.  Suppose we have a new
> machine type, let's call it 'papr'.  I'm thinking it would be (at
> least with -nodefaults) basically a super-minimal version of pseries:
> so each PHB would have to be explicitly created, the VIO bridge would
> have to be explicitly created, likewise the NVRAM.  Not sure about the
> "devices" which really represent firmware features - the RTC, RNG,
> hypervisor event source and so forth.
> 
> Might have some advantages.  Then again, it doesn't really solve the
> specific problem here.  It means libvirt (or the user) has to
> explicitly choose a PCI or PCI-E PHB to put things on,

libvirt would probably add a

  <controller type='pci' model='pcie-root'/>

to the guest XML by default, resulting in a
spapr-pcie-host-bridge providing pcie.0 and the same
controller / address allocation logic as q35; the user
would be able to use

  <controller type='pci' model='pci-root'/>

instead to stick with legacy PCI. This would only matter
when using '-nodefaults' anyway, when that flag is not
present a PCIe (or legacy PCI) could be created by QEMU
to make it more convenient for people that are not using
libvirt.

Maybe we should have a different model, specific to
pseries guests, instead, so that all PHBs would look the
same in the guest XML, something like

  <controller type='pci' model='phb-pcie'/>

It would require shuffling libvirt's PCI address allocation
code around quite a bit, but it should be doable. And if it
makes life easier for our users, then it's worth it.

> but libvirt's
> PCI-E address allocation will still be wrong in all probability.
> 
> Guh.

> As an aside, here's a RANT.
[...]

Laine already addressed your points extensively, but I'd
like to add a few thoughts of my own.

* PCI addresses for libvirt guests don't need to be stable
  only when performing migration, but also to guarantee
  that no change in guest ABI will happen as a consequence
  of eg. a simple power cycle.

* Even if libvirt left all PCI address assignment to QEMU,
  we would need a way for users to override QEMU's choices,
  because one size never fits all and users have all kinds
  of crazy, yet valid, requirements. So the first time we
  run QEMU, we would have to take the backend-specific
  format you suggest, parse it to extract the PCI addresses
  that have been assigned, and reflect them in the guest
  XML so that the user can change a bunch of them. Then I
  guess we could re-encode it in the backend-specific format
  and pass it to QEMU the next time we run it but, at that
  point, what's the difference with simply putting the PCI
  addresses on the command line directly?

* It's not just about the addresses, by the way, but also
  about the controllers - what model is used, how they are
  plugged together and so on. More stuff that would have to
  round-trip because users need to be able to take matters
  into their own hands.

* Design mistakes in any software, combined with strict
  backwards compatibility requirements, make it difficult
  to make changes in both related components and the
  software itself, even when the changes would be very
  beneficial. It can be very frustrating at times, but
  it's the reality of things and unfortunately there's only
  so much we can do about it.

* Eduardo's work, which you mentioned, is going to be very
  beneficial in the long run; in the short run, Marcel's
  PCIe device placement guidelines, a document that has seen
  contributions from QEMU, OVMF and libvirt developers, have
  been invaluable to improve libvirt's PCI address allocation
  logic. So we're already doing better, and more improvements
  are on the way :)

-- 
Andrea Bolognani / Red Hat / Virtualization

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [libvirt] [Qemu-ppc] [RFC PATCH qemu] spapr_pci: Create PCI-express root bus by default
  2016-12-05 20:54                     ` Laine Stump
@ 2016-12-07  3:34                       ` David Gibson
  0 siblings, 0 replies; 38+ messages in thread
From: David Gibson @ 2016-12-07  3:34 UTC (permalink / raw)
  To: Laine Stump
  Cc: qemu-ppc, Andrea Bolognani, Alexey Kardashevskiy, benh,
	Greg Kurz, qemu-devel, libvir-list, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 27075 bytes --]

On Mon, Dec 05, 2016 at 03:54:49PM -0500, Laine Stump wrote:
> (Sorry for any duplicates. I sent it from the wrong address the first time)
> 
> On 12/01/2016 11:18 PM, David Gibson wrote:
> > On Fri, Nov 25, 2016 at 02:46:21PM +0100, Andrea Bolognani wrote:
> > > On Wed, 2016-11-23 at 16:00 +1100, David Gibson wrote:
> > > > > Existing libvirt versions assume that pseries guests have
> > > > > a legacy PCI root bus, and will base their PCI address
> > > > > allocation / PCI topology decisions on that fact: they
> > > > > will, for example, use legacy PCI bridges.
> > > >   Um.. yeah.. trouble is libvirt's PCI-E address allocation probably
> > > > won't work for spapr PCI-E either, because of the weird PCI-E without
> > > > root complex presentation we get in PAPR.
> > > So, would the PCIe Root Bus in a pseries guest behave
> > > differently than the one in a q35 or mach-virt guest?
> > Yes.  I had a long discussion with BenH and got a somewhat better idea
> > about this.
> > 
> > If only a single host PE (== iommu group) is passed through and there
> > are no emulated devices, the difference isn't too bad: basically on
> > pseries you'll see the subtree that would be below the root complex on
> > q35.
> > 
> > But if you pass through multiple groups, things get weird.  On q35,
> > you'd generally expect physically separate (different slot) devices to
> > appear under separate root complexes.  Whereas on pseries they'll
> > appear as siblings on a virtual bus (which makes no physical sense for
> > point-to-point PCI-E).
> > 
> > I suppose we could try treating all devices on pseries as though they
> > were chipset builtin devices on q35, which will appear on the root
> > PCI-E bus without root complex.  But I suspect that's likely to cause
> > trouble with hotplug, and it will certainly need different address
> > allocation from libvirt.
> > 
> > > Does it have a different number of slots, do we have to
> > > plug different controllers into them, ...?
> > > 
> > > Regardless of how we decide to move forward with the
> > > PCIe-enabled pseries machine type, libvirt will have to
> > > know about this so it can behave appropriately.
> > So there are kind of two extremes of how to address this.  There are a
> > variety of options in between, but I suspect they're going to be even
> > more muddled and hideous than the extremes.
> > 
> > 1) Give up.  You said there's already a flag that says a PCI-E bus is
> > able to accept vanilla-PCI devices.  We add a hack flag that says a
> > vanilla-PCI bus is able to accept PCI-E devices.  We keep address
> > allocation as it is now - the pseries topology really does resemble
> > vanilla-PCI much better than it does PCI-E. But, we allow PCI-E
> > devices, and PAPR has mechanisms for accessing the extended config
> > space.  PCI-E standard hotplug and error reporting will never work,
> > but PAPR provides its own mechanisms for those, so that should be ok.
> > 
> > 2) Start exposing the PCI-E heirarchy for pseries guests much more
> > like q35, root complexes and all.  It's not clear that PAPR actually
> > *forbids* exposing the root complex, it just doesn't require it and
> > that's not what PowerVM does.  But.. there are big questions about
> > whether existing guests will cope with this or not.  When you start
> > adding in multiple passed through devices and particularly virtual
> > functions as well, things could get very ugly - we might need to
> > construct multiple emulated virtual root complexes or other messes.
> > 
> > In the short to medium term, I'm thinking option (1) seems pretty
> > compelling.
> > 
> > > > > > I believe after we introduced the very first
> > > > > > pseries-pcie-X.Y, we will just stop adding new pseries-X.Y.
> > > > >   Isn't i440fx still being updated despite the fact that q35
> > > > > exists? Granted, there are a lot more differences between
> > > > > those two machine types than just the root bus type.
> > > >   Right, there are heaps of differences between i440fx and q35, and
> > > > reasons to keep both updated.  For pseries we have neither the impetus
> > > > nor the resources to maintain two different machine type variant,
> > > > where the only difference is between legacy PCI and weirdly presented
> > > > PCI-E.
> > > Calling the PCIe machine type either pseries-2.8 or
> > > pseries-pcie-2.8 would result in the very same amount of
> > > work, and in both cases it would be understood that the
> > > legacy PCI machine type is no longer going to be updated,
> > > but can still be used to run existing guests.
> > So, I'm not sure if the idea of a new machine type has legs or not,
> > but let's think it through a bit further.  Suppose we have a new
> > machine type, let's call it 'papr'.  I'm thinking it would be (at
> > least with -nodefaults) basically a super-minimal version of pseries:
> > so each PHB would have to be explicitly created, the VIO bridge would
> > have to be explicitly created, likewise the NVRAM.  Not sure about the
> > "devices" which really represent firmware features - the RTC, RNG,
> > hypervisor event source and so forth.
> > 
> > Might have some advantages.  Then again, it doesn't really solve the
> > specific problem here.  It means libvirt (or the user) has to
> > explicitly choose a PCI or PCI-E PHB to put things on, but libvirt's
> > PCI-E address allocation will still be wrong in all probability.
> 
> That's a broad statement. Why? If qemu reports the default devices and
> characteristics of the devices properly (and libvirt uses that information)
> there's no reason for it to make the wrong decision.

Actually, unlike the comments below, this assertion is neither
particularly broad, nor critical of libvirt (it is obliquely critical
of PAPR and PowerVM).

The point is that the way PCI-E devices are presented under PAPR is
different from the way they'd appear on a hardware bus or on
virtualized PC or ARM.  Specifically we don't present the pseudo-P2P
bridges of the root complexes to the guest.  Instead (unless you
explicitly ask for a bridge) all the devices will appear directly on
the root "bus" - obviously this bus is a fiction since PCI-E is point
to point in hardware.

In short, from the point of view of topology, PAPR PCI-E will look
much more like vanilla PCI than PCI-E on an x86.  So to the extent
that libvirt address allocation differs between PCI and PCI-E, and I
gather it does, the PCI-E variant will be wrong for PAPR.

That's why for the medium term I advocate option (1) above of treating
PAPR PCI-E as if it was a vanilla PCI bus, but with a flag indicating
that PCI-E devices can be attached to it.

> > Guh.
> > 
> > 
> > As an aside, here's a RANT.
> > 
> > libvirt address allocation.  Seriously, W. T. F!
> > 
> > libvirt insists on owning address allocation.  That's so it can
> > recreate the exact same machine at the far end of a migration.  So far
> > so good, except it insists on recording that information in the domain
> > XML in kinda-sorta-but-not-really back end independent form.
> 
> Explain "kinda-sorta-but-not-really". If there's a deficiency in the model
> maybe it can be fixed.

My point is that the variety of constraints we can have in the back
end is so broad that it's impossible to construct a representation to
that level of detail which is truly backend independent.

> > But the
> > thing is libvirt fundamentally CAN NOT get this right.
> 
> True, but not for the reasons you think. If qemu is able to respond to
> queries with adequate details about the devices available for a machinetype
> (and what buses are in place by default), there's no reason that libvirt
> can't add devices addressed such that all the connections are legal;

The problem here is "adequate details".  Again, AFAICT the variety of
possible constraints given all the vagaries of different platforms is
so great that I don't think any format short of code is sufficient to
describe it.

> what
> libvirt *can't* get right is the policy requested in the next higher layer
> of management (and ultimately of the user) - does this device need to be
> hotpluggable? Does the user want to keep all devices on the root complex to
> avoid extra PCI controllers?
> 
> And qemu fundamentally CAN NOT get it right either. qemu knows what is
> possible and what is allowed, but it doesn't know what the user *wants*
> (beyond "they want device X", which is only 1/2 the story), and has no way
> of being told what the user wants other than with a PCI address.

So, here's a more specific point.  You *don't* give PCI addresses to
qemu - rather you give device/slot numbers, and which bus/bridge
they're attached to.  libvirt's use of DDDD:BB:SS.F addresses is a
fundamental error, because such addresses aren't meaningful until
*after* the bus is enumerated.  The bus number is explicitly allocated
during enumeration (by guest firmware or OS).  You can kind of predict
what they'll do, but it requires knowing about every bridge and device
on the system, which means it risks breakage every time the backend
changes its default setup - and the interpretation of the addresses
could be changed by an insertion of a bridge somewhere apparently
unrelated in the XML.

The domain number is an entirely guest-local concept, except - sorta -
on x86 where there's an obvious choice of the index that's used to
control the multiplexed access to all the host bridges' config spaces.
On platforms which have separate registers for each host bridge, it's
entirely possible for libvirt's idea of domain numbers to differ from
the guest's which is wonderfully confusing.

> To back up for a minute, some background info: once a device has been added
> to a domain, at *any* time in the future (not just during a migration, but
> forever more until the end of time) that device must always have the same
> PCI address as it had that first time. In order to guarantee that, libvirt
> needs to either:

I dispute the "any time".  I'll grant you want to keep the virtual
hardware stable for an extended period.  But if you ever want to
update your backend to newer behaviour (meaning new machine type
version, not just new qemu version) then you should revisit address
allocation.  And, yes, that potentially breaks the guest, but so could
the backend changes anyway, just like any hardware update for a
physical machine.  In most case - assuming it's been set up sensibly -
the guest OS will cope with minor hw reshuffles.  That is, after all,
why RHEL locates filesystems by UUID and NICs by MAC in normal
operation.

> a) keep track of the order the devices were added and always put the devices
> in the same order on the commandline (assuming that qemu guarantees that it
> actually assigns addresses based on the order of the devices' appearance on
> the commandline, which has never been stated anywhere as an API guarantee of
> qemu),
> 
> or
> 
> b) remember the address of each device as it is added and specify that on
> the commandline in the future. libvirt chooses (b). And where is the logical
> place to store that address? In the config.

Putting in the config seems like the logical place at first, but it's
a mistake:

First, the information we need to ensure stable guest hardware is
inherently backend specific.  On qemu we need to say which bridges and
bus ids each device is connected to.  Some other system could use PCI
addresses (I think that would be a a design flaw in the backend, but
if libvirt can't deal with imperfect backends it has failed its
remit), for containers the whole question doesn't even make sense.  A
really simplistic hypervisor could take simply an (ordered) list of
plugin devices and always put them into a fixed set of slots.

Putting this into the general config forces us to attempt a solution
to the insoluble: greating a neutral representation translatable to
any backend specific representation.

Second, this means that the config now contains both information that
was originally entered by the user (or next level up) as what they
want for the VM, and also information that libvirt decided once upon a
time and is now sticking with.  We can no longer separate which bits
of information are which.  That means when we want to update the
backend, we can't tell what we should regenerate.

[Meta rant, when it comes to design flaws that breed other design
 flaws the fact that the domain XML is both a read and write interface
 is just about the granddaddy of them all.  It has two awful
 consequences:
  1) After the VM is first run, we can no longer tell which config
     constraints were ones actually requested by the user, and which
     were added by some version of libvirt in the past and are now
     just there so we make the same decisions as before.  That
     distinction matters.
  2) Because the user could be relying on *seeing* something in the
     XML, not just that whatever it put in there will still be
     understood by new libvirt versions means we basically can't ever
     fix errors in the XML design.]

> So we've established that the PCI address of a device needs to be stored in
> the config. So why does libvirt need to choose it the first time?
> 
> 1) Because qemu doesn't have (and CAN NOT have) all the information about
> what are the user's plans for that device:

CAN not?  Why not?  Fundamentally it seems like we need to get the
information about what the user wants and what the virtual hardware
can do into the same place.  At least from the examples you've
mentioned so far, pushing "needs to be hotpluggable" and so forth
flags down into qemu seems a lot more feasible that pulling up the
complex and platform dependent set of constraints into libvirt.

And yes, qemu is very far from perfect in this area - the
introspection interfaces need to be greatly improved.

>   a) It has no idea if the user wants that device to be hotpluggable (on a
> root-port)
>       or not (on root complex as an integrated device)
> 
>   b) it doesn't know if the user wants the device to be placed on an
> expander bus
>       so that its NUMA status can be discovered by the guest OS.

So both (a) and (b) are precisely examples of what I'm describing.
libvirt is assuming
	(a) requires hotplug => must be on a root port
	(b) device bound to NUMA node => use expander bus

But (a) is only true for guest platforms using the standard hotplug
mechanism.  pseries does not.  (b) is essentially only true for x86 -
the whole concept of the expander bus is a hack to workaround x86's
historical inability to have truly independent PCI host bridge.  On
pseries we can create independent host bridges and assign them each a
NUMA node (including the default one).

qemu already knew that, libvirt needs to be taught.  Just as it seems
like libvirt needs to be taught about practically every platform
quirk.

> If there is a choice, there must be a way to make that choice. The way that
> qemu provides to make the choice is by specifying an address. So libvirt
> must specify an address in its config.

Again, no, qemu does not take addresses at all.  At least not in the
"DDDD:BB:SS.F" sense.  Even if it did, DDDD:BB:SS.F addresses have no
place in a supposedly back-end independent config.

> 2) Because qemu is unable/unwilling to automatically add PCIe root ports
> when necessary, it's *not even possible* (on PCIe machinetypes) for it to
> place a device on a hotpluggable port without libvirt specifying a PCI
> address the very first time the device is added (and also adding in a
> root-port), but libvirt's default policy is that (almost) all devices should
> be hotpluggable. If we were to follow your recommendation ("libvirt never
> specifies PCI addresses, but instead allows qemu to assign them"), hotplug
> on PCIe-based machinetypes would not be possible, though.

Ok, so, yes, I see that a lot of this allocation mess originates
because qemu didn't provide sufficient auto-allocation, or
introspection, so libvirt had to work around that.  The trouble is
that those workarounds have now been baked into the libvirt interface
- primarily by the fact that the data is integrated back into the
primary config rather than stored as adjunct back end specific data.
That means it's now insanely difficult to improve the interfaces to
qemu, because libvirt XMLs now have all this address information.  A
few of them are because the user really did request that for some
reason, but many are because that was how libvirt tried to implement
some more abstract user request.  We can now no longer tell the
difference, which means we can't translate those abstract requests
into a new saner qemu interface.

> There have even been mentions that even libvirt is too *low* in the stack to
> be specifying the PCI address of devices (i.e. that all PCI address
> decisions should be up to higher level management applications)

Frankly, if the higher level management needs that much control over
the PCI topology, going via an abstraction layer like libvirt seems
not the right choice.  In that case the management needs to understand
the guest platform in great detail anyway, so what does libvirt buy you.

>- I had
> posted a patch that would allow specifying "hotpluggable='yes|no'" in the
> XML rather than forcing the call to specify an address, and this was NACKed
>because it was seen as libvirt dictating policy.

Huh.  Your patch makes perfect sense to me - a step in the right
direction - and the reason for NACK seems completely bogus.

> (In the end, libvirt *does*
> dictate a default policy, (it's just that the only way to modify that policy
> is by manually specifying addresses) - libvirt's default PCI address policy
> is that (almost) all devices will be assigned an address that makes them
> hotpluggable, and will not be placed on a non-0 NUMA node.

Yeah, but "address which makes them hotpluggable" depends on guest
platform (machine type).  As does how you assign them to NUMA nodes.

> So, in spite of libvirt's effort, in the end we *still* need to expose
> address configuration to higher level management applications, since they
> may want to force all devices onto the root complex (e.g. libguestfs, which
> does it to reduce PCI controller count, and thus startup time) or force
> certain devices to be on a non-0 NUMA node (e.g. OpenStack when it wants to
> place a VFIO assigned device on the same NUMA node in the guest as it is in
> the host).

So, I'm not necessarily against allowing an explicit address override
for special cases (although DDDD:BB:SS.F is the wrong way to encode
it).  If that breaks due to a backend update, it's reasonable to just
punt the error back to the user and get them to fix it - if they
didn't understand the guest platform well enough to fix this, they
shouldn't have been overriding addresses.

But it's not reasonable to punt back to the user when the only thing
that got broken by the backend update was information libvirt (maybe
not even the current version) decided some time in the past, based on
faulty assumptions about the guest platform.

But libvirt has no way to distinguish these cases because all the info
is folded back into the one XML blob.

> With all of that, I fail to see how it would be at all viable to simply
> leave PCI address assignment up to qemu.
> 
> > There are all
> > sorts of possible machine specific address allocation constraints that
> > can exist - from simply which devices are already created by default
> > (for varying values of "default") to complicated constraints depending
> > on details of board wiring.  The back end has to know about these - it
> > implements them.
> 
> And since qemu knows about them, it should be able to report them. Which is
> what Eduardo's work is doing. And then libvirt will know about all the
> constraints in an programmatic manner (rather than the horrible (tedious,
> error prone) hardcoding of all those details that we've had to suffer with
> until now).

I haven't had a chance to look in detail at Eduardo's latest stuff,
but yes it should be an improvement.  But I think you vastly
underestimate how many different platform quirks we could have which
won't be easy to encode.

> > The ONLY way libvirt can get this (temporarily)
> > right is by duplicating a huge chunk of the back end's allocation
> > logic, which will inevitably get out of date causing problems just
> > like this.
> > 
> > Basically the back end will *always* have better information about how
> > to place devices than libvirt can.
> 
> And since no matter how hard qemu might try to come up with a policy for
> address assignment that will satisfy the needs of 100% of the users 100% of
> the time, it will fail (because different users have different needs).
> Because qemu will be unable to properly place all devices all the time,
> libvirt (and higher level management) will still need to do it. Even in the
> basic case qemu doesn't provide what libvirt requires as default - that
> devices be hotpluggable.
> 
> > So, libvirt should be allowing the
> > back end to do the allocation, then snapshotting that in a back end
> > specific format which can be used for creating migration
> > destinations. But that breaks libvirt's the-domain-XML-is-everything
> > model.
> 
> No, that doesn't work because qemu would in many situations place the
> devices at the wrong address / on the wrong controller, because there are
> many possible topologies that are legal, and the user may (for perfectly
> valid reasons) want something different from what qemu would have chosen.

Right, there are certainly reasons for overrides.  But at the moment
*every* device is overridden, which means when there is something
wrong in libvirt's assumptions about how to place things, every device
can break, not just overridden ones.

> (An example of two differing (and valid) policies - libguestfs needs guests
> to startup as quickly as possible, and thus wants as few PCI controllers as
> possible (this makes a noticeable difference in Linux boot time), so it
> wants all devices to be integrated on the root complex. On the other hand, a
> generic guest in libvirt should make all devices hotpluggable just in case
> the user wants to unplug them, so by default it tries to place all devices
> on a pcie-root-port. You can't support both of these if addressing is all
> left up to qemu)

With the tools that qemu currently gives you.  But now the assumption
that address allocation belongs to libvirt means that it's insanely
difficult to build those tools without breaking libvirt.

> > In this regard libvirt doesn't just have a design flaw, it has design
> > flaws which breed more design flaws like pestilent cancer.
> 
> 
> It may make you feel good to say that, but the facts don't back it up. Any
> project makes design mistakes, but in the specific case you're discussing
> here, I think you haven't looked from a wide enough viewpoint to see the
> necessity of what libvirt is doing and why it can't be done by qemu
> (certainly not all the time anyway).
> 
> 
> > And what's
> > worse the consequences of those design flaws are now making sane
> > design decisions increasingly difficult in adjacent projects like
> > qemu.
> 
> libvirt has always done the best that could be done with the information
> provided by qemu. The problem isn't that libvirt is creating new problems
> for qemu out of thin air, it's that qemu is unable to automatically address
> PCI devices for all possible situations and user policy preferences, so
> higher levels need to make the decisions about addressing to satisfy their
> policies (ie what they *want*, eg hotpluggable, integrated on root complex),
> and qemu hasn't (until Eduardo's patches) been able to provide adequate
> information about what is *legal* (e.g which devices can be plugged into
> which model of pci controller, what slots are available on each type of
> controller, whether those slots are hotpluggable) in a programmatic way, so
> libvirt has had to hardcode rules about bus-device compatibility and
> capabilities, slot ranges, etc in order to make proper decisions itself when
> possible, and to sanity-check decisions about addresses made by higher level
> management when not. I don't think that's a design flaw. I think that's
> making the best of a "less than ideal" situation.
> 
> 
> > I'd feel better about this if there seemed to be some recognition of
> > it, and some necessarily long term plan to improve it, but if there is
> > I haven't heard of it.  Or at least the closest thing seems to be
> > coming from the qemu side (witness Eduardo's talk at the last KVM
> > forum, and mine at the one before).
> 
> Eduardo's work isn't being done to make up for some mythical design flaw in
> libvirt. It is being done in order to give libvirt the (previously
> unavailable) information it needs to do a necessary job, and is being done
> at least partly at the request of libvirt (we've certainly been demanding
> some of that stuff for a long time!)

Well, I did warn you it was a rant.  That said, while there are
understandable historical reasons for them design flaws in libvirt
which make it really difficult to design other things properly are
alas not a myth.  Three examples:

1) Read/write XML interface.  Makes it nigh-impossible to fix flaws by
updating to a more coherent internal representation (with backwards
compat glue to parse the old format into the new one).

2) Not separating original user config from libvirt additions /
annotations.  Discussed repeatedly above.

3) Use of DDDD:BB.SS.F addresses (rather than a reference to bus
attachment point + SS.F).  You can't use addresses to construct a
machine in an address space which only exists once the machine is
constructed.

And that's just the first three I came up with that have at least a
slight connection to the problem at hand.

> The summary is that it's impossible for qemu to correctly decide where to
> put new devices, especially in a PCIe hierarchy for a few reasons (at
> least); because of this, libvirt (and higher level management) needs to be
> able to assign addresses to devices, and in order for us/them to be able to
> do that properly, qemu needs to provide detailed and accurate information
> about what buses/controllers/devices are in each machinetype, what
> controllers/devices are available to add, and what are the legal ways of
> connecting those devices and controllers.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [RFC PATCH qemu] spapr_pci: Create PCI-express root bus by default
  2016-12-06 17:30                     ` [Qemu-devel] " Andrea Bolognani
@ 2016-12-07  4:11                       ` David Gibson
  2016-12-07 16:42                         ` Andrea Bolognani
  0 siblings, 1 reply; 38+ messages in thread
From: David Gibson @ 2016-12-07  4:11 UTC (permalink / raw)
  To: Andrea Bolognani
  Cc: Alexey Kardashevskiy, Greg Kurz, Paolo Bonzini, Alex Williamson,
	qemu-ppc, qemu-devel, libvir-list, Michael Roth, benh

[-- Attachment #1: Type: text/plain, Size: 13729 bytes --]

On Tue, Dec 06, 2016 at 06:30:47PM +0100, Andrea Bolognani wrote:
> On Fri, 2016-12-02 at 15:18 +1100, David Gibson wrote:
> > > So, would the PCIe Root Bus in a pseries guest behave
> > > differently than the one in a q35 or mach-virt guest?
> > 
> > Yes.  I had a long discussion with BenH and got a somewhat better idea
> > about this.
> 
> Sorry, but I'm afraid you're going to have to break this
> down even further for me :(
> 
> > If only a single host PE (== iommu group) is passed through and there
> > are no emulated devices, the difference isn't too bad: basically on
> > pseries you'll see the subtree that would be below the root complex on
> > q35.
> > 
> > But if you pass through multiple groups, things get weird.
> 
> Is the difference between q35 and pseries guests with
> respect to PCIe only relevant when it comes to assigned
> devices, or in general? I'm asking this because you seem to
> focus entirely on assigned devices.

Well, in a sense that's up to us.  The only existing model we have is
PowerVM, and PowerVM only does device passthrough, no emulated
devices.  PAPR doesn't really distinguish one way or the other, but
it's written from the perspective of assuming that all PCI devices
correspond to physical devices on the host

> > On q35,
> > you'd generally expect physically separate (different slot) devices to
> > appear under separate root complexes.
> 
> This part I don't get at all, so please bear with me.
> 
> The way I read it you're claiming that eg. a SCSI controller
> and a network adapter, being physically separate and assigned
> to separate PCI slots, should have a dedicated PCIe Root
> Complex each on a q35 guest.

Right, my understanding was that if the devices were slotted, rather
than integrated, each one would sit under a separate root complex, the
root complex being a pseudo PCI to PCI bridge.

> That doesn't match with my experience, where you would simply
> assign them to separate slots of the default PCIe Root Bus
> (pcie.0), eg. 00:01.0 and 00:02.0.

The qemu default, or the libvirt default?  I think this represents
treating the devices as though they were integrated devices in the
host bridge.  I believe on q35 they would not be hotpluggable - but on
pseries they would be (because we don't use the standard hot plug
controller).

> Maybe you're referring to the fact that you might want to
> create multiple PCIe Root Complexes in order to assign the
> host devices to separate guest NUMA nodes? How is creating
> multiple PCIe Root Complexes on q35 using pxb-pcie different
> than creating multiple PHBs using spapr-pci-host-bridge on
> pseries?

Uh.. AIUI the root complex is the PCI to PCI bridge under which PCI-E
slots appear.  PXB is something different - essentially different host
bridges as you say (though with some weird hacks to access config
space, which make it dependent on the primary bus in a way which spapr
PHBs are not).

I'll admit I'm pretty confused myself about the exact distinction
between root complex, root port and upstream and downstream ports.

> > Whereas on pseries they'll
> > appear as siblings on a virtual bus (which makes no physical sense for
> > point-to-point PCI-E).
> 
> What is the virtual bus in question? Why would it matter
> that they're siblings?

On pseries it won't.  But my understanding is that libvirt won't
create them that way on q35 - instead it will insert the RCs / P2P
bridges to allow them to be hotplugged.  Inserting that bridge may
confuse pseries guests which aren't expecting it.

> I'm possibly missing the point entirely, but so far it
> looks to me like there are different configurations you
> might want to use depending on your goal, and both q35
> and pseries give you comparable tools to achieve such
> configurations.
> 
> > I suppose we could try treating all devices on pseries as though they
> > were chipset builtin devices on q35, which will appear on the root
> > PCI-E bus without root complex.  But I suspect that's likely to cause
> > trouble with hotplug, and it will certainly need different address
> > allocation from libvirt.
> 
> PCIe Integrated Endpoint Devices are not hotpluggable on
> q35, that's why libvirt will follow QEMU's PCIe topology
> recommendations and place a PCIe Root Port between them;
> I assume the same could be done for pseries guests as
> soon as QEMU grows support for generic PCIe Root Ports,
> something Marcel has already posted patches for.

Here you've hit on it.  No, we should not do that for pseries,
AFAICT.  PAPR doesn't really have the concept of integrated endpoint
devices, and all devices can be hotplugged via the PAPR mechanisms
(and none can via the PCI-E standard hotplug mechanism).

> Again, sorry for clearly misunderstanding your explanation,
> but I'm still not seeing the issue here. I'm sure it's very
> clear in your mind, but I'm afraid you're going to have to
> walk me through it :(

I wish it were entirely clear in my mind.  Like I say I'm still pretty
confused by exactly the root complex entails.

> > > Regardless of how we decide to move forward with the
> > > PCIe-enabled pseries machine type, libvirt will have to
> > > know about this so it can behave appropriately.
> > 
> > So there are kind of two extremes of how to address this.  There are a
> > variety of options in between, but I suspect they're going to be even
> > more muddled and hideous than the extremes.
> > 
> > 1) Give up.  You said there's already a flag that says a PCI-E bus is
> > able to accept vanilla-PCI devices.  We add a hack flag that says a
> > vanilla-PCI bus is able to accept PCI-E devices.  We keep address
> > allocation as it is now - the pseries topology really does resemble
> > vanilla-PCI much better than it does PCI-E. But, we allow PCI-E
> > devices, and PAPR has mechanisms for accessing the extended config
> > space.  PCI-E standard hotplug and error reporting will never work,
> > but PAPR provides its own mechanisms for those, so that should be ok.
> 
> We can definitely special-case pseries guests and take
> the "anything goes" approach to PCI vs PCIe, but it would
> certainly be nicer if we could avoid presenting our users
> the head-scratching situation of PCIe devices being plugged
> into legacy PCI slots and still showing up as PCIe in the
> guest.
> 
> What about virtio devices, which present themselves either
> as legacy PCI or PCIe depending on the kind of slot they
> are plugged into? Would they show up as PCIe or legacy PCI
> on a PCIe-enabled pseries guest?

That we'd have to address on the qemu side with some 

> > 2) Start exposing the PCI-E heirarchy for pseries guests much more
> > like q35, root complexes and all.  It's not clear that PAPR actually
> > *forbids* exposing the root complex, it just doesn't require it and
> > that's not what PowerVM does.  But.. there are big questions about
> > whether existing guests will cope with this or not.  When you start
> > adding in multiple passed through devices and particularly virtual
> > functions as well, things could get very ugly - we might need to
> > construct multiple emulated virtual root complexes or other messes.
> > 
> > In the short to medium term, I'm thinking option (1) seems pretty
> > compelling.
> 
> Is the Root Complex not currently exposed? The Root Bus
> certainly is,

Like I say, I'm fairly confused myself, but I'm pretty sure that Root
Complex != Root Bus.  The RC sits under the root bus IIRC.. or
possibly it consists of the root bus plus something under it as well.

Now... from what Laine was saying it sounds like more of the
differences between PCI-E placement and PCI placement may be
implemented by libvirt than qemu than I realized.  So possibly we do
want to make the bus be PCI-E on the qemu side, but have libvirt use
the vanilla-PCI placement guidelines rather than PCI-E for pseries.

> otherwise PCI devices won't work at all, I
> assume. And I can clearly see a pci.0 bus in the output
> of 'info qtree' for a pseries guest, and a pci.1 too if
> I add a spapr-pci-host-bridge.
> 
> Maybe I just don't quite get the relationship between Root
> Complexes and Root Buses, but I guess my question is: what
> is preventing us from simply doing whatever a
> spapr-pci-host-bridge is doing in order to expose a legacy
> PCI Root Bus (pci.*) to the guest, and create a new
> spapr-pcie-host-bridge that exposes a PCIe Root Bus (pcie.*)
> instead?

Hrm, the suggestion of providing both a vanilla-PCI and PCI-E host
bridge came up before.  I think one of us spotted a problem with that,
but I don't recall what it was now.  I guess one is how libvirt would
map it's stupid-fake-domain-numbers to which root bus to use.

> > So, I'm not sure if the idea of a new machine type has legs or not,
> > but let's think it through a bit further.  Suppose we have a new
> > machine type, let's call it 'papr'.  I'm thinking it would be (at
> > least with -nodefaults) basically a super-minimal version of pseries:
> > so each PHB would have to be explicitly created, the VIO bridge would
> > have to be explicitly created, likewise the NVRAM.  Not sure about the
> > "devices" which really represent firmware features - the RTC, RNG,
> > hypervisor event source and so forth.
> > 
> > Might have some advantages.  Then again, it doesn't really solve the
> > specific problem here.  It means libvirt (or the user) has to
> > explicitly choose a PCI or PCI-E PHB to put things on,
> 
> libvirt would probably add a
> 
>   <controller type='pci' model='pcie-root'/>
> 
> to the guest XML by default, resulting in a
> spapr-pcie-host-bridge providing pcie.0 and the same
> controller / address allocation logic as q35; the user
> would be able to use
> 
>   <controller type='pci' model='pci-root'/>
> 
> instead to stick with legacy PCI. This would only matter
> when using '-nodefaults' anyway, when that flag is not
> present a PCIe (or legacy PCI) could be created by QEMU
> to make it more convenient for people that are not using
> libvirt.
> 
> Maybe we should have a different model, specific to
> pseries guests, instead, so that all PHBs would look the
> same in the guest XML, something like
> 
>   <controller type='pci' model='phb-pcie'/>
> 
> It would require shuffling libvirt's PCI address allocation
> code around quite a bit, but it should be doable. And if it
> makes life easier for our users, then it's worth it.

Hrm.  So my first inclination would be to stick with the generic
names, and map those to creating new pseries host bridges on pseries
guests.  I would have thought that would be the easier option for
users.  But I may not have realized all the implications yet.

> > but libvirt's
> > PCI-E address allocation will still be wrong in all probability.
> > 
> > Guh.
> 
> > As an aside, here's a RANT.
> [...]
> 
> Laine already addressed your points extensively, but I'd
> like to add a few thoughts of my own.
> 
> * PCI addresses for libvirt guests don't need to be stable
>   only when performing migration, but also to guarantee
>   that no change in guest ABI will happen as a consequence
>   of eg. a simple power cycle.
> 
> * Even if libvirt left all PCI address assignment to QEMU,
>   we would need a way for users to override QEMU's choices,
>   because one size never fits all and users have all kinds
>   of crazy, yet valid, requirements. So the first time we
>   run QEMU, we would have to take the backend-specific
>   format you suggest, parse it to extract the PCI addresses
>   that have been assigned, and reflect them in the guest
>   XML so that the user can change a bunch of them. Then I
>   guess we could re-encode it in the backend-specific format
>   and pass it to QEMU the next time we run it but, at that
>   point, what's the difference with simply putting the PCI
>   addresses on the command line directly?
> 
> * It's not just about the addresses, by the way, but also
>   about the controllers - what model is used, how they are
>   plugged together and so on. More stuff that would have to
>   round-trip because users need to be able to take matters
>   into their own hands.
> 
> * Design mistakes in any software, combined with strict
>   backwards compatibility requirements, make it difficult
>   to make changes in both related components and the
>   software itself, even when the changes would be very
>   beneficial. It can be very frustrating at times, but
>   it's the reality of things and unfortunately there's only
>   so much we can do about it.

I think the above I've touched on in my reply to Laine.

> * Eduardo's work, which you mentioned, is going to be very
>   beneficial in the long run; in the short run, Marcel's
>   PCIe device placement guidelines, a document that has seen
>   contributions from QEMU, OVMF and libvirt developers, have
>   been invaluable to improve libvirt's PCI address allocation
>   logic. So we're already doing better, and more improvements
>   are on the way :)

Right.. so here's the thing, I strongly suspect that Marcel's
guidelines will not be correct for pseries.  I'm not sure if they'll
be definitively wrong, or just different enough from PowerVM that it
might confuse guests, but either way.  Can you send me a link to that
document though, which might help me figure this out.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [RFC PATCH qemu] spapr_pci: Create PCI-express root bus by default
  2016-12-07  4:11                       ` David Gibson
@ 2016-12-07 16:42                         ` Andrea Bolognani
  2016-12-13 12:25                           ` Marcel Apfelbaum
  0 siblings, 1 reply; 38+ messages in thread
From: Andrea Bolognani @ 2016-12-07 16:42 UTC (permalink / raw)
  To: David Gibson
  Cc: Alexey Kardashevskiy, Greg Kurz, Paolo Bonzini, Alex Williamson,
	qemu-ppc, qemu-devel, libvir-list, Michael Roth, benh,
	Marcel Apfelbaum

[Added Marcel to CC]

On Wed, 2016-12-07 at 15:11 +1100, David Gibson wrote:
> > Is the difference between q35 and pseries guests with
> > respect to PCIe only relevant when it comes to assigned
> > devices, or in general? I'm asking this because you seem to
> > focus entirely on assigned devices.
> 
> Well, in a sense that's up to us.  The only existing model we have is
> PowerVM, and PowerVM only does device passthrough, no emulated
> devices.  PAPR doesn't really distinguish one way or the other, but
> it's written from the perspective of assuming that all PCI devices
> correspond to physical devices on the host

Okay, that makes sense.

> > > On q35,
> > > you'd generally expect physically separate (different slot) devices to
> > > appear under separate root complexes.
> > 
> > This part I don't get at all, so please bear with me.
> > 
> > The way I read it you're claiming that eg. a SCSI controller
> > and a network adapter, being physically separate and assigned
> > to separate PCI slots, should have a dedicated PCIe Root
> > Complex each on a q35 guest.
> 
> Right, my understanding was that if the devices were slotted, rather
> than integrated, each one would sit under a separate root complex, the
> root complex being a pseudo PCI to PCI bridge.

I assume "slotted" means "plugged into a slot that's not one
of those provided by pcie.0" or something along those lines.

More on the root complex bit later.

> > That doesn't match with my experience, where you would simply
> > assign them to separate slots of the default PCIe Root Bus
> > (pcie.0), eg. 00:01.0 and 00:02.0.
> 
> The qemu default, or the libvirt default?

I'm talking about the libvirt default, which is supposed to
follows Marcel's PCIe Guidelines.

> I think this represents
> treating the devices as though they were integrated devices in the
> host bridge.  I believe on q35 they would not be hotpluggable

Yeah, that's indeed not quite what libvirt would do by
default: in reality, there would be a ioh3420 between the
pcie.0 slots and each device exactly to enable hotplug.

> but on
> pseries they would be (because we don't use the standard hot plug
> controller).

We can account for that in libvirt and avoid adding the
extra ioh3420s (or rather the upcoming generic PCIe Root
Ports) for pseries guests.

> > Maybe you're referring to the fact that you might want to
> > create multiple PCIe Root Complexes in order to assign the
> > host devices to separate guest NUMA nodes? How is creating
> > multiple PCIe Root Complexes on q35 using pxb-pcie different
> > than creating multiple PHBs using spapr-pci-host-bridge on
> > pseries?
> 
> Uh.. AIUI the root complex is the PCI to PCI bridge under which PCI-E
> slots appear.  PXB is something different - essentially different host
> bridges as you say (though with some weird hacks to access config
> space, which make it dependent on the primary bus in a way which spapr
> PHBs are not).
> 
> I'll admit I'm pretty confused myself about the exact distinction
> between root complex, root port and upstream and downstream ports.

I think we both need to get our terminology straight :)
I'm sure Marcel will be happy to point us in the right
direction.

My understanding is that the PCIe Root Complex is the piece
of hardware that exposes a PCIe Root Bus (pcie.0 in QEMU);
PXBs can be connected to slots in pcie.0 to create more buses
that behave, for the most part, like pcie.0 and are mostly
useful to bind devices to specific NUMA nodes. Same applies
to legacy PCI with the pxb (instead of pxb-pcie) device.

In a similar fashion, PHBs are the hardware thingies that
expose a PCI Root Bus (pci.0 and so on), the main difference
being that they are truly independent: so a q35 guest will
always have a "primary" PCIe Root Bus and (optionally) a
bunch of "secondary" ones, but the same will not be the case
for pseries guests.

I don't think the difference is that important though, at
least from libvirt's point of view: whether you're creating
a pseries guest with two PHBs, or a q35 guest with its
built-in PCIe Root Complex and an extra PCIe Expander Bus,
you will end up with two "top level" buses that you can plug
more devices into. If we had spapr-pcie-host-bridge, we
could treat them mostly the same - with caveats such as the
one described above, of course.

> > > Whereas on pseries they'll
> > > appear as siblings on a virtual bus (which makes no physical sense for
> > > point-to-point PCI-E).
> > 
> > What is the virtual bus in question? Why would it matter
> > that they're siblings?
> 
> On pseries it won't.  But my understanding is that libvirt won't
> create them that way on q35 - instead it will insert the RCs / P2P
> bridges to allow them to be hotplugged.  Inserting that bridge may
> confuse pseries guests which aren't expecting it.

libvirt will automatically add PCIe Root Ports to make the
devices hotpluggable on q35 guests, yes. But, as mentioned
above, we can teach it not to.

> > I'm possibly missing the point entirely, but so far it
> > looks to me like there are different configurations you
> > might want to use depending on your goal, and both q35
> > and pseries give you comparable tools to achieve such
> > configurations.
> > > 
> > > I suppose we could try treating all devices on pseries as though they
> > > were chipset builtin devices on q35, which will appear on the root
> > > PCI-E bus without root complex.  But I suspect that's likely to cause
> > > trouble with hotplug, and it will certainly need different address
> > > allocation from libvirt.
> > 
> > PCIe Integrated Endpoint Devices are not hotpluggable on
> > q35, that's why libvirt will follow QEMU's PCIe topology
> > recommendations and place a PCIe Root Port between them;
> > I assume the same could be done for pseries guests as
> > soon as QEMU grows support for generic PCIe Root Ports,
> > something Marcel has already posted patches for.
> 
> Here you've hit on it.  No, we should not do that for pseries,
> AFAICT.  PAPR doesn't really have the concept of integrated endpoint
> devices, and all devices can be hotplugged via the PAPR mechanisms
> (and none can via the PCI-E standard hotplug mechanism).

Cool, I get it now.

> > Again, sorry for clearly misunderstanding your explanation,
> > but I'm still not seeing the issue here. I'm sure it's very
> > clear in your mind, but I'm afraid you're going to have to
> > walk me through it :(
> 
> I wish it were entirely clear in my mind.  Like I say I'm still pretty
> confused by exactly the root complex entails.

Same here, but this back-and-forth is helping! :)

[...]
> > What about virtio devices, which present themselves either
> > as legacy PCI or PCIe depending on the kind of slot they
> > are plugged into? Would they show up as PCIe or legacy PCI
> > on a PCIe-enabled pseries guest?
> 
> That we'd have to address on the qemu side with some 

Unfinished sentence?

[...]
> > Is the Root Complex not currently exposed? The Root Bus
> > certainly is,
> 
> Like I say, I'm fairly confused myself, but I'm pretty sure that Root
> Complex != Root Bus.  The RC sits under the root bus IIRC.. or
> possibly it consists of the root bus plus something under it as well.
> 
> Now... from what Laine was saying it sounds like more of the
> differences between PCI-E placement and PCI placement may be
> implemented by libvirt than qemu than I realized.  So possibly we do
> want to make the bus be PCI-E on the qemu side, but have libvirt use
> the vanilla-PCI placement guidelines rather than PCI-E for pseries.

Basically the special casing I was mentioning earlier.

[...]
> > Maybe I just don't quite get the relationship between Root
> > Complexes and Root Buses, but I guess my question is: what
> > is preventing us from simply doing whatever a
> > spapr-pci-host-bridge is doing in order to expose a legacy
> > PCI Root Bus (pci.*) to the guest, and create a new
> > spapr-pcie-host-bridge that exposes a PCIe Root Bus (pcie.*)
> > instead?
> 
> Hrm, the suggestion of providing both a vanilla-PCI and PCI-E host
> bridge came up before.  I think one of us spotted a problem with that,
> but I don't recall what it was now.  I guess one is how libvirt would
> map it's stupid-fake-domain-numbers to which root bus to use.

That issue is relevant whether or nor we have different PHB
flavors, isn't it? As soon as multiple PHBs are present in
a pseries guest, multiple PCI domains will be there as well,
and we need to handle that somehow.

On q35, on the other hand, I haven't been able to find a way
to create extra PCI domains: adding a pxb-pcie certainly
didn't work the same as adding an extra spapr-pci-host-bridge
in that regard.

[...]
> > Maybe we should have a different model, specific to
> > pseries guests, instead, so that all PHBs would look the
> > same in the guest XML, something like
> > 
> >   <controller type='pci' model='phb-pcie'/>
> > 
> > It would require shuffling libvirt's PCI address allocation
> > code around quite a bit, but it should be doable. And if it
> > makes life easier for our users, then it's worth it.
> 
> Hrm.  So my first inclination would be to stick with the generic
> names, and map those to creating new pseries host bridges on pseries
> guests.  I would have thought that would be the easier option for
> users.  But I may not have realized all the implications yet.

You're probably right, but I can't immediately see how we
would make the user aware of which PHB is which. Maybe we
could add some sub-element or extra attribute...

Anyway, we should not focus too much on this specific bit
at the moment, deciding on a specific XML is mostly
bikeshedding :)

[...]
> > * Eduardo's work, which you mentioned, is going to be very
> >   beneficial in the long run; in the short run, Marcel's
> >   PCIe device placement guidelines, a document that has seen
> >   contributions from QEMU, OVMF and libvirt developers, have
> >   been invaluable to improve libvirt's PCI address allocation
> >   logic. So we're already doing better, and more improvements
> >   are on the way :)
> 
> Right.. so here's the thing, I strongly suspect that Marcel's
> guidelines will not be correct for pseries.  I'm not sure if they'll
> be definitively wrong, or just different enough from PowerVM that it
> might confuse guests, but either way.

Those guidelines have been developed with q35/mach-virt in
mind[1], so I wouldn't at all be surprised if they didn't
apply to pseries guests. And in fact, we just found out
that they don't!

My point is that we could easily create a similar document
for pseries guests, and then libvirt will be able to pick
up whatever recommendations we come up with just like it
did for q35/mach-virt.

> Can you send me a link to that
> document though, which might help me figure this out.

It's docs/pcie.txt in QEMU's git repository.


[1] Even though I now realize that this is not immediately
    clear by looking at the document itself
-- 
Andrea Bolognani / Red Hat / Virtualization

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [RFC PATCH qemu] spapr_pci: Create PCI-express root bus by default
  2016-12-07 16:42                         ` Andrea Bolognani
@ 2016-12-13 12:25                           ` Marcel Apfelbaum
  2016-12-13 13:15                             ` Greg Kurz
                                               ` (2 more replies)
  0 siblings, 3 replies; 38+ messages in thread
From: Marcel Apfelbaum @ 2016-12-13 12:25 UTC (permalink / raw)
  To: Andrea Bolognani, David Gibson
  Cc: Alexey Kardashevskiy, Greg Kurz, Paolo Bonzini, Alex Williamson,
	qemu-ppc, qemu-devel, libvir-list, Michael Roth, benh,
	Marcel Apfelbaum

On 12/07/2016 06:42 PM, Andrea Bolognani wrote:
> [Added Marcel to CC]
>


Hi,

Sorry for the late reply.

> On Wed, 2016-12-07 at 15:11 +1100, David Gibson wrote:
>>> Is the difference between q35 and pseries guests with
>>> respect to PCIe only relevant when it comes to assigned
>>> devices, or in general? I'm asking this because you seem to
>>> focus entirely on assigned devices.
>>
>> Well, in a sense that's up to us.  The only existing model we have is
>> PowerVM, and PowerVM only does device passthrough, no emulated
>> devices.  PAPR doesn't really distinguish one way or the other, but
>> it's written from the perspective of assuming that all PCI devices
>> correspond to physical devices on the host
>
> Okay, that makes sense.
>
>>>> On q35,
>>>> you'd generally expect physically separate (different slot) devices to
>>>> appear under separate root complexes.
>>>
>>> This part I don't get at all, so please bear with me.
>>>
>>> The way I read it you're claiming that eg. a SCSI controller
>>> and a network adapter, being physically separate and assigned
>>> to separate PCI slots, should have a dedicated PCIe Root
>>> Complex each on a q35 guest.
>>

Not a PCIe Root Complex, but a PCIe Root port.

>> Right, my understanding was that if the devices were slotted, rather
>> than integrated, each one would sit under a separate root complex, the
>> root complex being a pseudo PCI to PCI bridge.
>
> I assume "slotted" means "plugged into a slot that's not one
> of those provided by pcie.0" or something along those lines.
>
> More on the root complex bit later.
>
>>> That doesn't match with my experience, where you would simply
>>> assign them to separate slots of the default PCIe Root Bus
>>> (pcie.0), eg. 00:01.0 and 00:02.0.
>>
>> The qemu default, or the libvirt default?
>
> I'm talking about the libvirt default, which is supposed to
> follows Marcel's PCIe Guidelines.
>
>> I think this represents
>> treating the devices as though they were integrated devices in the
>> host bridge.  I believe on q35 they would not be hotpluggable
>

Correct. Please have a look to the new document
regarding pcie: docs/pcie.txt and the corresponding presentations.


> Yeah, that's indeed not quite what libvirt would do by
> default: in reality, there would be a ioh3420 between the
> pcie.0 slots and each device exactly to enable hotplug.
>
>> but on
>> pseries they would be (because we don't use the standard hot plug
>> controller).
>
> We can account for that in libvirt and avoid adding the
> extra ioh3420s (or rather the upcoming generic PCIe Root
> Ports) for pseries guests.
>
>>> Maybe you're referring to the fact that you might want to
>>> create multiple PCIe Root Complexes in order to assign the
>>> host devices to separate guest NUMA nodes? How is creating
>>> multiple PCIe Root Complexes on q35 using pxb-pcie different
>>> than creating multiple PHBs using spapr-pci-host-bridge on
>>> pseries?
>>
>> Uh.. AIUI the root complex is the PCI to PCI bridge under which PCI-E
>> slots appear.  PXB is something different - essentially different host
>> bridges as you say (though with some weird hacks to access config
>> space, which make it dependent on the primary bus in a way which spapr
>> PHBs are not).
>>
>> I'll admit I'm pretty confused myself about the exact distinction
>> between root complex, root port and upstream and downstream ports.
>
> I think we both need to get our terminology straight :)
> I'm sure Marcel will be happy to point us in the right
> direction.
>
> My understanding is that the PCIe Root Complex is the piece
> of hardware that exposes a PCIe Root Bus (pcie.0 in QEMU);

right

> PXBs can be connected to slots in pcie.0 to create more buses
> that behave, for the most part, like pcie.0 and are mostly
> useful to bind devices to specific NUMA nodes.

right

  Same applies
> to legacy PCI with the pxb (instead of pxb-pcie) device.
>

pxb should not be used for PCIe machines, only for legacy PCI ones.

> In a similar fashion, PHBs are the hardware thingies that
> expose a PCI Root Bus (pci.0 and so on), the main difference
> being that they are truly independent: so a q35 guest will
> always have a "primary" PCIe Root Bus and (optionally) a
> bunch of "secondary" ones, but the same will not be the case
> for pseries guests.
>

OK

> I don't think the difference is that important though, at
> least from libvirt's point of view: whether you're creating
> a pseries guest with two PHBs, or a q35 guest with its
> built-in PCIe Root Complex and an extra PCIe Expander Bus,
> you will end up with two "top level" buses that you can plug
> more devices into.

I agree

  If we had spapr-pcie-host-bridge, we
> could treat them mostly the same - with caveats such as the
> one described above, of course.
>
>>>> Whereas on pseries they'll
>>>> appear as siblings on a virtual bus (which makes no physical sense for
>>>> point-to-point PCI-E).
>>>
>>> What is the virtual bus in question? Why would it matter
>>> that they're siblings?
>>
>> On pseries it won't.  But my understanding is that libvirt won't
>> create them that way on q35 - instead it will insert the RCs / P2P
>> bridges to allow them to be hotplugged.  Inserting that bridge may
>> confuse pseries guests which aren't expecting it.
>
> libvirt will automatically add PCIe Root Ports to make the
> devices hotpluggable on q35 guests, yes. But, as mentioned
> above, we can teach it not to.
>
>>> I'm possibly missing the point entirely, but so far it
>>> looks to me like there are different configurations you
>>> might want to use depending on your goal, and both q35
>>> and pseries give you comparable tools to achieve such
>>> configurations.
>>>>
>>>> I suppose we could try treating all devices on pseries as though they
>>>> were chipset builtin devices on q35, which will appear on the root
>>>> PCI-E bus without root complex.

Actually the root PCIe bus is part of a root complex.

  But I suspect that's likely to cause
>>>> trouble with hotplug, and it will certainly need different address
>>>> allocation from libvirt.
>>>
>>> PCIe Integrated Endpoint Devices are not hotpluggable on
>>> q35, that's why libvirt will follow QEMU's PCIe topology
>>> recommendations and place a PCIe Root Port between them;
>>> I assume the same could be done for pseries guests as
>>> soon as QEMU grows support for generic PCIe Root Ports,
>>> something Marcel has already posted patches for.
>>
>> Here you've hit on it.  No, we should not do that for pseries,
>> AFAICT.  PAPR doesn't really have the concept of integrated endpoint
>> devices, and all devices can be hotplugged via the PAPR mechanisms
>> (and none can via the PCI-E standard hotplug mechanism).
>

This seems to be interfering with the PCIe spec:
   1. No PCIe root ports ? those are part of the spec.
   2. Only integrated devices ? hotplug is not PCIe native?

> Cool, I get it now.
>
>>> Again, sorry for clearly misunderstanding your explanation,
>>> but I'm still not seeing the issue here. I'm sure it's very
>>> clear in your mind, but I'm afraid you're going to have to
>>> walk me through it :(
>>
>> I wish it were entirely clear in my mind.  Like I say I'm still pretty
>> confused by exactly the root complex entails.
>
> Same here, but this back-and-forth is helping! :)
>
> [...]
>>> What about virtio devices, which present themselves either
>>> as legacy PCI or PCIe depending on the kind of slot they
>>> are plugged into? Would they show up as PCIe or legacy PCI
>>> on a PCIe-enabled pseries guest?
>>
>> That we'd have to address on the qemu side with some
>
> Unfinished sentence?
>
> [...]
>>> Is the Root Complex not currently exposed? The Root Bus
>>> certainly is,
>>
>> Like I say, I'm fairly confused myself, but I'm pretty sure that Root
>> Complex != Root Bus.  The RC sits under the root bus IIRC.. or
>> possibly it consists of the root bus plus something under it as well.
>>

The Root complex includes the PCI bus, some configuration registers if
needed, provides access to the configuration space, translates relevant CPU
reads/writes to PCI(e) transactions...

>> Now... from what Laine was saying it sounds like more of the
>> differences between PCI-E placement and PCI placement may be
>> implemented by libvirt than qemu than I realized.  So possibly we do
>> want to make the bus be PCI-E on the qemu side, but have libvirt use
>> the vanilla-PCI placement guidelines rather than PCI-E for pseries.
>
> Basically the special casing I was mentioning earlier.

That looks complicated.. I wish I would no more about the pseries
PCIe stuff, does any one know where I can get the info ? (besides 'google it'...)

>
> [...]
>>> Maybe I just don't quite get the relationship between Root
>>> Complexes and Root Buses, but I guess my question is: what
>>> is preventing us from simply doing whatever a
>>> spapr-pci-host-bridge is doing in order to expose a legacy
>>> PCI Root Bus (pci.*) to the guest, and create a new
>>> spapr-pcie-host-bridge that exposes a PCIe Root Bus (pcie.*)
>>> instead?
>>
>> Hrm, the suggestion of providing both a vanilla-PCI and PCI-E host
>> bridge came up before.  I think one of us spotted a problem with that,
>> but I don't recall what it was now.  I guess one is how libvirt would
>> map it's stupid-fake-domain-numbers to which root bus to use.
>

This would be a weird configuration, I never heard of something like that
on a bare metal machine, but I never worked on pseries, who knows...


> That issue is relevant whether or nor we have different PHB
> flavors, isn't it? As soon as multiple PHBs are present in
> a pseries guest, multiple PCI domains will be there as well,
> and we need to handle that somehow.
>
> On q35, on the other hand, I haven't been able to find a way
> to create extra PCI domains: adding a pxb-pcie certainly
> didn't work the same as adding an extra spapr-pci-host-bridge
> in that regard.
>

Indeed, all the pxb-pcie devices "share" the same domain.

> [...]
>>> Maybe we should have a different model, specific to
>>> pseries guests, instead, so that all PHBs would look the
>>> same in the guest XML, something like
>>>
>>>    <controller type='pci' model='phb-pcie'/>
>>>
>>> It would require shuffling libvirt's PCI address allocation
>>> code around quite a bit, but it should be doable. And if it
>>> makes life easier for our users, then it's worth it.
>>
>> Hrm.  So my first inclination would be to stick with the generic
>> names, and map those to creating new pseries host bridges on pseries
>> guests.  I would have thought that would be the easier option for
>> users.  But I may not have realized all the implications yet.
>
> You're probably right, but I can't immediately see how we
> would make the user aware of which PHB is which. Maybe we
> could add some sub-element or extra attribute...
>
> Anyway, we should not focus too much on this specific bit
> at the moment, deciding on a specific XML is mostly
> bikeshedding :)
>
> [...]
>>> * Eduardo's work, which you mentioned, is going to be very
>>>    beneficial in the long run; in the short run, Marcel's
>>>    PCIe device placement guidelines, a document that has seen
>>>    contributions from QEMU, OVMF and libvirt developers, have
>>>    been invaluable to improve libvirt's PCI address allocation
>>>    logic. So we're already doing better, and more improvements
>>>    are on the way :)
>>
>> Right.. so here's the thing, I strongly suspect that Marcel's
>> guidelines will not be correct for pseries.

We should make the document stick for all PCIe archs, if we need
to modify it, let's do it.

   I'm not sure if they'll
>> be definitively wrong, or just different enough from PowerVM that it
>> might confuse guests, but either way.
>

I really need to understand how it would confuse the guests,
it does not deviate from the PCIe spec, it only adds some restrictions.

> Those guidelines have been developed with q35/mach-virt in
> mind[1], so I wouldn't at all be surprised if they didn't
> apply to pseries guests. And in fact, we just found out
> that they don't!
>
> My point is that we could easily create a similar document
> for pseries guests, and then libvirt will be able to pick
> up whatever recommendations we come up with just like it
> did for q35/mach-virt.
>
>> Can you send me a link to that
>> document though, which might help me figure this out.
>
> It's docs/pcie.txt in QEMU's git repository.
>
>
> [1] Even though I now realize that this is not immediately
>     clear by looking at the document itself


I kind of miss the core issue, what is the main problem?

Thanks,
Marcel

> --
> Andrea Bolognani / Red Hat / Virtualization
>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [RFC PATCH qemu] spapr_pci: Create PCI-express root bus by default
  2016-12-13 12:25                           ` Marcel Apfelbaum
@ 2016-12-13 13:15                             ` Greg Kurz
  2016-12-13 15:15                             ` Benjamin Herrenschmidt
  2016-12-14  2:46                             ` David Gibson
  2 siblings, 0 replies; 38+ messages in thread
From: Greg Kurz @ 2016-12-13 13:15 UTC (permalink / raw)
  To: Marcel Apfelbaum
  Cc: Andrea Bolognani, David Gibson, Alexey Kardashevskiy,
	Paolo Bonzini, Alex Williamson, qemu-ppc, qemu-devel,
	libvir-list, Michael Roth, benh, Marcel Apfelbaum

On Tue, 13 Dec 2016 14:25:44 +0200
Marcel Apfelbaum <marcel@redhat.com> wrote:

> >> Now... from what Laine was saying it sounds like more of the
> >> differences between PCI-E placement and PCI placement may be
> >> implemented by libvirt than qemu than I realized.  So possibly we do
> >> want to make the bus be PCI-E on the qemu side, but have libvirt use
> >> the vanilla-PCI placement guidelines rather than PCI-E for pseries.  
> >
> > Basically the special casing I was mentioning earlier.  
> 
> That looks complicated.. I wish I would no more about the pseries
> PCIe stuff, does any one know where I can get the info ? (besides 'google it'...)

This is described in the LoPAPR specification, available at:

https://openpowerfoundation.org/wp-content/uploads/2016/05/LoPAPR_DRAFT_v11_24March2016_cmt1.pdf

Cheers.

--
Greg

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [RFC PATCH qemu] spapr_pci: Create PCI-express root bus by default
  2016-12-13 12:25                           ` Marcel Apfelbaum
  2016-12-13 13:15                             ` Greg Kurz
@ 2016-12-13 15:15                             ` Benjamin Herrenschmidt
  2016-12-14  2:48                               ` David Gibson
  2016-12-14 12:02                               ` Marcel Apfelbaum
  2016-12-14  2:46                             ` David Gibson
  2 siblings, 2 replies; 38+ messages in thread
From: Benjamin Herrenschmidt @ 2016-12-13 15:15 UTC (permalink / raw)
  To: Marcel Apfelbaum, Andrea Bolognani, David Gibson
  Cc: Alexey Kardashevskiy, Greg Kurz, Paolo Bonzini, Alex Williamson,
	qemu-ppc, qemu-devel, libvir-list, Michael Roth,
	Marcel Apfelbaum

On Tue, 2016-12-13 at 14:25 +0200, Marcel Apfelbaum wrote:
> > > Hrm, the suggestion of providing both a vanilla-PCI and PCI-E host
> > > bridge came up before.  I think one of us spotted a problem with that,
> > > but I don't recall what it was now.  I guess one is how libvirt would
> > > map it's stupid-fake-domain-numbers to which root bus to use.
> 
> This would be a weird configuration, I never heard of something like that
> on a bare metal machine, but I never worked on pseries, who knows...

That's the thing though, it's *not* bare metal ;-)

On bare metal, we use the "powernv" platform for which we haven't yet upstreamed
the PCIE support, but there we have real PCIe root ports with all what's exepcted
on them.

They come on physically different PHB, one root complex per PHB, but they are
"proper" PCIe. Hotplug is a bit weird still because it has to go through some
FW interactions as the HW doesn't use the PCIe standard slot control bits (and
our qemu model doesn't handle it yet) but otherwise it's standard.

"pseries" on the other hand is a paravirtualized platform. It's the environment
presented to guests by our propritary hypervisor (PowerVM, aka pHyp) and
KVM/qemu. I think there's a public spec these days but to cut the story short,
it doesn't expose PCIe "properly" for bad historical reasons.

It tends to hide the parent, ie, the root port for slots connected directly to
a PHB or the entire hierarchy from the root to (and including) the downstream
switch port for slots under as switch.

So you end up with PCIe devices devoid of a proper "parent" which is a bit
weird.

Hotplug is implemented using specific firmware interfaces that are identical
with pre-PCIe (ie PCI/PCI-X) systems. The FW has interface for supporting 3
kinds of hotplug:

   - Entire PHBs
   - P2P Bridges
   - Individual devices on an existing PHB or P2P bridge

In practice these days mostly the first one is used for everything under pHyp,
though I think we have a mix of 1 and 3 for KVM/qemu.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [RFC PATCH qemu] spapr_pci: Create PCI-express root bus by default
  2016-12-13 12:25                           ` Marcel Apfelbaum
  2016-12-13 13:15                             ` Greg Kurz
  2016-12-13 15:15                             ` Benjamin Herrenschmidt
@ 2016-12-14  2:46                             ` David Gibson
  2016-12-14 18:26                               ` Marcel Apfelbaum
  2 siblings, 1 reply; 38+ messages in thread
From: David Gibson @ 2016-12-14  2:46 UTC (permalink / raw)
  To: Marcel Apfelbaum
  Cc: Andrea Bolognani, Alexey Kardashevskiy, Greg Kurz, Paolo Bonzini,
	Alex Williamson, qemu-ppc, qemu-devel, libvir-list, Michael Roth,
	benh, Marcel Apfelbaum

[-- Attachment #1: Type: text/plain, Size: 16576 bytes --]

On Tue, Dec 13, 2016 at 02:25:44PM +0200, Marcel Apfelbaum wrote:
> On 12/07/2016 06:42 PM, Andrea Bolognani wrote:
> > [Added Marcel to CC]
> > 
> 
> 
> Hi,
> 
> Sorry for the late reply.
> 
> > On Wed, 2016-12-07 at 15:11 +1100, David Gibson wrote:
> > > > Is the difference between q35 and pseries guests with
> > > > respect to PCIe only relevant when it comes to assigned
> > > > devices, or in general? I'm asking this because you seem to
> > > > focus entirely on assigned devices.
> > > 
> > > Well, in a sense that's up to us.  The only existing model we have is
> > > PowerVM, and PowerVM only does device passthrough, no emulated
> > > devices.  PAPR doesn't really distinguish one way or the other, but
> > > it's written from the perspective of assuming that all PCI devices
> > > correspond to physical devices on the host
> > 
> > Okay, that makes sense.
> > 
> > > > > On q35,
> > > > > you'd generally expect physically separate (different slot) devices to
> > > > > appear under separate root complexes.
> > > > 
> > > > This part I don't get at all, so please bear with me.
> > > > 
> > > > The way I read it you're claiming that eg. a SCSI controller
> > > > and a network adapter, being physically separate and assigned
> > > > to separate PCI slots, should have a dedicated PCIe Root
> > > > Complex each on a q35 guest.
> > > 
> 
> Not a PCIe Root Complex, but a PCIe Root port.

Ah, sorry.  As I said, I've been pretty confused by all the terminology.

> > > Right, my understanding was that if the devices were slotted, rather
> > > than integrated, each one would sit under a separate root complex, the
> > > root complex being a pseudo PCI to PCI bridge.
> > 
> > I assume "slotted" means "plugged into a slot that's not one
> > of those provided by pcie.0" or something along those lines.
> > 
> > More on the root complex bit later.
> > 
> > > > That doesn't match with my experience, where you would simply
> > > > assign them to separate slots of the default PCIe Root Bus
> > > > (pcie.0), eg. 00:01.0 and 00:02.0.
> > > 
> > > The qemu default, or the libvirt default?
> > 
> > I'm talking about the libvirt default, which is supposed to
> > follows Marcel's PCIe Guidelines.
> > 
> > > I think this represents
> > > treating the devices as though they were integrated devices in the
> > > host bridge.  I believe on q35 they would not be hotpluggable
> > 
> 
> Correct. Please have a look to the new document
> regarding pcie: docs/pcie.txt and the corresponding presentations.
> 
> 
> > Yeah, that's indeed not quite what libvirt would do by
> > default: in reality, there would be a ioh3420 between the
> > pcie.0 slots and each device exactly to enable hotplug.
> > 
> > > but on
> > > pseries they would be (because we don't use the standard hot plug
> > > controller).
> > 
> > We can account for that in libvirt and avoid adding the
> > extra ioh3420s (or rather the upcoming generic PCIe Root
> > Ports) for pseries guests.
> > 
> > > > Maybe you're referring to the fact that you might want to
> > > > create multiple PCIe Root Complexes in order to assign the
> > > > host devices to separate guest NUMA nodes? How is creating
> > > > multiple PCIe Root Complexes on q35 using pxb-pcie different
> > > > than creating multiple PHBs using spapr-pci-host-bridge on
> > > > pseries?
> > > 
> > > Uh.. AIUI the root complex is the PCI to PCI bridge under which PCI-E
> > > slots appear.  PXB is something different - essentially different host
> > > bridges as you say (though with some weird hacks to access config
> > > space, which make it dependent on the primary bus in a way which spapr
> > > PHBs are not).
> > > 
> > > I'll admit I'm pretty confused myself about the exact distinction
> > > between root complex, root port and upstream and downstream ports.
> > 
> > I think we both need to get our terminology straight :)
> > I'm sure Marcel will be happy to point us in the right
> > direction.
> > 
> > My understanding is that the PCIe Root Complex is the piece
> > of hardware that exposes a PCIe Root Bus (pcie.0 in QEMU);
> 
> right

Oh.. I wasn't as clear as I'd like to be on what the root complex is.
But I thought the root complex did have some guest visible presence in
the PCI tree.  What you're describing here seems equivalent to what
I'd call the PCI Host Bridge (== PHB).

> > PXBs can be connected to slots in pcie.0 to create more buses
> > that behave, for the most part, like pcie.0 and are mostly
> > useful to bind devices to specific NUMA nodes.
> 
> right
> 
>  Same applies
> > to legacy PCI with the pxb (instead of pxb-pcie) device.
> > 
> 
> pxb should not be used for PCIe machines, only for legacy PCI ones.

Noted.  And not for pseries at all.  Note that because we have a
para-virtualized platform (all PCI config access goes via hypercalls)
the distinction between PCI and PCI-E is much blurrier than in the x86
case.

> > In a similar fashion, PHBs are the hardware thingies that
> > expose a PCI Root Bus (pci.0 and so on), the main difference
> > being that they are truly independent: so a q35 guest will
> > always have a "primary" PCIe Root Bus and (optionally) a
> > bunch of "secondary" ones, but the same will not be the case
> > for pseries guests.
> 
> OK
> 
> > I don't think the difference is that important though, at
> > least from libvirt's point of view: whether you're creating
> > a pseries guest with two PHBs, or a q35 guest with its
> > built-in PCIe Root Complex and an extra PCIe Expander Bus,
> > you will end up with two "top level" buses that you can plug
> > more devices into.
> 
> I agree
> 
>  If we had spapr-pcie-host-bridge, we
> > could treat them mostly the same - with caveats such as the
> > one described above, of course.
> > 
> > > > > Whereas on pseries they'll
> > > > > appear as siblings on a virtual bus (which makes no physical sense for
> > > > > point-to-point PCI-E).
> > > > 
> > > > What is the virtual bus in question? Why would it matter
> > > > that they're siblings?
> > > 
> > > On pseries it won't.  But my understanding is that libvirt won't
> > > create them that way on q35 - instead it will insert the RCs / P2P
> > > bridges to allow them to be hotplugged.  Inserting that bridge may
> > > confuse pseries guests which aren't expecting it.
> > 
> > libvirt will automatically add PCIe Root Ports to make the
> > devices hotpluggable on q35 guests, yes. But, as mentioned
> > above, we can teach it not to.
> > 
> > > > I'm possibly missing the point entirely, but so far it
> > > > looks to me like there are different configurations you
> > > > might want to use depending on your goal, and both q35
> > > > and pseries give you comparable tools to achieve such
> > > > configurations.
> > > > > 
> > > > > I suppose we could try treating all devices on pseries as though they
> > > > > were chipset builtin devices on q35, which will appear on the root
> > > > > PCI-E bus without root complex.
> 
> Actually the root PCIe bus is part of a root complex.

So I think what I meant above was "root port".  The point is that
there won't be the (pseudo) PCI to PCI bridge appearing above the
device that there typically would be on q35.

>  But I suspect that's likely to cause
> > > > > trouble with hotplug, and it will certainly need different address
> > > > > allocation from libvirt.
> > > > 
> > > > PCIe Integrated Endpoint Devices are not hotpluggable on
> > > > q35, that's why libvirt will follow QEMU's PCIe topology
> > > > recommendations and place a PCIe Root Port between them;
> > > > I assume the same could be done for pseries guests as
> > > > soon as QEMU grows support for generic PCIe Root Ports,
> > > > something Marcel has already posted patches for.
> > > 
> > > Here you've hit on it.  No, we should not do that for pseries,
> > > AFAICT.  PAPR doesn't really have the concept of integrated endpoint
> > > devices, and all devices can be hotplugged via the PAPR mechanisms
> > > (and none can via the PCI-E standard hotplug mechanism).
> > 
> 
> This seems to be interfering with the PCIe spec:
>   1. No PCIe root ports ? those are part of the spec.

Yes, I dare say it does interfere with the spec.  Nonetheless, there
it is.

>   2. Only integrated devices ? hotplug is not PCIe native?

That's correct.  PAPR supplies its own hotplug mechanism, which works
for both PCI and PCI-E devices, which is different from the standard
PCI-E hotplug mechanism.

> 
> > Cool, I get it now.
> > 
> > > > Again, sorry for clearly misunderstanding your explanation,
> > > > but I'm still not seeing the issue here. I'm sure it's very
> > > > clear in your mind, but I'm afraid you're going to have to
> > > > walk me through it :(
> > > 
> > > I wish it were entirely clear in my mind.  Like I say I'm still pretty
> > > confused by exactly the root complex entails.
> > 
> > Same here, but this back-and-forth is helping! :)
> > 
> > [...]
> > > > What about virtio devices, which present themselves either
> > > > as legacy PCI or PCIe depending on the kind of slot they
> > > > are plugged into? Would they show up as PCIe or legacy PCI
> > > > on a PCIe-enabled pseries guest?
> > > 
> > > That we'd have to address on the qemu side with some
> > 
> > Unfinished sentence?
> > 
> > [...]
> > > > Is the Root Complex not currently exposed? The Root Bus
> > > > certainly is,
> > > 
> > > Like I say, I'm fairly confused myself, but I'm pretty sure that Root
> > > Complex != Root Bus.  The RC sits under the root bus IIRC.. or
> > > possibly it consists of the root bus plus something under it as well.
> > > 
> 
> The Root complex includes the PCI bus, some configuration registers if
> needed, provides access to the configuration space, translates relevant CPU
> reads/writes to PCI(e) transactions...

Do those configuration registers appear within PCI space, or outside
it (e.g. raw MMIO or PIO registers)?

> > > Now... from what Laine was saying it sounds like more of the
> > > differences between PCI-E placement and PCI placement may be
> > > implemented by libvirt than qemu than I realized.  So possibly we do
> > > want to make the bus be PCI-E on the qemu side, but have libvirt use
> > > the vanilla-PCI placement guidelines rather than PCI-E for pseries.
> > 
> > Basically the special casing I was mentioning earlier.
> 
> That looks complicated.. I wish I would no more about the pseries
> PCIe stuff, does any one know where I can get the info ? (besides 'google it'...)

Andrea gave a pointer to the PAPR document.  Unfortunately how much it
covers here I'm not sure about.  In particular I'm not sure how much
of this is actually PAPR mandated, and how much is just copying
PowerVM as the pre-existing PAPR implementation.

> 
> > 
> > [...]
> > > > Maybe I just don't quite get the relationship between Root
> > > > Complexes and Root Buses, but I guess my question is: what
> > > > is preventing us from simply doing whatever a
> > > > spapr-pci-host-bridge is doing in order to expose a legacy
> > > > PCI Root Bus (pci.*) to the guest, and create a new
> > > > spapr-pcie-host-bridge that exposes a PCIe Root Bus (pcie.*)
> > > > instead?
> > > 
> > > Hrm, the suggestion of providing both a vanilla-PCI and PCI-E host
> > > bridge came up before.  I think one of us spotted a problem with that,
> > > but I don't recall what it was now.  I guess one is how libvirt would
> > > map it's stupid-fake-domain-numbers to which root bus to use.
> 
> This would be a weird configuration, I never heard of something like that
> on a bare metal machine, but I never worked on pseries, who knows...

Which aspect?  Having multiple independent host bridges is perfectly
reasonable - x86 just doesn't do it well for rather stupid historical
reasons.

PAPR is quite explicitly a paravirtual platform, you cannot have a
bare-metal PAPR machine.

> > That issue is relevant whether or nor we have different PHB
> > flavors, isn't it? As soon as multiple PHBs are present in
> > a pseries guest, multiple PCI domains will be there as well,
> > and we need to handle that somehow.
> > 
> > On q35, on the other hand, I haven't been able to find a way
> > to create extra PCI domains: adding a pxb-pcie certainly
> > didn't work the same as adding an extra spapr-pci-host-bridge
> > in that regard.
> > 
> 
> Indeed, all the pxb-pcie devices "share" the same domain.
> 
> > [...]
> > > > Maybe we should have a different model, specific to
> > > > pseries guests, instead, so that all PHBs would look the
> > > > same in the guest XML, something like
> > > > 
> > > >    <controller type='pci' model='phb-pcie'/>
> > > > 
> > > > It would require shuffling libvirt's PCI address allocation
> > > > code around quite a bit, but it should be doable. And if it
> > > > makes life easier for our users, then it's worth it.
> > > 
> > > Hrm.  So my first inclination would be to stick with the generic
> > > names, and map those to creating new pseries host bridges on pseries
> > > guests.  I would have thought that would be the easier option for
> > > users.  But I may not have realized all the implications yet.
> > 
> > You're probably right, but I can't immediately see how we
> > would make the user aware of which PHB is which. Maybe we
> > could add some sub-element or extra attribute...
> > 
> > Anyway, we should not focus too much on this specific bit
> > at the moment, deciding on a specific XML is mostly
> > bikeshedding :)
> > 
> > [...]
> > > > * Eduardo's work, which you mentioned, is going to be very
> > > >    beneficial in the long run; in the short run, Marcel's
> > > >    PCIe device placement guidelines, a document that has seen
> > > >    contributions from QEMU, OVMF and libvirt developers, have
> > > >    been invaluable to improve libvirt's PCI address allocation
> > > >    logic. So we're already doing better, and more improvements
> > > >    are on the way :)
> > > 
> > > Right.. so here's the thing, I strongly suspect that Marcel's
> > > guidelines will not be correct for pseries.
> 
> We should make the document stick for all PCIe archs, if we need
> to modify it, let's do it.

Yeah, I'm not sure it's possible to cover both x86 and pseries at
once.  As you noted, it looks rather like PAPR is contradicting the
PCI-E spec.

Again, one possible option here is to continue to treat pseries as
having a vanilla-PCI bus, but with a special flag saying that it's
magically able to connect PCI-E devices.

>   I'm not sure if they'll
> > > be definitively wrong, or just different enough from PowerVM that it
> > > might confuse guests, but either way.
> > 
> 
> I really need to understand how it would confuse the guests,
> it does not deviate from the PCIe spec, it only adds some restrictions.

Because the guests are written to work with PowerVM, which seems to do
something other than the PCIe spec...

> > Those guidelines have been developed with q35/mach-virt in
> > mind[1], so I wouldn't at all be surprised if they didn't
> > apply to pseries guests. And in fact, we just found out
> > that they don't!
> > 
> > My point is that we could easily create a similar document
> > for pseries guests, and then libvirt will be able to pick
> > up whatever recommendations we come up with just like it
> > did for q35/mach-virt.
> > 
> > > Can you send me a link to that
> > > document though, which might help me figure this out.
> > 
> > It's docs/pcie.txt in QEMU's git repository.
> > 
> > 
> > [1] Even though I now realize that this is not immediately
> >     clear by looking at the document itself
> 
> 
> I kind of miss the core issue, what is the main problem?

The core problem is that at this stage it's not possible to attach
PCIe devices (either emulated or passthrough) to a pseries guest.  We
need to be able to do that - specifically allowing the guest to access
PCIe extended config space.

The PAPR virtualized PCI interfaces definitely do allow a guest to
access extended config space, but in most other regards they behave
more like vanilla-PCI than PCIe.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [RFC PATCH qemu] spapr_pci: Create PCI-express root bus by default
  2016-12-13 15:15                             ` Benjamin Herrenschmidt
@ 2016-12-14  2:48                               ` David Gibson
  2016-12-14 12:02                               ` Marcel Apfelbaum
  1 sibling, 0 replies; 38+ messages in thread
From: David Gibson @ 2016-12-14  2:48 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Marcel Apfelbaum, Andrea Bolognani, Alexey Kardashevskiy,
	Greg Kurz, Paolo Bonzini, Alex Williamson, qemu-ppc, qemu-devel,
	libvir-list, Michael Roth, Marcel Apfelbaum

[-- Attachment #1: Type: text/plain, Size: 2630 bytes --]

On Tue, Dec 13, 2016 at 09:15:37AM -0600, Benjamin Herrenschmidt wrote:
> On Tue, 2016-12-13 at 14:25 +0200, Marcel Apfelbaum wrote:
> > > > Hrm, the suggestion of providing both a vanilla-PCI and PCI-E host
> > > > bridge came up before.  I think one of us spotted a problem with that,
> > > > but I don't recall what it was now.  I guess one is how libvirt would
> > > > map it's stupid-fake-domain-numbers to which root bus to use.
> > 
> > This would be a weird configuration, I never heard of something like that
> > on a bare metal machine, but I never worked on pseries, who knows...
> 
> That's the thing though, it's *not* bare metal ;-)
> 
> On bare metal, we use the "powernv" platform for which we haven't yet upstreamed
> the PCIE support, but there we have real PCIe root ports with all what's exepcted
> on them.
> 
> They come on physically different PHB, one root complex per PHB, but they are
> "proper" PCIe. Hotplug is a bit weird still because it has to go through some
> FW interactions as the HW doesn't use the PCIe standard slot control bits (and
> our qemu model doesn't handle it yet) but otherwise it's standard.
> 
> "pseries" on the other hand is a paravirtualized platform. It's the environment
> presented to guests by our propritary hypervisor (PowerVM, aka pHyp) and
> KVM/qemu. I think there's a public spec these days but to cut the story short,
> it doesn't expose PCIe "properly" for bad historical reasons.
> 
> It tends to hide the parent, ie, the root port for slots connected directly to
> a PHB or the entire hierarchy from the root to (and including) the downstream
> switch port for slots under as switch.
> 
> So you end up with PCIe devices devoid of a proper "parent" which is a bit
> weird.
> 
> Hotplug is implemented using specific firmware interfaces that are identical
> with pre-PCIe (ie PCI/PCI-X) systems. The FW has interface for supporting 3
> kinds of hotplug:
> 
>    - Entire PHBs
>    - P2P Bridges
>    - Individual devices on an existing PHB or P2P bridge
> 
> In practice these days mostly the first one is used for everything under pHyp,
> though I think we have a mix of 1 and 3 for KVM/qemu.

Alas, no, we only have 3 on KVM/qemu.  Well, possibly you could do (2)
by plugging a bridge device using (3) then a device onto the bridge.
Whole PHB hotplug has never been implemented for qemu/KVM.. but we
probably want to.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [RFC PATCH qemu] spapr_pci: Create PCI-express root bus by default
  2016-12-13 15:15                             ` Benjamin Herrenschmidt
  2016-12-14  2:48                               ` David Gibson
@ 2016-12-14 12:02                               ` Marcel Apfelbaum
  1 sibling, 0 replies; 38+ messages in thread
From: Marcel Apfelbaum @ 2016-12-14 12:02 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Andrea Bolognani, David Gibson
  Cc: Alexey Kardashevskiy, Greg Kurz, Paolo Bonzini, Alex Williamson,
	qemu-ppc, qemu-devel, libvir-list, Michael Roth,
	Marcel Apfelbaum

On 12/13/2016 05:15 PM, Benjamin Herrenschmidt wrote:
> On Tue, 2016-12-13 at 14:25 +0200, Marcel Apfelbaum wrote:
>>>> Hrm, the suggestion of providing both a vanilla-PCI and PCI-E host
>>>> bridge came up before.  I think one of us spotted a problem with that,
>>>> but I don't recall what it was now.  I guess one is how libvirt would
>>>> map it's stupid-fake-domain-numbers to which root bus to use.
>>
>> This would be a weird configuration, I never heard of something like that
>> on a bare metal machine, but I never worked on pseries, who knows...
>
> That's the thing though, it's *not* bare metal ;-)
>
> On bare metal, we use the "powernv" platform for which we haven't yet upstreamed
> the PCIE support, but there we have real PCIe root ports with all what's exepcted
> on them.
>
> They come on physically different PHB, one root complex per PHB, but they are
> "proper" PCIe. Hotplug is a bit weird still because it has to go through some
> FW interactions as the HW doesn't use the PCIe standard slot control bits (and
> our qemu model doesn't handle it yet) but otherwise it's standard.
>

Thanks for the explanations, so bare metal is standard minus hotplug.

> "pseries" on the other hand is a paravirtualized platform. It's the environment
> presented to guests by our propritary hypervisor (PowerVM, aka pHyp) and
> KVM/qemu. I think there's a public spec these days but to cut the story short,
> it doesn't expose PCIe "properly" for bad historical reasons.
>
> It tends to hide the parent, ie, the root port for slots connected directly to
> a PHB or the entire hierarchy from the root to (and including) the downstream
> switch port for slots under as switch.
>
> So you end up with PCIe devices devoid of a proper "parent" which is a bit
> weird.
>

OK, now is more "clear". I can get back to read the thread from the start :)

Thanks,
Marcel

> Hotplug is implemented using specific firmware interfaces that are identical
> with pre-PCIe (ie PCI/PCI-X) systems. The FW has interface for supporting 3
> kinds of hotplug:
>
>    - Entire PHBs
>    - P2P Bridges
>    - Individual devices on an existing PHB or P2P bridge
>
> In practice these days mostly the first one is used for everything under pHyp,
> though I think we have a mix of 1 and 3 for KVM/qemu.
>
> Cheers,
> Ben.
>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [RFC PATCH qemu] spapr_pci: Create PCI-express root bus by default
  2016-12-14  2:46                             ` David Gibson
@ 2016-12-14 18:26                               ` Marcel Apfelbaum
  2016-12-15 21:59                                 ` Benjamin Herrenschmidt
  2016-12-19 17:39                                 ` Andrea Bolognani
  0 siblings, 2 replies; 38+ messages in thread
From: Marcel Apfelbaum @ 2016-12-14 18:26 UTC (permalink / raw)
  To: David Gibson
  Cc: Andrea Bolognani, Alexey Kardashevskiy, Greg Kurz, Paolo Bonzini,
	Alex Williamson, qemu-ppc, qemu-devel, libvir-list, Michael Roth,
	benh, Marcel Apfelbaum, Eduardo Habkost

On 12/14/2016 04:46 AM, David Gibson wrote:
> On Tue, Dec 13, 2016 at 02:25:44PM +0200, Marcel Apfelbaum wrote:
>> On 12/07/2016 06:42 PM, Andrea Bolognani wrote:
>>> [Added Marcel to CC]
>>>
>>
>>
>> Hi,
>>
>> Sorry for the late reply.
>>
>>> On Wed, 2016-12-07 at 15:11 +1100, David Gibson wrote:
>>>>> Is the difference between q35 and pseries guests with
>>>>> respect to PCIe only relevant when it comes to assigned
>>>>> devices, or in general? I'm asking this because you seem to
>>>>> focus entirely on assigned devices.
>>>>
>>>> Well, in a sense that's up to us.  The only existing model we have is
>>>> PowerVM, and PowerVM only does device passthrough, no emulated
>>>> devices.  PAPR doesn't really distinguish one way or the other, but
>>>> it's written from the perspective of assuming that all PCI devices
>>>> correspond to physical devices on the host
>>>
>>> Okay, that makes sense.
>>>
>>>>>> On q35,
>>>>>> you'd generally expect physically separate (different slot) devices to
>>>>>> appear under separate root complexes.
>>>>>
>>>>> This part I don't get at all, so please bear with me.
>>>>>
>>>>> The way I read it you're claiming that eg. a SCSI controller
>>>>> and a network adapter, being physically separate and assigned
>>>>> to separate PCI slots, should have a dedicated PCIe Root
>>>>> Complex each on a q35 guest.
>>>>
>>
>> Not a PCIe Root Complex, but a PCIe Root port.
>
> Ah, sorry.  As I said, I've been pretty confused by all the terminology.
>
>>>> Right, my understanding was that if the devices were slotted, rather
>>>> than integrated, each one would sit under a separate root complex, the
>>>> root complex being a pseudo PCI to PCI bridge.
>>>
>>> I assume "slotted" means "plugged into a slot that's not one
>>> of those provided by pcie.0" or something along those lines.
>>>
>>> More on the root complex bit later.
>>>
>>>>> That doesn't match with my experience, where you would simply
>>>>> assign them to separate slots of the default PCIe Root Bus
>>>>> (pcie.0), eg. 00:01.0 and 00:02.0.
>>>>
>>>> The qemu default, or the libvirt default?
>>>
>>> I'm talking about the libvirt default, which is supposed to
>>> follows Marcel's PCIe Guidelines.
>>>
>>>> I think this represents
>>>> treating the devices as though they were integrated devices in the
>>>> host bridge.  I believe on q35 they would not be hotpluggable
>>>
>>
>> Correct. Please have a look to the new document
>> regarding pcie: docs/pcie.txt and the corresponding presentations.
>>
>>
>>> Yeah, that's indeed not quite what libvirt would do by
>>> default: in reality, there would be a ioh3420 between the
>>> pcie.0 slots and each device exactly to enable hotplug.
>>>
>>>> but on
>>>> pseries they would be (because we don't use the standard hot plug
>>>> controller).
>>>
>>> We can account for that in libvirt and avoid adding the
>>> extra ioh3420s (or rather the upcoming generic PCIe Root
>>> Ports) for pseries guests.
>>>
>>>>> Maybe you're referring to the fact that you might want to
>>>>> create multiple PCIe Root Complexes in order to assign the
>>>>> host devices to separate guest NUMA nodes? How is creating
>>>>> multiple PCIe Root Complexes on q35 using pxb-pcie different
>>>>> than creating multiple PHBs using spapr-pci-host-bridge on
>>>>> pseries?
>>>>
>>>> Uh.. AIUI the root complex is the PCI to PCI bridge under which PCI-E
>>>> slots appear.  PXB is something different - essentially different host
>>>> bridges as you say (though with some weird hacks to access config
>>>> space, which make it dependent on the primary bus in a way which spapr
>>>> PHBs are not).
>>>>
>>>> I'll admit I'm pretty confused myself about the exact distinction
>>>> between root complex, root port and upstream and downstream ports.
>>>
>>> I think we both need to get our terminology straight :)
>>> I'm sure Marcel will be happy to point us in the right
>>> direction.
>>>
>>> My understanding is that the PCIe Root Complex is the piece
>>> of hardware that exposes a PCIe Root Bus (pcie.0 in QEMU);
>>
>> right
>
> Oh.. I wasn't as clear as I'd like to be on what the root complex is.
> But I thought the root complex did have some guest visible presence in
> the PCI tree.  What you're describing here seems equivalent to what
> I'd call the PCI Host Bridge (== PHB).
>

Yes, a Root Complex is a type of Host Bridge in the sense it bridges
between CPU/Memory Controller ant the PCI subsystem.

>>> PXBs can be connected to slots in pcie.0 to create more buses
>>> that behave, for the most part, like pcie.0 and are mostly
>>> useful to bind devices to specific NUMA nodes.
>>
>> right
>>
>>  Same applies
>>> to legacy PCI with the pxb (instead of pxb-pcie) device.
>>>
>>
>> pxb should not be used for PCIe machines, only for legacy PCI ones.
>
> Noted.  And not for pseries at all.  Note that because we have a
> para-virtualized platform (all PCI config access goes via hypercalls)
> the distinction between PCI and PCI-E is much blurrier than in the x86
> case.
>

OK

>>> In a similar fashion, PHBs are the hardware thingies that
>>> expose a PCI Root Bus (pci.0 and so on), the main difference
>>> being that they are truly independent: so a q35 guest will
>>> always have a "primary" PCIe Root Bus and (optionally) a
>>> bunch of "secondary" ones, but the same will not be the case
>>> for pseries guests.
>>
>> OK
>>
>>> I don't think the difference is that important though, at
>>> least from libvirt's point of view: whether you're creating
>>> a pseries guest with two PHBs, or a q35 guest with its
>>> built-in PCIe Root Complex and an extra PCIe Expander Bus,
>>> you will end up with two "top level" buses that you can plug
>>> more devices into.
>>
>> I agree
>>
>>  If we had spapr-pcie-host-bridge, we
>>> could treat them mostly the same - with caveats such as the
>>> one described above, of course.
>>>
>>>>>> Whereas on pseries they'll
>>>>>> appear as siblings on a virtual bus (which makes no physical sense for
>>>>>> point-to-point PCI-E).
>>>>>
>>>>> What is the virtual bus in question? Why would it matter
>>>>> that they're siblings?
>>>>
>>>> On pseries it won't.  But my understanding is that libvirt won't
>>>> create them that way on q35 - instead it will insert the RCs / P2P
>>>> bridges to allow them to be hotplugged.  Inserting that bridge may
>>>> confuse pseries guests which aren't expecting it.
>>>
>>> libvirt will automatically add PCIe Root Ports to make the
>>> devices hotpluggable on q35 guests, yes. But, as mentioned
>>> above, we can teach it not to.
>>>
>>>>> I'm possibly missing the point entirely, but so far it
>>>>> looks to me like there are different configurations you
>>>>> might want to use depending on your goal, and both q35
>>>>> and pseries give you comparable tools to achieve such
>>>>> configurations.
>>>>>>
>>>>>> I suppose we could try treating all devices on pseries as though they
>>>>>> were chipset builtin devices on q35, which will appear on the root
>>>>>> PCI-E bus without root complex.
>>
>> Actually the root PCIe bus is part of a root complex.
>
> So I think what I meant above was "root port".

Yes

   The point is that
> there won't be the (pseudo) PCI to PCI bridge appearing above the
> device that there typically would be on q35.
>

Understood, on one hand we have no PCIe Root Ports, on the other
hand the devices are not integrated - they can be hot-plugged
by platform specific means.

>>  But I suspect that's likely to cause
>>>>>> trouble with hotplug, and it will certainly need different address
>>>>>> allocation from libvirt.
>>>>>
>>>>> PCIe Integrated Endpoint Devices are not hotpluggable on
>>>>> q35, that's why libvirt will follow QEMU's PCIe topology
>>>>> recommendations and place a PCIe Root Port between them;
>>>>> I assume the same could be done for pseries guests as
>>>>> soon as QEMU grows support for generic PCIe Root Ports,
>>>>> something Marcel has already posted patches for.
>>>>
>>>> Here you've hit on it.  No, we should not do that for pseries,
>>>> AFAICT.  PAPR doesn't really have the concept of integrated endpoint
>>>> devices, and all devices can be hotplugged via the PAPR mechanisms
>>>> (and none can via the PCI-E standard hotplug mechanism).
>>>
>>
>> This seems to be interfering with the PCIe spec:
>>   1. No PCIe root ports ? those are part of the spec.
>
> Yes, I dare say it does interfere with the spec.  Nonetheless, there
> it is.
>
>>   2. Only integrated devices ? hotplug is not PCIe native?
>
> That's correct.  PAPR supplies its own hotplug mechanism, which works
> for both PCI and PCI-E devices, which is different from the standard
> PCI-E hotplug mechanism.
>

Ok the hw is PCIe, but configuration/hot-plug
is platform specific.

>>
>>> Cool, I get it now.
>>>
>>>>> Again, sorry for clearly misunderstanding your explanation,
>>>>> but I'm still not seeing the issue here. I'm sure it's very
>>>>> clear in your mind, but I'm afraid you're going to have to
>>>>> walk me through it :(
>>>>
>>>> I wish it were entirely clear in my mind.  Like I say I'm still pretty
>>>> confused by exactly the root complex entails.
>>>
>>> Same here, but this back-and-forth is helping! :)
>>>
>>> [...]
>>>>> What about virtio devices, which present themselves either
>>>>> as legacy PCI or PCIe depending on the kind of slot they
>>>>> are plugged into? Would they show up as PCIe or legacy PCI
>>>>> on a PCIe-enabled pseries guest?
>>>>
>>>> That we'd have to address on the qemu side with some
>>>
>>> Unfinished sentence?
>>>
>>> [...]
>>>>> Is the Root Complex not currently exposed? The Root Bus
>>>>> certainly is,
>>>>
>>>> Like I say, I'm fairly confused myself, but I'm pretty sure that Root
>>>> Complex != Root Bus.  The RC sits under the root bus IIRC.. or
>>>> possibly it consists of the root bus plus something under it as well.
>>>>
>>
>> The Root complex includes the PCI bus, some configuration registers if
>> needed, provides access to the configuration space, translates relevant CPU
>> reads/writes to PCI(e) transactions...
>
> Do those configuration registers appear within PCI space, or outside
> it (e.g. raw MMIO or PIO registers)?
>

Root Complexes use MMIO to expose the PCI configuration space,
they call it ECAM (enhanced configuration access mechanism) or MMConfig.

>>>> Now... from what Laine was saying it sounds like more of the
>>>> differences between PCI-E placement and PCI placement may be
>>>> implemented by libvirt than qemu than I realized.  So possibly we do
>>>> want to make the bus be PCI-E on the qemu side, but have libvirt use
>>>> the vanilla-PCI placement guidelines rather than PCI-E for pseries.
>>>
>>> Basically the special casing I was mentioning earlier.
>>
>> That looks complicated.. I wish I would no more about the pseries
>> PCIe stuff, does any one know where I can get the info ? (besides 'google it'...)
>
> Andrea gave a pointer to the PAPR document.  Unfortunately how much it
> covers here I'm not sure about.  In particular I'm not sure how much
> of this is actually PAPR mandated, and how much is just copying
> PowerVM as the pre-existing PAPR implementation.
>

Understood, I'll have a look, but with low expectations :).
Anyway, by now I do have some basic notions of spapr, thanks!

>>
>>>
>>> [...]
>>>>> Maybe I just don't quite get the relationship between Root
>>>>> Complexes and Root Buses, but I guess my question is: what
>>>>> is preventing us from simply doing whatever a
>>>>> spapr-pci-host-bridge is doing in order to expose a legacy
>>>>> PCI Root Bus (pci.*) to the guest, and create a new
>>>>> spapr-pcie-host-bridge that exposes a PCIe Root Bus (pcie.*)
>>>>> instead?
>>>>
>>>> Hrm, the suggestion of providing both a vanilla-PCI and PCI-E host
>>>> bridge came up before.  I think one of us spotted a problem with that,
>>>> but I don't recall what it was now.  I guess one is how libvirt would
>>>> map it's stupid-fake-domain-numbers to which root bus to use.
>>
>> This would be a weird configuration, I never heard of something like that
>> on a bare metal machine, but I never worked on pseries, who knows...
>
> Which aspect?  Having multiple independent host bridges is perfectly
> reasonable - x86 just doesn't do it well for rather stupid historical
> reasons.
>

I agree about the multiple host-bridges, is actually what pxb/pxb-pcie
devices (kind of) do.

I was talking about having one PCI PHB and another PHB which is PCI Express.
I was referring to one system having both PCI and PCIe PHBs.

> PAPR is quite explicitly a paravirtual platform, you cannot have a
> bare-metal PAPR machine.
>

Understood.

>>> That issue is relevant whether or nor we have different PHB
>>> flavors, isn't it? As soon as multiple PHBs are present in
>>> a pseries guest, multiple PCI domains will be there as well,
>>> and we need to handle that somehow.
>>>
>>> On q35, on the other hand, I haven't been able to find a way
>>> to create extra PCI domains: adding a pxb-pcie certainly
>>> didn't work the same as adding an extra spapr-pci-host-bridge
>>> in that regard.
>>>
>>
>> Indeed, all the pxb-pcie devices "share" the same domain.
>>
>>> [...]
>>>>> Maybe we should have a different model, specific to
>>>>> pseries guests, instead, so that all PHBs would look the
>>>>> same in the guest XML, something like
>>>>>
>>>>>    <controller type='pci' model='phb-pcie'/>
>>>>>
>>>>> It would require shuffling libvirt's PCI address allocation
>>>>> code around quite a bit, but it should be doable. And if it
>>>>> makes life easier for our users, then it's worth it.
>>>>
>>>> Hrm.  So my first inclination would be to stick with the generic
>>>> names, and map those to creating new pseries host bridges on pseries
>>>> guests.  I would have thought that would be the easier option for
>>>> users.  But I may not have realized all the implications yet.
>>>
>>> You're probably right, but I can't immediately see how we
>>> would make the user aware of which PHB is which. Maybe we
>>> could add some sub-element or extra attribute...
>>>
>>> Anyway, we should not focus too much on this specific bit
>>> at the moment, deciding on a specific XML is mostly
>>> bikeshedding :)
>>>
>>> [...]
>>>>> * Eduardo's work, which you mentioned, is going to be very
>>>>>    beneficial in the long run; in the short run, Marcel's
>>>>>    PCIe device placement guidelines, a document that has seen
>>>>>    contributions from QEMU, OVMF and libvirt developers, have
>>>>>    been invaluable to improve libvirt's PCI address allocation
>>>>>    logic. So we're already doing better, and more improvements
>>>>>    are on the way :)
>>>>
>>>> Right.. so here's the thing, I strongly suspect that Marcel's
>>>> guidelines will not be correct for pseries.
>>
>> We should make the document stick for all PCIe archs, if we need
>> to modify it, let's do it.
>
> Yeah, I'm not sure it's possible to cover both x86 and pseries at
> once.  As you noted, it looks rather like PAPR is contradicting the
> PCI-E spec.
>

At this point I agree that PAPR PCI-E is not really "by the book",
so the PCIe guidelines will not work for PAPR.

> Again, one possible option here is to continue to treat pseries as
> having a vanilla-PCI bus, but with a special flag saying that it's
> magically able to connect PCI-E devices.
>

A PCIe bus supporting PCI devices is strange (QEMU allows it ...),
but a PCI bus supporting PCIe devices is hard to "swallow".

I would say maybe make it a special case of a PCIe bus with different rules.
It can derive from the PCIe bus class and override the usual behavior
with PAPR specific rule which happen to be similar with the PCI bus rules.

Adding Eduardo, he is currently working on a way to properly expose
the information on what devices can be plugged on what bus/slot.

>>   I'm not sure if they'll
>>>> be definitively wrong, or just different enough from PowerVM that it
>>>> might confuse guests, but either way.
>>>
>>
>> I really need to understand how it would confuse the guests,
>> it does not deviate from the PCIe spec, it only adds some restrictions.
>
> Because the guests are written to work with PowerVM, which seems to do
> something other than the PCIe spec...
>
>>> Those guidelines have been developed with q35/mach-virt in
>>> mind[1], so I wouldn't at all be surprised if they didn't
>>> apply to pseries guests. And in fact, we just found out
>>> that they don't!
>>>
>>> My point is that we could easily create a similar document
>>> for pseries guests, and then libvirt will be able to pick
>>> up whatever recommendations we come up with just like it
>>> did for q35/mach-virt.
>>>
>>>> Can you send me a link to that
>>>> document though, which might help me figure this out.
>>>
>>> It's docs/pcie.txt in QEMU's git repository.
>>>
>>>
>>> [1] Even though I now realize that this is not immediately
>>>     clear by looking at the document itself
>>
>>
>> I kind of miss the core issue, what is the main problem?
>
> The core problem is that at this stage it's not possible to attach
> PCIe devices (either emulated or passthrough) to a pseries guest.  We
> need to be able to do that - specifically allowing the guest to access
> PCIe extended config space.
>

Do we have in QEMU the code to expose the Extended Config Space
by other means instead of MMIO (used by x86)?

> The PAPR virtualized PCI interfaces definitely do allow a guest to
> access extended config space, but in most other regards they behave
> more like vanilla-PCI than PCIe.
>

So the problem is related to how to expose the information to libvirt?
If yes, maybe Eduardo can help.


Thanks,
Marcel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [RFC PATCH qemu] spapr_pci: Create PCI-express root bus by default
  2016-12-14 18:26                               ` Marcel Apfelbaum
@ 2016-12-15 21:59                                 ` Benjamin Herrenschmidt
  2016-12-19 17:39                                 ` Andrea Bolognani
  1 sibling, 0 replies; 38+ messages in thread
From: Benjamin Herrenschmidt @ 2016-12-15 21:59 UTC (permalink / raw)
  To: Marcel Apfelbaum, David Gibson
  Cc: Andrea Bolognani, Alexey Kardashevskiy, Greg Kurz, Paolo Bonzini,
	Alex Williamson, qemu-ppc, qemu-devel, libvir-list, Michael Roth,
	Marcel Apfelbaum, Eduardo Habkost

On Wed, 2016-12-14 at 20:26 +0200, Marcel Apfelbaum wrote:
> > > The Root complex includes the PCI bus, some configuration
> > > registers if
> > > needed, provides access to the configuration space, translates
> > > relevant CPU
> > > reads/writes to PCI(e) transactions...
> > 
> > Do those configuration registers appear within PCI space, or
> > outside
> > it (e.g. raw MMIO or PIO registers)?
> > 
> 
> Root Complexes use MMIO to expose the PCI configuration space,
> they call it ECAM (enhanced configuration access mechanism) or
> MMConfig.

Not all of them do. There are plenty of PCIe RCs out there that
do config space using different mechanisms such as the good old
indirect address/data method, such as ours. IE. Even in real "bare
metal" powernv, where the root port is visible to Linux, we still
go through firmware for config space to get through to those
registers (among other things).

My PHB3 implementation (not upstream yet and a bit bitrotted by now)
exposed PCIe that way including extended config space and that was
working fine.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [RFC PATCH qemu] spapr_pci: Create PCI-express root bus by default
  2016-12-14 18:26                               ` Marcel Apfelbaum
  2016-12-15 21:59                                 ` Benjamin Herrenschmidt
@ 2016-12-19 17:39                                 ` Andrea Bolognani
  1 sibling, 0 replies; 38+ messages in thread
From: Andrea Bolognani @ 2016-12-19 17:39 UTC (permalink / raw)
  To: Marcel Apfelbaum, David Gibson
  Cc: Alexey Kardashevskiy, Greg Kurz, Paolo Bonzini, Alex Williamson,
	qemu-ppc, qemu-devel, libvir-list, Michael Roth, benh,
	Marcel Apfelbaum, Eduardo Habkost

On Wed, 2016-12-14 at 20:26 +0200, Marcel Apfelbaum wrote:
> > > > > > Maybe I just don't quite get the relationship between Root
> > > > > > Complexes and Root Buses, but I guess my question is: what
> > > > > > is preventing us from simply doing whatever a
> > > > > > spapr-pci-host-bridge is doing in order to expose a legacy
> > > > > > PCI Root Bus (pci.*) to the guest, and create a new
> > > > > > spapr-pcie-host-bridge that exposes a PCIe Root Bus (pcie.*)
> > > > > > instead?
> > > > > 
> > > > > Hrm, the suggestion of providing both a vanilla-PCI and PCI-E host
> > > > > bridge came up before.  I think one of us spotted a problem with that,
> > > > > but I don't recall what it was now.  I guess one is how libvirt would
> > > > > map it's stupid-fake-domain-numbers to which root bus to use.
> > > 
> > > This would be a weird configuration, I never heard of something like that
> > > on a bare metal machine, but I never worked on pseries, who knows...
> > 
> > Which aspect?  Having multiple independent host bridges is perfectly
> > reasonable - x86 just doesn't do it well for rather stupid historical
> > reasons.
> 
> I agree about the multiple host-bridges, is actually what pxb/pxb-pcie
> devices (kind of) do.
> 
> I was talking about having one PCI PHB and another PHB which is PCI Express.
> I was referring to one system having both PCI and PCIe PHBs.

Sorry this confused you: what I was talking about was just
having both a legacy PCI PHB and a PCI Express PHB available
in QEMU, not using both for the same guest. The user would
pick one or the other and stick with it.

> > Again, one possible option here is to continue to treat pseries as
> > having a vanilla-PCI bus, but with a special flag saying that it's
> > magically able to connect PCI-E devices.
> 
> A PCIe bus supporting PCI devices is strange (QEMU allows it ...),
> but a PCI bus supporting PCIe devices is hard to "swallow".

I agree.

> I would say maybe make it a special case of a PCIe bus with different rules.
> It can derive from the PCIe bus class and override the usual behavior
> with PAPR specific rule which happen to be similar with the PCI bus rules.

It's pretty clear by now that libvirt will need to have some
special handling of pseries guests. Having to choose between
"weird legacy PCI" and "weird PCI Express", I think I'd rather
go with the latter :)

-- 
Andrea Bolognani / Red Hat / Virtualization

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2016-12-19 17:40 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-10-28  7:56 [Qemu-devel] [RFC PATCH qemu] spapr_pci: Create PCI-express root bus by default Alexey Kardashevskiy
2016-10-28 10:07 ` [Qemu-devel] [Qemu-ppc] " Greg Kurz
2016-10-31  2:53   ` David Gibson
2016-10-31  4:10     ` Alexey Kardashevskiy
2016-11-01  2:46       ` David Gibson
2016-11-15 14:02         ` Andrea Bolognani
2016-11-17  2:02           ` Alexey Kardashevskiy
2016-11-18  6:11             ` David Gibson
2016-11-18  8:17             ` Andrea Bolognani
2016-11-21  2:12               ` Alexey Kardashevskiy
2016-11-21 13:08                 ` Andrea Bolognani
2016-11-22  2:26                   ` Alexey Kardashevskiy
2016-11-23  5:02                     ` David Gibson
2016-11-25 14:36                       ` Andrea Bolognani
2016-12-02  3:37                         ` David Gibson
2016-11-22 14:07                   ` [Qemu-devel] [libvirt] " Eric Blake
2016-11-23  5:00               ` [Qemu-devel] " David Gibson
2016-11-25 13:46                 ` Andrea Bolognani
2016-12-02  4:18                   ` David Gibson
2016-12-02  5:17                     ` Benjamin Herrenschmidt
2016-12-02  5:50                       ` David Gibson
2016-12-02 21:41                         ` Benjamin Herrenschmidt
2016-12-03  1:02                           ` Alexey Kardashevskiy
2016-12-05 19:06                     ` [Qemu-devel] [libvirt] " Laine Stump
2016-12-05 20:54                     ` Laine Stump
2016-12-07  3:34                       ` David Gibson
2016-12-06 17:30                     ` [Qemu-devel] " Andrea Bolognani
2016-12-07  4:11                       ` David Gibson
2016-12-07 16:42                         ` Andrea Bolognani
2016-12-13 12:25                           ` Marcel Apfelbaum
2016-12-13 13:15                             ` Greg Kurz
2016-12-13 15:15                             ` Benjamin Herrenschmidt
2016-12-14  2:48                               ` David Gibson
2016-12-14 12:02                               ` Marcel Apfelbaum
2016-12-14  2:46                             ` David Gibson
2016-12-14 18:26                               ` Marcel Apfelbaum
2016-12-15 21:59                                 ` Benjamin Herrenschmidt
2016-12-19 17:39                                 ` Andrea Bolognani

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.