All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines
@ 2016-09-01 13:22 Marcel Apfelbaum
  2016-09-01 13:27 ` Peter Maydell
                   ` (2 more replies)
  0 siblings, 3 replies; 52+ messages in thread
From: Marcel Apfelbaum @ 2016-09-01 13:22 UTC (permalink / raw)
  To: qemu-devel; +Cc: mst, lersek

Proposes best practices on how to use PCIe/PCI device
in PCIe based machines and explain the reasoning behind them.

Signed-off-by: Marcel Apfelbaum <marcel@redhat.com>
---

Hi,

Please add your comments on what to add/remove/edit to make this doc usable.

Thanks,
Marcel

 docs/pcie.txt | 145 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 145 insertions(+)
 create mode 100644 docs/pcie.txt

diff --git a/docs/pcie.txt b/docs/pcie.txt
new file mode 100644
index 0000000..52a8830
--- /dev/null
+++ b/docs/pcie.txt
@@ -0,0 +1,145 @@
+PCI EXPRESS GUIDELINES
+======================
+
+1. Introduction
+================
+The doc proposes best practices on how to use PCIe/PCI device
+in PCIe based machines and explains the reasoning behind them.
+
+
+2. Device placement strategy
+============================
+QEMU does not have a clear socket-device matching mechanism
+and allows any PCI/PCIe device to be plugged into any PCI/PCIe slot.
+Plugging a PCI device into a PCIe device might not always work and
+is weird anyway since it cannot be done for "bare metal".
+Plugging a PCIe device into a PCI slot will hide the Extended
+Configuration Space thus is also not recommended.
+
+The recommendation is to separate the PCIe and PCI hierarchies.
+PCIe devices should be plugged only into PCIe Root Ports and
+PCIe Downstream ports (let's call them PCIe ports).
+
+2.1 Root Bus (pcie.0)
+=====================
+Plug only legacy PCI devices as Root Complex Integrated Devices
+even if the PCIe spec does not forbid PCIe devices. The existing
+hardware uses mostly PCI devices as Integrated Endpoints. In this
+way we may avoid some strange Guest OS-es behaviour.
+Other than that plug only PCIe Root Ports, PCIe Switches (upstream ports)
+or DMI-PCI bridges to start legacy PCI hierarchies.
+
+
+   pcie.0 bus
+   --------------------------------------------------------------------------
+        |                |                    |                   |
+   -----------   ------------------   ------------------  ------------------
+   | PCI Dev |   | PCIe Root Port |   |  Upstream Port |  | DMI-PCI bridge |
+   -----------   ------------------   ------------------  ------------------
+
+2.2 PCIe only hierarchy
+=======================
+Always use PCIe Root ports to start a PCIe hierarchy. Use PCIe switches (Upstream
+Ports + several Downstream Ports) if out of PCIe Root Ports slots. PCIe switches
+can be nested until a depth of 6-7. Plug only PCIe devices into PCIe Ports.
+
+
+   pcie.0 bus
+   ----------------------------------------------------
+        |                |               |
+   -------------   -------------   -------------
+   | Root Port |   | Root Port |   | Root Port |
+   ------------   --------------   -------------
+         |                               |
+    ------------                 -----------------
+    | PCIe Dev |                 | Upstream Port |
+    ------------                 -----------------
+                                  |            |
+                     -------------------    -------------------
+                     | Downstream Port |    | Downstream Port |
+                     -------------------    -------------------
+                             |
+                         ------------
+                         | PCIe Dev |
+                         ------------
+
+2.3 PCI only hierarchy
+======================
+Legacy PCI devices can be plugged into pcie.0 as Integrated Devices or
+into DMI-PCI bridge. PCI-PCI bridges can be plugged into DMI-PCI bridges
+and can be nested until a depth of 6-7. DMI-BRIDGES should be plugged
+only into pcie.0 bus.
+
+   pcie.0 bus
+   ----------------------------------------------
+        |                            |
+   -----------               ------------------
+   | PCI Dev |               | DMI-PCI BRIDGE |
+   ----------                ------------------
+                               |            |
+                        -----------    ------------------
+                        | PCI Dev |    | PCI-PCI Bridge |
+                        -----------    ------------------
+                                         |           |
+                                  -----------     -----------
+                                  | PCI Dev |     | PCI Dev |
+                                  -----------     -----------
+
+
+
+3. IO space issues
+===================
+PCIe Ports are seen by Firmware/Guest OS as PCI bridges and
+as required by PCI spec will reserve a 4K IO range for each.
+The firmware used by QEMU (SeaBIOS/OVMF) will further optimize
+it by allocation the IO space only if there is at least a device
+with IO BARs plugged into the bridge.
+Behind a PCIe PORT only one device may be plugged, resulting in
+the allocation of a whole 4K range for each device.
+The IO space is limited resulting in ~10 PCIe ports per system
+if devices with IO BARs are plugged into IO ports.
+
+Using the proposed device placing strategy solves this issue
+by using only PCIe devices with PCIe PORTS. The PCIe spec requires
+PCIe devices to work without IO BARs.
+The PCI hierarchy has no such limitations.
+
+
+4. Hot Plug
+============
+The root bus pcie.0 does not support hot-plug, so Integrated Devices,
+DMI-PCI bridges and Root Ports can't be hot-plugged/hot-unplugged.
+
+PCI devices can be hot-plugged into PCI-PCI bridges. (There is a bug
+in QEMU preventing it to work, but it would be solved soon).
+The PCI hotplug is ACPI based and can work side by side with the PCIe
+native hotplug.
+
+PCIe devices can be natively hot-plugged/hot-unplugged into/from
+PCIe Ports (Root Ports/Downstream Ports). Switches are hot-pluggable.
+Keep in mind you always need to have at least one PCIe Port available
+for hotplug, the PCIe Ports themselves are not hot-pluggable.
+
+
+5. Device assignment
+====================
+Host devices are mostly PCIe and should be plugged only into PCIe ports.
+PCI-PCI bridge slots can be used for legacy PCI host devices.
+
+
+6. Virtio devices
+=================
+Virtio devices plugged into the PCI hierarchy or as an Integrated Devices
+will remain PCI and have transitional behaviour as default.
+Virtio devices plugged into PCIe ports are Express devices and have
+"1.0" behavior by default without IO support.
+In both case disable-* properties can be used to override the behaviour.
+
+
+7. Conclusion
+==============
+The proposal offers a usage model that is easy to understand and follow
+and in the same time overcomes some PCIe limitations.
+
+
+
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines
  2016-09-01 13:22 [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines Marcel Apfelbaum
@ 2016-09-01 13:27 ` Peter Maydell
  2016-09-01 13:51   ` Marcel Apfelbaum
  2016-09-05 16:24 ` Laszlo Ersek
  2016-09-06 15:38 ` Alex Williamson
  2 siblings, 1 reply; 52+ messages in thread
From: Peter Maydell @ 2016-09-01 13:27 UTC (permalink / raw)
  To: Marcel Apfelbaum; +Cc: QEMU Developers, Laszlo Ersek, Michael S. Tsirkin

On 1 September 2016 at 14:22, Marcel Apfelbaum <marcel@redhat.com> wrote:
> Proposes best practices on how to use PCIe/PCI device
> in PCIe based machines and explain the reasoning behind them.
>
> Signed-off-by: Marcel Apfelbaum <marcel@redhat.com>
> ---
>
> Hi,
>
> Please add your comments on what to add/remove/edit to make this doc usable.

As somebody who doesn't really understand the problem space, my
thoughts:

(1) is this intended as advice for developers writing machine
models and adding pci controllers to them, or is it intended as
advice for users (and libvirt-style management layers) about
how to configure QEMU?

(2) it seems to be a bit short on concrete advice (either
"you should do this" instructions to machine model developers,
or "use command lines like this" instructions to end-users.

thanks
-- PMM

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines
  2016-09-01 13:27 ` Peter Maydell
@ 2016-09-01 13:51   ` Marcel Apfelbaum
  2016-09-01 17:14     ` Laszlo Ersek
  0 siblings, 1 reply; 52+ messages in thread
From: Marcel Apfelbaum @ 2016-09-01 13:51 UTC (permalink / raw)
  To: Peter Maydell; +Cc: QEMU Developers, Laszlo Ersek, Michael S. Tsirkin

On 09/01/2016 04:27 PM, Peter Maydell wrote:
> On 1 September 2016 at 14:22, Marcel Apfelbaum <marcel@redhat.com> wrote:
>> Proposes best practices on how to use PCIe/PCI device
>> in PCIe based machines and explain the reasoning behind them.
>>
>> Signed-off-by: Marcel Apfelbaum <marcel@redhat.com>
>> ---
>>
>> Hi,
>>
>> Please add your comments on what to add/remove/edit to make this doc usable.
>

Hi Peter,

> As somebody who doesn't really understand the problem space, my
> thoughts:
>
> (1) is this intended as advice for developers writing machine
> models and adding pci controllers to them, or is it intended as
> advice for users (and libvirt-style management layers) about
> how to configure QEMU?
>

Is it intended for management layers as they have no way to
understand how to "consume" the Q35 machine,
but also for firmware developers (OVMF/SeaBIOS) to help them
understand the usage model so they can optimize IO/MEM
resources allocation for both boot time and hot-plug.

QEMU users/developers can also benefit from it as the PCIe arch
is more complex supporting both PCI/PCIe devices and
several PCI/PCIe controllers with no clear rules on what goes where.

> (2) it seems to be a bit short on concrete advice (either
> "you should do this" instructions to machine model developers,
> or "use command lines like this" instructions to end-users.
>

Thanks for the point. I'll be sure to add detailed command line examples
to the next version.

Thanks,
Marcel

> thanks
> -- PMM
>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines
  2016-09-01 13:51   ` Marcel Apfelbaum
@ 2016-09-01 17:14     ` Laszlo Ersek
  0 siblings, 0 replies; 52+ messages in thread
From: Laszlo Ersek @ 2016-09-01 17:14 UTC (permalink / raw)
  To: Marcel Apfelbaum, Peter Maydell; +Cc: QEMU Developers, Michael S. Tsirkin

On 09/01/16 15:51, Marcel Apfelbaum wrote:
> On 09/01/2016 04:27 PM, Peter Maydell wrote:
>> On 1 September 2016 at 14:22, Marcel Apfelbaum <marcel@redhat.com> wrote:
>>> Proposes best practices on how to use PCIe/PCI device
>>> in PCIe based machines and explain the reasoning behind them.
>>>
>>> Signed-off-by: Marcel Apfelbaum <marcel@redhat.com>
>>> ---
>>>
>>> Hi,
>>>
>>> Please add your comments on what to add/remove/edit to make this doc
>>> usable.
>>
> 
> Hi Peter,
> 
>> As somebody who doesn't really understand the problem space, my
>> thoughts:
>>
>> (1) is this intended as advice for developers writing machine
>> models and adding pci controllers to them, or is it intended as
>> advice for users (and libvirt-style management layers) about
>> how to configure QEMU?
>>
> 
> Is it intended for management layers as they have no way to
> understand how to "consume" the Q35 machine,
> but also for firmware developers (OVMF/SeaBIOS) to help them
> understand the usage model so they can optimize IO/MEM
> resources allocation for both boot time and hot-plug.
> 
> QEMU users/developers can also benefit from it as the PCIe arch
> is more complex supporting both PCI/PCIe devices and
> several PCI/PCIe controllers with no clear rules on what goes where.
> 
>> (2) it seems to be a bit short on concrete advice (either
>> "you should do this" instructions to machine model developers,
>> or "use command lines like this" instructions to end-users.
>>
> 
> Thanks for the point. I'll be sure to add detailed command line examples
> to the next version.

I think that would be a huge benefit!

(I'll try to read the document later, and come back with remarks.)

Thanks!
Laszlo

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines
  2016-09-01 13:22 [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines Marcel Apfelbaum
  2016-09-01 13:27 ` Peter Maydell
@ 2016-09-05 16:24 ` Laszlo Ersek
  2016-09-05 20:02   ` Marcel Apfelbaum
                     ` (2 more replies)
  2016-09-06 15:38 ` Alex Williamson
  2 siblings, 3 replies; 52+ messages in thread
From: Laszlo Ersek @ 2016-09-05 16:24 UTC (permalink / raw)
  To: Marcel Apfelbaum, qemu-devel
  Cc: mst, Peter Maydell, Drew Jones, Laine Stump, Andrea Bolognani,
	Alex Williamson, Gerd Hoffmann

On 09/01/16 15:22, Marcel Apfelbaum wrote:
> Proposes best practices on how to use PCIe/PCI device
> in PCIe based machines and explain the reasoning behind them.
> 
> Signed-off-by: Marcel Apfelbaum <marcel@redhat.com>
> ---
> 
> Hi,
> 
> Please add your comments on what to add/remove/edit to make this doc usable.

I'll give you a brain dump below -- most of it might easily be
incorrect, but I'll just speak my mind :)

> 
> Thanks,
> Marcel
> 
>  docs/pcie.txt | 145 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 145 insertions(+)
>  create mode 100644 docs/pcie.txt
> 
> diff --git a/docs/pcie.txt b/docs/pcie.txt
> new file mode 100644
> index 0000000..52a8830
> --- /dev/null
> +++ b/docs/pcie.txt
> @@ -0,0 +1,145 @@
> +PCI EXPRESS GUIDELINES
> +======================
> +
> +1. Introduction
> +================
> +The doc proposes best practices on how to use PCIe/PCI device
> +in PCIe based machines and explains the reasoning behind them.

General request: please replace all occurrences of "PCIe" with "PCI
Express" in the text (not command lines, of course). The reason is that
the "e" letter is a minimal difference, and I've misread PCIe as PC
several times, while interpreting this document. Obviously the resultant
confusion is terrible, as you are explaining the difference between PCI
and PCI Express in the entire document :)

> +
> +
> +2. Device placement strategy
> +============================
> +QEMU does not have a clear socket-device matching mechanism
> +and allows any PCI/PCIe device to be plugged into any PCI/PCIe slot.
> +Plugging a PCI device into a PCIe device might not always work and

s/PCIe device/PCI Express slot/

> +is weird anyway since it cannot be done for "bare metal".
> +Plugging a PCIe device into a PCI slot will hide the Extended
> +Configuration Space thus is also not recommended.
> +
> +The recommendation is to separate the PCIe and PCI hierarchies.
> +PCIe devices should be plugged only into PCIe Root Ports and
> +PCIe Downstream ports (let's call them PCIe ports).

Please do not use the shorthand; we should always spell out downstream
ports and root ports. Assume people reading this document are dumber
than I am wrt. PCI / PCI Express -- I'm already pretty dumb, and I
appreciate the detail! :) If they are smart, they won't mind the detail;
if they lack expertise, they'll appreciate the detail, won't they. :)

> +
> +2.1 Root Bus (pcie.0)

Can we call this Root Complex instead?

> +=====================
> +Plug only legacy PCI devices as Root Complex Integrated Devices
> +even if the PCIe spec does not forbid PCIe devices.

I suggest "even though the PCI Express spec does not forbid PCI Express
devices as Integrated Devices". (Detail is good!)

Also, as Peter suggested, this (but not just this) would be a good place
to provide command line fragments.

> The existing
> +hardware uses mostly PCI devices as Integrated Endpoints. In this
> +way we may avoid some strange Guest OS-es behaviour.
> +Other than that plug only PCIe Root Ports, PCIe Switches (upstream ports)
> +or DMI-PCI bridges to start legacy PCI hierarchies.

Hmmmm, I had to re-read this paragraph (while looking at the diagram)
five times until I mostly understood it :) What about the following wording:

--------
Place only the following kinds of devices directly on the Root Complex:

(1) For devices with dedicated, specific functionality (network card,
graphics card, IDE controller, etc), place only legacy PCI devices on
the Root Complex. These will be considered Integrated Endpoints.
Although the PCI Express spec does not forbid PCI Express devices as
Integrated Endpoints, existing hardware mostly integrates legacy PCI
devices with the Root Complex. Guest OSes are suspected to behave
strangely when PCI Express devices are integrated with the Root Complex.

(2) PCI Express Root Ports, for starting exclusively PCI Express
hierarchies.

(3) PCI Express Switches (connected with their Upstream Ports to the
Root Complex), also for starting exclusively PCI Express hierarchies.

(4) For starting legacy PCI hierarchies: DMI-PCI bridges.

> +
> +
> +   pcie.0 bus

"bus" is correct in QEMU lingo, but I'd still call it complex here.

> +   --------------------------------------------------------------------------
> +        |                |                    |                   |
> +   -----------   ------------------   ------------------  ------------------
> +   | PCI Dev |   | PCIe Root Port |   |  Upstream Port |  | DMI-PCI bridge |
> +   -----------   ------------------   ------------------  ------------------
> +

Please insert a separate (brief) section here about pxb-pcie devices --
just mention that they are documented in a separate spec txt in more
detail, and that they create new root complexes in practice.

In fact, maybe option (5) would be better for pxb-pcie devices, under
section 2.1, than a dedicated section!

> +2.2 PCIe only hierarchy
> +=======================
> +Always use PCIe Root ports to start a PCIe hierarchy. Use PCIe switches (Upstream
> +Ports + several Downstream Ports) if out of PCIe Root Ports slots. PCIe switches
> +can be nested until a depth of 6-7. Plug only PCIe devices into PCIe Ports.

- Please name the maximum number of the root ports that's allowed on the
root complex (cmdline example?)

- Also, this is the first time you mention "slot". While the PCI Express
spec allows for root ports / downstream ports not implementing a slot
(IIRC), I think we shouldn't muddy the waters here, and restrict the
word "slot" to the command line examples only.

- What you say here about switches (upstream ports) matches what I've
learned from you thus far :), but it doesn't match bullet (3) in section
2.1. That is, if we suggest to *always* add a Root Port between the Root
Complex and the Upstream Port of a switch, then (3) should not be
present in section 2.1. (Do we suggest that BTW?)

We're not giving a technical description here (the PCI Express spec is
good enough for that), we're dictating policy. We shouldn't be shy about
minimizing the accepted use cases.

Our main guidance here should be the amount of bus numbers used up by
the hierarchy. Parts of the document might later apply to
qemu-system-aarch64 -M virt, and that machine is severely starved in the
bus numbers department (it has MMCONFIG space for 16 buses only!)

So how about this:

* the basic idea is good I think: always go for root ports, unless the
root complex is fully populated

* if you run out of root ports, use a switch with downstream ports, but
plug the upstream port directly in the root complex (make it an
integrated device). This would save us a bus number, and match option
(3) in section 2.1, but it doesn't match the diagram below, where a root
port is between the root complex and the upstream port. (Of course, if a
root port is *required* there, then 2.1 (3) is wrong, and should be
removed.)

* the "population algorithm" should be laid out in a bit more detail.
You mention a possible depth of 6-7, but I think it would be best to
keep the hierarchy as flat as possible (let's not waste bus numbers on
upstream ports, and time on deep enumeration!). In other words, only
plug upstream ports in the root complex (and without intervening root
ports, if that's allowed). For example:

-  1-32 ports needed: use root ports only

- 33-64 ports needed: use 31 root ports, and one switch with 2-32
downstream ports

- 65-94 ports needed: use 30 root ports, one switch with 32 downstream
ports, another switch with 3-32 downstream ports

- 95-125 ports needed: use 29 root ports, two switches with 32
downstream ports each, and a third switch with 2-32 downstream ports

- 126-156 ports needed: use 28 root ports, three switches with 32
downstream ports each, and a fourth switch with 2-32 downstream ports

- 157-187 ports needed: use 27 root ports, four switches with 32
downstream ports each, and a fifth switch with 2-32 downstream ports

- 188-218 ports: 26 root ports, 5 fully populated switches, sixth switch
with 2-32 downstream ports,

- 219-249 ports: 25 root ports, 6 fully pop. switches, seventh switch
with 2-32 downstream ports

(And I think this is where it ends, because the 7 upstream ports total
in the switches take up 7 bus numbers, so we'd need 249 + 7 = 256 bus
numbers, not counting the root complex, so 249 ports isn't even attainable.)

You might argue that this is way too detailed, but with the "problem
space" offering so much freedom (consider libvirt too...), I think it
would be helpful. This would also help trim the "explorations" of
downstream QE departments :)

> +
> +
> +   pcie.0 bus
> +   ----------------------------------------------------
> +        |                |               |
> +   -------------   -------------   -------------
> +   | Root Port |   | Root Port |   | Root Port |
> +   ------------   --------------   -------------
> +         |                               |
> +    ------------                 -----------------
> +    | PCIe Dev |                 | Upstream Port |
> +    ------------                 -----------------
> +                                  |            |
> +                     -------------------    -------------------
> +                     | Downstream Port |    | Downstream Port |
> +                     -------------------    -------------------
> +                             |
> +                         ------------
> +                         | PCIe Dev |
> +                         ------------
> +

So the upper right root port should be removed, probably.

Also, I recommend to draw a "container" around the upstream port plus
the two downstream ports, and tack a "switch" label to it.

> +2.3 PCI only hierarchy
> +======================
> +Legacy PCI devices can be plugged into pcie.0 as Integrated Devices or
> +into DMI-PCI bridge. PCI-PCI bridges can be plugged into DMI-PCI bridges
> +and can be nested until a depth of 6-7. DMI-BRIDGES should be plugged
> +only into pcie.0 bus.
> +
> +   pcie.0 bus
> +   ----------------------------------------------
> +        |                            |
> +   -----------               ------------------
> +   | PCI Dev |               | DMI-PCI BRIDGE |
> +   ----------                ------------------
> +                               |            |
> +                        -----------    ------------------
> +                        | PCI Dev |    | PCI-PCI Bridge |
> +                        -----------    ------------------
> +                                         |           |
> +                                  -----------     -----------
> +                                  | PCI Dev |     | PCI Dev |
> +                                  -----------     -----------

Works for me, but I would again elaborate a little bit on keeping the
hierarchy flat.

First, in order to preserve compatibility with libvirt's current
behavior, let's not plug a PCI device directly in to the DMI-PCI bridge,
even if that's possible otherwise. Let's just say

- there should be at most one DMI-PCI bridge (if a legacy PCI hierarchy
is required),

- only PCI-PCI bridges should be plugged into the DMI-PCI bridge,

- let's recommend that each PCI-PCI bridge be populated until it becomes
full, at which point another PCI-PCI bridge should be plugged into the
same one DMI-PCI bridge. Theoretically, with 32 legacy PCI devices per
PCI-PCI bridge, and 32 PCI-PCI bridges stuffed into the one DMI-PCI
bridge, we could have ~1024 legacy PCI devices (not counting the
integrated ones on the root complex(es)). There's also multi-function,
so I can't see anyone needing more than this.

For practical reasons though (see later), we should state here that we
recommend no more than 9 (nine) PCI-PCI bridges in total, all located
directly under the 1 (one) DMI-PCI bridge that is integrated into the
pcie.0 root complex. Nine PCI-PCI bridges should allow for 288 legacy
PCI devices. (And then there's multifunction.)

> +
> +
> +
> +3. IO space issues
> +===================
> +PCIe Ports are seen by Firmware/Guest OS as PCI bridges and

(please spell out downstream + root port)

> +as required by PCI spec will reserve a 4K IO range for each.
> +The firmware used by QEMU (SeaBIOS/OVMF) will further optimize
> +it by allocation the IO space only if there is at least a device
> +with IO BARs plugged into the bridge.

This used to be true, but is no longer true, for OVMF. And I think it's
actually correct: we *should* keep the 4K IO reservation per PCI-PCI bridge.

(But, certainly no IO reservation for PCI Express root port, upstream
port, or downstream port! And i'll need your help for telling these
apart in OVMF.)

Let me elaborate more under section "4. Hot Plug". For now let me just
say that I'd like this language about optimization to be dropped.

> +Behind a PCIe PORT only one device may be plugged, resulting in

(again, please spell out downstream and root port)

> +the allocation of a whole 4K range for each device.
> +The IO space is limited resulting in ~10 PCIe ports per system

(limited to 65536 byte-wide IO ports, but it's fragmented, so we have
about 10 * 4K free)

> +if devices with IO BARs are plugged into IO ports.

not "into IO ports" but "into PCI Express downstream and root ports".

> +
> +Using the proposed device placing strategy solves this issue

> +by using only PCIe devices with PCIe PORTS. The PCIe spec requires

(please spell out root / downstream etc)

> +PCIe devices to work without IO BARs.
> +The PCI hierarchy has no such limitations.

I'm sorry to have fragmented this section with so many comments, but the
idea is actually splendid, in my opinion!


... Okay, still speaking resources, could you insert a brief section
here about bus numbers? Under "3. IO space issues", you already explain
how "practically everything" qualifies as a PCI bridge. We should
mention that all those things, such as:
- root complex pcie.0,
- root complex added by pxb-pcie,
- root ports,
- upstream ports, downstream ports,
- bridges, etc

take up bus numbers, and we have 256 bus numbers in total.

In the next section you state that PCI hotplug (ACPI based) and PCI
Express hotplug (native) can work side by side, which is correct, and
the IO space competition is eliminated by the scheme proposed in section
3 -- and the MMIO space competition is "obvious" --, but the bus number
starvation is *very much* non-obvious. It should be spelled out. I think
it deserves a separate section. (Again, with an eye toward
qemu-system-aarch64 -M virt -- we've seen PCI Express failures there,
and they were due to bus number starvation. It wasn't fun to debug.
(Well, it was, but don't tell anyone :)))

> +
> +
> +4. Hot Plug
> +============
> +The root bus pcie.0 does not support hot-plug, so Integrated Devices,

s/root bus/root complex/? Also, any root complexes added with pxb-pcie
don't support hotplug.

> +DMI-PCI bridges and Root Ports can't be hot-plugged/hot-unplugged.

I would say: ... so anything that plugs *only* into a root complex,
cannot be hotplugged. Then the list is what you mention here (also
referring back to options (1), (2) and (4) in section 2.1), *plus* I
would also add option (5): pxb-pcie can also not be hotplugged.

> +
> +PCI devices can be hot-plugged into PCI-PCI bridges. (There is a bug
> +in QEMU preventing it to work, but it would be solved soon).

What bug?

Anyway, I'm unsure we should add this remark here -- it's a guide, not a
status report. I'm worried that whenever we fix that bug, we forget to
remove this remark.

> +The PCI hotplug is ACPI based and can work side by side with the PCIe
> +native hotplug.
> +
> +PCIe devices can be natively hot-plugged/hot-unplugged into/from
> +PCIe Ports (Root Ports/Downstream Ports). Switches are hot-pluggable.

I would mention the order (upstream port, downstream port), also add
some command lines maybe.

> +Keep in mind you always need to have at least one PCIe Port available
> +for hotplug, the PCIe Ports themselves are not hot-pluggable.

Well, the downstream ports of a switch that is being added *are*, aren't
they?

But, this question is actually irrelevant IMO, because here I would add
another subsection about *planning* for hot-plug. (I think that's pretty
important.) And those plans should make the hotplugging of switches
unnecessary!

* For the PCI Express hierarchy, I recommended a flat structure above.
The 256 bus numbers can easily be exhausted / covered by 25 root ports
plus 7 switches (each switch being fully populated with downstream
ports). This should allow all sysadmins to estimate their expected
numbers of hotplug PCI Express devices, in advance, and create enough
root ports / downstream ports. (Sysadmins are already used to planning
for hotplug, see VCPUs, memory (DIMM), memory (balloon).)

* For the PCI hierarchy, it should be even simpler, but worth a mention
-- start with enough PCI-PCI bridges under the one DMI-PCI bridge.

* Finally, this is the spot where we should design and explain our
resource reservation for hotplug:

  - For PCI Express hotplug, please explain that only such PCI Express
    devices can be hotplugged that require no IO space -- in section
    "3. IO space issues" you mention that this is a valid restriction.
    Furthermore, please state the MMIO32 and/or MMIO64 sizes that the
    firmware needs to reserve for root ports / downstream ports, and
    also explain that these sizes will act as maximum size and
    alignment limits for *individual* hotplug devices.

    We can invent fw_cfg switches for this, or maybe even a special PCI
    Express capability (to be placed in the config space of root and
    downstream ports).

  - For legacy PCI hotplug, this is where my evil plan of "no more than
    9 (nine) PCI-PCI bridges under the 1 (one) DMI-PCI bridge" unfolds!

    We've stated above (in Section 3) that we have about 10*4KB IO port
    space. One of those 4K chunks will go (collectively) to the
    Integrated PCI devices that sit on pcie.0. (If there are other root
    complexes from pxb-pcie devices, then those will get one chunk each
    too.) The rest -- hence assume 9 or fewer chunks -- will be consumed
    by the 9 (or respectively fewer) PCI-PCI bridges, for hotplug
    reservation. The upshot is that as long as a sysadmin sticks with
    our flat, "9 PCI-PCI bridges total" recommendation for the legacy

    PCI hierarchy, the IO reservation will be covered *immediately*.
    Simply put: don't create more PCI-PCI bridges than you have IO
    space for -- and that should leave you with about 9 "sibling"
    bridges, which are plenty enough for a huge number of legacy PCI
    devices!

    Furthermore, please state the MMIO32 / MMIO64 amount to reserve
    *per PCI-PCI bridge*. The firmware programmers need to know this,
    and people planning for legacy PCI hotplug should be informed that
    those limits are for all devices *together* on the same PCI-PCI
    bridge.

    Again, we could expose this via fw_cfg, or in a special capability
    (as suggested by Gerd IIRC) in the PCI config space of the PCI-PCI
    bridge.

> +
> +
> +5. Device assignment
> +====================
> +Host devices are mostly PCIe and should be plugged only into PCIe ports.
> +PCI-PCI bridge slots can be used for legacy PCI host devices.

Please provide a command line (lspci) so that users can easily determine
if the device they wish to assign is legacy PCI or PCI Express.

> +
> +
> +6. Virtio devices
> +=================
> +Virtio devices plugged into the PCI hierarchy or as an Integrated Devices

(drop "an")

> +will remain PCI and have transitional behaviour as default.

(Please add one sentence about what "transitional" means in this context
-- they'll have both IO and MMIO BARs.)

> +Virtio devices plugged into PCIe ports are Express devices and have
> +"1.0" behavior by default without IO support.
> +In both case disable-* properties can be used to override the behaviour.

Please emphasize that setting disable-legacy=off (that is, enabling
legacy behavior) for PCI Express virtio devices will cause them to
require IO space, which, given our PCI Express hierarchy, may quickly
lead to resource exhaustion, and is therefore strongly discouraged.

> +
> +
> +7. Conclusion
> +==============
> +The proposal offers a usage model that is easy to understand and follow
> +and in the same time overcomes some PCIe limitations.

I agree!

Thanks!
Laszlo

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines
  2016-09-05 16:24 ` Laszlo Ersek
@ 2016-09-05 20:02   ` Marcel Apfelbaum
  2016-09-06 13:31     ` Laszlo Ersek
  2016-09-06 11:35   ` Gerd Hoffmann
  2016-10-04 14:59   ` Daniel P. Berrange
  2 siblings, 1 reply; 52+ messages in thread
From: Marcel Apfelbaum @ 2016-09-05 20:02 UTC (permalink / raw)
  To: Laszlo Ersek, qemu-devel
  Cc: mst, Peter Maydell, Drew Jones, Laine Stump, Andrea Bolognani,
	Alex Williamson, Gerd Hoffmann

On 09/05/2016 07:24 PM, Laszlo Ersek wrote:
> On 09/01/16 15:22, Marcel Apfelbaum wrote:
>> Proposes best practices on how to use PCIe/PCI device
>> in PCIe based machines and explain the reasoning behind them.
>>
>> Signed-off-by: Marcel Apfelbaum <marcel@redhat.com>
>> ---
>>
>> Hi,
>>
>> Please add your comments on what to add/remove/edit to make this doc usable.
>

Hi Laszlo,

> I'll give you a brain dump below -- most of it might easily be
> incorrect, but I'll just speak my mind :)
>

Thanks for taking the time to go over it, I'll do my best to respond
to all the questions.

>>
>> Thanks,
>> Marcel
>>
>>  docs/pcie.txt | 145 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 145 insertions(+)
>>  create mode 100644 docs/pcie.txt
>>
>> diff --git a/docs/pcie.txt b/docs/pcie.txt
>> new file mode 100644
>> index 0000000..52a8830
>> --- /dev/null
>> +++ b/docs/pcie.txt
>> @@ -0,0 +1,145 @@
>> +PCI EXPRESS GUIDELINES
>> +======================
>> +
>> +1. Introduction
>> +================
>> +The doc proposes best practices on how to use PCIe/PCI device
>> +in PCIe based machines and explains the reasoning behind them.
>
> General request: please replace all occurrences of "PCIe" with "PCI
> Express" in the text (not command lines, of course). The reason is that
> the "e" letter is a minimal difference, and I've misread PCIe as PC
> several times, while interpreting this document. Obviously the resultant
> confusion is terrible, as you are explaining the difference between PCI
> and PCI Express in the entire document :)
>

Sure

>> +
>> +
>> +2. Device placement strategy
>> +============================
>> +QEMU does not have a clear socket-device matching mechanism
>> +and allows any PCI/PCIe device to be plugged into any PCI/PCIe slot.
>> +Plugging a PCI device into a PCIe device might not always work and
>
> s/PCIe device/PCI Express slot/
>

Thanks!

>> +is weird anyway since it cannot be done for "bare metal".
>> +Plugging a PCIe device into a PCI slot will hide the Extended
>> +Configuration Space thus is also not recommended.
>> +
>> +The recommendation is to separate the PCIe and PCI hierarchies.
>> +PCIe devices should be plugged only into PCIe Root Ports and
>> +PCIe Downstream ports (let's call them PCIe ports).
>
> Please do not use the shorthand; we should always spell out downstream
> ports and root ports. Assume people reading this document are dumber
> than I am wrt. PCI / PCI Express -- I'm already pretty dumb, and I
> appreciate the detail! :) If they are smart, they won't mind the detail;
> if they lack expertise, they'll appreciate the detail, won't they. :)
>

Sure

>> +
>> +2.1 Root Bus (pcie.0)
>
> Can we call this Root Complex instead?
>

Sorry, but we can't. The Root Complex is a type of Host-Bridge
(and can actually "have" multiple Host-Bridges), not a bus.
It stands between the CPU/Memory controller/APIC and the PCI/PCI Express fabric.
(as you can see, I am not using PCIe even for the comments :))

The Root Complex *includes* an internal bus (pcie.0) but also
can include some Integrated Devices, its own Configuration Space Registers
(e.g Root Complex Register Block), ...

One of the main functions of the Root Complex is to
generate PCI Express Transactions on behalf of the CPU(s) and
to "translate" the corresponding PCI Express Transactions into DMA accesses.

I can change it to "PCI Express Root Bus", it will help?

>> +=====================
>> +Plug only legacy PCI devices as Root Complex Integrated Devices
>> +even if the PCIe spec does not forbid PCIe devices.
>
> I suggest "even though the PCI Express spec does not forbid PCI Express
> devices as Integrated Devices". (Detail is good!)
>
Thanks

> Also, as Peter suggested, this (but not just this) would be a good place
> to provide command line fragments.
>

I've already added some examples, I'll appreciate if you can have a look on v2
that I will post really soon.

>> The existing
>> +hardware uses mostly PCI devices as Integrated Endpoints. In this
>> +way we may avoid some strange Guest OS-es behaviour.
>> +Other than that plug only PCIe Root Ports, PCIe Switches (upstream ports)
>> +or DMI-PCI bridges to start legacy PCI hierarchies.
>
> Hmmmm, I had to re-read this paragraph (while looking at the diagram)
> five times until I mostly understood it :) What about the following wording:
>
> --------
> Place only the following kinds of devices directly on the Root Complex:
>
> (1) For devices with dedicated, specific functionality (network card,
> graphics card, IDE controller, etc), place only legacy PCI devices on
> the Root Complex. These will be considered Integrated Endpoints.
> Although the PCI Express spec does not forbid PCI Express devices as
> Integrated Endpoints, existing hardware mostly integrates legacy PCI
> devices with the Root Complex. Guest OSes are suspected to behave
> strangely when PCI Express devices are integrated with the Root Complex.
>
> (2) PCI Express Root Ports, for starting exclusively PCI Express
> hierarchies.
>
> (3) PCI Express Switches (connected with their Upstream Ports to the
> Root Complex), also for starting exclusively PCI Express hierarchies.
>
> (4) For starting legacy PCI hierarchies: DMI-PCI bridges.
>

Thanks for the re-wording!
Actually I had a bug, even the Switches should be connected to Root Ports, not directly
to the PCI Express Root Bus (pcie.0) , I'll delete (3) to make it clear.


>> +
>> +
>> +   pcie.0 bus
>
> "bus" is correct in QEMU lingo, but I'd still call it complex here.
>

explained above

>> +   --------------------------------------------------------------------------
>> +        |                |                    |                   |
>> +   -----------   ------------------   ------------------  ------------------
>> +   | PCI Dev |   | PCIe Root Port |   |  Upstream Port |  | DMI-PCI bridge |
>> +   -----------   ------------------   ------------------  ------------------
>> +
>
> Please insert a separate (brief) section here about pxb-pcie devices --
> just mention that they are documented in a separate spec txt in more
> detail, and that they create new root complexes in practice.
>
> In fact, maybe option (5) would be better for pxb-pcie devices, under
> section 2.1, than a dedicated section!
>

Good idea, I'll add the pxb-pcie device.

>> +2.2 PCIe only hierarchy
>> +=======================
>> +Always use PCIe Root ports to start a PCIe hierarchy. Use PCIe switches (Upstream
>> +Ports + several Downstream Ports) if out of PCIe Root Ports slots. PCIe switches
>> +can be nested until a depth of 6-7. Plug only PCIe devices into PCIe Ports.
>
> - Please name the maximum number of the root ports that's allowed on the
> root complex (cmdline example?)
>

I'll try:
The PCI Express Root Bus (pcie.0) is an internal bus that similar to a PCI bus
supports up to 32 Integrated Devices/PCI Express Root Ports.

> - Also, this is the first time you mention "slot". While the PCI Express
> spec allows for root ports / downstream ports not implementing a slot
> (IIRC), I think we shouldn't muddy the waters here, and restrict the
> word "slot" to the command line examples only.
>

OK

> - What you say here about switches (upstream ports) matches what I've
> learned from you thus far :), but it doesn't match bullet (3) in section
> 2.1. That is, if we suggest to *always* add a Root Port between the Root
> Complex and the Upstream Port of a switch, then (3) should not be
> present in section 2.1. (Do we suggest that BTW?)
>
> We're not giving a technical description here (the PCI Express spec is
> good enough for that), we're dictating policy. We shouldn't be shy about
> minimizing the accepted use cases.
>
> Our main guidance here should be the amount of bus numbers used up by
> the hierarchy. Parts of the document might later apply to
> qemu-system-aarch64 -M virt, and that machine is severely starved in the
> bus numbers department (it has MMCONFIG space for 16 buses only!)
>
> So how about this:
>
> * the basic idea is good I think: always go for root ports, unless the
> root complex is fully populated
>
> * if you run out of root ports, use a switch with downstream ports, but
> plug the upstream port directly in the root complex (make it an
> integrated device). This would save us a bus number, and match option
> (3) in section 2.1, but it doesn't match the diagram below, where a root
> port is between the root complex and the upstream port. (Of course, if a
> root port is *required* there, then 2.1 (3) is wrong, and should be
> removed.)
>

A Root Port is required, thanks for spotting the bug.

> * the "population algorithm" should be laid out in a bit more detail.
> You mention a possible depth of 6-7, but I think it would be best to
> keep the hierarchy as flat as possible (let's not waste bus numbers on
> upstream ports, and time on deep enumeration!). In other words, only
> plug upstream ports in the root complex (and without intervening root
> ports, if that's allowed). For example:
>
> -  1-32 ports needed: use root ports only
>
> - 33-64 ports needed: use 31 root ports, and one switch with 2-32
> downstream ports
>
> - 65-94 ports needed: use 30 root ports, one switch with 32 downstream
> ports, another switch with 3-32 downstream ports
>
> - 95-125 ports needed: use 29 root ports, two switches with 32
> downstream ports each, and a third switch with 2-32 downstream ports
>
> - 126-156 ports needed: use 28 root ports, three switches with 32
> downstream ports each, and a fourth switch with 2-32 downstream ports
>
> - 157-187 ports needed: use 27 root ports, four switches with 32
> downstream ports each, and a fifth switch with 2-32 downstream ports
>
> - 188-218 ports: 26 root ports, 5 fully populated switches, sixth switch
> with 2-32 downstream ports,
>
> - 219-249 ports: 25 root ports, 6 fully pop. switches, seventh switch
> with 2-32 downstream ports
>

I can add it as a "best practice".

> (And I think this is where it ends, because the 7 upstream ports total
> in the switches take up 7 bus numbers, so we'd need 249 + 7 = 256 bus
> numbers, not counting the root complex, so 249 ports isn't even attainable.)
>

Theoretically we can implement multiple PCI domains, each domain can have
256 PCI buses, but we don't have that yet. An implementation should
start with the pxb-pcie using separate PCI domains instead of "stealing" bus ranges.
for the Root Complex. But this is for another thread.

> You might argue that this is way too detailed, but with the "problem
> space" offering so much freedom (consider libvirt too...), I think it
> would be helpful. This would also help trim the "explorations" of
> downstream QE departments :)
>

Well, we can accentuate that while nesting is supported, "deep nesting"
is not recommended and even not strictly necessary.

>> +
>> +
>> +   pcie.0 bus
>> +   ----------------------------------------------------
>> +        |                |               |
>> +   -------------   -------------   -------------
>> +   | Root Port |   | Root Port |   | Root Port |
>> +   ------------   --------------   -------------
>> +         |                               |
>> +    ------------                 -----------------
>> +    | PCIe Dev |                 | Upstream Port |
>> +    ------------                 -----------------
>> +                                  |            |
>> +                     -------------------    -------------------
>> +                     | Downstream Port |    | Downstream Port |
>> +                     -------------------    -------------------
>> +                             |
>> +                         ------------
>> +                         | PCIe Dev |
>> +                         ------------
>> +
>
> So the upper right root port should be removed, probably.
>

No, my bug in explanation, sorry.

> Also, I recommend to draw a "container" around the upstream port plus
> the two downstream ports, and tack a "switch" label to it.
>

Really? :) I had an interesting time with these "drawings". I'll try.

>> +2.3 PCI only hierarchy
>> +======================
>> +Legacy PCI devices can be plugged into pcie.0 as Integrated Devices or
>> +into DMI-PCI bridge. PCI-PCI bridges can be plugged into DMI-PCI bridges
>> +and can be nested until a depth of 6-7. DMI-BRIDGES should be plugged
>> +only into pcie.0 bus.
>> +
>> +   pcie.0 bus
>> +   ----------------------------------------------
>> +        |                            |
>> +   -----------               ------------------
>> +   | PCI Dev |               | DMI-PCI BRIDGE |
>> +   ----------                ------------------
>> +                               |            |
>> +                        -----------    ------------------
>> +                        | PCI Dev |    | PCI-PCI Bridge |
>> +                        -----------    ------------------
>> +                                         |           |
>> +                                  -----------     -----------
>> +                                  | PCI Dev |     | PCI Dev |
>> +                                  -----------     -----------
>
> Works for me, but I would again elaborate a little bit on keeping the
> hierarchy flat.
>
> First, in order to preserve compatibility with libvirt's current
> behavior, let's not plug a PCI device directly in to the DMI-PCI bridge,
> even if that's possible otherwise. Let's just say
>
> - there should be at most one DMI-PCI bridge (if a legacy PCI hierarchy
> is required),
>
> - only PCI-PCI bridges should be plugged into the DMI-PCI bridge,
>
> - let's recommend that each PCI-PCI bridge be populated until it becomes
> full, at which point another PCI-PCI bridge should be plugged into the
> same one DMI-PCI bridge. Theoretically, with 32 legacy PCI devices per
> PCI-PCI bridge, and 32 PCI-PCI bridges stuffed into the one DMI-PCI
> bridge, we could have ~1024 legacy PCI devices (not counting the
> integrated ones on the root complex(es)). There's also multi-function,
> so I can't see anyone needing more than this.
>

I can "live" with that. Even if it contradicts a little you flattening
argument if you need more room for PCI devices but you don't need hotplug.
In this case adding PCI devices to the DMI-PCI Bridge should be enough.
But I agree we should keep it as simple as possible and your idea makes sense, thanks.


> For practical reasons though (see later), we should state here that we
> recommend no more than 9 (nine) PCI-PCI bridges in total, all located
> directly under the 1 (one) DMI-PCI bridge that is integrated into the
> pcie.0 root complex. Nine PCI-PCI bridges should allow for 288 legacy
> PCI devices. (And then there's multifunction.)
>

OK... BTW the ~9 bridges limitation is the same for non PCI Express machines
e.g. i440FX machine.

>> +
>> +
>> +
>> +3. IO space issues
>> +===================
>> +PCIe Ports are seen by Firmware/Guest OS as PCI bridges and
>
> (please spell out downstream + root port)
>

OK

>> +as required by PCI spec will reserve a 4K IO range for each.
>> +The firmware used by QEMU (SeaBIOS/OVMF) will further optimize
>> +it by allocation the IO space only if there is at least a device
>> +with IO BARs plugged into the bridge.
>
> This used to be true, but is no longer true, for OVMF. And I think it's
> actually correct: we *should* keep the 4K IO reservation per PCI-PCI bridge.
>

I'll change to "should".

> (But, certainly no IO reservation for PCI Express root port, upstream
> port, or downstream port! And i'll need your help for telling these
> apart in OVMF.)
>

Just let me know how can I help.

> Let me elaborate more under section "4. Hot Plug". For now let me just
> say that I'd like this language about optimization to be dropped.
>
>> +Behind a PCIe PORT only one device may be plugged, resulting in
>
> (again, please spell out downstream and root port)
>

OK

>> +the allocation of a whole 4K range for each device.
>> +The IO space is limited resulting in ~10 PCIe ports per system
>
> (limited to 65536 byte-wide IO ports, but it's fragmented, so we have
> about 10 * 4K free)
>
>> +if devices with IO BARs are plugged into IO ports.
>
> not "into IO ports" but "into PCI Express downstream and root ports".
>

oops, thanks

>> +
>> +Using the proposed device placing strategy solves this issue
>
>> +by using only PCIe devices with PCIe PORTS. The PCIe spec requires
>
> (please spell out root / downstream etc)
>

OK

>> +PCIe devices to work without IO BARs.
>> +The PCI hierarchy has no such limitations.
>
> I'm sorry to have fragmented this section with so many comments, but the
> idea is actually splendid, in my opinion!
>

Thanks!

>
> ... Okay, still speaking resources, could you insert a brief section
> here about bus numbers? Under "3. IO space issues", you already explain
> how "practically everything" qualifies as a PCI bridge. We should
> mention that all those things, such as:
> - root complex pcie.0,
> - root complex added by pxb-pcie,
> - root ports,
> - upstream ports, downstream ports,
> - bridges, etc
>
> take up bus numbers, and we have 256 bus numbers in total.
>

I'll add a section for bus numbers, sure.

> In the next section you state that PCI hotplug (ACPI based) and PCI
> Express hotplug (native) can work side by side, which is correct, and
> the IO space competition is eliminated by the scheme proposed in section
> 3 -- and the MMIO space competition is "obvious" --, but the bus number
> starvation is *very much* non-obvious. It should be spelled out. I think
> it deserves a separate section. (Again, with an eye toward
> qemu-system-aarch64 -M virt -- we've seen PCI Express failures there,
> and they were due to bus number starvation. It wasn't fun to debug.
> (Well, it was, but don't tell anyone :)))
>

Got it, I'll try to make PCI Bus numbering a limitation as important as IO.
And we need to start looking at ways to solve this:
   1. pxb-pcie starting different PCI domains - pxb-pcie became another Root Complex
   2. switches can theoretically start PCI domains - emulate a Switch doing this.
Long term plans, of course.

>> +
>> +
>> +4. Hot Plug
>> +============
>> +The root bus pcie.0 does not support hot-plug, so Integrated Devices,
>
> s/root bus/root complex/? Also, any root complexes added with pxb-pcie
> don't support hotplug.
>

Actually pxb-pcie should support PCI Express Native Hotplug. If they don't is a bug
and I'll take care of it.
For pxb-pci (the PCI counter-part) is another story, it needs ACPI code
to be emitted and the feature is not yet implemented.


>> +DMI-PCI bridges and Root Ports can't be hot-plugged/hot-unplugged.
>
> I would say: ... so anything that plugs *only* into a root complex,
> cannot be hotplugged. Then the list is what you mention here (also
> referring back to options (1), (2) and (4) in section 2.1), *plus* I
> would also add option (5): pxb-pcie can also not be hotplugged.
>

because is actually an Integrated Device.

>> +
>> +PCI devices can be hot-plugged into PCI-PCI bridges. (There is a bug
>> +in QEMU preventing it to work, but it would be solved soon).
>
> What bug?
>

As stated above, PCI hotplug is based on emitting ACPI code for
recognizing the right slot (see "bsel" ACPI variables). Basically each
PCI-Bridge slot has a different "bsel" value used during
hotplug mechanism to identify the slot where the device is hot-plugged/hot-unplugged.

For PC machine the ACPI is generated while for Q35 is not.
(I think I've already sent an RFC some time ago for that)

> Anyway, I'm unsure we should add this remark here -- it's a guide, not a
> status report. I'm worried that whenever we fix that bug, we forget to
> remove this remark.
>

will remove it

>> +The PCI hotplug is ACPI based and can work side by side with the PCIe
>> +native hotplug.
>> +
>> +PCIe devices can be natively hot-plugged/hot-unplugged into/from
>> +PCIe Ports (Root Ports/Downstream Ports). Switches are hot-pluggable.
>
> I would mention the order (upstream port, downstream port), also add
> some command lines maybe.
>

I'll add some hmp example. Should I try it before :) ?

>> +Keep in mind you always need to have at least one PCIe Port available
>> +for hotplug, the PCIe Ports themselves are not hot-pluggable.
>
> Well, the downstream ports of a switch that is being added *are*, aren't
> they?

Nope, you cannot hotplug a PCI Express Root Port or a PCI Express Downstream Port.
The reason: The PCI Express Native Hotplug is based on SHPCs (Standard HotPlug Controllers)
which are integrated only in the mentioned ports and not in Upstream Ports or the Root Complex.
The "other" reason: When you buy a switch/server it has a number of ports and that's it.
You cannot add "later".

>
> But, this question is actually irrelevant IMO, because here I would add
> another subsection about *planning* for hot-plug. (I think that's pretty
> important.) And those plans should make the hotplugging of switches
> unnecessary!
>

I'll add a subsection for it. But when you are out of options you *can*
hotplug a switch if your sysadmin skills are limited...

> * For the PCI Express hierarchy, I recommended a flat structure above.
> The 256 bus numbers can easily be exhausted / covered by 25 root ports
> plus 7 switches (each switch being fully populated with downstream
> ports). This should allow all sysadmins to estimate their expected
> numbers of hotplug PCI Express devices, in advance, and create enough
> root ports / downstream ports. (Sysadmins are already used to planning
> for hotplug, see VCPUs, memory (DIMM), memory (balloon).)
>

That's another good idea, I'll add it to the doc, thanks!

> * For the PCI hierarchy, it should be even simpler, but worth a mention
> -- start with enough PCI-PCI bridges under the one DMI-PCI bridge.
>

OK

> * Finally, this is the spot where we should design and explain our
> resource reservation for hotplug:
>
>   - For PCI Express hotplug, please explain that only such PCI Express
>     devices can be hotplugged that require no IO space -- in section
>     "3. IO space issues" you mention that this is a valid restriction.
>     Furthermore, please state the MMIO32 and/or MMIO64 sizes that the
>     firmware needs to reserve for root ports / downstream ports, and
>     also explain that these sizes will act as maximum size and
>     alignment limits for *individual* hotplug devices.
>

OK

>     We can invent fw_cfg switches for this, or maybe even a special PCI
>     Express capability (to be placed in the config space of root and
>     downstream ports).
>

Gerd explicitly asked for the second idea (vendor specific capability)

>   - For legacy PCI hotplug, this is where my evil plan of "no more than
>     9 (nine) PCI-PCI bridges under the 1 (one) DMI-PCI bridge" unfolds!
>

The same as for PC, should this doc deal with this if is a common issue?
Maybe a simple comment is enough.

>     We've stated above (in Section 3) that we have about 10*4KB IO port
>     space. One of those 4K chunks will go (collectively) to the
>     Integrated PCI devices that sit on pcie.0. (If there are other root
>     complexes from pxb-pcie devices, then those will get one chunk each
>     too.) The rest -- hence assume 9 or fewer chunks -- will be consumed
>     by the 9 (or respectively fewer) PCI-PCI bridges, for hotplug
>     reservation. The upshot is that as long as a sysadmin sticks with
>     our flat, "9 PCI-PCI bridges total" recommendation for the legacy
>
>     PCI hierarchy, the IO reservation will be covered *immediately*.
>     Simply put: don't create more PCI-PCI bridges than you have IO
>     space for -- and that should leave you with about 9 "sibling"
>     bridges, which are plenty enough for a huge number of legacy PCI
>     devices!
>

I'll use that, thanks!

>     Furthermore, please state the MMIO32 / MMIO64 amount to reserve
>     *per PCI-PCI bridge*. The firmware programmers need to know this,
>     and people planning for legacy PCI hotplug should be informed that
>     those limits are for all devices *together* on the same PCI-PCI
>     bridge.
>

Yes... we'll ask here for the minimum 8MB MMIO because of the virtio 1.0
behavior and use that when "pushing" the patches for OVMF/SeaBIOS.
Its fun, first we "make" the rules, then we say "Hey, is written in QEMU docs".

>     Again, we could expose this via fw_cfg, or in a special capability
>     (as suggested by Gerd IIRC) in the PCI config space of the PCI-PCI
>     bridge.
>

Agreed

>> +
>> +
>> +5. Device assignment
>> +====================
>> +Host devices are mostly PCIe and should be plugged only into PCIe ports.
>> +PCI-PCI bridge slots can be used for legacy PCI host devices.
>
> Please provide a command line (lspci) so that users can easily determine
> if the device they wish to assign is legacy PCI or PCI Express.
>

OK, something like:

lspci -s 03:00.0 -v (as root)
03:00.0 Network controller: Intel Corporation Wireless 7260 (rev 83)
	Subsystem: Intel Corporation Dual Band Wireless-AC 7260
	Flags: bus master, fast devsel, latency 0, IRQ 50
	Memory at f0400000 (64-bit, non-prefetchable) [size=8K]
	Capabilities: [c8] Power Management version 3
	Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
	Capabilities: [40] Express Endpoint, MSI 00

                             ^^^^^^^^^^^^^^^

	Capabilities: [100] Advanced Error Reporting
	Capabilities: [140] Device Serial Number 7c-7a-91-ff-ff-90-db-20
	Capabilities: [14c] Latency Tolerance Reporting
	Capabilities: [154] Vendor Specific Information: ID=cafe Rev=1 Len=014 <?>



>> +
>> +
>> +6. Virtio devices
>> +=================
>> +Virtio devices plugged into the PCI hierarchy or as an Integrated Devices
>
> (drop "an")
>

OK

>> +will remain PCI and have transitional behaviour as default.
>
> (Please add one sentence about what "transitional" means in this context
> -- they'll have both IO and MMIO BARs.)
>

OK

>> +Virtio devices plugged into PCIe ports are Express devices and have
>> +"1.0" behavior by default without IO support.
>> +In both case disable-* properties can be used to override the behaviour.
>
> Please emphasize that setting disable-legacy=off (that is, enabling
> legacy behavior) for PCI Express virtio devices will cause them to
> require IO space, which, given our PCI Express hierarchy, may quickly
> lead to resource exhaustion, and is therefore strongly discouraged.
>

Sure

>> +
>> +
>> +7. Conclusion
>> +==============
>> +The proposal offers a usage model that is easy to understand and follow
>> +and in the same time overcomes some PCIe limitations.
>
> I agree!
>
> Thanks!
> Laszlo
>


Thanks for the detailed review! I was planning on sending V2 today,
but to follow your comments I will need another day.
Don't get me wrong, it totally worth it :)


Thanks,
Marcel

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines
  2016-09-05 16:24 ` Laszlo Ersek
  2016-09-05 20:02   ` Marcel Apfelbaum
@ 2016-09-06 11:35   ` Gerd Hoffmann
  2016-09-06 13:58     ` Laine Stump
                       ` (2 more replies)
  2016-10-04 14:59   ` Daniel P. Berrange
  2 siblings, 3 replies; 52+ messages in thread
From: Gerd Hoffmann @ 2016-09-06 11:35 UTC (permalink / raw)
  To: Laszlo Ersek
  Cc: Marcel Apfelbaum, qemu-devel, mst, Peter Maydell, Drew Jones,
	Laine Stump, Andrea Bolognani, Alex Williamson

  Hi,

> > +Plug only legacy PCI devices as Root Complex Integrated Devices
> > +even if the PCIe spec does not forbid PCIe devices.
> 
> I suggest "even though the PCI Express spec does not forbid PCI Express
> devices as Integrated Devices". (Detail is good!)

While talking about integrated devices:  There is docs/q35-chipset.cfg,
which documents how to mimic q35 with integrated devices as close and
complete as possible.

Usage:
  qemu-system-x86_64 -M q35 -readconfig docs/q35-chipset.cfg $args

Side note for usb: In practice you don't want to use the tons of
uhci/ehci controllers present in the original q35 but plug xhci into one
of the pcie root ports instead (unless your guest doesn't support xhci).

> > +as required by PCI spec will reserve a 4K IO range for each.
> > +The firmware used by QEMU (SeaBIOS/OVMF) will further optimize
> > +it by allocation the IO space only if there is at least a device
> > +with IO BARs plugged into the bridge.
> 
> This used to be true, but is no longer true, for OVMF. And I think it's
> actually correct: we *should* keep the 4K IO reservation per PCI-PCI bridge.
> 
> (But, certainly no IO reservation for PCI Express root port, upstream
> port, or downstream port! And i'll need your help for telling these
> apart in OVMF.)

IIRC the same is true for seabios, it looks for the pcie capability and
skips io space allocation on pcie ports only.

Side note: the linux kernel allocates io space nevertheless, so
checking /proc/ioports after boot doesn't tell you what the firmware
did.

cheers,
  Gerd

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines
  2016-09-05 20:02   ` Marcel Apfelbaum
@ 2016-09-06 13:31     ` Laszlo Ersek
  2016-09-06 14:46       ` Marcel Apfelbaum
  2016-09-07  6:21       ` Gerd Hoffmann
  0 siblings, 2 replies; 52+ messages in thread
From: Laszlo Ersek @ 2016-09-06 13:31 UTC (permalink / raw)
  To: Marcel Apfelbaum, qemu-devel
  Cc: mst, Peter Maydell, Drew Jones, Laine Stump, Andrea Bolognani,
	Alex Williamson, Gerd Hoffmann

On 09/05/16 22:02, Marcel Apfelbaum wrote:
> On 09/05/2016 07:24 PM, Laszlo Ersek wrote:
>> On 09/01/16 15:22, Marcel Apfelbaum wrote:
>>> Proposes best practices on how to use PCIe/PCI device
>>> in PCIe based machines and explain the reasoning behind them.
>>>
>>> Signed-off-by: Marcel Apfelbaum <marcel@redhat.com>
>>> ---
>>>
>>> Hi,
>>>
>>> Please add your comments on what to add/remove/edit to make this doc
>>> usable.
>>
> 
> Hi Laszlo,
> 
>> I'll give you a brain dump below -- most of it might easily be
>> incorrect, but I'll just speak my mind :)
>>
> 
> Thanks for taking the time to go over it, I'll do my best to respond
> to all the questions.
> 
>>>
>>> Thanks,
>>> Marcel
>>>
>>>  docs/pcie.txt | 145
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>  1 file changed, 145 insertions(+)
>>>  create mode 100644 docs/pcie.txt
>>>
>>> diff --git a/docs/pcie.txt b/docs/pcie.txt
>>> new file mode 100644
>>> index 0000000..52a8830
>>> --- /dev/null
>>> +++ b/docs/pcie.txt
>>> @@ -0,0 +1,145 @@
>>> +PCI EXPRESS GUIDELINES
>>> +======================
>>> +
>>> +1. Introduction
>>> +================
>>> +The doc proposes best practices on how to use PCIe/PCI device
>>> +in PCIe based machines and explains the reasoning behind them.
>>
>> General request: please replace all occurrences of "PCIe" with "PCI
>> Express" in the text (not command lines, of course). The reason is that
>> the "e" letter is a minimal difference, and I've misread PCIe as PC
>> several times, while interpreting this document. Obviously the resultant
>> confusion is terrible, as you are explaining the difference between PCI
>> and PCI Express in the entire document :)
>>
> 
> Sure
> 
>>> +
>>> +
>>> +2. Device placement strategy
>>> +============================
>>> +QEMU does not have a clear socket-device matching mechanism
>>> +and allows any PCI/PCIe device to be plugged into any PCI/PCIe slot.
>>> +Plugging a PCI device into a PCIe device might not always work and
>>
>> s/PCIe device/PCI Express slot/
>>
> 
> Thanks!
> 
>>> +is weird anyway since it cannot be done for "bare metal".
>>> +Plugging a PCIe device into a PCI slot will hide the Extended
>>> +Configuration Space thus is also not recommended.
>>> +
>>> +The recommendation is to separate the PCIe and PCI hierarchies.
>>> +PCIe devices should be plugged only into PCIe Root Ports and
>>> +PCIe Downstream ports (let's call them PCIe ports).
>>
>> Please do not use the shorthand; we should always spell out downstream
>> ports and root ports. Assume people reading this document are dumber
>> than I am wrt. PCI / PCI Express -- I'm already pretty dumb, and I
>> appreciate the detail! :) If they are smart, they won't mind the detail;
>> if they lack expertise, they'll appreciate the detail, won't they. :)
>>
> 
> Sure
> 
>>> +
>>> +2.1 Root Bus (pcie.0)
>>
>> Can we call this Root Complex instead?
>>
> 
> Sorry, but we can't. The Root Complex is a type of Host-Bridge
> (and can actually "have" multiple Host-Bridges), not a bus.
> It stands between the CPU/Memory controller/APIC and the PCI/PCI Express
> fabric.
> (as you can see, I am not using PCIe even for the comments :))
> 
> The Root Complex *includes* an internal bus (pcie.0) but also
> can include some Integrated Devices, its own Configuration Space Registers
> (e.g Root Complex Register Block), ...
> 
> One of the main functions of the Root Complex is to
> generate PCI Express Transactions on behalf of the CPU(s) and
> to "translate" the corresponding PCI Express Transactions into DMA
> accesses.
> 
> I can change it to "PCI Express Root Bus", it will help?

Yes, it will, thank you.

All my other "root complex" mentions below were incorrect, in light of
your clarification, so please consider those accordingly.

> 
>>> +=====================
>>> +Plug only legacy PCI devices as Root Complex Integrated Devices
>>> +even if the PCIe spec does not forbid PCIe devices.
>>
>> I suggest "even though the PCI Express spec does not forbid PCI Express
>> devices as Integrated Devices". (Detail is good!)
>>
> Thanks
> 
>> Also, as Peter suggested, this (but not just this) would be a good place
>> to provide command line fragments.
>>
> 
> I've already added some examples, I'll appreciate if you can have a look
> on v2
> that I will post really soon.
> 
>>> The existing
>>> +hardware uses mostly PCI devices as Integrated Endpoints. In this
>>> +way we may avoid some strange Guest OS-es behaviour.
>>> +Other than that plug only PCIe Root Ports, PCIe Switches (upstream
>>> ports)
>>> +or DMI-PCI bridges to start legacy PCI hierarchies.
>>
>> Hmmmm, I had to re-read this paragraph (while looking at the diagram)
>> five times until I mostly understood it :) What about the following
>> wording:
>>
>> --------
>> Place only the following kinds of devices directly on the Root Complex:
>>
>> (1) For devices with dedicated, specific functionality (network card,
>> graphics card, IDE controller, etc), place only legacy PCI devices on
>> the Root Complex. These will be considered Integrated Endpoints.
>> Although the PCI Express spec does not forbid PCI Express devices as
>> Integrated Endpoints, existing hardware mostly integrates legacy PCI
>> devices with the Root Complex. Guest OSes are suspected to behave
>> strangely when PCI Express devices are integrated with the Root Complex.
>>
>> (2) PCI Express Root Ports, for starting exclusively PCI Express
>> hierarchies.
>>
>> (3) PCI Express Switches (connected with their Upstream Ports to the
>> Root Complex), also for starting exclusively PCI Express hierarchies.
>>
>> (4) For starting legacy PCI hierarchies: DMI-PCI bridges.
>>
> 
> Thanks for the re-wording!
> Actually I had a bug, even the Switches should be connected to Root
> Ports, not directly
> to the PCI Express Root Bus (pcie.0) , I'll delete (3) to make it clear.

Ah, okay. That puts a lot of what I wrote in a different perspective :),
but I think the "as flat as possible" hierarchy should remain a valid
suggestion.

> 
> 
>>> +
>>> +
>>> +   pcie.0 bus
>>
>> "bus" is correct in QEMU lingo, but I'd still call it complex here.
>>
> 
> explained above
> 
>>> +  
>>> --------------------------------------------------------------------------
>>>
>>> +        |                |                    |                   |
>>> +   -----------   ------------------   ------------------ 
>>> ------------------
>>> +   | PCI Dev |   | PCIe Root Port |   |  Upstream Port |  | DMI-PCI
>>> bridge |
>>> +   -----------   ------------------   ------------------ 
>>> ------------------
>>> +
>>
>> Please insert a separate (brief) section here about pxb-pcie devices --
>> just mention that they are documented in a separate spec txt in more
>> detail, and that they create new root complexes in practice.
>>
>> In fact, maybe option (5) would be better for pxb-pcie devices, under
>> section 2.1, than a dedicated section!
>>
> 
> Good idea, I'll add the pxb-pcie device.
> 
>>> +2.2 PCIe only hierarchy
>>> +=======================
>>> +Always use PCIe Root ports to start a PCIe hierarchy. Use PCIe
>>> switches (Upstream
>>> +Ports + several Downstream Ports) if out of PCIe Root Ports slots.
>>> PCIe switches
>>> +can be nested until a depth of 6-7. Plug only PCIe devices into PCIe
>>> Ports.
>>
>> - Please name the maximum number of the root ports that's allowed on the
>> root complex (cmdline example?)
>>
> 
> I'll try:
> The PCI Express Root Bus (pcie.0) is an internal bus that similar to a
> PCI bus
> supports up to 32 Integrated Devices/PCI Express Root Ports.

Thanks, sounds good. Also, apparently, I wasn't wrong about the number 32 :)

>> - Also, this is the first time you mention "slot". While the PCI Express
>> spec allows for root ports / downstream ports not implementing a slot
>> (IIRC), I think we shouldn't muddy the waters here, and restrict the
>> word "slot" to the command line examples only.
>>
> 
> OK
> 
>> - What you say here about switches (upstream ports) matches what I've
>> learned from you thus far :), but it doesn't match bullet (3) in section
>> 2.1. That is, if we suggest to *always* add a Root Port between the Root
>> Complex and the Upstream Port of a switch, then (3) should not be
>> present in section 2.1. (Do we suggest that BTW?)
>>
>> We're not giving a technical description here (the PCI Express spec is
>> good enough for that), we're dictating policy. We shouldn't be shy about
>> minimizing the accepted use cases.
>>
>> Our main guidance here should be the amount of bus numbers used up by
>> the hierarchy. Parts of the document might later apply to
>> qemu-system-aarch64 -M virt, and that machine is severely starved in the
>> bus numbers department (it has MMCONFIG space for 16 buses only!)
>>
>> So how about this:
>>
>> * the basic idea is good I think: always go for root ports, unless the
>> root complex is fully populated
>>
>> * if you run out of root ports, use a switch with downstream ports, but
>> plug the upstream port directly in the root complex (make it an
>> integrated device). This would save us a bus number, and match option
>> (3) in section 2.1, but it doesn't match the diagram below, where a root
>> port is between the root complex and the upstream port. (Of course, if a
>> root port is *required* there, then 2.1 (3) is wrong, and should be
>> removed.)
>>
> 
> A Root Port is required, thanks for spotting the bug.
> 
>> * the "population algorithm" should be laid out in a bit more detail.
>> You mention a possible depth of 6-7, but I think it would be best to
>> keep the hierarchy as flat as possible (let's not waste bus numbers on
>> upstream ports, and time on deep enumeration!). In other words, only
>> plug upstream ports in the root complex (and without intervening root
>> ports, if that's allowed). For example:
>>
>> -  1-32 ports needed: use root ports only
>>
>> - 33-64 ports needed: use 31 root ports, and one switch with 2-32
>> downstream ports
>>
>> - 65-94 ports needed: use 30 root ports, one switch with 32 downstream
>> ports, another switch with 3-32 downstream ports
>>
>> - 95-125 ports needed: use 29 root ports, two switches with 32
>> downstream ports each, and a third switch with 2-32 downstream ports
>>
>> - 126-156 ports needed: use 28 root ports, three switches with 32
>> downstream ports each, and a fourth switch with 2-32 downstream ports
>>
>> - 157-187 ports needed: use 27 root ports, four switches with 32
>> downstream ports each, and a fifth switch with 2-32 downstream ports
>>
>> - 188-218 ports: 26 root ports, 5 fully populated switches, sixth switch
>> with 2-32 downstream ports,
>>
>> - 219-249 ports: 25 root ports, 6 fully pop. switches, seventh switch
>> with 2-32 downstream ports
>>
> 
> I can add it as a "best practice".

That would be highly appreciated, thanks! Of course, with the root ports
being mandatory between the PCI Express Root Bus and the upstream port
of every switch, we hit the bus number limit a bit earlier:

> 
>> (And I think this is where it ends, because the 7 upstream ports total
>> in the switches take up 7 bus numbers, so we'd need 249 + 7 = 256 bus
>> numbers, not counting the root complex, so 249 ports isn't even
>> attainable.)

we'd need 249+7+7.

>>
> 
> Theoretically we can implement multiple PCI domains, each domain can have
> 256 PCI buses, but we don't have that yet. An implementation should
> start with the pxb-pcie using separate PCI domains instead of "stealing"
> bus ranges.
> for the Root Complex. But this is for another thread.

Definitely for another thread :)

> 
>> You might argue that this is way too detailed, but with the "problem
>> space" offering so much freedom (consider libvirt too...), I think it
>> would be helpful. This would also help trim the "explorations" of
>> downstream QE departments :)
>>
> 
> Well, we can accentuate that while nesting is supported, "deep nesting"
> is not recommended and even not strictly necessary.

I agree, thanks.

> 
>>> +
>>> +
>>> +   pcie.0 bus
>>> +   ----------------------------------------------------
>>> +        |                |               |
>>> +   -------------   -------------   -------------
>>> +   | Root Port |   | Root Port |   | Root Port |
>>> +   ------------   --------------   -------------
>>> +         |                               |
>>> +    ------------                 -----------------
>>> +    | PCIe Dev |                 | Upstream Port |
>>> +    ------------                 -----------------
>>> +                                  |            |
>>> +                     -------------------    -------------------
>>> +                     | Downstream Port |    | Downstream Port |
>>> +                     -------------------    -------------------
>>> +                             |
>>> +                         ------------
>>> +                         | PCIe Dev |
>>> +                         ------------
>>> +
>>
>> So the upper right root port should be removed, probably.
>>
> 
> No, my bug in explanation, sorry.
> 
>> Also, I recommend to draw a "container" around the upstream port plus
>> the two downstream ports, and tack a "switch" label to it.
>>
> 
> Really? :) I had an interesting time with these "drawings". I'll try.

Haha, thanks :) If you are an emacs user, it should be easy. (I'm not an
emacs user, but my editor does support macros that makes it okay-ish to
draw some basic ASCII art.)

> 
>>> +2.3 PCI only hierarchy
>>> +======================
>>> +Legacy PCI devices can be plugged into pcie.0 as Integrated Devices or
>>> +into DMI-PCI bridge. PCI-PCI bridges can be plugged into DMI-PCI
>>> bridges
>>> +and can be nested until a depth of 6-7. DMI-BRIDGES should be plugged
>>> +only into pcie.0 bus.
>>> +
>>> +   pcie.0 bus
>>> +   ----------------------------------------------
>>> +        |                            |
>>> +   -----------               ------------------
>>> +   | PCI Dev |               | DMI-PCI BRIDGE |
>>> +   ----------                ------------------
>>> +                               |            |
>>> +                        -----------    ------------------
>>> +                        | PCI Dev |    | PCI-PCI Bridge |
>>> +                        -----------    ------------------
>>> +                                         |           |
>>> +                                  -----------     -----------
>>> +                                  | PCI Dev |     | PCI Dev |
>>> +                                  -----------     -----------
>>
>> Works for me, but I would again elaborate a little bit on keeping the
>> hierarchy flat.
>>
>> First, in order to preserve compatibility with libvirt's current
>> behavior, let's not plug a PCI device directly in to the DMI-PCI bridge,
>> even if that's possible otherwise. Let's just say
>>
>> - there should be at most one DMI-PCI bridge (if a legacy PCI hierarchy
>> is required),
>>
>> - only PCI-PCI bridges should be plugged into the DMI-PCI bridge,
>>
>> - let's recommend that each PCI-PCI bridge be populated until it becomes
>> full, at which point another PCI-PCI bridge should be plugged into the
>> same one DMI-PCI bridge. Theoretically, with 32 legacy PCI devices per
>> PCI-PCI bridge, and 32 PCI-PCI bridges stuffed into the one DMI-PCI
>> bridge, we could have ~1024 legacy PCI devices (not counting the
>> integrated ones on the root complex(es)). There's also multi-function,
>> so I can't see anyone needing more than this.
>>
> 
> I can "live" with that. Even if it contradicts a little you flattening
> argument if you need more room for PCI devices but you don't need hotplug.
> In this case adding PCI devices to the DMI-PCI Bridge should be enough.
> But I agree we should keep it as simple as possible and your idea makes
> sense, thanks.
> 
> 
>> For practical reasons though (see later), we should state here that we
>> recommend no more than 9 (nine) PCI-PCI bridges in total, all located
>> directly under the 1 (one) DMI-PCI bridge that is integrated into the
>> pcie.0 root complex. Nine PCI-PCI bridges should allow for 288 legacy
>> PCI devices. (And then there's multifunction.)
>>
> 
> OK... BTW the ~9 bridges limitation is the same for non PCI Express
> machines
> e.g. i440FX machine.
> 
>>> +
>>> +
>>> +
>>> +3. IO space issues
>>> +===================
>>> +PCIe Ports are seen by Firmware/Guest OS as PCI bridges and
>>
>> (please spell out downstream + root port)
>>
> 
> OK
> 
>>> +as required by PCI spec will reserve a 4K IO range for each.
>>> +The firmware used by QEMU (SeaBIOS/OVMF) will further optimize
>>> +it by allocation the IO space only if there is at least a device
>>> +with IO BARs plugged into the bridge.
>>
>> This used to be true, but is no longer true, for OVMF. And I think it's
>> actually correct: we *should* keep the 4K IO reservation per PCI-PCI
>> bridge.
>>
> 
> I'll change to "should".
> 
>> (But, certainly no IO reservation for PCI Express root port, upstream
>> port, or downstream port! And i'll need your help for telling these
>> apart in OVMF.)
>>
> 
> Just let me know how can I help.

Well, in the EFI_PCI_HOT_PLUG_INIT_PROTOCOL.GetResourcePadding()
implementation, I'll have to look at the PCI config space of the
"bridge-like" PCI device that the generic PCI Bus driver of edk2 passes
back to me, asking me about resource reservation.

Based on the config space, I should be able to tell apart "PCI-PCI
bridge" from "PCI Express downstream or root port". So what I'd need
here is a semi-formal natural language description of these conditions.

Hmm, actually I think I've already written code, for another patch, that
identifies the latter category. So everything where that check doesn't
fire can be deemed "PCI-PCI bridge". (This hook gets called only for
bridges.)

Yet another alternative: if we go for the special PCI capability, for
exposing reservation sizes from QEMU to the firmware, then I can simply
search the capability list for just that capability. I think that could
be the easiest for me.

> 
>> Let me elaborate more under section "4. Hot Plug". For now let me just
>> say that I'd like this language about optimization to be dropped.
>>
>>> +Behind a PCIe PORT only one device may be plugged, resulting in
>>
>> (again, please spell out downstream and root port)
>>
> 
> OK
> 
>>> +the allocation of a whole 4K range for each device.
>>> +The IO space is limited resulting in ~10 PCIe ports per system
>>
>> (limited to 65536 byte-wide IO ports, but it's fragmented, so we have
>> about 10 * 4K free)
>>
>>> +if devices with IO BARs are plugged into IO ports.
>>
>> not "into IO ports" but "into PCI Express downstream and root ports".
>>
> 
> oops, thanks
> 
>>> +
>>> +Using the proposed device placing strategy solves this issue
>>
>>> +by using only PCIe devices with PCIe PORTS. The PCIe spec requires
>>
>> (please spell out root / downstream etc)
>>
> 
> OK
> 
>>> +PCIe devices to work without IO BARs.
>>> +The PCI hierarchy has no such limitations.
>>
>> I'm sorry to have fragmented this section with so many comments, but the
>> idea is actually splendid, in my opinion!
>>
> 
> Thanks!
> 
>>
>> ... Okay, still speaking resources, could you insert a brief section
>> here about bus numbers? Under "3. IO space issues", you already explain
>> how "practically everything" qualifies as a PCI bridge. We should
>> mention that all those things, such as:
>> - root complex pcie.0,
>> - root complex added by pxb-pcie,
>> - root ports,
>> - upstream ports, downstream ports,
>> - bridges, etc
>>
>> take up bus numbers, and we have 256 bus numbers in total.
>>
> 
> I'll add a section for bus numbers, sure.
> 
>> In the next section you state that PCI hotplug (ACPI based) and PCI
>> Express hotplug (native) can work side by side, which is correct, and
>> the IO space competition is eliminated by the scheme proposed in section
>> 3 -- and the MMIO space competition is "obvious" --, but the bus number
>> starvation is *very much* non-obvious. It should be spelled out. I think
>> it deserves a separate section. (Again, with an eye toward
>> qemu-system-aarch64 -M virt -- we've seen PCI Express failures there,
>> and they were due to bus number starvation. It wasn't fun to debug.
>> (Well, it was, but don't tell anyone :)))
>>
> 
> Got it, I'll try to make PCI Bus numbering a limitation as important as IO.
> And we need to start looking at ways to solve this:
>   1. pxb-pcie starting different PCI domains - pxb-pcie became another
> Root Complex
>   2. switches can theoretically start PCI domains - emulate a Switch
> doing this.
> Long term plans, of course.

Right, let's not rush that; first I'd like to reach a status where PCI
and PCI Express hotplug "just works" with OVMF... And if it fails, I
should be able to point at the user's config, and this document, and say
"wrong configuration". That's the goal. :)

> 
>>> +
>>> +
>>> +4. Hot Plug
>>> +============
>>> +The root bus pcie.0 does not support hot-plug, so Integrated Devices,
>>
>> s/root bus/root complex/? Also, any root complexes added with pxb-pcie
>> don't support hotplug.
>>
> 
> Actually pxb-pcie should support PCI Express Native Hotplug.

Huh, interesting.

> If they don't is a bug and I'll take care of it.

Hmm, a bit lower down you mention that PCI Express native hot plug is
based on SHPCs. So, when you say that pxb-pcie should support PCI
Express Native Hotplug, you mean that it should occur through SHPC, right?

However, for pxb-pci*, we had to disable SHPC: see QEMU commit
d10dda2d60c8 ("hw/pci-bridge: disable SHPC in PXB"), in June 2015.

For background, the series around it was
<https://lists.nongnu.org/archive/html/qemu-devel/2015-06/msg05136.html>
-- I think v7 was the last version.

... Actually, now I wonder if d10dda2d60c8 should be possible to revert
at this point! Namely, in OVMF I may have unwittingly fixed this issue
-- obviously much later than the QEMU commit: in March 2016. See

https://github.com/tianocore/edk2/commit/8f35eb92c419

If you look at the commit message of the QEMU patch, it says

    [...]

    Unfortunately, when this happens, the PCI_COMMAND_MEMORY bit is
    clear in the root bus's command register [...]

which I think should no longer be true, thanks to edk2 commit 8f35eb92c419.

So maybe we should re-evaluate QEMU commit d10dda2d60c8. If pxb-pci and
pxb-pcie work with current OVMF, due to edk2 commit 8f35eb92c419, then
maybe we should revert QEMU commit d10dda2d60c8.

Not urgent for me :), obviously, I'm just explaining so you can make a
note for later, if you wish to (if hot-plugging directly into pxb-pcie
should be necessary -- I think it's very low priority).

> For pxb-pci (the PCI counter-part) is another story, it needs ACPI code
> to be emitted and the feature is not yet implemented.
> 
> 
>>> +DMI-PCI bridges and Root Ports can't be hot-plugged/hot-unplugged.
>>
>> I would say: ... so anything that plugs *only* into a root complex,
>> cannot be hotplugged. Then the list is what you mention here (also
>> referring back to options (1), (2) and (4) in section 2.1), *plus* I
>> would also add option (5): pxb-pcie can also not be hotplugged.
>>
> 
> because is actually an Integrated Device.
> 
>>> +
>>> +PCI devices can be hot-plugged into PCI-PCI bridges. (There is a bug
>>> +in QEMU preventing it to work, but it would be solved soon).
>>
>> What bug?
>>
> 
> As stated above, PCI hotplug is based on emitting ACPI code for
> recognizing the right slot (see "bsel" ACPI variables). Basically each
> PCI-Bridge slot has a different "bsel" value used during
> hotplug mechanism to identify the slot where the device is
> hot-plugged/hot-unplugged.
> 
> For PC machine the ACPI is generated while for Q35 is not.
> (I think I've already sent an RFC some time ago for that)
> 
>> Anyway, I'm unsure we should add this remark here -- it's a guide, not a
>> status report. I'm worried that whenever we fix that bug, we forget to
>> remove this remark.
>>
> 
> will remove it
> 
>>> +The PCI hotplug is ACPI based and can work side by side with the PCIe
>>> +native hotplug.
>>> +
>>> +PCIe devices can be natively hot-plugged/hot-unplugged into/from
>>> +PCIe Ports (Root Ports/Downstream Ports). Switches are hot-pluggable.
>>
>> I would mention the order (upstream port, downstream port), also add
>> some command lines maybe.
>>
> 
> I'll add some hmp example. Should I try it before :) ?

Seeing it function as expected wouldn't hurt! :)

> 
>>> +Keep in mind you always need to have at least one PCIe Port available
>>> +for hotplug, the PCIe Ports themselves are not hot-pluggable.
>>
>> Well, the downstream ports of a switch that is being added *are*, aren't
>> they?
> 
> Nope, you cannot hotplug a PCI Express Root Port or a PCI Express
> Downstream Port.
> The reason: The PCI Express Native Hotplug is based on SHPCs (Standard
> HotPlug Controllers)
> which are integrated only in the mentioned ports and not in Upstream
> Ports or the Root Complex.
> The "other" reason: When you buy a switch/server it has a number of
> ports and that's it.
> You cannot add "later".

Makes sense, thank you. I think if you add the HMP example, it will make
it clear. I only assumed that you needed several monitor commands for
hotplugging a single switch (i.e., one command per one port) because on
the QEMU command line you do need a separate -device option for the
upstream port, and every single downstream port, of the same switch.

If, using the monitor, it's just one device_add for the upstream port,
and the downstream ports are added automatically, then I guess it'll be
easy to understand.

> 
>>
>> But, this question is actually irrelevant IMO, because here I would add
>> another subsection about *planning* for hot-plug. (I think that's pretty
>> important.) And those plans should make the hotplugging of switches
>> unnecessary!
>>
> 
> I'll add a subsection for it. But when you are out of options you *can*
> hotplug a switch if your sysadmin skills are limited...

You probably can, but then we'll run into the resource allocation
problem again:

(1) The user will hotplug a switch (= S1) under a root port with, say,
two downstream ports (= S1-DP1, S1-DP2).

(2) They'll then plug a PCI Express device into one of those downstream
ports (S1-DP1-Dev1).

(3) Then they'll want to hot-plug *another* switch into the *other*
downstream port (S1-DP2-S2).


                         DP1 -- Dev1 (2)
                        /
     root port -- S1 (1)
                        \
                         DP2 -- S2 (3)

However, concerning the resource needs of S2 (and especially the devices
hot-plugged under S2!), S1 won't have enough left over, because Dev1
(under DP1) will have eaten into them, and Dev1's BARs will have been
programmed!

We could never credibly explain our way out of this situation in a bug
report. For that reason, I think we should discourage hotplug ideas that
would change the topology, and require recursive resource allocation at
higher levels and/or parallel branches of the topology.

I know Linux can do that, and it even succeeds if there is enough room,
but from the messages seen in the guest dmesg when it fails, how do you
explain to the user that they should have plugged in S2 first, and Dev1
second?

So, we should recommend *not* to hotplug switches or PCI-PCI bridges.
Instead,
- keep a very flat hierarchy from the start;
- for PCI Express, add as many root ports and downstream ports as you
deem enough for future hotplug needs (keeping the flat formula I described);
- for legacy PCI, add as many sibling PCI-PCI bridges directly under the
one DMI-PCI bridge as you deem sufficient for future hotplug needs.

In short, don't change the hierarchy at runtime by hotplugging internal
nodes; hotplug *leaf nodes* only.

> 
>> * For the PCI Express hierarchy, I recommended a flat structure above.
>> The 256 bus numbers can easily be exhausted / covered by 25 root ports
>> plus 7 switches (each switch being fully populated with downstream
>> ports). This should allow all sysadmins to estimate their expected
>> numbers of hotplug PCI Express devices, in advance, and create enough
>> root ports / downstream ports. (Sysadmins are already used to planning
>> for hotplug, see VCPUs, memory (DIMM), memory (balloon).)
>>
> 
> That's another good idea, I'll add it to the doc, thanks!
> 
>> * For the PCI hierarchy, it should be even simpler, but worth a mention
>> -- start with enough PCI-PCI bridges under the one DMI-PCI bridge.
>>
> 
> OK
> 
>> * Finally, this is the spot where we should design and explain our
>> resource reservation for hotplug:
>>
>>   - For PCI Express hotplug, please explain that only such PCI Express
>>     devices can be hotplugged that require no IO space -- in section
>>     "3. IO space issues" you mention that this is a valid restriction.
>>     Furthermore, please state the MMIO32 and/or MMIO64 sizes that the
>>     firmware needs to reserve for root ports / downstream ports, and
>>     also explain that these sizes will act as maximum size and
>>     alignment limits for *individual* hotplug devices.
>>
> 
> OK
> 
>>     We can invent fw_cfg switches for this, or maybe even a special PCI
>>     Express capability (to be placed in the config space of root and
>>     downstream ports).
>>
> 
> Gerd explicitly asked for the second idea (vendor specific capability)

Nice, thank you for confirming it; let's do this then. It will also
simplify my work in the
EFI_PCI_HOT_PLUG_INIT_PROTOCOL.GetResourcePadding() function: it should
suffice to scan the config space of the bridge, regardless of the
"PCI-PCI bridge / PCI Express root or downstream port" distinction.

> 
>>   - For legacy PCI hotplug, this is where my evil plan of "no more than
>>     9 (nine) PCI-PCI bridges under the 1 (one) DMI-PCI bridge" unfolds!
>>
> 
> The same as for PC, should this doc deal with this if is a common issue?
> Maybe a simple comment is enough.
> 
>>     We've stated above (in Section 3) that we have about 10*4KB IO port
>>     space. One of those 4K chunks will go (collectively) to the
>>     Integrated PCI devices that sit on pcie.0. (If there are other root
>>     complexes from pxb-pcie devices, then those will get one chunk each
>>     too.) The rest -- hence assume 9 or fewer chunks -- will be consumed
>>     by the 9 (or respectively fewer) PCI-PCI bridges, for hotplug
>>     reservation. The upshot is that as long as a sysadmin sticks with
>>     our flat, "9 PCI-PCI bridges total" recommendation for the legacy
>>
>>     PCI hierarchy, the IO reservation will be covered *immediately*.
>>     Simply put: don't create more PCI-PCI bridges than you have IO
>>     space for -- and that should leave you with about 9 "sibling"
>>     bridges, which are plenty enough for a huge number of legacy PCI
>>     devices!
>>
> 
> I'll use that, thanks!
> 
>>     Furthermore, please state the MMIO32 / MMIO64 amount to reserve
>>     *per PCI-PCI bridge*. The firmware programmers need to know this,
>>     and people planning for legacy PCI hotplug should be informed that
>>     those limits are for all devices *together* on the same PCI-PCI
>>     bridge.
>>
> 
> Yes... we'll ask here for the minimum 8MB MMIO because of the virtio 1.0
> behavior and use that when "pushing" the patches for OVMF/SeaBIOS.
> Its fun, first we "make" the rules, then we say "Hey, is written in QEMU
> docs".

Absolutely. This is how it should work! ;)

The best part of Gerd's suggestion, as far as the firmwares are
concerned, is that we won't have to hard-code any constants in the
firmware. We'll just have to parse the PCI config spaces of the bridges,
for the vendor specific capability, and use the numbers from the
capability for resource reservation. The OVMF and SeaBIOS patches won't
have any constants in them. :)

Should we change our minds in QEMU later, no firmware patches should be
necessary.

> 
>>     Again, we could expose this via fw_cfg, or in a special capability
>>     (as suggested by Gerd IIRC) in the PCI config space of the PCI-PCI
>>     bridge.
>>
> 
> Agreed
> 
>>> +
>>> +
>>> +5. Device assignment
>>> +====================
>>> +Host devices are mostly PCIe and should be plugged only into PCIe
>>> ports.
>>> +PCI-PCI bridge slots can be used for legacy PCI host devices.
>>
>> Please provide a command line (lspci) so that users can easily determine
>> if the device they wish to assign is legacy PCI or PCI Express.
>>
> 
> OK, something like:
> 
> lspci -s 03:00.0 -v (as root)
> 03:00.0 Network controller: Intel Corporation Wireless 7260 (rev 83)
>     Subsystem: Intel Corporation Dual Band Wireless-AC 7260
>     Flags: bus master, fast devsel, latency 0, IRQ 50
>     Memory at f0400000 (64-bit, non-prefetchable) [size=8K]
>     Capabilities: [c8] Power Management version 3
>     Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
>     Capabilities: [40] Express Endpoint, MSI 00
> 
>                             ^^^^^^^^^^^^^^^
> 
>     Capabilities: [100] Advanced Error Reporting
>     Capabilities: [140] Device Serial Number 7c-7a-91-ff-ff-90-db-20
>     Capabilities: [14c] Latency Tolerance Reporting
>     Capabilities: [154] Vendor Specific Information: ID=cafe Rev=1
> Len=014 <?>

Yep, looks great.

> 
> 
> 
>>> +
>>> +
>>> +6. Virtio devices
>>> +=================
>>> +Virtio devices plugged into the PCI hierarchy or as an Integrated
>>> Devices
>>
>> (drop "an")
>>
> 
> OK
> 
>>> +will remain PCI and have transitional behaviour as default.
>>
>> (Please add one sentence about what "transitional" means in this context
>> -- they'll have both IO and MMIO BARs.)
>>
> 
> OK
> 
>>> +Virtio devices plugged into PCIe ports are Express devices and have
>>> +"1.0" behavior by default without IO support.
>>> +In both case disable-* properties can be used to override the
>>> behaviour.
>>
>> Please emphasize that setting disable-legacy=off (that is, enabling
>> legacy behavior) for PCI Express virtio devices will cause them to
>> require IO space, which, given our PCI Express hierarchy, may quickly
>> lead to resource exhaustion, and is therefore strongly discouraged.
>>
> 
> Sure
> 
>>> +
>>> +
>>> +7. Conclusion
>>> +==============
>>> +The proposal offers a usage model that is easy to understand and follow
>>> +and in the same time overcomes some PCIe limitations.
>>
>> I agree!
>>
>> Thanks!
>> Laszlo
>>
> 
> 
> Thanks for the detailed review! I was planning on sending V2 today,
> but to follow your comments I will need another day.
> Don't get me wrong, it totally worth it :)

Thank you. I'm very happy about this document coming along. And, I too
will likely need a separate day to review your v2. :)

Cheers!
Laszlo

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines
  2016-09-06 11:35   ` Gerd Hoffmann
@ 2016-09-06 13:58     ` Laine Stump
  2016-09-07  7:04       ` Gerd Hoffmann
  2016-09-06 14:47     ` Marcel Apfelbaum
  2016-09-07  7:53     ` Laszlo Ersek
  2 siblings, 1 reply; 52+ messages in thread
From: Laine Stump @ 2016-09-06 13:58 UTC (permalink / raw)
  To: Gerd Hoffmann, Laszlo Ersek
  Cc: Marcel Apfelbaum, qemu-devel, mst, Peter Maydell, Drew Jones,
	Andrea Bolognani, Alex Williamson

On 09/06/2016 07:35 AM, Gerd Hoffmann wrote:
> While talking about integrated devices:  There is docs/q35-chipset.cfg,
> which documents how to mimic q35 with integrated devices as close and
> complete as possible.
>
> Usage:
>   qemu-system-x86_64 -M q35 -readconfig docs/q35-chipset.cfg $args
>
> Side note for usb: In practice you don't want to use the tons of
> uhci/ehci controllers present in the original q35 but plug xhci into one
> of the pcie root ports instead (unless your guest doesn't support xhci).

I've wondered about that recently. For i440fx machinetypes if you don't 
specify a USB controller in libvirt's domain config, you will 
automatically get the PIIX3 USB controller added. In order to maintain 
consistency on the topic of "auto-adding USB when not specified", if the 
machinetype is Q35 we will autoadd a set of USB2 (uhci/ehci) controllers 
(I think I added that based on your comments at the time :-). But 
recently I've mostly been hearing that people should use xhci instead. 
So should libvirt add a single xhci (rather than the uhci/ehci set) at 
the same port when no USB is specified?

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines
  2016-09-06 13:31     ` Laszlo Ersek
@ 2016-09-06 14:46       ` Marcel Apfelbaum
  2016-09-07  6:21       ` Gerd Hoffmann
  1 sibling, 0 replies; 52+ messages in thread
From: Marcel Apfelbaum @ 2016-09-06 14:46 UTC (permalink / raw)
  To: Laszlo Ersek, qemu-devel
  Cc: mst, Peter Maydell, Drew Jones, Laine Stump, Andrea Bolognani,
	Alex Williamson, Gerd Hoffmann

On 09/06/2016 04:31 PM, Laszlo Ersek wrote:
> On 09/05/16 22:02, Marcel Apfelbaum wrote:
>> On 09/05/2016 07:24 PM, Laszlo Ersek wrote:
>>> On 09/01/16 15:22, Marcel Apfelbaum wrote:
>>>> Proposes best practices on how to use PCIe/PCI device
>>>> in PCIe based machines and explain the reasoning behind them.
>>>>
>>>> Signed-off-by: Marcel Apfelbaum <marcel@redhat.com>
>>>> ---
>>>>
>>>> Hi,
>>>>
>>>> Please add your comments on what to add/remove/edit to make this doc
>>>> usable.
>>>
>>

[...]

>>
>>> (But, certainly no IO reservation for PCI Express root port, upstream
>>> port, or downstream port! And i'll need your help for telling these
>>> apart in OVMF.)
>>>
>>
>> Just let me know how can I help.
>
> Well, in the EFI_PCI_HOT_PLUG_INIT_PROTOCOL.GetResourcePadding()
> implementation, I'll have to look at the PCI config space of the
> "bridge-like" PCI device that the generic PCI Bus driver of edk2 passes
> back to me, asking me about resource reservation.
>
> Based on the config space, I should be able to tell apart "PCI-PCI
> bridge" from "PCI Express downstream or root port". So what I'd need
> here is a semi-formal natural language description of these conditions.

You can use PCI Express Spec: 7.8.2. PCI Express Capabilities Register (Offset 02h)

Bit 7:4 Register Description:
Device/Port Type – Indicates the specific type of this PCI
Express Function. Note that different Functions in a multi-
Function device can generally be of different types.
Defined encodings are:
0000b PCI Express Endpoint
0001b Legacy PCI Express Endpoint
0100b Root Port of PCI Express Root Complex*
0101b Upstream Port of PCI Express Switch*
0110b Downstream Port of PCI Express Switch*
0111b PCI Express to PCI/PCI-X Bridge*
1000b PCI/PCI-X to PCI Express Bridge*
1001b Root Complex Integrated Endpoint



>
> Hmm, actually I think I've already written code, for another patch, that
> identifies the latter category. So everything where that check doesn't
> fire can be deemed "PCI-PCI bridge". (This hook gets called only for
> bridges.)
>
> Yet another alternative: if we go for the special PCI capability, for
> exposing reservation sizes from QEMU to the firmware, then I can simply
> search the capability list for just that capability. I think that could
> be the easiest for me.
>

That would be a "later" step.
BTW, following a offline chat with Michael S. Tsirkin
regarding virto 1.0 requiring 8M MMIO by default we arrived to a conclusion that
is not really needed and we came up with an alternative that will require less then 2M
MMIO space.
I put this here because the above solution will give us some time to deal with
the MMIO ranges reservation.

[...]

>>>> +
>>>> +
>>>> +4. Hot Plug
>>>> +============
>>>> +The root bus pcie.0 does not support hot-plug, so Integrated Devices,
>>>
>>> s/root bus/root complex/? Also, any root complexes added with pxb-pcie
>>> don't support hotplug.
>>>
>>
>> Actually pxb-pcie should support PCI Express Native Hotplug.
>
> Huh, interesting.
>
>> If they don't is a bug and I'll take care of it.
>
> Hmm, a bit lower down you mention that PCI Express native hot plug is
> based on SHPCs. So, when you say that pxb-pcie should support PCI
> Express Native Hotplug, you mean that it should occur through SHPC, right?
>

Yes, but I was talking about the Integrated SHPCs of the PCI Express
Root Ports and PCI Express Downstream Ports. (devices plugged into them)


> However, for pxb-pci*, we had to disable SHPC: see QEMU commit
> d10dda2d60c8 ("hw/pci-bridge: disable SHPC in PXB"), in June 2015.
>

This is only for the pxb device (not pxb-pcie) and only for the internal pci-bridge that comes with it.
And... we don't use SHPC based hot-plug for PCI, only for PCI Express.
For PCI we are using only the ACPI hotplug. So disabling it is not so bad.

The pxb-pcie does not have the internal PCI bridge. You don't need it because:
1. You can't have Integrated Devices for pxb-pcie
2. The PCI Express Upstream Port is a type of PCI-Bridge anyway.


> For background, the series around it was
> <https://lists.nongnu.org/archive/html/qemu-devel/2015-06/msg05136.html>
> -- I think v7 was the last version.
>
> ... Actually, now I wonder if d10dda2d60c8 should be possible to revert
> at this point! Namely, in OVMF I may have unwittingly fixed this issue
> -- obviously much later than the QEMU commit: in March 2016. See
>
> https://github.com/tianocore/edk2/commit/8f35eb92c419
>
> If you look at the commit message of the QEMU patch, it says
>
>     [...]
>
>     Unfortunately, when this happens, the PCI_COMMAND_MEMORY bit is
>     clear in the root bus's command register [...]
>
> which I think should no longer be true, thanks to edk2 commit 8f35eb92c419.
>
> So maybe we should re-evaluate QEMU commit d10dda2d60c8. If pxb-pci and
> pxb-pcie work with current OVMF, due to edk2 commit 8f35eb92c419, then
> maybe we should revert QEMU commit d10dda2d60c8.
>
> Not urgent for me :), obviously, I'm just explaining so you can make a
> note for later, if you wish to (if hot-plugging directly into pxb-pcie
> should be necessary -- I think it's very low priority).
>

As stated above, since we don't use it anyway it doesn't matter.

[...]

>> Nope, you cannot hotplug a PCI Express Root Port or a PCI Express
>> Downstream Port.
>> The reason: The PCI Express Native Hotplug is based on SHPCs (Standard
>> HotPlug Controllers)
>> which are integrated only in the mentioned ports and not in Upstream
>> Ports or the Root Complex.
>> The "other" reason: When you buy a switch/server it has a number of
>> ports and that's it.
>> You cannot add "later".
>
> Makes sense, thank you. I think if you add the HMP example, it will make
> it clear. I only assumed that you needed several monitor commands for
> hotplugging a single switch (i.e., one command per one port) because on
> the QEMU command line you do need a separate -device option for the
> upstream port, and every single downstream port, of the same switch.
>
> If, using the monitor, it's just one device_add for the upstream port,
> and the downstream ports are added automatically, then I guess it'll be
> easy to understand.
>

No it doesn't work like that, you would need to add them one by one (upstream ports and then downstream ports)
as far as I understand it.
Actually I've never done it before, I'll try it first and update the doc on
how it should be done. (if it can be done...)

>>
>>>
>>> But, this question is actually irrelevant IMO, because here I would add
>>> another subsection about *planning* for hot-plug. (I think that's pretty
>>> important.) And those plans should make the hotplugging of switches
>>> unnecessary!
>>>
>>
>> I'll add a subsection for it. But when you are out of options you *can*
>> hotplug a switch if your sysadmin skills are limited...
>
> You probably can, but then we'll run into the resource allocation
> problem again:
>
> (1) The user will hotplug a switch (= S1) under a root port with, say,
> two downstream ports (= S1-DP1, S1-DP2).
>
> (2) They'll then plug a PCI Express device into one of those downstream
> ports (S1-DP1-Dev1).
>
> (3) Then they'll want to hot-plug *another* switch into the *other*
> downstream port (S1-DP2-S2).
>
>
>                          DP1 -- Dev1 (2)
>                         /
>      root port -- S1 (1)
>                         \
>                          DP2 -- S2 (3)
>
> However, concerning the resource needs of S2 (and especially the devices
> hot-plugged under S2!), S1 won't have enough left over, because Dev1
> (under DP1) will have eaten into them, and Dev1's BARs will have been
> programmed!
>

Theoretically the Guest OS should trigger PCI resources re-allocation
but I agree we should not count on them.

> We could never credibly explain our way out of this situation in a bug
> report. For that reason, I think we should discourage hotplug ideas that
> would change the topology, and require recursive resource allocation at
> higher levels and/or parallel branches of the topology.
>
> I know Linux can do that, and it even succeeds if there is enough room,
> but from the messages seen in the guest dmesg when it fails, how do you
> explain to the user that they should have plugged in S2 first, and Dev1
> second?
>
> So, we should recommend *not* to hotplug switches or PCI-PCI bridges.
> Instead,
> - keep a very flat hierarchy from the start;
> - for PCI Express, add as many root ports and downstream ports as you
> deem enough for future hotplug needs (keeping the flat formula I described);
> - for legacy PCI, add as many sibling PCI-PCI bridges directly under the
> one DMI-PCI bridge as you deem sufficient for future hotplug needs.
>
> In short, don't change the hierarchy at runtime by hotplugging internal
> nodes; hotplug *leaf nodes* only.
>

Agreed. I'll re-use some of your comments in the doc.

>>

[...]

>>
>> Gerd explicitly asked for the second idea (vendor specific capability)
>
> Nice, thank you for confirming it; let's do this then. It will also
> simplify my work in the
> EFI_PCI_HOT_PLUG_INIT_PROTOCOL.GetResourcePadding() function: it should
> suffice to scan the config space of the bridge, regardless of the
> "PCI-PCI bridge / PCI Express root or downstream port" distinction.
>

Will do, but since we have a quick way to deal with the current issue
(virtio 1.0 requiring 8MB MMIO while firmware reserving 2MB for PCI-Bridges hotplug)

[...]

Thanks,
Marcel

>
> Cheers!
> Laszlo
>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines
  2016-09-06 11:35   ` Gerd Hoffmann
  2016-09-06 13:58     ` Laine Stump
@ 2016-09-06 14:47     ` Marcel Apfelbaum
  2016-09-07  7:53     ` Laszlo Ersek
  2 siblings, 0 replies; 52+ messages in thread
From: Marcel Apfelbaum @ 2016-09-06 14:47 UTC (permalink / raw)
  To: Gerd Hoffmann, Laszlo Ersek
  Cc: qemu-devel, mst, Peter Maydell, Drew Jones, Laine Stump,
	Andrea Bolognani, Alex Williamson

On 09/06/2016 02:35 PM, Gerd Hoffmann wrote:
>   Hi,
>
>>> +Plug only legacy PCI devices as Root Complex Integrated Devices
>>> +even if the PCIe spec does not forbid PCIe devices.
>>
>> I suggest "even though the PCI Express spec does not forbid PCI Express
>> devices as Integrated Devices". (Detail is good!)
>
> While talking about integrated devices:  There is docs/q35-chipset.cfg,
> which documents how to mimic q35 with integrated devices as close and
> complete as possible.
>
> Usage:
>   qemu-system-x86_64 -M q35 -readconfig docs/q35-chipset.cfg $args
>
> Side note for usb: In practice you don't want to use the tons of
> uhci/ehci controllers present in the original q35 but plug xhci into one
> of the pcie root ports instead (unless your guest doesn't support xhci).
>

Hi Gerd,

Thanks for the comments, I'll be sure to refer them in the doc.
Marcel


[...]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines
  2016-09-01 13:22 [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines Marcel Apfelbaum
  2016-09-01 13:27 ` Peter Maydell
  2016-09-05 16:24 ` Laszlo Ersek
@ 2016-09-06 15:38 ` Alex Williamson
  2016-09-06 18:14   ` Marcel Apfelbaum
  2 siblings, 1 reply; 52+ messages in thread
From: Alex Williamson @ 2016-09-06 15:38 UTC (permalink / raw)
  To: Marcel Apfelbaum; +Cc: qemu-devel, lersek, mst

On Thu,  1 Sep 2016 16:22:07 +0300
Marcel Apfelbaum <marcel@redhat.com> wrote:

> Proposes best practices on how to use PCIe/PCI device
> in PCIe based machines and explain the reasoning behind them.
> 
> Signed-off-by: Marcel Apfelbaum <marcel@redhat.com>
> ---
> 
> Hi,
> 
> Please add your comments on what to add/remove/edit to make this doc usable.
> 
> Thanks,
> Marcel
> 
>  docs/pcie.txt | 145 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 145 insertions(+)
>  create mode 100644 docs/pcie.txt
> 
> diff --git a/docs/pcie.txt b/docs/pcie.txt
> new file mode 100644
> index 0000000..52a8830
> --- /dev/null
> +++ b/docs/pcie.txt
> @@ -0,0 +1,145 @@
> +PCI EXPRESS GUIDELINES
> +======================
> +
> +1. Introduction
> +================
> +The doc proposes best practices on how to use PCIe/PCI device
> +in PCIe based machines and explains the reasoning behind them.
> +
> +
> +2. Device placement strategy
> +============================
> +QEMU does not have a clear socket-device matching mechanism
> +and allows any PCI/PCIe device to be plugged into any PCI/PCIe slot.
> +Plugging a PCI device into a PCIe device might not always work and
> +is weird anyway since it cannot be done for "bare metal".
> +Plugging a PCIe device into a PCI slot will hide the Extended
> +Configuration Space thus is also not recommended.
> +
> +The recommendation is to separate the PCIe and PCI hierarchies.
> +PCIe devices should be plugged only into PCIe Root Ports and
> +PCIe Downstream ports (let's call them PCIe ports).
> +
> +2.1 Root Bus (pcie.0)
> +=====================
> +Plug only legacy PCI devices as Root Complex Integrated Devices
> +even if the PCIe spec does not forbid PCIe devices. The existing

Surely we can have PCIe device on the root complex??

> +hardware uses mostly PCI devices as Integrated Endpoints. In this
> +way we may avoid some strange Guest OS-es behaviour.
> +Other than that plug only PCIe Root Ports, PCIe Switches (upstream ports)
> +or DMI-PCI bridges to start legacy PCI hierarchies.
> +
> +
> +   pcie.0 bus
> +   --------------------------------------------------------------------------
> +        |                |                    |                   |
> +   -----------   ------------------   ------------------  ------------------
> +   | PCI Dev |   | PCIe Root Port |   |  Upstream Port |  | DMI-PCI bridge |
> +   -----------   ------------------   ------------------  ------------------

Do you have a spec reference for plugging an upstream port directly
into the root complex?  IMHO this is invalid, an upstream port can only
be attached behind a downstream port, ie. a root port or downstream
switch port.

> +
> +2.2 PCIe only hierarchy
> +=======================
> +Always use PCIe Root ports to start a PCIe hierarchy. Use PCIe switches (Upstream
> +Ports + several Downstream Ports) if out of PCIe Root Ports slots. PCIe switches
> +can be nested until a depth of 6-7. Plug only PCIe devices into PCIe Ports.

This seems to contradict 2.1, but I agree more with this statement to
only start a PCIe sub-hierarchy with a root port, not an upstream port
connected to the root complex.  The 2nd sentence is confusing, I don't
know if you're referring to fan-out via PCIe switch downstream of a
root port or again suggesting to use upstream switch ports directly on
the root complex.

> +
> +
> +   pcie.0 bus
> +   ----------------------------------------------------
> +        |                |               |
> +   -------------   -------------   -------------
> +   | Root Port |   | Root Port |   | Root Port |
> +   ------------   --------------   -------------
> +         |                               |
> +    ------------                 -----------------
> +    | PCIe Dev |                 | Upstream Port |
> +    ------------                 -----------------
> +                                  |            |
> +                     -------------------    -------------------
> +                     | Downstream Port |    | Downstream Port |
> +                     -------------------    -------------------
> +                             |
> +                         ------------
> +                         | PCIe Dev |
> +                         ------------
> +
> +2.3 PCI only hierarchy
> +======================
> +Legacy PCI devices can be plugged into pcie.0 as Integrated Devices or
> +into DMI-PCI bridge. PCI-PCI bridges can be plugged into DMI-PCI bridges
> +and can be nested until a depth of 6-7. DMI-BRIDGES should be plugged
> +only into pcie.0 bus.
> +
> +   pcie.0 bus
> +   ----------------------------------------------
> +        |                            |
> +   -----------               ------------------
> +   | PCI Dev |               | DMI-PCI BRIDGE |
> +   ----------                ------------------
> +                               |            |
> +                        -----------    ------------------
> +                        | PCI Dev |    | PCI-PCI Bridge |
> +                        -----------    ------------------
> +                                         |           |
> +                                  -----------     -----------
> +                                  | PCI Dev |     | PCI Dev |
> +                                  -----------     -----------
> +

I really wish we had generic PCIe-to-PCI bridges rather than this DMI
bridge thing...

> +
> +
> +3. IO space issues
> +===================
> +PCIe Ports are seen by Firmware/Guest OS as PCI bridges and

Yeah, I've lost the meaning of Ports here, this statement is true for
upstream ports as well.

> +as required by PCI spec will reserve a 4K IO range for each.
> +The firmware used by QEMU (SeaBIOS/OVMF) will further optimize
> +it by allocation the IO space only if there is at least a device
> +with IO BARs plugged into the bridge.
> +Behind a PCIe PORT only one device may be plugged, resulting in

Here I think you're trying to specify root/downstream ports, but
upstream ports have the same i/o port allocation problems and do not
have this one device limitation.

> +the allocation of a whole 4K range for each device.
> +The IO space is limited resulting in ~10 PCIe ports per system
> +if devices with IO BARs are plugged into IO ports.
> +
> +Using the proposed device placing strategy solves this issue
> +by using only PCIe devices with PCIe PORTS. The PCIe spec requires
> +PCIe devices to work without IO BARs.
> +The PCI hierarchy has no such limitations.

Actually it does, but it's mostly not an issue since we have 32 slots
available (minus QEMU/libvirt excluding 1 for no good reason)
downstream of each bridge.

> +
> +
> +4. Hot Plug
> +============
> +The root bus pcie.0 does not support hot-plug, so Integrated Devices,
> +DMI-PCI bridges and Root Ports can't be hot-plugged/hot-unplugged.
> +
> +PCI devices can be hot-plugged into PCI-PCI bridges. (There is a bug
> +in QEMU preventing it to work, but it would be solved soon).

Probably want to give some sort of date/commit references to these
current state of affairs facts, a reader is not likely to lookup the
git commit for this verbiage and extrapolate it to a QEMU version.

> +The PCI hotplug is ACPI based and can work side by side with the PCIe
> +native hotplug.
> +
> +PCIe devices can be natively hot-plugged/hot-unplugged into/from
> +PCIe Ports (Root Ports/Downstream Ports). Switches are hot-pluggable.

Why?  This seems like a QEMU bug.  Clearly we need the downstream ports
in place when the upstream switch is hot-added, but this should be
feasible.

> +Keep in mind you always need to have at least one PCIe Port available
> +for hotplug, the PCIe Ports themselves are not hot-pluggable.

If a user cares about hotplug...

> +
> +
> +5. Device assignment
> +====================
> +Host devices are mostly PCIe and should be plugged only into PCIe ports.
> +PCI-PCI bridge slots can be used for legacy PCI host devices.

I don't think we have any evidence to suggest this as a best practice.
We have a lot of experience placing PCIe host devices into a
conventional PCI topology on 440FX.  We don't have nearly as much
experience placing them into downstream PCIe ports.  This seems like
how we would like for things to behave to look like real hardware
platforms, but it's just navel gazing whether it's actually the right
thing to do.  Thanks,

Alex

> +
> +
> +6. Virtio devices
> +=================
> +Virtio devices plugged into the PCI hierarchy or as an Integrated Devices
> +will remain PCI and have transitional behaviour as default.
> +Virtio devices plugged into PCIe ports are Express devices and have
> +"1.0" behavior by default without IO support.
> +In both case disable-* properties can be used to override the behaviour.
> +
> +
> +7. Conclusion
> +==============
> +The proposal offers a usage model that is easy to understand and follow
> +and in the same time overcomes some PCIe limitations.
> +
> +
> +

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines
  2016-09-06 15:38 ` Alex Williamson
@ 2016-09-06 18:14   ` Marcel Apfelbaum
  2016-09-06 18:32     ` Alex Williamson
  0 siblings, 1 reply; 52+ messages in thread
From: Marcel Apfelbaum @ 2016-09-06 18:14 UTC (permalink / raw)
  To: Alex Williamson; +Cc: qemu-devel, lersek, mst

On 09/06/2016 06:38 PM, Alex Williamson wrote:
> On Thu,  1 Sep 2016 16:22:07 +0300
> Marcel Apfelbaum <marcel@redhat.com> wrote:
>
>> Proposes best practices on how to use PCIe/PCI device
>> in PCIe based machines and explain the reasoning behind them.
>>
>> Signed-off-by: Marcel Apfelbaum <marcel@redhat.com>
>> ---
>>
>> Hi,
>>
>> Please add your comments on what to add/remove/edit to make this doc usable.
>>
>> Thanks,
>> Marcel
>>
>>  docs/pcie.txt | 145 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 145 insertions(+)
>>  create mode 100644 docs/pcie.txt
>>
>> diff --git a/docs/pcie.txt b/docs/pcie.txt
>> new file mode 100644
>> index 0000000..52a8830
>> --- /dev/null
>> +++ b/docs/pcie.txt
>> @@ -0,0 +1,145 @@
>> +PCI EXPRESS GUIDELINES
>> +======================
>> +
>> +1. Introduction
>> +================
>> +The doc proposes best practices on how to use PCIe/PCI device
>> +in PCIe based machines and explains the reasoning behind them.
>> +
>> +
>> +2. Device placement strategy
>> +============================
>> +QEMU does not have a clear socket-device matching mechanism
>> +and allows any PCI/PCIe device to be plugged into any PCI/PCIe slot.
>> +Plugging a PCI device into a PCIe device might not always work and
>> +is weird anyway since it cannot be done for "bare metal".
>> +Plugging a PCIe device into a PCI slot will hide the Extended
>> +Configuration Space thus is also not recommended.
>> +
>> +The recommendation is to separate the PCIe and PCI hierarchies.
>> +PCIe devices should be plugged only into PCIe Root Ports and
>> +PCIe Downstream ports (let's call them PCIe ports).
>> +
>> +2.1 Root Bus (pcie.0)
>> +=====================
>> +Plug only legacy PCI devices as Root Complex Integrated Devices
>> +even if the PCIe spec does not forbid PCIe devices. The existing
>

Hi Alex,
Thanks for the review.


> Surely we can have PCIe device on the root complex??
>

Yes, we can, is not forbidden. Even so, my understanding is
the main use for Integrated Devices is for legacy devices
like sound cards or nics that come with the motherboard.
Because of that my concern is we might be missing some support
for that in QEMU or even in linux kernel.

One example I got from Jason about an issue with Integrated Points in kernel:

commit d14053b3c714178525f22660e6aaf41263d00056
Author: David Woodhouse <David.Woodhouse@intel.com>
Date:   Thu Oct 15 09:28:06 2015 +0100

     iommu/vt-d: Fix ATSR handling for Root-Complex integrated endpoints

     The VT-d specification says that "Software must enable ATS on endpoint
     devices behind a Root Port only if the Root Port is reported as
     supporting ATS transactions."
   ....

We can say is a bug and is solved, what's the problem?
But my point it, why do it in the first place?
We are the hardware "vendors" and we can decide not to add PCIe
devices as Integrated Devices.



>> +hardware uses mostly PCI devices as Integrated Endpoints. In this
>> +way we may avoid some strange Guest OS-es behaviour.
>> +Other than that plug only PCIe Root Ports, PCIe Switches (upstream ports)
>> +or DMI-PCI bridges to start legacy PCI hierarchies.
>> +
>> +
>> +   pcie.0 bus
>> +   --------------------------------------------------------------------------
>> +        |                |                    |                   |
>> +   -----------   ------------------   ------------------  ------------------
>> +   | PCI Dev |   | PCIe Root Port |   |  Upstream Port |  | DMI-PCI bridge |
>> +   -----------   ------------------   ------------------  ------------------
>
> Do you have a spec reference for plugging an upstream port directly
> into the root complex?  IMHO this is invalid, an upstream port can only
> be attached behind a downstream port, ie. a root port or downstream
> switch port.
>

Yes, is a bug, both me and Laszlo spotted it and the 2.2 figure shows it right.
Thanks for finding it.

>> +
>> +2.2 PCIe only hierarchy
>> +=======================
>> +Always use PCIe Root ports to start a PCIe hierarchy. Use PCIe switches (Upstream
>> +Ports + several Downstream Ports) if out of PCIe Root Ports slots. PCIe switches
>> +can be nested until a depth of 6-7. Plug only PCIe devices into PCIe Ports.
>
> This seems to contradict 2.1,

Yes, please forgive the bug, it will not appear in v2

  but I agree more with this statement to
> only start a PCIe sub-hierarchy with a root port, not an upstream port
> connected to the root complex.  The 2nd sentence is confusing, I don't
> know if you're referring to fan-out via PCIe switch downstream of a
> root port or again suggesting to use upstream switch ports directly on
> the root complex.
>

The PCIe hierarchy always starts with PCI Express Root Ports, the switch
is to be plugged in the PCi Express ports. I will try to re-phrase to be more
clear.


>> +
>> +
>> +   pcie.0 bus
>> +   ----------------------------------------------------
>> +        |                |               |
>> +   -------------   -------------   -------------
>> +   | Root Port |   | Root Port |   | Root Port |
>> +   ------------   --------------   -------------
>> +         |                               |
>> +    ------------                 -----------------
>> +    | PCIe Dev |                 | Upstream Port |
>> +    ------------                 -----------------
>> +                                  |            |
>> +                     -------------------    -------------------
>> +                     | Downstream Port |    | Downstream Port |
>> +                     -------------------    -------------------
>> +                             |
>> +                         ------------
>> +                         | PCIe Dev |
>> +                         ------------
>> +
>> +2.3 PCI only hierarchy
>> +======================
>> +Legacy PCI devices can be plugged into pcie.0 as Integrated Devices or
>> +into DMI-PCI bridge. PCI-PCI bridges can be plugged into DMI-PCI bridges
>> +and can be nested until a depth of 6-7. DMI-BRIDGES should be plugged
>> +only into pcie.0 bus.
>> +
>> +   pcie.0 bus
>> +   ----------------------------------------------
>> +        |                            |
>> +   -----------               ------------------
>> +   | PCI Dev |               | DMI-PCI BRIDGE |
>> +   ----------                ------------------
>> +                               |            |
>> +                        -----------    ------------------
>> +                        | PCI Dev |    | PCI-PCI Bridge |
>> +                        -----------    ------------------
>> +                                         |           |
>> +                                  -----------     -----------
>> +                                  | PCI Dev |     | PCI Dev |
>> +                                  -----------     -----------
>> +
>
> I really wish we had generic PCIe-to-PCI bridges rather than this DMI
> bridge thing...
>

Thank you, that's a very good idea and I intend to implement it.

>> +
>> +
>> +3. IO space issues
>> +===================
>> +PCIe Ports are seen by Firmware/Guest OS as PCI bridges and
>
> Yeah, I've lost the meaning of Ports here, this statement is true for
> upstream ports as well.
>

Laslzo asked me to enumerate all the controllers instead of "PCIe",
I am starting to see why...

>> +as required by PCI spec will reserve a 4K IO range for each.
>> +The firmware used by QEMU (SeaBIOS/OVMF) will further optimize
>> +it by allocation the IO space only if there is at least a device
>> +with IO BARs plugged into the bridge.
>> +Behind a PCIe PORT only one device may be plugged, resulting in
>
> Here I think you're trying to specify root/downstream ports, but
> upstream ports have the same i/o port allocation problems and do not
> have this one device limitation.
>

I'll be more specific, sure.

>> +the allocation of a whole 4K range for each device.
>> +The IO space is limited resulting in ~10 PCIe ports per system
>> +if devices with IO BARs are plugged into IO ports.
>> +
>> +Using the proposed device placing strategy solves this issue
>> +by using only PCIe devices with PCIe PORTS. The PCIe spec requires
>> +PCIe devices to work without IO BARs.
>> +The PCI hierarchy has no such limitations.
>
> Actually it does, but it's mostly not an issue since we have 32 slots
> available (minus QEMU/libvirt excluding 1 for no good reason)
> downstream of each bridge.
>

This is what I meant, I'll make it more clear.

>> +
>> +
>> +4. Hot Plug
>> +============
>> +The root bus pcie.0 does not support hot-plug, so Integrated Devices,
>> +DMI-PCI bridges and Root Ports can't be hot-plugged/hot-unplugged.
>> +
>> +PCI devices can be hot-plugged into PCI-PCI bridges. (There is a bug
>> +in QEMU preventing it to work, but it would be solved soon).
>
> Probably want to give some sort of date/commit references to these
> current state of affairs facts, a reader is not likely to lookup the
> git commit for this verbiage and extrapolate it to a QEMU version.
>

I'll delete it (as Laslo proposed) since is not a "status" doc.
It will be taken care of eventually.

>> +The PCI hotplug is ACPI based and can work side by side with the PCIe
>> +native hotplug.
>> +
>> +PCIe devices can be natively hot-plugged/hot-unplugged into/from
>> +PCIe Ports (Root Ports/Downstream Ports). Switches are hot-pluggable.
>
> Why?  This seems like a QEMU bug.  Clearly we need the downstream ports
> in place when the upstream switch is hot-added, but this should be
> feasible.
>

I don't get understand the question. I do think switches can be hot-plugged,
but I am not sure if QEMU allows it. If not, this is something we should solve.

>> +Keep in mind you always need to have at least one PCIe Port available
>> +for hotplug, the PCIe Ports themselves are not hot-pluggable.
>
> If a user cares about hotplug...
>

...he should reserve enough empty PCI Express Root Ports/ PCI Express Downstrewm ports.
Laszlo had some numbers and ideas on how this user can plan in advance for hotplug,
maybe we should bring them together :)

>> +
>> +
>> +5. Device assignment
>> +====================
>> +Host devices are mostly PCIe and should be plugged only into PCIe ports.
>> +PCI-PCI bridge slots can be used for legacy PCI host devices.
>
> I don't think we have any evidence to suggest this as a best practice.
> We have a lot of experience placing PCIe host devices into a
> conventional PCI topology on 440FX.  We don't have nearly as much
> experience placing them into downstream PCIe ports.  This seems like
> how we would like for things to behave to look like real hardware
> platforms, but it's just navel gazing whether it's actually the right
> thing to do.  Thanks,
>

I had to look up the "navel gazing"...
Why I do agree with your statements I prefer a cleaner PCI Express machine
with as little legacy PCI as possible. I use this document as an opportunity
to start gaining experience with device assignment into PCI Express Root Ports
and Downstream Ports and solve the issues long the way.


Your review really helped, thanks!
Marcel

> Alex
>
>> +
>> +
>> +6. Virtio devices
>> +=================
>> +Virtio devices plugged into the PCI hierarchy or as an Integrated Devices
>> +will remain PCI and have transitional behaviour as default.
>> +Virtio devices plugged into PCIe ports are Express devices and have
>> +"1.0" behavior by default without IO support.
>> +In both case disable-* properties can be used to override the behaviour.
>> +
>> +
>> +7. Conclusion
>> +==============
>> +The proposal offers a usage model that is easy to understand and follow
>> +and in the same time overcomes some PCIe limitations.
>> +
>> +
>> +
>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines
  2016-09-06 18:14   ` Marcel Apfelbaum
@ 2016-09-06 18:32     ` Alex Williamson
  2016-09-06 18:59       ` Marcel Apfelbaum
  2016-09-07  7:44       ` Laszlo Ersek
  0 siblings, 2 replies; 52+ messages in thread
From: Alex Williamson @ 2016-09-06 18:32 UTC (permalink / raw)
  To: Marcel Apfelbaum; +Cc: qemu-devel, lersek, mst

On Tue, 6 Sep 2016 21:14:11 +0300
Marcel Apfelbaum <marcel@redhat.com> wrote:

> On 09/06/2016 06:38 PM, Alex Williamson wrote:
> > On Thu,  1 Sep 2016 16:22:07 +0300
> > Marcel Apfelbaum <marcel@redhat.com> wrote:
> >  
> >> Proposes best practices on how to use PCIe/PCI device
> >> in PCIe based machines and explain the reasoning behind them.
> >>
> >> Signed-off-by: Marcel Apfelbaum <marcel@redhat.com>
> >> ---
> >>
> >> Hi,
> >>
> >> Please add your comments on what to add/remove/edit to make this doc usable.
> >>
> >> Thanks,
> >> Marcel
> >>
> >>  docs/pcie.txt | 145 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>  1 file changed, 145 insertions(+)
> >>  create mode 100644 docs/pcie.txt
> >>
> >> diff --git a/docs/pcie.txt b/docs/pcie.txt
> >> new file mode 100644
> >> index 0000000..52a8830
> >> --- /dev/null
> >> +++ b/docs/pcie.txt
> >> @@ -0,0 +1,145 @@
> >> +PCI EXPRESS GUIDELINES
> >> +======================
> >> +
> >> +1. Introduction
> >> +================
> >> +The doc proposes best practices on how to use PCIe/PCI device
> >> +in PCIe based machines and explains the reasoning behind them.
> >> +
> >> +
> >> +2. Device placement strategy
> >> +============================
> >> +QEMU does not have a clear socket-device matching mechanism
> >> +and allows any PCI/PCIe device to be plugged into any PCI/PCIe slot.
> >> +Plugging a PCI device into a PCIe device might not always work and
> >> +is weird anyway since it cannot be done for "bare metal".
> >> +Plugging a PCIe device into a PCI slot will hide the Extended
> >> +Configuration Space thus is also not recommended.
> >> +
> >> +The recommendation is to separate the PCIe and PCI hierarchies.
> >> +PCIe devices should be plugged only into PCIe Root Ports and
> >> +PCIe Downstream ports (let's call them PCIe ports).
> >> +
> >> +2.1 Root Bus (pcie.0)
> >> +=====================
> >> +Plug only legacy PCI devices as Root Complex Integrated Devices
> >> +even if the PCIe spec does not forbid PCIe devices. The existing  
> >  
> 
> Hi Alex,
> Thanks for the review.
> 
> 
> > Surely we can have PCIe device on the root complex??
> >  
> 
> Yes, we can, is not forbidden. Even so, my understanding is
> the main use for Integrated Devices is for legacy devices
> like sound cards or nics that come with the motherboard.
> Because of that my concern is we might be missing some support
> for that in QEMU or even in linux kernel.
> 
> One example I got from Jason about an issue with Integrated Points in kernel:
> 
> commit d14053b3c714178525f22660e6aaf41263d00056
> Author: David Woodhouse <David.Woodhouse@intel.com>
> Date:   Thu Oct 15 09:28:06 2015 +0100
> 
>      iommu/vt-d: Fix ATSR handling for Root-Complex integrated endpoints
> 
>      The VT-d specification says that "Software must enable ATS on endpoint
>      devices behind a Root Port only if the Root Port is reported as
>      supporting ATS transactions."
>    ....
> 
> We can say is a bug and is solved, what's the problem?
> But my point it, why do it in the first place?
> We are the hardware "vendors" and we can decide not to add PCIe
> devices as Integrated Devices.
> 
> 
> 
> >> +hardware uses mostly PCI devices as Integrated Endpoints. In this
> >> +way we may avoid some strange Guest OS-es behaviour.
> >> +Other than that plug only PCIe Root Ports, PCIe Switches (upstream ports)
> >> +or DMI-PCI bridges to start legacy PCI hierarchies.
> >> +
> >> +
> >> +   pcie.0 bus
> >> +   --------------------------------------------------------------------------
> >> +        |                |                    |                   |
> >> +   -----------   ------------------   ------------------  ------------------
> >> +   | PCI Dev |   | PCIe Root Port |   |  Upstream Port |  | DMI-PCI bridge |
> >> +   -----------   ------------------   ------------------  ------------------  
> >
> > Do you have a spec reference for plugging an upstream port directly
> > into the root complex?  IMHO this is invalid, an upstream port can only
> > be attached behind a downstream port, ie. a root port or downstream
> > switch port.
> >  
> 
> Yes, is a bug, both me and Laszlo spotted it and the 2.2 figure shows it right.
> Thanks for finding it.
> 
> >> +
> >> +2.2 PCIe only hierarchy
> >> +=======================
> >> +Always use PCIe Root ports to start a PCIe hierarchy. Use PCIe switches (Upstream
> >> +Ports + several Downstream Ports) if out of PCIe Root Ports slots. PCIe switches
> >> +can be nested until a depth of 6-7. Plug only PCIe devices into PCIe Ports.  
> >
> > This seems to contradict 2.1,  
> 
> Yes, please forgive the bug, it will not appear in v2
> 
>   but I agree more with this statement to
> > only start a PCIe sub-hierarchy with a root port, not an upstream port
> > connected to the root complex.  The 2nd sentence is confusing, I don't
> > know if you're referring to fan-out via PCIe switch downstream of a
> > root port or again suggesting to use upstream switch ports directly on
> > the root complex.
> >  
> 
> The PCIe hierarchy always starts with PCI Express Root Ports, the switch
> is to be plugged in the PCi Express ports. I will try to re-phrase to be more
> clear.
> 
> 
> >> +
> >> +
> >> +   pcie.0 bus
> >> +   ----------------------------------------------------
> >> +        |                |               |
> >> +   -------------   -------------   -------------
> >> +   | Root Port |   | Root Port |   | Root Port |
> >> +   ------------   --------------   -------------
> >> +         |                               |
> >> +    ------------                 -----------------
> >> +    | PCIe Dev |                 | Upstream Port |
> >> +    ------------                 -----------------
> >> +                                  |            |
> >> +                     -------------------    -------------------
> >> +                     | Downstream Port |    | Downstream Port |
> >> +                     -------------------    -------------------
> >> +                             |
> >> +                         ------------
> >> +                         | PCIe Dev |
> >> +                         ------------
> >> +
> >> +2.3 PCI only hierarchy
> >> +======================
> >> +Legacy PCI devices can be plugged into pcie.0 as Integrated Devices or
> >> +into DMI-PCI bridge. PCI-PCI bridges can be plugged into DMI-PCI bridges
> >> +and can be nested until a depth of 6-7. DMI-BRIDGES should be plugged
> >> +only into pcie.0 bus.
> >> +
> >> +   pcie.0 bus
> >> +   ----------------------------------------------
> >> +        |                            |
> >> +   -----------               ------------------
> >> +   | PCI Dev |               | DMI-PCI BRIDGE |
> >> +   ----------                ------------------
> >> +                               |            |
> >> +                        -----------    ------------------
> >> +                        | PCI Dev |    | PCI-PCI Bridge |
> >> +                        -----------    ------------------
> >> +                                         |           |
> >> +                                  -----------     -----------
> >> +                                  | PCI Dev |     | PCI Dev |
> >> +                                  -----------     -----------
> >> +  
> >
> > I really wish we had generic PCIe-to-PCI bridges rather than this DMI
> > bridge thing...
> >  
> 
> Thank you, that's a very good idea and I intend to implement it.
> 
> >> +
> >> +
> >> +3. IO space issues
> >> +===================
> >> +PCIe Ports are seen by Firmware/Guest OS as PCI bridges and  
> >
> > Yeah, I've lost the meaning of Ports here, this statement is true for
> > upstream ports as well.
> >  
> 
> Laslzo asked me to enumerate all the controllers instead of "PCIe",
> I am starting to see why...
> 
> >> +as required by PCI spec will reserve a 4K IO range for each.
> >> +The firmware used by QEMU (SeaBIOS/OVMF) will further optimize
> >> +it by allocation the IO space only if there is at least a device
> >> +with IO BARs plugged into the bridge.
> >> +Behind a PCIe PORT only one device may be plugged, resulting in  
> >
> > Here I think you're trying to specify root/downstream ports, but
> > upstream ports have the same i/o port allocation problems and do not
> > have this one device limitation.
> >  
> 
> I'll be more specific, sure.
> 
> >> +the allocation of a whole 4K range for each device.
> >> +The IO space is limited resulting in ~10 PCIe ports per system
> >> +if devices with IO BARs are plugged into IO ports.
> >> +
> >> +Using the proposed device placing strategy solves this issue
> >> +by using only PCIe devices with PCIe PORTS. The PCIe spec requires
> >> +PCIe devices to work without IO BARs.
> >> +The PCI hierarchy has no such limitations.  
> >
> > Actually it does, but it's mostly not an issue since we have 32 slots
> > available (minus QEMU/libvirt excluding 1 for no good reason)
> > downstream of each bridge.
> >  
> 
> This is what I meant, I'll make it more clear.
> 
> >> +
> >> +
> >> +4. Hot Plug
> >> +============
> >> +The root bus pcie.0 does not support hot-plug, so Integrated Devices,
> >> +DMI-PCI bridges and Root Ports can't be hot-plugged/hot-unplugged.
> >> +
> >> +PCI devices can be hot-plugged into PCI-PCI bridges. (There is a bug
> >> +in QEMU preventing it to work, but it would be solved soon).  
> >
> > Probably want to give some sort of date/commit references to these
> > current state of affairs facts, a reader is not likely to lookup the
> > git commit for this verbiage and extrapolate it to a QEMU version.
> >  
> 
> I'll delete it (as Laslo proposed) since is not a "status" doc.
> It will be taken care of eventually.
> 
> >> +The PCI hotplug is ACPI based and can work side by side with the PCIe
> >> +native hotplug.
> >> +
> >> +PCIe devices can be natively hot-plugged/hot-unplugged into/from
> >> +PCIe Ports (Root Ports/Downstream Ports). Switches are hot-pluggable.  
> >
> > Why?  This seems like a QEMU bug.  Clearly we need the downstream ports
> > in place when the upstream switch is hot-added, but this should be
> > feasible.
> >  
> 
> I don't get understand the question. I do think switches can be hot-plugged,
> but I am not sure if QEMU allows it. If not, this is something we should solve.

Sorry, I read to quickly and inserted a <not> in there, I thought the
statement was that switches are not hot-pluggable.  I think the issue
will be the ordering of hot-adding the downstream switch ports prior to
the upstream switch port since or we're going to need to invent a
switch with hot-pluggable downstream ports.  I expect the guest is only
going to scan for downstream ports once after the upstream port is
discovered.
 
> >> +Keep in mind you always need to have at least one PCIe Port available
> >> +for hotplug, the PCIe Ports themselves are not hot-pluggable.  
> >
> > If a user cares about hotplug...
> >  
> 
> ...he should reserve enough empty PCI Express Root Ports/ PCI Express Downstrewm ports.
> Laszlo had some numbers and ideas on how this user can plan in advance for hotplug,
> maybe we should bring them together :)
> 
> >> +
> >> +
> >> +5. Device assignment
> >> +====================
> >> +Host devices are mostly PCIe and should be plugged only into PCIe ports.
> >> +PCI-PCI bridge slots can be used for legacy PCI host devices.  
> >
> > I don't think we have any evidence to suggest this as a best practice.
> > We have a lot of experience placing PCIe host devices into a
> > conventional PCI topology on 440FX.  We don't have nearly as much
> > experience placing them into downstream PCIe ports.  This seems like
> > how we would like for things to behave to look like real hardware
> > platforms, but it's just navel gazing whether it's actually the right
> > thing to do.  Thanks,
> >  
> 
> I had to look up the "navel gazing"...
> Why I do agree with your statements I prefer a cleaner PCI Express machine
> with as little legacy PCI as possible. I use this document as an opportunity
> to start gaining experience with device assignment into PCI Express Root Ports
> and Downstream Ports and solve the issues long the way.

That's exactly what I mean, there's an ulterior, personal motivation in
this suggestion that's not really backed by facts.  You'd like to make
the recommendation to place PCIe assigned devices into PCIe slots, but
that's not necessarily the configuration with the best track record
right now.  In fact there's really no advantage to a user to do this
unless they have a device that needs PCIe (radeon and tg3
potentially come to mind here).  So while I agree with you from an
ideological standpoint, I don't think that's sufficient to make the
recommendation you're proposing here.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines
  2016-09-06 18:32     ` Alex Williamson
@ 2016-09-06 18:59       ` Marcel Apfelbaum
  2016-09-07  7:44       ` Laszlo Ersek
  1 sibling, 0 replies; 52+ messages in thread
From: Marcel Apfelbaum @ 2016-09-06 18:59 UTC (permalink / raw)
  To: Alex Williamson; +Cc: qemu-devel, lersek, mst

On 09/06/2016 09:32 PM, Alex Williamson wrote:
> On Tue, 6 Sep 2016 21:14:11 +0300
> Marcel Apfelbaum <marcel@redhat.com> wrote:
>
>> On 09/06/2016 06:38 PM, Alex Williamson wrote:
>>> On Thu,  1 Sep 2016 16:22:07 +0300
>>> Marcel Apfelbaum <marcel@redhat.com> wrote:
>>>
>>>> Proposes best practices on how to use PCIe/PCI device
>>>> in PCIe based machines and explain the reasoning behind them.
>>>>
>>>> Signed-off-by: Marcel Apfelbaum <marcel@redhat.com>
>>>> ---
>>>>
>>>> Hi,
>>>>
>>>> Please add your comments on what to add/remove/edit to make this doc usable.
>>>>
>>>> Thanks,
>>>> Marcel
>>>>

[...]

>>
>>>> +The PCI hotplug is ACPI based and can work side by side with the PCIe
>>>> +native hotplug.
>>>> +
>>>> +PCIe devices can be natively hot-plugged/hot-unplugged into/from
>>>> +PCIe Ports (Root Ports/Downstream Ports). Switches are hot-pluggable.
>>>
>>> Why?  This seems like a QEMU bug.  Clearly we need the downstream ports
>>> in place when the upstream switch is hot-added, but this should be
>>> feasible.
>>>
>>
>> I don't get understand the question. I do think switches can be hot-plugged,
>> but I am not sure if QEMU allows it. If not, this is something we should solve.
>
> Sorry, I read to quickly and inserted a <not> in there, I thought the
> statement was that switches are not hot-pluggable.  I think the issue
> will be the ordering of hot-adding the downstream switch ports prior to
> the upstream switch port since or we're going to need to invent a
> switch with hot-pluggable downstream ports.  I expect the guest is only
> going to scan for downstream ports once after the upstream port is
> discovered.
>

The problem I see is that I need to specify a bus to
plug the Downstream Port, but this is the id of the upstream port
I haven't added yet. I need to think a little bit more on how to do it,
or I am missing something.

[...]

>>>> +5. Device assignment
>>>> +====================
>>>> +Host devices are mostly PCIe and should be plugged only into PCIe ports.
>>>> +PCI-PCI bridge slots can be used for legacy PCI host devices.
>>>
>>> I don't think we have any evidence to suggest this as a best practice.
>>> We have a lot of experience placing PCIe host devices into a
>>> conventional PCI topology on 440FX.  We don't have nearly as much
>>> experience placing them into downstream PCIe ports.  This seems like
>>> how we would like for things to behave to look like real hardware
>>> platforms, but it's just navel gazing whether it's actually the right
>>> thing to do.  Thanks,
>>>
>>
>> I had to look up the "navel gazing"...
>> Why I do agree with your statements I prefer a cleaner PCI Express machine
>> with as little legacy PCI as possible. I use this document as an opportunity
>> to start gaining experience with device assignment into PCI Express Root Ports
>> and Downstream Ports and solve the issues long the way.
>
> That's exactly what I mean, there's an ulterior, personal motivation in
> this suggestion that's not really backed by facts.

Ulterior yes, personal no. Several developers of both ARM and x86 PCI Express
machines see the new machines as an opportunity to get rid of legacy and keep
them as modern as possible. Funny thing, I *personally* prefer to see Q35
as a replacement for PC machines, no need to keep and support them both.


   You'd like to make
> the recommendation to place PCIe assigned devices into PCIe slots, but
> that's not necessarily the configuration with the best track record
> right now.

Since we haven't use Q35 at all until now (a speculation, but probably true)
the track record is kind of clean...

  In fact there's really no advantage to a user to do this
> unless they have a device that needs PCIe (radeon and tg3
> potentially come to mind here).

The advantage is  to avoid making the PCI Express "purists" (where are you now??)
to start a legacy PCI hierarchy to plug a modern device into
a modern PCI Express machine.

Another advantage is to avoid tainting the ACPI tables with ACPI hotplug
support for the PCI-bridge devices and stuff like that.

I agree is safer to plug assigned devices into PCI slots from
an "enterprise" point of view, but this is upstream, right ? :)
We look at the future... (and we don't have known issues yet anyway)

  So while I agree with you from an
> ideological standpoint, I don't think that's sufficient to make the
> recommendation you're proposing here.  Thanks,
>

I'll find a way to rephrase it, maybe:

Host devices are mostly PCIe and they can be plugged into PCI Express Root Ports/Downstream Ports,
however we have no experience doing that. As a fall-back the PCI hierarchy can be used to plug
an assigned device into a PCI slot.

Thanks,
Marcel


 >>>> +PCI-PCI bridge slots can be used for legacy PCI host devices.

> Alex
>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines
  2016-09-06 13:31     ` Laszlo Ersek
  2016-09-06 14:46       ` Marcel Apfelbaum
@ 2016-09-07  6:21       ` Gerd Hoffmann
  2016-09-07  8:06         ` Laszlo Ersek
  2016-09-07  8:06         ` Marcel Apfelbaum
  1 sibling, 2 replies; 52+ messages in thread
From: Gerd Hoffmann @ 2016-09-07  6:21 UTC (permalink / raw)
  To: Laszlo Ersek
  Cc: Marcel Apfelbaum, qemu-devel, mst, Peter Maydell, Drew Jones,
	Laine Stump, Andrea Bolognani, Alex Williamson

  Hi,

> >> ports, if that's allowed). For example:
> >>
> >> -  1-32 ports needed: use root ports only
> >>
> >> - 33-64 ports needed: use 31 root ports, and one switch with 2-32
> >> downstream ports

I expect you rarely need any switches.  You can go multifunction with
the pcie root ports.  Which is how physical q35 works too btw, typically
the root ports are on slot 1c for intel chipsets:

nilsson root ~# lspci -s1c
00:1c.0 PCI bridge: Intel Corporation 7 Series/C210 Series Chipset
Family PCI Express Root Port 1 (rev c4)
00:1c.1 PCI bridge: Intel Corporation 7 Series/C210 Series Chipset
Family PCI Express Root Port 2 (rev c4)
00:1c.2 PCI bridge: Intel Corporation 7 Series/C210 Series Chipset
Family PCI Express Root Port 3 (rev c4)

Root bus has 32 slots, a few are taken (host bridge @ 00.0, lpc+sata @
1f.*, pci bridge @ 1e.0, maybe vga @ 01.0), leaving 28 free slots.  With
8 functions each you can have up to 224 root ports without any switches,
and you have not many pci bus numbers left until you hit the 256 busses
limit ...

cheers,
  Gerd

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines
  2016-09-06 13:58     ` Laine Stump
@ 2016-09-07  7:04       ` Gerd Hoffmann
  2016-09-07 18:20         ` Laine Stump
  0 siblings, 1 reply; 52+ messages in thread
From: Gerd Hoffmann @ 2016-09-07  7:04 UTC (permalink / raw)
  To: Laine Stump
  Cc: Laszlo Ersek, Marcel Apfelbaum, qemu-devel, mst, Peter Maydell,
	Drew Jones, Andrea Bolognani, Alex Williamson

  Hi,

> > Side note for usb: In practice you don't want to use the tons of
> > uhci/ehci controllers present in the original q35 but plug xhci into one
> > of the pcie root ports instead (unless your guest doesn't support xhci).
> 
> I've wondered about that recently. For i440fx machinetypes if you don't 
> specify a USB controller in libvirt's domain config, you will 
> automatically get the PIIX3 USB controller added. In order to maintain 
> consistency on the topic of "auto-adding USB when not specified", if the 
> machinetype is Q35 we will autoadd a set of USB2 (uhci/ehci) controllers 
> (I think I added that based on your comments at the time :-). But 
> recently I've mostly been hearing that people should use xhci instead. 
> So should libvirt add a single xhci (rather than the uhci/ehci set) at 
> the same port when no USB is specified?

Big advantage of xhci is that the hardware design is much more
virtualization friendly, i.e. it needs alot less cpu cycles to emulate
than uhci/ohci/ehci.  Also xhci can handle all usb speeds, so you don't
need the complicated uhci/ehci companion setup with uhci for usb1 and
ehci for usb2 devices.

The problem with xhci is guest support.  Which becomes less and less of
a problem over time of course.  All our firmware (seabios/edk2/slof) has
xhci support meanwhile.  ppc64 switched from ohci to xhci by default in
rhel-7.3.  Finding linux guests without xhci support is pretty hard
meanwhile.  Maybe RHEL-5 qualifies.  Windows 8 + newer ships with xhci
drivers.

So, yea, maybe it's time to switch the default for q35 to xhci,
especially if we keep uhci as default for i440fx and suggest to use that
machine type for oldish guests.  But I'd suggest to place xhci in a pcie
root port then, so maybe wait with that until libvirt can auto-add pcie
root ports as needed ...

cheers,
  Gerd

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines
  2016-09-06 18:32     ` Alex Williamson
  2016-09-06 18:59       ` Marcel Apfelbaum
@ 2016-09-07  7:44       ` Laszlo Ersek
  1 sibling, 0 replies; 52+ messages in thread
From: Laszlo Ersek @ 2016-09-07  7:44 UTC (permalink / raw)
  To: Alex Williamson; +Cc: Marcel Apfelbaum, qemu-devel, mst

On 09/06/16 20:32, Alex Williamson wrote:
> On Tue, 6 Sep 2016 21:14:11 +0300
> Marcel Apfelbaum <marcel@redhat.com> wrote:
> 
>> On 09/06/2016 06:38 PM, Alex Williamson wrote:
>>> On Thu,  1 Sep 2016 16:22:07 +0300
>>> Marcel Apfelbaum <marcel@redhat.com> wrote:

>>>> +5. Device assignment
>>>> +====================
>>>> +Host devices are mostly PCIe and should be plugged only into PCIe ports.
>>>> +PCI-PCI bridge slots can be used for legacy PCI host devices.  
>>>
>>> I don't think we have any evidence to suggest this as a best practice.
>>> We have a lot of experience placing PCIe host devices into a
>>> conventional PCI topology on 440FX.  We don't have nearly as much
>>> experience placing them into downstream PCIe ports.  This seems like
>>> how we would like for things to behave to look like real hardware
>>> platforms, but it's just navel gazing whether it's actually the right
>>> thing to do.  Thanks,
>>>  
>>
>> I had to look up the "navel gazing"...
>> Why I do agree with your statements I prefer a cleaner PCI Express machine
>> with as little legacy PCI as possible. I use this document as an opportunity
>> to start gaining experience with device assignment into PCI Express Root Ports
>> and Downstream Ports and solve the issues long the way.
> 
> That's exactly what I mean, there's an ulterior, personal motivation in
> this suggestion that's not really backed by facts.  You'd like to make
> the recommendation to place PCIe assigned devices into PCIe slots, but
> that's not necessarily the configuration with the best track record
> right now.  In fact there's really no advantage to a user to do this
> unless they have a device that needs PCIe (radeon and tg3
> potentially come to mind here).  So while I agree with you from an
> ideological standpoint, I don't think that's sufficient to make the
> recommendation you're proposing here.  Thanks,

To reinforce what Marcel already replied, this document is all about
ideology / policy, and not a status report. We should be looking
forward, not backward. Permitting an exception for plugging a PCI
Express device into a legacy PCI slot just because the PCI Express
device is an assigned, physical one, dilutes the message, and will lead
to all kinds of mess elsewhere.

I'm acutely aware that conforming to the "PCI Express into PCI Express"
recommendation might not *work* in practice, but that doesn't matter
right now. This document should translate to a task list for QEMU and
firmware developers alike. At least I need this document to exist
primarily so I know what to do in OVMF, and what topologies in QE's BZs
to reject out of hand. If the "PCI Express into PCI Express" guideline
will require some VFIO work, and causes Q35 (not i440fx) users some
pain, so be it, IMO.

I'm saying this knowing that you know about ten billion times more about
PCI / PCI Express than I do.

Thanks
Laszlo

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines
  2016-09-06 11:35   ` Gerd Hoffmann
  2016-09-06 13:58     ` Laine Stump
  2016-09-06 14:47     ` Marcel Apfelbaum
@ 2016-09-07  7:53     ` Laszlo Ersek
  2016-09-07  7:57       ` Marcel Apfelbaum
  2 siblings, 1 reply; 52+ messages in thread
From: Laszlo Ersek @ 2016-09-07  7:53 UTC (permalink / raw)
  To: Gerd Hoffmann
  Cc: Marcel Apfelbaum, qemu-devel, mst, Peter Maydell, Drew Jones,
	Laine Stump, Andrea Bolognani, Alex Williamson

On 09/06/16 13:35, Gerd Hoffmann wrote:
>   Hi,
> 
>>> +Plug only legacy PCI devices as Root Complex Integrated Devices
>>> +even if the PCIe spec does not forbid PCIe devices.
>>
>> I suggest "even though the PCI Express spec does not forbid PCI Express
>> devices as Integrated Devices". (Detail is good!)
> 
> While talking about integrated devices:  There is docs/q35-chipset.cfg,
> which documents how to mimic q35 with integrated devices as close and
> complete as possible.
> 
> Usage:
>   qemu-system-x86_64 -M q35 -readconfig docs/q35-chipset.cfg $args
> 
> Side note for usb: In practice you don't want to use the tons of
> uhci/ehci controllers present in the original q35 but plug xhci into one
> of the pcie root ports instead (unless your guest doesn't support xhci).
> 
>>> +as required by PCI spec will reserve a 4K IO range for each.
>>> +The firmware used by QEMU (SeaBIOS/OVMF) will further optimize
>>> +it by allocation the IO space only if there is at least a device
>>> +with IO BARs plugged into the bridge.
>>
>> This used to be true, but is no longer true, for OVMF. And I think it's
>> actually correct: we *should* keep the 4K IO reservation per PCI-PCI bridge.
>>
>> (But, certainly no IO reservation for PCI Express root port, upstream
>> port, or downstream port! And i'll need your help for telling these
>> apart in OVMF.)
> 
> IIRC the same is true for seabios, it looks for the pcie capability and
> skips io space allocation on pcie ports only.
> 
> Side note: the linux kernel allocates io space nevertheless, so
> checking /proc/ioports after boot doesn't tell you what the firmware
> did.

Yeah, we've got to convince Linux to stop doing that. Earlier Alex
mentioned the "hpiosize" and "hpmemsize" PCI subsystem options for the
kernel:

          hpiosize=nn[KMG]        The fixed amount of bus space which is
                          reserved for hotplug bridge's IO window.
                          Default size is 256 bytes.
          hpmemsize=nn[KMG]       The fixed amount of bus space which is
                          reserved for hotplug bridge's memory window.
                          Default size is 2 megabytes.

This document (once complete) would be the basis for tweaking that stuff
in the kernel too. Primarily, "hpiosize" should default to zero, because
its current nonzero default (which gets rounded up to 4KB somewhere) is
what exhausts the IO space, if we have more than a handful of PCI
Express downstream / root ports.

Maybe we can add a PCI quirk for this to the kernel, for QEMU's PCI
Express ports (all of them -- root, upstream, downstream).

Thanks
Laszlo

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines
  2016-09-07  7:53     ` Laszlo Ersek
@ 2016-09-07  7:57       ` Marcel Apfelbaum
  0 siblings, 0 replies; 52+ messages in thread
From: Marcel Apfelbaum @ 2016-09-07  7:57 UTC (permalink / raw)
  To: Laszlo Ersek, Gerd Hoffmann
  Cc: qemu-devel, mst, Peter Maydell, Drew Jones, Laine Stump,
	Andrea Bolognani, Alex Williamson

On 09/07/2016 10:53 AM, Laszlo Ersek wrote:
> On 09/06/16 13:35, Gerd Hoffmann wrote:
>>   Hi,
>>

[...]

>>
>> Side note: the linux kernel allocates io space nevertheless, so
>> checking /proc/ioports after boot doesn't tell you what the firmware
>> did.
>
> Yeah, we've got to convince Linux to stop doing that. Earlier Alex
> mentioned the "hpiosize" and "hpmemsize" PCI subsystem options for the
> kernel:
>
>           hpiosize=nn[KMG]        The fixed amount of bus space which is
>                           reserved for hotplug bridge's IO window.
>                           Default size is 256 bytes.
>           hpmemsize=nn[KMG]       The fixed amount of bus space which is
>                           reserved for hotplug bridge's memory window.
>                           Default size is 2 megabytes.
>
> This document (once complete) would be the basis for tweaking that stuff
> in the kernel too. Primarily, "hpiosize" should default to zero, because
> its current nonzero default (which gets rounded up to 4KB somewhere) is
> what exhausts the IO space, if we have more than a handful of PCI
> Express downstream / root ports.
>
> Maybe we can add a PCI quirk for this to the kernel, for QEMU's PCI
> Express ports (all of them -- root, upstream, downstream).
>

Yes, once we will have our "own" controllers and not Intel emulations as today.

Thanks,
Marcel

> Thanks
> Laszlo
>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines
  2016-09-07  6:21       ` Gerd Hoffmann
@ 2016-09-07  8:06         ` Laszlo Ersek
  2016-09-07  8:23           ` Marcel Apfelbaum
  2016-09-07  8:06         ` Marcel Apfelbaum
  1 sibling, 1 reply; 52+ messages in thread
From: Laszlo Ersek @ 2016-09-07  8:06 UTC (permalink / raw)
  To: Gerd Hoffmann, Marcel Apfelbaum
  Cc: qemu-devel, mst, Peter Maydell, Drew Jones, Laine Stump,
	Andrea Bolognani, Alex Williamson

On 09/07/16 08:21, Gerd Hoffmann wrote:
>   Hi,
> 
>>>> ports, if that's allowed). For example:
>>>>
>>>> -  1-32 ports needed: use root ports only
>>>>
>>>> - 33-64 ports needed: use 31 root ports, and one switch with 2-32
>>>> downstream ports
> 
> I expect you rarely need any switches.  You can go multifunction with
> the pcie root ports.  Which is how physical q35 works too btw, typically
> the root ports are on slot 1c for intel chipsets:
> 
> nilsson root ~# lspci -s1c
> 00:1c.0 PCI bridge: Intel Corporation 7 Series/C210 Series Chipset
> Family PCI Express Root Port 1 (rev c4)
> 00:1c.1 PCI bridge: Intel Corporation 7 Series/C210 Series Chipset
> Family PCI Express Root Port 2 (rev c4)
> 00:1c.2 PCI bridge: Intel Corporation 7 Series/C210 Series Chipset
> Family PCI Express Root Port 3 (rev c4)
> 
> Root bus has 32 slots, a few are taken (host bridge @ 00.0, lpc+sata @
> 1f.*, pci bridge @ 1e.0, maybe vga @ 01.0), leaving 28 free slots.  With
> 8 functions each you can have up to 224 root ports without any switches,
> and you have not many pci bus numbers left until you hit the 256 busses
> limit ...

This is an absolutely great idea. I wonder if it allows us to rip out
all the language about switches, upstream ports and downstream ports. It
would be awesome if we didn't have to mention and draw those things *at
all* (better: if we could summarily discourage their use).

Marcel, what do you think?

Thanks
Laszlo

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines
  2016-09-07  6:21       ` Gerd Hoffmann
  2016-09-07  8:06         ` Laszlo Ersek
@ 2016-09-07  8:06         ` Marcel Apfelbaum
  2016-09-07 16:08           ` Alex Williamson
  2016-09-07 17:55           ` Laine Stump
  1 sibling, 2 replies; 52+ messages in thread
From: Marcel Apfelbaum @ 2016-09-07  8:06 UTC (permalink / raw)
  To: Gerd Hoffmann, Laszlo Ersek
  Cc: qemu-devel, mst, Peter Maydell, Drew Jones, Laine Stump,
	Andrea Bolognani, Alex Williamson

On 09/07/2016 09:21 AM, Gerd Hoffmann wrote:
>   Hi,
>
>>>> ports, if that's allowed). For example:
>>>>
>>>> -  1-32 ports needed: use root ports only
>>>>
>>>> - 33-64 ports needed: use 31 root ports, and one switch with 2-32
>>>> downstream ports
>
> I expect you rarely need any switches.  You can go multifunction with
> the pcie root ports.  Which is how physical q35 works too btw, typically
> the root ports are on slot 1c for intel chipsets:
>
> nilsson root ~# lspci -s1c
> 00:1c.0 PCI bridge: Intel Corporation 7 Series/C210 Series Chipset
> Family PCI Express Root Port 1 (rev c4)
> 00:1c.1 PCI bridge: Intel Corporation 7 Series/C210 Series Chipset
> Family PCI Express Root Port 2 (rev c4)
> 00:1c.2 PCI bridge: Intel Corporation 7 Series/C210 Series Chipset
> Family PCI Express Root Port 3 (rev c4)
>
> Root bus has 32 slots, a few are taken (host bridge @ 00.0, lpc+sata @
> 1f.*, pci bridge @ 1e.0, maybe vga @ 01.0), leaving 28 free slots.  With
> 8 functions each you can have up to 224 root ports without any switches,
> and you have not many pci bus numbers left until you hit the 256 busses
> limit ...
>

Good point, maybe libvirt can avoid adding switches unless the user explicitly
asked for them. I checked and it a actually works fine in QEMU.

BTW, when we will implement ARI we can have up to 256 Root Ports on a single slot...

Thanks,
Marcel

> cheers,
>   Gerd
>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines
  2016-09-07  8:06         ` Laszlo Ersek
@ 2016-09-07  8:23           ` Marcel Apfelbaum
  0 siblings, 0 replies; 52+ messages in thread
From: Marcel Apfelbaum @ 2016-09-07  8:23 UTC (permalink / raw)
  To: Laszlo Ersek, Gerd Hoffmann
  Cc: qemu-devel, mst, Peter Maydell, Drew Jones, Laine Stump,
	Andrea Bolognani, Alex Williamson

On 09/07/2016 11:06 AM, Laszlo Ersek wrote:
> On 09/07/16 08:21, Gerd Hoffmann wrote:
>>   Hi,
>>
>>>>> ports, if that's allowed). For example:
>>>>>
>>>>> -  1-32 ports needed: use root ports only
>>>>>
>>>>> - 33-64 ports needed: use 31 root ports, and one switch with 2-32
>>>>> downstream ports
>>
>> I expect you rarely need any switches.  You can go multifunction with
>> the pcie root ports.  Which is how physical q35 works too btw, typically
>> the root ports are on slot 1c for intel chipsets:
>>
>> nilsson root ~# lspci -s1c
>> 00:1c.0 PCI bridge: Intel Corporation 7 Series/C210 Series Chipset
>> Family PCI Express Root Port 1 (rev c4)
>> 00:1c.1 PCI bridge: Intel Corporation 7 Series/C210 Series Chipset
>> Family PCI Express Root Port 2 (rev c4)
>> 00:1c.2 PCI bridge: Intel Corporation 7 Series/C210 Series Chipset
>> Family PCI Express Root Port 3 (rev c4)
>>
>> Root bus has 32 slots, a few are taken (host bridge @ 00.0, lpc+sata @
>> 1f.*, pci bridge @ 1e.0, maybe vga @ 01.0), leaving 28 free slots.  With
>> 8 functions each you can have up to 224 root ports without any switches,
>> and you have not many pci bus numbers left until you hit the 256 busses
>> limit ...
>
> This is an absolutely great idea. I wonder if it allows us to rip out
> all the language about switches, upstream ports and downstream ports. It
> would be awesome if we didn't have to mention and draw those things *at
> all* (better: if we could summarily discourage their use).
>
> Marcel, what do you think?

While I do think using multi-function Root Ports is definitely the preferred
way to go, keeping the switches around is not so bad, even to have
all PCI Express controllers available for testing scenarios.
We can (and will) of course state  we prefer multi-function Root Ports over
switches and ask libvirt/other management software to not add switches
unless are specifically requested by users.

Thanks,
Marcel

>
> Thanks
> Laszlo
>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines
  2016-09-07  8:06         ` Marcel Apfelbaum
@ 2016-09-07 16:08           ` Alex Williamson
  2016-09-07 19:32             ` Marcel Apfelbaum
  2016-09-07 17:55           ` Laine Stump
  1 sibling, 1 reply; 52+ messages in thread
From: Alex Williamson @ 2016-09-07 16:08 UTC (permalink / raw)
  To: Marcel Apfelbaum
  Cc: Gerd Hoffmann, Laszlo Ersek, qemu-devel, mst, Peter Maydell,
	Drew Jones, Laine Stump, Andrea Bolognani

On Wed, 7 Sep 2016 11:06:45 +0300
Marcel Apfelbaum <marcel@redhat.com> wrote:

> On 09/07/2016 09:21 AM, Gerd Hoffmann wrote:
> >   Hi,
> >  
> >>>> ports, if that's allowed). For example:
> >>>>
> >>>> -  1-32 ports needed: use root ports only
> >>>>
> >>>> - 33-64 ports needed: use 31 root ports, and one switch with 2-32
> >>>> downstream ports  
> >
> > I expect you rarely need any switches.  You can go multifunction with
> > the pcie root ports.  Which is how physical q35 works too btw, typically
> > the root ports are on slot 1c for intel chipsets:
> >
> > nilsson root ~# lspci -s1c
> > 00:1c.0 PCI bridge: Intel Corporation 7 Series/C210 Series Chipset
> > Family PCI Express Root Port 1 (rev c4)
> > 00:1c.1 PCI bridge: Intel Corporation 7 Series/C210 Series Chipset
> > Family PCI Express Root Port 2 (rev c4)
> > 00:1c.2 PCI bridge: Intel Corporation 7 Series/C210 Series Chipset
> > Family PCI Express Root Port 3 (rev c4)
> >
> > Root bus has 32 slots, a few are taken (host bridge @ 00.0, lpc+sata @
> > 1f.*, pci bridge @ 1e.0, maybe vga @ 01.0), leaving 28 free slots.  With
> > 8 functions each you can have up to 224 root ports without any switches,
> > and you have not many pci bus numbers left until you hit the 256 busses
> > limit ...
> >  
> 
> Good point, maybe libvirt can avoid adding switches unless the user explicitly
> asked for them. I checked and it a actually works fine in QEMU.
> 
> BTW, when we will implement ARI we can have up to 256 Root Ports on a single slot...

"Root Ports on a single slot"... The entire idea of ARI is that there
is no slot, the slot/function address space is combined into one big
8-bit free-for-all.  Besides, can you do ARI on the root complex?
Typically you need to look at whether the port upstream of a given
device supports ARI to enable, there's no upstream PCI device on the
root complex.  This is the suggestion I gave you for switches, if the
upstream switch port supports ARI then we can have 256 downstream
switch ports (assuming ARI isn't only specific to downstream ports).
Thanks,

Alex

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines
  2016-09-07  8:06         ` Marcel Apfelbaum
  2016-09-07 16:08           ` Alex Williamson
@ 2016-09-07 17:55           ` Laine Stump
  2016-09-07 19:39             ` Marcel Apfelbaum
  2016-09-08  7:33             ` Gerd Hoffmann
  1 sibling, 2 replies; 52+ messages in thread
From: Laine Stump @ 2016-09-07 17:55 UTC (permalink / raw)
  To: qemu-devel
  Cc: Marcel Apfelbaum, Gerd Hoffmann, Laszlo Ersek, mst,
	Peter Maydell, Drew Jones, Andrea Bolognani, Alex Williamson

On 09/07/2016 04:06 AM, Marcel Apfelbaum wrote:
> On 09/07/2016 09:21 AM, Gerd Hoffmann wrote:
>>   Hi,
>>
>>>>> ports, if that's allowed). For example:
>>>>>
>>>>> -  1-32 ports needed: use root ports only
>>>>>
>>>>> - 33-64 ports needed: use 31 root ports, and one switch with 2-32
>>>>> downstream ports
>>
>> I expect you rarely need any switches.  You can go multifunction with
>> the pcie root ports.  Which is how physical q35 works too btw, typically
>> the root ports are on slot 1c for intel chipsets:
>>
>> nilsson root ~# lspci -s1c
>> 00:1c.0 PCI bridge: Intel Corporation 7 Series/C210 Series Chipset
>> Family PCI Express Root Port 1 (rev c4)
>> 00:1c.1 PCI bridge: Intel Corporation 7 Series/C210 Series Chipset
>> Family PCI Express Root Port 2 (rev c4)
>> 00:1c.2 PCI bridge: Intel Corporation 7 Series/C210 Series Chipset
>> Family PCI Express Root Port 3 (rev c4)
>>
>> Root bus has 32 slots, a few are taken (host bridge @ 00.0, lpc+sata @
>> 1f.*, pci bridge @ 1e.0, maybe vga @ 01.0), leaving 28 free slots.  With
>> 8 functions each you can have up to 224 root ports without any switches,
>> and you have not many pci bus numbers left until you hit the 256 busses
>> limit ...
>>
>
> Good point, maybe libvirt can avoid adding switches unless the user
> explicitly
> asked for them. I checked and it a actually works fine in QEMU.

I'm just now writing the code that auto-adds *-ports as they are needed, 
and doing it this way simplifies it *immensely*.

When I had to think about the possibility of needing upstream/downstream 
switches, as an endpoint device was added, I would need to check if a 
(root|downstream)-port was available and if not I might be able to just 
add a root-port, or I might have to add a downstream-port; if the only 
option was a downstream port, then *that* might require adding a new 
*upstream* port.

If I can limit libvirt to only auto-adding root-ports (and if there is 
no downside to putting multiple root ports on a single root bus port), 
then I just need to find an empty function of an empty slot on the root 
bus, add a root-port, and I'm done (and since 224 is *a lot*, I think at 
least for now it's okay to punt once they get past that point).

So, *is* there any downside to doing this?

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines
  2016-09-07  7:04       ` Gerd Hoffmann
@ 2016-09-07 18:20         ` Laine Stump
  2016-09-08  7:26           ` Gerd Hoffmann
  0 siblings, 1 reply; 52+ messages in thread
From: Laine Stump @ 2016-09-07 18:20 UTC (permalink / raw)
  To: Gerd Hoffmann, qemu-devel
  Cc: Laszlo Ersek, Marcel Apfelbaum, mst, Peter Maydell, Drew Jones,
	Andrea Bolognani, Alex Williamson

On 09/07/2016 03:04 AM, Gerd Hoffmann wrote:
>   Hi,
>
>>> Side note for usb: In practice you don't want to use the tons of
>>> uhci/ehci controllers present in the original q35 but plug xhci into one
>>> of the pcie root ports instead (unless your guest doesn't support xhci).
>>
>> I've wondered about that recently. For i440fx machinetypes if you don't
>> specify a USB controller in libvirt's domain config, you will
>> automatically get the PIIX3 USB controller added. In order to maintain
>> consistency on the topic of "auto-adding USB when not specified", if the
>> machinetype is Q35 we will autoadd a set of USB2 (uhci/ehci) controllers
>> (I think I added that based on your comments at the time :-). But
>> recently I've mostly been hearing that people should use xhci instead.
>> So should libvirt add a single xhci (rather than the uhci/ehci set) at
>> the same port when no USB is specified?
>
> Big advantage of xhci is that the hardware design is much more
> virtualization friendly, i.e. it needs alot less cpu cycles to emulate
> than uhci/ohci/ehci.  Also xhci can handle all usb speeds, so you don't
> need the complicated uhci/ehci companion setup with uhci for usb1 and
> ehci for usb2 devices.
>
> The problem with xhci is guest support.  Which becomes less and less of
> a problem over time of course.  All our firmware (seabios/edk2/slof) has
> xhci support meanwhile.  ppc64 switched from ohci to xhci by default in
> rhel-7.3.  Finding linux guests without xhci support is pretty hard
> meanwhile.  Maybe RHEL-5 qualifies.  Windows 8 + newer ships with xhci
> drivers.
>
> So, yea, maybe it's time to switch the default for q35 to xhci,
> especially if we keep uhci as default for i440fx and suggest to use that
> machine type for oldish guests.  But I'd suggest to place xhci in a pcie
> root port then, so maybe wait with that until libvirt can auto-add pcie
> root ports as needed ...

I'm doing that right now  (giving libvirt the ability to auto-add root 
ports) :-)

I had understood that the xhci could be a legacy PCI device or a PCI 
Express device depending on the socket it was plugged into (or was that 
possibly just someone doing some hand-waving over the fact that 
obscuring the PCI Express capabilities effectively turns it into a 
legacy PCI device?). If that's the case, why do you prefer the default 
USB controller to be added in a root-port rather than as an integrated 
device (which is what we do with the group of USB2 controllers, as well 
as the primary video device)

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines
  2016-09-07 16:08           ` Alex Williamson
@ 2016-09-07 19:32             ` Marcel Apfelbaum
  0 siblings, 0 replies; 52+ messages in thread
From: Marcel Apfelbaum @ 2016-09-07 19:32 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Gerd Hoffmann, Laszlo Ersek, qemu-devel, mst, Peter Maydell,
	Drew Jones, Laine Stump, Andrea Bolognani

On 09/07/2016 07:08 PM, Alex Williamson wrote:
> On Wed, 7 Sep 2016 11:06:45 +0300
> Marcel Apfelbaum <marcel@redhat.com> wrote:
>
>> On 09/07/2016 09:21 AM, Gerd Hoffmann wrote:
>>>   Hi,
>>>
>>>>>> ports, if that's allowed). For example:
>>>>>>
>>>>>> -  1-32 ports needed: use root ports only
>>>>>>
>>>>>> - 33-64 ports needed: use 31 root ports, and one switch with 2-32
>>>>>> downstream ports
>>>
>>> I expect you rarely need any switches.  You can go multifunction with
>>> the pcie root ports.  Which is how physical q35 works too btw, typically
>>> the root ports are on slot 1c for intel chipsets:
>>>
>>> nilsson root ~# lspci -s1c
>>> 00:1c.0 PCI bridge: Intel Corporation 7 Series/C210 Series Chipset
>>> Family PCI Express Root Port 1 (rev c4)
>>> 00:1c.1 PCI bridge: Intel Corporation 7 Series/C210 Series Chipset
>>> Family PCI Express Root Port 2 (rev c4)
>>> 00:1c.2 PCI bridge: Intel Corporation 7 Series/C210 Series Chipset
>>> Family PCI Express Root Port 3 (rev c4)
>>>
>>> Root bus has 32 slots, a few are taken (host bridge @ 00.0, lpc+sata @
>>> 1f.*, pci bridge @ 1e.0, maybe vga @ 01.0), leaving 28 free slots.  With
>>> 8 functions each you can have up to 224 root ports without any switches,
>>> and you have not many pci bus numbers left until you hit the 256 busses
>>> limit ...
>>>
>>
>> Good point, maybe libvirt can avoid adding switches unless the user explicitly
>> asked for them. I checked and it a actually works fine in QEMU.
>>
>> BTW, when we will implement ARI we can have up to 256 Root Ports on a single slot...
>
> "Root Ports on a single slot"... The entire idea of ARI is that there
> is no slot, the slot/function address space is combined into one big
> 8-bit free-for-all.  Besides, can you do ARI on the root complex?

No, we can't :(
Indeed, for the Root Complex bus we need the (bus:)dev:fn tuple, because we can
have multiple devices plugged in. Thanks for the correction.

Thanks,
Marcel

> Typically you need to look at whether the port upstream of a given
> device supports ARI to enable, there's no upstream PCI device on the
> root complex.  This is the suggestion I gave you for switches, if the
> upstream switch port supports ARI then we can have 256 downstream
> switch ports (assuming ARI isn't only specific to downstream ports).
> Thanks,
>
> Alex
>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines
  2016-09-07 17:55           ` Laine Stump
@ 2016-09-07 19:39             ` Marcel Apfelbaum
  2016-09-07 20:34               ` Laine Stump
  2016-09-15  8:38               ` Andrew Jones
  2016-09-08  7:33             ` Gerd Hoffmann
  1 sibling, 2 replies; 52+ messages in thread
From: Marcel Apfelbaum @ 2016-09-07 19:39 UTC (permalink / raw)
  To: Laine Stump, qemu-devel
  Cc: Gerd Hoffmann, Laszlo Ersek, mst, Peter Maydell, Drew Jones,
	Andrea Bolognani, Alex Williamson

On 09/07/2016 08:55 PM, Laine Stump wrote:
> On 09/07/2016 04:06 AM, Marcel Apfelbaum wrote:
>> On 09/07/2016 09:21 AM, Gerd Hoffmann wrote:
>>>   Hi,
>>>
>>>>>> ports, if that's allowed). For example:
>>>>>>
>>>>>> -  1-32 ports needed: use root ports only
>>>>>>
>>>>>> - 33-64 ports needed: use 31 root ports, and one switch with 2-32
>>>>>> downstream ports
>>>
>>> I expect you rarely need any switches.  You can go multifunction with
>>> the pcie root ports.  Which is how physical q35 works too btw, typically
>>> the root ports are on slot 1c for intel chipsets:
>>>
>>> nilsson root ~# lspci -s1c
>>> 00:1c.0 PCI bridge: Intel Corporation 7 Series/C210 Series Chipset
>>> Family PCI Express Root Port 1 (rev c4)
>>> 00:1c.1 PCI bridge: Intel Corporation 7 Series/C210 Series Chipset
>>> Family PCI Express Root Port 2 (rev c4)
>>> 00:1c.2 PCI bridge: Intel Corporation 7 Series/C210 Series Chipset
>>> Family PCI Express Root Port 3 (rev c4)
>>>
>>> Root bus has 32 slots, a few are taken (host bridge @ 00.0, lpc+sata @
>>> 1f.*, pci bridge @ 1e.0, maybe vga @ 01.0), leaving 28 free slots.  With
>>> 8 functions each you can have up to 224 root ports without any switches,
>>> and you have not many pci bus numbers left until you hit the 256 busses
>>> limit ...
>>>
>>
>> Good point, maybe libvirt can avoid adding switches unless the user
>> explicitly
>> asked for them. I checked and it a actually works fine in QEMU.
>
> I'm just now writing the code that auto-adds *-ports as they are needed, and doing it this way simplifies it *immensely*.
>
> When I had to think about the possibility of needing upstream/downstream switches, as an endpoint device was added, I would need to check if a (root|downstream)-port was available and if not I might
> be able to just add a root-port, or I might have to add a downstream-port; if the only option was a downstream port, then *that* might require adding a new *upstream* port.
>
> If I can limit libvirt to only auto-adding root-ports (and if there is no downside to putting multiple root ports on a single root bus port), then I just need to find an empty function of an empty
> slot on the root bus, add a root-port, and I'm done (and since 224 is *a lot*, I think at least for now it's okay to punt once they get past that point).
>
> So, *is* there any downside to doing this?
>

No downside I can think of.
Just be sure to emphasize the auto-add mechanism stops at 'x' devices. If the user needs more,
he should manually add switches and manually assign the devices to the Downstream Ports.

Thanks,
Marcel


>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines
  2016-09-07 19:39             ` Marcel Apfelbaum
@ 2016-09-07 20:34               ` Laine Stump
  2016-09-15  8:38               ` Andrew Jones
  1 sibling, 0 replies; 52+ messages in thread
From: Laine Stump @ 2016-09-07 20:34 UTC (permalink / raw)
  To: qemu-devel
  Cc: Marcel Apfelbaum, Peter Maydell, Drew Jones, mst,
	Andrea Bolognani, Alex Williamson, Gerd Hoffmann, Laszlo Ersek

On 09/07/2016 03:39 PM, Marcel Apfelbaum wrote:
> On 09/07/2016 08:55 PM, Laine Stump wrote:
>> On 09/07/2016 04:06 AM, Marcel Apfelbaum wrote:
>>> On 09/07/2016 09:21 AM, Gerd Hoffmann wrote:
>>>>   Hi,
>>>>
>>>>>>> ports, if that's allowed). For example:
>>>>>>>
>>>>>>> -  1-32 ports needed: use root ports only
>>>>>>>
>>>>>>> - 33-64 ports needed: use 31 root ports, and one switch with 2-32
>>>>>>> downstream ports
>>>>
>>>> I expect you rarely need any switches.  You can go multifunction with
>>>> the pcie root ports.  Which is how physical q35 works too btw,
>>>> typically
>>>> the root ports are on slot 1c for intel chipsets:
>>>>
>>>> nilsson root ~# lspci -s1c
>>>> 00:1c.0 PCI bridge: Intel Corporation 7 Series/C210 Series Chipset
>>>> Family PCI Express Root Port 1 (rev c4)
>>>> 00:1c.1 PCI bridge: Intel Corporation 7 Series/C210 Series Chipset
>>>> Family PCI Express Root Port 2 (rev c4)
>>>> 00:1c.2 PCI bridge: Intel Corporation 7 Series/C210 Series Chipset
>>>> Family PCI Express Root Port 3 (rev c4)
>>>>
>>>> Root bus has 32 slots, a few are taken (host bridge @ 00.0, lpc+sata @
>>>> 1f.*, pci bridge @ 1e.0, maybe vga @ 01.0), leaving 28 free slots.
>>>> With
>>>> 8 functions each you can have up to 224 root ports without any
>>>> switches,
>>>> and you have not many pci bus numbers left until you hit the 256 busses
>>>> limit ...
>>>>
>>>
>>> Good point, maybe libvirt can avoid adding switches unless the user
>>> explicitly
>>> asked for them. I checked and it a actually works fine in QEMU.
>>
>> I'm just now writing the code that auto-adds *-ports as they are
>> needed, and doing it this way simplifies it *immensely*.
>>
>> When I had to think about the possibility of needing
>> upstream/downstream switches, as an endpoint device was added, I would
>> need to check if a (root|downstream)-port was available and if not I
>> might
>> be able to just add a root-port, or I might have to add a
>> downstream-port; if the only option was a downstream port, then *that*
>> might require adding a new *upstream* port.
>>
>> If I can limit libvirt to only auto-adding root-ports (and if there is
>> no downside to putting multiple root ports on a single root bus port),
>> then I just need to find an empty function of an empty
>> slot on the root bus, add a root-port, and I'm done (and since 224 is
>> *a lot*, I think at least for now it's okay to punt once they get past
>> that point).
>>
>> So, *is* there any downside to doing this?
>>
>
> No downside I can think of.
> Just be sure to emphasize the auto-add mechanism stops at 'x' devices.
> If the user needs more,
> he should manually add switches and manually assign the devices to the
> Downstream Ports.

Actually, just the former - once the downstream ports are added, they'll 
automatically be used for endpoint devices (and even new upstream ports) 
as needed.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines
  2016-09-07 18:20         ` Laine Stump
@ 2016-09-08  7:26           ` Gerd Hoffmann
  0 siblings, 0 replies; 52+ messages in thread
From: Gerd Hoffmann @ 2016-09-08  7:26 UTC (permalink / raw)
  To: Laine Stump
  Cc: qemu-devel, Laszlo Ersek, Marcel Apfelbaum, mst, Peter Maydell,
	Drew Jones, Andrea Bolognani, Alex Williamson

  Hi,

> I had understood that the xhci could be a legacy PCI device or a PCI 
> Express device depending on the socket it was plugged into (or was that 
> possibly just someone doing some hand-waving over the fact that 
> obscuring the PCI Express capabilities effectively turns it into a 
> legacy PCI device?).

That is correct, it'll work both ways.

> If that's the case, why do you prefer the default 
> USB controller to be added in a root-port rather than as an integrated 
> device (which is what we do with the group of USB2 controllers, as well 
> as the primary video device)

Trying to mimic real hardware as close as possible.  The ich9 uhci/ehci
controllers are actually integrated chipset devices.  The nec xhci is a
express device in physical hardware.

That is more a personal preference though, there are no strong technical
reasons to do it that way.

cheers,
  Gerd

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines
  2016-09-07 17:55           ` Laine Stump
  2016-09-07 19:39             ` Marcel Apfelbaum
@ 2016-09-08  7:33             ` Gerd Hoffmann
  1 sibling, 0 replies; 52+ messages in thread
From: Gerd Hoffmann @ 2016-09-08  7:33 UTC (permalink / raw)
  To: Laine Stump
  Cc: qemu-devel, Marcel Apfelbaum, Laszlo Ersek, mst, Peter Maydell,
	Drew Jones, Andrea Bolognani, Alex Williamson

  Hi,

> > Good point, maybe libvirt can avoid adding switches unless the user
> > explicitly
> > asked for them. I checked and it a actually works fine in QEMU.

> So, *is* there any downside to doing this?

I don't think so.

The only issue I can think of when it comes to multifunction is hotplug,
because hotplug works at slot level in pci so you can't hotplug single
functions.

But as you can't hotplug the root ports in the first place this is
nothing we have to worry about in this specific case.

cheers,
  Gerd

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines
  2016-09-07 19:39             ` Marcel Apfelbaum
  2016-09-07 20:34               ` Laine Stump
@ 2016-09-15  8:38               ` Andrew Jones
  2016-09-15 14:20                 ` Marcel Apfelbaum
  1 sibling, 1 reply; 52+ messages in thread
From: Andrew Jones @ 2016-09-15  8:38 UTC (permalink / raw)
  To: Marcel Apfelbaum
  Cc: Laine Stump, qemu-devel, Peter Maydell, mst, Andrea Bolognani,
	Alex Williamson, Gerd Hoffmann, Laszlo Ersek

On Wed, Sep 07, 2016 at 10:39:28PM +0300, Marcel Apfelbaum wrote:
> On 09/07/2016 08:55 PM, Laine Stump wrote:
> > On 09/07/2016 04:06 AM, Marcel Apfelbaum wrote:
[snip]
> > > Good point, maybe libvirt can avoid adding switches unless the user
> > > explicitly
> > > asked for them. I checked and it a actually works fine in QEMU.
> > 
> > I'm just now writing the code that auto-adds *-ports as they are needed, and doing it this way simplifies it *immensely*.
> > 
> > When I had to think about the possibility of needing upstream/downstream switches, as an endpoint device was added, I would need to check if a (root|downstream)-port was available and if not I might
> > be able to just add a root-port, or I might have to add a downstream-port; if the only option was a downstream port, then *that* might require adding a new *upstream* port.
> > 
> > If I can limit libvirt to only auto-adding root-ports (and if there is no downside to putting multiple root ports on a single root bus port), then I just need to find an empty function of an empty
> > slot on the root bus, add a root-port, and I'm done (and since 224 is *a lot*, I think at least for now it's okay to punt once they get past that point).
> > 
> > So, *is* there any downside to doing this?
> > 
> 
> No downside I can think of.
> Just be sure to emphasize the auto-add mechanism stops at 'x' devices. If the user needs more,
> he should manually add switches and manually assign the devices to the Downstream Ports.
> 

Just catching up on mail after vacation and read this thread. Thanks
Marcel for writing this document (I guess a v1 is coming soon). This
will be very useful for determining the best default configuration of
a virtio-pci mach-virt.

FWIW, here is the proposal that I started formulating when I experimented
with this several months ago;

 - PCIe-only (disable-modern=off, disable-legacy=on)
 - No legacy PCI support, i.e. no bridges (yup, I'm a PCIe purist,
   but don't have a leg to stand on if push came to shove)
 - use one or more ports for virtio-scsi controllers for disks, one is
   probably enough
 - use one or more ports with multifunction, allowing up to 8 functions,
   for virtio-net, one port is probably enough
 - Add N extra ports for hotplug, N defaulting to 2
   - hotplug devices to first N-1 ports, reserving last for a switch
   - if switch is needed, hotplug it with M downstream ports
     (M defaulting to 2*(N-1)+1)
 - Encourage somebody to develop generic versions of ports and switches,
   hi Marcel :-), and exclusively use those in the configuration

Thanks,
drew

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines
  2016-09-15  8:38               ` Andrew Jones
@ 2016-09-15 14:20                 ` Marcel Apfelbaum
  2016-09-16 16:50                   ` Andrea Bolognani
  0 siblings, 1 reply; 52+ messages in thread
From: Marcel Apfelbaum @ 2016-09-15 14:20 UTC (permalink / raw)
  To: Andrew Jones
  Cc: Laine Stump, qemu-devel, Peter Maydell, mst, Andrea Bolognani,
	Alex Williamson, Gerd Hoffmann, Laszlo Ersek

On 09/15/2016 11:38 AM, Andrew Jones wrote:
> On Wed, Sep 07, 2016 at 10:39:28PM +0300, Marcel Apfelbaum wrote:
>> On 09/07/2016 08:55 PM, Laine Stump wrote:
>>> On 09/07/2016 04:06 AM, Marcel Apfelbaum wrote:
> [snip]
>>>> Good point, maybe libvirt can avoid adding switches unless the user
>>>> explicitly
>>>> asked for them. I checked and it a actually works fine in QEMU.
>>>
>>> I'm just now writing the code that auto-adds *-ports as they are needed, and doing it this way simplifies it *immensely*.
>>>
>>> When I had to think about the possibility of needing upstream/downstream switches, as an endpoint device was added, I would need to check if a (root|downstream)-port was available and if not I might
>>> be able to just add a root-port, or I might have to add a downstream-port; if the only option was a downstream port, then *that* might require adding a new *upstream* port.
>>>
>>> If I can limit libvirt to only auto-adding root-ports (and if there is no downside to putting multiple root ports on a single root bus port), then I just need to find an empty function of an empty
>>> slot on the root bus, add a root-port, and I'm done (and since 224 is *a lot*, I think at least for now it's okay to punt once they get past that point).
>>>
>>> So, *is* there any downside to doing this?
>>>
>>
>> No downside I can think of.
>> Just be sure to emphasize the auto-add mechanism stops at 'x' devices. If the user needs more,
>> he should manually add switches and manually assign the devices to the Downstream Ports.
>>
>
> Just catching up on mail after vacation and read this thread. Thanks
> Marcel for writing this document (I guess a v1 is coming soon).

Yes, I am sorry but I got caught up with other stuff and I am
going to be in PTO for a week, so V1 will take a little more time
than I planned.

This
> will be very useful for determining the best default configuration of
> a virtio-pci mach-virt.
>

It will be very good if this doc will match both x86 and match-virt PCIe machines.
Your review would be appreciated.

> FWIW, here is the proposal that I started formulating when I experimented
> with this several months ago;
>
>  - PCIe-only (disable-modern=off, disable-legacy=on)

If the virtio devices are plugged into PCI Express Root Ports
or Downstream Ports this is already the default configuration,
you don't need to add the disable-* anymore.

>  - No legacy PCI support, i.e. no bridges (yup, I'm a PCIe purist,
>    but don't have a leg to stand on if push came to shove)

Yes... We'll say that legacy PCI support is optional.

>  - use one or more ports for virtio-scsi controllers for disks, one is
>    probably enough
>  - use one or more ports with multifunction, allowing up to 8 functions,
>    for virtio-net, one port is probably enough

As Alex Williamson mentioned, PCI Express Root Ports are actually functions,
not devices, so you can have up to 8 Ports per slot. This will be better
than make the virtio-* multi-function devices because then you would need
to hot-plug/hot-unplug all of them.
If hot-plug is not an issue, using multi-function devices into multi-function
Ports will save a lot of bus numbers witch are, Like Laszlo mentioned a
scarce resource.

>  - Add N extra ports for hotplug, N defaulting to 2
>    - hotplug devices to first N-1 ports, reserving last for a switch
>    - if switch is needed, hotplug it with M downstream ports
>      (M defaulting to 2*(N-1)+1)

We would prefer multi-function ports to switches, since you'll go out
of bus numbers before you use all the PCI Express Root Ports anyway. (see previous mails)

However the switches will be supported for cases when you have a
lot of Integrated Devices and the Root Ports are not enough, or
to enable some testing scenarios.

>  - Encourage somebody to develop generic versions of ports and switches,
>    hi Marcel :-), and exclusively use those in the configuration
>

My goal is to try to come up with them for 2.8, but since I haven't started
to work on them yet, I can't commit :)

Thanks,
Marcel

> Thanks,
> drew
>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines
  2016-09-15 14:20                 ` Marcel Apfelbaum
@ 2016-09-16 16:50                   ` Andrea Bolognani
  0 siblings, 0 replies; 52+ messages in thread
From: Andrea Bolognani @ 2016-09-16 16:50 UTC (permalink / raw)
  To: Marcel Apfelbaum, Andrew Jones
  Cc: Laine Stump, qemu-devel, Peter Maydell, mst, Alex Williamson,
	Gerd Hoffmann, Laszlo Ersek

On Thu, 2016-09-15 at 17:20 +0300, Marcel Apfelbaum wrote:
> > Just catching up on mail after vacation and read this thread. Thanks
> > Marcel for writing this document (I guess a v1 is coming soon).
> 
> Yes, I am sorry but I got caught up with other stuff and I am
> going to be in PTO for a week, so V1 will take a little more time
> than I planned.

I finally caught up as well, and while I don't have much
value to contribute to the conversation, let me say this:
everything about this thread is absolutely awesome!

The amount of information one can absorb from the discussion
alone is amazing, but the guidelines contained in the
document we're crafting will certainly prove to be invaluable
to users and people working higher up in the stack alike.

Thanks Marcel and everyone involved. You guys rock! :)

-- 
Andrea Bolognani / Red Hat / Virtualization

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines
  2016-09-05 16:24 ` Laszlo Ersek
  2016-09-05 20:02   ` Marcel Apfelbaum
  2016-09-06 11:35   ` Gerd Hoffmann
@ 2016-10-04 14:59   ` Daniel P. Berrange
  2016-10-04 15:40     ` Laszlo Ersek
  2016-10-04 15:45     ` Alex Williamson
  2 siblings, 2 replies; 52+ messages in thread
From: Daniel P. Berrange @ 2016-10-04 14:59 UTC (permalink / raw)
  To: Laszlo Ersek
  Cc: Marcel Apfelbaum, qemu-devel, Peter Maydell, Drew Jones, mst,
	Andrea Bolognani, Alex Williamson, Gerd Hoffmann, Laine Stump

On Mon, Sep 05, 2016 at 06:24:48PM +0200, Laszlo Ersek wrote:
> On 09/01/16 15:22, Marcel Apfelbaum wrote:
> > +2.3 PCI only hierarchy
> > +======================
> > +Legacy PCI devices can be plugged into pcie.0 as Integrated Devices or
> > +into DMI-PCI bridge. PCI-PCI bridges can be plugged into DMI-PCI bridges
> > +and can be nested until a depth of 6-7. DMI-BRIDGES should be plugged
> > +only into pcie.0 bus.
> > +
> > +   pcie.0 bus
> > +   ----------------------------------------------
> > +        |                            |
> > +   -----------               ------------------
> > +   | PCI Dev |               | DMI-PCI BRIDGE |
> > +   ----------                ------------------
> > +                               |            |
> > +                        -----------    ------------------
> > +                        | PCI Dev |    | PCI-PCI Bridge |
> > +                        -----------    ------------------
> > +                                         |           |
> > +                                  -----------     -----------
> > +                                  | PCI Dev |     | PCI Dev |
> > +                                  -----------     -----------
> 
> Works for me, but I would again elaborate a little bit on keeping the
> hierarchy flat.
> 
> First, in order to preserve compatibility with libvirt's current
> behavior, let's not plug a PCI device directly in to the DMI-PCI bridge,
> even if that's possible otherwise. Let's just say
> 
> - there should be at most one DMI-PCI bridge (if a legacy PCI hierarchy
> is required),

Why do you suggest this ? If the guest has multiple NUMA nodes
and you're creating a PXB for each NUMA node, then it looks valid
to want to have a DMI-PCI bridge attached to each PXB, so you can
have legacy PCI devices on each NUMA node, instead of putting them
all on the PCI bridge without NUMA affinity.

> - only PCI-PCI bridges should be plugged into the DMI-PCI bridge,

What's the rational for that, as opposed to plugging devices directly
into the DMI-PCI bridge which seems to work ?

Regards,
Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://entangle-photo.org       -o-    http://search.cpan.org/~danberr/ :|

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines
  2016-10-04 14:59   ` Daniel P. Berrange
@ 2016-10-04 15:40     ` Laszlo Ersek
  2016-10-04 16:10       ` Laine Stump
  2016-10-04 15:45     ` Alex Williamson
  1 sibling, 1 reply; 52+ messages in thread
From: Laszlo Ersek @ 2016-10-04 15:40 UTC (permalink / raw)
  To: Daniel P. Berrange
  Cc: Marcel Apfelbaum, qemu-devel, Peter Maydell, Drew Jones, mst,
	Andrea Bolognani, Alex Williamson, Gerd Hoffmann, Laine Stump

On 10/04/16 16:59, Daniel P. Berrange wrote:
> On Mon, Sep 05, 2016 at 06:24:48PM +0200, Laszlo Ersek wrote:
>> On 09/01/16 15:22, Marcel Apfelbaum wrote:
>>> +2.3 PCI only hierarchy
>>> +======================
>>> +Legacy PCI devices can be plugged into pcie.0 as Integrated Devices or
>>> +into DMI-PCI bridge. PCI-PCI bridges can be plugged into DMI-PCI bridges
>>> +and can be nested until a depth of 6-7. DMI-BRIDGES should be plugged
>>> +only into pcie.0 bus.
>>> +
>>> +   pcie.0 bus
>>> +   ----------------------------------------------
>>> +        |                            |
>>> +   -----------               ------------------
>>> +   | PCI Dev |               | DMI-PCI BRIDGE |
>>> +   ----------                ------------------
>>> +                               |            |
>>> +                        -----------    ------------------
>>> +                        | PCI Dev |    | PCI-PCI Bridge |
>>> +                        -----------    ------------------
>>> +                                         |           |
>>> +                                  -----------     -----------
>>> +                                  | PCI Dev |     | PCI Dev |
>>> +                                  -----------     -----------
>>
>> Works for me, but I would again elaborate a little bit on keeping the
>> hierarchy flat.
>>
>> First, in order to preserve compatibility with libvirt's current
>> behavior, let's not plug a PCI device directly in to the DMI-PCI bridge,
>> even if that's possible otherwise. Let's just say
>>
>> - there should be at most one DMI-PCI bridge (if a legacy PCI hierarchy
>> is required),
> 
> Why do you suggest this ? If the guest has multiple NUMA nodes
> and you're creating a PXB for each NUMA node, then it looks valid
> to want to have a DMI-PCI bridge attached to each PXB, so you can
> have legacy PCI devices on each NUMA node, instead of putting them
> all on the PCI bridge without NUMA affinity.

You are right. I meant the above within one PCI Express root bus.

Small correction to your wording though: you don't want to attach the
DMI-PCI bridge to the PXB device, but to the extra root bus provided by
the PXB.

> 
>> - only PCI-PCI bridges should be plugged into the DMI-PCI bridge,
> 
> What's the rational for that, as opposed to plugging devices directly
> into the DMI-PCI bridge which seems to work ?

The rationale is that libvirt used to do it like this. And the rationale
for *that* is that DMI-PCI bridges cannot accept hotplugged devices,
while PCI-PCI bridges can.

Technically nothing forbids (AFAICT) cold-plugging PCI devices into
DMI-PCI bridges, but this document is expressly not just about technical
constraints -- it's a policy document. We want to simplify / trim the
supported PCI and PCI Express hierarchies as much as possible.

All valid *high-level* topology goals should be permitted / covered one
way or another by this document, but in as few ways as possible --
hopefully only one way. For example, if you read the rest of the thread,
flat hierarchies are preferred to deeply nested hierarchies, because
flat ones save on bus numbers, are easier to setup and understand,
probably perform better, and don't lose any generality for cold- or hotplug.

Thanks
Laszlo

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines
  2016-10-04 14:59   ` Daniel P. Berrange
  2016-10-04 15:40     ` Laszlo Ersek
@ 2016-10-04 15:45     ` Alex Williamson
  2016-10-04 16:25       ` Laine Stump
  1 sibling, 1 reply; 52+ messages in thread
From: Alex Williamson @ 2016-10-04 15:45 UTC (permalink / raw)
  To: Daniel P. Berrange
  Cc: Laszlo Ersek, Marcel Apfelbaum, qemu-devel, Peter Maydell,
	Drew Jones, mst, Andrea Bolognani, Gerd Hoffmann, Laine Stump

On Tue, 4 Oct 2016 15:59:11 +0100
"Daniel P. Berrange" <berrange@redhat.com> wrote:

> On Mon, Sep 05, 2016 at 06:24:48PM +0200, Laszlo Ersek wrote:
> > On 09/01/16 15:22, Marcel Apfelbaum wrote:  
> > > +2.3 PCI only hierarchy
> > > +======================
> > > +Legacy PCI devices can be plugged into pcie.0 as Integrated Devices or
> > > +into DMI-PCI bridge. PCI-PCI bridges can be plugged into DMI-PCI bridges
> > > +and can be nested until a depth of 6-7. DMI-BRIDGES should be plugged
> > > +only into pcie.0 bus.
> > > +
> > > +   pcie.0 bus
> > > +   ----------------------------------------------
> > > +        |                            |
> > > +   -----------               ------------------
> > > +   | PCI Dev |               | DMI-PCI BRIDGE |
> > > +   ----------                ------------------
> > > +                               |            |
> > > +                        -----------    ------------------
> > > +                        | PCI Dev |    | PCI-PCI Bridge |
> > > +                        -----------    ------------------
> > > +                                         |           |
> > > +                                  -----------     -----------
> > > +                                  | PCI Dev |     | PCI Dev |
> > > +                                  -----------     -----------  
> > 
> > Works for me, but I would again elaborate a little bit on keeping the
> > hierarchy flat.
> > 
> > First, in order to preserve compatibility with libvirt's current
> > behavior, let's not plug a PCI device directly in to the DMI-PCI bridge,
> > even if that's possible otherwise. Let's just say
> > 
> > - there should be at most one DMI-PCI bridge (if a legacy PCI hierarchy
> > is required),  
> 
> Why do you suggest this ? If the guest has multiple NUMA nodes
> and you're creating a PXB for each NUMA node, then it looks valid
> to want to have a DMI-PCI bridge attached to each PXB, so you can
> have legacy PCI devices on each NUMA node, instead of putting them
> all on the PCI bridge without NUMA affinity.

Seems like this is one of those "generic" vs "specific" device issues.
We use the DMI-to-PCI bridge as if it were a PCIe-to-PCI bridge, but
DMI is actually an Intel proprietary interface, the bridge just has the
same software interface as a PCI bridge.  So while you can use it as a
generic PCIe-to-PCI bridge, it's at least going to make me cringe every
time.
 
> > - only PCI-PCI bridges should be plugged into the DMI-PCI bridge,  
> 
> What's the rational for that, as opposed to plugging devices directly
> into the DMI-PCI bridge which seems to work ?

IIRC, something about hotplug, but from a PCI perspective it doesn't
make any sense to me either.  Same with the restriction from using slot
0 on PCI bridges, there's no basis for that except on the root bus.
Thanks,

Alex

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines
  2016-10-04 15:40     ` Laszlo Ersek
@ 2016-10-04 16:10       ` Laine Stump
  2016-10-04 16:43         ` Laszlo Ersek
  2016-10-04 17:54         ` Laine Stump
  0 siblings, 2 replies; 52+ messages in thread
From: Laine Stump @ 2016-10-04 16:10 UTC (permalink / raw)
  To: qemu-devel
  Cc: Laszlo Ersek, Daniel P. Berrange, Marcel Apfelbaum,
	Peter Maydell, Drew Jones, mst, Andrea Bolognani,
	Alex Williamson, Gerd Hoffmann

On 10/04/2016 11:40 AM, Laszlo Ersek wrote:
> On 10/04/16 16:59, Daniel P. Berrange wrote:
>> On Mon, Sep 05, 2016 at 06:24:48PM +0200, Laszlo Ersek wrote:
>>> On 09/01/16 15:22, Marcel Apfelbaum wrote:
>>>> +2.3 PCI only hierarchy
>>>> +======================
>>>> +Legacy PCI devices can be plugged into pcie.0 as Integrated Devices or
>>>> +into DMI-PCI bridge. PCI-PCI bridges can be plugged into DMI-PCI bridges
>>>> +and can be nested until a depth of 6-7. DMI-BRIDGES should be plugged
>>>> +only into pcie.0 bus.
>>>> +
>>>> +   pcie.0 bus
>>>> +   ----------------------------------------------
>>>> +        |                            |
>>>> +   -----------               ------------------
>>>> +   | PCI Dev |               | DMI-PCI BRIDGE |
>>>> +   ----------                ------------------
>>>> +                               |            |
>>>> +                        -----------    ------------------
>>>> +                        | PCI Dev |    | PCI-PCI Bridge |
>>>> +                        -----------    ------------------
>>>> +                                         |           |
>>>> +                                  -----------     -----------
>>>> +                                  | PCI Dev |     | PCI Dev |
>>>> +                                  -----------     -----------
>>>
>>> Works for me, but I would again elaborate a little bit on keeping the
>>> hierarchy flat.
>>>
>>> First, in order to preserve compatibility with libvirt's current
>>> behavior, let's not plug a PCI device directly in to the DMI-PCI bridge,
>>> even if that's possible otherwise. Let's just say
>>>
>>> - there should be at most one DMI-PCI bridge (if a legacy PCI hierarchy
>>> is required),
>>
>> Why do you suggest this ? If the guest has multiple NUMA nodes
>> and you're creating a PXB for each NUMA node, then it looks valid
>> to want to have a DMI-PCI bridge attached to each PXB, so you can
>> have legacy PCI devices on each NUMA node, instead of putting them
>> all on the PCI bridge without NUMA affinity.
>
> You are right. I meant the above within one PCI Express root bus.
>
> Small correction to your wording though: you don't want to attach the
> DMI-PCI bridge to the PXB device, but to the extra root bus provided by
> the PXB.

This made me realize something - the root bus on a pxb-pcie controller 
has a single slot and that slot can accept either a pcie-root-port 
(ioh3420) or a dmi-to-pci-bridge. If you want to have both express and 
legacy PCI devices on the same NUMA node, then you would either need to 
create one pxb-pcie for the pcie-root-port and another for the 
dmi-to-pci-bridge, or you would need to put the pcie-root-port and 
dmi-to-pci-bridge onto different functions of the single slot. Should 
the latter work properly?


>
>>
>>> - only PCI-PCI bridges should be plugged into the DMI-PCI bridge,
>>
>> What's the rational for that, as opposed to plugging devices directly
>> into the DMI-PCI bridge which seems to work ?
>
> The rationale is that libvirt used to do it like this.


Nah, that's just the *result* of the rationale that we wanted the 
devices to be hotpluggable. At some later date we learned the hotplug on 
a pci-bridge device doesn't work on a Q35 machine anyway, so it was kind 
of pointless (but we still do it because we hold out hope that hotplug 
of legacy PCI devices into a pci-bridge on Q35 machines will work one day)


> And the rationale
> for *that* is that DMI-PCI bridges cannot accept hotplugged devices,
> while PCI-PCI bridges can.
>
> Technically nothing forbids (AFAICT) cold-plugging PCI devices into
> DMI-PCI bridges, but this document is expressly not just about technical
> constraints -- it's a policy document. We want to simplify / trim the
> supported PCI and PCI Express hierarchies as much as possible.
>
> All valid *high-level* topology goals should be permitted / covered one
> way or another by this document, but in as few ways as possible --
> hopefully only one way. For example, if you read the rest of the thread,
> flat hierarchies are preferred to deeply nested hierarchies, because
> flat ones save on bus numbers

Do they?

>, are easier to setup and understand,
> probably perform better, and don't lose any generality for cold- or hotplug.
>
> Thanks
> Laszlo
>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines
  2016-10-04 15:45     ` Alex Williamson
@ 2016-10-04 16:25       ` Laine Stump
  2016-10-05 10:03         ` Marcel Apfelbaum
  0 siblings, 1 reply; 52+ messages in thread
From: Laine Stump @ 2016-10-04 16:25 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Daniel P. Berrange, Laszlo Ersek,
	Marcel Apfelbaum, Peter Maydell, Drew Jones, mst,
	Andrea Bolognani, Gerd Hoffmann

On 10/04/2016 11:45 AM, Alex Williamson wrote:
> On Tue, 4 Oct 2016 15:59:11 +0100
> "Daniel P. Berrange" <berrange@redhat.com> wrote:
>
>> On Mon, Sep 05, 2016 at 06:24:48PM +0200, Laszlo Ersek wrote:
>>> On 09/01/16 15:22, Marcel Apfelbaum wrote:
>>>> +2.3 PCI only hierarchy
>>>> +======================
>>>> +Legacy PCI devices can be plugged into pcie.0 as Integrated Devices or
>>>> +into DMI-PCI bridge. PCI-PCI bridges can be plugged into DMI-PCI bridges
>>>> +and can be nested until a depth of 6-7. DMI-BRIDGES should be plugged
>>>> +only into pcie.0 bus.
>>>> +
>>>> +   pcie.0 bus
>>>> +   ----------------------------------------------
>>>> +        |                            |
>>>> +   -----------               ------------------
>>>> +   | PCI Dev |               | DMI-PCI BRIDGE |
>>>> +   ----------                ------------------
>>>> +                               |            |
>>>> +                        -----------    ------------------
>>>> +                        | PCI Dev |    | PCI-PCI Bridge |
>>>> +                        -----------    ------------------
>>>> +                                         |           |
>>>> +                                  -----------     -----------
>>>> +                                  | PCI Dev |     | PCI Dev |
>>>> +                                  -----------     -----------
>>>
>>> Works for me, but I would again elaborate a little bit on keeping the
>>> hierarchy flat.
>>>
>>> First, in order to preserve compatibility with libvirt's current
>>> behavior, let's not plug a PCI device directly in to the DMI-PCI bridge,
>>> even if that's possible otherwise. Let's just say
>>>
>>> - there should be at most one DMI-PCI bridge (if a legacy PCI hierarchy
>>> is required),
>>
>> Why do you suggest this ? If the guest has multiple NUMA nodes
>> and you're creating a PXB for each NUMA node, then it looks valid
>> to want to have a DMI-PCI bridge attached to each PXB, so you can
>> have legacy PCI devices on each NUMA node, instead of putting them
>> all on the PCI bridge without NUMA affinity.
>
> Seems like this is one of those "generic" vs "specific" device issues.
> We use the DMI-to-PCI bridge as if it were a PCIe-to-PCI bridge, but
> DMI is actually an Intel proprietary interface, the bridge just has the
> same software interface as a PCI bridge.  So while you can use it as a
> generic PCIe-to-PCI bridge, it's at least going to make me cringe every
> time.


If using it this way makes kittens cry or something, then we'd be happy 
to use a generic pcie-to-pci bridge if somebody created one :-)


>
>>> - only PCI-PCI bridges should be plugged into the DMI-PCI bridge,
>>
>> What's the rational for that, as opposed to plugging devices directly
>> into the DMI-PCI bridge which seems to work ?
>
> IIRC, something about hotplug, but from a PCI perspective it doesn't
> make any sense to me either.


At one point Marcel and Michael were discussing the possibility of 
making hotplug work on a dmi-to-pci-bridge. Currently it doesn't even 
work for pci-bridge so (as I think I said in another message just now) 
it is kind of pointless, although when I asked about eliminating use of 
pci-bridge in favor of just using dmi-to-pci-bridge directly, I got lots 
of "no" votes.


>  Same with the restriction from using slot
> 0 on PCI bridges, there's no basis for that except on the root bus.

I tried allowing devices to be plugged into slot 0 of a pci-bridge in 
libvirt - qemu barfed, so I moved the "minSlot" for pci-bridge back up 
to 1. Slot 0 is completely usable on a dmi-to-pci-bridge though (and 
libvirt allows it). At this point, even if qemu enabled using slot 0 of 
a pci-bridge, libvirt wouldn't be able to expose that to users (unless 
the min/max slot of each PCI controller was made visible somewhere via QMP)

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines
  2016-10-04 16:10       ` Laine Stump
@ 2016-10-04 16:43         ` Laszlo Ersek
  2016-10-04 18:08           ` Laine Stump
  2016-10-04 17:54         ` Laine Stump
  1 sibling, 1 reply; 52+ messages in thread
From: Laszlo Ersek @ 2016-10-04 16:43 UTC (permalink / raw)
  To: Laine Stump, qemu-devel
  Cc: Daniel P. Berrange, Marcel Apfelbaum, Peter Maydell, Drew Jones,
	mst, Andrea Bolognani, Alex Williamson, Gerd Hoffmann

On 10/04/16 18:10, Laine Stump wrote:
> On 10/04/2016 11:40 AM, Laszlo Ersek wrote:
>> On 10/04/16 16:59, Daniel P. Berrange wrote:
>>> On Mon, Sep 05, 2016 at 06:24:48PM +0200, Laszlo Ersek wrote:
>>>> On 09/01/16 15:22, Marcel Apfelbaum wrote:
>>>>> +2.3 PCI only hierarchy
>>>>> +======================
>>>>> +Legacy PCI devices can be plugged into pcie.0 as Integrated
>>>>> Devices or
>>>>> +into DMI-PCI bridge. PCI-PCI bridges can be plugged into DMI-PCI
>>>>> bridges
>>>>> +and can be nested until a depth of 6-7. DMI-BRIDGES should be plugged
>>>>> +only into pcie.0 bus.
>>>>> +
>>>>> +   pcie.0 bus
>>>>> +   ----------------------------------------------
>>>>> +        |                            |
>>>>> +   -----------               ------------------
>>>>> +   | PCI Dev |               | DMI-PCI BRIDGE |
>>>>> +   ----------                ------------------
>>>>> +                               |            |
>>>>> +                        -----------    ------------------
>>>>> +                        | PCI Dev |    | PCI-PCI Bridge |
>>>>> +                        -----------    ------------------
>>>>> +                                         |           |
>>>>> +                                  -----------     -----------
>>>>> +                                  | PCI Dev |     | PCI Dev |
>>>>> +                                  -----------     -----------
>>>>
>>>> Works for me, but I would again elaborate a little bit on keeping the
>>>> hierarchy flat.
>>>>
>>>> First, in order to preserve compatibility with libvirt's current
>>>> behavior, let's not plug a PCI device directly in to the DMI-PCI
>>>> bridge,
>>>> even if that's possible otherwise. Let's just say
>>>>
>>>> - there should be at most one DMI-PCI bridge (if a legacy PCI hierarchy
>>>> is required),
>>>
>>> Why do you suggest this ? If the guest has multiple NUMA nodes
>>> and you're creating a PXB for each NUMA node, then it looks valid
>>> to want to have a DMI-PCI bridge attached to each PXB, so you can
>>> have legacy PCI devices on each NUMA node, instead of putting them
>>> all on the PCI bridge without NUMA affinity.
>>
>> You are right. I meant the above within one PCI Express root bus.
>>
>> Small correction to your wording though: you don't want to attach the
>> DMI-PCI bridge to the PXB device, but to the extra root bus provided by
>> the PXB.
> 
> This made me realize something - the root bus on a pxb-pcie controller
> has a single slot and that slot can accept either a pcie-root-port
> (ioh3420) or a dmi-to-pci-bridge. If you want to have both express and
> legacy PCI devices on the same NUMA node, then you would either need to
> create one pxb-pcie for the pcie-root-port and another for the
> dmi-to-pci-bridge, or you would need to put the pcie-root-port and
> dmi-to-pci-bridge onto different functions of the single slot. Should
> the latter work properly?

Yes, I expect so. (Famous last words? :))

> 
> 
>>
>>>
>>>> - only PCI-PCI bridges should be plugged into the DMI-PCI bridge,
>>>
>>> What's the rational for that, as opposed to plugging devices directly
>>> into the DMI-PCI bridge which seems to work ?
>>
>> The rationale is that libvirt used to do it like this.
> 
> 
> Nah, that's just the *result* of the rationale that we wanted the
> devices to be hotpluggable. At some later date we learned the hotplug on
> a pci-bridge device doesn't work on a Q35 machine anyway, so it was kind
> of pointless (but we still do it because we hold out hope that hotplug
> of legacy PCI devices into a pci-bridge on Q35 machines will work one day)
> 
> 
>> And the rationale
>> for *that* is that DMI-PCI bridges cannot accept hotplugged devices,
>> while PCI-PCI bridges can.
>>
>> Technically nothing forbids (AFAICT) cold-plugging PCI devices into
>> DMI-PCI bridges, but this document is expressly not just about technical
>> constraints -- it's a policy document. We want to simplify / trim the
>> supported PCI and PCI Express hierarchies as much as possible.
>>
>> All valid *high-level* topology goals should be permitted / covered one
>> way or another by this document, but in as few ways as possible --
>> hopefully only one way. For example, if you read the rest of the thread,
>> flat hierarchies are preferred to deeply nested hierarchies, because
>> flat ones save on bus numbers
> 
> Do they?

Yes. Nesting implies bridges, and bridges take up bus numbers. For
example, in a PCI Express switch, the upstream port of the switch
consumes a bus number, with no practical usefulness.

IIRC we collectively devised a flat pattern elsewhere in the thread
where you could exhaust the 0..255 bus number space such that almost
every bridge (= taking up a bus number) would also be capable of
accepting a hot-plugged or cold-plugged PCI Express device. That is,
practically no wasted bus numbers.

Hm.... search this message for "population algorithm":

https://www.mail-archive.com/qemu-devel@nongnu.org/msg394730.html

and then Gerd's big improvement / simplification on it, with multifunction:

https://www.mail-archive.com/qemu-devel@nongnu.org/msg395437.html

In Gerd's scheme, you'd only need only one or two (I'm lazy to count
exactly :)) PCI Express switches, to exhaust all bus numbers. Minimal
waste due to upstream ports.

Thanks
Laszlo

>> , are easier to setup and understand,
>> probably perform better, and don't lose any generality for cold- or
>> hotplug.
>>
>> Thanks
>> Laszlo
>>
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines
  2016-10-04 16:10       ` Laine Stump
  2016-10-04 16:43         ` Laszlo Ersek
@ 2016-10-04 17:54         ` Laine Stump
  2016-10-05  9:17           ` Marcel Apfelbaum
  1 sibling, 1 reply; 52+ messages in thread
From: Laine Stump @ 2016-10-04 17:54 UTC (permalink / raw)
  To: qemu-devel
  Cc: Laszlo Ersek, Daniel P. Berrange, Marcel Apfelbaum,
	Peter Maydell, Drew Jones, mst, Andrea Bolognani,
	Alex Williamson, Gerd Hoffmann

On 10/04/2016 12:10 PM, Laine Stump wrote:
> On 10/04/2016 11:40 AM, Laszlo Ersek wrote:

>> Small correction to your wording though: you don't want to attach the
>> DMI-PCI bridge to the PXB device, but to the extra root bus provided by
>> the PXB.
>
> This made me realize something - the root bus on a pxb-pcie controller
> has a single slot and that slot can accept either a pcie-root-port
> (ioh3420) or a dmi-to-pci-bridge. If you want to have both express and
> legacy PCI devices on the same NUMA node, then you would either need to
> create one pxb-pcie for the pcie-root-port and another for the
> dmi-to-pci-bridge, or you would need to put the pcie-root-port and
> dmi-to-pci-bridge onto different functions of the single slot. Should
> the latter work properly?

We were discussing pxb-pcie today while Dan was trying to get a 
particular configuration working, and there was some disagreement about 
two points that I stated above as fact (but which may just be 
misunderstanding again):

1) Does pxb-pcie only provide a single slot (0)? Or does it provide 32 
slots (0-31) just like the pcie root complex?

2) can you really only plug a pcie-root-port (ioh3420) into a pxb-pcie? 
Or will it accept anything that pcie.0 accepts?

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines
  2016-10-04 16:43         ` Laszlo Ersek
@ 2016-10-04 18:08           ` Laine Stump
  2016-10-04 18:52             ` Alex Williamson
  2016-10-04 18:56             ` Laszlo Ersek
  0 siblings, 2 replies; 52+ messages in thread
From: Laine Stump @ 2016-10-04 18:08 UTC (permalink / raw)
  To: qemu-devel
  Cc: Laszlo Ersek, Daniel P. Berrange, Marcel Apfelbaum,
	Peter Maydell, Drew Jones, mst, Andrea Bolognani,
	Alex Williamson, Gerd Hoffmann

On 10/04/2016 12:43 PM, Laszlo Ersek wrote:
> On 10/04/16 18:10, Laine Stump wrote:
>> On 10/04/2016 11:40 AM, Laszlo Ersek wrote:
>>> On 10/04/16 16:59, Daniel P. Berrange wrote:
>>>> On Mon, Sep 05, 2016 at 06:24:48PM +0200, Laszlo Ersek wrote:
>>> All valid *high-level* topology goals should be permitted / covered one
>>> way or another by this document, but in as few ways as possible --
>>> hopefully only one way. For example, if you read the rest of the thread,
>>> flat hierarchies are preferred to deeply nested hierarchies, because
>>> flat ones save on bus numbers
>>
>> Do they?
>
> Yes. Nesting implies bridges, and bridges take up bus numbers. For
> example, in a PCI Express switch, the upstream port of the switch
> consumes a bus number, with no practical usefulness.

I'ts all just idle number games, but what I was thinking of was the 
difference between plugging  a bunch of root-port+upstream+downstreamxN 
combos directly into pcie-root (flat), vs. plugging the first into 
pcie-root, and then subsequent ones into e.g. the last downstream port 
of the previous set. Take the simplest case of needing 63 hotpluggable 
slots. In the "flat" case, you have:

    2 x pcie-root-port
    2 x pcie-switch-upstream-port
    63 x pcie-switch-downstream-port

In the "nested" or "chained" case you have:

    1 x pcie-root-port
    1 x pcie-switch-upstream-port
    32 x pcie-downstream-port
    1 x pcie-switch-upstream-port
    32 x pcie-switch-downstream-port

so you use the same number of PCI controllers.

Of course if you're talking about the difference between using 
upstream+downstream vs. just having a bunch of pcie-root-ports directly 
on pcie-root then you're correct, but only marginally - for 63 
hotpluggable ports, you would need 63 x pcie-root-port, so a savings of 
4 controllers - about 6.5%. (Of course this is all moot since you run 
out of ioport space after, what, 7 controllers needing it anyway? :-P)

>
> IIRC we collectively devised a flat pattern elsewhere in the thread
> where you could exhaust the 0..255 bus number space such that almost
> every bridge (= taking up a bus number) would also be capable of
> accepting a hot-plugged or cold-plugged PCI Express device. That is,
> practically no wasted bus numbers.
>
> Hm.... search this message for "population algorithm":
>
> https://www.mail-archive.com/qemu-devel@nongnu.org/msg394730.html
>
> and then Gerd's big improvement / simplification on it, with multifunction:
>
> https://www.mail-archive.com/qemu-devel@nongnu.org/msg395437.html
>
> In Gerd's scheme, you'd only need only one or two (I'm lazy to count
> exactly :)) PCI Express switches, to exhaust all bus numbers. Minimal
> waste due to upstream ports.

Yep. And in response to his message, that's what I'm implementing as the 
default strategy in libvirt :-)

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines
  2016-10-04 18:08           ` Laine Stump
@ 2016-10-04 18:52             ` Alex Williamson
  2016-10-10 12:02               ` Andrea Bolognani
  2016-10-04 18:56             ` Laszlo Ersek
  1 sibling, 1 reply; 52+ messages in thread
From: Alex Williamson @ 2016-10-04 18:52 UTC (permalink / raw)
  To: Laine Stump
  Cc: qemu-devel, Laszlo Ersek, Daniel P. Berrange, Marcel Apfelbaum,
	Peter Maydell, Drew Jones, mst, Andrea Bolognani, Gerd Hoffmann

On Tue, 4 Oct 2016 14:08:45 -0400
Laine Stump <laine@redhat.com> wrote:

> On 10/04/2016 12:43 PM, Laszlo Ersek wrote:
> > On 10/04/16 18:10, Laine Stump wrote:  
> >> On 10/04/2016 11:40 AM, Laszlo Ersek wrote:  
> >>> On 10/04/16 16:59, Daniel P. Berrange wrote:  
> >>>> On Mon, Sep 05, 2016 at 06:24:48PM +0200, Laszlo Ersek wrote:  
> >>> All valid *high-level* topology goals should be permitted / covered one
> >>> way or another by this document, but in as few ways as possible --
> >>> hopefully only one way. For example, if you read the rest of the thread,
> >>> flat hierarchies are preferred to deeply nested hierarchies, because
> >>> flat ones save on bus numbers  
> >>
> >> Do they?  
> >
> > Yes. Nesting implies bridges, and bridges take up bus numbers. For
> > example, in a PCI Express switch, the upstream port of the switch
> > consumes a bus number, with no practical usefulness.  
> 
> I'ts all just idle number games, but what I was thinking of was the 
> difference between plugging  a bunch of root-port+upstream+downstreamxN 
> combos directly into pcie-root (flat), vs. plugging the first into 
> pcie-root, and then subsequent ones into e.g. the last downstream port 
> of the previous set. Take the simplest case of needing 63 hotpluggable 
> slots. In the "flat" case, you have:
> 
>     2 x pcie-root-port
>     2 x pcie-switch-upstream-port
>     63 x pcie-switch-downstream-port
> 
> In the "nested" or "chained" case you have:
> 
>     1 x pcie-root-port
>     1 x pcie-switch-upstream-port
>     32 x pcie-downstream-port
>     1 x pcie-switch-upstream-port
>     32 x pcie-switch-downstream-port

You're not thinking in enough dimensions.  A single root port can host
multiple sub-hierarchies on it's own.  We can have a multi-function
upstream switch, so you can have 8 upstream ports (00.{0-7}).  If we
implemented ARI on the upstream ports, we could have 256 upstream ports
attached to a single root port, but of course then we've run out of
bus numbers before we've even gotten to actual devices buses.

Another option, look at the downstream ports, why do they each need to
be in separate slots?  We have the address space of an entire bus to
work with, so we can also create multi-function downstream ports, which
gives us 256 downstream ports per upstream port.  Oops, we just ran out
of bus numbers again, but at least actual devices can be attached.
Thanks,

Alex


> so you use the same number of PCI controllers.
> 
> Of course if you're talking about the difference between using 
> upstream+downstream vs. just having a bunch of pcie-root-ports directly 
> on pcie-root then you're correct, but only marginally - for 63 
> hotpluggable ports, you would need 63 x pcie-root-port, so a savings of 
> 4 controllers - about 6.5%. (Of course this is all moot since you run 
> out of ioport space after, what, 7 controllers needing it anyway? :-P)
> 
> >
> > IIRC we collectively devised a flat pattern elsewhere in the thread
> > where you could exhaust the 0..255 bus number space such that almost
> > every bridge (= taking up a bus number) would also be capable of
> > accepting a hot-plugged or cold-plugged PCI Express device. That is,
> > practically no wasted bus numbers.
> >
> > Hm.... search this message for "population algorithm":
> >
> > https://www.mail-archive.com/qemu-devel@nongnu.org/msg394730.html
> >
> > and then Gerd's big improvement / simplification on it, with multifunction:
> >
> > https://www.mail-archive.com/qemu-devel@nongnu.org/msg395437.html
> >
> > In Gerd's scheme, you'd only need only one or two (I'm lazy to count
> > exactly :)) PCI Express switches, to exhaust all bus numbers. Minimal
> > waste due to upstream ports.  
> 
> Yep. And in response to his message, that's what I'm implementing as the 
> default strategy in libvirt :-)
> 
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines
  2016-10-04 18:08           ` Laine Stump
  2016-10-04 18:52             ` Alex Williamson
@ 2016-10-04 18:56             ` Laszlo Ersek
  1 sibling, 0 replies; 52+ messages in thread
From: Laszlo Ersek @ 2016-10-04 18:56 UTC (permalink / raw)
  To: Laine Stump, qemu-devel
  Cc: Daniel P. Berrange, Marcel Apfelbaum, Peter Maydell, Drew Jones,
	mst, Andrea Bolognani, Alex Williamson, Gerd Hoffmann

On 10/04/16 20:08, Laine Stump wrote:
> On 10/04/2016 12:43 PM, Laszlo Ersek wrote:
>> On 10/04/16 18:10, Laine Stump wrote:
>>> On 10/04/2016 11:40 AM, Laszlo Ersek wrote:
>>>> On 10/04/16 16:59, Daniel P. Berrange wrote:
>>>>> On Mon, Sep 05, 2016 at 06:24:48PM +0200, Laszlo Ersek wrote:
>>>> All valid *high-level* topology goals should be permitted / covered one
>>>> way or another by this document, but in as few ways as possible --
>>>> hopefully only one way. For example, if you read the rest of the
>>>> thread,
>>>> flat hierarchies are preferred to deeply nested hierarchies, because
>>>> flat ones save on bus numbers
>>>
>>> Do they?
>>
>> Yes. Nesting implies bridges, and bridges take up bus numbers. For
>> example, in a PCI Express switch, the upstream port of the switch
>> consumes a bus number, with no practical usefulness.
> 
> I'ts all just idle number games, but what I was thinking of was the
> difference between plugging  a bunch of root-port+upstream+downstreamxN
> combos directly into pcie-root (flat), vs. plugging the first into
> pcie-root, and then subsequent ones into e.g. the last downstream port
> of the previous set. Take the simplest case of needing 63 hotpluggable
> slots. In the "flat" case, you have:
> 
>    2 x pcie-root-port
>    2 x pcie-switch-upstream-port
>    63 x pcie-switch-downstream-port
> 
> In the "nested" or "chained" case you have:
> 
>    1 x pcie-root-port
>    1 x pcie-switch-upstream-port
>    32 x pcie-downstream-port
>    1 x pcie-switch-upstream-port
>    32 x pcie-switch-downstream-port
> 
> so you use the same number of PCI controllers.
> 
> Of course if you're talking about the difference between using
> upstream+downstream vs. just having a bunch of pcie-root-ports directly
> on pcie-root then you're correct, but only marginally - for 63
> hotpluggable ports, you would need 63 x pcie-root-port, so a savings of
> 4 controllers - about 6.5%.

We aim at 200+ ports.

Also, nesting causes recursion in any guest code that traverses the
hierarchy. I think it has some performance impact, plus, for me at
least, interpreting PCI enumeration logs with deep recursion is way
harder than the flat stuff. The bus number space is flat, and for me
it's easier to "map back" to the topology if the topology is also mostly
flat.

> (Of course this is all moot since you run
> out of ioport space after, what, 7 controllers needing it anyway? :-P)

No, it's not moot. The idea is that PCI Express devices must not require
IO space for correct operation -- I believe this is actually mandated by
the PCI Express spec --, so in the PCI Express hierarchy we wouldn't
reserve IO space at all. We discussed this earlier up-thread, please see:

http://lists.nongnu.org/archive/html/qemu-devel/2016-09/msg00672.html

    * Finally, this is the spot where we should design and explain our
      resource reservation for hotplug: [...]

>> IIRC we collectively devised a flat pattern elsewhere in the thread
>> where you could exhaust the 0..255 bus number space such that almost
>> every bridge (= taking up a bus number) would also be capable of
>> accepting a hot-plugged or cold-plugged PCI Express device. That is,
>> practically no wasted bus numbers.
>>
>> Hm.... search this message for "population algorithm":
>>
>> https://www.mail-archive.com/qemu-devel@nongnu.org/msg394730.html
>>
>> and then Gerd's big improvement / simplification on it, with
>> multifunction:
>>
>> https://www.mail-archive.com/qemu-devel@nongnu.org/msg395437.html
>>
>> In Gerd's scheme, you'd only need only one or two (I'm lazy to count
>> exactly :)) PCI Express switches, to exhaust all bus numbers. Minimal
>> waste due to upstream ports.
> 
> Yep. And in response to his message, that's what I'm implementing as the
> default strategy in libvirt :-)

Sounds great, thanks!
Laszlo

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines
  2016-10-04 17:54         ` Laine Stump
@ 2016-10-05  9:17           ` Marcel Apfelbaum
  2016-10-10 11:09             ` Andrea Bolognani
  0 siblings, 1 reply; 52+ messages in thread
From: Marcel Apfelbaum @ 2016-10-05  9:17 UTC (permalink / raw)
  To: Laine Stump, qemu-devel
  Cc: Laszlo Ersek, Daniel P. Berrange, Peter Maydell, Drew Jones, mst,
	Andrea Bolognani, Alex Williamson, Gerd Hoffmann

On 10/04/2016 08:54 PM, Laine Stump wrote:
> On 10/04/2016 12:10 PM, Laine Stump wrote:
>> On 10/04/2016 11:40 AM, Laszlo Ersek wrote:
>
>>> Small correction to your wording though: you don't want to attach the
>>> DMI-PCI bridge to the PXB device, but to the extra root bus provided by
>>> the PXB.
>>
>> This made me realize something - the root bus on a pxb-pcie controller
>> has a single slot and that slot can accept either a pcie-root-port
>> (ioh3420) or a dmi-to-pci-bridge. If you want to have both express and
>> legacy PCI devices on the same NUMA node, then you would either need to
>> create one pxb-pcie for the pcie-root-port and another for the
>> dmi-to-pci-bridge, or you would need to put the pcie-root-port and
>> dmi-to-pci-bridge onto different functions of the single slot. Should
>> the latter work properly?
>

Hi,

> We were discussing pxb-pcie today while Dan was trying to get a particular configuration working, and there was some disagreement about two points that I stated above as fact (but which may just be
> misunderstanding again):
>
> 1) Does pxb-pcie only provide a single slot (0)? Or does it provide 32 slots (0-31) just like the pcie root complex?
>

It provides 32 slots behaving like a PCI Express Root Complex.

> 2) can you really only plug a pcie-root-port (ioh3420) into a pxb-pcie? Or will it accept anything that pcie.0 accepts?

It supports only PCI Express Root Ports. It does not support Integrated Devices.

Thanks,
Marcel

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines
  2016-10-04 16:25       ` Laine Stump
@ 2016-10-05 10:03         ` Marcel Apfelbaum
  0 siblings, 0 replies; 52+ messages in thread
From: Marcel Apfelbaum @ 2016-10-05 10:03 UTC (permalink / raw)
  To: Laine Stump, qemu-devel
  Cc: Alex Williamson, Daniel P. Berrange, Laszlo Ersek, Peter Maydell,
	Drew Jones, mst, Andrea Bolognani, Gerd Hoffmann

On 10/04/2016 07:25 PM, Laine Stump wrote:
> On 10/04/2016 11:45 AM, Alex Williamson wrote:
>> On Tue, 4 Oct 2016 15:59:11 +0100
>> "Daniel P. Berrange" <berrange@redhat.com> wrote:
>>
>>> On Mon, Sep 05, 2016 at 06:24:48PM +0200, Laszlo Ersek wrote:
>>>> On 09/01/16 15:22, Marcel Apfelbaum wrote:
>>>>> +2.3 PCI only hierarchy
>>>>> +======================
>>>>> +Legacy PCI devices can be plugged into pcie.0 as Integrated Devices or
>>>>> +into DMI-PCI bridge. PCI-PCI bridges can be plugged into DMI-PCI bridges
>>>>> +and can be nested until a depth of 6-7. DMI-BRIDGES should be plugged
>>>>> +only into pcie.0 bus.
>>>>> +
>>>>> +   pcie.0 bus
>>>>> +   ----------------------------------------------
>>>>> +        |                            |
>>>>> +   -----------               ------------------
>>>>> +   | PCI Dev |               | DMI-PCI BRIDGE |
>>>>> +   ----------                ------------------
>>>>> +                               |            |
>>>>> +                        -----------    ------------------
>>>>> +                        | PCI Dev |    | PCI-PCI Bridge |
>>>>> +                        -----------    ------------------
>>>>> +                                         |           |
>>>>> +                                  -----------     -----------
>>>>> +                                  | PCI Dev |     | PCI Dev |
>>>>> +                                  -----------     -----------
>>>>
>>>> Works for me, but I would again elaborate a little bit on keeping the
>>>> hierarchy flat.
>>>>
>>>> First, in order to preserve compatibility with libvirt's current
>>>> behavior, let's not plug a PCI device directly in to the DMI-PCI bridge,
>>>> even if that's possible otherwise. Let's just say
>>>>
>>>> - there should be at most one DMI-PCI bridge (if a legacy PCI hierarchy
>>>> is required),
>>>
>>> Why do you suggest this ? If the guest has multiple NUMA nodes
>>> and you're creating a PXB for each NUMA node, then it looks valid
>>> to want to have a DMI-PCI bridge attached to each PXB, so you can
>>> have legacy PCI devices on each NUMA node, instead of putting them
>>> all on the PCI bridge without NUMA affinity.
>>
>> Seems like this is one of those "generic" vs "specific" device issues.
>> We use the DMI-to-PCI bridge as if it were a PCIe-to-PCI bridge, but
>> DMI is actually an Intel proprietary interface, the bridge just has the
>> same software interface as a PCI bridge.  So while you can use it as a
>> generic PCIe-to-PCI bridge, it's at least going to make me cringe every
>> time.
>
>
> If using it this way makes kittens cry or something, then we'd be happy to use a generic pcie-to-pci bridge if somebody created one :-)
>
>
>>
>>>> - only PCI-PCI bridges should be plugged into the DMI-PCI bridge,
>>>
>>> What's the rational for that, as opposed to plugging devices directly
>>> into the DMI-PCI bridge which seems to work ?
>>

Hi,

>> IIRC, something about hotplug, but from a PCI perspective it doesn't
>> make any sense to me either.
>

Indeed, the reason to plug the PCI bridge into the DMI-TO-PCI bridge
would be the hot-plug support.
The PCI bridges can support hotplug on Q35.
There is even an RFC on the list doing that:
     https://lists.gnu.org/archive/html/qemu-devel/2016-05/msg05681.html

With the DMI-PCI bridge is another story. From what I understand the actual
device (i82801b11) do not support hotplug and the chances to make it work
are minimal.


>
> At one point Marcel and Michael were discussing the possibility of making hotplug work on a dmi-to-pci-bridge. Currently it doesn't even work for pci-bridge so (as I think I said in another message
> just now) it is kind of pointless, although when I asked about eliminating use of pci-bridge in favor of just using dmi-to-pci-bridge directly, I got lots of "no" votes.
>

Since we have an RFC showing it is possible to have hotplug for PCI devices pluged into PCI bridges
it is better to continue using the PCI bridge until one of the bellow will happen:
  1 - pci-bridge ACPI hotplug will be possible
  2 - i82801b11 ACPI hotplug will be possible
  3 - a new pcie-pci bridge will be coded

>
>>  Same with the restriction from using slot
>> 0 on PCI bridges, there's no basis for that except on the root bus.
>
> I tried allowing devices to be plugged into slot 0 of a pci-bridge in libvirt - qemu barfed, so I moved the "minSlot" for pci-bridge back up to 1. Slot 0 is completely usable on a dmi-to-pci-bridge
> though (and libvirt allows it). At this point, even if qemu enabled using slot 0 of a pci-bridge, libvirt wouldn't be able to expose that to users (unless the min/max slot of each PCI controller was
> made visible somewhere via QMP)
>

The reason for not being able to plug a device into slot 0 of a PCI Bridge is the SHPC (Hot-plug controller)
device embedded in the PCI bridge by default. The SHPC spec requires this.
If one disables it with shpc=false, he should be able to use the slot 0.

Funny thing, the SHPC is not actually used by neither i440fx or Q35 machines,
for i440fx we use ACPI based PCI hotplug and for Q35 we use PCIe native hotplug.

Should we default the shpc to off?

Thanks,
Marcel

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines
  2016-10-05  9:17           ` Marcel Apfelbaum
@ 2016-10-10 11:09             ` Andrea Bolognani
  2016-10-10 14:15               ` Marcel Apfelbaum
  0 siblings, 1 reply; 52+ messages in thread
From: Andrea Bolognani @ 2016-10-10 11:09 UTC (permalink / raw)
  To: Marcel Apfelbaum, Laine Stump, qemu-devel
  Cc: Laszlo Ersek, Daniel P. Berrange, Peter Maydell, Drew Jones, mst,
	Alex Williamson, Gerd Hoffmann

On Wed, 2016-10-05 at 12:17 +0300, Marcel Apfelbaum wrote:
> > 2) can you really only plug a pcie-root-port (ioh3420)
> > into a pxb-pcie? Or will it accept anything that pcie.0
> > accepts?
> 
> It supports only PCI Express Root Ports. It does not
> support Integrated Devices.

So no PCI Express Switch Upstream Ports? What about
DMI-to-PCI Bridges?

-- 
Andrea Bolognani / Red Hat / Virtualization

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines
  2016-10-04 18:52             ` Alex Williamson
@ 2016-10-10 12:02               ` Andrea Bolognani
  2016-10-10 14:36                 ` Marcel Apfelbaum
  0 siblings, 1 reply; 52+ messages in thread
From: Andrea Bolognani @ 2016-10-10 12:02 UTC (permalink / raw)
  To: Alex Williamson, Laine Stump
  Cc: qemu-devel, Laszlo Ersek, Daniel P. Berrange, Marcel Apfelbaum,
	Peter Maydell, Drew Jones, mst, Gerd Hoffmann

On Tue, 2016-10-04 at 12:52 -0600, Alex Williamson wrote:
> > I'ts all just idle number games, but what I was thinking of was the 
> > difference between plugging  a bunch of root-port+upstream+downstreamxN 
> > combos directly into pcie-root (flat), vs. plugging the first into 
> > pcie-root, and then subsequent ones into e.g. the last downstream port 
> > of the previous set. Take the simplest case of needing 63 hotpluggable 
> > slots. In the "flat" case, you have:
> > 
> >     2 x pcie-root-port
> >     2 x pcie-switch-upstream-port
> >     63 x pcie-switch-downstream-port
> > 
> > In the "nested" or "chained" case you have:
> > 
> >     1 x pcie-root-port
> >     1 x pcie-switch-upstream-port
> >     32 x pcie-downstream-port
> >     1 x pcie-switch-upstream-port
> >     32 x pcie-switch-downstream-port
> 
> You're not thinking in enough dimensions.  A single root port can host
> multiple sub-hierarchies on it's own.  We can have a multi-function
> upstream switch, so you can have 8 upstream ports (00.{0-7}).  If we
> implemented ARI on the upstream ports, we could have 256 upstream ports
> attached to a single root port, but of course then we've run out of
> bus numbers before we've even gotten to actual devices buses.
> 
> Another option, look at the downstream ports, why do they each need to
> be in separate slots?  We have the address space of an entire bus to
> work with, so we can also create multi-function downstream ports, which
> gives us 256 downstream ports per upstream port.  Oops, we just ran out
> of bus numbers again, but at least actual devices can be attached.

What's the advantage in using ARI to stuff more than eight
of anything that's not Endpoint Devices in a single slot?

I mean, if we just fill up all 32 slots in a PCIe Root Bus
with 8 PCIe Root Ports each we already end up having 256
hotpluggable slots[1]. Why would it be preferable to use
ARI, or even PCIe Switches, instead?


[1] The last slot will have to be limited to 7 PCIe Root
    Ports if we don't want to run out of bus numbers
-- 
Andrea Bolognani / Red Hat / Virtualization

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines
  2016-10-10 11:09             ` Andrea Bolognani
@ 2016-10-10 14:15               ` Marcel Apfelbaum
  2016-10-11 13:30                 ` Andrea Bolognani
  0 siblings, 1 reply; 52+ messages in thread
From: Marcel Apfelbaum @ 2016-10-10 14:15 UTC (permalink / raw)
  To: Andrea Bolognani, Laine Stump, qemu-devel
  Cc: Laszlo Ersek, Daniel P. Berrange, Peter Maydell, Drew Jones, mst,
	Alex Williamson, Gerd Hoffmann

On 10/10/2016 02:09 PM, Andrea Bolognani wrote:
> On Wed, 2016-10-05 at 12:17 +0300, Marcel Apfelbaum wrote:
>>> 2) can you really only plug a pcie-root-port (ioh3420)
>>> into a pxb-pcie? Or will it accept anything that pcie.0
>>> accepts?
>>
>> It supports only PCI Express Root Ports. It does not
>> support Integrated Devices.
>
> So no PCI Express Switch Upstream Ports?

The switch upstream ports can only be plugged into PCIe Root Ports.
There is an error in the RFC showing otherwise, it is already
corrected in V1, not yet upstream.


What about DMI-to-PCI Bridges?

Yes, the dmi-to-pci bridge can be plugged into the pxb-pcie, I'll
be sure to emphasize it.

Thanks,
Marcel

>
> --
> Andrea Bolognani / Red Hat / Virtualization
>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines
  2016-10-10 12:02               ` Andrea Bolognani
@ 2016-10-10 14:36                 ` Marcel Apfelbaum
  2016-10-11 15:37                   ` Andrea Bolognani
  0 siblings, 1 reply; 52+ messages in thread
From: Marcel Apfelbaum @ 2016-10-10 14:36 UTC (permalink / raw)
  To: Andrea Bolognani, Alex Williamson, Laine Stump
  Cc: qemu-devel, Laszlo Ersek, Daniel P. Berrange, Peter Maydell,
	Drew Jones, mst, Gerd Hoffmann

On 10/10/2016 03:02 PM, Andrea Bolognani wrote:
> On Tue, 2016-10-04 at 12:52 -0600, Alex Williamson wrote:
>>> I'ts all just idle number games, but what I was thinking of was the
>>> difference between plugging  a bunch of root-port+upstream+downstreamxN
>>> combos directly into pcie-root (flat), vs. plugging the first into
>>> pcie-root, and then subsequent ones into e.g. the last downstream port
>>> of the previous set. Take the simplest case of needing 63 hotpluggable
>>> slots. In the "flat" case, you have:
>>>
>>>     2 x pcie-root-port
>>>      2 x pcie-switch-upstream-port
>>>      63 x pcie-switch-downstream-port
>>>
>>> In the "nested" or "chained" case you have:
>>>
>>>      1 x pcie-root-port
>>>      1 x pcie-switch-upstream-port
>>>      32 x pcie-downstream-port
>>>      1 x pcie-switch-upstream-port
>>>      32 x pcie-switch-downstream-port
>>
>> You're not thinking in enough dimensions.  A single root port can host
>> multiple sub-hierarchies on it's own.  We can have a multi-function
>> upstream switch, so you can have 8 upstream ports (00.{0-7}).  If we
>> implemented ARI on the upstream ports, we could have 256 upstream ports
>> attached to a single root port, but of course then we've run out of
>> bus numbers before we've even gotten to actual devices buses.
>>
>> Another option, look at the downstream ports, why do they each need to
>> be in separate slots?  We have the address space of an entire bus to
>> work with, so we can also create multi-function downstream ports, which
>> gives us 256 downstream ports per upstream port.  Oops, we just ran out
>> of bus numbers again, but at least actual devices can be attached.
>
> What's the advantage in using ARI to stuff more than eight
> of anything that's not Endpoint Devices in a single slot?
>
> I mean, if we just fill up all 32 slots in a PCIe Root Bus
> with 8 PCIe Root Ports each we already end up having 256
> hotpluggable slots[1]. Why would it be preferable to use
> ARI, or even PCIe Switches, instead?
>

What if you need more devices (functions actually) ?

If some of the pcie.0 slots are occupied by other Integrated devices
and you need more than 256 functions you can:
(1) Add a PCIe Switch - if you need hot-plug support -an you are pretty limited
     by the bus numbers, but it will give you a few more slots.
(2) Use multi-function devices per root port if you are not interested in hotplug.
     In this case ARI will give you up to 256 devices per Root Port.

Now the question is why ARI? Better utilization of the "problematic"
resources like Bus numbers and IO space; all that if you need an insane
number of devices, but we don't judge :).

Thanks,
Marcel

>
> [1] The last slot will have to be limited to 7 PCIe Root
>     Ports if we don't want to run out of bus numbers

I don't follow how this will 'save' us. If all the root ports
are in use and you leave space for one more, what can you do with it?

> --
> Andrea Bolognani / Red Hat / Virtualization
>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines
  2016-10-10 14:15               ` Marcel Apfelbaum
@ 2016-10-11 13:30                 ` Andrea Bolognani
  0 siblings, 0 replies; 52+ messages in thread
From: Andrea Bolognani @ 2016-10-11 13:30 UTC (permalink / raw)
  To: Marcel Apfelbaum, Laine Stump, qemu-devel
  Cc: Laszlo Ersek, Daniel P. Berrange, Peter Maydell, Drew Jones, mst,
	Alex Williamson, Gerd Hoffmann

On Mon, 2016-10-10 at 17:15 +0300, Marcel Apfelbaum wrote:
> > > > 2) can you really only plug a pcie-root-port (ioh3420)
> > > > into a pxb-pcie? Or will it accept anything that pcie.0
> > > > accepts?
> > > 
> > > It supports only PCI Express Root Ports. It does not
> > > support Integrated Devices.
> > 
> > So no PCI Express Switch Upstream Ports?
> 
> The switch upstream ports can only be plugged into PCIe Root Ports.
> There is an error in the RFC showing otherwise, it is already
> corrected in V1, not yet upstream.

I was pretty sure that was the case, but I wanted to
double-check just to be on the safe side ;)

> > What about DMI-to-PCI Bridges?
> 
> Yes, the dmi-to-pci bridge can be plugged into the pxb-pcie, I'll
> be sure to emphasize it.

Cool. I would have been very surprised if that would have
not been the case, considering how we need to use multiple
pxb-pcie to link PCI devices to specific NUMA nodes.

-- 
Andrea Bolognani / Red Hat / Virtualization

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines
  2016-10-10 14:36                 ` Marcel Apfelbaum
@ 2016-10-11 15:37                   ` Andrea Bolognani
  0 siblings, 0 replies; 52+ messages in thread
From: Andrea Bolognani @ 2016-10-11 15:37 UTC (permalink / raw)
  To: Marcel Apfelbaum, Alex Williamson, Laine Stump
  Cc: qemu-devel, Laszlo Ersek, Daniel P. Berrange, Peter Maydell,
	Drew Jones, mst, Gerd Hoffmann

On Mon, 2016-10-10 at 17:36 +0300, Marcel Apfelbaum wrote:
> > What's the advantage in using ARI to stuff more than eight
> > of anything that's not Endpoint Devices in a single slot?
> > 
> > I mean, if we just fill up all 32 slots in a PCIe Root Bus
> > with 8 PCIe Root Ports each we already end up having 256
> > hotpluggable slots[1]. Why would it be preferable to use
> > ARI, or even PCIe Switches, instead?
> 
> What if you need more devices (functions actually) ?
> 
> If some of the pcie.0 slots are occupied by other Integrated devices
> and you need more than 256 functions you can:
> (1) Add a PCIe Switch - if you need hot-plug support -an you are pretty limited
>      by the bus numbers, but it will give you a few more slots.
> (2) Use multi-function devices per root port if you are not interested in hotplug.
>      In this case ARI will give you up to 256 devices per Root Port.
> 
> Now the question is why ARI? Better utilization of the "problematic"
> resources like Bus numbers and IO space; all that if you need an insane
> number of devices, but we don't judge :).

My point is that AIUI ARI is something you only care about
for endpoint devices that want to have more than 8 functions.

When it comes to controller, there's no advantage that I can
think of in having 1 slot with 256 functions as opposed to 32
slots with 8 functions each; if anything, I expect that at
least some guest OSs would be quite baffled in finding eg. a
network adapter, a SCSI controller and a GPU as separate
functions of a single PCI slot.

> > [1] The last slot will have to be limited to 7 PCIe Root
> >     Ports if we don't want to run out of bus numbers
> 
> I don't follow how this will 'save' us. If all the root ports
> are in use and you leave space for one more, what can you do with it?

Probably my math is off, but if we can only have 256 PCI
buses (0-255) and we plug a PCIe Root Port in each of the
8 functions (0-7) of the 32 ports (0-31) available on the
PCIe Root Bus, we end up with

  0:00.[0-7] -> [001-008]:0.[0-7]
  0:01.[0-7] -> [009-016]:0.[0-7]
  0:02.[0-7] -> [017-024]:0.[0-7]
  ...
  0.30.[0-7] -> [241-248]:0.[0-7]
  0.31.[0-7] -> [249-256]:0.[0-7]

but 256 is not a valid bus number, so we should skip that
last PCIe Root Port and stop at 255.

-- 
Andrea Bolognani / Red Hat / Virtualization

^ permalink raw reply	[flat|nested] 52+ messages in thread

end of thread, other threads:[~2016-10-11 15:37 UTC | newest]

Thread overview: 52+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-09-01 13:22 [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines Marcel Apfelbaum
2016-09-01 13:27 ` Peter Maydell
2016-09-01 13:51   ` Marcel Apfelbaum
2016-09-01 17:14     ` Laszlo Ersek
2016-09-05 16:24 ` Laszlo Ersek
2016-09-05 20:02   ` Marcel Apfelbaum
2016-09-06 13:31     ` Laszlo Ersek
2016-09-06 14:46       ` Marcel Apfelbaum
2016-09-07  6:21       ` Gerd Hoffmann
2016-09-07  8:06         ` Laszlo Ersek
2016-09-07  8:23           ` Marcel Apfelbaum
2016-09-07  8:06         ` Marcel Apfelbaum
2016-09-07 16:08           ` Alex Williamson
2016-09-07 19:32             ` Marcel Apfelbaum
2016-09-07 17:55           ` Laine Stump
2016-09-07 19:39             ` Marcel Apfelbaum
2016-09-07 20:34               ` Laine Stump
2016-09-15  8:38               ` Andrew Jones
2016-09-15 14:20                 ` Marcel Apfelbaum
2016-09-16 16:50                   ` Andrea Bolognani
2016-09-08  7:33             ` Gerd Hoffmann
2016-09-06 11:35   ` Gerd Hoffmann
2016-09-06 13:58     ` Laine Stump
2016-09-07  7:04       ` Gerd Hoffmann
2016-09-07 18:20         ` Laine Stump
2016-09-08  7:26           ` Gerd Hoffmann
2016-09-06 14:47     ` Marcel Apfelbaum
2016-09-07  7:53     ` Laszlo Ersek
2016-09-07  7:57       ` Marcel Apfelbaum
2016-10-04 14:59   ` Daniel P. Berrange
2016-10-04 15:40     ` Laszlo Ersek
2016-10-04 16:10       ` Laine Stump
2016-10-04 16:43         ` Laszlo Ersek
2016-10-04 18:08           ` Laine Stump
2016-10-04 18:52             ` Alex Williamson
2016-10-10 12:02               ` Andrea Bolognani
2016-10-10 14:36                 ` Marcel Apfelbaum
2016-10-11 15:37                   ` Andrea Bolognani
2016-10-04 18:56             ` Laszlo Ersek
2016-10-04 17:54         ` Laine Stump
2016-10-05  9:17           ` Marcel Apfelbaum
2016-10-10 11:09             ` Andrea Bolognani
2016-10-10 14:15               ` Marcel Apfelbaum
2016-10-11 13:30                 ` Andrea Bolognani
2016-10-04 15:45     ` Alex Williamson
2016-10-04 16:25       ` Laine Stump
2016-10-05 10:03         ` Marcel Apfelbaum
2016-09-06 15:38 ` Alex Williamson
2016-09-06 18:14   ` Marcel Apfelbaum
2016-09-06 18:32     ` Alex Williamson
2016-09-06 18:59       ` Marcel Apfelbaum
2016-09-07  7:44       ` Laszlo Ersek

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.