* [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines @ 2016-09-01 13:22 Marcel Apfelbaum 2016-09-01 13:27 ` Peter Maydell ` (2 more replies) 0 siblings, 3 replies; 52+ messages in thread From: Marcel Apfelbaum @ 2016-09-01 13:22 UTC (permalink / raw) To: qemu-devel; +Cc: mst, lersek Proposes best practices on how to use PCIe/PCI device in PCIe based machines and explain the reasoning behind them. Signed-off-by: Marcel Apfelbaum <marcel@redhat.com> --- Hi, Please add your comments on what to add/remove/edit to make this doc usable. Thanks, Marcel docs/pcie.txt | 145 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 145 insertions(+) create mode 100644 docs/pcie.txt diff --git a/docs/pcie.txt b/docs/pcie.txt new file mode 100644 index 0000000..52a8830 --- /dev/null +++ b/docs/pcie.txt @@ -0,0 +1,145 @@ +PCI EXPRESS GUIDELINES +====================== + +1. Introduction +================ +The doc proposes best practices on how to use PCIe/PCI device +in PCIe based machines and explains the reasoning behind them. + + +2. Device placement strategy +============================ +QEMU does not have a clear socket-device matching mechanism +and allows any PCI/PCIe device to be plugged into any PCI/PCIe slot. +Plugging a PCI device into a PCIe device might not always work and +is weird anyway since it cannot be done for "bare metal". +Plugging a PCIe device into a PCI slot will hide the Extended +Configuration Space thus is also not recommended. + +The recommendation is to separate the PCIe and PCI hierarchies. +PCIe devices should be plugged only into PCIe Root Ports and +PCIe Downstream ports (let's call them PCIe ports). + +2.1 Root Bus (pcie.0) +===================== +Plug only legacy PCI devices as Root Complex Integrated Devices +even if the PCIe spec does not forbid PCIe devices. The existing +hardware uses mostly PCI devices as Integrated Endpoints. In this +way we may avoid some strange Guest OS-es behaviour. +Other than that plug only PCIe Root Ports, PCIe Switches (upstream ports) +or DMI-PCI bridges to start legacy PCI hierarchies. + + + pcie.0 bus + -------------------------------------------------------------------------- + | | | | + ----------- ------------------ ------------------ ------------------ + | PCI Dev | | PCIe Root Port | | Upstream Port | | DMI-PCI bridge | + ----------- ------------------ ------------------ ------------------ + +2.2 PCIe only hierarchy +======================= +Always use PCIe Root ports to start a PCIe hierarchy. Use PCIe switches (Upstream +Ports + several Downstream Ports) if out of PCIe Root Ports slots. PCIe switches +can be nested until a depth of 6-7. Plug only PCIe devices into PCIe Ports. + + + pcie.0 bus + ---------------------------------------------------- + | | | + ------------- ------------- ------------- + | Root Port | | Root Port | | Root Port | + ------------ -------------- ------------- + | | + ------------ ----------------- + | PCIe Dev | | Upstream Port | + ------------ ----------------- + | | + ------------------- ------------------- + | Downstream Port | | Downstream Port | + ------------------- ------------------- + | + ------------ + | PCIe Dev | + ------------ + +2.3 PCI only hierarchy +====================== +Legacy PCI devices can be plugged into pcie.0 as Integrated Devices or +into DMI-PCI bridge. PCI-PCI bridges can be plugged into DMI-PCI bridges +and can be nested until a depth of 6-7. DMI-BRIDGES should be plugged +only into pcie.0 bus. + + pcie.0 bus + ---------------------------------------------- + | | + ----------- ------------------ + | PCI Dev | | DMI-PCI BRIDGE | + ---------- ------------------ + | | + ----------- ------------------ + | PCI Dev | | PCI-PCI Bridge | + ----------- ------------------ + | | + ----------- ----------- + | PCI Dev | | PCI Dev | + ----------- ----------- + + + +3. IO space issues +=================== +PCIe Ports are seen by Firmware/Guest OS as PCI bridges and +as required by PCI spec will reserve a 4K IO range for each. +The firmware used by QEMU (SeaBIOS/OVMF) will further optimize +it by allocation the IO space only if there is at least a device +with IO BARs plugged into the bridge. +Behind a PCIe PORT only one device may be plugged, resulting in +the allocation of a whole 4K range for each device. +The IO space is limited resulting in ~10 PCIe ports per system +if devices with IO BARs are plugged into IO ports. + +Using the proposed device placing strategy solves this issue +by using only PCIe devices with PCIe PORTS. The PCIe spec requires +PCIe devices to work without IO BARs. +The PCI hierarchy has no such limitations. + + +4. Hot Plug +============ +The root bus pcie.0 does not support hot-plug, so Integrated Devices, +DMI-PCI bridges and Root Ports can't be hot-plugged/hot-unplugged. + +PCI devices can be hot-plugged into PCI-PCI bridges. (There is a bug +in QEMU preventing it to work, but it would be solved soon). +The PCI hotplug is ACPI based and can work side by side with the PCIe +native hotplug. + +PCIe devices can be natively hot-plugged/hot-unplugged into/from +PCIe Ports (Root Ports/Downstream Ports). Switches are hot-pluggable. +Keep in mind you always need to have at least one PCIe Port available +for hotplug, the PCIe Ports themselves are not hot-pluggable. + + +5. Device assignment +==================== +Host devices are mostly PCIe and should be plugged only into PCIe ports. +PCI-PCI bridge slots can be used for legacy PCI host devices. + + +6. Virtio devices +================= +Virtio devices plugged into the PCI hierarchy or as an Integrated Devices +will remain PCI and have transitional behaviour as default. +Virtio devices plugged into PCIe ports are Express devices and have +"1.0" behavior by default without IO support. +In both case disable-* properties can be used to override the behaviour. + + +7. Conclusion +============== +The proposal offers a usage model that is easy to understand and follow +and in the same time overcomes some PCIe limitations. + + + -- 2.5.5 ^ permalink raw reply related [flat|nested] 52+ messages in thread
* Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines 2016-09-01 13:22 [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines Marcel Apfelbaum @ 2016-09-01 13:27 ` Peter Maydell 2016-09-01 13:51 ` Marcel Apfelbaum 2016-09-05 16:24 ` Laszlo Ersek 2016-09-06 15:38 ` Alex Williamson 2 siblings, 1 reply; 52+ messages in thread From: Peter Maydell @ 2016-09-01 13:27 UTC (permalink / raw) To: Marcel Apfelbaum; +Cc: QEMU Developers, Laszlo Ersek, Michael S. Tsirkin On 1 September 2016 at 14:22, Marcel Apfelbaum <marcel@redhat.com> wrote: > Proposes best practices on how to use PCIe/PCI device > in PCIe based machines and explain the reasoning behind them. > > Signed-off-by: Marcel Apfelbaum <marcel@redhat.com> > --- > > Hi, > > Please add your comments on what to add/remove/edit to make this doc usable. As somebody who doesn't really understand the problem space, my thoughts: (1) is this intended as advice for developers writing machine models and adding pci controllers to them, or is it intended as advice for users (and libvirt-style management layers) about how to configure QEMU? (2) it seems to be a bit short on concrete advice (either "you should do this" instructions to machine model developers, or "use command lines like this" instructions to end-users. thanks -- PMM ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines 2016-09-01 13:27 ` Peter Maydell @ 2016-09-01 13:51 ` Marcel Apfelbaum 2016-09-01 17:14 ` Laszlo Ersek 0 siblings, 1 reply; 52+ messages in thread From: Marcel Apfelbaum @ 2016-09-01 13:51 UTC (permalink / raw) To: Peter Maydell; +Cc: QEMU Developers, Laszlo Ersek, Michael S. Tsirkin On 09/01/2016 04:27 PM, Peter Maydell wrote: > On 1 September 2016 at 14:22, Marcel Apfelbaum <marcel@redhat.com> wrote: >> Proposes best practices on how to use PCIe/PCI device >> in PCIe based machines and explain the reasoning behind them. >> >> Signed-off-by: Marcel Apfelbaum <marcel@redhat.com> >> --- >> >> Hi, >> >> Please add your comments on what to add/remove/edit to make this doc usable. > Hi Peter, > As somebody who doesn't really understand the problem space, my > thoughts: > > (1) is this intended as advice for developers writing machine > models and adding pci controllers to them, or is it intended as > advice for users (and libvirt-style management layers) about > how to configure QEMU? > Is it intended for management layers as they have no way to understand how to "consume" the Q35 machine, but also for firmware developers (OVMF/SeaBIOS) to help them understand the usage model so they can optimize IO/MEM resources allocation for both boot time and hot-plug. QEMU users/developers can also benefit from it as the PCIe arch is more complex supporting both PCI/PCIe devices and several PCI/PCIe controllers with no clear rules on what goes where. > (2) it seems to be a bit short on concrete advice (either > "you should do this" instructions to machine model developers, > or "use command lines like this" instructions to end-users. > Thanks for the point. I'll be sure to add detailed command line examples to the next version. Thanks, Marcel > thanks > -- PMM > ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines 2016-09-01 13:51 ` Marcel Apfelbaum @ 2016-09-01 17:14 ` Laszlo Ersek 0 siblings, 0 replies; 52+ messages in thread From: Laszlo Ersek @ 2016-09-01 17:14 UTC (permalink / raw) To: Marcel Apfelbaum, Peter Maydell; +Cc: QEMU Developers, Michael S. Tsirkin On 09/01/16 15:51, Marcel Apfelbaum wrote: > On 09/01/2016 04:27 PM, Peter Maydell wrote: >> On 1 September 2016 at 14:22, Marcel Apfelbaum <marcel@redhat.com> wrote: >>> Proposes best practices on how to use PCIe/PCI device >>> in PCIe based machines and explain the reasoning behind them. >>> >>> Signed-off-by: Marcel Apfelbaum <marcel@redhat.com> >>> --- >>> >>> Hi, >>> >>> Please add your comments on what to add/remove/edit to make this doc >>> usable. >> > > Hi Peter, > >> As somebody who doesn't really understand the problem space, my >> thoughts: >> >> (1) is this intended as advice for developers writing machine >> models and adding pci controllers to them, or is it intended as >> advice for users (and libvirt-style management layers) about >> how to configure QEMU? >> > > Is it intended for management layers as they have no way to > understand how to "consume" the Q35 machine, > but also for firmware developers (OVMF/SeaBIOS) to help them > understand the usage model so they can optimize IO/MEM > resources allocation for both boot time and hot-plug. > > QEMU users/developers can also benefit from it as the PCIe arch > is more complex supporting both PCI/PCIe devices and > several PCI/PCIe controllers with no clear rules on what goes where. > >> (2) it seems to be a bit short on concrete advice (either >> "you should do this" instructions to machine model developers, >> or "use command lines like this" instructions to end-users. >> > > Thanks for the point. I'll be sure to add detailed command line examples > to the next version. I think that would be a huge benefit! (I'll try to read the document later, and come back with remarks.) Thanks! Laszlo ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines 2016-09-01 13:22 [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines Marcel Apfelbaum 2016-09-01 13:27 ` Peter Maydell @ 2016-09-05 16:24 ` Laszlo Ersek 2016-09-05 20:02 ` Marcel Apfelbaum ` (2 more replies) 2016-09-06 15:38 ` Alex Williamson 2 siblings, 3 replies; 52+ messages in thread From: Laszlo Ersek @ 2016-09-05 16:24 UTC (permalink / raw) To: Marcel Apfelbaum, qemu-devel Cc: mst, Peter Maydell, Drew Jones, Laine Stump, Andrea Bolognani, Alex Williamson, Gerd Hoffmann On 09/01/16 15:22, Marcel Apfelbaum wrote: > Proposes best practices on how to use PCIe/PCI device > in PCIe based machines and explain the reasoning behind them. > > Signed-off-by: Marcel Apfelbaum <marcel@redhat.com> > --- > > Hi, > > Please add your comments on what to add/remove/edit to make this doc usable. I'll give you a brain dump below -- most of it might easily be incorrect, but I'll just speak my mind :) > > Thanks, > Marcel > > docs/pcie.txt | 145 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 145 insertions(+) > create mode 100644 docs/pcie.txt > > diff --git a/docs/pcie.txt b/docs/pcie.txt > new file mode 100644 > index 0000000..52a8830 > --- /dev/null > +++ b/docs/pcie.txt > @@ -0,0 +1,145 @@ > +PCI EXPRESS GUIDELINES > +====================== > + > +1. Introduction > +================ > +The doc proposes best practices on how to use PCIe/PCI device > +in PCIe based machines and explains the reasoning behind them. General request: please replace all occurrences of "PCIe" with "PCI Express" in the text (not command lines, of course). The reason is that the "e" letter is a minimal difference, and I've misread PCIe as PC several times, while interpreting this document. Obviously the resultant confusion is terrible, as you are explaining the difference between PCI and PCI Express in the entire document :) > + > + > +2. Device placement strategy > +============================ > +QEMU does not have a clear socket-device matching mechanism > +and allows any PCI/PCIe device to be plugged into any PCI/PCIe slot. > +Plugging a PCI device into a PCIe device might not always work and s/PCIe device/PCI Express slot/ > +is weird anyway since it cannot be done for "bare metal". > +Plugging a PCIe device into a PCI slot will hide the Extended > +Configuration Space thus is also not recommended. > + > +The recommendation is to separate the PCIe and PCI hierarchies. > +PCIe devices should be plugged only into PCIe Root Ports and > +PCIe Downstream ports (let's call them PCIe ports). Please do not use the shorthand; we should always spell out downstream ports and root ports. Assume people reading this document are dumber than I am wrt. PCI / PCI Express -- I'm already pretty dumb, and I appreciate the detail! :) If they are smart, they won't mind the detail; if they lack expertise, they'll appreciate the detail, won't they. :) > + > +2.1 Root Bus (pcie.0) Can we call this Root Complex instead? > +===================== > +Plug only legacy PCI devices as Root Complex Integrated Devices > +even if the PCIe spec does not forbid PCIe devices. I suggest "even though the PCI Express spec does not forbid PCI Express devices as Integrated Devices". (Detail is good!) Also, as Peter suggested, this (but not just this) would be a good place to provide command line fragments. > The existing > +hardware uses mostly PCI devices as Integrated Endpoints. In this > +way we may avoid some strange Guest OS-es behaviour. > +Other than that plug only PCIe Root Ports, PCIe Switches (upstream ports) > +or DMI-PCI bridges to start legacy PCI hierarchies. Hmmmm, I had to re-read this paragraph (while looking at the diagram) five times until I mostly understood it :) What about the following wording: -------- Place only the following kinds of devices directly on the Root Complex: (1) For devices with dedicated, specific functionality (network card, graphics card, IDE controller, etc), place only legacy PCI devices on the Root Complex. These will be considered Integrated Endpoints. Although the PCI Express spec does not forbid PCI Express devices as Integrated Endpoints, existing hardware mostly integrates legacy PCI devices with the Root Complex. Guest OSes are suspected to behave strangely when PCI Express devices are integrated with the Root Complex. (2) PCI Express Root Ports, for starting exclusively PCI Express hierarchies. (3) PCI Express Switches (connected with their Upstream Ports to the Root Complex), also for starting exclusively PCI Express hierarchies. (4) For starting legacy PCI hierarchies: DMI-PCI bridges. > + > + > + pcie.0 bus "bus" is correct in QEMU lingo, but I'd still call it complex here. > + -------------------------------------------------------------------------- > + | | | | > + ----------- ------------------ ------------------ ------------------ > + | PCI Dev | | PCIe Root Port | | Upstream Port | | DMI-PCI bridge | > + ----------- ------------------ ------------------ ------------------ > + Please insert a separate (brief) section here about pxb-pcie devices -- just mention that they are documented in a separate spec txt in more detail, and that they create new root complexes in practice. In fact, maybe option (5) would be better for pxb-pcie devices, under section 2.1, than a dedicated section! > +2.2 PCIe only hierarchy > +======================= > +Always use PCIe Root ports to start a PCIe hierarchy. Use PCIe switches (Upstream > +Ports + several Downstream Ports) if out of PCIe Root Ports slots. PCIe switches > +can be nested until a depth of 6-7. Plug only PCIe devices into PCIe Ports. - Please name the maximum number of the root ports that's allowed on the root complex (cmdline example?) - Also, this is the first time you mention "slot". While the PCI Express spec allows for root ports / downstream ports not implementing a slot (IIRC), I think we shouldn't muddy the waters here, and restrict the word "slot" to the command line examples only. - What you say here about switches (upstream ports) matches what I've learned from you thus far :), but it doesn't match bullet (3) in section 2.1. That is, if we suggest to *always* add a Root Port between the Root Complex and the Upstream Port of a switch, then (3) should not be present in section 2.1. (Do we suggest that BTW?) We're not giving a technical description here (the PCI Express spec is good enough for that), we're dictating policy. We shouldn't be shy about minimizing the accepted use cases. Our main guidance here should be the amount of bus numbers used up by the hierarchy. Parts of the document might later apply to qemu-system-aarch64 -M virt, and that machine is severely starved in the bus numbers department (it has MMCONFIG space for 16 buses only!) So how about this: * the basic idea is good I think: always go for root ports, unless the root complex is fully populated * if you run out of root ports, use a switch with downstream ports, but plug the upstream port directly in the root complex (make it an integrated device). This would save us a bus number, and match option (3) in section 2.1, but it doesn't match the diagram below, where a root port is between the root complex and the upstream port. (Of course, if a root port is *required* there, then 2.1 (3) is wrong, and should be removed.) * the "population algorithm" should be laid out in a bit more detail. You mention a possible depth of 6-7, but I think it would be best to keep the hierarchy as flat as possible (let's not waste bus numbers on upstream ports, and time on deep enumeration!). In other words, only plug upstream ports in the root complex (and without intervening root ports, if that's allowed). For example: - 1-32 ports needed: use root ports only - 33-64 ports needed: use 31 root ports, and one switch with 2-32 downstream ports - 65-94 ports needed: use 30 root ports, one switch with 32 downstream ports, another switch with 3-32 downstream ports - 95-125 ports needed: use 29 root ports, two switches with 32 downstream ports each, and a third switch with 2-32 downstream ports - 126-156 ports needed: use 28 root ports, three switches with 32 downstream ports each, and a fourth switch with 2-32 downstream ports - 157-187 ports needed: use 27 root ports, four switches with 32 downstream ports each, and a fifth switch with 2-32 downstream ports - 188-218 ports: 26 root ports, 5 fully populated switches, sixth switch with 2-32 downstream ports, - 219-249 ports: 25 root ports, 6 fully pop. switches, seventh switch with 2-32 downstream ports (And I think this is where it ends, because the 7 upstream ports total in the switches take up 7 bus numbers, so we'd need 249 + 7 = 256 bus numbers, not counting the root complex, so 249 ports isn't even attainable.) You might argue that this is way too detailed, but with the "problem space" offering so much freedom (consider libvirt too...), I think it would be helpful. This would also help trim the "explorations" of downstream QE departments :) > + > + > + pcie.0 bus > + ---------------------------------------------------- > + | | | > + ------------- ------------- ------------- > + | Root Port | | Root Port | | Root Port | > + ------------ -------------- ------------- > + | | > + ------------ ----------------- > + | PCIe Dev | | Upstream Port | > + ------------ ----------------- > + | | > + ------------------- ------------------- > + | Downstream Port | | Downstream Port | > + ------------------- ------------------- > + | > + ------------ > + | PCIe Dev | > + ------------ > + So the upper right root port should be removed, probably. Also, I recommend to draw a "container" around the upstream port plus the two downstream ports, and tack a "switch" label to it. > +2.3 PCI only hierarchy > +====================== > +Legacy PCI devices can be plugged into pcie.0 as Integrated Devices or > +into DMI-PCI bridge. PCI-PCI bridges can be plugged into DMI-PCI bridges > +and can be nested until a depth of 6-7. DMI-BRIDGES should be plugged > +only into pcie.0 bus. > + > + pcie.0 bus > + ---------------------------------------------- > + | | > + ----------- ------------------ > + | PCI Dev | | DMI-PCI BRIDGE | > + ---------- ------------------ > + | | > + ----------- ------------------ > + | PCI Dev | | PCI-PCI Bridge | > + ----------- ------------------ > + | | > + ----------- ----------- > + | PCI Dev | | PCI Dev | > + ----------- ----------- Works for me, but I would again elaborate a little bit on keeping the hierarchy flat. First, in order to preserve compatibility with libvirt's current behavior, let's not plug a PCI device directly in to the DMI-PCI bridge, even if that's possible otherwise. Let's just say - there should be at most one DMI-PCI bridge (if a legacy PCI hierarchy is required), - only PCI-PCI bridges should be plugged into the DMI-PCI bridge, - let's recommend that each PCI-PCI bridge be populated until it becomes full, at which point another PCI-PCI bridge should be plugged into the same one DMI-PCI bridge. Theoretically, with 32 legacy PCI devices per PCI-PCI bridge, and 32 PCI-PCI bridges stuffed into the one DMI-PCI bridge, we could have ~1024 legacy PCI devices (not counting the integrated ones on the root complex(es)). There's also multi-function, so I can't see anyone needing more than this. For practical reasons though (see later), we should state here that we recommend no more than 9 (nine) PCI-PCI bridges in total, all located directly under the 1 (one) DMI-PCI bridge that is integrated into the pcie.0 root complex. Nine PCI-PCI bridges should allow for 288 legacy PCI devices. (And then there's multifunction.) > + > + > + > +3. IO space issues > +=================== > +PCIe Ports are seen by Firmware/Guest OS as PCI bridges and (please spell out downstream + root port) > +as required by PCI spec will reserve a 4K IO range for each. > +The firmware used by QEMU (SeaBIOS/OVMF) will further optimize > +it by allocation the IO space only if there is at least a device > +with IO BARs plugged into the bridge. This used to be true, but is no longer true, for OVMF. And I think it's actually correct: we *should* keep the 4K IO reservation per PCI-PCI bridge. (But, certainly no IO reservation for PCI Express root port, upstream port, or downstream port! And i'll need your help for telling these apart in OVMF.) Let me elaborate more under section "4. Hot Plug". For now let me just say that I'd like this language about optimization to be dropped. > +Behind a PCIe PORT only one device may be plugged, resulting in (again, please spell out downstream and root port) > +the allocation of a whole 4K range for each device. > +The IO space is limited resulting in ~10 PCIe ports per system (limited to 65536 byte-wide IO ports, but it's fragmented, so we have about 10 * 4K free) > +if devices with IO BARs are plugged into IO ports. not "into IO ports" but "into PCI Express downstream and root ports". > + > +Using the proposed device placing strategy solves this issue > +by using only PCIe devices with PCIe PORTS. The PCIe spec requires (please spell out root / downstream etc) > +PCIe devices to work without IO BARs. > +The PCI hierarchy has no such limitations. I'm sorry to have fragmented this section with so many comments, but the idea is actually splendid, in my opinion! ... Okay, still speaking resources, could you insert a brief section here about bus numbers? Under "3. IO space issues", you already explain how "practically everything" qualifies as a PCI bridge. We should mention that all those things, such as: - root complex pcie.0, - root complex added by pxb-pcie, - root ports, - upstream ports, downstream ports, - bridges, etc take up bus numbers, and we have 256 bus numbers in total. In the next section you state that PCI hotplug (ACPI based) and PCI Express hotplug (native) can work side by side, which is correct, and the IO space competition is eliminated by the scheme proposed in section 3 -- and the MMIO space competition is "obvious" --, but the bus number starvation is *very much* non-obvious. It should be spelled out. I think it deserves a separate section. (Again, with an eye toward qemu-system-aarch64 -M virt -- we've seen PCI Express failures there, and they were due to bus number starvation. It wasn't fun to debug. (Well, it was, but don't tell anyone :))) > + > + > +4. Hot Plug > +============ > +The root bus pcie.0 does not support hot-plug, so Integrated Devices, s/root bus/root complex/? Also, any root complexes added with pxb-pcie don't support hotplug. > +DMI-PCI bridges and Root Ports can't be hot-plugged/hot-unplugged. I would say: ... so anything that plugs *only* into a root complex, cannot be hotplugged. Then the list is what you mention here (also referring back to options (1), (2) and (4) in section 2.1), *plus* I would also add option (5): pxb-pcie can also not be hotplugged. > + > +PCI devices can be hot-plugged into PCI-PCI bridges. (There is a bug > +in QEMU preventing it to work, but it would be solved soon). What bug? Anyway, I'm unsure we should add this remark here -- it's a guide, not a status report. I'm worried that whenever we fix that bug, we forget to remove this remark. > +The PCI hotplug is ACPI based and can work side by side with the PCIe > +native hotplug. > + > +PCIe devices can be natively hot-plugged/hot-unplugged into/from > +PCIe Ports (Root Ports/Downstream Ports). Switches are hot-pluggable. I would mention the order (upstream port, downstream port), also add some command lines maybe. > +Keep in mind you always need to have at least one PCIe Port available > +for hotplug, the PCIe Ports themselves are not hot-pluggable. Well, the downstream ports of a switch that is being added *are*, aren't they? But, this question is actually irrelevant IMO, because here I would add another subsection about *planning* for hot-plug. (I think that's pretty important.) And those plans should make the hotplugging of switches unnecessary! * For the PCI Express hierarchy, I recommended a flat structure above. The 256 bus numbers can easily be exhausted / covered by 25 root ports plus 7 switches (each switch being fully populated with downstream ports). This should allow all sysadmins to estimate their expected numbers of hotplug PCI Express devices, in advance, and create enough root ports / downstream ports. (Sysadmins are already used to planning for hotplug, see VCPUs, memory (DIMM), memory (balloon).) * For the PCI hierarchy, it should be even simpler, but worth a mention -- start with enough PCI-PCI bridges under the one DMI-PCI bridge. * Finally, this is the spot where we should design and explain our resource reservation for hotplug: - For PCI Express hotplug, please explain that only such PCI Express devices can be hotplugged that require no IO space -- in section "3. IO space issues" you mention that this is a valid restriction. Furthermore, please state the MMIO32 and/or MMIO64 sizes that the firmware needs to reserve for root ports / downstream ports, and also explain that these sizes will act as maximum size and alignment limits for *individual* hotplug devices. We can invent fw_cfg switches for this, or maybe even a special PCI Express capability (to be placed in the config space of root and downstream ports). - For legacy PCI hotplug, this is where my evil plan of "no more than 9 (nine) PCI-PCI bridges under the 1 (one) DMI-PCI bridge" unfolds! We've stated above (in Section 3) that we have about 10*4KB IO port space. One of those 4K chunks will go (collectively) to the Integrated PCI devices that sit on pcie.0. (If there are other root complexes from pxb-pcie devices, then those will get one chunk each too.) The rest -- hence assume 9 or fewer chunks -- will be consumed by the 9 (or respectively fewer) PCI-PCI bridges, for hotplug reservation. The upshot is that as long as a sysadmin sticks with our flat, "9 PCI-PCI bridges total" recommendation for the legacy PCI hierarchy, the IO reservation will be covered *immediately*. Simply put: don't create more PCI-PCI bridges than you have IO space for -- and that should leave you with about 9 "sibling" bridges, which are plenty enough for a huge number of legacy PCI devices! Furthermore, please state the MMIO32 / MMIO64 amount to reserve *per PCI-PCI bridge*. The firmware programmers need to know this, and people planning for legacy PCI hotplug should be informed that those limits are for all devices *together* on the same PCI-PCI bridge. Again, we could expose this via fw_cfg, or in a special capability (as suggested by Gerd IIRC) in the PCI config space of the PCI-PCI bridge. > + > + > +5. Device assignment > +==================== > +Host devices are mostly PCIe and should be plugged only into PCIe ports. > +PCI-PCI bridge slots can be used for legacy PCI host devices. Please provide a command line (lspci) so that users can easily determine if the device they wish to assign is legacy PCI or PCI Express. > + > + > +6. Virtio devices > +================= > +Virtio devices plugged into the PCI hierarchy or as an Integrated Devices (drop "an") > +will remain PCI and have transitional behaviour as default. (Please add one sentence about what "transitional" means in this context -- they'll have both IO and MMIO BARs.) > +Virtio devices plugged into PCIe ports are Express devices and have > +"1.0" behavior by default without IO support. > +In both case disable-* properties can be used to override the behaviour. Please emphasize that setting disable-legacy=off (that is, enabling legacy behavior) for PCI Express virtio devices will cause them to require IO space, which, given our PCI Express hierarchy, may quickly lead to resource exhaustion, and is therefore strongly discouraged. > + > + > +7. Conclusion > +============== > +The proposal offers a usage model that is easy to understand and follow > +and in the same time overcomes some PCIe limitations. I agree! Thanks! Laszlo ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines 2016-09-05 16:24 ` Laszlo Ersek @ 2016-09-05 20:02 ` Marcel Apfelbaum 2016-09-06 13:31 ` Laszlo Ersek 2016-09-06 11:35 ` Gerd Hoffmann 2016-10-04 14:59 ` Daniel P. Berrange 2 siblings, 1 reply; 52+ messages in thread From: Marcel Apfelbaum @ 2016-09-05 20:02 UTC (permalink / raw) To: Laszlo Ersek, qemu-devel Cc: mst, Peter Maydell, Drew Jones, Laine Stump, Andrea Bolognani, Alex Williamson, Gerd Hoffmann On 09/05/2016 07:24 PM, Laszlo Ersek wrote: > On 09/01/16 15:22, Marcel Apfelbaum wrote: >> Proposes best practices on how to use PCIe/PCI device >> in PCIe based machines and explain the reasoning behind them. >> >> Signed-off-by: Marcel Apfelbaum <marcel@redhat.com> >> --- >> >> Hi, >> >> Please add your comments on what to add/remove/edit to make this doc usable. > Hi Laszlo, > I'll give you a brain dump below -- most of it might easily be > incorrect, but I'll just speak my mind :) > Thanks for taking the time to go over it, I'll do my best to respond to all the questions. >> >> Thanks, >> Marcel >> >> docs/pcie.txt | 145 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> 1 file changed, 145 insertions(+) >> create mode 100644 docs/pcie.txt >> >> diff --git a/docs/pcie.txt b/docs/pcie.txt >> new file mode 100644 >> index 0000000..52a8830 >> --- /dev/null >> +++ b/docs/pcie.txt >> @@ -0,0 +1,145 @@ >> +PCI EXPRESS GUIDELINES >> +====================== >> + >> +1. Introduction >> +================ >> +The doc proposes best practices on how to use PCIe/PCI device >> +in PCIe based machines and explains the reasoning behind them. > > General request: please replace all occurrences of "PCIe" with "PCI > Express" in the text (not command lines, of course). The reason is that > the "e" letter is a minimal difference, and I've misread PCIe as PC > several times, while interpreting this document. Obviously the resultant > confusion is terrible, as you are explaining the difference between PCI > and PCI Express in the entire document :) > Sure >> + >> + >> +2. Device placement strategy >> +============================ >> +QEMU does not have a clear socket-device matching mechanism >> +and allows any PCI/PCIe device to be plugged into any PCI/PCIe slot. >> +Plugging a PCI device into a PCIe device might not always work and > > s/PCIe device/PCI Express slot/ > Thanks! >> +is weird anyway since it cannot be done for "bare metal". >> +Plugging a PCIe device into a PCI slot will hide the Extended >> +Configuration Space thus is also not recommended. >> + >> +The recommendation is to separate the PCIe and PCI hierarchies. >> +PCIe devices should be plugged only into PCIe Root Ports and >> +PCIe Downstream ports (let's call them PCIe ports). > > Please do not use the shorthand; we should always spell out downstream > ports and root ports. Assume people reading this document are dumber > than I am wrt. PCI / PCI Express -- I'm already pretty dumb, and I > appreciate the detail! :) If they are smart, they won't mind the detail; > if they lack expertise, they'll appreciate the detail, won't they. :) > Sure >> + >> +2.1 Root Bus (pcie.0) > > Can we call this Root Complex instead? > Sorry, but we can't. The Root Complex is a type of Host-Bridge (and can actually "have" multiple Host-Bridges), not a bus. It stands between the CPU/Memory controller/APIC and the PCI/PCI Express fabric. (as you can see, I am not using PCIe even for the comments :)) The Root Complex *includes* an internal bus (pcie.0) but also can include some Integrated Devices, its own Configuration Space Registers (e.g Root Complex Register Block), ... One of the main functions of the Root Complex is to generate PCI Express Transactions on behalf of the CPU(s) and to "translate" the corresponding PCI Express Transactions into DMA accesses. I can change it to "PCI Express Root Bus", it will help? >> +===================== >> +Plug only legacy PCI devices as Root Complex Integrated Devices >> +even if the PCIe spec does not forbid PCIe devices. > > I suggest "even though the PCI Express spec does not forbid PCI Express > devices as Integrated Devices". (Detail is good!) > Thanks > Also, as Peter suggested, this (but not just this) would be a good place > to provide command line fragments. > I've already added some examples, I'll appreciate if you can have a look on v2 that I will post really soon. >> The existing >> +hardware uses mostly PCI devices as Integrated Endpoints. In this >> +way we may avoid some strange Guest OS-es behaviour. >> +Other than that plug only PCIe Root Ports, PCIe Switches (upstream ports) >> +or DMI-PCI bridges to start legacy PCI hierarchies. > > Hmmmm, I had to re-read this paragraph (while looking at the diagram) > five times until I mostly understood it :) What about the following wording: > > -------- > Place only the following kinds of devices directly on the Root Complex: > > (1) For devices with dedicated, specific functionality (network card, > graphics card, IDE controller, etc), place only legacy PCI devices on > the Root Complex. These will be considered Integrated Endpoints. > Although the PCI Express spec does not forbid PCI Express devices as > Integrated Endpoints, existing hardware mostly integrates legacy PCI > devices with the Root Complex. Guest OSes are suspected to behave > strangely when PCI Express devices are integrated with the Root Complex. > > (2) PCI Express Root Ports, for starting exclusively PCI Express > hierarchies. > > (3) PCI Express Switches (connected with their Upstream Ports to the > Root Complex), also for starting exclusively PCI Express hierarchies. > > (4) For starting legacy PCI hierarchies: DMI-PCI bridges. > Thanks for the re-wording! Actually I had a bug, even the Switches should be connected to Root Ports, not directly to the PCI Express Root Bus (pcie.0) , I'll delete (3) to make it clear. >> + >> + >> + pcie.0 bus > > "bus" is correct in QEMU lingo, but I'd still call it complex here. > explained above >> + -------------------------------------------------------------------------- >> + | | | | >> + ----------- ------------------ ------------------ ------------------ >> + | PCI Dev | | PCIe Root Port | | Upstream Port | | DMI-PCI bridge | >> + ----------- ------------------ ------------------ ------------------ >> + > > Please insert a separate (brief) section here about pxb-pcie devices -- > just mention that they are documented in a separate spec txt in more > detail, and that they create new root complexes in practice. > > In fact, maybe option (5) would be better for pxb-pcie devices, under > section 2.1, than a dedicated section! > Good idea, I'll add the pxb-pcie device. >> +2.2 PCIe only hierarchy >> +======================= >> +Always use PCIe Root ports to start a PCIe hierarchy. Use PCIe switches (Upstream >> +Ports + several Downstream Ports) if out of PCIe Root Ports slots. PCIe switches >> +can be nested until a depth of 6-7. Plug only PCIe devices into PCIe Ports. > > - Please name the maximum number of the root ports that's allowed on the > root complex (cmdline example?) > I'll try: The PCI Express Root Bus (pcie.0) is an internal bus that similar to a PCI bus supports up to 32 Integrated Devices/PCI Express Root Ports. > - Also, this is the first time you mention "slot". While the PCI Express > spec allows for root ports / downstream ports not implementing a slot > (IIRC), I think we shouldn't muddy the waters here, and restrict the > word "slot" to the command line examples only. > OK > - What you say here about switches (upstream ports) matches what I've > learned from you thus far :), but it doesn't match bullet (3) in section > 2.1. That is, if we suggest to *always* add a Root Port between the Root > Complex and the Upstream Port of a switch, then (3) should not be > present in section 2.1. (Do we suggest that BTW?) > > We're not giving a technical description here (the PCI Express spec is > good enough for that), we're dictating policy. We shouldn't be shy about > minimizing the accepted use cases. > > Our main guidance here should be the amount of bus numbers used up by > the hierarchy. Parts of the document might later apply to > qemu-system-aarch64 -M virt, and that machine is severely starved in the > bus numbers department (it has MMCONFIG space for 16 buses only!) > > So how about this: > > * the basic idea is good I think: always go for root ports, unless the > root complex is fully populated > > * if you run out of root ports, use a switch with downstream ports, but > plug the upstream port directly in the root complex (make it an > integrated device). This would save us a bus number, and match option > (3) in section 2.1, but it doesn't match the diagram below, where a root > port is between the root complex and the upstream port. (Of course, if a > root port is *required* there, then 2.1 (3) is wrong, and should be > removed.) > A Root Port is required, thanks for spotting the bug. > * the "population algorithm" should be laid out in a bit more detail. > You mention a possible depth of 6-7, but I think it would be best to > keep the hierarchy as flat as possible (let's not waste bus numbers on > upstream ports, and time on deep enumeration!). In other words, only > plug upstream ports in the root complex (and without intervening root > ports, if that's allowed). For example: > > - 1-32 ports needed: use root ports only > > - 33-64 ports needed: use 31 root ports, and one switch with 2-32 > downstream ports > > - 65-94 ports needed: use 30 root ports, one switch with 32 downstream > ports, another switch with 3-32 downstream ports > > - 95-125 ports needed: use 29 root ports, two switches with 32 > downstream ports each, and a third switch with 2-32 downstream ports > > - 126-156 ports needed: use 28 root ports, three switches with 32 > downstream ports each, and a fourth switch with 2-32 downstream ports > > - 157-187 ports needed: use 27 root ports, four switches with 32 > downstream ports each, and a fifth switch with 2-32 downstream ports > > - 188-218 ports: 26 root ports, 5 fully populated switches, sixth switch > with 2-32 downstream ports, > > - 219-249 ports: 25 root ports, 6 fully pop. switches, seventh switch > with 2-32 downstream ports > I can add it as a "best practice". > (And I think this is where it ends, because the 7 upstream ports total > in the switches take up 7 bus numbers, so we'd need 249 + 7 = 256 bus > numbers, not counting the root complex, so 249 ports isn't even attainable.) > Theoretically we can implement multiple PCI domains, each domain can have 256 PCI buses, but we don't have that yet. An implementation should start with the pxb-pcie using separate PCI domains instead of "stealing" bus ranges. for the Root Complex. But this is for another thread. > You might argue that this is way too detailed, but with the "problem > space" offering so much freedom (consider libvirt too...), I think it > would be helpful. This would also help trim the "explorations" of > downstream QE departments :) > Well, we can accentuate that while nesting is supported, "deep nesting" is not recommended and even not strictly necessary. >> + >> + >> + pcie.0 bus >> + ---------------------------------------------------- >> + | | | >> + ------------- ------------- ------------- >> + | Root Port | | Root Port | | Root Port | >> + ------------ -------------- ------------- >> + | | >> + ------------ ----------------- >> + | PCIe Dev | | Upstream Port | >> + ------------ ----------------- >> + | | >> + ------------------- ------------------- >> + | Downstream Port | | Downstream Port | >> + ------------------- ------------------- >> + | >> + ------------ >> + | PCIe Dev | >> + ------------ >> + > > So the upper right root port should be removed, probably. > No, my bug in explanation, sorry. > Also, I recommend to draw a "container" around the upstream port plus > the two downstream ports, and tack a "switch" label to it. > Really? :) I had an interesting time with these "drawings". I'll try. >> +2.3 PCI only hierarchy >> +====================== >> +Legacy PCI devices can be plugged into pcie.0 as Integrated Devices or >> +into DMI-PCI bridge. PCI-PCI bridges can be plugged into DMI-PCI bridges >> +and can be nested until a depth of 6-7. DMI-BRIDGES should be plugged >> +only into pcie.0 bus. >> + >> + pcie.0 bus >> + ---------------------------------------------- >> + | | >> + ----------- ------------------ >> + | PCI Dev | | DMI-PCI BRIDGE | >> + ---------- ------------------ >> + | | >> + ----------- ------------------ >> + | PCI Dev | | PCI-PCI Bridge | >> + ----------- ------------------ >> + | | >> + ----------- ----------- >> + | PCI Dev | | PCI Dev | >> + ----------- ----------- > > Works for me, but I would again elaborate a little bit on keeping the > hierarchy flat. > > First, in order to preserve compatibility with libvirt's current > behavior, let's not plug a PCI device directly in to the DMI-PCI bridge, > even if that's possible otherwise. Let's just say > > - there should be at most one DMI-PCI bridge (if a legacy PCI hierarchy > is required), > > - only PCI-PCI bridges should be plugged into the DMI-PCI bridge, > > - let's recommend that each PCI-PCI bridge be populated until it becomes > full, at which point another PCI-PCI bridge should be plugged into the > same one DMI-PCI bridge. Theoretically, with 32 legacy PCI devices per > PCI-PCI bridge, and 32 PCI-PCI bridges stuffed into the one DMI-PCI > bridge, we could have ~1024 legacy PCI devices (not counting the > integrated ones on the root complex(es)). There's also multi-function, > so I can't see anyone needing more than this. > I can "live" with that. Even if it contradicts a little you flattening argument if you need more room for PCI devices but you don't need hotplug. In this case adding PCI devices to the DMI-PCI Bridge should be enough. But I agree we should keep it as simple as possible and your idea makes sense, thanks. > For practical reasons though (see later), we should state here that we > recommend no more than 9 (nine) PCI-PCI bridges in total, all located > directly under the 1 (one) DMI-PCI bridge that is integrated into the > pcie.0 root complex. Nine PCI-PCI bridges should allow for 288 legacy > PCI devices. (And then there's multifunction.) > OK... BTW the ~9 bridges limitation is the same for non PCI Express machines e.g. i440FX machine. >> + >> + >> + >> +3. IO space issues >> +=================== >> +PCIe Ports are seen by Firmware/Guest OS as PCI bridges and > > (please spell out downstream + root port) > OK >> +as required by PCI spec will reserve a 4K IO range for each. >> +The firmware used by QEMU (SeaBIOS/OVMF) will further optimize >> +it by allocation the IO space only if there is at least a device >> +with IO BARs plugged into the bridge. > > This used to be true, but is no longer true, for OVMF. And I think it's > actually correct: we *should* keep the 4K IO reservation per PCI-PCI bridge. > I'll change to "should". > (But, certainly no IO reservation for PCI Express root port, upstream > port, or downstream port! And i'll need your help for telling these > apart in OVMF.) > Just let me know how can I help. > Let me elaborate more under section "4. Hot Plug". For now let me just > say that I'd like this language about optimization to be dropped. > >> +Behind a PCIe PORT only one device may be plugged, resulting in > > (again, please spell out downstream and root port) > OK >> +the allocation of a whole 4K range for each device. >> +The IO space is limited resulting in ~10 PCIe ports per system > > (limited to 65536 byte-wide IO ports, but it's fragmented, so we have > about 10 * 4K free) > >> +if devices with IO BARs are plugged into IO ports. > > not "into IO ports" but "into PCI Express downstream and root ports". > oops, thanks >> + >> +Using the proposed device placing strategy solves this issue > >> +by using only PCIe devices with PCIe PORTS. The PCIe spec requires > > (please spell out root / downstream etc) > OK >> +PCIe devices to work without IO BARs. >> +The PCI hierarchy has no such limitations. > > I'm sorry to have fragmented this section with so many comments, but the > idea is actually splendid, in my opinion! > Thanks! > > ... Okay, still speaking resources, could you insert a brief section > here about bus numbers? Under "3. IO space issues", you already explain > how "practically everything" qualifies as a PCI bridge. We should > mention that all those things, such as: > - root complex pcie.0, > - root complex added by pxb-pcie, > - root ports, > - upstream ports, downstream ports, > - bridges, etc > > take up bus numbers, and we have 256 bus numbers in total. > I'll add a section for bus numbers, sure. > In the next section you state that PCI hotplug (ACPI based) and PCI > Express hotplug (native) can work side by side, which is correct, and > the IO space competition is eliminated by the scheme proposed in section > 3 -- and the MMIO space competition is "obvious" --, but the bus number > starvation is *very much* non-obvious. It should be spelled out. I think > it deserves a separate section. (Again, with an eye toward > qemu-system-aarch64 -M virt -- we've seen PCI Express failures there, > and they were due to bus number starvation. It wasn't fun to debug. > (Well, it was, but don't tell anyone :))) > Got it, I'll try to make PCI Bus numbering a limitation as important as IO. And we need to start looking at ways to solve this: 1. pxb-pcie starting different PCI domains - pxb-pcie became another Root Complex 2. switches can theoretically start PCI domains - emulate a Switch doing this. Long term plans, of course. >> + >> + >> +4. Hot Plug >> +============ >> +The root bus pcie.0 does not support hot-plug, so Integrated Devices, > > s/root bus/root complex/? Also, any root complexes added with pxb-pcie > don't support hotplug. > Actually pxb-pcie should support PCI Express Native Hotplug. If they don't is a bug and I'll take care of it. For pxb-pci (the PCI counter-part) is another story, it needs ACPI code to be emitted and the feature is not yet implemented. >> +DMI-PCI bridges and Root Ports can't be hot-plugged/hot-unplugged. > > I would say: ... so anything that plugs *only* into a root complex, > cannot be hotplugged. Then the list is what you mention here (also > referring back to options (1), (2) and (4) in section 2.1), *plus* I > would also add option (5): pxb-pcie can also not be hotplugged. > because is actually an Integrated Device. >> + >> +PCI devices can be hot-plugged into PCI-PCI bridges. (There is a bug >> +in QEMU preventing it to work, but it would be solved soon). > > What bug? > As stated above, PCI hotplug is based on emitting ACPI code for recognizing the right slot (see "bsel" ACPI variables). Basically each PCI-Bridge slot has a different "bsel" value used during hotplug mechanism to identify the slot where the device is hot-plugged/hot-unplugged. For PC machine the ACPI is generated while for Q35 is not. (I think I've already sent an RFC some time ago for that) > Anyway, I'm unsure we should add this remark here -- it's a guide, not a > status report. I'm worried that whenever we fix that bug, we forget to > remove this remark. > will remove it >> +The PCI hotplug is ACPI based and can work side by side with the PCIe >> +native hotplug. >> + >> +PCIe devices can be natively hot-plugged/hot-unplugged into/from >> +PCIe Ports (Root Ports/Downstream Ports). Switches are hot-pluggable. > > I would mention the order (upstream port, downstream port), also add > some command lines maybe. > I'll add some hmp example. Should I try it before :) ? >> +Keep in mind you always need to have at least one PCIe Port available >> +for hotplug, the PCIe Ports themselves are not hot-pluggable. > > Well, the downstream ports of a switch that is being added *are*, aren't > they? Nope, you cannot hotplug a PCI Express Root Port or a PCI Express Downstream Port. The reason: The PCI Express Native Hotplug is based on SHPCs (Standard HotPlug Controllers) which are integrated only in the mentioned ports and not in Upstream Ports or the Root Complex. The "other" reason: When you buy a switch/server it has a number of ports and that's it. You cannot add "later". > > But, this question is actually irrelevant IMO, because here I would add > another subsection about *planning* for hot-plug. (I think that's pretty > important.) And those plans should make the hotplugging of switches > unnecessary! > I'll add a subsection for it. But when you are out of options you *can* hotplug a switch if your sysadmin skills are limited... > * For the PCI Express hierarchy, I recommended a flat structure above. > The 256 bus numbers can easily be exhausted / covered by 25 root ports > plus 7 switches (each switch being fully populated with downstream > ports). This should allow all sysadmins to estimate their expected > numbers of hotplug PCI Express devices, in advance, and create enough > root ports / downstream ports. (Sysadmins are already used to planning > for hotplug, see VCPUs, memory (DIMM), memory (balloon).) > That's another good idea, I'll add it to the doc, thanks! > * For the PCI hierarchy, it should be even simpler, but worth a mention > -- start with enough PCI-PCI bridges under the one DMI-PCI bridge. > OK > * Finally, this is the spot where we should design and explain our > resource reservation for hotplug: > > - For PCI Express hotplug, please explain that only such PCI Express > devices can be hotplugged that require no IO space -- in section > "3. IO space issues" you mention that this is a valid restriction. > Furthermore, please state the MMIO32 and/or MMIO64 sizes that the > firmware needs to reserve for root ports / downstream ports, and > also explain that these sizes will act as maximum size and > alignment limits for *individual* hotplug devices. > OK > We can invent fw_cfg switches for this, or maybe even a special PCI > Express capability (to be placed in the config space of root and > downstream ports). > Gerd explicitly asked for the second idea (vendor specific capability) > - For legacy PCI hotplug, this is where my evil plan of "no more than > 9 (nine) PCI-PCI bridges under the 1 (one) DMI-PCI bridge" unfolds! > The same as for PC, should this doc deal with this if is a common issue? Maybe a simple comment is enough. > We've stated above (in Section 3) that we have about 10*4KB IO port > space. One of those 4K chunks will go (collectively) to the > Integrated PCI devices that sit on pcie.0. (If there are other root > complexes from pxb-pcie devices, then those will get one chunk each > too.) The rest -- hence assume 9 or fewer chunks -- will be consumed > by the 9 (or respectively fewer) PCI-PCI bridges, for hotplug > reservation. The upshot is that as long as a sysadmin sticks with > our flat, "9 PCI-PCI bridges total" recommendation for the legacy > > PCI hierarchy, the IO reservation will be covered *immediately*. > Simply put: don't create more PCI-PCI bridges than you have IO > space for -- and that should leave you with about 9 "sibling" > bridges, which are plenty enough for a huge number of legacy PCI > devices! > I'll use that, thanks! > Furthermore, please state the MMIO32 / MMIO64 amount to reserve > *per PCI-PCI bridge*. The firmware programmers need to know this, > and people planning for legacy PCI hotplug should be informed that > those limits are for all devices *together* on the same PCI-PCI > bridge. > Yes... we'll ask here for the minimum 8MB MMIO because of the virtio 1.0 behavior and use that when "pushing" the patches for OVMF/SeaBIOS. Its fun, first we "make" the rules, then we say "Hey, is written in QEMU docs". > Again, we could expose this via fw_cfg, or in a special capability > (as suggested by Gerd IIRC) in the PCI config space of the PCI-PCI > bridge. > Agreed >> + >> + >> +5. Device assignment >> +==================== >> +Host devices are mostly PCIe and should be plugged only into PCIe ports. >> +PCI-PCI bridge slots can be used for legacy PCI host devices. > > Please provide a command line (lspci) so that users can easily determine > if the device they wish to assign is legacy PCI or PCI Express. > OK, something like: lspci -s 03:00.0 -v (as root) 03:00.0 Network controller: Intel Corporation Wireless 7260 (rev 83) Subsystem: Intel Corporation Dual Band Wireless-AC 7260 Flags: bus master, fast devsel, latency 0, IRQ 50 Memory at f0400000 (64-bit, non-prefetchable) [size=8K] Capabilities: [c8] Power Management version 3 Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+ Capabilities: [40] Express Endpoint, MSI 00 ^^^^^^^^^^^^^^^ Capabilities: [100] Advanced Error Reporting Capabilities: [140] Device Serial Number 7c-7a-91-ff-ff-90-db-20 Capabilities: [14c] Latency Tolerance Reporting Capabilities: [154] Vendor Specific Information: ID=cafe Rev=1 Len=014 <?> >> + >> + >> +6. Virtio devices >> +================= >> +Virtio devices plugged into the PCI hierarchy or as an Integrated Devices > > (drop "an") > OK >> +will remain PCI and have transitional behaviour as default. > > (Please add one sentence about what "transitional" means in this context > -- they'll have both IO and MMIO BARs.) > OK >> +Virtio devices plugged into PCIe ports are Express devices and have >> +"1.0" behavior by default without IO support. >> +In both case disable-* properties can be used to override the behaviour. > > Please emphasize that setting disable-legacy=off (that is, enabling > legacy behavior) for PCI Express virtio devices will cause them to > require IO space, which, given our PCI Express hierarchy, may quickly > lead to resource exhaustion, and is therefore strongly discouraged. > Sure >> + >> + >> +7. Conclusion >> +============== >> +The proposal offers a usage model that is easy to understand and follow >> +and in the same time overcomes some PCIe limitations. > > I agree! > > Thanks! > Laszlo > Thanks for the detailed review! I was planning on sending V2 today, but to follow your comments I will need another day. Don't get me wrong, it totally worth it :) Thanks, Marcel ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines 2016-09-05 20:02 ` Marcel Apfelbaum @ 2016-09-06 13:31 ` Laszlo Ersek 2016-09-06 14:46 ` Marcel Apfelbaum 2016-09-07 6:21 ` Gerd Hoffmann 0 siblings, 2 replies; 52+ messages in thread From: Laszlo Ersek @ 2016-09-06 13:31 UTC (permalink / raw) To: Marcel Apfelbaum, qemu-devel Cc: mst, Peter Maydell, Drew Jones, Laine Stump, Andrea Bolognani, Alex Williamson, Gerd Hoffmann On 09/05/16 22:02, Marcel Apfelbaum wrote: > On 09/05/2016 07:24 PM, Laszlo Ersek wrote: >> On 09/01/16 15:22, Marcel Apfelbaum wrote: >>> Proposes best practices on how to use PCIe/PCI device >>> in PCIe based machines and explain the reasoning behind them. >>> >>> Signed-off-by: Marcel Apfelbaum <marcel@redhat.com> >>> --- >>> >>> Hi, >>> >>> Please add your comments on what to add/remove/edit to make this doc >>> usable. >> > > Hi Laszlo, > >> I'll give you a brain dump below -- most of it might easily be >> incorrect, but I'll just speak my mind :) >> > > Thanks for taking the time to go over it, I'll do my best to respond > to all the questions. > >>> >>> Thanks, >>> Marcel >>> >>> docs/pcie.txt | 145 >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>> 1 file changed, 145 insertions(+) >>> create mode 100644 docs/pcie.txt >>> >>> diff --git a/docs/pcie.txt b/docs/pcie.txt >>> new file mode 100644 >>> index 0000000..52a8830 >>> --- /dev/null >>> +++ b/docs/pcie.txt >>> @@ -0,0 +1,145 @@ >>> +PCI EXPRESS GUIDELINES >>> +====================== >>> + >>> +1. Introduction >>> +================ >>> +The doc proposes best practices on how to use PCIe/PCI device >>> +in PCIe based machines and explains the reasoning behind them. >> >> General request: please replace all occurrences of "PCIe" with "PCI >> Express" in the text (not command lines, of course). The reason is that >> the "e" letter is a minimal difference, and I've misread PCIe as PC >> several times, while interpreting this document. Obviously the resultant >> confusion is terrible, as you are explaining the difference between PCI >> and PCI Express in the entire document :) >> > > Sure > >>> + >>> + >>> +2. Device placement strategy >>> +============================ >>> +QEMU does not have a clear socket-device matching mechanism >>> +and allows any PCI/PCIe device to be plugged into any PCI/PCIe slot. >>> +Plugging a PCI device into a PCIe device might not always work and >> >> s/PCIe device/PCI Express slot/ >> > > Thanks! > >>> +is weird anyway since it cannot be done for "bare metal". >>> +Plugging a PCIe device into a PCI slot will hide the Extended >>> +Configuration Space thus is also not recommended. >>> + >>> +The recommendation is to separate the PCIe and PCI hierarchies. >>> +PCIe devices should be plugged only into PCIe Root Ports and >>> +PCIe Downstream ports (let's call them PCIe ports). >> >> Please do not use the shorthand; we should always spell out downstream >> ports and root ports. Assume people reading this document are dumber >> than I am wrt. PCI / PCI Express -- I'm already pretty dumb, and I >> appreciate the detail! :) If they are smart, they won't mind the detail; >> if they lack expertise, they'll appreciate the detail, won't they. :) >> > > Sure > >>> + >>> +2.1 Root Bus (pcie.0) >> >> Can we call this Root Complex instead? >> > > Sorry, but we can't. The Root Complex is a type of Host-Bridge > (and can actually "have" multiple Host-Bridges), not a bus. > It stands between the CPU/Memory controller/APIC and the PCI/PCI Express > fabric. > (as you can see, I am not using PCIe even for the comments :)) > > The Root Complex *includes* an internal bus (pcie.0) but also > can include some Integrated Devices, its own Configuration Space Registers > (e.g Root Complex Register Block), ... > > One of the main functions of the Root Complex is to > generate PCI Express Transactions on behalf of the CPU(s) and > to "translate" the corresponding PCI Express Transactions into DMA > accesses. > > I can change it to "PCI Express Root Bus", it will help? Yes, it will, thank you. All my other "root complex" mentions below were incorrect, in light of your clarification, so please consider those accordingly. > >>> +===================== >>> +Plug only legacy PCI devices as Root Complex Integrated Devices >>> +even if the PCIe spec does not forbid PCIe devices. >> >> I suggest "even though the PCI Express spec does not forbid PCI Express >> devices as Integrated Devices". (Detail is good!) >> > Thanks > >> Also, as Peter suggested, this (but not just this) would be a good place >> to provide command line fragments. >> > > I've already added some examples, I'll appreciate if you can have a look > on v2 > that I will post really soon. > >>> The existing >>> +hardware uses mostly PCI devices as Integrated Endpoints. In this >>> +way we may avoid some strange Guest OS-es behaviour. >>> +Other than that plug only PCIe Root Ports, PCIe Switches (upstream >>> ports) >>> +or DMI-PCI bridges to start legacy PCI hierarchies. >> >> Hmmmm, I had to re-read this paragraph (while looking at the diagram) >> five times until I mostly understood it :) What about the following >> wording: >> >> -------- >> Place only the following kinds of devices directly on the Root Complex: >> >> (1) For devices with dedicated, specific functionality (network card, >> graphics card, IDE controller, etc), place only legacy PCI devices on >> the Root Complex. These will be considered Integrated Endpoints. >> Although the PCI Express spec does not forbid PCI Express devices as >> Integrated Endpoints, existing hardware mostly integrates legacy PCI >> devices with the Root Complex. Guest OSes are suspected to behave >> strangely when PCI Express devices are integrated with the Root Complex. >> >> (2) PCI Express Root Ports, for starting exclusively PCI Express >> hierarchies. >> >> (3) PCI Express Switches (connected with their Upstream Ports to the >> Root Complex), also for starting exclusively PCI Express hierarchies. >> >> (4) For starting legacy PCI hierarchies: DMI-PCI bridges. >> > > Thanks for the re-wording! > Actually I had a bug, even the Switches should be connected to Root > Ports, not directly > to the PCI Express Root Bus (pcie.0) , I'll delete (3) to make it clear. Ah, okay. That puts a lot of what I wrote in a different perspective :), but I think the "as flat as possible" hierarchy should remain a valid suggestion. > > >>> + >>> + >>> + pcie.0 bus >> >> "bus" is correct in QEMU lingo, but I'd still call it complex here. >> > > explained above > >>> + >>> -------------------------------------------------------------------------- >>> >>> + | | | | >>> + ----------- ------------------ ------------------ >>> ------------------ >>> + | PCI Dev | | PCIe Root Port | | Upstream Port | | DMI-PCI >>> bridge | >>> + ----------- ------------------ ------------------ >>> ------------------ >>> + >> >> Please insert a separate (brief) section here about pxb-pcie devices -- >> just mention that they are documented in a separate spec txt in more >> detail, and that they create new root complexes in practice. >> >> In fact, maybe option (5) would be better for pxb-pcie devices, under >> section 2.1, than a dedicated section! >> > > Good idea, I'll add the pxb-pcie device. > >>> +2.2 PCIe only hierarchy >>> +======================= >>> +Always use PCIe Root ports to start a PCIe hierarchy. Use PCIe >>> switches (Upstream >>> +Ports + several Downstream Ports) if out of PCIe Root Ports slots. >>> PCIe switches >>> +can be nested until a depth of 6-7. Plug only PCIe devices into PCIe >>> Ports. >> >> - Please name the maximum number of the root ports that's allowed on the >> root complex (cmdline example?) >> > > I'll try: > The PCI Express Root Bus (pcie.0) is an internal bus that similar to a > PCI bus > supports up to 32 Integrated Devices/PCI Express Root Ports. Thanks, sounds good. Also, apparently, I wasn't wrong about the number 32 :) >> - Also, this is the first time you mention "slot". While the PCI Express >> spec allows for root ports / downstream ports not implementing a slot >> (IIRC), I think we shouldn't muddy the waters here, and restrict the >> word "slot" to the command line examples only. >> > > OK > >> - What you say here about switches (upstream ports) matches what I've >> learned from you thus far :), but it doesn't match bullet (3) in section >> 2.1. That is, if we suggest to *always* add a Root Port between the Root >> Complex and the Upstream Port of a switch, then (3) should not be >> present in section 2.1. (Do we suggest that BTW?) >> >> We're not giving a technical description here (the PCI Express spec is >> good enough for that), we're dictating policy. We shouldn't be shy about >> minimizing the accepted use cases. >> >> Our main guidance here should be the amount of bus numbers used up by >> the hierarchy. Parts of the document might later apply to >> qemu-system-aarch64 -M virt, and that machine is severely starved in the >> bus numbers department (it has MMCONFIG space for 16 buses only!) >> >> So how about this: >> >> * the basic idea is good I think: always go for root ports, unless the >> root complex is fully populated >> >> * if you run out of root ports, use a switch with downstream ports, but >> plug the upstream port directly in the root complex (make it an >> integrated device). This would save us a bus number, and match option >> (3) in section 2.1, but it doesn't match the diagram below, where a root >> port is between the root complex and the upstream port. (Of course, if a >> root port is *required* there, then 2.1 (3) is wrong, and should be >> removed.) >> > > A Root Port is required, thanks for spotting the bug. > >> * the "population algorithm" should be laid out in a bit more detail. >> You mention a possible depth of 6-7, but I think it would be best to >> keep the hierarchy as flat as possible (let's not waste bus numbers on >> upstream ports, and time on deep enumeration!). In other words, only >> plug upstream ports in the root complex (and without intervening root >> ports, if that's allowed). For example: >> >> - 1-32 ports needed: use root ports only >> >> - 33-64 ports needed: use 31 root ports, and one switch with 2-32 >> downstream ports >> >> - 65-94 ports needed: use 30 root ports, one switch with 32 downstream >> ports, another switch with 3-32 downstream ports >> >> - 95-125 ports needed: use 29 root ports, two switches with 32 >> downstream ports each, and a third switch with 2-32 downstream ports >> >> - 126-156 ports needed: use 28 root ports, three switches with 32 >> downstream ports each, and a fourth switch with 2-32 downstream ports >> >> - 157-187 ports needed: use 27 root ports, four switches with 32 >> downstream ports each, and a fifth switch with 2-32 downstream ports >> >> - 188-218 ports: 26 root ports, 5 fully populated switches, sixth switch >> with 2-32 downstream ports, >> >> - 219-249 ports: 25 root ports, 6 fully pop. switches, seventh switch >> with 2-32 downstream ports >> > > I can add it as a "best practice". That would be highly appreciated, thanks! Of course, with the root ports being mandatory between the PCI Express Root Bus and the upstream port of every switch, we hit the bus number limit a bit earlier: > >> (And I think this is where it ends, because the 7 upstream ports total >> in the switches take up 7 bus numbers, so we'd need 249 + 7 = 256 bus >> numbers, not counting the root complex, so 249 ports isn't even >> attainable.) we'd need 249+7+7. >> > > Theoretically we can implement multiple PCI domains, each domain can have > 256 PCI buses, but we don't have that yet. An implementation should > start with the pxb-pcie using separate PCI domains instead of "stealing" > bus ranges. > for the Root Complex. But this is for another thread. Definitely for another thread :) > >> You might argue that this is way too detailed, but with the "problem >> space" offering so much freedom (consider libvirt too...), I think it >> would be helpful. This would also help trim the "explorations" of >> downstream QE departments :) >> > > Well, we can accentuate that while nesting is supported, "deep nesting" > is not recommended and even not strictly necessary. I agree, thanks. > >>> + >>> + >>> + pcie.0 bus >>> + ---------------------------------------------------- >>> + | | | >>> + ------------- ------------- ------------- >>> + | Root Port | | Root Port | | Root Port | >>> + ------------ -------------- ------------- >>> + | | >>> + ------------ ----------------- >>> + | PCIe Dev | | Upstream Port | >>> + ------------ ----------------- >>> + | | >>> + ------------------- ------------------- >>> + | Downstream Port | | Downstream Port | >>> + ------------------- ------------------- >>> + | >>> + ------------ >>> + | PCIe Dev | >>> + ------------ >>> + >> >> So the upper right root port should be removed, probably. >> > > No, my bug in explanation, sorry. > >> Also, I recommend to draw a "container" around the upstream port plus >> the two downstream ports, and tack a "switch" label to it. >> > > Really? :) I had an interesting time with these "drawings". I'll try. Haha, thanks :) If you are an emacs user, it should be easy. (I'm not an emacs user, but my editor does support macros that makes it okay-ish to draw some basic ASCII art.) > >>> +2.3 PCI only hierarchy >>> +====================== >>> +Legacy PCI devices can be plugged into pcie.0 as Integrated Devices or >>> +into DMI-PCI bridge. PCI-PCI bridges can be plugged into DMI-PCI >>> bridges >>> +and can be nested until a depth of 6-7. DMI-BRIDGES should be plugged >>> +only into pcie.0 bus. >>> + >>> + pcie.0 bus >>> + ---------------------------------------------- >>> + | | >>> + ----------- ------------------ >>> + | PCI Dev | | DMI-PCI BRIDGE | >>> + ---------- ------------------ >>> + | | >>> + ----------- ------------------ >>> + | PCI Dev | | PCI-PCI Bridge | >>> + ----------- ------------------ >>> + | | >>> + ----------- ----------- >>> + | PCI Dev | | PCI Dev | >>> + ----------- ----------- >> >> Works for me, but I would again elaborate a little bit on keeping the >> hierarchy flat. >> >> First, in order to preserve compatibility with libvirt's current >> behavior, let's not plug a PCI device directly in to the DMI-PCI bridge, >> even if that's possible otherwise. Let's just say >> >> - there should be at most one DMI-PCI bridge (if a legacy PCI hierarchy >> is required), >> >> - only PCI-PCI bridges should be plugged into the DMI-PCI bridge, >> >> - let's recommend that each PCI-PCI bridge be populated until it becomes >> full, at which point another PCI-PCI bridge should be plugged into the >> same one DMI-PCI bridge. Theoretically, with 32 legacy PCI devices per >> PCI-PCI bridge, and 32 PCI-PCI bridges stuffed into the one DMI-PCI >> bridge, we could have ~1024 legacy PCI devices (not counting the >> integrated ones on the root complex(es)). There's also multi-function, >> so I can't see anyone needing more than this. >> > > I can "live" with that. Even if it contradicts a little you flattening > argument if you need more room for PCI devices but you don't need hotplug. > In this case adding PCI devices to the DMI-PCI Bridge should be enough. > But I agree we should keep it as simple as possible and your idea makes > sense, thanks. > > >> For practical reasons though (see later), we should state here that we >> recommend no more than 9 (nine) PCI-PCI bridges in total, all located >> directly under the 1 (one) DMI-PCI bridge that is integrated into the >> pcie.0 root complex. Nine PCI-PCI bridges should allow for 288 legacy >> PCI devices. (And then there's multifunction.) >> > > OK... BTW the ~9 bridges limitation is the same for non PCI Express > machines > e.g. i440FX machine. > >>> + >>> + >>> + >>> +3. IO space issues >>> +=================== >>> +PCIe Ports are seen by Firmware/Guest OS as PCI bridges and >> >> (please spell out downstream + root port) >> > > OK > >>> +as required by PCI spec will reserve a 4K IO range for each. >>> +The firmware used by QEMU (SeaBIOS/OVMF) will further optimize >>> +it by allocation the IO space only if there is at least a device >>> +with IO BARs plugged into the bridge. >> >> This used to be true, but is no longer true, for OVMF. And I think it's >> actually correct: we *should* keep the 4K IO reservation per PCI-PCI >> bridge. >> > > I'll change to "should". > >> (But, certainly no IO reservation for PCI Express root port, upstream >> port, or downstream port! And i'll need your help for telling these >> apart in OVMF.) >> > > Just let me know how can I help. Well, in the EFI_PCI_HOT_PLUG_INIT_PROTOCOL.GetResourcePadding() implementation, I'll have to look at the PCI config space of the "bridge-like" PCI device that the generic PCI Bus driver of edk2 passes back to me, asking me about resource reservation. Based on the config space, I should be able to tell apart "PCI-PCI bridge" from "PCI Express downstream or root port". So what I'd need here is a semi-formal natural language description of these conditions. Hmm, actually I think I've already written code, for another patch, that identifies the latter category. So everything where that check doesn't fire can be deemed "PCI-PCI bridge". (This hook gets called only for bridges.) Yet another alternative: if we go for the special PCI capability, for exposing reservation sizes from QEMU to the firmware, then I can simply search the capability list for just that capability. I think that could be the easiest for me. > >> Let me elaborate more under section "4. Hot Plug". For now let me just >> say that I'd like this language about optimization to be dropped. >> >>> +Behind a PCIe PORT only one device may be plugged, resulting in >> >> (again, please spell out downstream and root port) >> > > OK > >>> +the allocation of a whole 4K range for each device. >>> +The IO space is limited resulting in ~10 PCIe ports per system >> >> (limited to 65536 byte-wide IO ports, but it's fragmented, so we have >> about 10 * 4K free) >> >>> +if devices with IO BARs are plugged into IO ports. >> >> not "into IO ports" but "into PCI Express downstream and root ports". >> > > oops, thanks > >>> + >>> +Using the proposed device placing strategy solves this issue >> >>> +by using only PCIe devices with PCIe PORTS. The PCIe spec requires >> >> (please spell out root / downstream etc) >> > > OK > >>> +PCIe devices to work without IO BARs. >>> +The PCI hierarchy has no such limitations. >> >> I'm sorry to have fragmented this section with so many comments, but the >> idea is actually splendid, in my opinion! >> > > Thanks! > >> >> ... Okay, still speaking resources, could you insert a brief section >> here about bus numbers? Under "3. IO space issues", you already explain >> how "practically everything" qualifies as a PCI bridge. We should >> mention that all those things, such as: >> - root complex pcie.0, >> - root complex added by pxb-pcie, >> - root ports, >> - upstream ports, downstream ports, >> - bridges, etc >> >> take up bus numbers, and we have 256 bus numbers in total. >> > > I'll add a section for bus numbers, sure. > >> In the next section you state that PCI hotplug (ACPI based) and PCI >> Express hotplug (native) can work side by side, which is correct, and >> the IO space competition is eliminated by the scheme proposed in section >> 3 -- and the MMIO space competition is "obvious" --, but the bus number >> starvation is *very much* non-obvious. It should be spelled out. I think >> it deserves a separate section. (Again, with an eye toward >> qemu-system-aarch64 -M virt -- we've seen PCI Express failures there, >> and they were due to bus number starvation. It wasn't fun to debug. >> (Well, it was, but don't tell anyone :))) >> > > Got it, I'll try to make PCI Bus numbering a limitation as important as IO. > And we need to start looking at ways to solve this: > 1. pxb-pcie starting different PCI domains - pxb-pcie became another > Root Complex > 2. switches can theoretically start PCI domains - emulate a Switch > doing this. > Long term plans, of course. Right, let's not rush that; first I'd like to reach a status where PCI and PCI Express hotplug "just works" with OVMF... And if it fails, I should be able to point at the user's config, and this document, and say "wrong configuration". That's the goal. :) > >>> + >>> + >>> +4. Hot Plug >>> +============ >>> +The root bus pcie.0 does not support hot-plug, so Integrated Devices, >> >> s/root bus/root complex/? Also, any root complexes added with pxb-pcie >> don't support hotplug. >> > > Actually pxb-pcie should support PCI Express Native Hotplug. Huh, interesting. > If they don't is a bug and I'll take care of it. Hmm, a bit lower down you mention that PCI Express native hot plug is based on SHPCs. So, when you say that pxb-pcie should support PCI Express Native Hotplug, you mean that it should occur through SHPC, right? However, for pxb-pci*, we had to disable SHPC: see QEMU commit d10dda2d60c8 ("hw/pci-bridge: disable SHPC in PXB"), in June 2015. For background, the series around it was <https://lists.nongnu.org/archive/html/qemu-devel/2015-06/msg05136.html> -- I think v7 was the last version. ... Actually, now I wonder if d10dda2d60c8 should be possible to revert at this point! Namely, in OVMF I may have unwittingly fixed this issue -- obviously much later than the QEMU commit: in March 2016. See https://github.com/tianocore/edk2/commit/8f35eb92c419 If you look at the commit message of the QEMU patch, it says [...] Unfortunately, when this happens, the PCI_COMMAND_MEMORY bit is clear in the root bus's command register [...] which I think should no longer be true, thanks to edk2 commit 8f35eb92c419. So maybe we should re-evaluate QEMU commit d10dda2d60c8. If pxb-pci and pxb-pcie work with current OVMF, due to edk2 commit 8f35eb92c419, then maybe we should revert QEMU commit d10dda2d60c8. Not urgent for me :), obviously, I'm just explaining so you can make a note for later, if you wish to (if hot-plugging directly into pxb-pcie should be necessary -- I think it's very low priority). > For pxb-pci (the PCI counter-part) is another story, it needs ACPI code > to be emitted and the feature is not yet implemented. > > >>> +DMI-PCI bridges and Root Ports can't be hot-plugged/hot-unplugged. >> >> I would say: ... so anything that plugs *only* into a root complex, >> cannot be hotplugged. Then the list is what you mention here (also >> referring back to options (1), (2) and (4) in section 2.1), *plus* I >> would also add option (5): pxb-pcie can also not be hotplugged. >> > > because is actually an Integrated Device. > >>> + >>> +PCI devices can be hot-plugged into PCI-PCI bridges. (There is a bug >>> +in QEMU preventing it to work, but it would be solved soon). >> >> What bug? >> > > As stated above, PCI hotplug is based on emitting ACPI code for > recognizing the right slot (see "bsel" ACPI variables). Basically each > PCI-Bridge slot has a different "bsel" value used during > hotplug mechanism to identify the slot where the device is > hot-plugged/hot-unplugged. > > For PC machine the ACPI is generated while for Q35 is not. > (I think I've already sent an RFC some time ago for that) > >> Anyway, I'm unsure we should add this remark here -- it's a guide, not a >> status report. I'm worried that whenever we fix that bug, we forget to >> remove this remark. >> > > will remove it > >>> +The PCI hotplug is ACPI based and can work side by side with the PCIe >>> +native hotplug. >>> + >>> +PCIe devices can be natively hot-plugged/hot-unplugged into/from >>> +PCIe Ports (Root Ports/Downstream Ports). Switches are hot-pluggable. >> >> I would mention the order (upstream port, downstream port), also add >> some command lines maybe. >> > > I'll add some hmp example. Should I try it before :) ? Seeing it function as expected wouldn't hurt! :) > >>> +Keep in mind you always need to have at least one PCIe Port available >>> +for hotplug, the PCIe Ports themselves are not hot-pluggable. >> >> Well, the downstream ports of a switch that is being added *are*, aren't >> they? > > Nope, you cannot hotplug a PCI Express Root Port or a PCI Express > Downstream Port. > The reason: The PCI Express Native Hotplug is based on SHPCs (Standard > HotPlug Controllers) > which are integrated only in the mentioned ports and not in Upstream > Ports or the Root Complex. > The "other" reason: When you buy a switch/server it has a number of > ports and that's it. > You cannot add "later". Makes sense, thank you. I think if you add the HMP example, it will make it clear. I only assumed that you needed several monitor commands for hotplugging a single switch (i.e., one command per one port) because on the QEMU command line you do need a separate -device option for the upstream port, and every single downstream port, of the same switch. If, using the monitor, it's just one device_add for the upstream port, and the downstream ports are added automatically, then I guess it'll be easy to understand. > >> >> But, this question is actually irrelevant IMO, because here I would add >> another subsection about *planning* for hot-plug. (I think that's pretty >> important.) And those plans should make the hotplugging of switches >> unnecessary! >> > > I'll add a subsection for it. But when you are out of options you *can* > hotplug a switch if your sysadmin skills are limited... You probably can, but then we'll run into the resource allocation problem again: (1) The user will hotplug a switch (= S1) under a root port with, say, two downstream ports (= S1-DP1, S1-DP2). (2) They'll then plug a PCI Express device into one of those downstream ports (S1-DP1-Dev1). (3) Then they'll want to hot-plug *another* switch into the *other* downstream port (S1-DP2-S2). DP1 -- Dev1 (2) / root port -- S1 (1) \ DP2 -- S2 (3) However, concerning the resource needs of S2 (and especially the devices hot-plugged under S2!), S1 won't have enough left over, because Dev1 (under DP1) will have eaten into them, and Dev1's BARs will have been programmed! We could never credibly explain our way out of this situation in a bug report. For that reason, I think we should discourage hotplug ideas that would change the topology, and require recursive resource allocation at higher levels and/or parallel branches of the topology. I know Linux can do that, and it even succeeds if there is enough room, but from the messages seen in the guest dmesg when it fails, how do you explain to the user that they should have plugged in S2 first, and Dev1 second? So, we should recommend *not* to hotplug switches or PCI-PCI bridges. Instead, - keep a very flat hierarchy from the start; - for PCI Express, add as many root ports and downstream ports as you deem enough for future hotplug needs (keeping the flat formula I described); - for legacy PCI, add as many sibling PCI-PCI bridges directly under the one DMI-PCI bridge as you deem sufficient for future hotplug needs. In short, don't change the hierarchy at runtime by hotplugging internal nodes; hotplug *leaf nodes* only. > >> * For the PCI Express hierarchy, I recommended a flat structure above. >> The 256 bus numbers can easily be exhausted / covered by 25 root ports >> plus 7 switches (each switch being fully populated with downstream >> ports). This should allow all sysadmins to estimate their expected >> numbers of hotplug PCI Express devices, in advance, and create enough >> root ports / downstream ports. (Sysadmins are already used to planning >> for hotplug, see VCPUs, memory (DIMM), memory (balloon).) >> > > That's another good idea, I'll add it to the doc, thanks! > >> * For the PCI hierarchy, it should be even simpler, but worth a mention >> -- start with enough PCI-PCI bridges under the one DMI-PCI bridge. >> > > OK > >> * Finally, this is the spot where we should design and explain our >> resource reservation for hotplug: >> >> - For PCI Express hotplug, please explain that only such PCI Express >> devices can be hotplugged that require no IO space -- in section >> "3. IO space issues" you mention that this is a valid restriction. >> Furthermore, please state the MMIO32 and/or MMIO64 sizes that the >> firmware needs to reserve for root ports / downstream ports, and >> also explain that these sizes will act as maximum size and >> alignment limits for *individual* hotplug devices. >> > > OK > >> We can invent fw_cfg switches for this, or maybe even a special PCI >> Express capability (to be placed in the config space of root and >> downstream ports). >> > > Gerd explicitly asked for the second idea (vendor specific capability) Nice, thank you for confirming it; let's do this then. It will also simplify my work in the EFI_PCI_HOT_PLUG_INIT_PROTOCOL.GetResourcePadding() function: it should suffice to scan the config space of the bridge, regardless of the "PCI-PCI bridge / PCI Express root or downstream port" distinction. > >> - For legacy PCI hotplug, this is where my evil plan of "no more than >> 9 (nine) PCI-PCI bridges under the 1 (one) DMI-PCI bridge" unfolds! >> > > The same as for PC, should this doc deal with this if is a common issue? > Maybe a simple comment is enough. > >> We've stated above (in Section 3) that we have about 10*4KB IO port >> space. One of those 4K chunks will go (collectively) to the >> Integrated PCI devices that sit on pcie.0. (If there are other root >> complexes from pxb-pcie devices, then those will get one chunk each >> too.) The rest -- hence assume 9 or fewer chunks -- will be consumed >> by the 9 (or respectively fewer) PCI-PCI bridges, for hotplug >> reservation. The upshot is that as long as a sysadmin sticks with >> our flat, "9 PCI-PCI bridges total" recommendation for the legacy >> >> PCI hierarchy, the IO reservation will be covered *immediately*. >> Simply put: don't create more PCI-PCI bridges than you have IO >> space for -- and that should leave you with about 9 "sibling" >> bridges, which are plenty enough for a huge number of legacy PCI >> devices! >> > > I'll use that, thanks! > >> Furthermore, please state the MMIO32 / MMIO64 amount to reserve >> *per PCI-PCI bridge*. The firmware programmers need to know this, >> and people planning for legacy PCI hotplug should be informed that >> those limits are for all devices *together* on the same PCI-PCI >> bridge. >> > > Yes... we'll ask here for the minimum 8MB MMIO because of the virtio 1.0 > behavior and use that when "pushing" the patches for OVMF/SeaBIOS. > Its fun, first we "make" the rules, then we say "Hey, is written in QEMU > docs". Absolutely. This is how it should work! ;) The best part of Gerd's suggestion, as far as the firmwares are concerned, is that we won't have to hard-code any constants in the firmware. We'll just have to parse the PCI config spaces of the bridges, for the vendor specific capability, and use the numbers from the capability for resource reservation. The OVMF and SeaBIOS patches won't have any constants in them. :) Should we change our minds in QEMU later, no firmware patches should be necessary. > >> Again, we could expose this via fw_cfg, or in a special capability >> (as suggested by Gerd IIRC) in the PCI config space of the PCI-PCI >> bridge. >> > > Agreed > >>> + >>> + >>> +5. Device assignment >>> +==================== >>> +Host devices are mostly PCIe and should be plugged only into PCIe >>> ports. >>> +PCI-PCI bridge slots can be used for legacy PCI host devices. >> >> Please provide a command line (lspci) so that users can easily determine >> if the device they wish to assign is legacy PCI or PCI Express. >> > > OK, something like: > > lspci -s 03:00.0 -v (as root) > 03:00.0 Network controller: Intel Corporation Wireless 7260 (rev 83) > Subsystem: Intel Corporation Dual Band Wireless-AC 7260 > Flags: bus master, fast devsel, latency 0, IRQ 50 > Memory at f0400000 (64-bit, non-prefetchable) [size=8K] > Capabilities: [c8] Power Management version 3 > Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+ > Capabilities: [40] Express Endpoint, MSI 00 > > ^^^^^^^^^^^^^^^ > > Capabilities: [100] Advanced Error Reporting > Capabilities: [140] Device Serial Number 7c-7a-91-ff-ff-90-db-20 > Capabilities: [14c] Latency Tolerance Reporting > Capabilities: [154] Vendor Specific Information: ID=cafe Rev=1 > Len=014 <?> Yep, looks great. > > > >>> + >>> + >>> +6. Virtio devices >>> +================= >>> +Virtio devices plugged into the PCI hierarchy or as an Integrated >>> Devices >> >> (drop "an") >> > > OK > >>> +will remain PCI and have transitional behaviour as default. >> >> (Please add one sentence about what "transitional" means in this context >> -- they'll have both IO and MMIO BARs.) >> > > OK > >>> +Virtio devices plugged into PCIe ports are Express devices and have >>> +"1.0" behavior by default without IO support. >>> +In both case disable-* properties can be used to override the >>> behaviour. >> >> Please emphasize that setting disable-legacy=off (that is, enabling >> legacy behavior) for PCI Express virtio devices will cause them to >> require IO space, which, given our PCI Express hierarchy, may quickly >> lead to resource exhaustion, and is therefore strongly discouraged. >> > > Sure > >>> + >>> + >>> +7. Conclusion >>> +============== >>> +The proposal offers a usage model that is easy to understand and follow >>> +and in the same time overcomes some PCIe limitations. >> >> I agree! >> >> Thanks! >> Laszlo >> > > > Thanks for the detailed review! I was planning on sending V2 today, > but to follow your comments I will need another day. > Don't get me wrong, it totally worth it :) Thank you. I'm very happy about this document coming along. And, I too will likely need a separate day to review your v2. :) Cheers! Laszlo ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines 2016-09-06 13:31 ` Laszlo Ersek @ 2016-09-06 14:46 ` Marcel Apfelbaum 2016-09-07 6:21 ` Gerd Hoffmann 1 sibling, 0 replies; 52+ messages in thread From: Marcel Apfelbaum @ 2016-09-06 14:46 UTC (permalink / raw) To: Laszlo Ersek, qemu-devel Cc: mst, Peter Maydell, Drew Jones, Laine Stump, Andrea Bolognani, Alex Williamson, Gerd Hoffmann On 09/06/2016 04:31 PM, Laszlo Ersek wrote: > On 09/05/16 22:02, Marcel Apfelbaum wrote: >> On 09/05/2016 07:24 PM, Laszlo Ersek wrote: >>> On 09/01/16 15:22, Marcel Apfelbaum wrote: >>>> Proposes best practices on how to use PCIe/PCI device >>>> in PCIe based machines and explain the reasoning behind them. >>>> >>>> Signed-off-by: Marcel Apfelbaum <marcel@redhat.com> >>>> --- >>>> >>>> Hi, >>>> >>>> Please add your comments on what to add/remove/edit to make this doc >>>> usable. >>> >> [...] >> >>> (But, certainly no IO reservation for PCI Express root port, upstream >>> port, or downstream port! And i'll need your help for telling these >>> apart in OVMF.) >>> >> >> Just let me know how can I help. > > Well, in the EFI_PCI_HOT_PLUG_INIT_PROTOCOL.GetResourcePadding() > implementation, I'll have to look at the PCI config space of the > "bridge-like" PCI device that the generic PCI Bus driver of edk2 passes > back to me, asking me about resource reservation. > > Based on the config space, I should be able to tell apart "PCI-PCI > bridge" from "PCI Express downstream or root port". So what I'd need > here is a semi-formal natural language description of these conditions. You can use PCI Express Spec: 7.8.2. PCI Express Capabilities Register (Offset 02h) Bit 7:4 Register Description: Device/Port Type – Indicates the specific type of this PCI Express Function. Note that different Functions in a multi- Function device can generally be of different types. Defined encodings are: 0000b PCI Express Endpoint 0001b Legacy PCI Express Endpoint 0100b Root Port of PCI Express Root Complex* 0101b Upstream Port of PCI Express Switch* 0110b Downstream Port of PCI Express Switch* 0111b PCI Express to PCI/PCI-X Bridge* 1000b PCI/PCI-X to PCI Express Bridge* 1001b Root Complex Integrated Endpoint > > Hmm, actually I think I've already written code, for another patch, that > identifies the latter category. So everything where that check doesn't > fire can be deemed "PCI-PCI bridge". (This hook gets called only for > bridges.) > > Yet another alternative: if we go for the special PCI capability, for > exposing reservation sizes from QEMU to the firmware, then I can simply > search the capability list for just that capability. I think that could > be the easiest for me. > That would be a "later" step. BTW, following a offline chat with Michael S. Tsirkin regarding virto 1.0 requiring 8M MMIO by default we arrived to a conclusion that is not really needed and we came up with an alternative that will require less then 2M MMIO space. I put this here because the above solution will give us some time to deal with the MMIO ranges reservation. [...] >>>> + >>>> + >>>> +4. Hot Plug >>>> +============ >>>> +The root bus pcie.0 does not support hot-plug, so Integrated Devices, >>> >>> s/root bus/root complex/? Also, any root complexes added with pxb-pcie >>> don't support hotplug. >>> >> >> Actually pxb-pcie should support PCI Express Native Hotplug. > > Huh, interesting. > >> If they don't is a bug and I'll take care of it. > > Hmm, a bit lower down you mention that PCI Express native hot plug is > based on SHPCs. So, when you say that pxb-pcie should support PCI > Express Native Hotplug, you mean that it should occur through SHPC, right? > Yes, but I was talking about the Integrated SHPCs of the PCI Express Root Ports and PCI Express Downstream Ports. (devices plugged into them) > However, for pxb-pci*, we had to disable SHPC: see QEMU commit > d10dda2d60c8 ("hw/pci-bridge: disable SHPC in PXB"), in June 2015. > This is only for the pxb device (not pxb-pcie) and only for the internal pci-bridge that comes with it. And... we don't use SHPC based hot-plug for PCI, only for PCI Express. For PCI we are using only the ACPI hotplug. So disabling it is not so bad. The pxb-pcie does not have the internal PCI bridge. You don't need it because: 1. You can't have Integrated Devices for pxb-pcie 2. The PCI Express Upstream Port is a type of PCI-Bridge anyway. > For background, the series around it was > <https://lists.nongnu.org/archive/html/qemu-devel/2015-06/msg05136.html> > -- I think v7 was the last version. > > ... Actually, now I wonder if d10dda2d60c8 should be possible to revert > at this point! Namely, in OVMF I may have unwittingly fixed this issue > -- obviously much later than the QEMU commit: in March 2016. See > > https://github.com/tianocore/edk2/commit/8f35eb92c419 > > If you look at the commit message of the QEMU patch, it says > > [...] > > Unfortunately, when this happens, the PCI_COMMAND_MEMORY bit is > clear in the root bus's command register [...] > > which I think should no longer be true, thanks to edk2 commit 8f35eb92c419. > > So maybe we should re-evaluate QEMU commit d10dda2d60c8. If pxb-pci and > pxb-pcie work with current OVMF, due to edk2 commit 8f35eb92c419, then > maybe we should revert QEMU commit d10dda2d60c8. > > Not urgent for me :), obviously, I'm just explaining so you can make a > note for later, if you wish to (if hot-plugging directly into pxb-pcie > should be necessary -- I think it's very low priority). > As stated above, since we don't use it anyway it doesn't matter. [...] >> Nope, you cannot hotplug a PCI Express Root Port or a PCI Express >> Downstream Port. >> The reason: The PCI Express Native Hotplug is based on SHPCs (Standard >> HotPlug Controllers) >> which are integrated only in the mentioned ports and not in Upstream >> Ports or the Root Complex. >> The "other" reason: When you buy a switch/server it has a number of >> ports and that's it. >> You cannot add "later". > > Makes sense, thank you. I think if you add the HMP example, it will make > it clear. I only assumed that you needed several monitor commands for > hotplugging a single switch (i.e., one command per one port) because on > the QEMU command line you do need a separate -device option for the > upstream port, and every single downstream port, of the same switch. > > If, using the monitor, it's just one device_add for the upstream port, > and the downstream ports are added automatically, then I guess it'll be > easy to understand. > No it doesn't work like that, you would need to add them one by one (upstream ports and then downstream ports) as far as I understand it. Actually I've never done it before, I'll try it first and update the doc on how it should be done. (if it can be done...) >> >>> >>> But, this question is actually irrelevant IMO, because here I would add >>> another subsection about *planning* for hot-plug. (I think that's pretty >>> important.) And those plans should make the hotplugging of switches >>> unnecessary! >>> >> >> I'll add a subsection for it. But when you are out of options you *can* >> hotplug a switch if your sysadmin skills are limited... > > You probably can, but then we'll run into the resource allocation > problem again: > > (1) The user will hotplug a switch (= S1) under a root port with, say, > two downstream ports (= S1-DP1, S1-DP2). > > (2) They'll then plug a PCI Express device into one of those downstream > ports (S1-DP1-Dev1). > > (3) Then they'll want to hot-plug *another* switch into the *other* > downstream port (S1-DP2-S2). > > > DP1 -- Dev1 (2) > / > root port -- S1 (1) > \ > DP2 -- S2 (3) > > However, concerning the resource needs of S2 (and especially the devices > hot-plugged under S2!), S1 won't have enough left over, because Dev1 > (under DP1) will have eaten into them, and Dev1's BARs will have been > programmed! > Theoretically the Guest OS should trigger PCI resources re-allocation but I agree we should not count on them. > We could never credibly explain our way out of this situation in a bug > report. For that reason, I think we should discourage hotplug ideas that > would change the topology, and require recursive resource allocation at > higher levels and/or parallel branches of the topology. > > I know Linux can do that, and it even succeeds if there is enough room, > but from the messages seen in the guest dmesg when it fails, how do you > explain to the user that they should have plugged in S2 first, and Dev1 > second? > > So, we should recommend *not* to hotplug switches or PCI-PCI bridges. > Instead, > - keep a very flat hierarchy from the start; > - for PCI Express, add as many root ports and downstream ports as you > deem enough for future hotplug needs (keeping the flat formula I described); > - for legacy PCI, add as many sibling PCI-PCI bridges directly under the > one DMI-PCI bridge as you deem sufficient for future hotplug needs. > > In short, don't change the hierarchy at runtime by hotplugging internal > nodes; hotplug *leaf nodes* only. > Agreed. I'll re-use some of your comments in the doc. >> [...] >> >> Gerd explicitly asked for the second idea (vendor specific capability) > > Nice, thank you for confirming it; let's do this then. It will also > simplify my work in the > EFI_PCI_HOT_PLUG_INIT_PROTOCOL.GetResourcePadding() function: it should > suffice to scan the config space of the bridge, regardless of the > "PCI-PCI bridge / PCI Express root or downstream port" distinction. > Will do, but since we have a quick way to deal with the current issue (virtio 1.0 requiring 8MB MMIO while firmware reserving 2MB for PCI-Bridges hotplug) [...] Thanks, Marcel > > Cheers! > Laszlo > ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines 2016-09-06 13:31 ` Laszlo Ersek 2016-09-06 14:46 ` Marcel Apfelbaum @ 2016-09-07 6:21 ` Gerd Hoffmann 2016-09-07 8:06 ` Laszlo Ersek 2016-09-07 8:06 ` Marcel Apfelbaum 1 sibling, 2 replies; 52+ messages in thread From: Gerd Hoffmann @ 2016-09-07 6:21 UTC (permalink / raw) To: Laszlo Ersek Cc: Marcel Apfelbaum, qemu-devel, mst, Peter Maydell, Drew Jones, Laine Stump, Andrea Bolognani, Alex Williamson Hi, > >> ports, if that's allowed). For example: > >> > >> - 1-32 ports needed: use root ports only > >> > >> - 33-64 ports needed: use 31 root ports, and one switch with 2-32 > >> downstream ports I expect you rarely need any switches. You can go multifunction with the pcie root ports. Which is how physical q35 works too btw, typically the root ports are on slot 1c for intel chipsets: nilsson root ~# lspci -s1c 00:1c.0 PCI bridge: Intel Corporation 7 Series/C210 Series Chipset Family PCI Express Root Port 1 (rev c4) 00:1c.1 PCI bridge: Intel Corporation 7 Series/C210 Series Chipset Family PCI Express Root Port 2 (rev c4) 00:1c.2 PCI bridge: Intel Corporation 7 Series/C210 Series Chipset Family PCI Express Root Port 3 (rev c4) Root bus has 32 slots, a few are taken (host bridge @ 00.0, lpc+sata @ 1f.*, pci bridge @ 1e.0, maybe vga @ 01.0), leaving 28 free slots. With 8 functions each you can have up to 224 root ports without any switches, and you have not many pci bus numbers left until you hit the 256 busses limit ... cheers, Gerd ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines 2016-09-07 6:21 ` Gerd Hoffmann @ 2016-09-07 8:06 ` Laszlo Ersek 2016-09-07 8:23 ` Marcel Apfelbaum 2016-09-07 8:06 ` Marcel Apfelbaum 1 sibling, 1 reply; 52+ messages in thread From: Laszlo Ersek @ 2016-09-07 8:06 UTC (permalink / raw) To: Gerd Hoffmann, Marcel Apfelbaum Cc: qemu-devel, mst, Peter Maydell, Drew Jones, Laine Stump, Andrea Bolognani, Alex Williamson On 09/07/16 08:21, Gerd Hoffmann wrote: > Hi, > >>>> ports, if that's allowed). For example: >>>> >>>> - 1-32 ports needed: use root ports only >>>> >>>> - 33-64 ports needed: use 31 root ports, and one switch with 2-32 >>>> downstream ports > > I expect you rarely need any switches. You can go multifunction with > the pcie root ports. Which is how physical q35 works too btw, typically > the root ports are on slot 1c for intel chipsets: > > nilsson root ~# lspci -s1c > 00:1c.0 PCI bridge: Intel Corporation 7 Series/C210 Series Chipset > Family PCI Express Root Port 1 (rev c4) > 00:1c.1 PCI bridge: Intel Corporation 7 Series/C210 Series Chipset > Family PCI Express Root Port 2 (rev c4) > 00:1c.2 PCI bridge: Intel Corporation 7 Series/C210 Series Chipset > Family PCI Express Root Port 3 (rev c4) > > Root bus has 32 slots, a few are taken (host bridge @ 00.0, lpc+sata @ > 1f.*, pci bridge @ 1e.0, maybe vga @ 01.0), leaving 28 free slots. With > 8 functions each you can have up to 224 root ports without any switches, > and you have not many pci bus numbers left until you hit the 256 busses > limit ... This is an absolutely great idea. I wonder if it allows us to rip out all the language about switches, upstream ports and downstream ports. It would be awesome if we didn't have to mention and draw those things *at all* (better: if we could summarily discourage their use). Marcel, what do you think? Thanks Laszlo ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines 2016-09-07 8:06 ` Laszlo Ersek @ 2016-09-07 8:23 ` Marcel Apfelbaum 0 siblings, 0 replies; 52+ messages in thread From: Marcel Apfelbaum @ 2016-09-07 8:23 UTC (permalink / raw) To: Laszlo Ersek, Gerd Hoffmann Cc: qemu-devel, mst, Peter Maydell, Drew Jones, Laine Stump, Andrea Bolognani, Alex Williamson On 09/07/2016 11:06 AM, Laszlo Ersek wrote: > On 09/07/16 08:21, Gerd Hoffmann wrote: >> Hi, >> >>>>> ports, if that's allowed). For example: >>>>> >>>>> - 1-32 ports needed: use root ports only >>>>> >>>>> - 33-64 ports needed: use 31 root ports, and one switch with 2-32 >>>>> downstream ports >> >> I expect you rarely need any switches. You can go multifunction with >> the pcie root ports. Which is how physical q35 works too btw, typically >> the root ports are on slot 1c for intel chipsets: >> >> nilsson root ~# lspci -s1c >> 00:1c.0 PCI bridge: Intel Corporation 7 Series/C210 Series Chipset >> Family PCI Express Root Port 1 (rev c4) >> 00:1c.1 PCI bridge: Intel Corporation 7 Series/C210 Series Chipset >> Family PCI Express Root Port 2 (rev c4) >> 00:1c.2 PCI bridge: Intel Corporation 7 Series/C210 Series Chipset >> Family PCI Express Root Port 3 (rev c4) >> >> Root bus has 32 slots, a few are taken (host bridge @ 00.0, lpc+sata @ >> 1f.*, pci bridge @ 1e.0, maybe vga @ 01.0), leaving 28 free slots. With >> 8 functions each you can have up to 224 root ports without any switches, >> and you have not many pci bus numbers left until you hit the 256 busses >> limit ... > > This is an absolutely great idea. I wonder if it allows us to rip out > all the language about switches, upstream ports and downstream ports. It > would be awesome if we didn't have to mention and draw those things *at > all* (better: if we could summarily discourage their use). > > Marcel, what do you think? While I do think using multi-function Root Ports is definitely the preferred way to go, keeping the switches around is not so bad, even to have all PCI Express controllers available for testing scenarios. We can (and will) of course state we prefer multi-function Root Ports over switches and ask libvirt/other management software to not add switches unless are specifically requested by users. Thanks, Marcel > > Thanks > Laszlo > ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines 2016-09-07 6:21 ` Gerd Hoffmann 2016-09-07 8:06 ` Laszlo Ersek @ 2016-09-07 8:06 ` Marcel Apfelbaum 2016-09-07 16:08 ` Alex Williamson 2016-09-07 17:55 ` Laine Stump 1 sibling, 2 replies; 52+ messages in thread From: Marcel Apfelbaum @ 2016-09-07 8:06 UTC (permalink / raw) To: Gerd Hoffmann, Laszlo Ersek Cc: qemu-devel, mst, Peter Maydell, Drew Jones, Laine Stump, Andrea Bolognani, Alex Williamson On 09/07/2016 09:21 AM, Gerd Hoffmann wrote: > Hi, > >>>> ports, if that's allowed). For example: >>>> >>>> - 1-32 ports needed: use root ports only >>>> >>>> - 33-64 ports needed: use 31 root ports, and one switch with 2-32 >>>> downstream ports > > I expect you rarely need any switches. You can go multifunction with > the pcie root ports. Which is how physical q35 works too btw, typically > the root ports are on slot 1c for intel chipsets: > > nilsson root ~# lspci -s1c > 00:1c.0 PCI bridge: Intel Corporation 7 Series/C210 Series Chipset > Family PCI Express Root Port 1 (rev c4) > 00:1c.1 PCI bridge: Intel Corporation 7 Series/C210 Series Chipset > Family PCI Express Root Port 2 (rev c4) > 00:1c.2 PCI bridge: Intel Corporation 7 Series/C210 Series Chipset > Family PCI Express Root Port 3 (rev c4) > > Root bus has 32 slots, a few are taken (host bridge @ 00.0, lpc+sata @ > 1f.*, pci bridge @ 1e.0, maybe vga @ 01.0), leaving 28 free slots. With > 8 functions each you can have up to 224 root ports without any switches, > and you have not many pci bus numbers left until you hit the 256 busses > limit ... > Good point, maybe libvirt can avoid adding switches unless the user explicitly asked for them. I checked and it a actually works fine in QEMU. BTW, when we will implement ARI we can have up to 256 Root Ports on a single slot... Thanks, Marcel > cheers, > Gerd > ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines 2016-09-07 8:06 ` Marcel Apfelbaum @ 2016-09-07 16:08 ` Alex Williamson 2016-09-07 19:32 ` Marcel Apfelbaum 2016-09-07 17:55 ` Laine Stump 1 sibling, 1 reply; 52+ messages in thread From: Alex Williamson @ 2016-09-07 16:08 UTC (permalink / raw) To: Marcel Apfelbaum Cc: Gerd Hoffmann, Laszlo Ersek, qemu-devel, mst, Peter Maydell, Drew Jones, Laine Stump, Andrea Bolognani On Wed, 7 Sep 2016 11:06:45 +0300 Marcel Apfelbaum <marcel@redhat.com> wrote: > On 09/07/2016 09:21 AM, Gerd Hoffmann wrote: > > Hi, > > > >>>> ports, if that's allowed). For example: > >>>> > >>>> - 1-32 ports needed: use root ports only > >>>> > >>>> - 33-64 ports needed: use 31 root ports, and one switch with 2-32 > >>>> downstream ports > > > > I expect you rarely need any switches. You can go multifunction with > > the pcie root ports. Which is how physical q35 works too btw, typically > > the root ports are on slot 1c for intel chipsets: > > > > nilsson root ~# lspci -s1c > > 00:1c.0 PCI bridge: Intel Corporation 7 Series/C210 Series Chipset > > Family PCI Express Root Port 1 (rev c4) > > 00:1c.1 PCI bridge: Intel Corporation 7 Series/C210 Series Chipset > > Family PCI Express Root Port 2 (rev c4) > > 00:1c.2 PCI bridge: Intel Corporation 7 Series/C210 Series Chipset > > Family PCI Express Root Port 3 (rev c4) > > > > Root bus has 32 slots, a few are taken (host bridge @ 00.0, lpc+sata @ > > 1f.*, pci bridge @ 1e.0, maybe vga @ 01.0), leaving 28 free slots. With > > 8 functions each you can have up to 224 root ports without any switches, > > and you have not many pci bus numbers left until you hit the 256 busses > > limit ... > > > > Good point, maybe libvirt can avoid adding switches unless the user explicitly > asked for them. I checked and it a actually works fine in QEMU. > > BTW, when we will implement ARI we can have up to 256 Root Ports on a single slot... "Root Ports on a single slot"... The entire idea of ARI is that there is no slot, the slot/function address space is combined into one big 8-bit free-for-all. Besides, can you do ARI on the root complex? Typically you need to look at whether the port upstream of a given device supports ARI to enable, there's no upstream PCI device on the root complex. This is the suggestion I gave you for switches, if the upstream switch port supports ARI then we can have 256 downstream switch ports (assuming ARI isn't only specific to downstream ports). Thanks, Alex ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines 2016-09-07 16:08 ` Alex Williamson @ 2016-09-07 19:32 ` Marcel Apfelbaum 0 siblings, 0 replies; 52+ messages in thread From: Marcel Apfelbaum @ 2016-09-07 19:32 UTC (permalink / raw) To: Alex Williamson Cc: Gerd Hoffmann, Laszlo Ersek, qemu-devel, mst, Peter Maydell, Drew Jones, Laine Stump, Andrea Bolognani On 09/07/2016 07:08 PM, Alex Williamson wrote: > On Wed, 7 Sep 2016 11:06:45 +0300 > Marcel Apfelbaum <marcel@redhat.com> wrote: > >> On 09/07/2016 09:21 AM, Gerd Hoffmann wrote: >>> Hi, >>> >>>>>> ports, if that's allowed). For example: >>>>>> >>>>>> - 1-32 ports needed: use root ports only >>>>>> >>>>>> - 33-64 ports needed: use 31 root ports, and one switch with 2-32 >>>>>> downstream ports >>> >>> I expect you rarely need any switches. You can go multifunction with >>> the pcie root ports. Which is how physical q35 works too btw, typically >>> the root ports are on slot 1c for intel chipsets: >>> >>> nilsson root ~# lspci -s1c >>> 00:1c.0 PCI bridge: Intel Corporation 7 Series/C210 Series Chipset >>> Family PCI Express Root Port 1 (rev c4) >>> 00:1c.1 PCI bridge: Intel Corporation 7 Series/C210 Series Chipset >>> Family PCI Express Root Port 2 (rev c4) >>> 00:1c.2 PCI bridge: Intel Corporation 7 Series/C210 Series Chipset >>> Family PCI Express Root Port 3 (rev c4) >>> >>> Root bus has 32 slots, a few are taken (host bridge @ 00.0, lpc+sata @ >>> 1f.*, pci bridge @ 1e.0, maybe vga @ 01.0), leaving 28 free slots. With >>> 8 functions each you can have up to 224 root ports without any switches, >>> and you have not many pci bus numbers left until you hit the 256 busses >>> limit ... >>> >> >> Good point, maybe libvirt can avoid adding switches unless the user explicitly >> asked for them. I checked and it a actually works fine in QEMU. >> >> BTW, when we will implement ARI we can have up to 256 Root Ports on a single slot... > > "Root Ports on a single slot"... The entire idea of ARI is that there > is no slot, the slot/function address space is combined into one big > 8-bit free-for-all. Besides, can you do ARI on the root complex? No, we can't :( Indeed, for the Root Complex bus we need the (bus:)dev:fn tuple, because we can have multiple devices plugged in. Thanks for the correction. Thanks, Marcel > Typically you need to look at whether the port upstream of a given > device supports ARI to enable, there's no upstream PCI device on the > root complex. This is the suggestion I gave you for switches, if the > upstream switch port supports ARI then we can have 256 downstream > switch ports (assuming ARI isn't only specific to downstream ports). > Thanks, > > Alex > ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines 2016-09-07 8:06 ` Marcel Apfelbaum 2016-09-07 16:08 ` Alex Williamson @ 2016-09-07 17:55 ` Laine Stump 2016-09-07 19:39 ` Marcel Apfelbaum 2016-09-08 7:33 ` Gerd Hoffmann 1 sibling, 2 replies; 52+ messages in thread From: Laine Stump @ 2016-09-07 17:55 UTC (permalink / raw) To: qemu-devel Cc: Marcel Apfelbaum, Gerd Hoffmann, Laszlo Ersek, mst, Peter Maydell, Drew Jones, Andrea Bolognani, Alex Williamson On 09/07/2016 04:06 AM, Marcel Apfelbaum wrote: > On 09/07/2016 09:21 AM, Gerd Hoffmann wrote: >> Hi, >> >>>>> ports, if that's allowed). For example: >>>>> >>>>> - 1-32 ports needed: use root ports only >>>>> >>>>> - 33-64 ports needed: use 31 root ports, and one switch with 2-32 >>>>> downstream ports >> >> I expect you rarely need any switches. You can go multifunction with >> the pcie root ports. Which is how physical q35 works too btw, typically >> the root ports are on slot 1c for intel chipsets: >> >> nilsson root ~# lspci -s1c >> 00:1c.0 PCI bridge: Intel Corporation 7 Series/C210 Series Chipset >> Family PCI Express Root Port 1 (rev c4) >> 00:1c.1 PCI bridge: Intel Corporation 7 Series/C210 Series Chipset >> Family PCI Express Root Port 2 (rev c4) >> 00:1c.2 PCI bridge: Intel Corporation 7 Series/C210 Series Chipset >> Family PCI Express Root Port 3 (rev c4) >> >> Root bus has 32 slots, a few are taken (host bridge @ 00.0, lpc+sata @ >> 1f.*, pci bridge @ 1e.0, maybe vga @ 01.0), leaving 28 free slots. With >> 8 functions each you can have up to 224 root ports without any switches, >> and you have not many pci bus numbers left until you hit the 256 busses >> limit ... >> > > Good point, maybe libvirt can avoid adding switches unless the user > explicitly > asked for them. I checked and it a actually works fine in QEMU. I'm just now writing the code that auto-adds *-ports as they are needed, and doing it this way simplifies it *immensely*. When I had to think about the possibility of needing upstream/downstream switches, as an endpoint device was added, I would need to check if a (root|downstream)-port was available and if not I might be able to just add a root-port, or I might have to add a downstream-port; if the only option was a downstream port, then *that* might require adding a new *upstream* port. If I can limit libvirt to only auto-adding root-ports (and if there is no downside to putting multiple root ports on a single root bus port), then I just need to find an empty function of an empty slot on the root bus, add a root-port, and I'm done (and since 224 is *a lot*, I think at least for now it's okay to punt once they get past that point). So, *is* there any downside to doing this? ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines 2016-09-07 17:55 ` Laine Stump @ 2016-09-07 19:39 ` Marcel Apfelbaum 2016-09-07 20:34 ` Laine Stump 2016-09-15 8:38 ` Andrew Jones 2016-09-08 7:33 ` Gerd Hoffmann 1 sibling, 2 replies; 52+ messages in thread From: Marcel Apfelbaum @ 2016-09-07 19:39 UTC (permalink / raw) To: Laine Stump, qemu-devel Cc: Gerd Hoffmann, Laszlo Ersek, mst, Peter Maydell, Drew Jones, Andrea Bolognani, Alex Williamson On 09/07/2016 08:55 PM, Laine Stump wrote: > On 09/07/2016 04:06 AM, Marcel Apfelbaum wrote: >> On 09/07/2016 09:21 AM, Gerd Hoffmann wrote: >>> Hi, >>> >>>>>> ports, if that's allowed). For example: >>>>>> >>>>>> - 1-32 ports needed: use root ports only >>>>>> >>>>>> - 33-64 ports needed: use 31 root ports, and one switch with 2-32 >>>>>> downstream ports >>> >>> I expect you rarely need any switches. You can go multifunction with >>> the pcie root ports. Which is how physical q35 works too btw, typically >>> the root ports are on slot 1c for intel chipsets: >>> >>> nilsson root ~# lspci -s1c >>> 00:1c.0 PCI bridge: Intel Corporation 7 Series/C210 Series Chipset >>> Family PCI Express Root Port 1 (rev c4) >>> 00:1c.1 PCI bridge: Intel Corporation 7 Series/C210 Series Chipset >>> Family PCI Express Root Port 2 (rev c4) >>> 00:1c.2 PCI bridge: Intel Corporation 7 Series/C210 Series Chipset >>> Family PCI Express Root Port 3 (rev c4) >>> >>> Root bus has 32 slots, a few are taken (host bridge @ 00.0, lpc+sata @ >>> 1f.*, pci bridge @ 1e.0, maybe vga @ 01.0), leaving 28 free slots. With >>> 8 functions each you can have up to 224 root ports without any switches, >>> and you have not many pci bus numbers left until you hit the 256 busses >>> limit ... >>> >> >> Good point, maybe libvirt can avoid adding switches unless the user >> explicitly >> asked for them. I checked and it a actually works fine in QEMU. > > I'm just now writing the code that auto-adds *-ports as they are needed, and doing it this way simplifies it *immensely*. > > When I had to think about the possibility of needing upstream/downstream switches, as an endpoint device was added, I would need to check if a (root|downstream)-port was available and if not I might > be able to just add a root-port, or I might have to add a downstream-port; if the only option was a downstream port, then *that* might require adding a new *upstream* port. > > If I can limit libvirt to only auto-adding root-ports (and if there is no downside to putting multiple root ports on a single root bus port), then I just need to find an empty function of an empty > slot on the root bus, add a root-port, and I'm done (and since 224 is *a lot*, I think at least for now it's okay to punt once they get past that point). > > So, *is* there any downside to doing this? > No downside I can think of. Just be sure to emphasize the auto-add mechanism stops at 'x' devices. If the user needs more, he should manually add switches and manually assign the devices to the Downstream Ports. Thanks, Marcel > ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines 2016-09-07 19:39 ` Marcel Apfelbaum @ 2016-09-07 20:34 ` Laine Stump 2016-09-15 8:38 ` Andrew Jones 1 sibling, 0 replies; 52+ messages in thread From: Laine Stump @ 2016-09-07 20:34 UTC (permalink / raw) To: qemu-devel Cc: Marcel Apfelbaum, Peter Maydell, Drew Jones, mst, Andrea Bolognani, Alex Williamson, Gerd Hoffmann, Laszlo Ersek On 09/07/2016 03:39 PM, Marcel Apfelbaum wrote: > On 09/07/2016 08:55 PM, Laine Stump wrote: >> On 09/07/2016 04:06 AM, Marcel Apfelbaum wrote: >>> On 09/07/2016 09:21 AM, Gerd Hoffmann wrote: >>>> Hi, >>>> >>>>>>> ports, if that's allowed). For example: >>>>>>> >>>>>>> - 1-32 ports needed: use root ports only >>>>>>> >>>>>>> - 33-64 ports needed: use 31 root ports, and one switch with 2-32 >>>>>>> downstream ports >>>> >>>> I expect you rarely need any switches. You can go multifunction with >>>> the pcie root ports. Which is how physical q35 works too btw, >>>> typically >>>> the root ports are on slot 1c for intel chipsets: >>>> >>>> nilsson root ~# lspci -s1c >>>> 00:1c.0 PCI bridge: Intel Corporation 7 Series/C210 Series Chipset >>>> Family PCI Express Root Port 1 (rev c4) >>>> 00:1c.1 PCI bridge: Intel Corporation 7 Series/C210 Series Chipset >>>> Family PCI Express Root Port 2 (rev c4) >>>> 00:1c.2 PCI bridge: Intel Corporation 7 Series/C210 Series Chipset >>>> Family PCI Express Root Port 3 (rev c4) >>>> >>>> Root bus has 32 slots, a few are taken (host bridge @ 00.0, lpc+sata @ >>>> 1f.*, pci bridge @ 1e.0, maybe vga @ 01.0), leaving 28 free slots. >>>> With >>>> 8 functions each you can have up to 224 root ports without any >>>> switches, >>>> and you have not many pci bus numbers left until you hit the 256 busses >>>> limit ... >>>> >>> >>> Good point, maybe libvirt can avoid adding switches unless the user >>> explicitly >>> asked for them. I checked and it a actually works fine in QEMU. >> >> I'm just now writing the code that auto-adds *-ports as they are >> needed, and doing it this way simplifies it *immensely*. >> >> When I had to think about the possibility of needing >> upstream/downstream switches, as an endpoint device was added, I would >> need to check if a (root|downstream)-port was available and if not I >> might >> be able to just add a root-port, or I might have to add a >> downstream-port; if the only option was a downstream port, then *that* >> might require adding a new *upstream* port. >> >> If I can limit libvirt to only auto-adding root-ports (and if there is >> no downside to putting multiple root ports on a single root bus port), >> then I just need to find an empty function of an empty >> slot on the root bus, add a root-port, and I'm done (and since 224 is >> *a lot*, I think at least for now it's okay to punt once they get past >> that point). >> >> So, *is* there any downside to doing this? >> > > No downside I can think of. > Just be sure to emphasize the auto-add mechanism stops at 'x' devices. > If the user needs more, > he should manually add switches and manually assign the devices to the > Downstream Ports. Actually, just the former - once the downstream ports are added, they'll automatically be used for endpoint devices (and even new upstream ports) as needed. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines 2016-09-07 19:39 ` Marcel Apfelbaum 2016-09-07 20:34 ` Laine Stump @ 2016-09-15 8:38 ` Andrew Jones 2016-09-15 14:20 ` Marcel Apfelbaum 1 sibling, 1 reply; 52+ messages in thread From: Andrew Jones @ 2016-09-15 8:38 UTC (permalink / raw) To: Marcel Apfelbaum Cc: Laine Stump, qemu-devel, Peter Maydell, mst, Andrea Bolognani, Alex Williamson, Gerd Hoffmann, Laszlo Ersek On Wed, Sep 07, 2016 at 10:39:28PM +0300, Marcel Apfelbaum wrote: > On 09/07/2016 08:55 PM, Laine Stump wrote: > > On 09/07/2016 04:06 AM, Marcel Apfelbaum wrote: [snip] > > > Good point, maybe libvirt can avoid adding switches unless the user > > > explicitly > > > asked for them. I checked and it a actually works fine in QEMU. > > > > I'm just now writing the code that auto-adds *-ports as they are needed, and doing it this way simplifies it *immensely*. > > > > When I had to think about the possibility of needing upstream/downstream switches, as an endpoint device was added, I would need to check if a (root|downstream)-port was available and if not I might > > be able to just add a root-port, or I might have to add a downstream-port; if the only option was a downstream port, then *that* might require adding a new *upstream* port. > > > > If I can limit libvirt to only auto-adding root-ports (and if there is no downside to putting multiple root ports on a single root bus port), then I just need to find an empty function of an empty > > slot on the root bus, add a root-port, and I'm done (and since 224 is *a lot*, I think at least for now it's okay to punt once they get past that point). > > > > So, *is* there any downside to doing this? > > > > No downside I can think of. > Just be sure to emphasize the auto-add mechanism stops at 'x' devices. If the user needs more, > he should manually add switches and manually assign the devices to the Downstream Ports. > Just catching up on mail after vacation and read this thread. Thanks Marcel for writing this document (I guess a v1 is coming soon). This will be very useful for determining the best default configuration of a virtio-pci mach-virt. FWIW, here is the proposal that I started formulating when I experimented with this several months ago; - PCIe-only (disable-modern=off, disable-legacy=on) - No legacy PCI support, i.e. no bridges (yup, I'm a PCIe purist, but don't have a leg to stand on if push came to shove) - use one or more ports for virtio-scsi controllers for disks, one is probably enough - use one or more ports with multifunction, allowing up to 8 functions, for virtio-net, one port is probably enough - Add N extra ports for hotplug, N defaulting to 2 - hotplug devices to first N-1 ports, reserving last for a switch - if switch is needed, hotplug it with M downstream ports (M defaulting to 2*(N-1)+1) - Encourage somebody to develop generic versions of ports and switches, hi Marcel :-), and exclusively use those in the configuration Thanks, drew ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines 2016-09-15 8:38 ` Andrew Jones @ 2016-09-15 14:20 ` Marcel Apfelbaum 2016-09-16 16:50 ` Andrea Bolognani 0 siblings, 1 reply; 52+ messages in thread From: Marcel Apfelbaum @ 2016-09-15 14:20 UTC (permalink / raw) To: Andrew Jones Cc: Laine Stump, qemu-devel, Peter Maydell, mst, Andrea Bolognani, Alex Williamson, Gerd Hoffmann, Laszlo Ersek On 09/15/2016 11:38 AM, Andrew Jones wrote: > On Wed, Sep 07, 2016 at 10:39:28PM +0300, Marcel Apfelbaum wrote: >> On 09/07/2016 08:55 PM, Laine Stump wrote: >>> On 09/07/2016 04:06 AM, Marcel Apfelbaum wrote: > [snip] >>>> Good point, maybe libvirt can avoid adding switches unless the user >>>> explicitly >>>> asked for them. I checked and it a actually works fine in QEMU. >>> >>> I'm just now writing the code that auto-adds *-ports as they are needed, and doing it this way simplifies it *immensely*. >>> >>> When I had to think about the possibility of needing upstream/downstream switches, as an endpoint device was added, I would need to check if a (root|downstream)-port was available and if not I might >>> be able to just add a root-port, or I might have to add a downstream-port; if the only option was a downstream port, then *that* might require adding a new *upstream* port. >>> >>> If I can limit libvirt to only auto-adding root-ports (and if there is no downside to putting multiple root ports on a single root bus port), then I just need to find an empty function of an empty >>> slot on the root bus, add a root-port, and I'm done (and since 224 is *a lot*, I think at least for now it's okay to punt once they get past that point). >>> >>> So, *is* there any downside to doing this? >>> >> >> No downside I can think of. >> Just be sure to emphasize the auto-add mechanism stops at 'x' devices. If the user needs more, >> he should manually add switches and manually assign the devices to the Downstream Ports. >> > > Just catching up on mail after vacation and read this thread. Thanks > Marcel for writing this document (I guess a v1 is coming soon). Yes, I am sorry but I got caught up with other stuff and I am going to be in PTO for a week, so V1 will take a little more time than I planned. This > will be very useful for determining the best default configuration of > a virtio-pci mach-virt. > It will be very good if this doc will match both x86 and match-virt PCIe machines. Your review would be appreciated. > FWIW, here is the proposal that I started formulating when I experimented > with this several months ago; > > - PCIe-only (disable-modern=off, disable-legacy=on) If the virtio devices are plugged into PCI Express Root Ports or Downstream Ports this is already the default configuration, you don't need to add the disable-* anymore. > - No legacy PCI support, i.e. no bridges (yup, I'm a PCIe purist, > but don't have a leg to stand on if push came to shove) Yes... We'll say that legacy PCI support is optional. > - use one or more ports for virtio-scsi controllers for disks, one is > probably enough > - use one or more ports with multifunction, allowing up to 8 functions, > for virtio-net, one port is probably enough As Alex Williamson mentioned, PCI Express Root Ports are actually functions, not devices, so you can have up to 8 Ports per slot. This will be better than make the virtio-* multi-function devices because then you would need to hot-plug/hot-unplug all of them. If hot-plug is not an issue, using multi-function devices into multi-function Ports will save a lot of bus numbers witch are, Like Laszlo mentioned a scarce resource. > - Add N extra ports for hotplug, N defaulting to 2 > - hotplug devices to first N-1 ports, reserving last for a switch > - if switch is needed, hotplug it with M downstream ports > (M defaulting to 2*(N-1)+1) We would prefer multi-function ports to switches, since you'll go out of bus numbers before you use all the PCI Express Root Ports anyway. (see previous mails) However the switches will be supported for cases when you have a lot of Integrated Devices and the Root Ports are not enough, or to enable some testing scenarios. > - Encourage somebody to develop generic versions of ports and switches, > hi Marcel :-), and exclusively use those in the configuration > My goal is to try to come up with them for 2.8, but since I haven't started to work on them yet, I can't commit :) Thanks, Marcel > Thanks, > drew > ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines 2016-09-15 14:20 ` Marcel Apfelbaum @ 2016-09-16 16:50 ` Andrea Bolognani 0 siblings, 0 replies; 52+ messages in thread From: Andrea Bolognani @ 2016-09-16 16:50 UTC (permalink / raw) To: Marcel Apfelbaum, Andrew Jones Cc: Laine Stump, qemu-devel, Peter Maydell, mst, Alex Williamson, Gerd Hoffmann, Laszlo Ersek On Thu, 2016-09-15 at 17:20 +0300, Marcel Apfelbaum wrote: > > Just catching up on mail after vacation and read this thread. Thanks > > Marcel for writing this document (I guess a v1 is coming soon). > > Yes, I am sorry but I got caught up with other stuff and I am > going to be in PTO for a week, so V1 will take a little more time > than I planned. I finally caught up as well, and while I don't have much value to contribute to the conversation, let me say this: everything about this thread is absolutely awesome! The amount of information one can absorb from the discussion alone is amazing, but the guidelines contained in the document we're crafting will certainly prove to be invaluable to users and people working higher up in the stack alike. Thanks Marcel and everyone involved. You guys rock! :) -- Andrea Bolognani / Red Hat / Virtualization ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines 2016-09-07 17:55 ` Laine Stump 2016-09-07 19:39 ` Marcel Apfelbaum @ 2016-09-08 7:33 ` Gerd Hoffmann 1 sibling, 0 replies; 52+ messages in thread From: Gerd Hoffmann @ 2016-09-08 7:33 UTC (permalink / raw) To: Laine Stump Cc: qemu-devel, Marcel Apfelbaum, Laszlo Ersek, mst, Peter Maydell, Drew Jones, Andrea Bolognani, Alex Williamson Hi, > > Good point, maybe libvirt can avoid adding switches unless the user > > explicitly > > asked for them. I checked and it a actually works fine in QEMU. > So, *is* there any downside to doing this? I don't think so. The only issue I can think of when it comes to multifunction is hotplug, because hotplug works at slot level in pci so you can't hotplug single functions. But as you can't hotplug the root ports in the first place this is nothing we have to worry about in this specific case. cheers, Gerd ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines 2016-09-05 16:24 ` Laszlo Ersek 2016-09-05 20:02 ` Marcel Apfelbaum @ 2016-09-06 11:35 ` Gerd Hoffmann 2016-09-06 13:58 ` Laine Stump ` (2 more replies) 2016-10-04 14:59 ` Daniel P. Berrange 2 siblings, 3 replies; 52+ messages in thread From: Gerd Hoffmann @ 2016-09-06 11:35 UTC (permalink / raw) To: Laszlo Ersek Cc: Marcel Apfelbaum, qemu-devel, mst, Peter Maydell, Drew Jones, Laine Stump, Andrea Bolognani, Alex Williamson Hi, > > +Plug only legacy PCI devices as Root Complex Integrated Devices > > +even if the PCIe spec does not forbid PCIe devices. > > I suggest "even though the PCI Express spec does not forbid PCI Express > devices as Integrated Devices". (Detail is good!) While talking about integrated devices: There is docs/q35-chipset.cfg, which documents how to mimic q35 with integrated devices as close and complete as possible. Usage: qemu-system-x86_64 -M q35 -readconfig docs/q35-chipset.cfg $args Side note for usb: In practice you don't want to use the tons of uhci/ehci controllers present in the original q35 but plug xhci into one of the pcie root ports instead (unless your guest doesn't support xhci). > > +as required by PCI spec will reserve a 4K IO range for each. > > +The firmware used by QEMU (SeaBIOS/OVMF) will further optimize > > +it by allocation the IO space only if there is at least a device > > +with IO BARs plugged into the bridge. > > This used to be true, but is no longer true, for OVMF. And I think it's > actually correct: we *should* keep the 4K IO reservation per PCI-PCI bridge. > > (But, certainly no IO reservation for PCI Express root port, upstream > port, or downstream port! And i'll need your help for telling these > apart in OVMF.) IIRC the same is true for seabios, it looks for the pcie capability and skips io space allocation on pcie ports only. Side note: the linux kernel allocates io space nevertheless, so checking /proc/ioports after boot doesn't tell you what the firmware did. cheers, Gerd ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines 2016-09-06 11:35 ` Gerd Hoffmann @ 2016-09-06 13:58 ` Laine Stump 2016-09-07 7:04 ` Gerd Hoffmann 2016-09-06 14:47 ` Marcel Apfelbaum 2016-09-07 7:53 ` Laszlo Ersek 2 siblings, 1 reply; 52+ messages in thread From: Laine Stump @ 2016-09-06 13:58 UTC (permalink / raw) To: Gerd Hoffmann, Laszlo Ersek Cc: Marcel Apfelbaum, qemu-devel, mst, Peter Maydell, Drew Jones, Andrea Bolognani, Alex Williamson On 09/06/2016 07:35 AM, Gerd Hoffmann wrote: > While talking about integrated devices: There is docs/q35-chipset.cfg, > which documents how to mimic q35 with integrated devices as close and > complete as possible. > > Usage: > qemu-system-x86_64 -M q35 -readconfig docs/q35-chipset.cfg $args > > Side note for usb: In practice you don't want to use the tons of > uhci/ehci controllers present in the original q35 but plug xhci into one > of the pcie root ports instead (unless your guest doesn't support xhci). I've wondered about that recently. For i440fx machinetypes if you don't specify a USB controller in libvirt's domain config, you will automatically get the PIIX3 USB controller added. In order to maintain consistency on the topic of "auto-adding USB when not specified", if the machinetype is Q35 we will autoadd a set of USB2 (uhci/ehci) controllers (I think I added that based on your comments at the time :-). But recently I've mostly been hearing that people should use xhci instead. So should libvirt add a single xhci (rather than the uhci/ehci set) at the same port when no USB is specified? ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines 2016-09-06 13:58 ` Laine Stump @ 2016-09-07 7:04 ` Gerd Hoffmann 2016-09-07 18:20 ` Laine Stump 0 siblings, 1 reply; 52+ messages in thread From: Gerd Hoffmann @ 2016-09-07 7:04 UTC (permalink / raw) To: Laine Stump Cc: Laszlo Ersek, Marcel Apfelbaum, qemu-devel, mst, Peter Maydell, Drew Jones, Andrea Bolognani, Alex Williamson Hi, > > Side note for usb: In practice you don't want to use the tons of > > uhci/ehci controllers present in the original q35 but plug xhci into one > > of the pcie root ports instead (unless your guest doesn't support xhci). > > I've wondered about that recently. For i440fx machinetypes if you don't > specify a USB controller in libvirt's domain config, you will > automatically get the PIIX3 USB controller added. In order to maintain > consistency on the topic of "auto-adding USB when not specified", if the > machinetype is Q35 we will autoadd a set of USB2 (uhci/ehci) controllers > (I think I added that based on your comments at the time :-). But > recently I've mostly been hearing that people should use xhci instead. > So should libvirt add a single xhci (rather than the uhci/ehci set) at > the same port when no USB is specified? Big advantage of xhci is that the hardware design is much more virtualization friendly, i.e. it needs alot less cpu cycles to emulate than uhci/ohci/ehci. Also xhci can handle all usb speeds, so you don't need the complicated uhci/ehci companion setup with uhci for usb1 and ehci for usb2 devices. The problem with xhci is guest support. Which becomes less and less of a problem over time of course. All our firmware (seabios/edk2/slof) has xhci support meanwhile. ppc64 switched from ohci to xhci by default in rhel-7.3. Finding linux guests without xhci support is pretty hard meanwhile. Maybe RHEL-5 qualifies. Windows 8 + newer ships with xhci drivers. So, yea, maybe it's time to switch the default for q35 to xhci, especially if we keep uhci as default for i440fx and suggest to use that machine type for oldish guests. But I'd suggest to place xhci in a pcie root port then, so maybe wait with that until libvirt can auto-add pcie root ports as needed ... cheers, Gerd ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines 2016-09-07 7:04 ` Gerd Hoffmann @ 2016-09-07 18:20 ` Laine Stump 2016-09-08 7:26 ` Gerd Hoffmann 0 siblings, 1 reply; 52+ messages in thread From: Laine Stump @ 2016-09-07 18:20 UTC (permalink / raw) To: Gerd Hoffmann, qemu-devel Cc: Laszlo Ersek, Marcel Apfelbaum, mst, Peter Maydell, Drew Jones, Andrea Bolognani, Alex Williamson On 09/07/2016 03:04 AM, Gerd Hoffmann wrote: > Hi, > >>> Side note for usb: In practice you don't want to use the tons of >>> uhci/ehci controllers present in the original q35 but plug xhci into one >>> of the pcie root ports instead (unless your guest doesn't support xhci). >> >> I've wondered about that recently. For i440fx machinetypes if you don't >> specify a USB controller in libvirt's domain config, you will >> automatically get the PIIX3 USB controller added. In order to maintain >> consistency on the topic of "auto-adding USB when not specified", if the >> machinetype is Q35 we will autoadd a set of USB2 (uhci/ehci) controllers >> (I think I added that based on your comments at the time :-). But >> recently I've mostly been hearing that people should use xhci instead. >> So should libvirt add a single xhci (rather than the uhci/ehci set) at >> the same port when no USB is specified? > > Big advantage of xhci is that the hardware design is much more > virtualization friendly, i.e. it needs alot less cpu cycles to emulate > than uhci/ohci/ehci. Also xhci can handle all usb speeds, so you don't > need the complicated uhci/ehci companion setup with uhci for usb1 and > ehci for usb2 devices. > > The problem with xhci is guest support. Which becomes less and less of > a problem over time of course. All our firmware (seabios/edk2/slof) has > xhci support meanwhile. ppc64 switched from ohci to xhci by default in > rhel-7.3. Finding linux guests without xhci support is pretty hard > meanwhile. Maybe RHEL-5 qualifies. Windows 8 + newer ships with xhci > drivers. > > So, yea, maybe it's time to switch the default for q35 to xhci, > especially if we keep uhci as default for i440fx and suggest to use that > machine type for oldish guests. But I'd suggest to place xhci in a pcie > root port then, so maybe wait with that until libvirt can auto-add pcie > root ports as needed ... I'm doing that right now (giving libvirt the ability to auto-add root ports) :-) I had understood that the xhci could be a legacy PCI device or a PCI Express device depending on the socket it was plugged into (or was that possibly just someone doing some hand-waving over the fact that obscuring the PCI Express capabilities effectively turns it into a legacy PCI device?). If that's the case, why do you prefer the default USB controller to be added in a root-port rather than as an integrated device (which is what we do with the group of USB2 controllers, as well as the primary video device) ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines 2016-09-07 18:20 ` Laine Stump @ 2016-09-08 7:26 ` Gerd Hoffmann 0 siblings, 0 replies; 52+ messages in thread From: Gerd Hoffmann @ 2016-09-08 7:26 UTC (permalink / raw) To: Laine Stump Cc: qemu-devel, Laszlo Ersek, Marcel Apfelbaum, mst, Peter Maydell, Drew Jones, Andrea Bolognani, Alex Williamson Hi, > I had understood that the xhci could be a legacy PCI device or a PCI > Express device depending on the socket it was plugged into (or was that > possibly just someone doing some hand-waving over the fact that > obscuring the PCI Express capabilities effectively turns it into a > legacy PCI device?). That is correct, it'll work both ways. > If that's the case, why do you prefer the default > USB controller to be added in a root-port rather than as an integrated > device (which is what we do with the group of USB2 controllers, as well > as the primary video device) Trying to mimic real hardware as close as possible. The ich9 uhci/ehci controllers are actually integrated chipset devices. The nec xhci is a express device in physical hardware. That is more a personal preference though, there are no strong technical reasons to do it that way. cheers, Gerd ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines 2016-09-06 11:35 ` Gerd Hoffmann 2016-09-06 13:58 ` Laine Stump @ 2016-09-06 14:47 ` Marcel Apfelbaum 2016-09-07 7:53 ` Laszlo Ersek 2 siblings, 0 replies; 52+ messages in thread From: Marcel Apfelbaum @ 2016-09-06 14:47 UTC (permalink / raw) To: Gerd Hoffmann, Laszlo Ersek Cc: qemu-devel, mst, Peter Maydell, Drew Jones, Laine Stump, Andrea Bolognani, Alex Williamson On 09/06/2016 02:35 PM, Gerd Hoffmann wrote: > Hi, > >>> +Plug only legacy PCI devices as Root Complex Integrated Devices >>> +even if the PCIe spec does not forbid PCIe devices. >> >> I suggest "even though the PCI Express spec does not forbid PCI Express >> devices as Integrated Devices". (Detail is good!) > > While talking about integrated devices: There is docs/q35-chipset.cfg, > which documents how to mimic q35 with integrated devices as close and > complete as possible. > > Usage: > qemu-system-x86_64 -M q35 -readconfig docs/q35-chipset.cfg $args > > Side note for usb: In practice you don't want to use the tons of > uhci/ehci controllers present in the original q35 but plug xhci into one > of the pcie root ports instead (unless your guest doesn't support xhci). > Hi Gerd, Thanks for the comments, I'll be sure to refer them in the doc. Marcel [...] ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines 2016-09-06 11:35 ` Gerd Hoffmann 2016-09-06 13:58 ` Laine Stump 2016-09-06 14:47 ` Marcel Apfelbaum @ 2016-09-07 7:53 ` Laszlo Ersek 2016-09-07 7:57 ` Marcel Apfelbaum 2 siblings, 1 reply; 52+ messages in thread From: Laszlo Ersek @ 2016-09-07 7:53 UTC (permalink / raw) To: Gerd Hoffmann Cc: Marcel Apfelbaum, qemu-devel, mst, Peter Maydell, Drew Jones, Laine Stump, Andrea Bolognani, Alex Williamson On 09/06/16 13:35, Gerd Hoffmann wrote: > Hi, > >>> +Plug only legacy PCI devices as Root Complex Integrated Devices >>> +even if the PCIe spec does not forbid PCIe devices. >> >> I suggest "even though the PCI Express spec does not forbid PCI Express >> devices as Integrated Devices". (Detail is good!) > > While talking about integrated devices: There is docs/q35-chipset.cfg, > which documents how to mimic q35 with integrated devices as close and > complete as possible. > > Usage: > qemu-system-x86_64 -M q35 -readconfig docs/q35-chipset.cfg $args > > Side note for usb: In practice you don't want to use the tons of > uhci/ehci controllers present in the original q35 but plug xhci into one > of the pcie root ports instead (unless your guest doesn't support xhci). > >>> +as required by PCI spec will reserve a 4K IO range for each. >>> +The firmware used by QEMU (SeaBIOS/OVMF) will further optimize >>> +it by allocation the IO space only if there is at least a device >>> +with IO BARs plugged into the bridge. >> >> This used to be true, but is no longer true, for OVMF. And I think it's >> actually correct: we *should* keep the 4K IO reservation per PCI-PCI bridge. >> >> (But, certainly no IO reservation for PCI Express root port, upstream >> port, or downstream port! And i'll need your help for telling these >> apart in OVMF.) > > IIRC the same is true for seabios, it looks for the pcie capability and > skips io space allocation on pcie ports only. > > Side note: the linux kernel allocates io space nevertheless, so > checking /proc/ioports after boot doesn't tell you what the firmware > did. Yeah, we've got to convince Linux to stop doing that. Earlier Alex mentioned the "hpiosize" and "hpmemsize" PCI subsystem options for the kernel: hpiosize=nn[KMG] The fixed amount of bus space which is reserved for hotplug bridge's IO window. Default size is 256 bytes. hpmemsize=nn[KMG] The fixed amount of bus space which is reserved for hotplug bridge's memory window. Default size is 2 megabytes. This document (once complete) would be the basis for tweaking that stuff in the kernel too. Primarily, "hpiosize" should default to zero, because its current nonzero default (which gets rounded up to 4KB somewhere) is what exhausts the IO space, if we have more than a handful of PCI Express downstream / root ports. Maybe we can add a PCI quirk for this to the kernel, for QEMU's PCI Express ports (all of them -- root, upstream, downstream). Thanks Laszlo ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines 2016-09-07 7:53 ` Laszlo Ersek @ 2016-09-07 7:57 ` Marcel Apfelbaum 0 siblings, 0 replies; 52+ messages in thread From: Marcel Apfelbaum @ 2016-09-07 7:57 UTC (permalink / raw) To: Laszlo Ersek, Gerd Hoffmann Cc: qemu-devel, mst, Peter Maydell, Drew Jones, Laine Stump, Andrea Bolognani, Alex Williamson On 09/07/2016 10:53 AM, Laszlo Ersek wrote: > On 09/06/16 13:35, Gerd Hoffmann wrote: >> Hi, >> [...] >> >> Side note: the linux kernel allocates io space nevertheless, so >> checking /proc/ioports after boot doesn't tell you what the firmware >> did. > > Yeah, we've got to convince Linux to stop doing that. Earlier Alex > mentioned the "hpiosize" and "hpmemsize" PCI subsystem options for the > kernel: > > hpiosize=nn[KMG] The fixed amount of bus space which is > reserved for hotplug bridge's IO window. > Default size is 256 bytes. > hpmemsize=nn[KMG] The fixed amount of bus space which is > reserved for hotplug bridge's memory window. > Default size is 2 megabytes. > > This document (once complete) would be the basis for tweaking that stuff > in the kernel too. Primarily, "hpiosize" should default to zero, because > its current nonzero default (which gets rounded up to 4KB somewhere) is > what exhausts the IO space, if we have more than a handful of PCI > Express downstream / root ports. > > Maybe we can add a PCI quirk for this to the kernel, for QEMU's PCI > Express ports (all of them -- root, upstream, downstream). > Yes, once we will have our "own" controllers and not Intel emulations as today. Thanks, Marcel > Thanks > Laszlo > ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines 2016-09-05 16:24 ` Laszlo Ersek 2016-09-05 20:02 ` Marcel Apfelbaum 2016-09-06 11:35 ` Gerd Hoffmann @ 2016-10-04 14:59 ` Daniel P. Berrange 2016-10-04 15:40 ` Laszlo Ersek 2016-10-04 15:45 ` Alex Williamson 2 siblings, 2 replies; 52+ messages in thread From: Daniel P. Berrange @ 2016-10-04 14:59 UTC (permalink / raw) To: Laszlo Ersek Cc: Marcel Apfelbaum, qemu-devel, Peter Maydell, Drew Jones, mst, Andrea Bolognani, Alex Williamson, Gerd Hoffmann, Laine Stump On Mon, Sep 05, 2016 at 06:24:48PM +0200, Laszlo Ersek wrote: > On 09/01/16 15:22, Marcel Apfelbaum wrote: > > +2.3 PCI only hierarchy > > +====================== > > +Legacy PCI devices can be plugged into pcie.0 as Integrated Devices or > > +into DMI-PCI bridge. PCI-PCI bridges can be plugged into DMI-PCI bridges > > +and can be nested until a depth of 6-7. DMI-BRIDGES should be plugged > > +only into pcie.0 bus. > > + > > + pcie.0 bus > > + ---------------------------------------------- > > + | | > > + ----------- ------------------ > > + | PCI Dev | | DMI-PCI BRIDGE | > > + ---------- ------------------ > > + | | > > + ----------- ------------------ > > + | PCI Dev | | PCI-PCI Bridge | > > + ----------- ------------------ > > + | | > > + ----------- ----------- > > + | PCI Dev | | PCI Dev | > > + ----------- ----------- > > Works for me, but I would again elaborate a little bit on keeping the > hierarchy flat. > > First, in order to preserve compatibility with libvirt's current > behavior, let's not plug a PCI device directly in to the DMI-PCI bridge, > even if that's possible otherwise. Let's just say > > - there should be at most one DMI-PCI bridge (if a legacy PCI hierarchy > is required), Why do you suggest this ? If the guest has multiple NUMA nodes and you're creating a PXB for each NUMA node, then it looks valid to want to have a DMI-PCI bridge attached to each PXB, so you can have legacy PCI devices on each NUMA node, instead of putting them all on the PCI bridge without NUMA affinity. > - only PCI-PCI bridges should be plugged into the DMI-PCI bridge, What's the rational for that, as opposed to plugging devices directly into the DMI-PCI bridge which seems to work ? Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://entangle-photo.org -o- http://search.cpan.org/~danberr/ :| ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines 2016-10-04 14:59 ` Daniel P. Berrange @ 2016-10-04 15:40 ` Laszlo Ersek 2016-10-04 16:10 ` Laine Stump 2016-10-04 15:45 ` Alex Williamson 1 sibling, 1 reply; 52+ messages in thread From: Laszlo Ersek @ 2016-10-04 15:40 UTC (permalink / raw) To: Daniel P. Berrange Cc: Marcel Apfelbaum, qemu-devel, Peter Maydell, Drew Jones, mst, Andrea Bolognani, Alex Williamson, Gerd Hoffmann, Laine Stump On 10/04/16 16:59, Daniel P. Berrange wrote: > On Mon, Sep 05, 2016 at 06:24:48PM +0200, Laszlo Ersek wrote: >> On 09/01/16 15:22, Marcel Apfelbaum wrote: >>> +2.3 PCI only hierarchy >>> +====================== >>> +Legacy PCI devices can be plugged into pcie.0 as Integrated Devices or >>> +into DMI-PCI bridge. PCI-PCI bridges can be plugged into DMI-PCI bridges >>> +and can be nested until a depth of 6-7. DMI-BRIDGES should be plugged >>> +only into pcie.0 bus. >>> + >>> + pcie.0 bus >>> + ---------------------------------------------- >>> + | | >>> + ----------- ------------------ >>> + | PCI Dev | | DMI-PCI BRIDGE | >>> + ---------- ------------------ >>> + | | >>> + ----------- ------------------ >>> + | PCI Dev | | PCI-PCI Bridge | >>> + ----------- ------------------ >>> + | | >>> + ----------- ----------- >>> + | PCI Dev | | PCI Dev | >>> + ----------- ----------- >> >> Works for me, but I would again elaborate a little bit on keeping the >> hierarchy flat. >> >> First, in order to preserve compatibility with libvirt's current >> behavior, let's not plug a PCI device directly in to the DMI-PCI bridge, >> even if that's possible otherwise. Let's just say >> >> - there should be at most one DMI-PCI bridge (if a legacy PCI hierarchy >> is required), > > Why do you suggest this ? If the guest has multiple NUMA nodes > and you're creating a PXB for each NUMA node, then it looks valid > to want to have a DMI-PCI bridge attached to each PXB, so you can > have legacy PCI devices on each NUMA node, instead of putting them > all on the PCI bridge without NUMA affinity. You are right. I meant the above within one PCI Express root bus. Small correction to your wording though: you don't want to attach the DMI-PCI bridge to the PXB device, but to the extra root bus provided by the PXB. > >> - only PCI-PCI bridges should be plugged into the DMI-PCI bridge, > > What's the rational for that, as opposed to plugging devices directly > into the DMI-PCI bridge which seems to work ? The rationale is that libvirt used to do it like this. And the rationale for *that* is that DMI-PCI bridges cannot accept hotplugged devices, while PCI-PCI bridges can. Technically nothing forbids (AFAICT) cold-plugging PCI devices into DMI-PCI bridges, but this document is expressly not just about technical constraints -- it's a policy document. We want to simplify / trim the supported PCI and PCI Express hierarchies as much as possible. All valid *high-level* topology goals should be permitted / covered one way or another by this document, but in as few ways as possible -- hopefully only one way. For example, if you read the rest of the thread, flat hierarchies are preferred to deeply nested hierarchies, because flat ones save on bus numbers, are easier to setup and understand, probably perform better, and don't lose any generality for cold- or hotplug. Thanks Laszlo ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines 2016-10-04 15:40 ` Laszlo Ersek @ 2016-10-04 16:10 ` Laine Stump 2016-10-04 16:43 ` Laszlo Ersek 2016-10-04 17:54 ` Laine Stump 0 siblings, 2 replies; 52+ messages in thread From: Laine Stump @ 2016-10-04 16:10 UTC (permalink / raw) To: qemu-devel Cc: Laszlo Ersek, Daniel P. Berrange, Marcel Apfelbaum, Peter Maydell, Drew Jones, mst, Andrea Bolognani, Alex Williamson, Gerd Hoffmann On 10/04/2016 11:40 AM, Laszlo Ersek wrote: > On 10/04/16 16:59, Daniel P. Berrange wrote: >> On Mon, Sep 05, 2016 at 06:24:48PM +0200, Laszlo Ersek wrote: >>> On 09/01/16 15:22, Marcel Apfelbaum wrote: >>>> +2.3 PCI only hierarchy >>>> +====================== >>>> +Legacy PCI devices can be plugged into pcie.0 as Integrated Devices or >>>> +into DMI-PCI bridge. PCI-PCI bridges can be plugged into DMI-PCI bridges >>>> +and can be nested until a depth of 6-7. DMI-BRIDGES should be plugged >>>> +only into pcie.0 bus. >>>> + >>>> + pcie.0 bus >>>> + ---------------------------------------------- >>>> + | | >>>> + ----------- ------------------ >>>> + | PCI Dev | | DMI-PCI BRIDGE | >>>> + ---------- ------------------ >>>> + | | >>>> + ----------- ------------------ >>>> + | PCI Dev | | PCI-PCI Bridge | >>>> + ----------- ------------------ >>>> + | | >>>> + ----------- ----------- >>>> + | PCI Dev | | PCI Dev | >>>> + ----------- ----------- >>> >>> Works for me, but I would again elaborate a little bit on keeping the >>> hierarchy flat. >>> >>> First, in order to preserve compatibility with libvirt's current >>> behavior, let's not plug a PCI device directly in to the DMI-PCI bridge, >>> even if that's possible otherwise. Let's just say >>> >>> - there should be at most one DMI-PCI bridge (if a legacy PCI hierarchy >>> is required), >> >> Why do you suggest this ? If the guest has multiple NUMA nodes >> and you're creating a PXB for each NUMA node, then it looks valid >> to want to have a DMI-PCI bridge attached to each PXB, so you can >> have legacy PCI devices on each NUMA node, instead of putting them >> all on the PCI bridge without NUMA affinity. > > You are right. I meant the above within one PCI Express root bus. > > Small correction to your wording though: you don't want to attach the > DMI-PCI bridge to the PXB device, but to the extra root bus provided by > the PXB. This made me realize something - the root bus on a pxb-pcie controller has a single slot and that slot can accept either a pcie-root-port (ioh3420) or a dmi-to-pci-bridge. If you want to have both express and legacy PCI devices on the same NUMA node, then you would either need to create one pxb-pcie for the pcie-root-port and another for the dmi-to-pci-bridge, or you would need to put the pcie-root-port and dmi-to-pci-bridge onto different functions of the single slot. Should the latter work properly? > >> >>> - only PCI-PCI bridges should be plugged into the DMI-PCI bridge, >> >> What's the rational for that, as opposed to plugging devices directly >> into the DMI-PCI bridge which seems to work ? > > The rationale is that libvirt used to do it like this. Nah, that's just the *result* of the rationale that we wanted the devices to be hotpluggable. At some later date we learned the hotplug on a pci-bridge device doesn't work on a Q35 machine anyway, so it was kind of pointless (but we still do it because we hold out hope that hotplug of legacy PCI devices into a pci-bridge on Q35 machines will work one day) > And the rationale > for *that* is that DMI-PCI bridges cannot accept hotplugged devices, > while PCI-PCI bridges can. > > Technically nothing forbids (AFAICT) cold-plugging PCI devices into > DMI-PCI bridges, but this document is expressly not just about technical > constraints -- it's a policy document. We want to simplify / trim the > supported PCI and PCI Express hierarchies as much as possible. > > All valid *high-level* topology goals should be permitted / covered one > way or another by this document, but in as few ways as possible -- > hopefully only one way. For example, if you read the rest of the thread, > flat hierarchies are preferred to deeply nested hierarchies, because > flat ones save on bus numbers Do they? >, are easier to setup and understand, > probably perform better, and don't lose any generality for cold- or hotplug. > > Thanks > Laszlo > ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines 2016-10-04 16:10 ` Laine Stump @ 2016-10-04 16:43 ` Laszlo Ersek 2016-10-04 18:08 ` Laine Stump 2016-10-04 17:54 ` Laine Stump 1 sibling, 1 reply; 52+ messages in thread From: Laszlo Ersek @ 2016-10-04 16:43 UTC (permalink / raw) To: Laine Stump, qemu-devel Cc: Daniel P. Berrange, Marcel Apfelbaum, Peter Maydell, Drew Jones, mst, Andrea Bolognani, Alex Williamson, Gerd Hoffmann On 10/04/16 18:10, Laine Stump wrote: > On 10/04/2016 11:40 AM, Laszlo Ersek wrote: >> On 10/04/16 16:59, Daniel P. Berrange wrote: >>> On Mon, Sep 05, 2016 at 06:24:48PM +0200, Laszlo Ersek wrote: >>>> On 09/01/16 15:22, Marcel Apfelbaum wrote: >>>>> +2.3 PCI only hierarchy >>>>> +====================== >>>>> +Legacy PCI devices can be plugged into pcie.0 as Integrated >>>>> Devices or >>>>> +into DMI-PCI bridge. PCI-PCI bridges can be plugged into DMI-PCI >>>>> bridges >>>>> +and can be nested until a depth of 6-7. DMI-BRIDGES should be plugged >>>>> +only into pcie.0 bus. >>>>> + >>>>> + pcie.0 bus >>>>> + ---------------------------------------------- >>>>> + | | >>>>> + ----------- ------------------ >>>>> + | PCI Dev | | DMI-PCI BRIDGE | >>>>> + ---------- ------------------ >>>>> + | | >>>>> + ----------- ------------------ >>>>> + | PCI Dev | | PCI-PCI Bridge | >>>>> + ----------- ------------------ >>>>> + | | >>>>> + ----------- ----------- >>>>> + | PCI Dev | | PCI Dev | >>>>> + ----------- ----------- >>>> >>>> Works for me, but I would again elaborate a little bit on keeping the >>>> hierarchy flat. >>>> >>>> First, in order to preserve compatibility with libvirt's current >>>> behavior, let's not plug a PCI device directly in to the DMI-PCI >>>> bridge, >>>> even if that's possible otherwise. Let's just say >>>> >>>> - there should be at most one DMI-PCI bridge (if a legacy PCI hierarchy >>>> is required), >>> >>> Why do you suggest this ? If the guest has multiple NUMA nodes >>> and you're creating a PXB for each NUMA node, then it looks valid >>> to want to have a DMI-PCI bridge attached to each PXB, so you can >>> have legacy PCI devices on each NUMA node, instead of putting them >>> all on the PCI bridge without NUMA affinity. >> >> You are right. I meant the above within one PCI Express root bus. >> >> Small correction to your wording though: you don't want to attach the >> DMI-PCI bridge to the PXB device, but to the extra root bus provided by >> the PXB. > > This made me realize something - the root bus on a pxb-pcie controller > has a single slot and that slot can accept either a pcie-root-port > (ioh3420) or a dmi-to-pci-bridge. If you want to have both express and > legacy PCI devices on the same NUMA node, then you would either need to > create one pxb-pcie for the pcie-root-port and another for the > dmi-to-pci-bridge, or you would need to put the pcie-root-port and > dmi-to-pci-bridge onto different functions of the single slot. Should > the latter work properly? Yes, I expect so. (Famous last words? :)) > > >> >>> >>>> - only PCI-PCI bridges should be plugged into the DMI-PCI bridge, >>> >>> What's the rational for that, as opposed to plugging devices directly >>> into the DMI-PCI bridge which seems to work ? >> >> The rationale is that libvirt used to do it like this. > > > Nah, that's just the *result* of the rationale that we wanted the > devices to be hotpluggable. At some later date we learned the hotplug on > a pci-bridge device doesn't work on a Q35 machine anyway, so it was kind > of pointless (but we still do it because we hold out hope that hotplug > of legacy PCI devices into a pci-bridge on Q35 machines will work one day) > > >> And the rationale >> for *that* is that DMI-PCI bridges cannot accept hotplugged devices, >> while PCI-PCI bridges can. >> >> Technically nothing forbids (AFAICT) cold-plugging PCI devices into >> DMI-PCI bridges, but this document is expressly not just about technical >> constraints -- it's a policy document. We want to simplify / trim the >> supported PCI and PCI Express hierarchies as much as possible. >> >> All valid *high-level* topology goals should be permitted / covered one >> way or another by this document, but in as few ways as possible -- >> hopefully only one way. For example, if you read the rest of the thread, >> flat hierarchies are preferred to deeply nested hierarchies, because >> flat ones save on bus numbers > > Do they? Yes. Nesting implies bridges, and bridges take up bus numbers. For example, in a PCI Express switch, the upstream port of the switch consumes a bus number, with no practical usefulness. IIRC we collectively devised a flat pattern elsewhere in the thread where you could exhaust the 0..255 bus number space such that almost every bridge (= taking up a bus number) would also be capable of accepting a hot-plugged or cold-plugged PCI Express device. That is, practically no wasted bus numbers. Hm.... search this message for "population algorithm": https://www.mail-archive.com/qemu-devel@nongnu.org/msg394730.html and then Gerd's big improvement / simplification on it, with multifunction: https://www.mail-archive.com/qemu-devel@nongnu.org/msg395437.html In Gerd's scheme, you'd only need only one or two (I'm lazy to count exactly :)) PCI Express switches, to exhaust all bus numbers. Minimal waste due to upstream ports. Thanks Laszlo >> , are easier to setup and understand, >> probably perform better, and don't lose any generality for cold- or >> hotplug. >> >> Thanks >> Laszlo >> > ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines 2016-10-04 16:43 ` Laszlo Ersek @ 2016-10-04 18:08 ` Laine Stump 2016-10-04 18:52 ` Alex Williamson 2016-10-04 18:56 ` Laszlo Ersek 0 siblings, 2 replies; 52+ messages in thread From: Laine Stump @ 2016-10-04 18:08 UTC (permalink / raw) To: qemu-devel Cc: Laszlo Ersek, Daniel P. Berrange, Marcel Apfelbaum, Peter Maydell, Drew Jones, mst, Andrea Bolognani, Alex Williamson, Gerd Hoffmann On 10/04/2016 12:43 PM, Laszlo Ersek wrote: > On 10/04/16 18:10, Laine Stump wrote: >> On 10/04/2016 11:40 AM, Laszlo Ersek wrote: >>> On 10/04/16 16:59, Daniel P. Berrange wrote: >>>> On Mon, Sep 05, 2016 at 06:24:48PM +0200, Laszlo Ersek wrote: >>> All valid *high-level* topology goals should be permitted / covered one >>> way or another by this document, but in as few ways as possible -- >>> hopefully only one way. For example, if you read the rest of the thread, >>> flat hierarchies are preferred to deeply nested hierarchies, because >>> flat ones save on bus numbers >> >> Do they? > > Yes. Nesting implies bridges, and bridges take up bus numbers. For > example, in a PCI Express switch, the upstream port of the switch > consumes a bus number, with no practical usefulness. I'ts all just idle number games, but what I was thinking of was the difference between plugging a bunch of root-port+upstream+downstreamxN combos directly into pcie-root (flat), vs. plugging the first into pcie-root, and then subsequent ones into e.g. the last downstream port of the previous set. Take the simplest case of needing 63 hotpluggable slots. In the "flat" case, you have: 2 x pcie-root-port 2 x pcie-switch-upstream-port 63 x pcie-switch-downstream-port In the "nested" or "chained" case you have: 1 x pcie-root-port 1 x pcie-switch-upstream-port 32 x pcie-downstream-port 1 x pcie-switch-upstream-port 32 x pcie-switch-downstream-port so you use the same number of PCI controllers. Of course if you're talking about the difference between using upstream+downstream vs. just having a bunch of pcie-root-ports directly on pcie-root then you're correct, but only marginally - for 63 hotpluggable ports, you would need 63 x pcie-root-port, so a savings of 4 controllers - about 6.5%. (Of course this is all moot since you run out of ioport space after, what, 7 controllers needing it anyway? :-P) > > IIRC we collectively devised a flat pattern elsewhere in the thread > where you could exhaust the 0..255 bus number space such that almost > every bridge (= taking up a bus number) would also be capable of > accepting a hot-plugged or cold-plugged PCI Express device. That is, > practically no wasted bus numbers. > > Hm.... search this message for "population algorithm": > > https://www.mail-archive.com/qemu-devel@nongnu.org/msg394730.html > > and then Gerd's big improvement / simplification on it, with multifunction: > > https://www.mail-archive.com/qemu-devel@nongnu.org/msg395437.html > > In Gerd's scheme, you'd only need only one or two (I'm lazy to count > exactly :)) PCI Express switches, to exhaust all bus numbers. Minimal > waste due to upstream ports. Yep. And in response to his message, that's what I'm implementing as the default strategy in libvirt :-) ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines 2016-10-04 18:08 ` Laine Stump @ 2016-10-04 18:52 ` Alex Williamson 2016-10-10 12:02 ` Andrea Bolognani 2016-10-04 18:56 ` Laszlo Ersek 1 sibling, 1 reply; 52+ messages in thread From: Alex Williamson @ 2016-10-04 18:52 UTC (permalink / raw) To: Laine Stump Cc: qemu-devel, Laszlo Ersek, Daniel P. Berrange, Marcel Apfelbaum, Peter Maydell, Drew Jones, mst, Andrea Bolognani, Gerd Hoffmann On Tue, 4 Oct 2016 14:08:45 -0400 Laine Stump <laine@redhat.com> wrote: > On 10/04/2016 12:43 PM, Laszlo Ersek wrote: > > On 10/04/16 18:10, Laine Stump wrote: > >> On 10/04/2016 11:40 AM, Laszlo Ersek wrote: > >>> On 10/04/16 16:59, Daniel P. Berrange wrote: > >>>> On Mon, Sep 05, 2016 at 06:24:48PM +0200, Laszlo Ersek wrote: > >>> All valid *high-level* topology goals should be permitted / covered one > >>> way or another by this document, but in as few ways as possible -- > >>> hopefully only one way. For example, if you read the rest of the thread, > >>> flat hierarchies are preferred to deeply nested hierarchies, because > >>> flat ones save on bus numbers > >> > >> Do they? > > > > Yes. Nesting implies bridges, and bridges take up bus numbers. For > > example, in a PCI Express switch, the upstream port of the switch > > consumes a bus number, with no practical usefulness. > > I'ts all just idle number games, but what I was thinking of was the > difference between plugging a bunch of root-port+upstream+downstreamxN > combos directly into pcie-root (flat), vs. plugging the first into > pcie-root, and then subsequent ones into e.g. the last downstream port > of the previous set. Take the simplest case of needing 63 hotpluggable > slots. In the "flat" case, you have: > > 2 x pcie-root-port > 2 x pcie-switch-upstream-port > 63 x pcie-switch-downstream-port > > In the "nested" or "chained" case you have: > > 1 x pcie-root-port > 1 x pcie-switch-upstream-port > 32 x pcie-downstream-port > 1 x pcie-switch-upstream-port > 32 x pcie-switch-downstream-port You're not thinking in enough dimensions. A single root port can host multiple sub-hierarchies on it's own. We can have a multi-function upstream switch, so you can have 8 upstream ports (00.{0-7}). If we implemented ARI on the upstream ports, we could have 256 upstream ports attached to a single root port, but of course then we've run out of bus numbers before we've even gotten to actual devices buses. Another option, look at the downstream ports, why do they each need to be in separate slots? We have the address space of an entire bus to work with, so we can also create multi-function downstream ports, which gives us 256 downstream ports per upstream port. Oops, we just ran out of bus numbers again, but at least actual devices can be attached. Thanks, Alex > so you use the same number of PCI controllers. > > Of course if you're talking about the difference between using > upstream+downstream vs. just having a bunch of pcie-root-ports directly > on pcie-root then you're correct, but only marginally - for 63 > hotpluggable ports, you would need 63 x pcie-root-port, so a savings of > 4 controllers - about 6.5%. (Of course this is all moot since you run > out of ioport space after, what, 7 controllers needing it anyway? :-P) > > > > > IIRC we collectively devised a flat pattern elsewhere in the thread > > where you could exhaust the 0..255 bus number space such that almost > > every bridge (= taking up a bus number) would also be capable of > > accepting a hot-plugged or cold-plugged PCI Express device. That is, > > practically no wasted bus numbers. > > > > Hm.... search this message for "population algorithm": > > > > https://www.mail-archive.com/qemu-devel@nongnu.org/msg394730.html > > > > and then Gerd's big improvement / simplification on it, with multifunction: > > > > https://www.mail-archive.com/qemu-devel@nongnu.org/msg395437.html > > > > In Gerd's scheme, you'd only need only one or two (I'm lazy to count > > exactly :)) PCI Express switches, to exhaust all bus numbers. Minimal > > waste due to upstream ports. > > Yep. And in response to his message, that's what I'm implementing as the > default strategy in libvirt :-) > > ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines 2016-10-04 18:52 ` Alex Williamson @ 2016-10-10 12:02 ` Andrea Bolognani 2016-10-10 14:36 ` Marcel Apfelbaum 0 siblings, 1 reply; 52+ messages in thread From: Andrea Bolognani @ 2016-10-10 12:02 UTC (permalink / raw) To: Alex Williamson, Laine Stump Cc: qemu-devel, Laszlo Ersek, Daniel P. Berrange, Marcel Apfelbaum, Peter Maydell, Drew Jones, mst, Gerd Hoffmann On Tue, 2016-10-04 at 12:52 -0600, Alex Williamson wrote: > > I'ts all just idle number games, but what I was thinking of was the > > difference between plugging a bunch of root-port+upstream+downstreamxN > > combos directly into pcie-root (flat), vs. plugging the first into > > pcie-root, and then subsequent ones into e.g. the last downstream port > > of the previous set. Take the simplest case of needing 63 hotpluggable > > slots. In the "flat" case, you have: > > > > 2 x pcie-root-port > > 2 x pcie-switch-upstream-port > > 63 x pcie-switch-downstream-port > > > > In the "nested" or "chained" case you have: > > > > 1 x pcie-root-port > > 1 x pcie-switch-upstream-port > > 32 x pcie-downstream-port > > 1 x pcie-switch-upstream-port > > 32 x pcie-switch-downstream-port > > You're not thinking in enough dimensions. A single root port can host > multiple sub-hierarchies on it's own. We can have a multi-function > upstream switch, so you can have 8 upstream ports (00.{0-7}). If we > implemented ARI on the upstream ports, we could have 256 upstream ports > attached to a single root port, but of course then we've run out of > bus numbers before we've even gotten to actual devices buses. > > Another option, look at the downstream ports, why do they each need to > be in separate slots? We have the address space of an entire bus to > work with, so we can also create multi-function downstream ports, which > gives us 256 downstream ports per upstream port. Oops, we just ran out > of bus numbers again, but at least actual devices can be attached. What's the advantage in using ARI to stuff more than eight of anything that's not Endpoint Devices in a single slot? I mean, if we just fill up all 32 slots in a PCIe Root Bus with 8 PCIe Root Ports each we already end up having 256 hotpluggable slots[1]. Why would it be preferable to use ARI, or even PCIe Switches, instead? [1] The last slot will have to be limited to 7 PCIe Root Ports if we don't want to run out of bus numbers -- Andrea Bolognani / Red Hat / Virtualization ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines 2016-10-10 12:02 ` Andrea Bolognani @ 2016-10-10 14:36 ` Marcel Apfelbaum 2016-10-11 15:37 ` Andrea Bolognani 0 siblings, 1 reply; 52+ messages in thread From: Marcel Apfelbaum @ 2016-10-10 14:36 UTC (permalink / raw) To: Andrea Bolognani, Alex Williamson, Laine Stump Cc: qemu-devel, Laszlo Ersek, Daniel P. Berrange, Peter Maydell, Drew Jones, mst, Gerd Hoffmann On 10/10/2016 03:02 PM, Andrea Bolognani wrote: > On Tue, 2016-10-04 at 12:52 -0600, Alex Williamson wrote: >>> I'ts all just idle number games, but what I was thinking of was the >>> difference between plugging a bunch of root-port+upstream+downstreamxN >>> combos directly into pcie-root (flat), vs. plugging the first into >>> pcie-root, and then subsequent ones into e.g. the last downstream port >>> of the previous set. Take the simplest case of needing 63 hotpluggable >>> slots. In the "flat" case, you have: >>> >>> 2 x pcie-root-port >>> 2 x pcie-switch-upstream-port >>> 63 x pcie-switch-downstream-port >>> >>> In the "nested" or "chained" case you have: >>> >>> 1 x pcie-root-port >>> 1 x pcie-switch-upstream-port >>> 32 x pcie-downstream-port >>> 1 x pcie-switch-upstream-port >>> 32 x pcie-switch-downstream-port >> >> You're not thinking in enough dimensions. A single root port can host >> multiple sub-hierarchies on it's own. We can have a multi-function >> upstream switch, so you can have 8 upstream ports (00.{0-7}). If we >> implemented ARI on the upstream ports, we could have 256 upstream ports >> attached to a single root port, but of course then we've run out of >> bus numbers before we've even gotten to actual devices buses. >> >> Another option, look at the downstream ports, why do they each need to >> be in separate slots? We have the address space of an entire bus to >> work with, so we can also create multi-function downstream ports, which >> gives us 256 downstream ports per upstream port. Oops, we just ran out >> of bus numbers again, but at least actual devices can be attached. > > What's the advantage in using ARI to stuff more than eight > of anything that's not Endpoint Devices in a single slot? > > I mean, if we just fill up all 32 slots in a PCIe Root Bus > with 8 PCIe Root Ports each we already end up having 256 > hotpluggable slots[1]. Why would it be preferable to use > ARI, or even PCIe Switches, instead? > What if you need more devices (functions actually) ? If some of the pcie.0 slots are occupied by other Integrated devices and you need more than 256 functions you can: (1) Add a PCIe Switch - if you need hot-plug support -an you are pretty limited by the bus numbers, but it will give you a few more slots. (2) Use multi-function devices per root port if you are not interested in hotplug. In this case ARI will give you up to 256 devices per Root Port. Now the question is why ARI? Better utilization of the "problematic" resources like Bus numbers and IO space; all that if you need an insane number of devices, but we don't judge :). Thanks, Marcel > > [1] The last slot will have to be limited to 7 PCIe Root > Ports if we don't want to run out of bus numbers I don't follow how this will 'save' us. If all the root ports are in use and you leave space for one more, what can you do with it? > -- > Andrea Bolognani / Red Hat / Virtualization > ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines 2016-10-10 14:36 ` Marcel Apfelbaum @ 2016-10-11 15:37 ` Andrea Bolognani 0 siblings, 0 replies; 52+ messages in thread From: Andrea Bolognani @ 2016-10-11 15:37 UTC (permalink / raw) To: Marcel Apfelbaum, Alex Williamson, Laine Stump Cc: qemu-devel, Laszlo Ersek, Daniel P. Berrange, Peter Maydell, Drew Jones, mst, Gerd Hoffmann On Mon, 2016-10-10 at 17:36 +0300, Marcel Apfelbaum wrote: > > What's the advantage in using ARI to stuff more than eight > > of anything that's not Endpoint Devices in a single slot? > > > > I mean, if we just fill up all 32 slots in a PCIe Root Bus > > with 8 PCIe Root Ports each we already end up having 256 > > hotpluggable slots[1]. Why would it be preferable to use > > ARI, or even PCIe Switches, instead? > > What if you need more devices (functions actually) ? > > If some of the pcie.0 slots are occupied by other Integrated devices > and you need more than 256 functions you can: > (1) Add a PCIe Switch - if you need hot-plug support -an you are pretty limited > by the bus numbers, but it will give you a few more slots. > (2) Use multi-function devices per root port if you are not interested in hotplug. > In this case ARI will give you up to 256 devices per Root Port. > > Now the question is why ARI? Better utilization of the "problematic" > resources like Bus numbers and IO space; all that if you need an insane > number of devices, but we don't judge :). My point is that AIUI ARI is something you only care about for endpoint devices that want to have more than 8 functions. When it comes to controller, there's no advantage that I can think of in having 1 slot with 256 functions as opposed to 32 slots with 8 functions each; if anything, I expect that at least some guest OSs would be quite baffled in finding eg. a network adapter, a SCSI controller and a GPU as separate functions of a single PCI slot. > > [1] The last slot will have to be limited to 7 PCIe Root > > Ports if we don't want to run out of bus numbers > > I don't follow how this will 'save' us. If all the root ports > are in use and you leave space for one more, what can you do with it? Probably my math is off, but if we can only have 256 PCI buses (0-255) and we plug a PCIe Root Port in each of the 8 functions (0-7) of the 32 ports (0-31) available on the PCIe Root Bus, we end up with 0:00.[0-7] -> [001-008]:0.[0-7] 0:01.[0-7] -> [009-016]:0.[0-7] 0:02.[0-7] -> [017-024]:0.[0-7] ... 0.30.[0-7] -> [241-248]:0.[0-7] 0.31.[0-7] -> [249-256]:0.[0-7] but 256 is not a valid bus number, so we should skip that last PCIe Root Port and stop at 255. -- Andrea Bolognani / Red Hat / Virtualization ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines 2016-10-04 18:08 ` Laine Stump 2016-10-04 18:52 ` Alex Williamson @ 2016-10-04 18:56 ` Laszlo Ersek 1 sibling, 0 replies; 52+ messages in thread From: Laszlo Ersek @ 2016-10-04 18:56 UTC (permalink / raw) To: Laine Stump, qemu-devel Cc: Daniel P. Berrange, Marcel Apfelbaum, Peter Maydell, Drew Jones, mst, Andrea Bolognani, Alex Williamson, Gerd Hoffmann On 10/04/16 20:08, Laine Stump wrote: > On 10/04/2016 12:43 PM, Laszlo Ersek wrote: >> On 10/04/16 18:10, Laine Stump wrote: >>> On 10/04/2016 11:40 AM, Laszlo Ersek wrote: >>>> On 10/04/16 16:59, Daniel P. Berrange wrote: >>>>> On Mon, Sep 05, 2016 at 06:24:48PM +0200, Laszlo Ersek wrote: >>>> All valid *high-level* topology goals should be permitted / covered one >>>> way or another by this document, but in as few ways as possible -- >>>> hopefully only one way. For example, if you read the rest of the >>>> thread, >>>> flat hierarchies are preferred to deeply nested hierarchies, because >>>> flat ones save on bus numbers >>> >>> Do they? >> >> Yes. Nesting implies bridges, and bridges take up bus numbers. For >> example, in a PCI Express switch, the upstream port of the switch >> consumes a bus number, with no practical usefulness. > > I'ts all just idle number games, but what I was thinking of was the > difference between plugging a bunch of root-port+upstream+downstreamxN > combos directly into pcie-root (flat), vs. plugging the first into > pcie-root, and then subsequent ones into e.g. the last downstream port > of the previous set. Take the simplest case of needing 63 hotpluggable > slots. In the "flat" case, you have: > > 2 x pcie-root-port > 2 x pcie-switch-upstream-port > 63 x pcie-switch-downstream-port > > In the "nested" or "chained" case you have: > > 1 x pcie-root-port > 1 x pcie-switch-upstream-port > 32 x pcie-downstream-port > 1 x pcie-switch-upstream-port > 32 x pcie-switch-downstream-port > > so you use the same number of PCI controllers. > > Of course if you're talking about the difference between using > upstream+downstream vs. just having a bunch of pcie-root-ports directly > on pcie-root then you're correct, but only marginally - for 63 > hotpluggable ports, you would need 63 x pcie-root-port, so a savings of > 4 controllers - about 6.5%. We aim at 200+ ports. Also, nesting causes recursion in any guest code that traverses the hierarchy. I think it has some performance impact, plus, for me at least, interpreting PCI enumeration logs with deep recursion is way harder than the flat stuff. The bus number space is flat, and for me it's easier to "map back" to the topology if the topology is also mostly flat. > (Of course this is all moot since you run > out of ioport space after, what, 7 controllers needing it anyway? :-P) No, it's not moot. The idea is that PCI Express devices must not require IO space for correct operation -- I believe this is actually mandated by the PCI Express spec --, so in the PCI Express hierarchy we wouldn't reserve IO space at all. We discussed this earlier up-thread, please see: http://lists.nongnu.org/archive/html/qemu-devel/2016-09/msg00672.html * Finally, this is the spot where we should design and explain our resource reservation for hotplug: [...] >> IIRC we collectively devised a flat pattern elsewhere in the thread >> where you could exhaust the 0..255 bus number space such that almost >> every bridge (= taking up a bus number) would also be capable of >> accepting a hot-plugged or cold-plugged PCI Express device. That is, >> practically no wasted bus numbers. >> >> Hm.... search this message for "population algorithm": >> >> https://www.mail-archive.com/qemu-devel@nongnu.org/msg394730.html >> >> and then Gerd's big improvement / simplification on it, with >> multifunction: >> >> https://www.mail-archive.com/qemu-devel@nongnu.org/msg395437.html >> >> In Gerd's scheme, you'd only need only one or two (I'm lazy to count >> exactly :)) PCI Express switches, to exhaust all bus numbers. Minimal >> waste due to upstream ports. > > Yep. And in response to his message, that's what I'm implementing as the > default strategy in libvirt :-) Sounds great, thanks! Laszlo ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines 2016-10-04 16:10 ` Laine Stump 2016-10-04 16:43 ` Laszlo Ersek @ 2016-10-04 17:54 ` Laine Stump 2016-10-05 9:17 ` Marcel Apfelbaum 1 sibling, 1 reply; 52+ messages in thread From: Laine Stump @ 2016-10-04 17:54 UTC (permalink / raw) To: qemu-devel Cc: Laszlo Ersek, Daniel P. Berrange, Marcel Apfelbaum, Peter Maydell, Drew Jones, mst, Andrea Bolognani, Alex Williamson, Gerd Hoffmann On 10/04/2016 12:10 PM, Laine Stump wrote: > On 10/04/2016 11:40 AM, Laszlo Ersek wrote: >> Small correction to your wording though: you don't want to attach the >> DMI-PCI bridge to the PXB device, but to the extra root bus provided by >> the PXB. > > This made me realize something - the root bus on a pxb-pcie controller > has a single slot and that slot can accept either a pcie-root-port > (ioh3420) or a dmi-to-pci-bridge. If you want to have both express and > legacy PCI devices on the same NUMA node, then you would either need to > create one pxb-pcie for the pcie-root-port and another for the > dmi-to-pci-bridge, or you would need to put the pcie-root-port and > dmi-to-pci-bridge onto different functions of the single slot. Should > the latter work properly? We were discussing pxb-pcie today while Dan was trying to get a particular configuration working, and there was some disagreement about two points that I stated above as fact (but which may just be misunderstanding again): 1) Does pxb-pcie only provide a single slot (0)? Or does it provide 32 slots (0-31) just like the pcie root complex? 2) can you really only plug a pcie-root-port (ioh3420) into a pxb-pcie? Or will it accept anything that pcie.0 accepts? ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines 2016-10-04 17:54 ` Laine Stump @ 2016-10-05 9:17 ` Marcel Apfelbaum 2016-10-10 11:09 ` Andrea Bolognani 0 siblings, 1 reply; 52+ messages in thread From: Marcel Apfelbaum @ 2016-10-05 9:17 UTC (permalink / raw) To: Laine Stump, qemu-devel Cc: Laszlo Ersek, Daniel P. Berrange, Peter Maydell, Drew Jones, mst, Andrea Bolognani, Alex Williamson, Gerd Hoffmann On 10/04/2016 08:54 PM, Laine Stump wrote: > On 10/04/2016 12:10 PM, Laine Stump wrote: >> On 10/04/2016 11:40 AM, Laszlo Ersek wrote: > >>> Small correction to your wording though: you don't want to attach the >>> DMI-PCI bridge to the PXB device, but to the extra root bus provided by >>> the PXB. >> >> This made me realize something - the root bus on a pxb-pcie controller >> has a single slot and that slot can accept either a pcie-root-port >> (ioh3420) or a dmi-to-pci-bridge. If you want to have both express and >> legacy PCI devices on the same NUMA node, then you would either need to >> create one pxb-pcie for the pcie-root-port and another for the >> dmi-to-pci-bridge, or you would need to put the pcie-root-port and >> dmi-to-pci-bridge onto different functions of the single slot. Should >> the latter work properly? > Hi, > We were discussing pxb-pcie today while Dan was trying to get a particular configuration working, and there was some disagreement about two points that I stated above as fact (but which may just be > misunderstanding again): > > 1) Does pxb-pcie only provide a single slot (0)? Or does it provide 32 slots (0-31) just like the pcie root complex? > It provides 32 slots behaving like a PCI Express Root Complex. > 2) can you really only plug a pcie-root-port (ioh3420) into a pxb-pcie? Or will it accept anything that pcie.0 accepts? It supports only PCI Express Root Ports. It does not support Integrated Devices. Thanks, Marcel ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines 2016-10-05 9:17 ` Marcel Apfelbaum @ 2016-10-10 11:09 ` Andrea Bolognani 2016-10-10 14:15 ` Marcel Apfelbaum 0 siblings, 1 reply; 52+ messages in thread From: Andrea Bolognani @ 2016-10-10 11:09 UTC (permalink / raw) To: Marcel Apfelbaum, Laine Stump, qemu-devel Cc: Laszlo Ersek, Daniel P. Berrange, Peter Maydell, Drew Jones, mst, Alex Williamson, Gerd Hoffmann On Wed, 2016-10-05 at 12:17 +0300, Marcel Apfelbaum wrote: > > 2) can you really only plug a pcie-root-port (ioh3420) > > into a pxb-pcie? Or will it accept anything that pcie.0 > > accepts? > > It supports only PCI Express Root Ports. It does not > support Integrated Devices. So no PCI Express Switch Upstream Ports? What about DMI-to-PCI Bridges? -- Andrea Bolognani / Red Hat / Virtualization ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines 2016-10-10 11:09 ` Andrea Bolognani @ 2016-10-10 14:15 ` Marcel Apfelbaum 2016-10-11 13:30 ` Andrea Bolognani 0 siblings, 1 reply; 52+ messages in thread From: Marcel Apfelbaum @ 2016-10-10 14:15 UTC (permalink / raw) To: Andrea Bolognani, Laine Stump, qemu-devel Cc: Laszlo Ersek, Daniel P. Berrange, Peter Maydell, Drew Jones, mst, Alex Williamson, Gerd Hoffmann On 10/10/2016 02:09 PM, Andrea Bolognani wrote: > On Wed, 2016-10-05 at 12:17 +0300, Marcel Apfelbaum wrote: >>> 2) can you really only plug a pcie-root-port (ioh3420) >>> into a pxb-pcie? Or will it accept anything that pcie.0 >>> accepts? >> >> It supports only PCI Express Root Ports. It does not >> support Integrated Devices. > > So no PCI Express Switch Upstream Ports? The switch upstream ports can only be plugged into PCIe Root Ports. There is an error in the RFC showing otherwise, it is already corrected in V1, not yet upstream. What about DMI-to-PCI Bridges? Yes, the dmi-to-pci bridge can be plugged into the pxb-pcie, I'll be sure to emphasize it. Thanks, Marcel > > -- > Andrea Bolognani / Red Hat / Virtualization > ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines 2016-10-10 14:15 ` Marcel Apfelbaum @ 2016-10-11 13:30 ` Andrea Bolognani 0 siblings, 0 replies; 52+ messages in thread From: Andrea Bolognani @ 2016-10-11 13:30 UTC (permalink / raw) To: Marcel Apfelbaum, Laine Stump, qemu-devel Cc: Laszlo Ersek, Daniel P. Berrange, Peter Maydell, Drew Jones, mst, Alex Williamson, Gerd Hoffmann On Mon, 2016-10-10 at 17:15 +0300, Marcel Apfelbaum wrote: > > > > 2) can you really only plug a pcie-root-port (ioh3420) > > > > into a pxb-pcie? Or will it accept anything that pcie.0 > > > > accepts? > > > > > > It supports only PCI Express Root Ports. It does not > > > support Integrated Devices. > > > > So no PCI Express Switch Upstream Ports? > > The switch upstream ports can only be plugged into PCIe Root Ports. > There is an error in the RFC showing otherwise, it is already > corrected in V1, not yet upstream. I was pretty sure that was the case, but I wanted to double-check just to be on the safe side ;) > > What about DMI-to-PCI Bridges? > > Yes, the dmi-to-pci bridge can be plugged into the pxb-pcie, I'll > be sure to emphasize it. Cool. I would have been very surprised if that would have not been the case, considering how we need to use multiple pxb-pcie to link PCI devices to specific NUMA nodes. -- Andrea Bolognani / Red Hat / Virtualization ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines 2016-10-04 14:59 ` Daniel P. Berrange 2016-10-04 15:40 ` Laszlo Ersek @ 2016-10-04 15:45 ` Alex Williamson 2016-10-04 16:25 ` Laine Stump 1 sibling, 1 reply; 52+ messages in thread From: Alex Williamson @ 2016-10-04 15:45 UTC (permalink / raw) To: Daniel P. Berrange Cc: Laszlo Ersek, Marcel Apfelbaum, qemu-devel, Peter Maydell, Drew Jones, mst, Andrea Bolognani, Gerd Hoffmann, Laine Stump On Tue, 4 Oct 2016 15:59:11 +0100 "Daniel P. Berrange" <berrange@redhat.com> wrote: > On Mon, Sep 05, 2016 at 06:24:48PM +0200, Laszlo Ersek wrote: > > On 09/01/16 15:22, Marcel Apfelbaum wrote: > > > +2.3 PCI only hierarchy > > > +====================== > > > +Legacy PCI devices can be plugged into pcie.0 as Integrated Devices or > > > +into DMI-PCI bridge. PCI-PCI bridges can be plugged into DMI-PCI bridges > > > +and can be nested until a depth of 6-7. DMI-BRIDGES should be plugged > > > +only into pcie.0 bus. > > > + > > > + pcie.0 bus > > > + ---------------------------------------------- > > > + | | > > > + ----------- ------------------ > > > + | PCI Dev | | DMI-PCI BRIDGE | > > > + ---------- ------------------ > > > + | | > > > + ----------- ------------------ > > > + | PCI Dev | | PCI-PCI Bridge | > > > + ----------- ------------------ > > > + | | > > > + ----------- ----------- > > > + | PCI Dev | | PCI Dev | > > > + ----------- ----------- > > > > Works for me, but I would again elaborate a little bit on keeping the > > hierarchy flat. > > > > First, in order to preserve compatibility with libvirt's current > > behavior, let's not plug a PCI device directly in to the DMI-PCI bridge, > > even if that's possible otherwise. Let's just say > > > > - there should be at most one DMI-PCI bridge (if a legacy PCI hierarchy > > is required), > > Why do you suggest this ? If the guest has multiple NUMA nodes > and you're creating a PXB for each NUMA node, then it looks valid > to want to have a DMI-PCI bridge attached to each PXB, so you can > have legacy PCI devices on each NUMA node, instead of putting them > all on the PCI bridge without NUMA affinity. Seems like this is one of those "generic" vs "specific" device issues. We use the DMI-to-PCI bridge as if it were a PCIe-to-PCI bridge, but DMI is actually an Intel proprietary interface, the bridge just has the same software interface as a PCI bridge. So while you can use it as a generic PCIe-to-PCI bridge, it's at least going to make me cringe every time. > > - only PCI-PCI bridges should be plugged into the DMI-PCI bridge, > > What's the rational for that, as opposed to plugging devices directly > into the DMI-PCI bridge which seems to work ? IIRC, something about hotplug, but from a PCI perspective it doesn't make any sense to me either. Same with the restriction from using slot 0 on PCI bridges, there's no basis for that except on the root bus. Thanks, Alex ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines 2016-10-04 15:45 ` Alex Williamson @ 2016-10-04 16:25 ` Laine Stump 2016-10-05 10:03 ` Marcel Apfelbaum 0 siblings, 1 reply; 52+ messages in thread From: Laine Stump @ 2016-10-04 16:25 UTC (permalink / raw) To: qemu-devel Cc: Alex Williamson, Daniel P. Berrange, Laszlo Ersek, Marcel Apfelbaum, Peter Maydell, Drew Jones, mst, Andrea Bolognani, Gerd Hoffmann On 10/04/2016 11:45 AM, Alex Williamson wrote: > On Tue, 4 Oct 2016 15:59:11 +0100 > "Daniel P. Berrange" <berrange@redhat.com> wrote: > >> On Mon, Sep 05, 2016 at 06:24:48PM +0200, Laszlo Ersek wrote: >>> On 09/01/16 15:22, Marcel Apfelbaum wrote: >>>> +2.3 PCI only hierarchy >>>> +====================== >>>> +Legacy PCI devices can be plugged into pcie.0 as Integrated Devices or >>>> +into DMI-PCI bridge. PCI-PCI bridges can be plugged into DMI-PCI bridges >>>> +and can be nested until a depth of 6-7. DMI-BRIDGES should be plugged >>>> +only into pcie.0 bus. >>>> + >>>> + pcie.0 bus >>>> + ---------------------------------------------- >>>> + | | >>>> + ----------- ------------------ >>>> + | PCI Dev | | DMI-PCI BRIDGE | >>>> + ---------- ------------------ >>>> + | | >>>> + ----------- ------------------ >>>> + | PCI Dev | | PCI-PCI Bridge | >>>> + ----------- ------------------ >>>> + | | >>>> + ----------- ----------- >>>> + | PCI Dev | | PCI Dev | >>>> + ----------- ----------- >>> >>> Works for me, but I would again elaborate a little bit on keeping the >>> hierarchy flat. >>> >>> First, in order to preserve compatibility with libvirt's current >>> behavior, let's not plug a PCI device directly in to the DMI-PCI bridge, >>> even if that's possible otherwise. Let's just say >>> >>> - there should be at most one DMI-PCI bridge (if a legacy PCI hierarchy >>> is required), >> >> Why do you suggest this ? If the guest has multiple NUMA nodes >> and you're creating a PXB for each NUMA node, then it looks valid >> to want to have a DMI-PCI bridge attached to each PXB, so you can >> have legacy PCI devices on each NUMA node, instead of putting them >> all on the PCI bridge without NUMA affinity. > > Seems like this is one of those "generic" vs "specific" device issues. > We use the DMI-to-PCI bridge as if it were a PCIe-to-PCI bridge, but > DMI is actually an Intel proprietary interface, the bridge just has the > same software interface as a PCI bridge. So while you can use it as a > generic PCIe-to-PCI bridge, it's at least going to make me cringe every > time. If using it this way makes kittens cry or something, then we'd be happy to use a generic pcie-to-pci bridge if somebody created one :-) > >>> - only PCI-PCI bridges should be plugged into the DMI-PCI bridge, >> >> What's the rational for that, as opposed to plugging devices directly >> into the DMI-PCI bridge which seems to work ? > > IIRC, something about hotplug, but from a PCI perspective it doesn't > make any sense to me either. At one point Marcel and Michael were discussing the possibility of making hotplug work on a dmi-to-pci-bridge. Currently it doesn't even work for pci-bridge so (as I think I said in another message just now) it is kind of pointless, although when I asked about eliminating use of pci-bridge in favor of just using dmi-to-pci-bridge directly, I got lots of "no" votes. > Same with the restriction from using slot > 0 on PCI bridges, there's no basis for that except on the root bus. I tried allowing devices to be plugged into slot 0 of a pci-bridge in libvirt - qemu barfed, so I moved the "minSlot" for pci-bridge back up to 1. Slot 0 is completely usable on a dmi-to-pci-bridge though (and libvirt allows it). At this point, even if qemu enabled using slot 0 of a pci-bridge, libvirt wouldn't be able to expose that to users (unless the min/max slot of each PCI controller was made visible somewhere via QMP) ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines 2016-10-04 16:25 ` Laine Stump @ 2016-10-05 10:03 ` Marcel Apfelbaum 0 siblings, 0 replies; 52+ messages in thread From: Marcel Apfelbaum @ 2016-10-05 10:03 UTC (permalink / raw) To: Laine Stump, qemu-devel Cc: Alex Williamson, Daniel P. Berrange, Laszlo Ersek, Peter Maydell, Drew Jones, mst, Andrea Bolognani, Gerd Hoffmann On 10/04/2016 07:25 PM, Laine Stump wrote: > On 10/04/2016 11:45 AM, Alex Williamson wrote: >> On Tue, 4 Oct 2016 15:59:11 +0100 >> "Daniel P. Berrange" <berrange@redhat.com> wrote: >> >>> On Mon, Sep 05, 2016 at 06:24:48PM +0200, Laszlo Ersek wrote: >>>> On 09/01/16 15:22, Marcel Apfelbaum wrote: >>>>> +2.3 PCI only hierarchy >>>>> +====================== >>>>> +Legacy PCI devices can be plugged into pcie.0 as Integrated Devices or >>>>> +into DMI-PCI bridge. PCI-PCI bridges can be plugged into DMI-PCI bridges >>>>> +and can be nested until a depth of 6-7. DMI-BRIDGES should be plugged >>>>> +only into pcie.0 bus. >>>>> + >>>>> + pcie.0 bus >>>>> + ---------------------------------------------- >>>>> + | | >>>>> + ----------- ------------------ >>>>> + | PCI Dev | | DMI-PCI BRIDGE | >>>>> + ---------- ------------------ >>>>> + | | >>>>> + ----------- ------------------ >>>>> + | PCI Dev | | PCI-PCI Bridge | >>>>> + ----------- ------------------ >>>>> + | | >>>>> + ----------- ----------- >>>>> + | PCI Dev | | PCI Dev | >>>>> + ----------- ----------- >>>> >>>> Works for me, but I would again elaborate a little bit on keeping the >>>> hierarchy flat. >>>> >>>> First, in order to preserve compatibility with libvirt's current >>>> behavior, let's not plug a PCI device directly in to the DMI-PCI bridge, >>>> even if that's possible otherwise. Let's just say >>>> >>>> - there should be at most one DMI-PCI bridge (if a legacy PCI hierarchy >>>> is required), >>> >>> Why do you suggest this ? If the guest has multiple NUMA nodes >>> and you're creating a PXB for each NUMA node, then it looks valid >>> to want to have a DMI-PCI bridge attached to each PXB, so you can >>> have legacy PCI devices on each NUMA node, instead of putting them >>> all on the PCI bridge without NUMA affinity. >> >> Seems like this is one of those "generic" vs "specific" device issues. >> We use the DMI-to-PCI bridge as if it were a PCIe-to-PCI bridge, but >> DMI is actually an Intel proprietary interface, the bridge just has the >> same software interface as a PCI bridge. So while you can use it as a >> generic PCIe-to-PCI bridge, it's at least going to make me cringe every >> time. > > > If using it this way makes kittens cry or something, then we'd be happy to use a generic pcie-to-pci bridge if somebody created one :-) > > >> >>>> - only PCI-PCI bridges should be plugged into the DMI-PCI bridge, >>> >>> What's the rational for that, as opposed to plugging devices directly >>> into the DMI-PCI bridge which seems to work ? >> Hi, >> IIRC, something about hotplug, but from a PCI perspective it doesn't >> make any sense to me either. > Indeed, the reason to plug the PCI bridge into the DMI-TO-PCI bridge would be the hot-plug support. The PCI bridges can support hotplug on Q35. There is even an RFC on the list doing that: https://lists.gnu.org/archive/html/qemu-devel/2016-05/msg05681.html With the DMI-PCI bridge is another story. From what I understand the actual device (i82801b11) do not support hotplug and the chances to make it work are minimal. > > At one point Marcel and Michael were discussing the possibility of making hotplug work on a dmi-to-pci-bridge. Currently it doesn't even work for pci-bridge so (as I think I said in another message > just now) it is kind of pointless, although when I asked about eliminating use of pci-bridge in favor of just using dmi-to-pci-bridge directly, I got lots of "no" votes. > Since we have an RFC showing it is possible to have hotplug for PCI devices pluged into PCI bridges it is better to continue using the PCI bridge until one of the bellow will happen: 1 - pci-bridge ACPI hotplug will be possible 2 - i82801b11 ACPI hotplug will be possible 3 - a new pcie-pci bridge will be coded > >> Same with the restriction from using slot >> 0 on PCI bridges, there's no basis for that except on the root bus. > > I tried allowing devices to be plugged into slot 0 of a pci-bridge in libvirt - qemu barfed, so I moved the "minSlot" for pci-bridge back up to 1. Slot 0 is completely usable on a dmi-to-pci-bridge > though (and libvirt allows it). At this point, even if qemu enabled using slot 0 of a pci-bridge, libvirt wouldn't be able to expose that to users (unless the min/max slot of each PCI controller was > made visible somewhere via QMP) > The reason for not being able to plug a device into slot 0 of a PCI Bridge is the SHPC (Hot-plug controller) device embedded in the PCI bridge by default. The SHPC spec requires this. If one disables it with shpc=false, he should be able to use the slot 0. Funny thing, the SHPC is not actually used by neither i440fx or Q35 machines, for i440fx we use ACPI based PCI hotplug and for Q35 we use PCIe native hotplug. Should we default the shpc to off? Thanks, Marcel ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines 2016-09-01 13:22 [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines Marcel Apfelbaum 2016-09-01 13:27 ` Peter Maydell 2016-09-05 16:24 ` Laszlo Ersek @ 2016-09-06 15:38 ` Alex Williamson 2016-09-06 18:14 ` Marcel Apfelbaum 2 siblings, 1 reply; 52+ messages in thread From: Alex Williamson @ 2016-09-06 15:38 UTC (permalink / raw) To: Marcel Apfelbaum; +Cc: qemu-devel, lersek, mst On Thu, 1 Sep 2016 16:22:07 +0300 Marcel Apfelbaum <marcel@redhat.com> wrote: > Proposes best practices on how to use PCIe/PCI device > in PCIe based machines and explain the reasoning behind them. > > Signed-off-by: Marcel Apfelbaum <marcel@redhat.com> > --- > > Hi, > > Please add your comments on what to add/remove/edit to make this doc usable. > > Thanks, > Marcel > > docs/pcie.txt | 145 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 145 insertions(+) > create mode 100644 docs/pcie.txt > > diff --git a/docs/pcie.txt b/docs/pcie.txt > new file mode 100644 > index 0000000..52a8830 > --- /dev/null > +++ b/docs/pcie.txt > @@ -0,0 +1,145 @@ > +PCI EXPRESS GUIDELINES > +====================== > + > +1. Introduction > +================ > +The doc proposes best practices on how to use PCIe/PCI device > +in PCIe based machines and explains the reasoning behind them. > + > + > +2. Device placement strategy > +============================ > +QEMU does not have a clear socket-device matching mechanism > +and allows any PCI/PCIe device to be plugged into any PCI/PCIe slot. > +Plugging a PCI device into a PCIe device might not always work and > +is weird anyway since it cannot be done for "bare metal". > +Plugging a PCIe device into a PCI slot will hide the Extended > +Configuration Space thus is also not recommended. > + > +The recommendation is to separate the PCIe and PCI hierarchies. > +PCIe devices should be plugged only into PCIe Root Ports and > +PCIe Downstream ports (let's call them PCIe ports). > + > +2.1 Root Bus (pcie.0) > +===================== > +Plug only legacy PCI devices as Root Complex Integrated Devices > +even if the PCIe spec does not forbid PCIe devices. The existing Surely we can have PCIe device on the root complex?? > +hardware uses mostly PCI devices as Integrated Endpoints. In this > +way we may avoid some strange Guest OS-es behaviour. > +Other than that plug only PCIe Root Ports, PCIe Switches (upstream ports) > +or DMI-PCI bridges to start legacy PCI hierarchies. > + > + > + pcie.0 bus > + -------------------------------------------------------------------------- > + | | | | > + ----------- ------------------ ------------------ ------------------ > + | PCI Dev | | PCIe Root Port | | Upstream Port | | DMI-PCI bridge | > + ----------- ------------------ ------------------ ------------------ Do you have a spec reference for plugging an upstream port directly into the root complex? IMHO this is invalid, an upstream port can only be attached behind a downstream port, ie. a root port or downstream switch port. > + > +2.2 PCIe only hierarchy > +======================= > +Always use PCIe Root ports to start a PCIe hierarchy. Use PCIe switches (Upstream > +Ports + several Downstream Ports) if out of PCIe Root Ports slots. PCIe switches > +can be nested until a depth of 6-7. Plug only PCIe devices into PCIe Ports. This seems to contradict 2.1, but I agree more with this statement to only start a PCIe sub-hierarchy with a root port, not an upstream port connected to the root complex. The 2nd sentence is confusing, I don't know if you're referring to fan-out via PCIe switch downstream of a root port or again suggesting to use upstream switch ports directly on the root complex. > + > + > + pcie.0 bus > + ---------------------------------------------------- > + | | | > + ------------- ------------- ------------- > + | Root Port | | Root Port | | Root Port | > + ------------ -------------- ------------- > + | | > + ------------ ----------------- > + | PCIe Dev | | Upstream Port | > + ------------ ----------------- > + | | > + ------------------- ------------------- > + | Downstream Port | | Downstream Port | > + ------------------- ------------------- > + | > + ------------ > + | PCIe Dev | > + ------------ > + > +2.3 PCI only hierarchy > +====================== > +Legacy PCI devices can be plugged into pcie.0 as Integrated Devices or > +into DMI-PCI bridge. PCI-PCI bridges can be plugged into DMI-PCI bridges > +and can be nested until a depth of 6-7. DMI-BRIDGES should be plugged > +only into pcie.0 bus. > + > + pcie.0 bus > + ---------------------------------------------- > + | | > + ----------- ------------------ > + | PCI Dev | | DMI-PCI BRIDGE | > + ---------- ------------------ > + | | > + ----------- ------------------ > + | PCI Dev | | PCI-PCI Bridge | > + ----------- ------------------ > + | | > + ----------- ----------- > + | PCI Dev | | PCI Dev | > + ----------- ----------- > + I really wish we had generic PCIe-to-PCI bridges rather than this DMI bridge thing... > + > + > +3. IO space issues > +=================== > +PCIe Ports are seen by Firmware/Guest OS as PCI bridges and Yeah, I've lost the meaning of Ports here, this statement is true for upstream ports as well. > +as required by PCI spec will reserve a 4K IO range for each. > +The firmware used by QEMU (SeaBIOS/OVMF) will further optimize > +it by allocation the IO space only if there is at least a device > +with IO BARs plugged into the bridge. > +Behind a PCIe PORT only one device may be plugged, resulting in Here I think you're trying to specify root/downstream ports, but upstream ports have the same i/o port allocation problems and do not have this one device limitation. > +the allocation of a whole 4K range for each device. > +The IO space is limited resulting in ~10 PCIe ports per system > +if devices with IO BARs are plugged into IO ports. > + > +Using the proposed device placing strategy solves this issue > +by using only PCIe devices with PCIe PORTS. The PCIe spec requires > +PCIe devices to work without IO BARs. > +The PCI hierarchy has no such limitations. Actually it does, but it's mostly not an issue since we have 32 slots available (minus QEMU/libvirt excluding 1 for no good reason) downstream of each bridge. > + > + > +4. Hot Plug > +============ > +The root bus pcie.0 does not support hot-plug, so Integrated Devices, > +DMI-PCI bridges and Root Ports can't be hot-plugged/hot-unplugged. > + > +PCI devices can be hot-plugged into PCI-PCI bridges. (There is a bug > +in QEMU preventing it to work, but it would be solved soon). Probably want to give some sort of date/commit references to these current state of affairs facts, a reader is not likely to lookup the git commit for this verbiage and extrapolate it to a QEMU version. > +The PCI hotplug is ACPI based and can work side by side with the PCIe > +native hotplug. > + > +PCIe devices can be natively hot-plugged/hot-unplugged into/from > +PCIe Ports (Root Ports/Downstream Ports). Switches are hot-pluggable. Why? This seems like a QEMU bug. Clearly we need the downstream ports in place when the upstream switch is hot-added, but this should be feasible. > +Keep in mind you always need to have at least one PCIe Port available > +for hotplug, the PCIe Ports themselves are not hot-pluggable. If a user cares about hotplug... > + > + > +5. Device assignment > +==================== > +Host devices are mostly PCIe and should be plugged only into PCIe ports. > +PCI-PCI bridge slots can be used for legacy PCI host devices. I don't think we have any evidence to suggest this as a best practice. We have a lot of experience placing PCIe host devices into a conventional PCI topology on 440FX. We don't have nearly as much experience placing them into downstream PCIe ports. This seems like how we would like for things to behave to look like real hardware platforms, but it's just navel gazing whether it's actually the right thing to do. Thanks, Alex > + > + > +6. Virtio devices > +================= > +Virtio devices plugged into the PCI hierarchy or as an Integrated Devices > +will remain PCI and have transitional behaviour as default. > +Virtio devices plugged into PCIe ports are Express devices and have > +"1.0" behavior by default without IO support. > +In both case disable-* properties can be used to override the behaviour. > + > + > +7. Conclusion > +============== > +The proposal offers a usage model that is easy to understand and follow > +and in the same time overcomes some PCIe limitations. > + > + > + ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines 2016-09-06 15:38 ` Alex Williamson @ 2016-09-06 18:14 ` Marcel Apfelbaum 2016-09-06 18:32 ` Alex Williamson 0 siblings, 1 reply; 52+ messages in thread From: Marcel Apfelbaum @ 2016-09-06 18:14 UTC (permalink / raw) To: Alex Williamson; +Cc: qemu-devel, lersek, mst On 09/06/2016 06:38 PM, Alex Williamson wrote: > On Thu, 1 Sep 2016 16:22:07 +0300 > Marcel Apfelbaum <marcel@redhat.com> wrote: > >> Proposes best practices on how to use PCIe/PCI device >> in PCIe based machines and explain the reasoning behind them. >> >> Signed-off-by: Marcel Apfelbaum <marcel@redhat.com> >> --- >> >> Hi, >> >> Please add your comments on what to add/remove/edit to make this doc usable. >> >> Thanks, >> Marcel >> >> docs/pcie.txt | 145 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> 1 file changed, 145 insertions(+) >> create mode 100644 docs/pcie.txt >> >> diff --git a/docs/pcie.txt b/docs/pcie.txt >> new file mode 100644 >> index 0000000..52a8830 >> --- /dev/null >> +++ b/docs/pcie.txt >> @@ -0,0 +1,145 @@ >> +PCI EXPRESS GUIDELINES >> +====================== >> + >> +1. Introduction >> +================ >> +The doc proposes best practices on how to use PCIe/PCI device >> +in PCIe based machines and explains the reasoning behind them. >> + >> + >> +2. Device placement strategy >> +============================ >> +QEMU does not have a clear socket-device matching mechanism >> +and allows any PCI/PCIe device to be plugged into any PCI/PCIe slot. >> +Plugging a PCI device into a PCIe device might not always work and >> +is weird anyway since it cannot be done for "bare metal". >> +Plugging a PCIe device into a PCI slot will hide the Extended >> +Configuration Space thus is also not recommended. >> + >> +The recommendation is to separate the PCIe and PCI hierarchies. >> +PCIe devices should be plugged only into PCIe Root Ports and >> +PCIe Downstream ports (let's call them PCIe ports). >> + >> +2.1 Root Bus (pcie.0) >> +===================== >> +Plug only legacy PCI devices as Root Complex Integrated Devices >> +even if the PCIe spec does not forbid PCIe devices. The existing > Hi Alex, Thanks for the review. > Surely we can have PCIe device on the root complex?? > Yes, we can, is not forbidden. Even so, my understanding is the main use for Integrated Devices is for legacy devices like sound cards or nics that come with the motherboard. Because of that my concern is we might be missing some support for that in QEMU or even in linux kernel. One example I got from Jason about an issue with Integrated Points in kernel: commit d14053b3c714178525f22660e6aaf41263d00056 Author: David Woodhouse <David.Woodhouse@intel.com> Date: Thu Oct 15 09:28:06 2015 +0100 iommu/vt-d: Fix ATSR handling for Root-Complex integrated endpoints The VT-d specification says that "Software must enable ATS on endpoint devices behind a Root Port only if the Root Port is reported as supporting ATS transactions." .... We can say is a bug and is solved, what's the problem? But my point it, why do it in the first place? We are the hardware "vendors" and we can decide not to add PCIe devices as Integrated Devices. >> +hardware uses mostly PCI devices as Integrated Endpoints. In this >> +way we may avoid some strange Guest OS-es behaviour. >> +Other than that plug only PCIe Root Ports, PCIe Switches (upstream ports) >> +or DMI-PCI bridges to start legacy PCI hierarchies. >> + >> + >> + pcie.0 bus >> + -------------------------------------------------------------------------- >> + | | | | >> + ----------- ------------------ ------------------ ------------------ >> + | PCI Dev | | PCIe Root Port | | Upstream Port | | DMI-PCI bridge | >> + ----------- ------------------ ------------------ ------------------ > > Do you have a spec reference for plugging an upstream port directly > into the root complex? IMHO this is invalid, an upstream port can only > be attached behind a downstream port, ie. a root port or downstream > switch port. > Yes, is a bug, both me and Laszlo spotted it and the 2.2 figure shows it right. Thanks for finding it. >> + >> +2.2 PCIe only hierarchy >> +======================= >> +Always use PCIe Root ports to start a PCIe hierarchy. Use PCIe switches (Upstream >> +Ports + several Downstream Ports) if out of PCIe Root Ports slots. PCIe switches >> +can be nested until a depth of 6-7. Plug only PCIe devices into PCIe Ports. > > This seems to contradict 2.1, Yes, please forgive the bug, it will not appear in v2 but I agree more with this statement to > only start a PCIe sub-hierarchy with a root port, not an upstream port > connected to the root complex. The 2nd sentence is confusing, I don't > know if you're referring to fan-out via PCIe switch downstream of a > root port or again suggesting to use upstream switch ports directly on > the root complex. > The PCIe hierarchy always starts with PCI Express Root Ports, the switch is to be plugged in the PCi Express ports. I will try to re-phrase to be more clear. >> + >> + >> + pcie.0 bus >> + ---------------------------------------------------- >> + | | | >> + ------------- ------------- ------------- >> + | Root Port | | Root Port | | Root Port | >> + ------------ -------------- ------------- >> + | | >> + ------------ ----------------- >> + | PCIe Dev | | Upstream Port | >> + ------------ ----------------- >> + | | >> + ------------------- ------------------- >> + | Downstream Port | | Downstream Port | >> + ------------------- ------------------- >> + | >> + ------------ >> + | PCIe Dev | >> + ------------ >> + >> +2.3 PCI only hierarchy >> +====================== >> +Legacy PCI devices can be plugged into pcie.0 as Integrated Devices or >> +into DMI-PCI bridge. PCI-PCI bridges can be plugged into DMI-PCI bridges >> +and can be nested until a depth of 6-7. DMI-BRIDGES should be plugged >> +only into pcie.0 bus. >> + >> + pcie.0 bus >> + ---------------------------------------------- >> + | | >> + ----------- ------------------ >> + | PCI Dev | | DMI-PCI BRIDGE | >> + ---------- ------------------ >> + | | >> + ----------- ------------------ >> + | PCI Dev | | PCI-PCI Bridge | >> + ----------- ------------------ >> + | | >> + ----------- ----------- >> + | PCI Dev | | PCI Dev | >> + ----------- ----------- >> + > > I really wish we had generic PCIe-to-PCI bridges rather than this DMI > bridge thing... > Thank you, that's a very good idea and I intend to implement it. >> + >> + >> +3. IO space issues >> +=================== >> +PCIe Ports are seen by Firmware/Guest OS as PCI bridges and > > Yeah, I've lost the meaning of Ports here, this statement is true for > upstream ports as well. > Laslzo asked me to enumerate all the controllers instead of "PCIe", I am starting to see why... >> +as required by PCI spec will reserve a 4K IO range for each. >> +The firmware used by QEMU (SeaBIOS/OVMF) will further optimize >> +it by allocation the IO space only if there is at least a device >> +with IO BARs plugged into the bridge. >> +Behind a PCIe PORT only one device may be plugged, resulting in > > Here I think you're trying to specify root/downstream ports, but > upstream ports have the same i/o port allocation problems and do not > have this one device limitation. > I'll be more specific, sure. >> +the allocation of a whole 4K range for each device. >> +The IO space is limited resulting in ~10 PCIe ports per system >> +if devices with IO BARs are plugged into IO ports. >> + >> +Using the proposed device placing strategy solves this issue >> +by using only PCIe devices with PCIe PORTS. The PCIe spec requires >> +PCIe devices to work without IO BARs. >> +The PCI hierarchy has no such limitations. > > Actually it does, but it's mostly not an issue since we have 32 slots > available (minus QEMU/libvirt excluding 1 for no good reason) > downstream of each bridge. > This is what I meant, I'll make it more clear. >> + >> + >> +4. Hot Plug >> +============ >> +The root bus pcie.0 does not support hot-plug, so Integrated Devices, >> +DMI-PCI bridges and Root Ports can't be hot-plugged/hot-unplugged. >> + >> +PCI devices can be hot-plugged into PCI-PCI bridges. (There is a bug >> +in QEMU preventing it to work, but it would be solved soon). > > Probably want to give some sort of date/commit references to these > current state of affairs facts, a reader is not likely to lookup the > git commit for this verbiage and extrapolate it to a QEMU version. > I'll delete it (as Laslo proposed) since is not a "status" doc. It will be taken care of eventually. >> +The PCI hotplug is ACPI based and can work side by side with the PCIe >> +native hotplug. >> + >> +PCIe devices can be natively hot-plugged/hot-unplugged into/from >> +PCIe Ports (Root Ports/Downstream Ports). Switches are hot-pluggable. > > Why? This seems like a QEMU bug. Clearly we need the downstream ports > in place when the upstream switch is hot-added, but this should be > feasible. > I don't get understand the question. I do think switches can be hot-plugged, but I am not sure if QEMU allows it. If not, this is something we should solve. >> +Keep in mind you always need to have at least one PCIe Port available >> +for hotplug, the PCIe Ports themselves are not hot-pluggable. > > If a user cares about hotplug... > ...he should reserve enough empty PCI Express Root Ports/ PCI Express Downstrewm ports. Laszlo had some numbers and ideas on how this user can plan in advance for hotplug, maybe we should bring them together :) >> + >> + >> +5. Device assignment >> +==================== >> +Host devices are mostly PCIe and should be plugged only into PCIe ports. >> +PCI-PCI bridge slots can be used for legacy PCI host devices. > > I don't think we have any evidence to suggest this as a best practice. > We have a lot of experience placing PCIe host devices into a > conventional PCI topology on 440FX. We don't have nearly as much > experience placing them into downstream PCIe ports. This seems like > how we would like for things to behave to look like real hardware > platforms, but it's just navel gazing whether it's actually the right > thing to do. Thanks, > I had to look up the "navel gazing"... Why I do agree with your statements I prefer a cleaner PCI Express machine with as little legacy PCI as possible. I use this document as an opportunity to start gaining experience with device assignment into PCI Express Root Ports and Downstream Ports and solve the issues long the way. Your review really helped, thanks! Marcel > Alex > >> + >> + >> +6. Virtio devices >> +================= >> +Virtio devices plugged into the PCI hierarchy or as an Integrated Devices >> +will remain PCI and have transitional behaviour as default. >> +Virtio devices plugged into PCIe ports are Express devices and have >> +"1.0" behavior by default without IO support. >> +In both case disable-* properties can be used to override the behaviour. >> + >> + >> +7. Conclusion >> +============== >> +The proposal offers a usage model that is easy to understand and follow >> +and in the same time overcomes some PCIe limitations. >> + >> + >> + > ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines 2016-09-06 18:14 ` Marcel Apfelbaum @ 2016-09-06 18:32 ` Alex Williamson 2016-09-06 18:59 ` Marcel Apfelbaum 2016-09-07 7:44 ` Laszlo Ersek 0 siblings, 2 replies; 52+ messages in thread From: Alex Williamson @ 2016-09-06 18:32 UTC (permalink / raw) To: Marcel Apfelbaum; +Cc: qemu-devel, lersek, mst On Tue, 6 Sep 2016 21:14:11 +0300 Marcel Apfelbaum <marcel@redhat.com> wrote: > On 09/06/2016 06:38 PM, Alex Williamson wrote: > > On Thu, 1 Sep 2016 16:22:07 +0300 > > Marcel Apfelbaum <marcel@redhat.com> wrote: > > > >> Proposes best practices on how to use PCIe/PCI device > >> in PCIe based machines and explain the reasoning behind them. > >> > >> Signed-off-by: Marcel Apfelbaum <marcel@redhat.com> > >> --- > >> > >> Hi, > >> > >> Please add your comments on what to add/remove/edit to make this doc usable. > >> > >> Thanks, > >> Marcel > >> > >> docs/pcie.txt | 145 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >> 1 file changed, 145 insertions(+) > >> create mode 100644 docs/pcie.txt > >> > >> diff --git a/docs/pcie.txt b/docs/pcie.txt > >> new file mode 100644 > >> index 0000000..52a8830 > >> --- /dev/null > >> +++ b/docs/pcie.txt > >> @@ -0,0 +1,145 @@ > >> +PCI EXPRESS GUIDELINES > >> +====================== > >> + > >> +1. Introduction > >> +================ > >> +The doc proposes best practices on how to use PCIe/PCI device > >> +in PCIe based machines and explains the reasoning behind them. > >> + > >> + > >> +2. Device placement strategy > >> +============================ > >> +QEMU does not have a clear socket-device matching mechanism > >> +and allows any PCI/PCIe device to be plugged into any PCI/PCIe slot. > >> +Plugging a PCI device into a PCIe device might not always work and > >> +is weird anyway since it cannot be done for "bare metal". > >> +Plugging a PCIe device into a PCI slot will hide the Extended > >> +Configuration Space thus is also not recommended. > >> + > >> +The recommendation is to separate the PCIe and PCI hierarchies. > >> +PCIe devices should be plugged only into PCIe Root Ports and > >> +PCIe Downstream ports (let's call them PCIe ports). > >> + > >> +2.1 Root Bus (pcie.0) > >> +===================== > >> +Plug only legacy PCI devices as Root Complex Integrated Devices > >> +even if the PCIe spec does not forbid PCIe devices. The existing > > > > Hi Alex, > Thanks for the review. > > > > Surely we can have PCIe device on the root complex?? > > > > Yes, we can, is not forbidden. Even so, my understanding is > the main use for Integrated Devices is for legacy devices > like sound cards or nics that come with the motherboard. > Because of that my concern is we might be missing some support > for that in QEMU or even in linux kernel. > > One example I got from Jason about an issue with Integrated Points in kernel: > > commit d14053b3c714178525f22660e6aaf41263d00056 > Author: David Woodhouse <David.Woodhouse@intel.com> > Date: Thu Oct 15 09:28:06 2015 +0100 > > iommu/vt-d: Fix ATSR handling for Root-Complex integrated endpoints > > The VT-d specification says that "Software must enable ATS on endpoint > devices behind a Root Port only if the Root Port is reported as > supporting ATS transactions." > .... > > We can say is a bug and is solved, what's the problem? > But my point it, why do it in the first place? > We are the hardware "vendors" and we can decide not to add PCIe > devices as Integrated Devices. > > > > >> +hardware uses mostly PCI devices as Integrated Endpoints. In this > >> +way we may avoid some strange Guest OS-es behaviour. > >> +Other than that plug only PCIe Root Ports, PCIe Switches (upstream ports) > >> +or DMI-PCI bridges to start legacy PCI hierarchies. > >> + > >> + > >> + pcie.0 bus > >> + -------------------------------------------------------------------------- > >> + | | | | > >> + ----------- ------------------ ------------------ ------------------ > >> + | PCI Dev | | PCIe Root Port | | Upstream Port | | DMI-PCI bridge | > >> + ----------- ------------------ ------------------ ------------------ > > > > Do you have a spec reference for plugging an upstream port directly > > into the root complex? IMHO this is invalid, an upstream port can only > > be attached behind a downstream port, ie. a root port or downstream > > switch port. > > > > Yes, is a bug, both me and Laszlo spotted it and the 2.2 figure shows it right. > Thanks for finding it. > > >> + > >> +2.2 PCIe only hierarchy > >> +======================= > >> +Always use PCIe Root ports to start a PCIe hierarchy. Use PCIe switches (Upstream > >> +Ports + several Downstream Ports) if out of PCIe Root Ports slots. PCIe switches > >> +can be nested until a depth of 6-7. Plug only PCIe devices into PCIe Ports. > > > > This seems to contradict 2.1, > > Yes, please forgive the bug, it will not appear in v2 > > but I agree more with this statement to > > only start a PCIe sub-hierarchy with a root port, not an upstream port > > connected to the root complex. The 2nd sentence is confusing, I don't > > know if you're referring to fan-out via PCIe switch downstream of a > > root port or again suggesting to use upstream switch ports directly on > > the root complex. > > > > The PCIe hierarchy always starts with PCI Express Root Ports, the switch > is to be plugged in the PCi Express ports. I will try to re-phrase to be more > clear. > > > >> + > >> + > >> + pcie.0 bus > >> + ---------------------------------------------------- > >> + | | | > >> + ------------- ------------- ------------- > >> + | Root Port | | Root Port | | Root Port | > >> + ------------ -------------- ------------- > >> + | | > >> + ------------ ----------------- > >> + | PCIe Dev | | Upstream Port | > >> + ------------ ----------------- > >> + | | > >> + ------------------- ------------------- > >> + | Downstream Port | | Downstream Port | > >> + ------------------- ------------------- > >> + | > >> + ------------ > >> + | PCIe Dev | > >> + ------------ > >> + > >> +2.3 PCI only hierarchy > >> +====================== > >> +Legacy PCI devices can be plugged into pcie.0 as Integrated Devices or > >> +into DMI-PCI bridge. PCI-PCI bridges can be plugged into DMI-PCI bridges > >> +and can be nested until a depth of 6-7. DMI-BRIDGES should be plugged > >> +only into pcie.0 bus. > >> + > >> + pcie.0 bus > >> + ---------------------------------------------- > >> + | | > >> + ----------- ------------------ > >> + | PCI Dev | | DMI-PCI BRIDGE | > >> + ---------- ------------------ > >> + | | > >> + ----------- ------------------ > >> + | PCI Dev | | PCI-PCI Bridge | > >> + ----------- ------------------ > >> + | | > >> + ----------- ----------- > >> + | PCI Dev | | PCI Dev | > >> + ----------- ----------- > >> + > > > > I really wish we had generic PCIe-to-PCI bridges rather than this DMI > > bridge thing... > > > > Thank you, that's a very good idea and I intend to implement it. > > >> + > >> + > >> +3. IO space issues > >> +=================== > >> +PCIe Ports are seen by Firmware/Guest OS as PCI bridges and > > > > Yeah, I've lost the meaning of Ports here, this statement is true for > > upstream ports as well. > > > > Laslzo asked me to enumerate all the controllers instead of "PCIe", > I am starting to see why... > > >> +as required by PCI spec will reserve a 4K IO range for each. > >> +The firmware used by QEMU (SeaBIOS/OVMF) will further optimize > >> +it by allocation the IO space only if there is at least a device > >> +with IO BARs plugged into the bridge. > >> +Behind a PCIe PORT only one device may be plugged, resulting in > > > > Here I think you're trying to specify root/downstream ports, but > > upstream ports have the same i/o port allocation problems and do not > > have this one device limitation. > > > > I'll be more specific, sure. > > >> +the allocation of a whole 4K range for each device. > >> +The IO space is limited resulting in ~10 PCIe ports per system > >> +if devices with IO BARs are plugged into IO ports. > >> + > >> +Using the proposed device placing strategy solves this issue > >> +by using only PCIe devices with PCIe PORTS. The PCIe spec requires > >> +PCIe devices to work without IO BARs. > >> +The PCI hierarchy has no such limitations. > > > > Actually it does, but it's mostly not an issue since we have 32 slots > > available (minus QEMU/libvirt excluding 1 for no good reason) > > downstream of each bridge. > > > > This is what I meant, I'll make it more clear. > > >> + > >> + > >> +4. Hot Plug > >> +============ > >> +The root bus pcie.0 does not support hot-plug, so Integrated Devices, > >> +DMI-PCI bridges and Root Ports can't be hot-plugged/hot-unplugged. > >> + > >> +PCI devices can be hot-plugged into PCI-PCI bridges. (There is a bug > >> +in QEMU preventing it to work, but it would be solved soon). > > > > Probably want to give some sort of date/commit references to these > > current state of affairs facts, a reader is not likely to lookup the > > git commit for this verbiage and extrapolate it to a QEMU version. > > > > I'll delete it (as Laslo proposed) since is not a "status" doc. > It will be taken care of eventually. > > >> +The PCI hotplug is ACPI based and can work side by side with the PCIe > >> +native hotplug. > >> + > >> +PCIe devices can be natively hot-plugged/hot-unplugged into/from > >> +PCIe Ports (Root Ports/Downstream Ports). Switches are hot-pluggable. > > > > Why? This seems like a QEMU bug. Clearly we need the downstream ports > > in place when the upstream switch is hot-added, but this should be > > feasible. > > > > I don't get understand the question. I do think switches can be hot-plugged, > but I am not sure if QEMU allows it. If not, this is something we should solve. Sorry, I read to quickly and inserted a <not> in there, I thought the statement was that switches are not hot-pluggable. I think the issue will be the ordering of hot-adding the downstream switch ports prior to the upstream switch port since or we're going to need to invent a switch with hot-pluggable downstream ports. I expect the guest is only going to scan for downstream ports once after the upstream port is discovered. > >> +Keep in mind you always need to have at least one PCIe Port available > >> +for hotplug, the PCIe Ports themselves are not hot-pluggable. > > > > If a user cares about hotplug... > > > > ...he should reserve enough empty PCI Express Root Ports/ PCI Express Downstrewm ports. > Laszlo had some numbers and ideas on how this user can plan in advance for hotplug, > maybe we should bring them together :) > > >> + > >> + > >> +5. Device assignment > >> +==================== > >> +Host devices are mostly PCIe and should be plugged only into PCIe ports. > >> +PCI-PCI bridge slots can be used for legacy PCI host devices. > > > > I don't think we have any evidence to suggest this as a best practice. > > We have a lot of experience placing PCIe host devices into a > > conventional PCI topology on 440FX. We don't have nearly as much > > experience placing them into downstream PCIe ports. This seems like > > how we would like for things to behave to look like real hardware > > platforms, but it's just navel gazing whether it's actually the right > > thing to do. Thanks, > > > > I had to look up the "navel gazing"... > Why I do agree with your statements I prefer a cleaner PCI Express machine > with as little legacy PCI as possible. I use this document as an opportunity > to start gaining experience with device assignment into PCI Express Root Ports > and Downstream Ports and solve the issues long the way. That's exactly what I mean, there's an ulterior, personal motivation in this suggestion that's not really backed by facts. You'd like to make the recommendation to place PCIe assigned devices into PCIe slots, but that's not necessarily the configuration with the best track record right now. In fact there's really no advantage to a user to do this unless they have a device that needs PCIe (radeon and tg3 potentially come to mind here). So while I agree with you from an ideological standpoint, I don't think that's sufficient to make the recommendation you're proposing here. Thanks, Alex ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines 2016-09-06 18:32 ` Alex Williamson @ 2016-09-06 18:59 ` Marcel Apfelbaum 2016-09-07 7:44 ` Laszlo Ersek 1 sibling, 0 replies; 52+ messages in thread From: Marcel Apfelbaum @ 2016-09-06 18:59 UTC (permalink / raw) To: Alex Williamson; +Cc: qemu-devel, lersek, mst On 09/06/2016 09:32 PM, Alex Williamson wrote: > On Tue, 6 Sep 2016 21:14:11 +0300 > Marcel Apfelbaum <marcel@redhat.com> wrote: > >> On 09/06/2016 06:38 PM, Alex Williamson wrote: >>> On Thu, 1 Sep 2016 16:22:07 +0300 >>> Marcel Apfelbaum <marcel@redhat.com> wrote: >>> >>>> Proposes best practices on how to use PCIe/PCI device >>>> in PCIe based machines and explain the reasoning behind them. >>>> >>>> Signed-off-by: Marcel Apfelbaum <marcel@redhat.com> >>>> --- >>>> >>>> Hi, >>>> >>>> Please add your comments on what to add/remove/edit to make this doc usable. >>>> >>>> Thanks, >>>> Marcel >>>> [...] >> >>>> +The PCI hotplug is ACPI based and can work side by side with the PCIe >>>> +native hotplug. >>>> + >>>> +PCIe devices can be natively hot-plugged/hot-unplugged into/from >>>> +PCIe Ports (Root Ports/Downstream Ports). Switches are hot-pluggable. >>> >>> Why? This seems like a QEMU bug. Clearly we need the downstream ports >>> in place when the upstream switch is hot-added, but this should be >>> feasible. >>> >> >> I don't get understand the question. I do think switches can be hot-plugged, >> but I am not sure if QEMU allows it. If not, this is something we should solve. > > Sorry, I read to quickly and inserted a <not> in there, I thought the > statement was that switches are not hot-pluggable. I think the issue > will be the ordering of hot-adding the downstream switch ports prior to > the upstream switch port since or we're going to need to invent a > switch with hot-pluggable downstream ports. I expect the guest is only > going to scan for downstream ports once after the upstream port is > discovered. > The problem I see is that I need to specify a bus to plug the Downstream Port, but this is the id of the upstream port I haven't added yet. I need to think a little bit more on how to do it, or I am missing something. [...] >>>> +5. Device assignment >>>> +==================== >>>> +Host devices are mostly PCIe and should be plugged only into PCIe ports. >>>> +PCI-PCI bridge slots can be used for legacy PCI host devices. >>> >>> I don't think we have any evidence to suggest this as a best practice. >>> We have a lot of experience placing PCIe host devices into a >>> conventional PCI topology on 440FX. We don't have nearly as much >>> experience placing them into downstream PCIe ports. This seems like >>> how we would like for things to behave to look like real hardware >>> platforms, but it's just navel gazing whether it's actually the right >>> thing to do. Thanks, >>> >> >> I had to look up the "navel gazing"... >> Why I do agree with your statements I prefer a cleaner PCI Express machine >> with as little legacy PCI as possible. I use this document as an opportunity >> to start gaining experience with device assignment into PCI Express Root Ports >> and Downstream Ports and solve the issues long the way. > > That's exactly what I mean, there's an ulterior, personal motivation in > this suggestion that's not really backed by facts. Ulterior yes, personal no. Several developers of both ARM and x86 PCI Express machines see the new machines as an opportunity to get rid of legacy and keep them as modern as possible. Funny thing, I *personally* prefer to see Q35 as a replacement for PC machines, no need to keep and support them both. You'd like to make > the recommendation to place PCIe assigned devices into PCIe slots, but > that's not necessarily the configuration with the best track record > right now. Since we haven't use Q35 at all until now (a speculation, but probably true) the track record is kind of clean... In fact there's really no advantage to a user to do this > unless they have a device that needs PCIe (radeon and tg3 > potentially come to mind here). The advantage is to avoid making the PCI Express "purists" (where are you now??) to start a legacy PCI hierarchy to plug a modern device into a modern PCI Express machine. Another advantage is to avoid tainting the ACPI tables with ACPI hotplug support for the PCI-bridge devices and stuff like that. I agree is safer to plug assigned devices into PCI slots from an "enterprise" point of view, but this is upstream, right ? :) We look at the future... (and we don't have known issues yet anyway) So while I agree with you from an > ideological standpoint, I don't think that's sufficient to make the > recommendation you're proposing here. Thanks, > I'll find a way to rephrase it, maybe: Host devices are mostly PCIe and they can be plugged into PCI Express Root Ports/Downstream Ports, however we have no experience doing that. As a fall-back the PCI hierarchy can be used to plug an assigned device into a PCI slot. Thanks, Marcel >>>> +PCI-PCI bridge slots can be used for legacy PCI host devices. > Alex > ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines 2016-09-06 18:32 ` Alex Williamson 2016-09-06 18:59 ` Marcel Apfelbaum @ 2016-09-07 7:44 ` Laszlo Ersek 1 sibling, 0 replies; 52+ messages in thread From: Laszlo Ersek @ 2016-09-07 7:44 UTC (permalink / raw) To: Alex Williamson; +Cc: Marcel Apfelbaum, qemu-devel, mst On 09/06/16 20:32, Alex Williamson wrote: > On Tue, 6 Sep 2016 21:14:11 +0300 > Marcel Apfelbaum <marcel@redhat.com> wrote: > >> On 09/06/2016 06:38 PM, Alex Williamson wrote: >>> On Thu, 1 Sep 2016 16:22:07 +0300 >>> Marcel Apfelbaum <marcel@redhat.com> wrote: >>>> +5. Device assignment >>>> +==================== >>>> +Host devices are mostly PCIe and should be plugged only into PCIe ports. >>>> +PCI-PCI bridge slots can be used for legacy PCI host devices. >>> >>> I don't think we have any evidence to suggest this as a best practice. >>> We have a lot of experience placing PCIe host devices into a >>> conventional PCI topology on 440FX. We don't have nearly as much >>> experience placing them into downstream PCIe ports. This seems like >>> how we would like for things to behave to look like real hardware >>> platforms, but it's just navel gazing whether it's actually the right >>> thing to do. Thanks, >>> >> >> I had to look up the "navel gazing"... >> Why I do agree with your statements I prefer a cleaner PCI Express machine >> with as little legacy PCI as possible. I use this document as an opportunity >> to start gaining experience with device assignment into PCI Express Root Ports >> and Downstream Ports and solve the issues long the way. > > That's exactly what I mean, there's an ulterior, personal motivation in > this suggestion that's not really backed by facts. You'd like to make > the recommendation to place PCIe assigned devices into PCIe slots, but > that's not necessarily the configuration with the best track record > right now. In fact there's really no advantage to a user to do this > unless they have a device that needs PCIe (radeon and tg3 > potentially come to mind here). So while I agree with you from an > ideological standpoint, I don't think that's sufficient to make the > recommendation you're proposing here. Thanks, To reinforce what Marcel already replied, this document is all about ideology / policy, and not a status report. We should be looking forward, not backward. Permitting an exception for plugging a PCI Express device into a legacy PCI slot just because the PCI Express device is an assigned, physical one, dilutes the message, and will lead to all kinds of mess elsewhere. I'm acutely aware that conforming to the "PCI Express into PCI Express" recommendation might not *work* in practice, but that doesn't matter right now. This document should translate to a task list for QEMU and firmware developers alike. At least I need this document to exist primarily so I know what to do in OVMF, and what topologies in QE's BZs to reject out of hand. If the "PCI Express into PCI Express" guideline will require some VFIO work, and causes Q35 (not i440fx) users some pain, so be it, IMO. I'm saying this knowing that you know about ten billion times more about PCI / PCI Express than I do. Thanks Laszlo ^ permalink raw reply [flat|nested] 52+ messages in thread
end of thread, other threads:[~2016-10-11 15:37 UTC | newest] Thread overview: 52+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2016-09-01 13:22 [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines Marcel Apfelbaum 2016-09-01 13:27 ` Peter Maydell 2016-09-01 13:51 ` Marcel Apfelbaum 2016-09-01 17:14 ` Laszlo Ersek 2016-09-05 16:24 ` Laszlo Ersek 2016-09-05 20:02 ` Marcel Apfelbaum 2016-09-06 13:31 ` Laszlo Ersek 2016-09-06 14:46 ` Marcel Apfelbaum 2016-09-07 6:21 ` Gerd Hoffmann 2016-09-07 8:06 ` Laszlo Ersek 2016-09-07 8:23 ` Marcel Apfelbaum 2016-09-07 8:06 ` Marcel Apfelbaum 2016-09-07 16:08 ` Alex Williamson 2016-09-07 19:32 ` Marcel Apfelbaum 2016-09-07 17:55 ` Laine Stump 2016-09-07 19:39 ` Marcel Apfelbaum 2016-09-07 20:34 ` Laine Stump 2016-09-15 8:38 ` Andrew Jones 2016-09-15 14:20 ` Marcel Apfelbaum 2016-09-16 16:50 ` Andrea Bolognani 2016-09-08 7:33 ` Gerd Hoffmann 2016-09-06 11:35 ` Gerd Hoffmann 2016-09-06 13:58 ` Laine Stump 2016-09-07 7:04 ` Gerd Hoffmann 2016-09-07 18:20 ` Laine Stump 2016-09-08 7:26 ` Gerd Hoffmann 2016-09-06 14:47 ` Marcel Apfelbaum 2016-09-07 7:53 ` Laszlo Ersek 2016-09-07 7:57 ` Marcel Apfelbaum 2016-10-04 14:59 ` Daniel P. Berrange 2016-10-04 15:40 ` Laszlo Ersek 2016-10-04 16:10 ` Laine Stump 2016-10-04 16:43 ` Laszlo Ersek 2016-10-04 18:08 ` Laine Stump 2016-10-04 18:52 ` Alex Williamson 2016-10-10 12:02 ` Andrea Bolognani 2016-10-10 14:36 ` Marcel Apfelbaum 2016-10-11 15:37 ` Andrea Bolognani 2016-10-04 18:56 ` Laszlo Ersek 2016-10-04 17:54 ` Laine Stump 2016-10-05 9:17 ` Marcel Apfelbaum 2016-10-10 11:09 ` Andrea Bolognani 2016-10-10 14:15 ` Marcel Apfelbaum 2016-10-11 13:30 ` Andrea Bolognani 2016-10-04 15:45 ` Alex Williamson 2016-10-04 16:25 ` Laine Stump 2016-10-05 10:03 ` Marcel Apfelbaum 2016-09-06 15:38 ` Alex Williamson 2016-09-06 18:14 ` Marcel Apfelbaum 2016-09-06 18:32 ` Alex Williamson 2016-09-06 18:59 ` Marcel Apfelbaum 2016-09-07 7:44 ` Laszlo Ersek
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.