[Qemu-devel] [PATCH V2 RESEND] docs: add PCIe devices placement guidelines

* [Qemu-devel] [PATCH V2 RESEND] docs: add PCIe devices placement guidelines
@ 2016-10-13 13:52 Marcel Apfelbaum
  2016-10-13 14:05 ` Marcel Apfelbaum
  0 siblings, 1 reply; 13+ messages in thread
From: Marcel Apfelbaum @ 2016-10-13 13:52 UTC (permalink / raw)
  To: qemu-devel

Proposes best practices on how to use PCI Express/PCI device
in PCI Express based machines and explain the reasoning behind them.

Signed-off-by: Marcel Apfelbaum <marcel@redhat.com>
---

Hi,

I am sending the doc  twice, it appears the first time didn't make it to qemu-devel list.

RFC->v2:
 - Addressed a lot of comments from the reviewers (many thanks to all, especially to Laszlo)

Since the RFC mail-thread was relatively long and already
has passed a lot of time from the RFC, I post this version
even if is very possible that I left some of the comments out,
my apologies if so.

I will go over the comments again, in the meantime please
feel free to comment on this version, even if on something
you've already pointed out.

It may take a day or two until I'll be able to respond, but I
will do my best to address all comments.

Thanks,
Marcel


 docs/pcie.txt | 273 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 273 insertions(+)
 create mode 100644 docs/pcie.txt

diff --git a/docs/pcie.txt b/docs/pcie.txt
new file mode 100644
index 0000000..7d852f1
--- /dev/null
+++ b/docs/pcie.txt
@@ -0,0 +1,273 @@
+PCI EXPRESS GUIDELINES
+======================
+
+1. Introduction
+================
+The doc proposes best practices on how to use PCI Express/PCI device
+in PCI Express based machines and explains the reasoning behind them.
+
+
+2. Device placement strategy
+============================
+QEMU does not have a clear socket-device matching mechanism
+and allows any PCI/PCI Express device to be plugged into any PCI/PCI Express slot.
+Plugging a PCI device into a PCI Express slot might not always work and
+is weird anyway since it cannot be done for "bare metal".
+Plugging a PCI Express device into a PCI slot will hide the Extended
+Configuration Space thus is also not recommended.
+
+The recommendation is to separate the PCI Express and PCI hierarchies.
+PCI Express devices should be plugged only into PCI Express Root Ports and
+PCI Express Downstream ports.
+
+2.1 Root Bus (pcie.0)
+=====================
+Place only the following kinds of devices directly on the Root Complex:
+    (1) Devices with dedicated, specific functionality (network card,
+        graphics card, IDE controller, etc); place only legacy PCI devices on
+        the Root Complex. These will be considered Integrated Endpoints.
+        Note: Integrated devices are not hot-pluggable.
+
+        Although the PCI Express spec does not forbid PCI Express devices as
+        Integrated Endpoints, existing hardware mostly integrates legacy PCI
+        devices with the Root Complex. Guest OSes are suspected to behave
+        strangely when PCI Express devices are integrated with the Root Complex.
+
+    (2) PCI Express Root Ports (ioh3420), for starting exclusively PCI Express
+        hierarchies.
+
+    (3) DMI-PCI bridges (i82801b11-bridge), for starting legacy PCI hierarchies.
+
+    (4) Extra Root Complexes (pxb-pcie), if multiple PCIe Root Buses are needed.
+
+   pcie.0 bus
+   -----------------------------------------------------------------------------
+        |                |                    |                  |
+   -----------   ------------------   ------------------   --------------
+   | PCI Dev |   | PCIe Root Port |   | DMI-PCI bridge |   |  pxb-pcie  |
+   -----------   ------------------   ------------------   --------------
+
+2.1.1 To plug a device into a pcie.0 as Root Complex Integrated Device use:
+          -device <dev>[,bus=pcie.0]
+2.1.2 To expose a new PCI Express Root Bus use:
+          -device pxb-pcie,id=pcie.1,bus_nr=x,[numa_node=y],[addr=z]
+      Only PCI Express Root Ports and DMI-PCI bridges can be connected to the pcie.1 bus:
+          -device ioh3420,id=root_port1[,bus=pcie.1][,chassis=x][,slot=y][,addr=z] \
+          -device i82801b11-bridge,id=dmi_pci_bridge1,bus=pcie.1
+
+
+2.2 PCI Express only hierarchy
+==============================
+Always use PCI Express Root Ports to start PCI Express hierarchies.
+
+A PCI Express Root bus supports up to 32 devices. Since each
+PCI Express Root Port is a function and a multi-function
+device may support up to 8 functions, the maximum possible
+PCI Express Root Ports per PCI Express Root Bus is 256.
+
+Prefer coupling PCI Express Root Ports into multi-function devices
+to keep a simple flat hierarchy that is enough for most scenarios.
+Only use PCI Express Switches (x3130-upstream, xio3130-downstream)
+if there is no more room for PCI Express Root Ports.
+Please see section 4. for further justifications.
+
+Plug only PCI Express devices into PCI Express Ports.
+
+
+   pcie.0 bus
+   ----------------------------------------------------------------------------------
+        |                 |                                    |
+   -------------    -------------                        -------------
+   | Root Port |    | Root Port |                        | Root Port |
+   ------------     -------------                        -------------
+         |                            -------------------------|------------------------
+    ------------                      |                 -----------------              |
+    | PCIe Dev |                      |    PCI Express  | Upstream Port |              |
+    ------------                      |      Switch     -----------------              |
+                                      |                  |            |                |
+                                      |    -------------------    -------------------  |
+                                      |    | Downstream Port |    | Downstream Port |  |
+                                      |    -------------------    -------------------  |
+                                      -------------|-----------------------|------------
+                                             ------------
+                                             | PCIe Dev |
+                                             ------------
+
+2.2.1 Plugging a PCI Express device into a PCI Express Root Port:
+          -device ioh3420,id=root_port1,chassis=x[,bus=pcie.0][,slot=y][,addr=z]  \
+          -device <dev>,bus=root_port1
+      Note that chassis parameter is compulsory, and must be unique
+      for each PCI Express Root Port.
+2.2.2 Using multi-function PCI Express Root Ports:
+      -device ioh3420,id=root_port1,multifunction=on,chassis=x[,bus=pcie.0][,slot=y][,addr=z.0] \
+      -device ioh3420,id=root_port2,,chassis=x1[,bus=pcie.0][,slot=y1][,addr=z.1] \
+      -device ioh3420,id=root_port3,,chassis=x2[,bus=pcie.0][,slot=y2][,addr=z.2] \
+2.2.2 Plugging a PCI Express device into a Switch:
+      -device ioh3420,id=root_port1,chassis=x[,bus=pcie.0][,slot=y][,addr=z]  \
+      -device x3130-upstream,id=upstream_port1,bus=root_port1[,addr=x]          \
+      -device xio3130-downstream,id=downstream_port1,bus=upstream_port1,chassis=x1[,slot=y1][,addr=z1]] \
+      -device <dev>,bus=downstream_port1
+
+
+2.3 PCI only hierarchy
+======================
+Legacy PCI devices can be plugged into pcie.0 as Integrated Devices.
+Besides that use DMI-PCI bridges (i82801b11-bridge) to start PCI hierarchies.
+
+Prefer flat hierarchies. For most scenarios a single DMI-PCI bridge (having 32 slots)
+and several PCI-PCI bridges attached to it (each supporting also 32 slots) will support
+hundreds of legacy devices. The recommendation is to populate one PCI-PCI bridge
+under the DMI-PCI bridge until is full and then plug a new PCI-PCI bridge...
+
+   pcie.0 bus
+   ----------------------------------------------
+        |                            |
+   -----------               ------------------
+   | PCI Dev |               | DMI-PCI BRIDGE |
+   ----------                ------------------
+                               |            |
+                        -----------    ------------------
+                        | PCI Dev |    | PCI-PCI Bridge |
+                        -----------    ------------------
+                                         |           |
+                                  -----------     -----------
+                                  | PCI Dev |     | PCI Dev |
+                                  -----------     -----------
+
+2.3.1 To plug a PCI device into a pcie.0 as Integrated Device use:
+      -device <dev>[,bus=pcie.0]
+2.3.2 Plugging a PCI device into a DMI-PCI bridge:
+      -device i82801b11-bridge,id=dmi_pci_bridge1,[,bus=pcie.0]    \
+      -device <dev>,bus=dmi_pci_bridge1[,addr=x]
+2.3.3 Plugging a PCI device into a PCI-PCI bridge:
+      -device i82801b11-bridge,id=dmi_pci_bridge1,[,bus=pcie.0]                        \
+      -device pci-bridge,id=pci_bridge1,bus=dmi_pci_bridge1[,chassis_nr=x][,addr=y]    \
+      -device <dev>,bus=pci_bridge1[,addr=x]
+
+
+3. IO space issues
+===================
+The PCI Express Root Ports and PCI Express Downstream ports are seen by
+Firmware/Guest OS as PCI-PCI bridges and, as required by PCI spec,
+should reserve a 4K IO range for each even if only one (multifunction)
+device can be plugged into them, resulting in poor IO space utilization.
+
+The firmware used by QEMU (SeaBIOS/OVMF) may try further optimizations
+by not allocating IO space if possible:
+    (1) - For empty PCI Express Root Ports/PCI Express Downstream ports.
+    (2) - If the device behind the PCI Express Root Ports/PCI Express
+          Downstream has no IO BARs.
+
+The IO space is very limited, 65536 byte-wide IO ports, but it's fragmented
+resulting in ~10 PCI Express Root Ports (or PCI Express Downstream/Upstream ports) 
+ports per system if devices with IO BARs are used in the PCI Express hierarchy.
+
+Using the proposed device placing strategy solves this issue
+by using only PCI Express devices within PCI Express hierarchy.
+
+The PCI Express spec requires the PCI Express devices to work without using IO.
+The PCI hierarchy has no such limitations.
+
+
+4. Bus numbers issues
+======================
+Each PCI domain can have up to only 256 buses and the QEMU PCI Express
+machines do not support multiple PCI domains even if extra Root
+Complexes (pxb-pcie) are used.
+
+Each element of the PCI Express hierarchy (Root Complexes,
+PCI Express Root Ports, PCI Express Downstream/Upstream ports)
+takes up bus numbers. Since only one (multifunction) device
+can be attached to a PCI Express Root Port or PCI Express Downstream
+Port it is advised to plan in advance for the expected number of
+devices to prevent bus numbers starvation.
+
+
+5. Hot Plug
+============
+The PCI Express root buses (pcie.0 and the buses exposed by pxb-pcie devices)
+do not support hot-plug, so any devices plugged into Root Complexes
+cannot be hot-plugged/hot-unplugged:
+    (1) PCI Express Integrated Devices
+    (2) PCI Express Root Ports
+    (3) DMI-PCI bridges
+    (4) pxb-pcie
+
+PCI devices can be hot-plugged into PCI-PCI bridges, however cannot
+be hot-plugged into DMI-PCI bridges.
+The PCI hotplug is ACPI based and can work side by side with the
+PCI Express native hotplug.
+
+PCI Express devices can be natively hot-plugged/hot-unplugged into/from
+PCI Express Root Ports (and PCI Express Downstream Ports).
+
+5.1 Planning for hotplug:
+    (1) PCI hierarchy
+        Leave enough PCI-PCI bridge slots empty or add one
+        or more empty PCI-PCI bridges to the DMI-PCI bridge.
+
+        For each such bridge the Guest Firmware is expected to reserve 4K IO
+        space and 2M MMIO range to be used for all devices behind it.
+
+        Because of the hard IO limit of around 10 PCI bridges (~ 40K space) per system
+        don't use more than 9 bridges, leaving 4K for the Integrated devices
+        and none for the PCI Express Hierarchy.
+
+    (2) PCI Express hierarchy:
+        Leave enough PCI Express Root Ports empty. Use multifunction
+        PCI Express Root Ports to prevent going out of PCI bus numbers.
+        Don't use PCI Express Switches if you don't have too, they use
+        an extra PCI bus that may handy to plug another device id it comes to it.
+
+5.3 Hot plug example:
+Using HMP: (add -monitor stdio to QEMU command line)
+  device_add <dev>,id=<id>,bus=<pcie.0/PCI Express Root Port Id/PCI-PCI bridge Id/pxb-pcie Id>
+
+
+6. Device assignment
+====================
+Host devices are mostly PCI Express and should be plugged only into
+PCI Express Root Ports or PCI Express Downstream Ports.
+PCI-PCI bridge slots can be used for legacy PCI host devices.
+
+6.1 How to detect if a device is PCI Express:
+  > lspci -s 03:00.0 -v (as root)
+
+    03:00.0 Network controller: Intel Corporation Wireless 7260 (rev 83)
+    Subsystem: Intel Corporation Dual Band Wireless-AC 7260
+    Flags: bus master, fast devsel, latency 0, IRQ 50
+    Memory at f0400000 (64-bit, non-prefetchable) [size=8K]
+    Capabilities: [c8] Power Management version 3
+    Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
+    Capabilities: [40] Express Endpoint, MSI 00
+
+    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+    Capabilities: [100] Advanced Error Reporting
+    Capabilities: [140] Device Serial Number 7c-7a-91-ff-ff-90-db-20
+    Capabilities: [14c] Latency Tolerance Reporting
+    Capabilities: [154] Vendor Specific Information: ID=cafe Rev=1 Len=014 
+
+
+7. Virtio devices
+=================
+Virtio devices plugged into the PCI hierarchy or as Integrated Devices
+will remain PCI and have transitional behaviour as default.
+Transitional virtio devices work in both IO and MMIO modes depending on
+the guest support.
+
+Virtio devices plugged into PCI Express ports are PCI Express devices and
+have "1.0" behavior by default without IO support.
+In both case disable-* properties can be used to override the behaviour.
+
+Note that setting disable-legacy=off will enable legacy mode (enabling
+legacy behavior) for PCI Express virtio devices causing them to
+require IO space, which, given our PCI Express hierarchy, may quickly
+lead to resource exhaustion, and is therefore strongly discouraged.
+
+
+8. Conclusion
+==============
+The proposal offers a usage model that is easy to understand and follow
+and in the same time overcomes the PCI Express architecture limitations.
+
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 13+ messages in thread