From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:33965)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <marcel@redhat.com>) id 1bhKt5-0007vh-7O
	for qemu-devel@nongnu.org; Tue, 06 Sep 2016 14:14:30 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <marcel@redhat.com>) id 1bhKt0-0002CQ-QW
	for qemu-devel@nongnu.org; Tue, 06 Sep 2016 14:14:22 -0400
Received: from mx1.redhat.com ([209.132.183.28]:47906)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <marcel@redhat.com>) id 1bhKt0-0002Bc-H0
	for qemu-devel@nongnu.org; Tue, 06 Sep 2016 14:14:18 -0400
Received: from int-mx14.intmail.prod.int.phx2.redhat.com
	(int-mx14.intmail.prod.int.phx2.redhat.com [10.5.11.27])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by mx1.redhat.com (Postfix) with ESMTPS id 5E43815558
	for <qemu-devel@nongnu.org>; Tue,  6 Sep 2016 18:14:17 +0000 (UTC)
References: <1472736127-18137-1-git-send-email-marcel@redhat.com>
	<20160906093809.27c93074@t450s.home>
From: Marcel Apfelbaum <marcel@redhat.com>
Message-ID: <678a7c06-b49c-6e0a-104b-ca339f79246b@redhat.com>
Date: Tue, 6 Sep 2016 21:14:11 +0300
MIME-Version: 1.0
In-Reply-To: <20160906093809.27c93074@t450s.home>
Content-Type: text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding: 7bit
Subject: Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement
 guidelines
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel/>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Alex Williamson <alex.williamson@redhat.com>
Cc: qemu-devel@nongnu.org, lersek@redhat.com, mst@redhat.com

On 09/06/2016 06:38 PM, Alex Williamson wrote:
> On Thu,  1 Sep 2016 16:22:07 +0300
> Marcel Apfelbaum <marcel@redhat.com> wrote:
>
>> Proposes best practices on how to use PCIe/PCI device
>> in PCIe based machines and explain the reasoning behind them.
>>
>> Signed-off-by: Marcel Apfelbaum <marcel@redhat.com>
>> ---
>>
>> Hi,
>>
>> Please add your comments on what to add/remove/edit to make this doc usable.
>>
>> Thanks,
>> Marcel
>>
>>  docs/pcie.txt | 145 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 145 insertions(+)
>>  create mode 100644 docs/pcie.txt
>>
>> diff --git a/docs/pcie.txt b/docs/pcie.txt
>> new file mode 100644
>> index 0000000..52a8830
>> --- /dev/null
>> +++ b/docs/pcie.txt
>> @@ -0,0 +1,145 @@
>> +PCI EXPRESS GUIDELINES
>> +======================
>> +
>> +1. Introduction
>> +================
>> +The doc proposes best practices on how to use PCIe/PCI device
>> +in PCIe based machines and explains the reasoning behind them.
>> +
>> +
>> +2. Device placement strategy
>> +============================
>> +QEMU does not have a clear socket-device matching mechanism
>> +and allows any PCI/PCIe device to be plugged into any PCI/PCIe slot.
>> +Plugging a PCI device into a PCIe device might not always work and
>> +is weird anyway since it cannot be done for "bare metal".
>> +Plugging a PCIe device into a PCI slot will hide the Extended
>> +Configuration Space thus is also not recommended.
>> +
>> +The recommendation is to separate the PCIe and PCI hierarchies.
>> +PCIe devices should be plugged only into PCIe Root Ports and
>> +PCIe Downstream ports (let's call them PCIe ports).
>> +
>> +2.1 Root Bus (pcie.0)
>> +=====================
>> +Plug only legacy PCI devices as Root Complex Integrated Devices
>> +even if the PCIe spec does not forbid PCIe devices. The existing
>

Hi Alex,
Thanks for the review.


> Surely we can have PCIe device on the root complex??
>

Yes, we can, is not forbidden. Even so, my understanding is
the main use for Integrated Devices is for legacy devices
like sound cards or nics that come with the motherboard.
Because of that my concern is we might be missing some support
for that in QEMU or even in linux kernel.

One example I got from Jason about an issue with Integrated Points in kernel:

commit d14053b3c714178525f22660e6aaf41263d00056
Author: David Woodhouse <David.Woodhouse@intel.com>
Date:   Thu Oct 15 09:28:06 2015 +0100

     iommu/vt-d: Fix ATSR handling for Root-Complex integrated endpoints

     The VT-d specification says that "Software must enable ATS on endpoint
     devices behind a Root Port only if the Root Port is reported as
     supporting ATS transactions."
   ....

We can say is a bug and is solved, what's the problem?
But my point it, why do it in the first place?
We are the hardware "vendors" and we can decide not to add PCIe
devices as Integrated Devices.


>> +hardware uses mostly PCI devices as Integrated Endpoints. In this
>> +way we may avoid some strange Guest OS-es behaviour.
>> +Other than that plug only PCIe Root Ports, PCIe Switches (upstream ports)
>> +or DMI-PCI bridges to start legacy PCI hierarchies.
>> +
>> +
>> +   pcie.0 bus
>> +   --------------------------------------------------------------------------
>> +        |                |                    |                   |
>> +   -----------   ------------------   ------------------  ------------------
>> +   | PCI Dev |   | PCIe Root Port |   |  Upstream Port |  | DMI-PCI bridge |
>> +   -----------   ------------------   ------------------  ------------------
>
> Do you have a spec reference for plugging an upstream port directly
> into the root complex?  IMHO this is invalid, an upstream port can only
> be attached behind a downstream port, ie. a root port or downstream
> switch port.
>

Yes, is a bug, both me and Laszlo spotted it and the 2.2 figure shows it right.
Thanks for finding it.

>> +
>> +2.2 PCIe only hierarchy
>> +=======================
>> +Always use PCIe Root ports to start a PCIe hierarchy. Use PCIe switches (Upstream
>> +Ports + several Downstream Ports) if out of PCIe Root Ports slots. PCIe switches
>> +can be nested until a depth of 6-7. Plug only PCIe devices into PCIe Ports.
>
> This seems to contradict 2.1,

Yes, please forgive the bug, it will not appear in v2

  but I agree more with this statement to
> only start a PCIe sub-hierarchy with a root port, not an upstream port
> connected to the root complex.  The 2nd sentence is confusing, I don't
> know if you're referring to fan-out via PCIe switch downstream of a
> root port or again suggesting to use upstream switch ports directly on
> the root complex.
>

The PCIe hierarchy always starts with PCI Express Root Ports, the switch
is to be plugged in the PCi Express ports. I will try to re-phrase to be more
clear.


>> +
>> +
>> +   pcie.0 bus
>> +   ----------------------------------------------------
>> +        |                |               |
>> +   -------------   -------------   -------------
>> +   | Root Port |   | Root Port |   | Root Port |
>> +   ------------   --------------   -------------
>> +         |                               |
>> +    ------------                 -----------------
>> +    | PCIe Dev |                 | Upstream Port |
>> +    ------------                 -----------------
>> +                                  |            |
>> +                     -------------------    -------------------
>> +                     | Downstream Port |    | Downstream Port |
>> +                     -------------------    -------------------
>> +                             |
>> +                         ------------
>> +                         | PCIe Dev |
>> +                         ------------
>> +
>> +2.3 PCI only hierarchy
>> +======================
>> +Legacy PCI devices can be plugged into pcie.0 as Integrated Devices or
>> +into DMI-PCI bridge. PCI-PCI bridges can be plugged into DMI-PCI bridges
>> +and can be nested until a depth of 6-7. DMI-BRIDGES should be plugged
>> +only into pcie.0 bus.
>> +
>> +   pcie.0 bus
>> +   ----------------------------------------------
>> +        |                            |
>> +   -----------               ------------------
>> +   | PCI Dev |               | DMI-PCI BRIDGE |
>> +   ----------                ------------------
>> +                               |            |
>> +                        -----------    ------------------
>> +                        | PCI Dev |    | PCI-PCI Bridge |
>> +                        -----------    ------------------
>> +                                         |           |
>> +                                  -----------     -----------
>> +                                  | PCI Dev |     | PCI Dev |
>> +                                  -----------     -----------
>> +
>
> I really wish we had generic PCIe-to-PCI bridges rather than this DMI
> bridge thing...
>

Thank you, that's a very good idea and I intend to implement it.

>> +
>> +
>> +3. IO space issues
>> +===================
>> +PCIe Ports are seen by Firmware/Guest OS as PCI bridges and
>
> Yeah, I've lost the meaning of Ports here, this statement is true for
> upstream ports as well.
>

Laslzo asked me to enumerate all the controllers instead of "PCIe",
I am starting to see why...

>> +as required by PCI spec will reserve a 4K IO range for each.
>> +The firmware used by QEMU (SeaBIOS/OVMF) will further optimize
>> +it by allocation the IO space only if there is at least a device
>> +with IO BARs plugged into the bridge.
>> +Behind a PCIe PORT only one device may be plugged, resulting in
>
> Here I think you're trying to specify root/downstream ports, but
> upstream ports have the same i/o port allocation problems and do not
> have this one device limitation.
>

I'll be more specific, sure.

>> +the allocation of a whole 4K range for each device.
>> +The IO space is limited resulting in ~10 PCIe ports per system
>> +if devices with IO BARs are plugged into IO ports.
>> +
>> +Using the proposed device placing strategy solves this issue
>> +by using only PCIe devices with PCIe PORTS. The PCIe spec requires
>> +PCIe devices to work without IO BARs.
>> +The PCI hierarchy has no such limitations.
>
> Actually it does, but it's mostly not an issue since we have 32 slots
> available (minus QEMU/libvirt excluding 1 for no good reason)
> downstream of each bridge.
>

This is what I meant, I'll make it more clear.

>> +
>> +
>> +4. Hot Plug
>> +============
>> +The root bus pcie.0 does not support hot-plug, so Integrated Devices,
>> +DMI-PCI bridges and Root Ports can't be hot-plugged/hot-unplugged.
>> +
>> +PCI devices can be hot-plugged into PCI-PCI bridges. (There is a bug
>> +in QEMU preventing it to work, but it would be solved soon).
>
> Probably want to give some sort of date/commit references to these
> current state of affairs facts, a reader is not likely to lookup the
> git commit for this verbiage and extrapolate it to a QEMU version.
>

I'll delete it (as Laslo proposed) since is not a "status" doc.
It will be taken care of eventually.

>> +The PCI hotplug is ACPI based and can work side by side with the PCIe
>> +native hotplug.
>> +
>> +PCIe devices can be natively hot-plugged/hot-unplugged into/from
>> +PCIe Ports (Root Ports/Downstream Ports). Switches are hot-pluggable.
>
> Why?  This seems like a QEMU bug.  Clearly we need the downstream ports
> in place when the upstream switch is hot-added, but this should be
> feasible.
>

I don't get understand the question. I do think switches can be hot-plugged,
but I am not sure if QEMU allows it. If not, this is something we should solve.

>> +Keep in mind you always need to have at least one PCIe Port available
>> +for hotplug, the PCIe Ports themselves are not hot-pluggable.
>
> If a user cares about hotplug...
>

...he should reserve enough empty PCI Express Root Ports/ PCI Express Downstrewm ports.
Laszlo had some numbers and ideas on how this user can plan in advance for hotplug,
maybe we should bring them together :)

>> +
>> +
>> +5. Device assignment
>> +====================
>> +Host devices are mostly PCIe and should be plugged only into PCIe ports.
>> +PCI-PCI bridge slots can be used for legacy PCI host devices.
>
> I don't think we have any evidence to suggest this as a best practice.
> We have a lot of experience placing PCIe host devices into a
> conventional PCI topology on 440FX.  We don't have nearly as much
> experience placing them into downstream PCIe ports.  This seems like
> how we would like for things to behave to look like real hardware
> platforms, but it's just navel gazing whether it's actually the right
> thing to do.  Thanks,
>

I had to look up the "navel gazing"...
Why I do agree with your statements I prefer a cleaner PCI Express machine
with as little legacy PCI as possible. I use this document as an opportunity
to start gaining experience with device assignment into PCI Express Root Ports
and Downstream Ports and solve the issues long the way.


Your review really helped, thanks!
Marcel

> Alex
>
>> +
>> +
>> +6. Virtio devices
>> +=================
>> +Virtio devices plugged into the PCI hierarchy or as an Integrated Devices
>> +will remain PCI and have transitional behaviour as default.
>> +Virtio devices plugged into PCIe ports are Express devices and have
>> +"1.0" behavior by default without IO support.
>> +In both case disable-* properties can be used to override the behaviour.
>> +
>> +
>> +7. Conclusion
>> +==============
>> +The proposal offers a usage model that is easy to understand and follow
>> +and in the same time overcomes some PCIe limitations.
>> +
>> +
>> +
>