Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines

From: Marcel Apfelbaum <marcel@redhat.com>
To: Laszlo Ersek <lersek@redhat.com>, qemu-devel@nongnu.org
Cc: mst@redhat.com, Peter Maydell <peter.maydell@linaro.org>,
	Drew Jones <drjones@redhat.com>, Laine Stump <laine@redhat.com>,
	Andrea Bolognani <abologna@redhat.com>,
	Alex Williamson <alex.williamson@redhat.com>,
	Gerd Hoffmann <kraxel@redhat.com>
Subject: Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines
Date: Tue, 6 Sep 2016 17:46:42 +0300	[thread overview]
Message-ID: <339e185b-648e-1c4a-081b-5fd52ceeef21@redhat.com> (raw)
In-Reply-To: <739af9d6-2382-c4bd-679d-87b153124491@redhat.com>

On 09/06/2016 04:31 PM, Laszlo Ersek wrote:
> On 09/05/16 22:02, Marcel Apfelbaum wrote:
>> On 09/05/2016 07:24 PM, Laszlo Ersek wrote:
>>> On 09/01/16 15:22, Marcel Apfelbaum wrote:
>>>> Proposes best practices on how to use PCIe/PCI device
>>>> in PCIe based machines and explain the reasoning behind them.
>>>>
>>>> Signed-off-by: Marcel Apfelbaum <marcel@redhat.com>
>>>> ---
>>>>
>>>> Hi,
>>>>
>>>> Please add your comments on what to add/remove/edit to make this doc
>>>> usable.
>>>
>>

[...]

>>
>>> (But, certainly no IO reservation for PCI Express root port, upstream
>>> port, or downstream port! And i'll need your help for telling these
>>> apart in OVMF.)
>>>
>>
>> Just let me know how can I help.
>
> Well, in the EFI_PCI_HOT_PLUG_INIT_PROTOCOL.GetResourcePadding()
> implementation, I'll have to look at the PCI config space of the
> "bridge-like" PCI device that the generic PCI Bus driver of edk2 passes
> back to me, asking me about resource reservation.
>
> Based on the config space, I should be able to tell apart "PCI-PCI
> bridge" from "PCI Express downstream or root port". So what I'd need
> here is a semi-formal natural language description of these conditions.

You can use PCI Express Spec: 7.8.2. PCI Express Capabilities Register (Offset 02h)

Bit 7:4 Register Description:
Device/Port Type – Indicates the specific type of this PCI
Express Function. Note that different Functions in a multi-
Function device can generally be of different types.
Defined encodings are:
0000b PCI Express Endpoint
0001b Legacy PCI Express Endpoint
0100b Root Port of PCI Express Root Complex*
0101b Upstream Port of PCI Express Switch*
0110b Downstream Port of PCI Express Switch*
0111b PCI Express to PCI/PCI-X Bridge*
1000b PCI/PCI-X to PCI Express Bridge*
1001b Root Complex Integrated Endpoint

>
> Hmm, actually I think I've already written code, for another patch, that
> identifies the latter category. So everything where that check doesn't
> fire can be deemed "PCI-PCI bridge". (This hook gets called only for
> bridges.)
>
> Yet another alternative: if we go for the special PCI capability, for
> exposing reservation sizes from QEMU to the firmware, then I can simply
> search the capability list for just that capability. I think that could
> be the easiest for me.
>

That would be a "later" step.
BTW, following a offline chat with Michael S. Tsirkin
regarding virto 1.0 requiring 8M MMIO by default we arrived to a conclusion that
is not really needed and we came up with an alternative that will require less then 2M
MMIO space.
I put this here because the above solution will give us some time to deal with
the MMIO ranges reservation.

[...]

>>>> +
>>>> +
>>>> +4. Hot Plug
>>>> +============
>>>> +The root bus pcie.0 does not support hot-plug, so Integrated Devices,
>>>
>>> s/root bus/root complex/? Also, any root complexes added with pxb-pcie
>>> don't support hotplug.
>>>
>>
>> Actually pxb-pcie should support PCI Express Native Hotplug.
>
> Huh, interesting.
>
>> If they don't is a bug and I'll take care of it.
>
> Hmm, a bit lower down you mention that PCI Express native hot plug is
> based on SHPCs. So, when you say that pxb-pcie should support PCI
> Express Native Hotplug, you mean that it should occur through SHPC, right?
>

Yes, but I was talking about the Integrated SHPCs of the PCI Express
Root Ports and PCI Express Downstream Ports. (devices plugged into them)

> However, for pxb-pci*, we had to disable SHPC: see QEMU commit
> d10dda2d60c8 ("hw/pci-bridge: disable SHPC in PXB"), in June 2015.
>

This is only for the pxb device (not pxb-pcie) and only for the internal pci-bridge that comes with it.
And... we don't use SHPC based hot-plug for PCI, only for PCI Express.
For PCI we are using only the ACPI hotplug. So disabling it is not so bad.

The pxb-pcie does not have the internal PCI bridge. You don't need it because:
1. You can't have Integrated Devices for pxb-pcie
2. The PCI Express Upstream Port is a type of PCI-Bridge anyway.

> For background, the series around it was
> <https://lists.nongnu.org/archive/html/qemu-devel/2015-06/msg05136.html>
> -- I think v7 was the last version.
>
> ... Actually, now I wonder if d10dda2d60c8 should be possible to revert
> at this point! Namely, in OVMF I may have unwittingly fixed this issue
> -- obviously much later than the QEMU commit: in March 2016. See
>
> https://github.com/tianocore/edk2/commit/8f35eb92c419
>
> If you look at the commit message of the QEMU patch, it says
>
>     [...]
>
>     Unfortunately, when this happens, the PCI_COMMAND_MEMORY bit is
>     clear in the root bus's command register [...]
>
> which I think should no longer be true, thanks to edk2 commit 8f35eb92c419.
>
> So maybe we should re-evaluate QEMU commit d10dda2d60c8. If pxb-pci and
> pxb-pcie work with current OVMF, due to edk2 commit 8f35eb92c419, then
> maybe we should revert QEMU commit d10dda2d60c8.
>
> Not urgent for me :), obviously, I'm just explaining so you can make a
> note for later, if you wish to (if hot-plugging directly into pxb-pcie
> should be necessary -- I think it's very low priority).
>

As stated above, since we don't use it anyway it doesn't matter.

[...]

>> Nope, you cannot hotplug a PCI Express Root Port or a PCI Express
>> Downstream Port.
>> The reason: The PCI Express Native Hotplug is based on SHPCs (Standard
>> HotPlug Controllers)
>> which are integrated only in the mentioned ports and not in Upstream
>> Ports or the Root Complex.
>> The "other" reason: When you buy a switch/server it has a number of
>> ports and that's it.
>> You cannot add "later".
>
> Makes sense, thank you. I think if you add the HMP example, it will make
> it clear. I only assumed that you needed several monitor commands for
> hotplugging a single switch (i.e., one command per one port) because on
> the QEMU command line you do need a separate -device option for the
> upstream port, and every single downstream port, of the same switch.
>
> If, using the monitor, it's just one device_add for the upstream port,
> and the downstream ports are added automatically, then I guess it'll be
> easy to understand.
>

No it doesn't work like that, you would need to add them one by one (upstream ports and then downstream ports)
as far as I understand it.
Actually I've never done it before, I'll try it first and update the doc on
how it should be done. (if it can be done...)

>>
>>>
>>> But, this question is actually irrelevant IMO, because here I would add
>>> another subsection about *planning* for hot-plug. (I think that's pretty
>>> important.) And those plans should make the hotplugging of switches
>>> unnecessary!
>>>
>>
>> I'll add a subsection for it. But when you are out of options you *can*
>> hotplug a switch if your sysadmin skills are limited...
>
> You probably can, but then we'll run into the resource allocation
> problem again:
>
> (1) The user will hotplug a switch (= S1) under a root port with, say,
> two downstream ports (= S1-DP1, S1-DP2).
>
> (2) They'll then plug a PCI Express device into one of those downstream
> ports (S1-DP1-Dev1).
>
> (3) Then they'll want to hot-plug *another* switch into the *other*
> downstream port (S1-DP2-S2).
>
>
>                          DP1 -- Dev1 (2)
>                         /
>      root port -- S1 (1)
>                         \
>                          DP2 -- S2 (3)
>
> However, concerning the resource needs of S2 (and especially the devices
> hot-plugged under S2!), S1 won't have enough left over, because Dev1
> (under DP1) will have eaten into them, and Dev1's BARs will have been
> programmed!
>

Theoretically the Guest OS should trigger PCI resources re-allocation
but I agree we should not count on them.

> We could never credibly explain our way out of this situation in a bug
> report. For that reason, I think we should discourage hotplug ideas that
> would change the topology, and require recursive resource allocation at
> higher levels and/or parallel branches of the topology.
>
> I know Linux can do that, and it even succeeds if there is enough room,
> but from the messages seen in the guest dmesg when it fails, how do you
> explain to the user that they should have plugged in S2 first, and Dev1
> second?
>
> So, we should recommend *not* to hotplug switches or PCI-PCI bridges.
> Instead,
> - keep a very flat hierarchy from the start;
> - for PCI Express, add as many root ports and downstream ports as you
> deem enough for future hotplug needs (keeping the flat formula I described);
> - for legacy PCI, add as many sibling PCI-PCI bridges directly under the
> one DMI-PCI bridge as you deem sufficient for future hotplug needs.
>
> In short, don't change the hierarchy at runtime by hotplugging internal
> nodes; hotplug *leaf nodes* only.
>

Agreed. I'll re-use some of your comments in the doc.

>>

[...]

>>
>> Gerd explicitly asked for the second idea (vendor specific capability)
>
> Nice, thank you for confirming it; let's do this then. It will also
> simplify my work in the
> EFI_PCI_HOT_PLUG_INIT_PROTOCOL.GetResourcePadding() function: it should
> suffice to scan the config space of the bridge, regardless of the
> "PCI-PCI bridge / PCI Express root or downstream port" distinction.
>

Will do, but since we have a quick way to deal with the current issue
(virtio 1.0 requiring 8MB MMIO while firmware reserving 2MB for PCI-Bridges hotplug)

[...]

Thanks,
Marcel

>
> Cheers!
> Laszlo
>