From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:42878) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1daP5R-0004XG-E4 for qemu-devel@nongnu.org; Wed, 26 Jul 2017 12:23:03 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1daP5O-00030F-12 for qemu-devel@nongnu.org; Wed, 26 Jul 2017 12:23:01 -0400 Received: from mx1.redhat.com ([209.132.183.28]:59598) by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1daP5N-0002zN-O9 for qemu-devel@nongnu.org; Wed, 26 Jul 2017 12:22:57 -0400 References: <1500761510-1556-1-git-send-email-zuban32s@gmail.com> From: Marcel Apfelbaum Message-ID: Date: Wed, 26 Jul 2017 19:22:42 +0300 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] [RFC PATCH v2 0/4] Allow RedHat PCI bridges reserve more buses than necessary during init List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Laszlo Ersek , Aleksandr Bezzubikov , kraxel@redhat.com Cc: seabios@seabios.org, kevin@koconnor.net, qemu-devel@nongnu.org, konrad.wilk@oracle.com, mst@redhat.com On 26/07/2017 18:20, Laszlo Ersek wrote: > On 07/26/17 08:48, Marcel Apfelbaum wrote: >> On 25/07/2017 18:46, Laszlo Ersek wrote: > > [snip] > >>> (2) Bus range reservation, and hotplugging bridges. What's the >>> motivation? Our recommendations in "docs/pcie.txt" suggest flat >>> hierarchies. >>> >> >> It remains flat. You have one single PCIE-PCI bridge plugged >> into a PCIe Root Port, no deep nesting. >> >> The reason is to be able to support legacy PCI devices without >> "committing" with a DMI-PCI bridge in advance. (Keep Q35 without) >> legacy hw. >> >> The only way to support PCI devices in Q35 is to have them cold-plugged >> into the pcie.0 bus, which is good, but not enough for expanding the >> Q35 usability in order to make it eventually the default >> QEMU x86 machine (I know this is another discussion and I am in >> minority, at least for now). >> >> The plan is: >> Start Q35 machine as usual, but one of the PCIe Root Ports includes >> hints for firmware needed t support legacy PCI devices. (IO Ports range, >> extra bus,...) >> >> Once a pci device is needed you have 2 options: >> 1. Plug a PCIe-PCI bridge into a PCIe Root Port and the PCI device >> in the bridge. >> 2. Hotplug a PCIe-PCI bridge into a PCIe Root Port and then hotplug >> a PCI device into the bridge. > Hi Laszlo, > Thank you for the explanation, it makes the intent a lot clearer. > > However, what does the hot-pluggability of the PCIe-PCI bridge buy us? > In other words, what does it buy us when we do not add the PCIe-PCI > bridge immediately at guest startup, as an integrated device? > > Why is it a problem to "commit" in advance? I understand that we might > not like the DMI-PCI bridge (due to it being legacy), but what speaks > against cold-plugging the PCIe-PCI bridge either as an integrated device > in pcie.0 (assuming that is permitted), or cold-plugging the PCIe-PCI > bridge in a similarly cold-plugged PCIe root port? > We want to keep Q35 clean, and for most cases we don't want any legacy PCI stuff if not especially required. > I mean, in the cold-plugged case, you use up two bus numbers at the > most, one for the root port, and another for the PCIe-PCI bridge. In the > hot-plugged case, you have to start with the cold-plugged root port just > the same (so that you can communicate the bus number reservation *at > all*), and then reserve (= use up in advance) the bus number, the IO > space, and the MMIO space(s). I don't see the difference; hot-plugging > the PCIe-PCI bridge (= not committing in advance) doesn't seem to save > any resources. > Is not about resources, more about usage model. > I guess I would see a difference if we reserved more than one bus number > in the hotplug case, namely in order to support recursive hotplug under > the PCIe-PCI bridge. But, you confirmed that we intend to keep the flat > hierarchy (ie the exercise is only for enabling legacy PCI endpoints, > not for recursive hotplug). The PCIe-PCI bridge isn't a device that > does anything at all on its own, so why not just coldplug it? Its > resources have to be reserved in advance anyway. > Even if we prefer flat hierarchies, we should allow a sane nested bridges configuration, so we will some times reserve more than one. > So, thus far I would say "just cold-plug the PCIe-PCI bridge at startup, > possibly even make it an integrated device, and then you don't need to > reserve bus numbers (and other apertures)". > > Where am I wrong? > Nothing wrong, I am just looking for feature parity Q35 vs PC. Users may want to continue using [nested] PCI bridges, and we want the Q35 machine to be used by more users in order to make it reliable faster, while keeping it clean by default. We had a discussion on this matter on last year KVM forum and the hot-pluggable PCIe-PCI bridge was the general consensus. As a bonus we get the PCI firmware hints capability. > [snip] > >>> (4) Whether the reservation size should be absolute or relative (raised >>> by Gerd). IIUC, Gerd suggests that the absolute aperture size should be >>> specified (as a minimum), not the increment / reservation for hotplug >>> purposes. >>> >>> The Platform Initialization Specification, v1.6, downloadable at >>> , writes the following under >>> >>> EFI_PCI_HOT_PLUG_INIT_PROTOCOL.GetResourcePadding() >>> >>> in whose implementation I will have to parse the values from the >>> capability structure, and return the appropriate representation to the >>> platform-independent PciBusDxe driver (i.e., the enumeration / >>> allocation agent): >>> >>>> The padding is returned in the form of ACPI (2.0 & 3.0) resource >>>> descriptors. The exact definition of each of the fields is the same as >>>> in the >>>> EFI_PCI_HOST_BRIDGE_RESOURCE_ALLOCATION_PROTOCOL.SubmitResources() >>>> function. See the section 10.8 for the definition of this function. >>>> >>>> The PCI bus driver is responsible for adding this resource request to >>>> the resource requests by the physical PCI devices. If Attributes is >>>> EfiPaddingPciBus, the padding takes effect at the PCI bus level. If >>>> Attributes is EfiPaddingPciRootBridge, the required padding takes >>>> effect at the root bridge level. For details, see the definition of >>>> EFI_HPC_PADDING_ATTRIBUTES in "Related Definitions" below. >>> >>> Emphasis on "*adding* this resource request to the resource requests by >>> the physical PCI devices". >>> >>> However... After checking some OVMF logs, it seems that in practice, >>> PciBusDxe does exactly what Gerd suggests. >>> >>> Currently OVMF returns a constant 2MB MMIO padding and a constant 512B >>> IO padding for all hotplug-capable bridges (including PCI Express root >>> ports), from GetResourcePadding(). For example, for the following QEMU >>> command line fragment: >>> >>> -device >>> pxb-pcie,bus_nr=64,id=rootbr1,numa_node=1,bus=pcie.0,addr=0x3 \ >>> \ >>> -device ioh3420,id=rootbr1-port1,bus=rootbr1,addr=0x1,slot=3 \ >>> -device e1000e,bus=rootbr1-port1,netdev=net0 \ >>> \ >>> -device ioh3420,id=rootbr1-port2,bus=rootbr1,addr=0x2,slot=4 \ >>> -device e1000e,bus=rootbr1-port2,netdev=net1 \ >>> >>> PciBusDxe produces the following log (extract): >>> >>>> PciBus: Resource Map for Root Bridge PciRoot(0x40) >>>> Type = Io16; Base = 0x8000; Length = 0x2000; Alignment = >>>> 0xFFF >>>> Base = 0x8000; Length = 0x1000; Alignment = >>>> 0xFFF; Owner = PPB [40|02|00:**] >>>> Base = 0x9000; Length = 0x1000; Alignment = >>>> 0xFFF; Owner = PPB [40|01|00:**] >>>> Type = Mem32; Base = 0x98600000; Length = 0x400000; >>>> Alignment = 0x1FFFFF >>>> Base = 0x98600000; Length = 0x200000; Alignment = >>>> 0x1FFFFF; Owner = PPB [40|02|00:**] >>>> Base = 0x98800000; Length = 0x200000; Alignment = >>>> 0x1FFFFF; Owner = PPB [40|01|00:**] >>>> >>>> PciBus: Resource Map for Bridge [40|01|00] >>>> Type = Io16; Base = 0x9000; Length = 0x1000; Alignment = >>>> 0xFFF >>>> Base = Padding; Length = 0x200; Alignment = 0x1FF >>>> Base = 0x9000; Length = 0x20; Alignment = 0x1F; >>>> Owner = PCI [41|00|00:18] >>>> Type = Mem32; Base = 0x98800000; Length = 0x200000; >>>> Alignment = 0x1FFFFF >>>> Base = Padding; Length = 0x200000; Alignment = 0x1FFFFF >>>> Base = 0x98800000; Length = 0x20000; Alignment = >>>> 0x1FFFF; Owner = PCI [41|00|00:14] >>>> Base = 0x98820000; Length = 0x20000; Alignment = >>>> 0x1FFFF; Owner = PCI [41|00|00:10] >>>> Base = 0x98840000; Length = 0x4000; Alignment = >>>> 0x3FFF; Owner = PCI [41|00|00:1C] >>>> >>>> PciBus: Resource Map for Bridge [40|02|00] >>>> Type = Io16; Base = 0x8000; Length = 0x1000; Alignment = >>>> 0xFFF >>>> Base = Padding; Length = 0x200; Alignment = 0x1FF >>>> Base = 0x8000; Length = 0x20; Alignment = 0x1F; >>>> Owner = PCI [42|00|00:18] >>>> Type = Mem32; Base = 0x98600000; Length = 0x200000; >>>> Alignment = 0x1FFFFF >>>> Base = Padding; Length = 0x200000; Alignment = 0x1FFFFF >>>> Base = 0x98600000; Length = 0x20000; Alignment = >>>> 0x1FFFF; Owner = PCI [42|00|00:14] >>>> Base = 0x98620000; Length = 0x20000; Alignment = >>>> 0x1FFFF; Owner = PCI [42|00|00:10] >>>> Base = 0x98640000; Length = 0x4000; Alignment = >>>> 0x3FFF; Owner = PCI [42|00|00:1C] >>> >>> This means that >>> - both ioh3420 root ports get a padding of 512B for IO and 2MB fo MMIO, >> >> So the 512 IO will be reserved even if the IO handling is not enabled >> in the bridge? (I am asking because a similar issue I might have with >> other guest code.) > > I apologize for being unclear about the "distribution of work" between > edk2's generic PciBusDxe driver and OVMF's platform-specific > PciHotPlugInitDxe driver (which provides the > EFI_PCI_HOT_PLUG_INIT_PROTOCOL, of which GetResourcePadding() is a > member function). > > Basically, PciBusDxe is a platform-independent, universal driver, that > calls into "hooks", optionally provided by the platform, at specific > points in the enumeration / allocation. > > EFI_PCI_HOT_PLUG_INIT_PROTOCOL.GetResourcePadding() is such a platform > specific hook, and OVMF is "reasonably free" to provide PciBusDxe with > reservation hints from it, as OVMF sees fit, for any specific hotplug > controller. (Of course OVMF's freedom is limited by two factors: first > by the information that QEMU exposes to firmware, and second by the > "imperative" that in GetResourcePadding() we really shouldn't look at > any *other* PCI(e) controllers than the exact hotplug controller that > PciBusDxe is asking about.) > Until now sounds good :) > With that in mind: > > *Currently*, OVMF advises PciBusDxe to "reserve 512B of IO space" for > *any* hotplug controller. (From the above log, we can see that on the > root bridge, this 512B are then rounded up to 4KB.) > OK > *In the future*, OVMF should replace the constant 512B with the > following logic: consult the IO base/limit registers of the subject > hotplug controller, and if they are zero, then return "reserve no IO > space for this hotplug controller" to PciBusDxe. If the base/limit > registers are nonzero, OVMF would say "reserve 4KB IO space", or even > propagate the value from the capability. > Exactly > [snip] > >>> The PI spec says, >>> >>>> [...] For all the root HPCs and the nonroot HPCs, call >>>> EFI_PCI_HOT_PLUG_INIT_PROTOCOL.GetResourcePadding() to obtain the >>>> amount of overallocation and add that amount to the requests from the >>>> physical devices. Reprogram the bus numbers by taking into account the >>>> bus resource padding information. [...] >>> >>> However, according to my interpretation of the source code, PciBusDxe >>> does not consider bus number padding for non-root HPCs (which are "all" >>> HPCs on QEMU). >>> >> >> Theoretically speaking, it is possible to change the behavior, right? > > Not just theoretically; in the past I have changed PciBusDxe -- it > wouldn't identify QEMU's hotplug controllers (root port, downstream port > etc) appropriately, and I managed to get some patches in. It's just that > the less we understand the current code and the more intrusive/extensive > the change is, the harder it is to sell the *idea*. PciBusDxe is > platform-independent and shipped on many a physical system too. > Understood, but from your explanation it sounds like the existings callback sites(hooks) are enough. >>> So, I agree that we can add a bus number range field to the capability, >>> but I'm fairly sure it won't work with edk2's PciBusDxe without major >>> surgery. (Given that we haven't ever considered hot-plugging entire >>> bridges until now, i.e., anything that would require bus number >>> reservation, I think we can live with such a limitation when using OVMF, >>> for an indefinite time to come.) >>> >> >> Since the OVMF will still support cold-plug, I think is OK for now. >> Once the feature gets in SeaBIOS I will open a BZ for tracking. > > Please do that (at that time); it will certainly need a dedicated > discussion on edk2-devel. > Sure, I'll start a discussion on the edk2-level as soon as we finish the feature on QEMU/SeaBIOS. > (Also, your statement about cold-plug being viable for at least a while > is consistent with my questions / implications near the top: what does > the hot-pluggability of the PCIe-PCI bridge buy us ultimately?) > Please see above. Thanks, Marcel > Thanks! > Laszlo >