Re: [Qemu-devel] [PATCH] hw/pci-host: Fix x86 Host Bridges 64bit PCI hole

From: Laszlo Ersek <lersek@redhat.com>
To: Marcel Apfelbaum <marcel@redhat.com>, qemu-devel@nongnu.org
Cc: mst@redhat.com, pbonzini@redhat.com, ehabkost@redhat.com,
	Igor Mammedov <imammedo@redhat.com>,
	"Dr. David Alan Gilbert" <dgilbert@redhat.com>,
	Gerd Hoffmann <kraxel@redhat.com>
Subject: Re: [Qemu-devel] [PATCH] hw/pci-host: Fix x86 Host Bridges 64bit PCI hole
Date: Thu, 19 Oct 2017 13:52:14 +0200	[thread overview]
Message-ID: <6e21e32b-ce06-a6ff-8603-d864af129d7a@redhat.com> (raw)
In-Reply-To: <7b08ea3e-d380-ac0c-a716-d0ac703feaea@redhat.com>

On 10/19/17 13:29, Marcel Apfelbaum wrote:
> Hi Laszlo,
> 
> On 19/10/2017 13:41, Laszlo Ersek wrote:
>> On 10/19/17 11:30, Marcel Apfelbaum wrote:
>>>
>>> Hi Laszlo,
>>>
>>> On 18/10/2017 14:40, Laszlo Ersek wrote:
>>>> Hi Marcel,
>>>>
>>>> On 10/18/17 11:58, Marcel Apfelbaum wrote:
>>>>> Currently there is no MMIO range over 4G
>>>>> reserved for PCI hotplug. Since the 32bit PCI hole
>>>>> depends on the number of cold-plugged PCI devices
>>>>> and other factors, it is very possible is too small
>>>>> to hotplug PCI devices with large BARs.
>>>>>
>>>>> Fix it by reserving all the address space after
>>>>> RAM (and the reserved space for RAM hotplug),
>>>>> but no more than 40bits. The later seems to be
>>>>> QEMU's magic number for the Guest physical CPU addressable
>>>>> bits and is safe with respect to migration.
>>>>>
>>>>> Note this is a regression since prev QEMU versions had
>>>>> some range reserved for 64bit PCI hotplug.
>>>>>
>>>>> Signed-off-by: Marcel Apfelbaum <marcel@redhat.com>
>>>>> ---
>>>>>    hw/i386/pc.c              | 22 ++++++++++++++++++++++
>>>>>    hw/pci-host/piix.c        | 10 ++++++++++
>>>>>    hw/pci-host/q35.c         |  9 +++++++++
>>>>>    include/hw/i386/pc.h      | 10 ++++++++++
>>>>>    include/hw/pci-host/q35.h |  1 +
>>>>>    5 files changed, 52 insertions(+)
>>>>
>>>> - What are the symptoms of the problem?
>>>>
>>>
>>> PCI devices with *relatively* large BARs can't be hot-plugged.
>>> The reason is QEMU does not reserve MMIO range over 4G and
>>> uses whatever remains on the 32bit PCI window.
>>>
>>> If you coldplug enough PCI devices you will end up with
>>> with a 64bit PCI window that covers only the cold-plugged PCI
>>> devices, still no room for hotplug.
>>>
>>>> - What QEMU commit regressed the previous functionality?
>>>>
>>>
>>>   - commit 39848901818 pc: limit 64 bit hole to 2G by default
>>>   shows us QEMU had the 64bit PCI hole, so it is a regression.
>>
>> (Side remark: this commit was part of release v1.6.0, and at that time
>> we also had the "etc/pci-info" fw_cfg file.)
>>
>>>
>>>   - commit 2028fdf3791 piix: use 64 bit window programmed by guest
>>>   leaves no room for PCI hotplug
>>
>> (Side remark: part of v1.7.0, from 2013. So, we can call this a
>> regression, but then it's a very old one.
>>
>> Of course this is *not* a counter-argument to your patch, or commit
>> message wording; it's just that I'm surprised that the issue could
>> remain dormant this long -- usually users report regressions very
>> quickly.)
>>
> 
> I also thought I am proposing a new feature and I was pretty amazed
> I actually fix something that worked before.
> 
>>>
>>>> - How exactly is the regression (and this fix) exposed to the guest?
>>>>
>>>
>>> The nuance is both above commits computed the actual MMIO
>>> range used by cold-plugged devices, but as side effect
>>> they influenced the remaining PCI window for hot-plugged devices.
>>>
>>> What the guest sees is the CRS MMIO range over 4G.
>>
>> OK, understood.
>>
>>>
>>>> I'm confused because semi-recently I've done multiple successful
>>>> hotplugs, with OVMF, Q35, PCIe root ports, and PCI Express devices
>>>> having large (64-bit) MMIO BARs. The examples include ivshmem / Linux
>>>> guest (based on your hints) and vfio-pci (XL710 NIC) / Windows Server
>>>> 2016 guest.
>>>
>>> Cool! These are actually great news!
>>> Back to the issue in hand, a few things:
>>> 1. It works with Q35 only, but not with the I440FX machines.
>>
>> OK.
>>
>>> 2. The hints are used for the PCIe root ports, meaning for PCI bridges
>>>     and addresses a slightly different issues: leave hotplug space
>>>     for secondary buses, while this patch tackles hotplug
>>>     space for root buses.
>>
>> OK.
>>
>>>
>>>>
>>>> By default, OVMF marks a 32GB area as 64-bit MMIO aperture, for BAR
>>>> allocation. This area is above the normal memory (plus reservation for
>>>> hotplug memory, if any). It is also GB-aligned. The aperture size
>>>> can be
>>>> changed (currently) with an experimental -fw_cfg switch (the fw_cfg
>>>> file
>>>> is called "opt/ovmf/X-PciMmio64Mb", it states the size of the 64-bit
>>>> MMIO aperture as a decimal number of megabytes).
>>>
>>> I'll try to explain how I see the "picture":
>>> CRS is created by QEMU and computed *after* the firmware finishes
>>> the PCI BARs allocations.
>>
>> Yes.
>>
>>> Then *QEMU* decides how big it will be the
>>> 64bit PCI hole, not the firmware - see the past commits.
>>
>> Yes, first the guest-side programming happens, then QEMU (re-)calculates
>> the ACPI payload once the right ACPI fw_cfg file is selected / read.
>> This re-calculation (including the CRS) observes the previously
>> programmed values.
>>
>>>
>>> So IMHO you don't need the opt/ovmf/X-PciMmio64Mb since the OVMF
>>> is not deciding this value. It actually should not care at all.
>>
>> OVMF cannot *not* care. It must internally provide (to other,
>> platform-independent components in edk2) the 64-bit MMIO aperture that
>> the BARs can be allocated from.
>>
>>
>> But maybe I'm totally misunderstanding this patch! Are you saying that
>> the patch makes the following difference only: when the CRS is
>> recalculated (i.e., after the PCI enumeration has completed in the
>> guest, and the ACPI linker/loader fw_cfg files are opened), then you
>> only grow / round up the 64-bit range?
>>
> 
> Exactly.
> 
>> That would make sense, because a *superset* aperture exposed to the OS
>> via the _CRS does not invalidate any allocations done by the firmware
>> earlier. Furthermore, the firmware itself need not learn about the
>> details of said superset. In this sense I agree that OVMF should not
>> care at all.
>>
> 
> Exactly, the range is a superset.
> 
>> [snip]
>>
>>>> Also, how do you plan to integrate this feature with SeaBIOS?
>>
>> (Ever since I sent my previous response, I've been glad that I finished
>> my email with this question -- I really think OVMF and SeaBIOS should
>> behave uniformly in this regard. So I'm happy about your answer:)
>>
>>> Simple, no need :) . It just works as it should.
>>> SeaBIOS computes everything, *then* QEMU creates the CRs ranges
>>> based on firmware input and other properties.
>>> I think it works also with OVMF out of the box.
>>
>> OK, so this explains it all, thank you very much -- my question is if
>> you could spell out those "other properties" in a bit more detail (I
>> apologize if these should be obvious to me already, from the commit
>> message and the patch):
>>
> 
> "Other properties" was a vague link to max RAM, and things like that.
> Nothing specific.
> 
>> (1) How exactly does the enlarged 64-bit hole remain a *superset* of the
>> firmware-programmed BARs? Can you point me to a location in the code?
>>
> 
> Sure in the patch itself:
>  We first take the firmware computed ranges:
>      pci_bus_get_w64_range(h->bus, &w64);
>  Look at the *get_pci_hole64_start -> If the firmware set it we don't
> touch.
>  And to the *get_pci_hole64_end -> We only update the value if is smaller
>                                     than what firmware set.
> 
> 
> 
>> (2) What happens if we do have more than 1TB RAM?
>>
> 
> You get no 64bit "hot-pluggable" reserved range.
> The code dealing with that is:
> 
> +    if (hole64_start > PCI_HOST_PCI_HOLE64_END) {
> +        hole64_start = PCI_HOST_PCI_HOLE64_END;
> +    }
> 
> And since the end is hard-coded to PCI_HOST_PCI_HOLE64_END
> we have a 0 size region which gets discarded.
> 
> An interesting case is if we have more than 1T RAM and the
> firmware assigns BARs over that, we are increasing the
> 64bit window to include some of the RAM...
> 
> A better approach is to check against the actual hole64
> end and not the magic value.
> I'll send a v3, thanks!

Ah, right; in v1 -- and possibly v2, I haven't looked yet --
hole64_start will be set to PCI_HOST_PCI_HOLE64_END, in preparation for
the end to be set the exact same way (resulting in an empty range) --
however, the end won't be modified at all if it's already above
PCI_HOST_PCI_HOLE64_END, so we'll end up with a non-empty range that
intersects with RAM.

Please CC me on v3 too; I feel that I'm now better equipped to comment :)

Thanks!
Laszlo

> 
>>> A last note, when pxb/pxb-pcies will became hot-pluggable
>>> we need to leave some bits for them too. I wouldn't
>>> really want to add fw_config files for them too, or
>>> use a complicate fw_config, or rewrite
>>> the logic for all firmware if we don't have to.
>>
>> OK.
>>
>>>
>>> 64bit PCI window starts from memory(+reserved) end
>>> and finishes at 40bit.
>>
>> OK, this seems to (partly) answer my question (1) -- can you please
>> confirm that the
>>
>>    finishes at 40bit
>>
>> part does not imply forcefully truncating the CRS at 1TB, in case the
>> firmware already allocated BARs higher than that?
>>
> 
> I confirm:
> 
> +    if (s->pci_hole64_fix && value < PCI_HOST_PCI_HOLE64_END) {
> +        value = PCI_HOST_PCI_HOLE64_END;
> +    }
> 
> I touch the value only if is smaller than what the firmware set.
> 
>>> If you have 1T of RAM you
>>> get nothing for PCI hotplug, seems fair.
>>
>> This is fine (as the answer to my question (2)) as long as you mean it
>> in the following sense: "QEMU will simply *not grow* the MMIO in the
>> _CRS past 1TB".
>>
>>> When guests with more then 1T of RAM will become
>>> frequent, I suppose the QEMU's default CPU addressable
>>> bits will change and we will change it too.
>>
>> Thanks Marcel, I feel a lot safer now.
>>
>> Please confirm that the changes can only grow the area exposed in the
>> _CRS -- i.e., only lower or preserve the base, and only raise or
>> preserve the absolute limit. I'll shut up then :)
>>
> 
> Confirmed and thanks for the review!
> 
> Thanks,
> Marcel
> 
>> Thanks!
>> Laszlo
>>
>