All of lore.kernel.org
 help / color / mirror / Atom feed
From: Laszlo Ersek <lersek@redhat.com>
To: Marcel Apfelbaum <marcel@redhat.com>, qemu-devel@nongnu.org
Cc: mst@redhat.com, pbonzini@redhat.com, ehabkost@redhat.com,
	Igor Mammedov <imammedo@redhat.com>,
	"Dr. David Alan Gilbert" <dgilbert@redhat.com>,
	Gerd Hoffmann <kraxel@redhat.com>
Subject: Re: [Qemu-devel] [PATCH] hw/pci-host: Fix x86 Host Bridges 64bit PCI hole
Date: Thu, 19 Oct 2017 13:52:14 +0200	[thread overview]
Message-ID: <6e21e32b-ce06-a6ff-8603-d864af129d7a@redhat.com> (raw)
In-Reply-To: <7b08ea3e-d380-ac0c-a716-d0ac703feaea@redhat.com>

On 10/19/17 13:29, Marcel Apfelbaum wrote:
> Hi Laszlo,
> 
> On 19/10/2017 13:41, Laszlo Ersek wrote:
>> On 10/19/17 11:30, Marcel Apfelbaum wrote:
>>>
>>> Hi Laszlo,
>>>
>>> On 18/10/2017 14:40, Laszlo Ersek wrote:
>>>> Hi Marcel,
>>>>
>>>> On 10/18/17 11:58, Marcel Apfelbaum wrote:
>>>>> Currently there is no MMIO range over 4G
>>>>> reserved for PCI hotplug. Since the 32bit PCI hole
>>>>> depends on the number of cold-plugged PCI devices
>>>>> and other factors, it is very possible is too small
>>>>> to hotplug PCI devices with large BARs.
>>>>>
>>>>> Fix it by reserving all the address space after
>>>>> RAM (and the reserved space for RAM hotplug),
>>>>> but no more than 40bits. The later seems to be
>>>>> QEMU's magic number for the Guest physical CPU addressable
>>>>> bits and is safe with respect to migration.
>>>>>
>>>>> Note this is a regression since prev QEMU versions had
>>>>> some range reserved for 64bit PCI hotplug.
>>>>>
>>>>> Signed-off-by: Marcel Apfelbaum <marcel@redhat.com>
>>>>> ---
>>>>>    hw/i386/pc.c              | 22 ++++++++++++++++++++++
>>>>>    hw/pci-host/piix.c        | 10 ++++++++++
>>>>>    hw/pci-host/q35.c         |  9 +++++++++
>>>>>    include/hw/i386/pc.h      | 10 ++++++++++
>>>>>    include/hw/pci-host/q35.h |  1 +
>>>>>    5 files changed, 52 insertions(+)
>>>>
>>>> - What are the symptoms of the problem?
>>>>
>>>
>>> PCI devices with *relatively* large BARs can't be hot-plugged.
>>> The reason is QEMU does not reserve MMIO range over 4G and
>>> uses whatever remains on the 32bit PCI window.
>>>
>>> If you coldplug enough PCI devices you will end up with
>>> with a 64bit PCI window that covers only the cold-plugged PCI
>>> devices, still no room for hotplug.
>>>
>>>> - What QEMU commit regressed the previous functionality?
>>>>
>>>
>>>   - commit 39848901818 pc: limit 64 bit hole to 2G by default
>>>   shows us QEMU had the 64bit PCI hole, so it is a regression.
>>
>> (Side remark: this commit was part of release v1.6.0, and at that time
>> we also had the "etc/pci-info" fw_cfg file.)
>>
>>>
>>>   - commit 2028fdf3791 piix: use 64 bit window programmed by guest
>>>   leaves no room for PCI hotplug
>>
>> (Side remark: part of v1.7.0, from 2013. So, we can call this a
>> regression, but then it's a very old one.
>>
>> Of course this is *not* a counter-argument to your patch, or commit
>> message wording; it's just that I'm surprised that the issue could
>> remain dormant this long -- usually users report regressions very
>> quickly.)
>>
> 
> I also thought I am proposing a new feature and I was pretty amazed
> I actually fix something that worked before.
> 
>>>
>>>> - How exactly is the regression (and this fix) exposed to the guest?
>>>>
>>>
>>> The nuance is both above commits computed the actual MMIO
>>> range used by cold-plugged devices, but as side effect
>>> they influenced the remaining PCI window for hot-plugged devices.
>>>
>>> What the guest sees is the CRS MMIO range over 4G.
>>
>> OK, understood.
>>
>>>
>>>> I'm confused because semi-recently I've done multiple successful
>>>> hotplugs, with OVMF, Q35, PCIe root ports, and PCI Express devices
>>>> having large (64-bit) MMIO BARs. The examples include ivshmem / Linux
>>>> guest (based on your hints) and vfio-pci (XL710 NIC) / Windows Server
>>>> 2016 guest.
>>>
>>> Cool! These are actually great news!
>>> Back to the issue in hand, a few things:
>>> 1. It works with Q35 only, but not with the I440FX machines.
>>
>> OK.
>>
>>> 2. The hints are used for the PCIe root ports, meaning for PCI bridges
>>>     and addresses a slightly different issues: leave hotplug space
>>>     for secondary buses, while this patch tackles hotplug
>>>     space for root buses.
>>
>> OK.
>>
>>>
>>>>
>>>> By default, OVMF marks a 32GB area as 64-bit MMIO aperture, for BAR
>>>> allocation. This area is above the normal memory (plus reservation for
>>>> hotplug memory, if any). It is also GB-aligned. The aperture size
>>>> can be
>>>> changed (currently) with an experimental -fw_cfg switch (the fw_cfg
>>>> file
>>>> is called "opt/ovmf/X-PciMmio64Mb", it states the size of the 64-bit
>>>> MMIO aperture as a decimal number of megabytes).
>>>
>>> I'll try to explain how I see the "picture":
>>> CRS is created by QEMU and computed *after* the firmware finishes
>>> the PCI BARs allocations.
>>
>> Yes.
>>
>>> Then *QEMU* decides how big it will be the
>>> 64bit PCI hole, not the firmware - see the past commits.
>>
>> Yes, first the guest-side programming happens, then QEMU (re-)calculates
>> the ACPI payload once the right ACPI fw_cfg file is selected / read.
>> This re-calculation (including the CRS) observes the previously
>> programmed values.
>>
>>>
>>> So IMHO you don't need the opt/ovmf/X-PciMmio64Mb since the OVMF
>>> is not deciding this value. It actually should not care at all.
>>
>> OVMF cannot *not* care. It must internally provide (to other,
>> platform-independent components in edk2) the 64-bit MMIO aperture that
>> the BARs can be allocated from.
>>
>>
>> But maybe I'm totally misunderstanding this patch! Are you saying that
>> the patch makes the following difference only: when the CRS is
>> recalculated (i.e., after the PCI enumeration has completed in the
>> guest, and the ACPI linker/loader fw_cfg files are opened), then you
>> only grow / round up the 64-bit range?
>>
> 
> Exactly.
> 
>> That would make sense, because a *superset* aperture exposed to the OS
>> via the _CRS does not invalidate any allocations done by the firmware
>> earlier. Furthermore, the firmware itself need not learn about the
>> details of said superset. In this sense I agree that OVMF should not
>> care at all.
>>
> 
> Exactly, the range is a superset.
> 
>> [snip]
>>
>>>> Also, how do you plan to integrate this feature with SeaBIOS?
>>
>> (Ever since I sent my previous response, I've been glad that I finished
>> my email with this question -- I really think OVMF and SeaBIOS should
>> behave uniformly in this regard. So I'm happy about your answer:)
>>
>>> Simple, no need :) . It just works as it should.
>>> SeaBIOS computes everything, *then* QEMU creates the CRs ranges
>>> based on firmware input and other properties.
>>> I think it works also with OVMF out of the box.
>>
>> OK, so this explains it all, thank you very much -- my question is if
>> you could spell out those "other properties" in a bit more detail (I
>> apologize if these should be obvious to me already, from the commit
>> message and the patch):
>>
> 
> "Other properties" was a vague link to max RAM, and things like that.
> Nothing specific.
> 
>> (1) How exactly does the enlarged 64-bit hole remain a *superset* of the
>> firmware-programmed BARs? Can you point me to a location in the code?
>>
> 
> Sure in the patch itself:
>  We first take the firmware computed ranges:
>      pci_bus_get_w64_range(h->bus, &w64);
>  Look at the *get_pci_hole64_start -> If the firmware set it we don't
> touch.
>  And to the *get_pci_hole64_end -> We only update the value if is smaller
>                                     than what firmware set.
> 
> 
> 
>> (2) What happens if we do have more than 1TB RAM?
>>
> 
> You get no 64bit "hot-pluggable" reserved range.
> The code dealing with that is:
> 
> +    if (hole64_start > PCI_HOST_PCI_HOLE64_END) {
> +        hole64_start = PCI_HOST_PCI_HOLE64_END;
> +    }
> 
> And since the end is hard-coded to PCI_HOST_PCI_HOLE64_END
> we have a 0 size region which gets discarded.
> 
> An interesting case is if we have more than 1T RAM and the
> firmware assigns BARs over that, we are increasing the
> 64bit window to include some of the RAM...
> 
> A better approach is to check against the actual hole64
> end and not the magic value.
> I'll send a v3, thanks!

Ah, right; in v1 -- and possibly v2, I haven't looked yet --
hole64_start will be set to PCI_HOST_PCI_HOLE64_END, in preparation for
the end to be set the exact same way (resulting in an empty range) --
however, the end won't be modified at all if it's already above
PCI_HOST_PCI_HOLE64_END, so we'll end up with a non-empty range that
intersects with RAM.

Please CC me on v3 too; I feel that I'm now better equipped to comment :)

Thanks!
Laszlo

> 
>>> A last note, when pxb/pxb-pcies will became hot-pluggable
>>> we need to leave some bits for them too. I wouldn't
>>> really want to add fw_config files for them too, or
>>> use a complicate fw_config, or rewrite
>>> the logic for all firmware if we don't have to.
>>
>> OK.
>>
>>>
>>> 64bit PCI window starts from memory(+reserved) end
>>> and finishes at 40bit.
>>
>> OK, this seems to (partly) answer my question (1) -- can you please
>> confirm that the
>>
>>    finishes at 40bit
>>
>> part does not imply forcefully truncating the CRS at 1TB, in case the
>> firmware already allocated BARs higher than that?
>>
> 
> I confirm:
> 
> +    if (s->pci_hole64_fix && value < PCI_HOST_PCI_HOLE64_END) {
> +        value = PCI_HOST_PCI_HOLE64_END;
> +    }
> 
> I touch the value only if is smaller than what the firmware set.
> 
>>> If you have 1T of RAM you
>>> get nothing for PCI hotplug, seems fair.
>>
>> This is fine (as the answer to my question (2)) as long as you mean it
>> in the following sense: "QEMU will simply *not grow* the MMIO in the
>> _CRS past 1TB".
>>
>>> When guests with more then 1T of RAM will become
>>> frequent, I suppose the QEMU's default CPU addressable
>>> bits will change and we will change it too.
>>
>> Thanks Marcel, I feel a lot safer now.
>>
>> Please confirm that the changes can only grow the area exposed in the
>> _CRS -- i.e., only lower or preserve the base, and only raise or
>> preserve the absolute limit. I'll shut up then :)
>>
> 
> Confirmed and thanks for the review!
> 
> Thanks,
> Marcel
> 
>> Thanks!
>> Laszlo
>>
> 

  reply	other threads:[~2017-10-19 11:52 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-10-18  9:58 [Qemu-devel] [PATCH] hw/pci-host: Fix x86 Host Bridges 64bit PCI hole Marcel Apfelbaum
2017-10-18 10:45 ` Igor Mammedov
2017-10-18 10:57   ` Marcel Apfelbaum
2017-10-18 11:40 ` Laszlo Ersek
2017-10-19  9:30   ` Marcel Apfelbaum
2017-10-19 10:41     ` Laszlo Ersek
2017-10-19 11:29       ` Marcel Apfelbaum
2017-10-19 11:52         ` Laszlo Ersek [this message]
2017-10-19 13:03     ` Gerd Hoffmann
2017-10-19 13:34       ` Marcel Apfelbaum
2017-10-20  6:55         ` Gerd Hoffmann
2017-10-20  9:32           ` Laszlo Ersek
2017-10-20 10:59             ` Gerd Hoffmann
2017-10-20 14:06               ` Marcel Apfelbaum
2017-10-20 13:47           ` Marcel Apfelbaum
2017-10-23  5:45             ` Gerd Hoffmann
2017-10-23  8:46               ` Marcel Apfelbaum
2017-10-23  9:16                 ` Gerd Hoffmann
2017-10-23  9:35                   ` Marcel Apfelbaum

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=6e21e32b-ce06-a6ff-8603-d864af129d7a@redhat.com \
    --to=lersek@redhat.com \
    --cc=dgilbert@redhat.com \
    --cc=ehabkost@redhat.com \
    --cc=imammedo@redhat.com \
    --cc=kraxel@redhat.com \
    --cc=marcel@redhat.com \
    --cc=mst@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.