Re: How to reserve guest physical region for ACPI

From: Laszlo Ersek <lersek@redhat.com>
To: Igor Mammedov <imammedo@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>,
	Xiao Guangrong <guangrong.xiao@linux.intel.com>,
	pbonzini@redhat.com, gleb@kernel.org, mtosatti@redhat.com,
	stefanha@redhat.com, rth@twiddle.net, ehabkost@redhat.com,
	dan.j.williams@intel.com, kvm@vger.kernel.org,
	qemu-devel@nongnu.org, Marcel Apfelbaum <marcel@redhat.com>
Subject: Re: How to reserve guest physical region for ACPI
Date: Thu, 7 Jan 2016 18:33:05 +0100	[thread overview]
Message-ID: <568EA151.5040702@redhat.com> (raw)
In-Reply-To: <20160107145113.7b459368@nial.brq.redhat.com>

On 01/07/16 14:51, Igor Mammedov wrote:
> On Mon, 4 Jan 2016 21:17:31 +0100
> Laszlo Ersek <lersek@redhat.com> wrote:
> 
>> Michael CC'd me on the grandparent of the email below. I'll try to add
>> my thoughts in a single go, with regard to OVMF.
>>
>> On 12/30/15 20:52, Michael S. Tsirkin wrote:
>>> On Wed, Dec 30, 2015 at 04:55:54PM +0100, Igor Mammedov wrote:  
>>>> On Mon, 28 Dec 2015 14:50:15 +0200
>>>> "Michael S. Tsirkin" <mst@redhat.com> wrote:
>>>>  
>>>>> On Mon, Dec 28, 2015 at 10:39:04AM +0800, Xiao Guangrong wrote:  
>>>>>>
>>>>>> Hi Michael, Paolo,
>>>>>>
>>>>>> Now it is the time to return to the challenge that how to reserve guest
>>>>>> physical region internally used by ACPI.
>>>>>>
>>>>>> Igor suggested that:
>>>>>> | An alternative place to allocate reserve from could be high memory.
>>>>>> | For pc we have "reserved-memory-end" which currently makes sure
>>>>>> | that hotpluggable memory range isn't used by firmware
>>>>>> (https://lists.nongnu.org/archive/html/qemu-devel/2015-11/msg00926.html)  
>>
>> OVMF has no support for the "reserved-memory-end" fw_cfg file. The
>> reason is that nobody wrote that patch, nor asked for the patch to be
>> written. (Not implying that just requesting the patch would be
>> sufficient for the patch to be written.)
> Hijacking this part of thread to check if OVMF would work with memory-hotplug
> and if it needs "reserved-memory-end" support at all.
> 
> How OVMF determines which GPA ranges to use for initializing PCI BARs
> at boot time,

I'm glad you asked this question. This is an utterly sordid area that goes back quite a bit. We've discussed it several times in the past; for example: if you recall the "etc/pci-info" discussion...

Fact is, OVMF has no way to dynamically determine the PCI MMIO aperture to allocate BARs from. (Obviously parsing AML is out of question, especially at the early stage of the firmware where this information would be necessary. Plus that would be a chicken-egg problem anyway: QEMU composes the CRS in the AML *based on* the enumeration that was completed by the guest.)

Search "OvmfPkg/PlatformPei/Platform.c" for the string "PciBase"; it all originates there. I can also quote it:

    UINT32  TopOfLowRam;
    UINT32  PciBase;

    TopOfLowRam = GetSystemMemorySizeBelow4gb ();
    if (mHostBridgeDevId == INTEL_Q35_MCH_DEVICE_ID) {
      //
      // A 3GB base will always fall into Q35's 32-bit PCI host aperture,
      // regardless of the Q35 MMCONFIG BAR. Correspondingly, QEMU never lets
      // the RAM below 4 GB exceed it.
      //
      PciBase = BASE_2GB + BASE_1GB;
      ASSERT (TopOfLowRam <= PciBase);
    } else {
      PciBase = (TopOfLowRam < BASE_2GB) ? BASE_2GB : TopOfLowRam;
    }

    ...

    AddIoMemoryRangeHob (PciBase, 0xFC000000);

That's it.

In the past, it has repeatedly occurred that OVMF's calculation wouldn't match QEMU's calculation. Then PCI MMIO BARs were allocated outside of QEMU's actual MMIO aperture. This caused two things:
- video display not working (due to framebuffer being accessed in bogus place),
- Windows and Linux guests noticing that the BARs were outside of the range exposed in the _CRS, and disabling devices etc.

We kept duct-taping this, with patches in both OVMF and QEMU (see e.g. Gerd's QEMU commit ddaaefb4dd42).

It has been working fine for quite a long time now, but it is still not dynamic -- the calculations are duplicated between QEMU and OVMF.

To this day, I maintain that the "etc/pci-info" fw_cfg file would have been ideal for OVMF's purposes; and I still don't understand why it was ultimately removed.

> more specifically 64-bit BARs.

Ha. Haha. Hahahaha.

OVMF doesn't support 64-bit BARs *at all*. In order to implement that, I would have to (1) understand PCI about ten billion percent better than I do now, (2) extend the mostly *impenetrable* PCI host bridge / root bridge driver in "OvmfPkg/PciHostBridgeDxe" to support this functionality.

Unfortunately, the parts of the UEFI & Platform Init specs that seem to talk about this functionality are super complex and obscure.

We have plans with Marcel and others to understand this better and perhaps do something about it.

Anyway, the basic premise bears repeating: even for the 32-bit case, OVMF has no way to dynamically retrieve the PCI hole's boundaries from QEMU.

Honestly, I'm confused. If "reserved-memory-end" is exposed over fw_cfg, and it -- apparently! -- partakes in communicating the 64-bit PCI hole to the guest, then why again was "etc/pci-info" removed in the first place?

Thanks
Laszlo