qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* ovmf / PCI passthrough impaired due to very limiting PCI64 aperture
@ 2020-06-16 15:16 Guilherme G. Piccoli
  2020-06-16 16:50 ` Gerd Hoffmann
  0 siblings, 1 reply; 40+ messages in thread
From: Guilherme G. Piccoli @ 2020-06-16 15:16 UTC (permalink / raw)
  To: qemu-devel
  Cc: pedro.principeza, ehabkost, dann.frazier, dgilbert,
	christian.ehrhardt, kraxel, lersek, fw

Hello folks, I'd like to start a discussion (or bump it, in case it was
already discussed) about an "issue", or better saying, a limitation
we've been observing (and receiving reports) on qemu/ovmf with regards
to the PCI passthrough of large BAR devices.

After OVMF commit 7e5b1b670c38 ("OvmfPkg: PlatformPei: determine the
64-bit PCI host aperture for X64 DXE"), the PCI 64-bit aperture is a
hardcoded value passed to the guest via ACPI CRS that, in practical
terms does not allow 32G+ BAR PCI devices to be correctly passthrough'ed
to guests.

There was a very informative discussion on edk2 groups [0] started by my
colleague Dann, to which some edk2 and qemu developers responded with a
good amount of information and rationale about this limitation, and the
problems that increasing such limit would bring. All the colleagues that
responded in that group discussion are hereby CC'ed.

The summary (in my understanding) is:

- The main reasoning for the current limitation is to make it simple; we
need to take into account the 64-bit aperture in order to accomplish
memory mapping on OVMF, and for common scenarios the current limit of
32G accommodates the majority of use cases.

- On top of it, increasing the 64-bit aperture will incur in the
increase of the memory required for OVMF-calculated PEI (Pre-EFI
Initialization) page tables.

- The current aperture also accounts for the 36-bit CPU physical bits
(PCPU) common in old processors and in some qemu generic vcpus, and this
"helps" with live migration, since 36-bit seems to be the LCD (lowest
common denominator) between all processors (for 64-bit architectures),
hence the limiting PCI64 aperture wouldn't be yet another factor that
makes live migration difficult or impossible.

- Finally, there's an _experimental_ parameter to allow some users'
flexibility on PCI64 aperture calculation: "X-PciMmio64Mb".

The point is we have more and more devices out there with bigger BARs
(mostly GPUs), that either exceed 32G by themselves or are almost there
(16G) and if users want to pass-through such devices, OVMF doesn't allow
that. Relying on "X-PciMmio64Mb" is problematic due to the
experimental/unstable nature of such parameter.

Linux kernel allows bypassing ACPI CRS with "pci=nocrs", some discussion
about that on [1]. But other OSes may not have such option, effectively
preventing the PCI-PT of such large devices to succeed or forcing user
to rely on the experimental parameter.

I'd like to discuss here a definitive solution; I've started this
discussion on Tianocore bugzilla [2], but Laszlo wisely suggested us to
move here to gather input from qemu community.
Currently I see 2 options, being (a) my preferred one:

(a) We could rely in the guest physbits to calculate the PCI64 aperture.
If the users are doing host bits' passthrough (or setting the physbits
manually through -phys-bits), they are already risking a live migration
failure. Also, if the users are not setting the physbits in the guest,
there must be a default (seems to be 40bit according to my experiments),
seems to be a good idea to rely on that.
If guest physbits is 40, why to have OVMF limiting it to 36, right?

(b) Making the experimental "X-PciMmio64Mb" not experimental anymore is
also an option, allowing users to rely on it without the risk of support
drop.

Please let me know your thoughts on such limitation and how we could
improve it. Other ideas are also welcome, of course. Thanks for the
attention,


Guilherme


[0] edk2.groups.io/g/discuss/topic/ovmf_resource_assignment/59340711
[1] bugs.launchpad.net/bugs/1849563
[2] bugzilla.tianocore.org/show_bug.cgi?id=2796


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: ovmf / PCI passthrough impaired due to very limiting PCI64 aperture
  2020-06-16 15:16 ovmf / PCI passthrough impaired due to very limiting PCI64 aperture Guilherme G. Piccoli
@ 2020-06-16 16:50 ` Gerd Hoffmann
  2020-06-16 16:57   ` Dr. David Alan Gilbert
  2020-06-17  8:16   ` Christophe de Dinechin
  0 siblings, 2 replies; 40+ messages in thread
From: Gerd Hoffmann @ 2020-06-16 16:50 UTC (permalink / raw)
  To: Guilherme G. Piccoli
  Cc: pedro.principeza, ehabkost, dann.frazier, qemu-devel,
	christian.ehrhardt, dgilbert, lersek, fw

  Hi,

> (a) We could rely in the guest physbits to calculate the PCI64 aperture.

I'd love to do that.  Move the 64-bit I/O window as high as possible and
use -- say -- 25% of the physical address space for it.

Problem is we can't.

> failure. Also, if the users are not setting the physbits in the guest,
> there must be a default (seems to be 40bit according to my experiments),
> seems to be a good idea to rely on that.

Yes, 40 is the default, and it is used *even if the host supports less
than that*.  Typical values I've seen for intel hardware are 36 and 39.
39 is used even by recent hardware (not the xeons, but check out a
laptop or a nuc).

> If guest physbits is 40, why to have OVMF limiting it to 36, right?

Things will explode in case OVMF uses more physbits than the host
supports (host physbits limit applies to ept too).  In other words: OVMF
can't trust the guest physbits, so it is conservative to be on the safe
side.

If we can somehow make a *trustable* physbits value available to the
guest, then yes, we can go that route.  But the guest physbits we have
today unfortunately don't cut it.

take care,
  Gerd



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: ovmf / PCI passthrough impaired due to very limiting PCI64 aperture
  2020-06-16 16:50 ` Gerd Hoffmann
@ 2020-06-16 16:57   ` Dr. David Alan Gilbert
  2020-06-16 17:10     ` Eduardo Habkost
                       ` (2 more replies)
  2020-06-17  8:16   ` Christophe de Dinechin
  1 sibling, 3 replies; 40+ messages in thread
From: Dr. David Alan Gilbert @ 2020-06-16 16:57 UTC (permalink / raw)
  To: Gerd Hoffmann
  Cc: pedro.principeza, ehabkost, dann.frazier, Guilherme G. Piccoli,
	qemu-devel, christian.ehrhardt, lersek, fw

* Gerd Hoffmann (kraxel@redhat.com) wrote:
>   Hi,
> 
> > (a) We could rely in the guest physbits to calculate the PCI64 aperture.
> 
> I'd love to do that.  Move the 64-bit I/O window as high as possible and
> use -- say -- 25% of the physical address space for it.
> 
> Problem is we can't.
> 
> > failure. Also, if the users are not setting the physbits in the guest,
> > there must be a default (seems to be 40bit according to my experiments),
> > seems to be a good idea to rely on that.
> 
> Yes, 40 is the default, and it is used *even if the host supports less
> than that*.  Typical values I've seen for intel hardware are 36 and 39.
> 39 is used even by recent hardware (not the xeons, but check out a
> laptop or a nuc).
> 
> > If guest physbits is 40, why to have OVMF limiting it to 36, right?
> 
> Things will explode in case OVMF uses more physbits than the host
> supports (host physbits limit applies to ept too).  In other words: OVMF
> can't trust the guest physbits, so it is conservative to be on the safe
> side.
> 
> If we can somehow make a *trustable* physbits value available to the
> guest, then yes, we can go that route.  But the guest physbits we have
> today unfortunately don't cut it.

In downstream RH qemu, we run with host-physbits as default; so it's reasonably
trustworthy; of course that doesn't help you across a migration between
hosts with different sizes (e.g. an E5 Xeon to an E3).
Changing upstream to do the same would seem sensible to me, but it's not
a foolproof config.

Dave

> take care,
>   Gerd
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: ovmf / PCI passthrough impaired due to very limiting PCI64 aperture
  2020-06-16 16:57   ` Dr. David Alan Gilbert
@ 2020-06-16 17:10     ` Eduardo Habkost
  2020-06-17  8:17       ` Christophe de Dinechin
  2020-06-17  8:50       ` Daniel P. Berrangé
  2020-06-16 17:10     ` Gerd Hoffmann
  2020-06-16 17:14     ` Guilherme Piccoli
  2 siblings, 2 replies; 40+ messages in thread
From: Eduardo Habkost @ 2020-06-16 17:10 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: pedro.principeza, dann.frazier, Guilherme G. Piccoli, qemu-devel,
	christian.ehrhardt, Gerd Hoffmann, lersek, fw

On Tue, Jun 16, 2020 at 05:57:46PM +0100, Dr. David Alan Gilbert wrote:
> * Gerd Hoffmann (kraxel@redhat.com) wrote:
> >   Hi,
> > 
> > > (a) We could rely in the guest physbits to calculate the PCI64 aperture.
> > 
> > I'd love to do that.  Move the 64-bit I/O window as high as possible and
> > use -- say -- 25% of the physical address space for it.
> > 
> > Problem is we can't.
> > 
> > > failure. Also, if the users are not setting the physbits in the guest,
> > > there must be a default (seems to be 40bit according to my experiments),
> > > seems to be a good idea to rely on that.
> > 
> > Yes, 40 is the default, and it is used *even if the host supports less
> > than that*.  Typical values I've seen for intel hardware are 36 and 39.
> > 39 is used even by recent hardware (not the xeons, but check out a
> > laptop or a nuc).
> > 
> > > If guest physbits is 40, why to have OVMF limiting it to 36, right?
> > 
> > Things will explode in case OVMF uses more physbits than the host
> > supports (host physbits limit applies to ept too).  In other words: OVMF
> > can't trust the guest physbits, so it is conservative to be on the safe
> > side.
> > 
> > If we can somehow make a *trustable* physbits value available to the
> > guest, then yes, we can go that route.  But the guest physbits we have
> > today unfortunately don't cut it.
> 
> In downstream RH qemu, we run with host-physbits as default; so it's reasonably
> trustworthy; of course that doesn't help you across a migration between
> hosts with different sizes (e.g. an E5 Xeon to an E3).
> Changing upstream to do the same would seem sensible to me, but it's not
> a foolproof config.

Yeah, to make it really trustworthy we would need to prevent
migration to hosts with mismatching phys sizes.  We would need to
communicate that to the guest somehow (with new hypervisor CPUID
flags, maybe).

-- 
Eduardo



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: ovmf / PCI passthrough impaired due to very limiting PCI64 aperture
  2020-06-16 16:57   ` Dr. David Alan Gilbert
  2020-06-16 17:10     ` Eduardo Habkost
@ 2020-06-16 17:10     ` Gerd Hoffmann
  2020-06-16 17:16       ` Dr. David Alan Gilbert
  2020-06-16 17:14     ` Guilherme Piccoli
  2 siblings, 1 reply; 40+ messages in thread
From: Gerd Hoffmann @ 2020-06-16 17:10 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: pedro.principeza, ehabkost, dann.frazier, Guilherme G. Piccoli,
	qemu-devel, christian.ehrhardt, lersek, fw

  Hi,

> > If we can somehow make a *trustable* physbits value available to the
> > guest, then yes, we can go that route.  But the guest physbits we have
> > today unfortunately don't cut it.
> 
> In downstream RH qemu, we run with host-physbits as default; so it's reasonably
> trustworthy;

Can the guest figure somehow whenever it is trustworthy or not?

> of course that doesn't help you across a migration between
> hosts with different sizes (e.g. an E5 Xeon to an E3).

Making physbits configurable for migration compatibility is fine if qemu
outlaws the problematic guest physbits > host physbits case and throws
an error in that case.

take care,
  Gerd



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: ovmf / PCI passthrough impaired due to very limiting PCI64 aperture
  2020-06-16 16:57   ` Dr. David Alan Gilbert
  2020-06-16 17:10     ` Eduardo Habkost
  2020-06-16 17:10     ` Gerd Hoffmann
@ 2020-06-16 17:14     ` Guilherme Piccoli
  2020-06-17  6:40       ` Gerd Hoffmann
  2020-06-17 13:22       ` Laszlo Ersek
  2 siblings, 2 replies; 40+ messages in thread
From: Guilherme Piccoli @ 2020-06-16 17:14 UTC (permalink / raw)
  To: Gerd Hoffmann, Dr. David Alan Gilbert
  Cc: Pedro Principeza, ehabkost, Dann Frazier, qemu-devel,
	Christian Ehrhardt, lersek, fw

Thanks Gerd, Dave and Eduardo for the prompt responses!

So, I understand that when we use "-host-physical-bits", we are
passing the *real* number for the guest, correct? So, in this case we
can trust that the guest physbits matches the true host physbits.

What if then we have OVMF relying in the physbits *iff*
"-host-phys-bits" is used (which is the default in RH and a possible
machine configuration on libvirt XML in Ubuntu), and we have OVMF
fallbacks to 36-bit otherwise?

Now, regarding the problem "to trust or not" in the guests' physbits,
I think it's an orthogonal discussion to some extent. It'd be nice to
have that check, and as Eduardo said, prevent migration in such cases.
But it's not really preventing OVMF big PCI64 aperture if we only
increase the aperture _when  "-host-physical-bits" is used_.

Thanks,


Guilherme


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: ovmf / PCI passthrough impaired due to very limiting PCI64 aperture
  2020-06-16 17:10     ` Gerd Hoffmann
@ 2020-06-16 17:16       ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 40+ messages in thread
From: Dr. David Alan Gilbert @ 2020-06-16 17:16 UTC (permalink / raw)
  To: Gerd Hoffmann
  Cc: pedro.principeza, ehabkost, dann.frazier, Guilherme G. Piccoli,
	qemu-devel, christian.ehrhardt, lersek, fw

* Gerd Hoffmann (kraxel@redhat.com) wrote:
>   Hi,
> 
> > > If we can somehow make a *trustable* physbits value available to the
> > > guest, then yes, we can go that route.  But the guest physbits we have
> > > today unfortunately don't cut it.
> > 
> > In downstream RH qemu, we run with host-physbits as default; so it's reasonably
> > trustworthy;
> 
> Can the guest figure somehow whenever it is trustworthy or not?

At any one point in time there may be things that it can try and see how
the CPU responds but I'm not 100% sure.
I know there are some bodges in to make some MSR values 1 padded by the
right amount when crossing sizes that generally work.
(were those PAM registers or something - vague memories of an old
bug)

> > of course that doesn't help you across a migration between
> > hosts with different sizes (e.g. an E5 Xeon to an E3).
> 
> Making physbits configurable for migration compatibility is fine if qemu
> outlaws the problematic guest physbits > host physbits case and throws
> an error in that case.

I'm not sure that guest < host is entirely safe either though; although
it seems to work.

Dave

> take care,
>   Gerd
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: ovmf / PCI passthrough impaired due to very limiting PCI64 aperture
  2020-06-16 17:14     ` Guilherme Piccoli
@ 2020-06-17  6:40       ` Gerd Hoffmann
  2020-06-17 13:25         ` Laszlo Ersek
  2020-06-17 13:26         ` Laszlo Ersek
  2020-06-17 13:22       ` Laszlo Ersek
  1 sibling, 2 replies; 40+ messages in thread
From: Gerd Hoffmann @ 2020-06-17  6:40 UTC (permalink / raw)
  To: Guilherme Piccoli
  Cc: Pedro Principeza, ehabkost, Dann Frazier, Dr. David Alan Gilbert,
	Christian Ehrhardt, qemu-devel, lersek, fw

  Hi,

> What if then we have OVMF relying in the physbits *iff*
> "-host-phys-bits" is used

How can the guest know?

Adding "you can trust physbits" bits somewhere (as suggested by Eduardo)
would work for sure, but would depend on a qemu update.

Maybe a "don't trust physbits in case it is 40" heuristic works well
enough in practice.

> Now, regarding the problem "to trust or not" in the guests' physbits,
> I think it's an orthogonal discussion to some extent.

It isn't.  OVMF can't ignore the problem, you risk to break guests if
you do.

take care,
  Gerd



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: ovmf / PCI passthrough impaired due to very limiting PCI64 aperture
  2020-06-16 16:50 ` Gerd Hoffmann
  2020-06-16 16:57   ` Dr. David Alan Gilbert
@ 2020-06-17  8:16   ` Christophe de Dinechin
  2020-06-17 10:12     ` Gerd Hoffmann
  1 sibling, 1 reply; 40+ messages in thread
From: Christophe de Dinechin @ 2020-06-17  8:16 UTC (permalink / raw)
  To: Gerd Hoffmann
  Cc: pedro.principeza, ehabkost, dann.frazier, Guilherme G. Piccoli,
	qemu-devel, christian.ehrhardt, dgilbert, lersek, fw



> Le 16 Jun 2020 à 18:50, Gerd Hoffmann <kraxel@redhat.com> a écrit :
> 
>  Hi,
> 
>> (a) We could rely in the guest physbits to calculate the PCI64 aperture.
> 
> I'd love to do that.  Move the 64-bit I/O window as high as possible and
> use -- say -- 25% of the physical address space for it.
> 
> Problem is we can't.

Is the only reason unreliable guest physbits?

> 
>> failure. Also, if the users are not setting the physbits in the guest,
>> there must be a default (seems to be 40bit according to my experiments),
>> seems to be a good idea to rely on that.
> 
> Yes, 40 is the default, and it is used *even if the host supports less
> than that*.  Typical values I've seen for intel hardware are 36 and 39.
> 39 is used even by recent hardware (not the xeons, but check out a
> laptop or a nuc).
> 
>> If guest physbits is 40, why to have OVMF limiting it to 36, right?
> 
> Things will explode in case OVMF uses more physbits than the host
> supports (host physbits limit applies to ept too).  In other words: OVMF
> can't trust the guest physbits, so it is conservative to be on the safe
> side.
> 
> If we can somehow make a *trustable* physbits value available to the
> guest, then yes, we can go that route.  But the guest physbits we have
> today unfortunately don't cut it.

What is the rationale for ever allowing guest physbits > host physbits?

> 
> take care,
>  Gerd
> 
> 



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: ovmf / PCI passthrough impaired due to very limiting PCI64 aperture
  2020-06-16 17:10     ` Eduardo Habkost
@ 2020-06-17  8:17       ` Christophe de Dinechin
  2020-06-17 16:25         ` Eduardo Habkost
  2020-06-17  8:50       ` Daniel P. Berrangé
  1 sibling, 1 reply; 40+ messages in thread
From: Christophe de Dinechin @ 2020-06-17  8:17 UTC (permalink / raw)
  To: Eduardo Habkost
  Cc: pedro.principeza, dann.frazier, Guilherme G. Piccoli,
	Dr. David Alan Gilbert, christian.ehrhardt, qemu-devel,
	Gerd Hoffmann, lersek, fw

[-- Attachment #1: Type: text/plain, Size: 2107 bytes --]



> Le 16 Jun 2020 à 19:10, Eduardo Habkost <ehabkost@redhat.com> a écrit :
> 
> On Tue, Jun 16, 2020 at 05:57:46PM +0100, Dr. David Alan Gilbert wrote:
>> * Gerd Hoffmann (kraxel@redhat.com) wrote:
>>>  Hi,
>>> 
>>>> (a) We could rely in the guest physbits to calculate the PCI64 aperture.
>>> 
>>> I'd love to do that.  Move the 64-bit I/O window as high as possible and
>>> use -- say -- 25% of the physical address space for it.
>>> 
>>> Problem is we can't.
>>> 
>>>> failure. Also, if the users are not setting the physbits in the guest,
>>>> there must be a default (seems to be 40bit according to my experiments),
>>>> seems to be a good idea to rely on that.
>>> 
>>> Yes, 40 is the default, and it is used *even if the host supports less
>>> than that*.  Typical values I've seen for intel hardware are 36 and 39.
>>> 39 is used even by recent hardware (not the xeons, but check out a
>>> laptop or a nuc).
>>> 
>>>> If guest physbits is 40, why to have OVMF limiting it to 36, right?
>>> 
>>> Things will explode in case OVMF uses more physbits than the host
>>> supports (host physbits limit applies to ept too).  In other words: OVMF
>>> can't trust the guest physbits, so it is conservative to be on the safe
>>> side.
>>> 
>>> If we can somehow make a *trustable* physbits value available to the
>>> guest, then yes, we can go that route.  But the guest physbits we have
>>> today unfortunately don't cut it.
>> 
>> In downstream RH qemu, we run with host-physbits as default; so it's reasonably
>> trustworthy; of course that doesn't help you across a migration between
>> hosts with different sizes (e.g. an E5 Xeon to an E3).
>> Changing upstream to do the same would seem sensible to me, but it's not
>> a foolproof config.
> 
> Yeah, to make it really trustworthy we would need to prevent
> migration to hosts with mismatching phys sizes.

Wouldn't it be sufficient to prevent guestphysbits > hostphysbits?

>  We would need to
> communicate that to the guest somehow (with new hypervisor CPUID
> flags, maybe).
> 
> -- 
> Eduardo


[-- Attachment #2: Type: text/html, Size: 9317 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: ovmf / PCI passthrough impaired due to very limiting PCI64 aperture
  2020-06-16 17:10     ` Eduardo Habkost
  2020-06-17  8:17       ` Christophe de Dinechin
@ 2020-06-17  8:50       ` Daniel P. Berrangé
  2020-06-17 10:28         ` Dr. David Alan Gilbert
  2020-06-17 14:11         ` Eduardo Habkost
  1 sibling, 2 replies; 40+ messages in thread
From: Daniel P. Berrangé @ 2020-06-17  8:50 UTC (permalink / raw)
  To: Eduardo Habkost
  Cc: pedro.principeza, dann.frazier, Guilherme G. Piccoli,
	Dr. David Alan Gilbert, christian.ehrhardt, qemu-devel,
	Gerd Hoffmann, lersek, fw

On Tue, Jun 16, 2020 at 01:10:21PM -0400, Eduardo Habkost wrote:
> On Tue, Jun 16, 2020 at 05:57:46PM +0100, Dr. David Alan Gilbert wrote:
> > * Gerd Hoffmann (kraxel@redhat.com) wrote:
> > >   Hi,
> > > 
> > > > (a) We could rely in the guest physbits to calculate the PCI64 aperture.
> > > 
> > > I'd love to do that.  Move the 64-bit I/O window as high as possible and
> > > use -- say -- 25% of the physical address space for it.
> > > 
> > > Problem is we can't.
> > > 
> > > > failure. Also, if the users are not setting the physbits in the guest,
> > > > there must be a default (seems to be 40bit according to my experiments),
> > > > seems to be a good idea to rely on that.
> > > 
> > > Yes, 40 is the default, and it is used *even if the host supports less
> > > than that*.  Typical values I've seen for intel hardware are 36 and 39.
> > > 39 is used even by recent hardware (not the xeons, but check out a
> > > laptop or a nuc).
> > > 
> > > > If guest physbits is 40, why to have OVMF limiting it to 36, right?
> > > 
> > > Things will explode in case OVMF uses more physbits than the host
> > > supports (host physbits limit applies to ept too).  In other words: OVMF
> > > can't trust the guest physbits, so it is conservative to be on the safe
> > > side.
> > > 
> > > If we can somehow make a *trustable* physbits value available to the
> > > guest, then yes, we can go that route.  But the guest physbits we have
> > > today unfortunately don't cut it.
> > 
> > In downstream RH qemu, we run with host-physbits as default; so it's reasonably
> > trustworthy; of course that doesn't help you across a migration between
> > hosts with different sizes (e.g. an E5 Xeon to an E3).
> > Changing upstream to do the same would seem sensible to me, but it's not
> > a foolproof config.
> 
> Yeah, to make it really trustworthy we would need to prevent
> migration to hosts with mismatching phys sizes.  We would need to
> communicate that to the guest somehow (with new hypervisor CPUID
> flags, maybe).

QEMU should be able to validate the hostphysbits >= guestphysbits when
accepting incoming migration, and abort it.

Meanwhile libvirt should be enhanced to report hostphysbits, so that
management apps can determine that they shouldn't even pick bad hosts
in the first place.


Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: ovmf / PCI passthrough impaired due to very limiting PCI64 aperture
  2020-06-17  8:16   ` Christophe de Dinechin
@ 2020-06-17 10:12     ` Gerd Hoffmann
  0 siblings, 0 replies; 40+ messages in thread
From: Gerd Hoffmann @ 2020-06-17 10:12 UTC (permalink / raw)
  To: Christophe de Dinechin
  Cc: pedro.principeza, ehabkost, dann.frazier, Guilherme G. Piccoli,
	qemu-devel, christian.ehrhardt, dgilbert, lersek, fw

On Wed, Jun 17, 2020 at 10:16:55AM +0200, Christophe de Dinechin wrote:
> 
> 
> > Le 16 Jun 2020 à 18:50, Gerd Hoffmann <kraxel@redhat.com> a écrit :
> > 
> >  Hi,
> > 
> >> (a) We could rely in the guest physbits to calculate the PCI64 aperture.
> > 
> > I'd love to do that.  Move the 64-bit I/O window as high as possible and
> > use -- say -- 25% of the physical address space for it.
> > 
> > Problem is we can't.
> 
> Is the only reason unreliable guest physbits?

Yes.

> > If we can somehow make a *trustable* physbits value available to the
> > guest, then yes, we can go that route.  But the guest physbits we have
> > today unfortunately don't cut it.
> 
> What is the rationale for ever allowing guest physbits > host physbits?

I can't think of a good reason.  So probably simply historical reasons
and the fact that this isn't a problem with tcg.

take care,
  Gerd Hoffmann



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: ovmf / PCI passthrough impaired due to very limiting PCI64 aperture
  2020-06-17  8:50       ` Daniel P. Berrangé
@ 2020-06-17 10:28         ` Dr. David Alan Gilbert
  2020-06-17 14:11         ` Eduardo Habkost
  1 sibling, 0 replies; 40+ messages in thread
From: Dr. David Alan Gilbert @ 2020-06-17 10:28 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: pedro.principeza, Eduardo Habkost, dann.frazier,
	Guilherme G. Piccoli, qemu-devel, christian.ehrhardt,
	Gerd Hoffmann, lersek, fw

* Daniel P. Berrangé (berrange@redhat.com) wrote:
> On Tue, Jun 16, 2020 at 01:10:21PM -0400, Eduardo Habkost wrote:
> > On Tue, Jun 16, 2020 at 05:57:46PM +0100, Dr. David Alan Gilbert wrote:
> > > * Gerd Hoffmann (kraxel@redhat.com) wrote:
> > > >   Hi,
> > > > 
> > > > > (a) We could rely in the guest physbits to calculate the PCI64 aperture.
> > > > 
> > > > I'd love to do that.  Move the 64-bit I/O window as high as possible and
> > > > use -- say -- 25% of the physical address space for it.
> > > > 
> > > > Problem is we can't.
> > > > 
> > > > > failure. Also, if the users are not setting the physbits in the guest,
> > > > > there must be a default (seems to be 40bit according to my experiments),
> > > > > seems to be a good idea to rely on that.
> > > > 
> > > > Yes, 40 is the default, and it is used *even if the host supports less
> > > > than that*.  Typical values I've seen for intel hardware are 36 and 39.
> > > > 39 is used even by recent hardware (not the xeons, but check out a
> > > > laptop or a nuc).
> > > > 
> > > > > If guest physbits is 40, why to have OVMF limiting it to 36, right?
> > > > 
> > > > Things will explode in case OVMF uses more physbits than the host
> > > > supports (host physbits limit applies to ept too).  In other words: OVMF
> > > > can't trust the guest physbits, so it is conservative to be on the safe
> > > > side.
> > > > 
> > > > If we can somehow make a *trustable* physbits value available to the
> > > > guest, then yes, we can go that route.  But the guest physbits we have
> > > > today unfortunately don't cut it.
> > > 
> > > In downstream RH qemu, we run with host-physbits as default; so it's reasonably
> > > trustworthy; of course that doesn't help you across a migration between
> > > hosts with different sizes (e.g. an E5 Xeon to an E3).
> > > Changing upstream to do the same would seem sensible to me, but it's not
> > > a foolproof config.
> > 
> > Yeah, to make it really trustworthy we would need to prevent
> > migration to hosts with mismatching phys sizes.  We would need to
> > communicate that to the guest somehow (with new hypervisor CPUID
> > flags, maybe).
> 
> QEMU should be able to validate the hostphysbits >= guestphysbits when
> accepting incoming migration, and abort it.

Yeh, there's an outstanding request to validate other CPU flags as well.

> Meanwhile libvirt should be enhanced to report hostphysbits, so that
> management apps can determine that they shouldn't even pick bad hosts
> in the first place.

Sounds reasonable.
Note there are a couple of other considerations when choosing the
physbits as reported to the guest:

  a) TCG's view - I think it had a fixed size of 40 bits, but I haven't
     dug into it.
  b) We recently gained 'host-phys-bits-limit' which when used with
host-phys-bits lets you take the host value but then limit it.  Eduardo
seems to have done that to limit the guest from flipping into 5-level
page tables.  Hmm I've not tried with chips that do 5-level - but maybe
we also need this if you expect to migrate to hosts that don't have it.

(I've also got a vague memory that there's a limit in some IOMMUs
address sizes, but I can't remember what the details were).

Dave


> 
> Regards,
> Daniel
> -- 
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: ovmf / PCI passthrough impaired due to very limiting PCI64 aperture
  2020-06-16 17:14     ` Guilherme Piccoli
  2020-06-17  6:40       ` Gerd Hoffmann
@ 2020-06-17 13:22       ` Laszlo Ersek
  2020-06-17 13:43         ` Guilherme Piccoli
  2020-06-17 13:46         ` Dr. David Alan Gilbert
  1 sibling, 2 replies; 40+ messages in thread
From: Laszlo Ersek @ 2020-06-17 13:22 UTC (permalink / raw)
  To: Guilherme Piccoli, Gerd Hoffmann, Dr. David Alan Gilbert
  Cc: Pedro Principeza, ehabkost, Dann Frazier, qemu-devel,
	Christian Ehrhardt, fw

On 06/16/20 19:14, Guilherme Piccoli wrote:
> Thanks Gerd, Dave and Eduardo for the prompt responses!
> 
> So, I understand that when we use "-host-physical-bits", we are
> passing the *real* number for the guest, correct? So, in this case we
> can trust that the guest physbits matches the true host physbits.
> 
> What if then we have OVMF relying in the physbits *iff*
> "-host-phys-bits" is used (which is the default in RH and a possible
> machine configuration on libvirt XML in Ubuntu), and we have OVMF
> fallbacks to 36-bit otherwise?

I've now read the commit message on QEMU commit 258fe08bd341d, and the
complexity is simply stunning.

Right now, OVMF calculates the guest physical address space size from
various range sizes (such as hotplug memory area end, default or
user-configured PCI64 MMIO aperture), and derives the minimum suitable
guest-phys address width from that address space size. This width is
then exposed to the rest of the firmware with the CPU HOB (hand-off
block), which in turn controls how the GCD (global coherency domain)
memory space map is sized. Etc.

If QEMU can provide a *reliable* GPA width, in some info channel (CPUID
or even fw_cfg), then the above calculation could be reversed in OVMF.
We could take the width as a given (-> produce the CPU HOB directly),
plus calculate the *remaining* address space between the GPA space size
given by the width, and the end of the memory hotplug area end. If the
"remaining size" were negative, then obviously QEMU would have been
misconfigured, so we'd halt the boot. Otherwise, the remaining area
could be used as PCI64 MMIO aperture (PEI memory footprint of DXE page
tables be darned).

> Now, regarding the problem "to trust or not" in the guests' physbits,
> I think it's an orthogonal discussion to some extent. It'd be nice to
> have that check, and as Eduardo said, prevent migration in such cases.
> But it's not really preventing OVMF big PCI64 aperture if we only
> increase the aperture _when  "-host-physical-bits" is used_.

I don't know what exactly those flags do, but I doubt they are clearly
visible to OVMF in any particular way.

Thanks
Laszlo



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: ovmf / PCI passthrough impaired due to very limiting PCI64 aperture
  2020-06-17  6:40       ` Gerd Hoffmann
@ 2020-06-17 13:25         ` Laszlo Ersek
  2020-06-17 13:26         ` Laszlo Ersek
  1 sibling, 0 replies; 40+ messages in thread
From: Laszlo Ersek @ 2020-06-17 13:25 UTC (permalink / raw)
  To: Gerd Hoffmann, Guilherme Piccoli
  Cc: Pedro Principeza, ehabkost, Dann Frazier, Dr. David Alan Gilbert,
	Christian Ehrhardt, qemu-devel, fw

On 06/17/20 08:40, Gerd Hoffmann wrote:
>   Hi,
> 
>> What if then we have OVMF relying in the physbits *iff*
>> "-host-phys-bits" is used
> 
> How can the guest know?

Exactly!

> Adding "you can trust physbits" bits somewhere (as suggested by Eduardo)
> would work for sure, but would depend on a qemu update.
> 
> Maybe a "don't trust physbits in case it is 40" heuristic works well
> enough in practice.

... for how many years? ;)

No heuristics please.

(I absolutely prefer X-PciMmio64Mb to guesswork. With X-PciMmio64Mb,
users at least know they have to be careful. I know they don't like
that, they just want a promise "it will work forever", but it's not
randomly called "experimental", we're not there just yet.)

Thank you,
Laszlo

>> Now, regarding the problem "to trust or not" in the guests' physbits,
>> I think it's an orthogonal discussion to some extent.
> 
> It isn't.  OVMF can't ignore the problem, you risk to break guests if
> you do.
> 
> take care,
>   Gerd
> 



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: ovmf / PCI passthrough impaired due to very limiting PCI64 aperture
  2020-06-17  6:40       ` Gerd Hoffmann
  2020-06-17 13:25         ` Laszlo Ersek
@ 2020-06-17 13:26         ` Laszlo Ersek
  1 sibling, 0 replies; 40+ messages in thread
From: Laszlo Ersek @ 2020-06-17 13:26 UTC (permalink / raw)
  To: Gerd Hoffmann, Guilherme Piccoli
  Cc: Pedro Principeza, ehabkost, Dann Frazier, Dr. David Alan Gilbert,
	Christian Ehrhardt, qemu-devel, fw

On 06/17/20 08:40, Gerd Hoffmann wrote:
>   Hi,
> 
>> What if then we have OVMF relying in the physbits *iff*
>> "-host-phys-bits" is used
> 
> How can the guest know?

Exactly!

> Adding "you can trust physbits" bits somewhere (as suggested by Eduardo)
> would work for sure, but would depend on a qemu update.
> 
> Maybe a "don't trust physbits in case it is 40" heuristic works well
> enough in practice.

... for how many years? ;)

No heuristics please.

(I absolutely prefer X-PciMmio64Mb to guesswork. With X-PciMmio64Mb,
users at least know they have to be careful. I know they don't like
that, they just want a promise "it will work forever", but it's not
randomly called "experimental", we're not there just yet.)

Thank you,
Laszlo

>> Now, regarding the problem "to trust or not" in the guests' physbits,
>> I think it's an orthogonal discussion to some extent.
> 
> It isn't.  OVMF can't ignore the problem, you risk to break guests if
> you do.
> 
> take care,
>   Gerd
> 



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: ovmf / PCI passthrough impaired due to very limiting PCI64 aperture
  2020-06-17 13:22       ` Laszlo Ersek
@ 2020-06-17 13:43         ` Guilherme Piccoli
  2020-06-17 15:57           ` Laszlo Ersek
  2020-06-17 13:46         ` Dr. David Alan Gilbert
  1 sibling, 1 reply; 40+ messages in thread
From: Guilherme Piccoli @ 2020-06-17 13:43 UTC (permalink / raw)
  To: Laszlo Ersek
  Cc: Pedro Principeza, ehabkost, Dann Frazier, qemu-devel,
	Christian Ehrhardt, Dr. David Alan Gilbert, Gerd Hoffmann, fw

Thanks a lot for all the responses here! Very constructive discussion =)
Comments inline!

On Wed, Jun 17, 2020 at 10:22 AM Laszlo Ersek <lersek@redhat.com> wrote:
> [...]
> If QEMU can provide a *reliable* GPA width, in some info channel (CPUID
> or even fw_cfg), then the above calculation could be reversed in OVMF.
> We could take the width as a given (-> produce the CPU HOB directly),
> plus calculate the *remaining* address space between the GPA space size
> given by the width, and the end of the memory hotplug area end. If the
> "remaining size" were negative, then obviously QEMU would have been
> misconfigured, so we'd halt the boot. Otherwise, the remaining area
> could be used as PCI64 MMIO aperture...

That was *exactly* the way I was considering, but without the right
terminology due to my lack of experience in this topic heheh
Thanks for the great summary of the idea! I was considering fw_cfg,
but can be CPUID too, let me know what is the "trendy" way to do that.
So, the only problem with that refactor you're proposing is the
retrocompatibility with qemu versions, as I can anticipate cases in
which newer OVMF runs with older qemu, which does not provide such
trustworth physbits info. So, the code may be a bit complex, it'll
need to take into account this case (probably we could just rely on
the physbits "detected" by OVMF in such case, limiting PCI64 aperture
to the current 36-bits, right?).


> (PEI memory footprint of DXE page tables be darned).
LOL

Cheers,
Guilherme


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: ovmf / PCI passthrough impaired due to very limiting PCI64 aperture
  2020-06-17 13:22       ` Laszlo Ersek
  2020-06-17 13:43         ` Guilherme Piccoli
@ 2020-06-17 13:46         ` Dr. David Alan Gilbert
  2020-06-17 15:49           ` Eduardo Habkost
  2020-06-17 16:14           ` Laszlo Ersek
  1 sibling, 2 replies; 40+ messages in thread
From: Dr. David Alan Gilbert @ 2020-06-17 13:46 UTC (permalink / raw)
  To: Laszlo Ersek
  Cc: Pedro Principeza, ehabkost, Dann Frazier, Guilherme Piccoli,
	qemu-devel, Christian Ehrhardt, Gerd Hoffmann, fw

* Laszlo Ersek (lersek@redhat.com) wrote:
> On 06/16/20 19:14, Guilherme Piccoli wrote:
> > Thanks Gerd, Dave and Eduardo for the prompt responses!
> > 
> > So, I understand that when we use "-host-physical-bits", we are
> > passing the *real* number for the guest, correct? So, in this case we
> > can trust that the guest physbits matches the true host physbits.
> > 
> > What if then we have OVMF relying in the physbits *iff*
> > "-host-phys-bits" is used (which is the default in RH and a possible
> > machine configuration on libvirt XML in Ubuntu), and we have OVMF
> > fallbacks to 36-bit otherwise?
> 
> I've now read the commit message on QEMU commit 258fe08bd341d, and the
> complexity is simply stunning.
> 
> Right now, OVMF calculates the guest physical address space size from
> various range sizes (such as hotplug memory area end, default or
> user-configured PCI64 MMIO aperture), and derives the minimum suitable
> guest-phys address width from that address space size. This width is
> then exposed to the rest of the firmware with the CPU HOB (hand-off
> block), which in turn controls how the GCD (global coherency domain)
> memory space map is sized. Etc.
> 
> If QEMU can provide a *reliable* GPA width, in some info channel (CPUID
> or even fw_cfg), then the above calculation could be reversed in OVMF.
> We could take the width as a given (-> produce the CPU HOB directly),
> plus calculate the *remaining* address space between the GPA space size
> given by the width, and the end of the memory hotplug area end. If the
> "remaining size" were negative, then obviously QEMU would have been
> misconfigured, so we'd halt the boot. Otherwise, the remaining area
> could be used as PCI64 MMIO aperture (PEI memory footprint of DXE page
> tables be darned).
> 
> > Now, regarding the problem "to trust or not" in the guests' physbits,
> > I think it's an orthogonal discussion to some extent. It'd be nice to
> > have that check, and as Eduardo said, prevent migration in such cases.
> > But it's not really preventing OVMF big PCI64 aperture if we only
> > increase the aperture _when  "-host-physical-bits" is used_.
> 
> I don't know what exactly those flags do, but I doubt they are clearly
> visible to OVMF in any particular way.

The firmware should trust whatever it reads from the cpuid and thus gets
told from qemu; if qemu is doing the wrong thing there then that's our
problem and we need to fix it in qemu.

Dave

> Thanks
> Laszlo
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: ovmf / PCI passthrough impaired due to very limiting PCI64 aperture
  2020-06-17  8:50       ` Daniel P. Berrangé
  2020-06-17 10:28         ` Dr. David Alan Gilbert
@ 2020-06-17 14:11         ` Eduardo Habkost
  1 sibling, 0 replies; 40+ messages in thread
From: Eduardo Habkost @ 2020-06-17 14:11 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: pedro.principeza, dann.frazier, Guilherme G. Piccoli,
	Dr. David Alan Gilbert, christian.ehrhardt, qemu-devel,
	Gerd Hoffmann, lersek, fw

On Wed, Jun 17, 2020 at 09:50:33AM +0100, Daniel P. Berrangé wrote:
> On Tue, Jun 16, 2020 at 01:10:21PM -0400, Eduardo Habkost wrote:
> > On Tue, Jun 16, 2020 at 05:57:46PM +0100, Dr. David Alan Gilbert wrote:
> > > * Gerd Hoffmann (kraxel@redhat.com) wrote:
> > > >   Hi,
> > > > 
> > > > > (a) We could rely in the guest physbits to calculate the PCI64 aperture.
> > > > 
> > > > I'd love to do that.  Move the 64-bit I/O window as high as possible and
> > > > use -- say -- 25% of the physical address space for it.
> > > > 
> > > > Problem is we can't.
> > > > 
> > > > > failure. Also, if the users are not setting the physbits in the guest,
> > > > > there must be a default (seems to be 40bit according to my experiments),
> > > > > seems to be a good idea to rely on that.
> > > > 
> > > > Yes, 40 is the default, and it is used *even if the host supports less
> > > > than that*.  Typical values I've seen for intel hardware are 36 and 39.
> > > > 39 is used even by recent hardware (not the xeons, but check out a
> > > > laptop or a nuc).
> > > > 
> > > > > If guest physbits is 40, why to have OVMF limiting it to 36, right?
> > > > 
> > > > Things will explode in case OVMF uses more physbits than the host
> > > > supports (host physbits limit applies to ept too).  In other words: OVMF
> > > > can't trust the guest physbits, so it is conservative to be on the safe
> > > > side.
> > > > 
> > > > If we can somehow make a *trustable* physbits value available to the
> > > > guest, then yes, we can go that route.  But the guest physbits we have
> > > > today unfortunately don't cut it.
> > > 
> > > In downstream RH qemu, we run with host-physbits as default; so it's reasonably
> > > trustworthy; of course that doesn't help you across a migration between
> > > hosts with different sizes (e.g. an E5 Xeon to an E3).
> > > Changing upstream to do the same would seem sensible to me, but it's not
> > > a foolproof config.
> > 
> > Yeah, to make it really trustworthy we would need to prevent
> > migration to hosts with mismatching phys sizes.  We would need to
> > communicate that to the guest somehow (with new hypervisor CPUID
> > flags, maybe).
> 
> QEMU should be able to validate the hostphysbits >= guestphysbits when
> accepting incoming migration, and abort it.
> 
> Meanwhile libvirt should be enhanced to report hostphysbits, so that
> management apps can determine that they shouldn't even pick bad hosts
> in the first place.
> 

Whatever policy we choose to implement on the host side, it would
be nice to inform the guest that we are making additional
guarantees.  Especially considering that
guestphysbits > hostphysbits is currently allowed and works (so
changing the requirements unconditionally would be a regression).

-- 
Eduardo



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: ovmf / PCI passthrough impaired due to very limiting PCI64 aperture
  2020-06-17 13:46         ` Dr. David Alan Gilbert
@ 2020-06-17 15:49           ` Eduardo Habkost
  2020-06-17 15:57             ` Guilherme Piccoli
  2020-06-17 16:04             ` Dr. David Alan Gilbert
  2020-06-17 16:14           ` Laszlo Ersek
  1 sibling, 2 replies; 40+ messages in thread
From: Eduardo Habkost @ 2020-06-17 15:49 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Pedro Principeza, Dann Frazier, Guilherme Piccoli, qemu-devel,
	Christian Ehrhardt, Gerd Hoffmann, Laszlo Ersek, fw

On Wed, Jun 17, 2020 at 02:46:52PM +0100, Dr. David Alan Gilbert wrote:
> * Laszlo Ersek (lersek@redhat.com) wrote:
> > On 06/16/20 19:14, Guilherme Piccoli wrote:
> > > Thanks Gerd, Dave and Eduardo for the prompt responses!
> > > 
> > > So, I understand that when we use "-host-physical-bits", we are
> > > passing the *real* number for the guest, correct? So, in this case we
> > > can trust that the guest physbits matches the true host physbits.
> > > 
> > > What if then we have OVMF relying in the physbits *iff*
> > > "-host-phys-bits" is used (which is the default in RH and a possible
> > > machine configuration on libvirt XML in Ubuntu), and we have OVMF
> > > fallbacks to 36-bit otherwise?
> > 
> > I've now read the commit message on QEMU commit 258fe08bd341d, and the
> > complexity is simply stunning.
> > 
> > Right now, OVMF calculates the guest physical address space size from
> > various range sizes (such as hotplug memory area end, default or
> > user-configured PCI64 MMIO aperture), and derives the minimum suitable
> > guest-phys address width from that address space size. This width is
> > then exposed to the rest of the firmware with the CPU HOB (hand-off
> > block), which in turn controls how the GCD (global coherency domain)
> > memory space map is sized. Etc.
> > 
> > If QEMU can provide a *reliable* GPA width, in some info channel (CPUID
> > or even fw_cfg), then the above calculation could be reversed in OVMF.
> > We could take the width as a given (-> produce the CPU HOB directly),
> > plus calculate the *remaining* address space between the GPA space size
> > given by the width, and the end of the memory hotplug area end. If the
> > "remaining size" were negative, then obviously QEMU would have been
> > misconfigured, so we'd halt the boot. Otherwise, the remaining area
> > could be used as PCI64 MMIO aperture (PEI memory footprint of DXE page
> > tables be darned).
> > 
> > > Now, regarding the problem "to trust or not" in the guests' physbits,
> > > I think it's an orthogonal discussion to some extent. It'd be nice to
> > > have that check, and as Eduardo said, prevent migration in such cases.
> > > But it's not really preventing OVMF big PCI64 aperture if we only
> > > increase the aperture _when  "-host-physical-bits" is used_.
> > 
> > I don't know what exactly those flags do, but I doubt they are clearly
> > visible to OVMF in any particular way.
> 
> The firmware should trust whatever it reads from the cpuid and thus gets
> told from qemu; if qemu is doing the wrong thing there then that's our
> problem and we need to fix it in qemu.

It is impossible to provide a MAXPHYADDR that the guest can trust
unconditionally and allow live migration to hosts with different
sizes at the same time.

Unless we want to drop support live migration to hosts with
different sizes entirely, we need additional bits to tell the
guest how much it can trust MAXPHYADDR.

-- 
Eduardo



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: ovmf / PCI passthrough impaired due to very limiting PCI64 aperture
  2020-06-17 13:43         ` Guilherme Piccoli
@ 2020-06-17 15:57           ` Laszlo Ersek
  2020-06-17 16:01             ` Guilherme Piccoli
  0 siblings, 1 reply; 40+ messages in thread
From: Laszlo Ersek @ 2020-06-17 15:57 UTC (permalink / raw)
  To: Guilherme Piccoli
  Cc: Pedro Principeza, ehabkost, Dann Frazier, qemu-devel,
	Christian Ehrhardt, Dr. David Alan Gilbert, Gerd Hoffmann, fw

On 06/17/20 15:43, Guilherme Piccoli wrote:

> So, the only problem with that refactor you're proposing is the
> retrocompatibility with qemu versions, as I can anticipate cases in
> which newer OVMF runs with older qemu, which does not provide such
> trustworth physbits info.

I don't necessarily see compat issues -- large-BAR PCI device assignment
might simply stop working under those circumstances, because you could
no longer use X-PciMmio64Mb, and the new way wouldn't be supported by
(old) QEMU.

This would indeed be a regression relative to "X-PciMmio64Mb", but
that's exactly why there's an "X" in "X-PciMmio64Mb".

> So, the code may be a bit complex, it'll
> need to take into account this case (probably we could just rely on
> the physbits "detected" by OVMF in such case, limiting PCI64 aperture
> to the current 36-bits, right?).

A "bit complex" is an understatement. The code is already much more
complex than it should be. (And I do think that's an inherent, not
accidental, complexity; it reflects the situation.) Once we have a new
design, we should do the bare minimum to keep such previous usage
functional that's *not* reliant on X-PciMmio64Mb.

Laszlo



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: ovmf / PCI passthrough impaired due to very limiting PCI64 aperture
  2020-06-17 15:49           ` Eduardo Habkost
@ 2020-06-17 15:57             ` Guilherme Piccoli
  2020-06-17 16:33               ` Eduardo Habkost
  2020-06-17 16:04             ` Dr. David Alan Gilbert
  1 sibling, 1 reply; 40+ messages in thread
From: Guilherme Piccoli @ 2020-06-17 15:57 UTC (permalink / raw)
  To: Eduardo Habkost
  Cc: Pedro Principeza, Dann Frazier, qemu-devel, Christian Ehrhardt,
	Dr. David Alan Gilbert, Gerd Hoffmann, Laszlo Ersek, fw

Can't qemu reads the host physical bits and pass that as fw_cfg as
"real_host_physbits" or something like that?
OVMF could rely on that - if such property is available, we use it to
extend the PCI64 aperture.

On Wed, Jun 17, 2020 at 12:50 PM Eduardo Habkost <ehabkost@redhat.com> wrote:
>
> On Wed, Jun 17, 2020 at 02:46:52PM +0100, Dr. David Alan Gilbert wrote:
> > * Laszlo Ersek (lersek@redhat.com) wrote:
> > > On 06/16/20 19:14, Guilherme Piccoli wrote:
> > > > Thanks Gerd, Dave and Eduardo for the prompt responses!
> > > >
> > > > So, I understand that when we use "-host-physical-bits", we are
> > > > passing the *real* number for the guest, correct? So, in this case we
> > > > can trust that the guest physbits matches the true host physbits.
> > > >
> > > > What if then we have OVMF relying in the physbits *iff*
> > > > "-host-phys-bits" is used (which is the default in RH and a possible
> > > > machine configuration on libvirt XML in Ubuntu), and we have OVMF
> > > > fallbacks to 36-bit otherwise?
> > >
> > > I've now read the commit message on QEMU commit 258fe08bd341d, and the
> > > complexity is simply stunning.
> > >
> > > Right now, OVMF calculates the guest physical address space size from
> > > various range sizes (such as hotplug memory area end, default or
> > > user-configured PCI64 MMIO aperture), and derives the minimum suitable
> > > guest-phys address width from that address space size. This width is
> > > then exposed to the rest of the firmware with the CPU HOB (hand-off
> > > block), which in turn controls how the GCD (global coherency domain)
> > > memory space map is sized. Etc.
> > >
> > > If QEMU can provide a *reliable* GPA width, in some info channel (CPUID
> > > or even fw_cfg), then the above calculation could be reversed in OVMF.
> > > We could take the width as a given (-> produce the CPU HOB directly),
> > > plus calculate the *remaining* address space between the GPA space size
> > > given by the width, and the end of the memory hotplug area end. If the
> > > "remaining size" were negative, then obviously QEMU would have been
> > > misconfigured, so we'd halt the boot. Otherwise, the remaining area
> > > could be used as PCI64 MMIO aperture (PEI memory footprint of DXE page
> > > tables be darned).
> > >
> > > > Now, regarding the problem "to trust or not" in the guests' physbits,
> > > > I think it's an orthogonal discussion to some extent. It'd be nice to
> > > > have that check, and as Eduardo said, prevent migration in such cases.
> > > > But it's not really preventing OVMF big PCI64 aperture if we only
> > > > increase the aperture _when  "-host-physical-bits" is used_.
> > >
> > > I don't know what exactly those flags do, but I doubt they are clearly
> > > visible to OVMF in any particular way.
> >
> > The firmware should trust whatever it reads from the cpuid and thus gets
> > told from qemu; if qemu is doing the wrong thing there then that's our
> > problem and we need to fix it in qemu.
>
> It is impossible to provide a MAXPHYADDR that the guest can trust
> unconditionally and allow live migration to hosts with different
> sizes at the same time.
>
> Unless we want to drop support live migration to hosts with
> different sizes entirely, we need additional bits to tell the
> guest how much it can trust MAXPHYADDR.
>
> --
> Eduardo
>


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: ovmf / PCI passthrough impaired due to very limiting PCI64 aperture
  2020-06-17 15:57           ` Laszlo Ersek
@ 2020-06-17 16:01             ` Guilherme Piccoli
  2020-06-18  7:56               ` Laszlo Ersek
  0 siblings, 1 reply; 40+ messages in thread
From: Guilherme Piccoli @ 2020-06-17 16:01 UTC (permalink / raw)
  To: Laszlo Ersek
  Cc: Pedro Principeza, Eduardo Habkost, Dann Frazier, qemu-devel,
	Christian Ehrhardt, Dr. David Alan Gilbert, Gerd Hoffmann, fw

On Wed, Jun 17, 2020 at 12:57 PM Laszlo Ersek <lersek@redhat.com> wrote:
> [...]
> I don't necessarily see compat issues -- large-BAR PCI device assignment
> might simply stop working under those circumstances, because you could
> no longer use X-PciMmio64Mb, and the new way wouldn't be supported by
> (old) QEMU.
>
> This would indeed be a regression relative to "X-PciMmio64Mb", but
> that's exactly why there's an "X" in "X-PciMmio64Mb".
>

Are you planning to remove that option, with this improvement? I think
we could keep it for a while, as a way to override the automatic
mechanism we might implement. This is even for "safe" purposes, in
case there's some corner case with the auto-sized aperture that we
ignore upfront.

> > So, the code may be a bit complex, it'll
> > need to take into account this case (probably we could just rely on
> > the physbits "detected" by OVMF in such case, limiting PCI64 aperture
> > to the current 36-bits, right?).
>
> A "bit complex" is an understatement. The code is already much more
>[...]
heheh
indeed, it's very complex code! It took me a while to even start to
figure where things are done, I guess UEFI spec is complex in nature
too, which adds to that code complexity!


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: ovmf / PCI passthrough impaired due to very limiting PCI64 aperture
  2020-06-17 15:49           ` Eduardo Habkost
  2020-06-17 15:57             ` Guilherme Piccoli
@ 2020-06-17 16:04             ` Dr. David Alan Gilbert
  2020-06-17 16:17               ` Daniel P. Berrangé
                                 ` (2 more replies)
  1 sibling, 3 replies; 40+ messages in thread
From: Dr. David Alan Gilbert @ 2020-06-17 16:04 UTC (permalink / raw)
  To: Eduardo Habkost
  Cc: Pedro Principeza, Dann Frazier, Guilherme Piccoli, qemu-devel,
	Christian Ehrhardt, Gerd Hoffmann, Laszlo Ersek, fw

* Eduardo Habkost (ehabkost@redhat.com) wrote:
> On Wed, Jun 17, 2020 at 02:46:52PM +0100, Dr. David Alan Gilbert wrote:
> > * Laszlo Ersek (lersek@redhat.com) wrote:
> > > On 06/16/20 19:14, Guilherme Piccoli wrote:
> > > > Thanks Gerd, Dave and Eduardo for the prompt responses!
> > > > 
> > > > So, I understand that when we use "-host-physical-bits", we are
> > > > passing the *real* number for the guest, correct? So, in this case we
> > > > can trust that the guest physbits matches the true host physbits.
> > > > 
> > > > What if then we have OVMF relying in the physbits *iff*
> > > > "-host-phys-bits" is used (which is the default in RH and a possible
> > > > machine configuration on libvirt XML in Ubuntu), and we have OVMF
> > > > fallbacks to 36-bit otherwise?
> > > 
> > > I've now read the commit message on QEMU commit 258fe08bd341d, and the
> > > complexity is simply stunning.
> > > 
> > > Right now, OVMF calculates the guest physical address space size from
> > > various range sizes (such as hotplug memory area end, default or
> > > user-configured PCI64 MMIO aperture), and derives the minimum suitable
> > > guest-phys address width from that address space size. This width is
> > > then exposed to the rest of the firmware with the CPU HOB (hand-off
> > > block), which in turn controls how the GCD (global coherency domain)
> > > memory space map is sized. Etc.
> > > 
> > > If QEMU can provide a *reliable* GPA width, in some info channel (CPUID
> > > or even fw_cfg), then the above calculation could be reversed in OVMF.
> > > We could take the width as a given (-> produce the CPU HOB directly),
> > > plus calculate the *remaining* address space between the GPA space size
> > > given by the width, and the end of the memory hotplug area end. If the
> > > "remaining size" were negative, then obviously QEMU would have been
> > > misconfigured, so we'd halt the boot. Otherwise, the remaining area
> > > could be used as PCI64 MMIO aperture (PEI memory footprint of DXE page
> > > tables be darned).
> > > 
> > > > Now, regarding the problem "to trust or not" in the guests' physbits,
> > > > I think it's an orthogonal discussion to some extent. It'd be nice to
> > > > have that check, and as Eduardo said, prevent migration in such cases.
> > > > But it's not really preventing OVMF big PCI64 aperture if we only
> > > > increase the aperture _when  "-host-physical-bits" is used_.
> > > 
> > > I don't know what exactly those flags do, but I doubt they are clearly
> > > visible to OVMF in any particular way.
> > 
> > The firmware should trust whatever it reads from the cpuid and thus gets
> > told from qemu; if qemu is doing the wrong thing there then that's our
> > problem and we need to fix it in qemu.
> 
> It is impossible to provide a MAXPHYADDR that the guest can trust
> unconditionally and allow live migration to hosts with different
> sizes at the same time.

It would be nice to get to a point where we could say that the reported
size is no bigger than the physical hardware.
The gotcha here is that (upstream) qemu is still reporting 40 by default
when even modern Intel desktop chips are 39.

> Unless we want to drop support live migration to hosts with
> different sizes entirely, we need additional bits to tell the
> guest how much it can trust MAXPHYADDR.

Could we go with host-phys-bits=true by default, that at least means the
normal behaviour is correct; if people want to migrate between different
hosts with different sizes they should set phys-bits (or
host-phys-limit) to the lowest in their set of hardware.

Dave
> -- 
> Eduardo
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: ovmf / PCI passthrough impaired due to very limiting PCI64 aperture
  2020-06-17 13:46         ` Dr. David Alan Gilbert
  2020-06-17 15:49           ` Eduardo Habkost
@ 2020-06-17 16:14           ` Laszlo Ersek
  2020-06-17 16:43             ` Laszlo Ersek
  1 sibling, 1 reply; 40+ messages in thread
From: Laszlo Ersek @ 2020-06-17 16:14 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Pedro Principeza, ehabkost, Dann Frazier, Guilherme Piccoli,
	qemu-devel, Christian Ehrhardt, Gerd Hoffmann, fw

On 06/17/20 15:46, Dr. David Alan Gilbert wrote:
> * Laszlo Ersek (lersek@redhat.com) wrote:
>> On 06/16/20 19:14, Guilherme Piccoli wrote:
>>> Thanks Gerd, Dave and Eduardo for the prompt responses!
>>>
>>> So, I understand that when we use "-host-physical-bits", we are
>>> passing the *real* number for the guest, correct? So, in this case we
>>> can trust that the guest physbits matches the true host physbits.
>>>
>>> What if then we have OVMF relying in the physbits *iff*
>>> "-host-phys-bits" is used (which is the default in RH and a possible
>>> machine configuration on libvirt XML in Ubuntu), and we have OVMF
>>> fallbacks to 36-bit otherwise?
>>
>> I've now read the commit message on QEMU commit 258fe08bd341d, and the
>> complexity is simply stunning.
>>
>> Right now, OVMF calculates the guest physical address space size from
>> various range sizes (such as hotplug memory area end, default or
>> user-configured PCI64 MMIO aperture), and derives the minimum suitable
>> guest-phys address width from that address space size. This width is
>> then exposed to the rest of the firmware with the CPU HOB (hand-off
>> block), which in turn controls how the GCD (global coherency domain)
>> memory space map is sized. Etc.
>>
>> If QEMU can provide a *reliable* GPA width, in some info channel (CPUID
>> or even fw_cfg), then the above calculation could be reversed in OVMF.
>> We could take the width as a given (-> produce the CPU HOB directly),
>> plus calculate the *remaining* address space between the GPA space size
>> given by the width, and the end of the memory hotplug area end. If the
>> "remaining size" were negative, then obviously QEMU would have been
>> misconfigured, so we'd halt the boot. Otherwise, the remaining area
>> could be used as PCI64 MMIO aperture (PEI memory footprint of DXE page
>> tables be darned).
>>
>>> Now, regarding the problem "to trust or not" in the guests' physbits,
>>> I think it's an orthogonal discussion to some extent. It'd be nice to
>>> have that check, and as Eduardo said, prevent migration in such cases.
>>> But it's not really preventing OVMF big PCI64 aperture if we only
>>> increase the aperture _when  "-host-physical-bits" is used_.
>>
>> I don't know what exactly those flags do, but I doubt they are clearly
>> visible to OVMF in any particular way.
> 
> The firmware should trust whatever it reads from the cpuid and thus gets
> told from qemu; if qemu is doing the wrong thing there then that's our
> problem and we need to fix it in qemu.

This sounds good in practice, but -- as Gerd too has stated, to my
understanding -- it has potential to break existing usage.

Consider assigning a single device with a 32G BAR -- right now that's
supposed to work, without the X-PciMmio64Mb OVMF knob, on even the "most
basic" hardware (36-bit host phys address width, and EPT supported). If
OVMF suddenly starts trusting the CPUID from QEMU, and that results in a
GPA width of 40 bits (i.e. new OVMF is run on old QEMU), then the big
BAR (and other stuff too) could be allocated from GPA space that EPT is
actually able to deal with. --> regression for the user.

Sometimes I can tell users "hey given that you're building OVMF from
source, or taking it from a 3rd party origin anyway, can you just run
upstream QEMU too", but most of the time they just want everything to
continue working on a 3 year old Ubuntu LTS release or whatever. :/

And again, this is *without* "X-PciMmio64Mb".

Laszlo



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: ovmf / PCI passthrough impaired due to very limiting PCI64 aperture
  2020-06-17 16:04             ` Dr. David Alan Gilbert
@ 2020-06-17 16:17               ` Daniel P. Berrangé
  2020-06-17 16:22                 ` Eduardo Habkost
  2020-06-17 16:28               ` Eduardo Habkost
  2020-06-19 16:13               ` Dr. David Alan Gilbert
  2 siblings, 1 reply; 40+ messages in thread
From: Daniel P. Berrangé @ 2020-06-17 16:17 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Pedro Principeza, Eduardo Habkost, Dann Frazier,
	Guilherme Piccoli, qemu-devel, Christian Ehrhardt, Gerd Hoffmann,
	Laszlo Ersek, fw

On Wed, Jun 17, 2020 at 05:04:12PM +0100, Dr. David Alan Gilbert wrote:
> * Eduardo Habkost (ehabkost@redhat.com) wrote:
> > On Wed, Jun 17, 2020 at 02:46:52PM +0100, Dr. David Alan Gilbert wrote:
> > > * Laszlo Ersek (lersek@redhat.com) wrote:
> > > > On 06/16/20 19:14, Guilherme Piccoli wrote:
> > > > > Thanks Gerd, Dave and Eduardo for the prompt responses!
> > > > > 
> > > > > So, I understand that when we use "-host-physical-bits", we are
> > > > > passing the *real* number for the guest, correct? So, in this case we
> > > > > can trust that the guest physbits matches the true host physbits.
> > > > > 
> > > > > What if then we have OVMF relying in the physbits *iff*
> > > > > "-host-phys-bits" is used (which is the default in RH and a possible
> > > > > machine configuration on libvirt XML in Ubuntu), and we have OVMF
> > > > > fallbacks to 36-bit otherwise?
> > > > 
> > > > I've now read the commit message on QEMU commit 258fe08bd341d, and the
> > > > complexity is simply stunning.
> > > > 
> > > > Right now, OVMF calculates the guest physical address space size from
> > > > various range sizes (such as hotplug memory area end, default or
> > > > user-configured PCI64 MMIO aperture), and derives the minimum suitable
> > > > guest-phys address width from that address space size. This width is
> > > > then exposed to the rest of the firmware with the CPU HOB (hand-off
> > > > block), which in turn controls how the GCD (global coherency domain)
> > > > memory space map is sized. Etc.
> > > > 
> > > > If QEMU can provide a *reliable* GPA width, in some info channel (CPUID
> > > > or even fw_cfg), then the above calculation could be reversed in OVMF.
> > > > We could take the width as a given (-> produce the CPU HOB directly),
> > > > plus calculate the *remaining* address space between the GPA space size
> > > > given by the width, and the end of the memory hotplug area end. If the
> > > > "remaining size" were negative, then obviously QEMU would have been
> > > > misconfigured, so we'd halt the boot. Otherwise, the remaining area
> > > > could be used as PCI64 MMIO aperture (PEI memory footprint of DXE page
> > > > tables be darned).
> > > > 
> > > > > Now, regarding the problem "to trust or not" in the guests' physbits,
> > > > > I think it's an orthogonal discussion to some extent. It'd be nice to
> > > > > have that check, and as Eduardo said, prevent migration in such cases.
> > > > > But it's not really preventing OVMF big PCI64 aperture if we only
> > > > > increase the aperture _when  "-host-physical-bits" is used_.
> > > > 
> > > > I don't know what exactly those flags do, but I doubt they are clearly
> > > > visible to OVMF in any particular way.
> > > 
> > > The firmware should trust whatever it reads from the cpuid and thus gets
> > > told from qemu; if qemu is doing the wrong thing there then that's our
> > > problem and we need to fix it in qemu.
> > 
> > It is impossible to provide a MAXPHYADDR that the guest can trust
> > unconditionally and allow live migration to hosts with different
> > sizes at the same time.
> 
> It would be nice to get to a point where we could say that the reported
> size is no bigger than the physical hardware.
> The gotcha here is that (upstream) qemu is still reporting 40 by default
> when even modern Intel desktop chips are 39.
> 
> > Unless we want to drop support live migration to hosts with
> > different sizes entirely, we need additional bits to tell the
> > guest how much it can trust MAXPHYADDR.
> 
> Could we go with host-phys-bits=true by default, that at least means the
> normal behaviour is correct; if people want to migrate between different
> hosts with different sizes they should set phys-bits (or
> host-phys-limit) to the lowest in their set of hardware.

Is there any sense in picking the default value based on -cpu selection ?

If user has asked for -cpu host, there's no downside to host-phys-bits=true,
as the user has intentionally traded off live migration portability already.

If the user askes for -cpu $MODEL, then could we set phys-bits=NNN for some
NNN that is the lowest value for CPUs that are capable of running $MODEL ?
Or will that get too complicated with the wide range of SKU variants, in
particular server vs desktop CPUs.

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: ovmf / PCI passthrough impaired due to very limiting PCI64 aperture
  2020-06-17 16:17               ` Daniel P. Berrangé
@ 2020-06-17 16:22                 ` Eduardo Habkost
  2020-06-17 16:41                   ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 40+ messages in thread
From: Eduardo Habkost @ 2020-06-17 16:22 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Liu Yi L, Pedro Principeza, Like Xu, Dann Frazier,
	Guilherme Piccoli, Dr. David Alan Gilbert, Christian Ehrhardt,
	qemu-devel, Babu Moger, Gerd Hoffmann, Chenyi Qiang, Robert Hoo,
	Laszlo Ersek, fw

On Wed, Jun 17, 2020 at 05:17:17PM +0100, Daniel P. Berrangé wrote:
> On Wed, Jun 17, 2020 at 05:04:12PM +0100, Dr. David Alan Gilbert wrote:
> > * Eduardo Habkost (ehabkost@redhat.com) wrote:
> > > On Wed, Jun 17, 2020 at 02:46:52PM +0100, Dr. David Alan Gilbert wrote:
> > > > * Laszlo Ersek (lersek@redhat.com) wrote:
> > > > > On 06/16/20 19:14, Guilherme Piccoli wrote:
> > > > > > Thanks Gerd, Dave and Eduardo for the prompt responses!
> > > > > > 
> > > > > > So, I understand that when we use "-host-physical-bits", we are
> > > > > > passing the *real* number for the guest, correct? So, in this case we
> > > > > > can trust that the guest physbits matches the true host physbits.
> > > > > > 
> > > > > > What if then we have OVMF relying in the physbits *iff*
> > > > > > "-host-phys-bits" is used (which is the default in RH and a possible
> > > > > > machine configuration on libvirt XML in Ubuntu), and we have OVMF
> > > > > > fallbacks to 36-bit otherwise?
> > > > > 
> > > > > I've now read the commit message on QEMU commit 258fe08bd341d, and the
> > > > > complexity is simply stunning.
> > > > > 
> > > > > Right now, OVMF calculates the guest physical address space size from
> > > > > various range sizes (such as hotplug memory area end, default or
> > > > > user-configured PCI64 MMIO aperture), and derives the minimum suitable
> > > > > guest-phys address width from that address space size. This width is
> > > > > then exposed to the rest of the firmware with the CPU HOB (hand-off
> > > > > block), which in turn controls how the GCD (global coherency domain)
> > > > > memory space map is sized. Etc.
> > > > > 
> > > > > If QEMU can provide a *reliable* GPA width, in some info channel (CPUID
> > > > > or even fw_cfg), then the above calculation could be reversed in OVMF.
> > > > > We could take the width as a given (-> produce the CPU HOB directly),
> > > > > plus calculate the *remaining* address space between the GPA space size
> > > > > given by the width, and the end of the memory hotplug area end. If the
> > > > > "remaining size" were negative, then obviously QEMU would have been
> > > > > misconfigured, so we'd halt the boot. Otherwise, the remaining area
> > > > > could be used as PCI64 MMIO aperture (PEI memory footprint of DXE page
> > > > > tables be darned).
> > > > > 
> > > > > > Now, regarding the problem "to trust or not" in the guests' physbits,
> > > > > > I think it's an orthogonal discussion to some extent. It'd be nice to
> > > > > > have that check, and as Eduardo said, prevent migration in such cases.
> > > > > > But it's not really preventing OVMF big PCI64 aperture if we only
> > > > > > increase the aperture _when  "-host-physical-bits" is used_.
> > > > > 
> > > > > I don't know what exactly those flags do, but I doubt they are clearly
> > > > > visible to OVMF in any particular way.
> > > > 
> > > > The firmware should trust whatever it reads from the cpuid and thus gets
> > > > told from qemu; if qemu is doing the wrong thing there then that's our
> > > > problem and we need to fix it in qemu.
> > > 
> > > It is impossible to provide a MAXPHYADDR that the guest can trust
> > > unconditionally and allow live migration to hosts with different
> > > sizes at the same time.
> > 
> > It would be nice to get to a point where we could say that the reported
> > size is no bigger than the physical hardware.
> > The gotcha here is that (upstream) qemu is still reporting 40 by default
> > when even modern Intel desktop chips are 39.
> > 
> > > Unless we want to drop support live migration to hosts with
> > > different sizes entirely, we need additional bits to tell the
> > > guest how much it can trust MAXPHYADDR.
> > 
> > Could we go with host-phys-bits=true by default, that at least means the
> > normal behaviour is correct; if people want to migrate between different
> > hosts with different sizes they should set phys-bits (or
> > host-phys-limit) to the lowest in their set of hardware.
> 
> Is there any sense in picking the default value based on -cpu selection ?
> 
> If user has asked for -cpu host, there's no downside to host-phys-bits=true,
> as the user has intentionally traded off live migration portability already.

Setting host-phys-bits=true when using -cpu host makes a lot of
sense, and we could start doing that immediately.

> 
> If the user askes for -cpu $MODEL, then could we set phys-bits=NNN for some
> NNN that is the lowest value for CPUs that are capable of running $MODEL ?
> Or will that get too complicated with the wide range of SKU variants, in
> particular server vs desktop CPUs.

This makes sense too.  We need some help from CPU vendors to get
us this data added to our CPU model table.  I'm CCing some Intel
and AMD people that could help us.

-- 
Eduardo



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: ovmf / PCI passthrough impaired due to very limiting PCI64 aperture
  2020-06-17  8:17       ` Christophe de Dinechin
@ 2020-06-17 16:25         ` Eduardo Habkost
  0 siblings, 0 replies; 40+ messages in thread
From: Eduardo Habkost @ 2020-06-17 16:25 UTC (permalink / raw)
  To: Christophe de Dinechin
  Cc: pedro.principeza, dann.frazier, Guilherme G. Piccoli,
	Dr. David Alan Gilbert, christian.ehrhardt, qemu-devel,
	Gerd Hoffmann, lersek, fw

On Wed, Jun 17, 2020 at 10:17:55AM +0200, Christophe de Dinechin wrote:
> 
> 
> > Le 16 Jun 2020 à 19:10, Eduardo Habkost <ehabkost@redhat.com> a écrit :
> > 
> > On Tue, Jun 16, 2020 at 05:57:46PM +0100, Dr. David Alan Gilbert wrote:
> >> * Gerd Hoffmann (kraxel@redhat.com) wrote:
> >>>  Hi,
> >>> 
> >>>> (a) We could rely in the guest physbits to calculate the PCI64 aperture.
> >>> 
> >>> I'd love to do that.  Move the 64-bit I/O window as high as possible and
> >>> use -- say -- 25% of the physical address space for it.
> >>> 
> >>> Problem is we can't.
> >>> 
> >>>> failure. Also, if the users are not setting the physbits in the guest,
> >>>> there must be a default (seems to be 40bit according to my experiments),
> >>>> seems to be a good idea to rely on that.
> >>> 
> >>> Yes, 40 is the default, and it is used *even if the host supports less
> >>> than that*.  Typical values I've seen for intel hardware are 36 and 39.
> >>> 39 is used even by recent hardware (not the xeons, but check out a
> >>> laptop or a nuc).
> >>> 
> >>>> If guest physbits is 40, why to have OVMF limiting it to 36, right?
> >>> 
> >>> Things will explode in case OVMF uses more physbits than the host
> >>> supports (host physbits limit applies to ept too).  In other words: OVMF
> >>> can't trust the guest physbits, so it is conservative to be on the safe
> >>> side.
> >>> 
> >>> If we can somehow make a *trustable* physbits value available to the
> >>> guest, then yes, we can go that route.  But the guest physbits we have
> >>> today unfortunately don't cut it.
> >> 
> >> In downstream RH qemu, we run with host-physbits as default; so it's reasonably
> >> trustworthy; of course that doesn't help you across a migration between
> >> hosts with different sizes (e.g. an E5 Xeon to an E3).
> >> Changing upstream to do the same would seem sensible to me, but it's not
> >> a foolproof config.
> > 
> > Yeah, to make it really trustworthy we would need to prevent
> > migration to hosts with mismatching phys sizes.
> 
> Wouldn't it be sufficient to prevent guestphysbits > hostphysbits?

It probably would, for the OVMF issue we're discussing here.
But it's not a guarantee we give today.

If we make additional guarantees to guest software, we need to
communicate them unambiguously to the guest.

-- 
Eduardo



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: ovmf / PCI passthrough impaired due to very limiting PCI64 aperture
  2020-06-17 16:04             ` Dr. David Alan Gilbert
  2020-06-17 16:17               ` Daniel P. Berrangé
@ 2020-06-17 16:28               ` Eduardo Habkost
  2020-06-19 16:13               ` Dr. David Alan Gilbert
  2 siblings, 0 replies; 40+ messages in thread
From: Eduardo Habkost @ 2020-06-17 16:28 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Pedro Principeza, libvir-list, Dann Frazier, Guilherme Piccoli,
	qemu-devel, Christian Ehrhardt, Gerd Hoffmann, Laszlo Ersek, fw

+libvir-list

On Wed, Jun 17, 2020 at 05:04:12PM +0100, Dr. David Alan Gilbert wrote:
> * Eduardo Habkost (ehabkost@redhat.com) wrote:
> > On Wed, Jun 17, 2020 at 02:46:52PM +0100, Dr. David Alan Gilbert wrote:
> > > * Laszlo Ersek (lersek@redhat.com) wrote:
> > > > On 06/16/20 19:14, Guilherme Piccoli wrote:
> > > > > Thanks Gerd, Dave and Eduardo for the prompt responses!
> > > > > 
> > > > > So, I understand that when we use "-host-physical-bits", we are
> > > > > passing the *real* number for the guest, correct? So, in this case we
> > > > > can trust that the guest physbits matches the true host physbits.
> > > > > 
> > > > > What if then we have OVMF relying in the physbits *iff*
> > > > > "-host-phys-bits" is used (which is the default in RH and a possible
> > > > > machine configuration on libvirt XML in Ubuntu), and we have OVMF
> > > > > fallbacks to 36-bit otherwise?
> > > > 
> > > > I've now read the commit message on QEMU commit 258fe08bd341d, and the
> > > > complexity is simply stunning.
> > > > 
> > > > Right now, OVMF calculates the guest physical address space size from
> > > > various range sizes (such as hotplug memory area end, default or
> > > > user-configured PCI64 MMIO aperture), and derives the minimum suitable
> > > > guest-phys address width from that address space size. This width is
> > > > then exposed to the rest of the firmware with the CPU HOB (hand-off
> > > > block), which in turn controls how the GCD (global coherency domain)
> > > > memory space map is sized. Etc.
> > > > 
> > > > If QEMU can provide a *reliable* GPA width, in some info channel (CPUID
> > > > or even fw_cfg), then the above calculation could be reversed in OVMF.
> > > > We could take the width as a given (-> produce the CPU HOB directly),
> > > > plus calculate the *remaining* address space between the GPA space size
> > > > given by the width, and the end of the memory hotplug area end. If the
> > > > "remaining size" were negative, then obviously QEMU would have been
> > > > misconfigured, so we'd halt the boot. Otherwise, the remaining area
> > > > could be used as PCI64 MMIO aperture (PEI memory footprint of DXE page
> > > > tables be darned).
> > > > 
> > > > > Now, regarding the problem "to trust or not" in the guests' physbits,
> > > > > I think it's an orthogonal discussion to some extent. It'd be nice to
> > > > > have that check, and as Eduardo said, prevent migration in such cases.
> > > > > But it's not really preventing OVMF big PCI64 aperture if we only
> > > > > increase the aperture _when  "-host-physical-bits" is used_.
> > > > 
> > > > I don't know what exactly those flags do, but I doubt they are clearly
> > > > visible to OVMF in any particular way.
> > > 
> > > The firmware should trust whatever it reads from the cpuid and thus gets
> > > told from qemu; if qemu is doing the wrong thing there then that's our
> > > problem and we need to fix it in qemu.
> > 
> > It is impossible to provide a MAXPHYADDR that the guest can trust
> > unconditionally and allow live migration to hosts with different
> > sizes at the same time.
> 
> It would be nice to get to a point where we could say that the reported
> size is no bigger than the physical hardware.
> The gotcha here is that (upstream) qemu is still reporting 40 by default
> when even modern Intel desktop chips are 39.

I agree it would be nice.  We just need a method to tell guest
software that we are really making this additional guarantee.

> 
> > Unless we want to drop support live migration to hosts with
> > different sizes entirely, we need additional bits to tell the
> > guest how much it can trust MAXPHYADDR.
> 
> Could we go with host-phys-bits=true by default, that at least means the
> normal behaviour is correct; if people want to migrate between different
> hosts with different sizes they should set phys-bits (or
> host-phys-limit) to the lowest in their set of hardware.

Host-dependent defaults may be convenient for end users running
QEMU directly, but not a good idea if a stable guest ABI is
expected from your VM configuration (which is the case for
software using the libvirt API).

-- 
Eduardo



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: ovmf / PCI passthrough impaired due to very limiting PCI64 aperture
  2020-06-17 15:57             ` Guilherme Piccoli
@ 2020-06-17 16:33               ` Eduardo Habkost
  2020-06-17 16:40                 ` Guilherme Piccoli
  2020-06-18  8:00                 ` Laszlo Ersek
  0 siblings, 2 replies; 40+ messages in thread
From: Eduardo Habkost @ 2020-06-17 16:33 UTC (permalink / raw)
  To: Guilherme Piccoli
  Cc: Pedro Principeza, Dann Frazier, qemu-devel, Christian Ehrhardt,
	Dr. David Alan Gilbert, Gerd Hoffmann, Laszlo Ersek, fw

On Wed, Jun 17, 2020 at 12:57:52PM -0300, Guilherme Piccoli wrote:
> Can't qemu reads the host physical bits and pass that as fw_cfg as
> "real_host_physbits" or something like that?
> OVMF could rely on that - if such property is available, we use it to
> extend the PCI64 aperture.

Giving a exact "real host physbits" value would be too limiting
if we still want to allow migration to smaller hosts.  But we can
provide extra information on CPUID or fw_cfg, yes.

I will try to write down a proposal to extend the KVM CPUID leaf
to provide extra information about MAXPHYADDR guarantees.

-- 
Eduardo



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: ovmf / PCI passthrough impaired due to very limiting PCI64 aperture
  2020-06-17 16:33               ` Eduardo Habkost
@ 2020-06-17 16:40                 ` Guilherme Piccoli
  2020-06-18  8:00                 ` Laszlo Ersek
  1 sibling, 0 replies; 40+ messages in thread
From: Guilherme Piccoli @ 2020-06-17 16:40 UTC (permalink / raw)
  To: Eduardo Habkost
  Cc: Pedro Principeza, Dann Frazier, qemu-devel, Christian Ehrhardt,
	Dr. David Alan Gilbert, Gerd Hoffmann, Laszlo Ersek, fw

Awesome, thanks a bunch Eduardo!

On Wed, Jun 17, 2020 at 1:33 PM Eduardo Habkost <ehabkost@redhat.com> wrote:
>
> On Wed, Jun 17, 2020 at 12:57:52PM -0300, Guilherme Piccoli wrote:
> > Can't qemu reads the host physical bits and pass that as fw_cfg as
> > "real_host_physbits" or something like that?
> > OVMF could rely on that - if such property is available, we use it to
> > extend the PCI64 aperture.
>
> Giving a exact "real host physbits" value would be too limiting
> if we still want to allow migration to smaller hosts.  But we can
> provide extra information on CPUID or fw_cfg, yes.
>
> I will try to write down a proposal to extend the KVM CPUID leaf
> to provide extra information about MAXPHYADDR guarantees.
>
> --
> Eduardo
>


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: ovmf / PCI passthrough impaired due to very limiting PCI64 aperture
  2020-06-17 16:22                 ` Eduardo Habkost
@ 2020-06-17 16:41                   ` Dr. David Alan Gilbert
  2020-06-17 17:17                     ` Daniel P. Berrangé
  0 siblings, 1 reply; 40+ messages in thread
From: Dr. David Alan Gilbert @ 2020-06-17 16:41 UTC (permalink / raw)
  To: Eduardo Habkost
  Cc: Liu Yi L, Pedro Principeza, Like Xu, Dann Frazier,
	Guilherme Piccoli, qemu-devel, Christian Ehrhardt, Robert Hoo,
	Babu Moger, Gerd Hoffmann, Chenyi Qiang, Daniel P. Berrangé,
	Laszlo Ersek, fw

* Eduardo Habkost (ehabkost@redhat.com) wrote:
> On Wed, Jun 17, 2020 at 05:17:17PM +0100, Daniel P. Berrangé wrote:
> > On Wed, Jun 17, 2020 at 05:04:12PM +0100, Dr. David Alan Gilbert wrote:
> > > * Eduardo Habkost (ehabkost@redhat.com) wrote:
> > > > On Wed, Jun 17, 2020 at 02:46:52PM +0100, Dr. David Alan Gilbert wrote:
> > > > > * Laszlo Ersek (lersek@redhat.com) wrote:
> > > > > > On 06/16/20 19:14, Guilherme Piccoli wrote:
> > > > > > > Thanks Gerd, Dave and Eduardo for the prompt responses!
> > > > > > > 
> > > > > > > So, I understand that when we use "-host-physical-bits", we are
> > > > > > > passing the *real* number for the guest, correct? So, in this case we
> > > > > > > can trust that the guest physbits matches the true host physbits.
> > > > > > > 
> > > > > > > What if then we have OVMF relying in the physbits *iff*
> > > > > > > "-host-phys-bits" is used (which is the default in RH and a possible
> > > > > > > machine configuration on libvirt XML in Ubuntu), and we have OVMF
> > > > > > > fallbacks to 36-bit otherwise?
> > > > > > 
> > > > > > I've now read the commit message on QEMU commit 258fe08bd341d, and the
> > > > > > complexity is simply stunning.
> > > > > > 
> > > > > > Right now, OVMF calculates the guest physical address space size from
> > > > > > various range sizes (such as hotplug memory area end, default or
> > > > > > user-configured PCI64 MMIO aperture), and derives the minimum suitable
> > > > > > guest-phys address width from that address space size. This width is
> > > > > > then exposed to the rest of the firmware with the CPU HOB (hand-off
> > > > > > block), which in turn controls how the GCD (global coherency domain)
> > > > > > memory space map is sized. Etc.
> > > > > > 
> > > > > > If QEMU can provide a *reliable* GPA width, in some info channel (CPUID
> > > > > > or even fw_cfg), then the above calculation could be reversed in OVMF.
> > > > > > We could take the width as a given (-> produce the CPU HOB directly),
> > > > > > plus calculate the *remaining* address space between the GPA space size
> > > > > > given by the width, and the end of the memory hotplug area end. If the
> > > > > > "remaining size" were negative, then obviously QEMU would have been
> > > > > > misconfigured, so we'd halt the boot. Otherwise, the remaining area
> > > > > > could be used as PCI64 MMIO aperture (PEI memory footprint of DXE page
> > > > > > tables be darned).
> > > > > > 
> > > > > > > Now, regarding the problem "to trust or not" in the guests' physbits,
> > > > > > > I think it's an orthogonal discussion to some extent. It'd be nice to
> > > > > > > have that check, and as Eduardo said, prevent migration in such cases.
> > > > > > > But it's not really preventing OVMF big PCI64 aperture if we only
> > > > > > > increase the aperture _when  "-host-physical-bits" is used_.
> > > > > > 
> > > > > > I don't know what exactly those flags do, but I doubt they are clearly
> > > > > > visible to OVMF in any particular way.
> > > > > 
> > > > > The firmware should trust whatever it reads from the cpuid and thus gets
> > > > > told from qemu; if qemu is doing the wrong thing there then that's our
> > > > > problem and we need to fix it in qemu.
> > > > 
> > > > It is impossible to provide a MAXPHYADDR that the guest can trust
> > > > unconditionally and allow live migration to hosts with different
> > > > sizes at the same time.
> > > 
> > > It would be nice to get to a point where we could say that the reported
> > > size is no bigger than the physical hardware.
> > > The gotcha here is that (upstream) qemu is still reporting 40 by default
> > > when even modern Intel desktop chips are 39.
> > > 
> > > > Unless we want to drop support live migration to hosts with
> > > > different sizes entirely, we need additional bits to tell the
> > > > guest how much it can trust MAXPHYADDR.
> > > 
> > > Could we go with host-phys-bits=true by default, that at least means the
> > > normal behaviour is correct; if people want to migrate between different
> > > hosts with different sizes they should set phys-bits (or
> > > host-phys-limit) to the lowest in their set of hardware.
> > 
> > Is there any sense in picking the default value based on -cpu selection ?
> > 
> > If user has asked for -cpu host, there's no downside to host-phys-bits=true,
> > as the user has intentionally traded off live migration portability already.
> 
> Setting host-phys-bits=true when using -cpu host makes a lot of
> sense, and we could start doing that immediately.
> 
> > 
> > If the user askes for -cpu $MODEL, then could we set phys-bits=NNN for some
> > NNN that is the lowest value for CPUs that are capable of running $MODEL ?
> > Or will that get too complicated with the wide range of SKU variants, in
> > particular server vs desktop CPUs.
> 
> This makes sense too.  We need some help from CPU vendors to get
> us this data added to our CPU model table.  I'm CCing some Intel
> and AMD people that could help us.

That bit worries me because I think I agree it's SKU dependent and has
been for a long time (on Intel at least) and we don't even have CPU
models for all Intel devices. (My laptop for example is a Kaby Lake, 39
bits physical).  Maybe it works on the more modern ones where we have
'Icelake-Client' and 'Icelake-Server'.

Dave

> -- 
> Eduardo
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: ovmf / PCI passthrough impaired due to very limiting PCI64 aperture
  2020-06-17 16:14           ` Laszlo Ersek
@ 2020-06-17 16:43             ` Laszlo Ersek
  2020-06-17 17:02               ` Eduardo Habkost
  0 siblings, 1 reply; 40+ messages in thread
From: Laszlo Ersek @ 2020-06-17 16:43 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Pedro Principeza, ehabkost, Dann Frazier, Guilherme Piccoli,
	qemu-devel, Christian Ehrhardt, Gerd Hoffmann, fw

On 06/17/20 18:14, Laszlo Ersek wrote:
> On 06/17/20 15:46, Dr. David Alan Gilbert wrote:
>> * Laszlo Ersek (lersek@redhat.com) wrote:
>>> On 06/16/20 19:14, Guilherme Piccoli wrote:
>>>> Thanks Gerd, Dave and Eduardo for the prompt responses!
>>>>
>>>> So, I understand that when we use "-host-physical-bits", we are
>>>> passing the *real* number for the guest, correct? So, in this case we
>>>> can trust that the guest physbits matches the true host physbits.
>>>>
>>>> What if then we have OVMF relying in the physbits *iff*
>>>> "-host-phys-bits" is used (which is the default in RH and a possible
>>>> machine configuration on libvirt XML in Ubuntu), and we have OVMF
>>>> fallbacks to 36-bit otherwise?
>>>
>>> I've now read the commit message on QEMU commit 258fe08bd341d, and the
>>> complexity is simply stunning.
>>>
>>> Right now, OVMF calculates the guest physical address space size from
>>> various range sizes (such as hotplug memory area end, default or
>>> user-configured PCI64 MMIO aperture), and derives the minimum suitable
>>> guest-phys address width from that address space size. This width is
>>> then exposed to the rest of the firmware with the CPU HOB (hand-off
>>> block), which in turn controls how the GCD (global coherency domain)
>>> memory space map is sized. Etc.
>>>
>>> If QEMU can provide a *reliable* GPA width, in some info channel (CPUID
>>> or even fw_cfg), then the above calculation could be reversed in OVMF.
>>> We could take the width as a given (-> produce the CPU HOB directly),
>>> plus calculate the *remaining* address space between the GPA space size
>>> given by the width, and the end of the memory hotplug area end. If the
>>> "remaining size" were negative, then obviously QEMU would have been
>>> misconfigured, so we'd halt the boot. Otherwise, the remaining area
>>> could be used as PCI64 MMIO aperture (PEI memory footprint of DXE page
>>> tables be darned).
>>>
>>>> Now, regarding the problem "to trust or not" in the guests' physbits,
>>>> I think it's an orthogonal discussion to some extent. It'd be nice to
>>>> have that check, and as Eduardo said, prevent migration in such cases.
>>>> But it's not really preventing OVMF big PCI64 aperture if we only
>>>> increase the aperture _when  "-host-physical-bits" is used_.
>>>
>>> I don't know what exactly those flags do, but I doubt they are clearly
>>> visible to OVMF in any particular way.
>>
>> The firmware should trust whatever it reads from the cpuid and thus gets
>> told from qemu; if qemu is doing the wrong thing there then that's our
>> problem and we need to fix it in qemu.
> 
> This sounds good in practice, but -- as Gerd too has stated, to my
> understanding -- it has potential to break existing usage.
> 
> Consider assigning a single device with a 32G BAR -- right now that's
> supposed to work, without the X-PciMmio64Mb OVMF knob, on even the "most
> basic" hardware (36-bit host phys address width, and EPT supported). If
> OVMF suddenly starts trusting the CPUID from QEMU, and that results in a
> GPA width of 40 bits (i.e. new OVMF is run on old QEMU), then the big
> BAR (and other stuff too) could be allocated from GPA space that EPT is
> actually able to deal with. --> regression for the user.

s/able/unable/, sigh. :/

> 
> Sometimes I can tell users "hey given that you're building OVMF from
> source, or taking it from a 3rd party origin anyway, can you just run
> upstream QEMU too", but most of the time they just want everything to
> continue working on a 3 year old Ubuntu LTS release or whatever. :/
> 
> And again, this is *without* "X-PciMmio64Mb".
> 
> Laszlo
> 



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: ovmf / PCI passthrough impaired due to very limiting PCI64 aperture
  2020-06-17 16:43             ` Laszlo Ersek
@ 2020-06-17 17:02               ` Eduardo Habkost
  2020-06-18  8:29                 ` Laszlo Ersek
  0 siblings, 1 reply; 40+ messages in thread
From: Eduardo Habkost @ 2020-06-17 17:02 UTC (permalink / raw)
  To: Laszlo Ersek
  Cc: Pedro Principeza, Dann Frazier, Guilherme Piccoli, qemu-devel,
	Christian Ehrhardt, Dr. David Alan Gilbert, Gerd Hoffmann, fw

On Wed, Jun 17, 2020 at 06:43:20PM +0200, Laszlo Ersek wrote:
> On 06/17/20 18:14, Laszlo Ersek wrote:
> > On 06/17/20 15:46, Dr. David Alan Gilbert wrote:
> >> * Laszlo Ersek (lersek@redhat.com) wrote:
> >>> On 06/16/20 19:14, Guilherme Piccoli wrote:
> >>>> Thanks Gerd, Dave and Eduardo for the prompt responses!
> >>>>
> >>>> So, I understand that when we use "-host-physical-bits", we are
> >>>> passing the *real* number for the guest, correct? So, in this case we
> >>>> can trust that the guest physbits matches the true host physbits.
> >>>>
> >>>> What if then we have OVMF relying in the physbits *iff*
> >>>> "-host-phys-bits" is used (which is the default in RH and a possible
> >>>> machine configuration on libvirt XML in Ubuntu), and we have OVMF
> >>>> fallbacks to 36-bit otherwise?
> >>>
> >>> I've now read the commit message on QEMU commit 258fe08bd341d, and the
> >>> complexity is simply stunning.
> >>>
> >>> Right now, OVMF calculates the guest physical address space size from
> >>> various range sizes (such as hotplug memory area end, default or
> >>> user-configured PCI64 MMIO aperture), and derives the minimum suitable
> >>> guest-phys address width from that address space size. This width is
> >>> then exposed to the rest of the firmware with the CPU HOB (hand-off
> >>> block), which in turn controls how the GCD (global coherency domain)
> >>> memory space map is sized. Etc.
> >>>
> >>> If QEMU can provide a *reliable* GPA width, in some info channel (CPUID
> >>> or even fw_cfg), then the above calculation could be reversed in OVMF.
> >>> We could take the width as a given (-> produce the CPU HOB directly),
> >>> plus calculate the *remaining* address space between the GPA space size
> >>> given by the width, and the end of the memory hotplug area end. If the
> >>> "remaining size" were negative, then obviously QEMU would have been
> >>> misconfigured, so we'd halt the boot. Otherwise, the remaining area
> >>> could be used as PCI64 MMIO aperture (PEI memory footprint of DXE page
> >>> tables be darned).
> >>>
> >>>> Now, regarding the problem "to trust or not" in the guests' physbits,
> >>>> I think it's an orthogonal discussion to some extent. It'd be nice to
> >>>> have that check, and as Eduardo said, prevent migration in such cases.
> >>>> But it's not really preventing OVMF big PCI64 aperture if we only
> >>>> increase the aperture _when  "-host-physical-bits" is used_.
> >>>
> >>> I don't know what exactly those flags do, but I doubt they are clearly
> >>> visible to OVMF in any particular way.
> >>
> >> The firmware should trust whatever it reads from the cpuid and thus gets
> >> told from qemu; if qemu is doing the wrong thing there then that's our
> >> problem and we need to fix it in qemu.
> > 
> > This sounds good in practice, but -- as Gerd too has stated, to my
> > understanding -- it has potential to break existing usage.
> > 
> > Consider assigning a single device with a 32G BAR -- right now that's
> > supposed to work, without the X-PciMmio64Mb OVMF knob, on even the "most
> > basic" hardware (36-bit host phys address width, and EPT supported). If
> > OVMF suddenly starts trusting the CPUID from QEMU, and that results in a
> > GPA width of 40 bits (i.e. new OVMF is run on old QEMU), then the big
> > BAR (and other stuff too) could be allocated from GPA space that EPT is
> > actually able to deal with. --> regression for the user.
> 
> s/able/unable/, sigh. :/

I was confused for a while, thanks for the clarification.  :)

So, I'm trying to write down which additional guarantees we want
to give to guests, exactly.  I don't want the documentation to
reference "host physical address bits", but actual behavior we
don't emulate.

What does "unable to deal with" means in this specific case?  I
remember MAXPHYADDR mismatches make EPT treatment of of reserved
bits not be what guests would expect from bare metal, but can
somebody point out to the specific guest-visible VCPU behavior
that would cause a regression in OVMF?  Bonus points if anybody
can find the exact Intel SDM paragraph we fail to implement.

> 
> > 
> > Sometimes I can tell users "hey given that you're building OVMF from
> > source, or taking it from a 3rd party origin anyway, can you just run
> > upstream QEMU too", but most of the time they just want everything to
> > continue working on a 3 year old Ubuntu LTS release or whatever. :/
> > 

Agreed.  It wouldn't reasonable to ask guest software to
unconditionally trust the data we provide to it after we provided
incorrect data to guests for [*checks git log*] 13 years.

-- 
Eduardo



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: ovmf / PCI passthrough impaired due to very limiting PCI64 aperture
  2020-06-17 16:41                   ` Dr. David Alan Gilbert
@ 2020-06-17 17:17                     ` Daniel P. Berrangé
  2020-06-17 17:23                       ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 40+ messages in thread
From: Daniel P. Berrangé @ 2020-06-17 17:17 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Liu Yi L, Pedro Principeza, Eduardo Habkost, Like Xu,
	Dann Frazier, Guilherme Piccoli, qemu-devel, Christian Ehrhardt,
	Robert Hoo, Babu Moger, Gerd Hoffmann, Chenyi Qiang,
	Laszlo Ersek, fw

On Wed, Jun 17, 2020 at 05:41:41PM +0100, Dr. David Alan Gilbert wrote:
> * Eduardo Habkost (ehabkost@redhat.com) wrote:
> > On Wed, Jun 17, 2020 at 05:17:17PM +0100, Daniel P. Berrangé wrote:
> > > On Wed, Jun 17, 2020 at 05:04:12PM +0100, Dr. David Alan Gilbert wrote:
> > > > * Eduardo Habkost (ehabkost@redhat.com) wrote:
> > > > > On Wed, Jun 17, 2020 at 02:46:52PM +0100, Dr. David Alan Gilbert wrote:
> > > > > > * Laszlo Ersek (lersek@redhat.com) wrote:
> > > > > > > On 06/16/20 19:14, Guilherme Piccoli wrote:
> > > > > > > > Thanks Gerd, Dave and Eduardo for the prompt responses!
> > > > > > > > 
> > > > > > > > So, I understand that when we use "-host-physical-bits", we are
> > > > > > > > passing the *real* number for the guest, correct? So, in this case we
> > > > > > > > can trust that the guest physbits matches the true host physbits.
> > > > > > > > 
> > > > > > > > What if then we have OVMF relying in the physbits *iff*
> > > > > > > > "-host-phys-bits" is used (which is the default in RH and a possible
> > > > > > > > machine configuration on libvirt XML in Ubuntu), and we have OVMF
> > > > > > > > fallbacks to 36-bit otherwise?
> > > > > > > 
> > > > > > > I've now read the commit message on QEMU commit 258fe08bd341d, and the
> > > > > > > complexity is simply stunning.
> > > > > > > 
> > > > > > > Right now, OVMF calculates the guest physical address space size from
> > > > > > > various range sizes (such as hotplug memory area end, default or
> > > > > > > user-configured PCI64 MMIO aperture), and derives the minimum suitable
> > > > > > > guest-phys address width from that address space size. This width is
> > > > > > > then exposed to the rest of the firmware with the CPU HOB (hand-off
> > > > > > > block), which in turn controls how the GCD (global coherency domain)
> > > > > > > memory space map is sized. Etc.
> > > > > > > 
> > > > > > > If QEMU can provide a *reliable* GPA width, in some info channel (CPUID
> > > > > > > or even fw_cfg), then the above calculation could be reversed in OVMF.
> > > > > > > We could take the width as a given (-> produce the CPU HOB directly),
> > > > > > > plus calculate the *remaining* address space between the GPA space size
> > > > > > > given by the width, and the end of the memory hotplug area end. If the
> > > > > > > "remaining size" were negative, then obviously QEMU would have been
> > > > > > > misconfigured, so we'd halt the boot. Otherwise, the remaining area
> > > > > > > could be used as PCI64 MMIO aperture (PEI memory footprint of DXE page
> > > > > > > tables be darned).
> > > > > > > 
> > > > > > > > Now, regarding the problem "to trust or not" in the guests' physbits,
> > > > > > > > I think it's an orthogonal discussion to some extent. It'd be nice to
> > > > > > > > have that check, and as Eduardo said, prevent migration in such cases.
> > > > > > > > But it's not really preventing OVMF big PCI64 aperture if we only
> > > > > > > > increase the aperture _when  "-host-physical-bits" is used_.
> > > > > > > 
> > > > > > > I don't know what exactly those flags do, but I doubt they are clearly
> > > > > > > visible to OVMF in any particular way.
> > > > > > 
> > > > > > The firmware should trust whatever it reads from the cpuid and thus gets
> > > > > > told from qemu; if qemu is doing the wrong thing there then that's our
> > > > > > problem and we need to fix it in qemu.
> > > > > 
> > > > > It is impossible to provide a MAXPHYADDR that the guest can trust
> > > > > unconditionally and allow live migration to hosts with different
> > > > > sizes at the same time.
> > > > 
> > > > It would be nice to get to a point where we could say that the reported
> > > > size is no bigger than the physical hardware.
> > > > The gotcha here is that (upstream) qemu is still reporting 40 by default
> > > > when even modern Intel desktop chips are 39.
> > > > 
> > > > > Unless we want to drop support live migration to hosts with
> > > > > different sizes entirely, we need additional bits to tell the
> > > > > guest how much it can trust MAXPHYADDR.
> > > > 
> > > > Could we go with host-phys-bits=true by default, that at least means the
> > > > normal behaviour is correct; if people want to migrate between different
> > > > hosts with different sizes they should set phys-bits (or
> > > > host-phys-limit) to the lowest in their set of hardware.
> > > 
> > > Is there any sense in picking the default value based on -cpu selection ?
> > > 
> > > If user has asked for -cpu host, there's no downside to host-phys-bits=true,
> > > as the user has intentionally traded off live migration portability already.
> > 
> > Setting host-phys-bits=true when using -cpu host makes a lot of
> > sense, and we could start doing that immediately.
> > 
> > > 
> > > If the user askes for -cpu $MODEL, then could we set phys-bits=NNN for some
> > > NNN that is the lowest value for CPUs that are capable of running $MODEL ?
> > > Or will that get too complicated with the wide range of SKU variants, in
> > > particular server vs desktop CPUs.
> > 
> > This makes sense too.  We need some help from CPU vendors to get
> > us this data added to our CPU model table.  I'm CCing some Intel
> > and AMD people that could help us.
> 
> That bit worries me because I think I agree it's SKU dependent and has
> been for a long time (on Intel at least) and we don't even have CPU
> models for all Intel devices. (My laptop for example is a Kaby Lake, 39
> bits physical).  Maybe it works on the more modern ones where we have
> 'Icelake-Client' and 'Icelake-Server'.

Yeah, I think introducing the Client/Server variants was a good idea
and long overdue.

For older CPUs, we have a choice of picking a lowest common denominator
which would likely be a regression for people running on Xeon hosts,
or we could introduce Client/Server variants for old models which currently
lack them too.

I'd tend towards the latter - the Xeon vs non-Xeon SKUs for Intel have
always been different in various ways, and we've just been lazy using
one model for both.

With CPU versioning, that would let us "do the right thing".
eg Broadwell-Client (physbits=36) / Broadwell-Server (physbits=46)
Then "-cpu Broadwell" would automatically resolve to whichever of
Broadwell-Client/Server were supported on the current host taking into
account its phys-bits too. This would give us live migration safety and
not limit us to the lowest common denominator.

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: ovmf / PCI passthrough impaired due to very limiting PCI64 aperture
  2020-06-17 17:17                     ` Daniel P. Berrangé
@ 2020-06-17 17:23                       ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 40+ messages in thread
From: Dr. David Alan Gilbert @ 2020-06-17 17:23 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Liu Yi L, Pedro Principeza, Eduardo Habkost, Like Xu,
	Dann Frazier, Guilherme Piccoli, qemu-devel, Christian Ehrhardt,
	Robert Hoo, Babu Moger, Gerd Hoffmann, Chenyi Qiang,
	Laszlo Ersek, fw

* Daniel P. Berrangé (berrange@redhat.com) wrote:
> On Wed, Jun 17, 2020 at 05:41:41PM +0100, Dr. David Alan Gilbert wrote:
> > * Eduardo Habkost (ehabkost@redhat.com) wrote:
> > > On Wed, Jun 17, 2020 at 05:17:17PM +0100, Daniel P. Berrangé wrote:
> > > > On Wed, Jun 17, 2020 at 05:04:12PM +0100, Dr. David Alan Gilbert wrote:
> > > > > * Eduardo Habkost (ehabkost@redhat.com) wrote:
> > > > > > On Wed, Jun 17, 2020 at 02:46:52PM +0100, Dr. David Alan Gilbert wrote:
> > > > > > > * Laszlo Ersek (lersek@redhat.com) wrote:
> > > > > > > > On 06/16/20 19:14, Guilherme Piccoli wrote:
> > > > > > > > > Thanks Gerd, Dave and Eduardo for the prompt responses!
> > > > > > > > > 
> > > > > > > > > So, I understand that when we use "-host-physical-bits", we are
> > > > > > > > > passing the *real* number for the guest, correct? So, in this case we
> > > > > > > > > can trust that the guest physbits matches the true host physbits.
> > > > > > > > > 
> > > > > > > > > What if then we have OVMF relying in the physbits *iff*
> > > > > > > > > "-host-phys-bits" is used (which is the default in RH and a possible
> > > > > > > > > machine configuration on libvirt XML in Ubuntu), and we have OVMF
> > > > > > > > > fallbacks to 36-bit otherwise?
> > > > > > > > 
> > > > > > > > I've now read the commit message on QEMU commit 258fe08bd341d, and the
> > > > > > > > complexity is simply stunning.
> > > > > > > > 
> > > > > > > > Right now, OVMF calculates the guest physical address space size from
> > > > > > > > various range sizes (such as hotplug memory area end, default or
> > > > > > > > user-configured PCI64 MMIO aperture), and derives the minimum suitable
> > > > > > > > guest-phys address width from that address space size. This width is
> > > > > > > > then exposed to the rest of the firmware with the CPU HOB (hand-off
> > > > > > > > block), which in turn controls how the GCD (global coherency domain)
> > > > > > > > memory space map is sized. Etc.
> > > > > > > > 
> > > > > > > > If QEMU can provide a *reliable* GPA width, in some info channel (CPUID
> > > > > > > > or even fw_cfg), then the above calculation could be reversed in OVMF.
> > > > > > > > We could take the width as a given (-> produce the CPU HOB directly),
> > > > > > > > plus calculate the *remaining* address space between the GPA space size
> > > > > > > > given by the width, and the end of the memory hotplug area end. If the
> > > > > > > > "remaining size" were negative, then obviously QEMU would have been
> > > > > > > > misconfigured, so we'd halt the boot. Otherwise, the remaining area
> > > > > > > > could be used as PCI64 MMIO aperture (PEI memory footprint of DXE page
> > > > > > > > tables be darned).
> > > > > > > > 
> > > > > > > > > Now, regarding the problem "to trust or not" in the guests' physbits,
> > > > > > > > > I think it's an orthogonal discussion to some extent. It'd be nice to
> > > > > > > > > have that check, and as Eduardo said, prevent migration in such cases.
> > > > > > > > > But it's not really preventing OVMF big PCI64 aperture if we only
> > > > > > > > > increase the aperture _when  "-host-physical-bits" is used_.
> > > > > > > > 
> > > > > > > > I don't know what exactly those flags do, but I doubt they are clearly
> > > > > > > > visible to OVMF in any particular way.
> > > > > > > 
> > > > > > > The firmware should trust whatever it reads from the cpuid and thus gets
> > > > > > > told from qemu; if qemu is doing the wrong thing there then that's our
> > > > > > > problem and we need to fix it in qemu.
> > > > > > 
> > > > > > It is impossible to provide a MAXPHYADDR that the guest can trust
> > > > > > unconditionally and allow live migration to hosts with different
> > > > > > sizes at the same time.
> > > > > 
> > > > > It would be nice to get to a point where we could say that the reported
> > > > > size is no bigger than the physical hardware.
> > > > > The gotcha here is that (upstream) qemu is still reporting 40 by default
> > > > > when even modern Intel desktop chips are 39.
> > > > > 
> > > > > > Unless we want to drop support live migration to hosts with
> > > > > > different sizes entirely, we need additional bits to tell the
> > > > > > guest how much it can trust MAXPHYADDR.
> > > > > 
> > > > > Could we go with host-phys-bits=true by default, that at least means the
> > > > > normal behaviour is correct; if people want to migrate between different
> > > > > hosts with different sizes they should set phys-bits (or
> > > > > host-phys-limit) to the lowest in their set of hardware.
> > > > 
> > > > Is there any sense in picking the default value based on -cpu selection ?
> > > > 
> > > > If user has asked for -cpu host, there's no downside to host-phys-bits=true,
> > > > as the user has intentionally traded off live migration portability already.
> > > 
> > > Setting host-phys-bits=true when using -cpu host makes a lot of
> > > sense, and we could start doing that immediately.
> > > 
> > > > 
> > > > If the user askes for -cpu $MODEL, then could we set phys-bits=NNN for some
> > > > NNN that is the lowest value for CPUs that are capable of running $MODEL ?
> > > > Or will that get too complicated with the wide range of SKU variants, in
> > > > particular server vs desktop CPUs.
> > > 
> > > This makes sense too.  We need some help from CPU vendors to get
> > > us this data added to our CPU model table.  I'm CCing some Intel
> > > and AMD people that could help us.
> > 
> > That bit worries me because I think I agree it's SKU dependent and has
> > been for a long time (on Intel at least) and we don't even have CPU
> > models for all Intel devices. (My laptop for example is a Kaby Lake, 39
> > bits physical).  Maybe it works on the more modern ones where we have
> > 'Icelake-Client' and 'Icelake-Server'.
> 
> Yeah, I think introducing the Client/Server variants was a good idea
> and long overdue.
> 
> For older CPUs, we have a choice of picking a lowest common denominator
> which would likely be a regression for people running on Xeon hosts,
> or we could introduce Client/Server variants for old models which currently
> lack them too.
> 
> I'd tend towards the latter - the Xeon vs non-Xeon SKUs for Intel have
> always been different in various ways, and we've just been lazy using
> one model for both.
> 
> With CPU versioning, that would let us "do the right thing".
> eg Broadwell-Client (physbits=36) / Broadwell-Server (physbits=46)
> Then "-cpu Broadwell" would automatically resolve to whichever of
> Broadwell-Client/Server were supported on the current host taking into
> account its phys-bits too. This would give us live migration safety and
> not limit us to the lowest common denominator.

That does assume there's any consistency within the 'server' range.
Note also that the low end Xeons used to actually be closer to the
Clients.

Dave

> Regards,
> Daniel
> -- 
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: ovmf / PCI passthrough impaired due to very limiting PCI64 aperture
  2020-06-17 16:01             ` Guilherme Piccoli
@ 2020-06-18  7:56               ` Laszlo Ersek
  0 siblings, 0 replies; 40+ messages in thread
From: Laszlo Ersek @ 2020-06-18  7:56 UTC (permalink / raw)
  To: Guilherme Piccoli
  Cc: Pedro Principeza, Eduardo Habkost, Dann Frazier, qemu-devel,
	Christian Ehrhardt, Dr. David Alan Gilbert, Gerd Hoffmann, fw

On 06/17/20 18:01, Guilherme Piccoli wrote:
> On Wed, Jun 17, 2020 at 12:57 PM Laszlo Ersek <lersek@redhat.com> wrote:
>> [...]
>> I don't necessarily see compat issues -- large-BAR PCI device assignment
>> might simply stop working under those circumstances, because you could
>> no longer use X-PciMmio64Mb, and the new way wouldn't be supported by
>> (old) QEMU.
>>
>> This would indeed be a regression relative to "X-PciMmio64Mb", but
>> that's exactly why there's an "X" in "X-PciMmio64Mb".
>>
> 
> Are you planning to remove that option, with this improvement?

Yes.

> I think
> we could keep it for a while, as a way to override the automatic
> mechanism we might implement. This is even for "safe" purposes, in
> case there's some corner case with the auto-sized aperture that we
> ignore upfront.

I disagree. The knob is called "experimental" specifically so we don't
have to introduce even worse complexity for compatibility's sake than
what we have now. Graceful deprecation is for options that used to be
supported; "X-PciMmio64Mb" has been explicitly experimental from the start.

I'm not saying that I'll kill "X-PciMmio64Mb" for sure, just that I very
likely will.

Thanks
Laszlo



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: ovmf / PCI passthrough impaired due to very limiting PCI64 aperture
  2020-06-17 16:33               ` Eduardo Habkost
  2020-06-17 16:40                 ` Guilherme Piccoli
@ 2020-06-18  8:00                 ` Laszlo Ersek
  1 sibling, 0 replies; 40+ messages in thread
From: Laszlo Ersek @ 2020-06-18  8:00 UTC (permalink / raw)
  To: Eduardo Habkost, Guilherme Piccoli
  Cc: Pedro Principeza, Dann Frazier, qemu-devel, Christian Ehrhardt,
	Dr. David Alan Gilbert, Gerd Hoffmann, fw

On 06/17/20 18:33, Eduardo Habkost wrote:
> On Wed, Jun 17, 2020 at 12:57:52PM -0300, Guilherme Piccoli wrote:
>> Can't qemu reads the host physical bits and pass that as fw_cfg as
>> "real_host_physbits" or something like that?
>> OVMF could rely on that - if such property is available, we use it to
>> extend the PCI64 aperture.
> 
> Giving a exact "real host physbits" value would be too limiting
> if we still want to allow migration to smaller hosts.  But we can
> provide extra information on CPUID or fw_cfg, yes.
> 
> I will try to write down a proposal to extend the KVM CPUID leaf
> to provide extra information about MAXPHYADDR guarantees.

Thank you -- if this can be detected in the guest via CPUID, then I
prefer that to fw_cfg. We need to keep an eye on x-file-slots for fw_cfg.

Thanks
Laszlo



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: ovmf / PCI passthrough impaired due to very limiting PCI64 aperture
  2020-06-17 17:02               ` Eduardo Habkost
@ 2020-06-18  8:29                 ` Laszlo Ersek
  0 siblings, 0 replies; 40+ messages in thread
From: Laszlo Ersek @ 2020-06-18  8:29 UTC (permalink / raw)
  To: Eduardo Habkost
  Cc: Pedro Principeza, Dann Frazier, Guilherme Piccoli, qemu-devel,
	Christian Ehrhardt, Dr. David Alan Gilbert, Gerd Hoffmann, fw

[-- Attachment #1: Type: text/plain, Size: 2375 bytes --]

On 06/17/20 19:02, Eduardo Habkost wrote:
> On Wed, Jun 17, 2020 at 06:43:20PM +0200, Laszlo Ersek wrote:
>> On 06/17/20 18:14, Laszlo Ersek wrote:

>>> Consider assigning a single device with a 32G BAR -- right now that's
>>> supposed to work, without the X-PciMmio64Mb OVMF knob, on even the "most
>>> basic" hardware (36-bit host phys address width, and EPT supported). If
>>> OVMF suddenly starts trusting the CPUID from QEMU, and that results in a
>>> GPA width of 40 bits (i.e. new OVMF is run on old QEMU), then the big
>>> BAR (and other stuff too) could be allocated from GPA space that EPT is
>>> actually able to deal with. --> regression for the user.
>>
>> s/able/unable/, sigh. :/
> 
> I was confused for a while, thanks for the clarification.  :)

Sorry again -- looks like yesterday was already Friday for me...

> So, I'm trying to write down which additional guarantees we want
> to give to guests, exactly.  I don't want the documentation to
> reference "host physical address bits", but actual behavior we
> don't emulate.
> 
> What does "unable to deal with" means in this specific case?  I
> remember MAXPHYADDR mismatches make EPT treatment of of reserved
> bits not be what guests would expect from bare metal, but can
> somebody point out to the specific guest-visible VCPU behavior
> that would cause a regression in OVMF?  Bonus points if anybody
> can find the exact Intel SDM paragraph we fail to implement.

When I first encountered the problem, it wasn't actually related to
64-bit PCI MMIO -- it was much earlier, namely when I worked on enabling
64GiB+ *RAM size* in OVMF. Things fell apart just from the guest memory
side of things. The 64-bit PCI MMIO concern is an extrapolation from
that, not a symptom encountered in practice.

Let me find my original message about the RAM symptom...

... Meh, I can find two discussions related to this, in my personal
archives. One is on one of our internal RH mailing lists, so I'm not
going to quote that here. The other is on the public edk2-devel mailing
list -- but from June 2015, when edk2-devel was still hosted on
sourceforge.net. The sourceforge.net archives are no longer accessible,
and GMANE (the secondary archive) is dead too.

So I'm going to have to attach the one message from the thread that I
feel contains the most information. Start reading at the string "BUG:".

Thanks,
Laszlo

[-- Attachment #2: Attached Message --]
[-- Type: message/rfc822, Size: 11698 bytes --]

From: Laszlo Ersek <lersek@redhat.com>
To: Maoming <maoming.maoming@huawei.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>, edk2-devel@lists.sourceforge.net, "Huangpeng \(Peter\)" <peter.huangpeng@huawei.com>
Subject: Re: [edk2] [RFC 4/4] OvmfPkg: PlatformPei: invert MTRR setup in QemuInitializeRam()
Date: Wed, 10 Jun 2015 15:03:05 +0200
Message-ID: <55783589.2050904@redhat.com>

On 06/09/15 04:15, Laszlo Ersek wrote:
> On 06/08/15 23:46, Laszlo Ersek wrote:
>> At the moment we work with a UC default MTRR type, and set three memory
>> ranges to WB:
>> - [0, 640 KB),
>> - [1 MB, LowerMemorySize),
>> - [4 GB, 4 GB + UpperMemorySize).
>>
>> Unfortunately, coverage for the third range can fail with a high
>> likelihood. If the alignment of the base (ie. 4 GB) and the alignment of
>> the size (UpperMemorySize) differ, then MtrrLib creates a series of
>> variable MTRR entries, with power-of-two sized MTRR masks. And, it's
>> really easy to run out of variable MTRR entries, dependent on the
>> alignment difference.
>>
>> This is a problem because a Linux guest will loudly reject any high memory
>> that is not covered my MTRR.
>>
>> So, let's follow the inverse pattern (loosely inspired by SeaBIOS):
>> - flip the MTRR default type to WB,
>> - set [0, 640 KB) to WB -- fixed MTRRs have precedence over the default
>>   type and variable MTRRs, so we can't avoid this,
>> - set [640 KB, 1 MB) to UC -- implemented with fixed MTRRs,
>> - set [LowerMemorySize, 4 GB) to UC -- should succeed with variable MTRRs
>>   more likely than the other scheme (due to less chaotic alignment
>>   differences).
>>
>> Effects of this patch can be observed by setting DEBUG_CACHE (0x00200000)
>> in PcdDebugPrintErrorLevel.
>>
>> BUG: Although the MTRRs look good to me in the OVMF debug log, I still
>> can't boot >= 64 GB guests with this. Instead of the complaints mentioned
>> above, the Linux guest apparently spirals into an infinite loop (on KVM),
>> or hangs with no CPU load (on TCG).
> 
> No, actually there is no bug in this patch (so s/RFC/PATCH/). I did more
> testing and these are the findings:
> - I can reproduce the same issue on KVM with SeaBIOS guests.
> - The exact symptoms are that as soon as the highest guest-phys address
>   is >= 64 GB, then the guest kernel doesn't boot. It gets stuck
>   somewhere after hitting Enter in grub.
> - Normally 3 GB of the guest RAM is mapped under 4 GB in guest-phys
>   address space, then there's a 1 GB PCI hole, and the rest is above
>   4 GB. This means that a 63 GB guest can be started (because 63 - 3 + 4
>   == 64), but if you add just 1 MB more, it won't boot.
> - (This was the big discovery:) I flipped the "ept" parameter of the
>   kvm_intel module on my host to N, and then things started to work. I
>   just booted a 128 GB Linux guest with this patchset. (I have 4 GB
>   RAM in my host, plus approx 250 GB swap.) The guest could see it all.
> - The TCG boot didn't hang either; I just couldn't wait earlier for
>   network initialization to complete.
> 
> I'm CC'ing Paolo for help with the EPT question. Other than that, this
> series is functional. (For QEMU/KVM at least; Xen will likely need more
> fixes from others.)

We have a root cause, it seems. The issue is that the processor in my
laptop, on which I tested, has only 36 bits for physical addresses:

  $ grep 'address sizes' /proc/cpuinfo
  address sizes   : 36 bits physical, 48 bits virtual
  ...

Which matches where the problem surfaces (64 GB guest-phys address
space) with hw-supported nested paging (EPT) enabled on the host.

In order to confirm this, a colleague of mine gave me access to a server
with 96 GB of RAM, and:

  address sizes	: 46 bits physical, 48 bits virtual

On this host I booted a 72 GB OVMF guest on QEMU/KVM, with EPT enabled,
and according to the guest dmesg, the guest saw it all.

  Memory: 74160924K/75493820K available (7735K kernel code, 1149K
  rwdata, 3340K rodata, 1500K init, 1524K bss, 1332896K reserved, 0K
  cma-reserved)

Maoming: since you reported this issue, please confirm that the patch
series resolves it for you as well. In that case, I'll repost the series
with "PATCH" as subject-prefix instead of "RFC", and I'll drop the BUG
note from the last commit message.

Thanks
Laszlo

>> Cc: Maoming <maoming.maoming@huawei.com>
>> Cc: Huangpeng (Peter) <peter.huangpeng@huawei.com>
>> Cc: Wei Liu <wei.liu2@citrix.com>
>> Contributed-under: TianoCore Contribution Agreement 1.0
>> Signed-off-by: Laszlo Ersek <lersek@redhat.com>
>> ---
>>  OvmfPkg/PlatformPei/MemDetect.c | 43 +++++++++++++++++++++++++++++++++++++----
>>  1 file changed, 39 insertions(+), 4 deletions(-)
>>
>> diff --git a/OvmfPkg/PlatformPei/MemDetect.c b/OvmfPkg/PlatformPei/MemDetect.c
>> index 3ceb142..cceab22 100644
>> --- a/OvmfPkg/PlatformPei/MemDetect.c
>> +++ b/OvmfPkg/PlatformPei/MemDetect.c
>> @@ -194,6 +194,8 @@ QemuInitializeRam (
>>  {
>>    UINT64                      LowerMemorySize;
>>    UINT64                      UpperMemorySize;
>> +  MTRR_SETTINGS               MtrrSettings;
>> +  EFI_STATUS                  Status;
>>  
>>    DEBUG ((EFI_D_INFO, "%a called\n", __FUNCTION__));
>>  
>> @@ -214,12 +216,45 @@ QemuInitializeRam (
>>      }
>>    }
>>  
>> -  MtrrSetMemoryAttribute (BASE_1MB, LowerMemorySize - BASE_1MB, CacheWriteBack);
>> +  //
>> +  // We'd like to keep the following ranges uncached:
>> +  // - [640 KB, 1 MB)
>> +  // - [LowerMemorySize, 4 GB)
>> +  //
>> +  // Everything else should be WB. Unfortunately, programming the inverse (ie.
>> +  // keeping the default UC, and configuring the complement set of the above as
>> +  // WB) is not reliable in general, because the end of the upper RAM can have
>> +  // practically any alignment, and we may not have enough variable MTRRs to
>> +  // cover it exactly.
>> +  //
>> +  if (IsMtrrSupported ()) {
>> +    MtrrGetAllMtrrs (&MtrrSettings);
>>  
>> -  MtrrSetMemoryAttribute (0, BASE_512KB + BASE_128KB, CacheWriteBack);
>> +    //
>> +    // MTRRs disabled, fixed MTRRs disabled, default type is uncached
>> +    //
>> +    ASSERT ((MtrrSettings.MtrrDefType & BIT11) == 0);
>> +    ASSERT ((MtrrSettings.MtrrDefType & BIT10) == 0);
>> +    ASSERT ((MtrrSettings.MtrrDefType & 0xFF) == 0);
>>  
>> -  if (UpperMemorySize != 0) {
>> -    MtrrSetMemoryAttribute (BASE_4GB, UpperMemorySize, CacheWriteBack);
>> +    //
>> +    // flip default type to writeback
>> +    //
>> +    SetMem (&MtrrSettings.Fixed, sizeof MtrrSettings.Fixed, 0x06);
>> +    ZeroMem (&MtrrSettings.Variables, sizeof MtrrSettings.Variables);
>> +    MtrrSettings.MtrrDefType |= BIT11 | BIT10 | 6;
>> +    MtrrSetAllMtrrs (&MtrrSettings);
>> +
>> +    //
>> +    // punch holes
>> +    //
>> +    Status = MtrrSetMemoryAttribute (BASE_512KB + BASE_128KB,
>> +               SIZE_256KB + SIZE_128KB, CacheUncacheable);
>> +    ASSERT_EFI_ERROR (Status);
>> +
>> +    Status = MtrrSetMemoryAttribute (LowerMemorySize,
>> +               SIZE_4GB - LowerMemorySize, CacheUncacheable);
>> +    ASSERT_EFI_ERROR (Status);
>>    }
>>  }
>>  
>>
> 


------------------------------------------------------------------------------
_______________________________________________
edk2-devel mailing list
edk2-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/edk2-devel

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: ovmf / PCI passthrough impaired due to very limiting PCI64 aperture
  2020-06-17 16:04             ` Dr. David Alan Gilbert
  2020-06-17 16:17               ` Daniel P. Berrangé
  2020-06-17 16:28               ` Eduardo Habkost
@ 2020-06-19 16:13               ` Dr. David Alan Gilbert
  2 siblings, 0 replies; 40+ messages in thread
From: Dr. David Alan Gilbert @ 2020-06-19 16:13 UTC (permalink / raw)
  To: Eduardo Habkost
  Cc: Pedro Principeza, Dann Frazier, Guilherme Piccoli, qemu-devel,
	Christian Ehrhardt, Gerd Hoffmann, Laszlo Ersek, fw

I noticed that Mohammed Gamal just posted a series for better
KVM behaviour in cases where the guest size is smaller than the host:

https://lore.kernel.org/lkml/20200619153925.79106-1-mgamal@redhat.com/

--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2020-06-19 16:14 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-16 15:16 ovmf / PCI passthrough impaired due to very limiting PCI64 aperture Guilherme G. Piccoli
2020-06-16 16:50 ` Gerd Hoffmann
2020-06-16 16:57   ` Dr. David Alan Gilbert
2020-06-16 17:10     ` Eduardo Habkost
2020-06-17  8:17       ` Christophe de Dinechin
2020-06-17 16:25         ` Eduardo Habkost
2020-06-17  8:50       ` Daniel P. Berrangé
2020-06-17 10:28         ` Dr. David Alan Gilbert
2020-06-17 14:11         ` Eduardo Habkost
2020-06-16 17:10     ` Gerd Hoffmann
2020-06-16 17:16       ` Dr. David Alan Gilbert
2020-06-16 17:14     ` Guilherme Piccoli
2020-06-17  6:40       ` Gerd Hoffmann
2020-06-17 13:25         ` Laszlo Ersek
2020-06-17 13:26         ` Laszlo Ersek
2020-06-17 13:22       ` Laszlo Ersek
2020-06-17 13:43         ` Guilherme Piccoli
2020-06-17 15:57           ` Laszlo Ersek
2020-06-17 16:01             ` Guilherme Piccoli
2020-06-18  7:56               ` Laszlo Ersek
2020-06-17 13:46         ` Dr. David Alan Gilbert
2020-06-17 15:49           ` Eduardo Habkost
2020-06-17 15:57             ` Guilherme Piccoli
2020-06-17 16:33               ` Eduardo Habkost
2020-06-17 16:40                 ` Guilherme Piccoli
2020-06-18  8:00                 ` Laszlo Ersek
2020-06-17 16:04             ` Dr. David Alan Gilbert
2020-06-17 16:17               ` Daniel P. Berrangé
2020-06-17 16:22                 ` Eduardo Habkost
2020-06-17 16:41                   ` Dr. David Alan Gilbert
2020-06-17 17:17                     ` Daniel P. Berrangé
2020-06-17 17:23                       ` Dr. David Alan Gilbert
2020-06-17 16:28               ` Eduardo Habkost
2020-06-19 16:13               ` Dr. David Alan Gilbert
2020-06-17 16:14           ` Laszlo Ersek
2020-06-17 16:43             ` Laszlo Ersek
2020-06-17 17:02               ` Eduardo Habkost
2020-06-18  8:29                 ` Laszlo Ersek
2020-06-17  8:16   ` Christophe de Dinechin
2020-06-17 10:12     ` Gerd Hoffmann

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).