All of lore.kernel.org
 help / color / mirror / Atom feed
* PCI pass-through vs PoD
@ 2021-09-13  9:02 Jan Beulich
  2021-11-17  8:39 ` Jan Beulich
  0 siblings, 1 reply; 8+ messages in thread
From: Jan Beulich @ 2021-09-13  9:02 UTC (permalink / raw)
  To: xen-devel
  Cc: Anthony Perard, Ian Jackson, Paul Durrant, George Dunlap, Wei Liu

Hello,

libxl__domain_config_setdefault() checks whether PoD is going to be
enabled and fails domain creation if at the same time devices would get
assigned. Nevertheless setting up of IOMMU page tables is allowed.
However, when later assigning a device to a domain which has IOMMU page
tables, libxl__device_pci_add() does not appear to be concerned of PoD:
- xc_test_assign_device() / XEN_DOMCTL_test_assign_device only check the
  device for being available to assign,
- libxl__device_pci_setdefault() is only concerned about the RDM policy,
- other functions called look to not be related to such checking at all.

IMO assignment should fail if pod.count != pod.entry_count, and all PoD
entries should be resolved otherwise (whether explicitly by the
hypervisor or through some suitable existing hypercalls - didn't check
yet whether there are any reasonable candidates - by the tool stack is
secondary).

Jan



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: PCI pass-through vs PoD
  2021-09-13  9:02 PCI pass-through vs PoD Jan Beulich
@ 2021-11-17  8:39 ` Jan Beulich
  2021-11-17  8:55   ` Roger Pau Monné
  0 siblings, 1 reply; 8+ messages in thread
From: Jan Beulich @ 2021-11-17  8:39 UTC (permalink / raw)
  To: xen-devel
  Cc: Anthony Perard, Ian Jackson, Paul Durrant, George Dunlap, Wei Liu

On 13.09.2021 11:02, Jan Beulich wrote:
> libxl__domain_config_setdefault() checks whether PoD is going to be
> enabled and fails domain creation if at the same time devices would get
> assigned. Nevertheless setting up of IOMMU page tables is allowed.
> However, when later assigning a device to a domain which has IOMMU page
> tables, libxl__device_pci_add() does not appear to be concerned of PoD:
> - xc_test_assign_device() / XEN_DOMCTL_test_assign_device only check the
>   device for being available to assign,
> - libxl__device_pci_setdefault() is only concerned about the RDM policy,
> - other functions called look to not be related to such checking at all.

I've now verified this to be the case. In fact creating the guest and
assigning it a device while the guest still sits in the boot loader
allowed the (oldish) Linux guest I've been using to recognize the device
(and hence load its driver) even without any hotplug driver. Obviously
while still in the boot loader ...

> IMO assignment should fail if pod.count != pod.entry_count,

... this holds, and hence assignment should have failed.

IOW this approach currently is a simple "workaround" to avoid the "PCI
device assignment for HVM guest failed due to PoD enabled" error upon
domain creation.

I'll see if I can find a reasonable place to add the missing check; I'm
less certain about ...

> and all PoD
> entries should be resolved otherwise (whether explicitly by the
> hypervisor or through some suitable existing hypercalls - didn't check
> yet whether there are any reasonable candidates - by the tool stack is
> secondary).

... the approach to take here.

Jan



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: PCI pass-through vs PoD
  2021-11-17  8:39 ` Jan Beulich
@ 2021-11-17  8:55   ` Roger Pau Monné
  2021-11-17 10:13     ` Jan Beulich
  0 siblings, 1 reply; 8+ messages in thread
From: Roger Pau Monné @ 2021-11-17  8:55 UTC (permalink / raw)
  To: Jan Beulich
  Cc: xen-devel, Anthony Perard, Ian Jackson, Paul Durrant,
	George Dunlap, Wei Liu

On Wed, Nov 17, 2021 at 09:39:17AM +0100, Jan Beulich wrote:
> On 13.09.2021 11:02, Jan Beulich wrote:
> > libxl__domain_config_setdefault() checks whether PoD is going to be
> > enabled and fails domain creation if at the same time devices would get
> > assigned. Nevertheless setting up of IOMMU page tables is allowed.

I'm unsure whether allowing enabling the IOMMU with PoD is the right
thing to do, at least for our toolstack.

> > However, when later assigning a device to a domain which has IOMMU page
> > tables, libxl__device_pci_add() does not appear to be concerned of PoD:
> > - xc_test_assign_device() / XEN_DOMCTL_test_assign_device only check the
> >   device for being available to assign,
> > - libxl__device_pci_setdefault() is only concerned about the RDM policy,
> > - other functions called look to not be related to such checking at all.
> 
> I've now verified this to be the case. In fact creating the guest and
> assigning it a device while the guest still sits in the boot loader
> allowed the (oldish) Linux guest I've been using to recognize the device
> (and hence load its driver) even without any hotplug driver. Obviously
> while still in the boot loader ...
> 
> > IMO assignment should fail if pod.count != pod.entry_count,
> 
> ... this holds, and hence assignment should have failed.
> 
> IOW this approach currently is a simple "workaround" to avoid the "PCI
> device assignment for HVM guest failed due to PoD enabled" error upon
> domain creation.
> 
> I'll see if I can find a reasonable place to add the missing check; I'm
> less certain about ...
> 
> > and all PoD
> > entries should be resolved otherwise (whether explicitly by the
> > hypervisor or through some suitable existing hypercalls - didn't check
> > yet whether there are any reasonable candidates - by the tool stack is
> > secondary).
> 
> ... the approach to take here.

I think forcing all entries to be resolved would be unexpected when
assigning a device.

I would rather print a message saying that either the guest must
balloon down to the requested amount of memory, or that all entries
should be resolved (ie: using mem-set to match the mem-max value).

Thanks, Roger.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: PCI pass-through vs PoD
  2021-11-17  8:55   ` Roger Pau Monné
@ 2021-11-17 10:13     ` Jan Beulich
  2021-11-17 11:09       ` Andrew Cooper
  0 siblings, 1 reply; 8+ messages in thread
From: Jan Beulich @ 2021-11-17 10:13 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: xen-devel, Anthony Perard, Ian Jackson, Paul Durrant,
	George Dunlap, Wei Liu

On 17.11.2021 09:55, Roger Pau Monné wrote:
> On Wed, Nov 17, 2021 at 09:39:17AM +0100, Jan Beulich wrote:
>> On 13.09.2021 11:02, Jan Beulich wrote:
>>> libxl__domain_config_setdefault() checks whether PoD is going to be
>>> enabled and fails domain creation if at the same time devices would get
>>> assigned. Nevertheless setting up of IOMMU page tables is allowed.
> 
> I'm unsure whether allowing enabling the IOMMU with PoD is the right
> thing to do, at least for our toolstack.

May I ask about the reasons of you being unsure?

>>> However, when later assigning a device to a domain which has IOMMU page
>>> tables, libxl__device_pci_add() does not appear to be concerned of PoD:
>>> - xc_test_assign_device() / XEN_DOMCTL_test_assign_device only check the
>>>   device for being available to assign,
>>> - libxl__device_pci_setdefault() is only concerned about the RDM policy,
>>> - other functions called look to not be related to such checking at all.
>>
>> I've now verified this to be the case. In fact creating the guest and
>> assigning it a device while the guest still sits in the boot loader
>> allowed the (oldish) Linux guest I've been using to recognize the device
>> (and hence load its driver) even without any hotplug driver. Obviously
>> while still in the boot loader ...
>>
>>> IMO assignment should fail if pod.count != pod.entry_count,
>>
>> ... this holds, and hence assignment should have failed.
>>
>> IOW this approach currently is a simple "workaround" to avoid the "PCI
>> device assignment for HVM guest failed due to PoD enabled" error upon
>> domain creation.
>>
>> I'll see if I can find a reasonable place to add the missing check; I'm
>> less certain about ...
>>
>>> and all PoD
>>> entries should be resolved otherwise (whether explicitly by the
>>> hypervisor or through some suitable existing hypercalls - didn't check
>>> yet whether there are any reasonable candidates - by the tool stack is
>>> secondary).
>>
>> ... the approach to take here.
> 
> I think forcing all entries to be resolved would be unexpected when
> assigning a device.
> 
> I would rather print a message saying that either the guest must
> balloon down to the requested amount of memory, or that all entries
> should be resolved (ie: using mem-set to match the mem-max value).

But ballooning down alone doesn't help. That only ensures there is
enough memory in the PoD cache to fill all PoD entries. The PoD
entries will get replaced (resolved) only as they get touched. That
touching is what I call "resolving them", and what would be needed
for assignment to be safe (for the guest). Expecting the guest to
do anything about this is imo not very reasonable; it can only be
tool stack or the hypervisor effecting this.

Jan



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: PCI pass-through vs PoD
  2021-11-17 10:13     ` Jan Beulich
@ 2021-11-17 11:09       ` Andrew Cooper
  2021-11-17 11:23         ` Jan Beulich
  0 siblings, 1 reply; 8+ messages in thread
From: Andrew Cooper @ 2021-11-17 11:09 UTC (permalink / raw)
  To: Jan Beulich, Roger Pau Monné
  Cc: xen-devel, Anthony Perard, Ian Jackson, Paul Durrant,
	George Dunlap, Wei Liu

On 17/11/2021 10:13, Jan Beulich wrote:
> On 17.11.2021 09:55, Roger Pau Monné wrote:
>> On Wed, Nov 17, 2021 at 09:39:17AM +0100, Jan Beulich wrote:
>>> On 13.09.2021 11:02, Jan Beulich wrote:
>>>> libxl__domain_config_setdefault() checks whether PoD is going to be
>>>> enabled and fails domain creation if at the same time devices would get
>>>> assigned. Nevertheless setting up of IOMMU page tables is allowed.
>> I'm unsure whether allowing enabling the IOMMU with PoD is the right
>> thing to do, at least for our toolstack.
> May I ask about the reasons of you being unsure?

PoD and passthrough is a total nonsense.  You cannot have IOMMU mappings 
to bits of the guest physical address space which don't exist.

It is now the case that IOMMU (or not) must be specified at domain 
creation time, which is ahead of creating PoD pages.  Certainly as far 
as Xen is concerned, the logic probably wants reversing to have 
add_to_physmap&friends reject PoD if an IOMMU was configured.

A toolstack could, in principle, defer the decision to first device 
assignment.

However, this is terrible behaviour all around, because one way or 
another we've got to force-populate all PoD pages (which is potentially 
minutes worth of work to do), and liable to suffer -ENOMEM, or we have 
to reject a control operation with -EBUSY for a task which is dependent 
on the guest kernel actions in a known-buggy area.


There is no point trying to make this work.  If a user wants a device, 
they don't get to have PoD.  Anything else is a waste of time and effort 
on our behalf for a usecase that doesn't exist in practice.

~Andrew


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: PCI pass-through vs PoD
  2021-11-17 11:09       ` Andrew Cooper
@ 2021-11-17 11:23         ` Jan Beulich
  2021-11-17 13:07           ` Andrew Cooper
  0 siblings, 1 reply; 8+ messages in thread
From: Jan Beulich @ 2021-11-17 11:23 UTC (permalink / raw)
  To: Andrew Cooper, Roger Pau Monné
  Cc: xen-devel, Anthony Perard, Ian Jackson, Paul Durrant,
	George Dunlap, Wei Liu

On 17.11.2021 12:09, Andrew Cooper wrote:
> On 17/11/2021 10:13, Jan Beulich wrote:
>> On 17.11.2021 09:55, Roger Pau Monné wrote:
>>> On Wed, Nov 17, 2021 at 09:39:17AM +0100, Jan Beulich wrote:
>>>> On 13.09.2021 11:02, Jan Beulich wrote:
>>>>> libxl__domain_config_setdefault() checks whether PoD is going to be
>>>>> enabled and fails domain creation if at the same time devices would get
>>>>> assigned. Nevertheless setting up of IOMMU page tables is allowed.
>>> I'm unsure whether allowing enabling the IOMMU with PoD is the right
>>> thing to do, at least for our toolstack.
>> May I ask about the reasons of you being unsure?
> 
> PoD and passthrough is a total nonsense.  You cannot have IOMMU mappings 
> to bits of the guest physical address space which don't exist.
> 
> It is now the case that IOMMU (or not) must be specified at domain 
> creation time, which is ahead of creating PoD pages.  Certainly as far 
> as Xen is concerned, the logic probably wants reversing to have 
> add_to_physmap&friends reject PoD if an IOMMU was configured.
> 
> A toolstack could, in principle, defer the decision to first device 
> assignment.

Right, which is what I consider the preferred approach.

> However, this is terrible behaviour all around, because one way or 
> another we've got to force-populate all PoD pages (which is potentially 
> minutes worth of work to do),

Sure.

> and liable to suffer -ENOMEM,

Not if (as suggested) we first check that the PoD cache is large enough
to cover all PoD entries.

> or we have 
> to reject a control operation with -EBUSY for a task which is dependent 
> on the guest kernel actions in a known-buggy area.

Why reject anything?

> There is no point trying to make this work.  If a user wants a device, 
> they don't get to have PoD.  Anything else is a waste of time and effort 
> on our behalf for a usecase that doesn't exist in practice.

Not sure where you take the latter from. I suppose I'll submit the patch
as I have it now (once I have properly resolved dependencies on other
patches I have queued and/or pending), and if that's not deemed acceptable
plus if at the same time I don't really agree with proposed alternatives,
I'll leave fixing the bug to someone else. Of course the expectation then
is that such a bug fix come forward within a reasonable time frame ...

Jan



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: PCI pass-through vs PoD
  2021-11-17 11:23         ` Jan Beulich
@ 2021-11-17 13:07           ` Andrew Cooper
  2021-11-18  8:08             ` Jan Beulich
  0 siblings, 1 reply; 8+ messages in thread
From: Andrew Cooper @ 2021-11-17 13:07 UTC (permalink / raw)
  To: Jan Beulich, Roger Pau Monné
  Cc: xen-devel, Anthony Perard, Ian Jackson, Paul Durrant,
	George Dunlap, Wei Liu

On 17/11/2021 11:23, Jan Beulich wrote:
> On 17.11.2021 12:09, Andrew Cooper wrote:
>> On 17/11/2021 10:13, Jan Beulich wrote:
>>> On 17.11.2021 09:55, Roger Pau Monné wrote:
>>>> On Wed, Nov 17, 2021 at 09:39:17AM +0100, Jan Beulich wrote:
>>>>> On 13.09.2021 11:02, Jan Beulich wrote:
>>>>>> libxl__domain_config_setdefault() checks whether PoD is going to be
>>>>>> enabled and fails domain creation if at the same time devices would get
>>>>>> assigned. Nevertheless setting up of IOMMU page tables is allowed.
>>>> I'm unsure whether allowing enabling the IOMMU with PoD is the right
>>>> thing to do, at least for our toolstack.
>>> May I ask about the reasons of you being unsure?
>> PoD and passthrough is a total nonsense.  You cannot have IOMMU mappings
>> to bits of the guest physical address space which don't exist.
>>
>> It is now the case that IOMMU (or not) must be specified at domain
>> creation time, which is ahead of creating PoD pages.  Certainly as far
>> as Xen is concerned, the logic probably wants reversing to have
>> add_to_physmap&friends reject PoD if an IOMMU was configured.
>>
>> A toolstack could, in principle, defer the decision to first device
>> assignment.
> Right, which is what I consider the preferred approach.

Why?

Just because something is technically possible, does not mean it is an 
appropriate or clever thing to do.

In this case, we're talking about extra complexity in Xen and the 
toolstack, which in the very best case comes with unattractive user 
experience properties, to "fix" an issue which doesn't happen in practice.

>> and liable to suffer -ENOMEM,
> Not if (as suggested) we first check that the PoD cache is large enough
> to cover all PoD entries.

Just because at this instant we have enough free RAM to force-populate 
all PoD entries doesn't mean the same is true in 2 minutes time after 
we've been slowly force-populating a massive VM.

Yes, there are heuristics we can use to short-circuit the failure early, 
but that's still spelt -ENOMEM and reported to the user as such.

The only way to succeed here is to force populate the VM and to have not 
suffered -ENOMEM by the end of this task.

>> or we have
>> to reject a control operation with -EBUSY for a task which is dependent
>> on the guest kernel actions in a known-buggy area.
> Why reject anything?

Because the guest kernel has no knowledge of nor the ability to query 
the PoD status of a page, the only way to not have things malfunction is 
to enforce that there are no P2M entries of type PoD when devices are 
assigned.

If you don't want to / can't force-populate the entire VM prior to 
having device assigned, then the assign operation needs to fail.

>> There is no point trying to make this work.  If a user wants a device,
>> they don't get to have PoD.  Anything else is a waste of time and effort
>> on our behalf for a usecase that doesn't exist in practice.
> Not sure where you take the latter from. I suppose I'll submit the patch
> as I have it now (once I have properly resolved dependencies on other
> patches I have queued and/or pending), and if that's not deemed acceptable
> plus if at the same time I don't really agree with proposed alternatives,
> I'll leave fixing the bug to someone else. Of course the expectation then
> is that such a bug fix come forward within a reasonable time frame ...

What bug?  PoD and PCI Passthrough are mutually exclusive technologies.

We can (now) tell up front when a VM is configured with these mutually 
exclusive options.  Such a configuration should be rejected as early as 
possible.


What you're talking about is introducing extra complexity to explicitly 
support running the VM in a known-incompatible configuration, with the 
decision point for fixing said incompatibility deferred until runtime 
and now with a possibility of "genuinely can't this to become compatible".

Failing device assignment (potentially after a multi-minute wait) with 
"well you shouldn't have enabled PoD to begin with, you fool" is clearly 
worse behaviour than refusing to create such a VM in the first place, 
and you need a far far better reason than "because it's technically 
possible" to justify doing this.

~Andrew


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: PCI pass-through vs PoD
  2021-11-17 13:07           ` Andrew Cooper
@ 2021-11-18  8:08             ` Jan Beulich
  0 siblings, 0 replies; 8+ messages in thread
From: Jan Beulich @ 2021-11-18  8:08 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: xen-devel, Anthony Perard, Ian Jackson, Paul Durrant,
	George Dunlap, Wei Liu, Roger Pau Monné

On 17.11.2021 14:07, Andrew Cooper wrote:
> On 17/11/2021 11:23, Jan Beulich wrote:
>> On 17.11.2021 12:09, Andrew Cooper wrote:
>>> On 17/11/2021 10:13, Jan Beulich wrote:
>>>> On 17.11.2021 09:55, Roger Pau Monné wrote:
>>>>> On Wed, Nov 17, 2021 at 09:39:17AM +0100, Jan Beulich wrote:
>>>>>> On 13.09.2021 11:02, Jan Beulich wrote:
>>>>>>> libxl__domain_config_setdefault() checks whether PoD is going to be
>>>>>>> enabled and fails domain creation if at the same time devices would get
>>>>>>> assigned. Nevertheless setting up of IOMMU page tables is allowed.
>>>>> I'm unsure whether allowing enabling the IOMMU with PoD is the right
>>>>> thing to do, at least for our toolstack.
>>>> May I ask about the reasons of you being unsure?
>>> PoD and passthrough is a total nonsense.  You cannot have IOMMU mappings
>>> to bits of the guest physical address space which don't exist.
>>>
>>> It is now the case that IOMMU (or not) must be specified at domain
>>> creation time, which is ahead of creating PoD pages.  Certainly as far
>>> as Xen is concerned, the logic probably wants reversing to have
>>> add_to_physmap&friends reject PoD if an IOMMU was configured.
>>>
>>> A toolstack could, in principle, defer the decision to first device
>>> assignment.
>> Right, which is what I consider the preferred approach.
> 
> Why?
> 
> Just because something is technically possible, does not mean it is an 
> appropriate or clever thing to do.
> 
> In this case, we're talking about extra complexity in Xen and the 
> toolstack, which in the very best case comes with unattractive user 
> experience properties, to "fix" an issue which doesn't happen in practice.

IOW you're suggesting to wait for the first report of this being a problem.

>>> and liable to suffer -ENOMEM,
>> Not if (as suggested) we first check that the PoD cache is large enough
>> to cover all PoD entries.
> 
> Just because at this instant we have enough free RAM to force-populate 
> all PoD entries doesn't mean the same is true in 2 minutes time after 
> we've been slowly force-populating a massive VM.
> 
> Yes, there are heuristics we can use to short-circuit the failure early, 
> but that's still spelt -ENOMEM and reported to the user as such.
> 
> The only way to succeed here is to force populate the VM and to have not 
> suffered -ENOMEM by the end of this task.

I'm afraid I can't follow you here at all. The PoD cache is memory already
owned by the guest. As long as no new PoD entries get made out of thin air
(i.e. other than taking the backing page and placing it in the PoD cache),
there's no -ENOMEM possible here. That's precisely why entry count wants
to be checked against count of "PoD cache" pages to be sure.

>>> or we have
>>> to reject a control operation with -EBUSY for a task which is dependent
>>> on the guest kernel actions in a known-buggy area.
>> Why reject anything?
> 
> Because the guest kernel has no knowledge of nor the ability to query 
> the PoD status of a page, the only way to not have things malfunction is 
> to enforce that there are no P2M entries of type PoD when devices are 
> assigned.
> 
> If you don't want to / can't force-populate the entire VM prior to 
> having device assigned, then the assign operation needs to fail.

Well, yes, that's what I have been saying form the beginning. All we
appear to disagree on is whether tool stack or hypervisor should
actually put effort in doing such a force-populate.

>>> There is no point trying to make this work.  If a user wants a device,
>>> they don't get to have PoD.  Anything else is a waste of time and effort
>>> on our behalf for a usecase that doesn't exist in practice.
>> Not sure where you take the latter from. I suppose I'll submit the patch
>> as I have it now (once I have properly resolved dependencies on other
>> patches I have queued and/or pending), and if that's not deemed acceptable
>> plus if at the same time I don't really agree with proposed alternatives,
>> I'll leave fixing the bug to someone else. Of course the expectation then
>> is that such a bug fix come forward within a reasonable time frame ...
> 
> What bug?  PoD and PCI Passthrough are mutually exclusive technologies.

I wonder in how far you've read my earlier mails properly. After initially
only suspecting this might be possible, I did _verify_ that I can assign a
device with the guest still in PoD mode, including before the balloon
driver has kicked in (in which case even force-populate wouldn't help, i.e.
assignment ought to fail no matter what). While initially I thought this
would have been an unintended side effect of f89f555827a6 ("remove late
(on-demand) construction of IOMMU page tables"), I now think this has been
an issue even before. There's no check in the hypervisor (in particular arch_iommu_use_permitted() hasn't been checking for PoD so far, which is
used during assignment only anyway), while the tool stack checks only
during domain construction afaics (in libxl__domain_config_setdefault()).

Jan



^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2021-11-18  8:09 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-13  9:02 PCI pass-through vs PoD Jan Beulich
2021-11-17  8:39 ` Jan Beulich
2021-11-17  8:55   ` Roger Pau Monné
2021-11-17 10:13     ` Jan Beulich
2021-11-17 11:09       ` Andrew Cooper
2021-11-17 11:23         ` Jan Beulich
2021-11-17 13:07           ` Andrew Cooper
2021-11-18  8:08             ` Jan Beulich

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.