* PCI pass-through vs PoD @ 2021-09-13 9:02 Jan Beulich 2021-11-17 8:39 ` Jan Beulich 0 siblings, 1 reply; 8+ messages in thread From: Jan Beulich @ 2021-09-13 9:02 UTC (permalink / raw) To: xen-devel Cc: Anthony Perard, Ian Jackson, Paul Durrant, George Dunlap, Wei Liu Hello, libxl__domain_config_setdefault() checks whether PoD is going to be enabled and fails domain creation if at the same time devices would get assigned. Nevertheless setting up of IOMMU page tables is allowed. However, when later assigning a device to a domain which has IOMMU page tables, libxl__device_pci_add() does not appear to be concerned of PoD: - xc_test_assign_device() / XEN_DOMCTL_test_assign_device only check the device for being available to assign, - libxl__device_pci_setdefault() is only concerned about the RDM policy, - other functions called look to not be related to such checking at all. IMO assignment should fail if pod.count != pod.entry_count, and all PoD entries should be resolved otherwise (whether explicitly by the hypervisor or through some suitable existing hypercalls - didn't check yet whether there are any reasonable candidates - by the tool stack is secondary). Jan ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: PCI pass-through vs PoD 2021-09-13 9:02 PCI pass-through vs PoD Jan Beulich @ 2021-11-17 8:39 ` Jan Beulich 2021-11-17 8:55 ` Roger Pau Monné 0 siblings, 1 reply; 8+ messages in thread From: Jan Beulich @ 2021-11-17 8:39 UTC (permalink / raw) To: xen-devel Cc: Anthony Perard, Ian Jackson, Paul Durrant, George Dunlap, Wei Liu On 13.09.2021 11:02, Jan Beulich wrote: > libxl__domain_config_setdefault() checks whether PoD is going to be > enabled and fails domain creation if at the same time devices would get > assigned. Nevertheless setting up of IOMMU page tables is allowed. > However, when later assigning a device to a domain which has IOMMU page > tables, libxl__device_pci_add() does not appear to be concerned of PoD: > - xc_test_assign_device() / XEN_DOMCTL_test_assign_device only check the > device for being available to assign, > - libxl__device_pci_setdefault() is only concerned about the RDM policy, > - other functions called look to not be related to such checking at all. I've now verified this to be the case. In fact creating the guest and assigning it a device while the guest still sits in the boot loader allowed the (oldish) Linux guest I've been using to recognize the device (and hence load its driver) even without any hotplug driver. Obviously while still in the boot loader ... > IMO assignment should fail if pod.count != pod.entry_count, ... this holds, and hence assignment should have failed. IOW this approach currently is a simple "workaround" to avoid the "PCI device assignment for HVM guest failed due to PoD enabled" error upon domain creation. I'll see if I can find a reasonable place to add the missing check; I'm less certain about ... > and all PoD > entries should be resolved otherwise (whether explicitly by the > hypervisor or through some suitable existing hypercalls - didn't check > yet whether there are any reasonable candidates - by the tool stack is > secondary). ... the approach to take here. Jan ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: PCI pass-through vs PoD 2021-11-17 8:39 ` Jan Beulich @ 2021-11-17 8:55 ` Roger Pau Monné 2021-11-17 10:13 ` Jan Beulich 0 siblings, 1 reply; 8+ messages in thread From: Roger Pau Monné @ 2021-11-17 8:55 UTC (permalink / raw) To: Jan Beulich Cc: xen-devel, Anthony Perard, Ian Jackson, Paul Durrant, George Dunlap, Wei Liu On Wed, Nov 17, 2021 at 09:39:17AM +0100, Jan Beulich wrote: > On 13.09.2021 11:02, Jan Beulich wrote: > > libxl__domain_config_setdefault() checks whether PoD is going to be > > enabled and fails domain creation if at the same time devices would get > > assigned. Nevertheless setting up of IOMMU page tables is allowed. I'm unsure whether allowing enabling the IOMMU with PoD is the right thing to do, at least for our toolstack. > > However, when later assigning a device to a domain which has IOMMU page > > tables, libxl__device_pci_add() does not appear to be concerned of PoD: > > - xc_test_assign_device() / XEN_DOMCTL_test_assign_device only check the > > device for being available to assign, > > - libxl__device_pci_setdefault() is only concerned about the RDM policy, > > - other functions called look to not be related to such checking at all. > > I've now verified this to be the case. In fact creating the guest and > assigning it a device while the guest still sits in the boot loader > allowed the (oldish) Linux guest I've been using to recognize the device > (and hence load its driver) even without any hotplug driver. Obviously > while still in the boot loader ... > > > IMO assignment should fail if pod.count != pod.entry_count, > > ... this holds, and hence assignment should have failed. > > IOW this approach currently is a simple "workaround" to avoid the "PCI > device assignment for HVM guest failed due to PoD enabled" error upon > domain creation. > > I'll see if I can find a reasonable place to add the missing check; I'm > less certain about ... > > > and all PoD > > entries should be resolved otherwise (whether explicitly by the > > hypervisor or through some suitable existing hypercalls - didn't check > > yet whether there are any reasonable candidates - by the tool stack is > > secondary). > > ... the approach to take here. I think forcing all entries to be resolved would be unexpected when assigning a device. I would rather print a message saying that either the guest must balloon down to the requested amount of memory, or that all entries should be resolved (ie: using mem-set to match the mem-max value). Thanks, Roger. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: PCI pass-through vs PoD 2021-11-17 8:55 ` Roger Pau Monné @ 2021-11-17 10:13 ` Jan Beulich 2021-11-17 11:09 ` Andrew Cooper 0 siblings, 1 reply; 8+ messages in thread From: Jan Beulich @ 2021-11-17 10:13 UTC (permalink / raw) To: Roger Pau Monné Cc: xen-devel, Anthony Perard, Ian Jackson, Paul Durrant, George Dunlap, Wei Liu On 17.11.2021 09:55, Roger Pau Monné wrote: > On Wed, Nov 17, 2021 at 09:39:17AM +0100, Jan Beulich wrote: >> On 13.09.2021 11:02, Jan Beulich wrote: >>> libxl__domain_config_setdefault() checks whether PoD is going to be >>> enabled and fails domain creation if at the same time devices would get >>> assigned. Nevertheless setting up of IOMMU page tables is allowed. > > I'm unsure whether allowing enabling the IOMMU with PoD is the right > thing to do, at least for our toolstack. May I ask about the reasons of you being unsure? >>> However, when later assigning a device to a domain which has IOMMU page >>> tables, libxl__device_pci_add() does not appear to be concerned of PoD: >>> - xc_test_assign_device() / XEN_DOMCTL_test_assign_device only check the >>> device for being available to assign, >>> - libxl__device_pci_setdefault() is only concerned about the RDM policy, >>> - other functions called look to not be related to such checking at all. >> >> I've now verified this to be the case. In fact creating the guest and >> assigning it a device while the guest still sits in the boot loader >> allowed the (oldish) Linux guest I've been using to recognize the device >> (and hence load its driver) even without any hotplug driver. Obviously >> while still in the boot loader ... >> >>> IMO assignment should fail if pod.count != pod.entry_count, >> >> ... this holds, and hence assignment should have failed. >> >> IOW this approach currently is a simple "workaround" to avoid the "PCI >> device assignment for HVM guest failed due to PoD enabled" error upon >> domain creation. >> >> I'll see if I can find a reasonable place to add the missing check; I'm >> less certain about ... >> >>> and all PoD >>> entries should be resolved otherwise (whether explicitly by the >>> hypervisor or through some suitable existing hypercalls - didn't check >>> yet whether there are any reasonable candidates - by the tool stack is >>> secondary). >> >> ... the approach to take here. > > I think forcing all entries to be resolved would be unexpected when > assigning a device. > > I would rather print a message saying that either the guest must > balloon down to the requested amount of memory, or that all entries > should be resolved (ie: using mem-set to match the mem-max value). But ballooning down alone doesn't help. That only ensures there is enough memory in the PoD cache to fill all PoD entries. The PoD entries will get replaced (resolved) only as they get touched. That touching is what I call "resolving them", and what would be needed for assignment to be safe (for the guest). Expecting the guest to do anything about this is imo not very reasonable; it can only be tool stack or the hypervisor effecting this. Jan ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: PCI pass-through vs PoD 2021-11-17 10:13 ` Jan Beulich @ 2021-11-17 11:09 ` Andrew Cooper 2021-11-17 11:23 ` Jan Beulich 0 siblings, 1 reply; 8+ messages in thread From: Andrew Cooper @ 2021-11-17 11:09 UTC (permalink / raw) To: Jan Beulich, Roger Pau Monné Cc: xen-devel, Anthony Perard, Ian Jackson, Paul Durrant, George Dunlap, Wei Liu On 17/11/2021 10:13, Jan Beulich wrote: > On 17.11.2021 09:55, Roger Pau Monné wrote: >> On Wed, Nov 17, 2021 at 09:39:17AM +0100, Jan Beulich wrote: >>> On 13.09.2021 11:02, Jan Beulich wrote: >>>> libxl__domain_config_setdefault() checks whether PoD is going to be >>>> enabled and fails domain creation if at the same time devices would get >>>> assigned. Nevertheless setting up of IOMMU page tables is allowed. >> I'm unsure whether allowing enabling the IOMMU with PoD is the right >> thing to do, at least for our toolstack. > May I ask about the reasons of you being unsure? PoD and passthrough is a total nonsense. You cannot have IOMMU mappings to bits of the guest physical address space which don't exist. It is now the case that IOMMU (or not) must be specified at domain creation time, which is ahead of creating PoD pages. Certainly as far as Xen is concerned, the logic probably wants reversing to have add_to_physmap&friends reject PoD if an IOMMU was configured. A toolstack could, in principle, defer the decision to first device assignment. However, this is terrible behaviour all around, because one way or another we've got to force-populate all PoD pages (which is potentially minutes worth of work to do), and liable to suffer -ENOMEM, or we have to reject a control operation with -EBUSY for a task which is dependent on the guest kernel actions in a known-buggy area. There is no point trying to make this work. If a user wants a device, they don't get to have PoD. Anything else is a waste of time and effort on our behalf for a usecase that doesn't exist in practice. ~Andrew ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: PCI pass-through vs PoD 2021-11-17 11:09 ` Andrew Cooper @ 2021-11-17 11:23 ` Jan Beulich 2021-11-17 13:07 ` Andrew Cooper 0 siblings, 1 reply; 8+ messages in thread From: Jan Beulich @ 2021-11-17 11:23 UTC (permalink / raw) To: Andrew Cooper, Roger Pau Monné Cc: xen-devel, Anthony Perard, Ian Jackson, Paul Durrant, George Dunlap, Wei Liu On 17.11.2021 12:09, Andrew Cooper wrote: > On 17/11/2021 10:13, Jan Beulich wrote: >> On 17.11.2021 09:55, Roger Pau Monné wrote: >>> On Wed, Nov 17, 2021 at 09:39:17AM +0100, Jan Beulich wrote: >>>> On 13.09.2021 11:02, Jan Beulich wrote: >>>>> libxl__domain_config_setdefault() checks whether PoD is going to be >>>>> enabled and fails domain creation if at the same time devices would get >>>>> assigned. Nevertheless setting up of IOMMU page tables is allowed. >>> I'm unsure whether allowing enabling the IOMMU with PoD is the right >>> thing to do, at least for our toolstack. >> May I ask about the reasons of you being unsure? > > PoD and passthrough is a total nonsense. You cannot have IOMMU mappings > to bits of the guest physical address space which don't exist. > > It is now the case that IOMMU (or not) must be specified at domain > creation time, which is ahead of creating PoD pages. Certainly as far > as Xen is concerned, the logic probably wants reversing to have > add_to_physmap&friends reject PoD if an IOMMU was configured. > > A toolstack could, in principle, defer the decision to first device > assignment. Right, which is what I consider the preferred approach. > However, this is terrible behaviour all around, because one way or > another we've got to force-populate all PoD pages (which is potentially > minutes worth of work to do), Sure. > and liable to suffer -ENOMEM, Not if (as suggested) we first check that the PoD cache is large enough to cover all PoD entries. > or we have > to reject a control operation with -EBUSY for a task which is dependent > on the guest kernel actions in a known-buggy area. Why reject anything? > There is no point trying to make this work. If a user wants a device, > they don't get to have PoD. Anything else is a waste of time and effort > on our behalf for a usecase that doesn't exist in practice. Not sure where you take the latter from. I suppose I'll submit the patch as I have it now (once I have properly resolved dependencies on other patches I have queued and/or pending), and if that's not deemed acceptable plus if at the same time I don't really agree with proposed alternatives, I'll leave fixing the bug to someone else. Of course the expectation then is that such a bug fix come forward within a reasonable time frame ... Jan ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: PCI pass-through vs PoD 2021-11-17 11:23 ` Jan Beulich @ 2021-11-17 13:07 ` Andrew Cooper 2021-11-18 8:08 ` Jan Beulich 0 siblings, 1 reply; 8+ messages in thread From: Andrew Cooper @ 2021-11-17 13:07 UTC (permalink / raw) To: Jan Beulich, Roger Pau Monné Cc: xen-devel, Anthony Perard, Ian Jackson, Paul Durrant, George Dunlap, Wei Liu On 17/11/2021 11:23, Jan Beulich wrote: > On 17.11.2021 12:09, Andrew Cooper wrote: >> On 17/11/2021 10:13, Jan Beulich wrote: >>> On 17.11.2021 09:55, Roger Pau Monné wrote: >>>> On Wed, Nov 17, 2021 at 09:39:17AM +0100, Jan Beulich wrote: >>>>> On 13.09.2021 11:02, Jan Beulich wrote: >>>>>> libxl__domain_config_setdefault() checks whether PoD is going to be >>>>>> enabled and fails domain creation if at the same time devices would get >>>>>> assigned. Nevertheless setting up of IOMMU page tables is allowed. >>>> I'm unsure whether allowing enabling the IOMMU with PoD is the right >>>> thing to do, at least for our toolstack. >>> May I ask about the reasons of you being unsure? >> PoD and passthrough is a total nonsense. You cannot have IOMMU mappings >> to bits of the guest physical address space which don't exist. >> >> It is now the case that IOMMU (or not) must be specified at domain >> creation time, which is ahead of creating PoD pages. Certainly as far >> as Xen is concerned, the logic probably wants reversing to have >> add_to_physmap&friends reject PoD if an IOMMU was configured. >> >> A toolstack could, in principle, defer the decision to first device >> assignment. > Right, which is what I consider the preferred approach. Why? Just because something is technically possible, does not mean it is an appropriate or clever thing to do. In this case, we're talking about extra complexity in Xen and the toolstack, which in the very best case comes with unattractive user experience properties, to "fix" an issue which doesn't happen in practice. >> and liable to suffer -ENOMEM, > Not if (as suggested) we first check that the PoD cache is large enough > to cover all PoD entries. Just because at this instant we have enough free RAM to force-populate all PoD entries doesn't mean the same is true in 2 minutes time after we've been slowly force-populating a massive VM. Yes, there are heuristics we can use to short-circuit the failure early, but that's still spelt -ENOMEM and reported to the user as such. The only way to succeed here is to force populate the VM and to have not suffered -ENOMEM by the end of this task. >> or we have >> to reject a control operation with -EBUSY for a task which is dependent >> on the guest kernel actions in a known-buggy area. > Why reject anything? Because the guest kernel has no knowledge of nor the ability to query the PoD status of a page, the only way to not have things malfunction is to enforce that there are no P2M entries of type PoD when devices are assigned. If you don't want to / can't force-populate the entire VM prior to having device assigned, then the assign operation needs to fail. >> There is no point trying to make this work. If a user wants a device, >> they don't get to have PoD. Anything else is a waste of time and effort >> on our behalf for a usecase that doesn't exist in practice. > Not sure where you take the latter from. I suppose I'll submit the patch > as I have it now (once I have properly resolved dependencies on other > patches I have queued and/or pending), and if that's not deemed acceptable > plus if at the same time I don't really agree with proposed alternatives, > I'll leave fixing the bug to someone else. Of course the expectation then > is that such a bug fix come forward within a reasonable time frame ... What bug? PoD and PCI Passthrough are mutually exclusive technologies. We can (now) tell up front when a VM is configured with these mutually exclusive options. Such a configuration should be rejected as early as possible. What you're talking about is introducing extra complexity to explicitly support running the VM in a known-incompatible configuration, with the decision point for fixing said incompatibility deferred until runtime and now with a possibility of "genuinely can't this to become compatible". Failing device assignment (potentially after a multi-minute wait) with "well you shouldn't have enabled PoD to begin with, you fool" is clearly worse behaviour than refusing to create such a VM in the first place, and you need a far far better reason than "because it's technically possible" to justify doing this. ~Andrew ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: PCI pass-through vs PoD 2021-11-17 13:07 ` Andrew Cooper @ 2021-11-18 8:08 ` Jan Beulich 0 siblings, 0 replies; 8+ messages in thread From: Jan Beulich @ 2021-11-18 8:08 UTC (permalink / raw) To: Andrew Cooper Cc: xen-devel, Anthony Perard, Ian Jackson, Paul Durrant, George Dunlap, Wei Liu, Roger Pau Monné On 17.11.2021 14:07, Andrew Cooper wrote: > On 17/11/2021 11:23, Jan Beulich wrote: >> On 17.11.2021 12:09, Andrew Cooper wrote: >>> On 17/11/2021 10:13, Jan Beulich wrote: >>>> On 17.11.2021 09:55, Roger Pau Monné wrote: >>>>> On Wed, Nov 17, 2021 at 09:39:17AM +0100, Jan Beulich wrote: >>>>>> On 13.09.2021 11:02, Jan Beulich wrote: >>>>>>> libxl__domain_config_setdefault() checks whether PoD is going to be >>>>>>> enabled and fails domain creation if at the same time devices would get >>>>>>> assigned. Nevertheless setting up of IOMMU page tables is allowed. >>>>> I'm unsure whether allowing enabling the IOMMU with PoD is the right >>>>> thing to do, at least for our toolstack. >>>> May I ask about the reasons of you being unsure? >>> PoD and passthrough is a total nonsense. You cannot have IOMMU mappings >>> to bits of the guest physical address space which don't exist. >>> >>> It is now the case that IOMMU (or not) must be specified at domain >>> creation time, which is ahead of creating PoD pages. Certainly as far >>> as Xen is concerned, the logic probably wants reversing to have >>> add_to_physmap&friends reject PoD if an IOMMU was configured. >>> >>> A toolstack could, in principle, defer the decision to first device >>> assignment. >> Right, which is what I consider the preferred approach. > > Why? > > Just because something is technically possible, does not mean it is an > appropriate or clever thing to do. > > In this case, we're talking about extra complexity in Xen and the > toolstack, which in the very best case comes with unattractive user > experience properties, to "fix" an issue which doesn't happen in practice. IOW you're suggesting to wait for the first report of this being a problem. >>> and liable to suffer -ENOMEM, >> Not if (as suggested) we first check that the PoD cache is large enough >> to cover all PoD entries. > > Just because at this instant we have enough free RAM to force-populate > all PoD entries doesn't mean the same is true in 2 minutes time after > we've been slowly force-populating a massive VM. > > Yes, there are heuristics we can use to short-circuit the failure early, > but that's still spelt -ENOMEM and reported to the user as such. > > The only way to succeed here is to force populate the VM and to have not > suffered -ENOMEM by the end of this task. I'm afraid I can't follow you here at all. The PoD cache is memory already owned by the guest. As long as no new PoD entries get made out of thin air (i.e. other than taking the backing page and placing it in the PoD cache), there's no -ENOMEM possible here. That's precisely why entry count wants to be checked against count of "PoD cache" pages to be sure. >>> or we have >>> to reject a control operation with -EBUSY for a task which is dependent >>> on the guest kernel actions in a known-buggy area. >> Why reject anything? > > Because the guest kernel has no knowledge of nor the ability to query > the PoD status of a page, the only way to not have things malfunction is > to enforce that there are no P2M entries of type PoD when devices are > assigned. > > If you don't want to / can't force-populate the entire VM prior to > having device assigned, then the assign operation needs to fail. Well, yes, that's what I have been saying form the beginning. All we appear to disagree on is whether tool stack or hypervisor should actually put effort in doing such a force-populate. >>> There is no point trying to make this work. If a user wants a device, >>> they don't get to have PoD. Anything else is a waste of time and effort >>> on our behalf for a usecase that doesn't exist in practice. >> Not sure where you take the latter from. I suppose I'll submit the patch >> as I have it now (once I have properly resolved dependencies on other >> patches I have queued and/or pending), and if that's not deemed acceptable >> plus if at the same time I don't really agree with proposed alternatives, >> I'll leave fixing the bug to someone else. Of course the expectation then >> is that such a bug fix come forward within a reasonable time frame ... > > What bug? PoD and PCI Passthrough are mutually exclusive technologies. I wonder in how far you've read my earlier mails properly. After initially only suspecting this might be possible, I did _verify_ that I can assign a device with the guest still in PoD mode, including before the balloon driver has kicked in (in which case even force-populate wouldn't help, i.e. assignment ought to fail no matter what). While initially I thought this would have been an unintended side effect of f89f555827a6 ("remove late (on-demand) construction of IOMMU page tables"), I now think this has been an issue even before. There's no check in the hypervisor (in particular arch_iommu_use_permitted() hasn't been checking for PoD so far, which is used during assignment only anyway), while the tool stack checks only during domain construction afaics (in libxl__domain_config_setdefault()). Jan ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2021-11-18 8:09 UTC | newest] Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-09-13 9:02 PCI pass-through vs PoD Jan Beulich 2021-11-17 8:39 ` Jan Beulich 2021-11-17 8:55 ` Roger Pau Monné 2021-11-17 10:13 ` Jan Beulich 2021-11-17 11:09 ` Andrew Cooper 2021-11-17 11:23 ` Jan Beulich 2021-11-17 13:07 ` Andrew Cooper 2021-11-18 8:08 ` Jan Beulich
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.