All of lore.kernel.org
 help / color / mirror / Atom feed
* IGD pass-through failures since 4.10.
@ 2022-02-14  6:00 Dr. Greg
  2022-02-14  8:56 ` Jan Beulich
  2022-02-14  9:21 ` Roger Pau Monné
  0 siblings, 2 replies; 9+ messages in thread
From: Dr. Greg @ 2022-02-14  6:00 UTC (permalink / raw)
  To: xen-devel

Good morning, I hope the week is starting well for everyone.

We've made extensive use of PCI based graphics pass through for many
years, since around Xen 4.2.  In fact, we maintained a set of patches
for ATI cards against qemu-traditional that have seen a lot of
downloads from our FTP site.

We ended up switching to IGD based graphics a couple of years ago and
built a stack on top of Xen 4.10 using qemu-traditional.  That
coincided with our transition from Windows 7 to Windows 10.

We've never enjoyed anywhere near the stability with IGD/Windows-10
that we had with the ATI/Windows-7 desktops, ie. we see fairly
frequent crashes, lockups, reduced performance etc.  The ATI/Windows-y
desktops were almost astonishingly reliable, ie. hundreds of
consecutive Windows VM boot/passthrough cycles.

In order to try and address this issue we set out to upgrade our
workstation infrastructure.  Unfortunately we haven't found anything
that has worked post 4.10.

To be precise, 4.11 with qemu-traditional works, but upon exit from
the virtual machine, to which the graphics adapter and USB controller
are passed through to, both the USB controller and the graphics
controller cannot be re-initialized and re-attached to the Dom0
instance.

It appears to be a problem with mapping interrupts back to dom0 given
that we see the following:

Feb 10 08:16:05 hostname kernel: xhci_hcd 0000:00:14.0: xen map irq failed -19 for 32752 domain

Feb 10 08:16:05 hostname kernel: i915 0000:00:02.0: xen map irq failed -19 for 32752 domain

Feb 10 08:16:12 hostname kernel: xhci_hcd 0000:00:14.0: Error while assigning device slot ID

At which point the monitor has green and block bars on it and the USB
controller doesn't function.

Upstream QEMU doesn't work at all, the qemu-system-i386 process fails
and is caught by xl and then tries to re-start the domain, which
remains dead to the world and has to be destroyed.

We revved up to the most current 4.14.x release, but that acts exactly
the same way that 4.11.x does.  We've built up the most recent 4.15.x
release, so that we would be testing the most current release that
still supports qemu-traditional, but haven't been able to get the
testing done yet.  Given our current experiences, I would be surpised
if it would work.

We've tentatively tracked the poor Windows 10 performance down to the
hypervisor emitting hundreds of thousands of IOMMU/DMA violations.  We
made those go away by disabling the IGD IOMMU but that doesn't fix the
problem with upstream QEMU being able to boot the Windows instance,
nor does it fix the problem with remapping the device interrupts back
to Dom0 on domain exit.

The 4.10 based stack had been running with 16 GIG of memory in the
DomU Windows instances.  Based on some online comments, we tested
guests with 4 GIG of RAM but that doesn't impact the issues we are
seeing.

We've tested with the most recent 5.4 and 5.10 Linux kernels but the
Dom0 kernel version doesn't seem to have any impact on the issues we
are seeing.

We'd be interested in any comments/suggestions the group may have.  We
have the in-house skills to do fairly significant investigations and
would like to improve the performance of IGD pass-through for other
users of what is fairly useful and ubiquitious (IGD) technology.

Have a good day.

Dr. Greg

As always,
Dr. Greg Wettstein, Ph.D, Worker      Autonomously self-defensive
Enjellic Systems Development, LLC     IOT platforms and edge devices.
4206 N. 19th Ave.
Fargo, ND  58102
PH: 701-281-1686                      EMAIL: dg@enjellic.com
------------------------------------------------------------------------------
"My thoughts on the composition and effectiveness of the advisory
 committee?

 I think they are destined to accomplish about the same thing as what
 you would get from locking 9 chimpanzees in a room with an armed
 thermonuclear weapon and a can opener with orders to disarm it."
                                -- Dr. Greg Wettstein
                                   Resurrection


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: IGD pass-through failures since 4.10.
  2022-02-14  6:00 IGD pass-through failures since 4.10 Dr. Greg
@ 2022-02-14  8:56 ` Jan Beulich
  2022-02-17 20:15   ` Dr. Greg
  2022-02-14  9:21 ` Roger Pau Monné
  1 sibling, 1 reply; 9+ messages in thread
From: Jan Beulich @ 2022-02-14  8:56 UTC (permalink / raw)
  To: Dr. Greg; +Cc: xen-devel

On 14.02.2022 07:00, Dr. Greg wrote:
> Good morning, I hope the week is starting well for everyone.
> 
> We've made extensive use of PCI based graphics pass through for many
> years, since around Xen 4.2.  In fact, we maintained a set of patches
> for ATI cards against qemu-traditional that have seen a lot of
> downloads from our FTP site.
> 
> We ended up switching to IGD based graphics a couple of years ago and
> built a stack on top of Xen 4.10 using qemu-traditional.  That
> coincided with our transition from Windows 7 to Windows 10.
> 
> We've never enjoyed anywhere near the stability with IGD/Windows-10
> that we had with the ATI/Windows-7 desktops, ie. we see fairly
> frequent crashes, lockups, reduced performance etc.  The ATI/Windows-y
> desktops were almost astonishingly reliable, ie. hundreds of
> consecutive Windows VM boot/passthrough cycles.
> 
> In order to try and address this issue we set out to upgrade our
> workstation infrastructure.  Unfortunately we haven't found anything
> that has worked post 4.10.
> 
> To be precise, 4.11 with qemu-traditional works, but upon exit from
> the virtual machine, to which the graphics adapter and USB controller
> are passed through to, both the USB controller and the graphics
> controller cannot be re-initialized and re-attached to the Dom0
> instance.
> 
> It appears to be a problem with mapping interrupts back to dom0 given
> that we see the following:
> 
> Feb 10 08:16:05 hostname kernel: xhci_hcd 0000:00:14.0: xen map irq failed -19 for 32752 domain
> 
> Feb 10 08:16:05 hostname kernel: i915 0000:00:02.0: xen map irq failed -19 for 32752 domain
> 
> Feb 10 08:16:12 hostname kernel: xhci_hcd 0000:00:14.0: Error while assigning device slot ID

Just on this one aspect: It depends a lot what precisely you've used as
4.10 before. Was this the plain 4.10.4 release, or did you track the
stable branch, accumulating security fixes? In the former case I would
suspect device quarantining to get getting in your way. In which case
it would be relevant to know what exactly "re-attach to the Dom0" means
in your case.

Which brings me to this more general remark: What you describe sounds
like a number of possibly independent problems. I'm afraid it'll be
difficult for anyone to help without you drilling further down into
what lower level operations are actually causing trouble. It also feels
as if things may have ended up working for you on 4.10 just by chance.

I'm sorry that I'm not really of any help here,
Jan

> At which point the monitor has green and block bars on it and the USB
> controller doesn't function.
> 
> Upstream QEMU doesn't work at all, the qemu-system-i386 process fails
> and is caught by xl and then tries to re-start the domain, which
> remains dead to the world and has to be destroyed.
> 
> We revved up to the most current 4.14.x release, but that acts exactly
> the same way that 4.11.x does.  We've built up the most recent 4.15.x
> release, so that we would be testing the most current release that
> still supports qemu-traditional, but haven't been able to get the
> testing done yet.  Given our current experiences, I would be surpised
> if it would work.
> 
> We've tentatively tracked the poor Windows 10 performance down to the
> hypervisor emitting hundreds of thousands of IOMMU/DMA violations.  We
> made those go away by disabling the IGD IOMMU but that doesn't fix the
> problem with upstream QEMU being able to boot the Windows instance,
> nor does it fix the problem with remapping the device interrupts back
> to Dom0 on domain exit.
> 
> The 4.10 based stack had been running with 16 GIG of memory in the
> DomU Windows instances.  Based on some online comments, we tested
> guests with 4 GIG of RAM but that doesn't impact the issues we are
> seeing.
> 
> We've tested with the most recent 5.4 and 5.10 Linux kernels but the
> Dom0 kernel version doesn't seem to have any impact on the issues we
> are seeing.
> 
> We'd be interested in any comments/suggestions the group may have.  We
> have the in-house skills to do fairly significant investigations and
> would like to improve the performance of IGD pass-through for other
> users of what is fairly useful and ubiquitious (IGD) technology.
> 
> Have a good day.
> 
> Dr. Greg
> 
> As always,
> Dr. Greg Wettstein, Ph.D, Worker      Autonomously self-defensive
> Enjellic Systems Development, LLC     IOT platforms and edge devices.
> 4206 N. 19th Ave.
> Fargo, ND  58102
> PH: 701-281-1686                      EMAIL: dg@enjellic.com
> ------------------------------------------------------------------------------
> "My thoughts on the composition and effectiveness of the advisory
>  committee?
> 
>  I think they are destined to accomplish about the same thing as what
>  you would get from locking 9 chimpanzees in a room with an armed
>  thermonuclear weapon and a can opener with orders to disarm it."
>                                 -- Dr. Greg Wettstein
>                                    Resurrection
> 



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: IGD pass-through failures since 4.10.
  2022-02-14  6:00 IGD pass-through failures since 4.10 Dr. Greg
  2022-02-14  8:56 ` Jan Beulich
@ 2022-02-14  9:21 ` Roger Pau Monné
  2022-02-18 23:12   ` Dr. Greg
  1 sibling, 1 reply; 9+ messages in thread
From: Roger Pau Monné @ 2022-02-14  9:21 UTC (permalink / raw)
  To: Dr. Greg; +Cc: xen-devel

On Mon, Feb 14, 2022 at 12:00:11AM -0600, Dr. Greg wrote:
> Good morning, I hope the week is starting well for everyone.
> 
> We've made extensive use of PCI based graphics pass through for many
> years, since around Xen 4.2.  In fact, we maintained a set of patches
> for ATI cards against qemu-traditional that have seen a lot of
> downloads from our FTP site.
> 
> We ended up switching to IGD based graphics a couple of years ago and
> built a stack on top of Xen 4.10 using qemu-traditional.  That
> coincided with our transition from Windows 7 to Windows 10.
> 
> We've never enjoyed anywhere near the stability with IGD/Windows-10
> that we had with the ATI/Windows-7 desktops, ie. we see fairly
> frequent crashes, lockups, reduced performance etc.  The ATI/Windows-y
> desktops were almost astonishingly reliable, ie. hundreds of
> consecutive Windows VM boot/passthrough cycles.
> 
> In order to try and address this issue we set out to upgrade our
> workstation infrastructure.  Unfortunately we haven't found anything
> that has worked post 4.10.
> 
> To be precise, 4.11 with qemu-traditional works, but upon exit from
> the virtual machine, to which the graphics adapter and USB controller
> are passed through to, both the USB controller and the graphics
> controller cannot be re-initialized and re-attached to the Dom0
> instance.
> 
> It appears to be a problem with mapping interrupts back to dom0 given
> that we see the following:
> 
> Feb 10 08:16:05 hostname kernel: xhci_hcd 0000:00:14.0: xen map irq failed -19 for 32752 domain
> 
> Feb 10 08:16:05 hostname kernel: i915 0000:00:02.0: xen map irq failed -19 for 32752 domain
> 
> Feb 10 08:16:12 hostname kernel: xhci_hcd 0000:00:14.0: Error while assigning device slot ID

Are you testing with an hypervisor with debug enabled? If not, please
build one and see if there are any messages in Xen dmesg also as a
result of the error (uisng `xl dmesg` if you don't have a serial
attached to the box). Posting full Linux and Xen dmesgs (Xen build
with debug=y) could also help.

PHYSDEVOP_map_pirq is failing but without further information it's
impossible to limit the scope of the issue (and whether the issue is
with PHYSDEVOP_map_pirq or some previous operation).

Thanks, Roger.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: IGD pass-through failures since 4.10.
  2022-02-14  8:56 ` Jan Beulich
@ 2022-02-17 20:15   ` Dr. Greg
  2022-02-18  7:04     ` Jan Beulich
  0 siblings, 1 reply; 9+ messages in thread
From: Dr. Greg @ 2022-02-17 20:15 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Dr. Greg, xen-devel

On Mon, Feb 14, 2022 at 09:56:34AM +0100, Jan Beulich wrote:

Good morning, I hope the day is starting well for everyone, Jan thanks
for taking the time to reply.

> On 14.02.2022 07:00, Dr. Greg wrote:

> > It appears to be a problem with mapping interrupts back to dom0 given
> > that we see the following:
> > 
> > Feb 10 08:16:05 hostname kernel: xhci_hcd 0000:00:14.0: xen map irq failed -19 for 32752 domain
> > 
> > Feb 10 08:16:05 hostname kernel: i915 0000:00:02.0: xen map irq failed -19 for 32752 domain
> > 
> > Feb 10 08:16:12 hostname kernel: xhci_hcd 0000:00:14.0: Error while assigning device slot ID

> Just on this one aspect: It depends a lot what precisely you've used
> as 4.10 before. Was this the plain 4.10.4 release, or did you track
> the stable branch, accumulating security fixes?

It was based on the Xen GIT tree with a small number of modifications
that had been implemented by Intel to support their IGD
virtualization.

We did not end up using 'IGD virtualization', for a number of
technical reasons, instead we reverted back to using straight device
passthrough with qemu-traditional that we had previously been using.

If it would up being useful, we could come up with a diff between the
stock 4.10.4 tag and the codebase we used.

One of the purposes of the infrastructure upgrade was to try and get
on a completely mainline Xen source tree.

> would suspect device quarantining to get getting in your way. In
> which case it would be relevant to know what exactly "re-attach to
> the Dom0" means in your case.

Re-attach to Dom0 means to unbind the device from the pciback driver
and then bind the device to its original driver.  In the logs noted
above, the xhci_hcd driver to the USB controller and the i915 driver
to the IGD hardware.

It is the same strategy, same script actually, that we have been using
for 8+ years.

In the case of the logs above, the following command sequence is being
executed upon termination of the domain:

# Unbind devices.
echo 0000:00:14.0 >| /sys/bus/pci/drivers/pciback/unbind
echo 0000:00:02.0 >| /sys/bus/pci/drivers/pciback/unbind

# Rebind devices.
echo 0000:00:14.0 >| /sys/bus/pci/drivers/xhci_hcd/bind
echo 0000:00:02.0 >| /sys/bus/pci/drivers/i915/bind

Starting with the stock 4.11.4 release, the Dom0 re-attachment fails
with the 'xen_map_irq' failures being logged.

> Which brings me to this more general remark: What you describe sounds
> like a number of possibly independent problems. I'm afraid it'll be
> difficult for anyone to help without you drilling further down into
> what lower level operations are actually causing trouble. It also feels
> as if things may have ended up working for you on 4.10 just by
> chance.

I think the issue comes down to something that the hypervisor does, on
behalf of the domain doing the passthrough, as part of whatever
qemu-traditional needs to do in order to facilitate the attachment of
the PCI devices to the domain.

Running the detach/re-attach operation works perfectly in absence of
qemu-traditional being started in the domain.  The failure to
re-attach only occurs after qemu-traditional has been run in the
domain.

> I'm sorry that I'm not really of any help here,

Actually your reflections have been helpful.

Perhaps the most important clarification that we could get, for posterity
in this thread, is whether or not IGD pass-through is actually
supported in the mind of the Xen team.

According to the Xen web-site, IGD PCI pass-through is documented as
working with the following combinations:

Xen 4.11.x: QEMU >= 3.1

Xen 4.14.x: QEMU >= 5.2

We are currently having IGD pass-through with qemu-dm (3.1/5.2) fail
completely in those combinations.

Pass through with qemu-traditional works with 4.11.x but the
re-attachment fails.  On 4.14.x, execution of qemu-traditional is
failing secondary to some type of complaint about the inability to
determine the CPU type, which is some other issue that we haven't been
able to run down yet.

Those tests were done with builds from stock tagged releases in the
Xen GIT tree.

So it may be helpful to verify whether or not any of this is expected
to work, and if not, the Xen web-site would seem to need correction.

> Jan

Hopefully the following is helpful, I will be replying to Roger's
comments later.

Have a good day.

Dr. Greg

As always,
Dr. G.W. Wettstein, Ph.D.   Enjellic Systems Development, LLC.
4206 N. 19th Ave.           Specializing in information infra-structure
Fargo, ND  58102            development.
PH: 701-281-1686            EMAIL: greg@enjellic.com
------------------------------------------------------------------------------
"If your doing something the same way you have been doing it for ten years,
 the chances are you are doing it wrong."
                                -- Charles Kettering


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: IGD pass-through failures since 4.10.
  2022-02-17 20:15   ` Dr. Greg
@ 2022-02-18  7:04     ` Jan Beulich
  2022-02-22 18:52       ` Dr. Greg
  0 siblings, 1 reply; 9+ messages in thread
From: Jan Beulich @ 2022-02-18  7:04 UTC (permalink / raw)
  To: Dr. Greg; +Cc: xen-devel

On 17.02.2022 21:15, Dr. Greg wrote:
> On Mon, Feb 14, 2022 at 09:56:34AM +0100, Jan Beulich wrote:
>> On 14.02.2022 07:00, Dr. Greg wrote:
>>> It appears to be a problem with mapping interrupts back to dom0 given
>>> that we see the following:
>>>
>>> Feb 10 08:16:05 hostname kernel: xhci_hcd 0000:00:14.0: xen map irq failed -19 for 32752 domain
>>>
>>> Feb 10 08:16:05 hostname kernel: i915 0000:00:02.0: xen map irq failed -19 for 32752 domain
>>>
>>> Feb 10 08:16:12 hostname kernel: xhci_hcd 0000:00:14.0: Error while assigning device slot ID
> 
>> Just on this one aspect: It depends a lot what precisely you've used
>> as 4.10 before. Was this the plain 4.10.4 release, or did you track
>> the stable branch, accumulating security fixes?
> 
> It was based on the Xen GIT tree with a small number of modifications
> that had been implemented by Intel to support their IGD
> virtualization.
> 
> We did not end up using 'IGD virtualization', for a number of
> technical reasons, instead we reverted back to using straight device
> passthrough with qemu-traditional that we had previously been using.
> 
> If it would up being useful, we could come up with a diff between the
> stock 4.10.4 tag and the codebase we used.
> 
> One of the purposes of the infrastructure upgrade was to try and get
> on a completely mainline Xen source tree.

Depending on the size of the diff, this may or may not be helpful.
What you sadly didn't state is at least the precise base version.

>> would suspect device quarantining to get getting in your way. In
>> which case it would be relevant to know what exactly "re-attach to
>> the Dom0" means in your case.
> 
> Re-attach to Dom0 means to unbind the device from the pciback driver
> and then bind the device to its original driver.  In the logs noted
> above, the xhci_hcd driver to the USB controller and the i915 driver
> to the IGD hardware.
> 
> It is the same strategy, same script actually, that we have been using
> for 8+ years.

Right, but in the meantime quarantining has appeared. That wasn't
intended to break "traditional" usage, but ...

> In the case of the logs above, the following command sequence is being
> executed upon termination of the domain:
> 
> # Unbind devices.
> echo 0000:00:14.0 >| /sys/bus/pci/drivers/pciback/unbind
> echo 0000:00:02.0 >| /sys/bus/pci/drivers/pciback/unbind
> 
> # Rebind devices.
> echo 0000:00:14.0 >| /sys/bus/pci/drivers/xhci_hcd/bind
> echo 0000:00:02.0 >| /sys/bus/pci/drivers/i915/bind

... you may still want to try replacing these with
"xl pci-assignable-add ..." / "xl pci-assignable-remove ...".

> Starting with the stock 4.11.4 release, the Dom0 re-attachment fails
> with the 'xen_map_irq' failures being logged.
> 
>> Which brings me to this more general remark: What you describe sounds
>> like a number of possibly independent problems. I'm afraid it'll be
>> difficult for anyone to help without you drilling further down into
>> what lower level operations are actually causing trouble. It also feels
>> as if things may have ended up working for you on 4.10 just by
>> chance.
> 
> I think the issue comes down to something that the hypervisor does, on
> behalf of the domain doing the passthrough, as part of whatever
> qemu-traditional needs to do in order to facilitate the attachment of
> the PCI devices to the domain.
> 
> Running the detach/re-attach operation works perfectly in absence of
> qemu-traditional being started in the domain.  The failure to
> re-attach only occurs after qemu-traditional has been run in the
> domain.

Interesting. This suggests missing cleanup somewhere in the course of
tearing down assignment to the DomU. Without full (and full verbosity)
logs there's unlikely to be a way forward. Even the there's no promise
that the logs would have useful data.

Of course with qemu-trad now being neither security supported nor
recommended to use, you will want (need) to look into moving to
upstream qemu anyway, trying to deal with problems there instead.

>> I'm sorry that I'm not really of any help here,
> 
> Actually your reflections have been helpful.
> 
> Perhaps the most important clarification that we could get, for posterity
> in this thread, is whether or not IGD pass-through is actually
> supported in the mind of the Xen team.
> 
> According to the Xen web-site, IGD PCI pass-through is documented as
> working with the following combinations:
> 
> Xen 4.11.x: QEMU >= 3.1
> 
> Xen 4.14.x: QEMU >= 5.2
> 
> We are currently having IGD pass-through with qemu-dm (3.1/5.2) fail
> completely in those combinations.

I wonder on what basis these statements were added.

Jan



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: IGD pass-through failures since 4.10.
  2022-02-14  9:21 ` Roger Pau Monné
@ 2022-02-18 23:12   ` Dr. Greg
  0 siblings, 0 replies; 9+ messages in thread
From: Dr. Greg @ 2022-02-18 23:12 UTC (permalink / raw)
  To: Roger Pau Monn??; +Cc: xen-devel

On Mon, Feb 14, 2022 at 10:21:01AM +0100, Roger Pau Monn?? wrote:

Good afternoon, I hope the week has gone well for everyone.

> On Mon, Feb 14, 2022 at 12:00:11AM -0600, Dr. Greg wrote:

> >
> > [ Material removed ]
> >
> > It appears to be a problem with mapping interrupts back to dom0 given
> > that we see the following:
> > 
> > Feb 10 08:16:05 hostname kernel: xhci_hcd 0000:00:14.0: xen map irq failed -19 for 32752 domain
> > 
> > Feb 10 08:16:05 hostname kernel: i915 0000:00:02.0: xen map irq failed -19 for 32752 domain
> > 
> > Feb 10 08:16:12 hostname kernel: xhci_hcd 0000:00:14.0: Error while assigning device slot ID

> Are you testing with an hypervisor with debug enabled? If not,
> please build one and see if there are any messages in Xen dmesg also
> as a result of the error (uisng `xl dmesg` if you don't have a
> serial attached to the box). Posting full Linux and Xen dmesgs (Xen
> build with debug=y) could also help.

It was just a stock build out of the GIT tree.

We will get a debug hypervisors built and get traces out of the test
machine and post them to this thread.  I don't believe that dom0
kernel was talking very much about what was going on but we will
verify that.

> PHYSDEVOP_map_pirq is failing but without further information it's
> impossible to limit the scope of the issue (and whether the issue is
> with PHYSDEVOP_map_pirq or some previous operation).

Very useful piece of information to have.

From the log messages above, I assume the kernel is getting ENODEV
from the hypervisor call.  We will see if we can get some targeted
debug statements into the hypervisor to figure out what is going on.

> Thanks, Roger.

Thank you for the follow-up, have a good weekend.

Dr. Greg

As always,
Dr. Greg Wettstein, Ph.D    Worker / Principal Engineer
IDfusion, LLC
4206 19th Ave N.            Specialists in SGX secured infrastructure.
Fargo, ND  58102
PH: 701-281-1686            CELL: 701-361-2319
EMAIL: gw@idfusion.org
------------------------------------------------------------------------------
"Real Programmers consider "what you see is what you get" to be just as
 bad a concept in Text Editors as it is in women.  No, the Real
 Programmer wants a "you asked for it, you got it" text editor --
 complicated, cryptic, powerful, unforgiving, dangerous."
                                -- Matthias Schniedermeyer


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: IGD pass-through failures since 4.10.
  2022-02-18  7:04     ` Jan Beulich
@ 2022-02-22 18:52       ` Dr. Greg
  2022-02-23  8:59         ` Jan Beulich
  0 siblings, 1 reply; 9+ messages in thread
From: Dr. Greg @ 2022-02-22 18:52 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel

On Fri, Feb 18, 2022 at 08:04:14AM +0100, Jan Beulich wrote:

Good morning, I hope the week is advancing well for everyone.

> On 17.02.2022 21:15, Dr. Greg wrote:
> > On Mon, Feb 14, 2022 at 09:56:34AM +0100, Jan Beulich wrote:
> >> On 14.02.2022 07:00, Dr. Greg wrote:
> >>> It appears to be a problem with mapping interrupts back to dom0 given
> >>> that we see the following:
> >>>
> >>> Feb 10 08:16:05 hostname kernel: xhci_hcd 0000:00:14.0: xen map irq failed -19 for 32752 domain
> >>>
> >>> Feb 10 08:16:05 hostname kernel: i915 0000:00:02.0: xen map irq failed -19 for 32752 domain
> >>>
> >>> Feb 10 08:16:12 hostname kernel: xhci_hcd 0000:00:14.0: Error while assigning device slot ID
> > 
> >> Just on this one aspect: It depends a lot what precisely you've used
> >> as 4.10 before. Was this the plain 4.10.4 release, or did you track
> >> the stable branch, accumulating security fixes?
> > 
> > It was based on the Xen GIT tree with a small number of modifications
> > that had been implemented by Intel to support their IGD
> > virtualization.
> > 
> > We did not end up using 'IGD virtualization', for a number of
> > technical reasons, instead we reverted back to using straight device
> > passthrough with qemu-traditional that we had previously been using.
> > 
> > If it would up being useful, we could come up with a diff between the
> > stock 4.10.4 tag and the codebase we used.
> > 
> > One of the purposes of the infrastructure upgrade was to try and get
> > on a completely mainline Xen source tree.

> Depending on the size of the diff, this may or may not be helpful.
> What you sadly didn't state is at least the precise base version.

The stack that is in use is 18 patches beyond what is tagged as the
Xen 4.10 release.

I can generate the diff but most of the patches appear to be
infrastructure changes to support the VGT virtual display devices.

> >> would suspect device quarantining to get getting in your way. In
> >> which case it would be relevant to know what exactly "re-attach to
> >> the Dom0" means in your case.
> > 
> > Re-attach to Dom0 means to unbind the device from the pciback driver
> > and then bind the device to its original driver.  In the logs noted
> > above, the xhci_hcd driver to the USB controller and the i915 driver
> > to the IGD hardware.
> > 
> > It is the same strategy, same script actually, that we have been using
> > for 8+ years.

> Right, but in the meantime quarantining has appeared. That wasn't
> intended to break "traditional" usage, but ...

We just finished testing the 4.15.2 release and we got one successful
execution of the Windows VM under qemu-traditional.  We still have not
gotten the VM to boot and run under upstream qemu.

Testing the upstream qemu version has resulted in the VM not wanting
to boot under anything at this point.  We are now getting a VIDEO_TDR
error out of Windows and are trying to untangle that.

> > In the case of the logs above, the following command sequence is being
> > executed upon termination of the domain:
> > 
> > # Unbind devices.
> > echo 0000:00:14.0 >| /sys/bus/pci/drivers/pciback/unbind
> > echo 0000:00:02.0 >| /sys/bus/pci/drivers/pciback/unbind
> > 
> > # Rebind devices.
> > echo 0000:00:14.0 >| /sys/bus/pci/drivers/xhci_hcd/bind
> > echo 0000:00:02.0 >| /sys/bus/pci/drivers/i915/bind

> ... you may still want to try replacing these with
> "xl pci-assignable-add ..." / "xl pci-assignable-remove ...".

We tested using the 'xl pci-assignable-add/remove' sequences and we
believe this may have resulted in the proper return of the devices to
dom0 but haven't been able to verify that since the Windows VM is now
throwing the VIDEO_TDR error.

Unless we are misunderstanding something the 'xl
pci-assignable-remove' sequence requires the manual re-binding of the
devices to their dom0 drivers.

This seems a bit odd given that the 'xl pci-assignable-add' invocation
on a device that does not have a driver gives an indication that no
driver was found and the device would not be re-bound to its driver.

This needs a bit more testing after we get the basic VM implementation
running once again.

> > Starting with the stock 4.11.4 release, the Dom0 re-attachment fails
> > with the 'xen_map_irq' failures being logged.
> > 
> >> Which brings me to this more general remark: What you describe sounds
> >> like a number of possibly independent problems. I'm afraid it'll be
> >> difficult for anyone to help without you drilling further down into
> >> what lower level operations are actually causing trouble. It also feels
> >> as if things may have ended up working for you on 4.10 just by
> >> chance.
> > 
> > I think the issue comes down to something that the hypervisor does, on
> > behalf of the domain doing the passthrough, as part of whatever
> > qemu-traditional needs to do in order to facilitate the attachment of
> > the PCI devices to the domain.
> > 
> > Running the detach/re-attach operation works perfectly in absence of
> > qemu-traditional being started in the domain.  The failure to
> > re-attach only occurs after qemu-traditional has been run in the
> > domain.

> Interesting. This suggests missing cleanup somewhere in the course
> of tearing down assignment to the DomU. Without full (and full
> verbosity) logs there's unlikely to be a way forward. Even the
> there's no promise that the logs would have useful data.

As soon as we the VM running again we will deploy a debug enabled
hypervisor.

> Of course with qemu-trad now being neither security supported nor
> recommended to use, you will want (need) to look into moving to
> upstream qemu anyway, trying to deal with problems there instead.

We have had virtually no luck whatsoever with upstream qemu at this
point, all the way from 4.10 forward to 4.15 at this point.

There are a host of PCI options in the XL configuration file that may
be impacting this but we have not yet to find any good references on
this.

> >> I'm sorry that I'm not really of any help here,
> > 
> > Actually your reflections have been helpful.
> > 
> > Perhaps the most important clarification that we could get, for posterity
> > in this thread, is whether or not IGD pass-through is actually
> > supported in the mind of the Xen team.
> > 
> > According to the Xen web-site, IGD PCI pass-through is documented as
> > working with the following combinations:
> > 
> > Xen 4.11.x: QEMU >= 3.1
> > 
> > Xen 4.14.x: QEMU >= 5.2
> > 
> > We are currently having IGD pass-through with qemu-dm (3.1/5.2) fail
> > completely in those combinations.

> I wonder on what basis these statements were added.

I don't know but you can find them at the following URL:

https://wiki.xenproject.org/wiki/Xen_VGA_Passthrough

Under the following heading.

'Status of VGA graphics passthru in Xen'

The section is quite prescriptive with respect to what is supposed to
work.

That document gave us the impression that we would be on solid ground
moving to more recent versions of the stack including upstream QEMU
but that currently has not been our experience.

Unfortunately we are now in a position where testing seems to have
resulted in a virtual machine that no longer works on the 4.10.x
stack.... :-(

> Jan

Have a good day.

Dr. Greg

As always,
Dr. G.W. Wettstein, Ph.D.   Enjellic Systems Development, LLC.
4206 N. 19th Ave.           Specializing in information infra-structure
Fargo, ND  58102            development.
PH: 701-281-1686            EMAIL: greg@enjellic.com
------------------------------------------------------------------------------
"On the other hand, the Linux philosophy is 'laugh in the face of
 danger'.  Oops.  Wrong one.  'Do it yourself'.  Thats it."
                                -- Linus Torvalds


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: IGD pass-through failures since 4.10.
  2022-02-22 18:52       ` Dr. Greg
@ 2022-02-23  8:59         ` Jan Beulich
  2022-02-25  0:16           ` Dr. Greg
  0 siblings, 1 reply; 9+ messages in thread
From: Jan Beulich @ 2022-02-23  8:59 UTC (permalink / raw)
  To: Dr. Greg; +Cc: xen-devel

On 22.02.2022 19:52, Dr. Greg wrote:
> On Fri, Feb 18, 2022 at 08:04:14AM +0100, Jan Beulich wrote:
>> On 17.02.2022 21:15, Dr. Greg wrote:
>>> On Mon, Feb 14, 2022 at 09:56:34AM +0100, Jan Beulich wrote:
>>> In the case of the logs above, the following command sequence is being
>>> executed upon termination of the domain:
>>>
>>> # Unbind devices.
>>> echo 0000:00:14.0 >| /sys/bus/pci/drivers/pciback/unbind
>>> echo 0000:00:02.0 >| /sys/bus/pci/drivers/pciback/unbind
>>>
>>> # Rebind devices.
>>> echo 0000:00:14.0 >| /sys/bus/pci/drivers/xhci_hcd/bind
>>> echo 0000:00:02.0 >| /sys/bus/pci/drivers/i915/bind
> 
>> ... you may still want to try replacing these with
>> "xl pci-assignable-add ..." / "xl pci-assignable-remove ...".
> 
> We tested using the 'xl pci-assignable-add/remove' sequences and we
> believe this may have resulted in the proper return of the devices to
> dom0 but haven't been able to verify that since the Windows VM is now
> throwing the VIDEO_TDR error.
> 
> Unless we are misunderstanding something the 'xl
> pci-assignable-remove' sequence requires the manual re-binding of the
> devices to their dom0 drivers.

Hmm, I thought drivers would be rebound, but I'm not a tool stack person.
Looking at libxl__device_pci_assignable_remove() at least support this
assumption of mine. You did use the command's -r option, didn't you?

Jan



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: IGD pass-through failures since 4.10.
  2022-02-23  8:59         ` Jan Beulich
@ 2022-02-25  0:16           ` Dr. Greg
  0 siblings, 0 replies; 9+ messages in thread
From: Dr. Greg @ 2022-02-25  0:16 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel

On Wed, Feb 23, 2022 at 09:59:48AM +0100, Jan Beulich wrote:

Hi, I hope the end of the week is going well for everyone.

> On 22.02.2022 19:52, Dr. Greg wrote:
> > On Fri, Feb 18, 2022 at 08:04:14AM +0100, Jan Beulich wrote:
> >> On 17.02.2022 21:15, Dr. Greg wrote:
> >>> On Mon, Feb 14, 2022 at 09:56:34AM +0100, Jan Beulich wrote:
> >>> In the case of the logs above, the following command sequence is being
> >>> executed upon termination of the domain:
> >>>
> >>> # Unbind devices.
> >>> echo 0000:00:14.0 >| /sys/bus/pci/drivers/pciback/unbind
> >>> echo 0000:00:02.0 >| /sys/bus/pci/drivers/pciback/unbind
> >>>
> >>> # Rebind devices.
> >>> echo 0000:00:14.0 >| /sys/bus/pci/drivers/xhci_hcd/bind
> >>> echo 0000:00:02.0 >| /sys/bus/pci/drivers/i915/bind
> > 
> >> ... you may still want to try replacing these with
> >> "xl pci-assignable-add ..." / "xl pci-assignable-remove ...".
> > 
> > We tested using the 'xl pci-assignable-add/remove' sequences and we
> > believe this may have resulted in the proper return of the devices to
> > dom0 but haven't been able to verify that since the Windows VM is now
> > throwing the VIDEO_TDR error.
> > 
> > Unless we are misunderstanding something the 'xl
> > pci-assignable-remove' sequence requires the manual re-binding of the
> > devices to their dom0 drivers.

> Hmm, I thought drivers would be rebound, but I'm not a tool stack
> person.  Looking at libxl__device_pci_assignable_remove() at least
> support this assumption of mine. You did use the command's -r
> option, didn't you?

No we weren't and I now see the -r option.

We have already re-worked our setup scripts to use pci-assignable-add
and will verify the -r option works as advertised, thanks for the tip.

We had our lab machine broken for a couple of days where it wouldn't
start an IGD pass-through session in any way shape or form.  We got
that sorted out and will now go back to 4.15.2 and verify what works
and doesn't work and report back.

> Jan

Have a good weekend.

As always,
Dr. G.W. Wettstein, Ph.D.   Enjellic Systems Development, LLC.
4206 N. 19th Ave.           Specializing in information infra-structure
Fargo, ND  58102            development.
PH: 701-281-1686            EMAIL: greg@enjellic.com
------------------------------------------------------------------------------
"On the other hand, the Linux philosophy is 'laugh in the face of
 danger'.  Oops.  Wrong one.  'Do it yourself'.  Thats it."
                                -- Linus Torvalds


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2022-02-25  0:16 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-02-14  6:00 IGD pass-through failures since 4.10 Dr. Greg
2022-02-14  8:56 ` Jan Beulich
2022-02-17 20:15   ` Dr. Greg
2022-02-18  7:04     ` Jan Beulich
2022-02-22 18:52       ` Dr. Greg
2022-02-23  8:59         ` Jan Beulich
2022-02-25  0:16           ` Dr. Greg
2022-02-14  9:21 ` Roger Pau Monné
2022-02-18 23:12   ` Dr. Greg

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.