All of lore.kernel.org
 help / color / mirror / Atom feed
From: "G.R." <firemeteor@users.sourceforge.net>
To: Jan Beulich <jbeulich@suse.com>
Cc: xen-devel <xen-devel@lists.xen.org>,
	"Roger Pau Monné" <roger.pau@citrix.com>
Subject: Re: PCI pass-through problem for SN570 NVME SSD
Date: Thu, 7 Jul 2022 23:24:21 +0800	[thread overview]
Message-ID: <CAKhsbWZoeZeyysF+1O9xGvrVBrApHrSbk+GJavHUEHim5hudrA@mail.gmail.com> (raw)
In-Reply-To: <ca4e8b79-c831-8c09-6398-b76852dfde53@suse.com>

On Wed, Jul 6, 2022 at 2:33 PM Jan Beulich <jbeulich@suse.com> wrote:
>
> On 06.07.2022 08:25, G.R. wrote:
> > On Tue, Jul 5, 2022 at 7:59 PM Jan Beulich <jbeulich@suse.com> wrote:
> >> Nothing useful in there. Yet independent of that I guess we need to
> >> separate the issues you're seeing. Otherwise it'll be impossible to
> >> know what piece of data belongs where.
> > Yep, I think I'm seeing several different issues here:
> > 1. The FLR related DPC / AER message seen on the 1st attempt only when
> > pciback tries to seize and release the SN570
> >     - Later-on pciback operations appear just fine.
> > 2. MSI-X preparation failure message that shows up each time the SN570
> > is seized by pciback or when it's passed to domU.
> > 3. XEN tries to map BAR from two devices to the same page
> > 4. The "write-back to unknown field" message in QEMU log that goes
> > away with permissive=1 passthrough config.
> > 5. The "irq 16: nobody cared" message shows up *sometimes* in a
> > pattern that I haven't figured out  (See attached)
> > 6. The FreeBSD domU sees the device but fails to use it because low
> > level commands sent to it are aborted.
> > 7. The device does not return to the pci-assignable-list when the domU
> > it was assigned shuts-down. (See attached)
> >
> > #3 appears to be a known issue that could be worked around with
> > patches from the list.
> > I suspect #1 may have something to do with the device itself. It's
> > still not clear if it's deadly or just annoying.
> > I was able to update the firmware to the latest version and confirmed
> > that the new firmware didn't make any noticeable difference.
> >
> > I suspect issue #2, #4, #5, #6, #7 may be related, and the
> > pass-through was not completely successful...
> >
> > Should I expect a debug build of XEN hypervisor to give better
> > diagnose messages, without the debug patch that Roger mentioned?
>
> Well, "expect" is perhaps too much to say, but with problems like
> yours (and even more so with multiple ones) using a debug
> hypervisor (or kernel, if there such a build mode existed) is imo
> always a good idea. As is using as up-to-date a version as
> possible.

I built both 4.14.3 debug version and 4.16.1 release version for
testing purposes.
Unfortunately they gave me absolutely zero information, since both of
them are not able to get through issue #1
the FlR related DPC / AER issue.
With 4.16.1 release, it actually can survive the 'xl
pci-assignable-add' which triggers the first AER failure.
But the 'xl pci-assignable-remove' will lead to xl segmentation fault...
>[  655.041442] xl[975]: segfault at 0 ip 00007f2cccdaf71f sp 00007ffd73a3d4d0 error 4 in libxenlight.so.4.16.0[7f2cccd92000+7c000]
>[  655.041460] Code: 61 06 00 eb 13 66 0f 1f 44 00 00 83 c3 01 39 5c 24 2c 0f 86 1b 01 00 00 48 8b 34 24 89 d8 4d 89 f9 4d 89 f0 4c 89 e9 4c 89 e2 <48> 8b 3c c6 31 c0 48 89 ee e8 53 44 fe ff 83 f8 04 75 ce 48 8b 44
Since I'll need a couple of pci-assignable-add &&
pci-assignable-remove to get to a seemingly normal state, I cannot
proceed from here.

With 4.14.3 debug build, the hypervisor / dom0 reboots on 'xl
pci-assignable-add'.

[  574.623143] pciback 0000:05:00.0: xen_pciback: resetting (FLR, D3,
etc) the device
[  574.623203] pcieport 0000:00:1d.0: DPC: containment event,
status:0x1f11 source:0x0000
[  574.623204] pcieport 0000:00:1d.0: DPC: unmasked uncorrectable error detected
[  574.623209] pcieport 0000:00:1d.0: PCIe Bus Error:
severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Receiver
ID)
[  574.623240] pcieport 0000:00:1d.0:   device [8086:a330] error
status/mask=00200000/00010000
[  574.623261] pcieport 0000:00:1d.0:    [21] ACSViol                (First)
[  575.855026] pciback 0000:05:00.0: not ready 1023ms after FLR; waiting
[  576.895015] pciback 0000:05:00.0: not ready 2047ms after FLR; waiting
[  579.028311] pciback 0000:05:00.0: not ready 4095ms after FLR; waiting
[  583.294910] pciback 0000:05:00.0: not ready 8191ms after FLR; waiting
[  591.614965] pciback 0000:05:00.0: not ready 16383ms after FLR; waiting
[  609.534502] pciback 0000:05:00.0: not ready 32767ms after FLR; waiting
[  643.667069] pciback 0000:05:00.0: not ready 65535ms after FLR; giving up
//<=======The reboot happens somewhere here, not immediately, but
after a while...
//Maybe I can get something from xl dmesg if I was quick enough and
have connected from a second terminal...
[  644.773922] pciback 0000:05:00.0: xen_pciback: reset device
[  644.774050] pciback 0000:05:00.0: xen_pciback:
xen_pcibk_error_detected(bus:5,devfn:0)
[  644.774051] pciback 0000:05:00.0: xen_pciback: device is not found/assigned
[  644.923432] pciback 0000:05:00.0: xen_pciback:
xen_pcibk_error_resume(bus:5,devfn:0)
[  644.923437] pciback 0000:05:00.0: xen_pciback: device is not found/assigned
[  644.923616] pcieport 0000:00:1d.0: AER: device recovery successful



>
> Jan


  reply	other threads:[~2022-07-07 15:25 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-07-02 17:43 PCI pass-through problem for SN570 NVME SSD G.R.
2022-07-04  6:37 ` G.R.
2022-07-04 10:31   ` Jan Beulich
2022-07-04  9:50 ` Roger Pau Monné
2022-07-04 10:34   ` Jan Beulich
2022-07-04 11:34   ` G.R.
2022-07-04 11:44     ` G.R.
2022-07-04 13:09     ` Roger Pau Monné
2022-07-04 14:51       ` G.R.
2022-07-04 15:15         ` G.R.
2022-07-04 15:37           ` G.R.
2022-07-04 16:05             ` Roger Pau Monné
2022-07-04 16:07               ` Jan Beulich
2022-07-04 16:31               ` G.R.
2022-07-05  7:29                 ` Jan Beulich
2022-07-05 11:31                   ` G.R.
2022-07-05 11:59                     ` Jan Beulich
2022-07-06  6:25                       ` G.R.
2022-07-06  6:33                         ` Jan Beulich
2022-07-07 15:24                           ` G.R. [this message]
2022-07-07 15:36                             ` G.R.
2022-07-07 16:18                               ` Jan Beulich
2022-07-07 16:23                             ` Jan Beulich
2022-07-08  2:28                               ` G.R.
2022-07-09  1:24                                 ` G.R.
2022-07-09  4:27                                 ` G.R.
2022-07-04 15:33         ` Roger Pau Monné
2022-07-04 15:44           ` G.R.
2022-07-04 15:57             ` Roger Pau Monné
2022-07-05 18:06               ` Jason Andryuk
2022-07-04 16:05           ` Jan Beulich

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAKhsbWZoeZeyysF+1O9xGvrVBrApHrSbk+GJavHUEHim5hudrA@mail.gmail.com \
    --to=firemeteor@users.sourceforge.net \
    --cc=jbeulich@suse.com \
    --cc=roger.pau@citrix.com \
    --cc=xen-devel@lists.xen.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.