PCI pass-through problem for SN570 NVME SSD

* PCI pass-through problem for SN570 NVME SSD
@ 2022-07-02 17:43 G.R.
  2022-07-04  6:37 ` G.R.
  2022-07-04  9:50 ` Roger Pau Monné
  0 siblings, 2 replies; 31+ messages in thread
From: G.R. @ 2022-07-02 17:43 UTC (permalink / raw)
  To: xen-users, xen-devel

Hi everybody,

I run into problems passing through a SN570 NVME SSD to a HVM guest.
So far I have no idea if the problem is with this specific SSD or with
the CPU + motherboard combination or the SW stack.
Looking for some suggestions on troubleshooting.

List of build info:
CPU+motherboard: E-2146G + Gigabyte C246N-WU2
XEN version: 4.14.3
Dom0: Linux Kernel 5.10 (built from Debian 11.2 kernel source package)
The SN570 SSD sits here in the PCI tree:
           +-1d.0-[05]----00.0  Sandisk Corp Device 501a

Syndromes observed:
With ASPM enabled, pciback has problem seizing the device.

Jul  2 00:36:54 gaia kernel: [    1.648270] pciback 0000:05:00.0:
xen_pciback: seizing device
...
Jul  2 00:36:54 gaia kernel: [    1.768646] pcieport 0000:00:1d.0:
AER: enabled with IRQ 150
Jul  2 00:36:54 gaia kernel: [    1.768716] pcieport 0000:00:1d.0:
DPC: enabled with IRQ 150
Jul  2 00:36:54 gaia kernel: [    1.768717] pcieport 0000:00:1d.0:
DPC: error containment capabilities: Int Msg #0, RPExt+ PoisonedTLP+
SwTrigger+ RP PIO Log 4, DL_ActiveErr+
...
Jul  2 00:36:54 gaia kernel: [    1.770039] xen: registering gsi 16
triggering 0 polarity 1
Jul  2 00:36:54 gaia kernel: [    1.770041] Already setup the GSI :16
Jul  2 00:36:54 gaia kernel: [    1.770314] pcieport 0000:00:1d.0:
DPC: containment event, status:0x1f11 source:0x0000
Jul  2 00:36:54 gaia kernel: [    1.770315] pcieport 0000:00:1d.0:
DPC: unmasked uncorrectable error detected
Jul  2 00:36:54 gaia kernel: [    1.770320] pcieport 0000:00:1d.0:
PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction
Layer, (Receiver ID)
Jul  2 00:36:54 gaia kernel: [    1.770371] pcieport 0000:00:1d.0:
device [8086:a330] error status/mask=00200000/00010000
Jul  2 00:36:54 gaia kernel: [    1.770413] pcieport 0000:00:1d.0:
[21] ACSViol                (First)
Jul  2 00:36:54 gaia kernel: [    1.770466] pciback 0000:05:00.0:
xen_pciback: device is not found/assigned
Jul  2 00:36:54 gaia kernel: [    1.920195] pciback 0000:05:00.0:
xen_pciback: device is not found/assigned
Jul  2 00:36:54 gaia kernel: [    1.920260] pcieport 0000:00:1d.0:
AER: device recovery successful
Jul  2 00:36:54 gaia kernel: [    1.920263] pcieport 0000:00:1d.0:
DPC: containment event, status:0x1f01 source:0x0000
Jul  2 00:36:54 gaia kernel: [    1.920264] pcieport 0000:00:1d.0:
DPC: unmasked uncorrectable error detected
Jul  2 00:36:54 gaia kernel: [    1.920267] pciback 0000:05:00.0:
xen_pciback: device is not found/assigned
Jul  2 00:36:54 gaia kernel: [    1.938406] xen: registering gsi 16
triggering 0 polarity 1
Jul  2 00:36:54 gaia kernel: [    1.938408] Already setup the GSI :16
Jul  2 00:36:54 gaia kernel: [    1.938666] xen_pciback: backend is vpci
...
Jul  2 00:43:48 gaia kernel: [  420.231955] pcieport 0000:00:1d.0:
DPC: containment event, status:0x1f01 source:0x0000
Jul  2 00:43:48 gaia kernel: [  420.231961] pcieport 0000:00:1d.0:
DPC: unmasked uncorrectable error detected
Jul  2 00:43:48 gaia kernel: [  420.231993] pcieport 0000:00:1d.0:
PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction
Layer, (Requester ID)
Jul  2 00:43:48 gaia kernel: [  420.235775] pcieport 0000:00:1d.0:
device [8086:a330] error status/mask=00100000/00010000
Jul  2 00:43:48 gaia kernel: [  420.235779] pcieport 0000:00:1d.0:
[20] UnsupReq               (First)
Jul  2 00:43:48 gaia kernel: [  420.235783] pcieport 0000:00:1d.0:
AER:   TLP Header: 34000000 05000010 00000000 88458845
Jul  2 00:43:48 gaia kernel: [  420.235819] pci 0000:05:00.0: AER:
can't recover (no error_detected callback)
Jul  2 00:43:48 gaia kernel: [  420.384349] pcieport 0000:00:1d.0:
AER: device recovery successful
... // The following might relate to an attempt to assign the device
to guest, not very sure...
Jul  2 00:46:06 gaia kernel: [  559.147333] pciback 0000:05:00.0:
xen_pciback: seizing device
Jul  2 00:46:06 gaia kernel: [  559.147435] pciback 0000:05:00.0:
enabling device (0000 -> 0002)
Jul  2 00:46:06 gaia kernel: [  559.147508] xen: registering gsi 16
triggering 0 polarity 1
Jul  2 00:46:06 gaia kernel: [  559.147511] Already setup the GSI :16
Jul  2 00:46:06 gaia kernel: [  559.147558] pciback 0000:05:00.0:
xen_pciback: MSI-X preparation failed (-6)

With pcie_aspm=off, the error log related to pciback goes away.
But I suspect there are still some problems hidden -- since I don't
see any AER enabled messages so errors may be hidden.
I have the xen_pciback built directly into the kernel and assigned the
SSD to it in the kernel command-line.
However, the result from pci-assignable-xxx commands are not very consistent:

root@gaia:~# xl pci-assignable-list
0000:00:17.0
0000:05:00.0
root@gaia:~# xl pci-assignable-remove 05:00.0
libxl: error: libxl_pci.c:853:libxl__device_pci_assignable_remove:
failed to de-quarantine 0000:05:00.0 <===== Here!!!
root@gaia:~# xl pci-assignable-add 05:00.0
libxl: warning: libxl_pci.c:794:libxl__device_pci_assignable_add:
0000:05:00.0 already assigned to pciback <==== Here!!!
root@gaia:~# xl pci-assignable-remove 05:00.0
root@gaia:~# xl pci-assignable-list
0000:00:17.0
root@gaia:~# xl pci-assignable-add 05:00.0
libxl: warning: libxl_pci.c:814:libxl__device_pci_assignable_add:
0000:05:00.0 not bound to a driver, will not be rebound.
root@gaia:~# xl pci-assignable-list
0000:00:17.0
0000:05:00.0

After the 'xl pci-assignable-list' appears to be self-consistent,
creating VM with the SSD assigned still leads to a guest crash:
From qemu log:
[00:06.0] xen_pt_region_update: Error: create new mem mapping failed! (err: 1)
qemu-system-i386: terminating on signal 1 from pid 1192 (xl)

From the 'xl dmesg' output:
(XEN) d1: GFN 0xf3078 (0xa2616,0,5,7) -> (0xa2504,0,5,7) not permitted
(XEN) domain_crash called from p2m.c:1301
(XEN) Domain 1 reported crashed by domain 0 on cpu#4:
(XEN) memory_map:fail: dom1 gfn=f3078 mfn=a2504 nr=1 ret:-1

Which of the three syndromes are more fundamental?
1. The DPC / AER error log
2. The inconsistency in 'xl pci-assignable-list' state tracking
3. The GFN mapping failure on guest setup

Any suggestions for the next step?

Thanks,
G.R.

^ permalink raw reply	[flat|nested] 31+ messages in thread