Running VMs with an eGPU and VFIO: from flaky (<= 5.12.x) to broken (5.13.x)

* Running VMs with an eGPU and VFIO: from flaky (<= 5.12.x) to broken (5.13.x)
@ 2021-07-11 12:15 Andrej Podzimek via Virtualization
  0 siblings, 0 replies; only message in thread
From: Andrej Podzimek via Virtualization @ 2021-07-11 12:15 UTC (permalink / raw)
  To: virtualization

[-- Attachment #1.1: Type: text/plain, Size: 4995 bytes --]

Dear virtualization mailing list,

My question may well be misplaced, because it's Thunderbolt-, eGPU- as well as NVidia-related, but I'm out of ideas where else to ask. (Should I ask in a qemu- or libvirt-specific list instead? If so, please give me a hint.)

First, here's the configuration of the physical (host) machine:

         Command line: pcie_ports=native pci=assign-busses,hpbussize=0x33,realloc,hpmmiosize=256M,hpmmioprefsize=16G mem_encrypt=on
            lspci -tv: https://pastebin.com/raw/usBudC1y
          Motherboard: ASRock x570 Creator with BIOS 3.50
                  CPU: AMD Ryzen 3950X
               System: ArchLinux with kernel 5.12.15 / 5.13.1
       eGPU enclosure: Razer Core X Chroma
             eGPU GPU: NVidia Quadro P5000
        UEFI settings: Above 64b decoding, IOMMU and SR-IOV all *enabled*
     Other PCIe slots:
                      GPU: AMD Radeon Pro W5700
                       M2: Two Seagate FireCuda 520 (ZP2000GM30002)
                     WiFi: Intel AX200 (factory-default)

The eGPU is configured like this in libvirt:

     <hostdev mode="subsystem" type="pci" managed="yes">
       <source><address domain="0x0000" bus="0x3d" slot="0x00" function="0x0"/></source>
       <address type="pci" domain="0x0000" bus="0x07" slot="0x00" function="0x0"/>
     </hostdev>

Now the problem: Forwarding of the NVidia card inside the eGPU into virtual machines was flaky up to 5.12.x (i.e., sometimes worked, sometimes didn't) and stopped working entirely in 5.13:

     virsh # start FreeBSD
     error: Failed to start domain 'FreeBSD'
     error: internal error: qemu unexpectedly closed the monitor: 2021-07-11T10:34:09.102381Z qemu-system-x86_64: -device vfio-pci,host=0000:3d:00.0,id=hostdev0,bus=pci.6,addr=0x0: vfio 0000:3d:00.0: error getting device from group 49: Invalid argument
     Verify all devices in group 49 are bound to vfio-<bus> or pci-stub and not already in use

     virsh # start Windows
     error: Failed to start domain 'Windows'
     error: internal error: qemu unexpectedly closed the monitor: qxl_send_events: spice-server bug: guest stopped, ignoring
     2021-07-11T10:34:36.163549Z qemu-system-x86_64: -device vfio-pci,host=0000:3d:00.0,id=hostdev0,bus=pci.7,addr=0x0: vfio_listener_region_add received unaligned region
     2021-07-11T10:34:39.432499Z qemu-system-x86_64: -device vfio-pci,host=0000:3d:00.0,id=hostdev0,bus=pci.7,addr=0x0: vfio_listener_region_del received unaligned region
     2021-07-11T10:34:39.567039Z qemu-system-x86_64: -device vfio-pci,host=0000:3d:00.0,id=hostdev0,bus=pci.7,addr=0x0: vfio 0000:3d:00.0: error getting device from group 49: Invalid argument
     Verify all devices in group 49 are bound to vfio-<bus> or pci-stub and not already in use

============
With 5.12.x:

There were "lucky" and "unlucky" boots/uptimes. VMs could be started and restarted arbitrarily during the "lucky" uptimes and the NVidia eGPU worked flawlessly. During an "unlucky" uptime, the errors above popped up every single time and no VMs using the eGPU could be started. Restarts of the eGPU did not help. The likelihood of a "lucky" uptime was roughly 1:3, so it took quite a few reboots to get there. :-( /o\
============

============
With 5.13.x:

After boot, the eGPU on Thunderbolt initially doesn't work at all. It won't show up in lspci, the nvidia module is not loaded etc. Switching the eGPU off/on won't help. Surprisingly, the only way to make it initialize (that I've discovered thus far) is:
     modprobe -r thunderbolt
     modprobe thunderbolt

After that^^^ the eGPU and NVidia GPU are detected, modules are loaded, nvidia-smi works and shows information etc., but attempts at VM startup _always_ produces the errors above. I have not seen a "lucky" uptime in >50 boots. :-( Also, before unloading+reloading of thunderbolt, there is simply no device 3d:00.0 anywhere on PCI (and no trace of NVidia elsewhere), so that machine state is a (VM) non-starter.

What else I tried:
     * options thunderbolt start_icm=1  -- no change (plus admittedly I have no clue what the internal connection manager means/does)
     * options vfio_iommu_type1 disable_hugepages=1  -- "What if the 'unaligned region' is related to huge pages?" No change here either. /o\
     * a huge lot of reboots, Thunderbolt disconnects/reconnects etc. Nope. It won't work.
============

Final note: Without the extra command line tokens, namely pcie_ports=native pci=assign-busses,hpbussize=0x33,realloc,hpmmiosize=256M,hpmmioprefsize=16G, the NVidia eGPU just won't work, neither on 5.12.x nor on 5.13.x. Way more details about that are here:
     https://egpu.io/forums/postid/90608/
     https://bbs.archlinux.org/viewtopic.php?id=261303

What should I try next to debug the issue and, importantly, to keep my VMs working on 5.13.x and beyond? Any ideas, tips, magic kernel command line tokens etc. would be very helpful.

Cheers!
Andrej

[-- Attachment #1.2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 12770 bytes --]

[-- Attachment #2: Type: text/plain, Size: 183 bytes --]

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] only message in thread