* PCI resource allocation mismatch with BIOS @ 2022-11-28 11:14 Mika Westerberg 2022-11-28 20:39 ` Bjorn Helgaas 0 siblings, 1 reply; 15+ messages in thread From: Mika Westerberg @ 2022-11-28 11:14 UTC (permalink / raw) To: Bjorn Helgaas; +Cc: linux-pci, Mika Westerberg Hi Bjorn, There is another PCI resource allocation issue with some Intel GPUs but probably applies to other similar devices as well. This is something encountered in data centers where they trigger reset (secondary bus reset) to the GPUs if there is hang or similar detected. Basically they do something like: 1. Unbind the graphics driver(s) through sysfs. 2. Remove the PCIe devices under the root port or the PCIe switch upstream port through sysfs (echo 1 > ../remove). 3. Trigger reset through config space or use the sysfs reset attribute. 4. Run rescan on the root bus (echo 1 > /sys/bus/pci/rescan) Expectation is to see the devices come back in the same way prior the reset but what actually happens is that the Linux PCI resource allocation fails to allocate space for some of the resources. In this case it is the IOV BARs. BIOS allocates resources for all these at boot time but after the rescan Linux tries to re-allocate them but since the allocation algorithm is more "consuming" some of the BARs do not fit to the available resource space. Here is an example. The devices involved are: 53:00.0 GPU with IOV BARs 52:01.0 PCIe switch downstream port PF = Physical Function VF = Virtual Function BIOS allocation (dmesg) ----------------------- pci 0000:52:01.0: scanning [bus 53-54] behind bridge, pass 0 pci 0000:53:00.0: [8086:56c0] type 00 class 0x038000 pci 0000:53:00.0: reg 0x10: [mem 0x205e1f000000-0x205e1fffffff 64bit pref] pci 0000:53:00.0: reg 0x18: [mem 0x201c00000000-0x201fffffffff 64bit pref] pci 0000:53:00.0: reg 0x30: [mem 0xffe00000-0xffffffff pref] pci 0000:53:00.0: reg 0x344: [mem 0x205e00000000-0x205e00ffffff 64bit pref] pci 0000:53:00.0: VF(n) BAR0 space: [mem 0x205e00000000-0x205e1effffff 64bit pref] (contains BAR0 for 31 VFs) pci 0000:53:00.0: reg 0x34c: [mem 0x202000000000-0x2021ffffffff 64bit pref] pci 0000:53:00.0: VF(n) BAR2 space: [mem 0x202000000000-0x205dffffffff 64bit pref] (contains BAR2 for 31 VFs) pci 0000:52:01.0: PCI bridge to [bus 53-54] pci 0000:52:01.0: bridge window [mem 0x201c00000000-0x205e1fffffff 64bit pref] GPU ~~~ 0x201c00000000-0x201fffffffff PF BAR2 16384M 0x202000000000-0x205dffffffff VF BAR2 253952M (31 * 8G) 0x205e00000000-0x205e1effffff VF BAR0 496M (31 * 16M) 0x205e1f000000-0x205e1fffffff PF BAR0 16M 270848M PCIe downstream port ~~~~~~~~~~~~~~~~~~~~ 0x201c00000000-0x205e1fffffff 270848M Linux allocation (dmesg) ------------------------ pci 0000:52:01.0: [8086:4fa4] type 01 class 0x060400 pci_bus 0000:52: fixups for bus pci 0000:51:00.0: PCI bridge to [bus 52-54] pci 0000:51:00.0: bridge window [io 0x0000-0x0fff] pci 0000:51:00.0: bridge window [mem 0x00000000-0x000fffff] pci 0000:51:00.0: bridge window [mem 0x00000000-0x000fffff 64bit pref] pci 0000:52:01.0: scanning [bus 00-00] behind bridge, pass 0 pci 0000:52:01.0: bridge configuration invalid ([bus 00-00]), reconfiguring pci 0000:52:01.0: scanning [bus 00-00] behind bridge, pass 1 pci_bus 0000:53: scanning bus pci 0000:53:00.0: [8086:56c0] type 00 class 0x038000 pci 0000:53:00.0: reg 0x10: [mem 0x00000000-0x00ffffff 64bit pref] pci 0000:53:00.0: reg 0x18: [mem 0x00000000-0x3ffffffff 64bit pref] pci 0000:53:00.0: reg 0x30: [mem 0x00000000-0x001fffff pref] pci 0000:53:00.0: reg 0x344: [mem 0x00000000-0x00ffffff 64bit pref] pci 0000:53:00.0: VF(n) BAR0 space: [mem 0x00000000-0x1effffff 64bit pref] (contains BAR0 for 31 VFs) pci 0000:53:00.0: reg 0x34c: [mem 0x00000000-0x1ffffffff 64bit pref] pci 0000:53:00.0: VF(n) BAR2 space: [mem 0x00000000-0x3dffffffff 64bit pref] (contains BAR2 for 31 VFs) pci_bus 0000:53: fixups for bus pci 0000:52:01.0: PCI bridge to [bus 53-54] pci 0000:52:01.0: bridge window [io 0x0000-0x0fff] pci 0000:52:01.0: bridge window [mem 0x00000000-0x000fffff] pci 0000:52:01.0: bridge window [mem 0x00000000-0x000fffff 64bit pref] pci 0000:52:01.0: bridge window [mem 0x200000000-0x7ffffffff 64bit pref] to [bus 53] add_size 3e00000000 add_align 200000000 pci 0000:51:00.0: bridge window [mem 0x200000000-0x7ffffffff 64bit pref] to [bus 52-53] add_size 3e00000000 add_align 200000000 pcieport 0000:50:02.0: BAR 13: assigned [io 0x8000-0x8fff] pci 0000:51:00.0: BAR 15: no space for [mem size 0x4400000000 64bit pref] pci 0000:51:00.0: BAR 15: failed to assign [mem size 0x4400000000 64bit pref] pci 0000:51:00.0: BAR 0: assigned [mem 0x201c00000000-0x201c007fffff 64bit pref] pci 0000:51:00.0: BAR 14: assigned [mem 0xbb800000-0xbb9fffff] pci 0000:51:00.0: BAR 13: assigned [io 0x8000-0x8fff] pci 0000:51:00.0: BAR 15: assigned [mem 0x201c00000000-0x2021ffffffff 64bit pref] pci 0000:51:00.0: BAR 0: assigned [mem 0x202200000000-0x2022007fffff 64bit pref] pci 0000:51:00.0: BAR 14: assigned [mem 0xbb800000-0xbb9fffff] pci 0000:51:00.0: BAR 15: [mem 0x201c00000000-0x2021ffffffff 64bit pref] (failed to expand by 0x3e00000000) pci 0000:51:00.0: failed to add 3e00000000 res[15]=[mem 0x201c00000000-0x2021ffffffff 64bit pref] pci 0000:52:01.0: BAR 15: no space for [mem size 0x4400000000 64bit pref] pci 0000:52:01.0: BAR 15: failed to assign [mem size 0x4400000000 64bit pref] pci 0000:52:01.0: BAR 14: assigned [mem 0xbb800000-0xbb9fffff] pci 0000:52:01.0: BAR 13: assigned [io 0x8000-0x8fff] pci 0000:52:01.0: BAR 15: assigned [mem 0x201c00000000-0x2021ffffffff 64bit pref] pci 0000:52:01.0: BAR 14: assigned [mem 0xbb800000-0xbb9fffff] pci 0000:52:01.0: BAR 15: [mem 0x201c00000000-0x2021ffffffff 64bit pref] (failed to expand by 0x3e00000000) pci 0000:52:01.0: failed to add 3e00000000 res[15]=[mem 0x201c00000000-0x2021ffffffff 64bit pref] pci 0000:53:00.0: BAR 2: assigned [mem 0x201c00000000-0x201fffffffff 64bit pref] pci 0000:53:00.0: BAR 9: no space for [mem size 0x3e00000000 64bit pref] pci 0000:53:00.0: BAR 9: failed to assign [mem size 0x3e00000000 64bit pref] pci 0000:53:00.0: BAR 0: assigned [mem 0x202000000000-0x202000ffffff 64bit pref] pci 0000:53:00.0: BAR 7: assigned [mem 0x202001000000-0x20201fffffff 64bit pref] pci 0000:53:00.0: BAR 6: assigned [mem 0xbb800000-0xbb9fffff pref] pci 0000:53:00.0: BAR 2: assigned [mem 0x201c00000000-0x201fffffffff 64bit pref] pci 0000:53:00.0: BAR 0: assigned [mem 0x202000000000-0x202000ffffff 64bit pref] pci 0000:53:00.0: BAR 6: assigned [mem 0xbb800000-0xbb9fffff pref] pci 0000:53:00.0: BAR 9: no space for [mem size 0x3e00000000 64bit pref] pci 0000:53:00.0: BAR 9: failed to assign [mem size 0x3e00000000 64bit pref] pci 0000:53:00.0: BAR 7: assigned [mem 0x202001000000-0x20201fffffff 64bit pref] pci 0000:52:01.0: PCI bridge to [bus 53] pci 0000:52:01.0: bridge window [io 0x8000-0x8fff] pci 0000:52:01.0: bridge window [mem 0xbb800000-0xbb9fffff] pci 0000:52:01.0: bridge window [mem 0x201c00000000-0x2021ffffffff 64bit pref] GPU ~~~ 0x201c00000000-0x201fffffffff PF BAR2 16834M 0x202000000000-0x202000ffffff PF BAR0 16M 0x202001000000-0x20201fffffff VF BAR0 496M (31 * 16M) FAIL VF BAR2 253952M (31 * 8G) PCIe downstream port ~~~~~~~~~~~~~~~~~~~~ 0x201c00000000-0x2021ffffffff 24576M Now, if I hack the allocation algorithm (in pbus_size_mem()) to "mimic" the BIOS allocation then these fit fine. However, if the BIOS allocation ever changes we may end up in similar issue. Also the Linux PCI resource allocation code has been like that for aeons so changing it would likely cause regressions. Let me know if more information is needed. I have one of these cards locally and have remote access to a similar system where the above example was take so I can run additional testing. Also let me know if you want me to file a bug in kernel.org bugzilla. Thanks in advance! ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: PCI resource allocation mismatch with BIOS 2022-11-28 11:14 PCI resource allocation mismatch with BIOS Mika Westerberg @ 2022-11-28 20:39 ` Bjorn Helgaas 2022-11-28 22:06 ` Alex Williamson 0 siblings, 1 reply; 15+ messages in thread From: Bjorn Helgaas @ 2022-11-28 20:39 UTC (permalink / raw) To: Mika Westerberg; +Cc: Bjorn Helgaas, linux-pci, Alex Williamson [+cc Alex] Hi Mika, On Mon, Nov 28, 2022 at 01:14:14PM +0200, Mika Westerberg wrote: > Hi Bjorn, > > There is another PCI resource allocation issue with some Intel GPUs but > probably applies to other similar devices as well. This is something > encountered in data centers where they trigger reset (secondary bus > reset) to the GPUs if there is hang or similar detected. Basically they > do something like: > > 1. Unbind the graphics driver(s) through sysfs. > 2. Remove the PCIe devices under the root port or the PCIe switch > upstream port through sysfs (echo 1 > ../remove). > 3. Trigger reset through config space or use the sysfs reset attribute. > 4. Run rescan on the root bus (echo 1 > /sys/bus/pci/rescan) > > Expectation is to see the devices come back in the same way prior the > reset but what actually happens is that the Linux PCI resource > allocation fails to allocate space for some of the resources. In this > case it is the IOV BARs. > > BIOS allocates resources for all these at boot time but after the rescan > Linux tries to re-allocate them but since the allocation algorithm is > more "consuming" some of the BARs do not fit to the available resource > space. Thanks for the report! Definitely sounds like an issue. I doubt that I'll have time to work on it myself in the near future. Is the "remove" before the reset actually necessary? If we could avoid the removal, maybe the config space save/restore we already do around reset would avoid the issue? Bjorn > Here is an example. The devices involved are: > > 53:00.0 GPU with IOV BARs > 52:01.0 PCIe switch downstream port > > PF = Physical Function > VF = Virtual Function > > BIOS allocation (dmesg) > ----------------------- > pci 0000:52:01.0: scanning [bus 53-54] behind bridge, pass 0 > pci 0000:53:00.0: [8086:56c0] type 00 class 0x038000 > pci 0000:53:00.0: reg 0x10: [mem 0x205e1f000000-0x205e1fffffff 64bit pref] > pci 0000:53:00.0: reg 0x18: [mem 0x201c00000000-0x201fffffffff 64bit pref] > pci 0000:53:00.0: reg 0x30: [mem 0xffe00000-0xffffffff pref] > pci 0000:53:00.0: reg 0x344: [mem 0x205e00000000-0x205e00ffffff 64bit pref] > pci 0000:53:00.0: VF(n) BAR0 space: [mem 0x205e00000000-0x205e1effffff 64bit pref] (contains BAR0 for 31 VFs) > pci 0000:53:00.0: reg 0x34c: [mem 0x202000000000-0x2021ffffffff 64bit pref] > pci 0000:53:00.0: VF(n) BAR2 space: [mem 0x202000000000-0x205dffffffff 64bit pref] (contains BAR2 for 31 VFs) > pci 0000:52:01.0: PCI bridge to [bus 53-54] > pci 0000:52:01.0: bridge window [mem 0x201c00000000-0x205e1fffffff 64bit pref] > > GPU > ~~~ > 0x201c00000000-0x201fffffffff PF BAR2 16384M > 0x202000000000-0x205dffffffff VF BAR2 253952M (31 * 8G) > 0x205e00000000-0x205e1effffff VF BAR0 496M (31 * 16M) > 0x205e1f000000-0x205e1fffffff PF BAR0 16M > 270848M > > PCIe downstream port > ~~~~~~~~~~~~~~~~~~~~ > 0x201c00000000-0x205e1fffffff 270848M > > Linux allocation (dmesg) > ------------------------ > pci 0000:52:01.0: [8086:4fa4] type 01 class 0x060400 > pci_bus 0000:52: fixups for bus > pci 0000:51:00.0: PCI bridge to [bus 52-54] > pci 0000:51:00.0: bridge window [io 0x0000-0x0fff] > pci 0000:51:00.0: bridge window [mem 0x00000000-0x000fffff] > pci 0000:51:00.0: bridge window [mem 0x00000000-0x000fffff 64bit pref] > pci 0000:52:01.0: scanning [bus 00-00] behind bridge, pass 0 > pci 0000:52:01.0: bridge configuration invalid ([bus 00-00]), reconfiguring > pci 0000:52:01.0: scanning [bus 00-00] behind bridge, pass 1 > pci_bus 0000:53: scanning bus > pci 0000:53:00.0: [8086:56c0] type 00 class 0x038000 > pci 0000:53:00.0: reg 0x10: [mem 0x00000000-0x00ffffff 64bit pref] > pci 0000:53:00.0: reg 0x18: [mem 0x00000000-0x3ffffffff 64bit pref] > pci 0000:53:00.0: reg 0x30: [mem 0x00000000-0x001fffff pref] > pci 0000:53:00.0: reg 0x344: [mem 0x00000000-0x00ffffff 64bit pref] > pci 0000:53:00.0: VF(n) BAR0 space: [mem 0x00000000-0x1effffff 64bit pref] (contains BAR0 for 31 VFs) > pci 0000:53:00.0: reg 0x34c: [mem 0x00000000-0x1ffffffff 64bit pref] > pci 0000:53:00.0: VF(n) BAR2 space: [mem 0x00000000-0x3dffffffff 64bit pref] (contains BAR2 for 31 VFs) > pci_bus 0000:53: fixups for bus > pci 0000:52:01.0: PCI bridge to [bus 53-54] > pci 0000:52:01.0: bridge window [io 0x0000-0x0fff] > pci 0000:52:01.0: bridge window [mem 0x00000000-0x000fffff] > pci 0000:52:01.0: bridge window [mem 0x00000000-0x000fffff 64bit pref] > pci 0000:52:01.0: bridge window [mem 0x200000000-0x7ffffffff 64bit pref] to [bus 53] add_size 3e00000000 add_align 200000000 > pci 0000:51:00.0: bridge window [mem 0x200000000-0x7ffffffff 64bit pref] to [bus 52-53] add_size 3e00000000 add_align 200000000 > pcieport 0000:50:02.0: BAR 13: assigned [io 0x8000-0x8fff] > pci 0000:51:00.0: BAR 15: no space for [mem size 0x4400000000 64bit pref] > pci 0000:51:00.0: BAR 15: failed to assign [mem size 0x4400000000 64bit pref] > pci 0000:51:00.0: BAR 0: assigned [mem 0x201c00000000-0x201c007fffff 64bit pref] > pci 0000:51:00.0: BAR 14: assigned [mem 0xbb800000-0xbb9fffff] > pci 0000:51:00.0: BAR 13: assigned [io 0x8000-0x8fff] > pci 0000:51:00.0: BAR 15: assigned [mem 0x201c00000000-0x2021ffffffff 64bit pref] > pci 0000:51:00.0: BAR 0: assigned [mem 0x202200000000-0x2022007fffff 64bit pref] > pci 0000:51:00.0: BAR 14: assigned [mem 0xbb800000-0xbb9fffff] > pci 0000:51:00.0: BAR 15: [mem 0x201c00000000-0x2021ffffffff 64bit pref] (failed to expand by 0x3e00000000) > pci 0000:51:00.0: failed to add 3e00000000 res[15]=[mem 0x201c00000000-0x2021ffffffff 64bit pref] > pci 0000:52:01.0: BAR 15: no space for [mem size 0x4400000000 64bit pref] > pci 0000:52:01.0: BAR 15: failed to assign [mem size 0x4400000000 64bit pref] > pci 0000:52:01.0: BAR 14: assigned [mem 0xbb800000-0xbb9fffff] > pci 0000:52:01.0: BAR 13: assigned [io 0x8000-0x8fff] > pci 0000:52:01.0: BAR 15: assigned [mem 0x201c00000000-0x2021ffffffff 64bit pref] > pci 0000:52:01.0: BAR 14: assigned [mem 0xbb800000-0xbb9fffff] > pci 0000:52:01.0: BAR 15: [mem 0x201c00000000-0x2021ffffffff 64bit pref] (failed to expand by 0x3e00000000) > pci 0000:52:01.0: failed to add 3e00000000 res[15]=[mem 0x201c00000000-0x2021ffffffff 64bit pref] > pci 0000:53:00.0: BAR 2: assigned [mem 0x201c00000000-0x201fffffffff 64bit pref] > pci 0000:53:00.0: BAR 9: no space for [mem size 0x3e00000000 64bit pref] > pci 0000:53:00.0: BAR 9: failed to assign [mem size 0x3e00000000 64bit pref] > pci 0000:53:00.0: BAR 0: assigned [mem 0x202000000000-0x202000ffffff 64bit pref] > pci 0000:53:00.0: BAR 7: assigned [mem 0x202001000000-0x20201fffffff 64bit pref] > pci 0000:53:00.0: BAR 6: assigned [mem 0xbb800000-0xbb9fffff pref] > pci 0000:53:00.0: BAR 2: assigned [mem 0x201c00000000-0x201fffffffff 64bit pref] > pci 0000:53:00.0: BAR 0: assigned [mem 0x202000000000-0x202000ffffff 64bit pref] > pci 0000:53:00.0: BAR 6: assigned [mem 0xbb800000-0xbb9fffff pref] > pci 0000:53:00.0: BAR 9: no space for [mem size 0x3e00000000 64bit pref] > pci 0000:53:00.0: BAR 9: failed to assign [mem size 0x3e00000000 64bit pref] > pci 0000:53:00.0: BAR 7: assigned [mem 0x202001000000-0x20201fffffff 64bit pref] > pci 0000:52:01.0: PCI bridge to [bus 53] > pci 0000:52:01.0: bridge window [io 0x8000-0x8fff] > pci 0000:52:01.0: bridge window [mem 0xbb800000-0xbb9fffff] > pci 0000:52:01.0: bridge window [mem 0x201c00000000-0x2021ffffffff 64bit pref] > > GPU > ~~~ > 0x201c00000000-0x201fffffffff PF BAR2 16834M > 0x202000000000-0x202000ffffff PF BAR0 16M > 0x202001000000-0x20201fffffff VF BAR0 496M (31 * 16M) > FAIL VF BAR2 253952M (31 * 8G) > > PCIe downstream port > ~~~~~~~~~~~~~~~~~~~~ > 0x201c00000000-0x2021ffffffff 24576M > > Now, if I hack the allocation algorithm (in pbus_size_mem()) to "mimic" > the BIOS allocation then these fit fine. However, if the BIOS allocation > ever changes we may end up in similar issue. Also the Linux PCI resource > allocation code has been like that for aeons so changing it would likely > cause regressions. > > Let me know if more information is needed. I have one of these cards > locally and have remote access to a similar system where the above > example was take so I can run additional testing. > > Also let me know if you want me to file a bug in kernel.org bugzilla. > > Thanks in advance! ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: PCI resource allocation mismatch with BIOS 2022-11-28 20:39 ` Bjorn Helgaas @ 2022-11-28 22:06 ` Alex Williamson 2022-11-29 6:48 ` Lukas Wunner 0 siblings, 1 reply; 15+ messages in thread From: Alex Williamson @ 2022-11-28 22:06 UTC (permalink / raw) To: Bjorn Helgaas; +Cc: Mika Westerberg, Bjorn Helgaas, linux-pci On Mon, 28 Nov 2022 14:39:32 -0600 Bjorn Helgaas <helgaas@kernel.org> wrote: > [+cc Alex] > > Hi Mika, > > On Mon, Nov 28, 2022 at 01:14:14PM +0200, Mika Westerberg wrote: > > Hi Bjorn, > > > > There is another PCI resource allocation issue with some Intel GPUs but > > probably applies to other similar devices as well. This is something > > encountered in data centers where they trigger reset (secondary bus > > reset) to the GPUs if there is hang or similar detected. Basically they > > do something like: > > > > 1. Unbind the graphics driver(s) through sysfs. > > 2. Remove the PCIe devices under the root port or the PCIe switch > > upstream port through sysfs (echo 1 > ../remove). > > 3. Trigger reset through config space or use the sysfs reset attribute. > > 4. Run rescan on the root bus (echo 1 > /sys/bus/pci/rescan) > > > > Expectation is to see the devices come back in the same way prior the > > reset but what actually happens is that the Linux PCI resource > > allocation fails to allocate space for some of the resources. In this > > case it is the IOV BARs. > > > > BIOS allocates resources for all these at boot time but after the rescan > > Linux tries to re-allocate them but since the allocation algorithm is > > more "consuming" some of the BARs do not fit to the available resource > > space. > > Thanks for the report! Definitely sounds like an issue. I doubt that > I'll have time to work on it myself in the near future. > > Is the "remove" before the reset actually necessary? If we could > avoid the removal, maybe the config space save/restore we already do > around reset would avoid the issue? Agreed. Is this convoluted removal process being used to force a SBR, versus a FLR or PM reset that might otherwise be used by twiddling the reset attribute of the GPU directly? If so, the reset_method attribute can be used to force a bus reset and perform all the state save/restore handling to avoid reallocating BARs. A reset from the upstream switch port would only be necessary if you have some reason to also reset the switch downstream ports. Thanks, Alex ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: PCI resource allocation mismatch with BIOS 2022-11-28 22:06 ` Alex Williamson @ 2022-11-29 6:48 ` Lukas Wunner 2022-11-29 10:09 ` Mika Westerberg 2022-11-29 13:52 ` Alex Williamson 0 siblings, 2 replies; 15+ messages in thread From: Lukas Wunner @ 2022-11-29 6:48 UTC (permalink / raw) To: Alex Williamson; +Cc: Bjorn Helgaas, Mika Westerberg, Bjorn Helgaas, linux-pci On Mon, Nov 28, 2022 at 03:06:17PM -0700, Alex Williamson wrote: > Agreed. Is this convoluted removal process being used to force a SBR, > versus a FLR or PM reset that might otherwise be used by twiddling the > reset attribute of the GPU directly? If so, the reset_method attribute > can be used to force a bus reset and perform all the state save/restore > handling to avoid reallocating BARs. A reset from the upstream switch > port would only be necessary if you have some reason to also reset the > switch downstream ports. Thanks, A Secondary Bus Reset is only offered as a reset_method if the device to be reset is the *only* child of the upstream bridge. I.e. if the device to be reset has siblings or children, a Secondary Bus Reset is not permitted. Modern GPUs (including the one Mika is referring to) consist of a PCIe switch with the GPU, HD audio and telemetry devices below Downstream Bridges. A Secondary Bus Reset of the Root Port is not allowed in this case because the Switch Upstream Port has children. See this code in pci_parent_bus_reset(): if (pci_is_root_bus(dev->bus) || dev->subordinate || !dev->bus->self || dev->dev_flags & PCI_DEV_FLAGS_NO_BUS_RESET) return -ENOTTY; The dev->subordinate check disallows a SBR if there are children. Note that the code should probably instead check for... (dev->subordinate && !list_empty(dev->subordinate->devices)) ...because the port may have a subordinate bus without children (may have been removed for example). The "no siblings" rule is enforced by: list_for_each_entry(pdev, &dev->bus->devices, bus_list) if (pdev != dev) return -ENOTTY; Note that the devices list is iterated without holding pci_bus_sem, which looks fishy. That said, it *is* possible that a Secondary Bus Reset is erroneously offered despite these checks because we perform them early on device enumeration when the subordinate bus hasn't been scanned yet. So if the Root Port offers other reset methods besides SBR and the user switches to one of them, then reinstates the defaults, suddenly SBR will disappear because the subordinate bus has since been scanned. What's missing here is that we re-check availability of the reset methods on siblings and the parent when a device is added or removed. This is also necessary to make reset_method work properly with hotplug. However, the result may be that the reset_method attribute in sysfs may become invisible after adding a device (because there is no reset method available) and reappear after removing a device. So the reset_method logic is pretty broken right now I'm afraid. In any case, for Mika's use case it would be useful to have a "reset_subordinate" attribute on ports capable of a SBR such that the entire hierarchy below is reset. The "reset" attribute is insufficient. Thanks, Lukas ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: PCI resource allocation mismatch with BIOS 2022-11-29 6:48 ` Lukas Wunner @ 2022-11-29 10:09 ` Mika Westerberg 2022-11-29 13:52 ` Alex Williamson 1 sibling, 0 replies; 15+ messages in thread From: Mika Westerberg @ 2022-11-29 10:09 UTC (permalink / raw) To: Lukas Wunner; +Cc: Alex Williamson, Bjorn Helgaas, Bjorn Helgaas, linux-pci Hi, On Tue, Nov 29, 2022 at 07:48:12AM +0100, Lukas Wunner wrote: > On Mon, Nov 28, 2022 at 03:06:17PM -0700, Alex Williamson wrote: > > Agreed. Is this convoluted removal process being used to force a SBR, > > versus a FLR or PM reset that might otherwise be used by twiddling the > > reset attribute of the GPU directly? If so, the reset_method attribute > > can be used to force a bus reset and perform all the state save/restore > > handling to avoid reallocating BARs. A reset from the upstream switch > > port would only be necessary if you have some reason to also reset the > > switch downstream ports. Thanks, > > A Secondary Bus Reset is only offered as a reset_method if the > device to be reset is the *only* child of the upstream bridge. > I.e. if the device to be reset has siblings or children, > a Secondary Bus Reset is not permitted. > > Modern GPUs (including the one Mika is referring to) consist of > a PCIe switch with the GPU, HD audio and telemetry devices below > Downstream Bridges. A Secondary Bus Reset of the Root Port is > not allowed in this case because the Switch Upstream Port has > children. > > See this code in pci_parent_bus_reset(): > > if (pci_is_root_bus(dev->bus) || dev->subordinate || > !dev->bus->self || dev->dev_flags & PCI_DEV_FLAGS_NO_BUS_RESET) > return -ENOTTY; > > The dev->subordinate check disallows a SBR if there are children. > Note that the code should probably instead check for... > (dev->subordinate && !list_empty(dev->subordinate->devices)) > ...because the port may have a subordinate bus without children > (may have been removed for example). > > The "no siblings" rule is enforced by: > > list_for_each_entry(pdev, &dev->bus->devices, bus_list) > if (pdev != dev) > return -ENOTTY; > > Note that the devices list is iterated without holding pci_bus_sem, > which looks fishy. > > That said, it *is* possible that a Secondary Bus Reset is erroneously > offered despite these checks because we perform them early on device > enumeration when the subordinate bus hasn't been scanned yet. Thanks Lukas for the good explanation :) I think because of the above our GPU folks do Secondary Bus Reset directly by poking the config space through setpci or so, and I guess this is why they need the "remove" phase as well. > So if the Root Port offers other reset methods besides SBR and the > user switches to one of them, then reinstates the defaults, > suddenly SBR will disappear because the subordinate bus has since > been scanned. What's missing here is that we re-check availability > of the reset methods on siblings and the parent when a device is > added or removed. This is also necessary to make reset_method > work properly with hotplug. However, the result may be that the > reset_method attribute in sysfs may become invisible after adding > a device (because there is no reset method available) and reappear > after removing a device. > > So the reset_method logic is pretty broken right now I'm afraid. > > In any case, for Mika's use case it would be useful to have a > "reset_subordinate" attribute on ports capable of a SBR such that > the entire hierarchy below is reset. The "reset" attribute is > insufficient. Yes, that would be useful indeed. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: PCI resource allocation mismatch with BIOS 2022-11-29 6:48 ` Lukas Wunner 2022-11-29 10:09 ` Mika Westerberg @ 2022-11-29 13:52 ` Alex Williamson 2022-11-29 15:07 ` Mika Westerberg 1 sibling, 1 reply; 15+ messages in thread From: Alex Williamson @ 2022-11-29 13:52 UTC (permalink / raw) To: Lukas Wunner; +Cc: Bjorn Helgaas, Mika Westerberg, Bjorn Helgaas, linux-pci On Tue, 29 Nov 2022 07:48:12 +0100 Lukas Wunner <lukas@wunner.de> wrote: > On Mon, Nov 28, 2022 at 03:06:17PM -0700, Alex Williamson wrote: > > Agreed. Is this convoluted removal process being used to force a SBR, > > versus a FLR or PM reset that might otherwise be used by twiddling the > > reset attribute of the GPU directly? If so, the reset_method attribute > > can be used to force a bus reset and perform all the state save/restore > > handling to avoid reallocating BARs. A reset from the upstream switch > > port would only be necessary if you have some reason to also reset the > > switch downstream ports. Thanks, > > A Secondary Bus Reset is only offered as a reset_method if the > device to be reset is the *only* child of the upstream bridge. > I.e. if the device to be reset has siblings or children, > a Secondary Bus Reset is not permitted. > > Modern GPUs (including the one Mika is referring to) consist of > a PCIe switch with the GPU, HD audio and telemetry devices below > Downstream Bridges. A Secondary Bus Reset of the Root Port is > not allowed in this case because the Switch Upstream Port has > children. I didn't see such functions in the log provided, the GPU in question seems to be a single function device at 53:00.0. This matches what I've seen on an ARC A380 GPU where the GPU and HD audio are each single function devices under separate downstream ports of a PCIe switch. > See this code in pci_parent_bus_reset(): > > if (pci_is_root_bus(dev->bus) || dev->subordinate || > !dev->bus->self || dev->dev_flags & PCI_DEV_FLAGS_NO_BUS_RESET) > return -ENOTTY; > > The dev->subordinate check disallows a SBR if there are children. > Note that the code should probably instead check for... > (dev->subordinate && !list_empty(dev->subordinate->devices)) > ...because the port may have a subordinate bus without children > (may have been removed for example). > > The "no siblings" rule is enforced by: > > list_for_each_entry(pdev, &dev->bus->devices, bus_list) > if (pdev != dev) > return -ENOTTY; > > Note that the devices list is iterated without holding pci_bus_sem, > which looks fishy. > > That said, it *is* possible that a Secondary Bus Reset is erroneously > offered despite these checks because we perform them early on device > enumeration when the subordinate bus hasn't been scanned yet. > > So if the Root Port offers other reset methods besides SBR and the > user switches to one of them, then reinstates the defaults, > suddenly SBR will disappear because the subordinate bus has since > been scanned. What's missing here is that we re-check availability > of the reset methods on siblings and the parent when a device is > added or removed. This is also necessary to make reset_method > work properly with hotplug. However, the result may be that the > reset_method attribute in sysfs may become invisible after adding > a device (because there is no reset method available) and reappear > after removing a device. > > So the reset_method logic is pretty broken right now I'm afraid. I haven't checked for a while, but I thought we exposed SBR regardless of siblings, though it can't be accessed via the reset attribute if there are siblings. That allows that the sibling devices could be soft removed, a reset performed, and the bus re-scanned. If there are in fact sibling devices, it would make more sense to remove only those to effect a bus reset to avoid the resource issues with rescanning SR-IOV on the GPU. > In any case, for Mika's use case it would be useful to have a > "reset_subordinate" attribute on ports capable of a SBR such that > the entire hierarchy below is reset. The "reset" attribute is > insufficient. I'll toss out that a pretty simple vfio tool can be written to bind all the siblings on a bus enabling the hot reset ioctl in vfio. Thanks, Alex ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: PCI resource allocation mismatch with BIOS 2022-11-29 13:52 ` Alex Williamson @ 2022-11-29 15:07 ` Mika Westerberg 2022-11-29 15:46 ` Alex Williamson 0 siblings, 1 reply; 15+ messages in thread From: Mika Westerberg @ 2022-11-29 15:07 UTC (permalink / raw) To: Alex Williamson; +Cc: Lukas Wunner, Bjorn Helgaas, Bjorn Helgaas, linux-pci Hi, On Tue, Nov 29, 2022 at 06:52:42AM -0700, Alex Williamson wrote: > On Tue, 29 Nov 2022 07:48:12 +0100 > Lukas Wunner <lukas@wunner.de> wrote: > > > On Mon, Nov 28, 2022 at 03:06:17PM -0700, Alex Williamson wrote: > > > Agreed. Is this convoluted removal process being used to force a SBR, > > > versus a FLR or PM reset that might otherwise be used by twiddling the > > > reset attribute of the GPU directly? If so, the reset_method attribute > > > can be used to force a bus reset and perform all the state save/restore > > > handling to avoid reallocating BARs. A reset from the upstream switch > > > port would only be necessary if you have some reason to also reset the > > > switch downstream ports. Thanks, > > > > A Secondary Bus Reset is only offered as a reset_method if the > > device to be reset is the *only* child of the upstream bridge. > > I.e. if the device to be reset has siblings or children, > > a Secondary Bus Reset is not permitted. > > > > Modern GPUs (including the one Mika is referring to) consist of > > a PCIe switch with the GPU, HD audio and telemetry devices below > > Downstream Bridges. A Secondary Bus Reset of the Root Port is > > not allowed in this case because the Switch Upstream Port has > > children. > > I didn't see such functions in the log provided, the GPU in question > seems to be a single function device at 53:00.0. This matches what > I've seen on an ARC A380 GPU where the GPU and HD audio are each single > function devices under separate downstream ports of a PCIe switch. Yes, this one is similar. There is a PCIe switch and then bunch of devices connected to the downstream ports. One of them being the GPU. Sorry if I missed that part in the report. There are typically multiple of these cards and they want to perform the reset seperately for each. > > See this code in pci_parent_bus_reset(): > > > > if (pci_is_root_bus(dev->bus) || dev->subordinate || > > !dev->bus->self || dev->dev_flags & PCI_DEV_FLAGS_NO_BUS_RESET) > > return -ENOTTY; > > > > The dev->subordinate check disallows a SBR if there are children. > > Note that the code should probably instead check for... > > (dev->subordinate && !list_empty(dev->subordinate->devices)) > > ...because the port may have a subordinate bus without children > > (may have been removed for example). > > > > The "no siblings" rule is enforced by: > > > > list_for_each_entry(pdev, &dev->bus->devices, bus_list) > > if (pdev != dev) > > return -ENOTTY; > > > > Note that the devices list is iterated without holding pci_bus_sem, > > which looks fishy. > > > > That said, it *is* possible that a Secondary Bus Reset is erroneously > > offered despite these checks because we perform them early on device > > enumeration when the subordinate bus hasn't been scanned yet. > > > > So if the Root Port offers other reset methods besides SBR and the > > user switches to one of them, then reinstates the defaults, > > suddenly SBR will disappear because the subordinate bus has since > > been scanned. What's missing here is that we re-check availability > > of the reset methods on siblings and the parent when a device is > > added or removed. This is also necessary to make reset_method > > work properly with hotplug. However, the result may be that the > > reset_method attribute in sysfs may become invisible after adding > > a device (because there is no reset method available) and reappear > > after removing a device. > > > > So the reset_method logic is pretty broken right now I'm afraid. > > I haven't checked for a while, but I thought we exposed SBR regardless > of siblings, though it can't be accessed via the reset attribute if > there are siblings. That allows that the sibling devices could be soft > removed, a reset performed, and the bus re-scanned. If there are in > fact sibling devices, it would make more sense to remove only those to > effect a bus reset to avoid the resource issues with rescanning SR-IOV > on the GPU. If I understand correctly they perform the reset just above the upstream port of the PCIe switch so that it resets the whole "card". > > > In any case, for Mika's use case it would be useful to have a > > "reset_subordinate" attribute on ports capable of a SBR such that > > the entire hierarchy below is reset. The "reset" attribute is > > insufficient. > > I'll toss out that a pretty simple vfio tool can be written to bind all > the siblings on a bus enabling the hot reset ioctl in vfio. Thanks, Sounds good but unfortunately I'm not fluent in vfio so I have no idea how this simple tool could be done :( Do you have any examples or pointers that we could use to try this out? Thanks! ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: PCI resource allocation mismatch with BIOS 2022-11-29 15:07 ` Mika Westerberg @ 2022-11-29 15:46 ` Alex Williamson 2022-11-29 16:06 ` Lukas Wunner 0 siblings, 1 reply; 15+ messages in thread From: Alex Williamson @ 2022-11-29 15:46 UTC (permalink / raw) To: Mika Westerberg; +Cc: Lukas Wunner, Bjorn Helgaas, Bjorn Helgaas, linux-pci On Tue, 29 Nov 2022 17:07:21 +0200 Mika Westerberg <mika.westerberg@linux.intel.com> wrote: > Hi, > > On Tue, Nov 29, 2022 at 06:52:42AM -0700, Alex Williamson wrote: > > On Tue, 29 Nov 2022 07:48:12 +0100 > > Lukas Wunner <lukas@wunner.de> wrote: > > > > > On Mon, Nov 28, 2022 at 03:06:17PM -0700, Alex Williamson wrote: > > > > Agreed. Is this convoluted removal process being used to force a SBR, > > > > versus a FLR or PM reset that might otherwise be used by twiddling the > > > > reset attribute of the GPU directly? If so, the reset_method attribute > > > > can be used to force a bus reset and perform all the state save/restore > > > > handling to avoid reallocating BARs. A reset from the upstream switch > > > > port would only be necessary if you have some reason to also reset the > > > > switch downstream ports. Thanks, > > > > > > A Secondary Bus Reset is only offered as a reset_method if the > > > device to be reset is the *only* child of the upstream bridge. > > > I.e. if the device to be reset has siblings or children, > > > a Secondary Bus Reset is not permitted. > > > > > > Modern GPUs (including the one Mika is referring to) consist of > > > a PCIe switch with the GPU, HD audio and telemetry devices below > > > Downstream Bridges. A Secondary Bus Reset of the Root Port is > > > not allowed in this case because the Switch Upstream Port has > > > children. > > > > I didn't see such functions in the log provided, the GPU in question > > seems to be a single function device at 53:00.0. This matches what > > I've seen on an ARC A380 GPU where the GPU and HD audio are each single > > function devices under separate downstream ports of a PCIe switch. > > Yes, this one is similar. There is a PCIe switch and then bunch of > devices connected to the downstream ports. One of them being the GPU. > > Sorry if I missed that part in the report. > > There are typically multiple of these cards and they want to perform the > reset seperately for each. Maybe the elephant in the room is why it's apparently such common practice to need to perform a hard reset these devices outside of virtualization scenarios... > > > See this code in pci_parent_bus_reset(): > > > > > > if (pci_is_root_bus(dev->bus) || dev->subordinate || > > > !dev->bus->self || dev->dev_flags & PCI_DEV_FLAGS_NO_BUS_RESET) > > > return -ENOTTY; > > > > > > The dev->subordinate check disallows a SBR if there are children. > > > Note that the code should probably instead check for... > > > (dev->subordinate && !list_empty(dev->subordinate->devices)) > > > ...because the port may have a subordinate bus without children > > > (may have been removed for example). > > > > > > The "no siblings" rule is enforced by: > > > > > > list_for_each_entry(pdev, &dev->bus->devices, bus_list) > > > if (pdev != dev) > > > return -ENOTTY; > > > > > > Note that the devices list is iterated without holding pci_bus_sem, > > > which looks fishy. > > > > > > That said, it *is* possible that a Secondary Bus Reset is erroneously > > > offered despite these checks because we perform them early on device > > > enumeration when the subordinate bus hasn't been scanned yet. > > > > > > So if the Root Port offers other reset methods besides SBR and the > > > user switches to one of them, then reinstates the defaults, > > > suddenly SBR will disappear because the subordinate bus has since > > > been scanned. What's missing here is that we re-check availability > > > of the reset methods on siblings and the parent when a device is > > > added or removed. This is also necessary to make reset_method > > > work properly with hotplug. However, the result may be that the > > > reset_method attribute in sysfs may become invisible after adding > > > a device (because there is no reset method available) and reappear > > > after removing a device. > > > > > > So the reset_method logic is pretty broken right now I'm afraid. > > > > I haven't checked for a while, but I thought we exposed SBR regardless > > of siblings, though it can't be accessed via the reset attribute if > > there are siblings. That allows that the sibling devices could be soft > > removed, a reset performed, and the bus re-scanned. If there are in > > fact sibling devices, it would make more sense to remove only those to > > effect a bus reset to avoid the resource issues with rescanning SR-IOV > > on the GPU. > > If I understand correctly they perform the reset just above the upstream > port of the PCIe switch so that it resets the whole "card". ... and even to the extent that an entire PCIe hierarchy is being reset rather than individual devices, let alone leaf buses. > > > In any case, for Mika's use case it would be useful to have a > > > "reset_subordinate" attribute on ports capable of a SBR such that > > > the entire hierarchy below is reset. The "reset" attribute is > > > insufficient. > > > > I'll toss out that a pretty simple vfio tool can be written to bind all > > the siblings on a bus enabling the hot reset ioctl in vfio. Thanks, > > Sounds good but unfortunately I'm not fluent in vfio so I have no idea > how this simple tool could be done :( Do you have any examples or > pointers that we could use to try this out? Unfortunately, if your goal is really to reset the whole switch, vfio isn't going to help with that. The vfio-pci driver only binds to endpoints and a hot reset can be effected for a given endpoint, in the presence of sibling devices which are also owned by the same user. There's a rudimentary unit test below which implements this. I think the reason we don't have a "reset_subordinate" attribute is that there are a number of difficult issues in locking down an entire hierarchy that go beyond what vfio does in validating the user's proof of ownership for affected endpoints on a leaf bus. Thanks, Alex #include <errno.h> #include <libgen.h> #include <fcntl.h> #include <libgen.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include <unistd.h> #include <sys/ioctl.h> #include <sys/mman.h> #include <sys/param.h> #include <sys/stat.h> #include <sys/types.h> #include <sys/vfs.h> #include <linux/ioctl.h> #include <linux/vfio.h> void usage(char *name) { printf("usage: %s <iommu group id> <ssss:bb:dd.f>\n", name); } #define false 0 #define true 1 int main(int argc, char **argv) { int i, j, ret, container, *pfd; char path[PATH_MAX]; struct vfio_group_status group_status = { .argsz = sizeof(group_status) }; struct vfio_pci_hot_reset_info *reset_info; struct vfio_pci_dependent_device *devices; struct vfio_pci_hot_reset *reset; struct reset_dev { int groupid; int seg; int bus; int dev; int func; int fd; int group; } *reset_devs; if (argc < 3) { usage(argv[0]); return -1; } printf("Expect %d group/device pairs\n", (argc - 1)/2); reset_devs = calloc((argc - 1)/2, sizeof(struct reset_dev)); if (!reset_devs) return -1; for (i = 0; i < (argc - 1)/2; i++) { ret = sscanf(argv[i*2 + 1], "%d", &reset_devs[i].groupid); if (ret != 1) { usage(argv[0]); return -1; } ret = sscanf(argv[i*2 + 2], "%04x:%02x:%02x.%d", &reset_devs[i].seg, &reset_devs[i].bus, &reset_devs[i].dev, &reset_devs[i].func); if (ret != 4) { usage(argv[0]); return -1; } printf("Using PCI device %04x:%02x:%02x.%d in group %d " "for hot reset test\n", reset_devs[i].seg, reset_devs[i].bus, reset_devs[i].dev, reset_devs[i].func, reset_devs[i].groupid); } container = open("/dev/vfio/vfio", O_RDWR); if (container < 0) { printf("Failed to open /dev/vfio/vfio, %d (%s)\n", container, strerror(errno)); return container; } for (i = 0; i < (argc - 1)/2; i++) { snprintf(path, sizeof(path), "/dev/vfio/%d", reset_devs[i].groupid); reset_devs[i].group = open(path, O_RDWR); if (reset_devs[i].group < 0) { printf("Failed to open %s, %d (%s)\n", path, reset_devs[i].group, strerror(errno)); return reset_devs[i].group; } ret = ioctl(reset_devs[i].group, VFIO_GROUP_GET_STATUS, &group_status); if (ret) { printf("ioctl(VFIO_GROUP_GET_STATUS) failed\n"); return ret; } if (!(group_status.flags & VFIO_GROUP_FLAGS_VIABLE)) { printf("Group not viable, are all devices attached to vfio?\n"); return -1; } ret = ioctl(reset_devs[i].group, VFIO_GROUP_SET_CONTAINER, &container); if (ret) { printf("Failed to set group container\n"); return ret; } if (i == 0) { ret = ioctl(container, VFIO_SET_IOMMU, VFIO_TYPE1_IOMMU); if (ret) { printf("Failed to set IOMMU\n"); return ret; } } snprintf(path, sizeof(path), "%04x:%02x:%02x.%d", reset_devs[i].seg, reset_devs[i].bus, reset_devs[i].dev, reset_devs[i].func); reset_devs[i].fd = ioctl(reset_devs[i].group, VFIO_GROUP_GET_DEVICE_FD, path); if (reset_devs[i].fd < 0) { printf("Failed to get device %s\n", path); return -1; } } getchar(); reset_info = malloc(sizeof(*reset_info)); if (!reset_info) { printf("Failed to alloc info struct\n"); return -ENOMEM; } reset_info->argsz = sizeof(*reset_info); ret = ioctl(reset_devs[0].fd, VFIO_DEVICE_GET_PCI_HOT_RESET_INFO, reset_info); if (ret && errno == ENODEV) { printf("Device does not support hot reset\n"); return 0; } if (!ret || errno != ENOSPC) { printf("Expected fail/-ENOSPC, got %d/%d\n", ret, -errno); return -1; } printf("Dependent device count: %d\n", reset_info->count); reset_info = realloc(reset_info, sizeof(*reset_info) + (reset_info->count * sizeof(*devices))); if (!reset_info) { printf("Failed to re-alloc info struct\n"); return -ENOMEM; } reset_info->argsz = sizeof(*reset_info) + (reset_info->count * sizeof(*devices)); ret = ioctl(reset_devs[0].fd, VFIO_DEVICE_GET_PCI_HOT_RESET_INFO, reset_info); if (ret) { printf("Reset Info error\n"); return ret; } devices = &reset_info->devices[0]; for (i = 0; i < reset_info->count; i++) printf("%d: %04x:%02x:%02x.%d group %d\n", i, devices[i].segment, devices[i].bus, devices[i].devfn >> 3, devices[i].devfn & 7, devices[i].group_id); printf("Press any key to perform reset\n"); getchar(); printf("Attempting reset: "); fflush(stdout); reset = malloc(sizeof(*reset) + (sizeof(*pfd) * reset_info->count)); pfd = &reset->group_fds[0]; for (i = 0; i < reset_info->count; i++) { for (j = 0; j < (argc - 1)/2; j++) { if (devices[i].group_id == reset_devs[j].groupid) { *pfd++ = reset_devs[j].group; break; } } if (j == (argc - 1)/2) { printf("Group %d not found\n", devices[i].group_id); return -1; } } reset->argsz = sizeof(*reset) + (sizeof(*pfd) * reset_info->count); reset->count = reset_info->count; reset->flags = 0; ret = ioctl(reset_devs[0].fd, VFIO_DEVICE_PCI_HOT_RESET, reset); printf("%s\n", ret ? "Failed" : "Pass"); return ret; } ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: PCI resource allocation mismatch with BIOS 2022-11-29 15:46 ` Alex Williamson @ 2022-11-29 16:06 ` Lukas Wunner 2022-11-29 16:12 ` Alex Williamson 0 siblings, 1 reply; 15+ messages in thread From: Lukas Wunner @ 2022-11-29 16:06 UTC (permalink / raw) To: Alex Williamson; +Cc: Mika Westerberg, Bjorn Helgaas, Bjorn Helgaas, linux-pci On Tue, Nov 29, 2022 at 08:46:46AM -0700, Alex Williamson wrote: > Maybe the elephant in the room is why it's apparently such common > practice to need to perform a hard reset these devices outside of > virtualization scenarios... These GPUs are used as accelerators in cloud environments. They're reset to a pristine state when handed out to another tenant to avoid info leaks from the previous tenant. That should be a legitimate usage of PCIe reset, no? Thanks, Lukas ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: PCI resource allocation mismatch with BIOS 2022-11-29 16:06 ` Lukas Wunner @ 2022-11-29 16:12 ` Alex Williamson 2022-11-30 7:43 ` Lukas Wunner 0 siblings, 1 reply; 15+ messages in thread From: Alex Williamson @ 2022-11-29 16:12 UTC (permalink / raw) To: Lukas Wunner; +Cc: Mika Westerberg, Bjorn Helgaas, Bjorn Helgaas, linux-pci On Tue, 29 Nov 2022 17:06:26 +0100 Lukas Wunner <lukas@wunner.de> wrote: > On Tue, Nov 29, 2022 at 08:46:46AM -0700, Alex Williamson wrote: > > Maybe the elephant in the room is why it's apparently such common > > practice to need to perform a hard reset these devices outside of > > virtualization scenarios... > > These GPUs are used as accelerators in cloud environments. > > They're reset to a pristine state when handed out to another tenant > to avoid info leaks from the previous tenant. > > That should be a legitimate usage of PCIe reset, no? Absolutely, but why the whole switch? Thanks, Alex ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: PCI resource allocation mismatch with BIOS 2022-11-29 16:12 ` Alex Williamson @ 2022-11-30 7:43 ` Lukas Wunner 2022-11-30 7:57 ` Mika Westerberg 0 siblings, 1 reply; 15+ messages in thread From: Lukas Wunner @ 2022-11-30 7:43 UTC (permalink / raw) To: Alex Williamson; +Cc: Mika Westerberg, Bjorn Helgaas, Bjorn Helgaas, linux-pci On Tue, Nov 29, 2022 at 09:12:49AM -0700, Alex Williamson wrote: > On Tue, 29 Nov 2022 17:06:26 +0100 Lukas Wunner <lukas@wunner.de> wrote: > > On Tue, Nov 29, 2022 at 08:46:46AM -0700, Alex Williamson wrote: > > > Maybe the elephant in the room is why it's apparently such common > > > practice to need to perform a hard reset these devices outside of > > > virtualization scenarios... > > > > These GPUs are used as accelerators in cloud environments. > > > > They're reset to a pristine state when handed out to another tenant > > to avoid info leaks from the previous tenant. > > > > That should be a legitimate usage of PCIe reset, no? > > Absolutely, but why the whole switch? Thanks, The reset is propagated down the hierarchy, so by resetting the Switch Upstream Port, it is guaranteed that all endpoints are reset with just a single operation. Per PCIe r6.0.1 sec 6.6.1: "For a Switch, the following must cause a hot reset to be sent on all Downstream Ports: [...] Receiving a hot reset on the Upstream Port" Thanks, Lukas ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: PCI resource allocation mismatch with BIOS 2022-11-30 7:43 ` Lukas Wunner @ 2022-11-30 7:57 ` Mika Westerberg 2022-11-30 15:47 ` Alex Williamson 0 siblings, 1 reply; 15+ messages in thread From: Mika Westerberg @ 2022-11-30 7:57 UTC (permalink / raw) To: Lukas Wunner; +Cc: Alex Williamson, Bjorn Helgaas, Bjorn Helgaas, linux-pci Hi, On Wed, Nov 30, 2022 at 08:43:47AM +0100, Lukas Wunner wrote: > On Tue, Nov 29, 2022 at 09:12:49AM -0700, Alex Williamson wrote: > > On Tue, 29 Nov 2022 17:06:26 +0100 Lukas Wunner <lukas@wunner.de> wrote: > > > On Tue, Nov 29, 2022 at 08:46:46AM -0700, Alex Williamson wrote: > > > > Maybe the elephant in the room is why it's apparently such common > > > > practice to need to perform a hard reset these devices outside of > > > > virtualization scenarios... > > > > > > These GPUs are used as accelerators in cloud environments. > > > > > > They're reset to a pristine state when handed out to another tenant > > > to avoid info leaks from the previous tenant. > > > > > > That should be a legitimate usage of PCIe reset, no? > > > > Absolutely, but why the whole switch? Thanks, > > The reset is propagated down the hierarchy, so by resetting the > Switch Upstream Port, it is guaranteed that all endpoints are > reset with just a single operation. Per PCIe r6.0.1 sec 6.6.1: > > "For a Switch, the following must cause a hot reset to be sent > on all Downstream Ports: > [...] > Receiving a hot reset on the Upstream Port" Adding here the reason I got from the GPU folks: In addition to the use case when the GPU is reset when switched to another tenant, this is used for recovery. The "first level" recovery is handled by the graphics driver that does Function Level Reset but if that does not work data centers may trigger reset at higher level (root port or switch upstream port) to reset the whole card. So called "big hammer". There is another use case too - firmware upgrade. This allows data centers to upgrade firmware on those cards without need to reboot - they just reset the whole thing to make it run the new firmware. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: PCI resource allocation mismatch with BIOS 2022-11-30 7:57 ` Mika Westerberg @ 2022-11-30 15:47 ` Alex Williamson 2022-12-01 9:41 ` Mika Westerberg 0 siblings, 1 reply; 15+ messages in thread From: Alex Williamson @ 2022-11-30 15:47 UTC (permalink / raw) To: Mika Westerberg; +Cc: Lukas Wunner, Bjorn Helgaas, Bjorn Helgaas, linux-pci On Wed, 30 Nov 2022 09:57:18 +0200 Mika Westerberg <mika.westerberg@linux.intel.com> wrote: > Hi, > > On Wed, Nov 30, 2022 at 08:43:47AM +0100, Lukas Wunner wrote: > > On Tue, Nov 29, 2022 at 09:12:49AM -0700, Alex Williamson wrote: > > > On Tue, 29 Nov 2022 17:06:26 +0100 Lukas Wunner <lukas@wunner.de> wrote: > > > > On Tue, Nov 29, 2022 at 08:46:46AM -0700, Alex Williamson wrote: > > > > > Maybe the elephant in the room is why it's apparently such common > > > > > practice to need to perform a hard reset these devices outside of > > > > > virtualization scenarios... > > > > > > > > These GPUs are used as accelerators in cloud environments. > > > > > > > > They're reset to a pristine state when handed out to another tenant > > > > to avoid info leaks from the previous tenant. > > > > > > > > That should be a legitimate usage of PCIe reset, no? > > > > > > Absolutely, but why the whole switch? Thanks, > > > > The reset is propagated down the hierarchy, so by resetting the > > Switch Upstream Port, it is guaranteed that all endpoints are > > reset with just a single operation. Per PCIe r6.0.1 sec 6.6.1: > > > > "For a Switch, the following must cause a hot reset to be sent > > on all Downstream Ports: > > [...] > > Receiving a hot reset on the Upstream Port" This was never in doubt. > Adding here the reason I got from the GPU folks: > > In addition to the use case when the GPU is reset when switched to > another tenant, this is used for recovery. The "first level" recovery is > handled by the graphics driver that does Function Level Reset but if > that does not work data centers may trigger reset at higher level (root > port or switch upstream port) to reset the whole card. So called "big > hammer". Ok, so the first level issue can be solved by the reset_method attribute, allowing userspace to trigger a SBR for a signle function endpoint rather than an FLR (though this might suggest some hardware issues in the FLR implementation). This will save and restore the device config space, so there should be no resource reallocation issues. > There is another use case too - firmware upgrade. This allows data > centers to upgrade firmware on those cards without need to reboot - they > just reset the whole thing to make it run the new firmware. The above can also be used for this, so long as the firmware update only targets the GPU and device config space layout and features is not modified by the update. This touches on a long standing issue Bjorn has had with performing bus resets where we assume we get back essentially the same devices after the reset. That's not guaranteed with a pending firmware update. From the logs in this case, I notice that the BIOS didn't assigned the ROM BAR for the GPU while Linux did when re-scanning. That's a different address space from the SR-IOV BARs, so doesn't explain this failure, but I wonder if there's another resource in the hierarchy that the BIOS didn't program, ex. BAR0 on the upstream switch maybe? Possibly if the host were booted with pci=realloc, Linux might expand the resources to something it's happy with, allowing future re-scans. Otherwise I think we'd need a log of all BIOS vs Linux resource allocations from the root port down to see what might be the issue with the rescan. Thanks, Alex ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: PCI resource allocation mismatch with BIOS 2022-11-30 15:47 ` Alex Williamson @ 2022-12-01 9:41 ` Mika Westerberg 2022-12-09 11:08 ` Mika Westerberg 0 siblings, 1 reply; 15+ messages in thread From: Mika Westerberg @ 2022-12-01 9:41 UTC (permalink / raw) To: Alex Williamson; +Cc: Lukas Wunner, Bjorn Helgaas, Bjorn Helgaas, linux-pci Hi, On Wed, Nov 30, 2022 at 08:47:38AM -0700, Alex Williamson wrote: > On Wed, 30 Nov 2022 09:57:18 +0200 > Mika Westerberg <mika.westerberg@linux.intel.com> wrote: > > > Hi, > > > > On Wed, Nov 30, 2022 at 08:43:47AM +0100, Lukas Wunner wrote: > > > On Tue, Nov 29, 2022 at 09:12:49AM -0700, Alex Williamson wrote: > > > > On Tue, 29 Nov 2022 17:06:26 +0100 Lukas Wunner <lukas@wunner.de> wrote: > > > > > On Tue, Nov 29, 2022 at 08:46:46AM -0700, Alex Williamson wrote: > > > > > > Maybe the elephant in the room is why it's apparently such common > > > > > > practice to need to perform a hard reset these devices outside of > > > > > > virtualization scenarios... > > > > > > > > > > These GPUs are used as accelerators in cloud environments. > > > > > > > > > > They're reset to a pristine state when handed out to another tenant > > > > > to avoid info leaks from the previous tenant. > > > > > > > > > > That should be a legitimate usage of PCIe reset, no? > > > > > > > > Absolutely, but why the whole switch? Thanks, > > > > > > The reset is propagated down the hierarchy, so by resetting the > > > Switch Upstream Port, it is guaranteed that all endpoints are > > > reset with just a single operation. Per PCIe r6.0.1 sec 6.6.1: > > > > > > "For a Switch, the following must cause a hot reset to be sent > > > on all Downstream Ports: > > > [...] > > > Receiving a hot reset on the Upstream Port" > > This was never in doubt. > > > Adding here the reason I got from the GPU folks: > > > > In addition to the use case when the GPU is reset when switched to > > another tenant, this is used for recovery. The "first level" recovery is > > handled by the graphics driver that does Function Level Reset but if > > that does not work data centers may trigger reset at higher level (root > > port or switch upstream port) to reset the whole card. So called "big > > hammer". > > Ok, so the first level issue can be solved by the reset_method > attribute, allowing userspace to trigger a SBR for a signle function > endpoint rather than an FLR (though this might suggest some hardware > issues in the FLR implementation). This will save and restore the > device config space, so there should be no resource reallocation issues. Right, I think that already works. > > There is another use case too - firmware upgrade. This allows data > > centers to upgrade firmware on those cards without need to reboot - they > > just reset the whole thing to make it run the new firmware. > > The above can also be used for this, so long as the firmware update > only targets the GPU and device config space layout and features is not > modified by the update. This touches on a long standing issue Bjorn > has had with performing bus resets where we assume we get back > essentially the same devices after the reset. That's not guaranteed > with a pending firmware update. Fair enough but if the firmware does not change the resource consumption then we would get back the same devices. I guess this is the use-case with these cards. > >From the logs in this case, I notice that the BIOS didn't assigned the > ROM BAR for the GPU while Linux did when re-scanning. That's a > different address space from the SR-IOV BARs, so doesn't explain this > failure, but I wonder if there's another resource in the hierarchy that > the BIOS didn't program, ex. BAR0 on the upstream switch maybe? I don't think there is. > Possibly if the host were booted with pci=realloc, Linux might expand > the resources to something it's happy with, allowing future re-scans. This was tried (sorry I did not mention that) and it did not work either because of the way Linux allocates the resource space. BIOS allocates "minimal" window that fulfills the alignment requirement so if we had just 16G+16M BARs then we would get 16G+16M bridge window aligned for that 16G. However, Linux tries to align the size of the window too so instead we get 16G aligned to 8G so 24G memory window. This does not fit to the resources BIOS allocated. With pci=realloc Linux tries to release the resources from the upper level bridges and re-allocates more but the IOV resources are put into the "additional" list so it is fine if those don't get allocated in the end. > Otherwise I think we'd need a log of all BIOS vs Linux resource > allocations from the root port down to see what might be the issue with > the rescan. Thanks, Sure, I will share dmesg from that system showing the initial allocation and the re-scan as soon as I get confirmation that there is nothing under embargo in there. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: PCI resource allocation mismatch with BIOS 2022-12-01 9:41 ` Mika Westerberg @ 2022-12-09 11:08 ` Mika Westerberg 0 siblings, 0 replies; 15+ messages in thread From: Mika Westerberg @ 2022-12-09 11:08 UTC (permalink / raw) To: Alex Williamson; +Cc: Lukas Wunner, Bjorn Helgaas, Bjorn Helgaas, linux-pci Hi, On Thu, Dec 01, 2022 at 11:41:18AM +0200, Mika Westerberg wrote: > > Otherwise I think we'd need a log of all BIOS vs Linux resource > > allocations from the root port down to see what might be the issue with > > the rescan. Thanks, > > Sure, I will share dmesg from that system showing the initial allocation > and the re-scan as soon as I get confirmation that there is nothing > under embargo in there. Sorry for the delay. Took some time to get all the confirmations. I created a bug here: https://bugzilla.kernel.org/show_bug.cgi?id=216795 I also attached relevant dmesg and lspci outputs. I have now access to the system so I can do additional testing as needed. Thanks! ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2022-12-09 11:08 UTC | newest] Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2022-11-28 11:14 PCI resource allocation mismatch with BIOS Mika Westerberg 2022-11-28 20:39 ` Bjorn Helgaas 2022-11-28 22:06 ` Alex Williamson 2022-11-29 6:48 ` Lukas Wunner 2022-11-29 10:09 ` Mika Westerberg 2022-11-29 13:52 ` Alex Williamson 2022-11-29 15:07 ` Mika Westerberg 2022-11-29 15:46 ` Alex Williamson 2022-11-29 16:06 ` Lukas Wunner 2022-11-29 16:12 ` Alex Williamson 2022-11-30 7:43 ` Lukas Wunner 2022-11-30 7:57 ` Mika Westerberg 2022-11-30 15:47 ` Alex Williamson 2022-12-01 9:41 ` Mika Westerberg 2022-12-09 11:08 ` Mika Westerberg
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.