All of lore.kernel.org
 help / color / mirror / Atom feed
* PCI resource allocation mismatch with BIOS
@ 2022-11-28 11:14 Mika Westerberg
  2022-11-28 20:39 ` Bjorn Helgaas
  0 siblings, 1 reply; 15+ messages in thread
From: Mika Westerberg @ 2022-11-28 11:14 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: linux-pci, Mika Westerberg

Hi Bjorn,

There is another PCI resource allocation issue with some Intel GPUs but
probably applies to other similar devices as well. This is something
encountered in data centers where they trigger reset (secondary bus
reset) to the GPUs if there is hang or similar detected. Basically they
do something like:

  1. Unbind the graphics driver(s) through sysfs.
  2. Remove the PCIe devices under the root port or the PCIe switch
     upstream port through sysfs (echo 1 > ../remove).
  3. Trigger reset through config space or use the sysfs reset attribute.
  4. Run rescan on the root bus (echo 1 > /sys/bus/pci/rescan) 

Expectation is to see the devices come back in the same way prior the
reset but what actually happens is that the Linux PCI resource
allocation fails to allocate space for some of the resources. In this
case it is the IOV BARs.

BIOS allocates resources for all these at boot time but after the rescan
Linux tries to re-allocate them but since the allocation algorithm is
more "consuming" some of the BARs do not fit to the available resource
space.

Here is an example. The devices involved are:

53:00.0		GPU with IOV BARs
52:01.0		PCIe switch downstream port

PF = Physical Function
VF = Virtual Function

BIOS allocation (dmesg)
-----------------------
pci 0000:52:01.0: scanning [bus 53-54] behind bridge, pass 0
pci 0000:53:00.0: [8086:56c0] type 00 class 0x038000
pci 0000:53:00.0: reg 0x10: [mem 0x205e1f000000-0x205e1fffffff 64bit pref]
pci 0000:53:00.0: reg 0x18: [mem 0x201c00000000-0x201fffffffff 64bit pref]
pci 0000:53:00.0: reg 0x30: [mem 0xffe00000-0xffffffff pref]
pci 0000:53:00.0: reg 0x344: [mem 0x205e00000000-0x205e00ffffff 64bit pref]
pci 0000:53:00.0: VF(n) BAR0 space: [mem 0x205e00000000-0x205e1effffff 64bit pref] (contains BAR0 for 31 VFs)
pci 0000:53:00.0: reg 0x34c: [mem 0x202000000000-0x2021ffffffff 64bit pref]
pci 0000:53:00.0: VF(n) BAR2 space: [mem 0x202000000000-0x205dffffffff 64bit pref] (contains BAR2 for 31 VFs)
pci 0000:52:01.0: PCI bridge to [bus 53-54]
pci 0000:52:01.0:   bridge window [mem 0x201c00000000-0x205e1fffffff 64bit pref]

GPU
~~~
0x201c00000000-0x201fffffffff	PF BAR2 16384M
0x202000000000-0x205dffffffff	VF BAR2	253952M (31 * 8G)
0x205e00000000-0x205e1effffff	VF BAR0 496M (31 * 16M)
0x205e1f000000-0x205e1fffffff 	PF BAR0 16M
					270848M

PCIe downstream port
~~~~~~~~~~~~~~~~~~~~
0x201c00000000-0x205e1fffffff		270848M

Linux allocation (dmesg)
------------------------
pci 0000:52:01.0: [8086:4fa4] type 01 class 0x060400
pci_bus 0000:52: fixups for bus
pci 0000:51:00.0: PCI bridge to [bus 52-54]
pci 0000:51:00.0:   bridge window [io  0x0000-0x0fff]
pci 0000:51:00.0:   bridge window [mem 0x00000000-0x000fffff]
pci 0000:51:00.0:   bridge window [mem 0x00000000-0x000fffff 64bit pref]
pci 0000:52:01.0: scanning [bus 00-00] behind bridge, pass 0
pci 0000:52:01.0: bridge configuration invalid ([bus 00-00]), reconfiguring
pci 0000:52:01.0: scanning [bus 00-00] behind bridge, pass 1
pci_bus 0000:53: scanning bus
pci 0000:53:00.0: [8086:56c0] type 00 class 0x038000
pci 0000:53:00.0: reg 0x10: [mem 0x00000000-0x00ffffff 64bit pref]
pci 0000:53:00.0: reg 0x18: [mem 0x00000000-0x3ffffffff 64bit pref]
pci 0000:53:00.0: reg 0x30: [mem 0x00000000-0x001fffff pref]
pci 0000:53:00.0: reg 0x344: [mem 0x00000000-0x00ffffff 64bit pref]
pci 0000:53:00.0: VF(n) BAR0 space: [mem 0x00000000-0x1effffff 64bit pref] (contains BAR0 for 31 VFs)
pci 0000:53:00.0: reg 0x34c: [mem 0x00000000-0x1ffffffff 64bit pref]
pci 0000:53:00.0: VF(n) BAR2 space: [mem 0x00000000-0x3dffffffff 64bit pref] (contains BAR2 for 31 VFs)
pci_bus 0000:53: fixups for bus
pci 0000:52:01.0: PCI bridge to [bus 53-54]
pci 0000:52:01.0:   bridge window [io  0x0000-0x0fff]
pci 0000:52:01.0:   bridge window [mem 0x00000000-0x000fffff]
pci 0000:52:01.0:   bridge window [mem 0x00000000-0x000fffff 64bit pref]
pci 0000:52:01.0: bridge window [mem 0x200000000-0x7ffffffff 64bit pref] to [bus 53] add_size 3e00000000 add_align 200000000
pci 0000:51:00.0: bridge window [mem 0x200000000-0x7ffffffff 64bit pref] to [bus 52-53] add_size 3e00000000 add_align 200000000
pcieport 0000:50:02.0: BAR 13: assigned [io  0x8000-0x8fff]
pci 0000:51:00.0: BAR 15: no space for [mem size 0x4400000000 64bit pref]
pci 0000:51:00.0: BAR 15: failed to assign [mem size 0x4400000000 64bit pref]
pci 0000:51:00.0: BAR 0: assigned [mem 0x201c00000000-0x201c007fffff 64bit pref]
pci 0000:51:00.0: BAR 14: assigned [mem 0xbb800000-0xbb9fffff]
pci 0000:51:00.0: BAR 13: assigned [io  0x8000-0x8fff]
pci 0000:51:00.0: BAR 15: assigned [mem 0x201c00000000-0x2021ffffffff 64bit pref]
pci 0000:51:00.0: BAR 0: assigned [mem 0x202200000000-0x2022007fffff 64bit pref]
pci 0000:51:00.0: BAR 14: assigned [mem 0xbb800000-0xbb9fffff]
pci 0000:51:00.0: BAR 15: [mem 0x201c00000000-0x2021ffffffff 64bit pref] (failed to expand by 0x3e00000000)
pci 0000:51:00.0: failed to add 3e00000000 res[15]=[mem 0x201c00000000-0x2021ffffffff 64bit pref]
pci 0000:52:01.0: BAR 15: no space for [mem size 0x4400000000 64bit pref]
pci 0000:52:01.0: BAR 15: failed to assign [mem size 0x4400000000 64bit pref]
pci 0000:52:01.0: BAR 14: assigned [mem 0xbb800000-0xbb9fffff]
pci 0000:52:01.0: BAR 13: assigned [io  0x8000-0x8fff]
pci 0000:52:01.0: BAR 15: assigned [mem 0x201c00000000-0x2021ffffffff 64bit pref]
pci 0000:52:01.0: BAR 14: assigned [mem 0xbb800000-0xbb9fffff]
pci 0000:52:01.0: BAR 15: [mem 0x201c00000000-0x2021ffffffff 64bit pref] (failed to expand by 0x3e00000000)
pci 0000:52:01.0: failed to add 3e00000000 res[15]=[mem 0x201c00000000-0x2021ffffffff 64bit pref]
pci 0000:53:00.0: BAR 2: assigned [mem 0x201c00000000-0x201fffffffff 64bit pref]
pci 0000:53:00.0: BAR 9: no space for [mem size 0x3e00000000 64bit pref]
pci 0000:53:00.0: BAR 9: failed to assign [mem size 0x3e00000000 64bit pref]
pci 0000:53:00.0: BAR 0: assigned [mem 0x202000000000-0x202000ffffff 64bit pref]
pci 0000:53:00.0: BAR 7: assigned [mem 0x202001000000-0x20201fffffff 64bit pref]
pci 0000:53:00.0: BAR 6: assigned [mem 0xbb800000-0xbb9fffff pref]
pci 0000:53:00.0: BAR 2: assigned [mem 0x201c00000000-0x201fffffffff 64bit pref]
pci 0000:53:00.0: BAR 0: assigned [mem 0x202000000000-0x202000ffffff 64bit pref]
pci 0000:53:00.0: BAR 6: assigned [mem 0xbb800000-0xbb9fffff pref]
pci 0000:53:00.0: BAR 9: no space for [mem size 0x3e00000000 64bit pref]
pci 0000:53:00.0: BAR 9: failed to assign [mem size 0x3e00000000 64bit pref]
pci 0000:53:00.0: BAR 7: assigned [mem 0x202001000000-0x20201fffffff 64bit pref]
pci 0000:52:01.0: PCI bridge to [bus 53]
pci 0000:52:01.0:   bridge window [io  0x8000-0x8fff]
pci 0000:52:01.0:   bridge window [mem 0xbb800000-0xbb9fffff]
pci 0000:52:01.0:   bridge window [mem 0x201c00000000-0x2021ffffffff 64bit pref]

GPU
~~~
0x201c00000000-0x201fffffffff	PF BAR2 16834M
0x202000000000-0x202000ffffff	PF BAR0	16M
0x202001000000-0x20201fffffff 	VF BAR0	496M (31 * 16M)
FAIL				VF BAR2 253952M (31 * 8G)

PCIe downstream port
~~~~~~~~~~~~~~~~~~~~
0x201c00000000-0x2021ffffffff		24576M

Now, if I hack the allocation algorithm (in pbus_size_mem()) to "mimic"
the BIOS allocation then these fit fine. However, if the BIOS allocation
ever changes we may end up in similar issue. Also the Linux PCI resource
allocation code has been like that for aeons so changing it would likely
cause regressions.

Let me know if more information is needed. I have one of these cards
locally and have remote access to a similar system where the above
example was take so I can run additional testing.

Also let me know if you want me to file a bug in kernel.org bugzilla.

Thanks in advance!

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: PCI resource allocation mismatch with BIOS
  2022-11-28 11:14 PCI resource allocation mismatch with BIOS Mika Westerberg
@ 2022-11-28 20:39 ` Bjorn Helgaas
  2022-11-28 22:06   ` Alex Williamson
  0 siblings, 1 reply; 15+ messages in thread
From: Bjorn Helgaas @ 2022-11-28 20:39 UTC (permalink / raw)
  To: Mika Westerberg; +Cc: Bjorn Helgaas, linux-pci, Alex Williamson

[+cc Alex]

Hi Mika,

On Mon, Nov 28, 2022 at 01:14:14PM +0200, Mika Westerberg wrote:
> Hi Bjorn,
> 
> There is another PCI resource allocation issue with some Intel GPUs but
> probably applies to other similar devices as well. This is something
> encountered in data centers where they trigger reset (secondary bus
> reset) to the GPUs if there is hang or similar detected. Basically they
> do something like:
> 
>   1. Unbind the graphics driver(s) through sysfs.
>   2. Remove the PCIe devices under the root port or the PCIe switch
>      upstream port through sysfs (echo 1 > ../remove).
>   3. Trigger reset through config space or use the sysfs reset attribute.
>   4. Run rescan on the root bus (echo 1 > /sys/bus/pci/rescan) 
> 
> Expectation is to see the devices come back in the same way prior the
> reset but what actually happens is that the Linux PCI resource
> allocation fails to allocate space for some of the resources. In this
> case it is the IOV BARs.
> 
> BIOS allocates resources for all these at boot time but after the rescan
> Linux tries to re-allocate them but since the allocation algorithm is
> more "consuming" some of the BARs do not fit to the available resource
> space.

Thanks for the report!  Definitely sounds like an issue.  I doubt that
I'll have time to work on it myself in the near future.

Is the "remove" before the reset actually necessary?  If we could
avoid the removal, maybe the config space save/restore we already do
around reset would avoid the issue?

Bjorn

> Here is an example. The devices involved are:
> 
> 53:00.0		GPU with IOV BARs
> 52:01.0		PCIe switch downstream port
> 
> PF = Physical Function
> VF = Virtual Function
> 
> BIOS allocation (dmesg)
> -----------------------
> pci 0000:52:01.0: scanning [bus 53-54] behind bridge, pass 0
> pci 0000:53:00.0: [8086:56c0] type 00 class 0x038000
> pci 0000:53:00.0: reg 0x10: [mem 0x205e1f000000-0x205e1fffffff 64bit pref]
> pci 0000:53:00.0: reg 0x18: [mem 0x201c00000000-0x201fffffffff 64bit pref]
> pci 0000:53:00.0: reg 0x30: [mem 0xffe00000-0xffffffff pref]
> pci 0000:53:00.0: reg 0x344: [mem 0x205e00000000-0x205e00ffffff 64bit pref]
> pci 0000:53:00.0: VF(n) BAR0 space: [mem 0x205e00000000-0x205e1effffff 64bit pref] (contains BAR0 for 31 VFs)
> pci 0000:53:00.0: reg 0x34c: [mem 0x202000000000-0x2021ffffffff 64bit pref]
> pci 0000:53:00.0: VF(n) BAR2 space: [mem 0x202000000000-0x205dffffffff 64bit pref] (contains BAR2 for 31 VFs)
> pci 0000:52:01.0: PCI bridge to [bus 53-54]
> pci 0000:52:01.0:   bridge window [mem 0x201c00000000-0x205e1fffffff 64bit pref]
> 
> GPU
> ~~~
> 0x201c00000000-0x201fffffffff	PF BAR2 16384M
> 0x202000000000-0x205dffffffff	VF BAR2	253952M (31 * 8G)
> 0x205e00000000-0x205e1effffff	VF BAR0 496M (31 * 16M)
> 0x205e1f000000-0x205e1fffffff 	PF BAR0 16M
> 					270848M
> 
> PCIe downstream port
> ~~~~~~~~~~~~~~~~~~~~
> 0x201c00000000-0x205e1fffffff		270848M
> 
> Linux allocation (dmesg)
> ------------------------
> pci 0000:52:01.0: [8086:4fa4] type 01 class 0x060400
> pci_bus 0000:52: fixups for bus
> pci 0000:51:00.0: PCI bridge to [bus 52-54]
> pci 0000:51:00.0:   bridge window [io  0x0000-0x0fff]
> pci 0000:51:00.0:   bridge window [mem 0x00000000-0x000fffff]
> pci 0000:51:00.0:   bridge window [mem 0x00000000-0x000fffff 64bit pref]
> pci 0000:52:01.0: scanning [bus 00-00] behind bridge, pass 0
> pci 0000:52:01.0: bridge configuration invalid ([bus 00-00]), reconfiguring
> pci 0000:52:01.0: scanning [bus 00-00] behind bridge, pass 1
> pci_bus 0000:53: scanning bus
> pci 0000:53:00.0: [8086:56c0] type 00 class 0x038000
> pci 0000:53:00.0: reg 0x10: [mem 0x00000000-0x00ffffff 64bit pref]
> pci 0000:53:00.0: reg 0x18: [mem 0x00000000-0x3ffffffff 64bit pref]
> pci 0000:53:00.0: reg 0x30: [mem 0x00000000-0x001fffff pref]
> pci 0000:53:00.0: reg 0x344: [mem 0x00000000-0x00ffffff 64bit pref]
> pci 0000:53:00.0: VF(n) BAR0 space: [mem 0x00000000-0x1effffff 64bit pref] (contains BAR0 for 31 VFs)
> pci 0000:53:00.0: reg 0x34c: [mem 0x00000000-0x1ffffffff 64bit pref]
> pci 0000:53:00.0: VF(n) BAR2 space: [mem 0x00000000-0x3dffffffff 64bit pref] (contains BAR2 for 31 VFs)
> pci_bus 0000:53: fixups for bus
> pci 0000:52:01.0: PCI bridge to [bus 53-54]
> pci 0000:52:01.0:   bridge window [io  0x0000-0x0fff]
> pci 0000:52:01.0:   bridge window [mem 0x00000000-0x000fffff]
> pci 0000:52:01.0:   bridge window [mem 0x00000000-0x000fffff 64bit pref]
> pci 0000:52:01.0: bridge window [mem 0x200000000-0x7ffffffff 64bit pref] to [bus 53] add_size 3e00000000 add_align 200000000
> pci 0000:51:00.0: bridge window [mem 0x200000000-0x7ffffffff 64bit pref] to [bus 52-53] add_size 3e00000000 add_align 200000000
> pcieport 0000:50:02.0: BAR 13: assigned [io  0x8000-0x8fff]
> pci 0000:51:00.0: BAR 15: no space for [mem size 0x4400000000 64bit pref]
> pci 0000:51:00.0: BAR 15: failed to assign [mem size 0x4400000000 64bit pref]
> pci 0000:51:00.0: BAR 0: assigned [mem 0x201c00000000-0x201c007fffff 64bit pref]
> pci 0000:51:00.0: BAR 14: assigned [mem 0xbb800000-0xbb9fffff]
> pci 0000:51:00.0: BAR 13: assigned [io  0x8000-0x8fff]
> pci 0000:51:00.0: BAR 15: assigned [mem 0x201c00000000-0x2021ffffffff 64bit pref]
> pci 0000:51:00.0: BAR 0: assigned [mem 0x202200000000-0x2022007fffff 64bit pref]
> pci 0000:51:00.0: BAR 14: assigned [mem 0xbb800000-0xbb9fffff]
> pci 0000:51:00.0: BAR 15: [mem 0x201c00000000-0x2021ffffffff 64bit pref] (failed to expand by 0x3e00000000)
> pci 0000:51:00.0: failed to add 3e00000000 res[15]=[mem 0x201c00000000-0x2021ffffffff 64bit pref]
> pci 0000:52:01.0: BAR 15: no space for [mem size 0x4400000000 64bit pref]
> pci 0000:52:01.0: BAR 15: failed to assign [mem size 0x4400000000 64bit pref]
> pci 0000:52:01.0: BAR 14: assigned [mem 0xbb800000-0xbb9fffff]
> pci 0000:52:01.0: BAR 13: assigned [io  0x8000-0x8fff]
> pci 0000:52:01.0: BAR 15: assigned [mem 0x201c00000000-0x2021ffffffff 64bit pref]
> pci 0000:52:01.0: BAR 14: assigned [mem 0xbb800000-0xbb9fffff]
> pci 0000:52:01.0: BAR 15: [mem 0x201c00000000-0x2021ffffffff 64bit pref] (failed to expand by 0x3e00000000)
> pci 0000:52:01.0: failed to add 3e00000000 res[15]=[mem 0x201c00000000-0x2021ffffffff 64bit pref]
> pci 0000:53:00.0: BAR 2: assigned [mem 0x201c00000000-0x201fffffffff 64bit pref]
> pci 0000:53:00.0: BAR 9: no space for [mem size 0x3e00000000 64bit pref]
> pci 0000:53:00.0: BAR 9: failed to assign [mem size 0x3e00000000 64bit pref]
> pci 0000:53:00.0: BAR 0: assigned [mem 0x202000000000-0x202000ffffff 64bit pref]
> pci 0000:53:00.0: BAR 7: assigned [mem 0x202001000000-0x20201fffffff 64bit pref]
> pci 0000:53:00.0: BAR 6: assigned [mem 0xbb800000-0xbb9fffff pref]
> pci 0000:53:00.0: BAR 2: assigned [mem 0x201c00000000-0x201fffffffff 64bit pref]
> pci 0000:53:00.0: BAR 0: assigned [mem 0x202000000000-0x202000ffffff 64bit pref]
> pci 0000:53:00.0: BAR 6: assigned [mem 0xbb800000-0xbb9fffff pref]
> pci 0000:53:00.0: BAR 9: no space for [mem size 0x3e00000000 64bit pref]
> pci 0000:53:00.0: BAR 9: failed to assign [mem size 0x3e00000000 64bit pref]
> pci 0000:53:00.0: BAR 7: assigned [mem 0x202001000000-0x20201fffffff 64bit pref]
> pci 0000:52:01.0: PCI bridge to [bus 53]
> pci 0000:52:01.0:   bridge window [io  0x8000-0x8fff]
> pci 0000:52:01.0:   bridge window [mem 0xbb800000-0xbb9fffff]
> pci 0000:52:01.0:   bridge window [mem 0x201c00000000-0x2021ffffffff 64bit pref]
> 
> GPU
> ~~~
> 0x201c00000000-0x201fffffffff	PF BAR2 16834M
> 0x202000000000-0x202000ffffff	PF BAR0	16M
> 0x202001000000-0x20201fffffff 	VF BAR0	496M (31 * 16M)
> FAIL				VF BAR2 253952M (31 * 8G)
> 
> PCIe downstream port
> ~~~~~~~~~~~~~~~~~~~~
> 0x201c00000000-0x2021ffffffff		24576M
> 
> Now, if I hack the allocation algorithm (in pbus_size_mem()) to "mimic"
> the BIOS allocation then these fit fine. However, if the BIOS allocation
> ever changes we may end up in similar issue. Also the Linux PCI resource
> allocation code has been like that for aeons so changing it would likely
> cause regressions.
> 
> Let me know if more information is needed. I have one of these cards
> locally and have remote access to a similar system where the above
> example was take so I can run additional testing.
> 
> Also let me know if you want me to file a bug in kernel.org bugzilla.
> 
> Thanks in advance!

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: PCI resource allocation mismatch with BIOS
  2022-11-28 20:39 ` Bjorn Helgaas
@ 2022-11-28 22:06   ` Alex Williamson
  2022-11-29  6:48     ` Lukas Wunner
  0 siblings, 1 reply; 15+ messages in thread
From: Alex Williamson @ 2022-11-28 22:06 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: Mika Westerberg, Bjorn Helgaas, linux-pci

On Mon, 28 Nov 2022 14:39:32 -0600
Bjorn Helgaas <helgaas@kernel.org> wrote:

> [+cc Alex]
> 
> Hi Mika,
> 
> On Mon, Nov 28, 2022 at 01:14:14PM +0200, Mika Westerberg wrote:
> > Hi Bjorn,
> > 
> > There is another PCI resource allocation issue with some Intel GPUs but
> > probably applies to other similar devices as well. This is something
> > encountered in data centers where they trigger reset (secondary bus
> > reset) to the GPUs if there is hang or similar detected. Basically they
> > do something like:
> > 
> >   1. Unbind the graphics driver(s) through sysfs.
> >   2. Remove the PCIe devices under the root port or the PCIe switch
> >      upstream port through sysfs (echo 1 > ../remove).
> >   3. Trigger reset through config space or use the sysfs reset attribute.
> >   4. Run rescan on the root bus (echo 1 > /sys/bus/pci/rescan) 
> > 
> > Expectation is to see the devices come back in the same way prior the
> > reset but what actually happens is that the Linux PCI resource
> > allocation fails to allocate space for some of the resources. In this
> > case it is the IOV BARs.
> > 
> > BIOS allocates resources for all these at boot time but after the rescan
> > Linux tries to re-allocate them but since the allocation algorithm is
> > more "consuming" some of the BARs do not fit to the available resource
> > space.  
> 
> Thanks for the report!  Definitely sounds like an issue.  I doubt that
> I'll have time to work on it myself in the near future.
> 
> Is the "remove" before the reset actually necessary?  If we could
> avoid the removal, maybe the config space save/restore we already do
> around reset would avoid the issue?

Agreed.  Is this convoluted removal process being used to force a SBR,
versus a FLR or PM reset that might otherwise be used by twiddling the
reset attribute of the GPU directly?  If so, the reset_method attribute
can be used to force a bus reset and perform all the state save/restore
handling to avoid reallocating BARs.  A reset from the upstream switch
port would only be necessary if you have some reason to also reset the
switch downstream ports.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: PCI resource allocation mismatch with BIOS
  2022-11-28 22:06   ` Alex Williamson
@ 2022-11-29  6:48     ` Lukas Wunner
  2022-11-29 10:09       ` Mika Westerberg
  2022-11-29 13:52       ` Alex Williamson
  0 siblings, 2 replies; 15+ messages in thread
From: Lukas Wunner @ 2022-11-29  6:48 UTC (permalink / raw)
  To: Alex Williamson; +Cc: Bjorn Helgaas, Mika Westerberg, Bjorn Helgaas, linux-pci

On Mon, Nov 28, 2022 at 03:06:17PM -0700, Alex Williamson wrote:
> Agreed.  Is this convoluted removal process being used to force a SBR,
> versus a FLR or PM reset that might otherwise be used by twiddling the
> reset attribute of the GPU directly?  If so, the reset_method attribute
> can be used to force a bus reset and perform all the state save/restore
> handling to avoid reallocating BARs.  A reset from the upstream switch
> port would only be necessary if you have some reason to also reset the
> switch downstream ports.  Thanks,

A Secondary Bus Reset is only offered as a reset_method if the
device to be reset is the *only* child of the upstream bridge.
I.e. if the device to be reset has siblings or children,
a Secondary Bus Reset is not permitted.

Modern GPUs (including the one Mika is referring to) consist of
a PCIe switch with the GPU, HD audio and telemetry devices below
Downstream Bridges.  A Secondary Bus Reset of the Root Port is
not allowed in this case because the Switch Upstream Port has
children.

See this code in pci_parent_bus_reset():

	if (pci_is_root_bus(dev->bus) || dev->subordinate ||
	    !dev->bus->self || dev->dev_flags & PCI_DEV_FLAGS_NO_BUS_RESET)
		return -ENOTTY;

The dev->subordinate check disallows a SBR if there are children.
Note that the code should probably instead check for...
(dev->subordinate && !list_empty(dev->subordinate->devices))
...because the port may have a subordinate bus without children
(may have been removed for example).

The "no siblings" rule is enforced by:

	list_for_each_entry(pdev, &dev->bus->devices, bus_list)
		if (pdev != dev)
			return -ENOTTY;

Note that the devices list is iterated without holding pci_bus_sem,
which looks fishy.

That said, it *is* possible that a Secondary Bus Reset is erroneously
offered despite these checks because we perform them early on device
enumeration when the subordinate bus hasn't been scanned yet.

So if the Root Port offers other reset methods besides SBR and the
user switches to one of them, then reinstates the defaults,
suddenly SBR will disappear because the subordinate bus has since
been scanned.  What's missing here is that we re-check availability
of the reset methods on siblings and the parent when a device is
added or removed.  This is also necessary to make reset_method
work properly with hotplug.  However, the result may be that the
reset_method attribute in sysfs may become invisible after adding
a device (because there is no reset method available) and reappear
after removing a device.

So the reset_method logic is pretty broken right now I'm afraid.

In any case, for Mika's use case it would be useful to have a
"reset_subordinate" attribute on ports capable of a SBR such that
the entire hierarchy below is reset.  The "reset" attribute is
insufficient.

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: PCI resource allocation mismatch with BIOS
  2022-11-29  6:48     ` Lukas Wunner
@ 2022-11-29 10:09       ` Mika Westerberg
  2022-11-29 13:52       ` Alex Williamson
  1 sibling, 0 replies; 15+ messages in thread
From: Mika Westerberg @ 2022-11-29 10:09 UTC (permalink / raw)
  To: Lukas Wunner; +Cc: Alex Williamson, Bjorn Helgaas, Bjorn Helgaas, linux-pci

Hi,

On Tue, Nov 29, 2022 at 07:48:12AM +0100, Lukas Wunner wrote:
> On Mon, Nov 28, 2022 at 03:06:17PM -0700, Alex Williamson wrote:
> > Agreed.  Is this convoluted removal process being used to force a SBR,
> > versus a FLR or PM reset that might otherwise be used by twiddling the
> > reset attribute of the GPU directly?  If so, the reset_method attribute
> > can be used to force a bus reset and perform all the state save/restore
> > handling to avoid reallocating BARs.  A reset from the upstream switch
> > port would only be necessary if you have some reason to also reset the
> > switch downstream ports.  Thanks,
> 
> A Secondary Bus Reset is only offered as a reset_method if the
> device to be reset is the *only* child of the upstream bridge.
> I.e. if the device to be reset has siblings or children,
> a Secondary Bus Reset is not permitted.
> 
> Modern GPUs (including the one Mika is referring to) consist of
> a PCIe switch with the GPU, HD audio and telemetry devices below
> Downstream Bridges.  A Secondary Bus Reset of the Root Port is
> not allowed in this case because the Switch Upstream Port has
> children.
> 
> See this code in pci_parent_bus_reset():
> 
> 	if (pci_is_root_bus(dev->bus) || dev->subordinate ||
> 	    !dev->bus->self || dev->dev_flags & PCI_DEV_FLAGS_NO_BUS_RESET)
> 		return -ENOTTY;
> 
> The dev->subordinate check disallows a SBR if there are children.
> Note that the code should probably instead check for...
> (dev->subordinate && !list_empty(dev->subordinate->devices))
> ...because the port may have a subordinate bus without children
> (may have been removed for example).
> 
> The "no siblings" rule is enforced by:
> 
> 	list_for_each_entry(pdev, &dev->bus->devices, bus_list)
> 		if (pdev != dev)
> 			return -ENOTTY;
> 
> Note that the devices list is iterated without holding pci_bus_sem,
> which looks fishy.
> 
> That said, it *is* possible that a Secondary Bus Reset is erroneously
> offered despite these checks because we perform them early on device
> enumeration when the subordinate bus hasn't been scanned yet.

Thanks Lukas for the good explanation :) I think because of the above
our GPU folks do Secondary Bus Reset directly by poking the config space
through setpci or so, and I guess this is why they need the "remove"
phase as well.

> So if the Root Port offers other reset methods besides SBR and the
> user switches to one of them, then reinstates the defaults,
> suddenly SBR will disappear because the subordinate bus has since
> been scanned.  What's missing here is that we re-check availability
> of the reset methods on siblings and the parent when a device is
> added or removed.  This is also necessary to make reset_method
> work properly with hotplug.  However, the result may be that the
> reset_method attribute in sysfs may become invisible after adding
> a device (because there is no reset method available) and reappear
> after removing a device.
> 
> So the reset_method logic is pretty broken right now I'm afraid.
> 
> In any case, for Mika's use case it would be useful to have a
> "reset_subordinate" attribute on ports capable of a SBR such that
> the entire hierarchy below is reset.  The "reset" attribute is
> insufficient.

Yes, that would be useful indeed.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: PCI resource allocation mismatch with BIOS
  2022-11-29  6:48     ` Lukas Wunner
  2022-11-29 10:09       ` Mika Westerberg
@ 2022-11-29 13:52       ` Alex Williamson
  2022-11-29 15:07         ` Mika Westerberg
  1 sibling, 1 reply; 15+ messages in thread
From: Alex Williamson @ 2022-11-29 13:52 UTC (permalink / raw)
  To: Lukas Wunner; +Cc: Bjorn Helgaas, Mika Westerberg, Bjorn Helgaas, linux-pci

On Tue, 29 Nov 2022 07:48:12 +0100
Lukas Wunner <lukas@wunner.de> wrote:

> On Mon, Nov 28, 2022 at 03:06:17PM -0700, Alex Williamson wrote:
> > Agreed.  Is this convoluted removal process being used to force a SBR,
> > versus a FLR or PM reset that might otherwise be used by twiddling the
> > reset attribute of the GPU directly?  If so, the reset_method attribute
> > can be used to force a bus reset and perform all the state save/restore
> > handling to avoid reallocating BARs.  A reset from the upstream switch
> > port would only be necessary if you have some reason to also reset the
> > switch downstream ports.  Thanks,  
> 
> A Secondary Bus Reset is only offered as a reset_method if the
> device to be reset is the *only* child of the upstream bridge.
> I.e. if the device to be reset has siblings or children,
> a Secondary Bus Reset is not permitted.
> 
> Modern GPUs (including the one Mika is referring to) consist of
> a PCIe switch with the GPU, HD audio and telemetry devices below
> Downstream Bridges.  A Secondary Bus Reset of the Root Port is
> not allowed in this case because the Switch Upstream Port has
> children.

I didn't see such functions in the log provided, the GPU in question
seems to be a single function device at 53:00.0.  This matches what
I've seen on an ARC A380 GPU where the GPU and HD audio are each single
function devices under separate downstream ports of a PCIe switch.

> See this code in pci_parent_bus_reset():
> 
> 	if (pci_is_root_bus(dev->bus) || dev->subordinate ||
> 	    !dev->bus->self || dev->dev_flags & PCI_DEV_FLAGS_NO_BUS_RESET)
> 		return -ENOTTY;
> 
> The dev->subordinate check disallows a SBR if there are children.
> Note that the code should probably instead check for...
> (dev->subordinate && !list_empty(dev->subordinate->devices))
> ...because the port may have a subordinate bus without children
> (may have been removed for example).
> 
> The "no siblings" rule is enforced by:
> 
> 	list_for_each_entry(pdev, &dev->bus->devices, bus_list)
> 		if (pdev != dev)
> 			return -ENOTTY;
> 
> Note that the devices list is iterated without holding pci_bus_sem,
> which looks fishy.
> 
> That said, it *is* possible that a Secondary Bus Reset is erroneously
> offered despite these checks because we perform them early on device
> enumeration when the subordinate bus hasn't been scanned yet.
> 
> So if the Root Port offers other reset methods besides SBR and the
> user switches to one of them, then reinstates the defaults,
> suddenly SBR will disappear because the subordinate bus has since
> been scanned.  What's missing here is that we re-check availability
> of the reset methods on siblings and the parent when a device is
> added or removed.  This is also necessary to make reset_method
> work properly with hotplug.  However, the result may be that the
> reset_method attribute in sysfs may become invisible after adding
> a device (because there is no reset method available) and reappear
> after removing a device.
> 
> So the reset_method logic is pretty broken right now I'm afraid.

I haven't checked for a while, but I thought we exposed SBR regardless
of siblings, though it can't be accessed via the reset attribute if
there are siblings.  That allows that the sibling devices could be soft
removed, a reset performed, and the bus re-scanned.  If there are in
fact sibling devices, it would make more sense to remove only those to
effect a bus reset to avoid the resource issues with rescanning SR-IOV
on the GPU.

> In any case, for Mika's use case it would be useful to have a
> "reset_subordinate" attribute on ports capable of a SBR such that
> the entire hierarchy below is reset.  The "reset" attribute is
> insufficient.

I'll toss out that a pretty simple vfio tool can be written to bind all
the siblings on a bus enabling the hot reset ioctl in vfio.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: PCI resource allocation mismatch with BIOS
  2022-11-29 13:52       ` Alex Williamson
@ 2022-11-29 15:07         ` Mika Westerberg
  2022-11-29 15:46           ` Alex Williamson
  0 siblings, 1 reply; 15+ messages in thread
From: Mika Westerberg @ 2022-11-29 15:07 UTC (permalink / raw)
  To: Alex Williamson; +Cc: Lukas Wunner, Bjorn Helgaas, Bjorn Helgaas, linux-pci

Hi,

On Tue, Nov 29, 2022 at 06:52:42AM -0700, Alex Williamson wrote:
> On Tue, 29 Nov 2022 07:48:12 +0100
> Lukas Wunner <lukas@wunner.de> wrote:
> 
> > On Mon, Nov 28, 2022 at 03:06:17PM -0700, Alex Williamson wrote:
> > > Agreed.  Is this convoluted removal process being used to force a SBR,
> > > versus a FLR or PM reset that might otherwise be used by twiddling the
> > > reset attribute of the GPU directly?  If so, the reset_method attribute
> > > can be used to force a bus reset and perform all the state save/restore
> > > handling to avoid reallocating BARs.  A reset from the upstream switch
> > > port would only be necessary if you have some reason to also reset the
> > > switch downstream ports.  Thanks,  
> > 
> > A Secondary Bus Reset is only offered as a reset_method if the
> > device to be reset is the *only* child of the upstream bridge.
> > I.e. if the device to be reset has siblings or children,
> > a Secondary Bus Reset is not permitted.
> > 
> > Modern GPUs (including the one Mika is referring to) consist of
> > a PCIe switch with the GPU, HD audio and telemetry devices below
> > Downstream Bridges.  A Secondary Bus Reset of the Root Port is
> > not allowed in this case because the Switch Upstream Port has
> > children.
> 
> I didn't see such functions in the log provided, the GPU in question
> seems to be a single function device at 53:00.0.  This matches what
> I've seen on an ARC A380 GPU where the GPU and HD audio are each single
> function devices under separate downstream ports of a PCIe switch.

Yes, this one is similar. There is a PCIe switch and then bunch of
devices connected to the downstream ports. One of them being the GPU.

Sorry if I missed that part in the report.

There are typically multiple of these cards and they want to perform the
reset seperately for each.

> > See this code in pci_parent_bus_reset():
> > 
> > 	if (pci_is_root_bus(dev->bus) || dev->subordinate ||
> > 	    !dev->bus->self || dev->dev_flags & PCI_DEV_FLAGS_NO_BUS_RESET)
> > 		return -ENOTTY;
> > 
> > The dev->subordinate check disallows a SBR if there are children.
> > Note that the code should probably instead check for...
> > (dev->subordinate && !list_empty(dev->subordinate->devices))
> > ...because the port may have a subordinate bus without children
> > (may have been removed for example).
> > 
> > The "no siblings" rule is enforced by:
> > 
> > 	list_for_each_entry(pdev, &dev->bus->devices, bus_list)
> > 		if (pdev != dev)
> > 			return -ENOTTY;
> > 
> > Note that the devices list is iterated without holding pci_bus_sem,
> > which looks fishy.
> > 
> > That said, it *is* possible that a Secondary Bus Reset is erroneously
> > offered despite these checks because we perform them early on device
> > enumeration when the subordinate bus hasn't been scanned yet.
> > 
> > So if the Root Port offers other reset methods besides SBR and the
> > user switches to one of them, then reinstates the defaults,
> > suddenly SBR will disappear because the subordinate bus has since
> > been scanned.  What's missing here is that we re-check availability
> > of the reset methods on siblings and the parent when a device is
> > added or removed.  This is also necessary to make reset_method
> > work properly with hotplug.  However, the result may be that the
> > reset_method attribute in sysfs may become invisible after adding
> > a device (because there is no reset method available) and reappear
> > after removing a device.
> > 
> > So the reset_method logic is pretty broken right now I'm afraid.
> 
> I haven't checked for a while, but I thought we exposed SBR regardless
> of siblings, though it can't be accessed via the reset attribute if
> there are siblings.  That allows that the sibling devices could be soft
> removed, a reset performed, and the bus re-scanned.  If there are in
> fact sibling devices, it would make more sense to remove only those to
> effect a bus reset to avoid the resource issues with rescanning SR-IOV
> on the GPU.

If I understand correctly they perform the reset just above the upstream
port of the PCIe switch so that it resets the whole "card".
> 
> > In any case, for Mika's use case it would be useful to have a
> > "reset_subordinate" attribute on ports capable of a SBR such that
> > the entire hierarchy below is reset.  The "reset" attribute is
> > insufficient.
> 
> I'll toss out that a pretty simple vfio tool can be written to bind all
> the siblings on a bus enabling the hot reset ioctl in vfio.  Thanks,

Sounds good but unfortunately I'm not fluent in vfio so I have no idea
how this simple tool could be done :( Do you have any examples or
pointers that we could use to try this out?

Thanks!

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: PCI resource allocation mismatch with BIOS
  2022-11-29 15:07         ` Mika Westerberg
@ 2022-11-29 15:46           ` Alex Williamson
  2022-11-29 16:06             ` Lukas Wunner
  0 siblings, 1 reply; 15+ messages in thread
From: Alex Williamson @ 2022-11-29 15:46 UTC (permalink / raw)
  To: Mika Westerberg; +Cc: Lukas Wunner, Bjorn Helgaas, Bjorn Helgaas, linux-pci

On Tue, 29 Nov 2022 17:07:21 +0200
Mika Westerberg <mika.westerberg@linux.intel.com> wrote:

> Hi,
> 
> On Tue, Nov 29, 2022 at 06:52:42AM -0700, Alex Williamson wrote:
> > On Tue, 29 Nov 2022 07:48:12 +0100
> > Lukas Wunner <lukas@wunner.de> wrote:
> >   
> > > On Mon, Nov 28, 2022 at 03:06:17PM -0700, Alex Williamson wrote:  
> > > > Agreed.  Is this convoluted removal process being used to force a SBR,
> > > > versus a FLR or PM reset that might otherwise be used by twiddling the
> > > > reset attribute of the GPU directly?  If so, the reset_method attribute
> > > > can be used to force a bus reset and perform all the state save/restore
> > > > handling to avoid reallocating BARs.  A reset from the upstream switch
> > > > port would only be necessary if you have some reason to also reset the
> > > > switch downstream ports.  Thanks,    
> > > 
> > > A Secondary Bus Reset is only offered as a reset_method if the
> > > device to be reset is the *only* child of the upstream bridge.
> > > I.e. if the device to be reset has siblings or children,
> > > a Secondary Bus Reset is not permitted.
> > > 
> > > Modern GPUs (including the one Mika is referring to) consist of
> > > a PCIe switch with the GPU, HD audio and telemetry devices below
> > > Downstream Bridges.  A Secondary Bus Reset of the Root Port is
> > > not allowed in this case because the Switch Upstream Port has
> > > children.  
> > 
> > I didn't see such functions in the log provided, the GPU in question
> > seems to be a single function device at 53:00.0.  This matches what
> > I've seen on an ARC A380 GPU where the GPU and HD audio are each single
> > function devices under separate downstream ports of a PCIe switch.  
> 
> Yes, this one is similar. There is a PCIe switch and then bunch of
> devices connected to the downstream ports. One of them being the GPU.
> 
> Sorry if I missed that part in the report.
> 
> There are typically multiple of these cards and they want to perform the
> reset seperately for each.

Maybe the elephant in the room is why it's apparently such common
practice to need to perform a hard reset these devices outside of
virtualization scenarios...
 
> > > See this code in pci_parent_bus_reset():
> > > 
> > > 	if (pci_is_root_bus(dev->bus) || dev->subordinate ||
> > > 	    !dev->bus->self || dev->dev_flags & PCI_DEV_FLAGS_NO_BUS_RESET)
> > > 		return -ENOTTY;
> > > 
> > > The dev->subordinate check disallows a SBR if there are children.
> > > Note that the code should probably instead check for...
> > > (dev->subordinate && !list_empty(dev->subordinate->devices))
> > > ...because the port may have a subordinate bus without children
> > > (may have been removed for example).
> > > 
> > > The "no siblings" rule is enforced by:
> > > 
> > > 	list_for_each_entry(pdev, &dev->bus->devices, bus_list)
> > > 		if (pdev != dev)
> > > 			return -ENOTTY;
> > > 
> > > Note that the devices list is iterated without holding pci_bus_sem,
> > > which looks fishy.
> > > 
> > > That said, it *is* possible that a Secondary Bus Reset is erroneously
> > > offered despite these checks because we perform them early on device
> > > enumeration when the subordinate bus hasn't been scanned yet.
> > > 
> > > So if the Root Port offers other reset methods besides SBR and the
> > > user switches to one of them, then reinstates the defaults,
> > > suddenly SBR will disappear because the subordinate bus has since
> > > been scanned.  What's missing here is that we re-check availability
> > > of the reset methods on siblings and the parent when a device is
> > > added or removed.  This is also necessary to make reset_method
> > > work properly with hotplug.  However, the result may be that the
> > > reset_method attribute in sysfs may become invisible after adding
> > > a device (because there is no reset method available) and reappear
> > > after removing a device.
> > > 
> > > So the reset_method logic is pretty broken right now I'm afraid.  
> > 
> > I haven't checked for a while, but I thought we exposed SBR regardless
> > of siblings, though it can't be accessed via the reset attribute if
> > there are siblings.  That allows that the sibling devices could be soft
> > removed, a reset performed, and the bus re-scanned.  If there are in
> > fact sibling devices, it would make more sense to remove only those to
> > effect a bus reset to avoid the resource issues with rescanning SR-IOV
> > on the GPU.  
> 
> If I understand correctly they perform the reset just above the upstream
> port of the PCIe switch so that it resets the whole "card".

... and even to the extent that an entire PCIe hierarchy is being reset
rather than individual devices, let alone leaf buses.
 
> > > In any case, for Mika's use case it would be useful to have a
> > > "reset_subordinate" attribute on ports capable of a SBR such that
> > > the entire hierarchy below is reset.  The "reset" attribute is
> > > insufficient.  
> > 
> > I'll toss out that a pretty simple vfio tool can be written to bind all
> > the siblings on a bus enabling the hot reset ioctl in vfio.  Thanks,  
> 
> Sounds good but unfortunately I'm not fluent in vfio so I have no idea
> how this simple tool could be done :( Do you have any examples or
> pointers that we could use to try this out?

Unfortunately, if your goal is really to reset the whole switch, vfio
isn't going to help with that.  The vfio-pci driver only binds to
endpoints and a hot reset can be effected for a given endpoint, in the
presence of sibling devices which are also owned by the same user.
There's a rudimentary unit test below which implements this.

I think the reason we don't have a "reset_subordinate" attribute is
that there are a number of difficult issues in locking down an entire
hierarchy that go beyond what vfio does in validating the user's proof
of ownership for affected endpoints on a leaf bus.  Thanks,

Alex

#include <errno.h>
#include <libgen.h>
#include <fcntl.h>
#include <libgen.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <sys/ioctl.h>
#include <sys/mman.h>
#include <sys/param.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <sys/vfs.h>

#include <linux/ioctl.h>
#include <linux/vfio.h>

void usage(char *name)
{
	printf("usage: %s <iommu group id> <ssss:bb:dd.f>\n", name);
}

#define false 0
#define true 1

int main(int argc, char **argv)
{
	int i, j, ret, container, *pfd;
	char path[PATH_MAX];

	struct vfio_group_status group_status = {
		.argsz = sizeof(group_status)
	};

	struct vfio_pci_hot_reset_info *reset_info;
	struct vfio_pci_dependent_device *devices;
	struct vfio_pci_hot_reset *reset;

	struct reset_dev {
		int groupid;
		int seg;
		int bus;
		int dev;
		int func;
		int fd;
		int group;
	} *reset_devs;

	if (argc < 3) {
		usage(argv[0]);
		return -1;
	}

	printf("Expect %d group/device pairs\n", (argc - 1)/2);

	reset_devs = calloc((argc - 1)/2, sizeof(struct reset_dev));
	if (!reset_devs)
		return -1;

	for (i = 0; i < (argc - 1)/2; i++) {
		ret = sscanf(argv[i*2 + 1], "%d", &reset_devs[i].groupid);
		if (ret != 1) {
			usage(argv[0]);
			return -1;
		}

		ret = sscanf(argv[i*2 + 2], "%04x:%02x:%02x.%d",
			     &reset_devs[i].seg, &reset_devs[i].bus,
			     &reset_devs[i].dev, &reset_devs[i].func);
		if (ret != 4) {
			usage(argv[0]);
			return -1;
		}

		printf("Using PCI device %04x:%02x:%02x.%d in group %d "
	               "for hot reset test\n", reset_devs[i].seg,
		       reset_devs[i].bus, reset_devs[i].dev,
		       reset_devs[i].func, reset_devs[i].groupid);
	}

	container = open("/dev/vfio/vfio", O_RDWR);
	if (container < 0) {
		printf("Failed to open /dev/vfio/vfio, %d (%s)\n",
		       container, strerror(errno));
		return container;
	}

	for (i = 0; i < (argc - 1)/2; i++) {
		snprintf(path, sizeof(path), "/dev/vfio/%d",
			 reset_devs[i].groupid);
		reset_devs[i].group = open(path, O_RDWR);
		if (reset_devs[i].group < 0) {
			printf("Failed to open %s, %d (%s)\n",
			path, reset_devs[i].group, strerror(errno));
			return reset_devs[i].group;
		}

		ret = ioctl(reset_devs[i].group, VFIO_GROUP_GET_STATUS,
			    &group_status);
		if (ret) {
			printf("ioctl(VFIO_GROUP_GET_STATUS) failed\n");
			return ret;
		}

		if (!(group_status.flags & VFIO_GROUP_FLAGS_VIABLE)) {
			printf("Group not viable, are all devices attached to vfio?\n");
			return -1;
		}

		ret = ioctl(reset_devs[i].group, VFIO_GROUP_SET_CONTAINER,
			    &container);
		if (ret) {
			printf("Failed to set group container\n");
			return ret;
		}

		if (i == 0) {
			ret = ioctl(container, VFIO_SET_IOMMU,
				    VFIO_TYPE1_IOMMU);
			if (ret) {
				printf("Failed to set IOMMU\n");
				return ret;
			}
		}

		snprintf(path, sizeof(path), "%04x:%02x:%02x.%d",
			 reset_devs[i].seg, reset_devs[i].bus,
			 reset_devs[i].dev, reset_devs[i].func);

		reset_devs[i].fd = ioctl(reset_devs[i].group,
					 VFIO_GROUP_GET_DEVICE_FD, path);
		if (reset_devs[i].fd < 0) {
			printf("Failed to get device %s\n", path);
			return -1;
		}
	}
	getchar();

	reset_info = malloc(sizeof(*reset_info));
	if (!reset_info) {
		printf("Failed to alloc info struct\n");
		return -ENOMEM;
	}

	reset_info->argsz = sizeof(*reset_info);

	ret = ioctl(reset_devs[0].fd, VFIO_DEVICE_GET_PCI_HOT_RESET_INFO,
		    reset_info);
	if (ret && errno == ENODEV) {
		printf("Device does not support hot reset\n");
		return 0;
	}
	if (!ret || errno != ENOSPC) {
		printf("Expected fail/-ENOSPC, got %d/%d\n", ret, -errno);
		return -1;
	}

	printf("Dependent device count: %d\n", reset_info->count);

	reset_info = realloc(reset_info, sizeof(*reset_info) +
			     (reset_info->count * sizeof(*devices)));
	if (!reset_info) {
		printf("Failed to re-alloc info struct\n");
		return -ENOMEM;
	}

	reset_info->argsz = sizeof(*reset_info) +
                             (reset_info->count * sizeof(*devices));
	ret = ioctl(reset_devs[0].fd, VFIO_DEVICE_GET_PCI_HOT_RESET_INFO,
		    reset_info);
	if (ret) {
		printf("Reset Info error\n");
		return ret;
	}

	devices = &reset_info->devices[0];

	for (i = 0; i < reset_info->count; i++)
		printf("%d: %04x:%02x:%02x.%d group %d\n", i,
		       devices[i].segment, devices[i].bus,
		       devices[i].devfn >> 3, devices[i].devfn & 7,
		       devices[i].group_id);

	printf("Press any key to perform reset\n");
	getchar();
	printf("Attempting reset: ");
	fflush(stdout);

	reset = malloc(sizeof(*reset) + (sizeof(*pfd) * reset_info->count));
	pfd = &reset->group_fds[0];

	for (i = 0; i < reset_info->count; i++) {
		for (j = 0; j < (argc - 1)/2; j++) {
			if (devices[i].group_id == reset_devs[j].groupid) {
				*pfd++ = reset_devs[j].group;
				break;
			}
		}

		if (j == (argc - 1)/2) {
			printf("Group %d not found\n", devices[i].group_id);
			return -1;
		}
	}

	reset->argsz = sizeof(*reset) + (sizeof(*pfd) * reset_info->count);
	reset->count = reset_info->count;
	reset->flags = 0;

	ret = ioctl(reset_devs[0].fd, VFIO_DEVICE_PCI_HOT_RESET, reset);
	printf("%s\n", ret ? "Failed" : "Pass");

	return ret;
}


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: PCI resource allocation mismatch with BIOS
  2022-11-29 15:46           ` Alex Williamson
@ 2022-11-29 16:06             ` Lukas Wunner
  2022-11-29 16:12               ` Alex Williamson
  0 siblings, 1 reply; 15+ messages in thread
From: Lukas Wunner @ 2022-11-29 16:06 UTC (permalink / raw)
  To: Alex Williamson; +Cc: Mika Westerberg, Bjorn Helgaas, Bjorn Helgaas, linux-pci

On Tue, Nov 29, 2022 at 08:46:46AM -0700, Alex Williamson wrote:
> Maybe the elephant in the room is why it's apparently such common
> practice to need to perform a hard reset these devices outside of
> virtualization scenarios...

These GPUs are used as accelerators in cloud environments.

They're reset to a pristine state when handed out to another tenant
to avoid info leaks from the previous tenant.

That should be a legitimate usage of PCIe reset, no?

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: PCI resource allocation mismatch with BIOS
  2022-11-29 16:06             ` Lukas Wunner
@ 2022-11-29 16:12               ` Alex Williamson
  2022-11-30  7:43                 ` Lukas Wunner
  0 siblings, 1 reply; 15+ messages in thread
From: Alex Williamson @ 2022-11-29 16:12 UTC (permalink / raw)
  To: Lukas Wunner; +Cc: Mika Westerberg, Bjorn Helgaas, Bjorn Helgaas, linux-pci

On Tue, 29 Nov 2022 17:06:26 +0100
Lukas Wunner <lukas@wunner.de> wrote:

> On Tue, Nov 29, 2022 at 08:46:46AM -0700, Alex Williamson wrote:
> > Maybe the elephant in the room is why it's apparently such common
> > practice to need to perform a hard reset these devices outside of
> > virtualization scenarios...  
> 
> These GPUs are used as accelerators in cloud environments.
> 
> They're reset to a pristine state when handed out to another tenant
> to avoid info leaks from the previous tenant.
> 
> That should be a legitimate usage of PCIe reset, no?

Absolutely, but why the whole switch?  Thanks,

Alex


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: PCI resource allocation mismatch with BIOS
  2022-11-29 16:12               ` Alex Williamson
@ 2022-11-30  7:43                 ` Lukas Wunner
  2022-11-30  7:57                   ` Mika Westerberg
  0 siblings, 1 reply; 15+ messages in thread
From: Lukas Wunner @ 2022-11-30  7:43 UTC (permalink / raw)
  To: Alex Williamson; +Cc: Mika Westerberg, Bjorn Helgaas, Bjorn Helgaas, linux-pci

On Tue, Nov 29, 2022 at 09:12:49AM -0700, Alex Williamson wrote:
> On Tue, 29 Nov 2022 17:06:26 +0100 Lukas Wunner <lukas@wunner.de> wrote:
> > On Tue, Nov 29, 2022 at 08:46:46AM -0700, Alex Williamson wrote:
> > > Maybe the elephant in the room is why it's apparently such common
> > > practice to need to perform a hard reset these devices outside of
> > > virtualization scenarios...  
> > 
> > These GPUs are used as accelerators in cloud environments.
> > 
> > They're reset to a pristine state when handed out to another tenant
> > to avoid info leaks from the previous tenant.
> > 
> > That should be a legitimate usage of PCIe reset, no?
> 
> Absolutely, but why the whole switch?  Thanks,

The reset is propagated down the hierarchy, so by resetting the
Switch Upstream Port, it is guaranteed that all endpoints are
reset with just a single operation.  Per PCIe r6.0.1 sec 6.6.1:

   "For a Switch, the following must cause a hot reset to be sent
    on all Downstream Ports:
    [...]
    Receiving a hot reset on the Upstream Port"

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: PCI resource allocation mismatch with BIOS
  2022-11-30  7:43                 ` Lukas Wunner
@ 2022-11-30  7:57                   ` Mika Westerberg
  2022-11-30 15:47                     ` Alex Williamson
  0 siblings, 1 reply; 15+ messages in thread
From: Mika Westerberg @ 2022-11-30  7:57 UTC (permalink / raw)
  To: Lukas Wunner; +Cc: Alex Williamson, Bjorn Helgaas, Bjorn Helgaas, linux-pci

Hi,

On Wed, Nov 30, 2022 at 08:43:47AM +0100, Lukas Wunner wrote:
> On Tue, Nov 29, 2022 at 09:12:49AM -0700, Alex Williamson wrote:
> > On Tue, 29 Nov 2022 17:06:26 +0100 Lukas Wunner <lukas@wunner.de> wrote:
> > > On Tue, Nov 29, 2022 at 08:46:46AM -0700, Alex Williamson wrote:
> > > > Maybe the elephant in the room is why it's apparently such common
> > > > practice to need to perform a hard reset these devices outside of
> > > > virtualization scenarios...  
> > > 
> > > These GPUs are used as accelerators in cloud environments.
> > > 
> > > They're reset to a pristine state when handed out to another tenant
> > > to avoid info leaks from the previous tenant.
> > > 
> > > That should be a legitimate usage of PCIe reset, no?
> > 
> > Absolutely, but why the whole switch?  Thanks,
> 
> The reset is propagated down the hierarchy, so by resetting the
> Switch Upstream Port, it is guaranteed that all endpoints are
> reset with just a single operation.  Per PCIe r6.0.1 sec 6.6.1:
> 
>    "For a Switch, the following must cause a hot reset to be sent
>     on all Downstream Ports:
>     [...]
>     Receiving a hot reset on the Upstream Port"

Adding here the reason I got from the GPU folks:

In addition to the use case when the GPU is reset when switched to
another tenant, this is used for recovery. The "first level" recovery is
handled by the graphics driver that does Function Level Reset but if
that does not work data centers may trigger reset at higher level (root
port or switch upstream port) to reset the whole card. So called "big
hammer".

There is another use case too - firmware upgrade. This allows data
centers to upgrade firmware on those cards without need to reboot - they
just reset the whole thing to make it run the new firmware.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: PCI resource allocation mismatch with BIOS
  2022-11-30  7:57                   ` Mika Westerberg
@ 2022-11-30 15:47                     ` Alex Williamson
  2022-12-01  9:41                       ` Mika Westerberg
  0 siblings, 1 reply; 15+ messages in thread
From: Alex Williamson @ 2022-11-30 15:47 UTC (permalink / raw)
  To: Mika Westerberg; +Cc: Lukas Wunner, Bjorn Helgaas, Bjorn Helgaas, linux-pci

On Wed, 30 Nov 2022 09:57:18 +0200
Mika Westerberg <mika.westerberg@linux.intel.com> wrote:

> Hi,
> 
> On Wed, Nov 30, 2022 at 08:43:47AM +0100, Lukas Wunner wrote:
> > On Tue, Nov 29, 2022 at 09:12:49AM -0700, Alex Williamson wrote:  
> > > On Tue, 29 Nov 2022 17:06:26 +0100 Lukas Wunner <lukas@wunner.de> wrote:  
> > > > On Tue, Nov 29, 2022 at 08:46:46AM -0700, Alex Williamson wrote:  
> > > > > Maybe the elephant in the room is why it's apparently such common
> > > > > practice to need to perform a hard reset these devices outside of
> > > > > virtualization scenarios...    
> > > > 
> > > > These GPUs are used as accelerators in cloud environments.
> > > > 
> > > > They're reset to a pristine state when handed out to another tenant
> > > > to avoid info leaks from the previous tenant.
> > > > 
> > > > That should be a legitimate usage of PCIe reset, no?  
> > > 
> > > Absolutely, but why the whole switch?  Thanks,  
> > 
> > The reset is propagated down the hierarchy, so by resetting the
> > Switch Upstream Port, it is guaranteed that all endpoints are
> > reset with just a single operation.  Per PCIe r6.0.1 sec 6.6.1:
> > 
> >    "For a Switch, the following must cause a hot reset to be sent
> >     on all Downstream Ports:
> >     [...]
> >     Receiving a hot reset on the Upstream Port"  

This was never in doubt.
 
> Adding here the reason I got from the GPU folks:
> 
> In addition to the use case when the GPU is reset when switched to
> another tenant, this is used for recovery. The "first level" recovery is
> handled by the graphics driver that does Function Level Reset but if
> that does not work data centers may trigger reset at higher level (root
> port or switch upstream port) to reset the whole card. So called "big
> hammer".

Ok, so the first level issue can be solved by the reset_method
attribute, allowing userspace to trigger a SBR for a signle function
endpoint rather than an FLR (though this might suggest some hardware
issues in the FLR implementation).  This will save and restore the
device config space, so there should be no resource reallocation issues.
 
> There is another use case too - firmware upgrade. This allows data
> centers to upgrade firmware on those cards without need to reboot - they
> just reset the whole thing to make it run the new firmware.

The above can also be used for this, so long as the firmware update
only targets the GPU and device config space layout and features is not
modified by the update.  This touches on a long standing issue Bjorn
has had with performing bus resets where we assume we get back
essentially the same devices after the reset.  That's not guaranteed
with a pending firmware update.

From the logs in this case, I notice that the BIOS didn't assigned the
ROM BAR for the GPU while Linux did when re-scanning.  That's a
different address space from the SR-IOV BARs, so doesn't explain this
failure, but I wonder if there's another resource in the hierarchy that
the BIOS didn't program, ex. BAR0 on the upstream switch maybe?
Possibly if the host were booted with pci=realloc, Linux might expand
the resources to something it's happy with, allowing future re-scans.
Otherwise I think we'd need a log of all BIOS vs Linux resource
allocations from the root port down to see what might be the issue with
the rescan.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: PCI resource allocation mismatch with BIOS
  2022-11-30 15:47                     ` Alex Williamson
@ 2022-12-01  9:41                       ` Mika Westerberg
  2022-12-09 11:08                         ` Mika Westerberg
  0 siblings, 1 reply; 15+ messages in thread
From: Mika Westerberg @ 2022-12-01  9:41 UTC (permalink / raw)
  To: Alex Williamson; +Cc: Lukas Wunner, Bjorn Helgaas, Bjorn Helgaas, linux-pci

Hi,

On Wed, Nov 30, 2022 at 08:47:38AM -0700, Alex Williamson wrote:
> On Wed, 30 Nov 2022 09:57:18 +0200
> Mika Westerberg <mika.westerberg@linux.intel.com> wrote:
> 
> > Hi,
> > 
> > On Wed, Nov 30, 2022 at 08:43:47AM +0100, Lukas Wunner wrote:
> > > On Tue, Nov 29, 2022 at 09:12:49AM -0700, Alex Williamson wrote:  
> > > > On Tue, 29 Nov 2022 17:06:26 +0100 Lukas Wunner <lukas@wunner.de> wrote:  
> > > > > On Tue, Nov 29, 2022 at 08:46:46AM -0700, Alex Williamson wrote:  
> > > > > > Maybe the elephant in the room is why it's apparently such common
> > > > > > practice to need to perform a hard reset these devices outside of
> > > > > > virtualization scenarios...    
> > > > > 
> > > > > These GPUs are used as accelerators in cloud environments.
> > > > > 
> > > > > They're reset to a pristine state when handed out to another tenant
> > > > > to avoid info leaks from the previous tenant.
> > > > > 
> > > > > That should be a legitimate usage of PCIe reset, no?  
> > > > 
> > > > Absolutely, but why the whole switch?  Thanks,  
> > > 
> > > The reset is propagated down the hierarchy, so by resetting the
> > > Switch Upstream Port, it is guaranteed that all endpoints are
> > > reset with just a single operation.  Per PCIe r6.0.1 sec 6.6.1:
> > > 
> > >    "For a Switch, the following must cause a hot reset to be sent
> > >     on all Downstream Ports:
> > >     [...]
> > >     Receiving a hot reset on the Upstream Port"  
> 
> This was never in doubt.
>  
> > Adding here the reason I got from the GPU folks:
> > 
> > In addition to the use case when the GPU is reset when switched to
> > another tenant, this is used for recovery. The "first level" recovery is
> > handled by the graphics driver that does Function Level Reset but if
> > that does not work data centers may trigger reset at higher level (root
> > port or switch upstream port) to reset the whole card. So called "big
> > hammer".
> 
> Ok, so the first level issue can be solved by the reset_method
> attribute, allowing userspace to trigger a SBR for a signle function
> endpoint rather than an FLR (though this might suggest some hardware
> issues in the FLR implementation).  This will save and restore the
> device config space, so there should be no resource reallocation issues.

Right, I think that already works.

> > There is another use case too - firmware upgrade. This allows data
> > centers to upgrade firmware on those cards without need to reboot - they
> > just reset the whole thing to make it run the new firmware.
> 
> The above can also be used for this, so long as the firmware update
> only targets the GPU and device config space layout and features is not
> modified by the update.  This touches on a long standing issue Bjorn
> has had with performing bus resets where we assume we get back
> essentially the same devices after the reset.  That's not guaranteed
> with a pending firmware update.

Fair enough but if the firmware does not change the resource consumption
then we would get back the same devices. I guess this is the use-case
with these cards.

> >From the logs in this case, I notice that the BIOS didn't assigned the
> ROM BAR for the GPU while Linux did when re-scanning.  That's a
> different address space from the SR-IOV BARs, so doesn't explain this
> failure, but I wonder if there's another resource in the hierarchy that
> the BIOS didn't program, ex. BAR0 on the upstream switch maybe?

I don't think there is.

> Possibly if the host were booted with pci=realloc, Linux might expand
> the resources to something it's happy with, allowing future re-scans.

This was tried (sorry I did not mention that) and it did not work either
because of the way Linux allocates the resource space. BIOS allocates
"minimal" window that fulfills the alignment requirement so if we had
just 16G+16M BARs then we would get 16G+16M bridge window aligned for
that 16G. However, Linux tries to align the size of the window too so
instead we get 16G aligned to 8G so 24G memory window. This does not fit
to the resources BIOS allocated.

With pci=realloc Linux tries to release the resources from the upper
level bridges and re-allocates more but the IOV resources are put into
the "additional" list so it is fine if those don't get allocated in the
end.

> Otherwise I think we'd need a log of all BIOS vs Linux resource
> allocations from the root port down to see what might be the issue with
> the rescan.  Thanks,

Sure, I will share dmesg from that system showing the initial allocation
and the re-scan as soon as I get confirmation that there is nothing
under embargo in there.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: PCI resource allocation mismatch with BIOS
  2022-12-01  9:41                       ` Mika Westerberg
@ 2022-12-09 11:08                         ` Mika Westerberg
  0 siblings, 0 replies; 15+ messages in thread
From: Mika Westerberg @ 2022-12-09 11:08 UTC (permalink / raw)
  To: Alex Williamson; +Cc: Lukas Wunner, Bjorn Helgaas, Bjorn Helgaas, linux-pci

Hi,

On Thu, Dec 01, 2022 at 11:41:18AM +0200, Mika Westerberg wrote:
> > Otherwise I think we'd need a log of all BIOS vs Linux resource
> > allocations from the root port down to see what might be the issue with
> > the rescan.  Thanks,
> 
> Sure, I will share dmesg from that system showing the initial allocation
> and the re-scan as soon as I get confirmation that there is nothing
> under embargo in there.

Sorry for the delay. Took some time to get all the confirmations.

I created a bug here:

https://bugzilla.kernel.org/show_bug.cgi?id=216795

I also attached relevant dmesg and lspci outputs. I have now access to
the system so I can do additional testing as needed.

Thanks!

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2022-12-09 11:08 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-11-28 11:14 PCI resource allocation mismatch with BIOS Mika Westerberg
2022-11-28 20:39 ` Bjorn Helgaas
2022-11-28 22:06   ` Alex Williamson
2022-11-29  6:48     ` Lukas Wunner
2022-11-29 10:09       ` Mika Westerberg
2022-11-29 13:52       ` Alex Williamson
2022-11-29 15:07         ` Mika Westerberg
2022-11-29 15:46           ` Alex Williamson
2022-11-29 16:06             ` Lukas Wunner
2022-11-29 16:12               ` Alex Williamson
2022-11-30  7:43                 ` Lukas Wunner
2022-11-30  7:57                   ` Mika Westerberg
2022-11-30 15:47                     ` Alex Williamson
2022-12-01  9:41                       ` Mika Westerberg
2022-12-09 11:08                         ` Mika Westerberg

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.