All of lore.kernel.org
 help / color / mirror / Atom feed
* [question]: BAR allocation failing
@ 2021-07-14 21:32 Ruben
  2021-07-14 22:03 ` Alex Williamson
  0 siblings, 1 reply; 9+ messages in thread
From: Ruben @ 2021-07-14 21:32 UTC (permalink / raw)
  To: linux-pci; +Cc: bhelgaas, alex.williamson

I am experiencing an issue with virtualizing a machine which contains
8 NVidia A100 80GB cards.
As a bare metal host, the machine behaves as expected, the GPUs are
connected to the host with a PLX chip PEX88096, which connects 2 GPUs
to 16 lanes on the CPU (using the same NVidia HGX Delta baseboard).
When passing through all GPUs and NVLink bridges to a VM, a problem
arises in that the system can only initialize 4-5 of the 8 GPUs.

The dmesg log shows failed attempts for assiging BAR space to the GPUs
that are not getting initialized.

Things that were tried:
Q35-i440fx with and without UEFI
Qemu 5.x, Qemu 6.0
Host Ubuntu 20.04 host with Qemu/libvirt
Now running proxmox 7 on debian 11, host kernel 5.11.22-2, VM kernel 5.4.0-77
VM kernel parameters pci=nocrs pci=realloc=on/off

------------------------------------

lspci -v:
01:00.0 3D controller: NVIDIA Corporation Device 20b2 (rev a1)
        Memory at db000000 (32-bit, non-prefetchable) [size=16M]
        Memory at 2000000000 (64-bit, prefetchable) [size=128G]
        Memory at 1000000000 (64-bit, prefetchable) [size=32M]

02:00.0 3D controller: NVIDIA Corporation Device 20b2 (rev a1)
        Memory at dc000000 (32-bit, non-prefetchable) [size=16M]
        Memory at 4000000000 (64-bit, prefetchable) [size=128G]
        Memory at 6000000000 (64-bit, prefetchable) [size=32M]

...

0c:00.0 3D controller: NVIDIA Corporation Device 20b2 (rev a1)
        Memory at e0000000 (32-bit, non-prefetchable) [size=16M]
        Memory at <ignored> (64-bit, prefetchable)
        Memory at <ignored> (64-bit, prefetchable)

...

------------------------------------

root@a100:~# dmesg | grep 01:00
[    0.674363] pci 0000:01:00.0: [10de:20b2] type 00 class 0x030200
[    0.674884] pci 0000:01:00.0: reg 0x10: [mem 0xff000000-0xffffffff]
[    0.675010] pci 0000:01:00.0: reg 0x14: [mem
0xffffffe000000000-0xffffffffffffffff 64bit pref]
[    0.675129] pci 0000:01:00.0: reg 0x1c: [mem
0xfffffffffe000000-0xffffffffffffffff 64bit pref]
[    0.675416] pci 0000:01:00.0: Max Payload Size set to 128 (was 256, max 256)
[    0.675567] pci 0000:01:00.0: Enabling HDA controller
[    0.676324] pci 0000:01:00.0: PME# supported from D0 D3hot
[    1.377980] pci 0000:01:00.0: can't claim BAR 0 [mem
0xff000000-0xffffffff]: no compatible bridge window
[    1.377983] pci 0000:01:00.0: can't claim BAR 1 [mem
0xffffffe000000000-0xffffffffffffffff 64bit pref]: no compatible
bridge window
[    1.377986] pci 0000:01:00.0: can't claim BAR 3 [mem
0xfffffffffe000000-0xffffffffffffffff 64bit pref]: no compatible
bridge window
[    1.403889] pci 0000:01:00.0: BAR 1: assigned [mem
0x2000000000-0x3fffffffff 64bit pref]
[    1.404120] pci 0000:01:00.0: BAR 3: assigned [mem
0x1000000000-0x1001ffffff 64bit pref]
[    1.404335] pci 0000:01:00.0: BAR 0: assigned [mem 0xcf000000-0xcfffffff]
[    4.214191] nvidia 0000:01:00.0: enabling device (0000 -> 0002)
[   15.185625] [drm] Initialized nvidia-drm 0.0.0 20160202 for
0000:01:00.0 on minor 1

root@a100:~# dmesg | grep 06:00
[    0.724589] pci 0000:06:00.0: [10de:20b2] type 00 class 0x030200
[    0.724975] pci 0000:06:00.0: reg 0x10: [mem 0xff000000-0xffffffff]
[    0.725069] pci 0000:06:00.0: reg 0x14: [mem
0xffffffe000000000-0xffffffffffffffff 64bit pref]
[    0.725146] pci 0000:06:00.0: reg 0x1c: [mem
0xfffffffffe000000-0xffffffffffffffff 64bit pref]
[    0.725343] pci 0000:06:00.0: Max Payload Size set to 128 (was 256, max 256)
[    0.725471] pci 0000:06:00.0: Enabling HDA controller
[    0.726051] pci 0000:06:00.0: PME# supported from D0 D3hot
[    1.378149] pci 0000:06:00.0: can't claim BAR 0 [mem
0xff000000-0xffffffff]: no compatible bridge window
[    1.378151] pci 0000:06:00.0: can't claim BAR 1 [mem
0xffffffe000000000-0xffffffffffffffff 64bit pref]: no compatible
bridge window
[    1.378154] pci 0000:06:00.0: can't claim BAR 3 [mem
0xfffffffffe000000-0xffffffffffffffff 64bit pref]: no compatible
bridge window
[    1.421549] pci 0000:06:00.0: BAR 1: no space for [mem size
0x2000000000 64bit pref]
[    1.421553] pci 0000:06:00.0: BAR 1: trying firmware assignment
[mem 0xffffffe000000000-0xffffffffffffffff 64bit pref]
[    1.421556] pci 0000:06:00.0: BAR 1: [mem
0xffffffe000000000-0xffffffffffffffff 64bit pref] conflicts with PCI
mem [mem 0x00000000-0xffffffffff]
[    1.421559] pci 0000:06:00.0: BAR 1: failed to assign [mem size
0x2000000000 64bit pref]
[    1.421562] pci 0000:06:00.0: BAR 3: no space for [mem size
0x02000000 64bit pref]
[    1.421564] pci 0000:06:00.0: BAR 3: trying firmware assignment
[mem 0xfffffffffe000000-0xffffffffffffffff 64bit pref]
[    1.421567] pci 0000:06:00.0: BAR 3: [mem
0xfffffffffe000000-0xffffffffffffffff 64bit pref] conflicts with PCI
mem [mem 0x00000000-0xffffffffff]
[    1.421570] pci 0000:06:00.0: BAR 3: failed to assign [mem size
0x02000000 64bit pref]
[    1.421573] pci 0000:06:00.0: BAR 0: assigned [mem 0xd4000000-0xd4ffffff]
[   15.013778] nvidia 0000:06:00.0: enabling device (0000 -> 0002)
[   15.191872] [drm] Initialized nvidia-drm 0.0.0 20160202 for
0000:06:00.0 on minor 6
[   26.946648] NVRM: GPU 0000:06:00.0: RmInitAdapter failed! (0x22:0xffff:662)
[   26.948225] NVRM: GPU 0000:06:00.0: rm_init_adapter failed, device
minor number 5
[   26.982183] NVRM: GPU 0000:06:00.0: RmInitAdapter failed! (0x22:0xffff:662)
[   26.983434] NVRM: GPU 0000:06:00.0: rm_init_adapter failed, device
minor number 5

------------------------------------

I have (blindly) messed with parameters like pref64-reserve for the
pcie-root-port but to be frank I have little clue what I'm doing so my
question would be suggestions on what I can try.
This server will not be running an 8 GPU VM in production but I have a
few days left to test before it goes to work. I was hoping to learn
how to overcome this issue in the future.
Please be aware that my knowledge regarding virtualization and the
Linux kernel does not reach far.

Thanks in advance for your time!

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [question]: BAR allocation failing
  2021-07-14 21:32 [question]: BAR allocation failing Ruben
@ 2021-07-14 22:03 ` Alex Williamson
  2021-07-14 22:43   ` Ruben
  0 siblings, 1 reply; 9+ messages in thread
From: Alex Williamson @ 2021-07-14 22:03 UTC (permalink / raw)
  To: Ruben; +Cc: linux-pci, bhelgaas

On Thu, 15 Jul 2021 00:32:30 +0300
Ruben <rubenbryon@gmail.com> wrote:

> I am experiencing an issue with virtualizing a machine which contains
> 8 NVidia A100 80GB cards.
> As a bare metal host, the machine behaves as expected, the GPUs are
> connected to the host with a PLX chip PEX88096, which connects 2 GPUs
> to 16 lanes on the CPU (using the same NVidia HGX Delta baseboard).
> When passing through all GPUs and NVLink bridges to a VM, a problem
> arises in that the system can only initialize 4-5 of the 8 GPUs.
> 
> The dmesg log shows failed attempts for assiging BAR space to the GPUs
> that are not getting initialized.
> 
> Things that were tried:
> Q35-i440fx with and without UEFI
> Qemu 5.x, Qemu 6.0
> Host Ubuntu 20.04 host with Qemu/libvirt
> Now running proxmox 7 on debian 11, host kernel 5.11.22-2, VM kernel 5.4.0-77
> VM kernel parameters pci=nocrs pci=realloc=on/off
> 
> ------------------------------------
> 
> lspci -v:
> 01:00.0 3D controller: NVIDIA Corporation Device 20b2 (rev a1)
>         Memory at db000000 (32-bit, non-prefetchable) [size=16M]
>         Memory at 2000000000 (64-bit, prefetchable) [size=128G]
>         Memory at 1000000000 (64-bit, prefetchable) [size=32M]
> 
> 02:00.0 3D controller: NVIDIA Corporation Device 20b2 (rev a1)
>         Memory at dc000000 (32-bit, non-prefetchable) [size=16M]
>         Memory at 4000000000 (64-bit, prefetchable) [size=128G]
>         Memory at 6000000000 (64-bit, prefetchable) [size=32M]
> 
> ...
> 
> 0c:00.0 3D controller: NVIDIA Corporation Device 20b2 (rev a1)
>         Memory at e0000000 (32-bit, non-prefetchable) [size=16M]
>         Memory at <ignored> (64-bit, prefetchable)
>         Memory at <ignored> (64-bit, prefetchable)
> 
> ...
> 
...
> 
> ------------------------------------
> 
> I have (blindly) messed with parameters like pref64-reserve for the
> pcie-root-port but to be frank I have little clue what I'm doing so my
> question would be suggestions on what I can try.
> This server will not be running an 8 GPU VM in production but I have a
> few days left to test before it goes to work. I was hoping to learn
> how to overcome this issue in the future.
> Please be aware that my knowledge regarding virtualization and the
> Linux kernel does not reach far.

Try playing with the QEMU "-global q35-host.pci-hole64-size=" option for
the VM rather than pci=nocrs.  The default 64-bit MMIO hole for
QEMU/q35 is only 32GB.  You might be looking at a value like 2048G to
support this setup, but could maybe get away with 1024G if there's room
in 32-bit space for the 3rd BAR.

Note that assigning bridges usually doesn't make a lot of sense and
NVLink is a proprietary black box, so we don't know how to virtualize
it or what the guest drivers will do with it, you're on your own there.
We generally recommend to use vGPUs for such cases so the host driver
can handle all the NVLink aspects for GPU peer-to-peer.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [question]: BAR allocation failing
  2021-07-14 22:03 ` Alex Williamson
@ 2021-07-14 22:43   ` Ruben
  2021-07-15 14:49     ` Bjorn Helgaas
  0 siblings, 1 reply; 9+ messages in thread
From: Ruben @ 2021-07-14 22:43 UTC (permalink / raw)
  To: Alex Williamson; +Cc: linux-pci, Bjorn Helgaas

No luck so far with "-global q35-pcihost.pci-hole64-size=2048G"
("-global q35-host.pci-hole64-size=" gave an error "warning: global
q35-host.pci-hole64-size has invalid class name").
The result stays the same.

When we pass through the NVLink bridges we can have the (5 working)
GPUs talk at full P2P bandwidth and is described in the NVidia docs as
a valid option (ie. passing through all GPUs and NVlink bridges).
In production we have the bridges passed through to a service VM which
controls traffic, which is also described in their docs.

Op do 15 jul. 2021 om 01:03 schreef Alex Williamson
<alex.williamson@redhat.com>:
>
> On Thu, 15 Jul 2021 00:32:30 +0300
> Ruben <rubenbryon@gmail.com> wrote:
>
> > I am experiencing an issue with virtualizing a machine which contains
> > 8 NVidia A100 80GB cards.
> > As a bare metal host, the machine behaves as expected, the GPUs are
> > connected to the host with a PLX chip PEX88096, which connects 2 GPUs
> > to 16 lanes on the CPU (using the same NVidia HGX Delta baseboard).
> > When passing through all GPUs and NVLink bridges to a VM, a problem
> > arises in that the system can only initialize 4-5 of the 8 GPUs.
> >
> > The dmesg log shows failed attempts for assiging BAR space to the GPUs
> > that are not getting initialized.
> >
> > Things that were tried:
> > Q35-i440fx with and without UEFI
> > Qemu 5.x, Qemu 6.0
> > Host Ubuntu 20.04 host with Qemu/libvirt
> > Now running proxmox 7 on debian 11, host kernel 5.11.22-2, VM kernel 5.4.0-77
> > VM kernel parameters pci=nocrs pci=realloc=on/off
> >
> > ------------------------------------
> >
> > lspci -v:
> > 01:00.0 3D controller: NVIDIA Corporation Device 20b2 (rev a1)
> >         Memory at db000000 (32-bit, non-prefetchable) [size=16M]
> >         Memory at 2000000000 (64-bit, prefetchable) [size=128G]
> >         Memory at 1000000000 (64-bit, prefetchable) [size=32M]
> >
> > 02:00.0 3D controller: NVIDIA Corporation Device 20b2 (rev a1)
> >         Memory at dc000000 (32-bit, non-prefetchable) [size=16M]
> >         Memory at 4000000000 (64-bit, prefetchable) [size=128G]
> >         Memory at 6000000000 (64-bit, prefetchable) [size=32M]
> >
> > ...
> >
> > 0c:00.0 3D controller: NVIDIA Corporation Device 20b2 (rev a1)
> >         Memory at e0000000 (32-bit, non-prefetchable) [size=16M]
> >         Memory at <ignored> (64-bit, prefetchable)
> >         Memory at <ignored> (64-bit, prefetchable)
> >
> > ...
> >
> ...
> >
> > ------------------------------------
> >
> > I have (blindly) messed with parameters like pref64-reserve for the
> > pcie-root-port but to be frank I have little clue what I'm doing so my
> > question would be suggestions on what I can try.
> > This server will not be running an 8 GPU VM in production but I have a
> > few days left to test before it goes to work. I was hoping to learn
> > how to overcome this issue in the future.
> > Please be aware that my knowledge regarding virtualization and the
> > Linux kernel does not reach far.
>
> Try playing with the QEMU "-global q35-host.pci-hole64-size=" option for
> the VM rather than pci=nocrs.  The default 64-bit MMIO hole for
> QEMU/q35 is only 32GB.  You might be looking at a value like 2048G to
> support this setup, but could maybe get away with 1024G if there's room
> in 32-bit space for the 3rd BAR.
>
> Note that assigning bridges usually doesn't make a lot of sense and
> NVLink is a proprietary black box, so we don't know how to virtualize
> it or what the guest drivers will do with it, you're on your own there.
> We generally recommend to use vGPUs for such cases so the host driver
> can handle all the NVLink aspects for GPU peer-to-peer.  Thanks,
>
> Alex
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [question]: BAR allocation failing
  2021-07-14 22:43   ` Ruben
@ 2021-07-15 14:49     ` Bjorn Helgaas
  2021-07-15 20:39       ` Ruben
  0 siblings, 1 reply; 9+ messages in thread
From: Bjorn Helgaas @ 2021-07-15 14:49 UTC (permalink / raw)
  To: Ruben; +Cc: Alex Williamson, linux-pci, Bjorn Helgaas

On Thu, Jul 15, 2021 at 01:43:17AM +0300, Ruben wrote:
> No luck so far with "-global q35-pcihost.pci-hole64-size=2048G"
> ("-global q35-host.pci-hole64-size=" gave an error "warning: global
> q35-host.pci-hole64-size has invalid class name").
> The result stays the same.

Alex will have to chime in about the qemu option problem.

Your dmesg excerpts don't include the host bridge window info, e.g.,
"root bus resource [mem 0x7f800000-0xefffffff window]".  That tells
you what PCI thinks is available for devices.  This info comes from
ACPI, and I don't know whether the BIOS on qemu is smart enough to
compute it based on "q35-host.pci-hole64-size=".  But dmesg will tell
you.

"pci=nocrs" tells the kernel to ignore those windows from ACPI and
pretend everything that's not RAM is available for devices.  Of
course, that's not true in general, so it's not really safe.

PCI resources are hierarchical: an endpoint BAR must be contained
in the Root Ports window, which must in turn be contained in the host
bridge window.  You trimmed most of that information out from your
dmesg log, so we can't see exactly what's wrong.

> When we pass through the NVLink bridges we can have the (5 working)
> GPUs talk at full P2P bandwidth and is described in the NVidia docs as
> a valid option (ie. passing through all GPUs and NVlink bridges).
> In production we have the bridges passed through to a service VM which
> controls traffic, which is also described in their docs.
> 
> Op do 15 jul. 2021 om 01:03 schreef Alex Williamson
> <alex.williamson@redhat.com>:
> >
> > On Thu, 15 Jul 2021 00:32:30 +0300
> > Ruben <rubenbryon@gmail.com> wrote:
> >
> > > I am experiencing an issue with virtualizing a machine which contains
> > > 8 NVidia A100 80GB cards.
> > > As a bare metal host, the machine behaves as expected, the GPUs are
> > > connected to the host with a PLX chip PEX88096, which connects 2 GPUs
> > > to 16 lanes on the CPU (using the same NVidia HGX Delta baseboard).
> > > When passing through all GPUs and NVLink bridges to a VM, a problem
> > > arises in that the system can only initialize 4-5 of the 8 GPUs.
> > >
> > > The dmesg log shows failed attempts for assiging BAR space to the GPUs
> > > that are not getting initialized.
> > >
> > > Things that were tried:
> > > Q35-i440fx with and without UEFI
> > > Qemu 5.x, Qemu 6.0
> > > Host Ubuntu 20.04 host with Qemu/libvirt
> > > Now running proxmox 7 on debian 11, host kernel 5.11.22-2, VM kernel 5.4.0-77
> > > VM kernel parameters pci=nocrs pci=realloc=on/off
> > >
> > > ------------------------------------
> > >
> > > lspci -v:
> > > 01:00.0 3D controller: NVIDIA Corporation Device 20b2 (rev a1)
> > >         Memory at db000000 (32-bit, non-prefetchable) [size=16M]
> > >         Memory at 2000000000 (64-bit, prefetchable) [size=128G]
> > >         Memory at 1000000000 (64-bit, prefetchable) [size=32M]
> > >
> > > 02:00.0 3D controller: NVIDIA Corporation Device 20b2 (rev a1)
> > >         Memory at dc000000 (32-bit, non-prefetchable) [size=16M]
> > >         Memory at 4000000000 (64-bit, prefetchable) [size=128G]
> > >         Memory at 6000000000 (64-bit, prefetchable) [size=32M]
> > >
> > > ...
> > >
> > > 0c:00.0 3D controller: NVIDIA Corporation Device 20b2 (rev a1)
> > >         Memory at e0000000 (32-bit, non-prefetchable) [size=16M]
> > >         Memory at <ignored> (64-bit, prefetchable)
> > >         Memory at <ignored> (64-bit, prefetchable)
> > >
> > > ...
> > >
> > ...
> > >
> > > ------------------------------------
> > >
> > > I have (blindly) messed with parameters like pref64-reserve for the
> > > pcie-root-port but to be frank I have little clue what I'm doing so my
> > > question would be suggestions on what I can try.
> > > This server will not be running an 8 GPU VM in production but I have a
> > > few days left to test before it goes to work. I was hoping to learn
> > > how to overcome this issue in the future.
> > > Please be aware that my knowledge regarding virtualization and the
> > > Linux kernel does not reach far.
> >
> > Try playing with the QEMU "-global q35-host.pci-hole64-size=" option for
> > the VM rather than pci=nocrs.  The default 64-bit MMIO hole for
> > QEMU/q35 is only 32GB.  You might be looking at a value like 2048G to
> > support this setup, but could maybe get away with 1024G if there's room
> > in 32-bit space for the 3rd BAR.
> >
> > Note that assigning bridges usually doesn't make a lot of sense and
> > NVLink is a proprietary black box, so we don't know how to virtualize
> > it or what the guest drivers will do with it, you're on your own there.
> > We generally recommend to use vGPUs for such cases so the host driver
> > can handle all the NVLink aspects for GPU peer-to-peer.  Thanks,
> >
> > Alex
> >

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [question]: BAR allocation failing
  2021-07-15 14:49     ` Bjorn Helgaas
@ 2021-07-15 20:39       ` Ruben
  2021-07-15 21:50         ` Keith Busch
  2021-07-15 23:05         ` Bjorn Helgaas
  0 siblings, 2 replies; 9+ messages in thread
From: Ruben @ 2021-07-15 20:39 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: Alex Williamson, linux-pci, Bjorn Helgaas

Thanks for the response, here's a link to the entire dmesg log:
https://drive.google.com/file/d/1Uau0cgd2ymYGDXNr1mA9X_UdLoMH_Azn/view

Some entries that might be of interest:

[    0.378712] pci_bus 0000:00: root bus resource [io  0x0000-0xffff]
[    0.378712] pci_bus 0000:00: root bus resource [mem 0x00000000-0xffffffffff]
[    0.378712] pci_bus 0000:00: root bus resource [bus 00-ff]
...
For GPU 1 on bus 01:00.0 the process goes like this:
[    0.676903] pci 0000:01:00.0: [10de:20b2] type 00 class 0x030200
[    0.677433] pci 0000:01:00.0: reg 0x10: [mem 0xff000000-0xffffffff]
[    0.677551] pci 0000:01:00.0: reg 0x14: [mem
0xffffffe000000000-0xffffffffffffffff 64bit pref]
[    0.677668] pci 0000:01:00.0: reg 0x1c: [mem
0xfffffffffe000000-0xffffffffffffffff 64bit pref]
...
[    1.416980] pci 0000:01:00.0: can't claim BAR 0 [mem
0xff000000-0xffffffff]: no compatible bridge window
[    1.416983] pci 0000:01:00.0: can't claim BAR 1 [mem
0xffffffe000000000-0xffffffffffffffff 64bit pref]: no compatible
bridge window
[    1.416986] pci 0000:01:00.0: can't claim BAR 3 [mem
0xfffffffffe000000-0xffffffffffffffff 64bit pref]: no compatible
bridge window
....
[    1.445156] pci 0000:01:00.0: BAR 1: assigned [mem
0x2000000000-0x3fffffffff 64bit pref]
[    1.445380] pci 0000:01:00.0: BAR 3: assigned [mem
0x1000000000-0x1001ffffff 64bit pref]
[    1.445589] pci 0000:01:00.0: BAR 0: assigned [mem 0xdb000000-0xdbffffff]

GPU 5 on bus 05:00.0 seems to have taken the last available window for
BAR 1 and 3:
[    1.461179] pci 0000:05:00.0: BAR 1: assigned [mem
0xe000000000-0xffffffffff 64bit pref]
[    1.461361] pci 0000:05:00.0: BAR 3: assigned [mem
0xd000000000-0xd001ffffff 64bit pref]
[    1.461533] pci 0000:05:00.0: BAR 0: assigned [mem 0xdf000000-0xdfffffff]

The last step fails for GPU on bus 06:00.0:
[    1.463503] pci 0000:06:00.0: BAR 1: no space for [mem size
0x2000000000 64bit pref]
[    1.463508] pci 0000:06:00.0: BAR 1: trying firmware assignment
[mem 0xffffffe000000000-0xffffffffffffffff 64bit pref]
[    1.463511] pci 0000:06:00.0: BAR 1: [mem
0xffffffe000000000-0xffffffffffffffff 64bit pref] conflicts with PCI
mem [mem 0x00000000-0xffffffffff]
[    1.463514] pci 0000:06:00.0: BAR 1: failed to assign [mem size
0x2000000000 64bit pref]
[    1.463517] pci 0000:06:00.0: BAR 3: no space for [mem size
0x02000000 64bit pref]
[    1.463519] pci 0000:06:00.0: BAR 3: trying firmware assignment
[mem 0xfffffffffe000000-0xffffffffffffffff 64bit pref]
[    1.463522] pci 0000:06:00.0: BAR 3: [mem
0xfffffffffe000000-0xffffffffffffffff 64bit pref] conflicts with PCI
mem [mem 0x00000000-0xffffffffff]
[    1.463525] pci 0000:06:00.0: BAR 3: failed to assign [mem size
0x02000000 64bit pref]
[    1.463527] pci 0000:06:00.0: BAR 0: assigned [mem 0xe0000000-0xe0ffffff]

If I understand correctly, it looks like the bridge window is 40 bits or 1024GB?
BAR 3 takes only a small section but BAR 1 skips to the next 64GB
block to take 128GB, BAR3 of GPU1 starts at 0x1000000000 so by the
time we get to GPU 6 the 1024GB is used up it seems.

It seems that increasing the window size would solve the issue at
hand, however I haven't got a clue where to start.

Thanks for your input so far, greatly appreciated!

Op do 15 jul. 2021 om 17:49 schreef Bjorn Helgaas <helgaas@kernel.org>:
>
> On Thu, Jul 15, 2021 at 01:43:17AM +0300, Ruben wrote:
> > No luck so far with "-global q35-pcihost.pci-hole64-size=2048G"
> > ("-global q35-host.pci-hole64-size=" gave an error "warning: global
> > q35-host.pci-hole64-size has invalid class name").
> > The result stays the same.
>
> Alex will have to chime in about the qemu option problem.
>
> Your dmesg excerpts don't include the host bridge window info, e.g.,
> "root bus resource [mem 0x7f800000-0xefffffff window]".  That tells
> you what PCI thinks is available for devices.  This info comes from
> ACPI, and I don't know whether the BIOS on qemu is smart enough to
> compute it based on "q35-host.pci-hole64-size=".  But dmesg will tell
> you.
>
> "pci=nocrs" tells the kernel to ignore those windows from ACPI and
> pretend everything that's not RAM is available for devices.  Of
> course, that's not true in general, so it's not really safe.
>
> PCI resources are hierarchical: an endpoint BAR must be contained
> in the Root Ports window, which must in turn be contained in the host
> bridge window.  You trimmed most of that information out from your
> dmesg log, so we can't see exactly what's wrong.
>
> > When we pass through the NVLink bridges we can have the (5 working)
> > GPUs talk at full P2P bandwidth and is described in the NVidia docs as
> > a valid option (ie. passing through all GPUs and NVlink bridges).
> > In production we have the bridges passed through to a service VM which
> > controls traffic, which is also described in their docs.
> >
> > Op do 15 jul. 2021 om 01:03 schreef Alex Williamson
> > <alex.williamson@redhat.com>:
> > >
> > > On Thu, 15 Jul 2021 00:32:30 +0300
> > > Ruben <rubenbryon@gmail.com> wrote:
> > >
> > > > I am experiencing an issue with virtualizing a machine which contains
> > > > 8 NVidia A100 80GB cards.
> > > > As a bare metal host, the machine behaves as expected, the GPUs are
> > > > connected to the host with a PLX chip PEX88096, which connects 2 GPUs
> > > > to 16 lanes on the CPU (using the same NVidia HGX Delta baseboard).
> > > > When passing through all GPUs and NVLink bridges to a VM, a problem
> > > > arises in that the system can only initialize 4-5 of the 8 GPUs.
> > > >
> > > > The dmesg log shows failed attempts for assiging BAR space to the GPUs
> > > > that are not getting initialized.
> > > >
> > > > Things that were tried:
> > > > Q35-i440fx with and without UEFI
> > > > Qemu 5.x, Qemu 6.0
> > > > Host Ubuntu 20.04 host with Qemu/libvirt
> > > > Now running proxmox 7 on debian 11, host kernel 5.11.22-2, VM kernel 5.4.0-77
> > > > VM kernel parameters pci=nocrs pci=realloc=on/off
> > > >
> > > > ------------------------------------
> > > >
> > > > lspci -v:
> > > > 01:00.0 3D controller: NVIDIA Corporation Device 20b2 (rev a1)
> > > >         Memory at db000000 (32-bit, non-prefetchable) [size=16M]
> > > >         Memory at 2000000000 (64-bit, prefetchable) [size=128G]
> > > >         Memory at 1000000000 (64-bit, prefetchable) [size=32M]
> > > >
> > > > 02:00.0 3D controller: NVIDIA Corporation Device 20b2 (rev a1)
> > > >         Memory at dc000000 (32-bit, non-prefetchable) [size=16M]
> > > >         Memory at 4000000000 (64-bit, prefetchable) [size=128G]
> > > >         Memory at 6000000000 (64-bit, prefetchable) [size=32M]
> > > >
> > > > ...
> > > >
> > > > 0c:00.0 3D controller: NVIDIA Corporation Device 20b2 (rev a1)
> > > >         Memory at e0000000 (32-bit, non-prefetchable) [size=16M]
> > > >         Memory at <ignored> (64-bit, prefetchable)
> > > >         Memory at <ignored> (64-bit, prefetchable)
> > > >
> > > > ...
> > > >
> > > ...
> > > >
> > > > ------------------------------------
> > > >
> > > > I have (blindly) messed with parameters like pref64-reserve for the
> > > > pcie-root-port but to be frank I have little clue what I'm doing so my
> > > > question would be suggestions on what I can try.
> > > > This server will not be running an 8 GPU VM in production but I have a
> > > > few days left to test before it goes to work. I was hoping to learn
> > > > how to overcome this issue in the future.
> > > > Please be aware that my knowledge regarding virtualization and the
> > > > Linux kernel does not reach far.
> > >
> > > Try playing with the QEMU "-global q35-host.pci-hole64-size=" option for
> > > the VM rather than pci=nocrs.  The default 64-bit MMIO hole for
> > > QEMU/q35 is only 32GB.  You might be looking at a value like 2048G to
> > > support this setup, but could maybe get away with 1024G if there's room
> > > in 32-bit space for the 3rd BAR.
> > >
> > > Note that assigning bridges usually doesn't make a lot of sense and
> > > NVLink is a proprietary black box, so we don't know how to virtualize
> > > it or what the guest drivers will do with it, you're on your own there.
> > > We generally recommend to use vGPUs for such cases so the host driver
> > > can handle all the NVLink aspects for GPU peer-to-peer.  Thanks,
> > >
> > > Alex
> > >

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [question]: BAR allocation failing
  2021-07-15 20:39       ` Ruben
@ 2021-07-15 21:50         ` Keith Busch
  2021-07-15 23:05         ` Bjorn Helgaas
  1 sibling, 0 replies; 9+ messages in thread
From: Keith Busch @ 2021-07-15 21:50 UTC (permalink / raw)
  To: Ruben; +Cc: Bjorn Helgaas, Alex Williamson, linux-pci, Bjorn Helgaas

On Thu, Jul 15, 2021 at 11:39:54PM +0300, Ruben wrote:
> Thanks for the response, here's a link to the entire dmesg log:
> https://drive.google.com/file/d/1Uau0cgd2ymYGDXNr1mA9X_UdLoMH_Azn/view
> 
> Some entries that might be of interest:
> 
> [    0.378712] pci_bus 0000:00: root bus resource [io  0x0000-0xffff]
> [    0.378712] pci_bus 0000:00: root bus resource [mem 0x00000000-0xffffffffff]
> [    0.378712] pci_bus 0000:00: root bus resource [bus 00-ff]

I have not seen anything like that before. Usually you get non-zero
offset windows for memory resources separating 32-bit non-prefetchable
from the prefetchable.

Assuming what you're showing is fine, this says you've 1TB memory
addressable space available to the root bus. Each of your device's
memory requires 128GB and 32MB, both 64-bit non-prefetchable BARs. Due
to alignment requirements, your GPUs should need at least 2TB of IO
memory space to satisfy their memory request.

That's probably not very helpful information, though. I'll look if
there's something creating a 1TB limit in qemu, but I'm honestly not
very familiar enough in that particular area.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [question]: BAR allocation failing
  2021-07-15 20:39       ` Ruben
  2021-07-15 21:50         ` Keith Busch
@ 2021-07-15 23:05         ` Bjorn Helgaas
  2021-07-15 23:08           ` Alex Williamson
  1 sibling, 1 reply; 9+ messages in thread
From: Bjorn Helgaas @ 2021-07-15 23:05 UTC (permalink / raw)
  To: Ruben; +Cc: Alex Williamson, linux-pci, Bjorn Helgaas, Keith Busch

On Thu, Jul 15, 2021 at 11:39:54PM +0300, Ruben wrote:
> Thanks for the response, here's a link to the entire dmesg log:
> https://drive.google.com/file/d/1Uau0cgd2ymYGDXNr1mA9X_UdLoMH_Azn/view
> 
> Some entries that might be of interest:

ACPI tells us the host bridge windows are:

  acpi PNP0A08:00: host bridge window [mem 0x000a0000-0x000bffff window] (ignored)
  acpi PNP0A08:00: host bridge window [mem 0x80000000-0xafffffff window] (ignored)
  acpi PNP0A08:00: host bridge window [mem 0xc0000000-0xfebfffff window] (ignored)
  acpi PNP0A08:00: host bridge window [mem 0x800000000-0xfffffffff window] (ignored)

The 0xc0000000 window is about 1GB and is below 4GB.
The 0x800000000 window looks like 32GB.

But "pci=nocrs" means we ignore these windows ...

> pci_bus 0000:00: root bus resource [mem 0x00000000-0xffffffffff]

and instead use this 1TB of address space, from which DRAM is
excluded.  I think this is basically everything the CPU can address,
and I *think* it comes from this in setup_arch():

  iomem_resource.end = (1ULL << boot_cpu_data.x86_phys_bits) - 1;

But you have 8 GPUs, each of which needs 128GB + 32MB + 16MB, so you
need 1TB + 384MB to map them all, and the CPU can't address that much.

Since you're running this on qemu, I assume x86_phys_bits is telling
us about the capabilities of the CPU qemu is emulating.  Maybe there's
a way to tell qemu to emulate a CPU with more address bits?

Bjorn

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [question]: BAR allocation failing
  2021-07-15 23:05         ` Bjorn Helgaas
@ 2021-07-15 23:08           ` Alex Williamson
  2021-07-16  6:14             ` Ruben
  0 siblings, 1 reply; 9+ messages in thread
From: Alex Williamson @ 2021-07-15 23:08 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: Ruben, linux-pci, Bjorn Helgaas, Keith Busch

On Thu, 15 Jul 2021 18:05:06 -0500
Bjorn Helgaas <helgaas@kernel.org> wrote:

> On Thu, Jul 15, 2021 at 11:39:54PM +0300, Ruben wrote:
> > Thanks for the response, here's a link to the entire dmesg log:
> > https://drive.google.com/file/d/1Uau0cgd2ymYGDXNr1mA9X_UdLoMH_Azn/view
> > 
> > Some entries that might be of interest:  
> 
> ACPI tells us the host bridge windows are:
> 
>   acpi PNP0A08:00: host bridge window [mem 0x000a0000-0x000bffff window] (ignored)
>   acpi PNP0A08:00: host bridge window [mem 0x80000000-0xafffffff window] (ignored)
>   acpi PNP0A08:00: host bridge window [mem 0xc0000000-0xfebfffff window] (ignored)
>   acpi PNP0A08:00: host bridge window [mem 0x800000000-0xfffffffff window] (ignored)
> 
> The 0xc0000000 window is about 1GB and is below 4GB.
> The 0x800000000 window looks like 32GB.
> 
> But "pci=nocrs" means we ignore these windows ...
> 
> > pci_bus 0000:00: root bus resource [mem 0x00000000-0xffffffffff]  
> 
> and instead use this 1TB of address space, from which DRAM is
> excluded.  I think this is basically everything the CPU can address,
> and I *think* it comes from this in setup_arch():
> 
>   iomem_resource.end = (1ULL << boot_cpu_data.x86_phys_bits) - 1;
> 
> But you have 8 GPUs, each of which needs 128GB + 32MB + 16MB, so you
> need 1TB + 384MB to map them all, and the CPU can't address that much.
> 
> Since you're running this on qemu, I assume x86_phys_bits is telling
> us about the capabilities of the CPU qemu is emulating.  Maybe there's
> a way to tell qemu to emulate a CPU with more address bits?

"-cpu host" perhaps


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [question]: BAR allocation failing
  2021-07-15 23:08           ` Alex Williamson
@ 2021-07-16  6:14             ` Ruben
  0 siblings, 0 replies; 9+ messages in thread
From: Ruben @ 2021-07-16  6:14 UTC (permalink / raw)
  To: Alex Williamson; +Cc: Bjorn Helgaas, linux-pci, Bjorn Helgaas, Keith Busch

My lord, can't believe that was the answer, all GPUs are successfully
passed through now, thanks Alex!
In case it is of interest, here's the resulting dmesg:
https://drive.google.com/file/d/1b0f4bWPhJXifC8j9_NtB34R7GBA8Dky3/view

For the gal/chap finding this thread in 12 years trying to make an old
machine work:
In the qemu command I added "-cpu host", current kernel parameters
were "rcutree.rcu_idle_gp_delay=1 mem_encrypt=off iommu=off
amd_iommu=off pci=realloc=off,check_enable_amd_mmconf".

On to the next chapter of this machine, which is to enable MMIO
filtering by the hypervisor which is required to make VMs with less
than all resources/GPUs. Luckily NVidia provides 0 detailed
documentation on the matter.

Op vr 16 jul. 2021 om 02:08 schreef Alex Williamson
<alex.williamson@redhat.com>:
>
> On Thu, 15 Jul 2021 18:05:06 -0500
> Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> > On Thu, Jul 15, 2021 at 11:39:54PM +0300, Ruben wrote:
> > > Thanks for the response, here's a link to the entire dmesg log:
> > > https://drive.google.com/file/d/1Uau0cgd2ymYGDXNr1mA9X_UdLoMH_Azn/view
> > >
> > > Some entries that might be of interest:
> >
> > ACPI tells us the host bridge windows are:
> >
> >   acpi PNP0A08:00: host bridge window [mem 0x000a0000-0x000bffff window] (ignored)
> >   acpi PNP0A08:00: host bridge window [mem 0x80000000-0xafffffff window] (ignored)
> >   acpi PNP0A08:00: host bridge window [mem 0xc0000000-0xfebfffff window] (ignored)
> >   acpi PNP0A08:00: host bridge window [mem 0x800000000-0xfffffffff window] (ignored)
> >
> > The 0xc0000000 window is about 1GB and is below 4GB.
> > The 0x800000000 window looks like 32GB.
> >
> > But "pci=nocrs" means we ignore these windows ...
> >
> > > pci_bus 0000:00: root bus resource [mem 0x00000000-0xffffffffff]
> >
> > and instead use this 1TB of address space, from which DRAM is
> > excluded.  I think this is basically everything the CPU can address,
> > and I *think* it comes from this in setup_arch():
> >
> >   iomem_resource.end = (1ULL << boot_cpu_data.x86_phys_bits) - 1;
> >
> > But you have 8 GPUs, each of which needs 128GB + 32MB + 16MB, so you
> > need 1TB + 384MB to map them all, and the CPU can't address that much.
> >
> > Since you're running this on qemu, I assume x86_phys_bits is telling
> > us about the capabilities of the CPU qemu is emulating.  Maybe there's
> > a way to tell qemu to emulate a CPU with more address bits?
>
> "-cpu host" perhaps
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2021-07-16  6:14 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-14 21:32 [question]: BAR allocation failing Ruben
2021-07-14 22:03 ` Alex Williamson
2021-07-14 22:43   ` Ruben
2021-07-15 14:49     ` Bjorn Helgaas
2021-07-15 20:39       ` Ruben
2021-07-15 21:50         ` Keith Busch
2021-07-15 23:05         ` Bjorn Helgaas
2021-07-15 23:08           ` Alex Williamson
2021-07-16  6:14             ` Ruben

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.