Re: [question]: BAR allocation failing

From: Ruben <rubenbryon@gmail.com>
To: Alex Williamson <alex.williamson@redhat.com>
Cc: linux-pci@vger.kernel.org, Bjorn Helgaas <bhelgaas@google.com>
Subject: Re: [question]: BAR allocation failing
Date: Thu, 15 Jul 2021 01:43:17 +0300	[thread overview]
Message-ID: <CALdZjm6TsfsaQZRxJvr5YDh9VRn28vQjFY+JfZv-daU=gQu_Uw@mail.gmail.com> (raw)
In-Reply-To: <20210714160350.1bef2778.alex.williamson@redhat.com>

No luck so far with "-global q35-pcihost.pci-hole64-size=2048G"
("-global q35-host.pci-hole64-size=" gave an error "warning: global
q35-host.pci-hole64-size has invalid class name").
The result stays the same.

When we pass through the NVLink bridges we can have the (5 working)
GPUs talk at full P2P bandwidth and is described in the NVidia docs as
a valid option (ie. passing through all GPUs and NVlink bridges).
In production we have the bridges passed through to a service VM which
controls traffic, which is also described in their docs.

Op do 15 jul. 2021 om 01:03 schreef Alex Williamson
<alex.williamson@redhat.com>:
>
> On Thu, 15 Jul 2021 00:32:30 +0300
> Ruben <rubenbryon@gmail.com> wrote:
>
> > I am experiencing an issue with virtualizing a machine which contains
> > 8 NVidia A100 80GB cards.
> > As a bare metal host, the machine behaves as expected, the GPUs are
> > connected to the host with a PLX chip PEX88096, which connects 2 GPUs
> > to 16 lanes on the CPU (using the same NVidia HGX Delta baseboard).
> > When passing through all GPUs and NVLink bridges to a VM, a problem
> > arises in that the system can only initialize 4-5 of the 8 GPUs.
> >
> > The dmesg log shows failed attempts for assiging BAR space to the GPUs
> > that are not getting initialized.
> >
> > Things that were tried:
> > Q35-i440fx with and without UEFI
> > Qemu 5.x, Qemu 6.0
> > Host Ubuntu 20.04 host with Qemu/libvirt
> > Now running proxmox 7 on debian 11, host kernel 5.11.22-2, VM kernel 5.4.0-77
> > VM kernel parameters pci=nocrs pci=realloc=on/off
> >
> > ------------------------------------
> >
> > lspci -v:
> > 01:00.0 3D controller: NVIDIA Corporation Device 20b2 (rev a1)
> >         Memory at db000000 (32-bit, non-prefetchable) [size=16M]
> >         Memory at 2000000000 (64-bit, prefetchable) [size=128G]
> >         Memory at 1000000000 (64-bit, prefetchable) [size=32M]
> >
> > 02:00.0 3D controller: NVIDIA Corporation Device 20b2 (rev a1)
> >         Memory at dc000000 (32-bit, non-prefetchable) [size=16M]
> >         Memory at 4000000000 (64-bit, prefetchable) [size=128G]
> >         Memory at 6000000000 (64-bit, prefetchable) [size=32M]
> >
> > ...
> >
> > 0c:00.0 3D controller: NVIDIA Corporation Device 20b2 (rev a1)
> >         Memory at e0000000 (32-bit, non-prefetchable) [size=16M]
> >         Memory at <ignored> (64-bit, prefetchable)
> >         Memory at <ignored> (64-bit, prefetchable)
> >
> > ...
> >
> ...
> >
> > ------------------------------------
> >
> > I have (blindly) messed with parameters like pref64-reserve for the
> > pcie-root-port but to be frank I have little clue what I'm doing so my
> > question would be suggestions on what I can try.
> > This server will not be running an 8 GPU VM in production but I have a
> > few days left to test before it goes to work. I was hoping to learn
> > how to overcome this issue in the future.
> > Please be aware that my knowledge regarding virtualization and the
> > Linux kernel does not reach far.
>
> Try playing with the QEMU "-global q35-host.pci-hole64-size=" option for
> the VM rather than pci=nocrs.  The default 64-bit MMIO hole for
> QEMU/q35 is only 32GB.  You might be looking at a value like 2048G to
> support this setup, but could maybe get away with 1024G if there's room
> in 32-bit space for the 3rd BAR.
>
> Note that assigning bridges usually doesn't make a lot of sense and
> NVLink is a proprietary black box, so we don't know how to virtualize
> it or what the guest drivers will do with it, you're on your own there.
> We generally recommend to use vGPUs for such cases so the host driver
> can handle all the NVLink aspects for GPU peer-to-peer.  Thanks,
>
> Alex
>