[RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100

* [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100
@ 2018-06-07  8:44 ` Alexey Kardashevskiy
  0 siblings, 0 replies; 108+ messages in thread
From: Alexey Kardashevskiy @ 2018-06-07  8:44 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: kvm, Alexey Kardashevskiy, Ram Pai, kvm-ppc, Alex Williamson,
	Alistair Popple, David Gibson

Here is an rfc of some patches adding psaa-through support
for NVIDIA V100 GPU found in some POWER9 boxes.

The example P9 system has 6 GPUs, each accompanied with 2 bridges
representing the hardware links (aka NVLink2):

 4  0004:04:00.0 3D: NVIDIA Corporation GV100GL [Tesla V100 SXM2] (rev a1)
 5  0004:05:00.0 3D: NVIDIA Corporation GV100GL [Tesla V100 SXM2] (rev a1)
 6  0004:06:00.0 3D: NVIDIA Corporation GV100GL [Tesla V100 SXM2] (rev a1)
 4  0006:00:00.0 Bridge: IBM Device 04ea (rev 01)
 4  0006:00:00.1 Bridge: IBM Device 04ea (rev 01)
 5  0006:00:01.0 Bridge: IBM Device 04ea (rev 01)
 5  0006:00:01.1 Bridge: IBM Device 04ea (rev 01)
 6  0006:00:02.0 Bridge: IBM Device 04ea (rev 01)
 6  0006:00:02.1 Bridge: IBM Device 04ea (rev 01)
10  0007:00:00.0 Bridge: IBM Device 04ea (rev 01)
10  0007:00:00.1 Bridge: IBM Device 04ea (rev 01)
11  0007:00:01.0 Bridge: IBM Device 04ea (rev 01)
11  0007:00:01.1 Bridge: IBM Device 04ea (rev 01)
12  0007:00:02.0 Bridge: IBM Device 04ea (rev 01)
12  0007:00:02.1 Bridge: IBM Device 04ea (rev 01)
10  0035:03:00.0 3D: NVIDIA Corporation GV100GL [Tesla V100 SXM2] (rev a1)
11  0035:04:00.0 3D: NVIDIA Corporation GV100GL [Tesla V100 SXM2] (rev a1)
12  0035:05:00.0 3D: NVIDIA Corporation GV100GL [Tesla V100 SXM2] (rev a1)

^^ the number is an IOMMU group ID.

Each bridge represents an additional hardware interface called "NVLink2",
it is not a PCI link but separate but. The design inherits from original
NVLink from POWER8.

The new feature of V100 is 16GB of cache coherent memory on GPU board.
This memory is presented to the host via the device tree and remains offline
until the NVIDIA driver loads, trains NVLink2 (via the config space of these
bridges above) and the nvidia-persistenced daemon then onlines it.
The memory remains online as long as nvidia-persistenced is running, when
it stops, it offlines the memory.

The amount of GPUs suggest passing them through to a guest. However,
in order to do so we cannot use the NVIDIA driver so we have a host with
a 128GB window (bigger or equal to actual GPU RAM size) in a system memory
with no page structs backing this window and we cannot touch this memory
before the NVIDIA driver configures it in a host or a guest as
HMI (hardware management interrupt?) occurs.

On the example system the GPU RAM windows are located at:
0x0400 0000 0000
0x0420 0000 0000
0x0440 0000 0000
0x2400 0000 0000
0x2420 0000 0000
0x2440 0000 0000

So the complications are:

1. cannot touch the GPU memory till it is trained, i.e. cannot add ptes
to VFIO-to-userspace or guest-to-host-physical translations till
the driver trains it (i.e. nvidia-persistenced has started), otherwise
prefetching happens and HMI occurs; I am trying to get this changed
somehow;

2. since it appears as normal cache coherent memory, it will be used
for DMA which means it has to be pinned and mapped in the host. Having
no page structs makes it different from the usual case - we only need
translate user addresses to host physical and map GPU RAM memory but
pinning is not required.

This series maps GPU RAM via the GPU vfio-pci device so QEMU can then
register this memory as a KVM memory slot and present memory nodes to
the guest. Unless NVIDIA provides an userspace driver, this is no use
for things like DPDK.

There is another problem which the series does not address but worth
mentioning - it is not strictly necessary to map GPU RAM to the guest
exactly where it is in the host (I tested this to some extent), we still
might want to represent the memory at the same offset as on the host
which increases the size of a TCE table needed to cover such a huge
window: (((0x244000000000 + 0x2000000000) >> 16)*8)>>20 = 4556MB
I am addressing this in a separate patchset by allocating indirect TCE
levels on demand and using 16MB IOMMU pages in the guest as we can now
back emulated pages with the smaller hardware ones.

This is an RFC. Please comment. Thanks.

Alexey Kardashevskiy (5):
  vfio/spapr_tce: Simplify page contained test
  powerpc/iommu_context: Change referencing in API
  powerpc/iommu: Do not pin memory of a memory device
  vfio_pci: Allow mapping extra regions
  vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] [10de:1db1] subdriver

 drivers/vfio/pci/Makefile              |   1 +
 arch/powerpc/include/asm/mmu_context.h |   5 +-
 drivers/vfio/pci/vfio_pci_private.h    |  11 ++
 include/uapi/linux/vfio.h              |   3 +
 arch/powerpc/kernel/iommu.c            |   8 +-
 arch/powerpc/mm/mmu_context_iommu.c    |  70 +++++++++---
 drivers/vfio/pci/vfio_pci.c            |  19 +++-
 drivers/vfio/pci/vfio_pci_nvlink2.c    | 190 +++++++++++++++++++++++++++++++++
 drivers/vfio/vfio_iommu_spapr_tce.c    |  42 +++++---
 drivers/vfio/pci/Kconfig               |   4 +
 10 files changed, 319 insertions(+), 34 deletions(-)
 create mode 100644 drivers/vfio/pci/vfio_pci_nvlink2.c

-- 
2.11.0

^ permalink raw reply	[flat|nested] 108+ messages in thread