All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100
@ 2018-06-07  8:44 ` Alexey Kardashevskiy
  0 siblings, 0 replies; 108+ messages in thread
From: Alexey Kardashevskiy @ 2018-06-07  8:44 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: kvm, Alexey Kardashevskiy, Ram Pai, kvm-ppc, Alex Williamson,
	Alistair Popple, David Gibson

Here is an rfc of some patches adding psaa-through support
for NVIDIA V100 GPU found in some POWER9 boxes.

The example P9 system has 6 GPUs, each accompanied with 2 bridges
representing the hardware links (aka NVLink2):

 4  0004:04:00.0 3D: NVIDIA Corporation GV100GL [Tesla V100 SXM2] (rev a1)
 5  0004:05:00.0 3D: NVIDIA Corporation GV100GL [Tesla V100 SXM2] (rev a1)
 6  0004:06:00.0 3D: NVIDIA Corporation GV100GL [Tesla V100 SXM2] (rev a1)
 4  0006:00:00.0 Bridge: IBM Device 04ea (rev 01)
 4  0006:00:00.1 Bridge: IBM Device 04ea (rev 01)
 5  0006:00:01.0 Bridge: IBM Device 04ea (rev 01)
 5  0006:00:01.1 Bridge: IBM Device 04ea (rev 01)
 6  0006:00:02.0 Bridge: IBM Device 04ea (rev 01)
 6  0006:00:02.1 Bridge: IBM Device 04ea (rev 01)
10  0007:00:00.0 Bridge: IBM Device 04ea (rev 01)
10  0007:00:00.1 Bridge: IBM Device 04ea (rev 01)
11  0007:00:01.0 Bridge: IBM Device 04ea (rev 01)
11  0007:00:01.1 Bridge: IBM Device 04ea (rev 01)
12  0007:00:02.0 Bridge: IBM Device 04ea (rev 01)
12  0007:00:02.1 Bridge: IBM Device 04ea (rev 01)
10  0035:03:00.0 3D: NVIDIA Corporation GV100GL [Tesla V100 SXM2] (rev a1)
11  0035:04:00.0 3D: NVIDIA Corporation GV100GL [Tesla V100 SXM2] (rev a1)
12  0035:05:00.0 3D: NVIDIA Corporation GV100GL [Tesla V100 SXM2] (rev a1)

^^ the number is an IOMMU group ID.

Each bridge represents an additional hardware interface called "NVLink2",
it is not a PCI link but separate but. The design inherits from original
NVLink from POWER8.

The new feature of V100 is 16GB of cache coherent memory on GPU board.
This memory is presented to the host via the device tree and remains offline
until the NVIDIA driver loads, trains NVLink2 (via the config space of these
bridges above) and the nvidia-persistenced daemon then onlines it.
The memory remains online as long as nvidia-persistenced is running, when
it stops, it offlines the memory.

The amount of GPUs suggest passing them through to a guest. However,
in order to do so we cannot use the NVIDIA driver so we have a host with
a 128GB window (bigger or equal to actual GPU RAM size) in a system memory
with no page structs backing this window and we cannot touch this memory
before the NVIDIA driver configures it in a host or a guest as
HMI (hardware management interrupt?) occurs.

On the example system the GPU RAM windows are located at:
0x0400 0000 0000
0x0420 0000 0000
0x0440 0000 0000
0x2400 0000 0000
0x2420 0000 0000
0x2440 0000 0000

So the complications are:

1. cannot touch the GPU memory till it is trained, i.e. cannot add ptes
to VFIO-to-userspace or guest-to-host-physical translations till
the driver trains it (i.e. nvidia-persistenced has started), otherwise
prefetching happens and HMI occurs; I am trying to get this changed
somehow;

2. since it appears as normal cache coherent memory, it will be used
for DMA which means it has to be pinned and mapped in the host. Having
no page structs makes it different from the usual case - we only need
translate user addresses to host physical and map GPU RAM memory but
pinning is not required.

This series maps GPU RAM via the GPU vfio-pci device so QEMU can then
register this memory as a KVM memory slot and present memory nodes to
the guest. Unless NVIDIA provides an userspace driver, this is no use
for things like DPDK.


There is another problem which the series does not address but worth
mentioning - it is not strictly necessary to map GPU RAM to the guest
exactly where it is in the host (I tested this to some extent), we still
might want to represent the memory at the same offset as on the host
which increases the size of a TCE table needed to cover such a huge
window: (((0x244000000000 + 0x2000000000) >> 16)*8)>>20 = 4556MB
I am addressing this in a separate patchset by allocating indirect TCE
levels on demand and using 16MB IOMMU pages in the guest as we can now
back emulated pages with the smaller hardware ones.


This is an RFC. Please comment. Thanks.



Alexey Kardashevskiy (5):
  vfio/spapr_tce: Simplify page contained test
  powerpc/iommu_context: Change referencing in API
  powerpc/iommu: Do not pin memory of a memory device
  vfio_pci: Allow mapping extra regions
  vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] [10de:1db1] subdriver

 drivers/vfio/pci/Makefile              |   1 +
 arch/powerpc/include/asm/mmu_context.h |   5 +-
 drivers/vfio/pci/vfio_pci_private.h    |  11 ++
 include/uapi/linux/vfio.h              |   3 +
 arch/powerpc/kernel/iommu.c            |   8 +-
 arch/powerpc/mm/mmu_context_iommu.c    |  70 +++++++++---
 drivers/vfio/pci/vfio_pci.c            |  19 +++-
 drivers/vfio/pci/vfio_pci_nvlink2.c    | 190 +++++++++++++++++++++++++++++++++
 drivers/vfio/vfio_iommu_spapr_tce.c    |  42 +++++---
 drivers/vfio/pci/Kconfig               |   4 +
 10 files changed, 319 insertions(+), 34 deletions(-)
 create mode 100644 drivers/vfio/pci/vfio_pci_nvlink2.c

-- 
2.11.0

^ permalink raw reply	[flat|nested] 108+ messages in thread

* [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100
@ 2018-06-07  8:44 ` Alexey Kardashevskiy
  0 siblings, 0 replies; 108+ messages in thread
From: Alexey Kardashevskiy @ 2018-06-07  8:44 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, David Gibson, kvm-ppc, Alex Williamson,
	Benjamin Herrenschmidt, Ram Pai, kvm, Alistair Popple

Here is an rfc of some patches adding psaa-through support
for NVIDIA V100 GPU found in some POWER9 boxes.

The example P9 system has 6 GPUs, each accompanied with 2 bridges
representing the hardware links (aka NVLink2):

 4  0004:04:00.0 3D: NVIDIA Corporation GV100GL [Tesla V100 SXM2] (rev a1)
 5  0004:05:00.0 3D: NVIDIA Corporation GV100GL [Tesla V100 SXM2] (rev a1)
 6  0004:06:00.0 3D: NVIDIA Corporation GV100GL [Tesla V100 SXM2] (rev a1)
 4  0006:00:00.0 Bridge: IBM Device 04ea (rev 01)
 4  0006:00:00.1 Bridge: IBM Device 04ea (rev 01)
 5  0006:00:01.0 Bridge: IBM Device 04ea (rev 01)
 5  0006:00:01.1 Bridge: IBM Device 04ea (rev 01)
 6  0006:00:02.0 Bridge: IBM Device 04ea (rev 01)
 6  0006:00:02.1 Bridge: IBM Device 04ea (rev 01)
10  0007:00:00.0 Bridge: IBM Device 04ea (rev 01)
10  0007:00:00.1 Bridge: IBM Device 04ea (rev 01)
11  0007:00:01.0 Bridge: IBM Device 04ea (rev 01)
11  0007:00:01.1 Bridge: IBM Device 04ea (rev 01)
12  0007:00:02.0 Bridge: IBM Device 04ea (rev 01)
12  0007:00:02.1 Bridge: IBM Device 04ea (rev 01)
10  0035:03:00.0 3D: NVIDIA Corporation GV100GL [Tesla V100 SXM2] (rev a1)
11  0035:04:00.0 3D: NVIDIA Corporation GV100GL [Tesla V100 SXM2] (rev a1)
12  0035:05:00.0 3D: NVIDIA Corporation GV100GL [Tesla V100 SXM2] (rev a1)

^^ the number is an IOMMU group ID.

Each bridge represents an additional hardware interface called "NVLink2",
it is not a PCI link but separate but. The design inherits from original
NVLink from POWER8.

The new feature of V100 is 16GB of cache coherent memory on GPU board.
This memory is presented to the host via the device tree and remains offline
until the NVIDIA driver loads, trains NVLink2 (via the config space of these
bridges above) and the nvidia-persistenced daemon then onlines it.
The memory remains online as long as nvidia-persistenced is running, when
it stops, it offlines the memory.

The amount of GPUs suggest passing them through to a guest. However,
in order to do so we cannot use the NVIDIA driver so we have a host with
a 128GB window (bigger or equal to actual GPU RAM size) in a system memory
with no page structs backing this window and we cannot touch this memory
before the NVIDIA driver configures it in a host or a guest as
HMI (hardware management interrupt?) occurs.

On the example system the GPU RAM windows are located at:
0x0400 0000 0000
0x0420 0000 0000
0x0440 0000 0000
0x2400 0000 0000
0x2420 0000 0000
0x2440 0000 0000

So the complications are:

1. cannot touch the GPU memory till it is trained, i.e. cannot add ptes
to VFIO-to-userspace or guest-to-host-physical translations till
the driver trains it (i.e. nvidia-persistenced has started), otherwise
prefetching happens and HMI occurs; I am trying to get this changed
somehow;

2. since it appears as normal cache coherent memory, it will be used
for DMA which means it has to be pinned and mapped in the host. Having
no page structs makes it different from the usual case - we only need
translate user addresses to host physical and map GPU RAM memory but
pinning is not required.

This series maps GPU RAM via the GPU vfio-pci device so QEMU can then
register this memory as a KVM memory slot and present memory nodes to
the guest. Unless NVIDIA provides an userspace driver, this is no use
for things like DPDK.


There is another problem which the series does not address but worth
mentioning - it is not strictly necessary to map GPU RAM to the guest
exactly where it is in the host (I tested this to some extent), we still
might want to represent the memory at the same offset as on the host
which increases the size of a TCE table needed to cover such a huge
window: (((0x244000000000 + 0x2000000000) >> 16)*8)>>20 = 4556MB
I am addressing this in a separate patchset by allocating indirect TCE
levels on demand and using 16MB IOMMU pages in the guest as we can now
back emulated pages with the smaller hardware ones.


This is an RFC. Please comment. Thanks.



Alexey Kardashevskiy (5):
  vfio/spapr_tce: Simplify page contained test
  powerpc/iommu_context: Change referencing in API
  powerpc/iommu: Do not pin memory of a memory device
  vfio_pci: Allow mapping extra regions
  vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] [10de:1db1] subdriver

 drivers/vfio/pci/Makefile              |   1 +
 arch/powerpc/include/asm/mmu_context.h |   5 +-
 drivers/vfio/pci/vfio_pci_private.h    |  11 ++
 include/uapi/linux/vfio.h              |   3 +
 arch/powerpc/kernel/iommu.c            |   8 +-
 arch/powerpc/mm/mmu_context_iommu.c    |  70 +++++++++---
 drivers/vfio/pci/vfio_pci.c            |  19 +++-
 drivers/vfio/pci/vfio_pci_nvlink2.c    | 190 +++++++++++++++++++++++++++++++++
 drivers/vfio/vfio_iommu_spapr_tce.c    |  42 +++++---
 drivers/vfio/pci/Kconfig               |   4 +
 10 files changed, 319 insertions(+), 34 deletions(-)
 create mode 100644 drivers/vfio/pci/vfio_pci_nvlink2.c

-- 
2.11.0

^ permalink raw reply	[flat|nested] 108+ messages in thread

* [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100
@ 2018-06-07  8:44 ` Alexey Kardashevskiy
  0 siblings, 0 replies; 108+ messages in thread
From: Alexey Kardashevskiy @ 2018-06-07  8:44 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: kvm, Alexey Kardashevskiy, Ram Pai, kvm-ppc, Alex Williamson,
	Alistair Popple, David Gibson

Here is an rfc of some patches adding psaa-through support
for NVIDIA V100 GPU found in some POWER9 boxes.

The example P9 system has 6 GPUs, each accompanied with 2 bridges
representing the hardware links (aka NVLink2):

 4  0004:04:00.0 3D: NVIDIA Corporation GV100GL [Tesla V100 SXM2] (rev a1)
 5  0004:05:00.0 3D: NVIDIA Corporation GV100GL [Tesla V100 SXM2] (rev a1)
 6  0004:06:00.0 3D: NVIDIA Corporation GV100GL [Tesla V100 SXM2] (rev a1)
 4  0006:00:00.0 Bridge: IBM Device 04ea (rev 01)
 4  0006:00:00.1 Bridge: IBM Device 04ea (rev 01)
 5  0006:00:01.0 Bridge: IBM Device 04ea (rev 01)
 5  0006:00:01.1 Bridge: IBM Device 04ea (rev 01)
 6  0006:00:02.0 Bridge: IBM Device 04ea (rev 01)
 6  0006:00:02.1 Bridge: IBM Device 04ea (rev 01)
10  0007:00:00.0 Bridge: IBM Device 04ea (rev 01)
10  0007:00:00.1 Bridge: IBM Device 04ea (rev 01)
11  0007:00:01.0 Bridge: IBM Device 04ea (rev 01)
11  0007:00:01.1 Bridge: IBM Device 04ea (rev 01)
12  0007:00:02.0 Bridge: IBM Device 04ea (rev 01)
12  0007:00:02.1 Bridge: IBM Device 04ea (rev 01)
10  0035:03:00.0 3D: NVIDIA Corporation GV100GL [Tesla V100 SXM2] (rev a1)
11  0035:04:00.0 3D: NVIDIA Corporation GV100GL [Tesla V100 SXM2] (rev a1)
12  0035:05:00.0 3D: NVIDIA Corporation GV100GL [Tesla V100 SXM2] (rev a1)

^^ the number is an IOMMU group ID.

Each bridge represents an additional hardware interface called "NVLink2",
it is not a PCI link but separate but. The design inherits from original
NVLink from POWER8.

The new feature of V100 is 16GB of cache coherent memory on GPU board.
This memory is presented to the host via the device tree and remains offline
until the NVIDIA driver loads, trains NVLink2 (via the config space of these
bridges above) and the nvidia-persistenced daemon then onlines it.
The memory remains online as long as nvidia-persistenced is running, when
it stops, it offlines the memory.

The amount of GPUs suggest passing them through to a guest. However,
in order to do so we cannot use the NVIDIA driver so we have a host with
a 128GB window (bigger or equal to actual GPU RAM size) in a system memory
with no page structs backing this window and we cannot touch this memory
before the NVIDIA driver configures it in a host or a guest as
HMI (hardware management interrupt?) occurs.

On the example system the GPU RAM windows are located at:
0x0400 0000 0000
0x0420 0000 0000
0x0440 0000 0000
0x2400 0000 0000
0x2420 0000 0000
0x2440 0000 0000

So the complications are:

1. cannot touch the GPU memory till it is trained, i.e. cannot add ptes
to VFIO-to-userspace or guest-to-host-physical translations till
the driver trains it (i.e. nvidia-persistenced has started), otherwise
prefetching happens and HMI occurs; I am trying to get this changed
somehow;

2. since it appears as normal cache coherent memory, it will be used
for DMA which means it has to be pinned and mapped in the host. Having
no page structs makes it different from the usual case - we only need
translate user addresses to host physical and map GPU RAM memory but
pinning is not required.

This series maps GPU RAM via the GPU vfio-pci device so QEMU can then
register this memory as a KVM memory slot and present memory nodes to
the guest. Unless NVIDIA provides an userspace driver, this is no use
for things like DPDK.


There is another problem which the series does not address but worth
mentioning - it is not strictly necessary to map GPU RAM to the guest
exactly where it is in the host (I tested this to some extent), we still
might want to represent the memory at the same offset as on the host
which increases the size of a TCE table needed to cover such a huge
window: (((0x244000000000 + 0x2000000000) >> 16)*8)>>20 = 4556MB
I am addressing this in a separate patchset by allocating indirect TCE
levels on demand and using 16MB IOMMU pages in the guest as we can now
back emulated pages with the smaller hardware ones.


This is an RFC. Please comment. Thanks.



Alexey Kardashevskiy (5):
  vfio/spapr_tce: Simplify page contained test
  powerpc/iommu_context: Change referencing in API
  powerpc/iommu: Do not pin memory of a memory device
  vfio_pci: Allow mapping extra regions
  vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] [10de:1db1] subdriver

 drivers/vfio/pci/Makefile              |   1 +
 arch/powerpc/include/asm/mmu_context.h |   5 +-
 drivers/vfio/pci/vfio_pci_private.h    |  11 ++
 include/uapi/linux/vfio.h              |   3 +
 arch/powerpc/kernel/iommu.c            |   8 +-
 arch/powerpc/mm/mmu_context_iommu.c    |  70 +++++++++---
 drivers/vfio/pci/vfio_pci.c            |  19 +++-
 drivers/vfio/pci/vfio_pci_nvlink2.c    | 190 +++++++++++++++++++++++++++++++++
 drivers/vfio/vfio_iommu_spapr_tce.c    |  42 +++++---
 drivers/vfio/pci/Kconfig               |   4 +
 10 files changed, 319 insertions(+), 34 deletions(-)
 create mode 100644 drivers/vfio/pci/vfio_pci_nvlink2.c

-- 
2.11.0


^ permalink raw reply	[flat|nested] 108+ messages in thread

* [RFC PATCH kernel 1/5] vfio/spapr_tce: Simplify page contained test
  2018-06-07  8:44 ` Alexey Kardashevskiy
  (?)
@ 2018-06-07  8:44   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 108+ messages in thread
From: Alexey Kardashevskiy @ 2018-06-07  8:44 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: kvm, Alexey Kardashevskiy, Ram Pai, kvm-ppc, Alex Williamson,
	Alistair Popple, David Gibson

The test function takes a page struct pointer which is not used by
either of two callers in any other way, make it simple and just pass
a physical address there.

This should cause no behavioral change now but later we may start
supporting host addresses for memory devices which are not backed
with page structs.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 drivers/vfio/vfio_iommu_spapr_tce.c | 11 ++++-------
 1 file changed, 4 insertions(+), 7 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index 759a5bd..2c4a048 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -249,8 +249,9 @@ static void tce_iommu_userspace_view_free(struct iommu_table *tbl,
 	decrement_locked_vm(mm, cb >> PAGE_SHIFT);
 }
 
-static bool tce_page_is_contained(struct page *page, unsigned page_shift)
+static bool tce_page_is_contained(unsigned long hpa, unsigned page_shift)
 {
+	struct page *page = pfn_to_page(hpa >> PAGE_SHIFT);
 	/*
 	 * Check that the TCE table granularity is not bigger than the size of
 	 * a page we just found. Otherwise the hardware can get access to
@@ -549,7 +550,6 @@ static long tce_iommu_build(struct tce_container *container,
 		enum dma_data_direction direction)
 {
 	long i, ret = 0;
-	struct page *page;
 	unsigned long hpa;
 	enum dma_data_direction dirtmp;
 
@@ -560,8 +560,7 @@ static long tce_iommu_build(struct tce_container *container,
 		if (ret)
 			break;
 
-		page = pfn_to_page(hpa >> PAGE_SHIFT);
-		if (!tce_page_is_contained(page, tbl->it_page_shift)) {
+		if (!tce_page_is_contained(hpa, tbl->it_page_shift)) {
 			ret = -EPERM;
 			break;
 		}
@@ -595,7 +594,6 @@ static long tce_iommu_build_v2(struct tce_container *container,
 		enum dma_data_direction direction)
 {
 	long i, ret = 0;
-	struct page *page;
 	unsigned long hpa;
 	enum dma_data_direction dirtmp;
 
@@ -615,8 +613,7 @@ static long tce_iommu_build_v2(struct tce_container *container,
 		if (ret)
 			break;
 
-		page = pfn_to_page(hpa >> PAGE_SHIFT);
-		if (!tce_page_is_contained(page, tbl->it_page_shift)) {
+		if (!tce_page_is_contained(hpa, tbl->it_page_shift)) {
 			ret = -EPERM;
 			break;
 		}
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [RFC PATCH kernel 1/5] vfio/spapr_tce: Simplify page contained test
@ 2018-06-07  8:44   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 108+ messages in thread
From: Alexey Kardashevskiy @ 2018-06-07  8:44 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, David Gibson, kvm-ppc, Alex Williamson,
	Benjamin Herrenschmidt, Ram Pai, kvm, Alistair Popple

The test function takes a page struct pointer which is not used by
either of two callers in any other way, make it simple and just pass
a physical address there.

This should cause no behavioral change now but later we may start
supporting host addresses for memory devices which are not backed
with page structs.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 drivers/vfio/vfio_iommu_spapr_tce.c | 11 ++++-------
 1 file changed, 4 insertions(+), 7 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index 759a5bd..2c4a048 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -249,8 +249,9 @@ static void tce_iommu_userspace_view_free(struct iommu_table *tbl,
 	decrement_locked_vm(mm, cb >> PAGE_SHIFT);
 }
 
-static bool tce_page_is_contained(struct page *page, unsigned page_shift)
+static bool tce_page_is_contained(unsigned long hpa, unsigned page_shift)
 {
+	struct page *page = pfn_to_page(hpa >> PAGE_SHIFT);
 	/*
 	 * Check that the TCE table granularity is not bigger than the size of
 	 * a page we just found. Otherwise the hardware can get access to
@@ -549,7 +550,6 @@ static long tce_iommu_build(struct tce_container *container,
 		enum dma_data_direction direction)
 {
 	long i, ret = 0;
-	struct page *page;
 	unsigned long hpa;
 	enum dma_data_direction dirtmp;
 
@@ -560,8 +560,7 @@ static long tce_iommu_build(struct tce_container *container,
 		if (ret)
 			break;
 
-		page = pfn_to_page(hpa >> PAGE_SHIFT);
-		if (!tce_page_is_contained(page, tbl->it_page_shift)) {
+		if (!tce_page_is_contained(hpa, tbl->it_page_shift)) {
 			ret = -EPERM;
 			break;
 		}
@@ -595,7 +594,6 @@ static long tce_iommu_build_v2(struct tce_container *container,
 		enum dma_data_direction direction)
 {
 	long i, ret = 0;
-	struct page *page;
 	unsigned long hpa;
 	enum dma_data_direction dirtmp;
 
@@ -615,8 +613,7 @@ static long tce_iommu_build_v2(struct tce_container *container,
 		if (ret)
 			break;
 
-		page = pfn_to_page(hpa >> PAGE_SHIFT);
-		if (!tce_page_is_contained(page, tbl->it_page_shift)) {
+		if (!tce_page_is_contained(hpa, tbl->it_page_shift)) {
 			ret = -EPERM;
 			break;
 		}
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [RFC PATCH kernel 1/5] vfio/spapr_tce: Simplify page contained test
@ 2018-06-07  8:44   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 108+ messages in thread
From: Alexey Kardashevskiy @ 2018-06-07  8:44 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: kvm, Alexey Kardashevskiy, Ram Pai, kvm-ppc, Alex Williamson,
	Alistair Popple, David Gibson

The test function takes a page struct pointer which is not used by
either of two callers in any other way, make it simple and just pass
a physical address there.

This should cause no behavioral change now but later we may start
supporting host addresses for memory devices which are not backed
with page structs.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 drivers/vfio/vfio_iommu_spapr_tce.c | 11 ++++-------
 1 file changed, 4 insertions(+), 7 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index 759a5bd..2c4a048 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -249,8 +249,9 @@ static void tce_iommu_userspace_view_free(struct iommu_table *tbl,
 	decrement_locked_vm(mm, cb >> PAGE_SHIFT);
 }
 
-static bool tce_page_is_contained(struct page *page, unsigned page_shift)
+static bool tce_page_is_contained(unsigned long hpa, unsigned page_shift)
 {
+	struct page *page = pfn_to_page(hpa >> PAGE_SHIFT);
 	/*
 	 * Check that the TCE table granularity is not bigger than the size of
 	 * a page we just found. Otherwise the hardware can get access to
@@ -549,7 +550,6 @@ static long tce_iommu_build(struct tce_container *container,
 		enum dma_data_direction direction)
 {
 	long i, ret = 0;
-	struct page *page;
 	unsigned long hpa;
 	enum dma_data_direction dirtmp;
 
@@ -560,8 +560,7 @@ static long tce_iommu_build(struct tce_container *container,
 		if (ret)
 			break;
 
-		page = pfn_to_page(hpa >> PAGE_SHIFT);
-		if (!tce_page_is_contained(page, tbl->it_page_shift)) {
+		if (!tce_page_is_contained(hpa, tbl->it_page_shift)) {
 			ret = -EPERM;
 			break;
 		}
@@ -595,7 +594,6 @@ static long tce_iommu_build_v2(struct tce_container *container,
 		enum dma_data_direction direction)
 {
 	long i, ret = 0;
-	struct page *page;
 	unsigned long hpa;
 	enum dma_data_direction dirtmp;
 
@@ -615,8 +613,7 @@ static long tce_iommu_build_v2(struct tce_container *container,
 		if (ret)
 			break;
 
-		page = pfn_to_page(hpa >> PAGE_SHIFT);
-		if (!tce_page_is_contained(page, tbl->it_page_shift)) {
+		if (!tce_page_is_contained(hpa, tbl->it_page_shift)) {
 			ret = -EPERM;
 			break;
 		}
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [RFC PATCH kernel 2/5] powerpc/iommu_context: Change referencing in API
  2018-06-07  8:44 ` Alexey Kardashevskiy
  (?)
@ 2018-06-07  8:44   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 108+ messages in thread
From: Alexey Kardashevskiy @ 2018-06-07  8:44 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: kvm, Alexey Kardashevskiy, Ram Pai, kvm-ppc, Alex Williamson,
	Alistair Popple, David Gibson

At the moment a single function - mm_iommu_get() - allocates a new region
or just references it if it is already registered with the current MM
context.

We are going to allow API to be used for memory devices and different
variant of mm_iommu_get() will be needed so let's move referencing
part to where it belongs - mm_iommu_find().

This turns mm_iommu_get() into a wrapper as the actual function will
be extended later and renames it to mm_iommu_new() to illustrate
the change.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/include/asm/mmu_context.h |  2 +-
 arch/powerpc/mm/mmu_context_iommu.c    | 19 +++++++++++++++----
 drivers/vfio/vfio_iommu_spapr_tce.c    | 21 +++++++++++----------
 3 files changed, 27 insertions(+), 15 deletions(-)

diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h
index 1835ca1..b598ec4 100644
--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -21,7 +21,7 @@ struct mm_iommu_table_group_mem_t;
 
 extern int isolate_lru_page(struct page *page);	/* from internal.h */
 extern bool mm_iommu_preregistered(struct mm_struct *mm);
-extern long mm_iommu_get(struct mm_struct *mm,
+extern long mm_iommu_new(struct mm_struct *mm,
 		unsigned long ua, unsigned long entries,
 		struct mm_iommu_table_group_mem_t **pmem);
 extern long mm_iommu_put(struct mm_struct *mm,
diff --git a/arch/powerpc/mm/mmu_context_iommu.c b/arch/powerpc/mm/mmu_context_iommu.c
index 4c615fc..6b471d2 100644
--- a/arch/powerpc/mm/mmu_context_iommu.c
+++ b/arch/powerpc/mm/mmu_context_iommu.c
@@ -120,7 +120,8 @@ static int mm_iommu_move_page_from_cma(struct page *page)
 	return 0;
 }
 
-long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries,
+static long mm_iommu_do_alloc(struct mm_struct *mm, unsigned long ua,
+		unsigned long entries,
 		struct mm_iommu_table_group_mem_t **pmem)
 {
 	struct mm_iommu_table_group_mem_t *mem;
@@ -132,8 +133,7 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries,
 	list_for_each_entry_rcu(mem, &mm->context.iommu_group_mem_list,
 			next) {
 		if ((mem->ua == ua) && (mem->entries == entries)) {
-			++mem->used;
-			*pmem = mem;
+			ret = -EBUSY;
 			goto unlock_exit;
 		}
 
@@ -218,7 +218,13 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries,
 
 	return ret;
 }
-EXPORT_SYMBOL_GPL(mm_iommu_get);
+
+long mm_iommu_new(struct mm_struct *mm, unsigned long ua, unsigned long entries,
+		struct mm_iommu_table_group_mem_t **pmem)
+{
+	return mm_iommu_do_alloc(mm, ua, entries, pmem);
+}
+EXPORT_SYMBOL_GPL(mm_iommu_new);
 
 static void mm_iommu_unpin(struct mm_iommu_table_group_mem_t *mem)
 {
@@ -337,13 +343,18 @@ struct mm_iommu_table_group_mem_t *mm_iommu_find(struct mm_struct *mm,
 {
 	struct mm_iommu_table_group_mem_t *mem, *ret = NULL;
 
+	mutex_lock(&mem_list_mutex);
+
 	list_for_each_entry_rcu(mem, &mm->context.iommu_group_mem_list, next) {
 		if ((mem->ua == ua) && (mem->entries == entries)) {
 			ret = mem;
+			++mem->used;
 			break;
 		}
 	}
 
+	mutex_unlock(&mem_list_mutex);
+
 	return ret;
 }
 EXPORT_SYMBOL_GPL(mm_iommu_find);
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index 2c4a048..7f1effd 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -149,9 +149,9 @@ static long tce_iommu_prereg_free(struct tce_container *container,
 static long tce_iommu_unregister_pages(struct tce_container *container,
 		__u64 vaddr, __u64 size)
 {
+	long ret = -ENOENT;
 	struct mm_iommu_table_group_mem_t *mem;
 	struct tce_iommu_prereg *tcemem;
-	bool found = false;
 
 	if ((vaddr & ~PAGE_MASK) || (size & ~PAGE_MASK))
 		return -EINVAL;
@@ -162,15 +162,14 @@ static long tce_iommu_unregister_pages(struct tce_container *container,
 
 	list_for_each_entry(tcemem, &container->prereg_list, next) {
 		if (tcemem->mem == mem) {
-			found = true;
+			ret = tce_iommu_prereg_free(container, tcemem);
 			break;
 		}
 	}
 
-	if (!found)
-		return -ENOENT;
+	mm_iommu_put(container->mm, mem);
 
-	return tce_iommu_prereg_free(container, tcemem);
+	return ret;
 }
 
 static long tce_iommu_register_pages(struct tce_container *container,
@@ -188,15 +187,17 @@ static long tce_iommu_register_pages(struct tce_container *container,
 	mem = mm_iommu_find(container->mm, vaddr, entries);
 	if (mem) {
 		list_for_each_entry(tcemem, &container->prereg_list, next) {
-			if (tcemem->mem == mem)
+			if (tcemem->mem == mem) {
+				mm_iommu_put(container->mm, mem);
 				return -EBUSY;
+			}
 		}
+	} else {
+		ret = mm_iommu_new(container->mm, vaddr, entries, &mem);
+		if (ret)
+			return ret;
 	}
 
-	ret = mm_iommu_get(container->mm, vaddr, entries, &mem);
-	if (ret)
-		return ret;
-
 	tcemem = kzalloc(sizeof(*tcemem), GFP_KERNEL);
 	if (!tcemem) {
 		mm_iommu_put(container->mm, mem);
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [RFC PATCH kernel 2/5] powerpc/iommu_context: Change referencing in API
@ 2018-06-07  8:44   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 108+ messages in thread
From: Alexey Kardashevskiy @ 2018-06-07  8:44 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, David Gibson, kvm-ppc, Alex Williamson,
	Benjamin Herrenschmidt, Ram Pai, kvm, Alistair Popple

At the moment a single function - mm_iommu_get() - allocates a new region
or just references it if it is already registered with the current MM
context.

We are going to allow API to be used for memory devices and different
variant of mm_iommu_get() will be needed so let's move referencing
part to where it belongs - mm_iommu_find().

This turns mm_iommu_get() into a wrapper as the actual function will
be extended later and renames it to mm_iommu_new() to illustrate
the change.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/include/asm/mmu_context.h |  2 +-
 arch/powerpc/mm/mmu_context_iommu.c    | 19 +++++++++++++++----
 drivers/vfio/vfio_iommu_spapr_tce.c    | 21 +++++++++++----------
 3 files changed, 27 insertions(+), 15 deletions(-)

diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h
index 1835ca1..b598ec4 100644
--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -21,7 +21,7 @@ struct mm_iommu_table_group_mem_t;
 
 extern int isolate_lru_page(struct page *page);	/* from internal.h */
 extern bool mm_iommu_preregistered(struct mm_struct *mm);
-extern long mm_iommu_get(struct mm_struct *mm,
+extern long mm_iommu_new(struct mm_struct *mm,
 		unsigned long ua, unsigned long entries,
 		struct mm_iommu_table_group_mem_t **pmem);
 extern long mm_iommu_put(struct mm_struct *mm,
diff --git a/arch/powerpc/mm/mmu_context_iommu.c b/arch/powerpc/mm/mmu_context_iommu.c
index 4c615fc..6b471d2 100644
--- a/arch/powerpc/mm/mmu_context_iommu.c
+++ b/arch/powerpc/mm/mmu_context_iommu.c
@@ -120,7 +120,8 @@ static int mm_iommu_move_page_from_cma(struct page *page)
 	return 0;
 }
 
-long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries,
+static long mm_iommu_do_alloc(struct mm_struct *mm, unsigned long ua,
+		unsigned long entries,
 		struct mm_iommu_table_group_mem_t **pmem)
 {
 	struct mm_iommu_table_group_mem_t *mem;
@@ -132,8 +133,7 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries,
 	list_for_each_entry_rcu(mem, &mm->context.iommu_group_mem_list,
 			next) {
 		if ((mem->ua == ua) && (mem->entries == entries)) {
-			++mem->used;
-			*pmem = mem;
+			ret = -EBUSY;
 			goto unlock_exit;
 		}
 
@@ -218,7 +218,13 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries,
 
 	return ret;
 }
-EXPORT_SYMBOL_GPL(mm_iommu_get);
+
+long mm_iommu_new(struct mm_struct *mm, unsigned long ua, unsigned long entries,
+		struct mm_iommu_table_group_mem_t **pmem)
+{
+	return mm_iommu_do_alloc(mm, ua, entries, pmem);
+}
+EXPORT_SYMBOL_GPL(mm_iommu_new);
 
 static void mm_iommu_unpin(struct mm_iommu_table_group_mem_t *mem)
 {
@@ -337,13 +343,18 @@ struct mm_iommu_table_group_mem_t *mm_iommu_find(struct mm_struct *mm,
 {
 	struct mm_iommu_table_group_mem_t *mem, *ret = NULL;
 
+	mutex_lock(&mem_list_mutex);
+
 	list_for_each_entry_rcu(mem, &mm->context.iommu_group_mem_list, next) {
 		if ((mem->ua == ua) && (mem->entries == entries)) {
 			ret = mem;
+			++mem->used;
 			break;
 		}
 	}
 
+	mutex_unlock(&mem_list_mutex);
+
 	return ret;
 }
 EXPORT_SYMBOL_GPL(mm_iommu_find);
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index 2c4a048..7f1effd 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -149,9 +149,9 @@ static long tce_iommu_prereg_free(struct tce_container *container,
 static long tce_iommu_unregister_pages(struct tce_container *container,
 		__u64 vaddr, __u64 size)
 {
+	long ret = -ENOENT;
 	struct mm_iommu_table_group_mem_t *mem;
 	struct tce_iommu_prereg *tcemem;
-	bool found = false;
 
 	if ((vaddr & ~PAGE_MASK) || (size & ~PAGE_MASK))
 		return -EINVAL;
@@ -162,15 +162,14 @@ static long tce_iommu_unregister_pages(struct tce_container *container,
 
 	list_for_each_entry(tcemem, &container->prereg_list, next) {
 		if (tcemem->mem == mem) {
-			found = true;
+			ret = tce_iommu_prereg_free(container, tcemem);
 			break;
 		}
 	}
 
-	if (!found)
-		return -ENOENT;
+	mm_iommu_put(container->mm, mem);
 
-	return tce_iommu_prereg_free(container, tcemem);
+	return ret;
 }
 
 static long tce_iommu_register_pages(struct tce_container *container,
@@ -188,15 +187,17 @@ static long tce_iommu_register_pages(struct tce_container *container,
 	mem = mm_iommu_find(container->mm, vaddr, entries);
 	if (mem) {
 		list_for_each_entry(tcemem, &container->prereg_list, next) {
-			if (tcemem->mem == mem)
+			if (tcemem->mem == mem) {
+				mm_iommu_put(container->mm, mem);
 				return -EBUSY;
+			}
 		}
+	} else {
+		ret = mm_iommu_new(container->mm, vaddr, entries, &mem);
+		if (ret)
+			return ret;
 	}
 
-	ret = mm_iommu_get(container->mm, vaddr, entries, &mem);
-	if (ret)
-		return ret;
-
 	tcemem = kzalloc(sizeof(*tcemem), GFP_KERNEL);
 	if (!tcemem) {
 		mm_iommu_put(container->mm, mem);
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [RFC PATCH kernel 2/5] powerpc/iommu_context: Change referencing in API
@ 2018-06-07  8:44   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 108+ messages in thread
From: Alexey Kardashevskiy @ 2018-06-07  8:44 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: kvm, Alexey Kardashevskiy, Ram Pai, kvm-ppc, Alex Williamson,
	Alistair Popple, David Gibson

At the moment a single function - mm_iommu_get() - allocates a new region
or just references it if it is already registered with the current MM
context.

We are going to allow API to be used for memory devices and different
variant of mm_iommu_get() will be needed so let's move referencing
part to where it belongs - mm_iommu_find().

This turns mm_iommu_get() into a wrapper as the actual function will
be extended later and renames it to mm_iommu_new() to illustrate
the change.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/include/asm/mmu_context.h |  2 +-
 arch/powerpc/mm/mmu_context_iommu.c    | 19 +++++++++++++++----
 drivers/vfio/vfio_iommu_spapr_tce.c    | 21 +++++++++++----------
 3 files changed, 27 insertions(+), 15 deletions(-)

diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h
index 1835ca1..b598ec4 100644
--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -21,7 +21,7 @@ struct mm_iommu_table_group_mem_t;
 
 extern int isolate_lru_page(struct page *page);	/* from internal.h */
 extern bool mm_iommu_preregistered(struct mm_struct *mm);
-extern long mm_iommu_get(struct mm_struct *mm,
+extern long mm_iommu_new(struct mm_struct *mm,
 		unsigned long ua, unsigned long entries,
 		struct mm_iommu_table_group_mem_t **pmem);
 extern long mm_iommu_put(struct mm_struct *mm,
diff --git a/arch/powerpc/mm/mmu_context_iommu.c b/arch/powerpc/mm/mmu_context_iommu.c
index 4c615fc..6b471d2 100644
--- a/arch/powerpc/mm/mmu_context_iommu.c
+++ b/arch/powerpc/mm/mmu_context_iommu.c
@@ -120,7 +120,8 @@ static int mm_iommu_move_page_from_cma(struct page *page)
 	return 0;
 }
 
-long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries,
+static long mm_iommu_do_alloc(struct mm_struct *mm, unsigned long ua,
+		unsigned long entries,
 		struct mm_iommu_table_group_mem_t **pmem)
 {
 	struct mm_iommu_table_group_mem_t *mem;
@@ -132,8 +133,7 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries,
 	list_for_each_entry_rcu(mem, &mm->context.iommu_group_mem_list,
 			next) {
 		if ((mem->ua = ua) && (mem->entries = entries)) {
-			++mem->used;
-			*pmem = mem;
+			ret = -EBUSY;
 			goto unlock_exit;
 		}
 
@@ -218,7 +218,13 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries,
 
 	return ret;
 }
-EXPORT_SYMBOL_GPL(mm_iommu_get);
+
+long mm_iommu_new(struct mm_struct *mm, unsigned long ua, unsigned long entries,
+		struct mm_iommu_table_group_mem_t **pmem)
+{
+	return mm_iommu_do_alloc(mm, ua, entries, pmem);
+}
+EXPORT_SYMBOL_GPL(mm_iommu_new);
 
 static void mm_iommu_unpin(struct mm_iommu_table_group_mem_t *mem)
 {
@@ -337,13 +343,18 @@ struct mm_iommu_table_group_mem_t *mm_iommu_find(struct mm_struct *mm,
 {
 	struct mm_iommu_table_group_mem_t *mem, *ret = NULL;
 
+	mutex_lock(&mem_list_mutex);
+
 	list_for_each_entry_rcu(mem, &mm->context.iommu_group_mem_list, next) {
 		if ((mem->ua = ua) && (mem->entries = entries)) {
 			ret = mem;
+			++mem->used;
 			break;
 		}
 	}
 
+	mutex_unlock(&mem_list_mutex);
+
 	return ret;
 }
 EXPORT_SYMBOL_GPL(mm_iommu_find);
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index 2c4a048..7f1effd 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -149,9 +149,9 @@ static long tce_iommu_prereg_free(struct tce_container *container,
 static long tce_iommu_unregister_pages(struct tce_container *container,
 		__u64 vaddr, __u64 size)
 {
+	long ret = -ENOENT;
 	struct mm_iommu_table_group_mem_t *mem;
 	struct tce_iommu_prereg *tcemem;
-	bool found = false;
 
 	if ((vaddr & ~PAGE_MASK) || (size & ~PAGE_MASK))
 		return -EINVAL;
@@ -162,15 +162,14 @@ static long tce_iommu_unregister_pages(struct tce_container *container,
 
 	list_for_each_entry(tcemem, &container->prereg_list, next) {
 		if (tcemem->mem = mem) {
-			found = true;
+			ret = tce_iommu_prereg_free(container, tcemem);
 			break;
 		}
 	}
 
-	if (!found)
-		return -ENOENT;
+	mm_iommu_put(container->mm, mem);
 
-	return tce_iommu_prereg_free(container, tcemem);
+	return ret;
 }
 
 static long tce_iommu_register_pages(struct tce_container *container,
@@ -188,15 +187,17 @@ static long tce_iommu_register_pages(struct tce_container *container,
 	mem = mm_iommu_find(container->mm, vaddr, entries);
 	if (mem) {
 		list_for_each_entry(tcemem, &container->prereg_list, next) {
-			if (tcemem->mem = mem)
+			if (tcemem->mem = mem) {
+				mm_iommu_put(container->mm, mem);
 				return -EBUSY;
+			}
 		}
+	} else {
+		ret = mm_iommu_new(container->mm, vaddr, entries, &mem);
+		if (ret)
+			return ret;
 	}
 
-	ret = mm_iommu_get(container->mm, vaddr, entries, &mem);
-	if (ret)
-		return ret;
-
 	tcemem = kzalloc(sizeof(*tcemem), GFP_KERNEL);
 	if (!tcemem) {
 		mm_iommu_put(container->mm, mem);
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [RFC PATCH kernel 3/5] powerpc/iommu: Do not pin memory of a memory device
  2018-06-07  8:44 ` Alexey Kardashevskiy
  (?)
@ 2018-06-07  8:44   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 108+ messages in thread
From: Alexey Kardashevskiy @ 2018-06-07  8:44 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: kvm, Alexey Kardashevskiy, Ram Pai, kvm-ppc, Alex Williamson,
	Alistair Popple, David Gibson

This new memory does not have page structs as it is not hotplugged to
the host so gup() will fail anyway.

This registers a new mapping in memory context so the user of this
API does not have to worry about the nature of this memory.

Also, since host addresses may not be backed with page structs, this
adds a workaround to iommu_tce_xchg() to avoid putting absent page structs.
realmode_pfn_to_page() is used there as, unline its virtmode counterpart,
it actually walks through the list of vmemmap_backing.

The same is used in tce_page_is_contained() to drop the check for now.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

# Conflicts:
#	arch/powerpc/mm/mmu_context_iommu.c
---
 arch/powerpc/include/asm/mmu_context.h |  3 ++
 arch/powerpc/kernel/iommu.c            |  8 +++--
 arch/powerpc/mm/mmu_context_iommu.c    | 55 +++++++++++++++++++++++++++-------
 drivers/vfio/vfio_iommu_spapr_tce.c    | 12 +++++++-
 4 files changed, 65 insertions(+), 13 deletions(-)

diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h
index b598ec4..0c14495 100644
--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -24,6 +24,9 @@ extern bool mm_iommu_preregistered(struct mm_struct *mm);
 extern long mm_iommu_new(struct mm_struct *mm,
 		unsigned long ua, unsigned long entries,
 		struct mm_iommu_table_group_mem_t **pmem);
+extern long mm_iommu_newdev(struct mm_struct *mm, unsigned long ua,
+		unsigned long entries, unsigned long dev_hpa,
+		struct mm_iommu_table_group_mem_t **pmem);
 extern long mm_iommu_put(struct mm_struct *mm,
 		struct mm_iommu_table_group_mem_t *mem);
 extern void mm_iommu_init(struct mm_struct *mm);
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index af7a20d..fc985a5 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -1001,8 +1001,12 @@ long iommu_tce_xchg(struct iommu_table *tbl, unsigned long entry,
 	ret = tbl->it_ops->exchange(tbl, entry, hpa, direction);
 
 	if (!ret && ((*direction == DMA_FROM_DEVICE) ||
-			(*direction == DMA_BIDIRECTIONAL)))
-		SetPageDirty(pfn_to_page(*hpa >> PAGE_SHIFT));
+			(*direction == DMA_BIDIRECTIONAL))) {
+		struct page *pg = __va(realmode_pfn_to_page(*hpa >> PAGE_SHIFT));
+
+		if (pg)
+			SetPageDirty(pg);
+	}
 
 	/* if (unlikely(ret))
 		pr_err("iommu_tce: %s failed on hwaddr=%lx ioba=%lx kva=%lx ret=%d\n",
diff --git a/arch/powerpc/mm/mmu_context_iommu.c b/arch/powerpc/mm/mmu_context_iommu.c
index 6b471d2..b132924 100644
--- a/arch/powerpc/mm/mmu_context_iommu.c
+++ b/arch/powerpc/mm/mmu_context_iommu.c
@@ -30,6 +30,8 @@ struct mm_iommu_table_group_mem_t {
 	u64 ua;			/* userspace address */
 	u64 entries;		/* number of entries in hpas[] */
 	u64 *hpas;		/* vmalloc'ed */
+#define MM_IOMMU_TABLE_INVALID_HPA	((uint64_t)-1)
+	u64 dev_hpa;		/* Device memory base address */
 };
 
 static long mm_iommu_adjust_locked_vm(struct mm_struct *mm,
@@ -121,7 +123,7 @@ static int mm_iommu_move_page_from_cma(struct page *page)
 }
 
 static long mm_iommu_do_alloc(struct mm_struct *mm, unsigned long ua,
-		unsigned long entries,
+		unsigned long entries, unsigned long dev_hpa,
 		struct mm_iommu_table_group_mem_t **pmem)
 {
 	struct mm_iommu_table_group_mem_t *mem;
@@ -147,11 +149,13 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, unsigned long ua,
 
 	}
 
-	ret = mm_iommu_adjust_locked_vm(mm, entries, true);
-	if (ret)
-		goto unlock_exit;
+	if (dev_hpa == MM_IOMMU_TABLE_INVALID_HPA) {
+		ret = mm_iommu_adjust_locked_vm(mm, entries, true);
+		if (ret)
+			goto unlock_exit;
 
-	locked_entries = entries;
+		locked_entries = entries;
+	}
 
 	mem = kzalloc(sizeof(*mem), GFP_KERNEL);
 	if (!mem) {
@@ -159,6 +163,11 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, unsigned long ua,
 		goto unlock_exit;
 	}
 
+	if (dev_hpa != MM_IOMMU_TABLE_INVALID_HPA) {
+		mem->dev_hpa = dev_hpa;
+		goto good_exit;
+	}
+
 	mem->hpas = vzalloc(entries * sizeof(mem->hpas[0]));
 	if (!mem->hpas) {
 		kfree(mem);
@@ -202,6 +211,7 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, unsigned long ua,
 		mem->hpas[i] = page_to_pfn(page) << PAGE_SHIFT;
 	}
 
+good_exit:
 	atomic64_set(&mem->mapped, 1);
 	mem->used = 1;
 	mem->ua = ua;
@@ -222,15 +232,27 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, unsigned long ua,
 long mm_iommu_new(struct mm_struct *mm, unsigned long ua, unsigned long entries,
 		struct mm_iommu_table_group_mem_t **pmem)
 {
-	return mm_iommu_do_alloc(mm, ua, entries, pmem);
+	return mm_iommu_do_alloc(mm, ua, entries, MM_IOMMU_TABLE_INVALID_HPA,
+			pmem);
 }
 EXPORT_SYMBOL_GPL(mm_iommu_new);
 
+long mm_iommu_newdev(struct mm_struct *mm, unsigned long ua,
+		unsigned long entries, unsigned long dev_hpa,
+		struct mm_iommu_table_group_mem_t **pmem)
+{
+	return mm_iommu_do_alloc(mm, ua, entries, dev_hpa, pmem);
+}
+EXPORT_SYMBOL_GPL(mm_iommu_newdev);
+
 static void mm_iommu_unpin(struct mm_iommu_table_group_mem_t *mem)
 {
 	long i;
 	struct page *page = NULL;
 
+	if (!mem->hpas)
+		return;
+
 	for (i = 0; i < mem->entries; ++i) {
 		if (!mem->hpas[i])
 			continue;
@@ -269,6 +291,7 @@ static void mm_iommu_release(struct mm_iommu_table_group_mem_t *mem)
 long mm_iommu_put(struct mm_struct *mm, struct mm_iommu_table_group_mem_t *mem)
 {
 	long ret = 0;
+	unsigned long entries;
 
 	mutex_lock(&mem_list_mutex);
 
@@ -290,9 +313,11 @@ long mm_iommu_put(struct mm_struct *mm, struct mm_iommu_table_group_mem_t *mem)
 	}
 
 	/* @mapped became 0 so now mappings are disabled, release the region */
+	entries = mem->entries;
 	mm_iommu_release(mem);
 
-	mm_iommu_adjust_locked_vm(mm, mem->entries, false);
+	if (mem->dev_hpa != MM_IOMMU_TABLE_INVALID_HPA)
+		mm_iommu_adjust_locked_vm(mm, entries, false);
 
 unlock_exit:
 	mutex_unlock(&mem_list_mutex);
@@ -363,11 +388,17 @@ long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
 		unsigned long ua, unsigned long *hpa)
 {
 	const long entry = (ua - mem->ua) >> PAGE_SHIFT;
-	u64 *va = &mem->hpas[entry];
+	u64 *va;
 
 	if (entry >= mem->entries)
 		return -EFAULT;
 
+	if (!mem->hpas) {
+		*hpa = mem->dev_hpa + (ua - mem->ua);
+		return 0;
+	}
+
+	va = &mem->hpas[entry];
 	*hpa = *va | (ua & ~PAGE_MASK);
 
 	return 0;
@@ -378,13 +409,17 @@ long mm_iommu_ua_to_hpa_rm(struct mm_iommu_table_group_mem_t *mem,
 		unsigned long ua, unsigned long *hpa)
 {
 	const long entry = (ua - mem->ua) >> PAGE_SHIFT;
-	void *va = &mem->hpas[entry];
 	unsigned long *pa;
 
 	if (entry >= mem->entries)
 		return -EFAULT;
 
-	pa = (void *) vmalloc_to_phys(va);
+	if (!mem->hpas) {
+		*hpa = mem->dev_hpa + (ua - mem->ua);
+		return 0;
+	}
+
+	pa = (void *) vmalloc_to_phys(&mem->hpas[entry]);
 	if (!pa)
 		return -EFAULT;
 
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index 7f1effd..47071f3 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -252,7 +252,17 @@ static void tce_iommu_userspace_view_free(struct iommu_table *tbl,
 
 static bool tce_page_is_contained(unsigned long hpa, unsigned page_shift)
 {
-	struct page *page = pfn_to_page(hpa >> PAGE_SHIFT);
+	struct page *page = __va(realmode_pfn_to_page(hpa >> PAGE_SHIFT));
+
+	/*
+	 * If there not page, we assume it is a device memory and therefore
+	 * it is contigous and always pinned.
+	 *
+	 * TODO: test device boundaries?
+	 */
+	if (!page)
+		return true;
+
 	/*
 	 * Check that the TCE table granularity is not bigger than the size of
 	 * a page we just found. Otherwise the hardware can get access to
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [RFC PATCH kernel 3/5] powerpc/iommu: Do not pin memory of a memory device
@ 2018-06-07  8:44   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 108+ messages in thread
From: Alexey Kardashevskiy @ 2018-06-07  8:44 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, David Gibson, kvm-ppc, Alex Williamson,
	Benjamin Herrenschmidt, Ram Pai, kvm, Alistair Popple

This new memory does not have page structs as it is not hotplugged to
the host so gup() will fail anyway.

This registers a new mapping in memory context so the user of this
API does not have to worry about the nature of this memory.

Also, since host addresses may not be backed with page structs, this
adds a workaround to iommu_tce_xchg() to avoid putting absent page structs.
realmode_pfn_to_page() is used there as, unline its virtmode counterpart,
it actually walks through the list of vmemmap_backing.

The same is used in tce_page_is_contained() to drop the check for now.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

# Conflicts:
#	arch/powerpc/mm/mmu_context_iommu.c
---
 arch/powerpc/include/asm/mmu_context.h |  3 ++
 arch/powerpc/kernel/iommu.c            |  8 +++--
 arch/powerpc/mm/mmu_context_iommu.c    | 55 +++++++++++++++++++++++++++-------
 drivers/vfio/vfio_iommu_spapr_tce.c    | 12 +++++++-
 4 files changed, 65 insertions(+), 13 deletions(-)

diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h
index b598ec4..0c14495 100644
--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -24,6 +24,9 @@ extern bool mm_iommu_preregistered(struct mm_struct *mm);
 extern long mm_iommu_new(struct mm_struct *mm,
 		unsigned long ua, unsigned long entries,
 		struct mm_iommu_table_group_mem_t **pmem);
+extern long mm_iommu_newdev(struct mm_struct *mm, unsigned long ua,
+		unsigned long entries, unsigned long dev_hpa,
+		struct mm_iommu_table_group_mem_t **pmem);
 extern long mm_iommu_put(struct mm_struct *mm,
 		struct mm_iommu_table_group_mem_t *mem);
 extern void mm_iommu_init(struct mm_struct *mm);
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index af7a20d..fc985a5 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -1001,8 +1001,12 @@ long iommu_tce_xchg(struct iommu_table *tbl, unsigned long entry,
 	ret = tbl->it_ops->exchange(tbl, entry, hpa, direction);
 
 	if (!ret && ((*direction == DMA_FROM_DEVICE) ||
-			(*direction == DMA_BIDIRECTIONAL)))
-		SetPageDirty(pfn_to_page(*hpa >> PAGE_SHIFT));
+			(*direction == DMA_BIDIRECTIONAL))) {
+		struct page *pg = __va(realmode_pfn_to_page(*hpa >> PAGE_SHIFT));
+
+		if (pg)
+			SetPageDirty(pg);
+	}
 
 	/* if (unlikely(ret))
 		pr_err("iommu_tce: %s failed on hwaddr=%lx ioba=%lx kva=%lx ret=%d\n",
diff --git a/arch/powerpc/mm/mmu_context_iommu.c b/arch/powerpc/mm/mmu_context_iommu.c
index 6b471d2..b132924 100644
--- a/arch/powerpc/mm/mmu_context_iommu.c
+++ b/arch/powerpc/mm/mmu_context_iommu.c
@@ -30,6 +30,8 @@ struct mm_iommu_table_group_mem_t {
 	u64 ua;			/* userspace address */
 	u64 entries;		/* number of entries in hpas[] */
 	u64 *hpas;		/* vmalloc'ed */
+#define MM_IOMMU_TABLE_INVALID_HPA	((uint64_t)-1)
+	u64 dev_hpa;		/* Device memory base address */
 };
 
 static long mm_iommu_adjust_locked_vm(struct mm_struct *mm,
@@ -121,7 +123,7 @@ static int mm_iommu_move_page_from_cma(struct page *page)
 }
 
 static long mm_iommu_do_alloc(struct mm_struct *mm, unsigned long ua,
-		unsigned long entries,
+		unsigned long entries, unsigned long dev_hpa,
 		struct mm_iommu_table_group_mem_t **pmem)
 {
 	struct mm_iommu_table_group_mem_t *mem;
@@ -147,11 +149,13 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, unsigned long ua,
 
 	}
 
-	ret = mm_iommu_adjust_locked_vm(mm, entries, true);
-	if (ret)
-		goto unlock_exit;
+	if (dev_hpa == MM_IOMMU_TABLE_INVALID_HPA) {
+		ret = mm_iommu_adjust_locked_vm(mm, entries, true);
+		if (ret)
+			goto unlock_exit;
 
-	locked_entries = entries;
+		locked_entries = entries;
+	}
 
 	mem = kzalloc(sizeof(*mem), GFP_KERNEL);
 	if (!mem) {
@@ -159,6 +163,11 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, unsigned long ua,
 		goto unlock_exit;
 	}
 
+	if (dev_hpa != MM_IOMMU_TABLE_INVALID_HPA) {
+		mem->dev_hpa = dev_hpa;
+		goto good_exit;
+	}
+
 	mem->hpas = vzalloc(entries * sizeof(mem->hpas[0]));
 	if (!mem->hpas) {
 		kfree(mem);
@@ -202,6 +211,7 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, unsigned long ua,
 		mem->hpas[i] = page_to_pfn(page) << PAGE_SHIFT;
 	}
 
+good_exit:
 	atomic64_set(&mem->mapped, 1);
 	mem->used = 1;
 	mem->ua = ua;
@@ -222,15 +232,27 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, unsigned long ua,
 long mm_iommu_new(struct mm_struct *mm, unsigned long ua, unsigned long entries,
 		struct mm_iommu_table_group_mem_t **pmem)
 {
-	return mm_iommu_do_alloc(mm, ua, entries, pmem);
+	return mm_iommu_do_alloc(mm, ua, entries, MM_IOMMU_TABLE_INVALID_HPA,
+			pmem);
 }
 EXPORT_SYMBOL_GPL(mm_iommu_new);
 
+long mm_iommu_newdev(struct mm_struct *mm, unsigned long ua,
+		unsigned long entries, unsigned long dev_hpa,
+		struct mm_iommu_table_group_mem_t **pmem)
+{
+	return mm_iommu_do_alloc(mm, ua, entries, dev_hpa, pmem);
+}
+EXPORT_SYMBOL_GPL(mm_iommu_newdev);
+
 static void mm_iommu_unpin(struct mm_iommu_table_group_mem_t *mem)
 {
 	long i;
 	struct page *page = NULL;
 
+	if (!mem->hpas)
+		return;
+
 	for (i = 0; i < mem->entries; ++i) {
 		if (!mem->hpas[i])
 			continue;
@@ -269,6 +291,7 @@ static void mm_iommu_release(struct mm_iommu_table_group_mem_t *mem)
 long mm_iommu_put(struct mm_struct *mm, struct mm_iommu_table_group_mem_t *mem)
 {
 	long ret = 0;
+	unsigned long entries;
 
 	mutex_lock(&mem_list_mutex);
 
@@ -290,9 +313,11 @@ long mm_iommu_put(struct mm_struct *mm, struct mm_iommu_table_group_mem_t *mem)
 	}
 
 	/* @mapped became 0 so now mappings are disabled, release the region */
+	entries = mem->entries;
 	mm_iommu_release(mem);
 
-	mm_iommu_adjust_locked_vm(mm, mem->entries, false);
+	if (mem->dev_hpa != MM_IOMMU_TABLE_INVALID_HPA)
+		mm_iommu_adjust_locked_vm(mm, entries, false);
 
 unlock_exit:
 	mutex_unlock(&mem_list_mutex);
@@ -363,11 +388,17 @@ long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
 		unsigned long ua, unsigned long *hpa)
 {
 	const long entry = (ua - mem->ua) >> PAGE_SHIFT;
-	u64 *va = &mem->hpas[entry];
+	u64 *va;
 
 	if (entry >= mem->entries)
 		return -EFAULT;
 
+	if (!mem->hpas) {
+		*hpa = mem->dev_hpa + (ua - mem->ua);
+		return 0;
+	}
+
+	va = &mem->hpas[entry];
 	*hpa = *va | (ua & ~PAGE_MASK);
 
 	return 0;
@@ -378,13 +409,17 @@ long mm_iommu_ua_to_hpa_rm(struct mm_iommu_table_group_mem_t *mem,
 		unsigned long ua, unsigned long *hpa)
 {
 	const long entry = (ua - mem->ua) >> PAGE_SHIFT;
-	void *va = &mem->hpas[entry];
 	unsigned long *pa;
 
 	if (entry >= mem->entries)
 		return -EFAULT;
 
-	pa = (void *) vmalloc_to_phys(va);
+	if (!mem->hpas) {
+		*hpa = mem->dev_hpa + (ua - mem->ua);
+		return 0;
+	}
+
+	pa = (void *) vmalloc_to_phys(&mem->hpas[entry]);
 	if (!pa)
 		return -EFAULT;
 
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index 7f1effd..47071f3 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -252,7 +252,17 @@ static void tce_iommu_userspace_view_free(struct iommu_table *tbl,
 
 static bool tce_page_is_contained(unsigned long hpa, unsigned page_shift)
 {
-	struct page *page = pfn_to_page(hpa >> PAGE_SHIFT);
+	struct page *page = __va(realmode_pfn_to_page(hpa >> PAGE_SHIFT));
+
+	/*
+	 * If there not page, we assume it is a device memory and therefore
+	 * it is contigous and always pinned.
+	 *
+	 * TODO: test device boundaries?
+	 */
+	if (!page)
+		return true;
+
 	/*
 	 * Check that the TCE table granularity is not bigger than the size of
 	 * a page we just found. Otherwise the hardware can get access to
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [RFC PATCH kernel 3/5] powerpc/iommu: Do not pin memory of a memory device
@ 2018-06-07  8:44   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 108+ messages in thread
From: Alexey Kardashevskiy @ 2018-06-07  8:44 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: kvm, Alexey Kardashevskiy, Ram Pai, kvm-ppc, Alex Williamson,
	Alistair Popple, David Gibson

This new memory does not have page structs as it is not hotplugged to
the host so gup() will fail anyway.

This registers a new mapping in memory context so the user of this
API does not have to worry about the nature of this memory.

Also, since host addresses may not be backed with page structs, this
adds a workaround to iommu_tce_xchg() to avoid putting absent page structs.
realmode_pfn_to_page() is used there as, unline its virtmode counterpart,
it actually walks through the list of vmemmap_backing.

The same is used in tce_page_is_contained() to drop the check for now.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

# Conflicts:
#	arch/powerpc/mm/mmu_context_iommu.c
---
 arch/powerpc/include/asm/mmu_context.h |  3 ++
 arch/powerpc/kernel/iommu.c            |  8 +++--
 arch/powerpc/mm/mmu_context_iommu.c    | 55 +++++++++++++++++++++++++++-------
 drivers/vfio/vfio_iommu_spapr_tce.c    | 12 +++++++-
 4 files changed, 65 insertions(+), 13 deletions(-)

diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h
index b598ec4..0c14495 100644
--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -24,6 +24,9 @@ extern bool mm_iommu_preregistered(struct mm_struct *mm);
 extern long mm_iommu_new(struct mm_struct *mm,
 		unsigned long ua, unsigned long entries,
 		struct mm_iommu_table_group_mem_t **pmem);
+extern long mm_iommu_newdev(struct mm_struct *mm, unsigned long ua,
+		unsigned long entries, unsigned long dev_hpa,
+		struct mm_iommu_table_group_mem_t **pmem);
 extern long mm_iommu_put(struct mm_struct *mm,
 		struct mm_iommu_table_group_mem_t *mem);
 extern void mm_iommu_init(struct mm_struct *mm);
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index af7a20d..fc985a5 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -1001,8 +1001,12 @@ long iommu_tce_xchg(struct iommu_table *tbl, unsigned long entry,
 	ret = tbl->it_ops->exchange(tbl, entry, hpa, direction);
 
 	if (!ret && ((*direction = DMA_FROM_DEVICE) ||
-			(*direction = DMA_BIDIRECTIONAL)))
-		SetPageDirty(pfn_to_page(*hpa >> PAGE_SHIFT));
+			(*direction = DMA_BIDIRECTIONAL))) {
+		struct page *pg = __va(realmode_pfn_to_page(*hpa >> PAGE_SHIFT));
+
+		if (pg)
+			SetPageDirty(pg);
+	}
 
 	/* if (unlikely(ret))
 		pr_err("iommu_tce: %s failed on hwaddr=%lx ioba=%lx kva=%lx ret=%d\n",
diff --git a/arch/powerpc/mm/mmu_context_iommu.c b/arch/powerpc/mm/mmu_context_iommu.c
index 6b471d2..b132924 100644
--- a/arch/powerpc/mm/mmu_context_iommu.c
+++ b/arch/powerpc/mm/mmu_context_iommu.c
@@ -30,6 +30,8 @@ struct mm_iommu_table_group_mem_t {
 	u64 ua;			/* userspace address */
 	u64 entries;		/* number of entries in hpas[] */
 	u64 *hpas;		/* vmalloc'ed */
+#define MM_IOMMU_TABLE_INVALID_HPA	((uint64_t)-1)
+	u64 dev_hpa;		/* Device memory base address */
 };
 
 static long mm_iommu_adjust_locked_vm(struct mm_struct *mm,
@@ -121,7 +123,7 @@ static int mm_iommu_move_page_from_cma(struct page *page)
 }
 
 static long mm_iommu_do_alloc(struct mm_struct *mm, unsigned long ua,
-		unsigned long entries,
+		unsigned long entries, unsigned long dev_hpa,
 		struct mm_iommu_table_group_mem_t **pmem)
 {
 	struct mm_iommu_table_group_mem_t *mem;
@@ -147,11 +149,13 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, unsigned long ua,
 
 	}
 
-	ret = mm_iommu_adjust_locked_vm(mm, entries, true);
-	if (ret)
-		goto unlock_exit;
+	if (dev_hpa = MM_IOMMU_TABLE_INVALID_HPA) {
+		ret = mm_iommu_adjust_locked_vm(mm, entries, true);
+		if (ret)
+			goto unlock_exit;
 
-	locked_entries = entries;
+		locked_entries = entries;
+	}
 
 	mem = kzalloc(sizeof(*mem), GFP_KERNEL);
 	if (!mem) {
@@ -159,6 +163,11 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, unsigned long ua,
 		goto unlock_exit;
 	}
 
+	if (dev_hpa != MM_IOMMU_TABLE_INVALID_HPA) {
+		mem->dev_hpa = dev_hpa;
+		goto good_exit;
+	}
+
 	mem->hpas = vzalloc(entries * sizeof(mem->hpas[0]));
 	if (!mem->hpas) {
 		kfree(mem);
@@ -202,6 +211,7 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, unsigned long ua,
 		mem->hpas[i] = page_to_pfn(page) << PAGE_SHIFT;
 	}
 
+good_exit:
 	atomic64_set(&mem->mapped, 1);
 	mem->used = 1;
 	mem->ua = ua;
@@ -222,15 +232,27 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, unsigned long ua,
 long mm_iommu_new(struct mm_struct *mm, unsigned long ua, unsigned long entries,
 		struct mm_iommu_table_group_mem_t **pmem)
 {
-	return mm_iommu_do_alloc(mm, ua, entries, pmem);
+	return mm_iommu_do_alloc(mm, ua, entries, MM_IOMMU_TABLE_INVALID_HPA,
+			pmem);
 }
 EXPORT_SYMBOL_GPL(mm_iommu_new);
 
+long mm_iommu_newdev(struct mm_struct *mm, unsigned long ua,
+		unsigned long entries, unsigned long dev_hpa,
+		struct mm_iommu_table_group_mem_t **pmem)
+{
+	return mm_iommu_do_alloc(mm, ua, entries, dev_hpa, pmem);
+}
+EXPORT_SYMBOL_GPL(mm_iommu_newdev);
+
 static void mm_iommu_unpin(struct mm_iommu_table_group_mem_t *mem)
 {
 	long i;
 	struct page *page = NULL;
 
+	if (!mem->hpas)
+		return;
+
 	for (i = 0; i < mem->entries; ++i) {
 		if (!mem->hpas[i])
 			continue;
@@ -269,6 +291,7 @@ static void mm_iommu_release(struct mm_iommu_table_group_mem_t *mem)
 long mm_iommu_put(struct mm_struct *mm, struct mm_iommu_table_group_mem_t *mem)
 {
 	long ret = 0;
+	unsigned long entries;
 
 	mutex_lock(&mem_list_mutex);
 
@@ -290,9 +313,11 @@ long mm_iommu_put(struct mm_struct *mm, struct mm_iommu_table_group_mem_t *mem)
 	}
 
 	/* @mapped became 0 so now mappings are disabled, release the region */
+	entries = mem->entries;
 	mm_iommu_release(mem);
 
-	mm_iommu_adjust_locked_vm(mm, mem->entries, false);
+	if (mem->dev_hpa != MM_IOMMU_TABLE_INVALID_HPA)
+		mm_iommu_adjust_locked_vm(mm, entries, false);
 
 unlock_exit:
 	mutex_unlock(&mem_list_mutex);
@@ -363,11 +388,17 @@ long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
 		unsigned long ua, unsigned long *hpa)
 {
 	const long entry = (ua - mem->ua) >> PAGE_SHIFT;
-	u64 *va = &mem->hpas[entry];
+	u64 *va;
 
 	if (entry >= mem->entries)
 		return -EFAULT;
 
+	if (!mem->hpas) {
+		*hpa = mem->dev_hpa + (ua - mem->ua);
+		return 0;
+	}
+
+	va = &mem->hpas[entry];
 	*hpa = *va | (ua & ~PAGE_MASK);
 
 	return 0;
@@ -378,13 +409,17 @@ long mm_iommu_ua_to_hpa_rm(struct mm_iommu_table_group_mem_t *mem,
 		unsigned long ua, unsigned long *hpa)
 {
 	const long entry = (ua - mem->ua) >> PAGE_SHIFT;
-	void *va = &mem->hpas[entry];
 	unsigned long *pa;
 
 	if (entry >= mem->entries)
 		return -EFAULT;
 
-	pa = (void *) vmalloc_to_phys(va);
+	if (!mem->hpas) {
+		*hpa = mem->dev_hpa + (ua - mem->ua);
+		return 0;
+	}
+
+	pa = (void *) vmalloc_to_phys(&mem->hpas[entry]);
 	if (!pa)
 		return -EFAULT;
 
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index 7f1effd..47071f3 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -252,7 +252,17 @@ static void tce_iommu_userspace_view_free(struct iommu_table *tbl,
 
 static bool tce_page_is_contained(unsigned long hpa, unsigned page_shift)
 {
-	struct page *page = pfn_to_page(hpa >> PAGE_SHIFT);
+	struct page *page = __va(realmode_pfn_to_page(hpa >> PAGE_SHIFT));
+
+	/*
+	 * If there not page, we assume it is a device memory and therefore
+	 * it is contigous and always pinned.
+	 *
+	 * TODO: test device boundaries?
+	 */
+	if (!page)
+		return true;
+
 	/*
 	 * Check that the TCE table granularity is not bigger than the size of
 	 * a page we just found. Otherwise the hardware can get access to
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [RFC PATCH kernel 4/5] vfio_pci: Allow mapping extra regions
  2018-06-07  8:44 ` Alexey Kardashevskiy
  (?)
@ 2018-06-07  8:44   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 108+ messages in thread
From: Alexey Kardashevskiy @ 2018-06-07  8:44 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: kvm, Alexey Kardashevskiy, Ram Pai, kvm-ppc, Alex Williamson,
	Alistair Popple, David Gibson

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 drivers/vfio/pci/vfio_pci_private.h |  3 +++
 drivers/vfio/pci/vfio_pci.c         | 10 ++++++++--
 2 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
index cde3b5d..86aab05 100644
--- a/drivers/vfio/pci/vfio_pci_private.h
+++ b/drivers/vfio/pci/vfio_pci_private.h
@@ -59,6 +59,9 @@ struct vfio_pci_regops {
 		      size_t count, loff_t *ppos, bool iswrite);
 	void	(*release)(struct vfio_pci_device *vdev,
 			   struct vfio_pci_region *region);
+	int	(*mmap)(struct vfio_pci_device *vdev,
+			struct vfio_pci_region *region,
+			struct vm_area_struct *vma);
 };
 
 struct vfio_pci_region {
diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 3729937..7bddf1e 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -1123,10 +1123,16 @@ static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
 		return -EINVAL;
 	if ((vma->vm_flags & VM_SHARED) == 0)
 		return -EINVAL;
+	if (index >= VFIO_PCI_NUM_REGIONS) {
+		int regnum = index - VFIO_PCI_NUM_REGIONS;
+		struct vfio_pci_region *region = vdev->region + regnum;
+
+		if (region && region->ops && region->ops->mmap)
+			return region->ops->mmap(vdev, region, vma);
+		return -EINVAL;
+	}
 	if (index >= VFIO_PCI_ROM_REGION_INDEX)
 		return -EINVAL;
-	if (!vdev->bar_mmap_supported[index])
-		return -EINVAL;
 
 	phys_len = PAGE_ALIGN(pci_resource_len(pdev, index));
 	req_len = vma->vm_end - vma->vm_start;
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [RFC PATCH kernel 4/5] vfio_pci: Allow mapping extra regions
@ 2018-06-07  8:44   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 108+ messages in thread
From: Alexey Kardashevskiy @ 2018-06-07  8:44 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, David Gibson, kvm-ppc, Alex Williamson,
	Benjamin Herrenschmidt, Ram Pai, kvm, Alistair Popple

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 drivers/vfio/pci/vfio_pci_private.h |  3 +++
 drivers/vfio/pci/vfio_pci.c         | 10 ++++++++--
 2 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
index cde3b5d..86aab05 100644
--- a/drivers/vfio/pci/vfio_pci_private.h
+++ b/drivers/vfio/pci/vfio_pci_private.h
@@ -59,6 +59,9 @@ struct vfio_pci_regops {
 		      size_t count, loff_t *ppos, bool iswrite);
 	void	(*release)(struct vfio_pci_device *vdev,
 			   struct vfio_pci_region *region);
+	int	(*mmap)(struct vfio_pci_device *vdev,
+			struct vfio_pci_region *region,
+			struct vm_area_struct *vma);
 };
 
 struct vfio_pci_region {
diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 3729937..7bddf1e 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -1123,10 +1123,16 @@ static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
 		return -EINVAL;
 	if ((vma->vm_flags & VM_SHARED) == 0)
 		return -EINVAL;
+	if (index >= VFIO_PCI_NUM_REGIONS) {
+		int regnum = index - VFIO_PCI_NUM_REGIONS;
+		struct vfio_pci_region *region = vdev->region + regnum;
+
+		if (region && region->ops && region->ops->mmap)
+			return region->ops->mmap(vdev, region, vma);
+		return -EINVAL;
+	}
 	if (index >= VFIO_PCI_ROM_REGION_INDEX)
 		return -EINVAL;
-	if (!vdev->bar_mmap_supported[index])
-		return -EINVAL;
 
 	phys_len = PAGE_ALIGN(pci_resource_len(pdev, index));
 	req_len = vma->vm_end - vma->vm_start;
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [RFC PATCH kernel 4/5] vfio_pci: Allow mapping extra regions
@ 2018-06-07  8:44   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 108+ messages in thread
From: Alexey Kardashevskiy @ 2018-06-07  8:44 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: kvm, Alexey Kardashevskiy, Ram Pai, kvm-ppc, Alex Williamson,
	Alistair Popple, David Gibson

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 drivers/vfio/pci/vfio_pci_private.h |  3 +++
 drivers/vfio/pci/vfio_pci.c         | 10 ++++++++--
 2 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
index cde3b5d..86aab05 100644
--- a/drivers/vfio/pci/vfio_pci_private.h
+++ b/drivers/vfio/pci/vfio_pci_private.h
@@ -59,6 +59,9 @@ struct vfio_pci_regops {
 		      size_t count, loff_t *ppos, bool iswrite);
 	void	(*release)(struct vfio_pci_device *vdev,
 			   struct vfio_pci_region *region);
+	int	(*mmap)(struct vfio_pci_device *vdev,
+			struct vfio_pci_region *region,
+			struct vm_area_struct *vma);
 };
 
 struct vfio_pci_region {
diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 3729937..7bddf1e 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -1123,10 +1123,16 @@ static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
 		return -EINVAL;
 	if ((vma->vm_flags & VM_SHARED) = 0)
 		return -EINVAL;
+	if (index >= VFIO_PCI_NUM_REGIONS) {
+		int regnum = index - VFIO_PCI_NUM_REGIONS;
+		struct vfio_pci_region *region = vdev->region + regnum;
+
+		if (region && region->ops && region->ops->mmap)
+			return region->ops->mmap(vdev, region, vma);
+		return -EINVAL;
+	}
 	if (index >= VFIO_PCI_ROM_REGION_INDEX)
 		return -EINVAL;
-	if (!vdev->bar_mmap_supported[index])
-		return -EINVAL;
 
 	phys_len = PAGE_ALIGN(pci_resource_len(pdev, index));
 	req_len = vma->vm_end - vma->vm_start;
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [RFC PATCH kernel 5/5] vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] [10de:1db1] subdriver
  2018-06-07  8:44 ` Alexey Kardashevskiy
  (?)
@ 2018-06-07  8:44   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 108+ messages in thread
From: Alexey Kardashevskiy @ 2018-06-07  8:44 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: kvm, Alexey Kardashevskiy, Ram Pai, kvm-ppc, Alex Williamson,
	Alistair Popple, David Gibson

Some POWER9 chips come with special NVLink2 links which provide
cacheable memory access to the RAM physically located on NVIDIA GPU.
This memory is presented to a host via the device tree but remains
offline until the NVIDIA driver onlines it.

This exports this RAM to the userspace as a new region so
the NVIDIA driver in the guest can train these links and online GPU RAM.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 drivers/vfio/pci/Makefile           |   1 +
 drivers/vfio/pci/vfio_pci_private.h |   8 ++
 include/uapi/linux/vfio.h           |   3 +
 drivers/vfio/pci/vfio_pci.c         |   9 ++
 drivers/vfio/pci/vfio_pci_nvlink2.c | 190 ++++++++++++++++++++++++++++++++++++
 drivers/vfio/pci/Kconfig            |   4 +
 6 files changed, 215 insertions(+)
 create mode 100644 drivers/vfio/pci/vfio_pci_nvlink2.c

diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
index 76d8ec0..9662c06 100644
--- a/drivers/vfio/pci/Makefile
+++ b/drivers/vfio/pci/Makefile
@@ -1,5 +1,6 @@
 
 vfio-pci-y := vfio_pci.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o
 vfio-pci-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o
+vfio-pci-$(CONFIG_VFIO_PCI_NVLINK2) += vfio_pci_nvlink2.o
 
 obj-$(CONFIG_VFIO_PCI) += vfio-pci.o
diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
index 86aab05..7115b9b 100644
--- a/drivers/vfio/pci/vfio_pci_private.h
+++ b/drivers/vfio/pci/vfio_pci_private.h
@@ -160,4 +160,12 @@ static inline int vfio_pci_igd_init(struct vfio_pci_device *vdev)
 	return -ENODEV;
 }
 #endif
+#ifdef CONFIG_VFIO_PCI_NVLINK2
+extern int vfio_pci_nvlink2_init(struct vfio_pci_device *vdev);
+#else
+static inline int vfio_pci_nvlink2_init(struct vfio_pci_device *vdev)
+{
+	return -ENODEV;
+}
+#endif
 #endif /* VFIO_PCI_PRIVATE_H */
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 1aa7b82..2fe8227 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -301,6 +301,9 @@ struct vfio_region_info_cap_type {
 #define VFIO_REGION_SUBTYPE_INTEL_IGD_HOST_CFG	(2)
 #define VFIO_REGION_SUBTYPE_INTEL_IGD_LPC_CFG	(3)
 
+/* NVIDIA GPU NV2 */
+#define VFIO_REGION_SUBTYPE_NVIDIA_NVLINK2	(4)
+
 /*
  * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped
  * which allows direct access to non-MSIX registers which happened to be within
diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 7bddf1e..38c9475 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -306,6 +306,15 @@ static int vfio_pci_enable(struct vfio_pci_device *vdev)
 		}
 	}
 
+	if (pdev->vendor == PCI_VENDOR_ID_NVIDIA &&
+	    pdev->device == 0x1db1 &&
+	    IS_ENABLED(CONFIG_VFIO_PCI_NVLINK2)) {
+		ret = vfio_pci_nvlink2_init(vdev);
+		if (ret)
+			dev_warn(&vdev->pdev->dev,
+				 "Failed to setup NVIDIA NV2 RAM region\n");
+	}
+
 	vfio_pci_probe_mmaps(vdev);
 
 	return 0;
diff --git a/drivers/vfio/pci/vfio_pci_nvlink2.c b/drivers/vfio/pci/vfio_pci_nvlink2.c
new file mode 100644
index 0000000..451c5cb
--- /dev/null
+++ b/drivers/vfio/pci/vfio_pci_nvlink2.c
@@ -0,0 +1,190 @@
+// SPDX-License-Identifier: GPL-2.0+
+/*
+ * VFIO PCI NVIDIA Whitherspoon GPU support a.k.a. NVLink2.
+ *
+ * Copyright (C) 2018 IBM Corp.  All rights reserved.
+ *     Author: Alexey Kardashevskiy <aik@ozlabs.ru>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * Register an on-GPU RAM region for cacheable access.
+ *
+ * Derived from original vfio_pci_igd.c:
+ * Copyright (C) 2016 Red Hat, Inc.  All rights reserved.
+ *	Author: Alex Williamson <alex.williamson@redhat.com>
+ */
+
+#include <linux/io.h>
+#include <linux/pci.h>
+#include <linux/uaccess.h>
+#include <linux/vfio.h>
+#include <linux/sched/mm.h>
+#include <linux/mmu_context.h>
+
+#include "vfio_pci_private.h"
+
+struct vfio_pci_nvlink2_data {
+	unsigned long gpu_hpa;
+	unsigned long useraddr;
+	unsigned long size;
+	struct mm_struct *mm;
+	struct mm_iommu_table_group_mem_t *mem;
+};
+
+static size_t vfio_pci_nvlink2_rw(struct vfio_pci_device *vdev,
+		char __user *buf, size_t count, loff_t *ppos, bool iswrite)
+{
+	unsigned int i = VFIO_PCI_OFFSET_TO_INDEX(*ppos) - VFIO_PCI_NUM_REGIONS;
+	void *base = vdev->region[i].data;
+	loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
+
+	if (pos >= vdev->region[i].size)
+		return -EINVAL;
+
+	count = min(count, (size_t)(vdev->region[i].size - pos));
+
+	if (iswrite) {
+		if (copy_from_user(base + pos, buf, count))
+			return -EFAULT;
+	} else {
+		if (copy_to_user(buf, base + pos, count))
+			return -EFAULT;
+	}
+	*ppos += count;
+
+	return count;
+}
+
+static void vfio_pci_nvlink2_release(struct vfio_pci_device *vdev,
+		struct vfio_pci_region *region)
+{
+	struct vfio_pci_nvlink2_data *data = region->data;
+	long ret;
+
+	ret = mm_iommu_put(data->mm, data->mem);
+	WARN_ON(ret);
+
+	mmdrop(data->mm);
+	kfree(data);
+}
+
+static int vfio_pci_nvlink2_mmap_fault(struct vm_fault *vmf)
+{
+	struct vm_area_struct *vma = vmf->vma;
+	struct vfio_pci_region *region = vma->vm_private_data;
+	struct vfio_pci_nvlink2_data *data = region->data;
+	int ret;
+	unsigned long vmf_off = (vmf->address - vma->vm_start) >> PAGE_SHIFT;
+	unsigned long nv2pg = data->gpu_hpa >> PAGE_SHIFT;
+	unsigned long vm_pgoff = vma->vm_pgoff &
+		((1U << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) - 1);
+	unsigned long pfn = nv2pg + vm_pgoff + vmf_off;
+
+	ret = vm_insert_pfn(vma, vmf->address, pfn);
+	/* TODO: make it a tracepoint */
+	pr_debug("NVLink2: vmf=%lx hpa=%lx ret=%d\n",
+		 vmf->address, pfn << PAGE_SHIFT, ret);
+	if (ret)
+		return VM_FAULT_SIGSEGV;
+
+	return VM_FAULT_NOPAGE;
+}
+
+static const struct vm_operations_struct vfio_pci_nvlink2_mmap_vmops = {
+	.fault = vfio_pci_nvlink2_mmap_fault,
+};
+
+static int vfio_pci_nvlink2_mmap(struct vfio_pci_device *vdev,
+		struct vfio_pci_region *region, struct vm_area_struct *vma)
+{
+	long ret;
+	struct vfio_pci_nvlink2_data *data = region->data;
+
+	if (data->useraddr)
+		return -EPERM;
+
+	if (vma->vm_end - vma->vm_start > data->size)
+		return -EINVAL;
+
+	vma->vm_private_data = region;
+	vma->vm_flags |= VM_PFNMAP;
+	vma->vm_ops = &vfio_pci_nvlink2_mmap_vmops;
+
+	/*
+	 * Calling mm_iommu_newdev() here once as the region is not
+	 * registered yet and therefore right initialization will happen now.
+	 * Other places will use mm_iommu_find() which returns
+	 * registered @mem and does not go gup().
+	 */
+	data->useraddr = vma->vm_start;
+	data->mm = current->mm;
+	atomic_inc(&data->mm->mm_count);
+	ret = mm_iommu_newdev(data->mm, data->useraddr,
+			(vma->vm_end - vma->vm_start) >> PAGE_SHIFT,
+			data->gpu_hpa, &data->mem);
+
+	pr_debug("VFIO NVLINK2 mmap: useraddr=%lx hpa=%lx size=%lx ret=%ld\n",
+			data->useraddr, data->gpu_hpa,
+			vma->vm_end - vma->vm_start, ret);
+
+	return ret;
+}
+
+static const struct vfio_pci_regops vfio_pci_nvlink2_regops = {
+	.rw = vfio_pci_nvlink2_rw,
+	.release = vfio_pci_nvlink2_release,
+	.mmap = vfio_pci_nvlink2_mmap,
+};
+
+int vfio_pci_nvlink2_init(struct vfio_pci_device *vdev)
+{
+	int len = 0, ret;
+	struct device_node *npu_node, *mem_node;
+	struct pci_dev *npu_dev;
+	uint32_t *mem_phandle, *val;
+	struct vfio_pci_nvlink2_data *data;
+
+	npu_dev = pnv_pci_get_npu_dev(vdev->pdev, 0);
+	if (!npu_dev)
+		return -EINVAL;
+
+	npu_node = pci_device_to_OF_node(npu_dev);
+	if (!npu_node)
+		return -EINVAL;
+
+	mem_phandle = (void *) of_get_property(npu_node, "memory-region", NULL);
+	if (!mem_phandle)
+		return -EINVAL;
+
+	mem_node = of_find_node_by_phandle(be32_to_cpu(*mem_phandle));
+	if (!mem_node)
+		return -EINVAL;
+
+	val = (uint32_t *) of_get_property(mem_node, "reg", &len);
+	if (!val || len != 2 * sizeof(uint64_t))
+		return -EINVAL;
+
+	data = kzalloc(sizeof(*data), GFP_KERNEL);
+	if (!data)
+		return -ENOMEM;
+
+	data->gpu_hpa = ((uint64_t)be32_to_cpu(val[0]) << 32) |
+			be32_to_cpu(val[1]);
+	data->size = ((uint64_t)be32_to_cpu(val[2]) << 32) |
+			be32_to_cpu(val[3]);
+
+	dev_dbg(&vdev->pdev->dev, "%lx..%lx\n", data->gpu_hpa,
+			data->gpu_hpa + data->size - 1);
+
+	ret = vfio_pci_register_dev_region(vdev,
+			PCI_VENDOR_ID_NVIDIA | VFIO_REGION_TYPE_PCI_VENDOR_TYPE,
+			VFIO_REGION_SUBTYPE_NVIDIA_NVLINK2,
+			&vfio_pci_nvlink2_regops, data->size,
+			VFIO_REGION_INFO_FLAG_READ, data);
+	if (ret)
+		kfree(data);
+
+	return ret;
+}
diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
index 24ee260..2725bc8 100644
--- a/drivers/vfio/pci/Kconfig
+++ b/drivers/vfio/pci/Kconfig
@@ -30,3 +30,7 @@ config VFIO_PCI_INTX
 config VFIO_PCI_IGD
 	depends on VFIO_PCI
 	def_bool y if X86
+
+config VFIO_PCI_NVLINK2
+	depends on VFIO_PCI
+	def_bool y if PPC_POWERNV
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [RFC PATCH kernel 5/5] vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] [10de:1db1] subdriver
@ 2018-06-07  8:44   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 108+ messages in thread
From: Alexey Kardashevskiy @ 2018-06-07  8:44 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, David Gibson, kvm-ppc, Alex Williamson,
	Benjamin Herrenschmidt, Ram Pai, kvm, Alistair Popple

Some POWER9 chips come with special NVLink2 links which provide
cacheable memory access to the RAM physically located on NVIDIA GPU.
This memory is presented to a host via the device tree but remains
offline until the NVIDIA driver onlines it.

This exports this RAM to the userspace as a new region so
the NVIDIA driver in the guest can train these links and online GPU RAM.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 drivers/vfio/pci/Makefile           |   1 +
 drivers/vfio/pci/vfio_pci_private.h |   8 ++
 include/uapi/linux/vfio.h           |   3 +
 drivers/vfio/pci/vfio_pci.c         |   9 ++
 drivers/vfio/pci/vfio_pci_nvlink2.c | 190 ++++++++++++++++++++++++++++++++++++
 drivers/vfio/pci/Kconfig            |   4 +
 6 files changed, 215 insertions(+)
 create mode 100644 drivers/vfio/pci/vfio_pci_nvlink2.c

diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
index 76d8ec0..9662c06 100644
--- a/drivers/vfio/pci/Makefile
+++ b/drivers/vfio/pci/Makefile
@@ -1,5 +1,6 @@
 
 vfio-pci-y := vfio_pci.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o
 vfio-pci-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o
+vfio-pci-$(CONFIG_VFIO_PCI_NVLINK2) += vfio_pci_nvlink2.o
 
 obj-$(CONFIG_VFIO_PCI) += vfio-pci.o
diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
index 86aab05..7115b9b 100644
--- a/drivers/vfio/pci/vfio_pci_private.h
+++ b/drivers/vfio/pci/vfio_pci_private.h
@@ -160,4 +160,12 @@ static inline int vfio_pci_igd_init(struct vfio_pci_device *vdev)
 	return -ENODEV;
 }
 #endif
+#ifdef CONFIG_VFIO_PCI_NVLINK2
+extern int vfio_pci_nvlink2_init(struct vfio_pci_device *vdev);
+#else
+static inline int vfio_pci_nvlink2_init(struct vfio_pci_device *vdev)
+{
+	return -ENODEV;
+}
+#endif
 #endif /* VFIO_PCI_PRIVATE_H */
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 1aa7b82..2fe8227 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -301,6 +301,9 @@ struct vfio_region_info_cap_type {
 #define VFIO_REGION_SUBTYPE_INTEL_IGD_HOST_CFG	(2)
 #define VFIO_REGION_SUBTYPE_INTEL_IGD_LPC_CFG	(3)
 
+/* NVIDIA GPU NV2 */
+#define VFIO_REGION_SUBTYPE_NVIDIA_NVLINK2	(4)
+
 /*
  * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped
  * which allows direct access to non-MSIX registers which happened to be within
diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 7bddf1e..38c9475 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -306,6 +306,15 @@ static int vfio_pci_enable(struct vfio_pci_device *vdev)
 		}
 	}
 
+	if (pdev->vendor == PCI_VENDOR_ID_NVIDIA &&
+	    pdev->device == 0x1db1 &&
+	    IS_ENABLED(CONFIG_VFIO_PCI_NVLINK2)) {
+		ret = vfio_pci_nvlink2_init(vdev);
+		if (ret)
+			dev_warn(&vdev->pdev->dev,
+				 "Failed to setup NVIDIA NV2 RAM region\n");
+	}
+
 	vfio_pci_probe_mmaps(vdev);
 
 	return 0;
diff --git a/drivers/vfio/pci/vfio_pci_nvlink2.c b/drivers/vfio/pci/vfio_pci_nvlink2.c
new file mode 100644
index 0000000..451c5cb
--- /dev/null
+++ b/drivers/vfio/pci/vfio_pci_nvlink2.c
@@ -0,0 +1,190 @@
+// SPDX-License-Identifier: GPL-2.0+
+/*
+ * VFIO PCI NVIDIA Whitherspoon GPU support a.k.a. NVLink2.
+ *
+ * Copyright (C) 2018 IBM Corp.  All rights reserved.
+ *     Author: Alexey Kardashevskiy <aik@ozlabs.ru>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * Register an on-GPU RAM region for cacheable access.
+ *
+ * Derived from original vfio_pci_igd.c:
+ * Copyright (C) 2016 Red Hat, Inc.  All rights reserved.
+ *	Author: Alex Williamson <alex.williamson@redhat.com>
+ */
+
+#include <linux/io.h>
+#include <linux/pci.h>
+#include <linux/uaccess.h>
+#include <linux/vfio.h>
+#include <linux/sched/mm.h>
+#include <linux/mmu_context.h>
+
+#include "vfio_pci_private.h"
+
+struct vfio_pci_nvlink2_data {
+	unsigned long gpu_hpa;
+	unsigned long useraddr;
+	unsigned long size;
+	struct mm_struct *mm;
+	struct mm_iommu_table_group_mem_t *mem;
+};
+
+static size_t vfio_pci_nvlink2_rw(struct vfio_pci_device *vdev,
+		char __user *buf, size_t count, loff_t *ppos, bool iswrite)
+{
+	unsigned int i = VFIO_PCI_OFFSET_TO_INDEX(*ppos) - VFIO_PCI_NUM_REGIONS;
+	void *base = vdev->region[i].data;
+	loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
+
+	if (pos >= vdev->region[i].size)
+		return -EINVAL;
+
+	count = min(count, (size_t)(vdev->region[i].size - pos));
+
+	if (iswrite) {
+		if (copy_from_user(base + pos, buf, count))
+			return -EFAULT;
+	} else {
+		if (copy_to_user(buf, base + pos, count))
+			return -EFAULT;
+	}
+	*ppos += count;
+
+	return count;
+}
+
+static void vfio_pci_nvlink2_release(struct vfio_pci_device *vdev,
+		struct vfio_pci_region *region)
+{
+	struct vfio_pci_nvlink2_data *data = region->data;
+	long ret;
+
+	ret = mm_iommu_put(data->mm, data->mem);
+	WARN_ON(ret);
+
+	mmdrop(data->mm);
+	kfree(data);
+}
+
+static int vfio_pci_nvlink2_mmap_fault(struct vm_fault *vmf)
+{
+	struct vm_area_struct *vma = vmf->vma;
+	struct vfio_pci_region *region = vma->vm_private_data;
+	struct vfio_pci_nvlink2_data *data = region->data;
+	int ret;
+	unsigned long vmf_off = (vmf->address - vma->vm_start) >> PAGE_SHIFT;
+	unsigned long nv2pg = data->gpu_hpa >> PAGE_SHIFT;
+	unsigned long vm_pgoff = vma->vm_pgoff &
+		((1U << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) - 1);
+	unsigned long pfn = nv2pg + vm_pgoff + vmf_off;
+
+	ret = vm_insert_pfn(vma, vmf->address, pfn);
+	/* TODO: make it a tracepoint */
+	pr_debug("NVLink2: vmf=%lx hpa=%lx ret=%d\n",
+		 vmf->address, pfn << PAGE_SHIFT, ret);
+	if (ret)
+		return VM_FAULT_SIGSEGV;
+
+	return VM_FAULT_NOPAGE;
+}
+
+static const struct vm_operations_struct vfio_pci_nvlink2_mmap_vmops = {
+	.fault = vfio_pci_nvlink2_mmap_fault,
+};
+
+static int vfio_pci_nvlink2_mmap(struct vfio_pci_device *vdev,
+		struct vfio_pci_region *region, struct vm_area_struct *vma)
+{
+	long ret;
+	struct vfio_pci_nvlink2_data *data = region->data;
+
+	if (data->useraddr)
+		return -EPERM;
+
+	if (vma->vm_end - vma->vm_start > data->size)
+		return -EINVAL;
+
+	vma->vm_private_data = region;
+	vma->vm_flags |= VM_PFNMAP;
+	vma->vm_ops = &vfio_pci_nvlink2_mmap_vmops;
+
+	/*
+	 * Calling mm_iommu_newdev() here once as the region is not
+	 * registered yet and therefore right initialization will happen now.
+	 * Other places will use mm_iommu_find() which returns
+	 * registered @mem and does not go gup().
+	 */
+	data->useraddr = vma->vm_start;
+	data->mm = current->mm;
+	atomic_inc(&data->mm->mm_count);
+	ret = mm_iommu_newdev(data->mm, data->useraddr,
+			(vma->vm_end - vma->vm_start) >> PAGE_SHIFT,
+			data->gpu_hpa, &data->mem);
+
+	pr_debug("VFIO NVLINK2 mmap: useraddr=%lx hpa=%lx size=%lx ret=%ld\n",
+			data->useraddr, data->gpu_hpa,
+			vma->vm_end - vma->vm_start, ret);
+
+	return ret;
+}
+
+static const struct vfio_pci_regops vfio_pci_nvlink2_regops = {
+	.rw = vfio_pci_nvlink2_rw,
+	.release = vfio_pci_nvlink2_release,
+	.mmap = vfio_pci_nvlink2_mmap,
+};
+
+int vfio_pci_nvlink2_init(struct vfio_pci_device *vdev)
+{
+	int len = 0, ret;
+	struct device_node *npu_node, *mem_node;
+	struct pci_dev *npu_dev;
+	uint32_t *mem_phandle, *val;
+	struct vfio_pci_nvlink2_data *data;
+
+	npu_dev = pnv_pci_get_npu_dev(vdev->pdev, 0);
+	if (!npu_dev)
+		return -EINVAL;
+
+	npu_node = pci_device_to_OF_node(npu_dev);
+	if (!npu_node)
+		return -EINVAL;
+
+	mem_phandle = (void *) of_get_property(npu_node, "memory-region", NULL);
+	if (!mem_phandle)
+		return -EINVAL;
+
+	mem_node = of_find_node_by_phandle(be32_to_cpu(*mem_phandle));
+	if (!mem_node)
+		return -EINVAL;
+
+	val = (uint32_t *) of_get_property(mem_node, "reg", &len);
+	if (!val || len != 2 * sizeof(uint64_t))
+		return -EINVAL;
+
+	data = kzalloc(sizeof(*data), GFP_KERNEL);
+	if (!data)
+		return -ENOMEM;
+
+	data->gpu_hpa = ((uint64_t)be32_to_cpu(val[0]) << 32) |
+			be32_to_cpu(val[1]);
+	data->size = ((uint64_t)be32_to_cpu(val[2]) << 32) |
+			be32_to_cpu(val[3]);
+
+	dev_dbg(&vdev->pdev->dev, "%lx..%lx\n", data->gpu_hpa,
+			data->gpu_hpa + data->size - 1);
+
+	ret = vfio_pci_register_dev_region(vdev,
+			PCI_VENDOR_ID_NVIDIA | VFIO_REGION_TYPE_PCI_VENDOR_TYPE,
+			VFIO_REGION_SUBTYPE_NVIDIA_NVLINK2,
+			&vfio_pci_nvlink2_regops, data->size,
+			VFIO_REGION_INFO_FLAG_READ, data);
+	if (ret)
+		kfree(data);
+
+	return ret;
+}
diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
index 24ee260..2725bc8 100644
--- a/drivers/vfio/pci/Kconfig
+++ b/drivers/vfio/pci/Kconfig
@@ -30,3 +30,7 @@ config VFIO_PCI_INTX
 config VFIO_PCI_IGD
 	depends on VFIO_PCI
 	def_bool y if X86
+
+config VFIO_PCI_NVLINK2
+	depends on VFIO_PCI
+	def_bool y if PPC_POWERNV
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [RFC PATCH kernel 5/5] vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] [10de:1db1] subdriver
@ 2018-06-07  8:44   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 108+ messages in thread
From: Alexey Kardashevskiy @ 2018-06-07  8:44 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: kvm, Alexey Kardashevskiy, Ram Pai, kvm-ppc, Alex Williamson,
	Alistair Popple, David Gibson

Some POWER9 chips come with special NVLink2 links which provide
cacheable memory access to the RAM physically located on NVIDIA GPU.
This memory is presented to a host via the device tree but remains
offline until the NVIDIA driver onlines it.

This exports this RAM to the userspace as a new region so
the NVIDIA driver in the guest can train these links and online GPU RAM.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 drivers/vfio/pci/Makefile           |   1 +
 drivers/vfio/pci/vfio_pci_private.h |   8 ++
 include/uapi/linux/vfio.h           |   3 +
 drivers/vfio/pci/vfio_pci.c         |   9 ++
 drivers/vfio/pci/vfio_pci_nvlink2.c | 190 ++++++++++++++++++++++++++++++++++++
 drivers/vfio/pci/Kconfig            |   4 +
 6 files changed, 215 insertions(+)
 create mode 100644 drivers/vfio/pci/vfio_pci_nvlink2.c

diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
index 76d8ec0..9662c06 100644
--- a/drivers/vfio/pci/Makefile
+++ b/drivers/vfio/pci/Makefile
@@ -1,5 +1,6 @@
 
 vfio-pci-y := vfio_pci.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o
 vfio-pci-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o
+vfio-pci-$(CONFIG_VFIO_PCI_NVLINK2) += vfio_pci_nvlink2.o
 
 obj-$(CONFIG_VFIO_PCI) += vfio-pci.o
diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
index 86aab05..7115b9b 100644
--- a/drivers/vfio/pci/vfio_pci_private.h
+++ b/drivers/vfio/pci/vfio_pci_private.h
@@ -160,4 +160,12 @@ static inline int vfio_pci_igd_init(struct vfio_pci_device *vdev)
 	return -ENODEV;
 }
 #endif
+#ifdef CONFIG_VFIO_PCI_NVLINK2
+extern int vfio_pci_nvlink2_init(struct vfio_pci_device *vdev);
+#else
+static inline int vfio_pci_nvlink2_init(struct vfio_pci_device *vdev)
+{
+	return -ENODEV;
+}
+#endif
 #endif /* VFIO_PCI_PRIVATE_H */
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 1aa7b82..2fe8227 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -301,6 +301,9 @@ struct vfio_region_info_cap_type {
 #define VFIO_REGION_SUBTYPE_INTEL_IGD_HOST_CFG	(2)
 #define VFIO_REGION_SUBTYPE_INTEL_IGD_LPC_CFG	(3)
 
+/* NVIDIA GPU NV2 */
+#define VFIO_REGION_SUBTYPE_NVIDIA_NVLINK2	(4)
+
 /*
  * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped
  * which allows direct access to non-MSIX registers which happened to be within
diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 7bddf1e..38c9475 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -306,6 +306,15 @@ static int vfio_pci_enable(struct vfio_pci_device *vdev)
 		}
 	}
 
+	if (pdev->vendor = PCI_VENDOR_ID_NVIDIA &&
+	    pdev->device = 0x1db1 &&
+	    IS_ENABLED(CONFIG_VFIO_PCI_NVLINK2)) {
+		ret = vfio_pci_nvlink2_init(vdev);
+		if (ret)
+			dev_warn(&vdev->pdev->dev,
+				 "Failed to setup NVIDIA NV2 RAM region\n");
+	}
+
 	vfio_pci_probe_mmaps(vdev);
 
 	return 0;
diff --git a/drivers/vfio/pci/vfio_pci_nvlink2.c b/drivers/vfio/pci/vfio_pci_nvlink2.c
new file mode 100644
index 0000000..451c5cb
--- /dev/null
+++ b/drivers/vfio/pci/vfio_pci_nvlink2.c
@@ -0,0 +1,190 @@
+// SPDX-License-Identifier: GPL-2.0+
+/*
+ * VFIO PCI NVIDIA Whitherspoon GPU support a.k.a. NVLink2.
+ *
+ * Copyright (C) 2018 IBM Corp.  All rights reserved.
+ *     Author: Alexey Kardashevskiy <aik@ozlabs.ru>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * Register an on-GPU RAM region for cacheable access.
+ *
+ * Derived from original vfio_pci_igd.c:
+ * Copyright (C) 2016 Red Hat, Inc.  All rights reserved.
+ *	Author: Alex Williamson <alex.williamson@redhat.com>
+ */
+
+#include <linux/io.h>
+#include <linux/pci.h>
+#include <linux/uaccess.h>
+#include <linux/vfio.h>
+#include <linux/sched/mm.h>
+#include <linux/mmu_context.h>
+
+#include "vfio_pci_private.h"
+
+struct vfio_pci_nvlink2_data {
+	unsigned long gpu_hpa;
+	unsigned long useraddr;
+	unsigned long size;
+	struct mm_struct *mm;
+	struct mm_iommu_table_group_mem_t *mem;
+};
+
+static size_t vfio_pci_nvlink2_rw(struct vfio_pci_device *vdev,
+		char __user *buf, size_t count, loff_t *ppos, bool iswrite)
+{
+	unsigned int i = VFIO_PCI_OFFSET_TO_INDEX(*ppos) - VFIO_PCI_NUM_REGIONS;
+	void *base = vdev->region[i].data;
+	loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
+
+	if (pos >= vdev->region[i].size)
+		return -EINVAL;
+
+	count = min(count, (size_t)(vdev->region[i].size - pos));
+
+	if (iswrite) {
+		if (copy_from_user(base + pos, buf, count))
+			return -EFAULT;
+	} else {
+		if (copy_to_user(buf, base + pos, count))
+			return -EFAULT;
+	}
+	*ppos += count;
+
+	return count;
+}
+
+static void vfio_pci_nvlink2_release(struct vfio_pci_device *vdev,
+		struct vfio_pci_region *region)
+{
+	struct vfio_pci_nvlink2_data *data = region->data;
+	long ret;
+
+	ret = mm_iommu_put(data->mm, data->mem);
+	WARN_ON(ret);
+
+	mmdrop(data->mm);
+	kfree(data);
+}
+
+static int vfio_pci_nvlink2_mmap_fault(struct vm_fault *vmf)
+{
+	struct vm_area_struct *vma = vmf->vma;
+	struct vfio_pci_region *region = vma->vm_private_data;
+	struct vfio_pci_nvlink2_data *data = region->data;
+	int ret;
+	unsigned long vmf_off = (vmf->address - vma->vm_start) >> PAGE_SHIFT;
+	unsigned long nv2pg = data->gpu_hpa >> PAGE_SHIFT;
+	unsigned long vm_pgoff = vma->vm_pgoff &
+		((1U << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) - 1);
+	unsigned long pfn = nv2pg + vm_pgoff + vmf_off;
+
+	ret = vm_insert_pfn(vma, vmf->address, pfn);
+	/* TODO: make it a tracepoint */
+	pr_debug("NVLink2: vmf=%lx hpa=%lx ret=%d\n",
+		 vmf->address, pfn << PAGE_SHIFT, ret);
+	if (ret)
+		return VM_FAULT_SIGSEGV;
+
+	return VM_FAULT_NOPAGE;
+}
+
+static const struct vm_operations_struct vfio_pci_nvlink2_mmap_vmops = {
+	.fault = vfio_pci_nvlink2_mmap_fault,
+};
+
+static int vfio_pci_nvlink2_mmap(struct vfio_pci_device *vdev,
+		struct vfio_pci_region *region, struct vm_area_struct *vma)
+{
+	long ret;
+	struct vfio_pci_nvlink2_data *data = region->data;
+
+	if (data->useraddr)
+		return -EPERM;
+
+	if (vma->vm_end - vma->vm_start > data->size)
+		return -EINVAL;
+
+	vma->vm_private_data = region;
+	vma->vm_flags |= VM_PFNMAP;
+	vma->vm_ops = &vfio_pci_nvlink2_mmap_vmops;
+
+	/*
+	 * Calling mm_iommu_newdev() here once as the region is not
+	 * registered yet and therefore right initialization will happen now.
+	 * Other places will use mm_iommu_find() which returns
+	 * registered @mem and does not go gup().
+	 */
+	data->useraddr = vma->vm_start;
+	data->mm = current->mm;
+	atomic_inc(&data->mm->mm_count);
+	ret = mm_iommu_newdev(data->mm, data->useraddr,
+			(vma->vm_end - vma->vm_start) >> PAGE_SHIFT,
+			data->gpu_hpa, &data->mem);
+
+	pr_debug("VFIO NVLINK2 mmap: useraddr=%lx hpa=%lx size=%lx ret=%ld\n",
+			data->useraddr, data->gpu_hpa,
+			vma->vm_end - vma->vm_start, ret);
+
+	return ret;
+}
+
+static const struct vfio_pci_regops vfio_pci_nvlink2_regops = {
+	.rw = vfio_pci_nvlink2_rw,
+	.release = vfio_pci_nvlink2_release,
+	.mmap = vfio_pci_nvlink2_mmap,
+};
+
+int vfio_pci_nvlink2_init(struct vfio_pci_device *vdev)
+{
+	int len = 0, ret;
+	struct device_node *npu_node, *mem_node;
+	struct pci_dev *npu_dev;
+	uint32_t *mem_phandle, *val;
+	struct vfio_pci_nvlink2_data *data;
+
+	npu_dev = pnv_pci_get_npu_dev(vdev->pdev, 0);
+	if (!npu_dev)
+		return -EINVAL;
+
+	npu_node = pci_device_to_OF_node(npu_dev);
+	if (!npu_node)
+		return -EINVAL;
+
+	mem_phandle = (void *) of_get_property(npu_node, "memory-region", NULL);
+	if (!mem_phandle)
+		return -EINVAL;
+
+	mem_node = of_find_node_by_phandle(be32_to_cpu(*mem_phandle));
+	if (!mem_node)
+		return -EINVAL;
+
+	val = (uint32_t *) of_get_property(mem_node, "reg", &len);
+	if (!val || len != 2 * sizeof(uint64_t))
+		return -EINVAL;
+
+	data = kzalloc(sizeof(*data), GFP_KERNEL);
+	if (!data)
+		return -ENOMEM;
+
+	data->gpu_hpa = ((uint64_t)be32_to_cpu(val[0]) << 32) |
+			be32_to_cpu(val[1]);
+	data->size = ((uint64_t)be32_to_cpu(val[2]) << 32) |
+			be32_to_cpu(val[3]);
+
+	dev_dbg(&vdev->pdev->dev, "%lx..%lx\n", data->gpu_hpa,
+			data->gpu_hpa + data->size - 1);
+
+	ret = vfio_pci_register_dev_region(vdev,
+			PCI_VENDOR_ID_NVIDIA | VFIO_REGION_TYPE_PCI_VENDOR_TYPE,
+			VFIO_REGION_SUBTYPE_NVIDIA_NVLINK2,
+			&vfio_pci_nvlink2_regops, data->size,
+			VFIO_REGION_INFO_FLAG_READ, data);
+	if (ret)
+		kfree(data);
+
+	return ret;
+}
diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
index 24ee260..2725bc8 100644
--- a/drivers/vfio/pci/Kconfig
+++ b/drivers/vfio/pci/Kconfig
@@ -30,3 +30,7 @@ config VFIO_PCI_INTX
 config VFIO_PCI_IGD
 	depends on VFIO_PCI
 	def_bool y if X86
+
+config VFIO_PCI_NVLINK2
+	depends on VFIO_PCI
+	def_bool y if PPC_POWERNV
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 108+ messages in thread

* Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100
  2018-06-07  8:44 ` Alexey Kardashevskiy
  (?)
@ 2018-06-07 17:04   ` Alex Williamson
  -1 siblings, 0 replies; 108+ messages in thread
From: Alex Williamson @ 2018-06-07 17:04 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: kvm, Ram Pai, kvm-ppc, Alistair Popple, linuxppc-dev, David Gibson

On Thu,  7 Jun 2018 18:44:15 +1000
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> Here is an rfc of some patches adding psaa-through support
> for NVIDIA V100 GPU found in some POWER9 boxes.
> 
> The example P9 system has 6 GPUs, each accompanied with 2 bridges
> representing the hardware links (aka NVLink2):
> 
>  4  0004:04:00.0 3D: NVIDIA Corporation GV100GL [Tesla V100 SXM2] (rev a1)
>  5  0004:05:00.0 3D: NVIDIA Corporation GV100GL [Tesla V100 SXM2] (rev a1)
>  6  0004:06:00.0 3D: NVIDIA Corporation GV100GL [Tesla V100 SXM2] (rev a1)
>  4  0006:00:00.0 Bridge: IBM Device 04ea (rev 01)
>  4  0006:00:00.1 Bridge: IBM Device 04ea (rev 01)
>  5  0006:00:01.0 Bridge: IBM Device 04ea (rev 01)
>  5  0006:00:01.1 Bridge: IBM Device 04ea (rev 01)
>  6  0006:00:02.0 Bridge: IBM Device 04ea (rev 01)
>  6  0006:00:02.1 Bridge: IBM Device 04ea (rev 01)
> 10  0007:00:00.0 Bridge: IBM Device 04ea (rev 01)
> 10  0007:00:00.1 Bridge: IBM Device 04ea (rev 01)
> 11  0007:00:01.0 Bridge: IBM Device 04ea (rev 01)
> 11  0007:00:01.1 Bridge: IBM Device 04ea (rev 01)
> 12  0007:00:02.0 Bridge: IBM Device 04ea (rev 01)
> 12  0007:00:02.1 Bridge: IBM Device 04ea (rev 01)
> 10  0035:03:00.0 3D: NVIDIA Corporation GV100GL [Tesla V100 SXM2] (rev a1)
> 11  0035:04:00.0 3D: NVIDIA Corporation GV100GL [Tesla V100 SXM2] (rev a1)
> 12  0035:05:00.0 3D: NVIDIA Corporation GV100GL [Tesla V100 SXM2] (rev a1)
> 
> ^^ the number is an IOMMU group ID.

Can we back up and discuss whether the IOMMU grouping of NVLink
connected devices makes sense?  AIUI we have a PCI view of these
devices and from that perspective they're isolated.  That's the view of
the device used to generate the grouping.  However, not visible to us,
these devices are interconnected via NVLink.  What isolation properties
does NVLink provide given that its entire purpose for existing seems to
be to provide a high performance link for p2p between devices?
 
> Each bridge represents an additional hardware interface called "NVLink2",
> it is not a PCI link but separate but. The design inherits from original
> NVLink from POWER8.
> 
> The new feature of V100 is 16GB of cache coherent memory on GPU board.
> This memory is presented to the host via the device tree and remains offline
> until the NVIDIA driver loads, trains NVLink2 (via the config space of these
> bridges above) and the nvidia-persistenced daemon then onlines it.
> The memory remains online as long as nvidia-persistenced is running, when
> it stops, it offlines the memory.
> 
> The amount of GPUs suggest passing them through to a guest. However,
> in order to do so we cannot use the NVIDIA driver so we have a host with
> a 128GB window (bigger or equal to actual GPU RAM size) in a system memory
> with no page structs backing this window and we cannot touch this memory
> before the NVIDIA driver configures it in a host or a guest as
> HMI (hardware management interrupt?) occurs.

Having a lot of GPUs only suggests assignment to a guest if there's
actually isolation provided between those GPUs.  Otherwise we'd need to
assign them as one big group, which gets a lot less useful.  Thanks,

Alex

> On the example system the GPU RAM windows are located at:
> 0x0400 0000 0000
> 0x0420 0000 0000
> 0x0440 0000 0000
> 0x2400 0000 0000
> 0x2420 0000 0000
> 0x2440 0000 0000
> 
> So the complications are:
> 
> 1. cannot touch the GPU memory till it is trained, i.e. cannot add ptes
> to VFIO-to-userspace or guest-to-host-physical translations till
> the driver trains it (i.e. nvidia-persistenced has started), otherwise
> prefetching happens and HMI occurs; I am trying to get this changed
> somehow;
> 
> 2. since it appears as normal cache coherent memory, it will be used
> for DMA which means it has to be pinned and mapped in the host. Having
> no page structs makes it different from the usual case - we only need
> translate user addresses to host physical and map GPU RAM memory but
> pinning is not required.
> 
> This series maps GPU RAM via the GPU vfio-pci device so QEMU can then
> register this memory as a KVM memory slot and present memory nodes to
> the guest. Unless NVIDIA provides an userspace driver, this is no use
> for things like DPDK.
> 
> 
> There is another problem which the series does not address but worth
> mentioning - it is not strictly necessary to map GPU RAM to the guest
> exactly where it is in the host (I tested this to some extent), we still
> might want to represent the memory at the same offset as on the host
> which increases the size of a TCE table needed to cover such a huge
> window: (((0x244000000000 + 0x2000000000) >> 16)*8)>>20 = 4556MB
> I am addressing this in a separate patchset by allocating indirect TCE
> levels on demand and using 16MB IOMMU pages in the guest as we can now
> back emulated pages with the smaller hardware ones.
> 
> 
> This is an RFC. Please comment. Thanks.
> 
> 
> 
> Alexey Kardashevskiy (5):
>   vfio/spapr_tce: Simplify page contained test
>   powerpc/iommu_context: Change referencing in API
>   powerpc/iommu: Do not pin memory of a memory device
>   vfio_pci: Allow mapping extra regions
>   vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] [10de:1db1] subdriver
> 
>  drivers/vfio/pci/Makefile              |   1 +
>  arch/powerpc/include/asm/mmu_context.h |   5 +-
>  drivers/vfio/pci/vfio_pci_private.h    |  11 ++
>  include/uapi/linux/vfio.h              |   3 +
>  arch/powerpc/kernel/iommu.c            |   8 +-
>  arch/powerpc/mm/mmu_context_iommu.c    |  70 +++++++++---
>  drivers/vfio/pci/vfio_pci.c            |  19 +++-
>  drivers/vfio/pci/vfio_pci_nvlink2.c    | 190 +++++++++++++++++++++++++++++++++
>  drivers/vfio/vfio_iommu_spapr_tce.c    |  42 +++++---
>  drivers/vfio/pci/Kconfig               |   4 +
>  10 files changed, 319 insertions(+), 34 deletions(-)
>  create mode 100644 drivers/vfio/pci/vfio_pci_nvlink2.c
> 

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100
@ 2018-06-07 17:04   ` Alex Williamson
  0 siblings, 0 replies; 108+ messages in thread
From: Alex Williamson @ 2018-06-07 17:04 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, David Gibson, kvm-ppc, Benjamin Herrenschmidt,
	Ram Pai, kvm, Alistair Popple

On Thu,  7 Jun 2018 18:44:15 +1000
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> Here is an rfc of some patches adding psaa-through support
> for NVIDIA V100 GPU found in some POWER9 boxes.
> 
> The example P9 system has 6 GPUs, each accompanied with 2 bridges
> representing the hardware links (aka NVLink2):
> 
>  4  0004:04:00.0 3D: NVIDIA Corporation GV100GL [Tesla V100 SXM2] (rev a1)
>  5  0004:05:00.0 3D: NVIDIA Corporation GV100GL [Tesla V100 SXM2] (rev a1)
>  6  0004:06:00.0 3D: NVIDIA Corporation GV100GL [Tesla V100 SXM2] (rev a1)
>  4  0006:00:00.0 Bridge: IBM Device 04ea (rev 01)
>  4  0006:00:00.1 Bridge: IBM Device 04ea (rev 01)
>  5  0006:00:01.0 Bridge: IBM Device 04ea (rev 01)
>  5  0006:00:01.1 Bridge: IBM Device 04ea (rev 01)
>  6  0006:00:02.0 Bridge: IBM Device 04ea (rev 01)
>  6  0006:00:02.1 Bridge: IBM Device 04ea (rev 01)
> 10  0007:00:00.0 Bridge: IBM Device 04ea (rev 01)
> 10  0007:00:00.1 Bridge: IBM Device 04ea (rev 01)
> 11  0007:00:01.0 Bridge: IBM Device 04ea (rev 01)
> 11  0007:00:01.1 Bridge: IBM Device 04ea (rev 01)
> 12  0007:00:02.0 Bridge: IBM Device 04ea (rev 01)
> 12  0007:00:02.1 Bridge: IBM Device 04ea (rev 01)
> 10  0035:03:00.0 3D: NVIDIA Corporation GV100GL [Tesla V100 SXM2] (rev a1)
> 11  0035:04:00.0 3D: NVIDIA Corporation GV100GL [Tesla V100 SXM2] (rev a1)
> 12  0035:05:00.0 3D: NVIDIA Corporation GV100GL [Tesla V100 SXM2] (rev a1)
> 
> ^^ the number is an IOMMU group ID.

Can we back up and discuss whether the IOMMU grouping of NVLink
connected devices makes sense?  AIUI we have a PCI view of these
devices and from that perspective they're isolated.  That's the view of
the device used to generate the grouping.  However, not visible to us,
these devices are interconnected via NVLink.  What isolation properties
does NVLink provide given that its entire purpose for existing seems to
be to provide a high performance link for p2p between devices?
 
> Each bridge represents an additional hardware interface called "NVLink2",
> it is not a PCI link but separate but. The design inherits from original
> NVLink from POWER8.
> 
> The new feature of V100 is 16GB of cache coherent memory on GPU board.
> This memory is presented to the host via the device tree and remains offline
> until the NVIDIA driver loads, trains NVLink2 (via the config space of these
> bridges above) and the nvidia-persistenced daemon then onlines it.
> The memory remains online as long as nvidia-persistenced is running, when
> it stops, it offlines the memory.
> 
> The amount of GPUs suggest passing them through to a guest. However,
> in order to do so we cannot use the NVIDIA driver so we have a host with
> a 128GB window (bigger or equal to actual GPU RAM size) in a system memory
> with no page structs backing this window and we cannot touch this memory
> before the NVIDIA driver configures it in a host or a guest as
> HMI (hardware management interrupt?) occurs.

Having a lot of GPUs only suggests assignment to a guest if there's
actually isolation provided between those GPUs.  Otherwise we'd need to
assign them as one big group, which gets a lot less useful.  Thanks,

Alex

> On the example system the GPU RAM windows are located at:
> 0x0400 0000 0000
> 0x0420 0000 0000
> 0x0440 0000 0000
> 0x2400 0000 0000
> 0x2420 0000 0000
> 0x2440 0000 0000
> 
> So the complications are:
> 
> 1. cannot touch the GPU memory till it is trained, i.e. cannot add ptes
> to VFIO-to-userspace or guest-to-host-physical translations till
> the driver trains it (i.e. nvidia-persistenced has started), otherwise
> prefetching happens and HMI occurs; I am trying to get this changed
> somehow;
> 
> 2. since it appears as normal cache coherent memory, it will be used
> for DMA which means it has to be pinned and mapped in the host. Having
> no page structs makes it different from the usual case - we only need
> translate user addresses to host physical and map GPU RAM memory but
> pinning is not required.
> 
> This series maps GPU RAM via the GPU vfio-pci device so QEMU can then
> register this memory as a KVM memory slot and present memory nodes to
> the guest. Unless NVIDIA provides an userspace driver, this is no use
> for things like DPDK.
> 
> 
> There is another problem which the series does not address but worth
> mentioning - it is not strictly necessary to map GPU RAM to the guest
> exactly where it is in the host (I tested this to some extent), we still
> might want to represent the memory at the same offset as on the host
> which increases the size of a TCE table needed to cover such a huge
> window: (((0x244000000000 + 0x2000000000) >> 16)*8)>>20 = 4556MB
> I am addressing this in a separate patchset by allocating indirect TCE
> levels on demand and using 16MB IOMMU pages in the guest as we can now
> back emulated pages with the smaller hardware ones.
> 
> 
> This is an RFC. Please comment. Thanks.
> 
> 
> 
> Alexey Kardashevskiy (5):
>   vfio/spapr_tce: Simplify page contained test
>   powerpc/iommu_context: Change referencing in API
>   powerpc/iommu: Do not pin memory of a memory device
>   vfio_pci: Allow mapping extra regions
>   vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] [10de:1db1] subdriver
> 
>  drivers/vfio/pci/Makefile              |   1 +
>  arch/powerpc/include/asm/mmu_context.h |   5 +-
>  drivers/vfio/pci/vfio_pci_private.h    |  11 ++
>  include/uapi/linux/vfio.h              |   3 +
>  arch/powerpc/kernel/iommu.c            |   8 +-
>  arch/powerpc/mm/mmu_context_iommu.c    |  70 +++++++++---
>  drivers/vfio/pci/vfio_pci.c            |  19 +++-
>  drivers/vfio/pci/vfio_pci_nvlink2.c    | 190 +++++++++++++++++++++++++++++++++
>  drivers/vfio/vfio_iommu_spapr_tce.c    |  42 +++++---
>  drivers/vfio/pci/Kconfig               |   4 +
>  10 files changed, 319 insertions(+), 34 deletions(-)
>  create mode 100644 drivers/vfio/pci/vfio_pci_nvlink2.c
> 

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100
@ 2018-06-07 17:04   ` Alex Williamson
  0 siblings, 0 replies; 108+ messages in thread
From: Alex Williamson @ 2018-06-07 17:04 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: kvm, Ram Pai, kvm-ppc, Alistair Popple, linuxppc-dev, David Gibson

On Thu,  7 Jun 2018 18:44:15 +1000
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> Here is an rfc of some patches adding psaa-through support
> for NVIDIA V100 GPU found in some POWER9 boxes.
> 
> The example P9 system has 6 GPUs, each accompanied with 2 bridges
> representing the hardware links (aka NVLink2):
> 
>  4  0004:04:00.0 3D: NVIDIA Corporation GV100GL [Tesla V100 SXM2] (rev a1)
>  5  0004:05:00.0 3D: NVIDIA Corporation GV100GL [Tesla V100 SXM2] (rev a1)
>  6  0004:06:00.0 3D: NVIDIA Corporation GV100GL [Tesla V100 SXM2] (rev a1)
>  4  0006:00:00.0 Bridge: IBM Device 04ea (rev 01)
>  4  0006:00:00.1 Bridge: IBM Device 04ea (rev 01)
>  5  0006:00:01.0 Bridge: IBM Device 04ea (rev 01)
>  5  0006:00:01.1 Bridge: IBM Device 04ea (rev 01)
>  6  0006:00:02.0 Bridge: IBM Device 04ea (rev 01)
>  6  0006:00:02.1 Bridge: IBM Device 04ea (rev 01)
> 10  0007:00:00.0 Bridge: IBM Device 04ea (rev 01)
> 10  0007:00:00.1 Bridge: IBM Device 04ea (rev 01)
> 11  0007:00:01.0 Bridge: IBM Device 04ea (rev 01)
> 11  0007:00:01.1 Bridge: IBM Device 04ea (rev 01)
> 12  0007:00:02.0 Bridge: IBM Device 04ea (rev 01)
> 12  0007:00:02.1 Bridge: IBM Device 04ea (rev 01)
> 10  0035:03:00.0 3D: NVIDIA Corporation GV100GL [Tesla V100 SXM2] (rev a1)
> 11  0035:04:00.0 3D: NVIDIA Corporation GV100GL [Tesla V100 SXM2] (rev a1)
> 12  0035:05:00.0 3D: NVIDIA Corporation GV100GL [Tesla V100 SXM2] (rev a1)
> 
> ^^ the number is an IOMMU group ID.

Can we back up and discuss whether the IOMMU grouping of NVLink
connected devices makes sense?  AIUI we have a PCI view of these
devices and from that perspective they're isolated.  That's the view of
the device used to generate the grouping.  However, not visible to us,
these devices are interconnected via NVLink.  What isolation properties
does NVLink provide given that its entire purpose for existing seems to
be to provide a high performance link for p2p between devices?
 
> Each bridge represents an additional hardware interface called "NVLink2",
> it is not a PCI link but separate but. The design inherits from original
> NVLink from POWER8.
> 
> The new feature of V100 is 16GB of cache coherent memory on GPU board.
> This memory is presented to the host via the device tree and remains offline
> until the NVIDIA driver loads, trains NVLink2 (via the config space of these
> bridges above) and the nvidia-persistenced daemon then onlines it.
> The memory remains online as long as nvidia-persistenced is running, when
> it stops, it offlines the memory.
> 
> The amount of GPUs suggest passing them through to a guest. However,
> in order to do so we cannot use the NVIDIA driver so we have a host with
> a 128GB window (bigger or equal to actual GPU RAM size) in a system memory
> with no page structs backing this window and we cannot touch this memory
> before the NVIDIA driver configures it in a host or a guest as
> HMI (hardware management interrupt?) occurs.

Having a lot of GPUs only suggests assignment to a guest if there's
actually isolation provided between those GPUs.  Otherwise we'd need to
assign them as one big group, which gets a lot less useful.  Thanks,

Alex

> On the example system the GPU RAM windows are located at:
> 0x0400 0000 0000
> 0x0420 0000 0000
> 0x0440 0000 0000
> 0x2400 0000 0000
> 0x2420 0000 0000
> 0x2440 0000 0000
> 
> So the complications are:
> 
> 1. cannot touch the GPU memory till it is trained, i.e. cannot add ptes
> to VFIO-to-userspace or guest-to-host-physical translations till
> the driver trains it (i.e. nvidia-persistenced has started), otherwise
> prefetching happens and HMI occurs; I am trying to get this changed
> somehow;
> 
> 2. since it appears as normal cache coherent memory, it will be used
> for DMA which means it has to be pinned and mapped in the host. Having
> no page structs makes it different from the usual case - we only need
> translate user addresses to host physical and map GPU RAM memory but
> pinning is not required.
> 
> This series maps GPU RAM via the GPU vfio-pci device so QEMU can then
> register this memory as a KVM memory slot and present memory nodes to
> the guest. Unless NVIDIA provides an userspace driver, this is no use
> for things like DPDK.
> 
> 
> There is another problem which the series does not address but worth
> mentioning - it is not strictly necessary to map GPU RAM to the guest
> exactly where it is in the host (I tested this to some extent), we still
> might want to represent the memory at the same offset as on the host
> which increases the size of a TCE table needed to cover such a huge
> window: (((0x244000000000 + 0x2000000000) >> 16)*8)>>20 = 4556MB
> I am addressing this in a separate patchset by allocating indirect TCE
> levels on demand and using 16MB IOMMU pages in the guest as we can now
> back emulated pages with the smaller hardware ones.
> 
> 
> This is an RFC. Please comment. Thanks.
> 
> 
> 
> Alexey Kardashevskiy (5):
>   vfio/spapr_tce: Simplify page contained test
>   powerpc/iommu_context: Change referencing in API
>   powerpc/iommu: Do not pin memory of a memory device
>   vfio_pci: Allow mapping extra regions
>   vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] [10de:1db1] subdriver
> 
>  drivers/vfio/pci/Makefile              |   1 +
>  arch/powerpc/include/asm/mmu_context.h |   5 +-
>  drivers/vfio/pci/vfio_pci_private.h    |  11 ++
>  include/uapi/linux/vfio.h              |   3 +
>  arch/powerpc/kernel/iommu.c            |   8 +-
>  arch/powerpc/mm/mmu_context_iommu.c    |  70 +++++++++---
>  drivers/vfio/pci/vfio_pci.c            |  19 +++-
>  drivers/vfio/pci/vfio_pci_nvlink2.c    | 190 +++++++++++++++++++++++++++++++++
>  drivers/vfio/vfio_iommu_spapr_tce.c    |  42 +++++---
>  drivers/vfio/pci/Kconfig               |   4 +
>  10 files changed, 319 insertions(+), 34 deletions(-)
>  create mode 100644 drivers/vfio/pci/vfio_pci_nvlink2.c
> 


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC PATCH kernel 5/5] vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] [10de:1db1] subdriver
  2018-06-07  8:44   ` Alexey Kardashevskiy
  (?)
@ 2018-06-07 17:04     ` Alex Williamson
  -1 siblings, 0 replies; 108+ messages in thread
From: Alex Williamson @ 2018-06-07 17:04 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: kvm, Ram Pai, kvm-ppc, Alistair Popple, linuxppc-dev, David Gibson

On Thu,  7 Jun 2018 18:44:20 +1000
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> Some POWER9 chips come with special NVLink2 links which provide
> cacheable memory access to the RAM physically located on NVIDIA GPU.
> This memory is presented to a host via the device tree but remains
> offline until the NVIDIA driver onlines it.
> 
> This exports this RAM to the userspace as a new region so
> the NVIDIA driver in the guest can train these links and online GPU RAM.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
>  drivers/vfio/pci/Makefile           |   1 +
>  drivers/vfio/pci/vfio_pci_private.h |   8 ++
>  include/uapi/linux/vfio.h           |   3 +
>  drivers/vfio/pci/vfio_pci.c         |   9 ++
>  drivers/vfio/pci/vfio_pci_nvlink2.c | 190 ++++++++++++++++++++++++++++++++++++
>  drivers/vfio/pci/Kconfig            |   4 +
>  6 files changed, 215 insertions(+)
>  create mode 100644 drivers/vfio/pci/vfio_pci_nvlink2.c
> 
> diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
> index 76d8ec0..9662c06 100644
> --- a/drivers/vfio/pci/Makefile
> +++ b/drivers/vfio/pci/Makefile
> @@ -1,5 +1,6 @@
>  
>  vfio-pci-y := vfio_pci.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o
>  vfio-pci-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o
> +vfio-pci-$(CONFIG_VFIO_PCI_NVLINK2) += vfio_pci_nvlink2.o
>  
>  obj-$(CONFIG_VFIO_PCI) += vfio-pci.o
> diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
> index 86aab05..7115b9b 100644
> --- a/drivers/vfio/pci/vfio_pci_private.h
> +++ b/drivers/vfio/pci/vfio_pci_private.h
> @@ -160,4 +160,12 @@ static inline int vfio_pci_igd_init(struct vfio_pci_device *vdev)
>  	return -ENODEV;
>  }
>  #endif
> +#ifdef CONFIG_VFIO_PCI_NVLINK2
> +extern int vfio_pci_nvlink2_init(struct vfio_pci_device *vdev);
> +#else
> +static inline int vfio_pci_nvlink2_init(struct vfio_pci_device *vdev)
> +{
> +	return -ENODEV;
> +}
> +#endif
>  #endif /* VFIO_PCI_PRIVATE_H */
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 1aa7b82..2fe8227 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -301,6 +301,9 @@ struct vfio_region_info_cap_type {
>  #define VFIO_REGION_SUBTYPE_INTEL_IGD_HOST_CFG	(2)
>  #define VFIO_REGION_SUBTYPE_INTEL_IGD_LPC_CFG	(3)
>  
> +/* NVIDIA GPU NV2 */
> +#define VFIO_REGION_SUBTYPE_NVIDIA_NVLINK2	(4)

You're continuing the Intel vendor ID sub-types for an NVIDIA vendor ID
subtype.  Each vendor has their own address space of sub-types.

> +
>  /*
>   * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped
>   * which allows direct access to non-MSIX registers which happened to be within
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index 7bddf1e..38c9475 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -306,6 +306,15 @@ static int vfio_pci_enable(struct vfio_pci_device *vdev)
>  		}
>  	}
>  
> +	if (pdev->vendor == PCI_VENDOR_ID_NVIDIA &&
> +	    pdev->device == 0x1db1 &&
> +	    IS_ENABLED(CONFIG_VFIO_PCI_NVLINK2)) {

Can't we do better than check this based on device ID?  Perhaps PCIe
capability hints at this?

Is it worthwhile to continue with assigning the device in the !ENABLED
case?  For instance, maybe it would be better to provide a weak
definition of vfio_pci_nvlink2_init() that would cause us to fail here
if we don't have this device specific support enabled.  I realize
you're following the example set forth for IGD, but those regions are
optional, for better or worse.

> +		ret = vfio_pci_nvlink2_init(vdev);
> +		if (ret)
> +			dev_warn(&vdev->pdev->dev,
> +				 "Failed to setup NVIDIA NV2 RAM region\n");
> +	}
> +
>  	vfio_pci_probe_mmaps(vdev);
>  
>  	return 0;
> diff --git a/drivers/vfio/pci/vfio_pci_nvlink2.c b/drivers/vfio/pci/vfio_pci_nvlink2.c
> new file mode 100644
> index 0000000..451c5cb
> --- /dev/null
> +++ b/drivers/vfio/pci/vfio_pci_nvlink2.c
> @@ -0,0 +1,190 @@
> +// SPDX-License-Identifier: GPL-2.0+
> +/*
> + * VFIO PCI NVIDIA Whitherspoon GPU support a.k.a. NVLink2.
> + *
> + * Copyright (C) 2018 IBM Corp.  All rights reserved.
> + *     Author: Alexey Kardashevskiy <aik@ozlabs.ru>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + *
> + * Register an on-GPU RAM region for cacheable access.
> + *
> + * Derived from original vfio_pci_igd.c:
> + * Copyright (C) 2016 Red Hat, Inc.  All rights reserved.
> + *	Author: Alex Williamson <alex.williamson@redhat.com>
> + */
> +
> +#include <linux/io.h>
> +#include <linux/pci.h>
> +#include <linux/uaccess.h>
> +#include <linux/vfio.h>
> +#include <linux/sched/mm.h>
> +#include <linux/mmu_context.h>
> +
> +#include "vfio_pci_private.h"
> +
> +struct vfio_pci_nvlink2_data {
> +	unsigned long gpu_hpa;
> +	unsigned long useraddr;
> +	unsigned long size;
> +	struct mm_struct *mm;
> +	struct mm_iommu_table_group_mem_t *mem;
> +};
> +
> +static size_t vfio_pci_nvlink2_rw(struct vfio_pci_device *vdev,
> +		char __user *buf, size_t count, loff_t *ppos, bool iswrite)
> +{
> +	unsigned int i = VFIO_PCI_OFFSET_TO_INDEX(*ppos) - VFIO_PCI_NUM_REGIONS;
> +	void *base = vdev->region[i].data;
> +	loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
> +
> +	if (pos >= vdev->region[i].size)
> +		return -EINVAL;
> +
> +	count = min(count, (size_t)(vdev->region[i].size - pos));
> +
> +	if (iswrite) {
> +		if (copy_from_user(base + pos, buf, count))
> +			return -EFAULT;
> +	} else {
> +		if (copy_to_user(buf, base + pos, count))
> +			return -EFAULT;
> +	}
> +	*ppos += count;
> +
> +	return count;
> +}
> +
> +static void vfio_pci_nvlink2_release(struct vfio_pci_device *vdev,
> +		struct vfio_pci_region *region)
> +{
> +	struct vfio_pci_nvlink2_data *data = region->data;
> +	long ret;
> +
> +	ret = mm_iommu_put(data->mm, data->mem);
> +	WARN_ON(ret);
> +
> +	mmdrop(data->mm);
> +	kfree(data);
> +}
> +
> +static int vfio_pci_nvlink2_mmap_fault(struct vm_fault *vmf)
> +{
> +	struct vm_area_struct *vma = vmf->vma;
> +	struct vfio_pci_region *region = vma->vm_private_data;
> +	struct vfio_pci_nvlink2_data *data = region->data;
> +	int ret;
> +	unsigned long vmf_off = (vmf->address - vma->vm_start) >> PAGE_SHIFT;
> +	unsigned long nv2pg = data->gpu_hpa >> PAGE_SHIFT;
> +	unsigned long vm_pgoff = vma->vm_pgoff &
> +		((1U << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) - 1);
> +	unsigned long pfn = nv2pg + vm_pgoff + vmf_off;
> +
> +	ret = vm_insert_pfn(vma, vmf->address, pfn);
> +	/* TODO: make it a tracepoint */
> +	pr_debug("NVLink2: vmf=%lx hpa=%lx ret=%d\n",
> +		 vmf->address, pfn << PAGE_SHIFT, ret);
> +	if (ret)
> +		return VM_FAULT_SIGSEGV;
> +
> +	return VM_FAULT_NOPAGE;
> +}
> +
> +static const struct vm_operations_struct vfio_pci_nvlink2_mmap_vmops = {
> +	.fault = vfio_pci_nvlink2_mmap_fault,
> +};
> +
> +static int vfio_pci_nvlink2_mmap(struct vfio_pci_device *vdev,
> +		struct vfio_pci_region *region, struct vm_area_struct *vma)
> +{
> +	long ret;
> +	struct vfio_pci_nvlink2_data *data = region->data;
> +
> +	if (data->useraddr)
> +		return -EPERM;
> +
> +	if (vma->vm_end - vma->vm_start > data->size)
> +		return -EINVAL;
> +
> +	vma->vm_private_data = region;
> +	vma->vm_flags |= VM_PFNMAP;
> +	vma->vm_ops = &vfio_pci_nvlink2_mmap_vmops;
> +
> +	/*
> +	 * Calling mm_iommu_newdev() here once as the region is not
> +	 * registered yet and therefore right initialization will happen now.
> +	 * Other places will use mm_iommu_find() which returns
> +	 * registered @mem and does not go gup().
> +	 */
> +	data->useraddr = vma->vm_start;
> +	data->mm = current->mm;
> +	atomic_inc(&data->mm->mm_count);
> +	ret = mm_iommu_newdev(data->mm, data->useraddr,
> +			(vma->vm_end - vma->vm_start) >> PAGE_SHIFT,
> +			data->gpu_hpa, &data->mem);
> +
> +	pr_debug("VFIO NVLINK2 mmap: useraddr=%lx hpa=%lx size=%lx ret=%ld\n",
> +			data->useraddr, data->gpu_hpa,
> +			vma->vm_end - vma->vm_start, ret);
> +
> +	return ret;
> +}
> +
> +static const struct vfio_pci_regops vfio_pci_nvlink2_regops = {
> +	.rw = vfio_pci_nvlink2_rw,
> +	.release = vfio_pci_nvlink2_release,
> +	.mmap = vfio_pci_nvlink2_mmap,
> +};
> +
> +int vfio_pci_nvlink2_init(struct vfio_pci_device *vdev)
> +{
> +	int len = 0, ret;
> +	struct device_node *npu_node, *mem_node;
> +	struct pci_dev *npu_dev;
> +	uint32_t *mem_phandle, *val;
> +	struct vfio_pci_nvlink2_data *data;
> +
> +	npu_dev = pnv_pci_get_npu_dev(vdev->pdev, 0);
> +	if (!npu_dev)
> +		return -EINVAL;
> +
> +	npu_node = pci_device_to_OF_node(npu_dev);
> +	if (!npu_node)
> +		return -EINVAL;
> +
> +	mem_phandle = (void *) of_get_property(npu_node, "memory-region", NULL);
> +	if (!mem_phandle)
> +		return -EINVAL;
> +
> +	mem_node = of_find_node_by_phandle(be32_to_cpu(*mem_phandle));
> +	if (!mem_node)
> +		return -EINVAL;
> +
> +	val = (uint32_t *) of_get_property(mem_node, "reg", &len);
> +	if (!val || len != 2 * sizeof(uint64_t))
> +		return -EINVAL;
> +
> +	data = kzalloc(sizeof(*data), GFP_KERNEL);
> +	if (!data)
> +		return -ENOMEM;
> +
> +	data->gpu_hpa = ((uint64_t)be32_to_cpu(val[0]) << 32) |
> +			be32_to_cpu(val[1]);
> +	data->size = ((uint64_t)be32_to_cpu(val[2]) << 32) |
> +			be32_to_cpu(val[3]);
> +
> +	dev_dbg(&vdev->pdev->dev, "%lx..%lx\n", data->gpu_hpa,
> +			data->gpu_hpa + data->size - 1);
> +
> +	ret = vfio_pci_register_dev_region(vdev,
> +			PCI_VENDOR_ID_NVIDIA | VFIO_REGION_TYPE_PCI_VENDOR_TYPE,
> +			VFIO_REGION_SUBTYPE_NVIDIA_NVLINK2,
> +			&vfio_pci_nvlink2_regops, data->size,
> +			VFIO_REGION_INFO_FLAG_READ, data);
> +	if (ret)
> +		kfree(data);
> +
> +	return ret;
> +}
> diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
> index 24ee260..2725bc8 100644
> --- a/drivers/vfio/pci/Kconfig
> +++ b/drivers/vfio/pci/Kconfig
> @@ -30,3 +30,7 @@ config VFIO_PCI_INTX
>  config VFIO_PCI_IGD
>  	depends on VFIO_PCI
>  	def_bool y if X86
> +
> +config VFIO_PCI_NVLINK2
> +	depends on VFIO_PCI
> +	def_bool y if PPC_POWERNV

As written, this also depends on PPC_POWERNV (or at least TCE), it's not
a portable implementation that we could re-use on X86 or ARM or any
other platform if hardware appeared for it.  Can we improve that as
well to make this less POWER specific?  Thanks,

Alex

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC PATCH kernel 5/5] vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] [10de:1db1] subdriver
@ 2018-06-07 17:04     ` Alex Williamson
  0 siblings, 0 replies; 108+ messages in thread
From: Alex Williamson @ 2018-06-07 17:04 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, David Gibson, kvm-ppc, Benjamin Herrenschmidt,
	Ram Pai, kvm, Alistair Popple

On Thu,  7 Jun 2018 18:44:20 +1000
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> Some POWER9 chips come with special NVLink2 links which provide
> cacheable memory access to the RAM physically located on NVIDIA GPU.
> This memory is presented to a host via the device tree but remains
> offline until the NVIDIA driver onlines it.
> 
> This exports this RAM to the userspace as a new region so
> the NVIDIA driver in the guest can train these links and online GPU RAM.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
>  drivers/vfio/pci/Makefile           |   1 +
>  drivers/vfio/pci/vfio_pci_private.h |   8 ++
>  include/uapi/linux/vfio.h           |   3 +
>  drivers/vfio/pci/vfio_pci.c         |   9 ++
>  drivers/vfio/pci/vfio_pci_nvlink2.c | 190 ++++++++++++++++++++++++++++++++++++
>  drivers/vfio/pci/Kconfig            |   4 +
>  6 files changed, 215 insertions(+)
>  create mode 100644 drivers/vfio/pci/vfio_pci_nvlink2.c
> 
> diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
> index 76d8ec0..9662c06 100644
> --- a/drivers/vfio/pci/Makefile
> +++ b/drivers/vfio/pci/Makefile
> @@ -1,5 +1,6 @@
>  
>  vfio-pci-y := vfio_pci.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o
>  vfio-pci-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o
> +vfio-pci-$(CONFIG_VFIO_PCI_NVLINK2) += vfio_pci_nvlink2.o
>  
>  obj-$(CONFIG_VFIO_PCI) += vfio-pci.o
> diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
> index 86aab05..7115b9b 100644
> --- a/drivers/vfio/pci/vfio_pci_private.h
> +++ b/drivers/vfio/pci/vfio_pci_private.h
> @@ -160,4 +160,12 @@ static inline int vfio_pci_igd_init(struct vfio_pci_device *vdev)
>  	return -ENODEV;
>  }
>  #endif
> +#ifdef CONFIG_VFIO_PCI_NVLINK2
> +extern int vfio_pci_nvlink2_init(struct vfio_pci_device *vdev);
> +#else
> +static inline int vfio_pci_nvlink2_init(struct vfio_pci_device *vdev)
> +{
> +	return -ENODEV;
> +}
> +#endif
>  #endif /* VFIO_PCI_PRIVATE_H */
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 1aa7b82..2fe8227 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -301,6 +301,9 @@ struct vfio_region_info_cap_type {
>  #define VFIO_REGION_SUBTYPE_INTEL_IGD_HOST_CFG	(2)
>  #define VFIO_REGION_SUBTYPE_INTEL_IGD_LPC_CFG	(3)
>  
> +/* NVIDIA GPU NV2 */
> +#define VFIO_REGION_SUBTYPE_NVIDIA_NVLINK2	(4)

You're continuing the Intel vendor ID sub-types for an NVIDIA vendor ID
subtype.  Each vendor has their own address space of sub-types.

> +
>  /*
>   * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped
>   * which allows direct access to non-MSIX registers which happened to be within
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index 7bddf1e..38c9475 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -306,6 +306,15 @@ static int vfio_pci_enable(struct vfio_pci_device *vdev)
>  		}
>  	}
>  
> +	if (pdev->vendor == PCI_VENDOR_ID_NVIDIA &&
> +	    pdev->device == 0x1db1 &&
> +	    IS_ENABLED(CONFIG_VFIO_PCI_NVLINK2)) {

Can't we do better than check this based on device ID?  Perhaps PCIe
capability hints at this?

Is it worthwhile to continue with assigning the device in the !ENABLED
case?  For instance, maybe it would be better to provide a weak
definition of vfio_pci_nvlink2_init() that would cause us to fail here
if we don't have this device specific support enabled.  I realize
you're following the example set forth for IGD, but those regions are
optional, for better or worse.

> +		ret = vfio_pci_nvlink2_init(vdev);
> +		if (ret)
> +			dev_warn(&vdev->pdev->dev,
> +				 "Failed to setup NVIDIA NV2 RAM region\n");
> +	}
> +
>  	vfio_pci_probe_mmaps(vdev);
>  
>  	return 0;
> diff --git a/drivers/vfio/pci/vfio_pci_nvlink2.c b/drivers/vfio/pci/vfio_pci_nvlink2.c
> new file mode 100644
> index 0000000..451c5cb
> --- /dev/null
> +++ b/drivers/vfio/pci/vfio_pci_nvlink2.c
> @@ -0,0 +1,190 @@
> +// SPDX-License-Identifier: GPL-2.0+
> +/*
> + * VFIO PCI NVIDIA Whitherspoon GPU support a.k.a. NVLink2.
> + *
> + * Copyright (C) 2018 IBM Corp.  All rights reserved.
> + *     Author: Alexey Kardashevskiy <aik@ozlabs.ru>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + *
> + * Register an on-GPU RAM region for cacheable access.
> + *
> + * Derived from original vfio_pci_igd.c:
> + * Copyright (C) 2016 Red Hat, Inc.  All rights reserved.
> + *	Author: Alex Williamson <alex.williamson@redhat.com>
> + */
> +
> +#include <linux/io.h>
> +#include <linux/pci.h>
> +#include <linux/uaccess.h>
> +#include <linux/vfio.h>
> +#include <linux/sched/mm.h>
> +#include <linux/mmu_context.h>
> +
> +#include "vfio_pci_private.h"
> +
> +struct vfio_pci_nvlink2_data {
> +	unsigned long gpu_hpa;
> +	unsigned long useraddr;
> +	unsigned long size;
> +	struct mm_struct *mm;
> +	struct mm_iommu_table_group_mem_t *mem;
> +};
> +
> +static size_t vfio_pci_nvlink2_rw(struct vfio_pci_device *vdev,
> +		char __user *buf, size_t count, loff_t *ppos, bool iswrite)
> +{
> +	unsigned int i = VFIO_PCI_OFFSET_TO_INDEX(*ppos) - VFIO_PCI_NUM_REGIONS;
> +	void *base = vdev->region[i].data;
> +	loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
> +
> +	if (pos >= vdev->region[i].size)
> +		return -EINVAL;
> +
> +	count = min(count, (size_t)(vdev->region[i].size - pos));
> +
> +	if (iswrite) {
> +		if (copy_from_user(base + pos, buf, count))
> +			return -EFAULT;
> +	} else {
> +		if (copy_to_user(buf, base + pos, count))
> +			return -EFAULT;
> +	}
> +	*ppos += count;
> +
> +	return count;
> +}
> +
> +static void vfio_pci_nvlink2_release(struct vfio_pci_device *vdev,
> +		struct vfio_pci_region *region)
> +{
> +	struct vfio_pci_nvlink2_data *data = region->data;
> +	long ret;
> +
> +	ret = mm_iommu_put(data->mm, data->mem);
> +	WARN_ON(ret);
> +
> +	mmdrop(data->mm);
> +	kfree(data);
> +}
> +
> +static int vfio_pci_nvlink2_mmap_fault(struct vm_fault *vmf)
> +{
> +	struct vm_area_struct *vma = vmf->vma;
> +	struct vfio_pci_region *region = vma->vm_private_data;
> +	struct vfio_pci_nvlink2_data *data = region->data;
> +	int ret;
> +	unsigned long vmf_off = (vmf->address - vma->vm_start) >> PAGE_SHIFT;
> +	unsigned long nv2pg = data->gpu_hpa >> PAGE_SHIFT;
> +	unsigned long vm_pgoff = vma->vm_pgoff &
> +		((1U << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) - 1);
> +	unsigned long pfn = nv2pg + vm_pgoff + vmf_off;
> +
> +	ret = vm_insert_pfn(vma, vmf->address, pfn);
> +	/* TODO: make it a tracepoint */
> +	pr_debug("NVLink2: vmf=%lx hpa=%lx ret=%d\n",
> +		 vmf->address, pfn << PAGE_SHIFT, ret);
> +	if (ret)
> +		return VM_FAULT_SIGSEGV;
> +
> +	return VM_FAULT_NOPAGE;
> +}
> +
> +static const struct vm_operations_struct vfio_pci_nvlink2_mmap_vmops = {
> +	.fault = vfio_pci_nvlink2_mmap_fault,
> +};
> +
> +static int vfio_pci_nvlink2_mmap(struct vfio_pci_device *vdev,
> +		struct vfio_pci_region *region, struct vm_area_struct *vma)
> +{
> +	long ret;
> +	struct vfio_pci_nvlink2_data *data = region->data;
> +
> +	if (data->useraddr)
> +		return -EPERM;
> +
> +	if (vma->vm_end - vma->vm_start > data->size)
> +		return -EINVAL;
> +
> +	vma->vm_private_data = region;
> +	vma->vm_flags |= VM_PFNMAP;
> +	vma->vm_ops = &vfio_pci_nvlink2_mmap_vmops;
> +
> +	/*
> +	 * Calling mm_iommu_newdev() here once as the region is not
> +	 * registered yet and therefore right initialization will happen now.
> +	 * Other places will use mm_iommu_find() which returns
> +	 * registered @mem and does not go gup().
> +	 */
> +	data->useraddr = vma->vm_start;
> +	data->mm = current->mm;
> +	atomic_inc(&data->mm->mm_count);
> +	ret = mm_iommu_newdev(data->mm, data->useraddr,
> +			(vma->vm_end - vma->vm_start) >> PAGE_SHIFT,
> +			data->gpu_hpa, &data->mem);
> +
> +	pr_debug("VFIO NVLINK2 mmap: useraddr=%lx hpa=%lx size=%lx ret=%ld\n",
> +			data->useraddr, data->gpu_hpa,
> +			vma->vm_end - vma->vm_start, ret);
> +
> +	return ret;
> +}
> +
> +static const struct vfio_pci_regops vfio_pci_nvlink2_regops = {
> +	.rw = vfio_pci_nvlink2_rw,
> +	.release = vfio_pci_nvlink2_release,
> +	.mmap = vfio_pci_nvlink2_mmap,
> +};
> +
> +int vfio_pci_nvlink2_init(struct vfio_pci_device *vdev)
> +{
> +	int len = 0, ret;
> +	struct device_node *npu_node, *mem_node;
> +	struct pci_dev *npu_dev;
> +	uint32_t *mem_phandle, *val;
> +	struct vfio_pci_nvlink2_data *data;
> +
> +	npu_dev = pnv_pci_get_npu_dev(vdev->pdev, 0);
> +	if (!npu_dev)
> +		return -EINVAL;
> +
> +	npu_node = pci_device_to_OF_node(npu_dev);
> +	if (!npu_node)
> +		return -EINVAL;
> +
> +	mem_phandle = (void *) of_get_property(npu_node, "memory-region", NULL);
> +	if (!mem_phandle)
> +		return -EINVAL;
> +
> +	mem_node = of_find_node_by_phandle(be32_to_cpu(*mem_phandle));
> +	if (!mem_node)
> +		return -EINVAL;
> +
> +	val = (uint32_t *) of_get_property(mem_node, "reg", &len);
> +	if (!val || len != 2 * sizeof(uint64_t))
> +		return -EINVAL;
> +
> +	data = kzalloc(sizeof(*data), GFP_KERNEL);
> +	if (!data)
> +		return -ENOMEM;
> +
> +	data->gpu_hpa = ((uint64_t)be32_to_cpu(val[0]) << 32) |
> +			be32_to_cpu(val[1]);
> +	data->size = ((uint64_t)be32_to_cpu(val[2]) << 32) |
> +			be32_to_cpu(val[3]);
> +
> +	dev_dbg(&vdev->pdev->dev, "%lx..%lx\n", data->gpu_hpa,
> +			data->gpu_hpa + data->size - 1);
> +
> +	ret = vfio_pci_register_dev_region(vdev,
> +			PCI_VENDOR_ID_NVIDIA | VFIO_REGION_TYPE_PCI_VENDOR_TYPE,
> +			VFIO_REGION_SUBTYPE_NVIDIA_NVLINK2,
> +			&vfio_pci_nvlink2_regops, data->size,
> +			VFIO_REGION_INFO_FLAG_READ, data);
> +	if (ret)
> +		kfree(data);
> +
> +	return ret;
> +}
> diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
> index 24ee260..2725bc8 100644
> --- a/drivers/vfio/pci/Kconfig
> +++ b/drivers/vfio/pci/Kconfig
> @@ -30,3 +30,7 @@ config VFIO_PCI_INTX
>  config VFIO_PCI_IGD
>  	depends on VFIO_PCI
>  	def_bool y if X86
> +
> +config VFIO_PCI_NVLINK2
> +	depends on VFIO_PCI
> +	def_bool y if PPC_POWERNV

As written, this also depends on PPC_POWERNV (or at least TCE), it's not
a portable implementation that we could re-use on X86 or ARM or any
other platform if hardware appeared for it.  Can we improve that as
well to make this less POWER specific?  Thanks,

Alex

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC PATCH kernel 5/5] vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] [10de:1db1] subdriver
@ 2018-06-07 17:04     ` Alex Williamson
  0 siblings, 0 replies; 108+ messages in thread
From: Alex Williamson @ 2018-06-07 17:04 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: kvm, Ram Pai, kvm-ppc, Alistair Popple, linuxppc-dev, David Gibson

On Thu,  7 Jun 2018 18:44:20 +1000
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> Some POWER9 chips come with special NVLink2 links which provide
> cacheable memory access to the RAM physically located on NVIDIA GPU.
> This memory is presented to a host via the device tree but remains
> offline until the NVIDIA driver onlines it.
> 
> This exports this RAM to the userspace as a new region so
> the NVIDIA driver in the guest can train these links and online GPU RAM.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
>  drivers/vfio/pci/Makefile           |   1 +
>  drivers/vfio/pci/vfio_pci_private.h |   8 ++
>  include/uapi/linux/vfio.h           |   3 +
>  drivers/vfio/pci/vfio_pci.c         |   9 ++
>  drivers/vfio/pci/vfio_pci_nvlink2.c | 190 ++++++++++++++++++++++++++++++++++++
>  drivers/vfio/pci/Kconfig            |   4 +
>  6 files changed, 215 insertions(+)
>  create mode 100644 drivers/vfio/pci/vfio_pci_nvlink2.c
> 
> diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
> index 76d8ec0..9662c06 100644
> --- a/drivers/vfio/pci/Makefile
> +++ b/drivers/vfio/pci/Makefile
> @@ -1,5 +1,6 @@
>  
>  vfio-pci-y := vfio_pci.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o
>  vfio-pci-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o
> +vfio-pci-$(CONFIG_VFIO_PCI_NVLINK2) += vfio_pci_nvlink2.o
>  
>  obj-$(CONFIG_VFIO_PCI) += vfio-pci.o
> diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
> index 86aab05..7115b9b 100644
> --- a/drivers/vfio/pci/vfio_pci_private.h
> +++ b/drivers/vfio/pci/vfio_pci_private.h
> @@ -160,4 +160,12 @@ static inline int vfio_pci_igd_init(struct vfio_pci_device *vdev)
>  	return -ENODEV;
>  }
>  #endif
> +#ifdef CONFIG_VFIO_PCI_NVLINK2
> +extern int vfio_pci_nvlink2_init(struct vfio_pci_device *vdev);
> +#else
> +static inline int vfio_pci_nvlink2_init(struct vfio_pci_device *vdev)
> +{
> +	return -ENODEV;
> +}
> +#endif
>  #endif /* VFIO_PCI_PRIVATE_H */
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 1aa7b82..2fe8227 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -301,6 +301,9 @@ struct vfio_region_info_cap_type {
>  #define VFIO_REGION_SUBTYPE_INTEL_IGD_HOST_CFG	(2)
>  #define VFIO_REGION_SUBTYPE_INTEL_IGD_LPC_CFG	(3)
>  
> +/* NVIDIA GPU NV2 */
> +#define VFIO_REGION_SUBTYPE_NVIDIA_NVLINK2	(4)

You're continuing the Intel vendor ID sub-types for an NVIDIA vendor ID
subtype.  Each vendor has their own address space of sub-types.

> +
>  /*
>   * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped
>   * which allows direct access to non-MSIX registers which happened to be within
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index 7bddf1e..38c9475 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -306,6 +306,15 @@ static int vfio_pci_enable(struct vfio_pci_device *vdev)
>  		}
>  	}
>  
> +	if (pdev->vendor = PCI_VENDOR_ID_NVIDIA &&
> +	    pdev->device = 0x1db1 &&
> +	    IS_ENABLED(CONFIG_VFIO_PCI_NVLINK2)) {

Can't we do better than check this based on device ID?  Perhaps PCIe
capability hints at this?

Is it worthwhile to continue with assigning the device in the !ENABLED
case?  For instance, maybe it would be better to provide a weak
definition of vfio_pci_nvlink2_init() that would cause us to fail here
if we don't have this device specific support enabled.  I realize
you're following the example set forth for IGD, but those regions are
optional, for better or worse.

> +		ret = vfio_pci_nvlink2_init(vdev);
> +		if (ret)
> +			dev_warn(&vdev->pdev->dev,
> +				 "Failed to setup NVIDIA NV2 RAM region\n");
> +	}
> +
>  	vfio_pci_probe_mmaps(vdev);
>  
>  	return 0;
> diff --git a/drivers/vfio/pci/vfio_pci_nvlink2.c b/drivers/vfio/pci/vfio_pci_nvlink2.c
> new file mode 100644
> index 0000000..451c5cb
> --- /dev/null
> +++ b/drivers/vfio/pci/vfio_pci_nvlink2.c
> @@ -0,0 +1,190 @@
> +// SPDX-License-Identifier: GPL-2.0+
> +/*
> + * VFIO PCI NVIDIA Whitherspoon GPU support a.k.a. NVLink2.
> + *
> + * Copyright (C) 2018 IBM Corp.  All rights reserved.
> + *     Author: Alexey Kardashevskiy <aik@ozlabs.ru>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + *
> + * Register an on-GPU RAM region for cacheable access.
> + *
> + * Derived from original vfio_pci_igd.c:
> + * Copyright (C) 2016 Red Hat, Inc.  All rights reserved.
> + *	Author: Alex Williamson <alex.williamson@redhat.com>
> + */
> +
> +#include <linux/io.h>
> +#include <linux/pci.h>
> +#include <linux/uaccess.h>
> +#include <linux/vfio.h>
> +#include <linux/sched/mm.h>
> +#include <linux/mmu_context.h>
> +
> +#include "vfio_pci_private.h"
> +
> +struct vfio_pci_nvlink2_data {
> +	unsigned long gpu_hpa;
> +	unsigned long useraddr;
> +	unsigned long size;
> +	struct mm_struct *mm;
> +	struct mm_iommu_table_group_mem_t *mem;
> +};
> +
> +static size_t vfio_pci_nvlink2_rw(struct vfio_pci_device *vdev,
> +		char __user *buf, size_t count, loff_t *ppos, bool iswrite)
> +{
> +	unsigned int i = VFIO_PCI_OFFSET_TO_INDEX(*ppos) - VFIO_PCI_NUM_REGIONS;
> +	void *base = vdev->region[i].data;
> +	loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
> +
> +	if (pos >= vdev->region[i].size)
> +		return -EINVAL;
> +
> +	count = min(count, (size_t)(vdev->region[i].size - pos));
> +
> +	if (iswrite) {
> +		if (copy_from_user(base + pos, buf, count))
> +			return -EFAULT;
> +	} else {
> +		if (copy_to_user(buf, base + pos, count))
> +			return -EFAULT;
> +	}
> +	*ppos += count;
> +
> +	return count;
> +}
> +
> +static void vfio_pci_nvlink2_release(struct vfio_pci_device *vdev,
> +		struct vfio_pci_region *region)
> +{
> +	struct vfio_pci_nvlink2_data *data = region->data;
> +	long ret;
> +
> +	ret = mm_iommu_put(data->mm, data->mem);
> +	WARN_ON(ret);
> +
> +	mmdrop(data->mm);
> +	kfree(data);
> +}
> +
> +static int vfio_pci_nvlink2_mmap_fault(struct vm_fault *vmf)
> +{
> +	struct vm_area_struct *vma = vmf->vma;
> +	struct vfio_pci_region *region = vma->vm_private_data;
> +	struct vfio_pci_nvlink2_data *data = region->data;
> +	int ret;
> +	unsigned long vmf_off = (vmf->address - vma->vm_start) >> PAGE_SHIFT;
> +	unsigned long nv2pg = data->gpu_hpa >> PAGE_SHIFT;
> +	unsigned long vm_pgoff = vma->vm_pgoff &
> +		((1U << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) - 1);
> +	unsigned long pfn = nv2pg + vm_pgoff + vmf_off;
> +
> +	ret = vm_insert_pfn(vma, vmf->address, pfn);
> +	/* TODO: make it a tracepoint */
> +	pr_debug("NVLink2: vmf=%lx hpa=%lx ret=%d\n",
> +		 vmf->address, pfn << PAGE_SHIFT, ret);
> +	if (ret)
> +		return VM_FAULT_SIGSEGV;
> +
> +	return VM_FAULT_NOPAGE;
> +}
> +
> +static const struct vm_operations_struct vfio_pci_nvlink2_mmap_vmops = {
> +	.fault = vfio_pci_nvlink2_mmap_fault,
> +};
> +
> +static int vfio_pci_nvlink2_mmap(struct vfio_pci_device *vdev,
> +		struct vfio_pci_region *region, struct vm_area_struct *vma)
> +{
> +	long ret;
> +	struct vfio_pci_nvlink2_data *data = region->data;
> +
> +	if (data->useraddr)
> +		return -EPERM;
> +
> +	if (vma->vm_end - vma->vm_start > data->size)
> +		return -EINVAL;
> +
> +	vma->vm_private_data = region;
> +	vma->vm_flags |= VM_PFNMAP;
> +	vma->vm_ops = &vfio_pci_nvlink2_mmap_vmops;
> +
> +	/*
> +	 * Calling mm_iommu_newdev() here once as the region is not
> +	 * registered yet and therefore right initialization will happen now.
> +	 * Other places will use mm_iommu_find() which returns
> +	 * registered @mem and does not go gup().
> +	 */
> +	data->useraddr = vma->vm_start;
> +	data->mm = current->mm;
> +	atomic_inc(&data->mm->mm_count);
> +	ret = mm_iommu_newdev(data->mm, data->useraddr,
> +			(vma->vm_end - vma->vm_start) >> PAGE_SHIFT,
> +			data->gpu_hpa, &data->mem);
> +
> +	pr_debug("VFIO NVLINK2 mmap: useraddr=%lx hpa=%lx size=%lx ret=%ld\n",
> +			data->useraddr, data->gpu_hpa,
> +			vma->vm_end - vma->vm_start, ret);
> +
> +	return ret;
> +}
> +
> +static const struct vfio_pci_regops vfio_pci_nvlink2_regops = {
> +	.rw = vfio_pci_nvlink2_rw,
> +	.release = vfio_pci_nvlink2_release,
> +	.mmap = vfio_pci_nvlink2_mmap,
> +};
> +
> +int vfio_pci_nvlink2_init(struct vfio_pci_device *vdev)
> +{
> +	int len = 0, ret;
> +	struct device_node *npu_node, *mem_node;
> +	struct pci_dev *npu_dev;
> +	uint32_t *mem_phandle, *val;
> +	struct vfio_pci_nvlink2_data *data;
> +
> +	npu_dev = pnv_pci_get_npu_dev(vdev->pdev, 0);
> +	if (!npu_dev)
> +		return -EINVAL;
> +
> +	npu_node = pci_device_to_OF_node(npu_dev);
> +	if (!npu_node)
> +		return -EINVAL;
> +
> +	mem_phandle = (void *) of_get_property(npu_node, "memory-region", NULL);
> +	if (!mem_phandle)
> +		return -EINVAL;
> +
> +	mem_node = of_find_node_by_phandle(be32_to_cpu(*mem_phandle));
> +	if (!mem_node)
> +		return -EINVAL;
> +
> +	val = (uint32_t *) of_get_property(mem_node, "reg", &len);
> +	if (!val || len != 2 * sizeof(uint64_t))
> +		return -EINVAL;
> +
> +	data = kzalloc(sizeof(*data), GFP_KERNEL);
> +	if (!data)
> +		return -ENOMEM;
> +
> +	data->gpu_hpa = ((uint64_t)be32_to_cpu(val[0]) << 32) |
> +			be32_to_cpu(val[1]);
> +	data->size = ((uint64_t)be32_to_cpu(val[2]) << 32) |
> +			be32_to_cpu(val[3]);
> +
> +	dev_dbg(&vdev->pdev->dev, "%lx..%lx\n", data->gpu_hpa,
> +			data->gpu_hpa + data->size - 1);
> +
> +	ret = vfio_pci_register_dev_region(vdev,
> +			PCI_VENDOR_ID_NVIDIA | VFIO_REGION_TYPE_PCI_VENDOR_TYPE,
> +			VFIO_REGION_SUBTYPE_NVIDIA_NVLINK2,
> +			&vfio_pci_nvlink2_regops, data->size,
> +			VFIO_REGION_INFO_FLAG_READ, data);
> +	if (ret)
> +		kfree(data);
> +
> +	return ret;
> +}
> diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
> index 24ee260..2725bc8 100644
> --- a/drivers/vfio/pci/Kconfig
> +++ b/drivers/vfio/pci/Kconfig
> @@ -30,3 +30,7 @@ config VFIO_PCI_INTX
>  config VFIO_PCI_IGD
>  	depends on VFIO_PCI
>  	def_bool y if X86
> +
> +config VFIO_PCI_NVLINK2
> +	depends on VFIO_PCI
> +	def_bool y if PPC_POWERNV

As written, this also depends on PPC_POWERNV (or at least TCE), it's not
a portable implementation that we could re-use on X86 or ARM or any
other platform if hardware appeared for it.  Can we improve that as
well to make this less POWER specific?  Thanks,

Alex


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC PATCH kernel 4/5] vfio_pci: Allow mapping extra regions
  2018-06-07  8:44   ` Alexey Kardashevskiy
  (?)
@ 2018-06-07 17:04     ` Alex Williamson
  -1 siblings, 0 replies; 108+ messages in thread
From: Alex Williamson @ 2018-06-07 17:04 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: kvm, Ram Pai, kvm-ppc, Alistair Popple, linuxppc-dev, David Gibson

On Thu,  7 Jun 2018 18:44:19 +1000
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

What's an "extra region", -ENOCOMMITLOG

> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
>  drivers/vfio/pci/vfio_pci_private.h |  3 +++
>  drivers/vfio/pci/vfio_pci.c         | 10 ++++++++--
>  2 files changed, 11 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
> index cde3b5d..86aab05 100644
> --- a/drivers/vfio/pci/vfio_pci_private.h
> +++ b/drivers/vfio/pci/vfio_pci_private.h
> @@ -59,6 +59,9 @@ struct vfio_pci_regops {
>  		      size_t count, loff_t *ppos, bool iswrite);
>  	void	(*release)(struct vfio_pci_device *vdev,
>  			   struct vfio_pci_region *region);
> +	int	(*mmap)(struct vfio_pci_device *vdev,
> +			struct vfio_pci_region *region,
> +			struct vm_area_struct *vma);
>  };
>  
>  struct vfio_pci_region {
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index 3729937..7bddf1e 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -1123,10 +1123,16 @@ static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
>  		return -EINVAL;
>  	if ((vma->vm_flags & VM_SHARED) == 0)
>  		return -EINVAL;
> +	if (index >= VFIO_PCI_NUM_REGIONS) {
> +		int regnum = index - VFIO_PCI_NUM_REGIONS;
> +		struct vfio_pci_region *region = vdev->region + regnum;
> +
> +		if (region && region->ops && region->ops->mmap)
> +			return region->ops->mmap(vdev, region, vma);
> +		return -EINVAL;
> +	}
>  	if (index >= VFIO_PCI_ROM_REGION_INDEX)
>  		return -EINVAL;
> -	if (!vdev->bar_mmap_supported[index])
> -		return -EINVAL;

This seems unrelated.  Thanks,

Alex
  
>  	phys_len = PAGE_ALIGN(pci_resource_len(pdev, index));
>  	req_len = vma->vm_end - vma->vm_start;

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC PATCH kernel 4/5] vfio_pci: Allow mapping extra regions
@ 2018-06-07 17:04     ` Alex Williamson
  0 siblings, 0 replies; 108+ messages in thread
From: Alex Williamson @ 2018-06-07 17:04 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, David Gibson, kvm-ppc, Benjamin Herrenschmidt,
	Ram Pai, kvm, Alistair Popple

On Thu,  7 Jun 2018 18:44:19 +1000
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

What's an "extra region", -ENOCOMMITLOG

> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
>  drivers/vfio/pci/vfio_pci_private.h |  3 +++
>  drivers/vfio/pci/vfio_pci.c         | 10 ++++++++--
>  2 files changed, 11 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
> index cde3b5d..86aab05 100644
> --- a/drivers/vfio/pci/vfio_pci_private.h
> +++ b/drivers/vfio/pci/vfio_pci_private.h
> @@ -59,6 +59,9 @@ struct vfio_pci_regops {
>  		      size_t count, loff_t *ppos, bool iswrite);
>  	void	(*release)(struct vfio_pci_device *vdev,
>  			   struct vfio_pci_region *region);
> +	int	(*mmap)(struct vfio_pci_device *vdev,
> +			struct vfio_pci_region *region,
> +			struct vm_area_struct *vma);
>  };
>  
>  struct vfio_pci_region {
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index 3729937..7bddf1e 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -1123,10 +1123,16 @@ static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
>  		return -EINVAL;
>  	if ((vma->vm_flags & VM_SHARED) == 0)
>  		return -EINVAL;
> +	if (index >= VFIO_PCI_NUM_REGIONS) {
> +		int regnum = index - VFIO_PCI_NUM_REGIONS;
> +		struct vfio_pci_region *region = vdev->region + regnum;
> +
> +		if (region && region->ops && region->ops->mmap)
> +			return region->ops->mmap(vdev, region, vma);
> +		return -EINVAL;
> +	}
>  	if (index >= VFIO_PCI_ROM_REGION_INDEX)
>  		return -EINVAL;
> -	if (!vdev->bar_mmap_supported[index])
> -		return -EINVAL;

This seems unrelated.  Thanks,

Alex
  
>  	phys_len = PAGE_ALIGN(pci_resource_len(pdev, index));
>  	req_len = vma->vm_end - vma->vm_start;

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC PATCH kernel 4/5] vfio_pci: Allow mapping extra regions
@ 2018-06-07 17:04     ` Alex Williamson
  0 siblings, 0 replies; 108+ messages in thread
From: Alex Williamson @ 2018-06-07 17:04 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: kvm, Ram Pai, kvm-ppc, Alistair Popple, linuxppc-dev, David Gibson

On Thu,  7 Jun 2018 18:44:19 +1000
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

What's an "extra region", -ENOCOMMITLOG

> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
>  drivers/vfio/pci/vfio_pci_private.h |  3 +++
>  drivers/vfio/pci/vfio_pci.c         | 10 ++++++++--
>  2 files changed, 11 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
> index cde3b5d..86aab05 100644
> --- a/drivers/vfio/pci/vfio_pci_private.h
> +++ b/drivers/vfio/pci/vfio_pci_private.h
> @@ -59,6 +59,9 @@ struct vfio_pci_regops {
>  		      size_t count, loff_t *ppos, bool iswrite);
>  	void	(*release)(struct vfio_pci_device *vdev,
>  			   struct vfio_pci_region *region);
> +	int	(*mmap)(struct vfio_pci_device *vdev,
> +			struct vfio_pci_region *region,
> +			struct vm_area_struct *vma);
>  };
>  
>  struct vfio_pci_region {
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index 3729937..7bddf1e 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -1123,10 +1123,16 @@ static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
>  		return -EINVAL;
>  	if ((vma->vm_flags & VM_SHARED) = 0)
>  		return -EINVAL;
> +	if (index >= VFIO_PCI_NUM_REGIONS) {
> +		int regnum = index - VFIO_PCI_NUM_REGIONS;
> +		struct vfio_pci_region *region = vdev->region + regnum;
> +
> +		if (region && region->ops && region->ops->mmap)
> +			return region->ops->mmap(vdev, region, vma);
> +		return -EINVAL;
> +	}
>  	if (index >= VFIO_PCI_ROM_REGION_INDEX)
>  		return -EINVAL;
> -	if (!vdev->bar_mmap_supported[index])
> -		return -EINVAL;

This seems unrelated.  Thanks,

Alex
  
>  	phys_len = PAGE_ALIGN(pci_resource_len(pdev, index));
>  	req_len = vma->vm_end - vma->vm_start;



^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100
  2018-06-07 17:04   ` Alex Williamson
  (?)
@ 2018-06-07 21:54     ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 108+ messages in thread
From: Benjamin Herrenschmidt @ 2018-06-07 21:54 UTC (permalink / raw)
  To: Alex Williamson, Alexey Kardashevskiy
  Cc: kvm, Alistair Popple, Ram Pai, kvm-ppc, linuxppc-dev, David Gibson

On Thu, 2018-06-07 at 11:04 -0600, Alex Williamson wrote:
> 
> Can we back up and discuss whether the IOMMU grouping of NVLink
> connected devices makes sense?  AIUI we have a PCI view of these
> devices and from that perspective they're isolated.  That's the view of
> the device used to generate the grouping.  However, not visible to us,
> these devices are interconnected via NVLink.  What isolation properties
> does NVLink provide given that its entire purpose for existing seems to
> be to provide a high performance link for p2p between devices?

Not entire. On POWER chips, we also have an nvlink between the device
and the CPU which is running significantly faster than PCIe.

But yes, there are cross-links and those should probably be accounted
for in the grouping.

> > Each bridge represents an additional hardware interface called "NVLink2",
> > it is not a PCI link but separate but. The design inherits from original
> > NVLink from POWER8.
> > 
> > The new feature of V100 is 16GB of cache coherent memory on GPU board.
> > This memory is presented to the host via the device tree and remains offline
> > until the NVIDIA driver loads, trains NVLink2 (via the config space of these
> > bridges above) and the nvidia-persistenced daemon then onlines it.
> > The memory remains online as long as nvidia-persistenced is running, when
> > it stops, it offlines the memory.
> > 
> > The amount of GPUs suggest passing them through to a guest. However,
> > in order to do so we cannot use the NVIDIA driver so we have a host with
> > a 128GB window (bigger or equal to actual GPU RAM size) in a system memory
> > with no page structs backing this window and we cannot touch this memory
> > before the NVIDIA driver configures it in a host or a guest as
> > HMI (hardware management interrupt?) occurs.
> 
> Having a lot of GPUs only suggests assignment to a guest if there's
> actually isolation provided between those GPUs.  Otherwise we'd need to
> assign them as one big group, which gets a lot less useful.  Thanks,
> 
> Alex
> 
> > On the example system the GPU RAM windows are located at:
> > 0x0400 0000 0000
> > 0x0420 0000 0000
> > 0x0440 0000 0000
> > 0x2400 0000 0000
> > 0x2420 0000 0000
> > 0x2440 0000 0000
> > 
> > So the complications are:
> > 
> > 1. cannot touch the GPU memory till it is trained, i.e. cannot add ptes
> > to VFIO-to-userspace or guest-to-host-physical translations till
> > the driver trains it (i.e. nvidia-persistenced has started), otherwise
> > prefetching happens and HMI occurs; I am trying to get this changed
> > somehow;
> > 
> > 2. since it appears as normal cache coherent memory, it will be used
> > for DMA which means it has to be pinned and mapped in the host. Having
> > no page structs makes it different from the usual case - we only need
> > translate user addresses to host physical and map GPU RAM memory but
> > pinning is not required.
> > 
> > This series maps GPU RAM via the GPU vfio-pci device so QEMU can then
> > register this memory as a KVM memory slot and present memory nodes to
> > the guest. Unless NVIDIA provides an userspace driver, this is no use
> > for things like DPDK.
> > 
> > 
> > There is another problem which the series does not address but worth
> > mentioning - it is not strictly necessary to map GPU RAM to the guest
> > exactly where it is in the host (I tested this to some extent), we still
> > might want to represent the memory at the same offset as on the host
> > which increases the size of a TCE table needed to cover such a huge
> > window: (((0x244000000000 + 0x2000000000) >> 16)*8)>>20 = 4556MB
> > I am addressing this in a separate patchset by allocating indirect TCE
> > levels on demand and using 16MB IOMMU pages in the guest as we can now
> > back emulated pages with the smaller hardware ones.
> > 
> > 
> > This is an RFC. Please comment. Thanks.
> > 
> > 
> > 
> > Alexey Kardashevskiy (5):
> >   vfio/spapr_tce: Simplify page contained test
> >   powerpc/iommu_context: Change referencing in API
> >   powerpc/iommu: Do not pin memory of a memory device
> >   vfio_pci: Allow mapping extra regions
> >   vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] [10de:1db1] subdriver
> > 
> >  drivers/vfio/pci/Makefile              |   1 +
> >  arch/powerpc/include/asm/mmu_context.h |   5 +-
> >  drivers/vfio/pci/vfio_pci_private.h    |  11 ++
> >  include/uapi/linux/vfio.h              |   3 +
> >  arch/powerpc/kernel/iommu.c            |   8 +-
> >  arch/powerpc/mm/mmu_context_iommu.c    |  70 +++++++++---
> >  drivers/vfio/pci/vfio_pci.c            |  19 +++-
> >  drivers/vfio/pci/vfio_pci_nvlink2.c    | 190 +++++++++++++++++++++++++++++++++
> >  drivers/vfio/vfio_iommu_spapr_tce.c    |  42 +++++---
> >  drivers/vfio/pci/Kconfig               |   4 +
> >  10 files changed, 319 insertions(+), 34 deletions(-)
> >  create mode 100644 drivers/vfio/pci/vfio_pci_nvlink2.c
> > 

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100
@ 2018-06-07 21:54     ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 108+ messages in thread
From: Benjamin Herrenschmidt @ 2018-06-07 21:54 UTC (permalink / raw)
  To: Alex Williamson, Alexey Kardashevskiy
  Cc: linuxppc-dev, David Gibson, kvm-ppc, Ram Pai, kvm, Alistair Popple

On Thu, 2018-06-07 at 11:04 -0600, Alex Williamson wrote:
> 
> Can we back up and discuss whether the IOMMU grouping of NVLink
> connected devices makes sense?  AIUI we have a PCI view of these
> devices and from that perspective they're isolated.  That's the view of
> the device used to generate the grouping.  However, not visible to us,
> these devices are interconnected via NVLink.  What isolation properties
> does NVLink provide given that its entire purpose for existing seems to
> be to provide a high performance link for p2p between devices?

Not entire. On POWER chips, we also have an nvlink between the device
and the CPU which is running significantly faster than PCIe.

But yes, there are cross-links and those should probably be accounted
for in the grouping.

> > Each bridge represents an additional hardware interface called "NVLink2",
> > it is not a PCI link but separate but. The design inherits from original
> > NVLink from POWER8.
> > 
> > The new feature of V100 is 16GB of cache coherent memory on GPU board.
> > This memory is presented to the host via the device tree and remains offline
> > until the NVIDIA driver loads, trains NVLink2 (via the config space of these
> > bridges above) and the nvidia-persistenced daemon then onlines it.
> > The memory remains online as long as nvidia-persistenced is running, when
> > it stops, it offlines the memory.
> > 
> > The amount of GPUs suggest passing them through to a guest. However,
> > in order to do so we cannot use the NVIDIA driver so we have a host with
> > a 128GB window (bigger or equal to actual GPU RAM size) in a system memory
> > with no page structs backing this window and we cannot touch this memory
> > before the NVIDIA driver configures it in a host or a guest as
> > HMI (hardware management interrupt?) occurs.
> 
> Having a lot of GPUs only suggests assignment to a guest if there's
> actually isolation provided between those GPUs.  Otherwise we'd need to
> assign them as one big group, which gets a lot less useful.  Thanks,
> 
> Alex
> 
> > On the example system the GPU RAM windows are located at:
> > 0x0400 0000 0000
> > 0x0420 0000 0000
> > 0x0440 0000 0000
> > 0x2400 0000 0000
> > 0x2420 0000 0000
> > 0x2440 0000 0000
> > 
> > So the complications are:
> > 
> > 1. cannot touch the GPU memory till it is trained, i.e. cannot add ptes
> > to VFIO-to-userspace or guest-to-host-physical translations till
> > the driver trains it (i.e. nvidia-persistenced has started), otherwise
> > prefetching happens and HMI occurs; I am trying to get this changed
> > somehow;
> > 
> > 2. since it appears as normal cache coherent memory, it will be used
> > for DMA which means it has to be pinned and mapped in the host. Having
> > no page structs makes it different from the usual case - we only need
> > translate user addresses to host physical and map GPU RAM memory but
> > pinning is not required.
> > 
> > This series maps GPU RAM via the GPU vfio-pci device so QEMU can then
> > register this memory as a KVM memory slot and present memory nodes to
> > the guest. Unless NVIDIA provides an userspace driver, this is no use
> > for things like DPDK.
> > 
> > 
> > There is another problem which the series does not address but worth
> > mentioning - it is not strictly necessary to map GPU RAM to the guest
> > exactly where it is in the host (I tested this to some extent), we still
> > might want to represent the memory at the same offset as on the host
> > which increases the size of a TCE table needed to cover such a huge
> > window: (((0x244000000000 + 0x2000000000) >> 16)*8)>>20 = 4556MB
> > I am addressing this in a separate patchset by allocating indirect TCE
> > levels on demand and using 16MB IOMMU pages in the guest as we can now
> > back emulated pages with the smaller hardware ones.
> > 
> > 
> > This is an RFC. Please comment. Thanks.
> > 
> > 
> > 
> > Alexey Kardashevskiy (5):
> >   vfio/spapr_tce: Simplify page contained test
> >   powerpc/iommu_context: Change referencing in API
> >   powerpc/iommu: Do not pin memory of a memory device
> >   vfio_pci: Allow mapping extra regions
> >   vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] [10de:1db1] subdriver
> > 
> >  drivers/vfio/pci/Makefile              |   1 +
> >  arch/powerpc/include/asm/mmu_context.h |   5 +-
> >  drivers/vfio/pci/vfio_pci_private.h    |  11 ++
> >  include/uapi/linux/vfio.h              |   3 +
> >  arch/powerpc/kernel/iommu.c            |   8 +-
> >  arch/powerpc/mm/mmu_context_iommu.c    |  70 +++++++++---
> >  drivers/vfio/pci/vfio_pci.c            |  19 +++-
> >  drivers/vfio/pci/vfio_pci_nvlink2.c    | 190 +++++++++++++++++++++++++++++++++
> >  drivers/vfio/vfio_iommu_spapr_tce.c    |  42 +++++---
> >  drivers/vfio/pci/Kconfig               |   4 +
> >  10 files changed, 319 insertions(+), 34 deletions(-)
> >  create mode 100644 drivers/vfio/pci/vfio_pci_nvlink2.c
> > 

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100
@ 2018-06-07 21:54     ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 108+ messages in thread
From: Benjamin Herrenschmidt @ 2018-06-07 21:54 UTC (permalink / raw)
  To: Alex Williamson, Alexey Kardashevskiy
  Cc: kvm, Alistair Popple, Ram Pai, kvm-ppc, linuxppc-dev, David Gibson

On Thu, 2018-06-07 at 11:04 -0600, Alex Williamson wrote:
> 
> Can we back up and discuss whether the IOMMU grouping of NVLink
> connected devices makes sense?  AIUI we have a PCI view of these
> devices and from that perspective they're isolated.  That's the view of
> the device used to generate the grouping.  However, not visible to us,
> these devices are interconnected via NVLink.  What isolation properties
> does NVLink provide given that its entire purpose for existing seems to
> be to provide a high performance link for p2p between devices?

Not entire. On POWER chips, we also have an nvlink between the device
and the CPU which is running significantly faster than PCIe.

But yes, there are cross-links and those should probably be accounted
for in the grouping.

> > Each bridge represents an additional hardware interface called "NVLink2",
> > it is not a PCI link but separate but. The design inherits from original
> > NVLink from POWER8.
> > 
> > The new feature of V100 is 16GB of cache coherent memory on GPU board.
> > This memory is presented to the host via the device tree and remains offline
> > until the NVIDIA driver loads, trains NVLink2 (via the config space of these
> > bridges above) and the nvidia-persistenced daemon then onlines it.
> > The memory remains online as long as nvidia-persistenced is running, when
> > it stops, it offlines the memory.
> > 
> > The amount of GPUs suggest passing them through to a guest. However,
> > in order to do so we cannot use the NVIDIA driver so we have a host with
> > a 128GB window (bigger or equal to actual GPU RAM size) in a system memory
> > with no page structs backing this window and we cannot touch this memory
> > before the NVIDIA driver configures it in a host or a guest as
> > HMI (hardware management interrupt?) occurs.
> 
> Having a lot of GPUs only suggests assignment to a guest if there's
> actually isolation provided between those GPUs.  Otherwise we'd need to
> assign them as one big group, which gets a lot less useful.  Thanks,
> 
> Alex
> 
> > On the example system the GPU RAM windows are located at:
> > 0x0400 0000 0000
> > 0x0420 0000 0000
> > 0x0440 0000 0000
> > 0x2400 0000 0000
> > 0x2420 0000 0000
> > 0x2440 0000 0000
> > 
> > So the complications are:
> > 
> > 1. cannot touch the GPU memory till it is trained, i.e. cannot add ptes
> > to VFIO-to-userspace or guest-to-host-physical translations till
> > the driver trains it (i.e. nvidia-persistenced has started), otherwise
> > prefetching happens and HMI occurs; I am trying to get this changed
> > somehow;
> > 
> > 2. since it appears as normal cache coherent memory, it will be used
> > for DMA which means it has to be pinned and mapped in the host. Having
> > no page structs makes it different from the usual case - we only need
> > translate user addresses to host physical and map GPU RAM memory but
> > pinning is not required.
> > 
> > This series maps GPU RAM via the GPU vfio-pci device so QEMU can then
> > register this memory as a KVM memory slot and present memory nodes to
> > the guest. Unless NVIDIA provides an userspace driver, this is no use
> > for things like DPDK.
> > 
> > 
> > There is another problem which the series does not address but worth
> > mentioning - it is not strictly necessary to map GPU RAM to the guest
> > exactly where it is in the host (I tested this to some extent), we still
> > might want to represent the memory at the same offset as on the host
> > which increases the size of a TCE table needed to cover such a huge
> > window: (((0x244000000000 + 0x2000000000) >> 16)*8)>>20 = 4556MB
> > I am addressing this in a separate patchset by allocating indirect TCE
> > levels on demand and using 16MB IOMMU pages in the guest as we can now
> > back emulated pages with the smaller hardware ones.
> > 
> > 
> > This is an RFC. Please comment. Thanks.
> > 
> > 
> > 
> > Alexey Kardashevskiy (5):
> >   vfio/spapr_tce: Simplify page contained test
> >   powerpc/iommu_context: Change referencing in API
> >   powerpc/iommu: Do not pin memory of a memory device
> >   vfio_pci: Allow mapping extra regions
> >   vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] [10de:1db1] subdriver
> > 
> >  drivers/vfio/pci/Makefile              |   1 +
> >  arch/powerpc/include/asm/mmu_context.h |   5 +-
> >  drivers/vfio/pci/vfio_pci_private.h    |  11 ++
> >  include/uapi/linux/vfio.h              |   3 +
> >  arch/powerpc/kernel/iommu.c            |   8 +-
> >  arch/powerpc/mm/mmu_context_iommu.c    |  70 +++++++++---
> >  drivers/vfio/pci/vfio_pci.c            |  19 +++-
> >  drivers/vfio/pci/vfio_pci_nvlink2.c    | 190 +++++++++++++++++++++++++++++++++
> >  drivers/vfio/vfio_iommu_spapr_tce.c    |  42 +++++---
> >  drivers/vfio/pci/Kconfig               |   4 +
> >  10 files changed, 319 insertions(+), 34 deletions(-)
> >  create mode 100644 drivers/vfio/pci/vfio_pci_nvlink2.c
> > 

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100
  2018-06-07 21:54     ` Benjamin Herrenschmidt
  (?)
@ 2018-06-07 22:15       ` Alex Williamson
  -1 siblings, 0 replies; 108+ messages in thread
From: Alex Williamson @ 2018-06-07 22:15 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: kvm, Alexey Kardashevskiy, Alistair Popple, Ram Pai, kvm-ppc,
	linuxppc-dev, David Gibson

On Fri, 08 Jun 2018 07:54:02 +1000
Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:

> On Thu, 2018-06-07 at 11:04 -0600, Alex Williamson wrote:
> > 
> > Can we back up and discuss whether the IOMMU grouping of NVLink
> > connected devices makes sense?  AIUI we have a PCI view of these
> > devices and from that perspective they're isolated.  That's the view of
> > the device used to generate the grouping.  However, not visible to us,
> > these devices are interconnected via NVLink.  What isolation properties
> > does NVLink provide given that its entire purpose for existing seems to
> > be to provide a high performance link for p2p between devices?  
> 
> Not entire. On POWER chips, we also have an nvlink between the device
> and the CPU which is running significantly faster than PCIe.
> 
> But yes, there are cross-links and those should probably be accounted
> for in the grouping.

Then after we fix the grouping, can we just let the host driver manage
this coherent memory range and expose vGPUs to guests?  The use case of
assigning all 6 GPUs to one VM seems pretty limited.  (Might need to
convince NVIDIA to support more than a single vGPU per VM though)
Thanks,

Alex

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100
@ 2018-06-07 22:15       ` Alex Williamson
  0 siblings, 0 replies; 108+ messages in thread
From: Alex Williamson @ 2018-06-07 22:15 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Alexey Kardashevskiy, linuxppc-dev, David Gibson, kvm-ppc,
	Ram Pai, kvm, Alistair Popple

On Fri, 08 Jun 2018 07:54:02 +1000
Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:

> On Thu, 2018-06-07 at 11:04 -0600, Alex Williamson wrote:
> > 
> > Can we back up and discuss whether the IOMMU grouping of NVLink
> > connected devices makes sense?  AIUI we have a PCI view of these
> > devices and from that perspective they're isolated.  That's the view of
> > the device used to generate the grouping.  However, not visible to us,
> > these devices are interconnected via NVLink.  What isolation properties
> > does NVLink provide given that its entire purpose for existing seems to
> > be to provide a high performance link for p2p between devices?  
> 
> Not entire. On POWER chips, we also have an nvlink between the device
> and the CPU which is running significantly faster than PCIe.
> 
> But yes, there are cross-links and those should probably be accounted
> for in the grouping.

Then after we fix the grouping, can we just let the host driver manage
this coherent memory range and expose vGPUs to guests?  The use case of
assigning all 6 GPUs to one VM seems pretty limited.  (Might need to
convince NVIDIA to support more than a single vGPU per VM though)
Thanks,

Alex

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100
@ 2018-06-07 22:15       ` Alex Williamson
  0 siblings, 0 replies; 108+ messages in thread
From: Alex Williamson @ 2018-06-07 22:15 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: kvm, Alexey Kardashevskiy, Alistair Popple, Ram Pai, kvm-ppc,
	linuxppc-dev, David Gibson

On Fri, 08 Jun 2018 07:54:02 +1000
Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:

> On Thu, 2018-06-07 at 11:04 -0600, Alex Williamson wrote:
> > 
> > Can we back up and discuss whether the IOMMU grouping of NVLink
> > connected devices makes sense?  AIUI we have a PCI view of these
> > devices and from that perspective they're isolated.  That's the view of
> > the device used to generate the grouping.  However, not visible to us,
> > these devices are interconnected via NVLink.  What isolation properties
> > does NVLink provide given that its entire purpose for existing seems to
> > be to provide a high performance link for p2p between devices?  
> 
> Not entire. On POWER chips, we also have an nvlink between the device
> and the CPU which is running significantly faster than PCIe.
> 
> But yes, there are cross-links and those should probably be accounted
> for in the grouping.

Then after we fix the grouping, can we just let the host driver manage
this coherent memory range and expose vGPUs to guests?  The use case of
assigning all 6 GPUs to one VM seems pretty limited.  (Might need to
convince NVIDIA to support more than a single vGPU per VM though)
Thanks,

Alex

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100
  2018-06-07 22:15       ` Alex Williamson
  (?)
@ 2018-06-07 23:20         ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 108+ messages in thread
From: Benjamin Herrenschmidt @ 2018-06-07 23:20 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kvm, Alexey Kardashevskiy, Alistair Popple, Ram Pai, kvm-ppc,
	linuxppc-dev, David Gibson

On Thu, 2018-06-07 at 16:15 -0600, Alex Williamson wrote:
> On Fri, 08 Jun 2018 07:54:02 +1000
> Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:
> 
> > On Thu, 2018-06-07 at 11:04 -0600, Alex Williamson wrote:
> > > 
> > > Can we back up and discuss whether the IOMMU grouping of NVLink
> > > connected devices makes sense?  AIUI we have a PCI view of these
> > > devices and from that perspective they're isolated.  That's the view of
> > > the device used to generate the grouping.  However, not visible to us,
> > > these devices are interconnected via NVLink.  What isolation properties
> > > does NVLink provide given that its entire purpose for existing seems to
> > > be to provide a high performance link for p2p between devices?  
> > 
> > Not entire. On POWER chips, we also have an nvlink between the device
> > and the CPU which is running significantly faster than PCIe.
> > 
> > But yes, there are cross-links and those should probably be accounted
> > for in the grouping.
> 
> Then after we fix the grouping, can we just let the host driver manage
> this coherent memory range and expose vGPUs to guests?  The use case of
> assigning all 6 GPUs to one VM seems pretty limited.  (Might need to
> convince NVIDIA to support more than a single vGPU per VM though)
> Thanks,

I don't know about "vGPUs" and what nVidia may be cooking in that area.

The patched from Alexey allow for passing through the full thing, but
they aren't trivial (there are additional issues, I'm not sure how
covered they are, as we need to pay with the mapping attributes of
portions of the GPU memory on the host side...).

Note: The cross-links are only per-socket so that would be 2 groups of
3.

We *can* allow individual GPUs to be passed through, either if somebody
designs a system without cross links, or if the user is ok with the
security risk as the guest driver will not enable them if it doesn't
"find" both sides of them.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100
@ 2018-06-07 23:20         ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 108+ messages in thread
From: Benjamin Herrenschmidt @ 2018-06-07 23:20 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Alexey Kardashevskiy, linuxppc-dev, David Gibson, kvm-ppc,
	Ram Pai, kvm, Alistair Popple

On Thu, 2018-06-07 at 16:15 -0600, Alex Williamson wrote:
> On Fri, 08 Jun 2018 07:54:02 +1000
> Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:
> 
> > On Thu, 2018-06-07 at 11:04 -0600, Alex Williamson wrote:
> > > 
> > > Can we back up and discuss whether the IOMMU grouping of NVLink
> > > connected devices makes sense?  AIUI we have a PCI view of these
> > > devices and from that perspective they're isolated.  That's the view of
> > > the device used to generate the grouping.  However, not visible to us,
> > > these devices are interconnected via NVLink.  What isolation properties
> > > does NVLink provide given that its entire purpose for existing seems to
> > > be to provide a high performance link for p2p between devices?  
> > 
> > Not entire. On POWER chips, we also have an nvlink between the device
> > and the CPU which is running significantly faster than PCIe.
> > 
> > But yes, there are cross-links and those should probably be accounted
> > for in the grouping.
> 
> Then after we fix the grouping, can we just let the host driver manage
> this coherent memory range and expose vGPUs to guests?  The use case of
> assigning all 6 GPUs to one VM seems pretty limited.  (Might need to
> convince NVIDIA to support more than a single vGPU per VM though)
> Thanks,

I don't know about "vGPUs" and what nVidia may be cooking in that area.

The patched from Alexey allow for passing through the full thing, but
they aren't trivial (there are additional issues, I'm not sure how
covered they are, as we need to pay with the mapping attributes of
portions of the GPU memory on the host side...).

Note: The cross-links are only per-socket so that would be 2 groups of
3.

We *can* allow individual GPUs to be passed through, either if somebody
designs a system without cross links, or if the user is ok with the
security risk as the guest driver will not enable them if it doesn't
"find" both sides of them.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100
@ 2018-06-07 23:20         ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 108+ messages in thread
From: Benjamin Herrenschmidt @ 2018-06-07 23:20 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kvm, Alexey Kardashevskiy, Alistair Popple, Ram Pai, kvm-ppc,
	linuxppc-dev, David Gibson

On Thu, 2018-06-07 at 16:15 -0600, Alex Williamson wrote:
> On Fri, 08 Jun 2018 07:54:02 +1000
> Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:
> 
> > On Thu, 2018-06-07 at 11:04 -0600, Alex Williamson wrote:
> > > 
> > > Can we back up and discuss whether the IOMMU grouping of NVLink
> > > connected devices makes sense?  AIUI we have a PCI view of these
> > > devices and from that perspective they're isolated.  That's the view of
> > > the device used to generate the grouping.  However, not visible to us,
> > > these devices are interconnected via NVLink.  What isolation properties
> > > does NVLink provide given that its entire purpose for existing seems to
> > > be to provide a high performance link for p2p between devices?  
> > 
> > Not entire. On POWER chips, we also have an nvlink between the device
> > and the CPU which is running significantly faster than PCIe.
> > 
> > But yes, there are cross-links and those should probably be accounted
> > for in the grouping.
> 
> Then after we fix the grouping, can we just let the host driver manage
> this coherent memory range and expose vGPUs to guests?  The use case of
> assigning all 6 GPUs to one VM seems pretty limited.  (Might need to
> convince NVIDIA to support more than a single vGPU per VM though)
> Thanks,

I don't know about "vGPUs" and what nVidia may be cooking in that area.

The patched from Alexey allow for passing through the full thing, but
they aren't trivial (there are additional issues, I'm not sure how
covered they are, as we need to pay with the mapping attributes of
portions of the GPU memory on the host side...).

Note: The cross-links are only per-socket so that would be 2 groups of
3.

We *can* allow individual GPUs to be passed through, either if somebody
designs a system without cross links, or if the user is ok with the
security risk as the guest driver will not enable them if it doesn't
"find" both sides of them.

Cheers,
Ben.


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100
  2018-06-07 23:20         ` Benjamin Herrenschmidt
  (?)
@ 2018-06-08  0:34           ` Alex Williamson
  -1 siblings, 0 replies; 108+ messages in thread
From: Alex Williamson @ 2018-06-08  0:34 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: kvm, Alexey Kardashevskiy, Alistair Popple, Ram Pai, kvm-ppc,
	linuxppc-dev, David Gibson

On Fri, 08 Jun 2018 09:20:30 +1000
Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:

> On Thu, 2018-06-07 at 16:15 -0600, Alex Williamson wrote:
> > On Fri, 08 Jun 2018 07:54:02 +1000
> > Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:
> >   
> > > On Thu, 2018-06-07 at 11:04 -0600, Alex Williamson wrote:  
> > > > 
> > > > Can we back up and discuss whether the IOMMU grouping of NVLink
> > > > connected devices makes sense?  AIUI we have a PCI view of these
> > > > devices and from that perspective they're isolated.  That's the view of
> > > > the device used to generate the grouping.  However, not visible to us,
> > > > these devices are interconnected via NVLink.  What isolation properties
> > > > does NVLink provide given that its entire purpose for existing seems to
> > > > be to provide a high performance link for p2p between devices?    
> > > 
> > > Not entire. On POWER chips, we also have an nvlink between the device
> > > and the CPU which is running significantly faster than PCIe.
> > > 
> > > But yes, there are cross-links and those should probably be accounted
> > > for in the grouping.  
> > 
> > Then after we fix the grouping, can we just let the host driver manage
> > this coherent memory range and expose vGPUs to guests?  The use case of
> > assigning all 6 GPUs to one VM seems pretty limited.  (Might need to
> > convince NVIDIA to support more than a single vGPU per VM though)
> > Thanks,  
> 
> I don't know about "vGPUs" and what nVidia may be cooking in that area.
> 
> The patched from Alexey allow for passing through the full thing, but
> they aren't trivial (there are additional issues, I'm not sure how
> covered they are, as we need to pay with the mapping attributes of
> portions of the GPU memory on the host side...).
> 
> Note: The cross-links are only per-socket so that would be 2 groups of
> 3.
> 
> We *can* allow individual GPUs to be passed through, either if somebody
> designs a system without cross links, or if the user is ok with the
> security risk as the guest driver will not enable them if it doesn't
> "find" both sides of them.

If GPUs are not isolated and we cannot prevent them from probing each
other via these links, then I think we have an obligation to configure
grouping in a way that doesn't rely on a benevolent userspace.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100
@ 2018-06-08  0:34           ` Alex Williamson
  0 siblings, 0 replies; 108+ messages in thread
From: Alex Williamson @ 2018-06-08  0:34 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Alexey Kardashevskiy, linuxppc-dev, David Gibson, kvm-ppc,
	Ram Pai, kvm, Alistair Popple

On Fri, 08 Jun 2018 09:20:30 +1000
Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:

> On Thu, 2018-06-07 at 16:15 -0600, Alex Williamson wrote:
> > On Fri, 08 Jun 2018 07:54:02 +1000
> > Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:
> >   
> > > On Thu, 2018-06-07 at 11:04 -0600, Alex Williamson wrote:  
> > > > 
> > > > Can we back up and discuss whether the IOMMU grouping of NVLink
> > > > connected devices makes sense?  AIUI we have a PCI view of these
> > > > devices and from that perspective they're isolated.  That's the view of
> > > > the device used to generate the grouping.  However, not visible to us,
> > > > these devices are interconnected via NVLink.  What isolation properties
> > > > does NVLink provide given that its entire purpose for existing seems to
> > > > be to provide a high performance link for p2p between devices?    
> > > 
> > > Not entire. On POWER chips, we also have an nvlink between the device
> > > and the CPU which is running significantly faster than PCIe.
> > > 
> > > But yes, there are cross-links and those should probably be accounted
> > > for in the grouping.  
> > 
> > Then after we fix the grouping, can we just let the host driver manage
> > this coherent memory range and expose vGPUs to guests?  The use case of
> > assigning all 6 GPUs to one VM seems pretty limited.  (Might need to
> > convince NVIDIA to support more than a single vGPU per VM though)
> > Thanks,  
> 
> I don't know about "vGPUs" and what nVidia may be cooking in that area.
> 
> The patched from Alexey allow for passing through the full thing, but
> they aren't trivial (there are additional issues, I'm not sure how
> covered they are, as we need to pay with the mapping attributes of
> portions of the GPU memory on the host side...).
> 
> Note: The cross-links are only per-socket so that would be 2 groups of
> 3.
> 
> We *can* allow individual GPUs to be passed through, either if somebody
> designs a system without cross links, or if the user is ok with the
> security risk as the guest driver will not enable them if it doesn't
> "find" both sides of them.

If GPUs are not isolated and we cannot prevent them from probing each
other via these links, then I think we have an obligation to configure
grouping in a way that doesn't rely on a benevolent userspace.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100
@ 2018-06-08  0:34           ` Alex Williamson
  0 siblings, 0 replies; 108+ messages in thread
From: Alex Williamson @ 2018-06-08  0:34 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: kvm, Alexey Kardashevskiy, Alistair Popple, Ram Pai, kvm-ppc,
	linuxppc-dev, David Gibson

On Fri, 08 Jun 2018 09:20:30 +1000
Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:

> On Thu, 2018-06-07 at 16:15 -0600, Alex Williamson wrote:
> > On Fri, 08 Jun 2018 07:54:02 +1000
> > Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:
> >   
> > > On Thu, 2018-06-07 at 11:04 -0600, Alex Williamson wrote:  
> > > > 
> > > > Can we back up and discuss whether the IOMMU grouping of NVLink
> > > > connected devices makes sense?  AIUI we have a PCI view of these
> > > > devices and from that perspective they're isolated.  That's the view of
> > > > the device used to generate the grouping.  However, not visible to us,
> > > > these devices are interconnected via NVLink.  What isolation properties
> > > > does NVLink provide given that its entire purpose for existing seems to
> > > > be to provide a high performance link for p2p between devices?    
> > > 
> > > Not entire. On POWER chips, we also have an nvlink between the device
> > > and the CPU which is running significantly faster than PCIe.
> > > 
> > > But yes, there are cross-links and those should probably be accounted
> > > for in the grouping.  
> > 
> > Then after we fix the grouping, can we just let the host driver manage
> > this coherent memory range and expose vGPUs to guests?  The use case of
> > assigning all 6 GPUs to one VM seems pretty limited.  (Might need to
> > convince NVIDIA to support more than a single vGPU per VM though)
> > Thanks,  
> 
> I don't know about "vGPUs" and what nVidia may be cooking in that area.
> 
> The patched from Alexey allow for passing through the full thing, but
> they aren't trivial (there are additional issues, I'm not sure how
> covered they are, as we need to pay with the mapping attributes of
> portions of the GPU memory on the host side...).
> 
> Note: The cross-links are only per-socket so that would be 2 groups of
> 3.
> 
> We *can* allow individual GPUs to be passed through, either if somebody
> designs a system without cross links, or if the user is ok with the
> security risk as the guest driver will not enable them if it doesn't
> "find" both sides of them.

If GPUs are not isolated and we cannot prevent them from probing each
other via these links, then I think we have an obligation to configure
grouping in a way that doesn't rely on a benevolent userspace.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100
  2018-06-08  0:34           ` Alex Williamson
  (?)
@ 2018-06-08  0:58             ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 108+ messages in thread
From: Benjamin Herrenschmidt @ 2018-06-08  0:58 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kvm, Alexey Kardashevskiy, Alistair Popple, Ram Pai, kvm-ppc,
	linuxppc-dev, David Gibson

On Thu, 2018-06-07 at 18:34 -0600, Alex Williamson wrote:
> > We *can* allow individual GPUs to be passed through, either if somebody
> > designs a system without cross links, or if the user is ok with the
> > security risk as the guest driver will not enable them if it doesn't
> > "find" both sides of them.
> 
> If GPUs are not isolated and we cannot prevent them from probing each
> other via these links, then I think we have an obligation to configure
> grouping in a way that doesn't rely on a benevolent userspace.  Thanks,

Well, it's a user decision, no ? Like how we used to let the user
decide whether to pass-through things that have LSIs shared out of
their domain.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100
@ 2018-06-08  0:58             ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 108+ messages in thread
From: Benjamin Herrenschmidt @ 2018-06-08  0:58 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Alexey Kardashevskiy, linuxppc-dev, David Gibson, kvm-ppc,
	Ram Pai, kvm, Alistair Popple

On Thu, 2018-06-07 at 18:34 -0600, Alex Williamson wrote:
> > We *can* allow individual GPUs to be passed through, either if somebody
> > designs a system without cross links, or if the user is ok with the
> > security risk as the guest driver will not enable them if it doesn't
> > "find" both sides of them.
> 
> If GPUs are not isolated and we cannot prevent them from probing each
> other via these links, then I think we have an obligation to configure
> grouping in a way that doesn't rely on a benevolent userspace.  Thanks,

Well, it's a user decision, no ? Like how we used to let the user
decide whether to pass-through things that have LSIs shared out of
their domain.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100
@ 2018-06-08  0:58             ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 108+ messages in thread
From: Benjamin Herrenschmidt @ 2018-06-08  0:58 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kvm, Alexey Kardashevskiy, Alistair Popple, Ram Pai, kvm-ppc,
	linuxppc-dev, David Gibson

On Thu, 2018-06-07 at 18:34 -0600, Alex Williamson wrote:
> > We *can* allow individual GPUs to be passed through, either if somebody
> > designs a system without cross links, or if the user is ok with the
> > security risk as the guest driver will not enable them if it doesn't
> > "find" both sides of them.
> 
> If GPUs are not isolated and we cannot prevent them from probing each
> other via these links, then I think we have an obligation to configure
> grouping in a way that doesn't rely on a benevolent userspace.  Thanks,

Well, it's a user decision, no ? Like how we used to let the user
decide whether to pass-through things that have LSIs shared out of
their domain.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100
  2018-06-08  0:58             ` Benjamin Herrenschmidt
  (?)
@ 2018-06-08  1:18               ` Alex Williamson
  -1 siblings, 0 replies; 108+ messages in thread
From: Alex Williamson @ 2018-06-08  1:18 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: kvm, Alexey Kardashevskiy, Alistair Popple, Ram Pai, kvm-ppc,
	linuxppc-dev, David Gibson

On Fri, 08 Jun 2018 10:58:54 +1000
Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:

> On Thu, 2018-06-07 at 18:34 -0600, Alex Williamson wrote:
> > > We *can* allow individual GPUs to be passed through, either if somebody
> > > designs a system without cross links, or if the user is ok with the
> > > security risk as the guest driver will not enable them if it doesn't
> > > "find" both sides of them.  
> > 
> > If GPUs are not isolated and we cannot prevent them from probing each
> > other via these links, then I think we have an obligation to configure
> > grouping in a way that doesn't rely on a benevolent userspace.  Thanks,  
> 
> Well, it's a user decision, no ? Like how we used to let the user
> decide whether to pass-through things that have LSIs shared out of
> their domain.

No, users don't get to pinky swear they'll be good.  The kernel creates
IOMMU groups assuming the worst case isolation and malicious users.
Its the kernel's job to protect itself from users and to protect users
from each other.  Anything else is unsupportable.  The only way to
bypass the default grouping is to modify the kernel.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100
@ 2018-06-08  1:18               ` Alex Williamson
  0 siblings, 0 replies; 108+ messages in thread
From: Alex Williamson @ 2018-06-08  1:18 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Alexey Kardashevskiy, linuxppc-dev, David Gibson, kvm-ppc,
	Ram Pai, kvm, Alistair Popple

On Fri, 08 Jun 2018 10:58:54 +1000
Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:

> On Thu, 2018-06-07 at 18:34 -0600, Alex Williamson wrote:
> > > We *can* allow individual GPUs to be passed through, either if somebody
> > > designs a system without cross links, or if the user is ok with the
> > > security risk as the guest driver will not enable them if it doesn't
> > > "find" both sides of them.  
> > 
> > If GPUs are not isolated and we cannot prevent them from probing each
> > other via these links, then I think we have an obligation to configure
> > grouping in a way that doesn't rely on a benevolent userspace.  Thanks,  
> 
> Well, it's a user decision, no ? Like how we used to let the user
> decide whether to pass-through things that have LSIs shared out of
> their domain.

No, users don't get to pinky swear they'll be good.  The kernel creates
IOMMU groups assuming the worst case isolation and malicious users.
Its the kernel's job to protect itself from users and to protect users
from each other.  Anything else is unsupportable.  The only way to
bypass the default grouping is to modify the kernel.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100
@ 2018-06-08  1:18               ` Alex Williamson
  0 siblings, 0 replies; 108+ messages in thread
From: Alex Williamson @ 2018-06-08  1:18 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: kvm, Alexey Kardashevskiy, Alistair Popple, Ram Pai, kvm-ppc,
	linuxppc-dev, David Gibson

On Fri, 08 Jun 2018 10:58:54 +1000
Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:

> On Thu, 2018-06-07 at 18:34 -0600, Alex Williamson wrote:
> > > We *can* allow individual GPUs to be passed through, either if somebody
> > > designs a system without cross links, or if the user is ok with the
> > > security risk as the guest driver will not enable them if it doesn't
> > > "find" both sides of them.  
> > 
> > If GPUs are not isolated and we cannot prevent them from probing each
> > other via these links, then I think we have an obligation to configure
> > grouping in a way that doesn't rely on a benevolent userspace.  Thanks,  
> 
> Well, it's a user decision, no ? Like how we used to let the user
> decide whether to pass-through things that have LSIs shared out of
> their domain.

No, users don't get to pinky swear they'll be good.  The kernel creates
IOMMU groups assuming the worst case isolation and malicious users.
Its the kernel's job to protect itself from users and to protect users
from each other.  Anything else is unsupportable.  The only way to
bypass the default grouping is to modify the kernel.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100
  2018-06-07 22:15       ` Alex Williamson
  (?)
@ 2018-06-08  3:08         ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 108+ messages in thread
From: Alexey Kardashevskiy @ 2018-06-08  3:08 UTC (permalink / raw)
  To: Alex Williamson, Benjamin Herrenschmidt
  Cc: kvm, Alistair Popple, Ram Pai, kvm-ppc, linuxppc-dev, David Gibson

On 8/6/18 8:15 am, Alex Williamson wrote:
> On Fri, 08 Jun 2018 07:54:02 +1000
> Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:
> 
>> On Thu, 2018-06-07 at 11:04 -0600, Alex Williamson wrote:
>>>
>>> Can we back up and discuss whether the IOMMU grouping of NVLink
>>> connected devices makes sense?  AIUI we have a PCI view of these
>>> devices and from that perspective they're isolated.  That's the view of
>>> the device used to generate the grouping.  However, not visible to us,
>>> these devices are interconnected via NVLink.  What isolation properties
>>> does NVLink provide given that its entire purpose for existing seems to
>>> be to provide a high performance link for p2p between devices?  
>>
>> Not entire. On POWER chips, we also have an nvlink between the device
>> and the CPU which is running significantly faster than PCIe.
>>
>> But yes, there are cross-links and those should probably be accounted
>> for in the grouping.
> 
> Then after we fix the grouping, can we just let the host driver manage
> this coherent memory range and expose vGPUs to guests?  The use case of
> assigning all 6 GPUs to one VM seems pretty limited.  (Might need to
> convince NVIDIA to support more than a single vGPU per VM though)

These are physical GPUs, not virtual sriov-alike things they are
implementing as well elsewhere.

My current understanding is that every P9 chip in that box has some NVLink2
logic on it so each P9 is directly connected to 3 GPUs via PCIe and
2xNVLink2, and GPUs in that big group are interconnected by NVLink2 links
as well.

>From small bits of information I have it seems that a GPU can perfectly
work alone and if the NVIDIA driver does not see these interconnects
(because we do not pass the rest of the big 3xGPU group to this guest), it
continues with a single GPU. There is an "nvidia-smi -r" big reset hammer
which simply refuses to work until all 3 GPUs are passed so there is some
distinction between passing 1 or 3 GPUs, and I am trying (as we speak) to
get a confirmation from NVIDIA that it is ok to pass just a single GPU.

So we will either have 6 groups (one per GPU) or 2 groups (one per
interconnected group).


-- 
Alexey

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100
@ 2018-06-08  3:08         ` Alexey Kardashevskiy
  0 siblings, 0 replies; 108+ messages in thread
From: Alexey Kardashevskiy @ 2018-06-08  3:08 UTC (permalink / raw)
  To: Alex Williamson, Benjamin Herrenschmidt
  Cc: linuxppc-dev, David Gibson, kvm-ppc, Ram Pai, kvm, Alistair Popple

On 8/6/18 8:15 am, Alex Williamson wrote:
> On Fri, 08 Jun 2018 07:54:02 +1000
> Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:
> 
>> On Thu, 2018-06-07 at 11:04 -0600, Alex Williamson wrote:
>>>
>>> Can we back up and discuss whether the IOMMU grouping of NVLink
>>> connected devices makes sense?  AIUI we have a PCI view of these
>>> devices and from that perspective they're isolated.  That's the view of
>>> the device used to generate the grouping.  However, not visible to us,
>>> these devices are interconnected via NVLink.  What isolation properties
>>> does NVLink provide given that its entire purpose for existing seems to
>>> be to provide a high performance link for p2p between devices?  
>>
>> Not entire. On POWER chips, we also have an nvlink between the device
>> and the CPU which is running significantly faster than PCIe.
>>
>> But yes, there are cross-links and those should probably be accounted
>> for in the grouping.
> 
> Then after we fix the grouping, can we just let the host driver manage
> this coherent memory range and expose vGPUs to guests?  The use case of
> assigning all 6 GPUs to one VM seems pretty limited.  (Might need to
> convince NVIDIA to support more than a single vGPU per VM though)

These are physical GPUs, not virtual sriov-alike things they are
implementing as well elsewhere.

My current understanding is that every P9 chip in that box has some NVLink2
logic on it so each P9 is directly connected to 3 GPUs via PCIe and
2xNVLink2, and GPUs in that big group are interconnected by NVLink2 links
as well.

>From small bits of information I have it seems that a GPU can perfectly
work alone and if the NVIDIA driver does not see these interconnects
(because we do not pass the rest of the big 3xGPU group to this guest), it
continues with a single GPU. There is an "nvidia-smi -r" big reset hammer
which simply refuses to work until all 3 GPUs are passed so there is some
distinction between passing 1 or 3 GPUs, and I am trying (as we speak) to
get a confirmation from NVIDIA that it is ok to pass just a single GPU.

So we will either have 6 groups (one per GPU) or 2 groups (one per
interconnected group).


-- 
Alexey

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100
@ 2018-06-08  3:08         ` Alexey Kardashevskiy
  0 siblings, 0 replies; 108+ messages in thread
From: Alexey Kardashevskiy @ 2018-06-08  3:08 UTC (permalink / raw)
  To: Alex Williamson, Benjamin Herrenschmidt
  Cc: kvm, Alistair Popple, Ram Pai, kvm-ppc, linuxppc-dev, David Gibson

On 8/6/18 8:15 am, Alex Williamson wrote:
> On Fri, 08 Jun 2018 07:54:02 +1000
> Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:
> 
>> On Thu, 2018-06-07 at 11:04 -0600, Alex Williamson wrote:
>>>
>>> Can we back up and discuss whether the IOMMU grouping of NVLink
>>> connected devices makes sense?  AIUI we have a PCI view of these
>>> devices and from that perspective they're isolated.  That's the view of
>>> the device used to generate the grouping.  However, not visible to us,
>>> these devices are interconnected via NVLink.  What isolation properties
>>> does NVLink provide given that its entire purpose for existing seems to
>>> be to provide a high performance link for p2p between devices?  
>>
>> Not entire. On POWER chips, we also have an nvlink between the device
>> and the CPU which is running significantly faster than PCIe.
>>
>> But yes, there are cross-links and those should probably be accounted
>> for in the grouping.
> 
> Then after we fix the grouping, can we just let the host driver manage
> this coherent memory range and expose vGPUs to guests?  The use case of
> assigning all 6 GPUs to one VM seems pretty limited.  (Might need to
> convince NVIDIA to support more than a single vGPU per VM though)

These are physical GPUs, not virtual sriov-alike things they are
implementing as well elsewhere.

My current understanding is that every P9 chip in that box has some NVLink2
logic on it so each P9 is directly connected to 3 GPUs via PCIe and
2xNVLink2, and GPUs in that big group are interconnected by NVLink2 links
as well.

From small bits of information I have it seems that a GPU can perfectly
work alone and if the NVIDIA driver does not see these interconnects
(because we do not pass the rest of the big 3xGPU group to this guest), it
continues with a single GPU. There is an "nvidia-smi -r" big reset hammer
which simply refuses to work until all 3 GPUs are passed so there is some
distinction between passing 1 or 3 GPUs, and I am trying (as we speak) to
get a confirmation from NVIDIA that it is ok to pass just a single GPU.

So we will either have 6 groups (one per GPU) or 2 groups (one per
interconnected group).


-- 
Alexey

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC PATCH kernel 5/5] vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] [10de:1db1] subdriver
  2018-06-07 17:04     ` Alex Williamson
  (?)
@ 2018-06-08  3:09       ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 108+ messages in thread
From: Alexey Kardashevskiy @ 2018-06-08  3:09 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kvm, Ram Pai, kvm-ppc, Alistair Popple, linuxppc-dev, David Gibson

On 8/6/18 3:04 am, Alex Williamson wrote:
> On Thu,  7 Jun 2018 18:44:20 +1000
> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> 
>> Some POWER9 chips come with special NVLink2 links which provide
>> cacheable memory access to the RAM physically located on NVIDIA GPU.
>> This memory is presented to a host via the device tree but remains
>> offline until the NVIDIA driver onlines it.
>>
>> This exports this RAM to the userspace as a new region so
>> the NVIDIA driver in the guest can train these links and online GPU RAM.
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> ---
>>  drivers/vfio/pci/Makefile           |   1 +
>>  drivers/vfio/pci/vfio_pci_private.h |   8 ++
>>  include/uapi/linux/vfio.h           |   3 +
>>  drivers/vfio/pci/vfio_pci.c         |   9 ++
>>  drivers/vfio/pci/vfio_pci_nvlink2.c | 190 ++++++++++++++++++++++++++++++++++++
>>  drivers/vfio/pci/Kconfig            |   4 +
>>  6 files changed, 215 insertions(+)
>>  create mode 100644 drivers/vfio/pci/vfio_pci_nvlink2.c
>>
>> diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
>> index 76d8ec0..9662c06 100644
>> --- a/drivers/vfio/pci/Makefile
>> +++ b/drivers/vfio/pci/Makefile
>> @@ -1,5 +1,6 @@
>>  
>>  vfio-pci-y := vfio_pci.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o
>>  vfio-pci-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o
>> +vfio-pci-$(CONFIG_VFIO_PCI_NVLINK2) += vfio_pci_nvlink2.o
>>  
>>  obj-$(CONFIG_VFIO_PCI) += vfio-pci.o
>> diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
>> index 86aab05..7115b9b 100644
>> --- a/drivers/vfio/pci/vfio_pci_private.h
>> +++ b/drivers/vfio/pci/vfio_pci_private.h
>> @@ -160,4 +160,12 @@ static inline int vfio_pci_igd_init(struct vfio_pci_device *vdev)
>>  	return -ENODEV;
>>  }
>>  #endif
>> +#ifdef CONFIG_VFIO_PCI_NVLINK2
>> +extern int vfio_pci_nvlink2_init(struct vfio_pci_device *vdev);
>> +#else
>> +static inline int vfio_pci_nvlink2_init(struct vfio_pci_device *vdev)
>> +{
>> +	return -ENODEV;
>> +}
>> +#endif
>>  #endif /* VFIO_PCI_PRIVATE_H */
>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
>> index 1aa7b82..2fe8227 100644
>> --- a/include/uapi/linux/vfio.h
>> +++ b/include/uapi/linux/vfio.h
>> @@ -301,6 +301,9 @@ struct vfio_region_info_cap_type {
>>  #define VFIO_REGION_SUBTYPE_INTEL_IGD_HOST_CFG	(2)
>>  #define VFIO_REGION_SUBTYPE_INTEL_IGD_LPC_CFG	(3)
>>  
>> +/* NVIDIA GPU NV2 */
>> +#define VFIO_REGION_SUBTYPE_NVIDIA_NVLINK2	(4)
> 
> You're continuing the Intel vendor ID sub-types for an NVIDIA vendor ID
> subtype.  Each vendor has their own address space of sub-types.


True, I'll update. I just like unique numbers better :)

> 
>> +
>>  /*
>>   * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped
>>   * which allows direct access to non-MSIX registers which happened to be within
>> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
>> index 7bddf1e..38c9475 100644
>> --- a/drivers/vfio/pci/vfio_pci.c
>> +++ b/drivers/vfio/pci/vfio_pci.c
>> @@ -306,6 +306,15 @@ static int vfio_pci_enable(struct vfio_pci_device *vdev)
>>  		}
>>  	}
>>  
>> +	if (pdev->vendor == PCI_VENDOR_ID_NVIDIA &&
>> +	    pdev->device == 0x1db1 &&
>> +	    IS_ENABLED(CONFIG_VFIO_PCI_NVLINK2)) {
> 
> Can't we do better than check this based on device ID?  Perhaps PCIe
> capability hints at this?

A normal PCI pluggable device looks like this:

root@fstn3:~# sudo lspci -vs 0000:03:00.0
0000:03:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
	Subsystem: NVIDIA Corporation GK210GL [Tesla K80]
	Flags: fast devsel, IRQ 497
	Memory at 3fe000000000 (32-bit, non-prefetchable) [disabled] [size=16M]
	Memory at 200000000000 (64-bit, prefetchable) [disabled] [size=16G]
	Memory at 200400000000 (64-bit, prefetchable) [disabled] [size=32M]
	Capabilities: [60] Power Management version 3
	Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
	Capabilities: [78] Express Endpoint, MSI 00
	Capabilities: [100] Virtual Channel
	Capabilities: [128] Power Budgeting <?>
	Capabilities: [420] Advanced Error Reporting
	Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
	Capabilities: [900] #19


This is a NVLink v1 machine:

aik@garrison1:~$ sudo lspci -vs 000a:01:00.0
000a:01:00.0 3D controller: NVIDIA Corporation Device 15fe (rev a1)
	Subsystem: NVIDIA Corporation Device 116b
	Flags: bus master, fast devsel, latency 0, IRQ 457
	Memory at 3fe300000000 (32-bit, non-prefetchable) [size=16M]
	Memory at 260000000000 (64-bit, prefetchable) [size=16G]
	Memory at 260400000000 (64-bit, prefetchable) [size=32M]
	Capabilities: [60] Power Management version 3
	Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
	Capabilities: [78] Express Endpoint, MSI 00
	Capabilities: [100] Virtual Channel
	Capabilities: [250] Latency Tolerance Reporting
	Capabilities: [258] L1 PM Substates
	Capabilities: [128] Power Budgeting <?>
	Capabilities: [420] Advanced Error Reporting
	Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
	Capabilities: [900] #19
	Kernel driver in use: nvidia
	Kernel modules: nvidiafb, nouveau, nvidia_384_drm, nvidia_384


This is the one the patch is for:

[aik@yc02goos ~]$ sudo lspci -vs 0035:03:00.0
0035:03:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2]
(rev a1)
	Subsystem: NVIDIA Corporation Device 1212
	Flags: fast devsel, IRQ 82, NUMA node 8
	Memory at 620c280000000 (32-bit, non-prefetchable) [disabled] [size=16M]
	Memory at 6228000000000 (64-bit, prefetchable) [disabled] [size=16G]
	Memory at 6228400000000 (64-bit, prefetchable) [disabled] [size=32M]
	Capabilities: [60] Power Management version 3
	Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
	Capabilities: [78] Express Endpoint, MSI 00
	Capabilities: [100] Virtual Channel
	Capabilities: [250] Latency Tolerance Reporting
	Capabilities: [258] L1 PM Substates
	Capabilities: [128] Power Budgeting <?>
	Capabilities: [420] Advanced Error Reporting
	Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
	Capabilities: [900] #19
	Capabilities: [ac0] #23
	Kernel driver in use: vfio-pci


I can only see a new capability #23 which I have no idea about what it
actually does - my latest PCIe spec is
PCI_Express_Base_r3.1a_December7-2015.pdf and that only knows capabilities
till #21, do you have any better spec? Does not seem promising anyway...


> Is it worthwhile to continue with assigning the device in the !ENABLED
> case?  For instance, maybe it would be better to provide a weak
> definition of vfio_pci_nvlink2_init() that would cause us to fail here
> if we don't have this device specific support enabled.  I realize
> you're following the example set forth for IGD, but those regions are
> optional, for better or worse.


The device is supposed to work even without GPU RAM passed through, this
should look like NVLink v1 in this case (there used to be bugs in the
driver, may be still are, have not checked for a while but there is a bug
opened at NVIDIA about this and they were going to fix that), this is why I
chose not to fail here.



>> +		ret = vfio_pci_nvlink2_init(vdev);
>> +		if (ret)
>> +			dev_warn(&vdev->pdev->dev,
>> +				 "Failed to setup NVIDIA NV2 RAM region\n");
>> +	}
>> +
>>  	vfio_pci_probe_mmaps(vdev);
>>  
>>  	return 0;
>> diff --git a/drivers/vfio/pci/vfio_pci_nvlink2.c b/drivers/vfio/pci/vfio_pci_nvlink2.c
>> new file mode 100644
>> index 0000000..451c5cb
>> --- /dev/null
>> +++ b/drivers/vfio/pci/vfio_pci_nvlink2.c
>> @@ -0,0 +1,190 @@
>> +// SPDX-License-Identifier: GPL-2.0+
>> +/*
>> + * VFIO PCI NVIDIA Whitherspoon GPU support a.k.a. NVLink2.
>> + *
>> + * Copyright (C) 2018 IBM Corp.  All rights reserved.
>> + *     Author: Alexey Kardashevskiy <aik@ozlabs.ru>
>> + *
>> + * This program is free software; you can redistribute it and/or modify
>> + * it under the terms of the GNU General Public License version 2 as
>> + * published by the Free Software Foundation.
>> + *
>> + * Register an on-GPU RAM region for cacheable access.
>> + *
>> + * Derived from original vfio_pci_igd.c:
>> + * Copyright (C) 2016 Red Hat, Inc.  All rights reserved.
>> + *	Author: Alex Williamson <alex.williamson@redhat.com>
>> + */
>> +
>> +#include <linux/io.h>
>> +#include <linux/pci.h>
>> +#include <linux/uaccess.h>
>> +#include <linux/vfio.h>
>> +#include <linux/sched/mm.h>
>> +#include <linux/mmu_context.h>
>> +
>> +#include "vfio_pci_private.h"
>> +
>> +struct vfio_pci_nvlink2_data {
>> +	unsigned long gpu_hpa;
>> +	unsigned long useraddr;
>> +	unsigned long size;
>> +	struct mm_struct *mm;
>> +	struct mm_iommu_table_group_mem_t *mem;
>> +};
>> +
>> +static size_t vfio_pci_nvlink2_rw(struct vfio_pci_device *vdev,
>> +		char __user *buf, size_t count, loff_t *ppos, bool iswrite)
>> +{
>> +	unsigned int i = VFIO_PCI_OFFSET_TO_INDEX(*ppos) - VFIO_PCI_NUM_REGIONS;
>> +	void *base = vdev->region[i].data;
>> +	loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
>> +
>> +	if (pos >= vdev->region[i].size)
>> +		return -EINVAL;
>> +
>> +	count = min(count, (size_t)(vdev->region[i].size - pos));
>> +
>> +	if (iswrite) {
>> +		if (copy_from_user(base + pos, buf, count))
>> +			return -EFAULT;
>> +	} else {
>> +		if (copy_to_user(buf, base + pos, count))
>> +			return -EFAULT;
>> +	}
>> +	*ppos += count;
>> +
>> +	return count;
>> +}
>> +
>> +static void vfio_pci_nvlink2_release(struct vfio_pci_device *vdev,
>> +		struct vfio_pci_region *region)
>> +{
>> +	struct vfio_pci_nvlink2_data *data = region->data;
>> +	long ret;
>> +
>> +	ret = mm_iommu_put(data->mm, data->mem);
>> +	WARN_ON(ret);
>> +
>> +	mmdrop(data->mm);
>> +	kfree(data);
>> +}
>> +
>> +static int vfio_pci_nvlink2_mmap_fault(struct vm_fault *vmf)
>> +{
>> +	struct vm_area_struct *vma = vmf->vma;
>> +	struct vfio_pci_region *region = vma->vm_private_data;
>> +	struct vfio_pci_nvlink2_data *data = region->data;
>> +	int ret;
>> +	unsigned long vmf_off = (vmf->address - vma->vm_start) >> PAGE_SHIFT;
>> +	unsigned long nv2pg = data->gpu_hpa >> PAGE_SHIFT;
>> +	unsigned long vm_pgoff = vma->vm_pgoff &
>> +		((1U << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) - 1);
>> +	unsigned long pfn = nv2pg + vm_pgoff + vmf_off;
>> +
>> +	ret = vm_insert_pfn(vma, vmf->address, pfn);
>> +	/* TODO: make it a tracepoint */
>> +	pr_debug("NVLink2: vmf=%lx hpa=%lx ret=%d\n",
>> +		 vmf->address, pfn << PAGE_SHIFT, ret);
>> +	if (ret)
>> +		return VM_FAULT_SIGSEGV;
>> +
>> +	return VM_FAULT_NOPAGE;
>> +}
>> +
>> +static const struct vm_operations_struct vfio_pci_nvlink2_mmap_vmops = {
>> +	.fault = vfio_pci_nvlink2_mmap_fault,
>> +};
>> +
>> +static int vfio_pci_nvlink2_mmap(struct vfio_pci_device *vdev,
>> +		struct vfio_pci_region *region, struct vm_area_struct *vma)
>> +{
>> +	long ret;
>> +	struct vfio_pci_nvlink2_data *data = region->data;
>> +
>> +	if (data->useraddr)
>> +		return -EPERM;
>> +
>> +	if (vma->vm_end - vma->vm_start > data->size)
>> +		return -EINVAL;
>> +
>> +	vma->vm_private_data = region;
>> +	vma->vm_flags |= VM_PFNMAP;
>> +	vma->vm_ops = &vfio_pci_nvlink2_mmap_vmops;
>> +
>> +	/*
>> +	 * Calling mm_iommu_newdev() here once as the region is not
>> +	 * registered yet and therefore right initialization will happen now.
>> +	 * Other places will use mm_iommu_find() which returns
>> +	 * registered @mem and does not go gup().
>> +	 */
>> +	data->useraddr = vma->vm_start;
>> +	data->mm = current->mm;
>> +	atomic_inc(&data->mm->mm_count);
>> +	ret = mm_iommu_newdev(data->mm, data->useraddr,
>> +			(vma->vm_end - vma->vm_start) >> PAGE_SHIFT,
>> +			data->gpu_hpa, &data->mem);
>> +
>> +	pr_debug("VFIO NVLINK2 mmap: useraddr=%lx hpa=%lx size=%lx ret=%ld\n",
>> +			data->useraddr, data->gpu_hpa,
>> +			vma->vm_end - vma->vm_start, ret);
>> +
>> +	return ret;
>> +}
>> +
>> +static const struct vfio_pci_regops vfio_pci_nvlink2_regops = {
>> +	.rw = vfio_pci_nvlink2_rw,
>> +	.release = vfio_pci_nvlink2_release,
>> +	.mmap = vfio_pci_nvlink2_mmap,
>> +};
>> +
>> +int vfio_pci_nvlink2_init(struct vfio_pci_device *vdev)
>> +{
>> +	int len = 0, ret;
>> +	struct device_node *npu_node, *mem_node;
>> +	struct pci_dev *npu_dev;
>> +	uint32_t *mem_phandle, *val;
>> +	struct vfio_pci_nvlink2_data *data;
>> +
>> +	npu_dev = pnv_pci_get_npu_dev(vdev->pdev, 0);
>> +	if (!npu_dev)
>> +		return -EINVAL;
>> +
>> +	npu_node = pci_device_to_OF_node(npu_dev);
>> +	if (!npu_node)
>> +		return -EINVAL;
>> +
>> +	mem_phandle = (void *) of_get_property(npu_node, "memory-region", NULL);
>> +	if (!mem_phandle)
>> +		return -EINVAL;
>> +
>> +	mem_node = of_find_node_by_phandle(be32_to_cpu(*mem_phandle));
>> +	if (!mem_node)
>> +		return -EINVAL;
>> +
>> +	val = (uint32_t *) of_get_property(mem_node, "reg", &len);
>> +	if (!val || len != 2 * sizeof(uint64_t))
>> +		return -EINVAL;
>> +
>> +	data = kzalloc(sizeof(*data), GFP_KERNEL);
>> +	if (!data)
>> +		return -ENOMEM;
>> +
>> +	data->gpu_hpa = ((uint64_t)be32_to_cpu(val[0]) << 32) |
>> +			be32_to_cpu(val[1]);
>> +	data->size = ((uint64_t)be32_to_cpu(val[2]) << 32) |
>> +			be32_to_cpu(val[3]);
>> +
>> +	dev_dbg(&vdev->pdev->dev, "%lx..%lx\n", data->gpu_hpa,
>> +			data->gpu_hpa + data->size - 1);
>> +
>> +	ret = vfio_pci_register_dev_region(vdev,
>> +			PCI_VENDOR_ID_NVIDIA | VFIO_REGION_TYPE_PCI_VENDOR_TYPE,
>> +			VFIO_REGION_SUBTYPE_NVIDIA_NVLINK2,
>> +			&vfio_pci_nvlink2_regops, data->size,
>> +			VFIO_REGION_INFO_FLAG_READ, data);
>> +	if (ret)
>> +		kfree(data);
>> +
>> +	return ret;
>> +}
>> diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
>> index 24ee260..2725bc8 100644
>> --- a/drivers/vfio/pci/Kconfig
>> +++ b/drivers/vfio/pci/Kconfig
>> @@ -30,3 +30,7 @@ config VFIO_PCI_INTX
>>  config VFIO_PCI_IGD
>>  	depends on VFIO_PCI
>>  	def_bool y if X86
>> +
>> +config VFIO_PCI_NVLINK2
>> +	depends on VFIO_PCI
>> +	def_bool y if PPC_POWERNV
> 
> As written, this also depends on PPC_POWERNV (or at least TCE), it's not
> a portable implementation that we could re-use on X86 or ARM or any
> other platform if hardware appeared for it.  Can we improve that as
> well to make this less POWER specific?  Thanks,


As I said in another mail, every P9 chip in that box has some NVLink2 logic
on it so it is not even common among P9's in general and I am having hard
time seeing these V100s used elsewhere in such way.



-- 
Alexey

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC PATCH kernel 5/5] vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] [10de:1db1] subdriver
@ 2018-06-08  3:09       ` Alexey Kardashevskiy
  0 siblings, 0 replies; 108+ messages in thread
From: Alexey Kardashevskiy @ 2018-06-08  3:09 UTC (permalink / raw)
  To: Alex Williamson
  Cc: linuxppc-dev, David Gibson, kvm-ppc, Benjamin Herrenschmidt,
	Ram Pai, kvm, Alistair Popple

On 8/6/18 3:04 am, Alex Williamson wrote:
> On Thu,  7 Jun 2018 18:44:20 +1000
> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> 
>> Some POWER9 chips come with special NVLink2 links which provide
>> cacheable memory access to the RAM physically located on NVIDIA GPU.
>> This memory is presented to a host via the device tree but remains
>> offline until the NVIDIA driver onlines it.
>>
>> This exports this RAM to the userspace as a new region so
>> the NVIDIA driver in the guest can train these links and online GPU RAM.
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> ---
>>  drivers/vfio/pci/Makefile           |   1 +
>>  drivers/vfio/pci/vfio_pci_private.h |   8 ++
>>  include/uapi/linux/vfio.h           |   3 +
>>  drivers/vfio/pci/vfio_pci.c         |   9 ++
>>  drivers/vfio/pci/vfio_pci_nvlink2.c | 190 ++++++++++++++++++++++++++++++++++++
>>  drivers/vfio/pci/Kconfig            |   4 +
>>  6 files changed, 215 insertions(+)
>>  create mode 100644 drivers/vfio/pci/vfio_pci_nvlink2.c
>>
>> diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
>> index 76d8ec0..9662c06 100644
>> --- a/drivers/vfio/pci/Makefile
>> +++ b/drivers/vfio/pci/Makefile
>> @@ -1,5 +1,6 @@
>>  
>>  vfio-pci-y := vfio_pci.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o
>>  vfio-pci-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o
>> +vfio-pci-$(CONFIG_VFIO_PCI_NVLINK2) += vfio_pci_nvlink2.o
>>  
>>  obj-$(CONFIG_VFIO_PCI) += vfio-pci.o
>> diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
>> index 86aab05..7115b9b 100644
>> --- a/drivers/vfio/pci/vfio_pci_private.h
>> +++ b/drivers/vfio/pci/vfio_pci_private.h
>> @@ -160,4 +160,12 @@ static inline int vfio_pci_igd_init(struct vfio_pci_device *vdev)
>>  	return -ENODEV;
>>  }
>>  #endif
>> +#ifdef CONFIG_VFIO_PCI_NVLINK2
>> +extern int vfio_pci_nvlink2_init(struct vfio_pci_device *vdev);
>> +#else
>> +static inline int vfio_pci_nvlink2_init(struct vfio_pci_device *vdev)
>> +{
>> +	return -ENODEV;
>> +}
>> +#endif
>>  #endif /* VFIO_PCI_PRIVATE_H */
>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
>> index 1aa7b82..2fe8227 100644
>> --- a/include/uapi/linux/vfio.h
>> +++ b/include/uapi/linux/vfio.h
>> @@ -301,6 +301,9 @@ struct vfio_region_info_cap_type {
>>  #define VFIO_REGION_SUBTYPE_INTEL_IGD_HOST_CFG	(2)
>>  #define VFIO_REGION_SUBTYPE_INTEL_IGD_LPC_CFG	(3)
>>  
>> +/* NVIDIA GPU NV2 */
>> +#define VFIO_REGION_SUBTYPE_NVIDIA_NVLINK2	(4)
> 
> You're continuing the Intel vendor ID sub-types for an NVIDIA vendor ID
> subtype.  Each vendor has their own address space of sub-types.


True, I'll update. I just like unique numbers better :)

> 
>> +
>>  /*
>>   * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped
>>   * which allows direct access to non-MSIX registers which happened to be within
>> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
>> index 7bddf1e..38c9475 100644
>> --- a/drivers/vfio/pci/vfio_pci.c
>> +++ b/drivers/vfio/pci/vfio_pci.c
>> @@ -306,6 +306,15 @@ static int vfio_pci_enable(struct vfio_pci_device *vdev)
>>  		}
>>  	}
>>  
>> +	if (pdev->vendor == PCI_VENDOR_ID_NVIDIA &&
>> +	    pdev->device == 0x1db1 &&
>> +	    IS_ENABLED(CONFIG_VFIO_PCI_NVLINK2)) {
> 
> Can't we do better than check this based on device ID?  Perhaps PCIe
> capability hints at this?

A normal PCI pluggable device looks like this:

root@fstn3:~# sudo lspci -vs 0000:03:00.0
0000:03:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
	Subsystem: NVIDIA Corporation GK210GL [Tesla K80]
	Flags: fast devsel, IRQ 497
	Memory at 3fe000000000 (32-bit, non-prefetchable) [disabled] [size=16M]
	Memory at 200000000000 (64-bit, prefetchable) [disabled] [size=16G]
	Memory at 200400000000 (64-bit, prefetchable) [disabled] [size=32M]
	Capabilities: [60] Power Management version 3
	Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
	Capabilities: [78] Express Endpoint, MSI 00
	Capabilities: [100] Virtual Channel
	Capabilities: [128] Power Budgeting <?>
	Capabilities: [420] Advanced Error Reporting
	Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
	Capabilities: [900] #19


This is a NVLink v1 machine:

aik@garrison1:~$ sudo lspci -vs 000a:01:00.0
000a:01:00.0 3D controller: NVIDIA Corporation Device 15fe (rev a1)
	Subsystem: NVIDIA Corporation Device 116b
	Flags: bus master, fast devsel, latency 0, IRQ 457
	Memory at 3fe300000000 (32-bit, non-prefetchable) [size=16M]
	Memory at 260000000000 (64-bit, prefetchable) [size=16G]
	Memory at 260400000000 (64-bit, prefetchable) [size=32M]
	Capabilities: [60] Power Management version 3
	Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
	Capabilities: [78] Express Endpoint, MSI 00
	Capabilities: [100] Virtual Channel
	Capabilities: [250] Latency Tolerance Reporting
	Capabilities: [258] L1 PM Substates
	Capabilities: [128] Power Budgeting <?>
	Capabilities: [420] Advanced Error Reporting
	Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
	Capabilities: [900] #19
	Kernel driver in use: nvidia
	Kernel modules: nvidiafb, nouveau, nvidia_384_drm, nvidia_384


This is the one the patch is for:

[aik@yc02goos ~]$ sudo lspci -vs 0035:03:00.0
0035:03:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2]
(rev a1)
	Subsystem: NVIDIA Corporation Device 1212
	Flags: fast devsel, IRQ 82, NUMA node 8
	Memory at 620c280000000 (32-bit, non-prefetchable) [disabled] [size=16M]
	Memory at 6228000000000 (64-bit, prefetchable) [disabled] [size=16G]
	Memory at 6228400000000 (64-bit, prefetchable) [disabled] [size=32M]
	Capabilities: [60] Power Management version 3
	Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
	Capabilities: [78] Express Endpoint, MSI 00
	Capabilities: [100] Virtual Channel
	Capabilities: [250] Latency Tolerance Reporting
	Capabilities: [258] L1 PM Substates
	Capabilities: [128] Power Budgeting <?>
	Capabilities: [420] Advanced Error Reporting
	Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
	Capabilities: [900] #19
	Capabilities: [ac0] #23
	Kernel driver in use: vfio-pci


I can only see a new capability #23 which I have no idea about what it
actually does - my latest PCIe spec is
PCI_Express_Base_r3.1a_December7-2015.pdf and that only knows capabilities
till #21, do you have any better spec? Does not seem promising anyway...


> Is it worthwhile to continue with assigning the device in the !ENABLED
> case?  For instance, maybe it would be better to provide a weak
> definition of vfio_pci_nvlink2_init() that would cause us to fail here
> if we don't have this device specific support enabled.  I realize
> you're following the example set forth for IGD, but those regions are
> optional, for better or worse.


The device is supposed to work even without GPU RAM passed through, this
should look like NVLink v1 in this case (there used to be bugs in the
driver, may be still are, have not checked for a while but there is a bug
opened at NVIDIA about this and they were going to fix that), this is why I
chose not to fail here.



>> +		ret = vfio_pci_nvlink2_init(vdev);
>> +		if (ret)
>> +			dev_warn(&vdev->pdev->dev,
>> +				 "Failed to setup NVIDIA NV2 RAM region\n");
>> +	}
>> +
>>  	vfio_pci_probe_mmaps(vdev);
>>  
>>  	return 0;
>> diff --git a/drivers/vfio/pci/vfio_pci_nvlink2.c b/drivers/vfio/pci/vfio_pci_nvlink2.c
>> new file mode 100644
>> index 0000000..451c5cb
>> --- /dev/null
>> +++ b/drivers/vfio/pci/vfio_pci_nvlink2.c
>> @@ -0,0 +1,190 @@
>> +// SPDX-License-Identifier: GPL-2.0+
>> +/*
>> + * VFIO PCI NVIDIA Whitherspoon GPU support a.k.a. NVLink2.
>> + *
>> + * Copyright (C) 2018 IBM Corp.  All rights reserved.
>> + *     Author: Alexey Kardashevskiy <aik@ozlabs.ru>
>> + *
>> + * This program is free software; you can redistribute it and/or modify
>> + * it under the terms of the GNU General Public License version 2 as
>> + * published by the Free Software Foundation.
>> + *
>> + * Register an on-GPU RAM region for cacheable access.
>> + *
>> + * Derived from original vfio_pci_igd.c:
>> + * Copyright (C) 2016 Red Hat, Inc.  All rights reserved.
>> + *	Author: Alex Williamson <alex.williamson@redhat.com>
>> + */
>> +
>> +#include <linux/io.h>
>> +#include <linux/pci.h>
>> +#include <linux/uaccess.h>
>> +#include <linux/vfio.h>
>> +#include <linux/sched/mm.h>
>> +#include <linux/mmu_context.h>
>> +
>> +#include "vfio_pci_private.h"
>> +
>> +struct vfio_pci_nvlink2_data {
>> +	unsigned long gpu_hpa;
>> +	unsigned long useraddr;
>> +	unsigned long size;
>> +	struct mm_struct *mm;
>> +	struct mm_iommu_table_group_mem_t *mem;
>> +};
>> +
>> +static size_t vfio_pci_nvlink2_rw(struct vfio_pci_device *vdev,
>> +		char __user *buf, size_t count, loff_t *ppos, bool iswrite)
>> +{
>> +	unsigned int i = VFIO_PCI_OFFSET_TO_INDEX(*ppos) - VFIO_PCI_NUM_REGIONS;
>> +	void *base = vdev->region[i].data;
>> +	loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
>> +
>> +	if (pos >= vdev->region[i].size)
>> +		return -EINVAL;
>> +
>> +	count = min(count, (size_t)(vdev->region[i].size - pos));
>> +
>> +	if (iswrite) {
>> +		if (copy_from_user(base + pos, buf, count))
>> +			return -EFAULT;
>> +	} else {
>> +		if (copy_to_user(buf, base + pos, count))
>> +			return -EFAULT;
>> +	}
>> +	*ppos += count;
>> +
>> +	return count;
>> +}
>> +
>> +static void vfio_pci_nvlink2_release(struct vfio_pci_device *vdev,
>> +		struct vfio_pci_region *region)
>> +{
>> +	struct vfio_pci_nvlink2_data *data = region->data;
>> +	long ret;
>> +
>> +	ret = mm_iommu_put(data->mm, data->mem);
>> +	WARN_ON(ret);
>> +
>> +	mmdrop(data->mm);
>> +	kfree(data);
>> +}
>> +
>> +static int vfio_pci_nvlink2_mmap_fault(struct vm_fault *vmf)
>> +{
>> +	struct vm_area_struct *vma = vmf->vma;
>> +	struct vfio_pci_region *region = vma->vm_private_data;
>> +	struct vfio_pci_nvlink2_data *data = region->data;
>> +	int ret;
>> +	unsigned long vmf_off = (vmf->address - vma->vm_start) >> PAGE_SHIFT;
>> +	unsigned long nv2pg = data->gpu_hpa >> PAGE_SHIFT;
>> +	unsigned long vm_pgoff = vma->vm_pgoff &
>> +		((1U << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) - 1);
>> +	unsigned long pfn = nv2pg + vm_pgoff + vmf_off;
>> +
>> +	ret = vm_insert_pfn(vma, vmf->address, pfn);
>> +	/* TODO: make it a tracepoint */
>> +	pr_debug("NVLink2: vmf=%lx hpa=%lx ret=%d\n",
>> +		 vmf->address, pfn << PAGE_SHIFT, ret);
>> +	if (ret)
>> +		return VM_FAULT_SIGSEGV;
>> +
>> +	return VM_FAULT_NOPAGE;
>> +}
>> +
>> +static const struct vm_operations_struct vfio_pci_nvlink2_mmap_vmops = {
>> +	.fault = vfio_pci_nvlink2_mmap_fault,
>> +};
>> +
>> +static int vfio_pci_nvlink2_mmap(struct vfio_pci_device *vdev,
>> +		struct vfio_pci_region *region, struct vm_area_struct *vma)
>> +{
>> +	long ret;
>> +	struct vfio_pci_nvlink2_data *data = region->data;
>> +
>> +	if (data->useraddr)
>> +		return -EPERM;
>> +
>> +	if (vma->vm_end - vma->vm_start > data->size)
>> +		return -EINVAL;
>> +
>> +	vma->vm_private_data = region;
>> +	vma->vm_flags |= VM_PFNMAP;
>> +	vma->vm_ops = &vfio_pci_nvlink2_mmap_vmops;
>> +
>> +	/*
>> +	 * Calling mm_iommu_newdev() here once as the region is not
>> +	 * registered yet and therefore right initialization will happen now.
>> +	 * Other places will use mm_iommu_find() which returns
>> +	 * registered @mem and does not go gup().
>> +	 */
>> +	data->useraddr = vma->vm_start;
>> +	data->mm = current->mm;
>> +	atomic_inc(&data->mm->mm_count);
>> +	ret = mm_iommu_newdev(data->mm, data->useraddr,
>> +			(vma->vm_end - vma->vm_start) >> PAGE_SHIFT,
>> +			data->gpu_hpa, &data->mem);
>> +
>> +	pr_debug("VFIO NVLINK2 mmap: useraddr=%lx hpa=%lx size=%lx ret=%ld\n",
>> +			data->useraddr, data->gpu_hpa,
>> +			vma->vm_end - vma->vm_start, ret);
>> +
>> +	return ret;
>> +}
>> +
>> +static const struct vfio_pci_regops vfio_pci_nvlink2_regops = {
>> +	.rw = vfio_pci_nvlink2_rw,
>> +	.release = vfio_pci_nvlink2_release,
>> +	.mmap = vfio_pci_nvlink2_mmap,
>> +};
>> +
>> +int vfio_pci_nvlink2_init(struct vfio_pci_device *vdev)
>> +{
>> +	int len = 0, ret;
>> +	struct device_node *npu_node, *mem_node;
>> +	struct pci_dev *npu_dev;
>> +	uint32_t *mem_phandle, *val;
>> +	struct vfio_pci_nvlink2_data *data;
>> +
>> +	npu_dev = pnv_pci_get_npu_dev(vdev->pdev, 0);
>> +	if (!npu_dev)
>> +		return -EINVAL;
>> +
>> +	npu_node = pci_device_to_OF_node(npu_dev);
>> +	if (!npu_node)
>> +		return -EINVAL;
>> +
>> +	mem_phandle = (void *) of_get_property(npu_node, "memory-region", NULL);
>> +	if (!mem_phandle)
>> +		return -EINVAL;
>> +
>> +	mem_node = of_find_node_by_phandle(be32_to_cpu(*mem_phandle));
>> +	if (!mem_node)
>> +		return -EINVAL;
>> +
>> +	val = (uint32_t *) of_get_property(mem_node, "reg", &len);
>> +	if (!val || len != 2 * sizeof(uint64_t))
>> +		return -EINVAL;
>> +
>> +	data = kzalloc(sizeof(*data), GFP_KERNEL);
>> +	if (!data)
>> +		return -ENOMEM;
>> +
>> +	data->gpu_hpa = ((uint64_t)be32_to_cpu(val[0]) << 32) |
>> +			be32_to_cpu(val[1]);
>> +	data->size = ((uint64_t)be32_to_cpu(val[2]) << 32) |
>> +			be32_to_cpu(val[3]);
>> +
>> +	dev_dbg(&vdev->pdev->dev, "%lx..%lx\n", data->gpu_hpa,
>> +			data->gpu_hpa + data->size - 1);
>> +
>> +	ret = vfio_pci_register_dev_region(vdev,
>> +			PCI_VENDOR_ID_NVIDIA | VFIO_REGION_TYPE_PCI_VENDOR_TYPE,
>> +			VFIO_REGION_SUBTYPE_NVIDIA_NVLINK2,
>> +			&vfio_pci_nvlink2_regops, data->size,
>> +			VFIO_REGION_INFO_FLAG_READ, data);
>> +	if (ret)
>> +		kfree(data);
>> +
>> +	return ret;
>> +}
>> diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
>> index 24ee260..2725bc8 100644
>> --- a/drivers/vfio/pci/Kconfig
>> +++ b/drivers/vfio/pci/Kconfig
>> @@ -30,3 +30,7 @@ config VFIO_PCI_INTX
>>  config VFIO_PCI_IGD
>>  	depends on VFIO_PCI
>>  	def_bool y if X86
>> +
>> +config VFIO_PCI_NVLINK2
>> +	depends on VFIO_PCI
>> +	def_bool y if PPC_POWERNV
> 
> As written, this also depends on PPC_POWERNV (or at least TCE), it's not
> a portable implementation that we could re-use on X86 or ARM or any
> other platform if hardware appeared for it.  Can we improve that as
> well to make this less POWER specific?  Thanks,


As I said in another mail, every P9 chip in that box has some NVLink2 logic
on it so it is not even common among P9's in general and I am having hard
time seeing these V100s used elsewhere in such way.



-- 
Alexey

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC PATCH kernel 5/5] vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] [10de:1db1] subdriver
@ 2018-06-08  3:09       ` Alexey Kardashevskiy
  0 siblings, 0 replies; 108+ messages in thread
From: Alexey Kardashevskiy @ 2018-06-08  3:09 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kvm, Ram Pai, kvm-ppc, Alistair Popple, linuxppc-dev, David Gibson

On 8/6/18 3:04 am, Alex Williamson wrote:
> On Thu,  7 Jun 2018 18:44:20 +1000
> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> 
>> Some POWER9 chips come with special NVLink2 links which provide
>> cacheable memory access to the RAM physically located on NVIDIA GPU.
>> This memory is presented to a host via the device tree but remains
>> offline until the NVIDIA driver onlines it.
>>
>> This exports this RAM to the userspace as a new region so
>> the NVIDIA driver in the guest can train these links and online GPU RAM.
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> ---
>>  drivers/vfio/pci/Makefile           |   1 +
>>  drivers/vfio/pci/vfio_pci_private.h |   8 ++
>>  include/uapi/linux/vfio.h           |   3 +
>>  drivers/vfio/pci/vfio_pci.c         |   9 ++
>>  drivers/vfio/pci/vfio_pci_nvlink2.c | 190 ++++++++++++++++++++++++++++++++++++
>>  drivers/vfio/pci/Kconfig            |   4 +
>>  6 files changed, 215 insertions(+)
>>  create mode 100644 drivers/vfio/pci/vfio_pci_nvlink2.c
>>
>> diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
>> index 76d8ec0..9662c06 100644
>> --- a/drivers/vfio/pci/Makefile
>> +++ b/drivers/vfio/pci/Makefile
>> @@ -1,5 +1,6 @@
>>  
>>  vfio-pci-y := vfio_pci.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o
>>  vfio-pci-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o
>> +vfio-pci-$(CONFIG_VFIO_PCI_NVLINK2) += vfio_pci_nvlink2.o
>>  
>>  obj-$(CONFIG_VFIO_PCI) += vfio-pci.o
>> diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
>> index 86aab05..7115b9b 100644
>> --- a/drivers/vfio/pci/vfio_pci_private.h
>> +++ b/drivers/vfio/pci/vfio_pci_private.h
>> @@ -160,4 +160,12 @@ static inline int vfio_pci_igd_init(struct vfio_pci_device *vdev)
>>  	return -ENODEV;
>>  }
>>  #endif
>> +#ifdef CONFIG_VFIO_PCI_NVLINK2
>> +extern int vfio_pci_nvlink2_init(struct vfio_pci_device *vdev);
>> +#else
>> +static inline int vfio_pci_nvlink2_init(struct vfio_pci_device *vdev)
>> +{
>> +	return -ENODEV;
>> +}
>> +#endif
>>  #endif /* VFIO_PCI_PRIVATE_H */
>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
>> index 1aa7b82..2fe8227 100644
>> --- a/include/uapi/linux/vfio.h
>> +++ b/include/uapi/linux/vfio.h
>> @@ -301,6 +301,9 @@ struct vfio_region_info_cap_type {
>>  #define VFIO_REGION_SUBTYPE_INTEL_IGD_HOST_CFG	(2)
>>  #define VFIO_REGION_SUBTYPE_INTEL_IGD_LPC_CFG	(3)
>>  
>> +/* NVIDIA GPU NV2 */
>> +#define VFIO_REGION_SUBTYPE_NVIDIA_NVLINK2	(4)
> 
> You're continuing the Intel vendor ID sub-types for an NVIDIA vendor ID
> subtype.  Each vendor has their own address space of sub-types.


True, I'll update. I just like unique numbers better :)

> 
>> +
>>  /*
>>   * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped
>>   * which allows direct access to non-MSIX registers which happened to be within
>> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
>> index 7bddf1e..38c9475 100644
>> --- a/drivers/vfio/pci/vfio_pci.c
>> +++ b/drivers/vfio/pci/vfio_pci.c
>> @@ -306,6 +306,15 @@ static int vfio_pci_enable(struct vfio_pci_device *vdev)
>>  		}
>>  	}
>>  
>> +	if (pdev->vendor = PCI_VENDOR_ID_NVIDIA &&
>> +	    pdev->device = 0x1db1 &&
>> +	    IS_ENABLED(CONFIG_VFIO_PCI_NVLINK2)) {
> 
> Can't we do better than check this based on device ID?  Perhaps PCIe
> capability hints at this?

A normal PCI pluggable device looks like this:

root@fstn3:~# sudo lspci -vs 0000:03:00.0
0000:03:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
	Subsystem: NVIDIA Corporation GK210GL [Tesla K80]
	Flags: fast devsel, IRQ 497
	Memory at 3fe000000000 (32-bit, non-prefetchable) [disabled] [size\x16M]
	Memory at 200000000000 (64-bit, prefetchable) [disabled] [size\x16G]
	Memory at 200400000000 (64-bit, prefetchable) [disabled] [size2M]
	Capabilities: [60] Power Management version 3
	Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
	Capabilities: [78] Express Endpoint, MSI 00
	Capabilities: [100] Virtual Channel
	Capabilities: [128] Power Budgeting <?>
	Capabilities: [420] Advanced Error Reporting
	Capabilities: [600] Vendor Specific Information: ID\001 Rev=1 Len\x024 <?>
	Capabilities: [900] #19


This is a NVLink v1 machine:

aik@garrison1:~$ sudo lspci -vs 000a:01:00.0
000a:01:00.0 3D controller: NVIDIA Corporation Device 15fe (rev a1)
	Subsystem: NVIDIA Corporation Device 116b
	Flags: bus master, fast devsel, latency 0, IRQ 457
	Memory at 3fe300000000 (32-bit, non-prefetchable) [size\x16M]
	Memory at 260000000000 (64-bit, prefetchable) [size\x16G]
	Memory at 260400000000 (64-bit, prefetchable) [size2M]
	Capabilities: [60] Power Management version 3
	Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
	Capabilities: [78] Express Endpoint, MSI 00
	Capabilities: [100] Virtual Channel
	Capabilities: [250] Latency Tolerance Reporting
	Capabilities: [258] L1 PM Substates
	Capabilities: [128] Power Budgeting <?>
	Capabilities: [420] Advanced Error Reporting
	Capabilities: [600] Vendor Specific Information: ID\001 Rev=1 Len\x024 <?>
	Capabilities: [900] #19
	Kernel driver in use: nvidia
	Kernel modules: nvidiafb, nouveau, nvidia_384_drm, nvidia_384


This is the one the patch is for:

[aik@yc02goos ~]$ sudo lspci -vs 0035:03:00.0
0035:03:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2]
(rev a1)
	Subsystem: NVIDIA Corporation Device 1212
	Flags: fast devsel, IRQ 82, NUMA node 8
	Memory at 620c280000000 (32-bit, non-prefetchable) [disabled] [size\x16M]
	Memory at 6228000000000 (64-bit, prefetchable) [disabled] [size\x16G]
	Memory at 6228400000000 (64-bit, prefetchable) [disabled] [size2M]
	Capabilities: [60] Power Management version 3
	Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
	Capabilities: [78] Express Endpoint, MSI 00
	Capabilities: [100] Virtual Channel
	Capabilities: [250] Latency Tolerance Reporting
	Capabilities: [258] L1 PM Substates
	Capabilities: [128] Power Budgeting <?>
	Capabilities: [420] Advanced Error Reporting
	Capabilities: [600] Vendor Specific Information: ID\001 Rev=1 Len\x024 <?>
	Capabilities: [900] #19
	Capabilities: [ac0] #23
	Kernel driver in use: vfio-pci


I can only see a new capability #23 which I have no idea about what it
actually does - my latest PCIe spec is
PCI_Express_Base_r3.1a_December7-2015.pdf and that only knows capabilities
till #21, do you have any better spec? Does not seem promising anyway...


> Is it worthwhile to continue with assigning the device in the !ENABLED
> case?  For instance, maybe it would be better to provide a weak
> definition of vfio_pci_nvlink2_init() that would cause us to fail here
> if we don't have this device specific support enabled.  I realize
> you're following the example set forth for IGD, but those regions are
> optional, for better or worse.


The device is supposed to work even without GPU RAM passed through, this
should look like NVLink v1 in this case (there used to be bugs in the
driver, may be still are, have not checked for a while but there is a bug
opened at NVIDIA about this and they were going to fix that), this is why I
chose not to fail here.



>> +		ret = vfio_pci_nvlink2_init(vdev);
>> +		if (ret)
>> +			dev_warn(&vdev->pdev->dev,
>> +				 "Failed to setup NVIDIA NV2 RAM region\n");
>> +	}
>> +
>>  	vfio_pci_probe_mmaps(vdev);
>>  
>>  	return 0;
>> diff --git a/drivers/vfio/pci/vfio_pci_nvlink2.c b/drivers/vfio/pci/vfio_pci_nvlink2.c
>> new file mode 100644
>> index 0000000..451c5cb
>> --- /dev/null
>> +++ b/drivers/vfio/pci/vfio_pci_nvlink2.c
>> @@ -0,0 +1,190 @@
>> +// SPDX-License-Identifier: GPL-2.0+
>> +/*
>> + * VFIO PCI NVIDIA Whitherspoon GPU support a.k.a. NVLink2.
>> + *
>> + * Copyright (C) 2018 IBM Corp.  All rights reserved.
>> + *     Author: Alexey Kardashevskiy <aik@ozlabs.ru>
>> + *
>> + * This program is free software; you can redistribute it and/or modify
>> + * it under the terms of the GNU General Public License version 2 as
>> + * published by the Free Software Foundation.
>> + *
>> + * Register an on-GPU RAM region for cacheable access.
>> + *
>> + * Derived from original vfio_pci_igd.c:
>> + * Copyright (C) 2016 Red Hat, Inc.  All rights reserved.
>> + *	Author: Alex Williamson <alex.williamson@redhat.com>
>> + */
>> +
>> +#include <linux/io.h>
>> +#include <linux/pci.h>
>> +#include <linux/uaccess.h>
>> +#include <linux/vfio.h>
>> +#include <linux/sched/mm.h>
>> +#include <linux/mmu_context.h>
>> +
>> +#include "vfio_pci_private.h"
>> +
>> +struct vfio_pci_nvlink2_data {
>> +	unsigned long gpu_hpa;
>> +	unsigned long useraddr;
>> +	unsigned long size;
>> +	struct mm_struct *mm;
>> +	struct mm_iommu_table_group_mem_t *mem;
>> +};
>> +
>> +static size_t vfio_pci_nvlink2_rw(struct vfio_pci_device *vdev,
>> +		char __user *buf, size_t count, loff_t *ppos, bool iswrite)
>> +{
>> +	unsigned int i = VFIO_PCI_OFFSET_TO_INDEX(*ppos) - VFIO_PCI_NUM_REGIONS;
>> +	void *base = vdev->region[i].data;
>> +	loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
>> +
>> +	if (pos >= vdev->region[i].size)
>> +		return -EINVAL;
>> +
>> +	count = min(count, (size_t)(vdev->region[i].size - pos));
>> +
>> +	if (iswrite) {
>> +		if (copy_from_user(base + pos, buf, count))
>> +			return -EFAULT;
>> +	} else {
>> +		if (copy_to_user(buf, base + pos, count))
>> +			return -EFAULT;
>> +	}
>> +	*ppos += count;
>> +
>> +	return count;
>> +}
>> +
>> +static void vfio_pci_nvlink2_release(struct vfio_pci_device *vdev,
>> +		struct vfio_pci_region *region)
>> +{
>> +	struct vfio_pci_nvlink2_data *data = region->data;
>> +	long ret;
>> +
>> +	ret = mm_iommu_put(data->mm, data->mem);
>> +	WARN_ON(ret);
>> +
>> +	mmdrop(data->mm);
>> +	kfree(data);
>> +}
>> +
>> +static int vfio_pci_nvlink2_mmap_fault(struct vm_fault *vmf)
>> +{
>> +	struct vm_area_struct *vma = vmf->vma;
>> +	struct vfio_pci_region *region = vma->vm_private_data;
>> +	struct vfio_pci_nvlink2_data *data = region->data;
>> +	int ret;
>> +	unsigned long vmf_off = (vmf->address - vma->vm_start) >> PAGE_SHIFT;
>> +	unsigned long nv2pg = data->gpu_hpa >> PAGE_SHIFT;
>> +	unsigned long vm_pgoff = vma->vm_pgoff &
>> +		((1U << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) - 1);
>> +	unsigned long pfn = nv2pg + vm_pgoff + vmf_off;
>> +
>> +	ret = vm_insert_pfn(vma, vmf->address, pfn);
>> +	/* TODO: make it a tracepoint */
>> +	pr_debug("NVLink2: vmf=%lx hpa=%lx ret=%d\n",
>> +		 vmf->address, pfn << PAGE_SHIFT, ret);
>> +	if (ret)
>> +		return VM_FAULT_SIGSEGV;
>> +
>> +	return VM_FAULT_NOPAGE;
>> +}
>> +
>> +static const struct vm_operations_struct vfio_pci_nvlink2_mmap_vmops = {
>> +	.fault = vfio_pci_nvlink2_mmap_fault,
>> +};
>> +
>> +static int vfio_pci_nvlink2_mmap(struct vfio_pci_device *vdev,
>> +		struct vfio_pci_region *region, struct vm_area_struct *vma)
>> +{
>> +	long ret;
>> +	struct vfio_pci_nvlink2_data *data = region->data;
>> +
>> +	if (data->useraddr)
>> +		return -EPERM;
>> +
>> +	if (vma->vm_end - vma->vm_start > data->size)
>> +		return -EINVAL;
>> +
>> +	vma->vm_private_data = region;
>> +	vma->vm_flags |= VM_PFNMAP;
>> +	vma->vm_ops = &vfio_pci_nvlink2_mmap_vmops;
>> +
>> +	/*
>> +	 * Calling mm_iommu_newdev() here once as the region is not
>> +	 * registered yet and therefore right initialization will happen now.
>> +	 * Other places will use mm_iommu_find() which returns
>> +	 * registered @mem and does not go gup().
>> +	 */
>> +	data->useraddr = vma->vm_start;
>> +	data->mm = current->mm;
>> +	atomic_inc(&data->mm->mm_count);
>> +	ret = mm_iommu_newdev(data->mm, data->useraddr,
>> +			(vma->vm_end - vma->vm_start) >> PAGE_SHIFT,
>> +			data->gpu_hpa, &data->mem);
>> +
>> +	pr_debug("VFIO NVLINK2 mmap: useraddr=%lx hpa=%lx size=%lx ret=%ld\n",
>> +			data->useraddr, data->gpu_hpa,
>> +			vma->vm_end - vma->vm_start, ret);
>> +
>> +	return ret;
>> +}
>> +
>> +static const struct vfio_pci_regops vfio_pci_nvlink2_regops = {
>> +	.rw = vfio_pci_nvlink2_rw,
>> +	.release = vfio_pci_nvlink2_release,
>> +	.mmap = vfio_pci_nvlink2_mmap,
>> +};
>> +
>> +int vfio_pci_nvlink2_init(struct vfio_pci_device *vdev)
>> +{
>> +	int len = 0, ret;
>> +	struct device_node *npu_node, *mem_node;
>> +	struct pci_dev *npu_dev;
>> +	uint32_t *mem_phandle, *val;
>> +	struct vfio_pci_nvlink2_data *data;
>> +
>> +	npu_dev = pnv_pci_get_npu_dev(vdev->pdev, 0);
>> +	if (!npu_dev)
>> +		return -EINVAL;
>> +
>> +	npu_node = pci_device_to_OF_node(npu_dev);
>> +	if (!npu_node)
>> +		return -EINVAL;
>> +
>> +	mem_phandle = (void *) of_get_property(npu_node, "memory-region", NULL);
>> +	if (!mem_phandle)
>> +		return -EINVAL;
>> +
>> +	mem_node = of_find_node_by_phandle(be32_to_cpu(*mem_phandle));
>> +	if (!mem_node)
>> +		return -EINVAL;
>> +
>> +	val = (uint32_t *) of_get_property(mem_node, "reg", &len);
>> +	if (!val || len != 2 * sizeof(uint64_t))
>> +		return -EINVAL;
>> +
>> +	data = kzalloc(sizeof(*data), GFP_KERNEL);
>> +	if (!data)
>> +		return -ENOMEM;
>> +
>> +	data->gpu_hpa = ((uint64_t)be32_to_cpu(val[0]) << 32) |
>> +			be32_to_cpu(val[1]);
>> +	data->size = ((uint64_t)be32_to_cpu(val[2]) << 32) |
>> +			be32_to_cpu(val[3]);
>> +
>> +	dev_dbg(&vdev->pdev->dev, "%lx..%lx\n", data->gpu_hpa,
>> +			data->gpu_hpa + data->size - 1);
>> +
>> +	ret = vfio_pci_register_dev_region(vdev,
>> +			PCI_VENDOR_ID_NVIDIA | VFIO_REGION_TYPE_PCI_VENDOR_TYPE,
>> +			VFIO_REGION_SUBTYPE_NVIDIA_NVLINK2,
>> +			&vfio_pci_nvlink2_regops, data->size,
>> +			VFIO_REGION_INFO_FLAG_READ, data);
>> +	if (ret)
>> +		kfree(data);
>> +
>> +	return ret;
>> +}
>> diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
>> index 24ee260..2725bc8 100644
>> --- a/drivers/vfio/pci/Kconfig
>> +++ b/drivers/vfio/pci/Kconfig
>> @@ -30,3 +30,7 @@ config VFIO_PCI_INTX
>>  config VFIO_PCI_IGD
>>  	depends on VFIO_PCI
>>  	def_bool y if X86
>> +
>> +config VFIO_PCI_NVLINK2
>> +	depends on VFIO_PCI
>> +	def_bool y if PPC_POWERNV
> 
> As written, this also depends on PPC_POWERNV (or at least TCE), it's not
> a portable implementation that we could re-use on X86 or ARM or any
> other platform if hardware appeared for it.  Can we improve that as
> well to make this less POWER specific?  Thanks,


As I said in another mail, every P9 chip in that box has some NVLink2 logic
on it so it is not even common among P9's in general and I am having hard
time seeing these V100s used elsewhere in such way.



-- 
Alexey

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC PATCH kernel 1/5] vfio/spapr_tce: Simplify page contained test
  2018-06-07  8:44   ` Alexey Kardashevskiy
  (?)
@ 2018-06-08  3:32     ` David Gibson
  -1 siblings, 0 replies; 108+ messages in thread
From: David Gibson @ 2018-06-08  3:32 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: kvm, Ram Pai, kvm-ppc, Alex Williamson, Alistair Popple, linuxppc-dev

[-- Attachment #1: Type: text/plain, Size: 2728 bytes --]

On Thu, Jun 07, 2018 at 06:44:16PM +1000, Alexey Kardashevskiy wrote:
> The test function takes a page struct pointer which is not used by
> either of two callers in any other way, make it simple and just pass
> a physical address there.
> 
> This should cause no behavioral change now but later we may start
> supporting host addresses for memory devices which are not backed
> with page structs.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

> ---
>  drivers/vfio/vfio_iommu_spapr_tce.c | 11 ++++-------
>  1 file changed, 4 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
> index 759a5bd..2c4a048 100644
> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> @@ -249,8 +249,9 @@ static void tce_iommu_userspace_view_free(struct iommu_table *tbl,
>  	decrement_locked_vm(mm, cb >> PAGE_SHIFT);
>  }
>  
> -static bool tce_page_is_contained(struct page *page, unsigned page_shift)
> +static bool tce_page_is_contained(unsigned long hpa, unsigned page_shift)
>  {
> +	struct page *page = pfn_to_page(hpa >> PAGE_SHIFT);
>  	/*
>  	 * Check that the TCE table granularity is not bigger than the size of
>  	 * a page we just found. Otherwise the hardware can get access to
> @@ -549,7 +550,6 @@ static long tce_iommu_build(struct tce_container *container,
>  		enum dma_data_direction direction)
>  {
>  	long i, ret = 0;
> -	struct page *page;
>  	unsigned long hpa;
>  	enum dma_data_direction dirtmp;
>  
> @@ -560,8 +560,7 @@ static long tce_iommu_build(struct tce_container *container,
>  		if (ret)
>  			break;
>  
> -		page = pfn_to_page(hpa >> PAGE_SHIFT);
> -		if (!tce_page_is_contained(page, tbl->it_page_shift)) {
> +		if (!tce_page_is_contained(hpa, tbl->it_page_shift)) {
>  			ret = -EPERM;
>  			break;
>  		}
> @@ -595,7 +594,6 @@ static long tce_iommu_build_v2(struct tce_container *container,
>  		enum dma_data_direction direction)
>  {
>  	long i, ret = 0;
> -	struct page *page;
>  	unsigned long hpa;
>  	enum dma_data_direction dirtmp;
>  
> @@ -615,8 +613,7 @@ static long tce_iommu_build_v2(struct tce_container *container,
>  		if (ret)
>  			break;
>  
> -		page = pfn_to_page(hpa >> PAGE_SHIFT);
> -		if (!tce_page_is_contained(page, tbl->it_page_shift)) {
> +		if (!tce_page_is_contained(hpa, tbl->it_page_shift)) {
>  			ret = -EPERM;
>  			break;
>  		}

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC PATCH kernel 1/5] vfio/spapr_tce: Simplify page contained test
@ 2018-06-08  3:32     ` David Gibson
  0 siblings, 0 replies; 108+ messages in thread
From: David Gibson @ 2018-06-08  3:32 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, kvm-ppc, Alex Williamson, Benjamin Herrenschmidt,
	Ram Pai, kvm, Alistair Popple

[-- Attachment #1: Type: text/plain, Size: 2728 bytes --]

On Thu, Jun 07, 2018 at 06:44:16PM +1000, Alexey Kardashevskiy wrote:
> The test function takes a page struct pointer which is not used by
> either of two callers in any other way, make it simple and just pass
> a physical address there.
> 
> This should cause no behavioral change now but later we may start
> supporting host addresses for memory devices which are not backed
> with page structs.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

> ---
>  drivers/vfio/vfio_iommu_spapr_tce.c | 11 ++++-------
>  1 file changed, 4 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
> index 759a5bd..2c4a048 100644
> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> @@ -249,8 +249,9 @@ static void tce_iommu_userspace_view_free(struct iommu_table *tbl,
>  	decrement_locked_vm(mm, cb >> PAGE_SHIFT);
>  }
>  
> -static bool tce_page_is_contained(struct page *page, unsigned page_shift)
> +static bool tce_page_is_contained(unsigned long hpa, unsigned page_shift)
>  {
> +	struct page *page = pfn_to_page(hpa >> PAGE_SHIFT);
>  	/*
>  	 * Check that the TCE table granularity is not bigger than the size of
>  	 * a page we just found. Otherwise the hardware can get access to
> @@ -549,7 +550,6 @@ static long tce_iommu_build(struct tce_container *container,
>  		enum dma_data_direction direction)
>  {
>  	long i, ret = 0;
> -	struct page *page;
>  	unsigned long hpa;
>  	enum dma_data_direction dirtmp;
>  
> @@ -560,8 +560,7 @@ static long tce_iommu_build(struct tce_container *container,
>  		if (ret)
>  			break;
>  
> -		page = pfn_to_page(hpa >> PAGE_SHIFT);
> -		if (!tce_page_is_contained(page, tbl->it_page_shift)) {
> +		if (!tce_page_is_contained(hpa, tbl->it_page_shift)) {
>  			ret = -EPERM;
>  			break;
>  		}
> @@ -595,7 +594,6 @@ static long tce_iommu_build_v2(struct tce_container *container,
>  		enum dma_data_direction direction)
>  {
>  	long i, ret = 0;
> -	struct page *page;
>  	unsigned long hpa;
>  	enum dma_data_direction dirtmp;
>  
> @@ -615,8 +613,7 @@ static long tce_iommu_build_v2(struct tce_container *container,
>  		if (ret)
>  			break;
>  
> -		page = pfn_to_page(hpa >> PAGE_SHIFT);
> -		if (!tce_page_is_contained(page, tbl->it_page_shift)) {
> +		if (!tce_page_is_contained(hpa, tbl->it_page_shift)) {
>  			ret = -EPERM;
>  			break;
>  		}

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC PATCH kernel 1/5] vfio/spapr_tce: Simplify page contained test
@ 2018-06-08  3:32     ` David Gibson
  0 siblings, 0 replies; 108+ messages in thread
From: David Gibson @ 2018-06-08  3:32 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: kvm, Ram Pai, kvm-ppc, Alex Williamson, Alistair Popple, linuxppc-dev

[-- Attachment #1: Type: text/plain, Size: 2728 bytes --]

On Thu, Jun 07, 2018 at 06:44:16PM +1000, Alexey Kardashevskiy wrote:
> The test function takes a page struct pointer which is not used by
> either of two callers in any other way, make it simple and just pass
> a physical address there.
> 
> This should cause no behavioral change now but later we may start
> supporting host addresses for memory devices which are not backed
> with page structs.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

> ---
>  drivers/vfio/vfio_iommu_spapr_tce.c | 11 ++++-------
>  1 file changed, 4 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
> index 759a5bd..2c4a048 100644
> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> @@ -249,8 +249,9 @@ static void tce_iommu_userspace_view_free(struct iommu_table *tbl,
>  	decrement_locked_vm(mm, cb >> PAGE_SHIFT);
>  }
>  
> -static bool tce_page_is_contained(struct page *page, unsigned page_shift)
> +static bool tce_page_is_contained(unsigned long hpa, unsigned page_shift)
>  {
> +	struct page *page = pfn_to_page(hpa >> PAGE_SHIFT);
>  	/*
>  	 * Check that the TCE table granularity is not bigger than the size of
>  	 * a page we just found. Otherwise the hardware can get access to
> @@ -549,7 +550,6 @@ static long tce_iommu_build(struct tce_container *container,
>  		enum dma_data_direction direction)
>  {
>  	long i, ret = 0;
> -	struct page *page;
>  	unsigned long hpa;
>  	enum dma_data_direction dirtmp;
>  
> @@ -560,8 +560,7 @@ static long tce_iommu_build(struct tce_container *container,
>  		if (ret)
>  			break;
>  
> -		page = pfn_to_page(hpa >> PAGE_SHIFT);
> -		if (!tce_page_is_contained(page, tbl->it_page_shift)) {
> +		if (!tce_page_is_contained(hpa, tbl->it_page_shift)) {
>  			ret = -EPERM;
>  			break;
>  		}
> @@ -595,7 +594,6 @@ static long tce_iommu_build_v2(struct tce_container *container,
>  		enum dma_data_direction direction)
>  {
>  	long i, ret = 0;
> -	struct page *page;
>  	unsigned long hpa;
>  	enum dma_data_direction dirtmp;
>  
> @@ -615,8 +613,7 @@ static long tce_iommu_build_v2(struct tce_container *container,
>  		if (ret)
>  			break;
>  
> -		page = pfn_to_page(hpa >> PAGE_SHIFT);
> -		if (!tce_page_is_contained(page, tbl->it_page_shift)) {
> +		if (!tce_page_is_contained(hpa, tbl->it_page_shift)) {
>  			ret = -EPERM;
>  			break;
>  		}

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC PATCH kernel 5/5] vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] [10de:1db1] subdriver
  2018-06-08  3:09       ` Alexey Kardashevskiy
  (?)
@ 2018-06-08  3:35         ` Alex Williamson
  -1 siblings, 0 replies; 108+ messages in thread
From: Alex Williamson @ 2018-06-08  3:35 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: kvm, Ram Pai, kvm-ppc, Alistair Popple, linuxppc-dev, David Gibson

On Fri, 8 Jun 2018 13:09:13 +1000
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> On 8/6/18 3:04 am, Alex Williamson wrote:
> > On Thu,  7 Jun 2018 18:44:20 +1000
> > Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> >> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> >> index 7bddf1e..38c9475 100644
> >> --- a/drivers/vfio/pci/vfio_pci.c
> >> +++ b/drivers/vfio/pci/vfio_pci.c
> >> @@ -306,6 +306,15 @@ static int vfio_pci_enable(struct vfio_pci_device *vdev)
> >>  		}
> >>  	}
> >>  
> >> +	if (pdev->vendor == PCI_VENDOR_ID_NVIDIA &&
> >> +	    pdev->device == 0x1db1 &&
> >> +	    IS_ENABLED(CONFIG_VFIO_PCI_NVLINK2)) {  
> > 
> > Can't we do better than check this based on device ID?  Perhaps PCIe
> > capability hints at this?  
> 
> A normal PCI pluggable device looks like this:
> 
> root@fstn3:~# sudo lspci -vs 0000:03:00.0
> 0000:03:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
> 	Subsystem: NVIDIA Corporation GK210GL [Tesla K80]
> 	Flags: fast devsel, IRQ 497
> 	Memory at 3fe000000000 (32-bit, non-prefetchable) [disabled] [size=16M]
> 	Memory at 200000000000 (64-bit, prefetchable) [disabled] [size=16G]
> 	Memory at 200400000000 (64-bit, prefetchable) [disabled] [size=32M]
> 	Capabilities: [60] Power Management version 3
> 	Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
> 	Capabilities: [78] Express Endpoint, MSI 00
> 	Capabilities: [100] Virtual Channel
> 	Capabilities: [128] Power Budgeting <?>
> 	Capabilities: [420] Advanced Error Reporting
> 	Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
> 	Capabilities: [900] #19
> 
> 
> This is a NVLink v1 machine:
> 
> aik@garrison1:~$ sudo lspci -vs 000a:01:00.0
> 000a:01:00.0 3D controller: NVIDIA Corporation Device 15fe (rev a1)
> 	Subsystem: NVIDIA Corporation Device 116b
> 	Flags: bus master, fast devsel, latency 0, IRQ 457
> 	Memory at 3fe300000000 (32-bit, non-prefetchable) [size=16M]
> 	Memory at 260000000000 (64-bit, prefetchable) [size=16G]
> 	Memory at 260400000000 (64-bit, prefetchable) [size=32M]
> 	Capabilities: [60] Power Management version 3
> 	Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
> 	Capabilities: [78] Express Endpoint, MSI 00
> 	Capabilities: [100] Virtual Channel
> 	Capabilities: [250] Latency Tolerance Reporting
> 	Capabilities: [258] L1 PM Substates
> 	Capabilities: [128] Power Budgeting <?>
> 	Capabilities: [420] Advanced Error Reporting
> 	Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
> 	Capabilities: [900] #19
> 	Kernel driver in use: nvidia
> 	Kernel modules: nvidiafb, nouveau, nvidia_384_drm, nvidia_384
> 
> 
> This is the one the patch is for:
> 
> [aik@yc02goos ~]$ sudo lspci -vs 0035:03:00.0
> 0035:03:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2]
> (rev a1)
> 	Subsystem: NVIDIA Corporation Device 1212
> 	Flags: fast devsel, IRQ 82, NUMA node 8
> 	Memory at 620c280000000 (32-bit, non-prefetchable) [disabled] [size=16M]
> 	Memory at 6228000000000 (64-bit, prefetchable) [disabled] [size=16G]
> 	Memory at 6228400000000 (64-bit, prefetchable) [disabled] [size=32M]
> 	Capabilities: [60] Power Management version 3
> 	Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
> 	Capabilities: [78] Express Endpoint, MSI 00
> 	Capabilities: [100] Virtual Channel
> 	Capabilities: [250] Latency Tolerance Reporting
> 	Capabilities: [258] L1 PM Substates
> 	Capabilities: [128] Power Budgeting <?>
> 	Capabilities: [420] Advanced Error Reporting
> 	Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
> 	Capabilities: [900] #19
> 	Capabilities: [ac0] #23
> 	Kernel driver in use: vfio-pci
> 
> 
> I can only see a new capability #23 which I have no idea about what it
> actually does - my latest PCIe spec is
> PCI_Express_Base_r3.1a_December7-2015.pdf and that only knows capabilities
> till #21, do you have any better spec? Does not seem promising anyway...

You could just look in include/uapi/linux/pci_regs.h and see that 23
(0x17) is a TPH Requester capability and google for that...  It's a TLP
processing hint related to cache processing for requests from system
specific interconnects.  Sounds rather promising.  Of course there's
also the vendor specific capability that might be probed if NVIDIA will
tell you what to look for and the init function you've implemented
looks for specific devicetree nodes, that I imagine you could test for
in a probe as well.

> > Is it worthwhile to continue with assigning the device in the !ENABLED
> > case?  For instance, maybe it would be better to provide a weak
> > definition of vfio_pci_nvlink2_init() that would cause us to fail here
> > if we don't have this device specific support enabled.  I realize
> > you're following the example set forth for IGD, but those regions are
> > optional, for better or worse.  
> 
> 
> The device is supposed to work even without GPU RAM passed through, this
> should look like NVLink v1 in this case (there used to be bugs in the
> driver, may be still are, have not checked for a while but there is a bug
> opened at NVIDIA about this and they were going to fix that), this is why I
> chose not to fail here.

Ok.

> >> diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
> >> index 24ee260..2725bc8 100644
> >> --- a/drivers/vfio/pci/Kconfig
> >> +++ b/drivers/vfio/pci/Kconfig
> >> @@ -30,3 +30,7 @@ config VFIO_PCI_INTX
> >>  config VFIO_PCI_IGD
> >>  	depends on VFIO_PCI
> >>  	def_bool y if X86
> >> +
> >> +config VFIO_PCI_NVLINK2
> >> +	depends on VFIO_PCI
> >> +	def_bool y if PPC_POWERNV  
> > 
> > As written, this also depends on PPC_POWERNV (or at least TCE), it's not
> > a portable implementation that we could re-use on X86 or ARM or any
> > other platform if hardware appeared for it.  Can we improve that as
> > well to make this less POWER specific?  Thanks,  
> 
> 
> As I said in another mail, every P9 chip in that box has some NVLink2 logic
> on it so it is not even common among P9's in general and I am having hard
> time seeing these V100s used elsewhere in such way.

https://www.redhat.com/archives/vfio-users/2018-May/msg00000.html

Not much platform info, but based on the rpm mentioned, looks like an
x86_64 box.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC PATCH kernel 5/5] vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] [10de:1db1] subdriver
@ 2018-06-08  3:35         ` Alex Williamson
  0 siblings, 0 replies; 108+ messages in thread
From: Alex Williamson @ 2018-06-08  3:35 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, David Gibson, kvm-ppc, Benjamin Herrenschmidt,
	Ram Pai, kvm, Alistair Popple

On Fri, 8 Jun 2018 13:09:13 +1000
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> On 8/6/18 3:04 am, Alex Williamson wrote:
> > On Thu,  7 Jun 2018 18:44:20 +1000
> > Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> >> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> >> index 7bddf1e..38c9475 100644
> >> --- a/drivers/vfio/pci/vfio_pci.c
> >> +++ b/drivers/vfio/pci/vfio_pci.c
> >> @@ -306,6 +306,15 @@ static int vfio_pci_enable(struct vfio_pci_device *vdev)
> >>  		}
> >>  	}
> >>  
> >> +	if (pdev->vendor == PCI_VENDOR_ID_NVIDIA &&
> >> +	    pdev->device == 0x1db1 &&
> >> +	    IS_ENABLED(CONFIG_VFIO_PCI_NVLINK2)) {  
> > 
> > Can't we do better than check this based on device ID?  Perhaps PCIe
> > capability hints at this?  
> 
> A normal PCI pluggable device looks like this:
> 
> root@fstn3:~# sudo lspci -vs 0000:03:00.0
> 0000:03:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
> 	Subsystem: NVIDIA Corporation GK210GL [Tesla K80]
> 	Flags: fast devsel, IRQ 497
> 	Memory at 3fe000000000 (32-bit, non-prefetchable) [disabled] [size=16M]
> 	Memory at 200000000000 (64-bit, prefetchable) [disabled] [size=16G]
> 	Memory at 200400000000 (64-bit, prefetchable) [disabled] [size=32M]
> 	Capabilities: [60] Power Management version 3
> 	Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
> 	Capabilities: [78] Express Endpoint, MSI 00
> 	Capabilities: [100] Virtual Channel
> 	Capabilities: [128] Power Budgeting <?>
> 	Capabilities: [420] Advanced Error Reporting
> 	Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
> 	Capabilities: [900] #19
> 
> 
> This is a NVLink v1 machine:
> 
> aik@garrison1:~$ sudo lspci -vs 000a:01:00.0
> 000a:01:00.0 3D controller: NVIDIA Corporation Device 15fe (rev a1)
> 	Subsystem: NVIDIA Corporation Device 116b
> 	Flags: bus master, fast devsel, latency 0, IRQ 457
> 	Memory at 3fe300000000 (32-bit, non-prefetchable) [size=16M]
> 	Memory at 260000000000 (64-bit, prefetchable) [size=16G]
> 	Memory at 260400000000 (64-bit, prefetchable) [size=32M]
> 	Capabilities: [60] Power Management version 3
> 	Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
> 	Capabilities: [78] Express Endpoint, MSI 00
> 	Capabilities: [100] Virtual Channel
> 	Capabilities: [250] Latency Tolerance Reporting
> 	Capabilities: [258] L1 PM Substates
> 	Capabilities: [128] Power Budgeting <?>
> 	Capabilities: [420] Advanced Error Reporting
> 	Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
> 	Capabilities: [900] #19
> 	Kernel driver in use: nvidia
> 	Kernel modules: nvidiafb, nouveau, nvidia_384_drm, nvidia_384
> 
> 
> This is the one the patch is for:
> 
> [aik@yc02goos ~]$ sudo lspci -vs 0035:03:00.0
> 0035:03:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2]
> (rev a1)
> 	Subsystem: NVIDIA Corporation Device 1212
> 	Flags: fast devsel, IRQ 82, NUMA node 8
> 	Memory at 620c280000000 (32-bit, non-prefetchable) [disabled] [size=16M]
> 	Memory at 6228000000000 (64-bit, prefetchable) [disabled] [size=16G]
> 	Memory at 6228400000000 (64-bit, prefetchable) [disabled] [size=32M]
> 	Capabilities: [60] Power Management version 3
> 	Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
> 	Capabilities: [78] Express Endpoint, MSI 00
> 	Capabilities: [100] Virtual Channel
> 	Capabilities: [250] Latency Tolerance Reporting
> 	Capabilities: [258] L1 PM Substates
> 	Capabilities: [128] Power Budgeting <?>
> 	Capabilities: [420] Advanced Error Reporting
> 	Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
> 	Capabilities: [900] #19
> 	Capabilities: [ac0] #23
> 	Kernel driver in use: vfio-pci
> 
> 
> I can only see a new capability #23 which I have no idea about what it
> actually does - my latest PCIe spec is
> PCI_Express_Base_r3.1a_December7-2015.pdf and that only knows capabilities
> till #21, do you have any better spec? Does not seem promising anyway...

You could just look in include/uapi/linux/pci_regs.h and see that 23
(0x17) is a TPH Requester capability and google for that...  It's a TLP
processing hint related to cache processing for requests from system
specific interconnects.  Sounds rather promising.  Of course there's
also the vendor specific capability that might be probed if NVIDIA will
tell you what to look for and the init function you've implemented
looks for specific devicetree nodes, that I imagine you could test for
in a probe as well.

> > Is it worthwhile to continue with assigning the device in the !ENABLED
> > case?  For instance, maybe it would be better to provide a weak
> > definition of vfio_pci_nvlink2_init() that would cause us to fail here
> > if we don't have this device specific support enabled.  I realize
> > you're following the example set forth for IGD, but those regions are
> > optional, for better or worse.  
> 
> 
> The device is supposed to work even without GPU RAM passed through, this
> should look like NVLink v1 in this case (there used to be bugs in the
> driver, may be still are, have not checked for a while but there is a bug
> opened at NVIDIA about this and they were going to fix that), this is why I
> chose not to fail here.

Ok.

> >> diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
> >> index 24ee260..2725bc8 100644
> >> --- a/drivers/vfio/pci/Kconfig
> >> +++ b/drivers/vfio/pci/Kconfig
> >> @@ -30,3 +30,7 @@ config VFIO_PCI_INTX
> >>  config VFIO_PCI_IGD
> >>  	depends on VFIO_PCI
> >>  	def_bool y if X86
> >> +
> >> +config VFIO_PCI_NVLINK2
> >> +	depends on VFIO_PCI
> >> +	def_bool y if PPC_POWERNV  
> > 
> > As written, this also depends on PPC_POWERNV (or at least TCE), it's not
> > a portable implementation that we could re-use on X86 or ARM or any
> > other platform if hardware appeared for it.  Can we improve that as
> > well to make this less POWER specific?  Thanks,  
> 
> 
> As I said in another mail, every P9 chip in that box has some NVLink2 logic
> on it so it is not even common among P9's in general and I am having hard
> time seeing these V100s used elsewhere in such way.

https://www.redhat.com/archives/vfio-users/2018-May/msg00000.html

Not much platform info, but based on the rpm mentioned, looks like an
x86_64 box.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC PATCH kernel 5/5] vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] [10de:1db1] subdriver
@ 2018-06-08  3:35         ` Alex Williamson
  0 siblings, 0 replies; 108+ messages in thread
From: Alex Williamson @ 2018-06-08  3:35 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: kvm, Ram Pai, kvm-ppc, Alistair Popple, linuxppc-dev, David Gibson

On Fri, 8 Jun 2018 13:09:13 +1000
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> On 8/6/18 3:04 am, Alex Williamson wrote:
> > On Thu,  7 Jun 2018 18:44:20 +1000
> > Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> >> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> >> index 7bddf1e..38c9475 100644
> >> --- a/drivers/vfio/pci/vfio_pci.c
> >> +++ b/drivers/vfio/pci/vfio_pci.c
> >> @@ -306,6 +306,15 @@ static int vfio_pci_enable(struct vfio_pci_device *vdev)
> >>  		}
> >>  	}
> >>  
> >> +	if (pdev->vendor = PCI_VENDOR_ID_NVIDIA &&
> >> +	    pdev->device = 0x1db1 &&
> >> +	    IS_ENABLED(CONFIG_VFIO_PCI_NVLINK2)) {  
> > 
> > Can't we do better than check this based on device ID?  Perhaps PCIe
> > capability hints at this?  
> 
> A normal PCI pluggable device looks like this:
> 
> root@fstn3:~# sudo lspci -vs 0000:03:00.0
> 0000:03:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
> 	Subsystem: NVIDIA Corporation GK210GL [Tesla K80]
> 	Flags: fast devsel, IRQ 497
> 	Memory at 3fe000000000 (32-bit, non-prefetchable) [disabled] [size\x16M]
> 	Memory at 200000000000 (64-bit, prefetchable) [disabled] [size\x16G]
> 	Memory at 200400000000 (64-bit, prefetchable) [disabled] [size2M]
> 	Capabilities: [60] Power Management version 3
> 	Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
> 	Capabilities: [78] Express Endpoint, MSI 00
> 	Capabilities: [100] Virtual Channel
> 	Capabilities: [128] Power Budgeting <?>
> 	Capabilities: [420] Advanced Error Reporting
> 	Capabilities: [600] Vendor Specific Information: ID\001 Rev=1 Len\x024 <?>
> 	Capabilities: [900] #19
> 
> 
> This is a NVLink v1 machine:
> 
> aik@garrison1:~$ sudo lspci -vs 000a:01:00.0
> 000a:01:00.0 3D controller: NVIDIA Corporation Device 15fe (rev a1)
> 	Subsystem: NVIDIA Corporation Device 116b
> 	Flags: bus master, fast devsel, latency 0, IRQ 457
> 	Memory at 3fe300000000 (32-bit, non-prefetchable) [size\x16M]
> 	Memory at 260000000000 (64-bit, prefetchable) [size\x16G]
> 	Memory at 260400000000 (64-bit, prefetchable) [size2M]
> 	Capabilities: [60] Power Management version 3
> 	Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
> 	Capabilities: [78] Express Endpoint, MSI 00
> 	Capabilities: [100] Virtual Channel
> 	Capabilities: [250] Latency Tolerance Reporting
> 	Capabilities: [258] L1 PM Substates
> 	Capabilities: [128] Power Budgeting <?>
> 	Capabilities: [420] Advanced Error Reporting
> 	Capabilities: [600] Vendor Specific Information: ID\001 Rev=1 Len\x024 <?>
> 	Capabilities: [900] #19
> 	Kernel driver in use: nvidia
> 	Kernel modules: nvidiafb, nouveau, nvidia_384_drm, nvidia_384
> 
> 
> This is the one the patch is for:
> 
> [aik@yc02goos ~]$ sudo lspci -vs 0035:03:00.0
> 0035:03:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2]
> (rev a1)
> 	Subsystem: NVIDIA Corporation Device 1212
> 	Flags: fast devsel, IRQ 82, NUMA node 8
> 	Memory at 620c280000000 (32-bit, non-prefetchable) [disabled] [size\x16M]
> 	Memory at 6228000000000 (64-bit, prefetchable) [disabled] [size\x16G]
> 	Memory at 6228400000000 (64-bit, prefetchable) [disabled] [size2M]
> 	Capabilities: [60] Power Management version 3
> 	Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
> 	Capabilities: [78] Express Endpoint, MSI 00
> 	Capabilities: [100] Virtual Channel
> 	Capabilities: [250] Latency Tolerance Reporting
> 	Capabilities: [258] L1 PM Substates
> 	Capabilities: [128] Power Budgeting <?>
> 	Capabilities: [420] Advanced Error Reporting
> 	Capabilities: [600] Vendor Specific Information: ID\001 Rev=1 Len\x024 <?>
> 	Capabilities: [900] #19
> 	Capabilities: [ac0] #23
> 	Kernel driver in use: vfio-pci
> 
> 
> I can only see a new capability #23 which I have no idea about what it
> actually does - my latest PCIe spec is
> PCI_Express_Base_r3.1a_December7-2015.pdf and that only knows capabilities
> till #21, do you have any better spec? Does not seem promising anyway...

You could just look in include/uapi/linux/pci_regs.h and see that 23
(0x17) is a TPH Requester capability and google for that...  It's a TLP
processing hint related to cache processing for requests from system
specific interconnects.  Sounds rather promising.  Of course there's
also the vendor specific capability that might be probed if NVIDIA will
tell you what to look for and the init function you've implemented
looks for specific devicetree nodes, that I imagine you could test for
in a probe as well.

> > Is it worthwhile to continue with assigning the device in the !ENABLED
> > case?  For instance, maybe it would be better to provide a weak
> > definition of vfio_pci_nvlink2_init() that would cause us to fail here
> > if we don't have this device specific support enabled.  I realize
> > you're following the example set forth for IGD, but those regions are
> > optional, for better or worse.  
> 
> 
> The device is supposed to work even without GPU RAM passed through, this
> should look like NVLink v1 in this case (there used to be bugs in the
> driver, may be still are, have not checked for a while but there is a bug
> opened at NVIDIA about this and they were going to fix that), this is why I
> chose not to fail here.

Ok.

> >> diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
> >> index 24ee260..2725bc8 100644
> >> --- a/drivers/vfio/pci/Kconfig
> >> +++ b/drivers/vfio/pci/Kconfig
> >> @@ -30,3 +30,7 @@ config VFIO_PCI_INTX
> >>  config VFIO_PCI_IGD
> >>  	depends on VFIO_PCI
> >>  	def_bool y if X86
> >> +
> >> +config VFIO_PCI_NVLINK2
> >> +	depends on VFIO_PCI
> >> +	def_bool y if PPC_POWERNV  
> > 
> > As written, this also depends on PPC_POWERNV (or at least TCE), it's not
> > a portable implementation that we could re-use on X86 or ARM or any
> > other platform if hardware appeared for it.  Can we improve that as
> > well to make this less POWER specific?  Thanks,  
> 
> 
> As I said in another mail, every P9 chip in that box has some NVLink2 logic
> on it so it is not even common among P9's in general and I am having hard
> time seeing these V100s used elsewhere in such way.

https://www.redhat.com/archives/vfio-users/2018-May/msg00000.html

Not much platform info, but based on the rpm mentioned, looks like an
x86_64 box.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100
  2018-06-08  3:08         ` Alexey Kardashevskiy
  (?)
@ 2018-06-08  3:44           ` Alex Williamson
  -1 siblings, 0 replies; 108+ messages in thread
From: Alex Williamson @ 2018-06-08  3:44 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: kvm, Ram Pai, kvm-ppc, Alistair Popple, linuxppc-dev, David Gibson

On Fri, 8 Jun 2018 13:08:54 +1000
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> On 8/6/18 8:15 am, Alex Williamson wrote:
> > On Fri, 08 Jun 2018 07:54:02 +1000
> > Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:
> >   
> >> On Thu, 2018-06-07 at 11:04 -0600, Alex Williamson wrote:  
> >>>
> >>> Can we back up and discuss whether the IOMMU grouping of NVLink
> >>> connected devices makes sense?  AIUI we have a PCI view of these
> >>> devices and from that perspective they're isolated.  That's the view of
> >>> the device used to generate the grouping.  However, not visible to us,
> >>> these devices are interconnected via NVLink.  What isolation properties
> >>> does NVLink provide given that its entire purpose for existing seems to
> >>> be to provide a high performance link for p2p between devices?    
> >>
> >> Not entire. On POWER chips, we also have an nvlink between the device
> >> and the CPU which is running significantly faster than PCIe.
> >>
> >> But yes, there are cross-links and those should probably be accounted
> >> for in the grouping.  
> > 
> > Then after we fix the grouping, can we just let the host driver manage
> > this coherent memory range and expose vGPUs to guests?  The use case of
> > assigning all 6 GPUs to one VM seems pretty limited.  (Might need to
> > convince NVIDIA to support more than a single vGPU per VM though)  
> 
> These are physical GPUs, not virtual sriov-alike things they are
> implementing as well elsewhere.

vGPUs as implemented on M- and P-series Teslas aren't SR-IOV like
either.  That's why we have mdev devices now to implement software
defined devices.  I don't have first hand experience with V-series, but
I would absolutely expect a PCIe-based Tesla V100 to support vGPU.

> My current understanding is that every P9 chip in that box has some NVLink2
> logic on it so each P9 is directly connected to 3 GPUs via PCIe and
> 2xNVLink2, and GPUs in that big group are interconnected by NVLink2 links
> as well.
> 
> From small bits of information I have it seems that a GPU can perfectly
> work alone and if the NVIDIA driver does not see these interconnects
> (because we do not pass the rest of the big 3xGPU group to this guest), it
> continues with a single GPU. There is an "nvidia-smi -r" big reset hammer
> which simply refuses to work until all 3 GPUs are passed so there is some
> distinction between passing 1 or 3 GPUs, and I am trying (as we speak) to
> get a confirmation from NVIDIA that it is ok to pass just a single GPU.
> 
> So we will either have 6 groups (one per GPU) or 2 groups (one per
> interconnected group).

I'm not gaining much confidence that we can rely on isolation between
NVLink connected GPUs, it sounds like you're simply expecting that
proprietary code from NVIDIA on a proprietary interconnect from NVIDIA
is going to play nice and nobody will figure out how to do bad things
because... obfuscation?  Thanks,

Alex

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100
@ 2018-06-08  3:44           ` Alex Williamson
  0 siblings, 0 replies; 108+ messages in thread
From: Alex Williamson @ 2018-06-08  3:44 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Benjamin Herrenschmidt, linuxppc-dev, David Gibson, kvm-ppc,
	Ram Pai, kvm, Alistair Popple

On Fri, 8 Jun 2018 13:08:54 +1000
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> On 8/6/18 8:15 am, Alex Williamson wrote:
> > On Fri, 08 Jun 2018 07:54:02 +1000
> > Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:
> >   
> >> On Thu, 2018-06-07 at 11:04 -0600, Alex Williamson wrote:  
> >>>
> >>> Can we back up and discuss whether the IOMMU grouping of NVLink
> >>> connected devices makes sense?  AIUI we have a PCI view of these
> >>> devices and from that perspective they're isolated.  That's the view of
> >>> the device used to generate the grouping.  However, not visible to us,
> >>> these devices are interconnected via NVLink.  What isolation properties
> >>> does NVLink provide given that its entire purpose for existing seems to
> >>> be to provide a high performance link for p2p between devices?    
> >>
> >> Not entire. On POWER chips, we also have an nvlink between the device
> >> and the CPU which is running significantly faster than PCIe.
> >>
> >> But yes, there are cross-links and those should probably be accounted
> >> for in the grouping.  
> > 
> > Then after we fix the grouping, can we just let the host driver manage
> > this coherent memory range and expose vGPUs to guests?  The use case of
> > assigning all 6 GPUs to one VM seems pretty limited.  (Might need to
> > convince NVIDIA to support more than a single vGPU per VM though)  
> 
> These are physical GPUs, not virtual sriov-alike things they are
> implementing as well elsewhere.

vGPUs as implemented on M- and P-series Teslas aren't SR-IOV like
either.  That's why we have mdev devices now to implement software
defined devices.  I don't have first hand experience with V-series, but
I would absolutely expect a PCIe-based Tesla V100 to support vGPU.

> My current understanding is that every P9 chip in that box has some NVLink2
> logic on it so each P9 is directly connected to 3 GPUs via PCIe and
> 2xNVLink2, and GPUs in that big group are interconnected by NVLink2 links
> as well.
> 
> From small bits of information I have it seems that a GPU can perfectly
> work alone and if the NVIDIA driver does not see these interconnects
> (because we do not pass the rest of the big 3xGPU group to this guest), it
> continues with a single GPU. There is an "nvidia-smi -r" big reset hammer
> which simply refuses to work until all 3 GPUs are passed so there is some
> distinction between passing 1 or 3 GPUs, and I am trying (as we speak) to
> get a confirmation from NVIDIA that it is ok to pass just a single GPU.
> 
> So we will either have 6 groups (one per GPU) or 2 groups (one per
> interconnected group).

I'm not gaining much confidence that we can rely on isolation between
NVLink connected GPUs, it sounds like you're simply expecting that
proprietary code from NVIDIA on a proprietary interconnect from NVIDIA
is going to play nice and nobody will figure out how to do bad things
because... obfuscation?  Thanks,

Alex

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100
@ 2018-06-08  3:44           ` Alex Williamson
  0 siblings, 0 replies; 108+ messages in thread
From: Alex Williamson @ 2018-06-08  3:44 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: kvm, Ram Pai, kvm-ppc, Alistair Popple, linuxppc-dev, David Gibson

On Fri, 8 Jun 2018 13:08:54 +1000
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> On 8/6/18 8:15 am, Alex Williamson wrote:
> > On Fri, 08 Jun 2018 07:54:02 +1000
> > Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:
> >   
> >> On Thu, 2018-06-07 at 11:04 -0600, Alex Williamson wrote:  
> >>>
> >>> Can we back up and discuss whether the IOMMU grouping of NVLink
> >>> connected devices makes sense?  AIUI we have a PCI view of these
> >>> devices and from that perspective they're isolated.  That's the view of
> >>> the device used to generate the grouping.  However, not visible to us,
> >>> these devices are interconnected via NVLink.  What isolation properties
> >>> does NVLink provide given that its entire purpose for existing seems to
> >>> be to provide a high performance link for p2p between devices?    
> >>
> >> Not entire. On POWER chips, we also have an nvlink between the device
> >> and the CPU which is running significantly faster than PCIe.
> >>
> >> But yes, there are cross-links and those should probably be accounted
> >> for in the grouping.  
> > 
> > Then after we fix the grouping, can we just let the host driver manage
> > this coherent memory range and expose vGPUs to guests?  The use case of
> > assigning all 6 GPUs to one VM seems pretty limited.  (Might need to
> > convince NVIDIA to support more than a single vGPU per VM though)  
> 
> These are physical GPUs, not virtual sriov-alike things they are
> implementing as well elsewhere.

vGPUs as implemented on M- and P-series Teslas aren't SR-IOV like
either.  That's why we have mdev devices now to implement software
defined devices.  I don't have first hand experience with V-series, but
I would absolutely expect a PCIe-based Tesla V100 to support vGPU.

> My current understanding is that every P9 chip in that box has some NVLink2
> logic on it so each P9 is directly connected to 3 GPUs via PCIe and
> 2xNVLink2, and GPUs in that big group are interconnected by NVLink2 links
> as well.
> 
> From small bits of information I have it seems that a GPU can perfectly
> work alone and if the NVIDIA driver does not see these interconnects
> (because we do not pass the rest of the big 3xGPU group to this guest), it
> continues with a single GPU. There is an "nvidia-smi -r" big reset hammer
> which simply refuses to work until all 3 GPUs are passed so there is some
> distinction between passing 1 or 3 GPUs, and I am trying (as we speak) to
> get a confirmation from NVIDIA that it is ok to pass just a single GPU.
> 
> So we will either have 6 groups (one per GPU) or 2 groups (one per
> interconnected group).

I'm not gaining much confidence that we can rely on isolation between
NVLink connected GPUs, it sounds like you're simply expecting that
proprietary code from NVIDIA on a proprietary interconnect from NVIDIA
is going to play nice and nobody will figure out how to do bad things
because... obfuscation?  Thanks,

Alex

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC PATCH kernel 5/5] vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] [10de:1db1] subdriver
  2018-06-08  3:35         ` Alex Williamson
  (?)
@ 2018-06-08  3:52           ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 108+ messages in thread
From: Alexey Kardashevskiy @ 2018-06-08  3:52 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kvm, Ram Pai, kvm-ppc, Alistair Popple, linuxppc-dev, David Gibson

On 8/6/18 1:35 pm, Alex Williamson wrote:
> On Fri, 8 Jun 2018 13:09:13 +1000
> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>> On 8/6/18 3:04 am, Alex Williamson wrote:
>>> On Thu,  7 Jun 2018 18:44:20 +1000
>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>>>> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
>>>> index 7bddf1e..38c9475 100644
>>>> --- a/drivers/vfio/pci/vfio_pci.c
>>>> +++ b/drivers/vfio/pci/vfio_pci.c
>>>> @@ -306,6 +306,15 @@ static int vfio_pci_enable(struct vfio_pci_device *vdev)
>>>>  		}
>>>>  	}
>>>>  
>>>> +	if (pdev->vendor == PCI_VENDOR_ID_NVIDIA &&
>>>> +	    pdev->device == 0x1db1 &&
>>>> +	    IS_ENABLED(CONFIG_VFIO_PCI_NVLINK2)) {  
>>>
>>> Can't we do better than check this based on device ID?  Perhaps PCIe
>>> capability hints at this?  
>>
>> A normal PCI pluggable device looks like this:
>>
>> root@fstn3:~# sudo lspci -vs 0000:03:00.0
>> 0000:03:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
>> 	Subsystem: NVIDIA Corporation GK210GL [Tesla K80]
>> 	Flags: fast devsel, IRQ 497
>> 	Memory at 3fe000000000 (32-bit, non-prefetchable) [disabled] [size=16M]
>> 	Memory at 200000000000 (64-bit, prefetchable) [disabled] [size=16G]
>> 	Memory at 200400000000 (64-bit, prefetchable) [disabled] [size=32M]
>> 	Capabilities: [60] Power Management version 3
>> 	Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
>> 	Capabilities: [78] Express Endpoint, MSI 00
>> 	Capabilities: [100] Virtual Channel
>> 	Capabilities: [128] Power Budgeting <?>
>> 	Capabilities: [420] Advanced Error Reporting
>> 	Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
>> 	Capabilities: [900] #19
>>
>>
>> This is a NVLink v1 machine:
>>
>> aik@garrison1:~$ sudo lspci -vs 000a:01:00.0
>> 000a:01:00.0 3D controller: NVIDIA Corporation Device 15fe (rev a1)
>> 	Subsystem: NVIDIA Corporation Device 116b
>> 	Flags: bus master, fast devsel, latency 0, IRQ 457
>> 	Memory at 3fe300000000 (32-bit, non-prefetchable) [size=16M]
>> 	Memory at 260000000000 (64-bit, prefetchable) [size=16G]
>> 	Memory at 260400000000 (64-bit, prefetchable) [size=32M]
>> 	Capabilities: [60] Power Management version 3
>> 	Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
>> 	Capabilities: [78] Express Endpoint, MSI 00
>> 	Capabilities: [100] Virtual Channel
>> 	Capabilities: [250] Latency Tolerance Reporting
>> 	Capabilities: [258] L1 PM Substates
>> 	Capabilities: [128] Power Budgeting <?>
>> 	Capabilities: [420] Advanced Error Reporting
>> 	Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
>> 	Capabilities: [900] #19
>> 	Kernel driver in use: nvidia
>> 	Kernel modules: nvidiafb, nouveau, nvidia_384_drm, nvidia_384
>>
>>
>> This is the one the patch is for:
>>
>> [aik@yc02goos ~]$ sudo lspci -vs 0035:03:00.0
>> 0035:03:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2]
>> (rev a1)
>> 	Subsystem: NVIDIA Corporation Device 1212
>> 	Flags: fast devsel, IRQ 82, NUMA node 8
>> 	Memory at 620c280000000 (32-bit, non-prefetchable) [disabled] [size=16M]
>> 	Memory at 6228000000000 (64-bit, prefetchable) [disabled] [size=16G]
>> 	Memory at 6228400000000 (64-bit, prefetchable) [disabled] [size=32M]
>> 	Capabilities: [60] Power Management version 3
>> 	Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
>> 	Capabilities: [78] Express Endpoint, MSI 00
>> 	Capabilities: [100] Virtual Channel
>> 	Capabilities: [250] Latency Tolerance Reporting
>> 	Capabilities: [258] L1 PM Substates
>> 	Capabilities: [128] Power Budgeting <?>
>> 	Capabilities: [420] Advanced Error Reporting
>> 	Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
>> 	Capabilities: [900] #19
>> 	Capabilities: [ac0] #23
>> 	Kernel driver in use: vfio-pci
>>
>>
>> I can only see a new capability #23 which I have no idea about what it
>> actually does - my latest PCIe spec is
>> PCI_Express_Base_r3.1a_December7-2015.pdf and that only knows capabilities
>> till #21, do you have any better spec? Does not seem promising anyway...
> 
> You could just look in include/uapi/linux/pci_regs.h and see that 23
> (0x17) is a TPH Requester capability and google for that...  It's a TLP
> processing hint related to cache processing for requests from system
> specific interconnects.  Sounds rather promising.  Of course there's
> also the vendor specific capability that might be probed if NVIDIA will
> tell you what to look for and the init function you've implemented
> looks for specific devicetree nodes, that I imagine you could test for
> in a probe as well.


This 23 is in hex:

[aik@yc02goos ~]$ sudo lspci -vs 0035:03:00.0
0035:03:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2]
(rev a1)
	Subsystem: NVIDIA Corporation Device 1212
	Flags: fast devsel, IRQ 82, NUMA node 8
	Memory at 620c280000000 (32-bit, non-prefetchable) [disabled] [size=16M]
	Memory at 6228000000000 (64-bit, prefetchable) [disabled] [size=16G]
	Memory at 6228400000000 (64-bit, prefetchable) [disabled] [size=32M]
	Capabilities: [60] Power Management version 3
	Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
	Capabilities: [78] Express Endpoint, MSI 00
	Capabilities: [100] Virtual Channel
	Capabilities: [250] Latency Tolerance Reporting
	Capabilities: [258] L1 PM Substates
	Capabilities: [128] Power Budgeting <?>
	Capabilities: [420] Advanced Error Reporting
	Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
	Capabilities: [900] #19
	Capabilities: [ac0] #23
	Kernel driver in use: vfio-pci

[aik@yc02goos ~]$ sudo lspci -vvvxxxxs 0035:03:00.0 | grep ac0
	Capabilities: [ac0 v1] #23
ac0: 23 00 01 00 de 10 c1 00 01 00 10 00 00 00 00 00


Talking to NVIDIA is always an option :)


> 
>>> Is it worthwhile to continue with assigning the device in the !ENABLED
>>> case?  For instance, maybe it would be better to provide a weak
>>> definition of vfio_pci_nvlink2_init() that would cause us to fail here
>>> if we don't have this device specific support enabled.  I realize
>>> you're following the example set forth for IGD, but those regions are
>>> optional, for better or worse.  
>>
>>
>> The device is supposed to work even without GPU RAM passed through, this
>> should look like NVLink v1 in this case (there used to be bugs in the
>> driver, may be still are, have not checked for a while but there is a bug
>> opened at NVIDIA about this and they were going to fix that), this is why I
>> chose not to fail here.
> 
> Ok.
> 
>>>> diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
>>>> index 24ee260..2725bc8 100644
>>>> --- a/drivers/vfio/pci/Kconfig
>>>> +++ b/drivers/vfio/pci/Kconfig
>>>> @@ -30,3 +30,7 @@ config VFIO_PCI_INTX
>>>>  config VFIO_PCI_IGD
>>>>  	depends on VFIO_PCI
>>>>  	def_bool y if X86
>>>> +
>>>> +config VFIO_PCI_NVLINK2
>>>> +	depends on VFIO_PCI
>>>> +	def_bool y if PPC_POWERNV  
>>>
>>> As written, this also depends on PPC_POWERNV (or at least TCE), it's not
>>> a portable implementation that we could re-use on X86 or ARM or any
>>> other platform if hardware appeared for it.  Can we improve that as
>>> well to make this less POWER specific?  Thanks,  
>>
>>
>> As I said in another mail, every P9 chip in that box has some NVLink2 logic
>> on it so it is not even common among P9's in general and I am having hard
>> time seeing these V100s used elsewhere in such way.
> 
> https://www.redhat.com/archives/vfio-users/2018-May/msg00000.html
> 
> Not much platform info, but based on the rpm mentioned, looks like an
> x86_64 box.  Thanks,

Wow. Interesting. Thanks for the pointer. No advertising material actually
says that it is P9 only or even mention P9, wiki does not say it is P9 only
either. Hmmm...



-- 
Alexey

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC PATCH kernel 5/5] vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] [10de:1db1] subdriver
@ 2018-06-08  3:52           ` Alexey Kardashevskiy
  0 siblings, 0 replies; 108+ messages in thread
From: Alexey Kardashevskiy @ 2018-06-08  3:52 UTC (permalink / raw)
  To: Alex Williamson
  Cc: linuxppc-dev, David Gibson, kvm-ppc, Benjamin Herrenschmidt,
	Ram Pai, kvm, Alistair Popple

On 8/6/18 1:35 pm, Alex Williamson wrote:
> On Fri, 8 Jun 2018 13:09:13 +1000
> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>> On 8/6/18 3:04 am, Alex Williamson wrote:
>>> On Thu,  7 Jun 2018 18:44:20 +1000
>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>>>> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
>>>> index 7bddf1e..38c9475 100644
>>>> --- a/drivers/vfio/pci/vfio_pci.c
>>>> +++ b/drivers/vfio/pci/vfio_pci.c
>>>> @@ -306,6 +306,15 @@ static int vfio_pci_enable(struct vfio_pci_device *vdev)
>>>>  		}
>>>>  	}
>>>>  
>>>> +	if (pdev->vendor == PCI_VENDOR_ID_NVIDIA &&
>>>> +	    pdev->device == 0x1db1 &&
>>>> +	    IS_ENABLED(CONFIG_VFIO_PCI_NVLINK2)) {  
>>>
>>> Can't we do better than check this based on device ID?  Perhaps PCIe
>>> capability hints at this?  
>>
>> A normal PCI pluggable device looks like this:
>>
>> root@fstn3:~# sudo lspci -vs 0000:03:00.0
>> 0000:03:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
>> 	Subsystem: NVIDIA Corporation GK210GL [Tesla K80]
>> 	Flags: fast devsel, IRQ 497
>> 	Memory at 3fe000000000 (32-bit, non-prefetchable) [disabled] [size=16M]
>> 	Memory at 200000000000 (64-bit, prefetchable) [disabled] [size=16G]
>> 	Memory at 200400000000 (64-bit, prefetchable) [disabled] [size=32M]
>> 	Capabilities: [60] Power Management version 3
>> 	Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
>> 	Capabilities: [78] Express Endpoint, MSI 00
>> 	Capabilities: [100] Virtual Channel
>> 	Capabilities: [128] Power Budgeting <?>
>> 	Capabilities: [420] Advanced Error Reporting
>> 	Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
>> 	Capabilities: [900] #19
>>
>>
>> This is a NVLink v1 machine:
>>
>> aik@garrison1:~$ sudo lspci -vs 000a:01:00.0
>> 000a:01:00.0 3D controller: NVIDIA Corporation Device 15fe (rev a1)
>> 	Subsystem: NVIDIA Corporation Device 116b
>> 	Flags: bus master, fast devsel, latency 0, IRQ 457
>> 	Memory at 3fe300000000 (32-bit, non-prefetchable) [size=16M]
>> 	Memory at 260000000000 (64-bit, prefetchable) [size=16G]
>> 	Memory at 260400000000 (64-bit, prefetchable) [size=32M]
>> 	Capabilities: [60] Power Management version 3
>> 	Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
>> 	Capabilities: [78] Express Endpoint, MSI 00
>> 	Capabilities: [100] Virtual Channel
>> 	Capabilities: [250] Latency Tolerance Reporting
>> 	Capabilities: [258] L1 PM Substates
>> 	Capabilities: [128] Power Budgeting <?>
>> 	Capabilities: [420] Advanced Error Reporting
>> 	Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
>> 	Capabilities: [900] #19
>> 	Kernel driver in use: nvidia
>> 	Kernel modules: nvidiafb, nouveau, nvidia_384_drm, nvidia_384
>>
>>
>> This is the one the patch is for:
>>
>> [aik@yc02goos ~]$ sudo lspci -vs 0035:03:00.0
>> 0035:03:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2]
>> (rev a1)
>> 	Subsystem: NVIDIA Corporation Device 1212
>> 	Flags: fast devsel, IRQ 82, NUMA node 8
>> 	Memory at 620c280000000 (32-bit, non-prefetchable) [disabled] [size=16M]
>> 	Memory at 6228000000000 (64-bit, prefetchable) [disabled] [size=16G]
>> 	Memory at 6228400000000 (64-bit, prefetchable) [disabled] [size=32M]
>> 	Capabilities: [60] Power Management version 3
>> 	Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
>> 	Capabilities: [78] Express Endpoint, MSI 00
>> 	Capabilities: [100] Virtual Channel
>> 	Capabilities: [250] Latency Tolerance Reporting
>> 	Capabilities: [258] L1 PM Substates
>> 	Capabilities: [128] Power Budgeting <?>
>> 	Capabilities: [420] Advanced Error Reporting
>> 	Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
>> 	Capabilities: [900] #19
>> 	Capabilities: [ac0] #23
>> 	Kernel driver in use: vfio-pci
>>
>>
>> I can only see a new capability #23 which I have no idea about what it
>> actually does - my latest PCIe spec is
>> PCI_Express_Base_r3.1a_December7-2015.pdf and that only knows capabilities
>> till #21, do you have any better spec? Does not seem promising anyway...
> 
> You could just look in include/uapi/linux/pci_regs.h and see that 23
> (0x17) is a TPH Requester capability and google for that...  It's a TLP
> processing hint related to cache processing for requests from system
> specific interconnects.  Sounds rather promising.  Of course there's
> also the vendor specific capability that might be probed if NVIDIA will
> tell you what to look for and the init function you've implemented
> looks for specific devicetree nodes, that I imagine you could test for
> in a probe as well.


This 23 is in hex:

[aik@yc02goos ~]$ sudo lspci -vs 0035:03:00.0
0035:03:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2]
(rev a1)
	Subsystem: NVIDIA Corporation Device 1212
	Flags: fast devsel, IRQ 82, NUMA node 8
	Memory at 620c280000000 (32-bit, non-prefetchable) [disabled] [size=16M]
	Memory at 6228000000000 (64-bit, prefetchable) [disabled] [size=16G]
	Memory at 6228400000000 (64-bit, prefetchable) [disabled] [size=32M]
	Capabilities: [60] Power Management version 3
	Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
	Capabilities: [78] Express Endpoint, MSI 00
	Capabilities: [100] Virtual Channel
	Capabilities: [250] Latency Tolerance Reporting
	Capabilities: [258] L1 PM Substates
	Capabilities: [128] Power Budgeting <?>
	Capabilities: [420] Advanced Error Reporting
	Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
	Capabilities: [900] #19
	Capabilities: [ac0] #23
	Kernel driver in use: vfio-pci

[aik@yc02goos ~]$ sudo lspci -vvvxxxxs 0035:03:00.0 | grep ac0
	Capabilities: [ac0 v1] #23
ac0: 23 00 01 00 de 10 c1 00 01 00 10 00 00 00 00 00


Talking to NVIDIA is always an option :)


> 
>>> Is it worthwhile to continue with assigning the device in the !ENABLED
>>> case?  For instance, maybe it would be better to provide a weak
>>> definition of vfio_pci_nvlink2_init() that would cause us to fail here
>>> if we don't have this device specific support enabled.  I realize
>>> you're following the example set forth for IGD, but those regions are
>>> optional, for better or worse.  
>>
>>
>> The device is supposed to work even without GPU RAM passed through, this
>> should look like NVLink v1 in this case (there used to be bugs in the
>> driver, may be still are, have not checked for a while but there is a bug
>> opened at NVIDIA about this and they were going to fix that), this is why I
>> chose not to fail here.
> 
> Ok.
> 
>>>> diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
>>>> index 24ee260..2725bc8 100644
>>>> --- a/drivers/vfio/pci/Kconfig
>>>> +++ b/drivers/vfio/pci/Kconfig
>>>> @@ -30,3 +30,7 @@ config VFIO_PCI_INTX
>>>>  config VFIO_PCI_IGD
>>>>  	depends on VFIO_PCI
>>>>  	def_bool y if X86
>>>> +
>>>> +config VFIO_PCI_NVLINK2
>>>> +	depends on VFIO_PCI
>>>> +	def_bool y if PPC_POWERNV  
>>>
>>> As written, this also depends on PPC_POWERNV (or at least TCE), it's not
>>> a portable implementation that we could re-use on X86 or ARM or any
>>> other platform if hardware appeared for it.  Can we improve that as
>>> well to make this less POWER specific?  Thanks,  
>>
>>
>> As I said in another mail, every P9 chip in that box has some NVLink2 logic
>> on it so it is not even common among P9's in general and I am having hard
>> time seeing these V100s used elsewhere in such way.
> 
> https://www.redhat.com/archives/vfio-users/2018-May/msg00000.html
> 
> Not much platform info, but based on the rpm mentioned, looks like an
> x86_64 box.  Thanks,

Wow. Interesting. Thanks for the pointer. No advertising material actually
says that it is P9 only or even mention P9, wiki does not say it is P9 only
either. Hmmm...



-- 
Alexey

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC PATCH kernel 5/5] vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] [10de:1db1] subdriver
@ 2018-06-08  3:52           ` Alexey Kardashevskiy
  0 siblings, 0 replies; 108+ messages in thread
From: Alexey Kardashevskiy @ 2018-06-08  3:52 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kvm, Ram Pai, kvm-ppc, Alistair Popple, linuxppc-dev, David Gibson

On 8/6/18 1:35 pm, Alex Williamson wrote:
> On Fri, 8 Jun 2018 13:09:13 +1000
> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>> On 8/6/18 3:04 am, Alex Williamson wrote:
>>> On Thu,  7 Jun 2018 18:44:20 +1000
>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>>>> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
>>>> index 7bddf1e..38c9475 100644
>>>> --- a/drivers/vfio/pci/vfio_pci.c
>>>> +++ b/drivers/vfio/pci/vfio_pci.c
>>>> @@ -306,6 +306,15 @@ static int vfio_pci_enable(struct vfio_pci_device *vdev)
>>>>  		}
>>>>  	}
>>>>  
>>>> +	if (pdev->vendor = PCI_VENDOR_ID_NVIDIA &&
>>>> +	    pdev->device = 0x1db1 &&
>>>> +	    IS_ENABLED(CONFIG_VFIO_PCI_NVLINK2)) {  
>>>
>>> Can't we do better than check this based on device ID?  Perhaps PCIe
>>> capability hints at this?  
>>
>> A normal PCI pluggable device looks like this:
>>
>> root@fstn3:~# sudo lspci -vs 0000:03:00.0
>> 0000:03:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
>> 	Subsystem: NVIDIA Corporation GK210GL [Tesla K80]
>> 	Flags: fast devsel, IRQ 497
>> 	Memory at 3fe000000000 (32-bit, non-prefetchable) [disabled] [size\x16M]
>> 	Memory at 200000000000 (64-bit, prefetchable) [disabled] [size\x16G]
>> 	Memory at 200400000000 (64-bit, prefetchable) [disabled] [size2M]
>> 	Capabilities: [60] Power Management version 3
>> 	Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
>> 	Capabilities: [78] Express Endpoint, MSI 00
>> 	Capabilities: [100] Virtual Channel
>> 	Capabilities: [128] Power Budgeting <?>
>> 	Capabilities: [420] Advanced Error Reporting
>> 	Capabilities: [600] Vendor Specific Information: ID\001 Rev=1 Len\x024 <?>
>> 	Capabilities: [900] #19
>>
>>
>> This is a NVLink v1 machine:
>>
>> aik@garrison1:~$ sudo lspci -vs 000a:01:00.0
>> 000a:01:00.0 3D controller: NVIDIA Corporation Device 15fe (rev a1)
>> 	Subsystem: NVIDIA Corporation Device 116b
>> 	Flags: bus master, fast devsel, latency 0, IRQ 457
>> 	Memory at 3fe300000000 (32-bit, non-prefetchable) [size\x16M]
>> 	Memory at 260000000000 (64-bit, prefetchable) [size\x16G]
>> 	Memory at 260400000000 (64-bit, prefetchable) [size2M]
>> 	Capabilities: [60] Power Management version 3
>> 	Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
>> 	Capabilities: [78] Express Endpoint, MSI 00
>> 	Capabilities: [100] Virtual Channel
>> 	Capabilities: [250] Latency Tolerance Reporting
>> 	Capabilities: [258] L1 PM Substates
>> 	Capabilities: [128] Power Budgeting <?>
>> 	Capabilities: [420] Advanced Error Reporting
>> 	Capabilities: [600] Vendor Specific Information: ID\001 Rev=1 Len\x024 <?>
>> 	Capabilities: [900] #19
>> 	Kernel driver in use: nvidia
>> 	Kernel modules: nvidiafb, nouveau, nvidia_384_drm, nvidia_384
>>
>>
>> This is the one the patch is for:
>>
>> [aik@yc02goos ~]$ sudo lspci -vs 0035:03:00.0
>> 0035:03:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2]
>> (rev a1)
>> 	Subsystem: NVIDIA Corporation Device 1212
>> 	Flags: fast devsel, IRQ 82, NUMA node 8
>> 	Memory at 620c280000000 (32-bit, non-prefetchable) [disabled] [size\x16M]
>> 	Memory at 6228000000000 (64-bit, prefetchable) [disabled] [size\x16G]
>> 	Memory at 6228400000000 (64-bit, prefetchable) [disabled] [size2M]
>> 	Capabilities: [60] Power Management version 3
>> 	Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
>> 	Capabilities: [78] Express Endpoint, MSI 00
>> 	Capabilities: [100] Virtual Channel
>> 	Capabilities: [250] Latency Tolerance Reporting
>> 	Capabilities: [258] L1 PM Substates
>> 	Capabilities: [128] Power Budgeting <?>
>> 	Capabilities: [420] Advanced Error Reporting
>> 	Capabilities: [600] Vendor Specific Information: ID\001 Rev=1 Len\x024 <?>
>> 	Capabilities: [900] #19
>> 	Capabilities: [ac0] #23
>> 	Kernel driver in use: vfio-pci
>>
>>
>> I can only see a new capability #23 which I have no idea about what it
>> actually does - my latest PCIe spec is
>> PCI_Express_Base_r3.1a_December7-2015.pdf and that only knows capabilities
>> till #21, do you have any better spec? Does not seem promising anyway...
> 
> You could just look in include/uapi/linux/pci_regs.h and see that 23
> (0x17) is a TPH Requester capability and google for that...  It's a TLP
> processing hint related to cache processing for requests from system
> specific interconnects.  Sounds rather promising.  Of course there's
> also the vendor specific capability that might be probed if NVIDIA will
> tell you what to look for and the init function you've implemented
> looks for specific devicetree nodes, that I imagine you could test for
> in a probe as well.


This 23 is in hex:

[aik@yc02goos ~]$ sudo lspci -vs 0035:03:00.0
0035:03:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2]
(rev a1)
	Subsystem: NVIDIA Corporation Device 1212
	Flags: fast devsel, IRQ 82, NUMA node 8
	Memory at 620c280000000 (32-bit, non-prefetchable) [disabled] [size\x16M]
	Memory at 6228000000000 (64-bit, prefetchable) [disabled] [size\x16G]
	Memory at 6228400000000 (64-bit, prefetchable) [disabled] [size2M]
	Capabilities: [60] Power Management version 3
	Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
	Capabilities: [78] Express Endpoint, MSI 00
	Capabilities: [100] Virtual Channel
	Capabilities: [250] Latency Tolerance Reporting
	Capabilities: [258] L1 PM Substates
	Capabilities: [128] Power Budgeting <?>
	Capabilities: [420] Advanced Error Reporting
	Capabilities: [600] Vendor Specific Information: ID\001 Rev=1 Len\x024 <?>
	Capabilities: [900] #19
	Capabilities: [ac0] #23
	Kernel driver in use: vfio-pci

[aik@yc02goos ~]$ sudo lspci -vvvxxxxs 0035:03:00.0 | grep ac0
	Capabilities: [ac0 v1] #23
ac0: 23 00 01 00 de 10 c1 00 01 00 10 00 00 00 00 00


Talking to NVIDIA is always an option :)


> 
>>> Is it worthwhile to continue with assigning the device in the !ENABLED
>>> case?  For instance, maybe it would be better to provide a weak
>>> definition of vfio_pci_nvlink2_init() that would cause us to fail here
>>> if we don't have this device specific support enabled.  I realize
>>> you're following the example set forth for IGD, but those regions are
>>> optional, for better or worse.  
>>
>>
>> The device is supposed to work even without GPU RAM passed through, this
>> should look like NVLink v1 in this case (there used to be bugs in the
>> driver, may be still are, have not checked for a while but there is a bug
>> opened at NVIDIA about this and they were going to fix that), this is why I
>> chose not to fail here.
> 
> Ok.
> 
>>>> diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
>>>> index 24ee260..2725bc8 100644
>>>> --- a/drivers/vfio/pci/Kconfig
>>>> +++ b/drivers/vfio/pci/Kconfig
>>>> @@ -30,3 +30,7 @@ config VFIO_PCI_INTX
>>>>  config VFIO_PCI_IGD
>>>>  	depends on VFIO_PCI
>>>>  	def_bool y if X86
>>>> +
>>>> +config VFIO_PCI_NVLINK2
>>>> +	depends on VFIO_PCI
>>>> +	def_bool y if PPC_POWERNV  
>>>
>>> As written, this also depends on PPC_POWERNV (or at least TCE), it's not
>>> a portable implementation that we could re-use on X86 or ARM or any
>>> other platform if hardware appeared for it.  Can we improve that as
>>> well to make this less POWER specific?  Thanks,  
>>
>>
>> As I said in another mail, every P9 chip in that box has some NVLink2 logic
>> on it so it is not even common among P9's in general and I am having hard
>> time seeing these V100s used elsewhere in such way.
> 
> https://www.redhat.com/archives/vfio-users/2018-May/msg00000.html
> 
> Not much platform info, but based on the rpm mentioned, looks like an
> x86_64 box.  Thanks,

Wow. Interesting. Thanks for the pointer. No advertising material actually
says that it is P9 only or even mention P9, wiki does not say it is P9 only
either. Hmmm...



-- 
Alexey

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100
  2018-06-08  3:44           ` Alex Williamson
  (?)
@ 2018-06-08  4:14             ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 108+ messages in thread
From: Alexey Kardashevskiy @ 2018-06-08  4:14 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kvm, Ram Pai, kvm-ppc, Alistair Popple, linuxppc-dev, David Gibson

On 8/6/18 1:44 pm, Alex Williamson wrote:
> On Fri, 8 Jun 2018 13:08:54 +1000
> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> 
>> On 8/6/18 8:15 am, Alex Williamson wrote:
>>> On Fri, 08 Jun 2018 07:54:02 +1000
>>> Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:
>>>   
>>>> On Thu, 2018-06-07 at 11:04 -0600, Alex Williamson wrote:  
>>>>>
>>>>> Can we back up and discuss whether the IOMMU grouping of NVLink
>>>>> connected devices makes sense?  AIUI we have a PCI view of these
>>>>> devices and from that perspective they're isolated.  That's the view of
>>>>> the device used to generate the grouping.  However, not visible to us,
>>>>> these devices are interconnected via NVLink.  What isolation properties
>>>>> does NVLink provide given that its entire purpose for existing seems to
>>>>> be to provide a high performance link for p2p between devices?    
>>>>
>>>> Not entire. On POWER chips, we also have an nvlink between the device
>>>> and the CPU which is running significantly faster than PCIe.
>>>>
>>>> But yes, there are cross-links and those should probably be accounted
>>>> for in the grouping.  
>>>
>>> Then after we fix the grouping, can we just let the host driver manage
>>> this coherent memory range and expose vGPUs to guests?  The use case of
>>> assigning all 6 GPUs to one VM seems pretty limited.  (Might need to
>>> convince NVIDIA to support more than a single vGPU per VM though)  
>>
>> These are physical GPUs, not virtual sriov-alike things they are
>> implementing as well elsewhere.
> 
> vGPUs as implemented on M- and P-series Teslas aren't SR-IOV like
> either.  That's why we have mdev devices now to implement software
> defined devices.  I don't have first hand experience with V-series, but
> I would absolutely expect a PCIe-based Tesla V100 to support vGPU.

So assuming V100 can do vGPU, you are suggesting ditching this patchset and
using mediated vGPUs instead, correct?


>> My current understanding is that every P9 chip in that box has some NVLink2
>> logic on it so each P9 is directly connected to 3 GPUs via PCIe and
>> 2xNVLink2, and GPUs in that big group are interconnected by NVLink2 links
>> as well.
>>
>> From small bits of information I have it seems that a GPU can perfectly
>> work alone and if the NVIDIA driver does not see these interconnects
>> (because we do not pass the rest of the big 3xGPU group to this guest), it
>> continues with a single GPU. There is an "nvidia-smi -r" big reset hammer
>> which simply refuses to work until all 3 GPUs are passed so there is some
>> distinction between passing 1 or 3 GPUs, and I am trying (as we speak) to
>> get a confirmation from NVIDIA that it is ok to pass just a single GPU.
>>
>> So we will either have 6 groups (one per GPU) or 2 groups (one per
>> interconnected group).
> 
> I'm not gaining much confidence that we can rely on isolation between
> NVLink connected GPUs, it sounds like you're simply expecting that
> proprietary code from NVIDIA on a proprietary interconnect from NVIDIA
> is going to play nice and nobody will figure out how to do bad things
> because... obfuscation?  Thanks,

Well, we already believe that a proprietary firmware of a sriov-capable
adapter like Mellanox ConnextX is not doing bad things, how is this
different in principle?


ps. their obfuscation is funny indeed :)
-- 
Alexey

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100
@ 2018-06-08  4:14             ` Alexey Kardashevskiy
  0 siblings, 0 replies; 108+ messages in thread
From: Alexey Kardashevskiy @ 2018-06-08  4:14 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Benjamin Herrenschmidt, linuxppc-dev, David Gibson, kvm-ppc,
	Ram Pai, kvm, Alistair Popple

On 8/6/18 1:44 pm, Alex Williamson wrote:
> On Fri, 8 Jun 2018 13:08:54 +1000
> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> 
>> On 8/6/18 8:15 am, Alex Williamson wrote:
>>> On Fri, 08 Jun 2018 07:54:02 +1000
>>> Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:
>>>   
>>>> On Thu, 2018-06-07 at 11:04 -0600, Alex Williamson wrote:  
>>>>>
>>>>> Can we back up and discuss whether the IOMMU grouping of NVLink
>>>>> connected devices makes sense?  AIUI we have a PCI view of these
>>>>> devices and from that perspective they're isolated.  That's the view of
>>>>> the device used to generate the grouping.  However, not visible to us,
>>>>> these devices are interconnected via NVLink.  What isolation properties
>>>>> does NVLink provide given that its entire purpose for existing seems to
>>>>> be to provide a high performance link for p2p between devices?    
>>>>
>>>> Not entire. On POWER chips, we also have an nvlink between the device
>>>> and the CPU which is running significantly faster than PCIe.
>>>>
>>>> But yes, there are cross-links and those should probably be accounted
>>>> for in the grouping.  
>>>
>>> Then after we fix the grouping, can we just let the host driver manage
>>> this coherent memory range and expose vGPUs to guests?  The use case of
>>> assigning all 6 GPUs to one VM seems pretty limited.  (Might need to
>>> convince NVIDIA to support more than a single vGPU per VM though)  
>>
>> These are physical GPUs, not virtual sriov-alike things they are
>> implementing as well elsewhere.
> 
> vGPUs as implemented on M- and P-series Teslas aren't SR-IOV like
> either.  That's why we have mdev devices now to implement software
> defined devices.  I don't have first hand experience with V-series, but
> I would absolutely expect a PCIe-based Tesla V100 to support vGPU.

So assuming V100 can do vGPU, you are suggesting ditching this patchset and
using mediated vGPUs instead, correct?


>> My current understanding is that every P9 chip in that box has some NVLink2
>> logic on it so each P9 is directly connected to 3 GPUs via PCIe and
>> 2xNVLink2, and GPUs in that big group are interconnected by NVLink2 links
>> as well.
>>
>> From small bits of information I have it seems that a GPU can perfectly
>> work alone and if the NVIDIA driver does not see these interconnects
>> (because we do not pass the rest of the big 3xGPU group to this guest), it
>> continues with a single GPU. There is an "nvidia-smi -r" big reset hammer
>> which simply refuses to work until all 3 GPUs are passed so there is some
>> distinction between passing 1 or 3 GPUs, and I am trying (as we speak) to
>> get a confirmation from NVIDIA that it is ok to pass just a single GPU.
>>
>> So we will either have 6 groups (one per GPU) or 2 groups (one per
>> interconnected group).
> 
> I'm not gaining much confidence that we can rely on isolation between
> NVLink connected GPUs, it sounds like you're simply expecting that
> proprietary code from NVIDIA on a proprietary interconnect from NVIDIA
> is going to play nice and nobody will figure out how to do bad things
> because... obfuscation?  Thanks,

Well, we already believe that a proprietary firmware of a sriov-capable
adapter like Mellanox ConnextX is not doing bad things, how is this
different in principle?


ps. their obfuscation is funny indeed :)
-- 
Alexey

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100
@ 2018-06-08  4:14             ` Alexey Kardashevskiy
  0 siblings, 0 replies; 108+ messages in thread
From: Alexey Kardashevskiy @ 2018-06-08  4:14 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kvm, Ram Pai, kvm-ppc, Alistair Popple, linuxppc-dev, David Gibson

On 8/6/18 1:44 pm, Alex Williamson wrote:
> On Fri, 8 Jun 2018 13:08:54 +1000
> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> 
>> On 8/6/18 8:15 am, Alex Williamson wrote:
>>> On Fri, 08 Jun 2018 07:54:02 +1000
>>> Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:
>>>   
>>>> On Thu, 2018-06-07 at 11:04 -0600, Alex Williamson wrote:  
>>>>>
>>>>> Can we back up and discuss whether the IOMMU grouping of NVLink
>>>>> connected devices makes sense?  AIUI we have a PCI view of these
>>>>> devices and from that perspective they're isolated.  That's the view of
>>>>> the device used to generate the grouping.  However, not visible to us,
>>>>> these devices are interconnected via NVLink.  What isolation properties
>>>>> does NVLink provide given that its entire purpose for existing seems to
>>>>> be to provide a high performance link for p2p between devices?    
>>>>
>>>> Not entire. On POWER chips, we also have an nvlink between the device
>>>> and the CPU which is running significantly faster than PCIe.
>>>>
>>>> But yes, there are cross-links and those should probably be accounted
>>>> for in the grouping.  
>>>
>>> Then after we fix the grouping, can we just let the host driver manage
>>> this coherent memory range and expose vGPUs to guests?  The use case of
>>> assigning all 6 GPUs to one VM seems pretty limited.  (Might need to
>>> convince NVIDIA to support more than a single vGPU per VM though)  
>>
>> These are physical GPUs, not virtual sriov-alike things they are
>> implementing as well elsewhere.
> 
> vGPUs as implemented on M- and P-series Teslas aren't SR-IOV like
> either.  That's why we have mdev devices now to implement software
> defined devices.  I don't have first hand experience with V-series, but
> I would absolutely expect a PCIe-based Tesla V100 to support vGPU.

So assuming V100 can do vGPU, you are suggesting ditching this patchset and
using mediated vGPUs instead, correct?


>> My current understanding is that every P9 chip in that box has some NVLink2
>> logic on it so each P9 is directly connected to 3 GPUs via PCIe and
>> 2xNVLink2, and GPUs in that big group are interconnected by NVLink2 links
>> as well.
>>
>> From small bits of information I have it seems that a GPU can perfectly
>> work alone and if the NVIDIA driver does not see these interconnects
>> (because we do not pass the rest of the big 3xGPU group to this guest), it
>> continues with a single GPU. There is an "nvidia-smi -r" big reset hammer
>> which simply refuses to work until all 3 GPUs are passed so there is some
>> distinction between passing 1 or 3 GPUs, and I am trying (as we speak) to
>> get a confirmation from NVIDIA that it is ok to pass just a single GPU.
>>
>> So we will either have 6 groups (one per GPU) or 2 groups (one per
>> interconnected group).
> 
> I'm not gaining much confidence that we can rely on isolation between
> NVLink connected GPUs, it sounds like you're simply expecting that
> proprietary code from NVIDIA on a proprietary interconnect from NVIDIA
> is going to play nice and nobody will figure out how to do bad things
> because... obfuscation?  Thanks,

Well, we already believe that a proprietary firmware of a sriov-capable
adapter like Mellanox ConnextX is not doing bad things, how is this
different in principle?


ps. their obfuscation is funny indeed :)
-- 
Alexey

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC PATCH kernel 5/5] vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] [10de:1db1] subdriver
  2018-06-08  3:52           ` Alexey Kardashevskiy
  (?)
@ 2018-06-08  4:34             ` Alex Williamson
  -1 siblings, 0 replies; 108+ messages in thread
From: Alex Williamson @ 2018-06-08  4:34 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: kvm, Ram Pai, kvm-ppc, Alistair Popple, linuxppc-dev, David Gibson

On Fri, 8 Jun 2018 13:52:05 +1000
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> On 8/6/18 1:35 pm, Alex Williamson wrote:
> > On Fri, 8 Jun 2018 13:09:13 +1000
> > Alexey Kardashevskiy <aik@ozlabs.ru> wrote:  
> >> On 8/6/18 3:04 am, Alex Williamson wrote:  
> >>> On Thu,  7 Jun 2018 18:44:20 +1000
> >>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:  
> >>>> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> >>>> index 7bddf1e..38c9475 100644
> >>>> --- a/drivers/vfio/pci/vfio_pci.c
> >>>> +++ b/drivers/vfio/pci/vfio_pci.c
> >>>> @@ -306,6 +306,15 @@ static int vfio_pci_enable(struct vfio_pci_device *vdev)
> >>>>  		}
> >>>>  	}
> >>>>  
> >>>> +	if (pdev->vendor == PCI_VENDOR_ID_NVIDIA &&
> >>>> +	    pdev->device == 0x1db1 &&
> >>>> +	    IS_ENABLED(CONFIG_VFIO_PCI_NVLINK2)) {    
> >>>
> >>> Can't we do better than check this based on device ID?  Perhaps PCIe
> >>> capability hints at this?    
> >>
> >> A normal PCI pluggable device looks like this:
> >>
> >> root@fstn3:~# sudo lspci -vs 0000:03:00.0
> >> 0000:03:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
> >> 	Subsystem: NVIDIA Corporation GK210GL [Tesla K80]
> >> 	Flags: fast devsel, IRQ 497
> >> 	Memory at 3fe000000000 (32-bit, non-prefetchable) [disabled] [size=16M]
> >> 	Memory at 200000000000 (64-bit, prefetchable) [disabled] [size=16G]
> >> 	Memory at 200400000000 (64-bit, prefetchable) [disabled] [size=32M]
> >> 	Capabilities: [60] Power Management version 3
> >> 	Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
> >> 	Capabilities: [78] Express Endpoint, MSI 00
> >> 	Capabilities: [100] Virtual Channel
> >> 	Capabilities: [128] Power Budgeting <?>
> >> 	Capabilities: [420] Advanced Error Reporting
> >> 	Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
> >> 	Capabilities: [900] #19
> >>
> >>
> >> This is a NVLink v1 machine:
> >>
> >> aik@garrison1:~$ sudo lspci -vs 000a:01:00.0
> >> 000a:01:00.0 3D controller: NVIDIA Corporation Device 15fe (rev a1)
> >> 	Subsystem: NVIDIA Corporation Device 116b
> >> 	Flags: bus master, fast devsel, latency 0, IRQ 457
> >> 	Memory at 3fe300000000 (32-bit, non-prefetchable) [size=16M]
> >> 	Memory at 260000000000 (64-bit, prefetchable) [size=16G]
> >> 	Memory at 260400000000 (64-bit, prefetchable) [size=32M]
> >> 	Capabilities: [60] Power Management version 3
> >> 	Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
> >> 	Capabilities: [78] Express Endpoint, MSI 00
> >> 	Capabilities: [100] Virtual Channel
> >> 	Capabilities: [250] Latency Tolerance Reporting
> >> 	Capabilities: [258] L1 PM Substates
> >> 	Capabilities: [128] Power Budgeting <?>
> >> 	Capabilities: [420] Advanced Error Reporting
> >> 	Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
> >> 	Capabilities: [900] #19
> >> 	Kernel driver in use: nvidia
> >> 	Kernel modules: nvidiafb, nouveau, nvidia_384_drm, nvidia_384
> >>
> >>
> >> This is the one the patch is for:
> >>
> >> [aik@yc02goos ~]$ sudo lspci -vs 0035:03:00.0
> >> 0035:03:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2]
> >> (rev a1)
> >> 	Subsystem: NVIDIA Corporation Device 1212
> >> 	Flags: fast devsel, IRQ 82, NUMA node 8
> >> 	Memory at 620c280000000 (32-bit, non-prefetchable) [disabled] [size=16M]
> >> 	Memory at 6228000000000 (64-bit, prefetchable) [disabled] [size=16G]
> >> 	Memory at 6228400000000 (64-bit, prefetchable) [disabled] [size=32M]
> >> 	Capabilities: [60] Power Management version 3
> >> 	Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
> >> 	Capabilities: [78] Express Endpoint, MSI 00
> >> 	Capabilities: [100] Virtual Channel
> >> 	Capabilities: [250] Latency Tolerance Reporting
> >> 	Capabilities: [258] L1 PM Substates
> >> 	Capabilities: [128] Power Budgeting <?>
> >> 	Capabilities: [420] Advanced Error Reporting
> >> 	Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
> >> 	Capabilities: [900] #19
> >> 	Capabilities: [ac0] #23
> >> 	Kernel driver in use: vfio-pci
> >>
> >>
> >> I can only see a new capability #23 which I have no idea about what it
> >> actually does - my latest PCIe spec is
> >> PCI_Express_Base_r3.1a_December7-2015.pdf and that only knows capabilities
> >> till #21, do you have any better spec? Does not seem promising anyway...  
> > 
> > You could just look in include/uapi/linux/pci_regs.h and see that 23
> > (0x17) is a TPH Requester capability and google for that...  It's a TLP
> > processing hint related to cache processing for requests from system
> > specific interconnects.  Sounds rather promising.  Of course there's
> > also the vendor specific capability that might be probed if NVIDIA will
> > tell you what to look for and the init function you've implemented
> > looks for specific devicetree nodes, that I imagine you could test for
> > in a probe as well.  
> 
> 
> This 23 is in hex:
> 
> [aik@yc02goos ~]$ sudo lspci -vs 0035:03:00.0
> 0035:03:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2]
> (rev a1)
> 	Subsystem: NVIDIA Corporation Device 1212
> 	Flags: fast devsel, IRQ 82, NUMA node 8
> 	Memory at 620c280000000 (32-bit, non-prefetchable) [disabled] [size=16M]
> 	Memory at 6228000000000 (64-bit, prefetchable) [disabled] [size=16G]
> 	Memory at 6228400000000 (64-bit, prefetchable) [disabled] [size=32M]
> 	Capabilities: [60] Power Management version 3
> 	Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
> 	Capabilities: [78] Express Endpoint, MSI 00
> 	Capabilities: [100] Virtual Channel
> 	Capabilities: [250] Latency Tolerance Reporting
> 	Capabilities: [258] L1 PM Substates
> 	Capabilities: [128] Power Budgeting <?>
> 	Capabilities: [420] Advanced Error Reporting
> 	Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
> 	Capabilities: [900] #19
> 	Capabilities: [ac0] #23
> 	Kernel driver in use: vfio-pci
> 
> [aik@yc02goos ~]$ sudo lspci -vvvxxxxs 0035:03:00.0 | grep ac0
> 	Capabilities: [ac0 v1] #23
> ac0: 23 00 01 00 de 10 c1 00 01 00 10 00 00 00 00 00

Oops, I was thinking lspci printed unknown in decimal.  Strange, it's a
shared, vendor specific capability:

https://pcisig.com/sites/default/files/specification_documents/ECN_DVSEC-2015-08-04-clean_0.pdf

We see in your dump that the vendor of this capability is 0x10de
(NVIDIA) and the ID of the capability is 0x0001.  Note that NVIDIA
sponsored this ECN.

> Talking to NVIDIA is always an option :)

Really no other choice to figure out how to decode these vendor
specific capabilities, this 0x23 capability at least seems to be meant
for sharing.

> >>> Is it worthwhile to continue with assigning the device in the !ENABLED
> >>> case?  For instance, maybe it would be better to provide a weak
> >>> definition of vfio_pci_nvlink2_init() that would cause us to fail here
> >>> if we don't have this device specific support enabled.  I realize
> >>> you're following the example set forth for IGD, but those regions are
> >>> optional, for better or worse.    
> >>
> >>
> >> The device is supposed to work even without GPU RAM passed through, this
> >> should look like NVLink v1 in this case (there used to be bugs in the
> >> driver, may be still are, have not checked for a while but there is a bug
> >> opened at NVIDIA about this and they were going to fix that), this is why I
> >> chose not to fail here.  
> > 
> > Ok.
> >   
> >>>> diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
> >>>> index 24ee260..2725bc8 100644
> >>>> --- a/drivers/vfio/pci/Kconfig
> >>>> +++ b/drivers/vfio/pci/Kconfig
> >>>> @@ -30,3 +30,7 @@ config VFIO_PCI_INTX
> >>>>  config VFIO_PCI_IGD
> >>>>  	depends on VFIO_PCI
> >>>>  	def_bool y if X86
> >>>> +
> >>>> +config VFIO_PCI_NVLINK2
> >>>> +	depends on VFIO_PCI
> >>>> +	def_bool y if PPC_POWERNV    
> >>>
> >>> As written, this also depends on PPC_POWERNV (or at least TCE), it's not
> >>> a portable implementation that we could re-use on X86 or ARM or any
> >>> other platform if hardware appeared for it.  Can we improve that as
> >>> well to make this less POWER specific?  Thanks,    
> >>
> >>
> >> As I said in another mail, every P9 chip in that box has some NVLink2 logic
> >> on it so it is not even common among P9's in general and I am having hard
> >> time seeing these V100s used elsewhere in such way.  
> > 
> > https://www.redhat.com/archives/vfio-users/2018-May/msg00000.html
> > 
> > Not much platform info, but based on the rpm mentioned, looks like an
> > x86_64 box.  Thanks,  
> 
> Wow. Interesting. Thanks for the pointer. No advertising material actually
> says that it is P9 only or even mention P9, wiki does not say it is P9 only
> either. Hmmm...

NVIDIA's own DGX systems are Xeon-based and seem to include NVLink.
The DGX-1 definitely makes use of the SXM2 modules, up to 8 of them.
The DGX Station might be the 4x V100 SXM2 box mentioned in the link.
Thanks,

Alex

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC PATCH kernel 5/5] vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] [10de:1db1] subdriver
@ 2018-06-08  4:34             ` Alex Williamson
  0 siblings, 0 replies; 108+ messages in thread
From: Alex Williamson @ 2018-06-08  4:34 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, David Gibson, kvm-ppc, Benjamin Herrenschmidt,
	Ram Pai, kvm, Alistair Popple

On Fri, 8 Jun 2018 13:52:05 +1000
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> On 8/6/18 1:35 pm, Alex Williamson wrote:
> > On Fri, 8 Jun 2018 13:09:13 +1000
> > Alexey Kardashevskiy <aik@ozlabs.ru> wrote:  
> >> On 8/6/18 3:04 am, Alex Williamson wrote:  
> >>> On Thu,  7 Jun 2018 18:44:20 +1000
> >>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:  
> >>>> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> >>>> index 7bddf1e..38c9475 100644
> >>>> --- a/drivers/vfio/pci/vfio_pci.c
> >>>> +++ b/drivers/vfio/pci/vfio_pci.c
> >>>> @@ -306,6 +306,15 @@ static int vfio_pci_enable(struct vfio_pci_device *vdev)
> >>>>  		}
> >>>>  	}
> >>>>  
> >>>> +	if (pdev->vendor == PCI_VENDOR_ID_NVIDIA &&
> >>>> +	    pdev->device == 0x1db1 &&
> >>>> +	    IS_ENABLED(CONFIG_VFIO_PCI_NVLINK2)) {    
> >>>
> >>> Can't we do better than check this based on device ID?  Perhaps PCIe
> >>> capability hints at this?    
> >>
> >> A normal PCI pluggable device looks like this:
> >>
> >> root@fstn3:~# sudo lspci -vs 0000:03:00.0
> >> 0000:03:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
> >> 	Subsystem: NVIDIA Corporation GK210GL [Tesla K80]
> >> 	Flags: fast devsel, IRQ 497
> >> 	Memory at 3fe000000000 (32-bit, non-prefetchable) [disabled] [size=16M]
> >> 	Memory at 200000000000 (64-bit, prefetchable) [disabled] [size=16G]
> >> 	Memory at 200400000000 (64-bit, prefetchable) [disabled] [size=32M]
> >> 	Capabilities: [60] Power Management version 3
> >> 	Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
> >> 	Capabilities: [78] Express Endpoint, MSI 00
> >> 	Capabilities: [100] Virtual Channel
> >> 	Capabilities: [128] Power Budgeting <?>
> >> 	Capabilities: [420] Advanced Error Reporting
> >> 	Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
> >> 	Capabilities: [900] #19
> >>
> >>
> >> This is a NVLink v1 machine:
> >>
> >> aik@garrison1:~$ sudo lspci -vs 000a:01:00.0
> >> 000a:01:00.0 3D controller: NVIDIA Corporation Device 15fe (rev a1)
> >> 	Subsystem: NVIDIA Corporation Device 116b
> >> 	Flags: bus master, fast devsel, latency 0, IRQ 457
> >> 	Memory at 3fe300000000 (32-bit, non-prefetchable) [size=16M]
> >> 	Memory at 260000000000 (64-bit, prefetchable) [size=16G]
> >> 	Memory at 260400000000 (64-bit, prefetchable) [size=32M]
> >> 	Capabilities: [60] Power Management version 3
> >> 	Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
> >> 	Capabilities: [78] Express Endpoint, MSI 00
> >> 	Capabilities: [100] Virtual Channel
> >> 	Capabilities: [250] Latency Tolerance Reporting
> >> 	Capabilities: [258] L1 PM Substates
> >> 	Capabilities: [128] Power Budgeting <?>
> >> 	Capabilities: [420] Advanced Error Reporting
> >> 	Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
> >> 	Capabilities: [900] #19
> >> 	Kernel driver in use: nvidia
> >> 	Kernel modules: nvidiafb, nouveau, nvidia_384_drm, nvidia_384
> >>
> >>
> >> This is the one the patch is for:
> >>
> >> [aik@yc02goos ~]$ sudo lspci -vs 0035:03:00.0
> >> 0035:03:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2]
> >> (rev a1)
> >> 	Subsystem: NVIDIA Corporation Device 1212
> >> 	Flags: fast devsel, IRQ 82, NUMA node 8
> >> 	Memory at 620c280000000 (32-bit, non-prefetchable) [disabled] [size=16M]
> >> 	Memory at 6228000000000 (64-bit, prefetchable) [disabled] [size=16G]
> >> 	Memory at 6228400000000 (64-bit, prefetchable) [disabled] [size=32M]
> >> 	Capabilities: [60] Power Management version 3
> >> 	Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
> >> 	Capabilities: [78] Express Endpoint, MSI 00
> >> 	Capabilities: [100] Virtual Channel
> >> 	Capabilities: [250] Latency Tolerance Reporting
> >> 	Capabilities: [258] L1 PM Substates
> >> 	Capabilities: [128] Power Budgeting <?>
> >> 	Capabilities: [420] Advanced Error Reporting
> >> 	Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
> >> 	Capabilities: [900] #19
> >> 	Capabilities: [ac0] #23
> >> 	Kernel driver in use: vfio-pci
> >>
> >>
> >> I can only see a new capability #23 which I have no idea about what it
> >> actually does - my latest PCIe spec is
> >> PCI_Express_Base_r3.1a_December7-2015.pdf and that only knows capabilities
> >> till #21, do you have any better spec? Does not seem promising anyway...  
> > 
> > You could just look in include/uapi/linux/pci_regs.h and see that 23
> > (0x17) is a TPH Requester capability and google for that...  It's a TLP
> > processing hint related to cache processing for requests from system
> > specific interconnects.  Sounds rather promising.  Of course there's
> > also the vendor specific capability that might be probed if NVIDIA will
> > tell you what to look for and the init function you've implemented
> > looks for specific devicetree nodes, that I imagine you could test for
> > in a probe as well.  
> 
> 
> This 23 is in hex:
> 
> [aik@yc02goos ~]$ sudo lspci -vs 0035:03:00.0
> 0035:03:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2]
> (rev a1)
> 	Subsystem: NVIDIA Corporation Device 1212
> 	Flags: fast devsel, IRQ 82, NUMA node 8
> 	Memory at 620c280000000 (32-bit, non-prefetchable) [disabled] [size=16M]
> 	Memory at 6228000000000 (64-bit, prefetchable) [disabled] [size=16G]
> 	Memory at 6228400000000 (64-bit, prefetchable) [disabled] [size=32M]
> 	Capabilities: [60] Power Management version 3
> 	Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
> 	Capabilities: [78] Express Endpoint, MSI 00
> 	Capabilities: [100] Virtual Channel
> 	Capabilities: [250] Latency Tolerance Reporting
> 	Capabilities: [258] L1 PM Substates
> 	Capabilities: [128] Power Budgeting <?>
> 	Capabilities: [420] Advanced Error Reporting
> 	Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
> 	Capabilities: [900] #19
> 	Capabilities: [ac0] #23
> 	Kernel driver in use: vfio-pci
> 
> [aik@yc02goos ~]$ sudo lspci -vvvxxxxs 0035:03:00.0 | grep ac0
> 	Capabilities: [ac0 v1] #23
> ac0: 23 00 01 00 de 10 c1 00 01 00 10 00 00 00 00 00

Oops, I was thinking lspci printed unknown in decimal.  Strange, it's a
shared, vendor specific capability:

https://pcisig.com/sites/default/files/specification_documents/ECN_DVSEC-2015-08-04-clean_0.pdf

We see in your dump that the vendor of this capability is 0x10de
(NVIDIA) and the ID of the capability is 0x0001.  Note that NVIDIA
sponsored this ECN.

> Talking to NVIDIA is always an option :)

Really no other choice to figure out how to decode these vendor
specific capabilities, this 0x23 capability at least seems to be meant
for sharing.

> >>> Is it worthwhile to continue with assigning the device in the !ENABLED
> >>> case?  For instance, maybe it would be better to provide a weak
> >>> definition of vfio_pci_nvlink2_init() that would cause us to fail here
> >>> if we don't have this device specific support enabled.  I realize
> >>> you're following the example set forth for IGD, but those regions are
> >>> optional, for better or worse.    
> >>
> >>
> >> The device is supposed to work even without GPU RAM passed through, this
> >> should look like NVLink v1 in this case (there used to be bugs in the
> >> driver, may be still are, have not checked for a while but there is a bug
> >> opened at NVIDIA about this and they were going to fix that), this is why I
> >> chose not to fail here.  
> > 
> > Ok.
> >   
> >>>> diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
> >>>> index 24ee260..2725bc8 100644
> >>>> --- a/drivers/vfio/pci/Kconfig
> >>>> +++ b/drivers/vfio/pci/Kconfig
> >>>> @@ -30,3 +30,7 @@ config VFIO_PCI_INTX
> >>>>  config VFIO_PCI_IGD
> >>>>  	depends on VFIO_PCI
> >>>>  	def_bool y if X86
> >>>> +
> >>>> +config VFIO_PCI_NVLINK2
> >>>> +	depends on VFIO_PCI
> >>>> +	def_bool y if PPC_POWERNV    
> >>>
> >>> As written, this also depends on PPC_POWERNV (or at least TCE), it's not
> >>> a portable implementation that we could re-use on X86 or ARM or any
> >>> other platform if hardware appeared for it.  Can we improve that as
> >>> well to make this less POWER specific?  Thanks,    
> >>
> >>
> >> As I said in another mail, every P9 chip in that box has some NVLink2 logic
> >> on it so it is not even common among P9's in general and I am having hard
> >> time seeing these V100s used elsewhere in such way.  
> > 
> > https://www.redhat.com/archives/vfio-users/2018-May/msg00000.html
> > 
> > Not much platform info, but based on the rpm mentioned, looks like an
> > x86_64 box.  Thanks,  
> 
> Wow. Interesting. Thanks for the pointer. No advertising material actually
> says that it is P9 only or even mention P9, wiki does not say it is P9 only
> either. Hmmm...

NVIDIA's own DGX systems are Xeon-based and seem to include NVLink.
The DGX-1 definitely makes use of the SXM2 modules, up to 8 of them.
The DGX Station might be the 4x V100 SXM2 box mentioned in the link.
Thanks,

Alex

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC PATCH kernel 5/5] vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] [10de:1db1] subdriver
@ 2018-06-08  4:34             ` Alex Williamson
  0 siblings, 0 replies; 108+ messages in thread
From: Alex Williamson @ 2018-06-08  4:34 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: kvm, Ram Pai, kvm-ppc, Alistair Popple, linuxppc-dev, David Gibson

On Fri, 8 Jun 2018 13:52:05 +1000
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> On 8/6/18 1:35 pm, Alex Williamson wrote:
> > On Fri, 8 Jun 2018 13:09:13 +1000
> > Alexey Kardashevskiy <aik@ozlabs.ru> wrote:  
> >> On 8/6/18 3:04 am, Alex Williamson wrote:  
> >>> On Thu,  7 Jun 2018 18:44:20 +1000
> >>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:  
> >>>> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> >>>> index 7bddf1e..38c9475 100644
> >>>> --- a/drivers/vfio/pci/vfio_pci.c
> >>>> +++ b/drivers/vfio/pci/vfio_pci.c
> >>>> @@ -306,6 +306,15 @@ static int vfio_pci_enable(struct vfio_pci_device *vdev)
> >>>>  		}
> >>>>  	}
> >>>>  
> >>>> +	if (pdev->vendor = PCI_VENDOR_ID_NVIDIA &&
> >>>> +	    pdev->device = 0x1db1 &&
> >>>> +	    IS_ENABLED(CONFIG_VFIO_PCI_NVLINK2)) {    
> >>>
> >>> Can't we do better than check this based on device ID?  Perhaps PCIe
> >>> capability hints at this?    
> >>
> >> A normal PCI pluggable device looks like this:
> >>
> >> root@fstn3:~# sudo lspci -vs 0000:03:00.0
> >> 0000:03:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
> >> 	Subsystem: NVIDIA Corporation GK210GL [Tesla K80]
> >> 	Flags: fast devsel, IRQ 497
> >> 	Memory at 3fe000000000 (32-bit, non-prefetchable) [disabled] [size\x16M]
> >> 	Memory at 200000000000 (64-bit, prefetchable) [disabled] [size\x16G]
> >> 	Memory at 200400000000 (64-bit, prefetchable) [disabled] [size2M]
> >> 	Capabilities: [60] Power Management version 3
> >> 	Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
> >> 	Capabilities: [78] Express Endpoint, MSI 00
> >> 	Capabilities: [100] Virtual Channel
> >> 	Capabilities: [128] Power Budgeting <?>
> >> 	Capabilities: [420] Advanced Error Reporting
> >> 	Capabilities: [600] Vendor Specific Information: ID\001 Rev=1 Len\x024 <?>
> >> 	Capabilities: [900] #19
> >>
> >>
> >> This is a NVLink v1 machine:
> >>
> >> aik@garrison1:~$ sudo lspci -vs 000a:01:00.0
> >> 000a:01:00.0 3D controller: NVIDIA Corporation Device 15fe (rev a1)
> >> 	Subsystem: NVIDIA Corporation Device 116b
> >> 	Flags: bus master, fast devsel, latency 0, IRQ 457
> >> 	Memory at 3fe300000000 (32-bit, non-prefetchable) [size\x16M]
> >> 	Memory at 260000000000 (64-bit, prefetchable) [size\x16G]
> >> 	Memory at 260400000000 (64-bit, prefetchable) [size2M]
> >> 	Capabilities: [60] Power Management version 3
> >> 	Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
> >> 	Capabilities: [78] Express Endpoint, MSI 00
> >> 	Capabilities: [100] Virtual Channel
> >> 	Capabilities: [250] Latency Tolerance Reporting
> >> 	Capabilities: [258] L1 PM Substates
> >> 	Capabilities: [128] Power Budgeting <?>
> >> 	Capabilities: [420] Advanced Error Reporting
> >> 	Capabilities: [600] Vendor Specific Information: ID\001 Rev=1 Len\x024 <?>
> >> 	Capabilities: [900] #19
> >> 	Kernel driver in use: nvidia
> >> 	Kernel modules: nvidiafb, nouveau, nvidia_384_drm, nvidia_384
> >>
> >>
> >> This is the one the patch is for:
> >>
> >> [aik@yc02goos ~]$ sudo lspci -vs 0035:03:00.0
> >> 0035:03:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2]
> >> (rev a1)
> >> 	Subsystem: NVIDIA Corporation Device 1212
> >> 	Flags: fast devsel, IRQ 82, NUMA node 8
> >> 	Memory at 620c280000000 (32-bit, non-prefetchable) [disabled] [size\x16M]
> >> 	Memory at 6228000000000 (64-bit, prefetchable) [disabled] [size\x16G]
> >> 	Memory at 6228400000000 (64-bit, prefetchable) [disabled] [size2M]
> >> 	Capabilities: [60] Power Management version 3
> >> 	Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
> >> 	Capabilities: [78] Express Endpoint, MSI 00
> >> 	Capabilities: [100] Virtual Channel
> >> 	Capabilities: [250] Latency Tolerance Reporting
> >> 	Capabilities: [258] L1 PM Substates
> >> 	Capabilities: [128] Power Budgeting <?>
> >> 	Capabilities: [420] Advanced Error Reporting
> >> 	Capabilities: [600] Vendor Specific Information: ID\001 Rev=1 Len\x024 <?>
> >> 	Capabilities: [900] #19
> >> 	Capabilities: [ac0] #23
> >> 	Kernel driver in use: vfio-pci
> >>
> >>
> >> I can only see a new capability #23 which I have no idea about what it
> >> actually does - my latest PCIe spec is
> >> PCI_Express_Base_r3.1a_December7-2015.pdf and that only knows capabilities
> >> till #21, do you have any better spec? Does not seem promising anyway...  
> > 
> > You could just look in include/uapi/linux/pci_regs.h and see that 23
> > (0x17) is a TPH Requester capability and google for that...  It's a TLP
> > processing hint related to cache processing for requests from system
> > specific interconnects.  Sounds rather promising.  Of course there's
> > also the vendor specific capability that might be probed if NVIDIA will
> > tell you what to look for and the init function you've implemented
> > looks for specific devicetree nodes, that I imagine you could test for
> > in a probe as well.  
> 
> 
> This 23 is in hex:
> 
> [aik@yc02goos ~]$ sudo lspci -vs 0035:03:00.0
> 0035:03:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2]
> (rev a1)
> 	Subsystem: NVIDIA Corporation Device 1212
> 	Flags: fast devsel, IRQ 82, NUMA node 8
> 	Memory at 620c280000000 (32-bit, non-prefetchable) [disabled] [size\x16M]
> 	Memory at 6228000000000 (64-bit, prefetchable) [disabled] [size\x16G]
> 	Memory at 6228400000000 (64-bit, prefetchable) [disabled] [size2M]
> 	Capabilities: [60] Power Management version 3
> 	Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
> 	Capabilities: [78] Express Endpoint, MSI 00
> 	Capabilities: [100] Virtual Channel
> 	Capabilities: [250] Latency Tolerance Reporting
> 	Capabilities: [258] L1 PM Substates
> 	Capabilities: [128] Power Budgeting <?>
> 	Capabilities: [420] Advanced Error Reporting
> 	Capabilities: [600] Vendor Specific Information: ID\001 Rev=1 Len\x024 <?>
> 	Capabilities: [900] #19
> 	Capabilities: [ac0] #23
> 	Kernel driver in use: vfio-pci
> 
> [aik@yc02goos ~]$ sudo lspci -vvvxxxxs 0035:03:00.0 | grep ac0
> 	Capabilities: [ac0 v1] #23
> ac0: 23 00 01 00 de 10 c1 00 01 00 10 00 00 00 00 00

Oops, I was thinking lspci printed unknown in decimal.  Strange, it's a
shared, vendor specific capability:

https://pcisig.com/sites/default/files/specification_documents/ECN_DVSEC-2015-08-04-clean_0.pdf

We see in your dump that the vendor of this capability is 0x10de
(NVIDIA) and the ID of the capability is 0x0001.  Note that NVIDIA
sponsored this ECN.

> Talking to NVIDIA is always an option :)

Really no other choice to figure out how to decode these vendor
specific capabilities, this 0x23 capability at least seems to be meant
for sharing.

> >>> Is it worthwhile to continue with assigning the device in the !ENABLED
> >>> case?  For instance, maybe it would be better to provide a weak
> >>> definition of vfio_pci_nvlink2_init() that would cause us to fail here
> >>> if we don't have this device specific support enabled.  I realize
> >>> you're following the example set forth for IGD, but those regions are
> >>> optional, for better or worse.    
> >>
> >>
> >> The device is supposed to work even without GPU RAM passed through, this
> >> should look like NVLink v1 in this case (there used to be bugs in the
> >> driver, may be still are, have not checked for a while but there is a bug
> >> opened at NVIDIA about this and they were going to fix that), this is why I
> >> chose not to fail here.  
> > 
> > Ok.
> >   
> >>>> diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
> >>>> index 24ee260..2725bc8 100644
> >>>> --- a/drivers/vfio/pci/Kconfig
> >>>> +++ b/drivers/vfio/pci/Kconfig
> >>>> @@ -30,3 +30,7 @@ config VFIO_PCI_INTX
> >>>>  config VFIO_PCI_IGD
> >>>>  	depends on VFIO_PCI
> >>>>  	def_bool y if X86
> >>>> +
> >>>> +config VFIO_PCI_NVLINK2
> >>>> +	depends on VFIO_PCI
> >>>> +	def_bool y if PPC_POWERNV    
> >>>
> >>> As written, this also depends on PPC_POWERNV (or at least TCE), it's not
> >>> a portable implementation that we could re-use on X86 or ARM or any
> >>> other platform if hardware appeared for it.  Can we improve that as
> >>> well to make this less POWER specific?  Thanks,    
> >>
> >>
> >> As I said in another mail, every P9 chip in that box has some NVLink2 logic
> >> on it so it is not even common among P9's in general and I am having hard
> >> time seeing these V100s used elsewhere in such way.  
> > 
> > https://www.redhat.com/archives/vfio-users/2018-May/msg00000.html
> > 
> > Not much platform info, but based on the rpm mentioned, looks like an
> > x86_64 box.  Thanks,  
> 
> Wow. Interesting. Thanks for the pointer. No advertising material actually
> says that it is P9 only or even mention P9, wiki does not say it is P9 only
> either. Hmmm...

NVIDIA's own DGX systems are Xeon-based and seem to include NVLink.
The DGX-1 definitely makes use of the SXM2 modules, up to 8 of them.
The DGX Station might be the 4x V100 SXM2 box mentioned in the link.
Thanks,

Alex

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100
  2018-06-08  4:14             ` Alexey Kardashevskiy
  (?)
@ 2018-06-08  5:03               ` Alex Williamson
  -1 siblings, 0 replies; 108+ messages in thread
From: Alex Williamson @ 2018-06-08  5:03 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: kvm, Ram Pai, kvm-ppc, Alistair Popple, linuxppc-dev, David Gibson

On Fri, 8 Jun 2018 14:14:23 +1000
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> On 8/6/18 1:44 pm, Alex Williamson wrote:
> > On Fri, 8 Jun 2018 13:08:54 +1000
> > Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> >   
> >> On 8/6/18 8:15 am, Alex Williamson wrote:  
> >>> On Fri, 08 Jun 2018 07:54:02 +1000
> >>> Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:
> >>>     
> >>>> On Thu, 2018-06-07 at 11:04 -0600, Alex Williamson wrote:    
> >>>>>
> >>>>> Can we back up and discuss whether the IOMMU grouping of NVLink
> >>>>> connected devices makes sense?  AIUI we have a PCI view of these
> >>>>> devices and from that perspective they're isolated.  That's the view of
> >>>>> the device used to generate the grouping.  However, not visible to us,
> >>>>> these devices are interconnected via NVLink.  What isolation properties
> >>>>> does NVLink provide given that its entire purpose for existing seems to
> >>>>> be to provide a high performance link for p2p between devices?      
> >>>>
> >>>> Not entire. On POWER chips, we also have an nvlink between the device
> >>>> and the CPU which is running significantly faster than PCIe.
> >>>>
> >>>> But yes, there are cross-links and those should probably be accounted
> >>>> for in the grouping.    
> >>>
> >>> Then after we fix the grouping, can we just let the host driver manage
> >>> this coherent memory range and expose vGPUs to guests?  The use case of
> >>> assigning all 6 GPUs to one VM seems pretty limited.  (Might need to
> >>> convince NVIDIA to support more than a single vGPU per VM though)    
> >>
> >> These are physical GPUs, not virtual sriov-alike things they are
> >> implementing as well elsewhere.  
> > 
> > vGPUs as implemented on M- and P-series Teslas aren't SR-IOV like
> > either.  That's why we have mdev devices now to implement software
> > defined devices.  I don't have first hand experience with V-series, but
> > I would absolutely expect a PCIe-based Tesla V100 to support vGPU.  
> 
> So assuming V100 can do vGPU, you are suggesting ditching this patchset and
> using mediated vGPUs instead, correct?

If it turns out that our PCIe-only-based IOMMU grouping doesn't
account for lack of isolation on the NVLink side and we correct that,
limiting assignment to sets of 3 interconnected GPUs, is that still a
useful feature?  OTOH, it's entirely an NVIDIA proprietary decision
whether they choose to support vGPU on these GPUs or whether they can
be convinced to support multiple vGPUs per VM.

> >> My current understanding is that every P9 chip in that box has some NVLink2
> >> logic on it so each P9 is directly connected to 3 GPUs via PCIe and
> >> 2xNVLink2, and GPUs in that big group are interconnected by NVLink2 links
> >> as well.
> >>
> >> From small bits of information I have it seems that a GPU can perfectly
> >> work alone and if the NVIDIA driver does not see these interconnects
> >> (because we do not pass the rest of the big 3xGPU group to this guest), it
> >> continues with a single GPU. There is an "nvidia-smi -r" big reset hammer
> >> which simply refuses to work until all 3 GPUs are passed so there is some
> >> distinction between passing 1 or 3 GPUs, and I am trying (as we speak) to
> >> get a confirmation from NVIDIA that it is ok to pass just a single GPU.
> >>
> >> So we will either have 6 groups (one per GPU) or 2 groups (one per
> >> interconnected group).  
> > 
> > I'm not gaining much confidence that we can rely on isolation between
> > NVLink connected GPUs, it sounds like you're simply expecting that
> > proprietary code from NVIDIA on a proprietary interconnect from NVIDIA
> > is going to play nice and nobody will figure out how to do bad things
> > because... obfuscation?  Thanks,  
> 
> Well, we already believe that a proprietary firmware of a sriov-capable
> adapter like Mellanox ConnextX is not doing bad things, how is this
> different in principle?

It seems like the scope and hierarchy are different.  Here we're
talking about exposing big discrete devices, which are peers of one
another (and have history of being reverse engineered), to userspace
drivers.  Once handed to userspace, each of those devices needs to be
considered untrusted.  In the case of SR-IOV, we typically have a
trusted host driver for the PF managing untrusted VFs.  We do rely on
some sanity in the hardware/firmware in isolating the VFs from each
other and from the PF, but we also often have source code for Linux
drivers for these devices and sometimes even datasheets.  Here we have
neither of those and perhaps we won't know the extent of the lack of
isolation between these devices until nouveau (best case) or some
exploit (worst case) exposes it.  IOMMU grouping always assumes a lack
of isolation between devices unless the hardware provides some
indication that isolation exists, for example ACS on PCIe.  If NVIDIA
wants to expose isolation on NVLink, perhaps they need to document
enough of it that the host kernel can manipulate and test for isolation,
perhaps even enabling virtualization of the NVLink interconnect
interface such that the host can prevent GPUs from interfering with
each other.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100
@ 2018-06-08  5:03               ` Alex Williamson
  0 siblings, 0 replies; 108+ messages in thread
From: Alex Williamson @ 2018-06-08  5:03 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Benjamin Herrenschmidt, linuxppc-dev, David Gibson, kvm-ppc,
	Ram Pai, kvm, Alistair Popple

On Fri, 8 Jun 2018 14:14:23 +1000
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> On 8/6/18 1:44 pm, Alex Williamson wrote:
> > On Fri, 8 Jun 2018 13:08:54 +1000
> > Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> >   
> >> On 8/6/18 8:15 am, Alex Williamson wrote:  
> >>> On Fri, 08 Jun 2018 07:54:02 +1000
> >>> Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:
> >>>     
> >>>> On Thu, 2018-06-07 at 11:04 -0600, Alex Williamson wrote:    
> >>>>>
> >>>>> Can we back up and discuss whether the IOMMU grouping of NVLink
> >>>>> connected devices makes sense?  AIUI we have a PCI view of these
> >>>>> devices and from that perspective they're isolated.  That's the view of
> >>>>> the device used to generate the grouping.  However, not visible to us,
> >>>>> these devices are interconnected via NVLink.  What isolation properties
> >>>>> does NVLink provide given that its entire purpose for existing seems to
> >>>>> be to provide a high performance link for p2p between devices?      
> >>>>
> >>>> Not entire. On POWER chips, we also have an nvlink between the device
> >>>> and the CPU which is running significantly faster than PCIe.
> >>>>
> >>>> But yes, there are cross-links and those should probably be accounted
> >>>> for in the grouping.    
> >>>
> >>> Then after we fix the grouping, can we just let the host driver manage
> >>> this coherent memory range and expose vGPUs to guests?  The use case of
> >>> assigning all 6 GPUs to one VM seems pretty limited.  (Might need to
> >>> convince NVIDIA to support more than a single vGPU per VM though)    
> >>
> >> These are physical GPUs, not virtual sriov-alike things they are
> >> implementing as well elsewhere.  
> > 
> > vGPUs as implemented on M- and P-series Teslas aren't SR-IOV like
> > either.  That's why we have mdev devices now to implement software
> > defined devices.  I don't have first hand experience with V-series, but
> > I would absolutely expect a PCIe-based Tesla V100 to support vGPU.  
> 
> So assuming V100 can do vGPU, you are suggesting ditching this patchset and
> using mediated vGPUs instead, correct?

If it turns out that our PCIe-only-based IOMMU grouping doesn't
account for lack of isolation on the NVLink side and we correct that,
limiting assignment to sets of 3 interconnected GPUs, is that still a
useful feature?  OTOH, it's entirely an NVIDIA proprietary decision
whether they choose to support vGPU on these GPUs or whether they can
be convinced to support multiple vGPUs per VM.

> >> My current understanding is that every P9 chip in that box has some NVLink2
> >> logic on it so each P9 is directly connected to 3 GPUs via PCIe and
> >> 2xNVLink2, and GPUs in that big group are interconnected by NVLink2 links
> >> as well.
> >>
> >> From small bits of information I have it seems that a GPU can perfectly
> >> work alone and if the NVIDIA driver does not see these interconnects
> >> (because we do not pass the rest of the big 3xGPU group to this guest), it
> >> continues with a single GPU. There is an "nvidia-smi -r" big reset hammer
> >> which simply refuses to work until all 3 GPUs are passed so there is some
> >> distinction between passing 1 or 3 GPUs, and I am trying (as we speak) to
> >> get a confirmation from NVIDIA that it is ok to pass just a single GPU.
> >>
> >> So we will either have 6 groups (one per GPU) or 2 groups (one per
> >> interconnected group).  
> > 
> > I'm not gaining much confidence that we can rely on isolation between
> > NVLink connected GPUs, it sounds like you're simply expecting that
> > proprietary code from NVIDIA on a proprietary interconnect from NVIDIA
> > is going to play nice and nobody will figure out how to do bad things
> > because... obfuscation?  Thanks,  
> 
> Well, we already believe that a proprietary firmware of a sriov-capable
> adapter like Mellanox ConnextX is not doing bad things, how is this
> different in principle?

It seems like the scope and hierarchy are different.  Here we're
talking about exposing big discrete devices, which are peers of one
another (and have history of being reverse engineered), to userspace
drivers.  Once handed to userspace, each of those devices needs to be
considered untrusted.  In the case of SR-IOV, we typically have a
trusted host driver for the PF managing untrusted VFs.  We do rely on
some sanity in the hardware/firmware in isolating the VFs from each
other and from the PF, but we also often have source code for Linux
drivers for these devices and sometimes even datasheets.  Here we have
neither of those and perhaps we won't know the extent of the lack of
isolation between these devices until nouveau (best case) or some
exploit (worst case) exposes it.  IOMMU grouping always assumes a lack
of isolation between devices unless the hardware provides some
indication that isolation exists, for example ACS on PCIe.  If NVIDIA
wants to expose isolation on NVLink, perhaps they need to document
enough of it that the host kernel can manipulate and test for isolation,
perhaps even enabling virtualization of the NVLink interconnect
interface such that the host can prevent GPUs from interfering with
each other.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100
@ 2018-06-08  5:03               ` Alex Williamson
  0 siblings, 0 replies; 108+ messages in thread
From: Alex Williamson @ 2018-06-08  5:03 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: kvm, Ram Pai, kvm-ppc, Alistair Popple, linuxppc-dev, David Gibson

On Fri, 8 Jun 2018 14:14:23 +1000
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> On 8/6/18 1:44 pm, Alex Williamson wrote:
> > On Fri, 8 Jun 2018 13:08:54 +1000
> > Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> >   
> >> On 8/6/18 8:15 am, Alex Williamson wrote:  
> >>> On Fri, 08 Jun 2018 07:54:02 +1000
> >>> Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:
> >>>     
> >>>> On Thu, 2018-06-07 at 11:04 -0600, Alex Williamson wrote:    
> >>>>>
> >>>>> Can we back up and discuss whether the IOMMU grouping of NVLink
> >>>>> connected devices makes sense?  AIUI we have a PCI view of these
> >>>>> devices and from that perspective they're isolated.  That's the view of
> >>>>> the device used to generate the grouping.  However, not visible to us,
> >>>>> these devices are interconnected via NVLink.  What isolation properties
> >>>>> does NVLink provide given that its entire purpose for existing seems to
> >>>>> be to provide a high performance link for p2p between devices?      
> >>>>
> >>>> Not entire. On POWER chips, we also have an nvlink between the device
> >>>> and the CPU which is running significantly faster than PCIe.
> >>>>
> >>>> But yes, there are cross-links and those should probably be accounted
> >>>> for in the grouping.    
> >>>
> >>> Then after we fix the grouping, can we just let the host driver manage
> >>> this coherent memory range and expose vGPUs to guests?  The use case of
> >>> assigning all 6 GPUs to one VM seems pretty limited.  (Might need to
> >>> convince NVIDIA to support more than a single vGPU per VM though)    
> >>
> >> These are physical GPUs, not virtual sriov-alike things they are
> >> implementing as well elsewhere.  
> > 
> > vGPUs as implemented on M- and P-series Teslas aren't SR-IOV like
> > either.  That's why we have mdev devices now to implement software
> > defined devices.  I don't have first hand experience with V-series, but
> > I would absolutely expect a PCIe-based Tesla V100 to support vGPU.  
> 
> So assuming V100 can do vGPU, you are suggesting ditching this patchset and
> using mediated vGPUs instead, correct?

If it turns out that our PCIe-only-based IOMMU grouping doesn't
account for lack of isolation on the NVLink side and we correct that,
limiting assignment to sets of 3 interconnected GPUs, is that still a
useful feature?  OTOH, it's entirely an NVIDIA proprietary decision
whether they choose to support vGPU on these GPUs or whether they can
be convinced to support multiple vGPUs per VM.

> >> My current understanding is that every P9 chip in that box has some NVLink2
> >> logic on it so each P9 is directly connected to 3 GPUs via PCIe and
> >> 2xNVLink2, and GPUs in that big group are interconnected by NVLink2 links
> >> as well.
> >>
> >> From small bits of information I have it seems that a GPU can perfectly
> >> work alone and if the NVIDIA driver does not see these interconnects
> >> (because we do not pass the rest of the big 3xGPU group to this guest), it
> >> continues with a single GPU. There is an "nvidia-smi -r" big reset hammer
> >> which simply refuses to work until all 3 GPUs are passed so there is some
> >> distinction between passing 1 or 3 GPUs, and I am trying (as we speak) to
> >> get a confirmation from NVIDIA that it is ok to pass just a single GPU.
> >>
> >> So we will either have 6 groups (one per GPU) or 2 groups (one per
> >> interconnected group).  
> > 
> > I'm not gaining much confidence that we can rely on isolation between
> > NVLink connected GPUs, it sounds like you're simply expecting that
> > proprietary code from NVIDIA on a proprietary interconnect from NVIDIA
> > is going to play nice and nobody will figure out how to do bad things
> > because... obfuscation?  Thanks,  
> 
> Well, we already believe that a proprietary firmware of a sriov-capable
> adapter like Mellanox ConnextX is not doing bad things, how is this
> different in principle?

It seems like the scope and hierarchy are different.  Here we're
talking about exposing big discrete devices, which are peers of one
another (and have history of being reverse engineered), to userspace
drivers.  Once handed to userspace, each of those devices needs to be
considered untrusted.  In the case of SR-IOV, we typically have a
trusted host driver for the PF managing untrusted VFs.  We do rely on
some sanity in the hardware/firmware in isolating the VFs from each
other and from the PF, but we also often have source code for Linux
drivers for these devices and sometimes even datasheets.  Here we have
neither of those and perhaps we won't know the extent of the lack of
isolation between these devices until nouveau (best case) or some
exploit (worst case) exposes it.  IOMMU grouping always assumes a lack
of isolation between devices unless the hardware provides some
indication that isolation exists, for example ACS on PCIe.  If NVIDIA
wants to expose isolation on NVLink, perhaps they need to document
enough of it that the host kernel can manipulate and test for isolation,
perhaps even enabling virtualization of the NVLink interconnect
interface such that the host can prevent GPUs from interfering with
each other.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100
  2018-06-08  5:03               ` Alex Williamson
  (?)
@ 2018-07-10  4:10                 ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 108+ messages in thread
From: Alexey Kardashevskiy @ 2018-07-10  4:10 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kvm, Ram Pai, kvm-ppc, Alistair Popple, linuxppc-dev, David Gibson

On Thu, 7 Jun 2018 23:03:23 -0600
Alex Williamson <alex.williamson@redhat.com> wrote:

> On Fri, 8 Jun 2018 14:14:23 +1000
> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> 
> > On 8/6/18 1:44 pm, Alex Williamson wrote:  
> > > On Fri, 8 Jun 2018 13:08:54 +1000
> > > Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> > >     
> > >> On 8/6/18 8:15 am, Alex Williamson wrote:    
> > >>> On Fri, 08 Jun 2018 07:54:02 +1000
> > >>> Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:
> > >>>       
> > >>>> On Thu, 2018-06-07 at 11:04 -0600, Alex Williamson wrote:      
> > >>>>>
> > >>>>> Can we back up and discuss whether the IOMMU grouping of NVLink
> > >>>>> connected devices makes sense?  AIUI we have a PCI view of these
> > >>>>> devices and from that perspective they're isolated.  That's the view of
> > >>>>> the device used to generate the grouping.  However, not visible to us,
> > >>>>> these devices are interconnected via NVLink.  What isolation properties
> > >>>>> does NVLink provide given that its entire purpose for existing seems to
> > >>>>> be to provide a high performance link for p2p between devices?        
> > >>>>
> > >>>> Not entire. On POWER chips, we also have an nvlink between the device
> > >>>> and the CPU which is running significantly faster than PCIe.
> > >>>>
> > >>>> But yes, there are cross-links and those should probably be accounted
> > >>>> for in the grouping.      
> > >>>
> > >>> Then after we fix the grouping, can we just let the host driver manage
> > >>> this coherent memory range and expose vGPUs to guests?  The use case of
> > >>> assigning all 6 GPUs to one VM seems pretty limited.  (Might need to
> > >>> convince NVIDIA to support more than a single vGPU per VM though)      
> > >>
> > >> These are physical GPUs, not virtual sriov-alike things they are
> > >> implementing as well elsewhere.    
> > > 
> > > vGPUs as implemented on M- and P-series Teslas aren't SR-IOV like
> > > either.  That's why we have mdev devices now to implement software
> > > defined devices.  I don't have first hand experience with V-series, but
> > > I would absolutely expect a PCIe-based Tesla V100 to support vGPU.    
> > 
> > So assuming V100 can do vGPU, you are suggesting ditching this patchset and
> > using mediated vGPUs instead, correct?  
> 
> If it turns out that our PCIe-only-based IOMMU grouping doesn't
> account for lack of isolation on the NVLink side and we correct that,
> limiting assignment to sets of 3 interconnected GPUs, is that still a
> useful feature?  OTOH, it's entirely an NVIDIA proprietary decision
> whether they choose to support vGPU on these GPUs or whether they can
> be convinced to support multiple vGPUs per VM.
> 
> > >> My current understanding is that every P9 chip in that box has some NVLink2
> > >> logic on it so each P9 is directly connected to 3 GPUs via PCIe and
> > >> 2xNVLink2, and GPUs in that big group are interconnected by NVLink2 links
> > >> as well.
> > >>
> > >> From small bits of information I have it seems that a GPU can perfectly
> > >> work alone and if the NVIDIA driver does not see these interconnects
> > >> (because we do not pass the rest of the big 3xGPU group to this guest), it
> > >> continues with a single GPU. There is an "nvidia-smi -r" big reset hammer
> > >> which simply refuses to work until all 3 GPUs are passed so there is some
> > >> distinction between passing 1 or 3 GPUs, and I am trying (as we speak) to
> > >> get a confirmation from NVIDIA that it is ok to pass just a single GPU.
> > >>
> > >> So we will either have 6 groups (one per GPU) or 2 groups (one per
> > >> interconnected group).    
> > > 
> > > I'm not gaining much confidence that we can rely on isolation between
> > > NVLink connected GPUs, it sounds like you're simply expecting that
> > > proprietary code from NVIDIA on a proprietary interconnect from NVIDIA
> > > is going to play nice and nobody will figure out how to do bad things
> > > because... obfuscation?  Thanks,    
> > 
> > Well, we already believe that a proprietary firmware of a sriov-capable
> > adapter like Mellanox ConnextX is not doing bad things, how is this
> > different in principle?  
> 
> It seems like the scope and hierarchy are different.  Here we're
> talking about exposing big discrete devices, which are peers of one
> another (and have history of being reverse engineered), to userspace
> drivers.  Once handed to userspace, each of those devices needs to be
> considered untrusted.  In the case of SR-IOV, we typically have a
> trusted host driver for the PF managing untrusted VFs.  We do rely on
> some sanity in the hardware/firmware in isolating the VFs from each
> other and from the PF, but we also often have source code for Linux
> drivers for these devices and sometimes even datasheets.  Here we have
> neither of those and perhaps we won't know the extent of the lack of
> isolation between these devices until nouveau (best case) or some
> exploit (worst case) exposes it.  IOMMU grouping always assumes a lack
> of isolation between devices unless the hardware provides some
> indication that isolation exists, for example ACS on PCIe.  If NVIDIA
> wants to expose isolation on NVLink, perhaps they need to document
> enough of it that the host kernel can manipulate and test for isolation,
> perhaps even enabling virtualization of the NVLink interconnect
> interface such that the host can prevent GPUs from interfering with
> each other.  Thanks,


So far I got this from NVIDIA:

1. An NVLink2 state can be controlled via MMIO registers, there is a
"NVLINK ISOLATION ON MULTI-TENANT SYSTEMS" spec (my copy is
"confidential" though) from NVIDIA with the MMIO addresses to block if
we want to disable certain links. In order to NVLink to work it needs to
be enabled on both sides so by filtering certains MMIO ranges we can
isolate a GPU.

2. We can and should also prohibit the GPU firmware update, this is
done via MMIO as well. The protocol is not open but at least register
ranges might be in order to filter these accesses, and there is no
plan to change this.

3. DMA trafic over the NVLink2 link can be of 2 types: UT=1 for
PCI-style DMA via our usual TCE tables (one per a NVLink2 link),
and UT=0 for direct host memory access. UT stands for "use
translation" and this is a part of the NVLink2 protocol. Only UT=1 is
possible over the PCIe link.
This UT=0 trafic uses host physical addresses returned by a nest MMU (a
piece of NVIDIA logic on a POWER9 chip), this takes LPID (guest id),
mmu context id (guest userspace mm id), a virtual address and translates
to the host physical and that result is used for UT=0 DMA, this is
called "ATS" although it is not PCIe ATS afaict.
NVIDIA says that the hardware is designed in a way that it can only do
DMA UT=0 to addresses which ATS translated to, and there is no way to
override this behavior and this is what guarantees the isolation.

So isolation can be achieved if I do not miss something.

How do we want this to be documented to proceed? I assume if I post
patches filtering MMIOs, this won't do it, right? If just 1..3 are
documented, will we take this t&c or we need a GPU API spec (which is
not going to happen anyway)?



--
Alexey

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100
@ 2018-07-10  4:10                 ` Alexey Kardashevskiy
  0 siblings, 0 replies; 108+ messages in thread
From: Alexey Kardashevskiy @ 2018-07-10  4:10 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Benjamin Herrenschmidt, linuxppc-dev, David Gibson, kvm-ppc,
	Ram Pai, kvm, Alistair Popple

On Thu, 7 Jun 2018 23:03:23 -0600
Alex Williamson <alex.williamson@redhat.com> wrote:

> On Fri, 8 Jun 2018 14:14:23 +1000
> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> 
> > On 8/6/18 1:44 pm, Alex Williamson wrote:  
> > > On Fri, 8 Jun 2018 13:08:54 +1000
> > > Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> > >     
> > >> On 8/6/18 8:15 am, Alex Williamson wrote:    
> > >>> On Fri, 08 Jun 2018 07:54:02 +1000
> > >>> Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:
> > >>>       
> > >>>> On Thu, 2018-06-07 at 11:04 -0600, Alex Williamson wrote:      
> > >>>>>
> > >>>>> Can we back up and discuss whether the IOMMU grouping of NVLink
> > >>>>> connected devices makes sense?  AIUI we have a PCI view of these
> > >>>>> devices and from that perspective they're isolated.  That's the view of
> > >>>>> the device used to generate the grouping.  However, not visible to us,
> > >>>>> these devices are interconnected via NVLink.  What isolation properties
> > >>>>> does NVLink provide given that its entire purpose for existing seems to
> > >>>>> be to provide a high performance link for p2p between devices?        
> > >>>>
> > >>>> Not entire. On POWER chips, we also have an nvlink between the device
> > >>>> and the CPU which is running significantly faster than PCIe.
> > >>>>
> > >>>> But yes, there are cross-links and those should probably be accounted
> > >>>> for in the grouping.      
> > >>>
> > >>> Then after we fix the grouping, can we just let the host driver manage
> > >>> this coherent memory range and expose vGPUs to guests?  The use case of
> > >>> assigning all 6 GPUs to one VM seems pretty limited.  (Might need to
> > >>> convince NVIDIA to support more than a single vGPU per VM though)      
> > >>
> > >> These are physical GPUs, not virtual sriov-alike things they are
> > >> implementing as well elsewhere.    
> > > 
> > > vGPUs as implemented on M- and P-series Teslas aren't SR-IOV like
> > > either.  That's why we have mdev devices now to implement software
> > > defined devices.  I don't have first hand experience with V-series, but
> > > I would absolutely expect a PCIe-based Tesla V100 to support vGPU.    
> > 
> > So assuming V100 can do vGPU, you are suggesting ditching this patchset and
> > using mediated vGPUs instead, correct?  
> 
> If it turns out that our PCIe-only-based IOMMU grouping doesn't
> account for lack of isolation on the NVLink side and we correct that,
> limiting assignment to sets of 3 interconnected GPUs, is that still a
> useful feature?  OTOH, it's entirely an NVIDIA proprietary decision
> whether they choose to support vGPU on these GPUs or whether they can
> be convinced to support multiple vGPUs per VM.
> 
> > >> My current understanding is that every P9 chip in that box has some NVLink2
> > >> logic on it so each P9 is directly connected to 3 GPUs via PCIe and
> > >> 2xNVLink2, and GPUs in that big group are interconnected by NVLink2 links
> > >> as well.
> > >>
> > >> From small bits of information I have it seems that a GPU can perfectly
> > >> work alone and if the NVIDIA driver does not see these interconnects
> > >> (because we do not pass the rest of the big 3xGPU group to this guest), it
> > >> continues with a single GPU. There is an "nvidia-smi -r" big reset hammer
> > >> which simply refuses to work until all 3 GPUs are passed so there is some
> > >> distinction between passing 1 or 3 GPUs, and I am trying (as we speak) to
> > >> get a confirmation from NVIDIA that it is ok to pass just a single GPU.
> > >>
> > >> So we will either have 6 groups (one per GPU) or 2 groups (one per
> > >> interconnected group).    
> > > 
> > > I'm not gaining much confidence that we can rely on isolation between
> > > NVLink connected GPUs, it sounds like you're simply expecting that
> > > proprietary code from NVIDIA on a proprietary interconnect from NVIDIA
> > > is going to play nice and nobody will figure out how to do bad things
> > > because... obfuscation?  Thanks,    
> > 
> > Well, we already believe that a proprietary firmware of a sriov-capable
> > adapter like Mellanox ConnextX is not doing bad things, how is this
> > different in principle?  
> 
> It seems like the scope and hierarchy are different.  Here we're
> talking about exposing big discrete devices, which are peers of one
> another (and have history of being reverse engineered), to userspace
> drivers.  Once handed to userspace, each of those devices needs to be
> considered untrusted.  In the case of SR-IOV, we typically have a
> trusted host driver for the PF managing untrusted VFs.  We do rely on
> some sanity in the hardware/firmware in isolating the VFs from each
> other and from the PF, but we also often have source code for Linux
> drivers for these devices and sometimes even datasheets.  Here we have
> neither of those and perhaps we won't know the extent of the lack of
> isolation between these devices until nouveau (best case) or some
> exploit (worst case) exposes it.  IOMMU grouping always assumes a lack
> of isolation between devices unless the hardware provides some
> indication that isolation exists, for example ACS on PCIe.  If NVIDIA
> wants to expose isolation on NVLink, perhaps they need to document
> enough of it that the host kernel can manipulate and test for isolation,
> perhaps even enabling virtualization of the NVLink interconnect
> interface such that the host can prevent GPUs from interfering with
> each other.  Thanks,


So far I got this from NVIDIA:

1. An NVLink2 state can be controlled via MMIO registers, there is a
"NVLINK ISOLATION ON MULTI-TENANT SYSTEMS" spec (my copy is
"confidential" though) from NVIDIA with the MMIO addresses to block if
we want to disable certain links. In order to NVLink to work it needs to
be enabled on both sides so by filtering certains MMIO ranges we can
isolate a GPU.

2. We can and should also prohibit the GPU firmware update, this is
done via MMIO as well. The protocol is not open but at least register
ranges might be in order to filter these accesses, and there is no
plan to change this.

3. DMA trafic over the NVLink2 link can be of 2 types: UT=1 for
PCI-style DMA via our usual TCE tables (one per a NVLink2 link),
and UT=0 for direct host memory access. UT stands for "use
translation" and this is a part of the NVLink2 protocol. Only UT=1 is
possible over the PCIe link.
This UT=0 trafic uses host physical addresses returned by a nest MMU (a
piece of NVIDIA logic on a POWER9 chip), this takes LPID (guest id),
mmu context id (guest userspace mm id), a virtual address and translates
to the host physical and that result is used for UT=0 DMA, this is
called "ATS" although it is not PCIe ATS afaict.
NVIDIA says that the hardware is designed in a way that it can only do
DMA UT=0 to addresses which ATS translated to, and there is no way to
override this behavior and this is what guarantees the isolation.

So isolation can be achieved if I do not miss something.

How do we want this to be documented to proceed? I assume if I post
patches filtering MMIOs, this won't do it, right? If just 1..3 are
documented, will we take this t&c or we need a GPU API spec (which is
not going to happen anyway)?



--
Alexey

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100
@ 2018-07-10  4:10                 ` Alexey Kardashevskiy
  0 siblings, 0 replies; 108+ messages in thread
From: Alexey Kardashevskiy @ 2018-07-10  4:10 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kvm, Ram Pai, kvm-ppc, Alistair Popple, linuxppc-dev, David Gibson

On Thu, 7 Jun 2018 23:03:23 -0600
Alex Williamson <alex.williamson@redhat.com> wrote:

> On Fri, 8 Jun 2018 14:14:23 +1000
> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> 
> > On 8/6/18 1:44 pm, Alex Williamson wrote:  
> > > On Fri, 8 Jun 2018 13:08:54 +1000
> > > Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> > >     
> > >> On 8/6/18 8:15 am, Alex Williamson wrote:    
> > >>> On Fri, 08 Jun 2018 07:54:02 +1000
> > >>> Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:
> > >>>       
> > >>>> On Thu, 2018-06-07 at 11:04 -0600, Alex Williamson wrote:      
> > >>>>>
> > >>>>> Can we back up and discuss whether the IOMMU grouping of NVLink
> > >>>>> connected devices makes sense?  AIUI we have a PCI view of these
> > >>>>> devices and from that perspective they're isolated.  That's the view of
> > >>>>> the device used to generate the grouping.  However, not visible to us,
> > >>>>> these devices are interconnected via NVLink.  What isolation properties
> > >>>>> does NVLink provide given that its entire purpose for existing seems to
> > >>>>> be to provide a high performance link for p2p between devices?        
> > >>>>
> > >>>> Not entire. On POWER chips, we also have an nvlink between the device
> > >>>> and the CPU which is running significantly faster than PCIe.
> > >>>>
> > >>>> But yes, there are cross-links and those should probably be accounted
> > >>>> for in the grouping.      
> > >>>
> > >>> Then after we fix the grouping, can we just let the host driver manage
> > >>> this coherent memory range and expose vGPUs to guests?  The use case of
> > >>> assigning all 6 GPUs to one VM seems pretty limited.  (Might need to
> > >>> convince NVIDIA to support more than a single vGPU per VM though)      
> > >>
> > >> These are physical GPUs, not virtual sriov-alike things they are
> > >> implementing as well elsewhere.    
> > > 
> > > vGPUs as implemented on M- and P-series Teslas aren't SR-IOV like
> > > either.  That's why we have mdev devices now to implement software
> > > defined devices.  I don't have first hand experience with V-series, but
> > > I would absolutely expect a PCIe-based Tesla V100 to support vGPU.    
> > 
> > So assuming V100 can do vGPU, you are suggesting ditching this patchset and
> > using mediated vGPUs instead, correct?  
> 
> If it turns out that our PCIe-only-based IOMMU grouping doesn't
> account for lack of isolation on the NVLink side and we correct that,
> limiting assignment to sets of 3 interconnected GPUs, is that still a
> useful feature?  OTOH, it's entirely an NVIDIA proprietary decision
> whether they choose to support vGPU on these GPUs or whether they can
> be convinced to support multiple vGPUs per VM.
> 
> > >> My current understanding is that every P9 chip in that box has some NVLink2
> > >> logic on it so each P9 is directly connected to 3 GPUs via PCIe and
> > >> 2xNVLink2, and GPUs in that big group are interconnected by NVLink2 links
> > >> as well.
> > >>
> > >> From small bits of information I have it seems that a GPU can perfectly
> > >> work alone and if the NVIDIA driver does not see these interconnects
> > >> (because we do not pass the rest of the big 3xGPU group to this guest), it
> > >> continues with a single GPU. There is an "nvidia-smi -r" big reset hammer
> > >> which simply refuses to work until all 3 GPUs are passed so there is some
> > >> distinction between passing 1 or 3 GPUs, and I am trying (as we speak) to
> > >> get a confirmation from NVIDIA that it is ok to pass just a single GPU.
> > >>
> > >> So we will either have 6 groups (one per GPU) or 2 groups (one per
> > >> interconnected group).    
> > > 
> > > I'm not gaining much confidence that we can rely on isolation between
> > > NVLink connected GPUs, it sounds like you're simply expecting that
> > > proprietary code from NVIDIA on a proprietary interconnect from NVIDIA
> > > is going to play nice and nobody will figure out how to do bad things
> > > because... obfuscation?  Thanks,    
> > 
> > Well, we already believe that a proprietary firmware of a sriov-capable
> > adapter like Mellanox ConnextX is not doing bad things, how is this
> > different in principle?  
> 
> It seems like the scope and hierarchy are different.  Here we're
> talking about exposing big discrete devices, which are peers of one
> another (and have history of being reverse engineered), to userspace
> drivers.  Once handed to userspace, each of those devices needs to be
> considered untrusted.  In the case of SR-IOV, we typically have a
> trusted host driver for the PF managing untrusted VFs.  We do rely on
> some sanity in the hardware/firmware in isolating the VFs from each
> other and from the PF, but we also often have source code for Linux
> drivers for these devices and sometimes even datasheets.  Here we have
> neither of those and perhaps we won't know the extent of the lack of
> isolation between these devices until nouveau (best case) or some
> exploit (worst case) exposes it.  IOMMU grouping always assumes a lack
> of isolation between devices unless the hardware provides some
> indication that isolation exists, for example ACS on PCIe.  If NVIDIA
> wants to expose isolation on NVLink, perhaps they need to document
> enough of it that the host kernel can manipulate and test for isolation,
> perhaps even enabling virtualization of the NVLink interconnect
> interface such that the host can prevent GPUs from interfering with
> each other.  Thanks,


So far I got this from NVIDIA:

1. An NVLink2 state can be controlled via MMIO registers, there is a
"NVLINK ISOLATION ON MULTI-TENANT SYSTEMS" spec (my copy is
"confidential" though) from NVIDIA with the MMIO addresses to block if
we want to disable certain links. In order to NVLink to work it needs to
be enabled on both sides so by filtering certains MMIO ranges we can
isolate a GPU.

2. We can and should also prohibit the GPU firmware update, this is
done via MMIO as well. The protocol is not open but at least register
ranges might be in order to filter these accesses, and there is no
plan to change this.

3. DMA trafic over the NVLink2 link can be of 2 types: UT=1 for
PCI-style DMA via our usual TCE tables (one per a NVLink2 link),
and UT=0 for direct host memory access. UT stands for "use
translation" and this is a part of the NVLink2 protocol. Only UT=1 is
possible over the PCIe link.
This UT=0 trafic uses host physical addresses returned by a nest MMU (a
piece of NVIDIA logic on a POWER9 chip), this takes LPID (guest id),
mmu context id (guest userspace mm id), a virtual address and translates
to the host physical and that result is used for UT=0 DMA, this is
called "ATS" although it is not PCIe ATS afaict.
NVIDIA says that the hardware is designed in a way that it can only do
DMA UT=0 to addresses which ATS translated to, and there is no way to
override this behavior and this is what guarantees the isolation.

So isolation can be achieved if I do not miss something.

How do we want this to be documented to proceed? I assume if I post
patches filtering MMIOs, this won't do it, right? If just 1..3 are
documented, will we take this t&c or we need a GPU API spec (which is
not going to happen anyway)?



--
Alexey

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100
  2018-07-10  4:10                 ` Alexey Kardashevskiy
  (?)
@ 2018-07-10 22:37                   ` Alex Williamson
  -1 siblings, 0 replies; 108+ messages in thread
From: Alex Williamson @ 2018-07-10 22:37 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: kvm, Ram Pai, kvm-ppc, Alistair Popple, linuxppc-dev, David Gibson

On Tue, 10 Jul 2018 14:10:20 +1000
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> On Thu, 7 Jun 2018 23:03:23 -0600
> Alex Williamson <alex.williamson@redhat.com> wrote:
> 
> > On Fri, 8 Jun 2018 14:14:23 +1000
> > Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> >   
> > > On 8/6/18 1:44 pm, Alex Williamson wrote:    
> > > > On Fri, 8 Jun 2018 13:08:54 +1000
> > > > Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> > > >       
> > > >> On 8/6/18 8:15 am, Alex Williamson wrote:      
> > > >>> On Fri, 08 Jun 2018 07:54:02 +1000
> > > >>> Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:
> > > >>>         
> > > >>>> On Thu, 2018-06-07 at 11:04 -0600, Alex Williamson wrote:        
> > > >>>>>
> > > >>>>> Can we back up and discuss whether the IOMMU grouping of NVLink
> > > >>>>> connected devices makes sense?  AIUI we have a PCI view of these
> > > >>>>> devices and from that perspective they're isolated.  That's the view of
> > > >>>>> the device used to generate the grouping.  However, not visible to us,
> > > >>>>> these devices are interconnected via NVLink.  What isolation properties
> > > >>>>> does NVLink provide given that its entire purpose for existing seems to
> > > >>>>> be to provide a high performance link for p2p between devices?          
> > > >>>>
> > > >>>> Not entire. On POWER chips, we also have an nvlink between the device
> > > >>>> and the CPU which is running significantly faster than PCIe.
> > > >>>>
> > > >>>> But yes, there are cross-links and those should probably be accounted
> > > >>>> for in the grouping.        
> > > >>>
> > > >>> Then after we fix the grouping, can we just let the host driver manage
> > > >>> this coherent memory range and expose vGPUs to guests?  The use case of
> > > >>> assigning all 6 GPUs to one VM seems pretty limited.  (Might need to
> > > >>> convince NVIDIA to support more than a single vGPU per VM though)        
> > > >>
> > > >> These are physical GPUs, not virtual sriov-alike things they are
> > > >> implementing as well elsewhere.      
> > > > 
> > > > vGPUs as implemented on M- and P-series Teslas aren't SR-IOV like
> > > > either.  That's why we have mdev devices now to implement software
> > > > defined devices.  I don't have first hand experience with V-series, but
> > > > I would absolutely expect a PCIe-based Tesla V100 to support vGPU.      
> > > 
> > > So assuming V100 can do vGPU, you are suggesting ditching this patchset and
> > > using mediated vGPUs instead, correct?    
> > 
> > If it turns out that our PCIe-only-based IOMMU grouping doesn't
> > account for lack of isolation on the NVLink side and we correct that,
> > limiting assignment to sets of 3 interconnected GPUs, is that still a
> > useful feature?  OTOH, it's entirely an NVIDIA proprietary decision
> > whether they choose to support vGPU on these GPUs or whether they can
> > be convinced to support multiple vGPUs per VM.
> >   
> > > >> My current understanding is that every P9 chip in that box has some NVLink2
> > > >> logic on it so each P9 is directly connected to 3 GPUs via PCIe and
> > > >> 2xNVLink2, and GPUs in that big group are interconnected by NVLink2 links
> > > >> as well.
> > > >>
> > > >> From small bits of information I have it seems that a GPU can perfectly
> > > >> work alone and if the NVIDIA driver does not see these interconnects
> > > >> (because we do not pass the rest of the big 3xGPU group to this guest), it
> > > >> continues with a single GPU. There is an "nvidia-smi -r" big reset hammer
> > > >> which simply refuses to work until all 3 GPUs are passed so there is some
> > > >> distinction between passing 1 or 3 GPUs, and I am trying (as we speak) to
> > > >> get a confirmation from NVIDIA that it is ok to pass just a single GPU.
> > > >>
> > > >> So we will either have 6 groups (one per GPU) or 2 groups (one per
> > > >> interconnected group).      
> > > > 
> > > > I'm not gaining much confidence that we can rely on isolation between
> > > > NVLink connected GPUs, it sounds like you're simply expecting that
> > > > proprietary code from NVIDIA on a proprietary interconnect from NVIDIA
> > > > is going to play nice and nobody will figure out how to do bad things
> > > > because... obfuscation?  Thanks,      
> > > 
> > > Well, we already believe that a proprietary firmware of a sriov-capable
> > > adapter like Mellanox ConnextX is not doing bad things, how is this
> > > different in principle?    
> > 
> > It seems like the scope and hierarchy are different.  Here we're
> > talking about exposing big discrete devices, which are peers of one
> > another (and have history of being reverse engineered), to userspace
> > drivers.  Once handed to userspace, each of those devices needs to be
> > considered untrusted.  In the case of SR-IOV, we typically have a
> > trusted host driver for the PF managing untrusted VFs.  We do rely on
> > some sanity in the hardware/firmware in isolating the VFs from each
> > other and from the PF, but we also often have source code for Linux
> > drivers for these devices and sometimes even datasheets.  Here we have
> > neither of those and perhaps we won't know the extent of the lack of
> > isolation between these devices until nouveau (best case) or some
> > exploit (worst case) exposes it.  IOMMU grouping always assumes a lack
> > of isolation between devices unless the hardware provides some
> > indication that isolation exists, for example ACS on PCIe.  If NVIDIA
> > wants to expose isolation on NVLink, perhaps they need to document
> > enough of it that the host kernel can manipulate and test for isolation,
> > perhaps even enabling virtualization of the NVLink interconnect
> > interface such that the host can prevent GPUs from interfering with
> > each other.  Thanks,  
> 
> 
> So far I got this from NVIDIA:
> 
> 1. An NVLink2 state can be controlled via MMIO registers, there is a
> "NVLINK ISOLATION ON MULTI-TENANT SYSTEMS" spec (my copy is
> "confidential" though) from NVIDIA with the MMIO addresses to block if
> we want to disable certain links. In order to NVLink to work it needs to
> be enabled on both sides so by filtering certains MMIO ranges we can
> isolate a GPU.

Where are these MMIO registers, on the bridge or on the endpoint device?
I'm wondering when you say block MMIO if these are ranges on the device
that we disallow mmap to and all the overlapping PAGE_SIZE issues that
come with that or if this should essentially be device specific
enable_acs and acs_enabled quirks, and maybe also potentially used by
Logan's disable acs series to allow GPUs to be linked and have grouping
to match.

> 2. We can and should also prohibit the GPU firmware update, this is
> done via MMIO as well. The protocol is not open but at least register
> ranges might be in order to filter these accesses, and there is no
> plan to change this.

I assume this MMIO is on the endpoint and has all the PAGE_SIZE joys
along with it.  Also, there are certainly use cases of updating
firmware for an assigned device, we don't want to impose a policy, but
we should figure out the right place for that policy to be specified by
the admin.

> 3. DMA trafic over the NVLink2 link can be of 2 types: UT=1 for
> PCI-style DMA via our usual TCE tables (one per a NVLink2 link),
> and UT=0 for direct host memory access. UT stands for "use
> translation" and this is a part of the NVLink2 protocol. Only UT=1 is
> possible over the PCIe link.
> This UT=0 trafic uses host physical addresses returned by a nest MMU (a
> piece of NVIDIA logic on a POWER9 chip), this takes LPID (guest id),
> mmu context id (guest userspace mm id), a virtual address and translates
> to the host physical and that result is used for UT=0 DMA, this is
> called "ATS" although it is not PCIe ATS afaict.
> NVIDIA says that the hardware is designed in a way that it can only do
> DMA UT=0 to addresses which ATS translated to, and there is no way to
> override this behavior and this is what guarantees the isolation.

I'm kinda lost here, maybe we can compare it to PCIe ATS where an
endpoint requests a translation of an IOVA to physical address, the
IOMMU returns a lookup based on PCIe requester ID, and there's an
invalidation protocol to keep things coherent.  In the case above, who
provides a guest id and mmu context id?  Additional software
somewhere?  Is the virtual address an IOVA or a process virtual
address?  Do we assume some sort of invalidation protocol as well?

> So isolation can be achieved if I do not miss something.
> 
> How do we want this to be documented to proceed? I assume if I post
> patches filtering MMIOs, this won't do it, right? If just 1..3 are
> documented, will we take this t&c or we need a GPU API spec (which is
> not going to happen anyway)?

"t&c"?  I think we need what we're actually interacting with to be well
documented, but that could be _thorough_ comments in the code, enough
to understand the theory of operation, as far as I'm concerned.  A pdf
lost on a corporate webserver isn't necessarily an improvement over
that, but there needs to be sufficient detail to understand what we're
touching such that we can maintain, adapt, and improve the code over
time.  Only item #3 above appears POWER specific, so I'd hope that #1
is done in the PCI subsystem, #2 might be a QEMU option (maybe kernel
vfio-pci, but I'm not sure that's necessary), and I don't know where #3
goes.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100
@ 2018-07-10 22:37                   ` Alex Williamson
  0 siblings, 0 replies; 108+ messages in thread
From: Alex Williamson @ 2018-07-10 22:37 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Benjamin Herrenschmidt, linuxppc-dev, David Gibson, kvm-ppc,
	Ram Pai, kvm, Alistair Popple

On Tue, 10 Jul 2018 14:10:20 +1000
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> On Thu, 7 Jun 2018 23:03:23 -0600
> Alex Williamson <alex.williamson@redhat.com> wrote:
> 
> > On Fri, 8 Jun 2018 14:14:23 +1000
> > Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> >   
> > > On 8/6/18 1:44 pm, Alex Williamson wrote:    
> > > > On Fri, 8 Jun 2018 13:08:54 +1000
> > > > Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> > > >       
> > > >> On 8/6/18 8:15 am, Alex Williamson wrote:      
> > > >>> On Fri, 08 Jun 2018 07:54:02 +1000
> > > >>> Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:
> > > >>>         
> > > >>>> On Thu, 2018-06-07 at 11:04 -0600, Alex Williamson wrote:        
> > > >>>>>
> > > >>>>> Can we back up and discuss whether the IOMMU grouping of NVLink
> > > >>>>> connected devices makes sense?  AIUI we have a PCI view of these
> > > >>>>> devices and from that perspective they're isolated.  That's the view of
> > > >>>>> the device used to generate the grouping.  However, not visible to us,
> > > >>>>> these devices are interconnected via NVLink.  What isolation properties
> > > >>>>> does NVLink provide given that its entire purpose for existing seems to
> > > >>>>> be to provide a high performance link for p2p between devices?          
> > > >>>>
> > > >>>> Not entire. On POWER chips, we also have an nvlink between the device
> > > >>>> and the CPU which is running significantly faster than PCIe.
> > > >>>>
> > > >>>> But yes, there are cross-links and those should probably be accounted
> > > >>>> for in the grouping.        
> > > >>>
> > > >>> Then after we fix the grouping, can we just let the host driver manage
> > > >>> this coherent memory range and expose vGPUs to guests?  The use case of
> > > >>> assigning all 6 GPUs to one VM seems pretty limited.  (Might need to
> > > >>> convince NVIDIA to support more than a single vGPU per VM though)        
> > > >>
> > > >> These are physical GPUs, not virtual sriov-alike things they are
> > > >> implementing as well elsewhere.      
> > > > 
> > > > vGPUs as implemented on M- and P-series Teslas aren't SR-IOV like
> > > > either.  That's why we have mdev devices now to implement software
> > > > defined devices.  I don't have first hand experience with V-series, but
> > > > I would absolutely expect a PCIe-based Tesla V100 to support vGPU.      
> > > 
> > > So assuming V100 can do vGPU, you are suggesting ditching this patchset and
> > > using mediated vGPUs instead, correct?    
> > 
> > If it turns out that our PCIe-only-based IOMMU grouping doesn't
> > account for lack of isolation on the NVLink side and we correct that,
> > limiting assignment to sets of 3 interconnected GPUs, is that still a
> > useful feature?  OTOH, it's entirely an NVIDIA proprietary decision
> > whether they choose to support vGPU on these GPUs or whether they can
> > be convinced to support multiple vGPUs per VM.
> >   
> > > >> My current understanding is that every P9 chip in that box has some NVLink2
> > > >> logic on it so each P9 is directly connected to 3 GPUs via PCIe and
> > > >> 2xNVLink2, and GPUs in that big group are interconnected by NVLink2 links
> > > >> as well.
> > > >>
> > > >> From small bits of information I have it seems that a GPU can perfectly
> > > >> work alone and if the NVIDIA driver does not see these interconnects
> > > >> (because we do not pass the rest of the big 3xGPU group to this guest), it
> > > >> continues with a single GPU. There is an "nvidia-smi -r" big reset hammer
> > > >> which simply refuses to work until all 3 GPUs are passed so there is some
> > > >> distinction between passing 1 or 3 GPUs, and I am trying (as we speak) to
> > > >> get a confirmation from NVIDIA that it is ok to pass just a single GPU.
> > > >>
> > > >> So we will either have 6 groups (one per GPU) or 2 groups (one per
> > > >> interconnected group).      
> > > > 
> > > > I'm not gaining much confidence that we can rely on isolation between
> > > > NVLink connected GPUs, it sounds like you're simply expecting that
> > > > proprietary code from NVIDIA on a proprietary interconnect from NVIDIA
> > > > is going to play nice and nobody will figure out how to do bad things
> > > > because... obfuscation?  Thanks,      
> > > 
> > > Well, we already believe that a proprietary firmware of a sriov-capable
> > > adapter like Mellanox ConnextX is not doing bad things, how is this
> > > different in principle?    
> > 
> > It seems like the scope and hierarchy are different.  Here we're
> > talking about exposing big discrete devices, which are peers of one
> > another (and have history of being reverse engineered), to userspace
> > drivers.  Once handed to userspace, each of those devices needs to be
> > considered untrusted.  In the case of SR-IOV, we typically have a
> > trusted host driver for the PF managing untrusted VFs.  We do rely on
> > some sanity in the hardware/firmware in isolating the VFs from each
> > other and from the PF, but we also often have source code for Linux
> > drivers for these devices and sometimes even datasheets.  Here we have
> > neither of those and perhaps we won't know the extent of the lack of
> > isolation between these devices until nouveau (best case) or some
> > exploit (worst case) exposes it.  IOMMU grouping always assumes a lack
> > of isolation between devices unless the hardware provides some
> > indication that isolation exists, for example ACS on PCIe.  If NVIDIA
> > wants to expose isolation on NVLink, perhaps they need to document
> > enough of it that the host kernel can manipulate and test for isolation,
> > perhaps even enabling virtualization of the NVLink interconnect
> > interface such that the host can prevent GPUs from interfering with
> > each other.  Thanks,  
> 
> 
> So far I got this from NVIDIA:
> 
> 1. An NVLink2 state can be controlled via MMIO registers, there is a
> "NVLINK ISOLATION ON MULTI-TENANT SYSTEMS" spec (my copy is
> "confidential" though) from NVIDIA with the MMIO addresses to block if
> we want to disable certain links. In order to NVLink to work it needs to
> be enabled on both sides so by filtering certains MMIO ranges we can
> isolate a GPU.

Where are these MMIO registers, on the bridge or on the endpoint device?
I'm wondering when you say block MMIO if these are ranges on the device
that we disallow mmap to and all the overlapping PAGE_SIZE issues that
come with that or if this should essentially be device specific
enable_acs and acs_enabled quirks, and maybe also potentially used by
Logan's disable acs series to allow GPUs to be linked and have grouping
to match.

> 2. We can and should also prohibit the GPU firmware update, this is
> done via MMIO as well. The protocol is not open but at least register
> ranges might be in order to filter these accesses, and there is no
> plan to change this.

I assume this MMIO is on the endpoint and has all the PAGE_SIZE joys
along with it.  Also, there are certainly use cases of updating
firmware for an assigned device, we don't want to impose a policy, but
we should figure out the right place for that policy to be specified by
the admin.

> 3. DMA trafic over the NVLink2 link can be of 2 types: UT=1 for
> PCI-style DMA via our usual TCE tables (one per a NVLink2 link),
> and UT=0 for direct host memory access. UT stands for "use
> translation" and this is a part of the NVLink2 protocol. Only UT=1 is
> possible over the PCIe link.
> This UT=0 trafic uses host physical addresses returned by a nest MMU (a
> piece of NVIDIA logic on a POWER9 chip), this takes LPID (guest id),
> mmu context id (guest userspace mm id), a virtual address and translates
> to the host physical and that result is used for UT=0 DMA, this is
> called "ATS" although it is not PCIe ATS afaict.
> NVIDIA says that the hardware is designed in a way that it can only do
> DMA UT=0 to addresses which ATS translated to, and there is no way to
> override this behavior and this is what guarantees the isolation.

I'm kinda lost here, maybe we can compare it to PCIe ATS where an
endpoint requests a translation of an IOVA to physical address, the
IOMMU returns a lookup based on PCIe requester ID, and there's an
invalidation protocol to keep things coherent.  In the case above, who
provides a guest id and mmu context id?  Additional software
somewhere?  Is the virtual address an IOVA or a process virtual
address?  Do we assume some sort of invalidation protocol as well?

> So isolation can be achieved if I do not miss something.
> 
> How do we want this to be documented to proceed? I assume if I post
> patches filtering MMIOs, this won't do it, right? If just 1..3 are
> documented, will we take this t&c or we need a GPU API spec (which is
> not going to happen anyway)?

"t&c"?  I think we need what we're actually interacting with to be well
documented, but that could be _thorough_ comments in the code, enough
to understand the theory of operation, as far as I'm concerned.  A pdf
lost on a corporate webserver isn't necessarily an improvement over
that, but there needs to be sufficient detail to understand what we're
touching such that we can maintain, adapt, and improve the code over
time.  Only item #3 above appears POWER specific, so I'd hope that #1
is done in the PCI subsystem, #2 might be a QEMU option (maybe kernel
vfio-pci, but I'm not sure that's necessary), and I don't know where #3
goes.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100
@ 2018-07-10 22:37                   ` Alex Williamson
  0 siblings, 0 replies; 108+ messages in thread
From: Alex Williamson @ 2018-07-10 22:37 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: kvm, Ram Pai, kvm-ppc, Alistair Popple, linuxppc-dev, David Gibson

On Tue, 10 Jul 2018 14:10:20 +1000
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> On Thu, 7 Jun 2018 23:03:23 -0600
> Alex Williamson <alex.williamson@redhat.com> wrote:
> 
> > On Fri, 8 Jun 2018 14:14:23 +1000
> > Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> >   
> > > On 8/6/18 1:44 pm, Alex Williamson wrote:    
> > > > On Fri, 8 Jun 2018 13:08:54 +1000
> > > > Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> > > >       
> > > >> On 8/6/18 8:15 am, Alex Williamson wrote:      
> > > >>> On Fri, 08 Jun 2018 07:54:02 +1000
> > > >>> Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:
> > > >>>         
> > > >>>> On Thu, 2018-06-07 at 11:04 -0600, Alex Williamson wrote:        
> > > >>>>>
> > > >>>>> Can we back up and discuss whether the IOMMU grouping of NVLink
> > > >>>>> connected devices makes sense?  AIUI we have a PCI view of these
> > > >>>>> devices and from that perspective they're isolated.  That's the view of
> > > >>>>> the device used to generate the grouping.  However, not visible to us,
> > > >>>>> these devices are interconnected via NVLink.  What isolation properties
> > > >>>>> does NVLink provide given that its entire purpose for existing seems to
> > > >>>>> be to provide a high performance link for p2p between devices?          
> > > >>>>
> > > >>>> Not entire. On POWER chips, we also have an nvlink between the device
> > > >>>> and the CPU which is running significantly faster than PCIe.
> > > >>>>
> > > >>>> But yes, there are cross-links and those should probably be accounted
> > > >>>> for in the grouping.        
> > > >>>
> > > >>> Then after we fix the grouping, can we just let the host driver manage
> > > >>> this coherent memory range and expose vGPUs to guests?  The use case of
> > > >>> assigning all 6 GPUs to one VM seems pretty limited.  (Might need to
> > > >>> convince NVIDIA to support more than a single vGPU per VM though)        
> > > >>
> > > >> These are physical GPUs, not virtual sriov-alike things they are
> > > >> implementing as well elsewhere.      
> > > > 
> > > > vGPUs as implemented on M- and P-series Teslas aren't SR-IOV like
> > > > either.  That's why we have mdev devices now to implement software
> > > > defined devices.  I don't have first hand experience with V-series, but
> > > > I would absolutely expect a PCIe-based Tesla V100 to support vGPU.      
> > > 
> > > So assuming V100 can do vGPU, you are suggesting ditching this patchset and
> > > using mediated vGPUs instead, correct?    
> > 
> > If it turns out that our PCIe-only-based IOMMU grouping doesn't
> > account for lack of isolation on the NVLink side and we correct that,
> > limiting assignment to sets of 3 interconnected GPUs, is that still a
> > useful feature?  OTOH, it's entirely an NVIDIA proprietary decision
> > whether they choose to support vGPU on these GPUs or whether they can
> > be convinced to support multiple vGPUs per VM.
> >   
> > > >> My current understanding is that every P9 chip in that box has some NVLink2
> > > >> logic on it so each P9 is directly connected to 3 GPUs via PCIe and
> > > >> 2xNVLink2, and GPUs in that big group are interconnected by NVLink2 links
> > > >> as well.
> > > >>
> > > >> From small bits of information I have it seems that a GPU can perfectly
> > > >> work alone and if the NVIDIA driver does not see these interconnects
> > > >> (because we do not pass the rest of the big 3xGPU group to this guest), it
> > > >> continues with a single GPU. There is an "nvidia-smi -r" big reset hammer
> > > >> which simply refuses to work until all 3 GPUs are passed so there is some
> > > >> distinction between passing 1 or 3 GPUs, and I am trying (as we speak) to
> > > >> get a confirmation from NVIDIA that it is ok to pass just a single GPU.
> > > >>
> > > >> So we will either have 6 groups (one per GPU) or 2 groups (one per
> > > >> interconnected group).      
> > > > 
> > > > I'm not gaining much confidence that we can rely on isolation between
> > > > NVLink connected GPUs, it sounds like you're simply expecting that
> > > > proprietary code from NVIDIA on a proprietary interconnect from NVIDIA
> > > > is going to play nice and nobody will figure out how to do bad things
> > > > because... obfuscation?  Thanks,      
> > > 
> > > Well, we already believe that a proprietary firmware of a sriov-capable
> > > adapter like Mellanox ConnextX is not doing bad things, how is this
> > > different in principle?    
> > 
> > It seems like the scope and hierarchy are different.  Here we're
> > talking about exposing big discrete devices, which are peers of one
> > another (and have history of being reverse engineered), to userspace
> > drivers.  Once handed to userspace, each of those devices needs to be
> > considered untrusted.  In the case of SR-IOV, we typically have a
> > trusted host driver for the PF managing untrusted VFs.  We do rely on
> > some sanity in the hardware/firmware in isolating the VFs from each
> > other and from the PF, but we also often have source code for Linux
> > drivers for these devices and sometimes even datasheets.  Here we have
> > neither of those and perhaps we won't know the extent of the lack of
> > isolation between these devices until nouveau (best case) or some
> > exploit (worst case) exposes it.  IOMMU grouping always assumes a lack
> > of isolation between devices unless the hardware provides some
> > indication that isolation exists, for example ACS on PCIe.  If NVIDIA
> > wants to expose isolation on NVLink, perhaps they need to document
> > enough of it that the host kernel can manipulate and test for isolation,
> > perhaps even enabling virtualization of the NVLink interconnect
> > interface such that the host can prevent GPUs from interfering with
> > each other.  Thanks,  
> 
> 
> So far I got this from NVIDIA:
> 
> 1. An NVLink2 state can be controlled via MMIO registers, there is a
> "NVLINK ISOLATION ON MULTI-TENANT SYSTEMS" spec (my copy is
> "confidential" though) from NVIDIA with the MMIO addresses to block if
> we want to disable certain links. In order to NVLink to work it needs to
> be enabled on both sides so by filtering certains MMIO ranges we can
> isolate a GPU.

Where are these MMIO registers, on the bridge or on the endpoint device?
I'm wondering when you say block MMIO if these are ranges on the device
that we disallow mmap to and all the overlapping PAGE_SIZE issues that
come with that or if this should essentially be device specific
enable_acs and acs_enabled quirks, and maybe also potentially used by
Logan's disable acs series to allow GPUs to be linked and have grouping
to match.

> 2. We can and should also prohibit the GPU firmware update, this is
> done via MMIO as well. The protocol is not open but at least register
> ranges might be in order to filter these accesses, and there is no
> plan to change this.

I assume this MMIO is on the endpoint and has all the PAGE_SIZE joys
along with it.  Also, there are certainly use cases of updating
firmware for an assigned device, we don't want to impose a policy, but
we should figure out the right place for that policy to be specified by
the admin.

> 3. DMA trafic over the NVLink2 link can be of 2 types: UT=1 for
> PCI-style DMA via our usual TCE tables (one per a NVLink2 link),
> and UT=0 for direct host memory access. UT stands for "use
> translation" and this is a part of the NVLink2 protocol. Only UT=1 is
> possible over the PCIe link.
> This UT=0 trafic uses host physical addresses returned by a nest MMU (a
> piece of NVIDIA logic on a POWER9 chip), this takes LPID (guest id),
> mmu context id (guest userspace mm id), a virtual address and translates
> to the host physical and that result is used for UT=0 DMA, this is
> called "ATS" although it is not PCIe ATS afaict.
> NVIDIA says that the hardware is designed in a way that it can only do
> DMA UT=0 to addresses which ATS translated to, and there is no way to
> override this behavior and this is what guarantees the isolation.

I'm kinda lost here, maybe we can compare it to PCIe ATS where an
endpoint requests a translation of an IOVA to physical address, the
IOMMU returns a lookup based on PCIe requester ID, and there's an
invalidation protocol to keep things coherent.  In the case above, who
provides a guest id and mmu context id?  Additional software
somewhere?  Is the virtual address an IOVA or a process virtual
address?  Do we assume some sort of invalidation protocol as well?

> So isolation can be achieved if I do not miss something.
> 
> How do we want this to be documented to proceed? I assume if I post
> patches filtering MMIOs, this won't do it, right? If just 1..3 are
> documented, will we take this t&c or we need a GPU API spec (which is
> not going to happen anyway)?

"t&c"?  I think we need what we're actually interacting with to be well
documented, but that could be _thorough_ comments in the code, enough
to understand the theory of operation, as far as I'm concerned.  A pdf
lost on a corporate webserver isn't necessarily an improvement over
that, but there needs to be sufficient detail to understand what we're
touching such that we can maintain, adapt, and improve the code over
time.  Only item #3 above appears POWER specific, so I'd hope that #1
is done in the PCI subsystem, #2 might be a QEMU option (maybe kernel
vfio-pci, but I'm not sure that's necessary), and I don't know where #3
goes.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100
  2018-07-10 22:37                   ` Alex Williamson
  (?)
@ 2018-07-11  9:26                     ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 108+ messages in thread
From: Alexey Kardashevskiy @ 2018-07-11  9:26 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kvm, Ram Pai, kvm-ppc, Alistair Popple, linuxppc-dev, David Gibson

On Tue, 10 Jul 2018 16:37:15 -0600
Alex Williamson <alex.williamson@redhat.com> wrote:

> On Tue, 10 Jul 2018 14:10:20 +1000
> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> 
> > On Thu, 7 Jun 2018 23:03:23 -0600
> > Alex Williamson <alex.williamson@redhat.com> wrote:
> >   
> > > On Fri, 8 Jun 2018 14:14:23 +1000
> > > Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> > >     
> > > > On 8/6/18 1:44 pm, Alex Williamson wrote:      
> > > > > On Fri, 8 Jun 2018 13:08:54 +1000
> > > > > Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> > > > >         
> > > > >> On 8/6/18 8:15 am, Alex Williamson wrote:        
> > > > >>> On Fri, 08 Jun 2018 07:54:02 +1000
> > > > >>> Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:
> > > > >>>           
> > > > >>>> On Thu, 2018-06-07 at 11:04 -0600, Alex Williamson wrote:          
> > > > >>>>>
> > > > >>>>> Can we back up and discuss whether the IOMMU grouping of NVLink
> > > > >>>>> connected devices makes sense?  AIUI we have a PCI view of these
> > > > >>>>> devices and from that perspective they're isolated.  That's the view of
> > > > >>>>> the device used to generate the grouping.  However, not visible to us,
> > > > >>>>> these devices are interconnected via NVLink.  What isolation properties
> > > > >>>>> does NVLink provide given that its entire purpose for existing seems to
> > > > >>>>> be to provide a high performance link for p2p between devices?            
> > > > >>>>
> > > > >>>> Not entire. On POWER chips, we also have an nvlink between the device
> > > > >>>> and the CPU which is running significantly faster than PCIe.
> > > > >>>>
> > > > >>>> But yes, there are cross-links and those should probably be accounted
> > > > >>>> for in the grouping.          
> > > > >>>
> > > > >>> Then after we fix the grouping, can we just let the host driver manage
> > > > >>> this coherent memory range and expose vGPUs to guests?  The use case of
> > > > >>> assigning all 6 GPUs to one VM seems pretty limited.  (Might need to
> > > > >>> convince NVIDIA to support more than a single vGPU per VM though)          
> > > > >>
> > > > >> These are physical GPUs, not virtual sriov-alike things they are
> > > > >> implementing as well elsewhere.        
> > > > > 
> > > > > vGPUs as implemented on M- and P-series Teslas aren't SR-IOV like
> > > > > either.  That's why we have mdev devices now to implement software
> > > > > defined devices.  I don't have first hand experience with V-series, but
> > > > > I would absolutely expect a PCIe-based Tesla V100 to support vGPU.        
> > > > 
> > > > So assuming V100 can do vGPU, you are suggesting ditching this patchset and
> > > > using mediated vGPUs instead, correct?      
> > > 
> > > If it turns out that our PCIe-only-based IOMMU grouping doesn't
> > > account for lack of isolation on the NVLink side and we correct that,
> > > limiting assignment to sets of 3 interconnected GPUs, is that still a
> > > useful feature?  OTOH, it's entirely an NVIDIA proprietary decision
> > > whether they choose to support vGPU on these GPUs or whether they can
> > > be convinced to support multiple vGPUs per VM.
> > >     
> > > > >> My current understanding is that every P9 chip in that box has some NVLink2
> > > > >> logic on it so each P9 is directly connected to 3 GPUs via PCIe and
> > > > >> 2xNVLink2, and GPUs in that big group are interconnected by NVLink2 links
> > > > >> as well.
> > > > >>
> > > > >> From small bits of information I have it seems that a GPU can perfectly
> > > > >> work alone and if the NVIDIA driver does not see these interconnects
> > > > >> (because we do not pass the rest of the big 3xGPU group to this guest), it
> > > > >> continues with a single GPU. There is an "nvidia-smi -r" big reset hammer
> > > > >> which simply refuses to work until all 3 GPUs are passed so there is some
> > > > >> distinction between passing 1 or 3 GPUs, and I am trying (as we speak) to
> > > > >> get a confirmation from NVIDIA that it is ok to pass just a single GPU.
> > > > >>
> > > > >> So we will either have 6 groups (one per GPU) or 2 groups (one per
> > > > >> interconnected group).        
> > > > > 
> > > > > I'm not gaining much confidence that we can rely on isolation between
> > > > > NVLink connected GPUs, it sounds like you're simply expecting that
> > > > > proprietary code from NVIDIA on a proprietary interconnect from NVIDIA
> > > > > is going to play nice and nobody will figure out how to do bad things
> > > > > because... obfuscation?  Thanks,        
> > > > 
> > > > Well, we already believe that a proprietary firmware of a sriov-capable
> > > > adapter like Mellanox ConnextX is not doing bad things, how is this
> > > > different in principle?      
> > > 
> > > It seems like the scope and hierarchy are different.  Here we're
> > > talking about exposing big discrete devices, which are peers of one
> > > another (and have history of being reverse engineered), to userspace
> > > drivers.  Once handed to userspace, each of those devices needs to be
> > > considered untrusted.  In the case of SR-IOV, we typically have a
> > > trusted host driver for the PF managing untrusted VFs.  We do rely on
> > > some sanity in the hardware/firmware in isolating the VFs from each
> > > other and from the PF, but we also often have source code for Linux
> > > drivers for these devices and sometimes even datasheets.  Here we have
> > > neither of those and perhaps we won't know the extent of the lack of
> > > isolation between these devices until nouveau (best case) or some
> > > exploit (worst case) exposes it.  IOMMU grouping always assumes a lack
> > > of isolation between devices unless the hardware provides some
> > > indication that isolation exists, for example ACS on PCIe.  If NVIDIA
> > > wants to expose isolation on NVLink, perhaps they need to document
> > > enough of it that the host kernel can manipulate and test for isolation,
> > > perhaps even enabling virtualization of the NVLink interconnect
> > > interface such that the host can prevent GPUs from interfering with
> > > each other.  Thanks,    
> > 
> > 
> > So far I got this from NVIDIA:
> > 
> > 1. An NVLink2 state can be controlled via MMIO registers, there is a
> > "NVLINK ISOLATION ON MULTI-TENANT SYSTEMS" spec (my copy is
> > "confidential" though) from NVIDIA with the MMIO addresses to block if
> > we want to disable certain links. In order to NVLink to work it needs to
> > be enabled on both sides so by filtering certains MMIO ranges we can
> > isolate a GPU.  
> 
> Where are these MMIO registers, on the bridge or on the endpoint device?

The endpoint GPU device.

> I'm wondering when you say block MMIO if these are ranges on the device
> that we disallow mmap to and all the overlapping PAGE_SIZE issues that
> come with that or if this should essentially be device specific
> enable_acs and acs_enabled quirks, and maybe also potentially used by
> Logan's disable acs series to allow GPUs to be linked and have grouping
> to match.

An update, I confused P100 and V100, P100 would need filtering but
ours is V100 and it has a couple of registers which we can use to
disable particular links and once disabled, the link cannot be
re-enabled till the next secondary bus reset.


> > 2. We can and should also prohibit the GPU firmware update, this is
> > done via MMIO as well. The protocol is not open but at least register
> > ranges might be in order to filter these accesses, and there is no
> > plan to change this.  
> 
> I assume this MMIO is on the endpoint and has all the PAGE_SIZE joys
> along with it.

Yes, however NVIDIA says there is no performance critical stuff with
this 64K page.

> Also, there are certainly use cases of updating
> firmware for an assigned device, we don't want to impose a policy, but
> we should figure out the right place for that policy to be specified by
> the admin.

May be but NVIDIA is talking about some "out-of-band" command to the GPU
to enable firmware update so firmware update is not really supported.


> > 3. DMA trafic over the NVLink2 link can be of 2 types: UT=1 for
> > PCI-style DMA via our usual TCE tables (one per a NVLink2 link),
> > and UT=0 for direct host memory access. UT stands for "use
> > translation" and this is a part of the NVLink2 protocol. Only UT=1 is
> > possible over the PCIe link.
> > This UT=0 trafic uses host physical addresses returned by a nest MMU (a
> > piece of NVIDIA logic on a POWER9 chip), this takes LPID (guest id),
> > mmu context id (guest userspace mm id), a virtual address and translates
> > to the host physical and that result is used for UT=0 DMA, this is
> > called "ATS" although it is not PCIe ATS afaict.
> > NVIDIA says that the hardware is designed in a way that it can only do
> > DMA UT=0 to addresses which ATS translated to, and there is no way to
> > override this behavior and this is what guarantees the isolation.  
> 
> I'm kinda lost here, maybe we can compare it to PCIe ATS where an
> endpoint requests a translation of an IOVA to physical address, the
> IOMMU returns a lookup based on PCIe requester ID, and there's an
> invalidation protocol to keep things coherent.

Yes there is. The current approach is to have an MMU notifier in
the kernel which tells an NPU (IBM piece of logic between GPU/NVlink2
and NVIDIA nest MMU) to invalidate translations and that in turn pokes
the GPU till that confirms that it invalidated tlbs and there is no
ongoing DMA.

> In the case above, who provides a guest id and mmu context id? 

We (powerpc/powernv platform) configure NPU to bind specific bus:dev:fn to
an LPID (== guest id) and MMU context id comes from the guest. The nest
MMU knows where the partition table and this table contains all the
pointers needs for the translation.


> Additional software
> somewhere?  Is the virtual address an IOVA or a process virtual
> address? 

A guest kernel or a guest userspace virtual address.

> Do we assume some sort of invalidation protocol as well?

I am little confused, is this question about the same invalidation
protocol as above or different?


> > So isolation can be achieved if I do not miss something.
> > 
> > How do we want this to be documented to proceed? I assume if I post
> > patches filtering MMIOs, this won't do it, right? If just 1..3 are
> > documented, will we take this t&c or we need a GPU API spec (which is
> > not going to happen anyway)?  
> 
> "t&c"? I think we need what we're actually interacting with to be well
> documented, but that could be _thorough_ comments in the code, enough
> to understand the theory of operation, as far as I'm concerned.  A pdf
> lost on a corporate webserver isn't necessarily an improvement over
> that, but there needs to be sufficient detail to understand what we're
> touching such that we can maintain, adapt, and improve the code over
> time.  Only item #3 above appears POWER specific, so I'd hope that #1
> is done in the PCI subsystem, #2 might be a QEMU option (maybe kernel
> vfio-pci, but I'm not sure that's necessary), and I don't know where #3
> goes.  Thanks,

Ok, understood. Thanks!


--
Alexey

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100
@ 2018-07-11  9:26                     ` Alexey Kardashevskiy
  0 siblings, 0 replies; 108+ messages in thread
From: Alexey Kardashevskiy @ 2018-07-11  9:26 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Benjamin Herrenschmidt, linuxppc-dev, David Gibson, kvm-ppc,
	Ram Pai, kvm, Alistair Popple

On Tue, 10 Jul 2018 16:37:15 -0600
Alex Williamson <alex.williamson@redhat.com> wrote:

> On Tue, 10 Jul 2018 14:10:20 +1000
> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> 
> > On Thu, 7 Jun 2018 23:03:23 -0600
> > Alex Williamson <alex.williamson@redhat.com> wrote:
> >   
> > > On Fri, 8 Jun 2018 14:14:23 +1000
> > > Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> > >     
> > > > On 8/6/18 1:44 pm, Alex Williamson wrote:      
> > > > > On Fri, 8 Jun 2018 13:08:54 +1000
> > > > > Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> > > > >         
> > > > >> On 8/6/18 8:15 am, Alex Williamson wrote:        
> > > > >>> On Fri, 08 Jun 2018 07:54:02 +1000
> > > > >>> Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:
> > > > >>>           
> > > > >>>> On Thu, 2018-06-07 at 11:04 -0600, Alex Williamson wrote:          
> > > > >>>>>
> > > > >>>>> Can we back up and discuss whether the IOMMU grouping of NVLink
> > > > >>>>> connected devices makes sense?  AIUI we have a PCI view of these
> > > > >>>>> devices and from that perspective they're isolated.  That's the view of
> > > > >>>>> the device used to generate the grouping.  However, not visible to us,
> > > > >>>>> these devices are interconnected via NVLink.  What isolation properties
> > > > >>>>> does NVLink provide given that its entire purpose for existing seems to
> > > > >>>>> be to provide a high performance link for p2p between devices?            
> > > > >>>>
> > > > >>>> Not entire. On POWER chips, we also have an nvlink between the device
> > > > >>>> and the CPU which is running significantly faster than PCIe.
> > > > >>>>
> > > > >>>> But yes, there are cross-links and those should probably be accounted
> > > > >>>> for in the grouping.          
> > > > >>>
> > > > >>> Then after we fix the grouping, can we just let the host driver manage
> > > > >>> this coherent memory range and expose vGPUs to guests?  The use case of
> > > > >>> assigning all 6 GPUs to one VM seems pretty limited.  (Might need to
> > > > >>> convince NVIDIA to support more than a single vGPU per VM though)          
> > > > >>
> > > > >> These are physical GPUs, not virtual sriov-alike things they are
> > > > >> implementing as well elsewhere.        
> > > > > 
> > > > > vGPUs as implemented on M- and P-series Teslas aren't SR-IOV like
> > > > > either.  That's why we have mdev devices now to implement software
> > > > > defined devices.  I don't have first hand experience with V-series, but
> > > > > I would absolutely expect a PCIe-based Tesla V100 to support vGPU.        
> > > > 
> > > > So assuming V100 can do vGPU, you are suggesting ditching this patchset and
> > > > using mediated vGPUs instead, correct?      
> > > 
> > > If it turns out that our PCIe-only-based IOMMU grouping doesn't
> > > account for lack of isolation on the NVLink side and we correct that,
> > > limiting assignment to sets of 3 interconnected GPUs, is that still a
> > > useful feature?  OTOH, it's entirely an NVIDIA proprietary decision
> > > whether they choose to support vGPU on these GPUs or whether they can
> > > be convinced to support multiple vGPUs per VM.
> > >     
> > > > >> My current understanding is that every P9 chip in that box has some NVLink2
> > > > >> logic on it so each P9 is directly connected to 3 GPUs via PCIe and
> > > > >> 2xNVLink2, and GPUs in that big group are interconnected by NVLink2 links
> > > > >> as well.
> > > > >>
> > > > >> From small bits of information I have it seems that a GPU can perfectly
> > > > >> work alone and if the NVIDIA driver does not see these interconnects
> > > > >> (because we do not pass the rest of the big 3xGPU group to this guest), it
> > > > >> continues with a single GPU. There is an "nvidia-smi -r" big reset hammer
> > > > >> which simply refuses to work until all 3 GPUs are passed so there is some
> > > > >> distinction between passing 1 or 3 GPUs, and I am trying (as we speak) to
> > > > >> get a confirmation from NVIDIA that it is ok to pass just a single GPU.
> > > > >>
> > > > >> So we will either have 6 groups (one per GPU) or 2 groups (one per
> > > > >> interconnected group).        
> > > > > 
> > > > > I'm not gaining much confidence that we can rely on isolation between
> > > > > NVLink connected GPUs, it sounds like you're simply expecting that
> > > > > proprietary code from NVIDIA on a proprietary interconnect from NVIDIA
> > > > > is going to play nice and nobody will figure out how to do bad things
> > > > > because... obfuscation?  Thanks,        
> > > > 
> > > > Well, we already believe that a proprietary firmware of a sriov-capable
> > > > adapter like Mellanox ConnextX is not doing bad things, how is this
> > > > different in principle?      
> > > 
> > > It seems like the scope and hierarchy are different.  Here we're
> > > talking about exposing big discrete devices, which are peers of one
> > > another (and have history of being reverse engineered), to userspace
> > > drivers.  Once handed to userspace, each of those devices needs to be
> > > considered untrusted.  In the case of SR-IOV, we typically have a
> > > trusted host driver for the PF managing untrusted VFs.  We do rely on
> > > some sanity in the hardware/firmware in isolating the VFs from each
> > > other and from the PF, but we also often have source code for Linux
> > > drivers for these devices and sometimes even datasheets.  Here we have
> > > neither of those and perhaps we won't know the extent of the lack of
> > > isolation between these devices until nouveau (best case) or some
> > > exploit (worst case) exposes it.  IOMMU grouping always assumes a lack
> > > of isolation between devices unless the hardware provides some
> > > indication that isolation exists, for example ACS on PCIe.  If NVIDIA
> > > wants to expose isolation on NVLink, perhaps they need to document
> > > enough of it that the host kernel can manipulate and test for isolation,
> > > perhaps even enabling virtualization of the NVLink interconnect
> > > interface such that the host can prevent GPUs from interfering with
> > > each other.  Thanks,    
> > 
> > 
> > So far I got this from NVIDIA:
> > 
> > 1. An NVLink2 state can be controlled via MMIO registers, there is a
> > "NVLINK ISOLATION ON MULTI-TENANT SYSTEMS" spec (my copy is
> > "confidential" though) from NVIDIA with the MMIO addresses to block if
> > we want to disable certain links. In order to NVLink to work it needs to
> > be enabled on both sides so by filtering certains MMIO ranges we can
> > isolate a GPU.  
> 
> Where are these MMIO registers, on the bridge or on the endpoint device?

The endpoint GPU device.

> I'm wondering when you say block MMIO if these are ranges on the device
> that we disallow mmap to and all the overlapping PAGE_SIZE issues that
> come with that or if this should essentially be device specific
> enable_acs and acs_enabled quirks, and maybe also potentially used by
> Logan's disable acs series to allow GPUs to be linked and have grouping
> to match.

An update, I confused P100 and V100, P100 would need filtering but
ours is V100 and it has a couple of registers which we can use to
disable particular links and once disabled, the link cannot be
re-enabled till the next secondary bus reset.


> > 2. We can and should also prohibit the GPU firmware update, this is
> > done via MMIO as well. The protocol is not open but at least register
> > ranges might be in order to filter these accesses, and there is no
> > plan to change this.  
> 
> I assume this MMIO is on the endpoint and has all the PAGE_SIZE joys
> along with it.

Yes, however NVIDIA says there is no performance critical stuff with
this 64K page.

> Also, there are certainly use cases of updating
> firmware for an assigned device, we don't want to impose a policy, but
> we should figure out the right place for that policy to be specified by
> the admin.

May be but NVIDIA is talking about some "out-of-band" command to the GPU
to enable firmware update so firmware update is not really supported.


> > 3. DMA trafic over the NVLink2 link can be of 2 types: UT=1 for
> > PCI-style DMA via our usual TCE tables (one per a NVLink2 link),
> > and UT=0 for direct host memory access. UT stands for "use
> > translation" and this is a part of the NVLink2 protocol. Only UT=1 is
> > possible over the PCIe link.
> > This UT=0 trafic uses host physical addresses returned by a nest MMU (a
> > piece of NVIDIA logic on a POWER9 chip), this takes LPID (guest id),
> > mmu context id (guest userspace mm id), a virtual address and translates
> > to the host physical and that result is used for UT=0 DMA, this is
> > called "ATS" although it is not PCIe ATS afaict.
> > NVIDIA says that the hardware is designed in a way that it can only do
> > DMA UT=0 to addresses which ATS translated to, and there is no way to
> > override this behavior and this is what guarantees the isolation.  
> 
> I'm kinda lost here, maybe we can compare it to PCIe ATS where an
> endpoint requests a translation of an IOVA to physical address, the
> IOMMU returns a lookup based on PCIe requester ID, and there's an
> invalidation protocol to keep things coherent.

Yes there is. The current approach is to have an MMU notifier in
the kernel which tells an NPU (IBM piece of logic between GPU/NVlink2
and NVIDIA nest MMU) to invalidate translations and that in turn pokes
the GPU till that confirms that it invalidated tlbs and there is no
ongoing DMA.

> In the case above, who provides a guest id and mmu context id? 

We (powerpc/powernv platform) configure NPU to bind specific bus:dev:fn to
an LPID (== guest id) and MMU context id comes from the guest. The nest
MMU knows where the partition table and this table contains all the
pointers needs for the translation.


> Additional software
> somewhere?  Is the virtual address an IOVA or a process virtual
> address? 

A guest kernel or a guest userspace virtual address.

> Do we assume some sort of invalidation protocol as well?

I am little confused, is this question about the same invalidation
protocol as above or different?


> > So isolation can be achieved if I do not miss something.
> > 
> > How do we want this to be documented to proceed? I assume if I post
> > patches filtering MMIOs, this won't do it, right? If just 1..3 are
> > documented, will we take this t&c or we need a GPU API spec (which is
> > not going to happen anyway)?  
> 
> "t&c"? I think we need what we're actually interacting with to be well
> documented, but that could be _thorough_ comments in the code, enough
> to understand the theory of operation, as far as I'm concerned.  A pdf
> lost on a corporate webserver isn't necessarily an improvement over
> that, but there needs to be sufficient detail to understand what we're
> touching such that we can maintain, adapt, and improve the code over
> time.  Only item #3 above appears POWER specific, so I'd hope that #1
> is done in the PCI subsystem, #2 might be a QEMU option (maybe kernel
> vfio-pci, but I'm not sure that's necessary), and I don't know where #3
> goes.  Thanks,

Ok, understood. Thanks!


--
Alexey

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100
@ 2018-07-11  9:26                     ` Alexey Kardashevskiy
  0 siblings, 0 replies; 108+ messages in thread
From: Alexey Kardashevskiy @ 2018-07-11  9:26 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kvm, Ram Pai, kvm-ppc, Alistair Popple, linuxppc-dev, David Gibson

On Tue, 10 Jul 2018 16:37:15 -0600
Alex Williamson <alex.williamson@redhat.com> wrote:

> On Tue, 10 Jul 2018 14:10:20 +1000
> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> 
> > On Thu, 7 Jun 2018 23:03:23 -0600
> > Alex Williamson <alex.williamson@redhat.com> wrote:
> >   
> > > On Fri, 8 Jun 2018 14:14:23 +1000
> > > Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> > >     
> > > > On 8/6/18 1:44 pm, Alex Williamson wrote:      
> > > > > On Fri, 8 Jun 2018 13:08:54 +1000
> > > > > Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> > > > >         
> > > > >> On 8/6/18 8:15 am, Alex Williamson wrote:        
> > > > >>> On Fri, 08 Jun 2018 07:54:02 +1000
> > > > >>> Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:
> > > > >>>           
> > > > >>>> On Thu, 2018-06-07 at 11:04 -0600, Alex Williamson wrote:          
> > > > >>>>>
> > > > >>>>> Can we back up and discuss whether the IOMMU grouping of NVLink
> > > > >>>>> connected devices makes sense?  AIUI we have a PCI view of these
> > > > >>>>> devices and from that perspective they're isolated.  That's the view of
> > > > >>>>> the device used to generate the grouping.  However, not visible to us,
> > > > >>>>> these devices are interconnected via NVLink.  What isolation properties
> > > > >>>>> does NVLink provide given that its entire purpose for existing seems to
> > > > >>>>> be to provide a high performance link for p2p between devices?            
> > > > >>>>
> > > > >>>> Not entire. On POWER chips, we also have an nvlink between the device
> > > > >>>> and the CPU which is running significantly faster than PCIe.
> > > > >>>>
> > > > >>>> But yes, there are cross-links and those should probably be accounted
> > > > >>>> for in the grouping.          
> > > > >>>
> > > > >>> Then after we fix the grouping, can we just let the host driver manage
> > > > >>> this coherent memory range and expose vGPUs to guests?  The use case of
> > > > >>> assigning all 6 GPUs to one VM seems pretty limited.  (Might need to
> > > > >>> convince NVIDIA to support more than a single vGPU per VM though)          
> > > > >>
> > > > >> These are physical GPUs, not virtual sriov-alike things they are
> > > > >> implementing as well elsewhere.        
> > > > > 
> > > > > vGPUs as implemented on M- and P-series Teslas aren't SR-IOV like
> > > > > either.  That's why we have mdev devices now to implement software
> > > > > defined devices.  I don't have first hand experience with V-series, but
> > > > > I would absolutely expect a PCIe-based Tesla V100 to support vGPU.        
> > > > 
> > > > So assuming V100 can do vGPU, you are suggesting ditching this patchset and
> > > > using mediated vGPUs instead, correct?      
> > > 
> > > If it turns out that our PCIe-only-based IOMMU grouping doesn't
> > > account for lack of isolation on the NVLink side and we correct that,
> > > limiting assignment to sets of 3 interconnected GPUs, is that still a
> > > useful feature?  OTOH, it's entirely an NVIDIA proprietary decision
> > > whether they choose to support vGPU on these GPUs or whether they can
> > > be convinced to support multiple vGPUs per VM.
> > >     
> > > > >> My current understanding is that every P9 chip in that box has some NVLink2
> > > > >> logic on it so each P9 is directly connected to 3 GPUs via PCIe and
> > > > >> 2xNVLink2, and GPUs in that big group are interconnected by NVLink2 links
> > > > >> as well.
> > > > >>
> > > > >> From small bits of information I have it seems that a GPU can perfectly
> > > > >> work alone and if the NVIDIA driver does not see these interconnects
> > > > >> (because we do not pass the rest of the big 3xGPU group to this guest), it
> > > > >> continues with a single GPU. There is an "nvidia-smi -r" big reset hammer
> > > > >> which simply refuses to work until all 3 GPUs are passed so there is some
> > > > >> distinction between passing 1 or 3 GPUs, and I am trying (as we speak) to
> > > > >> get a confirmation from NVIDIA that it is ok to pass just a single GPU.
> > > > >>
> > > > >> So we will either have 6 groups (one per GPU) or 2 groups (one per
> > > > >> interconnected group).        
> > > > > 
> > > > > I'm not gaining much confidence that we can rely on isolation between
> > > > > NVLink connected GPUs, it sounds like you're simply expecting that
> > > > > proprietary code from NVIDIA on a proprietary interconnect from NVIDIA
> > > > > is going to play nice and nobody will figure out how to do bad things
> > > > > because... obfuscation?  Thanks,        
> > > > 
> > > > Well, we already believe that a proprietary firmware of a sriov-capable
> > > > adapter like Mellanox ConnextX is not doing bad things, how is this
> > > > different in principle?      
> > > 
> > > It seems like the scope and hierarchy are different.  Here we're
> > > talking about exposing big discrete devices, which are peers of one
> > > another (and have history of being reverse engineered), to userspace
> > > drivers.  Once handed to userspace, each of those devices needs to be
> > > considered untrusted.  In the case of SR-IOV, we typically have a
> > > trusted host driver for the PF managing untrusted VFs.  We do rely on
> > > some sanity in the hardware/firmware in isolating the VFs from each
> > > other and from the PF, but we also often have source code for Linux
> > > drivers for these devices and sometimes even datasheets.  Here we have
> > > neither of those and perhaps we won't know the extent of the lack of
> > > isolation between these devices until nouveau (best case) or some
> > > exploit (worst case) exposes it.  IOMMU grouping always assumes a lack
> > > of isolation between devices unless the hardware provides some
> > > indication that isolation exists, for example ACS on PCIe.  If NVIDIA
> > > wants to expose isolation on NVLink, perhaps they need to document
> > > enough of it that the host kernel can manipulate and test for isolation,
> > > perhaps even enabling virtualization of the NVLink interconnect
> > > interface such that the host can prevent GPUs from interfering with
> > > each other.  Thanks,    
> > 
> > 
> > So far I got this from NVIDIA:
> > 
> > 1. An NVLink2 state can be controlled via MMIO registers, there is a
> > "NVLINK ISOLATION ON MULTI-TENANT SYSTEMS" spec (my copy is
> > "confidential" though) from NVIDIA with the MMIO addresses to block if
> > we want to disable certain links. In order to NVLink to work it needs to
> > be enabled on both sides so by filtering certains MMIO ranges we can
> > isolate a GPU.  
> 
> Where are these MMIO registers, on the bridge or on the endpoint device?

The endpoint GPU device.

> I'm wondering when you say block MMIO if these are ranges on the device
> that we disallow mmap to and all the overlapping PAGE_SIZE issues that
> come with that or if this should essentially be device specific
> enable_acs and acs_enabled quirks, and maybe also potentially used by
> Logan's disable acs series to allow GPUs to be linked and have grouping
> to match.

An update, I confused P100 and V100, P100 would need filtering but
ours is V100 and it has a couple of registers which we can use to
disable particular links and once disabled, the link cannot be
re-enabled till the next secondary bus reset.


> > 2. We can and should also prohibit the GPU firmware update, this is
> > done via MMIO as well. The protocol is not open but at least register
> > ranges might be in order to filter these accesses, and there is no
> > plan to change this.  
> 
> I assume this MMIO is on the endpoint and has all the PAGE_SIZE joys
> along with it.

Yes, however NVIDIA says there is no performance critical stuff with
this 64K page.

> Also, there are certainly use cases of updating
> firmware for an assigned device, we don't want to impose a policy, but
> we should figure out the right place for that policy to be specified by
> the admin.

May be but NVIDIA is talking about some "out-of-band" command to the GPU
to enable firmware update so firmware update is not really supported.


> > 3. DMA trafic over the NVLink2 link can be of 2 types: UT=1 for
> > PCI-style DMA via our usual TCE tables (one per a NVLink2 link),
> > and UT=0 for direct host memory access. UT stands for "use
> > translation" and this is a part of the NVLink2 protocol. Only UT=1 is
> > possible over the PCIe link.
> > This UT=0 trafic uses host physical addresses returned by a nest MMU (a
> > piece of NVIDIA logic on a POWER9 chip), this takes LPID (guest id),
> > mmu context id (guest userspace mm id), a virtual address and translates
> > to the host physical and that result is used for UT=0 DMA, this is
> > called "ATS" although it is not PCIe ATS afaict.
> > NVIDIA says that the hardware is designed in a way that it can only do
> > DMA UT=0 to addresses which ATS translated to, and there is no way to
> > override this behavior and this is what guarantees the isolation.  
> 
> I'm kinda lost here, maybe we can compare it to PCIe ATS where an
> endpoint requests a translation of an IOVA to physical address, the
> IOMMU returns a lookup based on PCIe requester ID, and there's an
> invalidation protocol to keep things coherent.

Yes there is. The current approach is to have an MMU notifier in
the kernel which tells an NPU (IBM piece of logic between GPU/NVlink2
and NVIDIA nest MMU) to invalidate translations and that in turn pokes
the GPU till that confirms that it invalidated tlbs and there is no
ongoing DMA.

> In the case above, who provides a guest id and mmu context id? 

We (powerpc/powernv platform) configure NPU to bind specific bus:dev:fn to
an LPID (= guest id) and MMU context id comes from the guest. The nest
MMU knows where the partition table and this table contains all the
pointers needs for the translation.


> Additional software
> somewhere?  Is the virtual address an IOVA or a process virtual
> address? 

A guest kernel or a guest userspace virtual address.

> Do we assume some sort of invalidation protocol as well?

I am little confused, is this question about the same invalidation
protocol as above or different?


> > So isolation can be achieved if I do not miss something.
> > 
> > How do we want this to be documented to proceed? I assume if I post
> > patches filtering MMIOs, this won't do it, right? If just 1..3 are
> > documented, will we take this t&c or we need a GPU API spec (which is
> > not going to happen anyway)?  
> 
> "t&c"? I think we need what we're actually interacting with to be well
> documented, but that could be _thorough_ comments in the code, enough
> to understand the theory of operation, as far as I'm concerned.  A pdf
> lost on a corporate webserver isn't necessarily an improvement over
> that, but there needs to be sufficient detail to understand what we're
> touching such that we can maintain, adapt, and improve the code over
> time.  Only item #3 above appears POWER specific, so I'd hope that #1
> is done in the PCI subsystem, #2 might be a QEMU option (maybe kernel
> vfio-pci, but I'm not sure that's necessary), and I don't know where #3
> goes.  Thanks,

Ok, understood. Thanks!


--
Alexey

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100
  2018-07-11  9:26                     ` Alexey Kardashevskiy
  (?)
@ 2018-07-30  8:58                       ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 108+ messages in thread
From: Alexey Kardashevskiy @ 2018-07-30  8:58 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kvm, Ram Pai, kvm-ppc, Alistair Popple, linuxppc-dev, David Gibson



On 11/07/2018 19:26, Alexey Kardashevskiy wrote:
> On Tue, 10 Jul 2018 16:37:15 -0600
> Alex Williamson <alex.williamson@redhat.com> wrote:
> 
>> On Tue, 10 Jul 2018 14:10:20 +1000
>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>>
>>> On Thu, 7 Jun 2018 23:03:23 -0600
>>> Alex Williamson <alex.williamson@redhat.com> wrote:
>>>   
>>>> On Fri, 8 Jun 2018 14:14:23 +1000
>>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>>>>     
>>>>> On 8/6/18 1:44 pm, Alex Williamson wrote:      
>>>>>> On Fri, 8 Jun 2018 13:08:54 +1000
>>>>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>>>>>>         
>>>>>>> On 8/6/18 8:15 am, Alex Williamson wrote:        
>>>>>>>> On Fri, 08 Jun 2018 07:54:02 +1000
>>>>>>>> Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:
>>>>>>>>           
>>>>>>>>> On Thu, 2018-06-07 at 11:04 -0600, Alex Williamson wrote:          
>>>>>>>>>>
>>>>>>>>>> Can we back up and discuss whether the IOMMU grouping of NVLink
>>>>>>>>>> connected devices makes sense?  AIUI we have a PCI view of these
>>>>>>>>>> devices and from that perspective they're isolated.  That's the view of
>>>>>>>>>> the device used to generate the grouping.  However, not visible to us,
>>>>>>>>>> these devices are interconnected via NVLink.  What isolation properties
>>>>>>>>>> does NVLink provide given that its entire purpose for existing seems to
>>>>>>>>>> be to provide a high performance link for p2p between devices?            
>>>>>>>>>
>>>>>>>>> Not entire. On POWER chips, we also have an nvlink between the device
>>>>>>>>> and the CPU which is running significantly faster than PCIe.
>>>>>>>>>
>>>>>>>>> But yes, there are cross-links and those should probably be accounted
>>>>>>>>> for in the grouping.          
>>>>>>>>
>>>>>>>> Then after we fix the grouping, can we just let the host driver manage
>>>>>>>> this coherent memory range and expose vGPUs to guests?  The use case of
>>>>>>>> assigning all 6 GPUs to one VM seems pretty limited.  (Might need to
>>>>>>>> convince NVIDIA to support more than a single vGPU per VM though)          
>>>>>>>
>>>>>>> These are physical GPUs, not virtual sriov-alike things they are
>>>>>>> implementing as well elsewhere.        
>>>>>>
>>>>>> vGPUs as implemented on M- and P-series Teslas aren't SR-IOV like
>>>>>> either.  That's why we have mdev devices now to implement software
>>>>>> defined devices.  I don't have first hand experience with V-series, but
>>>>>> I would absolutely expect a PCIe-based Tesla V100 to support vGPU.        
>>>>>
>>>>> So assuming V100 can do vGPU, you are suggesting ditching this patchset and
>>>>> using mediated vGPUs instead, correct?      
>>>>
>>>> If it turns out that our PCIe-only-based IOMMU grouping doesn't
>>>> account for lack of isolation on the NVLink side and we correct that,
>>>> limiting assignment to sets of 3 interconnected GPUs, is that still a
>>>> useful feature?  OTOH, it's entirely an NVIDIA proprietary decision
>>>> whether they choose to support vGPU on these GPUs or whether they can
>>>> be convinced to support multiple vGPUs per VM.
>>>>     
>>>>>>> My current understanding is that every P9 chip in that box has some NVLink2
>>>>>>> logic on it so each P9 is directly connected to 3 GPUs via PCIe and
>>>>>>> 2xNVLink2, and GPUs in that big group are interconnected by NVLink2 links
>>>>>>> as well.
>>>>>>>
>>>>>>> From small bits of information I have it seems that a GPU can perfectly
>>>>>>> work alone and if the NVIDIA driver does not see these interconnects
>>>>>>> (because we do not pass the rest of the big 3xGPU group to this guest), it
>>>>>>> continues with a single GPU. There is an "nvidia-smi -r" big reset hammer
>>>>>>> which simply refuses to work until all 3 GPUs are passed so there is some
>>>>>>> distinction between passing 1 or 3 GPUs, and I am trying (as we speak) to
>>>>>>> get a confirmation from NVIDIA that it is ok to pass just a single GPU.
>>>>>>>
>>>>>>> So we will either have 6 groups (one per GPU) or 2 groups (one per
>>>>>>> interconnected group).        
>>>>>>
>>>>>> I'm not gaining much confidence that we can rely on isolation between
>>>>>> NVLink connected GPUs, it sounds like you're simply expecting that
>>>>>> proprietary code from NVIDIA on a proprietary interconnect from NVIDIA
>>>>>> is going to play nice and nobody will figure out how to do bad things
>>>>>> because... obfuscation?  Thanks,        
>>>>>
>>>>> Well, we already believe that a proprietary firmware of a sriov-capable
>>>>> adapter like Mellanox ConnextX is not doing bad things, how is this
>>>>> different in principle?      
>>>>
>>>> It seems like the scope and hierarchy are different.  Here we're
>>>> talking about exposing big discrete devices, which are peers of one
>>>> another (and have history of being reverse engineered), to userspace
>>>> drivers.  Once handed to userspace, each of those devices needs to be
>>>> considered untrusted.  In the case of SR-IOV, we typically have a
>>>> trusted host driver for the PF managing untrusted VFs.  We do rely on
>>>> some sanity in the hardware/firmware in isolating the VFs from each
>>>> other and from the PF, but we also often have source code for Linux
>>>> drivers for these devices and sometimes even datasheets.  Here we have
>>>> neither of those and perhaps we won't know the extent of the lack of
>>>> isolation between these devices until nouveau (best case) or some
>>>> exploit (worst case) exposes it.  IOMMU grouping always assumes a lack
>>>> of isolation between devices unless the hardware provides some
>>>> indication that isolation exists, for example ACS on PCIe.  If NVIDIA
>>>> wants to expose isolation on NVLink, perhaps they need to document
>>>> enough of it that the host kernel can manipulate and test for isolation,
>>>> perhaps even enabling virtualization of the NVLink interconnect
>>>> interface such that the host can prevent GPUs from interfering with
>>>> each other.  Thanks,    
>>>
>>>
>>> So far I got this from NVIDIA:
>>>
>>> 1. An NVLink2 state can be controlled via MMIO registers, there is a
>>> "NVLINK ISOLATION ON MULTI-TENANT SYSTEMS" spec (my copy is
>>> "confidential" though) from NVIDIA with the MMIO addresses to block if
>>> we want to disable certain links. In order to NVLink to work it needs to
>>> be enabled on both sides so by filtering certains MMIO ranges we can
>>> isolate a GPU.  
>>
>> Where are these MMIO registers, on the bridge or on the endpoint device?
> 
> The endpoint GPU device.
> 
>> I'm wondering when you say block MMIO if these are ranges on the device
>> that we disallow mmap to and all the overlapping PAGE_SIZE issues that
>> come with that or if this should essentially be device specific
>> enable_acs and acs_enabled quirks, and maybe also potentially used by
>> Logan's disable acs series to allow GPUs to be linked and have grouping
>> to match.
> 
> An update, I confused P100 and V100, P100 would need filtering but
> ours is V100 and it has a couple of registers which we can use to
> disable particular links and once disabled, the link cannot be
> re-enabled till the next secondary bus reset.
> 
> 
>>> 2. We can and should also prohibit the GPU firmware update, this is
>>> done via MMIO as well. The protocol is not open but at least register
>>> ranges might be in order to filter these accesses, and there is no
>>> plan to change this.  
>>
>> I assume this MMIO is on the endpoint and has all the PAGE_SIZE joys
>> along with it.
> 
> Yes, however NVIDIA says there is no performance critical stuff with
> this 64K page.
> 
>> Also, there are certainly use cases of updating
>> firmware for an assigned device, we don't want to impose a policy, but
>> we should figure out the right place for that policy to be specified by
>> the admin.
> 
> May be but NVIDIA is talking about some "out-of-band" command to the GPU
> to enable firmware update so firmware update is not really supported.
> 
> 
>>> 3. DMA trafic over the NVLink2 link can be of 2 types: UT=1 for
>>> PCI-style DMA via our usual TCE tables (one per a NVLink2 link),
>>> and UT=0 for direct host memory access. UT stands for "use
>>> translation" and this is a part of the NVLink2 protocol. Only UT=1 is
>>> possible over the PCIe link.
>>> This UT=0 trafic uses host physical addresses returned by a nest MMU (a
>>> piece of NVIDIA logic on a POWER9 chip), this takes LPID (guest id),
>>> mmu context id (guest userspace mm id), a virtual address and translates
>>> to the host physical and that result is used for UT=0 DMA, this is
>>> called "ATS" although it is not PCIe ATS afaict.
>>> NVIDIA says that the hardware is designed in a way that it can only do
>>> DMA UT=0 to addresses which ATS translated to, and there is no way to
>>> override this behavior and this is what guarantees the isolation.  
>>
>> I'm kinda lost here, maybe we can compare it to PCIe ATS where an
>> endpoint requests a translation of an IOVA to physical address, the
>> IOMMU returns a lookup based on PCIe requester ID, and there's an
>> invalidation protocol to keep things coherent.
> 
> Yes there is. The current approach is to have an MMU notifier in
> the kernel which tells an NPU (IBM piece of logic between GPU/NVlink2
> and NVIDIA nest MMU) to invalidate translations and that in turn pokes
> the GPU till that confirms that it invalidated tlbs and there is no
> ongoing DMA.
> 
>> In the case above, who provides a guest id and mmu context id? 
> 
> We (powerpc/powernv platform) configure NPU to bind specific bus:dev:fn to
> an LPID (== guest id) and MMU context id comes from the guest. The nest
> MMU knows where the partition table and this table contains all the
> pointers needs for the translation.
> 
> 
>> Additional software
>> somewhere?  Is the virtual address an IOVA or a process virtual
>> address? 
> 
> A guest kernel or a guest userspace virtual address.
> 
>> Do we assume some sort of invalidation protocol as well?
> 
> I am little confused, is this question about the same invalidation
> protocol as above or different?
> 
> 
>>> So isolation can be achieved if I do not miss something.
>>>
>>> How do we want this to be documented to proceed? I assume if I post
>>> patches filtering MMIOs, this won't do it, right? If just 1..3 are
>>> documented, will we take this t&c or we need a GPU API spec (which is
>>> not going to happen anyway)?  
>>
>> "t&c"? I think we need what we're actually interacting with to be well
>> documented, but that could be _thorough_ comments in the code, enough
>> to understand the theory of operation, as far as I'm concerned.  A pdf
>> lost on a corporate webserver isn't necessarily an improvement over
>> that, but there needs to be sufficient detail to understand what we're
>> touching such that we can maintain, adapt, and improve the code over
>> time.  Only item #3 above appears POWER specific, so I'd hope that #1
>> is done in the PCI subsystem, #2 might be a QEMU option (maybe kernel
>> vfio-pci, but I'm not sure that's necessary), and I don't know where #3
>> goes.  Thanks,
> 
> Ok, understood. Thanks!

After some local discussions, it was pointed out that force disabling
nvlinks won't bring us much as for an nvlink to work, both sides need to
enable it so malicious guests cannot penetrate good ones (or a host)
unless a good guest enabled the link but won't happen with a well
behaving guest. And if two guests became malicious, then can still only
harm each other, and so can they via other ways such network. This is
different from PCIe as once PCIe link is unavoidably enabled, a well
behaving device cannot firewall itself from peers as it is up to the
upstream bridge(s) now to decide the routing; with nvlink2, a GPU still
has means to protect itself, just like a guest can run "firewalld" for
network.

Although it would be a nice feature to have an extra barrier between
GPUs, is inability to block the links in hypervisor still a blocker for
V100 pass through?


-- 
Alexey

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100
@ 2018-07-30  8:58                       ` Alexey Kardashevskiy
  0 siblings, 0 replies; 108+ messages in thread
From: Alexey Kardashevskiy @ 2018-07-30  8:58 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Benjamin Herrenschmidt, linuxppc-dev, David Gibson, kvm-ppc,
	Ram Pai, kvm, Alistair Popple



On 11/07/2018 19:26, Alexey Kardashevskiy wrote:
> On Tue, 10 Jul 2018 16:37:15 -0600
> Alex Williamson <alex.williamson@redhat.com> wrote:
> 
>> On Tue, 10 Jul 2018 14:10:20 +1000
>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>>
>>> On Thu, 7 Jun 2018 23:03:23 -0600
>>> Alex Williamson <alex.williamson@redhat.com> wrote:
>>>   
>>>> On Fri, 8 Jun 2018 14:14:23 +1000
>>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>>>>     
>>>>> On 8/6/18 1:44 pm, Alex Williamson wrote:      
>>>>>> On Fri, 8 Jun 2018 13:08:54 +1000
>>>>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>>>>>>         
>>>>>>> On 8/6/18 8:15 am, Alex Williamson wrote:        
>>>>>>>> On Fri, 08 Jun 2018 07:54:02 +1000
>>>>>>>> Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:
>>>>>>>>           
>>>>>>>>> On Thu, 2018-06-07 at 11:04 -0600, Alex Williamson wrote:          
>>>>>>>>>>
>>>>>>>>>> Can we back up and discuss whether the IOMMU grouping of NVLink
>>>>>>>>>> connected devices makes sense?  AIUI we have a PCI view of these
>>>>>>>>>> devices and from that perspective they're isolated.  That's the view of
>>>>>>>>>> the device used to generate the grouping.  However, not visible to us,
>>>>>>>>>> these devices are interconnected via NVLink.  What isolation properties
>>>>>>>>>> does NVLink provide given that its entire purpose for existing seems to
>>>>>>>>>> be to provide a high performance link for p2p between devices?            
>>>>>>>>>
>>>>>>>>> Not entire. On POWER chips, we also have an nvlink between the device
>>>>>>>>> and the CPU which is running significantly faster than PCIe.
>>>>>>>>>
>>>>>>>>> But yes, there are cross-links and those should probably be accounted
>>>>>>>>> for in the grouping.          
>>>>>>>>
>>>>>>>> Then after we fix the grouping, can we just let the host driver manage
>>>>>>>> this coherent memory range and expose vGPUs to guests?  The use case of
>>>>>>>> assigning all 6 GPUs to one VM seems pretty limited.  (Might need to
>>>>>>>> convince NVIDIA to support more than a single vGPU per VM though)          
>>>>>>>
>>>>>>> These are physical GPUs, not virtual sriov-alike things they are
>>>>>>> implementing as well elsewhere.        
>>>>>>
>>>>>> vGPUs as implemented on M- and P-series Teslas aren't SR-IOV like
>>>>>> either.  That's why we have mdev devices now to implement software
>>>>>> defined devices.  I don't have first hand experience with V-series, but
>>>>>> I would absolutely expect a PCIe-based Tesla V100 to support vGPU.        
>>>>>
>>>>> So assuming V100 can do vGPU, you are suggesting ditching this patchset and
>>>>> using mediated vGPUs instead, correct?      
>>>>
>>>> If it turns out that our PCIe-only-based IOMMU grouping doesn't
>>>> account for lack of isolation on the NVLink side and we correct that,
>>>> limiting assignment to sets of 3 interconnected GPUs, is that still a
>>>> useful feature?  OTOH, it's entirely an NVIDIA proprietary decision
>>>> whether they choose to support vGPU on these GPUs or whether they can
>>>> be convinced to support multiple vGPUs per VM.
>>>>     
>>>>>>> My current understanding is that every P9 chip in that box has some NVLink2
>>>>>>> logic on it so each P9 is directly connected to 3 GPUs via PCIe and
>>>>>>> 2xNVLink2, and GPUs in that big group are interconnected by NVLink2 links
>>>>>>> as well.
>>>>>>>
>>>>>>> From small bits of information I have it seems that a GPU can perfectly
>>>>>>> work alone and if the NVIDIA driver does not see these interconnects
>>>>>>> (because we do not pass the rest of the big 3xGPU group to this guest), it
>>>>>>> continues with a single GPU. There is an "nvidia-smi -r" big reset hammer
>>>>>>> which simply refuses to work until all 3 GPUs are passed so there is some
>>>>>>> distinction between passing 1 or 3 GPUs, and I am trying (as we speak) to
>>>>>>> get a confirmation from NVIDIA that it is ok to pass just a single GPU.
>>>>>>>
>>>>>>> So we will either have 6 groups (one per GPU) or 2 groups (one per
>>>>>>> interconnected group).        
>>>>>>
>>>>>> I'm not gaining much confidence that we can rely on isolation between
>>>>>> NVLink connected GPUs, it sounds like you're simply expecting that
>>>>>> proprietary code from NVIDIA on a proprietary interconnect from NVIDIA
>>>>>> is going to play nice and nobody will figure out how to do bad things
>>>>>> because... obfuscation?  Thanks,        
>>>>>
>>>>> Well, we already believe that a proprietary firmware of a sriov-capable
>>>>> adapter like Mellanox ConnextX is not doing bad things, how is this
>>>>> different in principle?      
>>>>
>>>> It seems like the scope and hierarchy are different.  Here we're
>>>> talking about exposing big discrete devices, which are peers of one
>>>> another (and have history of being reverse engineered), to userspace
>>>> drivers.  Once handed to userspace, each of those devices needs to be
>>>> considered untrusted.  In the case of SR-IOV, we typically have a
>>>> trusted host driver for the PF managing untrusted VFs.  We do rely on
>>>> some sanity in the hardware/firmware in isolating the VFs from each
>>>> other and from the PF, but we also often have source code for Linux
>>>> drivers for these devices and sometimes even datasheets.  Here we have
>>>> neither of those and perhaps we won't know the extent of the lack of
>>>> isolation between these devices until nouveau (best case) or some
>>>> exploit (worst case) exposes it.  IOMMU grouping always assumes a lack
>>>> of isolation between devices unless the hardware provides some
>>>> indication that isolation exists, for example ACS on PCIe.  If NVIDIA
>>>> wants to expose isolation on NVLink, perhaps they need to document
>>>> enough of it that the host kernel can manipulate and test for isolation,
>>>> perhaps even enabling virtualization of the NVLink interconnect
>>>> interface such that the host can prevent GPUs from interfering with
>>>> each other.  Thanks,    
>>>
>>>
>>> So far I got this from NVIDIA:
>>>
>>> 1. An NVLink2 state can be controlled via MMIO registers, there is a
>>> "NVLINK ISOLATION ON MULTI-TENANT SYSTEMS" spec (my copy is
>>> "confidential" though) from NVIDIA with the MMIO addresses to block if
>>> we want to disable certain links. In order to NVLink to work it needs to
>>> be enabled on both sides so by filtering certains MMIO ranges we can
>>> isolate a GPU.  
>>
>> Where are these MMIO registers, on the bridge or on the endpoint device?
> 
> The endpoint GPU device.
> 
>> I'm wondering when you say block MMIO if these are ranges on the device
>> that we disallow mmap to and all the overlapping PAGE_SIZE issues that
>> come with that or if this should essentially be device specific
>> enable_acs and acs_enabled quirks, and maybe also potentially used by
>> Logan's disable acs series to allow GPUs to be linked and have grouping
>> to match.
> 
> An update, I confused P100 and V100, P100 would need filtering but
> ours is V100 and it has a couple of registers which we can use to
> disable particular links and once disabled, the link cannot be
> re-enabled till the next secondary bus reset.
> 
> 
>>> 2. We can and should also prohibit the GPU firmware update, this is
>>> done via MMIO as well. The protocol is not open but at least register
>>> ranges might be in order to filter these accesses, and there is no
>>> plan to change this.  
>>
>> I assume this MMIO is on the endpoint and has all the PAGE_SIZE joys
>> along with it.
> 
> Yes, however NVIDIA says there is no performance critical stuff with
> this 64K page.
> 
>> Also, there are certainly use cases of updating
>> firmware for an assigned device, we don't want to impose a policy, but
>> we should figure out the right place for that policy to be specified by
>> the admin.
> 
> May be but NVIDIA is talking about some "out-of-band" command to the GPU
> to enable firmware update so firmware update is not really supported.
> 
> 
>>> 3. DMA trafic over the NVLink2 link can be of 2 types: UT=1 for
>>> PCI-style DMA via our usual TCE tables (one per a NVLink2 link),
>>> and UT=0 for direct host memory access. UT stands for "use
>>> translation" and this is a part of the NVLink2 protocol. Only UT=1 is
>>> possible over the PCIe link.
>>> This UT=0 trafic uses host physical addresses returned by a nest MMU (a
>>> piece of NVIDIA logic on a POWER9 chip), this takes LPID (guest id),
>>> mmu context id (guest userspace mm id), a virtual address and translates
>>> to the host physical and that result is used for UT=0 DMA, this is
>>> called "ATS" although it is not PCIe ATS afaict.
>>> NVIDIA says that the hardware is designed in a way that it can only do
>>> DMA UT=0 to addresses which ATS translated to, and there is no way to
>>> override this behavior and this is what guarantees the isolation.  
>>
>> I'm kinda lost here, maybe we can compare it to PCIe ATS where an
>> endpoint requests a translation of an IOVA to physical address, the
>> IOMMU returns a lookup based on PCIe requester ID, and there's an
>> invalidation protocol to keep things coherent.
> 
> Yes there is. The current approach is to have an MMU notifier in
> the kernel which tells an NPU (IBM piece of logic between GPU/NVlink2
> and NVIDIA nest MMU) to invalidate translations and that in turn pokes
> the GPU till that confirms that it invalidated tlbs and there is no
> ongoing DMA.
> 
>> In the case above, who provides a guest id and mmu context id? 
> 
> We (powerpc/powernv platform) configure NPU to bind specific bus:dev:fn to
> an LPID (== guest id) and MMU context id comes from the guest. The nest
> MMU knows where the partition table and this table contains all the
> pointers needs for the translation.
> 
> 
>> Additional software
>> somewhere?  Is the virtual address an IOVA or a process virtual
>> address? 
> 
> A guest kernel or a guest userspace virtual address.
> 
>> Do we assume some sort of invalidation protocol as well?
> 
> I am little confused, is this question about the same invalidation
> protocol as above or different?
> 
> 
>>> So isolation can be achieved if I do not miss something.
>>>
>>> How do we want this to be documented to proceed? I assume if I post
>>> patches filtering MMIOs, this won't do it, right? If just 1..3 are
>>> documented, will we take this t&c or we need a GPU API spec (which is
>>> not going to happen anyway)?  
>>
>> "t&c"? I think we need what we're actually interacting with to be well
>> documented, but that could be _thorough_ comments in the code, enough
>> to understand the theory of operation, as far as I'm concerned.  A pdf
>> lost on a corporate webserver isn't necessarily an improvement over
>> that, but there needs to be sufficient detail to understand what we're
>> touching such that we can maintain, adapt, and improve the code over
>> time.  Only item #3 above appears POWER specific, so I'd hope that #1
>> is done in the PCI subsystem, #2 might be a QEMU option (maybe kernel
>> vfio-pci, but I'm not sure that's necessary), and I don't know where #3
>> goes.  Thanks,
> 
> Ok, understood. Thanks!

After some local discussions, it was pointed out that force disabling
nvlinks won't bring us much as for an nvlink to work, both sides need to
enable it so malicious guests cannot penetrate good ones (or a host)
unless a good guest enabled the link but won't happen with a well
behaving guest. And if two guests became malicious, then can still only
harm each other, and so can they via other ways such network. This is
different from PCIe as once PCIe link is unavoidably enabled, a well
behaving device cannot firewall itself from peers as it is up to the
upstream bridge(s) now to decide the routing; with nvlink2, a GPU still
has means to protect itself, just like a guest can run "firewalld" for
network.

Although it would be a nice feature to have an extra barrier between
GPUs, is inability to block the links in hypervisor still a blocker for
V100 pass through?


-- 
Alexey

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100
@ 2018-07-30  8:58                       ` Alexey Kardashevskiy
  0 siblings, 0 replies; 108+ messages in thread
From: Alexey Kardashevskiy @ 2018-07-30  8:58 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kvm, Ram Pai, kvm-ppc, Alistair Popple, linuxppc-dev, David Gibson



On 11/07/2018 19:26, Alexey Kardashevskiy wrote:
> On Tue, 10 Jul 2018 16:37:15 -0600
> Alex Williamson <alex.williamson@redhat.com> wrote:
> 
>> On Tue, 10 Jul 2018 14:10:20 +1000
>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>>
>>> On Thu, 7 Jun 2018 23:03:23 -0600
>>> Alex Williamson <alex.williamson@redhat.com> wrote:
>>>   
>>>> On Fri, 8 Jun 2018 14:14:23 +1000
>>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>>>>     
>>>>> On 8/6/18 1:44 pm, Alex Williamson wrote:      
>>>>>> On Fri, 8 Jun 2018 13:08:54 +1000
>>>>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>>>>>>         
>>>>>>> On 8/6/18 8:15 am, Alex Williamson wrote:        
>>>>>>>> On Fri, 08 Jun 2018 07:54:02 +1000
>>>>>>>> Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:
>>>>>>>>           
>>>>>>>>> On Thu, 2018-06-07 at 11:04 -0600, Alex Williamson wrote:          
>>>>>>>>>>
>>>>>>>>>> Can we back up and discuss whether the IOMMU grouping of NVLink
>>>>>>>>>> connected devices makes sense?  AIUI we have a PCI view of these
>>>>>>>>>> devices and from that perspective they're isolated.  That's the view of
>>>>>>>>>> the device used to generate the grouping.  However, not visible to us,
>>>>>>>>>> these devices are interconnected via NVLink.  What isolation properties
>>>>>>>>>> does NVLink provide given that its entire purpose for existing seems to
>>>>>>>>>> be to provide a high performance link for p2p between devices?            
>>>>>>>>>
>>>>>>>>> Not entire. On POWER chips, we also have an nvlink between the device
>>>>>>>>> and the CPU which is running significantly faster than PCIe.
>>>>>>>>>
>>>>>>>>> But yes, there are cross-links and those should probably be accounted
>>>>>>>>> for in the grouping.          
>>>>>>>>
>>>>>>>> Then after we fix the grouping, can we just let the host driver manage
>>>>>>>> this coherent memory range and expose vGPUs to guests?  The use case of
>>>>>>>> assigning all 6 GPUs to one VM seems pretty limited.  (Might need to
>>>>>>>> convince NVIDIA to support more than a single vGPU per VM though)          
>>>>>>>
>>>>>>> These are physical GPUs, not virtual sriov-alike things they are
>>>>>>> implementing as well elsewhere.        
>>>>>>
>>>>>> vGPUs as implemented on M- and P-series Teslas aren't SR-IOV like
>>>>>> either.  That's why we have mdev devices now to implement software
>>>>>> defined devices.  I don't have first hand experience with V-series, but
>>>>>> I would absolutely expect a PCIe-based Tesla V100 to support vGPU.        
>>>>>
>>>>> So assuming V100 can do vGPU, you are suggesting ditching this patchset and
>>>>> using mediated vGPUs instead, correct?      
>>>>
>>>> If it turns out that our PCIe-only-based IOMMU grouping doesn't
>>>> account for lack of isolation on the NVLink side and we correct that,
>>>> limiting assignment to sets of 3 interconnected GPUs, is that still a
>>>> useful feature?  OTOH, it's entirely an NVIDIA proprietary decision
>>>> whether they choose to support vGPU on these GPUs or whether they can
>>>> be convinced to support multiple vGPUs per VM.
>>>>     
>>>>>>> My current understanding is that every P9 chip in that box has some NVLink2
>>>>>>> logic on it so each P9 is directly connected to 3 GPUs via PCIe and
>>>>>>> 2xNVLink2, and GPUs in that big group are interconnected by NVLink2 links
>>>>>>> as well.
>>>>>>>
>>>>>>> From small bits of information I have it seems that a GPU can perfectly
>>>>>>> work alone and if the NVIDIA driver does not see these interconnects
>>>>>>> (because we do not pass the rest of the big 3xGPU group to this guest), it
>>>>>>> continues with a single GPU. There is an "nvidia-smi -r" big reset hammer
>>>>>>> which simply refuses to work until all 3 GPUs are passed so there is some
>>>>>>> distinction between passing 1 or 3 GPUs, and I am trying (as we speak) to
>>>>>>> get a confirmation from NVIDIA that it is ok to pass just a single GPU.
>>>>>>>
>>>>>>> So we will either have 6 groups (one per GPU) or 2 groups (one per
>>>>>>> interconnected group).        
>>>>>>
>>>>>> I'm not gaining much confidence that we can rely on isolation between
>>>>>> NVLink connected GPUs, it sounds like you're simply expecting that
>>>>>> proprietary code from NVIDIA on a proprietary interconnect from NVIDIA
>>>>>> is going to play nice and nobody will figure out how to do bad things
>>>>>> because... obfuscation?  Thanks,        
>>>>>
>>>>> Well, we already believe that a proprietary firmware of a sriov-capable
>>>>> adapter like Mellanox ConnextX is not doing bad things, how is this
>>>>> different in principle?      
>>>>
>>>> It seems like the scope and hierarchy are different.  Here we're
>>>> talking about exposing big discrete devices, which are peers of one
>>>> another (and have history of being reverse engineered), to userspace
>>>> drivers.  Once handed to userspace, each of those devices needs to be
>>>> considered untrusted.  In the case of SR-IOV, we typically have a
>>>> trusted host driver for the PF managing untrusted VFs.  We do rely on
>>>> some sanity in the hardware/firmware in isolating the VFs from each
>>>> other and from the PF, but we also often have source code for Linux
>>>> drivers for these devices and sometimes even datasheets.  Here we have
>>>> neither of those and perhaps we won't know the extent of the lack of
>>>> isolation between these devices until nouveau (best case) or some
>>>> exploit (worst case) exposes it.  IOMMU grouping always assumes a lack
>>>> of isolation between devices unless the hardware provides some
>>>> indication that isolation exists, for example ACS on PCIe.  If NVIDIA
>>>> wants to expose isolation on NVLink, perhaps they need to document
>>>> enough of it that the host kernel can manipulate and test for isolation,
>>>> perhaps even enabling virtualization of the NVLink interconnect
>>>> interface such that the host can prevent GPUs from interfering with
>>>> each other.  Thanks,    
>>>
>>>
>>> So far I got this from NVIDIA:
>>>
>>> 1. An NVLink2 state can be controlled via MMIO registers, there is a
>>> "NVLINK ISOLATION ON MULTI-TENANT SYSTEMS" spec (my copy is
>>> "confidential" though) from NVIDIA with the MMIO addresses to block if
>>> we want to disable certain links. In order to NVLink to work it needs to
>>> be enabled on both sides so by filtering certains MMIO ranges we can
>>> isolate a GPU.  
>>
>> Where are these MMIO registers, on the bridge or on the endpoint device?
> 
> The endpoint GPU device.
> 
>> I'm wondering when you say block MMIO if these are ranges on the device
>> that we disallow mmap to and all the overlapping PAGE_SIZE issues that
>> come with that or if this should essentially be device specific
>> enable_acs and acs_enabled quirks, and maybe also potentially used by
>> Logan's disable acs series to allow GPUs to be linked and have grouping
>> to match.
> 
> An update, I confused P100 and V100, P100 would need filtering but
> ours is V100 and it has a couple of registers which we can use to
> disable particular links and once disabled, the link cannot be
> re-enabled till the next secondary bus reset.
> 
> 
>>> 2. We can and should also prohibit the GPU firmware update, this is
>>> done via MMIO as well. The protocol is not open but at least register
>>> ranges might be in order to filter these accesses, and there is no
>>> plan to change this.  
>>
>> I assume this MMIO is on the endpoint and has all the PAGE_SIZE joys
>> along with it.
> 
> Yes, however NVIDIA says there is no performance critical stuff with
> this 64K page.
> 
>> Also, there are certainly use cases of updating
>> firmware for an assigned device, we don't want to impose a policy, but
>> we should figure out the right place for that policy to be specified by
>> the admin.
> 
> May be but NVIDIA is talking about some "out-of-band" command to the GPU
> to enable firmware update so firmware update is not really supported.
> 
> 
>>> 3. DMA trafic over the NVLink2 link can be of 2 types: UT=1 for
>>> PCI-style DMA via our usual TCE tables (one per a NVLink2 link),
>>> and UT=0 for direct host memory access. UT stands for "use
>>> translation" and this is a part of the NVLink2 protocol. Only UT=1 is
>>> possible over the PCIe link.
>>> This UT=0 trafic uses host physical addresses returned by a nest MMU (a
>>> piece of NVIDIA logic on a POWER9 chip), this takes LPID (guest id),
>>> mmu context id (guest userspace mm id), a virtual address and translates
>>> to the host physical and that result is used for UT=0 DMA, this is
>>> called "ATS" although it is not PCIe ATS afaict.
>>> NVIDIA says that the hardware is designed in a way that it can only do
>>> DMA UT=0 to addresses which ATS translated to, and there is no way to
>>> override this behavior and this is what guarantees the isolation.  
>>
>> I'm kinda lost here, maybe we can compare it to PCIe ATS where an
>> endpoint requests a translation of an IOVA to physical address, the
>> IOMMU returns a lookup based on PCIe requester ID, and there's an
>> invalidation protocol to keep things coherent.
> 
> Yes there is. The current approach is to have an MMU notifier in
> the kernel which tells an NPU (IBM piece of logic between GPU/NVlink2
> and NVIDIA nest MMU) to invalidate translations and that in turn pokes
> the GPU till that confirms that it invalidated tlbs and there is no
> ongoing DMA.
> 
>> In the case above, who provides a guest id and mmu context id? 
> 
> We (powerpc/powernv platform) configure NPU to bind specific bus:dev:fn to
> an LPID (= guest id) and MMU context id comes from the guest. The nest
> MMU knows where the partition table and this table contains all the
> pointers needs for the translation.
> 
> 
>> Additional software
>> somewhere?  Is the virtual address an IOVA or a process virtual
>> address? 
> 
> A guest kernel or a guest userspace virtual address.
> 
>> Do we assume some sort of invalidation protocol as well?
> 
> I am little confused, is this question about the same invalidation
> protocol as above or different?
> 
> 
>>> So isolation can be achieved if I do not miss something.
>>>
>>> How do we want this to be documented to proceed? I assume if I post
>>> patches filtering MMIOs, this won't do it, right? If just 1..3 are
>>> documented, will we take this t&c or we need a GPU API spec (which is
>>> not going to happen anyway)?  
>>
>> "t&c"? I think we need what we're actually interacting with to be well
>> documented, but that could be _thorough_ comments in the code, enough
>> to understand the theory of operation, as far as I'm concerned.  A pdf
>> lost on a corporate webserver isn't necessarily an improvement over
>> that, but there needs to be sufficient detail to understand what we're
>> touching such that we can maintain, adapt, and improve the code over
>> time.  Only item #3 above appears POWER specific, so I'd hope that #1
>> is done in the PCI subsystem, #2 might be a QEMU option (maybe kernel
>> vfio-pci, but I'm not sure that's necessary), and I don't know where #3
>> goes.  Thanks,
> 
> Ok, understood. Thanks!

After some local discussions, it was pointed out that force disabling
nvlinks won't bring us much as for an nvlink to work, both sides need to
enable it so malicious guests cannot penetrate good ones (or a host)
unless a good guest enabled the link but won't happen with a well
behaving guest. And if two guests became malicious, then can still only
harm each other, and so can they via other ways such network. This is
different from PCIe as once PCIe link is unavoidably enabled, a well
behaving device cannot firewall itself from peers as it is up to the
upstream bridge(s) now to decide the routing; with nvlink2, a GPU still
has means to protect itself, just like a guest can run "firewalld" for
network.

Although it would be a nice feature to have an extra barrier between
GPUs, is inability to block the links in hypervisor still a blocker for
V100 pass through?


-- 
Alexey

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100
  2018-07-30  8:58                       ` Alexey Kardashevskiy
  (?)
@ 2018-07-30 16:29                         ` Alex Williamson
  -1 siblings, 0 replies; 108+ messages in thread
From: Alex Williamson @ 2018-07-30 16:29 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: kvm, Ram Pai, kvm-ppc, Alistair Popple, linuxppc-dev, David Gibson

On Mon, 30 Jul 2018 18:58:49 +1000
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> On 11/07/2018 19:26, Alexey Kardashevskiy wrote:
> > On Tue, 10 Jul 2018 16:37:15 -0600
> > Alex Williamson <alex.williamson@redhat.com> wrote:
> >   
> >> On Tue, 10 Jul 2018 14:10:20 +1000
> >> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> >>  
> >>> On Thu, 7 Jun 2018 23:03:23 -0600
> >>> Alex Williamson <alex.williamson@redhat.com> wrote:
> >>>     
> >>>> On Fri, 8 Jun 2018 14:14:23 +1000
> >>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> >>>>       
> >>>>> On 8/6/18 1:44 pm, Alex Williamson wrote:        
> >>>>>> On Fri, 8 Jun 2018 13:08:54 +1000
> >>>>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> >>>>>>           
> >>>>>>> On 8/6/18 8:15 am, Alex Williamson wrote:          
> >>>>>>>> On Fri, 08 Jun 2018 07:54:02 +1000
> >>>>>>>> Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:
> >>>>>>>>             
> >>>>>>>>> On Thu, 2018-06-07 at 11:04 -0600, Alex Williamson wrote:            
> >>>>>>>>>>
> >>>>>>>>>> Can we back up and discuss whether the IOMMU grouping of NVLink
> >>>>>>>>>> connected devices makes sense?  AIUI we have a PCI view of these
> >>>>>>>>>> devices and from that perspective they're isolated.  That's the view of
> >>>>>>>>>> the device used to generate the grouping.  However, not visible to us,
> >>>>>>>>>> these devices are interconnected via NVLink.  What isolation properties
> >>>>>>>>>> does NVLink provide given that its entire purpose for existing seems to
> >>>>>>>>>> be to provide a high performance link for p2p between devices?              
> >>>>>>>>>
> >>>>>>>>> Not entire. On POWER chips, we also have an nvlink between the device
> >>>>>>>>> and the CPU which is running significantly faster than PCIe.
> >>>>>>>>>
> >>>>>>>>> But yes, there are cross-links and those should probably be accounted
> >>>>>>>>> for in the grouping.            
> >>>>>>>>
> >>>>>>>> Then after we fix the grouping, can we just let the host driver manage
> >>>>>>>> this coherent memory range and expose vGPUs to guests?  The use case of
> >>>>>>>> assigning all 6 GPUs to one VM seems pretty limited.  (Might need to
> >>>>>>>> convince NVIDIA to support more than a single vGPU per VM though)            
> >>>>>>>
> >>>>>>> These are physical GPUs, not virtual sriov-alike things they are
> >>>>>>> implementing as well elsewhere.          
> >>>>>>
> >>>>>> vGPUs as implemented on M- and P-series Teslas aren't SR-IOV like
> >>>>>> either.  That's why we have mdev devices now to implement software
> >>>>>> defined devices.  I don't have first hand experience with V-series, but
> >>>>>> I would absolutely expect a PCIe-based Tesla V100 to support vGPU.          
> >>>>>
> >>>>> So assuming V100 can do vGPU, you are suggesting ditching this patchset and
> >>>>> using mediated vGPUs instead, correct?        
> >>>>
> >>>> If it turns out that our PCIe-only-based IOMMU grouping doesn't
> >>>> account for lack of isolation on the NVLink side and we correct that,
> >>>> limiting assignment to sets of 3 interconnected GPUs, is that still a
> >>>> useful feature?  OTOH, it's entirely an NVIDIA proprietary decision
> >>>> whether they choose to support vGPU on these GPUs or whether they can
> >>>> be convinced to support multiple vGPUs per VM.
> >>>>       
> >>>>>>> My current understanding is that every P9 chip in that box has some NVLink2
> >>>>>>> logic on it so each P9 is directly connected to 3 GPUs via PCIe and
> >>>>>>> 2xNVLink2, and GPUs in that big group are interconnected by NVLink2 links
> >>>>>>> as well.
> >>>>>>>
> >>>>>>> From small bits of information I have it seems that a GPU can perfectly
> >>>>>>> work alone and if the NVIDIA driver does not see these interconnects
> >>>>>>> (because we do not pass the rest of the big 3xGPU group to this guest), it
> >>>>>>> continues with a single GPU. There is an "nvidia-smi -r" big reset hammer
> >>>>>>> which simply refuses to work until all 3 GPUs are passed so there is some
> >>>>>>> distinction between passing 1 or 3 GPUs, and I am trying (as we speak) to
> >>>>>>> get a confirmation from NVIDIA that it is ok to pass just a single GPU.
> >>>>>>>
> >>>>>>> So we will either have 6 groups (one per GPU) or 2 groups (one per
> >>>>>>> interconnected group).          
> >>>>>>
> >>>>>> I'm not gaining much confidence that we can rely on isolation between
> >>>>>> NVLink connected GPUs, it sounds like you're simply expecting that
> >>>>>> proprietary code from NVIDIA on a proprietary interconnect from NVIDIA
> >>>>>> is going to play nice and nobody will figure out how to do bad things
> >>>>>> because... obfuscation?  Thanks,          
> >>>>>
> >>>>> Well, we already believe that a proprietary firmware of a sriov-capable
> >>>>> adapter like Mellanox ConnextX is not doing bad things, how is this
> >>>>> different in principle?        
> >>>>
> >>>> It seems like the scope and hierarchy are different.  Here we're
> >>>> talking about exposing big discrete devices, which are peers of one
> >>>> another (and have history of being reverse engineered), to userspace
> >>>> drivers.  Once handed to userspace, each of those devices needs to be
> >>>> considered untrusted.  In the case of SR-IOV, we typically have a
> >>>> trusted host driver for the PF managing untrusted VFs.  We do rely on
> >>>> some sanity in the hardware/firmware in isolating the VFs from each
> >>>> other and from the PF, but we also often have source code for Linux
> >>>> drivers for these devices and sometimes even datasheets.  Here we have
> >>>> neither of those and perhaps we won't know the extent of the lack of
> >>>> isolation between these devices until nouveau (best case) or some
> >>>> exploit (worst case) exposes it.  IOMMU grouping always assumes a lack
> >>>> of isolation between devices unless the hardware provides some
> >>>> indication that isolation exists, for example ACS on PCIe.  If NVIDIA
> >>>> wants to expose isolation on NVLink, perhaps they need to document
> >>>> enough of it that the host kernel can manipulate and test for isolation,
> >>>> perhaps even enabling virtualization of the NVLink interconnect
> >>>> interface such that the host can prevent GPUs from interfering with
> >>>> each other.  Thanks,      
> >>>
> >>>
> >>> So far I got this from NVIDIA:
> >>>
> >>> 1. An NVLink2 state can be controlled via MMIO registers, there is a
> >>> "NVLINK ISOLATION ON MULTI-TENANT SYSTEMS" spec (my copy is
> >>> "confidential" though) from NVIDIA with the MMIO addresses to block if
> >>> we want to disable certain links. In order to NVLink to work it needs to
> >>> be enabled on both sides so by filtering certains MMIO ranges we can
> >>> isolate a GPU.    
> >>
> >> Where are these MMIO registers, on the bridge or on the endpoint device?  
> > 
> > The endpoint GPU device.
> >   
> >> I'm wondering when you say block MMIO if these are ranges on the device
> >> that we disallow mmap to and all the overlapping PAGE_SIZE issues that
> >> come with that or if this should essentially be device specific
> >> enable_acs and acs_enabled quirks, and maybe also potentially used by
> >> Logan's disable acs series to allow GPUs to be linked and have grouping
> >> to match.  
> > 
> > An update, I confused P100 and V100, P100 would need filtering but
> > ours is V100 and it has a couple of registers which we can use to
> > disable particular links and once disabled, the link cannot be
> > re-enabled till the next secondary bus reset.
> > 
> >   
> >>> 2. We can and should also prohibit the GPU firmware update, this is
> >>> done via MMIO as well. The protocol is not open but at least register
> >>> ranges might be in order to filter these accesses, and there is no
> >>> plan to change this.    
> >>
> >> I assume this MMIO is on the endpoint and has all the PAGE_SIZE joys
> >> along with it.  
> > 
> > Yes, however NVIDIA says there is no performance critical stuff with
> > this 64K page.
> >   
> >> Also, there are certainly use cases of updating
> >> firmware for an assigned device, we don't want to impose a policy, but
> >> we should figure out the right place for that policy to be specified by
> >> the admin.  
> > 
> > May be but NVIDIA is talking about some "out-of-band" command to the GPU
> > to enable firmware update so firmware update is not really supported.
> > 
> >   
> >>> 3. DMA trafic over the NVLink2 link can be of 2 types: UT=1 for
> >>> PCI-style DMA via our usual TCE tables (one per a NVLink2 link),
> >>> and UT=0 for direct host memory access. UT stands for "use
> >>> translation" and this is a part of the NVLink2 protocol. Only UT=1 is
> >>> possible over the PCIe link.
> >>> This UT=0 trafic uses host physical addresses returned by a nest MMU (a
> >>> piece of NVIDIA logic on a POWER9 chip), this takes LPID (guest id),
> >>> mmu context id (guest userspace mm id), a virtual address and translates
> >>> to the host physical and that result is used for UT=0 DMA, this is
> >>> called "ATS" although it is not PCIe ATS afaict.
> >>> NVIDIA says that the hardware is designed in a way that it can only do
> >>> DMA UT=0 to addresses which ATS translated to, and there is no way to
> >>> override this behavior and this is what guarantees the isolation.    
> >>
> >> I'm kinda lost here, maybe we can compare it to PCIe ATS where an
> >> endpoint requests a translation of an IOVA to physical address, the
> >> IOMMU returns a lookup based on PCIe requester ID, and there's an
> >> invalidation protocol to keep things coherent.  
> > 
> > Yes there is. The current approach is to have an MMU notifier in
> > the kernel which tells an NPU (IBM piece of logic between GPU/NVlink2
> > and NVIDIA nest MMU) to invalidate translations and that in turn pokes
> > the GPU till that confirms that it invalidated tlbs and there is no
> > ongoing DMA.
> >   
> >> In the case above, who provides a guest id and mmu context id?   
> > 
> > We (powerpc/powernv platform) configure NPU to bind specific bus:dev:fn to
> > an LPID (== guest id) and MMU context id comes from the guest. The nest
> > MMU knows where the partition table and this table contains all the
> > pointers needs for the translation.
> > 
> >   
> >> Additional software
> >> somewhere?  Is the virtual address an IOVA or a process virtual
> >> address?   
> > 
> > A guest kernel or a guest userspace virtual address.
> >   
> >> Do we assume some sort of invalidation protocol as well?  
> > 
> > I am little confused, is this question about the same invalidation
> > protocol as above or different?
> > 
> >   
> >>> So isolation can be achieved if I do not miss something.
> >>>
> >>> How do we want this to be documented to proceed? I assume if I post
> >>> patches filtering MMIOs, this won't do it, right? If just 1..3 are
> >>> documented, will we take this t&c or we need a GPU API spec (which is
> >>> not going to happen anyway)?    
> >>
> >> "t&c"? I think we need what we're actually interacting with to be well
> >> documented, but that could be _thorough_ comments in the code, enough
> >> to understand the theory of operation, as far as I'm concerned.  A pdf
> >> lost on a corporate webserver isn't necessarily an improvement over
> >> that, but there needs to be sufficient detail to understand what we're
> >> touching such that we can maintain, adapt, and improve the code over
> >> time.  Only item #3 above appears POWER specific, so I'd hope that #1
> >> is done in the PCI subsystem, #2 might be a QEMU option (maybe kernel
> >> vfio-pci, but I'm not sure that's necessary), and I don't know where #3
> >> goes.  Thanks,  
> > 
> > Ok, understood. Thanks!  
> 
> After some local discussions, it was pointed out that force disabling
> nvlinks won't bring us much as for an nvlink to work, both sides need to
> enable it so malicious guests cannot penetrate good ones (or a host)
> unless a good guest enabled the link but won't happen with a well
> behaving guest. And if two guests became malicious, then can still only
> harm each other, and so can they via other ways such network. This is
> different from PCIe as once PCIe link is unavoidably enabled, a well
> behaving device cannot firewall itself from peers as it is up to the
> upstream bridge(s) now to decide the routing; with nvlink2, a GPU still
> has means to protect itself, just like a guest can run "firewalld" for
> network.
> 
> Although it would be a nice feature to have an extra barrier between
> GPUs, is inability to block the links in hypervisor still a blocker for
> V100 pass through?

How is the NVLink configured by the guest, is it 'on'/'off' or are
specific routes configured?  If the former, then isn't a non-malicious
guest still susceptible to a malicious guest?  If the latter, how is
routing configured by the guest given that the guest view of the
topology doesn't match physical hardware?  Are these routes
deconfigured by device reset?  Are they part of the save/restore
state?  Thanks,

Alex

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100
@ 2018-07-30 16:29                         ` Alex Williamson
  0 siblings, 0 replies; 108+ messages in thread
From: Alex Williamson @ 2018-07-30 16:29 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Benjamin Herrenschmidt, linuxppc-dev, David Gibson, kvm-ppc,
	Ram Pai, kvm, Alistair Popple

On Mon, 30 Jul 2018 18:58:49 +1000
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> On 11/07/2018 19:26, Alexey Kardashevskiy wrote:
> > On Tue, 10 Jul 2018 16:37:15 -0600
> > Alex Williamson <alex.williamson@redhat.com> wrote:
> >   
> >> On Tue, 10 Jul 2018 14:10:20 +1000
> >> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> >>  
> >>> On Thu, 7 Jun 2018 23:03:23 -0600
> >>> Alex Williamson <alex.williamson@redhat.com> wrote:
> >>>     
> >>>> On Fri, 8 Jun 2018 14:14:23 +1000
> >>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> >>>>       
> >>>>> On 8/6/18 1:44 pm, Alex Williamson wrote:        
> >>>>>> On Fri, 8 Jun 2018 13:08:54 +1000
> >>>>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> >>>>>>           
> >>>>>>> On 8/6/18 8:15 am, Alex Williamson wrote:          
> >>>>>>>> On Fri, 08 Jun 2018 07:54:02 +1000
> >>>>>>>> Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:
> >>>>>>>>             
> >>>>>>>>> On Thu, 2018-06-07 at 11:04 -0600, Alex Williamson wrote:            
> >>>>>>>>>>
> >>>>>>>>>> Can we back up and discuss whether the IOMMU grouping of NVLink
> >>>>>>>>>> connected devices makes sense?  AIUI we have a PCI view of these
> >>>>>>>>>> devices and from that perspective they're isolated.  That's the view of
> >>>>>>>>>> the device used to generate the grouping.  However, not visible to us,
> >>>>>>>>>> these devices are interconnected via NVLink.  What isolation properties
> >>>>>>>>>> does NVLink provide given that its entire purpose for existing seems to
> >>>>>>>>>> be to provide a high performance link for p2p between devices?              
> >>>>>>>>>
> >>>>>>>>> Not entire. On POWER chips, we also have an nvlink between the device
> >>>>>>>>> and the CPU which is running significantly faster than PCIe.
> >>>>>>>>>
> >>>>>>>>> But yes, there are cross-links and those should probably be accounted
> >>>>>>>>> for in the grouping.            
> >>>>>>>>
> >>>>>>>> Then after we fix the grouping, can we just let the host driver manage
> >>>>>>>> this coherent memory range and expose vGPUs to guests?  The use case of
> >>>>>>>> assigning all 6 GPUs to one VM seems pretty limited.  (Might need to
> >>>>>>>> convince NVIDIA to support more than a single vGPU per VM though)            
> >>>>>>>
> >>>>>>> These are physical GPUs, not virtual sriov-alike things they are
> >>>>>>> implementing as well elsewhere.          
> >>>>>>
> >>>>>> vGPUs as implemented on M- and P-series Teslas aren't SR-IOV like
> >>>>>> either.  That's why we have mdev devices now to implement software
> >>>>>> defined devices.  I don't have first hand experience with V-series, but
> >>>>>> I would absolutely expect a PCIe-based Tesla V100 to support vGPU.          
> >>>>>
> >>>>> So assuming V100 can do vGPU, you are suggesting ditching this patchset and
> >>>>> using mediated vGPUs instead, correct?        
> >>>>
> >>>> If it turns out that our PCIe-only-based IOMMU grouping doesn't
> >>>> account for lack of isolation on the NVLink side and we correct that,
> >>>> limiting assignment to sets of 3 interconnected GPUs, is that still a
> >>>> useful feature?  OTOH, it's entirely an NVIDIA proprietary decision
> >>>> whether they choose to support vGPU on these GPUs or whether they can
> >>>> be convinced to support multiple vGPUs per VM.
> >>>>       
> >>>>>>> My current understanding is that every P9 chip in that box has some NVLink2
> >>>>>>> logic on it so each P9 is directly connected to 3 GPUs via PCIe and
> >>>>>>> 2xNVLink2, and GPUs in that big group are interconnected by NVLink2 links
> >>>>>>> as well.
> >>>>>>>
> >>>>>>> From small bits of information I have it seems that a GPU can perfectly
> >>>>>>> work alone and if the NVIDIA driver does not see these interconnects
> >>>>>>> (because we do not pass the rest of the big 3xGPU group to this guest), it
> >>>>>>> continues with a single GPU. There is an "nvidia-smi -r" big reset hammer
> >>>>>>> which simply refuses to work until all 3 GPUs are passed so there is some
> >>>>>>> distinction between passing 1 or 3 GPUs, and I am trying (as we speak) to
> >>>>>>> get a confirmation from NVIDIA that it is ok to pass just a single GPU.
> >>>>>>>
> >>>>>>> So we will either have 6 groups (one per GPU) or 2 groups (one per
> >>>>>>> interconnected group).          
> >>>>>>
> >>>>>> I'm not gaining much confidence that we can rely on isolation between
> >>>>>> NVLink connected GPUs, it sounds like you're simply expecting that
> >>>>>> proprietary code from NVIDIA on a proprietary interconnect from NVIDIA
> >>>>>> is going to play nice and nobody will figure out how to do bad things
> >>>>>> because... obfuscation?  Thanks,          
> >>>>>
> >>>>> Well, we already believe that a proprietary firmware of a sriov-capable
> >>>>> adapter like Mellanox ConnextX is not doing bad things, how is this
> >>>>> different in principle?        
> >>>>
> >>>> It seems like the scope and hierarchy are different.  Here we're
> >>>> talking about exposing big discrete devices, which are peers of one
> >>>> another (and have history of being reverse engineered), to userspace
> >>>> drivers.  Once handed to userspace, each of those devices needs to be
> >>>> considered untrusted.  In the case of SR-IOV, we typically have a
> >>>> trusted host driver for the PF managing untrusted VFs.  We do rely on
> >>>> some sanity in the hardware/firmware in isolating the VFs from each
> >>>> other and from the PF, but we also often have source code for Linux
> >>>> drivers for these devices and sometimes even datasheets.  Here we have
> >>>> neither of those and perhaps we won't know the extent of the lack of
> >>>> isolation between these devices until nouveau (best case) or some
> >>>> exploit (worst case) exposes it.  IOMMU grouping always assumes a lack
> >>>> of isolation between devices unless the hardware provides some
> >>>> indication that isolation exists, for example ACS on PCIe.  If NVIDIA
> >>>> wants to expose isolation on NVLink, perhaps they need to document
> >>>> enough of it that the host kernel can manipulate and test for isolation,
> >>>> perhaps even enabling virtualization of the NVLink interconnect
> >>>> interface such that the host can prevent GPUs from interfering with
> >>>> each other.  Thanks,      
> >>>
> >>>
> >>> So far I got this from NVIDIA:
> >>>
> >>> 1. An NVLink2 state can be controlled via MMIO registers, there is a
> >>> "NVLINK ISOLATION ON MULTI-TENANT SYSTEMS" spec (my copy is
> >>> "confidential" though) from NVIDIA with the MMIO addresses to block if
> >>> we want to disable certain links. In order to NVLink to work it needs to
> >>> be enabled on both sides so by filtering certains MMIO ranges we can
> >>> isolate a GPU.    
> >>
> >> Where are these MMIO registers, on the bridge or on the endpoint device?  
> > 
> > The endpoint GPU device.
> >   
> >> I'm wondering when you say block MMIO if these are ranges on the device
> >> that we disallow mmap to and all the overlapping PAGE_SIZE issues that
> >> come with that or if this should essentially be device specific
> >> enable_acs and acs_enabled quirks, and maybe also potentially used by
> >> Logan's disable acs series to allow GPUs to be linked and have grouping
> >> to match.  
> > 
> > An update, I confused P100 and V100, P100 would need filtering but
> > ours is V100 and it has a couple of registers which we can use to
> > disable particular links and once disabled, the link cannot be
> > re-enabled till the next secondary bus reset.
> > 
> >   
> >>> 2. We can and should also prohibit the GPU firmware update, this is
> >>> done via MMIO as well. The protocol is not open but at least register
> >>> ranges might be in order to filter these accesses, and there is no
> >>> plan to change this.    
> >>
> >> I assume this MMIO is on the endpoint and has all the PAGE_SIZE joys
> >> along with it.  
> > 
> > Yes, however NVIDIA says there is no performance critical stuff with
> > this 64K page.
> >   
> >> Also, there are certainly use cases of updating
> >> firmware for an assigned device, we don't want to impose a policy, but
> >> we should figure out the right place for that policy to be specified by
> >> the admin.  
> > 
> > May be but NVIDIA is talking about some "out-of-band" command to the GPU
> > to enable firmware update so firmware update is not really supported.
> > 
> >   
> >>> 3. DMA trafic over the NVLink2 link can be of 2 types: UT=1 for
> >>> PCI-style DMA via our usual TCE tables (one per a NVLink2 link),
> >>> and UT=0 for direct host memory access. UT stands for "use
> >>> translation" and this is a part of the NVLink2 protocol. Only UT=1 is
> >>> possible over the PCIe link.
> >>> This UT=0 trafic uses host physical addresses returned by a nest MMU (a
> >>> piece of NVIDIA logic on a POWER9 chip), this takes LPID (guest id),
> >>> mmu context id (guest userspace mm id), a virtual address and translates
> >>> to the host physical and that result is used for UT=0 DMA, this is
> >>> called "ATS" although it is not PCIe ATS afaict.
> >>> NVIDIA says that the hardware is designed in a way that it can only do
> >>> DMA UT=0 to addresses which ATS translated to, and there is no way to
> >>> override this behavior and this is what guarantees the isolation.    
> >>
> >> I'm kinda lost here, maybe we can compare it to PCIe ATS where an
> >> endpoint requests a translation of an IOVA to physical address, the
> >> IOMMU returns a lookup based on PCIe requester ID, and there's an
> >> invalidation protocol to keep things coherent.  
> > 
> > Yes there is. The current approach is to have an MMU notifier in
> > the kernel which tells an NPU (IBM piece of logic between GPU/NVlink2
> > and NVIDIA nest MMU) to invalidate translations and that in turn pokes
> > the GPU till that confirms that it invalidated tlbs and there is no
> > ongoing DMA.
> >   
> >> In the case above, who provides a guest id and mmu context id?   
> > 
> > We (powerpc/powernv platform) configure NPU to bind specific bus:dev:fn to
> > an LPID (== guest id) and MMU context id comes from the guest. The nest
> > MMU knows where the partition table and this table contains all the
> > pointers needs for the translation.
> > 
> >   
> >> Additional software
> >> somewhere?  Is the virtual address an IOVA or a process virtual
> >> address?   
> > 
> > A guest kernel or a guest userspace virtual address.
> >   
> >> Do we assume some sort of invalidation protocol as well?  
> > 
> > I am little confused, is this question about the same invalidation
> > protocol as above or different?
> > 
> >   
> >>> So isolation can be achieved if I do not miss something.
> >>>
> >>> How do we want this to be documented to proceed? I assume if I post
> >>> patches filtering MMIOs, this won't do it, right? If just 1..3 are
> >>> documented, will we take this t&c or we need a GPU API spec (which is
> >>> not going to happen anyway)?    
> >>
> >> "t&c"? I think we need what we're actually interacting with to be well
> >> documented, but that could be _thorough_ comments in the code, enough
> >> to understand the theory of operation, as far as I'm concerned.  A pdf
> >> lost on a corporate webserver isn't necessarily an improvement over
> >> that, but there needs to be sufficient detail to understand what we're
> >> touching such that we can maintain, adapt, and improve the code over
> >> time.  Only item #3 above appears POWER specific, so I'd hope that #1
> >> is done in the PCI subsystem, #2 might be a QEMU option (maybe kernel
> >> vfio-pci, but I'm not sure that's necessary), and I don't know where #3
> >> goes.  Thanks,  
> > 
> > Ok, understood. Thanks!  
> 
> After some local discussions, it was pointed out that force disabling
> nvlinks won't bring us much as for an nvlink to work, both sides need to
> enable it so malicious guests cannot penetrate good ones (or a host)
> unless a good guest enabled the link but won't happen with a well
> behaving guest. And if two guests became malicious, then can still only
> harm each other, and so can they via other ways such network. This is
> different from PCIe as once PCIe link is unavoidably enabled, a well
> behaving device cannot firewall itself from peers as it is up to the
> upstream bridge(s) now to decide the routing; with nvlink2, a GPU still
> has means to protect itself, just like a guest can run "firewalld" for
> network.
> 
> Although it would be a nice feature to have an extra barrier between
> GPUs, is inability to block the links in hypervisor still a blocker for
> V100 pass through?

How is the NVLink configured by the guest, is it 'on'/'off' or are
specific routes configured?  If the former, then isn't a non-malicious
guest still susceptible to a malicious guest?  If the latter, how is
routing configured by the guest given that the guest view of the
topology doesn't match physical hardware?  Are these routes
deconfigured by device reset?  Are they part of the save/restore
state?  Thanks,

Alex

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100
@ 2018-07-30 16:29                         ` Alex Williamson
  0 siblings, 0 replies; 108+ messages in thread
From: Alex Williamson @ 2018-07-30 16:29 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: kvm, Ram Pai, kvm-ppc, Alistair Popple, linuxppc-dev, David Gibson

On Mon, 30 Jul 2018 18:58:49 +1000
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> On 11/07/2018 19:26, Alexey Kardashevskiy wrote:
> > On Tue, 10 Jul 2018 16:37:15 -0600
> > Alex Williamson <alex.williamson@redhat.com> wrote:
> >   
> >> On Tue, 10 Jul 2018 14:10:20 +1000
> >> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> >>  
> >>> On Thu, 7 Jun 2018 23:03:23 -0600
> >>> Alex Williamson <alex.williamson@redhat.com> wrote:
> >>>     
> >>>> On Fri, 8 Jun 2018 14:14:23 +1000
> >>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> >>>>       
> >>>>> On 8/6/18 1:44 pm, Alex Williamson wrote:        
> >>>>>> On Fri, 8 Jun 2018 13:08:54 +1000
> >>>>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> >>>>>>           
> >>>>>>> On 8/6/18 8:15 am, Alex Williamson wrote:          
> >>>>>>>> On Fri, 08 Jun 2018 07:54:02 +1000
> >>>>>>>> Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:
> >>>>>>>>             
> >>>>>>>>> On Thu, 2018-06-07 at 11:04 -0600, Alex Williamson wrote:            
> >>>>>>>>>>
> >>>>>>>>>> Can we back up and discuss whether the IOMMU grouping of NVLink
> >>>>>>>>>> connected devices makes sense?  AIUI we have a PCI view of these
> >>>>>>>>>> devices and from that perspective they're isolated.  That's the view of
> >>>>>>>>>> the device used to generate the grouping.  However, not visible to us,
> >>>>>>>>>> these devices are interconnected via NVLink.  What isolation properties
> >>>>>>>>>> does NVLink provide given that its entire purpose for existing seems to
> >>>>>>>>>> be to provide a high performance link for p2p between devices?              
> >>>>>>>>>
> >>>>>>>>> Not entire. On POWER chips, we also have an nvlink between the device
> >>>>>>>>> and the CPU which is running significantly faster than PCIe.
> >>>>>>>>>
> >>>>>>>>> But yes, there are cross-links and those should probably be accounted
> >>>>>>>>> for in the grouping.            
> >>>>>>>>
> >>>>>>>> Then after we fix the grouping, can we just let the host driver manage
> >>>>>>>> this coherent memory range and expose vGPUs to guests?  The use case of
> >>>>>>>> assigning all 6 GPUs to one VM seems pretty limited.  (Might need to
> >>>>>>>> convince NVIDIA to support more than a single vGPU per VM though)            
> >>>>>>>
> >>>>>>> These are physical GPUs, not virtual sriov-alike things they are
> >>>>>>> implementing as well elsewhere.          
> >>>>>>
> >>>>>> vGPUs as implemented on M- and P-series Teslas aren't SR-IOV like
> >>>>>> either.  That's why we have mdev devices now to implement software
> >>>>>> defined devices.  I don't have first hand experience with V-series, but
> >>>>>> I would absolutely expect a PCIe-based Tesla V100 to support vGPU.          
> >>>>>
> >>>>> So assuming V100 can do vGPU, you are suggesting ditching this patchset and
> >>>>> using mediated vGPUs instead, correct?        
> >>>>
> >>>> If it turns out that our PCIe-only-based IOMMU grouping doesn't
> >>>> account for lack of isolation on the NVLink side and we correct that,
> >>>> limiting assignment to sets of 3 interconnected GPUs, is that still a
> >>>> useful feature?  OTOH, it's entirely an NVIDIA proprietary decision
> >>>> whether they choose to support vGPU on these GPUs or whether they can
> >>>> be convinced to support multiple vGPUs per VM.
> >>>>       
> >>>>>>> My current understanding is that every P9 chip in that box has some NVLink2
> >>>>>>> logic on it so each P9 is directly connected to 3 GPUs via PCIe and
> >>>>>>> 2xNVLink2, and GPUs in that big group are interconnected by NVLink2 links
> >>>>>>> as well.
> >>>>>>>
> >>>>>>> From small bits of information I have it seems that a GPU can perfectly
> >>>>>>> work alone and if the NVIDIA driver does not see these interconnects
> >>>>>>> (because we do not pass the rest of the big 3xGPU group to this guest), it
> >>>>>>> continues with a single GPU. There is an "nvidia-smi -r" big reset hammer
> >>>>>>> which simply refuses to work until all 3 GPUs are passed so there is some
> >>>>>>> distinction between passing 1 or 3 GPUs, and I am trying (as we speak) to
> >>>>>>> get a confirmation from NVIDIA that it is ok to pass just a single GPU.
> >>>>>>>
> >>>>>>> So we will either have 6 groups (one per GPU) or 2 groups (one per
> >>>>>>> interconnected group).          
> >>>>>>
> >>>>>> I'm not gaining much confidence that we can rely on isolation between
> >>>>>> NVLink connected GPUs, it sounds like you're simply expecting that
> >>>>>> proprietary code from NVIDIA on a proprietary interconnect from NVIDIA
> >>>>>> is going to play nice and nobody will figure out how to do bad things
> >>>>>> because... obfuscation?  Thanks,          
> >>>>>
> >>>>> Well, we already believe that a proprietary firmware of a sriov-capable
> >>>>> adapter like Mellanox ConnextX is not doing bad things, how is this
> >>>>> different in principle?        
> >>>>
> >>>> It seems like the scope and hierarchy are different.  Here we're
> >>>> talking about exposing big discrete devices, which are peers of one
> >>>> another (and have history of being reverse engineered), to userspace
> >>>> drivers.  Once handed to userspace, each of those devices needs to be
> >>>> considered untrusted.  In the case of SR-IOV, we typically have a
> >>>> trusted host driver for the PF managing untrusted VFs.  We do rely on
> >>>> some sanity in the hardware/firmware in isolating the VFs from each
> >>>> other and from the PF, but we also often have source code for Linux
> >>>> drivers for these devices and sometimes even datasheets.  Here we have
> >>>> neither of those and perhaps we won't know the extent of the lack of
> >>>> isolation between these devices until nouveau (best case) or some
> >>>> exploit (worst case) exposes it.  IOMMU grouping always assumes a lack
> >>>> of isolation between devices unless the hardware provides some
> >>>> indication that isolation exists, for example ACS on PCIe.  If NVIDIA
> >>>> wants to expose isolation on NVLink, perhaps they need to document
> >>>> enough of it that the host kernel can manipulate and test for isolation,
> >>>> perhaps even enabling virtualization of the NVLink interconnect
> >>>> interface such that the host can prevent GPUs from interfering with
> >>>> each other.  Thanks,      
> >>>
> >>>
> >>> So far I got this from NVIDIA:
> >>>
> >>> 1. An NVLink2 state can be controlled via MMIO registers, there is a
> >>> "NVLINK ISOLATION ON MULTI-TENANT SYSTEMS" spec (my copy is
> >>> "confidential" though) from NVIDIA with the MMIO addresses to block if
> >>> we want to disable certain links. In order to NVLink to work it needs to
> >>> be enabled on both sides so by filtering certains MMIO ranges we can
> >>> isolate a GPU.    
> >>
> >> Where are these MMIO registers, on the bridge or on the endpoint device?  
> > 
> > The endpoint GPU device.
> >   
> >> I'm wondering when you say block MMIO if these are ranges on the device
> >> that we disallow mmap to and all the overlapping PAGE_SIZE issues that
> >> come with that or if this should essentially be device specific
> >> enable_acs and acs_enabled quirks, and maybe also potentially used by
> >> Logan's disable acs series to allow GPUs to be linked and have grouping
> >> to match.  
> > 
> > An update, I confused P100 and V100, P100 would need filtering but
> > ours is V100 and it has a couple of registers which we can use to
> > disable particular links and once disabled, the link cannot be
> > re-enabled till the next secondary bus reset.
> > 
> >   
> >>> 2. We can and should also prohibit the GPU firmware update, this is
> >>> done via MMIO as well. The protocol is not open but at least register
> >>> ranges might be in order to filter these accesses, and there is no
> >>> plan to change this.    
> >>
> >> I assume this MMIO is on the endpoint and has all the PAGE_SIZE joys
> >> along with it.  
> > 
> > Yes, however NVIDIA says there is no performance critical stuff with
> > this 64K page.
> >   
> >> Also, there are certainly use cases of updating
> >> firmware for an assigned device, we don't want to impose a policy, but
> >> we should figure out the right place for that policy to be specified by
> >> the admin.  
> > 
> > May be but NVIDIA is talking about some "out-of-band" command to the GPU
> > to enable firmware update so firmware update is not really supported.
> > 
> >   
> >>> 3. DMA trafic over the NVLink2 link can be of 2 types: UT=1 for
> >>> PCI-style DMA via our usual TCE tables (one per a NVLink2 link),
> >>> and UT=0 for direct host memory access. UT stands for "use
> >>> translation" and this is a part of the NVLink2 protocol. Only UT=1 is
> >>> possible over the PCIe link.
> >>> This UT=0 trafic uses host physical addresses returned by a nest MMU (a
> >>> piece of NVIDIA logic on a POWER9 chip), this takes LPID (guest id),
> >>> mmu context id (guest userspace mm id), a virtual address and translates
> >>> to the host physical and that result is used for UT=0 DMA, this is
> >>> called "ATS" although it is not PCIe ATS afaict.
> >>> NVIDIA says that the hardware is designed in a way that it can only do
> >>> DMA UT=0 to addresses which ATS translated to, and there is no way to
> >>> override this behavior and this is what guarantees the isolation.    
> >>
> >> I'm kinda lost here, maybe we can compare it to PCIe ATS where an
> >> endpoint requests a translation of an IOVA to physical address, the
> >> IOMMU returns a lookup based on PCIe requester ID, and there's an
> >> invalidation protocol to keep things coherent.  
> > 
> > Yes there is. The current approach is to have an MMU notifier in
> > the kernel which tells an NPU (IBM piece of logic between GPU/NVlink2
> > and NVIDIA nest MMU) to invalidate translations and that in turn pokes
> > the GPU till that confirms that it invalidated tlbs and there is no
> > ongoing DMA.
> >   
> >> In the case above, who provides a guest id and mmu context id?   
> > 
> > We (powerpc/powernv platform) configure NPU to bind specific bus:dev:fn to
> > an LPID (= guest id) and MMU context id comes from the guest. The nest
> > MMU knows where the partition table and this table contains all the
> > pointers needs for the translation.
> > 
> >   
> >> Additional software
> >> somewhere?  Is the virtual address an IOVA or a process virtual
> >> address?   
> > 
> > A guest kernel or a guest userspace virtual address.
> >   
> >> Do we assume some sort of invalidation protocol as well?  
> > 
> > I am little confused, is this question about the same invalidation
> > protocol as above or different?
> > 
> >   
> >>> So isolation can be achieved if I do not miss something.
> >>>
> >>> How do we want this to be documented to proceed? I assume if I post
> >>> patches filtering MMIOs, this won't do it, right? If just 1..3 are
> >>> documented, will we take this t&c or we need a GPU API spec (which is
> >>> not going to happen anyway)?    
> >>
> >> "t&c"? I think we need what we're actually interacting with to be well
> >> documented, but that could be _thorough_ comments in the code, enough
> >> to understand the theory of operation, as far as I'm concerned.  A pdf
> >> lost on a corporate webserver isn't necessarily an improvement over
> >> that, but there needs to be sufficient detail to understand what we're
> >> touching such that we can maintain, adapt, and improve the code over
> >> time.  Only item #3 above appears POWER specific, so I'd hope that #1
> >> is done in the PCI subsystem, #2 might be a QEMU option (maybe kernel
> >> vfio-pci, but I'm not sure that's necessary), and I don't know where #3
> >> goes.  Thanks,  
> > 
> > Ok, understood. Thanks!  
> 
> After some local discussions, it was pointed out that force disabling
> nvlinks won't bring us much as for an nvlink to work, both sides need to
> enable it so malicious guests cannot penetrate good ones (or a host)
> unless a good guest enabled the link but won't happen with a well
> behaving guest. And if two guests became malicious, then can still only
> harm each other, and so can they via other ways such network. This is
> different from PCIe as once PCIe link is unavoidably enabled, a well
> behaving device cannot firewall itself from peers as it is up to the
> upstream bridge(s) now to decide the routing; with nvlink2, a GPU still
> has means to protect itself, just like a guest can run "firewalld" for
> network.
> 
> Although it would be a nice feature to have an extra barrier between
> GPUs, is inability to block the links in hypervisor still a blocker for
> V100 pass through?

How is the NVLink configured by the guest, is it 'on'/'off' or are
specific routes configured?  If the former, then isn't a non-malicious
guest still susceptible to a malicious guest?  If the latter, how is
routing configured by the guest given that the guest view of the
topology doesn't match physical hardware?  Are these routes
deconfigured by device reset?  Are they part of the save/restore
state?  Thanks,

Alex

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100
  2018-07-30 16:29                         ` Alex Williamson
  (?)
@ 2018-07-31  4:03                           ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 108+ messages in thread
From: Alexey Kardashevskiy @ 2018-07-31  4:03 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kvm, Ram Pai, kvm-ppc, Alistair Popple, linuxppc-dev, David Gibson



On 31/07/2018 02:29, Alex Williamson wrote:
> On Mon, 30 Jul 2018 18:58:49 +1000
> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> 
>> On 11/07/2018 19:26, Alexey Kardashevskiy wrote:
>>> On Tue, 10 Jul 2018 16:37:15 -0600
>>> Alex Williamson <alex.williamson@redhat.com> wrote:
>>>   
>>>> On Tue, 10 Jul 2018 14:10:20 +1000
>>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>>>>  
>>>>> On Thu, 7 Jun 2018 23:03:23 -0600
>>>>> Alex Williamson <alex.williamson@redhat.com> wrote:
>>>>>     
>>>>>> On Fri, 8 Jun 2018 14:14:23 +1000
>>>>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>>>>>>       
>>>>>>> On 8/6/18 1:44 pm, Alex Williamson wrote:        
>>>>>>>> On Fri, 8 Jun 2018 13:08:54 +1000
>>>>>>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>>>>>>>>           
>>>>>>>>> On 8/6/18 8:15 am, Alex Williamson wrote:          
>>>>>>>>>> On Fri, 08 Jun 2018 07:54:02 +1000
>>>>>>>>>> Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:
>>>>>>>>>>             
>>>>>>>>>>> On Thu, 2018-06-07 at 11:04 -0600, Alex Williamson wrote:            
>>>>>>>>>>>>
>>>>>>>>>>>> Can we back up and discuss whether the IOMMU grouping of NVLink
>>>>>>>>>>>> connected devices makes sense?  AIUI we have a PCI view of these
>>>>>>>>>>>> devices and from that perspective they're isolated.  That's the view of
>>>>>>>>>>>> the device used to generate the grouping.  However, not visible to us,
>>>>>>>>>>>> these devices are interconnected via NVLink.  What isolation properties
>>>>>>>>>>>> does NVLink provide given that its entire purpose for existing seems to
>>>>>>>>>>>> be to provide a high performance link for p2p between devices?              
>>>>>>>>>>>
>>>>>>>>>>> Not entire. On POWER chips, we also have an nvlink between the device
>>>>>>>>>>> and the CPU which is running significantly faster than PCIe.
>>>>>>>>>>>
>>>>>>>>>>> But yes, there are cross-links and those should probably be accounted
>>>>>>>>>>> for in the grouping.            
>>>>>>>>>>
>>>>>>>>>> Then after we fix the grouping, can we just let the host driver manage
>>>>>>>>>> this coherent memory range and expose vGPUs to guests?  The use case of
>>>>>>>>>> assigning all 6 GPUs to one VM seems pretty limited.  (Might need to
>>>>>>>>>> convince NVIDIA to support more than a single vGPU per VM though)            
>>>>>>>>>
>>>>>>>>> These are physical GPUs, not virtual sriov-alike things they are
>>>>>>>>> implementing as well elsewhere.          
>>>>>>>>
>>>>>>>> vGPUs as implemented on M- and P-series Teslas aren't SR-IOV like
>>>>>>>> either.  That's why we have mdev devices now to implement software
>>>>>>>> defined devices.  I don't have first hand experience with V-series, but
>>>>>>>> I would absolutely expect a PCIe-based Tesla V100 to support vGPU.          
>>>>>>>
>>>>>>> So assuming V100 can do vGPU, you are suggesting ditching this patchset and
>>>>>>> using mediated vGPUs instead, correct?        
>>>>>>
>>>>>> If it turns out that our PCIe-only-based IOMMU grouping doesn't
>>>>>> account for lack of isolation on the NVLink side and we correct that,
>>>>>> limiting assignment to sets of 3 interconnected GPUs, is that still a
>>>>>> useful feature?  OTOH, it's entirely an NVIDIA proprietary decision
>>>>>> whether they choose to support vGPU on these GPUs or whether they can
>>>>>> be convinced to support multiple vGPUs per VM.
>>>>>>       
>>>>>>>>> My current understanding is that every P9 chip in that box has some NVLink2
>>>>>>>>> logic on it so each P9 is directly connected to 3 GPUs via PCIe and
>>>>>>>>> 2xNVLink2, and GPUs in that big group are interconnected by NVLink2 links
>>>>>>>>> as well.
>>>>>>>>>
>>>>>>>>> From small bits of information I have it seems that a GPU can perfectly
>>>>>>>>> work alone and if the NVIDIA driver does not see these interconnects
>>>>>>>>> (because we do not pass the rest of the big 3xGPU group to this guest), it
>>>>>>>>> continues with a single GPU. There is an "nvidia-smi -r" big reset hammer
>>>>>>>>> which simply refuses to work until all 3 GPUs are passed so there is some
>>>>>>>>> distinction between passing 1 or 3 GPUs, and I am trying (as we speak) to
>>>>>>>>> get a confirmation from NVIDIA that it is ok to pass just a single GPU.
>>>>>>>>>
>>>>>>>>> So we will either have 6 groups (one per GPU) or 2 groups (one per
>>>>>>>>> interconnected group).          
>>>>>>>>
>>>>>>>> I'm not gaining much confidence that we can rely on isolation between
>>>>>>>> NVLink connected GPUs, it sounds like you're simply expecting that
>>>>>>>> proprietary code from NVIDIA on a proprietary interconnect from NVIDIA
>>>>>>>> is going to play nice and nobody will figure out how to do bad things
>>>>>>>> because... obfuscation?  Thanks,          
>>>>>>>
>>>>>>> Well, we already believe that a proprietary firmware of a sriov-capable
>>>>>>> adapter like Mellanox ConnextX is not doing bad things, how is this
>>>>>>> different in principle?        
>>>>>>
>>>>>> It seems like the scope and hierarchy are different.  Here we're
>>>>>> talking about exposing big discrete devices, which are peers of one
>>>>>> another (and have history of being reverse engineered), to userspace
>>>>>> drivers.  Once handed to userspace, each of those devices needs to be
>>>>>> considered untrusted.  In the case of SR-IOV, we typically have a
>>>>>> trusted host driver for the PF managing untrusted VFs.  We do rely on
>>>>>> some sanity in the hardware/firmware in isolating the VFs from each
>>>>>> other and from the PF, but we also often have source code for Linux
>>>>>> drivers for these devices and sometimes even datasheets.  Here we have
>>>>>> neither of those and perhaps we won't know the extent of the lack of
>>>>>> isolation between these devices until nouveau (best case) or some
>>>>>> exploit (worst case) exposes it.  IOMMU grouping always assumes a lack
>>>>>> of isolation between devices unless the hardware provides some
>>>>>> indication that isolation exists, for example ACS on PCIe.  If NVIDIA
>>>>>> wants to expose isolation on NVLink, perhaps they need to document
>>>>>> enough of it that the host kernel can manipulate and test for isolation,
>>>>>> perhaps even enabling virtualization of the NVLink interconnect
>>>>>> interface such that the host can prevent GPUs from interfering with
>>>>>> each other.  Thanks,      
>>>>>
>>>>>
>>>>> So far I got this from NVIDIA:
>>>>>
>>>>> 1. An NVLink2 state can be controlled via MMIO registers, there is a
>>>>> "NVLINK ISOLATION ON MULTI-TENANT SYSTEMS" spec (my copy is
>>>>> "confidential" though) from NVIDIA with the MMIO addresses to block if
>>>>> we want to disable certain links. In order to NVLink to work it needs to
>>>>> be enabled on both sides so by filtering certains MMIO ranges we can
>>>>> isolate a GPU.    
>>>>
>>>> Where are these MMIO registers, on the bridge or on the endpoint device?  
>>>
>>> The endpoint GPU device.
>>>   
>>>> I'm wondering when you say block MMIO if these are ranges on the device
>>>> that we disallow mmap to and all the overlapping PAGE_SIZE issues that
>>>> come with that or if this should essentially be device specific
>>>> enable_acs and acs_enabled quirks, and maybe also potentially used by
>>>> Logan's disable acs series to allow GPUs to be linked and have grouping
>>>> to match.  
>>>
>>> An update, I confused P100 and V100, P100 would need filtering but
>>> ours is V100 and it has a couple of registers which we can use to
>>> disable particular links and once disabled, the link cannot be
>>> re-enabled till the next secondary bus reset.
>>>
>>>   
>>>>> 2. We can and should also prohibit the GPU firmware update, this is
>>>>> done via MMIO as well. The protocol is not open but at least register
>>>>> ranges might be in order to filter these accesses, and there is no
>>>>> plan to change this.    
>>>>
>>>> I assume this MMIO is on the endpoint and has all the PAGE_SIZE joys
>>>> along with it.  
>>>
>>> Yes, however NVIDIA says there is no performance critical stuff with
>>> this 64K page.
>>>   
>>>> Also, there are certainly use cases of updating
>>>> firmware for an assigned device, we don't want to impose a policy, but
>>>> we should figure out the right place for that policy to be specified by
>>>> the admin.  
>>>
>>> May be but NVIDIA is talking about some "out-of-band" command to the GPU
>>> to enable firmware update so firmware update is not really supported.
>>>
>>>   
>>>>> 3. DMA trafic over the NVLink2 link can be of 2 types: UT=1 for
>>>>> PCI-style DMA via our usual TCE tables (one per a NVLink2 link),
>>>>> and UT=0 for direct host memory access. UT stands for "use
>>>>> translation" and this is a part of the NVLink2 protocol. Only UT=1 is
>>>>> possible over the PCIe link.
>>>>> This UT=0 trafic uses host physical addresses returned by a nest MMU (a
>>>>> piece of NVIDIA logic on a POWER9 chip), this takes LPID (guest id),
>>>>> mmu context id (guest userspace mm id), a virtual address and translates
>>>>> to the host physical and that result is used for UT=0 DMA, this is
>>>>> called "ATS" although it is not PCIe ATS afaict.
>>>>> NVIDIA says that the hardware is designed in a way that it can only do
>>>>> DMA UT=0 to addresses which ATS translated to, and there is no way to
>>>>> override this behavior and this is what guarantees the isolation.    
>>>>
>>>> I'm kinda lost here, maybe we can compare it to PCIe ATS where an
>>>> endpoint requests a translation of an IOVA to physical address, the
>>>> IOMMU returns a lookup based on PCIe requester ID, and there's an
>>>> invalidation protocol to keep things coherent.  
>>>
>>> Yes there is. The current approach is to have an MMU notifier in
>>> the kernel which tells an NPU (IBM piece of logic between GPU/NVlink2
>>> and NVIDIA nest MMU) to invalidate translations and that in turn pokes
>>> the GPU till that confirms that it invalidated tlbs and there is no
>>> ongoing DMA.
>>>   
>>>> In the case above, who provides a guest id and mmu context id?   
>>>
>>> We (powerpc/powernv platform) configure NPU to bind specific bus:dev:fn to
>>> an LPID (== guest id) and MMU context id comes from the guest. The nest
>>> MMU knows where the partition table and this table contains all the
>>> pointers needs for the translation.
>>>
>>>   
>>>> Additional software
>>>> somewhere?  Is the virtual address an IOVA or a process virtual
>>>> address?   
>>>
>>> A guest kernel or a guest userspace virtual address.
>>>   
>>>> Do we assume some sort of invalidation protocol as well?  
>>>
>>> I am little confused, is this question about the same invalidation
>>> protocol as above or different?
>>>
>>>   
>>>>> So isolation can be achieved if I do not miss something.
>>>>>
>>>>> How do we want this to be documented to proceed? I assume if I post
>>>>> patches filtering MMIOs, this won't do it, right? If just 1..3 are
>>>>> documented, will we take this t&c or we need a GPU API spec (which is
>>>>> not going to happen anyway)?    
>>>>
>>>> "t&c"? I think we need what we're actually interacting with to be well
>>>> documented, but that could be _thorough_ comments in the code, enough
>>>> to understand the theory of operation, as far as I'm concerned.  A pdf
>>>> lost on a corporate webserver isn't necessarily an improvement over
>>>> that, but there needs to be sufficient detail to understand what we're
>>>> touching such that we can maintain, adapt, and improve the code over
>>>> time.  Only item #3 above appears POWER specific, so I'd hope that #1
>>>> is done in the PCI subsystem, #2 might be a QEMU option (maybe kernel
>>>> vfio-pci, but I'm not sure that's necessary), and I don't know where #3
>>>> goes.  Thanks,  
>>>
>>> Ok, understood. Thanks!  
>>
>> After some local discussions, it was pointed out that force disabling
>> nvlinks won't bring us much as for an nvlink to work, both sides need to
>> enable it so malicious guests cannot penetrate good ones (or a host)
>> unless a good guest enabled the link but won't happen with a well
>> behaving guest. And if two guests became malicious, then can still only
>> harm each other, and so can they via other ways such network. This is
>> different from PCIe as once PCIe link is unavoidably enabled, a well
>> behaving device cannot firewall itself from peers as it is up to the
>> upstream bridge(s) now to decide the routing; with nvlink2, a GPU still
>> has means to protect itself, just like a guest can run "firewalld" for
>> network.
>>
>> Although it would be a nice feature to have an extra barrier between
>> GPUs, is inability to block the links in hypervisor still a blocker for
>> V100 pass through?
> 
> How is the NVLink configured by the guest, is it 'on'/'off' or are
> specific routes configured? 

The GPU-GPU links need not to be blocked and need to be enabled
(==trained) by a driver in the guest. There are no routes between GPUs
in NVLink fabric, these are direct links, it is just a switch on each
side, both switches need to be on for a link to work.

The GPU-CPU links - the GPU bit is the same switch, the CPU NVlink state
is controlled via the emulated PCI bridges which I pass through together
with the GPU.


> If the former, then isn't a non-malicious
> guest still susceptible to a malicious guest?

A non-malicious guest needs to turn its switch on for a link to a GPU
which belongs to a malicious guest.

> If the latter, how is
> routing configured by the guest given that the guest view of the
> topology doesn't match physical hardware?  Are these routes
> deconfigured by device reset?  Are they part of the save/restore
> state?  Thanks,





-- 
Alexey

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100
@ 2018-07-31  4:03                           ` Alexey Kardashevskiy
  0 siblings, 0 replies; 108+ messages in thread
From: Alexey Kardashevskiy @ 2018-07-31  4:03 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Benjamin Herrenschmidt, linuxppc-dev, David Gibson, kvm-ppc,
	Ram Pai, kvm, Alistair Popple



On 31/07/2018 02:29, Alex Williamson wrote:
> On Mon, 30 Jul 2018 18:58:49 +1000
> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> 
>> On 11/07/2018 19:26, Alexey Kardashevskiy wrote:
>>> On Tue, 10 Jul 2018 16:37:15 -0600
>>> Alex Williamson <alex.williamson@redhat.com> wrote:
>>>   
>>>> On Tue, 10 Jul 2018 14:10:20 +1000
>>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>>>>  
>>>>> On Thu, 7 Jun 2018 23:03:23 -0600
>>>>> Alex Williamson <alex.williamson@redhat.com> wrote:
>>>>>     
>>>>>> On Fri, 8 Jun 2018 14:14:23 +1000
>>>>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>>>>>>       
>>>>>>> On 8/6/18 1:44 pm, Alex Williamson wrote:        
>>>>>>>> On Fri, 8 Jun 2018 13:08:54 +1000
>>>>>>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>>>>>>>>           
>>>>>>>>> On 8/6/18 8:15 am, Alex Williamson wrote:          
>>>>>>>>>> On Fri, 08 Jun 2018 07:54:02 +1000
>>>>>>>>>> Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:
>>>>>>>>>>             
>>>>>>>>>>> On Thu, 2018-06-07 at 11:04 -0600, Alex Williamson wrote:            
>>>>>>>>>>>>
>>>>>>>>>>>> Can we back up and discuss whether the IOMMU grouping of NVLink
>>>>>>>>>>>> connected devices makes sense?  AIUI we have a PCI view of these
>>>>>>>>>>>> devices and from that perspective they're isolated.  That's the view of
>>>>>>>>>>>> the device used to generate the grouping.  However, not visible to us,
>>>>>>>>>>>> these devices are interconnected via NVLink.  What isolation properties
>>>>>>>>>>>> does NVLink provide given that its entire purpose for existing seems to
>>>>>>>>>>>> be to provide a high performance link for p2p between devices?              
>>>>>>>>>>>
>>>>>>>>>>> Not entire. On POWER chips, we also have an nvlink between the device
>>>>>>>>>>> and the CPU which is running significantly faster than PCIe.
>>>>>>>>>>>
>>>>>>>>>>> But yes, there are cross-links and those should probably be accounted
>>>>>>>>>>> for in the grouping.            
>>>>>>>>>>
>>>>>>>>>> Then after we fix the grouping, can we just let the host driver manage
>>>>>>>>>> this coherent memory range and expose vGPUs to guests?  The use case of
>>>>>>>>>> assigning all 6 GPUs to one VM seems pretty limited.  (Might need to
>>>>>>>>>> convince NVIDIA to support more than a single vGPU per VM though)            
>>>>>>>>>
>>>>>>>>> These are physical GPUs, not virtual sriov-alike things they are
>>>>>>>>> implementing as well elsewhere.          
>>>>>>>>
>>>>>>>> vGPUs as implemented on M- and P-series Teslas aren't SR-IOV like
>>>>>>>> either.  That's why we have mdev devices now to implement software
>>>>>>>> defined devices.  I don't have first hand experience with V-series, but
>>>>>>>> I would absolutely expect a PCIe-based Tesla V100 to support vGPU.          
>>>>>>>
>>>>>>> So assuming V100 can do vGPU, you are suggesting ditching this patchset and
>>>>>>> using mediated vGPUs instead, correct?        
>>>>>>
>>>>>> If it turns out that our PCIe-only-based IOMMU grouping doesn't
>>>>>> account for lack of isolation on the NVLink side and we correct that,
>>>>>> limiting assignment to sets of 3 interconnected GPUs, is that still a
>>>>>> useful feature?  OTOH, it's entirely an NVIDIA proprietary decision
>>>>>> whether they choose to support vGPU on these GPUs or whether they can
>>>>>> be convinced to support multiple vGPUs per VM.
>>>>>>       
>>>>>>>>> My current understanding is that every P9 chip in that box has some NVLink2
>>>>>>>>> logic on it so each P9 is directly connected to 3 GPUs via PCIe and
>>>>>>>>> 2xNVLink2, and GPUs in that big group are interconnected by NVLink2 links
>>>>>>>>> as well.
>>>>>>>>>
>>>>>>>>> From small bits of information I have it seems that a GPU can perfectly
>>>>>>>>> work alone and if the NVIDIA driver does not see these interconnects
>>>>>>>>> (because we do not pass the rest of the big 3xGPU group to this guest), it
>>>>>>>>> continues with a single GPU. There is an "nvidia-smi -r" big reset hammer
>>>>>>>>> which simply refuses to work until all 3 GPUs are passed so there is some
>>>>>>>>> distinction between passing 1 or 3 GPUs, and I am trying (as we speak) to
>>>>>>>>> get a confirmation from NVIDIA that it is ok to pass just a single GPU.
>>>>>>>>>
>>>>>>>>> So we will either have 6 groups (one per GPU) or 2 groups (one per
>>>>>>>>> interconnected group).          
>>>>>>>>
>>>>>>>> I'm not gaining much confidence that we can rely on isolation between
>>>>>>>> NVLink connected GPUs, it sounds like you're simply expecting that
>>>>>>>> proprietary code from NVIDIA on a proprietary interconnect from NVIDIA
>>>>>>>> is going to play nice and nobody will figure out how to do bad things
>>>>>>>> because... obfuscation?  Thanks,          
>>>>>>>
>>>>>>> Well, we already believe that a proprietary firmware of a sriov-capable
>>>>>>> adapter like Mellanox ConnextX is not doing bad things, how is this
>>>>>>> different in principle?        
>>>>>>
>>>>>> It seems like the scope and hierarchy are different.  Here we're
>>>>>> talking about exposing big discrete devices, which are peers of one
>>>>>> another (and have history of being reverse engineered), to userspace
>>>>>> drivers.  Once handed to userspace, each of those devices needs to be
>>>>>> considered untrusted.  In the case of SR-IOV, we typically have a
>>>>>> trusted host driver for the PF managing untrusted VFs.  We do rely on
>>>>>> some sanity in the hardware/firmware in isolating the VFs from each
>>>>>> other and from the PF, but we also often have source code for Linux
>>>>>> drivers for these devices and sometimes even datasheets.  Here we have
>>>>>> neither of those and perhaps we won't know the extent of the lack of
>>>>>> isolation between these devices until nouveau (best case) or some
>>>>>> exploit (worst case) exposes it.  IOMMU grouping always assumes a lack
>>>>>> of isolation between devices unless the hardware provides some
>>>>>> indication that isolation exists, for example ACS on PCIe.  If NVIDIA
>>>>>> wants to expose isolation on NVLink, perhaps they need to document
>>>>>> enough of it that the host kernel can manipulate and test for isolation,
>>>>>> perhaps even enabling virtualization of the NVLink interconnect
>>>>>> interface such that the host can prevent GPUs from interfering with
>>>>>> each other.  Thanks,      
>>>>>
>>>>>
>>>>> So far I got this from NVIDIA:
>>>>>
>>>>> 1. An NVLink2 state can be controlled via MMIO registers, there is a
>>>>> "NVLINK ISOLATION ON MULTI-TENANT SYSTEMS" spec (my copy is
>>>>> "confidential" though) from NVIDIA with the MMIO addresses to block if
>>>>> we want to disable certain links. In order to NVLink to work it needs to
>>>>> be enabled on both sides so by filtering certains MMIO ranges we can
>>>>> isolate a GPU.    
>>>>
>>>> Where are these MMIO registers, on the bridge or on the endpoint device?  
>>>
>>> The endpoint GPU device.
>>>   
>>>> I'm wondering when you say block MMIO if these are ranges on the device
>>>> that we disallow mmap to and all the overlapping PAGE_SIZE issues that
>>>> come with that or if this should essentially be device specific
>>>> enable_acs and acs_enabled quirks, and maybe also potentially used by
>>>> Logan's disable acs series to allow GPUs to be linked and have grouping
>>>> to match.  
>>>
>>> An update, I confused P100 and V100, P100 would need filtering but
>>> ours is V100 and it has a couple of registers which we can use to
>>> disable particular links and once disabled, the link cannot be
>>> re-enabled till the next secondary bus reset.
>>>
>>>   
>>>>> 2. We can and should also prohibit the GPU firmware update, this is
>>>>> done via MMIO as well. The protocol is not open but at least register
>>>>> ranges might be in order to filter these accesses, and there is no
>>>>> plan to change this.    
>>>>
>>>> I assume this MMIO is on the endpoint and has all the PAGE_SIZE joys
>>>> along with it.  
>>>
>>> Yes, however NVIDIA says there is no performance critical stuff with
>>> this 64K page.
>>>   
>>>> Also, there are certainly use cases of updating
>>>> firmware for an assigned device, we don't want to impose a policy, but
>>>> we should figure out the right place for that policy to be specified by
>>>> the admin.  
>>>
>>> May be but NVIDIA is talking about some "out-of-band" command to the GPU
>>> to enable firmware update so firmware update is not really supported.
>>>
>>>   
>>>>> 3. DMA trafic over the NVLink2 link can be of 2 types: UT=1 for
>>>>> PCI-style DMA via our usual TCE tables (one per a NVLink2 link),
>>>>> and UT=0 for direct host memory access. UT stands for "use
>>>>> translation" and this is a part of the NVLink2 protocol. Only UT=1 is
>>>>> possible over the PCIe link.
>>>>> This UT=0 trafic uses host physical addresses returned by a nest MMU (a
>>>>> piece of NVIDIA logic on a POWER9 chip), this takes LPID (guest id),
>>>>> mmu context id (guest userspace mm id), a virtual address and translates
>>>>> to the host physical and that result is used for UT=0 DMA, this is
>>>>> called "ATS" although it is not PCIe ATS afaict.
>>>>> NVIDIA says that the hardware is designed in a way that it can only do
>>>>> DMA UT=0 to addresses which ATS translated to, and there is no way to
>>>>> override this behavior and this is what guarantees the isolation.    
>>>>
>>>> I'm kinda lost here, maybe we can compare it to PCIe ATS where an
>>>> endpoint requests a translation of an IOVA to physical address, the
>>>> IOMMU returns a lookup based on PCIe requester ID, and there's an
>>>> invalidation protocol to keep things coherent.  
>>>
>>> Yes there is. The current approach is to have an MMU notifier in
>>> the kernel which tells an NPU (IBM piece of logic between GPU/NVlink2
>>> and NVIDIA nest MMU) to invalidate translations and that in turn pokes
>>> the GPU till that confirms that it invalidated tlbs and there is no
>>> ongoing DMA.
>>>   
>>>> In the case above, who provides a guest id and mmu context id?   
>>>
>>> We (powerpc/powernv platform) configure NPU to bind specific bus:dev:fn to
>>> an LPID (== guest id) and MMU context id comes from the guest. The nest
>>> MMU knows where the partition table and this table contains all the
>>> pointers needs for the translation.
>>>
>>>   
>>>> Additional software
>>>> somewhere?  Is the virtual address an IOVA or a process virtual
>>>> address?   
>>>
>>> A guest kernel or a guest userspace virtual address.
>>>   
>>>> Do we assume some sort of invalidation protocol as well?  
>>>
>>> I am little confused, is this question about the same invalidation
>>> protocol as above or different?
>>>
>>>   
>>>>> So isolation can be achieved if I do not miss something.
>>>>>
>>>>> How do we want this to be documented to proceed? I assume if I post
>>>>> patches filtering MMIOs, this won't do it, right? If just 1..3 are
>>>>> documented, will we take this t&c or we need a GPU API spec (which is
>>>>> not going to happen anyway)?    
>>>>
>>>> "t&c"? I think we need what we're actually interacting with to be well
>>>> documented, but that could be _thorough_ comments in the code, enough
>>>> to understand the theory of operation, as far as I'm concerned.  A pdf
>>>> lost on a corporate webserver isn't necessarily an improvement over
>>>> that, but there needs to be sufficient detail to understand what we're
>>>> touching such that we can maintain, adapt, and improve the code over
>>>> time.  Only item #3 above appears POWER specific, so I'd hope that #1
>>>> is done in the PCI subsystem, #2 might be a QEMU option (maybe kernel
>>>> vfio-pci, but I'm not sure that's necessary), and I don't know where #3
>>>> goes.  Thanks,  
>>>
>>> Ok, understood. Thanks!  
>>
>> After some local discussions, it was pointed out that force disabling
>> nvlinks won't bring us much as for an nvlink to work, both sides need to
>> enable it so malicious guests cannot penetrate good ones (or a host)
>> unless a good guest enabled the link but won't happen with a well
>> behaving guest. And if two guests became malicious, then can still only
>> harm each other, and so can they via other ways such network. This is
>> different from PCIe as once PCIe link is unavoidably enabled, a well
>> behaving device cannot firewall itself from peers as it is up to the
>> upstream bridge(s) now to decide the routing; with nvlink2, a GPU still
>> has means to protect itself, just like a guest can run "firewalld" for
>> network.
>>
>> Although it would be a nice feature to have an extra barrier between
>> GPUs, is inability to block the links in hypervisor still a blocker for
>> V100 pass through?
> 
> How is the NVLink configured by the guest, is it 'on'/'off' or are
> specific routes configured? 

The GPU-GPU links need not to be blocked and need to be enabled
(==trained) by a driver in the guest. There are no routes between GPUs
in NVLink fabric, these are direct links, it is just a switch on each
side, both switches need to be on for a link to work.

The GPU-CPU links - the GPU bit is the same switch, the CPU NVlink state
is controlled via the emulated PCI bridges which I pass through together
with the GPU.


> If the former, then isn't a non-malicious
> guest still susceptible to a malicious guest?

A non-malicious guest needs to turn its switch on for a link to a GPU
which belongs to a malicious guest.

> If the latter, how is
> routing configured by the guest given that the guest view of the
> topology doesn't match physical hardware?  Are these routes
> deconfigured by device reset?  Are they part of the save/restore
> state?  Thanks,





-- 
Alexey

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100
@ 2018-07-31  4:03                           ` Alexey Kardashevskiy
  0 siblings, 0 replies; 108+ messages in thread
From: Alexey Kardashevskiy @ 2018-07-31  4:03 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kvm, Ram Pai, kvm-ppc, Alistair Popple, linuxppc-dev, David Gibson



On 31/07/2018 02:29, Alex Williamson wrote:
> On Mon, 30 Jul 2018 18:58:49 +1000
> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> 
>> On 11/07/2018 19:26, Alexey Kardashevskiy wrote:
>>> On Tue, 10 Jul 2018 16:37:15 -0600
>>> Alex Williamson <alex.williamson@redhat.com> wrote:
>>>   
>>>> On Tue, 10 Jul 2018 14:10:20 +1000
>>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>>>>  
>>>>> On Thu, 7 Jun 2018 23:03:23 -0600
>>>>> Alex Williamson <alex.williamson@redhat.com> wrote:
>>>>>     
>>>>>> On Fri, 8 Jun 2018 14:14:23 +1000
>>>>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>>>>>>       
>>>>>>> On 8/6/18 1:44 pm, Alex Williamson wrote:        
>>>>>>>> On Fri, 8 Jun 2018 13:08:54 +1000
>>>>>>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>>>>>>>>           
>>>>>>>>> On 8/6/18 8:15 am, Alex Williamson wrote:          
>>>>>>>>>> On Fri, 08 Jun 2018 07:54:02 +1000
>>>>>>>>>> Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:
>>>>>>>>>>             
>>>>>>>>>>> On Thu, 2018-06-07 at 11:04 -0600, Alex Williamson wrote:            
>>>>>>>>>>>>
>>>>>>>>>>>> Can we back up and discuss whether the IOMMU grouping of NVLink
>>>>>>>>>>>> connected devices makes sense?  AIUI we have a PCI view of these
>>>>>>>>>>>> devices and from that perspective they're isolated.  That's the view of
>>>>>>>>>>>> the device used to generate the grouping.  However, not visible to us,
>>>>>>>>>>>> these devices are interconnected via NVLink.  What isolation properties
>>>>>>>>>>>> does NVLink provide given that its entire purpose for existing seems to
>>>>>>>>>>>> be to provide a high performance link for p2p between devices?              
>>>>>>>>>>>
>>>>>>>>>>> Not entire. On POWER chips, we also have an nvlink between the device
>>>>>>>>>>> and the CPU which is running significantly faster than PCIe.
>>>>>>>>>>>
>>>>>>>>>>> But yes, there are cross-links and those should probably be accounted
>>>>>>>>>>> for in the grouping.            
>>>>>>>>>>
>>>>>>>>>> Then after we fix the grouping, can we just let the host driver manage
>>>>>>>>>> this coherent memory range and expose vGPUs to guests?  The use case of
>>>>>>>>>> assigning all 6 GPUs to one VM seems pretty limited.  (Might need to
>>>>>>>>>> convince NVIDIA to support more than a single vGPU per VM though)            
>>>>>>>>>
>>>>>>>>> These are physical GPUs, not virtual sriov-alike things they are
>>>>>>>>> implementing as well elsewhere.          
>>>>>>>>
>>>>>>>> vGPUs as implemented on M- and P-series Teslas aren't SR-IOV like
>>>>>>>> either.  That's why we have mdev devices now to implement software
>>>>>>>> defined devices.  I don't have first hand experience with V-series, but
>>>>>>>> I would absolutely expect a PCIe-based Tesla V100 to support vGPU.          
>>>>>>>
>>>>>>> So assuming V100 can do vGPU, you are suggesting ditching this patchset and
>>>>>>> using mediated vGPUs instead, correct?        
>>>>>>
>>>>>> If it turns out that our PCIe-only-based IOMMU grouping doesn't
>>>>>> account for lack of isolation on the NVLink side and we correct that,
>>>>>> limiting assignment to sets of 3 interconnected GPUs, is that still a
>>>>>> useful feature?  OTOH, it's entirely an NVIDIA proprietary decision
>>>>>> whether they choose to support vGPU on these GPUs or whether they can
>>>>>> be convinced to support multiple vGPUs per VM.
>>>>>>       
>>>>>>>>> My current understanding is that every P9 chip in that box has some NVLink2
>>>>>>>>> logic on it so each P9 is directly connected to 3 GPUs via PCIe and
>>>>>>>>> 2xNVLink2, and GPUs in that big group are interconnected by NVLink2 links
>>>>>>>>> as well.
>>>>>>>>>
>>>>>>>>> From small bits of information I have it seems that a GPU can perfectly
>>>>>>>>> work alone and if the NVIDIA driver does not see these interconnects
>>>>>>>>> (because we do not pass the rest of the big 3xGPU group to this guest), it
>>>>>>>>> continues with a single GPU. There is an "nvidia-smi -r" big reset hammer
>>>>>>>>> which simply refuses to work until all 3 GPUs are passed so there is some
>>>>>>>>> distinction between passing 1 or 3 GPUs, and I am trying (as we speak) to
>>>>>>>>> get a confirmation from NVIDIA that it is ok to pass just a single GPU.
>>>>>>>>>
>>>>>>>>> So we will either have 6 groups (one per GPU) or 2 groups (one per
>>>>>>>>> interconnected group).          
>>>>>>>>
>>>>>>>> I'm not gaining much confidence that we can rely on isolation between
>>>>>>>> NVLink connected GPUs, it sounds like you're simply expecting that
>>>>>>>> proprietary code from NVIDIA on a proprietary interconnect from NVIDIA
>>>>>>>> is going to play nice and nobody will figure out how to do bad things
>>>>>>>> because... obfuscation?  Thanks,          
>>>>>>>
>>>>>>> Well, we already believe that a proprietary firmware of a sriov-capable
>>>>>>> adapter like Mellanox ConnextX is not doing bad things, how is this
>>>>>>> different in principle?        
>>>>>>
>>>>>> It seems like the scope and hierarchy are different.  Here we're
>>>>>> talking about exposing big discrete devices, which are peers of one
>>>>>> another (and have history of being reverse engineered), to userspace
>>>>>> drivers.  Once handed to userspace, each of those devices needs to be
>>>>>> considered untrusted.  In the case of SR-IOV, we typically have a
>>>>>> trusted host driver for the PF managing untrusted VFs.  We do rely on
>>>>>> some sanity in the hardware/firmware in isolating the VFs from each
>>>>>> other and from the PF, but we also often have source code for Linux
>>>>>> drivers for these devices and sometimes even datasheets.  Here we have
>>>>>> neither of those and perhaps we won't know the extent of the lack of
>>>>>> isolation between these devices until nouveau (best case) or some
>>>>>> exploit (worst case) exposes it.  IOMMU grouping always assumes a lack
>>>>>> of isolation between devices unless the hardware provides some
>>>>>> indication that isolation exists, for example ACS on PCIe.  If NVIDIA
>>>>>> wants to expose isolation on NVLink, perhaps they need to document
>>>>>> enough of it that the host kernel can manipulate and test for isolation,
>>>>>> perhaps even enabling virtualization of the NVLink interconnect
>>>>>> interface such that the host can prevent GPUs from interfering with
>>>>>> each other.  Thanks,      
>>>>>
>>>>>
>>>>> So far I got this from NVIDIA:
>>>>>
>>>>> 1. An NVLink2 state can be controlled via MMIO registers, there is a
>>>>> "NVLINK ISOLATION ON MULTI-TENANT SYSTEMS" spec (my copy is
>>>>> "confidential" though) from NVIDIA with the MMIO addresses to block if
>>>>> we want to disable certain links. In order to NVLink to work it needs to
>>>>> be enabled on both sides so by filtering certains MMIO ranges we can
>>>>> isolate a GPU.    
>>>>
>>>> Where are these MMIO registers, on the bridge or on the endpoint device?  
>>>
>>> The endpoint GPU device.
>>>   
>>>> I'm wondering when you say block MMIO if these are ranges on the device
>>>> that we disallow mmap to and all the overlapping PAGE_SIZE issues that
>>>> come with that or if this should essentially be device specific
>>>> enable_acs and acs_enabled quirks, and maybe also potentially used by
>>>> Logan's disable acs series to allow GPUs to be linked and have grouping
>>>> to match.  
>>>
>>> An update, I confused P100 and V100, P100 would need filtering but
>>> ours is V100 and it has a couple of registers which we can use to
>>> disable particular links and once disabled, the link cannot be
>>> re-enabled till the next secondary bus reset.
>>>
>>>   
>>>>> 2. We can and should also prohibit the GPU firmware update, this is
>>>>> done via MMIO as well. The protocol is not open but at least register
>>>>> ranges might be in order to filter these accesses, and there is no
>>>>> plan to change this.    
>>>>
>>>> I assume this MMIO is on the endpoint and has all the PAGE_SIZE joys
>>>> along with it.  
>>>
>>> Yes, however NVIDIA says there is no performance critical stuff with
>>> this 64K page.
>>>   
>>>> Also, there are certainly use cases of updating
>>>> firmware for an assigned device, we don't want to impose a policy, but
>>>> we should figure out the right place for that policy to be specified by
>>>> the admin.  
>>>
>>> May be but NVIDIA is talking about some "out-of-band" command to the GPU
>>> to enable firmware update so firmware update is not really supported.
>>>
>>>   
>>>>> 3. DMA trafic over the NVLink2 link can be of 2 types: UT=1 for
>>>>> PCI-style DMA via our usual TCE tables (one per a NVLink2 link),
>>>>> and UT=0 for direct host memory access. UT stands for "use
>>>>> translation" and this is a part of the NVLink2 protocol. Only UT=1 is
>>>>> possible over the PCIe link.
>>>>> This UT=0 trafic uses host physical addresses returned by a nest MMU (a
>>>>> piece of NVIDIA logic on a POWER9 chip), this takes LPID (guest id),
>>>>> mmu context id (guest userspace mm id), a virtual address and translates
>>>>> to the host physical and that result is used for UT=0 DMA, this is
>>>>> called "ATS" although it is not PCIe ATS afaict.
>>>>> NVIDIA says that the hardware is designed in a way that it can only do
>>>>> DMA UT=0 to addresses which ATS translated to, and there is no way to
>>>>> override this behavior and this is what guarantees the isolation.    
>>>>
>>>> I'm kinda lost here, maybe we can compare it to PCIe ATS where an
>>>> endpoint requests a translation of an IOVA to physical address, the
>>>> IOMMU returns a lookup based on PCIe requester ID, and there's an
>>>> invalidation protocol to keep things coherent.  
>>>
>>> Yes there is. The current approach is to have an MMU notifier in
>>> the kernel which tells an NPU (IBM piece of logic between GPU/NVlink2
>>> and NVIDIA nest MMU) to invalidate translations and that in turn pokes
>>> the GPU till that confirms that it invalidated tlbs and there is no
>>> ongoing DMA.
>>>   
>>>> In the case above, who provides a guest id and mmu context id?   
>>>
>>> We (powerpc/powernv platform) configure NPU to bind specific bus:dev:fn to
>>> an LPID (= guest id) and MMU context id comes from the guest. The nest
>>> MMU knows where the partition table and this table contains all the
>>> pointers needs for the translation.
>>>
>>>   
>>>> Additional software
>>>> somewhere?  Is the virtual address an IOVA or a process virtual
>>>> address?   
>>>
>>> A guest kernel or a guest userspace virtual address.
>>>   
>>>> Do we assume some sort of invalidation protocol as well?  
>>>
>>> I am little confused, is this question about the same invalidation
>>> protocol as above or different?
>>>
>>>   
>>>>> So isolation can be achieved if I do not miss something.
>>>>>
>>>>> How do we want this to be documented to proceed? I assume if I post
>>>>> patches filtering MMIOs, this won't do it, right? If just 1..3 are
>>>>> documented, will we take this t&c or we need a GPU API spec (which is
>>>>> not going to happen anyway)?    
>>>>
>>>> "t&c"? I think we need what we're actually interacting with to be well
>>>> documented, but that could be _thorough_ comments in the code, enough
>>>> to understand the theory of operation, as far as I'm concerned.  A pdf
>>>> lost on a corporate webserver isn't necessarily an improvement over
>>>> that, but there needs to be sufficient detail to understand what we're
>>>> touching such that we can maintain, adapt, and improve the code over
>>>> time.  Only item #3 above appears POWER specific, so I'd hope that #1
>>>> is done in the PCI subsystem, #2 might be a QEMU option (maybe kernel
>>>> vfio-pci, but I'm not sure that's necessary), and I don't know where #3
>>>> goes.  Thanks,  
>>>
>>> Ok, understood. Thanks!  
>>
>> After some local discussions, it was pointed out that force disabling
>> nvlinks won't bring us much as for an nvlink to work, both sides need to
>> enable it so malicious guests cannot penetrate good ones (or a host)
>> unless a good guest enabled the link but won't happen with a well
>> behaving guest. And if two guests became malicious, then can still only
>> harm each other, and so can they via other ways such network. This is
>> different from PCIe as once PCIe link is unavoidably enabled, a well
>> behaving device cannot firewall itself from peers as it is up to the
>> upstream bridge(s) now to decide the routing; with nvlink2, a GPU still
>> has means to protect itself, just like a guest can run "firewalld" for
>> network.
>>
>> Although it would be a nice feature to have an extra barrier between
>> GPUs, is inability to block the links in hypervisor still a blocker for
>> V100 pass through?
> 
> How is the NVLink configured by the guest, is it 'on'/'off' or are
> specific routes configured? 

The GPU-GPU links need not to be blocked and need to be enabled
(=trained) by a driver in the guest. There are no routes between GPUs
in NVLink fabric, these are direct links, it is just a switch on each
side, both switches need to be on for a link to work.

The GPU-CPU links - the GPU bit is the same switch, the CPU NVlink state
is controlled via the emulated PCI bridges which I pass through together
with the GPU.


> If the former, then isn't a non-malicious
> guest still susceptible to a malicious guest?

A non-malicious guest needs to turn its switch on for a link to a GPU
which belongs to a malicious guest.

> If the latter, how is
> routing configured by the guest given that the guest view of the
> topology doesn't match physical hardware?  Are these routes
> deconfigured by device reset?  Are they part of the save/restore
> state?  Thanks,





-- 
Alexey

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100
  2018-07-31  4:03                           ` Alexey Kardashevskiy
  (?)
@ 2018-07-31 14:29                             ` Alex Williamson
  -1 siblings, 0 replies; 108+ messages in thread
From: Alex Williamson @ 2018-07-31 14:29 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: kvm, Ram Pai, kvm-ppc, Alistair Popple, linuxppc-dev, David Gibson

On Tue, 31 Jul 2018 14:03:35 +1000
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> On 31/07/2018 02:29, Alex Williamson wrote:
> > On Mon, 30 Jul 2018 18:58:49 +1000
> > Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> >> After some local discussions, it was pointed out that force disabling
> >> nvlinks won't bring us much as for an nvlink to work, both sides need to
> >> enable it so malicious guests cannot penetrate good ones (or a host)
> >> unless a good guest enabled the link but won't happen with a well
> >> behaving guest. And if two guests became malicious, then can still only
> >> harm each other, and so can they via other ways such network. This is
> >> different from PCIe as once PCIe link is unavoidably enabled, a well
> >> behaving device cannot firewall itself from peers as it is up to the
> >> upstream bridge(s) now to decide the routing; with nvlink2, a GPU still
> >> has means to protect itself, just like a guest can run "firewalld" for
> >> network.
> >>
> >> Although it would be a nice feature to have an extra barrier between
> >> GPUs, is inability to block the links in hypervisor still a blocker for
> >> V100 pass through?  
> > 
> > How is the NVLink configured by the guest, is it 'on'/'off' or are
> > specific routes configured?   
> 
> The GPU-GPU links need not to be blocked and need to be enabled
> (==trained) by a driver in the guest. There are no routes between GPUs
> in NVLink fabric, these are direct links, it is just a switch on each
> side, both switches need to be on for a link to work.

Ok, but there is at least the possibility of multiple direct links per
GPU, the very first diagram I find of NVlink shows 8 interconnected
GPUs:

https://www.nvidia.com/en-us/data-center/nvlink/

So if each switch enables one direct, point to point link, how does the
guest know which links to open for which peer device?  And of course
since we can't see the spec, a security audit is at best hearsay :-\
 
> The GPU-CPU links - the GPU bit is the same switch, the CPU NVlink state
> is controlled via the emulated PCI bridges which I pass through together
> with the GPU.

So there's a special emulated switch, is that how the guest knows which
GPUs it can enable NVLinks to?

> > If the former, then isn't a non-malicious
> > guest still susceptible to a malicious guest?  
> 
> A non-malicious guest needs to turn its switch on for a link to a GPU
> which belongs to a malicious guest.

Actual security, or obfuscation, will we ever know...

> > If the latter, how is
> > routing configured by the guest given that the guest view of the
> > topology doesn't match physical hardware?  Are these routes
> > deconfigured by device reset?  Are they part of the save/restore
> > state?  Thanks,  

Still curious what happens to these routes on reset.  Can a later user
of a GPU inherit a device where the links are already enabled?  Thanks,

Alex

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100
@ 2018-07-31 14:29                             ` Alex Williamson
  0 siblings, 0 replies; 108+ messages in thread
From: Alex Williamson @ 2018-07-31 14:29 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Benjamin Herrenschmidt, linuxppc-dev, David Gibson, kvm-ppc,
	Ram Pai, kvm, Alistair Popple

On Tue, 31 Jul 2018 14:03:35 +1000
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> On 31/07/2018 02:29, Alex Williamson wrote:
> > On Mon, 30 Jul 2018 18:58:49 +1000
> > Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> >> After some local discussions, it was pointed out that force disabling
> >> nvlinks won't bring us much as for an nvlink to work, both sides need to
> >> enable it so malicious guests cannot penetrate good ones (or a host)
> >> unless a good guest enabled the link but won't happen with a well
> >> behaving guest. And if two guests became malicious, then can still only
> >> harm each other, and so can they via other ways such network. This is
> >> different from PCIe as once PCIe link is unavoidably enabled, a well
> >> behaving device cannot firewall itself from peers as it is up to the
> >> upstream bridge(s) now to decide the routing; with nvlink2, a GPU still
> >> has means to protect itself, just like a guest can run "firewalld" for
> >> network.
> >>
> >> Although it would be a nice feature to have an extra barrier between
> >> GPUs, is inability to block the links in hypervisor still a blocker for
> >> V100 pass through?  
> > 
> > How is the NVLink configured by the guest, is it 'on'/'off' or are
> > specific routes configured?   
> 
> The GPU-GPU links need not to be blocked and need to be enabled
> (==trained) by a driver in the guest. There are no routes between GPUs
> in NVLink fabric, these are direct links, it is just a switch on each
> side, both switches need to be on for a link to work.

Ok, but there is at least the possibility of multiple direct links per
GPU, the very first diagram I find of NVlink shows 8 interconnected
GPUs:

https://www.nvidia.com/en-us/data-center/nvlink/

So if each switch enables one direct, point to point link, how does the
guest know which links to open for which peer device?  And of course
since we can't see the spec, a security audit is at best hearsay :-\
 
> The GPU-CPU links - the GPU bit is the same switch, the CPU NVlink state
> is controlled via the emulated PCI bridges which I pass through together
> with the GPU.

So there's a special emulated switch, is that how the guest knows which
GPUs it can enable NVLinks to?

> > If the former, then isn't a non-malicious
> > guest still susceptible to a malicious guest?  
> 
> A non-malicious guest needs to turn its switch on for a link to a GPU
> which belongs to a malicious guest.

Actual security, or obfuscation, will we ever know...

> > If the latter, how is
> > routing configured by the guest given that the guest view of the
> > topology doesn't match physical hardware?  Are these routes
> > deconfigured by device reset?  Are they part of the save/restore
> > state?  Thanks,  

Still curious what happens to these routes on reset.  Can a later user
of a GPU inherit a device where the links are already enabled?  Thanks,

Alex

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100
@ 2018-07-31 14:29                             ` Alex Williamson
  0 siblings, 0 replies; 108+ messages in thread
From: Alex Williamson @ 2018-07-31 14:29 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: kvm, Ram Pai, kvm-ppc, Alistair Popple, linuxppc-dev, David Gibson

On Tue, 31 Jul 2018 14:03:35 +1000
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> On 31/07/2018 02:29, Alex Williamson wrote:
> > On Mon, 30 Jul 2018 18:58:49 +1000
> > Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> >> After some local discussions, it was pointed out that force disabling
> >> nvlinks won't bring us much as for an nvlink to work, both sides need to
> >> enable it so malicious guests cannot penetrate good ones (or a host)
> >> unless a good guest enabled the link but won't happen with a well
> >> behaving guest. And if two guests became malicious, then can still only
> >> harm each other, and so can they via other ways such network. This is
> >> different from PCIe as once PCIe link is unavoidably enabled, a well
> >> behaving device cannot firewall itself from peers as it is up to the
> >> upstream bridge(s) now to decide the routing; with nvlink2, a GPU still
> >> has means to protect itself, just like a guest can run "firewalld" for
> >> network.
> >>
> >> Although it would be a nice feature to have an extra barrier between
> >> GPUs, is inability to block the links in hypervisor still a blocker for
> >> V100 pass through?  
> > 
> > How is the NVLink configured by the guest, is it 'on'/'off' or are
> > specific routes configured?   
> 
> The GPU-GPU links need not to be blocked and need to be enabled
> (=trained) by a driver in the guest. There are no routes between GPUs
> in NVLink fabric, these are direct links, it is just a switch on each
> side, both switches need to be on for a link to work.

Ok, but there is at least the possibility of multiple direct links per
GPU, the very first diagram I find of NVlink shows 8 interconnected
GPUs:

https://www.nvidia.com/en-us/data-center/nvlink/

So if each switch enables one direct, point to point link, how does the
guest know which links to open for which peer device?  And of course
since we can't see the spec, a security audit is at best hearsay :-\
 
> The GPU-CPU links - the GPU bit is the same switch, the CPU NVlink state
> is controlled via the emulated PCI bridges which I pass through together
> with the GPU.

So there's a special emulated switch, is that how the guest knows which
GPUs it can enable NVLinks to?

> > If the former, then isn't a non-malicious
> > guest still susceptible to a malicious guest?  
> 
> A non-malicious guest needs to turn its switch on for a link to a GPU
> which belongs to a malicious guest.

Actual security, or obfuscation, will we ever know...

> > If the latter, how is
> > routing configured by the guest given that the guest view of the
> > topology doesn't match physical hardware?  Are these routes
> > deconfigured by device reset?  Are they part of the save/restore
> > state?  Thanks,  

Still curious what happens to these routes on reset.  Can a later user
of a GPU inherit a device where the links are already enabled?  Thanks,

Alex

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100
  2018-07-31 14:29                             ` Alex Williamson
  (?)
@ 2018-08-01  8:37                               ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 108+ messages in thread
From: Alexey Kardashevskiy @ 2018-08-01  8:37 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kvm, Ram Pai, kvm-ppc, Alistair Popple, linuxppc-dev, David Gibson



On 01/08/2018 00:29, Alex Williamson wrote:
> On Tue, 31 Jul 2018 14:03:35 +1000
> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> 
>> On 31/07/2018 02:29, Alex Williamson wrote:
>>> On Mon, 30 Jul 2018 18:58:49 +1000
>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>>>> After some local discussions, it was pointed out that force disabling
>>>> nvlinks won't bring us much as for an nvlink to work, both sides need to
>>>> enable it so malicious guests cannot penetrate good ones (or a host)
>>>> unless a good guest enabled the link but won't happen with a well
>>>> behaving guest. And if two guests became malicious, then can still only
>>>> harm each other, and so can they via other ways such network. This is
>>>> different from PCIe as once PCIe link is unavoidably enabled, a well
>>>> behaving device cannot firewall itself from peers as it is up to the
>>>> upstream bridge(s) now to decide the routing; with nvlink2, a GPU still
>>>> has means to protect itself, just like a guest can run "firewalld" for
>>>> network.
>>>>
>>>> Although it would be a nice feature to have an extra barrier between
>>>> GPUs, is inability to block the links in hypervisor still a blocker for
>>>> V100 pass through?  
>>>
>>> How is the NVLink configured by the guest, is it 'on'/'off' or are
>>> specific routes configured?   
>>
>> The GPU-GPU links need not to be blocked and need to be enabled
>> (==trained) by a driver in the guest. There are no routes between GPUs
>> in NVLink fabric, these are direct links, it is just a switch on each
>> side, both switches need to be on for a link to work.
> 
> Ok, but there is at least the possibility of multiple direct links per
> GPU, the very first diagram I find of NVlink shows 8 interconnected
> GPUs:
> 
> https://www.nvidia.com/en-us/data-center/nvlink/

Out design is like the left part of the picture but it is just a detail.

> So if each switch enables one direct, point to point link, how does the
> guest know which links to open for which peer device?

It uses PCI config space on GPUs to discover the topology.

> And of course
> since we can't see the spec, a security audit is at best hearsay :-\

Yup, the exact discovery protocol is hidden.


>> The GPU-CPU links - the GPU bit is the same switch, the CPU NVlink state
>> is controlled via the emulated PCI bridges which I pass through together
>> with the GPU.
> 
> So there's a special emulated switch, is that how the guest knows which
> GPUs it can enable NVLinks to?

Since it only has PCI config space (there is nothing relevant in the
device tree at all), I assume (double checking with the NVIDIA folks
now) the guest driver enables them all, tests which pair works and
disables the ones which do not. This gives a malicious guest a tiny
window of opportunity to break into a good guest. Hm :-/


>>> If the former, then isn't a non-malicious
>>> guest still susceptible to a malicious guest?  
>>
>> A non-malicious guest needs to turn its switch on for a link to a GPU
>> which belongs to a malicious guest.
> 
> Actual security, or obfuscation, will we ever know...
>>>> If the latter, how is
>>> routing configured by the guest given that the guest view of the
>>> topology doesn't match physical hardware?  Are these routes
>>> deconfigured by device reset?  Are they part of the save/restore
>>> state?  Thanks,  
> 
> Still curious what happens to these routes on reset.  Can a later user
> of a GPU inherit a device where the links are already enabled?  Thanks,

I am told that the GPU reset disables links. As a side effect, we get an
HMI (a hardware fault which reset the host machine) when trying
accessing the GPU RAM which indicates that the link is down as the
memory is only accessible via the nvlink. We have special fencing code
in our host firmware (skiboot) to fence this memory on PCI reset so
reading from it returns zeroes instead of HMIs.



-- 
Alexey

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100
@ 2018-08-01  8:37                               ` Alexey Kardashevskiy
  0 siblings, 0 replies; 108+ messages in thread
From: Alexey Kardashevskiy @ 2018-08-01  8:37 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Benjamin Herrenschmidt, linuxppc-dev, David Gibson, kvm-ppc,
	Ram Pai, kvm, Alistair Popple



On 01/08/2018 00:29, Alex Williamson wrote:
> On Tue, 31 Jul 2018 14:03:35 +1000
> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> 
>> On 31/07/2018 02:29, Alex Williamson wrote:
>>> On Mon, 30 Jul 2018 18:58:49 +1000
>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>>>> After some local discussions, it was pointed out that force disabling
>>>> nvlinks won't bring us much as for an nvlink to work, both sides need to
>>>> enable it so malicious guests cannot penetrate good ones (or a host)
>>>> unless a good guest enabled the link but won't happen with a well
>>>> behaving guest. And if two guests became malicious, then can still only
>>>> harm each other, and so can they via other ways such network. This is
>>>> different from PCIe as once PCIe link is unavoidably enabled, a well
>>>> behaving device cannot firewall itself from peers as it is up to the
>>>> upstream bridge(s) now to decide the routing; with nvlink2, a GPU still
>>>> has means to protect itself, just like a guest can run "firewalld" for
>>>> network.
>>>>
>>>> Although it would be a nice feature to have an extra barrier between
>>>> GPUs, is inability to block the links in hypervisor still a blocker for
>>>> V100 pass through?  
>>>
>>> How is the NVLink configured by the guest, is it 'on'/'off' or are
>>> specific routes configured?   
>>
>> The GPU-GPU links need not to be blocked and need to be enabled
>> (==trained) by a driver in the guest. There are no routes between GPUs
>> in NVLink fabric, these are direct links, it is just a switch on each
>> side, both switches need to be on for a link to work.
> 
> Ok, but there is at least the possibility of multiple direct links per
> GPU, the very first diagram I find of NVlink shows 8 interconnected
> GPUs:
> 
> https://www.nvidia.com/en-us/data-center/nvlink/

Out design is like the left part of the picture but it is just a detail.

> So if each switch enables one direct, point to point link, how does the
> guest know which links to open for which peer device?

It uses PCI config space on GPUs to discover the topology.

> And of course
> since we can't see the spec, a security audit is at best hearsay :-\

Yup, the exact discovery protocol is hidden.


>> The GPU-CPU links - the GPU bit is the same switch, the CPU NVlink state
>> is controlled via the emulated PCI bridges which I pass through together
>> with the GPU.
> 
> So there's a special emulated switch, is that how the guest knows which
> GPUs it can enable NVLinks to?

Since it only has PCI config space (there is nothing relevant in the
device tree at all), I assume (double checking with the NVIDIA folks
now) the guest driver enables them all, tests which pair works and
disables the ones which do not. This gives a malicious guest a tiny
window of opportunity to break into a good guest. Hm :-/


>>> If the former, then isn't a non-malicious
>>> guest still susceptible to a malicious guest?  
>>
>> A non-malicious guest needs to turn its switch on for a link to a GPU
>> which belongs to a malicious guest.
> 
> Actual security, or obfuscation, will we ever know...
>>>> If the latter, how is
>>> routing configured by the guest given that the guest view of the
>>> topology doesn't match physical hardware?  Are these routes
>>> deconfigured by device reset?  Are they part of the save/restore
>>> state?  Thanks,  
> 
> Still curious what happens to these routes on reset.  Can a later user
> of a GPU inherit a device where the links are already enabled?  Thanks,

I am told that the GPU reset disables links. As a side effect, we get an
HMI (a hardware fault which reset the host machine) when trying
accessing the GPU RAM which indicates that the link is down as the
memory is only accessible via the nvlink. We have special fencing code
in our host firmware (skiboot) to fence this memory on PCI reset so
reading from it returns zeroes instead of HMIs.



-- 
Alexey

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100
@ 2018-08-01  8:37                               ` Alexey Kardashevskiy
  0 siblings, 0 replies; 108+ messages in thread
From: Alexey Kardashevskiy @ 2018-08-01  8:37 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kvm, Ram Pai, kvm-ppc, Alistair Popple, linuxppc-dev, David Gibson



On 01/08/2018 00:29, Alex Williamson wrote:
> On Tue, 31 Jul 2018 14:03:35 +1000
> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> 
>> On 31/07/2018 02:29, Alex Williamson wrote:
>>> On Mon, 30 Jul 2018 18:58:49 +1000
>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>>>> After some local discussions, it was pointed out that force disabling
>>>> nvlinks won't bring us much as for an nvlink to work, both sides need to
>>>> enable it so malicious guests cannot penetrate good ones (or a host)
>>>> unless a good guest enabled the link but won't happen with a well
>>>> behaving guest. And if two guests became malicious, then can still only
>>>> harm each other, and so can they via other ways such network. This is
>>>> different from PCIe as once PCIe link is unavoidably enabled, a well
>>>> behaving device cannot firewall itself from peers as it is up to the
>>>> upstream bridge(s) now to decide the routing; with nvlink2, a GPU still
>>>> has means to protect itself, just like a guest can run "firewalld" for
>>>> network.
>>>>
>>>> Although it would be a nice feature to have an extra barrier between
>>>> GPUs, is inability to block the links in hypervisor still a blocker for
>>>> V100 pass through?  
>>>
>>> How is the NVLink configured by the guest, is it 'on'/'off' or are
>>> specific routes configured?   
>>
>> The GPU-GPU links need not to be blocked and need to be enabled
>> (=trained) by a driver in the guest. There are no routes between GPUs
>> in NVLink fabric, these are direct links, it is just a switch on each
>> side, both switches need to be on for a link to work.
> 
> Ok, but there is at least the possibility of multiple direct links per
> GPU, the very first diagram I find of NVlink shows 8 interconnected
> GPUs:
> 
> https://www.nvidia.com/en-us/data-center/nvlink/

Out design is like the left part of the picture but it is just a detail.

> So if each switch enables one direct, point to point link, how does the
> guest know which links to open for which peer device?

It uses PCI config space on GPUs to discover the topology.

> And of course
> since we can't see the spec, a security audit is at best hearsay :-\

Yup, the exact discovery protocol is hidden.


>> The GPU-CPU links - the GPU bit is the same switch, the CPU NVlink state
>> is controlled via the emulated PCI bridges which I pass through together
>> with the GPU.
> 
> So there's a special emulated switch, is that how the guest knows which
> GPUs it can enable NVLinks to?

Since it only has PCI config space (there is nothing relevant in the
device tree at all), I assume (double checking with the NVIDIA folks
now) the guest driver enables them all, tests which pair works and
disables the ones which do not. This gives a malicious guest a tiny
window of opportunity to break into a good guest. Hm :-/


>>> If the former, then isn't a non-malicious
>>> guest still susceptible to a malicious guest?  
>>
>> A non-malicious guest needs to turn its switch on for a link to a GPU
>> which belongs to a malicious guest.
> 
> Actual security, or obfuscation, will we ever know...
>>>> If the latter, how is
>>> routing configured by the guest given that the guest view of the
>>> topology doesn't match physical hardware?  Are these routes
>>> deconfigured by device reset?  Are they part of the save/restore
>>> state?  Thanks,  
> 
> Still curious what happens to these routes on reset.  Can a later user
> of a GPU inherit a device where the links are already enabled?  Thanks,

I am told that the GPU reset disables links. As a side effect, we get an
HMI (a hardware fault which reset the host machine) when trying
accessing the GPU RAM which indicates that the link is down as the
memory is only accessible via the nvlink. We have special fencing code
in our host firmware (skiboot) to fence this memory on PCI reset so
reading from it returns zeroes instead of HMIs.



-- 
Alexey

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100
  2018-08-01  8:37                               ` Alexey Kardashevskiy
  (?)
@ 2018-08-01 16:16                                 ` Alex Williamson
  -1 siblings, 0 replies; 108+ messages in thread
From: Alex Williamson @ 2018-08-01 16:16 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: kvm, Ram Pai, kvm-ppc, Alistair Popple, linuxppc-dev, David Gibson

On Wed, 1 Aug 2018 18:37:35 +1000
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> On 01/08/2018 00:29, Alex Williamson wrote:
> > On Tue, 31 Jul 2018 14:03:35 +1000
> > Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> >   
> >> On 31/07/2018 02:29, Alex Williamson wrote:  
> >>> On Mon, 30 Jul 2018 18:58:49 +1000
> >>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:  
> >>>> After some local discussions, it was pointed out that force disabling
> >>>> nvlinks won't bring us much as for an nvlink to work, both sides need to
> >>>> enable it so malicious guests cannot penetrate good ones (or a host)
> >>>> unless a good guest enabled the link but won't happen with a well
> >>>> behaving guest. And if two guests became malicious, then can still only
> >>>> harm each other, and so can they via other ways such network. This is
> >>>> different from PCIe as once PCIe link is unavoidably enabled, a well
> >>>> behaving device cannot firewall itself from peers as it is up to the
> >>>> upstream bridge(s) now to decide the routing; with nvlink2, a GPU still
> >>>> has means to protect itself, just like a guest can run "firewalld" for
> >>>> network.
> >>>>
> >>>> Although it would be a nice feature to have an extra barrier between
> >>>> GPUs, is inability to block the links in hypervisor still a blocker for
> >>>> V100 pass through?    
> >>>
> >>> How is the NVLink configured by the guest, is it 'on'/'off' or are
> >>> specific routes configured?     
> >>
> >> The GPU-GPU links need not to be blocked and need to be enabled
> >> (==trained) by a driver in the guest. There are no routes between GPUs
> >> in NVLink fabric, these are direct links, it is just a switch on each
> >> side, both switches need to be on for a link to work.  
> > 
> > Ok, but there is at least the possibility of multiple direct links per
> > GPU, the very first diagram I find of NVlink shows 8 interconnected
> > GPUs:
> > 
> > https://www.nvidia.com/en-us/data-center/nvlink/  
> 
> Out design is like the left part of the picture but it is just a detail.

Unless we can specifically identify a direct link vs a mesh link, we
shouldn't be making assumptions about the degree of interconnect.
 
> > So if each switch enables one direct, point to point link, how does the
> > guest know which links to open for which peer device?  
> 
> It uses PCI config space on GPUs to discover the topology.

So do we need to virtualize this config space if we're going to
virtualize the topology?

> > And of course
> > since we can't see the spec, a security audit is at best hearsay :-\  
> 
> Yup, the exact discovery protocol is hidden.

It could be reverse engineered...

> >> The GPU-CPU links - the GPU bit is the same switch, the CPU NVlink state
> >> is controlled via the emulated PCI bridges which I pass through together
> >> with the GPU.  
> > 
> > So there's a special emulated switch, is that how the guest knows which
> > GPUs it can enable NVLinks to?  
> 
> Since it only has PCI config space (there is nothing relevant in the
> device tree at all), I assume (double checking with the NVIDIA folks
> now) the guest driver enables them all, tests which pair works and
> disables the ones which do not. This gives a malicious guest a tiny
> window of opportunity to break into a good guest. Hm :-/

Let's not minimize that window, that seems like a prime candidate for
an exploit.

> >>> If the former, then isn't a non-malicious
> >>> guest still susceptible to a malicious guest?    
> >>
> >> A non-malicious guest needs to turn its switch on for a link to a GPU
> >> which belongs to a malicious guest.  
> > 
> > Actual security, or obfuscation, will we ever know...  
> >>>> If the latter, how is  
> >>> routing configured by the guest given that the guest view of the
> >>> topology doesn't match physical hardware?  Are these routes
> >>> deconfigured by device reset?  Are they part of the save/restore
> >>> state?  Thanks,    
> > 
> > Still curious what happens to these routes on reset.  Can a later user
> > of a GPU inherit a device where the links are already enabled?  Thanks,  
> 
> I am told that the GPU reset disables links. As a side effect, we get an
> HMI (a hardware fault which reset the host machine) when trying
> accessing the GPU RAM which indicates that the link is down as the
> memory is only accessible via the nvlink. We have special fencing code
> in our host firmware (skiboot) to fence this memory on PCI reset so
> reading from it returns zeroes instead of HMIs.

What sort of reset is required for this?  Typically we rely on
secondary bus reset for GPUs, but it would be a problem if GPUs were to
start implementing FLR and nobody had a spec to learn that FLR maybe
didn't disable the link.  The better approach to me still seems to be
virtualizing these NVLink config registers to an extent that the user
can only enabling links where they have ownership of both ends of the
connection.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100
@ 2018-08-01 16:16                                 ` Alex Williamson
  0 siblings, 0 replies; 108+ messages in thread
From: Alex Williamson @ 2018-08-01 16:16 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Benjamin Herrenschmidt, linuxppc-dev, David Gibson, kvm-ppc,
	Ram Pai, kvm, Alistair Popple

On Wed, 1 Aug 2018 18:37:35 +1000
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> On 01/08/2018 00:29, Alex Williamson wrote:
> > On Tue, 31 Jul 2018 14:03:35 +1000
> > Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> >   
> >> On 31/07/2018 02:29, Alex Williamson wrote:  
> >>> On Mon, 30 Jul 2018 18:58:49 +1000
> >>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:  
> >>>> After some local discussions, it was pointed out that force disabling
> >>>> nvlinks won't bring us much as for an nvlink to work, both sides need to
> >>>> enable it so malicious guests cannot penetrate good ones (or a host)
> >>>> unless a good guest enabled the link but won't happen with a well
> >>>> behaving guest. And if two guests became malicious, then can still only
> >>>> harm each other, and so can they via other ways such network. This is
> >>>> different from PCIe as once PCIe link is unavoidably enabled, a well
> >>>> behaving device cannot firewall itself from peers as it is up to the
> >>>> upstream bridge(s) now to decide the routing; with nvlink2, a GPU still
> >>>> has means to protect itself, just like a guest can run "firewalld" for
> >>>> network.
> >>>>
> >>>> Although it would be a nice feature to have an extra barrier between
> >>>> GPUs, is inability to block the links in hypervisor still a blocker for
> >>>> V100 pass through?    
> >>>
> >>> How is the NVLink configured by the guest, is it 'on'/'off' or are
> >>> specific routes configured?     
> >>
> >> The GPU-GPU links need not to be blocked and need to be enabled
> >> (==trained) by a driver in the guest. There are no routes between GPUs
> >> in NVLink fabric, these are direct links, it is just a switch on each
> >> side, both switches need to be on for a link to work.  
> > 
> > Ok, but there is at least the possibility of multiple direct links per
> > GPU, the very first diagram I find of NVlink shows 8 interconnected
> > GPUs:
> > 
> > https://www.nvidia.com/en-us/data-center/nvlink/  
> 
> Out design is like the left part of the picture but it is just a detail.

Unless we can specifically identify a direct link vs a mesh link, we
shouldn't be making assumptions about the degree of interconnect.
 
> > So if each switch enables one direct, point to point link, how does the
> > guest know which links to open for which peer device?  
> 
> It uses PCI config space on GPUs to discover the topology.

So do we need to virtualize this config space if we're going to
virtualize the topology?

> > And of course
> > since we can't see the spec, a security audit is at best hearsay :-\  
> 
> Yup, the exact discovery protocol is hidden.

It could be reverse engineered...

> >> The GPU-CPU links - the GPU bit is the same switch, the CPU NVlink state
> >> is controlled via the emulated PCI bridges which I pass through together
> >> with the GPU.  
> > 
> > So there's a special emulated switch, is that how the guest knows which
> > GPUs it can enable NVLinks to?  
> 
> Since it only has PCI config space (there is nothing relevant in the
> device tree at all), I assume (double checking with the NVIDIA folks
> now) the guest driver enables them all, tests which pair works and
> disables the ones which do not. This gives a malicious guest a tiny
> window of opportunity to break into a good guest. Hm :-/

Let's not minimize that window, that seems like a prime candidate for
an exploit.

> >>> If the former, then isn't a non-malicious
> >>> guest still susceptible to a malicious guest?    
> >>
> >> A non-malicious guest needs to turn its switch on for a link to a GPU
> >> which belongs to a malicious guest.  
> > 
> > Actual security, or obfuscation, will we ever know...  
> >>>> If the latter, how is  
> >>> routing configured by the guest given that the guest view of the
> >>> topology doesn't match physical hardware?  Are these routes
> >>> deconfigured by device reset?  Are they part of the save/restore
> >>> state?  Thanks,    
> > 
> > Still curious what happens to these routes on reset.  Can a later user
> > of a GPU inherit a device where the links are already enabled?  Thanks,  
> 
> I am told that the GPU reset disables links. As a side effect, we get an
> HMI (a hardware fault which reset the host machine) when trying
> accessing the GPU RAM which indicates that the link is down as the
> memory is only accessible via the nvlink. We have special fencing code
> in our host firmware (skiboot) to fence this memory on PCI reset so
> reading from it returns zeroes instead of HMIs.

What sort of reset is required for this?  Typically we rely on
secondary bus reset for GPUs, but it would be a problem if GPUs were to
start implementing FLR and nobody had a spec to learn that FLR maybe
didn't disable the link.  The better approach to me still seems to be
virtualizing these NVLink config registers to an extent that the user
can only enabling links where they have ownership of both ends of the
connection.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100
@ 2018-08-01 16:16                                 ` Alex Williamson
  0 siblings, 0 replies; 108+ messages in thread
From: Alex Williamson @ 2018-08-01 16:16 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: kvm, Ram Pai, kvm-ppc, Alistair Popple, linuxppc-dev, David Gibson

On Wed, 1 Aug 2018 18:37:35 +1000
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> On 01/08/2018 00:29, Alex Williamson wrote:
> > On Tue, 31 Jul 2018 14:03:35 +1000
> > Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> >   
> >> On 31/07/2018 02:29, Alex Williamson wrote:  
> >>> On Mon, 30 Jul 2018 18:58:49 +1000
> >>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:  
> >>>> After some local discussions, it was pointed out that force disabling
> >>>> nvlinks won't bring us much as for an nvlink to work, both sides need to
> >>>> enable it so malicious guests cannot penetrate good ones (or a host)
> >>>> unless a good guest enabled the link but won't happen with a well
> >>>> behaving guest. And if two guests became malicious, then can still only
> >>>> harm each other, and so can they via other ways such network. This is
> >>>> different from PCIe as once PCIe link is unavoidably enabled, a well
> >>>> behaving device cannot firewall itself from peers as it is up to the
> >>>> upstream bridge(s) now to decide the routing; with nvlink2, a GPU still
> >>>> has means to protect itself, just like a guest can run "firewalld" for
> >>>> network.
> >>>>
> >>>> Although it would be a nice feature to have an extra barrier between
> >>>> GPUs, is inability to block the links in hypervisor still a blocker for
> >>>> V100 pass through?    
> >>>
> >>> How is the NVLink configured by the guest, is it 'on'/'off' or are
> >>> specific routes configured?     
> >>
> >> The GPU-GPU links need not to be blocked and need to be enabled
> >> (=trained) by a driver in the guest. There are no routes between GPUs
> >> in NVLink fabric, these are direct links, it is just a switch on each
> >> side, both switches need to be on for a link to work.  
> > 
> > Ok, but there is at least the possibility of multiple direct links per
> > GPU, the very first diagram I find of NVlink shows 8 interconnected
> > GPUs:
> > 
> > https://www.nvidia.com/en-us/data-center/nvlink/  
> 
> Out design is like the left part of the picture but it is just a detail.

Unless we can specifically identify a direct link vs a mesh link, we
shouldn't be making assumptions about the degree of interconnect.
 
> > So if each switch enables one direct, point to point link, how does the
> > guest know which links to open for which peer device?  
> 
> It uses PCI config space on GPUs to discover the topology.

So do we need to virtualize this config space if we're going to
virtualize the topology?

> > And of course
> > since we can't see the spec, a security audit is at best hearsay :-\  
> 
> Yup, the exact discovery protocol is hidden.

It could be reverse engineered...

> >> The GPU-CPU links - the GPU bit is the same switch, the CPU NVlink state
> >> is controlled via the emulated PCI bridges which I pass through together
> >> with the GPU.  
> > 
> > So there's a special emulated switch, is that how the guest knows which
> > GPUs it can enable NVLinks to?  
> 
> Since it only has PCI config space (there is nothing relevant in the
> device tree at all), I assume (double checking with the NVIDIA folks
> now) the guest driver enables them all, tests which pair works and
> disables the ones which do not. This gives a malicious guest a tiny
> window of opportunity to break into a good guest. Hm :-/

Let's not minimize that window, that seems like a prime candidate for
an exploit.

> >>> If the former, then isn't a non-malicious
> >>> guest still susceptible to a malicious guest?    
> >>
> >> A non-malicious guest needs to turn its switch on for a link to a GPU
> >> which belongs to a malicious guest.  
> > 
> > Actual security, or obfuscation, will we ever know...  
> >>>> If the latter, how is  
> >>> routing configured by the guest given that the guest view of the
> >>> topology doesn't match physical hardware?  Are these routes
> >>> deconfigured by device reset?  Are they part of the save/restore
> >>> state?  Thanks,    
> > 
> > Still curious what happens to these routes on reset.  Can a later user
> > of a GPU inherit a device where the links are already enabled?  Thanks,  
> 
> I am told that the GPU reset disables links. As a side effect, we get an
> HMI (a hardware fault which reset the host machine) when trying
> accessing the GPU RAM which indicates that the link is down as the
> memory is only accessible via the nvlink. We have special fencing code
> in our host firmware (skiboot) to fence this memory on PCI reset so
> reading from it returns zeroes instead of HMIs.

What sort of reset is required for this?  Typically we rely on
secondary bus reset for GPUs, but it would be a problem if GPUs were to
start implementing FLR and nobody had a spec to learn that FLR maybe
didn't disable the link.  The better approach to me still seems to be
virtualizing these NVLink config registers to an extent that the user
can only enabling links where they have ownership of both ends of the
connection.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100
  2018-08-01 16:16                                 ` Alex Williamson
  (?)
@ 2018-08-08  8:39                                   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 108+ messages in thread
From: Alexey Kardashevskiy @ 2018-08-08  8:39 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kvm, Ram Pai, kvm-ppc, Alistair Popple, linuxppc-dev, David Gibson



On 02/08/2018 02:16, Alex Williamson wrote:
> On Wed, 1 Aug 2018 18:37:35 +1000
> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> 
>> On 01/08/2018 00:29, Alex Williamson wrote:
>>> On Tue, 31 Jul 2018 14:03:35 +1000
>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>>>   
>>>> On 31/07/2018 02:29, Alex Williamson wrote:  
>>>>> On Mon, 30 Jul 2018 18:58:49 +1000
>>>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:  
>>>>>> After some local discussions, it was pointed out that force disabling
>>>>>> nvlinks won't bring us much as for an nvlink to work, both sides need to
>>>>>> enable it so malicious guests cannot penetrate good ones (or a host)
>>>>>> unless a good guest enabled the link but won't happen with a well
>>>>>> behaving guest. And if two guests became malicious, then can still only
>>>>>> harm each other, and so can they via other ways such network. This is
>>>>>> different from PCIe as once PCIe link is unavoidably enabled, a well
>>>>>> behaving device cannot firewall itself from peers as it is up to the
>>>>>> upstream bridge(s) now to decide the routing; with nvlink2, a GPU still
>>>>>> has means to protect itself, just like a guest can run "firewalld" for
>>>>>> network.
>>>>>>
>>>>>> Although it would be a nice feature to have an extra barrier between
>>>>>> GPUs, is inability to block the links in hypervisor still a blocker for
>>>>>> V100 pass through?    
>>>>>
>>>>> How is the NVLink configured by the guest, is it 'on'/'off' or are
>>>>> specific routes configured?     
>>>>
>>>> The GPU-GPU links need not to be blocked and need to be enabled
>>>> (==trained) by a driver in the guest. There are no routes between GPUs
>>>> in NVLink fabric, these are direct links, it is just a switch on each
>>>> side, both switches need to be on for a link to work.  
>>>
>>> Ok, but there is at least the possibility of multiple direct links per
>>> GPU, the very first diagram I find of NVlink shows 8 interconnected
>>> GPUs:
>>>
>>> https://www.nvidia.com/en-us/data-center/nvlink/  
>>
>> Out design is like the left part of the picture but it is just a detail.
> 
> Unless we can specifically identify a direct link vs a mesh link, we
> shouldn't be making assumptions about the degree of interconnect.
>  
>>> So if each switch enables one direct, point to point link, how does the
>>> guest know which links to open for which peer device?  
>>
>> It uses PCI config space on GPUs to discover the topology.
> 
> So do we need to virtualize this config space if we're going to
> virtualize the topology?
> 
>>> And of course
>>> since we can't see the spec, a security audit is at best hearsay :-\  
>>
>> Yup, the exact discovery protocol is hidden.
> 
> It could be reverse engineered...
> 
>>>> The GPU-CPU links - the GPU bit is the same switch, the CPU NVlink state
>>>> is controlled via the emulated PCI bridges which I pass through together
>>>> with the GPU.  
>>>
>>> So there's a special emulated switch, is that how the guest knows which
>>> GPUs it can enable NVLinks to?  
>>
>> Since it only has PCI config space (there is nothing relevant in the
>> device tree at all), I assume (double checking with the NVIDIA folks
>> now) the guest driver enables them all, tests which pair works and
>> disables the ones which do not. This gives a malicious guest a tiny
>> window of opportunity to break into a good guest. Hm :-/
> 
> Let's not minimize that window, that seems like a prime candidate for
> an exploit.
> 
>>>>> If the former, then isn't a non-malicious
>>>>> guest still susceptible to a malicious guest?    
>>>>
>>>> A non-malicious guest needs to turn its switch on for a link to a GPU
>>>> which belongs to a malicious guest.  
>>>
>>> Actual security, or obfuscation, will we ever know...  
>>>>>> If the latter, how is  
>>>>> routing configured by the guest given that the guest view of the
>>>>> topology doesn't match physical hardware?  Are these routes
>>>>> deconfigured by device reset?  Are they part of the save/restore
>>>>> state?  Thanks,    
>>>
>>> Still curious what happens to these routes on reset.  Can a later user
>>> of a GPU inherit a device where the links are already enabled?  Thanks,  
>>
>> I am told that the GPU reset disables links. As a side effect, we get an
>> HMI (a hardware fault which reset the host machine) when trying
>> accessing the GPU RAM which indicates that the link is down as the
>> memory is only accessible via the nvlink. We have special fencing code
>> in our host firmware (skiboot) to fence this memory on PCI reset so
>> reading from it returns zeroes instead of HMIs.
> 
> What sort of reset is required for this?  Typically we rely on
> secondary bus reset for GPUs, but it would be a problem if GPUs were to
> start implementing FLR and nobody had a spec to learn that FLR maybe
> didn't disable the link.  The better approach to me still seems to be
> virtualizing these NVLink config registers to an extent that the user
> can only enabling links where they have ownership of both ends of the
> connection.  Thanks,


I re-read what I wrote and I owe some explanation.

The link state can be:
- disabled (or masked),
- enabled (or not-disabled? unmasked?),
- trained (configured).

At the moment no reset disables links, on sec bus reset they are
unconfigured and go to the initial enabled-and-not-trained state which
is the default config. The NVIDIA driver in the guest trains links to do
the topology discovery. We can disable links and this disabled status
remains until sec bus reset and there is no way to re-enable links other
than sec bus reset. This is what I get from NVIDIA. FLR should not be
able to change a thing here.



-- 
Alexey

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100
@ 2018-08-08  8:39                                   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 108+ messages in thread
From: Alexey Kardashevskiy @ 2018-08-08  8:39 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Benjamin Herrenschmidt, linuxppc-dev, David Gibson, kvm-ppc,
	Ram Pai, kvm, Alistair Popple



On 02/08/2018 02:16, Alex Williamson wrote:
> On Wed, 1 Aug 2018 18:37:35 +1000
> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> 
>> On 01/08/2018 00:29, Alex Williamson wrote:
>>> On Tue, 31 Jul 2018 14:03:35 +1000
>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>>>   
>>>> On 31/07/2018 02:29, Alex Williamson wrote:  
>>>>> On Mon, 30 Jul 2018 18:58:49 +1000
>>>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:  
>>>>>> After some local discussions, it was pointed out that force disabling
>>>>>> nvlinks won't bring us much as for an nvlink to work, both sides need to
>>>>>> enable it so malicious guests cannot penetrate good ones (or a host)
>>>>>> unless a good guest enabled the link but won't happen with a well
>>>>>> behaving guest. And if two guests became malicious, then can still only
>>>>>> harm each other, and so can they via other ways such network. This is
>>>>>> different from PCIe as once PCIe link is unavoidably enabled, a well
>>>>>> behaving device cannot firewall itself from peers as it is up to the
>>>>>> upstream bridge(s) now to decide the routing; with nvlink2, a GPU still
>>>>>> has means to protect itself, just like a guest can run "firewalld" for
>>>>>> network.
>>>>>>
>>>>>> Although it would be a nice feature to have an extra barrier between
>>>>>> GPUs, is inability to block the links in hypervisor still a blocker for
>>>>>> V100 pass through?    
>>>>>
>>>>> How is the NVLink configured by the guest, is it 'on'/'off' or are
>>>>> specific routes configured?     
>>>>
>>>> The GPU-GPU links need not to be blocked and need to be enabled
>>>> (==trained) by a driver in the guest. There are no routes between GPUs
>>>> in NVLink fabric, these are direct links, it is just a switch on each
>>>> side, both switches need to be on for a link to work.  
>>>
>>> Ok, but there is at least the possibility of multiple direct links per
>>> GPU, the very first diagram I find of NVlink shows 8 interconnected
>>> GPUs:
>>>
>>> https://www.nvidia.com/en-us/data-center/nvlink/  
>>
>> Out design is like the left part of the picture but it is just a detail.
> 
> Unless we can specifically identify a direct link vs a mesh link, we
> shouldn't be making assumptions about the degree of interconnect.
>  
>>> So if each switch enables one direct, point to point link, how does the
>>> guest know which links to open for which peer device?  
>>
>> It uses PCI config space on GPUs to discover the topology.
> 
> So do we need to virtualize this config space if we're going to
> virtualize the topology?
> 
>>> And of course
>>> since we can't see the spec, a security audit is at best hearsay :-\  
>>
>> Yup, the exact discovery protocol is hidden.
> 
> It could be reverse engineered...
> 
>>>> The GPU-CPU links - the GPU bit is the same switch, the CPU NVlink state
>>>> is controlled via the emulated PCI bridges which I pass through together
>>>> with the GPU.  
>>>
>>> So there's a special emulated switch, is that how the guest knows which
>>> GPUs it can enable NVLinks to?  
>>
>> Since it only has PCI config space (there is nothing relevant in the
>> device tree at all), I assume (double checking with the NVIDIA folks
>> now) the guest driver enables them all, tests which pair works and
>> disables the ones which do not. This gives a malicious guest a tiny
>> window of opportunity to break into a good guest. Hm :-/
> 
> Let's not minimize that window, that seems like a prime candidate for
> an exploit.
> 
>>>>> If the former, then isn't a non-malicious
>>>>> guest still susceptible to a malicious guest?    
>>>>
>>>> A non-malicious guest needs to turn its switch on for a link to a GPU
>>>> which belongs to a malicious guest.  
>>>
>>> Actual security, or obfuscation, will we ever know...  
>>>>>> If the latter, how is  
>>>>> routing configured by the guest given that the guest view of the
>>>>> topology doesn't match physical hardware?  Are these routes
>>>>> deconfigured by device reset?  Are they part of the save/restore
>>>>> state?  Thanks,    
>>>
>>> Still curious what happens to these routes on reset.  Can a later user
>>> of a GPU inherit a device where the links are already enabled?  Thanks,  
>>
>> I am told that the GPU reset disables links. As a side effect, we get an
>> HMI (a hardware fault which reset the host machine) when trying
>> accessing the GPU RAM which indicates that the link is down as the
>> memory is only accessible via the nvlink. We have special fencing code
>> in our host firmware (skiboot) to fence this memory on PCI reset so
>> reading from it returns zeroes instead of HMIs.
> 
> What sort of reset is required for this?  Typically we rely on
> secondary bus reset for GPUs, but it would be a problem if GPUs were to
> start implementing FLR and nobody had a spec to learn that FLR maybe
> didn't disable the link.  The better approach to me still seems to be
> virtualizing these NVLink config registers to an extent that the user
> can only enabling links where they have ownership of both ends of the
> connection.  Thanks,


I re-read what I wrote and I owe some explanation.

The link state can be:
- disabled (or masked),
- enabled (or not-disabled? unmasked?),
- trained (configured).

At the moment no reset disables links, on sec bus reset they are
unconfigured and go to the initial enabled-and-not-trained state which
is the default config. The NVIDIA driver in the guest trains links to do
the topology discovery. We can disable links and this disabled status
remains until sec bus reset and there is no way to re-enable links other
than sec bus reset. This is what I get from NVIDIA. FLR should not be
able to change a thing here.



-- 
Alexey

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100
@ 2018-08-08  8:39                                   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 108+ messages in thread
From: Alexey Kardashevskiy @ 2018-08-08  8:39 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kvm, Ram Pai, kvm-ppc, Alistair Popple, linuxppc-dev, David Gibson



On 02/08/2018 02:16, Alex Williamson wrote:
> On Wed, 1 Aug 2018 18:37:35 +1000
> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> 
>> On 01/08/2018 00:29, Alex Williamson wrote:
>>> On Tue, 31 Jul 2018 14:03:35 +1000
>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>>>   
>>>> On 31/07/2018 02:29, Alex Williamson wrote:  
>>>>> On Mon, 30 Jul 2018 18:58:49 +1000
>>>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:  
>>>>>> After some local discussions, it was pointed out that force disabling
>>>>>> nvlinks won't bring us much as for an nvlink to work, both sides need to
>>>>>> enable it so malicious guests cannot penetrate good ones (or a host)
>>>>>> unless a good guest enabled the link but won't happen with a well
>>>>>> behaving guest. And if two guests became malicious, then can still only
>>>>>> harm each other, and so can they via other ways such network. This is
>>>>>> different from PCIe as once PCIe link is unavoidably enabled, a well
>>>>>> behaving device cannot firewall itself from peers as it is up to the
>>>>>> upstream bridge(s) now to decide the routing; with nvlink2, a GPU still
>>>>>> has means to protect itself, just like a guest can run "firewalld" for
>>>>>> network.
>>>>>>
>>>>>> Although it would be a nice feature to have an extra barrier between
>>>>>> GPUs, is inability to block the links in hypervisor still a blocker for
>>>>>> V100 pass through?    
>>>>>
>>>>> How is the NVLink configured by the guest, is it 'on'/'off' or are
>>>>> specific routes configured?     
>>>>
>>>> The GPU-GPU links need not to be blocked and need to be enabled
>>>> (=trained) by a driver in the guest. There are no routes between GPUs
>>>> in NVLink fabric, these are direct links, it is just a switch on each
>>>> side, both switches need to be on for a link to work.  
>>>
>>> Ok, but there is at least the possibility of multiple direct links per
>>> GPU, the very first diagram I find of NVlink shows 8 interconnected
>>> GPUs:
>>>
>>> https://www.nvidia.com/en-us/data-center/nvlink/  
>>
>> Out design is like the left part of the picture but it is just a detail.
> 
> Unless we can specifically identify a direct link vs a mesh link, we
> shouldn't be making assumptions about the degree of interconnect.
>  
>>> So if each switch enables one direct, point to point link, how does the
>>> guest know which links to open for which peer device?  
>>
>> It uses PCI config space on GPUs to discover the topology.
> 
> So do we need to virtualize this config space if we're going to
> virtualize the topology?
> 
>>> And of course
>>> since we can't see the spec, a security audit is at best hearsay :-\  
>>
>> Yup, the exact discovery protocol is hidden.
> 
> It could be reverse engineered...
> 
>>>> The GPU-CPU links - the GPU bit is the same switch, the CPU NVlink state
>>>> is controlled via the emulated PCI bridges which I pass through together
>>>> with the GPU.  
>>>
>>> So there's a special emulated switch, is that how the guest knows which
>>> GPUs it can enable NVLinks to?  
>>
>> Since it only has PCI config space (there is nothing relevant in the
>> device tree at all), I assume (double checking with the NVIDIA folks
>> now) the guest driver enables them all, tests which pair works and
>> disables the ones which do not. This gives a malicious guest a tiny
>> window of opportunity to break into a good guest. Hm :-/
> 
> Let's not minimize that window, that seems like a prime candidate for
> an exploit.
> 
>>>>> If the former, then isn't a non-malicious
>>>>> guest still susceptible to a malicious guest?    
>>>>
>>>> A non-malicious guest needs to turn its switch on for a link to a GPU
>>>> which belongs to a malicious guest.  
>>>
>>> Actual security, or obfuscation, will we ever know...  
>>>>>> If the latter, how is  
>>>>> routing configured by the guest given that the guest view of the
>>>>> topology doesn't match physical hardware?  Are these routes
>>>>> deconfigured by device reset?  Are they part of the save/restore
>>>>> state?  Thanks,    
>>>
>>> Still curious what happens to these routes on reset.  Can a later user
>>> of a GPU inherit a device where the links are already enabled?  Thanks,  
>>
>> I am told that the GPU reset disables links. As a side effect, we get an
>> HMI (a hardware fault which reset the host machine) when trying
>> accessing the GPU RAM which indicates that the link is down as the
>> memory is only accessible via the nvlink. We have special fencing code
>> in our host firmware (skiboot) to fence this memory on PCI reset so
>> reading from it returns zeroes instead of HMIs.
> 
> What sort of reset is required for this?  Typically we rely on
> secondary bus reset for GPUs, but it would be a problem if GPUs were to
> start implementing FLR and nobody had a spec to learn that FLR maybe
> didn't disable the link.  The better approach to me still seems to be
> virtualizing these NVLink config registers to an extent that the user
> can only enabling links where they have ownership of both ends of the
> connection.  Thanks,


I re-read what I wrote and I owe some explanation.

The link state can be:
- disabled (or masked),
- enabled (or not-disabled? unmasked?),
- trained (configured).

At the moment no reset disables links, on sec bus reset they are
unconfigured and go to the initial enabled-and-not-trained state which
is the default config. The NVIDIA driver in the guest trains links to do
the topology discovery. We can disable links and this disabled status
remains until sec bus reset and there is no way to re-enable links other
than sec bus reset. This is what I get from NVIDIA. FLR should not be
able to change a thing here.



-- 
Alexey

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100
  2018-08-08  8:39                                   ` Alexey Kardashevskiy
  (?)
@ 2018-08-09  4:21                                     ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 108+ messages in thread
From: Alexey Kardashevskiy @ 2018-08-09  4:21 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kvm, Ram Pai, kvm-ppc, Alistair Popple, linuxppc-dev, David Gibson



On 08/08/2018 18:39, Alexey Kardashevskiy wrote:
> 
> 
> On 02/08/2018 02:16, Alex Williamson wrote:
>> On Wed, 1 Aug 2018 18:37:35 +1000
>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>>
>>> On 01/08/2018 00:29, Alex Williamson wrote:
>>>> On Tue, 31 Jul 2018 14:03:35 +1000
>>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>>>>   
>>>>> On 31/07/2018 02:29, Alex Williamson wrote:  
>>>>>> On Mon, 30 Jul 2018 18:58:49 +1000
>>>>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:  
>>>>>>> After some local discussions, it was pointed out that force disabling
>>>>>>> nvlinks won't bring us much as for an nvlink to work, both sides need to
>>>>>>> enable it so malicious guests cannot penetrate good ones (or a host)
>>>>>>> unless a good guest enabled the link but won't happen with a well
>>>>>>> behaving guest. And if two guests became malicious, then can still only
>>>>>>> harm each other, and so can they via other ways such network. This is
>>>>>>> different from PCIe as once PCIe link is unavoidably enabled, a well
>>>>>>> behaving device cannot firewall itself from peers as it is up to the
>>>>>>> upstream bridge(s) now to decide the routing; with nvlink2, a GPU still
>>>>>>> has means to protect itself, just like a guest can run "firewalld" for
>>>>>>> network.
>>>>>>>
>>>>>>> Although it would be a nice feature to have an extra barrier between
>>>>>>> GPUs, is inability to block the links in hypervisor still a blocker for
>>>>>>> V100 pass through?    
>>>>>>
>>>>>> How is the NVLink configured by the guest, is it 'on'/'off' or are
>>>>>> specific routes configured?     
>>>>>
>>>>> The GPU-GPU links need not to be blocked and need to be enabled
>>>>> (==trained) by a driver in the guest. There are no routes between GPUs
>>>>> in NVLink fabric, these are direct links, it is just a switch on each
>>>>> side, both switches need to be on for a link to work.  
>>>>
>>>> Ok, but there is at least the possibility of multiple direct links per
>>>> GPU, the very first diagram I find of NVlink shows 8 interconnected
>>>> GPUs:
>>>>
>>>> https://www.nvidia.com/en-us/data-center/nvlink/  
>>>
>>> Out design is like the left part of the picture but it is just a detail.
>>
>> Unless we can specifically identify a direct link vs a mesh link, we
>> shouldn't be making assumptions about the degree of interconnect.
>>  
>>>> So if each switch enables one direct, point to point link, how does the
>>>> guest know which links to open for which peer device?  
>>>
>>> It uses PCI config space on GPUs to discover the topology.
>>
>> So do we need to virtualize this config space if we're going to
>> virtualize the topology?
>>
>>>> And of course
>>>> since we can't see the spec, a security audit is at best hearsay :-\  
>>>
>>> Yup, the exact discovery protocol is hidden.
>>
>> It could be reverse engineered...
>>
>>>>> The GPU-CPU links - the GPU bit is the same switch, the CPU NVlink state
>>>>> is controlled via the emulated PCI bridges which I pass through together
>>>>> with the GPU.  
>>>>
>>>> So there's a special emulated switch, is that how the guest knows which
>>>> GPUs it can enable NVLinks to?  
>>>
>>> Since it only has PCI config space (there is nothing relevant in the
>>> device tree at all), I assume (double checking with the NVIDIA folks
>>> now) the guest driver enables them all, tests which pair works and
>>> disables the ones which do not. This gives a malicious guest a tiny
>>> window of opportunity to break into a good guest. Hm :-/
>>
>> Let's not minimize that window, that seems like a prime candidate for
>> an exploit.
>>
>>>>>> If the former, then isn't a non-malicious
>>>>>> guest still susceptible to a malicious guest?    
>>>>>
>>>>> A non-malicious guest needs to turn its switch on for a link to a GPU
>>>>> which belongs to a malicious guest.  
>>>>
>>>> Actual security, or obfuscation, will we ever know...  
>>>>>>> If the latter, how is  
>>>>>> routing configured by the guest given that the guest view of the
>>>>>> topology doesn't match physical hardware?  Are these routes
>>>>>> deconfigured by device reset?  Are they part of the save/restore
>>>>>> state?  Thanks,    
>>>>
>>>> Still curious what happens to these routes on reset.  Can a later user
>>>> of a GPU inherit a device where the links are already enabled?  Thanks,  
>>>
>>> I am told that the GPU reset disables links. As a side effect, we get an
>>> HMI (a hardware fault which reset the host machine) when trying
>>> accessing the GPU RAM which indicates that the link is down as the
>>> memory is only accessible via the nvlink. We have special fencing code
>>> in our host firmware (skiboot) to fence this memory on PCI reset so
>>> reading from it returns zeroes instead of HMIs.
>>
>> What sort of reset is required for this?  Typically we rely on
>> secondary bus reset for GPUs, but it would be a problem if GPUs were to
>> start implementing FLR and nobody had a spec to learn that FLR maybe
>> didn't disable the link.  The better approach to me still seems to be
>> virtualizing these NVLink config registers to an extent that the user
>> can only enabling links where they have ownership of both ends of the
>> connection.  Thanks,
> 
> 
> I re-read what I wrote and I owe some explanation.
> 
> The link state can be:
> - disabled (or masked),
> - enabled (or not-disabled? unmasked?),
> - trained (configured).
> 
> At the moment no reset disables links, on sec bus reset they are
> unconfigured and go to the initial enabled-and-not-trained state which
> is the default config. The NVIDIA driver in the guest trains links to do
> the topology discovery. We can disable links and this disabled status
> remains until sec bus reset and there is no way to re-enable links other
> than sec bus reset. This is what I get from NVIDIA. FLR should not be
> able to change a thing here.


btw using this masking mechanism does not involve any virtualizing -
these are MMIO registers which a powernv platform reset hook will write
to in order to stay in sync with already configured IOMMU groups and
that's all, the guest will still be able to access them with no
filtering on the way, it just won't do anything. Or this is still called
virtualizing?




-- 
Alexey

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100
@ 2018-08-09  4:21                                     ` Alexey Kardashevskiy
  0 siblings, 0 replies; 108+ messages in thread
From: Alexey Kardashevskiy @ 2018-08-09  4:21 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Benjamin Herrenschmidt, linuxppc-dev, David Gibson, kvm-ppc,
	Ram Pai, kvm, Alistair Popple



On 08/08/2018 18:39, Alexey Kardashevskiy wrote:
> 
> 
> On 02/08/2018 02:16, Alex Williamson wrote:
>> On Wed, 1 Aug 2018 18:37:35 +1000
>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>>
>>> On 01/08/2018 00:29, Alex Williamson wrote:
>>>> On Tue, 31 Jul 2018 14:03:35 +1000
>>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>>>>   
>>>>> On 31/07/2018 02:29, Alex Williamson wrote:  
>>>>>> On Mon, 30 Jul 2018 18:58:49 +1000
>>>>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:  
>>>>>>> After some local discussions, it was pointed out that force disabling
>>>>>>> nvlinks won't bring us much as for an nvlink to work, both sides need to
>>>>>>> enable it so malicious guests cannot penetrate good ones (or a host)
>>>>>>> unless a good guest enabled the link but won't happen with a well
>>>>>>> behaving guest. And if two guests became malicious, then can still only
>>>>>>> harm each other, and so can they via other ways such network. This is
>>>>>>> different from PCIe as once PCIe link is unavoidably enabled, a well
>>>>>>> behaving device cannot firewall itself from peers as it is up to the
>>>>>>> upstream bridge(s) now to decide the routing; with nvlink2, a GPU still
>>>>>>> has means to protect itself, just like a guest can run "firewalld" for
>>>>>>> network.
>>>>>>>
>>>>>>> Although it would be a nice feature to have an extra barrier between
>>>>>>> GPUs, is inability to block the links in hypervisor still a blocker for
>>>>>>> V100 pass through?    
>>>>>>
>>>>>> How is the NVLink configured by the guest, is it 'on'/'off' or are
>>>>>> specific routes configured?     
>>>>>
>>>>> The GPU-GPU links need not to be blocked and need to be enabled
>>>>> (==trained) by a driver in the guest. There are no routes between GPUs
>>>>> in NVLink fabric, these are direct links, it is just a switch on each
>>>>> side, both switches need to be on for a link to work.  
>>>>
>>>> Ok, but there is at least the possibility of multiple direct links per
>>>> GPU, the very first diagram I find of NVlink shows 8 interconnected
>>>> GPUs:
>>>>
>>>> https://www.nvidia.com/en-us/data-center/nvlink/  
>>>
>>> Out design is like the left part of the picture but it is just a detail.
>>
>> Unless we can specifically identify a direct link vs a mesh link, we
>> shouldn't be making assumptions about the degree of interconnect.
>>  
>>>> So if each switch enables one direct, point to point link, how does the
>>>> guest know which links to open for which peer device?  
>>>
>>> It uses PCI config space on GPUs to discover the topology.
>>
>> So do we need to virtualize this config space if we're going to
>> virtualize the topology?
>>
>>>> And of course
>>>> since we can't see the spec, a security audit is at best hearsay :-\  
>>>
>>> Yup, the exact discovery protocol is hidden.
>>
>> It could be reverse engineered...
>>
>>>>> The GPU-CPU links - the GPU bit is the same switch, the CPU NVlink state
>>>>> is controlled via the emulated PCI bridges which I pass through together
>>>>> with the GPU.  
>>>>
>>>> So there's a special emulated switch, is that how the guest knows which
>>>> GPUs it can enable NVLinks to?  
>>>
>>> Since it only has PCI config space (there is nothing relevant in the
>>> device tree at all), I assume (double checking with the NVIDIA folks
>>> now) the guest driver enables them all, tests which pair works and
>>> disables the ones which do not. This gives a malicious guest a tiny
>>> window of opportunity to break into a good guest. Hm :-/
>>
>> Let's not minimize that window, that seems like a prime candidate for
>> an exploit.
>>
>>>>>> If the former, then isn't a non-malicious
>>>>>> guest still susceptible to a malicious guest?    
>>>>>
>>>>> A non-malicious guest needs to turn its switch on for a link to a GPU
>>>>> which belongs to a malicious guest.  
>>>>
>>>> Actual security, or obfuscation, will we ever know...  
>>>>>>> If the latter, how is  
>>>>>> routing configured by the guest given that the guest view of the
>>>>>> topology doesn't match physical hardware?  Are these routes
>>>>>> deconfigured by device reset?  Are they part of the save/restore
>>>>>> state?  Thanks,    
>>>>
>>>> Still curious what happens to these routes on reset.  Can a later user
>>>> of a GPU inherit a device where the links are already enabled?  Thanks,  
>>>
>>> I am told that the GPU reset disables links. As a side effect, we get an
>>> HMI (a hardware fault which reset the host machine) when trying
>>> accessing the GPU RAM which indicates that the link is down as the
>>> memory is only accessible via the nvlink. We have special fencing code
>>> in our host firmware (skiboot) to fence this memory on PCI reset so
>>> reading from it returns zeroes instead of HMIs.
>>
>> What sort of reset is required for this?  Typically we rely on
>> secondary bus reset for GPUs, but it would be a problem if GPUs were to
>> start implementing FLR and nobody had a spec to learn that FLR maybe
>> didn't disable the link.  The better approach to me still seems to be
>> virtualizing these NVLink config registers to an extent that the user
>> can only enabling links where they have ownership of both ends of the
>> connection.  Thanks,
> 
> 
> I re-read what I wrote and I owe some explanation.
> 
> The link state can be:
> - disabled (or masked),
> - enabled (or not-disabled? unmasked?),
> - trained (configured).
> 
> At the moment no reset disables links, on sec bus reset they are
> unconfigured and go to the initial enabled-and-not-trained state which
> is the default config. The NVIDIA driver in the guest trains links to do
> the topology discovery. We can disable links and this disabled status
> remains until sec bus reset and there is no way to re-enable links other
> than sec bus reset. This is what I get from NVIDIA. FLR should not be
> able to change a thing here.


btw using this masking mechanism does not involve any virtualizing -
these are MMIO registers which a powernv platform reset hook will write
to in order to stay in sync with already configured IOMMU groups and
that's all, the guest will still be able to access them with no
filtering on the way, it just won't do anything. Or this is still called
virtualizing?




-- 
Alexey

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100
@ 2018-08-09  4:21                                     ` Alexey Kardashevskiy
  0 siblings, 0 replies; 108+ messages in thread
From: Alexey Kardashevskiy @ 2018-08-09  4:21 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kvm, Ram Pai, kvm-ppc, Alistair Popple, linuxppc-dev, David Gibson



On 08/08/2018 18:39, Alexey Kardashevskiy wrote:
> 
> 
> On 02/08/2018 02:16, Alex Williamson wrote:
>> On Wed, 1 Aug 2018 18:37:35 +1000
>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>>
>>> On 01/08/2018 00:29, Alex Williamson wrote:
>>>> On Tue, 31 Jul 2018 14:03:35 +1000
>>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>>>>   
>>>>> On 31/07/2018 02:29, Alex Williamson wrote:  
>>>>>> On Mon, 30 Jul 2018 18:58:49 +1000
>>>>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:  
>>>>>>> After some local discussions, it was pointed out that force disabling
>>>>>>> nvlinks won't bring us much as for an nvlink to work, both sides need to
>>>>>>> enable it so malicious guests cannot penetrate good ones (or a host)
>>>>>>> unless a good guest enabled the link but won't happen with a well
>>>>>>> behaving guest. And if two guests became malicious, then can still only
>>>>>>> harm each other, and so can they via other ways such network. This is
>>>>>>> different from PCIe as once PCIe link is unavoidably enabled, a well
>>>>>>> behaving device cannot firewall itself from peers as it is up to the
>>>>>>> upstream bridge(s) now to decide the routing; with nvlink2, a GPU still
>>>>>>> has means to protect itself, just like a guest can run "firewalld" for
>>>>>>> network.
>>>>>>>
>>>>>>> Although it would be a nice feature to have an extra barrier between
>>>>>>> GPUs, is inability to block the links in hypervisor still a blocker for
>>>>>>> V100 pass through?    
>>>>>>
>>>>>> How is the NVLink configured by the guest, is it 'on'/'off' or are
>>>>>> specific routes configured?     
>>>>>
>>>>> The GPU-GPU links need not to be blocked and need to be enabled
>>>>> (=trained) by a driver in the guest. There are no routes between GPUs
>>>>> in NVLink fabric, these are direct links, it is just a switch on each
>>>>> side, both switches need to be on for a link to work.  
>>>>
>>>> Ok, but there is at least the possibility of multiple direct links per
>>>> GPU, the very first diagram I find of NVlink shows 8 interconnected
>>>> GPUs:
>>>>
>>>> https://www.nvidia.com/en-us/data-center/nvlink/  
>>>
>>> Out design is like the left part of the picture but it is just a detail.
>>
>> Unless we can specifically identify a direct link vs a mesh link, we
>> shouldn't be making assumptions about the degree of interconnect.
>>  
>>>> So if each switch enables one direct, point to point link, how does the
>>>> guest know which links to open for which peer device?  
>>>
>>> It uses PCI config space on GPUs to discover the topology.
>>
>> So do we need to virtualize this config space if we're going to
>> virtualize the topology?
>>
>>>> And of course
>>>> since we can't see the spec, a security audit is at best hearsay :-\  
>>>
>>> Yup, the exact discovery protocol is hidden.
>>
>> It could be reverse engineered...
>>
>>>>> The GPU-CPU links - the GPU bit is the same switch, the CPU NVlink state
>>>>> is controlled via the emulated PCI bridges which I pass through together
>>>>> with the GPU.  
>>>>
>>>> So there's a special emulated switch, is that how the guest knows which
>>>> GPUs it can enable NVLinks to?  
>>>
>>> Since it only has PCI config space (there is nothing relevant in the
>>> device tree at all), I assume (double checking with the NVIDIA folks
>>> now) the guest driver enables them all, tests which pair works and
>>> disables the ones which do not. This gives a malicious guest a tiny
>>> window of opportunity to break into a good guest. Hm :-/
>>
>> Let's not minimize that window, that seems like a prime candidate for
>> an exploit.
>>
>>>>>> If the former, then isn't a non-malicious
>>>>>> guest still susceptible to a malicious guest?    
>>>>>
>>>>> A non-malicious guest needs to turn its switch on for a link to a GPU
>>>>> which belongs to a malicious guest.  
>>>>
>>>> Actual security, or obfuscation, will we ever know...  
>>>>>>> If the latter, how is  
>>>>>> routing configured by the guest given that the guest view of the
>>>>>> topology doesn't match physical hardware?  Are these routes
>>>>>> deconfigured by device reset?  Are they part of the save/restore
>>>>>> state?  Thanks,    
>>>>
>>>> Still curious what happens to these routes on reset.  Can a later user
>>>> of a GPU inherit a device where the links are already enabled?  Thanks,  
>>>
>>> I am told that the GPU reset disables links. As a side effect, we get an
>>> HMI (a hardware fault which reset the host machine) when trying
>>> accessing the GPU RAM which indicates that the link is down as the
>>> memory is only accessible via the nvlink. We have special fencing code
>>> in our host firmware (skiboot) to fence this memory on PCI reset so
>>> reading from it returns zeroes instead of HMIs.
>>
>> What sort of reset is required for this?  Typically we rely on
>> secondary bus reset for GPUs, but it would be a problem if GPUs were to
>> start implementing FLR and nobody had a spec to learn that FLR maybe
>> didn't disable the link.  The better approach to me still seems to be
>> virtualizing these NVLink config registers to an extent that the user
>> can only enabling links where they have ownership of both ends of the
>> connection.  Thanks,
> 
> 
> I re-read what I wrote and I owe some explanation.
> 
> The link state can be:
> - disabled (or masked),
> - enabled (or not-disabled? unmasked?),
> - trained (configured).
> 
> At the moment no reset disables links, on sec bus reset they are
> unconfigured and go to the initial enabled-and-not-trained state which
> is the default config. The NVIDIA driver in the guest trains links to do
> the topology discovery. We can disable links and this disabled status
> remains until sec bus reset and there is no way to re-enable links other
> than sec bus reset. This is what I get from NVIDIA. FLR should not be
> able to change a thing here.


btw using this masking mechanism does not involve any virtualizing -
these are MMIO registers which a powernv platform reset hook will write
to in order to stay in sync with already configured IOMMU groups and
that's all, the guest will still be able to access them with no
filtering on the way, it just won't do anything. Or this is still called
virtualizing?




-- 
Alexey

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100
  2018-08-09  4:21                                     ` Alexey Kardashevskiy
  (?)
@ 2018-08-09 14:06                                       ` Alex Williamson
  -1 siblings, 0 replies; 108+ messages in thread
From: Alex Williamson @ 2018-08-09 14:06 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: kvm, Ram Pai, kvm-ppc, Alistair Popple, linuxppc-dev, David Gibson

On Thu, 9 Aug 2018 14:21:29 +1000
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> On 08/08/2018 18:39, Alexey Kardashevskiy wrote:
> > 
> > 
> > On 02/08/2018 02:16, Alex Williamson wrote:  
> >> On Wed, 1 Aug 2018 18:37:35 +1000
> >> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> >>  
> >>> On 01/08/2018 00:29, Alex Williamson wrote:  
> >>>> On Tue, 31 Jul 2018 14:03:35 +1000
> >>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> >>>>     
> >>>>> On 31/07/2018 02:29, Alex Williamson wrote:    
> >>>>>> On Mon, 30 Jul 2018 18:58:49 +1000
> >>>>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:    
> >>>>>>> After some local discussions, it was pointed out that force disabling
> >>>>>>> nvlinks won't bring us much as for an nvlink to work, both sides need to
> >>>>>>> enable it so malicious guests cannot penetrate good ones (or a host)
> >>>>>>> unless a good guest enabled the link but won't happen with a well
> >>>>>>> behaving guest. And if two guests became malicious, then can still only
> >>>>>>> harm each other, and so can they via other ways such network. This is
> >>>>>>> different from PCIe as once PCIe link is unavoidably enabled, a well
> >>>>>>> behaving device cannot firewall itself from peers as it is up to the
> >>>>>>> upstream bridge(s) now to decide the routing; with nvlink2, a GPU still
> >>>>>>> has means to protect itself, just like a guest can run "firewalld" for
> >>>>>>> network.
> >>>>>>>
> >>>>>>> Although it would be a nice feature to have an extra barrier between
> >>>>>>> GPUs, is inability to block the links in hypervisor still a blocker for
> >>>>>>> V100 pass through?      
> >>>>>>
> >>>>>> How is the NVLink configured by the guest, is it 'on'/'off' or are
> >>>>>> specific routes configured?       
> >>>>>
> >>>>> The GPU-GPU links need not to be blocked and need to be enabled
> >>>>> (==trained) by a driver in the guest. There are no routes between GPUs
> >>>>> in NVLink fabric, these are direct links, it is just a switch on each
> >>>>> side, both switches need to be on for a link to work.    
> >>>>
> >>>> Ok, but there is at least the possibility of multiple direct links per
> >>>> GPU, the very first diagram I find of NVlink shows 8 interconnected
> >>>> GPUs:
> >>>>
> >>>> https://www.nvidia.com/en-us/data-center/nvlink/    
> >>>
> >>> Out design is like the left part of the picture but it is just a detail.  
> >>
> >> Unless we can specifically identify a direct link vs a mesh link, we
> >> shouldn't be making assumptions about the degree of interconnect.
> >>    
> >>>> So if each switch enables one direct, point to point link, how does the
> >>>> guest know which links to open for which peer device?    
> >>>
> >>> It uses PCI config space on GPUs to discover the topology.  
> >>
> >> So do we need to virtualize this config space if we're going to
> >> virtualize the topology?
> >>  
> >>>> And of course
> >>>> since we can't see the spec, a security audit is at best hearsay :-\    
> >>>
> >>> Yup, the exact discovery protocol is hidden.  
> >>
> >> It could be reverse engineered...
> >>  
> >>>>> The GPU-CPU links - the GPU bit is the same switch, the CPU NVlink state
> >>>>> is controlled via the emulated PCI bridges which I pass through together
> >>>>> with the GPU.    
> >>>>
> >>>> So there's a special emulated switch, is that how the guest knows which
> >>>> GPUs it can enable NVLinks to?    
> >>>
> >>> Since it only has PCI config space (there is nothing relevant in the
> >>> device tree at all), I assume (double checking with the NVIDIA folks
> >>> now) the guest driver enables them all, tests which pair works and
> >>> disables the ones which do not. This gives a malicious guest a tiny
> >>> window of opportunity to break into a good guest. Hm :-/  
> >>
> >> Let's not minimize that window, that seems like a prime candidate for
> >> an exploit.
> >>  
> >>>>>> If the former, then isn't a non-malicious
> >>>>>> guest still susceptible to a malicious guest?      
> >>>>>
> >>>>> A non-malicious guest needs to turn its switch on for a link to a GPU
> >>>>> which belongs to a malicious guest.    
> >>>>
> >>>> Actual security, or obfuscation, will we ever know...    
> >>>>>>> If the latter, how is    
> >>>>>> routing configured by the guest given that the guest view of the
> >>>>>> topology doesn't match physical hardware?  Are these routes
> >>>>>> deconfigured by device reset?  Are they part of the save/restore
> >>>>>> state?  Thanks,      
> >>>>
> >>>> Still curious what happens to these routes on reset.  Can a later user
> >>>> of a GPU inherit a device where the links are already enabled?  Thanks,    
> >>>
> >>> I am told that the GPU reset disables links. As a side effect, we get an
> >>> HMI (a hardware fault which reset the host machine) when trying
> >>> accessing the GPU RAM which indicates that the link is down as the
> >>> memory is only accessible via the nvlink. We have special fencing code
> >>> in our host firmware (skiboot) to fence this memory on PCI reset so
> >>> reading from it returns zeroes instead of HMIs.  
> >>
> >> What sort of reset is required for this?  Typically we rely on
> >> secondary bus reset for GPUs, but it would be a problem if GPUs were to
> >> start implementing FLR and nobody had a spec to learn that FLR maybe
> >> didn't disable the link.  The better approach to me still seems to be
> >> virtualizing these NVLink config registers to an extent that the user
> >> can only enabling links where they have ownership of both ends of the
> >> connection.  Thanks,  
> > 
> > 
> > I re-read what I wrote and I owe some explanation.
> > 
> > The link state can be:
> > - disabled (or masked),
> > - enabled (or not-disabled? unmasked?),
> > - trained (configured).
> > 
> > At the moment no reset disables links, on sec bus reset they are
> > unconfigured and go to the initial enabled-and-not-trained state which
> > is the default config. The NVIDIA driver in the guest trains links to do
> > the topology discovery. We can disable links and this disabled status
> > remains until sec bus reset and there is no way to re-enable links other
> > than sec bus reset. This is what I get from NVIDIA. FLR should not be
> > able to change a thing here.  
> 
> 
> btw using this masking mechanism does not involve any virtualizing -
> these are MMIO registers which a powernv platform reset hook will write
> to in order to stay in sync with already configured IOMMU groups and
> that's all, the guest will still be able to access them with no
> filtering on the way, it just won't do anything. Or this is still called
> virtualizing?

The only thing POWER specific here seems to be the NVLink interface to
the CPU, so why would a reset hook be implemented as a powernv platform
reset hook?  We know these GPUs also exist in x86 platforms, so
anything we do on the endpoint should be shared regardless of the
platform.  I'm envisioning that even if we simply disable the NVLink
via a device specific reset, we'd probably still want to hide the
NVLink capability from the user, otherwise it seems likely that they
might try to interact with NVLink and we might induce problems that
it's not in an expected state.  So if we hide the capability or trap
access to the configuration registers, I'd call that virtualization.
Thanks,

Alex

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100
@ 2018-08-09 14:06                                       ` Alex Williamson
  0 siblings, 0 replies; 108+ messages in thread
From: Alex Williamson @ 2018-08-09 14:06 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Benjamin Herrenschmidt, linuxppc-dev, David Gibson, kvm-ppc,
	Ram Pai, kvm, Alistair Popple

On Thu, 9 Aug 2018 14:21:29 +1000
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> On 08/08/2018 18:39, Alexey Kardashevskiy wrote:
> > 
> > 
> > On 02/08/2018 02:16, Alex Williamson wrote:  
> >> On Wed, 1 Aug 2018 18:37:35 +1000
> >> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> >>  
> >>> On 01/08/2018 00:29, Alex Williamson wrote:  
> >>>> On Tue, 31 Jul 2018 14:03:35 +1000
> >>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> >>>>     
> >>>>> On 31/07/2018 02:29, Alex Williamson wrote:    
> >>>>>> On Mon, 30 Jul 2018 18:58:49 +1000
> >>>>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:    
> >>>>>>> After some local discussions, it was pointed out that force disabling
> >>>>>>> nvlinks won't bring us much as for an nvlink to work, both sides need to
> >>>>>>> enable it so malicious guests cannot penetrate good ones (or a host)
> >>>>>>> unless a good guest enabled the link but won't happen with a well
> >>>>>>> behaving guest. And if two guests became malicious, then can still only
> >>>>>>> harm each other, and so can they via other ways such network. This is
> >>>>>>> different from PCIe as once PCIe link is unavoidably enabled, a well
> >>>>>>> behaving device cannot firewall itself from peers as it is up to the
> >>>>>>> upstream bridge(s) now to decide the routing; with nvlink2, a GPU still
> >>>>>>> has means to protect itself, just like a guest can run "firewalld" for
> >>>>>>> network.
> >>>>>>>
> >>>>>>> Although it would be a nice feature to have an extra barrier between
> >>>>>>> GPUs, is inability to block the links in hypervisor still a blocker for
> >>>>>>> V100 pass through?      
> >>>>>>
> >>>>>> How is the NVLink configured by the guest, is it 'on'/'off' or are
> >>>>>> specific routes configured?       
> >>>>>
> >>>>> The GPU-GPU links need not to be blocked and need to be enabled
> >>>>> (==trained) by a driver in the guest. There are no routes between GPUs
> >>>>> in NVLink fabric, these are direct links, it is just a switch on each
> >>>>> side, both switches need to be on for a link to work.    
> >>>>
> >>>> Ok, but there is at least the possibility of multiple direct links per
> >>>> GPU, the very first diagram I find of NVlink shows 8 interconnected
> >>>> GPUs:
> >>>>
> >>>> https://www.nvidia.com/en-us/data-center/nvlink/    
> >>>
> >>> Out design is like the left part of the picture but it is just a detail.  
> >>
> >> Unless we can specifically identify a direct link vs a mesh link, we
> >> shouldn't be making assumptions about the degree of interconnect.
> >>    
> >>>> So if each switch enables one direct, point to point link, how does the
> >>>> guest know which links to open for which peer device?    
> >>>
> >>> It uses PCI config space on GPUs to discover the topology.  
> >>
> >> So do we need to virtualize this config space if we're going to
> >> virtualize the topology?
> >>  
> >>>> And of course
> >>>> since we can't see the spec, a security audit is at best hearsay :-\    
> >>>
> >>> Yup, the exact discovery protocol is hidden.  
> >>
> >> It could be reverse engineered...
> >>  
> >>>>> The GPU-CPU links - the GPU bit is the same switch, the CPU NVlink state
> >>>>> is controlled via the emulated PCI bridges which I pass through together
> >>>>> with the GPU.    
> >>>>
> >>>> So there's a special emulated switch, is that how the guest knows which
> >>>> GPUs it can enable NVLinks to?    
> >>>
> >>> Since it only has PCI config space (there is nothing relevant in the
> >>> device tree at all), I assume (double checking with the NVIDIA folks
> >>> now) the guest driver enables them all, tests which pair works and
> >>> disables the ones which do not. This gives a malicious guest a tiny
> >>> window of opportunity to break into a good guest. Hm :-/  
> >>
> >> Let's not minimize that window, that seems like a prime candidate for
> >> an exploit.
> >>  
> >>>>>> If the former, then isn't a non-malicious
> >>>>>> guest still susceptible to a malicious guest?      
> >>>>>
> >>>>> A non-malicious guest needs to turn its switch on for a link to a GPU
> >>>>> which belongs to a malicious guest.    
> >>>>
> >>>> Actual security, or obfuscation, will we ever know...    
> >>>>>>> If the latter, how is    
> >>>>>> routing configured by the guest given that the guest view of the
> >>>>>> topology doesn't match physical hardware?  Are these routes
> >>>>>> deconfigured by device reset?  Are they part of the save/restore
> >>>>>> state?  Thanks,      
> >>>>
> >>>> Still curious what happens to these routes on reset.  Can a later user
> >>>> of a GPU inherit a device where the links are already enabled?  Thanks,    
> >>>
> >>> I am told that the GPU reset disables links. As a side effect, we get an
> >>> HMI (a hardware fault which reset the host machine) when trying
> >>> accessing the GPU RAM which indicates that the link is down as the
> >>> memory is only accessible via the nvlink. We have special fencing code
> >>> in our host firmware (skiboot) to fence this memory on PCI reset so
> >>> reading from it returns zeroes instead of HMIs.  
> >>
> >> What sort of reset is required for this?  Typically we rely on
> >> secondary bus reset for GPUs, but it would be a problem if GPUs were to
> >> start implementing FLR and nobody had a spec to learn that FLR maybe
> >> didn't disable the link.  The better approach to me still seems to be
> >> virtualizing these NVLink config registers to an extent that the user
> >> can only enabling links where they have ownership of both ends of the
> >> connection.  Thanks,  
> > 
> > 
> > I re-read what I wrote and I owe some explanation.
> > 
> > The link state can be:
> > - disabled (or masked),
> > - enabled (or not-disabled? unmasked?),
> > - trained (configured).
> > 
> > At the moment no reset disables links, on sec bus reset they are
> > unconfigured and go to the initial enabled-and-not-trained state which
> > is the default config. The NVIDIA driver in the guest trains links to do
> > the topology discovery. We can disable links and this disabled status
> > remains until sec bus reset and there is no way to re-enable links other
> > than sec bus reset. This is what I get from NVIDIA. FLR should not be
> > able to change a thing here.  
> 
> 
> btw using this masking mechanism does not involve any virtualizing -
> these are MMIO registers which a powernv platform reset hook will write
> to in order to stay in sync with already configured IOMMU groups and
> that's all, the guest will still be able to access them with no
> filtering on the way, it just won't do anything. Or this is still called
> virtualizing?

The only thing POWER specific here seems to be the NVLink interface to
the CPU, so why would a reset hook be implemented as a powernv platform
reset hook?  We know these GPUs also exist in x86 platforms, so
anything we do on the endpoint should be shared regardless of the
platform.  I'm envisioning that even if we simply disable the NVLink
via a device specific reset, we'd probably still want to hide the
NVLink capability from the user, otherwise it seems likely that they
might try to interact with NVLink and we might induce problems that
it's not in an expected state.  So if we hide the capability or trap
access to the configuration registers, I'd call that virtualization.
Thanks,

Alex

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100
@ 2018-08-09 14:06                                       ` Alex Williamson
  0 siblings, 0 replies; 108+ messages in thread
From: Alex Williamson @ 2018-08-09 14:06 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: kvm, Ram Pai, kvm-ppc, Alistair Popple, linuxppc-dev, David Gibson

On Thu, 9 Aug 2018 14:21:29 +1000
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> On 08/08/2018 18:39, Alexey Kardashevskiy wrote:
> > 
> > 
> > On 02/08/2018 02:16, Alex Williamson wrote:  
> >> On Wed, 1 Aug 2018 18:37:35 +1000
> >> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> >>  
> >>> On 01/08/2018 00:29, Alex Williamson wrote:  
> >>>> On Tue, 31 Jul 2018 14:03:35 +1000
> >>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> >>>>     
> >>>>> On 31/07/2018 02:29, Alex Williamson wrote:    
> >>>>>> On Mon, 30 Jul 2018 18:58:49 +1000
> >>>>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:    
> >>>>>>> After some local discussions, it was pointed out that force disabling
> >>>>>>> nvlinks won't bring us much as for an nvlink to work, both sides need to
> >>>>>>> enable it so malicious guests cannot penetrate good ones (or a host)
> >>>>>>> unless a good guest enabled the link but won't happen with a well
> >>>>>>> behaving guest. And if two guests became malicious, then can still only
> >>>>>>> harm each other, and so can they via other ways such network. This is
> >>>>>>> different from PCIe as once PCIe link is unavoidably enabled, a well
> >>>>>>> behaving device cannot firewall itself from peers as it is up to the
> >>>>>>> upstream bridge(s) now to decide the routing; with nvlink2, a GPU still
> >>>>>>> has means to protect itself, just like a guest can run "firewalld" for
> >>>>>>> network.
> >>>>>>>
> >>>>>>> Although it would be a nice feature to have an extra barrier between
> >>>>>>> GPUs, is inability to block the links in hypervisor still a blocker for
> >>>>>>> V100 pass through?      
> >>>>>>
> >>>>>> How is the NVLink configured by the guest, is it 'on'/'off' or are
> >>>>>> specific routes configured?       
> >>>>>
> >>>>> The GPU-GPU links need not to be blocked and need to be enabled
> >>>>> (=trained) by a driver in the guest. There are no routes between GPUs
> >>>>> in NVLink fabric, these are direct links, it is just a switch on each
> >>>>> side, both switches need to be on for a link to work.    
> >>>>
> >>>> Ok, but there is at least the possibility of multiple direct links per
> >>>> GPU, the very first diagram I find of NVlink shows 8 interconnected
> >>>> GPUs:
> >>>>
> >>>> https://www.nvidia.com/en-us/data-center/nvlink/    
> >>>
> >>> Out design is like the left part of the picture but it is just a detail.  
> >>
> >> Unless we can specifically identify a direct link vs a mesh link, we
> >> shouldn't be making assumptions about the degree of interconnect.
> >>    
> >>>> So if each switch enables one direct, point to point link, how does the
> >>>> guest know which links to open for which peer device?    
> >>>
> >>> It uses PCI config space on GPUs to discover the topology.  
> >>
> >> So do we need to virtualize this config space if we're going to
> >> virtualize the topology?
> >>  
> >>>> And of course
> >>>> since we can't see the spec, a security audit is at best hearsay :-\    
> >>>
> >>> Yup, the exact discovery protocol is hidden.  
> >>
> >> It could be reverse engineered...
> >>  
> >>>>> The GPU-CPU links - the GPU bit is the same switch, the CPU NVlink state
> >>>>> is controlled via the emulated PCI bridges which I pass through together
> >>>>> with the GPU.    
> >>>>
> >>>> So there's a special emulated switch, is that how the guest knows which
> >>>> GPUs it can enable NVLinks to?    
> >>>
> >>> Since it only has PCI config space (there is nothing relevant in the
> >>> device tree at all), I assume (double checking with the NVIDIA folks
> >>> now) the guest driver enables them all, tests which pair works and
> >>> disables the ones which do not. This gives a malicious guest a tiny
> >>> window of opportunity to break into a good guest. Hm :-/  
> >>
> >> Let's not minimize that window, that seems like a prime candidate for
> >> an exploit.
> >>  
> >>>>>> If the former, then isn't a non-malicious
> >>>>>> guest still susceptible to a malicious guest?      
> >>>>>
> >>>>> A non-malicious guest needs to turn its switch on for a link to a GPU
> >>>>> which belongs to a malicious guest.    
> >>>>
> >>>> Actual security, or obfuscation, will we ever know...    
> >>>>>>> If the latter, how is    
> >>>>>> routing configured by the guest given that the guest view of the
> >>>>>> topology doesn't match physical hardware?  Are these routes
> >>>>>> deconfigured by device reset?  Are they part of the save/restore
> >>>>>> state?  Thanks,      
> >>>>
> >>>> Still curious what happens to these routes on reset.  Can a later user
> >>>> of a GPU inherit a device where the links are already enabled?  Thanks,    
> >>>
> >>> I am told that the GPU reset disables links. As a side effect, we get an
> >>> HMI (a hardware fault which reset the host machine) when trying
> >>> accessing the GPU RAM which indicates that the link is down as the
> >>> memory is only accessible via the nvlink. We have special fencing code
> >>> in our host firmware (skiboot) to fence this memory on PCI reset so
> >>> reading from it returns zeroes instead of HMIs.  
> >>
> >> What sort of reset is required for this?  Typically we rely on
> >> secondary bus reset for GPUs, but it would be a problem if GPUs were to
> >> start implementing FLR and nobody had a spec to learn that FLR maybe
> >> didn't disable the link.  The better approach to me still seems to be
> >> virtualizing these NVLink config registers to an extent that the user
> >> can only enabling links where they have ownership of both ends of the
> >> connection.  Thanks,  
> > 
> > 
> > I re-read what I wrote and I owe some explanation.
> > 
> > The link state can be:
> > - disabled (or masked),
> > - enabled (or not-disabled? unmasked?),
> > - trained (configured).
> > 
> > At the moment no reset disables links, on sec bus reset they are
> > unconfigured and go to the initial enabled-and-not-trained state which
> > is the default config. The NVIDIA driver in the guest trains links to do
> > the topology discovery. We can disable links and this disabled status
> > remains until sec bus reset and there is no way to re-enable links other
> > than sec bus reset. This is what I get from NVIDIA. FLR should not be
> > able to change a thing here.  
> 
> 
> btw using this masking mechanism does not involve any virtualizing -
> these are MMIO registers which a powernv platform reset hook will write
> to in order to stay in sync with already configured IOMMU groups and
> that's all, the guest will still be able to access them with no
> filtering on the way, it just won't do anything. Or this is still called
> virtualizing?

The only thing POWER specific here seems to be the NVLink interface to
the CPU, so why would a reset hook be implemented as a powernv platform
reset hook?  We know these GPUs also exist in x86 platforms, so
anything we do on the endpoint should be shared regardless of the
platform.  I'm envisioning that even if we simply disable the NVLink
via a device specific reset, we'd probably still want to hide the
NVLink capability from the user, otherwise it seems likely that they
might try to interact with NVLink and we might induce problems that
it's not in an expected state.  So if we hide the capability or trap
access to the configuration registers, I'd call that virtualization.
Thanks,

Alex

^ permalink raw reply	[flat|nested] 108+ messages in thread

end of thread, other threads:[~2018-08-09 14:06 UTC | newest]

Thread overview: 108+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-06-07  8:44 [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100 Alexey Kardashevskiy
2018-06-07  8:44 ` Alexey Kardashevskiy
2018-06-07  8:44 ` Alexey Kardashevskiy
2018-06-07  8:44 ` [RFC PATCH kernel 1/5] vfio/spapr_tce: Simplify page contained test Alexey Kardashevskiy
2018-06-07  8:44   ` Alexey Kardashevskiy
2018-06-07  8:44   ` Alexey Kardashevskiy
2018-06-08  3:32   ` David Gibson
2018-06-08  3:32     ` David Gibson
2018-06-08  3:32     ` David Gibson
2018-06-07  8:44 ` [RFC PATCH kernel 2/5] powerpc/iommu_context: Change referencing in API Alexey Kardashevskiy
2018-06-07  8:44   ` Alexey Kardashevskiy
2018-06-07  8:44   ` Alexey Kardashevskiy
2018-06-07  8:44 ` [RFC PATCH kernel 3/5] powerpc/iommu: Do not pin memory of a memory device Alexey Kardashevskiy
2018-06-07  8:44   ` Alexey Kardashevskiy
2018-06-07  8:44   ` Alexey Kardashevskiy
2018-06-07  8:44 ` [RFC PATCH kernel 4/5] vfio_pci: Allow mapping extra regions Alexey Kardashevskiy
2018-06-07  8:44   ` Alexey Kardashevskiy
2018-06-07  8:44   ` Alexey Kardashevskiy
2018-06-07 17:04   ` Alex Williamson
2018-06-07 17:04     ` Alex Williamson
2018-06-07 17:04     ` Alex Williamson
2018-06-07  8:44 ` [RFC PATCH kernel 5/5] vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] [10de:1db1] subdriver Alexey Kardashevskiy
2018-06-07  8:44   ` Alexey Kardashevskiy
2018-06-07  8:44   ` Alexey Kardashevskiy
2018-06-07 17:04   ` Alex Williamson
2018-06-07 17:04     ` Alex Williamson
2018-06-07 17:04     ` Alex Williamson
2018-06-08  3:09     ` Alexey Kardashevskiy
2018-06-08  3:09       ` Alexey Kardashevskiy
2018-06-08  3:09       ` Alexey Kardashevskiy
2018-06-08  3:35       ` Alex Williamson
2018-06-08  3:35         ` Alex Williamson
2018-06-08  3:35         ` Alex Williamson
2018-06-08  3:52         ` Alexey Kardashevskiy
2018-06-08  3:52           ` Alexey Kardashevskiy
2018-06-08  3:52           ` Alexey Kardashevskiy
2018-06-08  4:34           ` Alex Williamson
2018-06-08  4:34             ` Alex Williamson
2018-06-08  4:34             ` Alex Williamson
2018-06-07 17:04 ` [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100 Alex Williamson
2018-06-07 17:04   ` Alex Williamson
2018-06-07 17:04   ` Alex Williamson
2018-06-07 21:54   ` Benjamin Herrenschmidt
2018-06-07 21:54     ` Benjamin Herrenschmidt
2018-06-07 21:54     ` Benjamin Herrenschmidt
2018-06-07 22:15     ` Alex Williamson
2018-06-07 22:15       ` Alex Williamson
2018-06-07 22:15       ` Alex Williamson
2018-06-07 23:20       ` Benjamin Herrenschmidt
2018-06-07 23:20         ` Benjamin Herrenschmidt
2018-06-07 23:20         ` Benjamin Herrenschmidt
2018-06-08  0:34         ` Alex Williamson
2018-06-08  0:34           ` Alex Williamson
2018-06-08  0:34           ` Alex Williamson
2018-06-08  0:58           ` Benjamin Herrenschmidt
2018-06-08  0:58             ` Benjamin Herrenschmidt
2018-06-08  0:58             ` Benjamin Herrenschmidt
2018-06-08  1:18             ` Alex Williamson
2018-06-08  1:18               ` Alex Williamson
2018-06-08  1:18               ` Alex Williamson
2018-06-08  3:08       ` Alexey Kardashevskiy
2018-06-08  3:08         ` Alexey Kardashevskiy
2018-06-08  3:08         ` Alexey Kardashevskiy
2018-06-08  3:44         ` Alex Williamson
2018-06-08  3:44           ` Alex Williamson
2018-06-08  3:44           ` Alex Williamson
2018-06-08  4:14           ` Alexey Kardashevskiy
2018-06-08  4:14             ` Alexey Kardashevskiy
2018-06-08  4:14             ` Alexey Kardashevskiy
2018-06-08  5:03             ` Alex Williamson
2018-06-08  5:03               ` Alex Williamson
2018-06-08  5:03               ` Alex Williamson
2018-07-10  4:10               ` Alexey Kardashevskiy
2018-07-10  4:10                 ` Alexey Kardashevskiy
2018-07-10  4:10                 ` Alexey Kardashevskiy
2018-07-10 22:37                 ` Alex Williamson
2018-07-10 22:37                   ` Alex Williamson
2018-07-10 22:37                   ` Alex Williamson
2018-07-11  9:26                   ` Alexey Kardashevskiy
2018-07-11  9:26                     ` Alexey Kardashevskiy
2018-07-11  9:26                     ` Alexey Kardashevskiy
2018-07-30  8:58                     ` Alexey Kardashevskiy
2018-07-30  8:58                       ` Alexey Kardashevskiy
2018-07-30  8:58                       ` Alexey Kardashevskiy
2018-07-30 16:29                       ` Alex Williamson
2018-07-30 16:29                         ` Alex Williamson
2018-07-30 16:29                         ` Alex Williamson
2018-07-31  4:03                         ` Alexey Kardashevskiy
2018-07-31  4:03                           ` Alexey Kardashevskiy
2018-07-31  4:03                           ` Alexey Kardashevskiy
2018-07-31 14:29                           ` Alex Williamson
2018-07-31 14:29                             ` Alex Williamson
2018-07-31 14:29                             ` Alex Williamson
2018-08-01  8:37                             ` Alexey Kardashevskiy
2018-08-01  8:37                               ` Alexey Kardashevskiy
2018-08-01  8:37                               ` Alexey Kardashevskiy
2018-08-01 16:16                               ` Alex Williamson
2018-08-01 16:16                                 ` Alex Williamson
2018-08-01 16:16                                 ` Alex Williamson
2018-08-08  8:39                                 ` Alexey Kardashevskiy
2018-08-08  8:39                                   ` Alexey Kardashevskiy
2018-08-08  8:39                                   ` Alexey Kardashevskiy
2018-08-09  4:21                                   ` Alexey Kardashevskiy
2018-08-09  4:21                                     ` Alexey Kardashevskiy
2018-08-09  4:21                                     ` Alexey Kardashevskiy
2018-08-09 14:06                                     ` Alex Williamson
2018-08-09 14:06                                       ` Alex Williamson
2018-08-09 14:06                                       ` Alex Williamson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.