All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH qemu v3 0/6] spapr_pci, vfio: NVIDIA V100 + POWER9 passthrough
@ 2019-02-27  8:51 Alexey Kardashevskiy
  2019-02-27  8:51 ` [Qemu-devel] [PATCH qemu v3 1/6] pci: Move NVIDIA vendor id to the rest of ids Alexey Kardashevskiy
                   ` (5 more replies)
  0 siblings, 6 replies; 21+ messages in thread
From: Alexey Kardashevskiy @ 2019-02-27  8:51 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, qemu-ppc, David Gibson, Sam Bobroff,
	Piotr Jaroszynski, Leonardo Augusto Guimarães Garcia,
	Jose Ricardo Ziviani, Daniel Henrique Barboza, Alex Williamson


This is for passing through NVIDIA V100 GPUs on POWER9 systems.

This implements a subdriver for NVIDIA V100 GPU with coherent memory and
NPU/ATS support available in the POWER9 CPU.

1/6 went via PCI tree, here for the reference only.

Since 6/6 moves GPU RAM to much higher addresses than before,
I added 4/6 to mitigate RCU stall warnings in the guest.

Here is the kernel driver:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/drivers/vfio/pci?h=v5.0-rc6&id=7f92891778dff62303c070ac81de7b7d80de331a

SLOF changes already went in.

This is based on dwg/ppc-for-4.0 sha1
a12da2e Murilo Opsfelder Araujo "ppc/pnv: use IEC binary prefixes to represent sizes".

Please comment. Thanks.



Alexey Kardashevskiy (6):
  pci: Move NVIDIA vendor id to the rest of ids
  vfio/spapr: Fix indirect levels calculation
  vfio/spapr: Rename local systempagesize variable
  spapr_iommu: Do not replay mappings from just created DMA window
  vfio: Make vfio_get_region_info_cap public
  spapr: Support NVIDIA V100 GPU with NVLink2

 hw/ppc/Makefile.objs          |   2 +-
 hw/vfio/pci.h                 |   2 +
 include/hw/pci-host/spapr.h   |  41 ++++
 include/hw/pci/pci_ids.h      |   2 +
 include/hw/ppc/spapr.h        |   4 +-
 include/hw/vfio/vfio-common.h |   2 +
 hw/ppc/spapr.c                |  29 ++-
 hw/ppc/spapr_iommu.c          |  31 +++
 hw/ppc/spapr_pci.c            |   8 +
 hw/ppc/spapr_pci_nvlink2.c    | 419 ++++++++++++++++++++++++++++++++++
 hw/ppc/spapr_rtas_ddw.c       |   7 +
 hw/vfio/common.c              |   2 +-
 hw/vfio/pci-quirks.c          | 122 +++++++++-
 hw/vfio/pci.c                 |  14 ++
 hw/vfio/spapr.c               |  49 ++--
 hw/vfio/trace-events          |   6 +-
 16 files changed, 718 insertions(+), 22 deletions(-)
 create mode 100644 hw/ppc/spapr_pci_nvlink2.c

-- 
2.17.1

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Qemu-devel] [PATCH qemu v3 1/6] pci: Move NVIDIA vendor id to the rest of ids
  2019-02-27  8:51 [Qemu-devel] [PATCH qemu v3 0/6] spapr_pci, vfio: NVIDIA V100 + POWER9 passthrough Alexey Kardashevskiy
@ 2019-02-27  8:51 ` Alexey Kardashevskiy
  2019-02-28  0:56   ` David Gibson
  2019-02-27  8:51 ` [Qemu-devel] [PATCH qemu v3 2/6] vfio/spapr: Fix indirect levels calculation Alexey Kardashevskiy
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 21+ messages in thread
From: Alexey Kardashevskiy @ 2019-02-27  8:51 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, qemu-ppc, David Gibson, Sam Bobroff,
	Piotr Jaroszynski, Leonardo Augusto Guimarães Garcia,
	Jose Ricardo Ziviani, Daniel Henrique Barboza, Alex Williamson

sPAPR code will use it too so move it from VFIO to the common code.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Alistair Francis <alistair.francis@wdc.com>
---
 include/hw/pci/pci_ids.h | 2 ++
 hw/vfio/pci-quirks.c     | 2 --
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/hw/pci/pci_ids.h b/include/hw/pci/pci_ids.h
index eeb3301..0abe27a 100644
--- a/include/hw/pci/pci_ids.h
+++ b/include/hw/pci/pci_ids.h
@@ -271,4 +271,6 @@
 
 #define PCI_VENDOR_ID_SYNOPSYS           0x16C3
 
+#define PCI_VENDOR_ID_NVIDIA             0x10de
+
 #endif
diff --git a/hw/vfio/pci-quirks.c b/hw/vfio/pci-quirks.c
index eae31c7..40a1200 100644
--- a/hw/vfio/pci-quirks.c
+++ b/hw/vfio/pci-quirks.c
@@ -526,8 +526,6 @@ static void vfio_probe_ati_bar2_quirk(VFIOPCIDevice *vdev, int nr)
  * note it for future reference.
  */
 
-#define PCI_VENDOR_ID_NVIDIA                    0x10de
-
 /*
  * Nvidia has several different methods to get to config space, the
  * nouveu project has several of these documented here:
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [Qemu-devel] [PATCH qemu v3 2/6] vfio/spapr: Fix indirect levels calculation
  2019-02-27  8:51 [Qemu-devel] [PATCH qemu v3 0/6] spapr_pci, vfio: NVIDIA V100 + POWER9 passthrough Alexey Kardashevskiy
  2019-02-27  8:51 ` [Qemu-devel] [PATCH qemu v3 1/6] pci: Move NVIDIA vendor id to the rest of ids Alexey Kardashevskiy
@ 2019-02-27  8:51 ` Alexey Kardashevskiy
  2019-02-28  2:24   ` David Gibson
  2019-02-27  8:51 ` [Qemu-devel] [PATCH qemu v3 3/6] vfio/spapr: Rename local systempagesize variable Alexey Kardashevskiy
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 21+ messages in thread
From: Alexey Kardashevskiy @ 2019-02-27  8:51 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, qemu-ppc, David Gibson, Sam Bobroff,
	Piotr Jaroszynski, Leonardo Augusto Guimarães Garcia,
	Jose Ricardo Ziviani, Daniel Henrique Barboza, Alex Williamson

The current code assumes that we can address more bits on a PCI bus
for DMA than we really can but there is no way knowing the actual limit.

This makes a better guess for the number of levels and if the kernel
fails to allocate that, this increases the level numbers till succeeded
or reached the 64bit limit.

This adds levels to the trace point.

This may cause the kernel to warn about failed allocation:
   [65122.837458] Failed to allocate a TCE memory, level shift=28
which might happen if MAX_ORDER is not large enough as it can vary:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/powerpc/Kconfig?h=v5.0-rc2#n727

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
v2:
* replace systempagesize with getpagesize() when calculating bits_per_level/max_levels
---
 hw/vfio/spapr.c      | 43 +++++++++++++++++++++++++++++++++----------
 hw/vfio/trace-events |  2 +-
 2 files changed, 34 insertions(+), 11 deletions(-)

diff --git a/hw/vfio/spapr.c b/hw/vfio/spapr.c
index becf71a..88437a7 100644
--- a/hw/vfio/spapr.c
+++ b/hw/vfio/spapr.c
@@ -143,10 +143,10 @@ int vfio_spapr_create_window(VFIOContainer *container,
                              MemoryRegionSection *section,
                              hwaddr *pgsize)
 {
-    int ret;
+    int ret = 0;
     IOMMUMemoryRegion *iommu_mr = IOMMU_MEMORY_REGION(section->mr);
     uint64_t pagesize = memory_region_iommu_get_min_page_size(iommu_mr);
-    unsigned entries, pages;
+    unsigned entries, bits_total, bits_per_level, max_levels;
     struct vfio_iommu_spapr_tce_create create = { .argsz = sizeof(create) };
     long systempagesize = qemu_getrampagesize();
 
@@ -176,16 +176,38 @@ int vfio_spapr_create_window(VFIOContainer *container,
     create.window_size = int128_get64(section->size);
     create.page_shift = ctz64(pagesize);
     /*
-     * SPAPR host supports multilevel TCE tables, there is some
-     * heuristic to decide how many levels we want for our table:
-     * 0..64 = 1; 65..4096 = 2; 4097..262144 = 3; 262145.. = 4
+     * SPAPR host supports multilevel TCE tables. We try to guess optimal
+     * levels number and if this fails (for example due to the host memory
+     * fragmentation), we increase levels. The DMA address structure is:
+     * rrrrrrrr rxxxxxxx xxxxxxxx xxxxxxxx  xxxxxxxx xxxxxxxx xxxxxxxx iiiiiiii
+     * where:
+     *   r = reserved (bits >= 55 are reserved in the existing hardware)
+     *   i = IOMMU page offset (64K in this example)
+     *   x = bits to index a TCE which can be split to equal chunks to index
+     *      within the level.
+     * The aim is to split "x" to smaller possible number of levels.
      */
     entries = create.window_size >> create.page_shift;
-    pages = MAX((entries * sizeof(uint64_t)) / getpagesize(), 1);
-    pages = MAX(pow2ceil(pages), 1); /* Round up */
-    create.levels = ctz64(pages) / 6 + 1;
-
-    ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_CREATE, &create);
+    /* bits_total is number of "x" needed */
+    bits_total = ctz64(entries * sizeof(uint64_t));
+    /*
+     * bits_per_level is a safe guess of how much we can allocate per level:
+     * 8 is the current minimum for CONFIG_FORCE_MAX_ZONEORDER and MAX_ORDER
+     * is usually bigger than that.
+     * Below we look at getpagesize() as TCEs are allocated from system pages.
+     */
+    bits_per_level = ctz64(getpagesize()) + 8;
+    create.levels = bits_total / bits_per_level;
+    if (bits_total % bits_per_level) {
+        ++create.levels;
+    }
+    max_levels = (64 - create.page_shift) / ctz64(getpagesize());
+    for ( ; create.levels <= max_levels; ++create.levels) {
+        ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_CREATE, &create);
+        if (!ret) {
+            break;
+        }
+    }
     if (ret) {
         error_report("Failed to create a window, ret = %d (%m)", ret);
         return -errno;
@@ -200,6 +222,7 @@ int vfio_spapr_create_window(VFIOContainer *container,
         return -EINVAL;
     }
     trace_vfio_spapr_create_window(create.page_shift,
+                                   create.levels,
                                    create.window_size,
                                    create.start_addr);
     *pgsize = pagesize;
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index ed2f333..cf1e886 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -129,6 +129,6 @@ vfio_prereg_listener_region_add_skip(uint64_t start, uint64_t end) "0x%"PRIx64"
 vfio_prereg_listener_region_del_skip(uint64_t start, uint64_t end) "0x%"PRIx64" - 0x%"PRIx64
 vfio_prereg_register(uint64_t va, uint64_t size, int ret) "va=0x%"PRIx64" size=0x%"PRIx64" ret=%d"
 vfio_prereg_unregister(uint64_t va, uint64_t size, int ret) "va=0x%"PRIx64" size=0x%"PRIx64" ret=%d"
-vfio_spapr_create_window(int ps, uint64_t ws, uint64_t off) "pageshift=0x%x winsize=0x%"PRIx64" offset=0x%"PRIx64
+vfio_spapr_create_window(int ps, unsigned int levels, uint64_t ws, uint64_t off) "pageshift=0x%x levels=%u winsize=0x%"PRIx64" offset=0x%"PRIx64
 vfio_spapr_remove_window(uint64_t off) "offset=0x%"PRIx64
 vfio_spapr_group_attach(int groupfd, int tablefd) "Attached groupfd %d to liobn fd %d"
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [Qemu-devel] [PATCH qemu v3 3/6] vfio/spapr: Rename local systempagesize variable
  2019-02-27  8:51 [Qemu-devel] [PATCH qemu v3 0/6] spapr_pci, vfio: NVIDIA V100 + POWER9 passthrough Alexey Kardashevskiy
  2019-02-27  8:51 ` [Qemu-devel] [PATCH qemu v3 1/6] pci: Move NVIDIA vendor id to the rest of ids Alexey Kardashevskiy
  2019-02-27  8:51 ` [Qemu-devel] [PATCH qemu v3 2/6] vfio/spapr: Fix indirect levels calculation Alexey Kardashevskiy
@ 2019-02-27  8:51 ` Alexey Kardashevskiy
  2019-02-28  2:26   ` David Gibson
  2019-02-27  8:51 ` [Qemu-devel] [PATCH qemu v3 4/6] spapr_iommu: Do not replay mappings from just created DMA window Alexey Kardashevskiy
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 21+ messages in thread
From: Alexey Kardashevskiy @ 2019-02-27  8:51 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, qemu-ppc, David Gibson, Sam Bobroff,
	Piotr Jaroszynski, Leonardo Augusto Guimarães Garcia,
	Jose Ricardo Ziviani, Daniel Henrique Barboza, Alex Williamson

The "systempagesize" name suggests that it is the host system page size
while it is the smallest page size of memory backing the guest RAM so
let's rename it to stop confusion. This should cause no behavioral change.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 hw/vfio/spapr.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/hw/vfio/spapr.c b/hw/vfio/spapr.c
index 88437a7..57fe758 100644
--- a/hw/vfio/spapr.c
+++ b/hw/vfio/spapr.c
@@ -148,14 +148,14 @@ int vfio_spapr_create_window(VFIOContainer *container,
     uint64_t pagesize = memory_region_iommu_get_min_page_size(iommu_mr);
     unsigned entries, bits_total, bits_per_level, max_levels;
     struct vfio_iommu_spapr_tce_create create = { .argsz = sizeof(create) };
-    long systempagesize = qemu_getrampagesize();
+    long rampagesize = qemu_getrampagesize();
 
     /*
      * The host might not support the guest supported IOMMU page size,
      * so we will use smaller physical IOMMU pages to back them.
      */
-    if (pagesize > systempagesize) {
-        pagesize = systempagesize;
+    if (pagesize > rampagesize) {
+        pagesize = rampagesize;
     }
     pagesize = 1ULL << (63 - clz64(container->pgsizes &
                                    (pagesize | (pagesize - 1))));
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [Qemu-devel] [PATCH qemu v3 4/6] spapr_iommu: Do not replay mappings from just created DMA window
  2019-02-27  8:51 [Qemu-devel] [PATCH qemu v3 0/6] spapr_pci, vfio: NVIDIA V100 + POWER9 passthrough Alexey Kardashevskiy
                   ` (2 preceding siblings ...)
  2019-02-27  8:51 ` [Qemu-devel] [PATCH qemu v3 3/6] vfio/spapr: Rename local systempagesize variable Alexey Kardashevskiy
@ 2019-02-27  8:51 ` Alexey Kardashevskiy
  2019-02-27 14:33   ` [Qemu-devel] [Qemu-ppc] " Greg Kurz
  2019-02-27  8:51 ` [Qemu-devel] [PATCH qemu v3 5/6] vfio: Make vfio_get_region_info_cap public Alexey Kardashevskiy
  2019-02-27  8:51 ` [Qemu-devel] [PATCH qemu v3 6/6] spapr: Support NVIDIA V100 GPU with NVLink2 Alexey Kardashevskiy
  5 siblings, 1 reply; 21+ messages in thread
From: Alexey Kardashevskiy @ 2019-02-27  8:51 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, qemu-ppc, David Gibson, Sam Bobroff,
	Piotr Jaroszynski, Leonardo Augusto Guimarães Garcia,
	Jose Ricardo Ziviani, Daniel Henrique Barboza, Alex Williamson

On sPAPR vfio_listener_region_add() is called in 2 situations:
1. a new listener is registered from vfio_connect_container();
2. a new IOMMU Memory Region is added from rtas_ibm_create_pe_dma_window().

In both cases vfio_listener_region_add() calls
memory_region_iommu_replay() to notify newly registered IOMMU notifiers
about existing mappings which is totally desirable for case 1.

However for case 2 it is nothing but noop as the window has just been
created and has no valid mappings so replaying those does not do anything.
It is barely noticeable with usual guests but if the window happens to be
really big, such no-op replay might take minutes and trigger RCU stall
warnings in the guest.

For example, a upcoming GPU RAM memory region mapped at 64TiB (right
after SPAPR_PCI_LIMIT) causes a 64bit DMA window to be at least 128TiB
which is (128<<40)/0x10000=2.147.483.648 TCEs to replay.

This mitigates the problem by adding an "skipping_replay" flag to
sPAPRTCETable and defining sPAPR own IOMMU MR replay() hook which does
exactly the same thing as the generic one except it returns early if
@skipping_replay==true.

When "ibm,create-pe-dma-window" is complete, the guest will map only
required regions of the huge DMA window.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 include/hw/ppc/spapr.h  |  1 +
 hw/ppc/spapr_iommu.c    | 31 +++++++++++++++++++++++++++++++
 hw/ppc/spapr_rtas_ddw.c |  7 +++++++
 3 files changed, 39 insertions(+)

diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index 86b0488..358bb38 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -727,6 +727,7 @@ struct sPAPRTCETable {
     uint64_t *mig_table;
     bool bypass;
     bool need_vfio;
+    bool skipping_replay;
     int fd;
     MemoryRegion root;
     IOMMUMemoryRegion iommu;
diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
index 37e98f9..8f23179 100644
--- a/hw/ppc/spapr_iommu.c
+++ b/hw/ppc/spapr_iommu.c
@@ -141,6 +141,36 @@ static IOMMUTLBEntry spapr_tce_translate_iommu(IOMMUMemoryRegion *iommu,
     return ret;
 }
 
+static void spapr_tce_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
+{
+    MemoryRegion *mr = MEMORY_REGION(iommu_mr);
+    IOMMUMemoryRegionClass *imrc = IOMMU_MEMORY_REGION_GET_CLASS(iommu_mr);
+    hwaddr addr, granularity;
+    IOMMUTLBEntry iotlb;
+    sPAPRTCETable *tcet = container_of(iommu_mr, sPAPRTCETable, iommu);
+
+    if (tcet->skipping_replay) {
+        return;
+    }
+
+    granularity = memory_region_iommu_get_min_page_size(iommu_mr);
+
+    for (addr = 0; addr < memory_region_size(mr); addr += granularity) {
+        iotlb = imrc->translate(iommu_mr, addr, IOMMU_NONE, n->iommu_idx);
+        if (iotlb.perm != IOMMU_NONE) {
+            n->notify(n, &iotlb);
+        }
+
+        /*
+         * if (2^64 - MR size) < granularity, it's possible to get an
+         * infinite loop here.  This should catch such a wraparound.
+         */
+        if ((addr + granularity) < addr) {
+            break;
+        }
+    }
+}
+
 static int spapr_tce_table_pre_save(void *opaque)
 {
     sPAPRTCETable *tcet = SPAPR_TCE_TABLE(opaque);
@@ -659,6 +689,7 @@ static void spapr_iommu_memory_region_class_init(ObjectClass *klass, void *data)
     IOMMUMemoryRegionClass *imrc = IOMMU_MEMORY_REGION_CLASS(klass);
 
     imrc->translate = spapr_tce_translate_iommu;
+    imrc->replay = spapr_tce_replay;
     imrc->get_min_page_size = spapr_tce_get_min_page_size;
     imrc->notify_flag_changed = spapr_tce_notify_flag_changed;
     imrc->get_attr = spapr_tce_get_attr;
diff --git a/hw/ppc/spapr_rtas_ddw.c b/hw/ppc/spapr_rtas_ddw.c
index cb8a410..9cc020d 100644
--- a/hw/ppc/spapr_rtas_ddw.c
+++ b/hw/ppc/spapr_rtas_ddw.c
@@ -171,8 +171,15 @@ static void rtas_ibm_create_pe_dma_window(PowerPCCPU *cpu,
     }
 
     win_addr = (windows == 0) ? sphb->dma_win_addr : sphb->dma64_win_addr;
+    /*
+     * We have just created a window, we know for the fact that it is empty,
+     * use a hack to avoid iterating over the table as it is quite possible
+     * to have billions of TCEs, all empty.
+     */
+    tcet->skipping_replay = true;
     spapr_tce_table_enable(tcet, page_shift, win_addr,
                            1ULL << (window_shift - page_shift));
+    tcet->skipping_replay = false;
     if (!tcet->nb_table) {
         goto hw_error_exit;
     }
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [Qemu-devel] [PATCH qemu v3 5/6] vfio: Make vfio_get_region_info_cap public
  2019-02-27  8:51 [Qemu-devel] [PATCH qemu v3 0/6] spapr_pci, vfio: NVIDIA V100 + POWER9 passthrough Alexey Kardashevskiy
                   ` (3 preceding siblings ...)
  2019-02-27  8:51 ` [Qemu-devel] [PATCH qemu v3 4/6] spapr_iommu: Do not replay mappings from just created DMA window Alexey Kardashevskiy
@ 2019-02-27  8:51 ` Alexey Kardashevskiy
  2019-02-27  8:51 ` [Qemu-devel] [PATCH qemu v3 6/6] spapr: Support NVIDIA V100 GPU with NVLink2 Alexey Kardashevskiy
  5 siblings, 0 replies; 21+ messages in thread
From: Alexey Kardashevskiy @ 2019-02-27  8:51 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, qemu-ppc, David Gibson, Sam Bobroff,
	Piotr Jaroszynski, Leonardo Augusto Guimarães Garcia,
	Jose Ricardo Ziviani, Daniel Henrique Barboza, Alex Williamson

This makes vfio_get_region_info_cap() to be used in quirks.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 include/hw/vfio/vfio-common.h | 2 ++
 hw/vfio/common.c              | 2 +-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 7624c9f..fbf0966 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -189,6 +189,8 @@ int vfio_get_region_info(VFIODevice *vbasedev, int index,
 int vfio_get_dev_region_info(VFIODevice *vbasedev, uint32_t type,
                              uint32_t subtype, struct vfio_region_info **info);
 bool vfio_has_region_cap(VFIODevice *vbasedev, int region, uint16_t cap_type);
+struct vfio_info_cap_header *
+vfio_get_region_info_cap(struct vfio_region_info *info, uint16_t id);
 #endif
 extern const MemoryListener vfio_prereg_listener;
 
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index df2b472..4374cc6 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -729,7 +729,7 @@ static void vfio_listener_release(VFIOContainer *container)
     }
 }
 
-static struct vfio_info_cap_header *
+struct vfio_info_cap_header *
 vfio_get_region_info_cap(struct vfio_region_info *info, uint16_t id)
 {
     struct vfio_info_cap_header *hdr;
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [Qemu-devel] [PATCH qemu v3 6/6] spapr: Support NVIDIA V100 GPU with NVLink2
  2019-02-27  8:51 [Qemu-devel] [PATCH qemu v3 0/6] spapr_pci, vfio: NVIDIA V100 + POWER9 passthrough Alexey Kardashevskiy
                   ` (4 preceding siblings ...)
  2019-02-27  8:51 ` [Qemu-devel] [PATCH qemu v3 5/6] vfio: Make vfio_get_region_info_cap public Alexey Kardashevskiy
@ 2019-02-27  8:51 ` Alexey Kardashevskiy
  2019-02-28  3:31   ` David Gibson
  5 siblings, 1 reply; 21+ messages in thread
From: Alexey Kardashevskiy @ 2019-02-27  8:51 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, qemu-ppc, David Gibson, Sam Bobroff,
	Piotr Jaroszynski, Leonardo Augusto Guimarães Garcia,
	Jose Ricardo Ziviani, Daniel Henrique Barboza, Alex Williamson

NVIDIA V100 GPUs have on-board RAM which is mapped into the host memory
space and accessible as normal RAM via an NVLink bus. The VFIO-PCI driver
implements special regions for such GPUs and emulates an NVLink bridge.
NVLink2-enabled POWER9 CPUs also provide address translation services
which includes an ATS shootdown (ATSD) register exported via the NVLink
bridge device.

This adds a quirk to VFIO to map the GPU memory and create an MR;
the new MR is stored in a PCI device as a QOM link. The sPAPR PCI uses
this to get the MR and map it to the system address space.
Another quirk does the same for ATSD.

This adds additional steps to sPAPR PHB setup:

1. Search for specific GPUs and NPUs, collect findings in
sPAPRPHBState::nvgpus, manage system address space mappings;

2. Add device-specific properties such as "ibm,npu", "ibm,gpu",
"memory-block", "link-speed" to advertise the NVLink2 function to
the guest;

3. Add "mmio-atsd" to vPHB to advertise the ATSD capability;

4. Add new memory blocks (with extra "linux,memory-usable" to prevent
the guest OS from accessing the new memory until it is onlined) and
npuphb# nodes representing an NPU unit for every vPHB as the GPU driver
uses it for link discovery.

This allocates space for GPU RAM and ATSD like we do for MMIOs by
adding 2 new parameters to the phb_placement() hook. Older machine types
set these to zero.

This puts new memory nodes in a separate NUMA node to replicate the host
system setup as the GPU driver relies on this.

This adds requirement similar to EEH - one IOMMU group per vPHB.
The reason for this is that ATSD registers belong to a physical NPU
so they cannot invalidate translations on GPUs attached to another NPU.
It is guaranteed by the host platform as it does not mix NVLink bridges
or GPUs from different NPU in the same IOMMU group. If more than one
IOMMU group is detected on a vPHB, this disables ATSD support for that
vPHB and prints a warning.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
v3:
* moved GPU RAM above PCI MMIO limit
* renamed QOM property to nvlink2-tgt
* moved nvlink2 code to its own file

---

The example command line for redbud system:

pbuild/qemu-aiku1804le-ppc64/ppc64-softmmu/qemu-system-ppc64 \
-nodefaults \
-chardev stdio,id=STDIO0,signal=off,mux=on \
-device spapr-vty,id=svty0,reg=0x71000110,chardev=STDIO0 \
-mon id=MON0,chardev=STDIO0,mode=readline -nographic -vga none \
-enable-kvm -m 384G \
-chardev socket,id=SOCKET0,server,nowait,host=localhost,port=40000 \
-mon chardev=SOCKET0,mode=control \
-smp 80,sockets=1,threads=4 \
-netdev "tap,id=TAP0,helper=/home/aik/qemu-bridge-helper --br=br0" \
-device "virtio-net-pci,id=vnet0,mac=52:54:00:12:34:56,netdev=TAP0" \
img/vdisk0.img \
-device "vfio-pci,id=vfio0004_04_00_0,host=0004:04:00.0" \
-device "vfio-pci,id=vfio0006_00_00_0,host=0006:00:00.0" \
-device "vfio-pci,id=vfio0006_00_00_1,host=0006:00:00.1" \
-device "vfio-pci,id=vfio0006_00_00_2,host=0006:00:00.2" \
-device "vfio-pci,id=vfio0004_05_00_0,host=0004:05:00.0" \
-device "vfio-pci,id=vfio0006_00_01_0,host=0006:00:01.0" \
-device "vfio-pci,id=vfio0006_00_01_1,host=0006:00:01.1" \
-device "vfio-pci,id=vfio0006_00_01_2,host=0006:00:01.2" \
-device spapr-pci-host-bridge,id=phb1,index=1 \
-device "vfio-pci,id=vfio0035_03_00_0,host=0035:03:00.0" \
-device "vfio-pci,id=vfio0007_00_00_0,host=0007:00:00.0" \
-device "vfio-pci,id=vfio0007_00_00_1,host=0007:00:00.1" \
-device "vfio-pci,id=vfio0007_00_00_2,host=0007:00:00.2" \
-device "vfio-pci,id=vfio0035_04_00_0,host=0035:04:00.0" \
-device "vfio-pci,id=vfio0007_00_01_0,host=0007:00:01.0" \
-device "vfio-pci,id=vfio0007_00_01_1,host=0007:00:01.1" \
-device "vfio-pci,id=vfio0007_00_01_2,host=0007:00:01.2" -snapshot \
-machine pseries \
-L /home/aik/t/qemu-ppc64-bios/ -d guest_errors

Note that QEMU attaches PCI devices to the last added vPHB so first
8 devices - 4:04:00.0 till 6:00:01.2 - go to the default vPHB, and
35:03:00.0..7:00:01.2 to the vPHB with id=phb1.
---
 hw/ppc/Makefile.objs        |   2 +-
 hw/vfio/pci.h               |   2 +
 include/hw/pci-host/spapr.h |  41 ++++
 include/hw/ppc/spapr.h      |   3 +-
 hw/ppc/spapr.c              |  29 ++-
 hw/ppc/spapr_pci.c          |   8 +
 hw/ppc/spapr_pci_nvlink2.c  | 419 ++++++++++++++++++++++++++++++++++++
 hw/vfio/pci-quirks.c        | 120 +++++++++++
 hw/vfio/pci.c               |  14 ++
 hw/vfio/trace-events        |   4 +
 10 files changed, 637 insertions(+), 5 deletions(-)
 create mode 100644 hw/ppc/spapr_pci_nvlink2.c

diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
index 1111b21..636e717 100644
--- a/hw/ppc/Makefile.objs
+++ b/hw/ppc/Makefile.objs
@@ -9,7 +9,7 @@ obj-$(CONFIG_SPAPR_RNG) +=  spapr_rng.o
 # IBM PowerNV
 obj-$(CONFIG_POWERNV) += pnv.o pnv_xscom.o pnv_core.o pnv_lpc.o pnv_psi.o pnv_occ.o pnv_bmc.o
 ifeq ($(CONFIG_PCI)$(CONFIG_PSERIES)$(CONFIG_LINUX), yyy)
-obj-y += spapr_pci_vfio.o
+obj-y += spapr_pci_vfio.o spapr_pci_nvlink2.o
 endif
 obj-$(CONFIG_PSERIES) += spapr_rtas_ddw.o
 # PowerPC 4xx boards
diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
index b1ae4c0..706c304 100644
--- a/hw/vfio/pci.h
+++ b/hw/vfio/pci.h
@@ -194,6 +194,8 @@ int vfio_populate_vga(VFIOPCIDevice *vdev, Error **errp);
 int vfio_pci_igd_opregion_init(VFIOPCIDevice *vdev,
                                struct vfio_region_info *info,
                                Error **errp);
+int vfio_pci_nvidia_v100_ram_init(VFIOPCIDevice *vdev, Error **errp);
+int vfio_pci_nvlink2_init(VFIOPCIDevice *vdev, Error **errp);
 
 void vfio_display_reset(VFIOPCIDevice *vdev);
 int vfio_display_probe(VFIOPCIDevice *vdev, Error **errp);
diff --git a/include/hw/pci-host/spapr.h b/include/hw/pci-host/spapr.h
index ab0e3a0..e791dd4 100644
--- a/include/hw/pci-host/spapr.h
+++ b/include/hw/pci-host/spapr.h
@@ -87,6 +87,9 @@ struct sPAPRPHBState {
     uint32_t mig_liobn;
     hwaddr mig_mem_win_addr, mig_mem_win_size;
     hwaddr mig_io_win_addr, mig_io_win_size;
+    hwaddr nv2_gpa_win_addr;
+    hwaddr nv2_atsd_win_addr;
+    struct spapr_phb_pci_nvgpu_config *nvgpus;
 };
 
 #define SPAPR_PCI_MEM_WIN_BUS_OFFSET 0x80000000ULL
@@ -105,6 +108,23 @@ struct sPAPRPHBState {
 
 #define SPAPR_PCI_MSI_WINDOW         0x40000000000ULL
 
+#define SPAPR_PCI_NV2RAM64_WIN_BASE  SPAPR_PCI_LIMIT
+#define SPAPR_PCI_NV2RAM64_WIN_SIZE  0x10000000000ULL /* 1 TiB for all 6xGPUs */
+
+/* Max number of these GPUs per a physical box */
+#define NVGPU_MAX_NUM                6
+/*
+ * One NVLink bridge provides one ATSD register so it should be 18.
+ * In practice though since we allow only one group per vPHB which equals
+ * to an NPU2 which has maximum 6 NVLink bridges.
+ */
+#define NVGPU_MAX_ATSD               6
+
+#define SPAPR_PCI_NV2ATSD_WIN_BASE   (SPAPR_PCI_NV2RAM64_WIN_BASE + \
+                                      SPAPR_PCI_NV2RAM64_WIN_SIZE * \
+                                      NVGPU_MAX_NUM)
+#define SPAPR_PCI_NV2ATSD_WIN_SIZE   (NVGPU_MAX_ATSD * 0x10000)
+
 static inline qemu_irq spapr_phb_lsi_qirq(struct sPAPRPHBState *phb, int pin)
 {
     sPAPRMachineState *spapr = SPAPR_MACHINE(qdev_get_machine());
@@ -135,6 +155,11 @@ int spapr_phb_vfio_eeh_get_state(sPAPRPHBState *sphb, int *state);
 int spapr_phb_vfio_eeh_reset(sPAPRPHBState *sphb, int option);
 int spapr_phb_vfio_eeh_configure(sPAPRPHBState *sphb);
 void spapr_phb_vfio_reset(DeviceState *qdev);
+void spapr_phb_nvgpu_setup(sPAPRPHBState *sphb);
+void spapr_phb_nvgpu_populate_dt(sPAPRPHBState *sphb, void *fdt, int bus_off);
+void spapr_phb_nvgpu_ram_populate_dt(sPAPRPHBState *sphb, void *fdt);
+void spapr_phb_nvgpu_populate_pcidev_dt(PCIDevice *dev, void *fdt, int offset,
+                                        sPAPRPHBState *sphb);
 #else
 static inline bool spapr_phb_eeh_available(sPAPRPHBState *sphb)
 {
@@ -161,6 +186,22 @@ static inline int spapr_phb_vfio_eeh_configure(sPAPRPHBState *sphb)
 static inline void spapr_phb_vfio_reset(DeviceState *qdev)
 {
 }
+static inline void spapr_phb_nvgpu_setup(sPAPRPHBState *sphb)
+{
+}
+static inline void spapr_phb_nvgpu_populate_dt(sPAPRPHBState *sphb, void *fdt,
+                                               int bus_off)
+{
+}
+static inline void spapr_phb_nvgpu_ram_populate_dt(sPAPRPHBState *sphb,
+                                                   void *fdt)
+{
+}
+static inline void spapr_phb_nvgpu_populate_pcidev_dt(PCIDevice *dev, void *fdt,
+                                                      int offset,
+                                                      sPAPRPHBState *sphb)
+{
+}
 #endif
 
 void spapr_phb_dma_reset(sPAPRPHBState *sphb);
diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index 358bb38..9acf867 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -113,7 +113,8 @@ struct sPAPRMachineClass {
     void (*phb_placement)(sPAPRMachineState *spapr, uint32_t index,
                           uint64_t *buid, hwaddr *pio, 
                           hwaddr *mmio32, hwaddr *mmio64,
-                          unsigned n_dma, uint32_t *liobns, Error **errp);
+                          unsigned n_dma, uint32_t *liobns, hwaddr *nv2gpa,
+                          hwaddr *nv2atsd, Error **errp);
     sPAPRResizeHPT resize_hpt_default;
     sPAPRCapabilities default_caps;
     sPAPRIrq *irq;
diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index 74c9b07..fda6e7e 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -3929,7 +3929,9 @@ static void spapr_phb_pre_plug(HotplugHandler *hotplug_dev, DeviceState *dev,
     smc->phb_placement(spapr, sphb->index,
                        &sphb->buid, &sphb->io_win_addr,
                        &sphb->mem_win_addr, &sphb->mem64_win_addr,
-                       windows_supported, sphb->dma_liobn, errp);
+                       windows_supported, sphb->dma_liobn,
+                       &sphb->nv2_gpa_win_addr, &sphb->nv2_atsd_win_addr,
+                       errp);
 }
 
 static void spapr_phb_plug(HotplugHandler *hotplug_dev, DeviceState *dev,
@@ -4129,7 +4131,8 @@ static const CPUArchIdList *spapr_possible_cpu_arch_ids(MachineState *machine)
 static void spapr_phb_placement(sPAPRMachineState *spapr, uint32_t index,
                                 uint64_t *buid, hwaddr *pio,
                                 hwaddr *mmio32, hwaddr *mmio64,
-                                unsigned n_dma, uint32_t *liobns, Error **errp)
+                                unsigned n_dma, uint32_t *liobns,
+                                hwaddr *nv2gpa, hwaddr *nv2atsd, Error **errp)
 {
     /*
      * New-style PHB window placement.
@@ -4174,6 +4177,9 @@ static void spapr_phb_placement(sPAPRMachineState *spapr, uint32_t index,
     *pio = SPAPR_PCI_BASE + index * SPAPR_PCI_IO_WIN_SIZE;
     *mmio32 = SPAPR_PCI_BASE + (index + 1) * SPAPR_PCI_MEM32_WIN_SIZE;
     *mmio64 = SPAPR_PCI_BASE + (index + 1) * SPAPR_PCI_MEM64_WIN_SIZE;
+
+    *nv2gpa = SPAPR_PCI_NV2RAM64_WIN_BASE + index * SPAPR_PCI_NV2RAM64_WIN_SIZE;
+    *nv2atsd = SPAPR_PCI_NV2ATSD_WIN_BASE + index * SPAPR_PCI_NV2ATSD_WIN_SIZE;
 }
 
 static ICSState *spapr_ics_get(XICSFabric *dev, int irq)
@@ -4376,6 +4382,18 @@ DEFINE_SPAPR_MACHINE(4_0, "4.0", true);
 /*
  * pseries-3.1
  */
+static void phb_placement_3_1(sPAPRMachineState *spapr, uint32_t index,
+                              uint64_t *buid, hwaddr *pio,
+                              hwaddr *mmio32, hwaddr *mmio64,
+                              unsigned n_dma, uint32_t *liobns,
+                              hwaddr *nv2gpa, hwaddr *nv2atsd, Error **errp)
+{
+    spapr_phb_placement(spapr, index, buid, pio, mmio32, mmio64, n_dma, liobns,
+                        nv2gpa, nv2atsd, errp);
+    *nv2gpa = 0;
+    *nv2atsd = 0;
+}
+
 static void spapr_machine_3_1_class_options(MachineClass *mc)
 {
     sPAPRMachineClass *smc = SPAPR_MACHINE_CLASS(mc);
@@ -4391,6 +4409,7 @@ static void spapr_machine_3_1_class_options(MachineClass *mc)
     mc->default_cpu_type = POWERPC_CPU_TYPE_NAME("power8_v2.0");
     smc->update_dt_enabled = false;
     smc->dr_phb_enabled = false;
+    smc->phb_placement = phb_placement_3_1;
 }
 
 DEFINE_SPAPR_MACHINE(3_1, "3.1", false);
@@ -4522,7 +4541,8 @@ DEFINE_SPAPR_MACHINE(2_8, "2.8", false);
 static void phb_placement_2_7(sPAPRMachineState *spapr, uint32_t index,
                               uint64_t *buid, hwaddr *pio,
                               hwaddr *mmio32, hwaddr *mmio64,
-                              unsigned n_dma, uint32_t *liobns, Error **errp)
+                              unsigned n_dma, uint32_t *liobns,
+                              hwaddr *nv2gpa, hwaddr *nv2atsd, Error **errp)
 {
     /* Legacy PHB placement for pseries-2.7 and earlier machine types */
     const uint64_t base_buid = 0x800000020000000ULL;
@@ -4566,6 +4586,9 @@ static void phb_placement_2_7(sPAPRMachineState *spapr, uint32_t index,
      * fallback behaviour of automatically splitting a large "32-bit"
      * window into contiguous 32-bit and 64-bit windows
      */
+
+    *nv2gpa = 0;
+    *nv2atsd = 0;
 }
 
 static void spapr_machine_2_7_class_options(MachineClass *mc)
diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
index 06a5ffd..f076462 100644
--- a/hw/ppc/spapr_pci.c
+++ b/hw/ppc/spapr_pci.c
@@ -1355,6 +1355,8 @@ static void spapr_populate_pci_child_dt(PCIDevice *dev, void *fdt, int offset,
     if (sphb->pcie_ecs && pci_is_express(dev)) {
         _FDT(fdt_setprop_cell(fdt, offset, "ibm,pci-config-space-type", 0x1));
     }
+
+    spapr_phb_nvgpu_populate_pcidev_dt(dev, fdt, offset, sphb);
 }
 
 /* create OF node for pci device and required OF DT properties */
@@ -1878,6 +1880,7 @@ static void spapr_phb_reset(DeviceState *qdev)
     sPAPRPHBState *sphb = SPAPR_PCI_HOST_BRIDGE(qdev);
 
     spapr_phb_dma_reset(sphb);
+    spapr_phb_nvgpu_setup(sphb);
 
     /* Reset the IOMMU state */
     object_child_foreach(OBJECT(qdev), spapr_phb_children_reset, NULL);
@@ -1910,6 +1913,8 @@ static Property spapr_phb_properties[] = {
                      pre_2_8_migration, false),
     DEFINE_PROP_BOOL("pcie-extended-configuration-space", sPAPRPHBState,
                      pcie_ecs, true),
+    DEFINE_PROP_UINT64("gpa", sPAPRPHBState, nv2_gpa_win_addr, 0),
+    DEFINE_PROP_UINT64("atsd", sPAPRPHBState, nv2_atsd_win_addr, 0),
     DEFINE_PROP_END_OF_LIST(),
 };
 
@@ -2282,6 +2287,9 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb, uint32_t intc_phandle, void *fdt,
         return ret;
     }
 
+    spapr_phb_nvgpu_populate_dt(phb, fdt, bus_off);
+    spapr_phb_nvgpu_ram_populate_dt(phb, fdt);
+
     return 0;
 }
 
diff --git a/hw/ppc/spapr_pci_nvlink2.c b/hw/ppc/spapr_pci_nvlink2.c
new file mode 100644
index 0000000..965a6be
--- /dev/null
+++ b/hw/ppc/spapr_pci_nvlink2.c
@@ -0,0 +1,419 @@
+/*
+ * QEMU sPAPR PCI for NVLink2 pass through
+ *
+ * Copyright (c) 2019 Alexey Kardashevskiy, IBM Corporation.
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ * THE SOFTWARE.
+ */
+#include "qemu/osdep.h"
+#include "qapi/error.h"
+#include "qemu-common.h"
+#include "hw/pci/pci.h"
+#include "hw/pci-host/spapr.h"
+#include "qemu/error-report.h"
+#include "hw/ppc/fdt.h"
+#include "hw/pci/pci_bridge.h"
+
+#define PHANDLE_PCIDEV(phb, pdev)    (0x12000000 | \
+                                     (((phb)->index) << 16) | ((pdev)->devfn))
+#define PHANDLE_GPURAM(phb, n)       (0x110000FF | ((n) << 8) | \
+                                     (((phb)->index) << 16))
+/* NVLink2 wants a separate NUMA node for its RAM */
+#define GPURAM_ASSOCIATIVITY(phb, n) (255 - ((phb)->index * 3 + (n)))
+#define PHANDLE_NVLINK(phb, gn, nn)  (0x00130000 | (((phb)->index) << 8) | \
+                                     ((gn) << 4) | (nn))
+
+/* Max number of NVLinks per GPU in any physical box */
+#define NVGPU_MAX_LINKS              3
+
+struct spapr_phb_pci_nvgpu_config {
+    uint64_t nv2_ram_current;
+    uint64_t nv2_atsd_current;
+    int num; /* number of non empty (i.e. tgt!=0) entries in slots[] */
+    struct spapr_phb_pci_nvgpu_slot {
+        uint64_t tgt;
+        uint64_t gpa;
+        PCIDevice *gpdev;
+        int linknum;
+        struct {
+            uint64_t atsd_gpa;
+            PCIDevice *npdev;
+            uint32_t link_speed;
+        } links[NVGPU_MAX_LINKS];
+    } slots[NVGPU_MAX_NUM];
+};
+
+static int spapr_pci_nvgpu_get_slot(struct spapr_phb_pci_nvgpu_config *nvgpus,
+                                    uint64_t tgt)
+{
+    int i;
+
+    /* Search for partially collected "slot" */
+    for (i = 0; i < nvgpus->num; ++i) {
+        if (nvgpus->slots[i].tgt == tgt) {
+            return i;
+        }
+    }
+
+    if (nvgpus->num == ARRAY_SIZE(nvgpus->slots)) {
+        warn_report("Found too many NVLink bridges per GPU");
+        return -1;
+    }
+
+    i = nvgpus->num;
+    nvgpus->slots[i].tgt = tgt;
+    ++nvgpus->num;
+
+    return i;
+}
+
+static void spapr_pci_collect_nvgpu(struct spapr_phb_pci_nvgpu_config *nvgpus,
+                                    PCIDevice *pdev, uint64_t tgt,
+                                    MemoryRegion *mr)
+{
+    int i = spapr_pci_nvgpu_get_slot(nvgpus, tgt);
+
+    if (i < 0) {
+        return;
+    }
+    g_assert(!nvgpus->slots[i].gpdev);
+    nvgpus->slots[i].gpdev = pdev;
+
+    nvgpus->slots[i].gpa = nvgpus->nv2_ram_current;
+    nvgpus->nv2_ram_current += memory_region_size(mr);
+}
+
+static void spapr_pci_collect_nvnpu(struct spapr_phb_pci_nvgpu_config *nvgpus,
+                                    PCIDevice *pdev, uint64_t tgt,
+                                    MemoryRegion *mr)
+{
+    int i = spapr_pci_nvgpu_get_slot(nvgpus, tgt), j;
+    struct spapr_phb_pci_nvgpu_slot *nvslot;
+
+    if (i < 0) {
+        return;
+    }
+
+    nvslot = &nvgpus->slots[i];
+    j = nvslot->linknum;
+    if (j == ARRAY_SIZE(nvslot->links)) {
+        warn_report("Found too many NVLink2 bridges");
+        return;
+    }
+    ++nvslot->linknum;
+
+    g_assert(!nvslot->links[j].npdev);
+    nvslot->links[j].npdev = pdev;
+    nvslot->links[j].atsd_gpa = nvgpus->nv2_atsd_current;
+    nvgpus->nv2_atsd_current += memory_region_size(mr);
+    nvslot->links[j].link_speed =
+        object_property_get_uint(OBJECT(pdev), "nvlink2-link-speed", NULL);
+}
+
+static void spapr_phb_pci_collect_nvgpu(PCIBus *bus, PCIDevice *pdev,
+                                        void *opaque)
+{
+    PCIBus *sec_bus;
+    Object *po = OBJECT(pdev);
+    uint64_t tgt = object_property_get_uint(po, "nvlink2-tgt", NULL);
+
+    if (tgt) {
+        Object *mr_gpu = object_property_get_link(po, "nvlink2-mr[0]", NULL);
+        Object *mr_npu = object_property_get_link(po, "nvlink2-atsd-mr[0]",
+                                                  NULL);
+
+        if (mr_gpu) {
+            spapr_pci_collect_nvgpu(opaque, pdev, tgt, MEMORY_REGION(mr_gpu));
+        } else if (mr_npu) {
+            spapr_pci_collect_nvnpu(opaque, pdev, tgt, MEMORY_REGION(mr_npu));
+        } else {
+            warn_report("Unexpected device with \"nvlink2-tgt\"");
+        }
+    }
+    if ((pci_default_read_config(pdev, PCI_HEADER_TYPE, 1) !=
+         PCI_HEADER_TYPE_BRIDGE)) {
+        return;
+    }
+
+    sec_bus = pci_bridge_get_sec_bus(PCI_BRIDGE(pdev));
+    if (!sec_bus) {
+        return;
+    }
+
+    pci_for_each_device(sec_bus, pci_bus_num(sec_bus),
+                        spapr_phb_pci_collect_nvgpu, opaque);
+}
+
+void spapr_phb_nvgpu_setup(sPAPRPHBState *sphb)
+{
+    int i, j, valid_gpu_num;
+
+    /* If there are existing NVLink2 MRs, unmap those before recreating */
+    if (sphb->nvgpus) {
+        for (i = 0; i < sphb->nvgpus->num; ++i) {
+            struct spapr_phb_pci_nvgpu_slot *nvslot = &sphb->nvgpus->slots[i];
+            Object *nv_mrobj = object_property_get_link(OBJECT(nvslot->gpdev),
+                                                        "nvlink2-mr[0]", NULL);
+
+            if (nv_mrobj) {
+                memory_region_del_subregion(get_system_memory(),
+                                            MEMORY_REGION(nv_mrobj));
+            }
+            for (j = 0; j < nvslot->linknum; ++j) {
+                PCIDevice *npdev = nvslot->links[j].npdev;
+                Object *atsd_mrobj;
+                atsd_mrobj = object_property_get_link(OBJECT(npdev),
+                                                      "nvlink2-atsd-mr[0]",
+                                                      NULL);
+                if (atsd_mrobj) {
+                    memory_region_del_subregion(get_system_memory(),
+                                                MEMORY_REGION(atsd_mrobj));
+                }
+            }
+        }
+        g_free(sphb->nvgpus);
+        sphb->nvgpus = NULL;
+    }
+
+    /* Search for GPUs and NPUs */
+    if (sphb->nv2_gpa_win_addr && sphb->nv2_atsd_win_addr) {
+        PCIBus *bus = PCI_HOST_BRIDGE(sphb)->bus;
+
+        sphb->nvgpus = g_new0(struct spapr_phb_pci_nvgpu_config, 1);
+        sphb->nvgpus->nv2_ram_current = sphb->nv2_gpa_win_addr;
+        sphb->nvgpus->nv2_atsd_current = sphb->nv2_atsd_win_addr;
+
+        pci_for_each_device(bus, pci_bus_num(bus),
+                            spapr_phb_pci_collect_nvgpu, sphb->nvgpus);
+    }
+
+    /* Add found GPU RAM and ATSD MRs if found */
+    for (i = 0, valid_gpu_num = 0; i < sphb->nvgpus->num; ++i) {
+        Object *nvmrobj;
+        struct spapr_phb_pci_nvgpu_slot *nvslot = &sphb->nvgpus->slots[i];
+
+        if (!nvslot->gpdev) {
+            continue;
+        }
+        nvmrobj = object_property_get_link(OBJECT(nvslot->gpdev),
+                                           "nvlink2-mr[0]", NULL);
+        /* ATSD is pointless without GPU RAM MR so skip those */
+        if (!nvmrobj) {
+            continue;
+        }
+
+        ++valid_gpu_num;
+        memory_region_add_subregion(get_system_memory(), nvslot->gpa,
+                                    MEMORY_REGION(nvmrobj));
+
+        for (j = 0; j < nvslot->linknum; ++j) {
+            Object *atsdmrobj;
+
+            atsdmrobj = object_property_get_link(OBJECT(nvslot->links[j].npdev),
+                                                 "nvlink2-atsd-mr[0]",
+                                                 NULL);
+            if (!atsdmrobj) {
+                continue;
+            }
+            memory_region_add_subregion(get_system_memory(),
+                                        nvslot->links[j].atsd_gpa,
+                                        MEMORY_REGION(atsdmrobj));
+        }
+    }
+
+    if (!valid_gpu_num) {
+        /* We did not find any interesting GPU */
+        g_free(sphb->nvgpus);
+        sphb->nvgpus = NULL;
+    }
+}
+
+void spapr_phb_nvgpu_populate_dt(sPAPRPHBState *sphb, void *fdt, int bus_off)
+{
+    int i, j, atsdnum = 0;
+    uint64_t atsd[8]; /* The existing limitation of known guests */
+
+    if (!sphb->nvgpus) {
+        return;
+    }
+
+    for (i = 0; (i < sphb->nvgpus->num) && (atsdnum < ARRAY_SIZE(atsd)); ++i) {
+        struct spapr_phb_pci_nvgpu_slot *nvslot = &sphb->nvgpus->slots[i];
+
+        if (!nvslot->gpdev) {
+            continue;
+        }
+        for (j = 0; j < nvslot->linknum; ++j) {
+            if (!nvslot->links[j].atsd_gpa) {
+                continue;
+            }
+
+            if (atsdnum == ARRAY_SIZE(atsd)) {
+                warn_report("Only %ld ATSD registers allowed",
+                            ARRAY_SIZE(atsd));
+                break;
+            }
+            atsd[atsdnum] = cpu_to_be64(nvslot->links[j].atsd_gpa);
+            ++atsdnum;
+        }
+    }
+
+    if (!atsdnum) {
+        warn_report("No ATSD registers found");
+    } else if (!spapr_phb_eeh_available(sphb)) {
+        /*
+         * ibm,mmio-atsd contains ATSD registers; these belong to an NPU PHB
+         * which we do not emulate as a separate device. Instead we put
+         * ibm,mmio-atsd to the vPHB with GPU and make sure that we do not
+         * put GPUs from different IOMMU groups to the same vPHB to ensure
+         * that the guest will use ATSDs from the corresponding NPU.
+         */
+        warn_report("ATSD requires separate vPHB per GPU IOMMU group");
+    } else {
+        _FDT((fdt_setprop(fdt, bus_off, "ibm,mmio-atsd",
+                          atsd, atsdnum * sizeof(atsd[0]))));
+    }
+}
+
+void spapr_phb_nvgpu_ram_populate_dt(sPAPRPHBState *sphb, void *fdt)
+{
+    int i, j, linkidx, npuoff;
+    char *npuname;
+
+    if (!sphb->nvgpus) {
+        return;
+    }
+
+    npuname = g_strdup_printf("npuphb%d", sphb->index);
+    npuoff = fdt_add_subnode(fdt, 0, npuname);
+    _FDT(npuoff);
+    _FDT(fdt_setprop_cell(fdt, npuoff, "#address-cells", 1));
+    _FDT(fdt_setprop_cell(fdt, npuoff, "#size-cells", 0));
+    /* Advertise NPU as POWER9 so the guest can enable NPU2 contexts */
+    _FDT((fdt_setprop_string(fdt, npuoff, "compatible", "ibm,power9-npu")));
+    g_free(npuname);
+
+    for (i = 0, linkidx = 0; i < sphb->nvgpus->num; ++i) {
+        for (j = 0; j < sphb->nvgpus->slots[i].linknum; ++j) {
+            char *linkname = g_strdup_printf("link@%d", linkidx);
+            int off = fdt_add_subnode(fdt, npuoff, linkname);
+
+            _FDT(off);
+            /* _FDT((fdt_setprop_cell(fdt, off, "reg", linkidx))); */
+            _FDT((fdt_setprop_string(fdt, off, "compatible",
+                                     "ibm,npu-link")));
+            _FDT((fdt_setprop_cell(fdt, off, "phandle",
+                                   PHANDLE_NVLINK(sphb, i, j))));
+            _FDT((fdt_setprop_cell(fdt, off, "ibm,npu-link-index", linkidx)));
+            g_free(linkname);
+            ++linkidx;
+        }
+    }
+
+    /* Add memory nodes for GPU RAM and mark them unusable */
+    for (i = 0; i < sphb->nvgpus->num; ++i) {
+        struct spapr_phb_pci_nvgpu_slot *nvslot = &sphb->nvgpus->slots[i];
+        Object *nv_mrobj = object_property_get_link(OBJECT(nvslot->gpdev),
+                                                    "nvlink2-mr[0]", NULL);
+        uint32_t at = cpu_to_be32(GPURAM_ASSOCIATIVITY(sphb, i));
+        uint32_t associativity[] = { cpu_to_be32(0x4), at, at, at, at };
+        uint64_t size = object_property_get_uint(nv_mrobj, "size", NULL);
+        uint64_t mem_reg[2] = { cpu_to_be64(nvslot->gpa), cpu_to_be64(size) };
+        char *mem_name = g_strdup_printf("memory@%lx", nvslot->gpa);
+        int off = fdt_add_subnode(fdt, 0, mem_name);
+
+        _FDT(off);
+        _FDT((fdt_setprop_string(fdt, off, "device_type", "memory")));
+        _FDT((fdt_setprop(fdt, off, "reg", mem_reg, sizeof(mem_reg))));
+        _FDT((fdt_setprop(fdt, off, "ibm,associativity", associativity,
+                          sizeof(associativity))));
+
+        _FDT((fdt_setprop_string(fdt, off, "compatible",
+                                 "ibm,coherent-device-memory")));
+
+        mem_reg[1] = cpu_to_be64(0);
+        _FDT((fdt_setprop(fdt, off, "linux,usable-memory", mem_reg,
+                          sizeof(mem_reg))));
+        _FDT((fdt_setprop_cell(fdt, off, "phandle",
+                               PHANDLE_GPURAM(sphb, i))));
+        g_free(mem_name);
+    }
+
+}
+
+void spapr_phb_nvgpu_populate_pcidev_dt(PCIDevice *dev, void *fdt, int offset,
+                                        sPAPRPHBState *sphb)
+{
+    int i, j;
+
+    if (!sphb->nvgpus) {
+        return;
+    }
+
+    for (i = 0; i < sphb->nvgpus->num; ++i) {
+        struct spapr_phb_pci_nvgpu_slot *nvslot = &sphb->nvgpus->slots[i];
+
+        /* Skip "slot" without attached GPU */
+        if (!nvslot->gpdev) {
+            continue;
+        }
+        if (dev == nvslot->gpdev) {
+            uint32_t npus[nvslot->linknum];
+
+            for (j = 0; j < nvslot->linknum; ++j) {
+                PCIDevice *npdev = nvslot->links[j].npdev;
+
+                npus[j] = cpu_to_be32(PHANDLE_PCIDEV(sphb, npdev));
+            }
+            _FDT(fdt_setprop(fdt, offset, "ibm,npu", npus,
+                             j * sizeof(npus[0])));
+            _FDT((fdt_setprop_cell(fdt, offset, "phandle",
+                                   PHANDLE_PCIDEV(sphb, dev))));
+            continue;
+        }
+
+        for (j = 0; j < nvslot->linknum; ++j) {
+            if (dev != nvslot->links[j].npdev) {
+                continue;
+            }
+
+            _FDT((fdt_setprop_cell(fdt, offset, "phandle",
+                                   PHANDLE_PCIDEV(sphb, dev))));
+            _FDT(fdt_setprop_cell(fdt, offset, "ibm,gpu",
+                                  PHANDLE_PCIDEV(sphb, nvslot->gpdev)));
+            _FDT((fdt_setprop_cell(fdt, offset, "ibm,nvlink",
+                                   PHANDLE_NVLINK(sphb, i, j))));
+            /*
+             * If we ever want to emulate GPU RAM at the same location as on
+             * the host - here is the encoding GPA->TGT:
+             *
+             * gta  = ((sphb->nv2_gpa >> 42) & 0x1) << 42;
+             * gta |= ((sphb->nv2_gpa >> 45) & 0x3) << 43;
+             * gta |= ((sphb->nv2_gpa >> 49) & 0x3) << 45;
+             * gta |= sphb->nv2_gpa & ((1UL << 43) - 1);
+             */
+            _FDT(fdt_setprop_cell(fdt, offset, "memory-region",
+                                  PHANDLE_GPURAM(sphb, i)));
+            _FDT(fdt_setprop_u64(fdt, offset, "ibm,device-tgt-addr",
+                                 nvslot->tgt));
+            _FDT(fdt_setprop_cell(fdt, offset, "ibm,nvlink-speed",
+                                  nvslot->links[j].link_speed));
+        }
+    }
+}
diff --git a/hw/vfio/pci-quirks.c b/hw/vfio/pci-quirks.c
index 40a1200..15ec0b4 100644
--- a/hw/vfio/pci-quirks.c
+++ b/hw/vfio/pci-quirks.c
@@ -2180,3 +2180,123 @@ int vfio_add_virt_caps(VFIOPCIDevice *vdev, Error **errp)
 
     return 0;
 }
+
+static void vfio_pci_nvlink2_get_tgt(Object *obj, Visitor *v,
+                                     const char *name,
+                                     void *opaque, Error **errp)
+{
+    uint64_t tgt = (uint64_t) opaque;
+    visit_type_uint64(v, name, &tgt, errp);
+}
+
+static void vfio_pci_nvlink2_get_link_speed(Object *obj, Visitor *v,
+                                                 const char *name,
+                                                 void *opaque, Error **errp)
+{
+    uint32_t link_speed = (uint32_t)(uint64_t) opaque;
+    visit_type_uint32(v, name, &link_speed, errp);
+}
+
+int vfio_pci_nvidia_v100_ram_init(VFIOPCIDevice *vdev, Error **errp)
+{
+    int ret;
+    void *p;
+    struct vfio_region_info *nv2region = NULL;
+    struct vfio_info_cap_header *hdr;
+    MemoryRegion *nv2mr = g_malloc0(sizeof(*nv2mr));
+
+    ret = vfio_get_dev_region_info(&vdev->vbasedev,
+                                   VFIO_REGION_TYPE_PCI_VENDOR_TYPE |
+                                   PCI_VENDOR_ID_NVIDIA,
+                                   VFIO_REGION_SUBTYPE_NVIDIA_NVLINK2_RAM,
+                                   &nv2region);
+    if (ret) {
+        return ret;
+    }
+
+    p = mmap(NULL, nv2region->size, PROT_READ | PROT_WRITE | PROT_EXEC,
+             MAP_SHARED, vdev->vbasedev.fd, nv2region->offset);
+
+    if (!p) {
+        return -errno;
+    }
+
+    memory_region_init_ram_ptr(nv2mr, OBJECT(vdev), "nvlink2-mr",
+                               nv2region->size, p);
+
+    hdr = vfio_get_region_info_cap(nv2region,
+                                   VFIO_REGION_INFO_CAP_NVLINK2_SSATGT);
+    if (hdr) {
+        struct vfio_region_info_cap_nvlink2_ssatgt *cap = (void *) hdr;
+
+        object_property_add(OBJECT(vdev), "nvlink2-tgt", "uint64",
+                            vfio_pci_nvlink2_get_tgt, NULL, NULL,
+                            (void *) cap->tgt, NULL);
+        trace_vfio_pci_nvidia_gpu_setup_quirk(vdev->vbasedev.name, cap->tgt,
+                                              nv2region->size);
+    }
+    g_free(nv2region);
+
+    return 0;
+}
+
+int vfio_pci_nvlink2_init(VFIOPCIDevice *vdev, Error **errp)
+{
+    int ret;
+    void *p;
+    struct vfio_region_info *atsd_region = NULL;
+    struct vfio_info_cap_header *hdr;
+
+    ret = vfio_get_dev_region_info(&vdev->vbasedev,
+                                   VFIO_REGION_TYPE_PCI_VENDOR_TYPE |
+                                   PCI_VENDOR_ID_IBM,
+                                   VFIO_REGION_SUBTYPE_IBM_NVLINK2_ATSD,
+                                   &atsd_region);
+    if (ret) {
+        return ret;
+    }
+
+    /* Some NVLink bridges come without assigned ATSD, skip MR part */
+    if (atsd_region->size) {
+        MemoryRegion *atsd_mr = g_malloc0(sizeof(*atsd_mr));
+
+        p = mmap(NULL, atsd_region->size, PROT_READ | PROT_WRITE | PROT_EXEC,
+                 MAP_SHARED, vdev->vbasedev.fd, atsd_region->offset);
+
+        if (!p) {
+            return -errno;
+        }
+
+        memory_region_init_ram_device_ptr(atsd_mr, OBJECT(vdev),
+                                          "nvlink2-atsd-mr",
+                                          atsd_region->size,
+                                          p);
+    }
+
+    hdr = vfio_get_region_info_cap(atsd_region,
+                                   VFIO_REGION_INFO_CAP_NVLINK2_SSATGT);
+    if (hdr) {
+        struct vfio_region_info_cap_nvlink2_ssatgt *cap = (void *) hdr;
+
+        object_property_add(OBJECT(vdev), "nvlink2-tgt", "uint64",
+                            vfio_pci_nvlink2_get_tgt, NULL, NULL,
+                            (void *) cap->tgt, NULL);
+        trace_vfio_pci_nvlink2_setup_quirk_ssatgt(vdev->vbasedev.name, cap->tgt,
+                                                  atsd_region->size);
+    }
+
+    hdr = vfio_get_region_info_cap(atsd_region,
+                                   VFIO_REGION_INFO_CAP_NVLINK2_LNKSPD);
+    if (hdr) {
+        struct vfio_region_info_cap_nvlink2_lnkspd *cap = (void *) hdr;
+
+        object_property_add(OBJECT(vdev), "nvlink2-link-speed", "uint32",
+                            vfio_pci_nvlink2_get_link_speed, NULL, NULL,
+                            (void *) (uint64_t) cap->link_speed, NULL);
+        trace_vfio_pci_nvlink2_setup_quirk_lnkspd(vdev->vbasedev.name,
+                                                  cap->link_speed);
+    }
+    g_free(atsd_region);
+
+    return 0;
+}
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index dd12f36..07aa141 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -3069,6 +3069,20 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
         goto out_teardown;
     }
 
+    if (vdev->vendor_id == PCI_VENDOR_ID_NVIDIA) {
+        ret = vfio_pci_nvidia_v100_ram_init(vdev, errp);
+        if (ret && ret != -ENODEV) {
+            error_report("Failed to setup NVIDIA V100 GPU RAM");
+        }
+    }
+
+    if (vdev->vendor_id == PCI_VENDOR_ID_IBM) {
+        ret = vfio_pci_nvlink2_init(vdev, errp);
+        if (ret && ret != -ENODEV) {
+            error_report("Failed to setup NVlink2 bridge");
+        }
+    }
+
     vfio_register_err_notifier(vdev);
     vfio_register_req_notifier(vdev);
     vfio_setup_resetfn_quirk(vdev);
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index cf1e886..88841e9 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -87,6 +87,10 @@ vfio_pci_igd_opregion_enabled(const char *name) "%s"
 vfio_pci_igd_host_bridge_enabled(const char *name) "%s"
 vfio_pci_igd_lpc_bridge_enabled(const char *name) "%s"
 
+vfio_pci_nvidia_gpu_setup_quirk(const char *name, uint64_t tgt, uint64_t size) "%s tgt=0x%"PRIx64" size=0x%"PRIx64
+vfio_pci_nvlink2_setup_quirk_ssatgt(const char *name, uint64_t tgt, uint64_t size) "%s tgt=0x%"PRIx64" size=0x%"PRIx64
+vfio_pci_nvlink2_setup_quirk_lnkspd(const char *name, uint32_t link_speed) "%s link_speed=0x%x"
+
 # hw/vfio/common.c
 vfio_region_write(const char *name, int index, uint64_t addr, uint64_t data, unsigned size) " (%s:region%d+0x%"PRIx64", 0x%"PRIx64 ", %d)"
 vfio_region_read(char *name, int index, uint64_t addr, unsigned size, uint64_t data) " (%s:region%d+0x%"PRIx64", %d) = 0x%"PRIx64
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH qemu v3 4/6] spapr_iommu: Do not replay mappings from just created DMA window
  2019-02-27  8:51 ` [Qemu-devel] [PATCH qemu v3 4/6] spapr_iommu: Do not replay mappings from just created DMA window Alexey Kardashevskiy
@ 2019-02-27 14:33   ` Greg Kurz
  2019-02-27 23:59     ` Alexey Kardashevskiy
  0 siblings, 1 reply; 21+ messages in thread
From: Greg Kurz @ 2019-02-27 14:33 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: qemu-devel, Jose Ricardo Ziviani, Daniel Henrique Barboza,
	Alex Williamson, Sam Bobroff, Piotr Jaroszynski, qemu-ppc,
	Leonardo Augusto Guimarães Garcia, David Gibson

On Wed, 27 Feb 2019 19:51:47 +1100
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> On sPAPR vfio_listener_region_add() is called in 2 situations:
> 1. a new listener is registered from vfio_connect_container();
> 2. a new IOMMU Memory Region is added from rtas_ibm_create_pe_dma_window().
> 
> In both cases vfio_listener_region_add() calls
> memory_region_iommu_replay() to notify newly registered IOMMU notifiers
> about existing mappings which is totally desirable for case 1.
> 
> However for case 2 it is nothing but noop as the window has just been
> created and has no valid mappings so replaying those does not do anything.
> It is barely noticeable with usual guests but if the window happens to be
> really big, such no-op replay might take minutes and trigger RCU stall
> warnings in the guest.
> 
> For example, a upcoming GPU RAM memory region mapped at 64TiB (right
> after SPAPR_PCI_LIMIT) causes a 64bit DMA window to be at least 128TiB
> which is (128<<40)/0x10000=2.147.483.648 TCEs to replay.
> 
> This mitigates the problem by adding an "skipping_replay" flag to
> sPAPRTCETable and defining sPAPR own IOMMU MR replay() hook which does
> exactly the same thing as the generic one except it returns early if
> @skipping_replay==true.
> 
> When "ibm,create-pe-dma-window" is complete, the guest will map only
> required regions of the huge DMA window.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
>  include/hw/ppc/spapr.h  |  1 +
>  hw/ppc/spapr_iommu.c    | 31 +++++++++++++++++++++++++++++++
>  hw/ppc/spapr_rtas_ddw.c |  7 +++++++
>  3 files changed, 39 insertions(+)
> 
> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> index 86b0488..358bb38 100644
> --- a/include/hw/ppc/spapr.h
> +++ b/include/hw/ppc/spapr.h
> @@ -727,6 +727,7 @@ struct sPAPRTCETable {
>      uint64_t *mig_table;
>      bool bypass;
>      bool need_vfio;
> +    bool skipping_replay;
>      int fd;
>      MemoryRegion root;
>      IOMMUMemoryRegion iommu;
> diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
> index 37e98f9..8f23179 100644
> --- a/hw/ppc/spapr_iommu.c
> +++ b/hw/ppc/spapr_iommu.c
> @@ -141,6 +141,36 @@ static IOMMUTLBEntry spapr_tce_translate_iommu(IOMMUMemoryRegion *iommu,
>      return ret;
>  }
>  
> +static void spapr_tce_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
> +{
> +    MemoryRegion *mr = MEMORY_REGION(iommu_mr);
> +    IOMMUMemoryRegionClass *imrc = IOMMU_MEMORY_REGION_GET_CLASS(iommu_mr);
> +    hwaddr addr, granularity;
> +    IOMMUTLBEntry iotlb;
> +    sPAPRTCETable *tcet = container_of(iommu_mr, sPAPRTCETable, iommu);
> +
> +    if (tcet->skipping_replay) {
> +        return;
> +    }
> +
> +    granularity = memory_region_iommu_get_min_page_size(iommu_mr);
> +
> +    for (addr = 0; addr < memory_region_size(mr); addr += granularity) {
> +        iotlb = imrc->translate(iommu_mr, addr, IOMMU_NONE, n->iommu_idx);
> +        if (iotlb.perm != IOMMU_NONE) {
> +            n->notify(n, &iotlb);
> +        }
> +
> +        /*
> +         * if (2^64 - MR size) < granularity, it's possible to get an
> +         * infinite loop here.  This should catch such a wraparound.
> +         */
> +        if ((addr + granularity) < addr) {
> +            break;
> +        }
> +    }
> +}

It is a bit unfortunate to duplicate all that code. What about making
a memory_region_iommu_replay_generic() helper out of it and call it
from spapr_tce_replay() and memory_region_iommu_replay() ?

Apart from that, LGTM.

> +
>  static int spapr_tce_table_pre_save(void *opaque)
>  {
>      sPAPRTCETable *tcet = SPAPR_TCE_TABLE(opaque);
> @@ -659,6 +689,7 @@ static void spapr_iommu_memory_region_class_init(ObjectClass *klass, void *data)
>      IOMMUMemoryRegionClass *imrc = IOMMU_MEMORY_REGION_CLASS(klass);
>  
>      imrc->translate = spapr_tce_translate_iommu;
> +    imrc->replay = spapr_tce_replay;
>      imrc->get_min_page_size = spapr_tce_get_min_page_size;
>      imrc->notify_flag_changed = spapr_tce_notify_flag_changed;
>      imrc->get_attr = spapr_tce_get_attr;
> diff --git a/hw/ppc/spapr_rtas_ddw.c b/hw/ppc/spapr_rtas_ddw.c
> index cb8a410..9cc020d 100644
> --- a/hw/ppc/spapr_rtas_ddw.c
> +++ b/hw/ppc/spapr_rtas_ddw.c
> @@ -171,8 +171,15 @@ static void rtas_ibm_create_pe_dma_window(PowerPCCPU *cpu,
>      }
>  
>      win_addr = (windows == 0) ? sphb->dma_win_addr : sphb->dma64_win_addr;
> +    /*
> +     * We have just created a window, we know for the fact that it is empty,
> +     * use a hack to avoid iterating over the table as it is quite possible
> +     * to have billions of TCEs, all empty.
> +     */
> +    tcet->skipping_replay = true;
>      spapr_tce_table_enable(tcet, page_shift, win_addr,
>                             1ULL << (window_shift - page_shift));
> +    tcet->skipping_replay = false;
>      if (!tcet->nb_table) {
>          goto hw_error_exit;
>      }

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH qemu v3 4/6] spapr_iommu: Do not replay mappings from just created DMA window
  2019-02-27 14:33   ` [Qemu-devel] [Qemu-ppc] " Greg Kurz
@ 2019-02-27 23:59     ` Alexey Kardashevskiy
  2019-02-28  3:49       ` David Gibson
  0 siblings, 1 reply; 21+ messages in thread
From: Alexey Kardashevskiy @ 2019-02-27 23:59 UTC (permalink / raw)
  To: Greg Kurz
  Cc: qemu-devel, Jose Ricardo Ziviani, Daniel Henrique Barboza,
	Alex Williamson, Sam Bobroff, Piotr Jaroszynski, qemu-ppc,
	Leonardo Augusto Guimarães Garcia, David Gibson



On 28/02/2019 01:33, Greg Kurz wrote:
> On Wed, 27 Feb 2019 19:51:47 +1100
> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> 
>> On sPAPR vfio_listener_region_add() is called in 2 situations:
>> 1. a new listener is registered from vfio_connect_container();
>> 2. a new IOMMU Memory Region is added from rtas_ibm_create_pe_dma_window().
>>
>> In both cases vfio_listener_region_add() calls
>> memory_region_iommu_replay() to notify newly registered IOMMU notifiers
>> about existing mappings which is totally desirable for case 1.
>>
>> However for case 2 it is nothing but noop as the window has just been
>> created and has no valid mappings so replaying those does not do anything.
>> It is barely noticeable with usual guests but if the window happens to be
>> really big, such no-op replay might take minutes and trigger RCU stall
>> warnings in the guest.
>>
>> For example, a upcoming GPU RAM memory region mapped at 64TiB (right
>> after SPAPR_PCI_LIMIT) causes a 64bit DMA window to be at least 128TiB
>> which is (128<<40)/0x10000=2.147.483.648 TCEs to replay.
>>
>> This mitigates the problem by adding an "skipping_replay" flag to
>> sPAPRTCETable and defining sPAPR own IOMMU MR replay() hook which does
>> exactly the same thing as the generic one except it returns early if
>> @skipping_replay==true.
>>
>> When "ibm,create-pe-dma-window" is complete, the guest will map only
>> required regions of the huge DMA window.
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> ---
>>  include/hw/ppc/spapr.h  |  1 +
>>  hw/ppc/spapr_iommu.c    | 31 +++++++++++++++++++++++++++++++
>>  hw/ppc/spapr_rtas_ddw.c |  7 +++++++
>>  3 files changed, 39 insertions(+)
>>
>> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
>> index 86b0488..358bb38 100644
>> --- a/include/hw/ppc/spapr.h
>> +++ b/include/hw/ppc/spapr.h
>> @@ -727,6 +727,7 @@ struct sPAPRTCETable {
>>      uint64_t *mig_table;
>>      bool bypass;
>>      bool need_vfio;
>> +    bool skipping_replay;
>>      int fd;
>>      MemoryRegion root;
>>      IOMMUMemoryRegion iommu;
>> diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
>> index 37e98f9..8f23179 100644
>> --- a/hw/ppc/spapr_iommu.c
>> +++ b/hw/ppc/spapr_iommu.c
>> @@ -141,6 +141,36 @@ static IOMMUTLBEntry spapr_tce_translate_iommu(IOMMUMemoryRegion *iommu,
>>      return ret;
>>  }
>>  
>> +static void spapr_tce_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
>> +{
>> +    MemoryRegion *mr = MEMORY_REGION(iommu_mr);
>> +    IOMMUMemoryRegionClass *imrc = IOMMU_MEMORY_REGION_GET_CLASS(iommu_mr);
>> +    hwaddr addr, granularity;
>> +    IOMMUTLBEntry iotlb;
>> +    sPAPRTCETable *tcet = container_of(iommu_mr, sPAPRTCETable, iommu);
>> +
>> +    if (tcet->skipping_replay) {
>> +        return;
>> +    }
>> +
>> +    granularity = memory_region_iommu_get_min_page_size(iommu_mr);
>> +
>> +    for (addr = 0; addr < memory_region_size(mr); addr += granularity) {
>> +        iotlb = imrc->translate(iommu_mr, addr, IOMMU_NONE, n->iommu_idx);
>> +        if (iotlb.perm != IOMMU_NONE) {
>> +            n->notify(n, &iotlb);
>> +        }
>> +
>> +        /*
>> +         * if (2^64 - MR size) < granularity, it's possible to get an
>> +         * infinite loop here.  This should catch such a wraparound.
>> +         */
>> +        if ((addr + granularity) < addr) {
>> +            break;
>> +        }
>> +    }
>> +}
> 
> It is a bit unfortunate to duplicate all that code. What about making
> a memory_region_iommu_replay_generic() helper out of it and call it
> from spapr_tce_replay() and memory_region_iommu_replay() ?


I really do not want to mess with generic code to solve our local sPAPR
problem, especially when there is a way not to do so.

And as a next step, I was thinking of removing (i.e. making this replay
a no-op) from QEMU later and do replay in KVM instead when an IOMMU
group is attaching to KVM as this is the only case when we need replay
and KVM has a lot better idea what TCEs are actually valid and can skip
most of them. This is a bit bigger thing as it requires a KVM capability
"KVM replays mappings" but when we get it, spapr_tce_replay() will
become no-op.


> Apart from that, LGTM.

Well. It is a hack, I just do not have taste to tell how nasty it is :)


> 
>> +
>>  static int spapr_tce_table_pre_save(void *opaque)
>>  {
>>      sPAPRTCETable *tcet = SPAPR_TCE_TABLE(opaque);
>> @@ -659,6 +689,7 @@ static void spapr_iommu_memory_region_class_init(ObjectClass *klass, void *data)
>>      IOMMUMemoryRegionClass *imrc = IOMMU_MEMORY_REGION_CLASS(klass);
>>  
>>      imrc->translate = spapr_tce_translate_iommu;
>> +    imrc->replay = spapr_tce_replay;
>>      imrc->get_min_page_size = spapr_tce_get_min_page_size;
>>      imrc->notify_flag_changed = spapr_tce_notify_flag_changed;
>>      imrc->get_attr = spapr_tce_get_attr;
>> diff --git a/hw/ppc/spapr_rtas_ddw.c b/hw/ppc/spapr_rtas_ddw.c
>> index cb8a410..9cc020d 100644
>> --- a/hw/ppc/spapr_rtas_ddw.c
>> +++ b/hw/ppc/spapr_rtas_ddw.c
>> @@ -171,8 +171,15 @@ static void rtas_ibm_create_pe_dma_window(PowerPCCPU *cpu,
>>      }
>>  
>>      win_addr = (windows == 0) ? sphb->dma_win_addr : sphb->dma64_win_addr;
>> +    /*
>> +     * We have just created a window, we know for the fact that it is empty,
>> +     * use a hack to avoid iterating over the table as it is quite possible
>> +     * to have billions of TCEs, all empty.
>> +     */
>> +    tcet->skipping_replay = true;
>>      spapr_tce_table_enable(tcet, page_shift, win_addr,
>>                             1ULL << (window_shift - page_shift));
>> +    tcet->skipping_replay = false;
>>      if (!tcet->nb_table) {
>>          goto hw_error_exit;
>>      }
> 

-- 
Alexey

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v3 1/6] pci: Move NVIDIA vendor id to the rest of ids
  2019-02-27  8:51 ` [Qemu-devel] [PATCH qemu v3 1/6] pci: Move NVIDIA vendor id to the rest of ids Alexey Kardashevskiy
@ 2019-02-28  0:56   ` David Gibson
  0 siblings, 0 replies; 21+ messages in thread
From: David Gibson @ 2019-02-28  0:56 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: qemu-devel, qemu-ppc, Sam Bobroff, Piotr Jaroszynski,
	Leonardo Augusto Guimarães Garcia, Jose Ricardo Ziviani,
	Daniel Henrique Barboza, Alex Williamson

[-- Attachment #1: Type: text/plain, Size: 1586 bytes --]

On Wed, Feb 27, 2019 at 07:51:44PM +1100, Alexey Kardashevskiy wrote:
> sPAPR code will use it too so move it from VFIO to the common code.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> Reviewed-by: Alistair Francis <alistair.francis@wdc.com>

Not technically my purview, but since it's to enable the rest of this
series, I've applied it to ppc-for-4.0.

> ---
>  include/hw/pci/pci_ids.h | 2 ++
>  hw/vfio/pci-quirks.c     | 2 --
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/include/hw/pci/pci_ids.h b/include/hw/pci/pci_ids.h
> index eeb3301..0abe27a 100644
> --- a/include/hw/pci/pci_ids.h
> +++ b/include/hw/pci/pci_ids.h
> @@ -271,4 +271,6 @@
>  
>  #define PCI_VENDOR_ID_SYNOPSYS           0x16C3
>  
> +#define PCI_VENDOR_ID_NVIDIA             0x10de
> +
>  #endif
> diff --git a/hw/vfio/pci-quirks.c b/hw/vfio/pci-quirks.c
> index eae31c7..40a1200 100644
> --- a/hw/vfio/pci-quirks.c
> +++ b/hw/vfio/pci-quirks.c
> @@ -526,8 +526,6 @@ static void vfio_probe_ati_bar2_quirk(VFIOPCIDevice *vdev, int nr)
>   * note it for future reference.
>   */
>  
> -#define PCI_VENDOR_ID_NVIDIA                    0x10de
> -
>  /*
>   * Nvidia has several different methods to get to config space, the
>   * nouveu project has several of these documented here:

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v3 2/6] vfio/spapr: Fix indirect levels calculation
  2019-02-27  8:51 ` [Qemu-devel] [PATCH qemu v3 2/6] vfio/spapr: Fix indirect levels calculation Alexey Kardashevskiy
@ 2019-02-28  2:24   ` David Gibson
  0 siblings, 0 replies; 21+ messages in thread
From: David Gibson @ 2019-02-28  2:24 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: qemu-devel, qemu-ppc, Sam Bobroff, Piotr Jaroszynski,
	Leonardo Augusto Guimarães Garcia, Jose Ricardo Ziviani,
	Daniel Henrique Barboza, Alex Williamson

[-- Attachment #1: Type: text/plain, Size: 6013 bytes --]

On Wed, Feb 27, 2019 at 07:51:45PM +1100, Alexey Kardashevskiy wrote:
> The current code assumes that we can address more bits on a PCI bus
> for DMA than we really can but there is no way knowing the actual limit.
> 
> This makes a better guess for the number of levels and if the kernel
> fails to allocate that, this increases the level numbers till succeeded
> or reached the 64bit limit.
> 
> This adds levels to the trace point.
> 
> This may cause the kernel to warn about failed allocation:
>    [65122.837458] Failed to allocate a TCE memory, level shift=28
> which might happen if MAX_ORDER is not large enough as it can vary:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/powerpc/Kconfig?h=v5.0-rc2#n727
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
> Changes:
> v2:
> * replace systempagesize with getpagesize() when calculating
> bits_per_level/max_levels

As noted previously, guessing how the kernel will do this is pretty
gross, but the existing KVM interface kind of forces us to.  Plus it's
only non-optimal, not incorrect if we guess wrong.  So, applied.

> ---
>  hw/vfio/spapr.c      | 43 +++++++++++++++++++++++++++++++++----------
>  hw/vfio/trace-events |  2 +-
>  2 files changed, 34 insertions(+), 11 deletions(-)
> 
> diff --git a/hw/vfio/spapr.c b/hw/vfio/spapr.c
> index becf71a..88437a7 100644
> --- a/hw/vfio/spapr.c
> +++ b/hw/vfio/spapr.c
> @@ -143,10 +143,10 @@ int vfio_spapr_create_window(VFIOContainer *container,
>                               MemoryRegionSection *section,
>                               hwaddr *pgsize)
>  {
> -    int ret;
> +    int ret = 0;
>      IOMMUMemoryRegion *iommu_mr = IOMMU_MEMORY_REGION(section->mr);
>      uint64_t pagesize = memory_region_iommu_get_min_page_size(iommu_mr);
> -    unsigned entries, pages;
> +    unsigned entries, bits_total, bits_per_level, max_levels;
>      struct vfio_iommu_spapr_tce_create create = { .argsz = sizeof(create) };
>      long systempagesize = qemu_getrampagesize();
>  
> @@ -176,16 +176,38 @@ int vfio_spapr_create_window(VFIOContainer *container,
>      create.window_size = int128_get64(section->size);
>      create.page_shift = ctz64(pagesize);
>      /*
> -     * SPAPR host supports multilevel TCE tables, there is some
> -     * heuristic to decide how many levels we want for our table:
> -     * 0..64 = 1; 65..4096 = 2; 4097..262144 = 3; 262145.. = 4
> +     * SPAPR host supports multilevel TCE tables. We try to guess optimal
> +     * levels number and if this fails (for example due to the host memory
> +     * fragmentation), we increase levels. The DMA address structure is:
> +     * rrrrrrrr rxxxxxxx xxxxxxxx xxxxxxxx  xxxxxxxx xxxxxxxx xxxxxxxx iiiiiiii
> +     * where:
> +     *   r = reserved (bits >= 55 are reserved in the existing hardware)
> +     *   i = IOMMU page offset (64K in this example)
> +     *   x = bits to index a TCE which can be split to equal chunks to index
> +     *      within the level.
> +     * The aim is to split "x" to smaller possible number of levels.
>       */
>      entries = create.window_size >> create.page_shift;
> -    pages = MAX((entries * sizeof(uint64_t)) / getpagesize(), 1);
> -    pages = MAX(pow2ceil(pages), 1); /* Round up */
> -    create.levels = ctz64(pages) / 6 + 1;
> -
> -    ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_CREATE, &create);
> +    /* bits_total is number of "x" needed */
> +    bits_total = ctz64(entries * sizeof(uint64_t));
> +    /*
> +     * bits_per_level is a safe guess of how much we can allocate per level:
> +     * 8 is the current minimum for CONFIG_FORCE_MAX_ZONEORDER and MAX_ORDER
> +     * is usually bigger than that.
> +     * Below we look at getpagesize() as TCEs are allocated from system pages.
> +     */
> +    bits_per_level = ctz64(getpagesize()) + 8;
> +    create.levels = bits_total / bits_per_level;
> +    if (bits_total % bits_per_level) {
> +        ++create.levels;
> +    }
> +    max_levels = (64 - create.page_shift) / ctz64(getpagesize());
> +    for ( ; create.levels <= max_levels; ++create.levels) {
> +        ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_CREATE, &create);
> +        if (!ret) {
> +            break;
> +        }
> +    }
>      if (ret) {
>          error_report("Failed to create a window, ret = %d (%m)", ret);
>          return -errno;
> @@ -200,6 +222,7 @@ int vfio_spapr_create_window(VFIOContainer *container,
>          return -EINVAL;
>      }
>      trace_vfio_spapr_create_window(create.page_shift,
> +                                   create.levels,
>                                     create.window_size,
>                                     create.start_addr);
>      *pgsize = pagesize;
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index ed2f333..cf1e886 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -129,6 +129,6 @@ vfio_prereg_listener_region_add_skip(uint64_t start, uint64_t end) "0x%"PRIx64"
>  vfio_prereg_listener_region_del_skip(uint64_t start, uint64_t end) "0x%"PRIx64" - 0x%"PRIx64
>  vfio_prereg_register(uint64_t va, uint64_t size, int ret) "va=0x%"PRIx64" size=0x%"PRIx64" ret=%d"
>  vfio_prereg_unregister(uint64_t va, uint64_t size, int ret) "va=0x%"PRIx64" size=0x%"PRIx64" ret=%d"
> -vfio_spapr_create_window(int ps, uint64_t ws, uint64_t off) "pageshift=0x%x winsize=0x%"PRIx64" offset=0x%"PRIx64
> +vfio_spapr_create_window(int ps, unsigned int levels, uint64_t ws, uint64_t off) "pageshift=0x%x levels=%u winsize=0x%"PRIx64" offset=0x%"PRIx64
>  vfio_spapr_remove_window(uint64_t off) "offset=0x%"PRIx64
>  vfio_spapr_group_attach(int groupfd, int tablefd) "Attached groupfd %d to liobn fd %d"

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v3 3/6] vfio/spapr: Rename local systempagesize variable
  2019-02-27  8:51 ` [Qemu-devel] [PATCH qemu v3 3/6] vfio/spapr: Rename local systempagesize variable Alexey Kardashevskiy
@ 2019-02-28  2:26   ` David Gibson
  0 siblings, 0 replies; 21+ messages in thread
From: David Gibson @ 2019-02-28  2:26 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: qemu-devel, qemu-ppc, Sam Bobroff, Piotr Jaroszynski,
	Leonardo Augusto Guimarães Garcia, Jose Ricardo Ziviani,
	Daniel Henrique Barboza, Alex Williamson

[-- Attachment #1: Type: text/plain, Size: 1690 bytes --]

On Wed, Feb 27, 2019 at 07:51:46PM +1100, Alexey Kardashevskiy wrote:
> The "systempagesize" name suggests that it is the host system page size
> while it is the smallest page size of memory backing the guest RAM so
> let's rename it to stop confusion. This should cause no behavioral change.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

Definitely a good idea, applied.

> ---
>  hw/vfio/spapr.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/hw/vfio/spapr.c b/hw/vfio/spapr.c
> index 88437a7..57fe758 100644
> --- a/hw/vfio/spapr.c
> +++ b/hw/vfio/spapr.c
> @@ -148,14 +148,14 @@ int vfio_spapr_create_window(VFIOContainer *container,
>      uint64_t pagesize = memory_region_iommu_get_min_page_size(iommu_mr);
>      unsigned entries, bits_total, bits_per_level, max_levels;
>      struct vfio_iommu_spapr_tce_create create = { .argsz = sizeof(create) };
> -    long systempagesize = qemu_getrampagesize();
> +    long rampagesize = qemu_getrampagesize();
>  
>      /*
>       * The host might not support the guest supported IOMMU page size,
>       * so we will use smaller physical IOMMU pages to back them.
>       */
> -    if (pagesize > systempagesize) {
> -        pagesize = systempagesize;
> +    if (pagesize > rampagesize) {
> +        pagesize = rampagesize;
>      }
>      pagesize = 1ULL << (63 - clz64(container->pgsizes &
>                                     (pagesize | (pagesize - 1))));

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v3 6/6] spapr: Support NVIDIA V100 GPU with NVLink2
  2019-02-27  8:51 ` [Qemu-devel] [PATCH qemu v3 6/6] spapr: Support NVIDIA V100 GPU with NVLink2 Alexey Kardashevskiy
@ 2019-02-28  3:31   ` David Gibson
  2019-02-28  6:11     ` Alexey Kardashevskiy
  0 siblings, 1 reply; 21+ messages in thread
From: David Gibson @ 2019-02-28  3:31 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: qemu-devel, qemu-ppc, Sam Bobroff, Piotr Jaroszynski,
	Leonardo Augusto Guimarães Garcia, Jose Ricardo Ziviani,
	Daniel Henrique Barboza, Alex Williamson

[-- Attachment #1: Type: text/plain, Size: 41795 bytes --]

On Wed, Feb 27, 2019 at 07:51:49PM +1100, Alexey Kardashevskiy wrote:
> NVIDIA V100 GPUs have on-board RAM which is mapped into the host memory
> space and accessible as normal RAM via an NVLink bus. The VFIO-PCI driver
> implements special regions for such GPUs and emulates an NVLink bridge.
> NVLink2-enabled POWER9 CPUs also provide address translation services
> which includes an ATS shootdown (ATSD) register exported via the NVLink
> bridge device.
> 
> This adds a quirk to VFIO to map the GPU memory and create an MR;
> the new MR is stored in a PCI device as a QOM link. The sPAPR PCI uses
> this to get the MR and map it to the system address space.
> Another quirk does the same for ATSD.
> 
> This adds additional steps to sPAPR PHB setup:
> 
> 1. Search for specific GPUs and NPUs, collect findings in
> sPAPRPHBState::nvgpus, manage system address space mappings;
> 
> 2. Add device-specific properties such as "ibm,npu", "ibm,gpu",
> "memory-block", "link-speed" to advertise the NVLink2 function to
> the guest;
> 
> 3. Add "mmio-atsd" to vPHB to advertise the ATSD capability;
> 
> 4. Add new memory blocks (with extra "linux,memory-usable" to prevent
> the guest OS from accessing the new memory until it is onlined) and
> npuphb# nodes representing an NPU unit for every vPHB as the GPU driver
> uses it for link discovery.
> 
> This allocates space for GPU RAM and ATSD like we do for MMIOs by
> adding 2 new parameters to the phb_placement() hook. Older machine types
> set these to zero.
> 
> This puts new memory nodes in a separate NUMA node to replicate the host
> system setup as the GPU driver relies on this.
> 
> This adds requirement similar to EEH - one IOMMU group per vPHB.
> The reason for this is that ATSD registers belong to a physical NPU
> so they cannot invalidate translations on GPUs attached to another NPU.
> It is guaranteed by the host platform as it does not mix NVLink bridges
> or GPUs from different NPU in the same IOMMU group. If more than one
> IOMMU group is detected on a vPHB, this disables ATSD support for that
> vPHB and prints a warning.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
> Changes:
> v3:
> * moved GPU RAM above PCI MMIO limit
> * renamed QOM property to nvlink2-tgt
> * moved nvlink2 code to its own file
> 
> ---
> 
> The example command line for redbud system:
> 
> pbuild/qemu-aiku1804le-ppc64/ppc64-softmmu/qemu-system-ppc64 \
> -nodefaults \
> -chardev stdio,id=STDIO0,signal=off,mux=on \
> -device spapr-vty,id=svty0,reg=0x71000110,chardev=STDIO0 \
> -mon id=MON0,chardev=STDIO0,mode=readline -nographic -vga none \
> -enable-kvm -m 384G \
> -chardev socket,id=SOCKET0,server,nowait,host=localhost,port=40000 \
> -mon chardev=SOCKET0,mode=control \
> -smp 80,sockets=1,threads=4 \
> -netdev "tap,id=TAP0,helper=/home/aik/qemu-bridge-helper --br=br0" \
> -device "virtio-net-pci,id=vnet0,mac=52:54:00:12:34:56,netdev=TAP0" \
> img/vdisk0.img \
> -device "vfio-pci,id=vfio0004_04_00_0,host=0004:04:00.0" \
> -device "vfio-pci,id=vfio0006_00_00_0,host=0006:00:00.0" \
> -device "vfio-pci,id=vfio0006_00_00_1,host=0006:00:00.1" \
> -device "vfio-pci,id=vfio0006_00_00_2,host=0006:00:00.2" \
> -device "vfio-pci,id=vfio0004_05_00_0,host=0004:05:00.0" \
> -device "vfio-pci,id=vfio0006_00_01_0,host=0006:00:01.0" \
> -device "vfio-pci,id=vfio0006_00_01_1,host=0006:00:01.1" \
> -device "vfio-pci,id=vfio0006_00_01_2,host=0006:00:01.2" \
> -device spapr-pci-host-bridge,id=phb1,index=1 \
> -device "vfio-pci,id=vfio0035_03_00_0,host=0035:03:00.0" \
> -device "vfio-pci,id=vfio0007_00_00_0,host=0007:00:00.0" \
> -device "vfio-pci,id=vfio0007_00_00_1,host=0007:00:00.1" \
> -device "vfio-pci,id=vfio0007_00_00_2,host=0007:00:00.2" \
> -device "vfio-pci,id=vfio0035_04_00_0,host=0035:04:00.0" \
> -device "vfio-pci,id=vfio0007_00_01_0,host=0007:00:01.0" \
> -device "vfio-pci,id=vfio0007_00_01_1,host=0007:00:01.1" \
> -device "vfio-pci,id=vfio0007_00_01_2,host=0007:00:01.2" -snapshot \
> -machine pseries \
> -L /home/aik/t/qemu-ppc64-bios/ -d guest_errors
> 
> Note that QEMU attaches PCI devices to the last added vPHB so first
> 8 devices - 4:04:00.0 till 6:00:01.2 - go to the default vPHB, and
> 35:03:00.0..7:00:01.2 to the vPHB with id=phb1.
> ---
>  hw/ppc/Makefile.objs        |   2 +-
>  hw/vfio/pci.h               |   2 +
>  include/hw/pci-host/spapr.h |  41 ++++
>  include/hw/ppc/spapr.h      |   3 +-
>  hw/ppc/spapr.c              |  29 ++-
>  hw/ppc/spapr_pci.c          |   8 +
>  hw/ppc/spapr_pci_nvlink2.c  | 419 ++++++++++++++++++++++++++++++++++++
>  hw/vfio/pci-quirks.c        | 120 +++++++++++
>  hw/vfio/pci.c               |  14 ++
>  hw/vfio/trace-events        |   4 +
>  10 files changed, 637 insertions(+), 5 deletions(-)
>  create mode 100644 hw/ppc/spapr_pci_nvlink2.c
> 
> diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
> index 1111b21..636e717 100644
> --- a/hw/ppc/Makefile.objs
> +++ b/hw/ppc/Makefile.objs
> @@ -9,7 +9,7 @@ obj-$(CONFIG_SPAPR_RNG) +=  spapr_rng.o
>  # IBM PowerNV
>  obj-$(CONFIG_POWERNV) += pnv.o pnv_xscom.o pnv_core.o pnv_lpc.o pnv_psi.o pnv_occ.o pnv_bmc.o
>  ifeq ($(CONFIG_PCI)$(CONFIG_PSERIES)$(CONFIG_LINUX), yyy)
> -obj-y += spapr_pci_vfio.o
> +obj-y += spapr_pci_vfio.o spapr_pci_nvlink2.o
>  endif
>  obj-$(CONFIG_PSERIES) += spapr_rtas_ddw.o
>  # PowerPC 4xx boards
> diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
> index b1ae4c0..706c304 100644
> --- a/hw/vfio/pci.h
> +++ b/hw/vfio/pci.h
> @@ -194,6 +194,8 @@ int vfio_populate_vga(VFIOPCIDevice *vdev, Error **errp);
>  int vfio_pci_igd_opregion_init(VFIOPCIDevice *vdev,
>                                 struct vfio_region_info *info,
>                                 Error **errp);
> +int vfio_pci_nvidia_v100_ram_init(VFIOPCIDevice *vdev, Error **errp);
> +int vfio_pci_nvlink2_init(VFIOPCIDevice *vdev, Error **errp);
>  
>  void vfio_display_reset(VFIOPCIDevice *vdev);
>  int vfio_display_probe(VFIOPCIDevice *vdev, Error **errp);
> diff --git a/include/hw/pci-host/spapr.h b/include/hw/pci-host/spapr.h
> index ab0e3a0..e791dd4 100644
> --- a/include/hw/pci-host/spapr.h
> +++ b/include/hw/pci-host/spapr.h
> @@ -87,6 +87,9 @@ struct sPAPRPHBState {
>      uint32_t mig_liobn;
>      hwaddr mig_mem_win_addr, mig_mem_win_size;
>      hwaddr mig_io_win_addr, mig_io_win_size;
> +    hwaddr nv2_gpa_win_addr;
> +    hwaddr nv2_atsd_win_addr;
> +    struct spapr_phb_pci_nvgpu_config *nvgpus;
>  };
>  
>  #define SPAPR_PCI_MEM_WIN_BUS_OFFSET 0x80000000ULL
> @@ -105,6 +108,23 @@ struct sPAPRPHBState {
>  
>  #define SPAPR_PCI_MSI_WINDOW         0x40000000000ULL
>  
> +#define SPAPR_PCI_NV2RAM64_WIN_BASE  SPAPR_PCI_LIMIT
> +#define SPAPR_PCI_NV2RAM64_WIN_SIZE  0x10000000000ULL /* 1 TiB for all 6xGPUs */

The comments and values below suggest that it is 1TiB for each GPU,
rather than 1TiB shared by all 6.  Which is it?

> +
> +/* Max number of these GPUs per a physical box */
> +#define NVGPU_MAX_NUM                6

Is there any possibility later hardware revisions could increase this?
If so we should probably leave some extra room in the address space.

> +/*
> + * One NVLink bridge provides one ATSD register so it should be 18.
> + * In practice though since we allow only one group per vPHB which equals
> + * to an NPU2 which has maximum 6 NVLink bridges.
> + */
> +#define NVGPU_MAX_ATSD               6
> +
> +#define SPAPR_PCI_NV2ATSD_WIN_BASE   (SPAPR_PCI_NV2RAM64_WIN_BASE + \
> +                                      SPAPR_PCI_NV2RAM64_WIN_SIZE * \
> +                                      NVGPU_MAX_NUM)
> +#define SPAPR_PCI_NV2ATSD_WIN_SIZE   (NVGPU_MAX_ATSD * 0x10000)

What's the significance of the 64 kiB constant here?  Should it be a
symbolic name, or speleed "64 * kiB".

> +
>  static inline qemu_irq spapr_phb_lsi_qirq(struct sPAPRPHBState *phb, int pin)
>  {
>      sPAPRMachineState *spapr = SPAPR_MACHINE(qdev_get_machine());
> @@ -135,6 +155,11 @@ int spapr_phb_vfio_eeh_get_state(sPAPRPHBState *sphb, int *state);
>  int spapr_phb_vfio_eeh_reset(sPAPRPHBState *sphb, int option);
>  int spapr_phb_vfio_eeh_configure(sPAPRPHBState *sphb);
>  void spapr_phb_vfio_reset(DeviceState *qdev);
> +void spapr_phb_nvgpu_setup(sPAPRPHBState *sphb);
> +void spapr_phb_nvgpu_populate_dt(sPAPRPHBState *sphb, void *fdt, int bus_off);
> +void spapr_phb_nvgpu_ram_populate_dt(sPAPRPHBState *sphb, void *fdt);
> +void spapr_phb_nvgpu_populate_pcidev_dt(PCIDevice *dev, void *fdt, int offset,
> +                                        sPAPRPHBState *sphb);
>  #else
>  static inline bool spapr_phb_eeh_available(sPAPRPHBState *sphb)
>  {
> @@ -161,6 +186,22 @@ static inline int spapr_phb_vfio_eeh_configure(sPAPRPHBState *sphb)
>  static inline void spapr_phb_vfio_reset(DeviceState *qdev)
>  {
>  }
> +static inline void spapr_phb_nvgpu_setup(sPAPRPHBState *sphb)
> +{
> +}
> +static inline void spapr_phb_nvgpu_populate_dt(sPAPRPHBState *sphb, void *fdt,
> +                                               int bus_off)
> +{
> +}
> +static inline void spapr_phb_nvgpu_ram_populate_dt(sPAPRPHBState *sphb,
> +                                                   void *fdt)
> +{
> +}
> +static inline void spapr_phb_nvgpu_populate_pcidev_dt(PCIDevice *dev, void *fdt,
> +                                                      int offset,
> +                                                      sPAPRPHBState *sphb)
> +{
> +}

I'm guessing some of these should never get called on systems without
NVLink2, in which case they should probably have a
g_assert_not_reached() in there.

>  #endif
>  
>  void spapr_phb_dma_reset(sPAPRPHBState *sphb);
> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> index 358bb38..9acf867 100644
> --- a/include/hw/ppc/spapr.h
> +++ b/include/hw/ppc/spapr.h
> @@ -113,7 +113,8 @@ struct sPAPRMachineClass {
>      void (*phb_placement)(sPAPRMachineState *spapr, uint32_t index,
>                            uint64_t *buid, hwaddr *pio, 
>                            hwaddr *mmio32, hwaddr *mmio64,
> -                          unsigned n_dma, uint32_t *liobns, Error **errp);
> +                          unsigned n_dma, uint32_t *liobns, hwaddr *nv2gpa,
> +                          hwaddr *nv2atsd, Error **errp);
>      sPAPRResizeHPT resize_hpt_default;
>      sPAPRCapabilities default_caps;
>      sPAPRIrq *irq;
> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> index 74c9b07..fda6e7e 100644
> --- a/hw/ppc/spapr.c
> +++ b/hw/ppc/spapr.c
> @@ -3929,7 +3929,9 @@ static void spapr_phb_pre_plug(HotplugHandler *hotplug_dev, DeviceState *dev,
>      smc->phb_placement(spapr, sphb->index,
>                         &sphb->buid, &sphb->io_win_addr,
>                         &sphb->mem_win_addr, &sphb->mem64_win_addr,
> -                       windows_supported, sphb->dma_liobn, errp);
> +                       windows_supported, sphb->dma_liobn,
> +                       &sphb->nv2_gpa_win_addr, &sphb->nv2_atsd_win_addr,
> +                       errp);
>  }
>  
>  static void spapr_phb_plug(HotplugHandler *hotplug_dev, DeviceState *dev,
> @@ -4129,7 +4131,8 @@ static const CPUArchIdList *spapr_possible_cpu_arch_ids(MachineState *machine)
>  static void spapr_phb_placement(sPAPRMachineState *spapr, uint32_t index,
>                                  uint64_t *buid, hwaddr *pio,
>                                  hwaddr *mmio32, hwaddr *mmio64,
> -                                unsigned n_dma, uint32_t *liobns, Error **errp)
> +                                unsigned n_dma, uint32_t *liobns,
> +                                hwaddr *nv2gpa, hwaddr *nv2atsd, Error **errp)
>  {
>      /*
>       * New-style PHB window placement.
> @@ -4174,6 +4177,9 @@ static void spapr_phb_placement(sPAPRMachineState *spapr, uint32_t index,
>      *pio = SPAPR_PCI_BASE + index * SPAPR_PCI_IO_WIN_SIZE;
>      *mmio32 = SPAPR_PCI_BASE + (index + 1) * SPAPR_PCI_MEM32_WIN_SIZE;
>      *mmio64 = SPAPR_PCI_BASE + (index + 1) * SPAPR_PCI_MEM64_WIN_SIZE;
> +

This doesn't look right.  SPAPR_PCI_NV2ATSD_WIN_BASE appears to be
defined such that there slots for NVGPU_MAX_NUM gpa "slots" of size
SPAPR_PCI_NV2RAM64_WIN_SIZE before we get to the ATSD base.

> +    *nv2gpa = SPAPR_PCI_NV2RAM64_WIN_BASE + index * SPAPR_PCI_NV2RAM64_WIN_SIZE;

But this implies you need a "slot" for every possible PHB index, which
is rather more than NVGPU_MAX_NUM.

> +    *nv2atsd = SPAPR_PCI_NV2ATSD_WIN_BASE + index * SPAPR_PCI_NV2ATSD_WIN_SIZE;
>  }
>  
>  static ICSState *spapr_ics_get(XICSFabric *dev, int irq)
> @@ -4376,6 +4382,18 @@ DEFINE_SPAPR_MACHINE(4_0, "4.0", true);
>  /*
>   * pseries-3.1
>   */
> +static void phb_placement_3_1(sPAPRMachineState *spapr, uint32_t index,
> +                              uint64_t *buid, hwaddr *pio,
> +                              hwaddr *mmio32, hwaddr *mmio64,
> +                              unsigned n_dma, uint32_t *liobns,
> +                              hwaddr *nv2gpa, hwaddr *nv2atsd, Error **errp)
> +{
> +    spapr_phb_placement(spapr, index, buid, pio, mmio32, mmio64, n_dma, liobns,
> +                        nv2gpa, nv2atsd, errp);
> +    *nv2gpa = 0;
> +    *nv2atsd = 0;
> +}
> +
>  static void spapr_machine_3_1_class_options(MachineClass *mc)
>  {
>      sPAPRMachineClass *smc = SPAPR_MACHINE_CLASS(mc);
> @@ -4391,6 +4409,7 @@ static void spapr_machine_3_1_class_options(MachineClass *mc)
>      mc->default_cpu_type = POWERPC_CPU_TYPE_NAME("power8_v2.0");
>      smc->update_dt_enabled = false;
>      smc->dr_phb_enabled = false;
> +    smc->phb_placement = phb_placement_3_1;
>  }
>  
>  DEFINE_SPAPR_MACHINE(3_1, "3.1", false);
> @@ -4522,7 +4541,8 @@ DEFINE_SPAPR_MACHINE(2_8, "2.8", false);
>  static void phb_placement_2_7(sPAPRMachineState *spapr, uint32_t index,
>                                uint64_t *buid, hwaddr *pio,
>                                hwaddr *mmio32, hwaddr *mmio64,
> -                              unsigned n_dma, uint32_t *liobns, Error **errp)
> +                              unsigned n_dma, uint32_t *liobns,
> +                              hwaddr *nv2gpa, hwaddr *nv2atsd, Error **errp)
>  {
>      /* Legacy PHB placement for pseries-2.7 and earlier machine types */
>      const uint64_t base_buid = 0x800000020000000ULL;
> @@ -4566,6 +4586,9 @@ static void phb_placement_2_7(sPAPRMachineState *spapr, uint32_t index,
>       * fallback behaviour of automatically splitting a large "32-bit"
>       * window into contiguous 32-bit and 64-bit windows
>       */
> +
> +    *nv2gpa = 0;
> +    *nv2atsd = 0;
>  }
>  
>  static void spapr_machine_2_7_class_options(MachineClass *mc)
> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
> index 06a5ffd..f076462 100644
> --- a/hw/ppc/spapr_pci.c
> +++ b/hw/ppc/spapr_pci.c
> @@ -1355,6 +1355,8 @@ static void spapr_populate_pci_child_dt(PCIDevice *dev, void *fdt, int offset,
>      if (sphb->pcie_ecs && pci_is_express(dev)) {
>          _FDT(fdt_setprop_cell(fdt, offset, "ibm,pci-config-space-type", 0x1));
>      }
> +
> +    spapr_phb_nvgpu_populate_pcidev_dt(dev, fdt, offset, sphb);
>  }
>  
>  /* create OF node for pci device and required OF DT properties */
> @@ -1878,6 +1880,7 @@ static void spapr_phb_reset(DeviceState *qdev)
>      sPAPRPHBState *sphb = SPAPR_PCI_HOST_BRIDGE(qdev);
>  
>      spapr_phb_dma_reset(sphb);
> +    spapr_phb_nvgpu_setup(sphb);
>  
>      /* Reset the IOMMU state */
>      object_child_foreach(OBJECT(qdev), spapr_phb_children_reset, NULL);
> @@ -1910,6 +1913,8 @@ static Property spapr_phb_properties[] = {
>                       pre_2_8_migration, false),
>      DEFINE_PROP_BOOL("pcie-extended-configuration-space", sPAPRPHBState,
>                       pcie_ecs, true),
> +    DEFINE_PROP_UINT64("gpa", sPAPRPHBState, nv2_gpa_win_addr, 0),
> +    DEFINE_PROP_UINT64("atsd", sPAPRPHBState, nv2_atsd_win_addr, 0),
>      DEFINE_PROP_END_OF_LIST(),
>  };
>  
> @@ -2282,6 +2287,9 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb, uint32_t intc_phandle, void *fdt,
>          return ret;
>      }
>  
> +    spapr_phb_nvgpu_populate_dt(phb, fdt, bus_off);
> +    spapr_phb_nvgpu_ram_populate_dt(phb, fdt);
> +
>      return 0;
>  }
>  
> diff --git a/hw/ppc/spapr_pci_nvlink2.c b/hw/ppc/spapr_pci_nvlink2.c
> new file mode 100644
> index 0000000..965a6be
> --- /dev/null
> +++ b/hw/ppc/spapr_pci_nvlink2.c
> @@ -0,0 +1,419 @@
> +/*
> + * QEMU sPAPR PCI for NVLink2 pass through
> + *
> + * Copyright (c) 2019 Alexey Kardashevskiy, IBM Corporation.
> + *
> + * Permission is hereby granted, free of charge, to any person obtaining a copy
> + * of this software and associated documentation files (the "Software"), to deal
> + * in the Software without restriction, including without limitation the rights
> + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
> + * copies of the Software, and to permit persons to whom the Software is
> + * furnished to do so, subject to the following conditions:
> + *
> + * The above copyright notice and this permission notice shall be included in
> + * all copies or substantial portions of the Software.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
> + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
> + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
> + * THE SOFTWARE.
> + */
> +#include "qemu/osdep.h"
> +#include "qapi/error.h"
> +#include "qemu-common.h"
> +#include "hw/pci/pci.h"
> +#include "hw/pci-host/spapr.h"
> +#include "qemu/error-report.h"
> +#include "hw/ppc/fdt.h"
> +#include "hw/pci/pci_bridge.h"
> +
> +#define PHANDLE_PCIDEV(phb, pdev)    (0x12000000 | \
> +                                     (((phb)->index) << 16) | ((pdev)->devfn))
> +#define PHANDLE_GPURAM(phb, n)       (0x110000FF | ((n) << 8) | \
> +                                     (((phb)->index) << 16))
> +/* NVLink2 wants a separate NUMA node for its RAM */
> +#define GPURAM_ASSOCIATIVITY(phb, n) (255 - ((phb)->index * 3 + (n)))
> +#define PHANDLE_NVLINK(phb, gn, nn)  (0x00130000 | (((phb)->index) << 8) | \
> +                                     ((gn) << 4) | (nn))
> +
> +/* Max number of NVLinks per GPU in any physical box */
> +#define NVGPU_MAX_LINKS              3
> +
> +struct spapr_phb_pci_nvgpu_config {
> +    uint64_t nv2_ram_current;
> +    uint64_t nv2_atsd_current;
> +    int num; /* number of non empty (i.e. tgt!=0) entries in slots[] */
> +    struct spapr_phb_pci_nvgpu_slot {
> +        uint64_t tgt;
> +        uint64_t gpa;
> +        PCIDevice *gpdev;
> +        int linknum;
> +        struct {
> +            uint64_t atsd_gpa;
> +            PCIDevice *npdev;
> +            uint32_t link_speed;
> +        } links[NVGPU_MAX_LINKS];
> +    } slots[NVGPU_MAX_NUM];
> +};
> +
> +static int spapr_pci_nvgpu_get_slot(struct spapr_phb_pci_nvgpu_config *nvgpus,
> +                                    uint64_t tgt)
> +{
> +    int i;
> +
> +    /* Search for partially collected "slot" */
> +    for (i = 0; i < nvgpus->num; ++i) {
> +        if (nvgpus->slots[i].tgt == tgt) {
> +            return i;
> +        }
> +    }
> +
> +    if (nvgpus->num == ARRAY_SIZE(nvgpus->slots)) {
> +        warn_report("Found too many NVLink bridges per GPU");
> +        return -1;

This is within qemu so it would be better to use the qemu error API
than returning an error code.

> +    }
> +
> +    i = nvgpus->num;
> +    nvgpus->slots[i].tgt = tgt;
> +    ++nvgpus->num;
> +
> +    return i;

Might be nicer to return a pointer to the slot structure.

> +}
> +
> +static void spapr_pci_collect_nvgpu(struct spapr_phb_pci_nvgpu_config *nvgpus,
> +                                    PCIDevice *pdev, uint64_t tgt,
> +                                    MemoryRegion *mr)
> +{
> +    int i = spapr_pci_nvgpu_get_slot(nvgpus, tgt);
> +
> +    if (i < 0) {
> +        return;
> +    }
> +    g_assert(!nvgpus->slots[i].gpdev);
> +    nvgpus->slots[i].gpdev = pdev;
> +
> +    nvgpus->slots[i].gpa = nvgpus->nv2_ram_current;
> +    nvgpus->nv2_ram_current += memory_region_size(mr);
> +}
> +
> +static void spapr_pci_collect_nvnpu(struct spapr_phb_pci_nvgpu_config *nvgpus,
> +                                    PCIDevice *pdev, uint64_t tgt,
> +                                    MemoryRegion *mr)
> +{
> +    int i = spapr_pci_nvgpu_get_slot(nvgpus, tgt), j;
> +    struct spapr_phb_pci_nvgpu_slot *nvslot;
> +
> +    if (i < 0) {
> +        return;
> +    }
> +
> +    nvslot = &nvgpus->slots[i];
> +    j = nvslot->linknum;
> +    if (j == ARRAY_SIZE(nvslot->links)) {
> +        warn_report("Found too many NVLink2 bridges");
> +        return;
> +    }
> +    ++nvslot->linknum;
> +
> +    g_assert(!nvslot->links[j].npdev);
> +    nvslot->links[j].npdev = pdev;
> +    nvslot->links[j].atsd_gpa = nvgpus->nv2_atsd_current;
> +    nvgpus->nv2_atsd_current += memory_region_size(mr);
> +    nvslot->links[j].link_speed =
> +        object_property_get_uint(OBJECT(pdev), "nvlink2-link-speed", NULL);
> +}
> +
> +static void spapr_phb_pci_collect_nvgpu(PCIBus *bus, PCIDevice *pdev,
> +                                        void *opaque)
> +{
> +    PCIBus *sec_bus;
> +    Object *po = OBJECT(pdev);
> +    uint64_t tgt = object_property_get_uint(po, "nvlink2-tgt", NULL);
> +
> +    if (tgt) {
> +        Object *mr_gpu = object_property_get_link(po, "nvlink2-mr[0]", NULL);
> +        Object *mr_npu = object_property_get_link(po, "nvlink2-atsd-mr[0]",
> +                                                  NULL);
> +
> +        if (mr_gpu) {
> +            spapr_pci_collect_nvgpu(opaque, pdev, tgt, MEMORY_REGION(mr_gpu));
> +        } else if (mr_npu) {
> +            spapr_pci_collect_nvnpu(opaque, pdev, tgt, MEMORY_REGION(mr_npu));
> +        } else {
> +            warn_report("Unexpected device with \"nvlink2-tgt\"");

IIUC this would have to be a code error, so should be an assert() not
a warning.

> +        }
> +    }
> +    if ((pci_default_read_config(pdev, PCI_HEADER_TYPE, 1) !=
> +         PCI_HEADER_TYPE_BRIDGE)) {
> +        return;
> +    }
> +
> +    sec_bus = pci_bridge_get_sec_bus(PCI_BRIDGE(pdev));
> +    if (!sec_bus) {
> +        return;
> +    }
> +
> +    pci_for_each_device(sec_bus, pci_bus_num(sec_bus),
> +                        spapr_phb_pci_collect_nvgpu, opaque);
> +}
> +
> +void spapr_phb_nvgpu_setup(sPAPRPHBState *sphb)
> +{
> +    int i, j, valid_gpu_num;
> +
> +    /* If there are existing NVLink2 MRs, unmap those before recreating */
> +    if (sphb->nvgpus) {
> +        for (i = 0; i < sphb->nvgpus->num; ++i) {
> +            struct spapr_phb_pci_nvgpu_slot *nvslot = &sphb->nvgpus->slots[i];
> +            Object *nv_mrobj = object_property_get_link(OBJECT(nvslot->gpdev),
> +                                                        "nvlink2-mr[0]", NULL);
> +
> +            if (nv_mrobj) {
> +                memory_region_del_subregion(get_system_memory(),
> +                                            MEMORY_REGION(nv_mrobj));
> +            }
> +            for (j = 0; j < nvslot->linknum; ++j) {
> +                PCIDevice *npdev = nvslot->links[j].npdev;
> +                Object *atsd_mrobj;
> +                atsd_mrobj = object_property_get_link(OBJECT(npdev),
> +                                                      "nvlink2-atsd-mr[0]",
> +                                                      NULL);
> +                if (atsd_mrobj) {
> +                    memory_region_del_subregion(get_system_memory(),
> +                                                MEMORY_REGION(atsd_mrobj));
> +                }
> +            }
> +        }
> +        g_free(sphb->nvgpus);

Probably worth collecting the above into a nvgpu_free() helper -
chances are you'll want it on cleanup paths as well.

> +        sphb->nvgpus = NULL;
> +    }
> +
> +    /* Search for GPUs and NPUs */
> +    if (sphb->nv2_gpa_win_addr && sphb->nv2_atsd_win_addr) {
> +        PCIBus *bus = PCI_HOST_BRIDGE(sphb)->bus;
> +
> +        sphb->nvgpus = g_new0(struct spapr_phb_pci_nvgpu_config, 1);
> +        sphb->nvgpus->nv2_ram_current = sphb->nv2_gpa_win_addr;
> +        sphb->nvgpus->nv2_atsd_current = sphb->nv2_atsd_win_addr;
> +
> +        pci_for_each_device(bus, pci_bus_num(bus),
> +                            spapr_phb_pci_collect_nvgpu, sphb->nvgpus);
> +    }
> +
> +    /* Add found GPU RAM and ATSD MRs if found */
> +    for (i = 0, valid_gpu_num = 0; i < sphb->nvgpus->num; ++i) {
> +        Object *nvmrobj;
> +        struct spapr_phb_pci_nvgpu_slot *nvslot = &sphb->nvgpus->slots[i];
> +
> +        if (!nvslot->gpdev) {
> +            continue;
> +        }
> +        nvmrobj = object_property_get_link(OBJECT(nvslot->gpdev),
> +                                           "nvlink2-mr[0]", NULL);
> +        /* ATSD is pointless without GPU RAM MR so skip those */
> +        if (!nvmrobj) {
> +            continue;
> +        }
> +
> +        ++valid_gpu_num;
> +        memory_region_add_subregion(get_system_memory(), nvslot->gpa,
> +                                    MEMORY_REGION(nvmrobj));
> +
> +        for (j = 0; j < nvslot->linknum; ++j) {
> +            Object *atsdmrobj;
> +
> +            atsdmrobj = object_property_get_link(OBJECT(nvslot->links[j].npdev),
> +                                                 "nvlink2-atsd-mr[0]",
> +                                                 NULL);
> +            if (!atsdmrobj) {
> +                continue;
> +            }
> +            memory_region_add_subregion(get_system_memory(),
> +                                        nvslot->links[j].atsd_gpa,
> +                                        MEMORY_REGION(atsdmrobj));
> +        }
> +    }
> +
> +    if (!valid_gpu_num) {
> +        /* We did not find any interesting GPU */
> +        g_free(sphb->nvgpus);
> +        sphb->nvgpus = NULL;
> +    }
> +}
> +
> +void spapr_phb_nvgpu_populate_dt(sPAPRPHBState *sphb, void *fdt, int bus_off)
> +{
> +    int i, j, atsdnum = 0;
> +    uint64_t atsd[8]; /* The existing limitation of known guests */
> +
> +    if (!sphb->nvgpus) {
> +        return;
> +    }
> +
> +    for (i = 0; (i < sphb->nvgpus->num) && (atsdnum < ARRAY_SIZE(atsd)); ++i) {
> +        struct spapr_phb_pci_nvgpu_slot *nvslot = &sphb->nvgpus->slots[i];
> +
> +        if (!nvslot->gpdev) {
> +            continue;
> +        }
> +        for (j = 0; j < nvslot->linknum; ++j) {
> +            if (!nvslot->links[j].atsd_gpa) {
> +                continue;
> +            }
> +
> +            if (atsdnum == ARRAY_SIZE(atsd)) {
> +                warn_report("Only %ld ATSD registers allowed",
> +                            ARRAY_SIZE(atsd));

Probably should be an error not a warning.

> +                break;
> +            }
> +            atsd[atsdnum] = cpu_to_be64(nvslot->links[j].atsd_gpa);
> +            ++atsdnum;
> +        }
> +    }
> +
> +    if (!atsdnum) {
> +        warn_report("No ATSD registers found");
> +    } else if (!spapr_phb_eeh_available(sphb)) {
> +        /*
> +         * ibm,mmio-atsd contains ATSD registers; these belong to an NPU PHB
> +         * which we do not emulate as a separate device. Instead we put
> +         * ibm,mmio-atsd to the vPHB with GPU and make sure that we do not
> +         * put GPUs from different IOMMU groups to the same vPHB to ensure
> +         * that the guest will use ATSDs from the corresponding NPU.
> +         */
> +        warn_report("ATSD requires separate vPHB per GPU IOMMU group");
> +    } else {
> +        _FDT((fdt_setprop(fdt, bus_off, "ibm,mmio-atsd",
> +                          atsd, atsdnum * sizeof(atsd[0]))));
> +    }
> +}
> +
> +void spapr_phb_nvgpu_ram_populate_dt(sPAPRPHBState *sphb, void *fdt)
> +{
> +    int i, j, linkidx, npuoff;
> +    char *npuname;
> +
> +    if (!sphb->nvgpus) {
> +        return;
> +    }
> +
> +    npuname = g_strdup_printf("npuphb%d", sphb->index);
> +    npuoff = fdt_add_subnode(fdt, 0, npuname);
> +    _FDT(npuoff);
> +    _FDT(fdt_setprop_cell(fdt, npuoff, "#address-cells", 1));
> +    _FDT(fdt_setprop_cell(fdt, npuoff, "#size-cells", 0));
> +    /* Advertise NPU as POWER9 so the guest can enable NPU2 contexts */
> +    _FDT((fdt_setprop_string(fdt, npuoff, "compatible", "ibm,power9-npu")));
> +    g_free(npuname);
> +
> +    for (i = 0, linkidx = 0; i < sphb->nvgpus->num; ++i) {
> +        for (j = 0; j < sphb->nvgpus->slots[i].linknum; ++j) {
> +            char *linkname = g_strdup_printf("link@%d", linkidx);
> +            int off = fdt_add_subnode(fdt, npuoff, linkname);
> +
> +            _FDT(off);
> +            /* _FDT((fdt_setprop_cell(fdt, off, "reg", linkidx)));
> */

Are the indices you're using for 'reg' and the unit name arbitrary?
If so it's generally best to base them on some static property of the
device, rather than just allocating sequentially.


> +            _FDT((fdt_setprop_string(fdt, off, "compatible",
> +                                     "ibm,npu-link")));
> +            _FDT((fdt_setprop_cell(fdt, off, "phandle",
> +                                   PHANDLE_NVLINK(sphb, i, j))));
> +            _FDT((fdt_setprop_cell(fdt, off, "ibm,npu-link-index", linkidx)));

Why do you need the index here as well as in reg?

> +            g_free(linkname);
> +            ++linkidx;
> +        }
> +    }
> +
> +    /* Add memory nodes for GPU RAM and mark them unusable */
> +    for (i = 0; i < sphb->nvgpus->num; ++i) {
> +        struct spapr_phb_pci_nvgpu_slot *nvslot = &sphb->nvgpus->slots[i];
> +        Object *nv_mrobj = object_property_get_link(OBJECT(nvslot->gpdev),
> +                                                    "nvlink2-mr[0]", NULL);
> +        uint32_t at = cpu_to_be32(GPURAM_ASSOCIATIVITY(sphb, i));
> +        uint32_t associativity[] = { cpu_to_be32(0x4), at, at, at, at };
> +        uint64_t size = object_property_get_uint(nv_mrobj, "size", NULL);
> +        uint64_t mem_reg[2] = { cpu_to_be64(nvslot->gpa), cpu_to_be64(size) };
> +        char *mem_name = g_strdup_printf("memory@%lx", nvslot->gpa);
> +        int off = fdt_add_subnode(fdt, 0, mem_name);
> +
> +        _FDT(off);
> +        _FDT((fdt_setprop_string(fdt, off, "device_type", "memory")));
> +        _FDT((fdt_setprop(fdt, off, "reg", mem_reg, sizeof(mem_reg))));
> +        _FDT((fdt_setprop(fdt, off, "ibm,associativity", associativity,
> +                          sizeof(associativity))));
> +
> +        _FDT((fdt_setprop_string(fdt, off, "compatible",
> +                                 "ibm,coherent-device-memory")));
> +
> +        mem_reg[1] = cpu_to_be64(0);
> +        _FDT((fdt_setprop(fdt, off, "linux,usable-memory", mem_reg,
> +                          sizeof(mem_reg))));
> +        _FDT((fdt_setprop_cell(fdt, off, "phandle",
> +                               PHANDLE_GPURAM(sphb, i))));
> +        g_free(mem_name);
> +    }
> +
> +}
> +
> +void spapr_phb_nvgpu_populate_pcidev_dt(PCIDevice *dev, void *fdt, int offset,
> +                                        sPAPRPHBState *sphb)
> +{
> +    int i, j;
> +
> +    if (!sphb->nvgpus) {
> +        return;
> +    }
> +
> +    for (i = 0; i < sphb->nvgpus->num; ++i) {
> +        struct spapr_phb_pci_nvgpu_slot *nvslot = &sphb->nvgpus->slots[i];
> +
> +        /* Skip "slot" without attached GPU */

IIUC a "slot" should always have at least one GPU.  You need to handle
the case of an unitialized GPU in the "collect" functions because you
don't know if you'll discover the GPU or an NPU first.  But here not
having a GPU should be an error, shouldn't it?

> +        if (!nvslot->gpdev) {
> +            continue;
> +        }
> +        if (dev == nvslot->gpdev) {
> +            uint32_t npus[nvslot->linknum];
> +
> +            for (j = 0; j < nvslot->linknum; ++j) {
> +                PCIDevice *npdev = nvslot->links[j].npdev;
> +
> +                npus[j] = cpu_to_be32(PHANDLE_PCIDEV(sphb, npdev));
> +            }
> +            _FDT(fdt_setprop(fdt, offset, "ibm,npu", npus,
> +                             j * sizeof(npus[0])));
> +            _FDT((fdt_setprop_cell(fdt, offset, "phandle",
> +                                   PHANDLE_PCIDEV(sphb, dev))));
> +            continue;
> +        }
> +
> +        for (j = 0; j < nvslot->linknum; ++j) {
> +            if (dev != nvslot->links[j].npdev) {
> +                continue;
> +            }
> +
> +            _FDT((fdt_setprop_cell(fdt, offset, "phandle",
> +                                   PHANDLE_PCIDEV(sphb, dev))));
> +            _FDT(fdt_setprop_cell(fdt, offset, "ibm,gpu",
> +                                  PHANDLE_PCIDEV(sphb, nvslot->gpdev)));
> +            _FDT((fdt_setprop_cell(fdt, offset, "ibm,nvlink",
> +                                   PHANDLE_NVLINK(sphb, i, j))));
> +            /*
> +             * If we ever want to emulate GPU RAM at the same location as on
> +             * the host - here is the encoding GPA->TGT:
> +             *
> +             * gta  = ((sphb->nv2_gpa >> 42) & 0x1) << 42;
> +             * gta |= ((sphb->nv2_gpa >> 45) & 0x3) << 43;
> +             * gta |= ((sphb->nv2_gpa >> 49) & 0x3) << 45;
> +             * gta |= sphb->nv2_gpa & ((1UL << 43) - 1);
> +             */
> +            _FDT(fdt_setprop_cell(fdt, offset, "memory-region",
> +                                  PHANDLE_GPURAM(sphb, i)));
> +            _FDT(fdt_setprop_u64(fdt, offset, "ibm,device-tgt-addr",
> +                                 nvslot->tgt));
> +            _FDT(fdt_setprop_cell(fdt, offset, "ibm,nvlink-speed",
> +                                  nvslot->links[j].link_speed));
> +        }
> +    }
> +}
> diff --git a/hw/vfio/pci-quirks.c b/hw/vfio/pci-quirks.c
> index 40a1200..15ec0b4 100644
> --- a/hw/vfio/pci-quirks.c
> +++ b/hw/vfio/pci-quirks.c
> @@ -2180,3 +2180,123 @@ int vfio_add_virt_caps(VFIOPCIDevice *vdev, Error **errp)
>  
>      return 0;
>  }
> +
> +static void vfio_pci_nvlink2_get_tgt(Object *obj, Visitor *v,
> +                                     const char *name,
> +                                     void *opaque, Error **errp)
> +{
> +    uint64_t tgt = (uint64_t) opaque;
> +    visit_type_uint64(v, name, &tgt, errp);
> +}
> +
> +static void vfio_pci_nvlink2_get_link_speed(Object *obj, Visitor *v,
> +                                                 const char *name,
> +                                                 void *opaque, Error **errp)
> +{
> +    uint32_t link_speed = (uint32_t)(uint64_t) opaque;
> +    visit_type_uint32(v, name, &link_speed, errp);
> +}
> +
> +int vfio_pci_nvidia_v100_ram_init(VFIOPCIDevice *vdev, Error **errp)
> +{
> +    int ret;
> +    void *p;
> +    struct vfio_region_info *nv2region = NULL;
> +    struct vfio_info_cap_header *hdr;
> +    MemoryRegion *nv2mr = g_malloc0(sizeof(*nv2mr));
> +
> +    ret = vfio_get_dev_region_info(&vdev->vbasedev,
> +                                   VFIO_REGION_TYPE_PCI_VENDOR_TYPE |
> +                                   PCI_VENDOR_ID_NVIDIA,
> +                                   VFIO_REGION_SUBTYPE_NVIDIA_NVLINK2_RAM,
> +                                   &nv2region);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    p = mmap(NULL, nv2region->size, PROT_READ | PROT_WRITE | PROT_EXEC,
> +             MAP_SHARED, vdev->vbasedev.fd, nv2region->offset);
> +
> +    if (!p) {
> +        return -errno;
> +    }
> +
> +    memory_region_init_ram_ptr(nv2mr, OBJECT(vdev), "nvlink2-mr",
> +                               nv2region->size, p);
> +
> +    hdr = vfio_get_region_info_cap(nv2region,
> +                                   VFIO_REGION_INFO_CAP_NVLINK2_SSATGT);
> +    if (hdr) {
> +        struct vfio_region_info_cap_nvlink2_ssatgt *cap = (void *) hdr;
> +
> +        object_property_add(OBJECT(vdev), "nvlink2-tgt", "uint64",
> +                            vfio_pci_nvlink2_get_tgt, NULL, NULL,
> +                            (void *) cap->tgt, NULL);
> +        trace_vfio_pci_nvidia_gpu_setup_quirk(vdev->vbasedev.name, cap->tgt,
> +                                              nv2region->size);
> +    }
> +    g_free(nv2region);
> +
> +    return 0;
> +}
> +
> +int vfio_pci_nvlink2_init(VFIOPCIDevice *vdev, Error **errp)
> +{
> +    int ret;
> +    void *p;
> +    struct vfio_region_info *atsd_region = NULL;
> +    struct vfio_info_cap_header *hdr;
> +
> +    ret = vfio_get_dev_region_info(&vdev->vbasedev,
> +                                   VFIO_REGION_TYPE_PCI_VENDOR_TYPE |
> +                                   PCI_VENDOR_ID_IBM,
> +                                   VFIO_REGION_SUBTYPE_IBM_NVLINK2_ATSD,
> +                                   &atsd_region);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    /* Some NVLink bridges come without assigned ATSD, skip MR part */
> +    if (atsd_region->size) {
> +        MemoryRegion *atsd_mr = g_malloc0(sizeof(*atsd_mr));
> +
> +        p = mmap(NULL, atsd_region->size, PROT_READ | PROT_WRITE | PROT_EXEC,
> +                 MAP_SHARED, vdev->vbasedev.fd, atsd_region->offset);
> +
> +        if (!p) {
> +            return -errno;
> +        }
> +
> +        memory_region_init_ram_device_ptr(atsd_mr, OBJECT(vdev),
> +                                          "nvlink2-atsd-mr",
> +                                          atsd_region->size,
> +                                          p);
> +    }
> +
> +    hdr = vfio_get_region_info_cap(atsd_region,
> +                                   VFIO_REGION_INFO_CAP_NVLINK2_SSATGT);
> +    if (hdr) {
> +        struct vfio_region_info_cap_nvlink2_ssatgt *cap = (void *) hdr;
> +
> +        object_property_add(OBJECT(vdev), "nvlink2-tgt", "uint64",
> +                            vfio_pci_nvlink2_get_tgt, NULL, NULL,
> +                            (void *) cap->tgt, NULL);
> +        trace_vfio_pci_nvlink2_setup_quirk_ssatgt(vdev->vbasedev.name, cap->tgt,
> +                                                  atsd_region->size);
> +    }
> +
> +    hdr = vfio_get_region_info_cap(atsd_region,
> +                                   VFIO_REGION_INFO_CAP_NVLINK2_LNKSPD);
> +    if (hdr) {
> +        struct vfio_region_info_cap_nvlink2_lnkspd *cap = (void *) hdr;
> +
> +        object_property_add(OBJECT(vdev), "nvlink2-link-speed", "uint32",
> +                            vfio_pci_nvlink2_get_link_speed, NULL, NULL,
> +                            (void *) (uint64_t) cap->link_speed, NULL);
> +        trace_vfio_pci_nvlink2_setup_quirk_lnkspd(vdev->vbasedev.name,
> +                                                  cap->link_speed);
> +    }
> +    g_free(atsd_region);
> +
> +    return 0;
> +}
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index dd12f36..07aa141 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -3069,6 +3069,20 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
>          goto out_teardown;
>      }
>  
> +    if (vdev->vendor_id == PCI_VENDOR_ID_NVIDIA) {
> +        ret = vfio_pci_nvidia_v100_ram_init(vdev, errp);
> +        if (ret && ret != -ENODEV) {
> +            error_report("Failed to setup NVIDIA V100 GPU RAM");
> +        }
> +    }
> +
> +    if (vdev->vendor_id == PCI_VENDOR_ID_IBM) {
> +        ret = vfio_pci_nvlink2_init(vdev, errp);
> +        if (ret && ret != -ENODEV) {
> +            error_report("Failed to setup NVlink2 bridge");
> +        }
> +    }
> +
>      vfio_register_err_notifier(vdev);
>      vfio_register_req_notifier(vdev);
>      vfio_setup_resetfn_quirk(vdev);
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index cf1e886..88841e9 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -87,6 +87,10 @@ vfio_pci_igd_opregion_enabled(const char *name) "%s"
>  vfio_pci_igd_host_bridge_enabled(const char *name) "%s"
>  vfio_pci_igd_lpc_bridge_enabled(const char *name) "%s"
>  
> +vfio_pci_nvidia_gpu_setup_quirk(const char *name, uint64_t tgt, uint64_t size) "%s tgt=0x%"PRIx64" size=0x%"PRIx64
> +vfio_pci_nvlink2_setup_quirk_ssatgt(const char *name, uint64_t tgt, uint64_t size) "%s tgt=0x%"PRIx64" size=0x%"PRIx64
> +vfio_pci_nvlink2_setup_quirk_lnkspd(const char *name, uint32_t link_speed) "%s link_speed=0x%x"
> +
>  # hw/vfio/common.c
>  vfio_region_write(const char *name, int index, uint64_t addr, uint64_t data, unsigned size) " (%s:region%d+0x%"PRIx64", 0x%"PRIx64 ", %d)"
>  vfio_region_read(char *name, int index, uint64_t addr, unsigned size, uint64_t data) " (%s:region%d+0x%"PRIx64", %d) = 0x%"PRIx64

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH qemu v3 4/6] spapr_iommu: Do not replay mappings from just created DMA window
  2019-02-27 23:59     ` Alexey Kardashevskiy
@ 2019-02-28  3:49       ` David Gibson
  2019-02-28  5:37         ` Alexey Kardashevskiy
  0 siblings, 1 reply; 21+ messages in thread
From: David Gibson @ 2019-02-28  3:49 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Greg Kurz, qemu-devel, Jose Ricardo Ziviani,
	Daniel Henrique Barboza, Alex Williamson, Sam Bobroff,
	Piotr Jaroszynski, qemu-ppc,
	Leonardo Augusto Guimarães Garcia

[-- Attachment #1: Type: text/plain, Size: 5707 bytes --]

On Thu, Feb 28, 2019 at 10:59:56AM +1100, Alexey Kardashevskiy wrote:
> 
> 
> On 28/02/2019 01:33, Greg Kurz wrote:
> > On Wed, 27 Feb 2019 19:51:47 +1100
> > Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> > 
> >> On sPAPR vfio_listener_region_add() is called in 2 situations:
> >> 1. a new listener is registered from vfio_connect_container();
> >> 2. a new IOMMU Memory Region is added from rtas_ibm_create_pe_dma_window().
> >>
> >> In both cases vfio_listener_region_add() calls
> >> memory_region_iommu_replay() to notify newly registered IOMMU notifiers
> >> about existing mappings which is totally desirable for case 1.
> >>
> >> However for case 2 it is nothing but noop as the window has just been
> >> created and has no valid mappings so replaying those does not do anything.
> >> It is barely noticeable with usual guests but if the window happens to be
> >> really big, such no-op replay might take minutes and trigger RCU stall
> >> warnings in the guest.
> >>
> >> For example, a upcoming GPU RAM memory region mapped at 64TiB (right
> >> after SPAPR_PCI_LIMIT) causes a 64bit DMA window to be at least 128TiB
> >> which is (128<<40)/0x10000=2.147.483.648 TCEs to replay.
> >>
> >> This mitigates the problem by adding an "skipping_replay" flag to
> >> sPAPRTCETable and defining sPAPR own IOMMU MR replay() hook which does
> >> exactly the same thing as the generic one except it returns early if
> >> @skipping_replay==true.
> >>
> >> When "ibm,create-pe-dma-window" is complete, the guest will map only
> >> required regions of the huge DMA window.
> >>
> >> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >> ---
> >>  include/hw/ppc/spapr.h  |  1 +
> >>  hw/ppc/spapr_iommu.c    | 31 +++++++++++++++++++++++++++++++
> >>  hw/ppc/spapr_rtas_ddw.c |  7 +++++++
> >>  3 files changed, 39 insertions(+)
> >>
> >> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> >> index 86b0488..358bb38 100644
> >> --- a/include/hw/ppc/spapr.h
> >> +++ b/include/hw/ppc/spapr.h
> >> @@ -727,6 +727,7 @@ struct sPAPRTCETable {
> >>      uint64_t *mig_table;
> >>      bool bypass;
> >>      bool need_vfio;
> >> +    bool skipping_replay;
> >>      int fd;
> >>      MemoryRegion root;
> >>      IOMMUMemoryRegion iommu;
> >> diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
> >> index 37e98f9..8f23179 100644
> >> --- a/hw/ppc/spapr_iommu.c
> >> +++ b/hw/ppc/spapr_iommu.c
> >> @@ -141,6 +141,36 @@ static IOMMUTLBEntry spapr_tce_translate_iommu(IOMMUMemoryRegion *iommu,
> >>      return ret;
> >>  }
> >>  
> >> +static void spapr_tce_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
> >> +{
> >> +    MemoryRegion *mr = MEMORY_REGION(iommu_mr);
> >> +    IOMMUMemoryRegionClass *imrc = IOMMU_MEMORY_REGION_GET_CLASS(iommu_mr);
> >> +    hwaddr addr, granularity;
> >> +    IOMMUTLBEntry iotlb;
> >> +    sPAPRTCETable *tcet = container_of(iommu_mr, sPAPRTCETable, iommu);
> >> +
> >> +    if (tcet->skipping_replay) {
> >> +        return;
> >> +    }
> >> +
> >> +    granularity = memory_region_iommu_get_min_page_size(iommu_mr);
> >> +
> >> +    for (addr = 0; addr < memory_region_size(mr); addr += granularity) {
> >> +        iotlb = imrc->translate(iommu_mr, addr, IOMMU_NONE, n->iommu_idx);
> >> +        if (iotlb.perm != IOMMU_NONE) {
> >> +            n->notify(n, &iotlb);
> >> +        }
> >> +
> >> +        /*
> >> +         * if (2^64 - MR size) < granularity, it's possible to get an
> >> +         * infinite loop here.  This should catch such a wraparound.
> >> +         */
> >> +        if ((addr + granularity) < addr) {
> >> +            break;
> >> +        }
> >> +    }
> >> +}
> > 
> > It is a bit unfortunate to duplicate all that code. What about making
> > a memory_region_iommu_replay_generic() helper out of it and call it
> > from spapr_tce_replay() and memory_region_iommu_replay() ?
> 
> 
> I really do not want to mess with generic code to solve our local sPAPR
> problem, especially when there is a way not to do so.

Well, the thing is, I think we're actually the only user of the
current generic replay - everything else has more efficient structure
aware replay hooks AFAIK.  Which makes this hack even hackier.

> And as a next step, I was thinking of removing (i.e. making this replay
> a no-op) from QEMU later and do replay in KVM instead when an IOMMU
> group is attaching to KVM as this is the only case when we need replay
> and KVM has a lot better idea what TCEs are actually valid and can skip
> most of them. This is a bit bigger thing as it requires a KVM capability
> "KVM replays mappings" but when we get it, spapr_tce_replay() will
> become no-op.

That's a good idea longer term.

> > Apart from that, LGTM.
> 
> Well. It is a hack, I just do not have taste to tell how nasty it is
> :)

As an interim step until the kernel change, I think we can do a bit
better than this.  First, as Greg suggests we should have the
"generic" replay be a helper and have the spapr one call that with a
little in the way of extra checking.

Second, rather than having an explicit "skip_replay" flag, what we
really want here is to have the replay be a fast no-op if there are no
existing mappings rather than a slow no-op.  So instead I think we
should have a flag which records if any mappings have been made in the
region yet, initialized to false.  The new replay would do nothing if
it's still false.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH qemu v3 4/6] spapr_iommu: Do not replay mappings from just created DMA window
  2019-02-28  3:49       ` David Gibson
@ 2019-02-28  5:37         ` Alexey Kardashevskiy
  2019-03-05  3:28           ` David Gibson
  0 siblings, 1 reply; 21+ messages in thread
From: Alexey Kardashevskiy @ 2019-02-28  5:37 UTC (permalink / raw)
  To: David Gibson
  Cc: Greg Kurz, qemu-devel, Jose Ricardo Ziviani,
	Daniel Henrique Barboza, Alex Williamson, Sam Bobroff,
	Piotr Jaroszynski, qemu-ppc,
	Leonardo Augusto Guimarães Garcia



On 28/02/2019 14:49, David Gibson wrote:
> On Thu, Feb 28, 2019 at 10:59:56AM +1100, Alexey Kardashevskiy wrote:
>>
>>
>> On 28/02/2019 01:33, Greg Kurz wrote:
>>> On Wed, 27 Feb 2019 19:51:47 +1100
>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>>>
>>>> On sPAPR vfio_listener_region_add() is called in 2 situations:
>>>> 1. a new listener is registered from vfio_connect_container();
>>>> 2. a new IOMMU Memory Region is added from rtas_ibm_create_pe_dma_window().
>>>>
>>>> In both cases vfio_listener_region_add() calls
>>>> memory_region_iommu_replay() to notify newly registered IOMMU notifiers
>>>> about existing mappings which is totally desirable for case 1.
>>>>
>>>> However for case 2 it is nothing but noop as the window has just been
>>>> created and has no valid mappings so replaying those does not do anything.
>>>> It is barely noticeable with usual guests but if the window happens to be
>>>> really big, such no-op replay might take minutes and trigger RCU stall
>>>> warnings in the guest.
>>>>
>>>> For example, a upcoming GPU RAM memory region mapped at 64TiB (right
>>>> after SPAPR_PCI_LIMIT) causes a 64bit DMA window to be at least 128TiB
>>>> which is (128<<40)/0x10000=2.147.483.648 TCEs to replay.
>>>>
>>>> This mitigates the problem by adding an "skipping_replay" flag to
>>>> sPAPRTCETable and defining sPAPR own IOMMU MR replay() hook which does
>>>> exactly the same thing as the generic one except it returns early if
>>>> @skipping_replay==true.
>>>>
>>>> When "ibm,create-pe-dma-window" is complete, the guest will map only
>>>> required regions of the huge DMA window.
>>>>
>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>> ---
>>>>  include/hw/ppc/spapr.h  |  1 +
>>>>  hw/ppc/spapr_iommu.c    | 31 +++++++++++++++++++++++++++++++
>>>>  hw/ppc/spapr_rtas_ddw.c |  7 +++++++
>>>>  3 files changed, 39 insertions(+)
>>>>
>>>> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
>>>> index 86b0488..358bb38 100644
>>>> --- a/include/hw/ppc/spapr.h
>>>> +++ b/include/hw/ppc/spapr.h
>>>> @@ -727,6 +727,7 @@ struct sPAPRTCETable {
>>>>      uint64_t *mig_table;
>>>>      bool bypass;
>>>>      bool need_vfio;
>>>> +    bool skipping_replay;
>>>>      int fd;
>>>>      MemoryRegion root;
>>>>      IOMMUMemoryRegion iommu;
>>>> diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
>>>> index 37e98f9..8f23179 100644
>>>> --- a/hw/ppc/spapr_iommu.c
>>>> +++ b/hw/ppc/spapr_iommu.c
>>>> @@ -141,6 +141,36 @@ static IOMMUTLBEntry spapr_tce_translate_iommu(IOMMUMemoryRegion *iommu,
>>>>      return ret;
>>>>  }
>>>>  
>>>> +static void spapr_tce_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
>>>> +{
>>>> +    MemoryRegion *mr = MEMORY_REGION(iommu_mr);
>>>> +    IOMMUMemoryRegionClass *imrc = IOMMU_MEMORY_REGION_GET_CLASS(iommu_mr);
>>>> +    hwaddr addr, granularity;
>>>> +    IOMMUTLBEntry iotlb;
>>>> +    sPAPRTCETable *tcet = container_of(iommu_mr, sPAPRTCETable, iommu);
>>>> +
>>>> +    if (tcet->skipping_replay) {
>>>> +        return;
>>>> +    }
>>>> +
>>>> +    granularity = memory_region_iommu_get_min_page_size(iommu_mr);
>>>> +
>>>> +    for (addr = 0; addr < memory_region_size(mr); addr += granularity) {
>>>> +        iotlb = imrc->translate(iommu_mr, addr, IOMMU_NONE, n->iommu_idx);
>>>> +        if (iotlb.perm != IOMMU_NONE) {
>>>> +            n->notify(n, &iotlb);
>>>> +        }
>>>> +
>>>> +        /*
>>>> +         * if (2^64 - MR size) < granularity, it's possible to get an
>>>> +         * infinite loop here.  This should catch such a wraparound.
>>>> +         */
>>>> +        if ((addr + granularity) < addr) {
>>>> +            break;
>>>> +        }
>>>> +    }
>>>> +}
>>>
>>> It is a bit unfortunate to duplicate all that code. What about making
>>> a memory_region_iommu_replay_generic() helper out of it and call it
>>> from spapr_tce_replay() and memory_region_iommu_replay() ?
>>
>>
>> I really do not want to mess with generic code to solve our local sPAPR
>> problem, especially when there is a way not to do so.
> 
> Well, the thing is, I think we're actually the only user of the
> current generic replay - everything else has more efficient structure
> aware replay hooks AFAIK.  Which makes this hack even hackier.


If that so, then we are better off removing that loop from
memory_region_iommu_replay() at all rather than keeping it generic.


>> And as a next step, I was thinking of removing (i.e. making this replay
>> a no-op) from QEMU later and do replay in KVM instead when an IOMMU
>> group is attaching to KVM as this is the only case when we need replay
>> and KVM has a lot better idea what TCEs are actually valid and can skip
>> most of them. This is a bit bigger thing as it requires a KVM capability
>> "KVM replays mappings" but when we get it, spapr_tce_replay() will
>> become no-op.
> 
> That's a good idea longer term.
> 
>>> Apart from that, LGTM.
>>
>> Well. It is a hack, I just do not have taste to tell how nasty it is
>> :)
> 
> As an interim step until the kernel change, I think we can do a bit
> better than this.  First, as Greg suggests we should have the
> "generic" replay be a helper and have the spapr one call that with a
> little in the way of extra checking.
> 
> Second, rather than having an explicit "skip_replay" flag, what we
> really want here is to have the replay be a fast no-op if there are no
> existing mappings rather than a slow no-op.  So instead I think we
> should have a flag which records if any mappings have been made in the
> region yet, initialized to false.
> The new replay would do nothing if
> it's still false.

If QEMU controlled the mappings - sure. But it does not - KVM does it
instead via that fast path. So QEMU does not know if there are mappings
until it reads all TCEs from mmap'ed KVM TCE table which will fault in
all these pages.

We could implement some tricks such are allow reading (or ioctl) from
that KVM TCE fd and it could tell what is mapped and what is not in a
very condensed format (for example a bit per every 256MB of the guest
address space)  ooooor  implement different behavior for mapping with RW
or readonly - the latter would fail if there is no backing page
allocated yet - and then QEMU could skip these regions when replaying.



-- 
Alexey

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v3 6/6] spapr: Support NVIDIA V100 GPU with NVLink2
  2019-02-28  3:31   ` David Gibson
@ 2019-02-28  6:11     ` Alexey Kardashevskiy
  2019-03-05  1:47       ` David Gibson
  0 siblings, 1 reply; 21+ messages in thread
From: Alexey Kardashevskiy @ 2019-02-28  6:11 UTC (permalink / raw)
  To: David Gibson
  Cc: qemu-devel, qemu-ppc, Sam Bobroff, Piotr Jaroszynski,
	Leonardo Augusto Guimarães Garcia, Jose Ricardo Ziviani,
	Daniel Henrique Barboza, Alex Williamson



On 28/02/2019 14:31, David Gibson wrote:
> On Wed, Feb 27, 2019 at 07:51:49PM +1100, Alexey Kardashevskiy wrote:
>> NVIDIA V100 GPUs have on-board RAM which is mapped into the host memory
>> space and accessible as normal RAM via an NVLink bus. The VFIO-PCI driver
>> implements special regions for such GPUs and emulates an NVLink bridge.
>> NVLink2-enabled POWER9 CPUs also provide address translation services
>> which includes an ATS shootdown (ATSD) register exported via the NVLink
>> bridge device.
>>
>> This adds a quirk to VFIO to map the GPU memory and create an MR;
>> the new MR is stored in a PCI device as a QOM link. The sPAPR PCI uses
>> this to get the MR and map it to the system address space.
>> Another quirk does the same for ATSD.
>>
>> This adds additional steps to sPAPR PHB setup:
>>
>> 1. Search for specific GPUs and NPUs, collect findings in
>> sPAPRPHBState::nvgpus, manage system address space mappings;
>>
>> 2. Add device-specific properties such as "ibm,npu", "ibm,gpu",
>> "memory-block", "link-speed" to advertise the NVLink2 function to
>> the guest;
>>
>> 3. Add "mmio-atsd" to vPHB to advertise the ATSD capability;
>>
>> 4. Add new memory blocks (with extra "linux,memory-usable" to prevent
>> the guest OS from accessing the new memory until it is onlined) and
>> npuphb# nodes representing an NPU unit for every vPHB as the GPU driver
>> uses it for link discovery.
>>
>> This allocates space for GPU RAM and ATSD like we do for MMIOs by
>> adding 2 new parameters to the phb_placement() hook. Older machine types
>> set these to zero.
>>
>> This puts new memory nodes in a separate NUMA node to replicate the host
>> system setup as the GPU driver relies on this.
>>
>> This adds requirement similar to EEH - one IOMMU group per vPHB.
>> The reason for this is that ATSD registers belong to a physical NPU
>> so they cannot invalidate translations on GPUs attached to another NPU.
>> It is guaranteed by the host platform as it does not mix NVLink bridges
>> or GPUs from different NPU in the same IOMMU group. If more than one
>> IOMMU group is detected on a vPHB, this disables ATSD support for that
>> vPHB and prints a warning.
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> ---
>> Changes:
>> v3:
>> * moved GPU RAM above PCI MMIO limit
>> * renamed QOM property to nvlink2-tgt
>> * moved nvlink2 code to its own file
>>
>> ---
>>
>> The example command line for redbud system:
>>
>> pbuild/qemu-aiku1804le-ppc64/ppc64-softmmu/qemu-system-ppc64 \
>> -nodefaults \
>> -chardev stdio,id=STDIO0,signal=off,mux=on \
>> -device spapr-vty,id=svty0,reg=0x71000110,chardev=STDIO0 \
>> -mon id=MON0,chardev=STDIO0,mode=readline -nographic -vga none \
>> -enable-kvm -m 384G \
>> -chardev socket,id=SOCKET0,server,nowait,host=localhost,port=40000 \
>> -mon chardev=SOCKET0,mode=control \
>> -smp 80,sockets=1,threads=4 \
>> -netdev "tap,id=TAP0,helper=/home/aik/qemu-bridge-helper --br=br0" \
>> -device "virtio-net-pci,id=vnet0,mac=52:54:00:12:34:56,netdev=TAP0" \
>> img/vdisk0.img \
>> -device "vfio-pci,id=vfio0004_04_00_0,host=0004:04:00.0" \
>> -device "vfio-pci,id=vfio0006_00_00_0,host=0006:00:00.0" \
>> -device "vfio-pci,id=vfio0006_00_00_1,host=0006:00:00.1" \
>> -device "vfio-pci,id=vfio0006_00_00_2,host=0006:00:00.2" \
>> -device "vfio-pci,id=vfio0004_05_00_0,host=0004:05:00.0" \
>> -device "vfio-pci,id=vfio0006_00_01_0,host=0006:00:01.0" \
>> -device "vfio-pci,id=vfio0006_00_01_1,host=0006:00:01.1" \
>> -device "vfio-pci,id=vfio0006_00_01_2,host=0006:00:01.2" \
>> -device spapr-pci-host-bridge,id=phb1,index=1 \
>> -device "vfio-pci,id=vfio0035_03_00_0,host=0035:03:00.0" \
>> -device "vfio-pci,id=vfio0007_00_00_0,host=0007:00:00.0" \
>> -device "vfio-pci,id=vfio0007_00_00_1,host=0007:00:00.1" \
>> -device "vfio-pci,id=vfio0007_00_00_2,host=0007:00:00.2" \
>> -device "vfio-pci,id=vfio0035_04_00_0,host=0035:04:00.0" \
>> -device "vfio-pci,id=vfio0007_00_01_0,host=0007:00:01.0" \
>> -device "vfio-pci,id=vfio0007_00_01_1,host=0007:00:01.1" \
>> -device "vfio-pci,id=vfio0007_00_01_2,host=0007:00:01.2" -snapshot \
>> -machine pseries \
>> -L /home/aik/t/qemu-ppc64-bios/ -d guest_errors
>>
>> Note that QEMU attaches PCI devices to the last added vPHB so first
>> 8 devices - 4:04:00.0 till 6:00:01.2 - go to the default vPHB, and
>> 35:03:00.0..7:00:01.2 to the vPHB with id=phb1.
>> ---
>>  hw/ppc/Makefile.objs        |   2 +-
>>  hw/vfio/pci.h               |   2 +
>>  include/hw/pci-host/spapr.h |  41 ++++
>>  include/hw/ppc/spapr.h      |   3 +-
>>  hw/ppc/spapr.c              |  29 ++-
>>  hw/ppc/spapr_pci.c          |   8 +
>>  hw/ppc/spapr_pci_nvlink2.c  | 419 ++++++++++++++++++++++++++++++++++++
>>  hw/vfio/pci-quirks.c        | 120 +++++++++++
>>  hw/vfio/pci.c               |  14 ++
>>  hw/vfio/trace-events        |   4 +
>>  10 files changed, 637 insertions(+), 5 deletions(-)
>>  create mode 100644 hw/ppc/spapr_pci_nvlink2.c
>>
>> diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
>> index 1111b21..636e717 100644
>> --- a/hw/ppc/Makefile.objs
>> +++ b/hw/ppc/Makefile.objs
>> @@ -9,7 +9,7 @@ obj-$(CONFIG_SPAPR_RNG) +=  spapr_rng.o
>>  # IBM PowerNV
>>  obj-$(CONFIG_POWERNV) += pnv.o pnv_xscom.o pnv_core.o pnv_lpc.o pnv_psi.o pnv_occ.o pnv_bmc.o
>>  ifeq ($(CONFIG_PCI)$(CONFIG_PSERIES)$(CONFIG_LINUX), yyy)
>> -obj-y += spapr_pci_vfio.o
>> +obj-y += spapr_pci_vfio.o spapr_pci_nvlink2.o
>>  endif
>>  obj-$(CONFIG_PSERIES) += spapr_rtas_ddw.o
>>  # PowerPC 4xx boards
>> diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
>> index b1ae4c0..706c304 100644
>> --- a/hw/vfio/pci.h
>> +++ b/hw/vfio/pci.h
>> @@ -194,6 +194,8 @@ int vfio_populate_vga(VFIOPCIDevice *vdev, Error **errp);
>>  int vfio_pci_igd_opregion_init(VFIOPCIDevice *vdev,
>>                                 struct vfio_region_info *info,
>>                                 Error **errp);
>> +int vfio_pci_nvidia_v100_ram_init(VFIOPCIDevice *vdev, Error **errp);
>> +int vfio_pci_nvlink2_init(VFIOPCIDevice *vdev, Error **errp);
>>  
>>  void vfio_display_reset(VFIOPCIDevice *vdev);
>>  int vfio_display_probe(VFIOPCIDevice *vdev, Error **errp);
>> diff --git a/include/hw/pci-host/spapr.h b/include/hw/pci-host/spapr.h
>> index ab0e3a0..e791dd4 100644
>> --- a/include/hw/pci-host/spapr.h
>> +++ b/include/hw/pci-host/spapr.h
>> @@ -87,6 +87,9 @@ struct sPAPRPHBState {
>>      uint32_t mig_liobn;
>>      hwaddr mig_mem_win_addr, mig_mem_win_size;
>>      hwaddr mig_io_win_addr, mig_io_win_size;
>> +    hwaddr nv2_gpa_win_addr;
>> +    hwaddr nv2_atsd_win_addr;
>> +    struct spapr_phb_pci_nvgpu_config *nvgpus;
>>  };
>>  
>>  #define SPAPR_PCI_MEM_WIN_BUS_OFFSET 0x80000000ULL
>> @@ -105,6 +108,23 @@ struct sPAPRPHBState {
>>  
>>  #define SPAPR_PCI_MSI_WINDOW         0x40000000000ULL
>>  
>> +#define SPAPR_PCI_NV2RAM64_WIN_BASE  SPAPR_PCI_LIMIT
>> +#define SPAPR_PCI_NV2RAM64_WIN_SIZE  0x10000000000ULL /* 1 TiB for all 6xGPUs */
> 
> The comments and values below suggest that it is 1TiB for each GPU,
> rather than 1TiB shared by all 6.  Which is it?


1TiB for all of them within 1 vPHB. Not sure where it suggests 1TiB for
each GPU.

>> +
>> +/* Max number of these GPUs per a physical box */
>> +#define NVGPU_MAX_NUM                6
> 
> Is there any possibility later hardware revisions could increase this?
> If so we should probably leave some extra room in the address space.

A GPU RAM window is 256GiB (and only 32GiB is used), and 3 is the
maximum in one group so far. So 1TiB should be enough for quite some
time. Having more GPUs in a box is probably possible but for now 6xGPU
require water cooling while 4xGPU does not so unless there is a new
generation of these GPUs comes out, the numbers won't change much.

I'll double SPAPR_PCI_NV2RAM64_WIN_SIZE.


>> +/*
>> + * One NVLink bridge provides one ATSD register so it should be 18.
>> + * In practice though since we allow only one group per vPHB which equals
>> + * to an NPU2 which has maximum 6 NVLink bridges.
>> + */
>> +#define NVGPU_MAX_ATSD               6
>> +
>> +#define SPAPR_PCI_NV2ATSD_WIN_BASE   (SPAPR_PCI_NV2RAM64_WIN_BASE + \
>> +                                      SPAPR_PCI_NV2RAM64_WIN_SIZE * \
>> +                                      NVGPU_MAX_NUM)
>> +#define SPAPR_PCI_NV2ATSD_WIN_SIZE   (NVGPU_MAX_ATSD * 0x10000)
> 
> What's the significance of the 64 kiB constant here?  Should it be a
> symbolic name, or speleed "64 * kiB".

Ok.


> 
>> +
>>  static inline qemu_irq spapr_phb_lsi_qirq(struct sPAPRPHBState *phb, int pin)
>>  {
>>      sPAPRMachineState *spapr = SPAPR_MACHINE(qdev_get_machine());
>> @@ -135,6 +155,11 @@ int spapr_phb_vfio_eeh_get_state(sPAPRPHBState *sphb, int *state);
>>  int spapr_phb_vfio_eeh_reset(sPAPRPHBState *sphb, int option);
>>  int spapr_phb_vfio_eeh_configure(sPAPRPHBState *sphb);
>>  void spapr_phb_vfio_reset(DeviceState *qdev);
>> +void spapr_phb_nvgpu_setup(sPAPRPHBState *sphb);
>> +void spapr_phb_nvgpu_populate_dt(sPAPRPHBState *sphb, void *fdt, int bus_off);
>> +void spapr_phb_nvgpu_ram_populate_dt(sPAPRPHBState *sphb, void *fdt);
>> +void spapr_phb_nvgpu_populate_pcidev_dt(PCIDevice *dev, void *fdt, int offset,
>> +                                        sPAPRPHBState *sphb);
>>  #else
>>  static inline bool spapr_phb_eeh_available(sPAPRPHBState *sphb)
>>  {
>> @@ -161,6 +186,22 @@ static inline int spapr_phb_vfio_eeh_configure(sPAPRPHBState *sphb)
>>  static inline void spapr_phb_vfio_reset(DeviceState *qdev)
>>  {
>>  }
>> +static inline void spapr_phb_nvgpu_setup(sPAPRPHBState *sphb)
>> +{
>> +}
>> +static inline void spapr_phb_nvgpu_populate_dt(sPAPRPHBState *sphb, void *fdt,
>> +                                               int bus_off)
>> +{
>> +}
>> +static inline void spapr_phb_nvgpu_ram_populate_dt(sPAPRPHBState *sphb,
>> +                                                   void *fdt)
>> +{
>> +}
>> +static inline void spapr_phb_nvgpu_populate_pcidev_dt(PCIDevice *dev, void *fdt,
>> +                                                      int offset,
>> +                                                      sPAPRPHBState *sphb)
>> +{
>> +}
> 
> I'm guessing some of these should never get called on systems without
> NVLink2, in which case they should probably have a
> g_assert_not_reached() in there.

I guess if you compile QEMU for --target-list=ppc64-softmmu in Windows
(i.e. tcg + pseries + pci but no vfio), these will be called and crash
then, no?


> 
>>  #endif
>>  
>>  void spapr_phb_dma_reset(sPAPRPHBState *sphb);
>> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
>> index 358bb38..9acf867 100644
>> --- a/include/hw/ppc/spapr.h
>> +++ b/include/hw/ppc/spapr.h
>> @@ -113,7 +113,8 @@ struct sPAPRMachineClass {
>>      void (*phb_placement)(sPAPRMachineState *spapr, uint32_t index,
>>                            uint64_t *buid, hwaddr *pio, 
>>                            hwaddr *mmio32, hwaddr *mmio64,
>> -                          unsigned n_dma, uint32_t *liobns, Error **errp);
>> +                          unsigned n_dma, uint32_t *liobns, hwaddr *nv2gpa,
>> +                          hwaddr *nv2atsd, Error **errp);
>>      sPAPRResizeHPT resize_hpt_default;
>>      sPAPRCapabilities default_caps;
>>      sPAPRIrq *irq;
>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
>> index 74c9b07..fda6e7e 100644
>> --- a/hw/ppc/spapr.c
>> +++ b/hw/ppc/spapr.c
>> @@ -3929,7 +3929,9 @@ static void spapr_phb_pre_plug(HotplugHandler *hotplug_dev, DeviceState *dev,
>>      smc->phb_placement(spapr, sphb->index,
>>                         &sphb->buid, &sphb->io_win_addr,
>>                         &sphb->mem_win_addr, &sphb->mem64_win_addr,
>> -                       windows_supported, sphb->dma_liobn, errp);
>> +                       windows_supported, sphb->dma_liobn,
>> +                       &sphb->nv2_gpa_win_addr, &sphb->nv2_atsd_win_addr,
>> +                       errp);
>>  }
>>  
>>  static void spapr_phb_plug(HotplugHandler *hotplug_dev, DeviceState *dev,
>> @@ -4129,7 +4131,8 @@ static const CPUArchIdList *spapr_possible_cpu_arch_ids(MachineState *machine)
>>  static void spapr_phb_placement(sPAPRMachineState *spapr, uint32_t index,
>>                                  uint64_t *buid, hwaddr *pio,
>>                                  hwaddr *mmio32, hwaddr *mmio64,
>> -                                unsigned n_dma, uint32_t *liobns, Error **errp)
>> +                                unsigned n_dma, uint32_t *liobns,
>> +                                hwaddr *nv2gpa, hwaddr *nv2atsd, Error **errp)
>>  {
>>      /*
>>       * New-style PHB window placement.
>> @@ -4174,6 +4177,9 @@ static void spapr_phb_placement(sPAPRMachineState *spapr, uint32_t index,
>>      *pio = SPAPR_PCI_BASE + index * SPAPR_PCI_IO_WIN_SIZE;
>>      *mmio32 = SPAPR_PCI_BASE + (index + 1) * SPAPR_PCI_MEM32_WIN_SIZE;
>>      *mmio64 = SPAPR_PCI_BASE + (index + 1) * SPAPR_PCI_MEM64_WIN_SIZE;
>> +
> 
> This doesn't look right.  SPAPR_PCI_NV2ATSD_WIN_BASE appears to be
> defined such that there slots for NVGPU_MAX_NUM gpa "slots" of size
> SPAPR_PCI_NV2RAM64_WIN_SIZE before we get to the ATSD base.
> 
>> +    *nv2gpa = SPAPR_PCI_NV2RAM64_WIN_BASE + index * SPAPR_PCI_NV2RAM64_WIN_SIZE;
> 
> But this implies you need a "slot" for every possible PHB index, which
> is rather more than NVGPU_MAX_NUM.
> 
>> +    *nv2atsd = SPAPR_PCI_NV2ATSD_WIN_BASE + index * SPAPR_PCI_NV2ATSD_WIN_SIZE;


Ah right :( These should go then above 128TiB I guess as I do not really
want them to appear inside a huge dma window.



>>  }
>>  
>>  static ICSState *spapr_ics_get(XICSFabric *dev, int irq)
>> @@ -4376,6 +4382,18 @@ DEFINE_SPAPR_MACHINE(4_0, "4.0", true);
>>  /*
>>   * pseries-3.1
>>   */
>> +static void phb_placement_3_1(sPAPRMachineState *spapr, uint32_t index,
>> +                              uint64_t *buid, hwaddr *pio,
>> +                              hwaddr *mmio32, hwaddr *mmio64,
>> +                              unsigned n_dma, uint32_t *liobns,
>> +                              hwaddr *nv2gpa, hwaddr *nv2atsd, Error **errp)
>> +{
>> +    spapr_phb_placement(spapr, index, buid, pio, mmio32, mmio64, n_dma, liobns,
>> +                        nv2gpa, nv2atsd, errp);
>> +    *nv2gpa = 0;
>> +    *nv2atsd = 0;
>> +}
>> +
>>  static void spapr_machine_3_1_class_options(MachineClass *mc)
>>  {
>>      sPAPRMachineClass *smc = SPAPR_MACHINE_CLASS(mc);
>> @@ -4391,6 +4409,7 @@ static void spapr_machine_3_1_class_options(MachineClass *mc)
>>      mc->default_cpu_type = POWERPC_CPU_TYPE_NAME("power8_v2.0");
>>      smc->update_dt_enabled = false;
>>      smc->dr_phb_enabled = false;
>> +    smc->phb_placement = phb_placement_3_1;
>>  }
>>  
>>  DEFINE_SPAPR_MACHINE(3_1, "3.1", false);
>> @@ -4522,7 +4541,8 @@ DEFINE_SPAPR_MACHINE(2_8, "2.8", false);
>>  static void phb_placement_2_7(sPAPRMachineState *spapr, uint32_t index,
>>                                uint64_t *buid, hwaddr *pio,
>>                                hwaddr *mmio32, hwaddr *mmio64,
>> -                              unsigned n_dma, uint32_t *liobns, Error **errp)
>> +                              unsigned n_dma, uint32_t *liobns,
>> +                              hwaddr *nv2gpa, hwaddr *nv2atsd, Error **errp)
>>  {
>>      /* Legacy PHB placement for pseries-2.7 and earlier machine types */
>>      const uint64_t base_buid = 0x800000020000000ULL;
>> @@ -4566,6 +4586,9 @@ static void phb_placement_2_7(sPAPRMachineState *spapr, uint32_t index,
>>       * fallback behaviour of automatically splitting a large "32-bit"
>>       * window into contiguous 32-bit and 64-bit windows
>>       */
>> +
>> +    *nv2gpa = 0;
>> +    *nv2atsd = 0;
>>  }
>>  
>>  static void spapr_machine_2_7_class_options(MachineClass *mc)
>> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
>> index 06a5ffd..f076462 100644
>> --- a/hw/ppc/spapr_pci.c
>> +++ b/hw/ppc/spapr_pci.c
>> @@ -1355,6 +1355,8 @@ static void spapr_populate_pci_child_dt(PCIDevice *dev, void *fdt, int offset,
>>      if (sphb->pcie_ecs && pci_is_express(dev)) {
>>          _FDT(fdt_setprop_cell(fdt, offset, "ibm,pci-config-space-type", 0x1));
>>      }
>> +
>> +    spapr_phb_nvgpu_populate_pcidev_dt(dev, fdt, offset, sphb);
>>  }
>>  
>>  /* create OF node for pci device and required OF DT properties */
>> @@ -1878,6 +1880,7 @@ static void spapr_phb_reset(DeviceState *qdev)
>>      sPAPRPHBState *sphb = SPAPR_PCI_HOST_BRIDGE(qdev);
>>  
>>      spapr_phb_dma_reset(sphb);
>> +    spapr_phb_nvgpu_setup(sphb);
>>  
>>      /* Reset the IOMMU state */
>>      object_child_foreach(OBJECT(qdev), spapr_phb_children_reset, NULL);
>> @@ -1910,6 +1913,8 @@ static Property spapr_phb_properties[] = {
>>                       pre_2_8_migration, false),
>>      DEFINE_PROP_BOOL("pcie-extended-configuration-space", sPAPRPHBState,
>>                       pcie_ecs, true),
>> +    DEFINE_PROP_UINT64("gpa", sPAPRPHBState, nv2_gpa_win_addr, 0),
>> +    DEFINE_PROP_UINT64("atsd", sPAPRPHBState, nv2_atsd_win_addr, 0),
>>      DEFINE_PROP_END_OF_LIST(),
>>  };
>>  
>> @@ -2282,6 +2287,9 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb, uint32_t intc_phandle, void *fdt,
>>          return ret;
>>      }
>>  
>> +    spapr_phb_nvgpu_populate_dt(phb, fdt, bus_off);
>> +    spapr_phb_nvgpu_ram_populate_dt(phb, fdt);
>> +
>>      return 0;
>>  }
>>  
>> diff --git a/hw/ppc/spapr_pci_nvlink2.c b/hw/ppc/spapr_pci_nvlink2.c
>> new file mode 100644
>> index 0000000..965a6be
>> --- /dev/null
>> +++ b/hw/ppc/spapr_pci_nvlink2.c
>> @@ -0,0 +1,419 @@
>> +/*
>> + * QEMU sPAPR PCI for NVLink2 pass through
>> + *
>> + * Copyright (c) 2019 Alexey Kardashevskiy, IBM Corporation.
>> + *
>> + * Permission is hereby granted, free of charge, to any person obtaining a copy
>> + * of this software and associated documentation files (the "Software"), to deal
>> + * in the Software without restriction, including without limitation the rights
>> + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
>> + * copies of the Software, and to permit persons to whom the Software is
>> + * furnished to do so, subject to the following conditions:
>> + *
>> + * The above copyright notice and this permission notice shall be included in
>> + * all copies or substantial portions of the Software.
>> + *
>> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
>> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
>> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
>> + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
>> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
>> + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
>> + * THE SOFTWARE.
>> + */
>> +#include "qemu/osdep.h"
>> +#include "qapi/error.h"
>> +#include "qemu-common.h"
>> +#include "hw/pci/pci.h"
>> +#include "hw/pci-host/spapr.h"
>> +#include "qemu/error-report.h"
>> +#include "hw/ppc/fdt.h"
>> +#include "hw/pci/pci_bridge.h"
>> +
>> +#define PHANDLE_PCIDEV(phb, pdev)    (0x12000000 | \
>> +                                     (((phb)->index) << 16) | ((pdev)->devfn))
>> +#define PHANDLE_GPURAM(phb, n)       (0x110000FF | ((n) << 8) | \
>> +                                     (((phb)->index) << 16))
>> +/* NVLink2 wants a separate NUMA node for its RAM */
>> +#define GPURAM_ASSOCIATIVITY(phb, n) (255 - ((phb)->index * 3 + (n)))
>> +#define PHANDLE_NVLINK(phb, gn, nn)  (0x00130000 | (((phb)->index) << 8) | \
>> +                                     ((gn) << 4) | (nn))
>> +
>> +/* Max number of NVLinks per GPU in any physical box */
>> +#define NVGPU_MAX_LINKS              3
>> +
>> +struct spapr_phb_pci_nvgpu_config {
>> +    uint64_t nv2_ram_current;
>> +    uint64_t nv2_atsd_current;
>> +    int num; /* number of non empty (i.e. tgt!=0) entries in slots[] */
>> +    struct spapr_phb_pci_nvgpu_slot {
>> +        uint64_t tgt;
>> +        uint64_t gpa;
>> +        PCIDevice *gpdev;
>> +        int linknum;
>> +        struct {
>> +            uint64_t atsd_gpa;
>> +            PCIDevice *npdev;
>> +            uint32_t link_speed;
>> +        } links[NVGPU_MAX_LINKS];
>> +    } slots[NVGPU_MAX_NUM];
>> +};
>> +
>> +static int spapr_pci_nvgpu_get_slot(struct spapr_phb_pci_nvgpu_config *nvgpus,
>> +                                    uint64_t tgt)
>> +{
>> +    int i;
>> +
>> +    /* Search for partially collected "slot" */
>> +    for (i = 0; i < nvgpus->num; ++i) {
>> +        if (nvgpus->slots[i].tgt == tgt) {
>> +            return i;
>> +        }
>> +    }
>> +
>> +    if (nvgpus->num == ARRAY_SIZE(nvgpus->slots)) {
>> +        warn_report("Found too many NVLink bridges per GPU");
>> +        return -1;
> 
> This is within qemu so it would be better to use the qemu error API
> than returning an error code.

You mean returning Error**? Oh. Ok.

> 
>> +    }
>> +
>> +    i = nvgpus->num;
>> +    nvgpus->slots[i].tgt = tgt;
>> +    ++nvgpus->num;
>> +
>> +    return i;
> 
> Might be nicer to return a pointer to the slot structure.


This can work.


> 
>> +}
>> +
>> +static void spapr_pci_collect_nvgpu(struct spapr_phb_pci_nvgpu_config *nvgpus,
>> +                                    PCIDevice *pdev, uint64_t tgt,
>> +                                    MemoryRegion *mr)
>> +{
>> +    int i = spapr_pci_nvgpu_get_slot(nvgpus, tgt);
>> +
>> +    if (i < 0) {
>> +        return;
>> +    }
>> +    g_assert(!nvgpus->slots[i].gpdev);
>> +    nvgpus->slots[i].gpdev = pdev;
>> +
>> +    nvgpus->slots[i].gpa = nvgpus->nv2_ram_current;
>> +    nvgpus->nv2_ram_current += memory_region_size(mr);
>> +}
>> +
>> +static void spapr_pci_collect_nvnpu(struct spapr_phb_pci_nvgpu_config *nvgpus,
>> +                                    PCIDevice *pdev, uint64_t tgt,
>> +                                    MemoryRegion *mr)
>> +{
>> +    int i = spapr_pci_nvgpu_get_slot(nvgpus, tgt), j;
>> +    struct spapr_phb_pci_nvgpu_slot *nvslot;
>> +
>> +    if (i < 0) {
>> +        return;
>> +    }
>> +
>> +    nvslot = &nvgpus->slots[i];
>> +    j = nvslot->linknum;
>> +    if (j == ARRAY_SIZE(nvslot->links)) {
>> +        warn_report("Found too many NVLink2 bridges");
>> +        return;
>> +    }
>> +    ++nvslot->linknum;
>> +
>> +    g_assert(!nvslot->links[j].npdev);
>> +    nvslot->links[j].npdev = pdev;
>> +    nvslot->links[j].atsd_gpa = nvgpus->nv2_atsd_current;
>> +    nvgpus->nv2_atsd_current += memory_region_size(mr);
>> +    nvslot->links[j].link_speed =
>> +        object_property_get_uint(OBJECT(pdev), "nvlink2-link-speed", NULL);
>> +}
>> +
>> +static void spapr_phb_pci_collect_nvgpu(PCIBus *bus, PCIDevice *pdev,
>> +                                        void *opaque)
>> +{
>> +    PCIBus *sec_bus;
>> +    Object *po = OBJECT(pdev);
>> +    uint64_t tgt = object_property_get_uint(po, "nvlink2-tgt", NULL);
>> +
>> +    if (tgt) {
>> +        Object *mr_gpu = object_property_get_link(po, "nvlink2-mr[0]", NULL);
>> +        Object *mr_npu = object_property_get_link(po, "nvlink2-atsd-mr[0]",
>> +                                                  NULL);
>> +
>> +        if (mr_gpu) {
>> +            spapr_pci_collect_nvgpu(opaque, pdev, tgt, MEMORY_REGION(mr_gpu));
>> +        } else if (mr_npu) {
>> +            spapr_pci_collect_nvnpu(opaque, pdev, tgt, MEMORY_REGION(mr_npu));
>> +        } else {
>> +            warn_report("Unexpected device with \"nvlink2-tgt\"");
> 
> IIUC this would have to be a code error, so should be an assert() not
> a warning.


Ok.

> 
>> +        }
>> +    }
>> +    if ((pci_default_read_config(pdev, PCI_HEADER_TYPE, 1) !=
>> +         PCI_HEADER_TYPE_BRIDGE)) {
>> +        return;
>> +    }
>> +
>> +    sec_bus = pci_bridge_get_sec_bus(PCI_BRIDGE(pdev));
>> +    if (!sec_bus) {
>> +        return;
>> +    }
>> +
>> +    pci_for_each_device(sec_bus, pci_bus_num(sec_bus),
>> +                        spapr_phb_pci_collect_nvgpu, opaque);
>> +}
>> +
>> +void spapr_phb_nvgpu_setup(sPAPRPHBState *sphb)
>> +{
>> +    int i, j, valid_gpu_num;
>> +
>> +    /* If there are existing NVLink2 MRs, unmap those before recreating */
>> +    if (sphb->nvgpus) {
>> +        for (i = 0; i < sphb->nvgpus->num; ++i) {
>> +            struct spapr_phb_pci_nvgpu_slot *nvslot = &sphb->nvgpus->slots[i];
>> +            Object *nv_mrobj = object_property_get_link(OBJECT(nvslot->gpdev),
>> +                                                        "nvlink2-mr[0]", NULL);
>> +
>> +            if (nv_mrobj) {
>> +                memory_region_del_subregion(get_system_memory(),
>> +                                            MEMORY_REGION(nv_mrobj));
>> +            }
>> +            for (j = 0; j < nvslot->linknum; ++j) {
>> +                PCIDevice *npdev = nvslot->links[j].npdev;
>> +                Object *atsd_mrobj;
>> +                atsd_mrobj = object_property_get_link(OBJECT(npdev),
>> +                                                      "nvlink2-atsd-mr[0]",
>> +                                                      NULL);
>> +                if (atsd_mrobj) {
>> +                    memory_region_del_subregion(get_system_memory(),
>> +                                                MEMORY_REGION(atsd_mrobj));
>> +                }
>> +            }
>> +        }
>> +        g_free(sphb->nvgpus);
> 
> Probably worth collecting the above into a nvgpu_free() helper -
> chances are you'll want it on cleanup paths as well.

The only other cleanup path is below and it only executes if there is no
MR added so for now it does not seem useful.


>> +        sphb->nvgpus = NULL;
>> +    }
>> +
>> +    /* Search for GPUs and NPUs */
>> +    if (sphb->nv2_gpa_win_addr && sphb->nv2_atsd_win_addr) {
>> +        PCIBus *bus = PCI_HOST_BRIDGE(sphb)->bus;
>> +
>> +        sphb->nvgpus = g_new0(struct spapr_phb_pci_nvgpu_config, 1);
>> +        sphb->nvgpus->nv2_ram_current = sphb->nv2_gpa_win_addr;
>> +        sphb->nvgpus->nv2_atsd_current = sphb->nv2_atsd_win_addr;
>> +
>> +        pci_for_each_device(bus, pci_bus_num(bus),
>> +                            spapr_phb_pci_collect_nvgpu, sphb->nvgpus);
>> +    }
>> +
>> +    /* Add found GPU RAM and ATSD MRs if found */
>> +    for (i = 0, valid_gpu_num = 0; i < sphb->nvgpus->num; ++i) {
>> +        Object *nvmrobj;
>> +        struct spapr_phb_pci_nvgpu_slot *nvslot = &sphb->nvgpus->slots[i];
>> +
>> +        if (!nvslot->gpdev) {
>> +            continue;
>> +        }
>> +        nvmrobj = object_property_get_link(OBJECT(nvslot->gpdev),
>> +                                           "nvlink2-mr[0]", NULL);
>> +        /* ATSD is pointless without GPU RAM MR so skip those */
>> +        if (!nvmrobj) {
>> +            continue;
>> +        }
>> +
>> +        ++valid_gpu_num;
>> +        memory_region_add_subregion(get_system_memory(), nvslot->gpa,
>> +                                    MEMORY_REGION(nvmrobj));
>> +
>> +        for (j = 0; j < nvslot->linknum; ++j) {
>> +            Object *atsdmrobj;
>> +
>> +            atsdmrobj = object_property_get_link(OBJECT(nvslot->links[j].npdev),
>> +                                                 "nvlink2-atsd-mr[0]",
>> +                                                 NULL);
>> +            if (!atsdmrobj) {
>> +                continue;
>> +            }
>> +            memory_region_add_subregion(get_system_memory(),
>> +                                        nvslot->links[j].atsd_gpa,
>> +                                        MEMORY_REGION(atsdmrobj));
>> +        }
>> +    }
>> +
>> +    if (!valid_gpu_num) {
>> +        /* We did not find any interesting GPU */
>> +        g_free(sphb->nvgpus);
>> +        sphb->nvgpus = NULL;
>> +    }
>> +}
>> +
>> +void spapr_phb_nvgpu_populate_dt(sPAPRPHBState *sphb, void *fdt, int bus_off)
>> +{
>> +    int i, j, atsdnum = 0;
>> +    uint64_t atsd[8]; /* The existing limitation of known guests */
>> +
>> +    if (!sphb->nvgpus) {
>> +        return;
>> +    }
>> +
>> +    for (i = 0; (i < sphb->nvgpus->num) && (atsdnum < ARRAY_SIZE(atsd)); ++i) {
>> +        struct spapr_phb_pci_nvgpu_slot *nvslot = &sphb->nvgpus->slots[i];
>> +
>> +        if (!nvslot->gpdev) {
>> +            continue;
>> +        }
>> +        for (j = 0; j < nvslot->linknum; ++j) {
>> +            if (!nvslot->links[j].atsd_gpa) {
>> +                continue;
>> +            }
>> +
>> +            if (atsdnum == ARRAY_SIZE(atsd)) {
>> +                warn_report("Only %ld ATSD registers allowed",
>> +                            ARRAY_SIZE(atsd));
> 
> Probably should be an error not a warning.

We can still continue though, it is not fatal. These things come from
skiboot (which we control) but skiboot itself could compose the
properties itself or use whatever hostboot provided (does not happen now
though) and I would not like to be blocked by hostboot if/when this happens.



>> +                break;
>> +            }
>> +            atsd[atsdnum] = cpu_to_be64(nvslot->links[j].atsd_gpa);
>> +            ++atsdnum;
>> +        }
>> +    }
>> +
>> +    if (!atsdnum) {
>> +        warn_report("No ATSD registers found");
>> +    } else if (!spapr_phb_eeh_available(sphb)) {
>> +        /*
>> +         * ibm,mmio-atsd contains ATSD registers; these belong to an NPU PHB
>> +         * which we do not emulate as a separate device. Instead we put
>> +         * ibm,mmio-atsd to the vPHB with GPU and make sure that we do not
>> +         * put GPUs from different IOMMU groups to the same vPHB to ensure
>> +         * that the guest will use ATSDs from the corresponding NPU.
>> +         */
>> +        warn_report("ATSD requires separate vPHB per GPU IOMMU group");
>> +    } else {
>> +        _FDT((fdt_setprop(fdt, bus_off, "ibm,mmio-atsd",
>> +                          atsd, atsdnum * sizeof(atsd[0]))));
>> +    }
>> +}
>> +
>> +void spapr_phb_nvgpu_ram_populate_dt(sPAPRPHBState *sphb, void *fdt)
>> +{
>> +    int i, j, linkidx, npuoff;
>> +    char *npuname;
>> +
>> +    if (!sphb->nvgpus) {
>> +        return;
>> +    }
>> +
>> +    npuname = g_strdup_printf("npuphb%d", sphb->index);
>> +    npuoff = fdt_add_subnode(fdt, 0, npuname);
>> +    _FDT(npuoff);
>> +    _FDT(fdt_setprop_cell(fdt, npuoff, "#address-cells", 1));
>> +    _FDT(fdt_setprop_cell(fdt, npuoff, "#size-cells", 0));
>> +    /* Advertise NPU as POWER9 so the guest can enable NPU2 contexts */
>> +    _FDT((fdt_setprop_string(fdt, npuoff, "compatible", "ibm,power9-npu")));
>> +    g_free(npuname);
>> +
>> +    for (i = 0, linkidx = 0; i < sphb->nvgpus->num; ++i) {
>> +        for (j = 0; j < sphb->nvgpus->slots[i].linknum; ++j) {
>> +            char *linkname = g_strdup_printf("link@%d", linkidx);
>> +            int off = fdt_add_subnode(fdt, npuoff, linkname);
>> +
>> +            _FDT(off);
>> +            /* _FDT((fdt_setprop_cell(fdt, off, "reg", linkidx)));
>> */
> 
> Are the indices you're using for 'reg' and the unit name arbitrary?
> If so it's generally best to base them on some static property of the
> device, rather than just allocating sequentially.

On the host reg is the link index. Here it is actually commented out as
a reminder for the future.

> 
>> +            _FDT((fdt_setprop_string(fdt, off, "compatible",
>> +                                     "ibm,npu-link")));
>> +            _FDT((fdt_setprop_cell(fdt, off, "phandle",
>> +                                   PHANDLE_NVLINK(sphb, i, j))));
>> +            _FDT((fdt_setprop_cell(fdt, off, "ibm,npu-link-index", linkidx)));
> 
> Why do you need the index here as well as in reg?

I do not need "reg" really and I need index for this:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/powerpc/platforms/powernv/npu-dma.c?h=v4.20#n692



>> +            g_free(linkname);
>> +            ++linkidx;
>> +        }
>> +    }
>> +
>> +    /* Add memory nodes for GPU RAM and mark them unusable */
>> +    for (i = 0; i < sphb->nvgpus->num; ++i) {
>> +        struct spapr_phb_pci_nvgpu_slot *nvslot = &sphb->nvgpus->slots[i];
>> +        Object *nv_mrobj = object_property_get_link(OBJECT(nvslot->gpdev),
>> +                                                    "nvlink2-mr[0]", NULL);
>> +        uint32_t at = cpu_to_be32(GPURAM_ASSOCIATIVITY(sphb, i));
>> +        uint32_t associativity[] = { cpu_to_be32(0x4), at, at, at, at };
>> +        uint64_t size = object_property_get_uint(nv_mrobj, "size", NULL);
>> +        uint64_t mem_reg[2] = { cpu_to_be64(nvslot->gpa), cpu_to_be64(size) };
>> +        char *mem_name = g_strdup_printf("memory@%lx", nvslot->gpa);
>> +        int off = fdt_add_subnode(fdt, 0, mem_name);
>> +
>> +        _FDT(off);
>> +        _FDT((fdt_setprop_string(fdt, off, "device_type", "memory")));
>> +        _FDT((fdt_setprop(fdt, off, "reg", mem_reg, sizeof(mem_reg))));
>> +        _FDT((fdt_setprop(fdt, off, "ibm,associativity", associativity,
>> +                          sizeof(associativity))));
>> +
>> +        _FDT((fdt_setprop_string(fdt, off, "compatible",
>> +                                 "ibm,coherent-device-memory")));
>> +
>> +        mem_reg[1] = cpu_to_be64(0);
>> +        _FDT((fdt_setprop(fdt, off, "linux,usable-memory", mem_reg,
>> +                          sizeof(mem_reg))));
>> +        _FDT((fdt_setprop_cell(fdt, off, "phandle",
>> +                               PHANDLE_GPURAM(sphb, i))));
>> +        g_free(mem_name);
>> +    }
>> +
>> +}
>> +
>> +void spapr_phb_nvgpu_populate_pcidev_dt(PCIDevice *dev, void *fdt, int offset,
>> +                                        sPAPRPHBState *sphb)
>> +{
>> +    int i, j;
>> +
>> +    if (!sphb->nvgpus) {
>> +        return;
>> +    }
>> +
>> +    for (i = 0; i < sphb->nvgpus->num; ++i) {
>> +        struct spapr_phb_pci_nvgpu_slot *nvslot = &sphb->nvgpus->slots[i];
>> +
>> +        /* Skip "slot" without attached GPU */
> 
> IIUC a "slot" should always have at least one GPU.  You need to handle
> the case of an unitialized GPU in the "collect" functions because you
> don't know if you'll discover the GPU or an NPU first.  But here not
> having a GPU should be an error, shouldn't it?


If someone decides to pass 1 GPU with all related nvlinks and just
nvlinks from another GPU but without related GPU for whatever reason,
should we really stop him/her? Things won't work exactly at their best
but this still might be useful for weird debugging.




>> +        if (!nvslot->gpdev) {
>> +            continue;
>> +        }
>> +        if (dev == nvslot->gpdev) {
>> +            uint32_t npus[nvslot->linknum];
>> +
>> +            for (j = 0; j < nvslot->linknum; ++j) {
>> +                PCIDevice *npdev = nvslot->links[j].npdev;
>> +
>> +                npus[j] = cpu_to_be32(PHANDLE_PCIDEV(sphb, npdev));
>> +            }
>> +            _FDT(fdt_setprop(fdt, offset, "ibm,npu", npus,
>> +                             j * sizeof(npus[0])));
>> +            _FDT((fdt_setprop_cell(fdt, offset, "phandle",
>> +                                   PHANDLE_PCIDEV(sphb, dev))));
>> +            continue;
>> +        }
>> +
>> +        for (j = 0; j < nvslot->linknum; ++j) {
>> +            if (dev != nvslot->links[j].npdev) {
>> +                continue;
>> +            }
>> +
>> +            _FDT((fdt_setprop_cell(fdt, offset, "phandle",
>> +                                   PHANDLE_PCIDEV(sphb, dev))));
>> +            _FDT(fdt_setprop_cell(fdt, offset, "ibm,gpu",
>> +                                  PHANDLE_PCIDEV(sphb, nvslot->gpdev)));
>> +            _FDT((fdt_setprop_cell(fdt, offset, "ibm,nvlink",
>> +                                   PHANDLE_NVLINK(sphb, i, j))));
>> +            /*
>> +             * If we ever want to emulate GPU RAM at the same location as on
>> +             * the host - here is the encoding GPA->TGT:
>> +             *
>> +             * gta  = ((sphb->nv2_gpa >> 42) & 0x1) << 42;
>> +             * gta |= ((sphb->nv2_gpa >> 45) & 0x3) << 43;
>> +             * gta |= ((sphb->nv2_gpa >> 49) & 0x3) << 45;
>> +             * gta |= sphb->nv2_gpa & ((1UL << 43) - 1);
>> +             */
>> +            _FDT(fdt_setprop_cell(fdt, offset, "memory-region",
>> +                                  PHANDLE_GPURAM(sphb, i)));
>> +            _FDT(fdt_setprop_u64(fdt, offset, "ibm,device-tgt-addr",
>> +                                 nvslot->tgt));
>> +            _FDT(fdt_setprop_cell(fdt, offset, "ibm,nvlink-speed",
>> +                                  nvslot->links[j].link_speed));
>> +        }
>> +    }
>> +}
>> diff --git a/hw/vfio/pci-quirks.c b/hw/vfio/pci-quirks.c
>> index 40a1200..15ec0b4 100644
>> --- a/hw/vfio/pci-quirks.c
>> +++ b/hw/vfio/pci-quirks.c
>> @@ -2180,3 +2180,123 @@ int vfio_add_virt_caps(VFIOPCIDevice *vdev, Error **errp)
>>  
>>      return 0;
>>  }
>> +
>> +static void vfio_pci_nvlink2_get_tgt(Object *obj, Visitor *v,
>> +                                     const char *name,
>> +                                     void *opaque, Error **errp)
>> +{
>> +    uint64_t tgt = (uint64_t) opaque;
>> +    visit_type_uint64(v, name, &tgt, errp);
>> +}
>> +
>> +static void vfio_pci_nvlink2_get_link_speed(Object *obj, Visitor *v,
>> +                                                 const char *name,
>> +                                                 void *opaque, Error **errp)
>> +{
>> +    uint32_t link_speed = (uint32_t)(uint64_t) opaque;
>> +    visit_type_uint32(v, name, &link_speed, errp);
>> +}
>> +
>> +int vfio_pci_nvidia_v100_ram_init(VFIOPCIDevice *vdev, Error **errp)
>> +{
>> +    int ret;
>> +    void *p;
>> +    struct vfio_region_info *nv2region = NULL;
>> +    struct vfio_info_cap_header *hdr;
>> +    MemoryRegion *nv2mr = g_malloc0(sizeof(*nv2mr));
>> +
>> +    ret = vfio_get_dev_region_info(&vdev->vbasedev,
>> +                                   VFIO_REGION_TYPE_PCI_VENDOR_TYPE |
>> +                                   PCI_VENDOR_ID_NVIDIA,
>> +                                   VFIO_REGION_SUBTYPE_NVIDIA_NVLINK2_RAM,
>> +                                   &nv2region);
>> +    if (ret) {
>> +        return ret;
>> +    }
>> +
>> +    p = mmap(NULL, nv2region->size, PROT_READ | PROT_WRITE | PROT_EXEC,
>> +             MAP_SHARED, vdev->vbasedev.fd, nv2region->offset);
>> +
>> +    if (!p) {
>> +        return -errno;
>> +    }
>> +
>> +    memory_region_init_ram_ptr(nv2mr, OBJECT(vdev), "nvlink2-mr",
>> +                               nv2region->size, p);
>> +
>> +    hdr = vfio_get_region_info_cap(nv2region,
>> +                                   VFIO_REGION_INFO_CAP_NVLINK2_SSATGT);
>> +    if (hdr) {
>> +        struct vfio_region_info_cap_nvlink2_ssatgt *cap = (void *) hdr;
>> +
>> +        object_property_add(OBJECT(vdev), "nvlink2-tgt", "uint64",
>> +                            vfio_pci_nvlink2_get_tgt, NULL, NULL,
>> +                            (void *) cap->tgt, NULL);
>> +        trace_vfio_pci_nvidia_gpu_setup_quirk(vdev->vbasedev.name, cap->tgt,
>> +                                              nv2region->size);
>> +    }
>> +    g_free(nv2region);
>> +
>> +    return 0;
>> +}
>> +
>> +int vfio_pci_nvlink2_init(VFIOPCIDevice *vdev, Error **errp)
>> +{
>> +    int ret;
>> +    void *p;
>> +    struct vfio_region_info *atsd_region = NULL;
>> +    struct vfio_info_cap_header *hdr;
>> +
>> +    ret = vfio_get_dev_region_info(&vdev->vbasedev,
>> +                                   VFIO_REGION_TYPE_PCI_VENDOR_TYPE |
>> +                                   PCI_VENDOR_ID_IBM,
>> +                                   VFIO_REGION_SUBTYPE_IBM_NVLINK2_ATSD,
>> +                                   &atsd_region);
>> +    if (ret) {
>> +        return ret;
>> +    }
>> +
>> +    /* Some NVLink bridges come without assigned ATSD, skip MR part */
>> +    if (atsd_region->size) {
>> +        MemoryRegion *atsd_mr = g_malloc0(sizeof(*atsd_mr));
>> +
>> +        p = mmap(NULL, atsd_region->size, PROT_READ | PROT_WRITE | PROT_EXEC,
>> +                 MAP_SHARED, vdev->vbasedev.fd, atsd_region->offset);
>> +
>> +        if (!p) {
>> +            return -errno;
>> +        }
>> +
>> +        memory_region_init_ram_device_ptr(atsd_mr, OBJECT(vdev),
>> +                                          "nvlink2-atsd-mr",
>> +                                          atsd_region->size,
>> +                                          p);
>> +    }
>> +
>> +    hdr = vfio_get_region_info_cap(atsd_region,
>> +                                   VFIO_REGION_INFO_CAP_NVLINK2_SSATGT);
>> +    if (hdr) {
>> +        struct vfio_region_info_cap_nvlink2_ssatgt *cap = (void *) hdr;
>> +
>> +        object_property_add(OBJECT(vdev), "nvlink2-tgt", "uint64",
>> +                            vfio_pci_nvlink2_get_tgt, NULL, NULL,
>> +                            (void *) cap->tgt, NULL);
>> +        trace_vfio_pci_nvlink2_setup_quirk_ssatgt(vdev->vbasedev.name, cap->tgt,
>> +                                                  atsd_region->size);
>> +    }
>> +
>> +    hdr = vfio_get_region_info_cap(atsd_region,
>> +                                   VFIO_REGION_INFO_CAP_NVLINK2_LNKSPD);
>> +    if (hdr) {
>> +        struct vfio_region_info_cap_nvlink2_lnkspd *cap = (void *) hdr;
>> +
>> +        object_property_add(OBJECT(vdev), "nvlink2-link-speed", "uint32",
>> +                            vfio_pci_nvlink2_get_link_speed, NULL, NULL,
>> +                            (void *) (uint64_t) cap->link_speed, NULL);
>> +        trace_vfio_pci_nvlink2_setup_quirk_lnkspd(vdev->vbasedev.name,
>> +                                                  cap->link_speed);
>> +    }
>> +    g_free(atsd_region);
>> +
>> +    return 0;
>> +}
>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>> index dd12f36..07aa141 100644
>> --- a/hw/vfio/pci.c
>> +++ b/hw/vfio/pci.c
>> @@ -3069,6 +3069,20 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
>>          goto out_teardown;
>>      }
>>  
>> +    if (vdev->vendor_id == PCI_VENDOR_ID_NVIDIA) {
>> +        ret = vfio_pci_nvidia_v100_ram_init(vdev, errp);
>> +        if (ret && ret != -ENODEV) {
>> +            error_report("Failed to setup NVIDIA V100 GPU RAM");
>> +        }
>> +    }
>> +
>> +    if (vdev->vendor_id == PCI_VENDOR_ID_IBM) {
>> +        ret = vfio_pci_nvlink2_init(vdev, errp);
>> +        if (ret && ret != -ENODEV) {
>> +            error_report("Failed to setup NVlink2 bridge");
>> +        }
>> +    }
>> +
>>      vfio_register_err_notifier(vdev);
>>      vfio_register_req_notifier(vdev);
>>      vfio_setup_resetfn_quirk(vdev);
>> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
>> index cf1e886..88841e9 100644
>> --- a/hw/vfio/trace-events
>> +++ b/hw/vfio/trace-events
>> @@ -87,6 +87,10 @@ vfio_pci_igd_opregion_enabled(const char *name) "%s"
>>  vfio_pci_igd_host_bridge_enabled(const char *name) "%s"
>>  vfio_pci_igd_lpc_bridge_enabled(const char *name) "%s"
>>  
>> +vfio_pci_nvidia_gpu_setup_quirk(const char *name, uint64_t tgt, uint64_t size) "%s tgt=0x%"PRIx64" size=0x%"PRIx64
>> +vfio_pci_nvlink2_setup_quirk_ssatgt(const char *name, uint64_t tgt, uint64_t size) "%s tgt=0x%"PRIx64" size=0x%"PRIx64
>> +vfio_pci_nvlink2_setup_quirk_lnkspd(const char *name, uint32_t link_speed) "%s link_speed=0x%x"
>> +
>>  # hw/vfio/common.c
>>  vfio_region_write(const char *name, int index, uint64_t addr, uint64_t data, unsigned size) " (%s:region%d+0x%"PRIx64", 0x%"PRIx64 ", %d)"
>>  vfio_region_read(char *name, int index, uint64_t addr, unsigned size, uint64_t data) " (%s:region%d+0x%"PRIx64", %d) = 0x%"PRIx64
> 

-- 
Alexey

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v3 6/6] spapr: Support NVIDIA V100 GPU with NVLink2
  2019-02-28  6:11     ` Alexey Kardashevskiy
@ 2019-03-05  1:47       ` David Gibson
  2019-03-07  2:40         ` Alexey Kardashevskiy
  0 siblings, 1 reply; 21+ messages in thread
From: David Gibson @ 2019-03-05  1:47 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: qemu-devel, qemu-ppc, Sam Bobroff, Piotr Jaroszynski,
	Leonardo Augusto Guimarães Garcia, Jose Ricardo Ziviani,
	Daniel Henrique Barboza, Alex Williamson

[-- Attachment #1: Type: text/plain, Size: 48390 bytes --]

On Thu, Feb 28, 2019 at 05:11:32PM +1100, Alexey Kardashevskiy wrote:
> On 28/02/2019 14:31, David Gibson wrote:
> > On Wed, Feb 27, 2019 at 07:51:49PM +1100, Alexey Kardashevskiy wrote:
> >> NVIDIA V100 GPUs have on-board RAM which is mapped into the host memory
> >> space and accessible as normal RAM via an NVLink bus. The VFIO-PCI driver
> >> implements special regions for such GPUs and emulates an NVLink bridge.
> >> NVLink2-enabled POWER9 CPUs also provide address translation services
> >> which includes an ATS shootdown (ATSD) register exported via the NVLink
> >> bridge device.
> >>
> >> This adds a quirk to VFIO to map the GPU memory and create an MR;
> >> the new MR is stored in a PCI device as a QOM link. The sPAPR PCI uses
> >> this to get the MR and map it to the system address space.
> >> Another quirk does the same for ATSD.
> >>
> >> This adds additional steps to sPAPR PHB setup:
> >>
> >> 1. Search for specific GPUs and NPUs, collect findings in
> >> sPAPRPHBState::nvgpus, manage system address space mappings;
> >>
> >> 2. Add device-specific properties such as "ibm,npu", "ibm,gpu",
> >> "memory-block", "link-speed" to advertise the NVLink2 function to
> >> the guest;
> >>
> >> 3. Add "mmio-atsd" to vPHB to advertise the ATSD capability;
> >>
> >> 4. Add new memory blocks (with extra "linux,memory-usable" to prevent
> >> the guest OS from accessing the new memory until it is onlined) and
> >> npuphb# nodes representing an NPU unit for every vPHB as the GPU driver
> >> uses it for link discovery.
> >>
> >> This allocates space for GPU RAM and ATSD like we do for MMIOs by
> >> adding 2 new parameters to the phb_placement() hook. Older machine types
> >> set these to zero.
> >>
> >> This puts new memory nodes in a separate NUMA node to replicate the host
> >> system setup as the GPU driver relies on this.
> >>
> >> This adds requirement similar to EEH - one IOMMU group per vPHB.
> >> The reason for this is that ATSD registers belong to a physical NPU
> >> so they cannot invalidate translations on GPUs attached to another NPU.
> >> It is guaranteed by the host platform as it does not mix NVLink bridges
> >> or GPUs from different NPU in the same IOMMU group. If more than one
> >> IOMMU group is detected on a vPHB, this disables ATSD support for that
> >> vPHB and prints a warning.
> >>
> >> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >> ---
> >> Changes:
> >> v3:
> >> * moved GPU RAM above PCI MMIO limit
> >> * renamed QOM property to nvlink2-tgt
> >> * moved nvlink2 code to its own file
> >>
> >> ---
> >>
> >> The example command line for redbud system:
> >>
> >> pbuild/qemu-aiku1804le-ppc64/ppc64-softmmu/qemu-system-ppc64 \
> >> -nodefaults \
> >> -chardev stdio,id=STDIO0,signal=off,mux=on \
> >> -device spapr-vty,id=svty0,reg=0x71000110,chardev=STDIO0 \
> >> -mon id=MON0,chardev=STDIO0,mode=readline -nographic -vga none \
> >> -enable-kvm -m 384G \
> >> -chardev socket,id=SOCKET0,server,nowait,host=localhost,port=40000 \
> >> -mon chardev=SOCKET0,mode=control \
> >> -smp 80,sockets=1,threads=4 \
> >> -netdev "tap,id=TAP0,helper=/home/aik/qemu-bridge-helper --br=br0" \
> >> -device "virtio-net-pci,id=vnet0,mac=52:54:00:12:34:56,netdev=TAP0" \
> >> img/vdisk0.img \
> >> -device "vfio-pci,id=vfio0004_04_00_0,host=0004:04:00.0" \
> >> -device "vfio-pci,id=vfio0006_00_00_0,host=0006:00:00.0" \
> >> -device "vfio-pci,id=vfio0006_00_00_1,host=0006:00:00.1" \
> >> -device "vfio-pci,id=vfio0006_00_00_2,host=0006:00:00.2" \
> >> -device "vfio-pci,id=vfio0004_05_00_0,host=0004:05:00.0" \
> >> -device "vfio-pci,id=vfio0006_00_01_0,host=0006:00:01.0" \
> >> -device "vfio-pci,id=vfio0006_00_01_1,host=0006:00:01.1" \
> >> -device "vfio-pci,id=vfio0006_00_01_2,host=0006:00:01.2" \
> >> -device spapr-pci-host-bridge,id=phb1,index=1 \
> >> -device "vfio-pci,id=vfio0035_03_00_0,host=0035:03:00.0" \
> >> -device "vfio-pci,id=vfio0007_00_00_0,host=0007:00:00.0" \
> >> -device "vfio-pci,id=vfio0007_00_00_1,host=0007:00:00.1" \
> >> -device "vfio-pci,id=vfio0007_00_00_2,host=0007:00:00.2" \
> >> -device "vfio-pci,id=vfio0035_04_00_0,host=0035:04:00.0" \
> >> -device "vfio-pci,id=vfio0007_00_01_0,host=0007:00:01.0" \
> >> -device "vfio-pci,id=vfio0007_00_01_1,host=0007:00:01.1" \
> >> -device "vfio-pci,id=vfio0007_00_01_2,host=0007:00:01.2" -snapshot \
> >> -machine pseries \
> >> -L /home/aik/t/qemu-ppc64-bios/ -d guest_errors
> >>
> >> Note that QEMU attaches PCI devices to the last added vPHB so first
> >> 8 devices - 4:04:00.0 till 6:00:01.2 - go to the default vPHB, and
> >> 35:03:00.0..7:00:01.2 to the vPHB with id=phb1.
> >> ---
> >>  hw/ppc/Makefile.objs        |   2 +-
> >>  hw/vfio/pci.h               |   2 +
> >>  include/hw/pci-host/spapr.h |  41 ++++
> >>  include/hw/ppc/spapr.h      |   3 +-
> >>  hw/ppc/spapr.c              |  29 ++-
> >>  hw/ppc/spapr_pci.c          |   8 +
> >>  hw/ppc/spapr_pci_nvlink2.c  | 419 ++++++++++++++++++++++++++++++++++++
> >>  hw/vfio/pci-quirks.c        | 120 +++++++++++
> >>  hw/vfio/pci.c               |  14 ++
> >>  hw/vfio/trace-events        |   4 +
> >>  10 files changed, 637 insertions(+), 5 deletions(-)
> >>  create mode 100644 hw/ppc/spapr_pci_nvlink2.c
> >>
> >> diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
> >> index 1111b21..636e717 100644
> >> --- a/hw/ppc/Makefile.objs
> >> +++ b/hw/ppc/Makefile.objs
> >> @@ -9,7 +9,7 @@ obj-$(CONFIG_SPAPR_RNG) +=  spapr_rng.o
> >>  # IBM PowerNV
> >>  obj-$(CONFIG_POWERNV) += pnv.o pnv_xscom.o pnv_core.o pnv_lpc.o pnv_psi.o pnv_occ.o pnv_bmc.o
> >>  ifeq ($(CONFIG_PCI)$(CONFIG_PSERIES)$(CONFIG_LINUX), yyy)
> >> -obj-y += spapr_pci_vfio.o
> >> +obj-y += spapr_pci_vfio.o spapr_pci_nvlink2.o
> >>  endif
> >>  obj-$(CONFIG_PSERIES) += spapr_rtas_ddw.o
> >>  # PowerPC 4xx boards
> >> diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
> >> index b1ae4c0..706c304 100644
> >> --- a/hw/vfio/pci.h
> >> +++ b/hw/vfio/pci.h
> >> @@ -194,6 +194,8 @@ int vfio_populate_vga(VFIOPCIDevice *vdev, Error **errp);
> >>  int vfio_pci_igd_opregion_init(VFIOPCIDevice *vdev,
> >>                                 struct vfio_region_info *info,
> >>                                 Error **errp);
> >> +int vfio_pci_nvidia_v100_ram_init(VFIOPCIDevice *vdev, Error **errp);
> >> +int vfio_pci_nvlink2_init(VFIOPCIDevice *vdev, Error **errp);
> >>  
> >>  void vfio_display_reset(VFIOPCIDevice *vdev);
> >>  int vfio_display_probe(VFIOPCIDevice *vdev, Error **errp);
> >> diff --git a/include/hw/pci-host/spapr.h b/include/hw/pci-host/spapr.h
> >> index ab0e3a0..e791dd4 100644
> >> --- a/include/hw/pci-host/spapr.h
> >> +++ b/include/hw/pci-host/spapr.h
> >> @@ -87,6 +87,9 @@ struct sPAPRPHBState {
> >>      uint32_t mig_liobn;
> >>      hwaddr mig_mem_win_addr, mig_mem_win_size;
> >>      hwaddr mig_io_win_addr, mig_io_win_size;
> >> +    hwaddr nv2_gpa_win_addr;
> >> +    hwaddr nv2_atsd_win_addr;
> >> +    struct spapr_phb_pci_nvgpu_config *nvgpus;
> >>  };
> >>  
> >>  #define SPAPR_PCI_MEM_WIN_BUS_OFFSET 0x80000000ULL
> >> @@ -105,6 +108,23 @@ struct sPAPRPHBState {
> >>  
> >>  #define SPAPR_PCI_MSI_WINDOW         0x40000000000ULL
> >>  
> >> +#define SPAPR_PCI_NV2RAM64_WIN_BASE  SPAPR_PCI_LIMIT
> >> +#define SPAPR_PCI_NV2RAM64_WIN_SIZE  0x10000000000ULL /* 1 TiB for all 6xGPUs */
> > 
> > The comments and values below suggest that it is 1TiB for each GPU,
> > rather than 1TiB shared by all 6.  Which is it?
> 
> 
> 1TiB for all of them within 1 vPHB. Not sure where it suggests 1TiB for
> each GPU.

The fact that NV2ATSD_WIN_BASE is set at 6TiB above NV2RAM64_WIN_BASE
is what suggested to me that there was one 1TiB window for each of the
6 possible GPUs.

> >> +
> >> +/* Max number of these GPUs per a physical box */
> >> +#define NVGPU_MAX_NUM                6
> > 
> > Is there any possibility later hardware revisions could increase this?
> > If so we should probably leave some extra room in the address space.
> 
> A GPU RAM window is 256GiB (and only 32GiB is used), and 3 is the
> maximum in one group so far. So 1TiB should be enough for quite some
> time. Having more GPUs in a box is probably possible but for now 6xGPU
> require water cooling while 4xGPU does not so unless there is a new
> generation of these GPUs comes out, the numbers won't change much.

Hm, ok.

> I'll double SPAPR_PCI_NV2RAM64_WIN_SIZE.

Um.. I'm not sure how that follows from the above.

> 
> 
> >> +/*
> >> + * One NVLink bridge provides one ATSD register so it should be 18.
> >> + * In practice though since we allow only one group per vPHB which equals
> >> + * to an NPU2 which has maximum 6 NVLink bridges.
> >> + */
> >> +#define NVGPU_MAX_ATSD               6
> >> +
> >> +#define SPAPR_PCI_NV2ATSD_WIN_BASE   (SPAPR_PCI_NV2RAM64_WIN_BASE + \
> >> +                                      SPAPR_PCI_NV2RAM64_WIN_SIZE * \
> >> +                                      NVGPU_MAX_NUM)
> >> +#define SPAPR_PCI_NV2ATSD_WIN_SIZE   (NVGPU_MAX_ATSD * 0x10000)
> > 
> > What's the significance of the 64 kiB constant here?  Should it be a
> > symbolic name, or speleed "64 * kiB".
> 
> Ok.


Hmm.  Am I right in thinking that both each 64-bit RAM and each ATSD
RAM slot is per-vPHB?  Would it make more sense to directly index into
the array of slots with the phb index, rather than having a separate
GPU index?

> 
> 
> > 
> >> +
> >>  static inline qemu_irq spapr_phb_lsi_qirq(struct sPAPRPHBState *phb, int pin)
> >>  {
> >>      sPAPRMachineState *spapr = SPAPR_MACHINE(qdev_get_machine());
> >> @@ -135,6 +155,11 @@ int spapr_phb_vfio_eeh_get_state(sPAPRPHBState *sphb, int *state);
> >>  int spapr_phb_vfio_eeh_reset(sPAPRPHBState *sphb, int option);
> >>  int spapr_phb_vfio_eeh_configure(sPAPRPHBState *sphb);
> >>  void spapr_phb_vfio_reset(DeviceState *qdev);
> >> +void spapr_phb_nvgpu_setup(sPAPRPHBState *sphb);
> >> +void spapr_phb_nvgpu_populate_dt(sPAPRPHBState *sphb, void *fdt, int bus_off);
> >> +void spapr_phb_nvgpu_ram_populate_dt(sPAPRPHBState *sphb, void *fdt);
> >> +void spapr_phb_nvgpu_populate_pcidev_dt(PCIDevice *dev, void *fdt, int offset,
> >> +                                        sPAPRPHBState *sphb);
> >>  #else
> >>  static inline bool spapr_phb_eeh_available(sPAPRPHBState *sphb)
> >>  {
> >> @@ -161,6 +186,22 @@ static inline int spapr_phb_vfio_eeh_configure(sPAPRPHBState *sphb)
> >>  static inline void spapr_phb_vfio_reset(DeviceState *qdev)
> >>  {
> >>  }
> >> +static inline void spapr_phb_nvgpu_setup(sPAPRPHBState *sphb)
> >> +{
> >> +}
> >> +static inline void spapr_phb_nvgpu_populate_dt(sPAPRPHBState *sphb, void *fdt,
> >> +                                               int bus_off)
> >> +{
> >> +}
> >> +static inline void spapr_phb_nvgpu_ram_populate_dt(sPAPRPHBState *sphb,
> >> +                                                   void *fdt)
> >> +{
> >> +}
> >> +static inline void spapr_phb_nvgpu_populate_pcidev_dt(PCIDevice *dev, void *fdt,
> >> +                                                      int offset,
> >> +                                                      sPAPRPHBState *sphb)
> >> +{
> >> +}
> > 
> > I'm guessing some of these should never get called on systems without
> > NVLink2, in which case they should probably have a
> > g_assert_not_reached() in there.
> 
> I guess if you compile QEMU for --target-list=ppc64-softmmu in Windows
> (i.e. tcg + pseries + pci but no vfio), these will be called and crash
> then, no?

Well, if they can be called in that situation then, yes, they need to
be no-ops like they are now.  But is that true for all of them?
Hmm.. yes it might be, never mind.

> > 
> >>  #endif
> >>  
> >>  void spapr_phb_dma_reset(sPAPRPHBState *sphb);
> >> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> >> index 358bb38..9acf867 100644
> >> --- a/include/hw/ppc/spapr.h
> >> +++ b/include/hw/ppc/spapr.h
> >> @@ -113,7 +113,8 @@ struct sPAPRMachineClass {
> >>      void (*phb_placement)(sPAPRMachineState *spapr, uint32_t index,
> >>                            uint64_t *buid, hwaddr *pio, 
> >>                            hwaddr *mmio32, hwaddr *mmio64,
> >> -                          unsigned n_dma, uint32_t *liobns, Error **errp);
> >> +                          unsigned n_dma, uint32_t *liobns, hwaddr *nv2gpa,
> >> +                          hwaddr *nv2atsd, Error **errp);
> >>      sPAPRResizeHPT resize_hpt_default;
> >>      sPAPRCapabilities default_caps;
> >>      sPAPRIrq *irq;
> >> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> >> index 74c9b07..fda6e7e 100644
> >> --- a/hw/ppc/spapr.c
> >> +++ b/hw/ppc/spapr.c
> >> @@ -3929,7 +3929,9 @@ static void spapr_phb_pre_plug(HotplugHandler *hotplug_dev, DeviceState *dev,
> >>      smc->phb_placement(spapr, sphb->index,
> >>                         &sphb->buid, &sphb->io_win_addr,
> >>                         &sphb->mem_win_addr, &sphb->mem64_win_addr,
> >> -                       windows_supported, sphb->dma_liobn, errp);
> >> +                       windows_supported, sphb->dma_liobn,
> >> +                       &sphb->nv2_gpa_win_addr, &sphb->nv2_atsd_win_addr,
> >> +                       errp);
> >>  }
> >>  
> >>  static void spapr_phb_plug(HotplugHandler *hotplug_dev, DeviceState *dev,
> >> @@ -4129,7 +4131,8 @@ static const CPUArchIdList *spapr_possible_cpu_arch_ids(MachineState *machine)
> >>  static void spapr_phb_placement(sPAPRMachineState *spapr, uint32_t index,
> >>                                  uint64_t *buid, hwaddr *pio,
> >>                                  hwaddr *mmio32, hwaddr *mmio64,
> >> -                                unsigned n_dma, uint32_t *liobns, Error **errp)
> >> +                                unsigned n_dma, uint32_t *liobns,
> >> +                                hwaddr *nv2gpa, hwaddr *nv2atsd, Error **errp)
> >>  {
> >>      /*
> >>       * New-style PHB window placement.
> >> @@ -4174,6 +4177,9 @@ static void spapr_phb_placement(sPAPRMachineState *spapr, uint32_t index,
> >>      *pio = SPAPR_PCI_BASE + index * SPAPR_PCI_IO_WIN_SIZE;
> >>      *mmio32 = SPAPR_PCI_BASE + (index + 1) * SPAPR_PCI_MEM32_WIN_SIZE;
> >>      *mmio64 = SPAPR_PCI_BASE + (index + 1) * SPAPR_PCI_MEM64_WIN_SIZE;
> >> +
> > 
> > This doesn't look right.  SPAPR_PCI_NV2ATSD_WIN_BASE appears to be
> > defined such that there slots for NVGPU_MAX_NUM gpa "slots" of size
> > SPAPR_PCI_NV2RAM64_WIN_SIZE before we get to the ATSD base.
> > 
> >> +    *nv2gpa = SPAPR_PCI_NV2RAM64_WIN_BASE + index * SPAPR_PCI_NV2RAM64_WIN_SIZE;
> > 
> > But this implies you need a "slot" for every possible PHB index, which
> > is rather more than NVGPU_MAX_NUM.
> > 
> >> +    *nv2atsd = SPAPR_PCI_NV2ATSD_WIN_BASE + index * SPAPR_PCI_NV2ATSD_WIN_SIZE;
> 
> 
> Ah right :( These should go then above 128TiB I guess as I do not really
> want them to appear inside a huge dma window.

Right.  So actually looks like you are already indexing the window
slots by phb index, in which case you need to allow for 32 slots even
though only 6 can be populated at the moment.

> >>  }
> >>  
> >>  static ICSState *spapr_ics_get(XICSFabric *dev, int irq)
> >> @@ -4376,6 +4382,18 @@ DEFINE_SPAPR_MACHINE(4_0, "4.0", true);
> >>  /*
> >>   * pseries-3.1
> >>   */
> >> +static void phb_placement_3_1(sPAPRMachineState *spapr, uint32_t index,
> >> +                              uint64_t *buid, hwaddr *pio,
> >> +                              hwaddr *mmio32, hwaddr *mmio64,
> >> +                              unsigned n_dma, uint32_t *liobns,
> >> +                              hwaddr *nv2gpa, hwaddr *nv2atsd, Error **errp)
> >> +{
> >> +    spapr_phb_placement(spapr, index, buid, pio, mmio32, mmio64, n_dma, liobns,
> >> +                        nv2gpa, nv2atsd, errp);
> >> +    *nv2gpa = 0;
> >> +    *nv2atsd = 0;
> >> +}
> >> +
> >>  static void spapr_machine_3_1_class_options(MachineClass *mc)
> >>  {
> >>      sPAPRMachineClass *smc = SPAPR_MACHINE_CLASS(mc);
> >> @@ -4391,6 +4409,7 @@ static void spapr_machine_3_1_class_options(MachineClass *mc)
> >>      mc->default_cpu_type = POWERPC_CPU_TYPE_NAME("power8_v2.0");
> >>      smc->update_dt_enabled = false;
> >>      smc->dr_phb_enabled = false;
> >> +    smc->phb_placement = phb_placement_3_1;
> >>  }
> >>  
> >>  DEFINE_SPAPR_MACHINE(3_1, "3.1", false);
> >> @@ -4522,7 +4541,8 @@ DEFINE_SPAPR_MACHINE(2_8, "2.8", false);
> >>  static void phb_placement_2_7(sPAPRMachineState *spapr, uint32_t index,
> >>                                uint64_t *buid, hwaddr *pio,
> >>                                hwaddr *mmio32, hwaddr *mmio64,
> >> -                              unsigned n_dma, uint32_t *liobns, Error **errp)
> >> +                              unsigned n_dma, uint32_t *liobns,
> >> +                              hwaddr *nv2gpa, hwaddr *nv2atsd, Error **errp)
> >>  {
> >>      /* Legacy PHB placement for pseries-2.7 and earlier machine types */
> >>      const uint64_t base_buid = 0x800000020000000ULL;
> >> @@ -4566,6 +4586,9 @@ static void phb_placement_2_7(sPAPRMachineState *spapr, uint32_t index,
> >>       * fallback behaviour of automatically splitting a large "32-bit"
> >>       * window into contiguous 32-bit and 64-bit windows
> >>       */
> >> +
> >> +    *nv2gpa = 0;
> >> +    *nv2atsd = 0;
> >>  }
> >>  
> >>  static void spapr_machine_2_7_class_options(MachineClass *mc)
> >> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
> >> index 06a5ffd..f076462 100644
> >> --- a/hw/ppc/spapr_pci.c
> >> +++ b/hw/ppc/spapr_pci.c
> >> @@ -1355,6 +1355,8 @@ static void spapr_populate_pci_child_dt(PCIDevice *dev, void *fdt, int offset,
> >>      if (sphb->pcie_ecs && pci_is_express(dev)) {
> >>          _FDT(fdt_setprop_cell(fdt, offset, "ibm,pci-config-space-type", 0x1));
> >>      }
> >> +
> >> +    spapr_phb_nvgpu_populate_pcidev_dt(dev, fdt, offset, sphb);
> >>  }
> >>  
> >>  /* create OF node for pci device and required OF DT properties */
> >> @@ -1878,6 +1880,7 @@ static void spapr_phb_reset(DeviceState *qdev)
> >>      sPAPRPHBState *sphb = SPAPR_PCI_HOST_BRIDGE(qdev);
> >>  
> >>      spapr_phb_dma_reset(sphb);
> >> +    spapr_phb_nvgpu_setup(sphb);
> >>  
> >>      /* Reset the IOMMU state */
> >>      object_child_foreach(OBJECT(qdev), spapr_phb_children_reset, NULL);
> >> @@ -1910,6 +1913,8 @@ static Property spapr_phb_properties[] = {
> >>                       pre_2_8_migration, false),
> >>      DEFINE_PROP_BOOL("pcie-extended-configuration-space", sPAPRPHBState,
> >>                       pcie_ecs, true),
> >> +    DEFINE_PROP_UINT64("gpa", sPAPRPHBState, nv2_gpa_win_addr, 0),
> >> +    DEFINE_PROP_UINT64("atsd", sPAPRPHBState, nv2_atsd_win_addr, 0),
> >>      DEFINE_PROP_END_OF_LIST(),
> >>  };
> >>  
> >> @@ -2282,6 +2287,9 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb, uint32_t intc_phandle, void *fdt,
> >>          return ret;
> >>      }
> >>  
> >> +    spapr_phb_nvgpu_populate_dt(phb, fdt, bus_off);
> >> +    spapr_phb_nvgpu_ram_populate_dt(phb, fdt);
> >> +
> >>      return 0;
> >>  }
> >>  
> >> diff --git a/hw/ppc/spapr_pci_nvlink2.c b/hw/ppc/spapr_pci_nvlink2.c
> >> new file mode 100644
> >> index 0000000..965a6be
> >> --- /dev/null
> >> +++ b/hw/ppc/spapr_pci_nvlink2.c
> >> @@ -0,0 +1,419 @@
> >> +/*
> >> + * QEMU sPAPR PCI for NVLink2 pass through
> >> + *
> >> + * Copyright (c) 2019 Alexey Kardashevskiy, IBM Corporation.
> >> + *
> >> + * Permission is hereby granted, free of charge, to any person obtaining a copy
> >> + * of this software and associated documentation files (the "Software"), to deal
> >> + * in the Software without restriction, including without limitation the rights
> >> + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
> >> + * copies of the Software, and to permit persons to whom the Software is
> >> + * furnished to do so, subject to the following conditions:
> >> + *
> >> + * The above copyright notice and this permission notice shall be included in
> >> + * all copies or substantial portions of the Software.
> >> + *
> >> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
> >> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
> >> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
> >> + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
> >> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
> >> + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
> >> + * THE SOFTWARE.
> >> + */
> >> +#include "qemu/osdep.h"
> >> +#include "qapi/error.h"
> >> +#include "qemu-common.h"
> >> +#include "hw/pci/pci.h"
> >> +#include "hw/pci-host/spapr.h"
> >> +#include "qemu/error-report.h"
> >> +#include "hw/ppc/fdt.h"
> >> +#include "hw/pci/pci_bridge.h"
> >> +
> >> +#define PHANDLE_PCIDEV(phb, pdev)    (0x12000000 | \
> >> +                                     (((phb)->index) << 16) | ((pdev)->devfn))
> >> +#define PHANDLE_GPURAM(phb, n)       (0x110000FF | ((n) << 8) | \
> >> +                                     (((phb)->index) << 16))
> >> +/* NVLink2 wants a separate NUMA node for its RAM */
> >> +#define GPURAM_ASSOCIATIVITY(phb, n) (255 - ((phb)->index * 3 + (n)))
> >> +#define PHANDLE_NVLINK(phb, gn, nn)  (0x00130000 | (((phb)->index) << 8) | \
> >> +                                     ((gn) << 4) | (nn))
> >> +
> >> +/* Max number of NVLinks per GPU in any physical box */
> >> +#define NVGPU_MAX_LINKS              3
> >> +
> >> +struct spapr_phb_pci_nvgpu_config {
> >> +    uint64_t nv2_ram_current;
> >> +    uint64_t nv2_atsd_current;
> >> +    int num; /* number of non empty (i.e. tgt!=0) entries in slots[] */
> >> +    struct spapr_phb_pci_nvgpu_slot {
> >> +        uint64_t tgt;
> >> +        uint64_t gpa;
> >> +        PCIDevice *gpdev;
> >> +        int linknum;
> >> +        struct {
> >> +            uint64_t atsd_gpa;
> >> +            PCIDevice *npdev;
> >> +            uint32_t link_speed;
> >> +        } links[NVGPU_MAX_LINKS];
> >> +    } slots[NVGPU_MAX_NUM];
> >> +};
> >> +
> >> +static int spapr_pci_nvgpu_get_slot(struct spapr_phb_pci_nvgpu_config *nvgpus,
> >> +                                    uint64_t tgt)
> >> +{
> >> +    int i;
> >> +
> >> +    /* Search for partially collected "slot" */
> >> +    for (i = 0; i < nvgpus->num; ++i) {
> >> +        if (nvgpus->slots[i].tgt == tgt) {
> >> +            return i;
> >> +        }
> >> +    }
> >> +
> >> +    if (nvgpus->num == ARRAY_SIZE(nvgpus->slots)) {
> >> +        warn_report("Found too many NVLink bridges per GPU");
> >> +        return -1;
> > 
> > This is within qemu so it would be better to use the qemu error API
> > than returning an error code.
> 
> You mean returning Error**? Oh. Ok.

Well, not returning, technically, but taking an Error ** parameter
which is checked by the caller to detect errors.

> > 
> >> +    }
> >> +
> >> +    i = nvgpus->num;
> >> +    nvgpus->slots[i].tgt = tgt;
> >> +    ++nvgpus->num;
> >> +
> >> +    return i;
> > 
> > Might be nicer to return a pointer to the slot structure.
> 
> 
> This can work.
> 
> 
> > 
> >> +}
> >> +
> >> +static void spapr_pci_collect_nvgpu(struct spapr_phb_pci_nvgpu_config *nvgpus,
> >> +                                    PCIDevice *pdev, uint64_t tgt,
> >> +                                    MemoryRegion *mr)
> >> +{
> >> +    int i = spapr_pci_nvgpu_get_slot(nvgpus, tgt);
> >> +
> >> +    if (i < 0) {
> >> +        return;
> >> +    }
> >> +    g_assert(!nvgpus->slots[i].gpdev);
> >> +    nvgpus->slots[i].gpdev = pdev;
> >> +
> >> +    nvgpus->slots[i].gpa = nvgpus->nv2_ram_current;
> >> +    nvgpus->nv2_ram_current += memory_region_size(mr);
> >> +}
> >> +
> >> +static void spapr_pci_collect_nvnpu(struct spapr_phb_pci_nvgpu_config *nvgpus,
> >> +                                    PCIDevice *pdev, uint64_t tgt,
> >> +                                    MemoryRegion *mr)
> >> +{
> >> +    int i = spapr_pci_nvgpu_get_slot(nvgpus, tgt), j;
> >> +    struct spapr_phb_pci_nvgpu_slot *nvslot;
> >> +
> >> +    if (i < 0) {
> >> +        return;
> >> +    }
> >> +
> >> +    nvslot = &nvgpus->slots[i];
> >> +    j = nvslot->linknum;
> >> +    if (j == ARRAY_SIZE(nvslot->links)) {
> >> +        warn_report("Found too many NVLink2 bridges");
> >> +        return;
> >> +    }
> >> +    ++nvslot->linknum;
> >> +
> >> +    g_assert(!nvslot->links[j].npdev);
> >> +    nvslot->links[j].npdev = pdev;
> >> +    nvslot->links[j].atsd_gpa = nvgpus->nv2_atsd_current;
> >> +    nvgpus->nv2_atsd_current += memory_region_size(mr);
> >> +    nvslot->links[j].link_speed =
> >> +        object_property_get_uint(OBJECT(pdev), "nvlink2-link-speed", NULL);
> >> +}
> >> +
> >> +static void spapr_phb_pci_collect_nvgpu(PCIBus *bus, PCIDevice *pdev,
> >> +                                        void *opaque)
> >> +{
> >> +    PCIBus *sec_bus;
> >> +    Object *po = OBJECT(pdev);
> >> +    uint64_t tgt = object_property_get_uint(po, "nvlink2-tgt", NULL);
> >> +
> >> +    if (tgt) {
> >> +        Object *mr_gpu = object_property_get_link(po, "nvlink2-mr[0]", NULL);
> >> +        Object *mr_npu = object_property_get_link(po, "nvlink2-atsd-mr[0]",
> >> +                                                  NULL);
> >> +
> >> +        if (mr_gpu) {
> >> +            spapr_pci_collect_nvgpu(opaque, pdev, tgt, MEMORY_REGION(mr_gpu));
> >> +        } else if (mr_npu) {
> >> +            spapr_pci_collect_nvnpu(opaque, pdev, tgt, MEMORY_REGION(mr_npu));
> >> +        } else {
> >> +            warn_report("Unexpected device with \"nvlink2-tgt\"");
> > 
> > IIUC this would have to be a code error, so should be an assert() not
> > a warning.
> 
> 
> Ok.
> 
> > 
> >> +        }
> >> +    }
> >> +    if ((pci_default_read_config(pdev, PCI_HEADER_TYPE, 1) !=
> >> +         PCI_HEADER_TYPE_BRIDGE)) {
> >> +        return;
> >> +    }
> >> +
> >> +    sec_bus = pci_bridge_get_sec_bus(PCI_BRIDGE(pdev));
> >> +    if (!sec_bus) {
> >> +        return;
> >> +    }
> >> +
> >> +    pci_for_each_device(sec_bus, pci_bus_num(sec_bus),
> >> +                        spapr_phb_pci_collect_nvgpu, opaque);
> >> +}
> >> +
> >> +void spapr_phb_nvgpu_setup(sPAPRPHBState *sphb)
> >> +{
> >> +    int i, j, valid_gpu_num;
> >> +
> >> +    /* If there are existing NVLink2 MRs, unmap those before recreating */
> >> +    if (sphb->nvgpus) {
> >> +        for (i = 0; i < sphb->nvgpus->num; ++i) {
> >> +            struct spapr_phb_pci_nvgpu_slot *nvslot = &sphb->nvgpus->slots[i];
> >> +            Object *nv_mrobj = object_property_get_link(OBJECT(nvslot->gpdev),
> >> +                                                        "nvlink2-mr[0]", NULL);
> >> +
> >> +            if (nv_mrobj) {
> >> +                memory_region_del_subregion(get_system_memory(),
> >> +                                            MEMORY_REGION(nv_mrobj));
> >> +            }
> >> +            for (j = 0; j < nvslot->linknum; ++j) {
> >> +                PCIDevice *npdev = nvslot->links[j].npdev;
> >> +                Object *atsd_mrobj;
> >> +                atsd_mrobj = object_property_get_link(OBJECT(npdev),
> >> +                                                      "nvlink2-atsd-mr[0]",
> >> +                                                      NULL);
> >> +                if (atsd_mrobj) {
> >> +                    memory_region_del_subregion(get_system_memory(),
> >> +                                                MEMORY_REGION(atsd_mrobj));
> >> +                }
> >> +            }
> >> +        }
> >> +        g_free(sphb->nvgpus);
> > 
> > Probably worth collecting the above into a nvgpu_free() helper -
> > chances are you'll want it on cleanup paths as well.
> 
> The only other cleanup path is below and it only executes if there is no
> MR added so for now it does not seem useful.

Hrm... I've merged PHB hotplug recently.. so there should be a cleanup
path for unplug as well.

> 
> 
> >> +        sphb->nvgpus = NULL;
> >> +    }
> >> +
> >> +    /* Search for GPUs and NPUs */
> >> +    if (sphb->nv2_gpa_win_addr && sphb->nv2_atsd_win_addr) {
> >> +        PCIBus *bus = PCI_HOST_BRIDGE(sphb)->bus;
> >> +
> >> +        sphb->nvgpus = g_new0(struct spapr_phb_pci_nvgpu_config, 1);
> >> +        sphb->nvgpus->nv2_ram_current = sphb->nv2_gpa_win_addr;
> >> +        sphb->nvgpus->nv2_atsd_current = sphb->nv2_atsd_win_addr;
> >> +
> >> +        pci_for_each_device(bus, pci_bus_num(bus),
> >> +                            spapr_phb_pci_collect_nvgpu, sphb->nvgpus);
> >> +    }
> >> +
> >> +    /* Add found GPU RAM and ATSD MRs if found */
> >> +    for (i = 0, valid_gpu_num = 0; i < sphb->nvgpus->num; ++i) {
> >> +        Object *nvmrobj;
> >> +        struct spapr_phb_pci_nvgpu_slot *nvslot = &sphb->nvgpus->slots[i];
> >> +
> >> +        if (!nvslot->gpdev) {
> >> +            continue;
> >> +        }
> >> +        nvmrobj = object_property_get_link(OBJECT(nvslot->gpdev),
> >> +                                           "nvlink2-mr[0]", NULL);
> >> +        /* ATSD is pointless without GPU RAM MR so skip those */
> >> +        if (!nvmrobj) {
> >> +            continue;
> >> +        }
> >> +
> >> +        ++valid_gpu_num;
> >> +        memory_region_add_subregion(get_system_memory(), nvslot->gpa,
> >> +                                    MEMORY_REGION(nvmrobj));
> >> +
> >> +        for (j = 0; j < nvslot->linknum; ++j) {
> >> +            Object *atsdmrobj;
> >> +
> >> +            atsdmrobj = object_property_get_link(OBJECT(nvslot->links[j].npdev),
> >> +                                                 "nvlink2-atsd-mr[0]",
> >> +                                                 NULL);
> >> +            if (!atsdmrobj) {
> >> +                continue;
> >> +            }
> >> +            memory_region_add_subregion(get_system_memory(),
> >> +                                        nvslot->links[j].atsd_gpa,
> >> +                                        MEMORY_REGION(atsdmrobj));
> >> +        }
> >> +    }
> >> +
> >> +    if (!valid_gpu_num) {
> >> +        /* We did not find any interesting GPU */
> >> +        g_free(sphb->nvgpus);
> >> +        sphb->nvgpus = NULL;
> >> +    }
> >> +}
> >> +
> >> +void spapr_phb_nvgpu_populate_dt(sPAPRPHBState *sphb, void *fdt, int bus_off)
> >> +{
> >> +    int i, j, atsdnum = 0;
> >> +    uint64_t atsd[8]; /* The existing limitation of known guests */
> >> +
> >> +    if (!sphb->nvgpus) {
> >> +        return;
> >> +    }
> >> +
> >> +    for (i = 0; (i < sphb->nvgpus->num) && (atsdnum < ARRAY_SIZE(atsd)); ++i) {
> >> +        struct spapr_phb_pci_nvgpu_slot *nvslot = &sphb->nvgpus->slots[i];
> >> +
> >> +        if (!nvslot->gpdev) {
> >> +            continue;
> >> +        }
> >> +        for (j = 0; j < nvslot->linknum; ++j) {
> >> +            if (!nvslot->links[j].atsd_gpa) {
> >> +                continue;
> >> +            }
> >> +
> >> +            if (atsdnum == ARRAY_SIZE(atsd)) {
> >> +                warn_report("Only %ld ATSD registers allowed",
> >> +                            ARRAY_SIZE(atsd));
> > 
> > Probably should be an error not a warning.
> 
> We can still continue though, it is not fatal. These things come from
> skiboot (which we control) but skiboot itself could compose the
> properties itself or use whatever hostboot provided (does not happen now
> though) and I would not like to be blocked by hostboot if/when this happens.

Um.. what?  atsdnum is just a counter incremented below, it doesn't
come from skiboot or any other host-significant value.  The situation
here is that we have more nvlinks assigned to a guest that qemu can
support.  Yes, you could technically run the guest with some of the
links unavailable, but that seems pretty clearly not what the user
wanted.  Hence, an error is appropriate.

> 
> >> +                break;
> >> +            }
> >> +            atsd[atsdnum] = cpu_to_be64(nvslot->links[j].atsd_gpa);
> >> +            ++atsdnum;
> >> +        }
> >> +    }
> >> +
> >> +    if (!atsdnum) {
> >> +        warn_report("No ATSD registers found");
> >> +    } else if (!spapr_phb_eeh_available(sphb)) {
> >> +        /*
> >> +         * ibm,mmio-atsd contains ATSD registers; these belong to an NPU PHB
> >> +         * which we do not emulate as a separate device. Instead we put
> >> +         * ibm,mmio-atsd to the vPHB with GPU and make sure that we do not
> >> +         * put GPUs from different IOMMU groups to the same vPHB to ensure
> >> +         * that the guest will use ATSDs from the corresponding NPU.
> >> +         */
> >> +        warn_report("ATSD requires separate vPHB per GPU IOMMU group");
> >> +    } else {
> >> +        _FDT((fdt_setprop(fdt, bus_off, "ibm,mmio-atsd",
> >> +                          atsd, atsdnum * sizeof(atsd[0]))));
> >> +    }
> >> +}
> >> +
> >> +void spapr_phb_nvgpu_ram_populate_dt(sPAPRPHBState *sphb, void *fdt)
> >> +{
> >> +    int i, j, linkidx, npuoff;
> >> +    char *npuname;
> >> +
> >> +    if (!sphb->nvgpus) {
> >> +        return;
> >> +    }
> >> +
> >> +    npuname = g_strdup_printf("npuphb%d", sphb->index);
> >> +    npuoff = fdt_add_subnode(fdt, 0, npuname);
> >> +    _FDT(npuoff);
> >> +    _FDT(fdt_setprop_cell(fdt, npuoff, "#address-cells", 1));
> >> +    _FDT(fdt_setprop_cell(fdt, npuoff, "#size-cells", 0));
> >> +    /* Advertise NPU as POWER9 so the guest can enable NPU2 contexts */
> >> +    _FDT((fdt_setprop_string(fdt, npuoff, "compatible", "ibm,power9-npu")));
> >> +    g_free(npuname);
> >> +
> >> +    for (i = 0, linkidx = 0; i < sphb->nvgpus->num; ++i) {
> >> +        for (j = 0; j < sphb->nvgpus->slots[i].linknum; ++j) {
> >> +            char *linkname = g_strdup_printf("link@%d", linkidx);
> >> +            int off = fdt_add_subnode(fdt, npuoff, linkname);
> >> +
> >> +            _FDT(off);
> >> +            /* _FDT((fdt_setprop_cell(fdt, off, "reg", linkidx)));
> >> */
> > 
> > Are the indices you're using for 'reg' and the unit name arbitrary?
> > If so it's generally best to base them on some static property of the
> > device, rather than just allocating sequentially.
> 
> On the host reg is the link index. Here it is actually commented out as
> a reminder for the future.
> 
> > 
> >> +            _FDT((fdt_setprop_string(fdt, off, "compatible",
> >> +                                     "ibm,npu-link")));
> >> +            _FDT((fdt_setprop_cell(fdt, off, "phandle",
> >> +                                   PHANDLE_NVLINK(sphb, i, j))));
> >> +            _FDT((fdt_setprop_cell(fdt, off, "ibm,npu-link-index", linkidx)));
> > 
> > Why do you need the index here as well as in reg?
> 
> I do not need "reg" really and I need index for this:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/powerpc/platforms/powernv/npu-dma.c?h=v4.20#n692


Ok, because of a silly binding.  That's a good enough reason.

> >> +            g_free(linkname);
> >> +            ++linkidx;
> >> +        }
> >> +    }
> >> +
> >> +    /* Add memory nodes for GPU RAM and mark them unusable */
> >> +    for (i = 0; i < sphb->nvgpus->num; ++i) {
> >> +        struct spapr_phb_pci_nvgpu_slot *nvslot = &sphb->nvgpus->slots[i];
> >> +        Object *nv_mrobj = object_property_get_link(OBJECT(nvslot->gpdev),
> >> +                                                    "nvlink2-mr[0]", NULL);
> >> +        uint32_t at = cpu_to_be32(GPURAM_ASSOCIATIVITY(sphb, i));
> >> +        uint32_t associativity[] = { cpu_to_be32(0x4), at, at, at, at };
> >> +        uint64_t size = object_property_get_uint(nv_mrobj, "size", NULL);
> >> +        uint64_t mem_reg[2] = { cpu_to_be64(nvslot->gpa), cpu_to_be64(size) };
> >> +        char *mem_name = g_strdup_printf("memory@%lx", nvslot->gpa);
> >> +        int off = fdt_add_subnode(fdt, 0, mem_name);
> >> +
> >> +        _FDT(off);
> >> +        _FDT((fdt_setprop_string(fdt, off, "device_type", "memory")));
> >> +        _FDT((fdt_setprop(fdt, off, "reg", mem_reg, sizeof(mem_reg))));
> >> +        _FDT((fdt_setprop(fdt, off, "ibm,associativity", associativity,
> >> +                          sizeof(associativity))));
> >> +
> >> +        _FDT((fdt_setprop_string(fdt, off, "compatible",
> >> +                                 "ibm,coherent-device-memory")));
> >> +
> >> +        mem_reg[1] = cpu_to_be64(0);
> >> +        _FDT((fdt_setprop(fdt, off, "linux,usable-memory", mem_reg,
> >> +                          sizeof(mem_reg))));
> >> +        _FDT((fdt_setprop_cell(fdt, off, "phandle",
> >> +                               PHANDLE_GPURAM(sphb, i))));
> >> +        g_free(mem_name);
> >> +    }
> >> +
> >> +}
> >> +
> >> +void spapr_phb_nvgpu_populate_pcidev_dt(PCIDevice *dev, void *fdt, int offset,
> >> +                                        sPAPRPHBState *sphb)
> >> +{
> >> +    int i, j;
> >> +
> >> +    if (!sphb->nvgpus) {
> >> +        return;
> >> +    }
> >> +
> >> +    for (i = 0; i < sphb->nvgpus->num; ++i) {
> >> +        struct spapr_phb_pci_nvgpu_slot *nvslot = &sphb->nvgpus->slots[i];
> >> +
> >> +        /* Skip "slot" without attached GPU */
> > 
> > IIUC a "slot" should always have at least one GPU.  You need to handle
> > the case of an unitialized GPU in the "collect" functions because you
> > don't know if you'll discover the GPU or an NPU first.  But here not
> > having a GPU should be an error, shouldn't it?
> 
> 
> If someone decides to pass 1 GPU with all related nvlinks and just
> nvlinks from another GPU but without related GPU for whatever reason,
> should we really stop him/her? Things won't work exactly at their best
> but this still might be useful for weird debugging.

Hm, ok, I guess so.

> >> +        if (!nvslot->gpdev) {
> >> +            continue;
> >> +        }
> >> +        if (dev == nvslot->gpdev) {
> >> +            uint32_t npus[nvslot->linknum];
> >> +
> >> +            for (j = 0; j < nvslot->linknum; ++j) {
> >> +                PCIDevice *npdev = nvslot->links[j].npdev;
> >> +
> >> +                npus[j] = cpu_to_be32(PHANDLE_PCIDEV(sphb, npdev));
> >> +            }
> >> +            _FDT(fdt_setprop(fdt, offset, "ibm,npu", npus,
> >> +                             j * sizeof(npus[0])));
> >> +            _FDT((fdt_setprop_cell(fdt, offset, "phandle",
> >> +                                   PHANDLE_PCIDEV(sphb, dev))));
> >> +            continue;
> >> +        }
> >> +
> >> +        for (j = 0; j < nvslot->linknum; ++j) {
> >> +            if (dev != nvslot->links[j].npdev) {
> >> +                continue;
> >> +            }
> >> +
> >> +            _FDT((fdt_setprop_cell(fdt, offset, "phandle",
> >> +                                   PHANDLE_PCIDEV(sphb, dev))));
> >> +            _FDT(fdt_setprop_cell(fdt, offset, "ibm,gpu",
> >> +                                  PHANDLE_PCIDEV(sphb, nvslot->gpdev)));
> >> +            _FDT((fdt_setprop_cell(fdt, offset, "ibm,nvlink",
> >> +                                   PHANDLE_NVLINK(sphb, i, j))));
> >> +            /*
> >> +             * If we ever want to emulate GPU RAM at the same location as on
> >> +             * the host - here is the encoding GPA->TGT:
> >> +             *
> >> +             * gta  = ((sphb->nv2_gpa >> 42) & 0x1) << 42;
> >> +             * gta |= ((sphb->nv2_gpa >> 45) & 0x3) << 43;
> >> +             * gta |= ((sphb->nv2_gpa >> 49) & 0x3) << 45;
> >> +             * gta |= sphb->nv2_gpa & ((1UL << 43) - 1);
> >> +             */
> >> +            _FDT(fdt_setprop_cell(fdt, offset, "memory-region",
> >> +                                  PHANDLE_GPURAM(sphb, i)));
> >> +            _FDT(fdt_setprop_u64(fdt, offset, "ibm,device-tgt-addr",
> >> +                                 nvslot->tgt));
> >> +            _FDT(fdt_setprop_cell(fdt, offset, "ibm,nvlink-speed",
> >> +                                  nvslot->links[j].link_speed));
> >> +        }
> >> +    }
> >> +}
> >> diff --git a/hw/vfio/pci-quirks.c b/hw/vfio/pci-quirks.c
> >> index 40a1200..15ec0b4 100644
> >> --- a/hw/vfio/pci-quirks.c
> >> +++ b/hw/vfio/pci-quirks.c
> >> @@ -2180,3 +2180,123 @@ int vfio_add_virt_caps(VFIOPCIDevice *vdev, Error **errp)
> >>  
> >>      return 0;
> >>  }
> >> +
> >> +static void vfio_pci_nvlink2_get_tgt(Object *obj, Visitor *v,
> >> +                                     const char *name,
> >> +                                     void *opaque, Error **errp)
> >> +{
> >> +    uint64_t tgt = (uint64_t) opaque;
> >> +    visit_type_uint64(v, name, &tgt, errp);
> >> +}
> >> +
> >> +static void vfio_pci_nvlink2_get_link_speed(Object *obj, Visitor *v,
> >> +                                                 const char *name,
> >> +                                                 void *opaque, Error **errp)
> >> +{
> >> +    uint32_t link_speed = (uint32_t)(uint64_t) opaque;
> >> +    visit_type_uint32(v, name, &link_speed, errp);
> >> +}
> >> +
> >> +int vfio_pci_nvidia_v100_ram_init(VFIOPCIDevice *vdev, Error **errp)
> >> +{
> >> +    int ret;
> >> +    void *p;
> >> +    struct vfio_region_info *nv2region = NULL;
> >> +    struct vfio_info_cap_header *hdr;
> >> +    MemoryRegion *nv2mr = g_malloc0(sizeof(*nv2mr));
> >> +
> >> +    ret = vfio_get_dev_region_info(&vdev->vbasedev,
> >> +                                   VFIO_REGION_TYPE_PCI_VENDOR_TYPE |
> >> +                                   PCI_VENDOR_ID_NVIDIA,
> >> +                                   VFIO_REGION_SUBTYPE_NVIDIA_NVLINK2_RAM,
> >> +                                   &nv2region);
> >> +    if (ret) {
> >> +        return ret;
> >> +    }
> >> +
> >> +    p = mmap(NULL, nv2region->size, PROT_READ | PROT_WRITE | PROT_EXEC,
> >> +             MAP_SHARED, vdev->vbasedev.fd, nv2region->offset);
> >> +
> >> +    if (!p) {
> >> +        return -errno;
> >> +    }
> >> +
> >> +    memory_region_init_ram_ptr(nv2mr, OBJECT(vdev), "nvlink2-mr",
> >> +                               nv2region->size, p);
> >> +
> >> +    hdr = vfio_get_region_info_cap(nv2region,
> >> +                                   VFIO_REGION_INFO_CAP_NVLINK2_SSATGT);
> >> +    if (hdr) {
> >> +        struct vfio_region_info_cap_nvlink2_ssatgt *cap = (void *) hdr;
> >> +
> >> +        object_property_add(OBJECT(vdev), "nvlink2-tgt", "uint64",
> >> +                            vfio_pci_nvlink2_get_tgt, NULL, NULL,
> >> +                            (void *) cap->tgt, NULL);
> >> +        trace_vfio_pci_nvidia_gpu_setup_quirk(vdev->vbasedev.name, cap->tgt,
> >> +                                              nv2region->size);
> >> +    }
> >> +    g_free(nv2region);
> >> +
> >> +    return 0;
> >> +}
> >> +
> >> +int vfio_pci_nvlink2_init(VFIOPCIDevice *vdev, Error **errp)
> >> +{
> >> +    int ret;
> >> +    void *p;
> >> +    struct vfio_region_info *atsd_region = NULL;
> >> +    struct vfio_info_cap_header *hdr;
> >> +
> >> +    ret = vfio_get_dev_region_info(&vdev->vbasedev,
> >> +                                   VFIO_REGION_TYPE_PCI_VENDOR_TYPE |
> >> +                                   PCI_VENDOR_ID_IBM,
> >> +                                   VFIO_REGION_SUBTYPE_IBM_NVLINK2_ATSD,
> >> +                                   &atsd_region);
> >> +    if (ret) {
> >> +        return ret;
> >> +    }
> >> +
> >> +    /* Some NVLink bridges come without assigned ATSD, skip MR part */
> >> +    if (atsd_region->size) {
> >> +        MemoryRegion *atsd_mr = g_malloc0(sizeof(*atsd_mr));
> >> +
> >> +        p = mmap(NULL, atsd_region->size, PROT_READ | PROT_WRITE | PROT_EXEC,
> >> +                 MAP_SHARED, vdev->vbasedev.fd, atsd_region->offset);
> >> +
> >> +        if (!p) {
> >> +            return -errno;
> >> +        }
> >> +
> >> +        memory_region_init_ram_device_ptr(atsd_mr, OBJECT(vdev),
> >> +                                          "nvlink2-atsd-mr",
> >> +                                          atsd_region->size,
> >> +                                          p);
> >> +    }
> >> +
> >> +    hdr = vfio_get_region_info_cap(atsd_region,
> >> +                                   VFIO_REGION_INFO_CAP_NVLINK2_SSATGT);
> >> +    if (hdr) {
> >> +        struct vfio_region_info_cap_nvlink2_ssatgt *cap = (void *) hdr;
> >> +
> >> +        object_property_add(OBJECT(vdev), "nvlink2-tgt", "uint64",
> >> +                            vfio_pci_nvlink2_get_tgt, NULL, NULL,
> >> +                            (void *) cap->tgt, NULL);
> >> +        trace_vfio_pci_nvlink2_setup_quirk_ssatgt(vdev->vbasedev.name, cap->tgt,
> >> +                                                  atsd_region->size);
> >> +    }
> >> +
> >> +    hdr = vfio_get_region_info_cap(atsd_region,
> >> +                                   VFIO_REGION_INFO_CAP_NVLINK2_LNKSPD);
> >> +    if (hdr) {
> >> +        struct vfio_region_info_cap_nvlink2_lnkspd *cap = (void *) hdr;
> >> +
> >> +        object_property_add(OBJECT(vdev), "nvlink2-link-speed", "uint32",
> >> +                            vfio_pci_nvlink2_get_link_speed, NULL, NULL,
> >> +                            (void *) (uint64_t) cap->link_speed, NULL);
> >> +        trace_vfio_pci_nvlink2_setup_quirk_lnkspd(vdev->vbasedev.name,
> >> +                                                  cap->link_speed);
> >> +    }
> >> +    g_free(atsd_region);
> >> +
> >> +    return 0;
> >> +}
> >> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> >> index dd12f36..07aa141 100644
> >> --- a/hw/vfio/pci.c
> >> +++ b/hw/vfio/pci.c
> >> @@ -3069,6 +3069,20 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
> >>          goto out_teardown;
> >>      }
> >>  
> >> +    if (vdev->vendor_id == PCI_VENDOR_ID_NVIDIA) {
> >> +        ret = vfio_pci_nvidia_v100_ram_init(vdev, errp);
> >> +        if (ret && ret != -ENODEV) {
> >> +            error_report("Failed to setup NVIDIA V100 GPU RAM");
> >> +        }
> >> +    }
> >> +
> >> +    if (vdev->vendor_id == PCI_VENDOR_ID_IBM) {
> >> +        ret = vfio_pci_nvlink2_init(vdev, errp);
> >> +        if (ret && ret != -ENODEV) {
> >> +            error_report("Failed to setup NVlink2 bridge");
> >> +        }
> >> +    }
> >> +
> >>      vfio_register_err_notifier(vdev);
> >>      vfio_register_req_notifier(vdev);
> >>      vfio_setup_resetfn_quirk(vdev);
> >> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> >> index cf1e886..88841e9 100644
> >> --- a/hw/vfio/trace-events
> >> +++ b/hw/vfio/trace-events
> >> @@ -87,6 +87,10 @@ vfio_pci_igd_opregion_enabled(const char *name) "%s"
> >>  vfio_pci_igd_host_bridge_enabled(const char *name) "%s"
> >>  vfio_pci_igd_lpc_bridge_enabled(const char *name) "%s"
> >>  
> >> +vfio_pci_nvidia_gpu_setup_quirk(const char *name, uint64_t tgt, uint64_t size) "%s tgt=0x%"PRIx64" size=0x%"PRIx64
> >> +vfio_pci_nvlink2_setup_quirk_ssatgt(const char *name, uint64_t tgt, uint64_t size) "%s tgt=0x%"PRIx64" size=0x%"PRIx64
> >> +vfio_pci_nvlink2_setup_quirk_lnkspd(const char *name, uint32_t link_speed) "%s link_speed=0x%x"
> >> +
> >>  # hw/vfio/common.c
> >>  vfio_region_write(const char *name, int index, uint64_t addr, uint64_t data, unsigned size) " (%s:region%d+0x%"PRIx64", 0x%"PRIx64 ", %d)"
> >>  vfio_region_read(char *name, int index, uint64_t addr, unsigned size, uint64_t data) " (%s:region%d+0x%"PRIx64", %d) = 0x%"PRIx64
> > 
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH qemu v3 4/6] spapr_iommu: Do not replay mappings from just created DMA window
  2019-02-28  5:37         ` Alexey Kardashevskiy
@ 2019-03-05  3:28           ` David Gibson
  0 siblings, 0 replies; 21+ messages in thread
From: David Gibson @ 2019-03-05  3:28 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Greg Kurz, qemu-devel, Jose Ricardo Ziviani,
	Daniel Henrique Barboza, Alex Williamson, Sam Bobroff,
	Piotr Jaroszynski, qemu-ppc,
	Leonardo Augusto Guimarães Garcia

[-- Attachment #1: Type: text/plain, Size: 7607 bytes --]

On Thu, Feb 28, 2019 at 04:37:25PM +1100, Alexey Kardashevskiy wrote:
> On 28/02/2019 14:49, David Gibson wrote:
> > On Thu, Feb 28, 2019 at 10:59:56AM +1100, Alexey Kardashevskiy wrote:
> >>
> >>
> >> On 28/02/2019 01:33, Greg Kurz wrote:
> >>> On Wed, 27 Feb 2019 19:51:47 +1100
> >>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> >>>
> >>>> On sPAPR vfio_listener_region_add() is called in 2 situations:
> >>>> 1. a new listener is registered from vfio_connect_container();
> >>>> 2. a new IOMMU Memory Region is added from rtas_ibm_create_pe_dma_window().
> >>>>
> >>>> In both cases vfio_listener_region_add() calls
> >>>> memory_region_iommu_replay() to notify newly registered IOMMU notifiers
> >>>> about existing mappings which is totally desirable for case 1.
> >>>>
> >>>> However for case 2 it is nothing but noop as the window has just been
> >>>> created and has no valid mappings so replaying those does not do anything.
> >>>> It is barely noticeable with usual guests but if the window happens to be
> >>>> really big, such no-op replay might take minutes and trigger RCU stall
> >>>> warnings in the guest.
> >>>>
> >>>> For example, a upcoming GPU RAM memory region mapped at 64TiB (right
> >>>> after SPAPR_PCI_LIMIT) causes a 64bit DMA window to be at least 128TiB
> >>>> which is (128<<40)/0x10000=2.147.483.648 TCEs to replay.
> >>>>
> >>>> This mitigates the problem by adding an "skipping_replay" flag to
> >>>> sPAPRTCETable and defining sPAPR own IOMMU MR replay() hook which does
> >>>> exactly the same thing as the generic one except it returns early if
> >>>> @skipping_replay==true.
> >>>>
> >>>> When "ibm,create-pe-dma-window" is complete, the guest will map only
> >>>> required regions of the huge DMA window.
> >>>>
> >>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >>>> ---
> >>>>  include/hw/ppc/spapr.h  |  1 +
> >>>>  hw/ppc/spapr_iommu.c    | 31 +++++++++++++++++++++++++++++++
> >>>>  hw/ppc/spapr_rtas_ddw.c |  7 +++++++
> >>>>  3 files changed, 39 insertions(+)
> >>>>
> >>>> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> >>>> index 86b0488..358bb38 100644
> >>>> --- a/include/hw/ppc/spapr.h
> >>>> +++ b/include/hw/ppc/spapr.h
> >>>> @@ -727,6 +727,7 @@ struct sPAPRTCETable {
> >>>>      uint64_t *mig_table;
> >>>>      bool bypass;
> >>>>      bool need_vfio;
> >>>> +    bool skipping_replay;
> >>>>      int fd;
> >>>>      MemoryRegion root;
> >>>>      IOMMUMemoryRegion iommu;
> >>>> diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
> >>>> index 37e98f9..8f23179 100644
> >>>> --- a/hw/ppc/spapr_iommu.c
> >>>> +++ b/hw/ppc/spapr_iommu.c
> >>>> @@ -141,6 +141,36 @@ static IOMMUTLBEntry spapr_tce_translate_iommu(IOMMUMemoryRegion *iommu,
> >>>>      return ret;
> >>>>  }
> >>>>  
> >>>> +static void spapr_tce_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
> >>>> +{
> >>>> +    MemoryRegion *mr = MEMORY_REGION(iommu_mr);
> >>>> +    IOMMUMemoryRegionClass *imrc = IOMMU_MEMORY_REGION_GET_CLASS(iommu_mr);
> >>>> +    hwaddr addr, granularity;
> >>>> +    IOMMUTLBEntry iotlb;
> >>>> +    sPAPRTCETable *tcet = container_of(iommu_mr, sPAPRTCETable, iommu);
> >>>> +
> >>>> +    if (tcet->skipping_replay) {
> >>>> +        return;
> >>>> +    }
> >>>> +
> >>>> +    granularity = memory_region_iommu_get_min_page_size(iommu_mr);
> >>>> +
> >>>> +    for (addr = 0; addr < memory_region_size(mr); addr += granularity) {
> >>>> +        iotlb = imrc->translate(iommu_mr, addr, IOMMU_NONE, n->iommu_idx);
> >>>> +        if (iotlb.perm != IOMMU_NONE) {
> >>>> +            n->notify(n, &iotlb);
> >>>> +        }
> >>>> +
> >>>> +        /*
> >>>> +         * if (2^64 - MR size) < granularity, it's possible to get an
> >>>> +         * infinite loop here.  This should catch such a wraparound.
> >>>> +         */
> >>>> +        if ((addr + granularity) < addr) {
> >>>> +            break;
> >>>> +        }
> >>>> +    }
> >>>> +}
> >>>
> >>> It is a bit unfortunate to duplicate all that code. What about making
> >>> a memory_region_iommu_replay_generic() helper out of it and call it
> >>> from spapr_tce_replay() and memory_region_iommu_replay() ?
> >>
> >>
> >> I really do not want to mess with generic code to solve our local sPAPR
> >> problem, especially when there is a way not to do so.
> > 
> > Well, the thing is, I think we're actually the only user of the
> > current generic replay - everything else has more efficient structure
> > aware replay hooks AFAIK.  Which makes this hack even hackier.
> 
> If that so, then we are better off removing that loop from
> memory_region_iommu_replay() at all rather than keeping it generic.

Well.. maybe.  In theory this logic should work, albeit slowly, for
any IOMMU implementation, it's just that everyone else has more
efficient implementations at the moment.

> >> And as a next step, I was thinking of removing (i.e. making this replay
> >> a no-op) from QEMU later and do replay in KVM instead when an IOMMU
> >> group is attaching to KVM as this is the only case when we need replay
> >> and KVM has a lot better idea what TCEs are actually valid and can skip
> >> most of them. This is a bit bigger thing as it requires a KVM capability
> >> "KVM replays mappings" but when we get it, spapr_tce_replay() will
> >> become no-op.
> > 
> > That's a good idea longer term.
> > 
> >>> Apart from that, LGTM.
> >>
> >> Well. It is a hack, I just do not have taste to tell how nasty it is
> >> :)
> > 
> > As an interim step until the kernel change, I think we can do a bit
> > better than this.  First, as Greg suggests we should have the
> > "generic" replay be a helper and have the spapr one call that with a
> > little in the way of extra checking.
> > 
> > Second, rather than having an explicit "skip_replay" flag, what we
> > really want here is to have the replay be a fast no-op if there are no
> > existing mappings rather than a slow no-op.  So instead I think we
> > should have a flag which records if any mappings have been made in the
> > region yet, initialized to false.
> > The new replay would do nothing if
> > it's still false.
> 
> If QEMU controlled the mappings - sure. But it does not - KVM does it
> instead via that fast path. So QEMU does not know if there are mappings
> until it reads all TCEs from mmap'ed KVM TCE table which will fault in
> all these pages.

Oops, I forgot that; good point.

> We could implement some tricks such are allow reading (or ioctl) from
> that KVM TCE fd and it could tell what is mapped and what is not in a
> very condensed format (for example a bit per every 256MB of the guest
> address space)  ooooor  implement different behavior for mapping with RW
> or readonly - the latter would fail if there is no backing page
> allocated yet - and then QEMU could skip these regions when replaying.

Urgh, yeah, this is trickier than I initially thought.  About the
cleanest approach I can see is to delay allocation of the IOMMU table
until the first H_PUT_TCE, and only then actually bind the table to
KVM and allow for H_PUT_TCE acceleration.

Seems pretty awkward, though.  Ok, your hack seems the best way
forward in the short to medium term, just make sure there are some
comments there explaining why the hack is valuable.


-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v3 6/6] spapr: Support NVIDIA V100 GPU with NVLink2
  2019-03-05  1:47       ` David Gibson
@ 2019-03-07  2:40         ` Alexey Kardashevskiy
  2019-03-07  3:57           ` David Gibson
  0 siblings, 1 reply; 21+ messages in thread
From: Alexey Kardashevskiy @ 2019-03-07  2:40 UTC (permalink / raw)
  To: David Gibson
  Cc: qemu-devel, qemu-ppc, Sam Bobroff, Piotr Jaroszynski,
	Leonardo Augusto Guimarães Garcia, Jose Ricardo Ziviani,
	Daniel Henrique Barboza, Alex Williamson



On 05/03/2019 12:47, David Gibson wrote:
> On Thu, Feb 28, 2019 at 05:11:32PM +1100, Alexey Kardashevskiy wrote:
>> On 28/02/2019 14:31, David Gibson wrote:
>>> On Wed, Feb 27, 2019 at 07:51:49PM +1100, Alexey Kardashevskiy wrote:
>>>> NVIDIA V100 GPUs have on-board RAM which is mapped into the host memory
>>>> space and accessible as normal RAM via an NVLink bus. The VFIO-PCI driver
>>>> implements special regions for such GPUs and emulates an NVLink bridge.
>>>> NVLink2-enabled POWER9 CPUs also provide address translation services
>>>> which includes an ATS shootdown (ATSD) register exported via the NVLink
>>>> bridge device.
>>>>
>>>> This adds a quirk to VFIO to map the GPU memory and create an MR;
>>>> the new MR is stored in a PCI device as a QOM link. The sPAPR PCI uses
>>>> this to get the MR and map it to the system address space.
>>>> Another quirk does the same for ATSD.
>>>>
>>>> This adds additional steps to sPAPR PHB setup:
>>>>
>>>> 1. Search for specific GPUs and NPUs, collect findings in
>>>> sPAPRPHBState::nvgpus, manage system address space mappings;
>>>>
>>>> 2. Add device-specific properties such as "ibm,npu", "ibm,gpu",
>>>> "memory-block", "link-speed" to advertise the NVLink2 function to
>>>> the guest;
>>>>
>>>> 3. Add "mmio-atsd" to vPHB to advertise the ATSD capability;
>>>>
>>>> 4. Add new memory blocks (with extra "linux,memory-usable" to prevent
>>>> the guest OS from accessing the new memory until it is onlined) and
>>>> npuphb# nodes representing an NPU unit for every vPHB as the GPU driver
>>>> uses it for link discovery.
>>>>
>>>> This allocates space for GPU RAM and ATSD like we do for MMIOs by
>>>> adding 2 new parameters to the phb_placement() hook. Older machine types
>>>> set these to zero.
>>>>
>>>> This puts new memory nodes in a separate NUMA node to replicate the host
>>>> system setup as the GPU driver relies on this.
>>>>
>>>> This adds requirement similar to EEH - one IOMMU group per vPHB.
>>>> The reason for this is that ATSD registers belong to a physical NPU
>>>> so they cannot invalidate translations on GPUs attached to another NPU.
>>>> It is guaranteed by the host platform as it does not mix NVLink bridges
>>>> or GPUs from different NPU in the same IOMMU group. If more than one
>>>> IOMMU group is detected on a vPHB, this disables ATSD support for that
>>>> vPHB and prints a warning.
>>>>
>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>> ---
>>>> Changes:
>>>> v3:
>>>> * moved GPU RAM above PCI MMIO limit
>>>> * renamed QOM property to nvlink2-tgt
>>>> * moved nvlink2 code to its own file
>>>>
>>>> ---
>>>>
>>>> The example command line for redbud system:
>>>>
>>>> pbuild/qemu-aiku1804le-ppc64/ppc64-softmmu/qemu-system-ppc64 \
>>>> -nodefaults \
>>>> -chardev stdio,id=STDIO0,signal=off,mux=on \
>>>> -device spapr-vty,id=svty0,reg=0x71000110,chardev=STDIO0 \
>>>> -mon id=MON0,chardev=STDIO0,mode=readline -nographic -vga none \
>>>> -enable-kvm -m 384G \
>>>> -chardev socket,id=SOCKET0,server,nowait,host=localhost,port=40000 \
>>>> -mon chardev=SOCKET0,mode=control \
>>>> -smp 80,sockets=1,threads=4 \
>>>> -netdev "tap,id=TAP0,helper=/home/aik/qemu-bridge-helper --br=br0" \
>>>> -device "virtio-net-pci,id=vnet0,mac=52:54:00:12:34:56,netdev=TAP0" \
>>>> img/vdisk0.img \
>>>> -device "vfio-pci,id=vfio0004_04_00_0,host=0004:04:00.0" \
>>>> -device "vfio-pci,id=vfio0006_00_00_0,host=0006:00:00.0" \
>>>> -device "vfio-pci,id=vfio0006_00_00_1,host=0006:00:00.1" \
>>>> -device "vfio-pci,id=vfio0006_00_00_2,host=0006:00:00.2" \
>>>> -device "vfio-pci,id=vfio0004_05_00_0,host=0004:05:00.0" \
>>>> -device "vfio-pci,id=vfio0006_00_01_0,host=0006:00:01.0" \
>>>> -device "vfio-pci,id=vfio0006_00_01_1,host=0006:00:01.1" \
>>>> -device "vfio-pci,id=vfio0006_00_01_2,host=0006:00:01.2" \
>>>> -device spapr-pci-host-bridge,id=phb1,index=1 \
>>>> -device "vfio-pci,id=vfio0035_03_00_0,host=0035:03:00.0" \
>>>> -device "vfio-pci,id=vfio0007_00_00_0,host=0007:00:00.0" \
>>>> -device "vfio-pci,id=vfio0007_00_00_1,host=0007:00:00.1" \
>>>> -device "vfio-pci,id=vfio0007_00_00_2,host=0007:00:00.2" \
>>>> -device "vfio-pci,id=vfio0035_04_00_0,host=0035:04:00.0" \
>>>> -device "vfio-pci,id=vfio0007_00_01_0,host=0007:00:01.0" \
>>>> -device "vfio-pci,id=vfio0007_00_01_1,host=0007:00:01.1" \
>>>> -device "vfio-pci,id=vfio0007_00_01_2,host=0007:00:01.2" -snapshot \
>>>> -machine pseries \
>>>> -L /home/aik/t/qemu-ppc64-bios/ -d guest_errors
>>>>
>>>> Note that QEMU attaches PCI devices to the last added vPHB so first
>>>> 8 devices - 4:04:00.0 till 6:00:01.2 - go to the default vPHB, and
>>>> 35:03:00.0..7:00:01.2 to the vPHB with id=phb1.
>>>> ---
>>>>  hw/ppc/Makefile.objs        |   2 +-
>>>>  hw/vfio/pci.h               |   2 +
>>>>  include/hw/pci-host/spapr.h |  41 ++++
>>>>  include/hw/ppc/spapr.h      |   3 +-
>>>>  hw/ppc/spapr.c              |  29 ++-
>>>>  hw/ppc/spapr_pci.c          |   8 +
>>>>  hw/ppc/spapr_pci_nvlink2.c  | 419 ++++++++++++++++++++++++++++++++++++
>>>>  hw/vfio/pci-quirks.c        | 120 +++++++++++
>>>>  hw/vfio/pci.c               |  14 ++
>>>>  hw/vfio/trace-events        |   4 +
>>>>  10 files changed, 637 insertions(+), 5 deletions(-)
>>>>  create mode 100644 hw/ppc/spapr_pci_nvlink2.c
>>>>
>>>> diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
>>>> index 1111b21..636e717 100644
>>>> --- a/hw/ppc/Makefile.objs
>>>> +++ b/hw/ppc/Makefile.objs
>>>> @@ -9,7 +9,7 @@ obj-$(CONFIG_SPAPR_RNG) +=  spapr_rng.o
>>>>  # IBM PowerNV
>>>>  obj-$(CONFIG_POWERNV) += pnv.o pnv_xscom.o pnv_core.o pnv_lpc.o pnv_psi.o pnv_occ.o pnv_bmc.o
>>>>  ifeq ($(CONFIG_PCI)$(CONFIG_PSERIES)$(CONFIG_LINUX), yyy)
>>>> -obj-y += spapr_pci_vfio.o
>>>> +obj-y += spapr_pci_vfio.o spapr_pci_nvlink2.o
>>>>  endif
>>>>  obj-$(CONFIG_PSERIES) += spapr_rtas_ddw.o
>>>>  # PowerPC 4xx boards
>>>> diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
>>>> index b1ae4c0..706c304 100644
>>>> --- a/hw/vfio/pci.h
>>>> +++ b/hw/vfio/pci.h
>>>> @@ -194,6 +194,8 @@ int vfio_populate_vga(VFIOPCIDevice *vdev, Error **errp);
>>>>  int vfio_pci_igd_opregion_init(VFIOPCIDevice *vdev,
>>>>                                 struct vfio_region_info *info,
>>>>                                 Error **errp);
>>>> +int vfio_pci_nvidia_v100_ram_init(VFIOPCIDevice *vdev, Error **errp);
>>>> +int vfio_pci_nvlink2_init(VFIOPCIDevice *vdev, Error **errp);
>>>>  
>>>>  void vfio_display_reset(VFIOPCIDevice *vdev);
>>>>  int vfio_display_probe(VFIOPCIDevice *vdev, Error **errp);
>>>> diff --git a/include/hw/pci-host/spapr.h b/include/hw/pci-host/spapr.h
>>>> index ab0e3a0..e791dd4 100644
>>>> --- a/include/hw/pci-host/spapr.h
>>>> +++ b/include/hw/pci-host/spapr.h
>>>> @@ -87,6 +87,9 @@ struct sPAPRPHBState {
>>>>      uint32_t mig_liobn;
>>>>      hwaddr mig_mem_win_addr, mig_mem_win_size;
>>>>      hwaddr mig_io_win_addr, mig_io_win_size;
>>>> +    hwaddr nv2_gpa_win_addr;
>>>> +    hwaddr nv2_atsd_win_addr;
>>>> +    struct spapr_phb_pci_nvgpu_config *nvgpus;
>>>>  };
>>>>  
>>>>  #define SPAPR_PCI_MEM_WIN_BUS_OFFSET 0x80000000ULL
>>>> @@ -105,6 +108,23 @@ struct sPAPRPHBState {
>>>>  
>>>>  #define SPAPR_PCI_MSI_WINDOW         0x40000000000ULL
>>>>  
>>>> +#define SPAPR_PCI_NV2RAM64_WIN_BASE  SPAPR_PCI_LIMIT
>>>> +#define SPAPR_PCI_NV2RAM64_WIN_SIZE  0x10000000000ULL /* 1 TiB for all 6xGPUs */
>>>
>>> The comments and values below suggest that it is 1TiB for each GPU,
>>> rather than 1TiB shared by all 6.  Which is it?
>>
>>
>> 1TiB for all of them within 1 vPHB. Not sure where it suggests 1TiB for
>> each GPU.
> 
> The fact that NV2ATSD_WIN_BASE is set at 6TiB above NV2RAM64_WIN_BASE
> is what suggested to me that there was one 1TiB window for each of the
> 6 possible GPUs.
> 
>>>> +
>>>> +/* Max number of these GPUs per a physical box */
>>>> +#define NVGPU_MAX_NUM                6
>>>
>>> Is there any possibility later hardware revisions could increase this?
>>> If so we should probably leave some extra room in the address space.
>>
>> A GPU RAM window is 256GiB (and only 32GiB is used), and 3 is the
>> maximum in one group so far. So 1TiB should be enough for quite some
>> time. Having more GPUs in a box is probably possible but for now 6xGPU
>> require water cooling while 4xGPU does not so unless there is a new
>> generation of these GPUs comes out, the numbers won't change much.
> 
> Hm, ok.
> 
>> I'll double SPAPR_PCI_NV2RAM64_WIN_SIZE.
> 
> Um.. I'm not sure how that follows from the above.

1TiB is enough now but 2TiB is more future proof. That was it.



> 
>>
>>
>>>> +/*
>>>> + * One NVLink bridge provides one ATSD register so it should be 18.
>>>> + * In practice though since we allow only one group per vPHB which equals
>>>> + * to an NPU2 which has maximum 6 NVLink bridges.
>>>> + */
>>>> +#define NVGPU_MAX_ATSD               6
>>>> +
>>>> +#define SPAPR_PCI_NV2ATSD_WIN_BASE   (SPAPR_PCI_NV2RAM64_WIN_BASE + \
>>>> +                                      SPAPR_PCI_NV2RAM64_WIN_SIZE * \
>>>> +                                      NVGPU_MAX_NUM)
>>>> +#define SPAPR_PCI_NV2ATSD_WIN_SIZE   (NVGPU_MAX_ATSD * 0x10000)
>>>
>>> What's the significance of the 64 kiB constant here?  Should it be a
>>> symbolic name, or speleed "64 * kiB".
>>
>> Ok.
> 
> 
> Hmm.  Am I right in thinking that both each 64-bit RAM and each ATSD
> RAM slot is per-vPHB? 

These are windows from which I allocated RAM base and ATSD per GPU/NPU.


> Would it make more sense to directly index into
> the array of slots with the phb index, rather than having a separate
> GPU index?

There can be 1 or many "slots" per PHBs ("many" is not really encouraged
as they will miss ATSD but nevertheless), and "slots" are not in a
global list of any kind.


>>
>>>
>>>> +
>>>>  static inline qemu_irq spapr_phb_lsi_qirq(struct sPAPRPHBState *phb, int pin)
>>>>  {
>>>>      sPAPRMachineState *spapr = SPAPR_MACHINE(qdev_get_machine());
>>>> @@ -135,6 +155,11 @@ int spapr_phb_vfio_eeh_get_state(sPAPRPHBState *sphb, int *state);
>>>>  int spapr_phb_vfio_eeh_reset(sPAPRPHBState *sphb, int option);
>>>>  int spapr_phb_vfio_eeh_configure(sPAPRPHBState *sphb);
>>>>  void spapr_phb_vfio_reset(DeviceState *qdev);
>>>> +void spapr_phb_nvgpu_setup(sPAPRPHBState *sphb);
>>>> +void spapr_phb_nvgpu_populate_dt(sPAPRPHBState *sphb, void *fdt, int bus_off);
>>>> +void spapr_phb_nvgpu_ram_populate_dt(sPAPRPHBState *sphb, void *fdt);
>>>> +void spapr_phb_nvgpu_populate_pcidev_dt(PCIDevice *dev, void *fdt, int offset,
>>>> +                                        sPAPRPHBState *sphb);
>>>>  #else
>>>>  static inline bool spapr_phb_eeh_available(sPAPRPHBState *sphb)
>>>>  {
>>>> @@ -161,6 +186,22 @@ static inline int spapr_phb_vfio_eeh_configure(sPAPRPHBState *sphb)
>>>>  static inline void spapr_phb_vfio_reset(DeviceState *qdev)
>>>>  {
>>>>  }
>>>> +static inline void spapr_phb_nvgpu_setup(sPAPRPHBState *sphb)
>>>> +{
>>>> +}
>>>> +static inline void spapr_phb_nvgpu_populate_dt(sPAPRPHBState *sphb, void *fdt,
>>>> +                                               int bus_off)
>>>> +{
>>>> +}
>>>> +static inline void spapr_phb_nvgpu_ram_populate_dt(sPAPRPHBState *sphb,
>>>> +                                                   void *fdt)
>>>> +{
>>>> +}
>>>> +static inline void spapr_phb_nvgpu_populate_pcidev_dt(PCIDevice *dev, void *fdt,
>>>> +                                                      int offset,
>>>> +                                                      sPAPRPHBState *sphb)
>>>> +{
>>>> +}
>>>
>>> I'm guessing some of these should never get called on systems without
>>> NVLink2, in which case they should probably have a
>>> g_assert_not_reached() in there.
>>
>> I guess if you compile QEMU for --target-list=ppc64-softmmu in Windows
>> (i.e. tcg + pseries + pci but no vfio), these will be called and crash
>> then, no?
> 
> Well, if they can be called in that situation then, yes, they need to
> be no-ops like they are now.  But is that true for all of them?
> Hmm.. yes it might be, never mind.
> 
>>>
>>>>  #endif
>>>>  
>>>>  void spapr_phb_dma_reset(sPAPRPHBState *sphb);
>>>> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
>>>> index 358bb38..9acf867 100644
>>>> --- a/include/hw/ppc/spapr.h
>>>> +++ b/include/hw/ppc/spapr.h
>>>> @@ -113,7 +113,8 @@ struct sPAPRMachineClass {
>>>>      void (*phb_placement)(sPAPRMachineState *spapr, uint32_t index,
>>>>                            uint64_t *buid, hwaddr *pio, 
>>>>                            hwaddr *mmio32, hwaddr *mmio64,
>>>> -                          unsigned n_dma, uint32_t *liobns, Error **errp);
>>>> +                          unsigned n_dma, uint32_t *liobns, hwaddr *nv2gpa,
>>>> +                          hwaddr *nv2atsd, Error **errp);
>>>>      sPAPRResizeHPT resize_hpt_default;
>>>>      sPAPRCapabilities default_caps;
>>>>      sPAPRIrq *irq;
>>>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
>>>> index 74c9b07..fda6e7e 100644
>>>> --- a/hw/ppc/spapr.c
>>>> +++ b/hw/ppc/spapr.c
>>>> @@ -3929,7 +3929,9 @@ static void spapr_phb_pre_plug(HotplugHandler *hotplug_dev, DeviceState *dev,
>>>>      smc->phb_placement(spapr, sphb->index,
>>>>                         &sphb->buid, &sphb->io_win_addr,
>>>>                         &sphb->mem_win_addr, &sphb->mem64_win_addr,
>>>> -                       windows_supported, sphb->dma_liobn, errp);
>>>> +                       windows_supported, sphb->dma_liobn,
>>>> +                       &sphb->nv2_gpa_win_addr, &sphb->nv2_atsd_win_addr,
>>>> +                       errp);
>>>>  }
>>>>  
>>>>  static void spapr_phb_plug(HotplugHandler *hotplug_dev, DeviceState *dev,
>>>> @@ -4129,7 +4131,8 @@ static const CPUArchIdList *spapr_possible_cpu_arch_ids(MachineState *machine)
>>>>  static void spapr_phb_placement(sPAPRMachineState *spapr, uint32_t index,
>>>>                                  uint64_t *buid, hwaddr *pio,
>>>>                                  hwaddr *mmio32, hwaddr *mmio64,
>>>> -                                unsigned n_dma, uint32_t *liobns, Error **errp)
>>>> +                                unsigned n_dma, uint32_t *liobns,
>>>> +                                hwaddr *nv2gpa, hwaddr *nv2atsd, Error **errp)
>>>>  {
>>>>      /*
>>>>       * New-style PHB window placement.
>>>> @@ -4174,6 +4177,9 @@ static void spapr_phb_placement(sPAPRMachineState *spapr, uint32_t index,
>>>>      *pio = SPAPR_PCI_BASE + index * SPAPR_PCI_IO_WIN_SIZE;
>>>>      *mmio32 = SPAPR_PCI_BASE + (index + 1) * SPAPR_PCI_MEM32_WIN_SIZE;
>>>>      *mmio64 = SPAPR_PCI_BASE + (index + 1) * SPAPR_PCI_MEM64_WIN_SIZE;
>>>> +
>>>
>>> This doesn't look right.  SPAPR_PCI_NV2ATSD_WIN_BASE appears to be
>>> defined such that there slots for NVGPU_MAX_NUM gpa "slots" of size
>>> SPAPR_PCI_NV2RAM64_WIN_SIZE before we get to the ATSD base.
>>>
>>>> +    *nv2gpa = SPAPR_PCI_NV2RAM64_WIN_BASE + index * SPAPR_PCI_NV2RAM64_WIN_SIZE;
>>>
>>> But this implies you need a "slot" for every possible PHB index, which
>>> is rather more than NVGPU_MAX_NUM.
>>>
>>>> +    *nv2atsd = SPAPR_PCI_NV2ATSD_WIN_BASE + index * SPAPR_PCI_NV2ATSD_WIN_SIZE;
>>
>>
>> Ah right :( These should go then above 128TiB I guess as I do not really
>> want them to appear inside a huge dma window.
> 
> Right.  So actually looks like you are already indexing the window
> slots by phb index, in which case you need to allow for 32 slots even
> though only 6 can be populated at the moment.


Why precisely 32? Round up of 18?



>>>>  }
>>>>  
>>>>  static ICSState *spapr_ics_get(XICSFabric *dev, int irq)
>>>> @@ -4376,6 +4382,18 @@ DEFINE_SPAPR_MACHINE(4_0, "4.0", true);
>>>>  /*
>>>>   * pseries-3.1
>>>>   */
>>>> +static void phb_placement_3_1(sPAPRMachineState *spapr, uint32_t index,
>>>> +                              uint64_t *buid, hwaddr *pio,
>>>> +                              hwaddr *mmio32, hwaddr *mmio64,
>>>> +                              unsigned n_dma, uint32_t *liobns,
>>>> +                              hwaddr *nv2gpa, hwaddr *nv2atsd, Error **errp)
>>>> +{
>>>> +    spapr_phb_placement(spapr, index, buid, pio, mmio32, mmio64, n_dma, liobns,
>>>> +                        nv2gpa, nv2atsd, errp);
>>>> +    *nv2gpa = 0;
>>>> +    *nv2atsd = 0;
>>>> +}
>>>> +
>>>>  static void spapr_machine_3_1_class_options(MachineClass *mc)
>>>>  {
>>>>      sPAPRMachineClass *smc = SPAPR_MACHINE_CLASS(mc);
>>>> @@ -4391,6 +4409,7 @@ static void spapr_machine_3_1_class_options(MachineClass *mc)
>>>>      mc->default_cpu_type = POWERPC_CPU_TYPE_NAME("power8_v2.0");
>>>>      smc->update_dt_enabled = false;
>>>>      smc->dr_phb_enabled = false;
>>>> +    smc->phb_placement = phb_placement_3_1;
>>>>  }
>>>>  
>>>>  DEFINE_SPAPR_MACHINE(3_1, "3.1", false);
>>>> @@ -4522,7 +4541,8 @@ DEFINE_SPAPR_MACHINE(2_8, "2.8", false);
>>>>  static void phb_placement_2_7(sPAPRMachineState *spapr, uint32_t index,
>>>>                                uint64_t *buid, hwaddr *pio,
>>>>                                hwaddr *mmio32, hwaddr *mmio64,
>>>> -                              unsigned n_dma, uint32_t *liobns, Error **errp)
>>>> +                              unsigned n_dma, uint32_t *liobns,
>>>> +                              hwaddr *nv2gpa, hwaddr *nv2atsd, Error **errp)
>>>>  {
>>>>      /* Legacy PHB placement for pseries-2.7 and earlier machine types */
>>>>      const uint64_t base_buid = 0x800000020000000ULL;
>>>> @@ -4566,6 +4586,9 @@ static void phb_placement_2_7(sPAPRMachineState *spapr, uint32_t index,
>>>>       * fallback behaviour of automatically splitting a large "32-bit"
>>>>       * window into contiguous 32-bit and 64-bit windows
>>>>       */
>>>> +
>>>> +    *nv2gpa = 0;
>>>> +    *nv2atsd = 0;
>>>>  }
>>>>  
>>>>  static void spapr_machine_2_7_class_options(MachineClass *mc)
>>>> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
>>>> index 06a5ffd..f076462 100644
>>>> --- a/hw/ppc/spapr_pci.c
>>>> +++ b/hw/ppc/spapr_pci.c
>>>> @@ -1355,6 +1355,8 @@ static void spapr_populate_pci_child_dt(PCIDevice *dev, void *fdt, int offset,
>>>>      if (sphb->pcie_ecs && pci_is_express(dev)) {
>>>>          _FDT(fdt_setprop_cell(fdt, offset, "ibm,pci-config-space-type", 0x1));
>>>>      }
>>>> +
>>>> +    spapr_phb_nvgpu_populate_pcidev_dt(dev, fdt, offset, sphb);
>>>>  }
>>>>  
>>>>  /* create OF node for pci device and required OF DT properties */
>>>> @@ -1878,6 +1880,7 @@ static void spapr_phb_reset(DeviceState *qdev)
>>>>      sPAPRPHBState *sphb = SPAPR_PCI_HOST_BRIDGE(qdev);
>>>>  
>>>>      spapr_phb_dma_reset(sphb);
>>>> +    spapr_phb_nvgpu_setup(sphb);
>>>>  
>>>>      /* Reset the IOMMU state */
>>>>      object_child_foreach(OBJECT(qdev), spapr_phb_children_reset, NULL);
>>>> @@ -1910,6 +1913,8 @@ static Property spapr_phb_properties[] = {
>>>>                       pre_2_8_migration, false),
>>>>      DEFINE_PROP_BOOL("pcie-extended-configuration-space", sPAPRPHBState,
>>>>                       pcie_ecs, true),
>>>> +    DEFINE_PROP_UINT64("gpa", sPAPRPHBState, nv2_gpa_win_addr, 0),
>>>> +    DEFINE_PROP_UINT64("atsd", sPAPRPHBState, nv2_atsd_win_addr, 0),
>>>>      DEFINE_PROP_END_OF_LIST(),
>>>>  };
>>>>  
>>>> @@ -2282,6 +2287,9 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb, uint32_t intc_phandle, void *fdt,
>>>>          return ret;
>>>>      }
>>>>  
>>>> +    spapr_phb_nvgpu_populate_dt(phb, fdt, bus_off);
>>>> +    spapr_phb_nvgpu_ram_populate_dt(phb, fdt);
>>>> +
>>>>      return 0;
>>>>  }
>>>>  
>>>> diff --git a/hw/ppc/spapr_pci_nvlink2.c b/hw/ppc/spapr_pci_nvlink2.c
>>>> new file mode 100644
>>>> index 0000000..965a6be
>>>> --- /dev/null
>>>> +++ b/hw/ppc/spapr_pci_nvlink2.c
>>>> @@ -0,0 +1,419 @@
>>>> +/*
>>>> + * QEMU sPAPR PCI for NVLink2 pass through
>>>> + *
>>>> + * Copyright (c) 2019 Alexey Kardashevskiy, IBM Corporation.
>>>> + *
>>>> + * Permission is hereby granted, free of charge, to any person obtaining a copy
>>>> + * of this software and associated documentation files (the "Software"), to deal
>>>> + * in the Software without restriction, including without limitation the rights
>>>> + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
>>>> + * copies of the Software, and to permit persons to whom the Software is
>>>> + * furnished to do so, subject to the following conditions:
>>>> + *
>>>> + * The above copyright notice and this permission notice shall be included in
>>>> + * all copies or substantial portions of the Software.
>>>> + *
>>>> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
>>>> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
>>>> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
>>>> + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
>>>> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
>>>> + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
>>>> + * THE SOFTWARE.
>>>> + */
>>>> +#include "qemu/osdep.h"
>>>> +#include "qapi/error.h"
>>>> +#include "qemu-common.h"
>>>> +#include "hw/pci/pci.h"
>>>> +#include "hw/pci-host/spapr.h"
>>>> +#include "qemu/error-report.h"
>>>> +#include "hw/ppc/fdt.h"
>>>> +#include "hw/pci/pci_bridge.h"
>>>> +
>>>> +#define PHANDLE_PCIDEV(phb, pdev)    (0x12000000 | \
>>>> +                                     (((phb)->index) << 16) | ((pdev)->devfn))
>>>> +#define PHANDLE_GPURAM(phb, n)       (0x110000FF | ((n) << 8) | \
>>>> +                                     (((phb)->index) << 16))
>>>> +/* NVLink2 wants a separate NUMA node for its RAM */
>>>> +#define GPURAM_ASSOCIATIVITY(phb, n) (255 - ((phb)->index * 3 + (n)))
>>>> +#define PHANDLE_NVLINK(phb, gn, nn)  (0x00130000 | (((phb)->index) << 8) | \
>>>> +                                     ((gn) << 4) | (nn))
>>>> +
>>>> +/* Max number of NVLinks per GPU in any physical box */
>>>> +#define NVGPU_MAX_LINKS              3
>>>> +
>>>> +struct spapr_phb_pci_nvgpu_config {
>>>> +    uint64_t nv2_ram_current;
>>>> +    uint64_t nv2_atsd_current;
>>>> +    int num; /* number of non empty (i.e. tgt!=0) entries in slots[] */
>>>> +    struct spapr_phb_pci_nvgpu_slot {
>>>> +        uint64_t tgt;
>>>> +        uint64_t gpa;
>>>> +        PCIDevice *gpdev;
>>>> +        int linknum;
>>>> +        struct {
>>>> +            uint64_t atsd_gpa;
>>>> +            PCIDevice *npdev;
>>>> +            uint32_t link_speed;
>>>> +        } links[NVGPU_MAX_LINKS];
>>>> +    } slots[NVGPU_MAX_NUM];
>>>> +};
>>>> +
>>>> +static int spapr_pci_nvgpu_get_slot(struct spapr_phb_pci_nvgpu_config *nvgpus,
>>>> +                                    uint64_t tgt)
>>>> +{
>>>> +    int i;
>>>> +
>>>> +    /* Search for partially collected "slot" */
>>>> +    for (i = 0; i < nvgpus->num; ++i) {
>>>> +        if (nvgpus->slots[i].tgt == tgt) {
>>>> +            return i;
>>>> +        }
>>>> +    }
>>>> +
>>>> +    if (nvgpus->num == ARRAY_SIZE(nvgpus->slots)) {
>>>> +        warn_report("Found too many NVLink bridges per GPU");
>>>> +        return -1;
>>>
>>> This is within qemu so it would be better to use the qemu error API
>>> than returning an error code.
>>
>> You mean returning Error**? Oh. Ok.
> 
> Well, not returning, technically, but taking an Error ** parameter
> which is checked by the caller to detect errors.


None of these is actually propagated to the upper level as neither of
these is fatal (well, except one which I am turning into assert).


> 
>>>
>>>> +    }
>>>> +
>>>> +    i = nvgpus->num;
>>>> +    nvgpus->slots[i].tgt = tgt;
>>>> +    ++nvgpus->num;
>>>> +
>>>> +    return i;
>>>
>>> Might be nicer to return a pointer to the slot structure.
>>
>>
>> This can work.
>>
>>
>>>
>>>> +}
>>>> +
>>>> +static void spapr_pci_collect_nvgpu(struct spapr_phb_pci_nvgpu_config *nvgpus,
>>>> +                                    PCIDevice *pdev, uint64_t tgt,
>>>> +                                    MemoryRegion *mr)
>>>> +{
>>>> +    int i = spapr_pci_nvgpu_get_slot(nvgpus, tgt);
>>>> +
>>>> +    if (i < 0) {
>>>> +        return;
>>>> +    }
>>>> +    g_assert(!nvgpus->slots[i].gpdev);
>>>> +    nvgpus->slots[i].gpdev = pdev;
>>>> +
>>>> +    nvgpus->slots[i].gpa = nvgpus->nv2_ram_current;
>>>> +    nvgpus->nv2_ram_current += memory_region_size(mr);
>>>> +}
>>>> +
>>>> +static void spapr_pci_collect_nvnpu(struct spapr_phb_pci_nvgpu_config *nvgpus,
>>>> +                                    PCIDevice *pdev, uint64_t tgt,
>>>> +                                    MemoryRegion *mr)
>>>> +{
>>>> +    int i = spapr_pci_nvgpu_get_slot(nvgpus, tgt), j;
>>>> +    struct spapr_phb_pci_nvgpu_slot *nvslot;
>>>> +
>>>> +    if (i < 0) {
>>>> +        return;
>>>> +    }
>>>> +
>>>> +    nvslot = &nvgpus->slots[i];
>>>> +    j = nvslot->linknum;
>>>> +    if (j == ARRAY_SIZE(nvslot->links)) {
>>>> +        warn_report("Found too many NVLink2 bridges");
>>>> +        return;
>>>> +    }
>>>> +    ++nvslot->linknum;
>>>> +
>>>> +    g_assert(!nvslot->links[j].npdev);
>>>> +    nvslot->links[j].npdev = pdev;
>>>> +    nvslot->links[j].atsd_gpa = nvgpus->nv2_atsd_current;
>>>> +    nvgpus->nv2_atsd_current += memory_region_size(mr);
>>>> +    nvslot->links[j].link_speed =
>>>> +        object_property_get_uint(OBJECT(pdev), "nvlink2-link-speed", NULL);
>>>> +}
>>>> +
>>>> +static void spapr_phb_pci_collect_nvgpu(PCIBus *bus, PCIDevice *pdev,
>>>> +                                        void *opaque)
>>>> +{
>>>> +    PCIBus *sec_bus;
>>>> +    Object *po = OBJECT(pdev);
>>>> +    uint64_t tgt = object_property_get_uint(po, "nvlink2-tgt", NULL);
>>>> +
>>>> +    if (tgt) {
>>>> +        Object *mr_gpu = object_property_get_link(po, "nvlink2-mr[0]", NULL);
>>>> +        Object *mr_npu = object_property_get_link(po, "nvlink2-atsd-mr[0]",
>>>> +                                                  NULL);
>>>> +
>>>> +        if (mr_gpu) {
>>>> +            spapr_pci_collect_nvgpu(opaque, pdev, tgt, MEMORY_REGION(mr_gpu));
>>>> +        } else if (mr_npu) {
>>>> +            spapr_pci_collect_nvnpu(opaque, pdev, tgt, MEMORY_REGION(mr_npu));
>>>> +        } else {
>>>> +            warn_report("Unexpected device with \"nvlink2-tgt\"");
>>>
>>> IIUC this would have to be a code error, so should be an assert() not
>>> a warning.
>>
>>
>> Ok.
>>
>>>
>>>> +        }
>>>> +    }
>>>> +    if ((pci_default_read_config(pdev, PCI_HEADER_TYPE, 1) !=
>>>> +         PCI_HEADER_TYPE_BRIDGE)) {
>>>> +        return;
>>>> +    }
>>>> +
>>>> +    sec_bus = pci_bridge_get_sec_bus(PCI_BRIDGE(pdev));
>>>> +    if (!sec_bus) {
>>>> +        return;
>>>> +    }
>>>> +
>>>> +    pci_for_each_device(sec_bus, pci_bus_num(sec_bus),
>>>> +                        spapr_phb_pci_collect_nvgpu, opaque);
>>>> +}
>>>> +
>>>> +void spapr_phb_nvgpu_setup(sPAPRPHBState *sphb)
>>>> +{
>>>> +    int i, j, valid_gpu_num;
>>>> +
>>>> +    /* If there are existing NVLink2 MRs, unmap those before recreating */
>>>> +    if (sphb->nvgpus) {
>>>> +        for (i = 0; i < sphb->nvgpus->num; ++i) {
>>>> +            struct spapr_phb_pci_nvgpu_slot *nvslot = &sphb->nvgpus->slots[i];
>>>> +            Object *nv_mrobj = object_property_get_link(OBJECT(nvslot->gpdev),
>>>> +                                                        "nvlink2-mr[0]", NULL);
>>>> +
>>>> +            if (nv_mrobj) {
>>>> +                memory_region_del_subregion(get_system_memory(),
>>>> +                                            MEMORY_REGION(nv_mrobj));
>>>> +            }
>>>> +            for (j = 0; j < nvslot->linknum; ++j) {
>>>> +                PCIDevice *npdev = nvslot->links[j].npdev;
>>>> +                Object *atsd_mrobj;
>>>> +                atsd_mrobj = object_property_get_link(OBJECT(npdev),
>>>> +                                                      "nvlink2-atsd-mr[0]",
>>>> +                                                      NULL);
>>>> +                if (atsd_mrobj) {
>>>> +                    memory_region_del_subregion(get_system_memory(),
>>>> +                                                MEMORY_REGION(atsd_mrobj));
>>>> +                }
>>>> +            }
>>>> +        }
>>>> +        g_free(sphb->nvgpus);
>>>
>>> Probably worth collecting the above into a nvgpu_free() helper -
>>> chances are you'll want it on cleanup paths as well.
>>
>> The only other cleanup path is below and it only executes if there is no
>> MR added so for now it does not seem useful.
> 
> Hrm... I've merged PHB hotplug recently.. so there should be a cleanup
> path for unplug as well.


ah right. Wooohooo :) btw with phb hotplug we can try supporting EEH on
hotplugged VFIO devices.


>>
>>
>>>> +        sphb->nvgpus = NULL;
>>>> +    }
>>>> +
>>>> +    /* Search for GPUs and NPUs */
>>>> +    if (sphb->nv2_gpa_win_addr && sphb->nv2_atsd_win_addr) {
>>>> +        PCIBus *bus = PCI_HOST_BRIDGE(sphb)->bus;
>>>> +
>>>> +        sphb->nvgpus = g_new0(struct spapr_phb_pci_nvgpu_config, 1);
>>>> +        sphb->nvgpus->nv2_ram_current = sphb->nv2_gpa_win_addr;
>>>> +        sphb->nvgpus->nv2_atsd_current = sphb->nv2_atsd_win_addr;
>>>> +
>>>> +        pci_for_each_device(bus, pci_bus_num(bus),
>>>> +                            spapr_phb_pci_collect_nvgpu, sphb->nvgpus);
>>>> +    }
>>>> +
>>>> +    /* Add found GPU RAM and ATSD MRs if found */
>>>> +    for (i = 0, valid_gpu_num = 0; i < sphb->nvgpus->num; ++i) {
>>>> +        Object *nvmrobj;
>>>> +        struct spapr_phb_pci_nvgpu_slot *nvslot = &sphb->nvgpus->slots[i];
>>>> +
>>>> +        if (!nvslot->gpdev) {
>>>> +            continue;
>>>> +        }
>>>> +        nvmrobj = object_property_get_link(OBJECT(nvslot->gpdev),
>>>> +                                           "nvlink2-mr[0]", NULL);
>>>> +        /* ATSD is pointless without GPU RAM MR so skip those */
>>>> +        if (!nvmrobj) {
>>>> +            continue;
>>>> +        }
>>>> +
>>>> +        ++valid_gpu_num;
>>>> +        memory_region_add_subregion(get_system_memory(), nvslot->gpa,
>>>> +                                    MEMORY_REGION(nvmrobj));
>>>> +
>>>> +        for (j = 0; j < nvslot->linknum; ++j) {
>>>> +            Object *atsdmrobj;
>>>> +
>>>> +            atsdmrobj = object_property_get_link(OBJECT(nvslot->links[j].npdev),
>>>> +                                                 "nvlink2-atsd-mr[0]",
>>>> +                                                 NULL);
>>>> +            if (!atsdmrobj) {
>>>> +                continue;
>>>> +            }
>>>> +            memory_region_add_subregion(get_system_memory(),
>>>> +                                        nvslot->links[j].atsd_gpa,
>>>> +                                        MEMORY_REGION(atsdmrobj));
>>>> +        }
>>>> +    }
>>>> +
>>>> +    if (!valid_gpu_num) {
>>>> +        /* We did not find any interesting GPU */
>>>> +        g_free(sphb->nvgpus);
>>>> +        sphb->nvgpus = NULL;
>>>> +    }
>>>> +}
>>>> +
>>>> +void spapr_phb_nvgpu_populate_dt(sPAPRPHBState *sphb, void *fdt, int bus_off)
>>>> +{
>>>> +    int i, j, atsdnum = 0;
>>>> +    uint64_t atsd[8]; /* The existing limitation of known guests */
>>>> +
>>>> +    if (!sphb->nvgpus) {
>>>> +        return;
>>>> +    }
>>>> +
>>>> +    for (i = 0; (i < sphb->nvgpus->num) && (atsdnum < ARRAY_SIZE(atsd)); ++i) {
>>>> +        struct spapr_phb_pci_nvgpu_slot *nvslot = &sphb->nvgpus->slots[i];
>>>> +
>>>> +        if (!nvslot->gpdev) {
>>>> +            continue;
>>>> +        }
>>>> +        for (j = 0; j < nvslot->linknum; ++j) {
>>>> +            if (!nvslot->links[j].atsd_gpa) {
>>>> +                continue;
>>>> +            }
>>>> +
>>>> +            if (atsdnum == ARRAY_SIZE(atsd)) {
>>>> +                warn_report("Only %ld ATSD registers allowed",
>>>> +                            ARRAY_SIZE(atsd));
>>>
>>> Probably should be an error not a warning.
>>
>> We can still continue though, it is not fatal. These things come from
>> skiboot (which we control) but skiboot itself could compose the
>> properties itself or use whatever hostboot provided (does not happen now
>> though) and I would not like to be blocked by hostboot if/when this happens.
> 
> Um.. what?  atsdnum is just a counter incremented below, it doesn't
> come from skiboot or any other host-significant value.  The situation
> here is that we have more nvlinks assigned to a guest that qemu can
> support.  Yes, you could technically run the guest with some of the
> links unavailable, but that seems pretty clearly not what the user
> wanted.  Hence, an error is appropriate.


Not exactly. NVlinks are available whether there come with an ATSD VFIO
region or not, it was my choice to accompany ATSD with a NVLink2 bridge.
So it is quite possible to pass way too many links and yes QEMU won't
expose all accompaniying ATSDs to the guest but 1) guest might not need
this many ATSDs anyway (right now the NVIDIA driver always uses just one
and nobody complained about performance) 2) nvlink is functional as long
as the guest can access its config space.



> 
>>
>>>> +                break;
>>>> +            }
>>>> +            atsd[atsdnum] = cpu_to_be64(nvslot->links[j].atsd_gpa);
>>>> +            ++atsdnum;
>>>> +        }
>>>> +    }
>>>> +
>>>> +    if (!atsdnum) {
>>>> +        warn_report("No ATSD registers found");
>>>> +    } else if (!spapr_phb_eeh_available(sphb)) {
>>>> +        /*
>>>> +         * ibm,mmio-atsd contains ATSD registers; these belong to an NPU PHB
>>>> +         * which we do not emulate as a separate device. Instead we put
>>>> +         * ibm,mmio-atsd to the vPHB with GPU and make sure that we do not
>>>> +         * put GPUs from different IOMMU groups to the same vPHB to ensure
>>>> +         * that the guest will use ATSDs from the corresponding NPU.
>>>> +         */
>>>> +        warn_report("ATSD requires separate vPHB per GPU IOMMU group");
>>>> +    } else {
>>>> +        _FDT((fdt_setprop(fdt, bus_off, "ibm,mmio-atsd",
>>>> +                          atsd, atsdnum * sizeof(atsd[0]))));
>>>> +    }
>>>> +}
>>>> +
>>>> +void spapr_phb_nvgpu_ram_populate_dt(sPAPRPHBState *sphb, void *fdt)
>>>> +{
>>>> +    int i, j, linkidx, npuoff;
>>>> +    char *npuname;
>>>> +
>>>> +    if (!sphb->nvgpus) {
>>>> +        return;
>>>> +    }
>>>> +
>>>> +    npuname = g_strdup_printf("npuphb%d", sphb->index);
>>>> +    npuoff = fdt_add_subnode(fdt, 0, npuname);
>>>> +    _FDT(npuoff);
>>>> +    _FDT(fdt_setprop_cell(fdt, npuoff, "#address-cells", 1));
>>>> +    _FDT(fdt_setprop_cell(fdt, npuoff, "#size-cells", 0));
>>>> +    /* Advertise NPU as POWER9 so the guest can enable NPU2 contexts */
>>>> +    _FDT((fdt_setprop_string(fdt, npuoff, "compatible", "ibm,power9-npu")));
>>>> +    g_free(npuname);
>>>> +
>>>> +    for (i = 0, linkidx = 0; i < sphb->nvgpus->num; ++i) {
>>>> +        for (j = 0; j < sphb->nvgpus->slots[i].linknum; ++j) {
>>>> +            char *linkname = g_strdup_printf("link@%d", linkidx);
>>>> +            int off = fdt_add_subnode(fdt, npuoff, linkname);
>>>> +
>>>> +            _FDT(off);
>>>> +            /* _FDT((fdt_setprop_cell(fdt, off, "reg", linkidx)));
>>>> */
>>>
>>> Are the indices you're using for 'reg' and the unit name arbitrary?
>>> If so it's generally best to base them on some static property of the
>>> device, rather than just allocating sequentially.
>>
>> On the host reg is the link index. Here it is actually commented out as
>> a reminder for the future.
>>
>>>
>>>> +            _FDT((fdt_setprop_string(fdt, off, "compatible",
>>>> +                                     "ibm,npu-link")));
>>>> +            _FDT((fdt_setprop_cell(fdt, off, "phandle",
>>>> +                                   PHANDLE_NVLINK(sphb, i, j))));
>>>> +            _FDT((fdt_setprop_cell(fdt, off, "ibm,npu-link-index", linkidx)));
>>>
>>> Why do you need the index here as well as in reg?
>>
>> I do not need "reg" really and I need index for this:
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/powerpc/platforms/powernv/npu-dma.c?h=v4.20#n692
> 
> 
> Ok, because of a silly binding.  That's a good enough reason.
> 
>>>> +            g_free(linkname);
>>>> +            ++linkidx;
>>>> +        }
>>>> +    }
>>>> +
>>>> +    /* Add memory nodes for GPU RAM and mark them unusable */
>>>> +    for (i = 0; i < sphb->nvgpus->num; ++i) {
>>>> +        struct spapr_phb_pci_nvgpu_slot *nvslot = &sphb->nvgpus->slots[i];
>>>> +        Object *nv_mrobj = object_property_get_link(OBJECT(nvslot->gpdev),
>>>> +                                                    "nvlink2-mr[0]", NULL);
>>>> +        uint32_t at = cpu_to_be32(GPURAM_ASSOCIATIVITY(sphb, i));
>>>> +        uint32_t associativity[] = { cpu_to_be32(0x4), at, at, at, at };
>>>> +        uint64_t size = object_property_get_uint(nv_mrobj, "size", NULL);
>>>> +        uint64_t mem_reg[2] = { cpu_to_be64(nvslot->gpa), cpu_to_be64(size) };
>>>> +        char *mem_name = g_strdup_printf("memory@%lx", nvslot->gpa);
>>>> +        int off = fdt_add_subnode(fdt, 0, mem_name);
>>>> +
>>>> +        _FDT(off);
>>>> +        _FDT((fdt_setprop_string(fdt, off, "device_type", "memory")));
>>>> +        _FDT((fdt_setprop(fdt, off, "reg", mem_reg, sizeof(mem_reg))));
>>>> +        _FDT((fdt_setprop(fdt, off, "ibm,associativity", associativity,
>>>> +                          sizeof(associativity))));
>>>> +
>>>> +        _FDT((fdt_setprop_string(fdt, off, "compatible",
>>>> +                                 "ibm,coherent-device-memory")));
>>>> +
>>>> +        mem_reg[1] = cpu_to_be64(0);
>>>> +        _FDT((fdt_setprop(fdt, off, "linux,usable-memory", mem_reg,
>>>> +                          sizeof(mem_reg))));
>>>> +        _FDT((fdt_setprop_cell(fdt, off, "phandle",
>>>> +                               PHANDLE_GPURAM(sphb, i))));
>>>> +        g_free(mem_name);
>>>> +    }
>>>> +
>>>> +}
>>>> +
>>>> +void spapr_phb_nvgpu_populate_pcidev_dt(PCIDevice *dev, void *fdt, int offset,
>>>> +                                        sPAPRPHBState *sphb)
>>>> +{
>>>> +    int i, j;
>>>> +
>>>> +    if (!sphb->nvgpus) {
>>>> +        return;
>>>> +    }
>>>> +
>>>> +    for (i = 0; i < sphb->nvgpus->num; ++i) {
>>>> +        struct spapr_phb_pci_nvgpu_slot *nvslot = &sphb->nvgpus->slots[i];
>>>> +
>>>> +        /* Skip "slot" without attached GPU */
>>>
>>> IIUC a "slot" should always have at least one GPU.  You need to handle
>>> the case of an unitialized GPU in the "collect" functions because you
>>> don't know if you'll discover the GPU or an NPU first.  But here not
>>> having a GPU should be an error, shouldn't it?
>>
>>
>> If someone decides to pass 1 GPU with all related nvlinks and just
>> nvlinks from another GPU but without related GPU for whatever reason,
>> should we really stop him/her? Things won't work exactly at their best
>> but this still might be useful for weird debugging.
> 
> Hm, ok, I guess so.
> 
>>>> +        if (!nvslot->gpdev) {
>>>> +            continue;
>>>> +        }
>>>> +        if (dev == nvslot->gpdev) {
>>>> +            uint32_t npus[nvslot->linknum];
>>>> +
>>>> +            for (j = 0; j < nvslot->linknum; ++j) {
>>>> +                PCIDevice *npdev = nvslot->links[j].npdev;
>>>> +
>>>> +                npus[j] = cpu_to_be32(PHANDLE_PCIDEV(sphb, npdev));
>>>> +            }
>>>> +            _FDT(fdt_setprop(fdt, offset, "ibm,npu", npus,
>>>> +                             j * sizeof(npus[0])));
>>>> +            _FDT((fdt_setprop_cell(fdt, offset, "phandle",
>>>> +                                   PHANDLE_PCIDEV(sphb, dev))));
>>>> +            continue;
>>>> +        }
>>>> +
>>>> +        for (j = 0; j < nvslot->linknum; ++j) {
>>>> +            if (dev != nvslot->links[j].npdev) {
>>>> +                continue;
>>>> +            }
>>>> +
>>>> +            _FDT((fdt_setprop_cell(fdt, offset, "phandle",
>>>> +                                   PHANDLE_PCIDEV(sphb, dev))));
>>>> +            _FDT(fdt_setprop_cell(fdt, offset, "ibm,gpu",
>>>> +                                  PHANDLE_PCIDEV(sphb, nvslot->gpdev)));
>>>> +            _FDT((fdt_setprop_cell(fdt, offset, "ibm,nvlink",
>>>> +                                   PHANDLE_NVLINK(sphb, i, j))));
>>>> +            /*
>>>> +             * If we ever want to emulate GPU RAM at the same location as on
>>>> +             * the host - here is the encoding GPA->TGT:
>>>> +             *
>>>> +             * gta  = ((sphb->nv2_gpa >> 42) & 0x1) << 42;
>>>> +             * gta |= ((sphb->nv2_gpa >> 45) & 0x3) << 43;
>>>> +             * gta |= ((sphb->nv2_gpa >> 49) & 0x3) << 45;
>>>> +             * gta |= sphb->nv2_gpa & ((1UL << 43) - 1);
>>>> +             */
>>>> +            _FDT(fdt_setprop_cell(fdt, offset, "memory-region",
>>>> +                                  PHANDLE_GPURAM(sphb, i)));
>>>> +            _FDT(fdt_setprop_u64(fdt, offset, "ibm,device-tgt-addr",
>>>> +                                 nvslot->tgt));
>>>> +            _FDT(fdt_setprop_cell(fdt, offset, "ibm,nvlink-speed",
>>>> +                                  nvslot->links[j].link_speed));
>>>> +        }
>>>> +    }
>>>> +}
>>>> diff --git a/hw/vfio/pci-quirks.c b/hw/vfio/pci-quirks.c
>>>> index 40a1200..15ec0b4 100644
>>>> --- a/hw/vfio/pci-quirks.c
>>>> +++ b/hw/vfio/pci-quirks.c
>>>> @@ -2180,3 +2180,123 @@ int vfio_add_virt_caps(VFIOPCIDevice *vdev, Error **errp)
>>>>  
>>>>      return 0;
>>>>  }
>>>> +
>>>> +static void vfio_pci_nvlink2_get_tgt(Object *obj, Visitor *v,
>>>> +                                     const char *name,
>>>> +                                     void *opaque, Error **errp)
>>>> +{
>>>> +    uint64_t tgt = (uint64_t) opaque;
>>>> +    visit_type_uint64(v, name, &tgt, errp);
>>>> +}
>>>> +
>>>> +static void vfio_pci_nvlink2_get_link_speed(Object *obj, Visitor *v,
>>>> +                                                 const char *name,
>>>> +                                                 void *opaque, Error **errp)
>>>> +{
>>>> +    uint32_t link_speed = (uint32_t)(uint64_t) opaque;
>>>> +    visit_type_uint32(v, name, &link_speed, errp);
>>>> +}
>>>> +
>>>> +int vfio_pci_nvidia_v100_ram_init(VFIOPCIDevice *vdev, Error **errp)
>>>> +{
>>>> +    int ret;
>>>> +    void *p;
>>>> +    struct vfio_region_info *nv2region = NULL;
>>>> +    struct vfio_info_cap_header *hdr;
>>>> +    MemoryRegion *nv2mr = g_malloc0(sizeof(*nv2mr));
>>>> +
>>>> +    ret = vfio_get_dev_region_info(&vdev->vbasedev,
>>>> +                                   VFIO_REGION_TYPE_PCI_VENDOR_TYPE |
>>>> +                                   PCI_VENDOR_ID_NVIDIA,
>>>> +                                   VFIO_REGION_SUBTYPE_NVIDIA_NVLINK2_RAM,
>>>> +                                   &nv2region);
>>>> +    if (ret) {
>>>> +        return ret;
>>>> +    }
>>>> +
>>>> +    p = mmap(NULL, nv2region->size, PROT_READ | PROT_WRITE | PROT_EXEC,
>>>> +             MAP_SHARED, vdev->vbasedev.fd, nv2region->offset);
>>>> +
>>>> +    if (!p) {
>>>> +        return -errno;
>>>> +    }
>>>> +
>>>> +    memory_region_init_ram_ptr(nv2mr, OBJECT(vdev), "nvlink2-mr",
>>>> +                               nv2region->size, p);
>>>> +
>>>> +    hdr = vfio_get_region_info_cap(nv2region,
>>>> +                                   VFIO_REGION_INFO_CAP_NVLINK2_SSATGT);
>>>> +    if (hdr) {
>>>> +        struct vfio_region_info_cap_nvlink2_ssatgt *cap = (void *) hdr;
>>>> +
>>>> +        object_property_add(OBJECT(vdev), "nvlink2-tgt", "uint64",
>>>> +                            vfio_pci_nvlink2_get_tgt, NULL, NULL,
>>>> +                            (void *) cap->tgt, NULL);
>>>> +        trace_vfio_pci_nvidia_gpu_setup_quirk(vdev->vbasedev.name, cap->tgt,
>>>> +                                              nv2region->size);
>>>> +    }
>>>> +    g_free(nv2region);
>>>> +
>>>> +    return 0;
>>>> +}
>>>> +
>>>> +int vfio_pci_nvlink2_init(VFIOPCIDevice *vdev, Error **errp)
>>>> +{
>>>> +    int ret;
>>>> +    void *p;
>>>> +    struct vfio_region_info *atsd_region = NULL;
>>>> +    struct vfio_info_cap_header *hdr;
>>>> +
>>>> +    ret = vfio_get_dev_region_info(&vdev->vbasedev,
>>>> +                                   VFIO_REGION_TYPE_PCI_VENDOR_TYPE |
>>>> +                                   PCI_VENDOR_ID_IBM,
>>>> +                                   VFIO_REGION_SUBTYPE_IBM_NVLINK2_ATSD,
>>>> +                                   &atsd_region);
>>>> +    if (ret) {
>>>> +        return ret;
>>>> +    }
>>>> +
>>>> +    /* Some NVLink bridges come without assigned ATSD, skip MR part */
>>>> +    if (atsd_region->size) {
>>>> +        MemoryRegion *atsd_mr = g_malloc0(sizeof(*atsd_mr));
>>>> +
>>>> +        p = mmap(NULL, atsd_region->size, PROT_READ | PROT_WRITE | PROT_EXEC,
>>>> +                 MAP_SHARED, vdev->vbasedev.fd, atsd_region->offset);
>>>> +
>>>> +        if (!p) {
>>>> +            return -errno;
>>>> +        }
>>>> +
>>>> +        memory_region_init_ram_device_ptr(atsd_mr, OBJECT(vdev),
>>>> +                                          "nvlink2-atsd-mr",
>>>> +                                          atsd_region->size,
>>>> +                                          p);
>>>> +    }
>>>> +
>>>> +    hdr = vfio_get_region_info_cap(atsd_region,
>>>> +                                   VFIO_REGION_INFO_CAP_NVLINK2_SSATGT);
>>>> +    if (hdr) {
>>>> +        struct vfio_region_info_cap_nvlink2_ssatgt *cap = (void *) hdr;
>>>> +
>>>> +        object_property_add(OBJECT(vdev), "nvlink2-tgt", "uint64",
>>>> +                            vfio_pci_nvlink2_get_tgt, NULL, NULL,
>>>> +                            (void *) cap->tgt, NULL);
>>>> +        trace_vfio_pci_nvlink2_setup_quirk_ssatgt(vdev->vbasedev.name, cap->tgt,
>>>> +                                                  atsd_region->size);
>>>> +    }
>>>> +
>>>> +    hdr = vfio_get_region_info_cap(atsd_region,
>>>> +                                   VFIO_REGION_INFO_CAP_NVLINK2_LNKSPD);
>>>> +    if (hdr) {
>>>> +        struct vfio_region_info_cap_nvlink2_lnkspd *cap = (void *) hdr;
>>>> +
>>>> +        object_property_add(OBJECT(vdev), "nvlink2-link-speed", "uint32",
>>>> +                            vfio_pci_nvlink2_get_link_speed, NULL, NULL,
>>>> +                            (void *) (uint64_t) cap->link_speed, NULL);
>>>> +        trace_vfio_pci_nvlink2_setup_quirk_lnkspd(vdev->vbasedev.name,
>>>> +                                                  cap->link_speed);
>>>> +    }
>>>> +    g_free(atsd_region);
>>>> +
>>>> +    return 0;
>>>> +}
>>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>>>> index dd12f36..07aa141 100644
>>>> --- a/hw/vfio/pci.c
>>>> +++ b/hw/vfio/pci.c
>>>> @@ -3069,6 +3069,20 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
>>>>          goto out_teardown;
>>>>      }
>>>>  
>>>> +    if (vdev->vendor_id == PCI_VENDOR_ID_NVIDIA) {
>>>> +        ret = vfio_pci_nvidia_v100_ram_init(vdev, errp);
>>>> +        if (ret && ret != -ENODEV) {
>>>> +            error_report("Failed to setup NVIDIA V100 GPU RAM");
>>>> +        }
>>>> +    }
>>>> +
>>>> +    if (vdev->vendor_id == PCI_VENDOR_ID_IBM) {
>>>> +        ret = vfio_pci_nvlink2_init(vdev, errp);
>>>> +        if (ret && ret != -ENODEV) {
>>>> +            error_report("Failed to setup NVlink2 bridge");
>>>> +        }
>>>> +    }
>>>> +
>>>>      vfio_register_err_notifier(vdev);
>>>>      vfio_register_req_notifier(vdev);
>>>>      vfio_setup_resetfn_quirk(vdev);
>>>> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
>>>> index cf1e886..88841e9 100644
>>>> --- a/hw/vfio/trace-events
>>>> +++ b/hw/vfio/trace-events
>>>> @@ -87,6 +87,10 @@ vfio_pci_igd_opregion_enabled(const char *name) "%s"
>>>>  vfio_pci_igd_host_bridge_enabled(const char *name) "%s"
>>>>  vfio_pci_igd_lpc_bridge_enabled(const char *name) "%s"
>>>>  
>>>> +vfio_pci_nvidia_gpu_setup_quirk(const char *name, uint64_t tgt, uint64_t size) "%s tgt=0x%"PRIx64" size=0x%"PRIx64
>>>> +vfio_pci_nvlink2_setup_quirk_ssatgt(const char *name, uint64_t tgt, uint64_t size) "%s tgt=0x%"PRIx64" size=0x%"PRIx64
>>>> +vfio_pci_nvlink2_setup_quirk_lnkspd(const char *name, uint32_t link_speed) "%s link_speed=0x%x"
>>>> +
>>>>  # hw/vfio/common.c
>>>>  vfio_region_write(const char *name, int index, uint64_t addr, uint64_t data, unsigned size) " (%s:region%d+0x%"PRIx64", 0x%"PRIx64 ", %d)"
>>>>  vfio_region_read(char *name, int index, uint64_t addr, unsigned size, uint64_t data) " (%s:region%d+0x%"PRIx64", %d) = 0x%"PRIx64
>>>
>>
> 

-- 
Alexey

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v3 6/6] spapr: Support NVIDIA V100 GPU with NVLink2
  2019-03-07  2:40         ` Alexey Kardashevskiy
@ 2019-03-07  3:57           ` David Gibson
  2019-03-07  4:32             ` Alexey Kardashevskiy
  0 siblings, 1 reply; 21+ messages in thread
From: David Gibson @ 2019-03-07  3:57 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: qemu-devel, qemu-ppc, Sam Bobroff, Piotr Jaroszynski,
	Leonardo Augusto Guimarães Garcia, Jose Ricardo Ziviani,
	Daniel Henrique Barboza, Alex Williamson

[-- Attachment #1: Type: text/plain, Size: 52641 bytes --]

On Thu, Mar 07, 2019 at 01:40:33PM +1100, Alexey Kardashevskiy wrote:
> 
> 
> On 05/03/2019 12:47, David Gibson wrote:
> > On Thu, Feb 28, 2019 at 05:11:32PM +1100, Alexey Kardashevskiy wrote:
> >> On 28/02/2019 14:31, David Gibson wrote:
> >>> On Wed, Feb 27, 2019 at 07:51:49PM +1100, Alexey Kardashevskiy wrote:
> >>>> NVIDIA V100 GPUs have on-board RAM which is mapped into the host memory
> >>>> space and accessible as normal RAM via an NVLink bus. The VFIO-PCI driver
> >>>> implements special regions for such GPUs and emulates an NVLink bridge.
> >>>> NVLink2-enabled POWER9 CPUs also provide address translation services
> >>>> which includes an ATS shootdown (ATSD) register exported via the NVLink
> >>>> bridge device.
> >>>>
> >>>> This adds a quirk to VFIO to map the GPU memory and create an MR;
> >>>> the new MR is stored in a PCI device as a QOM link. The sPAPR PCI uses
> >>>> this to get the MR and map it to the system address space.
> >>>> Another quirk does the same for ATSD.
> >>>>
> >>>> This adds additional steps to sPAPR PHB setup:
> >>>>
> >>>> 1. Search for specific GPUs and NPUs, collect findings in
> >>>> sPAPRPHBState::nvgpus, manage system address space mappings;
> >>>>
> >>>> 2. Add device-specific properties such as "ibm,npu", "ibm,gpu",
> >>>> "memory-block", "link-speed" to advertise the NVLink2 function to
> >>>> the guest;
> >>>>
> >>>> 3. Add "mmio-atsd" to vPHB to advertise the ATSD capability;
> >>>>
> >>>> 4. Add new memory blocks (with extra "linux,memory-usable" to prevent
> >>>> the guest OS from accessing the new memory until it is onlined) and
> >>>> npuphb# nodes representing an NPU unit for every vPHB as the GPU driver
> >>>> uses it for link discovery.
> >>>>
> >>>> This allocates space for GPU RAM and ATSD like we do for MMIOs by
> >>>> adding 2 new parameters to the phb_placement() hook. Older machine types
> >>>> set these to zero.
> >>>>
> >>>> This puts new memory nodes in a separate NUMA node to replicate the host
> >>>> system setup as the GPU driver relies on this.
> >>>>
> >>>> This adds requirement similar to EEH - one IOMMU group per vPHB.
> >>>> The reason for this is that ATSD registers belong to a physical NPU
> >>>> so they cannot invalidate translations on GPUs attached to another NPU.
> >>>> It is guaranteed by the host platform as it does not mix NVLink bridges
> >>>> or GPUs from different NPU in the same IOMMU group. If more than one
> >>>> IOMMU group is detected on a vPHB, this disables ATSD support for that
> >>>> vPHB and prints a warning.
> >>>>
> >>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >>>> ---
> >>>> Changes:
> >>>> v3:
> >>>> * moved GPU RAM above PCI MMIO limit
> >>>> * renamed QOM property to nvlink2-tgt
> >>>> * moved nvlink2 code to its own file
> >>>>
> >>>> ---
> >>>>
> >>>> The example command line for redbud system:
> >>>>
> >>>> pbuild/qemu-aiku1804le-ppc64/ppc64-softmmu/qemu-system-ppc64 \
> >>>> -nodefaults \
> >>>> -chardev stdio,id=STDIO0,signal=off,mux=on \
> >>>> -device spapr-vty,id=svty0,reg=0x71000110,chardev=STDIO0 \
> >>>> -mon id=MON0,chardev=STDIO0,mode=readline -nographic -vga none \
> >>>> -enable-kvm -m 384G \
> >>>> -chardev socket,id=SOCKET0,server,nowait,host=localhost,port=40000 \
> >>>> -mon chardev=SOCKET0,mode=control \
> >>>> -smp 80,sockets=1,threads=4 \
> >>>> -netdev "tap,id=TAP0,helper=/home/aik/qemu-bridge-helper --br=br0" \
> >>>> -device "virtio-net-pci,id=vnet0,mac=52:54:00:12:34:56,netdev=TAP0" \
> >>>> img/vdisk0.img \
> >>>> -device "vfio-pci,id=vfio0004_04_00_0,host=0004:04:00.0" \
> >>>> -device "vfio-pci,id=vfio0006_00_00_0,host=0006:00:00.0" \
> >>>> -device "vfio-pci,id=vfio0006_00_00_1,host=0006:00:00.1" \
> >>>> -device "vfio-pci,id=vfio0006_00_00_2,host=0006:00:00.2" \
> >>>> -device "vfio-pci,id=vfio0004_05_00_0,host=0004:05:00.0" \
> >>>> -device "vfio-pci,id=vfio0006_00_01_0,host=0006:00:01.0" \
> >>>> -device "vfio-pci,id=vfio0006_00_01_1,host=0006:00:01.1" \
> >>>> -device "vfio-pci,id=vfio0006_00_01_2,host=0006:00:01.2" \
> >>>> -device spapr-pci-host-bridge,id=phb1,index=1 \
> >>>> -device "vfio-pci,id=vfio0035_03_00_0,host=0035:03:00.0" \
> >>>> -device "vfio-pci,id=vfio0007_00_00_0,host=0007:00:00.0" \
> >>>> -device "vfio-pci,id=vfio0007_00_00_1,host=0007:00:00.1" \
> >>>> -device "vfio-pci,id=vfio0007_00_00_2,host=0007:00:00.2" \
> >>>> -device "vfio-pci,id=vfio0035_04_00_0,host=0035:04:00.0" \
> >>>> -device "vfio-pci,id=vfio0007_00_01_0,host=0007:00:01.0" \
> >>>> -device "vfio-pci,id=vfio0007_00_01_1,host=0007:00:01.1" \
> >>>> -device "vfio-pci,id=vfio0007_00_01_2,host=0007:00:01.2" -snapshot \
> >>>> -machine pseries \
> >>>> -L /home/aik/t/qemu-ppc64-bios/ -d guest_errors
> >>>>
> >>>> Note that QEMU attaches PCI devices to the last added vPHB so first
> >>>> 8 devices - 4:04:00.0 till 6:00:01.2 - go to the default vPHB, and
> >>>> 35:03:00.0..7:00:01.2 to the vPHB with id=phb1.
> >>>> ---
> >>>>  hw/ppc/Makefile.objs        |   2 +-
> >>>>  hw/vfio/pci.h               |   2 +
> >>>>  include/hw/pci-host/spapr.h |  41 ++++
> >>>>  include/hw/ppc/spapr.h      |   3 +-
> >>>>  hw/ppc/spapr.c              |  29 ++-
> >>>>  hw/ppc/spapr_pci.c          |   8 +
> >>>>  hw/ppc/spapr_pci_nvlink2.c  | 419 ++++++++++++++++++++++++++++++++++++
> >>>>  hw/vfio/pci-quirks.c        | 120 +++++++++++
> >>>>  hw/vfio/pci.c               |  14 ++
> >>>>  hw/vfio/trace-events        |   4 +
> >>>>  10 files changed, 637 insertions(+), 5 deletions(-)
> >>>>  create mode 100644 hw/ppc/spapr_pci_nvlink2.c
> >>>>
> >>>> diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
> >>>> index 1111b21..636e717 100644
> >>>> --- a/hw/ppc/Makefile.objs
> >>>> +++ b/hw/ppc/Makefile.objs
> >>>> @@ -9,7 +9,7 @@ obj-$(CONFIG_SPAPR_RNG) +=  spapr_rng.o
> >>>>  # IBM PowerNV
> >>>>  obj-$(CONFIG_POWERNV) += pnv.o pnv_xscom.o pnv_core.o pnv_lpc.o pnv_psi.o pnv_occ.o pnv_bmc.o
> >>>>  ifeq ($(CONFIG_PCI)$(CONFIG_PSERIES)$(CONFIG_LINUX), yyy)
> >>>> -obj-y += spapr_pci_vfio.o
> >>>> +obj-y += spapr_pci_vfio.o spapr_pci_nvlink2.o
> >>>>  endif
> >>>>  obj-$(CONFIG_PSERIES) += spapr_rtas_ddw.o
> >>>>  # PowerPC 4xx boards
> >>>> diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
> >>>> index b1ae4c0..706c304 100644
> >>>> --- a/hw/vfio/pci.h
> >>>> +++ b/hw/vfio/pci.h
> >>>> @@ -194,6 +194,8 @@ int vfio_populate_vga(VFIOPCIDevice *vdev, Error **errp);
> >>>>  int vfio_pci_igd_opregion_init(VFIOPCIDevice *vdev,
> >>>>                                 struct vfio_region_info *info,
> >>>>                                 Error **errp);
> >>>> +int vfio_pci_nvidia_v100_ram_init(VFIOPCIDevice *vdev, Error **errp);
> >>>> +int vfio_pci_nvlink2_init(VFIOPCIDevice *vdev, Error **errp);
> >>>>  
> >>>>  void vfio_display_reset(VFIOPCIDevice *vdev);
> >>>>  int vfio_display_probe(VFIOPCIDevice *vdev, Error **errp);
> >>>> diff --git a/include/hw/pci-host/spapr.h b/include/hw/pci-host/spapr.h
> >>>> index ab0e3a0..e791dd4 100644
> >>>> --- a/include/hw/pci-host/spapr.h
> >>>> +++ b/include/hw/pci-host/spapr.h
> >>>> @@ -87,6 +87,9 @@ struct sPAPRPHBState {
> >>>>      uint32_t mig_liobn;
> >>>>      hwaddr mig_mem_win_addr, mig_mem_win_size;
> >>>>      hwaddr mig_io_win_addr, mig_io_win_size;
> >>>> +    hwaddr nv2_gpa_win_addr;
> >>>> +    hwaddr nv2_atsd_win_addr;
> >>>> +    struct spapr_phb_pci_nvgpu_config *nvgpus;
> >>>>  };
> >>>>  
> >>>>  #define SPAPR_PCI_MEM_WIN_BUS_OFFSET 0x80000000ULL
> >>>> @@ -105,6 +108,23 @@ struct sPAPRPHBState {
> >>>>  
> >>>>  #define SPAPR_PCI_MSI_WINDOW         0x40000000000ULL
> >>>>  
> >>>> +#define SPAPR_PCI_NV2RAM64_WIN_BASE  SPAPR_PCI_LIMIT
> >>>> +#define SPAPR_PCI_NV2RAM64_WIN_SIZE  0x10000000000ULL /* 1 TiB for all 6xGPUs */
> >>>
> >>> The comments and values below suggest that it is 1TiB for each GPU,
> >>> rather than 1TiB shared by all 6.  Which is it?
> >>
> >>
> >> 1TiB for all of them within 1 vPHB. Not sure where it suggests 1TiB for
> >> each GPU.
> > 
> > The fact that NV2ATSD_WIN_BASE is set at 6TiB above NV2RAM64_WIN_BASE
> > is what suggested to me that there was one 1TiB window for each of the
> > 6 possible GPUs.
> > 
> >>>> +
> >>>> +/* Max number of these GPUs per a physical box */
> >>>> +#define NVGPU_MAX_NUM                6
> >>>
> >>> Is there any possibility later hardware revisions could increase this?
> >>> If so we should probably leave some extra room in the address space.
> >>
> >> A GPU RAM window is 256GiB (and only 32GiB is used), and 3 is the
> >> maximum in one group so far. So 1TiB should be enough for quite some
> >> time. Having more GPUs in a box is probably possible but for now 6xGPU
> >> require water cooling while 4xGPU does not so unless there is a new
> >> generation of these GPUs comes out, the numbers won't change much.
> > 
> > Hm, ok.
> > 
> >> I'll double SPAPR_PCI_NV2RAM64_WIN_SIZE.
> > 
> > Um.. I'm not sure how that follows from the above.
> 
> 1TiB is enough now but 2TiB is more future proof. That was it.

Ok.

> >>>> +/*
> >>>> + * One NVLink bridge provides one ATSD register so it should be 18.
> >>>> + * In practice though since we allow only one group per vPHB which equals
> >>>> + * to an NPU2 which has maximum 6 NVLink bridges.
> >>>> + */
> >>>> +#define NVGPU_MAX_ATSD               6
> >>>> +
> >>>> +#define SPAPR_PCI_NV2ATSD_WIN_BASE   (SPAPR_PCI_NV2RAM64_WIN_BASE + \
> >>>> +                                      SPAPR_PCI_NV2RAM64_WIN_SIZE * \
> >>>> +                                      NVGPU_MAX_NUM)
> >>>> +#define SPAPR_PCI_NV2ATSD_WIN_SIZE   (NVGPU_MAX_ATSD * 0x10000)
> >>>
> >>> What's the significance of the 64 kiB constant here?  Should it be a
> >>> symbolic name, or speleed "64 * kiB".
> >>
> >> Ok.
> > 
> > 
> > Hmm.  Am I right in thinking that both each 64-bit RAM and each ATSD
> > RAM slot is per-vPHB? 
> 
> These are windows from which I allocated RAM base and ATSD per GPU/NPU.

Ok, I guess that per-vPHB set of windows is what I'm meaning by "slot"
then.

> > Would it make more sense to directly index into
> > the array of slots with the phb index, rather than having a separate
> > GPU index?
> 
> There can be 1 or many "slots" per PHBs ("many" is not really encouraged
> as they will miss ATSD but nevertheless), and "slots" are not in a
> global list of any kind.

Ok, I think we're using different meanings of "slot" here.  By "slot"
I was meaning one 64-bit and one ATS window with a common index
(i.e. a slot in the array indices, rather than a physical slot on the
system).  IIUC all the GPUs and NPUs on a vPHB will sit in a single
"slot" by that sense.

> >>>> +
> >>>>  static inline qemu_irq spapr_phb_lsi_qirq(struct sPAPRPHBState *phb, int pin)
> >>>>  {
> >>>>      sPAPRMachineState *spapr = SPAPR_MACHINE(qdev_get_machine());
> >>>> @@ -135,6 +155,11 @@ int spapr_phb_vfio_eeh_get_state(sPAPRPHBState *sphb, int *state);
> >>>>  int spapr_phb_vfio_eeh_reset(sPAPRPHBState *sphb, int option);
> >>>>  int spapr_phb_vfio_eeh_configure(sPAPRPHBState *sphb);
> >>>>  void spapr_phb_vfio_reset(DeviceState *qdev);
> >>>> +void spapr_phb_nvgpu_setup(sPAPRPHBState *sphb);
> >>>> +void spapr_phb_nvgpu_populate_dt(sPAPRPHBState *sphb, void *fdt, int bus_off);
> >>>> +void spapr_phb_nvgpu_ram_populate_dt(sPAPRPHBState *sphb, void *fdt);
> >>>> +void spapr_phb_nvgpu_populate_pcidev_dt(PCIDevice *dev, void *fdt, int offset,
> >>>> +                                        sPAPRPHBState *sphb);
> >>>>  #else
> >>>>  static inline bool spapr_phb_eeh_available(sPAPRPHBState *sphb)
> >>>>  {
> >>>> @@ -161,6 +186,22 @@ static inline int spapr_phb_vfio_eeh_configure(sPAPRPHBState *sphb)
> >>>>  static inline void spapr_phb_vfio_reset(DeviceState *qdev)
> >>>>  {
> >>>>  }
> >>>> +static inline void spapr_phb_nvgpu_setup(sPAPRPHBState *sphb)
> >>>> +{
> >>>> +}
> >>>> +static inline void spapr_phb_nvgpu_populate_dt(sPAPRPHBState *sphb, void *fdt,
> >>>> +                                               int bus_off)
> >>>> +{
> >>>> +}
> >>>> +static inline void spapr_phb_nvgpu_ram_populate_dt(sPAPRPHBState *sphb,
> >>>> +                                                   void *fdt)
> >>>> +{
> >>>> +}
> >>>> +static inline void spapr_phb_nvgpu_populate_pcidev_dt(PCIDevice *dev, void *fdt,
> >>>> +                                                      int offset,
> >>>> +                                                      sPAPRPHBState *sphb)
> >>>> +{
> >>>> +}
> >>>
> >>> I'm guessing some of these should never get called on systems without
> >>> NVLink2, in which case they should probably have a
> >>> g_assert_not_reached() in there.
> >>
> >> I guess if you compile QEMU for --target-list=ppc64-softmmu in Windows
> >> (i.e. tcg + pseries + pci but no vfio), these will be called and crash
> >> then, no?
> > 
> > Well, if they can be called in that situation then, yes, they need to
> > be no-ops like they are now.  But is that true for all of them?
> > Hmm.. yes it might be, never mind.
> > 
> >>>
> >>>>  #endif
> >>>>  
> >>>>  void spapr_phb_dma_reset(sPAPRPHBState *sphb);
> >>>> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> >>>> index 358bb38..9acf867 100644
> >>>> --- a/include/hw/ppc/spapr.h
> >>>> +++ b/include/hw/ppc/spapr.h
> >>>> @@ -113,7 +113,8 @@ struct sPAPRMachineClass {
> >>>>      void (*phb_placement)(sPAPRMachineState *spapr, uint32_t index,
> >>>>                            uint64_t *buid, hwaddr *pio, 
> >>>>                            hwaddr *mmio32, hwaddr *mmio64,
> >>>> -                          unsigned n_dma, uint32_t *liobns, Error **errp);
> >>>> +                          unsigned n_dma, uint32_t *liobns, hwaddr *nv2gpa,
> >>>> +                          hwaddr *nv2atsd, Error **errp);
> >>>>      sPAPRResizeHPT resize_hpt_default;
> >>>>      sPAPRCapabilities default_caps;
> >>>>      sPAPRIrq *irq;
> >>>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> >>>> index 74c9b07..fda6e7e 100644
> >>>> --- a/hw/ppc/spapr.c
> >>>> +++ b/hw/ppc/spapr.c
> >>>> @@ -3929,7 +3929,9 @@ static void spapr_phb_pre_plug(HotplugHandler *hotplug_dev, DeviceState *dev,
> >>>>      smc->phb_placement(spapr, sphb->index,
> >>>>                         &sphb->buid, &sphb->io_win_addr,
> >>>>                         &sphb->mem_win_addr, &sphb->mem64_win_addr,
> >>>> -                       windows_supported, sphb->dma_liobn, errp);
> >>>> +                       windows_supported, sphb->dma_liobn,
> >>>> +                       &sphb->nv2_gpa_win_addr, &sphb->nv2_atsd_win_addr,
> >>>> +                       errp);
> >>>>  }
> >>>>  
> >>>>  static void spapr_phb_plug(HotplugHandler *hotplug_dev, DeviceState *dev,
> >>>> @@ -4129,7 +4131,8 @@ static const CPUArchIdList *spapr_possible_cpu_arch_ids(MachineState *machine)
> >>>>  static void spapr_phb_placement(sPAPRMachineState *spapr, uint32_t index,
> >>>>                                  uint64_t *buid, hwaddr *pio,
> >>>>                                  hwaddr *mmio32, hwaddr *mmio64,
> >>>> -                                unsigned n_dma, uint32_t *liobns, Error **errp)
> >>>> +                                unsigned n_dma, uint32_t *liobns,
> >>>> +                                hwaddr *nv2gpa, hwaddr *nv2atsd, Error **errp)
> >>>>  {
> >>>>      /*
> >>>>       * New-style PHB window placement.
> >>>> @@ -4174,6 +4177,9 @@ static void spapr_phb_placement(sPAPRMachineState *spapr, uint32_t index,
> >>>>      *pio = SPAPR_PCI_BASE + index * SPAPR_PCI_IO_WIN_SIZE;
> >>>>      *mmio32 = SPAPR_PCI_BASE + (index + 1) * SPAPR_PCI_MEM32_WIN_SIZE;
> >>>>      *mmio64 = SPAPR_PCI_BASE + (index + 1) * SPAPR_PCI_MEM64_WIN_SIZE;
> >>>> +
> >>>
> >>> This doesn't look right.  SPAPR_PCI_NV2ATSD_WIN_BASE appears to be
> >>> defined such that there slots for NVGPU_MAX_NUM gpa "slots" of size
> >>> SPAPR_PCI_NV2RAM64_WIN_SIZE before we get to the ATSD base.
> >>>
> >>>> +    *nv2gpa = SPAPR_PCI_NV2RAM64_WIN_BASE + index * SPAPR_PCI_NV2RAM64_WIN_SIZE;
> >>>
> >>> But this implies you need a "slot" for every possible PHB index, which
> >>> is rather more than NVGPU_MAX_NUM.
> >>>
> >>>> +    *nv2atsd = SPAPR_PCI_NV2ATSD_WIN_BASE + index * SPAPR_PCI_NV2ATSD_WIN_SIZE;
> >>
> >>
> >> Ah right :( These should go then above 128TiB I guess as I do not really
> >> want them to appear inside a huge dma window.
> > 
> > Right.  So actually looks like you are already indexing the window
> > slots by phb index, in which case you need to allow for 32 slots even
> > though only 6 can be populated at the moment.
> 
> 
> Why precisely 32? Round up of 18?

Because 32 is the allowed number of vPHBs.

> >>>>  }
> >>>>  
> >>>>  static ICSState *spapr_ics_get(XICSFabric *dev, int irq)
> >>>> @@ -4376,6 +4382,18 @@ DEFINE_SPAPR_MACHINE(4_0, "4.0", true);
> >>>>  /*
> >>>>   * pseries-3.1
> >>>>   */
> >>>> +static void phb_placement_3_1(sPAPRMachineState *spapr, uint32_t index,
> >>>> +                              uint64_t *buid, hwaddr *pio,
> >>>> +                              hwaddr *mmio32, hwaddr *mmio64,
> >>>> +                              unsigned n_dma, uint32_t *liobns,
> >>>> +                              hwaddr *nv2gpa, hwaddr *nv2atsd, Error **errp)
> >>>> +{
> >>>> +    spapr_phb_placement(spapr, index, buid, pio, mmio32, mmio64, n_dma, liobns,
> >>>> +                        nv2gpa, nv2atsd, errp);
> >>>> +    *nv2gpa = 0;
> >>>> +    *nv2atsd = 0;
> >>>> +}
> >>>> +
> >>>>  static void spapr_machine_3_1_class_options(MachineClass *mc)
> >>>>  {
> >>>>      sPAPRMachineClass *smc = SPAPR_MACHINE_CLASS(mc);
> >>>> @@ -4391,6 +4409,7 @@ static void spapr_machine_3_1_class_options(MachineClass *mc)
> >>>>      mc->default_cpu_type = POWERPC_CPU_TYPE_NAME("power8_v2.0");
> >>>>      smc->update_dt_enabled = false;
> >>>>      smc->dr_phb_enabled = false;
> >>>> +    smc->phb_placement = phb_placement_3_1;
> >>>>  }
> >>>>  
> >>>>  DEFINE_SPAPR_MACHINE(3_1, "3.1", false);
> >>>> @@ -4522,7 +4541,8 @@ DEFINE_SPAPR_MACHINE(2_8, "2.8", false);
> >>>>  static void phb_placement_2_7(sPAPRMachineState *spapr, uint32_t index,
> >>>>                                uint64_t *buid, hwaddr *pio,
> >>>>                                hwaddr *mmio32, hwaddr *mmio64,
> >>>> -                              unsigned n_dma, uint32_t *liobns, Error **errp)
> >>>> +                              unsigned n_dma, uint32_t *liobns,
> >>>> +                              hwaddr *nv2gpa, hwaddr *nv2atsd, Error **errp)
> >>>>  {
> >>>>      /* Legacy PHB placement for pseries-2.7 and earlier machine types */
> >>>>      const uint64_t base_buid = 0x800000020000000ULL;
> >>>> @@ -4566,6 +4586,9 @@ static void phb_placement_2_7(sPAPRMachineState *spapr, uint32_t index,
> >>>>       * fallback behaviour of automatically splitting a large "32-bit"
> >>>>       * window into contiguous 32-bit and 64-bit windows
> >>>>       */
> >>>> +
> >>>> +    *nv2gpa = 0;
> >>>> +    *nv2atsd = 0;
> >>>>  }
> >>>>  
> >>>>  static void spapr_machine_2_7_class_options(MachineClass *mc)
> >>>> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
> >>>> index 06a5ffd..f076462 100644
> >>>> --- a/hw/ppc/spapr_pci.c
> >>>> +++ b/hw/ppc/spapr_pci.c
> >>>> @@ -1355,6 +1355,8 @@ static void spapr_populate_pci_child_dt(PCIDevice *dev, void *fdt, int offset,
> >>>>      if (sphb->pcie_ecs && pci_is_express(dev)) {
> >>>>          _FDT(fdt_setprop_cell(fdt, offset, "ibm,pci-config-space-type", 0x1));
> >>>>      }
> >>>> +
> >>>> +    spapr_phb_nvgpu_populate_pcidev_dt(dev, fdt, offset, sphb);
> >>>>  }
> >>>>  
> >>>>  /* create OF node for pci device and required OF DT properties */
> >>>> @@ -1878,6 +1880,7 @@ static void spapr_phb_reset(DeviceState *qdev)
> >>>>      sPAPRPHBState *sphb = SPAPR_PCI_HOST_BRIDGE(qdev);
> >>>>  
> >>>>      spapr_phb_dma_reset(sphb);
> >>>> +    spapr_phb_nvgpu_setup(sphb);
> >>>>  
> >>>>      /* Reset the IOMMU state */
> >>>>      object_child_foreach(OBJECT(qdev), spapr_phb_children_reset, NULL);
> >>>> @@ -1910,6 +1913,8 @@ static Property spapr_phb_properties[] = {
> >>>>                       pre_2_8_migration, false),
> >>>>      DEFINE_PROP_BOOL("pcie-extended-configuration-space", sPAPRPHBState,
> >>>>                       pcie_ecs, true),
> >>>> +    DEFINE_PROP_UINT64("gpa", sPAPRPHBState, nv2_gpa_win_addr, 0),
> >>>> +    DEFINE_PROP_UINT64("atsd", sPAPRPHBState, nv2_atsd_win_addr, 0),
> >>>>      DEFINE_PROP_END_OF_LIST(),
> >>>>  };
> >>>>  
> >>>> @@ -2282,6 +2287,9 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb, uint32_t intc_phandle, void *fdt,
> >>>>          return ret;
> >>>>      }
> >>>>  
> >>>> +    spapr_phb_nvgpu_populate_dt(phb, fdt, bus_off);
> >>>> +    spapr_phb_nvgpu_ram_populate_dt(phb, fdt);
> >>>> +
> >>>>      return 0;
> >>>>  }
> >>>>  
> >>>> diff --git a/hw/ppc/spapr_pci_nvlink2.c b/hw/ppc/spapr_pci_nvlink2.c
> >>>> new file mode 100644
> >>>> index 0000000..965a6be
> >>>> --- /dev/null
> >>>> +++ b/hw/ppc/spapr_pci_nvlink2.c
> >>>> @@ -0,0 +1,419 @@
> >>>> +/*
> >>>> + * QEMU sPAPR PCI for NVLink2 pass through
> >>>> + *
> >>>> + * Copyright (c) 2019 Alexey Kardashevskiy, IBM Corporation.
> >>>> + *
> >>>> + * Permission is hereby granted, free of charge, to any person obtaining a copy
> >>>> + * of this software and associated documentation files (the "Software"), to deal
> >>>> + * in the Software without restriction, including without limitation the rights
> >>>> + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
> >>>> + * copies of the Software, and to permit persons to whom the Software is
> >>>> + * furnished to do so, subject to the following conditions:
> >>>> + *
> >>>> + * The above copyright notice and this permission notice shall be included in
> >>>> + * all copies or substantial portions of the Software.
> >>>> + *
> >>>> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
> >>>> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
> >>>> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
> >>>> + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
> >>>> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
> >>>> + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
> >>>> + * THE SOFTWARE.
> >>>> + */
> >>>> +#include "qemu/osdep.h"
> >>>> +#include "qapi/error.h"
> >>>> +#include "qemu-common.h"
> >>>> +#include "hw/pci/pci.h"
> >>>> +#include "hw/pci-host/spapr.h"
> >>>> +#include "qemu/error-report.h"
> >>>> +#include "hw/ppc/fdt.h"
> >>>> +#include "hw/pci/pci_bridge.h"
> >>>> +
> >>>> +#define PHANDLE_PCIDEV(phb, pdev)    (0x12000000 | \
> >>>> +                                     (((phb)->index) << 16) | ((pdev)->devfn))
> >>>> +#define PHANDLE_GPURAM(phb, n)       (0x110000FF | ((n) << 8) | \
> >>>> +                                     (((phb)->index) << 16))
> >>>> +/* NVLink2 wants a separate NUMA node for its RAM */
> >>>> +#define GPURAM_ASSOCIATIVITY(phb, n) (255 - ((phb)->index * 3 + (n)))
> >>>> +#define PHANDLE_NVLINK(phb, gn, nn)  (0x00130000 | (((phb)->index) << 8) | \
> >>>> +                                     ((gn) << 4) | (nn))
> >>>> +
> >>>> +/* Max number of NVLinks per GPU in any physical box */
> >>>> +#define NVGPU_MAX_LINKS              3
> >>>> +
> >>>> +struct spapr_phb_pci_nvgpu_config {
> >>>> +    uint64_t nv2_ram_current;
> >>>> +    uint64_t nv2_atsd_current;
> >>>> +    int num; /* number of non empty (i.e. tgt!=0) entries in slots[] */
> >>>> +    struct spapr_phb_pci_nvgpu_slot {
> >>>> +        uint64_t tgt;
> >>>> +        uint64_t gpa;
> >>>> +        PCIDevice *gpdev;
> >>>> +        int linknum;
> >>>> +        struct {
> >>>> +            uint64_t atsd_gpa;
> >>>> +            PCIDevice *npdev;
> >>>> +            uint32_t link_speed;
> >>>> +        } links[NVGPU_MAX_LINKS];
> >>>> +    } slots[NVGPU_MAX_NUM];
> >>>> +};
> >>>> +
> >>>> +static int spapr_pci_nvgpu_get_slot(struct spapr_phb_pci_nvgpu_config *nvgpus,
> >>>> +                                    uint64_t tgt)
> >>>> +{
> >>>> +    int i;
> >>>> +
> >>>> +    /* Search for partially collected "slot" */
> >>>> +    for (i = 0; i < nvgpus->num; ++i) {
> >>>> +        if (nvgpus->slots[i].tgt == tgt) {
> >>>> +            return i;
> >>>> +        }
> >>>> +    }
> >>>> +
> >>>> +    if (nvgpus->num == ARRAY_SIZE(nvgpus->slots)) {
> >>>> +        warn_report("Found too many NVLink bridges per GPU");
> >>>> +        return -1;
> >>>
> >>> This is within qemu so it would be better to use the qemu error API
> >>> than returning an error code.
> >>
> >> You mean returning Error**? Oh. Ok.
> > 
> > Well, not returning, technically, but taking an Error ** parameter
> > which is checked by the caller to detect errors.
> 
> 
> None of these is actually propagated to the upper level as neither of
> these is fatal (well, except one which I am turning into assert).

Oh, ok.  In that case you don't need an Error **.

> >>>> +    }
> >>>> +
> >>>> +    i = nvgpus->num;
> >>>> +    nvgpus->slots[i].tgt = tgt;
> >>>> +    ++nvgpus->num;
> >>>> +
> >>>> +    return i;
> >>>
> >>> Might be nicer to return a pointer to the slot structure.
> >>
> >>
> >> This can work.
> >>
> >>
> >>>
> >>>> +}
> >>>> +
> >>>> +static void spapr_pci_collect_nvgpu(struct spapr_phb_pci_nvgpu_config *nvgpus,
> >>>> +                                    PCIDevice *pdev, uint64_t tgt,
> >>>> +                                    MemoryRegion *mr)
> >>>> +{
> >>>> +    int i = spapr_pci_nvgpu_get_slot(nvgpus, tgt);
> >>>> +
> >>>> +    if (i < 0) {
> >>>> +        return;
> >>>> +    }
> >>>> +    g_assert(!nvgpus->slots[i].gpdev);
> >>>> +    nvgpus->slots[i].gpdev = pdev;
> >>>> +
> >>>> +    nvgpus->slots[i].gpa = nvgpus->nv2_ram_current;
> >>>> +    nvgpus->nv2_ram_current += memory_region_size(mr);
> >>>> +}
> >>>> +
> >>>> +static void spapr_pci_collect_nvnpu(struct spapr_phb_pci_nvgpu_config *nvgpus,
> >>>> +                                    PCIDevice *pdev, uint64_t tgt,
> >>>> +                                    MemoryRegion *mr)
> >>>> +{
> >>>> +    int i = spapr_pci_nvgpu_get_slot(nvgpus, tgt), j;
> >>>> +    struct spapr_phb_pci_nvgpu_slot *nvslot;
> >>>> +
> >>>> +    if (i < 0) {
> >>>> +        return;
> >>>> +    }
> >>>> +
> >>>> +    nvslot = &nvgpus->slots[i];
> >>>> +    j = nvslot->linknum;
> >>>> +    if (j == ARRAY_SIZE(nvslot->links)) {
> >>>> +        warn_report("Found too many NVLink2 bridges");
> >>>> +        return;
> >>>> +    }
> >>>> +    ++nvslot->linknum;
> >>>> +
> >>>> +    g_assert(!nvslot->links[j].npdev);
> >>>> +    nvslot->links[j].npdev = pdev;
> >>>> +    nvslot->links[j].atsd_gpa = nvgpus->nv2_atsd_current;
> >>>> +    nvgpus->nv2_atsd_current += memory_region_size(mr);
> >>>> +    nvslot->links[j].link_speed =
> >>>> +        object_property_get_uint(OBJECT(pdev), "nvlink2-link-speed", NULL);
> >>>> +}
> >>>> +
> >>>> +static void spapr_phb_pci_collect_nvgpu(PCIBus *bus, PCIDevice *pdev,
> >>>> +                                        void *opaque)
> >>>> +{
> >>>> +    PCIBus *sec_bus;
> >>>> +    Object *po = OBJECT(pdev);
> >>>> +    uint64_t tgt = object_property_get_uint(po, "nvlink2-tgt", NULL);
> >>>> +
> >>>> +    if (tgt) {
> >>>> +        Object *mr_gpu = object_property_get_link(po, "nvlink2-mr[0]", NULL);
> >>>> +        Object *mr_npu = object_property_get_link(po, "nvlink2-atsd-mr[0]",
> >>>> +                                                  NULL);
> >>>> +
> >>>> +        if (mr_gpu) {
> >>>> +            spapr_pci_collect_nvgpu(opaque, pdev, tgt, MEMORY_REGION(mr_gpu));
> >>>> +        } else if (mr_npu) {
> >>>> +            spapr_pci_collect_nvnpu(opaque, pdev, tgt, MEMORY_REGION(mr_npu));
> >>>> +        } else {
> >>>> +            warn_report("Unexpected device with \"nvlink2-tgt\"");
> >>>
> >>> IIUC this would have to be a code error, so should be an assert() not
> >>> a warning.
> >>
> >>
> >> Ok.
> >>
> >>>
> >>>> +        }
> >>>> +    }
> >>>> +    if ((pci_default_read_config(pdev, PCI_HEADER_TYPE, 1) !=
> >>>> +         PCI_HEADER_TYPE_BRIDGE)) {
> >>>> +        return;
> >>>> +    }
> >>>> +
> >>>> +    sec_bus = pci_bridge_get_sec_bus(PCI_BRIDGE(pdev));
> >>>> +    if (!sec_bus) {
> >>>> +        return;
> >>>> +    }
> >>>> +
> >>>> +    pci_for_each_device(sec_bus, pci_bus_num(sec_bus),
> >>>> +                        spapr_phb_pci_collect_nvgpu, opaque);
> >>>> +}
> >>>> +
> >>>> +void spapr_phb_nvgpu_setup(sPAPRPHBState *sphb)
> >>>> +{
> >>>> +    int i, j, valid_gpu_num;
> >>>> +
> >>>> +    /* If there are existing NVLink2 MRs, unmap those before recreating */
> >>>> +    if (sphb->nvgpus) {
> >>>> +        for (i = 0; i < sphb->nvgpus->num; ++i) {
> >>>> +            struct spapr_phb_pci_nvgpu_slot *nvslot = &sphb->nvgpus->slots[i];
> >>>> +            Object *nv_mrobj = object_property_get_link(OBJECT(nvslot->gpdev),
> >>>> +                                                        "nvlink2-mr[0]", NULL);
> >>>> +
> >>>> +            if (nv_mrobj) {
> >>>> +                memory_region_del_subregion(get_system_memory(),
> >>>> +                                            MEMORY_REGION(nv_mrobj));
> >>>> +            }
> >>>> +            for (j = 0; j < nvslot->linknum; ++j) {
> >>>> +                PCIDevice *npdev = nvslot->links[j].npdev;
> >>>> +                Object *atsd_mrobj;
> >>>> +                atsd_mrobj = object_property_get_link(OBJECT(npdev),
> >>>> +                                                      "nvlink2-atsd-mr[0]",
> >>>> +                                                      NULL);
> >>>> +                if (atsd_mrobj) {
> >>>> +                    memory_region_del_subregion(get_system_memory(),
> >>>> +                                                MEMORY_REGION(atsd_mrobj));
> >>>> +                }
> >>>> +            }
> >>>> +        }
> >>>> +        g_free(sphb->nvgpus);
> >>>
> >>> Probably worth collecting the above into a nvgpu_free() helper -
> >>> chances are you'll want it on cleanup paths as well.
> >>
> >> The only other cleanup path is below and it only executes if there is no
> >> MR added so for now it does not seem useful.
> > 
> > Hrm... I've merged PHB hotplug recently.. so there should be a cleanup
> > path for unplug as well.
> 
> 
> ah right. Wooohooo :) btw with phb hotplug we can try supporting EEH on
> hotplugged VFIO devices.

Yeah, Sam is looking into it.

> >>>> +        sphb->nvgpus = NULL;
> >>>> +    }
> >>>> +
> >>>> +    /* Search for GPUs and NPUs */
> >>>> +    if (sphb->nv2_gpa_win_addr && sphb->nv2_atsd_win_addr) {
> >>>> +        PCIBus *bus = PCI_HOST_BRIDGE(sphb)->bus;
> >>>> +
> >>>> +        sphb->nvgpus = g_new0(struct spapr_phb_pci_nvgpu_config, 1);
> >>>> +        sphb->nvgpus->nv2_ram_current = sphb->nv2_gpa_win_addr;
> >>>> +        sphb->nvgpus->nv2_atsd_current = sphb->nv2_atsd_win_addr;
> >>>> +
> >>>> +        pci_for_each_device(bus, pci_bus_num(bus),
> >>>> +                            spapr_phb_pci_collect_nvgpu, sphb->nvgpus);
> >>>> +    }
> >>>> +
> >>>> +    /* Add found GPU RAM and ATSD MRs if found */
> >>>> +    for (i = 0, valid_gpu_num = 0; i < sphb->nvgpus->num; ++i) {
> >>>> +        Object *nvmrobj;
> >>>> +        struct spapr_phb_pci_nvgpu_slot *nvslot = &sphb->nvgpus->slots[i];
> >>>> +
> >>>> +        if (!nvslot->gpdev) {
> >>>> +            continue;
> >>>> +        }
> >>>> +        nvmrobj = object_property_get_link(OBJECT(nvslot->gpdev),
> >>>> +                                           "nvlink2-mr[0]", NULL);
> >>>> +        /* ATSD is pointless without GPU RAM MR so skip those */
> >>>> +        if (!nvmrobj) {
> >>>> +            continue;
> >>>> +        }
> >>>> +
> >>>> +        ++valid_gpu_num;
> >>>> +        memory_region_add_subregion(get_system_memory(), nvslot->gpa,
> >>>> +                                    MEMORY_REGION(nvmrobj));
> >>>> +
> >>>> +        for (j = 0; j < nvslot->linknum; ++j) {
> >>>> +            Object *atsdmrobj;
> >>>> +
> >>>> +            atsdmrobj = object_property_get_link(OBJECT(nvslot->links[j].npdev),
> >>>> +                                                 "nvlink2-atsd-mr[0]",
> >>>> +                                                 NULL);
> >>>> +            if (!atsdmrobj) {
> >>>> +                continue;
> >>>> +            }
> >>>> +            memory_region_add_subregion(get_system_memory(),
> >>>> +                                        nvslot->links[j].atsd_gpa,
> >>>> +                                        MEMORY_REGION(atsdmrobj));
> >>>> +        }
> >>>> +    }
> >>>> +
> >>>> +    if (!valid_gpu_num) {
> >>>> +        /* We did not find any interesting GPU */
> >>>> +        g_free(sphb->nvgpus);
> >>>> +        sphb->nvgpus = NULL;
> >>>> +    }
> >>>> +}
> >>>> +
> >>>> +void spapr_phb_nvgpu_populate_dt(sPAPRPHBState *sphb, void *fdt, int bus_off)
> >>>> +{
> >>>> +    int i, j, atsdnum = 0;
> >>>> +    uint64_t atsd[8]; /* The existing limitation of known guests */
> >>>> +
> >>>> +    if (!sphb->nvgpus) {
> >>>> +        return;
> >>>> +    }
> >>>> +
> >>>> +    for (i = 0; (i < sphb->nvgpus->num) && (atsdnum < ARRAY_SIZE(atsd)); ++i) {
> >>>> +        struct spapr_phb_pci_nvgpu_slot *nvslot = &sphb->nvgpus->slots[i];
> >>>> +
> >>>> +        if (!nvslot->gpdev) {
> >>>> +            continue;
> >>>> +        }
> >>>> +        for (j = 0; j < nvslot->linknum; ++j) {
> >>>> +            if (!nvslot->links[j].atsd_gpa) {
> >>>> +                continue;
> >>>> +            }
> >>>> +
> >>>> +            if (atsdnum == ARRAY_SIZE(atsd)) {
> >>>> +                warn_report("Only %ld ATSD registers allowed",
> >>>> +                            ARRAY_SIZE(atsd));
> >>>
> >>> Probably should be an error not a warning.
> >>
> >> We can still continue though, it is not fatal. These things come from
> >> skiboot (which we control) but skiboot itself could compose the
> >> properties itself or use whatever hostboot provided (does not happen now
> >> though) and I would not like to be blocked by hostboot if/when this happens.
> > 
> > Um.. what?  atsdnum is just a counter incremented below, it doesn't
> > come from skiboot or any other host-significant value.  The situation
> > here is that we have more nvlinks assigned to a guest that qemu can
> > support.  Yes, you could technically run the guest with some of the
> > links unavailable, but that seems pretty clearly not what the user
> > wanted.  Hence, an error is appropriate.
> 
> 
> Not exactly. NVlinks are available whether there come with an ATSD VFIO
> region or not, it was my choice to accompany ATSD with a NVLink2 bridge.
> So it is quite possible to pass way too many links and yes QEMU won't
> expose all accompaniying ATSDs to the guest but 1) guest might not need
> this many ATSDs anyway (right now the NVIDIA driver always uses just one
> and nobody complained about performance) 2) nvlink is functional as long
> as the guest can access its config space.

Sure, it can work.  But remember the qemu user is setting up this
configuration.  I think it makes sense to error if it's a stupid and
pointless configuration, even if the guest could technically work with
it.

> >>>> +                break;
> >>>> +            }
> >>>> +            atsd[atsdnum] = cpu_to_be64(nvslot->links[j].atsd_gpa);
> >>>> +            ++atsdnum;
> >>>> +        }
> >>>> +    }
> >>>> +
> >>>> +    if (!atsdnum) {
> >>>> +        warn_report("No ATSD registers found");
> >>>> +    } else if (!spapr_phb_eeh_available(sphb)) {
> >>>> +        /*
> >>>> +         * ibm,mmio-atsd contains ATSD registers; these belong to an NPU PHB
> >>>> +         * which we do not emulate as a separate device. Instead we put
> >>>> +         * ibm,mmio-atsd to the vPHB with GPU and make sure that we do not
> >>>> +         * put GPUs from different IOMMU groups to the same vPHB to ensure
> >>>> +         * that the guest will use ATSDs from the corresponding NPU.
> >>>> +         */
> >>>> +        warn_report("ATSD requires separate vPHB per GPU IOMMU group");
> >>>> +    } else {
> >>>> +        _FDT((fdt_setprop(fdt, bus_off, "ibm,mmio-atsd",
> >>>> +                          atsd, atsdnum * sizeof(atsd[0]))));
> >>>> +    }
> >>>> +}
> >>>> +
> >>>> +void spapr_phb_nvgpu_ram_populate_dt(sPAPRPHBState *sphb, void *fdt)
> >>>> +{
> >>>> +    int i, j, linkidx, npuoff;
> >>>> +    char *npuname;
> >>>> +
> >>>> +    if (!sphb->nvgpus) {
> >>>> +        return;
> >>>> +    }
> >>>> +
> >>>> +    npuname = g_strdup_printf("npuphb%d", sphb->index);
> >>>> +    npuoff = fdt_add_subnode(fdt, 0, npuname);
> >>>> +    _FDT(npuoff);
> >>>> +    _FDT(fdt_setprop_cell(fdt, npuoff, "#address-cells", 1));
> >>>> +    _FDT(fdt_setprop_cell(fdt, npuoff, "#size-cells", 0));
> >>>> +    /* Advertise NPU as POWER9 so the guest can enable NPU2 contexts */
> >>>> +    _FDT((fdt_setprop_string(fdt, npuoff, "compatible", "ibm,power9-npu")));
> >>>> +    g_free(npuname);
> >>>> +
> >>>> +    for (i = 0, linkidx = 0; i < sphb->nvgpus->num; ++i) {
> >>>> +        for (j = 0; j < sphb->nvgpus->slots[i].linknum; ++j) {
> >>>> +            char *linkname = g_strdup_printf("link@%d", linkidx);
> >>>> +            int off = fdt_add_subnode(fdt, npuoff, linkname);
> >>>> +
> >>>> +            _FDT(off);
> >>>> +            /* _FDT((fdt_setprop_cell(fdt, off, "reg", linkidx)));
> >>>> */
> >>>
> >>> Are the indices you're using for 'reg' and the unit name arbitrary?
> >>> If so it's generally best to base them on some static property of the
> >>> device, rather than just allocating sequentially.
> >>
> >> On the host reg is the link index. Here it is actually commented out as
> >> a reminder for the future.
> >>
> >>>
> >>>> +            _FDT((fdt_setprop_string(fdt, off, "compatible",
> >>>> +                                     "ibm,npu-link")));
> >>>> +            _FDT((fdt_setprop_cell(fdt, off, "phandle",
> >>>> +                                   PHANDLE_NVLINK(sphb, i, j))));
> >>>> +            _FDT((fdt_setprop_cell(fdt, off, "ibm,npu-link-index", linkidx)));
> >>>
> >>> Why do you need the index here as well as in reg?
> >>
> >> I do not need "reg" really and I need index for this:
> >>
> >> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/powerpc/platforms/powernv/npu-dma.c?h=v4.20#n692
> > 
> > 
> > Ok, because of a silly binding.  That's a good enough reason.
> > 
> >>>> +            g_free(linkname);
> >>>> +            ++linkidx;
> >>>> +        }
> >>>> +    }
> >>>> +
> >>>> +    /* Add memory nodes for GPU RAM and mark them unusable */
> >>>> +    for (i = 0; i < sphb->nvgpus->num; ++i) {
> >>>> +        struct spapr_phb_pci_nvgpu_slot *nvslot = &sphb->nvgpus->slots[i];
> >>>> +        Object *nv_mrobj = object_property_get_link(OBJECT(nvslot->gpdev),
> >>>> +                                                    "nvlink2-mr[0]", NULL);
> >>>> +        uint32_t at = cpu_to_be32(GPURAM_ASSOCIATIVITY(sphb, i));
> >>>> +        uint32_t associativity[] = { cpu_to_be32(0x4), at, at, at, at };
> >>>> +        uint64_t size = object_property_get_uint(nv_mrobj, "size", NULL);
> >>>> +        uint64_t mem_reg[2] = { cpu_to_be64(nvslot->gpa), cpu_to_be64(size) };
> >>>> +        char *mem_name = g_strdup_printf("memory@%lx", nvslot->gpa);
> >>>> +        int off = fdt_add_subnode(fdt, 0, mem_name);
> >>>> +
> >>>> +        _FDT(off);
> >>>> +        _FDT((fdt_setprop_string(fdt, off, "device_type", "memory")));
> >>>> +        _FDT((fdt_setprop(fdt, off, "reg", mem_reg, sizeof(mem_reg))));
> >>>> +        _FDT((fdt_setprop(fdt, off, "ibm,associativity", associativity,
> >>>> +                          sizeof(associativity))));
> >>>> +
> >>>> +        _FDT((fdt_setprop_string(fdt, off, "compatible",
> >>>> +                                 "ibm,coherent-device-memory")));
> >>>> +
> >>>> +        mem_reg[1] = cpu_to_be64(0);
> >>>> +        _FDT((fdt_setprop(fdt, off, "linux,usable-memory", mem_reg,
> >>>> +                          sizeof(mem_reg))));
> >>>> +        _FDT((fdt_setprop_cell(fdt, off, "phandle",
> >>>> +                               PHANDLE_GPURAM(sphb, i))));
> >>>> +        g_free(mem_name);
> >>>> +    }
> >>>> +
> >>>> +}
> >>>> +
> >>>> +void spapr_phb_nvgpu_populate_pcidev_dt(PCIDevice *dev, void *fdt, int offset,
> >>>> +                                        sPAPRPHBState *sphb)
> >>>> +{
> >>>> +    int i, j;
> >>>> +
> >>>> +    if (!sphb->nvgpus) {
> >>>> +        return;
> >>>> +    }
> >>>> +
> >>>> +    for (i = 0; i < sphb->nvgpus->num; ++i) {
> >>>> +        struct spapr_phb_pci_nvgpu_slot *nvslot = &sphb->nvgpus->slots[i];
> >>>> +
> >>>> +        /* Skip "slot" without attached GPU */
> >>>
> >>> IIUC a "slot" should always have at least one GPU.  You need to handle
> >>> the case of an unitialized GPU in the "collect" functions because you
> >>> don't know if you'll discover the GPU or an NPU first.  But here not
> >>> having a GPU should be an error, shouldn't it?
> >>
> >>
> >> If someone decides to pass 1 GPU with all related nvlinks and just
> >> nvlinks from another GPU but without related GPU for whatever reason,
> >> should we really stop him/her? Things won't work exactly at their best
> >> but this still might be useful for weird debugging.
> > 
> > Hm, ok, I guess so.
> > 
> >>>> +        if (!nvslot->gpdev) {
> >>>> +            continue;
> >>>> +        }
> >>>> +        if (dev == nvslot->gpdev) {
> >>>> +            uint32_t npus[nvslot->linknum];
> >>>> +
> >>>> +            for (j = 0; j < nvslot->linknum; ++j) {
> >>>> +                PCIDevice *npdev = nvslot->links[j].npdev;
> >>>> +
> >>>> +                npus[j] = cpu_to_be32(PHANDLE_PCIDEV(sphb, npdev));
> >>>> +            }
> >>>> +            _FDT(fdt_setprop(fdt, offset, "ibm,npu", npus,
> >>>> +                             j * sizeof(npus[0])));
> >>>> +            _FDT((fdt_setprop_cell(fdt, offset, "phandle",
> >>>> +                                   PHANDLE_PCIDEV(sphb, dev))));
> >>>> +            continue;
> >>>> +        }
> >>>> +
> >>>> +        for (j = 0; j < nvslot->linknum; ++j) {
> >>>> +            if (dev != nvslot->links[j].npdev) {
> >>>> +                continue;
> >>>> +            }
> >>>> +
> >>>> +            _FDT((fdt_setprop_cell(fdt, offset, "phandle",
> >>>> +                                   PHANDLE_PCIDEV(sphb, dev))));
> >>>> +            _FDT(fdt_setprop_cell(fdt, offset, "ibm,gpu",
> >>>> +                                  PHANDLE_PCIDEV(sphb, nvslot->gpdev)));
> >>>> +            _FDT((fdt_setprop_cell(fdt, offset, "ibm,nvlink",
> >>>> +                                   PHANDLE_NVLINK(sphb, i, j))));
> >>>> +            /*
> >>>> +             * If we ever want to emulate GPU RAM at the same location as on
> >>>> +             * the host - here is the encoding GPA->TGT:
> >>>> +             *
> >>>> +             * gta  = ((sphb->nv2_gpa >> 42) & 0x1) << 42;
> >>>> +             * gta |= ((sphb->nv2_gpa >> 45) & 0x3) << 43;
> >>>> +             * gta |= ((sphb->nv2_gpa >> 49) & 0x3) << 45;
> >>>> +             * gta |= sphb->nv2_gpa & ((1UL << 43) - 1);
> >>>> +             */
> >>>> +            _FDT(fdt_setprop_cell(fdt, offset, "memory-region",
> >>>> +                                  PHANDLE_GPURAM(sphb, i)));
> >>>> +            _FDT(fdt_setprop_u64(fdt, offset, "ibm,device-tgt-addr",
> >>>> +                                 nvslot->tgt));
> >>>> +            _FDT(fdt_setprop_cell(fdt, offset, "ibm,nvlink-speed",
> >>>> +                                  nvslot->links[j].link_speed));
> >>>> +        }
> >>>> +    }
> >>>> +}
> >>>> diff --git a/hw/vfio/pci-quirks.c b/hw/vfio/pci-quirks.c
> >>>> index 40a1200..15ec0b4 100644
> >>>> --- a/hw/vfio/pci-quirks.c
> >>>> +++ b/hw/vfio/pci-quirks.c
> >>>> @@ -2180,3 +2180,123 @@ int vfio_add_virt_caps(VFIOPCIDevice *vdev, Error **errp)
> >>>>  
> >>>>      return 0;
> >>>>  }
> >>>> +
> >>>> +static void vfio_pci_nvlink2_get_tgt(Object *obj, Visitor *v,
> >>>> +                                     const char *name,
> >>>> +                                     void *opaque, Error **errp)
> >>>> +{
> >>>> +    uint64_t tgt = (uint64_t) opaque;
> >>>> +    visit_type_uint64(v, name, &tgt, errp);
> >>>> +}
> >>>> +
> >>>> +static void vfio_pci_nvlink2_get_link_speed(Object *obj, Visitor *v,
> >>>> +                                                 const char *name,
> >>>> +                                                 void *opaque, Error **errp)
> >>>> +{
> >>>> +    uint32_t link_speed = (uint32_t)(uint64_t) opaque;
> >>>> +    visit_type_uint32(v, name, &link_speed, errp);
> >>>> +}
> >>>> +
> >>>> +int vfio_pci_nvidia_v100_ram_init(VFIOPCIDevice *vdev, Error **errp)
> >>>> +{
> >>>> +    int ret;
> >>>> +    void *p;
> >>>> +    struct vfio_region_info *nv2region = NULL;
> >>>> +    struct vfio_info_cap_header *hdr;
> >>>> +    MemoryRegion *nv2mr = g_malloc0(sizeof(*nv2mr));
> >>>> +
> >>>> +    ret = vfio_get_dev_region_info(&vdev->vbasedev,
> >>>> +                                   VFIO_REGION_TYPE_PCI_VENDOR_TYPE |
> >>>> +                                   PCI_VENDOR_ID_NVIDIA,
> >>>> +                                   VFIO_REGION_SUBTYPE_NVIDIA_NVLINK2_RAM,
> >>>> +                                   &nv2region);
> >>>> +    if (ret) {
> >>>> +        return ret;
> >>>> +    }
> >>>> +
> >>>> +    p = mmap(NULL, nv2region->size, PROT_READ | PROT_WRITE | PROT_EXEC,
> >>>> +             MAP_SHARED, vdev->vbasedev.fd, nv2region->offset);
> >>>> +
> >>>> +    if (!p) {
> >>>> +        return -errno;
> >>>> +    }
> >>>> +
> >>>> +    memory_region_init_ram_ptr(nv2mr, OBJECT(vdev), "nvlink2-mr",
> >>>> +                               nv2region->size, p);
> >>>> +
> >>>> +    hdr = vfio_get_region_info_cap(nv2region,
> >>>> +                                   VFIO_REGION_INFO_CAP_NVLINK2_SSATGT);
> >>>> +    if (hdr) {
> >>>> +        struct vfio_region_info_cap_nvlink2_ssatgt *cap = (void *) hdr;
> >>>> +
> >>>> +        object_property_add(OBJECT(vdev), "nvlink2-tgt", "uint64",
> >>>> +                            vfio_pci_nvlink2_get_tgt, NULL, NULL,
> >>>> +                            (void *) cap->tgt, NULL);
> >>>> +        trace_vfio_pci_nvidia_gpu_setup_quirk(vdev->vbasedev.name, cap->tgt,
> >>>> +                                              nv2region->size);
> >>>> +    }
> >>>> +    g_free(nv2region);
> >>>> +
> >>>> +    return 0;
> >>>> +}
> >>>> +
> >>>> +int vfio_pci_nvlink2_init(VFIOPCIDevice *vdev, Error **errp)
> >>>> +{
> >>>> +    int ret;
> >>>> +    void *p;
> >>>> +    struct vfio_region_info *atsd_region = NULL;
> >>>> +    struct vfio_info_cap_header *hdr;
> >>>> +
> >>>> +    ret = vfio_get_dev_region_info(&vdev->vbasedev,
> >>>> +                                   VFIO_REGION_TYPE_PCI_VENDOR_TYPE |
> >>>> +                                   PCI_VENDOR_ID_IBM,
> >>>> +                                   VFIO_REGION_SUBTYPE_IBM_NVLINK2_ATSD,
> >>>> +                                   &atsd_region);
> >>>> +    if (ret) {
> >>>> +        return ret;
> >>>> +    }
> >>>> +
> >>>> +    /* Some NVLink bridges come without assigned ATSD, skip MR part */
> >>>> +    if (atsd_region->size) {
> >>>> +        MemoryRegion *atsd_mr = g_malloc0(sizeof(*atsd_mr));
> >>>> +
> >>>> +        p = mmap(NULL, atsd_region->size, PROT_READ | PROT_WRITE | PROT_EXEC,
> >>>> +                 MAP_SHARED, vdev->vbasedev.fd, atsd_region->offset);
> >>>> +
> >>>> +        if (!p) {
> >>>> +            return -errno;
> >>>> +        }
> >>>> +
> >>>> +        memory_region_init_ram_device_ptr(atsd_mr, OBJECT(vdev),
> >>>> +                                          "nvlink2-atsd-mr",
> >>>> +                                          atsd_region->size,
> >>>> +                                          p);
> >>>> +    }
> >>>> +
> >>>> +    hdr = vfio_get_region_info_cap(atsd_region,
> >>>> +                                   VFIO_REGION_INFO_CAP_NVLINK2_SSATGT);
> >>>> +    if (hdr) {
> >>>> +        struct vfio_region_info_cap_nvlink2_ssatgt *cap = (void *) hdr;
> >>>> +
> >>>> +        object_property_add(OBJECT(vdev), "nvlink2-tgt", "uint64",
> >>>> +                            vfio_pci_nvlink2_get_tgt, NULL, NULL,
> >>>> +                            (void *) cap->tgt, NULL);
> >>>> +        trace_vfio_pci_nvlink2_setup_quirk_ssatgt(vdev->vbasedev.name, cap->tgt,
> >>>> +                                                  atsd_region->size);
> >>>> +    }
> >>>> +
> >>>> +    hdr = vfio_get_region_info_cap(atsd_region,
> >>>> +                                   VFIO_REGION_INFO_CAP_NVLINK2_LNKSPD);
> >>>> +    if (hdr) {
> >>>> +        struct vfio_region_info_cap_nvlink2_lnkspd *cap = (void *) hdr;
> >>>> +
> >>>> +        object_property_add(OBJECT(vdev), "nvlink2-link-speed", "uint32",
> >>>> +                            vfio_pci_nvlink2_get_link_speed, NULL, NULL,
> >>>> +                            (void *) (uint64_t) cap->link_speed, NULL);
> >>>> +        trace_vfio_pci_nvlink2_setup_quirk_lnkspd(vdev->vbasedev.name,
> >>>> +                                                  cap->link_speed);
> >>>> +    }
> >>>> +    g_free(atsd_region);
> >>>> +
> >>>> +    return 0;
> >>>> +}
> >>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> >>>> index dd12f36..07aa141 100644
> >>>> --- a/hw/vfio/pci.c
> >>>> +++ b/hw/vfio/pci.c
> >>>> @@ -3069,6 +3069,20 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
> >>>>          goto out_teardown;
> >>>>      }
> >>>>  
> >>>> +    if (vdev->vendor_id == PCI_VENDOR_ID_NVIDIA) {
> >>>> +        ret = vfio_pci_nvidia_v100_ram_init(vdev, errp);
> >>>> +        if (ret && ret != -ENODEV) {
> >>>> +            error_report("Failed to setup NVIDIA V100 GPU RAM");
> >>>> +        }
> >>>> +    }
> >>>> +
> >>>> +    if (vdev->vendor_id == PCI_VENDOR_ID_IBM) {
> >>>> +        ret = vfio_pci_nvlink2_init(vdev, errp);
> >>>> +        if (ret && ret != -ENODEV) {
> >>>> +            error_report("Failed to setup NVlink2 bridge");
> >>>> +        }
> >>>> +    }
> >>>> +
> >>>>      vfio_register_err_notifier(vdev);
> >>>>      vfio_register_req_notifier(vdev);
> >>>>      vfio_setup_resetfn_quirk(vdev);
> >>>> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> >>>> index cf1e886..88841e9 100644
> >>>> --- a/hw/vfio/trace-events
> >>>> +++ b/hw/vfio/trace-events
> >>>> @@ -87,6 +87,10 @@ vfio_pci_igd_opregion_enabled(const char *name) "%s"
> >>>>  vfio_pci_igd_host_bridge_enabled(const char *name) "%s"
> >>>>  vfio_pci_igd_lpc_bridge_enabled(const char *name) "%s"
> >>>>  
> >>>> +vfio_pci_nvidia_gpu_setup_quirk(const char *name, uint64_t tgt, uint64_t size) "%s tgt=0x%"PRIx64" size=0x%"PRIx64
> >>>> +vfio_pci_nvlink2_setup_quirk_ssatgt(const char *name, uint64_t tgt, uint64_t size) "%s tgt=0x%"PRIx64" size=0x%"PRIx64
> >>>> +vfio_pci_nvlink2_setup_quirk_lnkspd(const char *name, uint32_t link_speed) "%s link_speed=0x%x"
> >>>> +
> >>>>  # hw/vfio/common.c
> >>>>  vfio_region_write(const char *name, int index, uint64_t addr, uint64_t data, unsigned size) " (%s:region%d+0x%"PRIx64", 0x%"PRIx64 ", %d)"
> >>>>  vfio_region_read(char *name, int index, uint64_t addr, unsigned size, uint64_t data) " (%s:region%d+0x%"PRIx64", %d) = 0x%"PRIx64
> >>>
> >>
> > 
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [PATCH qemu v3 6/6] spapr: Support NVIDIA V100 GPU with NVLink2
  2019-03-07  3:57           ` David Gibson
@ 2019-03-07  4:32             ` Alexey Kardashevskiy
  0 siblings, 0 replies; 21+ messages in thread
From: Alexey Kardashevskiy @ 2019-03-07  4:32 UTC (permalink / raw)
  To: David Gibson
  Cc: qemu-devel, qemu-ppc, Sam Bobroff, Piotr Jaroszynski,
	Leonardo Augusto Guimarães Garcia, Jose Ricardo Ziviani,
	Daniel Henrique Barboza, Alex Williamson



On 07/03/2019 14:57, David Gibson wrote:
> On Thu, Mar 07, 2019 at 01:40:33PM +1100, Alexey Kardashevskiy wrote:
>>
>>
>> On 05/03/2019 12:47, David Gibson wrote:
>>> On Thu, Feb 28, 2019 at 05:11:32PM +1100, Alexey Kardashevskiy wrote:
>>>> On 28/02/2019 14:31, David Gibson wrote:
>>>>> On Wed, Feb 27, 2019 at 07:51:49PM +1100, Alexey Kardashevskiy wrote:
>>>>>> NVIDIA V100 GPUs have on-board RAM which is mapped into the host memory
>>>>>> space and accessible as normal RAM via an NVLink bus. The VFIO-PCI driver
>>>>>> implements special regions for such GPUs and emulates an NVLink bridge.
>>>>>> NVLink2-enabled POWER9 CPUs also provide address translation services
>>>>>> which includes an ATS shootdown (ATSD) register exported via the NVLink
>>>>>> bridge device.
>>>>>>
>>>>>> This adds a quirk to VFIO to map the GPU memory and create an MR;
>>>>>> the new MR is stored in a PCI device as a QOM link. The sPAPR PCI uses
>>>>>> this to get the MR and map it to the system address space.
>>>>>> Another quirk does the same for ATSD.
>>>>>>
>>>>>> This adds additional steps to sPAPR PHB setup:
>>>>>>
>>>>>> 1. Search for specific GPUs and NPUs, collect findings in
>>>>>> sPAPRPHBState::nvgpus, manage system address space mappings;
>>>>>>
>>>>>> 2. Add device-specific properties such as "ibm,npu", "ibm,gpu",
>>>>>> "memory-block", "link-speed" to advertise the NVLink2 function to
>>>>>> the guest;
>>>>>>
>>>>>> 3. Add "mmio-atsd" to vPHB to advertise the ATSD capability;
>>>>>>
>>>>>> 4. Add new memory blocks (with extra "linux,memory-usable" to prevent
>>>>>> the guest OS from accessing the new memory until it is onlined) and
>>>>>> npuphb# nodes representing an NPU unit for every vPHB as the GPU driver
>>>>>> uses it for link discovery.
>>>>>>
>>>>>> This allocates space for GPU RAM and ATSD like we do for MMIOs by
>>>>>> adding 2 new parameters to the phb_placement() hook. Older machine types
>>>>>> set these to zero.
>>>>>>
>>>>>> This puts new memory nodes in a separate NUMA node to replicate the host
>>>>>> system setup as the GPU driver relies on this.
>>>>>>
>>>>>> This adds requirement similar to EEH - one IOMMU group per vPHB.
>>>>>> The reason for this is that ATSD registers belong to a physical NPU
>>>>>> so they cannot invalidate translations on GPUs attached to another NPU.
>>>>>> It is guaranteed by the host platform as it does not mix NVLink bridges
>>>>>> or GPUs from different NPU in the same IOMMU group. If more than one
>>>>>> IOMMU group is detected on a vPHB, this disables ATSD support for that
>>>>>> vPHB and prints a warning.
>>>>>>
>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>>>> ---
>>>>>> Changes:
>>>>>> v3:
>>>>>> * moved GPU RAM above PCI MMIO limit
>>>>>> * renamed QOM property to nvlink2-tgt
>>>>>> * moved nvlink2 code to its own file
>>>>>>
>>>>>> ---
>>>>>>
>>>>>> The example command line for redbud system:
>>>>>>
>>>>>> pbuild/qemu-aiku1804le-ppc64/ppc64-softmmu/qemu-system-ppc64 \
>>>>>> -nodefaults \
>>>>>> -chardev stdio,id=STDIO0,signal=off,mux=on \
>>>>>> -device spapr-vty,id=svty0,reg=0x71000110,chardev=STDIO0 \
>>>>>> -mon id=MON0,chardev=STDIO0,mode=readline -nographic -vga none \
>>>>>> -enable-kvm -m 384G \
>>>>>> -chardev socket,id=SOCKET0,server,nowait,host=localhost,port=40000 \
>>>>>> -mon chardev=SOCKET0,mode=control \
>>>>>> -smp 80,sockets=1,threads=4 \
>>>>>> -netdev "tap,id=TAP0,helper=/home/aik/qemu-bridge-helper --br=br0" \
>>>>>> -device "virtio-net-pci,id=vnet0,mac=52:54:00:12:34:56,netdev=TAP0" \
>>>>>> img/vdisk0.img \
>>>>>> -device "vfio-pci,id=vfio0004_04_00_0,host=0004:04:00.0" \
>>>>>> -device "vfio-pci,id=vfio0006_00_00_0,host=0006:00:00.0" \
>>>>>> -device "vfio-pci,id=vfio0006_00_00_1,host=0006:00:00.1" \
>>>>>> -device "vfio-pci,id=vfio0006_00_00_2,host=0006:00:00.2" \
>>>>>> -device "vfio-pci,id=vfio0004_05_00_0,host=0004:05:00.0" \
>>>>>> -device "vfio-pci,id=vfio0006_00_01_0,host=0006:00:01.0" \
>>>>>> -device "vfio-pci,id=vfio0006_00_01_1,host=0006:00:01.1" \
>>>>>> -device "vfio-pci,id=vfio0006_00_01_2,host=0006:00:01.2" \
>>>>>> -device spapr-pci-host-bridge,id=phb1,index=1 \
>>>>>> -device "vfio-pci,id=vfio0035_03_00_0,host=0035:03:00.0" \
>>>>>> -device "vfio-pci,id=vfio0007_00_00_0,host=0007:00:00.0" \
>>>>>> -device "vfio-pci,id=vfio0007_00_00_1,host=0007:00:00.1" \
>>>>>> -device "vfio-pci,id=vfio0007_00_00_2,host=0007:00:00.2" \
>>>>>> -device "vfio-pci,id=vfio0035_04_00_0,host=0035:04:00.0" \
>>>>>> -device "vfio-pci,id=vfio0007_00_01_0,host=0007:00:01.0" \
>>>>>> -device "vfio-pci,id=vfio0007_00_01_1,host=0007:00:01.1" \
>>>>>> -device "vfio-pci,id=vfio0007_00_01_2,host=0007:00:01.2" -snapshot \
>>>>>> -machine pseries \
>>>>>> -L /home/aik/t/qemu-ppc64-bios/ -d guest_errors
>>>>>>
>>>>>> Note that QEMU attaches PCI devices to the last added vPHB so first
>>>>>> 8 devices - 4:04:00.0 till 6:00:01.2 - go to the default vPHB, and
>>>>>> 35:03:00.0..7:00:01.2 to the vPHB with id=phb1.
>>>>>> ---
>>>>>>  hw/ppc/Makefile.objs        |   2 +-
>>>>>>  hw/vfio/pci.h               |   2 +
>>>>>>  include/hw/pci-host/spapr.h |  41 ++++
>>>>>>  include/hw/ppc/spapr.h      |   3 +-
>>>>>>  hw/ppc/spapr.c              |  29 ++-
>>>>>>  hw/ppc/spapr_pci.c          |   8 +
>>>>>>  hw/ppc/spapr_pci_nvlink2.c  | 419 ++++++++++++++++++++++++++++++++++++
>>>>>>  hw/vfio/pci-quirks.c        | 120 +++++++++++
>>>>>>  hw/vfio/pci.c               |  14 ++
>>>>>>  hw/vfio/trace-events        |   4 +
>>>>>>  10 files changed, 637 insertions(+), 5 deletions(-)
>>>>>>  create mode 100644 hw/ppc/spapr_pci_nvlink2.c
>>>>>>
>>>>>> diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
>>>>>> index 1111b21..636e717 100644
>>>>>> --- a/hw/ppc/Makefile.objs
>>>>>> +++ b/hw/ppc/Makefile.objs
>>>>>> @@ -9,7 +9,7 @@ obj-$(CONFIG_SPAPR_RNG) +=  spapr_rng.o
>>>>>>  # IBM PowerNV
>>>>>>  obj-$(CONFIG_POWERNV) += pnv.o pnv_xscom.o pnv_core.o pnv_lpc.o pnv_psi.o pnv_occ.o pnv_bmc.o
>>>>>>  ifeq ($(CONFIG_PCI)$(CONFIG_PSERIES)$(CONFIG_LINUX), yyy)
>>>>>> -obj-y += spapr_pci_vfio.o
>>>>>> +obj-y += spapr_pci_vfio.o spapr_pci_nvlink2.o
>>>>>>  endif
>>>>>>  obj-$(CONFIG_PSERIES) += spapr_rtas_ddw.o
>>>>>>  # PowerPC 4xx boards
>>>>>> diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
>>>>>> index b1ae4c0..706c304 100644
>>>>>> --- a/hw/vfio/pci.h
>>>>>> +++ b/hw/vfio/pci.h
>>>>>> @@ -194,6 +194,8 @@ int vfio_populate_vga(VFIOPCIDevice *vdev, Error **errp);
>>>>>>  int vfio_pci_igd_opregion_init(VFIOPCIDevice *vdev,
>>>>>>                                 struct vfio_region_info *info,
>>>>>>                                 Error **errp);
>>>>>> +int vfio_pci_nvidia_v100_ram_init(VFIOPCIDevice *vdev, Error **errp);
>>>>>> +int vfio_pci_nvlink2_init(VFIOPCIDevice *vdev, Error **errp);
>>>>>>  
>>>>>>  void vfio_display_reset(VFIOPCIDevice *vdev);
>>>>>>  int vfio_display_probe(VFIOPCIDevice *vdev, Error **errp);
>>>>>> diff --git a/include/hw/pci-host/spapr.h b/include/hw/pci-host/spapr.h
>>>>>> index ab0e3a0..e791dd4 100644
>>>>>> --- a/include/hw/pci-host/spapr.h
>>>>>> +++ b/include/hw/pci-host/spapr.h
>>>>>> @@ -87,6 +87,9 @@ struct sPAPRPHBState {
>>>>>>      uint32_t mig_liobn;
>>>>>>      hwaddr mig_mem_win_addr, mig_mem_win_size;
>>>>>>      hwaddr mig_io_win_addr, mig_io_win_size;
>>>>>> +    hwaddr nv2_gpa_win_addr;
>>>>>> +    hwaddr nv2_atsd_win_addr;
>>>>>> +    struct spapr_phb_pci_nvgpu_config *nvgpus;
>>>>>>  };
>>>>>>  
>>>>>>  #define SPAPR_PCI_MEM_WIN_BUS_OFFSET 0x80000000ULL
>>>>>> @@ -105,6 +108,23 @@ struct sPAPRPHBState {
>>>>>>  
>>>>>>  #define SPAPR_PCI_MSI_WINDOW         0x40000000000ULL
>>>>>>  
>>>>>> +#define SPAPR_PCI_NV2RAM64_WIN_BASE  SPAPR_PCI_LIMIT
>>>>>> +#define SPAPR_PCI_NV2RAM64_WIN_SIZE  0x10000000000ULL /* 1 TiB for all 6xGPUs */
>>>>>
>>>>> The comments and values below suggest that it is 1TiB for each GPU,
>>>>> rather than 1TiB shared by all 6.  Which is it?
>>>>
>>>>
>>>> 1TiB for all of them within 1 vPHB. Not sure where it suggests 1TiB for
>>>> each GPU.
>>>
>>> The fact that NV2ATSD_WIN_BASE is set at 6TiB above NV2RAM64_WIN_BASE
>>> is what suggested to me that there was one 1TiB window for each of the
>>> 6 possible GPUs.
>>>
>>>>>> +
>>>>>> +/* Max number of these GPUs per a physical box */
>>>>>> +#define NVGPU_MAX_NUM                6
>>>>>
>>>>> Is there any possibility later hardware revisions could increase this?
>>>>> If so we should probably leave some extra room in the address space.
>>>>
>>>> A GPU RAM window is 256GiB (and only 32GiB is used), and 3 is the
>>>> maximum in one group so far. So 1TiB should be enough for quite some
>>>> time. Having more GPUs in a box is probably possible but for now 6xGPU
>>>> require water cooling while 4xGPU does not so unless there is a new
>>>> generation of these GPUs comes out, the numbers won't change much.
>>>
>>> Hm, ok.
>>>
>>>> I'll double SPAPR_PCI_NV2RAM64_WIN_SIZE.
>>>
>>> Um.. I'm not sure how that follows from the above.
>>
>> 1TiB is enough now but 2TiB is more future proof. That was it.
> 
> Ok.
> 
>>>>>> +/*
>>>>>> + * One NVLink bridge provides one ATSD register so it should be 18.
>>>>>> + * In practice though since we allow only one group per vPHB which equals
>>>>>> + * to an NPU2 which has maximum 6 NVLink bridges.
>>>>>> + */
>>>>>> +#define NVGPU_MAX_ATSD               6
>>>>>> +
>>>>>> +#define SPAPR_PCI_NV2ATSD_WIN_BASE   (SPAPR_PCI_NV2RAM64_WIN_BASE + \
>>>>>> +                                      SPAPR_PCI_NV2RAM64_WIN_SIZE * \
>>>>>> +                                      NVGPU_MAX_NUM)
>>>>>> +#define SPAPR_PCI_NV2ATSD_WIN_SIZE   (NVGPU_MAX_ATSD * 0x10000)
>>>>>
>>>>> What's the significance of the 64 kiB constant here?  Should it be a
>>>>> symbolic name, or speleed "64 * kiB".
>>>>
>>>> Ok.
>>>
>>>
>>> Hmm.  Am I right in thinking that both each 64-bit RAM and each ATSD
>>> RAM slot is per-vPHB? 
>>
>> These are windows from which I allocated RAM base and ATSD per GPU/NPU.
> 
> Ok, I guess that per-vPHB set of windows is what I'm meaning by "slot"
> then.
> 
>>> Would it make more sense to directly index into
>>> the array of slots with the phb index, rather than having a separate
>>> GPU index?
>>
>> There can be 1 or many "slots" per PHBs ("many" is not really encouraged
>> as they will miss ATSD but nevertheless), and "slots" are not in a
>> global list of any kind.
> 
> Ok, I think we're using different meanings of "slot" here.  By "slot"
> I was meaning one 64-bit and one ATS window with a common index
> (i.e. a slot in the array indices, rather than a physical slot on the
> system).  IIUC all the GPUs and NPUs on a vPHB will sit in a single
> "slot" by that sense.


This is not what I meant though. A slot is a physical SXM2 (or whatever
that acronym is) slot with all these links, and one vPHB has 2 or 3 of
these. I could use vPHB index + slot index for ATSD and RAM but I chose
to do more or less same thing as BARs which are allocated within
selected windows.


> 
>>>>>> +
>>>>>>  static inline qemu_irq spapr_phb_lsi_qirq(struct sPAPRPHBState *phb, int pin)
>>>>>>  {
>>>>>>      sPAPRMachineState *spapr = SPAPR_MACHINE(qdev_get_machine());
>>>>>> @@ -135,6 +155,11 @@ int spapr_phb_vfio_eeh_get_state(sPAPRPHBState *sphb, int *state);
>>>>>>  int spapr_phb_vfio_eeh_reset(sPAPRPHBState *sphb, int option);
>>>>>>  int spapr_phb_vfio_eeh_configure(sPAPRPHBState *sphb);
>>>>>>  void spapr_phb_vfio_reset(DeviceState *qdev);
>>>>>> +void spapr_phb_nvgpu_setup(sPAPRPHBState *sphb);
>>>>>> +void spapr_phb_nvgpu_populate_dt(sPAPRPHBState *sphb, void *fdt, int bus_off);
>>>>>> +void spapr_phb_nvgpu_ram_populate_dt(sPAPRPHBState *sphb, void *fdt);
>>>>>> +void spapr_phb_nvgpu_populate_pcidev_dt(PCIDevice *dev, void *fdt, int offset,
>>>>>> +                                        sPAPRPHBState *sphb);
>>>>>>  #else
>>>>>>  static inline bool spapr_phb_eeh_available(sPAPRPHBState *sphb)
>>>>>>  {
>>>>>> @@ -161,6 +186,22 @@ static inline int spapr_phb_vfio_eeh_configure(sPAPRPHBState *sphb)
>>>>>>  static inline void spapr_phb_vfio_reset(DeviceState *qdev)
>>>>>>  {
>>>>>>  }
>>>>>> +static inline void spapr_phb_nvgpu_setup(sPAPRPHBState *sphb)
>>>>>> +{
>>>>>> +}
>>>>>> +static inline void spapr_phb_nvgpu_populate_dt(sPAPRPHBState *sphb, void *fdt,
>>>>>> +                                               int bus_off)
>>>>>> +{
>>>>>> +}
>>>>>> +static inline void spapr_phb_nvgpu_ram_populate_dt(sPAPRPHBState *sphb,
>>>>>> +                                                   void *fdt)
>>>>>> +{
>>>>>> +}
>>>>>> +static inline void spapr_phb_nvgpu_populate_pcidev_dt(PCIDevice *dev, void *fdt,
>>>>>> +                                                      int offset,
>>>>>> +                                                      sPAPRPHBState *sphb)
>>>>>> +{
>>>>>> +}
>>>>>
>>>>> I'm guessing some of these should never get called on systems without
>>>>> NVLink2, in which case they should probably have a
>>>>> g_assert_not_reached() in there.
>>>>
>>>> I guess if you compile QEMU for --target-list=ppc64-softmmu in Windows
>>>> (i.e. tcg + pseries + pci but no vfio), these will be called and crash
>>>> then, no?
>>>
>>> Well, if they can be called in that situation then, yes, they need to
>>> be no-ops like they are now.  But is that true for all of them?
>>> Hmm.. yes it might be, never mind.
>>>
>>>>>
>>>>>>  #endif
>>>>>>  
>>>>>>  void spapr_phb_dma_reset(sPAPRPHBState *sphb);
>>>>>> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
>>>>>> index 358bb38..9acf867 100644
>>>>>> --- a/include/hw/ppc/spapr.h
>>>>>> +++ b/include/hw/ppc/spapr.h
>>>>>> @@ -113,7 +113,8 @@ struct sPAPRMachineClass {
>>>>>>      void (*phb_placement)(sPAPRMachineState *spapr, uint32_t index,
>>>>>>                            uint64_t *buid, hwaddr *pio, 
>>>>>>                            hwaddr *mmio32, hwaddr *mmio64,
>>>>>> -                          unsigned n_dma, uint32_t *liobns, Error **errp);
>>>>>> +                          unsigned n_dma, uint32_t *liobns, hwaddr *nv2gpa,
>>>>>> +                          hwaddr *nv2atsd, Error **errp);
>>>>>>      sPAPRResizeHPT resize_hpt_default;
>>>>>>      sPAPRCapabilities default_caps;
>>>>>>      sPAPRIrq *irq;
>>>>>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
>>>>>> index 74c9b07..fda6e7e 100644
>>>>>> --- a/hw/ppc/spapr.c
>>>>>> +++ b/hw/ppc/spapr.c
>>>>>> @@ -3929,7 +3929,9 @@ static void spapr_phb_pre_plug(HotplugHandler *hotplug_dev, DeviceState *dev,
>>>>>>      smc->phb_placement(spapr, sphb->index,
>>>>>>                         &sphb->buid, &sphb->io_win_addr,
>>>>>>                         &sphb->mem_win_addr, &sphb->mem64_win_addr,
>>>>>> -                       windows_supported, sphb->dma_liobn, errp);
>>>>>> +                       windows_supported, sphb->dma_liobn,
>>>>>> +                       &sphb->nv2_gpa_win_addr, &sphb->nv2_atsd_win_addr,
>>>>>> +                       errp);
>>>>>>  }
>>>>>>  
>>>>>>  static void spapr_phb_plug(HotplugHandler *hotplug_dev, DeviceState *dev,
>>>>>> @@ -4129,7 +4131,8 @@ static const CPUArchIdList *spapr_possible_cpu_arch_ids(MachineState *machine)
>>>>>>  static void spapr_phb_placement(sPAPRMachineState *spapr, uint32_t index,
>>>>>>                                  uint64_t *buid, hwaddr *pio,
>>>>>>                                  hwaddr *mmio32, hwaddr *mmio64,
>>>>>> -                                unsigned n_dma, uint32_t *liobns, Error **errp)
>>>>>> +                                unsigned n_dma, uint32_t *liobns,
>>>>>> +                                hwaddr *nv2gpa, hwaddr *nv2atsd, Error **errp)
>>>>>>  {
>>>>>>      /*
>>>>>>       * New-style PHB window placement.
>>>>>> @@ -4174,6 +4177,9 @@ static void spapr_phb_placement(sPAPRMachineState *spapr, uint32_t index,
>>>>>>      *pio = SPAPR_PCI_BASE + index * SPAPR_PCI_IO_WIN_SIZE;
>>>>>>      *mmio32 = SPAPR_PCI_BASE + (index + 1) * SPAPR_PCI_MEM32_WIN_SIZE;
>>>>>>      *mmio64 = SPAPR_PCI_BASE + (index + 1) * SPAPR_PCI_MEM64_WIN_SIZE;
>>>>>> +
>>>>>
>>>>> This doesn't look right.  SPAPR_PCI_NV2ATSD_WIN_BASE appears to be
>>>>> defined such that there slots for NVGPU_MAX_NUM gpa "slots" of size
>>>>> SPAPR_PCI_NV2RAM64_WIN_SIZE before we get to the ATSD base.
>>>>>
>>>>>> +    *nv2gpa = SPAPR_PCI_NV2RAM64_WIN_BASE + index * SPAPR_PCI_NV2RAM64_WIN_SIZE;
>>>>>
>>>>> But this implies you need a "slot" for every possible PHB index, which
>>>>> is rather more than NVGPU_MAX_NUM.
>>>>>
>>>>>> +    *nv2atsd = SPAPR_PCI_NV2ATSD_WIN_BASE + index * SPAPR_PCI_NV2ATSD_WIN_SIZE;
>>>>
>>>>
>>>> Ah right :( These should go then above 128TiB I guess as I do not really
>>>> want them to appear inside a huge dma window.
>>>
>>> Right.  So actually looks like you are already indexing the window
>>> slots by phb index, in which case you need to allow for 32 slots even
>>> though only 6 can be populated at the moment.
>>
>>
>> Why precisely 32? Round up of 18?
> 
> Because 32 is the allowed number of vPHBs.
> 
>>>>>>  }
>>>>>>  
>>>>>>  static ICSState *spapr_ics_get(XICSFabric *dev, int irq)
>>>>>> @@ -4376,6 +4382,18 @@ DEFINE_SPAPR_MACHINE(4_0, "4.0", true);
>>>>>>  /*
>>>>>>   * pseries-3.1
>>>>>>   */
>>>>>> +static void phb_placement_3_1(sPAPRMachineState *spapr, uint32_t index,
>>>>>> +                              uint64_t *buid, hwaddr *pio,
>>>>>> +                              hwaddr *mmio32, hwaddr *mmio64,
>>>>>> +                              unsigned n_dma, uint32_t *liobns,
>>>>>> +                              hwaddr *nv2gpa, hwaddr *nv2atsd, Error **errp)
>>>>>> +{
>>>>>> +    spapr_phb_placement(spapr, index, buid, pio, mmio32, mmio64, n_dma, liobns,
>>>>>> +                        nv2gpa, nv2atsd, errp);
>>>>>> +    *nv2gpa = 0;
>>>>>> +    *nv2atsd = 0;
>>>>>> +}
>>>>>> +
>>>>>>  static void spapr_machine_3_1_class_options(MachineClass *mc)
>>>>>>  {
>>>>>>      sPAPRMachineClass *smc = SPAPR_MACHINE_CLASS(mc);
>>>>>> @@ -4391,6 +4409,7 @@ static void spapr_machine_3_1_class_options(MachineClass *mc)
>>>>>>      mc->default_cpu_type = POWERPC_CPU_TYPE_NAME("power8_v2.0");
>>>>>>      smc->update_dt_enabled = false;
>>>>>>      smc->dr_phb_enabled = false;
>>>>>> +    smc->phb_placement = phb_placement_3_1;
>>>>>>  }
>>>>>>  
>>>>>>  DEFINE_SPAPR_MACHINE(3_1, "3.1", false);
>>>>>> @@ -4522,7 +4541,8 @@ DEFINE_SPAPR_MACHINE(2_8, "2.8", false);
>>>>>>  static void phb_placement_2_7(sPAPRMachineState *spapr, uint32_t index,
>>>>>>                                uint64_t *buid, hwaddr *pio,
>>>>>>                                hwaddr *mmio32, hwaddr *mmio64,
>>>>>> -                              unsigned n_dma, uint32_t *liobns, Error **errp)
>>>>>> +                              unsigned n_dma, uint32_t *liobns,
>>>>>> +                              hwaddr *nv2gpa, hwaddr *nv2atsd, Error **errp)
>>>>>>  {
>>>>>>      /* Legacy PHB placement for pseries-2.7 and earlier machine types */
>>>>>>      const uint64_t base_buid = 0x800000020000000ULL;
>>>>>> @@ -4566,6 +4586,9 @@ static void phb_placement_2_7(sPAPRMachineState *spapr, uint32_t index,
>>>>>>       * fallback behaviour of automatically splitting a large "32-bit"
>>>>>>       * window into contiguous 32-bit and 64-bit windows
>>>>>>       */
>>>>>> +
>>>>>> +    *nv2gpa = 0;
>>>>>> +    *nv2atsd = 0;
>>>>>>  }
>>>>>>  
>>>>>>  static void spapr_machine_2_7_class_options(MachineClass *mc)
>>>>>> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
>>>>>> index 06a5ffd..f076462 100644
>>>>>> --- a/hw/ppc/spapr_pci.c
>>>>>> +++ b/hw/ppc/spapr_pci.c
>>>>>> @@ -1355,6 +1355,8 @@ static void spapr_populate_pci_child_dt(PCIDevice *dev, void *fdt, int offset,
>>>>>>      if (sphb->pcie_ecs && pci_is_express(dev)) {
>>>>>>          _FDT(fdt_setprop_cell(fdt, offset, "ibm,pci-config-space-type", 0x1));
>>>>>>      }
>>>>>> +
>>>>>> +    spapr_phb_nvgpu_populate_pcidev_dt(dev, fdt, offset, sphb);
>>>>>>  }
>>>>>>  
>>>>>>  /* create OF node for pci device and required OF DT properties */
>>>>>> @@ -1878,6 +1880,7 @@ static void spapr_phb_reset(DeviceState *qdev)
>>>>>>      sPAPRPHBState *sphb = SPAPR_PCI_HOST_BRIDGE(qdev);
>>>>>>  
>>>>>>      spapr_phb_dma_reset(sphb);
>>>>>> +    spapr_phb_nvgpu_setup(sphb);
>>>>>>  
>>>>>>      /* Reset the IOMMU state */
>>>>>>      object_child_foreach(OBJECT(qdev), spapr_phb_children_reset, NULL);
>>>>>> @@ -1910,6 +1913,8 @@ static Property spapr_phb_properties[] = {
>>>>>>                       pre_2_8_migration, false),
>>>>>>      DEFINE_PROP_BOOL("pcie-extended-configuration-space", sPAPRPHBState,
>>>>>>                       pcie_ecs, true),
>>>>>> +    DEFINE_PROP_UINT64("gpa", sPAPRPHBState, nv2_gpa_win_addr, 0),
>>>>>> +    DEFINE_PROP_UINT64("atsd", sPAPRPHBState, nv2_atsd_win_addr, 0),
>>>>>>      DEFINE_PROP_END_OF_LIST(),
>>>>>>  };
>>>>>>  
>>>>>> @@ -2282,6 +2287,9 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb, uint32_t intc_phandle, void *fdt,
>>>>>>          return ret;
>>>>>>      }
>>>>>>  
>>>>>> +    spapr_phb_nvgpu_populate_dt(phb, fdt, bus_off);
>>>>>> +    spapr_phb_nvgpu_ram_populate_dt(phb, fdt);
>>>>>> +
>>>>>>      return 0;
>>>>>>  }
>>>>>>  
>>>>>> diff --git a/hw/ppc/spapr_pci_nvlink2.c b/hw/ppc/spapr_pci_nvlink2.c
>>>>>> new file mode 100644
>>>>>> index 0000000..965a6be
>>>>>> --- /dev/null
>>>>>> +++ b/hw/ppc/spapr_pci_nvlink2.c
>>>>>> @@ -0,0 +1,419 @@
>>>>>> +/*
>>>>>> + * QEMU sPAPR PCI for NVLink2 pass through
>>>>>> + *
>>>>>> + * Copyright (c) 2019 Alexey Kardashevskiy, IBM Corporation.
>>>>>> + *
>>>>>> + * Permission is hereby granted, free of charge, to any person obtaining a copy
>>>>>> + * of this software and associated documentation files (the "Software"), to deal
>>>>>> + * in the Software without restriction, including without limitation the rights
>>>>>> + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
>>>>>> + * copies of the Software, and to permit persons to whom the Software is
>>>>>> + * furnished to do so, subject to the following conditions:
>>>>>> + *
>>>>>> + * The above copyright notice and this permission notice shall be included in
>>>>>> + * all copies or substantial portions of the Software.
>>>>>> + *
>>>>>> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
>>>>>> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
>>>>>> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
>>>>>> + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
>>>>>> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
>>>>>> + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
>>>>>> + * THE SOFTWARE.
>>>>>> + */
>>>>>> +#include "qemu/osdep.h"
>>>>>> +#include "qapi/error.h"
>>>>>> +#include "qemu-common.h"
>>>>>> +#include "hw/pci/pci.h"
>>>>>> +#include "hw/pci-host/spapr.h"
>>>>>> +#include "qemu/error-report.h"
>>>>>> +#include "hw/ppc/fdt.h"
>>>>>> +#include "hw/pci/pci_bridge.h"
>>>>>> +
>>>>>> +#define PHANDLE_PCIDEV(phb, pdev)    (0x12000000 | \
>>>>>> +                                     (((phb)->index) << 16) | ((pdev)->devfn))
>>>>>> +#define PHANDLE_GPURAM(phb, n)       (0x110000FF | ((n) << 8) | \
>>>>>> +                                     (((phb)->index) << 16))
>>>>>> +/* NVLink2 wants a separate NUMA node for its RAM */
>>>>>> +#define GPURAM_ASSOCIATIVITY(phb, n) (255 - ((phb)->index * 3 + (n)))
>>>>>> +#define PHANDLE_NVLINK(phb, gn, nn)  (0x00130000 | (((phb)->index) << 8) | \
>>>>>> +                                     ((gn) << 4) | (nn))
>>>>>> +
>>>>>> +/* Max number of NVLinks per GPU in any physical box */
>>>>>> +#define NVGPU_MAX_LINKS              3
>>>>>> +
>>>>>> +struct spapr_phb_pci_nvgpu_config {
>>>>>> +    uint64_t nv2_ram_current;
>>>>>> +    uint64_t nv2_atsd_current;
>>>>>> +    int num; /* number of non empty (i.e. tgt!=0) entries in slots[] */
>>>>>> +    struct spapr_phb_pci_nvgpu_slot {
>>>>>> +        uint64_t tgt;
>>>>>> +        uint64_t gpa;
>>>>>> +        PCIDevice *gpdev;
>>>>>> +        int linknum;
>>>>>> +        struct {
>>>>>> +            uint64_t atsd_gpa;
>>>>>> +            PCIDevice *npdev;
>>>>>> +            uint32_t link_speed;
>>>>>> +        } links[NVGPU_MAX_LINKS];
>>>>>> +    } slots[NVGPU_MAX_NUM];
>>>>>> +};
>>>>>> +
>>>>>> +static int spapr_pci_nvgpu_get_slot(struct spapr_phb_pci_nvgpu_config *nvgpus,
>>>>>> +                                    uint64_t tgt)
>>>>>> +{
>>>>>> +    int i;
>>>>>> +
>>>>>> +    /* Search for partially collected "slot" */
>>>>>> +    for (i = 0; i < nvgpus->num; ++i) {
>>>>>> +        if (nvgpus->slots[i].tgt == tgt) {
>>>>>> +            return i;
>>>>>> +        }
>>>>>> +    }
>>>>>> +
>>>>>> +    if (nvgpus->num == ARRAY_SIZE(nvgpus->slots)) {
>>>>>> +        warn_report("Found too many NVLink bridges per GPU");
>>>>>> +        return -1;
>>>>>
>>>>> This is within qemu so it would be better to use the qemu error API
>>>>> than returning an error code.
>>>>
>>>> You mean returning Error**? Oh. Ok.
>>>
>>> Well, not returning, technically, but taking an Error ** parameter
>>> which is checked by the caller to detect errors.
>>
>>
>> None of these is actually propagated to the upper level as neither of
>> these is fatal (well, except one which I am turning into assert).
> 
> Oh, ok.  In that case you don't need an Error **.

Well, I added them already in my local tree and print the faults in
spapr_pci.c.


>>>>>> +    }
>>>>>> +
>>>>>> +    i = nvgpus->num;
>>>>>> +    nvgpus->slots[i].tgt = tgt;
>>>>>> +    ++nvgpus->num;
>>>>>> +
>>>>>> +    return i;
>>>>>
>>>>> Might be nicer to return a pointer to the slot structure.
>>>>
>>>>
>>>> This can work.
>>>>
>>>>
>>>>>
>>>>>> +}
>>>>>> +
>>>>>> +static void spapr_pci_collect_nvgpu(struct spapr_phb_pci_nvgpu_config *nvgpus,
>>>>>> +                                    PCIDevice *pdev, uint64_t tgt,
>>>>>> +                                    MemoryRegion *mr)
>>>>>> +{
>>>>>> +    int i = spapr_pci_nvgpu_get_slot(nvgpus, tgt);
>>>>>> +
>>>>>> +    if (i < 0) {
>>>>>> +        return;
>>>>>> +    }
>>>>>> +    g_assert(!nvgpus->slots[i].gpdev);
>>>>>> +    nvgpus->slots[i].gpdev = pdev;
>>>>>> +
>>>>>> +    nvgpus->slots[i].gpa = nvgpus->nv2_ram_current;
>>>>>> +    nvgpus->nv2_ram_current += memory_region_size(mr);
>>>>>> +}
>>>>>> +
>>>>>> +static void spapr_pci_collect_nvnpu(struct spapr_phb_pci_nvgpu_config *nvgpus,
>>>>>> +                                    PCIDevice *pdev, uint64_t tgt,
>>>>>> +                                    MemoryRegion *mr)
>>>>>> +{
>>>>>> +    int i = spapr_pci_nvgpu_get_slot(nvgpus, tgt), j;
>>>>>> +    struct spapr_phb_pci_nvgpu_slot *nvslot;
>>>>>> +
>>>>>> +    if (i < 0) {
>>>>>> +        return;
>>>>>> +    }
>>>>>> +
>>>>>> +    nvslot = &nvgpus->slots[i];
>>>>>> +    j = nvslot->linknum;
>>>>>> +    if (j == ARRAY_SIZE(nvslot->links)) {
>>>>>> +        warn_report("Found too many NVLink2 bridges");
>>>>>> +        return;
>>>>>> +    }
>>>>>> +    ++nvslot->linknum;
>>>>>> +
>>>>>> +    g_assert(!nvslot->links[j].npdev);
>>>>>> +    nvslot->links[j].npdev = pdev;
>>>>>> +    nvslot->links[j].atsd_gpa = nvgpus->nv2_atsd_current;
>>>>>> +    nvgpus->nv2_atsd_current += memory_region_size(mr);
>>>>>> +    nvslot->links[j].link_speed =
>>>>>> +        object_property_get_uint(OBJECT(pdev), "nvlink2-link-speed", NULL);
>>>>>> +}
>>>>>> +
>>>>>> +static void spapr_phb_pci_collect_nvgpu(PCIBus *bus, PCIDevice *pdev,
>>>>>> +                                        void *opaque)
>>>>>> +{
>>>>>> +    PCIBus *sec_bus;
>>>>>> +    Object *po = OBJECT(pdev);
>>>>>> +    uint64_t tgt = object_property_get_uint(po, "nvlink2-tgt", NULL);
>>>>>> +
>>>>>> +    if (tgt) {
>>>>>> +        Object *mr_gpu = object_property_get_link(po, "nvlink2-mr[0]", NULL);
>>>>>> +        Object *mr_npu = object_property_get_link(po, "nvlink2-atsd-mr[0]",
>>>>>> +                                                  NULL);
>>>>>> +
>>>>>> +        if (mr_gpu) {
>>>>>> +            spapr_pci_collect_nvgpu(opaque, pdev, tgt, MEMORY_REGION(mr_gpu));
>>>>>> +        } else if (mr_npu) {
>>>>>> +            spapr_pci_collect_nvnpu(opaque, pdev, tgt, MEMORY_REGION(mr_npu));
>>>>>> +        } else {
>>>>>> +            warn_report("Unexpected device with \"nvlink2-tgt\"");
>>>>>
>>>>> IIUC this would have to be a code error, so should be an assert() not
>>>>> a warning.
>>>>
>>>>
>>>> Ok.
>>>>
>>>>>
>>>>>> +        }
>>>>>> +    }
>>>>>> +    if ((pci_default_read_config(pdev, PCI_HEADER_TYPE, 1) !=
>>>>>> +         PCI_HEADER_TYPE_BRIDGE)) {
>>>>>> +        return;
>>>>>> +    }
>>>>>> +
>>>>>> +    sec_bus = pci_bridge_get_sec_bus(PCI_BRIDGE(pdev));
>>>>>> +    if (!sec_bus) {
>>>>>> +        return;
>>>>>> +    }
>>>>>> +
>>>>>> +    pci_for_each_device(sec_bus, pci_bus_num(sec_bus),
>>>>>> +                        spapr_phb_pci_collect_nvgpu, opaque);
>>>>>> +}
>>>>>> +
>>>>>> +void spapr_phb_nvgpu_setup(sPAPRPHBState *sphb)
>>>>>> +{
>>>>>> +    int i, j, valid_gpu_num;
>>>>>> +
>>>>>> +    /* If there are existing NVLink2 MRs, unmap those before recreating */
>>>>>> +    if (sphb->nvgpus) {
>>>>>> +        for (i = 0; i < sphb->nvgpus->num; ++i) {
>>>>>> +            struct spapr_phb_pci_nvgpu_slot *nvslot = &sphb->nvgpus->slots[i];
>>>>>> +            Object *nv_mrobj = object_property_get_link(OBJECT(nvslot->gpdev),
>>>>>> +                                                        "nvlink2-mr[0]", NULL);
>>>>>> +
>>>>>> +            if (nv_mrobj) {
>>>>>> +                memory_region_del_subregion(get_system_memory(),
>>>>>> +                                            MEMORY_REGION(nv_mrobj));
>>>>>> +            }
>>>>>> +            for (j = 0; j < nvslot->linknum; ++j) {
>>>>>> +                PCIDevice *npdev = nvslot->links[j].npdev;
>>>>>> +                Object *atsd_mrobj;
>>>>>> +                atsd_mrobj = object_property_get_link(OBJECT(npdev),
>>>>>> +                                                      "nvlink2-atsd-mr[0]",
>>>>>> +                                                      NULL);
>>>>>> +                if (atsd_mrobj) {
>>>>>> +                    memory_region_del_subregion(get_system_memory(),
>>>>>> +                                                MEMORY_REGION(atsd_mrobj));
>>>>>> +                }
>>>>>> +            }
>>>>>> +        }
>>>>>> +        g_free(sphb->nvgpus);
>>>>>
>>>>> Probably worth collecting the above into a nvgpu_free() helper -
>>>>> chances are you'll want it on cleanup paths as well.
>>>>
>>>> The only other cleanup path is below and it only executes if there is no
>>>> MR added so for now it does not seem useful.
>>>
>>> Hrm... I've merged PHB hotplug recently.. so there should be a cleanup
>>> path for unplug as well.
>>
>>
>> ah right. Wooohooo :) btw with phb hotplug we can try supporting EEH on
>> hotplugged VFIO devices.
> 
> Yeah, Sam is looking into it.
> 
>>>>>> +        sphb->nvgpus = NULL;
>>>>>> +    }
>>>>>> +
>>>>>> +    /* Search for GPUs and NPUs */
>>>>>> +    if (sphb->nv2_gpa_win_addr && sphb->nv2_atsd_win_addr) {
>>>>>> +        PCIBus *bus = PCI_HOST_BRIDGE(sphb)->bus;
>>>>>> +
>>>>>> +        sphb->nvgpus = g_new0(struct spapr_phb_pci_nvgpu_config, 1);
>>>>>> +        sphb->nvgpus->nv2_ram_current = sphb->nv2_gpa_win_addr;
>>>>>> +        sphb->nvgpus->nv2_atsd_current = sphb->nv2_atsd_win_addr;
>>>>>> +
>>>>>> +        pci_for_each_device(bus, pci_bus_num(bus),
>>>>>> +                            spapr_phb_pci_collect_nvgpu, sphb->nvgpus);
>>>>>> +    }
>>>>>> +
>>>>>> +    /* Add found GPU RAM and ATSD MRs if found */
>>>>>> +    for (i = 0, valid_gpu_num = 0; i < sphb->nvgpus->num; ++i) {
>>>>>> +        Object *nvmrobj;
>>>>>> +        struct spapr_phb_pci_nvgpu_slot *nvslot = &sphb->nvgpus->slots[i];
>>>>>> +
>>>>>> +        if (!nvslot->gpdev) {
>>>>>> +            continue;
>>>>>> +        }
>>>>>> +        nvmrobj = object_property_get_link(OBJECT(nvslot->gpdev),
>>>>>> +                                           "nvlink2-mr[0]", NULL);
>>>>>> +        /* ATSD is pointless without GPU RAM MR so skip those */
>>>>>> +        if (!nvmrobj) {
>>>>>> +            continue;
>>>>>> +        }
>>>>>> +
>>>>>> +        ++valid_gpu_num;
>>>>>> +        memory_region_add_subregion(get_system_memory(), nvslot->gpa,
>>>>>> +                                    MEMORY_REGION(nvmrobj));
>>>>>> +
>>>>>> +        for (j = 0; j < nvslot->linknum; ++j) {
>>>>>> +            Object *atsdmrobj;
>>>>>> +
>>>>>> +            atsdmrobj = object_property_get_link(OBJECT(nvslot->links[j].npdev),
>>>>>> +                                                 "nvlink2-atsd-mr[0]",
>>>>>> +                                                 NULL);
>>>>>> +            if (!atsdmrobj) {
>>>>>> +                continue;
>>>>>> +            }
>>>>>> +            memory_region_add_subregion(get_system_memory(),
>>>>>> +                                        nvslot->links[j].atsd_gpa,
>>>>>> +                                        MEMORY_REGION(atsdmrobj));
>>>>>> +        }
>>>>>> +    }
>>>>>> +
>>>>>> +    if (!valid_gpu_num) {
>>>>>> +        /* We did not find any interesting GPU */
>>>>>> +        g_free(sphb->nvgpus);
>>>>>> +        sphb->nvgpus = NULL;
>>>>>> +    }
>>>>>> +}
>>>>>> +
>>>>>> +void spapr_phb_nvgpu_populate_dt(sPAPRPHBState *sphb, void *fdt, int bus_off)
>>>>>> +{
>>>>>> +    int i, j, atsdnum = 0;
>>>>>> +    uint64_t atsd[8]; /* The existing limitation of known guests */
>>>>>> +
>>>>>> +    if (!sphb->nvgpus) {
>>>>>> +        return;
>>>>>> +    }
>>>>>> +
>>>>>> +    for (i = 0; (i < sphb->nvgpus->num) && (atsdnum < ARRAY_SIZE(atsd)); ++i) {
>>>>>> +        struct spapr_phb_pci_nvgpu_slot *nvslot = &sphb->nvgpus->slots[i];
>>>>>> +
>>>>>> +        if (!nvslot->gpdev) {
>>>>>> +            continue;
>>>>>> +        }
>>>>>> +        for (j = 0; j < nvslot->linknum; ++j) {
>>>>>> +            if (!nvslot->links[j].atsd_gpa) {
>>>>>> +                continue;
>>>>>> +            }
>>>>>> +
>>>>>> +            if (atsdnum == ARRAY_SIZE(atsd)) {
>>>>>> +                warn_report("Only %ld ATSD registers allowed",
>>>>>> +                            ARRAY_SIZE(atsd));
>>>>>
>>>>> Probably should be an error not a warning.
>>>>
>>>> We can still continue though, it is not fatal. These things come from
>>>> skiboot (which we control) but skiboot itself could compose the
>>>> properties itself or use whatever hostboot provided (does not happen now
>>>> though) and I would not like to be blocked by hostboot if/when this happens.
>>>
>>> Um.. what?  atsdnum is just a counter incremented below, it doesn't
>>> come from skiboot or any other host-significant value.  The situation
>>> here is that we have more nvlinks assigned to a guest that qemu can
>>> support.  Yes, you could technically run the guest with some of the
>>> links unavailable, but that seems pretty clearly not what the user
>>> wanted.  Hence, an error is appropriate.
>>
>>
>> Not exactly. NVlinks are available whether there come with an ATSD VFIO
>> region or not, it was my choice to accompany ATSD with a NVLink2 bridge.
>> So it is quite possible to pass way too many links and yes QEMU won't
>> expose all accompaniying ATSDs to the guest but 1) guest might not need
>> this many ATSDs anyway (right now the NVIDIA driver always uses just one
>> and nobody complained about performance) 2) nvlink is functional as long
>> as the guest can access its config space.
> 
> Sure, it can work.  But remember the qemu user is setting up this
> configuration.  I think it makes sense to error if it's a stupid and
> pointless configuration, even if the guest could technically work with
> it.

Ability of the guest to work kinda tells us that it is not that
pointless. btw what exactly do you mean by reporting an error instead of
warning? A fatal hw_error() or error_report() (which is just a message)?



>>>>>> +                break;
>>>>>> +            }
>>>>>> +            atsd[atsdnum] = cpu_to_be64(nvslot->links[j].atsd_gpa);
>>>>>> +            ++atsdnum;
>>>>>> +        }
>>>>>> +    }
>>>>>> +
>>>>>> +    if (!atsdnum) {
>>>>>> +        warn_report("No ATSD registers found");
>>>>>> +    } else if (!spapr_phb_eeh_available(sphb)) {
>>>>>> +        /*
>>>>>> +         * ibm,mmio-atsd contains ATSD registers; these belong to an NPU PHB
>>>>>> +         * which we do not emulate as a separate device. Instead we put
>>>>>> +         * ibm,mmio-atsd to the vPHB with GPU and make sure that we do not
>>>>>> +         * put GPUs from different IOMMU groups to the same vPHB to ensure
>>>>>> +         * that the guest will use ATSDs from the corresponding NPU.
>>>>>> +         */
>>>>>> +        warn_report("ATSD requires separate vPHB per GPU IOMMU group");
>>>>>> +    } else {
>>>>>> +        _FDT((fdt_setprop(fdt, bus_off, "ibm,mmio-atsd",
>>>>>> +                          atsd, atsdnum * sizeof(atsd[0]))));
>>>>>> +    }
>>>>>> +}
>>>>>> +
>>>>>> +void spapr_phb_nvgpu_ram_populate_dt(sPAPRPHBState *sphb, void *fdt)
>>>>>> +{
>>>>>> +    int i, j, linkidx, npuoff;
>>>>>> +    char *npuname;
>>>>>> +
>>>>>> +    if (!sphb->nvgpus) {
>>>>>> +        return;
>>>>>> +    }
>>>>>> +
>>>>>> +    npuname = g_strdup_printf("npuphb%d", sphb->index);
>>>>>> +    npuoff = fdt_add_subnode(fdt, 0, npuname);
>>>>>> +    _FDT(npuoff);
>>>>>> +    _FDT(fdt_setprop_cell(fdt, npuoff, "#address-cells", 1));
>>>>>> +    _FDT(fdt_setprop_cell(fdt, npuoff, "#size-cells", 0));
>>>>>> +    /* Advertise NPU as POWER9 so the guest can enable NPU2 contexts */
>>>>>> +    _FDT((fdt_setprop_string(fdt, npuoff, "compatible", "ibm,power9-npu")));
>>>>>> +    g_free(npuname);
>>>>>> +
>>>>>> +    for (i = 0, linkidx = 0; i < sphb->nvgpus->num; ++i) {
>>>>>> +        for (j = 0; j < sphb->nvgpus->slots[i].linknum; ++j) {
>>>>>> +            char *linkname = g_strdup_printf("link@%d", linkidx);
>>>>>> +            int off = fdt_add_subnode(fdt, npuoff, linkname);
>>>>>> +
>>>>>> +            _FDT(off);
>>>>>> +            /* _FDT((fdt_setprop_cell(fdt, off, "reg", linkidx)));
>>>>>> */
>>>>>
>>>>> Are the indices you're using for 'reg' and the unit name arbitrary?
>>>>> If so it's generally best to base them on some static property of the
>>>>> device, rather than just allocating sequentially.
>>>>
>>>> On the host reg is the link index. Here it is actually commented out as
>>>> a reminder for the future.
>>>>
>>>>>
>>>>>> +            _FDT((fdt_setprop_string(fdt, off, "compatible",
>>>>>> +                                     "ibm,npu-link")));
>>>>>> +            _FDT((fdt_setprop_cell(fdt, off, "phandle",
>>>>>> +                                   PHANDLE_NVLINK(sphb, i, j))));
>>>>>> +            _FDT((fdt_setprop_cell(fdt, off, "ibm,npu-link-index", linkidx)));
>>>>>
>>>>> Why do you need the index here as well as in reg?
>>>>
>>>> I do not need "reg" really and I need index for this:
>>>>
>>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/powerpc/platforms/powernv/npu-dma.c?h=v4.20#n692
>>>
>>>
>>> Ok, because of a silly binding.  That's a good enough reason.
>>>
>>>>>> +            g_free(linkname);
>>>>>> +            ++linkidx;
>>>>>> +        }
>>>>>> +    }
>>>>>> +
>>>>>> +    /* Add memory nodes for GPU RAM and mark them unusable */
>>>>>> +    for (i = 0; i < sphb->nvgpus->num; ++i) {
>>>>>> +        struct spapr_phb_pci_nvgpu_slot *nvslot = &sphb->nvgpus->slots[i];
>>>>>> +        Object *nv_mrobj = object_property_get_link(OBJECT(nvslot->gpdev),
>>>>>> +                                                    "nvlink2-mr[0]", NULL);
>>>>>> +        uint32_t at = cpu_to_be32(GPURAM_ASSOCIATIVITY(sphb, i));
>>>>>> +        uint32_t associativity[] = { cpu_to_be32(0x4), at, at, at, at };
>>>>>> +        uint64_t size = object_property_get_uint(nv_mrobj, "size", NULL);
>>>>>> +        uint64_t mem_reg[2] = { cpu_to_be64(nvslot->gpa), cpu_to_be64(size) };
>>>>>> +        char *mem_name = g_strdup_printf("memory@%lx", nvslot->gpa);
>>>>>> +        int off = fdt_add_subnode(fdt, 0, mem_name);
>>>>>> +
>>>>>> +        _FDT(off);
>>>>>> +        _FDT((fdt_setprop_string(fdt, off, "device_type", "memory")));
>>>>>> +        _FDT((fdt_setprop(fdt, off, "reg", mem_reg, sizeof(mem_reg))));
>>>>>> +        _FDT((fdt_setprop(fdt, off, "ibm,associativity", associativity,
>>>>>> +                          sizeof(associativity))));
>>>>>> +
>>>>>> +        _FDT((fdt_setprop_string(fdt, off, "compatible",
>>>>>> +                                 "ibm,coherent-device-memory")));
>>>>>> +
>>>>>> +        mem_reg[1] = cpu_to_be64(0);
>>>>>> +        _FDT((fdt_setprop(fdt, off, "linux,usable-memory", mem_reg,
>>>>>> +                          sizeof(mem_reg))));
>>>>>> +        _FDT((fdt_setprop_cell(fdt, off, "phandle",
>>>>>> +                               PHANDLE_GPURAM(sphb, i))));
>>>>>> +        g_free(mem_name);
>>>>>> +    }
>>>>>> +
>>>>>> +}
>>>>>> +
>>>>>> +void spapr_phb_nvgpu_populate_pcidev_dt(PCIDevice *dev, void *fdt, int offset,
>>>>>> +                                        sPAPRPHBState *sphb)
>>>>>> +{
>>>>>> +    int i, j;
>>>>>> +
>>>>>> +    if (!sphb->nvgpus) {
>>>>>> +        return;
>>>>>> +    }
>>>>>> +
>>>>>> +    for (i = 0; i < sphb->nvgpus->num; ++i) {
>>>>>> +        struct spapr_phb_pci_nvgpu_slot *nvslot = &sphb->nvgpus->slots[i];
>>>>>> +
>>>>>> +        /* Skip "slot" without attached GPU */
>>>>>
>>>>> IIUC a "slot" should always have at least one GPU.  You need to handle
>>>>> the case of an unitialized GPU in the "collect" functions because you
>>>>> don't know if you'll discover the GPU or an NPU first.  But here not
>>>>> having a GPU should be an error, shouldn't it?
>>>>
>>>>
>>>> If someone decides to pass 1 GPU with all related nvlinks and just
>>>> nvlinks from another GPU but without related GPU for whatever reason,
>>>> should we really stop him/her? Things won't work exactly at their best
>>>> but this still might be useful for weird debugging.
>>>
>>> Hm, ok, I guess so.
>>>
>>>>>> +        if (!nvslot->gpdev) {
>>>>>> +            continue;
>>>>>> +        }
>>>>>> +        if (dev == nvslot->gpdev) {
>>>>>> +            uint32_t npus[nvslot->linknum];
>>>>>> +
>>>>>> +            for (j = 0; j < nvslot->linknum; ++j) {
>>>>>> +                PCIDevice *npdev = nvslot->links[j].npdev;
>>>>>> +
>>>>>> +                npus[j] = cpu_to_be32(PHANDLE_PCIDEV(sphb, npdev));
>>>>>> +            }
>>>>>> +            _FDT(fdt_setprop(fdt, offset, "ibm,npu", npus,
>>>>>> +                             j * sizeof(npus[0])));
>>>>>> +            _FDT((fdt_setprop_cell(fdt, offset, "phandle",
>>>>>> +                                   PHANDLE_PCIDEV(sphb, dev))));
>>>>>> +            continue;
>>>>>> +        }
>>>>>> +
>>>>>> +        for (j = 0; j < nvslot->linknum; ++j) {
>>>>>> +            if (dev != nvslot->links[j].npdev) {
>>>>>> +                continue;
>>>>>> +            }
>>>>>> +
>>>>>> +            _FDT((fdt_setprop_cell(fdt, offset, "phandle",
>>>>>> +                                   PHANDLE_PCIDEV(sphb, dev))));
>>>>>> +            _FDT(fdt_setprop_cell(fdt, offset, "ibm,gpu",
>>>>>> +                                  PHANDLE_PCIDEV(sphb, nvslot->gpdev)));
>>>>>> +            _FDT((fdt_setprop_cell(fdt, offset, "ibm,nvlink",
>>>>>> +                                   PHANDLE_NVLINK(sphb, i, j))));
>>>>>> +            /*
>>>>>> +             * If we ever want to emulate GPU RAM at the same location as on
>>>>>> +             * the host - here is the encoding GPA->TGT:
>>>>>> +             *
>>>>>> +             * gta  = ((sphb->nv2_gpa >> 42) & 0x1) << 42;
>>>>>> +             * gta |= ((sphb->nv2_gpa >> 45) & 0x3) << 43;
>>>>>> +             * gta |= ((sphb->nv2_gpa >> 49) & 0x3) << 45;
>>>>>> +             * gta |= sphb->nv2_gpa & ((1UL << 43) - 1);
>>>>>> +             */
>>>>>> +            _FDT(fdt_setprop_cell(fdt, offset, "memory-region",
>>>>>> +                                  PHANDLE_GPURAM(sphb, i)));
>>>>>> +            _FDT(fdt_setprop_u64(fdt, offset, "ibm,device-tgt-addr",
>>>>>> +                                 nvslot->tgt));
>>>>>> +            _FDT(fdt_setprop_cell(fdt, offset, "ibm,nvlink-speed",
>>>>>> +                                  nvslot->links[j].link_speed));
>>>>>> +        }
>>>>>> +    }
>>>>>> +}
>>>>>> diff --git a/hw/vfio/pci-quirks.c b/hw/vfio/pci-quirks.c
>>>>>> index 40a1200..15ec0b4 100644
>>>>>> --- a/hw/vfio/pci-quirks.c
>>>>>> +++ b/hw/vfio/pci-quirks.c
>>>>>> @@ -2180,3 +2180,123 @@ int vfio_add_virt_caps(VFIOPCIDevice *vdev, Error **errp)
>>>>>>  
>>>>>>      return 0;
>>>>>>  }
>>>>>> +
>>>>>> +static void vfio_pci_nvlink2_get_tgt(Object *obj, Visitor *v,
>>>>>> +                                     const char *name,
>>>>>> +                                     void *opaque, Error **errp)
>>>>>> +{
>>>>>> +    uint64_t tgt = (uint64_t) opaque;
>>>>>> +    visit_type_uint64(v, name, &tgt, errp);
>>>>>> +}
>>>>>> +
>>>>>> +static void vfio_pci_nvlink2_get_link_speed(Object *obj, Visitor *v,
>>>>>> +                                                 const char *name,
>>>>>> +                                                 void *opaque, Error **errp)
>>>>>> +{
>>>>>> +    uint32_t link_speed = (uint32_t)(uint64_t) opaque;
>>>>>> +    visit_type_uint32(v, name, &link_speed, errp);
>>>>>> +}
>>>>>> +
>>>>>> +int vfio_pci_nvidia_v100_ram_init(VFIOPCIDevice *vdev, Error **errp)
>>>>>> +{
>>>>>> +    int ret;
>>>>>> +    void *p;
>>>>>> +    struct vfio_region_info *nv2region = NULL;
>>>>>> +    struct vfio_info_cap_header *hdr;
>>>>>> +    MemoryRegion *nv2mr = g_malloc0(sizeof(*nv2mr));
>>>>>> +
>>>>>> +    ret = vfio_get_dev_region_info(&vdev->vbasedev,
>>>>>> +                                   VFIO_REGION_TYPE_PCI_VENDOR_TYPE |
>>>>>> +                                   PCI_VENDOR_ID_NVIDIA,
>>>>>> +                                   VFIO_REGION_SUBTYPE_NVIDIA_NVLINK2_RAM,
>>>>>> +                                   &nv2region);
>>>>>> +    if (ret) {
>>>>>> +        return ret;
>>>>>> +    }
>>>>>> +
>>>>>> +    p = mmap(NULL, nv2region->size, PROT_READ | PROT_WRITE | PROT_EXEC,
>>>>>> +             MAP_SHARED, vdev->vbasedev.fd, nv2region->offset);
>>>>>> +
>>>>>> +    if (!p) {
>>>>>> +        return -errno;
>>>>>> +    }
>>>>>> +
>>>>>> +    memory_region_init_ram_ptr(nv2mr, OBJECT(vdev), "nvlink2-mr",
>>>>>> +                               nv2region->size, p);
>>>>>> +
>>>>>> +    hdr = vfio_get_region_info_cap(nv2region,
>>>>>> +                                   VFIO_REGION_INFO_CAP_NVLINK2_SSATGT);
>>>>>> +    if (hdr) {
>>>>>> +        struct vfio_region_info_cap_nvlink2_ssatgt *cap = (void *) hdr;
>>>>>> +
>>>>>> +        object_property_add(OBJECT(vdev), "nvlink2-tgt", "uint64",
>>>>>> +                            vfio_pci_nvlink2_get_tgt, NULL, NULL,
>>>>>> +                            (void *) cap->tgt, NULL);
>>>>>> +        trace_vfio_pci_nvidia_gpu_setup_quirk(vdev->vbasedev.name, cap->tgt,
>>>>>> +                                              nv2region->size);
>>>>>> +    }
>>>>>> +    g_free(nv2region);
>>>>>> +
>>>>>> +    return 0;
>>>>>> +}
>>>>>> +
>>>>>> +int vfio_pci_nvlink2_init(VFIOPCIDevice *vdev, Error **errp)
>>>>>> +{
>>>>>> +    int ret;
>>>>>> +    void *p;
>>>>>> +    struct vfio_region_info *atsd_region = NULL;
>>>>>> +    struct vfio_info_cap_header *hdr;
>>>>>> +
>>>>>> +    ret = vfio_get_dev_region_info(&vdev->vbasedev,
>>>>>> +                                   VFIO_REGION_TYPE_PCI_VENDOR_TYPE |
>>>>>> +                                   PCI_VENDOR_ID_IBM,
>>>>>> +                                   VFIO_REGION_SUBTYPE_IBM_NVLINK2_ATSD,
>>>>>> +                                   &atsd_region);
>>>>>> +    if (ret) {
>>>>>> +        return ret;
>>>>>> +    }
>>>>>> +
>>>>>> +    /* Some NVLink bridges come without assigned ATSD, skip MR part */
>>>>>> +    if (atsd_region->size) {
>>>>>> +        MemoryRegion *atsd_mr = g_malloc0(sizeof(*atsd_mr));
>>>>>> +
>>>>>> +        p = mmap(NULL, atsd_region->size, PROT_READ | PROT_WRITE | PROT_EXEC,
>>>>>> +                 MAP_SHARED, vdev->vbasedev.fd, atsd_region->offset);
>>>>>> +
>>>>>> +        if (!p) {
>>>>>> +            return -errno;
>>>>>> +        }
>>>>>> +
>>>>>> +        memory_region_init_ram_device_ptr(atsd_mr, OBJECT(vdev),
>>>>>> +                                          "nvlink2-atsd-mr",
>>>>>> +                                          atsd_region->size,
>>>>>> +                                          p);
>>>>>> +    }
>>>>>> +
>>>>>> +    hdr = vfio_get_region_info_cap(atsd_region,
>>>>>> +                                   VFIO_REGION_INFO_CAP_NVLINK2_SSATGT);
>>>>>> +    if (hdr) {
>>>>>> +        struct vfio_region_info_cap_nvlink2_ssatgt *cap = (void *) hdr;
>>>>>> +
>>>>>> +        object_property_add(OBJECT(vdev), "nvlink2-tgt", "uint64",
>>>>>> +                            vfio_pci_nvlink2_get_tgt, NULL, NULL,
>>>>>> +                            (void *) cap->tgt, NULL);
>>>>>> +        trace_vfio_pci_nvlink2_setup_quirk_ssatgt(vdev->vbasedev.name, cap->tgt,
>>>>>> +                                                  atsd_region->size);
>>>>>> +    }
>>>>>> +
>>>>>> +    hdr = vfio_get_region_info_cap(atsd_region,
>>>>>> +                                   VFIO_REGION_INFO_CAP_NVLINK2_LNKSPD);
>>>>>> +    if (hdr) {
>>>>>> +        struct vfio_region_info_cap_nvlink2_lnkspd *cap = (void *) hdr;
>>>>>> +
>>>>>> +        object_property_add(OBJECT(vdev), "nvlink2-link-speed", "uint32",
>>>>>> +                            vfio_pci_nvlink2_get_link_speed, NULL, NULL,
>>>>>> +                            (void *) (uint64_t) cap->link_speed, NULL);
>>>>>> +        trace_vfio_pci_nvlink2_setup_quirk_lnkspd(vdev->vbasedev.name,
>>>>>> +                                                  cap->link_speed);
>>>>>> +    }
>>>>>> +    g_free(atsd_region);
>>>>>> +
>>>>>> +    return 0;
>>>>>> +}
>>>>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>>>>>> index dd12f36..07aa141 100644
>>>>>> --- a/hw/vfio/pci.c
>>>>>> +++ b/hw/vfio/pci.c
>>>>>> @@ -3069,6 +3069,20 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
>>>>>>          goto out_teardown;
>>>>>>      }
>>>>>>  
>>>>>> +    if (vdev->vendor_id == PCI_VENDOR_ID_NVIDIA) {
>>>>>> +        ret = vfio_pci_nvidia_v100_ram_init(vdev, errp);
>>>>>> +        if (ret && ret != -ENODEV) {
>>>>>> +            error_report("Failed to setup NVIDIA V100 GPU RAM");
>>>>>> +        }
>>>>>> +    }
>>>>>> +
>>>>>> +    if (vdev->vendor_id == PCI_VENDOR_ID_IBM) {
>>>>>> +        ret = vfio_pci_nvlink2_init(vdev, errp);
>>>>>> +        if (ret && ret != -ENODEV) {
>>>>>> +            error_report("Failed to setup NVlink2 bridge");
>>>>>> +        }
>>>>>> +    }
>>>>>> +
>>>>>>      vfio_register_err_notifier(vdev);
>>>>>>      vfio_register_req_notifier(vdev);
>>>>>>      vfio_setup_resetfn_quirk(vdev);
>>>>>> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
>>>>>> index cf1e886..88841e9 100644
>>>>>> --- a/hw/vfio/trace-events
>>>>>> +++ b/hw/vfio/trace-events
>>>>>> @@ -87,6 +87,10 @@ vfio_pci_igd_opregion_enabled(const char *name) "%s"
>>>>>>  vfio_pci_igd_host_bridge_enabled(const char *name) "%s"
>>>>>>  vfio_pci_igd_lpc_bridge_enabled(const char *name) "%s"
>>>>>>  
>>>>>> +vfio_pci_nvidia_gpu_setup_quirk(const char *name, uint64_t tgt, uint64_t size) "%s tgt=0x%"PRIx64" size=0x%"PRIx64
>>>>>> +vfio_pci_nvlink2_setup_quirk_ssatgt(const char *name, uint64_t tgt, uint64_t size) "%s tgt=0x%"PRIx64" size=0x%"PRIx64
>>>>>> +vfio_pci_nvlink2_setup_quirk_lnkspd(const char *name, uint32_t link_speed) "%s link_speed=0x%x"
>>>>>> +
>>>>>>  # hw/vfio/common.c
>>>>>>  vfio_region_write(const char *name, int index, uint64_t addr, uint64_t data, unsigned size) " (%s:region%d+0x%"PRIx64", 0x%"PRIx64 ", %d)"
>>>>>>  vfio_region_read(char *name, int index, uint64_t addr, unsigned size, uint64_t data) " (%s:region%d+0x%"PRIx64", %d) = 0x%"PRIx64
>>>>>
>>>>
>>>
>>
> 

-- 
Alexey

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2019-03-07  4:32 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-02-27  8:51 [Qemu-devel] [PATCH qemu v3 0/6] spapr_pci, vfio: NVIDIA V100 + POWER9 passthrough Alexey Kardashevskiy
2019-02-27  8:51 ` [Qemu-devel] [PATCH qemu v3 1/6] pci: Move NVIDIA vendor id to the rest of ids Alexey Kardashevskiy
2019-02-28  0:56   ` David Gibson
2019-02-27  8:51 ` [Qemu-devel] [PATCH qemu v3 2/6] vfio/spapr: Fix indirect levels calculation Alexey Kardashevskiy
2019-02-28  2:24   ` David Gibson
2019-02-27  8:51 ` [Qemu-devel] [PATCH qemu v3 3/6] vfio/spapr: Rename local systempagesize variable Alexey Kardashevskiy
2019-02-28  2:26   ` David Gibson
2019-02-27  8:51 ` [Qemu-devel] [PATCH qemu v3 4/6] spapr_iommu: Do not replay mappings from just created DMA window Alexey Kardashevskiy
2019-02-27 14:33   ` [Qemu-devel] [Qemu-ppc] " Greg Kurz
2019-02-27 23:59     ` Alexey Kardashevskiy
2019-02-28  3:49       ` David Gibson
2019-02-28  5:37         ` Alexey Kardashevskiy
2019-03-05  3:28           ` David Gibson
2019-02-27  8:51 ` [Qemu-devel] [PATCH qemu v3 5/6] vfio: Make vfio_get_region_info_cap public Alexey Kardashevskiy
2019-02-27  8:51 ` [Qemu-devel] [PATCH qemu v3 6/6] spapr: Support NVIDIA V100 GPU with NVLink2 Alexey Kardashevskiy
2019-02-28  3:31   ` David Gibson
2019-02-28  6:11     ` Alexey Kardashevskiy
2019-03-05  1:47       ` David Gibson
2019-03-07  2:40         ` Alexey Kardashevskiy
2019-03-07  3:57           ` David Gibson
2019-03-07  4:32             ` Alexey Kardashevskiy

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.