All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH kernel v3 00/22] powerpc/powernv/npu, vfio: NVIDIA V100 + P9 passthrough
@ 2018-11-13  8:28 ` Alexey Kardashevskiy
  0 siblings, 0 replies; 84+ messages in thread
From: Alexey Kardashevskiy @ 2018-11-13  8:28 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Jose Ricardo Ziviani, Alexey Kardashevskiy, Alistair Popple,
	Alex Williamson, kvm-ppc, Sam Bobroff, Piotr Jaroszynski,
	Oliver O'Halloran, Andrew Donnellan,
	Leonardo Augusto Guimarães Garcia, Reza Arbab, David Gibson


This is for passing through NVIDIA V100 GPUs on POWER9 systems.
7/7 and https://github.com/aik/linux/commit/f41f5666d27b31c1
have the details of hardware setup.

This implements support for NVIDIA V100 GPU with coherent memory and
NPU/ATS support available in the POWER9 CPU. The aim is to support
unmodified vendor driver in the guest.

This is pushed to github:
https://github.com/aik/qemu/tree/nv2-stage4

The host and guest kernel tree is pushed on github as well:
https://github.com/aik/linux/tree/nv2-stage4

Skiboot bits are here:
https://github.com/aik/skiboot/tree/nv2-stage4

Please comment. Thanks.



Alexey Kardashevskiy (22):
  powerpc/ioda/npu: Call skiboot's hot reset hook when disabling NPU2
  powerpc/mm/iommu/vfio_spapr_tce: Change mm_iommu_get to reference a
    region
  powerpc/mm/iommu: Make mm_iommu_new() fail on existing regions
  powerpc/vfio/iommu/kvm: Do not pin device memory
  powerpc/powernv/npu: Add helper to access struct npu for NPU device
  powerpc/powernv: Detach npu struct from pnv_phb
  powerpc/powernv/npu: Move OPAL calls away from context manipulation
  powerpc/pseries/iommu: Allow dynamic window to start from zero
  powerpc/pseries/iommu: Force default DMA window removal
  powerpc/pseries/iommu: Use memory@ nodes in max RAM address
    calculation
  powerpc/pseries/npu: Enable platform support
  powerpc/pseries: Remove IOMMU API support for non-LPAR systems
  powerpc/powernv/pseries: Rework device adding to IOMMU groups
  powerpc/iommu_api: Move IOMMU groups setup to a single place
  powerpc/powernv: Reference iommu_table while it is linked to a group
  powerpc/powernv: Add purge cache OPAL call
  powerpc/powernv/npu: Convert NPU IOMMU helpers to
    iommu_table_group_ops
  powerpc/powernv/npu: Add compound IOMMU groups
  powerpc/powernv/npu: Add release_ownership hook
  vfio_pci: Allow mapping extra regions
  vfio_pci: Allow regions to add own capabilities
  vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] [10de:1db1] subdriver

 drivers/vfio/pci/Makefile                     |   1 +
 arch/powerpc/include/asm/iommu.h              |  17 +-
 arch/powerpc/include/asm/mmu_context.h        |   9 +-
 arch/powerpc/include/asm/opal-api.h           |   3 +-
 arch/powerpc/include/asm/opal.h               |   1 +
 arch/powerpc/include/asm/pci.h                |   4 +
 arch/powerpc/platforms/powernv/pci.h          |  30 +-
 drivers/vfio/pci/trace.h                      | 102 ++++
 drivers/vfio/pci/vfio_pci_private.h           |   8 +
 include/uapi/linux/vfio.h                     |  26 +
 arch/powerpc/kernel/iommu.c                   |  67 +--
 arch/powerpc/kvm/book3s_64_vio.c              |  18 +-
 arch/powerpc/mm/mmu_context_iommu.c           | 100 +++-
 arch/powerpc/platforms/powernv/npu-dma.c      | 531 +++++++++++++++---
 arch/powerpc/platforms/powernv/opal.c         |   1 +
 arch/powerpc/platforms/powernv/pci-ioda-tce.c |   3 +-
 arch/powerpc/platforms/powernv/pci-ioda.c     | 229 ++++----
 arch/powerpc/platforms/powernv/pci.c          |  43 +-
 arch/powerpc/platforms/pseries/iommu.c        | 134 +++--
 arch/powerpc/platforms/pseries/pci.c          |   6 +
 drivers/vfio/pci/vfio_pci.c                   |  54 +-
 drivers/vfio/pci/vfio_pci_nvlink2.c           | 433 ++++++++++++++
 drivers/vfio/vfio_iommu_spapr_tce.c           |  65 ++-
 .../powerpc/platforms/powernv/opal-wrappers.S |   1 +
 drivers/vfio/pci/Kconfig                      |   6 +
 25 files changed, 1497 insertions(+), 395 deletions(-)
 create mode 100644 drivers/vfio/pci/trace.h
 create mode 100644 drivers/vfio/pci/vfio_pci_nvlink2.c

-- 
2.17.1


^ permalink raw reply	[flat|nested] 84+ messages in thread

* [PATCH kernel v3 00/22] powerpc/powernv/npu, vfio: NVIDIA V100 + P9 passthrough
@ 2018-11-13  8:28 ` Alexey Kardashevskiy
  0 siblings, 0 replies; 84+ messages in thread
From: Alexey Kardashevskiy @ 2018-11-13  8:28 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Jose Ricardo Ziviani, Alexey Kardashevskiy, Alistair Popple,
	Alex Williamson, kvm-ppc, Sam Bobroff, Piotr Jaroszynski,
	Oliver O'Halloran, Andrew Donnellan,
	Leonardo Augusto Guimarães Garcia, Reza Arbab, David Gibson


This is for passing through NVIDIA V100 GPUs on POWER9 systems.
7/7 and https://github.com/aik/linux/commit/f41f5666d27b31c1
have the details of hardware setup.

This implements support for NVIDIA V100 GPU with coherent memory and
NPU/ATS support available in the POWER9 CPU. The aim is to support
unmodified vendor driver in the guest.

This is pushed to github:
https://github.com/aik/qemu/tree/nv2-stage4

The host and guest kernel tree is pushed on github as well:
https://github.com/aik/linux/tree/nv2-stage4

Skiboot bits are here:
https://github.com/aik/skiboot/tree/nv2-stage4

Please comment. Thanks.



Alexey Kardashevskiy (22):
  powerpc/ioda/npu: Call skiboot's hot reset hook when disabling NPU2
  powerpc/mm/iommu/vfio_spapr_tce: Change mm_iommu_get to reference a
    region
  powerpc/mm/iommu: Make mm_iommu_new() fail on existing regions
  powerpc/vfio/iommu/kvm: Do not pin device memory
  powerpc/powernv/npu: Add helper to access struct npu for NPU device
  powerpc/powernv: Detach npu struct from pnv_phb
  powerpc/powernv/npu: Move OPAL calls away from context manipulation
  powerpc/pseries/iommu: Allow dynamic window to start from zero
  powerpc/pseries/iommu: Force default DMA window removal
  powerpc/pseries/iommu: Use memory@ nodes in max RAM address
    calculation
  powerpc/pseries/npu: Enable platform support
  powerpc/pseries: Remove IOMMU API support for non-LPAR systems
  powerpc/powernv/pseries: Rework device adding to IOMMU groups
  powerpc/iommu_api: Move IOMMU groups setup to a single place
  powerpc/powernv: Reference iommu_table while it is linked to a group
  powerpc/powernv: Add purge cache OPAL call
  powerpc/powernv/npu: Convert NPU IOMMU helpers to
    iommu_table_group_ops
  powerpc/powernv/npu: Add compound IOMMU groups
  powerpc/powernv/npu: Add release_ownership hook
  vfio_pci: Allow mapping extra regions
  vfio_pci: Allow regions to add own capabilities
  vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] [10de:1db1] subdriver

 drivers/vfio/pci/Makefile                     |   1 +
 arch/powerpc/include/asm/iommu.h              |  17 +-
 arch/powerpc/include/asm/mmu_context.h        |   9 +-
 arch/powerpc/include/asm/opal-api.h           |   3 +-
 arch/powerpc/include/asm/opal.h               |   1 +
 arch/powerpc/include/asm/pci.h                |   4 +
 arch/powerpc/platforms/powernv/pci.h          |  30 +-
 drivers/vfio/pci/trace.h                      | 102 ++++
 drivers/vfio/pci/vfio_pci_private.h           |   8 +
 include/uapi/linux/vfio.h                     |  26 +
 arch/powerpc/kernel/iommu.c                   |  67 +--
 arch/powerpc/kvm/book3s_64_vio.c              |  18 +-
 arch/powerpc/mm/mmu_context_iommu.c           | 100 +++-
 arch/powerpc/platforms/powernv/npu-dma.c      | 531 +++++++++++++++---
 arch/powerpc/platforms/powernv/opal.c         |   1 +
 arch/powerpc/platforms/powernv/pci-ioda-tce.c |   3 +-
 arch/powerpc/platforms/powernv/pci-ioda.c     | 229 ++++----
 arch/powerpc/platforms/powernv/pci.c          |  43 +-
 arch/powerpc/platforms/pseries/iommu.c        | 134 +++--
 arch/powerpc/platforms/pseries/pci.c          |   6 +
 drivers/vfio/pci/vfio_pci.c                   |  54 +-
 drivers/vfio/pci/vfio_pci_nvlink2.c           | 433 ++++++++++++++
 drivers/vfio/vfio_iommu_spapr_tce.c           |  65 ++-
 .../powerpc/platforms/powernv/opal-wrappers.S |   1 +
 drivers/vfio/pci/Kconfig                      |   6 +
 25 files changed, 1497 insertions(+), 395 deletions(-)
 create mode 100644 drivers/vfio/pci/trace.h
 create mode 100644 drivers/vfio/pci/vfio_pci_nvlink2.c

-- 
2.17.1

^ permalink raw reply	[flat|nested] 84+ messages in thread

* [PATCH kernel v3 01/22] powerpc/ioda/npu: Call skiboot's hot reset hook when disabling NPU2
  2018-11-13  8:28 ` Alexey Kardashevskiy
@ 2018-11-13  8:28   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 84+ messages in thread
From: Alexey Kardashevskiy @ 2018-11-13  8:28 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Jose Ricardo Ziviani, Alexey Kardashevskiy, Alistair Popple,
	Alex Williamson, kvm-ppc, Sam Bobroff, Piotr Jaroszynski,
	Oliver O'Halloran, Andrew Donnellan,
	Leonardo Augusto Guimarães Garcia, Reza Arbab, David Gibson

The skiboot firmware has a hot reset handler which fences the NVIDIA V100
GPU RAM on Witherspoons and makes accesses no-op instead of throwing HMIs:
https://github.com/open-power/skiboot/commit/fca2b2b839a67

Now we are going to pass V100 via VFIO which most certainly involves
KVM guests which are often terminated without getting a chance to offline
GPU RAM so we end up with a running machine with misconfigured memory.
Accessing this memory produces hardware management interrupts (HMI)
which bring the host down.

To suppress HMIs, this wires up this hot reset hook to vfio_pci_disable()
via pci_disable_device() which switches NPU2 to a safe mode and prevents
HMIs.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Acked-by: Alistair Popple <alistair@popple.id.au>
---
Changes:
v2:
* updated the commit log
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 3d2d8fa..c78c204 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -3676,6 +3676,15 @@ static void pnv_pci_release_device(struct pci_dev *pdev)
 		pnv_ioda_release_pe(pe);
 }
 
+static void pnv_npu_disable_device(struct pci_dev *pdev)
+{
+	struct eeh_dev *edev = pci_dev_to_eeh_dev(pdev);
+	struct eeh_pe *eehpe = edev ? edev->pe : NULL;
+
+	if (eehpe && eeh_ops && eeh_ops->reset)
+		eeh_ops->reset(eehpe, EEH_RESET_HOT);
+}
+
 static void pnv_pci_ioda_shutdown(struct pci_controller *hose)
 {
 	struct pnv_phb *phb = hose->private_data;
@@ -3720,6 +3729,7 @@ static const struct pci_controller_ops pnv_npu_ioda_controller_ops = {
 	.reset_secondary_bus	= pnv_pci_reset_secondary_bus,
 	.dma_set_mask		= pnv_npu_dma_set_mask,
 	.shutdown		= pnv_pci_ioda_shutdown,
+	.disable_device		= pnv_npu_disable_device,
 };
 
 static const struct pci_controller_ops pnv_npu_ocapi_ioda_controller_ops = {
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH kernel v3 01/22] powerpc/ioda/npu: Call skiboot's hot reset hook when disabling NPU2
@ 2018-11-13  8:28   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 84+ messages in thread
From: Alexey Kardashevskiy @ 2018-11-13  8:28 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Jose Ricardo Ziviani, Alexey Kardashevskiy, Alistair Popple,
	Alex Williamson, kvm-ppc, Sam Bobroff, Piotr Jaroszynski,
	Oliver O'Halloran, Andrew Donnellan,
	Leonardo Augusto Guimarães Garcia, Reza Arbab, David Gibson

The skiboot firmware has a hot reset handler which fences the NVIDIA V100
GPU RAM on Witherspoons and makes accesses no-op instead of throwing HMIs:
https://github.com/open-power/skiboot/commit/fca2b2b839a67

Now we are going to pass V100 via VFIO which most certainly involves
KVM guests which are often terminated without getting a chance to offline
GPU RAM so we end up with a running machine with misconfigured memory.
Accessing this memory produces hardware management interrupts (HMI)
which bring the host down.

To suppress HMIs, this wires up this hot reset hook to vfio_pci_disable()
via pci_disable_device() which switches NPU2 to a safe mode and prevents
HMIs.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Acked-by: Alistair Popple <alistair@popple.id.au>
---
Changes:
v2:
* updated the commit log
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 3d2d8fa..c78c204 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -3676,6 +3676,15 @@ static void pnv_pci_release_device(struct pci_dev *pdev)
 		pnv_ioda_release_pe(pe);
 }
 
+static void pnv_npu_disable_device(struct pci_dev *pdev)
+{
+	struct eeh_dev *edev = pci_dev_to_eeh_dev(pdev);
+	struct eeh_pe *eehpe = edev ? edev->pe : NULL;
+
+	if (eehpe && eeh_ops && eeh_ops->reset)
+		eeh_ops->reset(eehpe, EEH_RESET_HOT);
+}
+
 static void pnv_pci_ioda_shutdown(struct pci_controller *hose)
 {
 	struct pnv_phb *phb = hose->private_data;
@@ -3720,6 +3729,7 @@ static const struct pci_controller_ops pnv_npu_ioda_controller_ops = {
 	.reset_secondary_bus	= pnv_pci_reset_secondary_bus,
 	.dma_set_mask		= pnv_npu_dma_set_mask,
 	.shutdown		= pnv_pci_ioda_shutdown,
+	.disable_device		= pnv_npu_disable_device,
 };
 
 static const struct pci_controller_ops pnv_npu_ocapi_ioda_controller_ops = {
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH kernel v3 02/22] powerpc/mm/iommu/vfio_spapr_tce: Change mm_iommu_get to reference a region
  2018-11-13  8:28 ` Alexey Kardashevskiy
@ 2018-11-13  8:28   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 84+ messages in thread
From: Alexey Kardashevskiy @ 2018-11-13  8:28 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Jose Ricardo Ziviani, Alexey Kardashevskiy, Alistair Popple,
	Alex Williamson, kvm-ppc, Sam Bobroff, Piotr Jaroszynski,
	Oliver O'Halloran, Andrew Donnellan,
	Leonardo Augusto Guimarães Garcia, Reza Arbab, David Gibson

Normally mm_iommu_get() is supposed to add a reference and
mm_iommu_put() to remove it. However historically mm_iommu_find() does
the referencing and mm_iommu_get() is doing allocation and referencing.

We are going to add another helper to preregister device memory so
instead of having mm_iommu_new() which pre-registers the normal memory
and references the region, we need separate helpers for pre-registering
and referencing.

This renames:
- mm_iommu_get to mm_iommu_new;
- mm_iommu_find to mm_iommu_get.

To make the mm_iommu_get name reflect what it is supposed to do, this
changes mm_iommu_get() to reference the region so from now on for every
mm_iommu_get() we need a matching mm_iommu_put().

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
v2:
* merged 2 patches into one
---
 arch/powerpc/include/asm/mmu_context.h |  4 +--
 arch/powerpc/mm/mmu_context_iommu.c    | 13 ++++++---
 drivers/vfio/vfio_iommu_spapr_tce.c    | 37 +++++++++++++++++---------
 3 files changed, 35 insertions(+), 19 deletions(-)

diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h
index 0381394..2d6b00d 100644
--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -21,7 +21,7 @@ struct mm_iommu_table_group_mem_t;
 
 extern int isolate_lru_page(struct page *page);	/* from internal.h */
 extern bool mm_iommu_preregistered(struct mm_struct *mm);
-extern long mm_iommu_get(struct mm_struct *mm,
+extern long mm_iommu_new(struct mm_struct *mm,
 		unsigned long ua, unsigned long entries,
 		struct mm_iommu_table_group_mem_t **pmem);
 extern long mm_iommu_put(struct mm_struct *mm,
@@ -32,7 +32,7 @@ extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup(struct mm_struct *mm,
 		unsigned long ua, unsigned long size);
 extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup_rm(
 		struct mm_struct *mm, unsigned long ua, unsigned long size);
-extern struct mm_iommu_table_group_mem_t *mm_iommu_find(struct mm_struct *mm,
+extern struct mm_iommu_table_group_mem_t *mm_iommu_get(struct mm_struct *mm,
 		unsigned long ua, unsigned long entries);
 extern long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
 		unsigned long ua, unsigned int pageshift, unsigned long *hpa);
diff --git a/arch/powerpc/mm/mmu_context_iommu.c b/arch/powerpc/mm/mmu_context_iommu.c
index 1d5161f..babc6ad 100644
--- a/arch/powerpc/mm/mmu_context_iommu.c
+++ b/arch/powerpc/mm/mmu_context_iommu.c
@@ -89,7 +89,7 @@ bool mm_iommu_preregistered(struct mm_struct *mm)
 }
 EXPORT_SYMBOL_GPL(mm_iommu_preregistered);
 
-long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries,
+long mm_iommu_new(struct mm_struct *mm, unsigned long ua, unsigned long entries,
 		struct mm_iommu_table_group_mem_t **pmem)
 {
 	struct mm_iommu_table_group_mem_t *mem;
@@ -202,7 +202,7 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries,
 
 	return ret;
 }
-EXPORT_SYMBOL_GPL(mm_iommu_get);
+EXPORT_SYMBOL_GPL(mm_iommu_new);
 
 static void mm_iommu_unpin(struct mm_iommu_table_group_mem_t *mem)
 {
@@ -318,21 +318,26 @@ struct mm_iommu_table_group_mem_t *mm_iommu_lookup_rm(struct mm_struct *mm,
 	return ret;
 }
 
-struct mm_iommu_table_group_mem_t *mm_iommu_find(struct mm_struct *mm,
+struct mm_iommu_table_group_mem_t *mm_iommu_get(struct mm_struct *mm,
 		unsigned long ua, unsigned long entries)
 {
 	struct mm_iommu_table_group_mem_t *mem, *ret = NULL;
 
+	mutex_lock(&mem_list_mutex);
+
 	list_for_each_entry_rcu(mem, &mm->context.iommu_group_mem_list, next) {
 		if ((mem->ua == ua) && (mem->entries == entries)) {
 			ret = mem;
+			++mem->used;
 			break;
 		}
 	}
 
+	mutex_unlock(&mem_list_mutex);
+
 	return ret;
 }
-EXPORT_SYMBOL_GPL(mm_iommu_find);
+EXPORT_SYMBOL_GPL(mm_iommu_get);
 
 long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
 		unsigned long ua, unsigned int pageshift, unsigned long *hpa)
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index ad63725..56db071 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -151,12 +151,13 @@ static long tce_iommu_unregister_pages(struct tce_container *container,
 {
 	struct mm_iommu_table_group_mem_t *mem;
 	struct tce_iommu_prereg *tcemem;
-	bool found = false;
+	bool found;
+	long ret;
 
 	if ((vaddr & ~PAGE_MASK) || (size & ~PAGE_MASK))
 		return -EINVAL;
 
-	mem = mm_iommu_find(container->mm, vaddr, size >> PAGE_SHIFT);
+	mem = mm_iommu_get(container->mm, vaddr, size >> PAGE_SHIFT);
 	if (!mem)
 		return -ENOENT;
 
@@ -168,9 +169,13 @@ static long tce_iommu_unregister_pages(struct tce_container *container,
 	}
 
 	if (!found)
-		return -ENOENT;
+		ret = -ENOENT;
+	else
+		ret = tce_iommu_prereg_free(container, tcemem);
 
-	return tce_iommu_prereg_free(container, tcemem);
+	mm_iommu_put(container->mm, mem);
+
+	return ret;
 }
 
 static long tce_iommu_register_pages(struct tce_container *container,
@@ -185,22 +190,24 @@ static long tce_iommu_register_pages(struct tce_container *container,
 			((vaddr + size) < vaddr))
 		return -EINVAL;
 
-	mem = mm_iommu_find(container->mm, vaddr, entries);
+	mem = mm_iommu_get(container->mm, vaddr, entries);
 	if (mem) {
 		list_for_each_entry(tcemem, &container->prereg_list, next) {
-			if (tcemem->mem == mem)
-				return -EBUSY;
+			if (tcemem->mem == mem) {
+				ret = -EBUSY;
+				goto put_exit;
+			}
 		}
+	} else {
+		ret = mm_iommu_new(container->mm, vaddr, entries, &mem);
+		if (ret)
+			return ret;
 	}
 
-	ret = mm_iommu_get(container->mm, vaddr, entries, &mem);
-	if (ret)
-		return ret;
-
 	tcemem = kzalloc(sizeof(*tcemem), GFP_KERNEL);
 	if (!tcemem) {
-		mm_iommu_put(container->mm, mem);
-		return -ENOMEM;
+		ret = -ENOMEM;
+		goto put_exit;
 	}
 
 	tcemem->mem = mem;
@@ -209,6 +216,10 @@ static long tce_iommu_register_pages(struct tce_container *container,
 	container->enabled = true;
 
 	return 0;
+
+put_exit:
+	mm_iommu_put(container->mm, mem);
+	return ret;
 }
 
 static bool tce_page_is_contained(struct page *page, unsigned page_shift)
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH kernel v3 02/22] powerpc/mm/iommu/vfio_spapr_tce: Change mm_iommu_get to reference a region
@ 2018-11-13  8:28   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 84+ messages in thread
From: Alexey Kardashevskiy @ 2018-11-13  8:28 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Jose Ricardo Ziviani, Alexey Kardashevskiy, Alistair Popple,
	Alex Williamson, kvm-ppc, Sam Bobroff, Piotr Jaroszynski,
	Oliver O'Halloran, Andrew Donnellan,
	Leonardo Augusto Guimarães Garcia, Reza Arbab, David Gibson

Normally mm_iommu_get() is supposed to add a reference and
mm_iommu_put() to remove it. However historically mm_iommu_find() does
the referencing and mm_iommu_get() is doing allocation and referencing.

We are going to add another helper to preregister device memory so
instead of having mm_iommu_new() which pre-registers the normal memory
and references the region, we need separate helpers for pre-registering
and referencing.

This renames:
- mm_iommu_get to mm_iommu_new;
- mm_iommu_find to mm_iommu_get.

To make the mm_iommu_get name reflect what it is supposed to do, this
changes mm_iommu_get() to reference the region so from now on for every
mm_iommu_get() we need a matching mm_iommu_put().

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
v2:
* merged 2 patches into one
---
 arch/powerpc/include/asm/mmu_context.h |  4 +--
 arch/powerpc/mm/mmu_context_iommu.c    | 13 ++++++---
 drivers/vfio/vfio_iommu_spapr_tce.c    | 37 +++++++++++++++++---------
 3 files changed, 35 insertions(+), 19 deletions(-)

diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h
index 0381394..2d6b00d 100644
--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -21,7 +21,7 @@ struct mm_iommu_table_group_mem_t;
 
 extern int isolate_lru_page(struct page *page);	/* from internal.h */
 extern bool mm_iommu_preregistered(struct mm_struct *mm);
-extern long mm_iommu_get(struct mm_struct *mm,
+extern long mm_iommu_new(struct mm_struct *mm,
 		unsigned long ua, unsigned long entries,
 		struct mm_iommu_table_group_mem_t **pmem);
 extern long mm_iommu_put(struct mm_struct *mm,
@@ -32,7 +32,7 @@ extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup(struct mm_struct *mm,
 		unsigned long ua, unsigned long size);
 extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup_rm(
 		struct mm_struct *mm, unsigned long ua, unsigned long size);
-extern struct mm_iommu_table_group_mem_t *mm_iommu_find(struct mm_struct *mm,
+extern struct mm_iommu_table_group_mem_t *mm_iommu_get(struct mm_struct *mm,
 		unsigned long ua, unsigned long entries);
 extern long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
 		unsigned long ua, unsigned int pageshift, unsigned long *hpa);
diff --git a/arch/powerpc/mm/mmu_context_iommu.c b/arch/powerpc/mm/mmu_context_iommu.c
index 1d5161f..babc6ad 100644
--- a/arch/powerpc/mm/mmu_context_iommu.c
+++ b/arch/powerpc/mm/mmu_context_iommu.c
@@ -89,7 +89,7 @@ bool mm_iommu_preregistered(struct mm_struct *mm)
 }
 EXPORT_SYMBOL_GPL(mm_iommu_preregistered);
 
-long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries,
+long mm_iommu_new(struct mm_struct *mm, unsigned long ua, unsigned long entries,
 		struct mm_iommu_table_group_mem_t **pmem)
 {
 	struct mm_iommu_table_group_mem_t *mem;
@@ -202,7 +202,7 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries,
 
 	return ret;
 }
-EXPORT_SYMBOL_GPL(mm_iommu_get);
+EXPORT_SYMBOL_GPL(mm_iommu_new);
 
 static void mm_iommu_unpin(struct mm_iommu_table_group_mem_t *mem)
 {
@@ -318,21 +318,26 @@ struct mm_iommu_table_group_mem_t *mm_iommu_lookup_rm(struct mm_struct *mm,
 	return ret;
 }
 
-struct mm_iommu_table_group_mem_t *mm_iommu_find(struct mm_struct *mm,
+struct mm_iommu_table_group_mem_t *mm_iommu_get(struct mm_struct *mm,
 		unsigned long ua, unsigned long entries)
 {
 	struct mm_iommu_table_group_mem_t *mem, *ret = NULL;
 
+	mutex_lock(&mem_list_mutex);
+
 	list_for_each_entry_rcu(mem, &mm->context.iommu_group_mem_list, next) {
 		if ((mem->ua = ua) && (mem->entries = entries)) {
 			ret = mem;
+			++mem->used;
 			break;
 		}
 	}
 
+	mutex_unlock(&mem_list_mutex);
+
 	return ret;
 }
-EXPORT_SYMBOL_GPL(mm_iommu_find);
+EXPORT_SYMBOL_GPL(mm_iommu_get);
 
 long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
 		unsigned long ua, unsigned int pageshift, unsigned long *hpa)
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index ad63725..56db071 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -151,12 +151,13 @@ static long tce_iommu_unregister_pages(struct tce_container *container,
 {
 	struct mm_iommu_table_group_mem_t *mem;
 	struct tce_iommu_prereg *tcemem;
-	bool found = false;
+	bool found;
+	long ret;
 
 	if ((vaddr & ~PAGE_MASK) || (size & ~PAGE_MASK))
 		return -EINVAL;
 
-	mem = mm_iommu_find(container->mm, vaddr, size >> PAGE_SHIFT);
+	mem = mm_iommu_get(container->mm, vaddr, size >> PAGE_SHIFT);
 	if (!mem)
 		return -ENOENT;
 
@@ -168,9 +169,13 @@ static long tce_iommu_unregister_pages(struct tce_container *container,
 	}
 
 	if (!found)
-		return -ENOENT;
+		ret = -ENOENT;
+	else
+		ret = tce_iommu_prereg_free(container, tcemem);
 
-	return tce_iommu_prereg_free(container, tcemem);
+	mm_iommu_put(container->mm, mem);
+
+	return ret;
 }
 
 static long tce_iommu_register_pages(struct tce_container *container,
@@ -185,22 +190,24 @@ static long tce_iommu_register_pages(struct tce_container *container,
 			((vaddr + size) < vaddr))
 		return -EINVAL;
 
-	mem = mm_iommu_find(container->mm, vaddr, entries);
+	mem = mm_iommu_get(container->mm, vaddr, entries);
 	if (mem) {
 		list_for_each_entry(tcemem, &container->prereg_list, next) {
-			if (tcemem->mem = mem)
-				return -EBUSY;
+			if (tcemem->mem = mem) {
+				ret = -EBUSY;
+				goto put_exit;
+			}
 		}
+	} else {
+		ret = mm_iommu_new(container->mm, vaddr, entries, &mem);
+		if (ret)
+			return ret;
 	}
 
-	ret = mm_iommu_get(container->mm, vaddr, entries, &mem);
-	if (ret)
-		return ret;
-
 	tcemem = kzalloc(sizeof(*tcemem), GFP_KERNEL);
 	if (!tcemem) {
-		mm_iommu_put(container->mm, mem);
-		return -ENOMEM;
+		ret = -ENOMEM;
+		goto put_exit;
 	}
 
 	tcemem->mem = mem;
@@ -209,6 +216,10 @@ static long tce_iommu_register_pages(struct tce_container *container,
 	container->enabled = true;
 
 	return 0;
+
+put_exit:
+	mm_iommu_put(container->mm, mem);
+	return ret;
 }
 
 static bool tce_page_is_contained(struct page *page, unsigned page_shift)
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH kernel v3 03/22] powerpc/mm/iommu: Make mm_iommu_new() fail on existing regions
  2018-11-13  8:28 ` Alexey Kardashevskiy
@ 2018-11-13  8:28   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 84+ messages in thread
From: Alexey Kardashevskiy @ 2018-11-13  8:28 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Jose Ricardo Ziviani, Alexey Kardashevskiy, Alistair Popple,
	Alex Williamson, kvm-ppc, Sam Bobroff, Piotr Jaroszynski,
	Oliver O'Halloran, Andrew Donnellan,
	Leonardo Augusto Guimarães Garcia, Reza Arbab, David Gibson

Since we are going to have 2 different preregistering helpers, let's
make it clear that mm_iommu_new() is only for the normal memory
(i.e.not device memory) and for existing areas mm_iommu_get() should be
used instead.

This removes the check for exact match as the check for overlap is
enough now.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
v2:
* remove the exact match check
---
 arch/powerpc/mm/mmu_context_iommu.c | 6 ------
 1 file changed, 6 deletions(-)

diff --git a/arch/powerpc/mm/mmu_context_iommu.c b/arch/powerpc/mm/mmu_context_iommu.c
index babc6ad..580d89e 100644
--- a/arch/powerpc/mm/mmu_context_iommu.c
+++ b/arch/powerpc/mm/mmu_context_iommu.c
@@ -102,12 +102,6 @@ long mm_iommu_new(struct mm_struct *mm, unsigned long ua, unsigned long entries,
 
 	list_for_each_entry_rcu(mem, &mm->context.iommu_group_mem_list,
 			next) {
-		if ((mem->ua == ua) && (mem->entries == entries)) {
-			++mem->used;
-			*pmem = mem;
-			goto unlock_exit;
-		}
-
 		/* Overlap? */
 		if ((mem->ua < (ua + (entries << PAGE_SHIFT))) &&
 				(ua < (mem->ua +
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH kernel v3 03/22] powerpc/mm/iommu: Make mm_iommu_new() fail on existing regions
@ 2018-11-13  8:28   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 84+ messages in thread
From: Alexey Kardashevskiy @ 2018-11-13  8:28 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Jose Ricardo Ziviani, Alexey Kardashevskiy, Alistair Popple,
	Alex Williamson, kvm-ppc, Sam Bobroff, Piotr Jaroszynski,
	Oliver O'Halloran, Andrew Donnellan,
	Leonardo Augusto Guimarães Garcia, Reza Arbab, David Gibson

Since we are going to have 2 different preregistering helpers, let's
make it clear that mm_iommu_new() is only for the normal memory
(i.e.not device memory) and for existing areas mm_iommu_get() should be
used instead.

This removes the check for exact match as the check for overlap is
enough now.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
v2:
* remove the exact match check
---
 arch/powerpc/mm/mmu_context_iommu.c | 6 ------
 1 file changed, 6 deletions(-)

diff --git a/arch/powerpc/mm/mmu_context_iommu.c b/arch/powerpc/mm/mmu_context_iommu.c
index babc6ad..580d89e 100644
--- a/arch/powerpc/mm/mmu_context_iommu.c
+++ b/arch/powerpc/mm/mmu_context_iommu.c
@@ -102,12 +102,6 @@ long mm_iommu_new(struct mm_struct *mm, unsigned long ua, unsigned long entries,
 
 	list_for_each_entry_rcu(mem, &mm->context.iommu_group_mem_list,
 			next) {
-		if ((mem->ua = ua) && (mem->entries = entries)) {
-			++mem->used;
-			*pmem = mem;
-			goto unlock_exit;
-		}
-
 		/* Overlap? */
 		if ((mem->ua < (ua + (entries << PAGE_SHIFT))) &&
 				(ua < (mem->ua +
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH kernel v3 04/22] powerpc/vfio/iommu/kvm: Do not pin device memory
  2018-11-13  8:28 ` Alexey Kardashevskiy
@ 2018-11-13  8:28   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 84+ messages in thread
From: Alexey Kardashevskiy @ 2018-11-13  8:28 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Jose Ricardo Ziviani, Alexey Kardashevskiy, Alistair Popple,
	Alex Williamson, kvm-ppc, Sam Bobroff, Piotr Jaroszynski,
	Oliver O'Halloran, Andrew Donnellan,
	Leonardo Augusto Guimarães Garcia, Reza Arbab, David Gibson

This new memory does not have page structs as it is not plugged to
the host so gup() will fail anyway.

This adds 2 helpers:
- mm_iommu_newdev() to preregister the "memory device" memory so
the rest of API can still be used;
- mm_iommu_is_devmem() to know if the physical address is one of thise
new regions which we must avoid unpinning of.

This adds @mm to tce_page_is_contained() and iommu_tce_xchg() to test
if the memory is device memory to avoid pfn_to_page().

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/include/asm/iommu.h       |  5 +-
 arch/powerpc/include/asm/mmu_context.h |  5 ++
 arch/powerpc/kernel/iommu.c            |  9 ++-
 arch/powerpc/kvm/book3s_64_vio.c       | 18 +++---
 arch/powerpc/mm/mmu_context_iommu.c    | 83 +++++++++++++++++++++++---
 drivers/vfio/vfio_iommu_spapr_tce.c    | 28 +++++----
 6 files changed, 116 insertions(+), 32 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 35db0cb..a8aeac0 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -218,8 +218,9 @@ extern void iommu_register_group(struct iommu_table_group *table_group,
 extern int iommu_add_device(struct device *dev);
 extern void iommu_del_device(struct device *dev);
 extern int __init tce_iommu_bus_notifier_init(void);
-extern long iommu_tce_xchg(struct iommu_table *tbl, unsigned long entry,
-		unsigned long *hpa, enum dma_data_direction *direction);
+extern long iommu_tce_xchg(struct mm_struct *mm, struct iommu_table *tbl,
+		unsigned long entry, unsigned long *hpa,
+		enum dma_data_direction *direction);
 #else
 static inline void iommu_register_group(struct iommu_table_group *table_group,
 					int pci_domain_number,
diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h
index 2d6b00d..f0f9f3d 100644
--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -24,6 +24,9 @@ extern bool mm_iommu_preregistered(struct mm_struct *mm);
 extern long mm_iommu_new(struct mm_struct *mm,
 		unsigned long ua, unsigned long entries,
 		struct mm_iommu_table_group_mem_t **pmem);
+extern long mm_iommu_newdev(struct mm_struct *mm, unsigned long ua,
+		unsigned long entries, unsigned long dev_hpa,
+		struct mm_iommu_table_group_mem_t **pmem);
 extern long mm_iommu_put(struct mm_struct *mm,
 		struct mm_iommu_table_group_mem_t *mem);
 extern void mm_iommu_init(struct mm_struct *mm);
@@ -39,6 +42,8 @@ extern long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
 extern long mm_iommu_ua_to_hpa_rm(struct mm_iommu_table_group_mem_t *mem,
 		unsigned long ua, unsigned int pageshift, unsigned long *hpa);
 extern void mm_iommu_ua_mark_dirty_rm(struct mm_struct *mm, unsigned long ua);
+extern bool mm_iommu_is_devmem(struct mm_struct *mm, unsigned long hpa,
+		unsigned int pageshift);
 extern long mm_iommu_mapped_inc(struct mm_iommu_table_group_mem_t *mem);
 extern void mm_iommu_mapped_dec(struct mm_iommu_table_group_mem_t *mem);
 #endif
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index f0dc680..8ccfdd9 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -47,6 +47,7 @@
 #include <asm/fadump.h>
 #include <asm/vio.h>
 #include <asm/tce.h>
+#include <asm/mmu_context.h>
 
 #define DBG(...)
 
@@ -993,15 +994,17 @@ int iommu_tce_check_gpa(unsigned long page_shift, unsigned long gpa)
 }
 EXPORT_SYMBOL_GPL(iommu_tce_check_gpa);
 
-long iommu_tce_xchg(struct iommu_table *tbl, unsigned long entry,
-		unsigned long *hpa, enum dma_data_direction *direction)
+long iommu_tce_xchg(struct mm_struct *mm, struct iommu_table *tbl,
+		unsigned long entry, unsigned long *hpa,
+		enum dma_data_direction *direction)
 {
 	long ret;
 
 	ret = tbl->it_ops->exchange(tbl, entry, hpa, direction);
 
 	if (!ret && ((*direction == DMA_FROM_DEVICE) ||
-			(*direction == DMA_BIDIRECTIONAL)))
+			(*direction == DMA_BIDIRECTIONAL)) &&
+			!mm_iommu_is_devmem(mm, *hpa, tbl->it_page_shift))
 		SetPageDirty(pfn_to_page(*hpa >> PAGE_SHIFT));
 
 	/* if (unlikely(ret))
diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index 62a8d03..532ab797 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -397,12 +397,13 @@ static long kvmppc_tce_validate(struct kvmppc_spapr_tce_table *stt,
 	return H_SUCCESS;
 }
 
-static void kvmppc_clear_tce(struct iommu_table *tbl, unsigned long entry)
+static void kvmppc_clear_tce(struct mm_struct *mm, struct iommu_table *tbl,
+		unsigned long entry)
 {
 	unsigned long hpa = 0;
 	enum dma_data_direction dir = DMA_NONE;
 
-	iommu_tce_xchg(tbl, entry, &hpa, &dir);
+	iommu_tce_xchg(mm, tbl, entry, &hpa, &dir);
 }
 
 static long kvmppc_tce_iommu_mapped_dec(struct kvm *kvm,
@@ -433,7 +434,7 @@ static long kvmppc_tce_iommu_do_unmap(struct kvm *kvm,
 	unsigned long hpa = 0;
 	long ret;
 
-	if (WARN_ON_ONCE(iommu_tce_xchg(tbl, entry, &hpa, &dir)))
+	if (WARN_ON_ONCE(iommu_tce_xchg(kvm->mm, tbl, entry, &hpa, &dir)))
 		return H_TOO_HARD;
 
 	if (dir == DMA_NONE)
@@ -441,7 +442,7 @@ static long kvmppc_tce_iommu_do_unmap(struct kvm *kvm,
 
 	ret = kvmppc_tce_iommu_mapped_dec(kvm, tbl, entry);
 	if (ret != H_SUCCESS)
-		iommu_tce_xchg(tbl, entry, &hpa, &dir);
+		iommu_tce_xchg(kvm->mm, tbl, entry, &hpa, &dir);
 
 	return ret;
 }
@@ -487,7 +488,7 @@ long kvmppc_tce_iommu_do_map(struct kvm *kvm, struct iommu_table *tbl,
 	if (mm_iommu_mapped_inc(mem))
 		return H_TOO_HARD;
 
-	ret = iommu_tce_xchg(tbl, entry, &hpa, &dir);
+	ret = iommu_tce_xchg(kvm->mm, tbl, entry, &hpa, &dir);
 	if (WARN_ON_ONCE(ret)) {
 		mm_iommu_mapped_dec(mem);
 		return H_TOO_HARD;
@@ -566,7 +567,7 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 					entry, ua, dir);
 
 		if (ret != H_SUCCESS) {
-			kvmppc_clear_tce(stit->tbl, entry);
+			kvmppc_clear_tce(vcpu->kvm->mm, stit->tbl, entry);
 			goto unlock_exit;
 		}
 	}
@@ -655,7 +656,8 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 					iommu_tce_direction(tce));
 
 			if (ret != H_SUCCESS) {
-				kvmppc_clear_tce(stit->tbl, entry);
+				kvmppc_clear_tce(vcpu->kvm->mm, stit->tbl,
+						entry);
 				goto unlock_exit;
 			}
 		}
@@ -704,7 +706,7 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
 				return ret;
 
 			WARN_ON_ONCE(1);
-			kvmppc_clear_tce(stit->tbl, entry);
+			kvmppc_clear_tce(vcpu->kvm->mm, stit->tbl, entry);
 		}
 	}
 
diff --git a/arch/powerpc/mm/mmu_context_iommu.c b/arch/powerpc/mm/mmu_context_iommu.c
index 580d89e..62fe5fe 100644
--- a/arch/powerpc/mm/mmu_context_iommu.c
+++ b/arch/powerpc/mm/mmu_context_iommu.c
@@ -47,6 +47,8 @@ struct mm_iommu_table_group_mem_t {
 		struct page **hpages;	/* vmalloc'ed */
 		phys_addr_t *hpas;
 	};
+#define MM_IOMMU_TABLE_INVALID_HPA	((uint64_t)-1)
+	u64 dev_hpa;		/* Device memory base address */
 };
 
 static long mm_iommu_adjust_locked_vm(struct mm_struct *mm,
@@ -89,7 +91,8 @@ bool mm_iommu_preregistered(struct mm_struct *mm)
 }
 EXPORT_SYMBOL_GPL(mm_iommu_preregistered);
 
-long mm_iommu_new(struct mm_struct *mm, unsigned long ua, unsigned long entries,
+static long mm_iommu_do_alloc(struct mm_struct *mm, unsigned long ua,
+		unsigned long entries, unsigned long dev_hpa,
 		struct mm_iommu_table_group_mem_t **pmem)
 {
 	struct mm_iommu_table_group_mem_t *mem;
@@ -112,11 +115,13 @@ long mm_iommu_new(struct mm_struct *mm, unsigned long ua, unsigned long entries,
 
 	}
 
-	ret = mm_iommu_adjust_locked_vm(mm, entries, true);
-	if (ret)
-		goto unlock_exit;
+	if (dev_hpa == MM_IOMMU_TABLE_INVALID_HPA) {
+		ret = mm_iommu_adjust_locked_vm(mm, entries, true);
+		if (ret)
+			goto unlock_exit;
 
-	locked_entries = entries;
+		locked_entries = entries;
+	}
 
 	mem = kzalloc(sizeof(*mem), GFP_KERNEL);
 	if (!mem) {
@@ -124,6 +129,13 @@ long mm_iommu_new(struct mm_struct *mm, unsigned long ua, unsigned long entries,
 		goto unlock_exit;
 	}
 
+	if (dev_hpa != MM_IOMMU_TABLE_INVALID_HPA) {
+		mem->pageshift = __ffs(dev_hpa | (entries << PAGE_SHIFT));
+		mem->dev_hpa = dev_hpa;
+		goto good_exit;
+	}
+	mem->dev_hpa = MM_IOMMU_TABLE_INVALID_HPA;
+
 	/*
 	 * For a starting point for a maximum page size calculation
 	 * we use @ua and @entries natural alignment to allow IOMMU pages
@@ -180,6 +192,7 @@ long mm_iommu_new(struct mm_struct *mm, unsigned long ua, unsigned long entries,
 
 	}
 
+good_exit:
 	atomic64_set(&mem->mapped, 1);
 	mem->used = 1;
 	mem->ua = ua;
@@ -196,13 +209,31 @@ long mm_iommu_new(struct mm_struct *mm, unsigned long ua, unsigned long entries,
 
 	return ret;
 }
+
+long mm_iommu_new(struct mm_struct *mm, unsigned long ua, unsigned long entries,
+		struct mm_iommu_table_group_mem_t **pmem)
+{
+	return mm_iommu_do_alloc(mm, ua, entries, MM_IOMMU_TABLE_INVALID_HPA,
+			pmem);
+}
 EXPORT_SYMBOL_GPL(mm_iommu_new);
 
+long mm_iommu_newdev(struct mm_struct *mm, unsigned long ua,
+		unsigned long entries, unsigned long dev_hpa,
+		struct mm_iommu_table_group_mem_t **pmem)
+{
+	return mm_iommu_do_alloc(mm, ua, entries, dev_hpa, pmem);
+}
+EXPORT_SYMBOL_GPL(mm_iommu_newdev);
+
 static void mm_iommu_unpin(struct mm_iommu_table_group_mem_t *mem)
 {
 	long i;
 	struct page *page = NULL;
 
+	if (!mem->hpas)
+		return;
+
 	for (i = 0; i < mem->entries; ++i) {
 		if (!mem->hpas[i])
 			continue;
@@ -244,6 +275,7 @@ static void mm_iommu_release(struct mm_iommu_table_group_mem_t *mem)
 long mm_iommu_put(struct mm_struct *mm, struct mm_iommu_table_group_mem_t *mem)
 {
 	long ret = 0;
+	unsigned long entries, dev_hpa;
 
 	mutex_lock(&mem_list_mutex);
 
@@ -265,9 +297,12 @@ long mm_iommu_put(struct mm_struct *mm, struct mm_iommu_table_group_mem_t *mem)
 	}
 
 	/* @mapped became 0 so now mappings are disabled, release the region */
+	entries = mem->entries;
+	dev_hpa = mem->dev_hpa;
 	mm_iommu_release(mem);
 
-	mm_iommu_adjust_locked_vm(mm, mem->entries, false);
+	if (dev_hpa == MM_IOMMU_TABLE_INVALID_HPA)
+		mm_iommu_adjust_locked_vm(mm, entries, false);
 
 unlock_exit:
 	mutex_unlock(&mem_list_mutex);
@@ -337,7 +372,7 @@ long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
 		unsigned long ua, unsigned int pageshift, unsigned long *hpa)
 {
 	const long entry = (ua - mem->ua) >> PAGE_SHIFT;
-	u64 *va = &mem->hpas[entry];
+	u64 *va;
 
 	if (entry >= mem->entries)
 		return -EFAULT;
@@ -345,6 +380,12 @@ long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
 	if (pageshift > mem->pageshift)
 		return -EFAULT;
 
+	if (!mem->hpas) {
+		*hpa = mem->dev_hpa + (ua - mem->ua);
+		return 0;
+	}
+
+	va = &mem->hpas[entry];
 	*hpa = (*va & MM_IOMMU_TABLE_GROUP_PAGE_MASK) | (ua & ~PAGE_MASK);
 
 	return 0;
@@ -355,7 +396,6 @@ long mm_iommu_ua_to_hpa_rm(struct mm_iommu_table_group_mem_t *mem,
 		unsigned long ua, unsigned int pageshift, unsigned long *hpa)
 {
 	const long entry = (ua - mem->ua) >> PAGE_SHIFT;
-	void *va = &mem->hpas[entry];
 	unsigned long *pa;
 
 	if (entry >= mem->entries)
@@ -364,7 +404,12 @@ long mm_iommu_ua_to_hpa_rm(struct mm_iommu_table_group_mem_t *mem,
 	if (pageshift > mem->pageshift)
 		return -EFAULT;
 
-	pa = (void *) vmalloc_to_phys(va);
+	if (!mem->hpas) {
+		*hpa = mem->dev_hpa + (ua - mem->ua);
+		return 0;
+	}
+
+	pa = (void *) vmalloc_to_phys(&mem->hpas[entry]);
 	if (!pa)
 		return -EFAULT;
 
@@ -394,6 +439,26 @@ extern void mm_iommu_ua_mark_dirty_rm(struct mm_struct *mm, unsigned long ua)
 	*pa |= MM_IOMMU_TABLE_GROUP_PAGE_DIRTY;
 }
 
+extern bool mm_iommu_is_devmem(struct mm_struct *mm, unsigned long hpa,
+		unsigned int pageshift)
+{
+	struct mm_iommu_table_group_mem_t *mem;
+	const unsigned long pagesize = 1UL << pageshift;
+
+	list_for_each_entry_rcu(mem, &mm->context.iommu_group_mem_list, next) {
+		if (mem->dev_hpa == MM_IOMMU_TABLE_INVALID_HPA)
+			continue;
+
+		if ((mem->dev_hpa <= hpa) &&
+				(hpa + pagesize <= mem->dev_hpa +
+				 (mem->entries << PAGE_SHIFT)))
+			return true;
+	}
+
+	return false;
+}
+EXPORT_SYMBOL_GPL(mm_iommu_is_devmem);
+
 long mm_iommu_mapped_inc(struct mm_iommu_table_group_mem_t *mem)
 {
 	if (atomic64_inc_not_zero(&mem->mapped))
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index 56db071..ed89137 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -222,8 +222,15 @@ static long tce_iommu_register_pages(struct tce_container *container,
 	return ret;
 }
 
-static bool tce_page_is_contained(struct page *page, unsigned page_shift)
+static bool tce_page_is_contained(struct mm_struct *mm, unsigned long hpa,
+		unsigned int page_shift)
 {
+	struct page *page;
+
+	if (mm_iommu_is_devmem(mm, hpa, page_shift))
+		return true;
+
+	page = pfn_to_page(hpa >> PAGE_SHIFT);
 	/*
 	 * Check that the TCE table granularity is not bigger than the size of
 	 * a page we just found. Otherwise the hardware can get access to
@@ -499,7 +506,8 @@ static int tce_iommu_clear(struct tce_container *container,
 
 		direction = DMA_NONE;
 		oldhpa = 0;
-		ret = iommu_tce_xchg(tbl, entry, &oldhpa, &direction);
+		ret = iommu_tce_xchg(container->mm, tbl, entry, &oldhpa,
+				&direction);
 		if (ret)
 			continue;
 
@@ -537,7 +545,6 @@ static long tce_iommu_build(struct tce_container *container,
 		enum dma_data_direction direction)
 {
 	long i, ret = 0;
-	struct page *page;
 	unsigned long hpa;
 	enum dma_data_direction dirtmp;
 
@@ -548,15 +555,16 @@ static long tce_iommu_build(struct tce_container *container,
 		if (ret)
 			break;
 
-		page = pfn_to_page(hpa >> PAGE_SHIFT);
-		if (!tce_page_is_contained(page, tbl->it_page_shift)) {
+		if (!tce_page_is_contained(container->mm, hpa,
+				tbl->it_page_shift)) {
 			ret = -EPERM;
 			break;
 		}
 
 		hpa |= offset;
 		dirtmp = direction;
-		ret = iommu_tce_xchg(tbl, entry + i, &hpa, &dirtmp);
+		ret = iommu_tce_xchg(container->mm, tbl, entry + i, &hpa,
+				&dirtmp);
 		if (ret) {
 			tce_iommu_unuse_page(container, hpa);
 			pr_err("iommu_tce: %s failed ioba=%lx, tce=%lx, ret=%ld\n",
@@ -583,7 +591,6 @@ static long tce_iommu_build_v2(struct tce_container *container,
 		enum dma_data_direction direction)
 {
 	long i, ret = 0;
-	struct page *page;
 	unsigned long hpa;
 	enum dma_data_direction dirtmp;
 
@@ -596,8 +603,8 @@ static long tce_iommu_build_v2(struct tce_container *container,
 		if (ret)
 			break;
 
-		page = pfn_to_page(hpa >> PAGE_SHIFT);
-		if (!tce_page_is_contained(page, tbl->it_page_shift)) {
+		if (!tce_page_is_contained(container->mm, hpa,
+				tbl->it_page_shift)) {
 			ret = -EPERM;
 			break;
 		}
@@ -610,7 +617,8 @@ static long tce_iommu_build_v2(struct tce_container *container,
 		if (mm_iommu_mapped_inc(mem))
 			break;
 
-		ret = iommu_tce_xchg(tbl, entry + i, &hpa, &dirtmp);
+		ret = iommu_tce_xchg(container->mm, tbl, entry + i, &hpa,
+				&dirtmp);
 		if (ret) {
 			/* dirtmp cannot be DMA_NONE here */
 			tce_iommu_unuse_page_v2(container, tbl, entry + i);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH kernel v3 04/22] powerpc/vfio/iommu/kvm: Do not pin device memory
@ 2018-11-13  8:28   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 84+ messages in thread
From: Alexey Kardashevskiy @ 2018-11-13  8:28 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Jose Ricardo Ziviani, Alexey Kardashevskiy, Alistair Popple,
	Alex Williamson, kvm-ppc, Sam Bobroff, Piotr Jaroszynski,
	Oliver O'Halloran, Andrew Donnellan,
	Leonardo Augusto Guimarães Garcia, Reza Arbab, David Gibson

This new memory does not have page structs as it is not plugged to
the host so gup() will fail anyway.

This adds 2 helpers:
- mm_iommu_newdev() to preregister the "memory device" memory so
the rest of API can still be used;
- mm_iommu_is_devmem() to know if the physical address is one of thise
new regions which we must avoid unpinning of.

This adds @mm to tce_page_is_contained() and iommu_tce_xchg() to test
if the memory is device memory to avoid pfn_to_page().

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/include/asm/iommu.h       |  5 +-
 arch/powerpc/include/asm/mmu_context.h |  5 ++
 arch/powerpc/kernel/iommu.c            |  9 ++-
 arch/powerpc/kvm/book3s_64_vio.c       | 18 +++---
 arch/powerpc/mm/mmu_context_iommu.c    | 83 +++++++++++++++++++++++---
 drivers/vfio/vfio_iommu_spapr_tce.c    | 28 +++++----
 6 files changed, 116 insertions(+), 32 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 35db0cb..a8aeac0 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -218,8 +218,9 @@ extern void iommu_register_group(struct iommu_table_group *table_group,
 extern int iommu_add_device(struct device *dev);
 extern void iommu_del_device(struct device *dev);
 extern int __init tce_iommu_bus_notifier_init(void);
-extern long iommu_tce_xchg(struct iommu_table *tbl, unsigned long entry,
-		unsigned long *hpa, enum dma_data_direction *direction);
+extern long iommu_tce_xchg(struct mm_struct *mm, struct iommu_table *tbl,
+		unsigned long entry, unsigned long *hpa,
+		enum dma_data_direction *direction);
 #else
 static inline void iommu_register_group(struct iommu_table_group *table_group,
 					int pci_domain_number,
diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h
index 2d6b00d..f0f9f3d 100644
--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -24,6 +24,9 @@ extern bool mm_iommu_preregistered(struct mm_struct *mm);
 extern long mm_iommu_new(struct mm_struct *mm,
 		unsigned long ua, unsigned long entries,
 		struct mm_iommu_table_group_mem_t **pmem);
+extern long mm_iommu_newdev(struct mm_struct *mm, unsigned long ua,
+		unsigned long entries, unsigned long dev_hpa,
+		struct mm_iommu_table_group_mem_t **pmem);
 extern long mm_iommu_put(struct mm_struct *mm,
 		struct mm_iommu_table_group_mem_t *mem);
 extern void mm_iommu_init(struct mm_struct *mm);
@@ -39,6 +42,8 @@ extern long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
 extern long mm_iommu_ua_to_hpa_rm(struct mm_iommu_table_group_mem_t *mem,
 		unsigned long ua, unsigned int pageshift, unsigned long *hpa);
 extern void mm_iommu_ua_mark_dirty_rm(struct mm_struct *mm, unsigned long ua);
+extern bool mm_iommu_is_devmem(struct mm_struct *mm, unsigned long hpa,
+		unsigned int pageshift);
 extern long mm_iommu_mapped_inc(struct mm_iommu_table_group_mem_t *mem);
 extern void mm_iommu_mapped_dec(struct mm_iommu_table_group_mem_t *mem);
 #endif
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index f0dc680..8ccfdd9 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -47,6 +47,7 @@
 #include <asm/fadump.h>
 #include <asm/vio.h>
 #include <asm/tce.h>
+#include <asm/mmu_context.h>
 
 #define DBG(...)
 
@@ -993,15 +994,17 @@ int iommu_tce_check_gpa(unsigned long page_shift, unsigned long gpa)
 }
 EXPORT_SYMBOL_GPL(iommu_tce_check_gpa);
 
-long iommu_tce_xchg(struct iommu_table *tbl, unsigned long entry,
-		unsigned long *hpa, enum dma_data_direction *direction)
+long iommu_tce_xchg(struct mm_struct *mm, struct iommu_table *tbl,
+		unsigned long entry, unsigned long *hpa,
+		enum dma_data_direction *direction)
 {
 	long ret;
 
 	ret = tbl->it_ops->exchange(tbl, entry, hpa, direction);
 
 	if (!ret && ((*direction = DMA_FROM_DEVICE) ||
-			(*direction = DMA_BIDIRECTIONAL)))
+			(*direction = DMA_BIDIRECTIONAL)) &&
+			!mm_iommu_is_devmem(mm, *hpa, tbl->it_page_shift))
 		SetPageDirty(pfn_to_page(*hpa >> PAGE_SHIFT));
 
 	/* if (unlikely(ret))
diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index 62a8d03..532ab797 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -397,12 +397,13 @@ static long kvmppc_tce_validate(struct kvmppc_spapr_tce_table *stt,
 	return H_SUCCESS;
 }
 
-static void kvmppc_clear_tce(struct iommu_table *tbl, unsigned long entry)
+static void kvmppc_clear_tce(struct mm_struct *mm, struct iommu_table *tbl,
+		unsigned long entry)
 {
 	unsigned long hpa = 0;
 	enum dma_data_direction dir = DMA_NONE;
 
-	iommu_tce_xchg(tbl, entry, &hpa, &dir);
+	iommu_tce_xchg(mm, tbl, entry, &hpa, &dir);
 }
 
 static long kvmppc_tce_iommu_mapped_dec(struct kvm *kvm,
@@ -433,7 +434,7 @@ static long kvmppc_tce_iommu_do_unmap(struct kvm *kvm,
 	unsigned long hpa = 0;
 	long ret;
 
-	if (WARN_ON_ONCE(iommu_tce_xchg(tbl, entry, &hpa, &dir)))
+	if (WARN_ON_ONCE(iommu_tce_xchg(kvm->mm, tbl, entry, &hpa, &dir)))
 		return H_TOO_HARD;
 
 	if (dir = DMA_NONE)
@@ -441,7 +442,7 @@ static long kvmppc_tce_iommu_do_unmap(struct kvm *kvm,
 
 	ret = kvmppc_tce_iommu_mapped_dec(kvm, tbl, entry);
 	if (ret != H_SUCCESS)
-		iommu_tce_xchg(tbl, entry, &hpa, &dir);
+		iommu_tce_xchg(kvm->mm, tbl, entry, &hpa, &dir);
 
 	return ret;
 }
@@ -487,7 +488,7 @@ long kvmppc_tce_iommu_do_map(struct kvm *kvm, struct iommu_table *tbl,
 	if (mm_iommu_mapped_inc(mem))
 		return H_TOO_HARD;
 
-	ret = iommu_tce_xchg(tbl, entry, &hpa, &dir);
+	ret = iommu_tce_xchg(kvm->mm, tbl, entry, &hpa, &dir);
 	if (WARN_ON_ONCE(ret)) {
 		mm_iommu_mapped_dec(mem);
 		return H_TOO_HARD;
@@ -566,7 +567,7 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 					entry, ua, dir);
 
 		if (ret != H_SUCCESS) {
-			kvmppc_clear_tce(stit->tbl, entry);
+			kvmppc_clear_tce(vcpu->kvm->mm, stit->tbl, entry);
 			goto unlock_exit;
 		}
 	}
@@ -655,7 +656,8 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 					iommu_tce_direction(tce));
 
 			if (ret != H_SUCCESS) {
-				kvmppc_clear_tce(stit->tbl, entry);
+				kvmppc_clear_tce(vcpu->kvm->mm, stit->tbl,
+						entry);
 				goto unlock_exit;
 			}
 		}
@@ -704,7 +706,7 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
 				return ret;
 
 			WARN_ON_ONCE(1);
-			kvmppc_clear_tce(stit->tbl, entry);
+			kvmppc_clear_tce(vcpu->kvm->mm, stit->tbl, entry);
 		}
 	}
 
diff --git a/arch/powerpc/mm/mmu_context_iommu.c b/arch/powerpc/mm/mmu_context_iommu.c
index 580d89e..62fe5fe 100644
--- a/arch/powerpc/mm/mmu_context_iommu.c
+++ b/arch/powerpc/mm/mmu_context_iommu.c
@@ -47,6 +47,8 @@ struct mm_iommu_table_group_mem_t {
 		struct page **hpages;	/* vmalloc'ed */
 		phys_addr_t *hpas;
 	};
+#define MM_IOMMU_TABLE_INVALID_HPA	((uint64_t)-1)
+	u64 dev_hpa;		/* Device memory base address */
 };
 
 static long mm_iommu_adjust_locked_vm(struct mm_struct *mm,
@@ -89,7 +91,8 @@ bool mm_iommu_preregistered(struct mm_struct *mm)
 }
 EXPORT_SYMBOL_GPL(mm_iommu_preregistered);
 
-long mm_iommu_new(struct mm_struct *mm, unsigned long ua, unsigned long entries,
+static long mm_iommu_do_alloc(struct mm_struct *mm, unsigned long ua,
+		unsigned long entries, unsigned long dev_hpa,
 		struct mm_iommu_table_group_mem_t **pmem)
 {
 	struct mm_iommu_table_group_mem_t *mem;
@@ -112,11 +115,13 @@ long mm_iommu_new(struct mm_struct *mm, unsigned long ua, unsigned long entries,
 
 	}
 
-	ret = mm_iommu_adjust_locked_vm(mm, entries, true);
-	if (ret)
-		goto unlock_exit;
+	if (dev_hpa = MM_IOMMU_TABLE_INVALID_HPA) {
+		ret = mm_iommu_adjust_locked_vm(mm, entries, true);
+		if (ret)
+			goto unlock_exit;
 
-	locked_entries = entries;
+		locked_entries = entries;
+	}
 
 	mem = kzalloc(sizeof(*mem), GFP_KERNEL);
 	if (!mem) {
@@ -124,6 +129,13 @@ long mm_iommu_new(struct mm_struct *mm, unsigned long ua, unsigned long entries,
 		goto unlock_exit;
 	}
 
+	if (dev_hpa != MM_IOMMU_TABLE_INVALID_HPA) {
+		mem->pageshift = __ffs(dev_hpa | (entries << PAGE_SHIFT));
+		mem->dev_hpa = dev_hpa;
+		goto good_exit;
+	}
+	mem->dev_hpa = MM_IOMMU_TABLE_INVALID_HPA;
+
 	/*
 	 * For a starting point for a maximum page size calculation
 	 * we use @ua and @entries natural alignment to allow IOMMU pages
@@ -180,6 +192,7 @@ long mm_iommu_new(struct mm_struct *mm, unsigned long ua, unsigned long entries,
 
 	}
 
+good_exit:
 	atomic64_set(&mem->mapped, 1);
 	mem->used = 1;
 	mem->ua = ua;
@@ -196,13 +209,31 @@ long mm_iommu_new(struct mm_struct *mm, unsigned long ua, unsigned long entries,
 
 	return ret;
 }
+
+long mm_iommu_new(struct mm_struct *mm, unsigned long ua, unsigned long entries,
+		struct mm_iommu_table_group_mem_t **pmem)
+{
+	return mm_iommu_do_alloc(mm, ua, entries, MM_IOMMU_TABLE_INVALID_HPA,
+			pmem);
+}
 EXPORT_SYMBOL_GPL(mm_iommu_new);
 
+long mm_iommu_newdev(struct mm_struct *mm, unsigned long ua,
+		unsigned long entries, unsigned long dev_hpa,
+		struct mm_iommu_table_group_mem_t **pmem)
+{
+	return mm_iommu_do_alloc(mm, ua, entries, dev_hpa, pmem);
+}
+EXPORT_SYMBOL_GPL(mm_iommu_newdev);
+
 static void mm_iommu_unpin(struct mm_iommu_table_group_mem_t *mem)
 {
 	long i;
 	struct page *page = NULL;
 
+	if (!mem->hpas)
+		return;
+
 	for (i = 0; i < mem->entries; ++i) {
 		if (!mem->hpas[i])
 			continue;
@@ -244,6 +275,7 @@ static void mm_iommu_release(struct mm_iommu_table_group_mem_t *mem)
 long mm_iommu_put(struct mm_struct *mm, struct mm_iommu_table_group_mem_t *mem)
 {
 	long ret = 0;
+	unsigned long entries, dev_hpa;
 
 	mutex_lock(&mem_list_mutex);
 
@@ -265,9 +297,12 @@ long mm_iommu_put(struct mm_struct *mm, struct mm_iommu_table_group_mem_t *mem)
 	}
 
 	/* @mapped became 0 so now mappings are disabled, release the region */
+	entries = mem->entries;
+	dev_hpa = mem->dev_hpa;
 	mm_iommu_release(mem);
 
-	mm_iommu_adjust_locked_vm(mm, mem->entries, false);
+	if (dev_hpa = MM_IOMMU_TABLE_INVALID_HPA)
+		mm_iommu_adjust_locked_vm(mm, entries, false);
 
 unlock_exit:
 	mutex_unlock(&mem_list_mutex);
@@ -337,7 +372,7 @@ long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
 		unsigned long ua, unsigned int pageshift, unsigned long *hpa)
 {
 	const long entry = (ua - mem->ua) >> PAGE_SHIFT;
-	u64 *va = &mem->hpas[entry];
+	u64 *va;
 
 	if (entry >= mem->entries)
 		return -EFAULT;
@@ -345,6 +380,12 @@ long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
 	if (pageshift > mem->pageshift)
 		return -EFAULT;
 
+	if (!mem->hpas) {
+		*hpa = mem->dev_hpa + (ua - mem->ua);
+		return 0;
+	}
+
+	va = &mem->hpas[entry];
 	*hpa = (*va & MM_IOMMU_TABLE_GROUP_PAGE_MASK) | (ua & ~PAGE_MASK);
 
 	return 0;
@@ -355,7 +396,6 @@ long mm_iommu_ua_to_hpa_rm(struct mm_iommu_table_group_mem_t *mem,
 		unsigned long ua, unsigned int pageshift, unsigned long *hpa)
 {
 	const long entry = (ua - mem->ua) >> PAGE_SHIFT;
-	void *va = &mem->hpas[entry];
 	unsigned long *pa;
 
 	if (entry >= mem->entries)
@@ -364,7 +404,12 @@ long mm_iommu_ua_to_hpa_rm(struct mm_iommu_table_group_mem_t *mem,
 	if (pageshift > mem->pageshift)
 		return -EFAULT;
 
-	pa = (void *) vmalloc_to_phys(va);
+	if (!mem->hpas) {
+		*hpa = mem->dev_hpa + (ua - mem->ua);
+		return 0;
+	}
+
+	pa = (void *) vmalloc_to_phys(&mem->hpas[entry]);
 	if (!pa)
 		return -EFAULT;
 
@@ -394,6 +439,26 @@ extern void mm_iommu_ua_mark_dirty_rm(struct mm_struct *mm, unsigned long ua)
 	*pa |= MM_IOMMU_TABLE_GROUP_PAGE_DIRTY;
 }
 
+extern bool mm_iommu_is_devmem(struct mm_struct *mm, unsigned long hpa,
+		unsigned int pageshift)
+{
+	struct mm_iommu_table_group_mem_t *mem;
+	const unsigned long pagesize = 1UL << pageshift;
+
+	list_for_each_entry_rcu(mem, &mm->context.iommu_group_mem_list, next) {
+		if (mem->dev_hpa = MM_IOMMU_TABLE_INVALID_HPA)
+			continue;
+
+		if ((mem->dev_hpa <= hpa) &&
+				(hpa + pagesize <= mem->dev_hpa +
+				 (mem->entries << PAGE_SHIFT)))
+			return true;
+	}
+
+	return false;
+}
+EXPORT_SYMBOL_GPL(mm_iommu_is_devmem);
+
 long mm_iommu_mapped_inc(struct mm_iommu_table_group_mem_t *mem)
 {
 	if (atomic64_inc_not_zero(&mem->mapped))
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index 56db071..ed89137 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -222,8 +222,15 @@ static long tce_iommu_register_pages(struct tce_container *container,
 	return ret;
 }
 
-static bool tce_page_is_contained(struct page *page, unsigned page_shift)
+static bool tce_page_is_contained(struct mm_struct *mm, unsigned long hpa,
+		unsigned int page_shift)
 {
+	struct page *page;
+
+	if (mm_iommu_is_devmem(mm, hpa, page_shift))
+		return true;
+
+	page = pfn_to_page(hpa >> PAGE_SHIFT);
 	/*
 	 * Check that the TCE table granularity is not bigger than the size of
 	 * a page we just found. Otherwise the hardware can get access to
@@ -499,7 +506,8 @@ static int tce_iommu_clear(struct tce_container *container,
 
 		direction = DMA_NONE;
 		oldhpa = 0;
-		ret = iommu_tce_xchg(tbl, entry, &oldhpa, &direction);
+		ret = iommu_tce_xchg(container->mm, tbl, entry, &oldhpa,
+				&direction);
 		if (ret)
 			continue;
 
@@ -537,7 +545,6 @@ static long tce_iommu_build(struct tce_container *container,
 		enum dma_data_direction direction)
 {
 	long i, ret = 0;
-	struct page *page;
 	unsigned long hpa;
 	enum dma_data_direction dirtmp;
 
@@ -548,15 +555,16 @@ static long tce_iommu_build(struct tce_container *container,
 		if (ret)
 			break;
 
-		page = pfn_to_page(hpa >> PAGE_SHIFT);
-		if (!tce_page_is_contained(page, tbl->it_page_shift)) {
+		if (!tce_page_is_contained(container->mm, hpa,
+				tbl->it_page_shift)) {
 			ret = -EPERM;
 			break;
 		}
 
 		hpa |= offset;
 		dirtmp = direction;
-		ret = iommu_tce_xchg(tbl, entry + i, &hpa, &dirtmp);
+		ret = iommu_tce_xchg(container->mm, tbl, entry + i, &hpa,
+				&dirtmp);
 		if (ret) {
 			tce_iommu_unuse_page(container, hpa);
 			pr_err("iommu_tce: %s failed ioba=%lx, tce=%lx, ret=%ld\n",
@@ -583,7 +591,6 @@ static long tce_iommu_build_v2(struct tce_container *container,
 		enum dma_data_direction direction)
 {
 	long i, ret = 0;
-	struct page *page;
 	unsigned long hpa;
 	enum dma_data_direction dirtmp;
 
@@ -596,8 +603,8 @@ static long tce_iommu_build_v2(struct tce_container *container,
 		if (ret)
 			break;
 
-		page = pfn_to_page(hpa >> PAGE_SHIFT);
-		if (!tce_page_is_contained(page, tbl->it_page_shift)) {
+		if (!tce_page_is_contained(container->mm, hpa,
+				tbl->it_page_shift)) {
 			ret = -EPERM;
 			break;
 		}
@@ -610,7 +617,8 @@ static long tce_iommu_build_v2(struct tce_container *container,
 		if (mm_iommu_mapped_inc(mem))
 			break;
 
-		ret = iommu_tce_xchg(tbl, entry + i, &hpa, &dirtmp);
+		ret = iommu_tce_xchg(container->mm, tbl, entry + i, &hpa,
+				&dirtmp);
 		if (ret) {
 			/* dirtmp cannot be DMA_NONE here */
 			tce_iommu_unuse_page_v2(container, tbl, entry + i);
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH kernel v3 05/22] powerpc/powernv/npu: Add helper to access struct npu for NPU device
  2018-11-13  8:28 ` Alexey Kardashevskiy
@ 2018-11-13  8:28   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 84+ messages in thread
From: Alexey Kardashevskiy @ 2018-11-13  8:28 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Jose Ricardo Ziviani, Alexey Kardashevskiy, Alistair Popple,
	Alex Williamson, kvm-ppc, Sam Bobroff, Piotr Jaroszynski,
	Oliver O'Halloran, Andrew Donnellan,
	Leonardo Augusto Guimarães Garcia, Reza Arbab, David Gibson

This step is to help removing the npu struct from pnv_phb so it
can be used by pseries as well.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 arch/powerpc/platforms/powernv/npu-dma.c | 22 ++++++++++++++++------
 1 file changed, 16 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/npu-dma.c b/arch/powerpc/platforms/powernv/npu-dma.c
index 91d488f..9f48831 100644
--- a/arch/powerpc/platforms/powernv/npu-dma.c
+++ b/arch/powerpc/platforms/powernv/npu-dma.c
@@ -327,6 +327,18 @@ struct pnv_ioda_pe *pnv_pci_npu_setup_iommu(struct pnv_ioda_pe *npe)
 	return gpe;
 }
 
+/*
+ * NPU2 ATS
+ */
+static struct npu *npdev_to_npu(struct pci_dev *npdev)
+{
+	struct pnv_phb *nphb;
+
+	nphb = pci_bus_to_host(npdev->bus)->private_data;
+
+	return &nphb->npu;
+}
+
 /* Maximum number of nvlinks per npu */
 #define NV_MAX_LINKS 6
 
@@ -478,7 +490,6 @@ static void acquire_atsd_reg(struct npu_context *npu_context,
 	int i, j;
 	struct npu *npu;
 	struct pci_dev *npdev;
-	struct pnv_phb *nphb;
 
 	for (i = 0; i <= max_npu2_index; i++) {
 		mmio_atsd_reg[i].reg = -1;
@@ -493,8 +504,7 @@ static void acquire_atsd_reg(struct npu_context *npu_context,
 			if (!npdev)
 				continue;
 
-			nphb = pci_bus_to_host(npdev->bus)->private_data;
-			npu = &nphb->npu;
+			npu = npdev_to_npu(npdev);
 			mmio_atsd_reg[i].npu = npu;
 			mmio_atsd_reg[i].reg = get_mmio_atsd_reg(npu);
 			while (mmio_atsd_reg[i].reg < 0) {
@@ -690,7 +700,7 @@ struct npu_context *pnv_npu2_init_context(struct pci_dev *gpdev,
 	}
 
 	nphb = pci_bus_to_host(npdev->bus)->private_data;
-	npu = &nphb->npu;
+	npu = npdev_to_npu(npdev);
 
 	/*
 	 * Setup the NPU context table for a particular GPU. These need to be
@@ -764,7 +774,7 @@ struct npu_context *pnv_npu2_init_context(struct pci_dev *gpdev,
 	 */
 	WRITE_ONCE(npu_context->npdev[npu->index][nvlink_index], npdev);
 
-	if (!nphb->npu.nmmu_flush) {
+	if (!npu->nmmu_flush) {
 		/*
 		 * If we're not explicitly flushing ourselves we need to mark
 		 * the thread for global flushes
@@ -810,7 +820,7 @@ void pnv_npu2_destroy_context(struct npu_context *npu_context,
 		return;
 
 	nphb = pci_bus_to_host(npdev->bus)->private_data;
-	npu = &nphb->npu;
+	npu = npdev_to_npu(npdev);
 	nvlink_dn = of_parse_phandle(npdev->dev.of_node, "ibm,nvlink", 0);
 	if (WARN_ON(of_property_read_u32(nvlink_dn, "ibm,npu-link-index",
 							&nvlink_index)))
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH kernel v3 05/22] powerpc/powernv/npu: Add helper to access struct npu for NPU device
@ 2018-11-13  8:28   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 84+ messages in thread
From: Alexey Kardashevskiy @ 2018-11-13  8:28 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Jose Ricardo Ziviani, Alexey Kardashevskiy, Alistair Popple,
	Alex Williamson, kvm-ppc, Sam Bobroff, Piotr Jaroszynski,
	Oliver O'Halloran, Andrew Donnellan,
	Leonardo Augusto Guimarães Garcia, Reza Arbab, David Gibson

This step is to help removing the npu struct from pnv_phb so it
can be used by pseries as well.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 arch/powerpc/platforms/powernv/npu-dma.c | 22 ++++++++++++++++------
 1 file changed, 16 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/npu-dma.c b/arch/powerpc/platforms/powernv/npu-dma.c
index 91d488f..9f48831 100644
--- a/arch/powerpc/platforms/powernv/npu-dma.c
+++ b/arch/powerpc/platforms/powernv/npu-dma.c
@@ -327,6 +327,18 @@ struct pnv_ioda_pe *pnv_pci_npu_setup_iommu(struct pnv_ioda_pe *npe)
 	return gpe;
 }
 
+/*
+ * NPU2 ATS
+ */
+static struct npu *npdev_to_npu(struct pci_dev *npdev)
+{
+	struct pnv_phb *nphb;
+
+	nphb = pci_bus_to_host(npdev->bus)->private_data;
+
+	return &nphb->npu;
+}
+
 /* Maximum number of nvlinks per npu */
 #define NV_MAX_LINKS 6
 
@@ -478,7 +490,6 @@ static void acquire_atsd_reg(struct npu_context *npu_context,
 	int i, j;
 	struct npu *npu;
 	struct pci_dev *npdev;
-	struct pnv_phb *nphb;
 
 	for (i = 0; i <= max_npu2_index; i++) {
 		mmio_atsd_reg[i].reg = -1;
@@ -493,8 +504,7 @@ static void acquire_atsd_reg(struct npu_context *npu_context,
 			if (!npdev)
 				continue;
 
-			nphb = pci_bus_to_host(npdev->bus)->private_data;
-			npu = &nphb->npu;
+			npu = npdev_to_npu(npdev);
 			mmio_atsd_reg[i].npu = npu;
 			mmio_atsd_reg[i].reg = get_mmio_atsd_reg(npu);
 			while (mmio_atsd_reg[i].reg < 0) {
@@ -690,7 +700,7 @@ struct npu_context *pnv_npu2_init_context(struct pci_dev *gpdev,
 	}
 
 	nphb = pci_bus_to_host(npdev->bus)->private_data;
-	npu = &nphb->npu;
+	npu = npdev_to_npu(npdev);
 
 	/*
 	 * Setup the NPU context table for a particular GPU. These need to be
@@ -764,7 +774,7 @@ struct npu_context *pnv_npu2_init_context(struct pci_dev *gpdev,
 	 */
 	WRITE_ONCE(npu_context->npdev[npu->index][nvlink_index], npdev);
 
-	if (!nphb->npu.nmmu_flush) {
+	if (!npu->nmmu_flush) {
 		/*
 		 * If we're not explicitly flushing ourselves we need to mark
 		 * the thread for global flushes
@@ -810,7 +820,7 @@ void pnv_npu2_destroy_context(struct npu_context *npu_context,
 		return;
 
 	nphb = pci_bus_to_host(npdev->bus)->private_data;
-	npu = &nphb->npu;
+	npu = npdev_to_npu(npdev);
 	nvlink_dn = of_parse_phandle(npdev->dev.of_node, "ibm,nvlink", 0);
 	if (WARN_ON(of_property_read_u32(nvlink_dn, "ibm,npu-link-index",
 							&nvlink_index)))
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH kernel v3 06/22] powerpc/powernv: Detach npu struct from pnv_phb
  2018-11-13  8:28 ` Alexey Kardashevskiy
@ 2018-11-13  8:28   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 84+ messages in thread
From: Alexey Kardashevskiy @ 2018-11-13  8:28 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Jose Ricardo Ziviani, Alexey Kardashevskiy, Alistair Popple,
	Alex Williamson, kvm-ppc, Sam Bobroff, Piotr Jaroszynski,
	Oliver O'Halloran, Andrew Donnellan,
	Leonardo Augusto Guimarães Garcia, Reza Arbab, David Gibson

The powernv PCI code stores NPU data in the pnv_phb struct. The latter
is referenced by pci_controller::private_data. We are going to have NPU2
support in the pseries platform as well but it does not store any
private_data in in the pci_controller struct; and even if it did,
it would be a different data structure.

This adds a global list of NPUs so each platform can register and use
these in the same fashion.

As npdev_to_npu() may now fail, this checks the returned pointer.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/platforms/powernv/pci.h     | 16 -----
 arch/powerpc/platforms/powernv/npu-dma.c | 78 ++++++++++++++++++++----
 2 files changed, 65 insertions(+), 29 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
index 2131373..f2d50974 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -8,9 +8,6 @@
 
 struct pci_dn;
 
-/* Maximum possible number of ATSD MMIO registers per NPU */
-#define NV_NMMU_ATSD_REGS 8
-
 enum pnv_phb_type {
 	PNV_PHB_IODA1		= 0,
 	PNV_PHB_IODA2		= 1,
@@ -176,19 +173,6 @@ struct pnv_phb {
 	unsigned int		diag_data_size;
 	u8			*diag_data;
 
-	/* Nvlink2 data */
-	struct npu {
-		int index;
-		__be64 *mmio_atsd_regs[NV_NMMU_ATSD_REGS];
-		unsigned int mmio_atsd_count;
-
-		/* Bitmask for MMIO register usage */
-		unsigned long mmio_atsd_usage;
-
-		/* Do we need to explicitly flush the nest mmu? */
-		bool nmmu_flush;
-	} npu;
-
 	int p2p_target_count;
 };
 
diff --git a/arch/powerpc/platforms/powernv/npu-dma.c b/arch/powerpc/platforms/powernv/npu-dma.c
index 9f48831..9fc4e4e 100644
--- a/arch/powerpc/platforms/powernv/npu-dma.c
+++ b/arch/powerpc/platforms/powernv/npu-dma.c
@@ -330,13 +330,39 @@ struct pnv_ioda_pe *pnv_pci_npu_setup_iommu(struct pnv_ioda_pe *npe)
 /*
  * NPU2 ATS
  */
+/* Maximum possible number of ATSD MMIO registers per NPU */
+#define NV_NMMU_ATSD_REGS 8
+
+/* An NPU descriptor, valid for POWER9 only */
+struct npu {
+	int index;
+	__be64 *mmio_atsd_regs[NV_NMMU_ATSD_REGS];
+	unsigned int mmio_atsd_count;
+
+	/* Bitmask for MMIO register usage */
+	unsigned long mmio_atsd_usage;
+
+	/* Do we need to explicitly flush the nest mmu? */
+	bool nmmu_flush;
+
+	struct list_head next;
+
+	struct pci_controller *hose;
+};
+
+static LIST_HEAD(npu2_devices);
+
 static struct npu *npdev_to_npu(struct pci_dev *npdev)
 {
-	struct pnv_phb *nphb;
+	struct pci_controller *hose = pci_bus_to_host(npdev->bus);
+	struct npu *npu;
 
-	nphb = pci_bus_to_host(npdev->bus)->private_data;
+	list_for_each_entry(npu, &npu2_devices, next)
+		if (hose == npu->hose)
+			return npu;
 
-	return &nphb->npu;
+	WARN_ON_ONCE(1);
+	return NULL;
 }
 
 /* Maximum number of nvlinks per npu */
@@ -505,6 +531,9 @@ static void acquire_atsd_reg(struct npu_context *npu_context,
 				continue;
 
 			npu = npdev_to_npu(npdev);
+			if (!npu)
+				continue;
+
 			mmio_atsd_reg[i].npu = npu;
 			mmio_atsd_reg[i].reg = get_mmio_atsd_reg(npu);
 			while (mmio_atsd_reg[i].reg < 0) {
@@ -701,6 +730,8 @@ struct npu_context *pnv_npu2_init_context(struct pci_dev *gpdev,
 
 	nphb = pci_bus_to_host(npdev->bus)->private_data;
 	npu = npdev_to_npu(npdev);
+	if (!npu)
+		return ERR_PTR(-ENODEV);
 
 	/*
 	 * Setup the NPU context table for a particular GPU. These need to be
@@ -821,6 +852,8 @@ void pnv_npu2_destroy_context(struct npu_context *npu_context,
 
 	nphb = pci_bus_to_host(npdev->bus)->private_data;
 	npu = npdev_to_npu(npdev);
+	if (!npu)
+		return;
 	nvlink_dn = of_parse_phandle(npdev->dev.of_node, "ibm,nvlink", 0);
 	if (WARN_ON(of_property_read_u32(nvlink_dn, "ibm,npu-link-index",
 							&nvlink_index)))
@@ -898,9 +931,15 @@ int pnv_npu2_init(struct pnv_phb *phb)
 	struct pci_dev *gpdev;
 	static int npu_index;
 	uint64_t rc = 0;
+	struct pci_controller *hose = phb->hose;
+	struct npu *npu;
+	int ret;
 
-	phb->npu.nmmu_flush =
-		of_property_read_bool(phb->hose->dn, "ibm,nmmu-flush");
+	npu = kzalloc(sizeof(*npu), GFP_KERNEL);
+	if (!npu)
+		return -ENOMEM;
+
+	npu->nmmu_flush = of_property_read_bool(hose->dn, "ibm,nmmu-flush");
 	for_each_child_of_node(phb->hose->dn, dn) {
 		gpdev = pnv_pci_get_gpu_dev(get_pci_dev(dn));
 		if (gpdev) {
@@ -914,18 +953,31 @@ int pnv_npu2_init(struct pnv_phb *phb)
 		}
 	}
 
-	for (i = 0; !of_property_read_u64_index(phb->hose->dn, "ibm,mmio-atsd",
+	for (i = 0; !of_property_read_u64_index(hose->dn, "ibm,mmio-atsd",
 							i, &mmio_atsd); i++)
-		phb->npu.mmio_atsd_regs[i] = ioremap(mmio_atsd, 32);
+		npu->mmio_atsd_regs[i] = ioremap(mmio_atsd, 32);
 
-	pr_info("NPU%lld: Found %d MMIO ATSD registers", phb->opal_id, i);
-	phb->npu.mmio_atsd_count = i;
-	phb->npu.mmio_atsd_usage = 0;
+	pr_info("NPU%d: Found %d MMIO ATSD registers", hose->global_number, i);
+	npu->mmio_atsd_count = i;
+	npu->mmio_atsd_usage = 0;
 	npu_index++;
-	if (WARN_ON(npu_index >= NV_MAX_NPUS))
-		return -ENOSPC;
+	if (WARN_ON(npu_index >= NV_MAX_NPUS)) {
+		ret = -ENOSPC;
+		goto fail_exit;
+	}
 	max_npu2_index = npu_index;
-	phb->npu.index = npu_index;
+	npu->index = npu_index;
+	npu->hose = hose;
+
+	list_add(&npu->next, &npu2_devices);
 
 	return 0;
+
+fail_exit:
+	for (i = 0; i < npu->mmio_atsd_count; ++i)
+		iounmap(npu->mmio_atsd_regs[i]);
+
+	kfree(npu);
+
+	return ret;
 }
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH kernel v3 06/22] powerpc/powernv: Detach npu struct from pnv_phb
@ 2018-11-13  8:28   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 84+ messages in thread
From: Alexey Kardashevskiy @ 2018-11-13  8:28 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Jose Ricardo Ziviani, Alexey Kardashevskiy, Alistair Popple,
	Alex Williamson, kvm-ppc, Sam Bobroff, Piotr Jaroszynski,
	Oliver O'Halloran, Andrew Donnellan,
	Leonardo Augusto Guimarães Garcia, Reza Arbab, David Gibson

The powernv PCI code stores NPU data in the pnv_phb struct. The latter
is referenced by pci_controller::private_data. We are going to have NPU2
support in the pseries platform as well but it does not store any
private_data in in the pci_controller struct; and even if it did,
it would be a different data structure.

This adds a global list of NPUs so each platform can register and use
these in the same fashion.

As npdev_to_npu() may now fail, this checks the returned pointer.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/platforms/powernv/pci.h     | 16 -----
 arch/powerpc/platforms/powernv/npu-dma.c | 78 ++++++++++++++++++++----
 2 files changed, 65 insertions(+), 29 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
index 2131373..f2d50974 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -8,9 +8,6 @@
 
 struct pci_dn;
 
-/* Maximum possible number of ATSD MMIO registers per NPU */
-#define NV_NMMU_ATSD_REGS 8
-
 enum pnv_phb_type {
 	PNV_PHB_IODA1		= 0,
 	PNV_PHB_IODA2		= 1,
@@ -176,19 +173,6 @@ struct pnv_phb {
 	unsigned int		diag_data_size;
 	u8			*diag_data;
 
-	/* Nvlink2 data */
-	struct npu {
-		int index;
-		__be64 *mmio_atsd_regs[NV_NMMU_ATSD_REGS];
-		unsigned int mmio_atsd_count;
-
-		/* Bitmask for MMIO register usage */
-		unsigned long mmio_atsd_usage;
-
-		/* Do we need to explicitly flush the nest mmu? */
-		bool nmmu_flush;
-	} npu;
-
 	int p2p_target_count;
 };
 
diff --git a/arch/powerpc/platforms/powernv/npu-dma.c b/arch/powerpc/platforms/powernv/npu-dma.c
index 9f48831..9fc4e4e 100644
--- a/arch/powerpc/platforms/powernv/npu-dma.c
+++ b/arch/powerpc/platforms/powernv/npu-dma.c
@@ -330,13 +330,39 @@ struct pnv_ioda_pe *pnv_pci_npu_setup_iommu(struct pnv_ioda_pe *npe)
 /*
  * NPU2 ATS
  */
+/* Maximum possible number of ATSD MMIO registers per NPU */
+#define NV_NMMU_ATSD_REGS 8
+
+/* An NPU descriptor, valid for POWER9 only */
+struct npu {
+	int index;
+	__be64 *mmio_atsd_regs[NV_NMMU_ATSD_REGS];
+	unsigned int mmio_atsd_count;
+
+	/* Bitmask for MMIO register usage */
+	unsigned long mmio_atsd_usage;
+
+	/* Do we need to explicitly flush the nest mmu? */
+	bool nmmu_flush;
+
+	struct list_head next;
+
+	struct pci_controller *hose;
+};
+
+static LIST_HEAD(npu2_devices);
+
 static struct npu *npdev_to_npu(struct pci_dev *npdev)
 {
-	struct pnv_phb *nphb;
+	struct pci_controller *hose = pci_bus_to_host(npdev->bus);
+	struct npu *npu;
 
-	nphb = pci_bus_to_host(npdev->bus)->private_data;
+	list_for_each_entry(npu, &npu2_devices, next)
+		if (hose = npu->hose)
+			return npu;
 
-	return &nphb->npu;
+	WARN_ON_ONCE(1);
+	return NULL;
 }
 
 /* Maximum number of nvlinks per npu */
@@ -505,6 +531,9 @@ static void acquire_atsd_reg(struct npu_context *npu_context,
 				continue;
 
 			npu = npdev_to_npu(npdev);
+			if (!npu)
+				continue;
+
 			mmio_atsd_reg[i].npu = npu;
 			mmio_atsd_reg[i].reg = get_mmio_atsd_reg(npu);
 			while (mmio_atsd_reg[i].reg < 0) {
@@ -701,6 +730,8 @@ struct npu_context *pnv_npu2_init_context(struct pci_dev *gpdev,
 
 	nphb = pci_bus_to_host(npdev->bus)->private_data;
 	npu = npdev_to_npu(npdev);
+	if (!npu)
+		return ERR_PTR(-ENODEV);
 
 	/*
 	 * Setup the NPU context table for a particular GPU. These need to be
@@ -821,6 +852,8 @@ void pnv_npu2_destroy_context(struct npu_context *npu_context,
 
 	nphb = pci_bus_to_host(npdev->bus)->private_data;
 	npu = npdev_to_npu(npdev);
+	if (!npu)
+		return;
 	nvlink_dn = of_parse_phandle(npdev->dev.of_node, "ibm,nvlink", 0);
 	if (WARN_ON(of_property_read_u32(nvlink_dn, "ibm,npu-link-index",
 							&nvlink_index)))
@@ -898,9 +931,15 @@ int pnv_npu2_init(struct pnv_phb *phb)
 	struct pci_dev *gpdev;
 	static int npu_index;
 	uint64_t rc = 0;
+	struct pci_controller *hose = phb->hose;
+	struct npu *npu;
+	int ret;
 
-	phb->npu.nmmu_flush -		of_property_read_bool(phb->hose->dn, "ibm,nmmu-flush");
+	npu = kzalloc(sizeof(*npu), GFP_KERNEL);
+	if (!npu)
+		return -ENOMEM;
+
+	npu->nmmu_flush = of_property_read_bool(hose->dn, "ibm,nmmu-flush");
 	for_each_child_of_node(phb->hose->dn, dn) {
 		gpdev = pnv_pci_get_gpu_dev(get_pci_dev(dn));
 		if (gpdev) {
@@ -914,18 +953,31 @@ int pnv_npu2_init(struct pnv_phb *phb)
 		}
 	}
 
-	for (i = 0; !of_property_read_u64_index(phb->hose->dn, "ibm,mmio-atsd",
+	for (i = 0; !of_property_read_u64_index(hose->dn, "ibm,mmio-atsd",
 							i, &mmio_atsd); i++)
-		phb->npu.mmio_atsd_regs[i] = ioremap(mmio_atsd, 32);
+		npu->mmio_atsd_regs[i] = ioremap(mmio_atsd, 32);
 
-	pr_info("NPU%lld: Found %d MMIO ATSD registers", phb->opal_id, i);
-	phb->npu.mmio_atsd_count = i;
-	phb->npu.mmio_atsd_usage = 0;
+	pr_info("NPU%d: Found %d MMIO ATSD registers", hose->global_number, i);
+	npu->mmio_atsd_count = i;
+	npu->mmio_atsd_usage = 0;
 	npu_index++;
-	if (WARN_ON(npu_index >= NV_MAX_NPUS))
-		return -ENOSPC;
+	if (WARN_ON(npu_index >= NV_MAX_NPUS)) {
+		ret = -ENOSPC;
+		goto fail_exit;
+	}
 	max_npu2_index = npu_index;
-	phb->npu.index = npu_index;
+	npu->index = npu_index;
+	npu->hose = hose;
+
+	list_add(&npu->next, &npu2_devices);
 
 	return 0;
+
+fail_exit:
+	for (i = 0; i < npu->mmio_atsd_count; ++i)
+		iounmap(npu->mmio_atsd_regs[i]);
+
+	kfree(npu);
+
+	return ret;
 }
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH kernel v3 07/22] powerpc/powernv/npu: Move OPAL calls away from context manipulation
  2018-11-13  8:28 ` Alexey Kardashevskiy
@ 2018-11-13  8:28   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 84+ messages in thread
From: Alexey Kardashevskiy @ 2018-11-13  8:28 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Jose Ricardo Ziviani, Alexey Kardashevskiy, Alistair Popple,
	Alex Williamson, kvm-ppc, Sam Bobroff, Piotr Jaroszynski,
	Oliver O'Halloran, Andrew Donnellan,
	Leonardo Augusto Guimarães Garcia, Reza Arbab, David Gibson

When introduced, the NPU context init/destroy helpers called OPAL which
enabled/disabled PID (a userspace memory context ID) filtering in an NPU
per a GPU; this was a requirement for P9 DD1.0. However newer chip
revision added a PID wildcard support so there is no more need to
call OPAL every time a new context is initialized. Also, since the PID
wildcard support was added, skiboot does not clear wildcard entries
in the NPU so these remain in the hardware till the system reboot.

This moves LPID and wildcard programming to the PE setup code which
executes once during the booting process so NPU2 context init/destroy
won't need to do additional configuration.

This removes the check for FW_FEATURE_OPAL as pnv_npu2_init_context/
pnv_npu2_release_context/pnv_npu2_init do not call OPAL anymore.

This moves pnv_npu2_init() declaration as pseries should be able to use it.
This keeps pnv_npu2_map_lpar() in powernv as pseries is not allowed to
call that. This exports pnv_npu2_map_lpar_dev() as following patches
will use it from the VFIO driver.

While at it, replace redundant list_for_each_entry_safe() with
a simpler list_for_each_entry().

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/include/asm/pci.h            |   3 +
 arch/powerpc/platforms/powernv/pci.h      |   2 +-
 arch/powerpc/platforms/powernv/npu-dma.c  | 105 +++++++++++-----------
 arch/powerpc/platforms/powernv/pci-ioda.c |  15 +++-
 4 files changed, 71 insertions(+), 54 deletions(-)

diff --git a/arch/powerpc/include/asm/pci.h b/arch/powerpc/include/asm/pci.h
index 2af9ded..baf2886 100644
--- a/arch/powerpc/include/asm/pci.h
+++ b/arch/powerpc/include/asm/pci.h
@@ -129,5 +129,8 @@ extern void pcibios_scan_phb(struct pci_controller *hose);
 
 extern struct pci_dev *pnv_pci_get_gpu_dev(struct pci_dev *npdev);
 extern struct pci_dev *pnv_pci_get_npu_dev(struct pci_dev *gpdev, int index);
+extern int pnv_npu2_init(struct pci_controller *hose);
+extern int pnv_npu2_map_lpar_dev(struct pci_dev *gpdev, unsigned int lparid,
+		unsigned long msr);
 
 #endif /* __ASM_POWERPC_PCI_H */
diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
index f2d50974..ddb4f02 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -190,6 +190,7 @@ extern void pnv_pci_init_ioda_hub(struct device_node *np);
 extern void pnv_pci_init_ioda2_phb(struct device_node *np);
 extern void pnv_pci_init_npu_phb(struct device_node *np);
 extern void pnv_pci_init_npu2_opencapi_phb(struct device_node *np);
+extern void pnv_npu2_map_lpar(struct pnv_ioda_pe *gpe, unsigned long msr);
 extern void pnv_pci_reset_secondary_bus(struct pci_dev *dev);
 extern int pnv_eeh_phb_reset(struct pci_controller *hose, int option);
 
@@ -220,7 +221,6 @@ extern long pnv_npu_set_window(struct pnv_ioda_pe *npe, int num,
 extern long pnv_npu_unset_window(struct pnv_ioda_pe *npe, int num);
 extern void pnv_npu_take_ownership(struct pnv_ioda_pe *npe);
 extern void pnv_npu_release_ownership(struct pnv_ioda_pe *npe);
-extern int pnv_npu2_init(struct pnv_phb *phb);
 
 /* pci-ioda-tce.c */
 #define POWERNV_IOMMU_DEFAULT_LEVELS	1
diff --git a/arch/powerpc/platforms/powernv/npu-dma.c b/arch/powerpc/platforms/powernv/npu-dma.c
index 9fc4e4e..4b60f43 100644
--- a/arch/powerpc/platforms/powernv/npu-dma.c
+++ b/arch/powerpc/platforms/powernv/npu-dma.c
@@ -698,7 +698,6 @@ struct npu_context *pnv_npu2_init_context(struct pci_dev *gpdev,
 	u32 nvlink_index;
 	struct device_node *nvlink_dn;
 	struct mm_struct *mm = current->mm;
-	struct pnv_phb *nphb;
 	struct npu *npu;
 	struct npu_context *npu_context;
 
@@ -708,9 +707,6 @@ struct npu_context *pnv_npu2_init_context(struct pci_dev *gpdev,
 	 */
 	struct pci_dev *npdev = pnv_pci_get_npu_dev(gpdev, 0);
 
-	if (!firmware_has_feature(FW_FEATURE_OPAL))
-		return ERR_PTR(-ENODEV);
-
 	if (!npdev)
 		/* No nvlink associated with this GPU device */
 		return ERR_PTR(-ENODEV);
@@ -728,23 +724,10 @@ struct npu_context *pnv_npu2_init_context(struct pci_dev *gpdev,
 		return ERR_PTR(-EINVAL);
 	}
 
-	nphb = pci_bus_to_host(npdev->bus)->private_data;
 	npu = npdev_to_npu(npdev);
 	if (!npu)
 		return ERR_PTR(-ENODEV);
 
-	/*
-	 * Setup the NPU context table for a particular GPU. These need to be
-	 * per-GPU as we need the tables to filter ATSDs when there are no
-	 * active contexts on a particular GPU. It is safe for these to be
-	 * called concurrently with destroy as the OPAL call takes appropriate
-	 * locks and refcounts on init/destroy.
-	 */
-	rc = opal_npu_init_context(nphb->opal_id, mm->context.id, flags,
-				PCI_DEVID(gpdev->bus->number, gpdev->devfn));
-	if (rc < 0)
-		return ERR_PTR(-ENOSPC);
-
 	/*
 	 * We store the npu pci device so we can more easily get at the
 	 * associated npus.
@@ -755,9 +738,6 @@ struct npu_context *pnv_npu2_init_context(struct pci_dev *gpdev,
 		if (npu_context->release_cb != cb ||
 			npu_context->priv != priv) {
 			spin_unlock(&npu_context_lock);
-			opal_npu_destroy_context(nphb->opal_id, mm->context.id,
-						PCI_DEVID(gpdev->bus->number,
-							gpdev->devfn));
 			return ERR_PTR(-EINVAL);
 		}
 
@@ -783,9 +763,6 @@ struct npu_context *pnv_npu2_init_context(struct pci_dev *gpdev,
 
 		if (rc) {
 			kfree(npu_context);
-			opal_npu_destroy_context(nphb->opal_id, mm->context.id,
-					PCI_DEVID(gpdev->bus->number,
-						gpdev->devfn));
 			return ERR_PTR(rc);
 		}
 
@@ -838,7 +815,6 @@ void pnv_npu2_destroy_context(struct npu_context *npu_context,
 			struct pci_dev *gpdev)
 {
 	int removed;
-	struct pnv_phb *nphb;
 	struct npu *npu;
 	struct pci_dev *npdev = pnv_pci_get_npu_dev(gpdev, 0);
 	struct device_node *nvlink_dn;
@@ -847,10 +823,6 @@ void pnv_npu2_destroy_context(struct npu_context *npu_context,
 	if (WARN_ON(!npdev))
 		return;
 
-	if (!firmware_has_feature(FW_FEATURE_OPAL))
-		return;
-
-	nphb = pci_bus_to_host(npdev->bus)->private_data;
 	npu = npdev_to_npu(npdev);
 	if (!npu)
 		return;
@@ -859,8 +831,6 @@ void pnv_npu2_destroy_context(struct npu_context *npu_context,
 							&nvlink_index)))
 		return;
 	WRITE_ONCE(npu_context->npdev[npu->index][nvlink_index], NULL);
-	opal_npu_destroy_context(nphb->opal_id, npu_context->mm->context.id,
-				PCI_DEVID(gpdev->bus->number, gpdev->devfn));
 	spin_lock(&npu_context_lock);
 	removed = kref_put(&npu_context->kref, pnv_npu2_release_context);
 	spin_unlock(&npu_context_lock);
@@ -892,9 +862,6 @@ int pnv_npu2_handle_fault(struct npu_context *context, uintptr_t *ea,
 	/* mmap_sem should be held so the struct_mm must be present */
 	struct mm_struct *mm = context->mm;
 
-	if (!firmware_has_feature(FW_FEATURE_OPAL))
-		return -ENODEV;
-
 	WARN_ON(!rwsem_is_locked(&mm->mmap_sem));
 
 	for (i = 0; i < count; i++) {
@@ -923,15 +890,11 @@ int pnv_npu2_handle_fault(struct npu_context *context, uintptr_t *ea,
 }
 EXPORT_SYMBOL(pnv_npu2_handle_fault);
 
-int pnv_npu2_init(struct pnv_phb *phb)
+int pnv_npu2_init(struct pci_controller *hose)
 {
 	unsigned int i;
 	u64 mmio_atsd;
-	struct device_node *dn;
-	struct pci_dev *gpdev;
 	static int npu_index;
-	uint64_t rc = 0;
-	struct pci_controller *hose = phb->hose;
 	struct npu *npu;
 	int ret;
 
@@ -940,18 +903,6 @@ int pnv_npu2_init(struct pnv_phb *phb)
 		return -ENOMEM;
 
 	npu->nmmu_flush = of_property_read_bool(hose->dn, "ibm,nmmu-flush");
-	for_each_child_of_node(phb->hose->dn, dn) {
-		gpdev = pnv_pci_get_gpu_dev(get_pci_dev(dn));
-		if (gpdev) {
-			rc = opal_npu_map_lpar(phb->opal_id,
-				PCI_DEVID(gpdev->bus->number, gpdev->devfn),
-				0, 0);
-			if (rc)
-				dev_err(&gpdev->dev,
-					"Error %lld mapping device to LPAR\n",
-					rc);
-		}
-	}
 
 	for (i = 0; !of_property_read_u64_index(hose->dn, "ibm,mmio-atsd",
 							i, &mmio_atsd); i++)
@@ -981,3 +932,57 @@ int pnv_npu2_init(struct pnv_phb *phb)
 
 	return ret;
 }
+
+int pnv_npu2_map_lpar_dev(struct pci_dev *gpdev, unsigned int lparid,
+		unsigned long msr)
+{
+	int ret;
+	struct pci_dev *npdev = pnv_pci_get_npu_dev(gpdev, 0);
+	struct pci_controller *hose;
+	struct pnv_phb *nphb;
+
+	if (!npdev)
+		return -ENODEV;
+
+	hose = pci_bus_to_host(npdev->bus);
+	nphb = hose->private_data;
+
+	dev_dbg(&gpdev->dev, "Map LPAR opalid=%llu lparid=%u\n",
+			nphb->opal_id, lparid);
+	/*
+	 * Currently we only support radix and non-zero LPCR only makes sense
+	 * for hash tables so skiboot expects the LPCR parameter to be a zero.
+	 */
+	ret = opal_npu_map_lpar(nphb->opal_id,
+			PCI_DEVID(gpdev->bus->number, gpdev->devfn), lparid,
+			0 /* LPCR bits */);
+	if (ret) {
+		dev_err(&gpdev->dev, "Error %d mapping device to LPAR\n", ret);
+		return ret;
+	}
+
+	dev_dbg(&gpdev->dev, "init context opalid=%llu msr=%lx\n",
+			nphb->opal_id, msr);
+	ret = opal_npu_init_context(nphb->opal_id, 0/*__unused*/, msr,
+			PCI_DEVID(gpdev->bus->number, gpdev->devfn));
+	if (ret < 0)
+		dev_err(&gpdev->dev, "Failed to init context: %d\n", ret);
+	else
+		ret = 0;
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(pnv_npu2_map_lpar_dev);
+
+void pnv_npu2_map_lpar(struct pnv_ioda_pe *gpe, unsigned long msr)
+{
+	int ret;
+	struct pci_dev *gpdev;
+
+	list_for_each_entry(gpdev, &gpe->pbus->devices, bus_list) {
+		ret = pnv_npu2_map_lpar_dev(gpdev, 0, msr);
+		if (ret < 0)
+			dev_err(&gpdev->dev, "Failed to init context: %d\n",
+					ret);
+	}
+}
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index c78c204..ec235ca 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1271,19 +1271,20 @@ static void pnv_ioda_setup_npu_PEs(struct pci_bus *bus)
 
 static void pnv_pci_ioda_setup_PEs(void)
 {
-	struct pci_controller *hose, *tmp;
+	struct pci_controller *hose;
 	struct pnv_phb *phb;
 	struct pci_bus *bus;
 	struct pci_dev *pdev;
+	struct pnv_ioda_pe *pe;
 
-	list_for_each_entry_safe(hose, tmp, &hose_list, list_node) {
+	list_for_each_entry(hose, &hose_list, list_node) {
 		phb = hose->private_data;
 		if (phb->type == PNV_PHB_NPU_NVLINK) {
 			/* PE#0 is needed for error reporting */
 			pnv_ioda_reserve_pe(phb, 0);
 			pnv_ioda_setup_npu_PEs(hose->bus);
 			if (phb->model == PNV_PHB_MODEL_NPU2)
-				pnv_npu2_init(phb);
+				pnv_npu2_init(hose);
 		}
 		if (phb->type == PNV_PHB_NPU_OCAPI) {
 			bus = hose->bus;
@@ -1291,6 +1292,14 @@ static void pnv_pci_ioda_setup_PEs(void)
 				pnv_ioda_setup_dev_PE(pdev);
 		}
 	}
+	list_for_each_entry(hose, &hose_list, list_node) {
+		phb = hose->private_data;
+		if (phb->type != PNV_PHB_IODA2)
+			continue;
+
+		list_for_each_entry(pe, &phb->ioda.pe_list, list)
+			pnv_npu2_map_lpar(pe, MSR_DR | MSR_PR | MSR_HV);
+	}
 }
 
 #ifdef CONFIG_PCI_IOV
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH kernel v3 07/22] powerpc/powernv/npu: Move OPAL calls away from context manipulation
@ 2018-11-13  8:28   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 84+ messages in thread
From: Alexey Kardashevskiy @ 2018-11-13  8:28 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Jose Ricardo Ziviani, Alexey Kardashevskiy, Alistair Popple,
	Alex Williamson, kvm-ppc, Sam Bobroff, Piotr Jaroszynski,
	Oliver O'Halloran, Andrew Donnellan,
	Leonardo Augusto Guimarães Garcia, Reza Arbab, David Gibson

When introduced, the NPU context init/destroy helpers called OPAL which
enabled/disabled PID (a userspace memory context ID) filtering in an NPU
per a GPU; this was a requirement for P9 DD1.0. However newer chip
revision added a PID wildcard support so there is no more need to
call OPAL every time a new context is initialized. Also, since the PID
wildcard support was added, skiboot does not clear wildcard entries
in the NPU so these remain in the hardware till the system reboot.

This moves LPID and wildcard programming to the PE setup code which
executes once during the booting process so NPU2 context init/destroy
won't need to do additional configuration.

This removes the check for FW_FEATURE_OPAL as pnv_npu2_init_context/
pnv_npu2_release_context/pnv_npu2_init do not call OPAL anymore.

This moves pnv_npu2_init() declaration as pseries should be able to use it.
This keeps pnv_npu2_map_lpar() in powernv as pseries is not allowed to
call that. This exports pnv_npu2_map_lpar_dev() as following patches
will use it from the VFIO driver.

While at it, replace redundant list_for_each_entry_safe() with
a simpler list_for_each_entry().

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/include/asm/pci.h            |   3 +
 arch/powerpc/platforms/powernv/pci.h      |   2 +-
 arch/powerpc/platforms/powernv/npu-dma.c  | 105 +++++++++++-----------
 arch/powerpc/platforms/powernv/pci-ioda.c |  15 +++-
 4 files changed, 71 insertions(+), 54 deletions(-)

diff --git a/arch/powerpc/include/asm/pci.h b/arch/powerpc/include/asm/pci.h
index 2af9ded..baf2886 100644
--- a/arch/powerpc/include/asm/pci.h
+++ b/arch/powerpc/include/asm/pci.h
@@ -129,5 +129,8 @@ extern void pcibios_scan_phb(struct pci_controller *hose);
 
 extern struct pci_dev *pnv_pci_get_gpu_dev(struct pci_dev *npdev);
 extern struct pci_dev *pnv_pci_get_npu_dev(struct pci_dev *gpdev, int index);
+extern int pnv_npu2_init(struct pci_controller *hose);
+extern int pnv_npu2_map_lpar_dev(struct pci_dev *gpdev, unsigned int lparid,
+		unsigned long msr);
 
 #endif /* __ASM_POWERPC_PCI_H */
diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
index f2d50974..ddb4f02 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -190,6 +190,7 @@ extern void pnv_pci_init_ioda_hub(struct device_node *np);
 extern void pnv_pci_init_ioda2_phb(struct device_node *np);
 extern void pnv_pci_init_npu_phb(struct device_node *np);
 extern void pnv_pci_init_npu2_opencapi_phb(struct device_node *np);
+extern void pnv_npu2_map_lpar(struct pnv_ioda_pe *gpe, unsigned long msr);
 extern void pnv_pci_reset_secondary_bus(struct pci_dev *dev);
 extern int pnv_eeh_phb_reset(struct pci_controller *hose, int option);
 
@@ -220,7 +221,6 @@ extern long pnv_npu_set_window(struct pnv_ioda_pe *npe, int num,
 extern long pnv_npu_unset_window(struct pnv_ioda_pe *npe, int num);
 extern void pnv_npu_take_ownership(struct pnv_ioda_pe *npe);
 extern void pnv_npu_release_ownership(struct pnv_ioda_pe *npe);
-extern int pnv_npu2_init(struct pnv_phb *phb);
 
 /* pci-ioda-tce.c */
 #define POWERNV_IOMMU_DEFAULT_LEVELS	1
diff --git a/arch/powerpc/platforms/powernv/npu-dma.c b/arch/powerpc/platforms/powernv/npu-dma.c
index 9fc4e4e..4b60f43 100644
--- a/arch/powerpc/platforms/powernv/npu-dma.c
+++ b/arch/powerpc/platforms/powernv/npu-dma.c
@@ -698,7 +698,6 @@ struct npu_context *pnv_npu2_init_context(struct pci_dev *gpdev,
 	u32 nvlink_index;
 	struct device_node *nvlink_dn;
 	struct mm_struct *mm = current->mm;
-	struct pnv_phb *nphb;
 	struct npu *npu;
 	struct npu_context *npu_context;
 
@@ -708,9 +707,6 @@ struct npu_context *pnv_npu2_init_context(struct pci_dev *gpdev,
 	 */
 	struct pci_dev *npdev = pnv_pci_get_npu_dev(gpdev, 0);
 
-	if (!firmware_has_feature(FW_FEATURE_OPAL))
-		return ERR_PTR(-ENODEV);
-
 	if (!npdev)
 		/* No nvlink associated with this GPU device */
 		return ERR_PTR(-ENODEV);
@@ -728,23 +724,10 @@ struct npu_context *pnv_npu2_init_context(struct pci_dev *gpdev,
 		return ERR_PTR(-EINVAL);
 	}
 
-	nphb = pci_bus_to_host(npdev->bus)->private_data;
 	npu = npdev_to_npu(npdev);
 	if (!npu)
 		return ERR_PTR(-ENODEV);
 
-	/*
-	 * Setup the NPU context table for a particular GPU. These need to be
-	 * per-GPU as we need the tables to filter ATSDs when there are no
-	 * active contexts on a particular GPU. It is safe for these to be
-	 * called concurrently with destroy as the OPAL call takes appropriate
-	 * locks and refcounts on init/destroy.
-	 */
-	rc = opal_npu_init_context(nphb->opal_id, mm->context.id, flags,
-				PCI_DEVID(gpdev->bus->number, gpdev->devfn));
-	if (rc < 0)
-		return ERR_PTR(-ENOSPC);
-
 	/*
 	 * We store the npu pci device so we can more easily get at the
 	 * associated npus.
@@ -755,9 +738,6 @@ struct npu_context *pnv_npu2_init_context(struct pci_dev *gpdev,
 		if (npu_context->release_cb != cb ||
 			npu_context->priv != priv) {
 			spin_unlock(&npu_context_lock);
-			opal_npu_destroy_context(nphb->opal_id, mm->context.id,
-						PCI_DEVID(gpdev->bus->number,
-							gpdev->devfn));
 			return ERR_PTR(-EINVAL);
 		}
 
@@ -783,9 +763,6 @@ struct npu_context *pnv_npu2_init_context(struct pci_dev *gpdev,
 
 		if (rc) {
 			kfree(npu_context);
-			opal_npu_destroy_context(nphb->opal_id, mm->context.id,
-					PCI_DEVID(gpdev->bus->number,
-						gpdev->devfn));
 			return ERR_PTR(rc);
 		}
 
@@ -838,7 +815,6 @@ void pnv_npu2_destroy_context(struct npu_context *npu_context,
 			struct pci_dev *gpdev)
 {
 	int removed;
-	struct pnv_phb *nphb;
 	struct npu *npu;
 	struct pci_dev *npdev = pnv_pci_get_npu_dev(gpdev, 0);
 	struct device_node *nvlink_dn;
@@ -847,10 +823,6 @@ void pnv_npu2_destroy_context(struct npu_context *npu_context,
 	if (WARN_ON(!npdev))
 		return;
 
-	if (!firmware_has_feature(FW_FEATURE_OPAL))
-		return;
-
-	nphb = pci_bus_to_host(npdev->bus)->private_data;
 	npu = npdev_to_npu(npdev);
 	if (!npu)
 		return;
@@ -859,8 +831,6 @@ void pnv_npu2_destroy_context(struct npu_context *npu_context,
 							&nvlink_index)))
 		return;
 	WRITE_ONCE(npu_context->npdev[npu->index][nvlink_index], NULL);
-	opal_npu_destroy_context(nphb->opal_id, npu_context->mm->context.id,
-				PCI_DEVID(gpdev->bus->number, gpdev->devfn));
 	spin_lock(&npu_context_lock);
 	removed = kref_put(&npu_context->kref, pnv_npu2_release_context);
 	spin_unlock(&npu_context_lock);
@@ -892,9 +862,6 @@ int pnv_npu2_handle_fault(struct npu_context *context, uintptr_t *ea,
 	/* mmap_sem should be held so the struct_mm must be present */
 	struct mm_struct *mm = context->mm;
 
-	if (!firmware_has_feature(FW_FEATURE_OPAL))
-		return -ENODEV;
-
 	WARN_ON(!rwsem_is_locked(&mm->mmap_sem));
 
 	for (i = 0; i < count; i++) {
@@ -923,15 +890,11 @@ int pnv_npu2_handle_fault(struct npu_context *context, uintptr_t *ea,
 }
 EXPORT_SYMBOL(pnv_npu2_handle_fault);
 
-int pnv_npu2_init(struct pnv_phb *phb)
+int pnv_npu2_init(struct pci_controller *hose)
 {
 	unsigned int i;
 	u64 mmio_atsd;
-	struct device_node *dn;
-	struct pci_dev *gpdev;
 	static int npu_index;
-	uint64_t rc = 0;
-	struct pci_controller *hose = phb->hose;
 	struct npu *npu;
 	int ret;
 
@@ -940,18 +903,6 @@ int pnv_npu2_init(struct pnv_phb *phb)
 		return -ENOMEM;
 
 	npu->nmmu_flush = of_property_read_bool(hose->dn, "ibm,nmmu-flush");
-	for_each_child_of_node(phb->hose->dn, dn) {
-		gpdev = pnv_pci_get_gpu_dev(get_pci_dev(dn));
-		if (gpdev) {
-			rc = opal_npu_map_lpar(phb->opal_id,
-				PCI_DEVID(gpdev->bus->number, gpdev->devfn),
-				0, 0);
-			if (rc)
-				dev_err(&gpdev->dev,
-					"Error %lld mapping device to LPAR\n",
-					rc);
-		}
-	}
 
 	for (i = 0; !of_property_read_u64_index(hose->dn, "ibm,mmio-atsd",
 							i, &mmio_atsd); i++)
@@ -981,3 +932,57 @@ int pnv_npu2_init(struct pnv_phb *phb)
 
 	return ret;
 }
+
+int pnv_npu2_map_lpar_dev(struct pci_dev *gpdev, unsigned int lparid,
+		unsigned long msr)
+{
+	int ret;
+	struct pci_dev *npdev = pnv_pci_get_npu_dev(gpdev, 0);
+	struct pci_controller *hose;
+	struct pnv_phb *nphb;
+
+	if (!npdev)
+		return -ENODEV;
+
+	hose = pci_bus_to_host(npdev->bus);
+	nphb = hose->private_data;
+
+	dev_dbg(&gpdev->dev, "Map LPAR opalid=%llu lparid=%u\n",
+			nphb->opal_id, lparid);
+	/*
+	 * Currently we only support radix and non-zero LPCR only makes sense
+	 * for hash tables so skiboot expects the LPCR parameter to be a zero.
+	 */
+	ret = opal_npu_map_lpar(nphb->opal_id,
+			PCI_DEVID(gpdev->bus->number, gpdev->devfn), lparid,
+			0 /* LPCR bits */);
+	if (ret) {
+		dev_err(&gpdev->dev, "Error %d mapping device to LPAR\n", ret);
+		return ret;
+	}
+
+	dev_dbg(&gpdev->dev, "init context opalid=%llu msr=%lx\n",
+			nphb->opal_id, msr);
+	ret = opal_npu_init_context(nphb->opal_id, 0/*__unused*/, msr,
+			PCI_DEVID(gpdev->bus->number, gpdev->devfn));
+	if (ret < 0)
+		dev_err(&gpdev->dev, "Failed to init context: %d\n", ret);
+	else
+		ret = 0;
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(pnv_npu2_map_lpar_dev);
+
+void pnv_npu2_map_lpar(struct pnv_ioda_pe *gpe, unsigned long msr)
+{
+	int ret;
+	struct pci_dev *gpdev;
+
+	list_for_each_entry(gpdev, &gpe->pbus->devices, bus_list) {
+		ret = pnv_npu2_map_lpar_dev(gpdev, 0, msr);
+		if (ret < 0)
+			dev_err(&gpdev->dev, "Failed to init context: %d\n",
+					ret);
+	}
+}
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index c78c204..ec235ca 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1271,19 +1271,20 @@ static void pnv_ioda_setup_npu_PEs(struct pci_bus *bus)
 
 static void pnv_pci_ioda_setup_PEs(void)
 {
-	struct pci_controller *hose, *tmp;
+	struct pci_controller *hose;
 	struct pnv_phb *phb;
 	struct pci_bus *bus;
 	struct pci_dev *pdev;
+	struct pnv_ioda_pe *pe;
 
-	list_for_each_entry_safe(hose, tmp, &hose_list, list_node) {
+	list_for_each_entry(hose, &hose_list, list_node) {
 		phb = hose->private_data;
 		if (phb->type = PNV_PHB_NPU_NVLINK) {
 			/* PE#0 is needed for error reporting */
 			pnv_ioda_reserve_pe(phb, 0);
 			pnv_ioda_setup_npu_PEs(hose->bus);
 			if (phb->model = PNV_PHB_MODEL_NPU2)
-				pnv_npu2_init(phb);
+				pnv_npu2_init(hose);
 		}
 		if (phb->type = PNV_PHB_NPU_OCAPI) {
 			bus = hose->bus;
@@ -1291,6 +1292,14 @@ static void pnv_pci_ioda_setup_PEs(void)
 				pnv_ioda_setup_dev_PE(pdev);
 		}
 	}
+	list_for_each_entry(hose, &hose_list, list_node) {
+		phb = hose->private_data;
+		if (phb->type != PNV_PHB_IODA2)
+			continue;
+
+		list_for_each_entry(pe, &phb->ioda.pe_list, list)
+			pnv_npu2_map_lpar(pe, MSR_DR | MSR_PR | MSR_HV);
+	}
 }
 
 #ifdef CONFIG_PCI_IOV
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH kernel v3 08/22] powerpc/pseries/iommu: Allow dynamic window to start from zero
  2018-11-13  8:28 ` Alexey Kardashevskiy
@ 2018-11-13  8:28   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 84+ messages in thread
From: Alexey Kardashevskiy @ 2018-11-13  8:28 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Jose Ricardo Ziviani, Alexey Kardashevskiy, Alistair Popple,
	Alex Williamson, kvm-ppc, Sam Bobroff, Piotr Jaroszynski,
	Oliver O'Halloran, Andrew Donnellan,
	Leonardo Augusto Guimarães Garcia, Reza Arbab, David Gibson

At the moment the kernel does not expect dynamic windows to ever start
at zero on a PCI bus as PAPR requires the hypervisor to create a 32bit
default window which starts from zero and the pseries kernel only
creates additional windows.

However PAPR permits removing the default window and creating another
one instead, starting from zero as well. In fact, the kernel used to
remove the default window after sha1 25ebc45b934 but this has been
reverted later.

Since there are devices capable of more than 32 bits for DMA but less than
50, and currently available hardware allows the second window only
at 1<<59, we will need to be able to create bigger windows starting from
zero. This does the initial preparation and should not cause any
behavioral changes.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 arch/powerpc/platforms/pseries/iommu.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index 06f0296..9ece42f 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -53,6 +53,8 @@
 
 #include "pseries.h"
 
+#define DDW_INVALID_OFFSET	((uint64_t)-1)
+
 static struct iommu_table_group *iommu_pseries_alloc_group(int node)
 {
 	struct iommu_table_group *table_group;
@@ -844,7 +846,7 @@ static u64 find_existing_ddw(struct device_node *pdn)
 {
 	struct direct_window *window;
 	const struct dynamic_dma_window_prop *direct64;
-	u64 dma_addr = 0;
+	u64 dma_addr = DDW_INVALID_OFFSET;
 
 	spin_lock(&direct_window_list_lock);
 	/* check if we already created a window and dupe that config if so */
@@ -992,7 +994,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
 	mutex_lock(&direct_window_init_mutex);
 
 	dma_addr = find_existing_ddw(pdn);
-	if (dma_addr != 0)
+	if (dma_addr != DDW_INVALID_OFFSET)
 		goto out_unlock;
 
 	/*
@@ -1228,7 +1230,7 @@ static int dma_set_mask_pSeriesLP(struct device *dev, u64 dma_mask)
 		}
 		if (pdn && PCI_DN(pdn)) {
 			dma_offset = enable_ddw(pdev, pdn);
-			if (dma_offset != 0) {
+			if (dma_offset != DDW_INVALID_OFFSET) {
 				dev_info(dev, "Using 64-bit direct DMA at offset %llx\n", dma_offset);
 				set_dma_offset(dev, dma_offset);
 				set_dma_ops(dev, &dma_nommu_ops);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH kernel v3 08/22] powerpc/pseries/iommu: Allow dynamic window to start from zero
@ 2018-11-13  8:28   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 84+ messages in thread
From: Alexey Kardashevskiy @ 2018-11-13  8:28 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Jose Ricardo Ziviani, Alexey Kardashevskiy, Alistair Popple,
	Alex Williamson, kvm-ppc, Sam Bobroff, Piotr Jaroszynski,
	Oliver O'Halloran, Andrew Donnellan,
	Leonardo Augusto Guimarães Garcia, Reza Arbab, David Gibson

At the moment the kernel does not expect dynamic windows to ever start
at zero on a PCI bus as PAPR requires the hypervisor to create a 32bit
default window which starts from zero and the pseries kernel only
creates additional windows.

However PAPR permits removing the default window and creating another
one instead, starting from zero as well. In fact, the kernel used to
remove the default window after sha1 25ebc45b934 but this has been
reverted later.

Since there are devices capable of more than 32 bits for DMA but less than
50, and currently available hardware allows the second window only
at 1<<59, we will need to be able to create bigger windows starting from
zero. This does the initial preparation and should not cause any
behavioral changes.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 arch/powerpc/platforms/pseries/iommu.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index 06f0296..9ece42f 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -53,6 +53,8 @@
 
 #include "pseries.h"
 
+#define DDW_INVALID_OFFSET	((uint64_t)-1)
+
 static struct iommu_table_group *iommu_pseries_alloc_group(int node)
 {
 	struct iommu_table_group *table_group;
@@ -844,7 +846,7 @@ static u64 find_existing_ddw(struct device_node *pdn)
 {
 	struct direct_window *window;
 	const struct dynamic_dma_window_prop *direct64;
-	u64 dma_addr = 0;
+	u64 dma_addr = DDW_INVALID_OFFSET;
 
 	spin_lock(&direct_window_list_lock);
 	/* check if we already created a window and dupe that config if so */
@@ -992,7 +994,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
 	mutex_lock(&direct_window_init_mutex);
 
 	dma_addr = find_existing_ddw(pdn);
-	if (dma_addr != 0)
+	if (dma_addr != DDW_INVALID_OFFSET)
 		goto out_unlock;
 
 	/*
@@ -1228,7 +1230,7 @@ static int dma_set_mask_pSeriesLP(struct device *dev, u64 dma_mask)
 		}
 		if (pdn && PCI_DN(pdn)) {
 			dma_offset = enable_ddw(pdev, pdn);
-			if (dma_offset != 0) {
+			if (dma_offset != DDW_INVALID_OFFSET) {
 				dev_info(dev, "Using 64-bit direct DMA at offset %llx\n", dma_offset);
 				set_dma_offset(dev, dma_offset);
 				set_dma_ops(dev, &dma_nommu_ops);
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH kernel v3 09/22] powerpc/pseries/iommu: Force default DMA window removal
  2018-11-13  8:28 ` Alexey Kardashevskiy
@ 2018-11-13  8:28   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 84+ messages in thread
From: Alexey Kardashevskiy @ 2018-11-13  8:28 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Jose Ricardo Ziviani, Alexey Kardashevskiy, Alistair Popple,
	Alex Williamson, kvm-ppc, Sam Bobroff, Piotr Jaroszynski,
	Oliver O'Halloran, Andrew Donnellan,
	Leonardo Augusto Guimarães Garcia, Reza Arbab, David Gibson

It is quite common for a device to support more than 32bit but less than
64bit for DMA, for example, GPUs often support 42..50bits. However
the pseries platform only allows huge DMA window (the one which allows
the use of more than 2GB of DMA space) for 64bit-capable devices mostly
because:

1. we may have 32bit and >32bit devices on the same IOMMU domain and
we cannot place the new big window where the 32bit one is located;

2. the existing hardware only supports the second window at very high
offset of 1<<59 == 0x0800.0000.0000.0000.

So in order to allow 33..59bit DMA, we have to remove the default DMA
window and place a huge one there instead.

The PAPR spec says that the platform may decide not to use the default
window and remove it using DDW RTAS calls. There are few possible ways
for the platform to decide:

1. look at the device IDs and decide in advance that such and such
devices are capable of more than 32bit DMA (powernv's sketchy bypass
does something like this - it drops the default window if all devices
on the PE are from the same vendor) - this is not great as involves
guessing because, unlike sketchy bypass, the GPU case involves 2 vendor
ids and does not scale;

2. advertise 1 available DMA window in the hypervisor via
ibm,query-pe-dma-window so the pseries platform could take it as a clue
that if more bits for DMA are needed, it has to remove the default
window - this is not great as it is implicit clue rather than direct
instruction;

3. removing the default DMA window at all it not really an option as
PAPR mandates its presense at the guest boot time;

4. make the hypervisor explicitly tell the guest that the default window
is better be removed so the guest does not have to think hard and can
simply do what requested and this is what this patch does.

This makes use of the latter approach and exploits a new
"qemu,dma-force-remove-default" flag in a vPHB.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/platforms/pseries/iommu.c | 28 +++++++++++++++++++++++---
 1 file changed, 25 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index 9ece42f..78473ac 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -54,6 +54,7 @@
 #include "pseries.h"
 
 #define DDW_INVALID_OFFSET	((uint64_t)-1)
+#define DDW_INVALID_LIOBN	((uint32_t)-1)
 
 static struct iommu_table_group *iommu_pseries_alloc_group(int node)
 {
@@ -977,7 +978,8 @@ static LIST_HEAD(failed_ddw_pdn_list);
  *
  * returns the dma offset for use by dma_set_mask
  */
-static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
+static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn,
+		u32 default_liobn)
 {
 	int len, ret;
 	struct ddw_query_response query;
@@ -1022,6 +1024,16 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
 	if (ret)
 		goto out_failed;
 
+	/*
+	 * The device tree has a request to force remove the default window,
+	 * do this.
+	 */
+	if (default_liobn != DDW_INVALID_LIOBN && (!ddw_avail[2] ||
+			rtas_call(ddw_avail[2], 1, 1, NULL, default_liobn))) {
+		dev_dbg(&dev->dev, "Could not remove window");
+		goto out_failed;
+	}
+
        /*
 	 * Query if there is a second window of size to map the
 	 * whole partition.  Query returns number of windows, largest
@@ -1212,7 +1224,7 @@ static int dma_set_mask_pSeriesLP(struct device *dev, u64 dma_mask)
 	pdev = to_pci_dev(dev);
 
 	/* only attempt to use a new window if 64-bit DMA is requested */
-	if (!disable_ddw && dma_mask == DMA_BIT_MASK(64)) {
+	if (!disable_ddw && dma_mask > DMA_BIT_MASK(32)) {
 		dn = pci_device_to_OF_node(pdev);
 		dev_dbg(dev, "node is %pOF\n", dn);
 
@@ -1229,7 +1241,17 @@ static int dma_set_mask_pSeriesLP(struct device *dev, u64 dma_mask)
 				break;
 		}
 		if (pdn && PCI_DN(pdn)) {
-			dma_offset = enable_ddw(pdev, pdn);
+			u32 liobn = DDW_INVALID_LIOBN;
+			int ret = of_device_is_compatible(pdn, "IBM,npu-vphb");
+
+			if (ret) {
+				dma_window = of_get_property(pdn,
+						"ibm,dma-window", NULL);
+				if (dma_window)
+					liobn = be32_to_cpu(dma_window[0]);
+			}
+
+			dma_offset = enable_ddw(pdev, pdn, liobn);
 			if (dma_offset != DDW_INVALID_OFFSET) {
 				dev_info(dev, "Using 64-bit direct DMA at offset %llx\n", dma_offset);
 				set_dma_offset(dev, dma_offset);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH kernel v3 09/22] powerpc/pseries/iommu: Force default DMA window removal
@ 2018-11-13  8:28   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 84+ messages in thread
From: Alexey Kardashevskiy @ 2018-11-13  8:28 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Jose Ricardo Ziviani, Alexey Kardashevskiy, Alistair Popple,
	Alex Williamson, kvm-ppc, Sam Bobroff, Piotr Jaroszynski,
	Oliver O'Halloran, Andrew Donnellan,
	Leonardo Augusto Guimarães Garcia, Reza Arbab, David Gibson

It is quite common for a device to support more than 32bit but less than
64bit for DMA, for example, GPUs often support 42..50bits. However
the pseries platform only allows huge DMA window (the one which allows
the use of more than 2GB of DMA space) for 64bit-capable devices mostly
because:

1. we may have 32bit and >32bit devices on the same IOMMU domain and
we cannot place the new big window where the 32bit one is located;

2. the existing hardware only supports the second window at very high
offset of 1<<59 = 0x0800.0000.0000.0000.

So in order to allow 33..59bit DMA, we have to remove the default DMA
window and place a huge one there instead.

The PAPR spec says that the platform may decide not to use the default
window and remove it using DDW RTAS calls. There are few possible ways
for the platform to decide:

1. look at the device IDs and decide in advance that such and such
devices are capable of more than 32bit DMA (powernv's sketchy bypass
does something like this - it drops the default window if all devices
on the PE are from the same vendor) - this is not great as involves
guessing because, unlike sketchy bypass, the GPU case involves 2 vendor
ids and does not scale;

2. advertise 1 available DMA window in the hypervisor via
ibm,query-pe-dma-window so the pseries platform could take it as a clue
that if more bits for DMA are needed, it has to remove the default
window - this is not great as it is implicit clue rather than direct
instruction;

3. removing the default DMA window at all it not really an option as
PAPR mandates its presense at the guest boot time;

4. make the hypervisor explicitly tell the guest that the default window
is better be removed so the guest does not have to think hard and can
simply do what requested and this is what this patch does.

This makes use of the latter approach and exploits a new
"qemu,dma-force-remove-default" flag in a vPHB.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/platforms/pseries/iommu.c | 28 +++++++++++++++++++++++---
 1 file changed, 25 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index 9ece42f..78473ac 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -54,6 +54,7 @@
 #include "pseries.h"
 
 #define DDW_INVALID_OFFSET	((uint64_t)-1)
+#define DDW_INVALID_LIOBN	((uint32_t)-1)
 
 static struct iommu_table_group *iommu_pseries_alloc_group(int node)
 {
@@ -977,7 +978,8 @@ static LIST_HEAD(failed_ddw_pdn_list);
  *
  * returns the dma offset for use by dma_set_mask
  */
-static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
+static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn,
+		u32 default_liobn)
 {
 	int len, ret;
 	struct ddw_query_response query;
@@ -1022,6 +1024,16 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
 	if (ret)
 		goto out_failed;
 
+	/*
+	 * The device tree has a request to force remove the default window,
+	 * do this.
+	 */
+	if (default_liobn != DDW_INVALID_LIOBN && (!ddw_avail[2] ||
+			rtas_call(ddw_avail[2], 1, 1, NULL, default_liobn))) {
+		dev_dbg(&dev->dev, "Could not remove window");
+		goto out_failed;
+	}
+
        /*
 	 * Query if there is a second window of size to map the
 	 * whole partition.  Query returns number of windows, largest
@@ -1212,7 +1224,7 @@ static int dma_set_mask_pSeriesLP(struct device *dev, u64 dma_mask)
 	pdev = to_pci_dev(dev);
 
 	/* only attempt to use a new window if 64-bit DMA is requested */
-	if (!disable_ddw && dma_mask = DMA_BIT_MASK(64)) {
+	if (!disable_ddw && dma_mask > DMA_BIT_MASK(32)) {
 		dn = pci_device_to_OF_node(pdev);
 		dev_dbg(dev, "node is %pOF\n", dn);
 
@@ -1229,7 +1241,17 @@ static int dma_set_mask_pSeriesLP(struct device *dev, u64 dma_mask)
 				break;
 		}
 		if (pdn && PCI_DN(pdn)) {
-			dma_offset = enable_ddw(pdev, pdn);
+			u32 liobn = DDW_INVALID_LIOBN;
+			int ret = of_device_is_compatible(pdn, "IBM,npu-vphb");
+
+			if (ret) {
+				dma_window = of_get_property(pdn,
+						"ibm,dma-window", NULL);
+				if (dma_window)
+					liobn = be32_to_cpu(dma_window[0]);
+			}
+
+			dma_offset = enable_ddw(pdev, pdn, liobn);
 			if (dma_offset != DDW_INVALID_OFFSET) {
 				dev_info(dev, "Using 64-bit direct DMA at offset %llx\n", dma_offset);
 				set_dma_offset(dev, dma_offset);
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH kernel v3 10/22] powerpc/pseries/iommu: Use memory@ nodes in max RAM address calculation
  2018-11-13  8:28 ` Alexey Kardashevskiy
@ 2018-11-13  8:28   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 84+ messages in thread
From: Alexey Kardashevskiy @ 2018-11-13  8:28 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Jose Ricardo Ziviani, Alexey Kardashevskiy, Alistair Popple,
	Alex Williamson, kvm-ppc, Sam Bobroff, Piotr Jaroszynski,
	Oliver O'Halloran, Andrew Donnellan,
	Leonardo Augusto Guimarães Garcia, Reza Arbab, David Gibson

We might have memory@ nodes with "linux,usable-memory" set to zero
(for example, to replicate powernv's behaviour for GPU coherent memory)
which means that the memory needs an extra initialization but since
it can be used afterwards, the pseries platform will try mapping it
for DMA so the DMA window needs to cover those memory regions too.

This walks through the memory nodes to find the highest RAM address to
let a huge DMA window cover that too in case this memory gets onlined
later.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/platforms/pseries/iommu.c | 43 +++++++++++++++++++++++++-
 1 file changed, 42 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index 78473ac..f818737 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -967,6 +967,47 @@ struct failed_ddw_pdn {
 
 static LIST_HEAD(failed_ddw_pdn_list);
 
+static unsigned long read_n_cells(int n, const __be32 **buf)
+{
+	unsigned long result = 0;
+
+	while (n--) {
+		result = (result << 32) | of_read_number(*buf, 1);
+		(*buf)++;
+	}
+	return result;
+}
+
+static phys_addr_t ddw_memory_hotplug_max(void)
+{
+	phys_addr_t max_addr = memory_hotplug_max();
+	struct device_node *memory;
+
+	for_each_node_by_type(memory, "memory") {
+		unsigned long start, size;
+		int ranges, n_mem_addr_cells, n_mem_size_cells, len;
+		const __be32 *memcell_buf;
+
+		memcell_buf = of_get_property(memory, "reg", &len);
+		if (!memcell_buf || len <= 0)
+			continue;
+
+		n_mem_addr_cells = of_n_addr_cells(memory);
+		n_mem_size_cells = of_n_size_cells(memory);
+
+		/* ranges in cell */
+		ranges = (len >> 2) / (n_mem_addr_cells + n_mem_size_cells);
+
+		/* these are order-sensitive, and modify the buffer pointer */
+		start = read_n_cells(n_mem_addr_cells, &memcell_buf);
+		size = read_n_cells(n_mem_size_cells, &memcell_buf);
+
+		max_addr = max_t(phys_addr_t, max_addr, start + size);
+	}
+
+	return max_addr;
+}
+
 /*
  * If the PE supports dynamic dma windows, and there is space for a table
  * that can map all pages in a linear offset, then setup such a table,
@@ -1067,7 +1108,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn,
 	}
 	/* verify the window * number of ptes will map the partition */
 	/* check largest block * page size > max memory hotplug addr */
-	max_addr = memory_hotplug_max();
+	max_addr = ddw_memory_hotplug_max();
 	if (query.largest_available_block < (max_addr >> page_shift)) {
 		dev_dbg(&dev->dev, "can't map partition max 0x%llx with %u "
 			  "%llu-sized pages\n", max_addr,  query.largest_available_block,
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH kernel v3 10/22] powerpc/pseries/iommu: Use memory@ nodes in max RAM address calculation
@ 2018-11-13  8:28   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 84+ messages in thread
From: Alexey Kardashevskiy @ 2018-11-13  8:28 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Jose Ricardo Ziviani, Alexey Kardashevskiy, Alistair Popple,
	Alex Williamson, kvm-ppc, Sam Bobroff, Piotr Jaroszynski,
	Oliver O'Halloran, Andrew Donnellan,
	Leonardo Augusto Guimarães Garcia, Reza Arbab, David Gibson

We might have memory@ nodes with "linux,usable-memory" set to zero
(for example, to replicate powernv's behaviour for GPU coherent memory)
which means that the memory needs an extra initialization but since
it can be used afterwards, the pseries platform will try mapping it
for DMA so the DMA window needs to cover those memory regions too.

This walks through the memory nodes to find the highest RAM address to
let a huge DMA window cover that too in case this memory gets onlined
later.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/platforms/pseries/iommu.c | 43 +++++++++++++++++++++++++-
 1 file changed, 42 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index 78473ac..f818737 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -967,6 +967,47 @@ struct failed_ddw_pdn {
 
 static LIST_HEAD(failed_ddw_pdn_list);
 
+static unsigned long read_n_cells(int n, const __be32 **buf)
+{
+	unsigned long result = 0;
+
+	while (n--) {
+		result = (result << 32) | of_read_number(*buf, 1);
+		(*buf)++;
+	}
+	return result;
+}
+
+static phys_addr_t ddw_memory_hotplug_max(void)
+{
+	phys_addr_t max_addr = memory_hotplug_max();
+	struct device_node *memory;
+
+	for_each_node_by_type(memory, "memory") {
+		unsigned long start, size;
+		int ranges, n_mem_addr_cells, n_mem_size_cells, len;
+		const __be32 *memcell_buf;
+
+		memcell_buf = of_get_property(memory, "reg", &len);
+		if (!memcell_buf || len <= 0)
+			continue;
+
+		n_mem_addr_cells = of_n_addr_cells(memory);
+		n_mem_size_cells = of_n_size_cells(memory);
+
+		/* ranges in cell */
+		ranges = (len >> 2) / (n_mem_addr_cells + n_mem_size_cells);
+
+		/* these are order-sensitive, and modify the buffer pointer */
+		start = read_n_cells(n_mem_addr_cells, &memcell_buf);
+		size = read_n_cells(n_mem_size_cells, &memcell_buf);
+
+		max_addr = max_t(phys_addr_t, max_addr, start + size);
+	}
+
+	return max_addr;
+}
+
 /*
  * If the PE supports dynamic dma windows, and there is space for a table
  * that can map all pages in a linear offset, then setup such a table,
@@ -1067,7 +1108,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn,
 	}
 	/* verify the window * number of ptes will map the partition */
 	/* check largest block * page size > max memory hotplug addr */
-	max_addr = memory_hotplug_max();
+	max_addr = ddw_memory_hotplug_max();
 	if (query.largest_available_block < (max_addr >> page_shift)) {
 		dev_dbg(&dev->dev, "can't map partition max 0x%llx with %u "
 			  "%llu-sized pages\n", max_addr,  query.largest_available_block,
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH kernel v3 11/22] powerpc/pseries/npu: Enable platform support
  2018-11-13  8:28 ` Alexey Kardashevskiy
@ 2018-11-13  8:28   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 84+ messages in thread
From: Alexey Kardashevskiy @ 2018-11-13  8:28 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Jose Ricardo Ziviani, Alexey Kardashevskiy, Alistair Popple,
	Alex Williamson, kvm-ppc, Sam Bobroff, Piotr Jaroszynski,
	Oliver O'Halloran, Andrew Donnellan,
	Leonardo Augusto Guimarães Garcia, Reza Arbab, David Gibson

We already changed NPU API for GPUs to not to call OPAL and the remaining
bit is initializing NPU structures.

This uses a new QEMU capability which marks NPU-enabled vPHBs as
"IBM,npu-vphb" and initializes an NPU structure per vPHB.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/platforms/pseries/pci.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/arch/powerpc/platforms/pseries/pci.c b/arch/powerpc/platforms/pseries/pci.c
index 41d8a4d..a50d5e4 100644
--- a/arch/powerpc/platforms/pseries/pci.c
+++ b/arch/powerpc/platforms/pseries/pci.c
@@ -29,6 +29,7 @@
 #include <asm/pci-bridge.h>
 #include <asm/prom.h>
 #include <asm/ppc-pci.h>
+#include <asm/pci.h>
 #include "pseries.h"
 
 #if 0
@@ -237,6 +238,8 @@ static void __init pSeries_request_regions(void)
 
 void __init pSeries_final_fixup(void)
 {
+	struct pci_controller *hose;
+
 	pSeries_request_regions();
 
 	eeh_probe_devices();
@@ -246,6 +249,9 @@ void __init pSeries_final_fixup(void)
 	ppc_md.pcibios_sriov_enable = pseries_pcibios_sriov_enable;
 	ppc_md.pcibios_sriov_disable = pseries_pcibios_sriov_disable;
 #endif
+	list_for_each_entry(hose, &hose_list, list_node)
+		if (of_device_is_compatible(hose->dn, "IBM,npu-vphb"))
+			pnv_npu2_init(hose);
 }
 
 /*
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH kernel v3 11/22] powerpc/pseries/npu: Enable platform support
@ 2018-11-13  8:28   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 84+ messages in thread
From: Alexey Kardashevskiy @ 2018-11-13  8:28 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Jose Ricardo Ziviani, Alexey Kardashevskiy, Alistair Popple,
	Alex Williamson, kvm-ppc, Sam Bobroff, Piotr Jaroszynski,
	Oliver O'Halloran, Andrew Donnellan,
	Leonardo Augusto Guimarães Garcia, Reza Arbab, David Gibson

We already changed NPU API for GPUs to not to call OPAL and the remaining
bit is initializing NPU structures.

This uses a new QEMU capability which marks NPU-enabled vPHBs as
"IBM,npu-vphb" and initializes an NPU structure per vPHB.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/platforms/pseries/pci.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/arch/powerpc/platforms/pseries/pci.c b/arch/powerpc/platforms/pseries/pci.c
index 41d8a4d..a50d5e4 100644
--- a/arch/powerpc/platforms/pseries/pci.c
+++ b/arch/powerpc/platforms/pseries/pci.c
@@ -29,6 +29,7 @@
 #include <asm/pci-bridge.h>
 #include <asm/prom.h>
 #include <asm/ppc-pci.h>
+#include <asm/pci.h>
 #include "pseries.h"
 
 #if 0
@@ -237,6 +238,8 @@ static void __init pSeries_request_regions(void)
 
 void __init pSeries_final_fixup(void)
 {
+	struct pci_controller *hose;
+
 	pSeries_request_regions();
 
 	eeh_probe_devices();
@@ -246,6 +249,9 @@ void __init pSeries_final_fixup(void)
 	ppc_md.pcibios_sriov_enable = pseries_pcibios_sriov_enable;
 	ppc_md.pcibios_sriov_disable = pseries_pcibios_sriov_disable;
 #endif
+	list_for_each_entry(hose, &hose_list, list_node)
+		if (of_device_is_compatible(hose->dn, "IBM,npu-vphb"))
+			pnv_npu2_init(hose);
 }
 
 /*
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH kernel v3 12/22] powerpc/pseries: Remove IOMMU API support for non-LPAR systems
  2018-11-13  8:28 ` Alexey Kardashevskiy
@ 2018-11-13  8:28   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 84+ messages in thread
From: Alexey Kardashevskiy @ 2018-11-13  8:28 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Jose Ricardo Ziviani, Alexey Kardashevskiy, Alistair Popple,
	Alex Williamson, kvm-ppc, Sam Bobroff, Piotr Jaroszynski,
	Oliver O'Halloran, Andrew Donnellan,
	Leonardo Augusto Guimarães Garcia, Reza Arbab, David Gibson

The pci_dma_bus_setup_pSeries and pci_dma_dev_setup_pSeries hooks are
registered for the pseries platform which does not have FW_FEATURE_LPAR;
these would be pre-powernv platforms which we never supported PCI pass
through for anyway so remove it.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---

Propably should remove all pseries-but-not-lpar code.
---
 arch/powerpc/platforms/pseries/iommu.c | 9 ++-------
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index f818737..b045f28 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -648,7 +648,6 @@ static void pci_dma_bus_setup_pSeries(struct pci_bus *bus)
 	iommu_table_setparms(pci->phb, dn, tbl);
 	tbl->it_ops = &iommu_table_pseries_ops;
 	iommu_init_table(tbl, pci->phb->node);
-	iommu_register_group(pci->table_group, pci_domain_nr(bus), 0);
 
 	/* Divide the rest (1.75GB) among the children */
 	pci->phb->dma_window_size = 0x80000000ul;
@@ -759,10 +758,7 @@ static void pci_dma_dev_setup_pSeries(struct pci_dev *dev)
 		iommu_table_setparms(phb, dn, tbl);
 		tbl->it_ops = &iommu_table_pseries_ops;
 		iommu_init_table(tbl, phb->node);
-		iommu_register_group(PCI_DN(dn)->table_group,
-				pci_domain_nr(phb->bus), 0);
 		set_iommu_table_base(&dev->dev, tbl);
-		iommu_add_device(&dev->dev);
 		return;
 	}
 
@@ -773,11 +769,10 @@ static void pci_dma_dev_setup_pSeries(struct pci_dev *dev)
 	while (dn && PCI_DN(dn) && PCI_DN(dn)->table_group == NULL)
 		dn = dn->parent;
 
-	if (dn && PCI_DN(dn)) {
+	if (dn && PCI_DN(dn))
 		set_iommu_table_base(&dev->dev,
 				PCI_DN(dn)->table_group->tables[0]);
-		iommu_add_device(&dev->dev);
-	} else
+	else
 		printk(KERN_WARNING "iommu: Device %s has no iommu table\n",
 		       pci_name(dev));
 }
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH kernel v3 12/22] powerpc/pseries: Remove IOMMU API support for non-LPAR systems
@ 2018-11-13  8:28   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 84+ messages in thread
From: Alexey Kardashevskiy @ 2018-11-13  8:28 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Jose Ricardo Ziviani, Alexey Kardashevskiy, Alistair Popple,
	Alex Williamson, kvm-ppc, Sam Bobroff, Piotr Jaroszynski,
	Oliver O'Halloran, Andrew Donnellan,
	Leonardo Augusto Guimarães Garcia, Reza Arbab, David Gibson

The pci_dma_bus_setup_pSeries and pci_dma_dev_setup_pSeries hooks are
registered for the pseries platform which does not have FW_FEATURE_LPAR;
these would be pre-powernv platforms which we never supported PCI pass
through for anyway so remove it.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---

Propably should remove all pseries-but-not-lpar code.
---
 arch/powerpc/platforms/pseries/iommu.c | 9 ++-------
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index f818737..b045f28 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -648,7 +648,6 @@ static void pci_dma_bus_setup_pSeries(struct pci_bus *bus)
 	iommu_table_setparms(pci->phb, dn, tbl);
 	tbl->it_ops = &iommu_table_pseries_ops;
 	iommu_init_table(tbl, pci->phb->node);
-	iommu_register_group(pci->table_group, pci_domain_nr(bus), 0);
 
 	/* Divide the rest (1.75GB) among the children */
 	pci->phb->dma_window_size = 0x80000000ul;
@@ -759,10 +758,7 @@ static void pci_dma_dev_setup_pSeries(struct pci_dev *dev)
 		iommu_table_setparms(phb, dn, tbl);
 		tbl->it_ops = &iommu_table_pseries_ops;
 		iommu_init_table(tbl, phb->node);
-		iommu_register_group(PCI_DN(dn)->table_group,
-				pci_domain_nr(phb->bus), 0);
 		set_iommu_table_base(&dev->dev, tbl);
-		iommu_add_device(&dev->dev);
 		return;
 	}
 
@@ -773,11 +769,10 @@ static void pci_dma_dev_setup_pSeries(struct pci_dev *dev)
 	while (dn && PCI_DN(dn) && PCI_DN(dn)->table_group = NULL)
 		dn = dn->parent;
 
-	if (dn && PCI_DN(dn)) {
+	if (dn && PCI_DN(dn))
 		set_iommu_table_base(&dev->dev,
 				PCI_DN(dn)->table_group->tables[0]);
-		iommu_add_device(&dev->dev);
-	} else
+	else
 		printk(KERN_WARNING "iommu: Device %s has no iommu table\n",
 		       pci_name(dev));
 }
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH kernel v3 13/22] powerpc/powernv/pseries: Rework device adding to IOMMU groups
  2018-11-13  8:28 ` Alexey Kardashevskiy
@ 2018-11-13  8:28   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 84+ messages in thread
From: Alexey Kardashevskiy @ 2018-11-13  8:28 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Jose Ricardo Ziviani, Alexey Kardashevskiy, Alistair Popple,
	Alex Williamson, kvm-ppc, Sam Bobroff, Piotr Jaroszynski,
	Oliver O'Halloran, Andrew Donnellan,
	Leonardo Augusto Guimarães Garcia, Reza Arbab, David Gibson

The powernv platform registers IOMMU groups and adds devices to them
from the pci_controller_ops::setup_bridge() hook except one case when
virtual functions (SRIOV VFs) are added from a bus notifier.

The pseries platform registers IOMMU groups from
the pci_controller_ops::dma_bus_setup() hook and adds devices from
the pci_controller_ops::dma_dev_setup() hook. The very same bus notifier
used for powernv does not add devices for pseries though as
__of_scan_bus() adds devices first, then it does the bus/dev DMA setup.

Both platforms use iommu_add_device() which takes a device and expects
it to have a valid IOMMU table struct with an iommu_table_group pointer
which in turn points the iommu_group struct (which represents
an IOMMU group). Although the helper seems easy to use, it relies on
some pre-existing device configuration and associated data structures
which it does not really need.

This simplifies iommu_add_device() to take the table_group pointer
directly. Pseries already has a table_group pointer handy and the bus
notified is not used anyway. For powernv, this copies the existing bus
notifier, makes it work for powernv only which means an easy way of
getting to the table_group pointer. This was tested on VFs but should
also support physical PCI hotplug.

Since iommu_add_device() receives the table_group pointer directly,
pseries does not do TCE cache invalidation (the hypervisor does) nor
allow multiple groups per a VFIO container (in other words sharing
an IOMMU table between partitionable endpoints), this removes
iommu_table_group_link from pseries.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 arch/powerpc/include/asm/iommu.h          | 12 ++---
 arch/powerpc/kernel/iommu.c               | 58 ++---------------------
 arch/powerpc/platforms/powernv/pci-ioda.c | 10 +---
 arch/powerpc/platforms/powernv/pci.c      | 43 ++++++++++++++++-
 arch/powerpc/platforms/pseries/iommu.c    | 46 +++++++++---------
 5 files changed, 74 insertions(+), 95 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index a8aeac0..e847ff6 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -215,9 +215,9 @@ struct iommu_table_group {
 
 extern void iommu_register_group(struct iommu_table_group *table_group,
 				 int pci_domain_number, unsigned long pe_num);
-extern int iommu_add_device(struct device *dev);
+extern int iommu_add_device(struct iommu_table_group *table_group,
+		struct device *dev);
 extern void iommu_del_device(struct device *dev);
-extern int __init tce_iommu_bus_notifier_init(void);
 extern long iommu_tce_xchg(struct mm_struct *mm, struct iommu_table *tbl,
 		unsigned long entry, unsigned long *hpa,
 		enum dma_data_direction *direction);
@@ -228,7 +228,8 @@ static inline void iommu_register_group(struct iommu_table_group *table_group,
 {
 }
 
-static inline int iommu_add_device(struct device *dev)
+static inline int iommu_add_device(struct iommu_table_group *table_group,
+		struct device *dev)
 {
 	return 0;
 }
@@ -236,11 +237,6 @@ static inline int iommu_add_device(struct device *dev)
 static inline void iommu_del_device(struct device *dev)
 {
 }
-
-static inline int __init tce_iommu_bus_notifier_init(void)
-{
-        return 0;
-}
 #endif /* !CONFIG_IOMMU_API */
 
 int dma_iommu_mapping_error(struct device *dev, dma_addr_t dma_addr);
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 8ccfdd9..1e85168 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -1076,11 +1076,8 @@ void iommu_release_ownership(struct iommu_table *tbl)
 }
 EXPORT_SYMBOL_GPL(iommu_release_ownership);
 
-int iommu_add_device(struct device *dev)
+int iommu_add_device(struct iommu_table_group *table_group, struct device *dev)
 {
-	struct iommu_table *tbl;
-	struct iommu_table_group_link *tgl;
-
 	/*
 	 * The sysfs entries should be populated before
 	 * binding IOMMU group. If sysfs entries isn't
@@ -1096,32 +1093,10 @@ int iommu_add_device(struct device *dev)
 		return -EBUSY;
 	}
 
-	tbl = get_iommu_table_base(dev);
-	if (!tbl) {
-		pr_debug("%s: Skipping device %s with no tbl\n",
-			 __func__, dev_name(dev));
-		return 0;
-	}
-
-	tgl = list_first_entry_or_null(&tbl->it_group_list,
-			struct iommu_table_group_link, next);
-	if (!tgl) {
-		pr_debug("%s: Skipping device %s with no group\n",
-			 __func__, dev_name(dev));
-		return 0;
-	}
 	pr_debug("%s: Adding %s to iommu group %d\n",
-		 __func__, dev_name(dev),
-		 iommu_group_id(tgl->table_group->group));
+		 __func__, dev_name(dev),  iommu_group_id(table_group->group));
 
-	if (PAGE_SIZE < IOMMU_PAGE_SIZE(tbl)) {
-		pr_err("%s: Invalid IOMMU page size %lx (%lx) on %s\n",
-		       __func__, IOMMU_PAGE_SIZE(tbl),
-		       PAGE_SIZE, dev_name(dev));
-		return -EINVAL;
-	}
-
-	return iommu_group_add_device(tgl->table_group->group, dev);
+	return iommu_group_add_device(table_group->group, dev);
 }
 EXPORT_SYMBOL_GPL(iommu_add_device);
 
@@ -1141,31 +1116,4 @@ void iommu_del_device(struct device *dev)
 	iommu_group_remove_device(dev);
 }
 EXPORT_SYMBOL_GPL(iommu_del_device);
-
-static int tce_iommu_bus_notifier(struct notifier_block *nb,
-                unsigned long action, void *data)
-{
-        struct device *dev = data;
-
-        switch (action) {
-        case BUS_NOTIFY_ADD_DEVICE:
-                return iommu_add_device(dev);
-        case BUS_NOTIFY_DEL_DEVICE:
-                if (dev->iommu_group)
-                        iommu_del_device(dev);
-                return 0;
-        default:
-                return 0;
-        }
-}
-
-static struct notifier_block tce_iommu_bus_nb = {
-        .notifier_call = tce_iommu_bus_notifier,
-};
-
-int __init tce_iommu_bus_notifier_init(void)
-{
-        bus_register_notifier(&pci_bus_type, &tce_iommu_bus_nb);
-        return 0;
-}
 #endif /* CONFIG_IOMMU_API */
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index ec235ca..f36a802 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1940,7 +1940,7 @@ static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe,
 		set_iommu_table_base(&dev->dev, pe->table_group.tables[0]);
 		set_dma_offset(&dev->dev, pe->tce_bypass_base);
 		if (add_to_group)
-			iommu_add_device(&dev->dev);
+			iommu_add_device(&pe->table_group, &dev->dev);
 
 		if ((pe->flags & PNV_IODA_PE_BUS_ALL) && dev->subordinate)
 			pnv_ioda_setup_bus_dma(pe, dev->subordinate,
@@ -2526,14 +2526,6 @@ static long pnv_pci_ioda2_setup_default_config(struct pnv_ioda_pe *pe)
 	if (!pnv_iommu_bypass_disabled)
 		pnv_pci_ioda2_set_bypass(pe, true);
 
-	/*
-	 * Setting table base here only for carrying iommu_group
-	 * further down to let iommu_add_device() do the job.
-	 * pnv_pci_ioda_dma_dev_setup will override it later anyway.
-	 */
-	if (pe->flags & PNV_IODA_PE_DEV)
-		set_iommu_table_base(&pe->pdev->dev, tbl);
-
 	return 0;
 }
 
diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
index 13aef23..98e02c1 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -1127,4 +1127,45 @@ void __init pnv_pci_init(void)
 	set_pci_dma_ops(&dma_iommu_ops);
 }
 
-machine_subsys_initcall_sync(powernv, tce_iommu_bus_notifier_init);
+static int pnv_tce_iommu_bus_notifier(struct notifier_block *nb,
+		unsigned long action, void *data)
+{
+	struct device *dev = data;
+	struct pci_dev *pdev;
+	struct pci_dn *pdn;
+	struct pnv_ioda_pe *pe;
+	struct pci_controller *hose;
+	struct pnv_phb *phb;
+
+	switch (action) {
+	case BUS_NOTIFY_ADD_DEVICE:
+		pdev = to_pci_dev(dev);
+		pdn = pci_get_pdn(pdev);
+		hose = pci_bus_to_host(pdev->bus);
+		phb = hose->private_data;
+
+		WARN_ON_ONCE(!phb);
+		if (!pdn || pdn->pe_number == IODA_INVALID_PE || !phb)
+			return 0;
+
+		pe = &phb->ioda.pe_array[pdn->pe_number];
+		iommu_add_device(&pe->table_group, dev);
+		return 0;
+	case BUS_NOTIFY_DEL_DEVICE:
+		iommu_del_device(dev);
+		return 0;
+	default:
+		return 0;
+	}
+}
+
+static struct notifier_block pnv_tce_iommu_bus_nb = {
+	.notifier_call = pnv_tce_iommu_bus_notifier,
+};
+
+static int __init pnv_tce_iommu_bus_notifier_init(void)
+{
+	bus_register_notifier(&pci_bus_type, &pnv_tce_iommu_bus_nb);
+	return 0;
+}
+machine_subsys_initcall_sync(powernv, pnv_tce_iommu_bus_notifier_init);
diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index b045f28..762f551 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -60,7 +60,6 @@ static struct iommu_table_group *iommu_pseries_alloc_group(int node)
 {
 	struct iommu_table_group *table_group;
 	struct iommu_table *tbl;
-	struct iommu_table_group_link *tgl;
 
 	table_group = kzalloc_node(sizeof(struct iommu_table_group), GFP_KERNEL,
 			   node);
@@ -71,22 +70,13 @@ static struct iommu_table_group *iommu_pseries_alloc_group(int node)
 	if (!tbl)
 		goto free_group;
 
-	tgl = kzalloc_node(sizeof(struct iommu_table_group_link), GFP_KERNEL,
-			node);
-	if (!tgl)
-		goto free_table;
-
 	INIT_LIST_HEAD_RCU(&tbl->it_group_list);
 	kref_init(&tbl->it_kref);
-	tgl->table_group = table_group;
-	list_add_rcu(&tgl->next, &tbl->it_group_list);
 
 	table_group->tables[0] = tbl;
 
 	return table_group;
 
-free_table:
-	kfree(tbl);
 free_group:
 	kfree(table_group);
 	return NULL;
@@ -96,23 +86,12 @@ static void iommu_pseries_free_group(struct iommu_table_group *table_group,
 		const char *node_name)
 {
 	struct iommu_table *tbl;
-#ifdef CONFIG_IOMMU_API
-	struct iommu_table_group_link *tgl;
-#endif
 
 	if (!table_group)
 		return;
 
 	tbl = table_group->tables[0];
 #ifdef CONFIG_IOMMU_API
-	tgl = list_first_entry_or_null(&tbl->it_group_list,
-			struct iommu_table_group_link, next);
-
-	WARN_ON_ONCE(!tgl);
-	if (tgl) {
-		list_del_rcu(&tgl->next);
-		kfree(tgl);
-	}
 	if (table_group->group) {
 		iommu_group_put(table_group->group);
 		BUG_ON(table_group->group);
@@ -1240,7 +1219,7 @@ static void pci_dma_dev_setup_pSeriesLP(struct pci_dev *dev)
 	}
 
 	set_iommu_table_base(&dev->dev, pci->table_group->tables[0]);
-	iommu_add_device(&dev->dev);
+	iommu_add_device(pci->table_group, &dev->dev);
 }
 
 static int dma_set_mask_pSeriesLP(struct device *dev, u64 dma_mask)
@@ -1455,4 +1434,27 @@ static int __init disable_multitce(char *str)
 
 __setup("multitce=", disable_multitce);
 
+static int tce_iommu_bus_notifier(struct notifier_block *nb,
+		unsigned long action, void *data)
+{
+	struct device *dev = data;
+
+	switch (action) {
+	case BUS_NOTIFY_DEL_DEVICE:
+		iommu_del_device(dev);
+		return 0;
+	default:
+		return 0;
+	}
+}
+
+static struct notifier_block tce_iommu_bus_nb = {
+	.notifier_call = tce_iommu_bus_notifier,
+};
+
+static int __init tce_iommu_bus_notifier_init(void)
+{
+	bus_register_notifier(&pci_bus_type, &tce_iommu_bus_nb);
+	return 0;
+}
 machine_subsys_initcall_sync(pseries, tce_iommu_bus_notifier_init);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH kernel v3 13/22] powerpc/powernv/pseries: Rework device adding to IOMMU groups
@ 2018-11-13  8:28   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 84+ messages in thread
From: Alexey Kardashevskiy @ 2018-11-13  8:28 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Jose Ricardo Ziviani, Alexey Kardashevskiy, Alistair Popple,
	Alex Williamson, kvm-ppc, Sam Bobroff, Piotr Jaroszynski,
	Oliver O'Halloran, Andrew Donnellan,
	Leonardo Augusto Guimarães Garcia, Reza Arbab, David Gibson

The powernv platform registers IOMMU groups and adds devices to them
from the pci_controller_ops::setup_bridge() hook except one case when
virtual functions (SRIOV VFs) are added from a bus notifier.

The pseries platform registers IOMMU groups from
the pci_controller_ops::dma_bus_setup() hook and adds devices from
the pci_controller_ops::dma_dev_setup() hook. The very same bus notifier
used for powernv does not add devices for pseries though as
__of_scan_bus() adds devices first, then it does the bus/dev DMA setup.

Both platforms use iommu_add_device() which takes a device and expects
it to have a valid IOMMU table struct with an iommu_table_group pointer
which in turn points the iommu_group struct (which represents
an IOMMU group). Although the helper seems easy to use, it relies on
some pre-existing device configuration and associated data structures
which it does not really need.

This simplifies iommu_add_device() to take the table_group pointer
directly. Pseries already has a table_group pointer handy and the bus
notified is not used anyway. For powernv, this copies the existing bus
notifier, makes it work for powernv only which means an easy way of
getting to the table_group pointer. This was tested on VFs but should
also support physical PCI hotplug.

Since iommu_add_device() receives the table_group pointer directly,
pseries does not do TCE cache invalidation (the hypervisor does) nor
allow multiple groups per a VFIO container (in other words sharing
an IOMMU table between partitionable endpoints), this removes
iommu_table_group_link from pseries.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 arch/powerpc/include/asm/iommu.h          | 12 ++---
 arch/powerpc/kernel/iommu.c               | 58 ++---------------------
 arch/powerpc/platforms/powernv/pci-ioda.c | 10 +---
 arch/powerpc/platforms/powernv/pci.c      | 43 ++++++++++++++++-
 arch/powerpc/platforms/pseries/iommu.c    | 46 +++++++++---------
 5 files changed, 74 insertions(+), 95 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index a8aeac0..e847ff6 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -215,9 +215,9 @@ struct iommu_table_group {
 
 extern void iommu_register_group(struct iommu_table_group *table_group,
 				 int pci_domain_number, unsigned long pe_num);
-extern int iommu_add_device(struct device *dev);
+extern int iommu_add_device(struct iommu_table_group *table_group,
+		struct device *dev);
 extern void iommu_del_device(struct device *dev);
-extern int __init tce_iommu_bus_notifier_init(void);
 extern long iommu_tce_xchg(struct mm_struct *mm, struct iommu_table *tbl,
 		unsigned long entry, unsigned long *hpa,
 		enum dma_data_direction *direction);
@@ -228,7 +228,8 @@ static inline void iommu_register_group(struct iommu_table_group *table_group,
 {
 }
 
-static inline int iommu_add_device(struct device *dev)
+static inline int iommu_add_device(struct iommu_table_group *table_group,
+		struct device *dev)
 {
 	return 0;
 }
@@ -236,11 +237,6 @@ static inline int iommu_add_device(struct device *dev)
 static inline void iommu_del_device(struct device *dev)
 {
 }
-
-static inline int __init tce_iommu_bus_notifier_init(void)
-{
-        return 0;
-}
 #endif /* !CONFIG_IOMMU_API */
 
 int dma_iommu_mapping_error(struct device *dev, dma_addr_t dma_addr);
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 8ccfdd9..1e85168 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -1076,11 +1076,8 @@ void iommu_release_ownership(struct iommu_table *tbl)
 }
 EXPORT_SYMBOL_GPL(iommu_release_ownership);
 
-int iommu_add_device(struct device *dev)
+int iommu_add_device(struct iommu_table_group *table_group, struct device *dev)
 {
-	struct iommu_table *tbl;
-	struct iommu_table_group_link *tgl;
-
 	/*
 	 * The sysfs entries should be populated before
 	 * binding IOMMU group. If sysfs entries isn't
@@ -1096,32 +1093,10 @@ int iommu_add_device(struct device *dev)
 		return -EBUSY;
 	}
 
-	tbl = get_iommu_table_base(dev);
-	if (!tbl) {
-		pr_debug("%s: Skipping device %s with no tbl\n",
-			 __func__, dev_name(dev));
-		return 0;
-	}
-
-	tgl = list_first_entry_or_null(&tbl->it_group_list,
-			struct iommu_table_group_link, next);
-	if (!tgl) {
-		pr_debug("%s: Skipping device %s with no group\n",
-			 __func__, dev_name(dev));
-		return 0;
-	}
 	pr_debug("%s: Adding %s to iommu group %d\n",
-		 __func__, dev_name(dev),
-		 iommu_group_id(tgl->table_group->group));
+		 __func__, dev_name(dev),  iommu_group_id(table_group->group));
 
-	if (PAGE_SIZE < IOMMU_PAGE_SIZE(tbl)) {
-		pr_err("%s: Invalid IOMMU page size %lx (%lx) on %s\n",
-		       __func__, IOMMU_PAGE_SIZE(tbl),
-		       PAGE_SIZE, dev_name(dev));
-		return -EINVAL;
-	}
-
-	return iommu_group_add_device(tgl->table_group->group, dev);
+	return iommu_group_add_device(table_group->group, dev);
 }
 EXPORT_SYMBOL_GPL(iommu_add_device);
 
@@ -1141,31 +1116,4 @@ void iommu_del_device(struct device *dev)
 	iommu_group_remove_device(dev);
 }
 EXPORT_SYMBOL_GPL(iommu_del_device);
-
-static int tce_iommu_bus_notifier(struct notifier_block *nb,
-                unsigned long action, void *data)
-{
-        struct device *dev = data;
-
-        switch (action) {
-        case BUS_NOTIFY_ADD_DEVICE:
-                return iommu_add_device(dev);
-        case BUS_NOTIFY_DEL_DEVICE:
-                if (dev->iommu_group)
-                        iommu_del_device(dev);
-                return 0;
-        default:
-                return 0;
-        }
-}
-
-static struct notifier_block tce_iommu_bus_nb = {
-        .notifier_call = tce_iommu_bus_notifier,
-};
-
-int __init tce_iommu_bus_notifier_init(void)
-{
-        bus_register_notifier(&pci_bus_type, &tce_iommu_bus_nb);
-        return 0;
-}
 #endif /* CONFIG_IOMMU_API */
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index ec235ca..f36a802 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1940,7 +1940,7 @@ static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe,
 		set_iommu_table_base(&dev->dev, pe->table_group.tables[0]);
 		set_dma_offset(&dev->dev, pe->tce_bypass_base);
 		if (add_to_group)
-			iommu_add_device(&dev->dev);
+			iommu_add_device(&pe->table_group, &dev->dev);
 
 		if ((pe->flags & PNV_IODA_PE_BUS_ALL) && dev->subordinate)
 			pnv_ioda_setup_bus_dma(pe, dev->subordinate,
@@ -2526,14 +2526,6 @@ static long pnv_pci_ioda2_setup_default_config(struct pnv_ioda_pe *pe)
 	if (!pnv_iommu_bypass_disabled)
 		pnv_pci_ioda2_set_bypass(pe, true);
 
-	/*
-	 * Setting table base here only for carrying iommu_group
-	 * further down to let iommu_add_device() do the job.
-	 * pnv_pci_ioda_dma_dev_setup will override it later anyway.
-	 */
-	if (pe->flags & PNV_IODA_PE_DEV)
-		set_iommu_table_base(&pe->pdev->dev, tbl);
-
 	return 0;
 }
 
diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
index 13aef23..98e02c1 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -1127,4 +1127,45 @@ void __init pnv_pci_init(void)
 	set_pci_dma_ops(&dma_iommu_ops);
 }
 
-machine_subsys_initcall_sync(powernv, tce_iommu_bus_notifier_init);
+static int pnv_tce_iommu_bus_notifier(struct notifier_block *nb,
+		unsigned long action, void *data)
+{
+	struct device *dev = data;
+	struct pci_dev *pdev;
+	struct pci_dn *pdn;
+	struct pnv_ioda_pe *pe;
+	struct pci_controller *hose;
+	struct pnv_phb *phb;
+
+	switch (action) {
+	case BUS_NOTIFY_ADD_DEVICE:
+		pdev = to_pci_dev(dev);
+		pdn = pci_get_pdn(pdev);
+		hose = pci_bus_to_host(pdev->bus);
+		phb = hose->private_data;
+
+		WARN_ON_ONCE(!phb);
+		if (!pdn || pdn->pe_number = IODA_INVALID_PE || !phb)
+			return 0;
+
+		pe = &phb->ioda.pe_array[pdn->pe_number];
+		iommu_add_device(&pe->table_group, dev);
+		return 0;
+	case BUS_NOTIFY_DEL_DEVICE:
+		iommu_del_device(dev);
+		return 0;
+	default:
+		return 0;
+	}
+}
+
+static struct notifier_block pnv_tce_iommu_bus_nb = {
+	.notifier_call = pnv_tce_iommu_bus_notifier,
+};
+
+static int __init pnv_tce_iommu_bus_notifier_init(void)
+{
+	bus_register_notifier(&pci_bus_type, &pnv_tce_iommu_bus_nb);
+	return 0;
+}
+machine_subsys_initcall_sync(powernv, pnv_tce_iommu_bus_notifier_init);
diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index b045f28..762f551 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -60,7 +60,6 @@ static struct iommu_table_group *iommu_pseries_alloc_group(int node)
 {
 	struct iommu_table_group *table_group;
 	struct iommu_table *tbl;
-	struct iommu_table_group_link *tgl;
 
 	table_group = kzalloc_node(sizeof(struct iommu_table_group), GFP_KERNEL,
 			   node);
@@ -71,22 +70,13 @@ static struct iommu_table_group *iommu_pseries_alloc_group(int node)
 	if (!tbl)
 		goto free_group;
 
-	tgl = kzalloc_node(sizeof(struct iommu_table_group_link), GFP_KERNEL,
-			node);
-	if (!tgl)
-		goto free_table;
-
 	INIT_LIST_HEAD_RCU(&tbl->it_group_list);
 	kref_init(&tbl->it_kref);
-	tgl->table_group = table_group;
-	list_add_rcu(&tgl->next, &tbl->it_group_list);
 
 	table_group->tables[0] = tbl;
 
 	return table_group;
 
-free_table:
-	kfree(tbl);
 free_group:
 	kfree(table_group);
 	return NULL;
@@ -96,23 +86,12 @@ static void iommu_pseries_free_group(struct iommu_table_group *table_group,
 		const char *node_name)
 {
 	struct iommu_table *tbl;
-#ifdef CONFIG_IOMMU_API
-	struct iommu_table_group_link *tgl;
-#endif
 
 	if (!table_group)
 		return;
 
 	tbl = table_group->tables[0];
 #ifdef CONFIG_IOMMU_API
-	tgl = list_first_entry_or_null(&tbl->it_group_list,
-			struct iommu_table_group_link, next);
-
-	WARN_ON_ONCE(!tgl);
-	if (tgl) {
-		list_del_rcu(&tgl->next);
-		kfree(tgl);
-	}
 	if (table_group->group) {
 		iommu_group_put(table_group->group);
 		BUG_ON(table_group->group);
@@ -1240,7 +1219,7 @@ static void pci_dma_dev_setup_pSeriesLP(struct pci_dev *dev)
 	}
 
 	set_iommu_table_base(&dev->dev, pci->table_group->tables[0]);
-	iommu_add_device(&dev->dev);
+	iommu_add_device(pci->table_group, &dev->dev);
 }
 
 static int dma_set_mask_pSeriesLP(struct device *dev, u64 dma_mask)
@@ -1455,4 +1434,27 @@ static int __init disable_multitce(char *str)
 
 __setup("multitce=", disable_multitce);
 
+static int tce_iommu_bus_notifier(struct notifier_block *nb,
+		unsigned long action, void *data)
+{
+	struct device *dev = data;
+
+	switch (action) {
+	case BUS_NOTIFY_DEL_DEVICE:
+		iommu_del_device(dev);
+		return 0;
+	default:
+		return 0;
+	}
+}
+
+static struct notifier_block tce_iommu_bus_nb = {
+	.notifier_call = tce_iommu_bus_notifier,
+};
+
+static int __init tce_iommu_bus_notifier_init(void)
+{
+	bus_register_notifier(&pci_bus_type, &tce_iommu_bus_nb);
+	return 0;
+}
 machine_subsys_initcall_sync(pseries, tce_iommu_bus_notifier_init);
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH kernel v3 14/22] powerpc/iommu_api: Move IOMMU groups setup to a single place
  2018-11-13  8:28 ` Alexey Kardashevskiy
@ 2018-11-13  8:28   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 84+ messages in thread
From: Alexey Kardashevskiy @ 2018-11-13  8:28 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Jose Ricardo Ziviani, Alexey Kardashevskiy, Alistair Popple,
	Alex Williamson, kvm-ppc, Sam Bobroff, Piotr Jaroszynski,
	Oliver O'Halloran, Andrew Donnellan,
	Leonardo Augusto Guimarães Garcia, Reza Arbab, David Gibson

Registering new IOMMU groups and adding devices to them are separated in
code and the latter is dug in the DMA setup code which it does not
really belong to.

This moved IOMMU groups setup to a separate helper which registers a group
and adds devices as before. This does not make a difference as IOMMU
groups are not used anyway; the only dependency here is that
iommu_add_device() requires a valid pointer to an iommu_table
(set by set_iommu_table_base()).

To keep the old behaviour, this does not add new IOMMU groups for PEs
with no DMA weigth and also skips NVLINK bridges which do not have
pci_controller_ops::setup_bridge (the normal way of adding PEs).

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 80 +++++++++++++++++++----
 1 file changed, 66 insertions(+), 14 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index f36a802..7f4904a 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1269,6 +1269,8 @@ static void pnv_ioda_setup_npu_PEs(struct pci_bus *bus)
 		pnv_ioda_setup_npu_PE(pdev);
 }
 
+static void pnv_ioda_setup_bus_iommu_group(struct pnv_ioda_pe *pe);
+
 static void pnv_pci_ioda_setup_PEs(void)
 {
 	struct pci_controller *hose;
@@ -1591,6 +1593,7 @@ static void pnv_ioda_setup_vf_PE(struct pci_dev *pdev, u16 num_vfs)
 		mutex_unlock(&phb->ioda.pe_list_mutex);
 
 		pnv_pci_ioda2_setup_dma_pe(phb, pe);
+		pnv_ioda_setup_bus_iommu_group(pe);
 	}
 }
 
@@ -1930,21 +1933,16 @@ static u64 pnv_pci_ioda_dma_get_required_mask(struct pci_dev *pdev)
 	return mask;
 }
 
-static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe,
-				   struct pci_bus *bus,
-				   bool add_to_group)
+static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe, struct pci_bus *bus)
 {
 	struct pci_dev *dev;
 
 	list_for_each_entry(dev, &bus->devices, bus_list) {
 		set_iommu_table_base(&dev->dev, pe->table_group.tables[0]);
 		set_dma_offset(&dev->dev, pe->tce_bypass_base);
-		if (add_to_group)
-			iommu_add_device(&pe->table_group, &dev->dev);
 
 		if ((pe->flags & PNV_IODA_PE_BUS_ALL) && dev->subordinate)
-			pnv_ioda_setup_bus_dma(pe, dev->subordinate,
-					add_to_group);
+			pnv_ioda_setup_bus_dma(pe, dev->subordinate);
 	}
 }
 
@@ -2374,7 +2372,7 @@ static void pnv_pci_ioda1_setup_dma_pe(struct pnv_phb *phb,
 	iommu_init_table(tbl, phb->hose->node);
 
 	if (pe->flags & (PNV_IODA_PE_BUS | PNV_IODA_PE_BUS_ALL))
-		pnv_ioda_setup_bus_dma(pe, pe->pbus, true);
+		pnv_ioda_setup_bus_dma(pe, pe->pbus);
 
 	return;
  fail:
@@ -2607,7 +2605,7 @@ static void pnv_ioda2_take_ownership(struct iommu_table_group *table_group)
 	pnv_pci_ioda2_set_bypass(pe, false);
 	pnv_pci_ioda2_unset_window(&pe->table_group, 0);
 	if (pe->pbus)
-		pnv_ioda_setup_bus_dma(pe, pe->pbus, false);
+		pnv_ioda_setup_bus_dma(pe, pe->pbus);
 	iommu_tce_table_put(tbl);
 }
 
@@ -2618,7 +2616,7 @@ static void pnv_ioda2_release_ownership(struct iommu_table_group *table_group)
 
 	pnv_pci_ioda2_setup_default_config(pe);
 	if (pe->pbus)
-		pnv_ioda_setup_bus_dma(pe, pe->pbus, false);
+		pnv_ioda_setup_bus_dma(pe, pe->pbus);
 }
 
 static struct iommu_table_group_ops pnv_pci_ioda2_ops = {
@@ -2735,12 +2733,68 @@ static struct iommu_table_group_ops pnv_pci_ioda2_npu_ops = {
 	.release_ownership = pnv_ioda2_release_ownership,
 };
 
+static void pnv_ioda_setup_bus_iommu_group_add_devices(struct pnv_ioda_pe *pe,
+		struct pci_bus *bus)
+{
+	struct pci_dev *dev;
+
+	list_for_each_entry(dev, &bus->devices, bus_list) {
+		iommu_add_device(&pe->table_group, &dev->dev);
+
+		if ((pe->flags & PNV_IODA_PE_BUS_ALL) && dev->subordinate)
+			pnv_ioda_setup_bus_iommu_group_add_devices(pe,
+					dev->subordinate);
+	}
+}
+
+static void pnv_ioda_setup_bus_iommu_group(struct pnv_ioda_pe *pe)
+{
+	if (!pnv_pci_ioda_pe_dma_weight(pe))
+		return;
+
+	iommu_register_group(&pe->table_group, pe->phb->hose->global_number,
+			pe->pe_number);
+
+	/*
+	 * set_iommu_table_base(&pe->pdev->dev, tbl) should have been called
+	 * by now
+	 */
+	if (pe->flags & PNV_IODA_PE_DEV)
+		iommu_add_device(&pe->table_group, &pe->pdev->dev);
+	else if (pe->flags & (PNV_IODA_PE_BUS | PNV_IODA_PE_BUS_ALL))
+		pnv_ioda_setup_bus_iommu_group_add_devices(pe, pe->pbus);
+}
+
 static void pnv_pci_ioda_setup_iommu_api(void)
 {
 	struct pci_controller *hose, *tmp;
 	struct pnv_phb *phb;
 	struct pnv_ioda_pe *pe, *gpe;
 
+	/*
+	 * There are 4 types of PEs:
+	 * - PNV_IODA_PE_BUS: a downstream port with an adapter,
+	 *   created from pnv_pci_setup_bridge();
+	 * - PNV_IODA_PE_BUS_ALL: a PCI-PCIX bridge with devices behind it,
+	 *   created from pnv_pci_setup_bridge();
+	 * - PNV_IODA_PE_VF: a SRIOV virtual function,
+	 *   created from pnv_pcibios_sriov_enable();
+	 * - PNV_IODA_PE_DEV: an NPU or OCAPI device,
+	 *   created from pnv_pci_ioda_fixup().
+	 *
+	 * Normally a PE is represented by an IOMMU group, however for
+	 * devices with side channels the groups need to be more strict.
+	 */
+	list_for_each_entry(hose, &hose_list, list_node) {
+		phb = hose->private_data;
+
+		if (phb->type == PNV_PHB_NPU_NVLINK)
+			continue;
+
+		list_for_each_entry(pe, &phb->ioda.pe_list, list)
+			pnv_ioda_setup_bus_iommu_group(pe);
+	}
+
 	/*
 	 * Now we have all PHBs discovered, time to add NPU devices to
 	 * the corresponding IOMMU groups.
@@ -2759,6 +2813,7 @@ static void pnv_pci_ioda_setup_iommu_api(void)
 	}
 }
 #else /* !CONFIG_IOMMU_API */
+static void pnv_ioda_setup_bus_iommu_group(struct pnv_ioda_pe *pe) { }
 static void pnv_pci_ioda_setup_iommu_api(void) { };
 #endif
 
@@ -2801,9 +2856,6 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
 	/* TVE #1 is selected by PCI address bit 59 */
 	pe->tce_bypass_base = 1ull << 59;
 
-	iommu_register_group(&pe->table_group, phb->hose->global_number,
-			pe->pe_number);
-
 	/* The PE will reserve all possible 32-bits space */
 	pe_info(pe, "Setting up 32-bit TCE table at 0..%08x\n",
 		phb->ioda.m32_pci_base);
@@ -2824,7 +2876,7 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
 		return;
 
 	if (pe->flags & (PNV_IODA_PE_BUS | PNV_IODA_PE_BUS_ALL))
-		pnv_ioda_setup_bus_dma(pe, pe->pbus, true);
+		pnv_ioda_setup_bus_dma(pe, pe->pbus);
 }
 
 #ifdef CONFIG_PCI_MSI
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH kernel v3 14/22] powerpc/iommu_api: Move IOMMU groups setup to a single place
@ 2018-11-13  8:28   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 84+ messages in thread
From: Alexey Kardashevskiy @ 2018-11-13  8:28 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Jose Ricardo Ziviani, Alexey Kardashevskiy, Alistair Popple,
	Alex Williamson, kvm-ppc, Sam Bobroff, Piotr Jaroszynski,
	Oliver O'Halloran, Andrew Donnellan,
	Leonardo Augusto Guimarães Garcia, Reza Arbab, David Gibson

Registering new IOMMU groups and adding devices to them are separated in
code and the latter is dug in the DMA setup code which it does not
really belong to.

This moved IOMMU groups setup to a separate helper which registers a group
and adds devices as before. This does not make a difference as IOMMU
groups are not used anyway; the only dependency here is that
iommu_add_device() requires a valid pointer to an iommu_table
(set by set_iommu_table_base()).

To keep the old behaviour, this does not add new IOMMU groups for PEs
with no DMA weigth and also skips NVLINK bridges which do not have
pci_controller_ops::setup_bridge (the normal way of adding PEs).

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 80 +++++++++++++++++++----
 1 file changed, 66 insertions(+), 14 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index f36a802..7f4904a 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1269,6 +1269,8 @@ static void pnv_ioda_setup_npu_PEs(struct pci_bus *bus)
 		pnv_ioda_setup_npu_PE(pdev);
 }
 
+static void pnv_ioda_setup_bus_iommu_group(struct pnv_ioda_pe *pe);
+
 static void pnv_pci_ioda_setup_PEs(void)
 {
 	struct pci_controller *hose;
@@ -1591,6 +1593,7 @@ static void pnv_ioda_setup_vf_PE(struct pci_dev *pdev, u16 num_vfs)
 		mutex_unlock(&phb->ioda.pe_list_mutex);
 
 		pnv_pci_ioda2_setup_dma_pe(phb, pe);
+		pnv_ioda_setup_bus_iommu_group(pe);
 	}
 }
 
@@ -1930,21 +1933,16 @@ static u64 pnv_pci_ioda_dma_get_required_mask(struct pci_dev *pdev)
 	return mask;
 }
 
-static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe,
-				   struct pci_bus *bus,
-				   bool add_to_group)
+static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe, struct pci_bus *bus)
 {
 	struct pci_dev *dev;
 
 	list_for_each_entry(dev, &bus->devices, bus_list) {
 		set_iommu_table_base(&dev->dev, pe->table_group.tables[0]);
 		set_dma_offset(&dev->dev, pe->tce_bypass_base);
-		if (add_to_group)
-			iommu_add_device(&pe->table_group, &dev->dev);
 
 		if ((pe->flags & PNV_IODA_PE_BUS_ALL) && dev->subordinate)
-			pnv_ioda_setup_bus_dma(pe, dev->subordinate,
-					add_to_group);
+			pnv_ioda_setup_bus_dma(pe, dev->subordinate);
 	}
 }
 
@@ -2374,7 +2372,7 @@ static void pnv_pci_ioda1_setup_dma_pe(struct pnv_phb *phb,
 	iommu_init_table(tbl, phb->hose->node);
 
 	if (pe->flags & (PNV_IODA_PE_BUS | PNV_IODA_PE_BUS_ALL))
-		pnv_ioda_setup_bus_dma(pe, pe->pbus, true);
+		pnv_ioda_setup_bus_dma(pe, pe->pbus);
 
 	return;
  fail:
@@ -2607,7 +2605,7 @@ static void pnv_ioda2_take_ownership(struct iommu_table_group *table_group)
 	pnv_pci_ioda2_set_bypass(pe, false);
 	pnv_pci_ioda2_unset_window(&pe->table_group, 0);
 	if (pe->pbus)
-		pnv_ioda_setup_bus_dma(pe, pe->pbus, false);
+		pnv_ioda_setup_bus_dma(pe, pe->pbus);
 	iommu_tce_table_put(tbl);
 }
 
@@ -2618,7 +2616,7 @@ static void pnv_ioda2_release_ownership(struct iommu_table_group *table_group)
 
 	pnv_pci_ioda2_setup_default_config(pe);
 	if (pe->pbus)
-		pnv_ioda_setup_bus_dma(pe, pe->pbus, false);
+		pnv_ioda_setup_bus_dma(pe, pe->pbus);
 }
 
 static struct iommu_table_group_ops pnv_pci_ioda2_ops = {
@@ -2735,12 +2733,68 @@ static struct iommu_table_group_ops pnv_pci_ioda2_npu_ops = {
 	.release_ownership = pnv_ioda2_release_ownership,
 };
 
+static void pnv_ioda_setup_bus_iommu_group_add_devices(struct pnv_ioda_pe *pe,
+		struct pci_bus *bus)
+{
+	struct pci_dev *dev;
+
+	list_for_each_entry(dev, &bus->devices, bus_list) {
+		iommu_add_device(&pe->table_group, &dev->dev);
+
+		if ((pe->flags & PNV_IODA_PE_BUS_ALL) && dev->subordinate)
+			pnv_ioda_setup_bus_iommu_group_add_devices(pe,
+					dev->subordinate);
+	}
+}
+
+static void pnv_ioda_setup_bus_iommu_group(struct pnv_ioda_pe *pe)
+{
+	if (!pnv_pci_ioda_pe_dma_weight(pe))
+		return;
+
+	iommu_register_group(&pe->table_group, pe->phb->hose->global_number,
+			pe->pe_number);
+
+	/*
+	 * set_iommu_table_base(&pe->pdev->dev, tbl) should have been called
+	 * by now
+	 */
+	if (pe->flags & PNV_IODA_PE_DEV)
+		iommu_add_device(&pe->table_group, &pe->pdev->dev);
+	else if (pe->flags & (PNV_IODA_PE_BUS | PNV_IODA_PE_BUS_ALL))
+		pnv_ioda_setup_bus_iommu_group_add_devices(pe, pe->pbus);
+}
+
 static void pnv_pci_ioda_setup_iommu_api(void)
 {
 	struct pci_controller *hose, *tmp;
 	struct pnv_phb *phb;
 	struct pnv_ioda_pe *pe, *gpe;
 
+	/*
+	 * There are 4 types of PEs:
+	 * - PNV_IODA_PE_BUS: a downstream port with an adapter,
+	 *   created from pnv_pci_setup_bridge();
+	 * - PNV_IODA_PE_BUS_ALL: a PCI-PCIX bridge with devices behind it,
+	 *   created from pnv_pci_setup_bridge();
+	 * - PNV_IODA_PE_VF: a SRIOV virtual function,
+	 *   created from pnv_pcibios_sriov_enable();
+	 * - PNV_IODA_PE_DEV: an NPU or OCAPI device,
+	 *   created from pnv_pci_ioda_fixup().
+	 *
+	 * Normally a PE is represented by an IOMMU group, however for
+	 * devices with side channels the groups need to be more strict.
+	 */
+	list_for_each_entry(hose, &hose_list, list_node) {
+		phb = hose->private_data;
+
+		if (phb->type = PNV_PHB_NPU_NVLINK)
+			continue;
+
+		list_for_each_entry(pe, &phb->ioda.pe_list, list)
+			pnv_ioda_setup_bus_iommu_group(pe);
+	}
+
 	/*
 	 * Now we have all PHBs discovered, time to add NPU devices to
 	 * the corresponding IOMMU groups.
@@ -2759,6 +2813,7 @@ static void pnv_pci_ioda_setup_iommu_api(void)
 	}
 }
 #else /* !CONFIG_IOMMU_API */
+static void pnv_ioda_setup_bus_iommu_group(struct pnv_ioda_pe *pe) { }
 static void pnv_pci_ioda_setup_iommu_api(void) { };
 #endif
 
@@ -2801,9 +2856,6 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
 	/* TVE #1 is selected by PCI address bit 59 */
 	pe->tce_bypass_base = 1ull << 59;
 
-	iommu_register_group(&pe->table_group, phb->hose->global_number,
-			pe->pe_number);
-
 	/* The PE will reserve all possible 32-bits space */
 	pe_info(pe, "Setting up 32-bit TCE table at 0..%08x\n",
 		phb->ioda.m32_pci_base);
@@ -2824,7 +2876,7 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
 		return;
 
 	if (pe->flags & (PNV_IODA_PE_BUS | PNV_IODA_PE_BUS_ALL))
-		pnv_ioda_setup_bus_dma(pe, pe->pbus, true);
+		pnv_ioda_setup_bus_dma(pe, pe->pbus);
 }
 
 #ifdef CONFIG_PCI_MSI
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH kernel v3 15/22] powerpc/powernv: Reference iommu_table while it is linked to a group
  2018-11-13  8:28 ` Alexey Kardashevskiy
@ 2018-11-13  8:28   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 84+ messages in thread
From: Alexey Kardashevskiy @ 2018-11-13  8:28 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Jose Ricardo Ziviani, Alexey Kardashevskiy, Alistair Popple,
	Alex Williamson, kvm-ppc, Sam Bobroff, Piotr Jaroszynski,
	Oliver O'Halloran, Andrew Donnellan,
	Leonardo Augusto Guimarães Garcia, Reza Arbab, David Gibson

The iommu_table pointer stored in iommu_table_group may get stale
by accident, this adds referencing and removes a redundant comment
about this.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/platforms/powernv/pci-ioda-tce.c | 3 ++-
 arch/powerpc/platforms/powernv/pci-ioda.c     | 4 ----
 2 files changed, 2 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda-tce.c b/arch/powerpc/platforms/powernv/pci-ioda-tce.c
index 7639b21..697449a 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda-tce.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda-tce.c
@@ -368,6 +368,7 @@ void pnv_pci_unlink_table_and_group(struct iommu_table *tbl,
 	found = false;
 	for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
 		if (table_group->tables[i] == tbl) {
+			iommu_tce_table_put(tbl);
 			table_group->tables[i] = NULL;
 			found = true;
 			break;
@@ -393,7 +394,7 @@ long pnv_pci_link_table_and_group(int node, int num,
 	tgl->table_group = table_group;
 	list_add_rcu(&tgl->next, &tbl->it_group_list);
 
-	table_group->tables[num] = tbl;
+	table_group->tables[num] = iommu_tce_table_get(tbl);
 
 	return 0;
 }
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 7f4904a..7caf373 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -2716,10 +2716,6 @@ static long pnv_pci_ioda2_npu_unset_window(
 
 static void pnv_ioda2_npu_take_ownership(struct iommu_table_group *table_group)
 {
-	/*
-	 * Detach NPU first as pnv_ioda2_take_ownership() will destroy
-	 * the iommu_table if 32bit DMA is enabled.
-	 */
 	pnv_npu_take_ownership(gpe_table_group_to_npe(table_group));
 	pnv_ioda2_take_ownership(table_group);
 }
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH kernel v3 15/22] powerpc/powernv: Reference iommu_table while it is linked to a group
@ 2018-11-13  8:28   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 84+ messages in thread
From: Alexey Kardashevskiy @ 2018-11-13  8:28 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Jose Ricardo Ziviani, Alexey Kardashevskiy, Alistair Popple,
	Alex Williamson, kvm-ppc, Sam Bobroff, Piotr Jaroszynski,
	Oliver O'Halloran, Andrew Donnellan,
	Leonardo Augusto Guimarães Garcia, Reza Arbab, David Gibson

The iommu_table pointer stored in iommu_table_group may get stale
by accident, this adds referencing and removes a redundant comment
about this.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/platforms/powernv/pci-ioda-tce.c | 3 ++-
 arch/powerpc/platforms/powernv/pci-ioda.c     | 4 ----
 2 files changed, 2 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda-tce.c b/arch/powerpc/platforms/powernv/pci-ioda-tce.c
index 7639b21..697449a 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda-tce.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda-tce.c
@@ -368,6 +368,7 @@ void pnv_pci_unlink_table_and_group(struct iommu_table *tbl,
 	found = false;
 	for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
 		if (table_group->tables[i] = tbl) {
+			iommu_tce_table_put(tbl);
 			table_group->tables[i] = NULL;
 			found = true;
 			break;
@@ -393,7 +394,7 @@ long pnv_pci_link_table_and_group(int node, int num,
 	tgl->table_group = table_group;
 	list_add_rcu(&tgl->next, &tbl->it_group_list);
 
-	table_group->tables[num] = tbl;
+	table_group->tables[num] = iommu_tce_table_get(tbl);
 
 	return 0;
 }
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 7f4904a..7caf373 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -2716,10 +2716,6 @@ static long pnv_pci_ioda2_npu_unset_window(
 
 static void pnv_ioda2_npu_take_ownership(struct iommu_table_group *table_group)
 {
-	/*
-	 * Detach NPU first as pnv_ioda2_take_ownership() will destroy
-	 * the iommu_table if 32bit DMA is enabled.
-	 */
 	pnv_npu_take_ownership(gpe_table_group_to_npe(table_group));
 	pnv_ioda2_take_ownership(table_group);
 }
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH kernel v3 16/22] powerpc/powernv: Add purge cache OPAL call
  2018-11-13  8:28 ` Alexey Kardashevskiy
@ 2018-11-13  8:28   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 84+ messages in thread
From: Alexey Kardashevskiy @ 2018-11-13  8:28 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Jose Ricardo Ziviani, Alexey Kardashevskiy, Alistair Popple,
	Alex Williamson, kvm-ppc, Sam Bobroff, Piotr Jaroszynski,
	Oliver O'Halloran, Andrew Donnellan,
	Leonardo Augusto Guimarães Garcia, Reza Arbab, David Gibson

Flushing caches using the dcbf instruction takes quite some time if we
need to flush gigabytes (16GB takes more than 15s); OPAL just added
a big hammer to flush all caches.

This adds opal_purge_cache() which will be used later to flush caches
for coherent GPU memory which might suddenly become unavailable if a GPU
is reset and NVLink is not (re)trained.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/include/asm/opal-api.h            | 3 ++-
 arch/powerpc/include/asm/opal.h                | 1 +
 arch/powerpc/platforms/powernv/opal.c          | 1 +
 arch/powerpc/platforms/powernv/opal-wrappers.S | 1 +
 4 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/opal-api.h b/arch/powerpc/include/asm/opal-api.h
index 870fb7b..55bc640 100644
--- a/arch/powerpc/include/asm/opal-api.h
+++ b/arch/powerpc/include/asm/opal-api.h
@@ -210,7 +210,8 @@
 #define OPAL_PCI_GET_PBCQ_TUNNEL_BAR		164
 #define OPAL_PCI_SET_PBCQ_TUNNEL_BAR		165
 #define	OPAL_NX_COPROC_INIT			167
-#define OPAL_LAST				167
+#define OPAL_CLEAR_CACHE			170
+#define OPAL_LAST				170
 
 #define QUIESCE_HOLD			1 /* Spin all calls at entry */
 #define QUIESCE_REJECT			2 /* Fail all calls with OPAL_BUSY */
diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
index ff38664..7db576e 100644
--- a/arch/powerpc/include/asm/opal.h
+++ b/arch/powerpc/include/asm/opal.h
@@ -294,6 +294,7 @@ int opal_set_power_shift_ratio(u32 handle, int token, u32 psr);
 int opal_sensor_group_clear(u32 group_hndl, int token);
 int opal_sensor_group_enable(u32 group_hndl, int token, bool enable);
 int opal_nx_coproc_init(uint32_t chip_id, uint32_t ct);
+int opal_purge_cache(void);
 
 s64 opal_signal_system_reset(s32 cpu);
 s64 opal_quiesce(u64 shutdown_type, s32 cpu);
diff --git a/arch/powerpc/platforms/powernv/opal.c b/arch/powerpc/platforms/powernv/opal.c
index beed86f..44ce824 100644
--- a/arch/powerpc/platforms/powernv/opal.c
+++ b/arch/powerpc/platforms/powernv/opal.c
@@ -1113,3 +1113,4 @@ EXPORT_SYMBOL_GPL(opal_int_eoi);
 EXPORT_SYMBOL_GPL(opal_error_code);
 /* Export the below symbol for NX compression */
 EXPORT_SYMBOL(opal_nx_coproc_init);
+EXPORT_SYMBOL(opal_purge_cache);
diff --git a/arch/powerpc/platforms/powernv/opal-wrappers.S b/arch/powerpc/platforms/powernv/opal-wrappers.S
index 2515282..5b886a6 100644
--- a/arch/powerpc/platforms/powernv/opal-wrappers.S
+++ b/arch/powerpc/platforms/powernv/opal-wrappers.S
@@ -331,3 +331,4 @@ OPAL_CALL(opal_pci_set_pbcq_tunnel_bar,		OPAL_PCI_SET_PBCQ_TUNNEL_BAR);
 OPAL_CALL(opal_sensor_read_u64,			OPAL_SENSOR_READ_U64);
 OPAL_CALL(opal_sensor_group_enable,		OPAL_SENSOR_GROUP_ENABLE);
 OPAL_CALL(opal_nx_coproc_init,			OPAL_NX_COPROC_INIT);
+OPAL_CALL(opal_purge_cache,			OPAL_CLEAR_CACHE);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH kernel v3 16/22] powerpc/powernv: Add purge cache OPAL call
@ 2018-11-13  8:28   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 84+ messages in thread
From: Alexey Kardashevskiy @ 2018-11-13  8:28 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Jose Ricardo Ziviani, Alexey Kardashevskiy, Alistair Popple,
	Alex Williamson, kvm-ppc, Sam Bobroff, Piotr Jaroszynski,
	Oliver O'Halloran, Andrew Donnellan,
	Leonardo Augusto Guimarães Garcia, Reza Arbab, David Gibson

Flushing caches using the dcbf instruction takes quite some time if we
need to flush gigabytes (16GB takes more than 15s); OPAL just added
a big hammer to flush all caches.

This adds opal_purge_cache() which will be used later to flush caches
for coherent GPU memory which might suddenly become unavailable if a GPU
is reset and NVLink is not (re)trained.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/include/asm/opal-api.h            | 3 ++-
 arch/powerpc/include/asm/opal.h                | 1 +
 arch/powerpc/platforms/powernv/opal.c          | 1 +
 arch/powerpc/platforms/powernv/opal-wrappers.S | 1 +
 4 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/opal-api.h b/arch/powerpc/include/asm/opal-api.h
index 870fb7b..55bc640 100644
--- a/arch/powerpc/include/asm/opal-api.h
+++ b/arch/powerpc/include/asm/opal-api.h
@@ -210,7 +210,8 @@
 #define OPAL_PCI_GET_PBCQ_TUNNEL_BAR		164
 #define OPAL_PCI_SET_PBCQ_TUNNEL_BAR		165
 #define	OPAL_NX_COPROC_INIT			167
-#define OPAL_LAST				167
+#define OPAL_CLEAR_CACHE			170
+#define OPAL_LAST				170
 
 #define QUIESCE_HOLD			1 /* Spin all calls at entry */
 #define QUIESCE_REJECT			2 /* Fail all calls with OPAL_BUSY */
diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
index ff38664..7db576e 100644
--- a/arch/powerpc/include/asm/opal.h
+++ b/arch/powerpc/include/asm/opal.h
@@ -294,6 +294,7 @@ int opal_set_power_shift_ratio(u32 handle, int token, u32 psr);
 int opal_sensor_group_clear(u32 group_hndl, int token);
 int opal_sensor_group_enable(u32 group_hndl, int token, bool enable);
 int opal_nx_coproc_init(uint32_t chip_id, uint32_t ct);
+int opal_purge_cache(void);
 
 s64 opal_signal_system_reset(s32 cpu);
 s64 opal_quiesce(u64 shutdown_type, s32 cpu);
diff --git a/arch/powerpc/platforms/powernv/opal.c b/arch/powerpc/platforms/powernv/opal.c
index beed86f..44ce824 100644
--- a/arch/powerpc/platforms/powernv/opal.c
+++ b/arch/powerpc/platforms/powernv/opal.c
@@ -1113,3 +1113,4 @@ EXPORT_SYMBOL_GPL(opal_int_eoi);
 EXPORT_SYMBOL_GPL(opal_error_code);
 /* Export the below symbol for NX compression */
 EXPORT_SYMBOL(opal_nx_coproc_init);
+EXPORT_SYMBOL(opal_purge_cache);
diff --git a/arch/powerpc/platforms/powernv/opal-wrappers.S b/arch/powerpc/platforms/powernv/opal-wrappers.S
index 2515282..5b886a6 100644
--- a/arch/powerpc/platforms/powernv/opal-wrappers.S
+++ b/arch/powerpc/platforms/powernv/opal-wrappers.S
@@ -331,3 +331,4 @@ OPAL_CALL(opal_pci_set_pbcq_tunnel_bar,		OPAL_PCI_SET_PBCQ_TUNNEL_BAR);
 OPAL_CALL(opal_sensor_read_u64,			OPAL_SENSOR_READ_U64);
 OPAL_CALL(opal_sensor_group_enable,		OPAL_SENSOR_GROUP_ENABLE);
 OPAL_CALL(opal_nx_coproc_init,			OPAL_NX_COPROC_INIT);
+OPAL_CALL(opal_purge_cache,			OPAL_CLEAR_CACHE);
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH kernel v3 17/22] powerpc/powernv/npu: Convert NPU IOMMU helpers to iommu_table_group_ops
  2018-11-13  8:28 ` Alexey Kardashevskiy
@ 2018-11-13  8:28   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 84+ messages in thread
From: Alexey Kardashevskiy @ 2018-11-13  8:28 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Jose Ricardo Ziviani, Alexey Kardashevskiy, Alistair Popple,
	Alex Williamson, kvm-ppc, Sam Bobroff, Piotr Jaroszynski,
	Oliver O'Halloran, Andrew Donnellan,
	Leonardo Augusto Guimarães Garcia, Reza Arbab, David Gibson

At the moment NPU IOMMU is manipulated directly from the IODA2 PCI
PE code; PCI PE acts as a master to NPU PE. Soon we will have compound
IOMMU groups with several PEs from several different PHB (such as
interconnected GPUs and NPUs) so there will be no single master but
a one big IOMMU group.

This makes a first step and converts an NPU PE to a table group.

This should cause no behavioral change. Note that
pnv_npu_release_ownership() has never been implemented.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/platforms/powernv/pci.h      |  5 ----
 arch/powerpc/platforms/powernv/npu-dma.c  | 29 ++++++++++++++++++-----
 arch/powerpc/platforms/powernv/pci-ioda.c | 17 +++++++------
 3 files changed, 33 insertions(+), 18 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
index ddb4f02..cf9f748 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -216,11 +216,6 @@ extern void pe_level_printk(const struct pnv_ioda_pe *pe, const char *level,
 extern void pnv_npu_try_dma_set_bypass(struct pci_dev *gpdev, bool bypass);
 extern void pnv_pci_ioda2_tce_invalidate_entire(struct pnv_phb *phb, bool rm);
 extern struct pnv_ioda_pe *pnv_pci_npu_setup_iommu(struct pnv_ioda_pe *npe);
-extern long pnv_npu_set_window(struct pnv_ioda_pe *npe, int num,
-		struct iommu_table *tbl);
-extern long pnv_npu_unset_window(struct pnv_ioda_pe *npe, int num);
-extern void pnv_npu_take_ownership(struct pnv_ioda_pe *npe);
-extern void pnv_npu_release_ownership(struct pnv_ioda_pe *npe);
 
 /* pci-ioda-tce.c */
 #define POWERNV_IOMMU_DEFAULT_LEVELS	1
diff --git a/arch/powerpc/platforms/powernv/npu-dma.c b/arch/powerpc/platforms/powernv/npu-dma.c
index 4b60f43..1792c7e 100644
--- a/arch/powerpc/platforms/powernv/npu-dma.c
+++ b/arch/powerpc/platforms/powernv/npu-dma.c
@@ -121,9 +121,11 @@ static struct pnv_ioda_pe *get_gpu_pci_dev_and_pe(struct pnv_ioda_pe *npe,
 	return pe;
 }
 
-long pnv_npu_set_window(struct pnv_ioda_pe *npe, int num,
+static long pnv_npu_set_window(struct iommu_table_group *table_group, int num,
 		struct iommu_table *tbl)
 {
+	struct pnv_ioda_pe *npe = container_of(table_group, struct pnv_ioda_pe,
+			table_group);
 	struct pnv_phb *phb = npe->phb;
 	int64_t rc;
 	const unsigned long size = tbl->it_indirect_levels ?
@@ -155,8 +157,10 @@ long pnv_npu_set_window(struct pnv_ioda_pe *npe, int num,
 	return 0;
 }
 
-long pnv_npu_unset_window(struct pnv_ioda_pe *npe, int num)
+static long pnv_npu_unset_window(struct iommu_table_group *table_group, int num)
 {
+	struct pnv_ioda_pe *npe = container_of(table_group, struct pnv_ioda_pe,
+			table_group);
 	struct pnv_phb *phb = npe->phb;
 	int64_t rc;
 
@@ -198,7 +202,8 @@ static void pnv_npu_dma_set_32(struct pnv_ioda_pe *npe)
 	if (!gpe)
 		return;
 
-	rc = pnv_npu_set_window(npe, 0, gpe->table_group.tables[0]);
+	rc = pnv_npu_set_window(&npe->table_group, 0,
+			gpe->table_group.tables[0]);
 
 	/*
 	 * NVLink devices use the same TCE table configuration as
@@ -223,7 +228,7 @@ static int pnv_npu_dma_set_bypass(struct pnv_ioda_pe *npe)
 	if (phb->type != PNV_PHB_NPU_NVLINK || !npe->pdev)
 		return -EINVAL;
 
-	rc = pnv_npu_unset_window(npe, 0);
+	rc = pnv_npu_unset_window(&npe->table_group, 0);
 	if (rc != OPAL_SUCCESS)
 		return rc;
 
@@ -276,9 +281,12 @@ void pnv_npu_try_dma_set_bypass(struct pci_dev *gpdev, bool bypass)
 	}
 }
 
+#ifdef CONFIG_IOMMU_API
 /* Switch ownership from platform code to external user (e.g. VFIO) */
-void pnv_npu_take_ownership(struct pnv_ioda_pe *npe)
+static void pnv_npu_take_ownership(struct iommu_table_group *table_group)
 {
+	struct pnv_ioda_pe *npe = container_of(table_group, struct pnv_ioda_pe,
+			table_group);
 	struct pnv_phb *phb = npe->phb;
 	int64_t rc;
 
@@ -289,7 +297,7 @@ void pnv_npu_take_ownership(struct pnv_ioda_pe *npe)
 	 * if it was enabled at the moment of ownership change.
 	 */
 	if (npe->table_group.tables[0]) {
-		pnv_npu_unset_window(npe, 0);
+		pnv_npu_unset_window(&npe->table_group, 0);
 		return;
 	}
 
@@ -304,6 +312,12 @@ void pnv_npu_take_ownership(struct pnv_ioda_pe *npe)
 	pnv_pci_ioda2_tce_invalidate_entire(npe->phb, false);
 }
 
+static struct iommu_table_group_ops pnv_pci_npu_ops = {
+	.set_window = pnv_npu_set_window,
+	.unset_window = pnv_npu_unset_window,
+	.take_ownership = pnv_npu_take_ownership,
+};
+
 struct pnv_ioda_pe *pnv_pci_npu_setup_iommu(struct pnv_ioda_pe *npe)
 {
 	struct pnv_phb *phb = npe->phb;
@@ -314,6 +328,8 @@ struct pnv_ioda_pe *pnv_pci_npu_setup_iommu(struct pnv_ioda_pe *npe)
 	if (!gpe || !gpdev)
 		return NULL;
 
+	npe->table_group.ops = &pnv_pci_npu_ops;
+
 	list_for_each_entry(npdev, &pbus->devices, bus_list) {
 		gptmp = pnv_pci_get_gpu_dev(npdev);
 
@@ -326,6 +342,7 @@ struct pnv_ioda_pe *pnv_pci_npu_setup_iommu(struct pnv_ioda_pe *npe)
 
 	return gpe;
 }
+#endif /* !CONFIG_IOMMU_API */
 
 /*
  * NPU2 ATS
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 7caf373..04639ae 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -2677,14 +2677,14 @@ static long pnv_pci_ioda2_npu_set_window(struct iommu_table_group *table_group,
 		return ret;
 
 	if (table_group->tables[num2])
-		pnv_npu_unset_window(npe, num2);
+		npe->table_group.ops->unset_window(&npe->table_group, num2);
 
-	ret = pnv_npu_set_window(npe, num, tbl);
+	ret = npe->table_group.ops->set_window(&npe->table_group, num, tbl);
 	if (ret) {
 		pnv_pci_ioda2_unset_window(table_group, num);
 		if (table_group->tables[num2])
-			pnv_npu_set_window(npe, num2,
-					table_group->tables[num2]);
+			npe->table_group.ops->set_window(&npe->table_group,
+					num2, table_group->tables[num2]);
 	}
 
 	return ret;
@@ -2704,19 +2704,22 @@ static long pnv_pci_ioda2_npu_unset_window(
 	if (!npe->table_group.tables[num])
 		return 0;
 
-	ret = pnv_npu_unset_window(npe, num);
+	ret = npe->table_group.ops->unset_window(&npe->table_group, num);
 	if (ret)
 		return ret;
 
 	if (table_group->tables[num2])
-		ret = pnv_npu_set_window(npe, num2, table_group->tables[num2]);
+		ret = npe->table_group.ops->set_window(&npe->table_group, num2,
+				table_group->tables[num2]);
 
 	return ret;
 }
 
 static void pnv_ioda2_npu_take_ownership(struct iommu_table_group *table_group)
 {
-	pnv_npu_take_ownership(gpe_table_group_to_npe(table_group));
+	struct pnv_ioda_pe *npe = gpe_table_group_to_npe(table_group);
+
+	npe->table_group.ops->take_ownership(&npe->table_group);
 	pnv_ioda2_take_ownership(table_group);
 }
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH kernel v3 17/22] powerpc/powernv/npu: Convert NPU IOMMU helpers to iommu_table_group_ops
@ 2018-11-13  8:28   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 84+ messages in thread
From: Alexey Kardashevskiy @ 2018-11-13  8:28 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Jose Ricardo Ziviani, Alexey Kardashevskiy, Alistair Popple,
	Alex Williamson, kvm-ppc, Sam Bobroff, Piotr Jaroszynski,
	Oliver O'Halloran, Andrew Donnellan,
	Leonardo Augusto Guimarães Garcia, Reza Arbab, David Gibson

At the moment NPU IOMMU is manipulated directly from the IODA2 PCI
PE code; PCI PE acts as a master to NPU PE. Soon we will have compound
IOMMU groups with several PEs from several different PHB (such as
interconnected GPUs and NPUs) so there will be no single master but
a one big IOMMU group.

This makes a first step and converts an NPU PE to a table group.

This should cause no behavioral change. Note that
pnv_npu_release_ownership() has never been implemented.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/platforms/powernv/pci.h      |  5 ----
 arch/powerpc/platforms/powernv/npu-dma.c  | 29 ++++++++++++++++++-----
 arch/powerpc/platforms/powernv/pci-ioda.c | 17 +++++++------
 3 files changed, 33 insertions(+), 18 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
index ddb4f02..cf9f748 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -216,11 +216,6 @@ extern void pe_level_printk(const struct pnv_ioda_pe *pe, const char *level,
 extern void pnv_npu_try_dma_set_bypass(struct pci_dev *gpdev, bool bypass);
 extern void pnv_pci_ioda2_tce_invalidate_entire(struct pnv_phb *phb, bool rm);
 extern struct pnv_ioda_pe *pnv_pci_npu_setup_iommu(struct pnv_ioda_pe *npe);
-extern long pnv_npu_set_window(struct pnv_ioda_pe *npe, int num,
-		struct iommu_table *tbl);
-extern long pnv_npu_unset_window(struct pnv_ioda_pe *npe, int num);
-extern void pnv_npu_take_ownership(struct pnv_ioda_pe *npe);
-extern void pnv_npu_release_ownership(struct pnv_ioda_pe *npe);
 
 /* pci-ioda-tce.c */
 #define POWERNV_IOMMU_DEFAULT_LEVELS	1
diff --git a/arch/powerpc/platforms/powernv/npu-dma.c b/arch/powerpc/platforms/powernv/npu-dma.c
index 4b60f43..1792c7e 100644
--- a/arch/powerpc/platforms/powernv/npu-dma.c
+++ b/arch/powerpc/platforms/powernv/npu-dma.c
@@ -121,9 +121,11 @@ static struct pnv_ioda_pe *get_gpu_pci_dev_and_pe(struct pnv_ioda_pe *npe,
 	return pe;
 }
 
-long pnv_npu_set_window(struct pnv_ioda_pe *npe, int num,
+static long pnv_npu_set_window(struct iommu_table_group *table_group, int num,
 		struct iommu_table *tbl)
 {
+	struct pnv_ioda_pe *npe = container_of(table_group, struct pnv_ioda_pe,
+			table_group);
 	struct pnv_phb *phb = npe->phb;
 	int64_t rc;
 	const unsigned long size = tbl->it_indirect_levels ?
@@ -155,8 +157,10 @@ long pnv_npu_set_window(struct pnv_ioda_pe *npe, int num,
 	return 0;
 }
 
-long pnv_npu_unset_window(struct pnv_ioda_pe *npe, int num)
+static long pnv_npu_unset_window(struct iommu_table_group *table_group, int num)
 {
+	struct pnv_ioda_pe *npe = container_of(table_group, struct pnv_ioda_pe,
+			table_group);
 	struct pnv_phb *phb = npe->phb;
 	int64_t rc;
 
@@ -198,7 +202,8 @@ static void pnv_npu_dma_set_32(struct pnv_ioda_pe *npe)
 	if (!gpe)
 		return;
 
-	rc = pnv_npu_set_window(npe, 0, gpe->table_group.tables[0]);
+	rc = pnv_npu_set_window(&npe->table_group, 0,
+			gpe->table_group.tables[0]);
 
 	/*
 	 * NVLink devices use the same TCE table configuration as
@@ -223,7 +228,7 @@ static int pnv_npu_dma_set_bypass(struct pnv_ioda_pe *npe)
 	if (phb->type != PNV_PHB_NPU_NVLINK || !npe->pdev)
 		return -EINVAL;
 
-	rc = pnv_npu_unset_window(npe, 0);
+	rc = pnv_npu_unset_window(&npe->table_group, 0);
 	if (rc != OPAL_SUCCESS)
 		return rc;
 
@@ -276,9 +281,12 @@ void pnv_npu_try_dma_set_bypass(struct pci_dev *gpdev, bool bypass)
 	}
 }
 
+#ifdef CONFIG_IOMMU_API
 /* Switch ownership from platform code to external user (e.g. VFIO) */
-void pnv_npu_take_ownership(struct pnv_ioda_pe *npe)
+static void pnv_npu_take_ownership(struct iommu_table_group *table_group)
 {
+	struct pnv_ioda_pe *npe = container_of(table_group, struct pnv_ioda_pe,
+			table_group);
 	struct pnv_phb *phb = npe->phb;
 	int64_t rc;
 
@@ -289,7 +297,7 @@ void pnv_npu_take_ownership(struct pnv_ioda_pe *npe)
 	 * if it was enabled at the moment of ownership change.
 	 */
 	if (npe->table_group.tables[0]) {
-		pnv_npu_unset_window(npe, 0);
+		pnv_npu_unset_window(&npe->table_group, 0);
 		return;
 	}
 
@@ -304,6 +312,12 @@ void pnv_npu_take_ownership(struct pnv_ioda_pe *npe)
 	pnv_pci_ioda2_tce_invalidate_entire(npe->phb, false);
 }
 
+static struct iommu_table_group_ops pnv_pci_npu_ops = {
+	.set_window = pnv_npu_set_window,
+	.unset_window = pnv_npu_unset_window,
+	.take_ownership = pnv_npu_take_ownership,
+};
+
 struct pnv_ioda_pe *pnv_pci_npu_setup_iommu(struct pnv_ioda_pe *npe)
 {
 	struct pnv_phb *phb = npe->phb;
@@ -314,6 +328,8 @@ struct pnv_ioda_pe *pnv_pci_npu_setup_iommu(struct pnv_ioda_pe *npe)
 	if (!gpe || !gpdev)
 		return NULL;
 
+	npe->table_group.ops = &pnv_pci_npu_ops;
+
 	list_for_each_entry(npdev, &pbus->devices, bus_list) {
 		gptmp = pnv_pci_get_gpu_dev(npdev);
 
@@ -326,6 +342,7 @@ struct pnv_ioda_pe *pnv_pci_npu_setup_iommu(struct pnv_ioda_pe *npe)
 
 	return gpe;
 }
+#endif /* !CONFIG_IOMMU_API */
 
 /*
  * NPU2 ATS
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 7caf373..04639ae 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -2677,14 +2677,14 @@ static long pnv_pci_ioda2_npu_set_window(struct iommu_table_group *table_group,
 		return ret;
 
 	if (table_group->tables[num2])
-		pnv_npu_unset_window(npe, num2);
+		npe->table_group.ops->unset_window(&npe->table_group, num2);
 
-	ret = pnv_npu_set_window(npe, num, tbl);
+	ret = npe->table_group.ops->set_window(&npe->table_group, num, tbl);
 	if (ret) {
 		pnv_pci_ioda2_unset_window(table_group, num);
 		if (table_group->tables[num2])
-			pnv_npu_set_window(npe, num2,
-					table_group->tables[num2]);
+			npe->table_group.ops->set_window(&npe->table_group,
+					num2, table_group->tables[num2]);
 	}
 
 	return ret;
@@ -2704,19 +2704,22 @@ static long pnv_pci_ioda2_npu_unset_window(
 	if (!npe->table_group.tables[num])
 		return 0;
 
-	ret = pnv_npu_unset_window(npe, num);
+	ret = npe->table_group.ops->unset_window(&npe->table_group, num);
 	if (ret)
 		return ret;
 
 	if (table_group->tables[num2])
-		ret = pnv_npu_set_window(npe, num2, table_group->tables[num2]);
+		ret = npe->table_group.ops->set_window(&npe->table_group, num2,
+				table_group->tables[num2]);
 
 	return ret;
 }
 
 static void pnv_ioda2_npu_take_ownership(struct iommu_table_group *table_group)
 {
-	pnv_npu_take_ownership(gpe_table_group_to_npe(table_group));
+	struct pnv_ioda_pe *npe = gpe_table_group_to_npe(table_group);
+
+	npe->table_group.ops->take_ownership(&npe->table_group);
 	pnv_ioda2_take_ownership(table_group);
 }
 
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH kernel v3 18/22] powerpc/powernv/npu: Add compound IOMMU groups
  2018-11-13  8:28 ` Alexey Kardashevskiy
@ 2018-11-13  8:28   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 84+ messages in thread
From: Alexey Kardashevskiy @ 2018-11-13  8:28 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Jose Ricardo Ziviani, Alexey Kardashevskiy, Alistair Popple,
	Alex Williamson, kvm-ppc, Sam Bobroff, Piotr Jaroszynski,
	Oliver O'Halloran, Andrew Donnellan,
	Leonardo Augusto Guimarães Garcia, Reza Arbab, David Gibson

At the moment powernv registers an IOMMU group for each PE. There is
an exception though - NPU (an emulated PCI bridge representing an NVLink);
powernv attaches these bridges to the GPU IOMMU group which becomes
a master.

Now we have POWER9 systems with GPUs connected to each other directly,
bypassing PCI. At the moment powernv does not control these links so
it has to put such interconnected GPUs to the same IOMMU group which
means that the old scheme with a GPU as a master won't work - there will
be up to 3 GPUs in such group.

This introduces a npu_comp struct which represents a compound IOMMU
group made of multiple PEs. This converts the existing NVLink1 code to
use the new scheme. From now on, each PE must have a valid
iommu_table_group_ops which will either be called directly (a single PE
group) or indirectly from a compound group.

This moves IOMMU group registration for NPU-connected GPUs to npu-dma.c.
For POWER8, this stores a new compound group pointer in a PE (so a GPU
is still a master); for POWER9 the new group pointer is stored in an NPU.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/include/asm/pci.h            |   1 +
 arch/powerpc/platforms/powernv/pci.h      |   7 +
 arch/powerpc/platforms/powernv/npu-dma.c  | 286 ++++++++++++++++++++--
 arch/powerpc/platforms/powernv/pci-ioda.c | 173 +++----------
 4 files changed, 308 insertions(+), 159 deletions(-)

diff --git a/arch/powerpc/include/asm/pci.h b/arch/powerpc/include/asm/pci.h
index baf2886..0c72f18 100644
--- a/arch/powerpc/include/asm/pci.h
+++ b/arch/powerpc/include/asm/pci.h
@@ -132,5 +132,6 @@ extern struct pci_dev *pnv_pci_get_npu_dev(struct pci_dev *gpdev, int index);
 extern int pnv_npu2_init(struct pci_controller *hose);
 extern int pnv_npu2_map_lpar_dev(struct pci_dev *gpdev, unsigned int lparid,
 		unsigned long msr);
+extern int pnv_npu2_unmap_lpar_dev(struct pci_dev *gpdev);
 
 #endif /* __ASM_POWERPC_PCI_H */
diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
index cf9f748..aef4bb5 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -62,6 +62,7 @@ struct pnv_ioda_pe {
 
 	/* "Base" iommu table, ie, 4K TCEs, 32-bit DMA */
 	struct iommu_table_group table_group;
+	struct npu_comp		*npucomp;
 
 	/* 64-bit TCE bypass region */
 	bool			tce_bypass_enabled;
@@ -201,6 +202,8 @@ extern void pnv_teardown_msi_irqs(struct pci_dev *pdev);
 extern struct pnv_ioda_pe *pnv_ioda_get_pe(struct pci_dev *dev);
 extern void pnv_set_msi_irq_chip(struct pnv_phb *phb, unsigned int virq);
 extern void pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe *pe, bool enable);
+extern unsigned long pnv_pci_ioda2_get_table_size(__u32 page_shift,
+		__u64 window_size, __u32 levels);
 extern int pnv_eeh_post_init(void);
 
 extern void pe_level_printk(const struct pnv_ioda_pe *pe, const char *level,
@@ -216,6 +219,10 @@ extern void pe_level_printk(const struct pnv_ioda_pe *pe, const char *level,
 extern void pnv_npu_try_dma_set_bypass(struct pci_dev *gpdev, bool bypass);
 extern void pnv_pci_ioda2_tce_invalidate_entire(struct pnv_phb *phb, bool rm);
 extern struct pnv_ioda_pe *pnv_pci_npu_setup_iommu(struct pnv_ioda_pe *npe);
+extern struct iommu_table_group *pnv_try_setup_npu_table_group(
+		struct pnv_ioda_pe *pe);
+extern struct iommu_table_group *pnv_npu_compound_attach(
+		struct pnv_ioda_pe *pe);
 
 /* pci-ioda-tce.c */
 #define POWERNV_IOMMU_DEFAULT_LEVELS	1
diff --git a/arch/powerpc/platforms/powernv/npu-dma.c b/arch/powerpc/platforms/powernv/npu-dma.c
index 1792c7e..2231f4c 100644
--- a/arch/powerpc/platforms/powernv/npu-dma.c
+++ b/arch/powerpc/platforms/powernv/npu-dma.c
@@ -317,31 +317,6 @@ static struct iommu_table_group_ops pnv_pci_npu_ops = {
 	.unset_window = pnv_npu_unset_window,
 	.take_ownership = pnv_npu_take_ownership,
 };
-
-struct pnv_ioda_pe *pnv_pci_npu_setup_iommu(struct pnv_ioda_pe *npe)
-{
-	struct pnv_phb *phb = npe->phb;
-	struct pci_bus *pbus = phb->hose->bus;
-	struct pci_dev *npdev, *gpdev = NULL, *gptmp;
-	struct pnv_ioda_pe *gpe = get_gpu_pci_dev_and_pe(npe, &gpdev);
-
-	if (!gpe || !gpdev)
-		return NULL;
-
-	npe->table_group.ops = &pnv_pci_npu_ops;
-
-	list_for_each_entry(npdev, &pbus->devices, bus_list) {
-		gptmp = pnv_pci_get_gpu_dev(npdev);
-
-		if (gptmp != gpdev)
-			continue;
-
-		pe_info(gpe, "Attached NPU %s\n", dev_name(&npdev->dev));
-		iommu_group_add_device(gpe->table_group.group, &npdev->dev);
-	}
-
-	return gpe;
-}
 #endif /* !CONFIG_IOMMU_API */
 
 /*
@@ -349,6 +324,17 @@ struct pnv_ioda_pe *pnv_pci_npu_setup_iommu(struct pnv_ioda_pe *npe)
  */
 /* Maximum possible number of ATSD MMIO registers per NPU */
 #define NV_NMMU_ATSD_REGS 8
+#define NV_NPU_MAX_PE_NUM	16
+
+/*
+ * A compound NPU IOMMU group which might consist of 1 GPU + 2xNPUs (POWER8) or
+ * up to 3 x (GPU + 2xNPUs) (POWER9).
+ */
+struct npu_comp {
+	struct iommu_table_group table_group;
+	int pe_num;
+	struct pnv_ioda_pe *pe[NV_NPU_MAX_PE_NUM];
+};
 
 /* An NPU descriptor, valid for POWER9 only */
 struct npu {
@@ -365,6 +351,8 @@ struct npu {
 	struct list_head next;
 
 	struct pci_controller *hose;
+
+	struct npu_comp npucomp;
 };
 
 static LIST_HEAD(npu2_devices);
@@ -382,6 +370,254 @@ static struct npu *npdev_to_npu(struct pci_dev *npdev)
 	return NULL;
 }
 
+#ifdef CONFIG_IOMMU_API
+static long pnv_npu_peers_create_table_userspace(
+		struct iommu_table_group *table_group,
+		int num, __u32 page_shift, __u64 window_size, __u32 levels,
+		struct iommu_table **ptbl)
+{
+	struct npu_comp *npucomp = container_of(table_group, struct npu_comp,
+			table_group);
+
+	if (!npucomp->pe_num || !npucomp->pe[0] ||
+			!npucomp->pe[0]->table_group.ops ||
+			!npucomp->pe[0]->table_group.ops->create_table)
+		return -EFAULT;
+
+	return npucomp->pe[0]->table_group.ops->create_table(
+			&npucomp->pe[0]->table_group, num, page_shift,
+			window_size, levels, ptbl);
+}
+
+static long pnv_npu_peers_set_window(struct iommu_table_group *table_group,
+		int num, struct iommu_table *tbl)
+{
+	int i, j;
+	long ret = 0;
+	struct npu_comp *npucomp = container_of(table_group, struct npu_comp,
+			table_group);
+
+	for (i = 0; i < npucomp->pe_num; ++i) {
+		struct pnv_ioda_pe *pe = npucomp->pe[i];
+
+		if (!pe->table_group.ops->set_window)
+			continue;
+
+		ret = pe->table_group.ops->set_window(&pe->table_group,
+				num, tbl);
+		if (ret)
+			break;
+	}
+
+	if (ret) {
+		for (j = 0; j < i; ++j) {
+			struct pnv_ioda_pe *pe = npucomp->pe[j];
+
+			if (!pe->table_group.ops->unset_window)
+				continue;
+
+			ret = pe->table_group.ops->unset_window(
+					&pe->table_group, num);
+			if (ret)
+				break;
+		}
+	} else {
+		table_group->tables[num] = iommu_tce_table_get(tbl);
+	}
+
+	return ret;
+}
+
+static long pnv_npu_peers_unset_window(struct iommu_table_group *table_group,
+		int num)
+{
+	int i, j;
+	long ret = 0;
+	struct npu_comp *npucomp = container_of(table_group, struct npu_comp,
+			table_group);
+
+	for (i = 0; i < npucomp->pe_num; ++i) {
+		struct pnv_ioda_pe *pe = npucomp->pe[i];
+
+		WARN_ON(npucomp->table_group.tables[num] !=
+				table_group->tables[num]);
+		if (!npucomp->table_group.tables[num])
+			continue;
+
+		if (!pe->table_group.ops->unset_window)
+			continue;
+
+		ret = pe->table_group.ops->unset_window(&pe->table_group, num);
+		if (ret)
+			break;
+	}
+
+	if (ret) {
+		for (j = 0; j < i; ++j) {
+			struct pnv_ioda_pe *pe = npucomp->pe[j];
+
+			if (!npucomp->table_group.tables[num])
+				continue;
+
+			if (!pe->table_group.ops->set_window)
+				continue;
+
+			ret = pe->table_group.ops->set_window(&pe->table_group,
+					num, table_group->tables[num]);
+			if (ret)
+				break;
+		}
+	} else if (table_group->tables[num]) {
+		iommu_tce_table_put(table_group->tables[num]);
+		table_group->tables[num] = NULL;
+	}
+
+	return ret;
+}
+
+static void pnv_npu_peers_take_ownership(struct iommu_table_group *table_group)
+{
+	int i;
+	struct npu_comp *npucomp = container_of(table_group, struct npu_comp,
+			table_group);
+
+	for (i = 0; i < npucomp->pe_num; ++i) {
+		struct pnv_ioda_pe *pe = npucomp->pe[i];
+
+		if (!pe->table_group.ops->take_ownership)
+			continue;
+		pe->table_group.ops->take_ownership(&pe->table_group);
+	}
+}
+
+static void pnv_npu_peers_release_ownership(
+		struct iommu_table_group *table_group)
+{
+	int i;
+	struct npu_comp *npucomp = container_of(table_group, struct npu_comp,
+			table_group);
+
+	for (i = 0; i < npucomp->pe_num; ++i) {
+		struct pnv_ioda_pe *pe = npucomp->pe[i];
+
+		if (!pe->table_group.ops->release_ownership)
+			continue;
+		pe->table_group.ops->release_ownership(&pe->table_group);
+	}
+}
+
+static struct iommu_table_group_ops pnv_npu_peers_ops = {
+	.get_table_size = pnv_pci_ioda2_get_table_size,
+	.create_table = pnv_npu_peers_create_table_userspace,
+	.set_window = pnv_npu_peers_set_window,
+	.unset_window = pnv_npu_peers_unset_window,
+	.take_ownership = pnv_npu_peers_take_ownership,
+	.release_ownership = pnv_npu_peers_release_ownership,
+};
+
+static void pnv_comp_attach_table_group(struct npu_comp *npucomp,
+		struct pnv_ioda_pe *pe)
+{
+	if (WARN_ON(npucomp->pe_num == NV_NPU_MAX_PE_NUM))
+		return;
+
+	npucomp->pe[npucomp->pe_num] = pe;
+	++npucomp->pe_num;
+}
+
+struct iommu_table_group *pnv_try_setup_npu_table_group(struct pnv_ioda_pe *pe)
+{
+	struct iommu_table_group *table_group;
+	struct npu *npu;
+	struct npu_comp *npucomp;
+	struct pci_dev *gpdev = NULL;
+	struct pci_controller *hose;
+	struct pci_dev *npdev;
+
+	list_for_each_entry(gpdev, &pe->pbus->devices, bus_list) {
+		npdev = pnv_pci_get_npu_dev(gpdev, 0);
+		if (npdev)
+			break;
+	}
+
+	if (!npdev)
+		/* It is not an NPU attached device, skip */
+		return NULL;
+
+	hose = pci_bus_to_host(gpdev->bus);
+	npu = npdev_to_npu(npdev);
+	if (npu) {
+		table_group = &npu->npucomp.table_group;
+
+		if (!table_group->group) {
+			table_group->ops = &pnv_npu_peers_ops;
+			iommu_register_group(table_group,
+					hose->global_number,
+					pe->pe_number);
+		}
+	} else {
+		/* Create a group for 1 GPU and attached NPUs */
+		pe->npucomp = kzalloc(sizeof(pe->npucomp), GFP_KERNEL);
+		table_group = &pe->npucomp->table_group;
+		table_group->ops = &pnv_npu_peers_ops;
+		iommu_register_group(table_group, hose->global_number,
+				pe->pe_number);
+	}
+
+	/* Steal capabilities from a GPU PE */
+	table_group->max_dynamic_windows_supported =
+		pe->table_group.max_dynamic_windows_supported;
+	table_group->tce32_start = pe->table_group.tce32_start;
+	table_group->tce32_size = pe->table_group.tce32_size;
+	table_group->max_levels = pe->table_group.max_levels;
+	table_group->pgsizes = pe->table_group.pgsizes;
+
+	npucomp = container_of(table_group, struct npu_comp, table_group);
+	pnv_comp_attach_table_group(npucomp, pe);
+
+	return table_group;
+}
+
+struct iommu_table_group *pnv_npu_compound_attach(struct pnv_ioda_pe *pe)
+{
+	struct iommu_table_group *table_group;
+	struct npu_comp *npucomp;
+	struct pci_dev *gpdev = NULL;
+	struct pci_dev *npdev;
+	struct pnv_ioda_pe *gpe = get_gpu_pci_dev_and_pe(pe, &gpdev);
+
+	WARN_ON(!(pe->flags & PNV_IODA_PE_DEV));
+	if (!gpe)
+		return NULL;
+
+	/*
+	 * IODA2 bridges get this set up from
+	 * pci_controller_ops::setup_bridge but NPU bridges do not
+	 * have this hook defined so we do it here.
+	 */
+	pe->table_group.max_dynamic_windows_supported =
+		IOMMU_TABLE_GROUP_MAX_TABLES;
+	pe->table_group.ops = &pnv_pci_npu_ops;
+
+	table_group = iommu_group_get_iommudata(
+			iommu_group_get(&gpdev->dev));
+
+	npucomp = container_of(table_group, struct npu_comp, table_group);
+	pnv_comp_attach_table_group(npucomp, pe);
+
+	list_for_each_entry(npdev, &pe->phb->hose->bus->devices, bus_list) {
+		struct pci_dev *gpdevtmp = pnv_pci_get_gpu_dev(npdev);
+
+		if (gpdevtmp != gpdev)
+			continue;
+
+		iommu_add_device(table_group, &npdev->dev);
+	}
+
+	return table_group;
+}
+#endif /* CONFIG_IOMMU_API */
+
 /* Maximum number of nvlinks per npu */
 #define NV_MAX_LINKS 6
 
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 04639ae..0e8ada5 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -190,7 +190,8 @@ static void pnv_ioda_free_pe(struct pnv_ioda_pe *pe)
 	unsigned int pe_num = pe->pe_number;
 
 	WARN_ON(pe->pdev);
-
+	WARN_ON(pe->npucomp);
+	kfree(pe->npucomp);
 	memset(pe, 0, sizeof(struct pnv_ioda_pe));
 	clear_bit(pe_num, phb->ioda.pe_alloc);
 }
@@ -1269,7 +1270,8 @@ static void pnv_ioda_setup_npu_PEs(struct pci_bus *bus)
 		pnv_ioda_setup_npu_PE(pdev);
 }
 
-static void pnv_ioda_setup_bus_iommu_group(struct pnv_ioda_pe *pe);
+static void pnv_ioda_setup_bus_iommu_group(struct pnv_ioda_pe *pe,
+		struct iommu_table_group *table_group, struct pci_bus *bus);
 
 static void pnv_pci_ioda_setup_PEs(void)
 {
@@ -1593,7 +1595,7 @@ static void pnv_ioda_setup_vf_PE(struct pci_dev *pdev, u16 num_vfs)
 		mutex_unlock(&phb->ioda.pe_list_mutex);
 
 		pnv_pci_ioda2_setup_dma_pe(phb, pe);
-		pnv_ioda_setup_bus_iommu_group(pe);
+		pnv_ioda_setup_bus_iommu_group(pe, &pe->table_group, NULL);
 	}
 }
 
@@ -2554,7 +2556,7 @@ static long pnv_pci_ioda2_unset_window(struct iommu_table_group *table_group,
 #endif
 
 #ifdef CONFIG_IOMMU_API
-static unsigned long pnv_pci_ioda2_get_table_size(__u32 page_shift,
+unsigned long pnv_pci_ioda2_get_table_size(__u32 page_shift,
 		__u64 window_size, __u32 levels)
 {
 	unsigned long bytes = 0;
@@ -2628,147 +2630,38 @@ static struct iommu_table_group_ops pnv_pci_ioda2_ops = {
 	.release_ownership = pnv_ioda2_release_ownership,
 };
 
-static int gpe_table_group_to_npe_cb(struct device *dev, void *opaque)
-{
-	struct pci_controller *hose;
-	struct pnv_phb *phb;
-	struct pnv_ioda_pe **ptmppe = opaque;
-	struct pci_dev *pdev = container_of(dev, struct pci_dev, dev);
-	struct pci_dn *pdn = pci_get_pdn(pdev);
-
-	if (!pdn || pdn->pe_number == IODA_INVALID_PE)
-		return 0;
-
-	hose = pci_bus_to_host(pdev->bus);
-	phb = hose->private_data;
-	if (phb->type != PNV_PHB_NPU_NVLINK)
-		return 0;
-
-	*ptmppe = &phb->ioda.pe_array[pdn->pe_number];
-
-	return 1;
-}
-
-/*
- * This returns PE of associated NPU.
- * This assumes that NPU is in the same IOMMU group with GPU and there is
- * no other PEs.
- */
-static struct pnv_ioda_pe *gpe_table_group_to_npe(
-		struct iommu_table_group *table_group)
-{
-	struct pnv_ioda_pe *npe = NULL;
-	int ret = iommu_group_for_each_dev(table_group->group, &npe,
-			gpe_table_group_to_npe_cb);
-
-	BUG_ON(!ret || !npe);
-
-	return npe;
-}
-
-static long pnv_pci_ioda2_npu_set_window(struct iommu_table_group *table_group,
-		int num, struct iommu_table *tbl)
-{
-	struct pnv_ioda_pe *npe = gpe_table_group_to_npe(table_group);
-	int num2 = (num == 0) ? 1 : 0;
-	long ret = pnv_pci_ioda2_set_window(table_group, num, tbl);
-
-	if (ret)
-		return ret;
-
-	if (table_group->tables[num2])
-		npe->table_group.ops->unset_window(&npe->table_group, num2);
-
-	ret = npe->table_group.ops->set_window(&npe->table_group, num, tbl);
-	if (ret) {
-		pnv_pci_ioda2_unset_window(table_group, num);
-		if (table_group->tables[num2])
-			npe->table_group.ops->set_window(&npe->table_group,
-					num2, table_group->tables[num2]);
-	}
-
-	return ret;
-}
-
-static long pnv_pci_ioda2_npu_unset_window(
-		struct iommu_table_group *table_group,
-		int num)
-{
-	struct pnv_ioda_pe *npe = gpe_table_group_to_npe(table_group);
-	int num2 = (num == 0) ? 1 : 0;
-	long ret = pnv_pci_ioda2_unset_window(table_group, num);
-
-	if (ret)
-		return ret;
-
-	if (!npe->table_group.tables[num])
-		return 0;
-
-	ret = npe->table_group.ops->unset_window(&npe->table_group, num);
-	if (ret)
-		return ret;
-
-	if (table_group->tables[num2])
-		ret = npe->table_group.ops->set_window(&npe->table_group, num2,
-				table_group->tables[num2]);
-
-	return ret;
-}
-
-static void pnv_ioda2_npu_take_ownership(struct iommu_table_group *table_group)
-{
-	struct pnv_ioda_pe *npe = gpe_table_group_to_npe(table_group);
-
-	npe->table_group.ops->take_ownership(&npe->table_group);
-	pnv_ioda2_take_ownership(table_group);
-}
-
-static struct iommu_table_group_ops pnv_pci_ioda2_npu_ops = {
-	.get_table_size = pnv_pci_ioda2_get_table_size,
-	.create_table = pnv_pci_ioda2_create_table_userspace,
-	.set_window = pnv_pci_ioda2_npu_set_window,
-	.unset_window = pnv_pci_ioda2_npu_unset_window,
-	.take_ownership = pnv_ioda2_npu_take_ownership,
-	.release_ownership = pnv_ioda2_release_ownership,
-};
-
 static void pnv_ioda_setup_bus_iommu_group_add_devices(struct pnv_ioda_pe *pe,
+		struct iommu_table_group *table_group,
 		struct pci_bus *bus)
 {
 	struct pci_dev *dev;
 
 	list_for_each_entry(dev, &bus->devices, bus_list) {
-		iommu_add_device(&pe->table_group, &dev->dev);
+		iommu_add_device(table_group, &dev->dev);
 
 		if ((pe->flags & PNV_IODA_PE_BUS_ALL) && dev->subordinate)
 			pnv_ioda_setup_bus_iommu_group_add_devices(pe,
-					dev->subordinate);
+					table_group, dev->subordinate);
 	}
 }
 
-static void pnv_ioda_setup_bus_iommu_group(struct pnv_ioda_pe *pe)
+static void pnv_ioda_setup_bus_iommu_group(struct pnv_ioda_pe *pe,
+		struct iommu_table_group *table_group, struct pci_bus *bus)
 {
-	if (!pnv_pci_ioda_pe_dma_weight(pe))
-		return;
 
-	iommu_register_group(&pe->table_group, pe->phb->hose->global_number,
-			pe->pe_number);
-
-	/*
-	 * set_iommu_table_base(&pe->pdev->dev, tbl) should have been called
-	 * by now
-	 */
 	if (pe->flags & PNV_IODA_PE_DEV)
-		iommu_add_device(&pe->table_group, &pe->pdev->dev);
-	else if (pe->flags & (PNV_IODA_PE_BUS | PNV_IODA_PE_BUS_ALL))
-		pnv_ioda_setup_bus_iommu_group_add_devices(pe, pe->pbus);
+		iommu_add_device(table_group, &pe->pdev->dev);
+
+	if ((pe->flags & (PNV_IODA_PE_BUS | PNV_IODA_PE_BUS_ALL)) || bus)
+		pnv_ioda_setup_bus_iommu_group_add_devices(pe, table_group,
+				bus);
 }
 
 static void pnv_pci_ioda_setup_iommu_api(void)
 {
-	struct pci_controller *hose, *tmp;
+	struct pci_controller *hose;
 	struct pnv_phb *phb;
-	struct pnv_ioda_pe *pe, *gpe;
+	struct pnv_ioda_pe *pe;
 
 	/*
 	 * There are 4 types of PEs:
@@ -2790,29 +2683,41 @@ static void pnv_pci_ioda_setup_iommu_api(void)
 		if (phb->type == PNV_PHB_NPU_NVLINK)
 			continue;
 
-		list_for_each_entry(pe, &phb->ioda.pe_list, list)
-			pnv_ioda_setup_bus_iommu_group(pe);
+		list_for_each_entry(pe, &phb->ioda.pe_list, list) {
+			struct iommu_table_group *table_group;
+
+			table_group = pnv_try_setup_npu_table_group(pe);
+			if (!table_group) {
+				if (!pnv_pci_ioda_pe_dma_weight(pe))
+					continue;
+
+				table_group = &pe->table_group;
+				iommu_register_group(&pe->table_group,
+						pe->phb->hose->global_number,
+						pe->pe_number);
+			}
+			pnv_ioda_setup_bus_iommu_group(pe, table_group,
+					pe->pbus);
+		}
 	}
 
 	/*
 	 * Now we have all PHBs discovered, time to add NPU devices to
 	 * the corresponding IOMMU groups.
 	 */
-	list_for_each_entry_safe(hose, tmp, &hose_list, list_node) {
+	list_for_each_entry(hose, &hose_list, list_node) {
 		phb = hose->private_data;
 
 		if (phb->type != PNV_PHB_NPU_NVLINK)
 			continue;
 
-		list_for_each_entry(pe, &phb->ioda.pe_list, list) {
-			gpe = pnv_pci_npu_setup_iommu(pe);
-			if (gpe)
-				gpe->table_group.ops = &pnv_pci_ioda2_npu_ops;
-		}
+		list_for_each_entry(pe, &phb->ioda.pe_list, list)
+			pnv_npu_compound_attach(pe);
 	}
 }
 #else /* !CONFIG_IOMMU_API */
-static void pnv_ioda_setup_bus_iommu_group(struct pnv_ioda_pe *pe) { }
+static void pnv_ioda_setup_bus_iommu_group(struct pnv_ioda_pe *pe,
+		struct iommu_table_group *table_group, struct pci_bus *bus){}
 static void pnv_pci_ioda_setup_iommu_api(void) { };
 #endif
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH kernel v3 18/22] powerpc/powernv/npu: Add compound IOMMU groups
@ 2018-11-13  8:28   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 84+ messages in thread
From: Alexey Kardashevskiy @ 2018-11-13  8:28 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Jose Ricardo Ziviani, Alexey Kardashevskiy, Alistair Popple,
	Alex Williamson, kvm-ppc, Sam Bobroff, Piotr Jaroszynski,
	Oliver O'Halloran, Andrew Donnellan,
	Leonardo Augusto Guimarães Garcia, Reza Arbab, David Gibson

At the moment powernv registers an IOMMU group for each PE. There is
an exception though - NPU (an emulated PCI bridge representing an NVLink);
powernv attaches these bridges to the GPU IOMMU group which becomes
a master.

Now we have POWER9 systems with GPUs connected to each other directly,
bypassing PCI. At the moment powernv does not control these links so
it has to put such interconnected GPUs to the same IOMMU group which
means that the old scheme with a GPU as a master won't work - there will
be up to 3 GPUs in such group.

This introduces a npu_comp struct which represents a compound IOMMU
group made of multiple PEs. This converts the existing NVLink1 code to
use the new scheme. From now on, each PE must have a valid
iommu_table_group_ops which will either be called directly (a single PE
group) or indirectly from a compound group.

This moves IOMMU group registration for NPU-connected GPUs to npu-dma.c.
For POWER8, this stores a new compound group pointer in a PE (so a GPU
is still a master); for POWER9 the new group pointer is stored in an NPU.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/include/asm/pci.h            |   1 +
 arch/powerpc/platforms/powernv/pci.h      |   7 +
 arch/powerpc/platforms/powernv/npu-dma.c  | 286 ++++++++++++++++++++--
 arch/powerpc/platforms/powernv/pci-ioda.c | 173 +++----------
 4 files changed, 308 insertions(+), 159 deletions(-)

diff --git a/arch/powerpc/include/asm/pci.h b/arch/powerpc/include/asm/pci.h
index baf2886..0c72f18 100644
--- a/arch/powerpc/include/asm/pci.h
+++ b/arch/powerpc/include/asm/pci.h
@@ -132,5 +132,6 @@ extern struct pci_dev *pnv_pci_get_npu_dev(struct pci_dev *gpdev, int index);
 extern int pnv_npu2_init(struct pci_controller *hose);
 extern int pnv_npu2_map_lpar_dev(struct pci_dev *gpdev, unsigned int lparid,
 		unsigned long msr);
+extern int pnv_npu2_unmap_lpar_dev(struct pci_dev *gpdev);
 
 #endif /* __ASM_POWERPC_PCI_H */
diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
index cf9f748..aef4bb5 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -62,6 +62,7 @@ struct pnv_ioda_pe {
 
 	/* "Base" iommu table, ie, 4K TCEs, 32-bit DMA */
 	struct iommu_table_group table_group;
+	struct npu_comp		*npucomp;
 
 	/* 64-bit TCE bypass region */
 	bool			tce_bypass_enabled;
@@ -201,6 +202,8 @@ extern void pnv_teardown_msi_irqs(struct pci_dev *pdev);
 extern struct pnv_ioda_pe *pnv_ioda_get_pe(struct pci_dev *dev);
 extern void pnv_set_msi_irq_chip(struct pnv_phb *phb, unsigned int virq);
 extern void pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe *pe, bool enable);
+extern unsigned long pnv_pci_ioda2_get_table_size(__u32 page_shift,
+		__u64 window_size, __u32 levels);
 extern int pnv_eeh_post_init(void);
 
 extern void pe_level_printk(const struct pnv_ioda_pe *pe, const char *level,
@@ -216,6 +219,10 @@ extern void pe_level_printk(const struct pnv_ioda_pe *pe, const char *level,
 extern void pnv_npu_try_dma_set_bypass(struct pci_dev *gpdev, bool bypass);
 extern void pnv_pci_ioda2_tce_invalidate_entire(struct pnv_phb *phb, bool rm);
 extern struct pnv_ioda_pe *pnv_pci_npu_setup_iommu(struct pnv_ioda_pe *npe);
+extern struct iommu_table_group *pnv_try_setup_npu_table_group(
+		struct pnv_ioda_pe *pe);
+extern struct iommu_table_group *pnv_npu_compound_attach(
+		struct pnv_ioda_pe *pe);
 
 /* pci-ioda-tce.c */
 #define POWERNV_IOMMU_DEFAULT_LEVELS	1
diff --git a/arch/powerpc/platforms/powernv/npu-dma.c b/arch/powerpc/platforms/powernv/npu-dma.c
index 1792c7e..2231f4c 100644
--- a/arch/powerpc/platforms/powernv/npu-dma.c
+++ b/arch/powerpc/platforms/powernv/npu-dma.c
@@ -317,31 +317,6 @@ static struct iommu_table_group_ops pnv_pci_npu_ops = {
 	.unset_window = pnv_npu_unset_window,
 	.take_ownership = pnv_npu_take_ownership,
 };
-
-struct pnv_ioda_pe *pnv_pci_npu_setup_iommu(struct pnv_ioda_pe *npe)
-{
-	struct pnv_phb *phb = npe->phb;
-	struct pci_bus *pbus = phb->hose->bus;
-	struct pci_dev *npdev, *gpdev = NULL, *gptmp;
-	struct pnv_ioda_pe *gpe = get_gpu_pci_dev_and_pe(npe, &gpdev);
-
-	if (!gpe || !gpdev)
-		return NULL;
-
-	npe->table_group.ops = &pnv_pci_npu_ops;
-
-	list_for_each_entry(npdev, &pbus->devices, bus_list) {
-		gptmp = pnv_pci_get_gpu_dev(npdev);
-
-		if (gptmp != gpdev)
-			continue;
-
-		pe_info(gpe, "Attached NPU %s\n", dev_name(&npdev->dev));
-		iommu_group_add_device(gpe->table_group.group, &npdev->dev);
-	}
-
-	return gpe;
-}
 #endif /* !CONFIG_IOMMU_API */
 
 /*
@@ -349,6 +324,17 @@ struct pnv_ioda_pe *pnv_pci_npu_setup_iommu(struct pnv_ioda_pe *npe)
  */
 /* Maximum possible number of ATSD MMIO registers per NPU */
 #define NV_NMMU_ATSD_REGS 8
+#define NV_NPU_MAX_PE_NUM	16
+
+/*
+ * A compound NPU IOMMU group which might consist of 1 GPU + 2xNPUs (POWER8) or
+ * up to 3 x (GPU + 2xNPUs) (POWER9).
+ */
+struct npu_comp {
+	struct iommu_table_group table_group;
+	int pe_num;
+	struct pnv_ioda_pe *pe[NV_NPU_MAX_PE_NUM];
+};
 
 /* An NPU descriptor, valid for POWER9 only */
 struct npu {
@@ -365,6 +351,8 @@ struct npu {
 	struct list_head next;
 
 	struct pci_controller *hose;
+
+	struct npu_comp npucomp;
 };
 
 static LIST_HEAD(npu2_devices);
@@ -382,6 +370,254 @@ static struct npu *npdev_to_npu(struct pci_dev *npdev)
 	return NULL;
 }
 
+#ifdef CONFIG_IOMMU_API
+static long pnv_npu_peers_create_table_userspace(
+		struct iommu_table_group *table_group,
+		int num, __u32 page_shift, __u64 window_size, __u32 levels,
+		struct iommu_table **ptbl)
+{
+	struct npu_comp *npucomp = container_of(table_group, struct npu_comp,
+			table_group);
+
+	if (!npucomp->pe_num || !npucomp->pe[0] ||
+			!npucomp->pe[0]->table_group.ops ||
+			!npucomp->pe[0]->table_group.ops->create_table)
+		return -EFAULT;
+
+	return npucomp->pe[0]->table_group.ops->create_table(
+			&npucomp->pe[0]->table_group, num, page_shift,
+			window_size, levels, ptbl);
+}
+
+static long pnv_npu_peers_set_window(struct iommu_table_group *table_group,
+		int num, struct iommu_table *tbl)
+{
+	int i, j;
+	long ret = 0;
+	struct npu_comp *npucomp = container_of(table_group, struct npu_comp,
+			table_group);
+
+	for (i = 0; i < npucomp->pe_num; ++i) {
+		struct pnv_ioda_pe *pe = npucomp->pe[i];
+
+		if (!pe->table_group.ops->set_window)
+			continue;
+
+		ret = pe->table_group.ops->set_window(&pe->table_group,
+				num, tbl);
+		if (ret)
+			break;
+	}
+
+	if (ret) {
+		for (j = 0; j < i; ++j) {
+			struct pnv_ioda_pe *pe = npucomp->pe[j];
+
+			if (!pe->table_group.ops->unset_window)
+				continue;
+
+			ret = pe->table_group.ops->unset_window(
+					&pe->table_group, num);
+			if (ret)
+				break;
+		}
+	} else {
+		table_group->tables[num] = iommu_tce_table_get(tbl);
+	}
+
+	return ret;
+}
+
+static long pnv_npu_peers_unset_window(struct iommu_table_group *table_group,
+		int num)
+{
+	int i, j;
+	long ret = 0;
+	struct npu_comp *npucomp = container_of(table_group, struct npu_comp,
+			table_group);
+
+	for (i = 0; i < npucomp->pe_num; ++i) {
+		struct pnv_ioda_pe *pe = npucomp->pe[i];
+
+		WARN_ON(npucomp->table_group.tables[num] !+				table_group->tables[num]);
+		if (!npucomp->table_group.tables[num])
+			continue;
+
+		if (!pe->table_group.ops->unset_window)
+			continue;
+
+		ret = pe->table_group.ops->unset_window(&pe->table_group, num);
+		if (ret)
+			break;
+	}
+
+	if (ret) {
+		for (j = 0; j < i; ++j) {
+			struct pnv_ioda_pe *pe = npucomp->pe[j];
+
+			if (!npucomp->table_group.tables[num])
+				continue;
+
+			if (!pe->table_group.ops->set_window)
+				continue;
+
+			ret = pe->table_group.ops->set_window(&pe->table_group,
+					num, table_group->tables[num]);
+			if (ret)
+				break;
+		}
+	} else if (table_group->tables[num]) {
+		iommu_tce_table_put(table_group->tables[num]);
+		table_group->tables[num] = NULL;
+	}
+
+	return ret;
+}
+
+static void pnv_npu_peers_take_ownership(struct iommu_table_group *table_group)
+{
+	int i;
+	struct npu_comp *npucomp = container_of(table_group, struct npu_comp,
+			table_group);
+
+	for (i = 0; i < npucomp->pe_num; ++i) {
+		struct pnv_ioda_pe *pe = npucomp->pe[i];
+
+		if (!pe->table_group.ops->take_ownership)
+			continue;
+		pe->table_group.ops->take_ownership(&pe->table_group);
+	}
+}
+
+static void pnv_npu_peers_release_ownership(
+		struct iommu_table_group *table_group)
+{
+	int i;
+	struct npu_comp *npucomp = container_of(table_group, struct npu_comp,
+			table_group);
+
+	for (i = 0; i < npucomp->pe_num; ++i) {
+		struct pnv_ioda_pe *pe = npucomp->pe[i];
+
+		if (!pe->table_group.ops->release_ownership)
+			continue;
+		pe->table_group.ops->release_ownership(&pe->table_group);
+	}
+}
+
+static struct iommu_table_group_ops pnv_npu_peers_ops = {
+	.get_table_size = pnv_pci_ioda2_get_table_size,
+	.create_table = pnv_npu_peers_create_table_userspace,
+	.set_window = pnv_npu_peers_set_window,
+	.unset_window = pnv_npu_peers_unset_window,
+	.take_ownership = pnv_npu_peers_take_ownership,
+	.release_ownership = pnv_npu_peers_release_ownership,
+};
+
+static void pnv_comp_attach_table_group(struct npu_comp *npucomp,
+		struct pnv_ioda_pe *pe)
+{
+	if (WARN_ON(npucomp->pe_num = NV_NPU_MAX_PE_NUM))
+		return;
+
+	npucomp->pe[npucomp->pe_num] = pe;
+	++npucomp->pe_num;
+}
+
+struct iommu_table_group *pnv_try_setup_npu_table_group(struct pnv_ioda_pe *pe)
+{
+	struct iommu_table_group *table_group;
+	struct npu *npu;
+	struct npu_comp *npucomp;
+	struct pci_dev *gpdev = NULL;
+	struct pci_controller *hose;
+	struct pci_dev *npdev;
+
+	list_for_each_entry(gpdev, &pe->pbus->devices, bus_list) {
+		npdev = pnv_pci_get_npu_dev(gpdev, 0);
+		if (npdev)
+			break;
+	}
+
+	if (!npdev)
+		/* It is not an NPU attached device, skip */
+		return NULL;
+
+	hose = pci_bus_to_host(gpdev->bus);
+	npu = npdev_to_npu(npdev);
+	if (npu) {
+		table_group = &npu->npucomp.table_group;
+
+		if (!table_group->group) {
+			table_group->ops = &pnv_npu_peers_ops;
+			iommu_register_group(table_group,
+					hose->global_number,
+					pe->pe_number);
+		}
+	} else {
+		/* Create a group for 1 GPU and attached NPUs */
+		pe->npucomp = kzalloc(sizeof(pe->npucomp), GFP_KERNEL);
+		table_group = &pe->npucomp->table_group;
+		table_group->ops = &pnv_npu_peers_ops;
+		iommu_register_group(table_group, hose->global_number,
+				pe->pe_number);
+	}
+
+	/* Steal capabilities from a GPU PE */
+	table_group->max_dynamic_windows_supported +		pe->table_group.max_dynamic_windows_supported;
+	table_group->tce32_start = pe->table_group.tce32_start;
+	table_group->tce32_size = pe->table_group.tce32_size;
+	table_group->max_levels = pe->table_group.max_levels;
+	table_group->pgsizes = pe->table_group.pgsizes;
+
+	npucomp = container_of(table_group, struct npu_comp, table_group);
+	pnv_comp_attach_table_group(npucomp, pe);
+
+	return table_group;
+}
+
+struct iommu_table_group *pnv_npu_compound_attach(struct pnv_ioda_pe *pe)
+{
+	struct iommu_table_group *table_group;
+	struct npu_comp *npucomp;
+	struct pci_dev *gpdev = NULL;
+	struct pci_dev *npdev;
+	struct pnv_ioda_pe *gpe = get_gpu_pci_dev_and_pe(pe, &gpdev);
+
+	WARN_ON(!(pe->flags & PNV_IODA_PE_DEV));
+	if (!gpe)
+		return NULL;
+
+	/*
+	 * IODA2 bridges get this set up from
+	 * pci_controller_ops::setup_bridge but NPU bridges do not
+	 * have this hook defined so we do it here.
+	 */
+	pe->table_group.max_dynamic_windows_supported +		IOMMU_TABLE_GROUP_MAX_TABLES;
+	pe->table_group.ops = &pnv_pci_npu_ops;
+
+	table_group = iommu_group_get_iommudata(
+			iommu_group_get(&gpdev->dev));
+
+	npucomp = container_of(table_group, struct npu_comp, table_group);
+	pnv_comp_attach_table_group(npucomp, pe);
+
+	list_for_each_entry(npdev, &pe->phb->hose->bus->devices, bus_list) {
+		struct pci_dev *gpdevtmp = pnv_pci_get_gpu_dev(npdev);
+
+		if (gpdevtmp != gpdev)
+			continue;
+
+		iommu_add_device(table_group, &npdev->dev);
+	}
+
+	return table_group;
+}
+#endif /* CONFIG_IOMMU_API */
+
 /* Maximum number of nvlinks per npu */
 #define NV_MAX_LINKS 6
 
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 04639ae..0e8ada5 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -190,7 +190,8 @@ static void pnv_ioda_free_pe(struct pnv_ioda_pe *pe)
 	unsigned int pe_num = pe->pe_number;
 
 	WARN_ON(pe->pdev);
-
+	WARN_ON(pe->npucomp);
+	kfree(pe->npucomp);
 	memset(pe, 0, sizeof(struct pnv_ioda_pe));
 	clear_bit(pe_num, phb->ioda.pe_alloc);
 }
@@ -1269,7 +1270,8 @@ static void pnv_ioda_setup_npu_PEs(struct pci_bus *bus)
 		pnv_ioda_setup_npu_PE(pdev);
 }
 
-static void pnv_ioda_setup_bus_iommu_group(struct pnv_ioda_pe *pe);
+static void pnv_ioda_setup_bus_iommu_group(struct pnv_ioda_pe *pe,
+		struct iommu_table_group *table_group, struct pci_bus *bus);
 
 static void pnv_pci_ioda_setup_PEs(void)
 {
@@ -1593,7 +1595,7 @@ static void pnv_ioda_setup_vf_PE(struct pci_dev *pdev, u16 num_vfs)
 		mutex_unlock(&phb->ioda.pe_list_mutex);
 
 		pnv_pci_ioda2_setup_dma_pe(phb, pe);
-		pnv_ioda_setup_bus_iommu_group(pe);
+		pnv_ioda_setup_bus_iommu_group(pe, &pe->table_group, NULL);
 	}
 }
 
@@ -2554,7 +2556,7 @@ static long pnv_pci_ioda2_unset_window(struct iommu_table_group *table_group,
 #endif
 
 #ifdef CONFIG_IOMMU_API
-static unsigned long pnv_pci_ioda2_get_table_size(__u32 page_shift,
+unsigned long pnv_pci_ioda2_get_table_size(__u32 page_shift,
 		__u64 window_size, __u32 levels)
 {
 	unsigned long bytes = 0;
@@ -2628,147 +2630,38 @@ static struct iommu_table_group_ops pnv_pci_ioda2_ops = {
 	.release_ownership = pnv_ioda2_release_ownership,
 };
 
-static int gpe_table_group_to_npe_cb(struct device *dev, void *opaque)
-{
-	struct pci_controller *hose;
-	struct pnv_phb *phb;
-	struct pnv_ioda_pe **ptmppe = opaque;
-	struct pci_dev *pdev = container_of(dev, struct pci_dev, dev);
-	struct pci_dn *pdn = pci_get_pdn(pdev);
-
-	if (!pdn || pdn->pe_number = IODA_INVALID_PE)
-		return 0;
-
-	hose = pci_bus_to_host(pdev->bus);
-	phb = hose->private_data;
-	if (phb->type != PNV_PHB_NPU_NVLINK)
-		return 0;
-
-	*ptmppe = &phb->ioda.pe_array[pdn->pe_number];
-
-	return 1;
-}
-
-/*
- * This returns PE of associated NPU.
- * This assumes that NPU is in the same IOMMU group with GPU and there is
- * no other PEs.
- */
-static struct pnv_ioda_pe *gpe_table_group_to_npe(
-		struct iommu_table_group *table_group)
-{
-	struct pnv_ioda_pe *npe = NULL;
-	int ret = iommu_group_for_each_dev(table_group->group, &npe,
-			gpe_table_group_to_npe_cb);
-
-	BUG_ON(!ret || !npe);
-
-	return npe;
-}
-
-static long pnv_pci_ioda2_npu_set_window(struct iommu_table_group *table_group,
-		int num, struct iommu_table *tbl)
-{
-	struct pnv_ioda_pe *npe = gpe_table_group_to_npe(table_group);
-	int num2 = (num = 0) ? 1 : 0;
-	long ret = pnv_pci_ioda2_set_window(table_group, num, tbl);
-
-	if (ret)
-		return ret;
-
-	if (table_group->tables[num2])
-		npe->table_group.ops->unset_window(&npe->table_group, num2);
-
-	ret = npe->table_group.ops->set_window(&npe->table_group, num, tbl);
-	if (ret) {
-		pnv_pci_ioda2_unset_window(table_group, num);
-		if (table_group->tables[num2])
-			npe->table_group.ops->set_window(&npe->table_group,
-					num2, table_group->tables[num2]);
-	}
-
-	return ret;
-}
-
-static long pnv_pci_ioda2_npu_unset_window(
-		struct iommu_table_group *table_group,
-		int num)
-{
-	struct pnv_ioda_pe *npe = gpe_table_group_to_npe(table_group);
-	int num2 = (num = 0) ? 1 : 0;
-	long ret = pnv_pci_ioda2_unset_window(table_group, num);
-
-	if (ret)
-		return ret;
-
-	if (!npe->table_group.tables[num])
-		return 0;
-
-	ret = npe->table_group.ops->unset_window(&npe->table_group, num);
-	if (ret)
-		return ret;
-
-	if (table_group->tables[num2])
-		ret = npe->table_group.ops->set_window(&npe->table_group, num2,
-				table_group->tables[num2]);
-
-	return ret;
-}
-
-static void pnv_ioda2_npu_take_ownership(struct iommu_table_group *table_group)
-{
-	struct pnv_ioda_pe *npe = gpe_table_group_to_npe(table_group);
-
-	npe->table_group.ops->take_ownership(&npe->table_group);
-	pnv_ioda2_take_ownership(table_group);
-}
-
-static struct iommu_table_group_ops pnv_pci_ioda2_npu_ops = {
-	.get_table_size = pnv_pci_ioda2_get_table_size,
-	.create_table = pnv_pci_ioda2_create_table_userspace,
-	.set_window = pnv_pci_ioda2_npu_set_window,
-	.unset_window = pnv_pci_ioda2_npu_unset_window,
-	.take_ownership = pnv_ioda2_npu_take_ownership,
-	.release_ownership = pnv_ioda2_release_ownership,
-};
-
 static void pnv_ioda_setup_bus_iommu_group_add_devices(struct pnv_ioda_pe *pe,
+		struct iommu_table_group *table_group,
 		struct pci_bus *bus)
 {
 	struct pci_dev *dev;
 
 	list_for_each_entry(dev, &bus->devices, bus_list) {
-		iommu_add_device(&pe->table_group, &dev->dev);
+		iommu_add_device(table_group, &dev->dev);
 
 		if ((pe->flags & PNV_IODA_PE_BUS_ALL) && dev->subordinate)
 			pnv_ioda_setup_bus_iommu_group_add_devices(pe,
-					dev->subordinate);
+					table_group, dev->subordinate);
 	}
 }
 
-static void pnv_ioda_setup_bus_iommu_group(struct pnv_ioda_pe *pe)
+static void pnv_ioda_setup_bus_iommu_group(struct pnv_ioda_pe *pe,
+		struct iommu_table_group *table_group, struct pci_bus *bus)
 {
-	if (!pnv_pci_ioda_pe_dma_weight(pe))
-		return;
 
-	iommu_register_group(&pe->table_group, pe->phb->hose->global_number,
-			pe->pe_number);
-
-	/*
-	 * set_iommu_table_base(&pe->pdev->dev, tbl) should have been called
-	 * by now
-	 */
 	if (pe->flags & PNV_IODA_PE_DEV)
-		iommu_add_device(&pe->table_group, &pe->pdev->dev);
-	else if (pe->flags & (PNV_IODA_PE_BUS | PNV_IODA_PE_BUS_ALL))
-		pnv_ioda_setup_bus_iommu_group_add_devices(pe, pe->pbus);
+		iommu_add_device(table_group, &pe->pdev->dev);
+
+	if ((pe->flags & (PNV_IODA_PE_BUS | PNV_IODA_PE_BUS_ALL)) || bus)
+		pnv_ioda_setup_bus_iommu_group_add_devices(pe, table_group,
+				bus);
 }
 
 static void pnv_pci_ioda_setup_iommu_api(void)
 {
-	struct pci_controller *hose, *tmp;
+	struct pci_controller *hose;
 	struct pnv_phb *phb;
-	struct pnv_ioda_pe *pe, *gpe;
+	struct pnv_ioda_pe *pe;
 
 	/*
 	 * There are 4 types of PEs:
@@ -2790,29 +2683,41 @@ static void pnv_pci_ioda_setup_iommu_api(void)
 		if (phb->type = PNV_PHB_NPU_NVLINK)
 			continue;
 
-		list_for_each_entry(pe, &phb->ioda.pe_list, list)
-			pnv_ioda_setup_bus_iommu_group(pe);
+		list_for_each_entry(pe, &phb->ioda.pe_list, list) {
+			struct iommu_table_group *table_group;
+
+			table_group = pnv_try_setup_npu_table_group(pe);
+			if (!table_group) {
+				if (!pnv_pci_ioda_pe_dma_weight(pe))
+					continue;
+
+				table_group = &pe->table_group;
+				iommu_register_group(&pe->table_group,
+						pe->phb->hose->global_number,
+						pe->pe_number);
+			}
+			pnv_ioda_setup_bus_iommu_group(pe, table_group,
+					pe->pbus);
+		}
 	}
 
 	/*
 	 * Now we have all PHBs discovered, time to add NPU devices to
 	 * the corresponding IOMMU groups.
 	 */
-	list_for_each_entry_safe(hose, tmp, &hose_list, list_node) {
+	list_for_each_entry(hose, &hose_list, list_node) {
 		phb = hose->private_data;
 
 		if (phb->type != PNV_PHB_NPU_NVLINK)
 			continue;
 
-		list_for_each_entry(pe, &phb->ioda.pe_list, list) {
-			gpe = pnv_pci_npu_setup_iommu(pe);
-			if (gpe)
-				gpe->table_group.ops = &pnv_pci_ioda2_npu_ops;
-		}
+		list_for_each_entry(pe, &phb->ioda.pe_list, list)
+			pnv_npu_compound_attach(pe);
 	}
 }
 #else /* !CONFIG_IOMMU_API */
-static void pnv_ioda_setup_bus_iommu_group(struct pnv_ioda_pe *pe) { }
+static void pnv_ioda_setup_bus_iommu_group(struct pnv_ioda_pe *pe,
+		struct iommu_table_group *table_group, struct pci_bus *bus){}
 static void pnv_pci_ioda_setup_iommu_api(void) { };
 #endif
 
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH kernel v3 19/22] powerpc/powernv/npu: Add release_ownership hook
  2018-11-13  8:28 ` Alexey Kardashevskiy
@ 2018-11-13  8:28   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 84+ messages in thread
From: Alexey Kardashevskiy @ 2018-11-13  8:28 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Jose Ricardo Ziviani, Alexey Kardashevskiy, Alistair Popple,
	Alex Williamson, kvm-ppc, Sam Bobroff, Piotr Jaroszynski,
	Oliver O'Halloran, Andrew Donnellan,
	Leonardo Augusto Guimarães Garcia, Reza Arbab, David Gibson

In order to make ATS work and translate addresses for arbitrary
LPID and PID, we need to program an NPU with LPID and allow PID wildcard
matching with a specific MSR mask.

This implements a helper to assign a GPU to LPAR and program the NPU
with a wildcard for PID and a helper to do clean-up. The helper takes
MSR (only DR/HV/PR/SF bits are allowed) to program them into NPU2 for
ATS checkout requests support.

This exports pnv_npu2_unmap_lpar_dev() as following patches will use it
from the VFIO driver.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/platforms/powernv/npu-dma.c | 47 ++++++++++++++++++++++++
 1 file changed, 47 insertions(+)

diff --git a/arch/powerpc/platforms/powernv/npu-dma.c b/arch/powerpc/platforms/powernv/npu-dma.c
index 2231f4c..48adaa5 100644
--- a/arch/powerpc/platforms/powernv/npu-dma.c
+++ b/arch/powerpc/platforms/powernv/npu-dma.c
@@ -289,6 +289,7 @@ static void pnv_npu_take_ownership(struct iommu_table_group *table_group)
 			table_group);
 	struct pnv_phb *phb = npe->phb;
 	int64_t rc;
+	struct pci_dev *gpdev = NULL;
 
 	/*
 	 * Note: NPU has just a single TVE in the hardware which means that
@@ -310,12 +311,28 @@ static void pnv_npu_take_ownership(struct iommu_table_group *table_group)
 		return;
 	}
 	pnv_pci_ioda2_tce_invalidate_entire(npe->phb, false);
+
+	get_gpu_pci_dev_and_pe(npe, &gpdev);
+	if (gpdev)
+		pnv_npu2_unmap_lpar_dev(gpdev);
+}
+
+static void pnv_npu_release_ownership(struct iommu_table_group *table_group)
+{
+	struct pnv_ioda_pe *npe = container_of(table_group, struct pnv_ioda_pe,
+			table_group);
+	struct pci_dev *gpdev = NULL;
+
+	get_gpu_pci_dev_and_pe(npe, &gpdev);
+	if (gpdev)
+		pnv_npu2_map_lpar_dev(gpdev, 0, MSR_DR | MSR_PR | MSR_HV);
 }
 
 static struct iommu_table_group_ops pnv_pci_npu_ops = {
 	.set_window = pnv_npu_set_window,
 	.unset_window = pnv_npu_unset_window,
 	.take_ownership = pnv_npu_take_ownership,
+	.release_ownership = pnv_npu_release_ownership,
 };
 #endif /* !CONFIG_IOMMU_API */
 
@@ -1239,3 +1256,33 @@ void pnv_npu2_map_lpar(struct pnv_ioda_pe *gpe, unsigned long msr)
 					ret);
 	}
 }
+
+int pnv_npu2_unmap_lpar_dev(struct pci_dev *gpdev)
+{
+	int ret;
+	struct pci_dev *npdev = pnv_pci_get_npu_dev(gpdev, 0);
+	struct pci_controller *hose = pci_bus_to_host(npdev->bus);
+	struct pnv_phb *nphb = hose->private_data;
+
+	dev_dbg(&gpdev->dev, "destroy context opalid=%llu\n",
+			nphb->opal_id);
+	ret = opal_npu_destroy_context(nphb->opal_id, 0/*__unused*/,
+			PCI_DEVID(gpdev->bus->number, gpdev->devfn));
+	if (ret < 0) {
+		dev_err(&gpdev->dev, "Failed to destroy context: %d\n", ret);
+		return ret;
+	}
+
+	/* Set LPID to 0 anyway, just to be safe */
+	dev_dbg(&gpdev->dev, "Map LPAR opalid=%llu lparid=0\n", nphb->opal_id);
+	ret = opal_npu_map_lpar(nphb->opal_id,
+			PCI_DEVID(gpdev->bus->number, gpdev->devfn), 0 /*LPID*/,
+			0 /* LPCR bits */);
+	if (ret)
+		dev_err(&gpdev->dev, "Error %d mapping device to LPAR\n", ret);
+
+	opal_purge_cache();
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(pnv_npu2_unmap_lpar_dev);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH kernel v3 19/22] powerpc/powernv/npu: Add release_ownership hook
@ 2018-11-13  8:28   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 84+ messages in thread
From: Alexey Kardashevskiy @ 2018-11-13  8:28 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Jose Ricardo Ziviani, Alexey Kardashevskiy, Alistair Popple,
	Alex Williamson, kvm-ppc, Sam Bobroff, Piotr Jaroszynski,
	Oliver O'Halloran, Andrew Donnellan,
	Leonardo Augusto Guimarães Garcia, Reza Arbab, David Gibson

In order to make ATS work and translate addresses for arbitrary
LPID and PID, we need to program an NPU with LPID and allow PID wildcard
matching with a specific MSR mask.

This implements a helper to assign a GPU to LPAR and program the NPU
with a wildcard for PID and a helper to do clean-up. The helper takes
MSR (only DR/HV/PR/SF bits are allowed) to program them into NPU2 for
ATS checkout requests support.

This exports pnv_npu2_unmap_lpar_dev() as following patches will use it
from the VFIO driver.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/platforms/powernv/npu-dma.c | 47 ++++++++++++++++++++++++
 1 file changed, 47 insertions(+)

diff --git a/arch/powerpc/platforms/powernv/npu-dma.c b/arch/powerpc/platforms/powernv/npu-dma.c
index 2231f4c..48adaa5 100644
--- a/arch/powerpc/platforms/powernv/npu-dma.c
+++ b/arch/powerpc/platforms/powernv/npu-dma.c
@@ -289,6 +289,7 @@ static void pnv_npu_take_ownership(struct iommu_table_group *table_group)
 			table_group);
 	struct pnv_phb *phb = npe->phb;
 	int64_t rc;
+	struct pci_dev *gpdev = NULL;
 
 	/*
 	 * Note: NPU has just a single TVE in the hardware which means that
@@ -310,12 +311,28 @@ static void pnv_npu_take_ownership(struct iommu_table_group *table_group)
 		return;
 	}
 	pnv_pci_ioda2_tce_invalidate_entire(npe->phb, false);
+
+	get_gpu_pci_dev_and_pe(npe, &gpdev);
+	if (gpdev)
+		pnv_npu2_unmap_lpar_dev(gpdev);
+}
+
+static void pnv_npu_release_ownership(struct iommu_table_group *table_group)
+{
+	struct pnv_ioda_pe *npe = container_of(table_group, struct pnv_ioda_pe,
+			table_group);
+	struct pci_dev *gpdev = NULL;
+
+	get_gpu_pci_dev_and_pe(npe, &gpdev);
+	if (gpdev)
+		pnv_npu2_map_lpar_dev(gpdev, 0, MSR_DR | MSR_PR | MSR_HV);
 }
 
 static struct iommu_table_group_ops pnv_pci_npu_ops = {
 	.set_window = pnv_npu_set_window,
 	.unset_window = pnv_npu_unset_window,
 	.take_ownership = pnv_npu_take_ownership,
+	.release_ownership = pnv_npu_release_ownership,
 };
 #endif /* !CONFIG_IOMMU_API */
 
@@ -1239,3 +1256,33 @@ void pnv_npu2_map_lpar(struct pnv_ioda_pe *gpe, unsigned long msr)
 					ret);
 	}
 }
+
+int pnv_npu2_unmap_lpar_dev(struct pci_dev *gpdev)
+{
+	int ret;
+	struct pci_dev *npdev = pnv_pci_get_npu_dev(gpdev, 0);
+	struct pci_controller *hose = pci_bus_to_host(npdev->bus);
+	struct pnv_phb *nphb = hose->private_data;
+
+	dev_dbg(&gpdev->dev, "destroy context opalid=%llu\n",
+			nphb->opal_id);
+	ret = opal_npu_destroy_context(nphb->opal_id, 0/*__unused*/,
+			PCI_DEVID(gpdev->bus->number, gpdev->devfn));
+	if (ret < 0) {
+		dev_err(&gpdev->dev, "Failed to destroy context: %d\n", ret);
+		return ret;
+	}
+
+	/* Set LPID to 0 anyway, just to be safe */
+	dev_dbg(&gpdev->dev, "Map LPAR opalid=%llu lparid=0\n", nphb->opal_id);
+	ret = opal_npu_map_lpar(nphb->opal_id,
+			PCI_DEVID(gpdev->bus->number, gpdev->devfn), 0 /*LPID*/,
+			0 /* LPCR bits */);
+	if (ret)
+		dev_err(&gpdev->dev, "Error %d mapping device to LPAR\n", ret);
+
+	opal_purge_cache();
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(pnv_npu2_unmap_lpar_dev);
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH kernel v3 20/22] vfio_pci: Allow mapping extra regions
  2018-11-13  8:28 ` Alexey Kardashevskiy
@ 2018-11-13  8:28   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 84+ messages in thread
From: Alexey Kardashevskiy @ 2018-11-13  8:28 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Jose Ricardo Ziviani, Alexey Kardashevskiy, Alistair Popple,
	Alex Williamson, kvm-ppc, Sam Bobroff, Piotr Jaroszynski,
	Oliver O'Halloran, Andrew Donnellan,
	Leonardo Augusto Guimarães Garcia, Reza Arbab, David Gibson

So far we only allowed mapping of MMIO BARs to the userspace. However
there there are GPUs with on-board coherent RAM accessible via side
channels which we also want to map to the userspace. The first client
for this is NVIDIA V100 GPU with NVLink2 direct links to a POWER9
NPU-enabled CPU; such GPUs have 16GB RAM which is coherently mapped
to the system address space, we are going to export these as an extra
PCI region.

We already support extra PCI regions and this adds support for mapping
them to the userspace.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
Changes:
v2:
* reverted one of mistakenly removed error checks
---
 drivers/vfio/pci/vfio_pci_private.h | 3 +++
 drivers/vfio/pci/vfio_pci.c         | 9 +++++++++
 2 files changed, 12 insertions(+)

diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
index cde3b5d..86aab05 100644
--- a/drivers/vfio/pci/vfio_pci_private.h
+++ b/drivers/vfio/pci/vfio_pci_private.h
@@ -59,6 +59,9 @@ struct vfio_pci_regops {
 		      size_t count, loff_t *ppos, bool iswrite);
 	void	(*release)(struct vfio_pci_device *vdev,
 			   struct vfio_pci_region *region);
+	int	(*mmap)(struct vfio_pci_device *vdev,
+			struct vfio_pci_region *region,
+			struct vm_area_struct *vma);
 };
 
 struct vfio_pci_region {
diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index fef5002..4a6f7c0 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -1130,6 +1130,15 @@ static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
 		return -EINVAL;
 	if ((vma->vm_flags & VM_SHARED) == 0)
 		return -EINVAL;
+	if (index >= VFIO_PCI_NUM_REGIONS) {
+		int regnum = index - VFIO_PCI_NUM_REGIONS;
+		struct vfio_pci_region *region = vdev->region + regnum;
+
+		if (region && region->ops && region->ops->mmap &&
+		    (region->flags & VFIO_REGION_INFO_FLAG_MMAP))
+			return region->ops->mmap(vdev, region, vma);
+		return -EINVAL;
+	}
 	if (index >= VFIO_PCI_ROM_REGION_INDEX)
 		return -EINVAL;
 	if (!vdev->bar_mmap_supported[index])
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH kernel v3 20/22] vfio_pci: Allow mapping extra regions
@ 2018-11-13  8:28   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 84+ messages in thread
From: Alexey Kardashevskiy @ 2018-11-13  8:28 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Jose Ricardo Ziviani, Alexey Kardashevskiy, Alistair Popple,
	Alex Williamson, kvm-ppc, Sam Bobroff, Piotr Jaroszynski,
	Oliver O'Halloran, Andrew Donnellan,
	Leonardo Augusto Guimarães Garcia, Reza Arbab, David Gibson

So far we only allowed mapping of MMIO BARs to the userspace. However
there there are GPUs with on-board coherent RAM accessible via side
channels which we also want to map to the userspace. The first client
for this is NVIDIA V100 GPU with NVLink2 direct links to a POWER9
NPU-enabled CPU; such GPUs have 16GB RAM which is coherently mapped
to the system address space, we are going to export these as an extra
PCI region.

We already support extra PCI regions and this adds support for mapping
them to the userspace.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
Changes:
v2:
* reverted one of mistakenly removed error checks
---
 drivers/vfio/pci/vfio_pci_private.h | 3 +++
 drivers/vfio/pci/vfio_pci.c         | 9 +++++++++
 2 files changed, 12 insertions(+)

diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
index cde3b5d..86aab05 100644
--- a/drivers/vfio/pci/vfio_pci_private.h
+++ b/drivers/vfio/pci/vfio_pci_private.h
@@ -59,6 +59,9 @@ struct vfio_pci_regops {
 		      size_t count, loff_t *ppos, bool iswrite);
 	void	(*release)(struct vfio_pci_device *vdev,
 			   struct vfio_pci_region *region);
+	int	(*mmap)(struct vfio_pci_device *vdev,
+			struct vfio_pci_region *region,
+			struct vm_area_struct *vma);
 };
 
 struct vfio_pci_region {
diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index fef5002..4a6f7c0 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -1130,6 +1130,15 @@ static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
 		return -EINVAL;
 	if ((vma->vm_flags & VM_SHARED) = 0)
 		return -EINVAL;
+	if (index >= VFIO_PCI_NUM_REGIONS) {
+		int regnum = index - VFIO_PCI_NUM_REGIONS;
+		struct vfio_pci_region *region = vdev->region + regnum;
+
+		if (region && region->ops && region->ops->mmap &&
+		    (region->flags & VFIO_REGION_INFO_FLAG_MMAP))
+			return region->ops->mmap(vdev, region, vma);
+		return -EINVAL;
+	}
 	if (index >= VFIO_PCI_ROM_REGION_INDEX)
 		return -EINVAL;
 	if (!vdev->bar_mmap_supported[index])
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH kernel v3 21/22] vfio_pci: Allow regions to add own capabilities
  2018-11-13  8:28 ` Alexey Kardashevskiy
@ 2018-11-13  8:28   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 84+ messages in thread
From: Alexey Kardashevskiy @ 2018-11-13  8:28 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Jose Ricardo Ziviani, Alexey Kardashevskiy, Alistair Popple,
	Alex Williamson, kvm-ppc, Sam Bobroff, Piotr Jaroszynski,
	Oliver O'Halloran, Andrew Donnellan,
	Leonardo Augusto Guimarães Garcia, Reza Arbab, David Gibson

VFIO regions already support region capabilities with a limited set of
fields. However the subdriver might have to report to the userspace
additional bits.

This adds an add_capability() hook to vfio_pci_regops.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
v3:
* removed confusing rationale for the patch, the next patch makes
use of it anyway
---
 drivers/vfio/pci/vfio_pci_private.h | 3 +++
 drivers/vfio/pci/vfio_pci.c         | 6 ++++++
 2 files changed, 9 insertions(+)

diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
index 86aab05..93c1738 100644
--- a/drivers/vfio/pci/vfio_pci_private.h
+++ b/drivers/vfio/pci/vfio_pci_private.h
@@ -62,6 +62,9 @@ struct vfio_pci_regops {
 	int	(*mmap)(struct vfio_pci_device *vdev,
 			struct vfio_pci_region *region,
 			struct vm_area_struct *vma);
+	int	(*add_capability)(struct vfio_pci_device *vdev,
+				  struct vfio_pci_region *region,
+				  struct vfio_info_cap *caps);
 };
 
 struct vfio_pci_region {
diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 4a6f7c0..6cb70cf 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -763,6 +763,12 @@ static long vfio_pci_ioctl(void *device_data,
 			if (ret)
 				return ret;
 
+			if (vdev->region[i].ops->add_capability) {
+				ret = vdev->region[i].ops->add_capability(vdev,
+						&vdev->region[i], &caps);
+				if (ret)
+					return ret;
+			}
 		}
 		}
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH kernel v3 21/22] vfio_pci: Allow regions to add own capabilities
@ 2018-11-13  8:28   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 84+ messages in thread
From: Alexey Kardashevskiy @ 2018-11-13  8:28 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Jose Ricardo Ziviani, Alexey Kardashevskiy, Alistair Popple,
	Alex Williamson, kvm-ppc, Sam Bobroff, Piotr Jaroszynski,
	Oliver O'Halloran, Andrew Donnellan,
	Leonardo Augusto Guimarães Garcia, Reza Arbab, David Gibson

VFIO regions already support region capabilities with a limited set of
fields. However the subdriver might have to report to the userspace
additional bits.

This adds an add_capability() hook to vfio_pci_regops.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
v3:
* removed confusing rationale for the patch, the next patch makes
use of it anyway
---
 drivers/vfio/pci/vfio_pci_private.h | 3 +++
 drivers/vfio/pci/vfio_pci.c         | 6 ++++++
 2 files changed, 9 insertions(+)

diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
index 86aab05..93c1738 100644
--- a/drivers/vfio/pci/vfio_pci_private.h
+++ b/drivers/vfio/pci/vfio_pci_private.h
@@ -62,6 +62,9 @@ struct vfio_pci_regops {
 	int	(*mmap)(struct vfio_pci_device *vdev,
 			struct vfio_pci_region *region,
 			struct vm_area_struct *vma);
+	int	(*add_capability)(struct vfio_pci_device *vdev,
+				  struct vfio_pci_region *region,
+				  struct vfio_info_cap *caps);
 };
 
 struct vfio_pci_region {
diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 4a6f7c0..6cb70cf 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -763,6 +763,12 @@ static long vfio_pci_ioctl(void *device_data,
 			if (ret)
 				return ret;
 
+			if (vdev->region[i].ops->add_capability) {
+				ret = vdev->region[i].ops->add_capability(vdev,
+						&vdev->region[i], &caps);
+				if (ret)
+					return ret;
+			}
 		}
 		}
 
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH kernel v3 22/22] vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] [10de:1db1] subdriver
  2018-11-13  8:28 ` Alexey Kardashevskiy
@ 2018-11-13  8:28   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 84+ messages in thread
From: Alexey Kardashevskiy @ 2018-11-13  8:28 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Jose Ricardo Ziviani, Alexey Kardashevskiy, Alistair Popple,
	Alex Williamson, kvm-ppc, Sam Bobroff, Piotr Jaroszynski,
	Oliver O'Halloran, Andrew Donnellan,
	Leonardo Augusto Guimarães Garcia, Reza Arbab, David Gibson

POWER9 Witherspoon machines come with 4 or 6 V100 GPUs which are not
pluggable PCIe devices but implement PCIe links for config space and MMIO.
In addition to that the GPUs are interconnected to each other and also
have direct links to the P9 CPU. The links are NVLink2 and provide direct
access to the system RAM for GPUs via NPU (an NVLink2 "proxy" on P9 chip).
These systems also support ATS (address translation services) which is
a part of the NVLink2 protocol. Such GPUs also share on-board RAM
(16GB in tested config) to the system via the same NVLink2 so a CPU has
cache-coherent access to a GPU RAM.

This exports GPU RAM to the userspace as a new PCI region. This
preregisters the new memory as device memory as it might be used for DMA.
This inserts pfns from the fault handler as the GPU memory is not onlined
until the NVIDIA driver is loaded and trained the links so doing this
earlier produces low level errors which we fence in the firmware so
it does not hurt the host system but still better to avoid.

This exports ATSD (Address Translation Shootdown) register of NPU which
allows the guest to invalidate TLB. The register conveniently occupies
a single 64k page. Since NPU maps the GPU memory, it has a "tgt" property
(which is an abbreviated host system bus address) and tells the GPU its
own system address. This exports the "tgt" as a capability to let
the guest driver conglomerate the routing information so each GPU knows
how to get directly to the other GPUs. This also adds the "tgt" capability
to a GPU to allow the userspace to find out the NVLinks corresponding
to a specific GPU.

For ATS to work, the nest MMU (an NVIDIA block in a P9 CPU) needs to
know LPID (a logical partition ID or a KVM guest hardware ID in other
words) and PID (a memory context ID of a userspace process, not to be
confused with a linux pid). This assigns a GPU to LPID in the NPU and
this is why this adds a listener for KVM on an IOMMU group. A PID comes
via NVLink from a GPU and NPU uses a PID wildcard to pass it through.

This requires coherent memory and ATSD to be available on the host as
the GPU vendor only supports configurations with both features enabled
and other configurations are known not to work. Because of this and
because of the ways the features are advertised to the host system
(which is a device tree with very platform specific properties),
this requires enabled POWERNV platform.

This hardcodes the NVLink2 support for specific vendor and device IDs
as there is no reliable way of knowing about coherent memory and ATS
support. The GPU has an unique vendor PCIe capability 0x23 but it was
confirmed that it does not provide required information (and it is still
undisclosed what it actually does).

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
v3:
* reworded the commit log about tgt
* added tracepoints (do we want them enabled for entire vfio-pci?)
* added code comments
* added write|mmap flags to the new regions
* auto enabled VFIO_PCI_NVLINK2 config option
* added 'tgt' capability to a GPU so QEMU can recreate ibm,npu and ibm,gpu
references; there are required by the NVIDIA driver
* keep notifier registered only for short time
---
 drivers/vfio/pci/Makefile           |   1 +
 drivers/vfio/pci/trace.h            | 102 +++++++
 drivers/vfio/pci/vfio_pci_private.h |   2 +
 include/uapi/linux/vfio.h           |  26 ++
 drivers/vfio/pci/vfio_pci.c         |  39 ++-
 drivers/vfio/pci/vfio_pci_nvlink2.c | 433 ++++++++++++++++++++++++++++
 drivers/vfio/pci/Kconfig            |   6 +
 7 files changed, 607 insertions(+), 2 deletions(-)
 create mode 100644 drivers/vfio/pci/trace.h
 create mode 100644 drivers/vfio/pci/vfio_pci_nvlink2.c

diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
index 76d8ec0..9662c06 100644
--- a/drivers/vfio/pci/Makefile
+++ b/drivers/vfio/pci/Makefile
@@ -1,5 +1,6 @@
 
 vfio-pci-y := vfio_pci.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o
 vfio-pci-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o
+vfio-pci-$(CONFIG_VFIO_PCI_NVLINK2) += vfio_pci_nvlink2.o
 
 obj-$(CONFIG_VFIO_PCI) += vfio-pci.o
diff --git a/drivers/vfio/pci/trace.h b/drivers/vfio/pci/trace.h
new file mode 100644
index 0000000..b80d2d3
--- /dev/null
+++ b/drivers/vfio/pci/trace.h
@@ -0,0 +1,102 @@
+/* SPDX-License-Identifier: GPL-2.0+ */
+/*
+ * VFIO PCI mmap/mmap_fault tracepoints
+ *
+ * Copyright (C) 2018 IBM Corp.  All rights reserved.
+ *     Author: Alexey Kardashevskiy <aik@ozlabs.ru>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM vfio_pci
+
+#if !defined(_TRACE_VFIO_PCI_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_VFIO_PCI_H
+
+#include <linux/tracepoint.h>
+
+TRACE_EVENT(vfio_pci_nvgpu_mmap_fault,
+	TP_PROTO(struct pci_dev *pdev, unsigned long hpa, unsigned long ua,
+			vm_fault_t ret),
+	TP_ARGS(pdev, hpa, ua, ret),
+
+	TP_STRUCT__entry(
+		__field(const char *, name)
+		__field(unsigned long, hpa)
+		__field(unsigned long, ua)
+		__field(int, ret)
+	),
+
+	TP_fast_assign(
+		__entry->name = dev_name(&pdev->dev),
+		__entry->hpa = hpa;
+		__entry->ua = ua;
+		__entry->ret = ret;
+	),
+
+	TP_printk("%s: %lx -> %lx ret=%d", __entry->name, __entry->hpa,
+			__entry->ua, __entry->ret)
+);
+
+TRACE_EVENT(vfio_pci_nvgpu_mmap,
+	TP_PROTO(struct pci_dev *pdev, unsigned long hpa, unsigned long ua,
+			unsigned long size, int ret),
+	TP_ARGS(pdev, hpa, ua, size, ret),
+
+	TP_STRUCT__entry(
+		__field(const char *, name)
+		__field(unsigned long, hpa)
+		__field(unsigned long, ua)
+		__field(unsigned long, size)
+		__field(int, ret)
+	),
+
+	TP_fast_assign(
+		__entry->name = dev_name(&pdev->dev),
+		__entry->hpa = hpa;
+		__entry->ua = ua;
+		__entry->size = size;
+		__entry->ret = ret;
+	),
+
+	TP_printk("%s: %lx -> %lx size=%lx ret=%d", __entry->name, __entry->hpa,
+			__entry->ua, __entry->size, __entry->ret)
+);
+
+TRACE_EVENT(vfio_pci_npu2_mmap,
+	TP_PROTO(struct pci_dev *pdev, unsigned long hpa, unsigned long ua,
+			unsigned long size, int ret),
+	TP_ARGS(pdev, hpa, ua, size, ret),
+
+	TP_STRUCT__entry(
+		__field(const char *, name)
+		__field(unsigned long, hpa)
+		__field(unsigned long, ua)
+		__field(unsigned long, size)
+		__field(int, ret)
+	),
+
+	TP_fast_assign(
+		__entry->name = dev_name(&pdev->dev),
+		__entry->hpa = hpa;
+		__entry->ua = ua;
+		__entry->size = size;
+		__entry->ret = ret;
+	),
+
+	TP_printk("%s: %lx -> %lx size=%lx ret=%d", __entry->name, __entry->hpa,
+			__entry->ua, __entry->size, __entry->ret)
+);
+
+#endif /* _TRACE_SUBSYS_H */
+
+#undef TRACE_INCLUDE_PATH
+#define TRACE_INCLUDE_PATH .
+#undef TRACE_INCLUDE_FILE
+#define TRACE_INCLUDE_FILE trace
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
index 93c1738..7639241 100644
--- a/drivers/vfio/pci/vfio_pci_private.h
+++ b/drivers/vfio/pci/vfio_pci_private.h
@@ -163,4 +163,6 @@ static inline int vfio_pci_igd_init(struct vfio_pci_device *vdev)
 	return -ENODEV;
 }
 #endif
+extern int vfio_pci_nvdia_v100_nvlink2_init(struct vfio_pci_device *vdev);
+extern int vfio_pci_ibm_npu2_init(struct vfio_pci_device *vdev);
 #endif /* VFIO_PCI_PRIVATE_H */
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 8131028..53a4061 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -353,6 +353,20 @@ struct vfio_region_gfx_edid {
 #define VFIO_DEVICE_GFX_LINK_STATE_DOWN  2
 };
 
+/* 10de vendor sub-type
+ *
+ * NVIDIA GPU NVlink2 RAM is coherent RAM mapped onto the host address space.
+ */
+#define VFIO_REGION_SUBTYPE_NVIDIA_NVLINK2_RAM	(1)
+
+/*
+ * 1014 vendor sub-type
+ *
+ * IBM NPU NVlink2 ATSD (Address Translation Shootdown) register of NPU
+ * to do TLB invalidation on a GPU.
+ */
+#define VFIO_REGION_SUBTYPE_IBM_NVLINK2_ATSD	(1)
+
 /*
  * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped
  * which allows direct access to non-MSIX registers which happened to be within
@@ -363,6 +377,18 @@ struct vfio_region_gfx_edid {
  */
 #define VFIO_REGION_INFO_CAP_MSIX_MAPPABLE	3
 
+/*
+ * Capability with compressed real address (aka SSA - small system address)
+ * where GPU RAM is mapped on a system bus. Used by a GPU for DMA routing.
+ */
+#define VFIO_REGION_INFO_CAP_NPU2		4
+
+struct vfio_region_info_cap_npu2 {
+	struct vfio_info_cap_header header;
+	__u64 tgt;
+	/* size is defined in VFIO_REGION_SUBTYPE_NVIDIA_NVLINK2_RAM */
+};
+
 /**
  * VFIO_DEVICE_GET_IRQ_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 9,
  *				    struct vfio_irq_info)
diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 6cb70cf..f072d8e 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -224,6 +224,16 @@ static bool vfio_pci_nointx(struct pci_dev *pdev)
 	return false;
 }
 
+int __weak vfio_pci_nvdia_v100_nvlink2_init(struct vfio_pci_device *vdev)
+{
+	return -ENODEV;
+}
+
+int __weak vfio_pci_ibm_npu2_init(struct vfio_pci_device *vdev)
+{
+	return -ENODEV;
+}
+
 static int vfio_pci_enable(struct vfio_pci_device *vdev)
 {
 	struct pci_dev *pdev = vdev->pdev;
@@ -302,14 +312,39 @@ static int vfio_pci_enable(struct vfio_pci_device *vdev)
 		if (ret) {
 			dev_warn(&vdev->pdev->dev,
 				 "Failed to setup Intel IGD regions\n");
-			vfio_pci_disable(vdev);
-			return ret;
+			goto disable_exit;
+		}
+	}
+
+	if (pdev->vendor == PCI_VENDOR_ID_NVIDIA &&
+	    pdev->device == 0x1db1 &&
+	    IS_ENABLED(CONFIG_VFIO_PCI_NVLINK2)) {
+		ret = vfio_pci_nvdia_v100_nvlink2_init(vdev);
+		if (ret) {
+			dev_warn(&vdev->pdev->dev,
+				 "Failed to setup NVIDIA NV2 RAM region\n");
+			goto disable_exit;
+		}
+	}
+
+	if (pdev->vendor == PCI_VENDOR_ID_IBM &&
+	    pdev->device == 0x04ea &&
+	    IS_ENABLED(CONFIG_VFIO_PCI_NVLINK2)) {
+		ret = vfio_pci_ibm_npu2_init(vdev);
+		if (ret) {
+			dev_warn(&vdev->pdev->dev,
+					"Failed to setup NVIDIA NV2 ATSD region\n");
+			goto disable_exit;
 		}
 	}
 
 	vfio_pci_probe_mmaps(vdev);
 
 	return 0;
+
+disable_exit:
+	vfio_pci_disable(vdev);
+	return ret;
 }
 
 static void vfio_pci_disable(struct vfio_pci_device *vdev)
diff --git a/drivers/vfio/pci/vfio_pci_nvlink2.c b/drivers/vfio/pci/vfio_pci_nvlink2.c
new file mode 100644
index 0000000..300159b
--- /dev/null
+++ b/drivers/vfio/pci/vfio_pci_nvlink2.c
@@ -0,0 +1,433 @@
+// SPDX-License-Identifier: GPL-2.0+
+/*
+ * VFIO PCI NVIDIA Whitherspoon GPU support a.k.a. NVLink2.
+ *
+ * Copyright (C) 2018 IBM Corp.  All rights reserved.
+ *     Author: Alexey Kardashevskiy <aik@ozlabs.ru>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * Register an on-GPU RAM region for cacheable access.
+ *
+ * Derived from original vfio_pci_igd.c:
+ * Copyright (C) 2016 Red Hat, Inc.  All rights reserved.
+ *	Author: Alex Williamson <alex.williamson@redhat.com>
+ */
+
+#include <linux/io.h>
+#include <linux/pci.h>
+#include <linux/uaccess.h>
+#include <linux/vfio.h>
+#include <linux/sched/mm.h>
+#include <linux/mmu_context.h>
+#include <asm/kvm_ppc.h>
+#include "vfio_pci_private.h"
+
+#define CREATE_TRACE_POINTS
+#include "trace.h"
+
+EXPORT_TRACEPOINT_SYMBOL_GPL(vfio_pci_nvgpu_mmap_fault);
+EXPORT_TRACEPOINT_SYMBOL_GPL(vfio_pci_nvgpu_mmap);
+EXPORT_TRACEPOINT_SYMBOL_GPL(vfio_pci_npu2_mmap);
+
+struct vfio_pci_nvgpu_data {
+	unsigned long gpu_hpa; /* GPU RAM physical address */
+	unsigned long gpu_tgt; /* TGT address of corresponding GPU RAM */
+	unsigned long useraddr; /* GPU RAM userspace address */
+	unsigned long size; /* Size of the GPU RAM window (usually 128GB) */
+	void *base; /* GPU RAM virtual address, for emulated access */
+	struct mm_struct *mm;
+	struct mm_iommu_table_group_mem_t *mem; /* Pre-registered RAM descr. */
+	struct pci_dev *gpdev;
+	struct notifier_block group_notifier;
+};
+
+static size_t vfio_pci_nvgpu_rw(struct vfio_pci_device *vdev,
+		char __user *buf, size_t count, loff_t *ppos, bool iswrite)
+{
+	unsigned int i = VFIO_PCI_OFFSET_TO_INDEX(*ppos) - VFIO_PCI_NUM_REGIONS;
+	struct vfio_pci_nvgpu_data *data = vdev->region[i].data;
+	loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
+
+	if (pos >= vdev->region[i].size)
+		return -EINVAL;
+
+	count = min(count, (size_t)(vdev->region[i].size - pos));
+
+	if (iswrite) {
+		if (copy_from_user(data->base + pos, buf, count))
+			return -EFAULT;
+	} else {
+		if (copy_to_user(buf, data->base + pos, count))
+			return -EFAULT;
+	}
+	*ppos += count;
+
+	return count;
+}
+
+static void vfio_pci_nvgpu_release(struct vfio_pci_device *vdev,
+		struct vfio_pci_region *region)
+{
+	struct vfio_pci_nvgpu_data *data = region->data;
+	long ret;
+	struct pci_controller *hose;
+	struct pci_dev *npdev;
+
+	/* If there were any mappings at all... */
+	if (data->mm) {
+		ret = mm_iommu_put(data->mm, data->mem);
+		WARN_ON(ret);
+
+		mmdrop(data->mm);
+	}
+
+	vfio_unregister_notifier(&data->gpdev->dev, VFIO_GROUP_NOTIFY,
+			&data->group_notifier);
+
+	npdev = pnv_pci_get_npu_dev(data->gpdev, 0);
+	hose = pci_bus_to_host(npdev->bus);
+
+	pnv_npu2_unmap_lpar_dev(data->gpdev);
+
+	memunmap(data->base);
+	kfree(data);
+}
+
+static vm_fault_t vfio_pci_nvgpu_mmap_fault(struct vm_fault *vmf)
+{
+	vm_fault_t ret;
+	struct vm_area_struct *vma = vmf->vma;
+	struct vfio_pci_region *region = vma->vm_private_data;
+	struct vfio_pci_nvgpu_data *data = region->data;
+	unsigned long vmf_off = (vmf->address - vma->vm_start) >> PAGE_SHIFT;
+	unsigned long nv2pg = data->gpu_hpa >> PAGE_SHIFT;
+	unsigned long vm_pgoff = vma->vm_pgoff &
+		((1U << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) - 1);
+	unsigned long pfn = nv2pg + vm_pgoff + vmf_off;
+
+	ret = vmf_insert_pfn(vma, vmf->address, pfn);
+	trace_vfio_pci_nvgpu_mmap_fault(data->gpdev, pfn << PAGE_SHIFT,
+			vmf->address, ret);
+
+	return ret;
+}
+
+static const struct vm_operations_struct vfio_pci_nvgpu_mmap_vmops = {
+	.fault = vfio_pci_nvgpu_mmap_fault,
+};
+
+static int vfio_pci_nvgpu_mmap(struct vfio_pci_device *vdev,
+		struct vfio_pci_region *region, struct vm_area_struct *vma)
+{
+	long ret;
+	struct vfio_pci_nvgpu_data *data = region->data;
+
+	if (data->useraddr)
+		return -EPERM;
+
+	if (vma->vm_end - vma->vm_start > data->size)
+		return -EINVAL;
+
+	vma->vm_private_data = region;
+	vma->vm_flags |= VM_PFNMAP;
+	vma->vm_ops = &vfio_pci_nvgpu_mmap_vmops;
+
+	/*
+	 * Calling mm_iommu_newdev() here once as the region is not
+	 * registered yet and therefore right initialization will happen now.
+	 * Other places will use mm_iommu_find() which returns
+	 * registered @mem and does not go gup().
+	 */
+	data->useraddr = vma->vm_start;
+	data->mm = current->mm;
+
+	atomic_inc(&data->mm->mm_count);
+	ret = mm_iommu_newdev(data->mm, data->useraddr,
+			(vma->vm_end - vma->vm_start) >> PAGE_SHIFT,
+			data->gpu_hpa, &data->mem);
+
+	trace_vfio_pci_nvgpu_mmap(vdev->pdev, data->gpu_hpa, data->useraddr,
+			vma->vm_end - vma->vm_start, ret);
+
+	return ret;
+}
+
+static int vfio_pci_nvgpu_add_capability(struct vfio_pci_device *vdev,
+		struct vfio_pci_region *region, struct vfio_info_cap *caps)
+{
+	struct vfio_pci_nvgpu_data *data = region->data;
+	struct vfio_region_info_cap_npu2 cap;
+
+	cap.header.id = VFIO_REGION_INFO_CAP_NPU2;
+	cap.header.version = 1;
+	cap.tgt = data->gpu_tgt;
+
+	return vfio_info_add_capability(caps, &cap.header, sizeof(cap));
+}
+
+static const struct vfio_pci_regops vfio_pci_nvgpu_regops = {
+	.rw = vfio_pci_nvgpu_rw,
+	.release = vfio_pci_nvgpu_release,
+	.mmap = vfio_pci_nvgpu_mmap,
+	.add_capability = vfio_pci_nvgpu_add_capability,
+};
+
+static int vfio_pci_nvgpu_group_notifier(struct notifier_block *nb,
+		unsigned long action, void *opaque)
+{
+	struct kvm *kvm = opaque;
+	struct vfio_pci_nvgpu_data *data = container_of(nb,
+			struct vfio_pci_nvgpu_data,
+			group_notifier);
+
+	if (action == VFIO_GROUP_NOTIFY_SET_KVM && kvm &&
+			pnv_npu2_map_lpar_dev(data->gpdev,
+				kvm->arch.lpid, MSR_DR | MSR_PR))
+		return NOTIFY_BAD;
+
+	return NOTIFY_OK;
+}
+
+int vfio_pci_nvdia_v100_nvlink2_init(struct vfio_pci_device *vdev)
+{
+	int ret;
+	u64 reg[2];
+	u64 tgt = 0;
+	struct device_node *npu_node, *mem_node;
+	struct pci_dev *npu_dev;
+	struct vfio_pci_nvgpu_data *data;
+	uint32_t mem_phandle = 0;
+	unsigned long events = VFIO_GROUP_NOTIFY_SET_KVM;
+
+	npu_dev = pnv_pci_get_npu_dev(vdev->pdev, 0);
+	if (!npu_dev)
+		return -EINVAL;
+
+	npu_node = pci_device_to_OF_node(npu_dev);
+	if (!npu_node)
+		return -EINVAL;
+
+	if (of_property_read_u32(npu_node, "memory-region", &mem_phandle))
+		return -EINVAL;
+
+	mem_node = of_find_node_by_phandle(mem_phandle);
+	if (!mem_node)
+		return -EINVAL;
+
+	if (of_property_read_variable_u64_array(mem_node, "reg", reg,
+				ARRAY_SIZE(reg), ARRAY_SIZE(reg)) !=
+			ARRAY_SIZE(reg))
+		return -EINVAL;
+
+	if (of_property_read_u64(npu_node, "ibm,device-tgt-addr", &tgt)) {
+		dev_warn(&vdev->pdev->dev, "No ibm,device-tgt-addr found\n");
+		return -EFAULT;
+	}
+
+	data = kzalloc(sizeof(*data), GFP_KERNEL);
+	if (!data)
+		return -ENOMEM;
+
+	data->gpu_hpa = reg[0];
+	data->gpu_tgt = tgt;
+	data->size = reg[1];
+	data->base = memremap(data->gpu_hpa, data->size, MEMREMAP_WB);
+	if (!data->base) {
+		ret = -ENOMEM;
+		goto free_exit;
+	}
+
+	dev_dbg(&vdev->pdev->dev, "%lx..%lx\n", data->gpu_hpa,
+			data->gpu_hpa + data->size - 1);
+
+	data->gpdev = vdev->pdev;
+	data->group_notifier.notifier_call = vfio_pci_nvgpu_group_notifier;
+
+	ret = vfio_register_notifier(&data->gpdev->dev, VFIO_GROUP_NOTIFY,
+			&events, &data->group_notifier);
+	if (ret)
+		goto free_exit;
+
+	/*
+	 * We have just set KVM, we do not need the listener anymore.
+	 * Also, keeping it registered means that if more than one GPU is
+	 * assigned, we will get several similar notifiers notifying about
+	 * the same device again which does not help with anything.
+	 */
+	vfio_unregister_notifier(&data->gpdev->dev, VFIO_GROUP_NOTIFY,
+			&data->group_notifier);
+
+	ret = vfio_pci_register_dev_region(vdev,
+			PCI_VENDOR_ID_NVIDIA | VFIO_REGION_TYPE_PCI_VENDOR_TYPE,
+			VFIO_REGION_SUBTYPE_NVIDIA_NVLINK2_RAM,
+			&vfio_pci_nvgpu_regops,
+			data->size,
+			VFIO_REGION_INFO_FLAG_READ |
+			VFIO_REGION_INFO_FLAG_WRITE |
+			VFIO_REGION_INFO_FLAG_MMAP,
+			data);
+	if (ret)
+		goto free_exit;
+
+	return 0;
+free_exit:
+	kfree(data);
+
+	return ret;
+}
+
+/*
+ * IBM NPU2 bridge
+ */
+struct vfio_pci_npu2_data {
+	void *base; /* ATSD register virtual address, for emulated access */
+	unsigned long mmio_atsd; /* ATSD physical address */
+	unsigned long gpu_tgt; /* TGT address of corresponding GPU RAM */
+};
+
+static size_t vfio_pci_npu2_rw(struct vfio_pci_device *vdev,
+		char __user *buf, size_t count, loff_t *ppos, bool iswrite)
+{
+	unsigned int i = VFIO_PCI_OFFSET_TO_INDEX(*ppos) - VFIO_PCI_NUM_REGIONS;
+	struct vfio_pci_npu2_data *data = vdev->region[i].data;
+	loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
+
+	if (pos >= vdev->region[i].size)
+		return -EINVAL;
+
+	count = min(count, (size_t)(vdev->region[i].size - pos));
+
+	if (iswrite) {
+		if (copy_from_user(data->base + pos, buf, count))
+			return -EFAULT;
+	} else {
+		if (copy_to_user(buf, data->base + pos, count))
+			return -EFAULT;
+	}
+	*ppos += count;
+
+	return count;
+}
+
+static int vfio_pci_npu2_mmap(struct vfio_pci_device *vdev,
+		struct vfio_pci_region *region, struct vm_area_struct *vma)
+{
+	int ret;
+	struct vfio_pci_npu2_data *data = region->data;
+	unsigned long req_len = vma->vm_end - vma->vm_start;
+
+	if (req_len != PAGE_SIZE)
+		return -EINVAL;
+
+	vma->vm_flags |= VM_PFNMAP;
+	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
+
+	ret = remap_pfn_range(vma, vma->vm_start, data->mmio_atsd >> PAGE_SHIFT,
+			req_len, vma->vm_page_prot);
+	trace_vfio_pci_npu2_mmap(vdev->pdev, data->mmio_atsd, vma->vm_start,
+			vma->vm_end - vma->vm_start, ret);
+
+	return ret;
+}
+
+static void vfio_pci_npu2_release(struct vfio_pci_device *vdev,
+		struct vfio_pci_region *region)
+{
+	struct vfio_pci_npu2_data *data = region->data;
+
+	memunmap(data->base);
+	kfree(data);
+}
+
+static int vfio_pci_npu2_add_capability(struct vfio_pci_device *vdev,
+		struct vfio_pci_region *region, struct vfio_info_cap *caps)
+{
+	struct vfio_pci_npu2_data *data = region->data;
+	struct vfio_region_info_cap_npu2 cap;
+
+	cap.header.id = VFIO_REGION_INFO_CAP_NPU2;
+	cap.header.version = 1;
+	cap.tgt = data->gpu_tgt;
+
+	return vfio_info_add_capability(caps, &cap.header, sizeof(cap));
+}
+
+static const struct vfio_pci_regops vfio_pci_npu2_regops = {
+	.rw = vfio_pci_npu2_rw,
+	.mmap = vfio_pci_npu2_mmap,
+	.release = vfio_pci_npu2_release,
+	.add_capability = vfio_pci_npu2_add_capability,
+};
+
+int vfio_pci_ibm_npu2_init(struct vfio_pci_device *vdev)
+{
+	int ret;
+	struct vfio_pci_npu2_data *data;
+	struct device_node *nvlink_dn;
+	u32 nvlink_index = 0;
+	struct pci_dev *npdev = vdev->pdev;
+	struct device_node *npu_node = pci_device_to_OF_node(npdev);
+	struct pci_controller *hose = pci_bus_to_host(npdev->bus);
+	u64 mmio_atsd = 0;
+	u64 tgt = 0;
+
+	/*
+	 * NPU2 normally has 8 ATSD registers (for concurrency) and 6 links
+	 * so we can allocate one register per link.
+	 * Since skiboot only exposes one (a bug), use this as a fallback
+	 * which is safe as we do not split GPUs attached to the same NPU.
+	 */
+	nvlink_dn = of_parse_phandle(npdev->dev.of_node, "ibm,nvlink", 0);
+	if (WARN_ON(of_property_read_u32(nvlink_dn, "ibm,npu-link-index",
+			&nvlink_index)))
+		return -ENODEV;
+
+	if (of_property_read_u64_index(hose->dn, "ibm,mmio-atsd", nvlink_index,
+			&mmio_atsd)) {
+		if (of_property_read_u64_index(hose->dn, "ibm,mmio-atsd", 0,
+					&mmio_atsd)) {
+			dev_warn(&vdev->pdev->dev, "No ATSD found\n");
+			return -EFAULT;
+		}
+		dev_warn(&vdev->pdev->dev, "Fallback to ATSD#0\n");
+	}
+
+	if (of_property_read_u64(npu_node, "ibm,device-tgt-addr", &tgt)) {
+		dev_warn(&vdev->pdev->dev, "No ibm,device-tgt-addr found\n");
+		return -EFAULT;
+	}
+
+	data = kzalloc(sizeof(*data), GFP_KERNEL);
+	if (!data)
+		return -ENOMEM;
+
+	data->mmio_atsd = mmio_atsd;
+	data->gpu_tgt = tgt;
+	data->base = memremap(data->mmio_atsd, SZ_64K, MEMREMAP_WT);
+	if (!data->base) {
+		ret = -ENOMEM;
+		goto free_exit;
+	}
+
+	ret = vfio_pci_register_dev_region(vdev,
+			PCI_VENDOR_ID_IBM | VFIO_REGION_TYPE_PCI_VENDOR_TYPE,
+			VFIO_REGION_SUBTYPE_IBM_NVLINK2_ATSD,
+			&vfio_pci_npu2_regops,
+			PAGE_SIZE,
+			VFIO_REGION_INFO_FLAG_READ |
+			VFIO_REGION_INFO_FLAG_WRITE |
+			VFIO_REGION_INFO_FLAG_MMAP,
+			data);
+	if (ret)
+		goto free_exit;
+
+	return 0;
+
+free_exit:
+	kfree(data);
+
+	return ret;
+}
diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
index 42dc1d3..d0f8e4f 100644
--- a/drivers/vfio/pci/Kconfig
+++ b/drivers/vfio/pci/Kconfig
@@ -38,3 +38,9 @@ config VFIO_PCI_IGD
 	  and LPC bridge config space.
 
 	  To enable Intel IGD assignment through vfio-pci, say Y.
+
+config VFIO_PCI_NVLINK2
+	def_bool y
+	depends on VFIO_PCI && PPC_POWERNV
+	help
+	  VFIO PCI support for P9 Witherspoon machine with NVIDIA V100 GPUs
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH kernel v3 22/22] vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] [10de:1db1] subdriver
@ 2018-11-13  8:28   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 84+ messages in thread
From: Alexey Kardashevskiy @ 2018-11-13  8:28 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Jose Ricardo Ziviani, Alexey Kardashevskiy, Alistair Popple,
	Alex Williamson, kvm-ppc, Sam Bobroff, Piotr Jaroszynski,
	Oliver O'Halloran, Andrew Donnellan,
	Leonardo Augusto Guimarães Garcia, Reza Arbab, David Gibson

POWER9 Witherspoon machines come with 4 or 6 V100 GPUs which are not
pluggable PCIe devices but implement PCIe links for config space and MMIO.
In addition to that the GPUs are interconnected to each other and also
have direct links to the P9 CPU. The links are NVLink2 and provide direct
access to the system RAM for GPUs via NPU (an NVLink2 "proxy" on P9 chip).
These systems also support ATS (address translation services) which is
a part of the NVLink2 protocol. Such GPUs also share on-board RAM
(16GB in tested config) to the system via the same NVLink2 so a CPU has
cache-coherent access to a GPU RAM.

This exports GPU RAM to the userspace as a new PCI region. This
preregisters the new memory as device memory as it might be used for DMA.
This inserts pfns from the fault handler as the GPU memory is not onlined
until the NVIDIA driver is loaded and trained the links so doing this
earlier produces low level errors which we fence in the firmware so
it does not hurt the host system but still better to avoid.

This exports ATSD (Address Translation Shootdown) register of NPU which
allows the guest to invalidate TLB. The register conveniently occupies
a single 64k page. Since NPU maps the GPU memory, it has a "tgt" property
(which is an abbreviated host system bus address) and tells the GPU its
own system address. This exports the "tgt" as a capability to let
the guest driver conglomerate the routing information so each GPU knows
how to get directly to the other GPUs. This also adds the "tgt" capability
to a GPU to allow the userspace to find out the NVLinks corresponding
to a specific GPU.

For ATS to work, the nest MMU (an NVIDIA block in a P9 CPU) needs to
know LPID (a logical partition ID or a KVM guest hardware ID in other
words) and PID (a memory context ID of a userspace process, not to be
confused with a linux pid). This assigns a GPU to LPID in the NPU and
this is why this adds a listener for KVM on an IOMMU group. A PID comes
via NVLink from a GPU and NPU uses a PID wildcard to pass it through.

This requires coherent memory and ATSD to be available on the host as
the GPU vendor only supports configurations with both features enabled
and other configurations are known not to work. Because of this and
because of the ways the features are advertised to the host system
(which is a device tree with very platform specific properties),
this requires enabled POWERNV platform.

This hardcodes the NVLink2 support for specific vendor and device IDs
as there is no reliable way of knowing about coherent memory and ATS
support. The GPU has an unique vendor PCIe capability 0x23 but it was
confirmed that it does not provide required information (and it is still
undisclosed what it actually does).

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
v3:
* reworded the commit log about tgt
* added tracepoints (do we want them enabled for entire vfio-pci?)
* added code comments
* added write|mmap flags to the new regions
* auto enabled VFIO_PCI_NVLINK2 config option
* added 'tgt' capability to a GPU so QEMU can recreate ibm,npu and ibm,gpu
references; there are required by the NVIDIA driver
* keep notifier registered only for short time
---
 drivers/vfio/pci/Makefile           |   1 +
 drivers/vfio/pci/trace.h            | 102 +++++++
 drivers/vfio/pci/vfio_pci_private.h |   2 +
 include/uapi/linux/vfio.h           |  26 ++
 drivers/vfio/pci/vfio_pci.c         |  39 ++-
 drivers/vfio/pci/vfio_pci_nvlink2.c | 433 ++++++++++++++++++++++++++++
 drivers/vfio/pci/Kconfig            |   6 +
 7 files changed, 607 insertions(+), 2 deletions(-)
 create mode 100644 drivers/vfio/pci/trace.h
 create mode 100644 drivers/vfio/pci/vfio_pci_nvlink2.c

diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
index 76d8ec0..9662c06 100644
--- a/drivers/vfio/pci/Makefile
+++ b/drivers/vfio/pci/Makefile
@@ -1,5 +1,6 @@
 
 vfio-pci-y := vfio_pci.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o
 vfio-pci-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o
+vfio-pci-$(CONFIG_VFIO_PCI_NVLINK2) += vfio_pci_nvlink2.o
 
 obj-$(CONFIG_VFIO_PCI) += vfio-pci.o
diff --git a/drivers/vfio/pci/trace.h b/drivers/vfio/pci/trace.h
new file mode 100644
index 0000000..b80d2d3
--- /dev/null
+++ b/drivers/vfio/pci/trace.h
@@ -0,0 +1,102 @@
+/* SPDX-License-Identifier: GPL-2.0+ */
+/*
+ * VFIO PCI mmap/mmap_fault tracepoints
+ *
+ * Copyright (C) 2018 IBM Corp.  All rights reserved.
+ *     Author: Alexey Kardashevskiy <aik@ozlabs.ru>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM vfio_pci
+
+#if !defined(_TRACE_VFIO_PCI_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_VFIO_PCI_H
+
+#include <linux/tracepoint.h>
+
+TRACE_EVENT(vfio_pci_nvgpu_mmap_fault,
+	TP_PROTO(struct pci_dev *pdev, unsigned long hpa, unsigned long ua,
+			vm_fault_t ret),
+	TP_ARGS(pdev, hpa, ua, ret),
+
+	TP_STRUCT__entry(
+		__field(const char *, name)
+		__field(unsigned long, hpa)
+		__field(unsigned long, ua)
+		__field(int, ret)
+	),
+
+	TP_fast_assign(
+		__entry->name = dev_name(&pdev->dev),
+		__entry->hpa = hpa;
+		__entry->ua = ua;
+		__entry->ret = ret;
+	),
+
+	TP_printk("%s: %lx -> %lx ret=%d", __entry->name, __entry->hpa,
+			__entry->ua, __entry->ret)
+);
+
+TRACE_EVENT(vfio_pci_nvgpu_mmap,
+	TP_PROTO(struct pci_dev *pdev, unsigned long hpa, unsigned long ua,
+			unsigned long size, int ret),
+	TP_ARGS(pdev, hpa, ua, size, ret),
+
+	TP_STRUCT__entry(
+		__field(const char *, name)
+		__field(unsigned long, hpa)
+		__field(unsigned long, ua)
+		__field(unsigned long, size)
+		__field(int, ret)
+	),
+
+	TP_fast_assign(
+		__entry->name = dev_name(&pdev->dev),
+		__entry->hpa = hpa;
+		__entry->ua = ua;
+		__entry->size = size;
+		__entry->ret = ret;
+	),
+
+	TP_printk("%s: %lx -> %lx size=%lx ret=%d", __entry->name, __entry->hpa,
+			__entry->ua, __entry->size, __entry->ret)
+);
+
+TRACE_EVENT(vfio_pci_npu2_mmap,
+	TP_PROTO(struct pci_dev *pdev, unsigned long hpa, unsigned long ua,
+			unsigned long size, int ret),
+	TP_ARGS(pdev, hpa, ua, size, ret),
+
+	TP_STRUCT__entry(
+		__field(const char *, name)
+		__field(unsigned long, hpa)
+		__field(unsigned long, ua)
+		__field(unsigned long, size)
+		__field(int, ret)
+	),
+
+	TP_fast_assign(
+		__entry->name = dev_name(&pdev->dev),
+		__entry->hpa = hpa;
+		__entry->ua = ua;
+		__entry->size = size;
+		__entry->ret = ret;
+	),
+
+	TP_printk("%s: %lx -> %lx size=%lx ret=%d", __entry->name, __entry->hpa,
+			__entry->ua, __entry->size, __entry->ret)
+);
+
+#endif /* _TRACE_SUBSYS_H */
+
+#undef TRACE_INCLUDE_PATH
+#define TRACE_INCLUDE_PATH .
+#undef TRACE_INCLUDE_FILE
+#define TRACE_INCLUDE_FILE trace
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
index 93c1738..7639241 100644
--- a/drivers/vfio/pci/vfio_pci_private.h
+++ b/drivers/vfio/pci/vfio_pci_private.h
@@ -163,4 +163,6 @@ static inline int vfio_pci_igd_init(struct vfio_pci_device *vdev)
 	return -ENODEV;
 }
 #endif
+extern int vfio_pci_nvdia_v100_nvlink2_init(struct vfio_pci_device *vdev);
+extern int vfio_pci_ibm_npu2_init(struct vfio_pci_device *vdev);
 #endif /* VFIO_PCI_PRIVATE_H */
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 8131028..53a4061 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -353,6 +353,20 @@ struct vfio_region_gfx_edid {
 #define VFIO_DEVICE_GFX_LINK_STATE_DOWN  2
 };
 
+/* 10de vendor sub-type
+ *
+ * NVIDIA GPU NVlink2 RAM is coherent RAM mapped onto the host address space.
+ */
+#define VFIO_REGION_SUBTYPE_NVIDIA_NVLINK2_RAM	(1)
+
+/*
+ * 1014 vendor sub-type
+ *
+ * IBM NPU NVlink2 ATSD (Address Translation Shootdown) register of NPU
+ * to do TLB invalidation on a GPU.
+ */
+#define VFIO_REGION_SUBTYPE_IBM_NVLINK2_ATSD	(1)
+
 /*
  * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped
  * which allows direct access to non-MSIX registers which happened to be within
@@ -363,6 +377,18 @@ struct vfio_region_gfx_edid {
  */
 #define VFIO_REGION_INFO_CAP_MSIX_MAPPABLE	3
 
+/*
+ * Capability with compressed real address (aka SSA - small system address)
+ * where GPU RAM is mapped on a system bus. Used by a GPU for DMA routing.
+ */
+#define VFIO_REGION_INFO_CAP_NPU2		4
+
+struct vfio_region_info_cap_npu2 {
+	struct vfio_info_cap_header header;
+	__u64 tgt;
+	/* size is defined in VFIO_REGION_SUBTYPE_NVIDIA_NVLINK2_RAM */
+};
+
 /**
  * VFIO_DEVICE_GET_IRQ_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 9,
  *				    struct vfio_irq_info)
diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 6cb70cf..f072d8e 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -224,6 +224,16 @@ static bool vfio_pci_nointx(struct pci_dev *pdev)
 	return false;
 }
 
+int __weak vfio_pci_nvdia_v100_nvlink2_init(struct vfio_pci_device *vdev)
+{
+	return -ENODEV;
+}
+
+int __weak vfio_pci_ibm_npu2_init(struct vfio_pci_device *vdev)
+{
+	return -ENODEV;
+}
+
 static int vfio_pci_enable(struct vfio_pci_device *vdev)
 {
 	struct pci_dev *pdev = vdev->pdev;
@@ -302,14 +312,39 @@ static int vfio_pci_enable(struct vfio_pci_device *vdev)
 		if (ret) {
 			dev_warn(&vdev->pdev->dev,
 				 "Failed to setup Intel IGD regions\n");
-			vfio_pci_disable(vdev);
-			return ret;
+			goto disable_exit;
+		}
+	}
+
+	if (pdev->vendor = PCI_VENDOR_ID_NVIDIA &&
+	    pdev->device = 0x1db1 &&
+	    IS_ENABLED(CONFIG_VFIO_PCI_NVLINK2)) {
+		ret = vfio_pci_nvdia_v100_nvlink2_init(vdev);
+		if (ret) {
+			dev_warn(&vdev->pdev->dev,
+				 "Failed to setup NVIDIA NV2 RAM region\n");
+			goto disable_exit;
+		}
+	}
+
+	if (pdev->vendor = PCI_VENDOR_ID_IBM &&
+	    pdev->device = 0x04ea &&
+	    IS_ENABLED(CONFIG_VFIO_PCI_NVLINK2)) {
+		ret = vfio_pci_ibm_npu2_init(vdev);
+		if (ret) {
+			dev_warn(&vdev->pdev->dev,
+					"Failed to setup NVIDIA NV2 ATSD region\n");
+			goto disable_exit;
 		}
 	}
 
 	vfio_pci_probe_mmaps(vdev);
 
 	return 0;
+
+disable_exit:
+	vfio_pci_disable(vdev);
+	return ret;
 }
 
 static void vfio_pci_disable(struct vfio_pci_device *vdev)
diff --git a/drivers/vfio/pci/vfio_pci_nvlink2.c b/drivers/vfio/pci/vfio_pci_nvlink2.c
new file mode 100644
index 0000000..300159b
--- /dev/null
+++ b/drivers/vfio/pci/vfio_pci_nvlink2.c
@@ -0,0 +1,433 @@
+// SPDX-License-Identifier: GPL-2.0+
+/*
+ * VFIO PCI NVIDIA Whitherspoon GPU support a.k.a. NVLink2.
+ *
+ * Copyright (C) 2018 IBM Corp.  All rights reserved.
+ *     Author: Alexey Kardashevskiy <aik@ozlabs.ru>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * Register an on-GPU RAM region for cacheable access.
+ *
+ * Derived from original vfio_pci_igd.c:
+ * Copyright (C) 2016 Red Hat, Inc.  All rights reserved.
+ *	Author: Alex Williamson <alex.williamson@redhat.com>
+ */
+
+#include <linux/io.h>
+#include <linux/pci.h>
+#include <linux/uaccess.h>
+#include <linux/vfio.h>
+#include <linux/sched/mm.h>
+#include <linux/mmu_context.h>
+#include <asm/kvm_ppc.h>
+#include "vfio_pci_private.h"
+
+#define CREATE_TRACE_POINTS
+#include "trace.h"
+
+EXPORT_TRACEPOINT_SYMBOL_GPL(vfio_pci_nvgpu_mmap_fault);
+EXPORT_TRACEPOINT_SYMBOL_GPL(vfio_pci_nvgpu_mmap);
+EXPORT_TRACEPOINT_SYMBOL_GPL(vfio_pci_npu2_mmap);
+
+struct vfio_pci_nvgpu_data {
+	unsigned long gpu_hpa; /* GPU RAM physical address */
+	unsigned long gpu_tgt; /* TGT address of corresponding GPU RAM */
+	unsigned long useraddr; /* GPU RAM userspace address */
+	unsigned long size; /* Size of the GPU RAM window (usually 128GB) */
+	void *base; /* GPU RAM virtual address, for emulated access */
+	struct mm_struct *mm;
+	struct mm_iommu_table_group_mem_t *mem; /* Pre-registered RAM descr. */
+	struct pci_dev *gpdev;
+	struct notifier_block group_notifier;
+};
+
+static size_t vfio_pci_nvgpu_rw(struct vfio_pci_device *vdev,
+		char __user *buf, size_t count, loff_t *ppos, bool iswrite)
+{
+	unsigned int i = VFIO_PCI_OFFSET_TO_INDEX(*ppos) - VFIO_PCI_NUM_REGIONS;
+	struct vfio_pci_nvgpu_data *data = vdev->region[i].data;
+	loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
+
+	if (pos >= vdev->region[i].size)
+		return -EINVAL;
+
+	count = min(count, (size_t)(vdev->region[i].size - pos));
+
+	if (iswrite) {
+		if (copy_from_user(data->base + pos, buf, count))
+			return -EFAULT;
+	} else {
+		if (copy_to_user(buf, data->base + pos, count))
+			return -EFAULT;
+	}
+	*ppos += count;
+
+	return count;
+}
+
+static void vfio_pci_nvgpu_release(struct vfio_pci_device *vdev,
+		struct vfio_pci_region *region)
+{
+	struct vfio_pci_nvgpu_data *data = region->data;
+	long ret;
+	struct pci_controller *hose;
+	struct pci_dev *npdev;
+
+	/* If there were any mappings at all... */
+	if (data->mm) {
+		ret = mm_iommu_put(data->mm, data->mem);
+		WARN_ON(ret);
+
+		mmdrop(data->mm);
+	}
+
+	vfio_unregister_notifier(&data->gpdev->dev, VFIO_GROUP_NOTIFY,
+			&data->group_notifier);
+
+	npdev = pnv_pci_get_npu_dev(data->gpdev, 0);
+	hose = pci_bus_to_host(npdev->bus);
+
+	pnv_npu2_unmap_lpar_dev(data->gpdev);
+
+	memunmap(data->base);
+	kfree(data);
+}
+
+static vm_fault_t vfio_pci_nvgpu_mmap_fault(struct vm_fault *vmf)
+{
+	vm_fault_t ret;
+	struct vm_area_struct *vma = vmf->vma;
+	struct vfio_pci_region *region = vma->vm_private_data;
+	struct vfio_pci_nvgpu_data *data = region->data;
+	unsigned long vmf_off = (vmf->address - vma->vm_start) >> PAGE_SHIFT;
+	unsigned long nv2pg = data->gpu_hpa >> PAGE_SHIFT;
+	unsigned long vm_pgoff = vma->vm_pgoff &
+		((1U << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) - 1);
+	unsigned long pfn = nv2pg + vm_pgoff + vmf_off;
+
+	ret = vmf_insert_pfn(vma, vmf->address, pfn);
+	trace_vfio_pci_nvgpu_mmap_fault(data->gpdev, pfn << PAGE_SHIFT,
+			vmf->address, ret);
+
+	return ret;
+}
+
+static const struct vm_operations_struct vfio_pci_nvgpu_mmap_vmops = {
+	.fault = vfio_pci_nvgpu_mmap_fault,
+};
+
+static int vfio_pci_nvgpu_mmap(struct vfio_pci_device *vdev,
+		struct vfio_pci_region *region, struct vm_area_struct *vma)
+{
+	long ret;
+	struct vfio_pci_nvgpu_data *data = region->data;
+
+	if (data->useraddr)
+		return -EPERM;
+
+	if (vma->vm_end - vma->vm_start > data->size)
+		return -EINVAL;
+
+	vma->vm_private_data = region;
+	vma->vm_flags |= VM_PFNMAP;
+	vma->vm_ops = &vfio_pci_nvgpu_mmap_vmops;
+
+	/*
+	 * Calling mm_iommu_newdev() here once as the region is not
+	 * registered yet and therefore right initialization will happen now.
+	 * Other places will use mm_iommu_find() which returns
+	 * registered @mem and does not go gup().
+	 */
+	data->useraddr = vma->vm_start;
+	data->mm = current->mm;
+
+	atomic_inc(&data->mm->mm_count);
+	ret = mm_iommu_newdev(data->mm, data->useraddr,
+			(vma->vm_end - vma->vm_start) >> PAGE_SHIFT,
+			data->gpu_hpa, &data->mem);
+
+	trace_vfio_pci_nvgpu_mmap(vdev->pdev, data->gpu_hpa, data->useraddr,
+			vma->vm_end - vma->vm_start, ret);
+
+	return ret;
+}
+
+static int vfio_pci_nvgpu_add_capability(struct vfio_pci_device *vdev,
+		struct vfio_pci_region *region, struct vfio_info_cap *caps)
+{
+	struct vfio_pci_nvgpu_data *data = region->data;
+	struct vfio_region_info_cap_npu2 cap;
+
+	cap.header.id = VFIO_REGION_INFO_CAP_NPU2;
+	cap.header.version = 1;
+	cap.tgt = data->gpu_tgt;
+
+	return vfio_info_add_capability(caps, &cap.header, sizeof(cap));
+}
+
+static const struct vfio_pci_regops vfio_pci_nvgpu_regops = {
+	.rw = vfio_pci_nvgpu_rw,
+	.release = vfio_pci_nvgpu_release,
+	.mmap = vfio_pci_nvgpu_mmap,
+	.add_capability = vfio_pci_nvgpu_add_capability,
+};
+
+static int vfio_pci_nvgpu_group_notifier(struct notifier_block *nb,
+		unsigned long action, void *opaque)
+{
+	struct kvm *kvm = opaque;
+	struct vfio_pci_nvgpu_data *data = container_of(nb,
+			struct vfio_pci_nvgpu_data,
+			group_notifier);
+
+	if (action = VFIO_GROUP_NOTIFY_SET_KVM && kvm &&
+			pnv_npu2_map_lpar_dev(data->gpdev,
+				kvm->arch.lpid, MSR_DR | MSR_PR))
+		return NOTIFY_BAD;
+
+	return NOTIFY_OK;
+}
+
+int vfio_pci_nvdia_v100_nvlink2_init(struct vfio_pci_device *vdev)
+{
+	int ret;
+	u64 reg[2];
+	u64 tgt = 0;
+	struct device_node *npu_node, *mem_node;
+	struct pci_dev *npu_dev;
+	struct vfio_pci_nvgpu_data *data;
+	uint32_t mem_phandle = 0;
+	unsigned long events = VFIO_GROUP_NOTIFY_SET_KVM;
+
+	npu_dev = pnv_pci_get_npu_dev(vdev->pdev, 0);
+	if (!npu_dev)
+		return -EINVAL;
+
+	npu_node = pci_device_to_OF_node(npu_dev);
+	if (!npu_node)
+		return -EINVAL;
+
+	if (of_property_read_u32(npu_node, "memory-region", &mem_phandle))
+		return -EINVAL;
+
+	mem_node = of_find_node_by_phandle(mem_phandle);
+	if (!mem_node)
+		return -EINVAL;
+
+	if (of_property_read_variable_u64_array(mem_node, "reg", reg,
+				ARRAY_SIZE(reg), ARRAY_SIZE(reg)) !+			ARRAY_SIZE(reg))
+		return -EINVAL;
+
+	if (of_property_read_u64(npu_node, "ibm,device-tgt-addr", &tgt)) {
+		dev_warn(&vdev->pdev->dev, "No ibm,device-tgt-addr found\n");
+		return -EFAULT;
+	}
+
+	data = kzalloc(sizeof(*data), GFP_KERNEL);
+	if (!data)
+		return -ENOMEM;
+
+	data->gpu_hpa = reg[0];
+	data->gpu_tgt = tgt;
+	data->size = reg[1];
+	data->base = memremap(data->gpu_hpa, data->size, MEMREMAP_WB);
+	if (!data->base) {
+		ret = -ENOMEM;
+		goto free_exit;
+	}
+
+	dev_dbg(&vdev->pdev->dev, "%lx..%lx\n", data->gpu_hpa,
+			data->gpu_hpa + data->size - 1);
+
+	data->gpdev = vdev->pdev;
+	data->group_notifier.notifier_call = vfio_pci_nvgpu_group_notifier;
+
+	ret = vfio_register_notifier(&data->gpdev->dev, VFIO_GROUP_NOTIFY,
+			&events, &data->group_notifier);
+	if (ret)
+		goto free_exit;
+
+	/*
+	 * We have just set KVM, we do not need the listener anymore.
+	 * Also, keeping it registered means that if more than one GPU is
+	 * assigned, we will get several similar notifiers notifying about
+	 * the same device again which does not help with anything.
+	 */
+	vfio_unregister_notifier(&data->gpdev->dev, VFIO_GROUP_NOTIFY,
+			&data->group_notifier);
+
+	ret = vfio_pci_register_dev_region(vdev,
+			PCI_VENDOR_ID_NVIDIA | VFIO_REGION_TYPE_PCI_VENDOR_TYPE,
+			VFIO_REGION_SUBTYPE_NVIDIA_NVLINK2_RAM,
+			&vfio_pci_nvgpu_regops,
+			data->size,
+			VFIO_REGION_INFO_FLAG_READ |
+			VFIO_REGION_INFO_FLAG_WRITE |
+			VFIO_REGION_INFO_FLAG_MMAP,
+			data);
+	if (ret)
+		goto free_exit;
+
+	return 0;
+free_exit:
+	kfree(data);
+
+	return ret;
+}
+
+/*
+ * IBM NPU2 bridge
+ */
+struct vfio_pci_npu2_data {
+	void *base; /* ATSD register virtual address, for emulated access */
+	unsigned long mmio_atsd; /* ATSD physical address */
+	unsigned long gpu_tgt; /* TGT address of corresponding GPU RAM */
+};
+
+static size_t vfio_pci_npu2_rw(struct vfio_pci_device *vdev,
+		char __user *buf, size_t count, loff_t *ppos, bool iswrite)
+{
+	unsigned int i = VFIO_PCI_OFFSET_TO_INDEX(*ppos) - VFIO_PCI_NUM_REGIONS;
+	struct vfio_pci_npu2_data *data = vdev->region[i].data;
+	loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
+
+	if (pos >= vdev->region[i].size)
+		return -EINVAL;
+
+	count = min(count, (size_t)(vdev->region[i].size - pos));
+
+	if (iswrite) {
+		if (copy_from_user(data->base + pos, buf, count))
+			return -EFAULT;
+	} else {
+		if (copy_to_user(buf, data->base + pos, count))
+			return -EFAULT;
+	}
+	*ppos += count;
+
+	return count;
+}
+
+static int vfio_pci_npu2_mmap(struct vfio_pci_device *vdev,
+		struct vfio_pci_region *region, struct vm_area_struct *vma)
+{
+	int ret;
+	struct vfio_pci_npu2_data *data = region->data;
+	unsigned long req_len = vma->vm_end - vma->vm_start;
+
+	if (req_len != PAGE_SIZE)
+		return -EINVAL;
+
+	vma->vm_flags |= VM_PFNMAP;
+	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
+
+	ret = remap_pfn_range(vma, vma->vm_start, data->mmio_atsd >> PAGE_SHIFT,
+			req_len, vma->vm_page_prot);
+	trace_vfio_pci_npu2_mmap(vdev->pdev, data->mmio_atsd, vma->vm_start,
+			vma->vm_end - vma->vm_start, ret);
+
+	return ret;
+}
+
+static void vfio_pci_npu2_release(struct vfio_pci_device *vdev,
+		struct vfio_pci_region *region)
+{
+	struct vfio_pci_npu2_data *data = region->data;
+
+	memunmap(data->base);
+	kfree(data);
+}
+
+static int vfio_pci_npu2_add_capability(struct vfio_pci_device *vdev,
+		struct vfio_pci_region *region, struct vfio_info_cap *caps)
+{
+	struct vfio_pci_npu2_data *data = region->data;
+	struct vfio_region_info_cap_npu2 cap;
+
+	cap.header.id = VFIO_REGION_INFO_CAP_NPU2;
+	cap.header.version = 1;
+	cap.tgt = data->gpu_tgt;
+
+	return vfio_info_add_capability(caps, &cap.header, sizeof(cap));
+}
+
+static const struct vfio_pci_regops vfio_pci_npu2_regops = {
+	.rw = vfio_pci_npu2_rw,
+	.mmap = vfio_pci_npu2_mmap,
+	.release = vfio_pci_npu2_release,
+	.add_capability = vfio_pci_npu2_add_capability,
+};
+
+int vfio_pci_ibm_npu2_init(struct vfio_pci_device *vdev)
+{
+	int ret;
+	struct vfio_pci_npu2_data *data;
+	struct device_node *nvlink_dn;
+	u32 nvlink_index = 0;
+	struct pci_dev *npdev = vdev->pdev;
+	struct device_node *npu_node = pci_device_to_OF_node(npdev);
+	struct pci_controller *hose = pci_bus_to_host(npdev->bus);
+	u64 mmio_atsd = 0;
+	u64 tgt = 0;
+
+	/*
+	 * NPU2 normally has 8 ATSD registers (for concurrency) and 6 links
+	 * so we can allocate one register per link.
+	 * Since skiboot only exposes one (a bug), use this as a fallback
+	 * which is safe as we do not split GPUs attached to the same NPU.
+	 */
+	nvlink_dn = of_parse_phandle(npdev->dev.of_node, "ibm,nvlink", 0);
+	if (WARN_ON(of_property_read_u32(nvlink_dn, "ibm,npu-link-index",
+			&nvlink_index)))
+		return -ENODEV;
+
+	if (of_property_read_u64_index(hose->dn, "ibm,mmio-atsd", nvlink_index,
+			&mmio_atsd)) {
+		if (of_property_read_u64_index(hose->dn, "ibm,mmio-atsd", 0,
+					&mmio_atsd)) {
+			dev_warn(&vdev->pdev->dev, "No ATSD found\n");
+			return -EFAULT;
+		}
+		dev_warn(&vdev->pdev->dev, "Fallback to ATSD#0\n");
+	}
+
+	if (of_property_read_u64(npu_node, "ibm,device-tgt-addr", &tgt)) {
+		dev_warn(&vdev->pdev->dev, "No ibm,device-tgt-addr found\n");
+		return -EFAULT;
+	}
+
+	data = kzalloc(sizeof(*data), GFP_KERNEL);
+	if (!data)
+		return -ENOMEM;
+
+	data->mmio_atsd = mmio_atsd;
+	data->gpu_tgt = tgt;
+	data->base = memremap(data->mmio_atsd, SZ_64K, MEMREMAP_WT);
+	if (!data->base) {
+		ret = -ENOMEM;
+		goto free_exit;
+	}
+
+	ret = vfio_pci_register_dev_region(vdev,
+			PCI_VENDOR_ID_IBM | VFIO_REGION_TYPE_PCI_VENDOR_TYPE,
+			VFIO_REGION_SUBTYPE_IBM_NVLINK2_ATSD,
+			&vfio_pci_npu2_regops,
+			PAGE_SIZE,
+			VFIO_REGION_INFO_FLAG_READ |
+			VFIO_REGION_INFO_FLAG_WRITE |
+			VFIO_REGION_INFO_FLAG_MMAP,
+			data);
+	if (ret)
+		goto free_exit;
+
+	return 0;
+
+free_exit:
+	kfree(data);
+
+	return ret;
+}
diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
index 42dc1d3..d0f8e4f 100644
--- a/drivers/vfio/pci/Kconfig
+++ b/drivers/vfio/pci/Kconfig
@@ -38,3 +38,9 @@ config VFIO_PCI_IGD
 	  and LPC bridge config space.
 
 	  To enable Intel IGD assignment through vfio-pci, say Y.
+
+config VFIO_PCI_NVLINK2
+	def_bool y
+	depends on VFIO_PCI && PPC_POWERNV
+	help
+	  VFIO PCI support for P9 Witherspoon machine with NVIDIA V100 GPUs
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* Re: [PATCH kernel v3 05/22] powerpc/powernv/npu: Add helper to access struct npu for NPU device
  2018-11-13  8:28   ` Alexey Kardashevskiy
@ 2018-11-14  3:42     ` Alistair Popple
  -1 siblings, 0 replies; 84+ messages in thread
From: Alistair Popple @ 2018-11-14  3:42 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Jose Ricardo Ziviani, Sam Bobroff, linuxppc-dev, Alex Williamson,
	kvm-ppc, Piotr Jaroszynski, Oliver O'Halloran,
	Andrew Donnellan, Leonardo Augusto Guimarães Garcia,
	Reza Arbab, David Gibson

Reviewed-by: Alistair Popple <alistair@popple.id.au>

On Tuesday, 13 November 2018 7:28:06 PM AEDT Alexey Kardashevskiy wrote:
> This step is to help removing the npu struct from pnv_phb so it
> can be used by pseries as well.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> ---
>  arch/powerpc/platforms/powernv/npu-dma.c | 22 ++++++++++++++++------
>  1 file changed, 16 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/powernv/npu-dma.c
> b/arch/powerpc/platforms/powernv/npu-dma.c index 91d488f..9f48831 100644
> --- a/arch/powerpc/platforms/powernv/npu-dma.c
> +++ b/arch/powerpc/platforms/powernv/npu-dma.c
> @@ -327,6 +327,18 @@ struct pnv_ioda_pe *pnv_pci_npu_setup_iommu(struct
> pnv_ioda_pe *npe) return gpe;
>  }
> 
> +/*
> + * NPU2 ATS
> + */
> +static struct npu *npdev_to_npu(struct pci_dev *npdev)
> +{
> +	struct pnv_phb *nphb;
> +
> +	nphb = pci_bus_to_host(npdev->bus)->private_data;
> +
> +	return &nphb->npu;
> +}
> +
>  /* Maximum number of nvlinks per npu */
>  #define NV_MAX_LINKS 6
> 
> @@ -478,7 +490,6 @@ static void acquire_atsd_reg(struct npu_context
> *npu_context, int i, j;
>  	struct npu *npu;
>  	struct pci_dev *npdev;
> -	struct pnv_phb *nphb;
> 
>  	for (i = 0; i <= max_npu2_index; i++) {
>  		mmio_atsd_reg[i].reg = -1;
> @@ -493,8 +504,7 @@ static void acquire_atsd_reg(struct npu_context
> *npu_context, if (!npdev)
>  				continue;
> 
> -			nphb = pci_bus_to_host(npdev->bus)->private_data;
> -			npu = &nphb->npu;
> +			npu = npdev_to_npu(npdev);
>  			mmio_atsd_reg[i].npu = npu;
>  			mmio_atsd_reg[i].reg = get_mmio_atsd_reg(npu);
>  			while (mmio_atsd_reg[i].reg < 0) {
> @@ -690,7 +700,7 @@ struct npu_context *pnv_npu2_init_context(struct pci_dev
> *gpdev, }
> 
>  	nphb = pci_bus_to_host(npdev->bus)->private_data;
> -	npu = &nphb->npu;
> +	npu = npdev_to_npu(npdev);
> 
>  	/*
>  	 * Setup the NPU context table for a particular GPU. These need to be
> @@ -764,7 +774,7 @@ struct npu_context *pnv_npu2_init_context(struct pci_dev
> *gpdev, */
>  	WRITE_ONCE(npu_context->npdev[npu->index][nvlink_index], npdev);
> 
> -	if (!nphb->npu.nmmu_flush) {
> +	if (!npu->nmmu_flush) {
>  		/*
>  		 * If we're not explicitly flushing ourselves we need to mark
>  		 * the thread for global flushes
> @@ -810,7 +820,7 @@ void pnv_npu2_destroy_context(struct npu_context
> *npu_context, return;
> 
>  	nphb = pci_bus_to_host(npdev->bus)->private_data;
> -	npu = &nphb->npu;
> +	npu = npdev_to_npu(npdev);
>  	nvlink_dn = of_parse_phandle(npdev->dev.of_node, "ibm,nvlink", 0);
>  	if (WARN_ON(of_property_read_u32(nvlink_dn, "ibm,npu-link-index",
>  							&nvlink_index)))



^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH kernel v3 05/22] powerpc/powernv/npu: Add helper to access struct npu for NPU device
@ 2018-11-14  3:42     ` Alistair Popple
  0 siblings, 0 replies; 84+ messages in thread
From: Alistair Popple @ 2018-11-14  3:42 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Jose Ricardo Ziviani, Sam Bobroff, linuxppc-dev, Alex Williamson,
	kvm-ppc, Piotr Jaroszynski, Oliver O'Halloran,
	Andrew Donnellan, Leonardo Augusto Guimarães Garcia,
	Reza Arbab, David Gibson

Reviewed-by: Alistair Popple <alistair@popple.id.au>

On Tuesday, 13 November 2018 7:28:06 PM AEDT Alexey Kardashevskiy wrote:
> This step is to help removing the npu struct from pnv_phb so it
> can be used by pseries as well.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> ---
>  arch/powerpc/platforms/powernv/npu-dma.c | 22 ++++++++++++++++------
>  1 file changed, 16 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/powernv/npu-dma.c
> b/arch/powerpc/platforms/powernv/npu-dma.c index 91d488f..9f48831 100644
> --- a/arch/powerpc/platforms/powernv/npu-dma.c
> +++ b/arch/powerpc/platforms/powernv/npu-dma.c
> @@ -327,6 +327,18 @@ struct pnv_ioda_pe *pnv_pci_npu_setup_iommu(struct
> pnv_ioda_pe *npe) return gpe;
>  }
> 
> +/*
> + * NPU2 ATS
> + */
> +static struct npu *npdev_to_npu(struct pci_dev *npdev)
> +{
> +	struct pnv_phb *nphb;
> +
> +	nphb = pci_bus_to_host(npdev->bus)->private_data;
> +
> +	return &nphb->npu;
> +}
> +
>  /* Maximum number of nvlinks per npu */
>  #define NV_MAX_LINKS 6
> 
> @@ -478,7 +490,6 @@ static void acquire_atsd_reg(struct npu_context
> *npu_context, int i, j;
>  	struct npu *npu;
>  	struct pci_dev *npdev;
> -	struct pnv_phb *nphb;
> 
>  	for (i = 0; i <= max_npu2_index; i++) {
>  		mmio_atsd_reg[i].reg = -1;
> @@ -493,8 +504,7 @@ static void acquire_atsd_reg(struct npu_context
> *npu_context, if (!npdev)
>  				continue;
> 
> -			nphb = pci_bus_to_host(npdev->bus)->private_data;
> -			npu = &nphb->npu;
> +			npu = npdev_to_npu(npdev);
>  			mmio_atsd_reg[i].npu = npu;
>  			mmio_atsd_reg[i].reg = get_mmio_atsd_reg(npu);
>  			while (mmio_atsd_reg[i].reg < 0) {
> @@ -690,7 +700,7 @@ struct npu_context *pnv_npu2_init_context(struct pci_dev
> *gpdev, }
> 
>  	nphb = pci_bus_to_host(npdev->bus)->private_data;
> -	npu = &nphb->npu;
> +	npu = npdev_to_npu(npdev);
> 
>  	/*
>  	 * Setup the NPU context table for a particular GPU. These need to be
> @@ -764,7 +774,7 @@ struct npu_context *pnv_npu2_init_context(struct pci_dev
> *gpdev, */
>  	WRITE_ONCE(npu_context->npdev[npu->index][nvlink_index], npdev);
> 
> -	if (!nphb->npu.nmmu_flush) {
> +	if (!npu->nmmu_flush) {
>  		/*
>  		 * If we're not explicitly flushing ourselves we need to mark
>  		 * the thread for global flushes
> @@ -810,7 +820,7 @@ void pnv_npu2_destroy_context(struct npu_context
> *npu_context, return;
> 
>  	nphb = pci_bus_to_host(npdev->bus)->private_data;
> -	npu = &nphb->npu;
> +	npu = npdev_to_npu(npdev);
>  	nvlink_dn = of_parse_phandle(npdev->dev.of_node, "ibm,nvlink", 0);
>  	if (WARN_ON(of_property_read_u32(nvlink_dn, "ibm,npu-link-index",
>  							&nvlink_index)))


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH kernel v3 06/22] powerpc/powernv: Detach npu struct from pnv_phb
  2018-11-13  8:28   ` Alexey Kardashevskiy
@ 2018-11-14  4:28     ` Alistair Popple
  -1 siblings, 0 replies; 84+ messages in thread
From: Alistair Popple @ 2018-11-14  4:28 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Jose Ricardo Ziviani, Sam Bobroff, linuxppc-dev, Alex Williamson,
	kvm-ppc, Piotr Jaroszynski, Oliver O'Halloran,
	Andrew Donnellan, Leonardo Augusto Guimarães Garcia,
	Reza Arbab, David Gibson

Hi Alexey,

On Tuesday, 13 November 2018 7:28:07 PM AEDT Alexey Kardashevskiy wrote:
>  static struct npu *npdev_to_npu(struct pci_dev *npdev)
>  {
> -	struct pnv_phb *nphb;
> +	struct pci_controller *hose = pci_bus_to_host(npdev->bus);
> +	struct npu *npu;
> 
> -	nphb = pci_bus_to_host(npdev->bus)->private_data;
> +	list_for_each_entry(npu, &npu2_devices, next)

This is called from the ATSD path which is (or at least has been) quite a 
performance critical path so searching through all the NPUs in a list may be 
problematic.

I guess currently it wont make any practical difference as we only ever have 2 
NPUs, but in future they may get divided into more logical NPUs. Would it be 
possible to store a back-pointer somewhere so we can avoid the lookup?

> +		if (hose == npu->hose)
> +			return npu;
> 
> -	return &nphb->npu;
> +	WARN_ON_ONCE(1);
> +	return NULL;
>  }
> 
>  /* Maximum number of nvlinks per npu */
> @@ -505,6 +531,9 @@ static void acquire_atsd_reg(struct npu_context
> *npu_context, continue;
> 
>  			npu = npdev_to_npu(npdev);
> +			if (!npu)
> +				continue;
> +
>  			mmio_atsd_reg[i].npu = npu;
>  			mmio_atsd_reg[i].reg = get_mmio_atsd_reg(npu);
>  			while (mmio_atsd_reg[i].reg < 0) {
> @@ -701,6 +730,8 @@ struct npu_context *pnv_npu2_init_context(struct pci_dev
> *gpdev,
> 
>  	nphb = pci_bus_to_host(npdev->bus)->private_data;
>  	npu = npdev_to_npu(npdev);
> +	if (!npu)
> +		return ERR_PTR(-ENODEV);
> 
>  	/*
>  	 * Setup the NPU context table for a particular GPU. These need to be
> @@ -821,6 +852,8 @@ void pnv_npu2_destroy_context(struct npu_context
> *npu_context,
> 
>  	nphb = pci_bus_to_host(npdev->bus)->private_data;
>  	npu = npdev_to_npu(npdev);
> +	if (!npu)
> +		return;
>  	nvlink_dn = of_parse_phandle(npdev->dev.of_node, "ibm,nvlink", 0);
>  	if (WARN_ON(of_property_read_u32(nvlink_dn, "ibm,npu-link-index",
>  							&nvlink_index)))
> @@ -898,9 +931,15 @@ int pnv_npu2_init(struct pnv_phb *phb)
>  	struct pci_dev *gpdev;
>  	static int npu_index;
>  	uint64_t rc = 0;
> +	struct pci_controller *hose = phb->hose;
> +	struct npu *npu;
> +	int ret;
> 
> -	phb->npu.nmmu_flush =
> -		of_property_read_bool(phb->hose->dn, "ibm,nmmu-flush");
> +	npu = kzalloc(sizeof(*npu), GFP_KERNEL);
> +	if (!npu)
> +		return -ENOMEM;
> +
> +	npu->nmmu_flush = of_property_read_bool(hose->dn, "ibm,nmmu-flush");
>  	for_each_child_of_node(phb->hose->dn, dn) {
>  		gpdev = pnv_pci_get_gpu_dev(get_pci_dev(dn));
>  		if (gpdev) {
> @@ -914,18 +953,31 @@ int pnv_npu2_init(struct pnv_phb *phb)
>  		}
>  	}
> 
> -	for (i = 0; !of_property_read_u64_index(phb->hose->dn, "ibm,mmio-atsd",
> +	for (i = 0; !of_property_read_u64_index(hose->dn, "ibm,mmio-atsd",
>  							i, &mmio_atsd); i++)
> -		phb->npu.mmio_atsd_regs[i] = ioremap(mmio_atsd, 32);
> +		npu->mmio_atsd_regs[i] = ioremap(mmio_atsd, 32);
> 
> -	pr_info("NPU%lld: Found %d MMIO ATSD registers", phb->opal_id, i);
> -	phb->npu.mmio_atsd_count = i;
> -	phb->npu.mmio_atsd_usage = 0;
> +	pr_info("NPU%d: Found %d MMIO ATSD registers", hose->global_number, i);
> +	npu->mmio_atsd_count = i;
> +	npu->mmio_atsd_usage = 0;
>  	npu_index++;
> -	if (WARN_ON(npu_index >= NV_MAX_NPUS))
> -		return -ENOSPC;
> +	if (WARN_ON(npu_index >= NV_MAX_NPUS)) {
> +		ret = -ENOSPC;
> +		goto fail_exit;
> +	}
>  	max_npu2_index = npu_index;
> -	phb->npu.index = npu_index;
> +	npu->index = npu_index;
> +	npu->hose = hose;
> +
> +	list_add(&npu->next, &npu2_devices);

Guess we don't need any locking here as the list gets setup once during boot 
long before loading the driver and is never modified right?

- Alistair

>  	return 0;
> +
> +fail_exit:
> +	for (i = 0; i < npu->mmio_atsd_count; ++i)
> +		iounmap(npu->mmio_atsd_regs[i]);
> +
> +	kfree(npu);
> +
> +	return ret;
>  }



^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH kernel v3 06/22] powerpc/powernv: Detach npu struct from pnv_phb
@ 2018-11-14  4:28     ` Alistair Popple
  0 siblings, 0 replies; 84+ messages in thread
From: Alistair Popple @ 2018-11-14  4:28 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Jose Ricardo Ziviani, Sam Bobroff, linuxppc-dev, Alex Williamson,
	kvm-ppc, Piotr Jaroszynski, Oliver O'Halloran,
	Andrew Donnellan, Leonardo Augusto Guimarães Garcia,
	Reza Arbab, David Gibson

Hi Alexey,

On Tuesday, 13 November 2018 7:28:07 PM AEDT Alexey Kardashevskiy wrote:
>  static struct npu *npdev_to_npu(struct pci_dev *npdev)
>  {
> -	struct pnv_phb *nphb;
> +	struct pci_controller *hose = pci_bus_to_host(npdev->bus);
> +	struct npu *npu;
> 
> -	nphb = pci_bus_to_host(npdev->bus)->private_data;
> +	list_for_each_entry(npu, &npu2_devices, next)

This is called from the ATSD path which is (or at least has been) quite a 
performance critical path so searching through all the NPUs in a list may be 
problematic.

I guess currently it wont make any practical difference as we only ever have 2 
NPUs, but in future they may get divided into more logical NPUs. Would it be 
possible to store a back-pointer somewhere so we can avoid the lookup?

> +		if (hose = npu->hose)
> +			return npu;
> 
> -	return &nphb->npu;
> +	WARN_ON_ONCE(1);
> +	return NULL;
>  }
> 
>  /* Maximum number of nvlinks per npu */
> @@ -505,6 +531,9 @@ static void acquire_atsd_reg(struct npu_context
> *npu_context, continue;
> 
>  			npu = npdev_to_npu(npdev);
> +			if (!npu)
> +				continue;
> +
>  			mmio_atsd_reg[i].npu = npu;
>  			mmio_atsd_reg[i].reg = get_mmio_atsd_reg(npu);
>  			while (mmio_atsd_reg[i].reg < 0) {
> @@ -701,6 +730,8 @@ struct npu_context *pnv_npu2_init_context(struct pci_dev
> *gpdev,
> 
>  	nphb = pci_bus_to_host(npdev->bus)->private_data;
>  	npu = npdev_to_npu(npdev);
> +	if (!npu)
> +		return ERR_PTR(-ENODEV);
> 
>  	/*
>  	 * Setup the NPU context table for a particular GPU. These need to be
> @@ -821,6 +852,8 @@ void pnv_npu2_destroy_context(struct npu_context
> *npu_context,
> 
>  	nphb = pci_bus_to_host(npdev->bus)->private_data;
>  	npu = npdev_to_npu(npdev);
> +	if (!npu)
> +		return;
>  	nvlink_dn = of_parse_phandle(npdev->dev.of_node, "ibm,nvlink", 0);
>  	if (WARN_ON(of_property_read_u32(nvlink_dn, "ibm,npu-link-index",
>  							&nvlink_index)))
> @@ -898,9 +931,15 @@ int pnv_npu2_init(struct pnv_phb *phb)
>  	struct pci_dev *gpdev;
>  	static int npu_index;
>  	uint64_t rc = 0;
> +	struct pci_controller *hose = phb->hose;
> +	struct npu *npu;
> +	int ret;
> 
> -	phb->npu.nmmu_flush > -		of_property_read_bool(phb->hose->dn, "ibm,nmmu-flush");
> +	npu = kzalloc(sizeof(*npu), GFP_KERNEL);
> +	if (!npu)
> +		return -ENOMEM;
> +
> +	npu->nmmu_flush = of_property_read_bool(hose->dn, "ibm,nmmu-flush");
>  	for_each_child_of_node(phb->hose->dn, dn) {
>  		gpdev = pnv_pci_get_gpu_dev(get_pci_dev(dn));
>  		if (gpdev) {
> @@ -914,18 +953,31 @@ int pnv_npu2_init(struct pnv_phb *phb)
>  		}
>  	}
> 
> -	for (i = 0; !of_property_read_u64_index(phb->hose->dn, "ibm,mmio-atsd",
> +	for (i = 0; !of_property_read_u64_index(hose->dn, "ibm,mmio-atsd",
>  							i, &mmio_atsd); i++)
> -		phb->npu.mmio_atsd_regs[i] = ioremap(mmio_atsd, 32);
> +		npu->mmio_atsd_regs[i] = ioremap(mmio_atsd, 32);
> 
> -	pr_info("NPU%lld: Found %d MMIO ATSD registers", phb->opal_id, i);
> -	phb->npu.mmio_atsd_count = i;
> -	phb->npu.mmio_atsd_usage = 0;
> +	pr_info("NPU%d: Found %d MMIO ATSD registers", hose->global_number, i);
> +	npu->mmio_atsd_count = i;
> +	npu->mmio_atsd_usage = 0;
>  	npu_index++;
> -	if (WARN_ON(npu_index >= NV_MAX_NPUS))
> -		return -ENOSPC;
> +	if (WARN_ON(npu_index >= NV_MAX_NPUS)) {
> +		ret = -ENOSPC;
> +		goto fail_exit;
> +	}
>  	max_npu2_index = npu_index;
> -	phb->npu.index = npu_index;
> +	npu->index = npu_index;
> +	npu->hose = hose;
> +
> +	list_add(&npu->next, &npu2_devices);

Guess we don't need any locking here as the list gets setup once during boot 
long before loading the driver and is never modified right?

- Alistair

>  	return 0;
> +
> +fail_exit:
> +	for (i = 0; i < npu->mmio_atsd_count; ++i)
> +		iounmap(npu->mmio_atsd_regs[i]);
> +
> +	kfree(npu);
> +
> +	return ret;
>  }


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH kernel v3 07/22] powerpc/powernv/npu: Move OPAL calls away from context manipulation
  2018-11-13  8:28   ` Alexey Kardashevskiy
@ 2018-11-14  4:57     ` Alistair Popple
  -1 siblings, 0 replies; 84+ messages in thread
From: Alistair Popple @ 2018-11-14  4:57 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Jose Ricardo Ziviani, Sam Bobroff, linuxppc-dev, Alex Williamson,
	kvm-ppc, Piotr Jaroszynski, Oliver O'Halloran,
	Andrew Donnellan, Leonardo Augusto Guimarães Garcia,
	Reza Arbab, David Gibson

> -	/*
> -	 * Setup the NPU context table for a particular GPU. These need to be
> -	 * per-GPU as we need the tables to filter ATSDs when there are no
> -	 * active contexts on a particular GPU. It is safe for these to be
> -	 * called concurrently with destroy as the OPAL call takes appropriate
> -	 * locks and refcounts on init/destroy.
> -	 */
> -	rc = opal_npu_init_context(nphb->opal_id, mm->context.id, flags,
> -				PCI_DEVID(gpdev->bus->number, gpdev->devfn));
> -	if (rc < 0)
> -		return ERR_PTR(-ENOSPC);
> -

This will prevent any drivers from setting up contexts with different MSR 
values (which is what the flags argument is for) than a standard userspace 
context (MSR_DR | MSR_PR | MSR_HV). In practice this currently never happens 
and I'm unsure if that's ever likely to change.

We should at least return an error if flags != (MSR_DR | MSR_PR | MSR_HV).

>  	/*
>  	 * We store the npu pci device so we can more easily get at the
>  	 * associated npus.
> @@ -755,9 +738,6 @@ struct npu_context *pnv_npu2_init_context(struct pci_dev
> *gpdev, if (npu_context->release_cb != cb ||
>  			npu_context->priv != priv) {
>  			spin_unlock(&npu_context_lock);
> -			opal_npu_destroy_context(nphb->opal_id, mm->context.id,
> -						PCI_DEVID(gpdev->bus->number,
> -							gpdev->devfn));
>  			return ERR_PTR(-EINVAL);
>  		}
> 
> @@ -783,9 +763,6 @@ struct npu_context *pnv_npu2_init_context(struct pci_dev
> *gpdev,
> 
>  		if (rc) {
>  			kfree(npu_context);
> -			opal_npu_destroy_context(nphb->opal_id, mm->context.id,
> -					PCI_DEVID(gpdev->bus->number,
> -						gpdev->devfn));
>  			return ERR_PTR(rc);
>  		}
> 
> @@ -838,7 +815,6 @@ void pnv_npu2_destroy_context(struct npu_context
> *npu_context, struct pci_dev *gpdev)
>  {
>  	int removed;
> -	struct pnv_phb *nphb;
>  	struct npu *npu;
>  	struct pci_dev *npdev = pnv_pci_get_npu_dev(gpdev, 0);
>  	struct device_node *nvlink_dn;
> @@ -847,10 +823,6 @@ void pnv_npu2_destroy_context(struct npu_context
> *npu_context, if (WARN_ON(!npdev))
>  		return;
> 
> -	if (!firmware_has_feature(FW_FEATURE_OPAL))
> -		return;
> -
> -	nphb = pci_bus_to_host(npdev->bus)->private_data;
>  	npu = npdev_to_npu(npdev);
>  	if (!npu)
>  		return;
> @@ -859,8 +831,6 @@ void pnv_npu2_destroy_context(struct npu_context
> *npu_context, &nvlink_index)))
>  		return;
>  	WRITE_ONCE(npu_context->npdev[npu->index][nvlink_index], NULL);
> -	opal_npu_destroy_context(nphb->opal_id, npu_context->mm->context.id,
> -				PCI_DEVID(gpdev->bus->number, gpdev->devfn));
>  	spin_lock(&npu_context_lock);
>  	removed = kref_put(&npu_context->kref, pnv_npu2_release_context);
>  	spin_unlock(&npu_context_lock);
> @@ -892,9 +862,6 @@ int pnv_npu2_handle_fault(struct npu_context *context,
> uintptr_t *ea, /* mmap_sem should be held so the struct_mm must be present
> */
>  	struct mm_struct *mm = context->mm;
> 
> -	if (!firmware_has_feature(FW_FEATURE_OPAL))
> -		return -ENODEV;
> -
>  	WARN_ON(!rwsem_is_locked(&mm->mmap_sem));
> 
>  	for (i = 0; i < count; i++) {
> @@ -923,15 +890,11 @@ int pnv_npu2_handle_fault(struct npu_context *context,
> uintptr_t *ea, }
>  EXPORT_SYMBOL(pnv_npu2_handle_fault);
> 
> -int pnv_npu2_init(struct pnv_phb *phb)
> +int pnv_npu2_init(struct pci_controller *hose)
>  {
>  	unsigned int i;
>  	u64 mmio_atsd;
> -	struct device_node *dn;
> -	struct pci_dev *gpdev;
>  	static int npu_index;
> -	uint64_t rc = 0;
> -	struct pci_controller *hose = phb->hose;
>  	struct npu *npu;
>  	int ret;
> 
> @@ -940,18 +903,6 @@ int pnv_npu2_init(struct pnv_phb *phb)
>  		return -ENOMEM;
> 
>  	npu->nmmu_flush = of_property_read_bool(hose->dn, "ibm,nmmu-flush");
> -	for_each_child_of_node(phb->hose->dn, dn) {
> -		gpdev = pnv_pci_get_gpu_dev(get_pci_dev(dn));
> -		if (gpdev) {
> -			rc = opal_npu_map_lpar(phb->opal_id,
> -				PCI_DEVID(gpdev->bus->number, gpdev->devfn),
> -				0, 0);
> -			if (rc)
> -				dev_err(&gpdev->dev,
> -					"Error %lld mapping device to LPAR\n",
> -					rc);
> -		}
> -	}
> 
>  	for (i = 0; !of_property_read_u64_index(hose->dn, "ibm,mmio-atsd",
>  							i, &mmio_atsd); i++)
> @@ -981,3 +932,57 @@ int pnv_npu2_init(struct pnv_phb *phb)
> 
>  	return ret;
>  }
> +
> +int pnv_npu2_map_lpar_dev(struct pci_dev *gpdev, unsigned int lparid,
> +		unsigned long msr)
> +{
> +	int ret;
> +	struct pci_dev *npdev = pnv_pci_get_npu_dev(gpdev, 0);
> +	struct pci_controller *hose;
> +	struct pnv_phb *nphb;
> +
> +	if (!npdev)
> +		return -ENODEV;
> +
> +	hose = pci_bus_to_host(npdev->bus);
> +	nphb = hose->private_data;
> +
> +	dev_dbg(&gpdev->dev, "Map LPAR opalid=%llu lparid=%u\n",
> +			nphb->opal_id, lparid);
> +	/*
> +	 * Currently we only support radix and non-zero LPCR only makes sense
> +	 * for hash tables so skiboot expects the LPCR parameter to be a zero.
> +	 */
> +	ret = opal_npu_map_lpar(nphb->opal_id,
> +			PCI_DEVID(gpdev->bus->number, gpdev->devfn), lparid,
> +			0 /* LPCR bits */);
> +	if (ret) {
> +		dev_err(&gpdev->dev, "Error %d mapping device to LPAR\n", ret);
> +		return ret;
> +	}
> +
> +	dev_dbg(&gpdev->dev, "init context opalid=%llu msr=%lx\n",
> +			nphb->opal_id, msr);
> +	ret = opal_npu_init_context(nphb->opal_id, 0/*__unused*/, msr,
> +			PCI_DEVID(gpdev->bus->number, gpdev->devfn));
> +	if (ret < 0)
> +		dev_err(&gpdev->dev, "Failed to init context: %d\n", ret);
> +	else
> +		ret = 0;
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(pnv_npu2_map_lpar_dev);
> +
> +void pnv_npu2_map_lpar(struct pnv_ioda_pe *gpe, unsigned long msr)
> +{
> +	int ret;
> +	struct pci_dev *gpdev;
> +
> +	list_for_each_entry(gpdev, &gpe->pbus->devices, bus_list) {
> +		ret = pnv_npu2_map_lpar_dev(gpdev, 0, msr);
> +		if (ret < 0)
> +			dev_err(&gpdev->dev, "Failed to init context: %d\n",
> +					ret);
> +	}
> +}
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c
> b/arch/powerpc/platforms/powernv/pci-ioda.c index c78c204..ec235ca 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -1271,19 +1271,20 @@ static void pnv_ioda_setup_npu_PEs(struct pci_bus
> *bus)
> 
>  static void pnv_pci_ioda_setup_PEs(void)
>  {
> -	struct pci_controller *hose, *tmp;
> +	struct pci_controller *hose;
>  	struct pnv_phb *phb;
>  	struct pci_bus *bus;
>  	struct pci_dev *pdev;
> +	struct pnv_ioda_pe *pe;
> 
> -	list_for_each_entry_safe(hose, tmp, &hose_list, list_node) {
> +	list_for_each_entry(hose, &hose_list, list_node) {
>  		phb = hose->private_data;
>  		if (phb->type == PNV_PHB_NPU_NVLINK) {
>  			/* PE#0 is needed for error reporting */
>  			pnv_ioda_reserve_pe(phb, 0);
>  			pnv_ioda_setup_npu_PEs(hose->bus);
>  			if (phb->model == PNV_PHB_MODEL_NPU2)
> -				pnv_npu2_init(phb);
> +				pnv_npu2_init(hose);
>  		}
>  		if (phb->type == PNV_PHB_NPU_OCAPI) {
>  			bus = hose->bus;
> @@ -1291,6 +1292,14 @@ static void pnv_pci_ioda_setup_PEs(void)
>  				pnv_ioda_setup_dev_PE(pdev);
>  		}
>  	}
> +	list_for_each_entry(hose, &hose_list, list_node) {
> +		phb = hose->private_data;
> +		if (phb->type != PNV_PHB_IODA2)
> +			continue;
> +
> +		list_for_each_entry(pe, &phb->ioda.pe_list, list)
> +			pnv_npu2_map_lpar(pe, MSR_DR | MSR_PR | MSR_HV);
> +	}
>  }
> 
>  #ifdef CONFIG_PCI_IOV



^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH kernel v3 07/22] powerpc/powernv/npu: Move OPAL calls away from context manipulation
@ 2018-11-14  4:57     ` Alistair Popple
  0 siblings, 0 replies; 84+ messages in thread
From: Alistair Popple @ 2018-11-14  4:57 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Jose Ricardo Ziviani, Sam Bobroff, linuxppc-dev, Alex Williamson,
	kvm-ppc, Piotr Jaroszynski, Oliver O'Halloran,
	Andrew Donnellan, Leonardo Augusto Guimarães Garcia,
	Reza Arbab, David Gibson

> -	/*
> -	 * Setup the NPU context table for a particular GPU. These need to be
> -	 * per-GPU as we need the tables to filter ATSDs when there are no
> -	 * active contexts on a particular GPU. It is safe for these to be
> -	 * called concurrently with destroy as the OPAL call takes appropriate
> -	 * locks and refcounts on init/destroy.
> -	 */
> -	rc = opal_npu_init_context(nphb->opal_id, mm->context.id, flags,
> -				PCI_DEVID(gpdev->bus->number, gpdev->devfn));
> -	if (rc < 0)
> -		return ERR_PTR(-ENOSPC);
> -

This will prevent any drivers from setting up contexts with different MSR 
values (which is what the flags argument is for) than a standard userspace 
context (MSR_DR | MSR_PR | MSR_HV). In practice this currently never happens 
and I'm unsure if that's ever likely to change.

We should at least return an error if flags != (MSR_DR | MSR_PR | MSR_HV).

>  	/*
>  	 * We store the npu pci device so we can more easily get at the
>  	 * associated npus.
> @@ -755,9 +738,6 @@ struct npu_context *pnv_npu2_init_context(struct pci_dev
> *gpdev, if (npu_context->release_cb != cb ||
>  			npu_context->priv != priv) {
>  			spin_unlock(&npu_context_lock);
> -			opal_npu_destroy_context(nphb->opal_id, mm->context.id,
> -						PCI_DEVID(gpdev->bus->number,
> -							gpdev->devfn));
>  			return ERR_PTR(-EINVAL);
>  		}
> 
> @@ -783,9 +763,6 @@ struct npu_context *pnv_npu2_init_context(struct pci_dev
> *gpdev,
> 
>  		if (rc) {
>  			kfree(npu_context);
> -			opal_npu_destroy_context(nphb->opal_id, mm->context.id,
> -					PCI_DEVID(gpdev->bus->number,
> -						gpdev->devfn));
>  			return ERR_PTR(rc);
>  		}
> 
> @@ -838,7 +815,6 @@ void pnv_npu2_destroy_context(struct npu_context
> *npu_context, struct pci_dev *gpdev)
>  {
>  	int removed;
> -	struct pnv_phb *nphb;
>  	struct npu *npu;
>  	struct pci_dev *npdev = pnv_pci_get_npu_dev(gpdev, 0);
>  	struct device_node *nvlink_dn;
> @@ -847,10 +823,6 @@ void pnv_npu2_destroy_context(struct npu_context
> *npu_context, if (WARN_ON(!npdev))
>  		return;
> 
> -	if (!firmware_has_feature(FW_FEATURE_OPAL))
> -		return;
> -
> -	nphb = pci_bus_to_host(npdev->bus)->private_data;
>  	npu = npdev_to_npu(npdev);
>  	if (!npu)
>  		return;
> @@ -859,8 +831,6 @@ void pnv_npu2_destroy_context(struct npu_context
> *npu_context, &nvlink_index)))
>  		return;
>  	WRITE_ONCE(npu_context->npdev[npu->index][nvlink_index], NULL);
> -	opal_npu_destroy_context(nphb->opal_id, npu_context->mm->context.id,
> -				PCI_DEVID(gpdev->bus->number, gpdev->devfn));
>  	spin_lock(&npu_context_lock);
>  	removed = kref_put(&npu_context->kref, pnv_npu2_release_context);
>  	spin_unlock(&npu_context_lock);
> @@ -892,9 +862,6 @@ int pnv_npu2_handle_fault(struct npu_context *context,
> uintptr_t *ea, /* mmap_sem should be held so the struct_mm must be present
> */
>  	struct mm_struct *mm = context->mm;
> 
> -	if (!firmware_has_feature(FW_FEATURE_OPAL))
> -		return -ENODEV;
> -
>  	WARN_ON(!rwsem_is_locked(&mm->mmap_sem));
> 
>  	for (i = 0; i < count; i++) {
> @@ -923,15 +890,11 @@ int pnv_npu2_handle_fault(struct npu_context *context,
> uintptr_t *ea, }
>  EXPORT_SYMBOL(pnv_npu2_handle_fault);
> 
> -int pnv_npu2_init(struct pnv_phb *phb)
> +int pnv_npu2_init(struct pci_controller *hose)
>  {
>  	unsigned int i;
>  	u64 mmio_atsd;
> -	struct device_node *dn;
> -	struct pci_dev *gpdev;
>  	static int npu_index;
> -	uint64_t rc = 0;
> -	struct pci_controller *hose = phb->hose;
>  	struct npu *npu;
>  	int ret;
> 
> @@ -940,18 +903,6 @@ int pnv_npu2_init(struct pnv_phb *phb)
>  		return -ENOMEM;
> 
>  	npu->nmmu_flush = of_property_read_bool(hose->dn, "ibm,nmmu-flush");
> -	for_each_child_of_node(phb->hose->dn, dn) {
> -		gpdev = pnv_pci_get_gpu_dev(get_pci_dev(dn));
> -		if (gpdev) {
> -			rc = opal_npu_map_lpar(phb->opal_id,
> -				PCI_DEVID(gpdev->bus->number, gpdev->devfn),
> -				0, 0);
> -			if (rc)
> -				dev_err(&gpdev->dev,
> -					"Error %lld mapping device to LPAR\n",
> -					rc);
> -		}
> -	}
> 
>  	for (i = 0; !of_property_read_u64_index(hose->dn, "ibm,mmio-atsd",
>  							i, &mmio_atsd); i++)
> @@ -981,3 +932,57 @@ int pnv_npu2_init(struct pnv_phb *phb)
> 
>  	return ret;
>  }
> +
> +int pnv_npu2_map_lpar_dev(struct pci_dev *gpdev, unsigned int lparid,
> +		unsigned long msr)
> +{
> +	int ret;
> +	struct pci_dev *npdev = pnv_pci_get_npu_dev(gpdev, 0);
> +	struct pci_controller *hose;
> +	struct pnv_phb *nphb;
> +
> +	if (!npdev)
> +		return -ENODEV;
> +
> +	hose = pci_bus_to_host(npdev->bus);
> +	nphb = hose->private_data;
> +
> +	dev_dbg(&gpdev->dev, "Map LPAR opalid=%llu lparid=%u\n",
> +			nphb->opal_id, lparid);
> +	/*
> +	 * Currently we only support radix and non-zero LPCR only makes sense
> +	 * for hash tables so skiboot expects the LPCR parameter to be a zero.
> +	 */
> +	ret = opal_npu_map_lpar(nphb->opal_id,
> +			PCI_DEVID(gpdev->bus->number, gpdev->devfn), lparid,
> +			0 /* LPCR bits */);
> +	if (ret) {
> +		dev_err(&gpdev->dev, "Error %d mapping device to LPAR\n", ret);
> +		return ret;
> +	}
> +
> +	dev_dbg(&gpdev->dev, "init context opalid=%llu msr=%lx\n",
> +			nphb->opal_id, msr);
> +	ret = opal_npu_init_context(nphb->opal_id, 0/*__unused*/, msr,
> +			PCI_DEVID(gpdev->bus->number, gpdev->devfn));
> +	if (ret < 0)
> +		dev_err(&gpdev->dev, "Failed to init context: %d\n", ret);
> +	else
> +		ret = 0;
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(pnv_npu2_map_lpar_dev);
> +
> +void pnv_npu2_map_lpar(struct pnv_ioda_pe *gpe, unsigned long msr)
> +{
> +	int ret;
> +	struct pci_dev *gpdev;
> +
> +	list_for_each_entry(gpdev, &gpe->pbus->devices, bus_list) {
> +		ret = pnv_npu2_map_lpar_dev(gpdev, 0, msr);
> +		if (ret < 0)
> +			dev_err(&gpdev->dev, "Failed to init context: %d\n",
> +					ret);
> +	}
> +}
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c
> b/arch/powerpc/platforms/powernv/pci-ioda.c index c78c204..ec235ca 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -1271,19 +1271,20 @@ static void pnv_ioda_setup_npu_PEs(struct pci_bus
> *bus)
> 
>  static void pnv_pci_ioda_setup_PEs(void)
>  {
> -	struct pci_controller *hose, *tmp;
> +	struct pci_controller *hose;
>  	struct pnv_phb *phb;
>  	struct pci_bus *bus;
>  	struct pci_dev *pdev;
> +	struct pnv_ioda_pe *pe;
> 
> -	list_for_each_entry_safe(hose, tmp, &hose_list, list_node) {
> +	list_for_each_entry(hose, &hose_list, list_node) {
>  		phb = hose->private_data;
>  		if (phb->type = PNV_PHB_NPU_NVLINK) {
>  			/* PE#0 is needed for error reporting */
>  			pnv_ioda_reserve_pe(phb, 0);
>  			pnv_ioda_setup_npu_PEs(hose->bus);
>  			if (phb->model = PNV_PHB_MODEL_NPU2)
> -				pnv_npu2_init(phb);
> +				pnv_npu2_init(hose);
>  		}
>  		if (phb->type = PNV_PHB_NPU_OCAPI) {
>  			bus = hose->bus;
> @@ -1291,6 +1292,14 @@ static void pnv_pci_ioda_setup_PEs(void)
>  				pnv_ioda_setup_dev_PE(pdev);
>  		}
>  	}
> +	list_for_each_entry(hose, &hose_list, list_node) {
> +		phb = hose->private_data;
> +		if (phb->type != PNV_PHB_IODA2)
> +			continue;
> +
> +		list_for_each_entry(pe, &phb->ioda.pe_list, list)
> +			pnv_npu2_map_lpar(pe, MSR_DR | MSR_PR | MSR_HV);
> +	}
>  }
> 
>  #ifdef CONFIG_PCI_IOV


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH kernel v3 02/22] powerpc/mm/iommu/vfio_spapr_tce: Change mm_iommu_get to reference a region
  2018-11-13  8:28   ` Alexey Kardashevskiy
@ 2018-11-15  5:32     ` David Gibson
  -1 siblings, 0 replies; 84+ messages in thread
From: David Gibson @ 2018-11-15  5:32 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Alex Williamson, Jose Ricardo Ziviani, Sam Bobroff,
	Alistair Popple, linuxppc-dev, kvm-ppc, Piotr Jaroszynski,
	Oliver O'Halloran, Andrew Donnellan,
	Leonardo Augusto Guimarães Garcia, Reza Arbab

[-- Attachment #1: Type: text/plain, Size: 6970 bytes --]

On Tue, Nov 13, 2018 at 07:28:03PM +1100, Alexey Kardashevskiy wrote:
> Normally mm_iommu_get() is supposed to add a reference and
> mm_iommu_put() to remove it. However historically mm_iommu_find() does
> the referencing and mm_iommu_get() is doing allocation and referencing.
> 
> We are going to add another helper to preregister device memory so
> instead of having mm_iommu_new() which pre-registers the normal memory
> and references the region, we need separate helpers for pre-registering
> and referencing.
> 
> This renames:
> - mm_iommu_get to mm_iommu_new;
> - mm_iommu_find to mm_iommu_get.
> 
> To make the mm_iommu_get name reflect what it is supposed to do, this
> changes mm_iommu_get() to reference the region so from now on for every
> mm_iommu_get() we need a matching mm_iommu_put().
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

> ---
> Changes:
> v2:
> * merged 2 patches into one
> ---
>  arch/powerpc/include/asm/mmu_context.h |  4 +--
>  arch/powerpc/mm/mmu_context_iommu.c    | 13 ++++++---
>  drivers/vfio/vfio_iommu_spapr_tce.c    | 37 +++++++++++++++++---------
>  3 files changed, 35 insertions(+), 19 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h
> index 0381394..2d6b00d 100644
> --- a/arch/powerpc/include/asm/mmu_context.h
> +++ b/arch/powerpc/include/asm/mmu_context.h
> @@ -21,7 +21,7 @@ struct mm_iommu_table_group_mem_t;
>  
>  extern int isolate_lru_page(struct page *page);	/* from internal.h */
>  extern bool mm_iommu_preregistered(struct mm_struct *mm);
> -extern long mm_iommu_get(struct mm_struct *mm,
> +extern long mm_iommu_new(struct mm_struct *mm,
>  		unsigned long ua, unsigned long entries,
>  		struct mm_iommu_table_group_mem_t **pmem);
>  extern long mm_iommu_put(struct mm_struct *mm,
> @@ -32,7 +32,7 @@ extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup(struct mm_struct *mm,
>  		unsigned long ua, unsigned long size);
>  extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup_rm(
>  		struct mm_struct *mm, unsigned long ua, unsigned long size);
> -extern struct mm_iommu_table_group_mem_t *mm_iommu_find(struct mm_struct *mm,
> +extern struct mm_iommu_table_group_mem_t *mm_iommu_get(struct mm_struct *mm,
>  		unsigned long ua, unsigned long entries);
>  extern long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
>  		unsigned long ua, unsigned int pageshift, unsigned long *hpa);
> diff --git a/arch/powerpc/mm/mmu_context_iommu.c b/arch/powerpc/mm/mmu_context_iommu.c
> index 1d5161f..babc6ad 100644
> --- a/arch/powerpc/mm/mmu_context_iommu.c
> +++ b/arch/powerpc/mm/mmu_context_iommu.c
> @@ -89,7 +89,7 @@ bool mm_iommu_preregistered(struct mm_struct *mm)
>  }
>  EXPORT_SYMBOL_GPL(mm_iommu_preregistered);
>  
> -long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries,
> +long mm_iommu_new(struct mm_struct *mm, unsigned long ua, unsigned long entries,
>  		struct mm_iommu_table_group_mem_t **pmem)
>  {
>  	struct mm_iommu_table_group_mem_t *mem;
> @@ -202,7 +202,7 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries,
>  
>  	return ret;
>  }
> -EXPORT_SYMBOL_GPL(mm_iommu_get);
> +EXPORT_SYMBOL_GPL(mm_iommu_new);
>  
>  static void mm_iommu_unpin(struct mm_iommu_table_group_mem_t *mem)
>  {
> @@ -318,21 +318,26 @@ struct mm_iommu_table_group_mem_t *mm_iommu_lookup_rm(struct mm_struct *mm,
>  	return ret;
>  }
>  
> -struct mm_iommu_table_group_mem_t *mm_iommu_find(struct mm_struct *mm,
> +struct mm_iommu_table_group_mem_t *mm_iommu_get(struct mm_struct *mm,
>  		unsigned long ua, unsigned long entries)
>  {
>  	struct mm_iommu_table_group_mem_t *mem, *ret = NULL;
>  
> +	mutex_lock(&mem_list_mutex);
> +
>  	list_for_each_entry_rcu(mem, &mm->context.iommu_group_mem_list, next) {
>  		if ((mem->ua == ua) && (mem->entries == entries)) {
>  			ret = mem;
> +			++mem->used;
>  			break;
>  		}
>  	}
>  
> +	mutex_unlock(&mem_list_mutex);
> +
>  	return ret;
>  }
> -EXPORT_SYMBOL_GPL(mm_iommu_find);
> +EXPORT_SYMBOL_GPL(mm_iommu_get);
>  
>  long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
>  		unsigned long ua, unsigned int pageshift, unsigned long *hpa)
> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
> index ad63725..56db071 100644
> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> @@ -151,12 +151,13 @@ static long tce_iommu_unregister_pages(struct tce_container *container,
>  {
>  	struct mm_iommu_table_group_mem_t *mem;
>  	struct tce_iommu_prereg *tcemem;
> -	bool found = false;
> +	bool found;
> +	long ret;
>  
>  	if ((vaddr & ~PAGE_MASK) || (size & ~PAGE_MASK))
>  		return -EINVAL;
>  
> -	mem = mm_iommu_find(container->mm, vaddr, size >> PAGE_SHIFT);
> +	mem = mm_iommu_get(container->mm, vaddr, size >> PAGE_SHIFT);
>  	if (!mem)
>  		return -ENOENT;
>  
> @@ -168,9 +169,13 @@ static long tce_iommu_unregister_pages(struct tce_container *container,
>  	}
>  
>  	if (!found)
> -		return -ENOENT;
> +		ret = -ENOENT;
> +	else
> +		ret = tce_iommu_prereg_free(container, tcemem);
>  
> -	return tce_iommu_prereg_free(container, tcemem);
> +	mm_iommu_put(container->mm, mem);
> +
> +	return ret;
>  }
>  
>  static long tce_iommu_register_pages(struct tce_container *container,
> @@ -185,22 +190,24 @@ static long tce_iommu_register_pages(struct tce_container *container,
>  			((vaddr + size) < vaddr))
>  		return -EINVAL;
>  
> -	mem = mm_iommu_find(container->mm, vaddr, entries);
> +	mem = mm_iommu_get(container->mm, vaddr, entries);
>  	if (mem) {
>  		list_for_each_entry(tcemem, &container->prereg_list, next) {
> -			if (tcemem->mem == mem)
> -				return -EBUSY;
> +			if (tcemem->mem == mem) {
> +				ret = -EBUSY;
> +				goto put_exit;
> +			}
>  		}
> +	} else {
> +		ret = mm_iommu_new(container->mm, vaddr, entries, &mem);
> +		if (ret)
> +			return ret;
>  	}
>  
> -	ret = mm_iommu_get(container->mm, vaddr, entries, &mem);
> -	if (ret)
> -		return ret;
> -
>  	tcemem = kzalloc(sizeof(*tcemem), GFP_KERNEL);
>  	if (!tcemem) {
> -		mm_iommu_put(container->mm, mem);
> -		return -ENOMEM;
> +		ret = -ENOMEM;
> +		goto put_exit;
>  	}
>  
>  	tcemem->mem = mem;
> @@ -209,6 +216,10 @@ static long tce_iommu_register_pages(struct tce_container *container,
>  	container->enabled = true;
>  
>  	return 0;
> +
> +put_exit:
> +	mm_iommu_put(container->mm, mem);
> +	return ret;
>  }
>  
>  static bool tce_page_is_contained(struct page *page, unsigned page_shift)

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH kernel v3 02/22] powerpc/mm/iommu/vfio_spapr_tce: Change mm_iommu_get to reference a regi
@ 2018-11-15  5:32     ` David Gibson
  0 siblings, 0 replies; 84+ messages in thread
From: David Gibson @ 2018-11-15  5:32 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Alex Williamson, Jose Ricardo Ziviani, Sam Bobroff,
	Alistair Popple, linuxppc-dev, kvm-ppc, Piotr Jaroszynski,
	Oliver O'Halloran, Andrew Donnellan,
	Leonardo Augusto Guimarães Garcia, Reza Arbab

[-- Attachment #1: Type: text/plain, Size: 6970 bytes --]

On Tue, Nov 13, 2018 at 07:28:03PM +1100, Alexey Kardashevskiy wrote:
> Normally mm_iommu_get() is supposed to add a reference and
> mm_iommu_put() to remove it. However historically mm_iommu_find() does
> the referencing and mm_iommu_get() is doing allocation and referencing.
> 
> We are going to add another helper to preregister device memory so
> instead of having mm_iommu_new() which pre-registers the normal memory
> and references the region, we need separate helpers for pre-registering
> and referencing.
> 
> This renames:
> - mm_iommu_get to mm_iommu_new;
> - mm_iommu_find to mm_iommu_get.
> 
> To make the mm_iommu_get name reflect what it is supposed to do, this
> changes mm_iommu_get() to reference the region so from now on for every
> mm_iommu_get() we need a matching mm_iommu_put().
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

> ---
> Changes:
> v2:
> * merged 2 patches into one
> ---
>  arch/powerpc/include/asm/mmu_context.h |  4 +--
>  arch/powerpc/mm/mmu_context_iommu.c    | 13 ++++++---
>  drivers/vfio/vfio_iommu_spapr_tce.c    | 37 +++++++++++++++++---------
>  3 files changed, 35 insertions(+), 19 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h
> index 0381394..2d6b00d 100644
> --- a/arch/powerpc/include/asm/mmu_context.h
> +++ b/arch/powerpc/include/asm/mmu_context.h
> @@ -21,7 +21,7 @@ struct mm_iommu_table_group_mem_t;
>  
>  extern int isolate_lru_page(struct page *page);	/* from internal.h */
>  extern bool mm_iommu_preregistered(struct mm_struct *mm);
> -extern long mm_iommu_get(struct mm_struct *mm,
> +extern long mm_iommu_new(struct mm_struct *mm,
>  		unsigned long ua, unsigned long entries,
>  		struct mm_iommu_table_group_mem_t **pmem);
>  extern long mm_iommu_put(struct mm_struct *mm,
> @@ -32,7 +32,7 @@ extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup(struct mm_struct *mm,
>  		unsigned long ua, unsigned long size);
>  extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup_rm(
>  		struct mm_struct *mm, unsigned long ua, unsigned long size);
> -extern struct mm_iommu_table_group_mem_t *mm_iommu_find(struct mm_struct *mm,
> +extern struct mm_iommu_table_group_mem_t *mm_iommu_get(struct mm_struct *mm,
>  		unsigned long ua, unsigned long entries);
>  extern long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
>  		unsigned long ua, unsigned int pageshift, unsigned long *hpa);
> diff --git a/arch/powerpc/mm/mmu_context_iommu.c b/arch/powerpc/mm/mmu_context_iommu.c
> index 1d5161f..babc6ad 100644
> --- a/arch/powerpc/mm/mmu_context_iommu.c
> +++ b/arch/powerpc/mm/mmu_context_iommu.c
> @@ -89,7 +89,7 @@ bool mm_iommu_preregistered(struct mm_struct *mm)
>  }
>  EXPORT_SYMBOL_GPL(mm_iommu_preregistered);
>  
> -long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries,
> +long mm_iommu_new(struct mm_struct *mm, unsigned long ua, unsigned long entries,
>  		struct mm_iommu_table_group_mem_t **pmem)
>  {
>  	struct mm_iommu_table_group_mem_t *mem;
> @@ -202,7 +202,7 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries,
>  
>  	return ret;
>  }
> -EXPORT_SYMBOL_GPL(mm_iommu_get);
> +EXPORT_SYMBOL_GPL(mm_iommu_new);
>  
>  static void mm_iommu_unpin(struct mm_iommu_table_group_mem_t *mem)
>  {
> @@ -318,21 +318,26 @@ struct mm_iommu_table_group_mem_t *mm_iommu_lookup_rm(struct mm_struct *mm,
>  	return ret;
>  }
>  
> -struct mm_iommu_table_group_mem_t *mm_iommu_find(struct mm_struct *mm,
> +struct mm_iommu_table_group_mem_t *mm_iommu_get(struct mm_struct *mm,
>  		unsigned long ua, unsigned long entries)
>  {
>  	struct mm_iommu_table_group_mem_t *mem, *ret = NULL;
>  
> +	mutex_lock(&mem_list_mutex);
> +
>  	list_for_each_entry_rcu(mem, &mm->context.iommu_group_mem_list, next) {
>  		if ((mem->ua == ua) && (mem->entries == entries)) {
>  			ret = mem;
> +			++mem->used;
>  			break;
>  		}
>  	}
>  
> +	mutex_unlock(&mem_list_mutex);
> +
>  	return ret;
>  }
> -EXPORT_SYMBOL_GPL(mm_iommu_find);
> +EXPORT_SYMBOL_GPL(mm_iommu_get);
>  
>  long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
>  		unsigned long ua, unsigned int pageshift, unsigned long *hpa)
> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
> index ad63725..56db071 100644
> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> @@ -151,12 +151,13 @@ static long tce_iommu_unregister_pages(struct tce_container *container,
>  {
>  	struct mm_iommu_table_group_mem_t *mem;
>  	struct tce_iommu_prereg *tcemem;
> -	bool found = false;
> +	bool found;
> +	long ret;
>  
>  	if ((vaddr & ~PAGE_MASK) || (size & ~PAGE_MASK))
>  		return -EINVAL;
>  
> -	mem = mm_iommu_find(container->mm, vaddr, size >> PAGE_SHIFT);
> +	mem = mm_iommu_get(container->mm, vaddr, size >> PAGE_SHIFT);
>  	if (!mem)
>  		return -ENOENT;
>  
> @@ -168,9 +169,13 @@ static long tce_iommu_unregister_pages(struct tce_container *container,
>  	}
>  
>  	if (!found)
> -		return -ENOENT;
> +		ret = -ENOENT;
> +	else
> +		ret = tce_iommu_prereg_free(container, tcemem);
>  
> -	return tce_iommu_prereg_free(container, tcemem);
> +	mm_iommu_put(container->mm, mem);
> +
> +	return ret;
>  }
>  
>  static long tce_iommu_register_pages(struct tce_container *container,
> @@ -185,22 +190,24 @@ static long tce_iommu_register_pages(struct tce_container *container,
>  			((vaddr + size) < vaddr))
>  		return -EINVAL;
>  
> -	mem = mm_iommu_find(container->mm, vaddr, entries);
> +	mem = mm_iommu_get(container->mm, vaddr, entries);
>  	if (mem) {
>  		list_for_each_entry(tcemem, &container->prereg_list, next) {
> -			if (tcemem->mem == mem)
> -				return -EBUSY;
> +			if (tcemem->mem == mem) {
> +				ret = -EBUSY;
> +				goto put_exit;
> +			}
>  		}
> +	} else {
> +		ret = mm_iommu_new(container->mm, vaddr, entries, &mem);
> +		if (ret)
> +			return ret;
>  	}
>  
> -	ret = mm_iommu_get(container->mm, vaddr, entries, &mem);
> -	if (ret)
> -		return ret;
> -
>  	tcemem = kzalloc(sizeof(*tcemem), GFP_KERNEL);
>  	if (!tcemem) {
> -		mm_iommu_put(container->mm, mem);
> -		return -ENOMEM;
> +		ret = -ENOMEM;
> +		goto put_exit;
>  	}
>  
>  	tcemem->mem = mem;
> @@ -209,6 +216,10 @@ static long tce_iommu_register_pages(struct tce_container *container,
>  	container->enabled = true;
>  
>  	return 0;
> +
> +put_exit:
> +	mm_iommu_put(container->mm, mem);
> +	return ret;
>  }
>  
>  static bool tce_page_is_contained(struct page *page, unsigned page_shift)

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH kernel v3 03/22] powerpc/mm/iommu: Make mm_iommu_new() fail on existing regions
  2018-11-13  8:28   ` Alexey Kardashevskiy
@ 2018-11-15  5:38     ` David Gibson
  -1 siblings, 0 replies; 84+ messages in thread
From: David Gibson @ 2018-11-15  5:38 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Alex Williamson, Jose Ricardo Ziviani, Sam Bobroff,
	Alistair Popple, linuxppc-dev, kvm-ppc, Piotr Jaroszynski,
	Oliver O'Halloran, Andrew Donnellan,
	Leonardo Augusto Guimarães Garcia, Reza Arbab

[-- Attachment #1: Type: text/plain, Size: 1629 bytes --]

On Tue, Nov 13, 2018 at 07:28:04PM +1100, Alexey Kardashevskiy wrote:
> Since we are going to have 2 different preregistering helpers, let's
> make it clear that mm_iommu_new() is only for the normal memory
> (i.e.not device memory) and for existing areas mm_iommu_get() should be
> used instead.
> 
> This removes the check for exact match as the check for overlap is
> enough now.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

It would make sense to fold this into the previous patch, I think, but
it's not worth doing a respin just for that.

> ---
> Changes:
> v2:
> * remove the exact match check
> ---
>  arch/powerpc/mm/mmu_context_iommu.c | 6 ------
>  1 file changed, 6 deletions(-)
> 
> diff --git a/arch/powerpc/mm/mmu_context_iommu.c b/arch/powerpc/mm/mmu_context_iommu.c
> index babc6ad..580d89e 100644
> --- a/arch/powerpc/mm/mmu_context_iommu.c
> +++ b/arch/powerpc/mm/mmu_context_iommu.c
> @@ -102,12 +102,6 @@ long mm_iommu_new(struct mm_struct *mm, unsigned long ua, unsigned long entries,
>  
>  	list_for_each_entry_rcu(mem, &mm->context.iommu_group_mem_list,
>  			next) {
> -		if ((mem->ua == ua) && (mem->entries == entries)) {
> -			++mem->used;
> -			*pmem = mem;
> -			goto unlock_exit;
> -		}
> -
>  		/* Overlap? */
>  		if ((mem->ua < (ua + (entries << PAGE_SHIFT))) &&
>  				(ua < (mem->ua +

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH kernel v3 03/22] powerpc/mm/iommu: Make mm_iommu_new() fail on existing regions
@ 2018-11-15  5:38     ` David Gibson
  0 siblings, 0 replies; 84+ messages in thread
From: David Gibson @ 2018-11-15  5:38 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Alex Williamson, Jose Ricardo Ziviani, Sam Bobroff,
	Alistair Popple, linuxppc-dev, kvm-ppc, Piotr Jaroszynski,
	Oliver O'Halloran, Andrew Donnellan,
	Leonardo Augusto Guimarães Garcia, Reza Arbab

[-- Attachment #1: Type: text/plain, Size: 1629 bytes --]

On Tue, Nov 13, 2018 at 07:28:04PM +1100, Alexey Kardashevskiy wrote:
> Since we are going to have 2 different preregistering helpers, let's
> make it clear that mm_iommu_new() is only for the normal memory
> (i.e.not device memory) and for existing areas mm_iommu_get() should be
> used instead.
> 
> This removes the check for exact match as the check for overlap is
> enough now.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

It would make sense to fold this into the previous patch, I think, but
it's not worth doing a respin just for that.

> ---
> Changes:
> v2:
> * remove the exact match check
> ---
>  arch/powerpc/mm/mmu_context_iommu.c | 6 ------
>  1 file changed, 6 deletions(-)
> 
> diff --git a/arch/powerpc/mm/mmu_context_iommu.c b/arch/powerpc/mm/mmu_context_iommu.c
> index babc6ad..580d89e 100644
> --- a/arch/powerpc/mm/mmu_context_iommu.c
> +++ b/arch/powerpc/mm/mmu_context_iommu.c
> @@ -102,12 +102,6 @@ long mm_iommu_new(struct mm_struct *mm, unsigned long ua, unsigned long entries,
>  
>  	list_for_each_entry_rcu(mem, &mm->context.iommu_group_mem_list,
>  			next) {
> -		if ((mem->ua == ua) && (mem->entries == entries)) {
> -			++mem->used;
> -			*pmem = mem;
> -			goto unlock_exit;
> -		}
> -
>  		/* Overlap? */
>  		if ((mem->ua < (ua + (entries << PAGE_SHIFT))) &&
>  				(ua < (mem->ua +

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH kernel v3 04/22] powerpc/vfio/iommu/kvm: Do not pin device memory
  2018-11-13  8:28   ` Alexey Kardashevskiy
@ 2018-11-16  3:11     ` David Gibson
  -1 siblings, 0 replies; 84+ messages in thread
From: David Gibson @ 2018-11-16  3:11 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Alex Williamson, Jose Ricardo Ziviani, Sam Bobroff,
	Alistair Popple, linuxppc-dev, kvm-ppc, Piotr Jaroszynski,
	Oliver O'Halloran, Andrew Donnellan,
	Leonardo Augusto Guimarães Garcia, Reza Arbab

[-- Attachment #1: Type: text/plain, Size: 16705 bytes --]

On Tue, Nov 13, 2018 at 07:28:05PM +1100, Alexey Kardashevskiy wrote:
> This new memory does not have page structs as it is not plugged to
> the host so gup() will fail anyway.
> 
> This adds 2 helpers:
> - mm_iommu_newdev() to preregister the "memory device" memory so
> the rest of API can still be used;
> - mm_iommu_is_devmem() to know if the physical address is one of thise
> new regions which we must avoid unpinning of.
> 
> This adds @mm to tce_page_is_contained() and iommu_tce_xchg() to test
> if the memory is device memory to avoid pfn_to_page().
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
>  arch/powerpc/include/asm/iommu.h       |  5 +-
>  arch/powerpc/include/asm/mmu_context.h |  5 ++
>  arch/powerpc/kernel/iommu.c            |  9 ++-
>  arch/powerpc/kvm/book3s_64_vio.c       | 18 +++---
>  arch/powerpc/mm/mmu_context_iommu.c    | 83 +++++++++++++++++++++++---
>  drivers/vfio/vfio_iommu_spapr_tce.c    | 28 +++++----
>  6 files changed, 116 insertions(+), 32 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
> index 35db0cb..a8aeac0 100644
> --- a/arch/powerpc/include/asm/iommu.h
> +++ b/arch/powerpc/include/asm/iommu.h
> @@ -218,8 +218,9 @@ extern void iommu_register_group(struct iommu_table_group *table_group,
>  extern int iommu_add_device(struct device *dev);
>  extern void iommu_del_device(struct device *dev);
>  extern int __init tce_iommu_bus_notifier_init(void);
> -extern long iommu_tce_xchg(struct iommu_table *tbl, unsigned long entry,
> -		unsigned long *hpa, enum dma_data_direction *direction);
> +extern long iommu_tce_xchg(struct mm_struct *mm, struct iommu_table *tbl,
> +		unsigned long entry, unsigned long *hpa,
> +		enum dma_data_direction *direction);
>  #else
>  static inline void iommu_register_group(struct iommu_table_group *table_group,
>  					int pci_domain_number,
> diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h
> index 2d6b00d..f0f9f3d 100644
> --- a/arch/powerpc/include/asm/mmu_context.h
> +++ b/arch/powerpc/include/asm/mmu_context.h
> @@ -24,6 +24,9 @@ extern bool mm_iommu_preregistered(struct mm_struct *mm);
>  extern long mm_iommu_new(struct mm_struct *mm,
>  		unsigned long ua, unsigned long entries,
>  		struct mm_iommu_table_group_mem_t **pmem);
> +extern long mm_iommu_newdev(struct mm_struct *mm, unsigned long ua,
> +		unsigned long entries, unsigned long dev_hpa,
> +		struct mm_iommu_table_group_mem_t **pmem);
>  extern long mm_iommu_put(struct mm_struct *mm,
>  		struct mm_iommu_table_group_mem_t *mem);
>  extern void mm_iommu_init(struct mm_struct *mm);
> @@ -39,6 +42,8 @@ extern long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
>  extern long mm_iommu_ua_to_hpa_rm(struct mm_iommu_table_group_mem_t *mem,
>  		unsigned long ua, unsigned int pageshift, unsigned long *hpa);
>  extern void mm_iommu_ua_mark_dirty_rm(struct mm_struct *mm, unsigned long ua);
> +extern bool mm_iommu_is_devmem(struct mm_struct *mm, unsigned long hpa,
> +		unsigned int pageshift);
>  extern long mm_iommu_mapped_inc(struct mm_iommu_table_group_mem_t *mem);
>  extern void mm_iommu_mapped_dec(struct mm_iommu_table_group_mem_t *mem);
>  #endif
> diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
> index f0dc680..8ccfdd9 100644
> --- a/arch/powerpc/kernel/iommu.c
> +++ b/arch/powerpc/kernel/iommu.c
> @@ -47,6 +47,7 @@
>  #include <asm/fadump.h>
>  #include <asm/vio.h>
>  #include <asm/tce.h>
> +#include <asm/mmu_context.h>
>  
>  #define DBG(...)
>  
> @@ -993,15 +994,17 @@ int iommu_tce_check_gpa(unsigned long page_shift, unsigned long gpa)
>  }
>  EXPORT_SYMBOL_GPL(iommu_tce_check_gpa);
>  
> -long iommu_tce_xchg(struct iommu_table *tbl, unsigned long entry,
> -		unsigned long *hpa, enum dma_data_direction *direction)
> +long iommu_tce_xchg(struct mm_struct *mm, struct iommu_table *tbl,
> +		unsigned long entry, unsigned long *hpa,
> +		enum dma_data_direction *direction)
>  {
>  	long ret;
>  
>  	ret = tbl->it_ops->exchange(tbl, entry, hpa, direction);
>  
>  	if (!ret && ((*direction == DMA_FROM_DEVICE) ||
> -			(*direction == DMA_BIDIRECTIONAL)))
> +			(*direction == DMA_BIDIRECTIONAL)) &&
> +			!mm_iommu_is_devmem(mm, *hpa, tbl->it_page_shift))
>  		SetPageDirty(pfn_to_page(*hpa >> PAGE_SHIFT));

What about the equivalent real mode paths?  I guess they won't ever be
called for this case, since they're only used on POWER8.  However some
checks or WARN_ON() or something to make that clear would be nice.

>  	/* if (unlikely(ret))
> diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
> index 62a8d03..532ab797 100644
> --- a/arch/powerpc/kvm/book3s_64_vio.c
> +++ b/arch/powerpc/kvm/book3s_64_vio.c
> @@ -397,12 +397,13 @@ static long kvmppc_tce_validate(struct kvmppc_spapr_tce_table *stt,
>  	return H_SUCCESS;
>  }
>  
> -static void kvmppc_clear_tce(struct iommu_table *tbl, unsigned long entry)
> +static void kvmppc_clear_tce(struct mm_struct *mm, struct iommu_table *tbl,
> +		unsigned long entry)
>  {
>  	unsigned long hpa = 0;
>  	enum dma_data_direction dir = DMA_NONE;
>  
> -	iommu_tce_xchg(tbl, entry, &hpa, &dir);
> +	iommu_tce_xchg(mm, tbl, entry, &hpa, &dir);
>  }
>  
>  static long kvmppc_tce_iommu_mapped_dec(struct kvm *kvm,
> @@ -433,7 +434,7 @@ static long kvmppc_tce_iommu_do_unmap(struct kvm *kvm,
>  	unsigned long hpa = 0;
>  	long ret;
>  
> -	if (WARN_ON_ONCE(iommu_tce_xchg(tbl, entry, &hpa, &dir)))
> +	if (WARN_ON_ONCE(iommu_tce_xchg(kvm->mm, tbl, entry, &hpa, &dir)))
>  		return H_TOO_HARD;
>  
>  	if (dir == DMA_NONE)
> @@ -441,7 +442,7 @@ static long kvmppc_tce_iommu_do_unmap(struct kvm *kvm,
>  
>  	ret = kvmppc_tce_iommu_mapped_dec(kvm, tbl, entry);
>  	if (ret != H_SUCCESS)
> -		iommu_tce_xchg(tbl, entry, &hpa, &dir);
> +		iommu_tce_xchg(kvm->mm, tbl, entry, &hpa, &dir);
>  
>  	return ret;
>  }
> @@ -487,7 +488,7 @@ long kvmppc_tce_iommu_do_map(struct kvm *kvm, struct iommu_table *tbl,
>  	if (mm_iommu_mapped_inc(mem))
>  		return H_TOO_HARD;
>  
> -	ret = iommu_tce_xchg(tbl, entry, &hpa, &dir);
> +	ret = iommu_tce_xchg(kvm->mm, tbl, entry, &hpa, &dir);
>  	if (WARN_ON_ONCE(ret)) {
>  		mm_iommu_mapped_dec(mem);
>  		return H_TOO_HARD;
> @@ -566,7 +567,7 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>  					entry, ua, dir);
>  
>  		if (ret != H_SUCCESS) {
> -			kvmppc_clear_tce(stit->tbl, entry);
> +			kvmppc_clear_tce(vcpu->kvm->mm, stit->tbl, entry);
>  			goto unlock_exit;
>  		}
>  	}
> @@ -655,7 +656,8 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  					iommu_tce_direction(tce));
>  
>  			if (ret != H_SUCCESS) {
> -				kvmppc_clear_tce(stit->tbl, entry);
> +				kvmppc_clear_tce(vcpu->kvm->mm, stit->tbl,
> +						entry);
>  				goto unlock_exit;
>  			}
>  		}
> @@ -704,7 +706,7 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
>  				return ret;
>  
>  			WARN_ON_ONCE(1);
> -			kvmppc_clear_tce(stit->tbl, entry);
> +			kvmppc_clear_tce(vcpu->kvm->mm, stit->tbl, entry);
>  		}
>  	}
>  
> diff --git a/arch/powerpc/mm/mmu_context_iommu.c b/arch/powerpc/mm/mmu_context_iommu.c
> index 580d89e..62fe5fe 100644
> --- a/arch/powerpc/mm/mmu_context_iommu.c
> +++ b/arch/powerpc/mm/mmu_context_iommu.c
> @@ -47,6 +47,8 @@ struct mm_iommu_table_group_mem_t {
>  		struct page **hpages;	/* vmalloc'ed */
>  		phys_addr_t *hpas;
>  	};
> +#define MM_IOMMU_TABLE_INVALID_HPA	((uint64_t)-1)
> +	u64 dev_hpa;		/* Device memory base address */
>  };
>  
>  static long mm_iommu_adjust_locked_vm(struct mm_struct *mm,
> @@ -89,7 +91,8 @@ bool mm_iommu_preregistered(struct mm_struct *mm)
>  }
>  EXPORT_SYMBOL_GPL(mm_iommu_preregistered);
>  
> -long mm_iommu_new(struct mm_struct *mm, unsigned long ua, unsigned long entries,
> +static long mm_iommu_do_alloc(struct mm_struct *mm, unsigned long ua,
> +		unsigned long entries, unsigned long dev_hpa,
>  		struct mm_iommu_table_group_mem_t **pmem)
>  {
>  	struct mm_iommu_table_group_mem_t *mem;
> @@ -112,11 +115,13 @@ long mm_iommu_new(struct mm_struct *mm, unsigned long ua, unsigned long entries,
>  
>  	}
>  
> -	ret = mm_iommu_adjust_locked_vm(mm, entries, true);
> -	if (ret)
> -		goto unlock_exit;
> +	if (dev_hpa == MM_IOMMU_TABLE_INVALID_HPA) {
> +		ret = mm_iommu_adjust_locked_vm(mm, entries, true);
> +		if (ret)
> +			goto unlock_exit;
>  
> -	locked_entries = entries;
> +		locked_entries = entries;
> +	}
>  
>  	mem = kzalloc(sizeof(*mem), GFP_KERNEL);
>  	if (!mem) {
> @@ -124,6 +129,13 @@ long mm_iommu_new(struct mm_struct *mm, unsigned long ua, unsigned long entries,
>  		goto unlock_exit;
>  	}
>  
> +	if (dev_hpa != MM_IOMMU_TABLE_INVALID_HPA) {
> +		mem->pageshift = __ffs(dev_hpa | (entries << PAGE_SHIFT));
> +		mem->dev_hpa = dev_hpa;
> +		goto good_exit;
> +	}
> +	mem->dev_hpa = MM_IOMMU_TABLE_INVALID_HPA;
> +
>  	/*
>  	 * For a starting point for a maximum page size calculation
>  	 * we use @ua and @entries natural alignment to allow IOMMU pages
> @@ -180,6 +192,7 @@ long mm_iommu_new(struct mm_struct *mm, unsigned long ua, unsigned long entries,
>  
>  	}
>  
> +good_exit:
>  	atomic64_set(&mem->mapped, 1);
>  	mem->used = 1;
>  	mem->ua = ua;
> @@ -196,13 +209,31 @@ long mm_iommu_new(struct mm_struct *mm, unsigned long ua, unsigned long entries,
>  
>  	return ret;
>  }
> +
> +long mm_iommu_new(struct mm_struct *mm, unsigned long ua, unsigned long entries,
> +		struct mm_iommu_table_group_mem_t **pmem)
> +{
> +	return mm_iommu_do_alloc(mm, ua, entries, MM_IOMMU_TABLE_INVALID_HPA,
> +			pmem);
> +}
>  EXPORT_SYMBOL_GPL(mm_iommu_new);
>  
> +long mm_iommu_newdev(struct mm_struct *mm, unsigned long ua,
> +		unsigned long entries, unsigned long dev_hpa,
> +		struct mm_iommu_table_group_mem_t **pmem)
> +{
> +	return mm_iommu_do_alloc(mm, ua, entries, dev_hpa, pmem);
> +}
> +EXPORT_SYMBOL_GPL(mm_iommu_newdev);
> +
>  static void mm_iommu_unpin(struct mm_iommu_table_group_mem_t *mem)
>  {
>  	long i;
>  	struct page *page = NULL;
>  
> +	if (!mem->hpas)
> +		return;
> +
>  	for (i = 0; i < mem->entries; ++i) {
>  		if (!mem->hpas[i])
>  			continue;
> @@ -244,6 +275,7 @@ static void mm_iommu_release(struct mm_iommu_table_group_mem_t *mem)
>  long mm_iommu_put(struct mm_struct *mm, struct mm_iommu_table_group_mem_t *mem)
>  {
>  	long ret = 0;
> +	unsigned long entries, dev_hpa;
>  
>  	mutex_lock(&mem_list_mutex);
>  
> @@ -265,9 +297,12 @@ long mm_iommu_put(struct mm_struct *mm, struct mm_iommu_table_group_mem_t *mem)
>  	}
>  
>  	/* @mapped became 0 so now mappings are disabled, release the region */
> +	entries = mem->entries;
> +	dev_hpa = mem->dev_hpa;
>  	mm_iommu_release(mem);
>  
> -	mm_iommu_adjust_locked_vm(mm, mem->entries, false);
> +	if (dev_hpa == MM_IOMMU_TABLE_INVALID_HPA)
> +		mm_iommu_adjust_locked_vm(mm, entries, false);
>  
>  unlock_exit:
>  	mutex_unlock(&mem_list_mutex);
> @@ -337,7 +372,7 @@ long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
>  		unsigned long ua, unsigned int pageshift, unsigned long *hpa)
>  {
>  	const long entry = (ua - mem->ua) >> PAGE_SHIFT;
> -	u64 *va = &mem->hpas[entry];
> +	u64 *va;
>  
>  	if (entry >= mem->entries)
>  		return -EFAULT;
> @@ -345,6 +380,12 @@ long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
>  	if (pageshift > mem->pageshift)
>  		return -EFAULT;
>  
> +	if (!mem->hpas) {
> +		*hpa = mem->dev_hpa + (ua - mem->ua);
> +		return 0;
> +	}
> +
> +	va = &mem->hpas[entry];
>  	*hpa = (*va & MM_IOMMU_TABLE_GROUP_PAGE_MASK) | (ua & ~PAGE_MASK);
>  
>  	return 0;
> @@ -355,7 +396,6 @@ long mm_iommu_ua_to_hpa_rm(struct mm_iommu_table_group_mem_t *mem,
>  		unsigned long ua, unsigned int pageshift, unsigned long *hpa)
>  {
>  	const long entry = (ua - mem->ua) >> PAGE_SHIFT;
> -	void *va = &mem->hpas[entry];
>  	unsigned long *pa;
>  
>  	if (entry >= mem->entries)
> @@ -364,7 +404,12 @@ long mm_iommu_ua_to_hpa_rm(struct mm_iommu_table_group_mem_t *mem,
>  	if (pageshift > mem->pageshift)
>  		return -EFAULT;
>  
> -	pa = (void *) vmalloc_to_phys(va);
> +	if (!mem->hpas) {
> +		*hpa = mem->dev_hpa + (ua - mem->ua);
> +		return 0;
> +	}
> +
> +	pa = (void *) vmalloc_to_phys(&mem->hpas[entry]);
>  	if (!pa)
>  		return -EFAULT;
>  
> @@ -394,6 +439,26 @@ extern void mm_iommu_ua_mark_dirty_rm(struct mm_struct *mm, unsigned long ua)
>  	*pa |= MM_IOMMU_TABLE_GROUP_PAGE_DIRTY;
>  }
>  
> +extern bool mm_iommu_is_devmem(struct mm_struct *mm, unsigned long hpa,
> +		unsigned int pageshift)
> +{
> +	struct mm_iommu_table_group_mem_t *mem;
> +	const unsigned long pagesize = 1UL << pageshift;
> +
> +	list_for_each_entry_rcu(mem, &mm->context.iommu_group_mem_list, next) {
> +		if (mem->dev_hpa == MM_IOMMU_TABLE_INVALID_HPA)
> +			continue;
> +
> +		if ((mem->dev_hpa <= hpa) &&
> +				(hpa + pagesize <= mem->dev_hpa +
> +				 (mem->entries << PAGE_SHIFT)))
> +			return true;
> +	}
> +
> +	return false;
> +}
> +EXPORT_SYMBOL_GPL(mm_iommu_is_devmem);
> +
>  long mm_iommu_mapped_inc(struct mm_iommu_table_group_mem_t *mem)
>  {
>  	if (atomic64_inc_not_zero(&mem->mapped))
> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
> index 56db071..ed89137 100644
> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> @@ -222,8 +222,15 @@ static long tce_iommu_register_pages(struct tce_container *container,
>  	return ret;
>  }
>  
> -static bool tce_page_is_contained(struct page *page, unsigned page_shift)
> +static bool tce_page_is_contained(struct mm_struct *mm, unsigned long hpa,
> +		unsigned int page_shift)
>  {
> +	struct page *page;
> +
> +	if (mm_iommu_is_devmem(mm, hpa, page_shift))
> +		return true;
> +
> +	page = pfn_to_page(hpa >> PAGE_SHIFT);
>  	/*
>  	 * Check that the TCE table granularity is not bigger than the size of
>  	 * a page we just found. Otherwise the hardware can get access to
> @@ -499,7 +506,8 @@ static int tce_iommu_clear(struct tce_container *container,
>  
>  		direction = DMA_NONE;
>  		oldhpa = 0;
> -		ret = iommu_tce_xchg(tbl, entry, &oldhpa, &direction);
> +		ret = iommu_tce_xchg(container->mm, tbl, entry, &oldhpa,
> +				&direction);
>  		if (ret)
>  			continue;
>  
> @@ -537,7 +545,6 @@ static long tce_iommu_build(struct tce_container *container,
>  		enum dma_data_direction direction)
>  {
>  	long i, ret = 0;
> -	struct page *page;
>  	unsigned long hpa;
>  	enum dma_data_direction dirtmp;
>  
> @@ -548,15 +555,16 @@ static long tce_iommu_build(struct tce_container *container,
>  		if (ret)
>  			break;
>  
> -		page = pfn_to_page(hpa >> PAGE_SHIFT);
> -		if (!tce_page_is_contained(page, tbl->it_page_shift)) {
> +		if (!tce_page_is_contained(container->mm, hpa,
> +				tbl->it_page_shift)) {
>  			ret = -EPERM;
>  			break;
>  		}
>  
>  		hpa |= offset;
>  		dirtmp = direction;
> -		ret = iommu_tce_xchg(tbl, entry + i, &hpa, &dirtmp);
> +		ret = iommu_tce_xchg(container->mm, tbl, entry + i, &hpa,
> +				&dirtmp);
>  		if (ret) {
>  			tce_iommu_unuse_page(container, hpa);
>  			pr_err("iommu_tce: %s failed ioba=%lx, tce=%lx, ret=%ld\n",
> @@ -583,7 +591,6 @@ static long tce_iommu_build_v2(struct tce_container *container,
>  		enum dma_data_direction direction)
>  {
>  	long i, ret = 0;
> -	struct page *page;
>  	unsigned long hpa;
>  	enum dma_data_direction dirtmp;
>  
> @@ -596,8 +603,8 @@ static long tce_iommu_build_v2(struct tce_container *container,
>  		if (ret)
>  			break;
>  
> -		page = pfn_to_page(hpa >> PAGE_SHIFT);
> -		if (!tce_page_is_contained(page, tbl->it_page_shift)) {
> +		if (!tce_page_is_contained(container->mm, hpa,
> +				tbl->it_page_shift)) {
>  			ret = -EPERM;
>  			break;
>  		}
> @@ -610,7 +617,8 @@ static long tce_iommu_build_v2(struct tce_container *container,
>  		if (mm_iommu_mapped_inc(mem))
>  			break;
>  
> -		ret = iommu_tce_xchg(tbl, entry + i, &hpa, &dirtmp);
> +		ret = iommu_tce_xchg(container->mm, tbl, entry + i, &hpa,
> +				&dirtmp);
>  		if (ret) {
>  			/* dirtmp cannot be DMA_NONE here */
>  			tce_iommu_unuse_page_v2(container, tbl, entry + i);

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH kernel v3 04/22] powerpc/vfio/iommu/kvm: Do not pin device memory
@ 2018-11-16  3:11     ` David Gibson
  0 siblings, 0 replies; 84+ messages in thread
From: David Gibson @ 2018-11-16  3:11 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Alex Williamson, Jose Ricardo Ziviani, Sam Bobroff,
	Alistair Popple, linuxppc-dev, kvm-ppc, Piotr Jaroszynski,
	Oliver O'Halloran, Andrew Donnellan,
	Leonardo Augusto Guimarães Garcia, Reza Arbab

[-- Attachment #1: Type: text/plain, Size: 16705 bytes --]

On Tue, Nov 13, 2018 at 07:28:05PM +1100, Alexey Kardashevskiy wrote:
> This new memory does not have page structs as it is not plugged to
> the host so gup() will fail anyway.
> 
> This adds 2 helpers:
> - mm_iommu_newdev() to preregister the "memory device" memory so
> the rest of API can still be used;
> - mm_iommu_is_devmem() to know if the physical address is one of thise
> new regions which we must avoid unpinning of.
> 
> This adds @mm to tce_page_is_contained() and iommu_tce_xchg() to test
> if the memory is device memory to avoid pfn_to_page().
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
>  arch/powerpc/include/asm/iommu.h       |  5 +-
>  arch/powerpc/include/asm/mmu_context.h |  5 ++
>  arch/powerpc/kernel/iommu.c            |  9 ++-
>  arch/powerpc/kvm/book3s_64_vio.c       | 18 +++---
>  arch/powerpc/mm/mmu_context_iommu.c    | 83 +++++++++++++++++++++++---
>  drivers/vfio/vfio_iommu_spapr_tce.c    | 28 +++++----
>  6 files changed, 116 insertions(+), 32 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
> index 35db0cb..a8aeac0 100644
> --- a/arch/powerpc/include/asm/iommu.h
> +++ b/arch/powerpc/include/asm/iommu.h
> @@ -218,8 +218,9 @@ extern void iommu_register_group(struct iommu_table_group *table_group,
>  extern int iommu_add_device(struct device *dev);
>  extern void iommu_del_device(struct device *dev);
>  extern int __init tce_iommu_bus_notifier_init(void);
> -extern long iommu_tce_xchg(struct iommu_table *tbl, unsigned long entry,
> -		unsigned long *hpa, enum dma_data_direction *direction);
> +extern long iommu_tce_xchg(struct mm_struct *mm, struct iommu_table *tbl,
> +		unsigned long entry, unsigned long *hpa,
> +		enum dma_data_direction *direction);
>  #else
>  static inline void iommu_register_group(struct iommu_table_group *table_group,
>  					int pci_domain_number,
> diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h
> index 2d6b00d..f0f9f3d 100644
> --- a/arch/powerpc/include/asm/mmu_context.h
> +++ b/arch/powerpc/include/asm/mmu_context.h
> @@ -24,6 +24,9 @@ extern bool mm_iommu_preregistered(struct mm_struct *mm);
>  extern long mm_iommu_new(struct mm_struct *mm,
>  		unsigned long ua, unsigned long entries,
>  		struct mm_iommu_table_group_mem_t **pmem);
> +extern long mm_iommu_newdev(struct mm_struct *mm, unsigned long ua,
> +		unsigned long entries, unsigned long dev_hpa,
> +		struct mm_iommu_table_group_mem_t **pmem);
>  extern long mm_iommu_put(struct mm_struct *mm,
>  		struct mm_iommu_table_group_mem_t *mem);
>  extern void mm_iommu_init(struct mm_struct *mm);
> @@ -39,6 +42,8 @@ extern long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
>  extern long mm_iommu_ua_to_hpa_rm(struct mm_iommu_table_group_mem_t *mem,
>  		unsigned long ua, unsigned int pageshift, unsigned long *hpa);
>  extern void mm_iommu_ua_mark_dirty_rm(struct mm_struct *mm, unsigned long ua);
> +extern bool mm_iommu_is_devmem(struct mm_struct *mm, unsigned long hpa,
> +		unsigned int pageshift);
>  extern long mm_iommu_mapped_inc(struct mm_iommu_table_group_mem_t *mem);
>  extern void mm_iommu_mapped_dec(struct mm_iommu_table_group_mem_t *mem);
>  #endif
> diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
> index f0dc680..8ccfdd9 100644
> --- a/arch/powerpc/kernel/iommu.c
> +++ b/arch/powerpc/kernel/iommu.c
> @@ -47,6 +47,7 @@
>  #include <asm/fadump.h>
>  #include <asm/vio.h>
>  #include <asm/tce.h>
> +#include <asm/mmu_context.h>
>  
>  #define DBG(...)
>  
> @@ -993,15 +994,17 @@ int iommu_tce_check_gpa(unsigned long page_shift, unsigned long gpa)
>  }
>  EXPORT_SYMBOL_GPL(iommu_tce_check_gpa);
>  
> -long iommu_tce_xchg(struct iommu_table *tbl, unsigned long entry,
> -		unsigned long *hpa, enum dma_data_direction *direction)
> +long iommu_tce_xchg(struct mm_struct *mm, struct iommu_table *tbl,
> +		unsigned long entry, unsigned long *hpa,
> +		enum dma_data_direction *direction)
>  {
>  	long ret;
>  
>  	ret = tbl->it_ops->exchange(tbl, entry, hpa, direction);
>  
>  	if (!ret && ((*direction == DMA_FROM_DEVICE) ||
> -			(*direction == DMA_BIDIRECTIONAL)))
> +			(*direction == DMA_BIDIRECTIONAL)) &&
> +			!mm_iommu_is_devmem(mm, *hpa, tbl->it_page_shift))
>  		SetPageDirty(pfn_to_page(*hpa >> PAGE_SHIFT));

What about the equivalent real mode paths?  I guess they won't ever be
called for this case, since they're only used on POWER8.  However some
checks or WARN_ON() or something to make that clear would be nice.

>  	/* if (unlikely(ret))
> diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
> index 62a8d03..532ab797 100644
> --- a/arch/powerpc/kvm/book3s_64_vio.c
> +++ b/arch/powerpc/kvm/book3s_64_vio.c
> @@ -397,12 +397,13 @@ static long kvmppc_tce_validate(struct kvmppc_spapr_tce_table *stt,
>  	return H_SUCCESS;
>  }
>  
> -static void kvmppc_clear_tce(struct iommu_table *tbl, unsigned long entry)
> +static void kvmppc_clear_tce(struct mm_struct *mm, struct iommu_table *tbl,
> +		unsigned long entry)
>  {
>  	unsigned long hpa = 0;
>  	enum dma_data_direction dir = DMA_NONE;
>  
> -	iommu_tce_xchg(tbl, entry, &hpa, &dir);
> +	iommu_tce_xchg(mm, tbl, entry, &hpa, &dir);
>  }
>  
>  static long kvmppc_tce_iommu_mapped_dec(struct kvm *kvm,
> @@ -433,7 +434,7 @@ static long kvmppc_tce_iommu_do_unmap(struct kvm *kvm,
>  	unsigned long hpa = 0;
>  	long ret;
>  
> -	if (WARN_ON_ONCE(iommu_tce_xchg(tbl, entry, &hpa, &dir)))
> +	if (WARN_ON_ONCE(iommu_tce_xchg(kvm->mm, tbl, entry, &hpa, &dir)))
>  		return H_TOO_HARD;
>  
>  	if (dir == DMA_NONE)
> @@ -441,7 +442,7 @@ static long kvmppc_tce_iommu_do_unmap(struct kvm *kvm,
>  
>  	ret = kvmppc_tce_iommu_mapped_dec(kvm, tbl, entry);
>  	if (ret != H_SUCCESS)
> -		iommu_tce_xchg(tbl, entry, &hpa, &dir);
> +		iommu_tce_xchg(kvm->mm, tbl, entry, &hpa, &dir);
>  
>  	return ret;
>  }
> @@ -487,7 +488,7 @@ long kvmppc_tce_iommu_do_map(struct kvm *kvm, struct iommu_table *tbl,
>  	if (mm_iommu_mapped_inc(mem))
>  		return H_TOO_HARD;
>  
> -	ret = iommu_tce_xchg(tbl, entry, &hpa, &dir);
> +	ret = iommu_tce_xchg(kvm->mm, tbl, entry, &hpa, &dir);
>  	if (WARN_ON_ONCE(ret)) {
>  		mm_iommu_mapped_dec(mem);
>  		return H_TOO_HARD;
> @@ -566,7 +567,7 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>  					entry, ua, dir);
>  
>  		if (ret != H_SUCCESS) {
> -			kvmppc_clear_tce(stit->tbl, entry);
> +			kvmppc_clear_tce(vcpu->kvm->mm, stit->tbl, entry);
>  			goto unlock_exit;
>  		}
>  	}
> @@ -655,7 +656,8 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  					iommu_tce_direction(tce));
>  
>  			if (ret != H_SUCCESS) {
> -				kvmppc_clear_tce(stit->tbl, entry);
> +				kvmppc_clear_tce(vcpu->kvm->mm, stit->tbl,
> +						entry);
>  				goto unlock_exit;
>  			}
>  		}
> @@ -704,7 +706,7 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
>  				return ret;
>  
>  			WARN_ON_ONCE(1);
> -			kvmppc_clear_tce(stit->tbl, entry);
> +			kvmppc_clear_tce(vcpu->kvm->mm, stit->tbl, entry);
>  		}
>  	}
>  
> diff --git a/arch/powerpc/mm/mmu_context_iommu.c b/arch/powerpc/mm/mmu_context_iommu.c
> index 580d89e..62fe5fe 100644
> --- a/arch/powerpc/mm/mmu_context_iommu.c
> +++ b/arch/powerpc/mm/mmu_context_iommu.c
> @@ -47,6 +47,8 @@ struct mm_iommu_table_group_mem_t {
>  		struct page **hpages;	/* vmalloc'ed */
>  		phys_addr_t *hpas;
>  	};
> +#define MM_IOMMU_TABLE_INVALID_HPA	((uint64_t)-1)
> +	u64 dev_hpa;		/* Device memory base address */
>  };
>  
>  static long mm_iommu_adjust_locked_vm(struct mm_struct *mm,
> @@ -89,7 +91,8 @@ bool mm_iommu_preregistered(struct mm_struct *mm)
>  }
>  EXPORT_SYMBOL_GPL(mm_iommu_preregistered);
>  
> -long mm_iommu_new(struct mm_struct *mm, unsigned long ua, unsigned long entries,
> +static long mm_iommu_do_alloc(struct mm_struct *mm, unsigned long ua,
> +		unsigned long entries, unsigned long dev_hpa,
>  		struct mm_iommu_table_group_mem_t **pmem)
>  {
>  	struct mm_iommu_table_group_mem_t *mem;
> @@ -112,11 +115,13 @@ long mm_iommu_new(struct mm_struct *mm, unsigned long ua, unsigned long entries,
>  
>  	}
>  
> -	ret = mm_iommu_adjust_locked_vm(mm, entries, true);
> -	if (ret)
> -		goto unlock_exit;
> +	if (dev_hpa == MM_IOMMU_TABLE_INVALID_HPA) {
> +		ret = mm_iommu_adjust_locked_vm(mm, entries, true);
> +		if (ret)
> +			goto unlock_exit;
>  
> -	locked_entries = entries;
> +		locked_entries = entries;
> +	}
>  
>  	mem = kzalloc(sizeof(*mem), GFP_KERNEL);
>  	if (!mem) {
> @@ -124,6 +129,13 @@ long mm_iommu_new(struct mm_struct *mm, unsigned long ua, unsigned long entries,
>  		goto unlock_exit;
>  	}
>  
> +	if (dev_hpa != MM_IOMMU_TABLE_INVALID_HPA) {
> +		mem->pageshift = __ffs(dev_hpa | (entries << PAGE_SHIFT));
> +		mem->dev_hpa = dev_hpa;
> +		goto good_exit;
> +	}
> +	mem->dev_hpa = MM_IOMMU_TABLE_INVALID_HPA;
> +
>  	/*
>  	 * For a starting point for a maximum page size calculation
>  	 * we use @ua and @entries natural alignment to allow IOMMU pages
> @@ -180,6 +192,7 @@ long mm_iommu_new(struct mm_struct *mm, unsigned long ua, unsigned long entries,
>  
>  	}
>  
> +good_exit:
>  	atomic64_set(&mem->mapped, 1);
>  	mem->used = 1;
>  	mem->ua = ua;
> @@ -196,13 +209,31 @@ long mm_iommu_new(struct mm_struct *mm, unsigned long ua, unsigned long entries,
>  
>  	return ret;
>  }
> +
> +long mm_iommu_new(struct mm_struct *mm, unsigned long ua, unsigned long entries,
> +		struct mm_iommu_table_group_mem_t **pmem)
> +{
> +	return mm_iommu_do_alloc(mm, ua, entries, MM_IOMMU_TABLE_INVALID_HPA,
> +			pmem);
> +}
>  EXPORT_SYMBOL_GPL(mm_iommu_new);
>  
> +long mm_iommu_newdev(struct mm_struct *mm, unsigned long ua,
> +		unsigned long entries, unsigned long dev_hpa,
> +		struct mm_iommu_table_group_mem_t **pmem)
> +{
> +	return mm_iommu_do_alloc(mm, ua, entries, dev_hpa, pmem);
> +}
> +EXPORT_SYMBOL_GPL(mm_iommu_newdev);
> +
>  static void mm_iommu_unpin(struct mm_iommu_table_group_mem_t *mem)
>  {
>  	long i;
>  	struct page *page = NULL;
>  
> +	if (!mem->hpas)
> +		return;
> +
>  	for (i = 0; i < mem->entries; ++i) {
>  		if (!mem->hpas[i])
>  			continue;
> @@ -244,6 +275,7 @@ static void mm_iommu_release(struct mm_iommu_table_group_mem_t *mem)
>  long mm_iommu_put(struct mm_struct *mm, struct mm_iommu_table_group_mem_t *mem)
>  {
>  	long ret = 0;
> +	unsigned long entries, dev_hpa;
>  
>  	mutex_lock(&mem_list_mutex);
>  
> @@ -265,9 +297,12 @@ long mm_iommu_put(struct mm_struct *mm, struct mm_iommu_table_group_mem_t *mem)
>  	}
>  
>  	/* @mapped became 0 so now mappings are disabled, release the region */
> +	entries = mem->entries;
> +	dev_hpa = mem->dev_hpa;
>  	mm_iommu_release(mem);
>  
> -	mm_iommu_adjust_locked_vm(mm, mem->entries, false);
> +	if (dev_hpa == MM_IOMMU_TABLE_INVALID_HPA)
> +		mm_iommu_adjust_locked_vm(mm, entries, false);
>  
>  unlock_exit:
>  	mutex_unlock(&mem_list_mutex);
> @@ -337,7 +372,7 @@ long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
>  		unsigned long ua, unsigned int pageshift, unsigned long *hpa)
>  {
>  	const long entry = (ua - mem->ua) >> PAGE_SHIFT;
> -	u64 *va = &mem->hpas[entry];
> +	u64 *va;
>  
>  	if (entry >= mem->entries)
>  		return -EFAULT;
> @@ -345,6 +380,12 @@ long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
>  	if (pageshift > mem->pageshift)
>  		return -EFAULT;
>  
> +	if (!mem->hpas) {
> +		*hpa = mem->dev_hpa + (ua - mem->ua);
> +		return 0;
> +	}
> +
> +	va = &mem->hpas[entry];
>  	*hpa = (*va & MM_IOMMU_TABLE_GROUP_PAGE_MASK) | (ua & ~PAGE_MASK);
>  
>  	return 0;
> @@ -355,7 +396,6 @@ long mm_iommu_ua_to_hpa_rm(struct mm_iommu_table_group_mem_t *mem,
>  		unsigned long ua, unsigned int pageshift, unsigned long *hpa)
>  {
>  	const long entry = (ua - mem->ua) >> PAGE_SHIFT;
> -	void *va = &mem->hpas[entry];
>  	unsigned long *pa;
>  
>  	if (entry >= mem->entries)
> @@ -364,7 +404,12 @@ long mm_iommu_ua_to_hpa_rm(struct mm_iommu_table_group_mem_t *mem,
>  	if (pageshift > mem->pageshift)
>  		return -EFAULT;
>  
> -	pa = (void *) vmalloc_to_phys(va);
> +	if (!mem->hpas) {
> +		*hpa = mem->dev_hpa + (ua - mem->ua);
> +		return 0;
> +	}
> +
> +	pa = (void *) vmalloc_to_phys(&mem->hpas[entry]);
>  	if (!pa)
>  		return -EFAULT;
>  
> @@ -394,6 +439,26 @@ extern void mm_iommu_ua_mark_dirty_rm(struct mm_struct *mm, unsigned long ua)
>  	*pa |= MM_IOMMU_TABLE_GROUP_PAGE_DIRTY;
>  }
>  
> +extern bool mm_iommu_is_devmem(struct mm_struct *mm, unsigned long hpa,
> +		unsigned int pageshift)
> +{
> +	struct mm_iommu_table_group_mem_t *mem;
> +	const unsigned long pagesize = 1UL << pageshift;
> +
> +	list_for_each_entry_rcu(mem, &mm->context.iommu_group_mem_list, next) {
> +		if (mem->dev_hpa == MM_IOMMU_TABLE_INVALID_HPA)
> +			continue;
> +
> +		if ((mem->dev_hpa <= hpa) &&
> +				(hpa + pagesize <= mem->dev_hpa +
> +				 (mem->entries << PAGE_SHIFT)))
> +			return true;
> +	}
> +
> +	return false;
> +}
> +EXPORT_SYMBOL_GPL(mm_iommu_is_devmem);
> +
>  long mm_iommu_mapped_inc(struct mm_iommu_table_group_mem_t *mem)
>  {
>  	if (atomic64_inc_not_zero(&mem->mapped))
> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
> index 56db071..ed89137 100644
> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> @@ -222,8 +222,15 @@ static long tce_iommu_register_pages(struct tce_container *container,
>  	return ret;
>  }
>  
> -static bool tce_page_is_contained(struct page *page, unsigned page_shift)
> +static bool tce_page_is_contained(struct mm_struct *mm, unsigned long hpa,
> +		unsigned int page_shift)
>  {
> +	struct page *page;
> +
> +	if (mm_iommu_is_devmem(mm, hpa, page_shift))
> +		return true;
> +
> +	page = pfn_to_page(hpa >> PAGE_SHIFT);
>  	/*
>  	 * Check that the TCE table granularity is not bigger than the size of
>  	 * a page we just found. Otherwise the hardware can get access to
> @@ -499,7 +506,8 @@ static int tce_iommu_clear(struct tce_container *container,
>  
>  		direction = DMA_NONE;
>  		oldhpa = 0;
> -		ret = iommu_tce_xchg(tbl, entry, &oldhpa, &direction);
> +		ret = iommu_tce_xchg(container->mm, tbl, entry, &oldhpa,
> +				&direction);
>  		if (ret)
>  			continue;
>  
> @@ -537,7 +545,6 @@ static long tce_iommu_build(struct tce_container *container,
>  		enum dma_data_direction direction)
>  {
>  	long i, ret = 0;
> -	struct page *page;
>  	unsigned long hpa;
>  	enum dma_data_direction dirtmp;
>  
> @@ -548,15 +555,16 @@ static long tce_iommu_build(struct tce_container *container,
>  		if (ret)
>  			break;
>  
> -		page = pfn_to_page(hpa >> PAGE_SHIFT);
> -		if (!tce_page_is_contained(page, tbl->it_page_shift)) {
> +		if (!tce_page_is_contained(container->mm, hpa,
> +				tbl->it_page_shift)) {
>  			ret = -EPERM;
>  			break;
>  		}
>  
>  		hpa |= offset;
>  		dirtmp = direction;
> -		ret = iommu_tce_xchg(tbl, entry + i, &hpa, &dirtmp);
> +		ret = iommu_tce_xchg(container->mm, tbl, entry + i, &hpa,
> +				&dirtmp);
>  		if (ret) {
>  			tce_iommu_unuse_page(container, hpa);
>  			pr_err("iommu_tce: %s failed ioba=%lx, tce=%lx, ret=%ld\n",
> @@ -583,7 +591,6 @@ static long tce_iommu_build_v2(struct tce_container *container,
>  		enum dma_data_direction direction)
>  {
>  	long i, ret = 0;
> -	struct page *page;
>  	unsigned long hpa;
>  	enum dma_data_direction dirtmp;
>  
> @@ -596,8 +603,8 @@ static long tce_iommu_build_v2(struct tce_container *container,
>  		if (ret)
>  			break;
>  
> -		page = pfn_to_page(hpa >> PAGE_SHIFT);
> -		if (!tce_page_is_contained(page, tbl->it_page_shift)) {
> +		if (!tce_page_is_contained(container->mm, hpa,
> +				tbl->it_page_shift)) {
>  			ret = -EPERM;
>  			break;
>  		}
> @@ -610,7 +617,8 @@ static long tce_iommu_build_v2(struct tce_container *container,
>  		if (mm_iommu_mapped_inc(mem))
>  			break;
>  
> -		ret = iommu_tce_xchg(tbl, entry + i, &hpa, &dirtmp);
> +		ret = iommu_tce_xchg(container->mm, tbl, entry + i, &hpa,
> +				&dirtmp);
>  		if (ret) {
>  			/* dirtmp cannot be DMA_NONE here */
>  			tce_iommu_unuse_page_v2(container, tbl, entry + i);

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH kernel v3 09/22] powerpc/pseries/iommu: Force default DMA window removal
  2018-11-13  8:28   ` Alexey Kardashevskiy
@ 2018-11-16  4:54     ` David Gibson
  -1 siblings, 0 replies; 84+ messages in thread
From: David Gibson @ 2018-11-16  4:54 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Alex Williamson, Jose Ricardo Ziviani, Sam Bobroff,
	Alistair Popple, linuxppc-dev, kvm-ppc, Piotr Jaroszynski,
	Oliver O'Halloran, Andrew Donnellan,
	Leonardo Augusto Guimarães Garcia, Reza Arbab

[-- Attachment #1: Type: text/plain, Size: 5290 bytes --]

On Tue, Nov 13, 2018 at 07:28:10PM +1100, Alexey Kardashevskiy wrote:
> It is quite common for a device to support more than 32bit but less than
> 64bit for DMA, for example, GPUs often support 42..50bits. However
> the pseries platform only allows huge DMA window (the one which allows
> the use of more than 2GB of DMA space) for 64bit-capable devices mostly
> because:
> 
> 1. we may have 32bit and >32bit devices on the same IOMMU domain and
> we cannot place the new big window where the 32bit one is located;
> 
> 2. the existing hardware only supports the second window at very high
> offset of 1<<59 == 0x0800.0000.0000.0000.
> 
> So in order to allow 33..59bit DMA, we have to remove the default DMA
> window and place a huge one there instead.
> 
> The PAPR spec says that the platform may decide not to use the default
> window and remove it using DDW RTAS calls. There are few possible ways
> for the platform to decide:
> 
> 1. look at the device IDs and decide in advance that such and such
> devices are capable of more than 32bit DMA (powernv's sketchy bypass
> does something like this - it drops the default window if all devices
> on the PE are from the same vendor) - this is not great as involves
> guessing because, unlike sketchy bypass, the GPU case involves 2 vendor
> ids and does not scale;
> 
> 2. advertise 1 available DMA window in the hypervisor via
> ibm,query-pe-dma-window so the pseries platform could take it as a clue
> that if more bits for DMA are needed, it has to remove the default
> window - this is not great as it is implicit clue rather than direct
> instruction;
> 
> 3. removing the default DMA window at all it not really an option as
> PAPR mandates its presense at the guest boot time;
> 
> 4. make the hypervisor explicitly tell the guest that the default window
> is better be removed so the guest does not have to think hard and can
> simply do what requested and this is what this patch does.

This approach only makes sense if the hypervisor has better
information as to what to do that the guest does.  It's not clear to
me why that would be the case.  Isn't the DMA capabilities of the
device something the driver should know, in which case it can decide
based on that?

> 
> This makes use of the latter approach and exploits a new
> "qemu,dma-force-remove-default" flag in a vPHB.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
>  arch/powerpc/platforms/pseries/iommu.c | 28 +++++++++++++++++++++++---
>  1 file changed, 25 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
> index 9ece42f..78473ac 100644
> --- a/arch/powerpc/platforms/pseries/iommu.c
> +++ b/arch/powerpc/platforms/pseries/iommu.c
> @@ -54,6 +54,7 @@
>  #include "pseries.h"
>  
>  #define DDW_INVALID_OFFSET	((uint64_t)-1)
> +#define DDW_INVALID_LIOBN	((uint32_t)-1)
>  
>  static struct iommu_table_group *iommu_pseries_alloc_group(int node)
>  {
> @@ -977,7 +978,8 @@ static LIST_HEAD(failed_ddw_pdn_list);
>   *
>   * returns the dma offset for use by dma_set_mask
>   */
> -static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
> +static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn,
> +		u32 default_liobn)
>  {
>  	int len, ret;
>  	struct ddw_query_response query;
> @@ -1022,6 +1024,16 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
>  	if (ret)
>  		goto out_failed;
>  
> +	/*
> +	 * The device tree has a request to force remove the default window,
> +	 * do this.
> +	 */
> +	if (default_liobn != DDW_INVALID_LIOBN && (!ddw_avail[2] ||
> +			rtas_call(ddw_avail[2], 1, 1, NULL, default_liobn))) {
> +		dev_dbg(&dev->dev, "Could not remove window");
> +		goto out_failed;
> +	}
> +
>         /*
>  	 * Query if there is a second window of size to map the
>  	 * whole partition.  Query returns number of windows, largest
> @@ -1212,7 +1224,7 @@ static int dma_set_mask_pSeriesLP(struct device *dev, u64 dma_mask)
>  	pdev = to_pci_dev(dev);
>  
>  	/* only attempt to use a new window if 64-bit DMA is requested */
> -	if (!disable_ddw && dma_mask == DMA_BIT_MASK(64)) {
> +	if (!disable_ddw && dma_mask > DMA_BIT_MASK(32)) {
>  		dn = pci_device_to_OF_node(pdev);
>  		dev_dbg(dev, "node is %pOF\n", dn);
>  
> @@ -1229,7 +1241,17 @@ static int dma_set_mask_pSeriesLP(struct device *dev, u64 dma_mask)
>  				break;
>  		}
>  		if (pdn && PCI_DN(pdn)) {
> -			dma_offset = enable_ddw(pdev, pdn);
> +			u32 liobn = DDW_INVALID_LIOBN;
> +			int ret = of_device_is_compatible(pdn, "IBM,npu-vphb");
> +
> +			if (ret) {
> +				dma_window = of_get_property(pdn,
> +						"ibm,dma-window", NULL);
> +				if (dma_window)
> +					liobn = be32_to_cpu(dma_window[0]);
> +			}
> +
> +			dma_offset = enable_ddw(pdev, pdn, liobn);
>  			if (dma_offset != DDW_INVALID_OFFSET) {
>  				dev_info(dev, "Using 64-bit direct DMA at offset %llx\n", dma_offset);
>  				set_dma_offset(dev, dma_offset);

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH kernel v3 09/22] powerpc/pseries/iommu: Force default DMA window removal
@ 2018-11-16  4:54     ` David Gibson
  0 siblings, 0 replies; 84+ messages in thread
From: David Gibson @ 2018-11-16  4:54 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Alex Williamson, Jose Ricardo Ziviani, Sam Bobroff,
	Alistair Popple, linuxppc-dev, kvm-ppc, Piotr Jaroszynski,
	Oliver O'Halloran, Andrew Donnellan,
	Leonardo Augusto Guimarães Garcia, Reza Arbab

[-- Attachment #1: Type: text/plain, Size: 5290 bytes --]

On Tue, Nov 13, 2018 at 07:28:10PM +1100, Alexey Kardashevskiy wrote:
> It is quite common for a device to support more than 32bit but less than
> 64bit for DMA, for example, GPUs often support 42..50bits. However
> the pseries platform only allows huge DMA window (the one which allows
> the use of more than 2GB of DMA space) for 64bit-capable devices mostly
> because:
> 
> 1. we may have 32bit and >32bit devices on the same IOMMU domain and
> we cannot place the new big window where the 32bit one is located;
> 
> 2. the existing hardware only supports the second window at very high
> offset of 1<<59 == 0x0800.0000.0000.0000.
> 
> So in order to allow 33..59bit DMA, we have to remove the default DMA
> window and place a huge one there instead.
> 
> The PAPR spec says that the platform may decide not to use the default
> window and remove it using DDW RTAS calls. There are few possible ways
> for the platform to decide:
> 
> 1. look at the device IDs and decide in advance that such and such
> devices are capable of more than 32bit DMA (powernv's sketchy bypass
> does something like this - it drops the default window if all devices
> on the PE are from the same vendor) - this is not great as involves
> guessing because, unlike sketchy bypass, the GPU case involves 2 vendor
> ids and does not scale;
> 
> 2. advertise 1 available DMA window in the hypervisor via
> ibm,query-pe-dma-window so the pseries platform could take it as a clue
> that if more bits for DMA are needed, it has to remove the default
> window - this is not great as it is implicit clue rather than direct
> instruction;
> 
> 3. removing the default DMA window at all it not really an option as
> PAPR mandates its presense at the guest boot time;
> 
> 4. make the hypervisor explicitly tell the guest that the default window
> is better be removed so the guest does not have to think hard and can
> simply do what requested and this is what this patch does.

This approach only makes sense if the hypervisor has better
information as to what to do that the guest does.  It's not clear to
me why that would be the case.  Isn't the DMA capabilities of the
device something the driver should know, in which case it can decide
based on that?

> 
> This makes use of the latter approach and exploits a new
> "qemu,dma-force-remove-default" flag in a vPHB.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
>  arch/powerpc/platforms/pseries/iommu.c | 28 +++++++++++++++++++++++---
>  1 file changed, 25 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
> index 9ece42f..78473ac 100644
> --- a/arch/powerpc/platforms/pseries/iommu.c
> +++ b/arch/powerpc/platforms/pseries/iommu.c
> @@ -54,6 +54,7 @@
>  #include "pseries.h"
>  
>  #define DDW_INVALID_OFFSET	((uint64_t)-1)
> +#define DDW_INVALID_LIOBN	((uint32_t)-1)
>  
>  static struct iommu_table_group *iommu_pseries_alloc_group(int node)
>  {
> @@ -977,7 +978,8 @@ static LIST_HEAD(failed_ddw_pdn_list);
>   *
>   * returns the dma offset for use by dma_set_mask
>   */
> -static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
> +static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn,
> +		u32 default_liobn)
>  {
>  	int len, ret;
>  	struct ddw_query_response query;
> @@ -1022,6 +1024,16 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
>  	if (ret)
>  		goto out_failed;
>  
> +	/*
> +	 * The device tree has a request to force remove the default window,
> +	 * do this.
> +	 */
> +	if (default_liobn != DDW_INVALID_LIOBN && (!ddw_avail[2] ||
> +			rtas_call(ddw_avail[2], 1, 1, NULL, default_liobn))) {
> +		dev_dbg(&dev->dev, "Could not remove window");
> +		goto out_failed;
> +	}
> +
>         /*
>  	 * Query if there is a second window of size to map the
>  	 * whole partition.  Query returns number of windows, largest
> @@ -1212,7 +1224,7 @@ static int dma_set_mask_pSeriesLP(struct device *dev, u64 dma_mask)
>  	pdev = to_pci_dev(dev);
>  
>  	/* only attempt to use a new window if 64-bit DMA is requested */
> -	if (!disable_ddw && dma_mask == DMA_BIT_MASK(64)) {
> +	if (!disable_ddw && dma_mask > DMA_BIT_MASK(32)) {
>  		dn = pci_device_to_OF_node(pdev);
>  		dev_dbg(dev, "node is %pOF\n", dn);
>  
> @@ -1229,7 +1241,17 @@ static int dma_set_mask_pSeriesLP(struct device *dev, u64 dma_mask)
>  				break;
>  		}
>  		if (pdn && PCI_DN(pdn)) {
> -			dma_offset = enable_ddw(pdev, pdn);
> +			u32 liobn = DDW_INVALID_LIOBN;
> +			int ret = of_device_is_compatible(pdn, "IBM,npu-vphb");
> +
> +			if (ret) {
> +				dma_window = of_get_property(pdn,
> +						"ibm,dma-window", NULL);
> +				if (dma_window)
> +					liobn = be32_to_cpu(dma_window[0]);
> +			}
> +
> +			dma_offset = enable_ddw(pdev, pdn, liobn);
>  			if (dma_offset != DDW_INVALID_OFFSET) {
>  				dev_info(dev, "Using 64-bit direct DMA at offset %llx\n", dma_offset);
>  				set_dma_offset(dev, dma_offset);

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH kernel v3 10/22] powerpc/pseries/iommu: Use memory@ nodes in max RAM address calculation
  2018-11-13  8:28   ` Alexey Kardashevskiy
@ 2018-11-16  5:23     ` David Gibson
  -1 siblings, 0 replies; 84+ messages in thread
From: David Gibson @ 2018-11-16  5:23 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Alex Williamson, Jose Ricardo Ziviani, Sam Bobroff,
	Alistair Popple, linuxppc-dev, kvm-ppc, Piotr Jaroszynski,
	Oliver O'Halloran, Andrew Donnellan,
	Leonardo Augusto Guimarães Garcia, Reza Arbab

[-- Attachment #1: Type: text/plain, Size: 3356 bytes --]

On Tue, Nov 13, 2018 at 07:28:11PM +1100, Alexey Kardashevskiy wrote:
> We might have memory@ nodes with "linux,usable-memory" set to zero
> (for example, to replicate powernv's behaviour for GPU coherent memory)
> which means that the memory needs an extra initialization but since
> it can be used afterwards, the pseries platform will try mapping it
> for DMA so the DMA window needs to cover those memory regions too.
> 
> This walks through the memory nodes to find the highest RAM address to
> let a huge DMA window cover that too in case this memory gets onlined
> later.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
>  arch/powerpc/platforms/pseries/iommu.c | 43 +++++++++++++++++++++++++-
>  1 file changed, 42 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
> index 78473ac..f818737 100644
> --- a/arch/powerpc/platforms/pseries/iommu.c
> +++ b/arch/powerpc/platforms/pseries/iommu.c
> @@ -967,6 +967,47 @@ struct failed_ddw_pdn {
>  
>  static LIST_HEAD(failed_ddw_pdn_list);
>  
> +static unsigned long read_n_cells(int n, const __be32 **buf)
> +{
> +	unsigned long result = 0;
> +
> +	while (n--) {
> +		result = (result << 32) | of_read_number(*buf, 1);
> +		(*buf)++;
> +	}
> +	return result;
> +}

Um.. this appears to be re-implementing of_read_number() in terms of
of_read_number().   Wat!?

> +static phys_addr_t ddw_memory_hotplug_max(void)
> +{
> +	phys_addr_t max_addr = memory_hotplug_max();
> +	struct device_node *memory;
> +
> +	for_each_node_by_type(memory, "memory") {
> +		unsigned long start, size;
> +		int ranges, n_mem_addr_cells, n_mem_size_cells, len;
> +		const __be32 *memcell_buf;
> +
> +		memcell_buf = of_get_property(memory, "reg", &len);
> +		if (!memcell_buf || len <= 0)
> +			continue;
> +
> +		n_mem_addr_cells = of_n_addr_cells(memory);
> +		n_mem_size_cells = of_n_size_cells(memory);
> +
> +		/* ranges in cell */
> +		ranges = (len >> 2) / (n_mem_addr_cells + n_mem_size_cells);
> +
> +		/* these are order-sensitive, and modify the buffer pointer */
> +		start = read_n_cells(n_mem_addr_cells, &memcell_buf);
> +		size = read_n_cells(n_mem_size_cells, &memcell_buf);
> +
> +		max_addr = max_t(phys_addr_t, max_addr, start + size);
> +	}
> +
> +	return max_addr;
> +}

Is there really no existing place we keep track of maxmimum possible
memory address?

>  /*
>   * If the PE supports dynamic dma windows, and there is space for a table
>   * that can map all pages in a linear offset, then setup such a table,
> @@ -1067,7 +1108,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn,
>  	}
>  	/* verify the window * number of ptes will map the partition */
>  	/* check largest block * page size > max memory hotplug addr */
> -	max_addr = memory_hotplug_max();
> +	max_addr = ddw_memory_hotplug_max();
>  	if (query.largest_available_block < (max_addr >> page_shift)) {
>  		dev_dbg(&dev->dev, "can't map partition max 0x%llx with %u "
>  			  "%llu-sized pages\n", max_addr,  query.largest_available_block,

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH kernel v3 10/22] powerpc/pseries/iommu: Use memory@ nodes in max RAM address calculation
@ 2018-11-16  5:23     ` David Gibson
  0 siblings, 0 replies; 84+ messages in thread
From: David Gibson @ 2018-11-16  5:23 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Alex Williamson, Jose Ricardo Ziviani, Sam Bobroff,
	Alistair Popple, linuxppc-dev, kvm-ppc, Piotr Jaroszynski,
	Oliver O'Halloran, Andrew Donnellan,
	Leonardo Augusto Guimarães Garcia, Reza Arbab

[-- Attachment #1: Type: text/plain, Size: 3356 bytes --]

On Tue, Nov 13, 2018 at 07:28:11PM +1100, Alexey Kardashevskiy wrote:
> We might have memory@ nodes with "linux,usable-memory" set to zero
> (for example, to replicate powernv's behaviour for GPU coherent memory)
> which means that the memory needs an extra initialization but since
> it can be used afterwards, the pseries platform will try mapping it
> for DMA so the DMA window needs to cover those memory regions too.
> 
> This walks through the memory nodes to find the highest RAM address to
> let a huge DMA window cover that too in case this memory gets onlined
> later.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
>  arch/powerpc/platforms/pseries/iommu.c | 43 +++++++++++++++++++++++++-
>  1 file changed, 42 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
> index 78473ac..f818737 100644
> --- a/arch/powerpc/platforms/pseries/iommu.c
> +++ b/arch/powerpc/platforms/pseries/iommu.c
> @@ -967,6 +967,47 @@ struct failed_ddw_pdn {
>  
>  static LIST_HEAD(failed_ddw_pdn_list);
>  
> +static unsigned long read_n_cells(int n, const __be32 **buf)
> +{
> +	unsigned long result = 0;
> +
> +	while (n--) {
> +		result = (result << 32) | of_read_number(*buf, 1);
> +		(*buf)++;
> +	}
> +	return result;
> +}

Um.. this appears to be re-implementing of_read_number() in terms of
of_read_number().   Wat!?

> +static phys_addr_t ddw_memory_hotplug_max(void)
> +{
> +	phys_addr_t max_addr = memory_hotplug_max();
> +	struct device_node *memory;
> +
> +	for_each_node_by_type(memory, "memory") {
> +		unsigned long start, size;
> +		int ranges, n_mem_addr_cells, n_mem_size_cells, len;
> +		const __be32 *memcell_buf;
> +
> +		memcell_buf = of_get_property(memory, "reg", &len);
> +		if (!memcell_buf || len <= 0)
> +			continue;
> +
> +		n_mem_addr_cells = of_n_addr_cells(memory);
> +		n_mem_size_cells = of_n_size_cells(memory);
> +
> +		/* ranges in cell */
> +		ranges = (len >> 2) / (n_mem_addr_cells + n_mem_size_cells);
> +
> +		/* these are order-sensitive, and modify the buffer pointer */
> +		start = read_n_cells(n_mem_addr_cells, &memcell_buf);
> +		size = read_n_cells(n_mem_size_cells, &memcell_buf);
> +
> +		max_addr = max_t(phys_addr_t, max_addr, start + size);
> +	}
> +
> +	return max_addr;
> +}

Is there really no existing place we keep track of maxmimum possible
memory address?

>  /*
>   * If the PE supports dynamic dma windows, and there is space for a table
>   * that can map all pages in a linear offset, then setup such a table,
> @@ -1067,7 +1108,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn,
>  	}
>  	/* verify the window * number of ptes will map the partition */
>  	/* check largest block * page size > max memory hotplug addr */
> -	max_addr = memory_hotplug_max();
> +	max_addr = ddw_memory_hotplug_max();
>  	if (query.largest_available_block < (max_addr >> page_shift)) {
>  		dev_dbg(&dev->dev, "can't map partition max 0x%llx with %u "
>  			  "%llu-sized pages\n", max_addr,  query.largest_available_block,

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH kernel v3 11/22] powerpc/pseries/npu: Enable platform support
  2018-11-13  8:28   ` Alexey Kardashevskiy
@ 2018-11-16  5:25     ` David Gibson
  -1 siblings, 0 replies; 84+ messages in thread
From: David Gibson @ 2018-11-16  5:25 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Alex Williamson, Jose Ricardo Ziviani, Sam Bobroff,
	Alistair Popple, linuxppc-dev, kvm-ppc, Piotr Jaroszynski,
	Oliver O'Halloran, Andrew Donnellan,
	Leonardo Augusto Guimarães Garcia, Reza Arbab

[-- Attachment #1: Type: text/plain, Size: 1894 bytes --]

On Tue, Nov 13, 2018 at 07:28:12PM +1100, Alexey Kardashevskiy wrote:
> We already changed NPU API for GPUs to not to call OPAL and the remaining
> bit is initializing NPU structures.
> 
> This uses a new QEMU capability which marks NPU-enabled vPHBs as
> "IBM,npu-vphb" and initializes an NPU structure per vPHB.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
>  arch/powerpc/platforms/pseries/pci.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/arch/powerpc/platforms/pseries/pci.c b/arch/powerpc/platforms/pseries/pci.c
> index 41d8a4d..a50d5e4 100644
> --- a/arch/powerpc/platforms/pseries/pci.c
> +++ b/arch/powerpc/platforms/pseries/pci.c
> @@ -29,6 +29,7 @@
>  #include <asm/pci-bridge.h>
>  #include <asm/prom.h>
>  #include <asm/ppc-pci.h>
> +#include <asm/pci.h>
>  #include "pseries.h"
>  
>  #if 0
> @@ -237,6 +238,8 @@ static void __init pSeries_request_regions(void)
>  
>  void __init pSeries_final_fixup(void)
>  {
> +	struct pci_controller *hose;
> +
>  	pSeries_request_regions();
>  
>  	eeh_probe_devices();
> @@ -246,6 +249,9 @@ void __init pSeries_final_fixup(void)
>  	ppc_md.pcibios_sriov_enable = pseries_pcibios_sriov_enable;
>  	ppc_md.pcibios_sriov_disable = pseries_pcibios_sriov_disable;
>  #endif
> +	list_for_each_entry(hose, &hose_list, list_node)
> +		if (of_device_is_compatible(hose->dn, "IBM,npu-vphb"))
> +			pnv_npu2_init(hose);

I take it from this the NPUs are showing up with a compatible property
that lists the normal PHB value as well as IBM,npu-vphb.  Since AIUI
the NPUs act quite differently from other (real) PHBs this seems
bogus.  Shouldn't they be probed separately?

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH kernel v3 11/22] powerpc/pseries/npu: Enable platform support
@ 2018-11-16  5:25     ` David Gibson
  0 siblings, 0 replies; 84+ messages in thread
From: David Gibson @ 2018-11-16  5:25 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Alex Williamson, Jose Ricardo Ziviani, Sam Bobroff,
	Alistair Popple, linuxppc-dev, kvm-ppc, Piotr Jaroszynski,
	Oliver O'Halloran, Andrew Donnellan,
	Leonardo Augusto Guimarães Garcia, Reza Arbab

[-- Attachment #1: Type: text/plain, Size: 1894 bytes --]

On Tue, Nov 13, 2018 at 07:28:12PM +1100, Alexey Kardashevskiy wrote:
> We already changed NPU API for GPUs to not to call OPAL and the remaining
> bit is initializing NPU structures.
> 
> This uses a new QEMU capability which marks NPU-enabled vPHBs as
> "IBM,npu-vphb" and initializes an NPU structure per vPHB.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
>  arch/powerpc/platforms/pseries/pci.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/arch/powerpc/platforms/pseries/pci.c b/arch/powerpc/platforms/pseries/pci.c
> index 41d8a4d..a50d5e4 100644
> --- a/arch/powerpc/platforms/pseries/pci.c
> +++ b/arch/powerpc/platforms/pseries/pci.c
> @@ -29,6 +29,7 @@
>  #include <asm/pci-bridge.h>
>  #include <asm/prom.h>
>  #include <asm/ppc-pci.h>
> +#include <asm/pci.h>
>  #include "pseries.h"
>  
>  #if 0
> @@ -237,6 +238,8 @@ static void __init pSeries_request_regions(void)
>  
>  void __init pSeries_final_fixup(void)
>  {
> +	struct pci_controller *hose;
> +
>  	pSeries_request_regions();
>  
>  	eeh_probe_devices();
> @@ -246,6 +249,9 @@ void __init pSeries_final_fixup(void)
>  	ppc_md.pcibios_sriov_enable = pseries_pcibios_sriov_enable;
>  	ppc_md.pcibios_sriov_disable = pseries_pcibios_sriov_disable;
>  #endif
> +	list_for_each_entry(hose, &hose_list, list_node)
> +		if (of_device_is_compatible(hose->dn, "IBM,npu-vphb"))
> +			pnv_npu2_init(hose);

I take it from this the NPUs are showing up with a compatible property
that lists the normal PHB value as well as IBM,npu-vphb.  Since AIUI
the NPUs act quite differently from other (real) PHBs this seems
bogus.  Shouldn't they be probed separately?

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH kernel v3 14/22] powerpc/iommu_api: Move IOMMU groups setup to a single place
  2018-11-13  8:28   ` Alexey Kardashevskiy
@ 2018-11-19  0:15     ` David Gibson
  -1 siblings, 0 replies; 84+ messages in thread
From: David Gibson @ 2018-11-19  0:15 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Alex Williamson, Jose Ricardo Ziviani, Sam Bobroff,
	Alistair Popple, linuxppc-dev, kvm-ppc, Piotr Jaroszynski,
	Oliver O'Halloran, Andrew Donnellan,
	Leonardo Augusto Guimarães Garcia, Reza Arbab

[-- Attachment #1: Type: text/plain, Size: 7189 bytes --]

On Tue, Nov 13, 2018 at 07:28:15PM +1100, Alexey Kardashevskiy wrote:
> Registering new IOMMU groups and adding devices to them are separated in
> code and the latter is dug in the DMA setup code which it does not
> really belong to.
> 
> This moved IOMMU groups setup to a separate helper which registers a group
> and adds devices as before. This does not make a difference as IOMMU
> groups are not used anyway; the only dependency here is that
> iommu_add_device() requires a valid pointer to an iommu_table
> (set by set_iommu_table_base()).
> 
> To keep the old behaviour, this does not add new IOMMU groups for PEs
> with no DMA weigth and also skips NVLINK bridges which do not have
> pci_controller_ops::setup_bridge (the normal way of adding PEs).
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

> ---
>  arch/powerpc/platforms/powernv/pci-ioda.c | 80 +++++++++++++++++++----
>  1 file changed, 66 insertions(+), 14 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> index f36a802..7f4904a 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -1269,6 +1269,8 @@ static void pnv_ioda_setup_npu_PEs(struct pci_bus *bus)
>  		pnv_ioda_setup_npu_PE(pdev);
>  }
>  
> +static void pnv_ioda_setup_bus_iommu_group(struct pnv_ioda_pe *pe);
> +
>  static void pnv_pci_ioda_setup_PEs(void)
>  {
>  	struct pci_controller *hose;
> @@ -1591,6 +1593,7 @@ static void pnv_ioda_setup_vf_PE(struct pci_dev *pdev, u16 num_vfs)
>  		mutex_unlock(&phb->ioda.pe_list_mutex);
>  
>  		pnv_pci_ioda2_setup_dma_pe(phb, pe);
> +		pnv_ioda_setup_bus_iommu_group(pe);
>  	}
>  }
>  
> @@ -1930,21 +1933,16 @@ static u64 pnv_pci_ioda_dma_get_required_mask(struct pci_dev *pdev)
>  	return mask;
>  }
>  
> -static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe,
> -				   struct pci_bus *bus,
> -				   bool add_to_group)
> +static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe, struct pci_bus *bus)
>  {
>  	struct pci_dev *dev;
>  
>  	list_for_each_entry(dev, &bus->devices, bus_list) {
>  		set_iommu_table_base(&dev->dev, pe->table_group.tables[0]);
>  		set_dma_offset(&dev->dev, pe->tce_bypass_base);
> -		if (add_to_group)
> -			iommu_add_device(&pe->table_group, &dev->dev);
>  
>  		if ((pe->flags & PNV_IODA_PE_BUS_ALL) && dev->subordinate)
> -			pnv_ioda_setup_bus_dma(pe, dev->subordinate,
> -					add_to_group);
> +			pnv_ioda_setup_bus_dma(pe, dev->subordinate);
>  	}
>  }
>  
> @@ -2374,7 +2372,7 @@ static void pnv_pci_ioda1_setup_dma_pe(struct pnv_phb *phb,
>  	iommu_init_table(tbl, phb->hose->node);
>  
>  	if (pe->flags & (PNV_IODA_PE_BUS | PNV_IODA_PE_BUS_ALL))
> -		pnv_ioda_setup_bus_dma(pe, pe->pbus, true);
> +		pnv_ioda_setup_bus_dma(pe, pe->pbus);
>  
>  	return;
>   fail:
> @@ -2607,7 +2605,7 @@ static void pnv_ioda2_take_ownership(struct iommu_table_group *table_group)
>  	pnv_pci_ioda2_set_bypass(pe, false);
>  	pnv_pci_ioda2_unset_window(&pe->table_group, 0);
>  	if (pe->pbus)
> -		pnv_ioda_setup_bus_dma(pe, pe->pbus, false);
> +		pnv_ioda_setup_bus_dma(pe, pe->pbus);
>  	iommu_tce_table_put(tbl);
>  }
>  
> @@ -2618,7 +2616,7 @@ static void pnv_ioda2_release_ownership(struct iommu_table_group *table_group)
>  
>  	pnv_pci_ioda2_setup_default_config(pe);
>  	if (pe->pbus)
> -		pnv_ioda_setup_bus_dma(pe, pe->pbus, false);
> +		pnv_ioda_setup_bus_dma(pe, pe->pbus);
>  }
>  
>  static struct iommu_table_group_ops pnv_pci_ioda2_ops = {
> @@ -2735,12 +2733,68 @@ static struct iommu_table_group_ops pnv_pci_ioda2_npu_ops = {
>  	.release_ownership = pnv_ioda2_release_ownership,
>  };
>  
> +static void pnv_ioda_setup_bus_iommu_group_add_devices(struct pnv_ioda_pe *pe,
> +		struct pci_bus *bus)
> +{
> +	struct pci_dev *dev;
> +
> +	list_for_each_entry(dev, &bus->devices, bus_list) {
> +		iommu_add_device(&pe->table_group, &dev->dev);
> +
> +		if ((pe->flags & PNV_IODA_PE_BUS_ALL) && dev->subordinate)
> +			pnv_ioda_setup_bus_iommu_group_add_devices(pe,
> +					dev->subordinate);
> +	}
> +}
> +
> +static void pnv_ioda_setup_bus_iommu_group(struct pnv_ioda_pe *pe)
> +{
> +	if (!pnv_pci_ioda_pe_dma_weight(pe))
> +		return;
> +
> +	iommu_register_group(&pe->table_group, pe->phb->hose->global_number,
> +			pe->pe_number);
> +
> +	/*
> +	 * set_iommu_table_base(&pe->pdev->dev, tbl) should have been called
> +	 * by now
> +	 */
> +	if (pe->flags & PNV_IODA_PE_DEV)
> +		iommu_add_device(&pe->table_group, &pe->pdev->dev);
> +	else if (pe->flags & (PNV_IODA_PE_BUS | PNV_IODA_PE_BUS_ALL))
> +		pnv_ioda_setup_bus_iommu_group_add_devices(pe, pe->pbus);
> +}
> +
>  static void pnv_pci_ioda_setup_iommu_api(void)
>  {
>  	struct pci_controller *hose, *tmp;
>  	struct pnv_phb *phb;
>  	struct pnv_ioda_pe *pe, *gpe;
>  
> +	/*
> +	 * There are 4 types of PEs:
> +	 * - PNV_IODA_PE_BUS: a downstream port with an adapter,
> +	 *   created from pnv_pci_setup_bridge();
> +	 * - PNV_IODA_PE_BUS_ALL: a PCI-PCIX bridge with devices behind it,
> +	 *   created from pnv_pci_setup_bridge();
> +	 * - PNV_IODA_PE_VF: a SRIOV virtual function,
> +	 *   created from pnv_pcibios_sriov_enable();
> +	 * - PNV_IODA_PE_DEV: an NPU or OCAPI device,
> +	 *   created from pnv_pci_ioda_fixup().
> +	 *
> +	 * Normally a PE is represented by an IOMMU group, however for
> +	 * devices with side channels the groups need to be more strict.
> +	 */
> +	list_for_each_entry(hose, &hose_list, list_node) {
> +		phb = hose->private_data;
> +
> +		if (phb->type == PNV_PHB_NPU_NVLINK)
> +			continue;
> +
> +		list_for_each_entry(pe, &phb->ioda.pe_list, list)
> +			pnv_ioda_setup_bus_iommu_group(pe);
> +	}
> +
>  	/*
>  	 * Now we have all PHBs discovered, time to add NPU devices to
>  	 * the corresponding IOMMU groups.
> @@ -2759,6 +2813,7 @@ static void pnv_pci_ioda_setup_iommu_api(void)
>  	}
>  }
>  #else /* !CONFIG_IOMMU_API */
> +static void pnv_ioda_setup_bus_iommu_group(struct pnv_ioda_pe *pe) { }
>  static void pnv_pci_ioda_setup_iommu_api(void) { };
>  #endif
>  
> @@ -2801,9 +2856,6 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
>  	/* TVE #1 is selected by PCI address bit 59 */
>  	pe->tce_bypass_base = 1ull << 59;
>  
> -	iommu_register_group(&pe->table_group, phb->hose->global_number,
> -			pe->pe_number);
> -
>  	/* The PE will reserve all possible 32-bits space */
>  	pe_info(pe, "Setting up 32-bit TCE table at 0..%08x\n",
>  		phb->ioda.m32_pci_base);
> @@ -2824,7 +2876,7 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
>  		return;
>  
>  	if (pe->flags & (PNV_IODA_PE_BUS | PNV_IODA_PE_BUS_ALL))
> -		pnv_ioda_setup_bus_dma(pe, pe->pbus, true);
> +		pnv_ioda_setup_bus_dma(pe, pe->pbus);
>  }
>  
>  #ifdef CONFIG_PCI_MSI

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH kernel v3 14/22] powerpc/iommu_api: Move IOMMU groups setup to a single place
@ 2018-11-19  0:15     ` David Gibson
  0 siblings, 0 replies; 84+ messages in thread
From: David Gibson @ 2018-11-19  0:15 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Alex Williamson, Jose Ricardo Ziviani, Sam Bobroff,
	Alistair Popple, linuxppc-dev, kvm-ppc, Piotr Jaroszynski,
	Oliver O'Halloran, Andrew Donnellan,
	Leonardo Augusto Guimarães Garcia, Reza Arbab

[-- Attachment #1: Type: text/plain, Size: 7189 bytes --]

On Tue, Nov 13, 2018 at 07:28:15PM +1100, Alexey Kardashevskiy wrote:
> Registering new IOMMU groups and adding devices to them are separated in
> code and the latter is dug in the DMA setup code which it does not
> really belong to.
> 
> This moved IOMMU groups setup to a separate helper which registers a group
> and adds devices as before. This does not make a difference as IOMMU
> groups are not used anyway; the only dependency here is that
> iommu_add_device() requires a valid pointer to an iommu_table
> (set by set_iommu_table_base()).
> 
> To keep the old behaviour, this does not add new IOMMU groups for PEs
> with no DMA weigth and also skips NVLINK bridges which do not have
> pci_controller_ops::setup_bridge (the normal way of adding PEs).
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

> ---
>  arch/powerpc/platforms/powernv/pci-ioda.c | 80 +++++++++++++++++++----
>  1 file changed, 66 insertions(+), 14 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> index f36a802..7f4904a 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -1269,6 +1269,8 @@ static void pnv_ioda_setup_npu_PEs(struct pci_bus *bus)
>  		pnv_ioda_setup_npu_PE(pdev);
>  }
>  
> +static void pnv_ioda_setup_bus_iommu_group(struct pnv_ioda_pe *pe);
> +
>  static void pnv_pci_ioda_setup_PEs(void)
>  {
>  	struct pci_controller *hose;
> @@ -1591,6 +1593,7 @@ static void pnv_ioda_setup_vf_PE(struct pci_dev *pdev, u16 num_vfs)
>  		mutex_unlock(&phb->ioda.pe_list_mutex);
>  
>  		pnv_pci_ioda2_setup_dma_pe(phb, pe);
> +		pnv_ioda_setup_bus_iommu_group(pe);
>  	}
>  }
>  
> @@ -1930,21 +1933,16 @@ static u64 pnv_pci_ioda_dma_get_required_mask(struct pci_dev *pdev)
>  	return mask;
>  }
>  
> -static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe,
> -				   struct pci_bus *bus,
> -				   bool add_to_group)
> +static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe, struct pci_bus *bus)
>  {
>  	struct pci_dev *dev;
>  
>  	list_for_each_entry(dev, &bus->devices, bus_list) {
>  		set_iommu_table_base(&dev->dev, pe->table_group.tables[0]);
>  		set_dma_offset(&dev->dev, pe->tce_bypass_base);
> -		if (add_to_group)
> -			iommu_add_device(&pe->table_group, &dev->dev);
>  
>  		if ((pe->flags & PNV_IODA_PE_BUS_ALL) && dev->subordinate)
> -			pnv_ioda_setup_bus_dma(pe, dev->subordinate,
> -					add_to_group);
> +			pnv_ioda_setup_bus_dma(pe, dev->subordinate);
>  	}
>  }
>  
> @@ -2374,7 +2372,7 @@ static void pnv_pci_ioda1_setup_dma_pe(struct pnv_phb *phb,
>  	iommu_init_table(tbl, phb->hose->node);
>  
>  	if (pe->flags & (PNV_IODA_PE_BUS | PNV_IODA_PE_BUS_ALL))
> -		pnv_ioda_setup_bus_dma(pe, pe->pbus, true);
> +		pnv_ioda_setup_bus_dma(pe, pe->pbus);
>  
>  	return;
>   fail:
> @@ -2607,7 +2605,7 @@ static void pnv_ioda2_take_ownership(struct iommu_table_group *table_group)
>  	pnv_pci_ioda2_set_bypass(pe, false);
>  	pnv_pci_ioda2_unset_window(&pe->table_group, 0);
>  	if (pe->pbus)
> -		pnv_ioda_setup_bus_dma(pe, pe->pbus, false);
> +		pnv_ioda_setup_bus_dma(pe, pe->pbus);
>  	iommu_tce_table_put(tbl);
>  }
>  
> @@ -2618,7 +2616,7 @@ static void pnv_ioda2_release_ownership(struct iommu_table_group *table_group)
>  
>  	pnv_pci_ioda2_setup_default_config(pe);
>  	if (pe->pbus)
> -		pnv_ioda_setup_bus_dma(pe, pe->pbus, false);
> +		pnv_ioda_setup_bus_dma(pe, pe->pbus);
>  }
>  
>  static struct iommu_table_group_ops pnv_pci_ioda2_ops = {
> @@ -2735,12 +2733,68 @@ static struct iommu_table_group_ops pnv_pci_ioda2_npu_ops = {
>  	.release_ownership = pnv_ioda2_release_ownership,
>  };
>  
> +static void pnv_ioda_setup_bus_iommu_group_add_devices(struct pnv_ioda_pe *pe,
> +		struct pci_bus *bus)
> +{
> +	struct pci_dev *dev;
> +
> +	list_for_each_entry(dev, &bus->devices, bus_list) {
> +		iommu_add_device(&pe->table_group, &dev->dev);
> +
> +		if ((pe->flags & PNV_IODA_PE_BUS_ALL) && dev->subordinate)
> +			pnv_ioda_setup_bus_iommu_group_add_devices(pe,
> +					dev->subordinate);
> +	}
> +}
> +
> +static void pnv_ioda_setup_bus_iommu_group(struct pnv_ioda_pe *pe)
> +{
> +	if (!pnv_pci_ioda_pe_dma_weight(pe))
> +		return;
> +
> +	iommu_register_group(&pe->table_group, pe->phb->hose->global_number,
> +			pe->pe_number);
> +
> +	/*
> +	 * set_iommu_table_base(&pe->pdev->dev, tbl) should have been called
> +	 * by now
> +	 */
> +	if (pe->flags & PNV_IODA_PE_DEV)
> +		iommu_add_device(&pe->table_group, &pe->pdev->dev);
> +	else if (pe->flags & (PNV_IODA_PE_BUS | PNV_IODA_PE_BUS_ALL))
> +		pnv_ioda_setup_bus_iommu_group_add_devices(pe, pe->pbus);
> +}
> +
>  static void pnv_pci_ioda_setup_iommu_api(void)
>  {
>  	struct pci_controller *hose, *tmp;
>  	struct pnv_phb *phb;
>  	struct pnv_ioda_pe *pe, *gpe;
>  
> +	/*
> +	 * There are 4 types of PEs:
> +	 * - PNV_IODA_PE_BUS: a downstream port with an adapter,
> +	 *   created from pnv_pci_setup_bridge();
> +	 * - PNV_IODA_PE_BUS_ALL: a PCI-PCIX bridge with devices behind it,
> +	 *   created from pnv_pci_setup_bridge();
> +	 * - PNV_IODA_PE_VF: a SRIOV virtual function,
> +	 *   created from pnv_pcibios_sriov_enable();
> +	 * - PNV_IODA_PE_DEV: an NPU or OCAPI device,
> +	 *   created from pnv_pci_ioda_fixup().
> +	 *
> +	 * Normally a PE is represented by an IOMMU group, however for
> +	 * devices with side channels the groups need to be more strict.
> +	 */
> +	list_for_each_entry(hose, &hose_list, list_node) {
> +		phb = hose->private_data;
> +
> +		if (phb->type == PNV_PHB_NPU_NVLINK)
> +			continue;
> +
> +		list_for_each_entry(pe, &phb->ioda.pe_list, list)
> +			pnv_ioda_setup_bus_iommu_group(pe);
> +	}
> +
>  	/*
>  	 * Now we have all PHBs discovered, time to add NPU devices to
>  	 * the corresponding IOMMU groups.
> @@ -2759,6 +2813,7 @@ static void pnv_pci_ioda_setup_iommu_api(void)
>  	}
>  }
>  #else /* !CONFIG_IOMMU_API */
> +static void pnv_ioda_setup_bus_iommu_group(struct pnv_ioda_pe *pe) { }
>  static void pnv_pci_ioda_setup_iommu_api(void) { };
>  #endif
>  
> @@ -2801,9 +2856,6 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
>  	/* TVE #1 is selected by PCI address bit 59 */
>  	pe->tce_bypass_base = 1ull << 59;
>  
> -	iommu_register_group(&pe->table_group, phb->hose->global_number,
> -			pe->pe_number);
> -
>  	/* The PE will reserve all possible 32-bits space */
>  	pe_info(pe, "Setting up 32-bit TCE table at 0..%08x\n",
>  		phb->ioda.m32_pci_base);
> @@ -2824,7 +2876,7 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
>  		return;
>  
>  	if (pe->flags & (PNV_IODA_PE_BUS | PNV_IODA_PE_BUS_ALL))
> -		pnv_ioda_setup_bus_dma(pe, pe->pbus, true);
> +		pnv_ioda_setup_bus_dma(pe, pe->pbus);
>  }
>  
>  #ifdef CONFIG_PCI_MSI

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH kernel v3 15/22] powerpc/powernv: Reference iommu_table while it is linked to a group
  2018-11-13  8:28   ` Alexey Kardashevskiy
@ 2018-11-19  0:20     ` David Gibson
  -1 siblings, 0 replies; 84+ messages in thread
From: David Gibson @ 2018-11-19  0:20 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Alex Williamson, Jose Ricardo Ziviani, Sam Bobroff,
	Alistair Popple, linuxppc-dev, kvm-ppc, Piotr Jaroszynski,
	Oliver O'Halloran, Andrew Donnellan,
	Leonardo Augusto Guimarães Garcia, Reza Arbab

[-- Attachment #1: Type: text/plain, Size: 2229 bytes --]

On Tue, Nov 13, 2018 at 07:28:16PM +1100, Alexey Kardashevskiy wrote:
> The iommu_table pointer stored in iommu_table_group may get stale
> by accident, this adds referencing and removes a redundant comment
> about this.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

> ---
>  arch/powerpc/platforms/powernv/pci-ioda-tce.c | 3 ++-
>  arch/powerpc/platforms/powernv/pci-ioda.c     | 4 ----
>  2 files changed, 2 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda-tce.c b/arch/powerpc/platforms/powernv/pci-ioda-tce.c
> index 7639b21..697449a 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda-tce.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda-tce.c
> @@ -368,6 +368,7 @@ void pnv_pci_unlink_table_and_group(struct iommu_table *tbl,
>  	found = false;
>  	for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
>  		if (table_group->tables[i] == tbl) {
> +			iommu_tce_table_put(tbl);
>  			table_group->tables[i] = NULL;
>  			found = true;
>  			break;
> @@ -393,7 +394,7 @@ long pnv_pci_link_table_and_group(int node, int num,
>  	tgl->table_group = table_group;
>  	list_add_rcu(&tgl->next, &tbl->it_group_list);
>  
> -	table_group->tables[num] = tbl;
> +	table_group->tables[num] = iommu_tce_table_get(tbl);
>  
>  	return 0;
>  }
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> index 7f4904a..7caf373 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -2716,10 +2716,6 @@ static long pnv_pci_ioda2_npu_unset_window(
>  
>  static void pnv_ioda2_npu_take_ownership(struct iommu_table_group *table_group)
>  {
> -	/*
> -	 * Detach NPU first as pnv_ioda2_take_ownership() will destroy
> -	 * the iommu_table if 32bit DMA is enabled.
> -	 */
>  	pnv_npu_take_ownership(gpe_table_group_to_npe(table_group));
>  	pnv_ioda2_take_ownership(table_group);
>  }

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH kernel v3 15/22] powerpc/powernv: Reference iommu_table while it is linked to a group
@ 2018-11-19  0:20     ` David Gibson
  0 siblings, 0 replies; 84+ messages in thread
From: David Gibson @ 2018-11-19  0:20 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Alex Williamson, Jose Ricardo Ziviani, Sam Bobroff,
	Alistair Popple, linuxppc-dev, kvm-ppc, Piotr Jaroszynski,
	Oliver O'Halloran, Andrew Donnellan,
	Leonardo Augusto Guimarães Garcia, Reza Arbab

[-- Attachment #1: Type: text/plain, Size: 2229 bytes --]

On Tue, Nov 13, 2018 at 07:28:16PM +1100, Alexey Kardashevskiy wrote:
> The iommu_table pointer stored in iommu_table_group may get stale
> by accident, this adds referencing and removes a redundant comment
> about this.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

> ---
>  arch/powerpc/platforms/powernv/pci-ioda-tce.c | 3 ++-
>  arch/powerpc/platforms/powernv/pci-ioda.c     | 4 ----
>  2 files changed, 2 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda-tce.c b/arch/powerpc/platforms/powernv/pci-ioda-tce.c
> index 7639b21..697449a 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda-tce.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda-tce.c
> @@ -368,6 +368,7 @@ void pnv_pci_unlink_table_and_group(struct iommu_table *tbl,
>  	found = false;
>  	for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
>  		if (table_group->tables[i] == tbl) {
> +			iommu_tce_table_put(tbl);
>  			table_group->tables[i] = NULL;
>  			found = true;
>  			break;
> @@ -393,7 +394,7 @@ long pnv_pci_link_table_and_group(int node, int num,
>  	tgl->table_group = table_group;
>  	list_add_rcu(&tgl->next, &tbl->it_group_list);
>  
> -	table_group->tables[num] = tbl;
> +	table_group->tables[num] = iommu_tce_table_get(tbl);
>  
>  	return 0;
>  }
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> index 7f4904a..7caf373 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -2716,10 +2716,6 @@ static long pnv_pci_ioda2_npu_unset_window(
>  
>  static void pnv_ioda2_npu_take_ownership(struct iommu_table_group *table_group)
>  {
> -	/*
> -	 * Detach NPU first as pnv_ioda2_take_ownership() will destroy
> -	 * the iommu_table if 32bit DMA is enabled.
> -	 */
>  	pnv_npu_take_ownership(gpe_table_group_to_npe(table_group));
>  	pnv_ioda2_take_ownership(table_group);
>  }

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH kernel v3 16/22] powerpc/powernv: Add purge cache OPAL call
  2018-11-13  8:28   ` Alexey Kardashevskiy
@ 2018-11-19  0:21     ` David Gibson
  -1 siblings, 0 replies; 84+ messages in thread
From: David Gibson @ 2018-11-19  0:21 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Alex Williamson, Jose Ricardo Ziviani, Sam Bobroff,
	Alistair Popple, linuxppc-dev, kvm-ppc, Piotr Jaroszynski,
	Oliver O'Halloran, Andrew Donnellan,
	Leonardo Augusto Guimarães Garcia, Reza Arbab

[-- Attachment #1: Type: text/plain, Size: 3298 bytes --]

On Tue, Nov 13, 2018 at 07:28:17PM +1100, Alexey Kardashevskiy wrote:
> Flushing caches using the dcbf instruction takes quite some time if we
> need to flush gigabytes (16GB takes more than 15s); OPAL just added
> a big hammer to flush all caches.
> 
> This adds opal_purge_cache() which will be used later to flush caches
> for coherent GPU memory which might suddenly become unavailable if a GPU
> is reset and NVLink is not (re)trained.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

> ---
>  arch/powerpc/include/asm/opal-api.h            | 3 ++-
>  arch/powerpc/include/asm/opal.h                | 1 +
>  arch/powerpc/platforms/powernv/opal.c          | 1 +
>  arch/powerpc/platforms/powernv/opal-wrappers.S | 1 +
>  4 files changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/include/asm/opal-api.h b/arch/powerpc/include/asm/opal-api.h
> index 870fb7b..55bc640 100644
> --- a/arch/powerpc/include/asm/opal-api.h
> +++ b/arch/powerpc/include/asm/opal-api.h
> @@ -210,7 +210,8 @@
>  #define OPAL_PCI_GET_PBCQ_TUNNEL_BAR		164
>  #define OPAL_PCI_SET_PBCQ_TUNNEL_BAR		165
>  #define	OPAL_NX_COPROC_INIT			167
> -#define OPAL_LAST				167
> +#define OPAL_CLEAR_CACHE			170
> +#define OPAL_LAST				170
>  
>  #define QUIESCE_HOLD			1 /* Spin all calls at entry */
>  #define QUIESCE_REJECT			2 /* Fail all calls with OPAL_BUSY */
> diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
> index ff38664..7db576e 100644
> --- a/arch/powerpc/include/asm/opal.h
> +++ b/arch/powerpc/include/asm/opal.h
> @@ -294,6 +294,7 @@ int opal_set_power_shift_ratio(u32 handle, int token, u32 psr);
>  int opal_sensor_group_clear(u32 group_hndl, int token);
>  int opal_sensor_group_enable(u32 group_hndl, int token, bool enable);
>  int opal_nx_coproc_init(uint32_t chip_id, uint32_t ct);
> +int opal_purge_cache(void);
>  
>  s64 opal_signal_system_reset(s32 cpu);
>  s64 opal_quiesce(u64 shutdown_type, s32 cpu);
> diff --git a/arch/powerpc/platforms/powernv/opal.c b/arch/powerpc/platforms/powernv/opal.c
> index beed86f..44ce824 100644
> --- a/arch/powerpc/platforms/powernv/opal.c
> +++ b/arch/powerpc/platforms/powernv/opal.c
> @@ -1113,3 +1113,4 @@ EXPORT_SYMBOL_GPL(opal_int_eoi);
>  EXPORT_SYMBOL_GPL(opal_error_code);
>  /* Export the below symbol for NX compression */
>  EXPORT_SYMBOL(opal_nx_coproc_init);
> +EXPORT_SYMBOL(opal_purge_cache);
> diff --git a/arch/powerpc/platforms/powernv/opal-wrappers.S b/arch/powerpc/platforms/powernv/opal-wrappers.S
> index 2515282..5b886a6 100644
> --- a/arch/powerpc/platforms/powernv/opal-wrappers.S
> +++ b/arch/powerpc/platforms/powernv/opal-wrappers.S
> @@ -331,3 +331,4 @@ OPAL_CALL(opal_pci_set_pbcq_tunnel_bar,		OPAL_PCI_SET_PBCQ_TUNNEL_BAR);
>  OPAL_CALL(opal_sensor_read_u64,			OPAL_SENSOR_READ_U64);
>  OPAL_CALL(opal_sensor_group_enable,		OPAL_SENSOR_GROUP_ENABLE);
>  OPAL_CALL(opal_nx_coproc_init,			OPAL_NX_COPROC_INIT);
> +OPAL_CALL(opal_purge_cache,			OPAL_CLEAR_CACHE);

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH kernel v3 16/22] powerpc/powernv: Add purge cache OPAL call
@ 2018-11-19  0:21     ` David Gibson
  0 siblings, 0 replies; 84+ messages in thread
From: David Gibson @ 2018-11-19  0:21 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Alex Williamson, Jose Ricardo Ziviani, Sam Bobroff,
	Alistair Popple, linuxppc-dev, kvm-ppc, Piotr Jaroszynski,
	Oliver O'Halloran, Andrew Donnellan,
	Leonardo Augusto Guimarães Garcia, Reza Arbab

[-- Attachment #1: Type: text/plain, Size: 3298 bytes --]

On Tue, Nov 13, 2018 at 07:28:17PM +1100, Alexey Kardashevskiy wrote:
> Flushing caches using the dcbf instruction takes quite some time if we
> need to flush gigabytes (16GB takes more than 15s); OPAL just added
> a big hammer to flush all caches.
> 
> This adds opal_purge_cache() which will be used later to flush caches
> for coherent GPU memory which might suddenly become unavailable if a GPU
> is reset and NVLink is not (re)trained.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

> ---
>  arch/powerpc/include/asm/opal-api.h            | 3 ++-
>  arch/powerpc/include/asm/opal.h                | 1 +
>  arch/powerpc/platforms/powernv/opal.c          | 1 +
>  arch/powerpc/platforms/powernv/opal-wrappers.S | 1 +
>  4 files changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/include/asm/opal-api.h b/arch/powerpc/include/asm/opal-api.h
> index 870fb7b..55bc640 100644
> --- a/arch/powerpc/include/asm/opal-api.h
> +++ b/arch/powerpc/include/asm/opal-api.h
> @@ -210,7 +210,8 @@
>  #define OPAL_PCI_GET_PBCQ_TUNNEL_BAR		164
>  #define OPAL_PCI_SET_PBCQ_TUNNEL_BAR		165
>  #define	OPAL_NX_COPROC_INIT			167
> -#define OPAL_LAST				167
> +#define OPAL_CLEAR_CACHE			170
> +#define OPAL_LAST				170
>  
>  #define QUIESCE_HOLD			1 /* Spin all calls at entry */
>  #define QUIESCE_REJECT			2 /* Fail all calls with OPAL_BUSY */
> diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
> index ff38664..7db576e 100644
> --- a/arch/powerpc/include/asm/opal.h
> +++ b/arch/powerpc/include/asm/opal.h
> @@ -294,6 +294,7 @@ int opal_set_power_shift_ratio(u32 handle, int token, u32 psr);
>  int opal_sensor_group_clear(u32 group_hndl, int token);
>  int opal_sensor_group_enable(u32 group_hndl, int token, bool enable);
>  int opal_nx_coproc_init(uint32_t chip_id, uint32_t ct);
> +int opal_purge_cache(void);
>  
>  s64 opal_signal_system_reset(s32 cpu);
>  s64 opal_quiesce(u64 shutdown_type, s32 cpu);
> diff --git a/arch/powerpc/platforms/powernv/opal.c b/arch/powerpc/platforms/powernv/opal.c
> index beed86f..44ce824 100644
> --- a/arch/powerpc/platforms/powernv/opal.c
> +++ b/arch/powerpc/platforms/powernv/opal.c
> @@ -1113,3 +1113,4 @@ EXPORT_SYMBOL_GPL(opal_int_eoi);
>  EXPORT_SYMBOL_GPL(opal_error_code);
>  /* Export the below symbol for NX compression */
>  EXPORT_SYMBOL(opal_nx_coproc_init);
> +EXPORT_SYMBOL(opal_purge_cache);
> diff --git a/arch/powerpc/platforms/powernv/opal-wrappers.S b/arch/powerpc/platforms/powernv/opal-wrappers.S
> index 2515282..5b886a6 100644
> --- a/arch/powerpc/platforms/powernv/opal-wrappers.S
> +++ b/arch/powerpc/platforms/powernv/opal-wrappers.S
> @@ -331,3 +331,4 @@ OPAL_CALL(opal_pci_set_pbcq_tunnel_bar,		OPAL_PCI_SET_PBCQ_TUNNEL_BAR);
>  OPAL_CALL(opal_sensor_read_u64,			OPAL_SENSOR_READ_U64);
>  OPAL_CALL(opal_sensor_group_enable,		OPAL_SENSOR_GROUP_ENABLE);
>  OPAL_CALL(opal_nx_coproc_init,			OPAL_NX_COPROC_INIT);
> +OPAL_CALL(opal_purge_cache,			OPAL_CLEAR_CACHE);

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH kernel v3 17/22] powerpc/powernv/npu: Convert NPU IOMMU helpers to iommu_table_group_ops
  2018-11-13  8:28   ` Alexey Kardashevskiy
@ 2018-11-19  0:24     ` David Gibson
  -1 siblings, 0 replies; 84+ messages in thread
From: David Gibson @ 2018-11-19  0:24 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Alex Williamson, Jose Ricardo Ziviani, Sam Bobroff,
	Alistair Popple, linuxppc-dev, kvm-ppc, Piotr Jaroszynski,
	Oliver O'Halloran, Andrew Donnellan,
	Leonardo Augusto Guimarães Garcia, Reza Arbab

[-- Attachment #1: Type: text/plain, Size: 7483 bytes --]

On Tue, Nov 13, 2018 at 07:28:18PM +1100, Alexey Kardashevskiy wrote:
> At the moment NPU IOMMU is manipulated directly from the IODA2 PCI
> PE code; PCI PE acts as a master to NPU PE. Soon we will have compound
> IOMMU groups with several PEs from several different PHB (such as
> interconnected GPUs and NPUs) so there will be no single master but
> a one big IOMMU group.
> 
> This makes a first step and converts an NPU PE to a table group.
> 
> This should cause no behavioral change. Note that
> pnv_npu_release_ownership() has never been implemented.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

> ---
>  arch/powerpc/platforms/powernv/pci.h      |  5 ----
>  arch/powerpc/platforms/powernv/npu-dma.c  | 29 ++++++++++++++++++-----
>  arch/powerpc/platforms/powernv/pci-ioda.c | 17 +++++++------
>  3 files changed, 33 insertions(+), 18 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
> index ddb4f02..cf9f748 100644
> --- a/arch/powerpc/platforms/powernv/pci.h
> +++ b/arch/powerpc/platforms/powernv/pci.h
> @@ -216,11 +216,6 @@ extern void pe_level_printk(const struct pnv_ioda_pe *pe, const char *level,
>  extern void pnv_npu_try_dma_set_bypass(struct pci_dev *gpdev, bool bypass);
>  extern void pnv_pci_ioda2_tce_invalidate_entire(struct pnv_phb *phb, bool rm);
>  extern struct pnv_ioda_pe *pnv_pci_npu_setup_iommu(struct pnv_ioda_pe *npe);
> -extern long pnv_npu_set_window(struct pnv_ioda_pe *npe, int num,
> -		struct iommu_table *tbl);
> -extern long pnv_npu_unset_window(struct pnv_ioda_pe *npe, int num);
> -extern void pnv_npu_take_ownership(struct pnv_ioda_pe *npe);
> -extern void pnv_npu_release_ownership(struct pnv_ioda_pe *npe);
>  
>  /* pci-ioda-tce.c */
>  #define POWERNV_IOMMU_DEFAULT_LEVELS	1
> diff --git a/arch/powerpc/platforms/powernv/npu-dma.c b/arch/powerpc/platforms/powernv/npu-dma.c
> index 4b60f43..1792c7e 100644
> --- a/arch/powerpc/platforms/powernv/npu-dma.c
> +++ b/arch/powerpc/platforms/powernv/npu-dma.c
> @@ -121,9 +121,11 @@ static struct pnv_ioda_pe *get_gpu_pci_dev_and_pe(struct pnv_ioda_pe *npe,
>  	return pe;
>  }
>  
> -long pnv_npu_set_window(struct pnv_ioda_pe *npe, int num,
> +static long pnv_npu_set_window(struct iommu_table_group *table_group, int num,
>  		struct iommu_table *tbl)
>  {
> +	struct pnv_ioda_pe *npe = container_of(table_group, struct pnv_ioda_pe,
> +			table_group);
>  	struct pnv_phb *phb = npe->phb;
>  	int64_t rc;
>  	const unsigned long size = tbl->it_indirect_levels ?
> @@ -155,8 +157,10 @@ long pnv_npu_set_window(struct pnv_ioda_pe *npe, int num,
>  	return 0;
>  }
>  
> -long pnv_npu_unset_window(struct pnv_ioda_pe *npe, int num)
> +static long pnv_npu_unset_window(struct iommu_table_group *table_group, int num)
>  {
> +	struct pnv_ioda_pe *npe = container_of(table_group, struct pnv_ioda_pe,
> +			table_group);
>  	struct pnv_phb *phb = npe->phb;
>  	int64_t rc;
>  
> @@ -198,7 +202,8 @@ static void pnv_npu_dma_set_32(struct pnv_ioda_pe *npe)
>  	if (!gpe)
>  		return;
>  
> -	rc = pnv_npu_set_window(npe, 0, gpe->table_group.tables[0]);
> +	rc = pnv_npu_set_window(&npe->table_group, 0,
> +			gpe->table_group.tables[0]);
>  
>  	/*
>  	 * NVLink devices use the same TCE table configuration as
> @@ -223,7 +228,7 @@ static int pnv_npu_dma_set_bypass(struct pnv_ioda_pe *npe)
>  	if (phb->type != PNV_PHB_NPU_NVLINK || !npe->pdev)
>  		return -EINVAL;
>  
> -	rc = pnv_npu_unset_window(npe, 0);
> +	rc = pnv_npu_unset_window(&npe->table_group, 0);
>  	if (rc != OPAL_SUCCESS)
>  		return rc;
>  
> @@ -276,9 +281,12 @@ void pnv_npu_try_dma_set_bypass(struct pci_dev *gpdev, bool bypass)
>  	}
>  }
>  
> +#ifdef CONFIG_IOMMU_API
>  /* Switch ownership from platform code to external user (e.g. VFIO) */
> -void pnv_npu_take_ownership(struct pnv_ioda_pe *npe)
> +static void pnv_npu_take_ownership(struct iommu_table_group *table_group)
>  {
> +	struct pnv_ioda_pe *npe = container_of(table_group, struct pnv_ioda_pe,
> +			table_group);
>  	struct pnv_phb *phb = npe->phb;
>  	int64_t rc;
>  
> @@ -289,7 +297,7 @@ void pnv_npu_take_ownership(struct pnv_ioda_pe *npe)
>  	 * if it was enabled at the moment of ownership change.
>  	 */
>  	if (npe->table_group.tables[0]) {
> -		pnv_npu_unset_window(npe, 0);
> +		pnv_npu_unset_window(&npe->table_group, 0);
>  		return;
>  	}
>  
> @@ -304,6 +312,12 @@ void pnv_npu_take_ownership(struct pnv_ioda_pe *npe)
>  	pnv_pci_ioda2_tce_invalidate_entire(npe->phb, false);
>  }
>  
> +static struct iommu_table_group_ops pnv_pci_npu_ops = {
> +	.set_window = pnv_npu_set_window,
> +	.unset_window = pnv_npu_unset_window,
> +	.take_ownership = pnv_npu_take_ownership,
> +};
> +
>  struct pnv_ioda_pe *pnv_pci_npu_setup_iommu(struct pnv_ioda_pe *npe)
>  {
>  	struct pnv_phb *phb = npe->phb;
> @@ -314,6 +328,8 @@ struct pnv_ioda_pe *pnv_pci_npu_setup_iommu(struct pnv_ioda_pe *npe)
>  	if (!gpe || !gpdev)
>  		return NULL;
>  
> +	npe->table_group.ops = &pnv_pci_npu_ops;
> +
>  	list_for_each_entry(npdev, &pbus->devices, bus_list) {
>  		gptmp = pnv_pci_get_gpu_dev(npdev);
>  
> @@ -326,6 +342,7 @@ struct pnv_ioda_pe *pnv_pci_npu_setup_iommu(struct pnv_ioda_pe *npe)
>  
>  	return gpe;
>  }
> +#endif /* !CONFIG_IOMMU_API */
>  
>  /*
>   * NPU2 ATS
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> index 7caf373..04639ae 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -2677,14 +2677,14 @@ static long pnv_pci_ioda2_npu_set_window(struct iommu_table_group *table_group,
>  		return ret;
>  
>  	if (table_group->tables[num2])
> -		pnv_npu_unset_window(npe, num2);
> +		npe->table_group.ops->unset_window(&npe->table_group, num2);
>  
> -	ret = pnv_npu_set_window(npe, num, tbl);
> +	ret = npe->table_group.ops->set_window(&npe->table_group, num, tbl);
>  	if (ret) {
>  		pnv_pci_ioda2_unset_window(table_group, num);
>  		if (table_group->tables[num2])
> -			pnv_npu_set_window(npe, num2,
> -					table_group->tables[num2]);
> +			npe->table_group.ops->set_window(&npe->table_group,
> +					num2, table_group->tables[num2]);
>  	}
>  
>  	return ret;
> @@ -2704,19 +2704,22 @@ static long pnv_pci_ioda2_npu_unset_window(
>  	if (!npe->table_group.tables[num])
>  		return 0;
>  
> -	ret = pnv_npu_unset_window(npe, num);
> +	ret = npe->table_group.ops->unset_window(&npe->table_group, num);
>  	if (ret)
>  		return ret;
>  
>  	if (table_group->tables[num2])
> -		ret = pnv_npu_set_window(npe, num2, table_group->tables[num2]);
> +		ret = npe->table_group.ops->set_window(&npe->table_group, num2,
> +				table_group->tables[num2]);
>  
>  	return ret;
>  }
>  
>  static void pnv_ioda2_npu_take_ownership(struct iommu_table_group *table_group)
>  {
> -	pnv_npu_take_ownership(gpe_table_group_to_npe(table_group));
> +	struct pnv_ioda_pe *npe = gpe_table_group_to_npe(table_group);
> +
> +	npe->table_group.ops->take_ownership(&npe->table_group);
>  	pnv_ioda2_take_ownership(table_group);
>  }
>  

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH kernel v3 17/22] powerpc/powernv/npu: Convert NPU IOMMU helpers to iommu_table_group_ops
@ 2018-11-19  0:24     ` David Gibson
  0 siblings, 0 replies; 84+ messages in thread
From: David Gibson @ 2018-11-19  0:24 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Alex Williamson, Jose Ricardo Ziviani, Sam Bobroff,
	Alistair Popple, linuxppc-dev, kvm-ppc, Piotr Jaroszynski,
	Oliver O'Halloran, Andrew Donnellan,
	Leonardo Augusto Guimarães Garcia, Reza Arbab

[-- Attachment #1: Type: text/plain, Size: 7483 bytes --]

On Tue, Nov 13, 2018 at 07:28:18PM +1100, Alexey Kardashevskiy wrote:
> At the moment NPU IOMMU is manipulated directly from the IODA2 PCI
> PE code; PCI PE acts as a master to NPU PE. Soon we will have compound
> IOMMU groups with several PEs from several different PHB (such as
> interconnected GPUs and NPUs) so there will be no single master but
> a one big IOMMU group.
> 
> This makes a first step and converts an NPU PE to a table group.
> 
> This should cause no behavioral change. Note that
> pnv_npu_release_ownership() has never been implemented.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

> ---
>  arch/powerpc/platforms/powernv/pci.h      |  5 ----
>  arch/powerpc/platforms/powernv/npu-dma.c  | 29 ++++++++++++++++++-----
>  arch/powerpc/platforms/powernv/pci-ioda.c | 17 +++++++------
>  3 files changed, 33 insertions(+), 18 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
> index ddb4f02..cf9f748 100644
> --- a/arch/powerpc/platforms/powernv/pci.h
> +++ b/arch/powerpc/platforms/powernv/pci.h
> @@ -216,11 +216,6 @@ extern void pe_level_printk(const struct pnv_ioda_pe *pe, const char *level,
>  extern void pnv_npu_try_dma_set_bypass(struct pci_dev *gpdev, bool bypass);
>  extern void pnv_pci_ioda2_tce_invalidate_entire(struct pnv_phb *phb, bool rm);
>  extern struct pnv_ioda_pe *pnv_pci_npu_setup_iommu(struct pnv_ioda_pe *npe);
> -extern long pnv_npu_set_window(struct pnv_ioda_pe *npe, int num,
> -		struct iommu_table *tbl);
> -extern long pnv_npu_unset_window(struct pnv_ioda_pe *npe, int num);
> -extern void pnv_npu_take_ownership(struct pnv_ioda_pe *npe);
> -extern void pnv_npu_release_ownership(struct pnv_ioda_pe *npe);
>  
>  /* pci-ioda-tce.c */
>  #define POWERNV_IOMMU_DEFAULT_LEVELS	1
> diff --git a/arch/powerpc/platforms/powernv/npu-dma.c b/arch/powerpc/platforms/powernv/npu-dma.c
> index 4b60f43..1792c7e 100644
> --- a/arch/powerpc/platforms/powernv/npu-dma.c
> +++ b/arch/powerpc/platforms/powernv/npu-dma.c
> @@ -121,9 +121,11 @@ static struct pnv_ioda_pe *get_gpu_pci_dev_and_pe(struct pnv_ioda_pe *npe,
>  	return pe;
>  }
>  
> -long pnv_npu_set_window(struct pnv_ioda_pe *npe, int num,
> +static long pnv_npu_set_window(struct iommu_table_group *table_group, int num,
>  		struct iommu_table *tbl)
>  {
> +	struct pnv_ioda_pe *npe = container_of(table_group, struct pnv_ioda_pe,
> +			table_group);
>  	struct pnv_phb *phb = npe->phb;
>  	int64_t rc;
>  	const unsigned long size = tbl->it_indirect_levels ?
> @@ -155,8 +157,10 @@ long pnv_npu_set_window(struct pnv_ioda_pe *npe, int num,
>  	return 0;
>  }
>  
> -long pnv_npu_unset_window(struct pnv_ioda_pe *npe, int num)
> +static long pnv_npu_unset_window(struct iommu_table_group *table_group, int num)
>  {
> +	struct pnv_ioda_pe *npe = container_of(table_group, struct pnv_ioda_pe,
> +			table_group);
>  	struct pnv_phb *phb = npe->phb;
>  	int64_t rc;
>  
> @@ -198,7 +202,8 @@ static void pnv_npu_dma_set_32(struct pnv_ioda_pe *npe)
>  	if (!gpe)
>  		return;
>  
> -	rc = pnv_npu_set_window(npe, 0, gpe->table_group.tables[0]);
> +	rc = pnv_npu_set_window(&npe->table_group, 0,
> +			gpe->table_group.tables[0]);
>  
>  	/*
>  	 * NVLink devices use the same TCE table configuration as
> @@ -223,7 +228,7 @@ static int pnv_npu_dma_set_bypass(struct pnv_ioda_pe *npe)
>  	if (phb->type != PNV_PHB_NPU_NVLINK || !npe->pdev)
>  		return -EINVAL;
>  
> -	rc = pnv_npu_unset_window(npe, 0);
> +	rc = pnv_npu_unset_window(&npe->table_group, 0);
>  	if (rc != OPAL_SUCCESS)
>  		return rc;
>  
> @@ -276,9 +281,12 @@ void pnv_npu_try_dma_set_bypass(struct pci_dev *gpdev, bool bypass)
>  	}
>  }
>  
> +#ifdef CONFIG_IOMMU_API
>  /* Switch ownership from platform code to external user (e.g. VFIO) */
> -void pnv_npu_take_ownership(struct pnv_ioda_pe *npe)
> +static void pnv_npu_take_ownership(struct iommu_table_group *table_group)
>  {
> +	struct pnv_ioda_pe *npe = container_of(table_group, struct pnv_ioda_pe,
> +			table_group);
>  	struct pnv_phb *phb = npe->phb;
>  	int64_t rc;
>  
> @@ -289,7 +297,7 @@ void pnv_npu_take_ownership(struct pnv_ioda_pe *npe)
>  	 * if it was enabled at the moment of ownership change.
>  	 */
>  	if (npe->table_group.tables[0]) {
> -		pnv_npu_unset_window(npe, 0);
> +		pnv_npu_unset_window(&npe->table_group, 0);
>  		return;
>  	}
>  
> @@ -304,6 +312,12 @@ void pnv_npu_take_ownership(struct pnv_ioda_pe *npe)
>  	pnv_pci_ioda2_tce_invalidate_entire(npe->phb, false);
>  }
>  
> +static struct iommu_table_group_ops pnv_pci_npu_ops = {
> +	.set_window = pnv_npu_set_window,
> +	.unset_window = pnv_npu_unset_window,
> +	.take_ownership = pnv_npu_take_ownership,
> +};
> +
>  struct pnv_ioda_pe *pnv_pci_npu_setup_iommu(struct pnv_ioda_pe *npe)
>  {
>  	struct pnv_phb *phb = npe->phb;
> @@ -314,6 +328,8 @@ struct pnv_ioda_pe *pnv_pci_npu_setup_iommu(struct pnv_ioda_pe *npe)
>  	if (!gpe || !gpdev)
>  		return NULL;
>  
> +	npe->table_group.ops = &pnv_pci_npu_ops;
> +
>  	list_for_each_entry(npdev, &pbus->devices, bus_list) {
>  		gptmp = pnv_pci_get_gpu_dev(npdev);
>  
> @@ -326,6 +342,7 @@ struct pnv_ioda_pe *pnv_pci_npu_setup_iommu(struct pnv_ioda_pe *npe)
>  
>  	return gpe;
>  }
> +#endif /* !CONFIG_IOMMU_API */
>  
>  /*
>   * NPU2 ATS
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> index 7caf373..04639ae 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -2677,14 +2677,14 @@ static long pnv_pci_ioda2_npu_set_window(struct iommu_table_group *table_group,
>  		return ret;
>  
>  	if (table_group->tables[num2])
> -		pnv_npu_unset_window(npe, num2);
> +		npe->table_group.ops->unset_window(&npe->table_group, num2);
>  
> -	ret = pnv_npu_set_window(npe, num, tbl);
> +	ret = npe->table_group.ops->set_window(&npe->table_group, num, tbl);
>  	if (ret) {
>  		pnv_pci_ioda2_unset_window(table_group, num);
>  		if (table_group->tables[num2])
> -			pnv_npu_set_window(npe, num2,
> -					table_group->tables[num2]);
> +			npe->table_group.ops->set_window(&npe->table_group,
> +					num2, table_group->tables[num2]);
>  	}
>  
>  	return ret;
> @@ -2704,19 +2704,22 @@ static long pnv_pci_ioda2_npu_unset_window(
>  	if (!npe->table_group.tables[num])
>  		return 0;
>  
> -	ret = pnv_npu_unset_window(npe, num);
> +	ret = npe->table_group.ops->unset_window(&npe->table_group, num);
>  	if (ret)
>  		return ret;
>  
>  	if (table_group->tables[num2])
> -		ret = pnv_npu_set_window(npe, num2, table_group->tables[num2]);
> +		ret = npe->table_group.ops->set_window(&npe->table_group, num2,
> +				table_group->tables[num2]);
>  
>  	return ret;
>  }
>  
>  static void pnv_ioda2_npu_take_ownership(struct iommu_table_group *table_group)
>  {
> -	pnv_npu_take_ownership(gpe_table_group_to_npe(table_group));
> +	struct pnv_ioda_pe *npe = gpe_table_group_to_npe(table_group);
> +
> +	npe->table_group.ops->take_ownership(&npe->table_group);
>  	pnv_ioda2_take_ownership(table_group);
>  }
>  

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH kernel v3 18/22] powerpc/powernv/npu: Add compound IOMMU groups
  2018-11-13  8:28   ` Alexey Kardashevskiy
@ 2018-11-19  1:12     ` David Gibson
  -1 siblings, 0 replies; 84+ messages in thread
From: David Gibson @ 2018-11-19  1:12 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Alex Williamson, Jose Ricardo Ziviani, Sam Bobroff,
	Alistair Popple, linuxppc-dev, kvm-ppc, Piotr Jaroszynski,
	Oliver O'Halloran, Andrew Donnellan,
	Leonardo Augusto Guimarães Garcia, Reza Arbab

[-- Attachment #1: Type: text/plain, Size: 22012 bytes --]

On Tue, Nov 13, 2018 at 07:28:19PM +1100, Alexey Kardashevskiy wrote:
> At the moment powernv registers an IOMMU group for each PE. There is
> an exception though - NPU (an emulated PCI bridge representing an NVLink);
> powernv attaches these bridges to the GPU IOMMU group which becomes
> a master.
> 
> Now we have POWER9 systems with GPUs connected to each other directly,
> bypassing PCI. At the moment powernv does not control these links so
> it has to put such interconnected GPUs to the same IOMMU group which
> means that the old scheme with a GPU as a master won't work - there will
> be up to 3 GPUs in such group.
> 
> This introduces a npu_comp struct which represents a compound IOMMU
> group made of multiple PEs. This converts the existing NVLink1 code to
> use the new scheme. From now on, each PE must have a valid
> iommu_table_group_ops which will either be called directly (a single PE
> group) or indirectly from a compound group.
> 
> This moves IOMMU group registration for NPU-connected GPUs to npu-dma.c.
> For POWER8, this stores a new compound group pointer in a PE (so a GPU
> is still a master); for POWER9 the new group pointer is stored in an NPU.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
>  arch/powerpc/include/asm/pci.h            |   1 +
>  arch/powerpc/platforms/powernv/pci.h      |   7 +
>  arch/powerpc/platforms/powernv/npu-dma.c  | 286 ++++++++++++++++++++--
>  arch/powerpc/platforms/powernv/pci-ioda.c | 173 +++----------
>  4 files changed, 308 insertions(+), 159 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/pci.h b/arch/powerpc/include/asm/pci.h
> index baf2886..0c72f18 100644
> --- a/arch/powerpc/include/asm/pci.h
> +++ b/arch/powerpc/include/asm/pci.h
> @@ -132,5 +132,6 @@ extern struct pci_dev *pnv_pci_get_npu_dev(struct pci_dev *gpdev, int index);
>  extern int pnv_npu2_init(struct pci_controller *hose);
>  extern int pnv_npu2_map_lpar_dev(struct pci_dev *gpdev, unsigned int lparid,
>  		unsigned long msr);
> +extern int pnv_npu2_unmap_lpar_dev(struct pci_dev *gpdev);
>  
>  #endif /* __ASM_POWERPC_PCI_H */
> diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
> index cf9f748..aef4bb5 100644
> --- a/arch/powerpc/platforms/powernv/pci.h
> +++ b/arch/powerpc/platforms/powernv/pci.h
> @@ -62,6 +62,7 @@ struct pnv_ioda_pe {
>  
>  	/* "Base" iommu table, ie, 4K TCEs, 32-bit DMA */
>  	struct iommu_table_group table_group;
> +	struct npu_comp		*npucomp;
>  
>  	/* 64-bit TCE bypass region */
>  	bool			tce_bypass_enabled;
> @@ -201,6 +202,8 @@ extern void pnv_teardown_msi_irqs(struct pci_dev *pdev);
>  extern struct pnv_ioda_pe *pnv_ioda_get_pe(struct pci_dev *dev);
>  extern void pnv_set_msi_irq_chip(struct pnv_phb *phb, unsigned int virq);
>  extern void pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe *pe, bool enable);
> +extern unsigned long pnv_pci_ioda2_get_table_size(__u32 page_shift,
> +		__u64 window_size, __u32 levels);
>  extern int pnv_eeh_post_init(void);
>  
>  extern void pe_level_printk(const struct pnv_ioda_pe *pe, const char *level,
> @@ -216,6 +219,10 @@ extern void pe_level_printk(const struct pnv_ioda_pe *pe, const char *level,
>  extern void pnv_npu_try_dma_set_bypass(struct pci_dev *gpdev, bool bypass);
>  extern void pnv_pci_ioda2_tce_invalidate_entire(struct pnv_phb *phb, bool rm);
>  extern struct pnv_ioda_pe *pnv_pci_npu_setup_iommu(struct pnv_ioda_pe *npe);
> +extern struct iommu_table_group *pnv_try_setup_npu_table_group(
> +		struct pnv_ioda_pe *pe);
> +extern struct iommu_table_group *pnv_npu_compound_attach(
> +		struct pnv_ioda_pe *pe);
>  
>  /* pci-ioda-tce.c */
>  #define POWERNV_IOMMU_DEFAULT_LEVELS	1
> diff --git a/arch/powerpc/platforms/powernv/npu-dma.c b/arch/powerpc/platforms/powernv/npu-dma.c
> index 1792c7e..2231f4c 100644
> --- a/arch/powerpc/platforms/powernv/npu-dma.c
> +++ b/arch/powerpc/platforms/powernv/npu-dma.c
> @@ -317,31 +317,6 @@ static struct iommu_table_group_ops pnv_pci_npu_ops = {
>  	.unset_window = pnv_npu_unset_window,
>  	.take_ownership = pnv_npu_take_ownership,
>  };
> -
> -struct pnv_ioda_pe *pnv_pci_npu_setup_iommu(struct pnv_ioda_pe *npe)
> -{
> -	struct pnv_phb *phb = npe->phb;
> -	struct pci_bus *pbus = phb->hose->bus;
> -	struct pci_dev *npdev, *gpdev = NULL, *gptmp;
> -	struct pnv_ioda_pe *gpe = get_gpu_pci_dev_and_pe(npe, &gpdev);
> -
> -	if (!gpe || !gpdev)
> -		return NULL;
> -
> -	npe->table_group.ops = &pnv_pci_npu_ops;
> -
> -	list_for_each_entry(npdev, &pbus->devices, bus_list) {
> -		gptmp = pnv_pci_get_gpu_dev(npdev);
> -
> -		if (gptmp != gpdev)
> -			continue;
> -
> -		pe_info(gpe, "Attached NPU %s\n", dev_name(&npdev->dev));
> -		iommu_group_add_device(gpe->table_group.group, &npdev->dev);
> -	}
> -
> -	return gpe;
> -}
>  #endif /* !CONFIG_IOMMU_API */
>  
>  /*
> @@ -349,6 +324,17 @@ struct pnv_ioda_pe *pnv_pci_npu_setup_iommu(struct pnv_ioda_pe *npe)
>   */
>  /* Maximum possible number of ATSD MMIO registers per NPU */
>  #define NV_NMMU_ATSD_REGS 8
> +#define NV_NPU_MAX_PE_NUM	16
> +
> +/*
> + * A compound NPU IOMMU group which might consist of 1 GPU + 2xNPUs (POWER8) or
> + * up to 3 x (GPU + 2xNPUs) (POWER9).
> + */
> +struct npu_comp {
> +	struct iommu_table_group table_group;
> +	int pe_num;
> +	struct pnv_ioda_pe *pe[NV_NPU_MAX_PE_NUM];
> +};
>  
>  /* An NPU descriptor, valid for POWER9 only */
>  struct npu {
> @@ -365,6 +351,8 @@ struct npu {
>  	struct list_head next;
>  
>  	struct pci_controller *hose;
> +
> +	struct npu_comp npucomp;
>  };

I'm confused by this.  The comment simply there are multiple NPUs in a
single composite-group, but the np_comp structure is embedded in the
npu structure, implying there's a copy per-NPU.


>  static LIST_HEAD(npu2_devices);
> @@ -382,6 +370,254 @@ static struct npu *npdev_to_npu(struct pci_dev *npdev)
>  	return NULL;
>  }
>  
> +#ifdef CONFIG_IOMMU_API
> +static long pnv_npu_peers_create_table_userspace(
> +		struct iommu_table_group *table_group,
> +		int num, __u32 page_shift, __u64 window_size, __u32 levels,
> +		struct iommu_table **ptbl)
> +{
> +	struct npu_comp *npucomp = container_of(table_group, struct npu_comp,
> +			table_group);
> +
> +	if (!npucomp->pe_num || !npucomp->pe[0] ||
> +			!npucomp->pe[0]->table_group.ops ||
> +			!npucomp->pe[0]->table_group.ops->create_table)
> +		return -EFAULT;
> +
> +	return npucomp->pe[0]->table_group.ops->create_table(
> +			&npucomp->pe[0]->table_group, num, page_shift,
> +			window_size, levels, ptbl);
> +}
> +
> +static long pnv_npu_peers_set_window(struct iommu_table_group *table_group,
> +		int num, struct iommu_table *tbl)
> +{
> +	int i, j;
> +	long ret = 0;
> +	struct npu_comp *npucomp = container_of(table_group, struct npu_comp,
> +			table_group);
> +
> +	for (i = 0; i < npucomp->pe_num; ++i) {
> +		struct pnv_ioda_pe *pe = npucomp->pe[i];
> +
> +		if (!pe->table_group.ops->set_window)
> +			continue;
> +
> +		ret = pe->table_group.ops->set_window(&pe->table_group,
> +				num, tbl);
> +		if (ret)
> +			break;
> +	}
> +
> +	if (ret) {
> +		for (j = 0; j < i; ++j) {
> +			struct pnv_ioda_pe *pe = npucomp->pe[j];
> +
> +			if (!pe->table_group.ops->unset_window)
> +				continue;
> +
> +			ret = pe->table_group.ops->unset_window(
> +					&pe->table_group, num);
> +			if (ret)
> +				break;
> +		}
> +	} else {
> +		table_group->tables[num] = iommu_tce_table_get(tbl);
> +	}
> +
> +	return ret;
> +}
> +
> +static long pnv_npu_peers_unset_window(struct iommu_table_group *table_group,
> +		int num)
> +{
> +	int i, j;
> +	long ret = 0;
> +	struct npu_comp *npucomp = container_of(table_group, struct npu_comp,
> +			table_group);
> +
> +	for (i = 0; i < npucomp->pe_num; ++i) {
> +		struct pnv_ioda_pe *pe = npucomp->pe[i];
> +
> +		WARN_ON(npucomp->table_group.tables[num] !=
> +				table_group->tables[num]);
> +		if (!npucomp->table_group.tables[num])
> +			continue;
> +
> +		if (!pe->table_group.ops->unset_window)
> +			continue;
> +
> +		ret = pe->table_group.ops->unset_window(&pe->table_group, num);
> +		if (ret)
> +			break;
> +	}
> +
> +	if (ret) {
> +		for (j = 0; j < i; ++j) {
> +			struct pnv_ioda_pe *pe = npucomp->pe[j];
> +
> +			if (!npucomp->table_group.tables[num])
> +				continue;
> +
> +			if (!pe->table_group.ops->set_window)
> +				continue;
> +
> +			ret = pe->table_group.ops->set_window(&pe->table_group,
> +					num, table_group->tables[num]);
> +			if (ret)
> +				break;
> +		}
> +	} else if (table_group->tables[num]) {
> +		iommu_tce_table_put(table_group->tables[num]);
> +		table_group->tables[num] = NULL;
> +	}
> +
> +	return ret;
> +}
> +
> +static void pnv_npu_peers_take_ownership(struct iommu_table_group *table_group)
> +{
> +	int i;
> +	struct npu_comp *npucomp = container_of(table_group, struct npu_comp,
> +			table_group);
> +
> +	for (i = 0; i < npucomp->pe_num; ++i) {
> +		struct pnv_ioda_pe *pe = npucomp->pe[i];
> +
> +		if (!pe->table_group.ops->take_ownership)
> +			continue;
> +		pe->table_group.ops->take_ownership(&pe->table_group);
> +	}
> +}
> +
> +static void pnv_npu_peers_release_ownership(
> +		struct iommu_table_group *table_group)
> +{
> +	int i;
> +	struct npu_comp *npucomp = container_of(table_group, struct npu_comp,
> +			table_group);
> +
> +	for (i = 0; i < npucomp->pe_num; ++i) {
> +		struct pnv_ioda_pe *pe = npucomp->pe[i];
> +
> +		if (!pe->table_group.ops->release_ownership)
> +			continue;
> +		pe->table_group.ops->release_ownership(&pe->table_group);
> +	}
> +}
> +
> +static struct iommu_table_group_ops pnv_npu_peers_ops = {
> +	.get_table_size = pnv_pci_ioda2_get_table_size,
> +	.create_table = pnv_npu_peers_create_table_userspace,
> +	.set_window = pnv_npu_peers_set_window,
> +	.unset_window = pnv_npu_peers_unset_window,
> +	.take_ownership = pnv_npu_peers_take_ownership,
> +	.release_ownership = pnv_npu_peers_release_ownership,
> +};
> +
> +static void pnv_comp_attach_table_group(struct npu_comp *npucomp,
> +		struct pnv_ioda_pe *pe)
> +{
> +	if (WARN_ON(npucomp->pe_num == NV_NPU_MAX_PE_NUM))
> +		return;
> +
> +	npucomp->pe[npucomp->pe_num] = pe;
> +	++npucomp->pe_num;
> +}
> +
> +struct iommu_table_group *pnv_try_setup_npu_table_group(struct pnv_ioda_pe *pe)
> +{
> +	struct iommu_table_group *table_group;
> +	struct npu *npu;
> +	struct npu_comp *npucomp;
> +	struct pci_dev *gpdev = NULL;
> +	struct pci_controller *hose;
> +	struct pci_dev *npdev;
> +
> +	list_for_each_entry(gpdev, &pe->pbus->devices, bus_list) {
> +		npdev = pnv_pci_get_npu_dev(gpdev, 0);
> +		if (npdev)
> +			break;
> +	}
> +
> +	if (!npdev)
> +		/* It is not an NPU attached device, skip */
> +		return NULL;
> +
> +	hose = pci_bus_to_host(gpdev->bus);
> +	npu = npdev_to_npu(npdev);
> +	if (npu) {
> +		table_group = &npu->npucomp.table_group;
> +
> +		if (!table_group->group) {
> +			table_group->ops = &pnv_npu_peers_ops;
> +			iommu_register_group(table_group,
> +					hose->global_number,
> +					pe->pe_number);
> +		}
> +	} else {
> +		/* Create a group for 1 GPU and attached NPUs */
> +		pe->npucomp = kzalloc(sizeof(pe->npucomp), GFP_KERNEL);
> +		table_group = &pe->npucomp->table_group;
> +		table_group->ops = &pnv_npu_peers_ops;
> +		iommu_register_group(table_group, hose->global_number,
> +				pe->pe_number);
> +	}
> +
> +	/* Steal capabilities from a GPU PE */
> +	table_group->max_dynamic_windows_supported =
> +		pe->table_group.max_dynamic_windows_supported;
> +	table_group->tce32_start = pe->table_group.tce32_start;
> +	table_group->tce32_size = pe->table_group.tce32_size;
> +	table_group->max_levels = pe->table_group.max_levels;
> +	table_group->pgsizes = pe->table_group.pgsizes;
> +
> +	npucomp = container_of(table_group, struct npu_comp, table_group);
> +	pnv_comp_attach_table_group(npucomp, pe);
> +
> +	return table_group;
> +}
> +
> +struct iommu_table_group *pnv_npu_compound_attach(struct pnv_ioda_pe *pe)
> +{
> +	struct iommu_table_group *table_group;
> +	struct npu_comp *npucomp;
> +	struct pci_dev *gpdev = NULL;
> +	struct pci_dev *npdev;
> +	struct pnv_ioda_pe *gpe = get_gpu_pci_dev_and_pe(pe, &gpdev);
> +
> +	WARN_ON(!(pe->flags & PNV_IODA_PE_DEV));
> +	if (!gpe)
> +		return NULL;
> +
> +	/*
> +	 * IODA2 bridges get this set up from
> +	 * pci_controller_ops::setup_bridge but NPU bridges do not
> +	 * have this hook defined so we do it here.
> +	 */
> +	pe->table_group.max_dynamic_windows_supported =
> +		IOMMU_TABLE_GROUP_MAX_TABLES;
> +	pe->table_group.ops = &pnv_pci_npu_ops;
> +
> +	table_group = iommu_group_get_iommudata(
> +			iommu_group_get(&gpdev->dev));
> +
> +	npucomp = container_of(table_group, struct npu_comp, table_group);
> +	pnv_comp_attach_table_group(npucomp, pe);
> +
> +	list_for_each_entry(npdev, &pe->phb->hose->bus->devices, bus_list) {
> +		struct pci_dev *gpdevtmp = pnv_pci_get_gpu_dev(npdev);
> +
> +		if (gpdevtmp != gpdev)
> +			continue;
> +
> +		iommu_add_device(table_group, &npdev->dev);
> +	}
> +
> +	return table_group;
> +}
> +#endif /* CONFIG_IOMMU_API */
> +
>  /* Maximum number of nvlinks per npu */
>  #define NV_MAX_LINKS 6
>  
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> index 04639ae..0e8ada5 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -190,7 +190,8 @@ static void pnv_ioda_free_pe(struct pnv_ioda_pe *pe)
>  	unsigned int pe_num = pe->pe_number;
>  
>  	WARN_ON(pe->pdev);
> -
> +	WARN_ON(pe->npucomp);
> +	kfree(pe->npucomp);
>  	memset(pe, 0, sizeof(struct pnv_ioda_pe));
>  	clear_bit(pe_num, phb->ioda.pe_alloc);
>  }
> @@ -1269,7 +1270,8 @@ static void pnv_ioda_setup_npu_PEs(struct pci_bus *bus)
>  		pnv_ioda_setup_npu_PE(pdev);
>  }
>  
> -static void pnv_ioda_setup_bus_iommu_group(struct pnv_ioda_pe *pe);
> +static void pnv_ioda_setup_bus_iommu_group(struct pnv_ioda_pe *pe,
> +		struct iommu_table_group *table_group, struct pci_bus *bus);
>  
>  static void pnv_pci_ioda_setup_PEs(void)
>  {
> @@ -1593,7 +1595,7 @@ static void pnv_ioda_setup_vf_PE(struct pci_dev *pdev, u16 num_vfs)
>  		mutex_unlock(&phb->ioda.pe_list_mutex);
>  
>  		pnv_pci_ioda2_setup_dma_pe(phb, pe);
> -		pnv_ioda_setup_bus_iommu_group(pe);
> +		pnv_ioda_setup_bus_iommu_group(pe, &pe->table_group, NULL);
>  	}
>  }
>  
> @@ -2554,7 +2556,7 @@ static long pnv_pci_ioda2_unset_window(struct iommu_table_group *table_group,
>  #endif
>  
>  #ifdef CONFIG_IOMMU_API
> -static unsigned long pnv_pci_ioda2_get_table_size(__u32 page_shift,
> +unsigned long pnv_pci_ioda2_get_table_size(__u32 page_shift,
>  		__u64 window_size, __u32 levels)
>  {
>  	unsigned long bytes = 0;
> @@ -2628,147 +2630,38 @@ static struct iommu_table_group_ops pnv_pci_ioda2_ops = {
>  	.release_ownership = pnv_ioda2_release_ownership,
>  };
>  
> -static int gpe_table_group_to_npe_cb(struct device *dev, void *opaque)
> -{
> -	struct pci_controller *hose;
> -	struct pnv_phb *phb;
> -	struct pnv_ioda_pe **ptmppe = opaque;
> -	struct pci_dev *pdev = container_of(dev, struct pci_dev, dev);
> -	struct pci_dn *pdn = pci_get_pdn(pdev);
> -
> -	if (!pdn || pdn->pe_number == IODA_INVALID_PE)
> -		return 0;
> -
> -	hose = pci_bus_to_host(pdev->bus);
> -	phb = hose->private_data;
> -	if (phb->type != PNV_PHB_NPU_NVLINK)
> -		return 0;
> -
> -	*ptmppe = &phb->ioda.pe_array[pdn->pe_number];
> -
> -	return 1;
> -}
> -
> -/*
> - * This returns PE of associated NPU.
> - * This assumes that NPU is in the same IOMMU group with GPU and there is
> - * no other PEs.
> - */
> -static struct pnv_ioda_pe *gpe_table_group_to_npe(
> -		struct iommu_table_group *table_group)
> -{
> -	struct pnv_ioda_pe *npe = NULL;
> -	int ret = iommu_group_for_each_dev(table_group->group, &npe,
> -			gpe_table_group_to_npe_cb);
> -
> -	BUG_ON(!ret || !npe);
> -
> -	return npe;
> -}
> -
> -static long pnv_pci_ioda2_npu_set_window(struct iommu_table_group *table_group,
> -		int num, struct iommu_table *tbl)
> -{
> -	struct pnv_ioda_pe *npe = gpe_table_group_to_npe(table_group);
> -	int num2 = (num == 0) ? 1 : 0;
> -	long ret = pnv_pci_ioda2_set_window(table_group, num, tbl);
> -
> -	if (ret)
> -		return ret;
> -
> -	if (table_group->tables[num2])
> -		npe->table_group.ops->unset_window(&npe->table_group, num2);
> -
> -	ret = npe->table_group.ops->set_window(&npe->table_group, num, tbl);
> -	if (ret) {
> -		pnv_pci_ioda2_unset_window(table_group, num);
> -		if (table_group->tables[num2])
> -			npe->table_group.ops->set_window(&npe->table_group,
> -					num2, table_group->tables[num2]);
> -	}
> -
> -	return ret;
> -}
> -
> -static long pnv_pci_ioda2_npu_unset_window(
> -		struct iommu_table_group *table_group,
> -		int num)
> -{
> -	struct pnv_ioda_pe *npe = gpe_table_group_to_npe(table_group);
> -	int num2 = (num == 0) ? 1 : 0;
> -	long ret = pnv_pci_ioda2_unset_window(table_group, num);
> -
> -	if (ret)
> -		return ret;
> -
> -	if (!npe->table_group.tables[num])
> -		return 0;
> -
> -	ret = npe->table_group.ops->unset_window(&npe->table_group, num);
> -	if (ret)
> -		return ret;
> -
> -	if (table_group->tables[num2])
> -		ret = npe->table_group.ops->set_window(&npe->table_group, num2,
> -				table_group->tables[num2]);
> -
> -	return ret;
> -}
> -
> -static void pnv_ioda2_npu_take_ownership(struct iommu_table_group *table_group)
> -{
> -	struct pnv_ioda_pe *npe = gpe_table_group_to_npe(table_group);
> -
> -	npe->table_group.ops->take_ownership(&npe->table_group);
> -	pnv_ioda2_take_ownership(table_group);
> -}
> -
> -static struct iommu_table_group_ops pnv_pci_ioda2_npu_ops = {
> -	.get_table_size = pnv_pci_ioda2_get_table_size,
> -	.create_table = pnv_pci_ioda2_create_table_userspace,
> -	.set_window = pnv_pci_ioda2_npu_set_window,
> -	.unset_window = pnv_pci_ioda2_npu_unset_window,
> -	.take_ownership = pnv_ioda2_npu_take_ownership,
> -	.release_ownership = pnv_ioda2_release_ownership,
> -};
> -
>  static void pnv_ioda_setup_bus_iommu_group_add_devices(struct pnv_ioda_pe *pe,
> +		struct iommu_table_group *table_group,
>  		struct pci_bus *bus)
>  {
>  	struct pci_dev *dev;
>  
>  	list_for_each_entry(dev, &bus->devices, bus_list) {
> -		iommu_add_device(&pe->table_group, &dev->dev);
> +		iommu_add_device(table_group, &dev->dev);
>  
>  		if ((pe->flags & PNV_IODA_PE_BUS_ALL) && dev->subordinate)
>  			pnv_ioda_setup_bus_iommu_group_add_devices(pe,
> -					dev->subordinate);
> +					table_group, dev->subordinate);
>  	}
>  }
>  
> -static void pnv_ioda_setup_bus_iommu_group(struct pnv_ioda_pe *pe)
> +static void pnv_ioda_setup_bus_iommu_group(struct pnv_ioda_pe *pe,
> +		struct iommu_table_group *table_group, struct pci_bus *bus)
>  {
> -	if (!pnv_pci_ioda_pe_dma_weight(pe))
> -		return;
>  
> -	iommu_register_group(&pe->table_group, pe->phb->hose->global_number,
> -			pe->pe_number);
> -
> -	/*
> -	 * set_iommu_table_base(&pe->pdev->dev, tbl) should have been called
> -	 * by now
> -	 */
>  	if (pe->flags & PNV_IODA_PE_DEV)
> -		iommu_add_device(&pe->table_group, &pe->pdev->dev);
> -	else if (pe->flags & (PNV_IODA_PE_BUS | PNV_IODA_PE_BUS_ALL))
> -		pnv_ioda_setup_bus_iommu_group_add_devices(pe, pe->pbus);
> +		iommu_add_device(table_group, &pe->pdev->dev);
> +
> +	if ((pe->flags & (PNV_IODA_PE_BUS | PNV_IODA_PE_BUS_ALL)) || bus)
> +		pnv_ioda_setup_bus_iommu_group_add_devices(pe, table_group,
> +				bus);
>  }
>  
>  static void pnv_pci_ioda_setup_iommu_api(void)
>  {
> -	struct pci_controller *hose, *tmp;
> +	struct pci_controller *hose;
>  	struct pnv_phb *phb;
> -	struct pnv_ioda_pe *pe, *gpe;
> +	struct pnv_ioda_pe *pe;
>  
>  	/*
>  	 * There are 4 types of PEs:
> @@ -2790,29 +2683,41 @@ static void pnv_pci_ioda_setup_iommu_api(void)
>  		if (phb->type == PNV_PHB_NPU_NVLINK)
>  			continue;
>  
> -		list_for_each_entry(pe, &phb->ioda.pe_list, list)
> -			pnv_ioda_setup_bus_iommu_group(pe);
> +		list_for_each_entry(pe, &phb->ioda.pe_list, list) {
> +			struct iommu_table_group *table_group;
> +
> +			table_group = pnv_try_setup_npu_table_group(pe);
> +			if (!table_group) {
> +				if (!pnv_pci_ioda_pe_dma_weight(pe))
> +					continue;
> +
> +				table_group = &pe->table_group;
> +				iommu_register_group(&pe->table_group,
> +						pe->phb->hose->global_number,
> +						pe->pe_number);
> +			}
> +			pnv_ioda_setup_bus_iommu_group(pe, table_group,
> +					pe->pbus);
> +		}
>  	}
>  
>  	/*
>  	 * Now we have all PHBs discovered, time to add NPU devices to
>  	 * the corresponding IOMMU groups.
>  	 */
> -	list_for_each_entry_safe(hose, tmp, &hose_list, list_node) {
> +	list_for_each_entry(hose, &hose_list, list_node) {
>  		phb = hose->private_data;
>  
>  		if (phb->type != PNV_PHB_NPU_NVLINK)
>  			continue;
>  
> -		list_for_each_entry(pe, &phb->ioda.pe_list, list) {
> -			gpe = pnv_pci_npu_setup_iommu(pe);
> -			if (gpe)
> -				gpe->table_group.ops = &pnv_pci_ioda2_npu_ops;
> -		}
> +		list_for_each_entry(pe, &phb->ioda.pe_list, list)
> +			pnv_npu_compound_attach(pe);
>  	}
>  }
>  #else /* !CONFIG_IOMMU_API */
> -static void pnv_ioda_setup_bus_iommu_group(struct pnv_ioda_pe *pe) { }
> +static void pnv_ioda_setup_bus_iommu_group(struct pnv_ioda_pe *pe,
> +		struct iommu_table_group *table_group, struct pci_bus *bus){}
>  static void pnv_pci_ioda_setup_iommu_api(void) { };
>  #endif
>  

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH kernel v3 18/22] powerpc/powernv/npu: Add compound IOMMU groups
@ 2018-11-19  1:12     ` David Gibson
  0 siblings, 0 replies; 84+ messages in thread
From: David Gibson @ 2018-11-19  1:12 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Alex Williamson, Jose Ricardo Ziviani, Sam Bobroff,
	Alistair Popple, linuxppc-dev, kvm-ppc, Piotr Jaroszynski,
	Oliver O'Halloran, Andrew Donnellan,
	Leonardo Augusto Guimarães Garcia, Reza Arbab

[-- Attachment #1: Type: text/plain, Size: 22012 bytes --]

On Tue, Nov 13, 2018 at 07:28:19PM +1100, Alexey Kardashevskiy wrote:
> At the moment powernv registers an IOMMU group for each PE. There is
> an exception though - NPU (an emulated PCI bridge representing an NVLink);
> powernv attaches these bridges to the GPU IOMMU group which becomes
> a master.
> 
> Now we have POWER9 systems with GPUs connected to each other directly,
> bypassing PCI. At the moment powernv does not control these links so
> it has to put such interconnected GPUs to the same IOMMU group which
> means that the old scheme with a GPU as a master won't work - there will
> be up to 3 GPUs in such group.
> 
> This introduces a npu_comp struct which represents a compound IOMMU
> group made of multiple PEs. This converts the existing NVLink1 code to
> use the new scheme. From now on, each PE must have a valid
> iommu_table_group_ops which will either be called directly (a single PE
> group) or indirectly from a compound group.
> 
> This moves IOMMU group registration for NPU-connected GPUs to npu-dma.c.
> For POWER8, this stores a new compound group pointer in a PE (so a GPU
> is still a master); for POWER9 the new group pointer is stored in an NPU.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
>  arch/powerpc/include/asm/pci.h            |   1 +
>  arch/powerpc/platforms/powernv/pci.h      |   7 +
>  arch/powerpc/platforms/powernv/npu-dma.c  | 286 ++++++++++++++++++++--
>  arch/powerpc/platforms/powernv/pci-ioda.c | 173 +++----------
>  4 files changed, 308 insertions(+), 159 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/pci.h b/arch/powerpc/include/asm/pci.h
> index baf2886..0c72f18 100644
> --- a/arch/powerpc/include/asm/pci.h
> +++ b/arch/powerpc/include/asm/pci.h
> @@ -132,5 +132,6 @@ extern struct pci_dev *pnv_pci_get_npu_dev(struct pci_dev *gpdev, int index);
>  extern int pnv_npu2_init(struct pci_controller *hose);
>  extern int pnv_npu2_map_lpar_dev(struct pci_dev *gpdev, unsigned int lparid,
>  		unsigned long msr);
> +extern int pnv_npu2_unmap_lpar_dev(struct pci_dev *gpdev);
>  
>  #endif /* __ASM_POWERPC_PCI_H */
> diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
> index cf9f748..aef4bb5 100644
> --- a/arch/powerpc/platforms/powernv/pci.h
> +++ b/arch/powerpc/platforms/powernv/pci.h
> @@ -62,6 +62,7 @@ struct pnv_ioda_pe {
>  
>  	/* "Base" iommu table, ie, 4K TCEs, 32-bit DMA */
>  	struct iommu_table_group table_group;
> +	struct npu_comp		*npucomp;
>  
>  	/* 64-bit TCE bypass region */
>  	bool			tce_bypass_enabled;
> @@ -201,6 +202,8 @@ extern void pnv_teardown_msi_irqs(struct pci_dev *pdev);
>  extern struct pnv_ioda_pe *pnv_ioda_get_pe(struct pci_dev *dev);
>  extern void pnv_set_msi_irq_chip(struct pnv_phb *phb, unsigned int virq);
>  extern void pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe *pe, bool enable);
> +extern unsigned long pnv_pci_ioda2_get_table_size(__u32 page_shift,
> +		__u64 window_size, __u32 levels);
>  extern int pnv_eeh_post_init(void);
>  
>  extern void pe_level_printk(const struct pnv_ioda_pe *pe, const char *level,
> @@ -216,6 +219,10 @@ extern void pe_level_printk(const struct pnv_ioda_pe *pe, const char *level,
>  extern void pnv_npu_try_dma_set_bypass(struct pci_dev *gpdev, bool bypass);
>  extern void pnv_pci_ioda2_tce_invalidate_entire(struct pnv_phb *phb, bool rm);
>  extern struct pnv_ioda_pe *pnv_pci_npu_setup_iommu(struct pnv_ioda_pe *npe);
> +extern struct iommu_table_group *pnv_try_setup_npu_table_group(
> +		struct pnv_ioda_pe *pe);
> +extern struct iommu_table_group *pnv_npu_compound_attach(
> +		struct pnv_ioda_pe *pe);
>  
>  /* pci-ioda-tce.c */
>  #define POWERNV_IOMMU_DEFAULT_LEVELS	1
> diff --git a/arch/powerpc/platforms/powernv/npu-dma.c b/arch/powerpc/platforms/powernv/npu-dma.c
> index 1792c7e..2231f4c 100644
> --- a/arch/powerpc/platforms/powernv/npu-dma.c
> +++ b/arch/powerpc/platforms/powernv/npu-dma.c
> @@ -317,31 +317,6 @@ static struct iommu_table_group_ops pnv_pci_npu_ops = {
>  	.unset_window = pnv_npu_unset_window,
>  	.take_ownership = pnv_npu_take_ownership,
>  };
> -
> -struct pnv_ioda_pe *pnv_pci_npu_setup_iommu(struct pnv_ioda_pe *npe)
> -{
> -	struct pnv_phb *phb = npe->phb;
> -	struct pci_bus *pbus = phb->hose->bus;
> -	struct pci_dev *npdev, *gpdev = NULL, *gptmp;
> -	struct pnv_ioda_pe *gpe = get_gpu_pci_dev_and_pe(npe, &gpdev);
> -
> -	if (!gpe || !gpdev)
> -		return NULL;
> -
> -	npe->table_group.ops = &pnv_pci_npu_ops;
> -
> -	list_for_each_entry(npdev, &pbus->devices, bus_list) {
> -		gptmp = pnv_pci_get_gpu_dev(npdev);
> -
> -		if (gptmp != gpdev)
> -			continue;
> -
> -		pe_info(gpe, "Attached NPU %s\n", dev_name(&npdev->dev));
> -		iommu_group_add_device(gpe->table_group.group, &npdev->dev);
> -	}
> -
> -	return gpe;
> -}
>  #endif /* !CONFIG_IOMMU_API */
>  
>  /*
> @@ -349,6 +324,17 @@ struct pnv_ioda_pe *pnv_pci_npu_setup_iommu(struct pnv_ioda_pe *npe)
>   */
>  /* Maximum possible number of ATSD MMIO registers per NPU */
>  #define NV_NMMU_ATSD_REGS 8
> +#define NV_NPU_MAX_PE_NUM	16
> +
> +/*
> + * A compound NPU IOMMU group which might consist of 1 GPU + 2xNPUs (POWER8) or
> + * up to 3 x (GPU + 2xNPUs) (POWER9).
> + */
> +struct npu_comp {
> +	struct iommu_table_group table_group;
> +	int pe_num;
> +	struct pnv_ioda_pe *pe[NV_NPU_MAX_PE_NUM];
> +};
>  
>  /* An NPU descriptor, valid for POWER9 only */
>  struct npu {
> @@ -365,6 +351,8 @@ struct npu {
>  	struct list_head next;
>  
>  	struct pci_controller *hose;
> +
> +	struct npu_comp npucomp;
>  };

I'm confused by this.  The comment simply there are multiple NPUs in a
single composite-group, but the np_comp structure is embedded in the
npu structure, implying there's a copy per-NPU.


>  static LIST_HEAD(npu2_devices);
> @@ -382,6 +370,254 @@ static struct npu *npdev_to_npu(struct pci_dev *npdev)
>  	return NULL;
>  }
>  
> +#ifdef CONFIG_IOMMU_API
> +static long pnv_npu_peers_create_table_userspace(
> +		struct iommu_table_group *table_group,
> +		int num, __u32 page_shift, __u64 window_size, __u32 levels,
> +		struct iommu_table **ptbl)
> +{
> +	struct npu_comp *npucomp = container_of(table_group, struct npu_comp,
> +			table_group);
> +
> +	if (!npucomp->pe_num || !npucomp->pe[0] ||
> +			!npucomp->pe[0]->table_group.ops ||
> +			!npucomp->pe[0]->table_group.ops->create_table)
> +		return -EFAULT;
> +
> +	return npucomp->pe[0]->table_group.ops->create_table(
> +			&npucomp->pe[0]->table_group, num, page_shift,
> +			window_size, levels, ptbl);
> +}
> +
> +static long pnv_npu_peers_set_window(struct iommu_table_group *table_group,
> +		int num, struct iommu_table *tbl)
> +{
> +	int i, j;
> +	long ret = 0;
> +	struct npu_comp *npucomp = container_of(table_group, struct npu_comp,
> +			table_group);
> +
> +	for (i = 0; i < npucomp->pe_num; ++i) {
> +		struct pnv_ioda_pe *pe = npucomp->pe[i];
> +
> +		if (!pe->table_group.ops->set_window)
> +			continue;
> +
> +		ret = pe->table_group.ops->set_window(&pe->table_group,
> +				num, tbl);
> +		if (ret)
> +			break;
> +	}
> +
> +	if (ret) {
> +		for (j = 0; j < i; ++j) {
> +			struct pnv_ioda_pe *pe = npucomp->pe[j];
> +
> +			if (!pe->table_group.ops->unset_window)
> +				continue;
> +
> +			ret = pe->table_group.ops->unset_window(
> +					&pe->table_group, num);
> +			if (ret)
> +				break;
> +		}
> +	} else {
> +		table_group->tables[num] = iommu_tce_table_get(tbl);
> +	}
> +
> +	return ret;
> +}
> +
> +static long pnv_npu_peers_unset_window(struct iommu_table_group *table_group,
> +		int num)
> +{
> +	int i, j;
> +	long ret = 0;
> +	struct npu_comp *npucomp = container_of(table_group, struct npu_comp,
> +			table_group);
> +
> +	for (i = 0; i < npucomp->pe_num; ++i) {
> +		struct pnv_ioda_pe *pe = npucomp->pe[i];
> +
> +		WARN_ON(npucomp->table_group.tables[num] !=
> +				table_group->tables[num]);
> +		if (!npucomp->table_group.tables[num])
> +			continue;
> +
> +		if (!pe->table_group.ops->unset_window)
> +			continue;
> +
> +		ret = pe->table_group.ops->unset_window(&pe->table_group, num);
> +		if (ret)
> +			break;
> +	}
> +
> +	if (ret) {
> +		for (j = 0; j < i; ++j) {
> +			struct pnv_ioda_pe *pe = npucomp->pe[j];
> +
> +			if (!npucomp->table_group.tables[num])
> +				continue;
> +
> +			if (!pe->table_group.ops->set_window)
> +				continue;
> +
> +			ret = pe->table_group.ops->set_window(&pe->table_group,
> +					num, table_group->tables[num]);
> +			if (ret)
> +				break;
> +		}
> +	} else if (table_group->tables[num]) {
> +		iommu_tce_table_put(table_group->tables[num]);
> +		table_group->tables[num] = NULL;
> +	}
> +
> +	return ret;
> +}
> +
> +static void pnv_npu_peers_take_ownership(struct iommu_table_group *table_group)
> +{
> +	int i;
> +	struct npu_comp *npucomp = container_of(table_group, struct npu_comp,
> +			table_group);
> +
> +	for (i = 0; i < npucomp->pe_num; ++i) {
> +		struct pnv_ioda_pe *pe = npucomp->pe[i];
> +
> +		if (!pe->table_group.ops->take_ownership)
> +			continue;
> +		pe->table_group.ops->take_ownership(&pe->table_group);
> +	}
> +}
> +
> +static void pnv_npu_peers_release_ownership(
> +		struct iommu_table_group *table_group)
> +{
> +	int i;
> +	struct npu_comp *npucomp = container_of(table_group, struct npu_comp,
> +			table_group);
> +
> +	for (i = 0; i < npucomp->pe_num; ++i) {
> +		struct pnv_ioda_pe *pe = npucomp->pe[i];
> +
> +		if (!pe->table_group.ops->release_ownership)
> +			continue;
> +		pe->table_group.ops->release_ownership(&pe->table_group);
> +	}
> +}
> +
> +static struct iommu_table_group_ops pnv_npu_peers_ops = {
> +	.get_table_size = pnv_pci_ioda2_get_table_size,
> +	.create_table = pnv_npu_peers_create_table_userspace,
> +	.set_window = pnv_npu_peers_set_window,
> +	.unset_window = pnv_npu_peers_unset_window,
> +	.take_ownership = pnv_npu_peers_take_ownership,
> +	.release_ownership = pnv_npu_peers_release_ownership,
> +};
> +
> +static void pnv_comp_attach_table_group(struct npu_comp *npucomp,
> +		struct pnv_ioda_pe *pe)
> +{
> +	if (WARN_ON(npucomp->pe_num == NV_NPU_MAX_PE_NUM))
> +		return;
> +
> +	npucomp->pe[npucomp->pe_num] = pe;
> +	++npucomp->pe_num;
> +}
> +
> +struct iommu_table_group *pnv_try_setup_npu_table_group(struct pnv_ioda_pe *pe)
> +{
> +	struct iommu_table_group *table_group;
> +	struct npu *npu;
> +	struct npu_comp *npucomp;
> +	struct pci_dev *gpdev = NULL;
> +	struct pci_controller *hose;
> +	struct pci_dev *npdev;
> +
> +	list_for_each_entry(gpdev, &pe->pbus->devices, bus_list) {
> +		npdev = pnv_pci_get_npu_dev(gpdev, 0);
> +		if (npdev)
> +			break;
> +	}
> +
> +	if (!npdev)
> +		/* It is not an NPU attached device, skip */
> +		return NULL;
> +
> +	hose = pci_bus_to_host(gpdev->bus);
> +	npu = npdev_to_npu(npdev);
> +	if (npu) {
> +		table_group = &npu->npucomp.table_group;
> +
> +		if (!table_group->group) {
> +			table_group->ops = &pnv_npu_peers_ops;
> +			iommu_register_group(table_group,
> +					hose->global_number,
> +					pe->pe_number);
> +		}
> +	} else {
> +		/* Create a group for 1 GPU and attached NPUs */
> +		pe->npucomp = kzalloc(sizeof(pe->npucomp), GFP_KERNEL);
> +		table_group = &pe->npucomp->table_group;
> +		table_group->ops = &pnv_npu_peers_ops;
> +		iommu_register_group(table_group, hose->global_number,
> +				pe->pe_number);
> +	}
> +
> +	/* Steal capabilities from a GPU PE */
> +	table_group->max_dynamic_windows_supported =
> +		pe->table_group.max_dynamic_windows_supported;
> +	table_group->tce32_start = pe->table_group.tce32_start;
> +	table_group->tce32_size = pe->table_group.tce32_size;
> +	table_group->max_levels = pe->table_group.max_levels;
> +	table_group->pgsizes = pe->table_group.pgsizes;
> +
> +	npucomp = container_of(table_group, struct npu_comp, table_group);
> +	pnv_comp_attach_table_group(npucomp, pe);
> +
> +	return table_group;
> +}
> +
> +struct iommu_table_group *pnv_npu_compound_attach(struct pnv_ioda_pe *pe)
> +{
> +	struct iommu_table_group *table_group;
> +	struct npu_comp *npucomp;
> +	struct pci_dev *gpdev = NULL;
> +	struct pci_dev *npdev;
> +	struct pnv_ioda_pe *gpe = get_gpu_pci_dev_and_pe(pe, &gpdev);
> +
> +	WARN_ON(!(pe->flags & PNV_IODA_PE_DEV));
> +	if (!gpe)
> +		return NULL;
> +
> +	/*
> +	 * IODA2 bridges get this set up from
> +	 * pci_controller_ops::setup_bridge but NPU bridges do not
> +	 * have this hook defined so we do it here.
> +	 */
> +	pe->table_group.max_dynamic_windows_supported =
> +		IOMMU_TABLE_GROUP_MAX_TABLES;
> +	pe->table_group.ops = &pnv_pci_npu_ops;
> +
> +	table_group = iommu_group_get_iommudata(
> +			iommu_group_get(&gpdev->dev));
> +
> +	npucomp = container_of(table_group, struct npu_comp, table_group);
> +	pnv_comp_attach_table_group(npucomp, pe);
> +
> +	list_for_each_entry(npdev, &pe->phb->hose->bus->devices, bus_list) {
> +		struct pci_dev *gpdevtmp = pnv_pci_get_gpu_dev(npdev);
> +
> +		if (gpdevtmp != gpdev)
> +			continue;
> +
> +		iommu_add_device(table_group, &npdev->dev);
> +	}
> +
> +	return table_group;
> +}
> +#endif /* CONFIG_IOMMU_API */
> +
>  /* Maximum number of nvlinks per npu */
>  #define NV_MAX_LINKS 6
>  
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> index 04639ae..0e8ada5 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -190,7 +190,8 @@ static void pnv_ioda_free_pe(struct pnv_ioda_pe *pe)
>  	unsigned int pe_num = pe->pe_number;
>  
>  	WARN_ON(pe->pdev);
> -
> +	WARN_ON(pe->npucomp);
> +	kfree(pe->npucomp);
>  	memset(pe, 0, sizeof(struct pnv_ioda_pe));
>  	clear_bit(pe_num, phb->ioda.pe_alloc);
>  }
> @@ -1269,7 +1270,8 @@ static void pnv_ioda_setup_npu_PEs(struct pci_bus *bus)
>  		pnv_ioda_setup_npu_PE(pdev);
>  }
>  
> -static void pnv_ioda_setup_bus_iommu_group(struct pnv_ioda_pe *pe);
> +static void pnv_ioda_setup_bus_iommu_group(struct pnv_ioda_pe *pe,
> +		struct iommu_table_group *table_group, struct pci_bus *bus);
>  
>  static void pnv_pci_ioda_setup_PEs(void)
>  {
> @@ -1593,7 +1595,7 @@ static void pnv_ioda_setup_vf_PE(struct pci_dev *pdev, u16 num_vfs)
>  		mutex_unlock(&phb->ioda.pe_list_mutex);
>  
>  		pnv_pci_ioda2_setup_dma_pe(phb, pe);
> -		pnv_ioda_setup_bus_iommu_group(pe);
> +		pnv_ioda_setup_bus_iommu_group(pe, &pe->table_group, NULL);
>  	}
>  }
>  
> @@ -2554,7 +2556,7 @@ static long pnv_pci_ioda2_unset_window(struct iommu_table_group *table_group,
>  #endif
>  
>  #ifdef CONFIG_IOMMU_API
> -static unsigned long pnv_pci_ioda2_get_table_size(__u32 page_shift,
> +unsigned long pnv_pci_ioda2_get_table_size(__u32 page_shift,
>  		__u64 window_size, __u32 levels)
>  {
>  	unsigned long bytes = 0;
> @@ -2628,147 +2630,38 @@ static struct iommu_table_group_ops pnv_pci_ioda2_ops = {
>  	.release_ownership = pnv_ioda2_release_ownership,
>  };
>  
> -static int gpe_table_group_to_npe_cb(struct device *dev, void *opaque)
> -{
> -	struct pci_controller *hose;
> -	struct pnv_phb *phb;
> -	struct pnv_ioda_pe **ptmppe = opaque;
> -	struct pci_dev *pdev = container_of(dev, struct pci_dev, dev);
> -	struct pci_dn *pdn = pci_get_pdn(pdev);
> -
> -	if (!pdn || pdn->pe_number == IODA_INVALID_PE)
> -		return 0;
> -
> -	hose = pci_bus_to_host(pdev->bus);
> -	phb = hose->private_data;
> -	if (phb->type != PNV_PHB_NPU_NVLINK)
> -		return 0;
> -
> -	*ptmppe = &phb->ioda.pe_array[pdn->pe_number];
> -
> -	return 1;
> -}
> -
> -/*
> - * This returns PE of associated NPU.
> - * This assumes that NPU is in the same IOMMU group with GPU and there is
> - * no other PEs.
> - */
> -static struct pnv_ioda_pe *gpe_table_group_to_npe(
> -		struct iommu_table_group *table_group)
> -{
> -	struct pnv_ioda_pe *npe = NULL;
> -	int ret = iommu_group_for_each_dev(table_group->group, &npe,
> -			gpe_table_group_to_npe_cb);
> -
> -	BUG_ON(!ret || !npe);
> -
> -	return npe;
> -}
> -
> -static long pnv_pci_ioda2_npu_set_window(struct iommu_table_group *table_group,
> -		int num, struct iommu_table *tbl)
> -{
> -	struct pnv_ioda_pe *npe = gpe_table_group_to_npe(table_group);
> -	int num2 = (num == 0) ? 1 : 0;
> -	long ret = pnv_pci_ioda2_set_window(table_group, num, tbl);
> -
> -	if (ret)
> -		return ret;
> -
> -	if (table_group->tables[num2])
> -		npe->table_group.ops->unset_window(&npe->table_group, num2);
> -
> -	ret = npe->table_group.ops->set_window(&npe->table_group, num, tbl);
> -	if (ret) {
> -		pnv_pci_ioda2_unset_window(table_group, num);
> -		if (table_group->tables[num2])
> -			npe->table_group.ops->set_window(&npe->table_group,
> -					num2, table_group->tables[num2]);
> -	}
> -
> -	return ret;
> -}
> -
> -static long pnv_pci_ioda2_npu_unset_window(
> -		struct iommu_table_group *table_group,
> -		int num)
> -{
> -	struct pnv_ioda_pe *npe = gpe_table_group_to_npe(table_group);
> -	int num2 = (num == 0) ? 1 : 0;
> -	long ret = pnv_pci_ioda2_unset_window(table_group, num);
> -
> -	if (ret)
> -		return ret;
> -
> -	if (!npe->table_group.tables[num])
> -		return 0;
> -
> -	ret = npe->table_group.ops->unset_window(&npe->table_group, num);
> -	if (ret)
> -		return ret;
> -
> -	if (table_group->tables[num2])
> -		ret = npe->table_group.ops->set_window(&npe->table_group, num2,
> -				table_group->tables[num2]);
> -
> -	return ret;
> -}
> -
> -static void pnv_ioda2_npu_take_ownership(struct iommu_table_group *table_group)
> -{
> -	struct pnv_ioda_pe *npe = gpe_table_group_to_npe(table_group);
> -
> -	npe->table_group.ops->take_ownership(&npe->table_group);
> -	pnv_ioda2_take_ownership(table_group);
> -}
> -
> -static struct iommu_table_group_ops pnv_pci_ioda2_npu_ops = {
> -	.get_table_size = pnv_pci_ioda2_get_table_size,
> -	.create_table = pnv_pci_ioda2_create_table_userspace,
> -	.set_window = pnv_pci_ioda2_npu_set_window,
> -	.unset_window = pnv_pci_ioda2_npu_unset_window,
> -	.take_ownership = pnv_ioda2_npu_take_ownership,
> -	.release_ownership = pnv_ioda2_release_ownership,
> -};
> -
>  static void pnv_ioda_setup_bus_iommu_group_add_devices(struct pnv_ioda_pe *pe,
> +		struct iommu_table_group *table_group,
>  		struct pci_bus *bus)
>  {
>  	struct pci_dev *dev;
>  
>  	list_for_each_entry(dev, &bus->devices, bus_list) {
> -		iommu_add_device(&pe->table_group, &dev->dev);
> +		iommu_add_device(table_group, &dev->dev);
>  
>  		if ((pe->flags & PNV_IODA_PE_BUS_ALL) && dev->subordinate)
>  			pnv_ioda_setup_bus_iommu_group_add_devices(pe,
> -					dev->subordinate);
> +					table_group, dev->subordinate);
>  	}
>  }
>  
> -static void pnv_ioda_setup_bus_iommu_group(struct pnv_ioda_pe *pe)
> +static void pnv_ioda_setup_bus_iommu_group(struct pnv_ioda_pe *pe,
> +		struct iommu_table_group *table_group, struct pci_bus *bus)
>  {
> -	if (!pnv_pci_ioda_pe_dma_weight(pe))
> -		return;
>  
> -	iommu_register_group(&pe->table_group, pe->phb->hose->global_number,
> -			pe->pe_number);
> -
> -	/*
> -	 * set_iommu_table_base(&pe->pdev->dev, tbl) should have been called
> -	 * by now
> -	 */
>  	if (pe->flags & PNV_IODA_PE_DEV)
> -		iommu_add_device(&pe->table_group, &pe->pdev->dev);
> -	else if (pe->flags & (PNV_IODA_PE_BUS | PNV_IODA_PE_BUS_ALL))
> -		pnv_ioda_setup_bus_iommu_group_add_devices(pe, pe->pbus);
> +		iommu_add_device(table_group, &pe->pdev->dev);
> +
> +	if ((pe->flags & (PNV_IODA_PE_BUS | PNV_IODA_PE_BUS_ALL)) || bus)
> +		pnv_ioda_setup_bus_iommu_group_add_devices(pe, table_group,
> +				bus);
>  }
>  
>  static void pnv_pci_ioda_setup_iommu_api(void)
>  {
> -	struct pci_controller *hose, *tmp;
> +	struct pci_controller *hose;
>  	struct pnv_phb *phb;
> -	struct pnv_ioda_pe *pe, *gpe;
> +	struct pnv_ioda_pe *pe;
>  
>  	/*
>  	 * There are 4 types of PEs:
> @@ -2790,29 +2683,41 @@ static void pnv_pci_ioda_setup_iommu_api(void)
>  		if (phb->type == PNV_PHB_NPU_NVLINK)
>  			continue;
>  
> -		list_for_each_entry(pe, &phb->ioda.pe_list, list)
> -			pnv_ioda_setup_bus_iommu_group(pe);
> +		list_for_each_entry(pe, &phb->ioda.pe_list, list) {
> +			struct iommu_table_group *table_group;
> +
> +			table_group = pnv_try_setup_npu_table_group(pe);
> +			if (!table_group) {
> +				if (!pnv_pci_ioda_pe_dma_weight(pe))
> +					continue;
> +
> +				table_group = &pe->table_group;
> +				iommu_register_group(&pe->table_group,
> +						pe->phb->hose->global_number,
> +						pe->pe_number);
> +			}
> +			pnv_ioda_setup_bus_iommu_group(pe, table_group,
> +					pe->pbus);
> +		}
>  	}
>  
>  	/*
>  	 * Now we have all PHBs discovered, time to add NPU devices to
>  	 * the corresponding IOMMU groups.
>  	 */
> -	list_for_each_entry_safe(hose, tmp, &hose_list, list_node) {
> +	list_for_each_entry(hose, &hose_list, list_node) {
>  		phb = hose->private_data;
>  
>  		if (phb->type != PNV_PHB_NPU_NVLINK)
>  			continue;
>  
> -		list_for_each_entry(pe, &phb->ioda.pe_list, list) {
> -			gpe = pnv_pci_npu_setup_iommu(pe);
> -			if (gpe)
> -				gpe->table_group.ops = &pnv_pci_ioda2_npu_ops;
> -		}
> +		list_for_each_entry(pe, &phb->ioda.pe_list, list)
> +			pnv_npu_compound_attach(pe);
>  	}
>  }
>  #else /* !CONFIG_IOMMU_API */
> -static void pnv_ioda_setup_bus_iommu_group(struct pnv_ioda_pe *pe) { }
> +static void pnv_ioda_setup_bus_iommu_group(struct pnv_ioda_pe *pe,
> +		struct iommu_table_group *table_group, struct pci_bus *bus){}
>  static void pnv_pci_ioda_setup_iommu_api(void) { };
>  #endif
>  

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH kernel v3 18/22] powerpc/powernv/npu: Add compound IOMMU groups
  2018-11-19  1:12     ` David Gibson
@ 2018-11-19  2:29       ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 84+ messages in thread
From: Alexey Kardashevskiy @ 2018-11-19  2:29 UTC (permalink / raw)
  To: David Gibson
  Cc: Alex Williamson, Jose Ricardo Ziviani, Sam Bobroff,
	Alistair Popple, linuxppc-dev, kvm-ppc, Piotr Jaroszynski,
	Oliver O'Halloran, Andrew Donnellan,
	Leonardo Augusto Guimarães Garcia, Reza Arbab



On 19/11/2018 12:12, David Gibson wrote:
> On Tue, Nov 13, 2018 at 07:28:19PM +1100, Alexey Kardashevskiy wrote:
>> At the moment powernv registers an IOMMU group for each PE. There is
>> an exception though - NPU (an emulated PCI bridge representing an NVLink);
>> powernv attaches these bridges to the GPU IOMMU group which becomes
>> a master.
>>
>> Now we have POWER9 systems with GPUs connected to each other directly,
>> bypassing PCI. At the moment powernv does not control these links so
>> it has to put such interconnected GPUs to the same IOMMU group which
>> means that the old scheme with a GPU as a master won't work - there will
>> be up to 3 GPUs in such group.
>>
>> This introduces a npu_comp struct which represents a compound IOMMU
>> group made of multiple PEs. This converts the existing NVLink1 code to
>> use the new scheme. From now on, each PE must have a valid
>> iommu_table_group_ops which will either be called directly (a single PE
>> group) or indirectly from a compound group.
>>
>> This moves IOMMU group registration for NPU-connected GPUs to npu-dma.c.
>> For POWER8, this stores a new compound group pointer in a PE (so a GPU
>> is still a master); for POWER9 the new group pointer is stored in an NPU.
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> ---
>>  arch/powerpc/include/asm/pci.h            |   1 +
>>  arch/powerpc/platforms/powernv/pci.h      |   7 +
>>  arch/powerpc/platforms/powernv/npu-dma.c  | 286 ++++++++++++++++++++--
>>  arch/powerpc/platforms/powernv/pci-ioda.c | 173 +++----------
>>  4 files changed, 308 insertions(+), 159 deletions(-)
>>
>> diff --git a/arch/powerpc/include/asm/pci.h b/arch/powerpc/include/asm/pci.h
>> index baf2886..0c72f18 100644
>> --- a/arch/powerpc/include/asm/pci.h
>> +++ b/arch/powerpc/include/asm/pci.h
>> @@ -132,5 +132,6 @@ extern struct pci_dev *pnv_pci_get_npu_dev(struct pci_dev *gpdev, int index);
>>  extern int pnv_npu2_init(struct pci_controller *hose);
>>  extern int pnv_npu2_map_lpar_dev(struct pci_dev *gpdev, unsigned int lparid,
>>  		unsigned long msr);
>> +extern int pnv_npu2_unmap_lpar_dev(struct pci_dev *gpdev);
>>  
>>  #endif /* __ASM_POWERPC_PCI_H */
>> diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
>> index cf9f748..aef4bb5 100644
>> --- a/arch/powerpc/platforms/powernv/pci.h
>> +++ b/arch/powerpc/platforms/powernv/pci.h
>> @@ -62,6 +62,7 @@ struct pnv_ioda_pe {
>>  
>>  	/* "Base" iommu table, ie, 4K TCEs, 32-bit DMA */
>>  	struct iommu_table_group table_group;
>> +	struct npu_comp		*npucomp;
>>  
>>  	/* 64-bit TCE bypass region */
>>  	bool			tce_bypass_enabled;
>> @@ -201,6 +202,8 @@ extern void pnv_teardown_msi_irqs(struct pci_dev *pdev);
>>  extern struct pnv_ioda_pe *pnv_ioda_get_pe(struct pci_dev *dev);
>>  extern void pnv_set_msi_irq_chip(struct pnv_phb *phb, unsigned int virq);
>>  extern void pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe *pe, bool enable);
>> +extern unsigned long pnv_pci_ioda2_get_table_size(__u32 page_shift,
>> +		__u64 window_size, __u32 levels);
>>  extern int pnv_eeh_post_init(void);
>>  
>>  extern void pe_level_printk(const struct pnv_ioda_pe *pe, const char *level,
>> @@ -216,6 +219,10 @@ extern void pe_level_printk(const struct pnv_ioda_pe *pe, const char *level,
>>  extern void pnv_npu_try_dma_set_bypass(struct pci_dev *gpdev, bool bypass);
>>  extern void pnv_pci_ioda2_tce_invalidate_entire(struct pnv_phb *phb, bool rm);
>>  extern struct pnv_ioda_pe *pnv_pci_npu_setup_iommu(struct pnv_ioda_pe *npe);
>> +extern struct iommu_table_group *pnv_try_setup_npu_table_group(
>> +		struct pnv_ioda_pe *pe);
>> +extern struct iommu_table_group *pnv_npu_compound_attach(
>> +		struct pnv_ioda_pe *pe);
>>  
>>  /* pci-ioda-tce.c */
>>  #define POWERNV_IOMMU_DEFAULT_LEVELS	1
>> diff --git a/arch/powerpc/platforms/powernv/npu-dma.c b/arch/powerpc/platforms/powernv/npu-dma.c
>> index 1792c7e..2231f4c 100644
>> --- a/arch/powerpc/platforms/powernv/npu-dma.c
>> +++ b/arch/powerpc/platforms/powernv/npu-dma.c
>> @@ -317,31 +317,6 @@ static struct iommu_table_group_ops pnv_pci_npu_ops = {
>>  	.unset_window = pnv_npu_unset_window,
>>  	.take_ownership = pnv_npu_take_ownership,
>>  };
>> -
>> -struct pnv_ioda_pe *pnv_pci_npu_setup_iommu(struct pnv_ioda_pe *npe)
>> -{
>> -	struct pnv_phb *phb = npe->phb;
>> -	struct pci_bus *pbus = phb->hose->bus;
>> -	struct pci_dev *npdev, *gpdev = NULL, *gptmp;
>> -	struct pnv_ioda_pe *gpe = get_gpu_pci_dev_and_pe(npe, &gpdev);
>> -
>> -	if (!gpe || !gpdev)
>> -		return NULL;
>> -
>> -	npe->table_group.ops = &pnv_pci_npu_ops;
>> -
>> -	list_for_each_entry(npdev, &pbus->devices, bus_list) {
>> -		gptmp = pnv_pci_get_gpu_dev(npdev);
>> -
>> -		if (gptmp != gpdev)
>> -			continue;
>> -
>> -		pe_info(gpe, "Attached NPU %s\n", dev_name(&npdev->dev));
>> -		iommu_group_add_device(gpe->table_group.group, &npdev->dev);
>> -	}
>> -
>> -	return gpe;
>> -}
>>  #endif /* !CONFIG_IOMMU_API */
>>  
>>  /*
>> @@ -349,6 +324,17 @@ struct pnv_ioda_pe *pnv_pci_npu_setup_iommu(struct pnv_ioda_pe *npe)
>>   */
>>  /* Maximum possible number of ATSD MMIO registers per NPU */
>>  #define NV_NMMU_ATSD_REGS 8
>> +#define NV_NPU_MAX_PE_NUM	16
>> +
>> +/*
>> + * A compound NPU IOMMU group which might consist of 1 GPU + 2xNPUs (POWER8) or
>> + * up to 3 x (GPU + 2xNPUs) (POWER9).
>> + */
>> +struct npu_comp {
>> +	struct iommu_table_group table_group;
>> +	int pe_num;
>> +	struct pnv_ioda_pe *pe[NV_NPU_MAX_PE_NUM];
>> +};
>>  
>>  /* An NPU descriptor, valid for POWER9 only */
>>  struct npu {
>> @@ -365,6 +351,8 @@ struct npu {
>>  	struct list_head next;
>>  
>>  	struct pci_controller *hose;
>> +
>> +	struct npu_comp npucomp;
>>  };
> 
> I'm confused by this.  The comment simply there are multiple NPUs in a
> single composite-group, but the np_comp structure is embedded in the
> npu structure, implying there's a copy per-NPU.


Yeah, there is a naming confusion. NPU is a big chunk in the CPU with 6
links, and this is what the "struct npu" above describes.

And there are 6 NPU emulated bridge devices which you can see in lspci
with the "ibmnpu" driver bound to them.

I guess from now on I will refer to the big NPU as "NPU" and to the
emulated bridge device as "NVLink2" or "NVlink2 emulated device" unless
you have a better suggestion (Alistair does not though).



-- 
Alexey

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH kernel v3 18/22] powerpc/powernv/npu: Add compound IOMMU groups
@ 2018-11-19  2:29       ` Alexey Kardashevskiy
  0 siblings, 0 replies; 84+ messages in thread
From: Alexey Kardashevskiy @ 2018-11-19  2:29 UTC (permalink / raw)
  To: David Gibson
  Cc: Alex Williamson, Jose Ricardo Ziviani, Sam Bobroff,
	Alistair Popple, linuxppc-dev, kvm-ppc, Piotr Jaroszynski,
	Oliver O'Halloran, Andrew Donnellan,
	Leonardo Augusto Guimarães Garcia, Reza Arbab



On 19/11/2018 12:12, David Gibson wrote:
> On Tue, Nov 13, 2018 at 07:28:19PM +1100, Alexey Kardashevskiy wrote:
>> At the moment powernv registers an IOMMU group for each PE. There is
>> an exception though - NPU (an emulated PCI bridge representing an NVLink);
>> powernv attaches these bridges to the GPU IOMMU group which becomes
>> a master.
>>
>> Now we have POWER9 systems with GPUs connected to each other directly,
>> bypassing PCI. At the moment powernv does not control these links so
>> it has to put such interconnected GPUs to the same IOMMU group which
>> means that the old scheme with a GPU as a master won't work - there will
>> be up to 3 GPUs in such group.
>>
>> This introduces a npu_comp struct which represents a compound IOMMU
>> group made of multiple PEs. This converts the existing NVLink1 code to
>> use the new scheme. From now on, each PE must have a valid
>> iommu_table_group_ops which will either be called directly (a single PE
>> group) or indirectly from a compound group.
>>
>> This moves IOMMU group registration for NPU-connected GPUs to npu-dma.c.
>> For POWER8, this stores a new compound group pointer in a PE (so a GPU
>> is still a master); for POWER9 the new group pointer is stored in an NPU.
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> ---
>>  arch/powerpc/include/asm/pci.h            |   1 +
>>  arch/powerpc/platforms/powernv/pci.h      |   7 +
>>  arch/powerpc/platforms/powernv/npu-dma.c  | 286 ++++++++++++++++++++--
>>  arch/powerpc/platforms/powernv/pci-ioda.c | 173 +++----------
>>  4 files changed, 308 insertions(+), 159 deletions(-)
>>
>> diff --git a/arch/powerpc/include/asm/pci.h b/arch/powerpc/include/asm/pci.h
>> index baf2886..0c72f18 100644
>> --- a/arch/powerpc/include/asm/pci.h
>> +++ b/arch/powerpc/include/asm/pci.h
>> @@ -132,5 +132,6 @@ extern struct pci_dev *pnv_pci_get_npu_dev(struct pci_dev *gpdev, int index);
>>  extern int pnv_npu2_init(struct pci_controller *hose);
>>  extern int pnv_npu2_map_lpar_dev(struct pci_dev *gpdev, unsigned int lparid,
>>  		unsigned long msr);
>> +extern int pnv_npu2_unmap_lpar_dev(struct pci_dev *gpdev);
>>  
>>  #endif /* __ASM_POWERPC_PCI_H */
>> diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
>> index cf9f748..aef4bb5 100644
>> --- a/arch/powerpc/platforms/powernv/pci.h
>> +++ b/arch/powerpc/platforms/powernv/pci.h
>> @@ -62,6 +62,7 @@ struct pnv_ioda_pe {
>>  
>>  	/* "Base" iommu table, ie, 4K TCEs, 32-bit DMA */
>>  	struct iommu_table_group table_group;
>> +	struct npu_comp		*npucomp;
>>  
>>  	/* 64-bit TCE bypass region */
>>  	bool			tce_bypass_enabled;
>> @@ -201,6 +202,8 @@ extern void pnv_teardown_msi_irqs(struct pci_dev *pdev);
>>  extern struct pnv_ioda_pe *pnv_ioda_get_pe(struct pci_dev *dev);
>>  extern void pnv_set_msi_irq_chip(struct pnv_phb *phb, unsigned int virq);
>>  extern void pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe *pe, bool enable);
>> +extern unsigned long pnv_pci_ioda2_get_table_size(__u32 page_shift,
>> +		__u64 window_size, __u32 levels);
>>  extern int pnv_eeh_post_init(void);
>>  
>>  extern void pe_level_printk(const struct pnv_ioda_pe *pe, const char *level,
>> @@ -216,6 +219,10 @@ extern void pe_level_printk(const struct pnv_ioda_pe *pe, const char *level,
>>  extern void pnv_npu_try_dma_set_bypass(struct pci_dev *gpdev, bool bypass);
>>  extern void pnv_pci_ioda2_tce_invalidate_entire(struct pnv_phb *phb, bool rm);
>>  extern struct pnv_ioda_pe *pnv_pci_npu_setup_iommu(struct pnv_ioda_pe *npe);
>> +extern struct iommu_table_group *pnv_try_setup_npu_table_group(
>> +		struct pnv_ioda_pe *pe);
>> +extern struct iommu_table_group *pnv_npu_compound_attach(
>> +		struct pnv_ioda_pe *pe);
>>  
>>  /* pci-ioda-tce.c */
>>  #define POWERNV_IOMMU_DEFAULT_LEVELS	1
>> diff --git a/arch/powerpc/platforms/powernv/npu-dma.c b/arch/powerpc/platforms/powernv/npu-dma.c
>> index 1792c7e..2231f4c 100644
>> --- a/arch/powerpc/platforms/powernv/npu-dma.c
>> +++ b/arch/powerpc/platforms/powernv/npu-dma.c
>> @@ -317,31 +317,6 @@ static struct iommu_table_group_ops pnv_pci_npu_ops = {
>>  	.unset_window = pnv_npu_unset_window,
>>  	.take_ownership = pnv_npu_take_ownership,
>>  };
>> -
>> -struct pnv_ioda_pe *pnv_pci_npu_setup_iommu(struct pnv_ioda_pe *npe)
>> -{
>> -	struct pnv_phb *phb = npe->phb;
>> -	struct pci_bus *pbus = phb->hose->bus;
>> -	struct pci_dev *npdev, *gpdev = NULL, *gptmp;
>> -	struct pnv_ioda_pe *gpe = get_gpu_pci_dev_and_pe(npe, &gpdev);
>> -
>> -	if (!gpe || !gpdev)
>> -		return NULL;
>> -
>> -	npe->table_group.ops = &pnv_pci_npu_ops;
>> -
>> -	list_for_each_entry(npdev, &pbus->devices, bus_list) {
>> -		gptmp = pnv_pci_get_gpu_dev(npdev);
>> -
>> -		if (gptmp != gpdev)
>> -			continue;
>> -
>> -		pe_info(gpe, "Attached NPU %s\n", dev_name(&npdev->dev));
>> -		iommu_group_add_device(gpe->table_group.group, &npdev->dev);
>> -	}
>> -
>> -	return gpe;
>> -}
>>  #endif /* !CONFIG_IOMMU_API */
>>  
>>  /*
>> @@ -349,6 +324,17 @@ struct pnv_ioda_pe *pnv_pci_npu_setup_iommu(struct pnv_ioda_pe *npe)
>>   */
>>  /* Maximum possible number of ATSD MMIO registers per NPU */
>>  #define NV_NMMU_ATSD_REGS 8
>> +#define NV_NPU_MAX_PE_NUM	16
>> +
>> +/*
>> + * A compound NPU IOMMU group which might consist of 1 GPU + 2xNPUs (POWER8) or
>> + * up to 3 x (GPU + 2xNPUs) (POWER9).
>> + */
>> +struct npu_comp {
>> +	struct iommu_table_group table_group;
>> +	int pe_num;
>> +	struct pnv_ioda_pe *pe[NV_NPU_MAX_PE_NUM];
>> +};
>>  
>>  /* An NPU descriptor, valid for POWER9 only */
>>  struct npu {
>> @@ -365,6 +351,8 @@ struct npu {
>>  	struct list_head next;
>>  
>>  	struct pci_controller *hose;
>> +
>> +	struct npu_comp npucomp;
>>  };
> 
> I'm confused by this.  The comment simply there are multiple NPUs in a
> single composite-group, but the np_comp structure is embedded in the
> npu structure, implying there's a copy per-NPU.


Yeah, there is a naming confusion. NPU is a big chunk in the CPU with 6
links, and this is what the "struct npu" above describes.

And there are 6 NPU emulated bridge devices which you can see in lspci
with the "ibmnpu" driver bound to them.

I guess from now on I will refer to the big NPU as "NPU" and to the
emulated bridge device as "NVLink2" or "NVlink2 emulated device" unless
you have a better suggestion (Alistair does not though).



-- 
Alexey

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH kernel v3 06/22] powerpc/powernv: Detach npu struct from pnv_phb
  2018-11-14  4:28     ` Alistair Popple
@ 2018-11-19  7:18       ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 84+ messages in thread
From: Alexey Kardashevskiy @ 2018-11-19  7:18 UTC (permalink / raw)
  To: Alistair Popple
  Cc: Jose Ricardo Ziviani, Sam Bobroff, linuxppc-dev, Alex Williamson,
	kvm-ppc, Piotr Jaroszynski, Oliver O'Halloran,
	Andrew Donnellan, Leonardo Augusto Guimarães Garcia,
	Reza Arbab, David Gibson



On 14/11/2018 15:28, Alistair Popple wrote:
> Hi Alexey,
> 
> On Tuesday, 13 November 2018 7:28:07 PM AEDT Alexey Kardashevskiy wrote:
>>  static struct npu *npdev_to_npu(struct pci_dev *npdev)
>>  {
>> -	struct pnv_phb *nphb;
>> +	struct pci_controller *hose = pci_bus_to_host(npdev->bus);
>> +	struct npu *npu;
>>
>> -	nphb = pci_bus_to_host(npdev->bus)->private_data;
>> +	list_for_each_entry(npu, &npu2_devices, next)
> 
> This is called from the ATSD path which is (or at least has been) quite a 
> performance critical path so searching through all the NPUs in a list may be 
> problematic.
> 
> I guess currently it wont make any practical difference as we only ever have 2 
> NPUs, but in future they may get divided into more logical NPUs. Would it be 
> possible to store a back-pointer somewhere so we can avoid the lookup?


It is quite possible even now with iommu_group_get() + container_of() +
iommu_group_put(), I'll try that in respin.


> 
>> +		if (hose == npu->hose)
>> +			return npu;
>>
>> -	return &nphb->npu;
>> +	WARN_ON_ONCE(1);
>> +	return NULL;
>>  }
>>
>>  /* Maximum number of nvlinks per npu */
>> @@ -505,6 +531,9 @@ static void acquire_atsd_reg(struct npu_context
>> *npu_context, continue;
>>
>>  			npu = npdev_to_npu(npdev);
>> +			if (!npu)
>> +				continue;
>> +
>>  			mmio_atsd_reg[i].npu = npu;
>>  			mmio_atsd_reg[i].reg = get_mmio_atsd_reg(npu);
>>  			while (mmio_atsd_reg[i].reg < 0) {
>> @@ -701,6 +730,8 @@ struct npu_context *pnv_npu2_init_context(struct pci_dev
>> *gpdev,
>>
>>  	nphb = pci_bus_to_host(npdev->bus)->private_data;
>>  	npu = npdev_to_npu(npdev);
>> +	if (!npu)
>> +		return ERR_PTR(-ENODEV);
>>
>>  	/*
>>  	 * Setup the NPU context table for a particular GPU. These need to be
>> @@ -821,6 +852,8 @@ void pnv_npu2_destroy_context(struct npu_context
>> *npu_context,
>>
>>  	nphb = pci_bus_to_host(npdev->bus)->private_data;
>>  	npu = npdev_to_npu(npdev);
>> +	if (!npu)
>> +		return;
>>  	nvlink_dn = of_parse_phandle(npdev->dev.of_node, "ibm,nvlink", 0);
>>  	if (WARN_ON(of_property_read_u32(nvlink_dn, "ibm,npu-link-index",
>>  							&nvlink_index)))
>> @@ -898,9 +931,15 @@ int pnv_npu2_init(struct pnv_phb *phb)
>>  	struct pci_dev *gpdev;
>>  	static int npu_index;
>>  	uint64_t rc = 0;
>> +	struct pci_controller *hose = phb->hose;
>> +	struct npu *npu;
>> +	int ret;
>>
>> -	phb->npu.nmmu_flush =
>> -		of_property_read_bool(phb->hose->dn, "ibm,nmmu-flush");
>> +	npu = kzalloc(sizeof(*npu), GFP_KERNEL);
>> +	if (!npu)
>> +		return -ENOMEM;
>> +
>> +	npu->nmmu_flush = of_property_read_bool(hose->dn, "ibm,nmmu-flush");
>>  	for_each_child_of_node(phb->hose->dn, dn) {
>>  		gpdev = pnv_pci_get_gpu_dev(get_pci_dev(dn));
>>  		if (gpdev) {
>> @@ -914,18 +953,31 @@ int pnv_npu2_init(struct pnv_phb *phb)
>>  		}
>>  	}
>>
>> -	for (i = 0; !of_property_read_u64_index(phb->hose->dn, "ibm,mmio-atsd",
>> +	for (i = 0; !of_property_read_u64_index(hose->dn, "ibm,mmio-atsd",
>>  							i, &mmio_atsd); i++)
>> -		phb->npu.mmio_atsd_regs[i] = ioremap(mmio_atsd, 32);
>> +		npu->mmio_atsd_regs[i] = ioremap(mmio_atsd, 32);
>>
>> -	pr_info("NPU%lld: Found %d MMIO ATSD registers", phb->opal_id, i);
>> -	phb->npu.mmio_atsd_count = i;
>> -	phb->npu.mmio_atsd_usage = 0;
>> +	pr_info("NPU%d: Found %d MMIO ATSD registers", hose->global_number, i);
>> +	npu->mmio_atsd_count = i;
>> +	npu->mmio_atsd_usage = 0;
>>  	npu_index++;
>> -	if (WARN_ON(npu_index >= NV_MAX_NPUS))
>> -		return -ENOSPC;
>> +	if (WARN_ON(npu_index >= NV_MAX_NPUS)) {
>> +		ret = -ENOSPC;
>> +		goto fail_exit;
>> +	}
>>  	max_npu2_index = npu_index;
>> -	phb->npu.index = npu_index;
>> +	npu->index = npu_index;
>> +	npu->hose = hose;
>> +
>> +	list_add(&npu->next, &npu2_devices);
> 
> Guess we don't need any locking here as the list gets setup once during boot 
> long before loading the driver and is never modified right?


Correct.


> 
> - Alistair
> 
>>  	return 0;
>> +
>> +fail_exit:
>> +	for (i = 0; i < npu->mmio_atsd_count; ++i)
>> +		iounmap(npu->mmio_atsd_regs[i]);
>> +
>> +	kfree(npu);
>> +
>> +	return ret;
>>  }
> 
> 

-- 
Alexey

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH kernel v3 06/22] powerpc/powernv: Detach npu struct from pnv_phb
@ 2018-11-19  7:18       ` Alexey Kardashevskiy
  0 siblings, 0 replies; 84+ messages in thread
From: Alexey Kardashevskiy @ 2018-11-19  7:18 UTC (permalink / raw)
  To: Alistair Popple
  Cc: Jose Ricardo Ziviani, Sam Bobroff, linuxppc-dev, Alex Williamson,
	kvm-ppc, Piotr Jaroszynski, Oliver O'Halloran,
	Andrew Donnellan, Leonardo Augusto Guimarães Garcia,
	Reza Arbab, David Gibson



On 14/11/2018 15:28, Alistair Popple wrote:
> Hi Alexey,
> 
> On Tuesday, 13 November 2018 7:28:07 PM AEDT Alexey Kardashevskiy wrote:
>>  static struct npu *npdev_to_npu(struct pci_dev *npdev)
>>  {
>> -	struct pnv_phb *nphb;
>> +	struct pci_controller *hose = pci_bus_to_host(npdev->bus);
>> +	struct npu *npu;
>>
>> -	nphb = pci_bus_to_host(npdev->bus)->private_data;
>> +	list_for_each_entry(npu, &npu2_devices, next)
> 
> This is called from the ATSD path which is (or at least has been) quite a 
> performance critical path so searching through all the NPUs in a list may be 
> problematic.
> 
> I guess currently it wont make any practical difference as we only ever have 2 
> NPUs, but in future they may get divided into more logical NPUs. Would it be 
> possible to store a back-pointer somewhere so we can avoid the lookup?


It is quite possible even now with iommu_group_get() + container_of() +
iommu_group_put(), I'll try that in respin.


> 
>> +		if (hose = npu->hose)
>> +			return npu;
>>
>> -	return &nphb->npu;
>> +	WARN_ON_ONCE(1);
>> +	return NULL;
>>  }
>>
>>  /* Maximum number of nvlinks per npu */
>> @@ -505,6 +531,9 @@ static void acquire_atsd_reg(struct npu_context
>> *npu_context, continue;
>>
>>  			npu = npdev_to_npu(npdev);
>> +			if (!npu)
>> +				continue;
>> +
>>  			mmio_atsd_reg[i].npu = npu;
>>  			mmio_atsd_reg[i].reg = get_mmio_atsd_reg(npu);
>>  			while (mmio_atsd_reg[i].reg < 0) {
>> @@ -701,6 +730,8 @@ struct npu_context *pnv_npu2_init_context(struct pci_dev
>> *gpdev,
>>
>>  	nphb = pci_bus_to_host(npdev->bus)->private_data;
>>  	npu = npdev_to_npu(npdev);
>> +	if (!npu)
>> +		return ERR_PTR(-ENODEV);
>>
>>  	/*
>>  	 * Setup the NPU context table for a particular GPU. These need to be
>> @@ -821,6 +852,8 @@ void pnv_npu2_destroy_context(struct npu_context
>> *npu_context,
>>
>>  	nphb = pci_bus_to_host(npdev->bus)->private_data;
>>  	npu = npdev_to_npu(npdev);
>> +	if (!npu)
>> +		return;
>>  	nvlink_dn = of_parse_phandle(npdev->dev.of_node, "ibm,nvlink", 0);
>>  	if (WARN_ON(of_property_read_u32(nvlink_dn, "ibm,npu-link-index",
>>  							&nvlink_index)))
>> @@ -898,9 +931,15 @@ int pnv_npu2_init(struct pnv_phb *phb)
>>  	struct pci_dev *gpdev;
>>  	static int npu_index;
>>  	uint64_t rc = 0;
>> +	struct pci_controller *hose = phb->hose;
>> +	struct npu *npu;
>> +	int ret;
>>
>> -	phb->npu.nmmu_flush >> -		of_property_read_bool(phb->hose->dn, "ibm,nmmu-flush");
>> +	npu = kzalloc(sizeof(*npu), GFP_KERNEL);
>> +	if (!npu)
>> +		return -ENOMEM;
>> +
>> +	npu->nmmu_flush = of_property_read_bool(hose->dn, "ibm,nmmu-flush");
>>  	for_each_child_of_node(phb->hose->dn, dn) {
>>  		gpdev = pnv_pci_get_gpu_dev(get_pci_dev(dn));
>>  		if (gpdev) {
>> @@ -914,18 +953,31 @@ int pnv_npu2_init(struct pnv_phb *phb)
>>  		}
>>  	}
>>
>> -	for (i = 0; !of_property_read_u64_index(phb->hose->dn, "ibm,mmio-atsd",
>> +	for (i = 0; !of_property_read_u64_index(hose->dn, "ibm,mmio-atsd",
>>  							i, &mmio_atsd); i++)
>> -		phb->npu.mmio_atsd_regs[i] = ioremap(mmio_atsd, 32);
>> +		npu->mmio_atsd_regs[i] = ioremap(mmio_atsd, 32);
>>
>> -	pr_info("NPU%lld: Found %d MMIO ATSD registers", phb->opal_id, i);
>> -	phb->npu.mmio_atsd_count = i;
>> -	phb->npu.mmio_atsd_usage = 0;
>> +	pr_info("NPU%d: Found %d MMIO ATSD registers", hose->global_number, i);
>> +	npu->mmio_atsd_count = i;
>> +	npu->mmio_atsd_usage = 0;
>>  	npu_index++;
>> -	if (WARN_ON(npu_index >= NV_MAX_NPUS))
>> -		return -ENOSPC;
>> +	if (WARN_ON(npu_index >= NV_MAX_NPUS)) {
>> +		ret = -ENOSPC;
>> +		goto fail_exit;
>> +	}
>>  	max_npu2_index = npu_index;
>> -	phb->npu.index = npu_index;
>> +	npu->index = npu_index;
>> +	npu->hose = hose;
>> +
>> +	list_add(&npu->next, &npu2_devices);
> 
> Guess we don't need any locking here as the list gets setup once during boot 
> long before loading the driver and is never modified right?


Correct.


> 
> - Alistair
> 
>>  	return 0;
>> +
>> +fail_exit:
>> +	for (i = 0; i < npu->mmio_atsd_count; ++i)
>> +		iounmap(npu->mmio_atsd_regs[i]);
>> +
>> +	kfree(npu);
>> +
>> +	return ret;
>>  }
> 
> 

-- 
Alexey

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH kernel v3 09/22] powerpc/pseries/iommu: Force default DMA window removal
  2018-11-16  4:54     ` David Gibson
@ 2018-11-19  7:28       ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 84+ messages in thread
From: Alexey Kardashevskiy @ 2018-11-19  7:28 UTC (permalink / raw)
  To: David Gibson
  Cc: Alex Williamson, Jose Ricardo Ziviani, Sam Bobroff,
	Alistair Popple, linuxppc-dev, kvm-ppc, Piotr Jaroszynski,
	Oliver O'Halloran, Andrew Donnellan,
	Leonardo Augusto Guimarães Garcia, Reza Arbab



On 16/11/2018 15:54, David Gibson wrote:
> On Tue, Nov 13, 2018 at 07:28:10PM +1100, Alexey Kardashevskiy wrote:
>> It is quite common for a device to support more than 32bit but less than
>> 64bit for DMA, for example, GPUs often support 42..50bits. However
>> the pseries platform only allows huge DMA window (the one which allows
>> the use of more than 2GB of DMA space) for 64bit-capable devices mostly
>> because:
>>
>> 1. we may have 32bit and >32bit devices on the same IOMMU domain and
>> we cannot place the new big window where the 32bit one is located;
>>
>> 2. the existing hardware only supports the second window at very high
>> offset of 1<<59 == 0x0800.0000.0000.0000.
>>
>> So in order to allow 33..59bit DMA, we have to remove the default DMA
>> window and place a huge one there instead.
>>
>> The PAPR spec says that the platform may decide not to use the default
>> window and remove it using DDW RTAS calls. There are few possible ways
>> for the platform to decide:
>>
>> 1. look at the device IDs and decide in advance that such and such
>> devices are capable of more than 32bit DMA (powernv's sketchy bypass
>> does something like this - it drops the default window if all devices
>> on the PE are from the same vendor) - this is not great as involves
>> guessing because, unlike sketchy bypass, the GPU case involves 2 vendor
>> ids and does not scale;
>>
>> 2. advertise 1 available DMA window in the hypervisor via
>> ibm,query-pe-dma-window so the pseries platform could take it as a clue
>> that if more bits for DMA are needed, it has to remove the default
>> window - this is not great as it is implicit clue rather than direct
>> instruction;
>>
>> 3. removing the default DMA window at all it not really an option as
>> PAPR mandates its presense at the guest boot time;
>>
>> 4. make the hypervisor explicitly tell the guest that the default window
>> is better be removed so the guest does not have to think hard and can
>> simply do what requested and this is what this patch does.
> 
> This approach only makes sense if the hypervisor has better
> information as to what to do that the guest does.  It's not clear to
> me why that would be the case.  Isn't the DMA capabilities of the
> device something the driver should know, in which case it can decide
> based on that?

The device knows it can do 42bits so it will request DMA mask for 42bits
and then the platform has to deal with it, the device has no control
over DMA windows.

Then the platform tries to make everything work, which sadly includes
32bit-DMA devices so the default DMA window stays there and for 42bit
devices there is no other way than to go via the smaller window as the
only other window we can create is beyond the reach of the GPU.

We have so called "sketchy bypass" hack for some other GPUs (which
Christoph is trying to get rid of) at
https://github.com/aik/linux/blob/nv2/arch/powerpc/platforms/powernv/pci-ioda.c#L1885

which is powernv and which seemed a solution there and which I am trying
to reimplement here.


> 
>>
>> This makes use of the latter approach and exploits a new
>> "qemu,dma-force-remove-default" flag in a vPHB.
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> ---
>>  arch/powerpc/platforms/pseries/iommu.c | 28 +++++++++++++++++++++++---
>>  1 file changed, 25 insertions(+), 3 deletions(-)
>>
>> diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
>> index 9ece42f..78473ac 100644
>> --- a/arch/powerpc/platforms/pseries/iommu.c
>> +++ b/arch/powerpc/platforms/pseries/iommu.c
>> @@ -54,6 +54,7 @@
>>  #include "pseries.h"
>>  
>>  #define DDW_INVALID_OFFSET	((uint64_t)-1)
>> +#define DDW_INVALID_LIOBN	((uint32_t)-1)
>>  
>>  static struct iommu_table_group *iommu_pseries_alloc_group(int node)
>>  {
>> @@ -977,7 +978,8 @@ static LIST_HEAD(failed_ddw_pdn_list);
>>   *
>>   * returns the dma offset for use by dma_set_mask
>>   */
>> -static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
>> +static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn,
>> +		u32 default_liobn)
>>  {
>>  	int len, ret;
>>  	struct ddw_query_response query;
>> @@ -1022,6 +1024,16 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
>>  	if (ret)
>>  		goto out_failed;
>>  
>> +	/*
>> +	 * The device tree has a request to force remove the default window,
>> +	 * do this.
>> +	 */
>> +	if (default_liobn != DDW_INVALID_LIOBN && (!ddw_avail[2] ||
>> +			rtas_call(ddw_avail[2], 1, 1, NULL, default_liobn))) {
>> +		dev_dbg(&dev->dev, "Could not remove window");
>> +		goto out_failed;
>> +	}
>> +
>>         /*
>>  	 * Query if there is a second window of size to map the
>>  	 * whole partition.  Query returns number of windows, largest
>> @@ -1212,7 +1224,7 @@ static int dma_set_mask_pSeriesLP(struct device *dev, u64 dma_mask)
>>  	pdev = to_pci_dev(dev);
>>  
>>  	/* only attempt to use a new window if 64-bit DMA is requested */
>> -	if (!disable_ddw && dma_mask == DMA_BIT_MASK(64)) {
>> +	if (!disable_ddw && dma_mask > DMA_BIT_MASK(32)) {
>>  		dn = pci_device_to_OF_node(pdev);
>>  		dev_dbg(dev, "node is %pOF\n", dn);
>>  
>> @@ -1229,7 +1241,17 @@ static int dma_set_mask_pSeriesLP(struct device *dev, u64 dma_mask)
>>  				break;
>>  		}
>>  		if (pdn && PCI_DN(pdn)) {
>> -			dma_offset = enable_ddw(pdev, pdn);
>> +			u32 liobn = DDW_INVALID_LIOBN;
>> +			int ret = of_device_is_compatible(pdn, "IBM,npu-vphb");
>> +
>> +			if (ret) {
>> +				dma_window = of_get_property(pdn,
>> +						"ibm,dma-window", NULL);
>> +				if (dma_window)
>> +					liobn = be32_to_cpu(dma_window[0]);
>> +			}
>> +
>> +			dma_offset = enable_ddw(pdev, pdn, liobn);
>>  			if (dma_offset != DDW_INVALID_OFFSET) {
>>  				dev_info(dev, "Using 64-bit direct DMA at offset %llx\n", dma_offset);
>>  				set_dma_offset(dev, dma_offset);
> 

-- 
Alexey

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH kernel v3 09/22] powerpc/pseries/iommu: Force default DMA window removal
@ 2018-11-19  7:28       ` Alexey Kardashevskiy
  0 siblings, 0 replies; 84+ messages in thread
From: Alexey Kardashevskiy @ 2018-11-19  7:28 UTC (permalink / raw)
  To: David Gibson
  Cc: Alex Williamson, Jose Ricardo Ziviani, Sam Bobroff,
	Alistair Popple, linuxppc-dev, kvm-ppc, Piotr Jaroszynski,
	Oliver O'Halloran, Andrew Donnellan,
	Leonardo Augusto Guimarães Garcia, Reza Arbab



On 16/11/2018 15:54, David Gibson wrote:
> On Tue, Nov 13, 2018 at 07:28:10PM +1100, Alexey Kardashevskiy wrote:
>> It is quite common for a device to support more than 32bit but less than
>> 64bit for DMA, for example, GPUs often support 42..50bits. However
>> the pseries platform only allows huge DMA window (the one which allows
>> the use of more than 2GB of DMA space) for 64bit-capable devices mostly
>> because:
>>
>> 1. we may have 32bit and >32bit devices on the same IOMMU domain and
>> we cannot place the new big window where the 32bit one is located;
>>
>> 2. the existing hardware only supports the second window at very high
>> offset of 1<<59 = 0x0800.0000.0000.0000.
>>
>> So in order to allow 33..59bit DMA, we have to remove the default DMA
>> window and place a huge one there instead.
>>
>> The PAPR spec says that the platform may decide not to use the default
>> window and remove it using DDW RTAS calls. There are few possible ways
>> for the platform to decide:
>>
>> 1. look at the device IDs and decide in advance that such and such
>> devices are capable of more than 32bit DMA (powernv's sketchy bypass
>> does something like this - it drops the default window if all devices
>> on the PE are from the same vendor) - this is not great as involves
>> guessing because, unlike sketchy bypass, the GPU case involves 2 vendor
>> ids and does not scale;
>>
>> 2. advertise 1 available DMA window in the hypervisor via
>> ibm,query-pe-dma-window so the pseries platform could take it as a clue
>> that if more bits for DMA are needed, it has to remove the default
>> window - this is not great as it is implicit clue rather than direct
>> instruction;
>>
>> 3. removing the default DMA window at all it not really an option as
>> PAPR mandates its presense at the guest boot time;
>>
>> 4. make the hypervisor explicitly tell the guest that the default window
>> is better be removed so the guest does not have to think hard and can
>> simply do what requested and this is what this patch does.
> 
> This approach only makes sense if the hypervisor has better
> information as to what to do that the guest does.  It's not clear to
> me why that would be the case.  Isn't the DMA capabilities of the
> device something the driver should know, in which case it can decide
> based on that?

The device knows it can do 42bits so it will request DMA mask for 42bits
and then the platform has to deal with it, the device has no control
over DMA windows.

Then the platform tries to make everything work, which sadly includes
32bit-DMA devices so the default DMA window stays there and for 42bit
devices there is no other way than to go via the smaller window as the
only other window we can create is beyond the reach of the GPU.

We have so called "sketchy bypass" hack for some other GPUs (which
Christoph is trying to get rid of) at
https://github.com/aik/linux/blob/nv2/arch/powerpc/platforms/powernv/pci-ioda.c#L1885

which is powernv and which seemed a solution there and which I am trying
to reimplement here.


> 
>>
>> This makes use of the latter approach and exploits a new
>> "qemu,dma-force-remove-default" flag in a vPHB.
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> ---
>>  arch/powerpc/platforms/pseries/iommu.c | 28 +++++++++++++++++++++++---
>>  1 file changed, 25 insertions(+), 3 deletions(-)
>>
>> diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
>> index 9ece42f..78473ac 100644
>> --- a/arch/powerpc/platforms/pseries/iommu.c
>> +++ b/arch/powerpc/platforms/pseries/iommu.c
>> @@ -54,6 +54,7 @@
>>  #include "pseries.h"
>>  
>>  #define DDW_INVALID_OFFSET	((uint64_t)-1)
>> +#define DDW_INVALID_LIOBN	((uint32_t)-1)
>>  
>>  static struct iommu_table_group *iommu_pseries_alloc_group(int node)
>>  {
>> @@ -977,7 +978,8 @@ static LIST_HEAD(failed_ddw_pdn_list);
>>   *
>>   * returns the dma offset for use by dma_set_mask
>>   */
>> -static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
>> +static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn,
>> +		u32 default_liobn)
>>  {
>>  	int len, ret;
>>  	struct ddw_query_response query;
>> @@ -1022,6 +1024,16 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
>>  	if (ret)
>>  		goto out_failed;
>>  
>> +	/*
>> +	 * The device tree has a request to force remove the default window,
>> +	 * do this.
>> +	 */
>> +	if (default_liobn != DDW_INVALID_LIOBN && (!ddw_avail[2] ||
>> +			rtas_call(ddw_avail[2], 1, 1, NULL, default_liobn))) {
>> +		dev_dbg(&dev->dev, "Could not remove window");
>> +		goto out_failed;
>> +	}
>> +
>>         /*
>>  	 * Query if there is a second window of size to map the
>>  	 * whole partition.  Query returns number of windows, largest
>> @@ -1212,7 +1224,7 @@ static int dma_set_mask_pSeriesLP(struct device *dev, u64 dma_mask)
>>  	pdev = to_pci_dev(dev);
>>  
>>  	/* only attempt to use a new window if 64-bit DMA is requested */
>> -	if (!disable_ddw && dma_mask = DMA_BIT_MASK(64)) {
>> +	if (!disable_ddw && dma_mask > DMA_BIT_MASK(32)) {
>>  		dn = pci_device_to_OF_node(pdev);
>>  		dev_dbg(dev, "node is %pOF\n", dn);
>>  
>> @@ -1229,7 +1241,17 @@ static int dma_set_mask_pSeriesLP(struct device *dev, u64 dma_mask)
>>  				break;
>>  		}
>>  		if (pdn && PCI_DN(pdn)) {
>> -			dma_offset = enable_ddw(pdev, pdn);
>> +			u32 liobn = DDW_INVALID_LIOBN;
>> +			int ret = of_device_is_compatible(pdn, "IBM,npu-vphb");
>> +
>> +			if (ret) {
>> +				dma_window = of_get_property(pdn,
>> +						"ibm,dma-window", NULL);
>> +				if (dma_window)
>> +					liobn = be32_to_cpu(dma_window[0]);
>> +			}
>> +
>> +			dma_offset = enable_ddw(pdev, pdn, liobn);
>>  			if (dma_offset != DDW_INVALID_OFFSET) {
>>  				dev_info(dev, "Using 64-bit direct DMA at offset %llx\n", dma_offset);
>>  				set_dma_offset(dev, dma_offset);
> 

-- 
Alexey

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH kernel v3 10/22] powerpc/pseries/iommu: Use memory@ nodes in max RAM address calculation
  2018-11-16  5:23     ` David Gibson
@ 2018-11-19  7:43       ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 84+ messages in thread
From: Alexey Kardashevskiy @ 2018-11-19  7:43 UTC (permalink / raw)
  To: David Gibson
  Cc: Alex Williamson, Jose Ricardo Ziviani, Sam Bobroff,
	Alistair Popple, linuxppc-dev, kvm-ppc, Piotr Jaroszynski,
	Oliver O'Halloran, Andrew Donnellan,
	Leonardo Augusto Guimarães Garcia, Reza Arbab



On 16/11/2018 16:23, David Gibson wrote:
> On Tue, Nov 13, 2018 at 07:28:11PM +1100, Alexey Kardashevskiy wrote:
>> We might have memory@ nodes with "linux,usable-memory" set to zero
>> (for example, to replicate powernv's behaviour for GPU coherent memory)
>> which means that the memory needs an extra initialization but since
>> it can be used afterwards, the pseries platform will try mapping it
>> for DMA so the DMA window needs to cover those memory regions too.
>>
>> This walks through the memory nodes to find the highest RAM address to
>> let a huge DMA window cover that too in case this memory gets onlined
>> later.
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> ---
>>  arch/powerpc/platforms/pseries/iommu.c | 43 +++++++++++++++++++++++++-
>>  1 file changed, 42 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
>> index 78473ac..f818737 100644
>> --- a/arch/powerpc/platforms/pseries/iommu.c
>> +++ b/arch/powerpc/platforms/pseries/iommu.c
>> @@ -967,6 +967,47 @@ struct failed_ddw_pdn {
>>  
>>  static LIST_HEAD(failed_ddw_pdn_list);
>>  
>> +static unsigned long read_n_cells(int n, const __be32 **buf)
>> +{
>> +	unsigned long result = 0;
>> +
>> +	while (n--) {
>> +		result = (result << 32) | of_read_number(*buf, 1);
>> +		(*buf)++;
>> +	}
>> +	return result;
>> +}
> 
> Um.. this appears to be re-implementing of_read_number() in terms of
> of_read_number().   Wat!?


This is a cut-n-paste from arch/powerpc/mm/numa.c :) My bad, I did not
think much when I did this.


> 
>> +static phys_addr_t ddw_memory_hotplug_max(void)
>> +{
>> +	phys_addr_t max_addr = memory_hotplug_max();
>> +	struct device_node *memory;
>> +
>> +	for_each_node_by_type(memory, "memory") {
>> +		unsigned long start, size;
>> +		int ranges, n_mem_addr_cells, n_mem_size_cells, len;
>> +		const __be32 *memcell_buf;
>> +
>> +		memcell_buf = of_get_property(memory, "reg", &len);
>> +		if (!memcell_buf || len <= 0)
>> +			continue;
>> +
>> +		n_mem_addr_cells = of_n_addr_cells(memory);
>> +		n_mem_size_cells = of_n_size_cells(memory);
>> +
>> +		/* ranges in cell */
>> +		ranges = (len >> 2) / (n_mem_addr_cells + n_mem_size_cells);
>> +
>> +		/* these are order-sensitive, and modify the buffer pointer */
>> +		start = read_n_cells(n_mem_addr_cells, &memcell_buf);
>> +		size = read_n_cells(n_mem_size_cells, &memcell_buf);
>> +
>> +		max_addr = max_t(phys_addr_t, max_addr, start + size);
>> +	}
>> +
>> +	return max_addr;
>> +}
> 
> Is there really no existing place we keep track of maxmimum possible
> memory address?

There are:

1. memblocks from mm/memblock.c - populated at the boot time from
"usable" memory@ nodes and mine are not "usable";

2. drmem from mm/drmem.c - populated from ibm,dynamic-memory-v2 - these
things do not support sparse regions so when I tried these with a GPU
RAM region mapped at 0x244000000000 - the device tree became quickly
over 1 MB and then qemu crashed, I did not debug any further as this
memory is not hotpluggable anyway from the rtas/qemu prospective, in
other words it is not something the user can hotplug or unplug.

And that is it afaict.


> 
>>  /*
>>   * If the PE supports dynamic dma windows, and there is space for a table
>>   * that can map all pages in a linear offset, then setup such a table,
>> @@ -1067,7 +1108,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn,
>>  	}
>>  	/* verify the window * number of ptes will map the partition */
>>  	/* check largest block * page size > max memory hotplug addr */
>> -	max_addr = memory_hotplug_max();
>> +	max_addr = ddw_memory_hotplug_max();
>>  	if (query.largest_available_block < (max_addr >> page_shift)) {
>>  		dev_dbg(&dev->dev, "can't map partition max 0x%llx with %u "
>>  			  "%llu-sized pages\n", max_addr,  query.largest_available_block,
> 

-- 
Alexey

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH kernel v3 10/22] powerpc/pseries/iommu: Use memory@ nodes in max RAM address calculation
@ 2018-11-19  7:43       ` Alexey Kardashevskiy
  0 siblings, 0 replies; 84+ messages in thread
From: Alexey Kardashevskiy @ 2018-11-19  7:43 UTC (permalink / raw)
  To: David Gibson
  Cc: Alex Williamson, Jose Ricardo Ziviani, Sam Bobroff,
	Alistair Popple, linuxppc-dev, kvm-ppc, Piotr Jaroszynski,
	Oliver O'Halloran, Andrew Donnellan,
	Leonardo Augusto Guimarães Garcia, Reza Arbab



On 16/11/2018 16:23, David Gibson wrote:
> On Tue, Nov 13, 2018 at 07:28:11PM +1100, Alexey Kardashevskiy wrote:
>> We might have memory@ nodes with "linux,usable-memory" set to zero
>> (for example, to replicate powernv's behaviour for GPU coherent memory)
>> which means that the memory needs an extra initialization but since
>> it can be used afterwards, the pseries platform will try mapping it
>> for DMA so the DMA window needs to cover those memory regions too.
>>
>> This walks through the memory nodes to find the highest RAM address to
>> let a huge DMA window cover that too in case this memory gets onlined
>> later.
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> ---
>>  arch/powerpc/platforms/pseries/iommu.c | 43 +++++++++++++++++++++++++-
>>  1 file changed, 42 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
>> index 78473ac..f818737 100644
>> --- a/arch/powerpc/platforms/pseries/iommu.c
>> +++ b/arch/powerpc/platforms/pseries/iommu.c
>> @@ -967,6 +967,47 @@ struct failed_ddw_pdn {
>>  
>>  static LIST_HEAD(failed_ddw_pdn_list);
>>  
>> +static unsigned long read_n_cells(int n, const __be32 **buf)
>> +{
>> +	unsigned long result = 0;
>> +
>> +	while (n--) {
>> +		result = (result << 32) | of_read_number(*buf, 1);
>> +		(*buf)++;
>> +	}
>> +	return result;
>> +}
> 
> Um.. this appears to be re-implementing of_read_number() in terms of
> of_read_number().   Wat!?


This is a cut-n-paste from arch/powerpc/mm/numa.c :) My bad, I did not
think much when I did this.


> 
>> +static phys_addr_t ddw_memory_hotplug_max(void)
>> +{
>> +	phys_addr_t max_addr = memory_hotplug_max();
>> +	struct device_node *memory;
>> +
>> +	for_each_node_by_type(memory, "memory") {
>> +		unsigned long start, size;
>> +		int ranges, n_mem_addr_cells, n_mem_size_cells, len;
>> +		const __be32 *memcell_buf;
>> +
>> +		memcell_buf = of_get_property(memory, "reg", &len);
>> +		if (!memcell_buf || len <= 0)
>> +			continue;
>> +
>> +		n_mem_addr_cells = of_n_addr_cells(memory);
>> +		n_mem_size_cells = of_n_size_cells(memory);
>> +
>> +		/* ranges in cell */
>> +		ranges = (len >> 2) / (n_mem_addr_cells + n_mem_size_cells);
>> +
>> +		/* these are order-sensitive, and modify the buffer pointer */
>> +		start = read_n_cells(n_mem_addr_cells, &memcell_buf);
>> +		size = read_n_cells(n_mem_size_cells, &memcell_buf);
>> +
>> +		max_addr = max_t(phys_addr_t, max_addr, start + size);
>> +	}
>> +
>> +	return max_addr;
>> +}
> 
> Is there really no existing place we keep track of maxmimum possible
> memory address?

There are:

1. memblocks from mm/memblock.c - populated at the boot time from
"usable" memory@ nodes and mine are not "usable";

2. drmem from mm/drmem.c - populated from ibm,dynamic-memory-v2 - these
things do not support sparse regions so when I tried these with a GPU
RAM region mapped at 0x244000000000 - the device tree became quickly
over 1 MB and then qemu crashed, I did not debug any further as this
memory is not hotpluggable anyway from the rtas/qemu prospective, in
other words it is not something the user can hotplug or unplug.

And that is it afaict.


> 
>>  /*
>>   * If the PE supports dynamic dma windows, and there is space for a table
>>   * that can map all pages in a linear offset, then setup such a table,
>> @@ -1067,7 +1108,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn,
>>  	}
>>  	/* verify the window * number of ptes will map the partition */
>>  	/* check largest block * page size > max memory hotplug addr */
>> -	max_addr = memory_hotplug_max();
>> +	max_addr = ddw_memory_hotplug_max();
>>  	if (query.largest_available_block < (max_addr >> page_shift)) {
>>  		dev_dbg(&dev->dev, "can't map partition max 0x%llx with %u "
>>  			  "%llu-sized pages\n", max_addr,  query.largest_available_block,
> 

-- 
Alexey

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH kernel v3 11/22] powerpc/pseries/npu: Enable platform support
  2018-11-16  5:25     ` David Gibson
@ 2018-11-19  7:50       ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 84+ messages in thread
From: Alexey Kardashevskiy @ 2018-11-19  7:50 UTC (permalink / raw)
  To: David Gibson
  Cc: Alex Williamson, Jose Ricardo Ziviani, Sam Bobroff,
	Alistair Popple, linuxppc-dev, kvm-ppc, Piotr Jaroszynski,
	Oliver O'Halloran, Andrew Donnellan,
	Leonardo Augusto Guimarães Garcia, Reza Arbab



On 16/11/2018 16:25, David Gibson wrote:
> On Tue, Nov 13, 2018 at 07:28:12PM +1100, Alexey Kardashevskiy wrote:
>> We already changed NPU API for GPUs to not to call OPAL and the remaining
>> bit is initializing NPU structures.
>>
>> This uses a new QEMU capability which marks NPU-enabled vPHBs as
>> "IBM,npu-vphb" and initializes an NPU structure per vPHB.
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> ---
>>  arch/powerpc/platforms/pseries/pci.c | 6 ++++++
>>  1 file changed, 6 insertions(+)
>>
>> diff --git a/arch/powerpc/platforms/pseries/pci.c b/arch/powerpc/platforms/pseries/pci.c
>> index 41d8a4d..a50d5e4 100644
>> --- a/arch/powerpc/platforms/pseries/pci.c
>> +++ b/arch/powerpc/platforms/pseries/pci.c
>> @@ -29,6 +29,7 @@
>>  #include <asm/pci-bridge.h>
>>  #include <asm/prom.h>
>>  #include <asm/ppc-pci.h>
>> +#include <asm/pci.h>
>>  #include "pseries.h"
>>  
>>  #if 0
>> @@ -237,6 +238,8 @@ static void __init pSeries_request_regions(void)
>>  
>>  void __init pSeries_final_fixup(void)
>>  {
>> +	struct pci_controller *hose;
>> +
>>  	pSeries_request_regions();
>>  
>>  	eeh_probe_devices();
>> @@ -246,6 +249,9 @@ void __init pSeries_final_fixup(void)
>>  	ppc_md.pcibios_sriov_enable = pseries_pcibios_sriov_enable;
>>  	ppc_md.pcibios_sriov_disable = pseries_pcibios_sriov_disable;
>>  #endif
>> +	list_for_each_entry(hose, &hose_list, list_node)
>> +		if (of_device_is_compatible(hose->dn, "IBM,npu-vphb"))
>> +			pnv_npu2_init(hose);
> 
> I take it from this the NPUs are showing up with a compatible property
> that lists the normal PHB value as well as IBM,npu-vphb.  Since AIUI
> the NPUs act quite differently from other (real) PHBs this seems
> bogus.  Shouldn't they be probed separately?

First, bad naming, will think of better one.

"IBM,npu-vphb" is an extra compatible type for otherwise usual pseries
PHB. The  differences are:

1. Initialize an "NPU" (not a NVLink2 bridge but a proper NPU) per a PHB
for context manipulation for a GPU;

2. Kill the default DMA window.

When a GPU is passed to a guest, it looks like:

aik@u1804kvm:~$ lspci
00:00.0 USB controller: NEC Corporation uPD720200 USB 3.0 Host
Controller (rev 03)
00:01.0 Ethernet controller: Red Hat, Inc Virtio network device
00:02.0 3D controller: NVIDIA Corporation GV100 [Tesla V100 SXM2] (rev a1)
00:03.0 Bridge: IBM Device 04ea (rev 01)
00:04.0 Bridge: IBM Device 04ea (rev 01)

So there are:
- one "struct npu" associated with the pseries PHB (not presented in
lspci but there /proc/device-tree/npuphb0/ with link@0 and link@1)

- 2 NVLink2 bridges, presented in lspci.


-- 
Alexey

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH kernel v3 11/22] powerpc/pseries/npu: Enable platform support
@ 2018-11-19  7:50       ` Alexey Kardashevskiy
  0 siblings, 0 replies; 84+ messages in thread
From: Alexey Kardashevskiy @ 2018-11-19  7:50 UTC (permalink / raw)
  To: David Gibson
  Cc: Alex Williamson, Jose Ricardo Ziviani, Sam Bobroff,
	Alistair Popple, linuxppc-dev, kvm-ppc, Piotr Jaroszynski,
	Oliver O'Halloran, Andrew Donnellan,
	Leonardo Augusto Guimarães Garcia, Reza Arbab



On 16/11/2018 16:25, David Gibson wrote:
> On Tue, Nov 13, 2018 at 07:28:12PM +1100, Alexey Kardashevskiy wrote:
>> We already changed NPU API for GPUs to not to call OPAL and the remaining
>> bit is initializing NPU structures.
>>
>> This uses a new QEMU capability which marks NPU-enabled vPHBs as
>> "IBM,npu-vphb" and initializes an NPU structure per vPHB.
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> ---
>>  arch/powerpc/platforms/pseries/pci.c | 6 ++++++
>>  1 file changed, 6 insertions(+)
>>
>> diff --git a/arch/powerpc/platforms/pseries/pci.c b/arch/powerpc/platforms/pseries/pci.c
>> index 41d8a4d..a50d5e4 100644
>> --- a/arch/powerpc/platforms/pseries/pci.c
>> +++ b/arch/powerpc/platforms/pseries/pci.c
>> @@ -29,6 +29,7 @@
>>  #include <asm/pci-bridge.h>
>>  #include <asm/prom.h>
>>  #include <asm/ppc-pci.h>
>> +#include <asm/pci.h>
>>  #include "pseries.h"
>>  
>>  #if 0
>> @@ -237,6 +238,8 @@ static void __init pSeries_request_regions(void)
>>  
>>  void __init pSeries_final_fixup(void)
>>  {
>> +	struct pci_controller *hose;
>> +
>>  	pSeries_request_regions();
>>  
>>  	eeh_probe_devices();
>> @@ -246,6 +249,9 @@ void __init pSeries_final_fixup(void)
>>  	ppc_md.pcibios_sriov_enable = pseries_pcibios_sriov_enable;
>>  	ppc_md.pcibios_sriov_disable = pseries_pcibios_sriov_disable;
>>  #endif
>> +	list_for_each_entry(hose, &hose_list, list_node)
>> +		if (of_device_is_compatible(hose->dn, "IBM,npu-vphb"))
>> +			pnv_npu2_init(hose);
> 
> I take it from this the NPUs are showing up with a compatible property
> that lists the normal PHB value as well as IBM,npu-vphb.  Since AIUI
> the NPUs act quite differently from other (real) PHBs this seems
> bogus.  Shouldn't they be probed separately?

First, bad naming, will think of better one.

"IBM,npu-vphb" is an extra compatible type for otherwise usual pseries
PHB. The  differences are:

1. Initialize an "NPU" (not a NVLink2 bridge but a proper NPU) per a PHB
for context manipulation for a GPU;

2. Kill the default DMA window.

When a GPU is passed to a guest, it looks like:

aik@u1804kvm:~$ lspci
00:00.0 USB controller: NEC Corporation uPD720200 USB 3.0 Host
Controller (rev 03)
00:01.0 Ethernet controller: Red Hat, Inc Virtio network device
00:02.0 3D controller: NVIDIA Corporation GV100 [Tesla V100 SXM2] (rev a1)
00:03.0 Bridge: IBM Device 04ea (rev 01)
00:04.0 Bridge: IBM Device 04ea (rev 01)

So there are:
- one "struct npu" associated with the pseries PHB (not presented in
lspci but there /proc/device-tree/npuphb0/ with link@0 and link@1)

- 2 NVLink2 bridges, presented in lspci.


-- 
Alexey

^ permalink raw reply	[flat|nested] 84+ messages in thread

end of thread, other threads:[~2018-11-19  7:53 UTC | newest]

Thread overview: 84+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-11-13  8:28 [PATCH kernel v3 00/22] powerpc/powernv/npu, vfio: NVIDIA V100 + P9 passthrough Alexey Kardashevskiy
2018-11-13  8:28 ` Alexey Kardashevskiy
2018-11-13  8:28 ` [PATCH kernel v3 01/22] powerpc/ioda/npu: Call skiboot's hot reset hook when disabling NPU2 Alexey Kardashevskiy
2018-11-13  8:28   ` Alexey Kardashevskiy
2018-11-13  8:28 ` [PATCH kernel v3 02/22] powerpc/mm/iommu/vfio_spapr_tce: Change mm_iommu_get to reference a region Alexey Kardashevskiy
2018-11-13  8:28   ` Alexey Kardashevskiy
2018-11-15  5:32   ` David Gibson
2018-11-15  5:32     ` [PATCH kernel v3 02/22] powerpc/mm/iommu/vfio_spapr_tce: Change mm_iommu_get to reference a regi David Gibson
2018-11-13  8:28 ` [PATCH kernel v3 03/22] powerpc/mm/iommu: Make mm_iommu_new() fail on existing regions Alexey Kardashevskiy
2018-11-13  8:28   ` Alexey Kardashevskiy
2018-11-15  5:38   ` David Gibson
2018-11-15  5:38     ` David Gibson
2018-11-13  8:28 ` [PATCH kernel v3 04/22] powerpc/vfio/iommu/kvm: Do not pin device memory Alexey Kardashevskiy
2018-11-13  8:28   ` Alexey Kardashevskiy
2018-11-16  3:11   ` David Gibson
2018-11-16  3:11     ` David Gibson
2018-11-13  8:28 ` [PATCH kernel v3 05/22] powerpc/powernv/npu: Add helper to access struct npu for NPU device Alexey Kardashevskiy
2018-11-13  8:28   ` Alexey Kardashevskiy
2018-11-14  3:42   ` Alistair Popple
2018-11-14  3:42     ` Alistair Popple
2018-11-13  8:28 ` [PATCH kernel v3 06/22] powerpc/powernv: Detach npu struct from pnv_phb Alexey Kardashevskiy
2018-11-13  8:28   ` Alexey Kardashevskiy
2018-11-14  4:28   ` Alistair Popple
2018-11-14  4:28     ` Alistair Popple
2018-11-19  7:18     ` Alexey Kardashevskiy
2018-11-19  7:18       ` Alexey Kardashevskiy
2018-11-13  8:28 ` [PATCH kernel v3 07/22] powerpc/powernv/npu: Move OPAL calls away from context manipulation Alexey Kardashevskiy
2018-11-13  8:28   ` Alexey Kardashevskiy
2018-11-14  4:57   ` Alistair Popple
2018-11-14  4:57     ` Alistair Popple
2018-11-13  8:28 ` [PATCH kernel v3 08/22] powerpc/pseries/iommu: Allow dynamic window to start from zero Alexey Kardashevskiy
2018-11-13  8:28   ` Alexey Kardashevskiy
2018-11-13  8:28 ` [PATCH kernel v3 09/22] powerpc/pseries/iommu: Force default DMA window removal Alexey Kardashevskiy
2018-11-13  8:28   ` Alexey Kardashevskiy
2018-11-16  4:54   ` David Gibson
2018-11-16  4:54     ` David Gibson
2018-11-19  7:28     ` Alexey Kardashevskiy
2018-11-19  7:28       ` Alexey Kardashevskiy
2018-11-13  8:28 ` [PATCH kernel v3 10/22] powerpc/pseries/iommu: Use memory@ nodes in max RAM address calculation Alexey Kardashevskiy
2018-11-13  8:28   ` Alexey Kardashevskiy
2018-11-16  5:23   ` David Gibson
2018-11-16  5:23     ` David Gibson
2018-11-19  7:43     ` Alexey Kardashevskiy
2018-11-19  7:43       ` Alexey Kardashevskiy
2018-11-13  8:28 ` [PATCH kernel v3 11/22] powerpc/pseries/npu: Enable platform support Alexey Kardashevskiy
2018-11-13  8:28   ` Alexey Kardashevskiy
2018-11-16  5:25   ` David Gibson
2018-11-16  5:25     ` David Gibson
2018-11-19  7:50     ` Alexey Kardashevskiy
2018-11-19  7:50       ` Alexey Kardashevskiy
2018-11-13  8:28 ` [PATCH kernel v3 12/22] powerpc/pseries: Remove IOMMU API support for non-LPAR systems Alexey Kardashevskiy
2018-11-13  8:28   ` Alexey Kardashevskiy
2018-11-13  8:28 ` [PATCH kernel v3 13/22] powerpc/powernv/pseries: Rework device adding to IOMMU groups Alexey Kardashevskiy
2018-11-13  8:28   ` Alexey Kardashevskiy
2018-11-13  8:28 ` [PATCH kernel v3 14/22] powerpc/iommu_api: Move IOMMU groups setup to a single place Alexey Kardashevskiy
2018-11-13  8:28   ` Alexey Kardashevskiy
2018-11-19  0:15   ` David Gibson
2018-11-19  0:15     ` David Gibson
2018-11-13  8:28 ` [PATCH kernel v3 15/22] powerpc/powernv: Reference iommu_table while it is linked to a group Alexey Kardashevskiy
2018-11-13  8:28   ` Alexey Kardashevskiy
2018-11-19  0:20   ` David Gibson
2018-11-19  0:20     ` David Gibson
2018-11-13  8:28 ` [PATCH kernel v3 16/22] powerpc/powernv: Add purge cache OPAL call Alexey Kardashevskiy
2018-11-13  8:28   ` Alexey Kardashevskiy
2018-11-19  0:21   ` David Gibson
2018-11-19  0:21     ` David Gibson
2018-11-13  8:28 ` [PATCH kernel v3 17/22] powerpc/powernv/npu: Convert NPU IOMMU helpers to iommu_table_group_ops Alexey Kardashevskiy
2018-11-13  8:28   ` Alexey Kardashevskiy
2018-11-19  0:24   ` David Gibson
2018-11-19  0:24     ` David Gibson
2018-11-13  8:28 ` [PATCH kernel v3 18/22] powerpc/powernv/npu: Add compound IOMMU groups Alexey Kardashevskiy
2018-11-13  8:28   ` Alexey Kardashevskiy
2018-11-19  1:12   ` David Gibson
2018-11-19  1:12     ` David Gibson
2018-11-19  2:29     ` Alexey Kardashevskiy
2018-11-19  2:29       ` Alexey Kardashevskiy
2018-11-13  8:28 ` [PATCH kernel v3 19/22] powerpc/powernv/npu: Add release_ownership hook Alexey Kardashevskiy
2018-11-13  8:28   ` Alexey Kardashevskiy
2018-11-13  8:28 ` [PATCH kernel v3 20/22] vfio_pci: Allow mapping extra regions Alexey Kardashevskiy
2018-11-13  8:28   ` Alexey Kardashevskiy
2018-11-13  8:28 ` [PATCH kernel v3 21/22] vfio_pci: Allow regions to add own capabilities Alexey Kardashevskiy
2018-11-13  8:28   ` Alexey Kardashevskiy
2018-11-13  8:28 ` [PATCH kernel v3 22/22] vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] [10de:1db1] subdriver Alexey Kardashevskiy
2018-11-13  8:28   ` Alexey Kardashevskiy

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.