[PATCH kernel 00/10] powerpc/powernv/npu: Enable PCI pass through for NVLink

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH kernel 00/10] powerpc/powernv/npu: Enable PCI pass through for NVLink
@ 2016-03-09  6:28 Alexey Kardashevskiy
  2016-03-09  6:28 ` [PATCH kernel 01/10] vfio/spapr: Relax the IOMMU compatibility check Alexey Kardashevskiy
                   ` (9 more replies)
  0 siblings, 10 replies; 26+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-09  6:28 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Alistair Popple, Benjamin Herrenschmidt,
	Daniel Axtens, David Gibson, Gavin Shan, Paul Mackerras,
	Russell Currey, Alex Williamson

IBM POWER8 NVlink systems contain usual Tesla K40-ish GPUs but also
contain a couple of really fast links between GPU and CPU. These links
are exposed to the userspace by the OPAL firmware as bridges.
In order to make these links work when GPU is passed to the guest,
these bridges need to be passed as well; otherwise performance will
degrade. More details are in 10/10.

This reworks the existing NPU support in the powernv platform and adds
VFIO support on top of that.

This was tested on POWER8NVL platform. pvr=0x004c0100.

Please comment. Thanks.


Alexey Kardashevskiy (10):
  vfio/spapr: Relax the IOMMU compatibility check
  powerpc/powernv: Rename pnv_pci_ioda2_tce_invalidate_entire
  powerpc/powernv: Define TCE Kill flags
  powerpc/powernv/npu: TCE Kill helpers cleanup
  powerpc/powernv/npu: Use the correct IOMMU page size
  powerpc/powernv/npu: Simplify DMA setup
  powerpc/powernv/npu: Rework TCE Kill handling
  powerpc/powernv/npu: Add NPU devices to IOMMU group
  powerpc/powernv/ioda2: Export some helpers
  powerpc/powernv/npu: Enable passing through via VFIO

 arch/powerpc/platforms/powernv/npu-dma.c  | 387 ++++++++++++++++++------------
 arch/powerpc/platforms/powernv/pci-ioda.c | 134 +++++------
 arch/powerpc/platforms/powernv/pci.h      |  32 +--
 drivers/vfio/vfio_iommu_spapr_tce.c       |   3 +-
 4 files changed, 309 insertions(+), 247 deletions(-)

-- 
2.5.0.rc3

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH kernel 01/10] vfio/spapr: Relax the IOMMU compatibility check
  2016-03-09  6:28 [PATCH kernel 00/10] powerpc/powernv/npu: Enable PCI pass through for NVLink Alexey Kardashevskiy
@ 2016-03-09  6:28 ` Alexey Kardashevskiy
  2016-03-10  5:35   ` David Gibson
  2016-03-09  6:28 ` [PATCH kernel 02/10] powerpc/powernv: Rename pnv_pci_ioda2_tce_invalidate_entire Alexey Kardashevskiy
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 26+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-09  6:28 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Alistair Popple, Benjamin Herrenschmidt,
	Daniel Axtens, David Gibson, Gavin Shan, Paul Mackerras,
	Russell Currey, Alex Williamson

We are going to have multiple different types of PHB on the same system
with POWER8 + NVLink and PHBs will have different IOMMU ops. However
we only really care about one callback - create_table - so we can
relax the compatibility check here.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 drivers/vfio/vfio_iommu_spapr_tce.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index 0582b72..3054e3f 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -1188,7 +1188,8 @@ static int tce_iommu_attach_group(void *iommu_data,
 			goto unlock_exit;
 		}
 		table_group_tmp = iommu_group_get_iommudata(tcegrp->grp);
-		if (table_group_tmp->ops != table_group->ops) {
+		if (table_group_tmp->ops->create_table !=
+				table_group->ops->create_table) {
 			pr_warn("tce_vfio: Group %d is incompatible with group %d\n",
 					iommu_group_id(iommu_group),
 					iommu_group_id(tcegrp->grp));
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH kernel 02/10] powerpc/powernv: Rename pnv_pci_ioda2_tce_invalidate_entire
  2016-03-09  6:28 [PATCH kernel 00/10] powerpc/powernv/npu: Enable PCI pass through for NVLink Alexey Kardashevskiy
  2016-03-09  6:28 ` [PATCH kernel 01/10] vfio/spapr: Relax the IOMMU compatibility check Alexey Kardashevskiy
@ 2016-03-09  6:28 ` Alexey Kardashevskiy
  2016-03-10  5:35   ` David Gibson
  2016-03-09  6:28 ` [PATCH kernel 03/10] powerpc/powernv: Define TCE Kill flags Alexey Kardashevskiy
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 26+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-09  6:28 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Alistair Popple, Benjamin Herrenschmidt,
	Daniel Axtens, David Gibson, Gavin Shan, Paul Mackerras,
	Russell Currey, Alex Williamson

As in fact pnv_pci_ioda2_tce_invalidate_entire() invalidates TCEs for
the specific PE rather than the entire cache, rename it to
pnv_pci_ioda2_tce_invalidate_pe(). In later patches we will add
a proper pnv_pci_ioda2_tce_invalidate_entire().

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 57dc2dc..889eca3 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1824,7 +1824,7 @@ static struct iommu_table_ops pnv_ioda1_iommu_ops = {
 	.get = pnv_tce_get,
 };
 
-static inline void pnv_pci_ioda2_tce_invalidate_entire(struct pnv_ioda_pe *pe)
+static inline void pnv_pci_ioda2_tce_invalidate_pe(struct pnv_ioda_pe *pe)
 {
 	/* 01xb - invalidate TCEs that match the specified PE# */
 	unsigned long val = (0x4ull << 60) | (pe->pe_number & 0xFF);
@@ -2101,7 +2101,7 @@ static long pnv_pci_ioda2_set_window(struct iommu_table_group *table_group,
 
 	pnv_pci_link_table_and_group(phb->hose->node, num,
 			tbl, &pe->table_group);
-	pnv_pci_ioda2_tce_invalidate_entire(pe);
+	pnv_pci_ioda2_tce_invalidate_pe(pe);
 
 	return 0;
 }
@@ -2245,7 +2245,7 @@ static long pnv_pci_ioda2_unset_window(struct iommu_table_group *table_group,
 	if (ret)
 		pe_warn(pe, "Unmapping failed, ret = %ld\n", ret);
 	else
-		pnv_pci_ioda2_tce_invalidate_entire(pe);
+		pnv_pci_ioda2_tce_invalidate_pe(pe);
 
 	pnv_pci_unlink_table_and_group(table_group->tables[num], table_group);
 
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH kernel 03/10] powerpc/powernv: Define TCE Kill flags
  2016-03-09  6:28 [PATCH kernel 00/10] powerpc/powernv/npu: Enable PCI pass through for NVLink Alexey Kardashevskiy
  2016-03-09  6:28 ` [PATCH kernel 01/10] vfio/spapr: Relax the IOMMU compatibility check Alexey Kardashevskiy
  2016-03-09  6:28 ` [PATCH kernel 02/10] powerpc/powernv: Rename pnv_pci_ioda2_tce_invalidate_entire Alexey Kardashevskiy
@ 2016-03-09  6:28 ` Alexey Kardashevskiy
  2016-03-10  5:36   ` David Gibson
  2016-03-09  6:29 ` [PATCH kernel 04/10] powerpc/powernv/npu: TCE Kill helpers cleanup Alexey Kardashevskiy
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 26+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-09  6:28 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Alistair Popple, Benjamin Herrenschmidt,
	Daniel Axtens, David Gibson, Gavin Shan, Paul Mackerras,
	Russell Currey, Alex Williamson

This replaces magic constants for TCE Kill IODA2 register with macros.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 889eca3..33e9489 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1824,10 +1824,13 @@ static struct iommu_table_ops pnv_ioda1_iommu_ops = {
 	.get = pnv_tce_get,
 };
 
+#define TCE_KILL_INVAL_PE   PPC_BIT(1)
+#define TCE_KILL_INVAL_TCE  PPC_BIT(2)
+
 static inline void pnv_pci_ioda2_tce_invalidate_pe(struct pnv_ioda_pe *pe)
 {
 	/* 01xb - invalidate TCEs that match the specified PE# */
-	unsigned long val = (0x4ull << 60) | (pe->pe_number & 0xFF);
+	unsigned long val = TCE_KILL_INVAL_PE | (pe->pe_number & 0xFF);
 	struct pnv_phb *phb = pe->phb;
 	struct pnv_ioda_pe *npe;
 	int i;
@@ -1855,7 +1858,7 @@ static void pnv_pci_ioda2_do_tce_invalidate(unsigned pe_number, bool rm,
 	unsigned long start, end, inc;
 
 	/* We'll invalidate DMA address in PE scope */
-	start = 0x2ull << 60;
+	start = TCE_KILL_INVAL_TCE;
 	start |= (pe_number & 0xFF);
 	end = start;
 
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH kernel 04/10] powerpc/powernv/npu: TCE Kill helpers cleanup
  2016-03-09  6:28 [PATCH kernel 00/10] powerpc/powernv/npu: Enable PCI pass through for NVLink Alexey Kardashevskiy
                   ` (2 preceding siblings ...)
  2016-03-09  6:28 ` [PATCH kernel 03/10] powerpc/powernv: Define TCE Kill flags Alexey Kardashevskiy
@ 2016-03-09  6:29 ` Alexey Kardashevskiy
  2016-03-10  5:42   ` David Gibson
  2016-03-21  2:51   ` Alistair Popple
  2016-03-09  6:29 ` [PATCH kernel 05/10] powerpc/powernv/npu: Use the correct IOMMU page size Alexey Kardashevskiy
                   ` (5 subsequent siblings)
  9 siblings, 2 replies; 26+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-09  6:29 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Alistair Popple, Benjamin Herrenschmidt,
	Daniel Axtens, David Gibson, Gavin Shan, Paul Mackerras,
	Russell Currey, Alex Williamson

NPU PHB TCE Kill register is exactly the same as in the rest of POWER8
so let's reuse the existing code for NPU. The only bit missing is
a helper to reset the entire TCE cache so this moves such a helper
from NPU code and renames it.

Since pnv_npu_tce_invalidate() does really invalidate the entire cache,
this uses pnv_pci_ioda2_tce_invalidate_entire() directly for NPU.
This adds an explicit comment for workaround for invalidating NPU TCE
cache.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/platforms/powernv/npu-dma.c  | 41 -------------------------------
 arch/powerpc/platforms/powernv/pci-ioda.c | 29 ++++++++++++++++++----
 arch/powerpc/platforms/powernv/pci.h      |  7 +-----
 3 files changed, 25 insertions(+), 52 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/npu-dma.c b/arch/powerpc/platforms/powernv/npu-dma.c
index 7229acd..778570c 100644
--- a/arch/powerpc/platforms/powernv/npu-dma.c
+++ b/arch/powerpc/platforms/powernv/npu-dma.c
@@ -25,8 +25,6 @@
  * Other types of TCE cache invalidation are not functional in the
  * hardware.
  */
-#define TCE_KILL_INVAL_ALL PPC_BIT(0)
-
 static struct pci_dev *get_pci_dev(struct device_node *dn)
 {
 	return PCI_DN(dn)->pcidev;
@@ -161,45 +159,6 @@ static struct pnv_ioda_pe *get_gpu_pci_dev_and_pe(struct pnv_ioda_pe *npe,
 	return pe;
 }
 
-void pnv_npu_tce_invalidate_entire(struct pnv_ioda_pe *npe)
-{
-	struct pnv_phb *phb = npe->phb;
-
-	if (WARN_ON(phb->type != PNV_PHB_NPU ||
-		    !phb->ioda.tce_inval_reg ||
-		    !(npe->flags & PNV_IODA_PE_DEV)))
-		return;
-
-	mb(); /* Ensure previous TCE table stores are visible */
-	__raw_writeq(cpu_to_be64(TCE_KILL_INVAL_ALL),
-		phb->ioda.tce_inval_reg);
-}
-
-void pnv_npu_tce_invalidate(struct pnv_ioda_pe *npe,
-				struct iommu_table *tbl,
-				unsigned long index,
-				unsigned long npages,
-				bool rm)
-{
-	struct pnv_phb *phb = npe->phb;
-
-	/* We can only invalidate the whole cache on NPU */
-	unsigned long val = TCE_KILL_INVAL_ALL;
-
-	if (WARN_ON(phb->type != PNV_PHB_NPU ||
-		    !phb->ioda.tce_inval_reg ||
-		    !(npe->flags & PNV_IODA_PE_DEV)))
-		return;
-
-	mb(); /* Ensure previous TCE table stores are visible */
-	if (rm)
-		__raw_rm_writeq(cpu_to_be64(val),
-		  (__be64 __iomem *) phb->ioda.tce_inval_reg_phys);
-	else
-		__raw_writeq(cpu_to_be64(val),
-			phb->ioda.tce_inval_reg);
-}
-
 void pnv_npu_init_dma_pe(struct pnv_ioda_pe *npe)
 {
 	struct pnv_ioda_pe *gpe;
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 33e9489..90cdf49 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1824,9 +1824,23 @@ static struct iommu_table_ops pnv_ioda1_iommu_ops = {
 	.get = pnv_tce_get,
 };
 
+#define TCE_KILL_INVAL_ALL  PPC_BIT(0)
 #define TCE_KILL_INVAL_PE   PPC_BIT(1)
 #define TCE_KILL_INVAL_TCE  PPC_BIT(2)
 
+void pnv_pci_ioda2_tce_invalidate_entire(struct pnv_phb *phb, bool rm)
+{
+	const unsigned long val = TCE_KILL_INVAL_ALL;
+
+	mb(); /* Ensure previous TCE table stores are visible */
+	if (rm)
+		__raw_rm_writeq(cpu_to_be64(val),
+				(__be64 __iomem *)
+				phb->ioda.tce_inval_reg_phys);
+	else
+		__raw_writeq(cpu_to_be64(val), phb->ioda.tce_inval_reg);
+}
+
 static inline void pnv_pci_ioda2_tce_invalidate_pe(struct pnv_ioda_pe *pe)
 {
 	/* 01xb - invalidate TCEs that match the specified PE# */
@@ -1847,7 +1861,7 @@ static inline void pnv_pci_ioda2_tce_invalidate_pe(struct pnv_ioda_pe *pe)
 			if (!npe || npe->phb->type != PNV_PHB_NPU)
 				continue;
 
-			pnv_npu_tce_invalidate_entire(npe);
+			pnv_pci_ioda2_tce_invalidate_entire(npe->phb, false);
 		}
 }
 
@@ -1896,14 +1910,19 @@ static void pnv_pci_ioda2_tce_invalidate(struct iommu_table *tbl,
 			index, npages);
 
 		if (pe->flags & PNV_IODA_PE_PEER)
-			/* Invalidate PEs using the same TCE table */
+			/*
+			 * The NVLink hardware does not support TCE kill
+			 * per TCE entry so we have to invalidate
+			 * the entire cache for it.
+			 */
 			for (i = 0; i < PNV_IODA_MAX_PEER_PES; i++) {
 				npe = pe->peers[i];
-				if (!npe || npe->phb->type != PNV_PHB_NPU)
+				if (!npe || npe->phb->type != PNV_PHB_NPU ||
+						!npe->phb->ioda.tce_inval_reg)
 					continue;
 
-				pnv_npu_tce_invalidate(npe, tbl, index,
-							npages, rm);
+				pnv_pci_ioda2_tce_invalidate_entire(npe->phb,
+						rm);
 			}
 	}
 }
diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
index 3f814f3..0b89a4c 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -237,15 +237,10 @@ extern int pnv_setup_msi_irqs(struct pci_dev *pdev, int nvec, int type);
 extern void pnv_teardown_msi_irqs(struct pci_dev *pdev);
 
 /* Nvlink functions */
-extern void pnv_npu_tce_invalidate_entire(struct pnv_ioda_pe *npe);
-extern void pnv_npu_tce_invalidate(struct pnv_ioda_pe *npe,
-				       struct iommu_table *tbl,
-				       unsigned long index,
-				       unsigned long npages,
-				       bool rm);
 extern void pnv_npu_init_dma_pe(struct pnv_ioda_pe *npe);
 extern void pnv_npu_setup_dma_pe(struct pnv_ioda_pe *npe);
 extern int pnv_npu_dma_set_bypass(struct pnv_ioda_pe *npe, bool enabled);
 extern int pnv_npu_dma_set_mask(struct pci_dev *npdev, u64 dma_mask);
+extern void pnv_pci_ioda2_tce_invalidate_entire(struct pnv_phb *phb, bool rm);
 
 #endif /* __POWERNV_PCI_H */
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH kernel 05/10] powerpc/powernv/npu: Use the correct IOMMU page size
  2016-03-09  6:28 [PATCH kernel 00/10] powerpc/powernv/npu: Enable PCI pass through for NVLink Alexey Kardashevskiy
                   ` (3 preceding siblings ...)
  2016-03-09  6:29 ` [PATCH kernel 04/10] powerpc/powernv/npu: TCE Kill helpers cleanup Alexey Kardashevskiy
@ 2016-03-09  6:29 ` Alexey Kardashevskiy
  2016-03-10  5:43   ` David Gibson
  2016-03-21  2:57   ` Alistair Popple
  2016-03-09  6:29 ` [PATCH kernel 06/10] powerpc/powernv/npu: Simplify DMA setup Alexey Kardashevskiy
                   ` (4 subsequent siblings)
  9 siblings, 2 replies; 26+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-09  6:29 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Alistair Popple, Benjamin Herrenschmidt,
	Daniel Axtens, David Gibson, Gavin Shan, Paul Mackerras,
	Russell Currey, Alex Williamson

This uses the page size from iommu_table instead of hard-coded 4K.
This should cause no change in behavior.

While we are here, move bits around to prepare for further rework
which will define and use iommu_table_group_ops.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/platforms/powernv/npu-dma.c | 11 +++++------
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/npu-dma.c b/arch/powerpc/platforms/powernv/npu-dma.c
index 778570c..5bd5fee 100644
--- a/arch/powerpc/platforms/powernv/npu-dma.c
+++ b/arch/powerpc/platforms/powernv/npu-dma.c
@@ -204,8 +204,7 @@ static void pnv_npu_disable_bypass(struct pnv_ioda_pe *npe)
 	struct pnv_phb *phb = npe->phb;
 	struct pci_dev *gpdev;
 	struct pnv_ioda_pe *gpe;
-	void *addr;
-	unsigned int size;
+	struct iommu_table *tbl;
 	int64_t rc;
 
 	/*
@@ -219,11 +218,11 @@ static void pnv_npu_disable_bypass(struct pnv_ioda_pe *npe)
 	if (!gpe)
 		return;
 
-	addr = (void *)gpe->table_group.tables[0]->it_base;
-	size = gpe->table_group.tables[0]->it_size << 3;
+	tbl = gpe->table_group.tables[0];
 	rc = opal_pci_map_pe_dma_window(phb->opal_id, npe->pe_number,
-					npe->pe_number, 1, __pa(addr),
-					size, 0x1000);
+					npe->pe_number, 1, __pa(tbl->it_base),
+					tbl->it_size << 3,
+					IOMMU_PAGE_SIZE(tbl));
 	if (rc != OPAL_SUCCESS)
 		pr_warn("%s: Error %lld setting DMA window on PHB#%d-PE#%d\n",
 			__func__, rc, phb->hose->global_number, npe->pe_number);
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH kernel 06/10] powerpc/powernv/npu: Simplify DMA setup
  2016-03-09  6:28 [PATCH kernel 00/10] powerpc/powernv/npu: Enable PCI pass through for NVLink Alexey Kardashevskiy
                   ` (4 preceding siblings ...)
  2016-03-09  6:29 ` [PATCH kernel 05/10] powerpc/powernv/npu: Use the correct IOMMU page size Alexey Kardashevskiy
@ 2016-03-09  6:29 ` Alexey Kardashevskiy
  2016-03-16  5:55   ` David Gibson
  2016-03-09  6:29 ` [PATCH kernel 07/10] powerpc/powernv/npu: Rework TCE Kill handling Alexey Kardashevskiy
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 26+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-09  6:29 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Alistair Popple, Benjamin Herrenschmidt,
	Daniel Axtens, David Gibson, Gavin Shan, Paul Mackerras,
	Russell Currey, Alex Williamson

NPU devices are quite specific, in fact they represent side DMA channel
of a GPU device. The GPU/NPU driver never actually configures DMA
for NPU devices, instead it relies on the platform code to propagate
DMA setup to NPU devices when a main GPU device is being configured.
When GPU is being set up, the same configuration - bypass or 32bit DMA -
is used for NPU. This makes DMA setup explicit.

pnv_npu_ioda_controller_ops::pnv_npu_dma_set_mask is moved to pci-ioda,
made static and prints warning as dma_set_mask() should never be called
on this function as in any case it will not configure GPU; so we make
this explicit.

Instead of using PNV_IODA_PE_PEER and peers[] (which next patch will
remove), we test every PCI device if there are corresponding NVLink
devices. If there are any, we propagate bypass mode to just found NPU
devices by calling the setup helper directly (which takes @bypass) and
avoid guessing (i.e. calculating from DMA mask) whether we need bypass
or not on NPU devices. Since DMA setup happens in very rare occasion,
this will not slow down booting or VFIO start/stop much.

This renames pnv_npu_disable_bypass to pnv_npu_dma_set_32 to make it
more clear what the function really does which is programming 32bit
table address to the TVT ("disabling bypass" means writing zeroes to
the TVT).

This removes pnv_npu_dma_set_bypass() from pnv_npu_ioda_fixup() as
the DMA configuration on NPU does not matter until dma_set_mask() is
called on GPU and that will do the NPU DMA configuration.

This removes phb->dma_dev_setup initialization for NPU as
pnv_pci_ioda_dma_dev_setup is no-op for it anyway.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/platforms/powernv/npu-dma.c  | 90 ++++++++++++++-----------------
 arch/powerpc/platforms/powernv/pci-ioda.c | 30 +++++------
 arch/powerpc/platforms/powernv/pci.h      |  3 +-
 3 files changed, 54 insertions(+), 69 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/npu-dma.c b/arch/powerpc/platforms/powernv/npu-dma.c
index 5bd5fee..8960e46 100644
--- a/arch/powerpc/platforms/powernv/npu-dma.c
+++ b/arch/powerpc/platforms/powernv/npu-dma.c
@@ -196,10 +196,9 @@ void pnv_npu_init_dma_pe(struct pnv_ioda_pe *npe)
 }
 
 /*
- * For the NPU we want to point the TCE table at the same table as the
- * real PCI device.
+ * Enables 32 bit DMA on NPU.
  */
-static void pnv_npu_disable_bypass(struct pnv_ioda_pe *npe)
+static void pnv_npu_dma_set_32(struct pnv_ioda_pe *npe)
 {
 	struct pnv_phb *phb = npe->phb;
 	struct pci_dev *gpdev;
@@ -235,72 +234,63 @@ static void pnv_npu_disable_bypass(struct pnv_ioda_pe *npe)
 }
 
 /*
- * Enable/disable bypass mode on the NPU. The NPU only supports one
+ * Enables bypass mode on the NPU. The NPU only supports one
  * window per link, so bypass needs to be explicitly enabled or
  * disabled. Unlike for a PHB3 bypass and non-bypass modes can't be
  * active at the same time.
  */
-int pnv_npu_dma_set_bypass(struct pnv_ioda_pe *npe, bool enable)
+static int pnv_npu_dma_set_bypass(struct pnv_ioda_pe *npe)
 {
 	struct pnv_phb *phb = npe->phb;
 	int64_t rc = 0;
+	phys_addr_t top = memblock_end_of_DRAM();
 
 	if (phb->type != PNV_PHB_NPU || !npe->pdev)
 		return -EINVAL;
 
-	if (enable) {
-		/* Enable the bypass window */
-		phys_addr_t top = memblock_end_of_DRAM();
+	/* Enable the bypass window */
 
-		npe->tce_bypass_base = 0;
-		top = roundup_pow_of_two(top);
-		dev_info(&npe->pdev->dev, "Enabling bypass for PE %d\n",
-			 npe->pe_number);
-		rc = opal_pci_map_pe_dma_window_real(phb->opal_id,
-					npe->pe_number, npe->pe_number,
-					npe->tce_bypass_base, top);
-	} else {
-		/*
-		 * Disable the bypass window by replacing it with the
-		 * TCE32 window.
-		 */
-		pnv_npu_disable_bypass(npe);
-	}
+	npe->tce_bypass_base = 0;
+	top = roundup_pow_of_two(top);
+	dev_info(&npe->pdev->dev, "Enabling bypass for PE %d\n",
+			npe->pe_number);
+	rc = opal_pci_map_pe_dma_window_real(phb->opal_id,
+			npe->pe_number, npe->pe_number,
+			npe->tce_bypass_base, top);
 
 	return rc;
 }
 
-int pnv_npu_dma_set_mask(struct pci_dev *npdev, u64 dma_mask)
+void pnv_npu_try_dma_set_bypass(struct pci_dev *gpdev, bool bypass)
 {
-	struct pci_controller *hose = pci_bus_to_host(npdev->bus);
-	struct pnv_phb *phb = hose->private_data;
-	struct pci_dn *pdn = pci_get_pdn(npdev);
-	struct pnv_ioda_pe *npe, *gpe;
-	struct pci_dev *gpdev;
-	uint64_t top;
-	bool bypass = false;
+	int i;
+	struct pnv_phb *phb;
+	struct pci_dn *pdn;
+	struct pnv_ioda_pe *npe;
+	struct pci_dev *npdev;
 
-	if (WARN_ON(!pdn || pdn->pe_number == IODA_INVALID_PE))
-		return -ENXIO;
+	for (i = 0; ; ++i) {
+		npdev = pnv_pci_get_npu_dev(gpdev, i);
 
-	/* We only do bypass if it's enabled on the linked device */
-	npe = &phb->ioda.pe_array[pdn->pe_number];
-	gpe = get_gpu_pci_dev_and_pe(npe, &gpdev);
-	if (!gpe)
-		return -ENODEV;
+		if (!npdev)
+			break;
 
-	if (gpe->tce_bypass_enabled) {
-		top = gpe->tce_bypass_base + memblock_end_of_DRAM() - 1;
-		bypass = (dma_mask >= top);
+		pdn = pci_get_pdn(npdev);
+		if (WARN_ON(!pdn || pdn->pe_number == IODA_INVALID_PE))
+			return;
+
+		phb = pci_bus_to_host(npdev->bus)->private_data;
+
+		/* We only do bypass if it's enabled on the linked device */
+		npe = &phb->ioda.pe_array[pdn->pe_number];
+
+		if (bypass) {
+			dev_info(&npdev->dev,
+					"Using 64-bit DMA iommu bypass\n");
+			pnv_npu_dma_set_bypass(npe);
+		} else {
+			dev_info(&npdev->dev, "Using 32-bit DMA via iommu\n");
+			pnv_npu_dma_set_32(npe);
+		}
 	}
-
-	if (bypass)
-		dev_info(&npdev->dev, "Using 64-bit DMA iommu bypass\n");
-	else
-		dev_info(&npdev->dev, "Using 32-bit DMA via iommu\n");
-
-	pnv_npu_dma_set_bypass(npe, bypass);
-	*npdev->dev.dma_mask = dma_mask;
-
-	return 0;
 }
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 90cdf49..47085b7 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1640,8 +1640,6 @@ static int pnv_pci_ioda_dma_set_mask(struct pci_dev *pdev, u64 dma_mask)
 	struct pnv_ioda_pe *pe;
 	uint64_t top;
 	bool bypass = false;
-	struct pci_dev *linked_npu_dev;
-	int i;
 
 	if (WARN_ON(!pdn || pdn->pe_number == IODA_INVALID_PE))
 		return -ENODEV;;
@@ -1662,15 +1660,7 @@ static int pnv_pci_ioda_dma_set_mask(struct pci_dev *pdev, u64 dma_mask)
 	*pdev->dev.dma_mask = dma_mask;
 
 	/* Update peer npu devices */
-	if (pe->flags & PNV_IODA_PE_PEER)
-		for (i = 0; i < PNV_IODA_MAX_PEER_PES; i++) {
-			if (!pe->peers[i])
-				continue;
-
-			linked_npu_dev = pe->peers[i]->pdev;
-			if (dma_get_mask(&linked_npu_dev->dev) != dma_mask)
-				dma_set_mask(&linked_npu_dev->dev, dma_mask);
-		}
+	pnv_npu_try_dma_set_bypass(pdev, bypass);
 
 	return 0;
 }
@@ -3120,7 +3110,6 @@ static void pnv_npu_ioda_fixup(void)
 			enable_bypass = dma_get_mask(&pe->pdev->dev) ==
 				DMA_BIT_MASK(64);
 			pnv_npu_init_dma_pe(pe);
-			pnv_npu_dma_set_bypass(pe, enable_bypass);
 		}
 	}
 }
@@ -3272,6 +3261,14 @@ static const struct pci_controller_ops pnv_pci_ioda_controller_ops = {
        .shutdown = pnv_pci_ioda_shutdown,
 };
 
+static int pnv_npu_dma_set_mask(struct pci_dev *npdev, u64 dma_mask)
+{
+	dev_err_once(&npdev->dev,
+			"%s operation unsupported for NVLink devices\n",
+			__func__);
+	return -EPERM;
+}
+
 static const struct pci_controller_ops pnv_npu_ioda_controller_ops = {
 	.dma_dev_setup = pnv_pci_dma_dev_setup,
 #ifdef CONFIG_PCI_MSI
@@ -3428,9 +3425,6 @@ static void __init pnv_pci_init_ioda_phb(struct device_node *np,
 	/* Setup RID -> PE mapping function */
 	phb->bdfn_to_pe = pnv_ioda_bdfn_to_pe;
 
-	/* Setup TCEs */
-	phb->dma_dev_setup = pnv_pci_ioda_dma_dev_setup;
-
 	/* Setup MSI support */
 	pnv_pci_init_ioda_msis(phb);
 
@@ -3443,10 +3437,12 @@ static void __init pnv_pci_init_ioda_phb(struct device_node *np,
 	 */
 	ppc_md.pcibios_fixup = pnv_pci_ioda_fixup;
 
-	if (phb->type == PNV_PHB_NPU)
+	if (phb->type == PNV_PHB_NPU) {
 		hose->controller_ops = pnv_npu_ioda_controller_ops;
-	else
+	} else {
+		phb->dma_dev_setup = pnv_pci_ioda_dma_dev_setup;
 		hose->controller_ops = pnv_pci_ioda_controller_ops;
+	}
 
 #ifdef CONFIG_PCI_IOV
 	ppc_md.pcibios_fixup_sriov = pnv_pci_ioda_fixup_iov_resources;
diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
index 0b89a4c..d574a9d 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -239,8 +239,7 @@ extern void pnv_teardown_msi_irqs(struct pci_dev *pdev);
 /* Nvlink functions */
 extern void pnv_npu_init_dma_pe(struct pnv_ioda_pe *npe);
 extern void pnv_npu_setup_dma_pe(struct pnv_ioda_pe *npe);
-extern int pnv_npu_dma_set_bypass(struct pnv_ioda_pe *npe, bool enabled);
-extern int pnv_npu_dma_set_mask(struct pci_dev *npdev, u64 dma_mask);
+extern void pnv_npu_try_dma_set_bypass(struct pci_dev *gpdev, bool bypass);
 extern void pnv_pci_ioda2_tce_invalidate_entire(struct pnv_phb *phb, bool rm);
 
 #endif /* __POWERNV_PCI_H */
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH kernel 07/10] powerpc/powernv/npu: Rework TCE Kill handling
  2016-03-09  6:28 [PATCH kernel 00/10] powerpc/powernv/npu: Enable PCI pass through for NVLink Alexey Kardashevskiy
                   ` (5 preceding siblings ...)
  2016-03-09  6:29 ` [PATCH kernel 06/10] powerpc/powernv/npu: Simplify DMA setup Alexey Kardashevskiy
@ 2016-03-09  6:29 ` Alexey Kardashevskiy
  2016-03-21  6:50   ` Alistair Popple
  2016-03-09  6:29 ` [PATCH kernel 08/10] powerpc/powernv/npu: Add NPU devices to IOMMU group Alexey Kardashevskiy
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 26+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-09  6:29 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Alistair Popple, Benjamin Herrenschmidt,
	Daniel Axtens, David Gibson, Gavin Shan, Paul Mackerras,
	Russell Currey, Alex Williamson

The pnv_ioda_pe struct keeps an array of peers. At the moment it is only
used to link GPU and NPU for 2 purposes:

1. Access NPU _quickly_ when configuring DMA for GPU - this was addressed
in the previos patch by removing use of it as DMA setup is not what
the kernel would constantly do.

2. Invalidate TCE cache for NPU when it is invalidated for GPU.
GPU and NPU are in different PE. There is already a mechanism to
attach multiple iommu_table_group to the same iommu_table (used for VFIO),
we can reuse it here so does this patch.

This gets rid of peers[] array and PNV_IODA_PE_PEER flag as they are
not needed anymore.

While we are here, add TCE cache invalidation after changing TVT.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/platforms/powernv/npu-dma.c  | 75 +++++++++----------------------
 arch/powerpc/platforms/powernv/pci-ioda.c | 57 +++--------------------
 arch/powerpc/platforms/powernv/pci.h      |  6 ---
 3 files changed, 29 insertions(+), 109 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/npu-dma.c b/arch/powerpc/platforms/powernv/npu-dma.c
index 8960e46..866d3d3 100644
--- a/arch/powerpc/platforms/powernv/npu-dma.c
+++ b/arch/powerpc/platforms/powernv/npu-dma.c
@@ -136,22 +136,17 @@ static struct pnv_ioda_pe *get_gpu_pci_dev_and_pe(struct pnv_ioda_pe *npe,
 	struct pnv_ioda_pe *pe;
 	struct pci_dn *pdn;
 
-	if (npe->flags & PNV_IODA_PE_PEER) {
-		pe = npe->peers[0];
-		pdev = pe->pdev;
-	} else {
-		pdev = pnv_pci_get_gpu_dev(npe->pdev);
-		if (!pdev)
-			return NULL;
+	pdev = pnv_pci_get_gpu_dev(npe->pdev);
+	if (!pdev)
+		return NULL;
 
-		pdn = pci_get_pdn(pdev);
-		if (WARN_ON(!pdn || pdn->pe_number == IODA_INVALID_PE))
-			return NULL;
+	pdn = pci_get_pdn(pdev);
+	if (WARN_ON(!pdn || pdn->pe_number == IODA_INVALID_PE))
+		return NULL;
 
-		hose = pci_bus_to_host(pdev->bus);
-		phb = hose->private_data;
-		pe = &phb->ioda.pe_array[pdn->pe_number];
-	}
+	hose = pci_bus_to_host(pdev->bus);
+	phb = hose->private_data;
+	pe = &phb->ioda.pe_array[pdn->pe_number];
 
 	if (gpdev)
 		*gpdev = pdev;
@@ -159,42 +154,6 @@ static struct pnv_ioda_pe *get_gpu_pci_dev_and_pe(struct pnv_ioda_pe *npe,
 	return pe;
 }
 
-void pnv_npu_init_dma_pe(struct pnv_ioda_pe *npe)
-{
-	struct pnv_ioda_pe *gpe;
-	struct pci_dev *gpdev;
-	int i, avail = -1;
-
-	if (!npe->pdev || !(npe->flags & PNV_IODA_PE_DEV))
-		return;
-
-	gpe = get_gpu_pci_dev_and_pe(npe, &gpdev);
-	if (!gpe)
-		return;
-
-	for (i = 0; i < PNV_IODA_MAX_PEER_PES; i++) {
-		/* Nothing to do if the PE is already connected. */
-		if (gpe->peers[i] == npe)
-			return;
-
-		if (!gpe->peers[i])
-			avail = i;
-	}
-
-	if (WARN_ON(avail < 0))
-		return;
-
-	gpe->peers[avail] = npe;
-	gpe->flags |= PNV_IODA_PE_PEER;
-
-	/*
-	 * We assume that the NPU devices only have a single peer PE
-	 * (the GPU PCIe device PE).
-	 */
-	npe->peers[0] = gpe;
-	npe->flags |= PNV_IODA_PE_PEER;
-}
-
 /*
  * Enables 32 bit DMA on NPU.
  */
@@ -225,6 +184,13 @@ static void pnv_npu_dma_set_32(struct pnv_ioda_pe *npe)
 	if (rc != OPAL_SUCCESS)
 		pr_warn("%s: Error %lld setting DMA window on PHB#%d-PE#%d\n",
 			__func__, rc, phb->hose->global_number, npe->pe_number);
+	else
+		pnv_pci_ioda2_tce_invalidate_entire(phb, false);
+
+	/* Add the table to the list so its TCE cache will get invalidated */
+	npe->table_group.tables[0] = tbl;
+	pnv_pci_link_table_and_group(phb->hose->node, 0,
+			tbl, &npe->table_group);
 
 	/*
 	 * We don't initialise npu_pe->tce32_table as we always use
@@ -245,10 +211,10 @@ static int pnv_npu_dma_set_bypass(struct pnv_ioda_pe *npe)
 	int64_t rc = 0;
 	phys_addr_t top = memblock_end_of_DRAM();
 
-	if (phb->type != PNV_PHB_NPU || !npe->pdev)
-		return -EINVAL;
-
 	/* Enable the bypass window */
+	pnv_pci_unlink_table_and_group(npe->table_group.tables[0],
+			&npe->table_group);
+	npe->table_group.tables[0] = NULL;
 
 	npe->tce_bypass_base = 0;
 	top = roundup_pow_of_two(top);
@@ -258,6 +224,9 @@ static int pnv_npu_dma_set_bypass(struct pnv_ioda_pe *npe)
 			npe->pe_number, npe->pe_number,
 			npe->tce_bypass_base, top);
 
+	if (rc == OPAL_SUCCESS)
+		pnv_pci_ioda2_tce_invalidate_entire(phb, false);
+
 	return rc;
 }
 
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 47085b7..5a6cf2e 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1836,23 +1836,12 @@ static inline void pnv_pci_ioda2_tce_invalidate_pe(struct pnv_ioda_pe *pe)
 	/* 01xb - invalidate TCEs that match the specified PE# */
 	unsigned long val = TCE_KILL_INVAL_PE | (pe->pe_number & 0xFF);
 	struct pnv_phb *phb = pe->phb;
-	struct pnv_ioda_pe *npe;
-	int i;
 
 	if (!phb->ioda.tce_inval_reg)
 		return;
 
 	mb(); /* Ensure above stores are visible */
 	__raw_writeq(cpu_to_be64(val), phb->ioda.tce_inval_reg);
-
-	if (pe->flags & PNV_IODA_PE_PEER)
-		for (i = 0; i < PNV_IODA_MAX_PEER_PES; i++) {
-			npe = pe->peers[i];
-			if (!npe || npe->phb->type != PNV_PHB_NPU)
-				continue;
-
-			pnv_pci_ioda2_tce_invalidate_entire(npe->phb, false);
-		}
 }
 
 static void pnv_pci_ioda2_do_tce_invalidate(unsigned pe_number, bool rm,
@@ -1887,33 +1876,24 @@ static void pnv_pci_ioda2_tce_invalidate(struct iommu_table *tbl,
 	struct iommu_table_group_link *tgl;
 
 	list_for_each_entry_lockless(tgl, &tbl->it_group_list, next) {
-		struct pnv_ioda_pe *npe;
 		struct pnv_ioda_pe *pe = container_of(tgl->table_group,
 				struct pnv_ioda_pe, table_group);
 		__be64 __iomem *invalidate = rm ?
 			(__be64 __iomem *)pe->phb->ioda.tce_inval_reg_phys :
 			pe->phb->ioda.tce_inval_reg;
-		int i;
 
-		pnv_pci_ioda2_do_tce_invalidate(pe->pe_number, rm,
-			invalidate, tbl->it_page_shift,
-			index, npages);
-
-		if (pe->flags & PNV_IODA_PE_PEER)
+		if (pe->phb->type == PNV_PHB_NPU) {
 			/*
 			 * The NVLink hardware does not support TCE kill
 			 * per TCE entry so we have to invalidate
 			 * the entire cache for it.
 			 */
-			for (i = 0; i < PNV_IODA_MAX_PEER_PES; i++) {
-				npe = pe->peers[i];
-				if (!npe || npe->phb->type != PNV_PHB_NPU ||
-						!npe->phb->ioda.tce_inval_reg)
-					continue;
-
-				pnv_pci_ioda2_tce_invalidate_entire(npe->phb,
-						rm);
-			}
+			pnv_pci_ioda2_tce_invalidate_entire(pe->phb, rm);
+			continue;
+		}
+		pnv_pci_ioda2_do_tce_invalidate(pe->pe_number, rm,
+			invalidate, tbl->it_page_shift,
+			index, npages);
 	}
 }
 
@@ -3094,26 +3074,6 @@ static void pnv_pci_ioda_create_dbgfs(void)
 #endif /* CONFIG_DEBUG_FS */
 }
 
-static void pnv_npu_ioda_fixup(void)
-{
-	bool enable_bypass;
-	struct pci_controller *hose, *tmp;
-	struct pnv_phb *phb;
-	struct pnv_ioda_pe *pe;
-
-	list_for_each_entry_safe(hose, tmp, &hose_list, list_node) {
-		phb = hose->private_data;
-		if (phb->type != PNV_PHB_NPU)
-			continue;
-
-		list_for_each_entry(pe, &phb->ioda.pe_dma_list, dma_link) {
-			enable_bypass = dma_get_mask(&pe->pdev->dev) ==
-				DMA_BIT_MASK(64);
-			pnv_npu_init_dma_pe(pe);
-		}
-	}
-}
-
 static void pnv_pci_ioda_fixup(void)
 {
 	pnv_pci_ioda_setup_PEs();
@@ -3126,9 +3086,6 @@ static void pnv_pci_ioda_fixup(void)
 	eeh_init();
 	eeh_addr_cache_build();
 #endif
-
-	/* Link NPU IODA tables to their PCI devices. */
-	pnv_npu_ioda_fixup();
 }
 
 /*
diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
index d574a9d..06405fd 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -24,7 +24,6 @@ enum pnv_phb_model {
 #define PNV_IODA_PE_MASTER	(1 << 3)	/* Master PE in compound case	*/
 #define PNV_IODA_PE_SLAVE	(1 << 4)	/* Slave PE in compound case	*/
 #define PNV_IODA_PE_VF		(1 << 5)	/* PE for one VF 		*/
-#define PNV_IODA_PE_PEER	(1 << 6)	/* PE has peers			*/
 
 /* Data associated with a PE, including IOMMU tracking etc.. */
 struct pnv_phb;
@@ -32,9 +31,6 @@ struct pnv_ioda_pe {
 	unsigned long		flags;
 	struct pnv_phb		*phb;
 
-#define PNV_IODA_MAX_PEER_PES	8
-	struct pnv_ioda_pe	*peers[PNV_IODA_MAX_PEER_PES];
-
 	/* A PE can be associated with a single device or an
 	 * entire bus (& children). In the former case, pdev
 	 * is populated, in the later case, pbus is.
@@ -237,8 +233,6 @@ extern int pnv_setup_msi_irqs(struct pci_dev *pdev, int nvec, int type);
 extern void pnv_teardown_msi_irqs(struct pci_dev *pdev);
 
 /* Nvlink functions */
-extern void pnv_npu_init_dma_pe(struct pnv_ioda_pe *npe);
-extern void pnv_npu_setup_dma_pe(struct pnv_ioda_pe *npe);
 extern void pnv_npu_try_dma_set_bypass(struct pci_dev *gpdev, bool bypass);
 extern void pnv_pci_ioda2_tce_invalidate_entire(struct pnv_phb *phb, bool rm);
 
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH kernel 08/10] powerpc/powernv/npu: Add NPU devices to IOMMU group
  2016-03-09  6:28 [PATCH kernel 00/10] powerpc/powernv/npu: Enable PCI pass through for NVLink Alexey Kardashevskiy
                   ` (6 preceding siblings ...)
  2016-03-09  6:29 ` [PATCH kernel 07/10] powerpc/powernv/npu: Rework TCE Kill handling Alexey Kardashevskiy
@ 2016-03-09  6:29 ` Alexey Kardashevskiy
  2016-03-21  4:48   ` David Gibson
  2016-03-09  6:29 ` [PATCH kernel 09/10] powerpc/powernv/ioda2: Export some helpers Alexey Kardashevskiy
  2016-03-09  6:29 ` [PATCH kernel 10/10] powerpc/powernv/npu: Enable passing through via VFIO Alexey Kardashevskiy
  9 siblings, 1 reply; 26+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-09  6:29 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Alistair Popple, Benjamin Herrenschmidt,
	Daniel Axtens, David Gibson, Gavin Shan, Paul Mackerras,
	Russell Currey, Alex Williamson

NPU devices have their own TVT which means they are isolated and can be
passed to the userspace via VFIO. The first step is to create an IOMMU
group and attach devices there so does the patch.

This adds a helper to npu-dma.c which gets GPU from the NPU's pdev and
then walks through all devices on the same bus to determine which NPUs
belong to the same GPU.

This adds an additional loop over PEs in pnv_ioda_setup_dma() as the main
loop skips NPU PEs as they do not have 32bit DMA segments.

This uses get_gpu_pci_dev_and_pe() to get @gpdev rather than
pnv_pci_get_gpu_dev() as the following patch will use @gpe as well.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/platforms/powernv/npu-dma.c  | 40 +++++++++++++++++++++++++++++++
 arch/powerpc/platforms/powernv/pci-ioda.c |  8 +++++++
 arch/powerpc/platforms/powernv/pci.h      |  1 +
 3 files changed, 49 insertions(+)

diff --git a/arch/powerpc/platforms/powernv/npu-dma.c b/arch/powerpc/platforms/powernv/npu-dma.c
index 866d3d3..e5a5feb 100644
--- a/arch/powerpc/platforms/powernv/npu-dma.c
+++ b/arch/powerpc/platforms/powernv/npu-dma.c
@@ -263,3 +263,43 @@ void pnv_npu_try_dma_set_bypass(struct pci_dev *gpdev, bool bypass)
 		}
 	}
 }
+
+void pnv_pci_npu_setup_iommu(struct pnv_ioda_pe *npe)
+{
+	struct iommu_table *tbl;
+	struct pnv_phb *phb = npe->phb;
+	struct pci_bus *pbus = phb->hose->bus;
+	struct pci_dev *npdev, *gpdev = NULL, *gptmp;
+	struct pnv_ioda_pe *gpe = get_gpu_pci_dev_and_pe(npe, &gpdev);
+
+	if (!gpe || !gpdev)
+		return;
+
+	iommu_register_group(&npe->table_group, phb->hose->global_number,
+			npe->pe_number);
+
+	tbl = pnv_pci_table_alloc(phb->hose->node);
+
+	list_for_each_entry(npdev, &pbus->devices, bus_list) {
+		gptmp = pnv_pci_get_gpu_dev(npdev);
+
+		if (gptmp != gpdev)
+			continue;
+
+		/*
+		 * The iommu_add_device() picks an IOMMU group from
+		 * the first IOMMU group attached to the iommu_table
+		 * so we need to pretend that there is a table so
+		 * iommu_add_device() can complete the job.
+		 * We unlink the tempopary table from the group afterwards.
+		 */
+		pnv_pci_link_table_and_group(phb->hose->node, 0,
+				tbl, &npe->table_group);
+		set_iommu_table_base(&npdev->dev, tbl);
+		iommu_add_device(&npdev->dev);
+		set_iommu_table_base(&npdev->dev, NULL);
+		pnv_pci_unlink_table_and_group(tbl, &npe->table_group);
+	}
+
+	iommu_free_table(tbl, "");
+}
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 5a6cf2e..becd168 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -2570,6 +2570,14 @@ static void pnv_ioda_setup_dma(struct pnv_phb *phb)
 		remaining -= segs;
 		base += segs;
 	}
+	/*
+	 * Create an IOMMU group and add devices to it.
+	 * DMA setup is to be done via GPU's dma_set_mask().
+	 */
+	if (phb->type == PNV_PHB_NPU) {
+		list_for_each_entry(pe, &phb->ioda.pe_dma_list, dma_link)
+			pnv_pci_npu_setup_iommu(pe);
+	}
 }
 
 #ifdef CONFIG_PCI_MSI
diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
index 06405fd..0c0083a 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -235,5 +235,6 @@ extern void pnv_teardown_msi_irqs(struct pci_dev *pdev);
 /* Nvlink functions */
 extern void pnv_npu_try_dma_set_bypass(struct pci_dev *gpdev, bool bypass);
 extern void pnv_pci_ioda2_tce_invalidate_entire(struct pnv_phb *phb, bool rm);
+extern void pnv_pci_npu_setup_iommu(struct pnv_ioda_pe *npe);
 
 #endif /* __POWERNV_PCI_H */
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH kernel 09/10] powerpc/powernv/ioda2: Export some helpers
  2016-03-09  6:28 [PATCH kernel 00/10] powerpc/powernv/npu: Enable PCI pass through for NVLink Alexey Kardashevskiy
                   ` (7 preceding siblings ...)
  2016-03-09  6:29 ` [PATCH kernel 08/10] powerpc/powernv/npu: Add NPU devices to IOMMU group Alexey Kardashevskiy
@ 2016-03-09  6:29 ` Alexey Kardashevskiy
  2016-03-09  6:29 ` [PATCH kernel 10/10] powerpc/powernv/npu: Enable passing through via VFIO Alexey Kardashevskiy
  9 siblings, 0 replies; 26+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-09  6:29 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Alistair Popple, Benjamin Herrenschmidt,
	Daniel Axtens, David Gibson, Gavin Shan, Paul Mackerras,
	Russell Currey, Alex Williamson

We are going to support VFIO on NPU PHB type which will share some code
with the normal IODA2 PHB.

This exports pnv_pci_ioda2_create_table and pnv_pci_ioda2_get_table_size.

This exports debugging helper pe_level_printk() as well.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 13 +++----------
 arch/powerpc/platforms/powernv/pci.h      | 15 +++++++++++++++
 2 files changed, 18 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index becd168..1a96498 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -56,7 +56,7 @@
 
 static void pnv_pci_ioda2_table_free_pages(struct iommu_table *tbl);
 
-static void pe_level_printk(const struct pnv_ioda_pe *pe, const char *level,
+void pe_level_printk(const struct pnv_ioda_pe *pe, const char *level,
 			    const char *fmt, ...)
 {
 	struct va_format vaf;
@@ -87,13 +87,6 @@ static void pe_level_printk(const struct pnv_ioda_pe *pe, const char *level,
 	va_end(args);
 }
 
-#define pe_err(pe, fmt, ...)					\
-	pe_level_printk(pe, KERN_ERR, fmt, ##__VA_ARGS__)
-#define pe_warn(pe, fmt, ...)					\
-	pe_level_printk(pe, KERN_WARNING, fmt, ##__VA_ARGS__)
-#define pe_info(pe, fmt, ...)					\
-	pe_level_printk(pe, KERN_INFO, fmt, ##__VA_ARGS__)
-
 static bool pnv_iommu_bypass_disabled __read_mostly;
 
 static int __init iommu_setup(char *str)
@@ -2130,7 +2123,7 @@ static long pnv_pci_ioda2_table_alloc_pages(int nid, __u64 bus_offset,
 		__u32 page_shift, __u64 window_size, __u32 levels,
 		struct iommu_table *tbl);
 
-static long pnv_pci_ioda2_create_table(struct iommu_table_group *table_group,
+long pnv_pci_ioda2_create_table(struct iommu_table_group *table_group,
 		int num, __u32 page_shift, __u64 window_size, __u32 levels,
 		struct iommu_table **ptbl)
 {
@@ -2246,7 +2239,7 @@ static long pnv_pci_ioda2_unset_window(struct iommu_table_group *table_group,
 #endif
 
 #ifdef CONFIG_IOMMU_API
-static unsigned long pnv_pci_ioda2_get_table_size(__u32 page_shift,
+unsigned long pnv_pci_ioda2_get_table_size(__u32 page_shift,
 		__u64 window_size, __u32 levels)
 {
 	unsigned long bytes = 0;
diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
index 0c0083a..3017717d 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -232,6 +232,21 @@ extern void pnv_pci_dma_bus_setup(struct pci_bus *bus);
 extern int pnv_setup_msi_irqs(struct pci_dev *pdev, int nvec, int type);
 extern void pnv_teardown_msi_irqs(struct pci_dev *pdev);
 
+extern unsigned long pnv_pci_ioda2_get_table_size(__u32 page_shift,
+		__u64 window_size, __u32 levels);
+extern long pnv_pci_ioda2_create_table(struct iommu_table_group *table_group,
+		int num, __u32 page_shift, __u64 window_size, __u32 levels,
+		struct iommu_table **ptbl);
+
+extern void pe_level_printk(const struct pnv_ioda_pe *pe, const char *level,
+			    const char *fmt, ...);
+#define pe_err(pe, fmt, ...)					\
+	pe_level_printk(pe, KERN_ERR, fmt, ##__VA_ARGS__)
+#define pe_warn(pe, fmt, ...)					\
+	pe_level_printk(pe, KERN_WARNING, fmt, ##__VA_ARGS__)
+#define pe_info(pe, fmt, ...)					\
+	pe_level_printk(pe, KERN_INFO, fmt, ##__VA_ARGS__)
+
 /* Nvlink functions */
 extern void pnv_npu_try_dma_set_bypass(struct pci_dev *gpdev, bool bypass);
 extern void pnv_pci_ioda2_tce_invalidate_entire(struct pnv_phb *phb, bool rm);
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH kernel 10/10] powerpc/powernv/npu: Enable passing through via VFIO
  2016-03-09  6:28 [PATCH kernel 00/10] powerpc/powernv/npu: Enable PCI pass through for NVLink Alexey Kardashevskiy
                   ` (8 preceding siblings ...)
  2016-03-09  6:29 ` [PATCH kernel 09/10] powerpc/powernv/ioda2: Export some helpers Alexey Kardashevskiy
@ 2016-03-09  6:29 ` Alexey Kardashevskiy
  9 siblings, 0 replies; 26+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-09  6:29 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Alistair Popple, Benjamin Herrenschmidt,
	Daniel Axtens, David Gibson, Gavin Shan, Paul Mackerras,
	Russell Currey, Alex Williamson

IBM POWER8 NVlink systems contain usual Tesla K40-ish GPUs but also
contain a couple of really fast links between GPU and CPU. These links
are exposed to the userspace by the OPAL firmware as bridges.
The device tree has references from GPU to NPU and vice versa via
"ibm,npu" and "ibm,gpu" properties which are "linux,phandle" of
the counterparts. The typical GPU looks like:

0003:01:00.0 3D controller: NVIDIA Corporation Device 15ff (rev a1)
0008:00:00.0 Bridge: IBM Device 04ea (prog-if 01)
0008:00:00.1 Bridge: IBM Device 04ea (prog-if 01)

In the host kernel, couple of links of the same GPU make a new PE.
A PHB with these links has a different type - PNV_PHB_NPU (the standard
IODA2 bridge type is PNV_PHB_IODA2). The previos patch added these links
to a new IOMMU group.

In order to make these links work when GPU is passed to the guest,
these bridges need to be passed as well; otherwise performance will
degrade. The previous patch adds these bridges to a new IOMMU group,
this patch adds the bits required by VFIO SPAPR TCE driver to pass it
to the userspace.

This defines pnv_pci_npu_ops and initializes it. It reuses
pnv_pci_ioda2_get_table_size() and pnv_pci_ioda2_create_table() as
the table will be programmed to both NPU and IODA2 bridge types so
it needs to be compatible with both bridge types.

As it is not known in what order the userspace will be adding IOMMU
groups to a VFIO container, we need to maintain PHB type compatibility.
This sets up table_group properties from the linked GPU. This initializes
@tce_bypass_base from GPU as well as this is used by
pnv_pci_ioda2_create_table(). This is a bit ugly but the only place
it is actually used in the NPU PHB is enabling bypass mode and there
we can safely use plain zero.

NPU PHB has just a single TVE per NVLink so it can have either 32bit or
64bit window but never both. Nevertheless the NPU table_group is added
to both iommu_table in the VFIO container for simpler design.

Note that the userspace should pass GPU with corresponding NPUs,
otherwise isolation is not guaranteed.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/platforms/powernv/npu-dma.c | 128 ++++++++++++++++++++++++++++++-
 1 file changed, 126 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/npu-dma.c b/arch/powerpc/platforms/powernv/npu-dma.c
index e5a5feb..283cd73 100644
--- a/arch/powerpc/platforms/powernv/npu-dma.c
+++ b/arch/powerpc/platforms/powernv/npu-dma.c
@@ -216,13 +216,12 @@ static int pnv_npu_dma_set_bypass(struct pnv_ioda_pe *npe)
 			&npe->table_group);
 	npe->table_group.tables[0] = NULL;
 
-	npe->tce_bypass_base = 0;
 	top = roundup_pow_of_two(top);
 	dev_info(&npe->pdev->dev, "Enabling bypass for PE %d\n",
 			npe->pe_number);
 	rc = opal_pci_map_pe_dma_window_real(phb->opal_id,
 			npe->pe_number, npe->pe_number,
-			npe->tce_bypass_base, top);
+			0 /* bypass base */, top);
 
 	if (rc == OPAL_SUCCESS)
 		pnv_pci_ioda2_tce_invalidate_entire(phb, false);
@@ -264,6 +263,120 @@ void pnv_npu_try_dma_set_bypass(struct pci_dev *gpdev, bool bypass)
 	}
 }
 
+static long pnv_pci_npu_set_window(struct iommu_table_group *table_group,
+		int num, struct iommu_table *tbl)
+{
+	struct pnv_ioda_pe *npe = container_of(table_group, struct pnv_ioda_pe,
+			table_group);
+	struct pnv_phb *phb = npe->phb;
+	int64_t rc;
+	const unsigned long size = tbl->it_indirect_levels ?
+		tbl->it_level_size : tbl->it_size;
+	const __u64 start_addr = tbl->it_offset << tbl->it_page_shift;
+	const __u64 win_size = tbl->it_size << tbl->it_page_shift;
+
+	pe_info(npe, "Setting up window#%d %llx..%llx pg=%lx\n", num,
+			start_addr, start_addr + win_size - 1,
+			IOMMU_PAGE_SIZE(tbl));
+
+	rc = opal_pci_map_pe_dma_window(phb->opal_id,
+			npe->pe_number,
+			npe->pe_number,
+			tbl->it_indirect_levels + 1,
+			__pa(tbl->it_base),
+			size << 3,
+			IOMMU_PAGE_SIZE(tbl));
+	if (rc) {
+		pe_err(npe, "Failed to configure TCE table, err %lld\n", rc);
+		return rc;
+	}
+
+	pnv_pci_link_table_and_group(phb->hose->node, num,
+			tbl, &npe->table_group);
+	pnv_pci_ioda2_tce_invalidate_entire(npe->phb, false);
+
+	return rc;
+}
+
+static long pnv_pci_npu_unset_window(struct iommu_table_group *table_group,
+		int num)
+{
+	struct pnv_ioda_pe *npe = container_of(table_group, struct pnv_ioda_pe,
+			table_group);
+	struct pnv_phb *phb = npe->phb;
+	long ret;
+
+	pe_info(npe, "Removing DMA window #%d\n", num);
+
+	ret = opal_pci_map_pe_dma_window(phb->opal_id, npe->pe_number,
+			npe->pe_number,
+			0/* levels */, 0/* table address */,
+			0/* table size */, 0/* page size */);
+	if (ret)
+		pe_warn(npe, "Unmapping failed, ret = %ld\n", ret);
+	else
+		pnv_pci_ioda2_tce_invalidate_entire(npe->phb, false);
+
+	pnv_pci_unlink_table_and_group(table_group->tables[num], table_group);
+
+	return ret;
+}
+
+/* Switch ownership from platform code to external user (e.g. VFIO) */
+static void pnv_pci_npu_take_ownership(struct iommu_table_group *table_group)
+{
+	struct pnv_ioda_pe *npe = container_of(table_group, struct pnv_ioda_pe,
+			table_group);
+	struct pnv_phb *phb = npe->phb;
+	int64_t ret;
+
+	if (npe->table_group.tables[0]) {
+		pnv_pci_unlink_table_and_group(npe->table_group.tables[0],
+				&npe->table_group);
+		npe->table_group.tables[0] = NULL;
+		ret = opal_pci_map_pe_dma_window(phb->opal_id, npe->pe_number,
+				npe->pe_number,
+				0/* levels */, 0/* table address */,
+				0/* table size */, 0/* page size */);
+	} else {
+		ret = opal_pci_map_pe_dma_window_real(phb->opal_id,
+				npe->pe_number, npe->pe_number,
+				0 /* bypass base */, 0);
+	}
+
+	if (ret != OPAL_SUCCESS)
+		pe_err(npe, "Failed to remove DMA window");
+	else
+		pnv_pci_ioda2_tce_invalidate_entire(npe->phb, false);
+}
+
+/* Switch ownership from external user (e.g. VFIO) back to core */
+static void pnv_pci_npu_release_ownership(struct iommu_table_group *table_group)
+{
+	struct pnv_ioda_pe *npe = container_of(table_group, struct pnv_ioda_pe,
+			table_group);
+	struct pnv_phb *phb = npe->phb;
+	int64_t ret;
+
+	ret = opal_pci_map_pe_dma_window(phb->opal_id, npe->pe_number,
+			npe->pe_number,
+			0/* levels */, 0/* table address */,
+			0/* table size */, 0/* page size */);
+	if (ret != OPAL_SUCCESS)
+		pe_err(npe, "Failed to remove DMA window");
+	else
+		pnv_pci_ioda2_tce_invalidate_entire(npe->phb, false);
+}
+
+static struct iommu_table_group_ops pnv_pci_npu_ops = {
+	.get_table_size = pnv_pci_ioda2_get_table_size,
+	.create_table = pnv_pci_ioda2_create_table,
+	.set_window = pnv_pci_npu_set_window,
+	.unset_window = pnv_pci_npu_unset_window,
+	.take_ownership = pnv_pci_npu_take_ownership,
+	.release_ownership = pnv_pci_npu_release_ownership,
+};
+
 void pnv_pci_npu_setup_iommu(struct pnv_ioda_pe *npe)
 {
 	struct iommu_table *tbl;
@@ -275,6 +388,17 @@ void pnv_pci_npu_setup_iommu(struct pnv_ioda_pe *npe)
 	if (!gpe || !gpdev)
 		return;
 
+	npe->table_group.tce32_start = gpe->table_group.tce32_start;
+	npe->table_group.tce32_size = gpe->table_group.tce32_size;
+	npe->table_group.max_dynamic_windows_supported =
+			gpe->table_group.max_dynamic_windows_supported;
+	npe->table_group.max_levels = gpe->table_group.max_levels;
+	npe->table_group.pgsizes = gpe->table_group.pgsizes;
+	npe->tce_bypass_base = gpe->tce_bypass_base;
+#ifdef CONFIG_IOMMU_API
+	npe->table_group.ops = &pnv_pci_npu_ops;
+#endif
+
 	iommu_register_group(&npe->table_group, phb->hose->global_number,
 			npe->pe_number);
 
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH kernel 01/10] vfio/spapr: Relax the IOMMU compatibility check
  2016-03-09  6:28 ` [PATCH kernel 01/10] vfio/spapr: Relax the IOMMU compatibility check Alexey Kardashevskiy
@ 2016-03-10  5:35   ` David Gibson
  0 siblings, 0 replies; 26+ messages in thread
From: David Gibson @ 2016-03-10  5:35 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Alistair Popple, Benjamin Herrenschmidt,
	Daniel Axtens, Gavin Shan, Paul Mackerras, Russell Currey,
	Alex Williamson

[-- Attachment #1: Type: text/plain, Size: 1429 bytes --]

On Wed, Mar 09, 2016 at 05:28:57PM +1100, Alexey Kardashevskiy wrote:
> We are going to have multiple different types of PHB on the same system
> with POWER8 + NVLink and PHBs will have different IOMMU ops. However
> we only really care about one callback - create_table - so we can
> relax the compatibility check here.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

> ---
>  drivers/vfio/vfio_iommu_spapr_tce.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
> index 0582b72..3054e3f 100644
> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> @@ -1188,7 +1188,8 @@ static int tce_iommu_attach_group(void *iommu_data,
>  			goto unlock_exit;
>  		}
>  		table_group_tmp = iommu_group_get_iommudata(tcegrp->grp);
> -		if (table_group_tmp->ops != table_group->ops) {
> +		if (table_group_tmp->ops->create_table !=
> +				table_group->ops->create_table) {
>  			pr_warn("tce_vfio: Group %d is incompatible with group %d\n",
>  					iommu_group_id(iommu_group),
>  					iommu_group_id(tcegrp->grp));

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH kernel 02/10] powerpc/powernv: Rename pnv_pci_ioda2_tce_invalidate_entire
  2016-03-09  6:28 ` [PATCH kernel 02/10] powerpc/powernv: Rename pnv_pci_ioda2_tce_invalidate_entire Alexey Kardashevskiy
@ 2016-03-10  5:35   ` David Gibson
  0 siblings, 0 replies; 26+ messages in thread
From: David Gibson @ 2016-03-10  5:35 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Alistair Popple, Benjamin Herrenschmidt,
	Daniel Axtens, Gavin Shan, Paul Mackerras, Russell Currey,
	Alex Williamson

[-- Attachment #1: Type: text/plain, Size: 2077 bytes --]

On Wed, Mar 09, 2016 at 05:28:58PM +1100, Alexey Kardashevskiy wrote:
> As in fact pnv_pci_ioda2_tce_invalidate_entire() invalidates TCEs for
> the specific PE rather than the entire cache, rename it to
> pnv_pci_ioda2_tce_invalidate_pe(). In later patches we will add
> a proper pnv_pci_ioda2_tce_invalidate_entire().
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

> ---
>  arch/powerpc/platforms/powernv/pci-ioda.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> index 57dc2dc..889eca3 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -1824,7 +1824,7 @@ static struct iommu_table_ops pnv_ioda1_iommu_ops = {
>  	.get = pnv_tce_get,
>  };
>  
> -static inline void pnv_pci_ioda2_tce_invalidate_entire(struct pnv_ioda_pe *pe)
> +static inline void pnv_pci_ioda2_tce_invalidate_pe(struct pnv_ioda_pe *pe)
>  {
>  	/* 01xb - invalidate TCEs that match the specified PE# */
>  	unsigned long val = (0x4ull << 60) | (pe->pe_number & 0xFF);
> @@ -2101,7 +2101,7 @@ static long pnv_pci_ioda2_set_window(struct iommu_table_group *table_group,
>  
>  	pnv_pci_link_table_and_group(phb->hose->node, num,
>  			tbl, &pe->table_group);
> -	pnv_pci_ioda2_tce_invalidate_entire(pe);
> +	pnv_pci_ioda2_tce_invalidate_pe(pe);
>  
>  	return 0;
>  }
> @@ -2245,7 +2245,7 @@ static long pnv_pci_ioda2_unset_window(struct iommu_table_group *table_group,
>  	if (ret)
>  		pe_warn(pe, "Unmapping failed, ret = %ld\n", ret);
>  	else
> -		pnv_pci_ioda2_tce_invalidate_entire(pe);
> +		pnv_pci_ioda2_tce_invalidate_pe(pe);
>  
>  	pnv_pci_unlink_table_and_group(table_group->tables[num], table_group);
>  

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH kernel 03/10] powerpc/powernv: Define TCE Kill flags
  2016-03-09  6:28 ` [PATCH kernel 03/10] powerpc/powernv: Define TCE Kill flags Alexey Kardashevskiy
@ 2016-03-10  5:36   ` David Gibson
  0 siblings, 0 replies; 26+ messages in thread
From: David Gibson @ 2016-03-10  5:36 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Alistair Popple, Benjamin Herrenschmidt,
	Daniel Axtens, Gavin Shan, Paul Mackerras, Russell Currey,
	Alex Williamson

[-- Attachment #1: Type: text/plain, Size: 1697 bytes --]

On Wed, Mar 09, 2016 at 05:28:59PM +1100, Alexey Kardashevskiy wrote:
> This replaces magic constants for TCE Kill IODA2 register with macros.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

> ---
>  arch/powerpc/platforms/powernv/pci-ioda.c | 7 +++++--
>  1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> index 889eca3..33e9489 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -1824,10 +1824,13 @@ static struct iommu_table_ops pnv_ioda1_iommu_ops = {
>  	.get = pnv_tce_get,
>  };
>  
> +#define TCE_KILL_INVAL_PE   PPC_BIT(1)
> +#define TCE_KILL_INVAL_TCE  PPC_BIT(2)
> +
>  static inline void pnv_pci_ioda2_tce_invalidate_pe(struct pnv_ioda_pe *pe)
>  {
>  	/* 01xb - invalidate TCEs that match the specified PE# */
> -	unsigned long val = (0x4ull << 60) | (pe->pe_number & 0xFF);
> +	unsigned long val = TCE_KILL_INVAL_PE | (pe->pe_number & 0xFF);
>  	struct pnv_phb *phb = pe->phb;
>  	struct pnv_ioda_pe *npe;
>  	int i;
> @@ -1855,7 +1858,7 @@ static void pnv_pci_ioda2_do_tce_invalidate(unsigned pe_number, bool rm,
>  	unsigned long start, end, inc;
>  
>  	/* We'll invalidate DMA address in PE scope */
> -	start = 0x2ull << 60;
> +	start = TCE_KILL_INVAL_TCE;
>  	start |= (pe_number & 0xFF);
>  	end = start;
>  

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH kernel 04/10] powerpc/powernv/npu: TCE Kill helpers cleanup
  2016-03-09  6:29 ` [PATCH kernel 04/10] powerpc/powernv/npu: TCE Kill helpers cleanup Alexey Kardashevskiy
@ 2016-03-10  5:42   ` David Gibson
  2016-03-21  2:51   ` Alistair Popple
  1 sibling, 0 replies; 26+ messages in thread
From: David Gibson @ 2016-03-10  5:42 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Alistair Popple, Benjamin Herrenschmidt,
	Daniel Axtens, Gavin Shan, Paul Mackerras, Russell Currey,
	Alex Williamson

[-- Attachment #1: Type: text/plain, Size: 6206 bytes --]

On Wed, Mar 09, 2016 at 05:29:00PM +1100, Alexey Kardashevskiy wrote:
> NPU PHB TCE Kill register is exactly the same as in the rest of POWER8
> so let's reuse the existing code for NPU. The only bit missing is
> a helper to reset the entire TCE cache so this moves such a helper
> from NPU code and renames it.
> 
> Since pnv_npu_tce_invalidate() does really invalidate the entire cache,
> this uses pnv_pci_ioda2_tce_invalidate_entire() directly for NPU.
> This adds an explicit comment for workaround for invalidating NPU TCE
> cache.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

> ---
>  arch/powerpc/platforms/powernv/npu-dma.c  | 41 -------------------------------
>  arch/powerpc/platforms/powernv/pci-ioda.c | 29 ++++++++++++++++++----
>  arch/powerpc/platforms/powernv/pci.h      |  7 +-----
>  3 files changed, 25 insertions(+), 52 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/powernv/npu-dma.c b/arch/powerpc/platforms/powernv/npu-dma.c
> index 7229acd..778570c 100644
> --- a/arch/powerpc/platforms/powernv/npu-dma.c
> +++ b/arch/powerpc/platforms/powernv/npu-dma.c
> @@ -25,8 +25,6 @@
>   * Other types of TCE cache invalidation are not functional in the
>   * hardware.
>   */
> -#define TCE_KILL_INVAL_ALL PPC_BIT(0)
> -
>  static struct pci_dev *get_pci_dev(struct device_node *dn)
>  {
>  	return PCI_DN(dn)->pcidev;
> @@ -161,45 +159,6 @@ static struct pnv_ioda_pe *get_gpu_pci_dev_and_pe(struct pnv_ioda_pe *npe,
>  	return pe;
>  }
>  
> -void pnv_npu_tce_invalidate_entire(struct pnv_ioda_pe *npe)
> -{
> -	struct pnv_phb *phb = npe->phb;
> -
> -	if (WARN_ON(phb->type != PNV_PHB_NPU ||
> -		    !phb->ioda.tce_inval_reg ||
> -		    !(npe->flags & PNV_IODA_PE_DEV)))
> -		return;
> -
> -	mb(); /* Ensure previous TCE table stores are visible */
> -	__raw_writeq(cpu_to_be64(TCE_KILL_INVAL_ALL),
> -		phb->ioda.tce_inval_reg);
> -}
> -
> -void pnv_npu_tce_invalidate(struct pnv_ioda_pe *npe,
> -				struct iommu_table *tbl,
> -				unsigned long index,
> -				unsigned long npages,
> -				bool rm)
> -{
> -	struct pnv_phb *phb = npe->phb;
> -
> -	/* We can only invalidate the whole cache on NPU */
> -	unsigned long val = TCE_KILL_INVAL_ALL;
> -
> -	if (WARN_ON(phb->type != PNV_PHB_NPU ||
> -		    !phb->ioda.tce_inval_reg ||
> -		    !(npe->flags & PNV_IODA_PE_DEV)))
> -		return;
> -
> -	mb(); /* Ensure previous TCE table stores are visible */
> -	if (rm)
> -		__raw_rm_writeq(cpu_to_be64(val),
> -		  (__be64 __iomem *) phb->ioda.tce_inval_reg_phys);
> -	else
> -		__raw_writeq(cpu_to_be64(val),
> -			phb->ioda.tce_inval_reg);
> -}
> -
>  void pnv_npu_init_dma_pe(struct pnv_ioda_pe *npe)
>  {
>  	struct pnv_ioda_pe *gpe;
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> index 33e9489..90cdf49 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -1824,9 +1824,23 @@ static struct iommu_table_ops pnv_ioda1_iommu_ops = {
>  	.get = pnv_tce_get,
>  };
>  
> +#define TCE_KILL_INVAL_ALL  PPC_BIT(0)
>  #define TCE_KILL_INVAL_PE   PPC_BIT(1)
>  #define TCE_KILL_INVAL_TCE  PPC_BIT(2)
>  
> +void pnv_pci_ioda2_tce_invalidate_entire(struct pnv_phb *phb, bool rm)
> +{
> +	const unsigned long val = TCE_KILL_INVAL_ALL;
> +
> +	mb(); /* Ensure previous TCE table stores are visible */
> +	if (rm)
> +		__raw_rm_writeq(cpu_to_be64(val),
> +				(__be64 __iomem *)
> +				phb->ioda.tce_inval_reg_phys);
> +	else
> +		__raw_writeq(cpu_to_be64(val), phb->ioda.tce_inval_reg);
> +}
> +
>  static inline void pnv_pci_ioda2_tce_invalidate_pe(struct pnv_ioda_pe *pe)
>  {
>  	/* 01xb - invalidate TCEs that match the specified PE# */
> @@ -1847,7 +1861,7 @@ static inline void pnv_pci_ioda2_tce_invalidate_pe(struct pnv_ioda_pe *pe)
>  			if (!npe || npe->phb->type != PNV_PHB_NPU)
>  				continue;
>  
> -			pnv_npu_tce_invalidate_entire(npe);
> +			pnv_pci_ioda2_tce_invalidate_entire(npe->phb, false);
>  		}
>  }
>  
> @@ -1896,14 +1910,19 @@ static void pnv_pci_ioda2_tce_invalidate(struct iommu_table *tbl,
>  			index, npages);
>  
>  		if (pe->flags & PNV_IODA_PE_PEER)
> -			/* Invalidate PEs using the same TCE table */
> +			/*
> +			 * The NVLink hardware does not support TCE kill
> +			 * per TCE entry so we have to invalidate
> +			 * the entire cache for it.
> +			 */
>  			for (i = 0; i < PNV_IODA_MAX_PEER_PES; i++) {
>  				npe = pe->peers[i];
> -				if (!npe || npe->phb->type != PNV_PHB_NPU)
> +				if (!npe || npe->phb->type != PNV_PHB_NPU ||
> +						!npe->phb->ioda.tce_inval_reg)
>  					continue;
>  
> -				pnv_npu_tce_invalidate(npe, tbl, index,
> -							npages, rm);
> +				pnv_pci_ioda2_tce_invalidate_entire(npe->phb,
> +						rm);
>  			}
>  	}
>  }
> diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
> index 3f814f3..0b89a4c 100644
> --- a/arch/powerpc/platforms/powernv/pci.h
> +++ b/arch/powerpc/platforms/powernv/pci.h
> @@ -237,15 +237,10 @@ extern int pnv_setup_msi_irqs(struct pci_dev *pdev, int nvec, int type);
>  extern void pnv_teardown_msi_irqs(struct pci_dev *pdev);
>  
>  /* Nvlink functions */
> -extern void pnv_npu_tce_invalidate_entire(struct pnv_ioda_pe *npe);
> -extern void pnv_npu_tce_invalidate(struct pnv_ioda_pe *npe,
> -				       struct iommu_table *tbl,
> -				       unsigned long index,
> -				       unsigned long npages,
> -				       bool rm);
>  extern void pnv_npu_init_dma_pe(struct pnv_ioda_pe *npe);
>  extern void pnv_npu_setup_dma_pe(struct pnv_ioda_pe *npe);
>  extern int pnv_npu_dma_set_bypass(struct pnv_ioda_pe *npe, bool enabled);
>  extern int pnv_npu_dma_set_mask(struct pci_dev *npdev, u64 dma_mask);
> +extern void pnv_pci_ioda2_tce_invalidate_entire(struct pnv_phb *phb, bool rm);
>  
>  #endif /* __POWERNV_PCI_H */

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH kernel 05/10] powerpc/powernv/npu: Use the correct IOMMU page size
  2016-03-09  6:29 ` [PATCH kernel 05/10] powerpc/powernv/npu: Use the correct IOMMU page size Alexey Kardashevskiy
@ 2016-03-10  5:43   ` David Gibson
  2016-03-21  2:57   ` Alistair Popple
  1 sibling, 0 replies; 26+ messages in thread
From: David Gibson @ 2016-03-10  5:43 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Alistair Popple, Benjamin Herrenschmidt,
	Daniel Axtens, Gavin Shan, Paul Mackerras, Russell Currey,
	Alex Williamson

[-- Attachment #1: Type: text/plain, Size: 1933 bytes --]

On Wed, Mar 09, 2016 at 05:29:01PM +1100, Alexey Kardashevskiy wrote:
> This uses the page size from iommu_table instead of hard-coded 4K.
> This should cause no change in behavior.
> 
> While we are here, move bits around to prepare for further rework
> which will define and use iommu_table_group_ops.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

> ---
>  arch/powerpc/platforms/powernv/npu-dma.c | 11 +++++------
>  1 file changed, 5 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/powernv/npu-dma.c b/arch/powerpc/platforms/powernv/npu-dma.c
> index 778570c..5bd5fee 100644
> --- a/arch/powerpc/platforms/powernv/npu-dma.c
> +++ b/arch/powerpc/platforms/powernv/npu-dma.c
> @@ -204,8 +204,7 @@ static void pnv_npu_disable_bypass(struct pnv_ioda_pe *npe)
>  	struct pnv_phb *phb = npe->phb;
>  	struct pci_dev *gpdev;
>  	struct pnv_ioda_pe *gpe;
> -	void *addr;
> -	unsigned int size;
> +	struct iommu_table *tbl;
>  	int64_t rc;
>  
>  	/*
> @@ -219,11 +218,11 @@ static void pnv_npu_disable_bypass(struct pnv_ioda_pe *npe)
>  	if (!gpe)
>  		return;
>  
> -	addr = (void *)gpe->table_group.tables[0]->it_base;
> -	size = gpe->table_group.tables[0]->it_size << 3;
> +	tbl = gpe->table_group.tables[0];
>  	rc = opal_pci_map_pe_dma_window(phb->opal_id, npe->pe_number,
> -					npe->pe_number, 1, __pa(addr),
> -					size, 0x1000);
> +					npe->pe_number, 1, __pa(tbl->it_base),
> +					tbl->it_size << 3,
> +					IOMMU_PAGE_SIZE(tbl));
>  	if (rc != OPAL_SUCCESS)
>  		pr_warn("%s: Error %lld setting DMA window on PHB#%d-PE#%d\n",
>  			__func__, rc, phb->hose->global_number, npe->pe_number);

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH kernel 06/10] powerpc/powernv/npu: Simplify DMA setup
  2016-03-09  6:29 ` [PATCH kernel 06/10] powerpc/powernv/npu: Simplify DMA setup Alexey Kardashevskiy
@ 2016-03-16  5:55   ` David Gibson
  2016-03-21  3:59     ` Alistair Popple
  0 siblings, 1 reply; 26+ messages in thread
From: David Gibson @ 2016-03-16  5:55 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Alistair Popple, Benjamin Herrenschmidt,
	Daniel Axtens, Gavin Shan, Paul Mackerras, Russell Currey,
	Alex Williamson

[-- Attachment #1: Type: text/plain, Size: 2216 bytes --]

On Wed, Mar 09, 2016 at 05:29:02PM +1100, Alexey Kardashevskiy wrote:
> NPU devices are quite specific, in fact they represent side DMA channel
> of a GPU device. The GPU/NPU driver never actually configures DMA
> for NPU devices, instead it relies on the platform code to propagate
> DMA setup to NPU devices when a main GPU device is being configured.
> When GPU is being set up, the same configuration - bypass or 32bit DMA -
> is used for NPU. This makes DMA setup explicit.
> 
> pnv_npu_ioda_controller_ops::pnv_npu_dma_set_mask is moved to pci-ioda,
> made static and prints warning as dma_set_mask() should never be called
> on this function as in any case it will not configure GPU; so we make
> this explicit.
> 
> Instead of using PNV_IODA_PE_PEER and peers[] (which next patch will
> remove), we test every PCI device if there are corresponding NVLink
> devices. If there are any, we propagate bypass mode to just found NPU
> devices by calling the setup helper directly (which takes @bypass) and
> avoid guessing (i.e. calculating from DMA mask) whether we need bypass
> or not on NPU devices. Since DMA setup happens in very rare occasion,
> this will not slow down booting or VFIO start/stop much.
> 
> This renames pnv_npu_disable_bypass to pnv_npu_dma_set_32 to make it
> more clear what the function really does which is programming 32bit
> table address to the TVT ("disabling bypass" means writing zeroes to
> the TVT).
> 
> This removes pnv_npu_dma_set_bypass() from pnv_npu_ioda_fixup() as
> the DMA configuration on NPU does not matter until dma_set_mask() is
> called on GPU and that will do the NPU DMA configuration.
> 
> This removes phb->dma_dev_setup initialization for NPU as
> pnv_pci_ioda_dma_dev_setup is no-op for it anyway.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

I'm having trouble making sense of the commit message, but the actual
changes look fine as best I can tell.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH kernel 04/10] powerpc/powernv/npu: TCE Kill helpers cleanup
  2016-03-09  6:29 ` [PATCH kernel 04/10] powerpc/powernv/npu: TCE Kill helpers cleanup Alexey Kardashevskiy
  2016-03-10  5:42   ` David Gibson
@ 2016-03-21  2:51   ` Alistair Popple
  1 sibling, 0 replies; 26+ messages in thread
From: Alistair Popple @ 2016-03-21  2:51 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Benjamin Herrenschmidt, Daniel Axtens,
	David Gibson, Gavin Shan, Paul Mackerras, Russell Currey,
	Alex Williamson

On Wed, 9 Mar 2016 17:29:00 Alexey Kardashevskiy wrote:
> NPU PHB TCE Kill register is exactly the same as in the rest of POWER8
> so let's reuse the existing code for NPU. The only bit missing is
> a helper to reset the entire TCE cache so this moves such a helper
> from NPU code and renames it.
> 
> Since pnv_npu_tce_invalidate() does really invalidate the entire cache,
> this uses pnv_pci_ioda2_tce_invalidate_entire() directly for NPU.
> This adds an explicit comment for workaround for invalidating NPU TCE
> cache.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

Reviewed-by: Alistair Popple <alistair@popple.id.au>

> ---
>  arch/powerpc/platforms/powernv/npu-dma.c  | 41 -------------------------------
>  arch/powerpc/platforms/powernv/pci-ioda.c | 29 ++++++++++++++++++----
>  arch/powerpc/platforms/powernv/pci.h      |  7 +-----
>  3 files changed, 25 insertions(+), 52 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/powernv/npu-dma.c b/arch/powerpc/platforms/powernv/npu-dma.c
> index 7229acd..778570c 100644
> --- a/arch/powerpc/platforms/powernv/npu-dma.c
> +++ b/arch/powerpc/platforms/powernv/npu-dma.c
> @@ -25,8 +25,6 @@
>   * Other types of TCE cache invalidation are not functional in the
>   * hardware.
>   */
> -#define TCE_KILL_INVAL_ALL PPC_BIT(0)
> -
>  static struct pci_dev *get_pci_dev(struct device_node *dn)
>  {
>  	return PCI_DN(dn)->pcidev;
> @@ -161,45 +159,6 @@ static struct pnv_ioda_pe *get_gpu_pci_dev_and_pe(struct pnv_ioda_pe *npe,
>  	return pe;
>  }
>  
> -void pnv_npu_tce_invalidate_entire(struct pnv_ioda_pe *npe)
> -{
> -	struct pnv_phb *phb = npe->phb;
> -
> -	if (WARN_ON(phb->type != PNV_PHB_NPU ||
> -		    !phb->ioda.tce_inval_reg ||
> -		    !(npe->flags & PNV_IODA_PE_DEV)))
> -		return;
> -
> -	mb(); /* Ensure previous TCE table stores are visible */
> -	__raw_writeq(cpu_to_be64(TCE_KILL_INVAL_ALL),
> -		phb->ioda.tce_inval_reg);
> -}
> -
> -void pnv_npu_tce_invalidate(struct pnv_ioda_pe *npe,
> -				struct iommu_table *tbl,
> -				unsigned long index,
> -				unsigned long npages,
> -				bool rm)
> -{
> -	struct pnv_phb *phb = npe->phb;
> -
> -	/* We can only invalidate the whole cache on NPU */
> -	unsigned long val = TCE_KILL_INVAL_ALL;
> -
> -	if (WARN_ON(phb->type != PNV_PHB_NPU ||
> -		    !phb->ioda.tce_inval_reg ||
> -		    !(npe->flags & PNV_IODA_PE_DEV)))
> -		return;
> -
> -	mb(); /* Ensure previous TCE table stores are visible */
> -	if (rm)
> -		__raw_rm_writeq(cpu_to_be64(val),
> -		  (__be64 __iomem *) phb->ioda.tce_inval_reg_phys);
> -	else
> -		__raw_writeq(cpu_to_be64(val),
> -			phb->ioda.tce_inval_reg);
> -}
> -
>  void pnv_npu_init_dma_pe(struct pnv_ioda_pe *npe)
>  {
>  	struct pnv_ioda_pe *gpe;
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> index 33e9489..90cdf49 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -1824,9 +1824,23 @@ static struct iommu_table_ops pnv_ioda1_iommu_ops = {
>  	.get = pnv_tce_get,
>  };
>  
> +#define TCE_KILL_INVAL_ALL  PPC_BIT(0)
>  #define TCE_KILL_INVAL_PE   PPC_BIT(1)
>  #define TCE_KILL_INVAL_TCE  PPC_BIT(2)
>  
> +void pnv_pci_ioda2_tce_invalidate_entire(struct pnv_phb *phb, bool rm)
> +{
> +	const unsigned long val = TCE_KILL_INVAL_ALL;
> +
> +	mb(); /* Ensure previous TCE table stores are visible */
> +	if (rm)
> +		__raw_rm_writeq(cpu_to_be64(val),
> +				(__be64 __iomem *)
> +				phb->ioda.tce_inval_reg_phys);
> +	else
> +		__raw_writeq(cpu_to_be64(val), phb->ioda.tce_inval_reg);
> +}
> +
>  static inline void pnv_pci_ioda2_tce_invalidate_pe(struct pnv_ioda_pe *pe)
>  {
>  	/* 01xb - invalidate TCEs that match the specified PE# */
> @@ -1847,7 +1861,7 @@ static inline void pnv_pci_ioda2_tce_invalidate_pe(struct pnv_ioda_pe *pe)
>  			if (!npe || npe->phb->type != PNV_PHB_NPU)
>  				continue;
>  
> -			pnv_npu_tce_invalidate_entire(npe);
> +			pnv_pci_ioda2_tce_invalidate_entire(npe->phb, false);
>  		}
>  }
>  
> @@ -1896,14 +1910,19 @@ static void pnv_pci_ioda2_tce_invalidate(struct iommu_table *tbl,
>  			index, npages);
>  
>  		if (pe->flags & PNV_IODA_PE_PEER)
> -			/* Invalidate PEs using the same TCE table */
> +			/*
> +			 * The NVLink hardware does not support TCE kill
> +			 * per TCE entry so we have to invalidate
> +			 * the entire cache for it.
> +			 */
>  			for (i = 0; i < PNV_IODA_MAX_PEER_PES; i++) {
>  				npe = pe->peers[i];
> -				if (!npe || npe->phb->type != PNV_PHB_NPU)
> +				if (!npe || npe->phb->type != PNV_PHB_NPU ||
> +						!npe->phb->ioda.tce_inval_reg)
>  					continue;
>  
> -				pnv_npu_tce_invalidate(npe, tbl, index,
> -							npages, rm);
> +				pnv_pci_ioda2_tce_invalidate_entire(npe->phb,
> +						rm);
>  			}
>  	}
>  }
> diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
> index 3f814f3..0b89a4c 100644
> --- a/arch/powerpc/platforms/powernv/pci.h
> +++ b/arch/powerpc/platforms/powernv/pci.h
> @@ -237,15 +237,10 @@ extern int pnv_setup_msi_irqs(struct pci_dev *pdev, int nvec, int type);
>  extern void pnv_teardown_msi_irqs(struct pci_dev *pdev);
>  
>  /* Nvlink functions */
> -extern void pnv_npu_tce_invalidate_entire(struct pnv_ioda_pe *npe);
> -extern void pnv_npu_tce_invalidate(struct pnv_ioda_pe *npe,
> -				       struct iommu_table *tbl,
> -				       unsigned long index,
> -				       unsigned long npages,
> -				       bool rm);
>  extern void pnv_npu_init_dma_pe(struct pnv_ioda_pe *npe);
>  extern void pnv_npu_setup_dma_pe(struct pnv_ioda_pe *npe);
>  extern int pnv_npu_dma_set_bypass(struct pnv_ioda_pe *npe, bool enabled);
>  extern int pnv_npu_dma_set_mask(struct pci_dev *npdev, u64 dma_mask);
> +extern void pnv_pci_ioda2_tce_invalidate_entire(struct pnv_phb *phb, bool rm);
>  
>  #endif /* __POWERNV_PCI_H */
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH kernel 05/10] powerpc/powernv/npu: Use the correct IOMMU page size
  2016-03-09  6:29 ` [PATCH kernel 05/10] powerpc/powernv/npu: Use the correct IOMMU page size Alexey Kardashevskiy
  2016-03-10  5:43   ` David Gibson
@ 2016-03-21  2:57   ` Alistair Popple
  1 sibling, 0 replies; 26+ messages in thread
From: Alistair Popple @ 2016-03-21  2:57 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Benjamin Herrenschmidt, Daniel Axtens,
	David Gibson, Gavin Shan, Paul Mackerras, Russell Currey,
	Alex Williamson

Thanks for fixing Alexey!

On Wed, 9 Mar 2016 17:29:01 Alexey Kardashevskiy wrote:
> This uses the page size from iommu_table instead of hard-coded 4K.
> This should cause no change in behavior.
> 
> While we are here, move bits around to prepare for further rework
> which will define and use iommu_table_group_ops.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

Reviewed-by: Alistair Popple <alistair@popple.id.au>

> ---
>  arch/powerpc/platforms/powernv/npu-dma.c | 11 +++++------
>  1 file changed, 5 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/powernv/npu-dma.c b/arch/powerpc/platforms/powernv/npu-dma.c
> index 778570c..5bd5fee 100644
> --- a/arch/powerpc/platforms/powernv/npu-dma.c
> +++ b/arch/powerpc/platforms/powernv/npu-dma.c
> @@ -204,8 +204,7 @@ static void pnv_npu_disable_bypass(struct pnv_ioda_pe *npe)
>  	struct pnv_phb *phb = npe->phb;
>  	struct pci_dev *gpdev;
>  	struct pnv_ioda_pe *gpe;
> -	void *addr;
> -	unsigned int size;
> +	struct iommu_table *tbl;
>  	int64_t rc;
>  
>  	/*
> @@ -219,11 +218,11 @@ static void pnv_npu_disable_bypass(struct pnv_ioda_pe *npe)
>  	if (!gpe)
>  		return;
>  
> -	addr = (void *)gpe->table_group.tables[0]->it_base;
> -	size = gpe->table_group.tables[0]->it_size << 3;
> +	tbl = gpe->table_group.tables[0];
>  	rc = opal_pci_map_pe_dma_window(phb->opal_id, npe->pe_number,
> -					npe->pe_number, 1, __pa(addr),
> -					size, 0x1000);
> +					npe->pe_number, 1, __pa(tbl->it_base),
> +					tbl->it_size << 3,
> +					IOMMU_PAGE_SIZE(tbl));
>  	if (rc != OPAL_SUCCESS)
>  		pr_warn("%s: Error %lld setting DMA window on PHB#%d-PE#%d\n",
>  			__func__, rc, phb->hose->global_number, npe->pe_number);
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH kernel 06/10] powerpc/powernv/npu: Simplify DMA setup
  2016-03-16  5:55   ` David Gibson
@ 2016-03-21  3:59     ` Alistair Popple
  0 siblings, 0 replies; 26+ messages in thread
From: Alistair Popple @ 2016-03-21  3:59 UTC (permalink / raw)
  To: David Gibson
  Cc: Alexey Kardashevskiy, linuxppc-dev, Benjamin Herrenschmidt,
	Daniel Axtens, Gavin Shan, Paul Mackerras, Russell Currey,
	Alex Williamson

On Wed, 16 Mar 2016 16:55:50 David Gibson wrote:
> On Wed, Mar 09, 2016 at 05:29:02PM +1100, Alexey Kardashevskiy wrote:
> > NPU devices are quite specific, in fact they represent side DMA channel
> > of a GPU device. The GPU/NPU driver never actually configures DMA
> > for NPU devices, instead it relies on the platform code to propagate
> > DMA setup to NPU devices when a main GPU device is being configured.
> > When GPU is being set up, the same configuration - bypass or 32bit DMA -
> > is used for NPU. This makes DMA setup explicit.
> > 
> > pnv_npu_ioda_controller_ops::pnv_npu_dma_set_mask is moved to pci-ioda,
> > made static and prints warning as dma_set_mask() should never be called
> > on this function as in any case it will not configure GPU; so we make
> > this explicit.
> > 
> > Instead of using PNV_IODA_PE_PEER and peers[] (which next patch will
> > remove), we test every PCI device if there are corresponding NVLink
> > devices. If there are any, we propagate bypass mode to just found NPU
> > devices by calling the setup helper directly (which takes @bypass) and
> > avoid guessing (i.e. calculating from DMA mask) whether we need bypass
> > or not on NPU devices. Since DMA setup happens in very rare occasion,
> > this will not slow down booting or VFIO start/stop much.
> > 
> > This renames pnv_npu_disable_bypass to pnv_npu_dma_set_32 to make it
> > more clear what the function really does which is programming 32bit
> > table address to the TVT ("disabling bypass" means writing zeroes to
> > the TVT).
> > 
> > This removes pnv_npu_dma_set_bypass() from pnv_npu_ioda_fixup() as
> > the DMA configuration on NPU does not matter until dma_set_mask() is
> > called on GPU and that will do the NPU DMA configuration.
> > 
> > This removes phb->dma_dev_setup initialization for NPU as
> > pnv_pci_ioda_dma_dev_setup is no-op for it anyway.
> > 
> > Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> 
> I'm having trouble making sense of the commit message, but the actual
> changes look fine as best I can tell.

For background the NPU NVLink PCI "devices" are actually emulated in firmware 
and are mainly used for link training. Their DMA/TCE setup must match the GPU 
which is connected via PCIe and NVLink so any changes to the DMA/TCE setup on 
the GPU PCIe device need to be propagated to the NVLink device as this is what 
device drivers expect and it doesn't make much sense to do anything else.

Originally we were going to propagate DMA/TCE changes the other way (NVLink 
device to PCI device) as well, but it proved unnecessary and unused. This 
patch cleans up the last bit of that behaviour and looks good to me as well.

> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

Reviewed-by: Alistair Popple <alistair@popple.id.au>

> 
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH kernel 08/10] powerpc/powernv/npu: Add NPU devices to IOMMU group
  2016-03-09  6:29 ` [PATCH kernel 08/10] powerpc/powernv/npu: Add NPU devices to IOMMU group Alexey Kardashevskiy
@ 2016-03-21  4:48   ` David Gibson
  2016-03-21  8:25     ` Alexey Kardashevskiy
  0 siblings, 1 reply; 26+ messages in thread
From: David Gibson @ 2016-03-21  4:48 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Alistair Popple, Benjamin Herrenschmidt,
	Daniel Axtens, Gavin Shan, Paul Mackerras, Russell Currey,
	Alex Williamson

[-- Attachment #1: Type: text/plain, Size: 4363 bytes --]

On Wed, Mar 09, 2016 at 05:29:04PM +1100, Alexey Kardashevskiy wrote:
> NPU devices have their own TVT which means they are isolated and can be
> passed to the userspace via VFIO. The first step is to create an IOMMU
> group and attach devices there so does the patch.
> 
> This adds a helper to npu-dma.c which gets GPU from the NPU's pdev and
> then walks through all devices on the same bus to determine which NPUs
> belong to the same GPU.
> 
> This adds an additional loop over PEs in pnv_ioda_setup_dma() as the main
> loop skips NPU PEs as they do not have 32bit DMA segments.
> 
> This uses get_gpu_pci_dev_and_pe() to get @gpdev rather than
> pnv_pci_get_gpu_dev() as the following patch will use @gpe as well.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

I'm not entirely clear on how these devices are assigned to groups.
Do they each get their own groups, or is the NPU device in the same
group as its corresponding GPU (I would have thought the latter makes
sense).

> ---
>  arch/powerpc/platforms/powernv/npu-dma.c  | 40 +++++++++++++++++++++++++++++++
>  arch/powerpc/platforms/powernv/pci-ioda.c |  8 +++++++
>  arch/powerpc/platforms/powernv/pci.h      |  1 +
>  3 files changed, 49 insertions(+)
> 
> diff --git a/arch/powerpc/platforms/powernv/npu-dma.c b/arch/powerpc/platforms/powernv/npu-dma.c
> index 866d3d3..e5a5feb 100644
> --- a/arch/powerpc/platforms/powernv/npu-dma.c
> +++ b/arch/powerpc/platforms/powernv/npu-dma.c
> @@ -263,3 +263,43 @@ void pnv_npu_try_dma_set_bypass(struct pci_dev *gpdev, bool bypass)
>  		}
>  	}
>  }
> +
> +void pnv_pci_npu_setup_iommu(struct pnv_ioda_pe *npe)
> +{
> +	struct iommu_table *tbl;
> +	struct pnv_phb *phb = npe->phb;
> +	struct pci_bus *pbus = phb->hose->bus;
> +	struct pci_dev *npdev, *gpdev = NULL, *gptmp;
> +	struct pnv_ioda_pe *gpe = get_gpu_pci_dev_and_pe(npe, &gpdev);
> +
> +	if (!gpe || !gpdev)
> +		return;
> +
> +	iommu_register_group(&npe->table_group, phb->hose->global_number,
> +			npe->pe_number);
> +
> +	tbl = pnv_pci_table_alloc(phb->hose->node);
> +
> +	list_for_each_entry(npdev, &pbus->devices, bus_list) {
> +		gptmp = pnv_pci_get_gpu_dev(npdev);
> +
> +		if (gptmp != gpdev)
> +			continue;
> +
> +		/*
> +		 * The iommu_add_device() picks an IOMMU group from
> +		 * the first IOMMU group attached to the iommu_table
> +		 * so we need to pretend that there is a table so
> +		 * iommu_add_device() can complete the job.
> +		 * We unlink the tempopary table from the group afterwards.
> +		 */
> +		pnv_pci_link_table_and_group(phb->hose->node, 0,
> +				tbl, &npe->table_group);
> +		set_iommu_table_base(&npdev->dev, tbl);
> +		iommu_add_device(&npdev->dev);
> +		set_iommu_table_base(&npdev->dev, NULL);
> +		pnv_pci_unlink_table_and_group(tbl, &npe->table_group);
> +	}
> +
> +	iommu_free_table(tbl, "");
> +}
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> index 5a6cf2e..becd168 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -2570,6 +2570,14 @@ static void pnv_ioda_setup_dma(struct pnv_phb *phb)
>  		remaining -= segs;
>  		base += segs;
>  	}
> +	/*
> +	 * Create an IOMMU group and add devices to it.
> +	 * DMA setup is to be done via GPU's dma_set_mask().
> +	 */
> +	if (phb->type == PNV_PHB_NPU) {
> +		list_for_each_entry(pe, &phb->ioda.pe_dma_list, dma_link)
> +			pnv_pci_npu_setup_iommu(pe);
> +	}
>  }
>  
>  #ifdef CONFIG_PCI_MSI
> diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
> index 06405fd..0c0083a 100644
> --- a/arch/powerpc/platforms/powernv/pci.h
> +++ b/arch/powerpc/platforms/powernv/pci.h
> @@ -235,5 +235,6 @@ extern void pnv_teardown_msi_irqs(struct pci_dev *pdev);
>  /* Nvlink functions */
>  extern void pnv_npu_try_dma_set_bypass(struct pci_dev *gpdev, bool bypass);
>  extern void pnv_pci_ioda2_tce_invalidate_entire(struct pnv_phb *phb, bool rm);
> +extern void pnv_pci_npu_setup_iommu(struct pnv_ioda_pe *npe);
>  
>  #endif /* __POWERNV_PCI_H */

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH kernel 07/10] powerpc/powernv/npu: Rework TCE Kill handling
  2016-03-09  6:29 ` [PATCH kernel 07/10] powerpc/powernv/npu: Rework TCE Kill handling Alexey Kardashevskiy
@ 2016-03-21  6:50   ` Alistair Popple
  0 siblings, 0 replies; 26+ messages in thread
From: Alistair Popple @ 2016-03-21  6:50 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Benjamin Herrenschmidt, Daniel Axtens,
	David Gibson, Gavin Shan, Paul Mackerras, Russell Currey,
	Alex Williamson

On Wed, 9 Mar 2016 17:29:03 Alexey Kardashevskiy wrote:
> The pnv_ioda_pe struct keeps an array of peers. At the moment it is only
> used to link GPU and NPU for 2 purposes:
> 
> 1. Access NPU _quickly_ when configuring DMA for GPU - this was addressed
> in the previos patch by removing use of it as DMA setup is not what
> the kernel would constantly do.

This was implemented using peers[] because we had peers[] anyway to deal with 
TCE cache invalidation. I agree there's no reason to keep it around solely for 
speed.

> 2. Invalidate TCE cache for NPU when it is invalidated for GPU.
> GPU and NPU are in different PE. There is already a mechanism to
> attach multiple iommu_table_group to the same iommu_table (used for VFIO),
> we can reuse it here so does this patch.

Ok, this makes sense. I wasn't aware of iommu_table_groups but it looks like a 
more elegant way of solving the problem. I'm not familiar with the way iommu 
groups work but the changes make sense to me as far as I can tell.

> This gets rid of peers[] array and PNV_IODA_PE_PEER flag as they are
> not needed anymore.
>
> While we are here, add TCE cache invalidation after changing TVT.

Good idea, even though I guess we're unlikely to hit a problem in practice as 
I'm pretty sure on a normal system the links would get retrained between runs 
with different TVTs which implies the NPU gets reset too.

> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

Reviewed-By: Alistair Popple <alistair@popple.id.au>

> ---
>  arch/powerpc/platforms/powernv/npu-dma.c  | 75 
+++++++++----------------------
>  arch/powerpc/platforms/powernv/pci-ioda.c | 57 +++--------------------
>  arch/powerpc/platforms/powernv/pci.h      |  6 ---
>  3 files changed, 29 insertions(+), 109 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/powernv/npu-dma.c 
b/arch/powerpc/platforms/powernv/npu-dma.c
> index 8960e46..866d3d3 100644
> --- a/arch/powerpc/platforms/powernv/npu-dma.c
> +++ b/arch/powerpc/platforms/powernv/npu-dma.c
> @@ -136,22 +136,17 @@ static struct pnv_ioda_pe 
*get_gpu_pci_dev_and_pe(struct pnv_ioda_pe *npe,
>  	struct pnv_ioda_pe *pe;
>  	struct pci_dn *pdn;
>  
> -	if (npe->flags & PNV_IODA_PE_PEER) {
> -		pe = npe->peers[0];
> -		pdev = pe->pdev;
> -	} else {
> -		pdev = pnv_pci_get_gpu_dev(npe->pdev);
> -		if (!pdev)
> -			return NULL;
> +	pdev = pnv_pci_get_gpu_dev(npe->pdev);
> +	if (!pdev)
> +		return NULL;
>  
> -		pdn = pci_get_pdn(pdev);
> -		if (WARN_ON(!pdn || pdn->pe_number == IODA_INVALID_PE))
> -			return NULL;
> +	pdn = pci_get_pdn(pdev);
> +	if (WARN_ON(!pdn || pdn->pe_number == IODA_INVALID_PE))
> +		return NULL;
>  
> -		hose = pci_bus_to_host(pdev->bus);
> -		phb = hose->private_data;
> -		pe = &phb->ioda.pe_array[pdn->pe_number];
> -	}
> +	hose = pci_bus_to_host(pdev->bus);
> +	phb = hose->private_data;
> +	pe = &phb->ioda.pe_array[pdn->pe_number];
>  
>  	if (gpdev)
>  		*gpdev = pdev;
> @@ -159,42 +154,6 @@ static struct pnv_ioda_pe 
*get_gpu_pci_dev_and_pe(struct pnv_ioda_pe *npe,
>  	return pe;
>  }
>  
> -void pnv_npu_init_dma_pe(struct pnv_ioda_pe *npe)
> -{
> -	struct pnv_ioda_pe *gpe;
> -	struct pci_dev *gpdev;
> -	int i, avail = -1;
> -
> -	if (!npe->pdev || !(npe->flags & PNV_IODA_PE_DEV))
> -		return;
> -
> -	gpe = get_gpu_pci_dev_and_pe(npe, &gpdev);
> -	if (!gpe)
> -		return;
> -
> -	for (i = 0; i < PNV_IODA_MAX_PEER_PES; i++) {
> -		/* Nothing to do if the PE is already connected. */
> -		if (gpe->peers[i] == npe)
> -			return;
> -
> -		if (!gpe->peers[i])
> -			avail = i;
> -	}
> -
> -	if (WARN_ON(avail < 0))
> -		return;
> -
> -	gpe->peers[avail] = npe;
> -	gpe->flags |= PNV_IODA_PE_PEER;
> -
> -	/*
> -	 * We assume that the NPU devices only have a single peer PE
> -	 * (the GPU PCIe device PE).
> -	 */
> -	npe->peers[0] = gpe;
> -	npe->flags |= PNV_IODA_PE_PEER;
> -}
> -
>  /*
>   * Enables 32 bit DMA on NPU.
>   */
> @@ -225,6 +184,13 @@ static void pnv_npu_dma_set_32(struct pnv_ioda_pe *npe)
>  	if (rc != OPAL_SUCCESS)
>  		pr_warn("%s: Error %lld setting DMA window on PHB#%d-PE#%d\n",
>  			__func__, rc, phb->hose->global_number, npe-
>pe_number);
> +	else
> +		pnv_pci_ioda2_tce_invalidate_entire(phb, false);
> +
> +	/* Add the table to the list so its TCE cache will get invalidated */
> +	npe->table_group.tables[0] = tbl;
> +	pnv_pci_link_table_and_group(phb->hose->node, 0,
> +			tbl, &npe->table_group);
>  
>  	/*
>  	 * We don't initialise npu_pe->tce32_table as we always use
> @@ -245,10 +211,10 @@ static int pnv_npu_dma_set_bypass(struct pnv_ioda_pe 
*npe)
>  	int64_t rc = 0;
>  	phys_addr_t top = memblock_end_of_DRAM();
>  
> -	if (phb->type != PNV_PHB_NPU || !npe->pdev)
> -		return -EINVAL;
> -
>  	/* Enable the bypass window */
> +	pnv_pci_unlink_table_and_group(npe->table_group.tables[0],
> +			&npe->table_group);
> +	npe->table_group.tables[0] = NULL;
>  
>  	npe->tce_bypass_base = 0;
>  	top = roundup_pow_of_two(top);
> @@ -258,6 +224,9 @@ static int pnv_npu_dma_set_bypass(struct pnv_ioda_pe 
*npe)
>  			npe->pe_number, npe->pe_number,
>  			npe->tce_bypass_base, top);
>  
> +	if (rc == OPAL_SUCCESS)
> +		pnv_pci_ioda2_tce_invalidate_entire(phb, false);
> +
>  	return rc;
>  }
>  
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
> index 47085b7..5a6cf2e 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -1836,23 +1836,12 @@ static inline void 
pnv_pci_ioda2_tce_invalidate_pe(struct pnv_ioda_pe *pe)
>  	/* 01xb - invalidate TCEs that match the specified PE# */
>  	unsigned long val = TCE_KILL_INVAL_PE | (pe->pe_number & 0xFF);
>  	struct pnv_phb *phb = pe->phb;
> -	struct pnv_ioda_pe *npe;
> -	int i;
>  
>  	if (!phb->ioda.tce_inval_reg)
>  		return;
>  
>  	mb(); /* Ensure above stores are visible */
>  	__raw_writeq(cpu_to_be64(val), phb->ioda.tce_inval_reg);
> -
> -	if (pe->flags & PNV_IODA_PE_PEER)
> -		for (i = 0; i < PNV_IODA_MAX_PEER_PES; i++) {
> -			npe = pe->peers[i];
> -			if (!npe || npe->phb->type != PNV_PHB_NPU)
> -				continue;
> -
> -			pnv_pci_ioda2_tce_invalidate_entire(npe->phb, false);
> -		}
>  }
>  
>  static void pnv_pci_ioda2_do_tce_invalidate(unsigned pe_number, bool rm,
> @@ -1887,33 +1876,24 @@ static void pnv_pci_ioda2_tce_invalidate(struct 
iommu_table *tbl,
>  	struct iommu_table_group_link *tgl;
>  
>  	list_for_each_entry_lockless(tgl, &tbl->it_group_list, next) {
> -		struct pnv_ioda_pe *npe;
>  		struct pnv_ioda_pe *pe = container_of(tgl->table_group,
>  				struct pnv_ioda_pe, table_group);
>  		__be64 __iomem *invalidate = rm ?
>  			(__be64 __iomem *)pe->phb->ioda.tce_inval_reg_phys :
>  			pe->phb->ioda.tce_inval_reg;
> -		int i;
>  
> -		pnv_pci_ioda2_do_tce_invalidate(pe->pe_number, rm,
> -			invalidate, tbl->it_page_shift,
> -			index, npages);
> -
> -		if (pe->flags & PNV_IODA_PE_PEER)
> +		if (pe->phb->type == PNV_PHB_NPU) {
>  			/*
>  			 * The NVLink hardware does not support TCE kill
>  			 * per TCE entry so we have to invalidate
>  			 * the entire cache for it.
>  			 */
> -			for (i = 0; i < PNV_IODA_MAX_PEER_PES; i++) {
> -				npe = pe->peers[i];
> -				if (!npe || npe->phb->type != PNV_PHB_NPU ||
> -						!npe->phb->ioda.tce_inval_reg)
> -					continue;
> -
> -				pnv_pci_ioda2_tce_invalidate_entire(npe->phb,
> -						rm);
> -			}
> +			pnv_pci_ioda2_tce_invalidate_entire(pe->phb, rm);
> +			continue;
> +		}
> +		pnv_pci_ioda2_do_tce_invalidate(pe->pe_number, rm,
> +			invalidate, tbl->it_page_shift,
> +			index, npages);
>  	}
>  }
>  
> @@ -3094,26 +3074,6 @@ static void pnv_pci_ioda_create_dbgfs(void)
>  #endif /* CONFIG_DEBUG_FS */
>  }
>  
> -static void pnv_npu_ioda_fixup(void)
> -{
> -	bool enable_bypass;
> -	struct pci_controller *hose, *tmp;
> -	struct pnv_phb *phb;
> -	struct pnv_ioda_pe *pe;
> -
> -	list_for_each_entry_safe(hose, tmp, &hose_list, list_node) {
> -		phb = hose->private_data;
> -		if (phb->type != PNV_PHB_NPU)
> -			continue;
> -
> -		list_for_each_entry(pe, &phb->ioda.pe_dma_list, dma_link) {
> -			enable_bypass = dma_get_mask(&pe->pdev->dev) ==
> -				DMA_BIT_MASK(64);
> -			pnv_npu_init_dma_pe(pe);
> -		}
> -	}
> -}
> -
>  static void pnv_pci_ioda_fixup(void)
>  {
>  	pnv_pci_ioda_setup_PEs();
> @@ -3126,9 +3086,6 @@ static void pnv_pci_ioda_fixup(void)
>  	eeh_init();
>  	eeh_addr_cache_build();
>  #endif
> -
> -	/* Link NPU IODA tables to their PCI devices. */
> -	pnv_npu_ioda_fixup();
>  }
>  
>  /*
> diff --git a/arch/powerpc/platforms/powernv/pci.h 
b/arch/powerpc/platforms/powernv/pci.h
> index d574a9d..06405fd 100644
> --- a/arch/powerpc/platforms/powernv/pci.h
> +++ b/arch/powerpc/platforms/powernv/pci.h
> @@ -24,7 +24,6 @@ enum pnv_phb_model {
>  #define PNV_IODA_PE_MASTER	(1 << 3)	/* Master PE in compound case	
*/
>  #define PNV_IODA_PE_SLAVE	(1 << 4)	/* Slave PE in compound case	
*/
>  #define PNV_IODA_PE_VF		(1 << 5)	/* PE for one VF 	
	*/
> -#define PNV_IODA_PE_PEER	(1 << 6)	/* PE has peers			
*/
>  
>  /* Data associated with a PE, including IOMMU tracking etc.. */
>  struct pnv_phb;
> @@ -32,9 +31,6 @@ struct pnv_ioda_pe {
>  	unsigned long		flags;
>  	struct pnv_phb		*phb;
>  
> -#define PNV_IODA_MAX_PEER_PES	8
> -	struct pnv_ioda_pe	*peers[PNV_IODA_MAX_PEER_PES];
> -
>  	/* A PE can be associated with a single device or an
>  	 * entire bus (& children). In the former case, pdev
>  	 * is populated, in the later case, pbus is.
> @@ -237,8 +233,6 @@ extern int pnv_setup_msi_irqs(struct pci_dev *pdev, int 
nvec, int type);
>  extern void pnv_teardown_msi_irqs(struct pci_dev *pdev);
>  
>  /* Nvlink functions */
> -extern void pnv_npu_init_dma_pe(struct pnv_ioda_pe *npe);
> -extern void pnv_npu_setup_dma_pe(struct pnv_ioda_pe *npe);
>  extern void pnv_npu_try_dma_set_bypass(struct pci_dev *gpdev, bool bypass);
>  extern void pnv_pci_ioda2_tce_invalidate_entire(struct pnv_phb *phb, bool 
rm);
>  
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH kernel 08/10] powerpc/powernv/npu: Add NPU devices to IOMMU group
  2016-03-21  4:48   ` David Gibson
@ 2016-03-21  8:25     ` Alexey Kardashevskiy
  2016-03-22  0:25       ` David Gibson
  0 siblings, 1 reply; 26+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-21  8:25 UTC (permalink / raw)
  To: David Gibson
  Cc: linuxppc-dev, Alistair Popple, Benjamin Herrenschmidt,
	Daniel Axtens, Gavin Shan, Paul Mackerras, Russell Currey,
	Alex Williamson

On 03/21/2016 03:48 PM, David Gibson wrote:
> On Wed, Mar 09, 2016 at 05:29:04PM +1100, Alexey Kardashevskiy wrote:
>> NPU devices have their own TVT which means they are isolated and can be
>> passed to the userspace via VFIO. The first step is to create an IOMMU
>> group and attach devices there so does the patch.
>>
>> This adds a helper to npu-dma.c which gets GPU from the NPU's pdev and
>> then walks through all devices on the same bus to determine which NPUs
>> belong to the same GPU.
>>
>> This adds an additional loop over PEs in pnv_ioda_setup_dma() as the main
>> loop skips NPU PEs as they do not have 32bit DMA segments.
>>
>> This uses get_gpu_pci_dev_and_pe() to get @gpdev rather than
>> pnv_pci_get_gpu_dev() as the following patch will use @gpe as well.
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>
> I'm not entirely clear on how these devices are assigned to groups.
> Do they each get their own groups, or is the NPU device in the same
> group as its corresponding GPU (I would have thought the latter makes
> sense).


I am putting them to a separate group as they have their own TCE table 
pointer even though they are expected to share it with GPU.

If I put them to the same group as GPUs, I would have to have 
IODA2-linked-to-NPU bridge type with different iommu_table_group_ops  or 
have multiple hacks everywhere in IODA2 to enable/disable bypass, etc.



>
>> ---
>>   arch/powerpc/platforms/powernv/npu-dma.c  | 40 +++++++++++++++++++++++++++++++
>>   arch/powerpc/platforms/powernv/pci-ioda.c |  8 +++++++
>>   arch/powerpc/platforms/powernv/pci.h      |  1 +
>>   3 files changed, 49 insertions(+)
>>
>> diff --git a/arch/powerpc/platforms/powernv/npu-dma.c b/arch/powerpc/platforms/powernv/npu-dma.c
>> index 866d3d3..e5a5feb 100644
>> --- a/arch/powerpc/platforms/powernv/npu-dma.c
>> +++ b/arch/powerpc/platforms/powernv/npu-dma.c
>> @@ -263,3 +263,43 @@ void pnv_npu_try_dma_set_bypass(struct pci_dev *gpdev, bool bypass)
>>   		}
>>   	}
>>   }
>> +
>> +void pnv_pci_npu_setup_iommu(struct pnv_ioda_pe *npe)
>> +{
>> +	struct iommu_table *tbl;
>> +	struct pnv_phb *phb = npe->phb;
>> +	struct pci_bus *pbus = phb->hose->bus;
>> +	struct pci_dev *npdev, *gpdev = NULL, *gptmp;
>> +	struct pnv_ioda_pe *gpe = get_gpu_pci_dev_and_pe(npe, &gpdev);
>> +
>> +	if (!gpe || !gpdev)
>> +		return;
>> +
>> +	iommu_register_group(&npe->table_group, phb->hose->global_number,
>> +			npe->pe_number);
>> +
>> +	tbl = pnv_pci_table_alloc(phb->hose->node);
>> +
>> +	list_for_each_entry(npdev, &pbus->devices, bus_list) {
>> +		gptmp = pnv_pci_get_gpu_dev(npdev);
>> +
>> +		if (gptmp != gpdev)
>> +			continue;
>> +
>> +		/*
>> +		 * The iommu_add_device() picks an IOMMU group from
>> +		 * the first IOMMU group attached to the iommu_table
>> +		 * so we need to pretend that there is a table so
>> +		 * iommu_add_device() can complete the job.
>> +		 * We unlink the tempopary table from the group afterwards.
>> +		 */
>> +		pnv_pci_link_table_and_group(phb->hose->node, 0,
>> +				tbl, &npe->table_group);
>> +		set_iommu_table_base(&npdev->dev, tbl);
>> +		iommu_add_device(&npdev->dev);
>> +		set_iommu_table_base(&npdev->dev, NULL);
>> +		pnv_pci_unlink_table_and_group(tbl, &npe->table_group);
>> +	}
>> +
>> +	iommu_free_table(tbl, "");
>> +}
>> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
>> index 5a6cf2e..becd168 100644
>> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
>> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
>> @@ -2570,6 +2570,14 @@ static void pnv_ioda_setup_dma(struct pnv_phb *phb)
>>   		remaining -= segs;
>>   		base += segs;
>>   	}
>> +	/*
>> +	 * Create an IOMMU group and add devices to it.
>> +	 * DMA setup is to be done via GPU's dma_set_mask().
>> +	 */
>> +	if (phb->type == PNV_PHB_NPU) {
>> +		list_for_each_entry(pe, &phb->ioda.pe_dma_list, dma_link)
>> +			pnv_pci_npu_setup_iommu(pe);
>> +	}
>>   }
>>
>>   #ifdef CONFIG_PCI_MSI
>> diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
>> index 06405fd..0c0083a 100644
>> --- a/arch/powerpc/platforms/powernv/pci.h
>> +++ b/arch/powerpc/platforms/powernv/pci.h
>> @@ -235,5 +235,6 @@ extern void pnv_teardown_msi_irqs(struct pci_dev *pdev);
>>   /* Nvlink functions */
>>   extern void pnv_npu_try_dma_set_bypass(struct pci_dev *gpdev, bool bypass);
>>   extern void pnv_pci_ioda2_tce_invalidate_entire(struct pnv_phb *phb, bool rm);
>> +extern void pnv_pci_npu_setup_iommu(struct pnv_ioda_pe *npe);
>>
>>   #endif /* __POWERNV_PCI_H */
>


-- 
Alexey

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH kernel 08/10] powerpc/powernv/npu: Add NPU devices to IOMMU group
  2016-03-21  8:25     ` Alexey Kardashevskiy
@ 2016-03-22  0:25       ` David Gibson
  2016-03-22  1:48         ` Alexey Kardashevskiy
  0 siblings, 1 reply; 26+ messages in thread
From: David Gibson @ 2016-03-22  0:25 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Alistair Popple, Benjamin Herrenschmidt,
	Daniel Axtens, Gavin Shan, Paul Mackerras, Russell Currey,
	Alex Williamson

[-- Attachment #1: Type: text/plain, Size: 5361 bytes --]

On Mon, Mar 21, 2016 at 07:25:23PM +1100, Alexey Kardashevskiy wrote:
> On 03/21/2016 03:48 PM, David Gibson wrote:
> >On Wed, Mar 09, 2016 at 05:29:04PM +1100, Alexey Kardashevskiy wrote:
> >>NPU devices have their own TVT which means they are isolated and can be
> >>passed to the userspace via VFIO. The first step is to create an IOMMU
> >>group and attach devices there so does the patch.
> >>
> >>This adds a helper to npu-dma.c which gets GPU from the NPU's pdev and
> >>then walks through all devices on the same bus to determine which NPUs
> >>belong to the same GPU.
> >>
> >>This adds an additional loop over PEs in pnv_ioda_setup_dma() as the main
> >>loop skips NPU PEs as they do not have 32bit DMA segments.
> >>
> >>This uses get_gpu_pci_dev_and_pe() to get @gpdev rather than
> >>pnv_pci_get_gpu_dev() as the following patch will use @gpe as well.
> >>
> >>Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >
> >I'm not entirely clear on how these devices are assigned to groups.
> >Do they each get their own groups, or is the NPU device in the same
> >group as its corresponding GPU (I would have thought the latter makes
> >sense).
> 
> 
> I am putting them to a separate group as they have their own TCE table
> pointer even though they are expected to share it with GPU.

Hmm.. is this safe?  If the GPU and NPU got assigned to different
owners, what would happen?  Could the interfere with each other?

> If I put them to the same group as GPUs, I would have to have
> IODA2-linked-to-NPU bridge type with different iommu_table_group_ops  or
> have multiple hacks everywhere in IODA2 to enable/disable bypass,
> etc.

Well.. I suspect it would mean no longer having a 1:1 correspondance
between user-visible IOMMU groups and the internal iommu_table.

> >>---
> >>  arch/powerpc/platforms/powernv/npu-dma.c  | 40 +++++++++++++++++++++++++++++++
> >>  arch/powerpc/platforms/powernv/pci-ioda.c |  8 +++++++
> >>  arch/powerpc/platforms/powernv/pci.h      |  1 +
> >>  3 files changed, 49 insertions(+)
> >>
> >>diff --git a/arch/powerpc/platforms/powernv/npu-dma.c b/arch/powerpc/platforms/powernv/npu-dma.c
> >>index 866d3d3..e5a5feb 100644
> >>--- a/arch/powerpc/platforms/powernv/npu-dma.c
> >>+++ b/arch/powerpc/platforms/powernv/npu-dma.c
> >>@@ -263,3 +263,43 @@ void pnv_npu_try_dma_set_bypass(struct pci_dev *gpdev, bool bypass)
> >>  		}
> >>  	}
> >>  }
> >>+
> >>+void pnv_pci_npu_setup_iommu(struct pnv_ioda_pe *npe)
> >>+{
> >>+	struct iommu_table *tbl;
> >>+	struct pnv_phb *phb = npe->phb;
> >>+	struct pci_bus *pbus = phb->hose->bus;
> >>+	struct pci_dev *npdev, *gpdev = NULL, *gptmp;
> >>+	struct pnv_ioda_pe *gpe = get_gpu_pci_dev_and_pe(npe, &gpdev);
> >>+
> >>+	if (!gpe || !gpdev)
> >>+		return;
> >>+
> >>+	iommu_register_group(&npe->table_group, phb->hose->global_number,
> >>+			npe->pe_number);
> >>+
> >>+	tbl = pnv_pci_table_alloc(phb->hose->node);
> >>+
> >>+	list_for_each_entry(npdev, &pbus->devices, bus_list) {
> >>+		gptmp = pnv_pci_get_gpu_dev(npdev);
> >>+
> >>+		if (gptmp != gpdev)
> >>+			continue;
> >>+
> >>+		/*
> >>+		 * The iommu_add_device() picks an IOMMU group from
> >>+		 * the first IOMMU group attached to the iommu_table
> >>+		 * so we need to pretend that there is a table so
> >>+		 * iommu_add_device() can complete the job.
> >>+		 * We unlink the tempopary table from the group afterwards.
> >>+		 */
> >>+		pnv_pci_link_table_and_group(phb->hose->node, 0,
> >>+				tbl, &npe->table_group);
> >>+		set_iommu_table_base(&npdev->dev, tbl);
> >>+		iommu_add_device(&npdev->dev);
> >>+		set_iommu_table_base(&npdev->dev, NULL);
> >>+		pnv_pci_unlink_table_and_group(tbl, &npe->table_group);
> >>+	}
> >>+
> >>+	iommu_free_table(tbl, "");
> >>+}
> >>diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> >>index 5a6cf2e..becd168 100644
> >>--- a/arch/powerpc/platforms/powernv/pci-ioda.c
> >>+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> >>@@ -2570,6 +2570,14 @@ static void pnv_ioda_setup_dma(struct pnv_phb *phb)
> >>  		remaining -= segs;
> >>  		base += segs;
> >>  	}
> >>+	/*
> >>+	 * Create an IOMMU group and add devices to it.
> >>+	 * DMA setup is to be done via GPU's dma_set_mask().
> >>+	 */
> >>+	if (phb->type == PNV_PHB_NPU) {
> >>+		list_for_each_entry(pe, &phb->ioda.pe_dma_list, dma_link)
> >>+			pnv_pci_npu_setup_iommu(pe);
> >>+	}
> >>  }
> >>
> >>  #ifdef CONFIG_PCI_MSI
> >>diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
> >>index 06405fd..0c0083a 100644
> >>--- a/arch/powerpc/platforms/powernv/pci.h
> >>+++ b/arch/powerpc/platforms/powernv/pci.h
> >>@@ -235,5 +235,6 @@ extern void pnv_teardown_msi_irqs(struct pci_dev *pdev);
> >>  /* Nvlink functions */
> >>  extern void pnv_npu_try_dma_set_bypass(struct pci_dev *gpdev, bool bypass);
> >>  extern void pnv_pci_ioda2_tce_invalidate_entire(struct pnv_phb *phb, bool rm);
> >>+extern void pnv_pci_npu_setup_iommu(struct pnv_ioda_pe *npe);
> >>
> >>  #endif /* __POWERNV_PCI_H */
> >
> 
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH kernel 08/10] powerpc/powernv/npu: Add NPU devices to IOMMU group
  2016-03-22  0:25       ` David Gibson
@ 2016-03-22  1:48         ` Alexey Kardashevskiy
  2016-03-22 12:41           ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 26+ messages in thread
From: Alexey Kardashevskiy @ 2016-03-22  1:48 UTC (permalink / raw)
  To: David Gibson
  Cc: linuxppc-dev, Alistair Popple, Benjamin Herrenschmidt,
	Daniel Axtens, Gavin Shan, Paul Mackerras, Russell Currey,
	Alex Williamson

On 03/22/2016 11:25 AM, David Gibson wrote:
> On Mon, Mar 21, 2016 at 07:25:23PM +1100, Alexey Kardashevskiy wrote:
>> On 03/21/2016 03:48 PM, David Gibson wrote:
>>> On Wed, Mar 09, 2016 at 05:29:04PM +1100, Alexey Kardashevskiy wrote:
>>>> NPU devices have their own TVT which means they are isolated and can be
>>>> passed to the userspace via VFIO. The first step is to create an IOMMU
>>>> group and attach devices there so does the patch.
>>>>
>>>> This adds a helper to npu-dma.c which gets GPU from the NPU's pdev and
>>>> then walks through all devices on the same bus to determine which NPUs
>>>> belong to the same GPU.
>>>>
>>>> This adds an additional loop over PEs in pnv_ioda_setup_dma() as the main
>>>> loop skips NPU PEs as they do not have 32bit DMA segments.
>>>>
>>>> This uses get_gpu_pci_dev_and_pe() to get @gpdev rather than
>>>> pnv_pci_get_gpu_dev() as the following patch will use @gpe as well.
>>>>
>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>
>>> I'm not entirely clear on how these devices are assigned to groups.
>>> Do they each get their own groups, or is the NPU device in the same
>>> group as its corresponding GPU (I would have thought the latter makes
>>> sense).
>>
>>
>> I am putting them to a separate group as they have their own TCE table
>> pointer even though they are expected to share it with GPU.
>
> Hmm.. is this safe?  If the GPU and NPU got assigned to different
> owners, what would happen?  Could the interfere with each other?


I suppose GPU from guest1 could trigger DMA from NPU to guest2 memory. 
Which puts a constrain to management tools not to pass NPU without their 
GPU counterparts.

The host can be affected as bypass is not disabled on NPU when GPU is taken 
by VFIO, I'll fix this.


>> If I put them to the same group as GPUs, I would have to have
>> IODA2-linked-to-NPU bridge type with different iommu_table_group_ops  or
>> have multiple hacks everywhere in IODA2 to enable/disable bypass,
>> etc.
>
> Well.. I suspect it would mean no longer having a 1:1 correspondance
> between user-visible IOMMU groups and the internal iommu_table.

Right.

Right now each GPU is sitting on a separate PHB and has its own PE. And all 
NPUs sit on a separate PHB and each couple of NPUs (2 links of the same 
GPU) gets a PE.

So we have separate PEs (struct pnv_ioda_pe) already, each has its own 
iommu_table_group_ops with all these VFIO IOMMU callbacks. So to make this 
all appear as one IOMMU group in sysfs, I will need to stop embedding 
iommu_table_group into pnv_ioda_pe but make it a pointer with reference 
counting, etc. Quite a massive change...




>>>> ---
>>>>   arch/powerpc/platforms/powernv/npu-dma.c  | 40 +++++++++++++++++++++++++++++++
>>>>   arch/powerpc/platforms/powernv/pci-ioda.c |  8 +++++++
>>>>   arch/powerpc/platforms/powernv/pci.h      |  1 +
>>>>   3 files changed, 49 insertions(+)
>>>>
>>>> diff --git a/arch/powerpc/platforms/powernv/npu-dma.c b/arch/powerpc/platforms/powernv/npu-dma.c
>>>> index 866d3d3..e5a5feb 100644
>>>> --- a/arch/powerpc/platforms/powernv/npu-dma.c
>>>> +++ b/arch/powerpc/platforms/powernv/npu-dma.c
>>>> @@ -263,3 +263,43 @@ void pnv_npu_try_dma_set_bypass(struct pci_dev *gpdev, bool bypass)
>>>>   		}
>>>>   	}
>>>>   }
>>>> +
>>>> +void pnv_pci_npu_setup_iommu(struct pnv_ioda_pe *npe)
>>>> +{
>>>> +	struct iommu_table *tbl;
>>>> +	struct pnv_phb *phb = npe->phb;
>>>> +	struct pci_bus *pbus = phb->hose->bus;
>>>> +	struct pci_dev *npdev, *gpdev = NULL, *gptmp;
>>>> +	struct pnv_ioda_pe *gpe = get_gpu_pci_dev_and_pe(npe, &gpdev);
>>>> +
>>>> +	if (!gpe || !gpdev)
>>>> +		return;
>>>> +
>>>> +	iommu_register_group(&npe->table_group, phb->hose->global_number,
>>>> +			npe->pe_number);
>>>> +
>>>> +	tbl = pnv_pci_table_alloc(phb->hose->node);
>>>> +
>>>> +	list_for_each_entry(npdev, &pbus->devices, bus_list) {
>>>> +		gptmp = pnv_pci_get_gpu_dev(npdev);
>>>> +
>>>> +		if (gptmp != gpdev)
>>>> +			continue;
>>>> +
>>>> +		/*
>>>> +		 * The iommu_add_device() picks an IOMMU group from
>>>> +		 * the first IOMMU group attached to the iommu_table
>>>> +		 * so we need to pretend that there is a table so
>>>> +		 * iommu_add_device() can complete the job.
>>>> +		 * We unlink the tempopary table from the group afterwards.
>>>> +		 */
>>>> +		pnv_pci_link_table_and_group(phb->hose->node, 0,
>>>> +				tbl, &npe->table_group);
>>>> +		set_iommu_table_base(&npdev->dev, tbl);
>>>> +		iommu_add_device(&npdev->dev);
>>>> +		set_iommu_table_base(&npdev->dev, NULL);
>>>> +		pnv_pci_unlink_table_and_group(tbl, &npe->table_group);
>>>> +	}
>>>> +
>>>> +	iommu_free_table(tbl, "");
>>>> +}
>>>> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
>>>> index 5a6cf2e..becd168 100644
>>>> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
>>>> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
>>>> @@ -2570,6 +2570,14 @@ static void pnv_ioda_setup_dma(struct pnv_phb *phb)
>>>>   		remaining -= segs;
>>>>   		base += segs;
>>>>   	}
>>>> +	/*
>>>> +	 * Create an IOMMU group and add devices to it.
>>>> +	 * DMA setup is to be done via GPU's dma_set_mask().
>>>> +	 */
>>>> +	if (phb->type == PNV_PHB_NPU) {
>>>> +		list_for_each_entry(pe, &phb->ioda.pe_dma_list, dma_link)
>>>> +			pnv_pci_npu_setup_iommu(pe);
>>>> +	}
>>>>   }
>>>>
>>>>   #ifdef CONFIG_PCI_MSI
>>>> diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
>>>> index 06405fd..0c0083a 100644
>>>> --- a/arch/powerpc/platforms/powernv/pci.h
>>>> +++ b/arch/powerpc/platforms/powernv/pci.h
>>>> @@ -235,5 +235,6 @@ extern void pnv_teardown_msi_irqs(struct pci_dev *pdev);
>>>>   /* Nvlink functions */
>>>>   extern void pnv_npu_try_dma_set_bypass(struct pci_dev *gpdev, bool bypass);
>>>>   extern void pnv_pci_ioda2_tce_invalidate_entire(struct pnv_phb *phb, bool rm);
>>>> +extern void pnv_pci_npu_setup_iommu(struct pnv_ioda_pe *npe);
>>>>
>>>>   #endif /* __POWERNV_PCI_H */
>>>
>>
>>
>


-- 
Alexey

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH kernel 08/10] powerpc/powernv/npu: Add NPU devices to IOMMU group
  2016-03-22  1:48         ` Alexey Kardashevskiy
@ 2016-03-22 12:41           ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 26+ messages in thread
From: Benjamin Herrenschmidt @ 2016-03-22 12:41 UTC (permalink / raw)
  To: Alexey Kardashevskiy, David Gibson
  Cc: linuxppc-dev, Alistair Popple, Daniel Axtens, Gavin Shan,
	Paul Mackerras, Russell Currey, Alex Williamson

On Tue, 2016-03-22 at 12:48 +1100, Alexey Kardashevskiy wrote:
> 
> I suppose GPU from guest1 could trigger DMA from NPU to guest2 memory. 
> Which puts a constrain to management tools not to pass NPU without their 
> GPU counterparts.

Management tools will not be taught such constraints. The plan always
was to make sure they are in the same group. So they should be.

> The host can be affected as bypass is not disabled on NPU when GPU is taken 
> by VFIO, I'll fix this.
>
> >> If I put them to the same group as GPUs, I would have to have
> >> IODA2-linked-to-NPU bridge type with different iommu_table_group_ops  or
> >> have multiple hacks everywhere in IODA2 to enable/disable bypass,
> >> etc.
> >
> > Well.. I suspect it would mean no longer having a 1:1 correspondance
> > between user-visible IOMMU groups and the internal iommu_table.
> 
> Right.

They can share the table too ...

> Right now each GPU is sitting on a separate PHB and has its own PE. And all 
> NPUs sit on a separate PHB and each couple of NPUs (2 links of the same 
> GPU) gets a PE.
> 
> So we have separate PEs (struct pnv_ioda_pe) already, each has its own 
> iommu_table_group_ops with all these VFIO IOMMU callbacks. So to make this 
> all appear as one IOMMU group in sysfs, I will need to stop embedding 
> iommu_table_group into pnv_ioda_pe but make it a pointer with reference 
> counting, etc. Quite a massive change...

Or you just put a quirk flag of some sort and a pointer to the "linked"
PE... sometimes that's a lot easier than lifting up the whole
infrastructure.
> 
> 
> 
> >>>> ---
> >>>>   arch/powerpc/platf

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2016-03-22 12:42 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-03-09  6:28 [PATCH kernel 00/10] powerpc/powernv/npu: Enable PCI pass through for NVLink Alexey Kardashevskiy
2016-03-09  6:28 ` [PATCH kernel 01/10] vfio/spapr: Relax the IOMMU compatibility check Alexey Kardashevskiy
2016-03-10  5:35   ` David Gibson
2016-03-09  6:28 ` [PATCH kernel 02/10] powerpc/powernv: Rename pnv_pci_ioda2_tce_invalidate_entire Alexey Kardashevskiy
2016-03-10  5:35   ` David Gibson
2016-03-09  6:28 ` [PATCH kernel 03/10] powerpc/powernv: Define TCE Kill flags Alexey Kardashevskiy
2016-03-10  5:36   ` David Gibson
2016-03-09  6:29 ` [PATCH kernel 04/10] powerpc/powernv/npu: TCE Kill helpers cleanup Alexey Kardashevskiy
2016-03-10  5:42   ` David Gibson
2016-03-21  2:51   ` Alistair Popple
2016-03-09  6:29 ` [PATCH kernel 05/10] powerpc/powernv/npu: Use the correct IOMMU page size Alexey Kardashevskiy
2016-03-10  5:43   ` David Gibson
2016-03-21  2:57   ` Alistair Popple
2016-03-09  6:29 ` [PATCH kernel 06/10] powerpc/powernv/npu: Simplify DMA setup Alexey Kardashevskiy
2016-03-16  5:55   ` David Gibson
2016-03-21  3:59     ` Alistair Popple
2016-03-09  6:29 ` [PATCH kernel 07/10] powerpc/powernv/npu: Rework TCE Kill handling Alexey Kardashevskiy
2016-03-21  6:50   ` Alistair Popple
2016-03-09  6:29 ` [PATCH kernel 08/10] powerpc/powernv/npu: Add NPU devices to IOMMU group Alexey Kardashevskiy
2016-03-21  4:48   ` David Gibson
2016-03-21  8:25     ` Alexey Kardashevskiy
2016-03-22  0:25       ` David Gibson
2016-03-22  1:48         ` Alexey Kardashevskiy
2016-03-22 12:41           ` Benjamin Herrenschmidt
2016-03-09  6:29 ` [PATCH kernel 09/10] powerpc/powernv/ioda2: Export some helpers Alexey Kardashevskiy
2016-03-09  6:29 ` [PATCH kernel 10/10] powerpc/powernv/npu: Enable passing through via VFIO Alexey Kardashevskiy

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.