All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/3] powerpc/powernv/npu: Improve ATSD invalidation overhead
@ 2018-10-03 18:51 Mark Hairgrove
  2018-10-03 18:51 ` [PATCH v2 1/3] powerpc/powernv/npu: Reduce eieio usage when issuing ATSD invalidates Mark Hairgrove
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Mark Hairgrove @ 2018-10-03 18:51 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Alistair Popple, Mark Hairgrove, Reza Arbab

When ATS is used in a process, all CPU TLB invalidates in that process
also trigger ATSD invalidates via mmu_notifiers. This additional overhead
is noticeable in applications which do heavy memory allocation or page
migration among nodes, particularly to and from GPUs.

This patch set reduces that overhead by rearranging how the ATSDs are
issued and by using size-based ATSD invalidates.

Mark Hairgrove (3):
  powerpc/powernv/npu: Reduce eieio usage when issuing ATSD invalidates
  powerpc/powernv/npu: Use size-based ATSD invalidates
  powerpc/powernv/npu: Remove atsd_threshold debugfs setting

 arch/powerpc/platforms/powernv/npu-dma.c |  198 ++++++++++++++----------------
 1 files changed, 94 insertions(+), 104 deletions(-)

-- 
1.7.2.5


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH v2 1/3] powerpc/powernv/npu: Reduce eieio usage when issuing ATSD invalidates
  2018-10-03 18:51 [PATCH v2 0/3] powerpc/powernv/npu: Improve ATSD invalidation overhead Mark Hairgrove
@ 2018-10-03 18:51 ` Mark Hairgrove
  2018-10-04  5:20   ` Alistair Popple
  2018-10-11  8:35   ` [v2, " Michael Ellerman
  2018-10-03 18:51 ` [PATCH v2 2/3] powerpc/powernv/npu: Use size-based " Mark Hairgrove
  2018-10-03 18:51 ` [PATCH v2 3/3] powerpc/powernv/npu: Remove atsd_threshold debugfs setting Mark Hairgrove
  2 siblings, 2 replies; 8+ messages in thread
From: Mark Hairgrove @ 2018-10-03 18:51 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Alistair Popple, Mark Hairgrove, Reza Arbab

There are two types of ATSDs issued to the NPU: invalidates targeting a
specific virtual address and invalidates targeting the whole address
space. In both cases prior to this change, the sequence was:

    for each NPU
        - Write the target address to the XTS_ATSD_AVA register
        - EIEIO
        - Write the launch value to issue the ATSD

First, a target address is not required when invalidating the whole
address space, so that write and the EIEIO have been removed. The AP
(size) field in the launch is not needed either.

Second, for per-address invalidates the above sequence is inefficient in
the common case of multiple NPUs because an EIEIO is issued per NPU. This
unnecessarily forces the launches of later ATSDs to be ordered with the
launches of earlier ones. The new sequence only issues a single EIEIO:

    for each NPU
        - Write the target address to the XTS_ATSD_AVA register
    EIEIO
    for each NPU
        - Write the launch value to issue the ATSD

Performance results were gathered using a microbenchmark which creates a
1G allocation then uses mprotect with PROT_NONE to trigger invalidates in
strides across the allocation.

With only a single NPU active (one GPU) the difference is in the noise for
both types of invalidates (+/-1%).

With two NPUs active (on a 6-GPU system) the effect is more noticeable:

         mprotect rate (GB/s)
Stride   Before      After      Speedup
64K         5.9        6.5          10%
1M         31.2       33.4           7%
2M         36.3       38.7           7%
4M        322.6      356.7          11%

Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
---
 arch/powerpc/platforms/powernv/npu-dma.c |   99 ++++++++++++++---------------
 1 files changed, 48 insertions(+), 51 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/npu-dma.c b/arch/powerpc/platforms/powernv/npu-dma.c
index 8006c54..c8f438a 100644
--- a/arch/powerpc/platforms/powernv/npu-dma.c
+++ b/arch/powerpc/platforms/powernv/npu-dma.c
@@ -454,79 +454,76 @@ static void put_mmio_atsd_reg(struct npu *npu, int reg)
 }
 
 /* MMIO ATSD register offsets */
-#define XTS_ATSD_AVA  1
-#define XTS_ATSD_STAT 2
+#define XTS_ATSD_LAUNCH 0
+#define XTS_ATSD_AVA    1
+#define XTS_ATSD_STAT   2
 
-static void mmio_launch_invalidate(struct mmio_atsd_reg *mmio_atsd_reg,
-				unsigned long launch, unsigned long va)
+static unsigned long get_atsd_launch_val(unsigned long pid, unsigned long psize,
+					bool flush)
 {
-	struct npu *npu = mmio_atsd_reg->npu;
-	int reg = mmio_atsd_reg->reg;
+	unsigned long launch = 0;
 
-	__raw_writeq_be(va, npu->mmio_atsd_regs[reg] + XTS_ATSD_AVA);
-	eieio();
-	__raw_writeq_be(launch, npu->mmio_atsd_regs[reg]);
+	if (psize == MMU_PAGE_COUNT) {
+		/* IS set to invalidate entire matching PID */
+		launch |= PPC_BIT(12);
+	} else {
+		/* AP set to invalidate region of psize */
+		launch |= (u64)mmu_get_ap(psize) << PPC_BITLSHIFT(17);
+	}
+
+	/* PRS set to process-scoped */
+	launch |= PPC_BIT(13);
+
+	/* PID */
+	launch |= pid << PPC_BITLSHIFT(38);
+
+	/* No flush */
+	launch |= !flush << PPC_BITLSHIFT(39);
+
+	return launch;
 }
 
-static void mmio_invalidate_pid(struct mmio_atsd_reg mmio_atsd_reg[NV_MAX_NPUS],
-				unsigned long pid, bool flush)
+static void mmio_atsd_regs_write(struct mmio_atsd_reg
+			mmio_atsd_reg[NV_MAX_NPUS], unsigned long offset,
+			unsigned long val)
 {
-	int i;
-	unsigned long launch;
+	struct npu *npu;
+	int i, reg;
 
 	for (i = 0; i <= max_npu2_index; i++) {
-		if (mmio_atsd_reg[i].reg < 0)
+		reg = mmio_atsd_reg[i].reg;
+		if (reg < 0)
 			continue;
 
-		/* IS set to invalidate matching PID */
-		launch = PPC_BIT(12);
-
-		/* PRS set to process-scoped */
-		launch |= PPC_BIT(13);
-
-		/* AP */
-		launch |= (u64)
-			mmu_get_ap(mmu_virtual_psize) << PPC_BITLSHIFT(17);
-
-		/* PID */
-		launch |= pid << PPC_BITLSHIFT(38);
+		npu = mmio_atsd_reg[i].npu;
+		__raw_writeq_be(val, npu->mmio_atsd_regs[reg] + offset);
+	}
+}
 
-		/* No flush */
-		launch |= !flush << PPC_BITLSHIFT(39);
+static void mmio_invalidate_pid(struct mmio_atsd_reg mmio_atsd_reg[NV_MAX_NPUS],
+				unsigned long pid, bool flush)
+{
+	unsigned long launch = get_atsd_launch_val(pid, MMU_PAGE_COUNT, flush);
 
-		/* Invalidating the entire process doesn't use a va */
-		mmio_launch_invalidate(&mmio_atsd_reg[i], launch, 0);
-	}
+	/* Invalidating the entire process doesn't use a va */
+	mmio_atsd_regs_write(mmio_atsd_reg, XTS_ATSD_LAUNCH, launch);
 }
 
 static void mmio_invalidate_va(struct mmio_atsd_reg mmio_atsd_reg[NV_MAX_NPUS],
 			unsigned long va, unsigned long pid, bool flush)
 {
-	int i;
 	unsigned long launch;
 
-	for (i = 0; i <= max_npu2_index; i++) {
-		if (mmio_atsd_reg[i].reg < 0)
-			continue;
-
-		/* IS set to invalidate target VA */
-		launch = 0;
+	launch = get_atsd_launch_val(pid, mmu_virtual_psize, flush);
 
-		/* PRS set to process scoped */
-		launch |= PPC_BIT(13);
+	/* Write all VAs first */
+	mmio_atsd_regs_write(mmio_atsd_reg, XTS_ATSD_AVA, va);
 
-		/* AP */
-		launch |= (u64)
-			mmu_get_ap(mmu_virtual_psize) << PPC_BITLSHIFT(17);
-
-		/* PID */
-		launch |= pid << PPC_BITLSHIFT(38);
-
-		/* No flush */
-		launch |= !flush << PPC_BITLSHIFT(39);
+	/* Issue one barrier for all address writes */
+	eieio();
 
-		mmio_launch_invalidate(&mmio_atsd_reg[i], launch, va);
-	}
+	/* Launch */
+	mmio_atsd_regs_write(mmio_atsd_reg, XTS_ATSD_LAUNCH, launch);
 }
 
 #define mn_to_npu_context(x) container_of(x, struct npu_context, mn)
-- 
1.7.2.5


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v2 2/3] powerpc/powernv/npu: Use size-based ATSD invalidates
  2018-10-03 18:51 [PATCH v2 0/3] powerpc/powernv/npu: Improve ATSD invalidation overhead Mark Hairgrove
  2018-10-03 18:51 ` [PATCH v2 1/3] powerpc/powernv/npu: Reduce eieio usage when issuing ATSD invalidates Mark Hairgrove
@ 2018-10-03 18:51 ` Mark Hairgrove
  2018-10-04  5:20   ` Alistair Popple
  2018-10-03 18:51 ` [PATCH v2 3/3] powerpc/powernv/npu: Remove atsd_threshold debugfs setting Mark Hairgrove
  2 siblings, 1 reply; 8+ messages in thread
From: Mark Hairgrove @ 2018-10-03 18:51 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Alistair Popple, Mark Hairgrove, Reza Arbab

Prior to this change only two types of ATSDs were issued to the NPU:
invalidates targeting a single page and invalidates targeting the whole
address space. The crossover point happened at the configurable
atsd_threshold which defaulted to 2M. Invalidates that size or smaller
would issue per-page invalidates for the whole range.

The NPU supports more invalidation sizes however: 64K, 2M, 1G, and all.
These invalidates target addresses aligned to their size. 2M is a common
invalidation size for GPU-enabled applications because that is a GPU
page size, so reducing the number of invalidates by 32x in that case is a
clear improvement.

ATSD latency is high in general so now we always issue a single invalidate
rather than multiple. This will over-invalidate in some cases, but for any
invalidation size over 2M it matches or improves the prior behavior.
There's also an improvement for single-page invalidates since the prior
version issued two invalidates for that case instead of one.

With this change all issued ATSDs now perform a flush, so the flush
parameter has been removed from all the helpers.

To show the benefit here are some performance numbers from a
microbenchmark which creates a 1G allocation then uses mprotect with
PROT_NONE to trigger invalidates in strides across the allocation.

One NPU (1 GPU):

         mprotect rate (GB/s)
Stride   Before      After      Speedup
64K         5.3        5.6           5%
1M         39.3       57.4          46%
2M         49.7       82.6          66%
4M        286.6      285.7           0%

Two NPUs (6 GPUs):

         mprotect rate (GB/s)
Stride   Before      After      Speedup
64K         6.5        7.4          13%
1M         33.4       67.9         103%
2M         38.7       93.1         141%
4M        356.7      354.6          -1%

Anything over 2M is roughly the same as before since both cases issue a
single ATSD.

Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
---
 arch/powerpc/platforms/powernv/npu-dma.c |  103 ++++++++++++++++--------------
 1 files changed, 55 insertions(+), 48 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/npu-dma.c b/arch/powerpc/platforms/powernv/npu-dma.c
index c8f438a..e4c0fab 100644
--- a/arch/powerpc/platforms/powernv/npu-dma.c
+++ b/arch/powerpc/platforms/powernv/npu-dma.c
@@ -18,6 +18,7 @@
 #include <linux/memblock.h>
 #include <linux/iommu.h>
 #include <linux/debugfs.h>
+#include <linux/sizes.h>
 
 #include <asm/debugfs.h>
 #include <asm/tlb.h>
@@ -458,8 +459,7 @@ static void put_mmio_atsd_reg(struct npu *npu, int reg)
 #define XTS_ATSD_AVA    1
 #define XTS_ATSD_STAT   2
 
-static unsigned long get_atsd_launch_val(unsigned long pid, unsigned long psize,
-					bool flush)
+static unsigned long get_atsd_launch_val(unsigned long pid, unsigned long psize)
 {
 	unsigned long launch = 0;
 
@@ -477,8 +477,7 @@ static unsigned long get_atsd_launch_val(unsigned long pid, unsigned long psize,
 	/* PID */
 	launch |= pid << PPC_BITLSHIFT(38);
 
-	/* No flush */
-	launch |= !flush << PPC_BITLSHIFT(39);
+	/* Leave "No flush" (bit 39) 0 so every ATSD performs a flush */
 
 	return launch;
 }
@@ -501,23 +500,22 @@ static void mmio_atsd_regs_write(struct mmio_atsd_reg
 }
 
 static void mmio_invalidate_pid(struct mmio_atsd_reg mmio_atsd_reg[NV_MAX_NPUS],
-				unsigned long pid, bool flush)
+				unsigned long pid)
 {
-	unsigned long launch = get_atsd_launch_val(pid, MMU_PAGE_COUNT, flush);
+	unsigned long launch = get_atsd_launch_val(pid, MMU_PAGE_COUNT);
 
 	/* Invalidating the entire process doesn't use a va */
 	mmio_atsd_regs_write(mmio_atsd_reg, XTS_ATSD_LAUNCH, launch);
 }
 
-static void mmio_invalidate_va(struct mmio_atsd_reg mmio_atsd_reg[NV_MAX_NPUS],
-			unsigned long va, unsigned long pid, bool flush)
+static void mmio_invalidate_range(struct mmio_atsd_reg
+			mmio_atsd_reg[NV_MAX_NPUS], unsigned long pid,
+			unsigned long start, unsigned long psize)
 {
-	unsigned long launch;
-
-	launch = get_atsd_launch_val(pid, mmu_virtual_psize, flush);
+	unsigned long launch = get_atsd_launch_val(pid, psize);
 
 	/* Write all VAs first */
-	mmio_atsd_regs_write(mmio_atsd_reg, XTS_ATSD_AVA, va);
+	mmio_atsd_regs_write(mmio_atsd_reg, XTS_ATSD_AVA, start);
 
 	/* Issue one barrier for all address writes */
 	eieio();
@@ -609,14 +607,36 @@ static void release_atsd_reg(struct mmio_atsd_reg mmio_atsd_reg[NV_MAX_NPUS])
 }
 
 /*
- * Invalidate either a single address or an entire PID depending on
- * the value of va.
+ * Invalidate a virtual address range
  */
-static void mmio_invalidate(struct npu_context *npu_context, int va,
-			unsigned long address, bool flush)
+static void mmio_invalidate(struct npu_context *npu_context,
+			unsigned long start, unsigned long size)
 {
 	struct mmio_atsd_reg mmio_atsd_reg[NV_MAX_NPUS];
 	unsigned long pid = npu_context->mm->context.id;
+	unsigned long atsd_start = 0;
+	unsigned long end = start + size - 1;
+	int atsd_psize = MMU_PAGE_COUNT;
+
+	/*
+	 * Convert the input range into one of the supported sizes. If the range
+	 * doesn't fit, use the next larger supported size. Invalidation latency
+	 * is high, so over-invalidation is preferred to issuing multiple
+	 * invalidates.
+	 *
+	 * A 4K page size isn't supported by NPU/GPU ATS, so that case is
+	 * ignored.
+	 */
+	if (size == SZ_64K) {
+		atsd_start = start;
+		atsd_psize = MMU_PAGE_64K;
+	} else if (ALIGN_DOWN(start, SZ_2M) == ALIGN_DOWN(end, SZ_2M)) {
+		atsd_start = ALIGN_DOWN(start, SZ_2M);
+		atsd_psize = MMU_PAGE_2M;
+	} else if (ALIGN_DOWN(start, SZ_1G) == ALIGN_DOWN(end, SZ_1G)) {
+		atsd_start = ALIGN_DOWN(start, SZ_1G);
+		atsd_psize = MMU_PAGE_1G;
+	}
 
 	if (npu_context->nmmu_flush)
 		/*
@@ -631,23 +651,25 @@ static void mmio_invalidate(struct npu_context *npu_context, int va,
 	 * an invalidate.
 	 */
 	acquire_atsd_reg(npu_context, mmio_atsd_reg);
-	if (va)
-		mmio_invalidate_va(mmio_atsd_reg, address, pid, flush);
+
+	if (atsd_psize == MMU_PAGE_COUNT)
+		mmio_invalidate_pid(mmio_atsd_reg, pid);
 	else
-		mmio_invalidate_pid(mmio_atsd_reg, pid, flush);
+		mmio_invalidate_range(mmio_atsd_reg, pid, atsd_start,
+					atsd_psize);
 
 	mmio_invalidate_wait(mmio_atsd_reg);
-	if (flush) {
-		/*
-		 * The GPU requires two flush ATSDs to ensure all entries have
-		 * been flushed. We use PID 0 as it will never be used for a
-		 * process on the GPU.
-		 */
-		mmio_invalidate_pid(mmio_atsd_reg, 0, true);
-		mmio_invalidate_wait(mmio_atsd_reg);
-		mmio_invalidate_pid(mmio_atsd_reg, 0, true);
-		mmio_invalidate_wait(mmio_atsd_reg);
-	}
+
+	/*
+	 * The GPU requires two flush ATSDs to ensure all entries have been
+	 * flushed. We use PID 0 as it will never be used for a process on the
+	 * GPU.
+	 */
+	mmio_invalidate_pid(mmio_atsd_reg, 0);
+	mmio_invalidate_wait(mmio_atsd_reg);
+	mmio_invalidate_pid(mmio_atsd_reg, 0);
+	mmio_invalidate_wait(mmio_atsd_reg);
+
 	release_atsd_reg(mmio_atsd_reg);
 }
 
@@ -664,7 +686,7 @@ static void pnv_npu2_mn_release(struct mmu_notifier *mn,
 	 * There should be no more translation requests for this PID, but we
 	 * need to ensure any entries for it are removed from the TLB.
 	 */
-	mmio_invalidate(npu_context, 0, 0, true);
+	mmio_invalidate(npu_context, 0, ~0UL);
 }
 
 static void pnv_npu2_mn_change_pte(struct mmu_notifier *mn,
@@ -673,8 +695,7 @@ static void pnv_npu2_mn_change_pte(struct mmu_notifier *mn,
 				pte_t pte)
 {
 	struct npu_context *npu_context = mn_to_npu_context(mn);
-
-	mmio_invalidate(npu_context, 1, address, true);
+	mmio_invalidate(npu_context, address, PAGE_SIZE);
 }
 
 static void pnv_npu2_mn_invalidate_range(struct mmu_notifier *mn,
@@ -682,21 +703,7 @@ static void pnv_npu2_mn_invalidate_range(struct mmu_notifier *mn,
 					unsigned long start, unsigned long end)
 {
 	struct npu_context *npu_context = mn_to_npu_context(mn);
-	unsigned long address;
-
-	if (end - start > atsd_threshold) {
-		/*
-		 * Just invalidate the entire PID if the address range is too
-		 * large.
-		 */
-		mmio_invalidate(npu_context, 0, 0, true);
-	} else {
-		for (address = start; address < end; address += PAGE_SIZE)
-			mmio_invalidate(npu_context, 1, address, false);
-
-		/* Do the flush only on the final addess == end */
-		mmio_invalidate(npu_context, 1, address, true);
-	}
+	mmio_invalidate(npu_context, start, end - start);
 }
 
 static const struct mmu_notifier_ops nv_nmmu_notifier_ops = {
-- 
1.7.2.5


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v2 3/3] powerpc/powernv/npu: Remove atsd_threshold debugfs setting
  2018-10-03 18:51 [PATCH v2 0/3] powerpc/powernv/npu: Improve ATSD invalidation overhead Mark Hairgrove
  2018-10-03 18:51 ` [PATCH v2 1/3] powerpc/powernv/npu: Reduce eieio usage when issuing ATSD invalidates Mark Hairgrove
  2018-10-03 18:51 ` [PATCH v2 2/3] powerpc/powernv/npu: Use size-based " Mark Hairgrove
@ 2018-10-03 18:51 ` Mark Hairgrove
  2018-10-04  5:21   ` Alistair Popple
  2 siblings, 1 reply; 8+ messages in thread
From: Mark Hairgrove @ 2018-10-03 18:51 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Alistair Popple, Mark Hairgrove, Reza Arbab

This threshold is no longer used now that all invalidates issue a single
ATSD to each active NPU.

Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
---
 arch/powerpc/platforms/powernv/npu-dma.c |   14 --------------
 1 files changed, 0 insertions(+), 14 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/npu-dma.c b/arch/powerpc/platforms/powernv/npu-dma.c
index e4c0fab..6f60e09 100644
--- a/arch/powerpc/platforms/powernv/npu-dma.c
+++ b/arch/powerpc/platforms/powernv/npu-dma.c
@@ -17,7 +17,6 @@
 #include <linux/pci.h>
 #include <linux/memblock.h>
 #include <linux/iommu.h>
-#include <linux/debugfs.h>
 #include <linux/sizes.h>
 
 #include <asm/debugfs.h>
@@ -43,14 +42,6 @@
 static DEFINE_SPINLOCK(npu_context_lock);
 
 /*
- * When an address shootdown range exceeds this threshold we invalidate the
- * entire TLB on the GPU for the given PID rather than each specific address in
- * the range.
- */
-static uint64_t atsd_threshold = 2 * 1024 * 1024;
-static struct dentry *atsd_threshold_dentry;
-
-/*
  * Other types of TCE cache invalidation are not functional in the
  * hardware.
  */
@@ -966,11 +957,6 @@ int pnv_npu2_init(struct pnv_phb *phb)
 	static int npu_index;
 	uint64_t rc = 0;
 
-	if (!atsd_threshold_dentry) {
-		atsd_threshold_dentry = debugfs_create_x64("atsd_threshold",
-				   0600, powerpc_debugfs_root, &atsd_threshold);
-	}
-
 	phb->npu.nmmu_flush =
 		of_property_read_bool(phb->hose->dn, "ibm,nmmu-flush");
 	for_each_child_of_node(phb->hose->dn, dn) {
-- 
1.7.2.5


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 1/3] powerpc/powernv/npu: Reduce eieio usage when issuing ATSD invalidates
  2018-10-03 18:51 ` [PATCH v2 1/3] powerpc/powernv/npu: Reduce eieio usage when issuing ATSD invalidates Mark Hairgrove
@ 2018-10-04  5:20   ` Alistair Popple
  2018-10-11  8:35   ` [v2, " Michael Ellerman
  1 sibling, 0 replies; 8+ messages in thread
From: Alistair Popple @ 2018-10-04  5:20 UTC (permalink / raw)
  To: Mark Hairgrove; +Cc: linuxppc-dev, Reza Arbab

Reviewed-by: Alistair Popple <alistair@popple.id.au>

On Wednesday, 3 October 2018 11:51:32 AM AEST Mark Hairgrove wrote:
> There are two types of ATSDs issued to the NPU: invalidates targeting a
> specific virtual address and invalidates targeting the whole address
> space. In both cases prior to this change, the sequence was:
> 
>     for each NPU
>         - Write the target address to the XTS_ATSD_AVA register
>         - EIEIO
>         - Write the launch value to issue the ATSD
> 
> First, a target address is not required when invalidating the whole
> address space, so that write and the EIEIO have been removed. The AP
> (size) field in the launch is not needed either.
> 
> Second, for per-address invalidates the above sequence is inefficient in
> the common case of multiple NPUs because an EIEIO is issued per NPU. This
> unnecessarily forces the launches of later ATSDs to be ordered with the
> launches of earlier ones. The new sequence only issues a single EIEIO:
> 
>     for each NPU
>         - Write the target address to the XTS_ATSD_AVA register
>     EIEIO
>     for each NPU
>         - Write the launch value to issue the ATSD
> 
> Performance results were gathered using a microbenchmark which creates a
> 1G allocation then uses mprotect with PROT_NONE to trigger invalidates in
> strides across the allocation.
> 
> With only a single NPU active (one GPU) the difference is in the noise for
> both types of invalidates (+/-1%).
> 
> With two NPUs active (on a 6-GPU system) the effect is more noticeable:
> 
>          mprotect rate (GB/s)
> Stride   Before      After      Speedup
> 64K         5.9        6.5          10%
> 1M         31.2       33.4           7%
> 2M         36.3       38.7           7%
> 4M        322.6      356.7          11%
> 
> Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
> ---
>  arch/powerpc/platforms/powernv/npu-dma.c |   99 ++++++++++++++---------------
>  1 files changed, 48 insertions(+), 51 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/powernv/npu-dma.c b/arch/powerpc/platforms/powernv/npu-dma.c
> index 8006c54..c8f438a 100644
> --- a/arch/powerpc/platforms/powernv/npu-dma.c
> +++ b/arch/powerpc/platforms/powernv/npu-dma.c
> @@ -454,79 +454,76 @@ static void put_mmio_atsd_reg(struct npu *npu, int reg)
>  }
>  
>  /* MMIO ATSD register offsets */
> -#define XTS_ATSD_AVA  1
> -#define XTS_ATSD_STAT 2
> +#define XTS_ATSD_LAUNCH 0
> +#define XTS_ATSD_AVA    1
> +#define XTS_ATSD_STAT   2
>  
> -static void mmio_launch_invalidate(struct mmio_atsd_reg *mmio_atsd_reg,
> -				unsigned long launch, unsigned long va)
> +static unsigned long get_atsd_launch_val(unsigned long pid, unsigned long psize,
> +					bool flush)
>  {
> -	struct npu *npu = mmio_atsd_reg->npu;
> -	int reg = mmio_atsd_reg->reg;
> +	unsigned long launch = 0;
>  
> -	__raw_writeq_be(va, npu->mmio_atsd_regs[reg] + XTS_ATSD_AVA);
> -	eieio();
> -	__raw_writeq_be(launch, npu->mmio_atsd_regs[reg]);
> +	if (psize == MMU_PAGE_COUNT) {
> +		/* IS set to invalidate entire matching PID */
> +		launch |= PPC_BIT(12);
> +	} else {
> +		/* AP set to invalidate region of psize */
> +		launch |= (u64)mmu_get_ap(psize) << PPC_BITLSHIFT(17);
> +	}
> +
> +	/* PRS set to process-scoped */
> +	launch |= PPC_BIT(13);
> +
> +	/* PID */
> +	launch |= pid << PPC_BITLSHIFT(38);
> +
> +	/* No flush */
> +	launch |= !flush << PPC_BITLSHIFT(39);
> +
> +	return launch;
>  }
>  
> -static void mmio_invalidate_pid(struct mmio_atsd_reg mmio_atsd_reg[NV_MAX_NPUS],
> -				unsigned long pid, bool flush)
> +static void mmio_atsd_regs_write(struct mmio_atsd_reg
> +			mmio_atsd_reg[NV_MAX_NPUS], unsigned long offset,
> +			unsigned long val)
>  {
> -	int i;
> -	unsigned long launch;
> +	struct npu *npu;
> +	int i, reg;
>  
>  	for (i = 0; i <= max_npu2_index; i++) {
> -		if (mmio_atsd_reg[i].reg < 0)
> +		reg = mmio_atsd_reg[i].reg;
> +		if (reg < 0)
>  			continue;
>  
> -		/* IS set to invalidate matching PID */
> -		launch = PPC_BIT(12);
> -
> -		/* PRS set to process-scoped */
> -		launch |= PPC_BIT(13);
> -
> -		/* AP */
> -		launch |= (u64)
> -			mmu_get_ap(mmu_virtual_psize) << PPC_BITLSHIFT(17);
> -
> -		/* PID */
> -		launch |= pid << PPC_BITLSHIFT(38);
> +		npu = mmio_atsd_reg[i].npu;
> +		__raw_writeq_be(val, npu->mmio_atsd_regs[reg] + offset);
> +	}
> +}
>  
> -		/* No flush */
> -		launch |= !flush << PPC_BITLSHIFT(39);
> +static void mmio_invalidate_pid(struct mmio_atsd_reg mmio_atsd_reg[NV_MAX_NPUS],
> +				unsigned long pid, bool flush)
> +{
> +	unsigned long launch = get_atsd_launch_val(pid, MMU_PAGE_COUNT, flush);
>  
> -		/* Invalidating the entire process doesn't use a va */
> -		mmio_launch_invalidate(&mmio_atsd_reg[i], launch, 0);
> -	}
> +	/* Invalidating the entire process doesn't use a va */
> +	mmio_atsd_regs_write(mmio_atsd_reg, XTS_ATSD_LAUNCH, launch);
>  }
>  
>  static void mmio_invalidate_va(struct mmio_atsd_reg mmio_atsd_reg[NV_MAX_NPUS],
>  			unsigned long va, unsigned long pid, bool flush)
>  {
> -	int i;
>  	unsigned long launch;
>  
> -	for (i = 0; i <= max_npu2_index; i++) {
> -		if (mmio_atsd_reg[i].reg < 0)
> -			continue;
> -
> -		/* IS set to invalidate target VA */
> -		launch = 0;
> +	launch = get_atsd_launch_val(pid, mmu_virtual_psize, flush);
>  
> -		/* PRS set to process scoped */
> -		launch |= PPC_BIT(13);
> +	/* Write all VAs first */
> +	mmio_atsd_regs_write(mmio_atsd_reg, XTS_ATSD_AVA, va);
>  
> -		/* AP */
> -		launch |= (u64)
> -			mmu_get_ap(mmu_virtual_psize) << PPC_BITLSHIFT(17);
> -
> -		/* PID */
> -		launch |= pid << PPC_BITLSHIFT(38);
> -
> -		/* No flush */
> -		launch |= !flush << PPC_BITLSHIFT(39);
> +	/* Issue one barrier for all address writes */
> +	eieio();
>  
> -		mmio_launch_invalidate(&mmio_atsd_reg[i], launch, va);
> -	}
> +	/* Launch */
> +	mmio_atsd_regs_write(mmio_atsd_reg, XTS_ATSD_LAUNCH, launch);
>  }
>  
>  #define mn_to_npu_context(x) container_of(x, struct npu_context, mn)
> 



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 2/3] powerpc/powernv/npu: Use size-based ATSD invalidates
  2018-10-03 18:51 ` [PATCH v2 2/3] powerpc/powernv/npu: Use size-based " Mark Hairgrove
@ 2018-10-04  5:20   ` Alistair Popple
  0 siblings, 0 replies; 8+ messages in thread
From: Alistair Popple @ 2018-10-04  5:20 UTC (permalink / raw)
  To: Mark Hairgrove; +Cc: linuxppc-dev, Reza Arbab

Reviewed-By: Alistair Popple <alistair@popple.id.au>

On Wednesday, 3 October 2018 11:51:33 AM AEST Mark Hairgrove wrote:
> Prior to this change only two types of ATSDs were issued to the NPU:
> invalidates targeting a single page and invalidates targeting the whole
> address space. The crossover point happened at the configurable
> atsd_threshold which defaulted to 2M. Invalidates that size or smaller
> would issue per-page invalidates for the whole range.
> 
> The NPU supports more invalidation sizes however: 64K, 2M, 1G, and all.
> These invalidates target addresses aligned to their size. 2M is a common
> invalidation size for GPU-enabled applications because that is a GPU
> page size, so reducing the number of invalidates by 32x in that case is a
> clear improvement.
> 
> ATSD latency is high in general so now we always issue a single invalidate
> rather than multiple. This will over-invalidate in some cases, but for any
> invalidation size over 2M it matches or improves the prior behavior.
> There's also an improvement for single-page invalidates since the prior
> version issued two invalidates for that case instead of one.
> 
> With this change all issued ATSDs now perform a flush, so the flush
> parameter has been removed from all the helpers.
> 
> To show the benefit here are some performance numbers from a
> microbenchmark which creates a 1G allocation then uses mprotect with
> PROT_NONE to trigger invalidates in strides across the allocation.
> 
> One NPU (1 GPU):
> 
>          mprotect rate (GB/s)
> Stride   Before      After      Speedup
> 64K         5.3        5.6           5%
> 1M         39.3       57.4          46%
> 2M         49.7       82.6          66%
> 4M        286.6      285.7           0%
> 
> Two NPUs (6 GPUs):
> 
>          mprotect rate (GB/s)
> Stride   Before      After      Speedup
> 64K         6.5        7.4          13%
> 1M         33.4       67.9         103%
> 2M         38.7       93.1         141%
> 4M        356.7      354.6          -1%
> 
> Anything over 2M is roughly the same as before since both cases issue a
> single ATSD.
> 
> Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
> ---
>  arch/powerpc/platforms/powernv/npu-dma.c |  103 ++++++++++++++++--------------
>  1 files changed, 55 insertions(+), 48 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/powernv/npu-dma.c b/arch/powerpc/platforms/powernv/npu-dma.c
> index c8f438a..e4c0fab 100644
> --- a/arch/powerpc/platforms/powernv/npu-dma.c
> +++ b/arch/powerpc/platforms/powernv/npu-dma.c
> @@ -18,6 +18,7 @@
>  #include <linux/memblock.h>
>  #include <linux/iommu.h>
>  #include <linux/debugfs.h>
> +#include <linux/sizes.h>
>  
>  #include <asm/debugfs.h>
>  #include <asm/tlb.h>
> @@ -458,8 +459,7 @@ static void put_mmio_atsd_reg(struct npu *npu, int reg)
>  #define XTS_ATSD_AVA    1
>  #define XTS_ATSD_STAT   2
>  
> -static unsigned long get_atsd_launch_val(unsigned long pid, unsigned long psize,
> -					bool flush)
> +static unsigned long get_atsd_launch_val(unsigned long pid, unsigned long psize)
>  {
>  	unsigned long launch = 0;
>  
> @@ -477,8 +477,7 @@ static unsigned long get_atsd_launch_val(unsigned long pid, unsigned long psize,
>  	/* PID */
>  	launch |= pid << PPC_BITLSHIFT(38);
>  
> -	/* No flush */
> -	launch |= !flush << PPC_BITLSHIFT(39);
> +	/* Leave "No flush" (bit 39) 0 so every ATSD performs a flush */
>  
>  	return launch;
>  }
> @@ -501,23 +500,22 @@ static void mmio_atsd_regs_write(struct mmio_atsd_reg
>  }
>  
>  static void mmio_invalidate_pid(struct mmio_atsd_reg mmio_atsd_reg[NV_MAX_NPUS],
> -				unsigned long pid, bool flush)
> +				unsigned long pid)
>  {
> -	unsigned long launch = get_atsd_launch_val(pid, MMU_PAGE_COUNT, flush);
> +	unsigned long launch = get_atsd_launch_val(pid, MMU_PAGE_COUNT);
>  
>  	/* Invalidating the entire process doesn't use a va */
>  	mmio_atsd_regs_write(mmio_atsd_reg, XTS_ATSD_LAUNCH, launch);
>  }
>  
> -static void mmio_invalidate_va(struct mmio_atsd_reg mmio_atsd_reg[NV_MAX_NPUS],
> -			unsigned long va, unsigned long pid, bool flush)
> +static void mmio_invalidate_range(struct mmio_atsd_reg
> +			mmio_atsd_reg[NV_MAX_NPUS], unsigned long pid,
> +			unsigned long start, unsigned long psize)
>  {
> -	unsigned long launch;
> -
> -	launch = get_atsd_launch_val(pid, mmu_virtual_psize, flush);
> +	unsigned long launch = get_atsd_launch_val(pid, psize);
>  
>  	/* Write all VAs first */
> -	mmio_atsd_regs_write(mmio_atsd_reg, XTS_ATSD_AVA, va);
> +	mmio_atsd_regs_write(mmio_atsd_reg, XTS_ATSD_AVA, start);
>  
>  	/* Issue one barrier for all address writes */
>  	eieio();
> @@ -609,14 +607,36 @@ static void release_atsd_reg(struct mmio_atsd_reg mmio_atsd_reg[NV_MAX_NPUS])
>  }
>  
>  /*
> - * Invalidate either a single address or an entire PID depending on
> - * the value of va.
> + * Invalidate a virtual address range
>   */
> -static void mmio_invalidate(struct npu_context *npu_context, int va,
> -			unsigned long address, bool flush)
> +static void mmio_invalidate(struct npu_context *npu_context,
> +			unsigned long start, unsigned long size)
>  {
>  	struct mmio_atsd_reg mmio_atsd_reg[NV_MAX_NPUS];
>  	unsigned long pid = npu_context->mm->context.id;
> +	unsigned long atsd_start = 0;
> +	unsigned long end = start + size - 1;
> +	int atsd_psize = MMU_PAGE_COUNT;
> +
> +	/*
> +	 * Convert the input range into one of the supported sizes. If the range
> +	 * doesn't fit, use the next larger supported size. Invalidation latency
> +	 * is high, so over-invalidation is preferred to issuing multiple
> +	 * invalidates.
> +	 *
> +	 * A 4K page size isn't supported by NPU/GPU ATS, so that case is
> +	 * ignored.
> +	 */
> +	if (size == SZ_64K) {
> +		atsd_start = start;
> +		atsd_psize = MMU_PAGE_64K;
> +	} else if (ALIGN_DOWN(start, SZ_2M) == ALIGN_DOWN(end, SZ_2M)) {
> +		atsd_start = ALIGN_DOWN(start, SZ_2M);
> +		atsd_psize = MMU_PAGE_2M;
> +	} else if (ALIGN_DOWN(start, SZ_1G) == ALIGN_DOWN(end, SZ_1G)) {
> +		atsd_start = ALIGN_DOWN(start, SZ_1G);
> +		atsd_psize = MMU_PAGE_1G;
> +	}
>  
>  	if (npu_context->nmmu_flush)
>  		/*
> @@ -631,23 +651,25 @@ static void mmio_invalidate(struct npu_context *npu_context, int va,
>  	 * an invalidate.
>  	 */
>  	acquire_atsd_reg(npu_context, mmio_atsd_reg);
> -	if (va)
> -		mmio_invalidate_va(mmio_atsd_reg, address, pid, flush);
> +
> +	if (atsd_psize == MMU_PAGE_COUNT)
> +		mmio_invalidate_pid(mmio_atsd_reg, pid);
>  	else
> -		mmio_invalidate_pid(mmio_atsd_reg, pid, flush);
> +		mmio_invalidate_range(mmio_atsd_reg, pid, atsd_start,
> +					atsd_psize);
>  
>  	mmio_invalidate_wait(mmio_atsd_reg);
> -	if (flush) {
> -		/*
> -		 * The GPU requires two flush ATSDs to ensure all entries have
> -		 * been flushed. We use PID 0 as it will never be used for a
> -		 * process on the GPU.
> -		 */
> -		mmio_invalidate_pid(mmio_atsd_reg, 0, true);
> -		mmio_invalidate_wait(mmio_atsd_reg);
> -		mmio_invalidate_pid(mmio_atsd_reg, 0, true);
> -		mmio_invalidate_wait(mmio_atsd_reg);
> -	}
> +
> +	/*
> +	 * The GPU requires two flush ATSDs to ensure all entries have been
> +	 * flushed. We use PID 0 as it will never be used for a process on the
> +	 * GPU.
> +	 */
> +	mmio_invalidate_pid(mmio_atsd_reg, 0);
> +	mmio_invalidate_wait(mmio_atsd_reg);
> +	mmio_invalidate_pid(mmio_atsd_reg, 0);
> +	mmio_invalidate_wait(mmio_atsd_reg);
> +
>  	release_atsd_reg(mmio_atsd_reg);
>  }
>  
> @@ -664,7 +686,7 @@ static void pnv_npu2_mn_release(struct mmu_notifier *mn,
>  	 * There should be no more translation requests for this PID, but we
>  	 * need to ensure any entries for it are removed from the TLB.
>  	 */
> -	mmio_invalidate(npu_context, 0, 0, true);
> +	mmio_invalidate(npu_context, 0, ~0UL);
>  }
>  
>  static void pnv_npu2_mn_change_pte(struct mmu_notifier *mn,
> @@ -673,8 +695,7 @@ static void pnv_npu2_mn_change_pte(struct mmu_notifier *mn,
>  				pte_t pte)
>  {
>  	struct npu_context *npu_context = mn_to_npu_context(mn);
> -
> -	mmio_invalidate(npu_context, 1, address, true);
> +	mmio_invalidate(npu_context, address, PAGE_SIZE);
>  }
>  
>  static void pnv_npu2_mn_invalidate_range(struct mmu_notifier *mn,
> @@ -682,21 +703,7 @@ static void pnv_npu2_mn_invalidate_range(struct mmu_notifier *mn,
>  					unsigned long start, unsigned long end)
>  {
>  	struct npu_context *npu_context = mn_to_npu_context(mn);
> -	unsigned long address;
> -
> -	if (end - start > atsd_threshold) {
> -		/*
> -		 * Just invalidate the entire PID if the address range is too
> -		 * large.
> -		 */
> -		mmio_invalidate(npu_context, 0, 0, true);
> -	} else {
> -		for (address = start; address < end; address += PAGE_SIZE)
> -			mmio_invalidate(npu_context, 1, address, false);
> -
> -		/* Do the flush only on the final addess == end */
> -		mmio_invalidate(npu_context, 1, address, true);
> -	}
> +	mmio_invalidate(npu_context, start, end - start);
>  }
>  
>  static const struct mmu_notifier_ops nv_nmmu_notifier_ops = {
> 



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 3/3] powerpc/powernv/npu: Remove atsd_threshold debugfs setting
  2018-10-03 18:51 ` [PATCH v2 3/3] powerpc/powernv/npu: Remove atsd_threshold debugfs setting Mark Hairgrove
@ 2018-10-04  5:21   ` Alistair Popple
  0 siblings, 0 replies; 8+ messages in thread
From: Alistair Popple @ 2018-10-04  5:21 UTC (permalink / raw)
  To: Mark Hairgrove; +Cc: linuxppc-dev, Reza Arbab

Reviewed-by: Alistair Popple <alistair@popple.id.au>

On Wednesday, 3 October 2018 11:51:34 AM AEST Mark Hairgrove wrote:
> This threshold is no longer used now that all invalidates issue a single
> ATSD to each active NPU.
> 
> Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
> ---
>  arch/powerpc/platforms/powernv/npu-dma.c |   14 --------------
>  1 files changed, 0 insertions(+), 14 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/powernv/npu-dma.c b/arch/powerpc/platforms/powernv/npu-dma.c
> index e4c0fab..6f60e09 100644
> --- a/arch/powerpc/platforms/powernv/npu-dma.c
> +++ b/arch/powerpc/platforms/powernv/npu-dma.c
> @@ -17,7 +17,6 @@
>  #include <linux/pci.h>
>  #include <linux/memblock.h>
>  #include <linux/iommu.h>
> -#include <linux/debugfs.h>
>  #include <linux/sizes.h>
>  
>  #include <asm/debugfs.h>
> @@ -43,14 +42,6 @@
>  static DEFINE_SPINLOCK(npu_context_lock);
>  
>  /*
> - * When an address shootdown range exceeds this threshold we invalidate the
> - * entire TLB on the GPU for the given PID rather than each specific address in
> - * the range.
> - */
> -static uint64_t atsd_threshold = 2 * 1024 * 1024;
> -static struct dentry *atsd_threshold_dentry;
> -
> -/*
>   * Other types of TCE cache invalidation are not functional in the
>   * hardware.
>   */
> @@ -966,11 +957,6 @@ int pnv_npu2_init(struct pnv_phb *phb)
>  	static int npu_index;
>  	uint64_t rc = 0;
>  
> -	if (!atsd_threshold_dentry) {
> -		atsd_threshold_dentry = debugfs_create_x64("atsd_threshold",
> -				   0600, powerpc_debugfs_root, &atsd_threshold);
> -	}
> -
>  	phb->npu.nmmu_flush =
>  		of_property_read_bool(phb->hose->dn, "ibm,nmmu-flush");
>  	for_each_child_of_node(phb->hose->dn, dn) {
> 



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [v2, 1/3] powerpc/powernv/npu: Reduce eieio usage when issuing ATSD invalidates
  2018-10-03 18:51 ` [PATCH v2 1/3] powerpc/powernv/npu: Reduce eieio usage when issuing ATSD invalidates Mark Hairgrove
  2018-10-04  5:20   ` Alistair Popple
@ 2018-10-11  8:35   ` Michael Ellerman
  1 sibling, 0 replies; 8+ messages in thread
From: Michael Ellerman @ 2018-10-11  8:35 UTC (permalink / raw)
  To: Mark Hairgrove, linuxppc-dev; +Cc: Alistair Popple, Mark Hairgrove, Reza Arbab

On Wed, 2018-10-03 at 18:51:32 UTC, Mark Hairgrove wrote:
> There are two types of ATSDs issued to the NPU: invalidates targeting a
> specific virtual address and invalidates targeting the whole address
> space. In both cases prior to this change, the sequence was:
> 
>     for each NPU
>         - Write the target address to the XTS_ATSD_AVA register
>         - EIEIO
>         - Write the launch value to issue the ATSD
> 
> First, a target address is not required when invalidating the whole
> address space, so that write and the EIEIO have been removed. The AP
> (size) field in the launch is not needed either.
> 
> Second, for per-address invalidates the above sequence is inefficient in
> the common case of multiple NPUs because an EIEIO is issued per NPU. This
> unnecessarily forces the launches of later ATSDs to be ordered with the
> launches of earlier ones. The new sequence only issues a single EIEIO:
> 
>     for each NPU
>         - Write the target address to the XTS_ATSD_AVA register
>     EIEIO
>     for each NPU
>         - Write the launch value to issue the ATSD
> 
> Performance results were gathered using a microbenchmark which creates a
> 1G allocation then uses mprotect with PROT_NONE to trigger invalidates in
> strides across the allocation.
> 
> With only a single NPU active (one GPU) the difference is in the noise for
> both types of invalidates (+/-1%).
> 
> With two NPUs active (on a 6-GPU system) the effect is more noticeable:
> 
>          mprotect rate (GB/s)
> Stride   Before      After      Speedup
> 64K         5.9        6.5          10%
> 1M         31.2       33.4           7%
> 2M         36.3       38.7           7%
> 4M        322.6      356.7          11%
> 
> Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
> Reviewed-by: Alistair Popple <alistair@popple.id.au>

Series applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/7ead15a1442b25e12a6f0791a7c7a5

cheers

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2018-10-11  8:45 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-10-03 18:51 [PATCH v2 0/3] powerpc/powernv/npu: Improve ATSD invalidation overhead Mark Hairgrove
2018-10-03 18:51 ` [PATCH v2 1/3] powerpc/powernv/npu: Reduce eieio usage when issuing ATSD invalidates Mark Hairgrove
2018-10-04  5:20   ` Alistair Popple
2018-10-11  8:35   ` [v2, " Michael Ellerman
2018-10-03 18:51 ` [PATCH v2 2/3] powerpc/powernv/npu: Use size-based " Mark Hairgrove
2018-10-04  5:20   ` Alistair Popple
2018-10-03 18:51 ` [PATCH v2 3/3] powerpc/powernv/npu: Remove atsd_threshold debugfs setting Mark Hairgrove
2018-10-04  5:21   ` Alistair Popple

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.