All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/4] copy_page improvements
@ 2016-02-02 12:46 Will Deacon
  2016-02-02 12:46 ` [PATCH 1/4] arm64: prefetch: don't provide spin_lock_prefetch with LSE Will Deacon
                   ` (5 more replies)
  0 siblings, 6 replies; 7+ messages in thread
From: Will Deacon @ 2016-02-02 12:46 UTC (permalink / raw)
  To: linux-arm-kernel

Hi all,

This patch series aims to improve our copy_page implementation on both
CPUs with and without hardware prefetchers. It is inspired by a previous
patch series from Andrew, and one of the patches is lifted directly from
there:

  http://lists.infradead.org/pipermail/linux-arm-kernel/2016-January/399132.html

Changes from that series:

  * Applies on top of mainline
  * Inverted the capability to the more generic ARM64_HAS_NO_HW_PREFETCH
  * Re-jigged the MIDR detection to try to keep it simple

I've left copy_template alone for now, since the template really deals
with 64 bytes per iteration, which would need changing.

Andrew -- it would be helpful if you could take this for a spin on
ThunderX to make sure it didn't screw anything up.

Cheers,

Will

--->8

Andrew Pinski (1):
  arm64: lib: patch in prfm for copy_page if requested

Will Deacon (3):
  arm64: prefetch: don't provide spin_lock_prefetch with LSE
  arm64: prefetch: add alternative pattern for CPUs without a prefetcher
  arm64: lib: improve copy_page to deal with 128 bytes at a time

 arch/arm64/include/asm/cpufeature.h |  3 +-
 arch/arm64/include/asm/cputype.h    | 17 +++++++++-
 arch/arm64/include/asm/processor.h  |  7 +++--
 arch/arm64/kernel/cpu_errata.c      | 18 ++---------
 arch/arm64/kernel/cpufeature.c      | 17 ++++++++++
 arch/arm64/lib/copy_page.S          | 63 ++++++++++++++++++++++++++++++++-----
 6 files changed, 98 insertions(+), 27 deletions(-)

-- 
2.1.4

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH 1/4] arm64: prefetch: don't provide spin_lock_prefetch with LSE
  2016-02-02 12:46 [PATCH 0/4] copy_page improvements Will Deacon
@ 2016-02-02 12:46 ` Will Deacon
  2016-02-02 12:46 ` [PATCH 2/4] arm64: prefetch: add alternative pattern for CPUs without a prefetcher Will Deacon
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: Will Deacon @ 2016-02-02 12:46 UTC (permalink / raw)
  To: linux-arm-kernel

The LSE atomics rely on us not dirtying data at L1 if we can avoid it,
otherwise many of the potential scalability benefits are lost.

This patch replaces spin_lock_prefetch with a nop when the LSE atomics
are in use, so that users don't shoot themselves in the foot by causing
needless coherence traffic at L1.

Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 arch/arm64/include/asm/processor.h | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/include/asm/processor.h b/arch/arm64/include/asm/processor.h
index 4acb7ca94fcd..31b76fce4477 100644
--- a/arch/arm64/include/asm/processor.h
+++ b/arch/arm64/include/asm/processor.h
@@ -29,6 +29,7 @@
 
 #include <linux/string.h>
 
+#include <asm/alternative.h>
 #include <asm/fpsimd.h>
 #include <asm/hw_breakpoint.h>
 #include <asm/pgtable-hwdef.h>
@@ -177,9 +178,11 @@ static inline void prefetchw(const void *ptr)
 }
 
 #define ARCH_HAS_SPINLOCK_PREFETCH
-static inline void spin_lock_prefetch(const void *x)
+static inline void spin_lock_prefetch(const void *ptr)
 {
-	prefetchw(x);
+	asm volatile(ARM64_LSE_ATOMIC_INSN(
+		     "prfm pstl1strm, %a0",
+		     "nop") : : "p" (ptr));
 }
 
 #define HAVE_ARCH_PICK_MMAP_LAYOUT
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH 2/4] arm64: prefetch: add alternative pattern for CPUs without a prefetcher
  2016-02-02 12:46 [PATCH 0/4] copy_page improvements Will Deacon
  2016-02-02 12:46 ` [PATCH 1/4] arm64: prefetch: don't provide spin_lock_prefetch with LSE Will Deacon
@ 2016-02-02 12:46 ` Will Deacon
  2016-02-02 12:46 ` [PATCH 3/4] arm64: lib: improve copy_page to deal with 128 bytes at a time Will Deacon
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: Will Deacon @ 2016-02-02 12:46 UTC (permalink / raw)
  To: linux-arm-kernel

Most CPUs have a hardware prefetcher which generally performs better
without explicit prefetch instructions issued by software, however
some CPUs (e.g. Cavium ThunderX) rely solely on explicit prefetch
instructions.

This patch adds an alternative pattern (ARM64_HAS_NO_HW_PREFETCH) to
allow our library code to make use of explicit prefetch instructions
during things like copy routines only when the CPU does not have the
capability to perform the prefetching itself.

Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 arch/arm64/include/asm/cpufeature.h |  3 ++-
 arch/arm64/include/asm/cputype.h    | 17 ++++++++++++++++-
 arch/arm64/kernel/cpu_errata.c      | 18 +++---------------
 arch/arm64/kernel/cpufeature.c      | 17 +++++++++++++++++
 4 files changed, 38 insertions(+), 17 deletions(-)

diff --git a/arch/arm64/include/asm/cpufeature.h b/arch/arm64/include/asm/cpufeature.h
index 8f271b83f910..8d56bd8550dc 100644
--- a/arch/arm64/include/asm/cpufeature.h
+++ b/arch/arm64/include/asm/cpufeature.h
@@ -30,8 +30,9 @@
 #define ARM64_HAS_LSE_ATOMICS			5
 #define ARM64_WORKAROUND_CAVIUM_23154		6
 #define ARM64_WORKAROUND_834220			7
+#define ARM64_HAS_NO_HW_PREFETCH		8
 
-#define ARM64_NCAPS				8
+#define ARM64_NCAPS				9
 
 #ifndef __ASSEMBLY__
 
diff --git a/arch/arm64/include/asm/cputype.h b/arch/arm64/include/asm/cputype.h
index 1a5949364ed0..7540284a17fe 100644
--- a/arch/arm64/include/asm/cputype.h
+++ b/arch/arm64/include/asm/cputype.h
@@ -57,11 +57,22 @@
 #define MIDR_IMPLEMENTOR(midr)	\
 	(((midr) & MIDR_IMPLEMENTOR_MASK) >> MIDR_IMPLEMENTOR_SHIFT)
 
-#define MIDR_CPU_PART(imp, partnum) \
+#define MIDR_CPU_MODEL(imp, partnum) \
 	(((imp)			<< MIDR_IMPLEMENTOR_SHIFT) | \
 	(0xf			<< MIDR_ARCHITECTURE_SHIFT) | \
 	((partnum)		<< MIDR_PARTNUM_SHIFT))
 
+#define MIDR_CPU_MODEL_MASK (MIDR_IMPLEMENTOR_MASK | MIDR_PARTNUM_MASK | \
+			     MIDR_ARCHITECTURE_MASK)
+
+#define MIDR_IS_CPU_MODEL_RANGE(midr, model, rv_min, rv_max)		\
+({									\
+	u32 _model = (midr) & MIDR_CPU_MODEL_MASK;			\
+	u32 rv = (midr) & (MIDR_REVISION_MASK | MIDR_VARIANT_MASK);	\
+									\
+	_model == (model) && rv >= (rv_min) && rv <= (rv_max);		\
+ })
+
 #define ARM_CPU_IMP_ARM			0x41
 #define ARM_CPU_IMP_APM			0x50
 #define ARM_CPU_IMP_CAVIUM		0x43
@@ -75,6 +86,10 @@
 
 #define CAVIUM_CPU_PART_THUNDERX	0x0A1
 
+#define MIDR_CORTEX_A53 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A53)
+#define MIDR_CORTEX_A57 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A57)
+#define MIDR_THUNDERX	MIDR_CPU_MODEL(ARM_CPU_IMP_CAVIUM, CAVIUM_CPU_PART_THUNDERX)
+
 #ifndef __ASSEMBLY__
 
 /*
diff --git a/arch/arm64/kernel/cpu_errata.c b/arch/arm64/kernel/cpu_errata.c
index feb6b4efa641..e6bc988e8dbf 100644
--- a/arch/arm64/kernel/cpu_errata.c
+++ b/arch/arm64/kernel/cpu_errata.c
@@ -21,24 +21,12 @@
 #include <asm/cputype.h>
 #include <asm/cpufeature.h>
 
-#define MIDR_CORTEX_A53 MIDR_CPU_PART(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A53)
-#define MIDR_CORTEX_A57 MIDR_CPU_PART(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A57)
-#define MIDR_THUNDERX	MIDR_CPU_PART(ARM_CPU_IMP_CAVIUM, CAVIUM_CPU_PART_THUNDERX)
-
-#define CPU_MODEL_MASK (MIDR_IMPLEMENTOR_MASK | MIDR_PARTNUM_MASK | \
-			MIDR_ARCHITECTURE_MASK)
-
 static bool __maybe_unused
 is_affected_midr_range(const struct arm64_cpu_capabilities *entry)
 {
-	u32 midr = read_cpuid_id();
-
-	if ((midr & CPU_MODEL_MASK) != entry->midr_model)
-		return false;
-
-	midr &= MIDR_REVISION_MASK | MIDR_VARIANT_MASK;
-
-	return (midr >= entry->midr_range_min && midr <= entry->midr_range_max);
+	return MIDR_IS_CPU_MODEL_RANGE(read_cpuid_id(), entry->midr_model,
+				       entry->midr_range_min,
+				       entry->midr_range_max);
 }
 
 #define MIDR_RANGE(model, min, max) \
diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
index 5c90aa490a2b..3615d7d7c9af 100644
--- a/arch/arm64/kernel/cpufeature.c
+++ b/arch/arm64/kernel/cpufeature.c
@@ -621,6 +621,18 @@ static bool has_useable_gicv3_cpuif(const struct arm64_cpu_capabilities *entry)
 	return has_sre;
 }
 
+static bool has_no_hw_prefetch(const struct arm64_cpu_capabilities *entry)
+{
+	u32 midr = read_cpuid_id();
+	u32 rv_min, rv_max;
+
+	/* Cavium ThunderX pass 1.x and 2.x */
+	rv_min = 0;
+	rv_max = (1 << MIDR_VARIANT_SHIFT) | MIDR_REVISION_MASK;
+
+	return MIDR_IS_CPU_MODEL_RANGE(midr, MIDR_THUNDERX, rv_min, rv_max);
+}
+
 static const struct arm64_cpu_capabilities arm64_features[] = {
 	{
 		.desc = "GIC system register CPU interface",
@@ -651,6 +663,11 @@ static const struct arm64_cpu_capabilities arm64_features[] = {
 		.min_field_value = 2,
 	},
 #endif /* CONFIG_AS_LSE && CONFIG_ARM64_LSE_ATOMICS */
+	{
+		.desc = "Software prefetching using PRFM",
+		.capability = ARM64_HAS_NO_HW_PREFETCH,
+		.matches = has_no_hw_prefetch,
+	},
 	{},
 };
 
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH 3/4] arm64: lib: improve copy_page to deal with 128 bytes at a time
  2016-02-02 12:46 [PATCH 0/4] copy_page improvements Will Deacon
  2016-02-02 12:46 ` [PATCH 1/4] arm64: prefetch: don't provide spin_lock_prefetch with LSE Will Deacon
  2016-02-02 12:46 ` [PATCH 2/4] arm64: prefetch: add alternative pattern for CPUs without a prefetcher Will Deacon
@ 2016-02-02 12:46 ` Will Deacon
  2016-02-02 12:46 ` [PATCH 4/4] arm64: lib: patch in prfm for copy_page if requested Will Deacon
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: Will Deacon @ 2016-02-02 12:46 UTC (permalink / raw)
  To: linux-arm-kernel

We want to avoid lots of different copy_page implementations, settling
for something that is "good enough" everywhere and hopefully easy to
understand and maintain whilst we're at it.

This patch reworks our copy_page implementation based on discussions
with Cavium on the list and benchmarking on Cortex-A processors so that:

  - The loop is unrolled to copy 128 bytes per iteration

  - The reads are offset so that we read from the next 128-byte block
    in the same iteration that we store the previous block

  - Explicit prefetch instructions are removed for now, since they hurt
    performance on CPUs with hardware prefetching

  - The loop exit condition is calculated at the start of the loop

Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 arch/arm64/lib/copy_page.S | 46 ++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 38 insertions(+), 8 deletions(-)

diff --git a/arch/arm64/lib/copy_page.S b/arch/arm64/lib/copy_page.S
index 512b9a7b980e..2534533ceb1d 100644
--- a/arch/arm64/lib/copy_page.S
+++ b/arch/arm64/lib/copy_page.S
@@ -27,20 +27,50 @@
  *	x1 - src
  */
 ENTRY(copy_page)
-	/* Assume cache line size is 64 bytes. */
-	prfm	pldl1strm, [x1, #64]
-1:	ldp	x2, x3, [x1]
+	ldp	x2, x3, [x1]
 	ldp	x4, x5, [x1, #16]
 	ldp	x6, x7, [x1, #32]
 	ldp	x8, x9, [x1, #48]
-	add	x1, x1, #64
-	prfm	pldl1strm, [x1, #64]
+	ldp	x10, x11, [x1, #64]
+	ldp	x12, x13, [x1, #80]
+	ldp	x14, x15, [x1, #96]
+	ldp	x16, x17, [x1, #112]
+
+	mov	x18, #(PAGE_SIZE - 128)
+	add	x1, x1, #128
+1:
+	subs	x18, x18, #128
+
 	stnp	x2, x3, [x0]
+	ldp	x2, x3, [x1]
 	stnp	x4, x5, [x0, #16]
+	ldp	x4, x5, [x1, #16]
 	stnp	x6, x7, [x0, #32]
+	ldp	x6, x7, [x1, #32]
 	stnp	x8, x9, [x0, #48]
-	add	x0, x0, #64
-	tst	x1, #(PAGE_SIZE - 1)
-	b.ne	1b
+	ldp	x8, x9, [x1, #48]
+	stnp	x10, x11, [x0, #64]
+	ldp	x10, x11, [x1, #64]
+	stnp	x12, x13, [x0, #80]
+	ldp	x12, x13, [x1, #80]
+	stnp	x14, x15, [x0, #96]
+	ldp	x14, x15, [x1, #96]
+	stnp	x16, x17, [x0, #112]
+	ldp	x16, x17, [x1, #112]
+
+	add	x0, x0, #128
+	add	x1, x1, #128
+
+	b.gt	1b
+
+	stnp	x2, x3, [x0]
+	stnp	x4, x5, [x0, #16]
+	stnp	x6, x7, [x0, #32]
+	stnp	x8, x9, [x0, #48]
+	stnp	x10, x11, [x0, #64]
+	stnp	x12, x13, [x0, #80]
+	stnp	x14, x15, [x0, #96]
+	stnp	x16, x17, [x0, #112]
+
 	ret
 ENDPROC(copy_page)
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH 4/4] arm64: lib: patch in prfm for copy_page if requested
  2016-02-02 12:46 [PATCH 0/4] copy_page improvements Will Deacon
                   ` (2 preceding siblings ...)
  2016-02-02 12:46 ` [PATCH 3/4] arm64: lib: improve copy_page to deal with 128 bytes at a time Will Deacon
@ 2016-02-02 12:46 ` Will Deacon
  2016-02-05 17:53 ` [PATCH 0/4] copy_page improvements Andrew Pinski
  2016-02-09 11:55 ` Catalin Marinas
  5 siblings, 0 replies; 7+ messages in thread
From: Will Deacon @ 2016-02-02 12:46 UTC (permalink / raw)
  To: linux-arm-kernel

From: Andrew Pinski <apinski@cavium.com>

On ThunderX T88 pass 1 and pass 2, there is no hardware prefetching so
we need to patch in explicit software prefetching instructions

Prefetching improves this code by 60% over the original code and 2x
over the code without prefetching for the affected hardware using the
benchmark code at https://github.com/apinski-cavium/copy_page_benchmark

Signed-off-by: Andrew Pinski <apinski@cavium.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 arch/arm64/lib/copy_page.S | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/arch/arm64/lib/copy_page.S b/arch/arm64/lib/copy_page.S
index 2534533ceb1d..4c1e700840b6 100644
--- a/arch/arm64/lib/copy_page.S
+++ b/arch/arm64/lib/copy_page.S
@@ -18,6 +18,8 @@
 #include <linux/const.h>
 #include <asm/assembler.h>
 #include <asm/page.h>
+#include <asm/cpufeature.h>
+#include <asm/alternative.h>
 
 /*
  * Copy a page from src to dest (both are page aligned)
@@ -27,6 +29,15 @@
  *	x1 - src
  */
 ENTRY(copy_page)
+alternative_if_not ARM64_HAS_NO_HW_PREFETCH
+	nop
+	nop
+alternative_else
+	# Prefetch two cache lines ahead.
+	prfm    pldl1strm, [x1, #128]
+	prfm    pldl1strm, [x1, #256]
+alternative_endif
+
 	ldp	x2, x3, [x1]
 	ldp	x4, x5, [x1, #16]
 	ldp	x6, x7, [x1, #32]
@@ -41,6 +52,12 @@ ENTRY(copy_page)
 1:
 	subs	x18, x18, #128
 
+alternative_if_not ARM64_HAS_NO_HW_PREFETCH
+	nop
+alternative_else
+	prfm    pldl1strm, [x1, #384]
+alternative_endif
+
 	stnp	x2, x3, [x0]
 	ldp	x2, x3, [x1]
 	stnp	x4, x5, [x0, #16]
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH 0/4] copy_page improvements
  2016-02-02 12:46 [PATCH 0/4] copy_page improvements Will Deacon
                   ` (3 preceding siblings ...)
  2016-02-02 12:46 ` [PATCH 4/4] arm64: lib: patch in prfm for copy_page if requested Will Deacon
@ 2016-02-05 17:53 ` Andrew Pinski
  2016-02-09 11:55 ` Catalin Marinas
  5 siblings, 0 replies; 7+ messages in thread
From: Andrew Pinski @ 2016-02-05 17:53 UTC (permalink / raw)
  To: linux-arm-kernel

On 2/2/2016 4:46 AM, Will Deacon wrote:
> Hi all,
>
> This patch series aims to improve our copy_page implementation on both
> CPUs with and without hardware prefetchers. It is inspired by a previous
> patch series from Andrew, and one of the patches is lifted directly from
> there:
>
>    http://lists.infradead.org/pipermail/linux-arm-kernel/2016-January/399132.html
>
> Changes from that series:
>
>    * Applies on top of mainline
>    * Inverted the capability to the more generic ARM64_HAS_NO_HW_PREFETCH
>    * Re-jigged the MIDR detection to try to keep it simple
>
> I've left copy_template alone for now, since the template really deals
> with 64 bytes per iteration, which would need changing.

Right, I had looked at doing that also for newlib/glibc but the code is 
too complex.
>
> Andrew -- it would be helpful if you could take this for a spin on
> ThunderX to make sure it didn't screw anything up.

Yes it works and I get the needed prefetch in copy_page.

Tested-by: Andrew Pinski  <apinski@cavium.com>

Thanks,
Andrew

>
> Cheers,
>
> Will
>
> --->8
>
> Andrew Pinski (1):
>    arm64: lib: patch in prfm for copy_page if requested
>
> Will Deacon (3):
>    arm64: prefetch: don't provide spin_lock_prefetch with LSE
>    arm64: prefetch: add alternative pattern for CPUs without a prefetcher
>    arm64: lib: improve copy_page to deal with 128 bytes at a time
>
>   arch/arm64/include/asm/cpufeature.h |  3 +-
>   arch/arm64/include/asm/cputype.h    | 17 +++++++++-
>   arch/arm64/include/asm/processor.h  |  7 +++--
>   arch/arm64/kernel/cpu_errata.c      | 18 ++---------
>   arch/arm64/kernel/cpufeature.c      | 17 ++++++++++
>   arch/arm64/lib/copy_page.S          | 63 ++++++++++++++++++++++++++++++++-----
>   6 files changed, 98 insertions(+), 27 deletions(-)
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH 0/4] copy_page improvements
  2016-02-02 12:46 [PATCH 0/4] copy_page improvements Will Deacon
                   ` (4 preceding siblings ...)
  2016-02-05 17:53 ` [PATCH 0/4] copy_page improvements Andrew Pinski
@ 2016-02-09 11:55 ` Catalin Marinas
  5 siblings, 0 replies; 7+ messages in thread
From: Catalin Marinas @ 2016-02-09 11:55 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Feb 02, 2016 at 12:46:22PM +0000, Will Deacon wrote:
> Andrew Pinski (1):
>   arm64: lib: patch in prfm for copy_page if requested
> 
> Will Deacon (3):
>   arm64: prefetch: don't provide spin_lock_prefetch with LSE
>   arm64: prefetch: add alternative pattern for CPUs without a prefetcher
>   arm64: lib: improve copy_page to deal with 128 bytes at a time

Patches queued for 4.6. Thanks.

-- 
Catalin

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2016-02-09 11:55 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-02-02 12:46 [PATCH 0/4] copy_page improvements Will Deacon
2016-02-02 12:46 ` [PATCH 1/4] arm64: prefetch: don't provide spin_lock_prefetch with LSE Will Deacon
2016-02-02 12:46 ` [PATCH 2/4] arm64: prefetch: add alternative pattern for CPUs without a prefetcher Will Deacon
2016-02-02 12:46 ` [PATCH 3/4] arm64: lib: improve copy_page to deal with 128 bytes at a time Will Deacon
2016-02-02 12:46 ` [PATCH 4/4] arm64: lib: patch in prfm for copy_page if requested Will Deacon
2016-02-05 17:53 ` [PATCH 0/4] copy_page improvements Andrew Pinski
2016-02-09 11:55 ` Catalin Marinas

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.