* [PATCH 0/4] copy_page improvements
@ 2016-02-02 12:46 Will Deacon
2016-02-02 12:46 ` [PATCH 1/4] arm64: prefetch: don't provide spin_lock_prefetch with LSE Will Deacon
` (5 more replies)
0 siblings, 6 replies; 7+ messages in thread
From: Will Deacon @ 2016-02-02 12:46 UTC (permalink / raw)
To: linux-arm-kernel
Hi all,
This patch series aims to improve our copy_page implementation on both
CPUs with and without hardware prefetchers. It is inspired by a previous
patch series from Andrew, and one of the patches is lifted directly from
there:
http://lists.infradead.org/pipermail/linux-arm-kernel/2016-January/399132.html
Changes from that series:
* Applies on top of mainline
* Inverted the capability to the more generic ARM64_HAS_NO_HW_PREFETCH
* Re-jigged the MIDR detection to try to keep it simple
I've left copy_template alone for now, since the template really deals
with 64 bytes per iteration, which would need changing.
Andrew -- it would be helpful if you could take this for a spin on
ThunderX to make sure it didn't screw anything up.
Cheers,
Will
--->8
Andrew Pinski (1):
arm64: lib: patch in prfm for copy_page if requested
Will Deacon (3):
arm64: prefetch: don't provide spin_lock_prefetch with LSE
arm64: prefetch: add alternative pattern for CPUs without a prefetcher
arm64: lib: improve copy_page to deal with 128 bytes at a time
arch/arm64/include/asm/cpufeature.h | 3 +-
arch/arm64/include/asm/cputype.h | 17 +++++++++-
arch/arm64/include/asm/processor.h | 7 +++--
arch/arm64/kernel/cpu_errata.c | 18 ++---------
arch/arm64/kernel/cpufeature.c | 17 ++++++++++
arch/arm64/lib/copy_page.S | 63 ++++++++++++++++++++++++++++++++-----
6 files changed, 98 insertions(+), 27 deletions(-)
--
2.1.4
^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH 1/4] arm64: prefetch: don't provide spin_lock_prefetch with LSE
2016-02-02 12:46 [PATCH 0/4] copy_page improvements Will Deacon
@ 2016-02-02 12:46 ` Will Deacon
2016-02-02 12:46 ` [PATCH 2/4] arm64: prefetch: add alternative pattern for CPUs without a prefetcher Will Deacon
` (4 subsequent siblings)
5 siblings, 0 replies; 7+ messages in thread
From: Will Deacon @ 2016-02-02 12:46 UTC (permalink / raw)
To: linux-arm-kernel
The LSE atomics rely on us not dirtying data at L1 if we can avoid it,
otherwise many of the potential scalability benefits are lost.
This patch replaces spin_lock_prefetch with a nop when the LSE atomics
are in use, so that users don't shoot themselves in the foot by causing
needless coherence traffic at L1.
Signed-off-by: Will Deacon <will.deacon@arm.com>
---
arch/arm64/include/asm/processor.h | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/arch/arm64/include/asm/processor.h b/arch/arm64/include/asm/processor.h
index 4acb7ca94fcd..31b76fce4477 100644
--- a/arch/arm64/include/asm/processor.h
+++ b/arch/arm64/include/asm/processor.h
@@ -29,6 +29,7 @@
#include <linux/string.h>
+#include <asm/alternative.h>
#include <asm/fpsimd.h>
#include <asm/hw_breakpoint.h>
#include <asm/pgtable-hwdef.h>
@@ -177,9 +178,11 @@ static inline void prefetchw(const void *ptr)
}
#define ARCH_HAS_SPINLOCK_PREFETCH
-static inline void spin_lock_prefetch(const void *x)
+static inline void spin_lock_prefetch(const void *ptr)
{
- prefetchw(x);
+ asm volatile(ARM64_LSE_ATOMIC_INSN(
+ "prfm pstl1strm, %a0",
+ "nop") : : "p" (ptr));
}
#define HAVE_ARCH_PICK_MMAP_LAYOUT
--
2.1.4
^ permalink raw reply related [flat|nested] 7+ messages in thread
* [PATCH 2/4] arm64: prefetch: add alternative pattern for CPUs without a prefetcher
2016-02-02 12:46 [PATCH 0/4] copy_page improvements Will Deacon
2016-02-02 12:46 ` [PATCH 1/4] arm64: prefetch: don't provide spin_lock_prefetch with LSE Will Deacon
@ 2016-02-02 12:46 ` Will Deacon
2016-02-02 12:46 ` [PATCH 3/4] arm64: lib: improve copy_page to deal with 128 bytes at a time Will Deacon
` (3 subsequent siblings)
5 siblings, 0 replies; 7+ messages in thread
From: Will Deacon @ 2016-02-02 12:46 UTC (permalink / raw)
To: linux-arm-kernel
Most CPUs have a hardware prefetcher which generally performs better
without explicit prefetch instructions issued by software, however
some CPUs (e.g. Cavium ThunderX) rely solely on explicit prefetch
instructions.
This patch adds an alternative pattern (ARM64_HAS_NO_HW_PREFETCH) to
allow our library code to make use of explicit prefetch instructions
during things like copy routines only when the CPU does not have the
capability to perform the prefetching itself.
Signed-off-by: Will Deacon <will.deacon@arm.com>
---
arch/arm64/include/asm/cpufeature.h | 3 ++-
arch/arm64/include/asm/cputype.h | 17 ++++++++++++++++-
arch/arm64/kernel/cpu_errata.c | 18 +++---------------
arch/arm64/kernel/cpufeature.c | 17 +++++++++++++++++
4 files changed, 38 insertions(+), 17 deletions(-)
diff --git a/arch/arm64/include/asm/cpufeature.h b/arch/arm64/include/asm/cpufeature.h
index 8f271b83f910..8d56bd8550dc 100644
--- a/arch/arm64/include/asm/cpufeature.h
+++ b/arch/arm64/include/asm/cpufeature.h
@@ -30,8 +30,9 @@
#define ARM64_HAS_LSE_ATOMICS 5
#define ARM64_WORKAROUND_CAVIUM_23154 6
#define ARM64_WORKAROUND_834220 7
+#define ARM64_HAS_NO_HW_PREFETCH 8
-#define ARM64_NCAPS 8
+#define ARM64_NCAPS 9
#ifndef __ASSEMBLY__
diff --git a/arch/arm64/include/asm/cputype.h b/arch/arm64/include/asm/cputype.h
index 1a5949364ed0..7540284a17fe 100644
--- a/arch/arm64/include/asm/cputype.h
+++ b/arch/arm64/include/asm/cputype.h
@@ -57,11 +57,22 @@
#define MIDR_IMPLEMENTOR(midr) \
(((midr) & MIDR_IMPLEMENTOR_MASK) >> MIDR_IMPLEMENTOR_SHIFT)
-#define MIDR_CPU_PART(imp, partnum) \
+#define MIDR_CPU_MODEL(imp, partnum) \
(((imp) << MIDR_IMPLEMENTOR_SHIFT) | \
(0xf << MIDR_ARCHITECTURE_SHIFT) | \
((partnum) << MIDR_PARTNUM_SHIFT))
+#define MIDR_CPU_MODEL_MASK (MIDR_IMPLEMENTOR_MASK | MIDR_PARTNUM_MASK | \
+ MIDR_ARCHITECTURE_MASK)
+
+#define MIDR_IS_CPU_MODEL_RANGE(midr, model, rv_min, rv_max) \
+({ \
+ u32 _model = (midr) & MIDR_CPU_MODEL_MASK; \
+ u32 rv = (midr) & (MIDR_REVISION_MASK | MIDR_VARIANT_MASK); \
+ \
+ _model == (model) && rv >= (rv_min) && rv <= (rv_max); \
+ })
+
#define ARM_CPU_IMP_ARM 0x41
#define ARM_CPU_IMP_APM 0x50
#define ARM_CPU_IMP_CAVIUM 0x43
@@ -75,6 +86,10 @@
#define CAVIUM_CPU_PART_THUNDERX 0x0A1
+#define MIDR_CORTEX_A53 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A53)
+#define MIDR_CORTEX_A57 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A57)
+#define MIDR_THUNDERX MIDR_CPU_MODEL(ARM_CPU_IMP_CAVIUM, CAVIUM_CPU_PART_THUNDERX)
+
#ifndef __ASSEMBLY__
/*
diff --git a/arch/arm64/kernel/cpu_errata.c b/arch/arm64/kernel/cpu_errata.c
index feb6b4efa641..e6bc988e8dbf 100644
--- a/arch/arm64/kernel/cpu_errata.c
+++ b/arch/arm64/kernel/cpu_errata.c
@@ -21,24 +21,12 @@
#include <asm/cputype.h>
#include <asm/cpufeature.h>
-#define MIDR_CORTEX_A53 MIDR_CPU_PART(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A53)
-#define MIDR_CORTEX_A57 MIDR_CPU_PART(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A57)
-#define MIDR_THUNDERX MIDR_CPU_PART(ARM_CPU_IMP_CAVIUM, CAVIUM_CPU_PART_THUNDERX)
-
-#define CPU_MODEL_MASK (MIDR_IMPLEMENTOR_MASK | MIDR_PARTNUM_MASK | \
- MIDR_ARCHITECTURE_MASK)
-
static bool __maybe_unused
is_affected_midr_range(const struct arm64_cpu_capabilities *entry)
{
- u32 midr = read_cpuid_id();
-
- if ((midr & CPU_MODEL_MASK) != entry->midr_model)
- return false;
-
- midr &= MIDR_REVISION_MASK | MIDR_VARIANT_MASK;
-
- return (midr >= entry->midr_range_min && midr <= entry->midr_range_max);
+ return MIDR_IS_CPU_MODEL_RANGE(read_cpuid_id(), entry->midr_model,
+ entry->midr_range_min,
+ entry->midr_range_max);
}
#define MIDR_RANGE(model, min, max) \
diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
index 5c90aa490a2b..3615d7d7c9af 100644
--- a/arch/arm64/kernel/cpufeature.c
+++ b/arch/arm64/kernel/cpufeature.c
@@ -621,6 +621,18 @@ static bool has_useable_gicv3_cpuif(const struct arm64_cpu_capabilities *entry)
return has_sre;
}
+static bool has_no_hw_prefetch(const struct arm64_cpu_capabilities *entry)
+{
+ u32 midr = read_cpuid_id();
+ u32 rv_min, rv_max;
+
+ /* Cavium ThunderX pass 1.x and 2.x */
+ rv_min = 0;
+ rv_max = (1 << MIDR_VARIANT_SHIFT) | MIDR_REVISION_MASK;
+
+ return MIDR_IS_CPU_MODEL_RANGE(midr, MIDR_THUNDERX, rv_min, rv_max);
+}
+
static const struct arm64_cpu_capabilities arm64_features[] = {
{
.desc = "GIC system register CPU interface",
@@ -651,6 +663,11 @@ static const struct arm64_cpu_capabilities arm64_features[] = {
.min_field_value = 2,
},
#endif /* CONFIG_AS_LSE && CONFIG_ARM64_LSE_ATOMICS */
+ {
+ .desc = "Software prefetching using PRFM",
+ .capability = ARM64_HAS_NO_HW_PREFETCH,
+ .matches = has_no_hw_prefetch,
+ },
{},
};
--
2.1.4
^ permalink raw reply related [flat|nested] 7+ messages in thread
* [PATCH 3/4] arm64: lib: improve copy_page to deal with 128 bytes at a time
2016-02-02 12:46 [PATCH 0/4] copy_page improvements Will Deacon
2016-02-02 12:46 ` [PATCH 1/4] arm64: prefetch: don't provide spin_lock_prefetch with LSE Will Deacon
2016-02-02 12:46 ` [PATCH 2/4] arm64: prefetch: add alternative pattern for CPUs without a prefetcher Will Deacon
@ 2016-02-02 12:46 ` Will Deacon
2016-02-02 12:46 ` [PATCH 4/4] arm64: lib: patch in prfm for copy_page if requested Will Deacon
` (2 subsequent siblings)
5 siblings, 0 replies; 7+ messages in thread
From: Will Deacon @ 2016-02-02 12:46 UTC (permalink / raw)
To: linux-arm-kernel
We want to avoid lots of different copy_page implementations, settling
for something that is "good enough" everywhere and hopefully easy to
understand and maintain whilst we're at it.
This patch reworks our copy_page implementation based on discussions
with Cavium on the list and benchmarking on Cortex-A processors so that:
- The loop is unrolled to copy 128 bytes per iteration
- The reads are offset so that we read from the next 128-byte block
in the same iteration that we store the previous block
- Explicit prefetch instructions are removed for now, since they hurt
performance on CPUs with hardware prefetching
- The loop exit condition is calculated at the start of the loop
Signed-off-by: Will Deacon <will.deacon@arm.com>
---
arch/arm64/lib/copy_page.S | 46 ++++++++++++++++++++++++++++++++++++++--------
1 file changed, 38 insertions(+), 8 deletions(-)
diff --git a/arch/arm64/lib/copy_page.S b/arch/arm64/lib/copy_page.S
index 512b9a7b980e..2534533ceb1d 100644
--- a/arch/arm64/lib/copy_page.S
+++ b/arch/arm64/lib/copy_page.S
@@ -27,20 +27,50 @@
* x1 - src
*/
ENTRY(copy_page)
- /* Assume cache line size is 64 bytes. */
- prfm pldl1strm, [x1, #64]
-1: ldp x2, x3, [x1]
+ ldp x2, x3, [x1]
ldp x4, x5, [x1, #16]
ldp x6, x7, [x1, #32]
ldp x8, x9, [x1, #48]
- add x1, x1, #64
- prfm pldl1strm, [x1, #64]
+ ldp x10, x11, [x1, #64]
+ ldp x12, x13, [x1, #80]
+ ldp x14, x15, [x1, #96]
+ ldp x16, x17, [x1, #112]
+
+ mov x18, #(PAGE_SIZE - 128)
+ add x1, x1, #128
+1:
+ subs x18, x18, #128
+
stnp x2, x3, [x0]
+ ldp x2, x3, [x1]
stnp x4, x5, [x0, #16]
+ ldp x4, x5, [x1, #16]
stnp x6, x7, [x0, #32]
+ ldp x6, x7, [x1, #32]
stnp x8, x9, [x0, #48]
- add x0, x0, #64
- tst x1, #(PAGE_SIZE - 1)
- b.ne 1b
+ ldp x8, x9, [x1, #48]
+ stnp x10, x11, [x0, #64]
+ ldp x10, x11, [x1, #64]
+ stnp x12, x13, [x0, #80]
+ ldp x12, x13, [x1, #80]
+ stnp x14, x15, [x0, #96]
+ ldp x14, x15, [x1, #96]
+ stnp x16, x17, [x0, #112]
+ ldp x16, x17, [x1, #112]
+
+ add x0, x0, #128
+ add x1, x1, #128
+
+ b.gt 1b
+
+ stnp x2, x3, [x0]
+ stnp x4, x5, [x0, #16]
+ stnp x6, x7, [x0, #32]
+ stnp x8, x9, [x0, #48]
+ stnp x10, x11, [x0, #64]
+ stnp x12, x13, [x0, #80]
+ stnp x14, x15, [x0, #96]
+ stnp x16, x17, [x0, #112]
+
ret
ENDPROC(copy_page)
--
2.1.4
^ permalink raw reply related [flat|nested] 7+ messages in thread
* [PATCH 4/4] arm64: lib: patch in prfm for copy_page if requested
2016-02-02 12:46 [PATCH 0/4] copy_page improvements Will Deacon
` (2 preceding siblings ...)
2016-02-02 12:46 ` [PATCH 3/4] arm64: lib: improve copy_page to deal with 128 bytes at a time Will Deacon
@ 2016-02-02 12:46 ` Will Deacon
2016-02-05 17:53 ` [PATCH 0/4] copy_page improvements Andrew Pinski
2016-02-09 11:55 ` Catalin Marinas
5 siblings, 0 replies; 7+ messages in thread
From: Will Deacon @ 2016-02-02 12:46 UTC (permalink / raw)
To: linux-arm-kernel
From: Andrew Pinski <apinski@cavium.com>
On ThunderX T88 pass 1 and pass 2, there is no hardware prefetching so
we need to patch in explicit software prefetching instructions
Prefetching improves this code by 60% over the original code and 2x
over the code without prefetching for the affected hardware using the
benchmark code at https://github.com/apinski-cavium/copy_page_benchmark
Signed-off-by: Andrew Pinski <apinski@cavium.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
---
arch/arm64/lib/copy_page.S | 17 +++++++++++++++++
1 file changed, 17 insertions(+)
diff --git a/arch/arm64/lib/copy_page.S b/arch/arm64/lib/copy_page.S
index 2534533ceb1d..4c1e700840b6 100644
--- a/arch/arm64/lib/copy_page.S
+++ b/arch/arm64/lib/copy_page.S
@@ -18,6 +18,8 @@
#include <linux/const.h>
#include <asm/assembler.h>
#include <asm/page.h>
+#include <asm/cpufeature.h>
+#include <asm/alternative.h>
/*
* Copy a page from src to dest (both are page aligned)
@@ -27,6 +29,15 @@
* x1 - src
*/
ENTRY(copy_page)
+alternative_if_not ARM64_HAS_NO_HW_PREFETCH
+ nop
+ nop
+alternative_else
+ # Prefetch two cache lines ahead.
+ prfm pldl1strm, [x1, #128]
+ prfm pldl1strm, [x1, #256]
+alternative_endif
+
ldp x2, x3, [x1]
ldp x4, x5, [x1, #16]
ldp x6, x7, [x1, #32]
@@ -41,6 +52,12 @@ ENTRY(copy_page)
1:
subs x18, x18, #128
+alternative_if_not ARM64_HAS_NO_HW_PREFETCH
+ nop
+alternative_else
+ prfm pldl1strm, [x1, #384]
+alternative_endif
+
stnp x2, x3, [x0]
ldp x2, x3, [x1]
stnp x4, x5, [x0, #16]
--
2.1.4
^ permalink raw reply related [flat|nested] 7+ messages in thread
* [PATCH 0/4] copy_page improvements
2016-02-02 12:46 [PATCH 0/4] copy_page improvements Will Deacon
` (3 preceding siblings ...)
2016-02-02 12:46 ` [PATCH 4/4] arm64: lib: patch in prfm for copy_page if requested Will Deacon
@ 2016-02-05 17:53 ` Andrew Pinski
2016-02-09 11:55 ` Catalin Marinas
5 siblings, 0 replies; 7+ messages in thread
From: Andrew Pinski @ 2016-02-05 17:53 UTC (permalink / raw)
To: linux-arm-kernel
On 2/2/2016 4:46 AM, Will Deacon wrote:
> Hi all,
>
> This patch series aims to improve our copy_page implementation on both
> CPUs with and without hardware prefetchers. It is inspired by a previous
> patch series from Andrew, and one of the patches is lifted directly from
> there:
>
> http://lists.infradead.org/pipermail/linux-arm-kernel/2016-January/399132.html
>
> Changes from that series:
>
> * Applies on top of mainline
> * Inverted the capability to the more generic ARM64_HAS_NO_HW_PREFETCH
> * Re-jigged the MIDR detection to try to keep it simple
>
> I've left copy_template alone for now, since the template really deals
> with 64 bytes per iteration, which would need changing.
Right, I had looked at doing that also for newlib/glibc but the code is
too complex.
>
> Andrew -- it would be helpful if you could take this for a spin on
> ThunderX to make sure it didn't screw anything up.
Yes it works and I get the needed prefetch in copy_page.
Tested-by: Andrew Pinski <apinski@cavium.com>
Thanks,
Andrew
>
> Cheers,
>
> Will
>
> --->8
>
> Andrew Pinski (1):
> arm64: lib: patch in prfm for copy_page if requested
>
> Will Deacon (3):
> arm64: prefetch: don't provide spin_lock_prefetch with LSE
> arm64: prefetch: add alternative pattern for CPUs without a prefetcher
> arm64: lib: improve copy_page to deal with 128 bytes at a time
>
> arch/arm64/include/asm/cpufeature.h | 3 +-
> arch/arm64/include/asm/cputype.h | 17 +++++++++-
> arch/arm64/include/asm/processor.h | 7 +++--
> arch/arm64/kernel/cpu_errata.c | 18 ++---------
> arch/arm64/kernel/cpufeature.c | 17 ++++++++++
> arch/arm64/lib/copy_page.S | 63 ++++++++++++++++++++++++++++++++-----
> 6 files changed, 98 insertions(+), 27 deletions(-)
>
^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH 0/4] copy_page improvements
2016-02-02 12:46 [PATCH 0/4] copy_page improvements Will Deacon
` (4 preceding siblings ...)
2016-02-05 17:53 ` [PATCH 0/4] copy_page improvements Andrew Pinski
@ 2016-02-09 11:55 ` Catalin Marinas
5 siblings, 0 replies; 7+ messages in thread
From: Catalin Marinas @ 2016-02-09 11:55 UTC (permalink / raw)
To: linux-arm-kernel
On Tue, Feb 02, 2016 at 12:46:22PM +0000, Will Deacon wrote:
> Andrew Pinski (1):
> arm64: lib: patch in prfm for copy_page if requested
>
> Will Deacon (3):
> arm64: prefetch: don't provide spin_lock_prefetch with LSE
> arm64: prefetch: add alternative pattern for CPUs without a prefetcher
> arm64: lib: improve copy_page to deal with 128 bytes at a time
Patches queued for 4.6. Thanks.
--
Catalin
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2016-02-09 11:55 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-02-02 12:46 [PATCH 0/4] copy_page improvements Will Deacon
2016-02-02 12:46 ` [PATCH 1/4] arm64: prefetch: don't provide spin_lock_prefetch with LSE Will Deacon
2016-02-02 12:46 ` [PATCH 2/4] arm64: prefetch: add alternative pattern for CPUs without a prefetcher Will Deacon
2016-02-02 12:46 ` [PATCH 3/4] arm64: lib: improve copy_page to deal with 128 bytes at a time Will Deacon
2016-02-02 12:46 ` [PATCH 4/4] arm64: lib: patch in prfm for copy_page if requested Will Deacon
2016-02-05 17:53 ` [PATCH 0/4] copy_page improvements Andrew Pinski
2016-02-09 11:55 ` Catalin Marinas
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.