All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/6] Add support for pldw instruction on v7 MP cores
@ 2013-07-23 11:36 Will Deacon
  2013-07-23 11:36 ` [PATCH 1/6] ARM: prefetch: remove redundant "cc" clobber Will Deacon
                   ` (5 more replies)
  0 siblings, 6 replies; 30+ messages in thread
From: Will Deacon @ 2013-07-23 11:36 UTC (permalink / raw)
  To: linux-arm-kernel

Hello,

This patch series adds support for the pldw instruction implemented on
ARMv7 SMP processors. The instruction is used to implement the prefetchw
macro, which is then used around barrier-less ldrex/strex sections (i.e.
the prefetch isn't blocked either side by memory barriers) to increase
the chances of the strex succeeding initially.

On my TC2 board (A7s + A15s), I see a ~1.3% boost in hackbench scores
on top of my barriers patch series.

All feedback welcome,

Will


Will Deacon (6):
  ARM: prefetch: remove redundant "cc" clobber
  ARM: smp_on_up: move inline asm ALT_SMP patching macro out of
    spinlock.h
  ARM: prefetch: add support for prefetchw using pldw on SMP ARMv7+ CPUs
  ARM: locks: prefetch the destination word for write prior to strex
  ARM: atomics: prefetch the destination word for write prior to strex
  ARM: bitops: prefetch the destination word for write prior to strex

 arch/arm/include/asm/atomic.h    |  7 +++++++
 arch/arm/include/asm/processor.h | 33 +++++++++++++++++++++++++--------
 arch/arm/include/asm/spinlock.h  | 24 ++++++++++++------------
 arch/arm/include/asm/unified.h   |  4 ++++
 arch/arm/lib/bitops.h            |  5 +++++
 5 files changed, 53 insertions(+), 20 deletions(-)

-- 
1.8.2.2

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 1/6] ARM: prefetch: remove redundant "cc" clobber
  2013-07-23 11:36 [PATCH 0/6] Add support for pldw instruction on v7 MP cores Will Deacon
@ 2013-07-23 11:36 ` Will Deacon
  2013-07-23 19:48   ` Nicolas Pitre
  2013-07-23 11:36 ` [PATCH 2/6] ARM: smp_on_up: move inline asm ALT_SMP patching macro out of spinlock.h Will Deacon
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 30+ messages in thread
From: Will Deacon @ 2013-07-23 11:36 UTC (permalink / raw)
  To: linux-arm-kernel

The pld instruction does not affect the condition flags, so don't bother
clobbering them.

Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 arch/arm/include/asm/processor.h | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/arch/arm/include/asm/processor.h b/arch/arm/include/asm/processor.h
index 06e7d50..91cfe08 100644
--- a/arch/arm/include/asm/processor.h
+++ b/arch/arm/include/asm/processor.h
@@ -101,9 +101,7 @@ static inline void prefetch(const void *ptr)
 {
 	__asm__ __volatile__(
 		"pld\t%a0"
-		:
-		: "p" (ptr)
-		: "cc");
+		:: "p" (ptr));
 }
 
 #define ARCH_HAS_PREFETCHW
-- 
1.8.2.2

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 2/6] ARM: smp_on_up: move inline asm ALT_SMP patching macro out of spinlock.h
  2013-07-23 11:36 [PATCH 0/6] Add support for pldw instruction on v7 MP cores Will Deacon
  2013-07-23 11:36 ` [PATCH 1/6] ARM: prefetch: remove redundant "cc" clobber Will Deacon
@ 2013-07-23 11:36 ` Will Deacon
  2013-07-23 19:55   ` Nicolas Pitre
  2013-07-23 11:36 ` [PATCH 3/6] ARM: prefetch: add support for prefetchw using pldw on SMP ARMv7+ CPUs Will Deacon
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 30+ messages in thread
From: Will Deacon @ 2013-07-23 11:36 UTC (permalink / raw)
  To: linux-arm-kernel

Patching UP/SMP alternatives inside inline assembly blocks is useful
outside of the spinlock implementation, where it is used for sev and wfe.

This patch lifts the macro into processor.h and gives it a scarier name
to (a) avoid conflicts in the global namespace and (b) to try and deter
its usage unless you "know what you're doing". The W macro for generating
wide instructions when targetting Thumb-2 is also made available under
the name WASM, to reduce the potential for conflicts with other headers.

Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 arch/arm/include/asm/processor.h | 12 ++++++++++++
 arch/arm/include/asm/spinlock.h  | 15 ++++-----------
 arch/arm/include/asm/unified.h   |  4 ++++
 3 files changed, 20 insertions(+), 11 deletions(-)

diff --git a/arch/arm/include/asm/processor.h b/arch/arm/include/asm/processor.h
index 91cfe08..cbdb130 100644
--- a/arch/arm/include/asm/processor.h
+++ b/arch/arm/include/asm/processor.h
@@ -22,6 +22,7 @@
 #include <asm/hw_breakpoint.h>
 #include <asm/ptrace.h>
 #include <asm/types.h>
+#include <asm/unified.h>
 
 #ifdef __KERNEL__
 #define STACK_TOP	((current->personality & ADDR_LIMIT_32BIT) ? \
@@ -91,6 +92,17 @@ unsigned long get_wchan(struct task_struct *p);
 #define KSTK_EIP(tsk)	task_pt_regs(tsk)->ARM_pc
 #define KSTK_ESP(tsk)	task_pt_regs(tsk)->ARM_sp
 
+#ifdef CONFIG_SMP
+#define __ALT_SMP_ASM(smp, up)						\
+	"9998:	" smp "\n"						\
+	"	.pushsection \".alt.smp.init\", \"a\"\n"		\
+	"	.long	9998b\n"					\
+	"	" up "\n"						\
+	"	.popsection\n"
+#else
+#define __ALT_SMP_ASM(smp, up)	up
+#endif
+
 /*
  * Prefetching support - only ARMv5.
  */
diff --git a/arch/arm/include/asm/spinlock.h b/arch/arm/include/asm/spinlock.h
index f8b8965..0de7bec 100644
--- a/arch/arm/include/asm/spinlock.h
+++ b/arch/arm/include/asm/spinlock.h
@@ -11,15 +11,7 @@
  * sev and wfe are ARMv6K extensions.  Uniprocessor ARMv6 may not have the K
  * extensions, so when running on UP, we have to patch these instructions away.
  */
-#define ALT_SMP(smp, up)					\
-	"9998:	" smp "\n"					\
-	"	.pushsection \".alt.smp.init\", \"a\"\n"	\
-	"	.long	9998b\n"				\
-	"	" up "\n"					\
-	"	.popsection\n"
-
 #ifdef CONFIG_THUMB2_KERNEL
-#define SEV		ALT_SMP("sev.w", "nop.w")
 /*
  * For Thumb-2, special care is needed to ensure that the conditional WFE
  * instruction really does assemble to exactly 4 bytes (as required by
@@ -31,17 +23,18 @@
  * the assembler won't change IT instructions which are explicitly present
  * in the input.
  */
-#define WFE(cond)	ALT_SMP(		\
+#define WFE(cond)	__ALT_SMP_ASM(		\
 	"it " cond "\n\t"			\
 	"wfe" cond ".n",			\
 						\
 	"nop.w"					\
 )
 #else
-#define SEV		ALT_SMP("sev", "nop")
-#define WFE(cond)	ALT_SMP("wfe" cond, "nop")
+#define WFE(cond)	__ALT_SMP_ASM("wfe" cond, "nop")
 #endif
 
+#define SEV		__ALT_SMP_ASM(WASM(sev), WASM(nop))
+
 static inline void dsb_sev(void)
 {
 #if __LINUX_ARM_ARCH__ >= 7
diff --git a/arch/arm/include/asm/unified.h b/arch/arm/include/asm/unified.h
index f5989f4..b88beab 100644
--- a/arch/arm/include/asm/unified.h
+++ b/arch/arm/include/asm/unified.h
@@ -38,6 +38,8 @@
 #ifdef __ASSEMBLY__
 #define W(instr)	instr.w
 #define BSYM(sym)	sym + 1
+#else
+#define WASM(instr)	#instr ".w"
 #endif
 
 #else	/* !CONFIG_THUMB2_KERNEL */
@@ -50,6 +52,8 @@
 #ifdef __ASSEMBLY__
 #define W(instr)	instr
 #define BSYM(sym)	sym
+#else
+#define WASM(instr)	#instr
 #endif
 
 #endif	/* CONFIG_THUMB2_KERNEL */
-- 
1.8.2.2

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 3/6] ARM: prefetch: add support for prefetchw using pldw on SMP ARMv7+ CPUs
  2013-07-23 11:36 [PATCH 0/6] Add support for pldw instruction on v7 MP cores Will Deacon
  2013-07-23 11:36 ` [PATCH 1/6] ARM: prefetch: remove redundant "cc" clobber Will Deacon
  2013-07-23 11:36 ` [PATCH 2/6] ARM: smp_on_up: move inline asm ALT_SMP patching macro out of spinlock.h Will Deacon
@ 2013-07-23 11:36 ` Will Deacon
  2013-07-23 20:05   ` Nicolas Pitre
  2013-07-23 11:36 ` [PATCH 4/6] ARM: locks: prefetch the destination word for write prior to strex Will Deacon
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 30+ messages in thread
From: Will Deacon @ 2013-07-23 11:36 UTC (permalink / raw)
  To: linux-arm-kernel

SMP ARMv7 CPUs implement the pldw instruction, which allows them to
prefetch data cachelines in an exclusive state.

This patch defines the prefetchw macro using pldw for CPUs that support
it.

Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 arch/arm/include/asm/processor.h | 17 ++++++++++++-----
 1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/arch/arm/include/asm/processor.h b/arch/arm/include/asm/processor.h
index cbdb130..d571697 100644
--- a/arch/arm/include/asm/processor.h
+++ b/arch/arm/include/asm/processor.h
@@ -116,12 +116,19 @@ static inline void prefetch(const void *ptr)
 		:: "p" (ptr));
 }
 
+#if __LINUX_ARM_ARCH__ >= 7 && defined(CONFIG_SMP)
 #define ARCH_HAS_PREFETCHW
-#define prefetchw(ptr)	prefetch(ptr)
-
-#define ARCH_HAS_SPINLOCK_PREFETCH
-#define spin_lock_prefetch(x) do { } while (0)
-
+static inline void prefetchw(const void *ptr)
+{
+	__asm__ __volatile__(
+		".arch_extension	mp\n"
+	__ALT_SMP_ASM(
+		WASM(pldw)		"\t%a0",
+		WASM(pld)		"\t%a0"
+	)
+		:: "p" (ptr));
+}
+#endif
 #endif
 
 #define HAVE_ARCH_PICK_MMAP_LAYOUT
-- 
1.8.2.2

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 4/6] ARM: locks: prefetch the destination word for write prior to strex
  2013-07-23 11:36 [PATCH 0/6] Add support for pldw instruction on v7 MP cores Will Deacon
                   ` (2 preceding siblings ...)
  2013-07-23 11:36 ` [PATCH 3/6] ARM: prefetch: add support for prefetchw using pldw on SMP ARMv7+ CPUs Will Deacon
@ 2013-07-23 11:36 ` Will Deacon
  2013-07-23 20:10   ` Nicolas Pitre
  2013-07-23 11:36 ` [PATCH 5/6] ARM: atomics: " Will Deacon
  2013-07-23 11:36 ` [PATCH 6/6] ARM: bitops: " Will Deacon
  5 siblings, 1 reply; 30+ messages in thread
From: Will Deacon @ 2013-07-23 11:36 UTC (permalink / raw)
  To: linux-arm-kernel

The cost of changing a cacheline from shared to exclusive state can be
significant, especially when this is triggered by an exclusive store,
since it may result in having to retry the transaction.

This patch prefixes our {spin,read,write}_[try]lock implementations with
pldw instructions (on CPUs which support them) to try and grab the line
in exclusive state from the start.

Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 arch/arm/include/asm/spinlock.h | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/arch/arm/include/asm/spinlock.h b/arch/arm/include/asm/spinlock.h
index 0de7bec..3e1cc9d 100644
--- a/arch/arm/include/asm/spinlock.h
+++ b/arch/arm/include/asm/spinlock.h
@@ -5,7 +5,7 @@
 #error SMP not supported on pre-ARMv6 CPUs
 #endif
 
-#include <asm/processor.h>
+#include <linux/prefetch.h>
 
 /*
  * sev and wfe are ARMv6K extensions.  Uniprocessor ARMv6 may not have the K
@@ -70,6 +70,7 @@ static inline void arch_spin_lock(arch_spinlock_t *lock)
 	u32 newval;
 	arch_spinlock_t lockval;
 
+	prefetchw((const void *)&lock->slock);
 	__asm__ __volatile__(
 "1:	ldrex	%0, [%3]\n"
 "	add	%1, %0, %4\n"
@@ -93,6 +94,7 @@ static inline int arch_spin_trylock(arch_spinlock_t *lock)
 	unsigned long contended, res;
 	u32 slock;
 
+	prefetchw((const void *)&lock->slock);
 	do {
 		__asm__ __volatile__(
 		"	ldrex	%0, [%3]\n"
@@ -145,6 +147,7 @@ static inline void arch_write_lock(arch_rwlock_t *rw)
 {
 	unsigned long tmp;
 
+	prefetchw((const void *)&rw->lock);
 	__asm__ __volatile__(
 "1:	ldrex	%0, [%1]\n"
 "	teq	%0, #0\n"
@@ -163,6 +166,7 @@ static inline int arch_write_trylock(arch_rwlock_t *rw)
 {
 	unsigned long tmp;
 
+	prefetchw((const void *)&rw->lock);
 	__asm__ __volatile__(
 "	ldrex	%0, [%1]\n"
 "	teq	%0, #0\n"
@@ -211,6 +215,7 @@ static inline void arch_read_lock(arch_rwlock_t *rw)
 {
 	unsigned long tmp, tmp2;
 
+	prefetchw((const void *)&rw->lock);
 	__asm__ __volatile__(
 "1:	ldrex	%0, [%2]\n"
 "	adds	%0, %0, #1\n"
@@ -231,6 +236,7 @@ static inline void arch_read_unlock(arch_rwlock_t *rw)
 
 	smp_mb();
 
+	prefetchw((const void *)&rw->lock);
 	__asm__ __volatile__(
 "1:	ldrex	%0, [%2]\n"
 "	sub	%0, %0, #1\n"
@@ -249,6 +255,7 @@ static inline int arch_read_trylock(arch_rwlock_t *rw)
 {
 	unsigned long tmp, tmp2 = 1;
 
+	prefetchw((const void *)&rw->lock);
 	__asm__ __volatile__(
 "	ldrex	%0, [%2]\n"
 "	adds	%0, %0, #1\n"
-- 
1.8.2.2

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 5/6] ARM: atomics: prefetch the destination word for write prior to strex
  2013-07-23 11:36 [PATCH 0/6] Add support for pldw instruction on v7 MP cores Will Deacon
                   ` (3 preceding siblings ...)
  2013-07-23 11:36 ` [PATCH 4/6] ARM: locks: prefetch the destination word for write prior to strex Will Deacon
@ 2013-07-23 11:36 ` Will Deacon
  2013-07-23 20:12   ` Nicolas Pitre
  2013-07-23 11:36 ` [PATCH 6/6] ARM: bitops: " Will Deacon
  5 siblings, 1 reply; 30+ messages in thread
From: Will Deacon @ 2013-07-23 11:36 UTC (permalink / raw)
  To: linux-arm-kernel

The cost of changing a cacheline from shared to exclusive state can be
significant, especially when this is triggered by an exclusive store,
since it may result in having to retry the transaction.

This patch prefixes our atomic access implementations with pldw
instructions (on CPUs which support them) to try and grab the line in
exclusive state from the start. Only the barrier-less functions are
updated, since memory barriers can limit the usefulness of prefetching
data.

Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 arch/arm/include/asm/atomic.h | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/arch/arm/include/asm/atomic.h b/arch/arm/include/asm/atomic.h
index da1c77d..e3f052a 100644
--- a/arch/arm/include/asm/atomic.h
+++ b/arch/arm/include/asm/atomic.h
@@ -12,6 +12,7 @@
 #define __ASM_ARM_ATOMIC_H
 
 #include <linux/compiler.h>
+#include <linux/prefetch.h>
 #include <linux/types.h>
 #include <linux/irqflags.h>
 #include <asm/barrier.h>
@@ -41,6 +42,7 @@ static inline void atomic_add(int i, atomic_t *v)
 	unsigned long tmp;
 	int result;
 
+	prefetchw((const void *)&v->counter);
 	__asm__ __volatile__("@ atomic_add\n"
 "1:	ldrex	%0, [%3]\n"
 "	add	%0, %0, %4\n"
@@ -79,6 +81,7 @@ static inline void atomic_sub(int i, atomic_t *v)
 	unsigned long tmp;
 	int result;
 
+	prefetchw((const void *)&v->counter);
 	__asm__ __volatile__("@ atomic_sub\n"
 "1:	ldrex	%0, [%3]\n"
 "	sub	%0, %0, %4\n"
@@ -138,6 +141,7 @@ static inline void atomic_clear_mask(unsigned long mask, unsigned long *addr)
 {
 	unsigned long tmp, tmp2;
 
+	prefetchw((const void *)addr);
 	__asm__ __volatile__("@ atomic_clear_mask\n"
 "1:	ldrex	%0, [%3]\n"
 "	bic	%0, %0, %4\n"
@@ -283,6 +287,7 @@ static inline void atomic64_set(atomic64_t *v, u64 i)
 {
 	u64 tmp;
 
+	prefetchw((const void *)&v->counter);
 	__asm__ __volatile__("@ atomic64_set\n"
 "1:	ldrexd	%0, %H0, [%2]\n"
 "	strexd	%0, %3, %H3, [%2]\n"
@@ -299,6 +304,7 @@ static inline void atomic64_add(u64 i, atomic64_t *v)
 	u64 result;
 	unsigned long tmp;
 
+	prefetchw((const void *)&v->counter);
 	__asm__ __volatile__("@ atomic64_add\n"
 "1:	ldrexd	%0, %H0, [%3]\n"
 "	adds	%0, %0, %4\n"
@@ -339,6 +345,7 @@ static inline void atomic64_sub(u64 i, atomic64_t *v)
 	u64 result;
 	unsigned long tmp;
 
+	prefetchw((const void *)&v->counter);
 	__asm__ __volatile__("@ atomic64_sub\n"
 "1:	ldrexd	%0, %H0, [%3]\n"
 "	subs	%0, %0, %4\n"
-- 
1.8.2.2

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 6/6] ARM: bitops: prefetch the destination word for write prior to strex
  2013-07-23 11:36 [PATCH 0/6] Add support for pldw instruction on v7 MP cores Will Deacon
                   ` (4 preceding siblings ...)
  2013-07-23 11:36 ` [PATCH 5/6] ARM: atomics: " Will Deacon
@ 2013-07-23 11:36 ` Will Deacon
  2013-07-23 20:15   ` Nicolas Pitre
  5 siblings, 1 reply; 30+ messages in thread
From: Will Deacon @ 2013-07-23 11:36 UTC (permalink / raw)
  To: linux-arm-kernel

The cost of changing a cacheline from shared to exclusive state can be
significant, especially when this is triggered by an exclusive store,
since it may result in having to retry the transaction.

This patch prefixes our atomic bitops implementation with prefetchw,
to try and grab the line in exclusive state from the start. The testop
macro is left alone, since the barrier semantics limit the usefulness
of prefetching data.

Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 arch/arm/lib/bitops.h | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/arch/arm/lib/bitops.h b/arch/arm/lib/bitops.h
index d6408d1..e0c68d5 100644
--- a/arch/arm/lib/bitops.h
+++ b/arch/arm/lib/bitops.h
@@ -10,6 +10,11 @@ UNWIND(	.fnstart	)
 	and	r3, r0, #31		@ Get bit offset
 	mov	r0, r0, lsr #5
 	add	r1, r1, r0, lsl #2	@ Get word offset
+#if __LINUX_ARM_ARCH__ >= 7
+	.arch_extension	mp
+	ALT_SMP(W(pldw)	[r1])
+	ALT_UP(W(nop))
+#endif
 	mov	r3, r2, lsl r3
 1:	ldrex	r2, [r1]
 	\instr	r2, r2, r3
-- 
1.8.2.2

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 1/6] ARM: prefetch: remove redundant "cc" clobber
  2013-07-23 11:36 ` [PATCH 1/6] ARM: prefetch: remove redundant "cc" clobber Will Deacon
@ 2013-07-23 19:48   ` Nicolas Pitre
  2013-07-24 10:19     ` Will Deacon
  0 siblings, 1 reply; 30+ messages in thread
From: Nicolas Pitre @ 2013-07-23 19:48 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, 23 Jul 2013, Will Deacon wrote:

> The pld instruction does not affect the condition flags, so don't bother
> clobbering them.
> 
> Signed-off-by: Will Deacon <will.deacon@arm.com>

Acked-by: Nicolas Pitre <nico@linaro.org>

You know the legacy reason why the cc clobber was there, right?
It certainly doesn't apply anymore.



> ---
>  arch/arm/include/asm/processor.h | 4 +---
>  1 file changed, 1 insertion(+), 3 deletions(-)
> 
> diff --git a/arch/arm/include/asm/processor.h b/arch/arm/include/asm/processor.h
> index 06e7d50..91cfe08 100644
> --- a/arch/arm/include/asm/processor.h
> +++ b/arch/arm/include/asm/processor.h
> @@ -101,9 +101,7 @@ static inline void prefetch(const void *ptr)
>  {
>  	__asm__ __volatile__(
>  		"pld\t%a0"
> -		:
> -		: "p" (ptr)
> -		: "cc");
> +		:: "p" (ptr));
>  }
>  
>  #define ARCH_HAS_PREFETCHW
> -- 
> 1.8.2.2
> 
> 
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 2/6] ARM: smp_on_up: move inline asm ALT_SMP patching macro out of spinlock.h
  2013-07-23 11:36 ` [PATCH 2/6] ARM: smp_on_up: move inline asm ALT_SMP patching macro out of spinlock.h Will Deacon
@ 2013-07-23 19:55   ` Nicolas Pitre
  0 siblings, 0 replies; 30+ messages in thread
From: Nicolas Pitre @ 2013-07-23 19:55 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, 23 Jul 2013, Will Deacon wrote:

> Patching UP/SMP alternatives inside inline assembly blocks is useful
> outside of the spinlock implementation, where it is used for sev and wfe.
> 
> This patch lifts the macro into processor.h and gives it a scarier name
> to (a) avoid conflicts in the global namespace and (b) to try and deter
> its usage unless you "know what you're doing". The W macro for generating
> wide instructions when targetting Thumb-2 is also made available under
> the name WASM, to reduce the potential for conflicts with other headers.
> 
> Signed-off-by: Will Deacon <will.deacon@arm.com>

Acked-by: Nicolas Pitre <nico@linaro.org>

> ---
>  arch/arm/include/asm/processor.h | 12 ++++++++++++
>  arch/arm/include/asm/spinlock.h  | 15 ++++-----------
>  arch/arm/include/asm/unified.h   |  4 ++++
>  3 files changed, 20 insertions(+), 11 deletions(-)
> 
> diff --git a/arch/arm/include/asm/processor.h b/arch/arm/include/asm/processor.h
> index 91cfe08..cbdb130 100644
> --- a/arch/arm/include/asm/processor.h
> +++ b/arch/arm/include/asm/processor.h
> @@ -22,6 +22,7 @@
>  #include <asm/hw_breakpoint.h>
>  #include <asm/ptrace.h>
>  #include <asm/types.h>
> +#include <asm/unified.h>
>  
>  #ifdef __KERNEL__
>  #define STACK_TOP	((current->personality & ADDR_LIMIT_32BIT) ? \
> @@ -91,6 +92,17 @@ unsigned long get_wchan(struct task_struct *p);
>  #define KSTK_EIP(tsk)	task_pt_regs(tsk)->ARM_pc
>  #define KSTK_ESP(tsk)	task_pt_regs(tsk)->ARM_sp
>  
> +#ifdef CONFIG_SMP
> +#define __ALT_SMP_ASM(smp, up)						\
> +	"9998:	" smp "\n"						\
> +	"	.pushsection \".alt.smp.init\", \"a\"\n"		\
> +	"	.long	9998b\n"					\
> +	"	" up "\n"						\
> +	"	.popsection\n"
> +#else
> +#define __ALT_SMP_ASM(smp, up)	up
> +#endif
> +
>  /*
>   * Prefetching support - only ARMv5.
>   */
> diff --git a/arch/arm/include/asm/spinlock.h b/arch/arm/include/asm/spinlock.h
> index f8b8965..0de7bec 100644
> --- a/arch/arm/include/asm/spinlock.h
> +++ b/arch/arm/include/asm/spinlock.h
> @@ -11,15 +11,7 @@
>   * sev and wfe are ARMv6K extensions.  Uniprocessor ARMv6 may not have the K
>   * extensions, so when running on UP, we have to patch these instructions away.
>   */
> -#define ALT_SMP(smp, up)					\
> -	"9998:	" smp "\n"					\
> -	"	.pushsection \".alt.smp.init\", \"a\"\n"	\
> -	"	.long	9998b\n"				\
> -	"	" up "\n"					\
> -	"	.popsection\n"
> -
>  #ifdef CONFIG_THUMB2_KERNEL
> -#define SEV		ALT_SMP("sev.w", "nop.w")
>  /*
>   * For Thumb-2, special care is needed to ensure that the conditional WFE
>   * instruction really does assemble to exactly 4 bytes (as required by
> @@ -31,17 +23,18 @@
>   * the assembler won't change IT instructions which are explicitly present
>   * in the input.
>   */
> -#define WFE(cond)	ALT_SMP(		\
> +#define WFE(cond)	__ALT_SMP_ASM(		\
>  	"it " cond "\n\t"			\
>  	"wfe" cond ".n",			\
>  						\
>  	"nop.w"					\
>  )
>  #else
> -#define SEV		ALT_SMP("sev", "nop")
> -#define WFE(cond)	ALT_SMP("wfe" cond, "nop")
> +#define WFE(cond)	__ALT_SMP_ASM("wfe" cond, "nop")
>  #endif
>  
> +#define SEV		__ALT_SMP_ASM(WASM(sev), WASM(nop))
> +
>  static inline void dsb_sev(void)
>  {
>  #if __LINUX_ARM_ARCH__ >= 7
> diff --git a/arch/arm/include/asm/unified.h b/arch/arm/include/asm/unified.h
> index f5989f4..b88beab 100644
> --- a/arch/arm/include/asm/unified.h
> +++ b/arch/arm/include/asm/unified.h
> @@ -38,6 +38,8 @@
>  #ifdef __ASSEMBLY__
>  #define W(instr)	instr.w
>  #define BSYM(sym)	sym + 1
> +#else
> +#define WASM(instr)	#instr ".w"
>  #endif
>  
>  #else	/* !CONFIG_THUMB2_KERNEL */
> @@ -50,6 +52,8 @@
>  #ifdef __ASSEMBLY__
>  #define W(instr)	instr
>  #define BSYM(sym)	sym
> +#else
> +#define WASM(instr)	#instr
>  #endif
>  
>  #endif	/* CONFIG_THUMB2_KERNEL */
> -- 
> 1.8.2.2
> 
> 
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 3/6] ARM: prefetch: add support for prefetchw using pldw on SMP ARMv7+ CPUs
  2013-07-23 11:36 ` [PATCH 3/6] ARM: prefetch: add support for prefetchw using pldw on SMP ARMv7+ CPUs Will Deacon
@ 2013-07-23 20:05   ` Nicolas Pitre
  2013-07-24 10:42     ` Will Deacon
  0 siblings, 1 reply; 30+ messages in thread
From: Nicolas Pitre @ 2013-07-23 20:05 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, 23 Jul 2013, Will Deacon wrote:

> SMP ARMv7 CPUs implement the pldw instruction, which allows them to
> prefetch data cachelines in an exclusive state.
> 
> This patch defines the prefetchw macro using pldw for CPUs that support
> it.
> 
> Signed-off-by: Will Deacon <will.deacon@arm.com>

Minor nit below.  Otherwise:

Acked-by: Nicolas Pitre <nico@linaro.org>

> ---
>  arch/arm/include/asm/processor.h | 17 ++++++++++++-----
>  1 file changed, 12 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/arm/include/asm/processor.h b/arch/arm/include/asm/processor.h
> index cbdb130..d571697 100644
> --- a/arch/arm/include/asm/processor.h
> +++ b/arch/arm/include/asm/processor.h
> @@ -116,12 +116,19 @@ static inline void prefetch(const void *ptr)
>  		:: "p" (ptr));
>  }
>  
> +#if __LINUX_ARM_ARCH__ >= 7 && defined(CONFIG_SMP)
>  #define ARCH_HAS_PREFETCHW
> -#define prefetchw(ptr)	prefetch(ptr)
> -
> -#define ARCH_HAS_SPINLOCK_PREFETCH
> -#define spin_lock_prefetch(x) do { } while (0)
> -

What about keeping the above definitions when __LINUX_ARM_ARCH__ >= 7
&& defined(CONFIG_SMP) is not true?

> +static inline void prefetchw(const void *ptr)
> +{
> +	__asm__ __volatile__(
> +		".arch_extension	mp\n"
> +	__ALT_SMP_ASM(
> +		WASM(pldw)		"\t%a0",
> +		WASM(pld)		"\t%a0"
> +	)

The paren indentation looks odd.  If you shift the whole __ALT_SMP_ASM 
block right then it is also less likely to look like a sequential 
execution of pldw followed by pld to the casual reader.

> +		:: "p" (ptr));
> +}
> +#endif
>  #endif
>  
>  #define HAVE_ARCH_PICK_MMAP_LAYOUT
> -- 
> 1.8.2.2
> 
> 
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 4/6] ARM: locks: prefetch the destination word for write prior to strex
  2013-07-23 11:36 ` [PATCH 4/6] ARM: locks: prefetch the destination word for write prior to strex Will Deacon
@ 2013-07-23 20:10   ` Nicolas Pitre
  2013-07-24 11:18     ` Will Deacon
  0 siblings, 1 reply; 30+ messages in thread
From: Nicolas Pitre @ 2013-07-23 20:10 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, 23 Jul 2013, Will Deacon wrote:

> The cost of changing a cacheline from shared to exclusive state can be
> significant, especially when this is triggered by an exclusive store,
> since it may result in having to retry the transaction.
> 
> This patch prefixes our {spin,read,write}_[try]lock implementations with
> pldw instructions (on CPUs which support them) to try and grab the line
> in exclusive state from the start.
> 
> Signed-off-by: Will Deacon <will.deacon@arm.com>
> ---
>  arch/arm/include/asm/spinlock.h | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/arm/include/asm/spinlock.h b/arch/arm/include/asm/spinlock.h
> index 0de7bec..3e1cc9d 100644
> --- a/arch/arm/include/asm/spinlock.h
> +++ b/arch/arm/include/asm/spinlock.h
> @@ -5,7 +5,7 @@
>  #error SMP not supported on pre-ARMv6 CPUs
>  #endif
>  
> -#include <asm/processor.h>
> +#include <linux/prefetch.h>
>  
>  /*
>   * sev and wfe are ARMv6K extensions.  Uniprocessor ARMv6 may not have the K
> @@ -70,6 +70,7 @@ static inline void arch_spin_lock(arch_spinlock_t *lock)
>  	u32 newval;
>  	arch_spinlock_t lockval;
>  
> +	prefetchw((const void *)&lock->slock);

Couldn't that cast be carried in the definition of prefetchw() instead?


Other than that:

Acked-by: Nicolas Pitre <nico@linaro.org>





>  	__asm__ __volatile__(
>  "1:	ldrex	%0, [%3]\n"
>  "	add	%1, %0, %4\n"
> @@ -93,6 +94,7 @@ static inline int arch_spin_trylock(arch_spinlock_t *lock)
>  	unsigned long contended, res;
>  	u32 slock;
>  
> +	prefetchw((const void *)&lock->slock);
>  	do {
>  		__asm__ __volatile__(
>  		"	ldrex	%0, [%3]\n"
> @@ -145,6 +147,7 @@ static inline void arch_write_lock(arch_rwlock_t *rw)
>  {
>  	unsigned long tmp;
>  
> +	prefetchw((const void *)&rw->lock);
>  	__asm__ __volatile__(
>  "1:	ldrex	%0, [%1]\n"
>  "	teq	%0, #0\n"
> @@ -163,6 +166,7 @@ static inline int arch_write_trylock(arch_rwlock_t *rw)
>  {
>  	unsigned long tmp;
>  
> +	prefetchw((const void *)&rw->lock);
>  	__asm__ __volatile__(
>  "	ldrex	%0, [%1]\n"
>  "	teq	%0, #0\n"
> @@ -211,6 +215,7 @@ static inline void arch_read_lock(arch_rwlock_t *rw)
>  {
>  	unsigned long tmp, tmp2;
>  
> +	prefetchw((const void *)&rw->lock);
>  	__asm__ __volatile__(
>  "1:	ldrex	%0, [%2]\n"
>  "	adds	%0, %0, #1\n"
> @@ -231,6 +236,7 @@ static inline void arch_read_unlock(arch_rwlock_t *rw)
>  
>  	smp_mb();
>  
> +	prefetchw((const void *)&rw->lock);
>  	__asm__ __volatile__(
>  "1:	ldrex	%0, [%2]\n"
>  "	sub	%0, %0, #1\n"
> @@ -249,6 +255,7 @@ static inline int arch_read_trylock(arch_rwlock_t *rw)
>  {
>  	unsigned long tmp, tmp2 = 1;
>  
> +	prefetchw((const void *)&rw->lock);
>  	__asm__ __volatile__(
>  "	ldrex	%0, [%2]\n"
>  "	adds	%0, %0, #1\n"
> -- 
> 1.8.2.2
> 
> 
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 5/6] ARM: atomics: prefetch the destination word for write prior to strex
  2013-07-23 11:36 ` [PATCH 5/6] ARM: atomics: " Will Deacon
@ 2013-07-23 20:12   ` Nicolas Pitre
  0 siblings, 0 replies; 30+ messages in thread
From: Nicolas Pitre @ 2013-07-23 20:12 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, 23 Jul 2013, Will Deacon wrote:

> The cost of changing a cacheline from shared to exclusive state can be
> significant, especially when this is triggered by an exclusive store,
> since it may result in having to retry the transaction.
> 
> This patch prefixes our atomic access implementations with pldw
> instructions (on CPUs which support them) to try and grab the line in
> exclusive state from the start. Only the barrier-less functions are
> updated, since memory barriers can limit the usefulness of prefetching
> data.
> 
> Signed-off-by: Will Deacon <will.deacon@arm.com>
> ---
>  arch/arm/include/asm/atomic.h | 7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/arch/arm/include/asm/atomic.h b/arch/arm/include/asm/atomic.h
> index da1c77d..e3f052a 100644
> --- a/arch/arm/include/asm/atomic.h
> +++ b/arch/arm/include/asm/atomic.h
> @@ -12,6 +12,7 @@
>  #define __ASM_ARM_ATOMIC_H
>  
>  #include <linux/compiler.h>
> +#include <linux/prefetch.h>
>  #include <linux/types.h>
>  #include <linux/irqflags.h>
>  #include <asm/barrier.h>
> @@ -41,6 +42,7 @@ static inline void atomic_add(int i, atomic_t *v)
>  	unsigned long tmp;
>  	int result;
>  
> +	prefetchw((const void *)&v->counter);

Same comment as for the previous patch wrt the cast.  Even more so given 
it is spreading all over the place now.






>  	__asm__ __volatile__("@ atomic_add\n"
>  "1:	ldrex	%0, [%3]\n"
>  "	add	%0, %0, %4\n"
> @@ -79,6 +81,7 @@ static inline void atomic_sub(int i, atomic_t *v)
>  	unsigned long tmp;
>  	int result;
>  
> +	prefetchw((const void *)&v->counter);
>  	__asm__ __volatile__("@ atomic_sub\n"
>  "1:	ldrex	%0, [%3]\n"
>  "	sub	%0, %0, %4\n"
> @@ -138,6 +141,7 @@ static inline void atomic_clear_mask(unsigned long mask, unsigned long *addr)
>  {
>  	unsigned long tmp, tmp2;
>  
> +	prefetchw((const void *)addr);
>  	__asm__ __volatile__("@ atomic_clear_mask\n"
>  "1:	ldrex	%0, [%3]\n"
>  "	bic	%0, %0, %4\n"
> @@ -283,6 +287,7 @@ static inline void atomic64_set(atomic64_t *v, u64 i)
>  {
>  	u64 tmp;
>  
> +	prefetchw((const void *)&v->counter);
>  	__asm__ __volatile__("@ atomic64_set\n"
>  "1:	ldrexd	%0, %H0, [%2]\n"
>  "	strexd	%0, %3, %H3, [%2]\n"
> @@ -299,6 +304,7 @@ static inline void atomic64_add(u64 i, atomic64_t *v)
>  	u64 result;
>  	unsigned long tmp;
>  
> +	prefetchw((const void *)&v->counter);
>  	__asm__ __volatile__("@ atomic64_add\n"
>  "1:	ldrexd	%0, %H0, [%3]\n"
>  "	adds	%0, %0, %4\n"
> @@ -339,6 +345,7 @@ static inline void atomic64_sub(u64 i, atomic64_t *v)
>  	u64 result;
>  	unsigned long tmp;
>  
> +	prefetchw((const void *)&v->counter);
>  	__asm__ __volatile__("@ atomic64_sub\n"
>  "1:	ldrexd	%0, %H0, [%3]\n"
>  "	subs	%0, %0, %4\n"
> -- 
> 1.8.2.2
> 
> 
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 6/6] ARM: bitops: prefetch the destination word for write prior to strex
  2013-07-23 11:36 ` [PATCH 6/6] ARM: bitops: " Will Deacon
@ 2013-07-23 20:15   ` Nicolas Pitre
  0 siblings, 0 replies; 30+ messages in thread
From: Nicolas Pitre @ 2013-07-23 20:15 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, 23 Jul 2013, Will Deacon wrote:

> The cost of changing a cacheline from shared to exclusive state can be
> significant, especially when this is triggered by an exclusive store,
> since it may result in having to retry the transaction.
> 
> This patch prefixes our atomic bitops implementation with prefetchw,
> to try and grab the line in exclusive state from the start. The testop
> macro is left alone, since the barrier semantics limit the usefulness
> of prefetching data.
> 
> Signed-off-by: Will Deacon <will.deacon@arm.com>

Acked-by: Nicolas Pitre <nico@linaro.org>

> ---
>  arch/arm/lib/bitops.h | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/arch/arm/lib/bitops.h b/arch/arm/lib/bitops.h
> index d6408d1..e0c68d5 100644
> --- a/arch/arm/lib/bitops.h
> +++ b/arch/arm/lib/bitops.h
> @@ -10,6 +10,11 @@ UNWIND(	.fnstart	)
>  	and	r3, r0, #31		@ Get bit offset
>  	mov	r0, r0, lsr #5
>  	add	r1, r1, r0, lsl #2	@ Get word offset
> +#if __LINUX_ARM_ARCH__ >= 7
> +	.arch_extension	mp
> +	ALT_SMP(W(pldw)	[r1])
> +	ALT_UP(W(nop))
> +#endif
>  	mov	r3, r2, lsl r3
>  1:	ldrex	r2, [r1]
>  	\instr	r2, r2, r3
> -- 
> 1.8.2.2
> 
> 
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 1/6] ARM: prefetch: remove redundant "cc" clobber
  2013-07-23 19:48   ` Nicolas Pitre
@ 2013-07-24 10:19     ` Will Deacon
  2013-07-24 16:16       ` Nicolas Pitre
  0 siblings, 1 reply; 30+ messages in thread
From: Will Deacon @ 2013-07-24 10:19 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Nicolas,

On Tue, Jul 23, 2013 at 08:48:11PM +0100, Nicolas Pitre wrote:
> On Tue, 23 Jul 2013, Will Deacon wrote:
> 
> > The pld instruction does not affect the condition flags, so don't bother
> > clobbering them.
> > 
> > Signed-off-by: Will Deacon <will.deacon@arm.com>
> 
> Acked-by: Nicolas Pitre <nico@linaro.org>

Thanks for reviewing the series, much appreciated.

> You know the legacy reason why the cc clobber was there, right?
> It certainly doesn't apply anymore.

I was under the impression that it was due to ancient GCC behaviour and the "cc"
was there to prevent re-ordering. For interest, what are the specifics?

Cheers,

Will

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 3/6] ARM: prefetch: add support for prefetchw using pldw on SMP ARMv7+ CPUs
  2013-07-23 20:05   ` Nicolas Pitre
@ 2013-07-24 10:42     ` Will Deacon
  0 siblings, 0 replies; 30+ messages in thread
From: Will Deacon @ 2013-07-24 10:42 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Jul 23, 2013 at 09:05:58PM +0100, Nicolas Pitre wrote:
> On Tue, 23 Jul 2013, Will Deacon wrote:
> 
> > SMP ARMv7 CPUs implement the pldw instruction, which allows them to
> > prefetch data cachelines in an exclusive state.
> > 
> > This patch defines the prefetchw macro using pldw for CPUs that support
> > it.
> > 
> > Signed-off-by: Will Deacon <will.deacon@arm.com>
> 
> Minor nit below.  Otherwise:
> 
> Acked-by: Nicolas Pitre <nico@linaro.org>
> 
> > ---
> >  arch/arm/include/asm/processor.h | 17 ++++++++++++-----
> >  1 file changed, 12 insertions(+), 5 deletions(-)
> > 
> > diff --git a/arch/arm/include/asm/processor.h b/arch/arm/include/asm/processor.h
> > index cbdb130..d571697 100644
> > --- a/arch/arm/include/asm/processor.h
> > +++ b/arch/arm/include/asm/processor.h
> > @@ -116,12 +116,19 @@ static inline void prefetch(const void *ptr)
> >  		:: "p" (ptr));
> >  }
> >  
> > +#if __LINUX_ARM_ARCH__ >= 7 && defined(CONFIG_SMP)
> >  #define ARCH_HAS_PREFETCHW
> > -#define prefetchw(ptr)	prefetch(ptr)
> > -
> > -#define ARCH_HAS_SPINLOCK_PREFETCH
> > -#define spin_lock_prefetch(x) do { } while (0)
> > -
> 
> What about keeping the above definitions when __LINUX_ARM_ARCH__ >= 7
> && defined(CONFIG_SMP) is not true?

In the case that we don't have prefetchw, then spin_lock_prefetch would end
up being defined as __builtin_prefetch(x,1) by linux/prefetch.h. GCC will
almost certainly spit out `pld' for that (it currently ignores the second
argument...), which should be better than doing nothing. The cacheline may not
be in the ideal state (shared rather than exclusive), but at least it's likely
to be in the cache rather than sitting out in memory.

Unfortunately, spin_lock_prefetch is only used in one place, so changes to
its implementation don't have any measurable difference on kernel
performance.

> > +static inline void prefetchw(const void *ptr)
> > +{
> > +	__asm__ __volatile__(
> > +		".arch_extension	mp\n"
> > +	__ALT_SMP_ASM(
> > +		WASM(pldw)		"\t%a0",
> > +		WASM(pld)		"\t%a0"
> > +	)
> 
> The paren indentation looks odd.  If you shift the whole __ALT_SMP_ASM 
> block right then it is also less likely to look like a sequential 
> execution of pldw followed by pld to the casual reader.

Sure, I'll fix that up.

Thanks again for your feedback,

Will

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 4/6] ARM: locks: prefetch the destination word for write prior to strex
  2013-07-23 20:10   ` Nicolas Pitre
@ 2013-07-24 11:18     ` Will Deacon
  2013-07-25 17:31       ` Stephen Boyd
  0 siblings, 1 reply; 30+ messages in thread
From: Will Deacon @ 2013-07-24 11:18 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Jul 23, 2013 at 09:10:33PM +0100, Nicolas Pitre wrote:
> On Tue, 23 Jul 2013, Will Deacon wrote:
> 
> > The cost of changing a cacheline from shared to exclusive state can be
> > significant, especially when this is triggered by an exclusive store,
> > since it may result in having to retry the transaction.
> > 
> > This patch prefixes our {spin,read,write}_[try]lock implementations with
> > pldw instructions (on CPUs which support them) to try and grab the line
> > in exclusive state from the start.
> > 
> > Signed-off-by: Will Deacon <will.deacon@arm.com>
> > ---
> >  arch/arm/include/asm/spinlock.h | 9 ++++++++-
> >  1 file changed, 8 insertions(+), 1 deletion(-)
> > 
> > diff --git a/arch/arm/include/asm/spinlock.h b/arch/arm/include/asm/spinlock.h
> > index 0de7bec..3e1cc9d 100644
> > --- a/arch/arm/include/asm/spinlock.h
> > +++ b/arch/arm/include/asm/spinlock.h
> > @@ -5,7 +5,7 @@
> >  #error SMP not supported on pre-ARMv6 CPUs
> >  #endif
> >  
> > -#include <asm/processor.h>
> > +#include <linux/prefetch.h>
> >  
> >  /*
> >   * sev and wfe are ARMv6K extensions.  Uniprocessor ARMv6 may not have the K
> > @@ -70,6 +70,7 @@ static inline void arch_spin_lock(arch_spinlock_t *lock)
> >  	u32 newval;
> >  	arch_spinlock_t lockval;
> >  
> > +	prefetchw((const void *)&lock->slock);
> 
> Couldn't that cast be carried in the definition of prefetchw() instead?

I think that would mean implementing prefetchw as a macro rather than an
inline function, since the core code expects to pass a const pointer and GCC
gets angry if the type signatures don't match.

I'll have a go at doing that for v2.

Cheers,

Will

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 1/6] ARM: prefetch: remove redundant "cc" clobber
  2013-07-24 10:19     ` Will Deacon
@ 2013-07-24 16:16       ` Nicolas Pitre
  2013-07-24 16:35         ` Will Deacon
  0 siblings, 1 reply; 30+ messages in thread
From: Nicolas Pitre @ 2013-07-24 16:16 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, 24 Jul 2013, Will Deacon wrote:

> Hi Nicolas,
> 
> On Tue, Jul 23, 2013 at 08:48:11PM +0100, Nicolas Pitre wrote:
> > On Tue, 23 Jul 2013, Will Deacon wrote:
> > 
> > > The pld instruction does not affect the condition flags, so don't bother
> > > clobbering them.
> > > 
> > > Signed-off-by: Will Deacon <will.deacon@arm.com>
> > 
> > Acked-by: Nicolas Pitre <nico@linaro.org>
> 
> Thanks for reviewing the series, much appreciated.
> 
> > You know the legacy reason why the cc clobber was there, right?
> > It certainly doesn't apply anymore.
> 
> I was under the impression that it was due to ancient GCC behaviour and the "cc"
> was there to prevent re-ordering. For interest, what are the specifics?

In the old days, you had to take care of the predicate for conditional 
instructions in your inline asm code.  So for example, if you had:

	if (foo)
		inline asm ("mov %0, #1" : "=r" (bar));

That wouldn't work most of the time because even if the if condition was 
false, gcc would not branch over the inline asm code but expect you to 
take care of the condition code.  That means the above needed to be 
written as:

	if (foo)
		inline asm ("mov%? %0, #1" : "=r" (bar));

Because having %? added all over the place was cumbersome, the 
alternative was to add "cc" to the clobber list so gcc would then branch 
across the inline asm when the condition is false.

But of course many people forgot about those subtle details quite often, 
and then gcc was changed so inline asm was made non predicated.


Nicolas

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 1/6] ARM: prefetch: remove redundant "cc" clobber
  2013-07-24 16:16       ` Nicolas Pitre
@ 2013-07-24 16:35         ` Will Deacon
  2013-07-24 16:58           ` Russell King - ARM Linux
  0 siblings, 1 reply; 30+ messages in thread
From: Will Deacon @ 2013-07-24 16:35 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Jul 24, 2013 at 05:16:59PM +0100, Nicolas Pitre wrote:
> On Wed, 24 Jul 2013, Will Deacon wrote:
> > On Tue, Jul 23, 2013 at 08:48:11PM +0100, Nicolas Pitre wrote:
> > > You know the legacy reason why the cc clobber was there, right?
> > > It certainly doesn't apply anymore.
> > 
> > I was under the impression that it was due to ancient GCC behaviour and the "cc"
> > was there to prevent re-ordering. For interest, what are the specifics?
> 
> In the old days, you had to take care of the predicate for conditional 
> instructions in your inline asm code.  So for example, if you had:
> 
> 	if (foo)
> 		inline asm ("mov %0, #1" : "=r" (bar));
> 
> That wouldn't work most of the time because even if the if condition was 
> false, gcc would not branch over the inline asm code but expect you to 
> take care of the condition code.  That means the above needed to be 
> written as:
> 
> 	if (foo)
> 		inline asm ("mov%? %0, #1" : "=r" (bar));
> 
> Because having %? added all over the place was cumbersome, the 
> alternative was to add "cc" to the clobber list so gcc would then branch 
> across the inline asm when the condition is false.
> 
> But of course many people forgot about those subtle details quite often, 
> and then gcc was changed so inline asm was made non predicated.

Holy sh*t....

Next time I complain to the tools guys about inline asm, I'll count my
blessings so they don't try and resurrect anything as horrendous as what
you've just described.

Thanks for the insight!

Will

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 1/6] ARM: prefetch: remove redundant "cc" clobber
  2013-07-24 16:35         ` Will Deacon
@ 2013-07-24 16:58           ` Russell King - ARM Linux
  2013-07-24 17:22             ` Nicolas Pitre
  0 siblings, 1 reply; 30+ messages in thread
From: Russell King - ARM Linux @ 2013-07-24 16:58 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Jul 24, 2013 at 05:35:31PM +0100, Will Deacon wrote:
> On Wed, Jul 24, 2013 at 05:16:59PM +0100, Nicolas Pitre wrote:
> > On Wed, 24 Jul 2013, Will Deacon wrote:
> > > On Tue, Jul 23, 2013 at 08:48:11PM +0100, Nicolas Pitre wrote:
> > > > You know the legacy reason why the cc clobber was there, right?
> > > > It certainly doesn't apply anymore.
> > > 
> > > I was under the impression that it was due to ancient GCC behaviour and the "cc"
> > > was there to prevent re-ordering. For interest, what are the specifics?
> > 
> > In the old days, you had to take care of the predicate for conditional 
> > instructions in your inline asm code.  So for example, if you had:
> > 
> > 	if (foo)
> > 		inline asm ("mov %0, #1" : "=r" (bar));
> > 
> > That wouldn't work most of the time because even if the if condition was 
> > false, gcc would not branch over the inline asm code but expect you to 
> > take care of the condition code.  That means the above needed to be 
> > written as:
> > 
> > 	if (foo)
> > 		inline asm ("mov%? %0, #1" : "=r" (bar));
> > 
> > Because having %? added all over the place was cumbersome, the 
> > alternative was to add "cc" to the clobber list so gcc would then branch 
> > across the inline asm when the condition is false.
> > 
> > But of course many people forgot about those subtle details quite often, 
> > and then gcc was changed so inline asm was made non predicated.
> 
> Holy sh*t....
> 
> Next time I complain to the tools guys about inline asm, I'll count my
> blessings so they don't try and resurrect anything as horrendous as what
> you've just described.

And the result of its removal is all the horrid macro crap in asm/tlbflush.h
to work around the lack of this feature - otherwise we end up with seven
instances of:

	tst
	bne	1f
	mcr
1:

for every call to a TLB function.  GCC _could_ have detected the lack of
%? and the lack of "cc" clobber and either warned or implicitly branched
around the block.  There was no need to completely remove the facility.

So now, instead, we have to do lots of preprocessor junk to work out what
instructions are always executed, which are never executed, and which
might be executed.  All that additional complexity thanks to the removal
of this brilliant feature.

That was a truely sad day when that happened.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 1/6] ARM: prefetch: remove redundant "cc" clobber
  2013-07-24 16:58           ` Russell King - ARM Linux
@ 2013-07-24 17:22             ` Nicolas Pitre
  0 siblings, 0 replies; 30+ messages in thread
From: Nicolas Pitre @ 2013-07-24 17:22 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, 24 Jul 2013, Russell King - ARM Linux wrote:

> On Wed, Jul 24, 2013 at 05:35:31PM +0100, Will Deacon wrote:
> > On Wed, Jul 24, 2013 at 05:16:59PM +0100, Nicolas Pitre wrote:
> > > On Wed, 24 Jul 2013, Will Deacon wrote:
> > > > On Tue, Jul 23, 2013 at 08:48:11PM +0100, Nicolas Pitre wrote:
> > > > > You know the legacy reason why the cc clobber was there, right?
> > > > > It certainly doesn't apply anymore.
> > > > 
> > > > I was under the impression that it was due to ancient GCC behaviour and the "cc"
> > > > was there to prevent re-ordering. For interest, what are the specifics?
> > > 
> > > In the old days, you had to take care of the predicate for conditional 
> > > instructions in your inline asm code.  So for example, if you had:
> > > 
> > > 	if (foo)
> > > 		inline asm ("mov %0, #1" : "=r" (bar));
> > > 
> > > That wouldn't work most of the time because even if the if condition was 
> > > false, gcc would not branch over the inline asm code but expect you to 
> > > take care of the condition code.  That means the above needed to be 
> > > written as:
> > > 
> > > 	if (foo)
> > > 		inline asm ("mov%? %0, #1" : "=r" (bar));
> > > 
> > > Because having %? added all over the place was cumbersome, the 
> > > alternative was to add "cc" to the clobber list so gcc would then branch 
> > > across the inline asm when the condition is false.
> > > 
> > > But of course many people forgot about those subtle details quite often, 
> > > and then gcc was changed so inline asm was made non predicated.
> > 
> > Holy sh*t....
> > 
> > Next time I complain to the tools guys about inline asm, I'll count my
> > blessings so they don't try and resurrect anything as horrendous as what
> > you've just described.
> 
> And the result of its removal is all the horrid macro crap in asm/tlbflush.h
> to work around the lack of this feature - otherwise we end up with seven
> instances of:
> 
> 	tst
> 	bne	1f
> 	mcr
> 1:
> 
> for every call to a TLB function.  GCC _could_ have detected the lack of
> %? and the lack of "cc" clobber and either warned or implicitly branched
> around the block.  There was no need to completely remove the facility.

I agree.  gcc could have considered an inline asm block predicate safe 
whenever it contains a %?.  I even suggested this at the time.


Nicolas

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 4/6] ARM: locks: prefetch the destination word for write prior to strex
  2013-07-24 11:18     ` Will Deacon
@ 2013-07-25 17:31       ` Stephen Boyd
  2013-07-25 17:37         ` Stephen Boyd
  0 siblings, 1 reply; 30+ messages in thread
From: Stephen Boyd @ 2013-07-25 17:31 UTC (permalink / raw)
  To: linux-arm-kernel

On 07/24/13 04:18, Will Deacon wrote:
> On Tue, Jul 23, 2013 at 09:10:33PM +0100, Nicolas Pitre wrote:
>> On Tue, 23 Jul 2013, Will Deacon wrote:
>>
>>> The cost of changing a cacheline from shared to exclusive state can be
>>> significant, especially when this is triggered by an exclusive store,
>>> since it may result in having to retry the transaction.
>>>
>>> This patch prefixes our {spin,read,write}_[try]lock implementations with
>>> pldw instructions (on CPUs which support them) to try and grab the line
>>> in exclusive state from the start.
>>>
>>> Signed-off-by: Will Deacon <will.deacon@arm.com>
>>> ---
>>>  arch/arm/include/asm/spinlock.h | 9 ++++++++-
>>>  1 file changed, 8 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/arch/arm/include/asm/spinlock.h b/arch/arm/include/asm/spinlock.h
>>> index 0de7bec..3e1cc9d 100644
>>> --- a/arch/arm/include/asm/spinlock.h
>>> +++ b/arch/arm/include/asm/spinlock.h
>>> @@ -5,7 +5,7 @@
>>>  #error SMP not supported on pre-ARMv6 CPUs
>>>  #endif
>>>  
>>> -#include <asm/processor.h>
>>> +#include <linux/prefetch.h>
>>>  
>>>  /*
>>>   * sev and wfe are ARMv6K extensions.  Uniprocessor ARMv6 may not have the K
>>> @@ -70,6 +70,7 @@ static inline void arch_spin_lock(arch_spinlock_t *lock)
>>>  	u32 newval;
>>>  	arch_spinlock_t lockval;
>>>  
>>> +	prefetchw((const void *)&lock->slock);
>> Couldn't that cast be carried in the definition of prefetchw() instead?
> I think that would mean implementing prefetchw as a macro rather than an
> inline function, since the core code expects to pass a const pointer and GCC
> gets angry if the type signatures don't match.

Maybe I'm wrong, but can't you just remove the casts and leave the
function as static inline? const void * is pretty much telling the
compiler to turn off type checking.

-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
hosted by The Linux Foundation

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 4/6] ARM: locks: prefetch the destination word for write prior to strex
  2013-07-25 17:31       ` Stephen Boyd
@ 2013-07-25 17:37         ` Stephen Boyd
  2013-07-25 17:45           ` Will Deacon
  0 siblings, 1 reply; 30+ messages in thread
From: Stephen Boyd @ 2013-07-25 17:37 UTC (permalink / raw)
  To: linux-arm-kernel

On 07/25/13 10:31, Stephen Boyd wrote:
> On 07/24/13 04:18, Will Deacon wrote:
>> On Tue, Jul 23, 2013 at 09:10:33PM +0100, Nicolas Pitre wrote:
>>> On Tue, 23 Jul 2013, Will Deacon wrote:
>>>
>>>> The cost of changing a cacheline from shared to exclusive state can be
>>>> significant, especially when this is triggered by an exclusive store,
>>>> since it may result in having to retry the transaction.
>>>>
>>>> This patch prefixes our {spin,read,write}_[try]lock implementations with
>>>> pldw instructions (on CPUs which support them) to try and grab the line
>>>> in exclusive state from the start.
>>>>
>>>> Signed-off-by: Will Deacon <will.deacon@arm.com>
>>>> ---
>>>>  arch/arm/include/asm/spinlock.h | 9 ++++++++-
>>>>  1 file changed, 8 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/arch/arm/include/asm/spinlock.h b/arch/arm/include/asm/spinlock.h
>>>> index 0de7bec..3e1cc9d 100644
>>>> --- a/arch/arm/include/asm/spinlock.h
>>>> +++ b/arch/arm/include/asm/spinlock.h
>>>> @@ -5,7 +5,7 @@
>>>>  #error SMP not supported on pre-ARMv6 CPUs
>>>>  #endif
>>>>  
>>>> -#include <asm/processor.h>
>>>> +#include <linux/prefetch.h>
>>>>  
>>>>  /*
>>>>   * sev and wfe are ARMv6K extensions.  Uniprocessor ARMv6 may not have the K
>>>> @@ -70,6 +70,7 @@ static inline void arch_spin_lock(arch_spinlock_t *lock)
>>>>  	u32 newval;
>>>>  	arch_spinlock_t lockval;
>>>>  
>>>> +	prefetchw((const void *)&lock->slock);
>>> Couldn't that cast be carried in the definition of prefetchw() instead?
>> I think that would mean implementing prefetchw as a macro rather than an
>> inline function, since the core code expects to pass a const pointer and GCC
>> gets angry if the type signatures don't match.
> Maybe I'm wrong, but can't you just remove the casts and leave the
> function as static inline? const void * is pretty much telling the
> compiler to turn off type checking.
>

Oh joy. Why is rwlock's lock member marked volatile?

-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
hosted by The Linux Foundation

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 4/6] ARM: locks: prefetch the destination word for write prior to strex
  2013-07-25 17:37         ` Stephen Boyd
@ 2013-07-25 17:45           ` Will Deacon
  2013-07-25 17:48             ` Will Deacon
                               ` (2 more replies)
  0 siblings, 3 replies; 30+ messages in thread
From: Will Deacon @ 2013-07-25 17:45 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Jul 25, 2013 at 06:37:48PM +0100, Stephen Boyd wrote:
> On 07/25/13 10:31, Stephen Boyd wrote:
> > On 07/24/13 04:18, Will Deacon wrote:
> >> On Tue, Jul 23, 2013 at 09:10:33PM +0100, Nicolas Pitre wrote:
> >>> On Tue, 23 Jul 2013, Will Deacon wrote:
> >>>
> >>>> The cost of changing a cacheline from shared to exclusive state can be
> >>>> significant, especially when this is triggered by an exclusive store,
> >>>> since it may result in having to retry the transaction.
> >>>>
> >>>> This patch prefixes our {spin,read,write}_[try]lock implementations with
> >>>> pldw instructions (on CPUs which support them) to try and grab the line
> >>>> in exclusive state from the start.
> >>>>
> >>>> Signed-off-by: Will Deacon <will.deacon@arm.com>
> >>>> ---
> >>>>  arch/arm/include/asm/spinlock.h | 9 ++++++++-
> >>>>  1 file changed, 8 insertions(+), 1 deletion(-)
> >>>>
> >>>> diff --git a/arch/arm/include/asm/spinlock.h b/arch/arm/include/asm/spinlock.h
> >>>> index 0de7bec..3e1cc9d 100644
> >>>> --- a/arch/arm/include/asm/spinlock.h
> >>>> +++ b/arch/arm/include/asm/spinlock.h
> >>>> @@ -5,7 +5,7 @@
> >>>>  #error SMP not supported on pre-ARMv6 CPUs
> >>>>  #endif
> >>>>  
> >>>> -#include <asm/processor.h>
> >>>> +#include <linux/prefetch.h>
> >>>>  
> >>>>  /*
> >>>>   * sev and wfe are ARMv6K extensions.  Uniprocessor ARMv6 may not have the K
> >>>> @@ -70,6 +70,7 @@ static inline void arch_spin_lock(arch_spinlock_t *lock)
> >>>>  	u32 newval;
> >>>>  	arch_spinlock_t lockval;
> >>>>  
> >>>> +	prefetchw((const void *)&lock->slock);
> >>> Couldn't that cast be carried in the definition of prefetchw() instead?
> >> I think that would mean implementing prefetchw as a macro rather than an
> >> inline function, since the core code expects to pass a const pointer and GCC
> >> gets angry if the type signatures don't match.
> > Maybe I'm wrong, but can't you just remove the casts and leave the
> > function as static inline? const void * is pretty much telling the
> > compiler to turn off type checking.
> >
> 
> Oh joy. Why is rwlock's lock member marked volatile?

Yeah, that was the problematic guy. However, I had to fix that anyway in
this patch because otherwise the definition for prefetchw when
!ARCH_HAS_PREFETCHW (which expands to __builtin_prefetch(x,1)) will explode.

So, given that I've fixed the rwlocks, I think I could put prefetch and
prefetchw back to static inline functions. What do you reckon?

Will

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 4/6] ARM: locks: prefetch the destination word for write prior to strex
  2013-07-25 17:45           ` Will Deacon
@ 2013-07-25 17:48             ` Will Deacon
  2013-07-25 17:55             ` Stephen Boyd
  2013-07-25 20:24             ` Nicolas Pitre
  2 siblings, 0 replies; 30+ messages in thread
From: Will Deacon @ 2013-07-25 17:48 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Jul 25, 2013 at 06:45:55PM +0100, Will Deacon wrote:
> On Thu, Jul 25, 2013 at 06:37:48PM +0100, Stephen Boyd wrote:
> > Oh joy. Why is rwlock's lock member marked volatile?
> 
> Yeah, that was the problematic guy. However, I had to fix that anyway in
> this patch because otherwise the definition for prefetchw when
> !ARCH_HAS_PREFETCHW (which expands to __builtin_prefetch(x,1)) will explode.

Just realised you're still looking at v1, which doesn't have any changes to
the rwlocks. I posted v2 here:

  http://lists.infradead.org/pipermail/linux-arm-kernel/2013-July/186411.html

Enjoy.

Will

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 4/6] ARM: locks: prefetch the destination word for write prior to strex
  2013-07-25 17:45           ` Will Deacon
  2013-07-25 17:48             ` Will Deacon
@ 2013-07-25 17:55             ` Stephen Boyd
  2013-07-25 18:05               ` Will Deacon
  2013-07-25 20:24             ` Nicolas Pitre
  2 siblings, 1 reply; 30+ messages in thread
From: Stephen Boyd @ 2013-07-25 17:55 UTC (permalink / raw)
  To: linux-arm-kernel

On 07/25/13 10:45, Will Deacon wrote:
> On Thu, Jul 25, 2013 at 06:37:48PM +0100, Stephen Boyd wrote:
>> On 07/25/13 10:31, Stephen Boyd wrote:
>>> On 07/24/13 04:18, Will Deacon wrote:
>>>> On Tue, Jul 23, 2013 at 09:10:33PM +0100, Nicolas Pitre wrote:
>>>>> On Tue, 23 Jul 2013, Will Deacon wrote:
>>>>>
>>>>>> The cost of changing a cacheline from shared to exclusive state can be
>>>>>> significant, especially when this is triggered by an exclusive store,
>>>>>> since it may result in having to retry the transaction.
>>>>>>
>>>>>> This patch prefixes our {spin,read,write}_[try]lock implementations with
>>>>>> pldw instructions (on CPUs which support them) to try and grab the line
>>>>>> in exclusive state from the start.
>>>>>>
>>>>>> Signed-off-by: Will Deacon <will.deacon@arm.com>
>>>>>> ---
>>>>>>  arch/arm/include/asm/spinlock.h | 9 ++++++++-
>>>>>>  1 file changed, 8 insertions(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/arch/arm/include/asm/spinlock.h b/arch/arm/include/asm/spinlock.h
>>>>>> index 0de7bec..3e1cc9d 100644
>>>>>> --- a/arch/arm/include/asm/spinlock.h
>>>>>> +++ b/arch/arm/include/asm/spinlock.h
>>>>>> @@ -5,7 +5,7 @@
>>>>>>  #error SMP not supported on pre-ARMv6 CPUs
>>>>>>  #endif
>>>>>>  
>>>>>> -#include <asm/processor.h>
>>>>>> +#include <linux/prefetch.h>
>>>>>>  
>>>>>>  /*
>>>>>>   * sev and wfe are ARMv6K extensions.  Uniprocessor ARMv6 may not have the K
>>>>>> @@ -70,6 +70,7 @@ static inline void arch_spin_lock(arch_spinlock_t *lock)
>>>>>>  	u32 newval;
>>>>>>  	arch_spinlock_t lockval;
>>>>>>  
>>>>>> +	prefetchw((const void *)&lock->slock);
>>>>> Couldn't that cast be carried in the definition of prefetchw() instead?
>>>> I think that would mean implementing prefetchw as a macro rather than an
>>>> inline function, since the core code expects to pass a const pointer and GCC
>>>> gets angry if the type signatures don't match.
>>> Maybe I'm wrong, but can't you just remove the casts and leave the
>>> function as static inline? const void * is pretty much telling the
>>> compiler to turn off type checking.
>>>
>> Oh joy. Why is rwlock's lock member marked volatile?
> Yeah, that was the problematic guy. However, I had to fix that anyway in
> this patch because otherwise the definition for prefetchw when
> !ARCH_HAS_PREFETCHW (which expands to __builtin_prefetch(x,1)) will explode.
>
> So, given that I've fixed the rwlocks, I think I could put prefetch and
> prefetchw back to static inline functions. What do you reckon?

It would be good to match the builtin function's signature so that we
don't explode in the future on ARCH_HAS_PREFETCHW configs.

-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
hosted by The Linux Foundation

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 4/6] ARM: locks: prefetch the destination word for write prior to strex
  2013-07-25 17:55             ` Stephen Boyd
@ 2013-07-25 18:05               ` Will Deacon
  2013-07-25 18:22                 ` Stephen Boyd
  0 siblings, 1 reply; 30+ messages in thread
From: Will Deacon @ 2013-07-25 18:05 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Jul 25, 2013 at 06:55:44PM +0100, Stephen Boyd wrote:
> On 07/25/13 10:45, Will Deacon wrote:
> > On Thu, Jul 25, 2013 at 06:37:48PM +0100, Stephen Boyd wrote:
> >> On 07/25/13 10:31, Stephen Boyd wrote:
> >>> Maybe I'm wrong, but can't you just remove the casts and leave the
> >>> function as static inline? const void * is pretty much telling the
> >>> compiler to turn off type checking.
> >>>
> >> Oh joy. Why is rwlock's lock member marked volatile?
> > Yeah, that was the problematic guy. However, I had to fix that anyway in
> > this patch because otherwise the definition for prefetchw when
> > !ARCH_HAS_PREFETCHW (which expands to __builtin_prefetch(x,1)) will explode.
> >
> > So, given that I've fixed the rwlocks, I think I could put prefetch and
> > prefetchw back to static inline functions. What do you reckon?
> 
> It would be good to match the builtin function's signature so that we
> don't explode in the future on ARCH_HAS_PREFETCHW configs.

Ok, so that's basically just undoing the macroisation on top of v2 (fixup
below).

Will

--->8

diff --git a/arch/arm/include/asm/processor.h b/arch/arm/include/asm/processor.h
index dde7ecc..dac9429 100644
--- a/arch/arm/include/asm/processor.h
+++ b/arch/arm/include/asm/processor.h
@@ -109,25 +109,25 @@ unsigned long get_wchan(struct task_struct *p);
 #if __LINUX_ARM_ARCH__ >= 5
 
 #define ARCH_HAS_PREFETCH
-#define prefetch(p)                                                    \
-({                                                                     \
-       __asm__ __volatile__(                                           \
-               "pld\t%a0"                                              \
-               :: "p" (p));                                            \
-})
+static inline void prefetch(const void *ptr)
+{
+       __asm__ __volatile__(
+               "pld\t%a0"
+               :: "p" (ptr));
+}
 
 #if __LINUX_ARM_ARCH__ >= 7 && defined(CONFIG_SMP)
 #define ARCH_HAS_PREFETCHW
-#define prefetchw(p)                                                   \
-({                                                                     \
-       __asm__ __volatile__(                                           \
-               ".arch_extension        mp\n"                           \
-               __ALT_SMP_ASM(                                          \
-                       WASM(pldw)              "\t%a0",                \
-                       WASM(pld)               "\t%a0"                 \
-               )                                                       \
-               :: "p" (p));                                            \
-})
+static inline void prefetchw(const void *ptr)
+{
+       __asm__ __volatile__(
+               ".arch_extension        mp\n"
+               __ALT_SMP_ASM(
+                       WASM(pldw)              "\t%a0",
+                       WASM(pld)               "\t%a0"
+               )
+               :: "p" (ptr));
+}
 #endif
 #endif

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 4/6] ARM: locks: prefetch the destination word for write prior to strex
  2013-07-25 18:05               ` Will Deacon
@ 2013-07-25 18:22                 ` Stephen Boyd
  0 siblings, 0 replies; 30+ messages in thread
From: Stephen Boyd @ 2013-07-25 18:22 UTC (permalink / raw)
  To: linux-arm-kernel

On 07/25/13 11:05, Will Deacon wrote:
> On Thu, Jul 25, 2013 at 06:55:44PM +0100, Stephen Boyd wrote:
>> On 07/25/13 10:45, Will Deacon wrote:
>>> On Thu, Jul 25, 2013 at 06:37:48PM +0100, Stephen Boyd wrote:
>>>> On 07/25/13 10:31, Stephen Boyd wrote:
>>>>> Maybe I'm wrong, but can't you just remove the casts and leave the
>>>>> function as static inline? const void * is pretty much telling the
>>>>> compiler to turn off type checking.
>>>>>
>>>> Oh joy. Why is rwlock's lock member marked volatile?
>>> Yeah, that was the problematic guy. However, I had to fix that anyway in
>>> this patch because otherwise the definition for prefetchw when
>>> !ARCH_HAS_PREFETCHW (which expands to __builtin_prefetch(x,1)) will explode.
>>>
>>> So, given that I've fixed the rwlocks, I think I could put prefetch and
>>> prefetchw back to static inline functions. What do you reckon?
>> It would be good to match the builtin function's signature so that we
>> don't explode in the future on ARCH_HAS_PREFETCHW configs.
> Ok, so that's basically just undoing the macroisation on top of v2 (fixup
> below).
>
> Will

Looks good to me. Thanks.

>
> --->8
>
> diff --git a/arch/arm/include/asm/processor.h b/arch/arm/include/asm/processor.h
> index dde7ecc..dac9429 100644
> --- a/arch/arm/include/asm/processor.h
> +++ b/arch/arm/include/asm/processor.h
> @@ -109,25 +109,25 @@ unsigned long get_wchan(struct task_struct *p);
>  #if __LINUX_ARM_ARCH__ >= 5
>  
>  #define ARCH_HAS_PREFETCH
> -#define prefetch(p)                                                    \
> -({                                                                     \
> -       __asm__ __volatile__(                                           \
> -               "pld\t%a0"                                              \
> -               :: "p" (p));                                            \
> -})
> +static inline void prefetch(const void *ptr)
> +{
> +       __asm__ __volatile__(
> +               "pld\t%a0"
> +               :: "p" (ptr));
> +}
>  
>  #if __LINUX_ARM_ARCH__ >= 7 && defined(CONFIG_SMP)
>  #define ARCH_HAS_PREFETCHW
> -#define prefetchw(p)                                                   \
> -({                                                                     \
> -       __asm__ __volatile__(                                           \
> -               ".arch_extension        mp\n"                           \
> -               __ALT_SMP_ASM(                                          \
> -                       WASM(pldw)              "\t%a0",                \
> -                       WASM(pld)               "\t%a0"                 \
> -               )                                                       \
> -               :: "p" (p));                                            \
> -})
> +static inline void prefetchw(const void *ptr)
> +{
> +       __asm__ __volatile__(
> +               ".arch_extension        mp\n"
> +               __ALT_SMP_ASM(
> +                       WASM(pldw)              "\t%a0",
> +                       WASM(pld)               "\t%a0"
> +               )
> +               :: "p" (ptr));
> +}
>  #endif
>  #endif
>


-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
hosted by The Linux Foundation

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 4/6] ARM: locks: prefetch the destination word for write prior to strex
  2013-07-25 17:45           ` Will Deacon
  2013-07-25 17:48             ` Will Deacon
  2013-07-25 17:55             ` Stephen Boyd
@ 2013-07-25 20:24             ` Nicolas Pitre
  2013-07-26 10:34               ` Will Deacon
  2 siblings, 1 reply; 30+ messages in thread
From: Nicolas Pitre @ 2013-07-25 20:24 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, 25 Jul 2013, Will Deacon wrote:

> On Thu, Jul 25, 2013 at 06:37:48PM +0100, Stephen Boyd wrote:
> > On 07/25/13 10:31, Stephen Boyd wrote:
> > > On 07/24/13 04:18, Will Deacon wrote:
> > >> On Tue, Jul 23, 2013 at 09:10:33PM +0100, Nicolas Pitre wrote:
> > >>> On Tue, 23 Jul 2013, Will Deacon wrote:
> > >>>
> > >>>> The cost of changing a cacheline from shared to exclusive state can be
> > >>>> significant, especially when this is triggered by an exclusive store,
> > >>>> since it may result in having to retry the transaction.
> > >>>>
> > >>>> This patch prefixes our {spin,read,write}_[try]lock implementations with
> > >>>> pldw instructions (on CPUs which support them) to try and grab the line
> > >>>> in exclusive state from the start.
> > >>>>
> > >>>> Signed-off-by: Will Deacon <will.deacon@arm.com>
> > >>>> ---
> > >>>>  arch/arm/include/asm/spinlock.h | 9 ++++++++-
> > >>>>  1 file changed, 8 insertions(+), 1 deletion(-)
> > >>>>
> > >>>> diff --git a/arch/arm/include/asm/spinlock.h b/arch/arm/include/asm/spinlock.h
> > >>>> index 0de7bec..3e1cc9d 100644
> > >>>> --- a/arch/arm/include/asm/spinlock.h
> > >>>> +++ b/arch/arm/include/asm/spinlock.h
> > >>>> @@ -5,7 +5,7 @@
> > >>>>  #error SMP not supported on pre-ARMv6 CPUs
> > >>>>  #endif
> > >>>>  
> > >>>> -#include <asm/processor.h>
> > >>>> +#include <linux/prefetch.h>
> > >>>>  
> > >>>>  /*
> > >>>>   * sev and wfe are ARMv6K extensions.  Uniprocessor ARMv6 may not have the K
> > >>>> @@ -70,6 +70,7 @@ static inline void arch_spin_lock(arch_spinlock_t *lock)
> > >>>>  	u32 newval;
> > >>>>  	arch_spinlock_t lockval;
> > >>>>  
> > >>>> +	prefetchw((const void *)&lock->slock);
> > >>> Couldn't that cast be carried in the definition of prefetchw() instead?
> > >> I think that would mean implementing prefetchw as a macro rather than an
> > >> inline function, since the core code expects to pass a const pointer and GCC
> > >> gets angry if the type signatures don't match.
> > > Maybe I'm wrong, but can't you just remove the casts and leave the
> > > function as static inline? const void * is pretty much telling the
> > > compiler to turn off type checking.
> > >
> > 
> > Oh joy. Why is rwlock's lock member marked volatile?
> 
> Yeah, that was the problematic guy. However, I had to fix that anyway in
> this patch because otherwise the definition for prefetchw when
> !ARCH_HAS_PREFETCHW (which expands to __builtin_prefetch(x,1)) will explode.

Why not always keep ARCH_HAS_PREFETCHW defined then, and have 
__builtin_prefetch([const volatile void *)x,1)) when we can't do better?

Or why not adding this cast to the generic definition?


Nicolas

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 4/6] ARM: locks: prefetch the destination word for write prior to strex
  2013-07-25 20:24             ` Nicolas Pitre
@ 2013-07-26 10:34               ` Will Deacon
  2013-07-26 14:46                 ` Nicolas Pitre
  0 siblings, 1 reply; 30+ messages in thread
From: Will Deacon @ 2013-07-26 10:34 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Nicolas,

On Thu, Jul 25, 2013 at 09:24:48PM +0100, Nicolas Pitre wrote:
> On Thu, 25 Jul 2013, Will Deacon wrote:
> > > 
> > > Oh joy. Why is rwlock's lock member marked volatile?
> > 
> > Yeah, that was the problematic guy. However, I had to fix that anyway in
> > this patch because otherwise the definition for prefetchw when
> > !ARCH_HAS_PREFETCHW (which expands to __builtin_prefetch(x,1)) will explode.
> 
> Why not always keep ARCH_HAS_PREFETCHW defined then, and have 
> __builtin_prefetch([const volatile void *)x,1)) when we can't do better?
> 
> Or why not adding this cast to the generic definition?

Yes, I'd prefer to fix the generic definition as a separate patch. However,
I think the only case that this generates warnings is when a volatile * is
passed. If volatile is actually required, I question whether prefetching
from that location even makes sense in the first place since, in the kernel,
the keyword seems to be generally associated with IO rather than CPUs accessing
shared memory via caches.

What do you think?

Will

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 4/6] ARM: locks: prefetch the destination word for write prior to strex
  2013-07-26 10:34               ` Will Deacon
@ 2013-07-26 14:46                 ` Nicolas Pitre
  0 siblings, 0 replies; 30+ messages in thread
From: Nicolas Pitre @ 2013-07-26 14:46 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, 26 Jul 2013, Will Deacon wrote:

> Hi Nicolas,
> 
> On Thu, Jul 25, 2013 at 09:24:48PM +0100, Nicolas Pitre wrote:
> > On Thu, 25 Jul 2013, Will Deacon wrote:
> > > > 
> > > > Oh joy. Why is rwlock's lock member marked volatile?
> > > 
> > > Yeah, that was the problematic guy. However, I had to fix that anyway in
> > > this patch because otherwise the definition for prefetchw when
> > > !ARCH_HAS_PREFETCHW (which expands to __builtin_prefetch(x,1)) will explode.
> > 
> > Why not always keep ARCH_HAS_PREFETCHW defined then, and have 
> > __builtin_prefetch([const volatile void *)x,1)) when we can't do better?
> > 
> > Or why not adding this cast to the generic definition?
> 
> Yes, I'd prefer to fix the generic definition as a separate patch. However,
> I think the only case that this generates warnings is when a volatile * is
> passed. If volatile is actually required, I question whether prefetching
> from that location even makes sense in the first place since, in the kernel,
> the keyword seems to be generally associated with IO rather than CPUs accessing
> shared memory via caches.
> 
> What do you think?

The only thing volatile is good for is to tell the compiler not to 
attempt any kind of optimization on the access.  Obviously this is 
needed for IO accessors, but we don't want the compiler to think it can 
cache a previous reading of the lock value either.  So as long as all 
accesses are done through some accessors (you did add ACCESS_ONCE() in 
your patch removing the volatile for that reason) then we don't need 
that keyword in the lock definition.


Nicolas

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2013-07-26 14:46 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-07-23 11:36 [PATCH 0/6] Add support for pldw instruction on v7 MP cores Will Deacon
2013-07-23 11:36 ` [PATCH 1/6] ARM: prefetch: remove redundant "cc" clobber Will Deacon
2013-07-23 19:48   ` Nicolas Pitre
2013-07-24 10:19     ` Will Deacon
2013-07-24 16:16       ` Nicolas Pitre
2013-07-24 16:35         ` Will Deacon
2013-07-24 16:58           ` Russell King - ARM Linux
2013-07-24 17:22             ` Nicolas Pitre
2013-07-23 11:36 ` [PATCH 2/6] ARM: smp_on_up: move inline asm ALT_SMP patching macro out of spinlock.h Will Deacon
2013-07-23 19:55   ` Nicolas Pitre
2013-07-23 11:36 ` [PATCH 3/6] ARM: prefetch: add support for prefetchw using pldw on SMP ARMv7+ CPUs Will Deacon
2013-07-23 20:05   ` Nicolas Pitre
2013-07-24 10:42     ` Will Deacon
2013-07-23 11:36 ` [PATCH 4/6] ARM: locks: prefetch the destination word for write prior to strex Will Deacon
2013-07-23 20:10   ` Nicolas Pitre
2013-07-24 11:18     ` Will Deacon
2013-07-25 17:31       ` Stephen Boyd
2013-07-25 17:37         ` Stephen Boyd
2013-07-25 17:45           ` Will Deacon
2013-07-25 17:48             ` Will Deacon
2013-07-25 17:55             ` Stephen Boyd
2013-07-25 18:05               ` Will Deacon
2013-07-25 18:22                 ` Stephen Boyd
2013-07-25 20:24             ` Nicolas Pitre
2013-07-26 10:34               ` Will Deacon
2013-07-26 14:46                 ` Nicolas Pitre
2013-07-23 11:36 ` [PATCH 5/6] ARM: atomics: " Will Deacon
2013-07-23 20:12   ` Nicolas Pitre
2013-07-23 11:36 ` [PATCH 6/6] ARM: bitops: " Will Deacon
2013-07-23 20:15   ` Nicolas Pitre

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.