All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/6] powerpc: Use lwarx hint in spinlocks
@ 2010-02-10 10:57 Anton Blanchard
  2010-02-10 11:02 ` [PATCH 2/6] powerpc: Use lwarx/ldarx hint in bit locks Anton Blanchard
                   ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: Anton Blanchard @ 2010-02-10 10:57 UTC (permalink / raw)
  To: benh; +Cc: npiggin, linuxppc-dev


Recent versions of the PowerPC architecture added a hint bit to the larx
instructions to differentiate between an atomic operation and a lock operation:

> 0 Other programs might attempt to modify the word in storage addressed by EA
> even if the subsequent Store Conditional succeeds.
> 
> 1 Other programs will not attempt to modify the word in storage addressed by
> EA until the program that has acquired the lock performs a subsequent store
> releasing the lock.

To avoid a binutils dependency this patch create macros for the extended lwarx
format and uses it in the spinlock code. To test this change I used a simple
test case that acquires and releases a global pthread mutex:

	pthread_mutex_lock(&mutex);
	pthread_mutex_unlock(&mutex);

On a 32 core POWER6, running 32 test threads we spend almost all our time in
the futex spinlock code:

    94.37%     perf  [kernel]                     [k] ._raw_spin_lock
               |          
               |--99.95%-- ._raw_spin_lock
               |          |          
               |          |--63.29%-- .futex_wake
               |          |          
               |          |--36.64%-- .futex_wait_setup

Which is a good test for this patch. The results (in lock/unlock operations per
second) are:

before: 1538203 ops/sec
after:  2189219 ops/sec

An improvement of 42%

A 32 core POWER7 improves even more:

before: 1279529 ops/sec
after:  2282076 ops/sec

An improvement of 78%

Signed-off-by: Anton Blanchard <anton@samba.org>
---

v2: We do this only for 64bit until we can verify all 32bit CPUs.

Tested so far: 970 (thanks Ben), POWER5, POWER6, POWER7
Still to test: RS64, POWER3, POWER4

Index: powerpc.git/arch/powerpc/include/asm/ppc-opcode.h
===================================================================
--- powerpc.git.orig/arch/powerpc/include/asm/ppc-opcode.h	2010-02-10 15:28:58.453072362 +1100
+++ powerpc.git/arch/powerpc/include/asm/ppc-opcode.h	2010-02-10 15:33:08.963071793 +1100
@@ -24,6 +24,7 @@
 #define PPC_INST_ISEL_MASK		0xfc00003e
 #define PPC_INST_LSWI			0x7c0004aa
 #define PPC_INST_LSWX			0x7c00042a
+#define PPC_INST_LWARX			0x7c000029
 #define PPC_INST_LWSYNC			0x7c2004ac
 #define PPC_INST_LXVD2X			0x7c000698
 #define PPC_INST_MCRXR			0x7c000400
@@ -55,15 +56,28 @@
 #define __PPC_RA(a)	(((a) & 0x1f) << 16)
 #define __PPC_RB(b)	(((b) & 0x1f) << 11)
 #define __PPC_RS(s)	(((s) & 0x1f) << 21)
+#define __PPC_RT(s)	__PPC_RS(s)
 #define __PPC_XS(s)	((((s) & 0x1f) << 21) | (((s) & 0x20) >> 5))
 #define __PPC_T_TLB(t)	(((t) & 0x3) << 21)
 #define __PPC_WC(w)	(((w) & 0x3) << 21)
+/*
+ * Only use the larx hint bit on 64bit CPUs. Once we verify it doesn't have
+ * any side effects on all 32bit processors, we can do this all the time.
+ */
+#ifdef CONFIG_PPC64
+#define __PPC_EH(eh)	(((eh) & 0x1) << 0)
+#else
+#define __PPC_EH(eh)	0
+#endif
 
 /* Deal with instructions that older assemblers aren't aware of */
 #define	PPC_DCBAL(a, b)		stringify_in_c(.long PPC_INST_DCBAL | \
 					__PPC_RA(a) | __PPC_RB(b))
 #define	PPC_DCBZL(a, b)		stringify_in_c(.long PPC_INST_DCBZL | \
 					__PPC_RA(a) | __PPC_RB(b))
+#define PPC_LWARX(t, a, b, eh)	stringify_in_c(.long PPC_INST_LWARX | \
+					__PPC_RT(t) | __PPC_RA(a) | \
+					__PPC_RB(b) | __PPC_EH(eh))
 #define PPC_MSGSND(b)		stringify_in_c(.long PPC_INST_MSGSND | \
 					__PPC_RB(b))
 #define PPC_RFCI		stringify_in_c(.long PPC_INST_RFCI)
Index: powerpc.git/arch/powerpc/include/asm/spinlock.h
===================================================================
--- powerpc.git.orig/arch/powerpc/include/asm/spinlock.h	2010-02-10 15:28:58.473072327 +1100
+++ powerpc.git/arch/powerpc/include/asm/spinlock.h	2010-02-10 15:29:29.454322618 +1100
@@ -27,6 +27,7 @@
 #endif
 #include <asm/asm-compat.h>
 #include <asm/synch.h>
+#include <asm/ppc-opcode.h>
 
 #define arch_spin_is_locked(x)		((x)->slock != 0)
 
@@ -60,7 +61,7 @@ static inline unsigned long __arch_spin_
 
 	token = LOCK_TOKEN;
 	__asm__ __volatile__(
-"1:	lwarx		%0,0,%2\n\
+"1:	" PPC_LWARX(%0,0,%2,1) "\n\
 	cmpwi		0,%0,0\n\
 	bne-		2f\n\
 	stwcx.		%1,0,%2\n\
@@ -186,7 +187,7 @@ static inline long __arch_read_trylock(a
 	long tmp;
 
 	__asm__ __volatile__(
-"1:	lwarx		%0,0,%1\n"
+"1:	" PPC_LWARX(%0,0,%1,1) "\n"
 	__DO_SIGN_EXTEND
 "	addic.		%0,%0,1\n\
 	ble-		2f\n"
@@ -211,7 +212,7 @@ static inline long __arch_write_trylock(
 
 	token = WRLOCK_TOKEN;
 	__asm__ __volatile__(
-"1:	lwarx		%0,0,%2\n\
+"1:	" PPC_LWARX(%0,0,%2,1) "\n\
 	cmpwi		0,%0,0\n\
 	bne-		2f\n"
 	PPC405_ERR77(0,%1)

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 2/6] powerpc: Use lwarx/ldarx hint in bit locks
  2010-02-10 10:57 [PATCH 1/6] powerpc: Use lwarx hint in spinlocks Anton Blanchard
@ 2010-02-10 11:02 ` Anton Blanchard
  2010-02-10 11:03   ` [PATCH 3/6] powerpc: Convert open coded native hashtable bit lock Anton Blanchard
  2010-02-11  6:56 ` [PATCH 1/6] powerpc: Use lwarx hint in spinlocks Nick Piggin
  2010-02-16  4:16 ` Olof Johansson
  2 siblings, 1 reply; 19+ messages in thread
From: Anton Blanchard @ 2010-02-10 11:02 UTC (permalink / raw)
  To: benh; +Cc: npiggin, linuxppc-dev


This patch implements the lwarx/ldarx hint bit for bit locks.

Signed-off-by: Anton Blanchard <anton@samba.org>
---

Index: powerpc.git/arch/powerpc/include/asm/asm-compat.h
===================================================================
--- powerpc.git.orig/arch/powerpc/include/asm/asm-compat.h	2010-02-10 17:16:07.003090026 +1100
+++ powerpc.git/arch/powerpc/include/asm/asm-compat.h	2010-02-10 17:16:39.144321846 +1100
@@ -2,6 +2,7 @@
 #define _ASM_POWERPC_ASM_COMPAT_H
 
 #include <asm/types.h>
+#include <asm/ppc-opcode.h>
 
 #ifdef __ASSEMBLY__
 #  define stringify_in_c(...)	__VA_ARGS__
@@ -24,7 +25,7 @@
 #define PPC_LONG	stringify_in_c(.llong)
 #define PPC_LONG_ALIGN	stringify_in_c(.balign 8)
 #define PPC_TLNEI	stringify_in_c(tdnei)
-#define PPC_LLARX	stringify_in_c(ldarx)
+#define PPC_LLARX(t, a, b, eh)	PPC_LDARX(t, a, b, eh)
 #define PPC_STLCX	stringify_in_c(stdcx.)
 #define PPC_CNTLZL	stringify_in_c(cntlzd)
 
@@ -46,7 +47,7 @@
 #define PPC_LONG	stringify_in_c(.long)
 #define PPC_LONG_ALIGN	stringify_in_c(.balign 4)
 #define PPC_TLNEI	stringify_in_c(twnei)
-#define PPC_LLARX	stringify_in_c(lwarx)
+#define PPC_LLARX(t, a, b, eh)	PPC_LWARX(t, a, b, eh)
 #define PPC_STLCX	stringify_in_c(stwcx.)
 #define PPC_CNTLZL	stringify_in_c(cntlzw)
 #define PPC_MTOCRF	stringify_in_c(mtcrf)
Index: powerpc.git/arch/powerpc/include/asm/bitops.h
===================================================================
--- powerpc.git.orig/arch/powerpc/include/asm/bitops.h	2010-02-10 17:16:07.013074408 +1100
+++ powerpc.git/arch/powerpc/include/asm/bitops.h	2010-02-10 17:16:14.253072938 +1100
@@ -65,7 +65,7 @@ static __inline__ void fn(unsigned long 
 	unsigned long *p = (unsigned long *)_p;	\
 	__asm__ __volatile__ (			\
 	prefix					\
-"1:"	PPC_LLARX "%0,0,%3\n"			\
+"1:"	PPC_LLARX(%0,0,%3,0) "\n"		\
 	stringify_in_c(op) "%0,%0,%2\n"		\
 	PPC405_ERR77(0,%3)			\
 	PPC_STLCX "%0,0,%3\n"			\
@@ -103,31 +103,31 @@ static __inline__ void change_bit(int nr
 
 /* Like DEFINE_BITOP(), with changes to the arguments to 'op' and the output
  * operands. */
-#define DEFINE_TESTOP(fn, op, prefix, postfix)	\
-static __inline__ unsigned long fn(		\
-		unsigned long mask,		\
-		volatile unsigned long *_p)	\
-{						\
-	unsigned long old, t;			\
-	unsigned long *p = (unsigned long *)_p;	\
-	__asm__ __volatile__ (			\
-	prefix					\
-"1:"	PPC_LLARX "%0,0,%3\n"			\
-	stringify_in_c(op) "%1,%0,%2\n"		\
-	PPC405_ERR77(0,%3)			\
-	PPC_STLCX "%1,0,%3\n"			\
-	"bne- 1b\n"				\
-	postfix					\
-	: "=&r" (old), "=&r" (t)		\
-	: "r" (mask), "r" (p)			\
-	: "cc", "memory");			\
-	return (old & mask);			\
-}
-
-DEFINE_TESTOP(test_and_set_bits, or, LWSYNC_ON_SMP, ISYNC_ON_SMP)
-DEFINE_TESTOP(test_and_set_bits_lock, or, "", ISYNC_ON_SMP)
-DEFINE_TESTOP(test_and_clear_bits, andc, LWSYNC_ON_SMP, ISYNC_ON_SMP)
-DEFINE_TESTOP(test_and_change_bits, xor, LWSYNC_ON_SMP, ISYNC_ON_SMP)
+#define DEFINE_TESTOP(fn, op, prefix, postfix, eh)	\
+static __inline__ unsigned long fn(			\
+		unsigned long mask,			\
+		volatile unsigned long *_p)		\
+{							\
+	unsigned long old, t;				\
+	unsigned long *p = (unsigned long *)_p;		\
+	__asm__ __volatile__ (				\
+	prefix						\
+"1:"	PPC_LLARX(%0,0,%3,eh) "\n"			\
+	stringify_in_c(op) "%1,%0,%2\n"			\
+	PPC405_ERR77(0,%3)				\
+	PPC_STLCX "%1,0,%3\n"				\
+	"bne- 1b\n"					\
+	postfix						\
+	: "=&r" (old), "=&r" (t)			\
+	: "r" (mask), "r" (p)				\
+	: "cc", "memory");				\
+	return (old & mask);				\
+}
+
+DEFINE_TESTOP(test_and_set_bits, or, LWSYNC_ON_SMP, ISYNC_ON_SMP, 0)
+DEFINE_TESTOP(test_and_set_bits_lock, or, "", ISYNC_ON_SMP, 1)
+DEFINE_TESTOP(test_and_clear_bits, andc, LWSYNC_ON_SMP, ISYNC_ON_SMP, 0)
+DEFINE_TESTOP(test_and_change_bits, xor, LWSYNC_ON_SMP, ISYNC_ON_SMP, 0)
 
 static __inline__ int test_and_set_bit(unsigned long nr,
 				       volatile unsigned long *addr)
Index: powerpc.git/arch/powerpc/include/asm/local.h
===================================================================
--- powerpc.git.orig/arch/powerpc/include/asm/local.h	2010-02-10 17:16:07.023072625 +1100
+++ powerpc.git/arch/powerpc/include/asm/local.h	2010-02-10 17:16:14.253072938 +1100
@@ -24,7 +24,7 @@ static __inline__ long local_add_return(
 	long t;
 
 	__asm__ __volatile__(
-"1:"	PPC_LLARX	"%0,0,%2		# local_add_return\n\
+"1:"	PPC_LLARX(%0,0,%2,0) "			# local_add_return\n\
 	add	%0,%1,%0\n"
 	PPC405_ERR77(0,%2)
 	PPC_STLCX	"%0,0,%2 \n\
@@ -43,7 +43,7 @@ static __inline__ long local_sub_return(
 	long t;
 
 	__asm__ __volatile__(
-"1:"	PPC_LLARX	"%0,0,%2		# local_sub_return\n\
+"1:"	PPC_LLARX(%0,0,%2,0) "			# local_sub_return\n\
 	subf	%0,%1,%0\n"
 	PPC405_ERR77(0,%2)
 	PPC_STLCX	"%0,0,%2 \n\
@@ -60,7 +60,7 @@ static __inline__ long local_inc_return(
 	long t;
 
 	__asm__ __volatile__(
-"1:"	PPC_LLARX	"%0,0,%1		# local_inc_return\n\
+"1:"	PPC_LLARX(%0,0,%1,0) "			# local_inc_return\n\
 	addic	%0,%0,1\n"
 	PPC405_ERR77(0,%1)
 	PPC_STLCX	"%0,0,%1 \n\
@@ -87,7 +87,7 @@ static __inline__ long local_dec_return(
 	long t;
 
 	__asm__ __volatile__(
-"1:"	PPC_LLARX	"%0,0,%1		# local_dec_return\n\
+"1:"	PPC_LLARX(%0,0,%1,0) "			# local_dec_return\n\
 	addic	%0,%0,-1\n"
 	PPC405_ERR77(0,%1)
 	PPC_STLCX	"%0,0,%1\n\
@@ -117,7 +117,7 @@ static __inline__ int local_add_unless(l
 	long t;
 
 	__asm__ __volatile__ (
-"1:"	PPC_LLARX	"%0,0,%1		# local_add_unless\n\
+"1:"	PPC_LLARX(%0,0,%1,0) "			# local_add_unless\n\
 	cmpw	0,%0,%3 \n\
 	beq-	2f \n\
 	add	%0,%2,%0 \n"
@@ -147,7 +147,7 @@ static __inline__ long local_dec_if_posi
 	long t;
 
 	__asm__ __volatile__(
-"1:"	PPC_LLARX	"%0,0,%1		# local_dec_if_positive\n\
+"1:"	PPC_LLARX(%0,0,%1,0) "			# local_dec_if_positive\n\
 	cmpwi	%0,1\n\
 	addi	%0,%0,-1\n\
 	blt-	2f\n"
Index: powerpc.git/arch/powerpc/include/asm/ppc-opcode.h
===================================================================
--- powerpc.git.orig/arch/powerpc/include/asm/ppc-opcode.h	2010-02-10 17:16:07.013074408 +1100
+++ powerpc.git/arch/powerpc/include/asm/ppc-opcode.h	2010-02-10 17:16:14.263072384 +1100
@@ -22,6 +22,7 @@
 #define PPC_INST_DCBZL			0x7c2007ec
 #define PPC_INST_ISEL			0x7c00001e
 #define PPC_INST_ISEL_MASK		0xfc00003e
+#define PPC_INST_LDARX			0x7c0000a8
 #define PPC_INST_LSWI			0x7c0004aa
 #define PPC_INST_LSWX			0x7c00042a
 #define PPC_INST_LWARX			0x7c000029
@@ -75,6 +76,9 @@
 					__PPC_RA(a) | __PPC_RB(b))
 #define	PPC_DCBZL(a, b)		stringify_in_c(.long PPC_INST_DCBZL | \
 					__PPC_RA(a) | __PPC_RB(b))
+#define PPC_LDARX(t, a, b, eh)	stringify_in_c(.long PPC_INST_LDARX | \
+					__PPC_RT(t) | __PPC_RA(a) | \
+					__PPC_RB(b) | __PPC_EH(eh))
 #define PPC_LWARX(t, a, b, eh)	stringify_in_c(.long PPC_INST_LWARX | \
 					__PPC_RT(t) | __PPC_RA(a) | \
 					__PPC_RB(b) | __PPC_EH(eh))

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 3/6] powerpc: Convert open coded native hashtable bit lock
  2010-02-10 11:02 ` [PATCH 2/6] powerpc: Use lwarx/ldarx hint in bit locks Anton Blanchard
@ 2010-02-10 11:03   ` Anton Blanchard
  2010-02-10 11:04     ` [PATCH 4/6] powerpc: Rename LWSYNC_ON_SMP to PPC_RELEASE_BARRIER, ISYNC_ON_SMP to PPC_ACQUIRE_BARRIER Anton Blanchard
  0 siblings, 1 reply; 19+ messages in thread
From: Anton Blanchard @ 2010-02-10 11:03 UTC (permalink / raw)
  To: benh; +Cc: npiggin, linuxppc-dev


Now we have real bit locks use them instead of open coding it.

Signed-off-by: Anton Blanchard <anton@samba.org>
---

diff --git a/arch/powerpc/mm/hash_native_64.c b/arch/powerpc/mm/hash_native_64.c
index 056d23a..9e1aa4f 100644
--- a/arch/powerpc/mm/hash_native_64.c
+++ b/arch/powerpc/mm/hash_native_64.c
@@ -122,7 +122,7 @@ static inline void native_lock_hpte(struct hash_pte *hptep)
 	unsigned long *word = &hptep->v;
 
 	while (1) {
-		if (!test_and_set_bit(HPTE_LOCK_BIT, word))
+		if (!test_and_set_bit_lock(HPTE_LOCK_BIT, word))
 			break;
 		while(test_bit(HPTE_LOCK_BIT, word))
 			cpu_relax();
@@ -133,8 +133,7 @@ static inline void native_unlock_hpte(struct hash_pte *hptep)
 {
 	unsigned long *word = &hptep->v;
 
-	asm volatile("lwsync":::"memory");
-	clear_bit(HPTE_LOCK_BIT, word);
+	clear_bit_unlock(HPTE_LOCK_BIT, word);
 }
 
 static long native_hpte_insert(unsigned long hpte_group, unsigned long va,

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 4/6] powerpc: Rename LWSYNC_ON_SMP to PPC_RELEASE_BARRIER, ISYNC_ON_SMP to PPC_ACQUIRE_BARRIER
  2010-02-10 11:03   ` [PATCH 3/6] powerpc: Convert open coded native hashtable bit lock Anton Blanchard
@ 2010-02-10 11:04     ` Anton Blanchard
  2010-02-10 11:07       ` [PATCH 5/6] powerpc: Fix lwsync patching code on 64bit Anton Blanchard
  2010-03-19  1:08       ` [PATCH 4/6] powerpc: Rename LWSYNC_ON_SMP to PPC_RELEASE_BARRIER, ISYNC_ON_SMP to PPC_ACQUIRE_BARRIER Nick Piggin
  0 siblings, 2 replies; 19+ messages in thread
From: Anton Blanchard @ 2010-02-10 11:04 UTC (permalink / raw)
  To: benh; +Cc: npiggin, linuxppc-dev


For performance reasons we are about to change ISYNC_ON_SMP to sometimes be
lwsync. Now that the macro name doesn't make sense, change it and LWSYNC_ON_SMP
to better explain what the barriers are doing.

Signed-off-by: Anton Blanchard <anton@samba.org>
---

Index: powerpc.git/arch/powerpc/include/asm/atomic.h
===================================================================
--- powerpc.git.orig/arch/powerpc/include/asm/atomic.h	2010-02-10 17:12:30.264322204 +1100
+++ powerpc.git/arch/powerpc/include/asm/atomic.h	2010-02-10 17:13:05.355571902 +1100
@@ -49,13 +49,13 @@ static __inline__ int atomic_add_return(
 	int t;
 
 	__asm__ __volatile__(
-	LWSYNC_ON_SMP
+	PPC_RELEASE_BARRIER
 "1:	lwarx	%0,0,%2		# atomic_add_return\n\
 	add	%0,%1,%0\n"
 	PPC405_ERR77(0,%2)
 "	stwcx.	%0,0,%2 \n\
 	bne-	1b"
-	ISYNC_ON_SMP
+	PPC_ACQUIRE_BARRIER
 	: "=&r" (t)
 	: "r" (a), "r" (&v->counter)
 	: "cc", "memory");
@@ -85,13 +85,13 @@ static __inline__ int atomic_sub_return(
 	int t;
 
 	__asm__ __volatile__(
-	LWSYNC_ON_SMP
+	PPC_RELEASE_BARRIER
 "1:	lwarx	%0,0,%2		# atomic_sub_return\n\
 	subf	%0,%1,%0\n"
 	PPC405_ERR77(0,%2)
 "	stwcx.	%0,0,%2 \n\
 	bne-	1b"
-	ISYNC_ON_SMP
+	PPC_ACQUIRE_BARRIER
 	: "=&r" (t)
 	: "r" (a), "r" (&v->counter)
 	: "cc", "memory");
@@ -119,13 +119,13 @@ static __inline__ int atomic_inc_return(
 	int t;
 
 	__asm__ __volatile__(
-	LWSYNC_ON_SMP
+	PPC_RELEASE_BARRIER
 "1:	lwarx	%0,0,%1		# atomic_inc_return\n\
 	addic	%0,%0,1\n"
 	PPC405_ERR77(0,%1)
 "	stwcx.	%0,0,%1 \n\
 	bne-	1b"
-	ISYNC_ON_SMP
+	PPC_ACQUIRE_BARRIER
 	: "=&r" (t)
 	: "r" (&v->counter)
 	: "cc", "xer", "memory");
@@ -163,13 +163,13 @@ static __inline__ int atomic_dec_return(
 	int t;
 
 	__asm__ __volatile__(
-	LWSYNC_ON_SMP
+	PPC_RELEASE_BARRIER
 "1:	lwarx	%0,0,%1		# atomic_dec_return\n\
 	addic	%0,%0,-1\n"
 	PPC405_ERR77(0,%1)
 "	stwcx.	%0,0,%1\n\
 	bne-	1b"
-	ISYNC_ON_SMP
+	PPC_ACQUIRE_BARRIER
 	: "=&r" (t)
 	: "r" (&v->counter)
 	: "cc", "xer", "memory");
@@ -194,7 +194,7 @@ static __inline__ int atomic_add_unless(
 	int t;
 
 	__asm__ __volatile__ (
-	LWSYNC_ON_SMP
+	PPC_RELEASE_BARRIER
 "1:	lwarx	%0,0,%1		# atomic_add_unless\n\
 	cmpw	0,%0,%3 \n\
 	beq-	2f \n\
@@ -202,7 +202,7 @@ static __inline__ int atomic_add_unless(
 	PPC405_ERR77(0,%2)
 "	stwcx.	%0,0,%1 \n\
 	bne-	1b \n"
-	ISYNC_ON_SMP
+	PPC_ACQUIRE_BARRIER
 "	subf	%0,%2,%0 \n\
 2:"
 	: "=&r" (t)
@@ -227,7 +227,7 @@ static __inline__ int atomic_dec_if_posi
 	int t;
 
 	__asm__ __volatile__(
-	LWSYNC_ON_SMP
+	PPC_RELEASE_BARRIER
 "1:	lwarx	%0,0,%1		# atomic_dec_if_positive\n\
 	cmpwi	%0,1\n\
 	addi	%0,%0,-1\n\
@@ -235,7 +235,7 @@ static __inline__ int atomic_dec_if_posi
 	PPC405_ERR77(0,%1)
 "	stwcx.	%0,0,%1\n\
 	bne-	1b"
-	ISYNC_ON_SMP
+	PPC_ACQUIRE_BARRIER
 	"\n\
 2:"	: "=&b" (t)
 	: "r" (&v->counter)
@@ -286,12 +286,12 @@ static __inline__ long atomic64_add_retu
 	long t;
 
 	__asm__ __volatile__(
-	LWSYNC_ON_SMP
+	PPC_RELEASE_BARRIER
 "1:	ldarx	%0,0,%2		# atomic64_add_return\n\
 	add	%0,%1,%0\n\
 	stdcx.	%0,0,%2 \n\
 	bne-	1b"
-	ISYNC_ON_SMP
+	PPC_ACQUIRE_BARRIER
 	: "=&r" (t)
 	: "r" (a), "r" (&v->counter)
 	: "cc", "memory");
@@ -320,12 +320,12 @@ static __inline__ long atomic64_sub_retu
 	long t;
 
 	__asm__ __volatile__(
-	LWSYNC_ON_SMP
+	PPC_RELEASE_BARRIER
 "1:	ldarx	%0,0,%2		# atomic64_sub_return\n\
 	subf	%0,%1,%0\n\
 	stdcx.	%0,0,%2 \n\
 	bne-	1b"
-	ISYNC_ON_SMP
+	PPC_ACQUIRE_BARRIER
 	: "=&r" (t)
 	: "r" (a), "r" (&v->counter)
 	: "cc", "memory");
@@ -352,12 +352,12 @@ static __inline__ long atomic64_inc_retu
 	long t;
 
 	__asm__ __volatile__(
-	LWSYNC_ON_SMP
+	PPC_RELEASE_BARRIER
 "1:	ldarx	%0,0,%1		# atomic64_inc_return\n\
 	addic	%0,%0,1\n\
 	stdcx.	%0,0,%1 \n\
 	bne-	1b"
-	ISYNC_ON_SMP
+	PPC_ACQUIRE_BARRIER
 	: "=&r" (t)
 	: "r" (&v->counter)
 	: "cc", "xer", "memory");
@@ -394,12 +394,12 @@ static __inline__ long atomic64_dec_retu
 	long t;
 
 	__asm__ __volatile__(
-	LWSYNC_ON_SMP
+	PPC_RELEASE_BARRIER
 "1:	ldarx	%0,0,%1		# atomic64_dec_return\n\
 	addic	%0,%0,-1\n\
 	stdcx.	%0,0,%1\n\
 	bne-	1b"
-	ISYNC_ON_SMP
+	PPC_ACQUIRE_BARRIER
 	: "=&r" (t)
 	: "r" (&v->counter)
 	: "cc", "xer", "memory");
@@ -419,13 +419,13 @@ static __inline__ long atomic64_dec_if_p
 	long t;
 
 	__asm__ __volatile__(
-	LWSYNC_ON_SMP
+	PPC_RELEASE_BARRIER
 "1:	ldarx	%0,0,%1		# atomic64_dec_if_positive\n\
 	addic.	%0,%0,-1\n\
 	blt-	2f\n\
 	stdcx.	%0,0,%1\n\
 	bne-	1b"
-	ISYNC_ON_SMP
+	PPC_ACQUIRE_BARRIER
 	"\n\
 2:"	: "=&r" (t)
 	: "r" (&v->counter)
@@ -451,14 +451,14 @@ static __inline__ int atomic64_add_unles
 	long t;
 
 	__asm__ __volatile__ (
-	LWSYNC_ON_SMP
+	PPC_RELEASE_BARRIER
 "1:	ldarx	%0,0,%1		# atomic_add_unless\n\
 	cmpd	0,%0,%3 \n\
 	beq-	2f \n\
 	add	%0,%2,%0 \n"
 "	stdcx.	%0,0,%1 \n\
 	bne-	1b \n"
-	ISYNC_ON_SMP
+	PPC_ACQUIRE_BARRIER
 "	subf	%0,%2,%0 \n\
 2:"
 	: "=&r" (t)
Index: powerpc.git/arch/powerpc/include/asm/futex.h
===================================================================
--- powerpc.git.orig/arch/powerpc/include/asm/futex.h	2010-02-10 17:12:30.274322058 +1100
+++ powerpc.git/arch/powerpc/include/asm/futex.h	2010-02-10 17:13:05.355571902 +1100
@@ -11,7 +11,7 @@
 
 #define __futex_atomic_op(insn, ret, oldval, uaddr, oparg) \
   __asm__ __volatile ( \
-	LWSYNC_ON_SMP \
+	PPC_RELEASE_BARRIER \
 "1:	lwarx	%0,0,%2\n" \
 	insn \
 	PPC405_ERR77(0, %2) \
@@ -90,14 +90,14 @@ futex_atomic_cmpxchg_inatomic(int __user
 		return -EFAULT;
 
         __asm__ __volatile__ (
-        LWSYNC_ON_SMP
+        PPC_RELEASE_BARRIER
 "1:     lwarx   %0,0,%2         # futex_atomic_cmpxchg_inatomic\n\
         cmpw    0,%0,%3\n\
         bne-    3f\n"
         PPC405_ERR77(0,%2)
 "2:     stwcx.  %4,0,%2\n\
         bne-    1b\n"
-        ISYNC_ON_SMP
+        PPC_ACQUIRE_BARRIER
 "3:	.section .fixup,\"ax\"\n\
 4:	li	%0,%5\n\
 	b	3b\n\
Index: powerpc.git/arch/powerpc/include/asm/mutex.h
===================================================================
--- powerpc.git.orig/arch/powerpc/include/asm/mutex.h	2010-02-10 17:12:30.244322286 +1100
+++ powerpc.git/arch/powerpc/include/asm/mutex.h	2010-02-10 17:13:05.355571902 +1100
@@ -15,7 +15,7 @@ static inline int __mutex_cmpxchg_lock(a
 	PPC405_ERR77(0,%1)
 "	stwcx.	%3,0,%1\n\
 	bne-	1b"
-	ISYNC_ON_SMP
+	PPC_ACQUIRE_BARRIER
 	"\n\
 2:"
 	: "=&r" (t)
@@ -35,7 +35,7 @@ static inline int __mutex_dec_return_loc
 	PPC405_ERR77(0,%1)
 "	stwcx.	%0,0,%1\n\
 	bne-	1b"
-	ISYNC_ON_SMP
+	PPC_ACQUIRE_BARRIER
 	: "=&r" (t)
 	: "r" (&v->counter)
 	: "cc", "memory");
@@ -48,7 +48,7 @@ static inline int __mutex_inc_return_unl
 	int t;
 
 	__asm__ __volatile__(
-	LWSYNC_ON_SMP
+	PPC_RELEASE_BARRIER
 "1:	lwarx	%0,0,%1		# mutex unlock\n\
 	addic	%0,%0,1\n"
 	PPC405_ERR77(0,%1)
Index: powerpc.git/arch/powerpc/include/asm/synch.h
===================================================================
--- powerpc.git.orig/arch/powerpc/include/asm/synch.h	2010-02-10 17:12:30.274322058 +1100
+++ powerpc.git/arch/powerpc/include/asm/synch.h	2010-02-10 17:13:05.355571902 +1100
@@ -37,11 +37,11 @@ static inline void isync(void)
 #endif
 
 #ifdef CONFIG_SMP
-#define ISYNC_ON_SMP	"\n\tisync\n"
-#define LWSYNC_ON_SMP	stringify_in_c(LWSYNC) "\n"
+#define PPC_ACQUIRE_BARRIER	"\n\tisync\n"
+#define PPC_RELEASE_BARRIER	stringify_in_c(LWSYNC) "\n"
 #else
-#define ISYNC_ON_SMP
-#define LWSYNC_ON_SMP
+#define PPC_ACQUIRE_BARRIER
+#define PPC_RELEASE_BARRIER
 #endif
 
 #endif /* __KERNEL__ */
Index: powerpc.git/arch/powerpc/include/asm/bitops.h
===================================================================
--- powerpc.git.orig/arch/powerpc/include/asm/bitops.h	2010-02-10 17:12:50.094322149 +1100
+++ powerpc.git/arch/powerpc/include/asm/bitops.h	2010-02-10 17:13:15.125572158 +1100
@@ -78,7 +78,7 @@ static __inline__ void fn(unsigned long 
 
 DEFINE_BITOP(set_bits, or, "", "")
 DEFINE_BITOP(clear_bits, andc, "", "")
-DEFINE_BITOP(clear_bits_unlock, andc, LWSYNC_ON_SMP, "")
+DEFINE_BITOP(clear_bits_unlock, andc, PPC_RELEASE_BARRIER, "")
 DEFINE_BITOP(change_bits, xor, "", "")
 
 static __inline__ void set_bit(int nr, volatile unsigned long *addr)
@@ -124,10 +124,14 @@ static __inline__ unsigned long fn(			\
 	return (old & mask);				\
 }
 
-DEFINE_TESTOP(test_and_set_bits, or, LWSYNC_ON_SMP, ISYNC_ON_SMP, 0)
-DEFINE_TESTOP(test_and_set_bits_lock, or, "", ISYNC_ON_SMP, 1)
-DEFINE_TESTOP(test_and_clear_bits, andc, LWSYNC_ON_SMP, ISYNC_ON_SMP, 0)
-DEFINE_TESTOP(test_and_change_bits, xor, LWSYNC_ON_SMP, ISYNC_ON_SMP, 0)
+DEFINE_TESTOP(test_and_set_bits, or, PPC_RELEASE_BARRIER,
+	      PPC_ACQUIRE_BARRIER, 0)
+DEFINE_TESTOP(test_and_set_bits_lock, or, "",
+	      PPC_ACQUIRE_BARRIER, 1)
+DEFINE_TESTOP(test_and_clear_bits, andc, PPC_RELEASE_BARRIER,
+	      PPC_ACQUIRE_BARRIER, 0)
+DEFINE_TESTOP(test_and_change_bits, xor, PPC_RELEASE_BARRIER,
+	      PPC_ACQUIRE_BARRIER, 0)
 
 static __inline__ int test_and_set_bit(unsigned long nr,
 				       volatile unsigned long *addr)
@@ -158,7 +162,7 @@ static __inline__ int test_and_change_bi
 
 static __inline__ void __clear_bit_unlock(int nr, volatile unsigned long *addr)
 {
-	__asm__ __volatile__(LWSYNC_ON_SMP "" ::: "memory");
+	__asm__ __volatile__(PPC_RELEASE_BARRIER "" ::: "memory");
 	__clear_bit(nr, addr);
 }
 
Index: powerpc.git/arch/powerpc/include/asm/system.h
===================================================================
--- powerpc.git.orig/arch/powerpc/include/asm/system.h	2010-02-10 17:12:30.284321702 +1100
+++ powerpc.git/arch/powerpc/include/asm/system.h	2010-02-10 17:13:05.355571902 +1100
@@ -232,12 +232,12 @@ __xchg_u32(volatile void *p, unsigned lo
 	unsigned long prev;
 
 	__asm__ __volatile__(
-	LWSYNC_ON_SMP
+	PPC_RELEASE_BARRIER
 "1:	lwarx	%0,0,%2 \n"
 	PPC405_ERR77(0,%2)
 "	stwcx.	%3,0,%2 \n\
 	bne-	1b"
-	ISYNC_ON_SMP
+	PPC_ACQUIRE_BARRIER
 	: "=&r" (prev), "+m" (*(volatile unsigned int *)p)
 	: "r" (p), "r" (val)
 	: "cc", "memory");
@@ -275,12 +275,12 @@ __xchg_u64(volatile void *p, unsigned lo
 	unsigned long prev;
 
 	__asm__ __volatile__(
-	LWSYNC_ON_SMP
+	PPC_RELEASE_BARRIER
 "1:	ldarx	%0,0,%2 \n"
 	PPC405_ERR77(0,%2)
 "	stdcx.	%3,0,%2 \n\
 	bne-	1b"
-	ISYNC_ON_SMP
+	PPC_ACQUIRE_BARRIER
 	: "=&r" (prev), "+m" (*(volatile unsigned long *)p)
 	: "r" (p), "r" (val)
 	: "cc", "memory");
@@ -366,14 +366,14 @@ __cmpxchg_u32(volatile unsigned int *p, 
 	unsigned int prev;
 
 	__asm__ __volatile__ (
-	LWSYNC_ON_SMP
+	PPC_RELEASE_BARRIER
 "1:	lwarx	%0,0,%2		# __cmpxchg_u32\n\
 	cmpw	0,%0,%3\n\
 	bne-	2f\n"
 	PPC405_ERR77(0,%2)
 "	stwcx.	%4,0,%2\n\
 	bne-	1b"
-	ISYNC_ON_SMP
+	PPC_ACQUIRE_BARRIER
 	"\n\
 2:"
 	: "=&r" (prev), "+m" (*p)
@@ -412,13 +412,13 @@ __cmpxchg_u64(volatile unsigned long *p,
 	unsigned long prev;
 
 	__asm__ __volatile__ (
-	LWSYNC_ON_SMP
+	PPC_RELEASE_BARRIER
 "1:	ldarx	%0,0,%2		# __cmpxchg_u64\n\
 	cmpd	0,%0,%3\n\
 	bne-	2f\n\
 	stdcx.	%4,0,%2\n\
 	bne-	1b"
-	ISYNC_ON_SMP
+	PPC_ACQUIRE_BARRIER
 	"\n\
 2:"
 	: "=&r" (prev), "+m" (*p)
Index: powerpc.git/arch/powerpc/include/asm/spinlock.h
===================================================================
--- powerpc.git.orig/arch/powerpc/include/asm/spinlock.h	2010-02-10 17:12:48.464321650 +1100
+++ powerpc.git/arch/powerpc/include/asm/spinlock.h	2010-02-10 17:13:05.355571902 +1100
@@ -65,9 +65,10 @@ static inline unsigned long __arch_spin_
 	cmpwi		0,%0,0\n\
 	bne-		2f\n\
 	stwcx.		%1,0,%2\n\
-	bne-		1b\n\
-	isync\n\
-2:"	: "=&r" (tmp)
+	bne-		1b\n"
+	PPC_ACQUIRE_BARRIER
+"2:"
+	: "=&r" (tmp)
 	: "r" (token), "r" (&lock->slock)
 	: "cr0", "memory");
 
@@ -145,7 +146,7 @@ static inline void arch_spin_unlock(arch
 {
 	SYNC_IO;
 	__asm__ __volatile__("# arch_spin_unlock\n\t"
-				LWSYNC_ON_SMP: : :"memory");
+				PPC_RELEASE_BARRIER: : :"memory");
 	lock->slock = 0;
 }
 
@@ -193,9 +194,9 @@ static inline long __arch_read_trylock(a
 	ble-		2f\n"
 	PPC405_ERR77(0,%1)
 "	stwcx.		%0,0,%1\n\
-	bne-		1b\n\
-	isync\n\
-2:"	: "=&r" (tmp)
+	bne-		1b\n"
+	PPC_ACQUIRE_BARRIER
+"2:"	: "=&r" (tmp)
 	: "r" (&rw->lock)
 	: "cr0", "xer", "memory");
 
@@ -217,9 +218,9 @@ static inline long __arch_write_trylock(
 	bne-		2f\n"
 	PPC405_ERR77(0,%1)
 "	stwcx.		%1,0,%2\n\
-	bne-		1b\n\
-	isync\n\
-2:"	: "=&r" (tmp)
+	bne-		1b\n"
+	PPC_ACQUIRE_BARRIER
+"2:"	: "=&r" (tmp)
 	: "r" (token), "r" (&rw->lock)
 	: "cr0", "memory");
 
@@ -270,7 +271,7 @@ static inline void arch_read_unlock(arch
 
 	__asm__ __volatile__(
 	"# read_unlock\n\t"
-	LWSYNC_ON_SMP
+	PPC_RELEASE_BARRIER
 "1:	lwarx		%0,0,%1\n\
 	addic		%0,%0,-1\n"
 	PPC405_ERR77(0,%1)
@@ -284,7 +285,7 @@ static inline void arch_read_unlock(arch
 static inline void arch_write_unlock(arch_rwlock_t *rw)
 {
 	__asm__ __volatile__("# write_unlock\n\t"
-				LWSYNC_ON_SMP: : :"memory");
+				PPC_RELEASE_BARRIER: : :"memory");
 	rw->lock = 0;
 }
 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 5/6] powerpc: Fix lwsync patching code on 64bit
  2010-02-10 11:04     ` [PATCH 4/6] powerpc: Rename LWSYNC_ON_SMP to PPC_RELEASE_BARRIER, ISYNC_ON_SMP to PPC_ACQUIRE_BARRIER Anton Blanchard
@ 2010-02-10 11:07       ` Anton Blanchard
  2010-02-10 11:10         ` [PATCH 6/6] powerpc: Use lwsync for acquire barrier if CPU supports it Anton Blanchard
  2010-03-19  1:08       ` [PATCH 4/6] powerpc: Rename LWSYNC_ON_SMP to PPC_RELEASE_BARRIER, ISYNC_ON_SMP to PPC_ACQUIRE_BARRIER Nick Piggin
  1 sibling, 1 reply; 19+ messages in thread
From: Anton Blanchard @ 2010-02-10 11:07 UTC (permalink / raw)
  To: benh; +Cc: npiggin, linuxppc-dev


do_lwsync_fixups doesn't work on 64bit, we end up writing lwsyncs to the
wrong addresses:

0:mon> di c0000001000bfacc
c0000001000bfacc  7c2004ac      lwsync

Since the lwsync section has negative offsets we need to use a signed int
pointer so we sign extend the value.

Signed-off-by: Anton Blanchard <anton@samba.org>
---

Index: powerpc.git/arch/powerpc/lib/feature-fixups.c
===================================================================
--- powerpc.git.orig/arch/powerpc/lib/feature-fixups.c	2010-02-10 18:34:32.645573355 +1100
+++ powerpc.git/arch/powerpc/lib/feature-fixups.c	2010-02-10 18:35:18.525571521 +1100
@@ -112,7 +112,7 @@ void do_feature_fixups(unsigned long val
 
 void do_lwsync_fixups(unsigned long value, void *fixup_start, void *fixup_end)
 {
-	unsigned int *start, *end, *dest;
+	int *start, *end, *dest;
 
 	if (!(value & CPU_FTR_LWSYNC))
 		return ;

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 6/6] powerpc: Use lwsync for acquire barrier if CPU supports it
  2010-02-10 11:07       ` [PATCH 5/6] powerpc: Fix lwsync patching code on 64bit Anton Blanchard
@ 2010-02-10 11:10         ` Anton Blanchard
  2010-02-11  7:09           ` Nick Piggin
  2010-02-16  4:22           ` Olof Johansson
  0 siblings, 2 replies; 19+ messages in thread
From: Anton Blanchard @ 2010-02-10 11:10 UTC (permalink / raw)
  To: benh; +Cc: npiggin, linuxppc-dev


Nick Piggin discovered that lwsync barriers around locks were faster than isync
on 970. That was a long time ago and I completely dropped the ball in testing
his patches across other ppc64 processors.

Turns out the idea helps on other chips. Using a microbenchmark that
uses a lot of threads to contend on a global pthread mutex (and therefore a
global futex), POWER6 improves 8% and POWER7 improves 2%. I checked POWER5
and while I couldn't measure an improvement, there was no regression.

This patch uses the lwsync patching code to replace the isyncs with lwsyncs
on CPUs that support the instruction. We were marking POWER3 and RS64 as lwsync
capable but in reality they treat it as a full sync (ie slow). Remove the
CPU_FTR_LWSYNC bit from these CPUs so they continue to use the faster isync
method.

Signed-off-by: Anton Blanchard <anton@samba.org>
---

Index: powerpc.git/arch/powerpc/include/asm/synch.h
===================================================================
--- powerpc.git.orig/arch/powerpc/include/asm/synch.h	2010-02-10 17:16:44.666823141 +1100
+++ powerpc.git/arch/powerpc/include/asm/synch.h	2010-02-10 17:16:45.123073249 +1100
@@ -37,7 +37,11 @@ static inline void isync(void)
 #endif
 
 #ifdef CONFIG_SMP
-#define PPC_ACQUIRE_BARRIER	"\n\tisync\n"
+#define __PPC_ACQUIRE_BARRIER				\
+	START_LWSYNC_SECTION(97);			\
+	isync;						\
+	MAKE_LWSYNC_SECTION_ENTRY(97, __lwsync_fixup);
+#define PPC_ACQUIRE_BARRIER	"\n" stringify_in_c(__PPC_ACQUIRE_BARRIER)
 #define PPC_RELEASE_BARRIER	stringify_in_c(LWSYNC) "\n"
 #else
 #define PPC_ACQUIRE_BARRIER
Index: powerpc.git/arch/powerpc/include/asm/cputable.h
===================================================================
--- powerpc.git.orig/arch/powerpc/include/asm/cputable.h	2010-02-10 17:15:59.816823326 +1100
+++ powerpc.git/arch/powerpc/include/asm/cputable.h	2010-02-10 17:16:45.123073249 +1100
@@ -381,9 +381,9 @@ extern const char *powerpc_base_platform
 #define CPU_FTRS_GENERIC_32	(CPU_FTR_COMMON | CPU_FTR_NODSISRALIGN)
 
 /* 64-bit CPUs */
-#define CPU_FTRS_POWER3	(CPU_FTR_USE_TB | CPU_FTR_LWSYNC | \
+#define CPU_FTRS_POWER3	(CPU_FTR_USE_TB | \
 	    CPU_FTR_IABR | CPU_FTR_PPC_LE)
-#define CPU_FTRS_RS64	(CPU_FTR_USE_TB | CPU_FTR_LWSYNC | \
+#define CPU_FTRS_RS64	(CPU_FTR_USE_TB | \
 	    CPU_FTR_IABR | \
 	    CPU_FTR_MMCRA | CPU_FTR_CTRL)
 #define CPU_FTRS_POWER4	(CPU_FTR_USE_TB | CPU_FTR_LWSYNC | \

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/6] powerpc: Use lwarx hint in spinlocks
  2010-02-10 10:57 [PATCH 1/6] powerpc: Use lwarx hint in spinlocks Anton Blanchard
  2010-02-10 11:02 ` [PATCH 2/6] powerpc: Use lwarx/ldarx hint in bit locks Anton Blanchard
@ 2010-02-11  6:56 ` Nick Piggin
  2010-02-17  9:37   ` Anton Blanchard
  2010-02-16  4:16 ` Olof Johansson
  2 siblings, 1 reply; 19+ messages in thread
From: Nick Piggin @ 2010-02-11  6:56 UTC (permalink / raw)
  To: Anton Blanchard; +Cc: linuxppc-dev

On Wed, Feb 10, 2010 at 09:57:28PM +1100, Anton Blanchard wrote:
> 
> Recent versions of the PowerPC architecture added a hint bit to the larx
> instructions to differentiate between an atomic operation and a lock operation:
> 
> > 0 Other programs might attempt to modify the word in storage addressed by EA
> > even if the subsequent Store Conditional succeeds.
> > 
> > 1 Other programs will not attempt to modify the word in storage addressed by
> > EA until the program that has acquired the lock performs a subsequent store
> > releasing the lock.
> 
> To avoid a binutils dependency this patch create macros for the extended lwarx
> format and uses it in the spinlock code. To test this change I used a simple
> test case that acquires and releases a global pthread mutex:
> 
> 	pthread_mutex_lock(&mutex);
> 	pthread_mutex_unlock(&mutex);
> 
> On a 32 core POWER6, running 32 test threads we spend almost all our time in
> the futex spinlock code:
> 
>     94.37%     perf  [kernel]                     [k] ._raw_spin_lock
>                |          
>                |--99.95%-- ._raw_spin_lock
>                |          |          
>                |          |--63.29%-- .futex_wake
>                |          |          
>                |          |--36.64%-- .futex_wait_setup
> 
> Which is a good test for this patch. The results (in lock/unlock operations per
> second) are:
> 
> before: 1538203 ops/sec
> after:  2189219 ops/sec
> 
> An improvement of 42%
> 
> A 32 core POWER7 improves even more:
> 
> before: 1279529 ops/sec
> after:  2282076 ops/sec
> 
> An improvement of 78%

Cool. How does it go when there are significant amount of instructions
between the lock and the unlock? A real(ish) workload, like dbench on
ramdisk (which should hit the dcache lock).

> 
> Signed-off-by: Anton Blanchard <anton@samba.org>
> ---
> 
> v2: We do this only for 64bit until we can verify all 32bit CPUs.
> 
> Tested so far: 970 (thanks Ben), POWER5, POWER6, POWER7
> Still to test: RS64, POWER3, POWER4
> 
> Index: powerpc.git/arch/powerpc/include/asm/ppc-opcode.h
> ===================================================================
> --- powerpc.git.orig/arch/powerpc/include/asm/ppc-opcode.h	2010-02-10 15:28:58.453072362 +1100
> +++ powerpc.git/arch/powerpc/include/asm/ppc-opcode.h	2010-02-10 15:33:08.963071793 +1100
> @@ -24,6 +24,7 @@
>  #define PPC_INST_ISEL_MASK		0xfc00003e
>  #define PPC_INST_LSWI			0x7c0004aa
>  #define PPC_INST_LSWX			0x7c00042a
> +#define PPC_INST_LWARX			0x7c000029
>  #define PPC_INST_LWSYNC			0x7c2004ac
>  #define PPC_INST_LXVD2X			0x7c000698
>  #define PPC_INST_MCRXR			0x7c000400
> @@ -55,15 +56,28 @@
>  #define __PPC_RA(a)	(((a) & 0x1f) << 16)
>  #define __PPC_RB(b)	(((b) & 0x1f) << 11)
>  #define __PPC_RS(s)	(((s) & 0x1f) << 21)
> +#define __PPC_RT(s)	__PPC_RS(s)
>  #define __PPC_XS(s)	((((s) & 0x1f) << 21) | (((s) & 0x20) >> 5))
>  #define __PPC_T_TLB(t)	(((t) & 0x3) << 21)
>  #define __PPC_WC(w)	(((w) & 0x3) << 21)
> +/*
> + * Only use the larx hint bit on 64bit CPUs. Once we verify it doesn't have
> + * any side effects on all 32bit processors, we can do this all the time.
> + */
> +#ifdef CONFIG_PPC64
> +#define __PPC_EH(eh)	(((eh) & 0x1) << 0)
> +#else
> +#define __PPC_EH(eh)	0
> +#endif
>  
>  /* Deal with instructions that older assemblers aren't aware of */
>  #define	PPC_DCBAL(a, b)		stringify_in_c(.long PPC_INST_DCBAL | \
>  					__PPC_RA(a) | __PPC_RB(b))
>  #define	PPC_DCBZL(a, b)		stringify_in_c(.long PPC_INST_DCBZL | \
>  					__PPC_RA(a) | __PPC_RB(b))
> +#define PPC_LWARX(t, a, b, eh)	stringify_in_c(.long PPC_INST_LWARX | \
> +					__PPC_RT(t) | __PPC_RA(a) | \
> +					__PPC_RB(b) | __PPC_EH(eh))
>  #define PPC_MSGSND(b)		stringify_in_c(.long PPC_INST_MSGSND | \
>  					__PPC_RB(b))
>  #define PPC_RFCI		stringify_in_c(.long PPC_INST_RFCI)
> Index: powerpc.git/arch/powerpc/include/asm/spinlock.h
> ===================================================================
> --- powerpc.git.orig/arch/powerpc/include/asm/spinlock.h	2010-02-10 15:28:58.473072327 +1100
> +++ powerpc.git/arch/powerpc/include/asm/spinlock.h	2010-02-10 15:29:29.454322618 +1100
> @@ -27,6 +27,7 @@
>  #endif
>  #include <asm/asm-compat.h>
>  #include <asm/synch.h>
> +#include <asm/ppc-opcode.h>
>  
>  #define arch_spin_is_locked(x)		((x)->slock != 0)
>  
> @@ -60,7 +61,7 @@ static inline unsigned long __arch_spin_
>  
>  	token = LOCK_TOKEN;
>  	__asm__ __volatile__(
> -"1:	lwarx		%0,0,%2\n\
> +"1:	" PPC_LWARX(%0,0,%2,1) "\n\
>  	cmpwi		0,%0,0\n\
>  	bne-		2f\n\
>  	stwcx.		%1,0,%2\n\
> @@ -186,7 +187,7 @@ static inline long __arch_read_trylock(a
>  	long tmp;
>  
>  	__asm__ __volatile__(
> -"1:	lwarx		%0,0,%1\n"
> +"1:	" PPC_LWARX(%0,0,%1,1) "\n"
>  	__DO_SIGN_EXTEND
>  "	addic.		%0,%0,1\n\
>  	ble-		2f\n"
> @@ -211,7 +212,7 @@ static inline long __arch_write_trylock(
>  
>  	token = WRLOCK_TOKEN;
>  	__asm__ __volatile__(
> -"1:	lwarx		%0,0,%2\n\
> +"1:	" PPC_LWARX(%0,0,%2,1) "\n\
>  	cmpwi		0,%0,0\n\
>  	bne-		2f\n"
>  	PPC405_ERR77(0,%1)

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 6/6] powerpc: Use lwsync for acquire barrier if CPU supports it
  2010-02-10 11:10         ` [PATCH 6/6] powerpc: Use lwsync for acquire barrier if CPU supports it Anton Blanchard
@ 2010-02-11  7:09           ` Nick Piggin
  2010-02-17  9:43             ` Anton Blanchard
  2010-02-16  4:22           ` Olof Johansson
  1 sibling, 1 reply; 19+ messages in thread
From: Nick Piggin @ 2010-02-11  7:09 UTC (permalink / raw)
  To: Anton Blanchard; +Cc: linuxppc-dev

On Wed, Feb 10, 2010 at 10:10:25PM +1100, Anton Blanchard wrote:
> 
> Nick Piggin discovered that lwsync barriers around locks were faster than isync
> on 970. That was a long time ago and I completely dropped the ball in testing
> his patches across other ppc64 processors.
> 
> Turns out the idea helps on other chips. Using a microbenchmark that
> uses a lot of threads to contend on a global pthread mutex (and therefore a
> global futex), POWER6 improves 8% and POWER7 improves 2%. I checked POWER5
> and while I couldn't measure an improvement, there was no regression.

Ah, good to see this one come back. I also tested tbench over localhost
btw which actually did show some speedup on the G5. 

BTW. this was the last thing left:
http://www.mail-archive.com/linuxppc-dev@lists.ozlabs.org/msg29738.html

Don't know if you took a look at that again, but maybe it's worth
looking at. Hmm, we do actually seem to be growing number of smp_mb*
calls in core kernel too.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/6] powerpc: Use lwarx hint in spinlocks
  2010-02-10 10:57 [PATCH 1/6] powerpc: Use lwarx hint in spinlocks Anton Blanchard
  2010-02-10 11:02 ` [PATCH 2/6] powerpc: Use lwarx/ldarx hint in bit locks Anton Blanchard
  2010-02-11  6:56 ` [PATCH 1/6] powerpc: Use lwarx hint in spinlocks Nick Piggin
@ 2010-02-16  4:16 ` Olof Johansson
  2 siblings, 0 replies; 19+ messages in thread
From: Olof Johansson @ 2010-02-16  4:16 UTC (permalink / raw)
  To: Anton Blanchard; +Cc: npiggin, linuxppc-dev

On Wed, Feb 10, 2010 at 09:57:28PM +1100, Anton Blanchard wrote:

> v2: We do this only for 64bit until we can verify all 32bit CPUs.
> 
> Tested so far: 970 (thanks Ben), POWER5, POWER6, POWER7
> Still to test: RS64, POWER3, POWER4

Tested on PA6T as well, no performance impact (as expected).


-Olof

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 6/6] powerpc: Use lwsync for acquire barrier if CPU supports it
  2010-02-16  4:22           ` Olof Johansson
@ 2010-02-16  4:19             ` Benjamin Herrenschmidt
  2010-02-16  6:07               ` Olof Johansson
  0 siblings, 1 reply; 19+ messages in thread
From: Benjamin Herrenschmidt @ 2010-02-16  4:19 UTC (permalink / raw)
  To: Olof Johansson; +Cc: npiggin, linuxppc-dev, Anton Blanchard

On Mon, 2010-02-15 at 22:22 -0600, Olof Johansson wrote:
> 
> Turns out this one hurts PA6T performance quite a bit, lwsync seems to be
> significantly more expensive there. I see a 25% drop in the microbenchmark
> doing pthread_lock/unlock loops on two cpus.
> 
> Taking out the CPU_FTR_LWSYNC will solve it, it's a bit unfortunate since
> the sync->lwsync changes definitely still can, and should, be done. 

So we should use a different feature bit. No biggie. If needed we can
split them more anyways.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 6/6] powerpc: Use lwsync for acquire barrier if CPU supports it
  2010-02-10 11:10         ` [PATCH 6/6] powerpc: Use lwsync for acquire barrier if CPU supports it Anton Blanchard
  2010-02-11  7:09           ` Nick Piggin
@ 2010-02-16  4:22           ` Olof Johansson
  2010-02-16  4:19             ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 19+ messages in thread
From: Olof Johansson @ 2010-02-16  4:22 UTC (permalink / raw)
  To: Anton Blanchard; +Cc: npiggin, linuxppc-dev

On Wed, Feb 10, 2010 at 10:10:25PM +1100, Anton Blanchard wrote:
> 
> Nick Piggin discovered that lwsync barriers around locks were faster than isync
> on 970. That was a long time ago and I completely dropped the ball in testing
> his patches across other ppc64 processors.
> 
> Turns out the idea helps on other chips. Using a microbenchmark that
> uses a lot of threads to contend on a global pthread mutex (and therefore a
> global futex), POWER6 improves 8% and POWER7 improves 2%. I checked POWER5
> and while I couldn't measure an improvement, there was no regression.
> 
> This patch uses the lwsync patching code to replace the isyncs with lwsyncs
> on CPUs that support the instruction. We were marking POWER3 and RS64 as lwsync
> capable but in reality they treat it as a full sync (ie slow). Remove the
> CPU_FTR_LWSYNC bit from these CPUs so they continue to use the faster isync
> method.
> 
> Signed-off-by: Anton Blanchard <anton@samba.org>

Turns out this one hurts PA6T performance quite a bit, lwsync seems to be
significantly more expensive there. I see a 25% drop in the microbenchmark
doing pthread_lock/unlock loops on two cpus.

Taking out the CPU_FTR_LWSYNC will solve it, it's a bit unfortunate since
the sync->lwsync changes definitely still can, and should, be done.


-Olof

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 6/6] powerpc: Use lwsync for acquire barrier if CPU supports it
  2010-02-16  4:19             ` Benjamin Herrenschmidt
@ 2010-02-16  6:07               ` Olof Johansson
  0 siblings, 0 replies; 19+ messages in thread
From: Olof Johansson @ 2010-02-16  6:07 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: npiggin, linuxppc-dev, Anton Blanchard

On Tue, Feb 16, 2010 at 03:19:03PM +1100, Benjamin Herrenschmidt wrote:
> On Mon, 2010-02-15 at 22:22 -0600, Olof Johansson wrote:
> > 
> > Turns out this one hurts PA6T performance quite a bit, lwsync seems to be
> > significantly more expensive there. I see a 25% drop in the microbenchmark
> > doing pthread_lock/unlock loops on two cpus.
> > 
> > Taking out the CPU_FTR_LWSYNC will solve it, it's a bit unfortunate since
> > the sync->lwsync changes definitely still can, and should, be done. 
> 
> So we should use a different feature bit. No biggie. If needed we can
> split them more anyways.

Yeah, fine with me.


-Olof

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/6] powerpc: Use lwarx hint in spinlocks
  2010-02-11  6:56 ` [PATCH 1/6] powerpc: Use lwarx hint in spinlocks Nick Piggin
@ 2010-02-17  9:37   ` Anton Blanchard
  2010-02-17 10:22     ` Nick Piggin
  0 siblings, 1 reply; 19+ messages in thread
From: Anton Blanchard @ 2010-02-17  9:37 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linuxppc-dev

 
Hi Nick,

> Cool. How does it go when there are significant amount of instructions
> between the lock and the unlock? A real(ish) workload, like dbench on
> ramdisk (which should hit the dcache lock).

Good question, I'll see if we can see a difference on dbench.

Anton

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 6/6] powerpc: Use lwsync for acquire barrier if CPU supports it
  2010-02-11  7:09           ` Nick Piggin
@ 2010-02-17  9:43             ` Anton Blanchard
  2010-02-17 10:41               ` Nick Piggin
  0 siblings, 1 reply; 19+ messages in thread
From: Anton Blanchard @ 2010-02-17  9:43 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linuxppc-dev

 
Hi Nick,

> Ah, good to see this one come back. I also tested tbench over localhost
> btw which actually did show some speedup on the G5. 
> 
> BTW. this was the last thing left:
> http://www.mail-archive.com/linuxppc-dev@lists.ozlabs.org/msg29738.html
> 
> Don't know if you took a look at that again, but maybe it's worth
> looking at. Hmm, we do actually seem to be growing number of smp_mb*
> calls in core kernel too.

Interesting idea! I do worry we will get a late night visit from the
architecture police. From memory they want the complete larx, stcwx, bne,
isync sequence to guarantee an acquire barrier.

Anton

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/6] powerpc: Use lwarx hint in spinlocks
  2010-02-17  9:37   ` Anton Blanchard
@ 2010-02-17 10:22     ` Nick Piggin
  0 siblings, 0 replies; 19+ messages in thread
From: Nick Piggin @ 2010-02-17 10:22 UTC (permalink / raw)
  To: Anton Blanchard; +Cc: linuxppc-dev

On Wed, Feb 17, 2010 at 08:37:14PM +1100, Anton Blanchard wrote:
>  
> Hi Nick,
> 
> > Cool. How does it go when there are significant amount of instructions
> > between the lock and the unlock? A real(ish) workload, like dbench on
> > ramdisk (which should hit the dcache lock).
> 
> Good question, I'll see if we can see a difference on dbench.

Well I misread your benchmark. Your test is using pthread mutexes in
order to basically exercise kernel's syscall and context switching and
futex code. Wheras I thought it was just a trivial lock/unlock
sequence being tested.

So I'm much more impressed with your numbers :) dbench would still
be interesting though.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 6/6] powerpc: Use lwsync for acquire barrier if CPU supports it
  2010-02-17  9:43             ` Anton Blanchard
@ 2010-02-17 10:41               ` Nick Piggin
  2010-02-17 12:12                 ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 19+ messages in thread
From: Nick Piggin @ 2010-02-17 10:41 UTC (permalink / raw)
  To: Anton Blanchard; +Cc: linuxppc-dev

On Wed, Feb 17, 2010 at 08:43:14PM +1100, Anton Blanchard wrote:
>  
> Hi Nick,
> 
> > Ah, good to see this one come back. I also tested tbench over localhost
> > btw which actually did show some speedup on the G5. 
> > 
> > BTW. this was the last thing left:
> > http://www.mail-archive.com/linuxppc-dev@lists.ozlabs.org/msg29738.html
> > 
> > Don't know if you took a look at that again, but maybe it's worth
> > looking at. Hmm, we do actually seem to be growing number of smp_mb*
> > calls in core kernel too.
> 
> Interesting idea! I do worry we will get a late night visit from the
> architecture police. From memory they want the complete larx, stcwx, bne,
> isync sequence to guarantee an acquire barrier.

Yes I suppose the branch can be executed "non speculatively" before the 
lwsync is completed. Wheras the larx/stcwx will have to complete before
the branch outcome can be known. I suppose probably not worthwhile
avoiding the full IO sync by adding yet more crap to make this work.

Thanks for putting my mind to rest though :)

I'd still be interested to know how expensive the full sync is when you
have a lot of IOs in flight.

Question, are you going to do the hint and isync->lwsync thing for
userspace as well? Too bad the kernel doesn't export synchronisation
primitives to userspace...

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 6/6] powerpc: Use lwsync for acquire barrier if CPU supports it
  2010-02-17 10:41               ` Nick Piggin
@ 2010-02-17 12:12                 ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 19+ messages in thread
From: Benjamin Herrenschmidt @ 2010-02-17 12:12 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linuxppc-dev, Anton Blanchard


> Yes I suppose the branch can be executed "non speculatively" before the 
> lwsync is completed. Wheras the larx/stcwx will have to complete before
> the branch outcome can be known. I suppose probably not worthwhile
> avoiding the full IO sync by adding yet more crap to make this work.
> 
> Thanks for putting my mind to rest though :)
> 
> I'd still be interested to know how expensive the full sync is when you
> have a lot of IOs in flight.
> 
> Question, are you going to do the hint and isync->lwsync thing for
> userspace as well? Too bad the kernel doesn't export synchronisation
> primitives to userspace...

We'd love to.... the vdso inherits from all the dynamic patching
features of the kernel and more...

It's really a matter of getting glibc to grok it, and doing so
efficiently (ie, not just having a function point in glibc, might be a
good use of the IFUNC stuff).

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 4/6] powerpc: Rename LWSYNC_ON_SMP to PPC_RELEASE_BARRIER, ISYNC_ON_SMP to PPC_ACQUIRE_BARRIER
  2010-02-10 11:04     ` [PATCH 4/6] powerpc: Rename LWSYNC_ON_SMP to PPC_RELEASE_BARRIER, ISYNC_ON_SMP to PPC_ACQUIRE_BARRIER Anton Blanchard
  2010-02-10 11:07       ` [PATCH 5/6] powerpc: Fix lwsync patching code on 64bit Anton Blanchard
@ 2010-03-19  1:08       ` Nick Piggin
  2010-03-19  1:36         ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 19+ messages in thread
From: Nick Piggin @ 2010-03-19  1:08 UTC (permalink / raw)
  To: Anton Blanchard; +Cc: linuxppc-dev

On Wed, Feb 10, 2010 at 10:04:06PM +1100, Anton Blanchard wrote:
> 
> For performance reasons we are about to change ISYNC_ON_SMP to sometimes be
> lwsync. Now that the macro name doesn't make sense, change it and LWSYNC_ON_SMP
> to better explain what the barriers are doing.
> 
> Signed-off-by: Anton Blanchard <anton@samba.org>
> ---
> 
> Index: powerpc.git/arch/powerpc/include/asm/atomic.h
> ===================================================================
> --- powerpc.git.orig/arch/powerpc/include/asm/atomic.h	2010-02-10 17:12:30.264322204 +1100
> +++ powerpc.git/arch/powerpc/include/asm/atomic.h	2010-02-10 17:13:05.355571902 +1100
> @@ -49,13 +49,13 @@ static __inline__ int atomic_add_return(
>  	int t;
>  
>  	__asm__ __volatile__(
> -	LWSYNC_ON_SMP
> +	PPC_RELEASE_BARRIER
>  "1:	lwarx	%0,0,%2		# atomic_add_return\n\
>  	add	%0,%1,%0\n"
>  	PPC405_ERR77(0,%2)
>  "	stwcx.	%0,0,%2 \n\
>  	bne-	1b"
> -	ISYNC_ON_SMP
> +	PPC_ACQUIRE_BARRIER

I wonder if this shouldn't be called PPC_ISYNC_ACQUIRE_BARRIER ?

Unlike PPC_RELEASE_BARRIER, it is not an acquire barrier unless it
is used like an isync.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 4/6] powerpc: Rename LWSYNC_ON_SMP to PPC_RELEASE_BARRIER, ISYNC_ON_SMP to PPC_ACQUIRE_BARRIER
  2010-03-19  1:08       ` [PATCH 4/6] powerpc: Rename LWSYNC_ON_SMP to PPC_RELEASE_BARRIER, ISYNC_ON_SMP to PPC_ACQUIRE_BARRIER Nick Piggin
@ 2010-03-19  1:36         ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 19+ messages in thread
From: Benjamin Herrenschmidt @ 2010-03-19  1:36 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linuxppc-dev, Anton Blanchard

On Fri, 2010-03-19 at 12:08 +1100, Nick Piggin wrote:
> > -     ISYNC_ON_SMP
> > +     PPC_ACQUIRE_BARRIER
> 
> I wonder if this shouldn't be called PPC_ISYNC_ACQUIRE_BARRIER ?
> 
> Unlike PPC_RELEASE_BARRIER, it is not an acquire barrier unless it
> is used like an isync.

Right. The semantic of isync would be more something like
PPC_IFETCH_BARRIER or similar :-)

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2010-03-19  1:37 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-02-10 10:57 [PATCH 1/6] powerpc: Use lwarx hint in spinlocks Anton Blanchard
2010-02-10 11:02 ` [PATCH 2/6] powerpc: Use lwarx/ldarx hint in bit locks Anton Blanchard
2010-02-10 11:03   ` [PATCH 3/6] powerpc: Convert open coded native hashtable bit lock Anton Blanchard
2010-02-10 11:04     ` [PATCH 4/6] powerpc: Rename LWSYNC_ON_SMP to PPC_RELEASE_BARRIER, ISYNC_ON_SMP to PPC_ACQUIRE_BARRIER Anton Blanchard
2010-02-10 11:07       ` [PATCH 5/6] powerpc: Fix lwsync patching code on 64bit Anton Blanchard
2010-02-10 11:10         ` [PATCH 6/6] powerpc: Use lwsync for acquire barrier if CPU supports it Anton Blanchard
2010-02-11  7:09           ` Nick Piggin
2010-02-17  9:43             ` Anton Blanchard
2010-02-17 10:41               ` Nick Piggin
2010-02-17 12:12                 ` Benjamin Herrenschmidt
2010-02-16  4:22           ` Olof Johansson
2010-02-16  4:19             ` Benjamin Herrenschmidt
2010-02-16  6:07               ` Olof Johansson
2010-03-19  1:08       ` [PATCH 4/6] powerpc: Rename LWSYNC_ON_SMP to PPC_RELEASE_BARRIER, ISYNC_ON_SMP to PPC_ACQUIRE_BARRIER Nick Piggin
2010-03-19  1:36         ` Benjamin Herrenschmidt
2010-02-11  6:56 ` [PATCH 1/6] powerpc: Use lwarx hint in spinlocks Nick Piggin
2010-02-17  9:37   ` Anton Blanchard
2010-02-17 10:22     ` Nick Piggin
2010-02-16  4:16 ` Olof Johansson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.