[RFC v2 0/7] atomics: powerpc: Implement relaxed/acquire/release variants of some atomics

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC v2 0/7] atomics: powerpc: Implement relaxed/acquire/release variants of some atomics
@ 2015-09-16 15:49 Boqun Feng
  2015-09-16 15:49 ` [RFC v2 1/7] atomics: Add test for atomic operations with _relaxed variants Boqun Feng
                   ` (6 more replies)
  0 siblings, 7 replies; 42+ messages in thread
From: Boqun Feng @ 2015-09-16 15:49 UTC (permalink / raw)
  To: linux-kernel, linuxppc-dev
  Cc: Peter Zijlstra, Ingo Molnar, Benjamin Herrenschmidt,
	Paul Mackerras, Michael Ellerman, Thomas Gleixner, Will Deacon,
	Paul E. McKenney, Waiman Long, Boqun Feng

Link for v1: https://lkml.org/lkml/2015/8/27/798

Changes since v1:

*	avoid to introduce macro arch_atomic_op_*()

*	also fix the problem that cmpxchg, xchg and their atomic_
	versions are not full barriers in PPC implementation.

*	rebase on v4.3-rc1

Relaxed/acquire/release variants of atomic operations {add,sub}_return
and {cmp,}xchg are introduced by commit:

"atomics: add acquire/release/relaxed variants of some atomic operations"

By default, the generic code will implement a relaxed variant as a full
ordered atomic operation and release/acquire a variant as a relaxed
variant with a necessary general barrier before or after.

On powerpc, which has a weak memory order model, a relaxed variant can
be implemented more lightweightly than a full ordered one. Further more,
release and acquire variants can be implemented with arch-specific
lightweight barriers.

Besides, cmpxchg, xchg and their atomic_ versions are only RELEASE+ACQUIRE
rather that full barriers in current PPC implementation, which is incorrect
according to memory-barriers.txt.

Therefore this patchset implements the relaxed/acquire/release variants
based on powerpc memory model and specific barriers, and fix cmpxchg,
xchg and their atomic_ versions. A trivial test for these new variants
is also included in this series, because some of these variants are not
used in kernel for now, I think is a good idea to at least generate the
code for these variants somewhere.

The patchset consists of 7 parts:

1.	Add trivial tests for the new variants in lib/atomic64_test.c

2.	Allow architectures to define their own __atomic_op_*() helpers
	to build other variants based on relaxed.

3.	Implement atomic{,64}_{add,sub}_return_* variants

4.	Implement xchg_* and atomic{,64}_xchg_* variants

5.	Implement cmpxchg_* atomic{,64}_cmpxchg_* variants

6.	Make atomic{,64}_xchg and xchg full barriers

7.	Make atomic{,64}_cmpxchg and cmpxchg full barriers

This patchset is based on v4.3-rc1 and all patches are built and boot
tested for little endian powernv and pseries, and also tested by 0day.

Looking forward to any suggestion, question and comment ;-)

Regards,
Boqun

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC v2 1/7] atomics: Add test for atomic operations with _relaxed variants
  2015-09-16 15:49 [RFC v2 0/7] atomics: powerpc: Implement relaxed/acquire/release variants of some atomics Boqun Feng
@ 2015-09-16 15:49 ` Boqun Feng
  2015-10-12  9:30   ` Will Deacon
  2015-09-16 15:49 ` [RFC v2 2/7] atomics: Allow architectures to define their own __atomic_op_* helpers Boqun Feng
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 42+ messages in thread
From: Boqun Feng @ 2015-09-16 15:49 UTC (permalink / raw)
  To: linux-kernel, linuxppc-dev
  Cc: Peter Zijlstra, Ingo Molnar, Benjamin Herrenschmidt,
	Paul Mackerras, Michael Ellerman, Thomas Gleixner, Will Deacon,
	Paul E. McKenney, Waiman Long, Boqun Feng

Some atomic operations now have _{relaxed, acquire, release} variants,
this patch then adds some trivial tests for two purpose:

1.	test the behavior of these new operations in single-CPU
	environment.
2.	make their code generated before we actually use them somewhere,
	so that we can examine their assembly code.

Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
---
 lib/atomic64_test.c | 91 ++++++++++++++++++++++++++++++++++-------------------
 1 file changed, 59 insertions(+), 32 deletions(-)

diff --git a/lib/atomic64_test.c b/lib/atomic64_test.c
index 83c33a5b..0484437 100644
--- a/lib/atomic64_test.c
+++ b/lib/atomic64_test.c
@@ -27,6 +27,50 @@ do {								\
 		(unsigned long long)r);				\
 } while (0)
 
+#define TEST_RETURN(bit, op, c_op, val)				\
+do {								\
+	atomic##bit##_set(&v, v0);				\
+	r = v0;							\
+	r c_op val;						\
+	BUG_ON(atomic##bit##_##op(val, &v) != r);		\
+	BUG_ON(atomic##bit##_read(&v) != r);			\
+} while (0)
+
+#define TEST_RETURN_FAMILY(bit, op, c_op, val)			\
+do {								\
+	TEST_RETURN(bit, op, c_op, val);			\
+	TEST_RETURN(bit, op##_acquire, c_op, val);		\
+	TEST_RETURN(bit, op##_release, c_op, val);		\
+	TEST_RETURN(bit, op##_relaxed, c_op, val);		\
+} while (0)
+
+#define TEST_ARGS(bit, op, init, ret, expect, args...)		\
+do {								\
+	atomic##bit##_set(&v, init);				\
+	BUG_ON(atomic##bit##_##op(&v, args) != ret);		\
+	BUG_ON(atomic##bit##_read(&v) != expect);		\
+} while (0)
+
+#define TEST_XCHG_FAMILY(bit, init, new)			\
+do {								\
+	TEST_ARGS(bit, xchg, init, init, new, new);		\
+	TEST_ARGS(bit, xchg_acquire, init, init, new, new);	\
+	TEST_ARGS(bit, xchg_release, init, init, new, new);	\
+	TEST_ARGS(bit, xchg_relaxed, init, init, new, new);	\
+} while (0)
+
+#define TEST_CMPXCHG_FAMILY(bit, init, new, wrong)			\
+do {									\
+	TEST_ARGS(bit, cmpxchg, init, init, new, init, new);		\
+	TEST_ARGS(bit, cmpxchg, init, init, init, wrong, new);		\
+	TEST_ARGS(bit, cmpxchg_acquire, init, init, new, init, new);	\
+	TEST_ARGS(bit, cmpxchg_acquire, init, init, init, wrong, new);	\
+	TEST_ARGS(bit, cmpxchg_release, init, init, new, init, new);	\
+	TEST_ARGS(bit, cmpxchg_release, init, init, init, wrong, new);	\
+	TEST_ARGS(bit, cmpxchg_relaxed, init, init, new, init, new);	\
+	TEST_ARGS(bit, cmpxchg_relaxed, init, init, init, wrong, new);	\
+} while (0)
+
 static __init void test_atomic(void)
 {
 	int v0 = 0xaaa31337;
@@ -45,6 +89,15 @@ static __init void test_atomic(void)
 	TEST(, and, &=, v1);
 	TEST(, xor, ^=, v1);
 	TEST(, andnot, &= ~, v1);
+
+	TEST_RETURN_FAMILY(, add_return, +=, onestwos);
+	TEST_RETURN_FAMILY(, add_return, +=, -one);
+	TEST_RETURN_FAMILY(, sub_return, -=, onestwos);
+	TEST_RETURN_FAMILY(, sub_return, -=, -one);
+
+	TEST_XCHG_FAMILY(, v0, v1);
+	TEST_CMPXCHG_FAMILY(, v0, v1, onestwos);
+
 }
 
 #define INIT(c) do { atomic64_set(&v, c); r = c; } while (0)
@@ -74,25 +127,10 @@ static __init void test_atomic64(void)
 	TEST(64, xor, ^=, v1);
 	TEST(64, andnot, &= ~, v1);
 
-	INIT(v0);
-	r += onestwos;
-	BUG_ON(atomic64_add_return(onestwos, &v) != r);
-	BUG_ON(v.counter != r);
-
-	INIT(v0);
-	r += -one;
-	BUG_ON(atomic64_add_return(-one, &v) != r);
-	BUG_ON(v.counter != r);
-
-	INIT(v0);
-	r -= onestwos;
-	BUG_ON(atomic64_sub_return(onestwos, &v) != r);
-	BUG_ON(v.counter != r);
-
-	INIT(v0);
-	r -= -one;
-	BUG_ON(atomic64_sub_return(-one, &v) != r);
-	BUG_ON(v.counter != r);
+	TEST_RETURN_FAMILY(64, add_return, +=, onestwos);
+	TEST_RETURN_FAMILY(64, add_return, +=, -one);
+	TEST_RETURN_FAMILY(64, sub_return, -=, onestwos);
+	TEST_RETURN_FAMILY(64, sub_return, -=, -one);
 
 	INIT(v0);
 	atomic64_inc(&v);
@@ -114,19 +152,8 @@ static __init void test_atomic64(void)
 	BUG_ON(atomic64_dec_return(&v) != r);
 	BUG_ON(v.counter != r);
 
-	INIT(v0);
-	BUG_ON(atomic64_xchg(&v, v1) != v0);
-	r = v1;
-	BUG_ON(v.counter != r);
-
-	INIT(v0);
-	BUG_ON(atomic64_cmpxchg(&v, v0, v1) != v0);
-	r = v1;
-	BUG_ON(v.counter != r);
-
-	INIT(v0);
-	BUG_ON(atomic64_cmpxchg(&v, v2, v1) != v0);
-	BUG_ON(v.counter != r);
+	TEST_XCHG_FAMILY(64, v0, v1);
+	TEST_CMPXCHG_FAMILY(64, v0, v1, v2);
 
 	INIT(v0);
 	BUG_ON(atomic64_add_unless(&v, one, v0));
-- 
2.5.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC v2 2/7] atomics: Allow architectures to define their own __atomic_op_* helpers
  2015-09-16 15:49 [RFC v2 0/7] atomics: powerpc: Implement relaxed/acquire/release variants of some atomics Boqun Feng
  2015-09-16 15:49 ` [RFC v2 1/7] atomics: Add test for atomic operations with _relaxed variants Boqun Feng
@ 2015-09-16 15:49 ` Boqun Feng
  2015-09-16 15:49 ` [RFC v2 3/7] powerpc: atomic: Implement atomic{,64}_{add,sub}_return_* variants Boqun Feng
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 42+ messages in thread
From: Boqun Feng @ 2015-09-16 15:49 UTC (permalink / raw)
  To: linux-kernel, linuxppc-dev
  Cc: Peter Zijlstra, Ingo Molnar, Benjamin Herrenschmidt,
	Paul Mackerras, Michael Ellerman, Thomas Gleixner, Will Deacon,
	Paul E. McKenney, Waiman Long, Boqun Feng

Some architectures may have their special barriers for acquire, release
and fence semantics, so that general memory barriers(smp_mb__*_atomic())
in the default __atomic_op_*() may be too strong, so allow architectures
to define their own helpers which can overwrite the default helpers.

Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
---
 include/linux/atomic.h | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/include/linux/atomic.h b/include/linux/atomic.h
index 00a5763..590c023 100644
--- a/include/linux/atomic.h
+++ b/include/linux/atomic.h
@@ -34,20 +34,29 @@
  * The idea here is to build acquire/release variants by adding explicit
  * barriers on top of the relaxed variant. In the case where the relaxed
  * variant is already fully ordered, no additional barriers are needed.
+ *
+ * Besides, if an arch has a special barrier for acquire/release, it could
+ * implement its own __atomic_op_* and use the same framework for building
+ * variants
  */
+#ifndef __atomic_op_acquire
 #define __atomic_op_acquire(op, args...)				\
 ({									\
 	typeof(op##_relaxed(args)) __ret  = op##_relaxed(args);		\
 	smp_mb__after_atomic();						\
 	__ret;								\
 })
+#endif
 
+#ifndef __atomic_op_release
 #define __atomic_op_release(op, args...)				\
 ({									\
 	smp_mb__before_atomic();					\
 	op##_relaxed(args);						\
 })
+#endif
 
+#ifndef __atomic_op_fence
 #define __atomic_op_fence(op, args...)					\
 ({									\
 	typeof(op##_relaxed(args)) __ret;				\
@@ -56,6 +65,7 @@
 	smp_mb__after_atomic();						\
 	__ret;								\
 })
+#endif
 
 /* atomic_add_return_relaxed */
 #ifndef atomic_add_return_relaxed
-- 
2.5.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC v2 3/7] powerpc: atomic: Implement atomic{,64}_{add,sub}_return_* variants
  2015-09-16 15:49 [RFC v2 0/7] atomics: powerpc: Implement relaxed/acquire/release variants of some atomics Boqun Feng
  2015-09-16 15:49 ` [RFC v2 1/7] atomics: Add test for atomic operations with _relaxed variants Boqun Feng
  2015-09-16 15:49 ` [RFC v2 2/7] atomics: Allow architectures to define their own __atomic_op_* helpers Boqun Feng
@ 2015-09-16 15:49 ` Boqun Feng
  2015-09-18 16:59   ` Will Deacon
  2015-09-16 15:49 ` [RFC v2 4/7] powerpc: atomic: Implement xchg_* and atomic{,64}_xchg_* variants Boqun Feng
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 42+ messages in thread
From: Boqun Feng @ 2015-09-16 15:49 UTC (permalink / raw)
  To: linux-kernel, linuxppc-dev
  Cc: Peter Zijlstra, Ingo Molnar, Benjamin Herrenschmidt,
	Paul Mackerras, Michael Ellerman, Thomas Gleixner, Will Deacon,
	Paul E. McKenney, Waiman Long, Boqun Feng

On powerpc, we don't need a general memory barrier to achieve acquire and
release semantics, so __atomic_op_{acquire,release} can be implemented
using "lwsync" and "isync".

For release semantics, since we only need to ensure all memory accesses
that issue before must take effects before the -store- part of the
atomics, "lwsync" is what we only need. On the platform without
"lwsync", "sync" should be used. Therefore, smp_lwsync() is used here.

For acquire semantics, "lwsync" is what we only need for the similar
reason.  However on the platform without "lwsync", we can use "isync"
rather than "sync" as an acquire barrier. So a new kind of barrier
smp_acquire_barrier__after_atomic() is introduced, which is barrier() on
UP, "lwsync" if available and "isync" otherwise.

__atomic_op_fence is defined as smp_lwsync() + _relaxed +
smp_mb__after_atomic() to guarantee a fully order.

Implement atomic{,64}_{add,sub}_return_relaxed, and build other variants
with these helpers.

Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
---
 arch/powerpc/include/asm/atomic.h | 88 ++++++++++++++++++++++++++++-----------
 1 file changed, 64 insertions(+), 24 deletions(-)

diff --git a/arch/powerpc/include/asm/atomic.h b/arch/powerpc/include/asm/atomic.h
index 55f106e..d1dcdcf 100644
--- a/arch/powerpc/include/asm/atomic.h
+++ b/arch/powerpc/include/asm/atomic.h
@@ -12,6 +12,39 @@
 
 #define ATOMIC_INIT(i)		{ (i) }
 
+/*
+ * Since {add,sub}_return_relaxed and xchg_relaxed are implemented with
+ * a "bne-" instruction at the end, so an isync is enough as a acquire barrier
+ * on the platform without lwsync.
+ */
+#ifdef CONFIG_SMP
+#define smp_acquire_barrier__after_atomic() \
+	__asm__ __volatile__(PPC_ACQUIRE_BARRIER : : : "memory")
+#else
+#define smp_acquire_barrier__after_atomic() barrier()
+#endif
+#define __atomic_op_acquire(op, args...)				\
+({									\
+	typeof(op##_relaxed(args)) __ret  = op##_relaxed(args);		\
+	smp_acquire_barrier__after_atomic();				\
+	__ret;								\
+})
+
+#define __atomic_op_release(op, args...)				\
+({									\
+	smp_lwsync();							\
+	op##_relaxed(args);						\
+})
+
+#define __atomic_op_fence(op, args...)				\
+({									\
+	typeof(op##_relaxed(args)) __ret;				\
+	smp_lwsync();							\
+	__ret = op##_relaxed(args);					\
+	smp_mb__after_atomic();						\
+	__ret;								\
+})
+
 static __inline__ int atomic_read(const atomic_t *v)
 {
 	int t;
@@ -42,27 +75,27 @@ static __inline__ void atomic_##op(int a, atomic_t *v)			\
 	: "cc");							\
 }									\
 
-#define ATOMIC_OP_RETURN(op, asm_op)					\
-static __inline__ int atomic_##op##_return(int a, atomic_t *v)		\
+#define ATOMIC_OP_RETURN_RELAXED(op, asm_op)				\
+static inline int atomic_##op##_return_relaxed(int a, atomic_t *v)	\
 {									\
 	int t;								\
 									\
 	__asm__ __volatile__(						\
-	PPC_ATOMIC_ENTRY_BARRIER					\
-"1:	lwarx	%0,0,%2		# atomic_" #op "_return\n"		\
-	#asm_op " %0,%1,%0\n"						\
-	PPC405_ERR77(0,%2)						\
-"	stwcx.	%0,0,%2 \n"						\
+"1:	lwarx	%0,0,%3		# atomic_" #op "_return_relaxed\n"	\
+	#asm_op " %0,%2,%0\n"						\
+	PPC405_ERR77(0, %3)						\
+"	stwcx.	%0,0,%3\n"						\
 "	bne-	1b\n"							\
-	PPC_ATOMIC_EXIT_BARRIER						\
-	: "=&r" (t)							\
+	: "=&r" (t), "+m" (v->counter)					\
 	: "r" (a), "r" (&v->counter)					\
-	: "cc", "memory");						\
+	: "cc");							\
 									\
 	return t;							\
 }
 
-#define ATOMIC_OPS(op, asm_op) ATOMIC_OP(op, asm_op) ATOMIC_OP_RETURN(op, asm_op)
+#define ATOMIC_OPS(op, asm_op)						\
+	ATOMIC_OP(op, asm_op)						\
+	ATOMIC_OP_RETURN_RELAXED(op, asm_op)
 
 ATOMIC_OPS(add, add)
 ATOMIC_OPS(sub, subf)
@@ -71,8 +104,11 @@ ATOMIC_OP(and, and)
 ATOMIC_OP(or, or)
 ATOMIC_OP(xor, xor)
 
+#define atomic_add_return_relaxed atomic_add_return_relaxed
+#define atomic_sub_return_relaxed atomic_sub_return_relaxed
+
 #undef ATOMIC_OPS
-#undef ATOMIC_OP_RETURN
+#undef ATOMIC_OP_RETURN_RELAXED
 #undef ATOMIC_OP
 
 #define atomic_add_negative(a, v)	(atomic_add_return((a), (v)) < 0)
@@ -285,26 +321,27 @@ static __inline__ void atomic64_##op(long a, atomic64_t *v)		\
 	: "cc");							\
 }
 
-#define ATOMIC64_OP_RETURN(op, asm_op)					\
-static __inline__ long atomic64_##op##_return(long a, atomic64_t *v)	\
+#define ATOMIC64_OP_RETURN_RELAXED(op, asm_op)				\
+static inline long							\
+atomic64_##op##_return_relaxed(long a, atomic64_t *v)			\
 {									\
 	long t;								\
 									\
 	__asm__ __volatile__(						\
-	PPC_ATOMIC_ENTRY_BARRIER					\
-"1:	ldarx	%0,0,%2		# atomic64_" #op "_return\n"		\
-	#asm_op " %0,%1,%0\n"						\
-"	stdcx.	%0,0,%2 \n"						\
+"1:	ldarx	%0,0,%3		# atomic64_" #op "_return_relaxed\n"	\
+	#asm_op " %0,%2,%0\n"						\
+"	stdcx.	%0,0,%3\n"						\
 "	bne-	1b\n"							\
-	PPC_ATOMIC_EXIT_BARRIER						\
-	: "=&r" (t)							\
+	: "=&r" (t), "+m" (v->counter)					\
 	: "r" (a), "r" (&v->counter)					\
-	: "cc", "memory");						\
+	: "cc");							\
 									\
 	return t;							\
 }
 
-#define ATOMIC64_OPS(op, asm_op) ATOMIC64_OP(op, asm_op) ATOMIC64_OP_RETURN(op, asm_op)
+#define ATOMIC64_OPS(op, asm_op)					\
+	ATOMIC64_OP(op, asm_op)						\
+	ATOMIC64_OP_RETURN_RELAXED(op, asm_op)
 
 ATOMIC64_OPS(add, add)
 ATOMIC64_OPS(sub, subf)
@@ -312,8 +349,11 @@ ATOMIC64_OP(and, and)
 ATOMIC64_OP(or, or)
 ATOMIC64_OP(xor, xor)
 
-#undef ATOMIC64_OPS
-#undef ATOMIC64_OP_RETURN
+#define atomic64_add_return_relaxed atomic64_add_return_relaxed
+#define atomic64_sub_return_relaxed atomic64_sub_return_relaxed
+
+#undef ATOPIC64_OPS
+#undef ATOMIC64_OP_RETURN_RELAXED
 #undef ATOMIC64_OP
 
 #define atomic64_add_negative(a, v)	(atomic64_add_return((a), (v)) < 0)
-- 
2.5.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC v2 4/7] powerpc: atomic: Implement xchg_* and atomic{,64}_xchg_* variants
  2015-09-16 15:49 [RFC v2 0/7] atomics: powerpc: Implement relaxed/acquire/release variants of some atomics Boqun Feng
                   ` (2 preceding siblings ...)
  2015-09-16 15:49 ` [RFC v2 3/7] powerpc: atomic: Implement atomic{,64}_{add,sub}_return_* variants Boqun Feng
@ 2015-09-16 15:49 ` Boqun Feng
  2015-10-01 12:24   ` Peter Zijlstra
  2015-09-16 15:49 ` [RFC v2 5/7] powerpc: atomic: Implement cmpxchg{,64}_* and atomic{,64}_cmpxchg_* variants Boqun Feng
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 42+ messages in thread
From: Boqun Feng @ 2015-09-16 15:49 UTC (permalink / raw)
  To: linux-kernel, linuxppc-dev
  Cc: Peter Zijlstra, Ingo Molnar, Benjamin Herrenschmidt,
	Paul Mackerras, Michael Ellerman, Thomas Gleixner, Will Deacon,
	Paul E. McKenney, Waiman Long, Boqun Feng

Implement xchg_relaxed and define atomic{,64}_xchg_* as xchg_relaxed,
based on these _relaxed variants, release/acquire variants can be built.

Note that xchg_relaxed and atomic_{,64}_xchg_relaxed are not compiler
barriers.

Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
---
 arch/powerpc/include/asm/atomic.h  |  2 ++
 arch/powerpc/include/asm/cmpxchg.h | 64 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 66 insertions(+)

diff --git a/arch/powerpc/include/asm/atomic.h b/arch/powerpc/include/asm/atomic.h
index d1dcdcf..d9f570b 100644
--- a/arch/powerpc/include/asm/atomic.h
+++ b/arch/powerpc/include/asm/atomic.h
@@ -193,6 +193,7 @@ static __inline__ int atomic_dec_return(atomic_t *v)
 
 #define atomic_cmpxchg(v, o, n) (cmpxchg(&((v)->counter), (o), (n)))
 #define atomic_xchg(v, new) (xchg(&((v)->counter), new))
+#define atomic_xchg_relaxed(v, new) xchg_relaxed(&((v)->counter), (new))
 
 /**
  * __atomic_add_unless - add unless the number is a given value
@@ -461,6 +462,7 @@ static __inline__ long atomic64_dec_if_positive(atomic64_t *v)
 
 #define atomic64_cmpxchg(v, o, n) (cmpxchg(&((v)->counter), (o), (n)))
 #define atomic64_xchg(v, new) (xchg(&((v)->counter), new))
+#define atomic64_xchg_relaxed(v, new) xchg_relaxed(&((v)->counter), (new))
 
 /**
  * atomic64_add_unless - add unless the number is a given value
diff --git a/arch/powerpc/include/asm/cmpxchg.h b/arch/powerpc/include/asm/cmpxchg.h
index ad6263c..66374f4 100644
--- a/arch/powerpc/include/asm/cmpxchg.h
+++ b/arch/powerpc/include/asm/cmpxchg.h
@@ -54,6 +54,32 @@ __xchg_u32_local(volatile void *p, unsigned long val)
 	return prev;
 }
 
+/*
+ * Atomic exchange relaxed
+ *
+ * Changes the memory location '*p' to be val and returns
+ * the previous value stored there.
+ *
+ * Note that this is not a compiler barrier, there is no order
+ * guarantee around.
+ */
+static __always_inline unsigned long
+__xchg_u32_relaxed(u32 *p, unsigned long val)
+{
+	unsigned long prev;
+
+	__asm__ __volatile__(
+"1:	lwarx	%0,0,%2\n"
+	PPC405_ERR77(0, %2)
+"	stwcx.	%3,0,%2\n"
+"	bne-	1b"
+	: "=&r" (prev), "+m" (*p)
+	: "r" (p), "r" (val)
+	: "cc");
+
+	return prev;
+}
+
 #ifdef CONFIG_PPC64
 static __always_inline unsigned long
 __xchg_u64(volatile void *p, unsigned long val)
@@ -90,6 +116,23 @@ __xchg_u64_local(volatile void *p, unsigned long val)
 
 	return prev;
 }
+
+static __always_inline unsigned long
+__xchg_u64_relaxed(u64 *p, unsigned long val)
+{
+	unsigned long prev;
+
+	__asm__ __volatile__(
+"1:	ldarx	%0,0,%2\n"
+	PPC405_ERR77(0, %2)
+"	stdcx.	%3,0,%2\n"
+"	bne-	1b"
+	: "=&r" (prev), "+m" (*p)
+	: "r" (p), "r" (val)
+	: "cc");
+
+	return prev;
+}
 #endif
 
 /*
@@ -127,6 +170,21 @@ __xchg_local(volatile void *ptr, unsigned long x, unsigned int size)
 	__xchg_called_with_bad_pointer();
 	return x;
 }
+
+static __always_inline unsigned long
+__xchg_relaxed(void *ptr, unsigned long x, unsigned int size)
+{
+	switch (size) {
+	case 4:
+		return __xchg_u32_relaxed(ptr, x);
+#ifdef CONFIG_PPC64
+	case 8:
+		return __xchg_u64_relaxed(ptr, x);
+#endif
+	}
+	__xchg_called_with_bad_pointer();
+	return x;
+}
 #define xchg(ptr,x)							     \
   ({									     \
      __typeof__(*(ptr)) _x_ = (x);					     \
@@ -140,6 +198,12 @@ __xchg_local(volatile void *ptr, unsigned long x, unsigned int size)
      		(unsigned long)_x_, sizeof(*(ptr))); 			     \
   })
 
+#define xchg_relaxed(ptr, x)						\
+({									\
+	__typeof__(*(ptr)) _x_ = (x);					\
+	(__typeof__(*(ptr))) __xchg_relaxed((ptr),			\
+			(unsigned long)_x_, sizeof(*(ptr)));		\
+})
 /*
  * Compare and exchange - if *p == old, set it to new,
  * and return the old value of *p.
-- 
2.5.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC v2 5/7] powerpc: atomic: Implement cmpxchg{,64}_* and atomic{,64}_cmpxchg_* variants
  2015-09-16 15:49 [RFC v2 0/7] atomics: powerpc: Implement relaxed/acquire/release variants of some atomics Boqun Feng
                   ` (3 preceding siblings ...)
  2015-09-16 15:49 ` [RFC v2 4/7] powerpc: atomic: Implement xchg_* and atomic{,64}_xchg_* variants Boqun Feng
@ 2015-09-16 15:49 ` Boqun Feng
  2015-10-01 12:27   ` Peter Zijlstra
  2015-09-16 15:49 ` [RFC v2 6/7] powerpc: atomic: Make atomic{,64}_xchg and xchg a full barrier Boqun Feng
  2015-09-16 15:49 ` [RFC v2 7/7] powerpc: atomic: Make atomic{,64}_cmpxchg and cmpxchg " Boqun Feng
  6 siblings, 1 reply; 42+ messages in thread
From: Boqun Feng @ 2015-09-16 15:49 UTC (permalink / raw)
  To: linux-kernel, linuxppc-dev
  Cc: Peter Zijlstra, Ingo Molnar, Benjamin Herrenschmidt,
	Paul Mackerras, Michael Ellerman, Thomas Gleixner, Will Deacon,
	Paul E. McKenney, Waiman Long, Boqun Feng

Unlike other atomic operation variants, cmpxchg{,64}_acquire and
atomic{,64}_cmpxchg_acquire don't have acquire semantics if the cmp part
fails, so we need to implement these using assembly.

Note cmpxchg{,64}_relaxed and atomic{,64}_cmpxchg_relaxed are not
compiler barriers.

Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
---
 arch/powerpc/include/asm/atomic.h  |  10 +++
 arch/powerpc/include/asm/cmpxchg.h | 141 ++++++++++++++++++++++++++++++++++++-
 2 files changed, 150 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/atomic.h b/arch/powerpc/include/asm/atomic.h
index d9f570b..0608e39 100644
--- a/arch/powerpc/include/asm/atomic.h
+++ b/arch/powerpc/include/asm/atomic.h
@@ -192,6 +192,11 @@ static __inline__ int atomic_dec_return(atomic_t *v)
 }
 
 #define atomic_cmpxchg(v, o, n) (cmpxchg(&((v)->counter), (o), (n)))
+#define atomic_cmpxchg_relaxed(v, o, n) \
+	cmpxchg_relaxed(&((v)->counter), (o), (n))
+#define atomic_cmpxchg_acquire(v, o, n) \
+	cmpxchg_acquire(&((v)->counter), (o), (n))
+
 #define atomic_xchg(v, new) (xchg(&((v)->counter), new))
 #define atomic_xchg_relaxed(v, new) xchg_relaxed(&((v)->counter), (new))
 
@@ -461,6 +466,11 @@ static __inline__ long atomic64_dec_if_positive(atomic64_t *v)
 }
 
 #define atomic64_cmpxchg(v, o, n) (cmpxchg(&((v)->counter), (o), (n)))
+#define atomic64_cmpxchg_relaxed(v, o, n) \
+	cmpxchg_relaxed(&((v)->counter), (o), (n))
+#define atomic64_cmpxchg_acquire(v, o, n) \
+	cmpxchg_acquire(&((v)->counter), (o), (n))
+
 #define atomic64_xchg(v, new) (xchg(&((v)->counter), new))
 #define atomic64_xchg_relaxed(v, new) xchg_relaxed(&((v)->counter), (new))
 
diff --git a/arch/powerpc/include/asm/cmpxchg.h b/arch/powerpc/include/asm/cmpxchg.h
index 66374f4..f40f295 100644
--- a/arch/powerpc/include/asm/cmpxchg.h
+++ b/arch/powerpc/include/asm/cmpxchg.h
@@ -254,6 +254,48 @@ __cmpxchg_u32_local(volatile unsigned int *p, unsigned long old,
 	return prev;
 }
 
+static __always_inline unsigned long
+__cmpxchg_u32_relaxed(u32 *p, unsigned long old, unsigned long new)
+{
+	unsigned long prev;
+
+	__asm__ __volatile__ (
+"1:	lwarx	%0,0,%2		# __cmpxchg_u32_relaxed\n"
+"	cmpw	0,%0,%3\n"
+"	bne-	2f\n"
+	PPC405_ERR77(0, %2)
+"	stwcx.	%4,0,%2\n"
+"	bne-	1b\n"
+"2:"
+	: "=&r" (prev), "+m" (*p)
+	: "r" (p), "r" (old), "r" (new)
+	: "cc");
+
+	return prev;
+}
+
+static __always_inline unsigned long
+__cmpxchg_u32_acquire(u32 *p, unsigned long old, unsigned long new)
+{
+	unsigned long prev;
+
+	__asm__ __volatile__ (
+"1:	lwarx	%0,0,%2		# __cmpxchg_u32_acquire\n"
+"	cmpw	0,%0,%3\n"
+"	bne-	2f\n"
+	PPC405_ERR77(0, %2)
+"	stwcx.	%4,0,%2\n"
+"	bne-	1b\n"
+	PPC_ACQUIRE_BARRIER
+	"\n"
+"2:"
+	: "=&r" (prev), "+m" (*p)
+	: "r" (p), "r" (old), "r" (new)
+	: "cc", "memory");
+
+	return prev;
+}
+
 #ifdef CONFIG_PPC64
 static __always_inline unsigned long
 __cmpxchg_u64(volatile unsigned long *p, unsigned long old, unsigned long new)
@@ -297,6 +339,46 @@ __cmpxchg_u64_local(volatile unsigned long *p, unsigned long old,
 
 	return prev;
 }
+
+static __always_inline unsigned long
+__cmpxchg_u64_relaxed(u64 *p, unsigned long old, unsigned long new)
+{
+	unsigned long prev;
+
+	__asm__ __volatile__ (
+"1:	ldarx	%0,0,%2		# __cmpxchg_u64_relaxed\n"
+"	cmpd	0,%0,%3\n"
+"	bne-	2f\n"
+"	stdcx.	%4,0,%2\n"
+"	bne-	1b\n"
+"2:"
+	: "=&r" (prev), "+m" (*p)
+	: "r" (p), "r" (old), "r" (new)
+	: "cc");
+
+	return prev;
+}
+
+static __always_inline unsigned long
+__cmpxchg_u64_acquire(u64 *p, unsigned long old, unsigned long new)
+{
+	unsigned long prev;
+
+	__asm__ __volatile__ (
+"1:	ldarx	%0,0,%2		# __cmpxchg_u64_acquire\n"
+"	cmpd	0,%0,%3\n"
+"	bne-	2f\n"
+"	stdcx.	%4,0,%2\n"
+"	bne-	1b\n"
+	PPC_ACQUIRE_BARRIER
+	"\n"
+"2:"
+	: "=&r" (prev), "+m" (*p)
+	: "r" (p), "r" (old), "r" (new)
+	: "cc", "memory");
+
+	return prev;
+}
 #endif
 
 /* This function doesn't exist, so you'll get a linker error
@@ -335,6 +417,37 @@ __cmpxchg_local(volatile void *ptr, unsigned long old, unsigned long new,
 	return old;
 }
 
+static __always_inline unsigned long
+__cmpxchg_relaxed(void *ptr, unsigned long old, unsigned long new,
+		  unsigned int size)
+{
+	switch (size) {
+	case 4:
+		return __cmpxchg_u32_relaxed(ptr, old, new);
+#ifdef CONFIG_PPC64
+	case 8:
+		return __cmpxchg_u64_relaxed(ptr, old, new);
+#endif
+	}
+	__cmpxchg_called_with_bad_pointer();
+	return old;
+}
+
+static __always_inline unsigned long
+__cmpxchg_acquire(void *ptr, unsigned long old, unsigned long new,
+		  unsigned int size)
+{
+	switch (size) {
+	case 4:
+		return __cmpxchg_u32_acquire(ptr, old, new);
+#ifdef CONFIG_PPC64
+	case 8:
+		return __cmpxchg_u64_acquire(ptr, old, new);
+#endif
+	}
+	__cmpxchg_called_with_bad_pointer();
+	return old;
+}
 #define cmpxchg(ptr, o, n)						 \
   ({									 \
      __typeof__(*(ptr)) _o_ = (o);					 \
@@ -352,6 +465,23 @@ __cmpxchg_local(volatile void *ptr, unsigned long old, unsigned long new,
 				    (unsigned long)_n_, sizeof(*(ptr))); \
   })
 
+#define cmpxchg_relaxed(ptr, o, n)					\
+({									\
+	__typeof__(*(ptr)) _o_ = (o);					\
+	__typeof__(*(ptr)) _n_ = (n);					\
+	(__typeof__(*(ptr))) __cmpxchg_relaxed((ptr),			\
+			(unsigned long)_o_, (unsigned long)_n_,		\
+			sizeof(*(ptr)));				\
+})
+
+#define cmpxchg_acquire(ptr, o, n)					\
+({									\
+	__typeof__(*(ptr)) _o_ = (o);					\
+	__typeof__(*(ptr)) _n_ = (n);					\
+	(__typeof__(*(ptr))) __cmpxchg_acquire((ptr),			\
+			(unsigned long)_o_, (unsigned long)_n_,		\
+			sizeof(*(ptr)));				\
+})
 #ifdef CONFIG_PPC64
 #define cmpxchg64(ptr, o, n)						\
   ({									\
@@ -363,7 +493,16 @@ __cmpxchg_local(volatile void *ptr, unsigned long old, unsigned long new,
 	BUILD_BUG_ON(sizeof(*(ptr)) != 8);				\
 	cmpxchg_local((ptr), (o), (n));					\
   })
-#define cmpxchg64_relaxed	cmpxchg64_local
+#define cmpxchg64_relaxed(ptr, o, n)					\
+({									\
+	BUILD_BUG_ON(sizeof(*(ptr)) != 8);				\
+	cmpxchg_relaxed((ptr), (o), (n));				\
+})
+#define cmpxchg64_acquire(ptr, o, n)					\
+({									\
+	BUILD_BUG_ON(sizeof(*(ptr)) != 8);				\
+	cmpxchg_acquire((ptr), (o), (n));				\
+})
 #else
 #include <asm-generic/cmpxchg-local.h>
 #define cmpxchg64_local(ptr, o, n) __cmpxchg64_local_generic((ptr), (o), (n))
-- 
2.5.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC v2 6/7] powerpc: atomic: Make atomic{,64}_xchg and xchg a full barrier
  2015-09-16 15:49 [RFC v2 0/7] atomics: powerpc: Implement relaxed/acquire/release variants of some atomics Boqun Feng
                   ` (4 preceding siblings ...)
  2015-09-16 15:49 ` [RFC v2 5/7] powerpc: atomic: Implement cmpxchg{,64}_* and atomic{,64}_cmpxchg_* variants Boqun Feng
@ 2015-09-16 15:49 ` Boqun Feng
  2015-10-01 12:28   ` Peter Zijlstra
  2015-09-16 15:49 ` [RFC v2 7/7] powerpc: atomic: Make atomic{,64}_cmpxchg and cmpxchg " Boqun Feng
  6 siblings, 1 reply; 42+ messages in thread
From: Boqun Feng @ 2015-09-16 15:49 UTC (permalink / raw)
  To: linux-kernel, linuxppc-dev
  Cc: Peter Zijlstra, Ingo Molnar, Benjamin Herrenschmidt,
	Paul Mackerras, Michael Ellerman, Thomas Gleixner, Will Deacon,
	Paul E. McKenney, Waiman Long, Boqun Feng

According to memory-barriers.txt, xchg and its atomic{,64}_ versions
need to imply a full barrier, however they are now just RELEASE+ACQUIRE,
which is not a full barrier.

So remove the definition of xchg(), and let __atomic_op_fence() build
the full-barrier versions of these operations.

Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
---
 arch/powerpc/include/asm/cmpxchg.h | 64 --------------------------------------
 1 file changed, 64 deletions(-)

diff --git a/arch/powerpc/include/asm/cmpxchg.h b/arch/powerpc/include/asm/cmpxchg.h
index f40f295..9f0379a 100644
--- a/arch/powerpc/include/asm/cmpxchg.h
+++ b/arch/powerpc/include/asm/cmpxchg.h
@@ -12,31 +12,7 @@
  * Changes the memory location '*ptr' to be val and returns
  * the previous value stored there.
  */
-static __always_inline unsigned long
-__xchg_u32(volatile void *p, unsigned long val)
-{
-	unsigned long prev;
 
-	__asm__ __volatile__(
-	PPC_RELEASE_BARRIER
-"1:	lwarx	%0,0,%2 \n"
-	PPC405_ERR77(0,%2)
-"	stwcx.	%3,0,%2 \n\
-	bne-	1b"
-	PPC_ACQUIRE_BARRIER
-	: "=&r" (prev), "+m" (*(volatile unsigned int *)p)
-	: "r" (p), "r" (val)
-	: "cc", "memory");
-
-	return prev;
-}
-
-/*
- * Atomic exchange
- *
- * Changes the memory location '*ptr' to be val and returns
- * the previous value stored there.
- */
 static __always_inline unsigned long
 __xchg_u32_local(volatile void *p, unsigned long val)
 {
@@ -82,25 +58,6 @@ __xchg_u32_relaxed(u32 *p, unsigned long val)
 
 #ifdef CONFIG_PPC64
 static __always_inline unsigned long
-__xchg_u64(volatile void *p, unsigned long val)
-{
-	unsigned long prev;
-
-	__asm__ __volatile__(
-	PPC_RELEASE_BARRIER
-"1:	ldarx	%0,0,%2 \n"
-	PPC405_ERR77(0,%2)
-"	stdcx.	%3,0,%2 \n\
-	bne-	1b"
-	PPC_ACQUIRE_BARRIER
-	: "=&r" (prev), "+m" (*(volatile unsigned long *)p)
-	: "r" (p), "r" (val)
-	: "cc", "memory");
-
-	return prev;
-}
-
-static __always_inline unsigned long
 __xchg_u64_local(volatile void *p, unsigned long val)
 {
 	unsigned long prev;
@@ -142,21 +99,6 @@ __xchg_u64_relaxed(u64 *p, unsigned long val)
 extern void __xchg_called_with_bad_pointer(void);
 
 static __always_inline unsigned long
-__xchg(volatile void *ptr, unsigned long x, unsigned int size)
-{
-	switch (size) {
-	case 4:
-		return __xchg_u32(ptr, x);
-#ifdef CONFIG_PPC64
-	case 8:
-		return __xchg_u64(ptr, x);
-#endif
-	}
-	__xchg_called_with_bad_pointer();
-	return x;
-}
-
-static __always_inline unsigned long
 __xchg_local(volatile void *ptr, unsigned long x, unsigned int size)
 {
 	switch (size) {
@@ -185,12 +127,6 @@ __xchg_relaxed(void *ptr, unsigned long x, unsigned int size)
 	__xchg_called_with_bad_pointer();
 	return x;
 }
-#define xchg(ptr,x)							     \
-  ({									     \
-     __typeof__(*(ptr)) _x_ = (x);					     \
-     (__typeof__(*(ptr))) __xchg((ptr), (unsigned long)_x_, sizeof(*(ptr))); \
-  })
-
 #define xchg_local(ptr,x)						     \
   ({									     \
      __typeof__(*(ptr)) _x_ = (x);					     \
-- 
2.5.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC v2 7/7] powerpc: atomic: Make atomic{,64}_cmpxchg and cmpxchg a full barrier
  2015-09-16 15:49 [RFC v2 0/7] atomics: powerpc: Implement relaxed/acquire/release variants of some atomics Boqun Feng
                   ` (5 preceding siblings ...)
  2015-09-16 15:49 ` [RFC v2 6/7] powerpc: atomic: Make atomic{,64}_xchg and xchg a full barrier Boqun Feng
@ 2015-09-16 15:49 ` Boqun Feng
  6 siblings, 0 replies; 42+ messages in thread
From: Boqun Feng @ 2015-09-16 15:49 UTC (permalink / raw)
  To: linux-kernel, linuxppc-dev
  Cc: Peter Zijlstra, Ingo Molnar, Benjamin Herrenschmidt,
	Paul Mackerras, Michael Ellerman, Thomas Gleixner, Will Deacon,
	Paul E. McKenney, Waiman Long, Boqun Feng

According to memory-barriers.txt, cmpxchg and its atomic{,64}_ versions
need to imply a full barrier, however they are now just RELEASE+ACQUIRE,
which is not a full barrier.

So replace PPC_RELEASE_BARRIER and PPC_ACQUIRE_BARRIER with
PPC_ATOMIC_ENTRY_BARRIER and PPC_ATOMIC_EXIT_BARRIER in
__cmpxchg_{u32,u64} respectively to guarantee full-barrier semantics
of atomic{,64}_cmpxchg() and cmpxchg().

Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
---
 arch/powerpc/include/asm/cmpxchg.h | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/cmpxchg.h b/arch/powerpc/include/asm/cmpxchg.h
index 9f0379a..5c58743 100644
--- a/arch/powerpc/include/asm/cmpxchg.h
+++ b/arch/powerpc/include/asm/cmpxchg.h
@@ -151,14 +151,14 @@ __cmpxchg_u32(volatile unsigned int *p, unsigned long old, unsigned long new)
 	unsigned int prev;
 
 	__asm__ __volatile__ (
-	PPC_RELEASE_BARRIER
+	PPC_ATOMIC_ENTRY_BARRIER
 "1:	lwarx	%0,0,%2		# __cmpxchg_u32\n\
 	cmpw	0,%0,%3\n\
 	bne-	2f\n"
 	PPC405_ERR77(0,%2)
 "	stwcx.	%4,0,%2\n\
 	bne-	1b"
-	PPC_ACQUIRE_BARRIER
+	PPC_ATOMIC_EXIT_BARRIER
 	"\n\
 2:"
 	: "=&r" (prev), "+m" (*p)
@@ -239,13 +239,13 @@ __cmpxchg_u64(volatile unsigned long *p, unsigned long old, unsigned long new)
 	unsigned long prev;
 
 	__asm__ __volatile__ (
-	PPC_RELEASE_BARRIER
+	PPC_ATOMIC_ENTRY_BARRIER
 "1:	ldarx	%0,0,%2		# __cmpxchg_u64\n\
 	cmpd	0,%0,%3\n\
 	bne-	2f\n\
 	stdcx.	%4,0,%2\n\
 	bne-	1b"
-	PPC_ACQUIRE_BARRIER
+	PPC_ATOMIC_EXIT_BARRIER
 	"\n\
 2:"
 	: "=&r" (prev), "+m" (*p)
-- 
2.5.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [RFC v2 3/7] powerpc: atomic: Implement atomic{,64}_{add,sub}_return_* variants
  2015-09-16 15:49 ` [RFC v2 3/7] powerpc: atomic: Implement atomic{,64}_{add,sub}_return_* variants Boqun Feng
@ 2015-09-18 16:59   ` Will Deacon
  2015-09-19 15:33     ` Boqun Feng
  0 siblings, 1 reply; 42+ messages in thread
From: Will Deacon @ 2015-09-18 16:59 UTC (permalink / raw)
  To: Boqun Feng
  Cc: linux-kernel, linuxppc-dev, Peter Zijlstra, Ingo Molnar,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Thomas Gleixner, Paul E. McKenney, Waiman Long

On Wed, Sep 16, 2015 at 04:49:31PM +0100, Boqun Feng wrote:
> On powerpc, we don't need a general memory barrier to achieve acquire and
> release semantics, so __atomic_op_{acquire,release} can be implemented
> using "lwsync" and "isync".

I'm assuming isync+ctrl isn't transitive, so we need to get to the bottom
of the s390 thread you linked me to before we start spreading this
further:

  https://lkml.org/lkml/2015/9/15/836

Will

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC v2 3/7] powerpc: atomic: Implement atomic{,64}_{add,sub}_return_* variants
  2015-09-18 16:59   ` Will Deacon
@ 2015-09-19 15:33     ` Boqun Feng
  2015-09-20  8:23       ` Boqun Feng
  0 siblings, 1 reply; 42+ messages in thread
From: Boqun Feng @ 2015-09-19 15:33 UTC (permalink / raw)
  To: Will Deacon
  Cc: linux-kernel, linuxppc-dev, Peter Zijlstra, Ingo Molnar,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Thomas Gleixner, Paul E. McKenney, Waiman Long

[-- Attachment #1: Type: text/plain, Size: 3759 bytes --]

Hi Will,

On Fri, Sep 18, 2015 at 05:59:02PM +0100, Will Deacon wrote:
> On Wed, Sep 16, 2015 at 04:49:31PM +0100, Boqun Feng wrote:
> > On powerpc, we don't need a general memory barrier to achieve acquire and
> > release semantics, so __atomic_op_{acquire,release} can be implemented
> > using "lwsync" and "isync".
> 
> I'm assuming isync+ctrl isn't transitive, so we need to get to the bottom

Actually the transitivity is still guaranteed here, I think ;-)

(Before I put my reasoning, I have to admit I just learned about the
cumulativity recently, so my reasoning may be wrong. But the good thing
is that we have our POWER experts in the CCed. In case I'm wrong, they
could correct me.)

The thing is, on POWER, transitivity is implemented by a similar but
slightly different concept, cumulativity, and as said in the link:

http://www.rdrop.com/users/paulmck/scalability/paper/N2745r.2011.03.04a.html

"""
The ordering done by a memory barrier is said to be “cumulative” if it
also orders storage accesses that are performed by processors and
mechanisms other than P1, as follows.

*	A includes all applicable storage accesses by any such processor
	or mechanism that have been performed with respect to P1 before
	the memory barrier is created.

*	B includes all applicable storage accesses by any such processor
	or mechanism that are performed after a Load instruction
	executed by that processor or mechanism has returned the value
	stored by a store that is in B.
"""

Please note that the set B can be extended indefinitely without any
other cumulative barrier.

So for a RELEASE+ACQUIRE pair to a same variable, as long as the barrier
in the RELEASE operation is cumumlative, the transitivity is guaranteed.
And lwsync is cumulative, so we are fine here.


I also wrote a herd litmus to test this. Due to the tool's limitation, I
use the xchg_release and xchg_acquire to test. And since herd doesn't
support backward branching, some tricks are used here to work around:


PPC lwsync+isync-transitivity
""
{
0:r1=1; 0:r2=x; 0:r3=1; 0:r10=0 ; 0:r11=0; 0:r12=a;
1:r1=9; 1:r2=x; 1:r3=1; 1:r10=0 ; 1:r11=0; 1:r12=a;
2:r1=9; 2:r2=x; 2:r3=2; 2:r10=0 ; 2:r11=0; 2:r12=a;
}
 P0           | P1                  | P2                  ;
 stw r1,0(r2) | lwz r1,0(r2)        |                     ;
              | lwsync              | lwarx r11, r10, r12 ;
              | lwarx  r11,r10,r12  | stwcx. r3, r10, r12 ;
              | stwcx. r3,r10,r12   | bne Fail            ;
              |                     | isync               ;
              |                     | lwz r1, 0(r2)       ;
              |                     | Fail:               ;

exists
(1:r1=1 /\ 1:r11=0 /\ 2:r11=1 /\ 2:r1 = 0)


And the result of this litmus is that:

Test lwsync+isync-transitivity Allowed
States 15
1:r1=0; 1:r11=0; 2:r1=0; 2:r11=0;
1:r1=0; 1:r11=0; 2:r1=0; 2:r11=1;
1:r1=0; 1:r11=0; 2:r1=1; 2:r11=0;
1:r1=0; 1:r11=0; 2:r1=1; 2:r11=1;
1:r1=0; 1:r11=0; 2:r1=9; 2:r11=0;
1:r1=0; 1:r11=0; 2:r1=9; 2:r11=1;
1:r1=0; 1:r11=2; 2:r1=0; 2:r11=0;
1:r1=0; 1:r11=2; 2:r1=1; 2:r11=0;
1:r1=1; 1:r11=0; 2:r1=0; 2:r11=0;
1:r1=1; 1:r11=0; 2:r1=1; 2:r11=0;
1:r1=1; 1:r11=0; 2:r1=1; 2:r11=1;
1:r1=1; 1:r11=0; 2:r1=9; 2:r11=0;
1:r1=1; 1:r11=0; 2:r1=9; 2:r11=1;
1:r1=1; 1:r11=2; 2:r1=0; 2:r11=0;
1:r1=1; 1:r11=2; 2:r1=1; 2:r11=0;
No
Witnesses
Positive: 0 Negative: 29
Condition exists (1:r1=1 /\ 1:r11=0 /\ 2:r11=1 /\ 2:r1=0)
Observation lwsync+isync-transitivity Never 0 29

,which means transitivity is guaranteed.

Regards,
Boqun

> of the s390 thread you linked me to before we start spreading this
> further:
> 
>   https://lkml.org/lkml/2015/9/15/836
> 
> Will

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC v2 3/7] powerpc: atomic: Implement atomic{,64}_{add,sub}_return_* variants
  2015-09-19 15:33     ` Boqun Feng
@ 2015-09-20  8:23       ` Boqun Feng
  2015-09-21 22:24         ` Will Deacon
  0 siblings, 1 reply; 42+ messages in thread
From: Boqun Feng @ 2015-09-20  8:23 UTC (permalink / raw)
  To: Will Deacon
  Cc: linux-kernel, linuxppc-dev, Peter Zijlstra, Ingo Molnar,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Thomas Gleixner, Paul E. McKenney, Waiman Long

[-- Attachment #1: Type: text/plain, Size: 5694 bytes --]

On Sat, Sep 19, 2015 at 11:33:10PM +0800, Boqun Feng wrote:
> Hi Will,
> 
> On Fri, Sep 18, 2015 at 05:59:02PM +0100, Will Deacon wrote:
> > On Wed, Sep 16, 2015 at 04:49:31PM +0100, Boqun Feng wrote:
> > > On powerpc, we don't need a general memory barrier to achieve acquire and
> > > release semantics, so __atomic_op_{acquire,release} can be implemented
> > > using "lwsync" and "isync".
> > 
> > I'm assuming isync+ctrl isn't transitive, so we need to get to the bottom
> 
> Actually the transitivity is still guaranteed here, I think ;-)
> 
> (Before I put my reasoning, I have to admit I just learned about the
> cumulativity recently, so my reasoning may be wrong. But the good thing
> is that we have our POWER experts in the CCed. In case I'm wrong, they
> could correct me.)
> 
> The thing is, on POWER, transitivity is implemented by a similar but
> slightly different concept, cumulativity, and as said in the link:
> 
> http://www.rdrop.com/users/paulmck/scalability/paper/N2745r.2011.03.04a.html
> 
> """
> The ordering done by a memory barrier is said to be “cumulative” if it
> also orders storage accesses that are performed by processors and
> mechanisms other than P1, as follows.
> 
> *	A includes all applicable storage accesses by any such processor
> 	or mechanism that have been performed with respect to P1 before
> 	the memory barrier is created.
> 
> *	B includes all applicable storage accesses by any such processor
> 	or mechanism that are performed after a Load instruction
> 	executed by that processor or mechanism has returned the value
> 	stored by a store that is in B.
> """
> 
> Please note that the set B can be extended indefinitely without any
> other cumulative barrier.
> 
> So for a RELEASE+ACQUIRE pair to a same variable, as long as the barrier
> in the RELEASE operation is cumumlative, the transitivity is guaranteed.
> And lwsync is cumulative, so we are fine here.
> 
> 
> I also wrote a herd litmus to test this. Due to the tool's limitation, I
> use the xchg_release and xchg_acquire to test. And since herd doesn't

Hmm.. I think I wanted to say atomic_xchg_release and
atomic_xchg_acquire here, sorry about that inaccuracy..

> support backward branching, some tricks are used here to work around:
> 

And I check again, herd does suppor backward branching, the problem is
just if we use backward branching, there will be a lot more states the
tool need to check, but it seems there are not too many in this case, so
I modify the litmus a little bit as follow:

PPC lwsync+isync-transitivity
""
{
0:r1=1; 0:r2=x; 0:r3=1; 0:r10=0 ; 0:r11=0; 0:r12=a;
1:r1=9; 1:r2=x; 1:r3=1; 1:r10=0 ; 1:r11=0; 1:r12=a;
2:r1=9; 2:r2=x; 2:r3=2; 2:r10=0 ; 2:r11=0; 2:r12=a;
}
 P0           | P1                  | P2                  ;
 stw r1,0(r2) | lwz r1,0(r2)        | Fail2:              ;
              | lwsync              | lwarx r11, r10, r12 ;
              | Fail1:              | stwcx. r3, r10, r12 ;
              | lwarx  r11,r10,r12  | bne Fail2           ;
              | stwcx. r3,r10,r12   | isync               ;
              | bne Fail1           | lwz r1, 0(r2)       ; 

exists
(1:r1=1 /\ 1:r11=0 /\ 2:r11=1 /\ 2:r1 = 0)

which is actually:

CPU 0			CPU 1				CPU 2
==============		=====================		=======================
{int x = 0, atomic_t a = ATOMIC_INIT(0)}
WRITE_ONCE(x,1);	t1 = READ_ONCE(x);		t2 = atomic_xchg_acquire(&a, 2);
			atomic_xchg_release(&a, 1);	t3 = READ_ONCE(x);

exists
(t1 == 1 && t2 == 1 && t3 == 0)


The result is still(it may take a while to get the result):

Test lwsync+isync-transitivity Allowed
States 11
1:r1=0; 1:r11=0; 2:r1=0; 2:r11=0;
1:r1=0; 1:r11=0; 2:r1=0; 2:r11=1;
1:r1=0; 1:r11=0; 2:r1=1; 2:r11=0;
1:r1=0; 1:r11=0; 2:r1=1; 2:r11=1;
1:r1=0; 1:r11=2; 2:r1=0; 2:r11=0;
1:r1=0; 1:r11=2; 2:r1=1; 2:r11=0;
1:r1=1; 1:r11=0; 2:r1=0; 2:r11=0;
1:r1=1; 1:r11=0; 2:r1=1; 2:r11=0;
1:r1=1; 1:r11=0; 2:r1=1; 2:r11=1;
1:r1=1; 1:r11=2; 2:r1=0; 2:r11=0;
1:r1=1; 1:r11=2; 2:r1=1; 2:r11=0;
Loop No
Witnesses
Positive: 0 Negative: 198
Condition exists (1:r1=1 /\ 1:r11=0 /\ 2:r11=1 /\ 2:r1=0)
Observation lwsync+isync-transitivity Never 0 198

, which means transitivity is guaranteed.


And I think it deserves more analysis based on cumulativity:

Initially, for the lwsync on P1(CPU 1), we have set A and B of the
storage accesses on the same processor which lwsync orders:

A includes:
	on CPU 1:
		lwz r1, 0(r2) // t1 = READ_ONCE(x); 

B includes:
	on CPU 1:
		lwarx  r11,r10,r12 // atomic_xchg_release();
		stwcx. r3,r10,r12

and as t1 == 1, which means before lwsync, P1 perceives the STORE of x
on CPU 0, which makes another storage access is included in A:

A now includes:
	on CPU 0:
		stw r1, 0(r) // WRITE_ONCE(x,1);
	on CPU 1:
		lwz r1, 0(r2) // t1 = READ_ONCE(x); 

B now includes:
	on CPU 1:
		lwarx  r11,r10,r12 // atomic_xchg_release();
		stwcx. r3,r10,r12

and as t2 == 1, which means on CPU 2, "lwarx  r11,r10,r12" in
atomic_xchg_acqurie() reads the value stored by "stwcx. r3,r10,r12" in
atomic_xchg_release() on CPU 1, that makes all storage accesses
performed after atomic_xchg_acquire() get included in set B:

A now includes:
	on CPU 0:
		stw r1, 0(r) // WRITE_ONCE(x,1);
	on CPU 1:
		lwz r1, 0(r2) // t1 = READ_ONCE(x); 

B now includes:
	on CPU 1:
		lwarx  r11,r10,r12 // atomic_xchg_release();
		stwcx. r3,r10,r12
	on CPU 2:
		lwz r1, 0(r2) // t3 = READ_ONCE(x);

Therefore the STORE of x on CPU 0 and the LOAD of x on CPU 2 can not be
reordered in this case, which means transitivity guaranteed.

Regards,
Boqun

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC v2 3/7] powerpc: atomic: Implement atomic{,64}_{add,sub}_return_* variants
  2015-09-20  8:23       ` Boqun Feng
@ 2015-09-21 22:24         ` Will Deacon
  2015-09-21 23:26           ` Boqun Feng
  0 siblings, 1 reply; 42+ messages in thread
From: Will Deacon @ 2015-09-21 22:24 UTC (permalink / raw)
  To: Boqun Feng
  Cc: linux-kernel, linuxppc-dev, Peter Zijlstra, Ingo Molnar,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Thomas Gleixner, Paul E. McKenney, Waiman Long

Hi Boqun,

On Sun, Sep 20, 2015 at 09:23:03AM +0100, Boqun Feng wrote:
> On Sat, Sep 19, 2015 at 11:33:10PM +0800, Boqun Feng wrote:
> > On Fri, Sep 18, 2015 at 05:59:02PM +0100, Will Deacon wrote:
> > > On Wed, Sep 16, 2015 at 04:49:31PM +0100, Boqun Feng wrote:
> > > > On powerpc, we don't need a general memory barrier to achieve acquire and
> > > > release semantics, so __atomic_op_{acquire,release} can be implemented
> > > > using "lwsync" and "isync".
> > > 
> > > I'm assuming isync+ctrl isn't transitive, so we need to get to the bottom
> > 
> > Actually the transitivity is still guaranteed here, I think ;-)

The litmus test I'm thinking of is:


{
0:r2=x;
1:r2=x; 1:r5=z;
2:r2=z; 2:r4=x;
}
 P0           | P1            | P2           ;
 li r1,1      | lwz r1,0(r2)  | lwz r1,0(r2) ;
 stw r1,0(r2) | cmpw r1,r1    | cmpw r1,r1   ;
              | beq LC00      | beq  LC01    ;
              | LC00:         | LC01:        ;
              | isync         | isync        ;
              | li r4,1       | lwz r3,0(r4) ;
              | stw r4,0(r5)  |              ;
exists
(1:r1=1 /\ 2:r1=1 /\ 2:r3=0)


Which appears to be allowed. I don't think you need to worry about backwards
branches for the ctrl+isync construction (none of the current example do,
afaict).

Anyway, all the problematic cases seem to arise when we start mixing
ACQUIRE/RELEASE accesses with relaxed accesses (i.e. where an access from
one group reads from an access in the other group). It would be simplest
to say that this doesn't provide any transitivity guarantees, and that
an ACQUIRE must always read from a RELEASE if transitivity is required.

Will

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC v2 3/7] powerpc: atomic: Implement atomic{,64}_{add,sub}_return_* variants
  2015-09-21 22:24         ` Will Deacon
@ 2015-09-21 23:26           ` Boqun Feng
  2015-09-21 23:37             ` Boqun Feng
  0 siblings, 1 reply; 42+ messages in thread
From: Boqun Feng @ 2015-09-21 23:26 UTC (permalink / raw)
  To: Will Deacon
  Cc: linux-kernel, linuxppc-dev, Peter Zijlstra, Ingo Molnar,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Thomas Gleixner, Paul E. McKenney, Waiman Long

[-- Attachment #1: Type: text/plain, Size: 2325 bytes --]

On Mon, Sep 21, 2015 at 11:24:27PM +0100, Will Deacon wrote:
> Hi Boqun,
> 
> On Sun, Sep 20, 2015 at 09:23:03AM +0100, Boqun Feng wrote:
> > On Sat, Sep 19, 2015 at 11:33:10PM +0800, Boqun Feng wrote:
> > > On Fri, Sep 18, 2015 at 05:59:02PM +0100, Will Deacon wrote:
> > > > On Wed, Sep 16, 2015 at 04:49:31PM +0100, Boqun Feng wrote:
> > > > > On powerpc, we don't need a general memory barrier to achieve acquire and
> > > > > release semantics, so __atomic_op_{acquire,release} can be implemented
> > > > > using "lwsync" and "isync".
> > > > 
> > > > I'm assuming isync+ctrl isn't transitive, so we need to get to the bottom
> > > 
> > > Actually the transitivity is still guaranteed here, I think ;-)
> 
> The litmus test I'm thinking of is:
> 
> 
> {
> 0:r2=x;
> 1:r2=x; 1:r5=z;
> 2:r2=z; 2:r4=x;
> }
>  P0           | P1            | P2           ;
>  li r1,1      | lwz r1,0(r2)  | lwz r1,0(r2) ;
>  stw r1,0(r2) | cmpw r1,r1    | cmpw r1,r1   ;
>               | beq LC00      | beq  LC01    ;
>               | LC00:         | LC01:        ;
>               | isync         | isync        ;
>               | li r4,1       | lwz r3,0(r4) ;
>               | stw r4,0(r5)  |              ;
> exists
> (1:r1=1 /\ 2:r1=1 /\ 2:r3=0)
> 
> 
> Which appears to be allowed. I don't think you need to worry about backwards
> branches for the ctrl+isync construction (none of the current example do,
> afaict).
> 

Yes.. my care of backwards branches is not quite related to the topic, I
concerned that mostly because my test is using atomic operation, and I
just want to test the exact asm code.

> Anyway, all the problematic cases seem to arise when we start mixing
> ACQUIRE/RELEASE accesses with relaxed accesses (i.e. where an access from
> one group reads from an access in the other group). It would be simplest
> to say that this doesn't provide any transitivity guarantees, and that
> an ACQUIRE must always read from a RELEASE if transitivity is required.
> 

Agreed. RELEASE alone doesn't provide transitivity and transitivity is
guaranteed only if an ACQUIRE read from a RELEASE. That's exactly the
direction which the link (https://lkml.org/lkml/2015/9/15/836) is
heading to. So I think we are fine here to use ctrl+isync here, right?

Regards,
Boqun

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC v2 3/7] powerpc: atomic: Implement atomic{,64}_{add,sub}_return_* variants
  2015-09-21 23:26           ` Boqun Feng
@ 2015-09-21 23:37             ` Boqun Feng
  2015-09-22 15:25               ` Paul E. McKenney
  0 siblings, 1 reply; 42+ messages in thread
From: Boqun Feng @ 2015-09-21 23:37 UTC (permalink / raw)
  To: Will Deacon
  Cc: linux-kernel, linuxppc-dev, Peter Zijlstra, Ingo Molnar,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Thomas Gleixner, Paul E. McKenney, Waiman Long

[-- Attachment #1: Type: text/plain, Size: 2550 bytes --]

On Tue, Sep 22, 2015 at 07:26:56AM +0800, Boqun Feng wrote:
> On Mon, Sep 21, 2015 at 11:24:27PM +0100, Will Deacon wrote:
> > Hi Boqun,
> > 
> > On Sun, Sep 20, 2015 at 09:23:03AM +0100, Boqun Feng wrote:
> > > On Sat, Sep 19, 2015 at 11:33:10PM +0800, Boqun Feng wrote:
> > > > On Fri, Sep 18, 2015 at 05:59:02PM +0100, Will Deacon wrote:
> > > > > On Wed, Sep 16, 2015 at 04:49:31PM +0100, Boqun Feng wrote:
> > > > > > On powerpc, we don't need a general memory barrier to achieve acquire and
> > > > > > release semantics, so __atomic_op_{acquire,release} can be implemented
> > > > > > using "lwsync" and "isync".
> > > > > 
> > > > > I'm assuming isync+ctrl isn't transitive, so we need to get to the bottom
> > > > 
> > > > Actually the transitivity is still guaranteed here, I think ;-)
> > 
> > The litmus test I'm thinking of is:
> > 
> > 
> > {
> > 0:r2=x;
> > 1:r2=x; 1:r5=z;
> > 2:r2=z; 2:r4=x;
> > }
> >  P0           | P1            | P2           ;
> >  li r1,1      | lwz r1,0(r2)  | lwz r1,0(r2) ;
> >  stw r1,0(r2) | cmpw r1,r1    | cmpw r1,r1   ;
> >               | beq LC00      | beq  LC01    ;
> >               | LC00:         | LC01:        ;
> >               | isync         | isync        ;
> >               | li r4,1       | lwz r3,0(r4) ;
> >               | stw r4,0(r5)  |              ;
> > exists
> > (1:r1=1 /\ 2:r1=1 /\ 2:r3=0)
> > 
> > 
> > Which appears to be allowed. I don't think you need to worry about backwards
> > branches for the ctrl+isync construction (none of the current example do,
> > afaict).
> > 
> 
> Yes.. my care of backwards branches is not quite related to the topic, I
> concerned that mostly because my test is using atomic operation, and I
> just want to test the exact asm code.
> 
> > Anyway, all the problematic cases seem to arise when we start mixing
> > ACQUIRE/RELEASE accesses with relaxed accesses (i.e. where an access from
> > one group reads from an access in the other group). It would be simplest
> > to say that this doesn't provide any transitivity guarantees, and that
> > an ACQUIRE must always read from a RELEASE if transitivity is required.
> > 
> 
> Agreed. RELEASE alone doesn't provide transitivity and transitivity is
          ^^^^^^^
This should be ACQUIRE...

> guaranteed only if an ACQUIRE read from a RELEASE. That's exactly the
> direction which the link (https://lkml.org/lkml/2015/9/15/836) is
> heading to. So I think we are fine here to use ctrl+isync here, right?
> 
> Regards,
> Boqun

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC v2 3/7] powerpc: atomic: Implement atomic{,64}_{add,sub}_return_* variants
  2015-09-21 23:37             ` Boqun Feng
@ 2015-09-22 15:25               ` Paul E. McKenney
  2015-09-23  0:07                 ` Boqun Feng
  0 siblings, 1 reply; 42+ messages in thread
From: Paul E. McKenney @ 2015-09-22 15:25 UTC (permalink / raw)
  To: Boqun Feng
  Cc: Will Deacon, linux-kernel, linuxppc-dev, Peter Zijlstra,
	Ingo Molnar, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Thomas Gleixner, Waiman Long

On Tue, Sep 22, 2015 at 07:37:04AM +0800, Boqun Feng wrote:
> On Tue, Sep 22, 2015 at 07:26:56AM +0800, Boqun Feng wrote:
> > On Mon, Sep 21, 2015 at 11:24:27PM +0100, Will Deacon wrote:
> > > Hi Boqun,
> > > 
> > > On Sun, Sep 20, 2015 at 09:23:03AM +0100, Boqun Feng wrote:
> > > > On Sat, Sep 19, 2015 at 11:33:10PM +0800, Boqun Feng wrote:
> > > > > On Fri, Sep 18, 2015 at 05:59:02PM +0100, Will Deacon wrote:
> > > > > > On Wed, Sep 16, 2015 at 04:49:31PM +0100, Boqun Feng wrote:
> > > > > > > On powerpc, we don't need a general memory barrier to achieve acquire and
> > > > > > > release semantics, so __atomic_op_{acquire,release} can be implemented
> > > > > > > using "lwsync" and "isync".
> > > > > > 
> > > > > > I'm assuming isync+ctrl isn't transitive, so we need to get to the bottom
> > > > > 
> > > > > Actually the transitivity is still guaranteed here, I think ;-)
> > > 
> > > The litmus test I'm thinking of is:
> > > 
> > > 
> > > {
> > > 0:r2=x;
> > > 1:r2=x; 1:r5=z;
> > > 2:r2=z; 2:r4=x;
> > > }
> > >  P0           | P1            | P2           ;
> > >  li r1,1      | lwz r1,0(r2)  | lwz r1,0(r2) ;
> > >  stw r1,0(r2) | cmpw r1,r1    | cmpw r1,r1   ;
> > >               | beq LC00      | beq  LC01    ;
> > >               | LC00:         | LC01:        ;
> > >               | isync         | isync        ;
> > >               | li r4,1       | lwz r3,0(r4) ;
> > >               | stw r4,0(r5)  |              ;
> > > exists
> > > (1:r1=1 /\ 2:r1=1 /\ 2:r3=0)
> > > 
> > > 
> > > Which appears to be allowed. I don't think you need to worry about backwards
> > > branches for the ctrl+isync construction (none of the current example do,
> > > afaict).
> > > 
> > 
> > Yes.. my care of backwards branches is not quite related to the topic, I
> > concerned that mostly because my test is using atomic operation, and I
> > just want to test the exact asm code.
> > 
> > > Anyway, all the problematic cases seem to arise when we start mixing
> > > ACQUIRE/RELEASE accesses with relaxed accesses (i.e. where an access from
> > > one group reads from an access in the other group). It would be simplest
> > > to say that this doesn't provide any transitivity guarantees, and that
> > > an ACQUIRE must always read from a RELEASE if transitivity is required.
> > > 
> > 
> > Agreed. RELEASE alone doesn't provide transitivity and transitivity is
>           ^^^^^^^
> This should be ACQUIRE...
> 
> > guaranteed only if an ACQUIRE read from a RELEASE. That's exactly the
> > direction which the link (https://lkml.org/lkml/2015/9/15/836) is
> > heading to. So I think we are fine here to use ctrl+isync here, right?

We are going to have to err on the side of strictness, that is, having
the documentation place more requirements on the developer than the union
of the hardware does.  Besides, I haven't heard any recent complaints
that memory-barriers.txt is too simple.  ;-)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC v2 3/7] powerpc: atomic: Implement atomic{,64}_{add,sub}_return_* variants
  2015-09-22 15:25               ` Paul E. McKenney
@ 2015-09-23  0:07                 ` Boqun Feng
  2015-09-25 21:29                   ` Paul E. McKenney
  0 siblings, 1 reply; 42+ messages in thread
From: Boqun Feng @ 2015-09-23  0:07 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Will Deacon, linux-kernel, linuxppc-dev, Peter Zijlstra,
	Ingo Molnar, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Thomas Gleixner, Waiman Long

[-- Attachment #1: Type: text/plain, Size: 3510 bytes --]

On Tue, Sep 22, 2015 at 08:25:40AM -0700, Paul E. McKenney wrote:
> On Tue, Sep 22, 2015 at 07:37:04AM +0800, Boqun Feng wrote:
> > On Tue, Sep 22, 2015 at 07:26:56AM +0800, Boqun Feng wrote:
> > > On Mon, Sep 21, 2015 at 11:24:27PM +0100, Will Deacon wrote:
> > > > Hi Boqun,
> > > > 
> > > > On Sun, Sep 20, 2015 at 09:23:03AM +0100, Boqun Feng wrote:
> > > > > On Sat, Sep 19, 2015 at 11:33:10PM +0800, Boqun Feng wrote:
> > > > > > On Fri, Sep 18, 2015 at 05:59:02PM +0100, Will Deacon wrote:
> > > > > > > On Wed, Sep 16, 2015 at 04:49:31PM +0100, Boqun Feng wrote:
> > > > > > > > On powerpc, we don't need a general memory barrier to achieve acquire and
> > > > > > > > release semantics, so __atomic_op_{acquire,release} can be implemented
> > > > > > > > using "lwsync" and "isync".
> > > > > > > 
> > > > > > > I'm assuming isync+ctrl isn't transitive, so we need to get to the bottom
> > > > > > 
> > > > > > Actually the transitivity is still guaranteed here, I think ;-)
> > > > 
> > > > The litmus test I'm thinking of is:
> > > > 
> > > > 
> > > > {
> > > > 0:r2=x;
> > > > 1:r2=x; 1:r5=z;
> > > > 2:r2=z; 2:r4=x;
> > > > }
> > > >  P0           | P1            | P2           ;
> > > >  li r1,1      | lwz r1,0(r2)  | lwz r1,0(r2) ;
> > > >  stw r1,0(r2) | cmpw r1,r1    | cmpw r1,r1   ;
> > > >               | beq LC00      | beq  LC01    ;
> > > >               | LC00:         | LC01:        ;
> > > >               | isync         | isync        ;
> > > >               | li r4,1       | lwz r3,0(r4) ;
> > > >               | stw r4,0(r5)  |              ;
> > > > exists
> > > > (1:r1=1 /\ 2:r1=1 /\ 2:r3=0)
> > > > 
> > > > 
> > > > Which appears to be allowed. I don't think you need to worry about backwards
> > > > branches for the ctrl+isync construction (none of the current example do,
> > > > afaict).
> > > > 
> > > 
> > > Yes.. my care of backwards branches is not quite related to the topic, I
> > > concerned that mostly because my test is using atomic operation, and I
> > > just want to test the exact asm code.
> > > 
> > > > Anyway, all the problematic cases seem to arise when we start mixing
> > > > ACQUIRE/RELEASE accesses with relaxed accesses (i.e. where an access from
> > > > one group reads from an access in the other group). It would be simplest
> > > > to say that this doesn't provide any transitivity guarantees, and that
> > > > an ACQUIRE must always read from a RELEASE if transitivity is required.
> > > > 
> > > 
> > > Agreed. RELEASE alone doesn't provide transitivity and transitivity is
> >           ^^^^^^^
> > This should be ACQUIRE...
> > 
> > > guaranteed only if an ACQUIRE read from a RELEASE. That's exactly the
> > > direction which the link (https://lkml.org/lkml/2015/9/15/836) is
> > > heading to. So I think we are fine here to use ctrl+isync here, right?
> 
> We are going to have to err on the side of strictness, that is, having
> the documentation place more requirements on the developer than the union
> of the hardware does.  Besides, I haven't heard any recent complaints
> that memory-barriers.txt is too simple.  ;-)
> 

Agreed ;-)

For atomic operations, using isync in ACQUIRE operations does gaurantee
that a pure RELEASE/ACQUIRE chain provides transitivity. So, again, I
think we are fine here to use isync in ACQUIRE atomic operations,
unless you think we need to be more strict, i.e, making ACQUIRE itself
provide transitivy?

Regards,
Boqun

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC v2 3/7] powerpc: atomic: Implement atomic{,64}_{add,sub}_return_* variants
  2015-09-23  0:07                 ` Boqun Feng
@ 2015-09-25 21:29                   ` Paul E. McKenney
  2015-09-26  2:18                     ` Boqun Feng
  0 siblings, 1 reply; 42+ messages in thread
From: Paul E. McKenney @ 2015-09-25 21:29 UTC (permalink / raw)
  To: Boqun Feng
  Cc: Will Deacon, linux-kernel, linuxppc-dev, Peter Zijlstra,
	Ingo Molnar, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Thomas Gleixner, Waiman Long

On Wed, Sep 23, 2015 at 08:07:55AM +0800, Boqun Feng wrote:
> On Tue, Sep 22, 2015 at 08:25:40AM -0700, Paul E. McKenney wrote:
> > On Tue, Sep 22, 2015 at 07:37:04AM +0800, Boqun Feng wrote:
> > > On Tue, Sep 22, 2015 at 07:26:56AM +0800, Boqun Feng wrote:
> > > > On Mon, Sep 21, 2015 at 11:24:27PM +0100, Will Deacon wrote:
> > > > > Hi Boqun,
> > > > > 
> > > > > On Sun, Sep 20, 2015 at 09:23:03AM +0100, Boqun Feng wrote:
> > > > > > On Sat, Sep 19, 2015 at 11:33:10PM +0800, Boqun Feng wrote:
> > > > > > > On Fri, Sep 18, 2015 at 05:59:02PM +0100, Will Deacon wrote:
> > > > > > > > On Wed, Sep 16, 2015 at 04:49:31PM +0100, Boqun Feng wrote:
> > > > > > > > > On powerpc, we don't need a general memory barrier to achieve acquire and
> > > > > > > > > release semantics, so __atomic_op_{acquire,release} can be implemented
> > > > > > > > > using "lwsync" and "isync".
> > > > > > > > 
> > > > > > > > I'm assuming isync+ctrl isn't transitive, so we need to get to the bottom
> > > > > > > 
> > > > > > > Actually the transitivity is still guaranteed here, I think ;-)
> > > > > 
> > > > > The litmus test I'm thinking of is:
> > > > > 
> > > > > 
> > > > > {
> > > > > 0:r2=x;
> > > > > 1:r2=x; 1:r5=z;
> > > > > 2:r2=z; 2:r4=x;
> > > > > }
> > > > >  P0           | P1            | P2           ;
> > > > >  li r1,1      | lwz r1,0(r2)  | lwz r1,0(r2) ;
> > > > >  stw r1,0(r2) | cmpw r1,r1    | cmpw r1,r1   ;
> > > > >               | beq LC00      | beq  LC01    ;
> > > > >               | LC00:         | LC01:        ;
> > > > >               | isync         | isync        ;
> > > > >               | li r4,1       | lwz r3,0(r4) ;
> > > > >               | stw r4,0(r5)  |              ;
> > > > > exists
> > > > > (1:r1=1 /\ 2:r1=1 /\ 2:r3=0)
> > > > > 
> > > > > 
> > > > > Which appears to be allowed. I don't think you need to worry about backwards
> > > > > branches for the ctrl+isync construction (none of the current example do,
> > > > > afaict).
> > > > > 
> > > > 
> > > > Yes.. my care of backwards branches is not quite related to the topic, I
> > > > concerned that mostly because my test is using atomic operation, and I
> > > > just want to test the exact asm code.
> > > > 
> > > > > Anyway, all the problematic cases seem to arise when we start mixing
> > > > > ACQUIRE/RELEASE accesses with relaxed accesses (i.e. where an access from
> > > > > one group reads from an access in the other group). It would be simplest
> > > > > to say that this doesn't provide any transitivity guarantees, and that
> > > > > an ACQUIRE must always read from a RELEASE if transitivity is required.
> > > > > 
> > > > 
> > > > Agreed. RELEASE alone doesn't provide transitivity and transitivity is
> > >           ^^^^^^^
> > > This should be ACQUIRE...
> > > 
> > > > guaranteed only if an ACQUIRE read from a RELEASE. That's exactly the
> > > > direction which the link (https://lkml.org/lkml/2015/9/15/836) is
> > > > heading to. So I think we are fine here to use ctrl+isync here, right?
> > 
> > We are going to have to err on the side of strictness, that is, having
> > the documentation place more requirements on the developer than the union
> > of the hardware does.  Besides, I haven't heard any recent complaints
> > that memory-barriers.txt is too simple.  ;-)
> 
> Agreed ;-)
> 
> For atomic operations, using isync in ACQUIRE operations does gaurantee
> that a pure RELEASE/ACQUIRE chain provides transitivity. So, again, I
> think we are fine here to use isync in ACQUIRE atomic operations,
> unless you think we need to be more strict, i.e, making ACQUIRE itself
> provide transitivy?

As I understand it, either isync or lwsync suffices, with the choice
depending on the hardware.  The kernel will rewrite itself at boot time
if you use the appropriate macro.  ;-)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC v2 3/7] powerpc: atomic: Implement atomic{,64}_{add,sub}_return_* variants
  2015-09-25 21:29                   ` Paul E. McKenney
@ 2015-09-26  2:18                     ` Boqun Feng
  0 siblings, 0 replies; 42+ messages in thread
From: Boqun Feng @ 2015-09-26  2:18 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Will Deacon, linux-kernel, linuxppc-dev, Peter Zijlstra,
	Ingo Molnar, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Thomas Gleixner, Waiman Long

[-- Attachment #1: Type: text/plain, Size: 4451 bytes --]

On Fri, Sep 25, 2015 at 02:29:04PM -0700, Paul E. McKenney wrote:
> On Wed, Sep 23, 2015 at 08:07:55AM +0800, Boqun Feng wrote:
> > On Tue, Sep 22, 2015 at 08:25:40AM -0700, Paul E. McKenney wrote:
> > > On Tue, Sep 22, 2015 at 07:37:04AM +0800, Boqun Feng wrote:
> > > > On Tue, Sep 22, 2015 at 07:26:56AM +0800, Boqun Feng wrote:
> > > > > On Mon, Sep 21, 2015 at 11:24:27PM +0100, Will Deacon wrote:
> > > > > > Hi Boqun,
> > > > > > 
> > > > > > On Sun, Sep 20, 2015 at 09:23:03AM +0100, Boqun Feng wrote:
> > > > > > > On Sat, Sep 19, 2015 at 11:33:10PM +0800, Boqun Feng wrote:
> > > > > > > > On Fri, Sep 18, 2015 at 05:59:02PM +0100, Will Deacon wrote:
> > > > > > > > > On Wed, Sep 16, 2015 at 04:49:31PM +0100, Boqun Feng wrote:
> > > > > > > > > > On powerpc, we don't need a general memory barrier to achieve acquire and
> > > > > > > > > > release semantics, so __atomic_op_{acquire,release} can be implemented
> > > > > > > > > > using "lwsync" and "isync".
> > > > > > > > > 
> > > > > > > > > I'm assuming isync+ctrl isn't transitive, so we need to get to the bottom
> > > > > > > > 
> > > > > > > > Actually the transitivity is still guaranteed here, I think ;-)
> > > > > > 
> > > > > > The litmus test I'm thinking of is:
> > > > > > 
> > > > > > 
> > > > > > {
> > > > > > 0:r2=x;
> > > > > > 1:r2=x; 1:r5=z;
> > > > > > 2:r2=z; 2:r4=x;
> > > > > > }
> > > > > >  P0           | P1            | P2           ;
> > > > > >  li r1,1      | lwz r1,0(r2)  | lwz r1,0(r2) ;
> > > > > >  stw r1,0(r2) | cmpw r1,r1    | cmpw r1,r1   ;
> > > > > >               | beq LC00      | beq  LC01    ;
> > > > > >               | LC00:         | LC01:        ;
> > > > > >               | isync         | isync        ;
> > > > > >               | li r4,1       | lwz r3,0(r4) ;
> > > > > >               | stw r4,0(r5)  |              ;
> > > > > > exists
> > > > > > (1:r1=1 /\ 2:r1=1 /\ 2:r3=0)
> > > > > > 
> > > > > > 
> > > > > > Which appears to be allowed. I don't think you need to worry about backwards
> > > > > > branches for the ctrl+isync construction (none of the current example do,
> > > > > > afaict).
> > > > > > 
> > > > > 
> > > > > Yes.. my care of backwards branches is not quite related to the topic, I
> > > > > concerned that mostly because my test is using atomic operation, and I
> > > > > just want to test the exact asm code.
> > > > > 
> > > > > > Anyway, all the problematic cases seem to arise when we start mixing
> > > > > > ACQUIRE/RELEASE accesses with relaxed accesses (i.e. where an access from
> > > > > > one group reads from an access in the other group). It would be simplest
> > > > > > to say that this doesn't provide any transitivity guarantees, and that
> > > > > > an ACQUIRE must always read from a RELEASE if transitivity is required.
> > > > > > 
> > > > > 
> > > > > Agreed. RELEASE alone doesn't provide transitivity and transitivity is
> > > >           ^^^^^^^
> > > > This should be ACQUIRE...
> > > > 
> > > > > guaranteed only if an ACQUIRE read from a RELEASE. That's exactly the
> > > > > direction which the link (https://lkml.org/lkml/2015/9/15/836) is
> > > > > heading to. So I think we are fine here to use ctrl+isync here, right?
> > > 
> > > We are going to have to err on the side of strictness, that is, having
> > > the documentation place more requirements on the developer than the union
> > > of the hardware does.  Besides, I haven't heard any recent complaints
> > > that memory-barriers.txt is too simple.  ;-)
> > 
> > Agreed ;-)
> > 
> > For atomic operations, using isync in ACQUIRE operations does gaurantee
> > that a pure RELEASE/ACQUIRE chain provides transitivity. So, again, I
> > think we are fine here to use isync in ACQUIRE atomic operations,
> > unless you think we need to be more strict, i.e, making ACQUIRE itself
> > provide transitivy?
> 
> As I understand it, either isync or lwsync suffices, with the choice
> depending on the hardware.  The kernel will rewrite itself at boot time
> if you use the appropriate macro.  ;-)
> 

Yep ;-)

Thank you and Will both for your comments. To be honest, I just began to
learn about these transitivity and cumulativity things recently. That's
a lot of fun to discuss these with you, Peter and Will, and the
herdtools you suggested and the N2745 document you wrote really helped a
lot. Thank you again!

Regards,
Boqun

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC v2 4/7] powerpc: atomic: Implement xchg_* and atomic{,64}_xchg_* variants
  2015-09-16 15:49 ` [RFC v2 4/7] powerpc: atomic: Implement xchg_* and atomic{,64}_xchg_* variants Boqun Feng
@ 2015-10-01 12:24   ` Peter Zijlstra
  2015-10-01 15:09     ` Paul E. McKenney
  0 siblings, 1 reply; 42+ messages in thread
From: Peter Zijlstra @ 2015-10-01 12:24 UTC (permalink / raw)
  To: Boqun Feng
  Cc: linux-kernel, linuxppc-dev, Ingo Molnar, Benjamin Herrenschmidt,
	Paul Mackerras, Michael Ellerman, Thomas Gleixner, Will Deacon,
	Paul E. McKenney, Waiman Long

On Wed, Sep 16, 2015 at 11:49:32PM +0800, Boqun Feng wrote:
> Implement xchg_relaxed and define atomic{,64}_xchg_* as xchg_relaxed,
> based on these _relaxed variants, release/acquire variants can be built.
> 
> Note that xchg_relaxed and atomic_{,64}_xchg_relaxed are not compiler
> barriers.

Hmm, and I note your previous patch creating the regular _relaxed
thingies also removes the memory clobber.

And looking at the ARM _relaxed patch from Will, I see their _relaxed
ops are also not a compiler barrier.

I must say I'm somewhat surprised by this level of relaxation, I had
expected to only loose SMP barriers, not the program order ones.

Is there a good argument for this?

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC v2 5/7] powerpc: atomic: Implement cmpxchg{,64}_* and atomic{,64}_cmpxchg_* variants
  2015-09-16 15:49 ` [RFC v2 5/7] powerpc: atomic: Implement cmpxchg{,64}_* and atomic{,64}_cmpxchg_* variants Boqun Feng
@ 2015-10-01 12:27   ` Peter Zijlstra
  2015-10-01 12:36     ` Peter Zijlstra
  2015-10-10  1:58     ` Boqun Feng
  0 siblings, 2 replies; 42+ messages in thread
From: Peter Zijlstra @ 2015-10-01 12:27 UTC (permalink / raw)
  To: Boqun Feng
  Cc: linux-kernel, linuxppc-dev, Ingo Molnar, Benjamin Herrenschmidt,
	Paul Mackerras, Michael Ellerman, Thomas Gleixner, Will Deacon,
	Paul E. McKenney, Waiman Long

On Wed, Sep 16, 2015 at 11:49:33PM +0800, Boqun Feng wrote:
> Unlike other atomic operation variants, cmpxchg{,64}_acquire and
> atomic{,64}_cmpxchg_acquire don't have acquire semantics if the cmp part
> fails, so we need to implement these using assembly.

I think that is actually expected and documented. That is, a cmpxchg
only implies barriers on success. See:

  ed2de9f74ecb ("locking/Documentation: Clarify failed cmpxchg() memory ordering semantics")

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC v2 6/7] powerpc: atomic: Make atomic{,64}_xchg and xchg a full barrier
  2015-09-16 15:49 ` [RFC v2 6/7] powerpc: atomic: Make atomic{,64}_xchg and xchg a full barrier Boqun Feng
@ 2015-10-01 12:28   ` Peter Zijlstra
       [not found]     ` <CAL0jBu_=w-GDtg6B+YXLmEn=Qac0mnWaWzS6NvLC9KMcNSj8=w@mail.gmail.com>
  0 siblings, 1 reply; 42+ messages in thread
From: Peter Zijlstra @ 2015-10-01 12:28 UTC (permalink / raw)
  To: Boqun Feng
  Cc: linux-kernel, linuxppc-dev, Ingo Molnar, Benjamin Herrenschmidt,
	Paul Mackerras, Michael Ellerman, Thomas Gleixner, Will Deacon,
	Paul E. McKenney, Waiman Long

On Wed, Sep 16, 2015 at 11:49:34PM +0800, Boqun Feng wrote:
> According to memory-barriers.txt, xchg and its atomic{,64}_ versions
> need to imply a full barrier, however they are now just RELEASE+ACQUIRE,
> which is not a full barrier.
> 
> So remove the definition of xchg(), and let __atomic_op_fence() build
> the full-barrier versions of these operations.

Do you want to do a patch for -stable fixing the current implementation?

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC v2 5/7] powerpc: atomic: Implement cmpxchg{,64}_* and atomic{,64}_cmpxchg_* variants
  2015-10-01 12:27   ` Peter Zijlstra
@ 2015-10-01 12:36     ` Peter Zijlstra
  2015-10-01 15:12       ` Paul E. McKenney
  2015-10-01 15:13       ` Paul E. McKenney
  2015-10-10  1:58     ` Boqun Feng
  1 sibling, 2 replies; 42+ messages in thread
From: Peter Zijlstra @ 2015-10-01 12:36 UTC (permalink / raw)
  To: Boqun Feng
  Cc: linux-kernel, linuxppc-dev, Ingo Molnar, Benjamin Herrenschmidt,
	Paul Mackerras, Michael Ellerman, Thomas Gleixner, Will Deacon,
	Paul E. McKenney, Waiman Long

On Thu, Oct 01, 2015 at 02:27:15PM +0200, Peter Zijlstra wrote:
> On Wed, Sep 16, 2015 at 11:49:33PM +0800, Boqun Feng wrote:
> > Unlike other atomic operation variants, cmpxchg{,64}_acquire and
> > atomic{,64}_cmpxchg_acquire don't have acquire semantics if the cmp part
> > fails, so we need to implement these using assembly.
> 
> I think that is actually expected and documented. That is, a cmpxchg
> only implies barriers on success. See:
> 
>   ed2de9f74ecb ("locking/Documentation: Clarify failed cmpxchg() memory ordering semantics")

Also:

654672d4ba1a6 (Will Deacon     2015-08-06 17:54:37 +0100  28)  * store portion of the operation. Note that a failed cmpxchg_acquire
654672d4ba1a6 (Will Deacon     2015-08-06 17:54:37 +0100  29)  * does -not- imply any memory ordering constraints.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC v2 4/7] powerpc: atomic: Implement xchg_* and atomic{,64}_xchg_* variants
  2015-10-01 12:24   ` Peter Zijlstra
@ 2015-10-01 15:09     ` Paul E. McKenney
  2015-10-01 17:13       ` Peter Zijlstra
  0 siblings, 1 reply; 42+ messages in thread
From: Paul E. McKenney @ 2015-10-01 15:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Boqun Feng, linux-kernel, linuxppc-dev, Ingo Molnar,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Thomas Gleixner, Will Deacon, Waiman Long

On Thu, Oct 01, 2015 at 02:24:40PM +0200, Peter Zijlstra wrote:
> On Wed, Sep 16, 2015 at 11:49:32PM +0800, Boqun Feng wrote:
> > Implement xchg_relaxed and define atomic{,64}_xchg_* as xchg_relaxed,
> > based on these _relaxed variants, release/acquire variants can be built.
> > 
> > Note that xchg_relaxed and atomic_{,64}_xchg_relaxed are not compiler
> > barriers.
> 
> Hmm, and I note your previous patch creating the regular _relaxed
> thingies also removes the memory clobber.
> 
> And looking at the ARM _relaxed patch from Will, I see their _relaxed
> ops are also not a compiler barrier.
> 
> I must say I'm somewhat surprised by this level of relaxation, I had
> expected to only loose SMP barriers, not the program order ones.
> 
> Is there a good argument for this?

Yes, when we say "relaxed", we really mean relaxed.  ;-)

Both the CPU and the compiler are allowed to reorder around relaxed
operations.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC v2 5/7] powerpc: atomic: Implement cmpxchg{,64}_* and atomic{,64}_cmpxchg_* variants
  2015-10-01 12:36     ` Peter Zijlstra
@ 2015-10-01 15:12       ` Paul E. McKenney
  2015-10-01 17:11         ` Peter Zijlstra
  2015-10-01 15:13       ` Paul E. McKenney
  1 sibling, 1 reply; 42+ messages in thread
From: Paul E. McKenney @ 2015-10-01 15:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Boqun Feng, linux-kernel, linuxppc-dev, Ingo Molnar,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Thomas Gleixner, Will Deacon, Waiman Long

On Thu, Oct 01, 2015 at 02:36:26PM +0200, Peter Zijlstra wrote:
> On Thu, Oct 01, 2015 at 02:27:15PM +0200, Peter Zijlstra wrote:
> > On Wed, Sep 16, 2015 at 11:49:33PM +0800, Boqun Feng wrote:
> > > Unlike other atomic operation variants, cmpxchg{,64}_acquire and
> > > atomic{,64}_cmpxchg_acquire don't have acquire semantics if the cmp part
> > > fails, so we need to implement these using assembly.
> > 
> > I think that is actually expected and documented. That is, a cmpxchg
> > only implies barriers on success. See:
> > 
> >   ed2de9f74ecb ("locking/Documentation: Clarify failed cmpxchg() memory ordering semantics")
> 
> Also:
> 
> 654672d4ba1a6 (Will Deacon     2015-08-06 17:54:37 +0100  28)  * store portion of the operation. Note that a failed cmpxchg_acquire
> 654672d4ba1a6 (Will Deacon     2015-08-06 17:54:37 +0100  29)  * does -not- imply any memory ordering constraints.

What C11 does is to allow the developer to specify different orderings
on success and failure.  But it is no harder to supply a barrier (if
needed) on the failure path, right?

							Thanx, Paul


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC v2 5/7] powerpc: atomic: Implement cmpxchg{,64}_* and atomic{,64}_cmpxchg_* variants
  2015-10-01 12:36     ` Peter Zijlstra
  2015-10-01 15:12       ` Paul E. McKenney
@ 2015-10-01 15:13       ` Paul E. McKenney
  1 sibling, 0 replies; 42+ messages in thread
From: Paul E. McKenney @ 2015-10-01 15:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Boqun Feng, linux-kernel, linuxppc-dev, Ingo Molnar,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Thomas Gleixner, Will Deacon, Waiman Long

On Thu, Oct 01, 2015 at 02:36:26PM +0200, Peter Zijlstra wrote:
> On Thu, Oct 01, 2015 at 02:27:15PM +0200, Peter Zijlstra wrote:
> > On Wed, Sep 16, 2015 at 11:49:33PM +0800, Boqun Feng wrote:
> > > Unlike other atomic operation variants, cmpxchg{,64}_acquire and
> > > atomic{,64}_cmpxchg_acquire don't have acquire semantics if the cmp part
> > > fails, so we need to implement these using assembly.
> > 
> > I think that is actually expected and documented. That is, a cmpxchg
> > only implies barriers on success. See:
> > 
> >   ed2de9f74ecb ("locking/Documentation: Clarify failed cmpxchg() memory ordering semantics")
> 
> Also:
> 
> 654672d4ba1a6 (Will Deacon     2015-08-06 17:54:37 +0100  28)  * store portion of the operation. Note that a failed cmpxchg_acquire
> 654672d4ba1a6 (Will Deacon     2015-08-06 17:54:37 +0100  29)  * does -not- imply any memory ordering constraints.

Agreed, no need for ordering on failed cmpxchg.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC v2 5/7] powerpc: atomic: Implement cmpxchg{,64}_* and atomic{,64}_cmpxchg_* variants
  2015-10-01 15:12       ` Paul E. McKenney
@ 2015-10-01 17:11         ` Peter Zijlstra
  0 siblings, 0 replies; 42+ messages in thread
From: Peter Zijlstra @ 2015-10-01 17:11 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Boqun Feng, linux-kernel, linuxppc-dev, Ingo Molnar,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Thomas Gleixner, Will Deacon, Waiman Long

On Thu, Oct 01, 2015 at 08:12:19AM -0700, Paul E. McKenney wrote:

> What C11 does is to allow the developer to specify different orderings
> on success and failure.  But it is no harder to supply a barrier (if
> needed) on the failure path, right?

Quite right.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC v2 4/7] powerpc: atomic: Implement xchg_* and atomic{,64}_xchg_* variants
  2015-10-01 15:09     ` Paul E. McKenney
@ 2015-10-01 17:13       ` Peter Zijlstra
  2015-10-01 18:03         ` Paul E. McKenney
  0 siblings, 1 reply; 42+ messages in thread
From: Peter Zijlstra @ 2015-10-01 17:13 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Boqun Feng, linux-kernel, linuxppc-dev, Ingo Molnar,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Thomas Gleixner, Will Deacon, Waiman Long

On Thu, Oct 01, 2015 at 08:09:09AM -0700, Paul E. McKenney wrote:
> On Thu, Oct 01, 2015 at 02:24:40PM +0200, Peter Zijlstra wrote:

> > I must say I'm somewhat surprised by this level of relaxation, I had
> > expected to only loose SMP barriers, not the program order ones.
> > 
> > Is there a good argument for this?
> 
> Yes, when we say "relaxed", we really mean relaxed.  ;-)
> 
> Both the CPU and the compiler are allowed to reorder around relaxed
> operations.

Is this documented somewhere, because I completely missed this part.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC v2 4/7] powerpc: atomic: Implement xchg_* and atomic{,64}_xchg_* variants
  2015-10-01 17:13       ` Peter Zijlstra
@ 2015-10-01 18:03         ` Paul E. McKenney
  2015-10-01 18:23           ` Peter Zijlstra
                             ` (2 more replies)
  0 siblings, 3 replies; 42+ messages in thread
From: Paul E. McKenney @ 2015-10-01 18:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Boqun Feng, linux-kernel, linuxppc-dev, Ingo Molnar,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Thomas Gleixner, Will Deacon, Waiman Long

On Thu, Oct 01, 2015 at 07:13:04PM +0200, Peter Zijlstra wrote:
> On Thu, Oct 01, 2015 at 08:09:09AM -0700, Paul E. McKenney wrote:
> > On Thu, Oct 01, 2015 at 02:24:40PM +0200, Peter Zijlstra wrote:
> 
> > > I must say I'm somewhat surprised by this level of relaxation, I had
> > > expected to only loose SMP barriers, not the program order ones.
> > > 
> > > Is there a good argument for this?
> > 
> > Yes, when we say "relaxed", we really mean relaxed.  ;-)
> > 
> > Both the CPU and the compiler are allowed to reorder around relaxed
> > operations.
> 
> Is this documented somewhere, because I completely missed this part.

Well, yes, these need to be added to the documentation.  I am assuming
that Will is looking to have the same effect as C11 memory_order_relaxed,
which is relaxed in this sense.  If he has something else in mind,
he needs to tell us what it is and why.  ;-)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC v2 4/7] powerpc: atomic: Implement xchg_* and atomic{,64}_xchg_* variants
  2015-10-01 18:03         ` Paul E. McKenney
@ 2015-10-01 18:23           ` Peter Zijlstra
  2015-10-01 19:41             ` Paul E. McKenney
  2015-10-05 14:44           ` Will Deacon
  2015-10-12  1:17           ` Boqun Feng
  2 siblings, 1 reply; 42+ messages in thread
From: Peter Zijlstra @ 2015-10-01 18:23 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Boqun Feng, linux-kernel, linuxppc-dev, Ingo Molnar,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Thomas Gleixner, Will Deacon, Waiman Long

On Thu, Oct 01, 2015 at 11:03:01AM -0700, Paul E. McKenney wrote:
> On Thu, Oct 01, 2015 at 07:13:04PM +0200, Peter Zijlstra wrote:
> > On Thu, Oct 01, 2015 at 08:09:09AM -0700, Paul E. McKenney wrote:
> > > On Thu, Oct 01, 2015 at 02:24:40PM +0200, Peter Zijlstra wrote:
> > 
> > > > I must say I'm somewhat surprised by this level of relaxation, I had
> > > > expected to only loose SMP barriers, not the program order ones.
> > > > 
> > > > Is there a good argument for this?
> > > 
> > > Yes, when we say "relaxed", we really mean relaxed.  ;-)
> > > 
> > > Both the CPU and the compiler are allowed to reorder around relaxed
> > > operations.
> > 
> > Is this documented somewhere, because I completely missed this part.
> 
> Well, yes, these need to be added to the documentation.  I am assuming
> that Will is looking to have the same effect as C11 memory_order_relaxed,
> which is relaxed in this sense.  If he has something else in mind,
> he needs to tell us what it is and why.  ;-)

I suspect he is; but I'm not _that_ up to date on the whole C11 stuff.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC v2 4/7] powerpc: atomic: Implement xchg_* and atomic{,64}_xchg_* variants
  2015-10-01 18:23           ` Peter Zijlstra
@ 2015-10-01 19:41             ` Paul E. McKenney
  0 siblings, 0 replies; 42+ messages in thread
From: Paul E. McKenney @ 2015-10-01 19:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Boqun Feng, linux-kernel, linuxppc-dev, Ingo Molnar,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Thomas Gleixner, Will Deacon, Waiman Long

On Thu, Oct 01, 2015 at 08:23:09PM +0200, Peter Zijlstra wrote:
> On Thu, Oct 01, 2015 at 11:03:01AM -0700, Paul E. McKenney wrote:
> > On Thu, Oct 01, 2015 at 07:13:04PM +0200, Peter Zijlstra wrote:
> > > On Thu, Oct 01, 2015 at 08:09:09AM -0700, Paul E. McKenney wrote:
> > > > On Thu, Oct 01, 2015 at 02:24:40PM +0200, Peter Zijlstra wrote:
> > > 
> > > > > I must say I'm somewhat surprised by this level of relaxation, I had
> > > > > expected to only loose SMP barriers, not the program order ones.
> > > > > 
> > > > > Is there a good argument for this?
> > > > 
> > > > Yes, when we say "relaxed", we really mean relaxed.  ;-)
> > > > 
> > > > Both the CPU and the compiler are allowed to reorder around relaxed
> > > > operations.
> > > 
> > > Is this documented somewhere, because I completely missed this part.
> > 
> > Well, yes, these need to be added to the documentation.  I am assuming
> > that Will is looking to have the same effect as C11 memory_order_relaxed,
> > which is relaxed in this sense.  If he has something else in mind,
> > he needs to tell us what it is and why.  ;-)
> 
> I suspect he is; but I'm not _that_ up to date on the whole C11 stuff.

Lucky you!  ;-)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC v2 6/7] powerpc: atomic: Make atomic{,64}_xchg and xchg a full barrier
       [not found]     ` <CAL0jBu_=w-GDtg6B+YXLmEn=Qac0mnWaWzS6NvLC9KMcNSj8=w@mail.gmail.com>
@ 2015-10-02  5:25       ` Peter Zijlstra
  0 siblings, 0 replies; 42+ messages in thread
From: Peter Zijlstra @ 2015-10-02  5:25 UTC (permalink / raw)
  To: Boqun Feng
  Cc: Benjamin Herrenschmidt, Paul E. McKenney, Thomas Gleixner,
	Paul Mackerras, linuxppc-dev, linux-kernel, Waiman Long,
	Ingo Molnar, Will Deacon, Michael Ellerman

On Fri, Oct 02, 2015 at 07:19:04AM +0800, Boqun Feng wrote:
> Hi Peter,
> 
> Please forgive me for the format of my reply. I'm travelling,
> and replying from my phone.
> 
> 2015年10月1日 下午7:28，"Peter Zijlstra" <peterz@infradead.org>写道：
> >
> > On Wed, Sep 16, 2015 at 11:49:34PM +0800, Boqun Feng wrote:
> > > According to memory-barriers.txt, xchg and its atomic{,64}_ versions
> > > need to imply a full barrier, however they are now just RELEASE+ACQUIRE,
> > > which is not a full barrier.
> > >
> > > So remove the definition of xchg(), and let __atomic_op_fence() build
> > > the full-barrier versions of these operations.
> >
> > Do you want to do a patch for -stable fixing the current implementation?
> 
> Good idea! I didn't think of this before, and I'd love to do the patch,
> but thing is that I'm not able to use my laptop until Oct 10th.
> I will send the patch once I'm back.
> Does that work for you?

Sure, no hurry.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC v2 4/7] powerpc: atomic: Implement xchg_* and atomic{,64}_xchg_* variants
  2015-10-01 18:03         ` Paul E. McKenney
  2015-10-01 18:23           ` Peter Zijlstra
@ 2015-10-05 14:44           ` Will Deacon
  2015-10-05 16:57             ` Paul E. McKenney
  2015-10-12  1:17           ` Boqun Feng
  2 siblings, 1 reply; 42+ messages in thread
From: Will Deacon @ 2015-10-05 14:44 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Peter Zijlstra, Boqun Feng, linux-kernel, linuxppc-dev,
	Ingo Molnar, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Thomas Gleixner, Waiman Long

On Thu, Oct 01, 2015 at 07:03:01PM +0100, Paul E. McKenney wrote:
> On Thu, Oct 01, 2015 at 07:13:04PM +0200, Peter Zijlstra wrote:
> > On Thu, Oct 01, 2015 at 08:09:09AM -0700, Paul E. McKenney wrote:
> > > On Thu, Oct 01, 2015 at 02:24:40PM +0200, Peter Zijlstra wrote:
> > 
> > > > I must say I'm somewhat surprised by this level of relaxation, I had
> > > > expected to only loose SMP barriers, not the program order ones.
> > > > 
> > > > Is there a good argument for this?
> > > 
> > > Yes, when we say "relaxed", we really mean relaxed.  ;-)
> > > 
> > > Both the CPU and the compiler are allowed to reorder around relaxed
> > > operations.
> > 
> > Is this documented somewhere, because I completely missed this part.
> 
> Well, yes, these need to be added to the documentation.  I am assuming
> that Will is looking to have the same effect as C11 memory_order_relaxed,
> which is relaxed in this sense.  If he has something else in mind,
> he needs to tell us what it is and why.  ;-)

I was treating them purely as being single-copy atomic and not providing
any memory ordering guarantees (much like the non *_return atomic operations
that we already have). I think this lines up with C11, minus the bits
about data races which we don't call out anyway.

Will

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC v2 4/7] powerpc: atomic: Implement xchg_* and atomic{,64}_xchg_* variants
  2015-10-05 14:44           ` Will Deacon
@ 2015-10-05 16:57             ` Paul E. McKenney
  0 siblings, 0 replies; 42+ messages in thread
From: Paul E. McKenney @ 2015-10-05 16:57 UTC (permalink / raw)
  To: Will Deacon
  Cc: Peter Zijlstra, Boqun Feng, linux-kernel, linuxppc-dev,
	Ingo Molnar, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Thomas Gleixner, Waiman Long

On Mon, Oct 05, 2015 at 03:44:07PM +0100, Will Deacon wrote:
> On Thu, Oct 01, 2015 at 07:03:01PM +0100, Paul E. McKenney wrote:
> > On Thu, Oct 01, 2015 at 07:13:04PM +0200, Peter Zijlstra wrote:
> > > On Thu, Oct 01, 2015 at 08:09:09AM -0700, Paul E. McKenney wrote:
> > > > On Thu, Oct 01, 2015 at 02:24:40PM +0200, Peter Zijlstra wrote:
> > > 
> > > > > I must say I'm somewhat surprised by this level of relaxation, I had
> > > > > expected to only loose SMP barriers, not the program order ones.
> > > > > 
> > > > > Is there a good argument for this?
> > > > 
> > > > Yes, when we say "relaxed", we really mean relaxed.  ;-)
> > > > 
> > > > Both the CPU and the compiler are allowed to reorder around relaxed
> > > > operations.
> > > 
> > > Is this documented somewhere, because I completely missed this part.
> > 
> > Well, yes, these need to be added to the documentation.  I am assuming
> > that Will is looking to have the same effect as C11 memory_order_relaxed,
> > which is relaxed in this sense.  If he has something else in mind,
> > he needs to tell us what it is and why.  ;-)
> 
> I was treating them purely as being single-copy atomic and not providing
> any memory ordering guarantees (much like the non *_return atomic operations
> that we already have). I think this lines up with C11, minus the bits
> about data races which we don't call out anyway.

As long as it is single-copy atomic and not multi-copy atomic, I believe
we are on the saem page.  We have slowly been outlawing some sorts of
data races over the past few years, and I would guess that this will
continue, expecially if good tooling emerges (and KTSAN is showing some
promise from what I can see).

							Thanx, Paul


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC v2 5/7] powerpc: atomic: Implement cmpxchg{,64}_* and atomic{,64}_cmpxchg_* variants
  2015-10-01 12:27   ` Peter Zijlstra
  2015-10-01 12:36     ` Peter Zijlstra
@ 2015-10-10  1:58     ` Boqun Feng
  2015-10-11 10:25       ` Boqun Feng
  1 sibling, 1 reply; 42+ messages in thread
From: Boqun Feng @ 2015-10-10  1:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linuxppc-dev, Ingo Molnar, Benjamin Herrenschmidt,
	Paul Mackerras, Michael Ellerman, Thomas Gleixner, Will Deacon,
	Paul E. McKenney, Waiman Long

[-- Attachment #1: Type: text/plain, Size: 967 bytes --]

Hi Peter,

Sorry for replying late.

On Thu, Oct 01, 2015 at 02:27:16PM +0200, Peter Zijlstra wrote:
> On Wed, Sep 16, 2015 at 11:49:33PM +0800, Boqun Feng wrote:
> > Unlike other atomic operation variants, cmpxchg{,64}_acquire and
> > atomic{,64}_cmpxchg_acquire don't have acquire semantics if the cmp part
> > fails, so we need to implement these using assembly.
> 
> I think that is actually expected and documented. That is, a cmpxchg
> only implies barriers on success. See:
> 
>   ed2de9f74ecb ("locking/Documentation: Clarify failed cmpxchg() memory ordering semantics")

I probably didn't make myself clear here, my point is that if we use
__atomic_op_acquire() to built *_cmpchg_acquire(For ARM and PowerPC),
the barrier will be implied _unconditionally_, meaning no matter cmp
fails or not, there will be a barrier after the cmpxchg operation.
Therefore we have to use assembly to implement the operations right now.

Regards,
Boqun

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC v2 5/7] powerpc: atomic: Implement cmpxchg{,64}_* and atomic{,64}_cmpxchg_* variants
  2015-10-10  1:58     ` Boqun Feng
@ 2015-10-11 10:25       ` Boqun Feng
  2015-10-12  6:46         ` Peter Zijlstra
  0 siblings, 1 reply; 42+ messages in thread
From: Boqun Feng @ 2015-10-11 10:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linuxppc-dev, Ingo Molnar, Benjamin Herrenschmidt,
	Paul Mackerras, Michael Ellerman, Thomas Gleixner, Will Deacon,
	Paul E. McKenney, Waiman Long

[-- Attachment #1: Type: text/plain, Size: 2244 bytes --]

On Sat, Oct 10, 2015 at 09:58:05AM +0800, Boqun Feng wrote:
> Hi Peter,
> 
> Sorry for replying late.
> 
> On Thu, Oct 01, 2015 at 02:27:16PM +0200, Peter Zijlstra wrote:
> > On Wed, Sep 16, 2015 at 11:49:33PM +0800, Boqun Feng wrote:
> > > Unlike other atomic operation variants, cmpxchg{,64}_acquire and
> > > atomic{,64}_cmpxchg_acquire don't have acquire semantics if the cmp part
> > > fails, so we need to implement these using assembly.
> > 
> > I think that is actually expected and documented. That is, a cmpxchg
> > only implies barriers on success. See:
> > 
> >   ed2de9f74ecb ("locking/Documentation: Clarify failed cmpxchg() memory ordering semantics")
> 
> I probably didn't make myself clear here, my point is that if we use
> __atomic_op_acquire() to built *_cmpchg_acquire(For ARM and PowerPC),
> the barrier will be implied _unconditionally_, meaning no matter cmp
> fails or not, there will be a barrier after the cmpxchg operation.
> Therefore we have to use assembly to implement the operations right now.
> 

Or let me try another way to explain this. What I wanted to say here is
that unlike the implementation of xchg family, which needs only to
implement _relaxed version and *remove* the fully ordered version, the
implementation of cmpxchg family needs to *remain* the fully ordered
version and implement the _acquire version in assembly. Because if we
use __atomic_op_*(), the barriers in the cmpxchg family will be implied
*unconditionally*, for example:

cmpxchg() on PPC will be(built by my version of __atomic_op_fence()):

	smp_lwsync();
	cmpxchg_relaxed(...);
	smp_mb__after_atomic(); // a full barrier regardless of success
				// or failure.

In order to have a conditional barrier, we need a way to jump out of a
ll/sc loop, which could only(?) be done by assembly code.

My commit log surely failed to explain this clearly, I will modifiy that
in next series. In the meanwhile, looking forwards to suggestion on the
implementation of cmpxchg familiy ;-)

BTW, Will, could you please check whether the barriers in cmpxchg family
are unconditional or not in the current implementation of ARM? IIUC,
they are currently unconditional, right?

Regards,
Boqun

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC v2 4/7] powerpc: atomic: Implement xchg_* and atomic{,64}_xchg_* variants
  2015-10-01 18:03         ` Paul E. McKenney
  2015-10-01 18:23           ` Peter Zijlstra
  2015-10-05 14:44           ` Will Deacon
@ 2015-10-12  1:17           ` Boqun Feng
  2015-10-12  9:28             ` Will Deacon
  2015-10-12 23:24             ` Paul E. McKenney
  2 siblings, 2 replies; 42+ messages in thread
From: Boqun Feng @ 2015-10-12  1:17 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Peter Zijlstra, linux-kernel, linuxppc-dev, Ingo Molnar,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Thomas Gleixner, Will Deacon, Waiman Long

[-- Attachment #1: Type: text/plain, Size: 2031 bytes --]

Hi Paul,

On Thu, Oct 01, 2015 at 11:03:01AM -0700, Paul E. McKenney wrote:
> On Thu, Oct 01, 2015 at 07:13:04PM +0200, Peter Zijlstra wrote:
> > On Thu, Oct 01, 2015 at 08:09:09AM -0700, Paul E. McKenney wrote:
> > > On Thu, Oct 01, 2015 at 02:24:40PM +0200, Peter Zijlstra wrote:
> > 
> > > > I must say I'm somewhat surprised by this level of relaxation, I had
> > > > expected to only loose SMP barriers, not the program order ones.
> > > > 
> > > > Is there a good argument for this?
> > > 
> > > Yes, when we say "relaxed", we really mean relaxed.  ;-)
> > > 
> > > Both the CPU and the compiler are allowed to reorder around relaxed
> > > operations.
> > 
> > Is this documented somewhere, because I completely missed this part.
> 
> Well, yes, these need to be added to the documentation.  I am assuming

Maybe it's good time for us to call it out which operation should be
a compiler barrier or a CPU barrier?

I had something in my mind while I was working on this series, not
really sure whether it's correct, but probably a start point:

All global and local atomic operations are at least atomic(no one can
observe the middle state) and volatile(compilers can't optimize out the
memory access). Based on this, there are four strictness levels, one
can rely on them:

RELAXED: neither a compiler barrier or a CPU barrier
LOCAL: a compiler barrier
PARTIAL: both a compiler barrier and a CPU barrier but not transitive
FULL: both compiler barrier and a CPU barrier, and transitive.

RELAXED includes all _relaxed variants and non-return atomics, LOCAL
includes all local atomics(local_* and {cmp}xchg_local), PARTIAL
includes _acquire and _release operations and FULL includes all fully
ordered global atomic operations.

Thoughts?

Regards,
Boqun

> that Will is looking to have the same effect as C11 memory_order_relaxed,
> which is relaxed in this sense.  If he has something else in mind,
> he needs to tell us what it is and why.  ;-)
> 
> 							Thanx, Paul
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC v2 5/7] powerpc: atomic: Implement cmpxchg{,64}_* and atomic{,64}_cmpxchg_* variants
  2015-10-11 10:25       ` Boqun Feng
@ 2015-10-12  6:46         ` Peter Zijlstra
  2015-10-12  7:03           ` Boqun Feng
  0 siblings, 1 reply; 42+ messages in thread
From: Peter Zijlstra @ 2015-10-12  6:46 UTC (permalink / raw)
  To: Boqun Feng
  Cc: linux-kernel, linuxppc-dev, Ingo Molnar, Benjamin Herrenschmidt,
	Paul Mackerras, Michael Ellerman, Thomas Gleixner, Will Deacon,
	Paul E. McKenney, Waiman Long

On Sun, Oct 11, 2015 at 06:25:20PM +0800, Boqun Feng wrote:
> On Sat, Oct 10, 2015 at 09:58:05AM +0800, Boqun Feng wrote:
> > Hi Peter,
> > 
> > Sorry for replying late.
> > 
> > On Thu, Oct 01, 2015 at 02:27:16PM +0200, Peter Zijlstra wrote:
> > > On Wed, Sep 16, 2015 at 11:49:33PM +0800, Boqun Feng wrote:
> > > > Unlike other atomic operation variants, cmpxchg{,64}_acquire and
> > > > atomic{,64}_cmpxchg_acquire don't have acquire semantics if the cmp part
> > > > fails, so we need to implement these using assembly.
> > > 
> > > I think that is actually expected and documented. That is, a cmpxchg
> > > only implies barriers on success. See:
> > > 
> > >   ed2de9f74ecb ("locking/Documentation: Clarify failed cmpxchg() memory ordering semantics")
> > 
> > I probably didn't make myself clear here, my point is that if we use
> > __atomic_op_acquire() to built *_cmpchg_acquire(For ARM and PowerPC),
> > the barrier will be implied _unconditionally_, meaning no matter cmp
> > fails or not, there will be a barrier after the cmpxchg operation.
> > Therefore we have to use assembly to implement the operations right now.

See later, but no, you don't _have_ to.

> Or let me try another way to explain this. What I wanted to say here is
> that unlike the implementation of xchg family, which needs only to
> implement _relaxed version and *remove* the fully ordered version, the
> implementation of cmpxchg family needs to *remain* the fully ordered
> version and implement the _acquire version in assembly. Because if we
> use __atomic_op_*(), the barriers in the cmpxchg family will be implied
> *unconditionally*, for example:

So the point that confused me, and which is still valid for the above,
is your use of 'need'.

You don't need to omit the barrier at all. Its perfectly valid to issue
too many barriers (pointless and a waste of time, yes; incorrect, no).

So what you want to say is: "Optimize cmpxchg_acquire() to avoid
superfluous barrier".

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC v2 5/7] powerpc: atomic: Implement cmpxchg{,64}_* and atomic{,64}_cmpxchg_* variants
  2015-10-12  6:46         ` Peter Zijlstra
@ 2015-10-12  7:03           ` Boqun Feng
  0 siblings, 0 replies; 42+ messages in thread
From: Boqun Feng @ 2015-10-12  7:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linuxppc-dev, Ingo Molnar, Benjamin Herrenschmidt,
	Paul Mackerras, Michael Ellerman, Thomas Gleixner, Will Deacon,
	Paul E. McKenney, Waiman Long

[-- Attachment #1: Type: text/plain, Size: 2514 bytes --]

On Mon, Oct 12, 2015 at 08:46:21AM +0200, Peter Zijlstra wrote:
> On Sun, Oct 11, 2015 at 06:25:20PM +0800, Boqun Feng wrote:
> > On Sat, Oct 10, 2015 at 09:58:05AM +0800, Boqun Feng wrote:
> > > Hi Peter,
> > > 
> > > Sorry for replying late.
> > > 
> > > On Thu, Oct 01, 2015 at 02:27:16PM +0200, Peter Zijlstra wrote:
> > > > On Wed, Sep 16, 2015 at 11:49:33PM +0800, Boqun Feng wrote:
> > > > > Unlike other atomic operation variants, cmpxchg{,64}_acquire and
> > > > > atomic{,64}_cmpxchg_acquire don't have acquire semantics if the cmp part
> > > > > fails, so we need to implement these using assembly.
> > > > 
> > > > I think that is actually expected and documented. That is, a cmpxchg
> > > > only implies barriers on success. See:
> > > > 
> > > >   ed2de9f74ecb ("locking/Documentation: Clarify failed cmpxchg() memory ordering semantics")
> > > 
> > > I probably didn't make myself clear here, my point is that if we use
> > > __atomic_op_acquire() to built *_cmpchg_acquire(For ARM and PowerPC),
> > > the barrier will be implied _unconditionally_, meaning no matter cmp
> > > fails or not, there will be a barrier after the cmpxchg operation.
> > > Therefore we have to use assembly to implement the operations right now.
> 
> See later, but no, you don't _have_ to.
> 
> > Or let me try another way to explain this. What I wanted to say here is
> > that unlike the implementation of xchg family, which needs only to
> > implement _relaxed version and *remove* the fully ordered version, the
> > implementation of cmpxchg family needs to *remain* the fully ordered
> > version and implement the _acquire version in assembly. Because if we
> > use __atomic_op_*(), the barriers in the cmpxchg family will be implied
> > *unconditionally*, for example:
> 
> So the point that confused me, and which is still valid for the above,
> is your use of 'need'.
> 

Indeed, my apologies for confusing more..

> You don't need to omit the barrier at all. Its perfectly valid to issue
> too many barriers (pointless and a waste of time, yes; incorrect, no).
> 

Agreed.

> So what you want to say is: "Optimize cmpxchg_acquire() to avoid
> superfluous barrier".

Yes. I would like to implement a cmpxchg_acquire that really has no
barrier if cmp failed, which became my 'need' for the implementation.
But you are right, we don't have to omit the barrier. I will modify the
commit log to explain this clear in the next version. Thank you!

Regards,
Boqun

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC v2 4/7] powerpc: atomic: Implement xchg_* and atomic{,64}_xchg_* variants
  2015-10-12  1:17           ` Boqun Feng
@ 2015-10-12  9:28             ` Will Deacon
  2015-10-12 23:24             ` Paul E. McKenney
  1 sibling, 0 replies; 42+ messages in thread
From: Will Deacon @ 2015-10-12  9:28 UTC (permalink / raw)
  To: Boqun Feng
  Cc: Paul E. McKenney, Peter Zijlstra, linux-kernel, linuxppc-dev,
	Ingo Molnar, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Thomas Gleixner, Waiman Long

On Mon, Oct 12, 2015 at 09:17:50AM +0800, Boqun Feng wrote:
> On Thu, Oct 01, 2015 at 11:03:01AM -0700, Paul E. McKenney wrote:
> > On Thu, Oct 01, 2015 at 07:13:04PM +0200, Peter Zijlstra wrote:
> > > On Thu, Oct 01, 2015 at 08:09:09AM -0700, Paul E. McKenney wrote:
> > > > On Thu, Oct 01, 2015 at 02:24:40PM +0200, Peter Zijlstra wrote:
> > > 
> > > > > I must say I'm somewhat surprised by this level of relaxation, I had
> > > > > expected to only loose SMP barriers, not the program order ones.
> > > > > 
> > > > > Is there a good argument for this?
> > > > 
> > > > Yes, when we say "relaxed", we really mean relaxed.  ;-)
> > > > 
> > > > Both the CPU and the compiler are allowed to reorder around relaxed
> > > > operations.
> > > 
> > > Is this documented somewhere, because I completely missed this part.
> > 
> > Well, yes, these need to be added to the documentation.  I am assuming
> 
> Maybe it's good time for us to call it out which operation should be
> a compiler barrier or a CPU barrier?
> 
> I had something in my mind while I was working on this series, not
> really sure whether it's correct, but probably a start point:
> 
> All global and local atomic operations are at least atomic(no one can
> observe the middle state) and volatile(compilers can't optimize out the
> memory access). Based on this, there are four strictness levels, one
> can rely on them:
> 
> RELAXED: neither a compiler barrier or a CPU barrier
> LOCAL: a compiler barrier
> PARTIAL: both a compiler barrier and a CPU barrier but not transitive
> FULL: both compiler barrier and a CPU barrier, and transitive.
> 
> RELAXED includes all _relaxed variants and non-return atomics, LOCAL
> includes all local atomics(local_* and {cmp}xchg_local), PARTIAL
> includes _acquire and _release operations and FULL includes all fully
> ordered global atomic operations.
> 
> Thoughts?

I think that's where we currently are already, apart from defining
transitivity (see the other thread), which makes things a whole lot more
muddy.

That said, on Friday we seemed to be in broad agreement on the semantics
-- the difficult part is getting the language right (which is why we
started to discuss including litmus tests alongside the documentation).

Will

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC v2 1/7] atomics: Add test for atomic operations with _relaxed variants
  2015-09-16 15:49 ` [RFC v2 1/7] atomics: Add test for atomic operations with _relaxed variants Boqun Feng
@ 2015-10-12  9:30   ` Will Deacon
  2015-10-12  9:38     ` Boqun Feng
  0 siblings, 1 reply; 42+ messages in thread
From: Will Deacon @ 2015-10-12  9:30 UTC (permalink / raw)
  To: Boqun Feng
  Cc: linux-kernel, linuxppc-dev, Peter Zijlstra, Ingo Molnar,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Thomas Gleixner, Paul E. McKenney, Waiman Long

On Wed, Sep 16, 2015 at 04:49:29PM +0100, Boqun Feng wrote:
> Some atomic operations now have _{relaxed, acquire, release} variants,
> this patch then adds some trivial tests for two purpose:
> 
> 1.	test the behavior of these new operations in single-CPU
> 	environment.
> 2.	make their code generated before we actually use them somewhere,
> 	so that we can examine their assembly code.
> 
> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
> ---
>  lib/atomic64_test.c | 91 ++++++++++++++++++++++++++++++++++-------------------
>  1 file changed, 59 insertions(+), 32 deletions(-)

FWIW: this was useful to me whilst developing the arm64 relaxed atomic
implementation. Could you extend it to include the recently queued inc/dec
variants too?

Will

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC v2 1/7] atomics: Add test for atomic operations with _relaxed variants
  2015-10-12  9:30   ` Will Deacon
@ 2015-10-12  9:38     ` Boqun Feng
  0 siblings, 0 replies; 42+ messages in thread
From: Boqun Feng @ 2015-10-12  9:38 UTC (permalink / raw)
  To: Will Deacon
  Cc: linux-kernel, linuxppc-dev, Peter Zijlstra, Ingo Molnar,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Thomas Gleixner, Paul E. McKenney, Waiman Long

[-- Attachment #1: Type: text/plain, Size: 943 bytes --]

On Mon, Oct 12, 2015 at 10:30:34AM +0100, Will Deacon wrote:
> On Wed, Sep 16, 2015 at 04:49:29PM +0100, Boqun Feng wrote:
> > Some atomic operations now have _{relaxed, acquire, release} variants,
> > this patch then adds some trivial tests for two purpose:
> > 
> > 1.	test the behavior of these new operations in single-CPU
> > 	environment.
> > 2.	make their code generated before we actually use them somewhere,
> > 	so that we can examine their assembly code.
> > 
> > Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
> > ---
> >  lib/atomic64_test.c | 91 ++++++++++++++++++++++++++++++++++-------------------
> >  1 file changed, 59 insertions(+), 32 deletions(-)
> 
> FWIW: this was useful to me whilst developing the arm64 relaxed atomic

Glad to know.

> implementation. Could you extend it to include the recently queued inc/dec
> variants too?
> 

Already include them in upcoming V3 ;-)

Regards,
Boqun

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC v2 4/7] powerpc: atomic: Implement xchg_* and atomic{,64}_xchg_* variants
  2015-10-12  1:17           ` Boqun Feng
  2015-10-12  9:28             ` Will Deacon
@ 2015-10-12 23:24             ` Paul E. McKenney
  1 sibling, 0 replies; 42+ messages in thread
From: Paul E. McKenney @ 2015-10-12 23:24 UTC (permalink / raw)
  To: Boqun Feng
  Cc: Peter Zijlstra, linux-kernel, linuxppc-dev, Ingo Molnar,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Thomas Gleixner, Will Deacon, Waiman Long

On Mon, Oct 12, 2015 at 09:17:50AM +0800, Boqun Feng wrote:
> Hi Paul,
> 
> On Thu, Oct 01, 2015 at 11:03:01AM -0700, Paul E. McKenney wrote:
> > On Thu, Oct 01, 2015 at 07:13:04PM +0200, Peter Zijlstra wrote:
> > > On Thu, Oct 01, 2015 at 08:09:09AM -0700, Paul E. McKenney wrote:
> > > > On Thu, Oct 01, 2015 at 02:24:40PM +0200, Peter Zijlstra wrote:
> > > 
> > > > > I must say I'm somewhat surprised by this level of relaxation, I had
> > > > > expected to only loose SMP barriers, not the program order ones.
> > > > > 
> > > > > Is there a good argument for this?
> > > > 
> > > > Yes, when we say "relaxed", we really mean relaxed.  ;-)
> > > > 
> > > > Both the CPU and the compiler are allowed to reorder around relaxed
> > > > operations.
> > > 
> > > Is this documented somewhere, because I completely missed this part.
> > 
> > Well, yes, these need to be added to the documentation.  I am assuming
> 
> Maybe it's good time for us to call it out which operation should be
> a compiler barrier or a CPU barrier?
> 
> I had something in my mind while I was working on this series, not
> really sure whether it's correct, but probably a start point:
> 
> All global and local atomic operations are at least atomic(no one can
> observe the middle state) and volatile(compilers can't optimize out the
> memory access). Based on this, there are four strictness levels, one
> can rely on them:
> 
> RELAXED: neither a compiler barrier or a CPU barrier
> LOCAL: a compiler barrier
> PARTIAL: both a compiler barrier and a CPU barrier but not transitive
> FULL: both compiler barrier and a CPU barrier, and transitive.

As Will noted, we have two types of transitive.  The first type is that
of release-acquire chains, where the transitivity is only observable
within the chain.  The second type is that of smp_mb(), where the
transitivity is observable globally.

							Thanx, Paul

> RELAXED includes all _relaxed variants and non-return atomics, LOCAL
> includes all local atomics(local_* and {cmp}xchg_local), PARTIAL
> includes _acquire and _release operations and FULL includes all fully
> ordered global atomic operations.
> 
> Thoughts?
> 
> Regards,
> Boqun
> 
> > that Will is looking to have the same effect as C11 memory_order_relaxed,
> > which is relaxed in this sense.  If he has something else in mind,
> > he needs to tell us what it is and why.  ;-)
> > 
> > 							Thanx, Paul
> > 



^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2015-10-12 23:24 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-09-16 15:49 [RFC v2 0/7] atomics: powerpc: Implement relaxed/acquire/release variants of some atomics Boqun Feng
2015-09-16 15:49 ` [RFC v2 1/7] atomics: Add test for atomic operations with _relaxed variants Boqun Feng
2015-10-12  9:30   ` Will Deacon
2015-10-12  9:38     ` Boqun Feng
2015-09-16 15:49 ` [RFC v2 2/7] atomics: Allow architectures to define their own __atomic_op_* helpers Boqun Feng
2015-09-16 15:49 ` [RFC v2 3/7] powerpc: atomic: Implement atomic{,64}_{add,sub}_return_* variants Boqun Feng
2015-09-18 16:59   ` Will Deacon
2015-09-19 15:33     ` Boqun Feng
2015-09-20  8:23       ` Boqun Feng
2015-09-21 22:24         ` Will Deacon
2015-09-21 23:26           ` Boqun Feng
2015-09-21 23:37             ` Boqun Feng
2015-09-22 15:25               ` Paul E. McKenney
2015-09-23  0:07                 ` Boqun Feng
2015-09-25 21:29                   ` Paul E. McKenney
2015-09-26  2:18                     ` Boqun Feng
2015-09-16 15:49 ` [RFC v2 4/7] powerpc: atomic: Implement xchg_* and atomic{,64}_xchg_* variants Boqun Feng
2015-10-01 12:24   ` Peter Zijlstra
2015-10-01 15:09     ` Paul E. McKenney
2015-10-01 17:13       ` Peter Zijlstra
2015-10-01 18:03         ` Paul E. McKenney
2015-10-01 18:23           ` Peter Zijlstra
2015-10-01 19:41             ` Paul E. McKenney
2015-10-05 14:44           ` Will Deacon
2015-10-05 16:57             ` Paul E. McKenney
2015-10-12  1:17           ` Boqun Feng
2015-10-12  9:28             ` Will Deacon
2015-10-12 23:24             ` Paul E. McKenney
2015-09-16 15:49 ` [RFC v2 5/7] powerpc: atomic: Implement cmpxchg{,64}_* and atomic{,64}_cmpxchg_* variants Boqun Feng
2015-10-01 12:27   ` Peter Zijlstra
2015-10-01 12:36     ` Peter Zijlstra
2015-10-01 15:12       ` Paul E. McKenney
2015-10-01 17:11         ` Peter Zijlstra
2015-10-01 15:13       ` Paul E. McKenney
2015-10-10  1:58     ` Boqun Feng
2015-10-11 10:25       ` Boqun Feng
2015-10-12  6:46         ` Peter Zijlstra
2015-10-12  7:03           ` Boqun Feng
2015-09-16 15:49 ` [RFC v2 6/7] powerpc: atomic: Make atomic{,64}_xchg and xchg a full barrier Boqun Feng
2015-10-01 12:28   ` Peter Zijlstra
     [not found]     ` <CAL0jBu_=w-GDtg6B+YXLmEn=Qac0mnWaWzS6NvLC9KMcNSj8=w@mail.gmail.com>
2015-10-02  5:25       ` Peter Zijlstra
2015-09-16 15:49 ` [RFC v2 7/7] powerpc: atomic: Make atomic{,64}_cmpxchg and cmpxchg " Boqun Feng

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).