linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH tip/locking/core v4 0/6] atomics: powerpc: Implement relaxed/acquire/release variants of some atomics
@ 2015-10-14 15:55 Boqun Feng
  2015-10-14 15:55 ` [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make *xchg and *cmpxchg a full barrier Boqun Feng
                   ` (5 more replies)
  0 siblings, 6 replies; 43+ messages in thread
From: Boqun Feng @ 2015-10-14 15:55 UTC (permalink / raw)
  To: linux-kernel, linuxppc-dev
  Cc: Peter Zijlstra, Ingo Molnar, Benjamin Herrenschmidt,
	Paul Mackerras, Michael Ellerman, Thomas Gleixner, Will Deacon,
	Paul E. McKenney, Waiman Long, Davidlohr Bueso, Boqun Feng

Hi all,

This is v4 of the series.

Link for v1: https://lkml.org/lkml/2015/8/27/798
Link for v2: https://lkml.org/lkml/2015/9/16/527
Link for v3: https://lkml.org/lkml/2015/10/12/368

Changes since v3:

*	avoid to introduce smp_acquire_barrier__after_atomic()
	(Will Deacon)

*	explain a little bit why we don't implement cmpxchg_release
	with assembly code (Will Deacon)


Relaxed/acquire/release variants of atomic operations {add,sub}_return
and {cmp,}xchg are introduced by commit:

"atomics: add acquire/release/relaxed variants of some atomic operations"

and {inc,dec}_return has been introduced by commit:

"locking/asm-generic: Add _{relaxed|acquire|release}() variants for
inc/dec atomics"

Both of these are in the current locking/core branch of the tip tree.

By default, the generic code will implement a relaxed variant as a full
ordered atomic operation and release/acquire a variant as a relaxed
variant with a necessary general barrier before or after.

On powerpc, which has a weak memory order model, a relaxed variant can
be implemented more lightweightly than a full ordered one. Further more,
release and acquire variants can be implemented with arch-specific
lightweight barriers.

Besides, cmpxchg, xchg and their atomic_ versions are only RELEASE+ACQUIRE
rather that full barriers in current PPC implementation, which is
incorrect according to memory-barriers.txt.

Therefore this patchset fix the order guarantee of cmpxchg, xchg and
their atomic_ versions and implements the relaxed/acquire/release
variants based on powerpc memory model and specific barriers, Some
trivial tests for these new variants are also included in this series,
because some of these variants are not used in kernel for now, I think
is a good idea to at least generate the code for these variants
somewhere.

The patchset consists of 6 parts:

1.	Make xchg, cmpxchg and their atomic_ versions a full barrier

2.	Add trivial tests for the new variants in lib/atomic64_test.c

3.	Allow architectures to define their own __atomic_op_*() helpers
	to build other variants based on relaxed.

4.	Implement atomic{,64}_{add,sub,inc,dec}_return_* variants

5.	Implement xchg_* and atomic{,64}_xchg_* variants

6.	Implement cmpxchg_* atomic{,64}_cmpxchg_* variants


This patchset is based on current locking/core branch of the tip tree
and all patches are built and boot tested for little endian pseries, and
also tested by 0day.


Looking forward to any suggestion, question and comment ;-)

Regards,
Boqun

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make *xchg and *cmpxchg a full barrier
  2015-10-14 15:55 [PATCH tip/locking/core v4 0/6] atomics: powerpc: Implement relaxed/acquire/release variants of some atomics Boqun Feng
@ 2015-10-14 15:55 ` Boqun Feng
  2015-10-14 20:19   ` Paul E. McKenney
  2015-10-14 15:55 ` [PATCH tip/locking/core v4 2/6] atomics: Add test for atomic operations with _relaxed variants Boqun Feng
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 43+ messages in thread
From: Boqun Feng @ 2015-10-14 15:55 UTC (permalink / raw)
  To: linux-kernel, linuxppc-dev
  Cc: Peter Zijlstra, Ingo Molnar, Benjamin Herrenschmidt,
	Paul Mackerras, Michael Ellerman, Thomas Gleixner, Will Deacon,
	Paul E. McKenney, Waiman Long, Davidlohr Bueso, Boqun Feng,
	stable

According to memory-barriers.txt, xchg, cmpxchg and their atomic{,64}_
versions all need to imply a full barrier, however they are now just
RELEASE+ACQUIRE, which is not a full barrier.

So replace PPC_RELEASE_BARRIER and PPC_ACQUIRE_BARRIER with
PPC_ATOMIC_ENTRY_BARRIER and PPC_ATOMIC_EXIT_BARRIER in
__{cmp,}xchg_{u32,u64} respectively to guarantee a full barrier
semantics of atomic{,64}_{cmp,}xchg() and {cmp,}xchg().

This patch is a complement of commit b97021f85517 ("powerpc: Fix
atomic_xxx_return barrier semantics").

Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Cc: <stable@vger.kernel.org> # 3.4+
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
---
 arch/powerpc/include/asm/cmpxchg.h | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/include/asm/cmpxchg.h b/arch/powerpc/include/asm/cmpxchg.h
index ad6263c..d1a8d93 100644
--- a/arch/powerpc/include/asm/cmpxchg.h
+++ b/arch/powerpc/include/asm/cmpxchg.h
@@ -18,12 +18,12 @@ __xchg_u32(volatile void *p, unsigned long val)
 	unsigned long prev;
 
 	__asm__ __volatile__(
-	PPC_RELEASE_BARRIER
+	PPC_ATOMIC_ENTRY_BARRIER
 "1:	lwarx	%0,0,%2 \n"
 	PPC405_ERR77(0,%2)
 "	stwcx.	%3,0,%2 \n\
 	bne-	1b"
-	PPC_ACQUIRE_BARRIER
+	PPC_ATOMIC_EXIT_BARRIER
 	: "=&r" (prev), "+m" (*(volatile unsigned int *)p)
 	: "r" (p), "r" (val)
 	: "cc", "memory");
@@ -61,12 +61,12 @@ __xchg_u64(volatile void *p, unsigned long val)
 	unsigned long prev;
 
 	__asm__ __volatile__(
-	PPC_RELEASE_BARRIER
+	PPC_ATOMIC_ENTRY_BARRIER
 "1:	ldarx	%0,0,%2 \n"
 	PPC405_ERR77(0,%2)
 "	stdcx.	%3,0,%2 \n\
 	bne-	1b"
-	PPC_ACQUIRE_BARRIER
+	PPC_ATOMIC_EXIT_BARRIER
 	: "=&r" (prev), "+m" (*(volatile unsigned long *)p)
 	: "r" (p), "r" (val)
 	: "cc", "memory");
@@ -151,14 +151,14 @@ __cmpxchg_u32(volatile unsigned int *p, unsigned long old, unsigned long new)
 	unsigned int prev;
 
 	__asm__ __volatile__ (
-	PPC_RELEASE_BARRIER
+	PPC_ATOMIC_ENTRY_BARRIER
 "1:	lwarx	%0,0,%2		# __cmpxchg_u32\n\
 	cmpw	0,%0,%3\n\
 	bne-	2f\n"
 	PPC405_ERR77(0,%2)
 "	stwcx.	%4,0,%2\n\
 	bne-	1b"
-	PPC_ACQUIRE_BARRIER
+	PPC_ATOMIC_EXIT_BARRIER
 	"\n\
 2:"
 	: "=&r" (prev), "+m" (*p)
@@ -197,13 +197,13 @@ __cmpxchg_u64(volatile unsigned long *p, unsigned long old, unsigned long new)
 	unsigned long prev;
 
 	__asm__ __volatile__ (
-	PPC_RELEASE_BARRIER
+	PPC_ATOMIC_ENTRY_BARRIER
 "1:	ldarx	%0,0,%2		# __cmpxchg_u64\n\
 	cmpd	0,%0,%3\n\
 	bne-	2f\n\
 	stdcx.	%4,0,%2\n\
 	bne-	1b"
-	PPC_ACQUIRE_BARRIER
+	PPC_ATOMIC_EXIT_BARRIER
 	"\n\
 2:"
 	: "=&r" (prev), "+m" (*p)
-- 
2.5.3


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH tip/locking/core v4 2/6] atomics: Add test for atomic operations with _relaxed variants
  2015-10-14 15:55 [PATCH tip/locking/core v4 0/6] atomics: powerpc: Implement relaxed/acquire/release variants of some atomics Boqun Feng
  2015-10-14 15:55 ` [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make *xchg and *cmpxchg a full barrier Boqun Feng
@ 2015-10-14 15:55 ` Boqun Feng
  2015-10-14 15:55 ` [PATCH tip/locking/core v4 3/6] atomics: Allow architectures to define their own __atomic_op_* helpers Boqun Feng
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 43+ messages in thread
From: Boqun Feng @ 2015-10-14 15:55 UTC (permalink / raw)
  To: linux-kernel, linuxppc-dev
  Cc: Peter Zijlstra, Ingo Molnar, Benjamin Herrenschmidt,
	Paul Mackerras, Michael Ellerman, Thomas Gleixner, Will Deacon,
	Paul E. McKenney, Waiman Long, Davidlohr Bueso, Boqun Feng

Some atomic operations now have _relaxed/acquire/release variants, this
patch then adds some trivial tests for two purpose:

1.	test the behavior of these new operations in single-CPU
	environment.

2.	make their code generated before we actually use them somewhere,
	so that we can examine their assembly code.

Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
---
 lib/atomic64_test.c | 120 ++++++++++++++++++++++++++++++++++------------------
 1 file changed, 79 insertions(+), 41 deletions(-)

diff --git a/lib/atomic64_test.c b/lib/atomic64_test.c
index 83c33a5b..18e422b 100644
--- a/lib/atomic64_test.c
+++ b/lib/atomic64_test.c
@@ -27,6 +27,65 @@ do {								\
 		(unsigned long long)r);				\
 } while (0)
 
+/*
+ * Test for a atomic operation family,
+ * @test should be a macro accepting parameters (bit, op, ...)
+ */
+
+#define FAMILY_TEST(test, bit, op, args...)	\
+do {						\
+	test(bit, op, ##args);		\
+	test(bit, op##_acquire, ##args);	\
+	test(bit, op##_release, ##args);	\
+	test(bit, op##_relaxed, ##args);	\
+} while (0)
+
+#define TEST_RETURN(bit, op, c_op, val)				\
+do {								\
+	atomic##bit##_set(&v, v0);				\
+	r = v0;							\
+	r c_op val;						\
+	BUG_ON(atomic##bit##_##op(val, &v) != r);		\
+	BUG_ON(atomic##bit##_read(&v) != r);			\
+} while (0)
+
+#define RETURN_FAMILY_TEST(bit, op, c_op, val)			\
+do {								\
+	FAMILY_TEST(TEST_RETURN, bit, op, c_op, val);		\
+} while (0)
+
+#define TEST_ARGS(bit, op, init, ret, expect, args...)		\
+do {								\
+	atomic##bit##_set(&v, init);				\
+	BUG_ON(atomic##bit##_##op(&v, ##args) != ret);		\
+	BUG_ON(atomic##bit##_read(&v) != expect);		\
+} while (0)
+
+#define XCHG_FAMILY_TEST(bit, init, new)				\
+do {									\
+	FAMILY_TEST(TEST_ARGS, bit, xchg, init, init, new, new);	\
+} while (0)
+
+#define CMPXCHG_FAMILY_TEST(bit, init, new, wrong)			\
+do {									\
+	FAMILY_TEST(TEST_ARGS, bit, cmpxchg, 				\
+			init, init, new, init, new);			\
+	FAMILY_TEST(TEST_ARGS, bit, cmpxchg,				\
+			init, init, init, wrong, new);			\
+} while (0)
+
+#define INC_RETURN_FAMILY_TEST(bit, i)			\
+do {							\
+	FAMILY_TEST(TEST_ARGS, bit, inc_return,		\
+			i, (i) + one, (i) + one);	\
+} while (0)
+
+#define DEC_RETURN_FAMILY_TEST(bit, i)			\
+do {							\
+	FAMILY_TEST(TEST_ARGS, bit, dec_return,		\
+			i, (i) - one, (i) - one);	\
+} while (0)
+
 static __init void test_atomic(void)
 {
 	int v0 = 0xaaa31337;
@@ -45,6 +104,18 @@ static __init void test_atomic(void)
 	TEST(, and, &=, v1);
 	TEST(, xor, ^=, v1);
 	TEST(, andnot, &= ~, v1);
+
+	RETURN_FAMILY_TEST(, add_return, +=, onestwos);
+	RETURN_FAMILY_TEST(, add_return, +=, -one);
+	RETURN_FAMILY_TEST(, sub_return, -=, onestwos);
+	RETURN_FAMILY_TEST(, sub_return, -=, -one);
+
+	INC_RETURN_FAMILY_TEST(, v0);
+	DEC_RETURN_FAMILY_TEST(, v0);
+
+	XCHG_FAMILY_TEST(, v0, v1);
+	CMPXCHG_FAMILY_TEST(, v0, v1, onestwos);
+
 }
 
 #define INIT(c) do { atomic64_set(&v, c); r = c; } while (0)
@@ -74,25 +145,10 @@ static __init void test_atomic64(void)
 	TEST(64, xor, ^=, v1);
 	TEST(64, andnot, &= ~, v1);
 
-	INIT(v0);
-	r += onestwos;
-	BUG_ON(atomic64_add_return(onestwos, &v) != r);
-	BUG_ON(v.counter != r);
-
-	INIT(v0);
-	r += -one;
-	BUG_ON(atomic64_add_return(-one, &v) != r);
-	BUG_ON(v.counter != r);
-
-	INIT(v0);
-	r -= onestwos;
-	BUG_ON(atomic64_sub_return(onestwos, &v) != r);
-	BUG_ON(v.counter != r);
-
-	INIT(v0);
-	r -= -one;
-	BUG_ON(atomic64_sub_return(-one, &v) != r);
-	BUG_ON(v.counter != r);
+	RETURN_FAMILY_TEST(64, add_return, +=, onestwos);
+	RETURN_FAMILY_TEST(64, add_return, +=, -one);
+	RETURN_FAMILY_TEST(64, sub_return, -=, onestwos);
+	RETURN_FAMILY_TEST(64, sub_return, -=, -one);
 
 	INIT(v0);
 	atomic64_inc(&v);
@@ -100,33 +156,15 @@ static __init void test_atomic64(void)
 	BUG_ON(v.counter != r);
 
 	INIT(v0);
-	r += one;
-	BUG_ON(atomic64_inc_return(&v) != r);
-	BUG_ON(v.counter != r);
-
-	INIT(v0);
 	atomic64_dec(&v);
 	r -= one;
 	BUG_ON(v.counter != r);
 
-	INIT(v0);
-	r -= one;
-	BUG_ON(atomic64_dec_return(&v) != r);
-	BUG_ON(v.counter != r);
-
-	INIT(v0);
-	BUG_ON(atomic64_xchg(&v, v1) != v0);
-	r = v1;
-	BUG_ON(v.counter != r);
-
-	INIT(v0);
-	BUG_ON(atomic64_cmpxchg(&v, v0, v1) != v0);
-	r = v1;
-	BUG_ON(v.counter != r);
+	INC_RETURN_FAMILY_TEST(64, v0);
+	DEC_RETURN_FAMILY_TEST(64, v0);
 
-	INIT(v0);
-	BUG_ON(atomic64_cmpxchg(&v, v2, v1) != v0);
-	BUG_ON(v.counter != r);
+	XCHG_FAMILY_TEST(64, v0, v1);
+	CMPXCHG_FAMILY_TEST(64, v0, v1, v2);
 
 	INIT(v0);
 	BUG_ON(atomic64_add_unless(&v, one, v0));
-- 
2.5.3


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH tip/locking/core v4 3/6] atomics: Allow architectures to define their own __atomic_op_* helpers
  2015-10-14 15:55 [PATCH tip/locking/core v4 0/6] atomics: powerpc: Implement relaxed/acquire/release variants of some atomics Boqun Feng
  2015-10-14 15:55 ` [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make *xchg and *cmpxchg a full barrier Boqun Feng
  2015-10-14 15:55 ` [PATCH tip/locking/core v4 2/6] atomics: Add test for atomic operations with _relaxed variants Boqun Feng
@ 2015-10-14 15:55 ` Boqun Feng
  2015-10-14 15:55 ` [PATCH tip/locking/core v4 4/6] powerpc: atomic: Implement atomic{,64}_*_return_* variants Boqun Feng
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 43+ messages in thread
From: Boqun Feng @ 2015-10-14 15:55 UTC (permalink / raw)
  To: linux-kernel, linuxppc-dev
  Cc: Peter Zijlstra, Ingo Molnar, Benjamin Herrenschmidt,
	Paul Mackerras, Michael Ellerman, Thomas Gleixner, Will Deacon,
	Paul E. McKenney, Waiman Long, Davidlohr Bueso, Boqun Feng

Some architectures may have their special barriers for acquire, release
and fence semantics, so that general memory barriers(smp_mb__*_atomic())
in the default __atomic_op_*() may be too strong, so allow architectures
to define their own helpers which can overwrite the default helpers.

Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
---
 include/linux/atomic.h | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/include/linux/atomic.h b/include/linux/atomic.h
index 27e580d..947c1dc 100644
--- a/include/linux/atomic.h
+++ b/include/linux/atomic.h
@@ -43,20 +43,29 @@ static inline int atomic_read_ctrl(const atomic_t *v)
  * The idea here is to build acquire/release variants by adding explicit
  * barriers on top of the relaxed variant. In the case where the relaxed
  * variant is already fully ordered, no additional barriers are needed.
+ *
+ * Besides, if an arch has a special barrier for acquire/release, it could
+ * implement its own __atomic_op_* and use the same framework for building
+ * variants
  */
+#ifndef __atomic_op_acquire
 #define __atomic_op_acquire(op, args...)				\
 ({									\
 	typeof(op##_relaxed(args)) __ret  = op##_relaxed(args);		\
 	smp_mb__after_atomic();						\
 	__ret;								\
 })
+#endif
 
+#ifndef __atomic_op_release
 #define __atomic_op_release(op, args...)				\
 ({									\
 	smp_mb__before_atomic();					\
 	op##_relaxed(args);						\
 })
+#endif
 
+#ifndef __atomic_op_fence
 #define __atomic_op_fence(op, args...)					\
 ({									\
 	typeof(op##_relaxed(args)) __ret;				\
@@ -65,6 +74,7 @@ static inline int atomic_read_ctrl(const atomic_t *v)
 	smp_mb__after_atomic();						\
 	__ret;								\
 })
+#endif
 
 /* atomic_add_return_relaxed */
 #ifndef atomic_add_return_relaxed
-- 
2.5.3


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH tip/locking/core v4 4/6] powerpc: atomic: Implement atomic{,64}_*_return_* variants
  2015-10-14 15:55 [PATCH tip/locking/core v4 0/6] atomics: powerpc: Implement relaxed/acquire/release variants of some atomics Boqun Feng
                   ` (2 preceding siblings ...)
  2015-10-14 15:55 ` [PATCH tip/locking/core v4 3/6] atomics: Allow architectures to define their own __atomic_op_* helpers Boqun Feng
@ 2015-10-14 15:55 ` Boqun Feng
  2015-10-14 15:56 ` [PATCH tip/locking/core v4 5/6] powerpc: atomic: Implement xchg_* and atomic{,64}_xchg_* variants Boqun Feng
  2015-10-14 15:56 ` [PATCH tip/locking/core v4 6/6] powerpc: atomic: Implement cmpxchg{,64}_* and atomic{,64}_cmpxchg_* variants Boqun Feng
  5 siblings, 0 replies; 43+ messages in thread
From: Boqun Feng @ 2015-10-14 15:55 UTC (permalink / raw)
  To: linux-kernel, linuxppc-dev
  Cc: Peter Zijlstra, Ingo Molnar, Benjamin Herrenschmidt,
	Paul Mackerras, Michael Ellerman, Thomas Gleixner, Will Deacon,
	Paul E. McKenney, Waiman Long, Davidlohr Bueso, Boqun Feng

On powerpc, acquire and release semantics can be achieved with
lightweight barriers("lwsync" and "ctrl+isync"), which can be used to
implement __atomic_op_{acquire,release}.

For release semantics, since we only need to ensure all memory accesses
that issue before must take effects before the -store- part of the
atomics, "lwsync" is what we only need. On the platform without
"lwsync", "sync" should be used. Therefore, smp_lwsync() is used here.

For acquire semantics, "lwsync" is what we only need for the similar
reason.  However on the platform without "lwsync", we can use "isync"
rather than "sync" as an acquire barrier. Therefore in
__atomic_op_acquire() we use PPC_ACQUIRE_BARRIER, which is barrier() on
UP, "lwsync" if available and "isync" otherwise.

__atomic_op_fence is defined as smp_lwsync() + _relaxed +
smp_mb__after_atomic() to guarantee a full barrier.

Implement atomic{,64}_{add,sub,inc,dec}_return_relaxed, and build other
variants with these helpers.

Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
---
 arch/powerpc/include/asm/atomic.h | 116 ++++++++++++++++++++++++--------------
 1 file changed, 74 insertions(+), 42 deletions(-)

diff --git a/arch/powerpc/include/asm/atomic.h b/arch/powerpc/include/asm/atomic.h
index 55f106e..ab76461 100644
--- a/arch/powerpc/include/asm/atomic.h
+++ b/arch/powerpc/include/asm/atomic.h
@@ -12,6 +12,33 @@
 
 #define ATOMIC_INIT(i)		{ (i) }
 
+/*
+ * Since *_return_relaxed and {cmp}xchg_relaxed are implemented with
+ * a "bne-" instruction at the end, so an isync is enough as a acquire barrier
+ * on the platform without lwsync.
+ */
+#define __atomic_op_acquire(op, args...)				\
+({									\
+	typeof(op##_relaxed(args)) __ret  = op##_relaxed(args);		\
+	__asm__ __volatile__(PPC_ACQUIRE_BARRIER "" : : : "memory");	\
+	__ret;								\
+})
+
+#define __atomic_op_release(op, args...)				\
+({									\
+	smp_lwsync();							\
+	op##_relaxed(args);						\
+})
+
+#define __atomic_op_fence(op, args...)				\
+({									\
+	typeof(op##_relaxed(args)) __ret;				\
+	smp_lwsync();							\
+	__ret = op##_relaxed(args);					\
+	smp_mb__after_atomic();						\
+	__ret;								\
+})
+
 static __inline__ int atomic_read(const atomic_t *v)
 {
 	int t;
@@ -42,27 +69,27 @@ static __inline__ void atomic_##op(int a, atomic_t *v)			\
 	: "cc");							\
 }									\
 
-#define ATOMIC_OP_RETURN(op, asm_op)					\
-static __inline__ int atomic_##op##_return(int a, atomic_t *v)		\
+#define ATOMIC_OP_RETURN_RELAXED(op, asm_op)				\
+static inline int atomic_##op##_return_relaxed(int a, atomic_t *v)	\
 {									\
 	int t;								\
 									\
 	__asm__ __volatile__(						\
-	PPC_ATOMIC_ENTRY_BARRIER					\
-"1:	lwarx	%0,0,%2		# atomic_" #op "_return\n"		\
-	#asm_op " %0,%1,%0\n"						\
-	PPC405_ERR77(0,%2)						\
-"	stwcx.	%0,0,%2 \n"						\
+"1:	lwarx	%0,0,%3		# atomic_" #op "_return_relaxed\n"	\
+	#asm_op " %0,%2,%0\n"						\
+	PPC405_ERR77(0, %3)						\
+"	stwcx.	%0,0,%3\n"						\
 "	bne-	1b\n"							\
-	PPC_ATOMIC_EXIT_BARRIER						\
-	: "=&r" (t)							\
+	: "=&r" (t), "+m" (v->counter)					\
 	: "r" (a), "r" (&v->counter)					\
-	: "cc", "memory");						\
+	: "cc");							\
 									\
 	return t;							\
 }
 
-#define ATOMIC_OPS(op, asm_op) ATOMIC_OP(op, asm_op) ATOMIC_OP_RETURN(op, asm_op)
+#define ATOMIC_OPS(op, asm_op)						\
+	ATOMIC_OP(op, asm_op)						\
+	ATOMIC_OP_RETURN_RELAXED(op, asm_op)
 
 ATOMIC_OPS(add, add)
 ATOMIC_OPS(sub, subf)
@@ -71,8 +98,11 @@ ATOMIC_OP(and, and)
 ATOMIC_OP(or, or)
 ATOMIC_OP(xor, xor)
 
+#define atomic_add_return_relaxed atomic_add_return_relaxed
+#define atomic_sub_return_relaxed atomic_sub_return_relaxed
+
 #undef ATOMIC_OPS
-#undef ATOMIC_OP_RETURN
+#undef ATOMIC_OP_RETURN_RELAXED
 #undef ATOMIC_OP
 
 #define atomic_add_negative(a, v)	(atomic_add_return((a), (v)) < 0)
@@ -92,21 +122,19 @@ static __inline__ void atomic_inc(atomic_t *v)
 	: "cc", "xer");
 }
 
-static __inline__ int atomic_inc_return(atomic_t *v)
+static __inline__ int atomic_inc_return_relaxed(atomic_t *v)
 {
 	int t;
 
 	__asm__ __volatile__(
-	PPC_ATOMIC_ENTRY_BARRIER
-"1:	lwarx	%0,0,%1		# atomic_inc_return\n\
+"1:	lwarx	%0,0,%1		# atomic_inc_return_relaxed\n\
 	addic	%0,%0,1\n"
 	PPC405_ERR77(0,%1)
 "	stwcx.	%0,0,%1 \n\
 	bne-	1b"
-	PPC_ATOMIC_EXIT_BARRIER
 	: "=&r" (t)
 	: "r" (&v->counter)
-	: "cc", "xer", "memory");
+	: "cc", "xer");
 
 	return t;
 }
@@ -136,25 +164,26 @@ static __inline__ void atomic_dec(atomic_t *v)
 	: "cc", "xer");
 }
 
-static __inline__ int atomic_dec_return(atomic_t *v)
+static __inline__ int atomic_dec_return_relaxed(atomic_t *v)
 {
 	int t;
 
 	__asm__ __volatile__(
-	PPC_ATOMIC_ENTRY_BARRIER
-"1:	lwarx	%0,0,%1		# atomic_dec_return\n\
+"1:	lwarx	%0,0,%1		# atomic_dec_return_relaxed\n\
 	addic	%0,%0,-1\n"
 	PPC405_ERR77(0,%1)
 "	stwcx.	%0,0,%1\n\
 	bne-	1b"
-	PPC_ATOMIC_EXIT_BARRIER
 	: "=&r" (t)
 	: "r" (&v->counter)
-	: "cc", "xer", "memory");
+	: "cc", "xer");
 
 	return t;
 }
 
+#define atomic_inc_return_relaxed atomic_inc_return_relaxed
+#define atomic_dec_return_relaxed atomic_dec_return_relaxed
+
 #define atomic_cmpxchg(v, o, n) (cmpxchg(&((v)->counter), (o), (n)))
 #define atomic_xchg(v, new) (xchg(&((v)->counter), new))
 
@@ -285,26 +314,27 @@ static __inline__ void atomic64_##op(long a, atomic64_t *v)		\
 	: "cc");							\
 }
 
-#define ATOMIC64_OP_RETURN(op, asm_op)					\
-static __inline__ long atomic64_##op##_return(long a, atomic64_t *v)	\
+#define ATOMIC64_OP_RETURN_RELAXED(op, asm_op)				\
+static inline long							\
+atomic64_##op##_return_relaxed(long a, atomic64_t *v)			\
 {									\
 	long t;								\
 									\
 	__asm__ __volatile__(						\
-	PPC_ATOMIC_ENTRY_BARRIER					\
-"1:	ldarx	%0,0,%2		# atomic64_" #op "_return\n"		\
-	#asm_op " %0,%1,%0\n"						\
-"	stdcx.	%0,0,%2 \n"						\
+"1:	ldarx	%0,0,%3		# atomic64_" #op "_return_relaxed\n"	\
+	#asm_op " %0,%2,%0\n"						\
+"	stdcx.	%0,0,%3\n"						\
 "	bne-	1b\n"							\
-	PPC_ATOMIC_EXIT_BARRIER						\
-	: "=&r" (t)							\
+	: "=&r" (t), "+m" (v->counter)					\
 	: "r" (a), "r" (&v->counter)					\
-	: "cc", "memory");						\
+	: "cc");							\
 									\
 	return t;							\
 }
 
-#define ATOMIC64_OPS(op, asm_op) ATOMIC64_OP(op, asm_op) ATOMIC64_OP_RETURN(op, asm_op)
+#define ATOMIC64_OPS(op, asm_op)					\
+	ATOMIC64_OP(op, asm_op)						\
+	ATOMIC64_OP_RETURN_RELAXED(op, asm_op)
 
 ATOMIC64_OPS(add, add)
 ATOMIC64_OPS(sub, subf)
@@ -312,8 +342,11 @@ ATOMIC64_OP(and, and)
 ATOMIC64_OP(or, or)
 ATOMIC64_OP(xor, xor)
 
-#undef ATOMIC64_OPS
-#undef ATOMIC64_OP_RETURN
+#define atomic64_add_return_relaxed atomic64_add_return_relaxed
+#define atomic64_sub_return_relaxed atomic64_sub_return_relaxed
+
+#undef ATOPIC64_OPS
+#undef ATOMIC64_OP_RETURN_RELAXED
 #undef ATOMIC64_OP
 
 #define atomic64_add_negative(a, v)	(atomic64_add_return((a), (v)) < 0)
@@ -332,20 +365,18 @@ static __inline__ void atomic64_inc(atomic64_t *v)
 	: "cc", "xer");
 }
 
-static __inline__ long atomic64_inc_return(atomic64_t *v)
+static __inline__ long atomic64_inc_return_relaxed(atomic64_t *v)
 {
 	long t;
 
 	__asm__ __volatile__(
-	PPC_ATOMIC_ENTRY_BARRIER
 "1:	ldarx	%0,0,%1		# atomic64_inc_return\n\
 	addic	%0,%0,1\n\
 	stdcx.	%0,0,%1 \n\
 	bne-	1b"
-	PPC_ATOMIC_EXIT_BARRIER
 	: "=&r" (t)
 	: "r" (&v->counter)
-	: "cc", "xer", "memory");
+	: "cc", "xer");
 
 	return t;
 }
@@ -374,24 +405,25 @@ static __inline__ void atomic64_dec(atomic64_t *v)
 	: "cc", "xer");
 }
 
-static __inline__ long atomic64_dec_return(atomic64_t *v)
+static __inline__ long atomic64_dec_return_relaxed(atomic64_t *v)
 {
 	long t;
 
 	__asm__ __volatile__(
-	PPC_ATOMIC_ENTRY_BARRIER
 "1:	ldarx	%0,0,%1		# atomic64_dec_return\n\
 	addic	%0,%0,-1\n\
 	stdcx.	%0,0,%1\n\
 	bne-	1b"
-	PPC_ATOMIC_EXIT_BARRIER
 	: "=&r" (t)
 	: "r" (&v->counter)
-	: "cc", "xer", "memory");
+	: "cc", "xer");
 
 	return t;
 }
 
+#define atomic64_inc_return_relaxed atomic64_inc_return_relaxed
+#define atomic64_dec_return_relaxed atomic64_dec_return_relaxed
+
 #define atomic64_sub_and_test(a, v)	(atomic64_sub_return((a), (v)) == 0)
 #define atomic64_dec_and_test(v)	(atomic64_dec_return((v)) == 0)
 
-- 
2.5.3


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH tip/locking/core v4 5/6] powerpc: atomic: Implement xchg_* and atomic{,64}_xchg_* variants
  2015-10-14 15:55 [PATCH tip/locking/core v4 0/6] atomics: powerpc: Implement relaxed/acquire/release variants of some atomics Boqun Feng
                   ` (3 preceding siblings ...)
  2015-10-14 15:55 ` [PATCH tip/locking/core v4 4/6] powerpc: atomic: Implement atomic{,64}_*_return_* variants Boqun Feng
@ 2015-10-14 15:56 ` Boqun Feng
  2015-10-14 15:56 ` [PATCH tip/locking/core v4 6/6] powerpc: atomic: Implement cmpxchg{,64}_* and atomic{,64}_cmpxchg_* variants Boqun Feng
  5 siblings, 0 replies; 43+ messages in thread
From: Boqun Feng @ 2015-10-14 15:56 UTC (permalink / raw)
  To: linux-kernel, linuxppc-dev
  Cc: Peter Zijlstra, Ingo Molnar, Benjamin Herrenschmidt,
	Paul Mackerras, Michael Ellerman, Thomas Gleixner, Will Deacon,
	Paul E. McKenney, Waiman Long, Davidlohr Bueso, Boqun Feng

Implement xchg_relaxed and atomic{,64}_xchg_relaxed, based on these
_relaxed variants, release/acquire variants and fully ordered versions
can be built.

Note that xchg_relaxed and atomic_{,64}_xchg_relaxed are not compiler
barriers.

Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
---
 arch/powerpc/include/asm/atomic.h  |  2 ++
 arch/powerpc/include/asm/cmpxchg.h | 69 +++++++++++++++++---------------------
 2 files changed, 32 insertions(+), 39 deletions(-)

diff --git a/arch/powerpc/include/asm/atomic.h b/arch/powerpc/include/asm/atomic.h
index ab76461..1e9d526 100644
--- a/arch/powerpc/include/asm/atomic.h
+++ b/arch/powerpc/include/asm/atomic.h
@@ -186,6 +186,7 @@ static __inline__ int atomic_dec_return_relaxed(atomic_t *v)
 
 #define atomic_cmpxchg(v, o, n) (cmpxchg(&((v)->counter), (o), (n)))
 #define atomic_xchg(v, new) (xchg(&((v)->counter), new))
+#define atomic_xchg_relaxed(v, new) xchg_relaxed(&((v)->counter), (new))
 
 /**
  * __atomic_add_unless - add unless the number is a given value
@@ -453,6 +454,7 @@ static __inline__ long atomic64_dec_if_positive(atomic64_t *v)
 
 #define atomic64_cmpxchg(v, o, n) (cmpxchg(&((v)->counter), (o), (n)))
 #define atomic64_xchg(v, new) (xchg(&((v)->counter), new))
+#define atomic64_xchg_relaxed(v, new) xchg_relaxed(&((v)->counter), (new))
 
 /**
  * atomic64_add_unless - add unless the number is a given value
diff --git a/arch/powerpc/include/asm/cmpxchg.h b/arch/powerpc/include/asm/cmpxchg.h
index d1a8d93..17c7e14 100644
--- a/arch/powerpc/include/asm/cmpxchg.h
+++ b/arch/powerpc/include/asm/cmpxchg.h
@@ -9,21 +9,20 @@
 /*
  * Atomic exchange
  *
- * Changes the memory location '*ptr' to be val and returns
+ * Changes the memory location '*p' to be val and returns
  * the previous value stored there.
  */
+
 static __always_inline unsigned long
-__xchg_u32(volatile void *p, unsigned long val)
+__xchg_u32_local(volatile void *p, unsigned long val)
 {
 	unsigned long prev;
 
 	__asm__ __volatile__(
-	PPC_ATOMIC_ENTRY_BARRIER
 "1:	lwarx	%0,0,%2 \n"
 	PPC405_ERR77(0,%2)
 "	stwcx.	%3,0,%2 \n\
 	bne-	1b"
-	PPC_ATOMIC_EXIT_BARRIER
 	: "=&r" (prev), "+m" (*(volatile unsigned int *)p)
 	: "r" (p), "r" (val)
 	: "cc", "memory");
@@ -31,42 +30,34 @@ __xchg_u32(volatile void *p, unsigned long val)
 	return prev;
 }
 
-/*
- * Atomic exchange
- *
- * Changes the memory location '*ptr' to be val and returns
- * the previous value stored there.
- */
 static __always_inline unsigned long
-__xchg_u32_local(volatile void *p, unsigned long val)
+__xchg_u32_relaxed(u32 *p, unsigned long val)
 {
 	unsigned long prev;
 
 	__asm__ __volatile__(
-"1:	lwarx	%0,0,%2 \n"
-	PPC405_ERR77(0,%2)
-"	stwcx.	%3,0,%2 \n\
-	bne-	1b"
-	: "=&r" (prev), "+m" (*(volatile unsigned int *)p)
+"1:	lwarx	%0,0,%2\n"
+	PPC405_ERR77(0, %2)
+"	stwcx.	%3,0,%2\n"
+"	bne-	1b"
+	: "=&r" (prev), "+m" (*p)
 	: "r" (p), "r" (val)
-	: "cc", "memory");
+	: "cc");
 
 	return prev;
 }
 
 #ifdef CONFIG_PPC64
 static __always_inline unsigned long
-__xchg_u64(volatile void *p, unsigned long val)
+__xchg_u64_local(volatile void *p, unsigned long val)
 {
 	unsigned long prev;
 
 	__asm__ __volatile__(
-	PPC_ATOMIC_ENTRY_BARRIER
 "1:	ldarx	%0,0,%2 \n"
 	PPC405_ERR77(0,%2)
 "	stdcx.	%3,0,%2 \n\
 	bne-	1b"
-	PPC_ATOMIC_EXIT_BARRIER
 	: "=&r" (prev), "+m" (*(volatile unsigned long *)p)
 	: "r" (p), "r" (val)
 	: "cc", "memory");
@@ -75,18 +66,18 @@ __xchg_u64(volatile void *p, unsigned long val)
 }
 
 static __always_inline unsigned long
-__xchg_u64_local(volatile void *p, unsigned long val)
+__xchg_u64_relaxed(u64 *p, unsigned long val)
 {
 	unsigned long prev;
 
 	__asm__ __volatile__(
-"1:	ldarx	%0,0,%2 \n"
-	PPC405_ERR77(0,%2)
-"	stdcx.	%3,0,%2 \n\
-	bne-	1b"
-	: "=&r" (prev), "+m" (*(volatile unsigned long *)p)
+"1:	ldarx	%0,0,%2\n"
+	PPC405_ERR77(0, %2)
+"	stdcx.	%3,0,%2\n"
+"	bne-	1b"
+	: "=&r" (prev), "+m" (*p)
 	: "r" (p), "r" (val)
-	: "cc", "memory");
+	: "cc");
 
 	return prev;
 }
@@ -99,14 +90,14 @@ __xchg_u64_local(volatile void *p, unsigned long val)
 extern void __xchg_called_with_bad_pointer(void);
 
 static __always_inline unsigned long
-__xchg(volatile void *ptr, unsigned long x, unsigned int size)
+__xchg_local(volatile void *ptr, unsigned long x, unsigned int size)
 {
 	switch (size) {
 	case 4:
-		return __xchg_u32(ptr, x);
+		return __xchg_u32_local(ptr, x);
 #ifdef CONFIG_PPC64
 	case 8:
-		return __xchg_u64(ptr, x);
+		return __xchg_u64_local(ptr, x);
 #endif
 	}
 	__xchg_called_with_bad_pointer();
@@ -114,25 +105,19 @@ __xchg(volatile void *ptr, unsigned long x, unsigned int size)
 }
 
 static __always_inline unsigned long
-__xchg_local(volatile void *ptr, unsigned long x, unsigned int size)
+__xchg_relaxed(void *ptr, unsigned long x, unsigned int size)
 {
 	switch (size) {
 	case 4:
-		return __xchg_u32_local(ptr, x);
+		return __xchg_u32_relaxed(ptr, x);
 #ifdef CONFIG_PPC64
 	case 8:
-		return __xchg_u64_local(ptr, x);
+		return __xchg_u64_relaxed(ptr, x);
 #endif
 	}
 	__xchg_called_with_bad_pointer();
 	return x;
 }
-#define xchg(ptr,x)							     \
-  ({									     \
-     __typeof__(*(ptr)) _x_ = (x);					     \
-     (__typeof__(*(ptr))) __xchg((ptr), (unsigned long)_x_, sizeof(*(ptr))); \
-  })
-
 #define xchg_local(ptr,x)						     \
   ({									     \
      __typeof__(*(ptr)) _x_ = (x);					     \
@@ -140,6 +125,12 @@ __xchg_local(volatile void *ptr, unsigned long x, unsigned int size)
      		(unsigned long)_x_, sizeof(*(ptr))); 			     \
   })
 
+#define xchg_relaxed(ptr, x)						\
+({									\
+	__typeof__(*(ptr)) _x_ = (x);					\
+	(__typeof__(*(ptr))) __xchg_relaxed((ptr),			\
+			(unsigned long)_x_, sizeof(*(ptr)));		\
+})
 /*
  * Compare and exchange - if *p == old, set it to new,
  * and return the old value of *p.
-- 
2.5.3


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH tip/locking/core v4 6/6] powerpc: atomic: Implement cmpxchg{,64}_* and atomic{,64}_cmpxchg_* variants
  2015-10-14 15:55 [PATCH tip/locking/core v4 0/6] atomics: powerpc: Implement relaxed/acquire/release variants of some atomics Boqun Feng
                   ` (4 preceding siblings ...)
  2015-10-14 15:56 ` [PATCH tip/locking/core v4 5/6] powerpc: atomic: Implement xchg_* and atomic{,64}_xchg_* variants Boqun Feng
@ 2015-10-14 15:56 ` Boqun Feng
  5 siblings, 0 replies; 43+ messages in thread
From: Boqun Feng @ 2015-10-14 15:56 UTC (permalink / raw)
  To: linux-kernel, linuxppc-dev
  Cc: Peter Zijlstra, Ingo Molnar, Benjamin Herrenschmidt,
	Paul Mackerras, Michael Ellerman, Thomas Gleixner, Will Deacon,
	Paul E. McKenney, Waiman Long, Davidlohr Bueso, Boqun Feng

Implement cmpxchg{,64}_relaxed and atomic{,64}_cmpxchg_relaxed, based on
which _release variants can be built.

To avoid superfluous barriers in _acquire variants, we implement these
operations with assembly code rather use __atomic_op_acquire() to build
them automatically.

For the same reason, we keep the assembly implementation of fully
ordered cmpxchg operations.

However, we don't do the similar for _release, because that will require
putting barriers in the middle of ll/sc loops, which is probably a bad
idea.

Note cmpxchg{,64}_relaxed and atomic{,64}_cmpxchg_relaxed are not
compiler barriers.

Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
---
 arch/powerpc/include/asm/atomic.h  |  10 +++
 arch/powerpc/include/asm/cmpxchg.h | 149 ++++++++++++++++++++++++++++++++++++-
 2 files changed, 158 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/atomic.h b/arch/powerpc/include/asm/atomic.h
index 1e9d526..e58188d 100644
--- a/arch/powerpc/include/asm/atomic.h
+++ b/arch/powerpc/include/asm/atomic.h
@@ -185,6 +185,11 @@ static __inline__ int atomic_dec_return_relaxed(atomic_t *v)
 #define atomic_dec_return_relaxed atomic_dec_return_relaxed
 
 #define atomic_cmpxchg(v, o, n) (cmpxchg(&((v)->counter), (o), (n)))
+#define atomic_cmpxchg_relaxed(v, o, n) \
+	cmpxchg_relaxed(&((v)->counter), (o), (n))
+#define atomic_cmpxchg_acquire(v, o, n) \
+	cmpxchg_acquire(&((v)->counter), (o), (n))
+
 #define atomic_xchg(v, new) (xchg(&((v)->counter), new))
 #define atomic_xchg_relaxed(v, new) xchg_relaxed(&((v)->counter), (new))
 
@@ -453,6 +458,11 @@ static __inline__ long atomic64_dec_if_positive(atomic64_t *v)
 }
 
 #define atomic64_cmpxchg(v, o, n) (cmpxchg(&((v)->counter), (o), (n)))
+#define atomic64_cmpxchg_relaxed(v, o, n) \
+	cmpxchg_relaxed(&((v)->counter), (o), (n))
+#define atomic64_cmpxchg_acquire(v, o, n) \
+	cmpxchg_acquire(&((v)->counter), (o), (n))
+
 #define atomic64_xchg(v, new) (xchg(&((v)->counter), new))
 #define atomic64_xchg_relaxed(v, new) xchg_relaxed(&((v)->counter), (new))
 
diff --git a/arch/powerpc/include/asm/cmpxchg.h b/arch/powerpc/include/asm/cmpxchg.h
index 17c7e14..cae4fa8 100644
--- a/arch/powerpc/include/asm/cmpxchg.h
+++ b/arch/powerpc/include/asm/cmpxchg.h
@@ -181,6 +181,56 @@ __cmpxchg_u32_local(volatile unsigned int *p, unsigned long old,
 	return prev;
 }
 
+static __always_inline unsigned long
+__cmpxchg_u32_relaxed(u32 *p, unsigned long old, unsigned long new)
+{
+	unsigned long prev;
+
+	__asm__ __volatile__ (
+"1:	lwarx	%0,0,%2		# __cmpxchg_u32_relaxed\n"
+"	cmpw	0,%0,%3\n"
+"	bne-	2f\n"
+	PPC405_ERR77(0, %2)
+"	stwcx.	%4,0,%2\n"
+"	bne-	1b\n"
+"2:"
+	: "=&r" (prev), "+m" (*p)
+	: "r" (p), "r" (old), "r" (new)
+	: "cc");
+
+	return prev;
+}
+
+/*
+ * cmpxchg family don't have order guarantee if cmp part fails, therefore we
+ * can avoid superfluous barriers if we use assembly code to implement
+ * cmpxchg() and cmpxchg_acquire(), however we don't do the similar for
+ * cmpxchg_release() because that will result in putting a barrier in the
+ * middle of a ll/sc loop, which is probably a bad idea. For example, this
+ * might cause the conditional store more likely to fail.
+ */
+static __always_inline unsigned long
+__cmpxchg_u32_acquire(u32 *p, unsigned long old, unsigned long new)
+{
+	unsigned long prev;
+
+	__asm__ __volatile__ (
+"1:	lwarx	%0,0,%2		# __cmpxchg_u32_acquire\n"
+"	cmpw	0,%0,%3\n"
+"	bne-	2f\n"
+	PPC405_ERR77(0, %2)
+"	stwcx.	%4,0,%2\n"
+"	bne-	1b\n"
+	PPC_ACQUIRE_BARRIER
+	"\n"
+"2:"
+	: "=&r" (prev), "+m" (*p)
+	: "r" (p), "r" (old), "r" (new)
+	: "cc", "memory");
+
+	return prev;
+}
+
 #ifdef CONFIG_PPC64
 static __always_inline unsigned long
 __cmpxchg_u64(volatile unsigned long *p, unsigned long old, unsigned long new)
@@ -224,6 +274,46 @@ __cmpxchg_u64_local(volatile unsigned long *p, unsigned long old,
 
 	return prev;
 }
+
+static __always_inline unsigned long
+__cmpxchg_u64_relaxed(u64 *p, unsigned long old, unsigned long new)
+{
+	unsigned long prev;
+
+	__asm__ __volatile__ (
+"1:	ldarx	%0,0,%2		# __cmpxchg_u64_relaxed\n"
+"	cmpd	0,%0,%3\n"
+"	bne-	2f\n"
+"	stdcx.	%4,0,%2\n"
+"	bne-	1b\n"
+"2:"
+	: "=&r" (prev), "+m" (*p)
+	: "r" (p), "r" (old), "r" (new)
+	: "cc");
+
+	return prev;
+}
+
+static __always_inline unsigned long
+__cmpxchg_u64_acquire(u64 *p, unsigned long old, unsigned long new)
+{
+	unsigned long prev;
+
+	__asm__ __volatile__ (
+"1:	ldarx	%0,0,%2		# __cmpxchg_u64_acquire\n"
+"	cmpd	0,%0,%3\n"
+"	bne-	2f\n"
+"	stdcx.	%4,0,%2\n"
+"	bne-	1b\n"
+	PPC_ACQUIRE_BARRIER
+	"\n"
+"2:"
+	: "=&r" (prev), "+m" (*p)
+	: "r" (p), "r" (old), "r" (new)
+	: "cc", "memory");
+
+	return prev;
+}
 #endif
 
 /* This function doesn't exist, so you'll get a linker error
@@ -262,6 +352,37 @@ __cmpxchg_local(volatile void *ptr, unsigned long old, unsigned long new,
 	return old;
 }
 
+static __always_inline unsigned long
+__cmpxchg_relaxed(void *ptr, unsigned long old, unsigned long new,
+		  unsigned int size)
+{
+	switch (size) {
+	case 4:
+		return __cmpxchg_u32_relaxed(ptr, old, new);
+#ifdef CONFIG_PPC64
+	case 8:
+		return __cmpxchg_u64_relaxed(ptr, old, new);
+#endif
+	}
+	__cmpxchg_called_with_bad_pointer();
+	return old;
+}
+
+static __always_inline unsigned long
+__cmpxchg_acquire(void *ptr, unsigned long old, unsigned long new,
+		  unsigned int size)
+{
+	switch (size) {
+	case 4:
+		return __cmpxchg_u32_acquire(ptr, old, new);
+#ifdef CONFIG_PPC64
+	case 8:
+		return __cmpxchg_u64_acquire(ptr, old, new);
+#endif
+	}
+	__cmpxchg_called_with_bad_pointer();
+	return old;
+}
 #define cmpxchg(ptr, o, n)						 \
   ({									 \
      __typeof__(*(ptr)) _o_ = (o);					 \
@@ -279,6 +400,23 @@ __cmpxchg_local(volatile void *ptr, unsigned long old, unsigned long new,
 				    (unsigned long)_n_, sizeof(*(ptr))); \
   })
 
+#define cmpxchg_relaxed(ptr, o, n)					\
+({									\
+	__typeof__(*(ptr)) _o_ = (o);					\
+	__typeof__(*(ptr)) _n_ = (n);					\
+	(__typeof__(*(ptr))) __cmpxchg_relaxed((ptr),			\
+			(unsigned long)_o_, (unsigned long)_n_,		\
+			sizeof(*(ptr)));				\
+})
+
+#define cmpxchg_acquire(ptr, o, n)					\
+({									\
+	__typeof__(*(ptr)) _o_ = (o);					\
+	__typeof__(*(ptr)) _n_ = (n);					\
+	(__typeof__(*(ptr))) __cmpxchg_acquire((ptr),			\
+			(unsigned long)_o_, (unsigned long)_n_,		\
+			sizeof(*(ptr)));				\
+})
 #ifdef CONFIG_PPC64
 #define cmpxchg64(ptr, o, n)						\
   ({									\
@@ -290,7 +428,16 @@ __cmpxchg_local(volatile void *ptr, unsigned long old, unsigned long new,
 	BUILD_BUG_ON(sizeof(*(ptr)) != 8);				\
 	cmpxchg_local((ptr), (o), (n));					\
   })
-#define cmpxchg64_relaxed	cmpxchg64_local
+#define cmpxchg64_relaxed(ptr, o, n)					\
+({									\
+	BUILD_BUG_ON(sizeof(*(ptr)) != 8);				\
+	cmpxchg_relaxed((ptr), (o), (n));				\
+})
+#define cmpxchg64_acquire(ptr, o, n)					\
+({									\
+	BUILD_BUG_ON(sizeof(*(ptr)) != 8);				\
+	cmpxchg_acquire((ptr), (o), (n));				\
+})
 #else
 #include <asm-generic/cmpxchg-local.h>
 #define cmpxchg64_local(ptr, o, n) __cmpxchg64_local_generic((ptr), (o), (n))
-- 
2.5.3


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make *xchg and *cmpxchg a full barrier
  2015-10-14 15:55 ` [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make *xchg and *cmpxchg a full barrier Boqun Feng
@ 2015-10-14 20:19   ` Paul E. McKenney
  2015-10-14 21:04     ` Peter Zijlstra
                       ` (2 more replies)
  0 siblings, 3 replies; 43+ messages in thread
From: Paul E. McKenney @ 2015-10-14 20:19 UTC (permalink / raw)
  To: Boqun Feng
  Cc: linux-kernel, linuxppc-dev, Peter Zijlstra, Ingo Molnar,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Thomas Gleixner, Will Deacon, Waiman Long, Davidlohr Bueso,
	stable

On Wed, Oct 14, 2015 at 11:55:56PM +0800, Boqun Feng wrote:
> According to memory-barriers.txt, xchg, cmpxchg and their atomic{,64}_
> versions all need to imply a full barrier, however they are now just
> RELEASE+ACQUIRE, which is not a full barrier.
> 
> So replace PPC_RELEASE_BARRIER and PPC_ACQUIRE_BARRIER with
> PPC_ATOMIC_ENTRY_BARRIER and PPC_ATOMIC_EXIT_BARRIER in
> __{cmp,}xchg_{u32,u64} respectively to guarantee a full barrier
> semantics of atomic{,64}_{cmp,}xchg() and {cmp,}xchg().
> 
> This patch is a complement of commit b97021f85517 ("powerpc: Fix
> atomic_xxx_return barrier semantics").
> 
> Acked-by: Michael Ellerman <mpe@ellerman.id.au>
> Cc: <stable@vger.kernel.org> # 3.4+
> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
> ---
>  arch/powerpc/include/asm/cmpxchg.h | 16 ++++++++--------
>  1 file changed, 8 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/cmpxchg.h b/arch/powerpc/include/asm/cmpxchg.h
> index ad6263c..d1a8d93 100644
> --- a/arch/powerpc/include/asm/cmpxchg.h
> +++ b/arch/powerpc/include/asm/cmpxchg.h
> @@ -18,12 +18,12 @@ __xchg_u32(volatile void *p, unsigned long val)
>  	unsigned long prev;
> 
>  	__asm__ __volatile__(
> -	PPC_RELEASE_BARRIER
> +	PPC_ATOMIC_ENTRY_BARRIER

This looks to be the lwsync instruction.

>  "1:	lwarx	%0,0,%2 \n"
>  	PPC405_ERR77(0,%2)
>  "	stwcx.	%3,0,%2 \n\
>  	bne-	1b"
> -	PPC_ACQUIRE_BARRIER
> +	PPC_ATOMIC_EXIT_BARRIER

And this looks to be the sync instruction.

>  	: "=&r" (prev), "+m" (*(volatile unsigned int *)p)
>  	: "r" (p), "r" (val)
>  	: "cc", "memory");

Hmmm...

Suppose we have something like the following, where "a" and "x" are both
initially zero:

	CPU 0				CPU 1
	-----				-----

	WRITE_ONCE(x, 1);		WRITE_ONCE(a, 2);
	r3 = xchg(&a, 1);		smp_mb();
					r3 = READ_ONCE(x);

If xchg() is fully ordered, we should never observe both CPUs'
r3 values being zero, correct?

And wouldn't this be represented by the following litmus test?

	PPC SB+lwsync-RMW2-lwsync+st-sync-leading
	""
	{
	0:r1=1; 0:r2=x; 0:r3=3; 0:r10=0 ; 0:r11=0; 0:r12=a;
	1:r1=2; 1:r2=x; 1:r3=3; 1:r10=0 ; 1:r11=0; 1:r12=a;
	}
	 P0                 | P1                 ;
	 stw r1,0(r2)       | stw r1,0(r12)      ;
	 lwsync             | sync               ;
	 lwarx  r11,r10,r12 | lwz r3,0(r2)       ;
	 stwcx. r1,r10,r12  | ;
	 bne Fail0          | ;
	 mr r3,r11          | ;
	 Fail0:             | ;
	exists
	(0:r3=0 /\ a=2 /\ 1:r3=0)

I left off P0's trailing sync because there is nothing for it to order
against in this particular litmus test.  I tried adding it and verified
that it has no effect.

Am I missing something here?  If not, it seems to me that you need
the leading lwsync to instead be a sync.

Of course, if I am not missing something, then this applies also to the
value-returning RMW atomic operations that you pulled this pattern from.
If so, it would seem that I didn't think through all the possibilities
back when PPC_ATOMIC_EXIT_BARRIER moved to sync...  In fact, I believe
that I worried about the RMW atomic operation acting as a barrier,
but not as the load/store itself.  :-/

							Thanx, Paul

> @@ -61,12 +61,12 @@ __xchg_u64(volatile void *p, unsigned long val)
>  	unsigned long prev;
> 
>  	__asm__ __volatile__(
> -	PPC_RELEASE_BARRIER
> +	PPC_ATOMIC_ENTRY_BARRIER
>  "1:	ldarx	%0,0,%2 \n"
>  	PPC405_ERR77(0,%2)
>  "	stdcx.	%3,0,%2 \n\
>  	bne-	1b"
> -	PPC_ACQUIRE_BARRIER
> +	PPC_ATOMIC_EXIT_BARRIER
>  	: "=&r" (prev), "+m" (*(volatile unsigned long *)p)
>  	: "r" (p), "r" (val)
>  	: "cc", "memory");
> @@ -151,14 +151,14 @@ __cmpxchg_u32(volatile unsigned int *p, unsigned long old, unsigned long new)
>  	unsigned int prev;
> 
>  	__asm__ __volatile__ (
> -	PPC_RELEASE_BARRIER
> +	PPC_ATOMIC_ENTRY_BARRIER
>  "1:	lwarx	%0,0,%2		# __cmpxchg_u32\n\
>  	cmpw	0,%0,%3\n\
>  	bne-	2f\n"
>  	PPC405_ERR77(0,%2)
>  "	stwcx.	%4,0,%2\n\
>  	bne-	1b"
> -	PPC_ACQUIRE_BARRIER
> +	PPC_ATOMIC_EXIT_BARRIER
>  	"\n\
>  2:"
>  	: "=&r" (prev), "+m" (*p)
> @@ -197,13 +197,13 @@ __cmpxchg_u64(volatile unsigned long *p, unsigned long old, unsigned long new)
>  	unsigned long prev;
> 
>  	__asm__ __volatile__ (
> -	PPC_RELEASE_BARRIER
> +	PPC_ATOMIC_ENTRY_BARRIER
>  "1:	ldarx	%0,0,%2		# __cmpxchg_u64\n\
>  	cmpd	0,%0,%3\n\
>  	bne-	2f\n\
>  	stdcx.	%4,0,%2\n\
>  	bne-	1b"
> -	PPC_ACQUIRE_BARRIER
> +	PPC_ATOMIC_EXIT_BARRIER
>  	"\n\
>  2:"
>  	: "=&r" (prev), "+m" (*p)
> -- 
> 2.5.3
> 


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make *xchg and *cmpxchg a full barrier
  2015-10-14 20:19   ` Paul E. McKenney
@ 2015-10-14 21:04     ` Peter Zijlstra
  2015-10-14 21:44       ` Paul E. McKenney
  2015-10-15 14:49     ` Boqun Feng
  2015-10-20  7:15     ` Boqun Feng
  2 siblings, 1 reply; 43+ messages in thread
From: Peter Zijlstra @ 2015-10-14 21:04 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Boqun Feng, linux-kernel, linuxppc-dev, Ingo Molnar,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Thomas Gleixner, Will Deacon, Waiman Long, Davidlohr Bueso,
	stable

On Wed, Oct 14, 2015 at 01:19:17PM -0700, Paul E. McKenney wrote:
> Suppose we have something like the following, where "a" and "x" are both
> initially zero:
> 
> 	CPU 0				CPU 1
> 	-----				-----
> 
> 	WRITE_ONCE(x, 1);		WRITE_ONCE(a, 2);
> 	r3 = xchg(&a, 1);		smp_mb();
> 					r3 = READ_ONCE(x);
> 
> If xchg() is fully ordered, we should never observe both CPUs'
> r3 values being zero, correct?
> 
> And wouldn't this be represented by the following litmus test?
> 
> 	PPC SB+lwsync-RMW2-lwsync+st-sync-leading
> 	""
> 	{
> 	0:r1=1; 0:r2=x; 0:r3=3; 0:r10=0 ; 0:r11=0; 0:r12=a;
> 	1:r1=2; 1:r2=x; 1:r3=3; 1:r10=0 ; 1:r11=0; 1:r12=a;
> 	}
> 	 P0                 | P1                 ;
> 	 stw r1,0(r2)       | stw r1,0(r12)      ;
> 	 lwsync             | sync               ;
> 	 lwarx  r11,r10,r12 | lwz r3,0(r2)       ;
> 	 stwcx. r1,r10,r12  | ;
> 	 bne Fail0          | ;
> 	 mr r3,r11          | ;
> 	 Fail0:             | ;
> 	exists
> 	(0:r3=0 /\ a=2 /\ 1:r3=0)
> 
> I left off P0's trailing sync because there is nothing for it to order
> against in this particular litmus test.  I tried adding it and verified
> that it has no effect.
> 
> Am I missing something here?  If not, it seems to me that you need
> the leading lwsync to instead be a sync.

So the scenario that would fail would be this one, right?

a = x = 0

	CPU0				CPU1

	r3 = load_locked (&a);
					a = 2;
					sync();
					r3 = x;
	x = 1;
	lwsync();
	if (!store_cond(&a, 1))
		goto again


Where we hoist the load way up because lwsync allows this.

I always thought this would fail because CPU1's store to @a would fail
the store_cond() on CPU0 and we'd do the 'again' thing, re-issuing the
load and now seeing the new value (2).

> Of course, if I am not missing something, then this applies also to the
> value-returning RMW atomic operations that you pulled this pattern from.
> If so, it would seem that I didn't think through all the possibilities
> back when PPC_ATOMIC_EXIT_BARRIER moved to sync...  In fact, I believe
> that I worried about the RMW atomic operation acting as a barrier,
> but not as the load/store itself.  :-/

AARGH64 does something very similar; it does something like:

	ll
	...
	sc-release

	mb

Which I assumed worked for the same reason, any change to the variable
would fail the sc, and we go for round 2, now observing the new value.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make *xchg and *cmpxchg a full barrier
  2015-10-14 21:04     ` Peter Zijlstra
@ 2015-10-14 21:44       ` Paul E. McKenney
  2015-10-15  0:53         ` Boqun Feng
  2015-10-15 10:35         ` Will Deacon
  0 siblings, 2 replies; 43+ messages in thread
From: Paul E. McKenney @ 2015-10-14 21:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Boqun Feng, linux-kernel, linuxppc-dev, Ingo Molnar,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Thomas Gleixner, Will Deacon, Waiman Long, Davidlohr Bueso,
	stable

[-- Attachment #1: Type: text/plain, Size: 4474 bytes --]

On Wed, Oct 14, 2015 at 11:04:19PM +0200, Peter Zijlstra wrote:
> On Wed, Oct 14, 2015 at 01:19:17PM -0700, Paul E. McKenney wrote:
> > Suppose we have something like the following, where "a" and "x" are both
> > initially zero:
> > 
> > 	CPU 0				CPU 1
> > 	-----				-----
> > 
> > 	WRITE_ONCE(x, 1);		WRITE_ONCE(a, 2);
> > 	r3 = xchg(&a, 1);		smp_mb();
> > 					r3 = READ_ONCE(x);
> > 
> > If xchg() is fully ordered, we should never observe both CPUs'
> > r3 values being zero, correct?
> > 
> > And wouldn't this be represented by the following litmus test?
> > 
> > 	PPC SB+lwsync-RMW2-lwsync+st-sync-leading
> > 	""
> > 	{
> > 	0:r1=1; 0:r2=x; 0:r3=3; 0:r10=0 ; 0:r11=0; 0:r12=a;
> > 	1:r1=2; 1:r2=x; 1:r3=3; 1:r10=0 ; 1:r11=0; 1:r12=a;
> > 	}
> > 	 P0                 | P1                 ;
> > 	 stw r1,0(r2)       | stw r1,0(r12)      ;
> > 	 lwsync             | sync               ;
> > 	 lwarx  r11,r10,r12 | lwz r3,0(r2)       ;
> > 	 stwcx. r1,r10,r12  | ;
> > 	 bne Fail0          | ;
> > 	 mr r3,r11          | ;
> > 	 Fail0:             | ;
> > 	exists
> > 	(0:r3=0 /\ a=2 /\ 1:r3=0)
> > 
> > I left off P0's trailing sync because there is nothing for it to order
> > against in this particular litmus test.  I tried adding it and verified
> > that it has no effect.
> > 
> > Am I missing something here?  If not, it seems to me that you need
> > the leading lwsync to instead be a sync.
> 
> So the scenario that would fail would be this one, right?
> 
> a = x = 0
> 
> 	CPU0				CPU1
> 
> 	r3 = load_locked (&a);
> 					a = 2;
> 					sync();
> 					r3 = x;
> 	x = 1;
> 	lwsync();
> 	if (!store_cond(&a, 1))
> 		goto again
> 
> 
> Where we hoist the load way up because lwsync allows this.

That scenario would end up with a==1 rather than a==2.

> I always thought this would fail because CPU1's store to @a would fail
> the store_cond() on CPU0 and we'd do the 'again' thing, re-issuing the
> load and now seeing the new value (2).

The stwcx. failure was one thing that prevented a number of other
misordering cases.  The problem is that we have to let go of the notion
of an implicit global clock.

To that end, the herd tool can make a diagram of what it thought
happened, and I have attached it.  I used this diagram to try and force
this scenario at https://www.cl.cam.ac.uk/~pes20/ppcmem/index.html#PPC,
and succeeded.  Here is the sequence of events:

o	Commit P0's write.  The model offers to propagate this write
	to the coherence point and to P1, but don't do so yet.

o	Commit P1's write.  Similar offers, but don't take them up yet.

o	Commit P0's lwsync.

o	Execute P0's lwarx, which reads a=0.  Then commit it.

o	Commit P0's stwcx. as successful.  This stores a=1.

o	Commit P0's branch (not taken).

o	Commit P0's final register-to-register move.

o	Commit P1's sync instruction.

o	There is now nothing that can happen in either processor.
	P0 is done, and P1 is waiting for its sync.  Therefore,
	propagate P1's a=2 write to the coherence point and to
	the other thread.

o	There is still nothing that can happen in either processor.
	So pick the barrier propagate, then the acknowledge sync.

o	P1 can now execute its read from x.  Because P0's write to
	x is still waiting to propagate to P1, this still reads
	x=0.  Execute and commit, and we now have both r3 registers
	equal to zero and the final value a=2.

o	Clean up by propagating the write to x everywhere, and
	propagating the lwsync.

And the "exists" clause really does trigger: 0:r3=0; 1:r3=0; [a]=2;

I am still not 100% confident of my litmus test.  It is quite possible
that I lost something in translation, but that is looking less likely.

> > Of course, if I am not missing something, then this applies also to the
> > value-returning RMW atomic operations that you pulled this pattern from.
> > If so, it would seem that I didn't think through all the possibilities
> > back when PPC_ATOMIC_EXIT_BARRIER moved to sync...  In fact, I believe
> > that I worried about the RMW atomic operation acting as a barrier,
> > but not as the load/store itself.  :-/
> 
> AARGH64 does something very similar; it does something like:
> 
> 	ll
> 	...
> 	sc-release
> 
> 	mb
> 
> Which I assumed worked for the same reason, any change to the variable
> would fail the sc, and we go for round 2, now observing the new value.

I have to defer to Will on this one.  You are right that ARM and PowerPC
do have similar memory models, but there are some differences.

							Thanx, Paul

[-- Attachment #2: SB+lwsync-RMW2-lwsync+st-sync-leading.pdf --]
[-- Type: application/pdf, Size: 21963 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make *xchg and *cmpxchg a full barrier
  2015-10-14 21:44       ` Paul E. McKenney
@ 2015-10-15  0:53         ` Boqun Feng
  2015-10-15  1:22           ` Boqun Feng
                             ` (2 more replies)
  2015-10-15 10:35         ` Will Deacon
  1 sibling, 3 replies; 43+ messages in thread
From: Boqun Feng @ 2015-10-15  0:53 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Peter Zijlstra, linux-kernel, linuxppc-dev, Ingo Molnar,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Thomas Gleixner, Will Deacon, Waiman Long, Davidlohr Bueso,
	stable

[-- Attachment #1: Type: text/plain, Size: 6332 bytes --]

On Wed, Oct 14, 2015 at 02:44:53PM -0700, Paul E. McKenney wrote:
> On Wed, Oct 14, 2015 at 11:04:19PM +0200, Peter Zijlstra wrote:
> > On Wed, Oct 14, 2015 at 01:19:17PM -0700, Paul E. McKenney wrote:
> > > Suppose we have something like the following, where "a" and "x" are both
> > > initially zero:
> > > 
> > > 	CPU 0				CPU 1
> > > 	-----				-----
> > > 
> > > 	WRITE_ONCE(x, 1);		WRITE_ONCE(a, 2);
> > > 	r3 = xchg(&a, 1);		smp_mb();
> > > 					r3 = READ_ONCE(x);
> > > 
> > > If xchg() is fully ordered, we should never observe both CPUs'
> > > r3 values being zero, correct?
> > > 
> > > And wouldn't this be represented by the following litmus test?
> > > 
> > > 	PPC SB+lwsync-RMW2-lwsync+st-sync-leading
> > > 	""
> > > 	{
> > > 	0:r1=1; 0:r2=x; 0:r3=3; 0:r10=0 ; 0:r11=0; 0:r12=a;
> > > 	1:r1=2; 1:r2=x; 1:r3=3; 1:r10=0 ; 1:r11=0; 1:r12=a;
> > > 	}
> > > 	 P0                 | P1                 ;
> > > 	 stw r1,0(r2)       | stw r1,0(r12)      ;
> > > 	 lwsync             | sync               ;
> > > 	 lwarx  r11,r10,r12 | lwz r3,0(r2)       ;
> > > 	 stwcx. r1,r10,r12  | ;
> > > 	 bne Fail0          | ;
> > > 	 mr r3,r11          | ;
> > > 	 Fail0:             | ;
> > > 	exists
> > > 	(0:r3=0 /\ a=2 /\ 1:r3=0)
> > > 
> > > I left off P0's trailing sync because there is nothing for it to order
> > > against in this particular litmus test.  I tried adding it and verified
> > > that it has no effect.
> > > 
> > > Am I missing something here?  If not, it seems to me that you need
> > > the leading lwsync to instead be a sync.

I'm afraid more than that, the above litmus also shows that

	CPU 0				CPU 1
	-----				-----

	WRITE_ONCE(x, 1);		WRITE_ONCE(a, 2);
	r3 = xchg_release(&a, 1);	smp_mb();
					r3 = READ_ONCE(x);

	(0:r3 == 0 && 1:r3 == 0 && a == 2) is not prohibitted

in the implementation of this patchset, which should be disallowed by
the semantics of RELEASE, right?

And even:

	CPU 0				CPU 1
	-----				-----

	WRITE_ONCE(x, 1);		WRITE_ONCE(a, 2);
	smp_store_release(&a, 1);	smp_mb();
					r3 = READ_ONCE(x);

	(1:r3 == 0 && a == 2) is not prohibitted

shows by:

	PPC weird-lwsync
	""
	{
	0:r1=1; 0:r2=x; 0:r3=3; 0:r12=a;
	1:r1=2; 1:r2=x; 1:r3=3; 1:r12=a;
	}
	 P0                 | P1                 ;
	 stw r1,0(r2)       | stw r1,0(r12)      ;
	 lwsync             | sync               ;
	 stw  r1,0(r12)	    | lwz r3,0(r2)       ;
	exists
	(a=2 /\ 1:r3=0)


Please find something I'm (or the tool is) missing, maybe we can't use
(a == 2) as a indication that STORE on CPU 1 happens after STORE on CPU
0?

And there is really something I find strange, see below.

> > 
> > So the scenario that would fail would be this one, right?
> > 
> > a = x = 0
> > 
> > 	CPU0				CPU1
> > 
> > 	r3 = load_locked (&a);
> > 					a = 2;
> > 					sync();
> > 					r3 = x;
> > 	x = 1;
> > 	lwsync();
> > 	if (!store_cond(&a, 1))
> > 		goto again
> > 
> > 
> > Where we hoist the load way up because lwsync allows this.
> 
> That scenario would end up with a==1 rather than a==2.
> 
> > I always thought this would fail because CPU1's store to @a would fail
> > the store_cond() on CPU0 and we'd do the 'again' thing, re-issuing the
> > load and now seeing the new value (2).
> 
> The stwcx. failure was one thing that prevented a number of other
> misordering cases.  The problem is that we have to let go of the notion
> of an implicit global clock.
> 
> To that end, the herd tool can make a diagram of what it thought
> happened, and I have attached it.  I used this diagram to try and force
> this scenario at https://www.cl.cam.ac.uk/~pes20/ppcmem/index.html#PPC,
> and succeeded.  Here is the sequence of events:
> 
> o	Commit P0's write.  The model offers to propagate this write
> 	to the coherence point and to P1, but don't do so yet.
> 
> o	Commit P1's write.  Similar offers, but don't take them up yet.
> 
> o	Commit P0's lwsync.
> 
> o	Execute P0's lwarx, which reads a=0.  Then commit it.
> 
> o	Commit P0's stwcx. as successful.  This stores a=1.
> 
> o	Commit P0's branch (not taken).
> 

So at this point, P0's write to 'a' has propagated to P1, right? But
P0's write to 'x' hasn't, even there is a lwsync between them, right?
Doesn't the lwsync prevent this from happening?

If at this point P0's write to 'a' hasn't propagated then when?

Regards,
Boqun

> o	Commit P0's final register-to-register move.
> 
> o	Commit P1's sync instruction.
> 
> o	There is now nothing that can happen in either processor.
> 	P0 is done, and P1 is waiting for its sync.  Therefore,
> 	propagate P1's a=2 write to the coherence point and to
> 	the other thread.
> 
> o	There is still nothing that can happen in either processor.
> 	So pick the barrier propagate, then the acknowledge sync.
> 
> o	P1 can now execute its read from x.  Because P0's write to
> 	x is still waiting to propagate to P1, this still reads
> 	x=0.  Execute and commit, and we now have both r3 registers
> 	equal to zero and the final value a=2.
> 
> o	Clean up by propagating the write to x everywhere, and
> 	propagating the lwsync.
> 
> And the "exists" clause really does trigger: 0:r3=0; 1:r3=0; [a]=2;
> 
> I am still not 100% confident of my litmus test.  It is quite possible
> that I lost something in translation, but that is looking less likely.
> 
> > > Of course, if I am not missing something, then this applies also to the
> > > value-returning RMW atomic operations that you pulled this pattern from.
> > > If so, it would seem that I didn't think through all the possibilities
> > > back when PPC_ATOMIC_EXIT_BARRIER moved to sync...  In fact, I believe
> > > that I worried about the RMW atomic operation acting as a barrier,
> > > but not as the load/store itself.  :-/
> > 
> > AARGH64 does something very similar; it does something like:
> > 
> > 	ll
> > 	...
> > 	sc-release
> > 
> > 	mb
> > 
> > Which I assumed worked for the same reason, any change to the variable
> > would fail the sc, and we go for round 2, now observing the new value.
> 
> I have to defer to Will on this one.  You are right that ARM and PowerPC
> do have similar memory models, but there are some differences.
> 
> 							Thanx, Paul



[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make *xchg and *cmpxchg a full barrier
  2015-10-15  0:53         ` Boqun Feng
@ 2015-10-15  1:22           ` Boqun Feng
  2015-10-15  3:07             ` Paul E. McKenney
  2015-10-15  3:07           ` Paul E. McKenney
  2015-10-15  3:11           ` Boqun Feng
  2 siblings, 1 reply; 43+ messages in thread
From: Boqun Feng @ 2015-10-15  1:22 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Peter Zijlstra, linux-kernel, linuxppc-dev, Ingo Molnar,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Thomas Gleixner, Will Deacon, Waiman Long, Davidlohr Bueso,
	stable

[-- Attachment #1: Type: text/plain, Size: 6903 bytes --]

On Thu, Oct 15, 2015 at 08:53:21AM +0800, Boqun Feng wrote:
> On Wed, Oct 14, 2015 at 02:44:53PM -0700, Paul E. McKenney wrote:
> > On Wed, Oct 14, 2015 at 11:04:19PM +0200, Peter Zijlstra wrote:
> > > On Wed, Oct 14, 2015 at 01:19:17PM -0700, Paul E. McKenney wrote:
> > > > Suppose we have something like the following, where "a" and "x" are both
> > > > initially zero:
> > > > 
> > > > 	CPU 0				CPU 1
> > > > 	-----				-----
> > > > 
> > > > 	WRITE_ONCE(x, 1);		WRITE_ONCE(a, 2);
> > > > 	r3 = xchg(&a, 1);		smp_mb();
> > > > 					r3 = READ_ONCE(x);
> > > > 
> > > > If xchg() is fully ordered, we should never observe both CPUs'
> > > > r3 values being zero, correct?
> > > > 
> > > > And wouldn't this be represented by the following litmus test?
> > > > 
> > > > 	PPC SB+lwsync-RMW2-lwsync+st-sync-leading
> > > > 	""
> > > > 	{
> > > > 	0:r1=1; 0:r2=x; 0:r3=3; 0:r10=0 ; 0:r11=0; 0:r12=a;
> > > > 	1:r1=2; 1:r2=x; 1:r3=3; 1:r10=0 ; 1:r11=0; 1:r12=a;
> > > > 	}
> > > > 	 P0                 | P1                 ;
> > > > 	 stw r1,0(r2)       | stw r1,0(r12)      ;
> > > > 	 lwsync             | sync               ;
> > > > 	 lwarx  r11,r10,r12 | lwz r3,0(r2)       ;
> > > > 	 stwcx. r1,r10,r12  | ;
> > > > 	 bne Fail0          | ;
> > > > 	 mr r3,r11          | ;
> > > > 	 Fail0:             | ;
> > > > 	exists
> > > > 	(0:r3=0 /\ a=2 /\ 1:r3=0)
> > > > 
> > > > I left off P0's trailing sync because there is nothing for it to order
> > > > against in this particular litmus test.  I tried adding it and verified
> > > > that it has no effect.
> > > > 
> > > > Am I missing something here?  If not, it seems to me that you need
> > > > the leading lwsync to instead be a sync.
> 
> I'm afraid more than that, the above litmus also shows that
> 

I mean there will be more things we need to fix, perhaps even smp_wmb()
need to be sync then..

Regards,
Boqun

> 	CPU 0				CPU 1
> 	-----				-----
> 
> 	WRITE_ONCE(x, 1);		WRITE_ONCE(a, 2);
> 	r3 = xchg_release(&a, 1);	smp_mb();
> 					r3 = READ_ONCE(x);
> 
> 	(0:r3 == 0 && 1:r3 == 0 && a == 2) is not prohibitted
> 
> in the implementation of this patchset, which should be disallowed by
> the semantics of RELEASE, right?
> 
> And even:
> 
> 	CPU 0				CPU 1
> 	-----				-----
> 
> 	WRITE_ONCE(x, 1);		WRITE_ONCE(a, 2);
> 	smp_store_release(&a, 1);	smp_mb();
> 					r3 = READ_ONCE(x);
> 
> 	(1:r3 == 0 && a == 2) is not prohibitted
> 
> shows by:
> 
> 	PPC weird-lwsync
> 	""
> 	{
> 	0:r1=1; 0:r2=x; 0:r3=3; 0:r12=a;
> 	1:r1=2; 1:r2=x; 1:r3=3; 1:r12=a;
> 	}
> 	 P0                 | P1                 ;
> 	 stw r1,0(r2)       | stw r1,0(r12)      ;
> 	 lwsync             | sync               ;
> 	 stw  r1,0(r12)	    | lwz r3,0(r2)       ;
> 	exists
> 	(a=2 /\ 1:r3=0)
> 
> 
> Please find something I'm (or the tool is) missing, maybe we can't use
> (a == 2) as a indication that STORE on CPU 1 happens after STORE on CPU
> 0?
> 
> And there is really something I find strange, see below.
> 
> > > 
> > > So the scenario that would fail would be this one, right?
> > > 
> > > a = x = 0
> > > 
> > > 	CPU0				CPU1
> > > 
> > > 	r3 = load_locked (&a);
> > > 					a = 2;
> > > 					sync();
> > > 					r3 = x;
> > > 	x = 1;
> > > 	lwsync();
> > > 	if (!store_cond(&a, 1))
> > > 		goto again
> > > 
> > > 
> > > Where we hoist the load way up because lwsync allows this.
> > 
> > That scenario would end up with a==1 rather than a==2.
> > 
> > > I always thought this would fail because CPU1's store to @a would fail
> > > the store_cond() on CPU0 and we'd do the 'again' thing, re-issuing the
> > > load and now seeing the new value (2).
> > 
> > The stwcx. failure was one thing that prevented a number of other
> > misordering cases.  The problem is that we have to let go of the notion
> > of an implicit global clock.
> > 
> > To that end, the herd tool can make a diagram of what it thought
> > happened, and I have attached it.  I used this diagram to try and force
> > this scenario at https://www.cl.cam.ac.uk/~pes20/ppcmem/index.html#PPC,
> > and succeeded.  Here is the sequence of events:
> > 
> > o	Commit P0's write.  The model offers to propagate this write
> > 	to the coherence point and to P1, but don't do so yet.
> > 
> > o	Commit P1's write.  Similar offers, but don't take them up yet.
> > 
> > o	Commit P0's lwsync.
> > 
> > o	Execute P0's lwarx, which reads a=0.  Then commit it.
> > 
> > o	Commit P0's stwcx. as successful.  This stores a=1.
> > 
> > o	Commit P0's branch (not taken).
> > 
> 
> So at this point, P0's write to 'a' has propagated to P1, right? But
> P0's write to 'x' hasn't, even there is a lwsync between them, right?
> Doesn't the lwsync prevent this from happening?
> 
> If at this point P0's write to 'a' hasn't propagated then when?
> 
> Regards,
> Boqun
> 
> > o	Commit P0's final register-to-register move.
> > 
> > o	Commit P1's sync instruction.
> > 
> > o	There is now nothing that can happen in either processor.
> > 	P0 is done, and P1 is waiting for its sync.  Therefore,
> > 	propagate P1's a=2 write to the coherence point and to
> > 	the other thread.
> > 
> > o	There is still nothing that can happen in either processor.
> > 	So pick the barrier propagate, then the acknowledge sync.
> > 
> > o	P1 can now execute its read from x.  Because P0's write to
> > 	x is still waiting to propagate to P1, this still reads
> > 	x=0.  Execute and commit, and we now have both r3 registers
> > 	equal to zero and the final value a=2.
> > 
> > o	Clean up by propagating the write to x everywhere, and
> > 	propagating the lwsync.
> > 
> > And the "exists" clause really does trigger: 0:r3=0; 1:r3=0; [a]=2;
> > 
> > I am still not 100% confident of my litmus test.  It is quite possible
> > that I lost something in translation, but that is looking less likely.
> > 
> > > > Of course, if I am not missing something, then this applies also to the
> > > > value-returning RMW atomic operations that you pulled this pattern from.
> > > > If so, it would seem that I didn't think through all the possibilities
> > > > back when PPC_ATOMIC_EXIT_BARRIER moved to sync...  In fact, I believe
> > > > that I worried about the RMW atomic operation acting as a barrier,
> > > > but not as the load/store itself.  :-/
> > > 
> > > AARGH64 does something very similar; it does something like:
> > > 
> > > 	ll
> > > 	...
> > > 	sc-release
> > > 
> > > 	mb
> > > 
> > > Which I assumed worked for the same reason, any change to the variable
> > > would fail the sc, and we go for round 2, now observing the new value.
> > 
> > I have to defer to Will on this one.  You are right that ARM and PowerPC
> > do have similar memory models, but there are some differences.
> > 
> > 							Thanx, Paul
> 
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make *xchg and *cmpxchg a full barrier
  2015-10-15  0:53         ` Boqun Feng
  2015-10-15  1:22           ` Boqun Feng
@ 2015-10-15  3:07           ` Paul E. McKenney
  2015-10-15  4:48             ` Boqun Feng
  2015-10-15  3:11           ` Boqun Feng
  2 siblings, 1 reply; 43+ messages in thread
From: Paul E. McKenney @ 2015-10-15  3:07 UTC (permalink / raw)
  To: Boqun Feng
  Cc: Peter Zijlstra, linux-kernel, linuxppc-dev, Ingo Molnar,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Thomas Gleixner, Will Deacon, Waiman Long, Davidlohr Bueso,
	stable

On Thu, Oct 15, 2015 at 08:53:21AM +0800, Boqun Feng wrote:
> On Wed, Oct 14, 2015 at 02:44:53PM -0700, Paul E. McKenney wrote:
> > On Wed, Oct 14, 2015 at 11:04:19PM +0200, Peter Zijlstra wrote:
> > > On Wed, Oct 14, 2015 at 01:19:17PM -0700, Paul E. McKenney wrote:
> > > > Suppose we have something like the following, where "a" and "x" are both
> > > > initially zero:
> > > > 
> > > > 	CPU 0				CPU 1
> > > > 	-----				-----
> > > > 
> > > > 	WRITE_ONCE(x, 1);		WRITE_ONCE(a, 2);
> > > > 	r3 = xchg(&a, 1);		smp_mb();
> > > > 					r3 = READ_ONCE(x);
> > > > 
> > > > If xchg() is fully ordered, we should never observe both CPUs'
> > > > r3 values being zero, correct?
> > > > 
> > > > And wouldn't this be represented by the following litmus test?
> > > > 
> > > > 	PPC SB+lwsync-RMW2-lwsync+st-sync-leading
> > > > 	""
> > > > 	{
> > > > 	0:r1=1; 0:r2=x; 0:r3=3; 0:r10=0 ; 0:r11=0; 0:r12=a;
> > > > 	1:r1=2; 1:r2=x; 1:r3=3; 1:r10=0 ; 1:r11=0; 1:r12=a;
> > > > 	}
> > > > 	 P0                 | P1                 ;
> > > > 	 stw r1,0(r2)       | stw r1,0(r12)      ;
> > > > 	 lwsync             | sync               ;
> > > > 	 lwarx  r11,r10,r12 | lwz r3,0(r2)       ;
> > > > 	 stwcx. r1,r10,r12  | ;
> > > > 	 bne Fail0          | ;
> > > > 	 mr r3,r11          | ;
> > > > 	 Fail0:             | ;
> > > > 	exists
> > > > 	(0:r3=0 /\ a=2 /\ 1:r3=0)
> > > > 
> > > > I left off P0's trailing sync because there is nothing for it to order
> > > > against in this particular litmus test.  I tried adding it and verified
> > > > that it has no effect.
> > > > 
> > > > Am I missing something here?  If not, it seems to me that you need
> > > > the leading lwsync to instead be a sync.
> 
> I'm afraid more than that, the above litmus also shows that
> 
> 	CPU 0				CPU 1
> 	-----				-----
> 
> 	WRITE_ONCE(x, 1);		WRITE_ONCE(a, 2);
> 	r3 = xchg_release(&a, 1);	smp_mb();
> 					r3 = READ_ONCE(x);
> 
> 	(0:r3 == 0 && 1:r3 == 0 && a == 2) is not prohibitted
> 
> in the implementation of this patchset, which should be disallowed by
> the semantics of RELEASE, right?

Not necessarily.  If you had the read first on CPU 1, and you had a
similar problem, I would be more worried.

> And even:
> 
> 	CPU 0				CPU 1
> 	-----				-----
> 
> 	WRITE_ONCE(x, 1);		WRITE_ONCE(a, 2);
> 	smp_store_release(&a, 1);	smp_mb();
> 					r3 = READ_ONCE(x);
> 
> 	(1:r3 == 0 && a == 2) is not prohibitted
> 
> shows by:
> 
> 	PPC weird-lwsync
> 	""
> 	{
> 	0:r1=1; 0:r2=x; 0:r3=3; 0:r12=a;
> 	1:r1=2; 1:r2=x; 1:r3=3; 1:r12=a;
> 	}
> 	 P0                 | P1                 ;
> 	 stw r1,0(r2)       | stw r1,0(r12)      ;
> 	 lwsync             | sync               ;
> 	 stw  r1,0(r12)	    | lwz r3,0(r2)       ;
> 	exists
> 	(a=2 /\ 1:r3=0)
> 
> Please find something I'm (or the tool is) missing, maybe we can't use
> (a == 2) as a indication that STORE on CPU 1 happens after STORE on CPU
> 0?

Again, if you were pairing the smp_store_release() with an smp_load_acquire()
or even a READ_ONCE() followed by a barrier, I would be quite concerned.
I am not at all worried about the above two litmus tests.

> And there is really something I find strange, see below.
> 
> > > 
> > > So the scenario that would fail would be this one, right?
> > > 
> > > a = x = 0
> > > 
> > > 	CPU0				CPU1
> > > 
> > > 	r3 = load_locked (&a);
> > > 					a = 2;
> > > 					sync();
> > > 					r3 = x;
> > > 	x = 1;
> > > 	lwsync();
> > > 	if (!store_cond(&a, 1))
> > > 		goto again
> > > 
> > > 
> > > Where we hoist the load way up because lwsync allows this.
> > 
> > That scenario would end up with a==1 rather than a==2.
> > 
> > > I always thought this would fail because CPU1's store to @a would fail
> > > the store_cond() on CPU0 and we'd do the 'again' thing, re-issuing the
> > > load and now seeing the new value (2).
> > 
> > The stwcx. failure was one thing that prevented a number of other
> > misordering cases.  The problem is that we have to let go of the notion
> > of an implicit global clock.
> > 
> > To that end, the herd tool can make a diagram of what it thought
> > happened, and I have attached it.  I used this diagram to try and force
> > this scenario at https://www.cl.cam.ac.uk/~pes20/ppcmem/index.html#PPC,
> > and succeeded.  Here is the sequence of events:
> > 
> > o	Commit P0's write.  The model offers to propagate this write
> > 	to the coherence point and to P1, but don't do so yet.
> > 
> > o	Commit P1's write.  Similar offers, but don't take them up yet.
> > 
> > o	Commit P0's lwsync.
> > 
> > o	Execute P0's lwarx, which reads a=0.  Then commit it.
> > 
> > o	Commit P0's stwcx. as successful.  This stores a=1.
> > 
> > o	Commit P0's branch (not taken).
> 
> So at this point, P0's write to 'a' has propagated to P1, right? But
> P0's write to 'x' hasn't, even there is a lwsync between them, right?
> Doesn't the lwsync prevent this from happening?

No, because lwsync is quite a bit weaker than sync aside from just
the store-load ordering.

> If at this point P0's write to 'a' hasn't propagated then when?

Later.  At the very end of the test, in this case.  ;-)

Why not try creating a longer litmus test that requires P0's write to
"a" to propagate to P1 before both processes complete?

							Thanx, Paul

> Regards,
> Boqun
> 
> > o	Commit P0's final register-to-register move.
> > 
> > o	Commit P1's sync instruction.
> > 
> > o	There is now nothing that can happen in either processor.
> > 	P0 is done, and P1 is waiting for its sync.  Therefore,
> > 	propagate P1's a=2 write to the coherence point and to
> > 	the other thread.
> > 
> > o	There is still nothing that can happen in either processor.
> > 	So pick the barrier propagate, then the acknowledge sync.
> > 
> > o	P1 can now execute its read from x.  Because P0's write to
> > 	x is still waiting to propagate to P1, this still reads
> > 	x=0.  Execute and commit, and we now have both r3 registers
> > 	equal to zero and the final value a=2.
> > 
> > o	Clean up by propagating the write to x everywhere, and
> > 	propagating the lwsync.
> > 
> > And the "exists" clause really does trigger: 0:r3=0; 1:r3=0; [a]=2;
> > 
> > I am still not 100% confident of my litmus test.  It is quite possible
> > that I lost something in translation, but that is looking less likely.
> > 
> > > > Of course, if I am not missing something, then this applies also to the
> > > > value-returning RMW atomic operations that you pulled this pattern from.
> > > > If so, it would seem that I didn't think through all the possibilities
> > > > back when PPC_ATOMIC_EXIT_BARRIER moved to sync...  In fact, I believe
> > > > that I worried about the RMW atomic operation acting as a barrier,
> > > > but not as the load/store itself.  :-/
> > > 
> > > AARGH64 does something very similar; it does something like:
> > > 
> > > 	ll
> > > 	...
> > > 	sc-release
> > > 
> > > 	mb
> > > 
> > > Which I assumed worked for the same reason, any change to the variable
> > > would fail the sc, and we go for round 2, now observing the new value.
> > 
> > I have to defer to Will on this one.  You are right that ARM and PowerPC
> > do have similar memory models, but there are some differences.
> > 
> > 							Thanx, Paul
> 
> 



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make *xchg and *cmpxchg a full barrier
  2015-10-15  1:22           ` Boqun Feng
@ 2015-10-15  3:07             ` Paul E. McKenney
  0 siblings, 0 replies; 43+ messages in thread
From: Paul E. McKenney @ 2015-10-15  3:07 UTC (permalink / raw)
  To: Boqun Feng
  Cc: Peter Zijlstra, linux-kernel, linuxppc-dev, Ingo Molnar,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Thomas Gleixner, Will Deacon, Waiman Long, Davidlohr Bueso,
	stable

On Thu, Oct 15, 2015 at 09:22:26AM +0800, Boqun Feng wrote:
> On Thu, Oct 15, 2015 at 08:53:21AM +0800, Boqun Feng wrote:
> > On Wed, Oct 14, 2015 at 02:44:53PM -0700, Paul E. McKenney wrote:
> > > On Wed, Oct 14, 2015 at 11:04:19PM +0200, Peter Zijlstra wrote:
> > > > On Wed, Oct 14, 2015 at 01:19:17PM -0700, Paul E. McKenney wrote:
> > > > > Suppose we have something like the following, where "a" and "x" are both
> > > > > initially zero:
> > > > > 
> > > > > 	CPU 0				CPU 1
> > > > > 	-----				-----
> > > > > 
> > > > > 	WRITE_ONCE(x, 1);		WRITE_ONCE(a, 2);
> > > > > 	r3 = xchg(&a, 1);		smp_mb();
> > > > > 					r3 = READ_ONCE(x);
> > > > > 
> > > > > If xchg() is fully ordered, we should never observe both CPUs'
> > > > > r3 values being zero, correct?
> > > > > 
> > > > > And wouldn't this be represented by the following litmus test?
> > > > > 
> > > > > 	PPC SB+lwsync-RMW2-lwsync+st-sync-leading
> > > > > 	""
> > > > > 	{
> > > > > 	0:r1=1; 0:r2=x; 0:r3=3; 0:r10=0 ; 0:r11=0; 0:r12=a;
> > > > > 	1:r1=2; 1:r2=x; 1:r3=3; 1:r10=0 ; 1:r11=0; 1:r12=a;
> > > > > 	}
> > > > > 	 P0                 | P1                 ;
> > > > > 	 stw r1,0(r2)       | stw r1,0(r12)      ;
> > > > > 	 lwsync             | sync               ;
> > > > > 	 lwarx  r11,r10,r12 | lwz r3,0(r2)       ;
> > > > > 	 stwcx. r1,r10,r12  | ;
> > > > > 	 bne Fail0          | ;
> > > > > 	 mr r3,r11          | ;
> > > > > 	 Fail0:             | ;
> > > > > 	exists
> > > > > 	(0:r3=0 /\ a=2 /\ 1:r3=0)
> > > > > 
> > > > > I left off P0's trailing sync because there is nothing for it to order
> > > > > against in this particular litmus test.  I tried adding it and verified
> > > > > that it has no effect.
> > > > > 
> > > > > Am I missing something here?  If not, it seems to me that you need
> > > > > the leading lwsync to instead be a sync.
> > 
> > I'm afraid more than that, the above litmus also shows that
> 
> I mean there will be more things we need to fix, perhaps even smp_wmb()
> need to be sync then..

That should not be necessary.  For smp_wmb(), lwsync should be just fine.

								Thanx, Paul

> Regards,
> Boqun
> 
> > 	CPU 0				CPU 1
> > 	-----				-----
> > 
> > 	WRITE_ONCE(x, 1);		WRITE_ONCE(a, 2);
> > 	r3 = xchg_release(&a, 1);	smp_mb();
> > 					r3 = READ_ONCE(x);
> > 
> > 	(0:r3 == 0 && 1:r3 == 0 && a == 2) is not prohibitted
> > 
> > in the implementation of this patchset, which should be disallowed by
> > the semantics of RELEASE, right?
> > 
> > And even:
> > 
> > 	CPU 0				CPU 1
> > 	-----				-----
> > 
> > 	WRITE_ONCE(x, 1);		WRITE_ONCE(a, 2);
> > 	smp_store_release(&a, 1);	smp_mb();
> > 					r3 = READ_ONCE(x);
> > 
> > 	(1:r3 == 0 && a == 2) is not prohibitted
> > 
> > shows by:
> > 
> > 	PPC weird-lwsync
> > 	""
> > 	{
> > 	0:r1=1; 0:r2=x; 0:r3=3; 0:r12=a;
> > 	1:r1=2; 1:r2=x; 1:r3=3; 1:r12=a;
> > 	}
> > 	 P0                 | P1                 ;
> > 	 stw r1,0(r2)       | stw r1,0(r12)      ;
> > 	 lwsync             | sync               ;
> > 	 stw  r1,0(r12)	    | lwz r3,0(r2)       ;
> > 	exists
> > 	(a=2 /\ 1:r3=0)
> > 
> > 
> > Please find something I'm (or the tool is) missing, maybe we can't use
> > (a == 2) as a indication that STORE on CPU 1 happens after STORE on CPU
> > 0?
> > 
> > And there is really something I find strange, see below.
> > 
> > > > 
> > > > So the scenario that would fail would be this one, right?
> > > > 
> > > > a = x = 0
> > > > 
> > > > 	CPU0				CPU1
> > > > 
> > > > 	r3 = load_locked (&a);
> > > > 					a = 2;
> > > > 					sync();
> > > > 					r3 = x;
> > > > 	x = 1;
> > > > 	lwsync();
> > > > 	if (!store_cond(&a, 1))
> > > > 		goto again
> > > > 
> > > > 
> > > > Where we hoist the load way up because lwsync allows this.
> > > 
> > > That scenario would end up with a==1 rather than a==2.
> > > 
> > > > I always thought this would fail because CPU1's store to @a would fail
> > > > the store_cond() on CPU0 and we'd do the 'again' thing, re-issuing the
> > > > load and now seeing the new value (2).
> > > 
> > > The stwcx. failure was one thing that prevented a number of other
> > > misordering cases.  The problem is that we have to let go of the notion
> > > of an implicit global clock.
> > > 
> > > To that end, the herd tool can make a diagram of what it thought
> > > happened, and I have attached it.  I used this diagram to try and force
> > > this scenario at https://www.cl.cam.ac.uk/~pes20/ppcmem/index.html#PPC,
> > > and succeeded.  Here is the sequence of events:
> > > 
> > > o	Commit P0's write.  The model offers to propagate this write
> > > 	to the coherence point and to P1, but don't do so yet.
> > > 
> > > o	Commit P1's write.  Similar offers, but don't take them up yet.
> > > 
> > > o	Commit P0's lwsync.
> > > 
> > > o	Execute P0's lwarx, which reads a=0.  Then commit it.
> > > 
> > > o	Commit P0's stwcx. as successful.  This stores a=1.
> > > 
> > > o	Commit P0's branch (not taken).
> > > 
> > 
> > So at this point, P0's write to 'a' has propagated to P1, right? But
> > P0's write to 'x' hasn't, even there is a lwsync between them, right?
> > Doesn't the lwsync prevent this from happening?
> > 
> > If at this point P0's write to 'a' hasn't propagated then when?
> > 
> > Regards,
> > Boqun
> > 
> > > o	Commit P0's final register-to-register move.
> > > 
> > > o	Commit P1's sync instruction.
> > > 
> > > o	There is now nothing that can happen in either processor.
> > > 	P0 is done, and P1 is waiting for its sync.  Therefore,
> > > 	propagate P1's a=2 write to the coherence point and to
> > > 	the other thread.
> > > 
> > > o	There is still nothing that can happen in either processor.
> > > 	So pick the barrier propagate, then the acknowledge sync.
> > > 
> > > o	P1 can now execute its read from x.  Because P0's write to
> > > 	x is still waiting to propagate to P1, this still reads
> > > 	x=0.  Execute and commit, and we now have both r3 registers
> > > 	equal to zero and the final value a=2.
> > > 
> > > o	Clean up by propagating the write to x everywhere, and
> > > 	propagating the lwsync.
> > > 
> > > And the "exists" clause really does trigger: 0:r3=0; 1:r3=0; [a]=2;
> > > 
> > > I am still not 100% confident of my litmus test.  It is quite possible
> > > that I lost something in translation, but that is looking less likely.
> > > 
> > > > > Of course, if I am not missing something, then this applies also to the
> > > > > value-returning RMW atomic operations that you pulled this pattern from.
> > > > > If so, it would seem that I didn't think through all the possibilities
> > > > > back when PPC_ATOMIC_EXIT_BARRIER moved to sync...  In fact, I believe
> > > > > that I worried about the RMW atomic operation acting as a barrier,
> > > > > but not as the load/store itself.  :-/
> > > > 
> > > > AARGH64 does something very similar; it does something like:
> > > > 
> > > > 	ll
> > > > 	...
> > > > 	sc-release
> > > > 
> > > > 	mb
> > > > 
> > > > Which I assumed worked for the same reason, any change to the variable
> > > > would fail the sc, and we go for round 2, now observing the new value.
> > > 
> > > I have to defer to Will on this one.  You are right that ARM and PowerPC
> > > do have similar memory models, but there are some differences.
> > > 
> > > 							Thanx, Paul
> > 
> > 



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make *xchg and *cmpxchg a full barrier
  2015-10-15  0:53         ` Boqun Feng
  2015-10-15  1:22           ` Boqun Feng
  2015-10-15  3:07           ` Paul E. McKenney
@ 2015-10-15  3:11           ` Boqun Feng
  2015-10-15  3:33             ` Paul E. McKenney
  2 siblings, 1 reply; 43+ messages in thread
From: Boqun Feng @ 2015-10-15  3:11 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Peter Zijlstra, linux-kernel, linuxppc-dev, Ingo Molnar,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Thomas Gleixner, Will Deacon, Waiman Long, Davidlohr Bueso,
	stable

[-- Attachment #1: Type: text/plain, Size: 3140 bytes --]

Hi Paul,

On Thu, Oct 15, 2015 at 08:53:21AM +0800, Boqun Feng wrote:
> On Wed, Oct 14, 2015 at 02:44:53PM -0700, Paul E. McKenney wrote:
[snip]
> > To that end, the herd tool can make a diagram of what it thought
> > happened, and I have attached it.  I used this diagram to try and force
> > this scenario at https://www.cl.cam.ac.uk/~pes20/ppcmem/index.html#PPC,
> > and succeeded.  Here is the sequence of events:
> > 
> > o	Commit P0's write.  The model offers to propagate this write
> > 	to the coherence point and to P1, but don't do so yet.
> > 
> > o	Commit P1's write.  Similar offers, but don't take them up yet.
> > 
> > o	Commit P0's lwsync.
> > 
> > o	Execute P0's lwarx, which reads a=0.  Then commit it.
> > 
> > o	Commit P0's stwcx. as successful.  This stores a=1.
> > 
> > o	Commit P0's branch (not taken).
> > 
> 
> So at this point, P0's write to 'a' has propagated to P1, right? But
> P0's write to 'x' hasn't, even there is a lwsync between them, right?
> Doesn't the lwsync prevent this from happening?
> 
> If at this point P0's write to 'a' hasn't propagated then when?
> 

Hmm.. I played around ppcmem, and figured out what happens to
propagation of P0's write to 'a':

At this point, or some point after store 'a' to 1 and before sync on
P1 finish, writes to 'a' reachs a coherence point which 'a' is 2, so
P0's write to 'a' "fails" and will not propagate.


I probably misunderstood the word "propagate", which actually means an
already coherent write gets seen by another CPU, right?

So my question should be:

As lwsync can order P0's write to 'a' happens after P0's write to 'x',
why P0's write to 'x' isn't seen by P1 after P1's write to 'a' overrides
P0's?

But ppcmem gave me the answer ;-) lwsync won't wait under P0's write to
'x' gets propagated, and if P0's write to 'a' "wins" in write coherence,
lwsync will guarantee propagation of 'x' happens before that of 'a', but
if P0's write to 'a' "fails", there will be no propagation of 'a' from
P0. So that lwsync can't do anything here.

Regards,
Boqun

> 
> > o	Commit P0's final register-to-register move.
> > 
> > o	Commit P1's sync instruction.
> > 
> > o	There is now nothing that can happen in either processor.
> > 	P0 is done, and P1 is waiting for its sync.  Therefore,
> > 	propagate P1's a=2 write to the coherence point and to
> > 	the other thread.
> > 
> > o	There is still nothing that can happen in either processor.
> > 	So pick the barrier propagate, then the acknowledge sync.
> > 
> > o	P1 can now execute its read from x.  Because P0's write to
> > 	x is still waiting to propagate to P1, this still reads
> > 	x=0.  Execute and commit, and we now have both r3 registers
> > 	equal to zero and the final value a=2.
> > 
> > o	Clean up by propagating the write to x everywhere, and
> > 	propagating the lwsync.
> > 
> > And the "exists" clause really does trigger: 0:r3=0; 1:r3=0; [a]=2;
> > 
> > I am still not 100% confident of my litmus test.  It is quite possible
> > that I lost something in translation, but that is looking less likely.
> > 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make *xchg and *cmpxchg a full barrier
  2015-10-15  3:11           ` Boqun Feng
@ 2015-10-15  3:33             ` Paul E. McKenney
  0 siblings, 0 replies; 43+ messages in thread
From: Paul E. McKenney @ 2015-10-15  3:33 UTC (permalink / raw)
  To: Boqun Feng
  Cc: Peter Zijlstra, linux-kernel, linuxppc-dev, Ingo Molnar,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Thomas Gleixner, Will Deacon, Waiman Long, Davidlohr Bueso,
	stable

On Thu, Oct 15, 2015 at 11:11:01AM +0800, Boqun Feng wrote:
> Hi Paul,
> 
> On Thu, Oct 15, 2015 at 08:53:21AM +0800, Boqun Feng wrote:
> > On Wed, Oct 14, 2015 at 02:44:53PM -0700, Paul E. McKenney wrote:
> [snip]
> > > To that end, the herd tool can make a diagram of what it thought
> > > happened, and I have attached it.  I used this diagram to try and force
> > > this scenario at https://www.cl.cam.ac.uk/~pes20/ppcmem/index.html#PPC,
> > > and succeeded.  Here is the sequence of events:
> > > 
> > > o	Commit P0's write.  The model offers to propagate this write
> > > 	to the coherence point and to P1, but don't do so yet.
> > > 
> > > o	Commit P1's write.  Similar offers, but don't take them up yet.
> > > 
> > > o	Commit P0's lwsync.
> > > 
> > > o	Execute P0's lwarx, which reads a=0.  Then commit it.
> > > 
> > > o	Commit P0's stwcx. as successful.  This stores a=1.
> > > 
> > > o	Commit P0's branch (not taken).
> > > 
> > 
> > So at this point, P0's write to 'a' has propagated to P1, right? But
> > P0's write to 'x' hasn't, even there is a lwsync between them, right?
> > Doesn't the lwsync prevent this from happening?
> > 
> > If at this point P0's write to 'a' hasn't propagated then when?
> 
> Hmm.. I played around ppcmem, and figured out what happens to
> propagation of P0's write to 'a':
> 
> At this point, or some point after store 'a' to 1 and before sync on
> P1 finish, writes to 'a' reachs a coherence point which 'a' is 2, so
> P0's write to 'a' "fails" and will not propagate.
> 
> I probably misunderstood the word "propagate", which actually means an
> already coherent write gets seen by another CPU, right?

It is quite possible for a given write to take a position in the coherence
order that guarantees that no one will see it, as is the case here.
But yes, all readers will see an order of values for a given memory
location that is consistent with the coherence order.

> So my question should be:
> 
> As lwsync can order P0's write to 'a' happens after P0's write to 'x',
> why P0's write to 'x' isn't seen by P1 after P1's write to 'a' overrides
> P0's?

There is no global clock for PPC's memory model.

> But ppcmem gave me the answer ;-) lwsync won't wait under P0's write to
> 'x' gets propagated, and if P0's write to 'a' "wins" in write coherence,
> lwsync will guarantee propagation of 'x' happens before that of 'a', but
> if P0's write to 'a' "fails", there will be no propagation of 'a' from
> P0. So that lwsync can't do anything here.

I believe that this is consistent, but the corners can get tricky.

							Thanx, Paul

> Regards,
> Boqun
> 
> > 
> > > o	Commit P0's final register-to-register move.
> > > 
> > > o	Commit P1's sync instruction.
> > > 
> > > o	There is now nothing that can happen in either processor.
> > > 	P0 is done, and P1 is waiting for its sync.  Therefore,
> > > 	propagate P1's a=2 write to the coherence point and to
> > > 	the other thread.
> > > 
> > > o	There is still nothing that can happen in either processor.
> > > 	So pick the barrier propagate, then the acknowledge sync.
> > > 
> > > o	P1 can now execute its read from x.  Because P0's write to
> > > 	x is still waiting to propagate to P1, this still reads
> > > 	x=0.  Execute and commit, and we now have both r3 registers
> > > 	equal to zero and the final value a=2.
> > > 
> > > o	Clean up by propagating the write to x everywhere, and
> > > 	propagating the lwsync.
> > > 
> > > And the "exists" clause really does trigger: 0:r3=0; 1:r3=0; [a]=2;
> > > 
> > > I am still not 100% confident of my litmus test.  It is quite possible
> > > that I lost something in translation, but that is looking less likely.
> > > 



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make *xchg and *cmpxchg a full barrier
  2015-10-15  3:07           ` Paul E. McKenney
@ 2015-10-15  4:48             ` Boqun Feng
  2015-10-15 16:30               ` Paul E. McKenney
  0 siblings, 1 reply; 43+ messages in thread
From: Boqun Feng @ 2015-10-15  4:48 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Peter Zijlstra, linux-kernel, linuxppc-dev, Ingo Molnar,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Thomas Gleixner, Will Deacon, Waiman Long, Davidlohr Bueso,
	stable

[-- Attachment #1: Type: text/plain, Size: 4698 bytes --]

On Wed, Oct 14, 2015 at 08:07:05PM -0700, Paul E. McKenney wrote:
> On Thu, Oct 15, 2015 at 08:53:21AM +0800, Boqun Feng wrote:
[snip]
> > 
> > I'm afraid more than that, the above litmus also shows that
> > 
> > 	CPU 0				CPU 1
> > 	-----				-----
> > 
> > 	WRITE_ONCE(x, 1);		WRITE_ONCE(a, 2);
> > 	r3 = xchg_release(&a, 1);	smp_mb();
> > 					r3 = READ_ONCE(x);
> > 
> > 	(0:r3 == 0 && 1:r3 == 0 && a == 2) is not prohibitted
> > 
> > in the implementation of this patchset, which should be disallowed by
> > the semantics of RELEASE, right?
> 
> Not necessarily.  If you had the read first on CPU 1, and you had a
> similar problem, I would be more worried.
> 

Sometimes I think maybe we should say that a single unpaired ACQUIRE or
RELEASE doesn't have any order guarantee because of the above case.

But seems that's not a normal or even existing case, my bad ;-(

> > And even:
> > 
> > 	CPU 0				CPU 1
> > 	-----				-----
> > 
> > 	WRITE_ONCE(x, 1);		WRITE_ONCE(a, 2);
> > 	smp_store_release(&a, 1);	smp_mb();
> > 					r3 = READ_ONCE(x);
> > 
> > 	(1:r3 == 0 && a == 2) is not prohibitted
> > 
> > shows by:
> > 
> > 	PPC weird-lwsync
> > 	""
> > 	{
> > 	0:r1=1; 0:r2=x; 0:r3=3; 0:r12=a;
> > 	1:r1=2; 1:r2=x; 1:r3=3; 1:r12=a;
> > 	}
> > 	 P0                 | P1                 ;
> > 	 stw r1,0(r2)       | stw r1,0(r12)      ;
> > 	 lwsync             | sync               ;
> > 	 stw  r1,0(r12)	    | lwz r3,0(r2)       ;
> > 	exists
> > 	(a=2 /\ 1:r3=0)
> > 
> > Please find something I'm (or the tool is) missing, maybe we can't use
> > (a == 2) as a indication that STORE on CPU 1 happens after STORE on CPU
> > 0?
> 
> Again, if you were pairing the smp_store_release() with an smp_load_acquire()
> or even a READ_ONCE() followed by a barrier, I would be quite concerned.
> I am not at all worried about the above two litmus tests.
> 

Understood, thank you for think through that ;-)

> > And there is really something I find strange, see below.
> > 
> > > > 
> > > > So the scenario that would fail would be this one, right?
> > > > 
> > > > a = x = 0
> > > > 
> > > > 	CPU0				CPU1
> > > > 
> > > > 	r3 = load_locked (&a);
> > > > 					a = 2;
> > > > 					sync();
> > > > 					r3 = x;
> > > > 	x = 1;
> > > > 	lwsync();
> > > > 	if (!store_cond(&a, 1))
> > > > 		goto again
> > > > 
> > > > 
> > > > Where we hoist the load way up because lwsync allows this.
> > > 
> > > That scenario would end up with a==1 rather than a==2.
> > > 
> > > > I always thought this would fail because CPU1's store to @a would fail
> > > > the store_cond() on CPU0 and we'd do the 'again' thing, re-issuing the
> > > > load and now seeing the new value (2).
> > > 
> > > The stwcx. failure was one thing that prevented a number of other
> > > misordering cases.  The problem is that we have to let go of the notion
> > > of an implicit global clock.
> > > 
> > > To that end, the herd tool can make a diagram of what it thought
> > > happened, and I have attached it.  I used this diagram to try and force
> > > this scenario at https://www.cl.cam.ac.uk/~pes20/ppcmem/index.html#PPC,
> > > and succeeded.  Here is the sequence of events:
> > > 
> > > o	Commit P0's write.  The model offers to propagate this write
> > > 	to the coherence point and to P1, but don't do so yet.
> > > 
> > > o	Commit P1's write.  Similar offers, but don't take them up yet.
> > > 
> > > o	Commit P0's lwsync.
> > > 
> > > o	Execute P0's lwarx, which reads a=0.  Then commit it.
> > > 
> > > o	Commit P0's stwcx. as successful.  This stores a=1.
> > > 
> > > o	Commit P0's branch (not taken).
> > 
> > So at this point, P0's write to 'a' has propagated to P1, right? But
> > P0's write to 'x' hasn't, even there is a lwsync between them, right?
> > Doesn't the lwsync prevent this from happening?
> 
> No, because lwsync is quite a bit weaker than sync aside from just
> the store-load ordering.
> 

Understood, I've tried the ppcmem, much clear now ;-)

> > If at this point P0's write to 'a' hasn't propagated then when?
> 
> Later.  At the very end of the test, in this case.  ;-)
> 

Hmm.. I tried exactly this sequence in ppcmem, seems propagation of P0's
write to 'a' is never an option...

> Why not try creating a longer litmus test that requires P0's write to
> "a" to propagate to P1 before both processes complete?
> 

I will try to write one, but to be clear, you mean we still observe 

0:r3 == 0 && a == 2 && 1:r3 == 0 

at the end, right? Because I understand that if P1's write to 'a'
doesn't override P0's, P0's write to 'a' will propagate.

Regards,
Boqun

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make *xchg and *cmpxchg a full barrier
  2015-10-14 21:44       ` Paul E. McKenney
  2015-10-15  0:53         ` Boqun Feng
@ 2015-10-15 10:35         ` Will Deacon
  2015-10-15 14:40           ` Boqun Feng
                             ` (2 more replies)
  1 sibling, 3 replies; 43+ messages in thread
From: Will Deacon @ 2015-10-15 10:35 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Peter Zijlstra, Boqun Feng, linux-kernel, linuxppc-dev,
	Ingo Molnar, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Thomas Gleixner, Waiman Long, Davidlohr Bueso,
	stable

Dammit guys, it's never simple is it?

On Wed, Oct 14, 2015 at 02:44:53PM -0700, Paul E. McKenney wrote:
> To that end, the herd tool can make a diagram of what it thought
> happened, and I have attached it.  I used this diagram to try and force
> this scenario at https://www.cl.cam.ac.uk/~pes20/ppcmem/index.html#PPC,
> and succeeded.  Here is the sequence of events:
> 
> o	Commit P0's write.  The model offers to propagate this write
> 	to the coherence point and to P1, but don't do so yet.
> 
> o	Commit P1's write.  Similar offers, but don't take them up yet.
> 
> o	Commit P0's lwsync.
> 
> o	Execute P0's lwarx, which reads a=0.  Then commit it.
> 
> o	Commit P0's stwcx. as successful.  This stores a=1.

On arm64, this is a conditional-store-*release* and therefore cannot be
observed before the initial write to x...

> o	Commit P0's branch (not taken).
> 
> o	Commit P0's final register-to-register move.
> 
> o	Commit P1's sync instruction.
> 
> o	There is now nothing that can happen in either processor.
> 	P0 is done, and P1 is waiting for its sync.  Therefore,
> 	propagate P1's a=2 write to the coherence point and to
> 	the other thread.

... therefore this is illegal, because you haven't yet propagated that
prior write...

> 
> o	There is still nothing that can happen in either processor.
> 	So pick the barrier propagate, then the acknowledge sync.
> 
> o	P1 can now execute its read from x.  Because P0's write to
> 	x is still waiting to propagate to P1, this still reads
> 	x=0.  Execute and commit, and we now have both r3 registers
> 	equal to zero and the final value a=2.

... and P1 would have to read x == 1.

So arm64 is ok. Doesn't lwsync order store->store observability for PPC?

Will

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make *xchg and *cmpxchg a full barrier
  2015-10-15 10:35         ` Will Deacon
@ 2015-10-15 14:40           ` Boqun Feng
  2015-10-15 14:50           ` Will Deacon
  2015-10-15 15:42           ` Paul E. McKenney
  2 siblings, 0 replies; 43+ messages in thread
From: Boqun Feng @ 2015-10-15 14:40 UTC (permalink / raw)
  To: Will Deacon
  Cc: Paul E. McKenney, Peter Zijlstra, linux-kernel, linuxppc-dev,
	Ingo Molnar, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Thomas Gleixner, Waiman Long, Davidlohr Bueso,
	stable

[-- Attachment #1: Type: text/plain, Size: 1650 bytes --]

On Thu, Oct 15, 2015 at 11:35:44AM +0100, Will Deacon wrote:
> 
> So arm64 is ok. Doesn't lwsync order store->store observability for PPC?
> 

I did some litmus and put the result here. My understanding might be
wrong, and I think Paul can explain the lwsync and store->store order
better ;-)


When a store->lwsync->store pairs with load->lwsync->load, according to
herd, YES.

    PPC W+lwsync+W-R+lwsync+R
    "
      2015-10-15
      herds said (1:r1=0 /\ 1:r2=2) doesn't exist,
      so if P1 observe the write to 'b', it must also observe P0's write
      to 'a'
    "
    {
    0:r1=1; 0:r2=2; 0:r12=a; 0:r13=b;
    1:r1=0; 1:r2=0; 1:r12=a; 1:r13=b;
    }
    
     P0              | P1             ;
     stw r1, 0(r12)  | lwz r2, 0(r13) ;
     lwsync          | lwsync         ;
     stw r2, 0(r13)  | lwz r1, 0(r12) ;
    
    exists
    (1:r1=0 /\ 1:r2=2)


If observation also includes "a write on one CPU -override- another
write on another CPU", then

when a store->lwsync->store pairs(?) with store->sync->load, according
to herd, NO(?).

    PPC W+lwsync+W-W+sync+R
    "
      2015-10-15
      herds said (1:r1=0 /\ b=3) exists sometimes,
      so if P1 observe P0's write to 'b'(by 'overriding' this write to
      'b'), it may not observe P0's write to 'a'.
    "
    {
    0:r1=1; 0:r2=2; 0:r12=a; 0:r13=b;
    1:r1=0; 1:r2=3; 1:r12=a; 1:r13=b;
    }
    
     P0              | P1             ;
     stw r1, 0(r12)  | stw r2, 0(r13) ;
     lwsync          | sync         ;
     stw r2, 0(r13)  | lwz r1, 0(r12) ;
    
    exists
    (1:r1=0 /\ b=3)


Regards,
Boqun

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make *xchg and *cmpxchg a full barrier
  2015-10-14 20:19   ` Paul E. McKenney
  2015-10-14 21:04     ` Peter Zijlstra
@ 2015-10-15 14:49     ` Boqun Feng
  2015-10-15 16:16       ` Paul E. McKenney
  2015-10-20  7:15     ` Boqun Feng
  2 siblings, 1 reply; 43+ messages in thread
From: Boqun Feng @ 2015-10-15 14:49 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, linuxppc-dev, Peter Zijlstra, Ingo Molnar,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Thomas Gleixner, Will Deacon, Waiman Long, Davidlohr Bueso,
	stable

[-- Attachment #1: Type: text/plain, Size: 4013 bytes --]

On Wed, Oct 14, 2015 at 01:19:17PM -0700, Paul E. McKenney wrote:
> On Wed, Oct 14, 2015 at 11:55:56PM +0800, Boqun Feng wrote:
> > According to memory-barriers.txt, xchg, cmpxchg and their atomic{,64}_
> > versions all need to imply a full barrier, however they are now just
> > RELEASE+ACQUIRE, which is not a full barrier.
> > 
> > So replace PPC_RELEASE_BARRIER and PPC_ACQUIRE_BARRIER with
> > PPC_ATOMIC_ENTRY_BARRIER and PPC_ATOMIC_EXIT_BARRIER in
> > __{cmp,}xchg_{u32,u64} respectively to guarantee a full barrier
> > semantics of atomic{,64}_{cmp,}xchg() and {cmp,}xchg().
> > 
> > This patch is a complement of commit b97021f85517 ("powerpc: Fix
> > atomic_xxx_return barrier semantics").
> > 
> > Acked-by: Michael Ellerman <mpe@ellerman.id.au>
> > Cc: <stable@vger.kernel.org> # 3.4+
> > Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
> > ---
> >  arch/powerpc/include/asm/cmpxchg.h | 16 ++++++++--------
> >  1 file changed, 8 insertions(+), 8 deletions(-)
> > 
> > diff --git a/arch/powerpc/include/asm/cmpxchg.h b/arch/powerpc/include/asm/cmpxchg.h
> > index ad6263c..d1a8d93 100644
> > --- a/arch/powerpc/include/asm/cmpxchg.h
> > +++ b/arch/powerpc/include/asm/cmpxchg.h
> > @@ -18,12 +18,12 @@ __xchg_u32(volatile void *p, unsigned long val)
> >  	unsigned long prev;
> > 
> >  	__asm__ __volatile__(
> > -	PPC_RELEASE_BARRIER
> > +	PPC_ATOMIC_ENTRY_BARRIER
> 
> This looks to be the lwsync instruction.
> 
> >  "1:	lwarx	%0,0,%2 \n"
> >  	PPC405_ERR77(0,%2)
> >  "	stwcx.	%3,0,%2 \n\
> >  	bne-	1b"
> > -	PPC_ACQUIRE_BARRIER
> > +	PPC_ATOMIC_EXIT_BARRIER
> 
> And this looks to be the sync instruction.
> 
> >  	: "=&r" (prev), "+m" (*(volatile unsigned int *)p)
> >  	: "r" (p), "r" (val)
> >  	: "cc", "memory");
> 
> Hmmm...
> 
> Suppose we have something like the following, where "a" and "x" are both
> initially zero:
> 
> 	CPU 0				CPU 1
> 	-----				-----
> 
> 	WRITE_ONCE(x, 1);		WRITE_ONCE(a, 2);
> 	r3 = xchg(&a, 1);		smp_mb();
> 					r3 = READ_ONCE(x);
> 
> If xchg() is fully ordered, we should never observe both CPUs'
> r3 values being zero, correct?
> 
> And wouldn't this be represented by the following litmus test?
> 
> 	PPC SB+lwsync-RMW2-lwsync+st-sync-leading
> 	""
> 	{
> 	0:r1=1; 0:r2=x; 0:r3=3; 0:r10=0 ; 0:r11=0; 0:r12=a;
> 	1:r1=2; 1:r2=x; 1:r3=3; 1:r10=0 ; 1:r11=0; 1:r12=a;
> 	}
> 	 P0                 | P1                 ;
> 	 stw r1,0(r2)       | stw r1,0(r12)      ;
> 	 lwsync             | sync               ;
> 	 lwarx  r11,r10,r12 | lwz r3,0(r2)       ;
> 	 stwcx. r1,r10,r12  | ;
> 	 bne Fail0          | ;
> 	 mr r3,r11          | ;
> 	 Fail0:             | ;
> 	exists
> 	(0:r3=0 /\ a=2 /\ 1:r3=0)
> 
> I left off P0's trailing sync because there is nothing for it to order
> against in this particular litmus test.  I tried adding it and verified
> that it has no effect.
> 
> Am I missing something here?  If not, it seems to me that you need
> the leading lwsync to instead be a sync.
> 

If so, I will define PPC_ATOMIC_ENTRY_BARRIER as "sync" in the next
version of this patch, any concern?

Of course, I will wait to do that until we all understand this is
nececarry and agree to make the change.

> Of course, if I am not missing something, then this applies also to the
> value-returning RMW atomic operations that you pulled this pattern from.

For the value-returning RMW atomics, if the leading barrier is
necessarily to be "sync", I will just remove my __atomic_op_fence() in
patch 4, but I will remain patch 3 unchanged for the consistency of
__atomic_op_*() macros' definitions. Peter and Will, do that works for
you both?

Regards,
Boqun

> If so, it would seem that I didn't think through all the possibilities
> back when PPC_ATOMIC_EXIT_BARRIER moved to sync...  In fact, I believe
> that I worried about the RMW atomic operation acting as a barrier,
> but not as the load/store itself.  :-/
> 
> 							Thanx, Paul
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make *xchg and *cmpxchg a full barrier
  2015-10-15 10:35         ` Will Deacon
  2015-10-15 14:40           ` Boqun Feng
@ 2015-10-15 14:50           ` Will Deacon
  2015-10-15 16:29             ` Paul E. McKenney
  2015-10-15 15:42           ` Paul E. McKenney
  2 siblings, 1 reply; 43+ messages in thread
From: Will Deacon @ 2015-10-15 14:50 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Peter Zijlstra, Boqun Feng, linux-kernel, linuxppc-dev,
	Ingo Molnar, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Thomas Gleixner, Waiman Long, Davidlohr Bueso,
	stable

On Thu, Oct 15, 2015 at 11:35:10AM +0100, Will Deacon wrote:
> Dammit guys, it's never simple is it?

I re-read this and it's even more confusing than I first thought.

> On Wed, Oct 14, 2015 at 02:44:53PM -0700, Paul E. McKenney wrote:
> > To that end, the herd tool can make a diagram of what it thought
> > happened, and I have attached it.  I used this diagram to try and force
> > this scenario at https://www.cl.cam.ac.uk/~pes20/ppcmem/index.html#PPC,
> > and succeeded.  Here is the sequence of events:
> > 
> > o	Commit P0's write.  The model offers to propagate this write
> > 	to the coherence point and to P1, but don't do so yet.
> > 
> > o	Commit P1's write.  Similar offers, but don't take them up yet.
> > 
> > o	Commit P0's lwsync.
> > 
> > o	Execute P0's lwarx, which reads a=0.  Then commit it.
> > 
> > o	Commit P0's stwcx. as successful.  This stores a=1.
> 
> On arm64, this is a conditional-store-*release* and therefore cannot be
> observed before the initial write to x...
> 
> > o	Commit P0's branch (not taken).
> > 
> > o	Commit P0's final register-to-register move.
> > 
> > o	Commit P1's sync instruction.
> > 
> > o	There is now nothing that can happen in either processor.
> > 	P0 is done, and P1 is waiting for its sync.  Therefore,
> > 	propagate P1's a=2 write to the coherence point and to
> > 	the other thread.
> 
> ... therefore this is illegal, because you haven't yet propagated that
> prior write...

I misread this as a propagation of PO's conditional store. What actually
happens on arm64, is that the early conditional store can only succeed
once it is placed into the coherence order of the location which it is
updating (but note that this is subtly different from multi-copy
atomicity!).

So, given that the previous conditional store succeeded, the coherence
order on A must be either {0, 1, 2} or {0, 2, 1}.

If it's {0, 1, 2} (as required by your complete example), that means
P1's a=2 write "observes" the conditional store by P0, and therefore
(because the conditional store has release semantics), also observes
P0's x=1 write.

On the other hand, if P1's a=2 write propagates first and we have a
coherence order of {0, 2, 1}, then P0 must have r3=2, because an
exclusive load returning zero would have led to a failed conditional
store thanks to the intervening write by P1.

I find it pretty weird if PPC allows the conditional store to succeed in
this way, as I think that would break simple cases like two threads
incrementing a shared variable in parallel:


""
{
0:r1=1; 0:r3=3; 0:r10=0 ; 0:r11=0; 0:r12=a;
1:r1=1; 1:r3=3; 1:r10=0 ; 1:r11=0; 1:r12=a;
}
P0                 | P1                 ;
lwarx  r11,r10,r12 | lwarx  r11,r10,r12 ;
add r11,r1,r11     | add r11,r1,r11     ;
stwcx. r11,r10,r12 | stwcx. r11,r10,r12 ;
bne Fail0          | bne Fail1          ;
mr r3,r1           | mr r3,r1           ;
Fail0:             | Fail1:             ;
exists
(0:r3=1 /\ a=1 /\ 1:r3=1)


Is also allowed by herd and forbidden by ppcmem, for example.

Will

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make *xchg and *cmpxchg a full barrier
  2015-10-15 10:35         ` Will Deacon
  2015-10-15 14:40           ` Boqun Feng
  2015-10-15 14:50           ` Will Deacon
@ 2015-10-15 15:42           ` Paul E. McKenney
  2 siblings, 0 replies; 43+ messages in thread
From: Paul E. McKenney @ 2015-10-15 15:42 UTC (permalink / raw)
  To: Will Deacon
  Cc: Peter Zijlstra, Boqun Feng, linux-kernel, linuxppc-dev,
	Ingo Molnar, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Thomas Gleixner, Waiman Long, Davidlohr Bueso,
	stable

On Thu, Oct 15, 2015 at 11:35:44AM +0100, Will Deacon wrote:
> Dammit guys, it's never simple is it?
> 
> On Wed, Oct 14, 2015 at 02:44:53PM -0700, Paul E. McKenney wrote:
> > To that end, the herd tool can make a diagram of what it thought
> > happened, and I have attached it.  I used this diagram to try and force
> > this scenario at https://www.cl.cam.ac.uk/~pes20/ppcmem/index.html#PPC,
> > and succeeded.  Here is the sequence of events:
> > 
> > o	Commit P0's write.  The model offers to propagate this write
> > 	to the coherence point and to P1, but don't do so yet.
> > 
> > o	Commit P1's write.  Similar offers, but don't take them up yet.
> > 
> > o	Commit P0's lwsync.
> > 
> > o	Execute P0's lwarx, which reads a=0.  Then commit it.
> > 
> > o	Commit P0's stwcx. as successful.  This stores a=1.
> 
> On arm64, this is a conditional-store-*release* and therefore cannot be
> observed before the initial write to x...
> 
> > o	Commit P0's branch (not taken).
> > 
> > o	Commit P0's final register-to-register move.
> > 
> > o	Commit P1's sync instruction.
> > 
> > o	There is now nothing that can happen in either processor.
> > 	P0 is done, and P1 is waiting for its sync.  Therefore,
> > 	propagate P1's a=2 write to the coherence point and to
> > 	the other thread.
> 
> ... therefore this is illegal, because you haven't yet propagated that
> prior write...

OK.  Power distinguishes between propagating to the coherence point
and to each of the other CPUs.

> > o	There is still nothing that can happen in either processor.
> > 	So pick the barrier propagate, then the acknowledge sync.
> > 
> > o	P1 can now execute its read from x.  Because P0's write to
> > 	x is still waiting to propagate to P1, this still reads
> > 	x=0.  Execute and commit, and we now have both r3 registers
> > 	equal to zero and the final value a=2.
> 
> ... and P1 would have to read x == 1.

Good!  Do ARMMEM and herd agree with you?

> So arm64 is ok. Doesn't lwsync order store->store observability for PPC?

Yes.  But this is not store->store observability, but rather store->load
visibility.  Furthermore, as I understand it, lwsync controls the
visibility to other CPUs, but not necessarily the coherence order.

Let's look at the example C code again:

	CPU 0				CPU 1
	-----				-----

	WRITE_ONCE(x, 1);		WRITE_ONCE(a, 2);
	r3 = xchg(&a, 1);		smp_mb();
					r3 = READ_ONCE(x);

The problem is that we are applying intuitions obtained from a
release-acquire chain, which hands off from stores to loads.  In contrast,
this example is quite weird in that we have a store handing off to another
store, but with reads also involved.  Making that work on Power requires
full memory barriers on both sides.  Intuitively, the coherence order
can be established after the fact as long as all readers see a consistent
set of values based on the subset of the sequence that each reader sees.

Anyway, it looks like Power does need a sync before and after for
value-returning atomics.  That certainly simplifies the analysis.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make *xchg and *cmpxchg a full barrier
  2015-10-15 14:49     ` Boqun Feng
@ 2015-10-15 16:16       ` Paul E. McKenney
  0 siblings, 0 replies; 43+ messages in thread
From: Paul E. McKenney @ 2015-10-15 16:16 UTC (permalink / raw)
  To: Boqun Feng
  Cc: linux-kernel, linuxppc-dev, Peter Zijlstra, Ingo Molnar,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Thomas Gleixner, Will Deacon, Waiman Long, Davidlohr Bueso,
	stable

On Thu, Oct 15, 2015 at 10:49:23PM +0800, Boqun Feng wrote:
> On Wed, Oct 14, 2015 at 01:19:17PM -0700, Paul E. McKenney wrote:
> > On Wed, Oct 14, 2015 at 11:55:56PM +0800, Boqun Feng wrote:
> > > According to memory-barriers.txt, xchg, cmpxchg and their atomic{,64}_
> > > versions all need to imply a full barrier, however they are now just
> > > RELEASE+ACQUIRE, which is not a full barrier.
> > > 
> > > So replace PPC_RELEASE_BARRIER and PPC_ACQUIRE_BARRIER with
> > > PPC_ATOMIC_ENTRY_BARRIER and PPC_ATOMIC_EXIT_BARRIER in
> > > __{cmp,}xchg_{u32,u64} respectively to guarantee a full barrier
> > > semantics of atomic{,64}_{cmp,}xchg() and {cmp,}xchg().
> > > 
> > > This patch is a complement of commit b97021f85517 ("powerpc: Fix
> > > atomic_xxx_return barrier semantics").
> > > 
> > > Acked-by: Michael Ellerman <mpe@ellerman.id.au>
> > > Cc: <stable@vger.kernel.org> # 3.4+
> > > Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
> > > ---
> > >  arch/powerpc/include/asm/cmpxchg.h | 16 ++++++++--------
> > >  1 file changed, 8 insertions(+), 8 deletions(-)
> > > 
> > > diff --git a/arch/powerpc/include/asm/cmpxchg.h b/arch/powerpc/include/asm/cmpxchg.h
> > > index ad6263c..d1a8d93 100644
> > > --- a/arch/powerpc/include/asm/cmpxchg.h
> > > +++ b/arch/powerpc/include/asm/cmpxchg.h
> > > @@ -18,12 +18,12 @@ __xchg_u32(volatile void *p, unsigned long val)
> > >  	unsigned long prev;
> > > 
> > >  	__asm__ __volatile__(
> > > -	PPC_RELEASE_BARRIER
> > > +	PPC_ATOMIC_ENTRY_BARRIER
> > 
> > This looks to be the lwsync instruction.
> > 
> > >  "1:	lwarx	%0,0,%2 \n"
> > >  	PPC405_ERR77(0,%2)
> > >  "	stwcx.	%3,0,%2 \n\
> > >  	bne-	1b"
> > > -	PPC_ACQUIRE_BARRIER
> > > +	PPC_ATOMIC_EXIT_BARRIER
> > 
> > And this looks to be the sync instruction.
> > 
> > >  	: "=&r" (prev), "+m" (*(volatile unsigned int *)p)
> > >  	: "r" (p), "r" (val)
> > >  	: "cc", "memory");
> > 
> > Hmmm...
> > 
> > Suppose we have something like the following, where "a" and "x" are both
> > initially zero:
> > 
> > 	CPU 0				CPU 1
> > 	-----				-----
> > 
> > 	WRITE_ONCE(x, 1);		WRITE_ONCE(a, 2);
> > 	r3 = xchg(&a, 1);		smp_mb();
> > 					r3 = READ_ONCE(x);
> > 
> > If xchg() is fully ordered, we should never observe both CPUs'
> > r3 values being zero, correct?
> > 
> > And wouldn't this be represented by the following litmus test?
> > 
> > 	PPC SB+lwsync-RMW2-lwsync+st-sync-leading
> > 	""
> > 	{
> > 	0:r1=1; 0:r2=x; 0:r3=3; 0:r10=0 ; 0:r11=0; 0:r12=a;
> > 	1:r1=2; 1:r2=x; 1:r3=3; 1:r10=0 ; 1:r11=0; 1:r12=a;
> > 	}
> > 	 P0                 | P1                 ;
> > 	 stw r1,0(r2)       | stw r1,0(r12)      ;
> > 	 lwsync             | sync               ;
> > 	 lwarx  r11,r10,r12 | lwz r3,0(r2)       ;
> > 	 stwcx. r1,r10,r12  | ;
> > 	 bne Fail0          | ;
> > 	 mr r3,r11          | ;
> > 	 Fail0:             | ;
> > 	exists
> > 	(0:r3=0 /\ a=2 /\ 1:r3=0)
> > 
> > I left off P0's trailing sync because there is nothing for it to order
> > against in this particular litmus test.  I tried adding it and verified
> > that it has no effect.
> > 
> > Am I missing something here?  If not, it seems to me that you need
> > the leading lwsync to instead be a sync.
> > 
> 
> If so, I will define PPC_ATOMIC_ENTRY_BARRIER as "sync" in the next
> version of this patch, any concern?
> 
> Of course, I will wait to do that until we all understand this is
> nececarry and agree to make the change.

I am in favor, but I am not the maintainer.  ;-)

							Thanx, Paul

> > Of course, if I am not missing something, then this applies also to the
> > value-returning RMW atomic operations that you pulled this pattern from.
> 
> For the value-returning RMW atomics, if the leading barrier is
> necessarily to be "sync", I will just remove my __atomic_op_fence() in
> patch 4, but I will remain patch 3 unchanged for the consistency of
> __atomic_op_*() macros' definitions. Peter and Will, do that works for
> you both?
> 
> Regards,
> Boqun
> 
> > If so, it would seem that I didn't think through all the possibilities
> > back when PPC_ATOMIC_EXIT_BARRIER moved to sync...  In fact, I believe
> > that I worried about the RMW atomic operation acting as a barrier,
> > but not as the load/store itself.  :-/
> > 
> > 							Thanx, Paul
> > 



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make *xchg and *cmpxchg a full barrier
  2015-10-15 14:50           ` Will Deacon
@ 2015-10-15 16:29             ` Paul E. McKenney
  0 siblings, 0 replies; 43+ messages in thread
From: Paul E. McKenney @ 2015-10-15 16:29 UTC (permalink / raw)
  To: Will Deacon
  Cc: Peter Zijlstra, Boqun Feng, linux-kernel, linuxppc-dev,
	Ingo Molnar, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Thomas Gleixner, Waiman Long, Davidlohr Bueso,
	stable

On Thu, Oct 15, 2015 at 03:50:44PM +0100, Will Deacon wrote:
> On Thu, Oct 15, 2015 at 11:35:10AM +0100, Will Deacon wrote:
> > Dammit guys, it's never simple is it?
> 
> I re-read this and it's even more confusing than I first thought.
> 
> > On Wed, Oct 14, 2015 at 02:44:53PM -0700, Paul E. McKenney wrote:
> > > To that end, the herd tool can make a diagram of what it thought
> > > happened, and I have attached it.  I used this diagram to try and force
> > > this scenario at https://www.cl.cam.ac.uk/~pes20/ppcmem/index.html#PPC,
> > > and succeeded.  Here is the sequence of events:
> > > 
> > > o	Commit P0's write.  The model offers to propagate this write
> > > 	to the coherence point and to P1, but don't do so yet.
> > > 
> > > o	Commit P1's write.  Similar offers, but don't take them up yet.
> > > 
> > > o	Commit P0's lwsync.
> > > 
> > > o	Execute P0's lwarx, which reads a=0.  Then commit it.
> > > 
> > > o	Commit P0's stwcx. as successful.  This stores a=1.
> > 
> > On arm64, this is a conditional-store-*release* and therefore cannot be
> > observed before the initial write to x...
> > 
> > > o	Commit P0's branch (not taken).
> > > 
> > > o	Commit P0's final register-to-register move.
> > > 
> > > o	Commit P1's sync instruction.
> > > 
> > > o	There is now nothing that can happen in either processor.
> > > 	P0 is done, and P1 is waiting for its sync.  Therefore,
> > > 	propagate P1's a=2 write to the coherence point and to
> > > 	the other thread.
> > 
> > ... therefore this is illegal, because you haven't yet propagated that
> > prior write...
> 
> I misread this as a propagation of PO's conditional store. What actually
> happens on arm64, is that the early conditional store can only succeed
> once it is placed into the coherence order of the location which it is
> updating (but note that this is subtly different from multi-copy
> atomicity!).
> 
> So, given that the previous conditional store succeeded, the coherence
> order on A must be either {0, 1, 2} or {0, 2, 1}.
> 
> If it's {0, 1, 2} (as required by your complete example), that means
> P1's a=2 write "observes" the conditional store by P0, and therefore
> (because the conditional store has release semantics), also observes
> P0's x=1 write.
> 
> On the other hand, if P1's a=2 write propagates first and we have a
> coherence order of {0, 2, 1}, then P0 must have r3=2, because an
> exclusive load returning zero would have led to a failed conditional
> store thanks to the intervening write by P1.
> 
> I find it pretty weird if PPC allows the conditional store to succeed in
> this way, as I think that would break simple cases like two threads
> incrementing a shared variable in parallel:
> 
> 
> ""
> {
> 0:r1=1; 0:r3=3; 0:r10=0 ; 0:r11=0; 0:r12=a;
> 1:r1=1; 1:r3=3; 1:r10=0 ; 1:r11=0; 1:r12=a;
> }
> P0                 | P1                 ;
> lwarx  r11,r10,r12 | lwarx  r11,r10,r12 ;
> add r11,r1,r11     | add r11,r1,r11     ;
> stwcx. r11,r10,r12 | stwcx. r11,r10,r12 ;
> bne Fail0          | bne Fail1          ;
> mr r3,r1           | mr r3,r1           ;
> Fail0:             | Fail1:             ;
> exists
> (0:r3=1 /\ a=1 /\ 1:r3=1)
> 
> 
> Is also allowed by herd and forbidden by ppcmem, for example.

I had no idea that either herd or ppcmem knew about the add instruction.
It looks like they at least try to understand.  Needless to say, in this
case I agree with ppcmem.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make *xchg and *cmpxchg a full barrier
  2015-10-15  4:48             ` Boqun Feng
@ 2015-10-15 16:30               ` Paul E. McKenney
  2015-10-19  0:19                 ` Boqun Feng
  0 siblings, 1 reply; 43+ messages in thread
From: Paul E. McKenney @ 2015-10-15 16:30 UTC (permalink / raw)
  To: Boqun Feng
  Cc: Peter Zijlstra, linux-kernel, linuxppc-dev, Ingo Molnar,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Thomas Gleixner, Will Deacon, Waiman Long, Davidlohr Bueso,
	stable

On Thu, Oct 15, 2015 at 12:48:03PM +0800, Boqun Feng wrote:
> On Wed, Oct 14, 2015 at 08:07:05PM -0700, Paul E. McKenney wrote:
> > On Thu, Oct 15, 2015 at 08:53:21AM +0800, Boqun Feng wrote:
> [snip]
> > > 
> > > I'm afraid more than that, the above litmus also shows that
> > > 
> > > 	CPU 0				CPU 1
> > > 	-----				-----
> > > 
> > > 	WRITE_ONCE(x, 1);		WRITE_ONCE(a, 2);
> > > 	r3 = xchg_release(&a, 1);	smp_mb();
> > > 					r3 = READ_ONCE(x);
> > > 
> > > 	(0:r3 == 0 && 1:r3 == 0 && a == 2) is not prohibitted
> > > 
> > > in the implementation of this patchset, which should be disallowed by
> > > the semantics of RELEASE, right?
> > 
> > Not necessarily.  If you had the read first on CPU 1, and you had a
> > similar problem, I would be more worried.
> > 
> 
> Sometimes I think maybe we should say that a single unpaired ACQUIRE or
> RELEASE doesn't have any order guarantee because of the above case.
> 
> But seems that's not a normal or even existing case, my bad ;-(
> 
> > > And even:
> > > 
> > > 	CPU 0				CPU 1
> > > 	-----				-----
> > > 
> > > 	WRITE_ONCE(x, 1);		WRITE_ONCE(a, 2);
> > > 	smp_store_release(&a, 1);	smp_mb();
> > > 					r3 = READ_ONCE(x);
> > > 
> > > 	(1:r3 == 0 && a == 2) is not prohibitted
> > > 
> > > shows by:
> > > 
> > > 	PPC weird-lwsync
> > > 	""
> > > 	{
> > > 	0:r1=1; 0:r2=x; 0:r3=3; 0:r12=a;
> > > 	1:r1=2; 1:r2=x; 1:r3=3; 1:r12=a;
> > > 	}
> > > 	 P0                 | P1                 ;
> > > 	 stw r1,0(r2)       | stw r1,0(r12)      ;
> > > 	 lwsync             | sync               ;
> > > 	 stw  r1,0(r12)	    | lwz r3,0(r2)       ;
> > > 	exists
> > > 	(a=2 /\ 1:r3=0)
> > > 
> > > Please find something I'm (or the tool is) missing, maybe we can't use
> > > (a == 2) as a indication that STORE on CPU 1 happens after STORE on CPU
> > > 0?
> > 
> > Again, if you were pairing the smp_store_release() with an smp_load_acquire()
> > or even a READ_ONCE() followed by a barrier, I would be quite concerned.
> > I am not at all worried about the above two litmus tests.
> > 
> 
> Understood, thank you for think through that ;-)
> 
> > > And there is really something I find strange, see below.
> > > 
> > > > > 
> > > > > So the scenario that would fail would be this one, right?
> > > > > 
> > > > > a = x = 0
> > > > > 
> > > > > 	CPU0				CPU1
> > > > > 
> > > > > 	r3 = load_locked (&a);
> > > > > 					a = 2;
> > > > > 					sync();
> > > > > 					r3 = x;
> > > > > 	x = 1;
> > > > > 	lwsync();
> > > > > 	if (!store_cond(&a, 1))
> > > > > 		goto again
> > > > > 
> > > > > 
> > > > > Where we hoist the load way up because lwsync allows this.
> > > > 
> > > > That scenario would end up with a==1 rather than a==2.
> > > > 
> > > > > I always thought this would fail because CPU1's store to @a would fail
> > > > > the store_cond() on CPU0 and we'd do the 'again' thing, re-issuing the
> > > > > load and now seeing the new value (2).
> > > > 
> > > > The stwcx. failure was one thing that prevented a number of other
> > > > misordering cases.  The problem is that we have to let go of the notion
> > > > of an implicit global clock.
> > > > 
> > > > To that end, the herd tool can make a diagram of what it thought
> > > > happened, and I have attached it.  I used this diagram to try and force
> > > > this scenario at https://www.cl.cam.ac.uk/~pes20/ppcmem/index.html#PPC,
> > > > and succeeded.  Here is the sequence of events:
> > > > 
> > > > o	Commit P0's write.  The model offers to propagate this write
> > > > 	to the coherence point and to P1, but don't do so yet.
> > > > 
> > > > o	Commit P1's write.  Similar offers, but don't take them up yet.
> > > > 
> > > > o	Commit P0's lwsync.
> > > > 
> > > > o	Execute P0's lwarx, which reads a=0.  Then commit it.
> > > > 
> > > > o	Commit P0's stwcx. as successful.  This stores a=1.
> > > > 
> > > > o	Commit P0's branch (not taken).
> > > 
> > > So at this point, P0's write to 'a' has propagated to P1, right? But
> > > P0's write to 'x' hasn't, even there is a lwsync between them, right?
> > > Doesn't the lwsync prevent this from happening?
> > 
> > No, because lwsync is quite a bit weaker than sync aside from just
> > the store-load ordering.
> > 
> 
> Understood, I've tried the ppcmem, much clear now ;-)
> 
> > > If at this point P0's write to 'a' hasn't propagated then when?
> > 
> > Later.  At the very end of the test, in this case.  ;-)
> > 
> 
> Hmm.. I tried exactly this sequence in ppcmem, seems propagation of P0's
> write to 'a' is never an option...
> 
> > Why not try creating a longer litmus test that requires P0's write to
> > "a" to propagate to P1 before both processes complete?
> > 
> 
> I will try to write one, but to be clear, you mean we still observe 
> 
> 0:r3 == 0 && a == 2 && 1:r3 == 0 
> 
> at the end, right? Because I understand that if P1's write to 'a'
> doesn't override P0's, P0's write to 'a' will propagate.

Your choice.  My question is whether you can come up with a similar
litmus test where lwsync is allowing the behavior here, but clearly
is affecting some other aspect of ordering.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make *xchg and *cmpxchg a full barrier
  2015-10-15 16:30               ` Paul E. McKenney
@ 2015-10-19  0:19                 ` Boqun Feng
  0 siblings, 0 replies; 43+ messages in thread
From: Boqun Feng @ 2015-10-19  0:19 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Peter Zijlstra, linux-kernel, linuxppc-dev, Ingo Molnar,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Thomas Gleixner, Will Deacon, Waiman Long, Davidlohr Bueso,
	stable

[-- Attachment #1: Type: text/plain, Size: 1383 bytes --]

On Thu, Oct 15, 2015 at 09:30:40AM -0700, Paul E. McKenney wrote:
> On Thu, Oct 15, 2015 at 12:48:03PM +0800, Boqun Feng wrote:
> > On Wed, Oct 14, 2015 at 08:07:05PM -0700, Paul E. McKenney wrote:
[snip]
> > 
> > > Why not try creating a longer litmus test that requires P0's write to
> > > "a" to propagate to P1 before both processes complete?
> > > 
> > 
> > I will try to write one, but to be clear, you mean we still observe 
> > 
> > 0:r3 == 0 && a == 2 && 1:r3 == 0 
> > 
> > at the end, right? Because I understand that if P1's write to 'a'
> > doesn't override P0's, P0's write to 'a' will propagate.
> 
> Your choice.  My question is whether you can come up with a similar
> litmus test where lwsync is allowing the behavior here, but clearly
> is affecting some other aspect of ordering.
> 

Got it, though my question about the propagation of P0's write to 'a'
was originally aimed at understanding the hardware behavior(or model) in
your sequence of events ;-)

To be clear, by "some other aspect of ordering", you mean something like
a paired RELEASE+ACQUIRE senario(i.e. P1 observes P0's write to 'a' via
a load, which means P0's write to 'a' propagates at some point), right?

If so I haven't yet came up with one, and I think there's probably none,
so my worry about "lwsync" in other places is likely unnecessary.

Regards,
Boqun

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make *xchg and *cmpxchg a full barrier
  2015-10-14 20:19   ` Paul E. McKenney
  2015-10-14 21:04     ` Peter Zijlstra
  2015-10-15 14:49     ` Boqun Feng
@ 2015-10-20  7:15     ` Boqun Feng
  2015-10-20  9:21       ` Peter Zijlstra
  2 siblings, 1 reply; 43+ messages in thread
From: Boqun Feng @ 2015-10-20  7:15 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, linuxppc-dev, Peter Zijlstra, Ingo Molnar,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Thomas Gleixner, Will Deacon, Waiman Long, Davidlohr Bueso,
	stable

[-- Attachment #1: Type: text/plain, Size: 1508 bytes --]

On Wed, Oct 14, 2015 at 01:19:17PM -0700, Paul E. McKenney wrote:
> 
> Am I missing something here?  If not, it seems to me that you need
> the leading lwsync to instead be a sync.
> 
> Of course, if I am not missing something, then this applies also to the
> value-returning RMW atomic operations that you pulled this pattern from.
> If so, it would seem that I didn't think through all the possibilities
> back when PPC_ATOMIC_EXIT_BARRIER moved to sync...  In fact, I believe
> that I worried about the RMW atomic operation acting as a barrier,
> but not as the load/store itself.  :-/
> 

Paul, I know this may be difficult, but could you recall why the
__futex_atomic_op() and futex_atomic_cmpxchg_inatomic() also got
involved into the movement of PPC_ATOMIC_EXIT_BARRIER to "sync"?

I did some search, but couldn't find the discussion of that patch.

I ask this because I recall Peter once bought up a discussion:

https://lkml.org/lkml/2015/8/26/596

Peter's conclusion seems to be that we could(though didn't want to) live
with futex atomics not being full barriers.


Peter, just be clear, I'm not in favor of relaxing futex atomics. But if
I make PPC_ATOMIC_ENTRY_BARRIER being "sync", it will also strengthen
the futex atomics, just wonder whether such strengthen is a -fix- or
not, considering that I want this patch to go to -stable tree.


Of course, in the meanwhile of waiting for your answer, I will try to
figure out this by myself ;-)

Regards,
Boqun

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make *xchg and *cmpxchg a full barrier
  2015-10-20  7:15     ` Boqun Feng
@ 2015-10-20  9:21       ` Peter Zijlstra
  2015-10-20 21:28         ` Paul E. McKenney
  0 siblings, 1 reply; 43+ messages in thread
From: Peter Zijlstra @ 2015-10-20  9:21 UTC (permalink / raw)
  To: Boqun Feng
  Cc: Paul E. McKenney, linux-kernel, linuxppc-dev, Ingo Molnar,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Thomas Gleixner, Will Deacon, Waiman Long, Davidlohr Bueso,
	stable

On Tue, Oct 20, 2015 at 03:15:32PM +0800, Boqun Feng wrote:
> On Wed, Oct 14, 2015 at 01:19:17PM -0700, Paul E. McKenney wrote:
> > 
> > Am I missing something here?  If not, it seems to me that you need
> > the leading lwsync to instead be a sync.
> > 
> > Of course, if I am not missing something, then this applies also to the
> > value-returning RMW atomic operations that you pulled this pattern from.
> > If so, it would seem that I didn't think through all the possibilities
> > back when PPC_ATOMIC_EXIT_BARRIER moved to sync...  In fact, I believe
> > that I worried about the RMW atomic operation acting as a barrier,
> > but not as the load/store itself.  :-/
> > 
> 
> Paul, I know this may be difficult, but could you recall why the
> __futex_atomic_op() and futex_atomic_cmpxchg_inatomic() also got
> involved into the movement of PPC_ATOMIC_EXIT_BARRIER to "sync"?
> 
> I did some search, but couldn't find the discussion of that patch.
> 
> I ask this because I recall Peter once bought up a discussion:
> 
> https://lkml.org/lkml/2015/8/26/596
> 
> Peter's conclusion seems to be that we could(though didn't want to) live
> with futex atomics not being full barriers.
> 
> 
> Peter, just be clear, I'm not in favor of relaxing futex atomics. But if
> I make PPC_ATOMIC_ENTRY_BARRIER being "sync", it will also strengthen
> the futex atomics, just wonder whether such strengthen is a -fix- or
> not, considering that I want this patch to go to -stable tree.

So Linus' argued that since we only need to order against user accesses
(true) and priv changes typically imply strong barriers (open) we might
want to allow archs to rely on those instead of mandating they have
explicit barriers in the futex primitives.

And I indeed forgot to follow up on that discussion.

So; does PPC imply full barriers on user<->kernel boundaries? If so, its
not critical to the futex atomic implementations what extra barriers are
added.

If not; then strengthening the futex ops is indeed (probably) a good
thing :-)

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make *xchg and *cmpxchg a full barrier
  2015-10-20  9:21       ` Peter Zijlstra
@ 2015-10-20 21:28         ` Paul E. McKenney
  2015-10-21  8:18           ` Peter Zijlstra
  2015-10-21  8:45           ` Boqun Feng
  0 siblings, 2 replies; 43+ messages in thread
From: Paul E. McKenney @ 2015-10-20 21:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Boqun Feng, linux-kernel, linuxppc-dev, Ingo Molnar,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Thomas Gleixner, Will Deacon, Waiman Long, Davidlohr Bueso,
	stable

On Tue, Oct 20, 2015 at 11:21:47AM +0200, Peter Zijlstra wrote:
> On Tue, Oct 20, 2015 at 03:15:32PM +0800, Boqun Feng wrote:
> > On Wed, Oct 14, 2015 at 01:19:17PM -0700, Paul E. McKenney wrote:
> > > 
> > > Am I missing something here?  If not, it seems to me that you need
> > > the leading lwsync to instead be a sync.
> > > 
> > > Of course, if I am not missing something, then this applies also to the
> > > value-returning RMW atomic operations that you pulled this pattern from.
> > > If so, it would seem that I didn't think through all the possibilities
> > > back when PPC_ATOMIC_EXIT_BARRIER moved to sync...  In fact, I believe
> > > that I worried about the RMW atomic operation acting as a barrier,
> > > but not as the load/store itself.  :-/
> > > 
> > 
> > Paul, I know this may be difficult, but could you recall why the
> > __futex_atomic_op() and futex_atomic_cmpxchg_inatomic() also got
> > involved into the movement of PPC_ATOMIC_EXIT_BARRIER to "sync"?
> > 
> > I did some search, but couldn't find the discussion of that patch.
> > 
> > I ask this because I recall Peter once bought up a discussion:
> > 
> > https://lkml.org/lkml/2015/8/26/596
> > 
> > Peter's conclusion seems to be that we could(though didn't want to) live
> > with futex atomics not being full barriers.

I have heard of user-level applications relying on unlock-lock being a
full barrier.  So paranoia would argue for the full barrier.

> > Peter, just be clear, I'm not in favor of relaxing futex atomics. But if
> > I make PPC_ATOMIC_ENTRY_BARRIER being "sync", it will also strengthen
> > the futex atomics, just wonder whether such strengthen is a -fix- or
> > not, considering that I want this patch to go to -stable tree.
> 
> So Linus' argued that since we only need to order against user accesses
> (true) and priv changes typically imply strong barriers (open) we might
> want to allow archs to rely on those instead of mandating they have
> explicit barriers in the futex primitives.
> 
> And I indeed forgot to follow up on that discussion.
> 
> So; does PPC imply full barriers on user<->kernel boundaries? If so, its
> not critical to the futex atomic implementations what extra barriers are
> added.
> 
> If not; then strengthening the futex ops is indeed (probably) a good
> thing :-)

I am not seeing a sync there, but I really have to defer to the
maintainers on this one.  I could easily have missed one.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make *xchg and *cmpxchg a full barrier
  2015-10-20 21:28         ` Paul E. McKenney
@ 2015-10-21  8:18           ` Peter Zijlstra
  2015-10-21 19:36             ` Paul E. McKenney
  2015-10-26  3:20             ` Paul Mackerras
  2015-10-21  8:45           ` Boqun Feng
  1 sibling, 2 replies; 43+ messages in thread
From: Peter Zijlstra @ 2015-10-21  8:18 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Boqun Feng, linux-kernel, linuxppc-dev, Ingo Molnar,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Thomas Gleixner, Will Deacon, Waiman Long, Davidlohr Bueso,
	stable

On Tue, Oct 20, 2015 at 02:28:35PM -0700, Paul E. McKenney wrote:
> I am not seeing a sync there, but I really have to defer to the
> maintainers on this one.  I could easily have missed one.

So x86 implies a full barrier for everything that changes the CPL; and
some form of implied ordering seems a must if you change the privilege
level unless you tag every single load/store with the priv level at that
time, which seems the more expensive option.

So I suspect the typical implementation will flush all load/stores,
change the effective priv level and continue.

This can of course be implemented at a pure per CPU ordering (RCpc),
which would be in line with the rest of Power, in which case you do
indeed need an explicit sync to make it visible to other CPUs.

But yes, if Michael or Ben could clarify this it would be good.

Back then I talked to Ralf about what MIPS says on this, and MIPS arch
spec is entirely quiet on this, it allows implementations full freedom
IIRC.

</ramble>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make *xchg and *cmpxchg a full barrier
  2015-10-20 21:28         ` Paul E. McKenney
  2015-10-21  8:18           ` Peter Zijlstra
@ 2015-10-21  8:45           ` Boqun Feng
  2015-10-21 19:35             ` Paul E. McKenney
  1 sibling, 1 reply; 43+ messages in thread
From: Boqun Feng @ 2015-10-21  8:45 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Peter Zijlstra, linux-kernel, linuxppc-dev, Ingo Molnar,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Thomas Gleixner, Will Deacon, Waiman Long, Davidlohr Bueso,
	stable

[-- Attachment #1: Type: text/plain, Size: 2988 bytes --]

On Tue, Oct 20, 2015 at 02:28:35PM -0700, Paul E. McKenney wrote:
> On Tue, Oct 20, 2015 at 11:21:47AM +0200, Peter Zijlstra wrote:
> > On Tue, Oct 20, 2015 at 03:15:32PM +0800, Boqun Feng wrote:
> > > On Wed, Oct 14, 2015 at 01:19:17PM -0700, Paul E. McKenney wrote:
> > > > 
> > > > Am I missing something here?  If not, it seems to me that you need
> > > > the leading lwsync to instead be a sync.
> > > > 
> > > > Of course, if I am not missing something, then this applies also to the
> > > > value-returning RMW atomic operations that you pulled this pattern from.
> > > > If so, it would seem that I didn't think through all the possibilities
> > > > back when PPC_ATOMIC_EXIT_BARRIER moved to sync...  In fact, I believe
> > > > that I worried about the RMW atomic operation acting as a barrier,
> > > > but not as the load/store itself.  :-/
> > > > 
> > > 
> > > Paul, I know this may be difficult, but could you recall why the
> > > __futex_atomic_op() and futex_atomic_cmpxchg_inatomic() also got
> > > involved into the movement of PPC_ATOMIC_EXIT_BARRIER to "sync"?
> > > 
> > > I did some search, but couldn't find the discussion of that patch.
> > > 
> > > I ask this because I recall Peter once bought up a discussion:
> > > 
> > > https://lkml.org/lkml/2015/8/26/596
> > > 
> > > Peter's conclusion seems to be that we could(though didn't want to) live
> > > with futex atomics not being full barriers.
> 
> I have heard of user-level applications relying on unlock-lock being a
> full barrier.  So paranoia would argue for the full barrier.
> 

Understood.

So a full barrier on one side of these operations is enough, I think.
IOW, there is no need to strengthen these operations.

> > > Peter, just be clear, I'm not in favor of relaxing futex atomics. But if
> > > I make PPC_ATOMIC_ENTRY_BARRIER being "sync", it will also strengthen
> > > the futex atomics, just wonder whether such strengthen is a -fix- or
> > > not, considering that I want this patch to go to -stable tree.
> > 
> > So Linus' argued that since we only need to order against user accesses
> > (true) and priv changes typically imply strong barriers (open) we might
> > want to allow archs to rely on those instead of mandating they have
> > explicit barriers in the futex primitives.
> > 
> > And I indeed forgot to follow up on that discussion.
> > 
> > So; does PPC imply full barriers on user<->kernel boundaries? If so, its
> > not critical to the futex atomic implementations what extra barriers are
> > added.
> > 
> > If not; then strengthening the futex ops is indeed (probably) a good
> > thing :-)

Peter, that's probably a good thing, but I'm not that familiar with
futex right now, so I won't touch that part if unnecessary in this
series.

Regards,
Boqun

> 
> I am not seeing a sync there, but I really have to defer to the
> maintainers on this one.  I could easily have missed one.
> 
> 							Thanx, Paul
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make *xchg and *cmpxchg a full barrier
  2015-10-21  8:45           ` Boqun Feng
@ 2015-10-21 19:35             ` Paul E. McKenney
  2015-10-21 19:48               ` Peter Zijlstra
  0 siblings, 1 reply; 43+ messages in thread
From: Paul E. McKenney @ 2015-10-21 19:35 UTC (permalink / raw)
  To: Boqun Feng
  Cc: Peter Zijlstra, linux-kernel, linuxppc-dev, Ingo Molnar,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Thomas Gleixner, Will Deacon, Waiman Long, Davidlohr Bueso,
	stable

On Wed, Oct 21, 2015 at 04:45:03PM +0800, Boqun Feng wrote:
> On Tue, Oct 20, 2015 at 02:28:35PM -0700, Paul E. McKenney wrote:
> > On Tue, Oct 20, 2015 at 11:21:47AM +0200, Peter Zijlstra wrote:
> > > On Tue, Oct 20, 2015 at 03:15:32PM +0800, Boqun Feng wrote:
> > > > On Wed, Oct 14, 2015 at 01:19:17PM -0700, Paul E. McKenney wrote:
> > > > > 
> > > > > Am I missing something here?  If not, it seems to me that you need
> > > > > the leading lwsync to instead be a sync.
> > > > > 
> > > > > Of course, if I am not missing something, then this applies also to the
> > > > > value-returning RMW atomic operations that you pulled this pattern from.
> > > > > If so, it would seem that I didn't think through all the possibilities
> > > > > back when PPC_ATOMIC_EXIT_BARRIER moved to sync...  In fact, I believe
> > > > > that I worried about the RMW atomic operation acting as a barrier,
> > > > > but not as the load/store itself.  :-/
> > > > > 
> > > > 
> > > > Paul, I know this may be difficult, but could you recall why the
> > > > __futex_atomic_op() and futex_atomic_cmpxchg_inatomic() also got
> > > > involved into the movement of PPC_ATOMIC_EXIT_BARRIER to "sync"?
> > > > 
> > > > I did some search, but couldn't find the discussion of that patch.
> > > > 
> > > > I ask this because I recall Peter once bought up a discussion:
> > > > 
> > > > https://lkml.org/lkml/2015/8/26/596
> > > > 
> > > > Peter's conclusion seems to be that we could(though didn't want to) live
> > > > with futex atomics not being full barriers.
> > 
> > I have heard of user-level applications relying on unlock-lock being a
> > full barrier.  So paranoia would argue for the full barrier.
> 
> Understood.
> 
> So a full barrier on one side of these operations is enough, I think.
> IOW, there is no need to strengthen these operations.

Do we need to also worry about other futex use cases?

							Thanx, Paul

> > > > Peter, just be clear, I'm not in favor of relaxing futex atomics. But if
> > > > I make PPC_ATOMIC_ENTRY_BARRIER being "sync", it will also strengthen
> > > > the futex atomics, just wonder whether such strengthen is a -fix- or
> > > > not, considering that I want this patch to go to -stable tree.
> > > 
> > > So Linus' argued that since we only need to order against user accesses
> > > (true) and priv changes typically imply strong barriers (open) we might
> > > want to allow archs to rely on those instead of mandating they have
> > > explicit barriers in the futex primitives.
> > > 
> > > And I indeed forgot to follow up on that discussion.
> > > 
> > > So; does PPC imply full barriers on user<->kernel boundaries? If so, its
> > > not critical to the futex atomic implementations what extra barriers are
> > > added.
> > > 
> > > If not; then strengthening the futex ops is indeed (probably) a good
> > > thing :-)
> 
> Peter, that's probably a good thing, but I'm not that familiar with
> futex right now, so I won't touch that part if unnecessary in this
> series.
> 
> Regards,
> Boqun
> 
> > 
> > I am not seeing a sync there, but I really have to defer to the
> > maintainers on this one.  I could easily have missed one.
> > 
> > 							Thanx, Paul
> > 



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make *xchg and *cmpxchg a full barrier
  2015-10-21  8:18           ` Peter Zijlstra
@ 2015-10-21 19:36             ` Paul E. McKenney
  2015-10-26  2:06               ` Boqun Feng
  2015-10-26  2:20               ` Michael Ellerman
  2015-10-26  3:20             ` Paul Mackerras
  1 sibling, 2 replies; 43+ messages in thread
From: Paul E. McKenney @ 2015-10-21 19:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Boqun Feng, linux-kernel, linuxppc-dev, Ingo Molnar,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Thomas Gleixner, Will Deacon, Waiman Long, Davidlohr Bueso,
	stable

On Wed, Oct 21, 2015 at 10:18:33AM +0200, Peter Zijlstra wrote:
> On Tue, Oct 20, 2015 at 02:28:35PM -0700, Paul E. McKenney wrote:
> > I am not seeing a sync there, but I really have to defer to the
> > maintainers on this one.  I could easily have missed one.
> 
> So x86 implies a full barrier for everything that changes the CPL; and
> some form of implied ordering seems a must if you change the privilege
> level unless you tag every single load/store with the priv level at that
> time, which seems the more expensive option.

And it is entirely possible that there is some similar operation
somewhere in the powerpc entry/exit code.  I would not trust myself
to recognize it, though.

> So I suspect the typical implementation will flush all load/stores,
> change the effective priv level and continue.
> 
> This can of course be implemented at a pure per CPU ordering (RCpc),
> which would be in line with the rest of Power, in which case you do
> indeed need an explicit sync to make it visible to other CPUs.
> 
> But yes, if Michael or Ben could clarify this it would be good.
> 
> Back then I talked to Ralf about what MIPS says on this, and MIPS arch
> spec is entirely quiet on this, it allows implementations full freedom
> IIRC.

:-) ;-) ;-)

> </ramble>

							Thanx, Paul


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make *xchg and *cmpxchg a full barrier
  2015-10-21 19:35             ` Paul E. McKenney
@ 2015-10-21 19:48               ` Peter Zijlstra
  2015-10-22 12:07                 ` Boqun Feng
  0 siblings, 1 reply; 43+ messages in thread
From: Peter Zijlstra @ 2015-10-21 19:48 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Boqun Feng, linux-kernel, linuxppc-dev, Ingo Molnar,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Thomas Gleixner, Will Deacon, Waiman Long, Davidlohr Bueso,
	stable

On Wed, Oct 21, 2015 at 12:35:23PM -0700, Paul E. McKenney wrote:
> > > > > I ask this because I recall Peter once bought up a discussion:
> > > > > 
> > > > > https://lkml.org/lkml/2015/8/26/596

> > So a full barrier on one side of these operations is enough, I think.
> > IOW, there is no need to strengthen these operations.
> 
> Do we need to also worry about other futex use cases?

Worry, always!

But yes, there is one more specific usecase, which is that of a
condition variable.

When we go sleep on a futex, we might want to assume visibility of the
stores done by the thread that woke us by the time we wake up.



And.. aside from the thoughts I outlined in the email referenced above,
there is always the chance people accidentally rely on the strong
ordering on their x86 CPU and find things come apart when ran on their
ARM/MIPS/etc..

There are a fair number of people who use the raw futex call and we have
0 visibility into many of them. The assumed and accidental ordering
guarantees will forever remain a mystery.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make *xchg and *cmpxchg a full barrier
  2015-10-21 19:48               ` Peter Zijlstra
@ 2015-10-22 12:07                 ` Boqun Feng
  2015-10-24 10:26                   ` Peter Zijlstra
  0 siblings, 1 reply; 43+ messages in thread
From: Boqun Feng @ 2015-10-22 12:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul E. McKenney, linux-kernel, linuxppc-dev, Ingo Molnar,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Thomas Gleixner, Will Deacon, Waiman Long, Davidlohr Bueso,
	stable

[-- Attachment #1: Type: text/plain, Size: 2627 bytes --]

On Wed, Oct 21, 2015 at 09:48:25PM +0200, Peter Zijlstra wrote:
> On Wed, Oct 21, 2015 at 12:35:23PM -0700, Paul E. McKenney wrote:
> > > > > > I ask this because I recall Peter once bought up a discussion:
> > > > > > 
> > > > > > https://lkml.org/lkml/2015/8/26/596
> 
> > > So a full barrier on one side of these operations is enough, I think.
> > > IOW, there is no need to strengthen these operations.
> > 
> > Do we need to also worry about other futex use cases?
> 
> Worry, always!
> 
> But yes, there is one more specific usecase, which is that of a
> condition variable.
> 
> When we go sleep on a futex, we might want to assume visibility of the
> stores done by the thread that woke us by the time we wake up.
> 

But the thing is futex atomics in PPC are already RELEASE(pc)+ACQUIRE
and imply a full barrier, is an RELEASE(sc) semantics really needed
here?

Further more, is this condition variable visibility guaranteed by other
part of futex? Because in futex_wake_op:

	futex_wake_op()
	  ...
	  double_unlock_hb(hb1, hb2);  <- RELEASE(pc) barrier here.
	  wake_up_q(&wake_q);

and in futex_wait():

	futex_wait()
	  ...
	  futex_wait_queue_me(hb, &q, to); <- schedule() here
	  ...
	  unqueue_me(&q)
	    drop_futex_key_refs(&q->key);
	    	iput()/mmdrop(); <- a full barrier
	  

The RELEASE(pc) barrier pairs with the full barrier, therefore the
userspace wakee can observe the condition variable modification.

> 
> 
> And.. aside from the thoughts I outlined in the email referenced above,
> there is always the chance people accidentally rely on the strong
> ordering on their x86 CPU and find things come apart when ran on their
> ARM/MIPS/etc..
> 
> There are a fair number of people who use the raw futex call and we have
> 0 visibility into many of them. The assumed and accidental ordering
> guarantees will forever remain a mystery.
> 

Understood. That's truely a potential problem. Considering not all the
architectures imply a full barrier at user<->kernel boundries, maybe we
can use one bit in the opcode of the futex system call to indicate
whether userspace treats futex as fully ordered. Like:

#define FUTEX_ORDER_SEQ_CST  0
#define FUTEX_ORDER_RELAXED  64 (bit 7 and bit 8 are already used)

Therefore all existing code will run with a strong ordering version of
futex(of course, we need to check and modify kernel code first to
guarantee that), and if userspace code uses FUTEX_ORDER_RELAXED, it must
not rely on futex() for the strong ordering, and should add memory
barriers itself if necessary.

Regards,
Boqun

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make *xchg and *cmpxchg a full barrier
  2015-10-22 12:07                 ` Boqun Feng
@ 2015-10-24 10:26                   ` Peter Zijlstra
  2015-10-24 11:53                     ` Boqun Feng
  0 siblings, 1 reply; 43+ messages in thread
From: Peter Zijlstra @ 2015-10-24 10:26 UTC (permalink / raw)
  To: Boqun Feng
  Cc: Paul E. McKenney, linux-kernel, linuxppc-dev, Ingo Molnar,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Thomas Gleixner, Will Deacon, Waiman Long, Davidlohr Bueso,
	stable

On Thu, Oct 22, 2015 at 08:07:16PM +0800, Boqun Feng wrote:
> On Wed, Oct 21, 2015 at 09:48:25PM +0200, Peter Zijlstra wrote:
> > On Wed, Oct 21, 2015 at 12:35:23PM -0700, Paul E. McKenney wrote:
> > > > > > > I ask this because I recall Peter once bought up a discussion:
> > > > > > > 
> > > > > > > https://lkml.org/lkml/2015/8/26/596
> > 
> > > > So a full barrier on one side of these operations is enough, I think.
> > > > IOW, there is no need to strengthen these operations.
> > > 
> > > Do we need to also worry about other futex use cases?
> > 
> > Worry, always!
> > 
> > But yes, there is one more specific usecase, which is that of a
> > condition variable.
> > 
> > When we go sleep on a futex, we might want to assume visibility of the
> > stores done by the thread that woke us by the time we wake up.
> > 
> 
> But the thing is futex atomics in PPC are already RELEASE(pc)+ACQUIRE
> and imply a full barrier, is an RELEASE(sc) semantics really needed
> here?

For this, no, the current code should be fine I think.

> Further more, is this condition variable visibility guaranteed by other
> part of futex? Because in futex_wake_op:
> 
> 	futex_wake_op()
> 	  ...
> 	  double_unlock_hb(hb1, hb2);  <- RELEASE(pc) barrier here.
> 	  wake_up_q(&wake_q);
> 
> and in futex_wait():
> 
> 	futex_wait()
> 	  ...
> 	  futex_wait_queue_me(hb, &q, to); <- schedule() here
> 	  ...
> 	  unqueue_me(&q)
> 	    drop_futex_key_refs(&q->key);
> 	    	iput()/mmdrop(); <- a full barrier
> 	  
> 
> The RELEASE(pc) barrier pairs with the full barrier, therefore the
> userspace wakee can observe the condition variable modification.

Right, futexes are a pain; and I think we all agreed we didn't want to
go rely on implementation details unless we absolutely _have_ to.

> > And.. aside from the thoughts I outlined in the email referenced above,
> > there is always the chance people accidentally rely on the strong
> > ordering on their x86 CPU and find things come apart when ran on their
> > ARM/MIPS/etc..
> > 
> > There are a fair number of people who use the raw futex call and we have
> > 0 visibility into many of them. The assumed and accidental ordering
> > guarantees will forever remain a mystery.
> > 
> 
> Understood. That's truely a potential problem. Considering not all the
> architectures imply a full barrier at user<->kernel boundries, maybe we
> can use one bit in the opcode of the futex system call to indicate
> whether userspace treats futex as fully ordered. Like:
> 
> #define FUTEX_ORDER_SEQ_CST  0
> #define FUTEX_ORDER_RELAXED  64 (bit 7 and bit 8 are already used)

Not unless there's an actual performance problem with any of this.
Futexes are painful enough as is.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make *xchg and *cmpxchg a full barrier
  2015-10-24 10:26                   ` Peter Zijlstra
@ 2015-10-24 11:53                     ` Boqun Feng
  2015-10-25 13:14                       ` Boqun Feng
  0 siblings, 1 reply; 43+ messages in thread
From: Boqun Feng @ 2015-10-24 11:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul E. McKenney, linux-kernel, linuxppc-dev, Ingo Molnar,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Thomas Gleixner, Will Deacon, Waiman Long, Davidlohr Bueso,
	stable

[-- Attachment #1: Type: text/plain, Size: 3726 bytes --]

On Sat, Oct 24, 2015 at 12:26:27PM +0200, Peter Zijlstra wrote:
> On Thu, Oct 22, 2015 at 08:07:16PM +0800, Boqun Feng wrote:
> > On Wed, Oct 21, 2015 at 09:48:25PM +0200, Peter Zijlstra wrote:
> > > On Wed, Oct 21, 2015 at 12:35:23PM -0700, Paul E. McKenney wrote:
> > > > > > > > I ask this because I recall Peter once bought up a discussion:
> > > > > > > > 
> > > > > > > > https://lkml.org/lkml/2015/8/26/596
> > > 
> > > > > So a full barrier on one side of these operations is enough, I think.
> > > > > IOW, there is no need to strengthen these operations.
> > > > 
> > > > Do we need to also worry about other futex use cases?
> > > 
> > > Worry, always!
> > > 
> > > But yes, there is one more specific usecase, which is that of a
> > > condition variable.
> > > 
> > > When we go sleep on a futex, we might want to assume visibility of the
> > > stores done by the thread that woke us by the time we wake up.
> > > 
> > 
> > But the thing is futex atomics in PPC are already RELEASE(pc)+ACQUIRE
> > and imply a full barrier, is an RELEASE(sc) semantics really needed
> > here?
> 
> For this, no, the current code should be fine I think.
> 
> > Further more, is this condition variable visibility guaranteed by other
> > part of futex? Because in futex_wake_op:
> > 
> > 	futex_wake_op()
> > 	  ...
> > 	  double_unlock_hb(hb1, hb2);  <- RELEASE(pc) barrier here.
> > 	  wake_up_q(&wake_q);
> > 
> > and in futex_wait():
> > 
> > 	futex_wait()
> > 	  ...
> > 	  futex_wait_queue_me(hb, &q, to); <- schedule() here
> > 	  ...
> > 	  unqueue_me(&q)
> > 	    drop_futex_key_refs(&q->key);
> > 	    	iput()/mmdrop(); <- a full barrier
> > 	  
> > 
> > The RELEASE(pc) barrier pairs with the full barrier, therefore the
> > userspace wakee can observe the condition variable modification.
> 
> Right, futexes are a pain; and I think we all agreed we didn't want to
> go rely on implementation details unless we absolutely _have_ to.
> 

Agreed.

Besides, after I have read why futex_wake_op(the caller of
futex_atomic_op_inuser()) is introduced, I think your worries are quite
reasonable. I thought the futex_atomic_op_inuser() only operated on
futex related variables, but it turns out it can actually operate any
userspace variable if userspace code likes, therefore we don't have
control of all memory ordering guarantee of the variable. So if PPC
doesn't provide a full barrier at user<->kernel boundries, we should
make futex_atomic_op_inuser() fully ordered.


Still looking into futex_atomic_cmpxchg_inatomic() ...

> > > And.. aside from the thoughts I outlined in the email referenced above,
> > > there is always the chance people accidentally rely on the strong
> > > ordering on their x86 CPU and find things come apart when ran on their
> > > ARM/MIPS/etc..
> > > 
> > > There are a fair number of people who use the raw futex call and we have
> > > 0 visibility into many of them. The assumed and accidental ordering
> > > guarantees will forever remain a mystery.
> > > 
> > 
> > Understood. That's truely a potential problem. Considering not all the
> > architectures imply a full barrier at user<->kernel boundries, maybe we
> > can use one bit in the opcode of the futex system call to indicate
> > whether userspace treats futex as fully ordered. Like:
> > 
> > #define FUTEX_ORDER_SEQ_CST  0
> > #define FUTEX_ORDER_RELAXED  64 (bit 7 and bit 8 are already used)
> 
> Not unless there's an actual performance problem with any of this.
> Futexes are painful enough as is.

Make sense, and we still have choices like modifying the userspace code
if there is actually a performance problem ;-)

Regards,
Boqun

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make *xchg and *cmpxchg a full barrier
  2015-10-24 11:53                     ` Boqun Feng
@ 2015-10-25 13:14                       ` Boqun Feng
  0 siblings, 0 replies; 43+ messages in thread
From: Boqun Feng @ 2015-10-25 13:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul E. McKenney, linux-kernel, linuxppc-dev, Ingo Molnar,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Thomas Gleixner, Will Deacon, Waiman Long, Davidlohr Bueso,
	stable

[-- Attachment #1: Type: text/plain, Size: 1392 bytes --]

On Sat, Oct 24, 2015 at 07:53:56PM +0800, Boqun Feng wrote:
> On Sat, Oct 24, 2015 at 12:26:27PM +0200, Peter Zijlstra wrote:
> > 
> > Right, futexes are a pain; and I think we all agreed we didn't want to
> > go rely on implementation details unless we absolutely _have_ to.
> > 
> 
> Agreed.
> 
> Besides, after I have read why futex_wake_op(the caller of
> futex_atomic_op_inuser()) is introduced, I think your worries are quite
> reasonable. I thought the futex_atomic_op_inuser() only operated on
> futex related variables, but it turns out it can actually operate any
> userspace variable if userspace code likes, therefore we don't have
> control of all memory ordering guarantee of the variable. So if PPC
> doesn't provide a full barrier at user<->kernel boundries, we should
> make futex_atomic_op_inuser() fully ordered.
> 
> 
> Still looking into futex_atomic_cmpxchg_inatomic() ...
> 

I thought that the futex related variables (userspace variables that the
first parameter of futex system call points to) are only accessed by
futex system call in userspace, but it turns out not the fact. So
memordy ordering guarantees of these variables are also out of the
control of kernel. Therefore we should make
futex_atomic_cmpxchg_inatomic() fully ordered, of course, if PPC doesn't
provide a full barrier at user<->kernel boundries..

Regards
Boqun


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make *xchg and *cmpxchg a full barrier
  2015-10-21 19:36             ` Paul E. McKenney
@ 2015-10-26  2:06               ` Boqun Feng
  2015-10-26  2:20               ` Michael Ellerman
  1 sibling, 0 replies; 43+ messages in thread
From: Boqun Feng @ 2015-10-26  2:06 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Peter Zijlstra, linux-kernel, linuxppc-dev, Ingo Molnar,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Thomas Gleixner, Will Deacon, Waiman Long, Davidlohr Bueso,
	stable

[-- Attachment #1: Type: text/plain, Size: 1529 bytes --]

On Wed, Oct 21, 2015 at 12:36:38PM -0700, Paul E. McKenney wrote:
> On Wed, Oct 21, 2015 at 10:18:33AM +0200, Peter Zijlstra wrote:
> > On Tue, Oct 20, 2015 at 02:28:35PM -0700, Paul E. McKenney wrote:
> > > I am not seeing a sync there, but I really have to defer to the
> > > maintainers on this one.  I could easily have missed one.
> > 
> > So x86 implies a full barrier for everything that changes the CPL; and
> > some form of implied ordering seems a must if you change the privilege
> > level unless you tag every single load/store with the priv level at that
> > time, which seems the more expensive option.
> 
> And it is entirely possible that there is some similar operation
> somewhere in the powerpc entry/exit code.  I would not trust myself
> to recognize it, though.
> 
> > So I suspect the typical implementation will flush all load/stores,
> > change the effective priv level and continue.
> > 
> > This can of course be implemented at a pure per CPU ordering (RCpc),
> > which would be in line with the rest of Power, in which case you do
> > indeed need an explicit sync to make it visible to other CPUs.
> > 
> > But yes, if Michael or Ben could clarify this it would be good.
> > 

Michael and Ben, ping for this, thank you ;-)

Regards,
Boqun

> > Back then I talked to Ralf about what MIPS says on this, and MIPS arch
> > spec is entirely quiet on this, it allows implementations full freedom
> > IIRC.
> 
> :-) ;-) ;-)
> 
> > </ramble>
> 
> 							Thanx, Paul
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make *xchg and *cmpxchg a full barrier
  2015-10-21 19:36             ` Paul E. McKenney
  2015-10-26  2:06               ` Boqun Feng
@ 2015-10-26  2:20               ` Michael Ellerman
  2015-10-26  8:55                 ` Boqun Feng
  1 sibling, 1 reply; 43+ messages in thread
From: Michael Ellerman @ 2015-10-26  2:20 UTC (permalink / raw)
  To: paulmck, Peter Zijlstra
  Cc: Boqun Feng, linux-kernel, linuxppc-dev, Ingo Molnar,
	Benjamin Herrenschmidt, Paul Mackerras, Thomas Gleixner,
	Will Deacon, Waiman Long, Davidlohr Bueso, stable

On Wed, 2015-10-21 at 12:36 -0700, Paul E. McKenney wrote:

> On Wed, Oct 21, 2015 at 10:18:33AM +0200, Peter Zijlstra wrote:

> > On Tue, Oct 20, 2015 at 02:28:35PM -0700, Paul E. McKenney wrote:

> > > I am not seeing a sync there, but I really have to defer to the
> > > maintainers on this one.  I could easily have missed one.
> > 
> > So x86 implies a full barrier for everything that changes the CPL; and
> > some form of implied ordering seems a must if you change the privilege
> > level unless you tag every single load/store with the priv level at that
> > time, which seems the more expensive option.
> 
> And it is entirely possible that there is some similar operation
> somewhere in the powerpc entry/exit code.  I would not trust myself
> to recognize it, though.

> > So I suspect the typical implementation will flush all load/stores,
> > change the effective priv level and continue.
> > 
> > This can of course be implemented at a pure per CPU ordering (RCpc),
> > which would be in line with the rest of Power, in which case you do
> > indeed need an explicit sync to make it visible to other CPUs.
> > 
> > But yes, if Michael or Ben could clarify this it would be good.
> 
> :-) ;-) ;-)

Sorry guys, these threads are so long I tend not to read them very actively :}

Looking at the system call path, the straight line path does not include any
barriers. I can't see any hidden in macros either.

We also have an explicit sync in the switch_to() path, which suggests that we
know system call is not a full barrier.

Also looking at the architecture, section 1.5 which talks about the
synchronisation that occurs on system calls, defines nothing in terms of
memory ordering, and includes a programming note which says "Unlike the
Synchronize instruction, a context synchronizing operation does not affect the
order in which storage accesses are performed.".

Whether that's actually how it's implemented I don't know, I'll see if I can
find out.

cheers


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make *xchg and *cmpxchg a full barrier
  2015-10-21  8:18           ` Peter Zijlstra
  2015-10-21 19:36             ` Paul E. McKenney
@ 2015-10-26  3:20             ` Paul Mackerras
  2015-10-26  8:58               ` Boqun Feng
  1 sibling, 1 reply; 43+ messages in thread
From: Paul Mackerras @ 2015-10-26  3:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul E. McKenney, Boqun Feng, linux-kernel, linuxppc-dev,
	Ingo Molnar, Benjamin Herrenschmidt, Michael Ellerman,
	Thomas Gleixner, Will Deacon, Waiman Long, Davidlohr Bueso,
	stable

On Wed, Oct 21, 2015 at 10:18:33AM +0200, Peter Zijlstra wrote:
> On Tue, Oct 20, 2015 at 02:28:35PM -0700, Paul E. McKenney wrote:
> > I am not seeing a sync there, but I really have to defer to the
> > maintainers on this one.  I could easily have missed one.
> 
> So x86 implies a full barrier for everything that changes the CPL; and
> some form of implied ordering seems a must if you change the privilege
> level unless you tag every single load/store with the priv level at that
> time, which seems the more expensive option.
> 
> So I suspect the typical implementation will flush all load/stores,
> change the effective priv level and continue.
> 
> This can of course be implemented at a pure per CPU ordering (RCpc),
> which would be in line with the rest of Power, in which case you do
> indeed need an explicit sync to make it visible to other CPUs.

Right - interrupts and returns from interrupt are context
synchronizing operations, which means they wait until all outstanding
instructions have got to the point where they have reported any
exceptions they're going to report, which means in turn that loads and
stores have completed address translation.  But all of that doesn't
imply anything about the visibility of the loads and stores.

There is a full barrier in the context switch path, but not in the
system call entry/exit path.

Paul.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make *xchg and *cmpxchg a full barrier
  2015-10-26  2:20               ` Michael Ellerman
@ 2015-10-26  8:55                 ` Boqun Feng
  0 siblings, 0 replies; 43+ messages in thread
From: Boqun Feng @ 2015-10-26  8:55 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: paulmck, Peter Zijlstra, linux-kernel, linuxppc-dev, Ingo Molnar,
	Benjamin Herrenschmidt, Paul Mackerras, Thomas Gleixner,
	Will Deacon, Waiman Long, Davidlohr Bueso, stable

[-- Attachment #1: Type: text/plain, Size: 1121 bytes --]

On Mon, Oct 26, 2015 at 11:20:01AM +0900, Michael Ellerman wrote:
> 
> Sorry guys, these threads are so long I tend not to read them very actively :}
> 
> Looking at the system call path, the straight line path does not include any
> barriers. I can't see any hidden in macros either.
> 
> We also have an explicit sync in the switch_to() path, which suggests that we
> know system call is not a full barrier.
> 
> Also looking at the architecture, section 1.5 which talks about the
> synchronisation that occurs on system calls, defines nothing in terms of
> memory ordering, and includes a programming note which says "Unlike the
> Synchronize instruction, a context synchronizing operation does not affect the
> order in which storage accesses are performed.".
> 

Thank you, Michael. So IIUC, "sc" and "rfid" just imply an execution
barrier like "isync" rather than a memory barrier. So memory barriers
are needed if a system call need a memory ordering guarantee.

Regards,
Boqun

> Whether that's actually how it's implemented I don't know, I'll see if I can
> find out.
> 
> cheers
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make *xchg and *cmpxchg a full barrier
  2015-10-26  3:20             ` Paul Mackerras
@ 2015-10-26  8:58               ` Boqun Feng
  0 siblings, 0 replies; 43+ messages in thread
From: Boqun Feng @ 2015-10-26  8:58 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Peter Zijlstra, Paul E. McKenney, linux-kernel, linuxppc-dev,
	Ingo Molnar, Benjamin Herrenschmidt, Michael Ellerman,
	Thomas Gleixner, Will Deacon, Waiman Long, Davidlohr Bueso,
	stable

[-- Attachment #1: Type: text/plain, Size: 1565 bytes --]

On Mon, Oct 26, 2015 at 02:20:21PM +1100, Paul Mackerras wrote:
> On Wed, Oct 21, 2015 at 10:18:33AM +0200, Peter Zijlstra wrote:
> > On Tue, Oct 20, 2015 at 02:28:35PM -0700, Paul E. McKenney wrote:
> > > I am not seeing a sync there, but I really have to defer to the
> > > maintainers on this one.  I could easily have missed one.
> > 
> > So x86 implies a full barrier for everything that changes the CPL; and
> > some form of implied ordering seems a must if you change the privilege
> > level unless you tag every single load/store with the priv level at that
> > time, which seems the more expensive option.
> > 
> > So I suspect the typical implementation will flush all load/stores,
> > change the effective priv level and continue.
> > 
> > This can of course be implemented at a pure per CPU ordering (RCpc),
> > which would be in line with the rest of Power, in which case you do
> > indeed need an explicit sync to make it visible to other CPUs.
> 
> Right - interrupts and returns from interrupt are context
> synchronizing operations, which means they wait until all outstanding
> instructions have got to the point where they have reported any
> exceptions they're going to report, which means in turn that loads and
> stores have completed address translation.  But all of that doesn't
> imply anything about the visibility of the loads and stores.
> 
> There is a full barrier in the context switch path, but not in the
> system call entry/exit path.
> 

Thank you, Paul. That's much clear now ;-)

Regards,
Boqun

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2015-10-26  8:58 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-10-14 15:55 [PATCH tip/locking/core v4 0/6] atomics: powerpc: Implement relaxed/acquire/release variants of some atomics Boqun Feng
2015-10-14 15:55 ` [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make *xchg and *cmpxchg a full barrier Boqun Feng
2015-10-14 20:19   ` Paul E. McKenney
2015-10-14 21:04     ` Peter Zijlstra
2015-10-14 21:44       ` Paul E. McKenney
2015-10-15  0:53         ` Boqun Feng
2015-10-15  1:22           ` Boqun Feng
2015-10-15  3:07             ` Paul E. McKenney
2015-10-15  3:07           ` Paul E. McKenney
2015-10-15  4:48             ` Boqun Feng
2015-10-15 16:30               ` Paul E. McKenney
2015-10-19  0:19                 ` Boqun Feng
2015-10-15  3:11           ` Boqun Feng
2015-10-15  3:33             ` Paul E. McKenney
2015-10-15 10:35         ` Will Deacon
2015-10-15 14:40           ` Boqun Feng
2015-10-15 14:50           ` Will Deacon
2015-10-15 16:29             ` Paul E. McKenney
2015-10-15 15:42           ` Paul E. McKenney
2015-10-15 14:49     ` Boqun Feng
2015-10-15 16:16       ` Paul E. McKenney
2015-10-20  7:15     ` Boqun Feng
2015-10-20  9:21       ` Peter Zijlstra
2015-10-20 21:28         ` Paul E. McKenney
2015-10-21  8:18           ` Peter Zijlstra
2015-10-21 19:36             ` Paul E. McKenney
2015-10-26  2:06               ` Boqun Feng
2015-10-26  2:20               ` Michael Ellerman
2015-10-26  8:55                 ` Boqun Feng
2015-10-26  3:20             ` Paul Mackerras
2015-10-26  8:58               ` Boqun Feng
2015-10-21  8:45           ` Boqun Feng
2015-10-21 19:35             ` Paul E. McKenney
2015-10-21 19:48               ` Peter Zijlstra
2015-10-22 12:07                 ` Boqun Feng
2015-10-24 10:26                   ` Peter Zijlstra
2015-10-24 11:53                     ` Boqun Feng
2015-10-25 13:14                       ` Boqun Feng
2015-10-14 15:55 ` [PATCH tip/locking/core v4 2/6] atomics: Add test for atomic operations with _relaxed variants Boqun Feng
2015-10-14 15:55 ` [PATCH tip/locking/core v4 3/6] atomics: Allow architectures to define their own __atomic_op_* helpers Boqun Feng
2015-10-14 15:55 ` [PATCH tip/locking/core v4 4/6] powerpc: atomic: Implement atomic{,64}_*_return_* variants Boqun Feng
2015-10-14 15:56 ` [PATCH tip/locking/core v4 5/6] powerpc: atomic: Implement xchg_* and atomic{,64}_xchg_* variants Boqun Feng
2015-10-14 15:56 ` [PATCH tip/locking/core v4 6/6] powerpc: atomic: Implement cmpxchg{,64}_* and atomic{,64}_cmpxchg_* variants Boqun Feng

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).