linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/4] locking/qspinlock, x86: Improve determinism for x86
@ 2018-10-03 13:02 Peter Zijlstra
  2018-10-03 13:02 ` [PATCH v2 1/4] locking/qspinlock: Re-order code Peter Zijlstra
                   ` (5 more replies)
  0 siblings, 6 replies; 13+ messages in thread
From: Peter Zijlstra @ 2018-10-03 13:02 UTC (permalink / raw)
  To: will.deacon, mingo
  Cc: linux-kernel, longman, andrea.parri, tglx, bigeasy, Peter Zijlstra

Back when Will did his qspinlock determinism patches, we were left with one
cmpxchg loop on x86 due to the use of atomic_fetch_or(). Will proposed a nifty
trick:

  http://lkml.kernel.org/r/20180409145409.GA9661@arm.com

While that didn't quite work, this series implements that basic idea.

Changes since v1:

 - Adjusted comments in #2, wildea01
 - 'Simplified' GEN_*_RMWcc, peterz
 - Rewrote _the_ patch to use BTS, wildea01



^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v2 1/4] locking/qspinlock: Re-order code
  2018-10-03 13:02 [PATCH v2 0/4] locking/qspinlock, x86: Improve determinism for x86 Peter Zijlstra
@ 2018-10-03 13:02 ` Peter Zijlstra
  2018-10-03 13:02 ` [PATCH v2 2/4] locking/qspinlock: Rework some comments Peter Zijlstra
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 13+ messages in thread
From: Peter Zijlstra @ 2018-10-03 13:02 UTC (permalink / raw)
  To: will.deacon, mingo
  Cc: linux-kernel, longman, andrea.parri, tglx, bigeasy, Peter Zijlstra

Flip the branch condition after atomic_fetch_or_acquire(_Q_PENDING_VAL)
such that we loose the indent. This also result in a more natural code
flow IMO.

Cc: mingo@kernel.org
Cc: tglx@linutronix.de
Cc: longman@redhat.com
Cc: andrea.parri@amarulasolutions.com
Acked-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/locking/qspinlock.c |   56 +++++++++++++++++++++------------------------
 1 file changed, 27 insertions(+), 29 deletions(-)

--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -330,39 +330,37 @@ void queued_spin_lock_slowpath(struct qs
 	 * 0,0,1 -> 0,1,1 ; pending
 	 */
 	val = atomic_fetch_or_acquire(_Q_PENDING_VAL, &lock->val);
-	if (!(val & ~_Q_LOCKED_MASK)) {
-		/*
-		 * We're pending, wait for the owner to go away.
-		 *
-		 * *,1,1 -> *,1,0
-		 *
-		 * this wait loop must be a load-acquire such that we match the
-		 * store-release that clears the locked bit and create lock
-		 * sequentiality; this is because not all
-		 * clear_pending_set_locked() implementations imply full
-		 * barriers.
-		 */
-		if (val & _Q_LOCKED_MASK) {
-			atomic_cond_read_acquire(&lock->val,
-						 !(VAL & _Q_LOCKED_MASK));
-		}
-
-		/*
-		 * take ownership and clear the pending bit.
-		 *
-		 * *,1,0 -> *,0,1
-		 */
-		clear_pending_set_locked(lock);
-		qstat_inc(qstat_lock_pending, true);
-		return;
+	/*
+	 * If we observe any contention; undo and queue.
+	 */
+	if (unlikely(val & ~_Q_LOCKED_MASK)) {
+		if (!(val & _Q_PENDING_MASK))
+			clear_pending(lock);
+		goto queue;
 	}
 
 	/*
-	 * If pending was clear but there are waiters in the queue, then
-	 * we need to undo our setting of pending before we queue ourselves.
+	 * We're pending, wait for the owner to go away.
+	 *
+	 * 0,1,1 -> 0,1,0
+	 *
+	 * this wait loop must be a load-acquire such that we match the
+	 * store-release that clears the locked bit and create lock
+	 * sequentiality; this is because not all
+	 * clear_pending_set_locked() implementations imply full
+	 * barriers.
+	 */
+	if (val & _Q_LOCKED_MASK)
+		atomic_cond_read_acquire(&lock->val, !(VAL & _Q_LOCKED_MASK));
+
+	/*
+	 * take ownership and clear the pending bit.
+	 *
+	 * 0,1,0 -> 0,0,1
 	 */
-	if (!(val & _Q_PENDING_MASK))
-		clear_pending(lock);
+	clear_pending_set_locked(lock);
+	qstat_inc(qstat_lock_pending, true);
+	return;
 
 	/*
 	 * End of pending bit optimistic spinning and beginning of MCS



^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v2 2/4] locking/qspinlock: Rework some comments
  2018-10-03 13:02 [PATCH v2 0/4] locking/qspinlock, x86: Improve determinism for x86 Peter Zijlstra
  2018-10-03 13:02 ` [PATCH v2 1/4] locking/qspinlock: Re-order code Peter Zijlstra
@ 2018-10-03 13:02 ` Peter Zijlstra
  2018-10-10 16:13   ` Will Deacon
  2018-10-03 13:03 ` [PATCH v2 3/4] x86/asm: Simplify GEN_*_RMWcc() macros Peter Zijlstra
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 13+ messages in thread
From: Peter Zijlstra @ 2018-10-03 13:02 UTC (permalink / raw)
  To: will.deacon, mingo
  Cc: linux-kernel, longman, andrea.parri, tglx, bigeasy, Peter Zijlstra

While working my way through the code again; I felt the comments could
use help.

Cc: mingo@kernel.org
Cc: will.deacon@arm.com
Cc: tglx@linutronix.de
Cc: longman@redhat.com
Cc: andrea.parri@amarulasolutions.com
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/locking/qspinlock.c |   38 +++++++++++++++++++++++++++-----------
 1 file changed, 27 insertions(+), 11 deletions(-)

--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -326,16 +326,23 @@ void queued_spin_lock_slowpath(struct qs
 	/*
 	 * trylock || pending
 	 *
-	 * 0,0,0 -> 0,0,1 ; trylock
-	 * 0,0,1 -> 0,1,1 ; pending
+	 * 0,0,* -> 0,1,* -> 0,0,1 pending, trylock
 	 */
 	val = atomic_fetch_or_acquire(_Q_PENDING_VAL, &lock->val);
+
 	/*
-	 * If we observe any contention; undo and queue.
+	 * If we observe contention, there is a concurrent locker.
+	 *
+	 * Undo and queue; our setting of PENDING might have made the
+	 * n,0,0 -> 0,0,0 transition fail and it will now be waiting
+	 * on @next to become !NULL.
 	 */
 	if (unlikely(val & ~_Q_LOCKED_MASK)) {
+
+		/* Undo PENDING if we set it. */
 		if (!(val & _Q_PENDING_MASK))
 			clear_pending(lock);
+
 		goto queue;
 	}
 
@@ -474,16 +481,25 @@ void queued_spin_lock_slowpath(struct qs
 	 */
 
 	/*
-	 * In the PV case we might already have _Q_LOCKED_VAL set.
+	 * In the PV case we might already have _Q_LOCKED_VAL set, because
+	 * of lock stealing; therefore we must also allow:
 	 *
-	 * The atomic_cond_read_acquire() call above has provided the
-	 * necessary acquire semantics required for locking.
-	 */
-	if (((val & _Q_TAIL_MASK) == tail) &&
-	    atomic_try_cmpxchg_relaxed(&lock->val, &val, _Q_LOCKED_VAL))
-		goto release; /* No contention */
+	 * n,0,1 -> 0,0,1
+	 *
+	 * Note: at this point: (val & _Q_PENDING_MASK) == 0, because of the
+	 *       above wait condition, therefore any concurrent setting of
+	 *       PENDING will make the uncontended transition fail.
+	 */
+	if ((val & _Q_TAIL_MASK) == tail) {
+		if (atomic_try_cmpxchg_relaxed(&lock->val, &val, _Q_LOCKED_VAL))
+			goto release; /* No contention */
+	}
 
-	/* Either somebody is queued behind us or _Q_PENDING_VAL is set */
+	/*
+	 * Either somebody is queued behind us or _Q_PENDING_VAL got set
+	 * which will then detect the remaining tail and queue behind us
+	 * ensuring we'll see a @next.
+	 */
 	set_locked(lock);
 
 	/*



^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v2 3/4] x86/asm: Simplify GEN_*_RMWcc() macros
  2018-10-03 13:02 [PATCH v2 0/4] locking/qspinlock, x86: Improve determinism for x86 Peter Zijlstra
  2018-10-03 13:02 ` [PATCH v2 1/4] locking/qspinlock: Re-order code Peter Zijlstra
  2018-10-03 13:02 ` [PATCH v2 2/4] locking/qspinlock: Rework some comments Peter Zijlstra
@ 2018-10-03 13:03 ` Peter Zijlstra
  2018-10-04  9:18   ` Peter Zijlstra
  2018-10-16 16:05   ` [tip:locking/core] x86/asm: 'Simplify' " tip-bot for Peter Zijlstra
  2018-10-03 13:03 ` [PATCH v2 4/4] locking/qspinlock, x86: Provide liveness guarantee Peter Zijlstra
                   ` (2 subsequent siblings)
  5 siblings, 2 replies; 13+ messages in thread
From: Peter Zijlstra @ 2018-10-03 13:03 UTC (permalink / raw)
  To: will.deacon, mingo
  Cc: linux-kernel, longman, andrea.parri, tglx, bigeasy,
	Peter Zijlstra, hpa, JBeulich, bp

Currently the GEN_*_RMWcc() macros include a return statement, which
pretty much mandates we directly wrap them in a (inline) function.

Macros with return statements are tricky and, as per the above, limit
use, so remove the return statement and make them
statement-expressions. This allows them to be used more widely.

Also, shuffle the arguments a bit. Place the @cc argument as 3rd, this
makes it consistent between UNARY and BINARY, but more importantly, it
makes the @arg0 argument last.

Since the @arg0 argument is now last, we can do CPP trickery and make
it an optional argument, simplifying the users; 17 out of 18
occurences do not need this argument.

Finally, change to asm symbolic names, instead of the numeric ordering
of operands, which allows us to get rid of __BINARY_RMWcc_ARG and get
cleaner code overall.

Cc: hpa@linux.intel.com
Cc: JBeulich@suse.com
Cc: bp@alien8.de
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/include/asm/atomic.h      |    8 ++--
 arch/x86/include/asm/atomic64_64.h |    8 ++--
 arch/x86/include/asm/bitops.h      |    9 +----
 arch/x86/include/asm/local.h       |    8 ++--
 arch/x86/include/asm/preempt.h     |    2 -
 arch/x86/include/asm/refcount.h    |   18 +++++-----
 arch/x86/include/asm/rmwcc.h       |   66 +++++++++++++++++++++----------------
 7 files changed, 64 insertions(+), 55 deletions(-)

--- a/arch/x86/include/asm/atomic.h
+++ b/arch/x86/include/asm/atomic.h
@@ -82,7 +82,7 @@ static __always_inline void arch_atomic_
  */
 static __always_inline bool arch_atomic_sub_and_test(int i, atomic_t *v)
 {
-	GEN_BINARY_RMWcc(LOCK_PREFIX "subl", v->counter, "er", i, "%0", e);
+	return GEN_BINARY_RMWcc(LOCK_PREFIX "subl", v->counter, e, "er", i);
 }
 #define arch_atomic_sub_and_test arch_atomic_sub_and_test
 
@@ -122,7 +122,7 @@ static __always_inline void arch_atomic_
  */
 static __always_inline bool arch_atomic_dec_and_test(atomic_t *v)
 {
-	GEN_UNARY_RMWcc(LOCK_PREFIX "decl", v->counter, "%0", e);
+	return GEN_UNARY_RMWcc(LOCK_PREFIX "decl", v->counter, e);
 }
 #define arch_atomic_dec_and_test arch_atomic_dec_and_test
 
@@ -136,7 +136,7 @@ static __always_inline bool arch_atomic_
  */
 static __always_inline bool arch_atomic_inc_and_test(atomic_t *v)
 {
-	GEN_UNARY_RMWcc(LOCK_PREFIX "incl", v->counter, "%0", e);
+	return GEN_UNARY_RMWcc(LOCK_PREFIX "incl", v->counter, e);
 }
 #define arch_atomic_inc_and_test arch_atomic_inc_and_test
 
@@ -151,7 +151,7 @@ static __always_inline bool arch_atomic_
  */
 static __always_inline bool arch_atomic_add_negative(int i, atomic_t *v)
 {
-	GEN_BINARY_RMWcc(LOCK_PREFIX "addl", v->counter, "er", i, "%0", s);
+	return GEN_BINARY_RMWcc(LOCK_PREFIX "addl", v->counter, s, "er", i);
 }
 #define arch_atomic_add_negative arch_atomic_add_negative
 
--- a/arch/x86/include/asm/atomic64_64.h
+++ b/arch/x86/include/asm/atomic64_64.h
@@ -73,7 +73,7 @@ static inline void arch_atomic64_sub(lon
  */
 static inline bool arch_atomic64_sub_and_test(long i, atomic64_t *v)
 {
-	GEN_BINARY_RMWcc(LOCK_PREFIX "subq", v->counter, "er", i, "%0", e);
+	return GEN_BINARY_RMWcc(LOCK_PREFIX "subq", v->counter, e, "er", i);
 }
 #define arch_atomic64_sub_and_test arch_atomic64_sub_and_test
 
@@ -115,7 +115,7 @@ static __always_inline void arch_atomic6
  */
 static inline bool arch_atomic64_dec_and_test(atomic64_t *v)
 {
-	GEN_UNARY_RMWcc(LOCK_PREFIX "decq", v->counter, "%0", e);
+	return GEN_UNARY_RMWcc(LOCK_PREFIX "decq", v->counter, e);
 }
 #define arch_atomic64_dec_and_test arch_atomic64_dec_and_test
 
@@ -129,7 +129,7 @@ static inline bool arch_atomic64_dec_and
  */
 static inline bool arch_atomic64_inc_and_test(atomic64_t *v)
 {
-	GEN_UNARY_RMWcc(LOCK_PREFIX "incq", v->counter, "%0", e);
+	return GEN_UNARY_RMWcc(LOCK_PREFIX "incq", v->counter, e);
 }
 #define arch_atomic64_inc_and_test arch_atomic64_inc_and_test
 
@@ -144,7 +144,7 @@ static inline bool arch_atomic64_inc_and
  */
 static inline bool arch_atomic64_add_negative(long i, atomic64_t *v)
 {
-	GEN_BINARY_RMWcc(LOCK_PREFIX "addq", v->counter, "er", i, "%0", s);
+	return GEN_BINARY_RMWcc(LOCK_PREFIX "addq", v->counter, s, "er", i);
 }
 #define arch_atomic64_add_negative arch_atomic64_add_negative
 
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -217,8 +217,7 @@ static __always_inline void change_bit(l
  */
 static __always_inline bool test_and_set_bit(long nr, volatile unsigned long *addr)
 {
-	GEN_BINARY_RMWcc(LOCK_PREFIX __ASM_SIZE(bts),
-	                 *addr, "Ir", nr, "%0", c);
+	return GEN_BINARY_RMWcc(LOCK_PREFIX __ASM_SIZE(bts), *addr, c, "Ir", nr);
 }
 
 /**
@@ -264,8 +263,7 @@ static __always_inline bool __test_and_s
  */
 static __always_inline bool test_and_clear_bit(long nr, volatile unsigned long *addr)
 {
-	GEN_BINARY_RMWcc(LOCK_PREFIX __ASM_SIZE(btr),
-	                 *addr, "Ir", nr, "%0", c);
+	return GEN_BINARY_RMWcc(LOCK_PREFIX __ASM_SIZE(btr), *addr, c, "Ir", nr);
 }
 
 /**
@@ -318,8 +316,7 @@ static __always_inline bool __test_and_c
  */
 static __always_inline bool test_and_change_bit(long nr, volatile unsigned long *addr)
 {
-	GEN_BINARY_RMWcc(LOCK_PREFIX __ASM_SIZE(btc),
-	                 *addr, "Ir", nr, "%0", c);
+	return GEN_BINARY_RMWcc(LOCK_PREFIX __ASM_SIZE(btc), *addr, c, "Ir", nr);
 }
 
 static __always_inline bool constant_test_bit(long nr, const volatile unsigned long *addr)
--- a/arch/x86/include/asm/local.h
+++ b/arch/x86/include/asm/local.h
@@ -53,7 +53,7 @@ static inline void local_sub(long i, loc
  */
 static inline bool local_sub_and_test(long i, local_t *l)
 {
-	GEN_BINARY_RMWcc(_ASM_SUB, l->a.counter, "er", i, "%0", e);
+	return GEN_BINARY_RMWcc(_ASM_SUB, l->a.counter, e, "er", i);
 }
 
 /**
@@ -66,7 +66,7 @@ static inline bool local_sub_and_test(lo
  */
 static inline bool local_dec_and_test(local_t *l)
 {
-	GEN_UNARY_RMWcc(_ASM_DEC, l->a.counter, "%0", e);
+	return GEN_UNARY_RMWcc(_ASM_DEC, l->a.counter, e);
 }
 
 /**
@@ -79,7 +79,7 @@ static inline bool local_dec_and_test(lo
  */
 static inline bool local_inc_and_test(local_t *l)
 {
-	GEN_UNARY_RMWcc(_ASM_INC, l->a.counter, "%0", e);
+	return GEN_UNARY_RMWcc(_ASM_INC, l->a.counter, e);
 }
 
 /**
@@ -93,7 +93,7 @@ static inline bool local_inc_and_test(lo
  */
 static inline bool local_add_negative(long i, local_t *l)
 {
-	GEN_BINARY_RMWcc(_ASM_ADD, l->a.counter, "er", i, "%0", s);
+	return GEN_BINARY_RMWcc(_ASM_ADD, l->a.counter, s, "er", i);
 }
 
 /**
--- a/arch/x86/include/asm/preempt.h
+++ b/arch/x86/include/asm/preempt.h
@@ -88,7 +88,7 @@ static __always_inline void __preempt_co
  */
 static __always_inline bool __preempt_count_dec_and_test(void)
 {
-	GEN_UNARY_RMWcc("decl", __preempt_count, __percpu_arg(0), e);
+	return GEN_UNARY_RMWcc("decl", __preempt_count, e, __percpu_arg([var]));
 }
 
 /*
--- a/arch/x86/include/asm/refcount.h
+++ b/arch/x86/include/asm/refcount.h
@@ -17,7 +17,7 @@
  */
 #define _REFCOUNT_EXCEPTION				\
 	".pushsection .text..refcount\n"		\
-	"111:\tlea %[counter], %%" _ASM_CX "\n"		\
+	"111:\tlea %[var], %%" _ASM_CX "\n"		\
 	"112:\t" ASM_UD2 "\n"				\
 	ASM_UNREACHABLE					\
 	".popsection\n"					\
@@ -43,7 +43,7 @@ static __always_inline void refcount_add
 {
 	asm volatile(LOCK_PREFIX "addl %1,%0\n\t"
 		REFCOUNT_CHECK_LT_ZERO
-		: [counter] "+m" (r->refs.counter)
+		: [var] "+m" (r->refs.counter)
 		: "ir" (i)
 		: "cc", "cx");
 }
@@ -52,7 +52,7 @@ static __always_inline void refcount_inc
 {
 	asm volatile(LOCK_PREFIX "incl %0\n\t"
 		REFCOUNT_CHECK_LT_ZERO
-		: [counter] "+m" (r->refs.counter)
+		: [var] "+m" (r->refs.counter)
 		: : "cc", "cx");
 }
 
@@ -60,21 +60,21 @@ static __always_inline void refcount_dec
 {
 	asm volatile(LOCK_PREFIX "decl %0\n\t"
 		REFCOUNT_CHECK_LE_ZERO
-		: [counter] "+m" (r->refs.counter)
+		: [var] "+m" (r->refs.counter)
 		: : "cc", "cx");
 }
 
 static __always_inline __must_check
 bool refcount_sub_and_test(unsigned int i, refcount_t *r)
 {
-	GEN_BINARY_SUFFIXED_RMWcc(LOCK_PREFIX "subl", REFCOUNT_CHECK_LT_ZERO,
-				  r->refs.counter, "er", i, "%0", e, "cx");
+	return GEN_BINARY_SUFFIXED_RMWcc(LOCK_PREFIX "subl",
+		REFCOUNT_CHECK_LT_ZERO, r->refs.counter, e, "er", i, "cx");
 }
 
 static __always_inline __must_check bool refcount_dec_and_test(refcount_t *r)
 {
-	GEN_UNARY_SUFFIXED_RMWcc(LOCK_PREFIX "decl", REFCOUNT_CHECK_LT_ZERO,
-				 r->refs.counter, "%0", e, "cx");
+	return GEN_UNARY_SUFFIXED_RMWcc(LOCK_PREFIX "decl",
+		REFCOUNT_CHECK_LT_ZERO, r->refs.counter, e, "cx");
 }
 
 static __always_inline __must_check
@@ -92,7 +92,7 @@ bool refcount_add_not_zero(unsigned int
 		/* Did we try to increment from/to an undesirable state? */
 		if (unlikely(c < 0 || c == INT_MAX || result < c)) {
 			asm volatile(REFCOUNT_ERROR
-				     : : [counter] "m" (r->refs.counter)
+				     : : [var] "m" (r->refs.counter)
 				     : "cc", "cx");
 			break;
 		}
--- a/arch/x86/include/asm/rmwcc.h
+++ b/arch/x86/include/asm/rmwcc.h
@@ -2,56 +2,68 @@
 #ifndef _ASM_X86_RMWcc
 #define _ASM_X86_RMWcc
 
+/* This counts to 12. Any more, it will return 13th argument. */
+#define __RMWcc_ARGS(_0, _1, _2, _3, _4, _5, _6, _7, _8, _9, _10, _11, _12, _n, X...) _n
+#define RMWcc_ARGS(X...) __RMWcc_ARGS(, ##X, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0)
+
+#define __RMWcc_CONCAT(a, b) a ## b
+#define RMWcc_CONCAT(a, b) __RMWcc_CONCAT(a, b)
+
 #define __CLOBBERS_MEM(clb...)	"memory", ## clb
 
 #if !defined(__GCC_ASM_FLAG_OUTPUTS__) && defined(CC_HAVE_ASM_GOTO)
 
 /* Use asm goto */
 
-#define __GEN_RMWcc(fullop, var, cc, clobbers, ...)			\
-do {									\
+#define __GEN_RMWcc(fullop, _var, cc, clobbers, ...)			\
+({									\
+	__label__ cc_label;						\
 	asm_volatile_goto (fullop "; j" #cc " %l[cc_label]"		\
-			: : [counter] "m" (var), ## __VA_ARGS__		\
+			: : [var] "m" (_var), ## __VA_ARGS__		\
 			: clobbers : cc_label);				\
-	return 0;							\
+	0;								\
 cc_label:								\
-	return 1;							\
-} while (0)
-
-#define __BINARY_RMWcc_ARG	" %1, "
-
+	1;								\
+})
 
 #else /* defined(__GCC_ASM_FLAG_OUTPUTS__) || !defined(CC_HAVE_ASM_GOTO) */
 
 /* Use flags output or a set instruction */
 
-#define __GEN_RMWcc(fullop, var, cc, clobbers, ...)			\
-do {									\
+#define __GEN_RMWcc(fullop, _var, cc, clobbers, ...)			\
+({									\
 	bool c;								\
 	asm volatile (fullop CC_SET(cc)					\
-			: [counter] "+m" (var), CC_OUT(cc) (c)		\
+			: [var] "+m" (_var), CC_OUT(cc) (c)		\
 			: __VA_ARGS__ : clobbers);			\
-	return c;							\
-} while (0)
-
-#define __BINARY_RMWcc_ARG	" %2, "
+	c;								\
+})
 
 #endif /* defined(__GCC_ASM_FLAG_OUTPUTS__) || !defined(CC_HAVE_ASM_GOTO) */
 
-#define GEN_UNARY_RMWcc(op, var, arg0, cc)				\
+#define GEN_UNARY_RMWcc_4(op, var, cc, arg0)				\
 	__GEN_RMWcc(op " " arg0, var, cc, __CLOBBERS_MEM())
 
-#define GEN_UNARY_SUFFIXED_RMWcc(op, suffix, var, arg0, cc, clobbers...)\
-	__GEN_RMWcc(op " " arg0 "\n\t" suffix, var, cc,			\
+#define GEN_UNARY_RMWcc_3(op, var, cc)					\
+	GEN_UNARY_RMWcc_4(op, var, cc, "%[var]")
+
+#define GEN_UNARY_RMWcc(X...) RMWcc_CONCAT(GEN_UNARY_RMWcc_, RMWcc_ARGS(X))(X)
+
+#define GEN_BINARY_RMWcc_6(op, var, cc, vcon, _val, arg0)		\
+	__GEN_RMWcc(op " %[val], " arg0, var, cc,			\
+		    __CLOBBERS_MEM(), [val] vcon (_val))
+
+#define GEN_BINARY_RMWcc_5(op, var, cc, vcon, val)			\
+	GEN_BINARY_RMWcc_6(op, var, cc, vcon, val, "%[var]")
+
+#define GEN_BINARY_RMWcc(X...) RMWcc_CONCAT(GEN_BINARY_RMWcc_, RMWcc_ARGS(X))(X)
+
+#define GEN_UNARY_SUFFIXED_RMWcc(op, suffix, var, cc, clobbers...)	\
+	__GEN_RMWcc(op " %[var]\n\t" suffix, var, cc,			\
 		    __CLOBBERS_MEM(clobbers))
 
-#define GEN_BINARY_RMWcc(op, var, vcon, val, arg0, cc)			\
-	__GEN_RMWcc(op __BINARY_RMWcc_ARG arg0, var, cc,		\
-		    __CLOBBERS_MEM(), vcon (val))
-
-#define GEN_BINARY_SUFFIXED_RMWcc(op, suffix, var, vcon, val, arg0, cc,	\
-				  clobbers...)				\
-	__GEN_RMWcc(op __BINARY_RMWcc_ARG arg0 "\n\t" suffix, var, cc,	\
-		    __CLOBBERS_MEM(clobbers), vcon (val))
+#define GEN_BINARY_SUFFIXED_RMWcc(op, suffix, var, cc, vcon, _val, clobbers...)\
+	__GEN_RMWcc(op " %[val], %[var]\n\t" suffix, var, cc,		\
+		    __CLOBBERS_MEM(clobbers), [val] vcon (_val))
 
 #endif /* _ASM_X86_RMWcc */



^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v2 4/4] locking/qspinlock, x86: Provide liveness guarantee
  2018-10-03 13:02 [PATCH v2 0/4] locking/qspinlock, x86: Improve determinism for x86 Peter Zijlstra
                   ` (2 preceding siblings ...)
  2018-10-03 13:03 ` [PATCH v2 3/4] x86/asm: Simplify GEN_*_RMWcc() macros Peter Zijlstra
@ 2018-10-03 13:03 ` Peter Zijlstra
  2018-10-10 16:12   ` Will Deacon
  2018-10-16 16:06   ` [tip:locking/core] " tip-bot for Peter Zijlstra
  2018-10-16 16:04 ` [tip:locking/core] locking/qspinlock: Re-order code tip-bot for Peter Zijlstra
  2018-10-16 16:04 ` [tip:locking/core] locking/qspinlock: Rework some comments tip-bot for Peter Zijlstra
  5 siblings, 2 replies; 13+ messages in thread
From: Peter Zijlstra @ 2018-10-03 13:03 UTC (permalink / raw)
  To: will.deacon, mingo
  Cc: linux-kernel, longman, andrea.parri, tglx, bigeasy, Peter Zijlstra

On x86 we cannot do fetch_or with a single instruction and thus end up
using a cmpxchg loop, this reduces determinism. Replace the fetch_or
with a composite operation: tas-pending + load.

Using two instructions of course opens a window we previously did not
have. Consider the scenario:


	CPU0		CPU1		CPU2

 1)	lock
	  trylock -> (0,0,1)

 2)			lock
			  trylock /* fail */

 3)	unlock -> (0,0,0)

 4)					lock
					  trylock -> (0,0,1)

 5)			  tas-pending -> (0,1,1)
			  load-val <- (0,1,0) from 3

 6)			  clear-pending-set-locked -> (0,0,1)

			  FAIL: _2_ owners

where 5) is our new composite operation. When we consider each part of
the qspinlock state as a separate variable (as we can when
_Q_PENDING_BITS == 8) then the above is entirely possible, because
tas-pending will only RmW the pending byte, so the later load is able
to observe prior tail and lock state (but not earlier than its own
trylock, which operates on the whole word, due to coherence).

To avoid this we need 2 things:

 - the load must come after the tas-pending (obviously, otherwise it
   can trivially observe prior state).

 - the tas-pending must be a full word RmW, it cannot be an xchg8 for
   example, such that we cannot observe other state prior to setting
   pending.

On x86 we can realize this by using "LOCK BTS m32, r32" for
tas-pending followed by a regular load.

Note that observing later state is not a problem:

 - if we fail to observe a later unlock, we'll simply spin-wait for
   that store to become visible.

 - if we observe a later xchg_tail, there is no difference from that
   xchg_tail having taken place before the tas-pending.

Cc: mingo@kernel.org
Cc: tglx@linutronix.de
Cc: longman@redhat.com
Cc: andrea.parri@amarulasolutions.com
Suggested-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/include/asm/qspinlock.h |   15 +++++++++++++++
 kernel/locking/qspinlock.c       |   16 +++++++++++++++-
 2 files changed, 30 insertions(+), 1 deletion(-)

--- a/arch/x86/include/asm/qspinlock.h
+++ b/arch/x86/include/asm/qspinlock.h
@@ -6,9 +6,24 @@
 #include <asm/cpufeature.h>
 #include <asm-generic/qspinlock_types.h>
 #include <asm/paravirt.h>
+#include <asm/rmwcc.h>
 
 #define _Q_PENDING_LOOPS	(1 << 9)
 
+#define queued_fetch_set_pending_acquire queued_fetch_set_pending_acquire
+static __always_inline u32 queued_fetch_set_pending_acquire(struct qspinlock *lock)
+{
+	u32 val = 0;
+
+	if (GEN_BINARY_RMWcc(LOCK_PREFIX "btsl", lock->val.counter, c,
+			     "I", _Q_PENDING_OFFSET))
+		val |= _Q_PENDING_VAL;
+
+	val |= atomic_read(&lock->val) & ~_Q_PENDING_MASK;
+
+	return val;
+}
+
 #ifdef CONFIG_PARAVIRT_SPINLOCKS
 extern void native_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val);
 extern void __pv_init_lock_hash(void);
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -232,6 +232,20 @@ static __always_inline u32 xchg_tail(str
 #endif /* _Q_PENDING_BITS == 8 */
 
 /**
+ * queued_fetch_set_pending_acquire - fetch the whole lock value and set pending
+ * @lock : Pointer to queued spinlock structure
+ * Return: The previous lock value
+ *
+ * *,*,* -> *,1,*
+ */
+#ifndef queued_fetch_set_pending_acquire
+static __always_inline u32 queued_fetch_set_pending_acquire(struct qspinlock *lock)
+{
+	return atomic_fetch_or_acquire(_Q_PENDING_VAL, &lock->val);
+}
+#endif
+
+/**
  * set_locked - Set the lock bit and own the lock
  * @lock: Pointer to queued spinlock structure
  *
@@ -328,7 +342,7 @@ void queued_spin_lock_slowpath(struct qs
 	 *
 	 * 0,0,* -> 0,1,* -> 0,0,1 pending, trylock
 	 */
-	val = atomic_fetch_or_acquire(_Q_PENDING_VAL, &lock->val);
+	val = queued_fetch_set_pending_acquire(lock);
 
 	/*
 	 * If we observe contention, there is a concurrent locker.



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 3/4] x86/asm: Simplify GEN_*_RMWcc() macros
  2018-10-03 13:03 ` [PATCH v2 3/4] x86/asm: Simplify GEN_*_RMWcc() macros Peter Zijlstra
@ 2018-10-04  9:18   ` Peter Zijlstra
  2018-10-16 16:05   ` [tip:locking/core] x86/asm: 'Simplify' " tip-bot for Peter Zijlstra
  1 sibling, 0 replies; 13+ messages in thread
From: Peter Zijlstra @ 2018-10-04  9:18 UTC (permalink / raw)
  To: will.deacon, mingo
  Cc: linux-kernel, longman, andrea.parri, tglx, bigeasy, hpa, JBeulich, bp

On Wed, Oct 03, 2018 at 03:03:00PM +0200, Peter Zijlstra wrote:
> +#define __GEN_RMWcc(fullop, _var, cc, clobbers, ...)			\
> +({									\
> +	__label__ cc_label;						\
>  	asm_volatile_goto (fullop "; j" #cc " %l[cc_label]"		\
> +			: : [var] "m" (_var), ## __VA_ARGS__		\
>  			: clobbers : cc_label);				\
> +	0;								\
>  cc_label:								\
> +	1;								\
> +})

That's obviously crap...

This one seems to actually compile and generate identical code:

#define __GEN_RMWcc(fullop, _var, cc, clobbers, ...)			\
({									\
	bool c = false;							\
	asm_volatile_goto (fullop "; j" #cc " %l[cc_label]"		\
			: : [var] "m" (_var), ## __VA_ARGS__		\
			: clobbers : cc_label);				\
	if (0) {							\
cc_label:	c = true;						\
	}								\
	c;								\
})

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 4/4] locking/qspinlock, x86: Provide liveness guarantee
  2018-10-03 13:03 ` [PATCH v2 4/4] locking/qspinlock, x86: Provide liveness guarantee Peter Zijlstra
@ 2018-10-10 16:12   ` Will Deacon
  2018-10-12  9:22     ` Will Deacon
  2018-10-16 16:06   ` [tip:locking/core] " tip-bot for Peter Zijlstra
  1 sibling, 1 reply; 13+ messages in thread
From: Will Deacon @ 2018-10-10 16:12 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: mingo, linux-kernel, longman, andrea.parri, tglx, bigeasy

On Wed, Oct 03, 2018 at 03:03:01PM +0200, Peter Zijlstra wrote:
> On x86 we cannot do fetch_or with a single instruction and thus end up
> using a cmpxchg loop, this reduces determinism. Replace the fetch_or
> with a composite operation: tas-pending + load.
> 
> Using two instructions of course opens a window we previously did not
> have. Consider the scenario:
> 
> 
> 	CPU0		CPU1		CPU2
> 
>  1)	lock
> 	  trylock -> (0,0,1)
> 
>  2)			lock
> 			  trylock /* fail */
> 
>  3)	unlock -> (0,0,0)
> 
>  4)					lock
> 					  trylock -> (0,0,1)
> 
>  5)			  tas-pending -> (0,1,1)
> 			  load-val <- (0,1,0) from 3
> 
>  6)			  clear-pending-set-locked -> (0,0,1)
> 
> 			  FAIL: _2_ owners
> 
> where 5) is our new composite operation. When we consider each part of
> the qspinlock state as a separate variable (as we can when
> _Q_PENDING_BITS == 8) then the above is entirely possible, because
> tas-pending will only RmW the pending byte, so the later load is able
> to observe prior tail and lock state (but not earlier than its own
> trylock, which operates on the whole word, due to coherence).
> 
> To avoid this we need 2 things:
> 
>  - the load must come after the tas-pending (obviously, otherwise it
>    can trivially observe prior state).
> 
>  - the tas-pending must be a full word RmW, it cannot be an xchg8 for
>    example, such that we cannot observe other state prior to setting
>    pending.
> 
> On x86 we can realize this by using "LOCK BTS m32, r32" for
> tas-pending followed by a regular load.
> 
> Note that observing later state is not a problem:
> 
>  - if we fail to observe a later unlock, we'll simply spin-wait for
>    that store to become visible.
> 
>  - if we observe a later xchg_tail, there is no difference from that
>    xchg_tail having taken place before the tas-pending.
> 
> Cc: mingo@kernel.org
> Cc: tglx@linutronix.de
> Cc: longman@redhat.com
> Cc: andrea.parri@amarulasolutions.com
> Suggested-by: Will Deacon <will.deacon@arm.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  arch/x86/include/asm/qspinlock.h |   15 +++++++++++++++
>  kernel/locking/qspinlock.c       |   16 +++++++++++++++-
>  2 files changed, 30 insertions(+), 1 deletion(-)

I've failed to break this by thinking really hard, so I've updated Catalin's
TLA model to see if the tools are still happy. I'll get back to you once
they've finished chewing on it.

Will

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 2/4] locking/qspinlock: Rework some comments
  2018-10-03 13:02 ` [PATCH v2 2/4] locking/qspinlock: Rework some comments Peter Zijlstra
@ 2018-10-10 16:13   ` Will Deacon
  0 siblings, 0 replies; 13+ messages in thread
From: Will Deacon @ 2018-10-10 16:13 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: mingo, linux-kernel, longman, andrea.parri, tglx, bigeasy

On Wed, Oct 03, 2018 at 03:02:59PM +0200, Peter Zijlstra wrote:
> While working my way through the code again; I felt the comments could
> use help.
> 
> Cc: mingo@kernel.org
> Cc: will.deacon@arm.com
> Cc: tglx@linutronix.de
> Cc: longman@redhat.com
> Cc: andrea.parri@amarulasolutions.com
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  kernel/locking/qspinlock.c |   38 +++++++++++++++++++++++++++-----------
>  1 file changed, 27 insertions(+), 11 deletions(-)

Acked-by: Will Deacon <will.deacon@arm.com>

Will

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 4/4] locking/qspinlock, x86: Provide liveness guarantee
  2018-10-10 16:12   ` Will Deacon
@ 2018-10-12  9:22     ` Will Deacon
  0 siblings, 0 replies; 13+ messages in thread
From: Will Deacon @ 2018-10-12  9:22 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: mingo, linux-kernel, longman, andrea.parri, tglx, bigeasy

On Wed, Oct 10, 2018 at 05:12:57PM +0100, Will Deacon wrote:
> On Wed, Oct 03, 2018 at 03:03:01PM +0200, Peter Zijlstra wrote:
> >  arch/x86/include/asm/qspinlock.h |   15 +++++++++++++++
> >  kernel/locking/qspinlock.c       |   16 +++++++++++++++-
> >  2 files changed, 30 insertions(+), 1 deletion(-)
> 
> I've failed to break this by thinking really hard, so I've updated Catalin's
> TLA model to see if the tools are still happy. I'll get back to you once
> they've finished chewing on it.

The model is happy, so this doesn't introduce any "bad" SC behaviours.

Reviewed-by: Will Deacon <will.deacon@arm.com>

Will

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [tip:locking/core] locking/qspinlock: Re-order code
  2018-10-03 13:02 [PATCH v2 0/4] locking/qspinlock, x86: Improve determinism for x86 Peter Zijlstra
                   ` (3 preceding siblings ...)
  2018-10-03 13:03 ` [PATCH v2 4/4] locking/qspinlock, x86: Provide liveness guarantee Peter Zijlstra
@ 2018-10-16 16:04 ` tip-bot for Peter Zijlstra
  2018-10-16 16:04 ` [tip:locking/core] locking/qspinlock: Rework some comments tip-bot for Peter Zijlstra
  5 siblings, 0 replies; 13+ messages in thread
From: tip-bot for Peter Zijlstra @ 2018-10-16 16:04 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: mingo, peterz, torvalds, hpa, tglx, linux-kernel, will.deacon

Commit-ID:  53bf57fab7321fb42b703056a4c80fc9d986d170
Gitweb:     https://git.kernel.org/tip/53bf57fab7321fb42b703056a4c80fc9d986d170
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Wed, 26 Sep 2018 13:01:18 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 16 Oct 2018 17:33:53 +0200

locking/qspinlock: Re-order code

Flip the branch condition after atomic_fetch_or_acquire(_Q_PENDING_VAL)
such that we loose the indent. This also result in a more natural code
flow IMO.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Will Deacon <will.deacon@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: andrea.parri@amarulasolutions.com
Cc: longman@redhat.com
Link: https://lkml.kernel.org/r/20181003130257.156322446@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/locking/qspinlock.c | 56 ++++++++++++++++++++++------------------------
 1 file changed, 27 insertions(+), 29 deletions(-)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index bfaeb05123ff..ec343276f975 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -330,39 +330,37 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	 * 0,0,1 -> 0,1,1 ; pending
 	 */
 	val = atomic_fetch_or_acquire(_Q_PENDING_VAL, &lock->val);
-	if (!(val & ~_Q_LOCKED_MASK)) {
-		/*
-		 * We're pending, wait for the owner to go away.
-		 *
-		 * *,1,1 -> *,1,0
-		 *
-		 * this wait loop must be a load-acquire such that we match the
-		 * store-release that clears the locked bit and create lock
-		 * sequentiality; this is because not all
-		 * clear_pending_set_locked() implementations imply full
-		 * barriers.
-		 */
-		if (val & _Q_LOCKED_MASK) {
-			atomic_cond_read_acquire(&lock->val,
-						 !(VAL & _Q_LOCKED_MASK));
-		}
-
-		/*
-		 * take ownership and clear the pending bit.
-		 *
-		 * *,1,0 -> *,0,1
-		 */
-		clear_pending_set_locked(lock);
-		qstat_inc(qstat_lock_pending, true);
-		return;
+	/*
+	 * If we observe any contention; undo and queue.
+	 */
+	if (unlikely(val & ~_Q_LOCKED_MASK)) {
+		if (!(val & _Q_PENDING_MASK))
+			clear_pending(lock);
+		goto queue;
 	}
 
 	/*
-	 * If pending was clear but there are waiters in the queue, then
-	 * we need to undo our setting of pending before we queue ourselves.
+	 * We're pending, wait for the owner to go away.
+	 *
+	 * 0,1,1 -> 0,1,0
+	 *
+	 * this wait loop must be a load-acquire such that we match the
+	 * store-release that clears the locked bit and create lock
+	 * sequentiality; this is because not all
+	 * clear_pending_set_locked() implementations imply full
+	 * barriers.
+	 */
+	if (val & _Q_LOCKED_MASK)
+		atomic_cond_read_acquire(&lock->val, !(VAL & _Q_LOCKED_MASK));
+
+	/*
+	 * take ownership and clear the pending bit.
+	 *
+	 * 0,1,0 -> 0,0,1
 	 */
-	if (!(val & _Q_PENDING_MASK))
-		clear_pending(lock);
+	clear_pending_set_locked(lock);
+	qstat_inc(qstat_lock_pending, true);
+	return;
 
 	/*
 	 * End of pending bit optimistic spinning and beginning of MCS

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [tip:locking/core] locking/qspinlock: Rework some comments
  2018-10-03 13:02 [PATCH v2 0/4] locking/qspinlock, x86: Improve determinism for x86 Peter Zijlstra
                   ` (4 preceding siblings ...)
  2018-10-16 16:04 ` [tip:locking/core] locking/qspinlock: Re-order code tip-bot for Peter Zijlstra
@ 2018-10-16 16:04 ` tip-bot for Peter Zijlstra
  5 siblings, 0 replies; 13+ messages in thread
From: tip-bot for Peter Zijlstra @ 2018-10-16 16:04 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: peterz, hpa, tglx, linux-kernel, torvalds, will.deacon, mingo

Commit-ID:  756b1df4c2c82a1cdffeafa9d2aa76c92e7fb405
Gitweb:     https://git.kernel.org/tip/756b1df4c2c82a1cdffeafa9d2aa76c92e7fb405
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Wed, 26 Sep 2018 13:01:19 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 16 Oct 2018 17:33:54 +0200

locking/qspinlock: Rework some comments

While working my way through the code again; I felt the comments could
use help.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Will Deacon <will.deacon@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: andrea.parri@amarulasolutions.com
Cc: longman@redhat.com
Link: https://lkml.kernel.org/r/20181003130257.156322446@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/locking/qspinlock.c | 36 ++++++++++++++++++++++++++----------
 1 file changed, 26 insertions(+), 10 deletions(-)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index ec343276f975..47cb99787e4d 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -326,16 +326,23 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	/*
 	 * trylock || pending
 	 *
-	 * 0,0,0 -> 0,0,1 ; trylock
-	 * 0,0,1 -> 0,1,1 ; pending
+	 * 0,0,* -> 0,1,* -> 0,0,1 pending, trylock
 	 */
 	val = atomic_fetch_or_acquire(_Q_PENDING_VAL, &lock->val);
+
 	/*
-	 * If we observe any contention; undo and queue.
+	 * If we observe contention, there is a concurrent locker.
+	 *
+	 * Undo and queue; our setting of PENDING might have made the
+	 * n,0,0 -> 0,0,0 transition fail and it will now be waiting
+	 * on @next to become !NULL.
 	 */
 	if (unlikely(val & ~_Q_LOCKED_MASK)) {
+
+		/* Undo PENDING if we set it. */
 		if (!(val & _Q_PENDING_MASK))
 			clear_pending(lock);
+
 		goto queue;
 	}
 
@@ -474,16 +481,25 @@ locked:
 	 */
 
 	/*
-	 * In the PV case we might already have _Q_LOCKED_VAL set.
+	 * In the PV case we might already have _Q_LOCKED_VAL set, because
+	 * of lock stealing; therefore we must also allow:
 	 *
-	 * The atomic_cond_read_acquire() call above has provided the
-	 * necessary acquire semantics required for locking.
+	 * n,0,1 -> 0,0,1
+	 *
+	 * Note: at this point: (val & _Q_PENDING_MASK) == 0, because of the
+	 *       above wait condition, therefore any concurrent setting of
+	 *       PENDING will make the uncontended transition fail.
 	 */
-	if (((val & _Q_TAIL_MASK) == tail) &&
-	    atomic_try_cmpxchg_relaxed(&lock->val, &val, _Q_LOCKED_VAL))
-		goto release; /* No contention */
+	if ((val & _Q_TAIL_MASK) == tail) {
+		if (atomic_try_cmpxchg_relaxed(&lock->val, &val, _Q_LOCKED_VAL))
+			goto release; /* No contention */
+	}
 
-	/* Either somebody is queued behind us or _Q_PENDING_VAL is set */
+	/*
+	 * Either somebody is queued behind us or _Q_PENDING_VAL got set
+	 * which will then detect the remaining tail and queue behind us
+	 * ensuring we'll see a @next.
+	 */
 	set_locked(lock);
 
 	/*

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [tip:locking/core] x86/asm: 'Simplify' GEN_*_RMWcc() macros
  2018-10-03 13:03 ` [PATCH v2 3/4] x86/asm: Simplify GEN_*_RMWcc() macros Peter Zijlstra
  2018-10-04  9:18   ` Peter Zijlstra
@ 2018-10-16 16:05   ` tip-bot for Peter Zijlstra
  1 sibling, 0 replies; 13+ messages in thread
From: tip-bot for Peter Zijlstra @ 2018-10-16 16:05 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: brgerst, dvlasenk, torvalds, bp, luto, hpa, tglx, mingo,
	linux-kernel, peterz

Commit-ID:  288e4521f0f6717909933116563e66bb894ae2af
Gitweb:     https://git.kernel.org/tip/288e4521f0f6717909933116563e66bb894ae2af
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Wed, 3 Oct 2018 12:34:10 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 16 Oct 2018 17:33:54 +0200

x86/asm: 'Simplify' GEN_*_RMWcc() macros

Currently the GEN_*_RMWcc() macros include a return statement, which
pretty much mandates we directly wrap them in a (inline) function.

Macros with return statements are tricky and, as per the above, limit
use, so remove the return statement and make them
statement-expressions. This allows them to be used more widely.

Also, shuffle the arguments a bit. Place the @cc argument as 3rd, this
makes it consistent between UNARY and BINARY, but more importantly, it
makes the @arg0 argument last.

Since the @arg0 argument is now last, we can do CPP trickery and make
it an optional argument, simplifying the users; 17 out of 18
occurences do not need this argument.

Finally, change to asm symbolic names, instead of the numeric ordering
of operands, which allows us to get rid of __BINARY_RMWcc_ARG and get
cleaner code overall.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: JBeulich@suse.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bp@alien8.de
Cc: hpa@linux.intel.com
Link: https://lkml.kernel.org/r/20181003130957.108960094@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/atomic.h      |  8 ++---
 arch/x86/include/asm/atomic64_64.h |  8 ++---
 arch/x86/include/asm/bitops.h      |  9 ++---
 arch/x86/include/asm/local.h       |  8 ++---
 arch/x86/include/asm/preempt.h     |  2 +-
 arch/x86/include/asm/refcount.h    | 13 +++----
 arch/x86/include/asm/rmwcc.h       | 69 ++++++++++++++++++++++----------------
 7 files changed, 64 insertions(+), 53 deletions(-)

diff --git a/arch/x86/include/asm/atomic.h b/arch/x86/include/asm/atomic.h
index ce84388e540c..ea3d95275b43 100644
--- a/arch/x86/include/asm/atomic.h
+++ b/arch/x86/include/asm/atomic.h
@@ -82,7 +82,7 @@ static __always_inline void arch_atomic_sub(int i, atomic_t *v)
  */
 static __always_inline bool arch_atomic_sub_and_test(int i, atomic_t *v)
 {
-	GEN_BINARY_RMWcc(LOCK_PREFIX "subl", v->counter, "er", i, "%0", e);
+	return GEN_BINARY_RMWcc(LOCK_PREFIX "subl", v->counter, e, "er", i);
 }
 #define arch_atomic_sub_and_test arch_atomic_sub_and_test
 
@@ -122,7 +122,7 @@ static __always_inline void arch_atomic_dec(atomic_t *v)
  */
 static __always_inline bool arch_atomic_dec_and_test(atomic_t *v)
 {
-	GEN_UNARY_RMWcc(LOCK_PREFIX "decl", v->counter, "%0", e);
+	return GEN_UNARY_RMWcc(LOCK_PREFIX "decl", v->counter, e);
 }
 #define arch_atomic_dec_and_test arch_atomic_dec_and_test
 
@@ -136,7 +136,7 @@ static __always_inline bool arch_atomic_dec_and_test(atomic_t *v)
  */
 static __always_inline bool arch_atomic_inc_and_test(atomic_t *v)
 {
-	GEN_UNARY_RMWcc(LOCK_PREFIX "incl", v->counter, "%0", e);
+	return GEN_UNARY_RMWcc(LOCK_PREFIX "incl", v->counter, e);
 }
 #define arch_atomic_inc_and_test arch_atomic_inc_and_test
 
@@ -151,7 +151,7 @@ static __always_inline bool arch_atomic_inc_and_test(atomic_t *v)
  */
 static __always_inline bool arch_atomic_add_negative(int i, atomic_t *v)
 {
-	GEN_BINARY_RMWcc(LOCK_PREFIX "addl", v->counter, "er", i, "%0", s);
+	return GEN_BINARY_RMWcc(LOCK_PREFIX "addl", v->counter, s, "er", i);
 }
 #define arch_atomic_add_negative arch_atomic_add_negative
 
diff --git a/arch/x86/include/asm/atomic64_64.h b/arch/x86/include/asm/atomic64_64.h
index 5f851d92eecd..dadc20adba21 100644
--- a/arch/x86/include/asm/atomic64_64.h
+++ b/arch/x86/include/asm/atomic64_64.h
@@ -73,7 +73,7 @@ static inline void arch_atomic64_sub(long i, atomic64_t *v)
  */
 static inline bool arch_atomic64_sub_and_test(long i, atomic64_t *v)
 {
-	GEN_BINARY_RMWcc(LOCK_PREFIX "subq", v->counter, "er", i, "%0", e);
+	return GEN_BINARY_RMWcc(LOCK_PREFIX "subq", v->counter, e, "er", i);
 }
 #define arch_atomic64_sub_and_test arch_atomic64_sub_and_test
 
@@ -115,7 +115,7 @@ static __always_inline void arch_atomic64_dec(atomic64_t *v)
  */
 static inline bool arch_atomic64_dec_and_test(atomic64_t *v)
 {
-	GEN_UNARY_RMWcc(LOCK_PREFIX "decq", v->counter, "%0", e);
+	return GEN_UNARY_RMWcc(LOCK_PREFIX "decq", v->counter, e);
 }
 #define arch_atomic64_dec_and_test arch_atomic64_dec_and_test
 
@@ -129,7 +129,7 @@ static inline bool arch_atomic64_dec_and_test(atomic64_t *v)
  */
 static inline bool arch_atomic64_inc_and_test(atomic64_t *v)
 {
-	GEN_UNARY_RMWcc(LOCK_PREFIX "incq", v->counter, "%0", e);
+	return GEN_UNARY_RMWcc(LOCK_PREFIX "incq", v->counter, e);
 }
 #define arch_atomic64_inc_and_test arch_atomic64_inc_and_test
 
@@ -144,7 +144,7 @@ static inline bool arch_atomic64_inc_and_test(atomic64_t *v)
  */
 static inline bool arch_atomic64_add_negative(long i, atomic64_t *v)
 {
-	GEN_BINARY_RMWcc(LOCK_PREFIX "addq", v->counter, "er", i, "%0", s);
+	return GEN_BINARY_RMWcc(LOCK_PREFIX "addq", v->counter, s, "er", i);
 }
 #define arch_atomic64_add_negative arch_atomic64_add_negative
 
diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index 9f645ba57dbb..124f9195eb3e 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -217,8 +217,7 @@ static __always_inline void change_bit(long nr, volatile unsigned long *addr)
  */
 static __always_inline bool test_and_set_bit(long nr, volatile unsigned long *addr)
 {
-	GEN_BINARY_RMWcc(LOCK_PREFIX __ASM_SIZE(bts),
-	                 *addr, "Ir", nr, "%0", c);
+	return GEN_BINARY_RMWcc(LOCK_PREFIX __ASM_SIZE(bts), *addr, c, "Ir", nr);
 }
 
 /**
@@ -264,8 +263,7 @@ static __always_inline bool __test_and_set_bit(long nr, volatile unsigned long *
  */
 static __always_inline bool test_and_clear_bit(long nr, volatile unsigned long *addr)
 {
-	GEN_BINARY_RMWcc(LOCK_PREFIX __ASM_SIZE(btr),
-	                 *addr, "Ir", nr, "%0", c);
+	return GEN_BINARY_RMWcc(LOCK_PREFIX __ASM_SIZE(btr), *addr, c, "Ir", nr);
 }
 
 /**
@@ -318,8 +316,7 @@ static __always_inline bool __test_and_change_bit(long nr, volatile unsigned lon
  */
 static __always_inline bool test_and_change_bit(long nr, volatile unsigned long *addr)
 {
-	GEN_BINARY_RMWcc(LOCK_PREFIX __ASM_SIZE(btc),
-	                 *addr, "Ir", nr, "%0", c);
+	return GEN_BINARY_RMWcc(LOCK_PREFIX __ASM_SIZE(btc), *addr, c, "Ir", nr);
 }
 
 static __always_inline bool constant_test_bit(long nr, const volatile unsigned long *addr)
diff --git a/arch/x86/include/asm/local.h b/arch/x86/include/asm/local.h
index c91083c59845..349a47acaa4a 100644
--- a/arch/x86/include/asm/local.h
+++ b/arch/x86/include/asm/local.h
@@ -53,7 +53,7 @@ static inline void local_sub(long i, local_t *l)
  */
 static inline bool local_sub_and_test(long i, local_t *l)
 {
-	GEN_BINARY_RMWcc(_ASM_SUB, l->a.counter, "er", i, "%0", e);
+	return GEN_BINARY_RMWcc(_ASM_SUB, l->a.counter, e, "er", i);
 }
 
 /**
@@ -66,7 +66,7 @@ static inline bool local_sub_and_test(long i, local_t *l)
  */
 static inline bool local_dec_and_test(local_t *l)
 {
-	GEN_UNARY_RMWcc(_ASM_DEC, l->a.counter, "%0", e);
+	return GEN_UNARY_RMWcc(_ASM_DEC, l->a.counter, e);
 }
 
 /**
@@ -79,7 +79,7 @@ static inline bool local_dec_and_test(local_t *l)
  */
 static inline bool local_inc_and_test(local_t *l)
 {
-	GEN_UNARY_RMWcc(_ASM_INC, l->a.counter, "%0", e);
+	return GEN_UNARY_RMWcc(_ASM_INC, l->a.counter, e);
 }
 
 /**
@@ -93,7 +93,7 @@ static inline bool local_inc_and_test(local_t *l)
  */
 static inline bool local_add_negative(long i, local_t *l)
 {
-	GEN_BINARY_RMWcc(_ASM_ADD, l->a.counter, "er", i, "%0", s);
+	return GEN_BINARY_RMWcc(_ASM_ADD, l->a.counter, s, "er", i);
 }
 
 /**
diff --git a/arch/x86/include/asm/preempt.h b/arch/x86/include/asm/preempt.h
index 7f2dbd91fc74..90cb2f36c042 100644
--- a/arch/x86/include/asm/preempt.h
+++ b/arch/x86/include/asm/preempt.h
@@ -88,7 +88,7 @@ static __always_inline void __preempt_count_sub(int val)
  */
 static __always_inline bool __preempt_count_dec_and_test(void)
 {
-	GEN_UNARY_RMWcc("decl", __preempt_count, __percpu_arg(0), e);
+	return GEN_UNARY_RMWcc("decl", __preempt_count, e, __percpu_arg([var]));
 }
 
 /*
diff --git a/arch/x86/include/asm/refcount.h b/arch/x86/include/asm/refcount.h
index c92909da0686..a8b5e1e13319 100644
--- a/arch/x86/include/asm/refcount.h
+++ b/arch/x86/include/asm/refcount.h
@@ -79,16 +79,17 @@ static __always_inline void refcount_dec(refcount_t *r)
 static __always_inline __must_check
 bool refcount_sub_and_test(unsigned int i, refcount_t *r)
 {
-	GEN_BINARY_SUFFIXED_RMWcc(LOCK_PREFIX "subl",
-				  "REFCOUNT_CHECK_LT_ZERO counter=\"%0\"",
-				  r->refs.counter, "er", i, "%0", e, "cx");
+
+	return GEN_BINARY_SUFFIXED_RMWcc(LOCK_PREFIX "subl",
+					 "REFCOUNT_CHECK_LT_ZERO counter=\"%[var]\"",
+					 r->refs.counter, e, "er", i, "cx");
 }
 
 static __always_inline __must_check bool refcount_dec_and_test(refcount_t *r)
 {
-	GEN_UNARY_SUFFIXED_RMWcc(LOCK_PREFIX "decl",
-				 "REFCOUNT_CHECK_LT_ZERO counter=\"%0\"",
-				 r->refs.counter, "%0", e, "cx");
+	return GEN_UNARY_SUFFIXED_RMWcc(LOCK_PREFIX "decl",
+					"REFCOUNT_CHECK_LT_ZERO counter=\"%[var]\"",
+					r->refs.counter, e, "cx");
 }
 
 static __always_inline __must_check
diff --git a/arch/x86/include/asm/rmwcc.h b/arch/x86/include/asm/rmwcc.h
index 4914a3e7c803..46ac84b506f5 100644
--- a/arch/x86/include/asm/rmwcc.h
+++ b/arch/x86/include/asm/rmwcc.h
@@ -2,56 +2,69 @@
 #ifndef _ASM_X86_RMWcc
 #define _ASM_X86_RMWcc
 
+/* This counts to 12. Any more, it will return 13th argument. */
+#define __RMWcc_ARGS(_0, _1, _2, _3, _4, _5, _6, _7, _8, _9, _10, _11, _12, _n, X...) _n
+#define RMWcc_ARGS(X...) __RMWcc_ARGS(, ##X, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0)
+
+#define __RMWcc_CONCAT(a, b) a ## b
+#define RMWcc_CONCAT(a, b) __RMWcc_CONCAT(a, b)
+
 #define __CLOBBERS_MEM(clb...)	"memory", ## clb
 
 #if !defined(__GCC_ASM_FLAG_OUTPUTS__) && defined(CC_HAVE_ASM_GOTO)
 
 /* Use asm goto */
 
-#define __GEN_RMWcc(fullop, var, cc, clobbers, ...)			\
-do {									\
+#define __GEN_RMWcc(fullop, _var, cc, clobbers, ...)			\
+({									\
+	bool c = false;							\
 	asm_volatile_goto (fullop "; j" #cc " %l[cc_label]"		\
-			: : [counter] "m" (var), ## __VA_ARGS__		\
+			: : [var] "m" (_var), ## __VA_ARGS__		\
 			: clobbers : cc_label);				\
-	return 0;							\
-cc_label:								\
-	return 1;							\
-} while (0)
-
-#define __BINARY_RMWcc_ARG	" %1, "
-
+	if (0) {							\
+cc_label:	c = true;						\
+	}								\
+	c;								\
+})
 
 #else /* defined(__GCC_ASM_FLAG_OUTPUTS__) || !defined(CC_HAVE_ASM_GOTO) */
 
 /* Use flags output or a set instruction */
 
-#define __GEN_RMWcc(fullop, var, cc, clobbers, ...)			\
-do {									\
+#define __GEN_RMWcc(fullop, _var, cc, clobbers, ...)			\
+({									\
 	bool c;								\
 	asm volatile (fullop CC_SET(cc)					\
-			: [counter] "+m" (var), CC_OUT(cc) (c)		\
+			: [var] "+m" (_var), CC_OUT(cc) (c)		\
 			: __VA_ARGS__ : clobbers);			\
-	return c;							\
-} while (0)
-
-#define __BINARY_RMWcc_ARG	" %2, "
+	c;								\
+})
 
 #endif /* defined(__GCC_ASM_FLAG_OUTPUTS__) || !defined(CC_HAVE_ASM_GOTO) */
 
-#define GEN_UNARY_RMWcc(op, var, arg0, cc)				\
+#define GEN_UNARY_RMWcc_4(op, var, cc, arg0)				\
 	__GEN_RMWcc(op " " arg0, var, cc, __CLOBBERS_MEM())
 
-#define GEN_UNARY_SUFFIXED_RMWcc(op, suffix, var, arg0, cc, clobbers...)\
-	__GEN_RMWcc(op " " arg0 "\n\t" suffix, var, cc,			\
-		    __CLOBBERS_MEM(clobbers))
+#define GEN_UNARY_RMWcc_3(op, var, cc)					\
+	GEN_UNARY_RMWcc_4(op, var, cc, "%[var]")
 
-#define GEN_BINARY_RMWcc(op, var, vcon, val, arg0, cc)			\
-	__GEN_RMWcc(op __BINARY_RMWcc_ARG arg0, var, cc,		\
-		    __CLOBBERS_MEM(), vcon (val))
+#define GEN_UNARY_RMWcc(X...) RMWcc_CONCAT(GEN_UNARY_RMWcc_, RMWcc_ARGS(X))(X)
+
+#define GEN_BINARY_RMWcc_6(op, var, cc, vcon, _val, arg0)		\
+	__GEN_RMWcc(op " %[val], " arg0, var, cc,			\
+		    __CLOBBERS_MEM(), [val] vcon (_val))
+
+#define GEN_BINARY_RMWcc_5(op, var, cc, vcon, val)			\
+	GEN_BINARY_RMWcc_6(op, var, cc, vcon, val, "%[var]")
+
+#define GEN_BINARY_RMWcc(X...) RMWcc_CONCAT(GEN_BINARY_RMWcc_, RMWcc_ARGS(X))(X)
+
+#define GEN_UNARY_SUFFIXED_RMWcc(op, suffix, var, cc, clobbers...)	\
+	__GEN_RMWcc(op " %[var]\n\t" suffix, var, cc,			\
+		    __CLOBBERS_MEM(clobbers))
 
-#define GEN_BINARY_SUFFIXED_RMWcc(op, suffix, var, vcon, val, arg0, cc,	\
-				  clobbers...)				\
-	__GEN_RMWcc(op __BINARY_RMWcc_ARG arg0 "\n\t" suffix, var, cc,	\
-		    __CLOBBERS_MEM(clobbers), vcon (val))
+#define GEN_BINARY_SUFFIXED_RMWcc(op, suffix, var, cc, vcon, _val, clobbers...)\
+	__GEN_RMWcc(op " %[val], %[var]\n\t" suffix, var, cc,		\
+		    __CLOBBERS_MEM(clobbers), [val] vcon (_val))
 
 #endif /* _ASM_X86_RMWcc */

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [tip:locking/core] locking/qspinlock, x86: Provide liveness guarantee
  2018-10-03 13:03 ` [PATCH v2 4/4] locking/qspinlock, x86: Provide liveness guarantee Peter Zijlstra
  2018-10-10 16:12   ` Will Deacon
@ 2018-10-16 16:06   ` tip-bot for Peter Zijlstra
  1 sibling, 0 replies; 13+ messages in thread
From: tip-bot for Peter Zijlstra @ 2018-10-16 16:06 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: will.deacon, mingo, hpa, linux-kernel, torvalds, tglx, peterz

Commit-ID:  7aa54be2976550f17c11a1c3e3630002dea39303
Gitweb:     https://git.kernel.org/tip/7aa54be2976550f17c11a1c3e3630002dea39303
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Wed, 26 Sep 2018 13:01:20 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 16 Oct 2018 17:33:54 +0200

locking/qspinlock, x86: Provide liveness guarantee

On x86 we cannot do fetch_or() with a single instruction and thus end up
using a cmpxchg loop, this reduces determinism. Replace the fetch_or()
with a composite operation: tas-pending + load.

Using two instructions of course opens a window we previously did not
have. Consider the scenario:

	CPU0		CPU1		CPU2

 1)	lock
	  trylock -> (0,0,1)

 2)			lock
			  trylock /* fail */

 3)	unlock -> (0,0,0)

 4)					lock
					  trylock -> (0,0,1)

 5)			  tas-pending -> (0,1,1)
			  load-val <- (0,1,0) from 3

 6)			  clear-pending-set-locked -> (0,0,1)

			  FAIL: _2_ owners

where 5) is our new composite operation. When we consider each part of
the qspinlock state as a separate variable (as we can when
_Q_PENDING_BITS == 8) then the above is entirely possible, because
tas-pending will only RmW the pending byte, so the later load is able
to observe prior tail and lock state (but not earlier than its own
trylock, which operates on the whole word, due to coherence).

To avoid this we need 2 things:

 - the load must come after the tas-pending (obviously, otherwise it
   can trivially observe prior state).

 - the tas-pending must be a full word RmW instruction, it cannot be an XCHGB for
   example, such that we cannot observe other state prior to setting
   pending.

On x86 we can realize this by using "LOCK BTS m32, r32" for
tas-pending followed by a regular load.

Note that observing later state is not a problem:

 - if we fail to observe a later unlock, we'll simply spin-wait for
   that store to become visible.

 - if we observe a later xchg_tail(), there is no difference from that
   xchg_tail() having taken place before the tas-pending.

Suggested-by: Will Deacon <will.deacon@arm.com>
Reported-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Will Deacon <will.deacon@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: andrea.parri@amarulasolutions.com
Cc: longman@redhat.com
Fixes: 59fb586b4a07 ("locking/qspinlock: Remove unbounded cmpxchg() loop from locking slowpath")
Link: https://lkml.kernel.org/r/20181003130957.183726335@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/qspinlock.h | 15 +++++++++++++++
 kernel/locking/qspinlock.c       | 16 +++++++++++++++-
 2 files changed, 30 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h
index 3e70bed8a978..87623c6b13db 100644
--- a/arch/x86/include/asm/qspinlock.h
+++ b/arch/x86/include/asm/qspinlock.h
@@ -6,9 +6,24 @@
 #include <asm/cpufeature.h>
 #include <asm-generic/qspinlock_types.h>
 #include <asm/paravirt.h>
+#include <asm/rmwcc.h>
 
 #define _Q_PENDING_LOOPS	(1 << 9)
 
+#define queued_fetch_set_pending_acquire queued_fetch_set_pending_acquire
+static __always_inline u32 queued_fetch_set_pending_acquire(struct qspinlock *lock)
+{
+	u32 val = 0;
+
+	if (GEN_BINARY_RMWcc(LOCK_PREFIX "btsl", lock->val.counter, c,
+			     "I", _Q_PENDING_OFFSET))
+		val |= _Q_PENDING_VAL;
+
+	val |= atomic_read(&lock->val) & ~_Q_PENDING_MASK;
+
+	return val;
+}
+
 #ifdef CONFIG_PARAVIRT_SPINLOCKS
 extern void native_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val);
 extern void __pv_init_lock_hash(void);
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 47cb99787e4d..341ca666bc60 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -231,6 +231,20 @@ static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
 }
 #endif /* _Q_PENDING_BITS == 8 */
 
+/**
+ * queued_fetch_set_pending_acquire - fetch the whole lock value and set pending
+ * @lock : Pointer to queued spinlock structure
+ * Return: The previous lock value
+ *
+ * *,*,* -> *,1,*
+ */
+#ifndef queued_fetch_set_pending_acquire
+static __always_inline u32 queued_fetch_set_pending_acquire(struct qspinlock *lock)
+{
+	return atomic_fetch_or_acquire(_Q_PENDING_VAL, &lock->val);
+}
+#endif
+
 /**
  * set_locked - Set the lock bit and own the lock
  * @lock: Pointer to queued spinlock structure
@@ -328,7 +342,7 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	 *
 	 * 0,0,* -> 0,1,* -> 0,0,1 pending, trylock
 	 */
-	val = atomic_fetch_or_acquire(_Q_PENDING_VAL, &lock->val);
+	val = queued_fetch_set_pending_acquire(lock);
 
 	/*
 	 * If we observe contention, there is a concurrent locker.

^ permalink raw reply related	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2018-10-16 16:06 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-10-03 13:02 [PATCH v2 0/4] locking/qspinlock, x86: Improve determinism for x86 Peter Zijlstra
2018-10-03 13:02 ` [PATCH v2 1/4] locking/qspinlock: Re-order code Peter Zijlstra
2018-10-03 13:02 ` [PATCH v2 2/4] locking/qspinlock: Rework some comments Peter Zijlstra
2018-10-10 16:13   ` Will Deacon
2018-10-03 13:03 ` [PATCH v2 3/4] x86/asm: Simplify GEN_*_RMWcc() macros Peter Zijlstra
2018-10-04  9:18   ` Peter Zijlstra
2018-10-16 16:05   ` [tip:locking/core] x86/asm: 'Simplify' " tip-bot for Peter Zijlstra
2018-10-03 13:03 ` [PATCH v2 4/4] locking/qspinlock, x86: Provide liveness guarantee Peter Zijlstra
2018-10-10 16:12   ` Will Deacon
2018-10-12  9:22     ` Will Deacon
2018-10-16 16:06   ` [tip:locking/core] " tip-bot for Peter Zijlstra
2018-10-16 16:04 ` [tip:locking/core] locking/qspinlock: Re-order code tip-bot for Peter Zijlstra
2018-10-16 16:04 ` [tip:locking/core] locking/qspinlock: Rework some comments tip-bot for Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).