linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 00/14] kernel/locking: qspinlock improvements
@ 2018-04-26 10:34 Will Deacon
  2018-04-26 10:34 ` [PATCH v3 01/14] barriers: Introduce smp_cond_load_relaxed and atomic_cond_read_relaxed Will Deacon
                   ` (15 more replies)
  0 siblings, 16 replies; 41+ messages in thread
From: Will Deacon @ 2018-04-26 10:34 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-arm-kernel, peterz, mingo, boqun.feng, paulmck, longman,
	will.deacon

Hi all,

This is version three of the qspinlock patches I posted previously:

  v1: https://lkml.org/lkml/2018/4/5/496
  v2: https://lkml.org/lkml/2018/4/11/618

Changes since v2 include:
  * Fixed bisection issues
  * Fixed x86 PV build
  * Added patch proposing me as a co-maintainer
  * Rebased onto -rc2

All feedback welcome,

Will

--->8

Jason Low (1):
  locking/mcs: Use smp_cond_load_acquire() in mcs spin loop

Waiman Long (1):
  locking/qspinlock: Add stat tracking for pending vs slowpath

Will Deacon (12):
  barriers: Introduce smp_cond_load_relaxed and atomic_cond_read_relaxed
  locking/qspinlock: Merge struct __qspinlock into struct qspinlock
  locking/qspinlock: Bound spinning on pending->locked transition in
    slowpath
  locking/qspinlock/x86: Increase _Q_PENDING_LOOPS upper bound
  locking/qspinlock: Remove unbounded cmpxchg loop from locking slowpath
  locking/qspinlock: Kill cmpxchg loop when claiming lock from head of
    queue
  locking/qspinlock: Use atomic_cond_read_acquire
  locking/qspinlock: Use smp_cond_load_relaxed to wait for next node
  locking/qspinlock: Make queued_spin_unlock use smp_store_release
  locking/qspinlock: Elide back-to-back RELEASE operations with
    smp_wmb()
  locking/qspinlock: Use try_cmpxchg instead of cmpxchg when locking
  MAINTAINERS: Add myself as a co-maintainer for LOCKING PRIMITIVES

 MAINTAINERS                               |   1 +
 arch/x86/include/asm/qspinlock.h          |  21 ++-
 arch/x86/include/asm/qspinlock_paravirt.h |   3 +-
 include/asm-generic/atomic-long.h         |   2 +
 include/asm-generic/barrier.h             |  27 +++-
 include/asm-generic/qspinlock.h           |   2 +-
 include/asm-generic/qspinlock_types.h     |  32 +++-
 include/linux/atomic.h                    |   2 +
 kernel/locking/mcs_spinlock.h             |  10 +-
 kernel/locking/qspinlock.c                | 247 ++++++++++++++----------------
 kernel/locking/qspinlock_paravirt.h       |  44 ++----
 kernel/locking/qspinlock_stat.h           |   9 +-
 12 files changed, 209 insertions(+), 191 deletions(-)

-- 
2.1.4

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH v3 01/14] barriers: Introduce smp_cond_load_relaxed and atomic_cond_read_relaxed
  2018-04-26 10:34 [PATCH v3 00/14] kernel/locking: qspinlock improvements Will Deacon
@ 2018-04-26 10:34 ` Will Deacon
  2018-04-27  9:36   ` [tip:locking/core] locking/barriers: Introduce smp_cond_load_relaxed() and atomic_cond_read_relaxed() tip-bot for Will Deacon
  2018-04-26 10:34 ` [PATCH v3 02/14] locking/qspinlock: Merge struct __qspinlock into struct qspinlock Will Deacon
                   ` (14 subsequent siblings)
  15 siblings, 1 reply; 41+ messages in thread
From: Will Deacon @ 2018-04-26 10:34 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-arm-kernel, peterz, mingo, boqun.feng, paulmck, longman,
	will.deacon

Whilst we currently provide smp_cond_load_acquire and
atomic_cond_read_acquire, there are cases where the ACQUIRE semantics are
not required because of a subsequent fence or release operation once the
conditional loop has exited.

This patch adds relaxed versions of the conditional spinning primitives
to avoid unnecessary barrier overhead on architectures such as arm64.

Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 include/asm-generic/atomic-long.h |  2 ++
 include/asm-generic/barrier.h     | 27 +++++++++++++++++++++------
 include/linux/atomic.h            |  2 ++
 3 files changed, 25 insertions(+), 6 deletions(-)

diff --git a/include/asm-generic/atomic-long.h b/include/asm-generic/atomic-long.h
index 34a028a7bcc5..5b2b0b5ea06d 100644
--- a/include/asm-generic/atomic-long.h
+++ b/include/asm-generic/atomic-long.h
@@ -244,6 +244,8 @@ static inline long atomic_long_add_unless(atomic_long_t *l, long a, long u)
 #define atomic_long_inc_not_zero(l) \
 	ATOMIC_LONG_PFX(_inc_not_zero)((ATOMIC_LONG_PFX(_t) *)(l))
 
+#define atomic_long_cond_read_relaxed(v, c) \
+	ATOMIC_LONG_PFX(_cond_read_relaxed)((ATOMIC_LONG_PFX(_t) *)(v), (c))
 #define atomic_long_cond_read_acquire(v, c) \
 	ATOMIC_LONG_PFX(_cond_read_acquire)((ATOMIC_LONG_PFX(_t) *)(v), (c))
 
diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
index 29458bbb2fa0..2cafdbb9ae4c 100644
--- a/include/asm-generic/barrier.h
+++ b/include/asm-generic/barrier.h
@@ -221,18 +221,17 @@ do {									\
 #endif
 
 /**
- * smp_cond_load_acquire() - (Spin) wait for cond with ACQUIRE ordering
+ * smp_cond_load_relaxed() - (Spin) wait for cond with no ordering guarantees
  * @ptr: pointer to the variable to wait on
  * @cond: boolean expression to wait for
  *
- * Equivalent to using smp_load_acquire() on the condition variable but employs
- * the control dependency of the wait to reduce the barrier on many platforms.
+ * Equivalent to using READ_ONCE() on the condition variable.
  *
  * Due to C lacking lambda expressions we load the value of *ptr into a
  * pre-named variable @VAL to be used in @cond.
  */
-#ifndef smp_cond_load_acquire
-#define smp_cond_load_acquire(ptr, cond_expr) ({		\
+#ifndef smp_cond_load_relaxed
+#define smp_cond_load_relaxed(ptr, cond_expr) ({		\
 	typeof(ptr) __PTR = (ptr);				\
 	typeof(*ptr) VAL;					\
 	for (;;) {						\
@@ -241,10 +240,26 @@ do {									\
 			break;					\
 		cpu_relax();					\
 	}							\
-	smp_acquire__after_ctrl_dep();				\
 	VAL;							\
 })
 #endif
 
+/**
+ * smp_cond_load_acquire() - (Spin) wait for cond with ACQUIRE ordering
+ * @ptr: pointer to the variable to wait on
+ * @cond: boolean expression to wait for
+ *
+ * Equivalent to using smp_load_acquire() on the condition variable but employs
+ * the control dependency of the wait to reduce the barrier on many platforms.
+ */
+#ifndef smp_cond_load_acquire
+#define smp_cond_load_acquire(ptr, cond_expr) ({		\
+	typeof(*ptr) _val;					\
+	_val = smp_cond_load_relaxed(ptr, cond_expr);		\
+	smp_acquire__after_ctrl_dep();				\
+	_val;							\
+})
+#endif
+
 #endif /* !__ASSEMBLY__ */
 #endif /* __ASM_GENERIC_BARRIER_H */
diff --git a/include/linux/atomic.h b/include/linux/atomic.h
index 8b276fd9a127..01ce3997cb42 100644
--- a/include/linux/atomic.h
+++ b/include/linux/atomic.h
@@ -654,6 +654,7 @@ static inline int atomic_dec_if_positive(atomic_t *v)
 }
 #endif
 
+#define atomic_cond_read_relaxed(v, c)	smp_cond_load_relaxed(&(v)->counter, (c))
 #define atomic_cond_read_acquire(v, c)	smp_cond_load_acquire(&(v)->counter, (c))
 
 #ifdef CONFIG_GENERIC_ATOMIC64
@@ -1075,6 +1076,7 @@ static inline long long atomic64_fetch_andnot_release(long long i, atomic64_t *v
 }
 #endif
 
+#define atomic64_cond_read_relaxed(v, c)	smp_cond_load_relaxed(&(v)->counter, (c))
 #define atomic64_cond_read_acquire(v, c)	smp_cond_load_acquire(&(v)->counter, (c))
 
 #include <asm-generic/atomic-long.h>
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v3 02/14] locking/qspinlock: Merge struct __qspinlock into struct qspinlock
  2018-04-26 10:34 [PATCH v3 00/14] kernel/locking: qspinlock improvements Will Deacon
  2018-04-26 10:34 ` [PATCH v3 01/14] barriers: Introduce smp_cond_load_relaxed and atomic_cond_read_relaxed Will Deacon
@ 2018-04-26 10:34 ` Will Deacon
  2018-04-27  9:37   ` [tip:locking/core] locking/qspinlock: Merge 'struct __qspinlock' into 'struct qspinlock' tip-bot for Will Deacon
  2018-04-26 10:34 ` [PATCH v3 03/14] locking/qspinlock: Bound spinning on pending->locked transition in slowpath Will Deacon
                   ` (13 subsequent siblings)
  15 siblings, 1 reply; 41+ messages in thread
From: Will Deacon @ 2018-04-26 10:34 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-arm-kernel, peterz, mingo, boqun.feng, paulmck, longman,
	will.deacon

struct __qspinlock provides a handy union of fields so that
subcomponents of the lockword can be accessed by name, without having to
manage shifts and masks explicitly and take endianness into account.

This is useful in qspinlock.h and also potentially in arch headers, so
move the struct __qspinlock into struct qspinlock and kill the extra
definition.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Acked-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 arch/x86/include/asm/qspinlock.h          |  2 +-
 arch/x86/include/asm/qspinlock_paravirt.h |  3 +-
 include/asm-generic/qspinlock_types.h     | 32 +++++++++++++++++++--
 kernel/locking/qspinlock.c                | 46 ++-----------------------------
 kernel/locking/qspinlock_paravirt.h       | 34 ++++++++---------------
 5 files changed, 46 insertions(+), 71 deletions(-)

diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h
index 5e16b5d40d32..90b0b0ed8161 100644
--- a/arch/x86/include/asm/qspinlock.h
+++ b/arch/x86/include/asm/qspinlock.h
@@ -16,7 +16,7 @@
  */
 static inline void native_queued_spin_unlock(struct qspinlock *lock)
 {
-	smp_store_release((u8 *)lock, 0);
+	smp_store_release(&lock->locked, 0);
 }
 
 #ifdef CONFIG_PARAVIRT_SPINLOCKS
diff --git a/arch/x86/include/asm/qspinlock_paravirt.h b/arch/x86/include/asm/qspinlock_paravirt.h
index 923307ea11c7..9ef5ee03d2d7 100644
--- a/arch/x86/include/asm/qspinlock_paravirt.h
+++ b/arch/x86/include/asm/qspinlock_paravirt.h
@@ -22,8 +22,7 @@ PV_CALLEE_SAVE_REGS_THUNK(__pv_queued_spin_unlock_slowpath);
  *
  * void __pv_queued_spin_unlock(struct qspinlock *lock)
  * {
- *	struct __qspinlock *l = (void *)lock;
- *	u8 lockval = cmpxchg(&l->locked, _Q_LOCKED_VAL, 0);
+ *	u8 lockval = cmpxchg(&lock->locked, _Q_LOCKED_VAL, 0);
  *
  *	if (likely(lockval == _Q_LOCKED_VAL))
  *		return;
diff --git a/include/asm-generic/qspinlock_types.h b/include/asm-generic/qspinlock_types.h
index 034acd0c4956..0763f065b975 100644
--- a/include/asm-generic/qspinlock_types.h
+++ b/include/asm-generic/qspinlock_types.h
@@ -29,13 +29,41 @@
 #endif
 
 typedef struct qspinlock {
-	atomic_t	val;
+	union {
+		atomic_t val;
+
+		/*
+		 * By using the whole 2nd least significant byte for the
+		 * pending bit, we can allow better optimization of the lock
+		 * acquisition for the pending bit holder.
+		 */
+#ifdef __LITTLE_ENDIAN
+		struct {
+			u8	locked;
+			u8	pending;
+		};
+		struct {
+			u16	locked_pending;
+			u16	tail;
+		};
+#else
+		struct {
+			u16	tail;
+			u16	locked_pending;
+		};
+		struct {
+			u8	reserved[2];
+			u8	pending;
+			u8	locked;
+		};
+#endif
+	};
 } arch_spinlock_t;
 
 /*
  * Initializier
  */
-#define	__ARCH_SPIN_LOCK_UNLOCKED	{ ATOMIC_INIT(0) }
+#define	__ARCH_SPIN_LOCK_UNLOCKED	{ .val = ATOMIC_INIT(0) }
 
 /*
  * Bitfields in the atomic value:
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index d880296245c5..f5b0e59f6d14 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -114,40 +114,6 @@ static inline __pure struct mcs_spinlock *decode_tail(u32 tail)
 
 #define _Q_LOCKED_PENDING_MASK (_Q_LOCKED_MASK | _Q_PENDING_MASK)
 
-/*
- * By using the whole 2nd least significant byte for the pending bit, we
- * can allow better optimization of the lock acquisition for the pending
- * bit holder.
- *
- * This internal structure is also used by the set_locked function which
- * is not restricted to _Q_PENDING_BITS == 8.
- */
-struct __qspinlock {
-	union {
-		atomic_t val;
-#ifdef __LITTLE_ENDIAN
-		struct {
-			u8	locked;
-			u8	pending;
-		};
-		struct {
-			u16	locked_pending;
-			u16	tail;
-		};
-#else
-		struct {
-			u16	tail;
-			u16	locked_pending;
-		};
-		struct {
-			u8	reserved[2];
-			u8	pending;
-			u8	locked;
-		};
-#endif
-	};
-};
-
 #if _Q_PENDING_BITS == 8
 /**
  * clear_pending_set_locked - take ownership and clear the pending bit.
@@ -159,9 +125,7 @@ struct __qspinlock {
  */
 static __always_inline void clear_pending_set_locked(struct qspinlock *lock)
 {
-	struct __qspinlock *l = (void *)lock;
-
-	WRITE_ONCE(l->locked_pending, _Q_LOCKED_VAL);
+	WRITE_ONCE(lock->locked_pending, _Q_LOCKED_VAL);
 }
 
 /*
@@ -176,13 +140,11 @@ static __always_inline void clear_pending_set_locked(struct qspinlock *lock)
  */
 static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
 {
-	struct __qspinlock *l = (void *)lock;
-
 	/*
 	 * Use release semantics to make sure that the MCS node is properly
 	 * initialized before changing the tail code.
 	 */
-	return (u32)xchg_release(&l->tail,
+	return (u32)xchg_release(&lock->tail,
 				 tail >> _Q_TAIL_OFFSET) << _Q_TAIL_OFFSET;
 }
 
@@ -237,9 +199,7 @@ static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
  */
 static __always_inline void set_locked(struct qspinlock *lock)
 {
-	struct __qspinlock *l = (void *)lock;
-
-	WRITE_ONCE(l->locked, _Q_LOCKED_VAL);
+	WRITE_ONCE(lock->locked, _Q_LOCKED_VAL);
 }
 
 
diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
index 6ee477765e6c..2711940429f5 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -87,8 +87,6 @@ struct pv_node {
 #define queued_spin_trylock(l)	pv_hybrid_queued_unfair_trylock(l)
 static inline bool pv_hybrid_queued_unfair_trylock(struct qspinlock *lock)
 {
-	struct __qspinlock *l = (void *)lock;
-
 	/*
 	 * Stay in unfair lock mode as long as queued mode waiters are
 	 * present in the MCS wait queue but the pending bit isn't set.
@@ -97,7 +95,7 @@ static inline bool pv_hybrid_queued_unfair_trylock(struct qspinlock *lock)
 		int val = atomic_read(&lock->val);
 
 		if (!(val & _Q_LOCKED_PENDING_MASK) &&
-		   (cmpxchg_acquire(&l->locked, 0, _Q_LOCKED_VAL) == 0)) {
+		   (cmpxchg_acquire(&lock->locked, 0, _Q_LOCKED_VAL) == 0)) {
 			qstat_inc(qstat_pv_lock_stealing, true);
 			return true;
 		}
@@ -117,16 +115,12 @@ static inline bool pv_hybrid_queued_unfair_trylock(struct qspinlock *lock)
 #if _Q_PENDING_BITS == 8
 static __always_inline void set_pending(struct qspinlock *lock)
 {
-	struct __qspinlock *l = (void *)lock;
-
-	WRITE_ONCE(l->pending, 1);
+	WRITE_ONCE(lock->pending, 1);
 }
 
 static __always_inline void clear_pending(struct qspinlock *lock)
 {
-	struct __qspinlock *l = (void *)lock;
-
-	WRITE_ONCE(l->pending, 0);
+	WRITE_ONCE(lock->pending, 0);
 }
 
 /*
@@ -136,10 +130,8 @@ static __always_inline void clear_pending(struct qspinlock *lock)
  */
 static __always_inline int trylock_clear_pending(struct qspinlock *lock)
 {
-	struct __qspinlock *l = (void *)lock;
-
-	return !READ_ONCE(l->locked) &&
-	       (cmpxchg_acquire(&l->locked_pending, _Q_PENDING_VAL,
+	return !READ_ONCE(lock->locked) &&
+	       (cmpxchg_acquire(&lock->locked_pending, _Q_PENDING_VAL,
 				_Q_LOCKED_VAL) == _Q_PENDING_VAL);
 }
 #else /* _Q_PENDING_BITS == 8 */
@@ -384,7 +376,6 @@ static void pv_wait_node(struct mcs_spinlock *node, struct mcs_spinlock *prev)
 static void pv_kick_node(struct qspinlock *lock, struct mcs_spinlock *node)
 {
 	struct pv_node *pn = (struct pv_node *)node;
-	struct __qspinlock *l = (void *)lock;
 
 	/*
 	 * If the vCPU is indeed halted, advance its state to match that of
@@ -413,7 +404,7 @@ static void pv_kick_node(struct qspinlock *lock, struct mcs_spinlock *node)
 	 * the hash table later on at unlock time, no atomic instruction is
 	 * needed.
 	 */
-	WRITE_ONCE(l->locked, _Q_SLOW_VAL);
+	WRITE_ONCE(lock->locked, _Q_SLOW_VAL);
 	(void)pv_hash(lock, pn);
 }
 
@@ -428,7 +419,6 @@ static u32
 pv_wait_head_or_lock(struct qspinlock *lock, struct mcs_spinlock *node)
 {
 	struct pv_node *pn = (struct pv_node *)node;
-	struct __qspinlock *l = (void *)lock;
 	struct qspinlock **lp = NULL;
 	int waitcnt = 0;
 	int loop;
@@ -479,13 +469,13 @@ pv_wait_head_or_lock(struct qspinlock *lock, struct mcs_spinlock *node)
 			 *
 			 * Matches the smp_rmb() in __pv_queued_spin_unlock().
 			 */
-			if (xchg(&l->locked, _Q_SLOW_VAL) == 0) {
+			if (xchg(&lock->locked, _Q_SLOW_VAL) == 0) {
 				/*
 				 * The lock was free and now we own the lock.
 				 * Change the lock value back to _Q_LOCKED_VAL
 				 * and unhash the table.
 				 */
-				WRITE_ONCE(l->locked, _Q_LOCKED_VAL);
+				WRITE_ONCE(lock->locked, _Q_LOCKED_VAL);
 				WRITE_ONCE(*lp, NULL);
 				goto gotlock;
 			}
@@ -493,7 +483,7 @@ pv_wait_head_or_lock(struct qspinlock *lock, struct mcs_spinlock *node)
 		WRITE_ONCE(pn->state, vcpu_hashed);
 		qstat_inc(qstat_pv_wait_head, true);
 		qstat_inc(qstat_pv_wait_again, waitcnt);
-		pv_wait(&l->locked, _Q_SLOW_VAL);
+		pv_wait(&lock->locked, _Q_SLOW_VAL);
 
 		/*
 		 * Because of lock stealing, the queue head vCPU may not be
@@ -518,7 +508,6 @@ pv_wait_head_or_lock(struct qspinlock *lock, struct mcs_spinlock *node)
 __visible void
 __pv_queued_spin_unlock_slowpath(struct qspinlock *lock, u8 locked)
 {
-	struct __qspinlock *l = (void *)lock;
 	struct pv_node *node;
 
 	if (unlikely(locked != _Q_SLOW_VAL)) {
@@ -547,7 +536,7 @@ __pv_queued_spin_unlock_slowpath(struct qspinlock *lock, u8 locked)
 	 * Now that we have a reference to the (likely) blocked pv_node,
 	 * release the lock.
 	 */
-	smp_store_release(&l->locked, 0);
+	smp_store_release(&lock->locked, 0);
 
 	/*
 	 * At this point the memory pointed at by lock can be freed/reused,
@@ -573,7 +562,6 @@ __pv_queued_spin_unlock_slowpath(struct qspinlock *lock, u8 locked)
 #ifndef __pv_queued_spin_unlock
 __visible void __pv_queued_spin_unlock(struct qspinlock *lock)
 {
-	struct __qspinlock *l = (void *)lock;
 	u8 locked;
 
 	/*
@@ -581,7 +569,7 @@ __visible void __pv_queued_spin_unlock(struct qspinlock *lock)
 	 * unhash. Otherwise it would be possible to have multiple @lock
 	 * entries, which would be BAD.
 	 */
-	locked = cmpxchg_release(&l->locked, _Q_LOCKED_VAL, 0);
+	locked = cmpxchg_release(&lock->locked, _Q_LOCKED_VAL, 0);
 	if (likely(locked == _Q_LOCKED_VAL))
 		return;
 
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v3 03/14] locking/qspinlock: Bound spinning on pending->locked transition in slowpath
  2018-04-26 10:34 [PATCH v3 00/14] kernel/locking: qspinlock improvements Will Deacon
  2018-04-26 10:34 ` [PATCH v3 01/14] barriers: Introduce smp_cond_load_relaxed and atomic_cond_read_relaxed Will Deacon
  2018-04-26 10:34 ` [PATCH v3 02/14] locking/qspinlock: Merge struct __qspinlock into struct qspinlock Will Deacon
@ 2018-04-26 10:34 ` Will Deacon
  2018-04-27  9:37   ` [tip:locking/core] " tip-bot for Will Deacon
  2018-04-26 10:34 ` [PATCH v3 04/14] locking/qspinlock/x86: Increase _Q_PENDING_LOOPS upper bound Will Deacon
                   ` (12 subsequent siblings)
  15 siblings, 1 reply; 41+ messages in thread
From: Will Deacon @ 2018-04-26 10:34 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-arm-kernel, peterz, mingo, boqun.feng, paulmck, longman,
	will.deacon

If a locker taking the qspinlock slowpath reads a lock value indicating
that only the pending bit is set, then it will spin whilst the
concurrent pending->locked transition takes effect.

Unfortunately, there is no guarantee that such a transition will ever be
observed since concurrent lockers could continuously set pending and
hand over the lock amongst themselves, leading to starvation. Whilst
this would probably resolve in practice, it means that it is not
possible to prove liveness properties about the lock and means that lock
acquisition time is unbounded.

Rather than removing the pending->locked spinning from the slowpath
altogether (which has been shown to heavily penalise a 2-threaded
locking stress test on x86), this patch replaces the explicit spinning
with a call to atomic_cond_read_relaxed and allows the architecture to
provide a bound on the number of spins. For architectures that can
respond to changes in cacheline state in their smp_cond_load implementation,
it should be sufficient to use the default bound of 1.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Suggested-by: Waiman Long <longman@redhat.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 kernel/locking/qspinlock.c | 20 +++++++++++++++++---
 1 file changed, 17 insertions(+), 3 deletions(-)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index f5b0e59f6d14..a0f7976348f8 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -77,6 +77,18 @@
 #endif
 
 /*
+ * The pending bit spinning loop count.
+ * This heuristic is used to limit the number of lockword accesses
+ * made by atomic_cond_read_relaxed when waiting for the lock to
+ * transition out of the "== _Q_PENDING_VAL" state. We don't spin
+ * indefinitely because there's no guarantee that we'll make forward
+ * progress.
+ */
+#ifndef _Q_PENDING_LOOPS
+#define _Q_PENDING_LOOPS	1
+#endif
+
+/*
  * Per-CPU queue node structures; we can never have more than 4 nested
  * contexts: task, softirq, hardirq, nmi.
  *
@@ -266,13 +278,15 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 		return;
 
 	/*
-	 * wait for in-progress pending->locked hand-overs
+	 * Wait for in-progress pending->locked hand-overs with a bounded
+	 * number of spins so that we guarantee forward progress.
 	 *
 	 * 0,1,0 -> 0,0,1
 	 */
 	if (val == _Q_PENDING_VAL) {
-		while ((val = atomic_read(&lock->val)) == _Q_PENDING_VAL)
-			cpu_relax();
+		int cnt = _Q_PENDING_LOOPS;
+		val = atomic_cond_read_relaxed(&lock->val,
+					       (VAL != _Q_PENDING_VAL) || !cnt--);
 	}
 
 	/*
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v3 04/14] locking/qspinlock/x86: Increase _Q_PENDING_LOOPS upper bound
  2018-04-26 10:34 [PATCH v3 00/14] kernel/locking: qspinlock improvements Will Deacon
                   ` (2 preceding siblings ...)
  2018-04-26 10:34 ` [PATCH v3 03/14] locking/qspinlock: Bound spinning on pending->locked transition in slowpath Will Deacon
@ 2018-04-26 10:34 ` Will Deacon
  2018-04-27  9:38   ` [tip:locking/core] " tip-bot for Will Deacon
  2018-04-26 10:34 ` [PATCH v3 05/14] locking/qspinlock: Remove unbounded cmpxchg loop from locking slowpath Will Deacon
                   ` (11 subsequent siblings)
  15 siblings, 1 reply; 41+ messages in thread
From: Will Deacon @ 2018-04-26 10:34 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-arm-kernel, peterz, mingo, boqun.feng, paulmck, longman,
	will.deacon

On x86, atomic_cond_read_relaxed will busy-wait with a cpu_relax() loop,
so it is desirable to increase the number of times we spin on the qspinlock
lockword when it is found to be transitioning from pending to locked.

According to Waiman Long:

| Ideally, the spinning times should be at least a few times the typical
| cacheline load time from memory which I think can be down to 100ns or
| so for each cacheline load with the newest systems or up to several
| hundreds ns for older systems.

which in his benchmarking corresponded to 512 iterations.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Suggested-by: Waiman Long <longman@redhat.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 arch/x86/include/asm/qspinlock.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h
index 90b0b0ed8161..da1370ad206d 100644
--- a/arch/x86/include/asm/qspinlock.h
+++ b/arch/x86/include/asm/qspinlock.h
@@ -7,6 +7,8 @@
 #include <asm-generic/qspinlock_types.h>
 #include <asm/paravirt.h>
 
+#define _Q_PENDING_LOOPS	(1 << 9)
+
 #define	queued_spin_unlock queued_spin_unlock
 /**
  * queued_spin_unlock - release a queued spinlock
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v3 05/14] locking/qspinlock: Remove unbounded cmpxchg loop from locking slowpath
  2018-04-26 10:34 [PATCH v3 00/14] kernel/locking: qspinlock improvements Will Deacon
                   ` (3 preceding siblings ...)
  2018-04-26 10:34 ` [PATCH v3 04/14] locking/qspinlock/x86: Increase _Q_PENDING_LOOPS upper bound Will Deacon
@ 2018-04-26 10:34 ` Will Deacon
  2018-04-26 15:53   ` Peter Zijlstra
                     ` (2 more replies)
  2018-04-26 10:34 ` [PATCH v3 06/14] locking/qspinlock: Kill cmpxchg loop when claiming lock from head of queue Will Deacon
                   ` (10 subsequent siblings)
  15 siblings, 3 replies; 41+ messages in thread
From: Will Deacon @ 2018-04-26 10:34 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-arm-kernel, peterz, mingo, boqun.feng, paulmck, longman,
	will.deacon

The qspinlock locking slowpath utilises a "pending" bit as a simple form
of an embedded test-and-set lock that can avoid the overhead of explicit
queuing in cases where the lock is held but uncontended. This bit is
managed using a cmpxchg loop which tries to transition the uncontended
lock word from (0,0,0) -> (0,0,1) or (0,0,1) -> (0,1,1).

Unfortunately, the cmpxchg loop is unbounded and lockers can be starved
indefinitely if the lock word is seen to oscillate between unlocked
(0,0,0) and locked (0,0,1). This could happen if concurrent lockers are
able to take the lock in the cmpxchg loop without queuing and pass it
around amongst themselves.

This patch fixes the problem by unconditionally setting _Q_PENDING_VAL
using atomic_fetch_or, and then inspecting the old value to see whether
we need to spin on the current lock owner, or whether we now effectively
hold the lock. The tricky scenario is when concurrent lockers end up
queuing on the lock and the lock becomes available, causing us to see
a lockword of (n,0,0). With pending now set, simply queuing could lead
to deadlock as the head of the queue may not have observed the pending
flag being cleared. Conversely, if the head of the queue did observe
pending being cleared, then it could transition the lock from (n,0,0) ->
(0,0,1) meaning that any attempt to "undo" our setting of the pending
bit could race with a concurrent locker trying to set it.

We handle this race by preserving the pending bit when taking the lock
after reaching the head of the queue and leaving the tail entry intact
if we saw pending set, because we know that the tail is going to be
updated shortly.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 kernel/locking/qspinlock.c          | 102 ++++++++++++++++++++----------------
 kernel/locking/qspinlock_paravirt.h |   5 --
 2 files changed, 58 insertions(+), 49 deletions(-)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index a0f7976348f8..ad94b7def005 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -128,6 +128,17 @@ static inline __pure struct mcs_spinlock *decode_tail(u32 tail)
 
 #if _Q_PENDING_BITS == 8
 /**
+ * clear_pending - clear the pending bit.
+ * @lock: Pointer to queued spinlock structure
+ *
+ * *,1,* -> *,0,*
+ */
+static __always_inline void clear_pending(struct qspinlock *lock)
+{
+	WRITE_ONCE(lock->pending, 0);
+}
+
+/**
  * clear_pending_set_locked - take ownership and clear the pending bit.
  * @lock: Pointer to queued spinlock structure
  *
@@ -163,6 +174,17 @@ static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
 #else /* _Q_PENDING_BITS == 8 */
 
 /**
+ * clear_pending - clear the pending bit.
+ * @lock: Pointer to queued spinlock structure
+ *
+ * *,1,* -> *,0,*
+ */
+static __always_inline void clear_pending(struct qspinlock *lock)
+{
+	atomic_andnot(_Q_PENDING_VAL, &lock->val);
+}
+
+/**
  * clear_pending_set_locked - take ownership and clear the pending bit.
  * @lock: Pointer to queued spinlock structure
  *
@@ -266,7 +288,7 @@ static __always_inline u32  __pv_wait_head_or_lock(struct qspinlock *lock,
 void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 {
 	struct mcs_spinlock *prev, *next, *node;
-	u32 new, old, tail;
+	u32 old, tail;
 	int idx;
 
 	BUILD_BUG_ON(CONFIG_NR_CPUS >= (1U << _Q_TAIL_CPU_BITS));
@@ -290,58 +312,50 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	}
 
 	/*
+	 * If we observe any contention; queue.
+	 */
+	if (val & ~_Q_LOCKED_MASK)
+		goto queue;
+
+	/*
 	 * trylock || pending
 	 *
 	 * 0,0,0 -> 0,0,1 ; trylock
 	 * 0,0,1 -> 0,1,1 ; pending
 	 */
-	for (;;) {
+	val = atomic_fetch_or_acquire(_Q_PENDING_VAL, &lock->val);
+	if (!(val & ~_Q_LOCKED_MASK)) {
 		/*
-		 * If we observe any contention; queue.
+		 * we're pending, wait for the owner to go away.
+		 *
+		 * *,1,1 -> *,1,0
+		 *
+		 * this wait loop must be a load-acquire such that we match the
+		 * store-release that clears the locked bit and create lock
+		 * sequentiality; this is because not all
+		 * clear_pending_set_locked() implementations imply full
+		 * barriers.
 		 */
-		if (val & ~_Q_LOCKED_MASK)
-			goto queue;
-
-		new = _Q_LOCKED_VAL;
-		if (val == new)
-			new |= _Q_PENDING_VAL;
+		if (val & _Q_LOCKED_MASK) {
+			smp_cond_load_acquire(&lock->val.counter,
+					      !(VAL & _Q_LOCKED_MASK));
+		}
 
 		/*
-		 * Acquire semantic is required here as the function may
-		 * return immediately if the lock was free.
+		 * take ownership and clear the pending bit.
+		 *
+		 * *,1,0 -> *,0,1
 		 */
-		old = atomic_cmpxchg_acquire(&lock->val, val, new);
-		if (old == val)
-			break;
-
-		val = old;
-	}
-
-	/*
-	 * we won the trylock
-	 */
-	if (new == _Q_LOCKED_VAL)
+		clear_pending_set_locked(lock);
 		return;
+	}
 
 	/*
-	 * we're pending, wait for the owner to go away.
-	 *
-	 * *,1,1 -> *,1,0
-	 *
-	 * this wait loop must be a load-acquire such that we match the
-	 * store-release that clears the locked bit and create lock
-	 * sequentiality; this is because not all clear_pending_set_locked()
-	 * implementations imply full barriers.
-	 */
-	smp_cond_load_acquire(&lock->val.counter, !(VAL & _Q_LOCKED_MASK));
-
-	/*
-	 * take ownership and clear the pending bit.
-	 *
-	 * *,1,0 -> *,0,1
+	 * If pending was clear but there are waiters in the queue, then
+	 * we need to undo our setting of pending before we queue ourselves.
 	 */
-	clear_pending_set_locked(lock);
-	return;
+	if (!(val & _Q_PENDING_MASK))
+		clear_pending(lock);
 
 	/*
 	 * End of pending bit optimistic spinning and beginning of MCS
@@ -445,15 +459,15 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	 * claim the lock:
 	 *
 	 * n,0,0 -> 0,0,1 : lock, uncontended
-	 * *,0,0 -> *,0,1 : lock, contended
+	 * *,*,0 -> *,*,1 : lock, contended
 	 *
-	 * If the queue head is the only one in the queue (lock value == tail),
-	 * clear the tail code and grab the lock. Otherwise, we only need
-	 * to grab the lock.
+	 * If the queue head is the only one in the queue (lock value == tail)
+	 * and nobody is pending, clear the tail code and grab the lock.
+	 * Otherwise, we only need to grab the lock.
 	 */
 	for (;;) {
 		/* In the PV case we might already have _Q_LOCKED_VAL set */
-		if ((val & _Q_TAIL_MASK) != tail) {
+		if ((val & _Q_TAIL_MASK) != tail || (val & _Q_PENDING_MASK)) {
 			set_locked(lock);
 			break;
 		}
diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
index 2711940429f5..2dbad2f25480 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -118,11 +118,6 @@ static __always_inline void set_pending(struct qspinlock *lock)
 	WRITE_ONCE(lock->pending, 1);
 }
 
-static __always_inline void clear_pending(struct qspinlock *lock)
-{
-	WRITE_ONCE(lock->pending, 0);
-}
-
 /*
  * The pending bit check in pv_queued_spin_steal_lock() isn't a memory
  * barrier. Therefore, an atomic cmpxchg_acquire() is used to acquire the
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v3 06/14] locking/qspinlock: Kill cmpxchg loop when claiming lock from head of queue
  2018-04-26 10:34 [PATCH v3 00/14] kernel/locking: qspinlock improvements Will Deacon
                   ` (4 preceding siblings ...)
  2018-04-26 10:34 ` [PATCH v3 05/14] locking/qspinlock: Remove unbounded cmpxchg loop from locking slowpath Will Deacon
@ 2018-04-26 10:34 ` Will Deacon
  2018-04-27  9:39   ` [tip:locking/core] locking/qspinlock: Kill cmpxchg() " tip-bot for Will Deacon
  2018-04-26 10:34 ` [PATCH v3 07/14] locking/qspinlock: Use atomic_cond_read_acquire Will Deacon
                   ` (9 subsequent siblings)
  15 siblings, 1 reply; 41+ messages in thread
From: Will Deacon @ 2018-04-26 10:34 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-arm-kernel, peterz, mingo, boqun.feng, paulmck, longman,
	will.deacon

When a queued locker reaches the head of the queue, it claims the lock
by setting _Q_LOCKED_VAL in the lockword. If there isn't contention, it
must also clear the tail as part of this operation so that subsequent
lockers can avoid taking the slowpath altogether.

Currently this is expressed as a cmpxchg loop that practically only
runs up to two iterations. This is confusing to the reader and unhelpful
to the compiler. Rewrite the cmpxchg loop without the loop, so that a
failed cmpxchg implies that there is contention and we just need to
write to _Q_LOCKED_VAL without considering the rest of the lockword.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 kernel/locking/qspinlock.c | 19 ++++++++-----------
 1 file changed, 8 insertions(+), 11 deletions(-)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index ad94b7def005..e42f50d69ed6 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -465,24 +465,21 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	 * and nobody is pending, clear the tail code and grab the lock.
 	 * Otherwise, we only need to grab the lock.
 	 */
-	for (;;) {
-		/* In the PV case we might already have _Q_LOCKED_VAL set */
-		if ((val & _Q_TAIL_MASK) != tail || (val & _Q_PENDING_MASK)) {
-			set_locked(lock);
-			break;
-		}
+
+	/* In the PV case we might already have _Q_LOCKED_VAL set */
+	if ((val & _Q_TAIL_MASK) == tail) {
 		/*
 		 * The smp_cond_load_acquire() call above has provided the
-		 * necessary acquire semantics required for locking. At most
-		 * two iterations of this loop may be ran.
+		 * necessary acquire semantics required for locking.
 		 */
 		old = atomic_cmpxchg_relaxed(&lock->val, val, _Q_LOCKED_VAL);
 		if (old == val)
-			goto release;	/* No contention */
-
-		val = old;
+			goto release; /* No contention */
 	}
 
+	/* Either somebody is queued behind us or _Q_PENDING_VAL is set */
+	set_locked(lock);
+
 	/*
 	 * contended path; wait for next if not observed yet, release.
 	 */
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v3 07/14] locking/qspinlock: Use atomic_cond_read_acquire
  2018-04-26 10:34 [PATCH v3 00/14] kernel/locking: qspinlock improvements Will Deacon
                   ` (5 preceding siblings ...)
  2018-04-26 10:34 ` [PATCH v3 06/14] locking/qspinlock: Kill cmpxchg loop when claiming lock from head of queue Will Deacon
@ 2018-04-26 10:34 ` Will Deacon
  2018-04-27  9:40   ` [tip:locking/core] locking/qspinlock: Use atomic_cond_read_acquire() tip-bot for Will Deacon
  2018-04-26 10:34 ` [PATCH v3 08/14] locking/mcs: Use smp_cond_load_acquire() in mcs spin loop Will Deacon
                   ` (8 subsequent siblings)
  15 siblings, 1 reply; 41+ messages in thread
From: Will Deacon @ 2018-04-26 10:34 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-arm-kernel, peterz, mingo, boqun.feng, paulmck, longman,
	will.deacon

Rather than dig into the counter field of the atomic_t inside the
qspinlock structure so that we can call smp_cond_load_acquire, use
atomic_cond_read_acquire instead, which operates on the atomic_t
directly.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 kernel/locking/qspinlock.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index e42f50d69ed6..5e383cfe4cce 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -337,8 +337,8 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 		 * barriers.
 		 */
 		if (val & _Q_LOCKED_MASK) {
-			smp_cond_load_acquire(&lock->val.counter,
-					      !(VAL & _Q_LOCKED_MASK));
+			atomic_cond_read_acquire(&lock->val,
+						 !(VAL & _Q_LOCKED_MASK));
 		}
 
 		/*
@@ -441,8 +441,8 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	 *
 	 * The PV pv_wait_head_or_lock function, if active, will acquire
 	 * the lock and return a non-zero value. So we have to skip the
-	 * smp_cond_load_acquire() call. As the next PV queue head hasn't been
-	 * designated yet, there is no way for the locked value to become
+	 * atomic_cond_read_acquire() call. As the next PV queue head hasn't
+	 * been designated yet, there is no way for the locked value to become
 	 * _Q_SLOW_VAL. So both the set_locked() and the
 	 * atomic_cmpxchg_relaxed() calls will be safe.
 	 *
@@ -452,7 +452,7 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	if ((val = pv_wait_head_or_lock(lock, node)))
 		goto locked;
 
-	val = smp_cond_load_acquire(&lock->val.counter, !(VAL & _Q_LOCKED_PENDING_MASK));
+	val = atomic_cond_read_acquire(&lock->val, !(VAL & _Q_LOCKED_PENDING_MASK));
 
 locked:
 	/*
@@ -469,7 +469,7 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	/* In the PV case we might already have _Q_LOCKED_VAL set */
 	if ((val & _Q_TAIL_MASK) == tail) {
 		/*
-		 * The smp_cond_load_acquire() call above has provided the
+		 * The atomic_cond_read_acquire() call above has provided the
 		 * necessary acquire semantics required for locking.
 		 */
 		old = atomic_cmpxchg_relaxed(&lock->val, val, _Q_LOCKED_VAL);
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v3 08/14] locking/mcs: Use smp_cond_load_acquire() in mcs spin loop
  2018-04-26 10:34 [PATCH v3 00/14] kernel/locking: qspinlock improvements Will Deacon
                   ` (6 preceding siblings ...)
  2018-04-26 10:34 ` [PATCH v3 07/14] locking/qspinlock: Use atomic_cond_read_acquire Will Deacon
@ 2018-04-26 10:34 ` Will Deacon
  2018-04-27  9:40   ` [tip:locking/core] locking/mcs: Use smp_cond_load_acquire() in MCS " tip-bot for Jason Low
  2018-04-26 10:34 ` [PATCH v3 09/14] locking/qspinlock: Use smp_cond_load_relaxed to wait for next node Will Deacon
                   ` (7 subsequent siblings)
  15 siblings, 1 reply; 41+ messages in thread
From: Will Deacon @ 2018-04-26 10:34 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-arm-kernel, peterz, mingo, boqun.feng, paulmck, longman,
	will.deacon

From: Jason Low <jason.low2@hp.com>

For qspinlocks on ARM64, we would like to use WFE instead
of purely spinning. Qspinlocks internally have lock
contenders spin on an MCS lock.

Update arch_mcs_spin_lock_contended() such that it uses
the new smp_cond_load_acquire() so that ARM64 can also
override this spin loop with its own implementation using WFE.

On x86, this can also be cheaper than spinning on
smp_load_acquire().

Signed-off-by: Jason Low <jason.low2@hp.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 kernel/locking/mcs_spinlock.h | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/kernel/locking/mcs_spinlock.h b/kernel/locking/mcs_spinlock.h
index f046b7ce9dd6..5e10153b4d3c 100644
--- a/kernel/locking/mcs_spinlock.h
+++ b/kernel/locking/mcs_spinlock.h
@@ -23,13 +23,15 @@ struct mcs_spinlock {
 
 #ifndef arch_mcs_spin_lock_contended
 /*
- * Using smp_load_acquire() provides a memory barrier that ensures
- * subsequent operations happen after the lock is acquired.
+ * Using smp_cond_load_acquire() provides the acquire semantics
+ * required so that subsequent operations happen after the
+ * lock is acquired. Additionally, some architectures such as
+ * ARM64 would like to do spin-waiting instead of purely
+ * spinning, and smp_cond_load_acquire() provides that behavior.
  */
 #define arch_mcs_spin_lock_contended(l)					\
 do {									\
-	while (!(smp_load_acquire(l)))					\
-		cpu_relax();						\
+	smp_cond_load_acquire(l, VAL);					\
 } while (0)
 #endif
 
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v3 09/14] locking/qspinlock: Use smp_cond_load_relaxed to wait for next node
  2018-04-26 10:34 [PATCH v3 00/14] kernel/locking: qspinlock improvements Will Deacon
                   ` (7 preceding siblings ...)
  2018-04-26 10:34 ` [PATCH v3 08/14] locking/mcs: Use smp_cond_load_acquire() in mcs spin loop Will Deacon
@ 2018-04-26 10:34 ` Will Deacon
  2018-04-27  9:41   ` [tip:locking/core] locking/qspinlock: Use smp_cond_load_relaxed() " tip-bot for Will Deacon
  2018-04-26 10:34 ` [PATCH v3 10/14] locking/qspinlock: Make queued_spin_unlock use smp_store_release Will Deacon
                   ` (6 subsequent siblings)
  15 siblings, 1 reply; 41+ messages in thread
From: Will Deacon @ 2018-04-26 10:34 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-arm-kernel, peterz, mingo, boqun.feng, paulmck, longman,
	will.deacon

When a locker reaches the head of the queue and takes the lock, a
concurrent locker may enqueue and force the lock holder to spin
whilst its node->next field is initialised. Rather than open-code
a READ_ONCE/cpu_relax() loop, this can be implemented using
smp_cond_load_relaxed instead.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 kernel/locking/qspinlock.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 5e383cfe4cce..7b8c81ebb15e 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -483,10 +483,8 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	/*
 	 * contended path; wait for next if not observed yet, release.
 	 */
-	if (!next) {
-		while (!(next = READ_ONCE(node->next)))
-			cpu_relax();
-	}
+	if (!next)
+		next = smp_cond_load_relaxed(&node->next, (VAL));
 
 	arch_mcs_spin_unlock_contended(&next->locked);
 	pv_kick_node(lock, next);
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v3 10/14] locking/qspinlock: Make queued_spin_unlock use smp_store_release
  2018-04-26 10:34 [PATCH v3 00/14] kernel/locking: qspinlock improvements Will Deacon
                   ` (8 preceding siblings ...)
  2018-04-26 10:34 ` [PATCH v3 09/14] locking/qspinlock: Use smp_cond_load_relaxed to wait for next node Will Deacon
@ 2018-04-26 10:34 ` Will Deacon
  2018-04-27  9:42   ` [tip:locking/core] locking/qspinlock: Use smp_store_release() in queued_spin_unlock() tip-bot for Will Deacon
  2018-04-26 10:34 ` [PATCH v3 11/14] locking/qspinlock: Elide back-to-back RELEASE operations with smp_wmb() Will Deacon
                   ` (5 subsequent siblings)
  15 siblings, 1 reply; 41+ messages in thread
From: Will Deacon @ 2018-04-26 10:34 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-arm-kernel, peterz, mingo, boqun.feng, paulmck, longman,
	will.deacon

A qspinlock can be unlocked simply by writing zero to the locked byte.
This can be implemented in the generic code, so do that and remove the
arch-specific override for x86 in the !PV case.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 arch/x86/include/asm/qspinlock.h | 17 ++++++-----------
 include/asm-generic/qspinlock.h  |  2 +-
 2 files changed, 7 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h
index da1370ad206d..3e70bed8a978 100644
--- a/arch/x86/include/asm/qspinlock.h
+++ b/arch/x86/include/asm/qspinlock.h
@@ -9,6 +9,12 @@
 
 #define _Q_PENDING_LOOPS	(1 << 9)
 
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+extern void native_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val);
+extern void __pv_init_lock_hash(void);
+extern void __pv_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val);
+extern void __raw_callee_save___pv_queued_spin_unlock(struct qspinlock *lock);
+
 #define	queued_spin_unlock queued_spin_unlock
 /**
  * queued_spin_unlock - release a queued spinlock
@@ -21,12 +27,6 @@ static inline void native_queued_spin_unlock(struct qspinlock *lock)
 	smp_store_release(&lock->locked, 0);
 }
 
-#ifdef CONFIG_PARAVIRT_SPINLOCKS
-extern void native_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val);
-extern void __pv_init_lock_hash(void);
-extern void __pv_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val);
-extern void __raw_callee_save___pv_queued_spin_unlock(struct qspinlock *lock);
-
 static inline void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 {
 	pv_queued_spin_lock_slowpath(lock, val);
@@ -42,11 +42,6 @@ static inline bool vcpu_is_preempted(long cpu)
 {
 	return pv_vcpu_is_preempted(cpu);
 }
-#else
-static inline void queued_spin_unlock(struct qspinlock *lock)
-{
-	native_queued_spin_unlock(lock);
-}
 #endif
 
 #ifdef CONFIG_PARAVIRT
diff --git a/include/asm-generic/qspinlock.h b/include/asm-generic/qspinlock.h
index b37b4ad7eb94..a8ed0a352d75 100644
--- a/include/asm-generic/qspinlock.h
+++ b/include/asm-generic/qspinlock.h
@@ -100,7 +100,7 @@ static __always_inline void queued_spin_unlock(struct qspinlock *lock)
 	/*
 	 * unlock() needs release semantics:
 	 */
-	(void)atomic_sub_return_release(_Q_LOCKED_VAL, &lock->val);
+	smp_store_release(&lock->locked, 0);
 }
 #endif
 
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v3 11/14] locking/qspinlock: Elide back-to-back RELEASE operations with smp_wmb()
  2018-04-26 10:34 [PATCH v3 00/14] kernel/locking: qspinlock improvements Will Deacon
                   ` (9 preceding siblings ...)
  2018-04-26 10:34 ` [PATCH v3 10/14] locking/qspinlock: Make queued_spin_unlock use smp_store_release Will Deacon
@ 2018-04-26 10:34 ` Will Deacon
  2018-04-27  9:42   ` [tip:locking/core] " tip-bot for Will Deacon
  2018-04-26 10:34 ` [PATCH v3 12/14] locking/qspinlock: Use try_cmpxchg instead of cmpxchg when locking Will Deacon
                   ` (4 subsequent siblings)
  15 siblings, 1 reply; 41+ messages in thread
From: Will Deacon @ 2018-04-26 10:34 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-arm-kernel, peterz, mingo, boqun.feng, paulmck, longman,
	will.deacon

The qspinlock slowpath must ensure that the MCS node is fully initialised
before it can be reached by another other CPU. This is currently enforced
by using a RELEASE operation when updating the tail and also when linking
the node into the waitqueue (since the control dependency off xchg_tail
is insufficient to enforce sufficient ordering -- see 95bcade33a8a
("locking/qspinlock: Ensure node is initialised before updating prev->next")).

Back-to-back RELEASE operations may be expensive on some architectures,
particularly those that implement them using fences under the hood. We
can replace the two RELEASE operations with a single smp_wmb() fence and
use RELAXED operations for the subsequent publishing of the node.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 kernel/locking/qspinlock.c | 33 +++++++++++++++++----------------
 1 file changed, 17 insertions(+), 16 deletions(-)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 7b8c81ebb15e..fa5d2ab369f9 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -164,10 +164,10 @@ static __always_inline void clear_pending_set_locked(struct qspinlock *lock)
 static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
 {
 	/*
-	 * Use release semantics to make sure that the MCS node is properly
-	 * initialized before changing the tail code.
+	 * We can use relaxed semantics since the caller ensures that the
+	 * MCS node is properly initialized before updating the tail.
 	 */
-	return (u32)xchg_release(&lock->tail,
+	return (u32)xchg_relaxed(&lock->tail,
 				 tail >> _Q_TAIL_OFFSET) << _Q_TAIL_OFFSET;
 }
 
@@ -212,10 +212,11 @@ static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
 	for (;;) {
 		new = (val & _Q_LOCKED_PENDING_MASK) | tail;
 		/*
-		 * Use release semantics to make sure that the MCS node is
-		 * properly initialized before changing the tail code.
+		 * We can use relaxed semantics since the caller ensures that
+		 * the MCS node is properly initialized before updating the
+		 * tail.
 		 */
-		old = atomic_cmpxchg_release(&lock->val, val, new);
+		old = atomic_cmpxchg_relaxed(&lock->val, val, new);
 		if (old == val)
 			break;
 
@@ -388,12 +389,18 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 		goto release;
 
 	/*
+	 * Ensure that the initialisation of @node is complete before we
+	 * publish the updated tail via xchg_tail() and potentially link
+	 * @node into the waitqueue via WRITE_ONCE(prev->next, node) below.
+	 */
+	smp_wmb();
+
+	/*
+	 * Publish the updated tail.
 	 * We have already touched the queueing cacheline; don't bother with
 	 * pending stuff.
 	 *
 	 * p,*,* -> n,*,*
-	 *
-	 * RELEASE, such that the stores to @node must be complete.
 	 */
 	old = xchg_tail(lock, tail);
 	next = NULL;
@@ -405,14 +412,8 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	if (old & _Q_TAIL_MASK) {
 		prev = decode_tail(old);
 
-		/*
-		 * We must ensure that the stores to @node are observed before
-		 * the write to prev->next. The address dependency from
-		 * xchg_tail is not sufficient to ensure this because the read
-		 * component of xchg_tail is unordered with respect to the
-		 * initialisation of @node.
-		 */
-		smp_store_release(&prev->next, node);
+		/* Link @node into the waitqueue. */
+		WRITE_ONCE(prev->next, node);
 
 		pv_wait_node(node, prev);
 		arch_mcs_spin_lock_contended(&node->locked);
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v3 12/14] locking/qspinlock: Use try_cmpxchg instead of cmpxchg when locking
  2018-04-26 10:34 [PATCH v3 00/14] kernel/locking: qspinlock improvements Will Deacon
                   ` (10 preceding siblings ...)
  2018-04-26 10:34 ` [PATCH v3 11/14] locking/qspinlock: Elide back-to-back RELEASE operations with smp_wmb() Will Deacon
@ 2018-04-26 10:34 ` Will Deacon
  2018-04-27  9:43   ` [tip:locking/core] locking/qspinlock: Use try_cmpxchg() instead of cmpxchg() " tip-bot for Will Deacon
  2018-04-26 10:34 ` [PATCH v3 13/14] locking/qspinlock: Add stat tracking for pending vs slowpath Will Deacon
                   ` (3 subsequent siblings)
  15 siblings, 1 reply; 41+ messages in thread
From: Will Deacon @ 2018-04-26 10:34 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-arm-kernel, peterz, mingo, boqun.feng, paulmck, longman,
	will.deacon

When reaching the head of an uncontended queue on the qspinlock slow-path,
using a try_cmpxchg instead of a cmpxchg operation to transition the
lock work to _Q_LOCKED_VAL generates slightly better code for x86 and
pretty much identical code for arm64.

Cc: Ingo Molnar <mingo@kernel.org>
Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 kernel/locking/qspinlock.c | 19 +++++++++----------
 1 file changed, 9 insertions(+), 10 deletions(-)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index fa5d2ab369f9..1e3ddc42135e 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -467,16 +467,15 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	 * Otherwise, we only need to grab the lock.
 	 */
 
-	/* In the PV case we might already have _Q_LOCKED_VAL set */
-	if ((val & _Q_TAIL_MASK) == tail) {
-		/*
-		 * The atomic_cond_read_acquire() call above has provided the
-		 * necessary acquire semantics required for locking.
-		 */
-		old = atomic_cmpxchg_relaxed(&lock->val, val, _Q_LOCKED_VAL);
-		if (old == val)
-			goto release; /* No contention */
-	}
+	/*
+	 * In the PV case we might already have _Q_LOCKED_VAL set.
+	 *
+	 * The atomic_cond_read_acquire() call above has provided the
+	 * necessary acquire semantics required for locking.
+	 */
+	if (((val & _Q_TAIL_MASK) == tail) &&
+	    atomic_try_cmpxchg_relaxed(&lock->val, &val, _Q_LOCKED_VAL))
+		goto release; /* No contention */
 
 	/* Either somebody is queued behind us or _Q_PENDING_VAL is set */
 	set_locked(lock);
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v3 13/14] locking/qspinlock: Add stat tracking for pending vs slowpath
  2018-04-26 10:34 [PATCH v3 00/14] kernel/locking: qspinlock improvements Will Deacon
                   ` (11 preceding siblings ...)
  2018-04-26 10:34 ` [PATCH v3 12/14] locking/qspinlock: Use try_cmpxchg instead of cmpxchg when locking Will Deacon
@ 2018-04-26 10:34 ` Will Deacon
  2018-04-27  9:43   ` [tip:locking/core] locking/qspinlock: Add stat tracking for pending vs. slowpath tip-bot for Waiman Long
  2018-04-26 10:34 ` [PATCH v3 14/14] MAINTAINERS: Add myself as a co-maintainer for LOCKING PRIMITIVES Will Deacon
                   ` (2 subsequent siblings)
  15 siblings, 1 reply; 41+ messages in thread
From: Will Deacon @ 2018-04-26 10:34 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-arm-kernel, peterz, mingo, boqun.feng, paulmck, longman,
	will.deacon

From: Waiman Long <longman@redhat.com>

Currently, the qspinlock_stat code tracks only statistical counts in the
PV qspinlock code. However, it may also be useful to track the number
of locking operations done via the pending code vs. the MCS lock queue
slowpath for the non-PV case.

The qspinlock stat code is modified to do that. The stat counter
pv_lock_slowpath is renamed to lock_slowpath so that it can be used by
both the PV and non-PV cases.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/locking/qspinlock.c          | 14 +++++++++++---
 kernel/locking/qspinlock_paravirt.h |  7 +------
 kernel/locking/qspinlock_stat.h     |  9 ++++++---
 3 files changed, 18 insertions(+), 12 deletions(-)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 1e3ddc42135e..b39f341d831b 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -12,11 +12,11 @@
  * GNU General Public License for more details.
  *
  * (C) Copyright 2013-2015 Hewlett-Packard Development Company, L.P.
- * (C) Copyright 2013-2014 Red Hat, Inc.
+ * (C) Copyright 2013-2014,2018 Red Hat, Inc.
  * (C) Copyright 2015 Intel Corp.
  * (C) Copyright 2015 Hewlett-Packard Enterprise Development LP
  *
- * Authors: Waiman Long <waiman.long@hpe.com>
+ * Authors: Waiman Long <longman@redhat.com>
  *          Peter Zijlstra <peterz@infradead.org>
  */
 
@@ -33,6 +33,11 @@
 #include <asm/qspinlock.h>
 
 /*
+ * Include queued spinlock statistics code
+ */
+#include "qspinlock_stat.h"
+
+/*
  * The basic principle of a queue-based spinlock can best be understood
  * by studying a classic queue-based spinlock implementation called the
  * MCS lock. The paper below provides a good description for this kind
@@ -295,7 +300,7 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	BUILD_BUG_ON(CONFIG_NR_CPUS >= (1U << _Q_TAIL_CPU_BITS));
 
 	if (pv_enabled())
-		goto queue;
+		goto pv_queue;
 
 	if (virt_spin_lock(lock))
 		return;
@@ -348,6 +353,7 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 		 * *,1,0 -> *,0,1
 		 */
 		clear_pending_set_locked(lock);
+		qstat_inc(qstat_lock_pending, true);
 		return;
 	}
 
@@ -363,6 +369,8 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	 * queuing.
 	 */
 queue:
+	qstat_inc(qstat_lock_slowpath, true);
+pv_queue:
 	node = this_cpu_ptr(&mcs_nodes[0]);
 	idx = node->count++;
 	tail = encode_tail(smp_processor_id(), idx);
diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
index 2dbad2f25480..25730b2ac022 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -56,11 +56,6 @@ struct pv_node {
 };
 
 /*
- * Include queued spinlock statistics code
- */
-#include "qspinlock_stat.h"
-
-/*
  * Hybrid PV queued/unfair lock
  *
  * By replacing the regular queued_spin_trylock() with the function below,
@@ -428,7 +423,7 @@ pv_wait_head_or_lock(struct qspinlock *lock, struct mcs_spinlock *node)
 	/*
 	 * Tracking # of slowpath locking operations
 	 */
-	qstat_inc(qstat_pv_lock_slowpath, true);
+	qstat_inc(qstat_lock_slowpath, true);
 
 	for (;; waitcnt++) {
 		/*
diff --git a/kernel/locking/qspinlock_stat.h b/kernel/locking/qspinlock_stat.h
index 4a30ef63c607..6bd78c0740fc 100644
--- a/kernel/locking/qspinlock_stat.h
+++ b/kernel/locking/qspinlock_stat.h
@@ -22,13 +22,14 @@
  *   pv_kick_wake	- # of vCPU kicks used for computing pv_latency_wake
  *   pv_latency_kick	- average latency (ns) of vCPU kick operation
  *   pv_latency_wake	- average latency (ns) from vCPU kick to wakeup
- *   pv_lock_slowpath	- # of locking operations via the slowpath
  *   pv_lock_stealing	- # of lock stealing operations
  *   pv_spurious_wakeup	- # of spurious wakeups in non-head vCPUs
  *   pv_wait_again	- # of wait's after a queue head vCPU kick
  *   pv_wait_early	- # of early vCPU wait's
  *   pv_wait_head	- # of vCPU wait's at the queue head
  *   pv_wait_node	- # of vCPU wait's at a non-head queue node
+ *   lock_pending	- # of locking operations via pending code
+ *   lock_slowpath	- # of locking operations via MCS lock queue
  *
  * Writing to the "reset_counters" file will reset all the above counter
  * values.
@@ -46,13 +47,14 @@ enum qlock_stats {
 	qstat_pv_kick_wake,
 	qstat_pv_latency_kick,
 	qstat_pv_latency_wake,
-	qstat_pv_lock_slowpath,
 	qstat_pv_lock_stealing,
 	qstat_pv_spurious_wakeup,
 	qstat_pv_wait_again,
 	qstat_pv_wait_early,
 	qstat_pv_wait_head,
 	qstat_pv_wait_node,
+	qstat_lock_pending,
+	qstat_lock_slowpath,
 	qstat_num,	/* Total number of statistical counters */
 	qstat_reset_cnts = qstat_num,
 };
@@ -73,12 +75,13 @@ static const char * const qstat_names[qstat_num + 1] = {
 	[qstat_pv_spurious_wakeup] = "pv_spurious_wakeup",
 	[qstat_pv_latency_kick]	   = "pv_latency_kick",
 	[qstat_pv_latency_wake]    = "pv_latency_wake",
-	[qstat_pv_lock_slowpath]   = "pv_lock_slowpath",
 	[qstat_pv_lock_stealing]   = "pv_lock_stealing",
 	[qstat_pv_wait_again]      = "pv_wait_again",
 	[qstat_pv_wait_early]      = "pv_wait_early",
 	[qstat_pv_wait_head]       = "pv_wait_head",
 	[qstat_pv_wait_node]       = "pv_wait_node",
+	[qstat_lock_pending]       = "lock_pending",
+	[qstat_lock_slowpath]      = "lock_slowpath",
 	[qstat_reset_cnts]         = "reset_counters",
 };
 
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v3 14/14] MAINTAINERS: Add myself as a co-maintainer for LOCKING PRIMITIVES
  2018-04-26 10:34 [PATCH v3 00/14] kernel/locking: qspinlock improvements Will Deacon
                   ` (12 preceding siblings ...)
  2018-04-26 10:34 ` [PATCH v3 13/14] locking/qspinlock: Add stat tracking for pending vs slowpath Will Deacon
@ 2018-04-26 10:34 ` Will Deacon
  2018-04-26 15:55   ` Peter Zijlstra
  2018-04-27  9:44   ` [tip:locking/core] MAINTAINERS: Add myself as a co-maintainer for the locking subsystem tip-bot for Will Deacon
  2018-04-26 15:54 ` [PATCH v3 00/14] kernel/locking: qspinlock improvements Peter Zijlstra
  2018-04-26 20:18 ` Waiman Long
  15 siblings, 2 replies; 41+ messages in thread
From: Will Deacon @ 2018-04-26 10:34 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-arm-kernel, peterz, mingo, boqun.feng, paulmck, longman,
	will.deacon

I've been heavily involved with concurrency and memory ordering stuff
(see ATOMIC INFRASTRUCTURE and LINUX KERNEL MEMORY CONSISTENCY MODEL)
and with arm64 now using qrwlock with a view to using qspinlock in the
near future, I'm going to continue being involved with the core locking
primitives. Reflect this by adding myself as a co-maintainer alongside
Ingo and Peter.

Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 MAINTAINERS | 1 +
 1 file changed, 1 insertion(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 92be777d060a..2509c20a979e 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -8328,6 +8328,7 @@ F:	Documentation/admin-guide/LSM/LoadPin.rst
 LOCKING PRIMITIVES
 M:	Peter Zijlstra <peterz@infradead.org>
 M:	Ingo Molnar <mingo@redhat.com>
+M:	Will Deacon <will.deacon@arm.com>
 L:	linux-kernel@vger.kernel.org
 T:	git git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git locking/core
 S:	Maintained
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 05/14] locking/qspinlock: Remove unbounded cmpxchg loop from locking slowpath
  2018-04-26 10:34 ` [PATCH v3 05/14] locking/qspinlock: Remove unbounded cmpxchg loop from locking slowpath Will Deacon
@ 2018-04-26 15:53   ` Peter Zijlstra
  2018-04-26 16:55     ` Will Deacon
  2018-04-26 20:16   ` Waiman Long
  2018-04-27  9:39   ` [tip:locking/core] locking/qspinlock: Remove unbounded cmpxchg() " tip-bot for Will Deacon
  2 siblings, 1 reply; 41+ messages in thread
From: Peter Zijlstra @ 2018-04-26 15:53 UTC (permalink / raw)
  To: Will Deacon
  Cc: linux-kernel, linux-arm-kernel, mingo, boqun.feng, paulmck, longman

On Thu, Apr 26, 2018 at 11:34:19AM +0100, Will Deacon wrote:
> @@ -290,58 +312,50 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
>  	}
>  
>  	/*
> +	 * If we observe any contention; queue.
> +	 */
> +	if (val & ~_Q_LOCKED_MASK)
> +		goto queue;
> +
> +	/*
>  	 * trylock || pending
>  	 *
>  	 * 0,0,0 -> 0,0,1 ; trylock
>  	 * 0,0,1 -> 0,1,1 ; pending
>  	 */
> +	val = atomic_fetch_or_acquire(_Q_PENDING_VAL, &lock->val);
> +	if (!(val & ~_Q_LOCKED_MASK)) {
>  		/*
> +		 * we're pending, wait for the owner to go away.
> +		 *
> +		 * *,1,1 -> *,1,0

Tail must be 0 here, right?

> +		 *
> +		 * this wait loop must be a load-acquire such that we match the
> +		 * store-release that clears the locked bit and create lock
> +		 * sequentiality; this is because not all
> +		 * clear_pending_set_locked() implementations imply full
> +		 * barriers.
>  		 */
> +		if (val & _Q_LOCKED_MASK) {
> +			smp_cond_load_acquire(&lock->val.counter,
> +					      !(VAL & _Q_LOCKED_MASK));
> +		}
>  
>  		/*
> +		 * take ownership and clear the pending bit.
> +		 *
> +		 * *,1,0 -> *,0,1
>  		 */

Idem.

> +		clear_pending_set_locked(lock);
>  		return;
> +	}
>  
>  	/*
> +	 * If pending was clear but there are waiters in the queue, then
> +	 * we need to undo our setting of pending before we queue ourselves.
>  	 */
> +	if (!(val & _Q_PENDING_MASK))
> +		clear_pending(lock);

This is the branch for when we have !0 tail.

>  
>  	/*
>  	 * End of pending bit optimistic spinning and beginning of MCS

> @@ -445,15 +459,15 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
>  	 * claim the lock:
>  	 *
>  	 * n,0,0 -> 0,0,1 : lock, uncontended
> +	 * *,*,0 -> *,*,1 : lock, contended
>  	 *
> +	 * If the queue head is the only one in the queue (lock value == tail)
> +	 * and nobody is pending, clear the tail code and grab the lock.
> +	 * Otherwise, we only need to grab the lock.
>  	 */
>  	for (;;) {
>  		/* In the PV case we might already have _Q_LOCKED_VAL set */
> +		if ((val & _Q_TAIL_MASK) != tail || (val & _Q_PENDING_MASK)) {
>  			set_locked(lock);
>  			break;
>  		}

This one hunk is terrible on the brain. I'm fairly sure I get it, but I
feel that comment can use help. Or at least, I need help reading it.

I'll try and cook up something when my brain starts working again.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 00/14] kernel/locking: qspinlock improvements
  2018-04-26 10:34 [PATCH v3 00/14] kernel/locking: qspinlock improvements Will Deacon
                   ` (13 preceding siblings ...)
  2018-04-26 10:34 ` [PATCH v3 14/14] MAINTAINERS: Add myself as a co-maintainer for LOCKING PRIMITIVES Will Deacon
@ 2018-04-26 15:54 ` Peter Zijlstra
  2018-04-27  9:33   ` Ingo Molnar
  2018-04-26 20:18 ` Waiman Long
  15 siblings, 1 reply; 41+ messages in thread
From: Peter Zijlstra @ 2018-04-26 15:54 UTC (permalink / raw)
  To: Will Deacon
  Cc: linux-kernel, linux-arm-kernel, mingo, boqun.feng, paulmck, longman



Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>

Ingo, please queue.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 14/14] MAINTAINERS: Add myself as a co-maintainer for LOCKING PRIMITIVES
  2018-04-26 10:34 ` [PATCH v3 14/14] MAINTAINERS: Add myself as a co-maintainer for LOCKING PRIMITIVES Will Deacon
@ 2018-04-26 15:55   ` Peter Zijlstra
  2018-04-27  9:44   ` [tip:locking/core] MAINTAINERS: Add myself as a co-maintainer for the locking subsystem tip-bot for Will Deacon
  1 sibling, 0 replies; 41+ messages in thread
From: Peter Zijlstra @ 2018-04-26 15:55 UTC (permalink / raw)
  To: Will Deacon
  Cc: linux-kernel, linux-arm-kernel, mingo, boqun.feng, paulmck, longman

On Thu, Apr 26, 2018 at 11:34:28AM +0100, Will Deacon wrote:
> I've been heavily involved with concurrency and memory ordering stuff
> (see ATOMIC INFRASTRUCTURE and LINUX KERNEL MEMORY CONSISTENCY MODEL)
> and with arm64 now using qrwlock with a view to using qspinlock in the
> near future, I'm going to continue being involved with the core locking
> primitives. Reflect this by adding myself as a co-maintainer alongside
> Ingo and Peter.

I much value your help on this, thanks!

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 05/14] locking/qspinlock: Remove unbounded cmpxchg loop from locking slowpath
  2018-04-26 15:53   ` Peter Zijlstra
@ 2018-04-26 16:55     ` Will Deacon
  2018-04-28 12:45       ` Peter Zijlstra
  0 siblings, 1 reply; 41+ messages in thread
From: Will Deacon @ 2018-04-26 16:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-arm-kernel, mingo, boqun.feng, paulmck, longman

Hi Peter,

On Thu, Apr 26, 2018 at 05:53:35PM +0200, Peter Zijlstra wrote:
> On Thu, Apr 26, 2018 at 11:34:19AM +0100, Will Deacon wrote:
> > @@ -290,58 +312,50 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
> >  	}
> >  
> >  	/*
> > +	 * If we observe any contention; queue.
> > +	 */
> > +	if (val & ~_Q_LOCKED_MASK)
> > +		goto queue;
> > +
> > +	/*
> >  	 * trylock || pending
> >  	 *
> >  	 * 0,0,0 -> 0,0,1 ; trylock
> >  	 * 0,0,1 -> 0,1,1 ; pending
> >  	 */
> > +	val = atomic_fetch_or_acquire(_Q_PENDING_VAL, &lock->val);
> > +	if (!(val & ~_Q_LOCKED_MASK)) {
> >  		/*
> > +		 * we're pending, wait for the owner to go away.
> > +		 *
> > +		 * *,1,1 -> *,1,0
> 
> Tail must be 0 here, right?

Not necessarily. If we're concurrently setting pending with another slowpath
locker, they could queue in the tail behind us, so we can't mess with those
upper bits.

> > +		 *
> > +		 * this wait loop must be a load-acquire such that we match the
> > +		 * store-release that clears the locked bit and create lock
> > +		 * sequentiality; this is because not all
> > +		 * clear_pending_set_locked() implementations imply full
> > +		 * barriers.
> >  		 */
> > +		if (val & _Q_LOCKED_MASK) {
> > +			smp_cond_load_acquire(&lock->val.counter,
> > +					      !(VAL & _Q_LOCKED_MASK));
> > +		}
> >  
> >  		/*
> > +		 * take ownership and clear the pending bit.
> > +		 *
> > +		 * *,1,0 -> *,0,1
> >  		 */
> 
> Idem.

Same here, hence why clear_pending_set_locked is either a 16-bit store or an
RmW (we can't just clobber the tail with 0).

> > +		clear_pending_set_locked(lock);
> >  		return;
> > +	}
> >  
> >  	/*
> > +	 * If pending was clear but there are waiters in the queue, then
> > +	 * we need to undo our setting of pending before we queue ourselves.
> >  	 */
> > +	if (!(val & _Q_PENDING_MASK))
> > +		clear_pending(lock);
> 
> This is the branch for when we have !0 tail.

That's the case where "val" has a !0 tail, but I think the comments are
trying to talk about the status of the lockword in memory, no?

> >  	/*
> >  	 * End of pending bit optimistic spinning and beginning of MCS
> 
> > @@ -445,15 +459,15 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
> >  	 * claim the lock:
> >  	 *
> >  	 * n,0,0 -> 0,0,1 : lock, uncontended
> > +	 * *,*,0 -> *,*,1 : lock, contended
> >  	 *
> > +	 * If the queue head is the only one in the queue (lock value == tail)
> > +	 * and nobody is pending, clear the tail code and grab the lock.
> > +	 * Otherwise, we only need to grab the lock.
> >  	 */
> >  	for (;;) {
> >  		/* In the PV case we might already have _Q_LOCKED_VAL set */
> > +		if ((val & _Q_TAIL_MASK) != tail || (val & _Q_PENDING_MASK)) {
> >  			set_locked(lock);
> >  			break;
> >  		}
> 
> This one hunk is terrible on the brain. I'm fairly sure I get it, but I
> feel that comment can use help. Or at least, I need help reading it.
> 
> I'll try and cook up something when my brain starts working again.

Cheers. I think the code is a bit easier to read if you look at it after the
whole series is applied, but the comments could probably still be improved.

Will

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 05/14] locking/qspinlock: Remove unbounded cmpxchg loop from locking slowpath
  2018-04-26 10:34 ` [PATCH v3 05/14] locking/qspinlock: Remove unbounded cmpxchg loop from locking slowpath Will Deacon
  2018-04-26 15:53   ` Peter Zijlstra
@ 2018-04-26 20:16   ` Waiman Long
  2018-04-27 10:16     ` Will Deacon
  2018-04-27  9:39   ` [tip:locking/core] locking/qspinlock: Remove unbounded cmpxchg() " tip-bot for Will Deacon
  2 siblings, 1 reply; 41+ messages in thread
From: Waiman Long @ 2018-04-26 20:16 UTC (permalink / raw)
  To: Will Deacon, linux-kernel
  Cc: linux-arm-kernel, peterz, mingo, boqun.feng, paulmck

On 04/26/2018 06:34 AM, Will Deacon wrote:
> The qspinlock locking slowpath utilises a "pending" bit as a simple form
> of an embedded test-and-set lock that can avoid the overhead of explicit
> queuing in cases where the lock is held but uncontended. This bit is
> managed using a cmpxchg loop which tries to transition the uncontended
> lock word from (0,0,0) -> (0,0,1) or (0,0,1) -> (0,1,1).
>
> Unfortunately, the cmpxchg loop is unbounded and lockers can be starved
> indefinitely if the lock word is seen to oscillate between unlocked
> (0,0,0) and locked (0,0,1). This could happen if concurrent lockers are
> able to take the lock in the cmpxchg loop without queuing and pass it
> around amongst themselves.
>
> This patch fixes the problem by unconditionally setting _Q_PENDING_VAL
> using atomic_fetch_or, and then inspecting the old value to see whether
> we need to spin on the current lock owner, or whether we now effectively
> hold the lock. The tricky scenario is when concurrent lockers end up
> queuing on the lock and the lock becomes available, causing us to see
> a lockword of (n,0,0). With pending now set, simply queuing could lead
> to deadlock as the head of the queue may not have observed the pending
> flag being cleared. Conversely, if the head of the queue did observe
> pending being cleared, then it could transition the lock from (n,0,0) ->
> (0,0,1) meaning that any attempt to "undo" our setting of the pending
> bit could race with a concurrent locker trying to set it.
>
> We handle this race by preserving the pending bit when taking the lock
> after reaching the head of the queue and leaving the tail entry intact
> if we saw pending set, because we know that the tail is going to be
> updated shortly.
>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Ingo Molnar <mingo@kernel.org>
> Signed-off-by: Will Deacon <will.deacon@arm.com>
> ---
>  kernel/locking/qspinlock.c          | 102 ++++++++++++++++++++----------------
>  kernel/locking/qspinlock_paravirt.h |   5 --
>  2 files changed, 58 insertions(+), 49 deletions(-)
>
> diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
> index a0f7976348f8..ad94b7def005 100644
> --- a/kernel/locking/qspinlock.c
> +++ b/kernel/locking/qspinlock.c
> @@ -128,6 +128,17 @@ static inline __pure struct mcs_spinlock *decode_tail(u32 tail)
>  
>  #if _Q_PENDING_BITS == 8
>  /**
> + * clear_pending - clear the pending bit.
> + * @lock: Pointer to queued spinlock structure
> + *
> + * *,1,* -> *,0,*
> + */
> +static __always_inline void clear_pending(struct qspinlock *lock)
> +{
> +	WRITE_ONCE(lock->pending, 0);
> +}
> +
> +/**
>   * clear_pending_set_locked - take ownership and clear the pending bit.
>   * @lock: Pointer to queued spinlock structure
>   *
> @@ -163,6 +174,17 @@ static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
>  #else /* _Q_PENDING_BITS == 8 */
>  
>  /**
> + * clear_pending - clear the pending bit.
> + * @lock: Pointer to queued spinlock structure
> + *
> + * *,1,* -> *,0,*
> + */
> +static __always_inline void clear_pending(struct qspinlock *lock)
> +{
> +	atomic_andnot(_Q_PENDING_VAL, &lock->val);
> +}
> +
> +/**
>   * clear_pending_set_locked - take ownership and clear the pending bit.
>   * @lock: Pointer to queued spinlock structure
>   *
> @@ -266,7 +288,7 @@ static __always_inline u32  __pv_wait_head_or_lock(struct qspinlock *lock,
>  void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
>  {
>  	struct mcs_spinlock *prev, *next, *node;
> -	u32 new, old, tail;
> +	u32 old, tail;
>  	int idx;
>  
>  	BUILD_BUG_ON(CONFIG_NR_CPUS >= (1U << _Q_TAIL_CPU_BITS));
> @@ -290,58 +312,50 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
>  	}
>  
>  	/*
> +	 * If we observe any contention; queue.
> +	 */
> +	if (val & ~_Q_LOCKED_MASK)
> +		goto queue;
> +
> +	/*
>  	 * trylock || pending
>  	 *
>  	 * 0,0,0 -> 0,0,1 ; trylock
>  	 * 0,0,1 -> 0,1,1 ; pending
>  	 */
> -	for (;;) {
> +	val = atomic_fetch_or_acquire(_Q_PENDING_VAL, &lock->val);
> +	if (!(val & ~_Q_LOCKED_MASK)) {
>  		/*
> -		 * If we observe any contention; queue.
> +		 * we're pending, wait for the owner to go away.
> +		 *
> +		 * *,1,1 -> *,1,0
> +		 *
> +		 * this wait loop must be a load-acquire such that we match the
> +		 * store-release that clears the locked bit and create lock
> +		 * sequentiality; this is because not all
> +		 * clear_pending_set_locked() implementations imply full
> +		 * barriers.
>  		 */
> -		if (val & ~_Q_LOCKED_MASK)
> -			goto queue;
> -
> -		new = _Q_LOCKED_VAL;
> -		if (val == new)
> -			new |= _Q_PENDING_VAL;
> +		if (val & _Q_LOCKED_MASK) {
> +			smp_cond_load_acquire(&lock->val.counter,
> +					      !(VAL & _Q_LOCKED_MASK));
> +		}
>  
>  		/*
> -		 * Acquire semantic is required here as the function may
> -		 * return immediately if the lock was free.
> +		 * take ownership and clear the pending bit.
> +		 *
> +		 * *,1,0 -> *,0,1
>  		 */
> -		old = atomic_cmpxchg_acquire(&lock->val, val, new);
> -		if (old == val)
> -			break;
> -
> -		val = old;
> -	}
> -
> -	/*
> -	 * we won the trylock
> -	 */
> -	if (new == _Q_LOCKED_VAL)
> +		clear_pending_set_locked(lock);
>  		return;
> +	}
>  
>  	/*
> -	 * we're pending, wait for the owner to go away.
> -	 *
> -	 * *,1,1 -> *,1,0
> -	 *
> -	 * this wait loop must be a load-acquire such that we match the
> -	 * store-release that clears the locked bit and create lock
> -	 * sequentiality; this is because not all clear_pending_set_locked()
> -	 * implementations imply full barriers.
> -	 */
> -	smp_cond_load_acquire(&lock->val.counter, !(VAL & _Q_LOCKED_MASK));
> -
> -	/*
> -	 * take ownership and clear the pending bit.
> -	 *
> -	 * *,1,0 -> *,0,1
> +	 * If pending was clear but there are waiters in the queue, then
> +	 * we need to undo our setting of pending before we queue ourselves.
>  	 */
> -	clear_pending_set_locked(lock);
> -	return;
> +	if (!(val & _Q_PENDING_MASK))
> +		clear_pending(lock);
>  
>  	/*
>  	 * End of pending bit optimistic spinning and beginning of MCS
> @@ -445,15 +459,15 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
>  	 * claim the lock:
>  	 *
>  	 * n,0,0 -> 0,0,1 : lock, uncontended
> -	 * *,0,0 -> *,0,1 : lock, contended
> +	 * *,*,0 -> *,*,1 : lock, contended
>  	 *
> -	 * If the queue head is the only one in the queue (lock value == tail),
> -	 * clear the tail code and grab the lock. Otherwise, we only need
> -	 * to grab the lock.
> +	 * If the queue head is the only one in the queue (lock value == tail)
> +	 * and nobody is pending, clear the tail code and grab the lock.
> +	 * Otherwise, we only need to grab the lock.
>  	 */
>  	for (;;) {
>  		/* In the PV case we might already have _Q_LOCKED_VAL set */
> -		if ((val & _Q_TAIL_MASK) != tail) {
> +		if ((val & _Q_TAIL_MASK) != tail || (val & _Q_PENDING_MASK)) {
>  			set_locked(lock);
>  			break;
>  		}
> diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
> index 2711940429f5..2dbad2f25480 100644
> --- a/kernel/locking/qspinlock_paravirt.h
> +++ b/kernel/locking/qspinlock_paravirt.h
> @@ -118,11 +118,6 @@ static __always_inline void set_pending(struct qspinlock *lock)
>  	WRITE_ONCE(lock->pending, 1);
>  }
>  
> -static __always_inline void clear_pending(struct qspinlock *lock)
> -{
> -	WRITE_ONCE(lock->pending, 0);
> -}
> -
>  /*
>   * The pending bit check in pv_queued_spin_steal_lock() isn't a memory
>   * barrier. Therefore, an atomic cmpxchg_acquire() is used to acquire the

There is another clear_pending() function after the "#else /*
_Q_PENDING_BITS == 8 */" line that need to be removed as well.

-Longman

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 00/14] kernel/locking: qspinlock improvements
  2018-04-26 10:34 [PATCH v3 00/14] kernel/locking: qspinlock improvements Will Deacon
                   ` (14 preceding siblings ...)
  2018-04-26 15:54 ` [PATCH v3 00/14] kernel/locking: qspinlock improvements Peter Zijlstra
@ 2018-04-26 20:18 ` Waiman Long
  15 siblings, 0 replies; 41+ messages in thread
From: Waiman Long @ 2018-04-26 20:18 UTC (permalink / raw)
  To: Will Deacon, linux-kernel
  Cc: linux-arm-kernel, peterz, mingo, boqun.feng, paulmck

On 04/26/2018 06:34 AM, Will Deacon wrote:
> Hi all,
>
> This is version three of the qspinlock patches I posted previously:
>
>   v1: https://lkml.org/lkml/2018/4/5/496
>   v2: https://lkml.org/lkml/2018/4/11/618
>
> Changes since v2 include:
>   * Fixed bisection issues
>   * Fixed x86 PV build
>   * Added patch proposing me as a co-maintainer
>   * Rebased onto -rc2
>
> All feedback welcome,
>
> Will
>
> --->8
>
> Jason Low (1):
>   locking/mcs: Use smp_cond_load_acquire() in mcs spin loop
>
> Waiman Long (1):
>   locking/qspinlock: Add stat tracking for pending vs slowpath
>
> Will Deacon (12):
>   barriers: Introduce smp_cond_load_relaxed and atomic_cond_read_relaxed
>   locking/qspinlock: Merge struct __qspinlock into struct qspinlock
>   locking/qspinlock: Bound spinning on pending->locked transition in
>     slowpath
>   locking/qspinlock/x86: Increase _Q_PENDING_LOOPS upper bound
>   locking/qspinlock: Remove unbounded cmpxchg loop from locking slowpath
>   locking/qspinlock: Kill cmpxchg loop when claiming lock from head of
>     queue
>   locking/qspinlock: Use atomic_cond_read_acquire
>   locking/qspinlock: Use smp_cond_load_relaxed to wait for next node
>   locking/qspinlock: Make queued_spin_unlock use smp_store_release
>   locking/qspinlock: Elide back-to-back RELEASE operations with
>     smp_wmb()
>   locking/qspinlock: Use try_cmpxchg instead of cmpxchg when locking
>   MAINTAINERS: Add myself as a co-maintainer for LOCKING PRIMITIVES
>
>  MAINTAINERS                               |   1 +
>  arch/x86/include/asm/qspinlock.h          |  21 ++-
>  arch/x86/include/asm/qspinlock_paravirt.h |   3 +-
>  include/asm-generic/atomic-long.h         |   2 +
>  include/asm-generic/barrier.h             |  27 +++-
>  include/asm-generic/qspinlock.h           |   2 +-
>  include/asm-generic/qspinlock_types.h     |  32 +++-
>  include/linux/atomic.h                    |   2 +
>  kernel/locking/mcs_spinlock.h             |  10 +-
>  kernel/locking/qspinlock.c                | 247 ++++++++++++++----------------
>  kernel/locking/qspinlock_paravirt.h       |  44 ++----
>  kernel/locking/qspinlock_stat.h           |   9 +-
>  12 files changed, 209 insertions(+), 191 deletions(-)
>
Other than my comment on patch 5 (which can wait as the code path is
unlikely to be used soon), I have no other issue with this patchset.

Acked-by: Waiman Long <longman@redhat.com>

Cheers,
Longman

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 00/14] kernel/locking: qspinlock improvements
  2018-04-26 15:54 ` [PATCH v3 00/14] kernel/locking: qspinlock improvements Peter Zijlstra
@ 2018-04-27  9:33   ` Ingo Molnar
  0 siblings, 0 replies; 41+ messages in thread
From: Ingo Molnar @ 2018-04-27  9:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Will Deacon, linux-kernel, linux-arm-kernel, boqun.feng, paulmck,
	longman


* Peter Zijlstra <peterz@infradead.org> wrote:

> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> 
> Ingo, please queue.

Applied to tip:locking/core, thanks guys!

	Ingo

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [tip:locking/core] locking/barriers: Introduce smp_cond_load_relaxed() and atomic_cond_read_relaxed()
  2018-04-26 10:34 ` [PATCH v3 01/14] barriers: Introduce smp_cond_load_relaxed and atomic_cond_read_relaxed Will Deacon
@ 2018-04-27  9:36   ` tip-bot for Will Deacon
  0 siblings, 0 replies; 41+ messages in thread
From: tip-bot for Will Deacon @ 2018-04-27  9:36 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: mingo, linux-kernel, peterz, hpa, tglx, torvalds, will.deacon, longman

Commit-ID:  fcfdfe30e324725007e9dc5814b62a4c430ea909
Gitweb:     https://git.kernel.org/tip/fcfdfe30e324725007e9dc5814b62a4c430ea909
Author:     Will Deacon <will.deacon@arm.com>
AuthorDate: Thu, 26 Apr 2018 11:34:15 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Fri, 27 Apr 2018 09:48:44 +0200

locking/barriers: Introduce smp_cond_load_relaxed() and atomic_cond_read_relaxed()

Whilst we currently provide smp_cond_load_acquire() and
atomic_cond_read_acquire(), there are cases where the ACQUIRE semantics are
not required because of a subsequent fence or release operation once the
conditional loop has exited.

This patch adds relaxed versions of the conditional spinning primitives
to avoid unnecessary barrier overhead on architectures such as arm64.

Signed-off-by: Will Deacon <will.deacon@arm.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Waiman Long <longman@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: boqun.feng@gmail.com
Cc: linux-arm-kernel@lists.infradead.org
Cc: paulmck@linux.vnet.ibm.com
Link: http://lkml.kernel.org/r/1524738868-31318-2-git-send-email-will.deacon@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/asm-generic/atomic-long.h |  2 ++
 include/asm-generic/barrier.h     | 27 +++++++++++++++++++++------
 include/linux/atomic.h            |  2 ++
 3 files changed, 25 insertions(+), 6 deletions(-)

diff --git a/include/asm-generic/atomic-long.h b/include/asm-generic/atomic-long.h
index 34a028a7bcc5..5b2b0b5ea06d 100644
--- a/include/asm-generic/atomic-long.h
+++ b/include/asm-generic/atomic-long.h
@@ -244,6 +244,8 @@ static inline long atomic_long_add_unless(atomic_long_t *l, long a, long u)
 #define atomic_long_inc_not_zero(l) \
 	ATOMIC_LONG_PFX(_inc_not_zero)((ATOMIC_LONG_PFX(_t) *)(l))
 
+#define atomic_long_cond_read_relaxed(v, c) \
+	ATOMIC_LONG_PFX(_cond_read_relaxed)((ATOMIC_LONG_PFX(_t) *)(v), (c))
 #define atomic_long_cond_read_acquire(v, c) \
 	ATOMIC_LONG_PFX(_cond_read_acquire)((ATOMIC_LONG_PFX(_t) *)(v), (c))
 
diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
index 29458bbb2fa0..2cafdbb9ae4c 100644
--- a/include/asm-generic/barrier.h
+++ b/include/asm-generic/barrier.h
@@ -221,18 +221,17 @@ do {									\
 #endif
 
 /**
- * smp_cond_load_acquire() - (Spin) wait for cond with ACQUIRE ordering
+ * smp_cond_load_relaxed() - (Spin) wait for cond with no ordering guarantees
  * @ptr: pointer to the variable to wait on
  * @cond: boolean expression to wait for
  *
- * Equivalent to using smp_load_acquire() on the condition variable but employs
- * the control dependency of the wait to reduce the barrier on many platforms.
+ * Equivalent to using READ_ONCE() on the condition variable.
  *
  * Due to C lacking lambda expressions we load the value of *ptr into a
  * pre-named variable @VAL to be used in @cond.
  */
-#ifndef smp_cond_load_acquire
-#define smp_cond_load_acquire(ptr, cond_expr) ({		\
+#ifndef smp_cond_load_relaxed
+#define smp_cond_load_relaxed(ptr, cond_expr) ({		\
 	typeof(ptr) __PTR = (ptr);				\
 	typeof(*ptr) VAL;					\
 	for (;;) {						\
@@ -241,10 +240,26 @@ do {									\
 			break;					\
 		cpu_relax();					\
 	}							\
-	smp_acquire__after_ctrl_dep();				\
 	VAL;							\
 })
 #endif
 
+/**
+ * smp_cond_load_acquire() - (Spin) wait for cond with ACQUIRE ordering
+ * @ptr: pointer to the variable to wait on
+ * @cond: boolean expression to wait for
+ *
+ * Equivalent to using smp_load_acquire() on the condition variable but employs
+ * the control dependency of the wait to reduce the barrier on many platforms.
+ */
+#ifndef smp_cond_load_acquire
+#define smp_cond_load_acquire(ptr, cond_expr) ({		\
+	typeof(*ptr) _val;					\
+	_val = smp_cond_load_relaxed(ptr, cond_expr);		\
+	smp_acquire__after_ctrl_dep();				\
+	_val;							\
+})
+#endif
+
 #endif /* !__ASSEMBLY__ */
 #endif /* __ASM_GENERIC_BARRIER_H */
diff --git a/include/linux/atomic.h b/include/linux/atomic.h
index 8b276fd9a127..01ce3997cb42 100644
--- a/include/linux/atomic.h
+++ b/include/linux/atomic.h
@@ -654,6 +654,7 @@ static inline int atomic_dec_if_positive(atomic_t *v)
 }
 #endif
 
+#define atomic_cond_read_relaxed(v, c)	smp_cond_load_relaxed(&(v)->counter, (c))
 #define atomic_cond_read_acquire(v, c)	smp_cond_load_acquire(&(v)->counter, (c))
 
 #ifdef CONFIG_GENERIC_ATOMIC64
@@ -1075,6 +1076,7 @@ static inline long long atomic64_fetch_andnot_release(long long i, atomic64_t *v
 }
 #endif
 
+#define atomic64_cond_read_relaxed(v, c)	smp_cond_load_relaxed(&(v)->counter, (c))
 #define atomic64_cond_read_acquire(v, c)	smp_cond_load_acquire(&(v)->counter, (c))
 
 #include <asm-generic/atomic-long.h>

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [tip:locking/core] locking/qspinlock: Merge 'struct __qspinlock' into 'struct qspinlock'
  2018-04-26 10:34 ` [PATCH v3 02/14] locking/qspinlock: Merge struct __qspinlock into struct qspinlock Will Deacon
@ 2018-04-27  9:37   ` tip-bot for Will Deacon
  0 siblings, 0 replies; 41+ messages in thread
From: tip-bot for Will Deacon @ 2018-04-27  9:37 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, will.deacon, mingo, longman, torvalds, boqun.feng,
	tglx, peterz, hpa

Commit-ID:  625e88be1f41b53cec55827c984e4a89ea8ee9f9
Gitweb:     https://git.kernel.org/tip/625e88be1f41b53cec55827c984e4a89ea8ee9f9
Author:     Will Deacon <will.deacon@arm.com>
AuthorDate: Thu, 26 Apr 2018 11:34:16 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Fri, 27 Apr 2018 09:48:45 +0200

locking/qspinlock: Merge 'struct __qspinlock' into 'struct qspinlock'

'struct __qspinlock' provides a handy union of fields so that
subcomponents of the lockword can be accessed by name, without having to
manage shifts and masks explicitly and take endianness into account.

This is useful in qspinlock.h and also potentially in arch headers, so
move the 'struct __qspinlock' into 'struct qspinlock' and kill the extra
definition.

Signed-off-by: Will Deacon <will.deacon@arm.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Waiman Long <longman@redhat.com>
Acked-by: Boqun Feng <boqun.feng@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-arm-kernel@lists.infradead.org
Cc: paulmck@linux.vnet.ibm.com
Link: http://lkml.kernel.org/r/1524738868-31318-3-git-send-email-will.deacon@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/qspinlock.h          |  2 +-
 arch/x86/include/asm/qspinlock_paravirt.h |  3 +-
 include/asm-generic/qspinlock_types.h     | 32 +++++++++++++++++++--
 kernel/locking/qspinlock.c                | 46 ++-----------------------------
 kernel/locking/qspinlock_paravirt.h       | 34 ++++++++---------------
 5 files changed, 46 insertions(+), 71 deletions(-)

diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h
index 5e16b5d40d32..90b0b0ed8161 100644
--- a/arch/x86/include/asm/qspinlock.h
+++ b/arch/x86/include/asm/qspinlock.h
@@ -16,7 +16,7 @@
  */
 static inline void native_queued_spin_unlock(struct qspinlock *lock)
 {
-	smp_store_release((u8 *)lock, 0);
+	smp_store_release(&lock->locked, 0);
 }
 
 #ifdef CONFIG_PARAVIRT_SPINLOCKS
diff --git a/arch/x86/include/asm/qspinlock_paravirt.h b/arch/x86/include/asm/qspinlock_paravirt.h
index 923307ea11c7..9ef5ee03d2d7 100644
--- a/arch/x86/include/asm/qspinlock_paravirt.h
+++ b/arch/x86/include/asm/qspinlock_paravirt.h
@@ -22,8 +22,7 @@ PV_CALLEE_SAVE_REGS_THUNK(__pv_queued_spin_unlock_slowpath);
  *
  * void __pv_queued_spin_unlock(struct qspinlock *lock)
  * {
- *	struct __qspinlock *l = (void *)lock;
- *	u8 lockval = cmpxchg(&l->locked, _Q_LOCKED_VAL, 0);
+ *	u8 lockval = cmpxchg(&lock->locked, _Q_LOCKED_VAL, 0);
  *
  *	if (likely(lockval == _Q_LOCKED_VAL))
  *		return;
diff --git a/include/asm-generic/qspinlock_types.h b/include/asm-generic/qspinlock_types.h
index 034acd0c4956..0763f065b975 100644
--- a/include/asm-generic/qspinlock_types.h
+++ b/include/asm-generic/qspinlock_types.h
@@ -29,13 +29,41 @@
 #endif
 
 typedef struct qspinlock {
-	atomic_t	val;
+	union {
+		atomic_t val;
+
+		/*
+		 * By using the whole 2nd least significant byte for the
+		 * pending bit, we can allow better optimization of the lock
+		 * acquisition for the pending bit holder.
+		 */
+#ifdef __LITTLE_ENDIAN
+		struct {
+			u8	locked;
+			u8	pending;
+		};
+		struct {
+			u16	locked_pending;
+			u16	tail;
+		};
+#else
+		struct {
+			u16	tail;
+			u16	locked_pending;
+		};
+		struct {
+			u8	reserved[2];
+			u8	pending;
+			u8	locked;
+		};
+#endif
+	};
 } arch_spinlock_t;
 
 /*
  * Initializier
  */
-#define	__ARCH_SPIN_LOCK_UNLOCKED	{ ATOMIC_INIT(0) }
+#define	__ARCH_SPIN_LOCK_UNLOCKED	{ .val = ATOMIC_INIT(0) }
 
 /*
  * Bitfields in the atomic value:
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index d880296245c5..f5b0e59f6d14 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -114,40 +114,6 @@ static inline __pure struct mcs_spinlock *decode_tail(u32 tail)
 
 #define _Q_LOCKED_PENDING_MASK (_Q_LOCKED_MASK | _Q_PENDING_MASK)
 
-/*
- * By using the whole 2nd least significant byte for the pending bit, we
- * can allow better optimization of the lock acquisition for the pending
- * bit holder.
- *
- * This internal structure is also used by the set_locked function which
- * is not restricted to _Q_PENDING_BITS == 8.
- */
-struct __qspinlock {
-	union {
-		atomic_t val;
-#ifdef __LITTLE_ENDIAN
-		struct {
-			u8	locked;
-			u8	pending;
-		};
-		struct {
-			u16	locked_pending;
-			u16	tail;
-		};
-#else
-		struct {
-			u16	tail;
-			u16	locked_pending;
-		};
-		struct {
-			u8	reserved[2];
-			u8	pending;
-			u8	locked;
-		};
-#endif
-	};
-};
-
 #if _Q_PENDING_BITS == 8
 /**
  * clear_pending_set_locked - take ownership and clear the pending bit.
@@ -159,9 +125,7 @@ struct __qspinlock {
  */
 static __always_inline void clear_pending_set_locked(struct qspinlock *lock)
 {
-	struct __qspinlock *l = (void *)lock;
-
-	WRITE_ONCE(l->locked_pending, _Q_LOCKED_VAL);
+	WRITE_ONCE(lock->locked_pending, _Q_LOCKED_VAL);
 }
 
 /*
@@ -176,13 +140,11 @@ static __always_inline void clear_pending_set_locked(struct qspinlock *lock)
  */
 static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
 {
-	struct __qspinlock *l = (void *)lock;
-
 	/*
 	 * Use release semantics to make sure that the MCS node is properly
 	 * initialized before changing the tail code.
 	 */
-	return (u32)xchg_release(&l->tail,
+	return (u32)xchg_release(&lock->tail,
 				 tail >> _Q_TAIL_OFFSET) << _Q_TAIL_OFFSET;
 }
 
@@ -237,9 +199,7 @@ static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
  */
 static __always_inline void set_locked(struct qspinlock *lock)
 {
-	struct __qspinlock *l = (void *)lock;
-
-	WRITE_ONCE(l->locked, _Q_LOCKED_VAL);
+	WRITE_ONCE(lock->locked, _Q_LOCKED_VAL);
 }
 
 
diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
index 6ee477765e6c..2711940429f5 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -87,8 +87,6 @@ struct pv_node {
 #define queued_spin_trylock(l)	pv_hybrid_queued_unfair_trylock(l)
 static inline bool pv_hybrid_queued_unfair_trylock(struct qspinlock *lock)
 {
-	struct __qspinlock *l = (void *)lock;
-
 	/*
 	 * Stay in unfair lock mode as long as queued mode waiters are
 	 * present in the MCS wait queue but the pending bit isn't set.
@@ -97,7 +95,7 @@ static inline bool pv_hybrid_queued_unfair_trylock(struct qspinlock *lock)
 		int val = atomic_read(&lock->val);
 
 		if (!(val & _Q_LOCKED_PENDING_MASK) &&
-		   (cmpxchg_acquire(&l->locked, 0, _Q_LOCKED_VAL) == 0)) {
+		   (cmpxchg_acquire(&lock->locked, 0, _Q_LOCKED_VAL) == 0)) {
 			qstat_inc(qstat_pv_lock_stealing, true);
 			return true;
 		}
@@ -117,16 +115,12 @@ static inline bool pv_hybrid_queued_unfair_trylock(struct qspinlock *lock)
 #if _Q_PENDING_BITS == 8
 static __always_inline void set_pending(struct qspinlock *lock)
 {
-	struct __qspinlock *l = (void *)lock;
-
-	WRITE_ONCE(l->pending, 1);
+	WRITE_ONCE(lock->pending, 1);
 }
 
 static __always_inline void clear_pending(struct qspinlock *lock)
 {
-	struct __qspinlock *l = (void *)lock;
-
-	WRITE_ONCE(l->pending, 0);
+	WRITE_ONCE(lock->pending, 0);
 }
 
 /*
@@ -136,10 +130,8 @@ static __always_inline void clear_pending(struct qspinlock *lock)
  */
 static __always_inline int trylock_clear_pending(struct qspinlock *lock)
 {
-	struct __qspinlock *l = (void *)lock;
-
-	return !READ_ONCE(l->locked) &&
-	       (cmpxchg_acquire(&l->locked_pending, _Q_PENDING_VAL,
+	return !READ_ONCE(lock->locked) &&
+	       (cmpxchg_acquire(&lock->locked_pending, _Q_PENDING_VAL,
 				_Q_LOCKED_VAL) == _Q_PENDING_VAL);
 }
 #else /* _Q_PENDING_BITS == 8 */
@@ -384,7 +376,6 @@ static void pv_wait_node(struct mcs_spinlock *node, struct mcs_spinlock *prev)
 static void pv_kick_node(struct qspinlock *lock, struct mcs_spinlock *node)
 {
 	struct pv_node *pn = (struct pv_node *)node;
-	struct __qspinlock *l = (void *)lock;
 
 	/*
 	 * If the vCPU is indeed halted, advance its state to match that of
@@ -413,7 +404,7 @@ static void pv_kick_node(struct qspinlock *lock, struct mcs_spinlock *node)
 	 * the hash table later on at unlock time, no atomic instruction is
 	 * needed.
 	 */
-	WRITE_ONCE(l->locked, _Q_SLOW_VAL);
+	WRITE_ONCE(lock->locked, _Q_SLOW_VAL);
 	(void)pv_hash(lock, pn);
 }
 
@@ -428,7 +419,6 @@ static u32
 pv_wait_head_or_lock(struct qspinlock *lock, struct mcs_spinlock *node)
 {
 	struct pv_node *pn = (struct pv_node *)node;
-	struct __qspinlock *l = (void *)lock;
 	struct qspinlock **lp = NULL;
 	int waitcnt = 0;
 	int loop;
@@ -479,13 +469,13 @@ pv_wait_head_or_lock(struct qspinlock *lock, struct mcs_spinlock *node)
 			 *
 			 * Matches the smp_rmb() in __pv_queued_spin_unlock().
 			 */
-			if (xchg(&l->locked, _Q_SLOW_VAL) == 0) {
+			if (xchg(&lock->locked, _Q_SLOW_VAL) == 0) {
 				/*
 				 * The lock was free and now we own the lock.
 				 * Change the lock value back to _Q_LOCKED_VAL
 				 * and unhash the table.
 				 */
-				WRITE_ONCE(l->locked, _Q_LOCKED_VAL);
+				WRITE_ONCE(lock->locked, _Q_LOCKED_VAL);
 				WRITE_ONCE(*lp, NULL);
 				goto gotlock;
 			}
@@ -493,7 +483,7 @@ pv_wait_head_or_lock(struct qspinlock *lock, struct mcs_spinlock *node)
 		WRITE_ONCE(pn->state, vcpu_hashed);
 		qstat_inc(qstat_pv_wait_head, true);
 		qstat_inc(qstat_pv_wait_again, waitcnt);
-		pv_wait(&l->locked, _Q_SLOW_VAL);
+		pv_wait(&lock->locked, _Q_SLOW_VAL);
 
 		/*
 		 * Because of lock stealing, the queue head vCPU may not be
@@ -518,7 +508,6 @@ gotlock:
 __visible void
 __pv_queued_spin_unlock_slowpath(struct qspinlock *lock, u8 locked)
 {
-	struct __qspinlock *l = (void *)lock;
 	struct pv_node *node;
 
 	if (unlikely(locked != _Q_SLOW_VAL)) {
@@ -547,7 +536,7 @@ __pv_queued_spin_unlock_slowpath(struct qspinlock *lock, u8 locked)
 	 * Now that we have a reference to the (likely) blocked pv_node,
 	 * release the lock.
 	 */
-	smp_store_release(&l->locked, 0);
+	smp_store_release(&lock->locked, 0);
 
 	/*
 	 * At this point the memory pointed at by lock can be freed/reused,
@@ -573,7 +562,6 @@ __pv_queued_spin_unlock_slowpath(struct qspinlock *lock, u8 locked)
 #ifndef __pv_queued_spin_unlock
 __visible void __pv_queued_spin_unlock(struct qspinlock *lock)
 {
-	struct __qspinlock *l = (void *)lock;
 	u8 locked;
 
 	/*
@@ -581,7 +569,7 @@ __visible void __pv_queued_spin_unlock(struct qspinlock *lock)
 	 * unhash. Otherwise it would be possible to have multiple @lock
 	 * entries, which would be BAD.
 	 */
-	locked = cmpxchg_release(&l->locked, _Q_LOCKED_VAL, 0);
+	locked = cmpxchg_release(&lock->locked, _Q_LOCKED_VAL, 0);
 	if (likely(locked == _Q_LOCKED_VAL))
 		return;
 

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [tip:locking/core] locking/qspinlock: Bound spinning on pending->locked transition in slowpath
  2018-04-26 10:34 ` [PATCH v3 03/14] locking/qspinlock: Bound spinning on pending->locked transition in slowpath Will Deacon
@ 2018-04-27  9:37   ` tip-bot for Will Deacon
  0 siblings, 0 replies; 41+ messages in thread
From: tip-bot for Will Deacon @ 2018-04-27  9:37 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: mingo, peterz, will.deacon, torvalds, linux-kernel, tglx, hpa, longman

Commit-ID:  6512276d97b160d90b53285bd06f7f201459a7e3
Gitweb:     https://git.kernel.org/tip/6512276d97b160d90b53285bd06f7f201459a7e3
Author:     Will Deacon <will.deacon@arm.com>
AuthorDate: Thu, 26 Apr 2018 11:34:17 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Fri, 27 Apr 2018 09:48:46 +0200

locking/qspinlock: Bound spinning on pending->locked transition in slowpath

If a locker taking the qspinlock slowpath reads a lock value indicating
that only the pending bit is set, then it will spin whilst the
concurrent pending->locked transition takes effect.

Unfortunately, there is no guarantee that such a transition will ever be
observed since concurrent lockers could continuously set pending and
hand over the lock amongst themselves, leading to starvation. Whilst
this would probably resolve in practice, it means that it is not
possible to prove liveness properties about the lock and means that lock
acquisition time is unbounded.

Rather than removing the pending->locked spinning from the slowpath
altogether (which has been shown to heavily penalise a 2-threaded
locking stress test on x86), this patch replaces the explicit spinning
with a call to atomic_cond_read_relaxed and allows the architecture to
provide a bound on the number of spins. For architectures that can
respond to changes in cacheline state in their smp_cond_load implementation,
it should be sufficient to use the default bound of 1.

Suggested-by: Waiman Long <longman@redhat.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Waiman Long <longman@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: boqun.feng@gmail.com
Cc: linux-arm-kernel@lists.infradead.org
Cc: paulmck@linux.vnet.ibm.com
Link: http://lkml.kernel.org/r/1524738868-31318-4-git-send-email-will.deacon@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/locking/qspinlock.c | 20 +++++++++++++++++---
 1 file changed, 17 insertions(+), 3 deletions(-)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index f5b0e59f6d14..a0f7976348f8 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -76,6 +76,18 @@
 #define MAX_NODES	4
 #endif
 
+/*
+ * The pending bit spinning loop count.
+ * This heuristic is used to limit the number of lockword accesses
+ * made by atomic_cond_read_relaxed when waiting for the lock to
+ * transition out of the "== _Q_PENDING_VAL" state. We don't spin
+ * indefinitely because there's no guarantee that we'll make forward
+ * progress.
+ */
+#ifndef _Q_PENDING_LOOPS
+#define _Q_PENDING_LOOPS	1
+#endif
+
 /*
  * Per-CPU queue node structures; we can never have more than 4 nested
  * contexts: task, softirq, hardirq, nmi.
@@ -266,13 +278,15 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 		return;
 
 	/*
-	 * wait for in-progress pending->locked hand-overs
+	 * Wait for in-progress pending->locked hand-overs with a bounded
+	 * number of spins so that we guarantee forward progress.
 	 *
 	 * 0,1,0 -> 0,0,1
 	 */
 	if (val == _Q_PENDING_VAL) {
-		while ((val = atomic_read(&lock->val)) == _Q_PENDING_VAL)
-			cpu_relax();
+		int cnt = _Q_PENDING_LOOPS;
+		val = atomic_cond_read_relaxed(&lock->val,
+					       (VAL != _Q_PENDING_VAL) || !cnt--);
 	}
 
 	/*

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [tip:locking/core] locking/qspinlock/x86: Increase _Q_PENDING_LOOPS upper bound
  2018-04-26 10:34 ` [PATCH v3 04/14] locking/qspinlock/x86: Increase _Q_PENDING_LOOPS upper bound Will Deacon
@ 2018-04-27  9:38   ` tip-bot for Will Deacon
  0 siblings, 0 replies; 41+ messages in thread
From: tip-bot for Will Deacon @ 2018-04-27  9:38 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: torvalds, peterz, longman, linux-kernel, will.deacon, mingo, tglx, hpa

Commit-ID:  b247be3fe89b6aba928bf80f4453d1c4ba8d2063
Gitweb:     https://git.kernel.org/tip/b247be3fe89b6aba928bf80f4453d1c4ba8d2063
Author:     Will Deacon <will.deacon@arm.com>
AuthorDate: Thu, 26 Apr 2018 11:34:18 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Fri, 27 Apr 2018 09:48:47 +0200

locking/qspinlock/x86: Increase _Q_PENDING_LOOPS upper bound

On x86, atomic_cond_read_relaxed will busy-wait with a cpu_relax() loop,
so it is desirable to increase the number of times we spin on the qspinlock
lockword when it is found to be transitioning from pending to locked.

According to Waiman Long:

 | Ideally, the spinning times should be at least a few times the typical
 | cacheline load time from memory which I think can be down to 100ns or
 | so for each cacheline load with the newest systems or up to several
 | hundreds ns for older systems.

which in his benchmarking corresponded to 512 iterations.

Suggested-by: Waiman Long <longman@redhat.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Waiman Long <longman@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: boqun.feng@gmail.com
Cc: linux-arm-kernel@lists.infradead.org
Cc: paulmck@linux.vnet.ibm.com
Link: http://lkml.kernel.org/r/1524738868-31318-5-git-send-email-will.deacon@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/qspinlock.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h
index 90b0b0ed8161..da1370ad206d 100644
--- a/arch/x86/include/asm/qspinlock.h
+++ b/arch/x86/include/asm/qspinlock.h
@@ -7,6 +7,8 @@
 #include <asm-generic/qspinlock_types.h>
 #include <asm/paravirt.h>
 
+#define _Q_PENDING_LOOPS	(1 << 9)
+
 #define	queued_spin_unlock queued_spin_unlock
 /**
  * queued_spin_unlock - release a queued spinlock

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [tip:locking/core] locking/qspinlock: Remove unbounded cmpxchg() loop from locking slowpath
  2018-04-26 10:34 ` [PATCH v3 05/14] locking/qspinlock: Remove unbounded cmpxchg loop from locking slowpath Will Deacon
  2018-04-26 15:53   ` Peter Zijlstra
  2018-04-26 20:16   ` Waiman Long
@ 2018-04-27  9:39   ` tip-bot for Will Deacon
  2 siblings, 0 replies; 41+ messages in thread
From: tip-bot for Will Deacon @ 2018-04-27  9:39 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: mingo, linux-kernel, tglx, peterz, hpa, will.deacon, torvalds, longman

Commit-ID:  59fb586b4a07b4e1a0ee577140ab4842ba451acd
Gitweb:     https://git.kernel.org/tip/59fb586b4a07b4e1a0ee577140ab4842ba451acd
Author:     Will Deacon <will.deacon@arm.com>
AuthorDate: Thu, 26 Apr 2018 11:34:19 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Fri, 27 Apr 2018 09:48:47 +0200

locking/qspinlock: Remove unbounded cmpxchg() loop from locking slowpath

The qspinlock locking slowpath utilises a "pending" bit as a simple form
of an embedded test-and-set lock that can avoid the overhead of explicit
queuing in cases where the lock is held but uncontended. This bit is
managed using a cmpxchg() loop which tries to transition the uncontended
lock word from (0,0,0) -> (0,0,1) or (0,0,1) -> (0,1,1).

Unfortunately, the cmpxchg() loop is unbounded and lockers can be starved
indefinitely if the lock word is seen to oscillate between unlocked
(0,0,0) and locked (0,0,1). This could happen if concurrent lockers are
able to take the lock in the cmpxchg() loop without queuing and pass it
around amongst themselves.

This patch fixes the problem by unconditionally setting _Q_PENDING_VAL
using atomic_fetch_or, and then inspecting the old value to see whether
we need to spin on the current lock owner, or whether we now effectively
hold the lock. The tricky scenario is when concurrent lockers end up
queuing on the lock and the lock becomes available, causing us to see
a lockword of (n,0,0). With pending now set, simply queuing could lead
to deadlock as the head of the queue may not have observed the pending
flag being cleared. Conversely, if the head of the queue did observe
pending being cleared, then it could transition the lock from (n,0,0) ->
(0,0,1) meaning that any attempt to "undo" our setting of the pending
bit could race with a concurrent locker trying to set it.

We handle this race by preserving the pending bit when taking the lock
after reaching the head of the queue and leaving the tail entry intact
if we saw pending set, because we know that the tail is going to be
updated shortly.

Signed-off-by: Will Deacon <will.deacon@arm.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Waiman Long <longman@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: boqun.feng@gmail.com
Cc: linux-arm-kernel@lists.infradead.org
Cc: paulmck@linux.vnet.ibm.com
Link: http://lkml.kernel.org/r/1524738868-31318-6-git-send-email-will.deacon@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/locking/qspinlock.c          | 102 ++++++++++++++++++++----------------
 kernel/locking/qspinlock_paravirt.h |   5 --
 2 files changed, 58 insertions(+), 49 deletions(-)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index a0f7976348f8..e06f67e021d9 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -127,6 +127,17 @@ static inline __pure struct mcs_spinlock *decode_tail(u32 tail)
 #define _Q_LOCKED_PENDING_MASK (_Q_LOCKED_MASK | _Q_PENDING_MASK)
 
 #if _Q_PENDING_BITS == 8
+/**
+ * clear_pending - clear the pending bit.
+ * @lock: Pointer to queued spinlock structure
+ *
+ * *,1,* -> *,0,*
+ */
+static __always_inline void clear_pending(struct qspinlock *lock)
+{
+	WRITE_ONCE(lock->pending, 0);
+}
+
 /**
  * clear_pending_set_locked - take ownership and clear the pending bit.
  * @lock: Pointer to queued spinlock structure
@@ -162,6 +173,17 @@ static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
 
 #else /* _Q_PENDING_BITS == 8 */
 
+/**
+ * clear_pending - clear the pending bit.
+ * @lock: Pointer to queued spinlock structure
+ *
+ * *,1,* -> *,0,*
+ */
+static __always_inline void clear_pending(struct qspinlock *lock)
+{
+	atomic_andnot(_Q_PENDING_VAL, &lock->val);
+}
+
 /**
  * clear_pending_set_locked - take ownership and clear the pending bit.
  * @lock: Pointer to queued spinlock structure
@@ -266,7 +288,7 @@ static __always_inline u32  __pv_wait_head_or_lock(struct qspinlock *lock,
 void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 {
 	struct mcs_spinlock *prev, *next, *node;
-	u32 new, old, tail;
+	u32 old, tail;
 	int idx;
 
 	BUILD_BUG_ON(CONFIG_NR_CPUS >= (1U << _Q_TAIL_CPU_BITS));
@@ -289,59 +311,51 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 					       (VAL != _Q_PENDING_VAL) || !cnt--);
 	}
 
+	/*
+	 * If we observe any contention; queue.
+	 */
+	if (val & ~_Q_LOCKED_MASK)
+		goto queue;
+
 	/*
 	 * trylock || pending
 	 *
 	 * 0,0,0 -> 0,0,1 ; trylock
 	 * 0,0,1 -> 0,1,1 ; pending
 	 */
-	for (;;) {
+	val = atomic_fetch_or_acquire(_Q_PENDING_VAL, &lock->val);
+	if (!(val & ~_Q_LOCKED_MASK)) {
 		/*
-		 * If we observe any contention; queue.
+		 * We're pending, wait for the owner to go away.
+		 *
+		 * *,1,1 -> *,1,0
+		 *
+		 * this wait loop must be a load-acquire such that we match the
+		 * store-release that clears the locked bit and create lock
+		 * sequentiality; this is because not all
+		 * clear_pending_set_locked() implementations imply full
+		 * barriers.
 		 */
-		if (val & ~_Q_LOCKED_MASK)
-			goto queue;
-
-		new = _Q_LOCKED_VAL;
-		if (val == new)
-			new |= _Q_PENDING_VAL;
+		if (val & _Q_LOCKED_MASK) {
+			smp_cond_load_acquire(&lock->val.counter,
+					      !(VAL & _Q_LOCKED_MASK));
+		}
 
 		/*
-		 * Acquire semantic is required here as the function may
-		 * return immediately if the lock was free.
+		 * take ownership and clear the pending bit.
+		 *
+		 * *,1,0 -> *,0,1
 		 */
-		old = atomic_cmpxchg_acquire(&lock->val, val, new);
-		if (old == val)
-			break;
-
-		val = old;
-	}
-
-	/*
-	 * we won the trylock
-	 */
-	if (new == _Q_LOCKED_VAL)
+		clear_pending_set_locked(lock);
 		return;
+	}
 
 	/*
-	 * we're pending, wait for the owner to go away.
-	 *
-	 * *,1,1 -> *,1,0
-	 *
-	 * this wait loop must be a load-acquire such that we match the
-	 * store-release that clears the locked bit and create lock
-	 * sequentiality; this is because not all clear_pending_set_locked()
-	 * implementations imply full barriers.
-	 */
-	smp_cond_load_acquire(&lock->val.counter, !(VAL & _Q_LOCKED_MASK));
-
-	/*
-	 * take ownership and clear the pending bit.
-	 *
-	 * *,1,0 -> *,0,1
+	 * If pending was clear but there are waiters in the queue, then
+	 * we need to undo our setting of pending before we queue ourselves.
 	 */
-	clear_pending_set_locked(lock);
-	return;
+	if (!(val & _Q_PENDING_MASK))
+		clear_pending(lock);
 
 	/*
 	 * End of pending bit optimistic spinning and beginning of MCS
@@ -445,15 +459,15 @@ locked:
 	 * claim the lock:
 	 *
 	 * n,0,0 -> 0,0,1 : lock, uncontended
-	 * *,0,0 -> *,0,1 : lock, contended
+	 * *,*,0 -> *,*,1 : lock, contended
 	 *
-	 * If the queue head is the only one in the queue (lock value == tail),
-	 * clear the tail code and grab the lock. Otherwise, we only need
-	 * to grab the lock.
+	 * If the queue head is the only one in the queue (lock value == tail)
+	 * and nobody is pending, clear the tail code and grab the lock.
+	 * Otherwise, we only need to grab the lock.
 	 */
 	for (;;) {
 		/* In the PV case we might already have _Q_LOCKED_VAL set */
-		if ((val & _Q_TAIL_MASK) != tail) {
+		if ((val & _Q_TAIL_MASK) != tail || (val & _Q_PENDING_MASK)) {
 			set_locked(lock);
 			break;
 		}
diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
index 2711940429f5..2dbad2f25480 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -118,11 +118,6 @@ static __always_inline void set_pending(struct qspinlock *lock)
 	WRITE_ONCE(lock->pending, 1);
 }
 
-static __always_inline void clear_pending(struct qspinlock *lock)
-{
-	WRITE_ONCE(lock->pending, 0);
-}
-
 /*
  * The pending bit check in pv_queued_spin_steal_lock() isn't a memory
  * barrier. Therefore, an atomic cmpxchg_acquire() is used to acquire the

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [tip:locking/core] locking/qspinlock: Kill cmpxchg() loop when claiming lock from head of queue
  2018-04-26 10:34 ` [PATCH v3 06/14] locking/qspinlock: Kill cmpxchg loop when claiming lock from head of queue Will Deacon
@ 2018-04-27  9:39   ` tip-bot for Will Deacon
  0 siblings, 0 replies; 41+ messages in thread
From: tip-bot for Will Deacon @ 2018-04-27  9:39 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: longman, peterz, mingo, linux-kernel, will.deacon, torvalds, tglx, hpa

Commit-ID:  c61da58d8a9ba9238250a548f00826eaf44af0f7
Gitweb:     https://git.kernel.org/tip/c61da58d8a9ba9238250a548f00826eaf44af0f7
Author:     Will Deacon <will.deacon@arm.com>
AuthorDate: Thu, 26 Apr 2018 11:34:20 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Fri, 27 Apr 2018 09:48:48 +0200

locking/qspinlock: Kill cmpxchg() loop when claiming lock from head of queue

When a queued locker reaches the head of the queue, it claims the lock
by setting _Q_LOCKED_VAL in the lockword. If there isn't contention, it
must also clear the tail as part of this operation so that subsequent
lockers can avoid taking the slowpath altogether.

Currently this is expressed as a cmpxchg() loop that practically only
runs up to two iterations. This is confusing to the reader and unhelpful
to the compiler. Rewrite the cmpxchg() loop without the loop, so that a
failed cmpxchg() implies that there is contention and we just need to
write to _Q_LOCKED_VAL without considering the rest of the lockword.

Signed-off-by: Will Deacon <will.deacon@arm.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Waiman Long <longman@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: boqun.feng@gmail.com
Cc: linux-arm-kernel@lists.infradead.org
Cc: paulmck@linux.vnet.ibm.com
Link: http://lkml.kernel.org/r/1524738868-31318-7-git-send-email-will.deacon@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/locking/qspinlock.c | 19 ++++++++-----------
 1 file changed, 8 insertions(+), 11 deletions(-)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index e06f67e021d9..b51494a50b1e 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -465,24 +465,21 @@ locked:
 	 * and nobody is pending, clear the tail code and grab the lock.
 	 * Otherwise, we only need to grab the lock.
 	 */
-	for (;;) {
-		/* In the PV case we might already have _Q_LOCKED_VAL set */
-		if ((val & _Q_TAIL_MASK) != tail || (val & _Q_PENDING_MASK)) {
-			set_locked(lock);
-			break;
-		}
+
+	/* In the PV case we might already have _Q_LOCKED_VAL set */
+	if ((val & _Q_TAIL_MASK) == tail) {
 		/*
 		 * The smp_cond_load_acquire() call above has provided the
-		 * necessary acquire semantics required for locking. At most
-		 * two iterations of this loop may be ran.
+		 * necessary acquire semantics required for locking.
 		 */
 		old = atomic_cmpxchg_relaxed(&lock->val, val, _Q_LOCKED_VAL);
 		if (old == val)
-			goto release;	/* No contention */
-
-		val = old;
+			goto release; /* No contention */
 	}
 
+	/* Either somebody is queued behind us or _Q_PENDING_VAL is set */
+	set_locked(lock);
+
 	/*
 	 * contended path; wait for next if not observed yet, release.
 	 */

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [tip:locking/core] locking/qspinlock: Use atomic_cond_read_acquire()
  2018-04-26 10:34 ` [PATCH v3 07/14] locking/qspinlock: Use atomic_cond_read_acquire Will Deacon
@ 2018-04-27  9:40   ` tip-bot for Will Deacon
  0 siblings, 0 replies; 41+ messages in thread
From: tip-bot for Will Deacon @ 2018-04-27  9:40 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: longman, tglx, peterz, linux-kernel, mingo, torvalds, will.deacon, hpa

Commit-ID:  f9c811fac48cfbbfb452b08d1042386947868d07
Gitweb:     https://git.kernel.org/tip/f9c811fac48cfbbfb452b08d1042386947868d07
Author:     Will Deacon <will.deacon@arm.com>
AuthorDate: Thu, 26 Apr 2018 11:34:21 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Fri, 27 Apr 2018 09:48:49 +0200

locking/qspinlock: Use atomic_cond_read_acquire()

Rather than dig into the counter field of the atomic_t inside the
qspinlock structure so that we can call smp_cond_load_acquire(), use
atomic_cond_read_acquire() instead, which operates on the atomic_t
directly.

Signed-off-by: Will Deacon <will.deacon@arm.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Waiman Long <longman@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: boqun.feng@gmail.com
Cc: linux-arm-kernel@lists.infradead.org
Cc: paulmck@linux.vnet.ibm.com
Link: http://lkml.kernel.org/r/1524738868-31318-8-git-send-email-will.deacon@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/locking/qspinlock.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index b51494a50b1e..56af1fa9874d 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -337,8 +337,8 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 		 * barriers.
 		 */
 		if (val & _Q_LOCKED_MASK) {
-			smp_cond_load_acquire(&lock->val.counter,
-					      !(VAL & _Q_LOCKED_MASK));
+			atomic_cond_read_acquire(&lock->val,
+						 !(VAL & _Q_LOCKED_MASK));
 		}
 
 		/*
@@ -441,8 +441,8 @@ queue:
 	 *
 	 * The PV pv_wait_head_or_lock function, if active, will acquire
 	 * the lock and return a non-zero value. So we have to skip the
-	 * smp_cond_load_acquire() call. As the next PV queue head hasn't been
-	 * designated yet, there is no way for the locked value to become
+	 * atomic_cond_read_acquire() call. As the next PV queue head hasn't
+	 * been designated yet, there is no way for the locked value to become
 	 * _Q_SLOW_VAL. So both the set_locked() and the
 	 * atomic_cmpxchg_relaxed() calls will be safe.
 	 *
@@ -452,7 +452,7 @@ queue:
 	if ((val = pv_wait_head_or_lock(lock, node)))
 		goto locked;
 
-	val = smp_cond_load_acquire(&lock->val.counter, !(VAL & _Q_LOCKED_PENDING_MASK));
+	val = atomic_cond_read_acquire(&lock->val, !(VAL & _Q_LOCKED_PENDING_MASK));
 
 locked:
 	/*
@@ -469,7 +469,7 @@ locked:
 	/* In the PV case we might already have _Q_LOCKED_VAL set */
 	if ((val & _Q_TAIL_MASK) == tail) {
 		/*
-		 * The smp_cond_load_acquire() call above has provided the
+		 * The atomic_cond_read_acquire() call above has provided the
 		 * necessary acquire semantics required for locking.
 		 */
 		old = atomic_cmpxchg_relaxed(&lock->val, val, _Q_LOCKED_VAL);

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [tip:locking/core] locking/mcs: Use smp_cond_load_acquire() in MCS spin loop
  2018-04-26 10:34 ` [PATCH v3 08/14] locking/mcs: Use smp_cond_load_acquire() in mcs spin loop Will Deacon
@ 2018-04-27  9:40   ` tip-bot for Jason Low
  0 siblings, 0 replies; 41+ messages in thread
From: tip-bot for Jason Low @ 2018-04-27  9:40 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: peterz, mingo, tglx, hpa, jason.low2, will.deacon, longman,
	torvalds, linux-kernel

Commit-ID:  7f56b58a92aaf2cab049f32a19af7cc57a3972f2
Gitweb:     https://git.kernel.org/tip/7f56b58a92aaf2cab049f32a19af7cc57a3972f2
Author:     Jason Low <jason.low2@hp.com>
AuthorDate: Thu, 26 Apr 2018 11:34:22 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Fri, 27 Apr 2018 09:48:49 +0200

locking/mcs: Use smp_cond_load_acquire() in MCS spin loop

For qspinlocks on ARM64, we would like to use WFE instead
of purely spinning. Qspinlocks internally have lock
contenders spin on an MCS lock.

Update arch_mcs_spin_lock_contended() such that it uses
the new smp_cond_load_acquire() so that ARM64 can also
override this spin loop with its own implementation using WFE.

On x86, this can also be cheaper than spinning on
smp_load_acquire().

Signed-off-by: Jason Low <jason.low2@hp.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Waiman Long <longman@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: boqun.feng@gmail.com
Cc: linux-arm-kernel@lists.infradead.org
Cc: paulmck@linux.vnet.ibm.com
Link: http://lkml.kernel.org/r/1524738868-31318-9-git-send-email-will.deacon@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/locking/mcs_spinlock.h | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/kernel/locking/mcs_spinlock.h b/kernel/locking/mcs_spinlock.h
index f046b7ce9dd6..5e10153b4d3c 100644
--- a/kernel/locking/mcs_spinlock.h
+++ b/kernel/locking/mcs_spinlock.h
@@ -23,13 +23,15 @@ struct mcs_spinlock {
 
 #ifndef arch_mcs_spin_lock_contended
 /*
- * Using smp_load_acquire() provides a memory barrier that ensures
- * subsequent operations happen after the lock is acquired.
+ * Using smp_cond_load_acquire() provides the acquire semantics
+ * required so that subsequent operations happen after the
+ * lock is acquired. Additionally, some architectures such as
+ * ARM64 would like to do spin-waiting instead of purely
+ * spinning, and smp_cond_load_acquire() provides that behavior.
  */
 #define arch_mcs_spin_lock_contended(l)					\
 do {									\
-	while (!(smp_load_acquire(l)))					\
-		cpu_relax();						\
+	smp_cond_load_acquire(l, VAL);					\
 } while (0)
 #endif
 

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [tip:locking/core] locking/qspinlock: Use smp_cond_load_relaxed() to wait for next node
  2018-04-26 10:34 ` [PATCH v3 09/14] locking/qspinlock: Use smp_cond_load_relaxed to wait for next node Will Deacon
@ 2018-04-27  9:41   ` tip-bot for Will Deacon
  0 siblings, 0 replies; 41+ messages in thread
From: tip-bot for Will Deacon @ 2018-04-27  9:41 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: tglx, hpa, mingo, longman, will.deacon, linux-kernel, peterz, torvalds

Commit-ID:  c131a198c497db436b558ac5e9a140cdcb91b304
Gitweb:     https://git.kernel.org/tip/c131a198c497db436b558ac5e9a140cdcb91b304
Author:     Will Deacon <will.deacon@arm.com>
AuthorDate: Thu, 26 Apr 2018 11:34:23 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Fri, 27 Apr 2018 09:48:50 +0200

locking/qspinlock: Use smp_cond_load_relaxed() to wait for next node

When a locker reaches the head of the queue and takes the lock, a
concurrent locker may enqueue and force the lock holder to spin
whilst its node->next field is initialised. Rather than open-code
a READ_ONCE/cpu_relax() loop, this can be implemented using
smp_cond_load_relaxed() instead.

Signed-off-by: Will Deacon <will.deacon@arm.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Waiman Long <longman@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: boqun.feng@gmail.com
Cc: linux-arm-kernel@lists.infradead.org
Cc: paulmck@linux.vnet.ibm.com
Link: http://lkml.kernel.org/r/1524738868-31318-10-git-send-email-will.deacon@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/locking/qspinlock.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 56af1fa9874d..d6c3b029bd93 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -483,10 +483,8 @@ locked:
 	/*
 	 * contended path; wait for next if not observed yet, release.
 	 */
-	if (!next) {
-		while (!(next = READ_ONCE(node->next)))
-			cpu_relax();
-	}
+	if (!next)
+		next = smp_cond_load_relaxed(&node->next, (VAL));
 
 	arch_mcs_spin_unlock_contended(&next->locked);
 	pv_kick_node(lock, next);

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [tip:locking/core] locking/qspinlock: Use smp_store_release() in queued_spin_unlock()
  2018-04-26 10:34 ` [PATCH v3 10/14] locking/qspinlock: Make queued_spin_unlock use smp_store_release Will Deacon
@ 2018-04-27  9:42   ` tip-bot for Will Deacon
  0 siblings, 0 replies; 41+ messages in thread
From: tip-bot for Will Deacon @ 2018-04-27  9:42 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: will.deacon, torvalds, tglx, peterz, longman, hpa, mingo, linux-kernel

Commit-ID:  626e5fbc14358901ddaa90ce510e0fbeab310432
Gitweb:     https://git.kernel.org/tip/626e5fbc14358901ddaa90ce510e0fbeab310432
Author:     Will Deacon <will.deacon@arm.com>
AuthorDate: Thu, 26 Apr 2018 11:34:24 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Fri, 27 Apr 2018 09:48:51 +0200

locking/qspinlock: Use smp_store_release() in queued_spin_unlock()

A qspinlock can be unlocked simply by writing zero to the locked byte.
This can be implemented in the generic code, so do that and remove the
arch-specific override for x86 in the !PV case.

Signed-off-by: Will Deacon <will.deacon@arm.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Waiman Long <longman@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: boqun.feng@gmail.com
Cc: linux-arm-kernel@lists.infradead.org
Cc: paulmck@linux.vnet.ibm.com
Link: http://lkml.kernel.org/r/1524738868-31318-11-git-send-email-will.deacon@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/qspinlock.h | 17 ++++++-----------
 include/asm-generic/qspinlock.h  |  2 +-
 2 files changed, 7 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h
index da1370ad206d..3e70bed8a978 100644
--- a/arch/x86/include/asm/qspinlock.h
+++ b/arch/x86/include/asm/qspinlock.h
@@ -9,6 +9,12 @@
 
 #define _Q_PENDING_LOOPS	(1 << 9)
 
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+extern void native_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val);
+extern void __pv_init_lock_hash(void);
+extern void __pv_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val);
+extern void __raw_callee_save___pv_queued_spin_unlock(struct qspinlock *lock);
+
 #define	queued_spin_unlock queued_spin_unlock
 /**
  * queued_spin_unlock - release a queued spinlock
@@ -21,12 +27,6 @@ static inline void native_queued_spin_unlock(struct qspinlock *lock)
 	smp_store_release(&lock->locked, 0);
 }
 
-#ifdef CONFIG_PARAVIRT_SPINLOCKS
-extern void native_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val);
-extern void __pv_init_lock_hash(void);
-extern void __pv_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val);
-extern void __raw_callee_save___pv_queued_spin_unlock(struct qspinlock *lock);
-
 static inline void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 {
 	pv_queued_spin_lock_slowpath(lock, val);
@@ -42,11 +42,6 @@ static inline bool vcpu_is_preempted(long cpu)
 {
 	return pv_vcpu_is_preempted(cpu);
 }
-#else
-static inline void queued_spin_unlock(struct qspinlock *lock)
-{
-	native_queued_spin_unlock(lock);
-}
 #endif
 
 #ifdef CONFIG_PARAVIRT
diff --git a/include/asm-generic/qspinlock.h b/include/asm-generic/qspinlock.h
index b37b4ad7eb94..a8ed0a352d75 100644
--- a/include/asm-generic/qspinlock.h
+++ b/include/asm-generic/qspinlock.h
@@ -100,7 +100,7 @@ static __always_inline void queued_spin_unlock(struct qspinlock *lock)
 	/*
 	 * unlock() needs release semantics:
 	 */
-	(void)atomic_sub_return_release(_Q_LOCKED_VAL, &lock->val);
+	smp_store_release(&lock->locked, 0);
 }
 #endif
 

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [tip:locking/core] locking/qspinlock: Elide back-to-back RELEASE operations with smp_wmb()
  2018-04-26 10:34 ` [PATCH v3 11/14] locking/qspinlock: Elide back-to-back RELEASE operations with smp_wmb() Will Deacon
@ 2018-04-27  9:42   ` tip-bot for Will Deacon
  0 siblings, 0 replies; 41+ messages in thread
From: tip-bot for Will Deacon @ 2018-04-27  9:42 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: will.deacon, torvalds, linux-kernel, peterz, mingo, tglx, hpa, longman

Commit-ID:  9d4646d14d51d62b967a12452c30ea7edf8dd8fa
Gitweb:     https://git.kernel.org/tip/9d4646d14d51d62b967a12452c30ea7edf8dd8fa
Author:     Will Deacon <will.deacon@arm.com>
AuthorDate: Thu, 26 Apr 2018 11:34:25 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Fri, 27 Apr 2018 09:48:52 +0200

locking/qspinlock: Elide back-to-back RELEASE operations with smp_wmb()

The qspinlock slowpath must ensure that the MCS node is fully initialised
before it can be reached by another other CPU. This is currently enforced
by using a RELEASE operation when updating the tail and also when linking
the node into the waitqueue, since the control dependency off xchg_tail
is insufficient to enforce sufficient ordering, see:

  95bcade33a8a ("locking/qspinlock: Ensure node is initialised before updating prev->next")

Back-to-back RELEASE operations may be expensive on some architectures,
particularly those that implement them using fences under the hood. We
can replace the two RELEASE operations with a single smp_wmb() fence and
use RELAXED operations for the subsequent publishing of the node.

Signed-off-by: Will Deacon <will.deacon@arm.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Waiman Long <longman@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: boqun.feng@gmail.com
Cc: linux-arm-kernel@lists.infradead.org
Cc: paulmck@linux.vnet.ibm.com
Link: http://lkml.kernel.org/r/1524738868-31318-12-git-send-email-will.deacon@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/locking/qspinlock.c | 33 +++++++++++++++++----------------
 1 file changed, 17 insertions(+), 16 deletions(-)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index d6c3b029bd93..956a12983bd0 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -164,10 +164,10 @@ static __always_inline void clear_pending_set_locked(struct qspinlock *lock)
 static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
 {
 	/*
-	 * Use release semantics to make sure that the MCS node is properly
-	 * initialized before changing the tail code.
+	 * We can use relaxed semantics since the caller ensures that the
+	 * MCS node is properly initialized before updating the tail.
 	 */
-	return (u32)xchg_release(&lock->tail,
+	return (u32)xchg_relaxed(&lock->tail,
 				 tail >> _Q_TAIL_OFFSET) << _Q_TAIL_OFFSET;
 }
 
@@ -212,10 +212,11 @@ static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
 	for (;;) {
 		new = (val & _Q_LOCKED_PENDING_MASK) | tail;
 		/*
-		 * Use release semantics to make sure that the MCS node is
-		 * properly initialized before changing the tail code.
+		 * We can use relaxed semantics since the caller ensures that
+		 * the MCS node is properly initialized before updating the
+		 * tail.
 		 */
-		old = atomic_cmpxchg_release(&lock->val, val, new);
+		old = atomic_cmpxchg_relaxed(&lock->val, val, new);
 		if (old == val)
 			break;
 
@@ -388,12 +389,18 @@ queue:
 		goto release;
 
 	/*
+	 * Ensure that the initialisation of @node is complete before we
+	 * publish the updated tail via xchg_tail() and potentially link
+	 * @node into the waitqueue via WRITE_ONCE(prev->next, node) below.
+	 */
+	smp_wmb();
+
+	/*
+	 * Publish the updated tail.
 	 * We have already touched the queueing cacheline; don't bother with
 	 * pending stuff.
 	 *
 	 * p,*,* -> n,*,*
-	 *
-	 * RELEASE, such that the stores to @node must be complete.
 	 */
 	old = xchg_tail(lock, tail);
 	next = NULL;
@@ -405,14 +412,8 @@ queue:
 	if (old & _Q_TAIL_MASK) {
 		prev = decode_tail(old);
 
-		/*
-		 * We must ensure that the stores to @node are observed before
-		 * the write to prev->next. The address dependency from
-		 * xchg_tail is not sufficient to ensure this because the read
-		 * component of xchg_tail is unordered with respect to the
-		 * initialisation of @node.
-		 */
-		smp_store_release(&prev->next, node);
+		/* Link @node into the waitqueue. */
+		WRITE_ONCE(prev->next, node);
 
 		pv_wait_node(node, prev);
 		arch_mcs_spin_lock_contended(&node->locked);

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [tip:locking/core] locking/qspinlock: Use try_cmpxchg() instead of cmpxchg() when locking
  2018-04-26 10:34 ` [PATCH v3 12/14] locking/qspinlock: Use try_cmpxchg instead of cmpxchg when locking Will Deacon
@ 2018-04-27  9:43   ` tip-bot for Will Deacon
  0 siblings, 0 replies; 41+ messages in thread
From: tip-bot for Will Deacon @ 2018-04-27  9:43 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: torvalds, will.deacon, tglx, mingo, peterz, longman, hpa, linux-kernel

Commit-ID:  ae75d9089ff7095d1d1a12c3cd86b21d3eaf3b15
Gitweb:     https://git.kernel.org/tip/ae75d9089ff7095d1d1a12c3cd86b21d3eaf3b15
Author:     Will Deacon <will.deacon@arm.com>
AuthorDate: Thu, 26 Apr 2018 11:34:26 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Fri, 27 Apr 2018 09:48:52 +0200

locking/qspinlock: Use try_cmpxchg() instead of cmpxchg() when locking

When reaching the head of an uncontended queue on the qspinlock slow-path,
using a try_cmpxchg() instead of a cmpxchg() operation to transition the
lock work to _Q_LOCKED_VAL generates slightly better code for x86 and
pretty much identical code for arm64.

Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Waiman Long <longman@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: boqun.feng@gmail.com
Cc: linux-arm-kernel@lists.infradead.org
Cc: paulmck@linux.vnet.ibm.com
Link: http://lkml.kernel.org/r/1524738868-31318-13-git-send-email-will.deacon@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/locking/qspinlock.c | 19 +++++++++----------
 1 file changed, 9 insertions(+), 10 deletions(-)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 956a12983bd0..46813185957b 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -467,16 +467,15 @@ locked:
 	 * Otherwise, we only need to grab the lock.
 	 */
 
-	/* In the PV case we might already have _Q_LOCKED_VAL set */
-	if ((val & _Q_TAIL_MASK) == tail) {
-		/*
-		 * The atomic_cond_read_acquire() call above has provided the
-		 * necessary acquire semantics required for locking.
-		 */
-		old = atomic_cmpxchg_relaxed(&lock->val, val, _Q_LOCKED_VAL);
-		if (old == val)
-			goto release; /* No contention */
-	}
+	/*
+	 * In the PV case we might already have _Q_LOCKED_VAL set.
+	 *
+	 * The atomic_cond_read_acquire() call above has provided the
+	 * necessary acquire semantics required for locking.
+	 */
+	if (((val & _Q_TAIL_MASK) == tail) &&
+	    atomic_try_cmpxchg_relaxed(&lock->val, &val, _Q_LOCKED_VAL))
+		goto release; /* No contention */
 
 	/* Either somebody is queued behind us or _Q_PENDING_VAL is set */
 	set_locked(lock);

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [tip:locking/core] locking/qspinlock: Add stat tracking for pending vs. slowpath
  2018-04-26 10:34 ` [PATCH v3 13/14] locking/qspinlock: Add stat tracking for pending vs slowpath Will Deacon
@ 2018-04-27  9:43   ` tip-bot for Waiman Long
  0 siblings, 0 replies; 41+ messages in thread
From: tip-bot for Waiman Long @ 2018-04-27  9:43 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, longman, torvalds, peterz, hpa, tglx, mingo

Commit-ID:  81d3dc9a349b1e61d77106bbb05a6e6dd29b9d5e
Gitweb:     https://git.kernel.org/tip/81d3dc9a349b1e61d77106bbb05a6e6dd29b9d5e
Author:     Waiman Long <longman@redhat.com>
AuthorDate: Thu, 26 Apr 2018 11:34:27 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Fri, 27 Apr 2018 09:48:53 +0200

locking/qspinlock: Add stat tracking for pending vs. slowpath

Currently, the qspinlock_stat code tracks only statistical counts in the
PV qspinlock code. However, it may also be useful to track the number
of locking operations done via the pending code vs. the MCS lock queue
slowpath for the non-PV case.

The qspinlock stat code is modified to do that. The stat counter
pv_lock_slowpath is renamed to lock_slowpath so that it can be used by
both the PV and non-PV cases.

Signed-off-by: Waiman Long <longman@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Waiman Long <longman@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: boqun.feng@gmail.com
Cc: linux-arm-kernel@lists.infradead.org
Cc: paulmck@linux.vnet.ibm.com
Cc: will.deacon@arm.com
Link: http://lkml.kernel.org/r/1524738868-31318-14-git-send-email-will.deacon@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/locking/qspinlock.c          | 14 +++++++++++---
 kernel/locking/qspinlock_paravirt.h |  7 +------
 kernel/locking/qspinlock_stat.h     |  9 ++++++---
 3 files changed, 18 insertions(+), 12 deletions(-)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 46813185957b..bfaeb05123ff 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -12,11 +12,11 @@
  * GNU General Public License for more details.
  *
  * (C) Copyright 2013-2015 Hewlett-Packard Development Company, L.P.
- * (C) Copyright 2013-2014 Red Hat, Inc.
+ * (C) Copyright 2013-2014,2018 Red Hat, Inc.
  * (C) Copyright 2015 Intel Corp.
  * (C) Copyright 2015 Hewlett-Packard Enterprise Development LP
  *
- * Authors: Waiman Long <waiman.long@hpe.com>
+ * Authors: Waiman Long <longman@redhat.com>
  *          Peter Zijlstra <peterz@infradead.org>
  */
 
@@ -32,6 +32,11 @@
 #include <asm/byteorder.h>
 #include <asm/qspinlock.h>
 
+/*
+ * Include queued spinlock statistics code
+ */
+#include "qspinlock_stat.h"
+
 /*
  * The basic principle of a queue-based spinlock can best be understood
  * by studying a classic queue-based spinlock implementation called the
@@ -295,7 +300,7 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	BUILD_BUG_ON(CONFIG_NR_CPUS >= (1U << _Q_TAIL_CPU_BITS));
 
 	if (pv_enabled())
-		goto queue;
+		goto pv_queue;
 
 	if (virt_spin_lock(lock))
 		return;
@@ -348,6 +353,7 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 		 * *,1,0 -> *,0,1
 		 */
 		clear_pending_set_locked(lock);
+		qstat_inc(qstat_lock_pending, true);
 		return;
 	}
 
@@ -363,6 +369,8 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	 * queuing.
 	 */
 queue:
+	qstat_inc(qstat_lock_slowpath, true);
+pv_queue:
 	node = this_cpu_ptr(&mcs_nodes[0]);
 	idx = node->count++;
 	tail = encode_tail(smp_processor_id(), idx);
diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
index 2dbad2f25480..25730b2ac022 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -55,11 +55,6 @@ struct pv_node {
 	u8			state;
 };
 
-/*
- * Include queued spinlock statistics code
- */
-#include "qspinlock_stat.h"
-
 /*
  * Hybrid PV queued/unfair lock
  *
@@ -428,7 +423,7 @@ pv_wait_head_or_lock(struct qspinlock *lock, struct mcs_spinlock *node)
 	/*
 	 * Tracking # of slowpath locking operations
 	 */
-	qstat_inc(qstat_pv_lock_slowpath, true);
+	qstat_inc(qstat_lock_slowpath, true);
 
 	for (;; waitcnt++) {
 		/*
diff --git a/kernel/locking/qspinlock_stat.h b/kernel/locking/qspinlock_stat.h
index 4a30ef63c607..6bd78c0740fc 100644
--- a/kernel/locking/qspinlock_stat.h
+++ b/kernel/locking/qspinlock_stat.h
@@ -22,13 +22,14 @@
  *   pv_kick_wake	- # of vCPU kicks used for computing pv_latency_wake
  *   pv_latency_kick	- average latency (ns) of vCPU kick operation
  *   pv_latency_wake	- average latency (ns) from vCPU kick to wakeup
- *   pv_lock_slowpath	- # of locking operations via the slowpath
  *   pv_lock_stealing	- # of lock stealing operations
  *   pv_spurious_wakeup	- # of spurious wakeups in non-head vCPUs
  *   pv_wait_again	- # of wait's after a queue head vCPU kick
  *   pv_wait_early	- # of early vCPU wait's
  *   pv_wait_head	- # of vCPU wait's at the queue head
  *   pv_wait_node	- # of vCPU wait's at a non-head queue node
+ *   lock_pending	- # of locking operations via pending code
+ *   lock_slowpath	- # of locking operations via MCS lock queue
  *
  * Writing to the "reset_counters" file will reset all the above counter
  * values.
@@ -46,13 +47,14 @@ enum qlock_stats {
 	qstat_pv_kick_wake,
 	qstat_pv_latency_kick,
 	qstat_pv_latency_wake,
-	qstat_pv_lock_slowpath,
 	qstat_pv_lock_stealing,
 	qstat_pv_spurious_wakeup,
 	qstat_pv_wait_again,
 	qstat_pv_wait_early,
 	qstat_pv_wait_head,
 	qstat_pv_wait_node,
+	qstat_lock_pending,
+	qstat_lock_slowpath,
 	qstat_num,	/* Total number of statistical counters */
 	qstat_reset_cnts = qstat_num,
 };
@@ -73,12 +75,13 @@ static const char * const qstat_names[qstat_num + 1] = {
 	[qstat_pv_spurious_wakeup] = "pv_spurious_wakeup",
 	[qstat_pv_latency_kick]	   = "pv_latency_kick",
 	[qstat_pv_latency_wake]    = "pv_latency_wake",
-	[qstat_pv_lock_slowpath]   = "pv_lock_slowpath",
 	[qstat_pv_lock_stealing]   = "pv_lock_stealing",
 	[qstat_pv_wait_again]      = "pv_wait_again",
 	[qstat_pv_wait_early]      = "pv_wait_early",
 	[qstat_pv_wait_head]       = "pv_wait_head",
 	[qstat_pv_wait_node]       = "pv_wait_node",
+	[qstat_lock_pending]       = "lock_pending",
+	[qstat_lock_slowpath]      = "lock_slowpath",
 	[qstat_reset_cnts]         = "reset_counters",
 };
 

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [tip:locking/core] MAINTAINERS: Add myself as a co-maintainer for the locking subsystem
  2018-04-26 10:34 ` [PATCH v3 14/14] MAINTAINERS: Add myself as a co-maintainer for LOCKING PRIMITIVES Will Deacon
  2018-04-26 15:55   ` Peter Zijlstra
@ 2018-04-27  9:44   ` tip-bot for Will Deacon
  1 sibling, 0 replies; 41+ messages in thread
From: tip-bot for Will Deacon @ 2018-04-27  9:44 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: will.deacon, longman, hpa, linux-kernel, mingo, torvalds, peterz, tglx

Commit-ID:  baa8c6ddf7be33f2b0ddeb68906d668caf646baa
Gitweb:     https://git.kernel.org/tip/baa8c6ddf7be33f2b0ddeb68906d668caf646baa
Author:     Will Deacon <will.deacon@arm.com>
AuthorDate: Thu, 26 Apr 2018 11:34:28 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Fri, 27 Apr 2018 09:48:54 +0200

MAINTAINERS: Add myself as a co-maintainer for the locking subsystem

I've been heavily involved with concurrency and memory ordering stuff
(see ATOMIC INFRASTRUCTURE and LINUX KERNEL MEMORY CONSISTENCY MODEL)
and with arm64 now using qrwlock with a view to using qspinlock in the
near future, I'm going to continue being involved with the core locking
primitives. Reflect this by adding myself as a co-maintainer alongside
Ingo and Peter.

Signed-off-by: Will Deacon <will.deacon@arm.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Cc: boqun.feng@gmail.com
Cc: linux-arm-kernel@lists.infradead.org
Cc: paulmck@linux.vnet.ibm.com
Link: http://lkml.kernel.org/r/1524738868-31318-15-git-send-email-will.deacon@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 MAINTAINERS | 1 +
 1 file changed, 1 insertion(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index dd66ae9a847e..e4585e33862c 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -8328,6 +8328,7 @@ F:	Documentation/admin-guide/LSM/LoadPin.rst
 LOCKING PRIMITIVES
 M:	Peter Zijlstra <peterz@infradead.org>
 M:	Ingo Molnar <mingo@redhat.com>
+M:	Will Deacon <will.deacon@arm.com>
 L:	linux-kernel@vger.kernel.org
 T:	git git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git locking/core
 S:	Maintained

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 05/14] locking/qspinlock: Remove unbounded cmpxchg loop from locking slowpath
  2018-04-26 20:16   ` Waiman Long
@ 2018-04-27 10:16     ` Will Deacon
  2018-04-27 11:01       ` [tip:locking/core] locking/qspinlock: Remove duplicate clear_pending() function from PV code tip-bot for Will Deacon
  2018-04-27 13:09       ` [PATCH v3 05/14] locking/qspinlock: Remove unbounded cmpxchg loop from locking slowpath Waiman Long
  0 siblings, 2 replies; 41+ messages in thread
From: Will Deacon @ 2018-04-27 10:16 UTC (permalink / raw)
  To: Waiman Long
  Cc: linux-kernel, linux-arm-kernel, peterz, mingo, boqun.feng, paulmck

Hi Waiman,

On Thu, Apr 26, 2018 at 04:16:30PM -0400, Waiman Long wrote:
> On 04/26/2018 06:34 AM, Will Deacon wrote:
> > diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
> > index 2711940429f5..2dbad2f25480 100644
> > --- a/kernel/locking/qspinlock_paravirt.h
> > +++ b/kernel/locking/qspinlock_paravirt.h
> > @@ -118,11 +118,6 @@ static __always_inline void set_pending(struct qspinlock *lock)
> >  	WRITE_ONCE(lock->pending, 1);
> >  }
> >  
> > -static __always_inline void clear_pending(struct qspinlock *lock)
> > -{
> > -	WRITE_ONCE(lock->pending, 0);
> > -}
> > -
> >  /*
> >   * The pending bit check in pv_queued_spin_steal_lock() isn't a memory
> >   * barrier. Therefore, an atomic cmpxchg_acquire() is used to acquire the
> 
> There is another clear_pending() function after the "#else /*
> _Q_PENDING_BITS == 8 */" line that need to be removed as well.

Bugger, sorry I missed that one. Is the >= 16K CPUs case supported elsewhere
in Linux? The x86 Kconfig appears to clamp NR_CPUS to 8192 iiuc.

Anyway, additional patch below. Ingo -- please can you apply this on top?

Thanks,

Will

--->8

>From ef6aa51e47047fe1a57dfdbe2f45caf63fa95be4 Mon Sep 17 00:00:00 2001
From: Will Deacon <will.deacon@arm.com>
Date: Fri, 27 Apr 2018 10:40:13 +0100
Subject: [PATCH] locking/qspinlock: Remove duplicate clear_pending function
 from PV code

The native clear_pending function is identical to the PV version, so the
latter can simply be removed. This fixes the build for systems with >=
16K CPUs using the PV lock implementation.

Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Reported-by: Waiman Long <longman@redhat.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 kernel/locking/qspinlock_paravirt.h | 5 -----
 1 file changed, 5 deletions(-)

diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
index 25730b2ac022..5a0cf5f9008c 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -130,11 +130,6 @@ static __always_inline void set_pending(struct qspinlock *lock)
 	atomic_or(_Q_PENDING_VAL, &lock->val);
 }
 
-static __always_inline void clear_pending(struct qspinlock *lock)
-{
-	atomic_andnot(_Q_PENDING_VAL, &lock->val);
-}
-
 static __always_inline int trylock_clear_pending(struct qspinlock *lock)
 {
 	int val = atomic_read(&lock->val);
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [tip:locking/core] locking/qspinlock: Remove duplicate clear_pending() function from PV code
  2018-04-27 10:16     ` Will Deacon
@ 2018-04-27 11:01       ` tip-bot for Will Deacon
  2018-04-27 13:09       ` [PATCH v3 05/14] locking/qspinlock: Remove unbounded cmpxchg loop from locking slowpath Waiman Long
  1 sibling, 0 replies; 41+ messages in thread
From: tip-bot for Will Deacon @ 2018-04-27 11:01 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: hpa, mingo, linux-kernel, peterz, will.deacon, longman, tglx, torvalds

Commit-ID:  3bea9adc96842b8a7345c7fb202c16ae9c8d5b25
Gitweb:     https://git.kernel.org/tip/3bea9adc96842b8a7345c7fb202c16ae9c8d5b25
Author:     Will Deacon <will.deacon@arm.com>
AuthorDate: Fri, 27 Apr 2018 10:40:13 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Fri, 27 Apr 2018 12:55:22 +0200

locking/qspinlock: Remove duplicate clear_pending() function from PV code

The native clear_pending() function is identical to the PV version, so the
latter can simply be removed.

This fixes the build for systems with >= 16K CPUs using the PV lock implementation.

Reported-by: Waiman Long <longman@redhat.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: boqun.feng@gmail.com
Cc: linux-arm-kernel@lists.infradead.org
Cc: paulmck@linux.vnet.ibm.com
Link: http://lkml.kernel.org/r/20180427101619.GB21705@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/locking/qspinlock_paravirt.h | 5 -----
 1 file changed, 5 deletions(-)

diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
index 25730b2ac022..5a0cf5f9008c 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -130,11 +130,6 @@ static __always_inline void set_pending(struct qspinlock *lock)
 	atomic_or(_Q_PENDING_VAL, &lock->val);
 }
 
-static __always_inline void clear_pending(struct qspinlock *lock)
-{
-	atomic_andnot(_Q_PENDING_VAL, &lock->val);
-}
-
 static __always_inline int trylock_clear_pending(struct qspinlock *lock)
 {
 	int val = atomic_read(&lock->val);

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 05/14] locking/qspinlock: Remove unbounded cmpxchg loop from locking slowpath
  2018-04-27 10:16     ` Will Deacon
  2018-04-27 11:01       ` [tip:locking/core] locking/qspinlock: Remove duplicate clear_pending() function from PV code tip-bot for Will Deacon
@ 2018-04-27 13:09       ` Waiman Long
  1 sibling, 0 replies; 41+ messages in thread
From: Waiman Long @ 2018-04-27 13:09 UTC (permalink / raw)
  To: Will Deacon
  Cc: linux-kernel, linux-arm-kernel, peterz, mingo, boqun.feng, paulmck

On 04/27/2018 06:16 AM, Will Deacon wrote:
> Hi Waiman,
>
> On Thu, Apr 26, 2018 at 04:16:30PM -0400, Waiman Long wrote:
>> On 04/26/2018 06:34 AM, Will Deacon wrote:
>>> diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
>>> index 2711940429f5..2dbad2f25480 100644
>>> --- a/kernel/locking/qspinlock_paravirt.h
>>> +++ b/kernel/locking/qspinlock_paravirt.h
>>> @@ -118,11 +118,6 @@ static __always_inline void set_pending(struct qspinlock *lock)
>>>  	WRITE_ONCE(lock->pending, 1);
>>>  }
>>>  
>>> -static __always_inline void clear_pending(struct qspinlock *lock)
>>> -{
>>> -	WRITE_ONCE(lock->pending, 0);
>>> -}
>>> -
>>>  /*
>>>   * The pending bit check in pv_queued_spin_steal_lock() isn't a memory
>>>   * barrier. Therefore, an atomic cmpxchg_acquire() is used to acquire the
>> There is another clear_pending() function after the "#else /*
>> _Q_PENDING_BITS == 8 */" line that need to be removed as well.
> Bugger, sorry I missed that one. Is the >= 16K CPUs case supported elsewhere
> in Linux? The x86 Kconfig appears to clamp NR_CPUS to 8192 iiuc.
>
> Anyway, additional patch below. Ingo -- please can you apply this on top?
>
I don't think we support >= 16k in any of the distros. However, this
will be a limit that we will reach eventually. That is why I said we can
wait.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 05/14] locking/qspinlock: Remove unbounded cmpxchg loop from locking slowpath
  2018-04-26 16:55     ` Will Deacon
@ 2018-04-28 12:45       ` Peter Zijlstra
  2018-04-30  8:53         ` Will Deacon
  0 siblings, 1 reply; 41+ messages in thread
From: Peter Zijlstra @ 2018-04-28 12:45 UTC (permalink / raw)
  To: Will Deacon
  Cc: linux-kernel, linux-arm-kernel, mingo, boqun.feng, paulmck, longman

On Thu, Apr 26, 2018 at 05:55:19PM +0100, Will Deacon wrote:
> Hi Peter,
> 
> On Thu, Apr 26, 2018 at 05:53:35PM +0200, Peter Zijlstra wrote:
> > On Thu, Apr 26, 2018 at 11:34:19AM +0100, Will Deacon wrote:
> > > @@ -290,58 +312,50 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
> > >  	}
> > >  
> > >  	/*
> > > +	 * If we observe any contention; queue.
> > > +	 */
> > > +	if (val & ~_Q_LOCKED_MASK)
> > > +		goto queue;
> > > +
> > > +	/*
> > >  	 * trylock || pending
> > >  	 *
> > >  	 * 0,0,0 -> 0,0,1 ; trylock
> > >  	 * 0,0,1 -> 0,1,1 ; pending
> > >  	 */
> > > +	val = atomic_fetch_or_acquire(_Q_PENDING_VAL, &lock->val);
> > > +	if (!(val & ~_Q_LOCKED_MASK)) {
> > >  		/*
> > > +		 * we're pending, wait for the owner to go away.
> > > +		 *
> > > +		 * *,1,1 -> *,1,0
> > 
> > Tail must be 0 here, right?
> 
> Not necessarily. If we're concurrently setting pending with another slowpath
> locker, they could queue in the tail behind us, so we can't mess with those
> upper bits.

Could be my brain just entirely stopped working; but I read that as:

	!(val & ~0xFF) := !(val & 0xFFFFFF00)

which then pretty much mandates the top bits are empty, no?

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 05/14] locking/qspinlock: Remove unbounded cmpxchg loop from locking slowpath
  2018-04-28 12:45       ` Peter Zijlstra
@ 2018-04-30  8:53         ` Will Deacon
  0 siblings, 0 replies; 41+ messages in thread
From: Will Deacon @ 2018-04-30  8:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-arm-kernel, mingo, boqun.feng, paulmck, longman

On Sat, Apr 28, 2018 at 02:45:37PM +0200, Peter Zijlstra wrote:
> On Thu, Apr 26, 2018 at 05:55:19PM +0100, Will Deacon wrote:
> > On Thu, Apr 26, 2018 at 05:53:35PM +0200, Peter Zijlstra wrote:
> > > On Thu, Apr 26, 2018 at 11:34:19AM +0100, Will Deacon wrote:
> > > > @@ -290,58 +312,50 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
> > > >  	}
> > > >  
> > > >  	/*
> > > > +	 * If we observe any contention; queue.
> > > > +	 */
> > > > +	if (val & ~_Q_LOCKED_MASK)
> > > > +		goto queue;
> > > > +
> > > > +	/*
> > > >  	 * trylock || pending
> > > >  	 *
> > > >  	 * 0,0,0 -> 0,0,1 ; trylock
> > > >  	 * 0,0,1 -> 0,1,1 ; pending
> > > >  	 */
> > > > +	val = atomic_fetch_or_acquire(_Q_PENDING_VAL, &lock->val);
> > > > +	if (!(val & ~_Q_LOCKED_MASK)) {
> > > >  		/*
> > > > +		 * we're pending, wait for the owner to go away.
> > > > +		 *
> > > > +		 * *,1,1 -> *,1,0
> > > 
> > > Tail must be 0 here, right?
> > 
> > Not necessarily. If we're concurrently setting pending with another slowpath
> > locker, they could queue in the tail behind us, so we can't mess with those
> > upper bits.
> 
> Could be my brain just entirely stopped working; but I read that as:
> 
> 	!(val & ~0xFF) := !(val & 0xFFFFFF00)
> 
> which then pretty much mandates the top bits are empty, no?

Only if there isn't a concurrent locker. For example:


T0:
// fastpath fails to acquire the lock, returns val == _Q_LOCKED_VAL

if (val & ~_Q_LOCKED_MASK)
	goto queue;

// Fallthrough

T1:
// fastpath fails to acquire the lock, returns val == _Q_LOCKED_VAL

if (val & ~_Q_LOCKED_MASK)
	goto queue;

// Fallthrough

T0:
val = atomic_fetch_or_acquire(_Q_PENDING_VAL, &lock->val);
// val == _Q_LOCKED_VAL

T1:
val = atomic_fetch_or_acquire(_Q_PENDING_VAL, &lock->val);
// val == _Q_PENDING_VAL | _Q_LOCKED_VAL
// Queue into tail

T0:
// Spins for _Q_LOCKED_MASK to go to zero, but tail is *non-zero*


So it's really down to whether the state transitions in the comments refer
to the lockword in memory, or the local "val" variable. I think the former
is more instructive, because the whole thing is concurrent.

Will

^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2018-04-30  8:52 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-04-26 10:34 [PATCH v3 00/14] kernel/locking: qspinlock improvements Will Deacon
2018-04-26 10:34 ` [PATCH v3 01/14] barriers: Introduce smp_cond_load_relaxed and atomic_cond_read_relaxed Will Deacon
2018-04-27  9:36   ` [tip:locking/core] locking/barriers: Introduce smp_cond_load_relaxed() and atomic_cond_read_relaxed() tip-bot for Will Deacon
2018-04-26 10:34 ` [PATCH v3 02/14] locking/qspinlock: Merge struct __qspinlock into struct qspinlock Will Deacon
2018-04-27  9:37   ` [tip:locking/core] locking/qspinlock: Merge 'struct __qspinlock' into 'struct qspinlock' tip-bot for Will Deacon
2018-04-26 10:34 ` [PATCH v3 03/14] locking/qspinlock: Bound spinning on pending->locked transition in slowpath Will Deacon
2018-04-27  9:37   ` [tip:locking/core] " tip-bot for Will Deacon
2018-04-26 10:34 ` [PATCH v3 04/14] locking/qspinlock/x86: Increase _Q_PENDING_LOOPS upper bound Will Deacon
2018-04-27  9:38   ` [tip:locking/core] " tip-bot for Will Deacon
2018-04-26 10:34 ` [PATCH v3 05/14] locking/qspinlock: Remove unbounded cmpxchg loop from locking slowpath Will Deacon
2018-04-26 15:53   ` Peter Zijlstra
2018-04-26 16:55     ` Will Deacon
2018-04-28 12:45       ` Peter Zijlstra
2018-04-30  8:53         ` Will Deacon
2018-04-26 20:16   ` Waiman Long
2018-04-27 10:16     ` Will Deacon
2018-04-27 11:01       ` [tip:locking/core] locking/qspinlock: Remove duplicate clear_pending() function from PV code tip-bot for Will Deacon
2018-04-27 13:09       ` [PATCH v3 05/14] locking/qspinlock: Remove unbounded cmpxchg loop from locking slowpath Waiman Long
2018-04-27  9:39   ` [tip:locking/core] locking/qspinlock: Remove unbounded cmpxchg() " tip-bot for Will Deacon
2018-04-26 10:34 ` [PATCH v3 06/14] locking/qspinlock: Kill cmpxchg loop when claiming lock from head of queue Will Deacon
2018-04-27  9:39   ` [tip:locking/core] locking/qspinlock: Kill cmpxchg() " tip-bot for Will Deacon
2018-04-26 10:34 ` [PATCH v3 07/14] locking/qspinlock: Use atomic_cond_read_acquire Will Deacon
2018-04-27  9:40   ` [tip:locking/core] locking/qspinlock: Use atomic_cond_read_acquire() tip-bot for Will Deacon
2018-04-26 10:34 ` [PATCH v3 08/14] locking/mcs: Use smp_cond_load_acquire() in mcs spin loop Will Deacon
2018-04-27  9:40   ` [tip:locking/core] locking/mcs: Use smp_cond_load_acquire() in MCS " tip-bot for Jason Low
2018-04-26 10:34 ` [PATCH v3 09/14] locking/qspinlock: Use smp_cond_load_relaxed to wait for next node Will Deacon
2018-04-27  9:41   ` [tip:locking/core] locking/qspinlock: Use smp_cond_load_relaxed() " tip-bot for Will Deacon
2018-04-26 10:34 ` [PATCH v3 10/14] locking/qspinlock: Make queued_spin_unlock use smp_store_release Will Deacon
2018-04-27  9:42   ` [tip:locking/core] locking/qspinlock: Use smp_store_release() in queued_spin_unlock() tip-bot for Will Deacon
2018-04-26 10:34 ` [PATCH v3 11/14] locking/qspinlock: Elide back-to-back RELEASE operations with smp_wmb() Will Deacon
2018-04-27  9:42   ` [tip:locking/core] " tip-bot for Will Deacon
2018-04-26 10:34 ` [PATCH v3 12/14] locking/qspinlock: Use try_cmpxchg instead of cmpxchg when locking Will Deacon
2018-04-27  9:43   ` [tip:locking/core] locking/qspinlock: Use try_cmpxchg() instead of cmpxchg() " tip-bot for Will Deacon
2018-04-26 10:34 ` [PATCH v3 13/14] locking/qspinlock: Add stat tracking for pending vs slowpath Will Deacon
2018-04-27  9:43   ` [tip:locking/core] locking/qspinlock: Add stat tracking for pending vs. slowpath tip-bot for Waiman Long
2018-04-26 10:34 ` [PATCH v3 14/14] MAINTAINERS: Add myself as a co-maintainer for LOCKING PRIMITIVES Will Deacon
2018-04-26 15:55   ` Peter Zijlstra
2018-04-27  9:44   ` [tip:locking/core] MAINTAINERS: Add myself as a co-maintainer for the locking subsystem tip-bot for Will Deacon
2018-04-26 15:54 ` [PATCH v3 00/14] kernel/locking: qspinlock improvements Peter Zijlstra
2018-04-27  9:33   ` Ingo Molnar
2018-04-26 20:18 ` Waiman Long

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).