[PATCH v2 00/13] kernel/locking: qspinlock improvements

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 00/13] kernel/locking: qspinlock improvements
@ 2018-04-11 18:01 Will Deacon
  2018-04-11 18:01 ` [PATCH v2 01/13] barriers: Introduce smp_cond_load_relaxed and atomic_cond_read_relaxed Will Deacon
                   ` (13 more replies)
  0 siblings, 14 replies; 22+ messages in thread
From: Will Deacon @ 2018-04-11 18:01 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-arm-kernel, peterz, mingo, boqun.feng, paulmck, longman,
	Will Deacon

Hi all,

Here's v2 of the qspinlock patches I posted last week:

  https://lkml.org/lkml/2018/4/5/496

Changes since v1 include:
  * Use WRITE_ONCE to clear the pending bit if we set it erroneously
  * Report pending and slowpath acquisitions via the qspinlock stat
    mechanism [Waiman Long]
  * Spin for a bounded duration while lock is observed in the
    pending->locked transition
  * Use try_cmpxchg to get better codegen on x86
  * Reword comments

All comments welcome,

Will

--->8

Jason Low (1):
  locking/mcs: Use smp_cond_load_acquire() in mcs spin loop

Waiman Long (1):
  locking/qspinlock: Add stat tracking for pending vs slowpath

Will Deacon (11):
  barriers: Introduce smp_cond_load_relaxed and atomic_cond_read_relaxed
  locking/qspinlock: Bound spinning on pending->locked transition in
    slowpath
  locking/qspinlock/x86: Increase _Q_PENDING_LOOPS upper bound
  locking/qspinlock: Remove unbounded cmpxchg loop from locking slowpath
  locking/qspinlock: Kill cmpxchg loop when claiming lock from head of
    queue
  locking/qspinlock: Use atomic_cond_read_acquire
  locking/qspinlock: Use smp_cond_load_relaxed to wait for next node
  locking/qspinlock: Merge struct __qspinlock into struct qspinlock
  locking/qspinlock: Make queued_spin_unlock use smp_store_release
  locking/qspinlock: Elide back-to-back RELEASE operations with
    smp_wmb()
  locking/qspinlock: Use try_cmpxchg instead of cmpxchg when locking

 arch/x86/include/asm/qspinlock.h          |  21 ++-
 arch/x86/include/asm/qspinlock_paravirt.h |   3 +-
 include/asm-generic/atomic-long.h         |   2 +
 include/asm-generic/barrier.h             |  27 +++-
 include/asm-generic/qspinlock.h           |   2 +-
 include/asm-generic/qspinlock_types.h     |  32 +++-
 include/linux/atomic.h                    |   2 +
 kernel/locking/mcs_spinlock.h             |  10 +-
 kernel/locking/qspinlock.c                | 247 ++++++++++++++----------------
 kernel/locking/qspinlock_paravirt.h       |  41 ++---
 kernel/locking/qspinlock_stat.h           |   9 +-
 11 files changed, 209 insertions(+), 187 deletions(-)

-- 
2.1.4

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v2 01/13] barriers: Introduce smp_cond_load_relaxed and atomic_cond_read_relaxed
  2018-04-11 18:01 [PATCH v2 00/13] kernel/locking: qspinlock improvements Will Deacon
@ 2018-04-11 18:01 ` Will Deacon
  2018-04-11 18:01 ` [PATCH v2 02/13] locking/qspinlock: Bound spinning on pending->locked transition in slowpath Will Deacon
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 22+ messages in thread
From: Will Deacon @ 2018-04-11 18:01 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-arm-kernel, peterz, mingo, boqun.feng, paulmck, longman,
	Will Deacon

Whilst we currently provide smp_cond_load_acquire and
atomic_cond_read_acquire, there are cases where the ACQUIRE semantics are
not required because of a subsequent fence or release operation once the
conditional loop has exited.

This patch adds relaxed versions of the conditional spinning primitives
to avoid unnecessary barrier overhead on architectures such as arm64.

Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 include/asm-generic/atomic-long.h |  2 ++
 include/asm-generic/barrier.h     | 27 +++++++++++++++++++++------
 include/linux/atomic.h            |  2 ++
 3 files changed, 25 insertions(+), 6 deletions(-)

diff --git a/include/asm-generic/atomic-long.h b/include/asm-generic/atomic-long.h
index 34a028a7bcc5..5b2b0b5ea06d 100644
--- a/include/asm-generic/atomic-long.h
+++ b/include/asm-generic/atomic-long.h
@@ -244,6 +244,8 @@ static inline long atomic_long_add_unless(atomic_long_t *l, long a, long u)
 #define atomic_long_inc_not_zero(l) \
 	ATOMIC_LONG_PFX(_inc_not_zero)((ATOMIC_LONG_PFX(_t) *)(l))
 
+#define atomic_long_cond_read_relaxed(v, c) \
+	ATOMIC_LONG_PFX(_cond_read_relaxed)((ATOMIC_LONG_PFX(_t) *)(v), (c))
 #define atomic_long_cond_read_acquire(v, c) \
 	ATOMIC_LONG_PFX(_cond_read_acquire)((ATOMIC_LONG_PFX(_t) *)(v), (c))
 
diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
index fe297b599b0a..305e03b19a26 100644
--- a/include/asm-generic/barrier.h
+++ b/include/asm-generic/barrier.h
@@ -221,18 +221,17 @@ do {									\
 #endif
 
 /**
- * smp_cond_load_acquire() - (Spin) wait for cond with ACQUIRE ordering
+ * smp_cond_load_relaxed() - (Spin) wait for cond with no ordering guarantees
  * @ptr: pointer to the variable to wait on
  * @cond: boolean expression to wait for
  *
- * Equivalent to using smp_load_acquire() on the condition variable but employs
- * the control dependency of the wait to reduce the barrier on many platforms.
+ * Equivalent to using READ_ONCE() on the condition variable.
  *
  * Due to C lacking lambda expressions we load the value of *ptr into a
  * pre-named variable @VAL to be used in @cond.
  */
-#ifndef smp_cond_load_acquire
-#define smp_cond_load_acquire(ptr, cond_expr) ({		\
+#ifndef smp_cond_load_relaxed
+#define smp_cond_load_relaxed(ptr, cond_expr) ({		\
 	typeof(ptr) __PTR = (ptr);				\
 	typeof(*ptr) VAL;					\
 	for (;;) {						\
@@ -241,10 +240,26 @@ do {									\
 			break;					\
 		cpu_relax();					\
 	}							\
-	smp_acquire__after_ctrl_dep();				\
 	VAL;							\
 })
 #endif
 
+/**
+ * smp_cond_load_acquire() - (Spin) wait for cond with ACQUIRE ordering
+ * @ptr: pointer to the variable to wait on
+ * @cond: boolean expression to wait for
+ *
+ * Equivalent to using smp_load_acquire() on the condition variable but employs
+ * the control dependency of the wait to reduce the barrier on many platforms.
+ */
+#ifndef smp_cond_load_acquire
+#define smp_cond_load_acquire(ptr, cond_expr) ({		\
+	typeof(*ptr) _val;					\
+	_val = smp_cond_load_relaxed(ptr, cond_expr);		\
+	smp_acquire__after_ctrl_dep();				\
+	_val;							\
+})
+#endif
+
 #endif /* !__ASSEMBLY__ */
 #endif /* __ASM_GENERIC_BARRIER_H */
diff --git a/include/linux/atomic.h b/include/linux/atomic.h
index 8b276fd9a127..01ce3997cb42 100644
--- a/include/linux/atomic.h
+++ b/include/linux/atomic.h
@@ -654,6 +654,7 @@ static inline int atomic_dec_if_positive(atomic_t *v)
 }
 #endif
 
+#define atomic_cond_read_relaxed(v, c)	smp_cond_load_relaxed(&(v)->counter, (c))
 #define atomic_cond_read_acquire(v, c)	smp_cond_load_acquire(&(v)->counter, (c))
 
 #ifdef CONFIG_GENERIC_ATOMIC64
@@ -1075,6 +1076,7 @@ static inline long long atomic64_fetch_andnot_release(long long i, atomic64_t *v
 }
 #endif
 
+#define atomic64_cond_read_relaxed(v, c)	smp_cond_load_relaxed(&(v)->counter, (c))
 #define atomic64_cond_read_acquire(v, c)	smp_cond_load_acquire(&(v)->counter, (c))
 
 #include <asm-generic/atomic-long.h>
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v2 02/13] locking/qspinlock: Bound spinning on pending->locked transition in slowpath
  2018-04-11 18:01 [PATCH v2 00/13] kernel/locking: qspinlock improvements Will Deacon
  2018-04-11 18:01 ` [PATCH v2 01/13] barriers: Introduce smp_cond_load_relaxed and atomic_cond_read_relaxed Will Deacon
@ 2018-04-11 18:01 ` Will Deacon
  2018-04-11 18:01 ` [PATCH v2 03/13] locking/qspinlock/x86: Increase _Q_PENDING_LOOPS upper bound Will Deacon
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 22+ messages in thread
From: Will Deacon @ 2018-04-11 18:01 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-arm-kernel, peterz, mingo, boqun.feng, paulmck, longman,
	Will Deacon

If a locker taking the qspinlock slowpath reads a lock value indicating
that only the pending bit is set, then it will spin whilst the
concurrent pending->locked transition takes effect.

Unfortunately, there is no guarantee that such a transition will ever be
observed since concurrent lockers could continuously set pending and
hand over the lock amongst themselves, leading to starvation. Whilst
this would probably resolve in practice, it means that it is not
possible to prove liveness properties about the lock and means that lock
acquisition time is unbounded.

Rather than removing the pending->locked spinning from the slowpath
altogether (which has been shown to heavily penalise a 2-threaded
locking stress test on x86), this patch replaces the explicit spinning
with a call to atomic_cond_read_relaxed and allows the architecture to
provide a bound on the number of spins. For architectures that can
respond to changes in cacheline state in their smp_cond_load implementation,
it should be sufficient to use the default bound of 1.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Suggested-by: Waiman Long <longman@redhat.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 kernel/locking/qspinlock.c | 20 +++++++++++++++++---
 1 file changed, 17 insertions(+), 3 deletions(-)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index d880296245c5..396701e8c62d 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -77,6 +77,18 @@
 #endif

 /*
+ * The pending bit spinning loop count.
+ * This heuristic is used to limit the number of lockword accesses
+ * made by atomic_cond_read_relaxed when waiting for the lock to
+ * transition out of the "== _Q_PENDING_VAL" state. We don't spin
+ * indefinitely because there's no guarantee that we'll make forward
+ * progress.
+ */
+#ifndef _Q_PENDING_LOOPS
+#define _Q_PENDING_LOOPS	1
+#endif
+
+/*
  * Per-CPU queue node structures; we can never have more than 4 nested
  * contexts: task, softirq, hardirq, nmi.
  *
@@ -306,13 +318,15 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 		return;

 	/*
-	 * wait for in-progress pending->locked hand-overs
+	 * Wait for in-progress pending->locked hand-overs with a bounded
+	 * number of spins so that we guarantee forward progress.
 	 *
 	 * 0,1,0 -> 0,0,1
 	 */
 	if (val == _Q_PENDING_VAL) {
-		while ((val = atomic_read(&lock->val)) == _Q_PENDING_VAL)
-			cpu_relax();
+		int cnt = _Q_PENDING_LOOPS;
+		val = atomic_cond_read_relaxed(&lock->val,
+					       (VAL != _Q_PENDING_VAL) || !cnt--);
 	}

 	/*
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v2 03/13] locking/qspinlock/x86: Increase _Q_PENDING_LOOPS upper bound
  2018-04-11 18:01 [PATCH v2 00/13] kernel/locking: qspinlock improvements Will Deacon
  2018-04-11 18:01 ` [PATCH v2 01/13] barriers: Introduce smp_cond_load_relaxed and atomic_cond_read_relaxed Will Deacon
  2018-04-11 18:01 ` [PATCH v2 02/13] locking/qspinlock: Bound spinning on pending->locked transition in slowpath Will Deacon
@ 2018-04-11 18:01 ` Will Deacon
  2018-04-11 18:01 ` [PATCH v2 04/13] locking/qspinlock: Remove unbounded cmpxchg loop from locking slowpath Will Deacon
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 22+ messages in thread
From: Will Deacon @ 2018-04-11 18:01 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-arm-kernel, peterz, mingo, boqun.feng, paulmck, longman,
	Will Deacon

On x86, atomic_cond_read_relaxed will busy-wait with a cpu_relax() loop,
so it is desirable to increase the number of times we spin on the qspinlock
lockword when it is found to be transitioning from pending to locked.

According to Waiman Long:

| Ideally, the spinning times should be at least a few times the typical
| cacheline load time from memory which I think can be down to 100ns or
| so for each cacheline load with the newest systems or up to several
| hundreds ns for older systems.

which in his benchmarking corresponded to 512 iterations.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Suggested-by: Waiman Long <longman@redhat.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 arch/x86/include/asm/qspinlock.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h
index 5e16b5d40d32..2f09915f4aa4 100644
--- a/arch/x86/include/asm/qspinlock.h
+++ b/arch/x86/include/asm/qspinlock.h
@@ -7,6 +7,8 @@
 #include <asm-generic/qspinlock_types.h>
 #include <asm/paravirt.h>
 
+#define _Q_PENDING_LOOPS	(1 << 9)
+
 #define	queued_spin_unlock queued_spin_unlock
 /**
  * queued_spin_unlock - release a queued spinlock
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v2 04/13] locking/qspinlock: Remove unbounded cmpxchg loop from locking slowpath
  2018-04-11 18:01 [PATCH v2 00/13] kernel/locking: qspinlock improvements Will Deacon
                   ` (2 preceding siblings ...)
  2018-04-11 18:01 ` [PATCH v2 03/13] locking/qspinlock/x86: Increase _Q_PENDING_LOOPS upper bound Will Deacon
@ 2018-04-11 18:01 ` Will Deacon
  2018-04-11 19:34   ` Waiman Long
  2018-04-11 19:53   ` Waiman Long
  2018-04-11 18:01 ` [PATCH v2 05/13] locking/qspinlock: Kill cmpxchg loop when claiming lock from head of queue Will Deacon
                   ` (9 subsequent siblings)
  13 siblings, 2 replies; 22+ messages in thread
From: Will Deacon @ 2018-04-11 18:01 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-arm-kernel, peterz, mingo, boqun.feng, paulmck, longman,
	Will Deacon

The qspinlock locking slowpath utilises a "pending" bit as a simple form
of an embedded test-and-set lock that can avoid the overhead of explicit
queuing in cases where the lock is held but uncontended. This bit is
managed using a cmpxchg loop which tries to transition the uncontended
lock word from (0,0,0) -> (0,0,1) or (0,0,1) -> (0,1,1).

Unfortunately, the cmpxchg loop is unbounded and lockers can be starved
indefinitely if the lock word is seen to oscillate between unlocked
(0,0,0) and locked (0,0,1). This could happen if concurrent lockers are
able to take the lock in the cmpxchg loop without queuing and pass it
around amongst themselves.

This patch fixes the problem by unconditionally setting _Q_PENDING_VAL
using atomic_fetch_or, and then inspecting the old value to see whether
we need to spin on the current lock owner, or whether we now effectively
hold the lock. The tricky scenario is when concurrent lockers end up
queuing on the lock and the lock becomes available, causing us to see
a lockword of (n,0,0). With pending now set, simply queuing could lead
to deadlock as the head of the queue may not have observed the pending
flag being cleared. Conversely, if the head of the queue did observe
pending being cleared, then it could transition the lock from (n,0,0) ->
(0,0,1) meaning that any attempt to "undo" our setting of the pending
bit could race with a concurrent locker trying to set it.

We handle this race by preserving the pending bit when taking the lock
after reaching the head of the queue and leaving the tail entry intact
if we saw pending set, because we know that the tail is going to be
updated shortly.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 kernel/locking/qspinlock.c | 102 ++++++++++++++++++++++++++-------------------
 1 file changed, 58 insertions(+), 44 deletions(-)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 396701e8c62d..a8fc402b3f3a 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -162,6 +162,17 @@ struct __qspinlock {
 
 #if _Q_PENDING_BITS == 8
 /**
+ * clear_pending - clear the pending bit.
+ * @lock: Pointer to queued spinlock structure
+ *
+ * *,1,* -> *,0,*
+ */
+static __always_inline void clear_pending(struct qspinlock *lock)
+{
+	WRITE_ONCE(lock->pending, 0);
+}
+
+/**
  * clear_pending_set_locked - take ownership and clear the pending bit.
  * @lock: Pointer to queued spinlock structure
  *
@@ -201,6 +212,17 @@ static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
 #else /* _Q_PENDING_BITS == 8 */
 
 /**
+ * clear_pending - clear the pending bit.
+ * @lock: Pointer to queued spinlock structure
+ *
+ * *,1,* -> *,0,*
+ */
+static __always_inline void clear_pending(struct qspinlock *lock)
+{
+	atomic_andnot(_Q_PENDING_VAL, &lock->val);
+}
+
+/**
  * clear_pending_set_locked - take ownership and clear the pending bit.
  * @lock: Pointer to queued spinlock structure
  *
@@ -306,7 +328,7 @@ static __always_inline u32  __pv_wait_head_or_lock(struct qspinlock *lock,
 void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 {
 	struct mcs_spinlock *prev, *next, *node;
-	u32 new, old, tail;
+	u32 old, tail;
 	int idx;
 
 	BUILD_BUG_ON(CONFIG_NR_CPUS >= (1U << _Q_TAIL_CPU_BITS));
@@ -330,58 +352,50 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	}
 
 	/*
+	 * If we observe any contention; queue.
+	 */
+	if (val & ~_Q_LOCKED_MASK)
+		goto queue;
+
+	/*
 	 * trylock || pending
 	 *
 	 * 0,0,0 -> 0,0,1 ; trylock
 	 * 0,0,1 -> 0,1,1 ; pending
 	 */
-	for (;;) {
+	val = atomic_fetch_or_acquire(_Q_PENDING_VAL, &lock->val);
+	if (!(val & ~_Q_LOCKED_MASK)) {
 		/*
-		 * If we observe any contention; queue.
+		 * we're pending, wait for the owner to go away.
+		 *
+		 * *,1,1 -> *,1,0
+		 *
+		 * this wait loop must be a load-acquire such that we match the
+		 * store-release that clears the locked bit and create lock
+		 * sequentiality; this is because not all
+		 * clear_pending_set_locked() implementations imply full
+		 * barriers.
 		 */
-		if (val & ~_Q_LOCKED_MASK)
-			goto queue;
-
-		new = _Q_LOCKED_VAL;
-		if (val == new)
-			new |= _Q_PENDING_VAL;
+		if (val & _Q_LOCKED_MASK) {
+			smp_cond_load_acquire(&lock->val.counter,
+					      !(VAL & _Q_LOCKED_MASK));
+		}
 
 		/*
-		 * Acquire semantic is required here as the function may
-		 * return immediately if the lock was free.
+		 * take ownership and clear the pending bit.
+		 *
+		 * *,1,0 -> *,0,1
 		 */
-		old = atomic_cmpxchg_acquire(&lock->val, val, new);
-		if (old == val)
-			break;
-
-		val = old;
-	}
-
-	/*
-	 * we won the trylock
-	 */
-	if (new == _Q_LOCKED_VAL)
+		clear_pending_set_locked(lock);
 		return;
+	}
 
 	/*
-	 * we're pending, wait for the owner to go away.
-	 *
-	 * *,1,1 -> *,1,0
-	 *
-	 * this wait loop must be a load-acquire such that we match the
-	 * store-release that clears the locked bit and create lock
-	 * sequentiality; this is because not all clear_pending_set_locked()
-	 * implementations imply full barriers.
-	 */
-	smp_cond_load_acquire(&lock->val.counter, !(VAL & _Q_LOCKED_MASK));
-
-	/*
-	 * take ownership and clear the pending bit.
-	 *
-	 * *,1,0 -> *,0,1
+	 * If pending was clear but there are waiters in the queue, then
+	 * we need to undo our setting of pending before we queue ourselves.
 	 */
-	clear_pending_set_locked(lock);
-	return;
+	if (!(val & _Q_PENDING_MASK))
+		clear_pending(lock);
 
 	/*
 	 * End of pending bit optimistic spinning and beginning of MCS
@@ -485,15 +499,15 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	 * claim the lock:
 	 *
 	 * n,0,0 -> 0,0,1 : lock, uncontended
-	 * *,0,0 -> *,0,1 : lock, contended
+	 * *,*,0 -> *,*,1 : lock, contended
 	 *
-	 * If the queue head is the only one in the queue (lock value == tail),
-	 * clear the tail code and grab the lock. Otherwise, we only need
-	 * to grab the lock.
+	 * If the queue head is the only one in the queue (lock value == tail)
+	 * and nobody is pending, clear the tail code and grab the lock.
+	 * Otherwise, we only need to grab the lock.
 	 */
 	for (;;) {
 		/* In the PV case we might already have _Q_LOCKED_VAL set */
-		if ((val & _Q_TAIL_MASK) != tail) {
+		if ((val & _Q_TAIL_MASK) != tail || (val & _Q_PENDING_MASK)) {
 			set_locked(lock);
 			break;
 		}
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v2 05/13] locking/qspinlock: Kill cmpxchg loop when claiming lock from head of queue
  2018-04-11 18:01 [PATCH v2 00/13] kernel/locking: qspinlock improvements Will Deacon
                   ` (3 preceding siblings ...)
  2018-04-11 18:01 ` [PATCH v2 04/13] locking/qspinlock: Remove unbounded cmpxchg loop from locking slowpath Will Deacon
@ 2018-04-11 18:01 ` Will Deacon
  2018-04-11 18:01 ` [PATCH v2 06/13] locking/qspinlock: Use atomic_cond_read_acquire Will Deacon
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 22+ messages in thread
From: Will Deacon @ 2018-04-11 18:01 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-arm-kernel, peterz, mingo, boqun.feng, paulmck, longman,
	Will Deacon

When a queued locker reaches the head of the queue, it claims the lock
by setting _Q_LOCKED_VAL in the lockword. If there isn't contention, it
must also clear the tail as part of this operation so that subsequent
lockers can avoid taking the slowpath altogether.

Currently this is expressed as a cmpxchg loop that practically only
runs up to two iterations. This is confusing to the reader and unhelpful
to the compiler. Rewrite the cmpxchg loop without the loop, so that a
failed cmpxchg implies that there is contention and we just need to
write to _Q_LOCKED_VAL without considering the rest of the lockword.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 kernel/locking/qspinlock.c | 19 ++++++++-----------
 1 file changed, 8 insertions(+), 11 deletions(-)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index a8fc402b3f3a..01b660442d87 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -505,24 +505,21 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	 * and nobody is pending, clear the tail code and grab the lock.
 	 * Otherwise, we only need to grab the lock.
 	 */
-	for (;;) {
-		/* In the PV case we might already have _Q_LOCKED_VAL set */
-		if ((val & _Q_TAIL_MASK) != tail || (val & _Q_PENDING_MASK)) {
-			set_locked(lock);
-			break;
-		}
+
+	/* In the PV case we might already have _Q_LOCKED_VAL set */
+	if ((val & _Q_TAIL_MASK) == tail) {
 		/*
 		 * The smp_cond_load_acquire() call above has provided the
-		 * necessary acquire semantics required for locking. At most
-		 * two iterations of this loop may be ran.
+		 * necessary acquire semantics required for locking.
 		 */
 		old = atomic_cmpxchg_relaxed(&lock->val, val, _Q_LOCKED_VAL);
 		if (old == val)
-			goto release;	/* No contention */
-
-		val = old;
+			goto release; /* No contention */
 	}
 
+	/* Either somebody is queued behind us or _Q_PENDING_VAL is set */
+	set_locked(lock);
+
 	/*
 	 * contended path; wait for next if not observed yet, release.
 	 */
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v2 06/13] locking/qspinlock: Use atomic_cond_read_acquire
  2018-04-11 18:01 [PATCH v2 00/13] kernel/locking: qspinlock improvements Will Deacon
                   ` (4 preceding siblings ...)
  2018-04-11 18:01 ` [PATCH v2 05/13] locking/qspinlock: Kill cmpxchg loop when claiming lock from head of queue Will Deacon
@ 2018-04-11 18:01 ` Will Deacon
  2018-04-11 18:01 ` [PATCH v2 07/13] locking/mcs: Use smp_cond_load_acquire() in mcs spin loop Will Deacon
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 22+ messages in thread
From: Will Deacon @ 2018-04-11 18:01 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-arm-kernel, peterz, mingo, boqun.feng, paulmck, longman,
	Will Deacon

Rather than dig into the counter field of the atomic_t inside the
qspinlock structure so that we can call smp_cond_load_acquire, use
atomic_cond_read_acquire instead, which operates on the atomic_t
directly.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 kernel/locking/qspinlock.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 01b660442d87..648a16a2cd23 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -377,8 +377,8 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 		 * barriers.
 		 */
 		if (val & _Q_LOCKED_MASK) {
-			smp_cond_load_acquire(&lock->val.counter,
-					      !(VAL & _Q_LOCKED_MASK));
+			atomic_cond_read_acquire(&lock->val,
+						 !(VAL & _Q_LOCKED_MASK));
 		}
 
 		/*
@@ -481,8 +481,8 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	 *
 	 * The PV pv_wait_head_or_lock function, if active, will acquire
 	 * the lock and return a non-zero value. So we have to skip the
-	 * smp_cond_load_acquire() call. As the next PV queue head hasn't been
-	 * designated yet, there is no way for the locked value to become
+	 * atomic_cond_read_acquire() call. As the next PV queue head hasn't
+	 * been designated yet, there is no way for the locked value to become
 	 * _Q_SLOW_VAL. So both the set_locked() and the
 	 * atomic_cmpxchg_relaxed() calls will be safe.
 	 *
@@ -492,7 +492,7 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	if ((val = pv_wait_head_or_lock(lock, node)))
 		goto locked;
 
-	val = smp_cond_load_acquire(&lock->val.counter, !(VAL & _Q_LOCKED_PENDING_MASK));
+	val = atomic_cond_read_acquire(&lock->val, !(VAL & _Q_LOCKED_PENDING_MASK));
 
 locked:
 	/*
@@ -509,7 +509,7 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	/* In the PV case we might already have _Q_LOCKED_VAL set */
 	if ((val & _Q_TAIL_MASK) == tail) {
 		/*
-		 * The smp_cond_load_acquire() call above has provided the
+		 * The atomic_cond_read_acquire() call above has provided the
 		 * necessary acquire semantics required for locking.
 		 */
 		old = atomic_cmpxchg_relaxed(&lock->val, val, _Q_LOCKED_VAL);
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v2 07/13] locking/mcs: Use smp_cond_load_acquire() in mcs spin loop
  2018-04-11 18:01 [PATCH v2 00/13] kernel/locking: qspinlock improvements Will Deacon
                   ` (5 preceding siblings ...)
  2018-04-11 18:01 ` [PATCH v2 06/13] locking/qspinlock: Use atomic_cond_read_acquire Will Deacon
@ 2018-04-11 18:01 ` Will Deacon
  2018-04-11 18:01 ` [PATCH v2 08/13] locking/qspinlock: Use smp_cond_load_relaxed to wait for next node Will Deacon
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 22+ messages in thread
From: Will Deacon @ 2018-04-11 18:01 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-arm-kernel, peterz, mingo, boqun.feng, paulmck, longman,
	Jason Low, Will Deacon

From: Jason Low <jason.low2@hp.com>

For qspinlocks on ARM64, we would like to use WFE instead
of purely spinning. Qspinlocks internally have lock
contenders spin on an MCS lock.

Update arch_mcs_spin_lock_contended() such that it uses
the new smp_cond_load_acquire() so that ARM64 can also
override this spin loop with its own implementation using WFE.

On x86, this can also be cheaper than spinning on
smp_load_acquire().

Signed-off-by: Jason Low <jason.low2@hp.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 kernel/locking/mcs_spinlock.h | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/kernel/locking/mcs_spinlock.h b/kernel/locking/mcs_spinlock.h
index f046b7ce9dd6..5e10153b4d3c 100644
--- a/kernel/locking/mcs_spinlock.h
+++ b/kernel/locking/mcs_spinlock.h
@@ -23,13 +23,15 @@ struct mcs_spinlock {
 
 #ifndef arch_mcs_spin_lock_contended
 /*
- * Using smp_load_acquire() provides a memory barrier that ensures
- * subsequent operations happen after the lock is acquired.
+ * Using smp_cond_load_acquire() provides the acquire semantics
+ * required so that subsequent operations happen after the
+ * lock is acquired. Additionally, some architectures such as
+ * ARM64 would like to do spin-waiting instead of purely
+ * spinning, and smp_cond_load_acquire() provides that behavior.
  */
 #define arch_mcs_spin_lock_contended(l)					\
 do {									\
-	while (!(smp_load_acquire(l)))					\
-		cpu_relax();						\
+	smp_cond_load_acquire(l, VAL);					\
 } while (0)
 #endif
 
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v2 08/13] locking/qspinlock: Use smp_cond_load_relaxed to wait for next node
  2018-04-11 18:01 [PATCH v2 00/13] kernel/locking: qspinlock improvements Will Deacon
                   ` (6 preceding siblings ...)
  2018-04-11 18:01 ` [PATCH v2 07/13] locking/mcs: Use smp_cond_load_acquire() in mcs spin loop Will Deacon
@ 2018-04-11 18:01 ` Will Deacon
  2018-04-11 18:01 ` [PATCH v2 09/13] locking/qspinlock: Merge struct __qspinlock into struct qspinlock Will Deacon
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 22+ messages in thread
From: Will Deacon @ 2018-04-11 18:01 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-arm-kernel, peterz, mingo, boqun.feng, paulmck, longman,
	Will Deacon

When a locker reaches the head of the queue and takes the lock, a
concurrent locker may enqueue and force the lock holder to spin
whilst its node->next field is initialised. Rather than open-code
a READ_ONCE/cpu_relax() loop, this can be implemented using
smp_cond_load_relaxed instead.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 kernel/locking/qspinlock.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 648a16a2cd23..c781ddbe59a6 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -523,10 +523,8 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	/*
 	 * contended path; wait for next if not observed yet, release.
 	 */
-	if (!next) {
-		while (!(next = READ_ONCE(node->next)))
-			cpu_relax();
-	}
+	if (!next)
+		next = smp_cond_load_relaxed(&node->next, (VAL));
 
 	arch_mcs_spin_unlock_contended(&next->locked);
 	pv_kick_node(lock, next);
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v2 09/13] locking/qspinlock: Merge struct __qspinlock into struct qspinlock
  2018-04-11 18:01 [PATCH v2 00/13] kernel/locking: qspinlock improvements Will Deacon
                   ` (7 preceding siblings ...)
  2018-04-11 18:01 ` [PATCH v2 08/13] locking/qspinlock: Use smp_cond_load_relaxed to wait for next node Will Deacon
@ 2018-04-11 18:01 ` Will Deacon
  2018-04-11 19:13   ` Waiman Long
  2018-04-11 18:01 ` [PATCH v2 10/13] locking/qspinlock: Make queued_spin_unlock use smp_store_release Will Deacon
                   ` (4 subsequent siblings)
  13 siblings, 1 reply; 22+ messages in thread
From: Will Deacon @ 2018-04-11 18:01 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-arm-kernel, peterz, mingo, boqun.feng, paulmck, longman,
	Will Deacon

struct __qspinlock provides a handy union of fields so that
subcomponents of the lockword can be accessed by name, without having to
manage shifts and masks explicitly and take endianness into account.

This is useful in qspinlock.h and also potentially in arch headers, so
move the struct __qspinlock into struct qspinlock and kill the extra
definition.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Acked-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 arch/x86/include/asm/qspinlock.h          |  2 +-
 arch/x86/include/asm/qspinlock_paravirt.h |  3 +-
 include/asm-generic/qspinlock_types.h     | 32 +++++++++++++++++++--
 kernel/locking/qspinlock.c                | 46 ++-----------------------------
 kernel/locking/qspinlock_paravirt.h       | 34 ++++++++---------------
 5 files changed, 46 insertions(+), 71 deletions(-)

diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h
index 2f09915f4aa4..da1370ad206d 100644
--- a/arch/x86/include/asm/qspinlock.h
+++ b/arch/x86/include/asm/qspinlock.h
@@ -18,7 +18,7 @@
  */
 static inline void native_queued_spin_unlock(struct qspinlock *lock)
 {
-	smp_store_release((u8 *)lock, 0);
+	smp_store_release(&lock->locked, 0);
 }
 
 #ifdef CONFIG_PARAVIRT_SPINLOCKS
diff --git a/arch/x86/include/asm/qspinlock_paravirt.h b/arch/x86/include/asm/qspinlock_paravirt.h
index 923307ea11c7..9ef5ee03d2d7 100644
--- a/arch/x86/include/asm/qspinlock_paravirt.h
+++ b/arch/x86/include/asm/qspinlock_paravirt.h
@@ -22,8 +22,7 @@ PV_CALLEE_SAVE_REGS_THUNK(__pv_queued_spin_unlock_slowpath);
  *
  * void __pv_queued_spin_unlock(struct qspinlock *lock)
  * {
- *	struct __qspinlock *l = (void *)lock;
- *	u8 lockval = cmpxchg(&l->locked, _Q_LOCKED_VAL, 0);
+ *	u8 lockval = cmpxchg(&lock->locked, _Q_LOCKED_VAL, 0);
  *
  *	if (likely(lockval == _Q_LOCKED_VAL))
  *		return;
diff --git a/include/asm-generic/qspinlock_types.h b/include/asm-generic/qspinlock_types.h
index 034acd0c4956..0763f065b975 100644
--- a/include/asm-generic/qspinlock_types.h
+++ b/include/asm-generic/qspinlock_types.h
@@ -29,13 +29,41 @@
 #endif
 
 typedef struct qspinlock {
-	atomic_t	val;
+	union {
+		atomic_t val;
+
+		/*
+		 * By using the whole 2nd least significant byte for the
+		 * pending bit, we can allow better optimization of the lock
+		 * acquisition for the pending bit holder.
+		 */
+#ifdef __LITTLE_ENDIAN
+		struct {
+			u8	locked;
+			u8	pending;
+		};
+		struct {
+			u16	locked_pending;
+			u16	tail;
+		};
+#else
+		struct {
+			u16	tail;
+			u16	locked_pending;
+		};
+		struct {
+			u8	reserved[2];
+			u8	pending;
+			u8	locked;
+		};
+#endif
+	};
 } arch_spinlock_t;
 
 /*
  * Initializier
  */
-#define	__ARCH_SPIN_LOCK_UNLOCKED	{ ATOMIC_INIT(0) }
+#define	__ARCH_SPIN_LOCK_UNLOCKED	{ .val = ATOMIC_INIT(0) }
 
 /*
  * Bitfields in the atomic value:
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index c781ddbe59a6..7b8c81ebb15e 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -126,40 +126,6 @@ static inline __pure struct mcs_spinlock *decode_tail(u32 tail)
 
 #define _Q_LOCKED_PENDING_MASK (_Q_LOCKED_MASK | _Q_PENDING_MASK)
 
-/*
- * By using the whole 2nd least significant byte for the pending bit, we
- * can allow better optimization of the lock acquisition for the pending
- * bit holder.
- *
- * This internal structure is also used by the set_locked function which
- * is not restricted to _Q_PENDING_BITS == 8.
- */
-struct __qspinlock {
-	union {
-		atomic_t val;
-#ifdef __LITTLE_ENDIAN
-		struct {
-			u8	locked;
-			u8	pending;
-		};
-		struct {
-			u16	locked_pending;
-			u16	tail;
-		};
-#else
-		struct {
-			u16	tail;
-			u16	locked_pending;
-		};
-		struct {
-			u8	reserved[2];
-			u8	pending;
-			u8	locked;
-		};
-#endif
-	};
-};
-
 #if _Q_PENDING_BITS == 8
 /**
  * clear_pending - clear the pending bit.
@@ -182,9 +148,7 @@ static __always_inline void clear_pending(struct qspinlock *lock)
  */
 static __always_inline void clear_pending_set_locked(struct qspinlock *lock)
 {
-	struct __qspinlock *l = (void *)lock;
-
-	WRITE_ONCE(l->locked_pending, _Q_LOCKED_VAL);
+	WRITE_ONCE(lock->locked_pending, _Q_LOCKED_VAL);
 }
 
 /*
@@ -199,13 +163,11 @@ static __always_inline void clear_pending_set_locked(struct qspinlock *lock)
  */
 static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
 {
-	struct __qspinlock *l = (void *)lock;
-
 	/*
 	 * Use release semantics to make sure that the MCS node is properly
 	 * initialized before changing the tail code.
 	 */
-	return (u32)xchg_release(&l->tail,
+	return (u32)xchg_release(&lock->tail,
 				 tail >> _Q_TAIL_OFFSET) << _Q_TAIL_OFFSET;
 }
 
@@ -271,9 +233,7 @@ static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
  */
 static __always_inline void set_locked(struct qspinlock *lock)
 {
-	struct __qspinlock *l = (void *)lock;
-
-	WRITE_ONCE(l->locked, _Q_LOCKED_VAL);
+	WRITE_ONCE(lock->locked, _Q_LOCKED_VAL);
 }
 
 
diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
index 6ee477765e6c..2711940429f5 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -87,8 +87,6 @@ struct pv_node {
 #define queued_spin_trylock(l)	pv_hybrid_queued_unfair_trylock(l)
 static inline bool pv_hybrid_queued_unfair_trylock(struct qspinlock *lock)
 {
-	struct __qspinlock *l = (void *)lock;
-
 	/*
 	 * Stay in unfair lock mode as long as queued mode waiters are
 	 * present in the MCS wait queue but the pending bit isn't set.
@@ -97,7 +95,7 @@ static inline bool pv_hybrid_queued_unfair_trylock(struct qspinlock *lock)
 		int val = atomic_read(&lock->val);
 
 		if (!(val & _Q_LOCKED_PENDING_MASK) &&
-		   (cmpxchg_acquire(&l->locked, 0, _Q_LOCKED_VAL) == 0)) {
+		   (cmpxchg_acquire(&lock->locked, 0, _Q_LOCKED_VAL) == 0)) {
 			qstat_inc(qstat_pv_lock_stealing, true);
 			return true;
 		}
@@ -117,16 +115,12 @@ static inline bool pv_hybrid_queued_unfair_trylock(struct qspinlock *lock)
 #if _Q_PENDING_BITS == 8
 static __always_inline void set_pending(struct qspinlock *lock)
 {
-	struct __qspinlock *l = (void *)lock;
-
-	WRITE_ONCE(l->pending, 1);
+	WRITE_ONCE(lock->pending, 1);
 }
 
 static __always_inline void clear_pending(struct qspinlock *lock)
 {
-	struct __qspinlock *l = (void *)lock;
-
-	WRITE_ONCE(l->pending, 0);
+	WRITE_ONCE(lock->pending, 0);
 }
 
 /*
@@ -136,10 +130,8 @@ static __always_inline void clear_pending(struct qspinlock *lock)
  */
 static __always_inline int trylock_clear_pending(struct qspinlock *lock)
 {
-	struct __qspinlock *l = (void *)lock;
-
-	return !READ_ONCE(l->locked) &&
-	       (cmpxchg_acquire(&l->locked_pending, _Q_PENDING_VAL,
+	return !READ_ONCE(lock->locked) &&
+	       (cmpxchg_acquire(&lock->locked_pending, _Q_PENDING_VAL,
 				_Q_LOCKED_VAL) == _Q_PENDING_VAL);
 }
 #else /* _Q_PENDING_BITS == 8 */
@@ -384,7 +376,6 @@ static void pv_wait_node(struct mcs_spinlock *node, struct mcs_spinlock *prev)
 static void pv_kick_node(struct qspinlock *lock, struct mcs_spinlock *node)
 {
 	struct pv_node *pn = (struct pv_node *)node;
-	struct __qspinlock *l = (void *)lock;
 
 	/*
 	 * If the vCPU is indeed halted, advance its state to match that of
@@ -413,7 +404,7 @@ static void pv_kick_node(struct qspinlock *lock, struct mcs_spinlock *node)
 	 * the hash table later on at unlock time, no atomic instruction is
 	 * needed.
 	 */
-	WRITE_ONCE(l->locked, _Q_SLOW_VAL);
+	WRITE_ONCE(lock->locked, _Q_SLOW_VAL);
 	(void)pv_hash(lock, pn);
 }
 
@@ -428,7 +419,6 @@ static u32
 pv_wait_head_or_lock(struct qspinlock *lock, struct mcs_spinlock *node)
 {
 	struct pv_node *pn = (struct pv_node *)node;
-	struct __qspinlock *l = (void *)lock;
 	struct qspinlock **lp = NULL;
 	int waitcnt = 0;
 	int loop;
@@ -479,13 +469,13 @@ pv_wait_head_or_lock(struct qspinlock *lock, struct mcs_spinlock *node)
 			 *
 			 * Matches the smp_rmb() in __pv_queued_spin_unlock().
 			 */
-			if (xchg(&l->locked, _Q_SLOW_VAL) == 0) {
+			if (xchg(&lock->locked, _Q_SLOW_VAL) == 0) {
 				/*
 				 * The lock was free and now we own the lock.
 				 * Change the lock value back to _Q_LOCKED_VAL
 				 * and unhash the table.
 				 */
-				WRITE_ONCE(l->locked, _Q_LOCKED_VAL);
+				WRITE_ONCE(lock->locked, _Q_LOCKED_VAL);
 				WRITE_ONCE(*lp, NULL);
 				goto gotlock;
 			}
@@ -493,7 +483,7 @@ pv_wait_head_or_lock(struct qspinlock *lock, struct mcs_spinlock *node)
 		WRITE_ONCE(pn->state, vcpu_hashed);
 		qstat_inc(qstat_pv_wait_head, true);
 		qstat_inc(qstat_pv_wait_again, waitcnt);
-		pv_wait(&l->locked, _Q_SLOW_VAL);
+		pv_wait(&lock->locked, _Q_SLOW_VAL);
 
 		/*
 		 * Because of lock stealing, the queue head vCPU may not be
@@ -518,7 +508,6 @@ pv_wait_head_or_lock(struct qspinlock *lock, struct mcs_spinlock *node)
 __visible void
 __pv_queued_spin_unlock_slowpath(struct qspinlock *lock, u8 locked)
 {
-	struct __qspinlock *l = (void *)lock;
 	struct pv_node *node;
 
 	if (unlikely(locked != _Q_SLOW_VAL)) {
@@ -547,7 +536,7 @@ __pv_queued_spin_unlock_slowpath(struct qspinlock *lock, u8 locked)
 	 * Now that we have a reference to the (likely) blocked pv_node,
 	 * release the lock.
 	 */
-	smp_store_release(&l->locked, 0);
+	smp_store_release(&lock->locked, 0);
 
 	/*
 	 * At this point the memory pointed at by lock can be freed/reused,
@@ -573,7 +562,6 @@ __pv_queued_spin_unlock_slowpath(struct qspinlock *lock, u8 locked)
 #ifndef __pv_queued_spin_unlock
 __visible void __pv_queued_spin_unlock(struct qspinlock *lock)
 {
-	struct __qspinlock *l = (void *)lock;
 	u8 locked;
 
 	/*
@@ -581,7 +569,7 @@ __visible void __pv_queued_spin_unlock(struct qspinlock *lock)
 	 * unhash. Otherwise it would be possible to have multiple @lock
 	 * entries, which would be BAD.
 	 */
-	locked = cmpxchg_release(&l->locked, _Q_LOCKED_VAL, 0);
+	locked = cmpxchg_release(&lock->locked, _Q_LOCKED_VAL, 0);
 	if (likely(locked == _Q_LOCKED_VAL))
 		return;
 
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v2 10/13] locking/qspinlock: Make queued_spin_unlock use smp_store_release
  2018-04-11 18:01 [PATCH v2 00/13] kernel/locking: qspinlock improvements Will Deacon
                   ` (8 preceding siblings ...)
  2018-04-11 18:01 ` [PATCH v2 09/13] locking/qspinlock: Merge struct __qspinlock into struct qspinlock Will Deacon
@ 2018-04-11 18:01 ` Will Deacon
  2018-04-11 18:01 ` [PATCH v2 11/13] locking/qspinlock: Elide back-to-back RELEASE operations with smp_wmb() Will Deacon
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 22+ messages in thread
From: Will Deacon @ 2018-04-11 18:01 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-arm-kernel, peterz, mingo, boqun.feng, paulmck, longman,
	Will Deacon

A qspinlock can be unlocked simply by writing zero to the locked byte.
This can be implemented in the generic code, so do that and remove the
arch-specific override for x86 in the !PV case.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 arch/x86/include/asm/qspinlock.h | 17 ++++++-----------
 include/asm-generic/qspinlock.h  |  2 +-
 2 files changed, 7 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h
index da1370ad206d..3e70bed8a978 100644
--- a/arch/x86/include/asm/qspinlock.h
+++ b/arch/x86/include/asm/qspinlock.h
@@ -9,6 +9,12 @@
 
 #define _Q_PENDING_LOOPS	(1 << 9)
 
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+extern void native_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val);
+extern void __pv_init_lock_hash(void);
+extern void __pv_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val);
+extern void __raw_callee_save___pv_queued_spin_unlock(struct qspinlock *lock);
+
 #define	queued_spin_unlock queued_spin_unlock
 /**
  * queued_spin_unlock - release a queued spinlock
@@ -21,12 +27,6 @@ static inline void native_queued_spin_unlock(struct qspinlock *lock)
 	smp_store_release(&lock->locked, 0);
 }
 
-#ifdef CONFIG_PARAVIRT_SPINLOCKS
-extern void native_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val);
-extern void __pv_init_lock_hash(void);
-extern void __pv_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val);
-extern void __raw_callee_save___pv_queued_spin_unlock(struct qspinlock *lock);
-
 static inline void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 {
 	pv_queued_spin_lock_slowpath(lock, val);
@@ -42,11 +42,6 @@ static inline bool vcpu_is_preempted(long cpu)
 {
 	return pv_vcpu_is_preempted(cpu);
 }
-#else
-static inline void queued_spin_unlock(struct qspinlock *lock)
-{
-	native_queued_spin_unlock(lock);
-}
 #endif
 
 #ifdef CONFIG_PARAVIRT
diff --git a/include/asm-generic/qspinlock.h b/include/asm-generic/qspinlock.h
index b37b4ad7eb94..a8ed0a352d75 100644
--- a/include/asm-generic/qspinlock.h
+++ b/include/asm-generic/qspinlock.h
@@ -100,7 +100,7 @@ static __always_inline void queued_spin_unlock(struct qspinlock *lock)
 	/*
 	 * unlock() needs release semantics:
 	 */
-	(void)atomic_sub_return_release(_Q_LOCKED_VAL, &lock->val);
+	smp_store_release(&lock->locked, 0);
 }
 #endif
 
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v2 11/13] locking/qspinlock: Elide back-to-back RELEASE operations with smp_wmb()
  2018-04-11 18:01 [PATCH v2 00/13] kernel/locking: qspinlock improvements Will Deacon
                   ` (9 preceding siblings ...)
  2018-04-11 18:01 ` [PATCH v2 10/13] locking/qspinlock: Make queued_spin_unlock use smp_store_release Will Deacon
@ 2018-04-11 18:01 ` Will Deacon
  2018-04-11 18:01 ` [PATCH v2 12/13] locking/qspinlock: Use try_cmpxchg instead of cmpxchg when locking Will Deacon
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 22+ messages in thread
From: Will Deacon @ 2018-04-11 18:01 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-arm-kernel, peterz, mingo, boqun.feng, paulmck, longman,
	Will Deacon

The qspinlock slowpath must ensure that the MCS node is fully initialised
before it can be reached by another other CPU. This is currently enforced
by using a RELEASE operation when updating the tail and also when linking
the node into the waitqueue (since the control dependency off xchg_tail
is insufficient to enforce sufficient ordering -- see 95bcade33a8a
("locking/qspinlock: Ensure node is initialised before updating prev->next")).

Back-to-back RELEASE operations may be expensive on some architectures,
particularly those that implement them using fences under the hood. We
can replace the two RELEASE operations with a single smp_wmb() fence and
use RELAXED operations for the subsequent publishing of the node.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 kernel/locking/qspinlock.c | 33 +++++++++++++++++----------------
 1 file changed, 17 insertions(+), 16 deletions(-)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 7b8c81ebb15e..fa5d2ab369f9 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -164,10 +164,10 @@ static __always_inline void clear_pending_set_locked(struct qspinlock *lock)
 static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
 {
 	/*
-	 * Use release semantics to make sure that the MCS node is properly
-	 * initialized before changing the tail code.
+	 * We can use relaxed semantics since the caller ensures that the
+	 * MCS node is properly initialized before updating the tail.
 	 */
-	return (u32)xchg_release(&lock->tail,
+	return (u32)xchg_relaxed(&lock->tail,
 				 tail >> _Q_TAIL_OFFSET) << _Q_TAIL_OFFSET;
 }
 
@@ -212,10 +212,11 @@ static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
 	for (;;) {
 		new = (val & _Q_LOCKED_PENDING_MASK) | tail;
 		/*
-		 * Use release semantics to make sure that the MCS node is
-		 * properly initialized before changing the tail code.
+		 * We can use relaxed semantics since the caller ensures that
+		 * the MCS node is properly initialized before updating the
+		 * tail.
 		 */
-		old = atomic_cmpxchg_release(&lock->val, val, new);
+		old = atomic_cmpxchg_relaxed(&lock->val, val, new);
 		if (old == val)
 			break;
 
@@ -388,12 +389,18 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 		goto release;
 
 	/*
+	 * Ensure that the initialisation of @node is complete before we
+	 * publish the updated tail via xchg_tail() and potentially link
+	 * @node into the waitqueue via WRITE_ONCE(prev->next, node) below.
+	 */
+	smp_wmb();
+
+	/*
+	 * Publish the updated tail.
 	 * We have already touched the queueing cacheline; don't bother with
 	 * pending stuff.
 	 *
 	 * p,*,* -> n,*,*
-	 *
-	 * RELEASE, such that the stores to @node must be complete.
 	 */
 	old = xchg_tail(lock, tail);
 	next = NULL;
@@ -405,14 +412,8 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	if (old & _Q_TAIL_MASK) {
 		prev = decode_tail(old);
 
-		/*
-		 * We must ensure that the stores to @node are observed before
-		 * the write to prev->next. The address dependency from
-		 * xchg_tail is not sufficient to ensure this because the read
-		 * component of xchg_tail is unordered with respect to the
-		 * initialisation of @node.
-		 */
-		smp_store_release(&prev->next, node);
+		/* Link @node into the waitqueue. */
+		WRITE_ONCE(prev->next, node);
 
 		pv_wait_node(node, prev);
 		arch_mcs_spin_lock_contended(&node->locked);
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v2 12/13] locking/qspinlock: Use try_cmpxchg instead of cmpxchg when locking
  2018-04-11 18:01 [PATCH v2 00/13] kernel/locking: qspinlock improvements Will Deacon
                   ` (10 preceding siblings ...)
  2018-04-11 18:01 ` [PATCH v2 11/13] locking/qspinlock: Elide back-to-back RELEASE operations with smp_wmb() Will Deacon
@ 2018-04-11 18:01 ` Will Deacon
  2018-04-11 18:01 ` [PATCH v2 13/13] locking/qspinlock: Add stat tracking for pending vs slowpath Will Deacon
  2018-04-13  9:24 ` [PATCH v2 00/13] kernel/locking: qspinlock improvements Catalin Marinas
  13 siblings, 0 replies; 22+ messages in thread
From: Will Deacon @ 2018-04-11 18:01 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-arm-kernel, peterz, mingo, boqun.feng, paulmck, longman,
	Will Deacon

When reaching the head of an uncontended queue on the qspinlock slow-path,
using a try_cmpxchg instead of a cmpxchg operation to transition the
lock work to _Q_LOCKED_VAL generates slightly better code for x86 and
pretty much identical code for arm64.

Cc: Ingo Molnar <mingo@kernel.org>
Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 kernel/locking/qspinlock.c | 19 +++++++++----------
 1 file changed, 9 insertions(+), 10 deletions(-)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index fa5d2ab369f9..1e3ddc42135e 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -467,16 +467,15 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	 * Otherwise, we only need to grab the lock.
 	 */
 
-	/* In the PV case we might already have _Q_LOCKED_VAL set */
-	if ((val & _Q_TAIL_MASK) == tail) {
-		/*
-		 * The atomic_cond_read_acquire() call above has provided the
-		 * necessary acquire semantics required for locking.
-		 */
-		old = atomic_cmpxchg_relaxed(&lock->val, val, _Q_LOCKED_VAL);
-		if (old == val)
-			goto release; /* No contention */
-	}
+	/*
+	 * In the PV case we might already have _Q_LOCKED_VAL set.
+	 *
+	 * The atomic_cond_read_acquire() call above has provided the
+	 * necessary acquire semantics required for locking.
+	 */
+	if (((val & _Q_TAIL_MASK) == tail) &&
+	    atomic_try_cmpxchg_relaxed(&lock->val, &val, _Q_LOCKED_VAL))
+		goto release; /* No contention */
 
 	/* Either somebody is queued behind us or _Q_PENDING_VAL is set */
 	set_locked(lock);
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v2 13/13] locking/qspinlock: Add stat tracking for pending vs slowpath
  2018-04-11 18:01 [PATCH v2 00/13] kernel/locking: qspinlock improvements Will Deacon
                   ` (11 preceding siblings ...)
  2018-04-11 18:01 ` [PATCH v2 12/13] locking/qspinlock: Use try_cmpxchg instead of cmpxchg when locking Will Deacon
@ 2018-04-11 18:01 ` Will Deacon
  2018-04-13  9:24 ` [PATCH v2 00/13] kernel/locking: qspinlock improvements Catalin Marinas
  13 siblings, 0 replies; 22+ messages in thread
From: Will Deacon @ 2018-04-11 18:01 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-arm-kernel, peterz, mingo, boqun.feng, paulmck, longman

From: Waiman Long <longman@redhat.com>

Currently, the qspinlock_stat code tracks only statistical counts in the
PV qspinlock code. However, it may also be useful to track the number
of locking operations done via the pending code vs. the MCS lock queue
slowpath for the non-PV case.

The qspinlock stat code is modified to do that. The stat counter
pv_lock_slowpath is renamed to lock_slowpath so that it can be used by
both the PV and non-PV cases.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/locking/qspinlock.c          | 14 +++++++++++---
 kernel/locking/qspinlock_paravirt.h |  7 +------
 kernel/locking/qspinlock_stat.h     |  9 ++++++---
 3 files changed, 18 insertions(+), 12 deletions(-)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 1e3ddc42135e..b39f341d831b 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -12,11 +12,11 @@
  * GNU General Public License for more details.
  *
  * (C) Copyright 2013-2015 Hewlett-Packard Development Company, L.P.
- * (C) Copyright 2013-2014 Red Hat, Inc.
+ * (C) Copyright 2013-2014,2018 Red Hat, Inc.
  * (C) Copyright 2015 Intel Corp.
  * (C) Copyright 2015 Hewlett-Packard Enterprise Development LP
  *
- * Authors: Waiman Long <waiman.long@hpe.com>
+ * Authors: Waiman Long <longman@redhat.com>
  *          Peter Zijlstra <peterz@infradead.org>
  */
 
@@ -33,6 +33,11 @@
 #include <asm/qspinlock.h>
 
 /*
+ * Include queued spinlock statistics code
+ */
+#include "qspinlock_stat.h"
+
+/*
  * The basic principle of a queue-based spinlock can best be understood
  * by studying a classic queue-based spinlock implementation called the
  * MCS lock. The paper below provides a good description for this kind
@@ -295,7 +300,7 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	BUILD_BUG_ON(CONFIG_NR_CPUS >= (1U << _Q_TAIL_CPU_BITS));
 
 	if (pv_enabled())
-		goto queue;
+		goto pv_queue;
 
 	if (virt_spin_lock(lock))
 		return;
@@ -348,6 +353,7 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 		 * *,1,0 -> *,0,1
 		 */
 		clear_pending_set_locked(lock);
+		qstat_inc(qstat_lock_pending, true);
 		return;
 	}
 
@@ -363,6 +369,8 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	 * queuing.
 	 */
 queue:
+	qstat_inc(qstat_lock_slowpath, true);
+pv_queue:
 	node = this_cpu_ptr(&mcs_nodes[0]);
 	idx = node->count++;
 	tail = encode_tail(smp_processor_id(), idx);
diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
index 2711940429f5..970f47632b6f 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -56,11 +56,6 @@ struct pv_node {
 };
 
 /*
- * Include queued spinlock statistics code
- */
-#include "qspinlock_stat.h"
-
-/*
  * Hybrid PV queued/unfair lock
  *
  * By replacing the regular queued_spin_trylock() with the function below,
@@ -433,7 +428,7 @@ pv_wait_head_or_lock(struct qspinlock *lock, struct mcs_spinlock *node)
 	/*
 	 * Tracking # of slowpath locking operations
 	 */
-	qstat_inc(qstat_pv_lock_slowpath, true);
+	qstat_inc(qstat_lock_slowpath, true);
 
 	for (;; waitcnt++) {
 		/*
diff --git a/kernel/locking/qspinlock_stat.h b/kernel/locking/qspinlock_stat.h
index 4a30ef63c607..6bd78c0740fc 100644
--- a/kernel/locking/qspinlock_stat.h
+++ b/kernel/locking/qspinlock_stat.h
@@ -22,13 +22,14 @@
  *   pv_kick_wake	- # of vCPU kicks used for computing pv_latency_wake
  *   pv_latency_kick	- average latency (ns) of vCPU kick operation
  *   pv_latency_wake	- average latency (ns) from vCPU kick to wakeup
- *   pv_lock_slowpath	- # of locking operations via the slowpath
  *   pv_lock_stealing	- # of lock stealing operations
  *   pv_spurious_wakeup	- # of spurious wakeups in non-head vCPUs
  *   pv_wait_again	- # of wait's after a queue head vCPU kick
  *   pv_wait_early	- # of early vCPU wait's
  *   pv_wait_head	- # of vCPU wait's at the queue head
  *   pv_wait_node	- # of vCPU wait's at a non-head queue node
+ *   lock_pending	- # of locking operations via pending code
+ *   lock_slowpath	- # of locking operations via MCS lock queue
  *
  * Writing to the "reset_counters" file will reset all the above counter
  * values.
@@ -46,13 +47,14 @@ enum qlock_stats {
 	qstat_pv_kick_wake,
 	qstat_pv_latency_kick,
 	qstat_pv_latency_wake,
-	qstat_pv_lock_slowpath,
 	qstat_pv_lock_stealing,
 	qstat_pv_spurious_wakeup,
 	qstat_pv_wait_again,
 	qstat_pv_wait_early,
 	qstat_pv_wait_head,
 	qstat_pv_wait_node,
+	qstat_lock_pending,
+	qstat_lock_slowpath,
 	qstat_num,	/* Total number of statistical counters */
 	qstat_reset_cnts = qstat_num,
 };
@@ -73,12 +75,13 @@ static const char * const qstat_names[qstat_num + 1] = {
 	[qstat_pv_spurious_wakeup] = "pv_spurious_wakeup",
 	[qstat_pv_latency_kick]	   = "pv_latency_kick",
 	[qstat_pv_latency_wake]    = "pv_latency_wake",
-	[qstat_pv_lock_slowpath]   = "pv_lock_slowpath",
 	[qstat_pv_lock_stealing]   = "pv_lock_stealing",
 	[qstat_pv_wait_again]      = "pv_wait_again",
 	[qstat_pv_wait_early]      = "pv_wait_early",
 	[qstat_pv_wait_head]       = "pv_wait_head",
 	[qstat_pv_wait_node]       = "pv_wait_node",
+	[qstat_lock_pending]       = "lock_pending",
+	[qstat_lock_slowpath]      = "lock_slowpath",
 	[qstat_reset_cnts]         = "reset_counters",
 };
 
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 09/13] locking/qspinlock: Merge struct __qspinlock into struct qspinlock
  2018-04-11 18:01 ` [PATCH v2 09/13] locking/qspinlock: Merge struct __qspinlock into struct qspinlock Will Deacon
@ 2018-04-11 19:13   ` Waiman Long
  0 siblings, 0 replies; 22+ messages in thread
From: Waiman Long @ 2018-04-11 19:13 UTC (permalink / raw)
  To: Will Deacon, linux-kernel
  Cc: linux-arm-kernel, peterz, mingo, boqun.feng, paulmck

On 04/11/2018 02:01 PM, Will Deacon wrote:
> struct __qspinlock provides a handy union of fields so that
> subcomponents of the lockword can be accessed by name, without having to
> manage shifts and masks explicitly and take endianness into account.
>
> This is useful in qspinlock.h and also potentially in arch headers, so
> move the struct __qspinlock into struct qspinlock and kill the extra
> definition.
>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Ingo Molnar <mingo@kernel.org>
> Acked-by: Boqun Feng <boqun.feng@gmail.com>
> Signed-off-by: Will Deacon <will.deacon@arm.com>

I think you are using the new qspinlock structure in some of your
earlier patches like patch 4. So you may want to move that earlier in
the series to avoid bisection problem.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 04/13] locking/qspinlock: Remove unbounded cmpxchg loop from locking slowpath
  2018-04-11 18:01 ` [PATCH v2 04/13] locking/qspinlock: Remove unbounded cmpxchg loop from locking slowpath Will Deacon
@ 2018-04-11 19:34   ` Waiman Long
  2018-04-11 20:35     ` Waiman Long
  2018-04-11 19:53   ` Waiman Long
  1 sibling, 1 reply; 22+ messages in thread
From: Waiman Long @ 2018-04-11 19:34 UTC (permalink / raw)
  To: Will Deacon, linux-kernel
  Cc: linux-arm-kernel, peterz, mingo, boqun.feng, paulmck

On 04/11/2018 02:01 PM, Will Deacon wrote:
> The qspinlock locking slowpath utilises a "pending" bit as a simple form
> of an embedded test-and-set lock that can avoid the overhead of explicit
> queuing in cases where the lock is held but uncontended. This bit is
> managed using a cmpxchg loop which tries to transition the uncontended
> lock word from (0,0,0) -> (0,0,1) or (0,0,1) -> (0,1,1).
>
> Unfortunately, the cmpxchg loop is unbounded and lockers can be starved
> indefinitely if the lock word is seen to oscillate between unlocked
> (0,0,0) and locked (0,0,1). This could happen if concurrent lockers are
> able to take the lock in the cmpxchg loop without queuing and pass it
> around amongst themselves.
>
> This patch fixes the problem by unconditionally setting _Q_PENDING_VAL
> using atomic_fetch_or, and then inspecting the old value to see whether
> we need to spin on the current lock owner, or whether we now effectively
> hold the lock. The tricky scenario is when concurrent lockers end up
> queuing on the lock and the lock becomes available, causing us to see
> a lockword of (n,0,0). With pending now set, simply queuing could lead
> to deadlock as the head of the queue may not have observed the pending
> flag being cleared. Conversely, if the head of the queue did observe
> pending being cleared, then it could transition the lock from (n,0,0) ->
> (0,0,1) meaning that any attempt to "undo" our setting of the pending
> bit could race with a concurrent locker trying to set it.
>
> We handle this race by preserving the pending bit when taking the lock
> after reaching the head of the queue and leaving the tail entry intact
> if we saw pending set, because we know that the tail is going to be
> updated shortly.
>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Ingo Molnar <mingo@kernel.org>
> Signed-off-by: Will Deacon <will.deacon@arm.com>
> ---
>  kernel/locking/qspinlock.c | 102 ++++++++++++++++++++++++++-------------------
>  1 file changed, 58 insertions(+), 44 deletions(-)
>
> diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
> index 396701e8c62d..a8fc402b3f3a 100644
> --- a/kernel/locking/qspinlock.c
> +++ b/kernel/locking/qspinlock.c
> @@ -162,6 +162,17 @@ struct __qspinlock {
>  
>  #if _Q_PENDING_BITS == 8
>  /**
> + * clear_pending - clear the pending bit.
> + * @lock: Pointer to queued spinlock structure
> + *
> + * *,1,* -> *,0,*
> + */
> +static __always_inline void clear_pending(struct qspinlock *lock)
> +{
> +	WRITE_ONCE(lock->pending, 0);
> +}
> +
> +/**
>   * clear_pending_set_locked - take ownership and clear the pending bit.
>   * @lock: Pointer to queued spinlock structure
>   *
> @@ -201,6 +212,17 @@ static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
>  #else /* _Q_PENDING_BITS == 8 */
>  
>  /**
> + * clear_pending - clear the pending bit.
> + * @lock: Pointer to queued spinlock structure
> + *
> + * *,1,* -> *,0,*
> + */
> +static __always_inline void clear_pending(struct qspinlock *lock)
> +{
> +	atomic_andnot(_Q_PENDING_VAL, &lock->val);
> +}
> +
> +/**
>   * clear_pending_set_locked - take ownership and clear the pending bit.
>   * @lock: Pointer to queued spinlock structure
>   *
> @@ -306,7 +328,7 @@ static __always_inline u32  __pv_wait_head_or_lock(struct qspinlock *lock,
>  void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
>  {
>  	struct mcs_spinlock *prev, *next, *node;
> -	u32 new, old, tail;
> +	u32 old, tail;
>  	int idx;
>  
>  	BUILD_BUG_ON(CONFIG_NR_CPUS >= (1U << _Q_TAIL_CPU_BITS));
> @@ -330,58 +352,50 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
>  	}
>  
>  	/*
> +	 * If we observe any contention; queue.
> +	 */
> +	if (val & ~_Q_LOCKED_MASK)
> +		goto queue;
> +
> +	/*
>  	 * trylock || pending
>  	 *
>  	 * 0,0,0 -> 0,0,1 ; trylock
>  	 * 0,0,1 -> 0,1,1 ; pending
>  	 */
> -	for (;;) {
> +	val = atomic_fetch_or_acquire(_Q_PENDING_VAL, &lock->val);
> +	if (!(val & ~_Q_LOCKED_MASK)) {
>  		/*
> -		 * If we observe any contention; queue.
> +		 * we're pending, wait for the owner to go away.
> +		 *
> +		 * *,1,1 -> *,1,0
> +		 *
> +		 * this wait loop must be a load-acquire such that we match the
> +		 * store-release that clears the locked bit and create lock
> +		 * sequentiality; this is because not all
> +		 * clear_pending_set_locked() implementations imply full
> +		 * barriers.
>  		 */
> -		if (val & ~_Q_LOCKED_MASK)
> -			goto queue;
> -
> -		new = _Q_LOCKED_VAL;
> -		if (val == new)
> -			new |= _Q_PENDING_VAL;
> +		if (val & _Q_LOCKED_MASK) {
> +			smp_cond_load_acquire(&lock->val.counter,
> +					      !(VAL & _Q_LOCKED_MASK));
> +		}
>  
>  		/*
> -		 * Acquire semantic is required here as the function may
> -		 * return immediately if the lock was free.
> +		 * take ownership and clear the pending bit.
> +		 *
> +		 * *,1,0 -> *,0,1
>  		 */
> -		old = atomic_cmpxchg_acquire(&lock->val, val, new);
> -		if (old == val)
> -			break;
> -
> -		val = old;
> -	}
> -
> -	/*
> -	 * we won the trylock
> -	 */
> -	if (new == _Q_LOCKED_VAL)
> +		clear_pending_set_locked(lock);
>  		return;
> +	}
>  
>  	/*
> -	 * we're pending, wait for the owner to go away.
> -	 *
> -	 * *,1,1 -> *,1,0
> -	 *
> -	 * this wait loop must be a load-acquire such that we match the
> -	 * store-release that clears the locked bit and create lock
> -	 * sequentiality; this is because not all clear_pending_set_locked()
> -	 * implementations imply full barriers.
> -	 */
> -	smp_cond_load_acquire(&lock->val.counter, !(VAL & _Q_LOCKED_MASK));
> -
> -	/*
> -	 * take ownership and clear the pending bit.
> -	 *
> -	 * *,1,0 -> *,0,1
> +	 * If pending was clear but there are waiters in the queue, then
> +	 * we need to undo our setting of pending before we queue ourselves.
>  	 */
> -	clear_pending_set_locked(lock);
> -	return;
> +	if (!(val & _Q_PENDING_MASK))
> +		clear_pending(lock);
>  
>  	/*
>  	 * End of pending bit optimistic spinning and beginning of MCS
> @@ -485,15 +499,15 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
>  	 * claim the lock:
>  	 *
>  	 * n,0,0 -> 0,0,1 : lock, uncontended
> -	 * *,0,0 -> *,0,1 : lock, contended
> +	 * *,*,0 -> *,*,1 : lock, contended
>  	 *
> -	 * If the queue head is the only one in the queue (lock value == tail),
> -	 * clear the tail code and grab the lock. Otherwise, we only need
> -	 * to grab the lock.
> +	 * If the queue head is the only one in the queue (lock value == tail)
> +	 * and nobody is pending, clear the tail code and grab the lock.
> +	 * Otherwise, we only need to grab the lock.
>  	 */
>  	for (;;) {
>  		/* In the PV case we might already have _Q_LOCKED_VAL set */
> -		if ((val & _Q_TAIL_MASK) != tail) {
> +		if ((val & _Q_TAIL_MASK) != tail || (val & _Q_PENDING_MASK)) {
>  			set_locked(lock);
>  			break;
>  		}

I don't think it is right to just grab the lock when the pending bit is
set. I believe it will cause problem.

Preserving the the pending bit should be just

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 35367cc..76d9124 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -511,7 +511,8 @@ void queued_spin_lock_slowpath(struct qspinlock
*lock, u32 v
                 * necessary acquire semantics required for locking. At most
                 * two iterations of this loop may be ran.
                 */
-               old = atomic_cmpxchg_relaxed(&lock->val, val,
_Q_LOCKED_VAL);
+               old = atomic_cmpxchg_relaxed(&lock->val, val,
+                       _Q_LOCKED_VAL | (val & _Q_PENDING_MASK));
                if (old == val)
                        goto release;   /* No contention */

Cheers,
Longman

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 04/13] locking/qspinlock: Remove unbounded cmpxchg loop from locking slowpath
  2018-04-11 18:01 ` [PATCH v2 04/13] locking/qspinlock: Remove unbounded cmpxchg loop from locking slowpath Will Deacon
  2018-04-11 19:34   ` Waiman Long
@ 2018-04-11 19:53   ` Waiman Long
  2018-04-12 14:06     ` Will Deacon
  1 sibling, 1 reply; 22+ messages in thread
From: Waiman Long @ 2018-04-11 19:53 UTC (permalink / raw)
  To: Will Deacon, linux-kernel
  Cc: linux-arm-kernel, peterz, mingo, boqun.feng, paulmck

On 04/11/2018 02:01 PM, Will Deacon wrote:
> The qspinlock locking slowpath utilises a "pending" bit as a simple form
> of an embedded test-and-set lock that can avoid the overhead of explicit
> queuing in cases where the lock is held but uncontended. This bit is
> managed using a cmpxchg loop which tries to transition the uncontended
> lock word from (0,0,0) -> (0,0,1) or (0,0,1) -> (0,1,1).
>
> Unfortunately, the cmpxchg loop is unbounded and lockers can be starved
> indefinitely if the lock word is seen to oscillate between unlocked
> (0,0,0) and locked (0,0,1). This could happen if concurrent lockers are
> able to take the lock in the cmpxchg loop without queuing and pass it
> around amongst themselves.
>
> This patch fixes the problem by unconditionally setting _Q_PENDING_VAL
> using atomic_fetch_or, and then inspecting the old value to see whether
> we need to spin on the current lock owner, or whether we now effectively
> hold the lock. The tricky scenario is when concurrent lockers end up
> queuing on the lock and the lock becomes available, causing us to see
> a lockword of (n,0,0). With pending now set, simply queuing could lead
> to deadlock as the head of the queue may not have observed the pending
> flag being cleared. Conversely, if the head of the queue did observe
> pending being cleared, then it could transition the lock from (n,0,0) ->
> (0,0,1) meaning that any attempt to "undo" our setting of the pending
> bit could race with a concurrent locker trying to set it.
>
> We handle this race by preserving the pending bit when taking the lock
> after reaching the head of the queue and leaving the tail entry intact
> if we saw pending set, because we know that the tail is going to be
> updated shortly.
>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Ingo Molnar <mingo@kernel.org>
> Signed-off-by: Will Deacon <will.deacon@arm.com>
> ---
>  kernel/locking/qspinlock.c | 102 ++++++++++++++++++++++++++-------------------
>  1 file changed, 58 insertions(+), 44 deletions(-)
>
> diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
> index 396701e8c62d..a8fc402b3f3a 100644
> --- a/kernel/locking/qspinlock.c
> +++ b/kernel/locking/qspinlock.c
> @@ -162,6 +162,17 @@ struct __qspinlock {
>  
>  #if _Q_PENDING_BITS == 8
>  /**
> + * clear_pending - clear the pending bit.
> + * @lock: Pointer to queued spinlock structure
> + *
> + * *,1,* -> *,0,*
> + */
> +static __always_inline void clear_pending(struct qspinlock *lock)
> +{
> +	WRITE_ONCE(lock->pending, 0);
> +}
> +
> +/**
>   * clear_pending_set_locked - take ownership and clear the pending bit.
>   * @lock: Pointer to queued spinlock structure
>   *
> @@ -201,6 +212,17 @@ static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
>  #else /* _Q_PENDING_BITS == 8 */
>  
>  /**
> + * clear_pending - clear the pending bit.
> + * @lock: Pointer to queued spinlock structure
> + *
> + * *,1,* -> *,0,*
> + */
> +static __always_inline void clear_pending(struct qspinlock *lock)
> +{
> +	atomic_andnot(_Q_PENDING_VAL, &lock->val);
> +}
> +
> +/**
>   * clear_pending_set_locked - take ownership and clear the pending bit.
>   * @lock: Pointer to queued spinlock structure
>   *

BTW, there is a similar clear_pending() function in
qspinlock_paravirt.c. I think you need to remove that with this patch.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 04/13] locking/qspinlock: Remove unbounded cmpxchg loop from locking slowpath
  2018-04-11 19:34   ` Waiman Long
@ 2018-04-11 20:35     ` Waiman Long
  0 siblings, 0 replies; 22+ messages in thread
From: Waiman Long @ 2018-04-11 20:35 UTC (permalink / raw)
  To: Will Deacon, linux-kernel
  Cc: linux-arm-kernel, peterz, mingo, boqun.feng, paulmck

On 04/11/2018 03:34 PM, Waiman Long wrote:
> On 04/11/2018 02:01 PM, Will Deacon wrote:
>> @@ -485,15 +499,15 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
>>  	 * claim the lock:
>>  	 *
>>  	 * n,0,0 -> 0,0,1 : lock, uncontended
>> -	 * *,0,0 -> *,0,1 : lock, contended
>> +	 * *,*,0 -> *,*,1 : lock, contended
>>  	 *
>> -	 * If the queue head is the only one in the queue (lock value == tail),
>> -	 * clear the tail code and grab the lock. Otherwise, we only need
>> -	 * to grab the lock.
>> +	 * If the queue head is the only one in the queue (lock value == tail)
>> +	 * and nobody is pending, clear the tail code and grab the lock.
>> +	 * Otherwise, we only need to grab the lock.
>>  	 */
>>  	for (;;) {
>>  		/* In the PV case we might already have _Q_LOCKED_VAL set */
>> -		if ((val & _Q_TAIL_MASK) != tail) {
>> +		if ((val & _Q_TAIL_MASK) != tail || (val & _Q_PENDING_MASK)) {
>>  			set_locked(lock);
>>  			break;
>>  		}
> I don't think it is right to just grab the lock when the pending bit is
> set. I believe it will cause problem.
>
> Preserving the the pending bit should be just
>
> diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
> index 35367cc..76d9124 100644
> --- a/kernel/locking/qspinlock.c
> +++ b/kernel/locking/qspinlock.c
> @@ -511,7 +511,8 @@ void queued_spin_lock_slowpath(struct qspinlock
> *lock, u32 v
>                  * necessary acquire semantics required for locking. At most
>                  * two iterations of this loop may be ran.
>                  */
> -               old = atomic_cmpxchg_relaxed(&lock->val, val,
> _Q_LOCKED_VAL);
> +               old = atomic_cmpxchg_relaxed(&lock->val, val,
> +                       _Q_LOCKED_VAL | (val & _Q_PENDING_MASK));
>                 if (old == val)
>                         goto release;   /* No contention */

After some more thought and reviewing the rests of the patchset, I now
think your change here is OK. Sorry for the noise.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 04/13] locking/qspinlock: Remove unbounded cmpxchg loop from locking slowpath
  2018-04-11 19:53   ` Waiman Long
@ 2018-04-12 14:06     ` Will Deacon
  2018-04-12 14:16       ` Waiman Long
  0 siblings, 1 reply; 22+ messages in thread
From: Will Deacon @ 2018-04-12 14:06 UTC (permalink / raw)
  To: Waiman Long
  Cc: linux-kernel, linux-arm-kernel, peterz, mingo, boqun.feng, paulmck

On Wed, Apr 11, 2018 at 03:53:16PM -0400, Waiman Long wrote:
> > diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
> > index 396701e8c62d..a8fc402b3f3a 100644
> > --- a/kernel/locking/qspinlock.c
> > +++ b/kernel/locking/qspinlock.c
> > @@ -162,6 +162,17 @@ struct __qspinlock {
> >  
> >  #if _Q_PENDING_BITS == 8
> >  /**
> > + * clear_pending - clear the pending bit.
> > + * @lock: Pointer to queued spinlock structure
> > + *
> > + * *,1,* -> *,0,*
> > + */
> > +static __always_inline void clear_pending(struct qspinlock *lock)
> > +{
> > +	WRITE_ONCE(lock->pending, 0);
> > +}
> > +
> > +/**
> >   * clear_pending_set_locked - take ownership and clear the pending bit.
> >   * @lock: Pointer to queued spinlock structure
> >   *
> > @@ -201,6 +212,17 @@ static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
> >  #else /* _Q_PENDING_BITS == 8 */
> >  
> >  /**
> > + * clear_pending - clear the pending bit.
> > + * @lock: Pointer to queued spinlock structure
> > + *
> > + * *,1,* -> *,0,*
> > + */
> > +static __always_inline void clear_pending(struct qspinlock *lock)
> > +{
> > +	atomic_andnot(_Q_PENDING_VAL, &lock->val);
> > +}
> > +
> > +/**
> >   * clear_pending_set_locked - take ownership and clear the pending bit.
> >   * @lock: Pointer to queued spinlock structure
> >   *
> 
> BTW, there is a similar clear_pending() function in
> qspinlock_paravirt.c. I think you need to remove that with this patch.

Thanks, I'll do that. I did build and bisect this series... for arm64, which
is completely useless as it doesn't get compiled until the final patch which
I haven't posted here.

Will

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 04/13] locking/qspinlock: Remove unbounded cmpxchg loop from locking slowpath
  2018-04-12 14:06     ` Will Deacon
@ 2018-04-12 14:16       ` Waiman Long
  2018-04-12 14:18         ` Will Deacon
  0 siblings, 1 reply; 22+ messages in thread
From: Waiman Long @ 2018-04-12 14:16 UTC (permalink / raw)
  To: Will Deacon
  Cc: linux-kernel, linux-arm-kernel, peterz, mingo, boqun.feng, paulmck

On 04/12/2018 10:06 AM, Will Deacon wrote:
> On Wed, Apr 11, 2018 at 03:53:16PM -0400, Waiman Long wrote:
>>> diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
>>> index 396701e8c62d..a8fc402b3f3a 100644
>>> --- a/kernel/locking/qspinlock.c
>>> +++ b/kernel/locking/qspinlock.c
>>> @@ -162,6 +162,17 @@ struct __qspinlock {
>>>  
>>>  #if _Q_PENDING_BITS == 8
>>>  /**
>>> + * clear_pending - clear the pending bit.
>>> + * @lock: Pointer to queued spinlock structure
>>> + *
>>> + * *,1,* -> *,0,*
>>> + */
>>> +static __always_inline void clear_pending(struct qspinlock *lock)
>>> +{
>>> +	WRITE_ONCE(lock->pending, 0);
>>> +}
>>> +
>>> +/**
>>>   * clear_pending_set_locked - take ownership and clear the pending bit.
>>>   * @lock: Pointer to queued spinlock structure
>>>   *
>>> @@ -201,6 +212,17 @@ static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
>>>  #else /* _Q_PENDING_BITS == 8 */
>>>  
>>>  /**
>>> + * clear_pending - clear the pending bit.
>>> + * @lock: Pointer to queued spinlock structure
>>> + *
>>> + * *,1,* -> *,0,*
>>> + */
>>> +static __always_inline void clear_pending(struct qspinlock *lock)
>>> +{
>>> +	atomic_andnot(_Q_PENDING_VAL, &lock->val);
>>> +}
>>> +
>>> +/**
>>>   * clear_pending_set_locked - take ownership and clear the pending bit.
>>>   * @lock: Pointer to queued spinlock structure
>>>   *
>> BTW, there is a similar clear_pending() function in
>> qspinlock_paravirt.c. I think you need to remove that with this patch.
> Thanks, I'll do that. I did build and bisect this series... for arm64, which
> is completely useless as it doesn't get compiled until the final patch which
> I haven't posted here.
>
> Will

You certainly need to get a x86 system and compiles with the patches
applied one-by-one with CONFIG_PARAVIRT_SPINLOCKS enabled.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 04/13] locking/qspinlock: Remove unbounded cmpxchg loop from locking slowpath
  2018-04-12 14:16       ` Waiman Long
@ 2018-04-12 14:18         ` Will Deacon
  0 siblings, 0 replies; 22+ messages in thread
From: Will Deacon @ 2018-04-12 14:18 UTC (permalink / raw)
  To: Waiman Long
  Cc: linux-kernel, linux-arm-kernel, peterz, mingo, boqun.feng, paulmck

On Thu, Apr 12, 2018 at 10:16:55AM -0400, Waiman Long wrote:
> On 04/12/2018 10:06 AM, Will Deacon wrote:
> > On Wed, Apr 11, 2018 at 03:53:16PM -0400, Waiman Long wrote:
> >>> diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
> >>> index 396701e8c62d..a8fc402b3f3a 100644
> >>> --- a/kernel/locking/qspinlock.c
> >>> +++ b/kernel/locking/qspinlock.c
> >>> @@ -162,6 +162,17 @@ struct __qspinlock {
> >>>  
> >>>  #if _Q_PENDING_BITS == 8
> >>>  /**
> >>> + * clear_pending - clear the pending bit.
> >>> + * @lock: Pointer to queued spinlock structure
> >>> + *
> >>> + * *,1,* -> *,0,*
> >>> + */
> >>> +static __always_inline void clear_pending(struct qspinlock *lock)
> >>> +{
> >>> +	WRITE_ONCE(lock->pending, 0);
> >>> +}
> >>> +
> >>> +/**
> >>>   * clear_pending_set_locked - take ownership and clear the pending bit.
> >>>   * @lock: Pointer to queued spinlock structure
> >>>   *
> >>> @@ -201,6 +212,17 @@ static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
> >>>  #else /* _Q_PENDING_BITS == 8 */
> >>>  
> >>>  /**
> >>> + * clear_pending - clear the pending bit.
> >>> + * @lock: Pointer to queued spinlock structure
> >>> + *
> >>> + * *,1,* -> *,0,*
> >>> + */
> >>> +static __always_inline void clear_pending(struct qspinlock *lock)
> >>> +{
> >>> +	atomic_andnot(_Q_PENDING_VAL, &lock->val);
> >>> +}
> >>> +
> >>> +/**
> >>>   * clear_pending_set_locked - take ownership and clear the pending bit.
> >>>   * @lock: Pointer to queued spinlock structure
> >>>   *
> >> BTW, there is a similar clear_pending() function in
> >> qspinlock_paravirt.c. I think you need to remove that with this patch.
> > Thanks, I'll do that. I did build and bisect this series... for arm64, which
> > is completely useless as it doesn't get compiled until the final patch which
> > I haven't posted here.
> >
> > Will
> 
> You certainly need to get a x86 system and compiles with the patches
> applied one-by-one with CONFIG_PARAVIRT_SPINLOCKS enabled.

Yup, doing that now. I just botched my test scripts before, which default to
arm64.

Will

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 00/13] kernel/locking: qspinlock improvements
  2018-04-11 18:01 [PATCH v2 00/13] kernel/locking: qspinlock improvements Will Deacon
                   ` (12 preceding siblings ...)
  2018-04-11 18:01 ` [PATCH v2 13/13] locking/qspinlock: Add stat tracking for pending vs slowpath Will Deacon
@ 2018-04-13  9:24 ` Catalin Marinas
  13 siblings, 0 replies; 22+ messages in thread
From: Catalin Marinas @ 2018-04-13  9:24 UTC (permalink / raw)
  To: Will Deacon
  Cc: linux-kernel, peterz, boqun.feng, longman, paulmck, mingo,
	linux-arm-kernel

On Wed, Apr 11, 2018 at 07:01:07PM +0100, Will Deacon wrote:
>   * Spin for a bounded duration while lock is observed in the
>     pending->locked transition

FWIW, I updated my model [1] to include the bounded handover loop and,
as expected, it passes the liveness check (well, assuming fairness of
other operations like fetch_or, cmpxchg etc).

[1] https://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/kernel-tla.git/tree/qspinlock.tla

-- 
Catalin

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2018-04-13  9:24 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-04-11 18:01 [PATCH v2 00/13] kernel/locking: qspinlock improvements Will Deacon
2018-04-11 18:01 ` [PATCH v2 01/13] barriers: Introduce smp_cond_load_relaxed and atomic_cond_read_relaxed Will Deacon
2018-04-11 18:01 ` [PATCH v2 02/13] locking/qspinlock: Bound spinning on pending->locked transition in slowpath Will Deacon
2018-04-11 18:01 ` [PATCH v2 03/13] locking/qspinlock/x86: Increase _Q_PENDING_LOOPS upper bound Will Deacon
2018-04-11 18:01 ` [PATCH v2 04/13] locking/qspinlock: Remove unbounded cmpxchg loop from locking slowpath Will Deacon
2018-04-11 19:34   ` Waiman Long
2018-04-11 20:35     ` Waiman Long
2018-04-11 19:53   ` Waiman Long
2018-04-12 14:06     ` Will Deacon
2018-04-12 14:16       ` Waiman Long
2018-04-12 14:18         ` Will Deacon
2018-04-11 18:01 ` [PATCH v2 05/13] locking/qspinlock: Kill cmpxchg loop when claiming lock from head of queue Will Deacon
2018-04-11 18:01 ` [PATCH v2 06/13] locking/qspinlock: Use atomic_cond_read_acquire Will Deacon
2018-04-11 18:01 ` [PATCH v2 07/13] locking/mcs: Use smp_cond_load_acquire() in mcs spin loop Will Deacon
2018-04-11 18:01 ` [PATCH v2 08/13] locking/qspinlock: Use smp_cond_load_relaxed to wait for next node Will Deacon
2018-04-11 18:01 ` [PATCH v2 09/13] locking/qspinlock: Merge struct __qspinlock into struct qspinlock Will Deacon
2018-04-11 19:13   ` Waiman Long
2018-04-11 18:01 ` [PATCH v2 10/13] locking/qspinlock: Make queued_spin_unlock use smp_store_release Will Deacon
2018-04-11 18:01 ` [PATCH v2 11/13] locking/qspinlock: Elide back-to-back RELEASE operations with smp_wmb() Will Deacon
2018-04-11 18:01 ` [PATCH v2 12/13] locking/qspinlock: Use try_cmpxchg instead of cmpxchg when locking Will Deacon
2018-04-11 18:01 ` [PATCH v2 13/13] locking/qspinlock: Add stat tracking for pending vs slowpath Will Deacon
2018-04-13  9:24 ` [PATCH v2 00/13] kernel/locking: qspinlock improvements Catalin Marinas

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).