linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC][PATCH 0/3] locking/qspinlock: Improve determinism for x86
@ 2018-09-26 11:01 Peter Zijlstra
  2018-09-26 11:01 ` [RFC][PATCH 1/3] locking/qspinlock: Re-order code Peter Zijlstra
                   ` (4 more replies)
  0 siblings, 5 replies; 32+ messages in thread
From: Peter Zijlstra @ 2018-09-26 11:01 UTC (permalink / raw)
  To: will.deacon, mingo
  Cc: linux-kernel, longman, andrea.parri, tglx, Peter Zijlstra

Back when Will did his qspinlock determinism patches, we were left with one
cmpxchg loop on x86 due to the use of atomic_fetch_or(). Will proposed a nifty
trick:

  http://lkml.kernel.org/r/20180409145409.GA9661@arm.com

But at the time we didn't pursue it. This series implements that and argues for
its correctness. In particular it places an smp_mb__after_atomic() in
between the two operations, which forces the load to come after the
store (which is free on x86 anyway).

In particular this ordering ensures a concurrent unlock cannot trigger
the uncontended handoff. Also it ensures that if the xchg() happens
after a (successful) trylock, we must observe that LOCKED bit.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [RFC][PATCH 1/3] locking/qspinlock: Re-order code
  2018-09-26 11:01 [RFC][PATCH 0/3] locking/qspinlock: Improve determinism for x86 Peter Zijlstra
@ 2018-09-26 11:01 ` Peter Zijlstra
  2018-10-01 17:17   ` Will Deacon
  2018-09-26 11:01 ` [RFC][PATCH 2/3] locking/qspinlock: Rework some comments Peter Zijlstra
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 32+ messages in thread
From: Peter Zijlstra @ 2018-09-26 11:01 UTC (permalink / raw)
  To: will.deacon, mingo
  Cc: linux-kernel, longman, andrea.parri, tglx, Peter Zijlstra

[-- Attachment #1: peterz-qspinlock-opt-2.patch --]
[-- Type: text/plain, Size: 2298 bytes --]

Flip the branch condition after atomic_fetch_or_acquire(_Q_PENDING_VAL)
such that we loose the indent. This also result in a more natural code
flow IMO.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/locking/qspinlock.c |   56 +++++++++++++++++++++------------------------
 1 file changed, 27 insertions(+), 29 deletions(-)

--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -330,39 +330,37 @@ void queued_spin_lock_slowpath(struct qs
 	 * 0,0,1 -> 0,1,1 ; pending
 	 */
 	val = atomic_fetch_or_acquire(_Q_PENDING_VAL, &lock->val);
-	if (!(val & ~_Q_LOCKED_MASK)) {
-		/*
-		 * We're pending, wait for the owner to go away.
-		 *
-		 * *,1,1 -> *,1,0
-		 *
-		 * this wait loop must be a load-acquire such that we match the
-		 * store-release that clears the locked bit and create lock
-		 * sequentiality; this is because not all
-		 * clear_pending_set_locked() implementations imply full
-		 * barriers.
-		 */
-		if (val & _Q_LOCKED_MASK) {
-			atomic_cond_read_acquire(&lock->val,
-						 !(VAL & _Q_LOCKED_MASK));
-		}
-
-		/*
-		 * take ownership and clear the pending bit.
-		 *
-		 * *,1,0 -> *,0,1
-		 */
-		clear_pending_set_locked(lock);
-		qstat_inc(qstat_lock_pending, true);
-		return;
+	/*
+	 * If we observe any contention; undo and queue.
+	 */
+	if (unlikely(val & ~_Q_LOCKED_MASK)) {
+		if (!(val & _Q_PENDING_MASK))
+			clear_pending(lock);
+		goto queue;
 	}
 
 	/*
-	 * If pending was clear but there are waiters in the queue, then
-	 * we need to undo our setting of pending before we queue ourselves.
+	 * We're pending, wait for the owner to go away.
+	 *
+	 * 0,1,1 -> 0,1,0
+	 *
+	 * this wait loop must be a load-acquire such that we match the
+	 * store-release that clears the locked bit and create lock
+	 * sequentiality; this is because not all
+	 * clear_pending_set_locked() implementations imply full
+	 * barriers.
+	 */
+	if (val & _Q_LOCKED_MASK)
+		atomic_cond_read_acquire(&lock->val, !(VAL & _Q_LOCKED_MASK));
+
+	/*
+	 * take ownership and clear the pending bit.
+	 *
+	 * 0,1,0 -> 0,0,1
 	 */
-	if (!(val & _Q_PENDING_MASK))
-		clear_pending(lock);
+	clear_pending_set_locked(lock);
+	qstat_inc(qstat_lock_pending, true);
+	return;
 
 	/*
 	 * End of pending bit optimistic spinning and beginning of MCS



^ permalink raw reply	[flat|nested] 32+ messages in thread

* [RFC][PATCH 2/3] locking/qspinlock: Rework some comments
  2018-09-26 11:01 [RFC][PATCH 0/3] locking/qspinlock: Improve determinism for x86 Peter Zijlstra
  2018-09-26 11:01 ` [RFC][PATCH 1/3] locking/qspinlock: Re-order code Peter Zijlstra
@ 2018-09-26 11:01 ` Peter Zijlstra
  2018-10-01 17:17   ` Will Deacon
  2018-09-26 11:01 ` [RFC][PATCH 3/3] locking/qspinlock: Optimize for x86 Peter Zijlstra
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 32+ messages in thread
From: Peter Zijlstra @ 2018-09-26 11:01 UTC (permalink / raw)
  To: will.deacon, mingo
  Cc: linux-kernel, longman, andrea.parri, tglx, Peter Zijlstra

[-- Attachment #1: peterz-qspinlock-opt-1.patch --]
[-- Type: text/plain, Size: 2598 bytes --]

While working my way through the code again; I felt the comments could
use help.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/locking/qspinlock.c |   40 ++++++++++++++++++++++++++++------------
 1 file changed, 28 insertions(+), 12 deletions(-)

--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -326,16 +326,23 @@ void queued_spin_lock_slowpath(struct qs
 	/*
 	 * trylock || pending
 	 *
-	 * 0,0,0 -> 0,0,1 ; trylock
-	 * 0,0,1 -> 0,1,1 ; pending
+	 * 0,0,* -> 0,1,* -> 0,0,1 pending, trylock
 	 */
 	val = atomic_fetch_or_acquire(_Q_PENDING_VAL, &lock->val);
+
 	/*
-	 * If we observe any contention; undo and queue.
+	 * If we observe contention, there was a concurrent lock.
+	 *
+	 * Undo and queue; our setting of PENDING might have made the
+	 * n,0,0 -> 0,0,0 transition fail and it will now be waiting
+	 * on @next to become !NULL.
 	 */
 	if (unlikely(val & ~_Q_LOCKED_MASK)) {
+
+		/* Undo PENDING if we set it. */
 		if (!(val & _Q_PENDING_MASK))
 			clear_pending(lock);
+
 		goto queue;
 	}
 
@@ -466,7 +473,7 @@ void queued_spin_lock_slowpath(struct qs
 	 * claim the lock:
 	 *
 	 * n,0,0 -> 0,0,1 : lock, uncontended
-	 * *,*,0 -> *,*,1 : lock, contended
+	 * *,0,0 -> *,0,1 : lock, contended
 	 *
 	 * If the queue head is the only one in the queue (lock value == tail)
 	 * and nobody is pending, clear the tail code and grab the lock.
@@ -474,16 +481,25 @@ void queued_spin_lock_slowpath(struct qs
 	 */
 
 	/*
-	 * In the PV case we might already have _Q_LOCKED_VAL set.
+	 * In the PV case we might already have _Q_LOCKED_VAL set, because
+	 * of lock stealing; therefore we must also allow:
 	 *
-	 * The atomic_cond_read_acquire() call above has provided the
-	 * necessary acquire semantics required for locking.
-	 */
-	if (((val & _Q_TAIL_MASK) == tail) &&
-	    atomic_try_cmpxchg_relaxed(&lock->val, &val, _Q_LOCKED_VAL))
-		goto release; /* No contention */
+	 * n,0,1 -> 0,0,1
+	 *
+	 * Note: at this point: (val & _Q_PENDING_MASK) == 0, because of the
+	 *       above wait condition, therefore any concurrent setting of
+	 *       PENDING will make the uncontended transition fail.
+	 */
+	if ((val & _Q_TAIL_MASK) == tail) {
+		if (atomic_try_cmpxchg_relaxed(&lock->val, &val, _Q_LOCKED_VAL))
+			goto release; /* No contention */
+	}
 
-	/* Either somebody is queued behind us or _Q_PENDING_VAL is set */
+	/*
+	 * Either somebody is queued behind us or _Q_PENDING_VAL got set
+	 * which will then detect the remaining tail and queue behind us
+	 * ensuring we'll see a @next.
+	 */
 	set_locked(lock);
 
 	/*



^ permalink raw reply	[flat|nested] 32+ messages in thread

* [RFC][PATCH 3/3] locking/qspinlock: Optimize for x86
  2018-09-26 11:01 [RFC][PATCH 0/3] locking/qspinlock: Improve determinism for x86 Peter Zijlstra
  2018-09-26 11:01 ` [RFC][PATCH 1/3] locking/qspinlock: Re-order code Peter Zijlstra
  2018-09-26 11:01 ` [RFC][PATCH 2/3] locking/qspinlock: Rework some comments Peter Zijlstra
@ 2018-09-26 11:01 ` Peter Zijlstra
  2018-09-26 16:30   ` Waiman Long
                     ` (3 more replies)
  2018-09-26 15:01 ` [RFC][PATCH 0/3] locking/qspinlock: Improve determinism " Sebastian Andrzej Siewior
  2018-09-26 16:20 ` Waiman Long
  4 siblings, 4 replies; 32+ messages in thread
From: Peter Zijlstra @ 2018-09-26 11:01 UTC (permalink / raw)
  To: will.deacon, mingo
  Cc: linux-kernel, longman, andrea.parri, tglx, Peter Zijlstra

[-- Attachment #1: peterz-qspinlock-opt-5.patch --]
[-- Type: text/plain, Size: 4755 bytes --]

On x86 we cannot do fetch_or with a single instruction and end up
using a cmpxchg loop, this reduces determinism. Replace the fetch_or
with a very tricky composite xchg8 + load.

The basic idea is that we use xchg8 to test-and-set the pending bit
(when it is a byte) and then a load to fetch the whole word. Using
two instructions of course opens a window we previously did not have.
In particular the ordering between pending and tail is of interrest,
because that is where the split happens.

The claim is that if we order them, it all works out just fine. There
are two specific cases where the pending,tail state changes:

 - when the 3rd lock(er) comes in and finds pending set, it'll queue
   and set tail; since we set tail while pending is set, the ordering
   is split is not important (and not fundamentally different form
   fetch_or). [*]

 - when the last queued lock holder acquires the lock (uncontended),
   we clear the tail and set the lock byte. By first setting the
   pending bit this cmpxchg will fail and the later load must then
   see the remaining tail.

Another interesting scenario is where there are only 2 threads:

	lock := (0,0,0)

	CPU 0			CPU 1

	lock()			lock()
	  trylock(-> 0,0,1)       trylock() /* fail */
	    return;               xchg_relaxed(pending, 1) (-> 0,1,1)
				  mb()
				  val = smp_load_acquire(*lock);

Where, without the mb() the load would've been allowed to return 0 for
the locked byte.

[*] there is a fun scenario where the 3rd locker observes the pending
bit, but before it sets the tail, the 1st lock (owner) unlocks and the
2nd locker acquires the lock, leaving us with a temporary 0,0,1 state.
But this is not different between fetch_or or this new method.

Suggested-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/include/asm/qspinlock.h |    2 +
 kernel/locking/qspinlock.c       |   56 ++++++++++++++++++++++++++++++++++++++-
 2 files changed, 57 insertions(+), 1 deletion(-)

--- a/arch/x86/include/asm/qspinlock.h
+++ b/arch/x86/include/asm/qspinlock.h
@@ -9,6 +9,8 @@
 
 #define _Q_PENDING_LOOPS	(1 << 9)
 
+#define _Q_NO_FETCH_OR
+
 #ifdef CONFIG_PARAVIRT_SPINLOCKS
 extern void native_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val);
 extern void __pv_init_lock_hash(void);
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -232,6 +232,60 @@ static __always_inline u32 xchg_tail(str
 #endif /* _Q_PENDING_BITS == 8 */
 
 /**
+ * set_pending_fetch_acquire - fetch the whole lock value and set pending and locked
+ * @lock : Pointer to queued spinlock structure
+ * Return: The previous lock value
+ *
+ * *,*,* -> *,1,*
+ */
+static __always_inline u32 set_pending_fetch_acquire(struct qspinlock *lock)
+{
+#if defined(_Q_NO_FETCH_OR) && _Q_PENDING_BITS == 8
+	u32 val;
+
+	/*
+	 * For the !LL/SC archs that do not have a native atomic_fetch_or
+	 * but do have a native xchg (x86) we can do trickery and avoid the
+	 * cmpxchg-loop based fetch-or to improve determinism.
+	 *
+	 * We first xchg the pending byte to set PENDING, and then issue a load
+	 * for the rest of the word and mask out the pending byte to compute a
+	 * 'full' value.
+	 */
+	val = xchg_relaxed(&lock->pending, 1) << _Q_PENDING_OFFSET;
+	/*
+	 * Ensures the tail load happens after the xchg().
+	 *
+	 *	   lock  unlock    (a)
+	 *   xchg ---------------.
+	 *    (b)  lock  unlock  +----- fetch_or
+	 *   load ---------------'
+	 *	   lock  unlock    (c)
+	 *
+	 * For both lock and unlock, (a) and (c) are the same as fetch_or(),
+	 * since either are fully before or after. The only new case is (b),
+	 * where a concurrent lock or unlock can interleave with the composite
+	 * operation.
+	 *
+	 * In either case, it is the queueing case that is of interrest, otherwise
+	 * the whole operation is covered by the xchg() and the tail will be 0.
+	 *
+	 * For lock-(b); we only care if we set the PENDING bit or not. If we lost
+	 * the PENDING race, we queue. Otherwise we'd observe the tail anyway.
+	 *
+	 * For unlock-(b); since we'll have set PENDING, the uncontended claim
+	 * will fail and we'll observe a !0 tail.
+	 */
+	smp_mb__after_atomic();
+	val |= atomic_read_acquire(&lock->val) & ~_Q_PENDING_MASK;
+
+	return val;
+#else
+	return atomic_fetch_or_acquire(_Q_PENDING_VAL, &lock->val);
+#endif
+}
+
+/**
  * set_locked - Set the lock bit and own the lock
  * @lock: Pointer to queued spinlock structure
  *
@@ -328,7 +382,7 @@ void queued_spin_lock_slowpath(struct qs
 	 *
 	 * 0,0,* -> 0,1,* -> 0,0,1 pending, trylock
 	 */
-	val = atomic_fetch_or_acquire(_Q_PENDING_VAL, &lock->val);
+	val = set_pending_fetch_acquire(lock);
 
 	/*
 	 * If we observe contention, there was a concurrent lock.



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC][PATCH 0/3] locking/qspinlock: Improve determinism for x86
  2018-09-26 11:01 [RFC][PATCH 0/3] locking/qspinlock: Improve determinism for x86 Peter Zijlstra
                   ` (2 preceding siblings ...)
  2018-09-26 11:01 ` [RFC][PATCH 3/3] locking/qspinlock: Optimize for x86 Peter Zijlstra
@ 2018-09-26 15:01 ` Sebastian Andrzej Siewior
  2018-09-26 15:08   ` Thomas Gleixner
  2018-09-26 16:20 ` Waiman Long
  4 siblings, 1 reply; 32+ messages in thread
From: Sebastian Andrzej Siewior @ 2018-09-26 15:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: will.deacon, mingo, linux-kernel, longman, andrea.parri, tglx

On 2018-09-26 13:01:17 [+0200], Peter Zijlstra wrote:
> In particular this ordering ensures a concurrent unlock cannot trigger
> the uncontended handoff. Also it ensures that if the xchg() happens
> after a (successful) trylock, we must observe that LOCKED bit.

so I backported a bunch of atomic files and qspinlock back to v4.9-RT in
order to get these patches applied. The backported kernel booted and
showed the same symptomps within seconds (those I reported last FR).
With those three patches ontop I have the Core2Duo 10 minutes in
testing without any issues…

Sebastian

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC][PATCH 0/3] locking/qspinlock: Improve determinism for x86
  2018-09-26 15:01 ` [RFC][PATCH 0/3] locking/qspinlock: Improve determinism " Sebastian Andrzej Siewior
@ 2018-09-26 15:08   ` Thomas Gleixner
  2018-09-26 15:38     ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 32+ messages in thread
From: Thomas Gleixner @ 2018-09-26 15:08 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Peter Zijlstra, will.deacon, mingo, linux-kernel, longman, andrea.parri

[-- Attachment #1: Type: text/plain, Size: 788 bytes --]

On Wed, 26 Sep 2018, Sebastian Andrzej Siewior wrote:

> On 2018-09-26 13:01:17 [+0200], Peter Zijlstra wrote:
> > In particular this ordering ensures a concurrent unlock cannot trigger
> > the uncontended handoff. Also it ensures that if the xchg() happens
> > after a (successful) trylock, we must observe that LOCKED bit.
> 
> so I backported a bunch of atomic files and qspinlock back to v4.9-RT in
> order to get these patches applied. The backported kernel booted and
> showed the same symptomps within seconds (those I reported last FR).
> With those three patches ontop I have the Core2Duo 10 minutes in
> testing without any issues…

Remember that 4.14 took almost 8 hours to fail, so I'd recommend not to
cheer too much. Let's see what it tells us in 48 hours.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC][PATCH 0/3] locking/qspinlock: Improve determinism for x86
  2018-09-26 15:08   ` Thomas Gleixner
@ 2018-09-26 15:38     ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 32+ messages in thread
From: Sebastian Andrzej Siewior @ 2018-09-26 15:38 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, will.deacon, mingo, linux-kernel, longman, andrea.parri

On 2018-09-26 17:08:42 [+0200], Thomas Gleixner wrote:
> Remember that 4.14 took almost 8 hours to fail, so I'd recommend not to
> cheer too much. Let's see what it tells us in 48 hours.

I don't cheer. I merely point out that shuffling the whole atomic code
around did not make a change while applying the three patches did.
The v4.4 kernel did survive for 5 days. Here we might start looking for
the cheer outfit.

> Thanks,
> 
> 	tglx
Sebastian

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC][PATCH 0/3] locking/qspinlock: Improve determinism for x86
  2018-09-26 11:01 [RFC][PATCH 0/3] locking/qspinlock: Improve determinism for x86 Peter Zijlstra
                   ` (3 preceding siblings ...)
  2018-09-26 15:01 ` [RFC][PATCH 0/3] locking/qspinlock: Improve determinism " Sebastian Andrzej Siewior
@ 2018-09-26 16:20 ` Waiman Long
  2018-09-26 17:51   ` Peter Zijlstra
  4 siblings, 1 reply; 32+ messages in thread
From: Waiman Long @ 2018-09-26 16:20 UTC (permalink / raw)
  To: Peter Zijlstra, will.deacon, mingo; +Cc: linux-kernel, andrea.parri, tglx

On 09/26/2018 07:01 AM, Peter Zijlstra wrote:
> Back when Will did his qspinlock determinism patches, we were left with one
> cmpxchg loop on x86 due to the use of atomic_fetch_or(). Will proposed a nifty
> trick:
>
>   http://lkml.kernel.org/r/20180409145409.GA9661@arm.com
>
> But at the time we didn't pursue it. This series implements that and argues for
> its correctness. In particular it places an smp_mb__after_atomic() in
> between the two operations, which forces the load to come after the
> store (which is free on x86 anyway).
>
> In particular this ordering ensures a concurrent unlock cannot trigger
> the uncontended handoff. Also it ensures that if the xchg() happens
> after a (successful) trylock, we must observe that LOCKED bit.

When you said "concurrent unlock cannot trigger the uncontended
handoff", are you saying the current code has this uncontended handoff
problem or just when comparing to doing a load first followed by xchg()?

Cheers,
Longman


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC][PATCH 3/3] locking/qspinlock: Optimize for x86
  2018-09-26 11:01 ` [RFC][PATCH 3/3] locking/qspinlock: Optimize for x86 Peter Zijlstra
@ 2018-09-26 16:30   ` Waiman Long
  2018-09-26 17:54     ` Peter Zijlstra
  2018-09-26 20:52   ` Andrea Parri
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 32+ messages in thread
From: Waiman Long @ 2018-09-26 16:30 UTC (permalink / raw)
  To: Peter Zijlstra, will.deacon, mingo; +Cc: linux-kernel, andrea.parri, tglx

On 09/26/2018 07:01 AM, Peter Zijlstra wrote:
> On x86 we cannot do fetch_or with a single instruction and end up
> using a cmpxchg loop, this reduces determinism. Replace the fetch_or
> with a very tricky composite xchg8 + load.
>
> The basic idea is that we use xchg8 to test-and-set the pending bit
> (when it is a byte) and then a load to fetch the whole word. Using
> two instructions of course opens a window we previously did not have.
> In particular the ordering between pending and tail is of interrest,
> because that is where the split happens.
>
> The claim is that if we order them, it all works out just fine. There
> are two specific cases where the pending,tail state changes:
>
>  - when the 3rd lock(er) comes in and finds pending set, it'll queue
>    and set tail; since we set tail while pending is set, the ordering
>    is split is not important (and not fundamentally different form
>    fetch_or). [*]

The split can cause some changes in behavior. The 3rd locker observes
the pending bit and set tail. The split load of the 2nd locker may make
it observe the tail and backout of the pending loop. As a result, the
2nd locker will acquire the lock after the third locker in this case.
That won't happen with the original code.

I am not saying this is a problem. It is just something we should take
note on.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC][PATCH 0/3] locking/qspinlock: Improve determinism for x86
  2018-09-26 16:20 ` Waiman Long
@ 2018-09-26 17:51   ` Peter Zijlstra
  2018-09-26 23:21     ` Waiman Long
  0 siblings, 1 reply; 32+ messages in thread
From: Peter Zijlstra @ 2018-09-26 17:51 UTC (permalink / raw)
  To: Waiman Long; +Cc: will.deacon, mingo, linux-kernel, andrea.parri, tglx

On Wed, Sep 26, 2018 at 12:20:09PM -0400, Waiman Long wrote:
> On 09/26/2018 07:01 AM, Peter Zijlstra wrote:
> > Back when Will did his qspinlock determinism patches, we were left with one
> > cmpxchg loop on x86 due to the use of atomic_fetch_or(). Will proposed a nifty
> > trick:
> >
> >   http://lkml.kernel.org/r/20180409145409.GA9661@arm.com
> >
> > But at the time we didn't pursue it. This series implements that and argues for
> > its correctness. In particular it places an smp_mb__after_atomic() in
> > between the two operations, which forces the load to come after the
> > store (which is free on x86 anyway).
> >
> > In particular this ordering ensures a concurrent unlock cannot trigger
> > the uncontended handoff. Also it ensures that if the xchg() happens
> > after a (successful) trylock, we must observe that LOCKED bit.
> 
> When you said "concurrent unlock cannot trigger the uncontended
> handoff", are you saying the current code has this uncontended handoff
> problem or just when comparing to doing a load first followed by xchg()?

In particular I'm talking about this:

locked:
	/*
	 * claim the lock:
	 *
	 * n,0,0 -> 0,0,1 : lock, uncontended
	 * *,0,0 -> *,0,1 : lock, contended
	 *
	 * If the queue head is the only one in the queue (lock value == tail)
	 * and nobody is pending, clear the tail code and grab the lock.
	 * Otherwise, we only need to grab the lock.
	 */
	/*
	 * In the PV case we might already have _Q_LOCKED_VAL set, because
	 * of lock stealing; therefore we must also allow:
	 *
	 * n,0,1 -> 0,0,1
	 *
	 * Note: at this point: (val & _Q_PENDING_MASK) == 0, because of the
	 *       above wait condition, therefore any concurrent setting of
	 *       PENDING will make the uncontended transition fail.
	 */
	if ((val & _Q_TAIL_MASK) == tail) {
		if (atomic_try_cmpxchg_relaxed(&lock->val, &val, _Q_LOCKED_VAL))
			goto release; /* No contention */
	}


Here queue head acquires the lock and we take the "uncontended" path. If
we set PENDING prior to this, the cmpxchg above will fail, and the later
load will observe the tail.



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC][PATCH 3/3] locking/qspinlock: Optimize for x86
  2018-09-26 16:30   ` Waiman Long
@ 2018-09-26 17:54     ` Peter Zijlstra
  2018-09-27  7:29       ` Peter Zijlstra
  0 siblings, 1 reply; 32+ messages in thread
From: Peter Zijlstra @ 2018-09-26 17:54 UTC (permalink / raw)
  To: Waiman Long; +Cc: will.deacon, mingo, linux-kernel, andrea.parri, tglx

On Wed, Sep 26, 2018 at 12:30:36PM -0400, Waiman Long wrote:
> On 09/26/2018 07:01 AM, Peter Zijlstra wrote:
> > On x86 we cannot do fetch_or with a single instruction and end up
> > using a cmpxchg loop, this reduces determinism. Replace the fetch_or
> > with a very tricky composite xchg8 + load.
> >
> > The basic idea is that we use xchg8 to test-and-set the pending bit
> > (when it is a byte) and then a load to fetch the whole word. Using
> > two instructions of course opens a window we previously did not have.
> > In particular the ordering between pending and tail is of interrest,
> > because that is where the split happens.
> >
> > The claim is that if we order them, it all works out just fine. There
> > are two specific cases where the pending,tail state changes:
> >
> >  - when the 3rd lock(er) comes in and finds pending set, it'll queue
> >    and set tail; since we set tail while pending is set, the ordering
> >    is split is not important (and not fundamentally different form
> >    fetch_or). [*]
> 
> The split can cause some changes in behavior. The 3rd locker observes
> the pending bit and set tail. The split load of the 2nd locker may make
> it observe the tail and backout of the pending loop. As a result, the
> 2nd locker will acquire the lock after the third locker in this case.
> That won't happen with the original code.
> 
> I am not saying this is a problem. It is just something we should take
> note on.

Right, good one. Yes that can happen.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC][PATCH 3/3] locking/qspinlock: Optimize for x86
  2018-09-26 11:01 ` [RFC][PATCH 3/3] locking/qspinlock: Optimize for x86 Peter Zijlstra
  2018-09-26 16:30   ` Waiman Long
@ 2018-09-26 20:52   ` Andrea Parri
  2018-09-27  7:17     ` Peter Zijlstra
  2018-09-27 12:16   ` David Laight
  2018-10-01 17:17   ` Will Deacon
  3 siblings, 1 reply; 32+ messages in thread
From: Andrea Parri @ 2018-09-26 20:52 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: will.deacon, mingo, linux-kernel, longman, tglx

On Wed, Sep 26, 2018 at 01:01:20PM +0200, Peter Zijlstra wrote:
> On x86 we cannot do fetch_or with a single instruction and end up
> using a cmpxchg loop, this reduces determinism. Replace the fetch_or
> with a very tricky composite xchg8 + load.
> 
> The basic idea is that we use xchg8 to test-and-set the pending bit
> (when it is a byte) and then a load to fetch the whole word. Using
> two instructions of course opens a window we previously did not have.
> In particular the ordering between pending and tail is of interrest,
> because that is where the split happens.
> 
> The claim is that if we order them, it all works out just fine. There
> are two specific cases where the pending,tail state changes:
> 
>  - when the 3rd lock(er) comes in and finds pending set, it'll queue
>    and set tail; since we set tail while pending is set, the ordering
>    is split is not important (and not fundamentally different form
>    fetch_or). [*]
> 
>  - when the last queued lock holder acquires the lock (uncontended),
>    we clear the tail and set the lock byte. By first setting the
>    pending bit this cmpxchg will fail and the later load must then
>    see the remaining tail.
> 
> Another interesting scenario is where there are only 2 threads:
> 
> 	lock := (0,0,0)
> 
> 	CPU 0			CPU 1
> 
> 	lock()			lock()
> 	  trylock(-> 0,0,1)       trylock() /* fail */
> 	    return;               xchg_relaxed(pending, 1) (-> 0,1,1)
> 				  mb()
> 				  val = smp_load_acquire(*lock);
> 
> Where, without the mb() the load would've been allowed to return 0 for
> the locked byte.

If this were true, we would have a violation of "coherence":

C peterz

{}

P0(atomic_t *val)
{
	int r0;

	/* in queued_spin_trylock() */
	r0 = atomic_cmpxchg_acquire(val, 0, 1);
}

P1(atomic_t *val)
{
	int r0;
	int r1;

	/* in queued_spin_trylock() */
	r0 = atomic_cmpxchg_acquire(val, 0, 1); /* or atomic_read(val) */

	/* in queued_spin_lock_slowpath)() -- set_pending_fetch_acquire() */
	r1 = atomic_read_acquire(val);
}

exists (0:r0=0 /\ ~1:r0=0 /\ 1:r1=0)

  Andrea


> 
> [*] there is a fun scenario where the 3rd locker observes the pending
> bit, but before it sets the tail, the 1st lock (owner) unlocks and the
> 2nd locker acquires the lock, leaving us with a temporary 0,0,1 state.
> But this is not different between fetch_or or this new method.
> 
> Suggested-by: Will Deacon <will.deacon@arm.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  arch/x86/include/asm/qspinlock.h |    2 +
>  kernel/locking/qspinlock.c       |   56 ++++++++++++++++++++++++++++++++++++++-
>  2 files changed, 57 insertions(+), 1 deletion(-)
> 
> --- a/arch/x86/include/asm/qspinlock.h
> +++ b/arch/x86/include/asm/qspinlock.h
> @@ -9,6 +9,8 @@
>  
>  #define _Q_PENDING_LOOPS	(1 << 9)
>  
> +#define _Q_NO_FETCH_OR
> +
>  #ifdef CONFIG_PARAVIRT_SPINLOCKS
>  extern void native_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val);
>  extern void __pv_init_lock_hash(void);
> --- a/kernel/locking/qspinlock.c
> +++ b/kernel/locking/qspinlock.c
> @@ -232,6 +232,60 @@ static __always_inline u32 xchg_tail(str
>  #endif /* _Q_PENDING_BITS == 8 */
>  
>  /**
> + * set_pending_fetch_acquire - fetch the whole lock value and set pending and locked
> + * @lock : Pointer to queued spinlock structure
> + * Return: The previous lock value
> + *
> + * *,*,* -> *,1,*
> + */
> +static __always_inline u32 set_pending_fetch_acquire(struct qspinlock *lock)
> +{
> +#if defined(_Q_NO_FETCH_OR) && _Q_PENDING_BITS == 8
> +	u32 val;
> +
> +	/*
> +	 * For the !LL/SC archs that do not have a native atomic_fetch_or
> +	 * but do have a native xchg (x86) we can do trickery and avoid the
> +	 * cmpxchg-loop based fetch-or to improve determinism.
> +	 *
> +	 * We first xchg the pending byte to set PENDING, and then issue a load
> +	 * for the rest of the word and mask out the pending byte to compute a
> +	 * 'full' value.
> +	 */
> +	val = xchg_relaxed(&lock->pending, 1) << _Q_PENDING_OFFSET;
> +	/*
> +	 * Ensures the tail load happens after the xchg().
> +	 *
> +	 *	   lock  unlock    (a)
> +	 *   xchg ---------------.
> +	 *    (b)  lock  unlock  +----- fetch_or
> +	 *   load ---------------'
> +	 *	   lock  unlock    (c)
> +	 *
> +	 * For both lock and unlock, (a) and (c) are the same as fetch_or(),
> +	 * since either are fully before or after. The only new case is (b),
> +	 * where a concurrent lock or unlock can interleave with the composite
> +	 * operation.
> +	 *
> +	 * In either case, it is the queueing case that is of interrest, otherwise
> +	 * the whole operation is covered by the xchg() and the tail will be 0.
> +	 *
> +	 * For lock-(b); we only care if we set the PENDING bit or not. If we lost
> +	 * the PENDING race, we queue. Otherwise we'd observe the tail anyway.
> +	 *
> +	 * For unlock-(b); since we'll have set PENDING, the uncontended claim
> +	 * will fail and we'll observe a !0 tail.
> +	 */
> +	smp_mb__after_atomic();
> +	val |= atomic_read_acquire(&lock->val) & ~_Q_PENDING_MASK;
> +
> +	return val;
> +#else
> +	return atomic_fetch_or_acquire(_Q_PENDING_VAL, &lock->val);
> +#endif
> +}
> +
> +/**
>   * set_locked - Set the lock bit and own the lock
>   * @lock: Pointer to queued spinlock structure
>   *
> @@ -328,7 +382,7 @@ void queued_spin_lock_slowpath(struct qs
>  	 *
>  	 * 0,0,* -> 0,1,* -> 0,0,1 pending, trylock
>  	 */
> -	val = atomic_fetch_or_acquire(_Q_PENDING_VAL, &lock->val);
> +	val = set_pending_fetch_acquire(lock);
>  
>  	/*
>  	 * If we observe contention, there was a concurrent lock.
> 
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC][PATCH 0/3] locking/qspinlock: Improve determinism for x86
  2018-09-26 17:51   ` Peter Zijlstra
@ 2018-09-26 23:21     ` Waiman Long
  0 siblings, 0 replies; 32+ messages in thread
From: Waiman Long @ 2018-09-26 23:21 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: will.deacon, mingo, linux-kernel, andrea.parri, tglx

On 09/26/2018 01:51 PM, Peter Zijlstra wrote:
> On Wed, Sep 26, 2018 at 12:20:09PM -0400, Waiman Long wrote:
>> On 09/26/2018 07:01 AM, Peter Zijlstra wrote:
>>> Back when Will did his qspinlock determinism patches, we were left with one
>>> cmpxchg loop on x86 due to the use of atomic_fetch_or(). Will proposed a nifty
>>> trick:
>>>
>>>   http://lkml.kernel.org/r/20180409145409.GA9661@arm.com
>>>
>>> But at the time we didn't pursue it. This series implements that and argues for
>>> its correctness. In particular it places an smp_mb__after_atomic() in
>>> between the two operations, which forces the load to come after the
>>> store (which is free on x86 anyway).
>>>
>>> In particular this ordering ensures a concurrent unlock cannot trigger
>>> the uncontended handoff. Also it ensures that if the xchg() happens
>>> after a (successful) trylock, we must observe that LOCKED bit.
>> When you said "concurrent unlock cannot trigger the uncontended
>> handoff", are you saying the current code has this uncontended handoff
>> problem or just when comparing to doing a load first followed by xchg()?
> In particular I'm talking about this:
>
> locked:
> 	/*
> 	 * claim the lock:
> 	 *
> 	 * n,0,0 -> 0,0,1 : lock, uncontended
> 	 * *,0,0 -> *,0,1 : lock, contended
> 	 *
> 	 * If the queue head is the only one in the queue (lock value == tail)
> 	 * and nobody is pending, clear the tail code and grab the lock.
> 	 * Otherwise, we only need to grab the lock.
> 	 */
> 	/*
> 	 * In the PV case we might already have _Q_LOCKED_VAL set, because
> 	 * of lock stealing; therefore we must also allow:
> 	 *
> 	 * n,0,1 -> 0,0,1
> 	 *
> 	 * Note: at this point: (val & _Q_PENDING_MASK) == 0, because of the
> 	 *       above wait condition, therefore any concurrent setting of
> 	 *       PENDING will make the uncontended transition fail.
> 	 */
> 	if ((val & _Q_TAIL_MASK) == tail) {
> 		if (atomic_try_cmpxchg_relaxed(&lock->val, &val, _Q_LOCKED_VAL))
> 			goto release; /* No contention */
> 	}
>
>
> Here queue head acquires the lock and we take the "uncontended" path. If
> we set PENDING prior to this, the cmpxchg above will fail, and the later
> load will observe the tail.

Thanks for the clarification. I was wondering if you are talking about a
latent bug in the code. So this is not the case here.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC][PATCH 3/3] locking/qspinlock: Optimize for x86
  2018-09-26 20:52   ` Andrea Parri
@ 2018-09-27  7:17     ` Peter Zijlstra
  2018-09-27  7:47       ` Andrea Parri
  0 siblings, 1 reply; 32+ messages in thread
From: Peter Zijlstra @ 2018-09-27  7:17 UTC (permalink / raw)
  To: Andrea Parri; +Cc: will.deacon, mingo, linux-kernel, longman, tglx

On Wed, Sep 26, 2018 at 10:52:08PM +0200, Andrea Parri wrote:
> On Wed, Sep 26, 2018 at 01:01:20PM +0200, Peter Zijlstra wrote:
> > On x86 we cannot do fetch_or with a single instruction and end up
> > using a cmpxchg loop, this reduces determinism. Replace the fetch_or
> > with a very tricky composite xchg8 + load.
> > 
> > The basic idea is that we use xchg8 to test-and-set the pending bit
> > (when it is a byte) and then a load to fetch the whole word. Using
> > two instructions of course opens a window we previously did not have.
> > In particular the ordering between pending and tail is of interrest,
> > because that is where the split happens.
> > 
> > The claim is that if we order them, it all works out just fine. There
> > are two specific cases where the pending,tail state changes:
> > 
> >  - when the 3rd lock(er) comes in and finds pending set, it'll queue
> >    and set tail; since we set tail while pending is set, the ordering
> >    is split is not important (and not fundamentally different form
> >    fetch_or). [*]
> > 
> >  - when the last queued lock holder acquires the lock (uncontended),
> >    we clear the tail and set the lock byte. By first setting the
> >    pending bit this cmpxchg will fail and the later load must then
> >    see the remaining tail.
> > 
> > Another interesting scenario is where there are only 2 threads:
> > 
> > 	lock := (0,0,0)
> > 
> > 	CPU 0			CPU 1
> > 
> > 	lock()			lock()
> > 	  trylock(-> 0,0,1)       trylock() /* fail */
> > 	    return;               xchg_relaxed(pending, 1) (-> 0,1,1)
> > 				  mb()
> > 				  val = smp_load_acquire(*lock);
> > 
> > Where, without the mb() the load would've been allowed to return 0 for
> > the locked byte.
> 
> If this were true, we would have a violation of "coherence":

The thing is, this is mixed size, see:

  https://www.cl.cam.ac.uk/~pes20/popl17/mixed-size.pdf

If I remember things correctly (I've not reread that paper recently) it
is allowed for:

	old = xchg(pending,1);
	val = smp_load_acquire(*lock);

to be re-ordered like:

	val = smp_load_acquire(*lock);
	old = xchg(pending, 1);

with the exception that it will fwd the pending byte into the later
load, so we get:

	val = (val & _Q_PENDING_MASK) | (old << _Q_PENDING_OFFSET);

for 'free'.

LKMM in particular does _NOT_ deal with mixed sized atomics _at_all_.

With the addition of smp_mb__after_atomic(), we disallow the load to be
done prior to the xchg(). It might still fwd the more recent pending
byte from its store buffer, but at least the other bytes must not be
earlier.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC][PATCH 3/3] locking/qspinlock: Optimize for x86
  2018-09-26 17:54     ` Peter Zijlstra
@ 2018-09-27  7:29       ` Peter Zijlstra
  0 siblings, 0 replies; 32+ messages in thread
From: Peter Zijlstra @ 2018-09-27  7:29 UTC (permalink / raw)
  To: Waiman Long; +Cc: will.deacon, mingo, linux-kernel, andrea.parri, tglx

On Wed, Sep 26, 2018 at 07:54:18PM +0200, Peter Zijlstra wrote:
> On Wed, Sep 26, 2018 at 12:30:36PM -0400, Waiman Long wrote:
> > On 09/26/2018 07:01 AM, Peter Zijlstra wrote:
> > > On x86 we cannot do fetch_or with a single instruction and end up
> > > using a cmpxchg loop, this reduces determinism. Replace the fetch_or
> > > with a very tricky composite xchg8 + load.
> > >
> > > The basic idea is that we use xchg8 to test-and-set the pending bit
> > > (when it is a byte) and then a load to fetch the whole word. Using
> > > two instructions of course opens a window we previously did not have.
> > > In particular the ordering between pending and tail is of interrest,
> > > because that is where the split happens.
> > >
> > > The claim is that if we order them, it all works out just fine. There
> > > are two specific cases where the pending,tail state changes:
> > >
> > >  - when the 3rd lock(er) comes in and finds pending set, it'll queue
> > >    and set tail; since we set tail while pending is set, the ordering
> > >    is split is not important (and not fundamentally different form
> > >    fetch_or). [*]
> > 
> > The split can cause some changes in behavior. The 3rd locker observes
> > the pending bit and set tail. The split load of the 2nd locker may make
> > it observe the tail and backout of the pending loop. As a result, the
> > 2nd locker will acquire the lock after the third locker in this case.
> > That won't happen with the original code.
> > 
> > I am not saying this is a problem. It is just something we should take
> > note on.
> 
> Right, good one. Yes that can happen.

I think I remember why I 'discounted' that scenario; the same can happen
with the fetch_or() but in a timing scenario.

Consider:

  CPU 0		CPU 1		CPU 2		CPU 3

  lock (->0,0,1)
		lock
		  trylock (fail)
		  tas-pending (->0,1,1)
		  wait-!locked
				lock
				  trylock (fail)
				  tas-pending (fail)
  unlock (->0,1,0)
		  acquire (->0,0,1)
						lock
						  trylock (fail)
				  (A) xchg-tail
						  tas-pending -> (x,1,1)
				  (B) xchg-tail

And then, depending on A/B x is either 0 or !0.

So the ordering of the concurrent lock attempts is always (obviously)
subject to timing, the split xchg-load thing just makes it 'worse'.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC][PATCH 3/3] locking/qspinlock: Optimize for x86
  2018-09-27  7:17     ` Peter Zijlstra
@ 2018-09-27  7:47       ` Andrea Parri
  2018-09-27  7:59         ` Peter Zijlstra
  0 siblings, 1 reply; 32+ messages in thread
From: Andrea Parri @ 2018-09-27  7:47 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: will.deacon, mingo, linux-kernel, longman, tglx

On Thu, Sep 27, 2018 at 09:17:47AM +0200, Peter Zijlstra wrote:
> On Wed, Sep 26, 2018 at 10:52:08PM +0200, Andrea Parri wrote:
> > On Wed, Sep 26, 2018 at 01:01:20PM +0200, Peter Zijlstra wrote:
> > > On x86 we cannot do fetch_or with a single instruction and end up
> > > using a cmpxchg loop, this reduces determinism. Replace the fetch_or
> > > with a very tricky composite xchg8 + load.
> > > 
> > > The basic idea is that we use xchg8 to test-and-set the pending bit
> > > (when it is a byte) and then a load to fetch the whole word. Using
> > > two instructions of course opens a window we previously did not have.
> > > In particular the ordering between pending and tail is of interrest,
> > > because that is where the split happens.
> > > 
> > > The claim is that if we order them, it all works out just fine. There
> > > are two specific cases where the pending,tail state changes:
> > > 
> > >  - when the 3rd lock(er) comes in and finds pending set, it'll queue
> > >    and set tail; since we set tail while pending is set, the ordering
> > >    is split is not important (and not fundamentally different form
> > >    fetch_or). [*]
> > > 
> > >  - when the last queued lock holder acquires the lock (uncontended),
> > >    we clear the tail and set the lock byte. By first setting the
> > >    pending bit this cmpxchg will fail and the later load must then
> > >    see the remaining tail.
> > > 
> > > Another interesting scenario is where there are only 2 threads:
> > > 
> > > 	lock := (0,0,0)
> > > 
> > > 	CPU 0			CPU 1
> > > 
> > > 	lock()			lock()
> > > 	  trylock(-> 0,0,1)       trylock() /* fail */
> > > 	    return;               xchg_relaxed(pending, 1) (-> 0,1,1)
> > > 				  mb()
> > > 				  val = smp_load_acquire(*lock);
> > > 
> > > Where, without the mb() the load would've been allowed to return 0 for
> > > the locked byte.
> > 
> > If this were true, we would have a violation of "coherence":
> 
> The thing is, this is mixed size, see:

The accesses to ->val are not, and those certainly have to meet the
"coherence" constraint (no matter the store to ->pending).


> 
>   https://www.cl.cam.ac.uk/~pes20/popl17/mixed-size.pdf
> 
> If I remember things correctly (I've not reread that paper recently) it
> is allowed for:
> 
> 	old = xchg(pending,1);
> 	val = smp_load_acquire(*lock);
> 
> to be re-ordered like:
> 
> 	val = smp_load_acquire(*lock);
> 	old = xchg(pending, 1);
> 
> with the exception that it will fwd the pending byte into the later
> load, so we get:
> 
> 	val = (val & _Q_PENDING_MASK) | (old << _Q_PENDING_OFFSET);
> 
> for 'free'.
> 
> LKMM in particular does _NOT_ deal with mixed sized atomics _at_all_.

True, but it is nothing conceptually new to deal with: there're Cat
models that handle mixed-size accesses, just give it time.

  Andrea


> 
> With the addition of smp_mb__after_atomic(), we disallow the load to be
> done prior to the xchg(). It might still fwd the more recent pending
> byte from its store buffer, but at least the other bytes must not be
> earlier.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC][PATCH 3/3] locking/qspinlock: Optimize for x86
  2018-09-27  7:47       ` Andrea Parri
@ 2018-09-27  7:59         ` Peter Zijlstra
  2018-09-27  8:13           ` Andrea Parri
  0 siblings, 1 reply; 32+ messages in thread
From: Peter Zijlstra @ 2018-09-27  7:59 UTC (permalink / raw)
  To: Andrea Parri; +Cc: will.deacon, mingo, linux-kernel, longman, tglx

On Thu, Sep 27, 2018 at 09:47:48AM +0200, Andrea Parri wrote:
> > LKMM in particular does _NOT_ deal with mixed sized atomics _at_all_.
> 
> True, but it is nothing conceptually new to deal with: there're Cat
> models that handle mixed-size accesses, just give it time.

Sure, but until that time I must not rely on (and thus not use) LKMM for
qspinlock things.

So while your argument about coherence might be true -- I'll have to
think about it; litmus tests are out the window.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC][PATCH 3/3] locking/qspinlock: Optimize for x86
  2018-09-27  7:59         ` Peter Zijlstra
@ 2018-09-27  8:13           ` Andrea Parri
  2018-09-27  8:57             ` Peter Zijlstra
  0 siblings, 1 reply; 32+ messages in thread
From: Andrea Parri @ 2018-09-27  8:13 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: will.deacon, mingo, linux-kernel, longman, tglx

On Thu, Sep 27, 2018 at 09:59:35AM +0200, Peter Zijlstra wrote:
> On Thu, Sep 27, 2018 at 09:47:48AM +0200, Andrea Parri wrote:
> > > LKMM in particular does _NOT_ deal with mixed sized atomics _at_all_.
> > 
> > True, but it is nothing conceptually new to deal with: there're Cat
> > models that handle mixed-size accesses, just give it time.
> 
> Sure, but until that time I must not rely on (and thus not use) LKMM for
> qspinlock things.

This is way too generic to be agreed ;D


> 
> So while your argument about coherence might be true -- I'll have to
> think about it; litmus tests are out the window.

You trimmed the litmus test I gave you.

  Andrea

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC][PATCH 3/3] locking/qspinlock: Optimize for x86
  2018-09-27  8:13           ` Andrea Parri
@ 2018-09-27  8:57             ` Peter Zijlstra
  0 siblings, 0 replies; 32+ messages in thread
From: Peter Zijlstra @ 2018-09-27  8:57 UTC (permalink / raw)
  To: Andrea Parri; +Cc: will.deacon, mingo, linux-kernel, longman, tglx

On Thu, Sep 27, 2018 at 10:13:15AM +0200, Andrea Parri wrote:
> On Thu, Sep 27, 2018 at 09:59:35AM +0200, Peter Zijlstra wrote:
> > On Thu, Sep 27, 2018 at 09:47:48AM +0200, Andrea Parri wrote:
> > > > LKMM in particular does _NOT_ deal with mixed sized atomics _at_all_.
> > > 
> > > True, but it is nothing conceptually new to deal with: there're Cat
> > > models that handle mixed-size accesses, just give it time.
> > 
> > Sure, but until that time I must not rely on (and thus not use) LKMM for
> > qspinlock things.
> 
> This is way too generic to be agreed ;D

Only if you know that thing well enough to know why it gives a certain
answer, and thus don't already need it.

If you need it to help you with something; it can't because it doesn't
do the mixed size thing.

> > So while your argument about coherence might be true -- I'll have to
> > think about it; litmus tests are out the window.
> 
> You trimmed the litmus test I gave you.

Because of not wanting to reverse engineer the argument from the litmus
test. But yes, I think I see your point, the earlier trylock will load
the new value and our later load cannot be earlier than that.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* RE: [RFC][PATCH 3/3] locking/qspinlock: Optimize for x86
  2018-09-26 11:01 ` [RFC][PATCH 3/3] locking/qspinlock: Optimize for x86 Peter Zijlstra
  2018-09-26 16:30   ` Waiman Long
  2018-09-26 20:52   ` Andrea Parri
@ 2018-09-27 12:16   ` David Laight
  2018-10-01 17:17   ` Will Deacon
  3 siblings, 0 replies; 32+ messages in thread
From: David Laight @ 2018-09-27 12:16 UTC (permalink / raw)
  To: 'Peter Zijlstra', will.deacon, mingo
  Cc: linux-kernel, longman, andrea.parri, tglx

From: Peter Zijlstra
> Sent: 26 September 2018 12:01
> 
> On x86 we cannot do fetch_or with a single instruction and end up
> using a cmpxchg loop, this reduces determinism. Replace the fetch_or
> with a very tricky composite xchg8 + load.
> 
> The basic idea is that we use xchg8 to test-and-set the pending bit
> (when it is a byte) and then a load to fetch the whole word. Using
> two instructions of course opens a window we previously did not have.
...

IIRC the load will be 'slow' because it will have to wait for the
earlier store to actually complete - rather than being satisfied
by data from the store buffer (because the widths are different).

This may not matter for xchg ?

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC][PATCH 3/3] locking/qspinlock: Optimize for x86
  2018-09-26 11:01 ` [RFC][PATCH 3/3] locking/qspinlock: Optimize for x86 Peter Zijlstra
                     ` (2 preceding siblings ...)
  2018-09-27 12:16   ` David Laight
@ 2018-10-01 17:17   ` Will Deacon
  2018-10-01 20:00     ` Peter Zijlstra
  2018-10-02 12:31     ` Andrea Parri
  3 siblings, 2 replies; 32+ messages in thread
From: Will Deacon @ 2018-10-01 17:17 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: mingo, linux-kernel, longman, andrea.parri, tglx

Hi Peter,

Thanks for chewing up my afternoon ;)

On Wed, Sep 26, 2018 at 01:01:20PM +0200, Peter Zijlstra wrote:
> On x86 we cannot do fetch_or with a single instruction and end up
> using a cmpxchg loop, this reduces determinism. Replace the fetch_or
> with a very tricky composite xchg8 + load.
> 
> The basic idea is that we use xchg8 to test-and-set the pending bit
> (when it is a byte) and then a load to fetch the whole word. Using
> two instructions of course opens a window we previously did not have.
> In particular the ordering between pending and tail is of interrest,
> because that is where the split happens.
> 
> The claim is that if we order them, it all works out just fine. There
> are two specific cases where the pending,tail state changes:
> 
>  - when the 3rd lock(er) comes in and finds pending set, it'll queue
>    and set tail; since we set tail while pending is set, the ordering
>    is split is not important (and not fundamentally different form
>    fetch_or). [*]
> 
>  - when the last queued lock holder acquires the lock (uncontended),
>    we clear the tail and set the lock byte. By first setting the
>    pending bit this cmpxchg will fail and the later load must then
>    see the remaining tail.
> 
> Another interesting scenario is where there are only 2 threads:
> 
> 	lock := (0,0,0)
> 
> 	CPU 0			CPU 1
> 
> 	lock()			lock()
> 	  trylock(-> 0,0,1)       trylock() /* fail */
> 	    return;               xchg_relaxed(pending, 1) (-> 0,1,1)
> 				  mb()
> 				  val = smp_load_acquire(*lock);
> 
> Where, without the mb() the load would've been allowed to return 0 for
> the locked byte.
> 
> [*] there is a fun scenario where the 3rd locker observes the pending
> bit, but before it sets the tail, the 1st lock (owner) unlocks and the
> 2nd locker acquires the lock, leaving us with a temporary 0,0,1 state.
> But this is not different between fetch_or or this new method.
> 
> Suggested-by: Will Deacon <will.deacon@arm.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  arch/x86/include/asm/qspinlock.h |    2 +
>  kernel/locking/qspinlock.c       |   56 ++++++++++++++++++++++++++++++++++++++-
>  2 files changed, 57 insertions(+), 1 deletion(-)
> 
> --- a/arch/x86/include/asm/qspinlock.h
> +++ b/arch/x86/include/asm/qspinlock.h
> @@ -9,6 +9,8 @@
>  
>  #define _Q_PENDING_LOOPS	(1 << 9)
>  
> +#define _Q_NO_FETCH_OR
> +
>  #ifdef CONFIG_PARAVIRT_SPINLOCKS
>  extern void native_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val);
>  extern void __pv_init_lock_hash(void);
> --- a/kernel/locking/qspinlock.c
> +++ b/kernel/locking/qspinlock.c
> @@ -232,6 +232,60 @@ static __always_inline u32 xchg_tail(str
>  #endif /* _Q_PENDING_BITS == 8 */
>  
>  /**
> + * set_pending_fetch_acquire - fetch the whole lock value and set pending and locked
> + * @lock : Pointer to queued spinlock structure
> + * Return: The previous lock value
> + *
> + * *,*,* -> *,1,*
> + */
> +static __always_inline u32 set_pending_fetch_acquire(struct qspinlock *lock)
> +{
> +#if defined(_Q_NO_FETCH_OR) && _Q_PENDING_BITS == 8
> +	u32 val;
> +
> +	/*
> +	 * For the !LL/SC archs that do not have a native atomic_fetch_or
> +	 * but do have a native xchg (x86) we can do trickery and avoid the
> +	 * cmpxchg-loop based fetch-or to improve determinism.
> +	 *
> +	 * We first xchg the pending byte to set PENDING, and then issue a load
> +	 * for the rest of the word and mask out the pending byte to compute a
> +	 * 'full' value.
> +	 */
> +	val = xchg_relaxed(&lock->pending, 1) << _Q_PENDING_OFFSET;

If we make this an xchg_acquire(), can we drop the smp_mb__after_atomic()
and use a plain atomic_read() to get the rest of the word? But actually,
consider this scenario with your patch:

1. CPU0 sees a locked val, and is about to do your xchg_relaxed() to set
   pending.

2. CPU1 comes in and sets pending, spins on locked

3. CPU2 sees a pending and locked val, and is about to enter the head of
   the waitqueue (i.e. it's right before xchg_tail()).

4. The locked holder unlock()s, CPU1 takes the lock() and then unlock()s
   it, so pending and locked are now 0.

5. CPU0 sets pending and reads back zeroes for the other fields

6. CPU0 clears pending and sets locked -- it now has the lock

7. CPU2 updates tail, sees it's at the head of the waitqueue and spins
   for locked and pending to go clear. However, it reads a stale value
   from step (4) and attempts the atomic_try_cmpxchg() to take the lock.

8. CPU2 will fail the cmpxchg(), but then go ahead and set locked. At this
   point we're hosed, because both CPU2 and CPU0 have the lock.

Is there something I'm missing that means this can't happen? I suppose
cacheline granularity ends up giving serialisation between (4) and (7),
but I'd *much* prefer not to rely on that because it feels horribly
fragile.

Another idea I was playing with was adding test_and_set_bit_acquire()
for this, because x86 has an instruction for that, right?

> +	/*
> +	 * Ensures the tail load happens after the xchg().
> +	 *
> +	 *	   lock  unlock    (a)
> +	 *   xchg ---------------.
> +	 *    (b)  lock  unlock  +----- fetch_or
> +	 *   load ---------------'
> +	 *	   lock  unlock    (c)
> +	 *
> +	 * For both lock and unlock, (a) and (c) are the same as fetch_or(),
> +	 * since either are fully before or after. The only new case is (b),
> +	 * where a concurrent lock or unlock can interleave with the composite
> +	 * operation.
> +	 *
> +	 * In either case, it is the queueing case that is of interrest, otherwise
> +	 * the whole operation is covered by the xchg() and the tail will be 0.
> +	 *
> +	 * For lock-(b); we only care if we set the PENDING bit or not. If we lost
> +	 * the PENDING race, we queue. Otherwise we'd observe the tail anyway.
> +	 *
> +	 * For unlock-(b); since we'll have set PENDING, the uncontended claim
> +	 * will fail and we'll observe a !0 tail.
> +	 */

I failed miserably at parsing this comment :(

I think the main issue is that I don't understand how to read the little
diagram you've got.

Will

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC][PATCH 1/3] locking/qspinlock: Re-order code
  2018-09-26 11:01 ` [RFC][PATCH 1/3] locking/qspinlock: Re-order code Peter Zijlstra
@ 2018-10-01 17:17   ` Will Deacon
  0 siblings, 0 replies; 32+ messages in thread
From: Will Deacon @ 2018-10-01 17:17 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: mingo, linux-kernel, longman, andrea.parri, tglx

On Wed, Sep 26, 2018 at 01:01:18PM +0200, Peter Zijlstra wrote:
> Flip the branch condition after atomic_fetch_or_acquire(_Q_PENDING_VAL)
> such that we loose the indent. This also result in a more natural code
> flow IMO.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  kernel/locking/qspinlock.c |   56 +++++++++++++++++++++------------------------
>  1 file changed, 27 insertions(+), 29 deletions(-)

I think I actually prefer the current code flow, but that's probably just
because I'm used to it and I don't have a strong opinion about this, so:

Acked-by: Will Deacon <will.deacon@arm.com>

given that this looks correct to me.

Will

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC][PATCH 2/3] locking/qspinlock: Rework some comments
  2018-09-26 11:01 ` [RFC][PATCH 2/3] locking/qspinlock: Rework some comments Peter Zijlstra
@ 2018-10-01 17:17   ` Will Deacon
  2018-10-01 19:10     ` Peter Zijlstra
  0 siblings, 1 reply; 32+ messages in thread
From: Will Deacon @ 2018-10-01 17:17 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: mingo, linux-kernel, longman, andrea.parri, tglx

On Wed, Sep 26, 2018 at 01:01:19PM +0200, Peter Zijlstra wrote:
> While working my way through the code again; I felt the comments could
> use help.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  kernel/locking/qspinlock.c |   40 ++++++++++++++++++++++++++++------------
>  1 file changed, 28 insertions(+), 12 deletions(-)
> 
> --- a/kernel/locking/qspinlock.c
> +++ b/kernel/locking/qspinlock.c
> @@ -326,16 +326,23 @@ void queued_spin_lock_slowpath(struct qs
>  	/*
>  	 * trylock || pending
>  	 *
> -	 * 0,0,0 -> 0,0,1 ; trylock
> -	 * 0,0,1 -> 0,1,1 ; pending
> +	 * 0,0,* -> 0,1,* -> 0,0,1 pending, trylock
>  	 */
>  	val = atomic_fetch_or_acquire(_Q_PENDING_VAL, &lock->val);
> +
>  	/*
> -	 * If we observe any contention; undo and queue.
> +	 * If we observe contention, there was a concurrent lock.

Nit: I think "concurrent lock" is confusing here, because that implies to
me that the lock was actually taken behind our back, which isn't necessarily
the case. How about "there is a concurrent locker"?

> +	 *
> +	 * Undo and queue; our setting of PENDING might have made the
> +	 * n,0,0 -> 0,0,0 transition fail and it will now be waiting
> +	 * on @next to become !NULL.
>  	 */

Hmm, but it could also fail another concurrent set of PENDING (and the lock
could just be held the entire time).

>  	if (unlikely(val & ~_Q_LOCKED_MASK)) {
> +
> +		/* Undo PENDING if we set it. */
>  		if (!(val & _Q_PENDING_MASK))
>  			clear_pending(lock);
> +
>  		goto queue;
>  	}
>  
> @@ -466,7 +473,7 @@ void queued_spin_lock_slowpath(struct qs
>  	 * claim the lock:
>  	 *
>  	 * n,0,0 -> 0,0,1 : lock, uncontended
> -	 * *,*,0 -> *,*,1 : lock, contended
> +	 * *,0,0 -> *,0,1 : lock, contended

Pending can be set behind our back in the contended case, in which case
we take the lock with a single byte store and don't clear pending. You
mention this in the updated comment below, but I think we should leave this
comment alone.

Will

>  	 *
>  	 * If the queue head is the only one in the queue (lock value == tail)
>  	 * and nobody is pending, clear the tail code and grab the lock.
> @@ -474,16 +481,25 @@ void queued_spin_lock_slowpath(struct qs
>  	 */
>  
>  	/*
> -	 * In the PV case we might already have _Q_LOCKED_VAL set.
> +	 * In the PV case we might already have _Q_LOCKED_VAL set, because
> +	 * of lock stealing; therefore we must also allow:
>  	 *
> -	 * The atomic_cond_read_acquire() call above has provided the
> -	 * necessary acquire semantics required for locking.
> -	 */
> -	if (((val & _Q_TAIL_MASK) == tail) &&
> -	    atomic_try_cmpxchg_relaxed(&lock->val, &val, _Q_LOCKED_VAL))
> -		goto release; /* No contention */
> +	 * n,0,1 -> 0,0,1
> +	 *
> +	 * Note: at this point: (val & _Q_PENDING_MASK) == 0, because of the
> +	 *       above wait condition, therefore any concurrent setting of
> +	 *       PENDING will make the uncontended transition fail.
> +	 */
> +	if ((val & _Q_TAIL_MASK) == tail) {
> +		if (atomic_try_cmpxchg_relaxed(&lock->val, &val, _Q_LOCKED_VAL))
> +			goto release; /* No contention */
> +	}
>  
> -	/* Either somebody is queued behind us or _Q_PENDING_VAL is set */
> +	/*
> +	 * Either somebody is queued behind us or _Q_PENDING_VAL got set
> +	 * which will then detect the remaining tail and queue behind us
> +	 * ensuring we'll see a @next.
> +	 */
>  	set_locked(lock);
>  
>  	/*
> 
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC][PATCH 2/3] locking/qspinlock: Rework some comments
  2018-10-01 17:17   ` Will Deacon
@ 2018-10-01 19:10     ` Peter Zijlstra
  2018-10-02 13:20       ` Will Deacon
  0 siblings, 1 reply; 32+ messages in thread
From: Peter Zijlstra @ 2018-10-01 19:10 UTC (permalink / raw)
  To: Will Deacon; +Cc: mingo, linux-kernel, longman, andrea.parri, tglx

On Mon, Oct 01, 2018 at 06:17:08PM +0100, Will Deacon wrote:
> On Wed, Sep 26, 2018 at 01:01:19PM +0200, Peter Zijlstra wrote:
> > +
> >  	/*
> > -	 * If we observe any contention; undo and queue.
> > +	 * If we observe contention, there was a concurrent lock.
> 
> Nit: I think "concurrent lock" is confusing here, because that implies to
> me that the lock was actually taken behind our back, which isn't necessarily
> the case. How about "there is a concurrent locker"?

Yes, that's better.

> > +	 *
> > +	 * Undo and queue; our setting of PENDING might have made the
> > +	 * n,0,0 -> 0,0,0 transition fail and it will now be waiting
> > +	 * on @next to become !NULL.
> >  	 */
> 
> Hmm, but it could also fail another concurrent set of PENDING (and the lock
> could just be held the entire time).

Right. What I wanted to convey was that is we observe _any_ contention,
we must abort and queue, because of that above condition failing and
waiting on @next.

The other cases weren't as critical, but that one really does require us
to queue in order to make forward progress.

Or did I misunderstand your concern?

> >  	if (unlikely(val & ~_Q_LOCKED_MASK)) {
> > +
> > +		/* Undo PENDING if we set it. */
> >  		if (!(val & _Q_PENDING_MASK))
> >  			clear_pending(lock);
> > +
> >  		goto queue;
> >  	}
> >  
> > @@ -466,7 +473,7 @@ void queued_spin_lock_slowpath(struct qs
> >  	 * claim the lock:
> >  	 *
> >  	 * n,0,0 -> 0,0,1 : lock, uncontended
> > -	 * *,*,0 -> *,*,1 : lock, contended
> > +	 * *,0,0 -> *,0,1 : lock, contended
> 
> Pending can be set behind our back in the contended case, in which case
> we take the lock with a single byte store and don't clear pending. You
> mention this in the updated comment below, but I think we should leave this
> comment alone.

Ah, so the reason I write it like so is because when we get here,
val.locked_pending == 0, per the atomic_cond_read_acquire() condition.



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC][PATCH 3/3] locking/qspinlock: Optimize for x86
  2018-10-01 17:17   ` Will Deacon
@ 2018-10-01 20:00     ` Peter Zijlstra
  2018-10-02 13:19       ` Will Deacon
  2018-10-02 12:31     ` Andrea Parri
  1 sibling, 1 reply; 32+ messages in thread
From: Peter Zijlstra @ 2018-10-01 20:00 UTC (permalink / raw)
  To: Will Deacon; +Cc: mingo, linux-kernel, longman, andrea.parri, tglx

On Mon, Oct 01, 2018 at 06:17:00PM +0100, Will Deacon wrote:
> Hi Peter,
> 
> Thanks for chewing up my afternoon ;)

I'll get you a beer in EDI ;-)

> On Wed, Sep 26, 2018 at 01:01:20PM +0200, Peter Zijlstra wrote:

> >  /**
> > + * set_pending_fetch_acquire - fetch the whole lock value and set pending and locked
> > + * @lock : Pointer to queued spinlock structure
> > + * Return: The previous lock value
> > + *
> > + * *,*,* -> *,1,*
> > + */
> > +static __always_inline u32 set_pending_fetch_acquire(struct qspinlock *lock)
> > +{
> > +#if defined(_Q_NO_FETCH_OR) && _Q_PENDING_BITS == 8
> > +	u32 val;
> > +
> > +	/*
> > +	 * For the !LL/SC archs that do not have a native atomic_fetch_or
> > +	 * but do have a native xchg (x86) we can do trickery and avoid the
> > +	 * cmpxchg-loop based fetch-or to improve determinism.
> > +	 *
> > +	 * We first xchg the pending byte to set PENDING, and then issue a load
> > +	 * for the rest of the word and mask out the pending byte to compute a
> > +	 * 'full' value.
> > +	 */
> > +	val = xchg_relaxed(&lock->pending, 1) << _Q_PENDING_OFFSET;
> 
> If we make this an xchg_acquire(), can we drop the smp_mb__after_atomic()
> and use a plain atomic_read() to get the rest of the word? 

I did consider that; but I confused myself so many times I stuck with
this one. Primarily because on x86 it doesn't matter one way or the
other and smp_mb() are 'easier' to reason about.

> But actually,
> consider this scenario with your patch:
> 
> 1. CPU0 sees a locked val, and is about to do your xchg_relaxed() to set
>    pending.
> 
> 2. CPU1 comes in and sets pending, spins on locked
> 
> 3. CPU2 sees a pending and locked val, and is about to enter the head of
>    the waitqueue (i.e. it's right before xchg_tail()).
> 
> 4. The locked holder unlock()s, CPU1 takes the lock() and then unlock()s
>    it, so pending and locked are now 0.
> 
> 5. CPU0 sets pending and reads back zeroes for the other fields
> 
> 6. CPU0 clears pending and sets locked -- it now has the lock
> 
> 7. CPU2 updates tail, sees it's at the head of the waitqueue and spins
>    for locked and pending to go clear. However, it reads a stale value
>    from step (4) and attempts the atomic_try_cmpxchg() to take the lock.
> 
> 8. CPU2 will fail the cmpxchg(), but then go ahead and set locked. At this
>    point we're hosed, because both CPU2 and CPU0 have the lock.

Let me draw a picture of that..


  CPU0		CPU1		CPU2		CPU3

0)						lock
						  trylock -> (0,0,1)
1)lock
    trylock /* fail */

2)		lock
		  trylock /* fail */
		  tas-pending -> (0,1,1)
		  wait-locked

3)				lock
				  trylock /* fail */
				  tas-pending /* fail */

4)						unlock -> (0,1,0)
		  clr_pnd_set_lck -> (0,0,1)
		  unlock -> (0,0,0)

5)  tas-pending -> (0,1,0)
    read-val -> (0,1,0)
6)  clr_pnd_set_lck -> (0,0,1)
7)				  xchg_tail -> (n,0,1)
				  load_acquire <- (n,0,0) (from-4)
8)				  cmpxchg /* fail */
				  set_locked()

> Is there something I'm missing that means this can't happen? I suppose
> cacheline granularity ends up giving serialisation between (4) and (7),
> but I'd *much* prefer not to rely on that because it feels horribly
> fragile.

Well, on x86 atomics are fully ordered, so the xchg_tail() does in
fact have smp_mb() in and that should order it sufficient for that not
to happen I think.

But in general, yes ick. Alternatively, making xchg_tail an ACQUIRE
doesn't seem too far out..

> Another idea I was playing with was adding test_and_set_bit_acquire()
> for this, because x86 has an instruction for that, right?

LOCK BTS, yes. So it can do a full 32bit RmW, but it cannot return the
old value of the word, just the old bit (in CF).

I suppose you get rid of the whole mixed size thing, but you still have
the whole two instruction thing.

> > +	/*
> > +	 * Ensures the tail load happens after the xchg().
> > +	 *
> > +	 *	   lock  unlock    (a)
> > +	 *   xchg ---------------.
> > +	 *    (b)  lock  unlock  +----- fetch_or
> > +	 *   load ---------------'
> > +	 *	   lock  unlock    (c)
> > +	 *
> 
> I failed miserably at parsing this comment :(
> 
> I think the main issue is that I don't understand how to read the little
> diagram you've got.

Where fetch_or() is indivisible and has happens-before (a) and
happens-after (c), the new thing is in fact divisible and has
happens-in-between (b).

Of the happens-in-between (b), we can either get a new concurrent
locker, or make progress of an extant concurrent locker because an
unlock happened.

But the rest of the text might indeed be very confused. I think I wrote
the bulk of that when I was in fact doing a xchg16 on locked_pending,
but that's fundamentally broken. I did edit it afterwards, but that
might have just made it worse.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC][PATCH 3/3] locking/qspinlock: Optimize for x86
  2018-10-01 17:17   ` Will Deacon
  2018-10-01 20:00     ` Peter Zijlstra
@ 2018-10-02 12:31     ` Andrea Parri
  2018-10-02 13:22       ` Will Deacon
  1 sibling, 1 reply; 32+ messages in thread
From: Andrea Parri @ 2018-10-02 12:31 UTC (permalink / raw)
  To: Will Deacon; +Cc: Peter Zijlstra, mingo, linux-kernel, longman, tglx

> consider this scenario with your patch:
> 
> 1. CPU0 sees a locked val, and is about to do your xchg_relaxed() to set
>    pending.
> 
> 2. CPU1 comes in and sets pending, spins on locked
> 
> 3. CPU2 sees a pending and locked val, and is about to enter the head of
>    the waitqueue (i.e. it's right before xchg_tail()).
> 
> 4. The locked holder unlock()s, CPU1 takes the lock() and then unlock()s
>    it, so pending and locked are now 0.
> 
> 5. CPU0 sets pending and reads back zeroes for the other fields
> 
> 6. CPU0 clears pending and sets locked -- it now has the lock
> 
> 7. CPU2 updates tail, sees it's at the head of the waitqueue and spins
>    for locked and pending to go clear. However, it reads a stale value
>    from step (4) and attempts the atomic_try_cmpxchg() to take the lock.
> 
> 8. CPU2 will fail the cmpxchg(), but then go ahead and set locked. At this
>    point we're hosed, because both CPU2 and CPU0 have the lock.

Thanks for pointing this out.   I am wondering: can't we have a similar
scenario with the current code (i.e., w/o these patches): what prevents
the scenario reported below, following Peter's diagram, from happening?

  Andrea


  CPU0		CPU1		CPU2		CPU3

0)						lock
						  trylock -> (0,0,1)
1)lock
    trylock /* fail */

2)		lock
		  trylock /* fail */
		  fetch_or_acquire -> (0,1,1)
		  wait-locked

3)				lock
				  trylock /* fail */
				  goto queue

4)						unlock -> (0,1,0)
		  clr_pnd_set_lck -> (0,0,1)
		  unlock -> (0,0,0)

5)  fetch_or_acquire -> (0,1,0)
6)  clr_pnd_set_lck -> (0,0,1)
7)				  xchg_tail -> (n,0,1)
				  load_acquire <- (n,0,0) (from-4)
8)				  cmpxchg /* fail */
				  set_locked()

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC][PATCH 3/3] locking/qspinlock: Optimize for x86
  2018-10-01 20:00     ` Peter Zijlstra
@ 2018-10-02 13:19       ` Will Deacon
  2018-10-02 14:14         ` Peter Zijlstra
  0 siblings, 1 reply; 32+ messages in thread
From: Will Deacon @ 2018-10-02 13:19 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: mingo, linux-kernel, longman, andrea.parri, tglx

On Mon, Oct 01, 2018 at 10:00:28PM +0200, Peter Zijlstra wrote:
> On Mon, Oct 01, 2018 at 06:17:00PM +0100, Will Deacon wrote:
> > Thanks for chewing up my afternoon ;)
> 
> I'll get you a beer in EDI ;-)

Just one?!

> > But actually,
> > consider this scenario with your patch:
> > 
> > 1. CPU0 sees a locked val, and is about to do your xchg_relaxed() to set
> >    pending.
> > 
> > 2. CPU1 comes in and sets pending, spins on locked
> > 
> > 3. CPU2 sees a pending and locked val, and is about to enter the head of
> >    the waitqueue (i.e. it's right before xchg_tail()).
> > 
> > 4. The locked holder unlock()s, CPU1 takes the lock() and then unlock()s
> >    it, so pending and locked are now 0.
> > 
> > 5. CPU0 sets pending and reads back zeroes for the other fields
> > 
> > 6. CPU0 clears pending and sets locked -- it now has the lock
> > 
> > 7. CPU2 updates tail, sees it's at the head of the waitqueue and spins
> >    for locked and pending to go clear. However, it reads a stale value
> >    from step (4) and attempts the atomic_try_cmpxchg() to take the lock.
> > 
> > 8. CPU2 will fail the cmpxchg(), but then go ahead and set locked. At this
> >    point we're hosed, because both CPU2 and CPU0 have the lock.
> 
> Let me draw a picture of that..
> 
> 
>   CPU0		CPU1		CPU2		CPU3
> 
> 0)						lock
> 						  trylock -> (0,0,1)
> 1)lock
>     trylock /* fail */
> 
> 2)		lock
> 		  trylock /* fail */
> 		  tas-pending -> (0,1,1)
> 		  wait-locked
> 
> 3)				lock
> 				  trylock /* fail */
> 				  tas-pending /* fail */
> 
> 4)						unlock -> (0,1,0)
> 		  clr_pnd_set_lck -> (0,0,1)
> 		  unlock -> (0,0,0)
> 
> 5)  tas-pending -> (0,1,0)
>     read-val -> (0,1,0)
> 6)  clr_pnd_set_lck -> (0,0,1)
> 7)				  xchg_tail -> (n,0,1)
> 				  load_acquire <- (n,0,0) (from-4)
> 8)				  cmpxchg /* fail */
> 				  set_locked()
> 
> > Is there something I'm missing that means this can't happen? I suppose
> > cacheline granularity ends up giving serialisation between (4) and (7),
> > but I'd *much* prefer not to rely on that because it feels horribly
> > fragile.
> 
> Well, on x86 atomics are fully ordered, so the xchg_tail() does in
> fact have smp_mb() in and that should order it sufficient for that not
> to happen I think.

Hmm, does that actually help, though? I still think you're relying on the
cache-coherence protocol to serialise the xchg() on pending before the
xchg_tail(), which I think is fragile because they don't actually overlap.

> But in general, yes ick. Alternatively, making xchg_tail an ACQUIRE
> doesn't seem too far out..
> 
> > Another idea I was playing with was adding test_and_set_bit_acquire()
> > for this, because x86 has an instruction for that, right?
> 
> LOCK BTS, yes. So it can do a full 32bit RmW, but it cannot return the
> old value of the word, just the old bit (in CF).
> 
> I suppose you get rid of the whole mixed size thing, but you still have
> the whole two instruction thing.

I really think we need that set of pending to operate on the whole lock
word.

> > > +	/*
> > > +	 * Ensures the tail load happens after the xchg().
> > > +	 *
> > > +	 *	   lock  unlock    (a)
> > > +	 *   xchg ---------------.
> > > +	 *    (b)  lock  unlock  +----- fetch_or
> > > +	 *   load ---------------'
> > > +	 *	   lock  unlock    (c)
> > > +	 *
> > 
> > I failed miserably at parsing this comment :(
> > 
> > I think the main issue is that I don't understand how to read the little
> > diagram you've got.
> 
> Where fetch_or() is indivisible and has happens-before (a) and
> happens-after (c), the new thing is in fact divisible and has
> happens-in-between (b).
> 
> Of the happens-in-between (b), we can either get a new concurrent
> locker, or make progress of an extant concurrent locker because an
> unlock happened.
> 
> But the rest of the text might indeed be very confused. I think I wrote
> the bulk of that when I was in fact doing a xchg16 on locked_pending,
> but that's fundamentally broken. I did edit it afterwards, but that
> might have just made it worse.

Ok, maybe just remove it :)

Will

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC][PATCH 2/3] locking/qspinlock: Rework some comments
  2018-10-01 19:10     ` Peter Zijlstra
@ 2018-10-02 13:20       ` Will Deacon
  2018-10-02 13:43         ` Peter Zijlstra
  0 siblings, 1 reply; 32+ messages in thread
From: Will Deacon @ 2018-10-02 13:20 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: mingo, linux-kernel, longman, andrea.parri, tglx

Hi Peter,

On Mon, Oct 01, 2018 at 09:10:22PM +0200, Peter Zijlstra wrote:
> On Mon, Oct 01, 2018 at 06:17:08PM +0100, Will Deacon wrote:
> > On Wed, Sep 26, 2018 at 01:01:19PM +0200, Peter Zijlstra wrote:
> > > +
> > >  	/*
> > > -	 * If we observe any contention; undo and queue.
> > > +	 * If we observe contention, there was a concurrent lock.
> > 
> > Nit: I think "concurrent lock" is confusing here, because that implies to
> > me that the lock was actually taken behind our back, which isn't necessarily
> > the case. How about "there is a concurrent locker"?
> 
> Yes, that's better.

Thanks.

> > > +	 *
> > > +	 * Undo and queue; our setting of PENDING might have made the
> > > +	 * n,0,0 -> 0,0,0 transition fail and it will now be waiting
> > > +	 * on @next to become !NULL.
> > >  	 */
> > 
> > Hmm, but it could also fail another concurrent set of PENDING (and the lock
> > could just be held the entire time).
> 
> Right. What I wanted to convey was that is we observe _any_ contention,
> we must abort and queue, because of that above condition failing and
> waiting on @next.
> 
> The other cases weren't as critical, but that one really does require us
> to queue in order to make forward progress.
> 
> Or did I misunderstand your concern?

See below, since I think my comments are related.

> > >  	if (unlikely(val & ~_Q_LOCKED_MASK)) {
> > > +
> > > +		/* Undo PENDING if we set it. */
> > >  		if (!(val & _Q_PENDING_MASK))
> > >  			clear_pending(lock);
> > > +
> > >  		goto queue;
> > >  	}
> > >  
> > > @@ -466,7 +473,7 @@ void queued_spin_lock_slowpath(struct qs
> > >  	 * claim the lock:
> > >  	 *
> > >  	 * n,0,0 -> 0,0,1 : lock, uncontended
> > > -	 * *,*,0 -> *,*,1 : lock, contended
> > > +	 * *,0,0 -> *,0,1 : lock, contended
> > 
> > Pending can be set behind our back in the contended case, in which case
> > we take the lock with a single byte store and don't clear pending. You
> > mention this in the updated comment below, but I think we should leave this
> > comment alone.
> 
> Ah, so the reason I write it like so is because when we get here,
> val.locked_pending == 0, per the atomic_cond_read_acquire() condition.

Ah, and I vaguely remember discussing this before. The way I read these
transition diagrams, I find it most useful if they correspond to the lock
word in memory. That way, it makes it clear about exactly which fields are
stable, and which can be concurrently modified. So in the comment above,
saying:

	 *,*,0 -> *,*,1 : lock, contended

is really helpful, because it clearly says "we're taking the lock, but the
rest of the lock word could be modified by others at the same time", whereas
saying:

	 *,0,0 -> *,0,1 : lock, contended

implies to me that pending is stable and cannot be set concurrently.

Will

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC][PATCH 3/3] locking/qspinlock: Optimize for x86
  2018-10-02 12:31     ` Andrea Parri
@ 2018-10-02 13:22       ` Will Deacon
  2018-10-02 13:44         ` Andrea Parri
  0 siblings, 1 reply; 32+ messages in thread
From: Will Deacon @ 2018-10-02 13:22 UTC (permalink / raw)
  To: Andrea Parri; +Cc: Peter Zijlstra, mingo, linux-kernel, longman, tglx

On Tue, Oct 02, 2018 at 02:31:52PM +0200, Andrea Parri wrote:
> > consider this scenario with your patch:
> > 
> > 1. CPU0 sees a locked val, and is about to do your xchg_relaxed() to set
> >    pending.
> > 
> > 2. CPU1 comes in and sets pending, spins on locked
> > 
> > 3. CPU2 sees a pending and locked val, and is about to enter the head of
> >    the waitqueue (i.e. it's right before xchg_tail()).
> > 
> > 4. The locked holder unlock()s, CPU1 takes the lock() and then unlock()s
> >    it, so pending and locked are now 0.
> > 
> > 5. CPU0 sets pending and reads back zeroes for the other fields
> > 
> > 6. CPU0 clears pending and sets locked -- it now has the lock
> > 
> > 7. CPU2 updates tail, sees it's at the head of the waitqueue and spins
> >    for locked and pending to go clear. However, it reads a stale value
> >    from step (4) and attempts the atomic_try_cmpxchg() to take the lock.
> > 
> > 8. CPU2 will fail the cmpxchg(), but then go ahead and set locked. At this
> >    point we're hosed, because both CPU2 and CPU0 have the lock.
> 
> Thanks for pointing this out.   I am wondering: can't we have a similar
> scenario with the current code (i.e., w/o these patches): what prevents
> the scenario reported below, following Peter's diagram, from happening?

The xchg_tail() in step (7) reads from the fetch_or_acquire() in step (5),
so I don't think we can see a stale value in the subsequent (overlapping)
acquire load.

Will

>   CPU0		CPU1		CPU2		CPU3
> 
> 0)						lock
> 						  trylock -> (0,0,1)
> 1)lock
>     trylock /* fail */
> 
> 2)		lock
> 		  trylock /* fail */
> 		  fetch_or_acquire -> (0,1,1)
> 		  wait-locked
> 
> 3)				lock
> 				  trylock /* fail */
> 				  goto queue
> 
> 4)						unlock -> (0,1,0)
> 		  clr_pnd_set_lck -> (0,0,1)
> 		  unlock -> (0,0,0)
> 
> 5)  fetch_or_acquire -> (0,1,0)
> 6)  clr_pnd_set_lck -> (0,0,1)
> 7)				  xchg_tail -> (n,0,1)
> 				  load_acquire <- (n,0,0) (from-4)
> 8)				  cmpxchg /* fail */
> 				  set_locked()

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC][PATCH 2/3] locking/qspinlock: Rework some comments
  2018-10-02 13:20       ` Will Deacon
@ 2018-10-02 13:43         ` Peter Zijlstra
  0 siblings, 0 replies; 32+ messages in thread
From: Peter Zijlstra @ 2018-10-02 13:43 UTC (permalink / raw)
  To: Will Deacon; +Cc: mingo, linux-kernel, longman, andrea.parri, tglx

On Tue, Oct 02, 2018 at 02:20:05PM +0100, Will Deacon wrote:
> > Ah, so the reason I write it like so is because when we get here,
> > val.locked_pending == 0, per the atomic_cond_read_acquire() condition.
> 
> Ah, and I vaguely remember discussing this before. The way I read these
> transition diagrams, I find it most useful if they correspond to the lock
> word in memory. That way, it makes it clear about exactly which fields are
> stable, and which can be concurrently modified. So in the comment above,
> saying:
> 
> 	 *,*,0 -> *,*,1 : lock, contended
> 
> is really helpful, because it clearly says "we're taking the lock, but the
> rest of the lock word could be modified by others at the same time", whereas
> saying:
> 
> 	 *,0,0 -> *,0,1 : lock, contended
> 
> implies to me that pending is stable and cannot be set concurrently.

Fair enough, will restore.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC][PATCH 3/3] locking/qspinlock: Optimize for x86
  2018-10-02 13:22       ` Will Deacon
@ 2018-10-02 13:44         ` Andrea Parri
  0 siblings, 0 replies; 32+ messages in thread
From: Andrea Parri @ 2018-10-02 13:44 UTC (permalink / raw)
  To: Will Deacon; +Cc: Peter Zijlstra, mingo, linux-kernel, longman, tglx

On Tue, Oct 02, 2018 at 02:22:09PM +0100, Will Deacon wrote:
> On Tue, Oct 02, 2018 at 02:31:52PM +0200, Andrea Parri wrote:
> > > consider this scenario with your patch:
> > > 
> > > 1. CPU0 sees a locked val, and is about to do your xchg_relaxed() to set
> > >    pending.
> > > 
> > > 2. CPU1 comes in and sets pending, spins on locked
> > > 
> > > 3. CPU2 sees a pending and locked val, and is about to enter the head of
> > >    the waitqueue (i.e. it's right before xchg_tail()).
> > > 
> > > 4. The locked holder unlock()s, CPU1 takes the lock() and then unlock()s
> > >    it, so pending and locked are now 0.
> > > 
> > > 5. CPU0 sets pending and reads back zeroes for the other fields
> > > 
> > > 6. CPU0 clears pending and sets locked -- it now has the lock
> > > 
> > > 7. CPU2 updates tail, sees it's at the head of the waitqueue and spins
> > >    for locked and pending to go clear. However, it reads a stale value
> > >    from step (4) and attempts the atomic_try_cmpxchg() to take the lock.
> > > 
> > > 8. CPU2 will fail the cmpxchg(), but then go ahead and set locked. At this
> > >    point we're hosed, because both CPU2 and CPU0 have the lock.
> > 
> > Thanks for pointing this out.   I am wondering: can't we have a similar
> > scenario with the current code (i.e., w/o these patches): what prevents
> > the scenario reported below, following Peter's diagram, from happening?
> 
> The xchg_tail() in step (7) reads from the fetch_or_acquire() in step (5),
> so I don't think we can see a stale value in the subsequent (overlapping)
> acquire load.

I see, thanks for the clarification.

  Andrea


> 
> Will
> 
> >   CPU0		CPU1		CPU2		CPU3
> > 
> > 0)						lock
> > 						  trylock -> (0,0,1)
> > 1)lock
> >     trylock /* fail */
> > 
> > 2)		lock
> > 		  trylock /* fail */
> > 		  fetch_or_acquire -> (0,1,1)
> > 		  wait-locked
> > 
> > 3)				lock
> > 				  trylock /* fail */
> > 				  goto queue
> > 
> > 4)						unlock -> (0,1,0)
> > 		  clr_pnd_set_lck -> (0,0,1)
> > 		  unlock -> (0,0,0)
> > 
> > 5)  fetch_or_acquire -> (0,1,0)
> > 6)  clr_pnd_set_lck -> (0,0,1)
> > 7)				  xchg_tail -> (n,0,1)
> > 				  load_acquire <- (n,0,0) (from-4)
> > 8)				  cmpxchg /* fail */
> > 				  set_locked()

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC][PATCH 3/3] locking/qspinlock: Optimize for x86
  2018-10-02 13:19       ` Will Deacon
@ 2018-10-02 14:14         ` Peter Zijlstra
  0 siblings, 0 replies; 32+ messages in thread
From: Peter Zijlstra @ 2018-10-02 14:14 UTC (permalink / raw)
  To: Will Deacon; +Cc: mingo, linux-kernel, longman, andrea.parri, tglx

On Tue, Oct 02, 2018 at 02:19:53PM +0100, Will Deacon wrote:
> On Mon, Oct 01, 2018 at 10:00:28PM +0200, Peter Zijlstra wrote:

> > Let me draw a picture of that..
> > 
> > 
> >   CPU0		CPU1		CPU2		CPU3
> > 
> > 0)						lock
> > 						  trylock -> (0,0,1)
> > 1)lock
> >     trylock /* fail */
> > 
> > 2)		lock
> > 		  trylock /* fail */
> > 		  tas-pending -> (0,1,1)
> > 		  wait-locked
> > 
> > 3)				lock
> > 				  trylock /* fail */
> > 				  tas-pending /* fail */
> > 
> > 4)						unlock -> (0,1,0)
> > 		  clr_pnd_set_lck -> (0,0,1)
> > 		  unlock -> (0,0,0)
> > 
> > 5)  tas-pending -> (0,1,0)
> >     read-val -> (0,1,0)
> > 6)  clr_pnd_set_lck -> (0,0,1)
> > 7)				  xchg_tail -> (n,0,1)
> > 				  load_acquire <- (n,0,0) (from-4)
> > 8)				  cmpxchg /* fail */
> > 				  set_locked()
> > 
> > > Is there something I'm missing that means this can't happen? I suppose
> > > cacheline granularity ends up giving serialisation between (4) and (7),
> > > but I'd *much* prefer not to rely on that because it feels horribly
> > > fragile.
> > 
> > Well, on x86 atomics are fully ordered, so the xchg_tail() does in
> > fact have smp_mb() in and that should order it sufficient for that not
> > to happen I think.
> 
> Hmm, does that actually help, though? I still think you're relying on the
> cache-coherence protocol to serialise the xchg() on pending before the
> xchg_tail(), which I think is fragile because they don't actually overlap.

Maybe, I suspect TSO makes it work, but see the below alternative.

---
--- a/arch/x86/include/asm/qspinlock.h
+++ b/arch/x86/include/asm/qspinlock.h
@@ -6,9 +6,29 @@
 #include <asm/cpufeature.h>
 #include <asm-generic/qspinlock_types.h>
 #include <asm/paravirt.h>
+#include <asm/rmwcc.h>
 
 #define _Q_PENDING_LOOPS	(1 << 9)
 
+static __always_inline bool __test_and_set_pending(struct qspinlock *lock)
+{
+	GEN_BINARY_RMWcc(LOCK_PREFIX "btsl",
+			 lock->val.counter, "Ir", _Q_PENDING_OFFSET, "%0", c);
+}
+
+#define queued_set_pending_fetch_acquire queued_set_pending_fetch_acquire
+static inline u32 queued_set_pending_fetch_acquire(struct qspinlock *lock)
+{
+	u32 val = 0;
+
+	if (__test_and_set_pending(lock))
+		val |= _Q_PENDING_VAL;
+
+	val |= atomic_read(&lock->val) & ~_Q_PENDING_MASK;
+
+	return val;
+}
+
 #ifdef CONFIG_PARAVIRT_SPINLOCKS
 extern void native_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val);
 extern void __pv_init_lock_hash(void);
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -232,6 +232,20 @@ static __always_inline u32 xchg_tail(str
 #endif /* _Q_PENDING_BITS == 8 */
 
 /**
+ * queued_set_pending_fetch_acquire - fetch the whole lock value and set pending
+ * @lock : Pointer to queued spinlock structure
+ * Return: The previous lock value
+ *
+ * *,*,* -> *,1,*
+ */
+#ifndef queued_set_pending_fetch_acquire
+static __always_inline u32 queued_set_pending_fetch_acquire(struct qspinlock *lock)
+{
+	return atomic_fetch_or_acquire(_Q_PENDING_VAL, &lock->val);
+}
+#endif
+
+/**
  * set_locked - Set the lock bit and own the lock
  * @lock: Pointer to queued spinlock structure
  *
@@ -328,7 +342,7 @@ void queued_spin_lock_slowpath(struct qs
 	 *
 	 * 0,0,* -> 0,1,* -> 0,0,1 pending, trylock
 	 */
-	val = atomic_fetch_or_acquire(_Q_PENDING_VAL, &lock->val);
+	val = queued_set_pending_fetch_acquire(lock);
 
 	/*
 	 * If we observe contention, there is a concurrent locker.

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2018-10-02 14:14 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-09-26 11:01 [RFC][PATCH 0/3] locking/qspinlock: Improve determinism for x86 Peter Zijlstra
2018-09-26 11:01 ` [RFC][PATCH 1/3] locking/qspinlock: Re-order code Peter Zijlstra
2018-10-01 17:17   ` Will Deacon
2018-09-26 11:01 ` [RFC][PATCH 2/3] locking/qspinlock: Rework some comments Peter Zijlstra
2018-10-01 17:17   ` Will Deacon
2018-10-01 19:10     ` Peter Zijlstra
2018-10-02 13:20       ` Will Deacon
2018-10-02 13:43         ` Peter Zijlstra
2018-09-26 11:01 ` [RFC][PATCH 3/3] locking/qspinlock: Optimize for x86 Peter Zijlstra
2018-09-26 16:30   ` Waiman Long
2018-09-26 17:54     ` Peter Zijlstra
2018-09-27  7:29       ` Peter Zijlstra
2018-09-26 20:52   ` Andrea Parri
2018-09-27  7:17     ` Peter Zijlstra
2018-09-27  7:47       ` Andrea Parri
2018-09-27  7:59         ` Peter Zijlstra
2018-09-27  8:13           ` Andrea Parri
2018-09-27  8:57             ` Peter Zijlstra
2018-09-27 12:16   ` David Laight
2018-10-01 17:17   ` Will Deacon
2018-10-01 20:00     ` Peter Zijlstra
2018-10-02 13:19       ` Will Deacon
2018-10-02 14:14         ` Peter Zijlstra
2018-10-02 12:31     ` Andrea Parri
2018-10-02 13:22       ` Will Deacon
2018-10-02 13:44         ` Andrea Parri
2018-09-26 15:01 ` [RFC][PATCH 0/3] locking/qspinlock: Improve determinism " Sebastian Andrzej Siewior
2018-09-26 15:08   ` Thomas Gleixner
2018-09-26 15:38     ` Sebastian Andrzej Siewior
2018-09-26 16:20 ` Waiman Long
2018-09-26 17:51   ` Peter Zijlstra
2018-09-26 23:21     ` Waiman Long

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).