linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH -v4 0/7] spin_unlock_wait borkage and assorted bits
@ 2016-06-02 11:51 Peter Zijlstra
  2016-06-02 11:51 ` [PATCH -v4 1/7] locking: Replace smp_cond_acquire with smp_cond_load_acquire Peter Zijlstra
                   ` (6 more replies)
  0 siblings, 7 replies; 26+ messages in thread
From: Peter Zijlstra @ 2016-06-02 11:51 UTC (permalink / raw)
  To: linux-kernel, torvalds, manfred, dave, paulmck, will.deacon
  Cc: boqun.feng, Waiman.Long, tj, pablo, kaber, davem, oleg,
	netfilter-devel, sasha.levin, hofrat, peterz

Similar to -v3 in that it rewrites spin_unlock_wait() for all.

The new spin_unlock_wait() provides ACQUIRE semantics to match the RELEASE of
the spin_unlock() we waited for and thereby ensure we can fully observe its
critical section.

This fixes a number (pretty much all) spin_unlock_wait() users.

This series pulls in the smp_cond_acquire() rewrite because it introduces a lot
of new users of it. All simple spin_unlock_wait() implementations end up being
one.

New in this series is the removal of cmpwait() and a reorder of patches.

I'm planning on queuing these patches for 4.8.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH -v4 1/7] locking: Replace smp_cond_acquire with smp_cond_load_acquire
  2016-06-02 11:51 [PATCH -v4 0/7] spin_unlock_wait borkage and assorted bits Peter Zijlstra
@ 2016-06-02 11:51 ` Peter Zijlstra
  2016-06-02 11:51 ` [PATCH -v4 2/7] locking: Introduce smp_acquire__after_ctrl_dep Peter Zijlstra
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 26+ messages in thread
From: Peter Zijlstra @ 2016-06-02 11:51 UTC (permalink / raw)
  To: linux-kernel, torvalds, manfred, dave, paulmck, will.deacon
  Cc: boqun.feng, Waiman.Long, tj, pablo, kaber, davem, oleg,
	netfilter-devel, sasha.levin, hofrat, peterz

[-- Attachment #1: peterz-locking-smp_cond_load_acquire.patch --]
[-- Type: text/plain, Size: 5988 bytes --]

This new form allows using hardware assisted waiting.

Some hardware (ARM64 and x86) allow monitoring an address for changes,
so by providing a pointer we can use this to replace the cpu_relax().

Requested-by: Will Deacon <will.deacon@arm.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/compiler.h   |   25 +++++++++++++++++++------
 kernel/locking/qspinlock.c |   12 ++++++------
 kernel/sched/core.c        |    8 ++++----
 kernel/sched/sched.h       |    2 +-
 kernel/smp.c               |    2 +-
 5 files changed, 31 insertions(+), 18 deletions(-)

--- a/include/linux/compiler.h
+++ b/include/linux/compiler.h
@@ -305,21 +305,34 @@ static __always_inline void __write_once
 })
 
 /**
- * smp_cond_acquire() - Spin wait for cond with ACQUIRE ordering
+ * smp_cond_load_acquire() - (Spin) wait for cond with ACQUIRE ordering
+ * @ptr: pointer to the variable to wait on
  * @cond: boolean expression to wait for
  *
  * Equivalent to using smp_load_acquire() on the condition variable but employs
  * the control dependency of the wait to reduce the barrier on many platforms.
  *
+ * Due to C lacking lambda expressions we load the value of *ptr into a
+ * pre-named variable @VAL to be used in @cond.
+ *
  * The control dependency provides a LOAD->STORE order, the additional RMB
  * provides LOAD->LOAD order, together they provide LOAD->{LOAD,STORE} order,
  * aka. ACQUIRE.
  */
-#define smp_cond_acquire(cond)	do {		\
-	while (!(cond))				\
-		cpu_relax();			\
-	smp_rmb(); /* ctrl + rmb := acquire */	\
-} while (0)
+#ifndef smp_cond_load_acquire
+#define smp_cond_load_acquire(ptr, cond_expr) ({		\
+	typeof(ptr) __PTR = (ptr);				\
+	typeof(*ptr) VAL;					\
+	for (;;) {						\
+		VAL = READ_ONCE(*__PTR);			\
+		if (cond_expr)					\
+			break;					\
+		cpu_relax();					\
+	}							\
+	smp_rmb(); /* ctrl + rmb := acquire */			\
+	VAL;							\
+})
+#endif
 
 #endif /* __KERNEL__ */
 
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -358,7 +358,7 @@ void queued_spin_lock_slowpath(struct qs
 	 * sequentiality; this is because not all clear_pending_set_locked()
 	 * implementations imply full barriers.
 	 */
-	smp_cond_acquire(!(atomic_read(&lock->val) & _Q_LOCKED_MASK));
+	smp_cond_load_acquire(&lock->val.counter, !(VAL & _Q_LOCKED_MASK));
 
 	/*
 	 * take ownership and clear the pending bit.
@@ -434,7 +434,7 @@ void queued_spin_lock_slowpath(struct qs
 	 *
 	 * The PV pv_wait_head_or_lock function, if active, will acquire
 	 * the lock and return a non-zero value. So we have to skip the
-	 * smp_cond_acquire() call. As the next PV queue head hasn't been
+	 * smp_cond_load_acquire() call. As the next PV queue head hasn't been
 	 * designated yet, there is no way for the locked value to become
 	 * _Q_SLOW_VAL. So both the set_locked() and the
 	 * atomic_cmpxchg_relaxed() calls will be safe.
@@ -445,7 +445,7 @@ void queued_spin_lock_slowpath(struct qs
 	if ((val = pv_wait_head_or_lock(lock, node)))
 		goto locked;
 
-	smp_cond_acquire(!((val = atomic_read(&lock->val)) & _Q_LOCKED_PENDING_MASK));
+	val = smp_cond_load_acquire(&lock->val.counter, !(VAL & _Q_LOCKED_PENDING_MASK));
 
 locked:
 	/*
@@ -465,9 +465,9 @@ void queued_spin_lock_slowpath(struct qs
 			break;
 		}
 		/*
-		 * The smp_cond_acquire() call above has provided the necessary
-		 * acquire semantics required for locking. At most two
-		 * iterations of this loop may be ran.
+		 * The smp_cond_load_acquire() call above has provided the
+		 * necessary acquire semantics required for locking. At most
+		 * two iterations of this loop may be ran.
 		 */
 		old = atomic_cmpxchg_relaxed(&lock->val, val, _Q_LOCKED_VAL);
 		if (old == val)
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1843,7 +1843,7 @@ static void ttwu_queue(struct task_struc
  * chain to provide order. Instead we do:
  *
  *   1) smp_store_release(X->on_cpu, 0)
- *   2) smp_cond_acquire(!X->on_cpu)
+ *   2) smp_cond_load_acquire(!X->on_cpu)
  *
  * Example:
  *
@@ -1854,7 +1854,7 @@ static void ttwu_queue(struct task_struc
  *   sched-out X
  *   smp_store_release(X->on_cpu, 0);
  *
- *                    smp_cond_acquire(!X->on_cpu);
+ *                    smp_cond_load_acquire(&X->on_cpu, !VAL);
  *                    X->state = WAKING
  *                    set_task_cpu(X,2)
  *
@@ -1880,7 +1880,7 @@ static void ttwu_queue(struct task_struc
  * This means that any means of doing remote wakeups must order the CPU doing
  * the wakeup against the CPU the task is going to end up running on. This,
  * however, is already required for the regular Program-Order guarantee above,
- * since the waking CPU is the one issueing the ACQUIRE (smp_cond_acquire).
+ * since the waking CPU is the one issueing the ACQUIRE (smp_cond_load_acquire).
  *
  */
 
@@ -1953,7 +1953,7 @@ try_to_wake_up(struct task_struct *p, un
 	 * This ensures that tasks getting woken will be fully ordered against
 	 * their previous state and preserve Program Order.
 	 */
-	smp_cond_acquire(!p->on_cpu);
+	smp_cond_load_acquire(&p->on_cpu, !VAL);
 
 	p->sched_contributes_to_load = !!task_contributes_to_load(p);
 	p->state = TASK_WAKING;
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1104,7 +1104,7 @@ static inline void finish_lock_switch(st
 	 * In particular, the load of prev->state in finish_task_switch() must
 	 * happen before this.
 	 *
-	 * Pairs with the smp_cond_acquire() in try_to_wake_up().
+	 * Pairs with the smp_cond_load_acquire() in try_to_wake_up().
 	 */
 	smp_store_release(&prev->on_cpu, 0);
 #endif
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -107,7 +107,7 @@ void __init call_function_init(void)
  */
 static __always_inline void csd_lock_wait(struct call_single_data *csd)
 {
-	smp_cond_acquire(!(csd->flags & CSD_FLAG_LOCK));
+	smp_cond_load_acquire(&csd->flags, !(VAL & CSD_FLAG_LOCK));
 }
 
 static __always_inline void csd_lock(struct call_single_data *csd)

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH -v4 2/7] locking: Introduce smp_acquire__after_ctrl_dep
  2016-06-02 11:51 [PATCH -v4 0/7] spin_unlock_wait borkage and assorted bits Peter Zijlstra
  2016-06-02 11:51 ` [PATCH -v4 1/7] locking: Replace smp_cond_acquire with smp_cond_load_acquire Peter Zijlstra
@ 2016-06-02 11:51 ` Peter Zijlstra
  2016-06-02 11:52 ` [PATCH -v4 3/7] locking: Move smp_cond_load_acquire() to asm-generic/barrier.h Peter Zijlstra
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 26+ messages in thread
From: Peter Zijlstra @ 2016-06-02 11:51 UTC (permalink / raw)
  To: linux-kernel, torvalds, manfred, dave, paulmck, will.deacon
  Cc: boqun.feng, Waiman.Long, tj, pablo, kaber, davem, oleg,
	netfilter-devel, sasha.levin, hofrat, peterz

[-- Attachment #1: peterz-locking-smp_acquire__after_ctrl_dep.patch --]
[-- Type: text/plain, Size: 2841 bytes --]

Introduce smp_acquire__after_ctrl_dep(), this construct is not
uncommen, but the lack of this barrier is.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/compiler.h |   17 ++++++++++++-----
 ipc/sem.c                |   14 ++------------
 2 files changed, 14 insertions(+), 17 deletions(-)

--- a/include/linux/compiler.h
+++ b/include/linux/compiler.h
@@ -305,6 +305,17 @@ static __always_inline void __write_once
 })
 
 /**
+ * smp_acquire__after_ctrl_dep() - Provide ACQUIRE ordering after a control dependency
+ *
+ * A control dependency provides a LOAD->STORE order, the additional RMB
+ * provides LOAD->LOAD order, together they provide LOAD->{LOAD,STORE} order,
+ * aka. (load)-ACQUIRE.
+ *
+ * Architectures that do not do load speculation can have this be barrier().
+ */
+#define smp_acquire__after_ctrl_dep()		smp_rmb()
+
+/**
  * smp_cond_load_acquire() - (Spin) wait for cond with ACQUIRE ordering
  * @ptr: pointer to the variable to wait on
  * @cond: boolean expression to wait for
@@ -314,10 +325,6 @@ static __always_inline void __write_once
  *
  * Due to C lacking lambda expressions we load the value of *ptr into a
  * pre-named variable @VAL to be used in @cond.
- *
- * The control dependency provides a LOAD->STORE order, the additional RMB
- * provides LOAD->LOAD order, together they provide LOAD->{LOAD,STORE} order,
- * aka. ACQUIRE.
  */
 #ifndef smp_cond_load_acquire
 #define smp_cond_load_acquire(ptr, cond_expr) ({		\
@@ -329,7 +336,7 @@ static __always_inline void __write_once
 			break;					\
 		cpu_relax();					\
 	}							\
-	smp_rmb(); /* ctrl + rmb := acquire */			\
+	smp_acquire__after_ctrl_dep();				\
 	VAL;							\
 })
 #endif
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -260,16 +260,6 @@ static void sem_rcu_free(struct rcu_head
 }
 
 /*
- * spin_unlock_wait() and !spin_is_locked() are not memory barriers, they
- * are only control barriers.
- * The code must pair with spin_unlock(&sem->lock) or
- * spin_unlock(&sem_perm.lock), thus just the control barrier is insufficient.
- *
- * smp_rmb() is sufficient, as writes cannot pass the control barrier.
- */
-#define ipc_smp_acquire__after_spin_is_unlocked()	smp_rmb()
-
-/*
  * Wait until all currently ongoing simple ops have completed.
  * Caller must own sem_perm.lock.
  * New simple ops cannot start, because simple ops first check
@@ -292,7 +282,7 @@ static void sem_wait_array(struct sem_ar
 		sem = sma->sem_base + i;
 		spin_unlock_wait(&sem->lock);
 	}
-	ipc_smp_acquire__after_spin_is_unlocked();
+	smp_acquire__after_ctrl_dep();
 }
 
 /*
@@ -350,7 +340,7 @@ static inline int sem_lock(struct sem_ar
 			 *	complex_count++;
 			 *	spin_unlock(sem_perm.lock);
 			 */
-			ipc_smp_acquire__after_spin_is_unlocked();
+			smp_acquire__after_ctrl_dep();
 
 			/*
 			 * Now repeat the test of complex_count:

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH -v4 3/7] locking: Move smp_cond_load_acquire() to asm-generic/barrier.h
  2016-06-02 11:51 [PATCH -v4 0/7] spin_unlock_wait borkage and assorted bits Peter Zijlstra
  2016-06-02 11:51 ` [PATCH -v4 1/7] locking: Replace smp_cond_acquire with smp_cond_load_acquire Peter Zijlstra
  2016-06-02 11:51 ` [PATCH -v4 2/7] locking: Introduce smp_acquire__after_ctrl_dep Peter Zijlstra
@ 2016-06-02 11:52 ` Peter Zijlstra
  2016-06-02 11:52 ` [PATCH -v4 4/7] locking, tile: Provide TILE specific smp_acquire__after_ctrl_dep Peter Zijlstra
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 26+ messages in thread
From: Peter Zijlstra @ 2016-06-02 11:52 UTC (permalink / raw)
  To: linux-kernel, torvalds, manfred, dave, paulmck, will.deacon
  Cc: boqun.feng, Waiman.Long, tj, pablo, kaber, davem, oleg,
	netfilter-devel, sasha.levin, hofrat, peterz

[-- Attachment #1: peterz-locking-move-asm-generic.patch --]
[-- Type: text/plain, Size: 3441 bytes --]

Since all asm/barrier.h should/must include asm-generic/barrier.h the
latter is a good place for generic infrastructure like this.

This also allows archs to override the new
smp_acquire__after_ctrl_dep().

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/asm-generic/barrier.h |   39 +++++++++++++++++++++++++++++++++++++++
 include/linux/compiler.h      |   37 -------------------------------------
 2 files changed, 39 insertions(+), 37 deletions(-)

--- a/include/asm-generic/barrier.h
+++ b/include/asm-generic/barrier.h
@@ -207,5 +207,44 @@ do {									\
 #define virt_store_release(p, v) __smp_store_release(p, v)
 #define virt_load_acquire(p) __smp_load_acquire(p)
 
+/**
+ * smp_acquire__after_ctrl_dep() - Provide ACQUIRE ordering after a control dependency
+ *
+ * A control dependency provides a LOAD->STORE order, the additional RMB
+ * provides LOAD->LOAD order, together they provide LOAD->{LOAD,STORE} order,
+ * aka. (load)-ACQUIRE.
+ *
+ * Architectures that do not do load speculation can have this be barrier().
+ */
+#ifndef smp_acquire__after_ctrl_dep
+#define smp_acquire__after_ctrl_dep()		smp_rmb()
+#endif
+
+/**
+ * smp_cond_load_acquire() - (Spin) wait for cond with ACQUIRE ordering
+ * @ptr: pointer to the variable to wait on
+ * @cond: boolean expression to wait for
+ *
+ * Equivalent to using smp_load_acquire() on the condition variable but employs
+ * the control dependency of the wait to reduce the barrier on many platforms.
+ *
+ * Due to C lacking lambda expressions we load the value of *ptr into a
+ * pre-named variable @VAL to be used in @cond.
+ */
+#ifndef smp_cond_load_acquire
+#define smp_cond_load_acquire(ptr, cond_expr) ({		\
+	typeof(ptr) __PTR = (ptr);				\
+	typeof(*ptr) VAL;					\
+	for (;;) {						\
+		VAL = READ_ONCE(*__PTR);			\
+		if (cond_expr)					\
+			break;					\
+		cpu_relax();					\
+	}							\
+	smp_acquire__after_ctrl_dep();				\
+	VAL;							\
+})
+#endif
+
 #endif /* !__ASSEMBLY__ */
 #endif /* __ASM_GENERIC_BARRIER_H */
--- a/include/linux/compiler.h
+++ b/include/linux/compiler.h
@@ -304,43 +304,6 @@ static __always_inline void __write_once
 	__u.__val;					\
 })
 
-/**
- * smp_acquire__after_ctrl_dep() - Provide ACQUIRE ordering after a control dependency
- *
- * A control dependency provides a LOAD->STORE order, the additional RMB
- * provides LOAD->LOAD order, together they provide LOAD->{LOAD,STORE} order,
- * aka. (load)-ACQUIRE.
- *
- * Architectures that do not do load speculation can have this be barrier().
- */
-#define smp_acquire__after_ctrl_dep()		smp_rmb()
-
-/**
- * smp_cond_load_acquire() - (Spin) wait for cond with ACQUIRE ordering
- * @ptr: pointer to the variable to wait on
- * @cond: boolean expression to wait for
- *
- * Equivalent to using smp_load_acquire() on the condition variable but employs
- * the control dependency of the wait to reduce the barrier on many platforms.
- *
- * Due to C lacking lambda expressions we load the value of *ptr into a
- * pre-named variable @VAL to be used in @cond.
- */
-#ifndef smp_cond_load_acquire
-#define smp_cond_load_acquire(ptr, cond_expr) ({		\
-	typeof(ptr) __PTR = (ptr);				\
-	typeof(*ptr) VAL;					\
-	for (;;) {						\
-		VAL = READ_ONCE(*__PTR);			\
-		if (cond_expr)					\
-			break;					\
-		cpu_relax();					\
-	}							\
-	smp_acquire__after_ctrl_dep();				\
-	VAL;							\
-})
-#endif
-
 #endif /* __KERNEL__ */
 
 #endif /* __ASSEMBLY__ */

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH -v4 4/7] locking, tile: Provide TILE specific smp_acquire__after_ctrl_dep
  2016-06-02 11:51 [PATCH -v4 0/7] spin_unlock_wait borkage and assorted bits Peter Zijlstra
                   ` (2 preceding siblings ...)
  2016-06-02 11:52 ` [PATCH -v4 3/7] locking: Move smp_cond_load_acquire() to asm-generic/barrier.h Peter Zijlstra
@ 2016-06-02 11:52 ` Peter Zijlstra
  2016-06-02 11:52 ` [PATCH -v4 5/7] locking, arch: Update spin_unlock_wait() Peter Zijlstra
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 26+ messages in thread
From: Peter Zijlstra @ 2016-06-02 11:52 UTC (permalink / raw)
  To: linux-kernel, torvalds, manfred, dave, paulmck, will.deacon
  Cc: boqun.feng, Waiman.Long, tj, pablo, kaber, davem, oleg,
	netfilter-devel, sasha.levin, hofrat, peterz, Chris Metcalf

[-- Attachment #1: peterz-tile-ctrl-dep.patch --]
[-- Type: text/plain, Size: 885 bytes --]

Since TILE doesn't do read speculation, its control dependencies also
guarantee LOAD->LOAD order and we don't need the additional RMB
otherwise required to provide ACQUIRE semantics.

Acked-by: Chris Metcalf <cmetcalf@mellanox.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/tile/include/asm/barrier.h |    7 +++++++
 1 file changed, 7 insertions(+)

--- a/arch/tile/include/asm/barrier.h
+++ b/arch/tile/include/asm/barrier.h
@@ -87,6 +87,13 @@ mb_incoherent(void)
 #define __smp_mb__after_atomic()	__smp_mb()
 #endif
 
+/*
+ * The TILE architecture does not do speculative reads; this ensures
+ * that a control dependency also orders against loads and already provides
+ * a LOAD->{LOAD,STORE} order and can forgo the additional RMB.
+ */
+#define smp_acquire__after_ctrl_dep()	barrier()
+
 #include <asm-generic/barrier.h>
 
 #endif /* !__ASSEMBLY__ */

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH -v4 5/7] locking, arch: Update spin_unlock_wait()
  2016-06-02 11:51 [PATCH -v4 0/7] spin_unlock_wait borkage and assorted bits Peter Zijlstra
                   ` (3 preceding siblings ...)
  2016-06-02 11:52 ` [PATCH -v4 4/7] locking, tile: Provide TILE specific smp_acquire__after_ctrl_dep Peter Zijlstra
@ 2016-06-02 11:52 ` Peter Zijlstra
  2016-06-02 14:24   ` Boqun Feng
  2016-06-02 11:52 ` [PATCH -v4 6/7] locking: Update spin_unlock_wait users Peter Zijlstra
  2016-06-02 11:52 ` [PATCH -v4 7/7] locking,netfilter: Fix nf_conntrack_lock() Peter Zijlstra
  6 siblings, 1 reply; 26+ messages in thread
From: Peter Zijlstra @ 2016-06-02 11:52 UTC (permalink / raw)
  To: linux-kernel, torvalds, manfred, dave, paulmck, will.deacon
  Cc: boqun.feng, Waiman.Long, tj, pablo, kaber, davem, oleg,
	netfilter-devel, sasha.levin, hofrat, peterz, jejb, chris, rth,
	dhowells, schwidefsky, mpe, ralf, linux, rkuo, vgupta,
	james.hogan, realmz6, ysato, tony.luck, cmetcalf

[-- Attachment #1: peterz-locking-spin_unlock_wait.patch --]
[-- Type: text/plain, Size: 16631 bytes --]

This patch updates/fixes all spin_unlock_wait() implementations.

The update is in semantics; where it previously was only a control
dependency, we now upgrade to a full load-acquire to match the
store-release from the spin_unlock() we waited on. This ensures that
when spin_unlock_wait() returns, we're guaranteed to observe the full
critical section we waited on.

This fixes a number of spin_unlock_wait() users that (not
unreasonably) rely on this.

I also fixed a number of ticket lock versions to only wait on the
current lock holder, instead of for a full unlock, as this is
sufficient.

Furthermore; again for ticket locks; I added an smp_rmb() in between
the initial ticket load and the spin loop testing the current value
because I could not convince myself the address dependency is
sufficient, esp. if the loads are of different sizes.

I'm more than happy to remove this smp_rmb() again if people are
certain the address dependency does indeed work as expected.

Cc: jejb@parisc-linux.org
Cc: davem@davemloft.net
Cc: chris@zankel.net
Cc: rth@twiddle.net
Cc: dhowells@redhat.com
Cc: schwidefsky@de.ibm.com
Cc: mpe@ellerman.id.au
Cc: ralf@linux-mips.org
Cc: linux@armlinux.org.uk
Cc: rkuo@codeaurora.org
Cc: vgupta@synopsys.com
Cc: james.hogan@imgtec.com
Cc: realmz6@gmail.com
Cc: ysato@users.sourceforge.jp
Cc: tony.luck@intel.com
Cc: cmetcalf@mellanox.com
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/alpha/include/asm/spinlock.h    |    9 +++++++--
 arch/arc/include/asm/spinlock.h      |    7 +++++--
 arch/arm/include/asm/spinlock.h      |   19 +++++++++++++++++--
 arch/blackfin/include/asm/spinlock.h |    5 +++--
 arch/hexagon/include/asm/spinlock.h  |   10 ++++++++--
 arch/ia64/include/asm/spinlock.h     |    4 ++++
 arch/m32r/include/asm/spinlock.h     |    9 +++++++--
 arch/metag/include/asm/spinlock.h    |   14 ++++++++++++--
 arch/mips/include/asm/spinlock.h     |   19 +++++++++++++++++--
 arch/mn10300/include/asm/spinlock.h  |    8 +++++++-
 arch/parisc/include/asm/spinlock.h   |    9 +++++++--
 arch/powerpc/include/asm/spinlock.h  |    8 ++++++--
 arch/s390/include/asm/spinlock.h     |    3 +++
 arch/sh/include/asm/spinlock.h       |   10 ++++++++--
 arch/sparc/include/asm/spinlock_32.h |    7 +++++--
 arch/sparc/include/asm/spinlock_64.h |   10 +++++++---
 arch/tile/lib/spinlock_32.c          |    6 ++++++
 arch/tile/lib/spinlock_64.c          |    6 ++++++
 arch/xtensa/include/asm/spinlock.h   |   10 ++++++++--
 include/asm-generic/barrier.h        |    2 +-
 include/asm-generic/qspinlock.h      |    5 +++--
 include/linux/spinlock_up.h          |   10 +++++++---
 22 files changed, 154 insertions(+), 36 deletions(-)

--- a/arch/alpha/include/asm/spinlock.h
+++ b/arch/alpha/include/asm/spinlock.h
@@ -3,6 +3,8 @@
 
 #include <linux/kernel.h>
 #include <asm/current.h>
+#include <asm/barrier.h>
+#include <asm/processor.h>
 
 /*
  * Simple spin lock operations.  There are two variants, one clears IRQ's
@@ -13,8 +15,11 @@
 
 #define arch_spin_lock_flags(lock, flags) arch_spin_lock(lock)
 #define arch_spin_is_locked(x)	((x)->lock != 0)
-#define arch_spin_unlock_wait(x) \
-		do { cpu_relax(); } while ((x)->lock)
+
+static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
+{
+	smp_cond_load_acquire(&lock->lock, !VAL);
+}
 
 static inline int arch_spin_value_unlocked(arch_spinlock_t lock)
 {
--- a/arch/arc/include/asm/spinlock.h
+++ b/arch/arc/include/asm/spinlock.h
@@ -15,8 +15,11 @@
 
 #define arch_spin_is_locked(x)	((x)->slock != __ARCH_SPIN_LOCK_UNLOCKED__)
 #define arch_spin_lock_flags(lock, flags)	arch_spin_lock(lock)
-#define arch_spin_unlock_wait(x) \
-	do { while (arch_spin_is_locked(x)) cpu_relax(); } while (0)
+
+static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
+{
+	smp_cond_load_acquire(&lock->slock, !VAL);
+}
 
 #ifdef CONFIG_ARC_HAS_LLSC
 
--- a/arch/arm/include/asm/spinlock.h
+++ b/arch/arm/include/asm/spinlock.h
@@ -6,6 +6,8 @@
 #endif
 
 #include <linux/prefetch.h>
+#include <asm/barrier.h>
+#include <asm/processor.h>
 
 /*
  * sev and wfe are ARMv6K extensions.  Uniprocessor ARMv6 may not have the K
@@ -50,8 +52,21 @@ static inline void dsb_sev(void)
  * memory.
  */
 
-#define arch_spin_unlock_wait(lock) \
-	do { while (arch_spin_is_locked(lock)) cpu_relax(); } while (0)
+static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
+{
+	u16 owner = READ_ONCE(lock->tickets.owner);
+
+	for (;;) {
+		arch_spinlock_t tmp = READ_ONCE(*lock);
+
+		if (tmp.tickets.owner == tmp.tickets.next ||
+		    tmp.tickets.owner != owner)
+			break;
+
+		wfe();
+	}
+	smp_acquire__after_ctrl_dep();
+}
 
 #define arch_spin_lock_flags(lock, flags) arch_spin_lock(lock)
 
--- a/arch/blackfin/include/asm/spinlock.h
+++ b/arch/blackfin/include/asm/spinlock.h
@@ -12,6 +12,8 @@
 #else
 
 #include <linux/atomic.h>
+#include <asm/processor.h>
+#include <asm/barrier.h>
 
 asmlinkage int __raw_spin_is_locked_asm(volatile int *ptr);
 asmlinkage void __raw_spin_lock_asm(volatile int *ptr);
@@ -48,8 +50,7 @@ static inline void arch_spin_unlock(arch
 
 static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
 {
-	while (arch_spin_is_locked(lock))
-		cpu_relax();
+	smp_cond_load_acquire(&lock->lock, !VAL);
 }
 
 static inline int arch_read_can_lock(arch_rwlock_t *rw)
--- a/arch/hexagon/include/asm/spinlock.h
+++ b/arch/hexagon/include/asm/spinlock.h
@@ -23,6 +23,8 @@
 #define _ASM_SPINLOCK_H
 
 #include <asm/irqflags.h>
+#include <asm/barrier.h>
+#include <asm/processor.h>
 
 /*
  * This file is pulled in for SMP builds.
@@ -176,8 +178,12 @@ static inline unsigned int arch_spin_try
  * SMP spinlocks are intended to allow only a single CPU at the lock
  */
 #define arch_spin_lock_flags(lock, flags) arch_spin_lock(lock)
-#define arch_spin_unlock_wait(lock) \
-	do {while (arch_spin_is_locked(lock)) cpu_relax(); } while (0)
+
+static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
+{
+	smp_cond_load_acquire(&lock->lock, !VAL);
+}
+
 #define arch_spin_is_locked(x) ((x)->lock != 0)
 
 #define arch_read_lock_flags(lock, flags) arch_read_lock(lock)
--- a/arch/ia64/include/asm/spinlock.h
+++ b/arch/ia64/include/asm/spinlock.h
@@ -15,6 +15,8 @@
 
 #include <linux/atomic.h>
 #include <asm/intrinsics.h>
+#include <asm/barrier.h>
+#include <asm/processor.h>
 
 #define arch_spin_lock_init(x)			((x)->lock = 0)
 
@@ -86,6 +88,8 @@ static __always_inline void __ticket_spi
 			return;
 		cpu_relax();
 	}
+
+	smp_acquire__after_ctrl_dep();
 }
 
 static inline int __ticket_spin_is_locked(arch_spinlock_t *lock)
--- a/arch/m32r/include/asm/spinlock.h
+++ b/arch/m32r/include/asm/spinlock.h
@@ -13,6 +13,8 @@
 #include <linux/atomic.h>
 #include <asm/dcache_clear.h>
 #include <asm/page.h>
+#include <asm/barrier.h>
+#include <asm/processor.h>
 
 /*
  * Your basic SMP spinlocks, allowing only a single CPU anywhere
@@ -27,8 +29,11 @@
 
 #define arch_spin_is_locked(x)		(*(volatile int *)(&(x)->slock) <= 0)
 #define arch_spin_lock_flags(lock, flags) arch_spin_lock(lock)
-#define arch_spin_unlock_wait(x) \
-		do { cpu_relax(); } while (arch_spin_is_locked(x))
+
+static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
+{
+	smp_cond_load_acquire(&lock->slock, VAL > 0);
+}
 
 /**
  * arch_spin_trylock - Try spin lock and return a result
--- a/arch/metag/include/asm/spinlock.h
+++ b/arch/metag/include/asm/spinlock.h
@@ -1,14 +1,24 @@
 #ifndef __ASM_SPINLOCK_H
 #define __ASM_SPINLOCK_H
 
+#include <asm/barrier.h>
+#include <asm/processor.h>
+
 #ifdef CONFIG_METAG_ATOMICITY_LOCK1
 #include <asm/spinlock_lock1.h>
 #else
 #include <asm/spinlock_lnkget.h>
 #endif
 
-#define arch_spin_unlock_wait(lock) \
-	do { while (arch_spin_is_locked(lock)) cpu_relax(); } while (0)
+/*
+ * both lock1 and lnkget are test-and-set spinlocks with 0 unlocked and 1
+ * locked.
+ */
+
+static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
+{
+	smp_cond_load_acquire(&lock->lock, !VAL);
+}
 
 #define arch_spin_lock_flags(lock, flags) arch_spin_lock(lock)
 
--- a/arch/mips/include/asm/spinlock.h
+++ b/arch/mips/include/asm/spinlock.h
@@ -12,6 +12,7 @@
 #include <linux/compiler.h>
 
 #include <asm/barrier.h>
+#include <asm/processor.h>
 #include <asm/compiler.h>
 #include <asm/war.h>
 
@@ -48,8 +49,22 @@ static inline int arch_spin_value_unlock
 }
 
 #define arch_spin_lock_flags(lock, flags) arch_spin_lock(lock)
-#define arch_spin_unlock_wait(x) \
-	while (arch_spin_is_locked(x)) { cpu_relax(); }
+
+static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
+{
+	u16 owner = READ_ONCE(lock->h.serving_now);
+	smp_rmb();
+	for (;;) {
+		arch_spinlock_t tmp = READ_ONCE(*lock);
+
+		if (tmp.h.serving_now == tmp.h.ticket ||
+		    tmp.h.serving_now != owner)
+			break;
+
+		cpu_relax();
+	}
+	smp_acquire__after_ctrl_dep();
+}
 
 static inline int arch_spin_is_contended(arch_spinlock_t *lock)
 {
--- a/arch/mn10300/include/asm/spinlock.h
+++ b/arch/mn10300/include/asm/spinlock.h
@@ -12,6 +12,8 @@
 #define _ASM_SPINLOCK_H
 
 #include <linux/atomic.h>
+#include <asm/barrier.h>
+#include <asm/processor.h>
 #include <asm/rwlock.h>
 #include <asm/page.h>
 
@@ -23,7 +25,11 @@
  */
 
 #define arch_spin_is_locked(x)	(*(volatile signed char *)(&(x)->slock) != 0)
-#define arch_spin_unlock_wait(x) do { barrier(); } while (arch_spin_is_locked(x))
+
+static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
+{
+	smp_cond_load_acquire(&lock->slock, !VAL);
+}
 
 static inline void arch_spin_unlock(arch_spinlock_t *lock)
 {
--- a/arch/parisc/include/asm/spinlock.h
+++ b/arch/parisc/include/asm/spinlock.h
@@ -13,8 +13,13 @@ static inline int arch_spin_is_locked(ar
 }
 
 #define arch_spin_lock(lock) arch_spin_lock_flags(lock, 0)
-#define arch_spin_unlock_wait(x) \
-		do { cpu_relax(); } while (arch_spin_is_locked(x))
+
+static inline void arch_spin_unlock_wait(arch_spinlock_t *x)
+{
+	volatile unsigned int *a = __ldcw_align(x);
+
+	smp_cond_load_acquire(a, VAL);
+}
 
 static inline void arch_spin_lock_flags(arch_spinlock_t *x,
 					 unsigned long flags)
--- a/arch/powerpc/include/asm/spinlock.h
+++ b/arch/powerpc/include/asm/spinlock.h
@@ -27,6 +27,8 @@
 #include <asm/asm-compat.h>
 #include <asm/synch.h>
 #include <asm/ppc-opcode.h>
+#include <asm/barrier.h>
+#include <asm/processor.h>
 
 #ifdef CONFIG_PPC64
 /* use 0x800000yy when locked, where yy == CPU number */
@@ -165,8 +167,10 @@ static inline void arch_spin_unlock(arch
 #ifdef CONFIG_PPC64
 extern void arch_spin_unlock_wait(arch_spinlock_t *lock);
 #else
-#define arch_spin_unlock_wait(lock) \
-	do { while (arch_spin_is_locked(lock)) cpu_relax(); } while (0)
+static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
+{
+	smp_cond_load_acquire(&lock->slock, !VAL);
+}
 #endif
 
 /*
--- a/arch/s390/include/asm/spinlock.h
+++ b/arch/s390/include/asm/spinlock.h
@@ -10,6 +10,8 @@
 #define __ASM_SPINLOCK_H
 
 #include <linux/smp.h>
+#include <asm/barrier.h>
+#include <asm/processor.h>
 
 #define SPINLOCK_LOCKVAL (S390_lowcore.spinlock_lockval)
 
@@ -97,6 +99,7 @@ static inline void arch_spin_unlock_wait
 {
 	while (arch_spin_is_locked(lock))
 		arch_spin_relax(lock);
+	smp_acquire__after_ctrl_dep();
 }
 
 /*
--- a/arch/sh/include/asm/spinlock.h
+++ b/arch/sh/include/asm/spinlock.h
@@ -19,14 +19,20 @@
 #error "Need movli.l/movco.l for spinlocks"
 #endif
 
+#include <asm/barrier.h>
+#include <asm/processor.h>
+
 /*
  * Your basic SMP spinlocks, allowing only a single CPU anywhere
  */
 
 #define arch_spin_is_locked(x)		((x)->lock <= 0)
 #define arch_spin_lock_flags(lock, flags) arch_spin_lock(lock)
-#define arch_spin_unlock_wait(x) \
-	do { while (arch_spin_is_locked(x)) cpu_relax(); } while (0)
+
+static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
+{
+	smp_cond_load_acquire(&lock->lock, VAL > 0);
+}
 
 /*
  * Simple spin lock operations.  There are two variants, one clears IRQ's
--- a/arch/sparc/include/asm/spinlock_32.h
+++ b/arch/sparc/include/asm/spinlock_32.h
@@ -9,12 +9,15 @@
 #ifndef __ASSEMBLY__
 
 #include <asm/psr.h>
+#include <asm/barrier.h>
 #include <asm/processor.h> /* for cpu_relax */
 
 #define arch_spin_is_locked(lock) (*((volatile unsigned char *)(lock)) != 0)
 
-#define arch_spin_unlock_wait(lock) \
-	do { while (arch_spin_is_locked(lock)) cpu_relax(); } while (0)
+static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
+{
+	smp_cond_load_acquire(&lock->lock, !VAL);
+}
 
 static inline void arch_spin_lock(arch_spinlock_t *lock)
 {
--- a/arch/sparc/include/asm/spinlock_64.h
+++ b/arch/sparc/include/asm/spinlock_64.h
@@ -8,6 +8,9 @@
 
 #ifndef __ASSEMBLY__
 
+#include <asm/processor.h>
+#include <asm/barrier.h>
+
 /* To get debugging spinlocks which detect and catch
  * deadlock situations, set CONFIG_DEBUG_SPINLOCK
  * and rebuild your kernel.
@@ -23,9 +26,10 @@
 
 #define arch_spin_is_locked(lp)	((lp)->lock != 0)
 
-#define arch_spin_unlock_wait(lp)	\
-	do {	rmb();			\
-	} while((lp)->lock)
+static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
+{
+	smp_cond_load_acquire(&lock->lock, !VAL);
+}
 
 static inline void arch_spin_lock(arch_spinlock_t *lock)
 {
--- a/arch/tile/lib/spinlock_32.c
+++ b/arch/tile/lib/spinlock_32.c
@@ -76,6 +76,12 @@ void arch_spin_unlock_wait(arch_spinlock
 	do {
 		delay_backoff(iterations++);
 	} while (READ_ONCE(lock->current_ticket) == curr);
+
+	/*
+	 * The TILE architecture doesn't do read speculation; therefore
+	 * a control dependency guarantees a LOAD->{LOAD,STORE} order.
+	 */
+	barrier();
 }
 EXPORT_SYMBOL(arch_spin_unlock_wait);
 
--- a/arch/tile/lib/spinlock_64.c
+++ b/arch/tile/lib/spinlock_64.c
@@ -76,6 +76,12 @@ void arch_spin_unlock_wait(arch_spinlock
 	do {
 		delay_backoff(iterations++);
 	} while (arch_spin_current(READ_ONCE(lock->lock)) == curr);
+
+	/*
+	 * The TILE architecture doesn't do read speculation; therefore
+	 * a control dependency guarantees a LOAD->{LOAD,STORE} order.
+	 */
+	barrier();
 }
 EXPORT_SYMBOL(arch_spin_unlock_wait);
 
--- a/arch/xtensa/include/asm/spinlock.h
+++ b/arch/xtensa/include/asm/spinlock.h
@@ -11,6 +11,9 @@
 #ifndef _XTENSA_SPINLOCK_H
 #define _XTENSA_SPINLOCK_H
 
+#include <asm/barrier.h>
+#include <asm/processor.h>
+
 /*
  * spinlock
  *
@@ -29,8 +32,11 @@
  */
 
 #define arch_spin_is_locked(x) ((x)->slock != 0)
-#define arch_spin_unlock_wait(lock) \
-	do { while (arch_spin_is_locked(lock)) cpu_relax(); } while (0)
+
+static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
+{
+	smp_cond_load_acquire(&lock->slock, !VAL);
+}
 
 #define arch_spin_lock_flags(lock, flags) arch_spin_lock(lock)
 
--- a/include/asm-generic/barrier.h
+++ b/include/asm-generic/barrier.h
@@ -194,7 +194,7 @@ do {									\
 })
 #endif
 
-#endif
+#endif	/* CONFIG_SMP */
 
 /* Barriers for virtual machine guests when talking to an SMP host */
 #define virt_mb() __smp_mb()
--- a/include/asm-generic/qspinlock.h
+++ b/include/asm-generic/qspinlock.h
@@ -20,6 +20,8 @@
 #define __ASM_GENERIC_QSPINLOCK_H
 
 #include <asm-generic/qspinlock_types.h>
+#include <asm/barrier.h>
+#include <asm/processor.h>
 
 /**
  * queued_spin_is_locked - is the spinlock locked?
@@ -133,8 +135,7 @@ static inline void queued_spin_unlock_wa
 {
 	/* See queued_spin_is_locked() */
 	smp_mb();
-	while (atomic_read(&lock->val) & _Q_LOCKED_MASK)
-		cpu_relax();
+	smp_cond_load_acquire(&lock->val.counter, !(VAL & _Q_LOCKED_MASK));
 }
 
 #ifndef virt_spin_lock
--- a/include/linux/spinlock_up.h
+++ b/include/linux/spinlock_up.h
@@ -6,6 +6,7 @@
 #endif
 
 #include <asm/processor.h>	/* for cpu_relax() */
+#include <asm/barrier.h>
 
 /*
  * include/linux/spinlock_up.h - UP-debug version of spinlocks.
@@ -25,6 +26,11 @@
 #ifdef CONFIG_DEBUG_SPINLOCK
 #define arch_spin_is_locked(x)		((x)->slock == 0)
 
+static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
+{
+	smp_cond_load_acquire(&lock->slock, VAL);
+}
+
 static inline void arch_spin_lock(arch_spinlock_t *lock)
 {
 	lock->slock = 0;
@@ -67,6 +73,7 @@ static inline void arch_spin_unlock(arch
 
 #else /* DEBUG_SPINLOCK */
 #define arch_spin_is_locked(lock)	((void)(lock), 0)
+#define arch_spin_unlock_wait(lock)	do { barrier(); (void)(lock); } while (0)
 /* for sched/core.c and kernel_lock.c: */
 # define arch_spin_lock(lock)		do { barrier(); (void)(lock); } while (0)
 # define arch_spin_lock_flags(lock, flags)	do { barrier(); (void)(lock); } while (0)
@@ -79,7 +86,4 @@ static inline void arch_spin_unlock(arch
 #define arch_read_can_lock(lock)	(((void)(lock), 1))
 #define arch_write_can_lock(lock)	(((void)(lock), 1))
 
-#define arch_spin_unlock_wait(lock) \
-		do { cpu_relax(); } while (arch_spin_is_locked(lock))
-
 #endif /* __LINUX_SPINLOCK_UP_H */

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH -v4 6/7] locking: Update spin_unlock_wait users
  2016-06-02 11:51 [PATCH -v4 0/7] spin_unlock_wait borkage and assorted bits Peter Zijlstra
                   ` (4 preceding siblings ...)
  2016-06-02 11:52 ` [PATCH -v4 5/7] locking, arch: Update spin_unlock_wait() Peter Zijlstra
@ 2016-06-02 11:52 ` Peter Zijlstra
  2016-06-02 11:52 ` [PATCH -v4 7/7] locking,netfilter: Fix nf_conntrack_lock() Peter Zijlstra
  6 siblings, 0 replies; 26+ messages in thread
From: Peter Zijlstra @ 2016-06-02 11:52 UTC (permalink / raw)
  To: linux-kernel, torvalds, manfred, dave, paulmck, will.deacon
  Cc: boqun.feng, Waiman.Long, tj, pablo, kaber, davem, oleg,
	netfilter-devel, sasha.levin, hofrat, peterz

[-- Attachment #1: peterz-locking-fix-spin_unlock_wait.patch --]
[-- Type: text/plain, Size: 1434 bytes --]

With the modified semantics of spin_unlock_wait() a number of
explicit barriers can be removed. And update the comment for the
do_exit() usecase, as that was somewhat stale/obscure.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 ipc/sem.c          |    1 -
 kernel/exit.c      |    8 ++++++--
 kernel/task_work.c |    1 -
 3 files changed, 6 insertions(+), 4 deletions(-)

--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -282,7 +282,6 @@ static void sem_wait_array(struct sem_ar
 		sem = sma->sem_base + i;
 		spin_unlock_wait(&sem->lock);
 	}
-	smp_acquire__after_ctrl_dep();
 }
 
 /*
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -776,10 +776,14 @@ void do_exit(long code)
 
 	exit_signals(tsk);  /* sets PF_EXITING */
 	/*
-	 * tsk->flags are checked in the futex code to protect against
-	 * an exiting task cleaning up the robust pi futexes.
+	 * Ensure that all new tsk->pi_lock acquisitions must observe
+	 * PF_EXITING. Serializes against futex.c:attach_to_pi_owner().
 	 */
 	smp_mb();
+	/*
+	 * Ensure that we must observe the pi_state in exit_mm() ->
+	 * mm_release() -> exit_pi_state_list().
+	 */
 	raw_spin_unlock_wait(&tsk->pi_lock);
 
 	if (unlikely(in_atomic())) {
--- a/kernel/task_work.c
+++ b/kernel/task_work.c
@@ -108,7 +108,6 @@ void task_work_run(void)
 		 * fail, but it can play with *work and other entries.
 		 */
 		raw_spin_unlock_wait(&task->pi_lock);
-		smp_mb();
 
 		do {
 			next = work->next;

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH -v4 7/7] locking,netfilter: Fix nf_conntrack_lock()
  2016-06-02 11:51 [PATCH -v4 0/7] spin_unlock_wait borkage and assorted bits Peter Zijlstra
                   ` (5 preceding siblings ...)
  2016-06-02 11:52 ` [PATCH -v4 6/7] locking: Update spin_unlock_wait users Peter Zijlstra
@ 2016-06-02 11:52 ` Peter Zijlstra
  6 siblings, 0 replies; 26+ messages in thread
From: Peter Zijlstra @ 2016-06-02 11:52 UTC (permalink / raw)
  To: linux-kernel, torvalds, manfred, dave, paulmck, will.deacon
  Cc: boqun.feng, Waiman.Long, tj, pablo, kaber, davem, oleg,
	netfilter-devel, sasha.levin, hofrat, peterz

[-- Attachment #1: peterz-locking-netfilter.patch --]
[-- Type: text/plain, Size: 1723 bytes --]

Even with spin_unlock_wait() fixed, nf_conntrack_lock{,_all}() is
borken as it misses a bunch of memory barriers to order the whole
global vs local locks scheme.

Even x86 (and other TSO archs) are affected.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 net/netfilter/nf_conntrack_core.c |   18 +++++++++++++++++-
 1 file changed, 17 insertions(+), 1 deletion(-)

--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -83,6 +83,12 @@ void nf_conntrack_lock(spinlock_t *lock)
 	spin_lock(lock);
 	while (unlikely(nf_conntrack_locks_all)) {
 		spin_unlock(lock);
+
+		/* Order the nf_contrack_locks_all load vs the
+		 * spin_unlock_wait() loads below, to ensure locks_all is
+		 * indeed held.
+		 */
+		smp_rmb(); /* spin_lock(locks_all) */
 		spin_unlock_wait(&nf_conntrack_locks_all_lock);
 		spin_lock(lock);
 	}
@@ -128,6 +134,12 @@ static void nf_conntrack_all_lock(void)
 	spin_lock(&nf_conntrack_locks_all_lock);
 	nf_conntrack_locks_all = true;
 
+	/* Order the above store against the spin_unlock_wait() loads
+	 * below, such that if nf_conntrack_lock() observes lock_all
+	 * we must observe lock[] held.
+	 */
+	smp_mb(); /* spin_lock(locks_all) */
+
 	for (i = 0; i < CONNTRACK_LOCKS; i++) {
 		spin_unlock_wait(&nf_conntrack_locks[i]);
 	}
@@ -135,7 +147,11 @@ static void nf_conntrack_all_lock(void)
 
 static void nf_conntrack_all_unlock(void)
 {
-	nf_conntrack_locks_all = false;
+	/* All prior stores must be complete before we clear locks_all.
+	 * Otherwise nf_conntrack_lock() might observe the false but not the
+	 * entire critical section.
+	 */
+	smp_store_release(&nf_conntrack_locks_all, false);
 	spin_unlock(&nf_conntrack_locks_all_lock);
 }
 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH -v4 5/7] locking, arch: Update spin_unlock_wait()
  2016-06-02 11:52 ` [PATCH -v4 5/7] locking, arch: Update spin_unlock_wait() Peter Zijlstra
@ 2016-06-02 14:24   ` Boqun Feng
  2016-06-02 14:44     ` Peter Zijlstra
  0 siblings, 1 reply; 26+ messages in thread
From: Boqun Feng @ 2016-06-02 14:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, torvalds, manfred, dave, paulmck, will.deacon,
	Waiman.Long, tj, pablo, kaber, davem, oleg, netfilter-devel,
	sasha.levin, hofrat, jejb, chris, rth, dhowells, schwidefsky,
	mpe, ralf, linux, rkuo, vgupta, james.hogan, realmz6, ysato,
	tony.luck, cmetcalf

[-- Attachment #1: Type: text/plain, Size: 1812 bytes --]

On Thu, Jun 02, 2016 at 01:52:02PM +0200, Peter Zijlstra wrote:
[snip]
> --- a/arch/powerpc/include/asm/spinlock.h
> +++ b/arch/powerpc/include/asm/spinlock.h
> @@ -27,6 +27,8 @@
>  #include <asm/asm-compat.h>
>  #include <asm/synch.h>
>  #include <asm/ppc-opcode.h>
> +#include <asm/barrier.h>
> +#include <asm/processor.h>
>  
>  #ifdef CONFIG_PPC64
>  /* use 0x800000yy when locked, where yy == CPU number */
> @@ -165,8 +167,10 @@ static inline void arch_spin_unlock(arch
>  #ifdef CONFIG_PPC64
>  extern void arch_spin_unlock_wait(arch_spinlock_t *lock);
>  #else
> -#define arch_spin_unlock_wait(lock) \
> -	do { while (arch_spin_is_locked(lock)) cpu_relax(); } while (0)
> +static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
> +{
> +	smp_cond_load_acquire(&lock->slock, !VAL);
> +}
>  #endif
>  

About spin_unlock_wait() on ppc, I actually have a fix pending review:

http://lkml.kernel.org/r/1461130033-70898-1-git-send-email-boqun.feng@gmail.com

that patch fixed a different problem when people want to pair a
spin_unlock_wait() with a spin_lock().

I think we still need that fix, and there are two conflicts with this
series:

1.	arch_spin_unlock_wait() code for PPC32 was deleted, and
	consolidated into one.

2.	I actually downgraded spin_unlock_wait() to !ACQUIRE ;-)


I can think of two ways to solve thoes conflicts:

1.	Modify my patch to make spin_unlock_wait() an ACQUIRE, and it
	can be merged in powerpc tree, and possible go into to mainline
	before 4.7. Then there is no need for this series to have code
	for ppc, therefore no conflict.

or

2.	I can rebase my patch on this series, and it can be added in
	this series, and will go into mainline at 4.8.


Michael and Peter, any thought?

Regards,
Boqun

>  /*

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH -v4 5/7] locking, arch: Update spin_unlock_wait()
  2016-06-02 14:24   ` Boqun Feng
@ 2016-06-02 14:44     ` Peter Zijlstra
  2016-06-02 15:11       ` Boqun Feng
  2016-06-02 16:34       ` Peter Zijlstra
  0 siblings, 2 replies; 26+ messages in thread
From: Peter Zijlstra @ 2016-06-02 14:44 UTC (permalink / raw)
  To: Boqun Feng
  Cc: linux-kernel, torvalds, manfred, dave, paulmck, will.deacon,
	Waiman.Long, tj, pablo, kaber, davem, oleg, netfilter-devel,
	sasha.levin, hofrat, jejb, chris, rth, dhowells, schwidefsky,
	mpe, ralf, linux, rkuo, vgupta, james.hogan, realmz6, ysato,
	tony.luck, cmetcalf

On Thu, Jun 02, 2016 at 10:24:40PM +0800, Boqun Feng wrote:
> On Thu, Jun 02, 2016 at 01:52:02PM +0200, Peter Zijlstra wrote:
> About spin_unlock_wait() on ppc, I actually have a fix pending review:
> 
> http://lkml.kernel.org/r/1461130033-70898-1-git-send-email-boqun.feng@gmail.com

Please use the normal commit quoting style:

  d86b8da04dfa ("arm64: spinlock: serialise spin_unlock_wait against concurrent lockers")

> that patch fixed a different problem when people want to pair a
> spin_unlock_wait() with a spin_lock().

Argh, indeed, and I think qspinlock is still broken there :/ But my poor
brain is about to give in for the day.

Let me go ponder that some :/

> I think we still need that fix, and there are two conflicts with this
> series:
> 
> 1.	arch_spin_unlock_wait() code for PPC32 was deleted, and
> 	consolidated into one.

Nice.

> 2.	I actually downgraded spin_unlock_wait() to !ACQUIRE ;-)

Fail ;-)

> I can think of two ways to solve thoes conflicts:
> 
> 1.	Modify my patch to make spin_unlock_wait() an ACQUIRE, and it
> 	can be merged in powerpc tree, and possible go into to mainline
> 	before 4.7. Then there is no need for this series to have code
> 	for ppc, therefore no conflict.

Hardly any other unlock_wait is an acquire, everyone is 'broken' :-/

> or
> 
> 2.	I can rebase my patch on this series, and it can be added in
> 	this series, and will go into mainline at 4.8.
> 
> 
> Michael and Peter, any thought?

I'm fine with it going in early, I can rebase, no problem.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH -v4 5/7] locking, arch: Update spin_unlock_wait()
  2016-06-02 14:44     ` Peter Zijlstra
@ 2016-06-02 15:11       ` Boqun Feng
  2016-06-02 15:57         ` Boqun Feng
  2016-06-02 16:04         ` Peter Zijlstra
  2016-06-02 16:34       ` Peter Zijlstra
  1 sibling, 2 replies; 26+ messages in thread
From: Boqun Feng @ 2016-06-02 15:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, torvalds, manfred, dave, paulmck, will.deacon,
	Waiman.Long, tj, pablo, kaber, davem, oleg, netfilter-devel,
	sasha.levin, hofrat, jejb, chris, rth, dhowells, schwidefsky,
	mpe, ralf, linux, rkuo, vgupta, james.hogan, realmz6, ysato,
	tony.luck, cmetcalf

[-- Attachment #1: Type: text/plain, Size: 2314 bytes --]

On Thu, Jun 02, 2016 at 04:44:24PM +0200, Peter Zijlstra wrote:
> On Thu, Jun 02, 2016 at 10:24:40PM +0800, Boqun Feng wrote:
> > On Thu, Jun 02, 2016 at 01:52:02PM +0200, Peter Zijlstra wrote:
> > About spin_unlock_wait() on ppc, I actually have a fix pending review:
> > 
> > http://lkml.kernel.org/r/1461130033-70898-1-git-send-email-boqun.feng@gmail.com
> 
> Please use the normal commit quoting style:
> 
>   d86b8da04dfa ("arm64: spinlock: serialise spin_unlock_wait against concurrent lockers")
> 

Good point ;-)

> > that patch fixed a different problem when people want to pair a
> > spin_unlock_wait() with a spin_lock().
> 
> Argh, indeed, and I think qspinlock is still broken there :/ But my poor
> brain is about to give in for the day.
> 
> Let me go ponder that some :/
> 

An intial thought of the fix is making queued_spin_unlock_wait() an
atomic-nop too:

static inline void queued_spin_unlock_wait(struct qspinlock *lock)
{
	struct __qspinlock *l = (struct __qspinlock *)lock;
	
	while (!cmpxchg(&l->locked, 0, 0))
		cpu_relax();
}

This could make queued_spin_unlock_wait() a WRITE, with a smp_mb()
preceding it, it would act like a RELEASE, which can be paired with
spin_lock().

Just food for thought. ;-)

> > I think we still need that fix, and there are two conflicts with this
> > series:
> > 
> > 1.	arch_spin_unlock_wait() code for PPC32 was deleted, and
> > 	consolidated into one.
> 
> Nice.
> 
> > 2.	I actually downgraded spin_unlock_wait() to !ACQUIRE ;-)
> 
> Fail ;-)
> 
> > I can think of two ways to solve thoes conflicts:
> > 
> > 1.	Modify my patch to make spin_unlock_wait() an ACQUIRE, and it
> > 	can be merged in powerpc tree, and possible go into to mainline
> > 	before 4.7. Then there is no need for this series to have code
> > 	for ppc, therefore no conflict.
> 
> Hardly any other unlock_wait is an acquire, everyone is 'broken' :-/
> 
> > or
> > 
> > 2.	I can rebase my patch on this series, and it can be added in
> > 	this series, and will go into mainline at 4.8.
> > 
> > 
> > Michael and Peter, any thought?
> 
> I'm fine with it going in early, I can rebase, no problem.

OK, I will resend a new patch making spin_unlock_wait() align the
semantics in your series.

Regards,
Boqun

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH -v4 5/7] locking, arch: Update spin_unlock_wait()
  2016-06-02 15:11       ` Boqun Feng
@ 2016-06-02 15:57         ` Boqun Feng
  2016-06-02 16:04         ` Peter Zijlstra
  1 sibling, 0 replies; 26+ messages in thread
From: Boqun Feng @ 2016-06-02 15:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, torvalds, manfred, dave, paulmck, will.deacon,
	Waiman.Long, tj, pablo, kaber, davem, oleg, netfilter-devel,
	sasha.levin, hofrat, jejb, chris, rth, dhowells, schwidefsky,
	mpe, ralf, linux, rkuo, vgupta, james.hogan, realmz6, ysato,
	tony.luck, cmetcalf

[-- Attachment #1: Type: text/plain, Size: 432 bytes --]

On Thu, Jun 02, 2016 at 11:11:07PM +0800, Boqun Feng wrote:
[snip]
> 
> OK, I will resend a new patch making spin_unlock_wait() align the
> semantics in your series.
> 

I realize that if my patch goes first then it's more safe and convenient
to keep the two smp_mb()s in ppc arch_spin_unlock_wait(). I will only do
fix and clean up in my patch and leave the semantics changing part to
you ;-)

> Regards,
> Boqun



[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH -v4 5/7] locking, arch: Update spin_unlock_wait()
  2016-06-02 15:11       ` Boqun Feng
  2016-06-02 15:57         ` Boqun Feng
@ 2016-06-02 16:04         ` Peter Zijlstra
  1 sibling, 0 replies; 26+ messages in thread
From: Peter Zijlstra @ 2016-06-02 16:04 UTC (permalink / raw)
  To: Boqun Feng
  Cc: linux-kernel, torvalds, manfred, dave, paulmck, will.deacon,
	Waiman.Long, tj, pablo, kaber, davem, oleg, netfilter-devel,
	sasha.levin, hofrat, jejb, chris, rth, dhowells, schwidefsky,
	mpe, ralf, linux, rkuo, vgupta, james.hogan, realmz6, ysato,
	tony.luck, cmetcalf

On Thu, Jun 02, 2016 at 11:11:07PM +0800, Boqun Feng wrote:
> On Thu, Jun 02, 2016 at 04:44:24PM +0200, Peter Zijlstra wrote:
> > Let me go ponder that some :/
> > 
> 
> An intial thought of the fix is making queued_spin_unlock_wait() an
> atomic-nop too:
> 
> static inline void queued_spin_unlock_wait(struct qspinlock *lock)
> {
> 	struct __qspinlock *l = (struct __qspinlock *)lock;
> 	
> 	while (!cmpxchg(&l->locked, 0, 0))
> 		cpu_relax();
> }
> 
> This could make queued_spin_unlock_wait() a WRITE, with a smp_mb()
> preceding it, it would act like a RELEASE, which can be paired with
> spin_lock().
> 
> Just food for thought. ;-)

Not sure that'll actually work. The qspinlock store is completely
unordered and not part of a ll/sc or anything like that.

Doing competing stores might even result in loosing it entirely.

But I think I got something.. Lemme go test it :-)

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH -v4 5/7] locking, arch: Update spin_unlock_wait()
  2016-06-02 14:44     ` Peter Zijlstra
  2016-06-02 15:11       ` Boqun Feng
@ 2016-06-02 16:34       ` Peter Zijlstra
  2016-06-02 17:57         ` Will Deacon
  1 sibling, 1 reply; 26+ messages in thread
From: Peter Zijlstra @ 2016-06-02 16:34 UTC (permalink / raw)
  To: Boqun Feng
  Cc: linux-kernel, torvalds, manfred, dave, paulmck, will.deacon,
	Waiman.Long, tj, pablo, kaber, davem, oleg, netfilter-devel,
	sasha.levin, hofrat, jejb, chris, rth, dhowells, schwidefsky,
	mpe, ralf, linux, rkuo, vgupta, james.hogan, realmz6, ysato,
	tony.luck, cmetcalf

On Thu, Jun 02, 2016 at 04:44:24PM +0200, Peter Zijlstra wrote:
> On Thu, Jun 02, 2016 at 10:24:40PM +0800, Boqun Feng wrote:
> > On Thu, Jun 02, 2016 at 01:52:02PM +0200, Peter Zijlstra wrote:
> > About spin_unlock_wait() on ppc, I actually have a fix pending review:
> > 
> > http://lkml.kernel.org/r/1461130033-70898-1-git-send-email-boqun.feng@gmail.com
> 
> > that patch fixed a different problem when people want to pair a
> > spin_unlock_wait() with a spin_lock().
> 
> Argh, indeed, and I think qspinlock is still broken there :/ But my poor
> brain is about to give in for the day.

This 'replaces' commit:

  54cf809b9512 ("locking,qspinlock: Fix spin_is_locked() and spin_unlock_wait()")

and seems to still work with the test case from that thread while
getting rid of the extra barriers.

---
 include/asm-generic/qspinlock.h | 37 +++++++----------------------------
 kernel/locking/qspinlock.c      | 43 +++++++++++++++++++++++++++++++++++++++++
 2 files changed, 50 insertions(+), 30 deletions(-)

diff --git a/include/asm-generic/qspinlock.h b/include/asm-generic/qspinlock.h
index 6bd05700d8c9..9e3dff16d5dc 100644
--- a/include/asm-generic/qspinlock.h
+++ b/include/asm-generic/qspinlock.h
@@ -28,30 +28,13 @@
  */
 static __always_inline int queued_spin_is_locked(struct qspinlock *lock)
 {
-	/*
-	 * queued_spin_lock_slowpath() can ACQUIRE the lock before
-	 * issuing the unordered store that sets _Q_LOCKED_VAL.
-	 *
-	 * See both smp_cond_acquire() sites for more detail.
-	 *
-	 * This however means that in code like:
-	 *
-	 *   spin_lock(A)		spin_lock(B)
-	 *   spin_unlock_wait(B)	spin_is_locked(A)
-	 *   do_something()		do_something()
+	/* 
+	 * See queued_spin_unlock_wait().
 	 *
-	 * Both CPUs can end up running do_something() because the store
-	 * setting _Q_LOCKED_VAL will pass through the loads in
-	 * spin_unlock_wait() and/or spin_is_locked().
-	 *
-	 * Avoid this by issuing a full memory barrier between the spin_lock()
-	 * and the loads in spin_unlock_wait() and spin_is_locked().
-	 *
-	 * Note that regular mutual exclusion doesn't care about this
-	 * delayed store.
+	 * Any !0 state indicates it is locked, even if _Q_LOCKED_VAL
+	 * isn't immediately observable.
 	 */
-	smp_mb();
-	return atomic_read(&lock->val) & _Q_LOCKED_MASK;
+	return !!atomic_read(&lock->val);
 }
 
 /**
@@ -123,19 +106,13 @@ static __always_inline void queued_spin_unlock(struct qspinlock *lock)
 #endif
 
 /**
- * queued_spin_unlock_wait - wait until current lock holder releases the lock
+ * queued_spin_unlock_wait - wait until the _current_ lock holder releases the lock
  * @lock : Pointer to queued spinlock structure
  *
  * There is a very slight possibility of live-lock if the lockers keep coming
  * and the waiter is just unfortunate enough to not see any unlock state.
  */
-static inline void queued_spin_unlock_wait(struct qspinlock *lock)
-{
-	/* See queued_spin_is_locked() */
-	smp_mb();
-	while (atomic_read(&lock->val) & _Q_LOCKED_MASK)
-		cpu_relax();
-}
+void queued_spin_unlock_wait(struct qspinlock *lock);
 
 #ifndef virt_spin_lock
 static __always_inline bool virt_spin_lock(struct qspinlock *lock)
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index ce2f75e32ae1..3a4e4b34584e 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -496,6 +496,49 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 EXPORT_SYMBOL(queued_spin_lock_slowpath);
 
 /*
+ * queued_spin_lock_slowpath() can ACQUIRE the lock before
+ * issuing an _unordered_ store to set _Q_LOCKED_VAL.
+ *
+ * This means that the store can be delayed, but no later than the
+ * store-release from the unlock. This means that simply observing
+ * _Q_LOCKED_VAL is not sufficient to determine if the lock is acquired.
+ *
+ * There are two sites that can issue the unordered store:
+ *
+ *  - clear_pending_set_locked(): *,1,0 -> *,0,1
+ *  - set_locked(): t,0,0 -> t,0,1 ; t != 0
+ *
+ * In both cases we have other !0 state that is observable before the
+ * lock store comes through. This means we can use that to wait for
+ * the lock store, and then wait for an unlock.
+ */
+void queued_spin_unlock_wait(struct qspinlock *lock)
+{
+	u32 val;
+
+	for (;;) {
+		val = atomic_read(&lock->val);
+
+		if (!val) /* not locked, we're done */
+			goto done;
+
+		if (val & _Q_LOCKED_MASK) /* locked, go wait for unlock */
+			break;
+
+		/* not locked, but pending, wait until we observe the lock */
+		cpu_relax();
+	}
+
+	/* any unlock is good */
+	while (atomic_read(&lock->val) & _Q_LOCKED_MASK)
+		cpu_relax();
+
+done:
+	smp_rmb();
+}
+EXPORT_SYMBOL(queued_spin_unlock_wait);
+
+/*
  * Generate the paravirt code for queued_spin_unlock_slowpath().
  */
 #if !defined(_GEN_PV_LOCK_SLOWPATH) && defined(CONFIG_PARAVIRT_SPINLOCKS)

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH -v4 5/7] locking, arch: Update spin_unlock_wait()
  2016-06-02 16:34       ` Peter Zijlstra
@ 2016-06-02 17:57         ` Will Deacon
  2016-06-02 21:51           ` Peter Zijlstra
  2016-06-06 16:08           ` Peter Zijlstra
  0 siblings, 2 replies; 26+ messages in thread
From: Will Deacon @ 2016-06-02 17:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Boqun Feng, linux-kernel, torvalds, manfred, dave, paulmck,
	Waiman.Long, tj, pablo, kaber, davem, oleg, netfilter-devel,
	sasha.levin, hofrat, jejb, chris, rth, dhowells, schwidefsky,
	mpe, ralf, linux, rkuo, vgupta, james.hogan, realmz6, ysato,
	tony.luck, cmetcalf

On Thu, Jun 02, 2016 at 06:34:25PM +0200, Peter Zijlstra wrote:
> On Thu, Jun 02, 2016 at 04:44:24PM +0200, Peter Zijlstra wrote:
> > On Thu, Jun 02, 2016 at 10:24:40PM +0800, Boqun Feng wrote:
> > > On Thu, Jun 02, 2016 at 01:52:02PM +0200, Peter Zijlstra wrote:
> > > About spin_unlock_wait() on ppc, I actually have a fix pending review:
> > > 
> > > http://lkml.kernel.org/r/1461130033-70898-1-git-send-email-boqun.feng@gmail.com
> > 
> > > that patch fixed a different problem when people want to pair a
> > > spin_unlock_wait() with a spin_lock().
> > 
> > Argh, indeed, and I think qspinlock is still broken there :/ But my poor
> > brain is about to give in for the day.
> 
> This 'replaces' commit:
> 
>   54cf809b9512 ("locking,qspinlock: Fix spin_is_locked() and spin_unlock_wait()")
> 
> and seems to still work with the test case from that thread while
> getting rid of the extra barriers.
> 
> ---
>  include/asm-generic/qspinlock.h | 37 +++++++----------------------------
>  kernel/locking/qspinlock.c      | 43 +++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 50 insertions(+), 30 deletions(-)
> 
> diff --git a/include/asm-generic/qspinlock.h b/include/asm-generic/qspinlock.h
> index 6bd05700d8c9..9e3dff16d5dc 100644
> --- a/include/asm-generic/qspinlock.h
> +++ b/include/asm-generic/qspinlock.h
> @@ -28,30 +28,13 @@
>   */
>  static __always_inline int queued_spin_is_locked(struct qspinlock *lock)
>  {
> -	/*
> -	 * queued_spin_lock_slowpath() can ACQUIRE the lock before
> -	 * issuing the unordered store that sets _Q_LOCKED_VAL.
> -	 *
> -	 * See both smp_cond_acquire() sites for more detail.
> -	 *
> -	 * This however means that in code like:
> -	 *
> -	 *   spin_lock(A)		spin_lock(B)
> -	 *   spin_unlock_wait(B)	spin_is_locked(A)
> -	 *   do_something()		do_something()
> +	/* 
> +	 * See queued_spin_unlock_wait().
>  	 *
> -	 * Both CPUs can end up running do_something() because the store
> -	 * setting _Q_LOCKED_VAL will pass through the loads in
> -	 * spin_unlock_wait() and/or spin_is_locked().
> -	 *
> -	 * Avoid this by issuing a full memory barrier between the spin_lock()
> -	 * and the loads in spin_unlock_wait() and spin_is_locked().
> -	 *
> -	 * Note that regular mutual exclusion doesn't care about this
> -	 * delayed store.
> +	 * Any !0 state indicates it is locked, even if _Q_LOCKED_VAL
> +	 * isn't immediately observable.
>  	 */
> -	smp_mb();
> -	return atomic_read(&lock->val) & _Q_LOCKED_MASK;
> +	return !!atomic_read(&lock->val);
>  }

I'm failing to keep up here :(

The fast-path code in queued_spin_lock is just an atomic_cmpxchg_acquire.
If that's built out of LL/SC instructions, then why don't we need a barrier
here in queued_spin_is_locked?

Or is the decision now that only spin_unlock_wait is required to enforce
this ordering?

Will

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH -v4 5/7] locking, arch: Update spin_unlock_wait()
  2016-06-02 17:57         ` Will Deacon
@ 2016-06-02 21:51           ` Peter Zijlstra
  2016-06-03 12:47             ` Will Deacon
  2016-06-06 16:08           ` Peter Zijlstra
  1 sibling, 1 reply; 26+ messages in thread
From: Peter Zijlstra @ 2016-06-02 21:51 UTC (permalink / raw)
  To: Will Deacon
  Cc: Boqun Feng, linux-kernel, torvalds, manfred, dave, paulmck,
	Waiman.Long, tj, pablo, kaber, davem, oleg, netfilter-devel,
	sasha.levin, hofrat, jejb, chris, rth, dhowells, schwidefsky,
	mpe, ralf, linux, rkuo, vgupta, james.hogan, realmz6, ysato,
	tony.luck, cmetcalf

On Thu, Jun 02, 2016 at 06:57:00PM +0100, Will Deacon wrote:
> > +++ b/include/asm-generic/qspinlock.h
> > @@ -28,30 +28,13 @@
> >   */
> >  static __always_inline int queued_spin_is_locked(struct qspinlock *lock)
> >  {
> > +	/* 
> > +	 * See queued_spin_unlock_wait().
> >  	 *
> > +	 * Any !0 state indicates it is locked, even if _Q_LOCKED_VAL
> > +	 * isn't immediately observable.
> >  	 */
> > -	smp_mb();
> > +	return !!atomic_read(&lock->val);
> >  }
> 
> I'm failing to keep up here :(
> 
> The fast-path code in queued_spin_lock is just an atomic_cmpxchg_acquire.
> If that's built out of LL/SC instructions, then why don't we need a barrier
> here in queued_spin_is_locked?
> 
> Or is the decision now that only spin_unlock_wait is required to enforce
> this ordering?

(warning: long, somewhat rambling, email)

You're talking about the smp_mb() that went missing?

So that wasn't the reason that smp_mb() existed..... but that makes the
atomic_foo_acquire() things somewhat hard to use, because I don't think
we want to unconditionally put the smp_mb() in there just in case.

Now, the normal atomic_foo_acquire() stuff uses smp_mb() as per
smp_mb__after_atomic(), its just ARM64 and PPC that go all 'funny' and
need this extra barrier. Blergh. So lets shelf this issue for a bit.

Let me write some text to hopefully explain where it did come from and
why I now removed it.


So the spin_is_locked() correctness issue comes from something like:

CPU0					CPU1

global_lock();				local_lock(i)
  spin_lock(&G)				  spin_lock(&L[i])
  for (i)				  if (!spin_is_locked(&G)) {
    spin_unlock_wait(&L[i]);		    smp_acquire__after_ctrl_dep();
					    return;
					  }
					  /* deal with fail */

Where it is important CPU1 sees G locked or CPU0 sees L[i] locked such
that there is exclusion between the two critical sections.

The load from spin_is_locked(&G) is constrained by the ACQUIRE from
spin_lock(&L[i]), and similarly the load(s) from spin_unlock_wait(&L[i])
are constrained by the ACQUIRE from spin_lock(&G).

Similarly, later stuff is constrained by the ACQUIRE from CTRL+RMB.


Given a simple (SC) test-and-set spinlock the above is fairly straight
forward and 'correct', right?


Now, the 'problem' with qspinlock is that one possible acquire path goes
like (there are more, but this is the easiest):

	smp_cond_acquire(!(atomic_read(&lock->val) & _Q_LOCKED_MASK));
	clear_pending_set_locked(lock);

And one possible implementation of clear_pending_set_locked() is:

	WRITE_ONCE(l->locked_pending, _Q_LOCKED_VAL);

IOW, we load-acquire the locked byte until its cleared, at which point
we know the pending byte to be 1. Then we consider the lock owned by us
and issue a regular unordered store to flip the pending and locked
bytes.

Normal mutual exclusion is fine with this, no new pending can happen
until this store becomes visible at which time the locked byte is
visibly taken.

This unordered store however, can be delayed (store buffer) such that
the loads from spin_unlock_wait/spin_is_locked can pass up before it
(even on TSO arches).

_IF_ spin_unlock_wait/spin_is_locked only look at the locked byte, this
is a problem because at that point the crossed store-load pattern
becomes uncrossed and we loose our guarantee. That is, what used to be:

	[S] G.locked = 1	[S] L[i].locked = 1
	[MB]			[MB]
	[L] L[i].locked		[L] G.locked

becomes:

	[S] G.locked = 1	[S] L[i].locked = 1
	[L] L[i].locked		[L] G.locked

Which we can reorder at will and bad things follow.

The previous fix for this was to include an smp_mb() in both
spin_is_locked() and spin_unlock_wait() to restore that ordering.


So at this point spin_is_locked() looks like:

	smp_mb();
	while (atomic_read(&lock->val) & _Q_LOCKED_MASK)
		cpu_relax();


But for spin_unlock_wait() there is a second correctness issue, namely:

CPU0				CPU1

flag = set;
smp_mb();			spin_lock(&l)
spin_unlock_wait(&l);		if (flag)
				  /* bail */

				/* add to lockless list */
				spin_unlock(&l);

/* iterate lockless list */


Which ensures that CPU1 will stop adding bits to the list and CPU0 will
observe the last entry on the list (if spin_unlock_wait() had ACQUIRE
semantics etc..)

This however, is still broken.. nothing ensures CPU0 sees l.locked
before CPU1 tests flag.

So while we fixed the first correctness case (global / local locking as
employed by ipc/sem and nf_conntrack) this is still very much broken.

My patch today rewrites spin_unlock_wait() and spin_is_locked() to rely
on more information to (hopefully -- I really need sleep) fix both.

The primary observation is that even though the l.locked store is
delayed, there has been a prior atomic operation on the lock word to
register the contending lock (in the above scenario, set the pending
byte, in the other paths, queue onto the tail word).

This means that any load passing the .locked byte store, must at least
observe that state.

Therefore, spin_is_locked() looks for any !0 state and is done. Even if
the locked byte is cleared, if any of the other fields are !0 it is in
fact locked.

spin_unlock_wait() is slightly more complex, it becomes:

 1) if the entire word is 0 -- we're unlocked, done
 2) if the locked byte is 0 (there must be other state, otherwise the
    previous case), wait until we see the locked byte.
 3) wait for the locked byte to be cleared

Which relies on the same observation, that even though we might not
observe the locked store, we must observe a store by the contending
lock.

This means the scenario above now becomes:

	[S] G.pending = 1		[S] L[i].pending = 1
	[MB]				[MB]
	[S] G.locked_pending = 1	[S] L[i].locked_pending = 1
	[L] L[i]			[L] G

And things are good again, we simply do not care if the unordered store
is observed or not.

Make sense?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH -v4 5/7] locking, arch: Update spin_unlock_wait()
  2016-06-02 21:51           ` Peter Zijlstra
@ 2016-06-03 12:47             ` Will Deacon
  2016-06-03 13:42               ` Peter Zijlstra
  2016-06-03 13:48               ` Peter Zijlstra
  0 siblings, 2 replies; 26+ messages in thread
From: Will Deacon @ 2016-06-03 12:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Boqun Feng, linux-kernel, torvalds, manfred, dave, paulmck,
	Waiman.Long, tj, pablo, kaber, davem, oleg, netfilter-devel,
	sasha.levin, hofrat, jejb, chris, rth, dhowells, schwidefsky,
	mpe, ralf, linux, rkuo, vgupta, james.hogan, realmz6, ysato,
	tony.luck, cmetcalf

Hi Peter,

On Thu, Jun 02, 2016 at 11:51:19PM +0200, Peter Zijlstra wrote:
> On Thu, Jun 02, 2016 at 06:57:00PM +0100, Will Deacon wrote:
> > > +++ b/include/asm-generic/qspinlock.h
> > > @@ -28,30 +28,13 @@
> > >   */
> > >  static __always_inline int queued_spin_is_locked(struct qspinlock *lock)
> > >  {
> > > +	/* 
> > > +	 * See queued_spin_unlock_wait().
> > >  	 *
> > > +	 * Any !0 state indicates it is locked, even if _Q_LOCKED_VAL
> > > +	 * isn't immediately observable.
> > >  	 */
> > > -	smp_mb();
> > > +	return !!atomic_read(&lock->val);
> > >  }
> > 
> > I'm failing to keep up here :(
> > 
> > The fast-path code in queued_spin_lock is just an atomic_cmpxchg_acquire.
> > If that's built out of LL/SC instructions, then why don't we need a barrier
> > here in queued_spin_is_locked?
> > 
> > Or is the decision now that only spin_unlock_wait is required to enforce
> > this ordering?
> 
> (warning: long, somewhat rambling, email)

Thanks for taking the time to write this up. Comments inline.

> You're talking about the smp_mb() that went missing?

Right -- I think you still need it.

> So that wasn't the reason that smp_mb() existed..... but that makes the
> atomic_foo_acquire() things somewhat hard to use, because I don't think
> we want to unconditionally put the smp_mb() in there just in case.
> 
> Now, the normal atomic_foo_acquire() stuff uses smp_mb() as per
> smp_mb__after_atomic(), its just ARM64 and PPC that go all 'funny' and
> need this extra barrier. Blergh. So lets shelf this issue for a bit.

Hmm... I certainly plan to get qspinlock up and running for arm64 in the
near future, so I'm not keen on shelving it for very long.

> Let me write some text to hopefully explain where it did come from and
> why I now removed it.
> 
> 
> So the spin_is_locked() correctness issue comes from something like:
> 
> CPU0					CPU1
> 
> global_lock();				local_lock(i)
>   spin_lock(&G)				  spin_lock(&L[i])
>   for (i)				  if (!spin_is_locked(&G)) {
>     spin_unlock_wait(&L[i]);		    smp_acquire__after_ctrl_dep();
> 					    return;
> 					  }
> 					  /* deal with fail */
> 
> Where it is important CPU1 sees G locked or CPU0 sees L[i] locked such
> that there is exclusion between the two critical sections.

Yes, and there's also a version of this where CPU0 is using spin_is_locked
(see 51d7d5205d338 "powerpc: Add smp_mb() to arch_spin_is_locked()").

> The load from spin_is_locked(&G) is constrained by the ACQUIRE from
> spin_lock(&L[i]), and similarly the load(s) from spin_unlock_wait(&L[i])
> are constrained by the ACQUIRE from spin_lock(&G).
> 
> Similarly, later stuff is constrained by the ACQUIRE from CTRL+RMB.
> 
> Given a simple (SC) test-and-set spinlock the above is fairly straight
> forward and 'correct', right?

Well, the same issue that you want to shelve can manifest here with
test-and-set locks too, allowing the spin_is_locked(&G) to be speculated
before the spin_lock(&L[i]) has finished taking the lock on CPU1.

Even ignoring that, I'm not convinced this would work for test-and-set
locks without barriers in unlock_wait and is_locked. For example, a
cut-down version of your test looks like:

CPU0:		CPU1:
LOCK x		LOCK y
Read y		Read x

and you don't want the reads to both return "unlocked".

Even on x86, I think you need a fence here:

X86 lock
{
}
 P0                | P1                ;
 MOV EAX,$1        | MOV EAX,$1        ;
 LOCK XCHG [x],EAX | LOCK XCHG [y],EAX ;
 MOV EBX,[y]       | MOV EBX,[x]       ;
exists
(0:EAX=0 /\ 0:EBX=0 /\ 1:EAX=0 /\ 1:EBX=0)

is permitted by herd.

> Now, the 'problem' with qspinlock is that one possible acquire path goes
> like (there are more, but this is the easiest):
> 
> 	smp_cond_acquire(!(atomic_read(&lock->val) & _Q_LOCKED_MASK));
> 	clear_pending_set_locked(lock);
> 
> And one possible implementation of clear_pending_set_locked() is:
> 
> 	WRITE_ONCE(l->locked_pending, _Q_LOCKED_VAL);
> 
> IOW, we load-acquire the locked byte until its cleared, at which point
> we know the pending byte to be 1. Then we consider the lock owned by us
> and issue a regular unordered store to flip the pending and locked
> bytes.
> 
> Normal mutual exclusion is fine with this, no new pending can happen
> until this store becomes visible at which time the locked byte is
> visibly taken.
> 
> This unordered store however, can be delayed (store buffer) such that
> the loads from spin_unlock_wait/spin_is_locked can pass up before it
> (even on TSO arches).

Right, and this is surprisingly similar to the LL/SC problem imo.

> _IF_ spin_unlock_wait/spin_is_locked only look at the locked byte, this
> is a problem because at that point the crossed store-load pattern
> becomes uncrossed and we loose our guarantee. That is, what used to be:
> 
> 	[S] G.locked = 1	[S] L[i].locked = 1
> 	[MB]			[MB]
> 	[L] L[i].locked		[L] G.locked
> 
> becomes:
> 
> 	[S] G.locked = 1	[S] L[i].locked = 1
> 	[L] L[i].locked		[L] G.locked
> 
> Which we can reorder at will and bad things follow.
> 
> The previous fix for this was to include an smp_mb() in both
> spin_is_locked() and spin_unlock_wait() to restore that ordering.
> 
> 
> So at this point spin_is_locked() looks like:
> 
> 	smp_mb();
> 	while (atomic_read(&lock->val) & _Q_LOCKED_MASK)
> 		cpu_relax();

I have something similar queued for arm64's ticket locks.

> But for spin_unlock_wait() there is a second correctness issue, namely:
> 
> CPU0				CPU1
> 
> flag = set;
> smp_mb();			spin_lock(&l)
> spin_unlock_wait(&l);		if (flag)
> 				  /* bail */
> 
> 				/* add to lockless list */
> 				spin_unlock(&l);
> 
> /* iterate lockless list */
> 
> 
> Which ensures that CPU1 will stop adding bits to the list and CPU0 will
> observe the last entry on the list (if spin_unlock_wait() had ACQUIRE
> semantics etc..)
> 
> This however, is still broken.. nothing ensures CPU0 sees l.locked
> before CPU1 tests flag.

Yup.

> So while we fixed the first correctness case (global / local locking as
> employed by ipc/sem and nf_conntrack) this is still very much broken.
> 
> My patch today rewrites spin_unlock_wait() and spin_is_locked() to rely
> on more information to (hopefully -- I really need sleep) fix both.
> 
> The primary observation is that even though the l.locked store is
> delayed, there has been a prior atomic operation on the lock word to
> register the contending lock (in the above scenario, set the pending
> byte, in the other paths, queue onto the tail word).
> 
> This means that any load passing the .locked byte store, must at least
> observe that state.

That's what I'm not sure about. Just because there was an atomic operation
writing that state, I don't think it means that it's visible to a normal
load.

Will

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH -v4 5/7] locking, arch: Update spin_unlock_wait()
  2016-06-03 12:47             ` Will Deacon
@ 2016-06-03 13:42               ` Peter Zijlstra
  2016-06-03 17:35                 ` Will Deacon
  2016-06-03 13:48               ` Peter Zijlstra
  1 sibling, 1 reply; 26+ messages in thread
From: Peter Zijlstra @ 2016-06-03 13:42 UTC (permalink / raw)
  To: Will Deacon
  Cc: Boqun Feng, linux-kernel, torvalds, manfred, dave, paulmck,
	Waiman.Long, tj, pablo, kaber, davem, oleg, netfilter-devel,
	sasha.levin, hofrat, jejb, chris, rth, dhowells, schwidefsky,
	mpe, ralf, linux, rkuo, vgupta, james.hogan, realmz6, ysato,
	tony.luck, cmetcalf, H. Peter Anvin

On Fri, Jun 03, 2016 at 01:47:34PM +0100, Will Deacon wrote:
> Even on x86, I think you need a fence here:
> 
> X86 lock
> {
> }
>  P0                | P1                ;
>  MOV EAX,$1        | MOV EAX,$1        ;
>  LOCK XCHG [x],EAX | LOCK XCHG [y],EAX ;
>  MOV EBX,[y]       | MOV EBX,[x]       ;
> exists
> (0:EAX=0 /\ 0:EBX=0 /\ 1:EAX=0 /\ 1:EBX=0)
> 
> is permitted by herd.

I am puzzled.. this should not be. You say adding MFENCE after LOCK XCHG
makes it 'work', but we assume LOCK <op> is a full fence all over the
place.

I'm thinking herd is busted.

Anybody? hpa, Linus?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH -v4 5/7] locking, arch: Update spin_unlock_wait()
  2016-06-03 12:47             ` Will Deacon
  2016-06-03 13:42               ` Peter Zijlstra
@ 2016-06-03 13:48               ` Peter Zijlstra
  1 sibling, 0 replies; 26+ messages in thread
From: Peter Zijlstra @ 2016-06-03 13:48 UTC (permalink / raw)
  To: Will Deacon
  Cc: Boqun Feng, linux-kernel, torvalds, manfred, dave, paulmck,
	Waiman.Long, tj, pablo, kaber, davem, oleg, netfilter-devel,
	sasha.levin, hofrat, jejb, chris, rth, dhowells, schwidefsky,
	mpe, ralf, linux, rkuo, vgupta, james.hogan, realmz6, ysato,
	tony.luck, cmetcalf

On Fri, Jun 03, 2016 at 01:47:34PM +0100, Will Deacon wrote:
> > Now, the normal atomic_foo_acquire() stuff uses smp_mb() as per
> > smp_mb__after_atomic(), its just ARM64 and PPC that go all 'funny' and
> > need this extra barrier. Blergh. So lets shelf this issue for a bit.
> 
> Hmm... I certainly plan to get qspinlock up and running for arm64 in the
> near future, so I'm not keen on shelving it for very long.

Sure; so short term we could always have arm64/ppc specific versions of
these functions where the difference matters.

Alternatively we need to introduce yet another barrier like:

	smp_mb__after_acquire()

Or something like that, which is a no-op by default except for arm64 and
ppc.

But I'm thinking nobody really wants more barrier primitives :/ (says he
who just added one).

> > This unordered store however, can be delayed (store buffer) such that
> > the loads from spin_unlock_wait/spin_is_locked can pass up before it
> > (even on TSO arches).
> 
> Right, and this is surprisingly similar to the LL/SC problem imo.

Yes and no. Yes because its an unordered store, no because a competing
ll/sc cannot make it fail the store and retry as done per your and
boqun's fancy solution.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH -v4 5/7] locking, arch: Update spin_unlock_wait()
  2016-06-03 13:42               ` Peter Zijlstra
@ 2016-06-03 17:35                 ` Will Deacon
  2016-06-03 19:13                   ` Peter Zijlstra
  0 siblings, 1 reply; 26+ messages in thread
From: Will Deacon @ 2016-06-03 17:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Boqun Feng, linux-kernel, torvalds, manfred, dave, paulmck,
	Waiman.Long, tj, pablo, kaber, davem, oleg, netfilter-devel,
	sasha.levin, hofrat, jejb, chris, rth, dhowells, schwidefsky,
	mpe, ralf, linux, rkuo, vgupta, james.hogan, realmz6, ysato,
	tony.luck, cmetcalf, H. Peter Anvin

On Fri, Jun 03, 2016 at 03:42:49PM +0200, Peter Zijlstra wrote:
> On Fri, Jun 03, 2016 at 01:47:34PM +0100, Will Deacon wrote:
> > Even on x86, I think you need a fence here:
> > 
> > X86 lock
> > {
> > }
> >  P0                | P1                ;
> >  MOV EAX,$1        | MOV EAX,$1        ;
> >  LOCK XCHG [x],EAX | LOCK XCHG [y],EAX ;
> >  MOV EBX,[y]       | MOV EBX,[x]       ;
> > exists
> > (0:EAX=0 /\ 0:EBX=0 /\ 1:EAX=0 /\ 1:EBX=0)
> > 
> > is permitted by herd.
> 
> I am puzzled.. this should not be. You say adding MFENCE after LOCK XCHG
> makes it 'work', but we assume LOCK <op> is a full fence all over the
> place.
> 
> I'm thinking herd is busted.

FWIW -- I spoke to the Herd authors and they confirmed that this is a
regresion in herd (fixed in the bleeding edge version).

Will

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH -v4 5/7] locking, arch: Update spin_unlock_wait()
  2016-06-03 17:35                 ` Will Deacon
@ 2016-06-03 19:13                   ` Peter Zijlstra
  0 siblings, 0 replies; 26+ messages in thread
From: Peter Zijlstra @ 2016-06-03 19:13 UTC (permalink / raw)
  To: Will Deacon
  Cc: Boqun Feng, linux-kernel, torvalds, manfred, dave, paulmck,
	Waiman.Long, tj, pablo, kaber, davem, oleg, netfilter-devel,
	sasha.levin, hofrat, jejb, chris, rth, dhowells, schwidefsky,
	mpe, ralf, linux, rkuo, vgupta, james.hogan, realmz6, ysato,
	tony.luck, cmetcalf, H. Peter Anvin

On Fri, Jun 03, 2016 at 06:35:37PM +0100, Will Deacon wrote:
> On Fri, Jun 03, 2016 at 03:42:49PM +0200, Peter Zijlstra wrote:
> > On Fri, Jun 03, 2016 at 01:47:34PM +0100, Will Deacon wrote:
> > > Even on x86, I think you need a fence here:
> > > 
> > > X86 lock
> > > {
> > > }
> > >  P0                | P1                ;
> > >  MOV EAX,$1        | MOV EAX,$1        ;
> > >  LOCK XCHG [x],EAX | LOCK XCHG [y],EAX ;
> > >  MOV EBX,[y]       | MOV EBX,[x]       ;
> > > exists
> > > (0:EAX=0 /\ 0:EBX=0 /\ 1:EAX=0 /\ 1:EBX=0)
> > > 
> > > is permitted by herd.
> > 
> > I am puzzled.. this should not be. You say adding MFENCE after LOCK XCHG
> > makes it 'work', but we assume LOCK <op> is a full fence all over the
> > place.
> > 
> > I'm thinking herd is busted.
> 
> FWIW -- I spoke to the Herd authors and they confirmed that this is a
> regresion in herd (fixed in the bleeding edge version).

Good, means I'm not slowly going crazeh -- or at least, this isn't one
of the signs :-)

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH -v4 5/7] locking, arch: Update spin_unlock_wait()
  2016-06-02 17:57         ` Will Deacon
  2016-06-02 21:51           ` Peter Zijlstra
@ 2016-06-06 16:08           ` Peter Zijlstra
  2016-06-07 11:43             ` Boqun Feng
  1 sibling, 1 reply; 26+ messages in thread
From: Peter Zijlstra @ 2016-06-06 16:08 UTC (permalink / raw)
  To: Will Deacon
  Cc: Boqun Feng, linux-kernel, torvalds, manfred, dave, paulmck,
	Waiman.Long, tj, pablo, kaber, davem, oleg, netfilter-devel,
	sasha.levin, hofrat, jejb, chris, rth, dhowells, schwidefsky,
	mpe, ralf, linux, rkuo, vgupta, james.hogan, realmz6, ysato,
	tony.luck, cmetcalf

On Thu, Jun 02, 2016 at 06:57:00PM +0100, Will Deacon wrote:
> > This 'replaces' commit:
> > 
> >   54cf809b9512 ("locking,qspinlock: Fix spin_is_locked() and spin_unlock_wait()")
> > 
> > and seems to still work with the test case from that thread while
> > getting rid of the extra barriers.

New version :-)

This one has moar comments; and also tries to address an issue with
xchg_tail(), which is its own consumer. Paul, Will said you'd love the
address dependency through union members :-)

(I should split this in at least 3 patches I suppose)

ARM64 and PPC should provide private versions of is_locked and
unlock_wait; because while this one deals with the unordered store as
per qspinlock construction, it still relies on cmpxchg_acquire()'s store
being visible.

---
 include/asm-generic/qspinlock.h |  40 ++++---------
 kernel/locking/qspinlock.c      | 126 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 137 insertions(+), 29 deletions(-)

diff --git a/include/asm-generic/qspinlock.h b/include/asm-generic/qspinlock.h
index 6bd05700d8c9..3020bdb7a43c 100644
--- a/include/asm-generic/qspinlock.h
+++ b/include/asm-generic/qspinlock.h
@@ -26,33 +26,18 @@
  * @lock: Pointer to queued spinlock structure
  * Return: 1 if it is locked, 0 otherwise
  */
+#ifndef queued_spin_is_locked
 static __always_inline int queued_spin_is_locked(struct qspinlock *lock)
 {
 	/*
-	 * queued_spin_lock_slowpath() can ACQUIRE the lock before
-	 * issuing the unordered store that sets _Q_LOCKED_VAL.
+	 * See queued_spin_unlock_wait().
 	 *
-	 * See both smp_cond_acquire() sites for more detail.
-	 *
-	 * This however means that in code like:
-	 *
-	 *   spin_lock(A)		spin_lock(B)
-	 *   spin_unlock_wait(B)	spin_is_locked(A)
-	 *   do_something()		do_something()
-	 *
-	 * Both CPUs can end up running do_something() because the store
-	 * setting _Q_LOCKED_VAL will pass through the loads in
-	 * spin_unlock_wait() and/or spin_is_locked().
-	 *
-	 * Avoid this by issuing a full memory barrier between the spin_lock()
-	 * and the loads in spin_unlock_wait() and spin_is_locked().
-	 *
-	 * Note that regular mutual exclusion doesn't care about this
-	 * delayed store.
+	 * Any !0 state indicates it is locked, even if _Q_LOCKED_VAL
+	 * isn't immediately observable.
 	 */
-	smp_mb();
-	return atomic_read(&lock->val) & _Q_LOCKED_MASK;
+	return !!atomic_read(&lock->val);
 }
+#endif
 
 /**
  * queued_spin_value_unlocked - is the spinlock structure unlocked?
@@ -78,6 +63,7 @@ static __always_inline int queued_spin_is_contended(struct qspinlock *lock)
 {
 	return atomic_read(&lock->val) & ~_Q_LOCKED_MASK;
 }
+
 /**
  * queued_spin_trylock - try to acquire the queued spinlock
  * @lock : Pointer to queued spinlock structure
@@ -123,19 +109,15 @@ static __always_inline void queued_spin_unlock(struct qspinlock *lock)
 #endif
 
 /**
- * queued_spin_unlock_wait - wait until current lock holder releases the lock
+ * queued_spin_unlock_wait - wait until the _current_ lock holder releases the lock
  * @lock : Pointer to queued spinlock structure
  *
  * There is a very slight possibility of live-lock if the lockers keep coming
  * and the waiter is just unfortunate enough to not see any unlock state.
  */
-static inline void queued_spin_unlock_wait(struct qspinlock *lock)
-{
-	/* See queued_spin_is_locked() */
-	smp_mb();
-	while (atomic_read(&lock->val) & _Q_LOCKED_MASK)
-		cpu_relax();
-}
+#ifndef queued_spin_unlock_wait
+void queued_spin_unlock_wait(struct qspinlock *lock);
+#endif
 
 #ifndef virt_spin_lock
 static __always_inline bool virt_spin_lock(struct qspinlock *lock)
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index ce2f75e32ae1..e1c29d352e0e 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -395,6 +395,8 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	 * pending stuff.
 	 *
 	 * p,*,* -> n,*,*
+	 *
+	 * RELEASE, such that the stores to @node must be complete.
 	 */
 	old = xchg_tail(lock, tail);
 	next = NULL;
@@ -405,6 +407,15 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	 */
 	if (old & _Q_TAIL_MASK) {
 		prev = decode_tail(old);
+		/*
+		 * The above xchg_tail() is also load of @lock which generates,
+		 * through decode_tail(), a pointer.
+		 *
+		 * The address dependency matches the RELEASE of xchg_tail()
+		 * such that the access to @prev must happen after.
+		 */
+		smp_read_barrier_depends();
+
 		WRITE_ONCE(prev->next, node);
 
 		pv_wait_node(node, prev);
@@ -496,6 +507,121 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 EXPORT_SYMBOL(queued_spin_lock_slowpath);
 
 /*
+ * PROBLEM: some architectures have an interesting issue with atomic ACQUIRE
+ * operations in that the ACQUIRE applies to the LOAD _not_ the STORE (ARM64,
+ * PPC). Also qspinlock has a similar issue per construction, the setting of
+ * the locked byte can be unordered acquiring the lock proper.
+ *
+ * This gets to be 'interesting' in the following cases, where the /should/s
+ * end up false because of this issue.
+ *
+ *
+ * CASE 1:
+ *
+ * So the spin_is_locked() correctness issue comes from something like:
+ *
+ *   CPU0				CPU1
+ *
+ *   global_lock();			local_lock(i)
+ *     spin_lock(&G)			  spin_lock(&L[i])
+ *     for (i)				  if (!spin_is_locked(&G)) {
+ *       spin_unlock_wait(&L[i]);	    smp_acquire__after_ctrl_dep();
+ * 					    return;
+ * 					  }
+ * 					  // deal with fail
+ *
+ * Where it is important CPU1 sees G locked or CPU0 sees L[i] locked such
+ * that there is exclusion between the two critical sections.
+ *
+ * The load from spin_is_locked(&G) /should/ be constrained by the ACQUIRE from
+ * spin_lock(&L[i]), and similarly the load(s) from spin_unlock_wait(&L[i])
+ * /should/ be constrained by the ACQUIRE from spin_lock(&G).
+ *
+ * Similarly, later stuff is constrained by the ACQUIRE from CTRL+RMB.
+ *
+ *
+ * CASE 2:
+ *
+ * For spin_unlock_wait() there is a second correctness issue, namely:
+ *
+ *   CPU0				CPU1
+ *
+ *   flag = set;
+ *   smp_mb();				spin_lock(&l)
+ *   spin_unlock_wait(&l);		if (!flag)
+ *					  // add to lockless list
+ * 					spin_unlock(&l);
+ *   // iterate lockless list
+ *
+ * Which wants to ensure that CPU1 will stop adding bits to the list and CPU0
+ * will observe the last entry on the list (if spin_unlock_wait() had ACQUIRE
+ * semantics etc..)
+ *
+ * Where flag /should/ be ordered against the locked store of l.
+ */
+
+
+/*
+ * queued_spin_lock_slowpath() can (load-)ACQUIRE the lock before
+ * issuing an _unordered_ store to set _Q_LOCKED_VAL.
+ *
+ * This means that the store can be delayed, but no later than the
+ * store-release from the unlock. This means that simply observing
+ * _Q_LOCKED_VAL is not sufficient to determine if the lock is acquired.
+ *
+ * There are two paths that can issue the unordered store:
+ *
+ *  (1) clear_pending_set_locked():	*,1,0 -> *,0,1
+ *
+ *  (2) set_locked():			t,0,0 -> t,0,1 ; t != 0
+ *      atomic_cmpxchg_relaxed():	t,0,0 -> 0,0,1
+ *
+ * However, in both cases we have other !0 state we've set before to queue
+ * ourseves:
+ *
+ * For (1) we have the atomic_cmpxchg_acquire() that set _Q_PENDING_VAL, our
+ * load is constrained by that ACQUIRE to not pass before that, and thus must
+ * observe the store.
+ *
+ * For (2) we have a more intersting scenario. We enqueue ourselves using
+ * xchg_tail(), which ends up being a RELEASE. This in itself is not
+ * sufficient, however that is followed by an smp_cond_acquire() on the same
+ * word, giving a RELEASE->ACQUIRE ordering. This again constrains our load and
+ * guarantees we must observe that store.
+ *
+ * Therefore both cases have other !0 state that is observable before the
+ * unordered locked byte store comes through. This means we can use that to
+ * wait for the lock store, and then wait for an unlock.
+ */
+#ifndef queued_spin_unlock_wait
+void queued_spin_unlock_wait(struct qspinlock *lock)
+{
+	u32 val;
+
+	for (;;) {
+		val = atomic_read(&lock->val);
+
+		if (!val) /* not locked, we're done */
+			goto done;
+
+		if (val & _Q_LOCKED_MASK) /* locked, go wait for unlock */
+			break;
+
+		/* not locked, but pending, wait until we observe the lock */
+		cpu_relax();
+	}
+
+	/* any unlock is good */
+	while (atomic_read(&lock->val) & _Q_LOCKED_MASK)
+		cpu_relax();
+
+done:
+	smp_rmb();
+}
+EXPORT_SYMBOL(queued_spin_unlock_wait);
+#endif
+
+/*
  * Generate the paravirt code for queued_spin_unlock_slowpath().
  */
 #if !defined(_GEN_PV_LOCK_SLOWPATH) && defined(CONFIG_PARAVIRT_SPINLOCKS)

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH -v4 5/7] locking, arch: Update spin_unlock_wait()
  2016-06-06 16:08           ` Peter Zijlstra
@ 2016-06-07 11:43             ` Boqun Feng
  2016-06-07 12:00               ` Peter Zijlstra
  0 siblings, 1 reply; 26+ messages in thread
From: Boqun Feng @ 2016-06-07 11:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Will Deacon, linux-kernel, torvalds, manfred, dave, paulmck,
	Waiman.Long, tj, pablo, kaber, davem, oleg, netfilter-devel,
	sasha.levin, hofrat, jejb, chris, rth, dhowells, schwidefsky,
	mpe, ralf, linux, rkuo, vgupta, james.hogan, realmz6, ysato,
	tony.luck, cmetcalf

[-- Attachment #1: Type: text/plain, Size: 2065 bytes --]

On Mon, Jun 06, 2016 at 06:08:36PM +0200, Peter Zijlstra wrote:
> On Thu, Jun 02, 2016 at 06:57:00PM +0100, Will Deacon wrote:
> > > This 'replaces' commit:
> > > 
> > >   54cf809b9512 ("locking,qspinlock: Fix spin_is_locked() and spin_unlock_wait()")
> > > 
> > > and seems to still work with the test case from that thread while
> > > getting rid of the extra barriers.
> 
> New version :-)
> 
> This one has moar comments; and also tries to address an issue with
> xchg_tail(), which is its own consumer. Paul, Will said you'd love the
> address dependency through union members :-)
> 
> (I should split this in at least 3 patches I suppose)
> 
> ARM64 and PPC should provide private versions of is_locked and
> unlock_wait; because while this one deals with the unordered store as
> per qspinlock construction, it still relies on cmpxchg_acquire()'s store
> being visible.
> 

[snip]

> diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
> index ce2f75e32ae1..e1c29d352e0e 100644
> --- a/kernel/locking/qspinlock.c
> +++ b/kernel/locking/qspinlock.c
> @@ -395,6 +395,8 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
>  	 * pending stuff.
>  	 *
>  	 * p,*,* -> n,*,*
> +	 *
> +	 * RELEASE, such that the stores to @node must be complete.
>  	 */
>  	old = xchg_tail(lock, tail);
>  	next = NULL;
> @@ -405,6 +407,15 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
>  	 */
>  	if (old & _Q_TAIL_MASK) {
>  		prev = decode_tail(old);
> +		/*
> +		 * The above xchg_tail() is also load of @lock which generates,
> +		 * through decode_tail(), a pointer.
> +		 *
> +		 * The address dependency matches the RELEASE of xchg_tail()
> +		 * such that the access to @prev must happen after.
> +		 */
> +		smp_read_barrier_depends();

Should this barrier be put before decode_tail()? Because it's the
dependency old -> prev that we want to protect here.

Regards,
Boqun

> +
>  		WRITE_ONCE(prev->next, node);
>  
>  		pv_wait_node(node, prev);
[snip]

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH -v4 5/7] locking, arch: Update spin_unlock_wait()
  2016-06-07 11:43             ` Boqun Feng
@ 2016-06-07 12:00               ` Peter Zijlstra
  2016-06-07 12:45                 ` Boqun Feng
  0 siblings, 1 reply; 26+ messages in thread
From: Peter Zijlstra @ 2016-06-07 12:00 UTC (permalink / raw)
  To: Boqun Feng
  Cc: Will Deacon, linux-kernel, torvalds, manfred, dave, paulmck,
	Waiman.Long, tj, pablo, kaber, davem, oleg, netfilter-devel,
	sasha.levin, hofrat, jejb, chris, rth, dhowells, schwidefsky,
	mpe, ralf, linux, rkuo, vgupta, james.hogan, realmz6, ysato,
	tony.luck, cmetcalf

On Tue, Jun 07, 2016 at 07:43:15PM +0800, Boqun Feng wrote:
> On Mon, Jun 06, 2016 at 06:08:36PM +0200, Peter Zijlstra wrote:
> > diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
> > index ce2f75e32ae1..e1c29d352e0e 100644
> > --- a/kernel/locking/qspinlock.c
> > +++ b/kernel/locking/qspinlock.c
> > @@ -395,6 +395,8 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
> >  	 * pending stuff.
> >  	 *
> >  	 * p,*,* -> n,*,*
> > +	 *
> > +	 * RELEASE, such that the stores to @node must be complete.
> >  	 */
> >  	old = xchg_tail(lock, tail);
> >  	next = NULL;
> > @@ -405,6 +407,15 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
> >  	 */
> >  	if (old & _Q_TAIL_MASK) {
> >  		prev = decode_tail(old);
> > +		/*
> > +		 * The above xchg_tail() is also load of @lock which generates,
> > +		 * through decode_tail(), a pointer.
> > +		 *
> > +		 * The address dependency matches the RELEASE of xchg_tail()
> > +		 * such that the access to @prev must happen after.
> > +		 */
> > +		smp_read_barrier_depends();
> 
> Should this barrier be put before decode_tail()? Because it's the
> dependency old -> prev that we want to protect here.

I don't think it matters one way or the other. The old->prev
transformation is pure; it doesn't depend on any state other than old.

I put it between prev and dereferences of prev, because that's what made
most sense to me; but really anywhere between the load of @old and the
first dereference of @prev is fine I suspect.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH -v4 5/7] locking, arch: Update spin_unlock_wait()
  2016-06-07 12:00               ` Peter Zijlstra
@ 2016-06-07 12:45                 ` Boqun Feng
  2016-06-07 17:36                   ` Peter Zijlstra
  0 siblings, 1 reply; 26+ messages in thread
From: Boqun Feng @ 2016-06-07 12:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Will Deacon, linux-kernel, torvalds, manfred, dave, paulmck,
	Waiman.Long, tj, pablo, kaber, davem, oleg, netfilter-devel,
	sasha.levin, hofrat, jejb, chris, rth, dhowells, schwidefsky,
	mpe, ralf, linux, rkuo, vgupta, james.hogan, realmz6, ysato,
	tony.luck, cmetcalf

[-- Attachment #1: Type: text/plain, Size: 2338 bytes --]

On Tue, Jun 07, 2016 at 02:00:16PM +0200, Peter Zijlstra wrote:
> On Tue, Jun 07, 2016 at 07:43:15PM +0800, Boqun Feng wrote:
> > On Mon, Jun 06, 2016 at 06:08:36PM +0200, Peter Zijlstra wrote:
> > > diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
> > > index ce2f75e32ae1..e1c29d352e0e 100644
> > > --- a/kernel/locking/qspinlock.c
> > > +++ b/kernel/locking/qspinlock.c
> > > @@ -395,6 +395,8 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
> > >  	 * pending stuff.
> > >  	 *
> > >  	 * p,*,* -> n,*,*
> > > +	 *
> > > +	 * RELEASE, such that the stores to @node must be complete.
> > >  	 */
> > >  	old = xchg_tail(lock, tail);
> > >  	next = NULL;
> > > @@ -405,6 +407,15 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
> > >  	 */
> > >  	if (old & _Q_TAIL_MASK) {
> > >  		prev = decode_tail(old);
> > > +		/*
> > > +		 * The above xchg_tail() is also load of @lock which generates,
> > > +		 * through decode_tail(), a pointer.
> > > +		 *
> > > +		 * The address dependency matches the RELEASE of xchg_tail()
> > > +		 * such that the access to @prev must happen after.
> > > +		 */
> > > +		smp_read_barrier_depends();
> > 
> > Should this barrier be put before decode_tail()? Because it's the
> > dependency old -> prev that we want to protect here.
> 
> I don't think it matters one way or the other. The old->prev
> transformation is pure; it doesn't depend on any state other than old.
> 

But wouldn't the old -> prev transformation get broken on Alpha
semantically in theory? Because Alpha could reorder the LOAD part of the
xchg_tail() and decode_tail(), which results in prev points to other
cpu's mcs_node.

Though, this is fine in current code, because xchg_release() is actually
xchg() on Alpha, and Alpha doesn't use qspinlock.

> I put it between prev and dereferences of prev, because that's what made
> most sense to me; but really anywhere between the load of @old and the
> first dereference of @prev is fine I suspect.

I understand the barrier here serves more for a documentation purpose,
and I don't want to a paranoid ;-) I'm fine with the current place, just
thought we could put it at somewhere more conforming to our memory
model.

/me feel myself a paranoid anyway...

Regards,
Boqun

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH -v4 5/7] locking, arch: Update spin_unlock_wait()
  2016-06-07 12:45                 ` Boqun Feng
@ 2016-06-07 17:36                   ` Peter Zijlstra
  0 siblings, 0 replies; 26+ messages in thread
From: Peter Zijlstra @ 2016-06-07 17:36 UTC (permalink / raw)
  To: Boqun Feng
  Cc: Will Deacon, linux-kernel, torvalds, manfred, dave, paulmck,
	Waiman.Long, tj, pablo, kaber, davem, oleg, netfilter-devel,
	sasha.levin, hofrat, jejb, chris, rth, dhowells, schwidefsky,
	mpe, ralf, linux, rkuo, vgupta, james.hogan, realmz6, ysato,
	tony.luck, cmetcalf

On Tue, Jun 07, 2016 at 08:45:53PM +0800, Boqun Feng wrote:
> On Tue, Jun 07, 2016 at 02:00:16PM +0200, Peter Zijlstra wrote:
> > On Tue, Jun 07, 2016 at 07:43:15PM +0800, Boqun Feng wrote:
> > > On Mon, Jun 06, 2016 at 06:08:36PM +0200, Peter Zijlstra wrote:
> > > > diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
> > > > index ce2f75e32ae1..e1c29d352e0e 100644
> > > > --- a/kernel/locking/qspinlock.c
> > > > +++ b/kernel/locking/qspinlock.c
> > > > @@ -395,6 +395,8 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
> > > >  	 * pending stuff.
> > > >  	 *
> > > >  	 * p,*,* -> n,*,*
> > > > +	 *
> > > > +	 * RELEASE, such that the stores to @node must be complete.
> > > >  	 */
> > > >  	old = xchg_tail(lock, tail);
> > > >  	next = NULL;
> > > > @@ -405,6 +407,15 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
> > > >  	 */
> > > >  	if (old & _Q_TAIL_MASK) {
> > > >  		prev = decode_tail(old);
> > > > +		/*
> > > > +		 * The above xchg_tail() is also load of @lock which generates,
> > > > +		 * through decode_tail(), a pointer.
> > > > +		 *
> > > > +		 * The address dependency matches the RELEASE of xchg_tail()
> > > > +		 * such that the access to @prev must happen after.
> > > > +		 */
> > > > +		smp_read_barrier_depends();
> > > 
> > > Should this barrier be put before decode_tail()? Because it's the
> > > dependency old -> prev that we want to protect here.
> > 
> > I don't think it matters one way or the other. The old->prev
> > transformation is pure; it doesn't depend on any state other than old.
> > 
> 
> But wouldn't the old -> prev transformation get broken on Alpha
> semantically in theory? Because Alpha could reorder the LOAD part of the
> xchg_tail() and decode_tail(), which results in prev points to other
> cpu's mcs_node.

No; I don't think this is possible. The thing Alpha needs the barrier
for is to force its two cache halves into sync, such that dependent
loads are guaranteed ordered.

There is only a single load of @old, there is no second load; the
transform into @prev is a 'pure' function:

  https://en.wikipedia.org/wiki/Pure_function

(even if the transformation needs to load data; that data is static, it
never changes, and therefore it is impossible to observe a stale value).

> Though, this is fine in current code, because xchg_release() is actually
> xchg() on Alpha, and Alpha doesn't use qspinlock.

I actually have a patch that converts Alpha to use relaxed atomics :-)
But yeah, the point is entirely moot for Alpha not using qspinlock.

> > I put it between prev and dereferences of prev, because that's what made
> > most sense to me; but really anywhere between the load of @old and the
> > first dereference of @prev is fine I suspect.
> 
> I understand the barrier here serves more for a documentation purpose,
> and I don't want to a paranoid ;-) I'm fine with the current place, just
> thought we could put it at somewhere more conforming to our memory
> model.

I think it is in accordance, there is the load of @old and there are the
loads of dereferenced @prev, between those a barrier needs to be placed.

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2016-06-07 17:37 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-02 11:51 [PATCH -v4 0/7] spin_unlock_wait borkage and assorted bits Peter Zijlstra
2016-06-02 11:51 ` [PATCH -v4 1/7] locking: Replace smp_cond_acquire with smp_cond_load_acquire Peter Zijlstra
2016-06-02 11:51 ` [PATCH -v4 2/7] locking: Introduce smp_acquire__after_ctrl_dep Peter Zijlstra
2016-06-02 11:52 ` [PATCH -v4 3/7] locking: Move smp_cond_load_acquire() to asm-generic/barrier.h Peter Zijlstra
2016-06-02 11:52 ` [PATCH -v4 4/7] locking, tile: Provide TILE specific smp_acquire__after_ctrl_dep Peter Zijlstra
2016-06-02 11:52 ` [PATCH -v4 5/7] locking, arch: Update spin_unlock_wait() Peter Zijlstra
2016-06-02 14:24   ` Boqun Feng
2016-06-02 14:44     ` Peter Zijlstra
2016-06-02 15:11       ` Boqun Feng
2016-06-02 15:57         ` Boqun Feng
2016-06-02 16:04         ` Peter Zijlstra
2016-06-02 16:34       ` Peter Zijlstra
2016-06-02 17:57         ` Will Deacon
2016-06-02 21:51           ` Peter Zijlstra
2016-06-03 12:47             ` Will Deacon
2016-06-03 13:42               ` Peter Zijlstra
2016-06-03 17:35                 ` Will Deacon
2016-06-03 19:13                   ` Peter Zijlstra
2016-06-03 13:48               ` Peter Zijlstra
2016-06-06 16:08           ` Peter Zijlstra
2016-06-07 11:43             ` Boqun Feng
2016-06-07 12:00               ` Peter Zijlstra
2016-06-07 12:45                 ` Boqun Feng
2016-06-07 17:36                   ` Peter Zijlstra
2016-06-02 11:52 ` [PATCH -v4 6/7] locking: Update spin_unlock_wait users Peter Zijlstra
2016-06-02 11:52 ` [PATCH -v4 7/7] locking,netfilter: Fix nf_conntrack_lock() Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).