linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH -v3 0/8] spin_unlock_wait borkage and assorted bits
@ 2016-05-31  9:41 Peter Zijlstra
  2016-05-31  9:41 ` [PATCH -v3 1/8] locking: Replace smp_cond_acquire with smp_cond_load_acquire Peter Zijlstra
                   ` (7 more replies)
  0 siblings, 8 replies; 24+ messages in thread
From: Peter Zijlstra @ 2016-05-31  9:41 UTC (permalink / raw)
  To: linux-kernel, torvalds, manfred, dave, paulmck, will.deacon
  Cc: boqun.feng, Waiman.Long, tj, pablo, kaber, davem, oleg,
	netfilter-devel, sasha.levin, hofrat, peterz

Similar to -v2 in that it rewrites spin_unlock_wait() for all.

The new spin_unlock_wait() provides ACQUIRE semantics to match the RELEASE of
the spin_unlock() we waited for and thereby ensure we can fully observe its
critical section.

This fixes a number (pretty much all) spin_unlock_wait() users.

This series pulls in the smp_cond_acquire() rewrite because it introduces a lot
of new users of it. All simple spin_unlock_wait() implementations end up being
one.

New in this series; apart from the fixes from the last posting (s390,tile),
is that it moves smp_cond_load_acquire() and friends into asm-generic/barrier.h
such that architectures that explicitly do not do load speculation (tile)
can override smp_acquire__after_ctrl_dep().

I'm planning on queuing these patches for 4.8.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH -v3 1/8] locking: Replace smp_cond_acquire with smp_cond_load_acquire
  2016-05-31  9:41 [PATCH -v3 0/8] spin_unlock_wait borkage and assorted bits Peter Zijlstra
@ 2016-05-31  9:41 ` Peter Zijlstra
  2016-05-31  9:41 ` [PATCH -v3 2/8] locking: Introduce cmpwait() Peter Zijlstra
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 24+ messages in thread
From: Peter Zijlstra @ 2016-05-31  9:41 UTC (permalink / raw)
  To: linux-kernel, torvalds, manfred, dave, paulmck, will.deacon
  Cc: boqun.feng, Waiman.Long, tj, pablo, kaber, davem, oleg,
	netfilter-devel, sasha.levin, hofrat, peterz

[-- Attachment #1: peterz-locking-smp_cond_load_acquire.patch --]
[-- Type: text/plain, Size: 5846 bytes --]

This new form allows using hardware assisted waiting.

Requested-by: Will Deacon <will.deacon@arm.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/compiler.h   |   25 +++++++++++++++++++------
 kernel/locking/qspinlock.c |   12 ++++++------
 kernel/sched/core.c        |    8 ++++----
 kernel/sched/sched.h       |    2 +-
 kernel/smp.c               |    2 +-
 5 files changed, 31 insertions(+), 18 deletions(-)

--- a/include/linux/compiler.h
+++ b/include/linux/compiler.h
@@ -305,21 +305,34 @@ static __always_inline void __write_once
 })
 
 /**
- * smp_cond_acquire() - Spin wait for cond with ACQUIRE ordering
+ * smp_cond_load_acquire() - (Spin) wait for cond with ACQUIRE ordering
+ * @ptr: pointer to the variable to wait on
  * @cond: boolean expression to wait for
  *
  * Equivalent to using smp_load_acquire() on the condition variable but employs
  * the control dependency of the wait to reduce the barrier on many platforms.
  *
+ * Due to C lacking lambda expressions we load the value of *ptr into a
+ * pre-named variable @VAL to be used in @cond.
+ *
  * The control dependency provides a LOAD->STORE order, the additional RMB
  * provides LOAD->LOAD order, together they provide LOAD->{LOAD,STORE} order,
  * aka. ACQUIRE.
  */
-#define smp_cond_acquire(cond)	do {		\
-	while (!(cond))				\
-		cpu_relax();			\
-	smp_rmb(); /* ctrl + rmb := acquire */	\
-} while (0)
+#ifndef smp_cond_load_acquire
+#define smp_cond_load_acquire(ptr, cond_expr) ({		\
+	typeof(ptr) __PTR = (ptr);				\
+	typeof(*ptr) VAL;					\
+	for (;;) {						\
+		VAL = READ_ONCE(*__PTR);			\
+		if (cond_expr)					\
+			break;					\
+		cpu_relax();					\
+	}							\
+	smp_rmb(); /* ctrl + rmb := acquire */			\
+	VAL;							\
+})
+#endif
 
 #endif /* __KERNEL__ */
 
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -358,7 +358,7 @@ void queued_spin_lock_slowpath(struct qs
 	 * sequentiality; this is because not all clear_pending_set_locked()
 	 * implementations imply full barriers.
 	 */
-	smp_cond_acquire(!(atomic_read(&lock->val) & _Q_LOCKED_MASK));
+	smp_cond_load_acquire(&lock->val.counter, !(VAL & _Q_LOCKED_MASK));
 
 	/*
 	 * take ownership and clear the pending bit.
@@ -434,7 +434,7 @@ void queued_spin_lock_slowpath(struct qs
 	 *
 	 * The PV pv_wait_head_or_lock function, if active, will acquire
 	 * the lock and return a non-zero value. So we have to skip the
-	 * smp_cond_acquire() call. As the next PV queue head hasn't been
+	 * smp_cond_load_acquire() call. As the next PV queue head hasn't been
 	 * designated yet, there is no way for the locked value to become
 	 * _Q_SLOW_VAL. So both the set_locked() and the
 	 * atomic_cmpxchg_relaxed() calls will be safe.
@@ -445,7 +445,7 @@ void queued_spin_lock_slowpath(struct qs
 	if ((val = pv_wait_head_or_lock(lock, node)))
 		goto locked;
 
-	smp_cond_acquire(!((val = atomic_read(&lock->val)) & _Q_LOCKED_PENDING_MASK));
+	val = smp_cond_load_acquire(&lock->val.counter, !(VAL & _Q_LOCKED_PENDING_MASK));
 
 locked:
 	/*
@@ -465,9 +465,9 @@ void queued_spin_lock_slowpath(struct qs
 			break;
 		}
 		/*
-		 * The smp_cond_acquire() call above has provided the necessary
-		 * acquire semantics required for locking. At most two
-		 * iterations of this loop may be ran.
+		 * The smp_cond_load_acquire() call above has provided the
+		 * necessary acquire semantics required for locking. At most
+		 * two iterations of this loop may be ran.
 		 */
 		old = atomic_cmpxchg_relaxed(&lock->val, val, _Q_LOCKED_VAL);
 		if (old == val)
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1843,7 +1843,7 @@ static void ttwu_queue(struct task_struc
  * chain to provide order. Instead we do:
  *
  *   1) smp_store_release(X->on_cpu, 0)
- *   2) smp_cond_acquire(!X->on_cpu)
+ *   2) smp_cond_load_acquire(!X->on_cpu)
  *
  * Example:
  *
@@ -1854,7 +1854,7 @@ static void ttwu_queue(struct task_struc
  *   sched-out X
  *   smp_store_release(X->on_cpu, 0);
  *
- *                    smp_cond_acquire(!X->on_cpu);
+ *                    smp_cond_load_acquire(&X->on_cpu, !VAL);
  *                    X->state = WAKING
  *                    set_task_cpu(X,2)
  *
@@ -1880,7 +1880,7 @@ static void ttwu_queue(struct task_struc
  * This means that any means of doing remote wakeups must order the CPU doing
  * the wakeup against the CPU the task is going to end up running on. This,
  * however, is already required for the regular Program-Order guarantee above,
- * since the waking CPU is the one issueing the ACQUIRE (smp_cond_acquire).
+ * since the waking CPU is the one issueing the ACQUIRE (smp_cond_load_acquire).
  *
  */
 
@@ -1953,7 +1953,7 @@ try_to_wake_up(struct task_struct *p, un
 	 * This ensures that tasks getting woken will be fully ordered against
 	 * their previous state and preserve Program Order.
 	 */
-	smp_cond_acquire(!p->on_cpu);
+	smp_cond_load_acquire(&p->on_cpu, !VAL);
 
 	p->sched_contributes_to_load = !!task_contributes_to_load(p);
 	p->state = TASK_WAKING;
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1104,7 +1104,7 @@ static inline void finish_lock_switch(st
 	 * In particular, the load of prev->state in finish_task_switch() must
 	 * happen before this.
 	 *
-	 * Pairs with the smp_cond_acquire() in try_to_wake_up().
+	 * Pairs with the smp_cond_load_acquire() in try_to_wake_up().
 	 */
 	smp_store_release(&prev->on_cpu, 0);
 #endif
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -107,7 +107,7 @@ void __init call_function_init(void)
  */
 static __always_inline void csd_lock_wait(struct call_single_data *csd)
 {
-	smp_cond_acquire(!(csd->flags & CSD_FLAG_LOCK));
+	smp_cond_load_acquire(&csd->flags, !(VAL & CSD_FLAG_LOCK));
 }
 
 static __always_inline void csd_lock(struct call_single_data *csd)

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH -v3 2/8] locking: Introduce cmpwait()
  2016-05-31  9:41 [PATCH -v3 0/8] spin_unlock_wait borkage and assorted bits Peter Zijlstra
  2016-05-31  9:41 ` [PATCH -v3 1/8] locking: Replace smp_cond_acquire with smp_cond_load_acquire Peter Zijlstra
@ 2016-05-31  9:41 ` Peter Zijlstra
  2016-05-31  9:41 ` [PATCH -v3 3/8] locking: Introduce smp_acquire__after_ctrl_dep Peter Zijlstra
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 24+ messages in thread
From: Peter Zijlstra @ 2016-05-31  9:41 UTC (permalink / raw)
  To: linux-kernel, torvalds, manfred, dave, paulmck, will.deacon
  Cc: boqun.feng, Waiman.Long, tj, pablo, kaber, davem, oleg,
	netfilter-devel, sasha.levin, hofrat, peterz

[-- Attachment #1: peterz-locking-cmp_and_wait.patch --]
[-- Type: text/plain, Size: 1481 bytes --]

Provide the cmpwait() primitive, which will 'spin' wait for a variable
to change and use it to implement smp_cond_load_acquire().

This primitive can be implemented with hardware assist on some
platforms (ARM64, x86).

Suggested-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/compiler.h |   19 ++++++++++++++++++-
 1 file changed, 18 insertions(+), 1 deletion(-)

--- a/include/linux/compiler.h
+++ b/include/linux/compiler.h
@@ -305,6 +305,23 @@ static __always_inline void __write_once
 })
 
 /**
+ * cmpwait - compare and wait for a variable to change
+ * @ptr: pointer to the variable to wait on
+ * @val: the value it should change from
+ *
+ * A simple constuct that waits for a variable to change from a known
+ * value; some architectures can do this in hardware.
+ */
+#ifndef cmpwait
+#define cmpwait(ptr, val) do {					\
+	typeof (ptr) __ptr = (ptr);				\
+	typeof (val) __val = (val);				\
+	while (READ_ONCE(*__ptr) == __val)			\
+		cpu_relax();					\
+} while (0)
+#endif
+
+/**
  * smp_cond_load_acquire() - (Spin) wait for cond with ACQUIRE ordering
  * @ptr: pointer to the variable to wait on
  * @cond: boolean expression to wait for
@@ -327,7 +344,7 @@ static __always_inline void __write_once
 		VAL = READ_ONCE(*__PTR);			\
 		if (cond_expr)					\
 			break;					\
-		cpu_relax();					\
+		cmpwait(__PTR, VAL);				\
 	}							\
 	smp_rmb(); /* ctrl + rmb := acquire */			\
 	VAL;							\

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH -v3 3/8] locking: Introduce smp_acquire__after_ctrl_dep
  2016-05-31  9:41 [PATCH -v3 0/8] spin_unlock_wait borkage and assorted bits Peter Zijlstra
  2016-05-31  9:41 ` [PATCH -v3 1/8] locking: Replace smp_cond_acquire with smp_cond_load_acquire Peter Zijlstra
  2016-05-31  9:41 ` [PATCH -v3 2/8] locking: Introduce cmpwait() Peter Zijlstra
@ 2016-05-31  9:41 ` Peter Zijlstra
  2016-06-01 13:52   ` Boqun Feng
  2016-05-31  9:41 ` [PATCH -v3 4/8] locking, arch: Update spin_unlock_wait() Peter Zijlstra
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 24+ messages in thread
From: Peter Zijlstra @ 2016-05-31  9:41 UTC (permalink / raw)
  To: linux-kernel, torvalds, manfred, dave, paulmck, will.deacon
  Cc: boqun.feng, Waiman.Long, tj, pablo, kaber, davem, oleg,
	netfilter-devel, sasha.levin, hofrat, peterz

[-- Attachment #1: peterz-locking-smp_acquire__after_ctrl_dep.patch --]
[-- Type: text/plain, Size: 2864 bytes --]

Introduce smp_acquire__after_ctrl_dep(), this construct is not
uncommen, but the lack of this barrier is.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/compiler.h |   18 +++++++++++++-----
 ipc/sem.c                |   14 ++------------
 2 files changed, 15 insertions(+), 17 deletions(-)

--- a/include/linux/compiler.h
+++ b/include/linux/compiler.h
@@ -305,6 +305,18 @@ static __always_inline void __write_once
 })
 
 /**
+ * smp_acquire__after_ctrl_dep() - Provide ACQUIRE ordering after a control dependency
+ *
+ * A control dependency provides a LOAD->STORE order, the additional RMB
+ * provides LOAD->LOAD order, together they provide LOAD->{LOAD,STORE} order,
+ * aka. (load)-ACQUIRE.
+ *
+ * Architectures that do not do load speculation can have this be barrier().
+ * XXX move into asm/barrier.h
+ */
+#define smp_acquire__after_ctrl_dep()		smp_rmb()
+
+/**
  * cmpwait - compare and wait for a variable to change
  * @ptr: pointer to the variable to wait on
  * @val: the value it should change from
@@ -331,10 +343,6 @@ static __always_inline void __write_once
  *
  * Due to C lacking lambda expressions we load the value of *ptr into a
  * pre-named variable @VAL to be used in @cond.
- *
- * The control dependency provides a LOAD->STORE order, the additional RMB
- * provides LOAD->LOAD order, together they provide LOAD->{LOAD,STORE} order,
- * aka. ACQUIRE.
  */
 #ifndef smp_cond_load_acquire
 #define smp_cond_load_acquire(ptr, cond_expr) ({		\
@@ -346,7 +354,7 @@ static __always_inline void __write_once
 			break;					\
 		cmpwait(__PTR, VAL);				\
 	}							\
-	smp_rmb(); /* ctrl + rmb := acquire */			\
+	smp_acquire__after_ctrl_dep();				\
 	VAL;							\
 })
 #endif
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -260,16 +260,6 @@ static void sem_rcu_free(struct rcu_head
 }
 
 /*
- * spin_unlock_wait() and !spin_is_locked() are not memory barriers, they
- * are only control barriers.
- * The code must pair with spin_unlock(&sem->lock) or
- * spin_unlock(&sem_perm.lock), thus just the control barrier is insufficient.
- *
- * smp_rmb() is sufficient, as writes cannot pass the control barrier.
- */
-#define ipc_smp_acquire__after_spin_is_unlocked()	smp_rmb()
-
-/*
  * Wait until all currently ongoing simple ops have completed.
  * Caller must own sem_perm.lock.
  * New simple ops cannot start, because simple ops first check
@@ -292,7 +282,7 @@ static void sem_wait_array(struct sem_ar
 		sem = sma->sem_base + i;
 		spin_unlock_wait(&sem->lock);
 	}
-	ipc_smp_acquire__after_spin_is_unlocked();
+	smp_acquire__after_ctrl_dep();
 }
 
 /*
@@ -350,7 +340,7 @@ static inline int sem_lock(struct sem_ar
 			 *	complex_count++;
 			 *	spin_unlock(sem_perm.lock);
 			 */
-			ipc_smp_acquire__after_spin_is_unlocked();
+			smp_acquire__after_ctrl_dep();
 
 			/*
 			 * Now repeat the test of complex_count:

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH -v3 4/8] locking, arch: Update spin_unlock_wait()
  2016-05-31  9:41 [PATCH -v3 0/8] spin_unlock_wait borkage and assorted bits Peter Zijlstra
                   ` (2 preceding siblings ...)
  2016-05-31  9:41 ` [PATCH -v3 3/8] locking: Introduce smp_acquire__after_ctrl_dep Peter Zijlstra
@ 2016-05-31  9:41 ` Peter Zijlstra
  2016-06-01 11:24   ` Will Deacon
  2016-05-31  9:41 ` [PATCH -v3 5/8] locking: Update spin_unlock_wait users Peter Zijlstra
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 24+ messages in thread
From: Peter Zijlstra @ 2016-05-31  9:41 UTC (permalink / raw)
  To: linux-kernel, torvalds, manfred, dave, paulmck, will.deacon
  Cc: boqun.feng, Waiman.Long, tj, pablo, kaber, davem, oleg,
	netfilter-devel, sasha.levin, hofrat, peterz, jejb, rth, chris,
	dhowells, schwidefsky, linux, ralf, mpe, vgupta, rkuo,
	james.hogan, realmz6, tony.luck, ysato, cmetcalf

[-- Attachment #1: peterz-locking-spin_unlock_wait.patch --]
[-- Type: text/plain, Size: 13289 bytes --]

This patch updates/fixes all spin_unlock_wait() implementations.

The update is in semantics; where it previously was only a control
dependency, we now upgrade to a full load-acquire to match the
store-release from the spin_unlock() we waited on. This ensures that
when spin_unlock_wait() returns, we're guaranteed to observe the full
critical section we waited on.

This fixes a number of spin_unlock_wait() users that (not
unreasonably) rely on this.

I also fixed a number of ticket lock versions to only wait on the
current lock holder, instead of for a full unlock, as this is
sufficient.

Furthermore; again for ticket locks; I added an smp_rmb() in between
the initial ticket load and the spin loop testing the current value
because I could not convince myself the address dependency is
sufficient, esp. if the loads are of different sizes.

I'm more than happy to remove this smp_rmb() again if people are
certain the address dependency does indeed work as expected.

Cc: jejb@parisc-linux.org
Cc: davem@davemloft.net
Cc: rth@twiddle.net
Cc: chris@zankel.net
Cc: dhowells@redhat.com
Cc: schwidefsky@de.ibm.com
Cc: linux@armlinux.org.uk
Cc: ralf@linux-mips.org
Cc: mpe@ellerman.id.au
Cc: vgupta@synopsys.com
Cc: rkuo@codeaurora.org
Cc: james.hogan@imgtec.com
Cc: realmz6@gmail.com
Cc: tony.luck@intel.com
Cc: ysato@users.sourceforge.jp
Cc: cmetcalf@mellanox.com
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/alpha/include/asm/spinlock.h    |    7 +++++--
 arch/arc/include/asm/spinlock.h      |    7 +++++--
 arch/arm/include/asm/spinlock.h      |   18 ++++++++++++++++--
 arch/blackfin/include/asm/spinlock.h |    3 +--
 arch/hexagon/include/asm/spinlock.h  |    8 ++++++--
 arch/ia64/include/asm/spinlock.h     |    2 ++
 arch/m32r/include/asm/spinlock.h     |    7 +++++--
 arch/metag/include/asm/spinlock.h    |   11 +++++++++--
 arch/mips/include/asm/spinlock.h     |   18 ++++++++++++++++--
 arch/mn10300/include/asm/spinlock.h  |    6 +++++-
 arch/parisc/include/asm/spinlock.h   |    9 +++++++--
 arch/powerpc/include/asm/spinlock.h  |    6 ++++--
 arch/s390/include/asm/spinlock.h     |    1 +
 arch/sh/include/asm/spinlock.h       |    7 +++++--
 arch/sparc/include/asm/spinlock_32.h |    6 ++++--
 arch/sparc/include/asm/spinlock_64.h |    9 ++++++---
 arch/tile/lib/spinlock_32.c          |    6 ++++++
 arch/tile/lib/spinlock_64.c          |    6 ++++++
 arch/xtensa/include/asm/spinlock.h   |    7 +++++--
 include/asm-generic/qspinlock.h      |    3 +--
 include/linux/spinlock_up.h          |    9 ++++++---
 21 files changed, 121 insertions(+), 35 deletions(-)

--- a/arch/alpha/include/asm/spinlock.h
+++ b/arch/alpha/include/asm/spinlock.h
@@ -13,8 +13,11 @@
 
 #define arch_spin_lock_flags(lock, flags) arch_spin_lock(lock)
 #define arch_spin_is_locked(x)	((x)->lock != 0)
-#define arch_spin_unlock_wait(x) \
-		do { cpu_relax(); } while ((x)->lock)
+
+static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
+{
+	smp_cond_load_acquire(&lock->lock, !VAL);
+}
 
 static inline int arch_spin_value_unlocked(arch_spinlock_t lock)
 {
--- a/arch/arc/include/asm/spinlock.h
+++ b/arch/arc/include/asm/spinlock.h
@@ -15,8 +15,11 @@
 
 #define arch_spin_is_locked(x)	((x)->slock != __ARCH_SPIN_LOCK_UNLOCKED__)
 #define arch_spin_lock_flags(lock, flags)	arch_spin_lock(lock)
-#define arch_spin_unlock_wait(x) \
-	do { while (arch_spin_is_locked(x)) cpu_relax(); } while (0)
+
+static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
+{
+	smp_cond_load_acquire(&lock->slock, !VAL);
+}
 
 #ifdef CONFIG_ARC_HAS_LLSC
 
--- a/arch/arm/include/asm/spinlock.h
+++ b/arch/arm/include/asm/spinlock.h
@@ -50,8 +50,22 @@ static inline void dsb_sev(void)
  * memory.
  */
 
-#define arch_spin_unlock_wait(lock) \
-	do { while (arch_spin_is_locked(lock)) cpu_relax(); } while (0)
+static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
+{
+	u16 owner = READ_ONCE(lock->tickets.owner);
+
+	smp_rmb();
+	for (;;) {
+		arch_spinlock_t tmp = READ_ONCE(*lock);
+
+		if (tmp.tickets.owner == tmp.tickets.next ||
+		    tmp.tickets.owner != owner)
+			break;
+
+		wfe();
+	}
+	smp_acquire__after_ctrl_dep();
+}
 
 #define arch_spin_lock_flags(lock, flags) arch_spin_lock(lock)
 
--- a/arch/blackfin/include/asm/spinlock.h
+++ b/arch/blackfin/include/asm/spinlock.h
@@ -48,8 +48,7 @@ static inline void arch_spin_unlock(arch
 
 static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
 {
-	while (arch_spin_is_locked(lock))
-		cpu_relax();
+	smp_cond_load_acquire(&lock->lock, !VAL);
 }
 
 static inline int arch_read_can_lock(arch_rwlock_t *rw)
--- a/arch/hexagon/include/asm/spinlock.h
+++ b/arch/hexagon/include/asm/spinlock.h
@@ -176,8 +176,12 @@ static inline unsigned int arch_spin_try
  * SMP spinlocks are intended to allow only a single CPU at the lock
  */
 #define arch_spin_lock_flags(lock, flags) arch_spin_lock(lock)
-#define arch_spin_unlock_wait(lock) \
-	do {while (arch_spin_is_locked(lock)) cpu_relax(); } while (0)
+
+static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
+{
+	smp_cond_load_acquire(&lock->lock, !VAL);
+}
+
 #define arch_spin_is_locked(x) ((x)->lock != 0)
 
 #define arch_read_lock_flags(lock, flags) arch_read_lock(lock)
--- a/arch/ia64/include/asm/spinlock.h
+++ b/arch/ia64/include/asm/spinlock.h
@@ -86,6 +86,8 @@ static __always_inline void __ticket_spi
 			return;
 		cpu_relax();
 	}
+
+	smp_acquire__after_ctrl_dep();
 }
 
 static inline int __ticket_spin_is_locked(arch_spinlock_t *lock)
--- a/arch/m32r/include/asm/spinlock.h
+++ b/arch/m32r/include/asm/spinlock.h
@@ -27,8 +27,11 @@
 
 #define arch_spin_is_locked(x)		(*(volatile int *)(&(x)->slock) <= 0)
 #define arch_spin_lock_flags(lock, flags) arch_spin_lock(lock)
-#define arch_spin_unlock_wait(x) \
-		do { cpu_relax(); } while (arch_spin_is_locked(x))
+
+static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
+{
+	smp_cond_load_acquire(&lock->slock, VAL > 0);
+}
 
 /**
  * arch_spin_trylock - Try spin lock and return a result
--- a/arch/metag/include/asm/spinlock.h
+++ b/arch/metag/include/asm/spinlock.h
@@ -7,8 +7,15 @@
 #include <asm/spinlock_lnkget.h>
 #endif
 
-#define arch_spin_unlock_wait(lock) \
-	do { while (arch_spin_is_locked(lock)) cpu_relax(); } while (0)
+/*
+ * both lock1 and lnkget are test-and-set spinlocks with 0 unlocked and 1
+ * locked.
+ */
+
+static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
+{
+	smp_cond_load_acquire(&lock->lock, !VAL);
+}
 
 #define arch_spin_lock_flags(lock, flags) arch_spin_lock(lock)
 
--- a/arch/mips/include/asm/spinlock.h
+++ b/arch/mips/include/asm/spinlock.h
@@ -48,8 +48,22 @@ static inline int arch_spin_value_unlock
 }
 
 #define arch_spin_lock_flags(lock, flags) arch_spin_lock(lock)
-#define arch_spin_unlock_wait(x) \
-	while (arch_spin_is_locked(x)) { cpu_relax(); }
+
+static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
+{
+	u16 owner = READ_ONCE(lock->h.serving_now);
+	smp_rmb();
+	for (;;) {
+		arch_spinlock_t tmp = READ_ONCE(*lock);
+
+		if (tmp.h.serving_now == tmp.h.ticket ||
+		    tmp.h.serving_now != owner)
+			break;
+
+		cpu_relax();
+	}
+	smp_acquire__after_ctrl_dep();
+}
 
 static inline int arch_spin_is_contended(arch_spinlock_t *lock)
 {
--- a/arch/mn10300/include/asm/spinlock.h
+++ b/arch/mn10300/include/asm/spinlock.h
@@ -23,7 +23,11 @@
  */
 
 #define arch_spin_is_locked(x)	(*(volatile signed char *)(&(x)->slock) != 0)
-#define arch_spin_unlock_wait(x) do { barrier(); } while (arch_spin_is_locked(x))
+
+static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
+{
+	smp_cond_load_acquire(&lock->slock, !VAL);
+}
 
 static inline void arch_spin_unlock(arch_spinlock_t *lock)
 {
--- a/arch/parisc/include/asm/spinlock.h
+++ b/arch/parisc/include/asm/spinlock.h
@@ -13,8 +13,13 @@ static inline int arch_spin_is_locked(ar
 }
 
 #define arch_spin_lock(lock) arch_spin_lock_flags(lock, 0)
-#define arch_spin_unlock_wait(x) \
-		do { cpu_relax(); } while (arch_spin_is_locked(x))
+
+static inline void arch_spin_unlock_wait(arch_spinlock_t *x)
+{
+	volatile unsigned int *a = __ldcw_align(x);
+
+	smp_cond_load_acquire(a, VAL);
+}
 
 static inline void arch_spin_lock_flags(arch_spinlock_t *x,
 					 unsigned long flags)
--- a/arch/powerpc/include/asm/spinlock.h
+++ b/arch/powerpc/include/asm/spinlock.h
@@ -165,8 +165,10 @@ static inline void arch_spin_unlock(arch
 #ifdef CONFIG_PPC64
 extern void arch_spin_unlock_wait(arch_spinlock_t *lock);
 #else
-#define arch_spin_unlock_wait(lock) \
-	do { while (arch_spin_is_locked(lock)) cpu_relax(); } while (0)
+static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
+{
+	smp_cond_load_acquire(&lock->slock, !VAL);
+}
 #endif
 
 /*
--- a/arch/s390/include/asm/spinlock.h
+++ b/arch/s390/include/asm/spinlock.h
@@ -97,6 +97,7 @@ static inline void arch_spin_unlock_wait
 {
 	while (arch_spin_is_locked(lock))
 		arch_spin_relax(lock);
+	smp_acquire__after_ctrl_dep();
 }
 
 /*
--- a/arch/sh/include/asm/spinlock.h
+++ b/arch/sh/include/asm/spinlock.h
@@ -25,8 +25,11 @@
 
 #define arch_spin_is_locked(x)		((x)->lock <= 0)
 #define arch_spin_lock_flags(lock, flags) arch_spin_lock(lock)
-#define arch_spin_unlock_wait(x) \
-	do { while (arch_spin_is_locked(x)) cpu_relax(); } while (0)
+
+static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
+{
+	smp_cond_load_acquire(&lock->lock, VAL > 0);
+}
 
 /*
  * Simple spin lock operations.  There are two variants, one clears IRQ's
--- a/arch/sparc/include/asm/spinlock_32.h
+++ b/arch/sparc/include/asm/spinlock_32.h
@@ -13,8 +13,10 @@
 
 #define arch_spin_is_locked(lock) (*((volatile unsigned char *)(lock)) != 0)
 
-#define arch_spin_unlock_wait(lock) \
-	do { while (arch_spin_is_locked(lock)) cpu_relax(); } while (0)
+static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
+{
+	smp_cond_load_acquire(&lock->lock, !VAL);
+}
 
 static inline void arch_spin_lock(arch_spinlock_t *lock)
 {
--- a/arch/sparc/include/asm/spinlock_64.h
+++ b/arch/sparc/include/asm/spinlock_64.h
@@ -8,6 +8,8 @@
 
 #ifndef __ASSEMBLY__
 
+#include <asm/processor.h>
+
 /* To get debugging spinlocks which detect and catch
  * deadlock situations, set CONFIG_DEBUG_SPINLOCK
  * and rebuild your kernel.
@@ -23,9 +25,10 @@
 
 #define arch_spin_is_locked(lp)	((lp)->lock != 0)
 
-#define arch_spin_unlock_wait(lp)	\
-	do {	rmb();			\
-	} while((lp)->lock)
+static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
+{
+	smp_cond_load_acquire(&lock->lock, !VAL);
+}
 
 static inline void arch_spin_lock(arch_spinlock_t *lock)
 {
--- a/arch/tile/lib/spinlock_32.c
+++ b/arch/tile/lib/spinlock_32.c
@@ -76,6 +76,12 @@ void arch_spin_unlock_wait(arch_spinlock
 	do {
 		delay_backoff(iterations++);
 	} while (READ_ONCE(lock->current_ticket) == curr);
+
+	/*
+	 * The TILE architecture doesn't do read speculation; therefore
+	 * a control dependency guarantees a LOAD->{LOAD,STORE} order.
+	 */
+	barrier();
 }
 EXPORT_SYMBOL(arch_spin_unlock_wait);
 
--- a/arch/tile/lib/spinlock_64.c
+++ b/arch/tile/lib/spinlock_64.c
@@ -76,6 +76,12 @@ void arch_spin_unlock_wait(arch_spinlock
 	do {
 		delay_backoff(iterations++);
 	} while (arch_spin_current(READ_ONCE(lock->lock)) == curr);
+
+	/*
+	 * The TILE architecture doesn't do read speculation; therefore
+	 * a control dependency guarantees a LOAD->{LOAD,STORE} order.
+	 */
+	barrier();
 }
 EXPORT_SYMBOL(arch_spin_unlock_wait);
 
--- a/arch/xtensa/include/asm/spinlock.h
+++ b/arch/xtensa/include/asm/spinlock.h
@@ -29,8 +29,11 @@
  */
 
 #define arch_spin_is_locked(x) ((x)->slock != 0)
-#define arch_spin_unlock_wait(lock) \
-	do { while (arch_spin_is_locked(lock)) cpu_relax(); } while (0)
+
+static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
+{
+	smp_cond_load_acquire(&lock->slock, !VAL);
+}
 
 #define arch_spin_lock_flags(lock, flags) arch_spin_lock(lock)
 
--- a/include/asm-generic/qspinlock.h
+++ b/include/asm-generic/qspinlock.h
@@ -133,8 +133,7 @@ static inline void queued_spin_unlock_wa
 {
 	/* See queued_spin_is_locked() */
 	smp_mb();
-	while (atomic_read(&lock->val) & _Q_LOCKED_MASK)
-		cpu_relax();
+	smp_cond_load_acquire(&lock->val.counter, !(VAL & _Q_LOCKED_MASK));
 }
 
 #ifndef virt_spin_lock
--- a/include/linux/spinlock_up.h
+++ b/include/linux/spinlock_up.h
@@ -25,6 +25,11 @@
 #ifdef CONFIG_DEBUG_SPINLOCK
 #define arch_spin_is_locked(x)		((x)->slock == 0)
 
+static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
+{
+	smp_cond_load_acquire(&lock->slock, VAL);
+}
+
 static inline void arch_spin_lock(arch_spinlock_t *lock)
 {
 	lock->slock = 0;
@@ -67,6 +72,7 @@ static inline void arch_spin_unlock(arch
 
 #else /* DEBUG_SPINLOCK */
 #define arch_spin_is_locked(lock)	((void)(lock), 0)
+#define arch_spin_unlock_wait(lock)	do { barrier(); (void)(lock); } while (0)
 /* for sched/core.c and kernel_lock.c: */
 # define arch_spin_lock(lock)		do { barrier(); (void)(lock); } while (0)
 # define arch_spin_lock_flags(lock, flags)	do { barrier(); (void)(lock); } while (0)
@@ -79,7 +85,4 @@ static inline void arch_spin_unlock(arch
 #define arch_read_can_lock(lock)	(((void)(lock), 1))
 #define arch_write_can_lock(lock)	(((void)(lock), 1))
 
-#define arch_spin_unlock_wait(lock) \
-		do { cpu_relax(); } while (arch_spin_is_locked(lock))
-
 #endif /* __LINUX_SPINLOCK_UP_H */

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH -v3 5/8] locking: Update spin_unlock_wait users
  2016-05-31  9:41 [PATCH -v3 0/8] spin_unlock_wait borkage and assorted bits Peter Zijlstra
                   ` (3 preceding siblings ...)
  2016-05-31  9:41 ` [PATCH -v3 4/8] locking, arch: Update spin_unlock_wait() Peter Zijlstra
@ 2016-05-31  9:41 ` Peter Zijlstra
  2016-05-31  9:41 ` [PATCH -v3 6/8] locking,netfilter: Fix nf_conntrack_lock() Peter Zijlstra
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 24+ messages in thread
From: Peter Zijlstra @ 2016-05-31  9:41 UTC (permalink / raw)
  To: linux-kernel, torvalds, manfred, dave, paulmck, will.deacon
  Cc: boqun.feng, Waiman.Long, tj, pablo, kaber, davem, oleg,
	netfilter-devel, sasha.levin, hofrat, peterz

[-- Attachment #1: peterz-locking-fix-spin_unlock_wait.patch --]
[-- Type: text/plain, Size: 1434 bytes --]

With the modified semantics of spin_unlock_wait() a number of
explicit barriers can be removed. And update the comment for the
do_exit() usecase, as that was somewhat stale/obscure.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 ipc/sem.c          |    1 -
 kernel/exit.c      |    8 ++++++--
 kernel/task_work.c |    1 -
 3 files changed, 6 insertions(+), 4 deletions(-)

--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -282,7 +282,6 @@ static void sem_wait_array(struct sem_ar
 		sem = sma->sem_base + i;
 		spin_unlock_wait(&sem->lock);
 	}
-	smp_acquire__after_ctrl_dep();
 }
 
 /*
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -776,10 +776,14 @@ void do_exit(long code)
 
 	exit_signals(tsk);  /* sets PF_EXITING */
 	/*
-	 * tsk->flags are checked in the futex code to protect against
-	 * an exiting task cleaning up the robust pi futexes.
+	 * Ensure that all new tsk->pi_lock acquisitions must observe
+	 * PF_EXITING. Serializes against futex.c:attach_to_pi_owner().
 	 */
 	smp_mb();
+	/*
+	 * Ensure that we must observe the pi_state in exit_mm() ->
+	 * mm_release() -> exit_pi_state_list().
+	 */
 	raw_spin_unlock_wait(&tsk->pi_lock);
 
 	if (unlikely(in_atomic())) {
--- a/kernel/task_work.c
+++ b/kernel/task_work.c
@@ -108,7 +108,6 @@ void task_work_run(void)
 		 * fail, but it can play with *work and other entries.
 		 */
 		raw_spin_unlock_wait(&task->pi_lock);
-		smp_mb();
 
 		do {
 			next = work->next;

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH -v3 6/8] locking,netfilter: Fix nf_conntrack_lock()
  2016-05-31  9:41 [PATCH -v3 0/8] spin_unlock_wait borkage and assorted bits Peter Zijlstra
                   ` (4 preceding siblings ...)
  2016-05-31  9:41 ` [PATCH -v3 5/8] locking: Update spin_unlock_wait users Peter Zijlstra
@ 2016-05-31  9:41 ` Peter Zijlstra
  2016-05-31  9:41 ` [PATCH -v3 7/8] locking: Move smp_cond_load_acquire() and friends into asm-generic/barrier.h Peter Zijlstra
  2016-05-31  9:41 ` [PATCH -v3 8/8] locking, tile: Provide TILE specific smp_acquire__after_ctrl_dep Peter Zijlstra
  7 siblings, 0 replies; 24+ messages in thread
From: Peter Zijlstra @ 2016-05-31  9:41 UTC (permalink / raw)
  To: linux-kernel, torvalds, manfred, dave, paulmck, will.deacon
  Cc: boqun.feng, Waiman.Long, tj, pablo, kaber, davem, oleg,
	netfilter-devel, sasha.levin, hofrat, peterz

[-- Attachment #1: peterz-locking-netfilter.patch --]
[-- Type: text/plain, Size: 1723 bytes --]

Even with spin_unlock_wait() fixed, nf_conntrack_lock{,_all}() is
borken as it misses a bunch of memory barriers to order the whole
global vs local locks scheme.

Even x86 (and other TSO archs) are affected.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 net/netfilter/nf_conntrack_core.c |   18 +++++++++++++++++-
 1 file changed, 17 insertions(+), 1 deletion(-)

--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -83,6 +83,12 @@ void nf_conntrack_lock(spinlock_t *lock)
 	spin_lock(lock);
 	while (unlikely(nf_conntrack_locks_all)) {
 		spin_unlock(lock);
+
+		/* Order the nf_contrack_locks_all load vs the
+		 * spin_unlock_wait() loads below, to ensure locks_all is
+		 * indeed held.
+		 */
+		smp_rmb(); /* spin_lock(locks_all) */
 		spin_unlock_wait(&nf_conntrack_locks_all_lock);
 		spin_lock(lock);
 	}
@@ -128,6 +134,12 @@ static void nf_conntrack_all_lock(void)
 	spin_lock(&nf_conntrack_locks_all_lock);
 	nf_conntrack_locks_all = true;
 
+	/* Order the above store against the spin_unlock_wait() loads
+	 * below, such that if nf_conntrack_lock() observes lock_all
+	 * we must observe lock[] held.
+	 */
+	smp_mb(); /* spin_lock(locks_all) */
+
 	for (i = 0; i < CONNTRACK_LOCKS; i++) {
 		spin_unlock_wait(&nf_conntrack_locks[i]);
 	}
@@ -135,7 +147,11 @@ static void nf_conntrack_all_lock(void)
 
 static void nf_conntrack_all_unlock(void)
 {
-	nf_conntrack_locks_all = false;
+	/* All prior stores must be complete before we clear locks_all.
+	 * Otherwise nf_conntrack_lock() might observe the false but not the
+	 * entire critical section.
+	 */
+	smp_store_release(&nf_conntrack_locks_all, false);
 	spin_unlock(&nf_conntrack_locks_all_lock);
 }
 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH -v3 7/8] locking: Move smp_cond_load_acquire() and friends into asm-generic/barrier.h
  2016-05-31  9:41 [PATCH -v3 0/8] spin_unlock_wait borkage and assorted bits Peter Zijlstra
                   ` (5 preceding siblings ...)
  2016-05-31  9:41 ` [PATCH -v3 6/8] locking,netfilter: Fix nf_conntrack_lock() Peter Zijlstra
@ 2016-05-31  9:41 ` Peter Zijlstra
  2016-05-31 20:01   ` Waiman Long
  2016-05-31  9:41 ` [PATCH -v3 8/8] locking, tile: Provide TILE specific smp_acquire__after_ctrl_dep Peter Zijlstra
  7 siblings, 1 reply; 24+ messages in thread
From: Peter Zijlstra @ 2016-05-31  9:41 UTC (permalink / raw)
  To: linux-kernel, torvalds, manfred, dave, paulmck, will.deacon
  Cc: boqun.feng, Waiman.Long, tj, pablo, kaber, davem, oleg,
	netfilter-devel, sasha.levin, hofrat, peterz

[-- Attachment #1: peterz-asm-generic-barrier.patch --]
[-- Type: text/plain, Size: 10142 bytes --]

Since all asm/barrier.h should/must include asm-generic/barrier.h the
latter is a good place for generic infrastructure like this.

This also allows archs to override the new
smp_acquire__after_ctrl_dep().

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/alpha/include/asm/spinlock.h    |    2 +
 arch/arm/include/asm/spinlock.h      |    2 +
 arch/blackfin/include/asm/spinlock.h |    2 +
 arch/hexagon/include/asm/spinlock.h  |    2 +
 arch/ia64/include/asm/spinlock.h     |    2 +
 arch/m32r/include/asm/spinlock.h     |    2 +
 arch/metag/include/asm/spinlock.h    |    3 +
 arch/mips/include/asm/spinlock.h     |    1 
 arch/mn10300/include/asm/spinlock.h  |    2 +
 arch/powerpc/include/asm/spinlock.h  |    2 +
 arch/s390/include/asm/spinlock.h     |    2 +
 arch/sh/include/asm/spinlock.h       |    3 +
 arch/sparc/include/asm/spinlock_32.h |    1 
 arch/sparc/include/asm/spinlock_64.h |    1 
 arch/xtensa/include/asm/spinlock.h   |    3 +
 include/asm-generic/barrier.h        |   58 ++++++++++++++++++++++++++++++++++-
 include/asm-generic/qspinlock.h      |    2 +
 include/linux/compiler.h             |   55 ---------------------------------
 include/linux/spinlock_up.h          |    1 
 19 files changed, 90 insertions(+), 56 deletions(-)

--- a/arch/alpha/include/asm/spinlock.h
+++ b/arch/alpha/include/asm/spinlock.h
@@ -3,6 +3,8 @@
 
 #include <linux/kernel.h>
 #include <asm/current.h>
+#include <asm/barrier.h>
+#include <asm/processor.h>
 
 /*
  * Simple spin lock operations.  There are two variants, one clears IRQ's
--- a/arch/arm/include/asm/spinlock.h
+++ b/arch/arm/include/asm/spinlock.h
@@ -6,6 +6,8 @@
 #endif
 
 #include <linux/prefetch.h>
+#include <asm/barrier.h>
+#include <asm/processor.h>
 
 /*
  * sev and wfe are ARMv6K extensions.  Uniprocessor ARMv6 may not have the K
--- a/arch/blackfin/include/asm/spinlock.h
+++ b/arch/blackfin/include/asm/spinlock.h
@@ -12,6 +12,8 @@
 #else
 
 #include <linux/atomic.h>
+#include <asm/processor.h>
+#include <asm/barrier.h>
 
 asmlinkage int __raw_spin_is_locked_asm(volatile int *ptr);
 asmlinkage void __raw_spin_lock_asm(volatile int *ptr);
--- a/arch/hexagon/include/asm/spinlock.h
+++ b/arch/hexagon/include/asm/spinlock.h
@@ -23,6 +23,8 @@
 #define _ASM_SPINLOCK_H
 
 #include <asm/irqflags.h>
+#include <asm/barrier.h>
+#include <asm/processor.h>
 
 /*
  * This file is pulled in for SMP builds.
--- a/arch/ia64/include/asm/spinlock.h
+++ b/arch/ia64/include/asm/spinlock.h
@@ -15,6 +15,8 @@
 
 #include <linux/atomic.h>
 #include <asm/intrinsics.h>
+#include <asm/barrier.h>
+#include <asm/processor.h>
 
 #define arch_spin_lock_init(x)			((x)->lock = 0)
 
--- a/arch/m32r/include/asm/spinlock.h
+++ b/arch/m32r/include/asm/spinlock.h
@@ -13,6 +13,8 @@
 #include <linux/atomic.h>
 #include <asm/dcache_clear.h>
 #include <asm/page.h>
+#include <asm/barrier.h>
+#include <asm/processor.h>
 
 /*
  * Your basic SMP spinlocks, allowing only a single CPU anywhere
--- a/arch/metag/include/asm/spinlock.h
+++ b/arch/metag/include/asm/spinlock.h
@@ -1,6 +1,9 @@
 #ifndef __ASM_SPINLOCK_H
 #define __ASM_SPINLOCK_H
 
+#include <asm/barrier.h>
+#include <asm/processor.h>
+
 #ifdef CONFIG_METAG_ATOMICITY_LOCK1
 #include <asm/spinlock_lock1.h>
 #else
--- a/arch/mips/include/asm/spinlock.h
+++ b/arch/mips/include/asm/spinlock.h
@@ -12,6 +12,7 @@
 #include <linux/compiler.h>
 
 #include <asm/barrier.h>
+#include <asm/processor.h>
 #include <asm/compiler.h>
 #include <asm/war.h>
 
--- a/arch/mn10300/include/asm/spinlock.h
+++ b/arch/mn10300/include/asm/spinlock.h
@@ -12,6 +12,8 @@
 #define _ASM_SPINLOCK_H
 
 #include <linux/atomic.h>
+#include <asm/barrier.h>
+#include <asm/processor.h>
 #include <asm/rwlock.h>
 #include <asm/page.h>
 
--- a/arch/powerpc/include/asm/spinlock.h
+++ b/arch/powerpc/include/asm/spinlock.h
@@ -27,6 +27,8 @@
 #include <asm/asm-compat.h>
 #include <asm/synch.h>
 #include <asm/ppc-opcode.h>
+#include <asm/barrier.h>
+#include <asm/processor.h>
 
 #ifdef CONFIG_PPC64
 /* use 0x800000yy when locked, where yy == CPU number */
--- a/arch/s390/include/asm/spinlock.h
+++ b/arch/s390/include/asm/spinlock.h
@@ -10,6 +10,8 @@
 #define __ASM_SPINLOCK_H
 
 #include <linux/smp.h>
+#include <asm/barrier.h>
+#include <asm/processor.h>
 
 #define SPINLOCK_LOCKVAL (S390_lowcore.spinlock_lockval)
 
--- a/arch/sh/include/asm/spinlock.h
+++ b/arch/sh/include/asm/spinlock.h
@@ -19,6 +19,9 @@
 #error "Need movli.l/movco.l for spinlocks"
 #endif
 
+#include <asm/barrier.h>
+#include <asm/processor.h>
+
 /*
  * Your basic SMP spinlocks, allowing only a single CPU anywhere
  */
--- a/arch/sparc/include/asm/spinlock_32.h
+++ b/arch/sparc/include/asm/spinlock_32.h
@@ -9,6 +9,7 @@
 #ifndef __ASSEMBLY__
 
 #include <asm/psr.h>
+#include <asm/barrier.h>
 #include <asm/processor.h> /* for cpu_relax */
 
 #define arch_spin_is_locked(lock) (*((volatile unsigned char *)(lock)) != 0)
--- a/arch/sparc/include/asm/spinlock_64.h
+++ b/arch/sparc/include/asm/spinlock_64.h
@@ -9,6 +9,7 @@
 #ifndef __ASSEMBLY__
 
 #include <asm/processor.h>
+#include <asm/barrier.h>
 
 /* To get debugging spinlocks which detect and catch
  * deadlock situations, set CONFIG_DEBUG_SPINLOCK
--- a/arch/xtensa/include/asm/spinlock.h
+++ b/arch/xtensa/include/asm/spinlock.h
@@ -11,6 +11,9 @@
 #ifndef _XTENSA_SPINLOCK_H
 #define _XTENSA_SPINLOCK_H
 
+#include <asm/barrier.h>
+#include <asm/processor.h>
+
 /*
  * spinlock
  *
--- a/include/asm-generic/barrier.h
+++ b/include/asm-generic/barrier.h
@@ -194,7 +194,7 @@ do {									\
 })
 #endif
 
-#endif
+#endif	/* CONFIG_SMP */
 
 /* Barriers for virtual machine guests when talking to an SMP host */
 #define virt_mb() __smp_mb()
@@ -207,5 +207,61 @@ do {									\
 #define virt_store_release(p, v) __smp_store_release(p, v)
 #define virt_load_acquire(p) __smp_load_acquire(p)
 
+/**
+ * smp_acquire__after_ctrl_dep() - Provide ACQUIRE ordering after a control dependency
+ *
+ * A control dependency provides a LOAD->STORE order, the additional RMB
+ * provides LOAD->LOAD order, together they provide LOAD->{LOAD,STORE} order,
+ * aka. (load)-ACQUIRE.
+ *
+ * Architectures that do not do load speculation can have this be barrier().
+ */
+#ifndef smp_acquire__after_ctrl_dep
+#define smp_acquire__after_ctrl_dep()		smp_rmb()
+#endif
+
+/**
+ * cmpwait - compare and wait for a variable to change
+ * @ptr: pointer to the variable to wait on
+ * @val: the value it should change from
+ *
+ * A simple constuct that waits for a variable to change from a known
+ * value; some architectures can do this in hardware.
+ */
+#ifndef cmpwait
+#define cmpwait(ptr, val) do {					\
+	typeof (ptr) __ptr = (ptr);				\
+	typeof (val) __val = (val);				\
+	while (READ_ONCE(*__ptr) == __val)			\
+		cpu_relax();					\
+} while (0)
+#endif
+
+/**
+ * smp_cond_load_acquire() - (Spin) wait for cond with ACQUIRE ordering
+ * @ptr: pointer to the variable to wait on
+ * @cond: boolean expression to wait for
+ *
+ * Equivalent to using smp_load_acquire() on the condition variable but employs
+ * the control dependency of the wait to reduce the barrier on many platforms.
+ *
+ * Due to C lacking lambda expressions we load the value of *ptr into a
+ * pre-named variable @VAL to be used in @cond.
+ */
+#ifndef smp_cond_load_acquire
+#define smp_cond_load_acquire(ptr, cond_expr) ({		\
+	typeof(ptr) __PTR = (ptr);				\
+	typeof(*ptr) VAL;					\
+	for (;;) {						\
+		VAL = READ_ONCE(*__PTR);			\
+		if (cond_expr)					\
+			break;					\
+		cmpwait(__PTR, VAL);				\
+	}							\
+	smp_acquire__after_ctrl_dep();				\
+	VAL;							\
+})
+#endif
+
 #endif /* !__ASSEMBLY__ */
 #endif /* __ASM_GENERIC_BARRIER_H */
--- a/include/asm-generic/qspinlock.h
+++ b/include/asm-generic/qspinlock.h
@@ -20,6 +20,8 @@
 #define __ASM_GENERIC_QSPINLOCK_H
 
 #include <asm-generic/qspinlock_types.h>
+#include <asm/barrier.h>
+#include <asm/processor.h>
 
 /**
  * queued_spin_is_locked - is the spinlock locked?
--- a/include/linux/compiler.h
+++ b/include/linux/compiler.h
@@ -304,61 +304,6 @@ static __always_inline void __write_once
 	__u.__val;					\
 })
 
-/**
- * smp_acquire__after_ctrl_dep() - Provide ACQUIRE ordering after a control dependency
- *
- * A control dependency provides a LOAD->STORE order, the additional RMB
- * provides LOAD->LOAD order, together they provide LOAD->{LOAD,STORE} order,
- * aka. (load)-ACQUIRE.
- *
- * Architectures that do not do load speculation can have this be barrier().
- * XXX move into asm/barrier.h
- */
-#define smp_acquire__after_ctrl_dep()		smp_rmb()
-
-/**
- * cmpwait - compare and wait for a variable to change
- * @ptr: pointer to the variable to wait on
- * @val: the value it should change from
- *
- * A simple constuct that waits for a variable to change from a known
- * value; some architectures can do this in hardware.
- */
-#ifndef cmpwait
-#define cmpwait(ptr, val) do {					\
-	typeof (ptr) __ptr = (ptr);				\
-	typeof (val) __val = (val);				\
-	while (READ_ONCE(*__ptr) == __val)			\
-		cpu_relax();					\
-} while (0)
-#endif
-
-/**
- * smp_cond_load_acquire() - (Spin) wait for cond with ACQUIRE ordering
- * @ptr: pointer to the variable to wait on
- * @cond: boolean expression to wait for
- *
- * Equivalent to using smp_load_acquire() on the condition variable but employs
- * the control dependency of the wait to reduce the barrier on many platforms.
- *
- * Due to C lacking lambda expressions we load the value of *ptr into a
- * pre-named variable @VAL to be used in @cond.
- */
-#ifndef smp_cond_load_acquire
-#define smp_cond_load_acquire(ptr, cond_expr) ({		\
-	typeof(ptr) __PTR = (ptr);				\
-	typeof(*ptr) VAL;					\
-	for (;;) {						\
-		VAL = READ_ONCE(*__PTR);			\
-		if (cond_expr)					\
-			break;					\
-		cmpwait(__PTR, VAL);				\
-	}							\
-	smp_acquire__after_ctrl_dep();				\
-	VAL;							\
-})
-#endif
-
 #endif /* __KERNEL__ */
 
 #endif /* __ASSEMBLY__ */
--- a/include/linux/spinlock_up.h
+++ b/include/linux/spinlock_up.h
@@ -6,6 +6,7 @@
 #endif
 
 #include <asm/processor.h>	/* for cpu_relax() */
+#include <asm/barrier.h>
 
 /*
  * include/linux/spinlock_up.h - UP-debug version of spinlocks.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH -v3 8/8] locking, tile: Provide TILE specific smp_acquire__after_ctrl_dep
  2016-05-31  9:41 [PATCH -v3 0/8] spin_unlock_wait borkage and assorted bits Peter Zijlstra
                   ` (6 preceding siblings ...)
  2016-05-31  9:41 ` [PATCH -v3 7/8] locking: Move smp_cond_load_acquire() and friends into asm-generic/barrier.h Peter Zijlstra
@ 2016-05-31  9:41 ` Peter Zijlstra
  2016-05-31 15:32   ` Chris Metcalf
  7 siblings, 1 reply; 24+ messages in thread
From: Peter Zijlstra @ 2016-05-31  9:41 UTC (permalink / raw)
  To: linux-kernel, torvalds, manfred, dave, paulmck, will.deacon
  Cc: boqun.feng, Waiman.Long, tj, pablo, kaber, davem, oleg,
	netfilter-devel, sasha.levin, hofrat, peterz, Chris Metcalf

[-- Attachment #1: peterz-tile-ctrl-dep.patch --]
[-- Type: text/plain, Size: 879 bytes --]

Since TILE doesn't do read speculation, its control dependencies also
guarantee LOAD->LOAD order and we don't need the additional RMB
otherwise required to provide ACQUIRE semantics.

Cc: Chris Metcalf <cmetcalf@mellanox.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/tile/include/asm/barrier.h |    7 +++++++
 1 file changed, 7 insertions(+)

--- a/arch/tile/include/asm/barrier.h
+++ b/arch/tile/include/asm/barrier.h
@@ -87,6 +87,13 @@ mb_incoherent(void)
 #define __smp_mb__after_atomic()	__smp_mb()
 #endif
 
+/*
+ * The TILE architecture does not do speculative reads; this ensures
+ * that a control dependency also orders against loads and already provides
+ * a LOAD->{LOAD,STORE} order and can forgo the additional RMB.
+ */
+#define smp_acquire__after_ctrl_dep()	barrier()
+
 #include <asm-generic/barrier.h>
 
 #endif /* !__ASSEMBLY__ */

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH -v3 8/8] locking, tile: Provide TILE specific smp_acquire__after_ctrl_dep
  2016-05-31  9:41 ` [PATCH -v3 8/8] locking, tile: Provide TILE specific smp_acquire__after_ctrl_dep Peter Zijlstra
@ 2016-05-31 15:32   ` Chris Metcalf
  0 siblings, 0 replies; 24+ messages in thread
From: Chris Metcalf @ 2016-05-31 15:32 UTC (permalink / raw)
  To: Peter Zijlstra, linux-kernel, torvalds, manfred, dave, paulmck,
	will.deacon
  Cc: boqun.feng, Waiman.Long, tj, pablo, kaber, davem, oleg,
	netfilter-devel, sasha.levin, hofrat

On 5/31/2016 5:41 AM, Peter Zijlstra wrote:
> Since TILE doesn't do read speculation, its control dependencies also
> guarantee LOAD->LOAD order and we don't need the additional RMB
> otherwise required to provide ACQUIRE semantics.
>
> Cc: Chris Metcalf<cmetcalf@mellanox.com>
> Signed-off-by: Peter Zijlstra (Intel)<peterz@infradead.org>
> ---
>   arch/tile/include/asm/barrier.h |    7 +++++++
>   1 file changed, 7 insertions(+)

Looks good.

Acked-by: Chris Metcalf <cmetcalf@mellanox.com>

-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH -v3 7/8] locking: Move smp_cond_load_acquire() and friends into asm-generic/barrier.h
  2016-05-31  9:41 ` [PATCH -v3 7/8] locking: Move smp_cond_load_acquire() and friends into asm-generic/barrier.h Peter Zijlstra
@ 2016-05-31 20:01   ` Waiman Long
  2016-06-01  9:31     ` Peter Zijlstra
  0 siblings, 1 reply; 24+ messages in thread
From: Waiman Long @ 2016-05-31 20:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, torvalds, manfred, dave, paulmck, will.deacon,
	boqun.feng, tj, pablo, kaber, davem, oleg, netfilter-devel,
	sasha.levin, hofrat

On 05/31/2016 05:41 AM, Peter Zijlstra wrote:
> Since all asm/barrier.h should/must include asm-generic/barrier.h the
> latter is a good place for generic infrastructure like this.
>
> This also allows archs to override the new
> smp_acquire__after_ctrl_dep().
>
> Signed-off-by: Peter Zijlstra (Intel)<peterz@infradead.org>
>
> --- a/include/asm-generic/barrier.h
> +++ b/include/asm-generic/barrier.h
> @@ -194,7 +194,7 @@ do {									\
>   })
>   #endif
>
> -#endif
> +#endif	/* CONFIG_SMP */
>
>   /* Barriers for virtual machine guests when talking to an SMP host */
>   #define virt_mb() __smp_mb()
> @@ -207,5 +207,61 @@ do {									\
>   #define virt_store_release(p, v) __smp_store_release(p, v)
>   #define virt_load_acquire(p) __smp_load_acquire(p)
>
> +/**
> + * smp_acquire__after_ctrl_dep() - Provide ACQUIRE ordering after a control dependency
> + *
> + * A control dependency provides a LOAD->STORE order, the additional RMB
> + * provides LOAD->LOAD order, together they provide LOAD->{LOAD,STORE} order,
> + * aka. (load)-ACQUIRE.
> + *
> + * Architectures that do not do load speculation can have this be barrier().
> + */
> +#ifndef smp_acquire__after_ctrl_dep
> +#define smp_acquire__after_ctrl_dep()		smp_rmb()
> +#endif
> +
> +/**
> + * cmpwait - compare and wait for a variable to change
> + * @ptr: pointer to the variable to wait on
> + * @val: the value it should change from
> + *
> + * A simple constuct that waits for a variable to change from a known
> + * value; some architectures can do this in hardware.
> + */
> +#ifndef cmpwait
> +#define cmpwait(ptr, val) do {					\
> +	typeof (ptr) __ptr = (ptr);				\
> +	typeof (val) __val = (val);				\
> +	while (READ_ONCE(*__ptr) == __val)			\
> +		cpu_relax();					\
> +} while (0)
> +#endif
> +
> +/**
> + * smp_cond_load_acquire() - (Spin) wait for cond with ACQUIRE ordering
> + * @ptr: pointer to the variable to wait on
> + * @cond: boolean expression to wait for
> + *
> + * Equivalent to using smp_load_acquire() on the condition variable but employs
> + * the control dependency of the wait to reduce the barrier on many platforms.
> + *
> + * Due to C lacking lambda expressions we load the value of *ptr into a
> + * pre-named variable @VAL to be used in @cond.
> + */
> +#ifndef smp_cond_load_acquire
> +#define smp_cond_load_acquire(ptr, cond_expr) ({		\
> +	typeof(ptr) __PTR = (ptr);				\
> +	typeof(*ptr) VAL;					\
> +	for (;;) {						\
> +		VAL = READ_ONCE(*__PTR);			\
> +		if (cond_expr)					\
> +			break;					\
> +		cmpwait(__PTR, VAL);				\
> +	}							\
> +	smp_acquire__after_ctrl_dep();				\
> +	VAL;							\
> +})
> +#endif
>

You are doing two READ_ONCE's in the smp_cond_load_acquire loop. Can we 
change it to do just one READ_ONCE, like

--- a/include/asm-generic/barrier.h
+++ b/include/asm-generic/barrier.h
@@ -229,12 +229,18 @@ do {
   * value; some architectures can do this in hardware.
   */
  #ifndef cmpwait
-#define cmpwait(ptr, val) do {                                 \
+#define cmpwait(ptr, val) ({                                   \
         typeof (ptr) __ptr = (ptr);                             \
-       typeof (val) __val = (val);                             \
-       while (READ_ONCE(*__ptr) == __val)                      \
+       typeof (val) __old = (val);                             \
+       typeof (val) __new;                                     \
+       for (;;) {                                              \
+               __new = READ_ONCE(*__ptr);                      \
+               if (__new != __old)                             \
+                       break;                                  \
                 cpu_relax();                                    \
-} while (0)
+       }                                                       \
+       __new;                                                  \
+})
  #endif

  /**
@@ -251,12 +257,11 @@ do {
  #ifndef smp_cond_load_acquire
  #define smp_cond_load_acquire(ptr, cond_expr) ({               \
         typeof(ptr) __PTR = (ptr);                              \
-       typeof(*ptr) VAL;                                       \
+       typeof(*ptr) VAL = READ_ONCE(*__PTR);                   \
         for (;;) {                                              \
-               VAL = READ_ONCE(*__PTR);                        \
                 if (cond_expr)                                  \
                         break;                                  \
-               cmpwait(__PTR, VAL);                            \
+               VAL = cmpwait(__PTR, VAL);                      \
         }                                                       \
         smp_acquire__after_ctrl_dep();                          \
         VAL;                                                    \

Cheers,
Longman

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH -v3 7/8] locking: Move smp_cond_load_acquire() and friends into asm-generic/barrier.h
  2016-05-31 20:01   ` Waiman Long
@ 2016-06-01  9:31     ` Peter Zijlstra
  2016-06-01 12:00       ` Will Deacon
  2016-06-01 16:53       ` Waiman Long
  0 siblings, 2 replies; 24+ messages in thread
From: Peter Zijlstra @ 2016-06-01  9:31 UTC (permalink / raw)
  To: Waiman Long
  Cc: linux-kernel, torvalds, manfred, dave, paulmck, will.deacon,
	boqun.feng, tj, pablo, kaber, davem, oleg, netfilter-devel,
	sasha.levin, hofrat

On Tue, May 31, 2016 at 04:01:06PM -0400, Waiman Long wrote:
> You are doing two READ_ONCE's in the smp_cond_load_acquire loop. Can we
> change it to do just one READ_ONCE, like
> 
> --- a/include/asm-generic/barrier.h
> +++ b/include/asm-generic/barrier.h
> @@ -229,12 +229,18 @@ do {
>   * value; some architectures can do this in hardware.
>   */
>  #ifndef cmpwait
> +#define cmpwait(ptr, val) ({                                   \
>         typeof (ptr) __ptr = (ptr);                             \
> +       typeof (val) __old = (val);                             \
> +       typeof (val) __new;                                     \
> +       for (;;) {                                              \
> +               __new = READ_ONCE(*__ptr);                      \
> +               if (__new != __old)                             \
> +                       break;                                  \
>                 cpu_relax();                                    \
> +       }                                                       \
> +       __new;                                                  \
> +})
>  #endif
> 
>  /**
> @@ -251,12 +257,11 @@ do {
>  #ifndef smp_cond_load_acquire
>  #define smp_cond_load_acquire(ptr, cond_expr) ({               \
>         typeof(ptr) __PTR = (ptr);                              \
> +       typeof(*ptr) VAL = READ_ONCE(*__PTR);                   \
>         for (;;) {                                              \
>                 if (cond_expr)                                  \
>                         break;                                  \
> +               VAL = cmpwait(__PTR, VAL);                      \
>         }                                                       \
>         smp_acquire__after_ctrl_dep();                          \
>         VAL;                                                    \

Yes, that generates slightly better code, but now that you made me look
at it, I think we need to kill the cmpwait() in the generic version and
only keep it for arch versions.

/me ponders...

So cmpwait() as implemented here has strict semantics; but arch
implementations as previously proposed have less strict semantics; and
the use here follows that less strict variant.

The difference being that the arch implementations of cmpwait can have
false positives (ie. return early, without a changed value)
smp_cond_load_acquire() can deal with these false positives seeing how
its in a loop and does its own (more specific) comparison.

Exposing cmpwait(), with the documented semantics, means that arch
versions need an additional loop inside to match these strict semantics,
or we need to weaken the cmpwait() semantics, at which point I'm not
entirely sure its worth keeping as a generic primitive...

Hmm, so if we can find a use for the weaker cmpwait() outside of
smp_cond_load_acquire() I think we can make a case for keeping it, and
looking at qspinlock.h there's two sites we can replace cpu_relax() with
it.

Will, since ARM64 seems to want to use this, does the below make sense
to you?

---
 include/asm-generic/barrier.h | 15 ++++++---------
 kernel/locking/qspinlock.c    |  4 ++--
 2 files changed, 8 insertions(+), 11 deletions(-)

diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
index be9222b10d17..05feda5c22e6 100644
--- a/include/asm-generic/barrier.h
+++ b/include/asm-generic/barrier.h
@@ -221,20 +221,17 @@ do {									\
 #endif
 
 /**
- * cmpwait - compare and wait for a variable to change
+ * cmpwait - compare and wait for a variable to 'change'
  * @ptr: pointer to the variable to wait on
  * @val: the value it should change from
  *
- * A simple constuct that waits for a variable to change from a known
- * value; some architectures can do this in hardware.
+ * A 'better' cpu_relax(), some architectures can avoid polling and have event
+ * based wakeups on variables. Such constructs allow false positives on the
+ * 'change' and can return early. Therefore this reduces to cpu_relax()
+ * without hardware assist.
  */
 #ifndef cmpwait
-#define cmpwait(ptr, val) do {					\
-	typeof (ptr) __ptr = (ptr);				\
-	typeof (val) __val = (val);				\
-	while (READ_ONCE(*__ptr) == __val)			\
-		cpu_relax();					\
-} while (0)
+#define cmpwait(ptr, val)	cpu_relax()
 #endif
 
 /**
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index e98e5bf679e9..60a811d56406 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -311,7 +311,7 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	 */
 	if (val == _Q_PENDING_VAL) {
 		while ((val = atomic_read(&lock->val)) == _Q_PENDING_VAL)
-			cpu_relax();
+			cmpwait(&lock->val.counter, _Q_PENDING_VAL);
 	}
 
 	/*
@@ -481,7 +481,7 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	 */
 	if (!next) {
 		while (!(next = READ_ONCE(node->next)))
-			cpu_relax();
+			cmpwait(&node->next, NULL);
 	}
 
 	arch_mcs_spin_unlock_contended(&next->locked);

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH -v3 4/8] locking, arch: Update spin_unlock_wait()
  2016-05-31  9:41 ` [PATCH -v3 4/8] locking, arch: Update spin_unlock_wait() Peter Zijlstra
@ 2016-06-01 11:24   ` Will Deacon
  2016-06-01 11:37     ` Peter Zijlstra
  0 siblings, 1 reply; 24+ messages in thread
From: Will Deacon @ 2016-06-01 11:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, torvalds, manfred, dave, paulmck, boqun.feng,
	Waiman.Long, tj, pablo, kaber, davem, oleg, netfilter-devel,
	sasha.levin, hofrat, jejb, rth, chris, dhowells, schwidefsky,
	linux, ralf, mpe, vgupta, rkuo, james.hogan, realmz6, tony.luck,
	ysato, cmetcalf

Hi Peter,

On Tue, May 31, 2016 at 11:41:38AM +0200, Peter Zijlstra wrote:
> This patch updates/fixes all spin_unlock_wait() implementations.
> 
> The update is in semantics; where it previously was only a control
> dependency, we now upgrade to a full load-acquire to match the
> store-release from the spin_unlock() we waited on. This ensures that
> when spin_unlock_wait() returns, we're guaranteed to observe the full
> critical section we waited on.
> 
> This fixes a number of spin_unlock_wait() users that (not
> unreasonably) rely on this.
> 
> I also fixed a number of ticket lock versions to only wait on the
> current lock holder, instead of for a full unlock, as this is
> sufficient.
> 
> Furthermore; again for ticket locks; I added an smp_rmb() in between
> the initial ticket load and the spin loop testing the current value
> because I could not convince myself the address dependency is
> sufficient, esp. if the loads are of different sizes.
> 
> I'm more than happy to remove this smp_rmb() again if people are
> certain the address dependency does indeed work as expected.

You can remove it for arm, since both the accesses are single-copy
atomic so the read-after-read rules apply.

> --- a/arch/arm/include/asm/spinlock.h
> +++ b/arch/arm/include/asm/spinlock.h
> @@ -50,8 +50,22 @@ static inline void dsb_sev(void)
>   * memory.
>   */
>  
> -#define arch_spin_unlock_wait(lock) \
> -	do { while (arch_spin_is_locked(lock)) cpu_relax(); } while (0)
> +static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
> +{
> +	u16 owner = READ_ONCE(lock->tickets.owner);
> +
> +	smp_rmb();

(so you can remove this barrier)

> +	for (;;) {
> +		arch_spinlock_t tmp = READ_ONCE(*lock);
> +
> +		if (tmp.tickets.owner == tmp.tickets.next ||
> +		    tmp.tickets.owner != owner)

This is interesting... on arm64, I actually wait until I observe the
lock being free, but here you also break if the owner has changed, on
the assumption that an unlock happened and we just didn't explicitly
see the lock in a free state. Now, what stops the initial read of
owner being speculated by the CPU at the dawn of time, and this loop
consequently returning early because at some point (before we called
arch_spin_unlock_wait) the lock was unlocked?

Will

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH -v3 4/8] locking, arch: Update spin_unlock_wait()
  2016-06-01 11:24   ` Will Deacon
@ 2016-06-01 11:37     ` Peter Zijlstra
  0 siblings, 0 replies; 24+ messages in thread
From: Peter Zijlstra @ 2016-06-01 11:37 UTC (permalink / raw)
  To: Will Deacon
  Cc: linux-kernel, torvalds, manfred, dave, paulmck, boqun.feng,
	Waiman.Long, tj, pablo, kaber, davem, oleg, netfilter-devel,
	sasha.levin, hofrat, jejb, rth, chris, dhowells, schwidefsky,
	linux, ralf, mpe, vgupta, rkuo, james.hogan, realmz6, tony.luck,
	ysato, cmetcalf

On Wed, Jun 01, 2016 at 12:24:32PM +0100, Will Deacon wrote:
> > --- a/arch/arm/include/asm/spinlock.h
> > +++ b/arch/arm/include/asm/spinlock.h
> > @@ -50,8 +50,22 @@ static inline void dsb_sev(void)
> >   * memory.
> >   */
> >  
> > -#define arch_spin_unlock_wait(lock) \
> > -	do { while (arch_spin_is_locked(lock)) cpu_relax(); } while (0)
> > +static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
> > +{
> > +	u16 owner = READ_ONCE(lock->tickets.owner);
> > +
> > +	smp_rmb();
> 
> (so you can remove this barrier)

*poof* in a cloud of bit smoke it goes..

> > +	for (;;) {
> > +		arch_spinlock_t tmp = READ_ONCE(*lock);
> > +
> > +		if (tmp.tickets.owner == tmp.tickets.next ||
> > +		    tmp.tickets.owner != owner)
> 
> This is interesting... on arm64, I actually wait until I observe the
> lock being free, but here you also break if the owner has changed, on
> the assumption that an unlock happened and we just didn't explicitly
> see the lock in a free state. Now, what stops the initial read of
> owner being speculated by the CPU at the dawn of time, and this loop
> consequently returning early because at some point (before we called
> arch_spin_unlock_wait) the lock was unlocked?

The user needs to be aware; take for instance the scenario explained
here:

  lkml.kernel.org/r/20160526135406.GK3192@twins.programming.kicks-ass.net

or the PF_EXITING spin_unlock_wait in do_exit.

In both cases we only need to wait for any in-flight critical section
that might not have observed our recent change. Any further critical
sections are guaranteed to have observed our change and will behave
accordingly.

Note that other architectures already had the wait for one ticket
completion (notably x86) semantics.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH -v3 7/8] locking: Move smp_cond_load_acquire() and friends into asm-generic/barrier.h
  2016-06-01  9:31     ` Peter Zijlstra
@ 2016-06-01 12:00       ` Will Deacon
  2016-06-01 12:06         ` Peter Zijlstra
  2016-06-01 16:53       ` Waiman Long
  1 sibling, 1 reply; 24+ messages in thread
From: Will Deacon @ 2016-06-01 12:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Waiman Long, linux-kernel, torvalds, manfred, dave, paulmck,
	boqun.feng, tj, pablo, kaber, davem, oleg, netfilter-devel,
	sasha.levin, hofrat

On Wed, Jun 01, 2016 at 11:31:58AM +0200, Peter Zijlstra wrote:
> On Tue, May 31, 2016 at 04:01:06PM -0400, Waiman Long wrote:
> > You are doing two READ_ONCE's in the smp_cond_load_acquire loop. Can we
> > change it to do just one READ_ONCE, like
> > 
> > --- a/include/asm-generic/barrier.h
> > +++ b/include/asm-generic/barrier.h
> > @@ -229,12 +229,18 @@ do {
> >   * value; some architectures can do this in hardware.
> >   */
> >  #ifndef cmpwait
> > +#define cmpwait(ptr, val) ({                                   \
> >         typeof (ptr) __ptr = (ptr);                             \
> > +       typeof (val) __old = (val);                             \
> > +       typeof (val) __new;                                     \
> > +       for (;;) {                                              \
> > +               __new = READ_ONCE(*__ptr);                      \
> > +               if (__new != __old)                             \
> > +                       break;                                  \
> >                 cpu_relax();                                    \
> > +       }                                                       \
> > +       __new;                                                  \
> > +})
> >  #endif
> > 
> >  /**
> > @@ -251,12 +257,11 @@ do {
> >  #ifndef smp_cond_load_acquire
> >  #define smp_cond_load_acquire(ptr, cond_expr) ({               \
> >         typeof(ptr) __PTR = (ptr);                              \
> > +       typeof(*ptr) VAL = READ_ONCE(*__PTR);                   \
> >         for (;;) {                                              \
> >                 if (cond_expr)                                  \
> >                         break;                                  \
> > +               VAL = cmpwait(__PTR, VAL);                      \
> >         }                                                       \
> >         smp_acquire__after_ctrl_dep();                          \
> >         VAL;                                                    \
> 
> Yes, that generates slightly better code, but now that you made me look
> at it, I think we need to kill the cmpwait() in the generic version and
> only keep it for arch versions.
> 
> /me ponders...
> 
> So cmpwait() as implemented here has strict semantics; but arch
> implementations as previously proposed have less strict semantics; and
> the use here follows that less strict variant.
> 
> The difference being that the arch implementations of cmpwait can have
> false positives (ie. return early, without a changed value)
> smp_cond_load_acquire() can deal with these false positives seeing how
> its in a loop and does its own (more specific) comparison.
> 
> Exposing cmpwait(), with the documented semantics, means that arch
> versions need an additional loop inside to match these strict semantics,
> or we need to weaken the cmpwait() semantics, at which point I'm not
> entirely sure its worth keeping as a generic primitive...
> 
> Hmm, so if we can find a use for the weaker cmpwait() outside of
> smp_cond_load_acquire() I think we can make a case for keeping it, and
> looking at qspinlock.h there's two sites we can replace cpu_relax() with
> it.
> 
> Will, since ARM64 seems to want to use this, does the below make sense
> to you?

Not especially -- I was going to override smp_cond_load_acquire anyway
because I want to build it using cmpwait_acquire and get rid of the
smp_acquire__after_ctrl_dep trick, which is likely slower on arm64.

So I'd be happier nuking cmpwait from the generic interfaces and using
smp_cond_load_acquire everywhere, if that's possible.

Will

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH -v3 7/8] locking: Move smp_cond_load_acquire() and friends into asm-generic/barrier.h
  2016-06-01 12:00       ` Will Deacon
@ 2016-06-01 12:06         ` Peter Zijlstra
  2016-06-01 12:13           ` Will Deacon
  0 siblings, 1 reply; 24+ messages in thread
From: Peter Zijlstra @ 2016-06-01 12:06 UTC (permalink / raw)
  To: Will Deacon
  Cc: Waiman Long, linux-kernel, torvalds, manfred, dave, paulmck,
	boqun.feng, tj, pablo, kaber, davem, oleg, netfilter-devel,
	sasha.levin, hofrat

On Wed, Jun 01, 2016 at 01:00:10PM +0100, Will Deacon wrote:
> On Wed, Jun 01, 2016 at 11:31:58AM +0200, Peter Zijlstra wrote:
> > Will, since ARM64 seems to want to use this, does the below make sense
> > to you?
> 
> Not especially -- I was going to override smp_cond_load_acquire anyway
> because I want to build it using cmpwait_acquire and get rid of the
> smp_acquire__after_ctrl_dep trick, which is likely slower on arm64.
> 
> So I'd be happier nuking cmpwait from the generic interfaces and using
> smp_cond_load_acquire everywhere, if that's possible.

Works for me; but that would loose using cmpwait() for
!smp_cond_load_acquire() spins, you fine with that?

The two conversions in the patch were both !acquire spins.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH -v3 7/8] locking: Move smp_cond_load_acquire() and friends into asm-generic/barrier.h
  2016-06-01 12:06         ` Peter Zijlstra
@ 2016-06-01 12:13           ` Will Deacon
  2016-06-01 12:45             ` Peter Zijlstra
  0 siblings, 1 reply; 24+ messages in thread
From: Will Deacon @ 2016-06-01 12:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Waiman Long, linux-kernel, torvalds, manfred, dave, paulmck,
	boqun.feng, tj, pablo, kaber, davem, oleg, netfilter-devel,
	sasha.levin, hofrat

On Wed, Jun 01, 2016 at 02:06:54PM +0200, Peter Zijlstra wrote:
> On Wed, Jun 01, 2016 at 01:00:10PM +0100, Will Deacon wrote:
> > On Wed, Jun 01, 2016 at 11:31:58AM +0200, Peter Zijlstra wrote:
> > > Will, since ARM64 seems to want to use this, does the below make sense
> > > to you?
> > 
> > Not especially -- I was going to override smp_cond_load_acquire anyway
> > because I want to build it using cmpwait_acquire and get rid of the
> > smp_acquire__after_ctrl_dep trick, which is likely slower on arm64.
> > 
> > So I'd be happier nuking cmpwait from the generic interfaces and using
> > smp_cond_load_acquire everywhere, if that's possible.
> 
> Works for me; but that would loose using cmpwait() for
> !smp_cond_load_acquire() spins, you fine with that?
> 
> The two conversions in the patch were both !acquire spins.

Maybe we could go the whole hog and add smp_cond_load_relaxed?

Will

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH -v3 7/8] locking: Move smp_cond_load_acquire() and friends into asm-generic/barrier.h
  2016-06-01 12:13           ` Will Deacon
@ 2016-06-01 12:45             ` Peter Zijlstra
  2016-06-01 14:07               ` Will Deacon
  0 siblings, 1 reply; 24+ messages in thread
From: Peter Zijlstra @ 2016-06-01 12:45 UTC (permalink / raw)
  To: Will Deacon
  Cc: Waiman Long, linux-kernel, torvalds, manfred, dave, paulmck,
	boqun.feng, tj, pablo, kaber, davem, oleg, netfilter-devel,
	sasha.levin, hofrat

On Wed, Jun 01, 2016 at 01:13:33PM +0100, Will Deacon wrote:
> On Wed, Jun 01, 2016 at 02:06:54PM +0200, Peter Zijlstra wrote:

> > Works for me; but that would loose using cmpwait() for
> > !smp_cond_load_acquire() spins, you fine with that?
> > 
> > The two conversions in the patch were both !acquire spins.
> 
> Maybe we could go the whole hog and add smp_cond_load_relaxed?

What about say the cmpxchg loops in queued_write_lock_slowpath()
? Would that be something you'd like to use wfe for?

Writing those in smp_cond_load_{acquire,relaxed)() is somewhat possible
but quite ugleh.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH -v3 3/8] locking: Introduce smp_acquire__after_ctrl_dep
  2016-05-31  9:41 ` [PATCH -v3 3/8] locking: Introduce smp_acquire__after_ctrl_dep Peter Zijlstra
@ 2016-06-01 13:52   ` Boqun Feng
  2016-06-01 16:22     ` Peter Zijlstra
  0 siblings, 1 reply; 24+ messages in thread
From: Boqun Feng @ 2016-06-01 13:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, torvalds, manfred, dave, paulmck, will.deacon,
	Waiman.Long, tj, pablo, kaber, davem, oleg, netfilter-devel,
	sasha.levin, hofrat

[-- Attachment #1: Type: text/plain, Size: 1100 bytes --]

On Tue, May 31, 2016 at 11:41:37AM +0200, Peter Zijlstra wrote:
[snip]
> @@ -260,16 +260,6 @@ static void sem_rcu_free(struct rcu_head
>  }
>  
>  /*
> - * spin_unlock_wait() and !spin_is_locked() are not memory barriers, they
> - * are only control barriers.
> - * The code must pair with spin_unlock(&sem->lock) or
> - * spin_unlock(&sem_perm.lock), thus just the control barrier is insufficient.
> - *
> - * smp_rmb() is sufficient, as writes cannot pass the control barrier.
> - */
> -#define ipc_smp_acquire__after_spin_is_unlocked()	smp_rmb()
> -
> -/*
>   * Wait until all currently ongoing simple ops have completed.
>   * Caller must own sem_perm.lock.
>   * New simple ops cannot start, because simple ops first check
> @@ -292,7 +282,7 @@ static void sem_wait_array(struct sem_ar
>  		sem = sma->sem_base + i;
>  		spin_unlock_wait(&sem->lock);
>  	}
> -	ipc_smp_acquire__after_spin_is_unlocked();
> +	smp_acquire__after_ctrl_dep();

I wonder whether we can kill this barrier after updating
spin_unlock_wait() to ACQUIRE?

Regards,
Boqun

>  }
>  
>  /*

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH -v3 7/8] locking: Move smp_cond_load_acquire() and friends into asm-generic/barrier.h
  2016-06-01 12:45             ` Peter Zijlstra
@ 2016-06-01 14:07               ` Will Deacon
  2016-06-01 17:13                 ` Peter Zijlstra
  0 siblings, 1 reply; 24+ messages in thread
From: Will Deacon @ 2016-06-01 14:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Waiman Long, linux-kernel, torvalds, manfred, dave, paulmck,
	boqun.feng, tj, pablo, kaber, davem, oleg, netfilter-devel,
	sasha.levin, hofrat

On Wed, Jun 01, 2016 at 02:45:41PM +0200, Peter Zijlstra wrote:
> On Wed, Jun 01, 2016 at 01:13:33PM +0100, Will Deacon wrote:
> > On Wed, Jun 01, 2016 at 02:06:54PM +0200, Peter Zijlstra wrote:
> 
> > > Works for me; but that would loose using cmpwait() for
> > > !smp_cond_load_acquire() spins, you fine with that?
> > > 
> > > The two conversions in the patch were both !acquire spins.
> > 
> > Maybe we could go the whole hog and add smp_cond_load_relaxed?
> 
> What about say the cmpxchg loops in queued_write_lock_slowpath()
> ? Would that be something you'd like to use wfe for?

Without actually running the code on real hardware, it's hard to say
for sure. I notice that those loops are using cpu_relax_lowlatency
at present and we *know* that we're next in the queue (i.e. we're just
waiting for existing readers to drain), so the benefit of wfe is somewhat
questionable here and I don't think we'd want to add that initially.

Will

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH -v3 3/8] locking: Introduce smp_acquire__after_ctrl_dep
  2016-06-01 13:52   ` Boqun Feng
@ 2016-06-01 16:22     ` Peter Zijlstra
  2016-06-01 23:19       ` Boqun Feng
  0 siblings, 1 reply; 24+ messages in thread
From: Peter Zijlstra @ 2016-06-01 16:22 UTC (permalink / raw)
  To: Boqun Feng
  Cc: linux-kernel, torvalds, manfred, dave, paulmck, will.deacon,
	Waiman.Long, tj, pablo, kaber, davem, oleg, netfilter-devel,
	sasha.levin, hofrat

On Wed, Jun 01, 2016 at 09:52:14PM +0800, Boqun Feng wrote:
> On Tue, May 31, 2016 at 11:41:37AM +0200, Peter Zijlstra wrote:

> > @@ -292,7 +282,7 @@ static void sem_wait_array(struct sem_ar
> >  		sem = sma->sem_base + i;
> >  		spin_unlock_wait(&sem->lock);
> >  	}
> > -	ipc_smp_acquire__after_spin_is_unlocked();
> > +	smp_acquire__after_ctrl_dep();
> 
> I wonder whether we can kill this barrier after updating
> spin_unlock_wait() to ACQUIRE?

See patch 5 doing that :-)

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH -v3 7/8] locking: Move smp_cond_load_acquire() and friends into asm-generic/barrier.h
  2016-06-01  9:31     ` Peter Zijlstra
  2016-06-01 12:00       ` Will Deacon
@ 2016-06-01 16:53       ` Waiman Long
  1 sibling, 0 replies; 24+ messages in thread
From: Waiman Long @ 2016-06-01 16:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, torvalds, manfred, dave, paulmck, will.deacon,
	boqun.feng, tj, pablo, kaber, davem, oleg, netfilter-devel,
	sasha.levin, hofrat

On 06/01/2016 05:31 AM, Peter Zijlstra wrote:
> On Tue, May 31, 2016 at 04:01:06PM -0400, Waiman Long wrote:
>> You are doing two READ_ONCE's in the smp_cond_load_acquire loop. Can we
>> change it to do just one READ_ONCE, like
>>
>> --- a/include/asm-generic/barrier.h
>> +++ b/include/asm-generic/barrier.h
>> @@ -229,12 +229,18 @@ do {
>>    * value; some architectures can do this in hardware.
>>    */
>>   #ifndef cmpwait
>> +#define cmpwait(ptr, val) ({                                   \
>>          typeof (ptr) __ptr = (ptr);                             \
>> +       typeof (val) __old = (val);                             \
>> +       typeof (val) __new;                                     \
>> +       for (;;) {                                              \
>> +               __new = READ_ONCE(*__ptr);                      \
>> +               if (__new != __old)                             \
>> +                       break;                                  \
>>                  cpu_relax();                                    \
>> +       }                                                       \
>> +       __new;                                                  \
>> +})
>>   #endif
>>
>>   /**
>> @@ -251,12 +257,11 @@ do {
>>   #ifndef smp_cond_load_acquire
>>   #define smp_cond_load_acquire(ptr, cond_expr) ({               \
>>          typeof(ptr) __PTR = (ptr);                              \
>> +       typeof(*ptr) VAL = READ_ONCE(*__PTR);                   \
>>          for (;;) {                                              \
>>                  if (cond_expr)                                  \
>>                          break;                                  \
>> +               VAL = cmpwait(__PTR, VAL);                      \
>>          }                                                       \
>>          smp_acquire__after_ctrl_dep();                          \
>>          VAL;                                                    \
> Yes, that generates slightly better code, but now that you made me look
> at it, I think we need to kill the cmpwait() in the generic version and
> only keep it for arch versions.
>
> /me ponders...
>
> So cmpwait() as implemented here has strict semantics; but arch
> implementations as previously proposed have less strict semantics; and
> the use here follows that less strict variant.
>
> The difference being that the arch implementations of cmpwait can have
> false positives (ie. return early, without a changed value)
> smp_cond_load_acquire() can deal with these false positives seeing how
> its in a loop and does its own (more specific) comparison.
>
> Exposing cmpwait(), with the documented semantics, means that arch
> versions need an additional loop inside to match these strict semantics,
> or we need to weaken the cmpwait() semantics, at which point I'm not
> entirely sure its worth keeping as a generic primitive...
>
> Hmm, so if we can find a use for the weaker cmpwait() outside of
> smp_cond_load_acquire() I think we can make a case for keeping it, and
> looking at qspinlock.h there's two sites we can replace cpu_relax() with
> it.
>
> Will, since ARM64 seems to want to use this, does the below make sense
> to you?
>
> ---
>   include/asm-generic/barrier.h | 15 ++++++---------
>   kernel/locking/qspinlock.c    |  4 ++--
>   2 files changed, 8 insertions(+), 11 deletions(-)
>
> diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
> index be9222b10d17..05feda5c22e6 100644
> --- a/include/asm-generic/barrier.h
> +++ b/include/asm-generic/barrier.h
> @@ -221,20 +221,17 @@ do {									\
>   #endif
>
>   /**
> - * cmpwait - compare and wait for a variable to change
> + * cmpwait - compare and wait for a variable to 'change'
>    * @ptr: pointer to the variable to wait on
>    * @val: the value it should change from
>    *
> - * A simple constuct that waits for a variable to change from a known
> - * value; some architectures can do this in hardware.
> + * A 'better' cpu_relax(), some architectures can avoid polling and have event
> + * based wakeups on variables. Such constructs allow false positives on the
> + * 'change' and can return early. Therefore this reduces to cpu_relax()
> + * without hardware assist.
>    */
>   #ifndef cmpwait
> -#define cmpwait(ptr, val) do {					\
> -	typeof (ptr) __ptr = (ptr);				\
> -	typeof (val) __val = (val);				\
> -	while (READ_ONCE(*__ptr) == __val)			\
> -		cpu_relax();					\
> -} while (0)
> +#define cmpwait(ptr, val)	cpu_relax()
>   #endif
>
>   /**
> diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
> index e98e5bf679e9..60a811d56406 100644
> --- a/kernel/locking/qspinlock.c
> +++ b/kernel/locking/qspinlock.c
> @@ -311,7 +311,7 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
>   	 */
>   	if (val == _Q_PENDING_VAL) {
>   		while ((val = atomic_read(&lock->val)) == _Q_PENDING_VAL)
> -			cpu_relax();
> +			cmpwait(&lock->val.counter, _Q_PENDING_VAL);
>   	}
>
>   	/*
> @@ -481,7 +481,7 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
>   	 */
>   	if (!next) {
>   		while (!(next = READ_ONCE(node->next)))
> -			cpu_relax();
> +			cmpwait(&node->next, NULL);
>   	}
>
>   	arch_mcs_spin_unlock_contended(&next->locked);

I think it is a good idea to consider cmpwait as a fancier version of 
cpu_relax(). It can certainly get used in a lot more places.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH -v3 7/8] locking: Move smp_cond_load_acquire() and friends into asm-generic/barrier.h
  2016-06-01 14:07               ` Will Deacon
@ 2016-06-01 17:13                 ` Peter Zijlstra
  0 siblings, 0 replies; 24+ messages in thread
From: Peter Zijlstra @ 2016-06-01 17:13 UTC (permalink / raw)
  To: Will Deacon
  Cc: Waiman Long, linux-kernel, torvalds, manfred, dave, paulmck,
	boqun.feng, tj, pablo, kaber, davem, oleg, netfilter-devel,
	sasha.levin, hofrat

On Wed, Jun 01, 2016 at 03:07:14PM +0100, Will Deacon wrote:
> On Wed, Jun 01, 2016 at 02:45:41PM +0200, Peter Zijlstra wrote:
> > On Wed, Jun 01, 2016 at 01:13:33PM +0100, Will Deacon wrote:
> > > On Wed, Jun 01, 2016 at 02:06:54PM +0200, Peter Zijlstra wrote:
> > 
> > > > Works for me; but that would loose using cmpwait() for
> > > > !smp_cond_load_acquire() spins, you fine with that?
> > > > 
> > > > The two conversions in the patch were both !acquire spins.
> > > 
> > > Maybe we could go the whole hog and add smp_cond_load_relaxed?
> > 
> > What about say the cmpxchg loops in queued_write_lock_slowpath()
> > ? Would that be something you'd like to use wfe for?
> 
> Without actually running the code on real hardware, it's hard to say
> for sure. I notice that those loops are using cpu_relax_lowlatency
> at present and we *know* that we're next in the queue (i.e. we're just
> waiting for existing readers to drain), so the benefit of wfe is somewhat
> questionable here and I don't think we'd want to add that initially.

OK, we can always change our minds anyway. OK I'll respin/fold/massage
the series to make it go away.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH -v3 3/8] locking: Introduce smp_acquire__after_ctrl_dep
  2016-06-01 16:22     ` Peter Zijlstra
@ 2016-06-01 23:19       ` Boqun Feng
  0 siblings, 0 replies; 24+ messages in thread
From: Boqun Feng @ 2016-06-01 23:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, torvalds, manfred, dave, paulmck, will.deacon,
	Waiman.Long, tj, pablo, kaber, davem, oleg, netfilter-devel,
	sasha.levin, hofrat

[-- Attachment #1: Type: text/plain, Size: 623 bytes --]

On Wed, Jun 01, 2016 at 06:22:55PM +0200, Peter Zijlstra wrote:
> On Wed, Jun 01, 2016 at 09:52:14PM +0800, Boqun Feng wrote:
> > On Tue, May 31, 2016 at 11:41:37AM +0200, Peter Zijlstra wrote:
> 
> > > @@ -292,7 +282,7 @@ static void sem_wait_array(struct sem_ar
> > >  		sem = sma->sem_base + i;
> > >  		spin_unlock_wait(&sem->lock);
> > >  	}
> > > -	ipc_smp_acquire__after_spin_is_unlocked();
> > > +	smp_acquire__after_ctrl_dep();
> > 
> > I wonder whether we can kill this barrier after updating
> > spin_unlock_wait() to ACQUIRE?
> 
> See patch 5 doing that :-)

Oops, right ;-)

Regards,
Boqun

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2016-06-01 23:16 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-05-31  9:41 [PATCH -v3 0/8] spin_unlock_wait borkage and assorted bits Peter Zijlstra
2016-05-31  9:41 ` [PATCH -v3 1/8] locking: Replace smp_cond_acquire with smp_cond_load_acquire Peter Zijlstra
2016-05-31  9:41 ` [PATCH -v3 2/8] locking: Introduce cmpwait() Peter Zijlstra
2016-05-31  9:41 ` [PATCH -v3 3/8] locking: Introduce smp_acquire__after_ctrl_dep Peter Zijlstra
2016-06-01 13:52   ` Boqun Feng
2016-06-01 16:22     ` Peter Zijlstra
2016-06-01 23:19       ` Boqun Feng
2016-05-31  9:41 ` [PATCH -v3 4/8] locking, arch: Update spin_unlock_wait() Peter Zijlstra
2016-06-01 11:24   ` Will Deacon
2016-06-01 11:37     ` Peter Zijlstra
2016-05-31  9:41 ` [PATCH -v3 5/8] locking: Update spin_unlock_wait users Peter Zijlstra
2016-05-31  9:41 ` [PATCH -v3 6/8] locking,netfilter: Fix nf_conntrack_lock() Peter Zijlstra
2016-05-31  9:41 ` [PATCH -v3 7/8] locking: Move smp_cond_load_acquire() and friends into asm-generic/barrier.h Peter Zijlstra
2016-05-31 20:01   ` Waiman Long
2016-06-01  9:31     ` Peter Zijlstra
2016-06-01 12:00       ` Will Deacon
2016-06-01 12:06         ` Peter Zijlstra
2016-06-01 12:13           ` Will Deacon
2016-06-01 12:45             ` Peter Zijlstra
2016-06-01 14:07               ` Will Deacon
2016-06-01 17:13                 ` Peter Zijlstra
2016-06-01 16:53       ` Waiman Long
2016-05-31  9:41 ` [PATCH -v3 8/8] locking, tile: Provide TILE specific smp_acquire__after_ctrl_dep Peter Zijlstra
2016-05-31 15:32   ` Chris Metcalf

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).