Question on smp_mb__before_spinlock

From: Peter Zijlstra <peterz@infradead.org>
To: Linus Torvalds <torvalds@linux-foundation.org>,
	Will Deacon <will.deacon@arm.com>,
	Oleg Nesterov <oleg@redhat.com>,
	Paul McKenney <paulmck@linux.vnet.ibm.com>,
	Benjamin Herrenschmidt <benh@kernel.crashing.org>,
	Michael Ellerman <mpe@ellerman.id.au>
Cc: linux-kernel@vger.kernel.org, Nicholas Piggin <npiggin@gmail.com>,
	Ingo Molnar <mingo@kernel.org>,
	Alan Stern <stern@rowland.harvard.edu>
Subject: Question on smp_mb__before_spinlock
Date: Mon, 5 Sep 2016 11:37:53 +0200	[thread overview]
Message-ID: <20160905093753.GN10138@twins.programming.kicks-ass.net> (raw)

Hi all,

So recently I've had two separate issues that touched upon
smp_mb__before_spinlock().


Since its inception, our understanding of ACQUIRE, esp. as applied to
spinlocks, has changed somewhat. Also, I wonder if, with a simple
change, we cannot make it provide more.

The problem with the comment is that the STORE done by spin_lock isn't
itself ordered by the ACQUIRE, and therefore a later LOAD can pass over
it and cross with any prior STORE, rendering the default WMB
insufficient (pointed out by Alan).

Now, this is only really a problem on PowerPC and ARM64, the former of
which already defined smp_mb__before_spinlock() as a smp_mb(), the
latter does not, Will?

The second issue I wondered about is spinlock transitivity. All except
powerpc have RCsc locks, and since Power already does a full mb, would
it not make sense to put it _after_ the spin_lock(), which would provide
the same guarantee, but also upgrades the section to RCsc.

That would make all schedule() calls fully transitive against one
another.


That is, would something like the below make sense?

(does not deal with mm_types.h and overlayfs using
smp_mb__before_spnlock).

---
 arch/arm64/include/asm/barrier.h   |  2 ++
 arch/powerpc/include/asm/barrier.h |  2 +-
 include/linux/spinlock.h           | 41 +++++++++++++++++++++++++++++---------
 kernel/sched/core.c                |  5 +++--
 4 files changed, 38 insertions(+), 12 deletions(-)

diff --git a/arch/arm64/include/asm/barrier.h b/arch/arm64/include/asm/barrier.h
index 4eea7f618dce..d5cc8b58f942 100644
--- a/arch/arm64/include/asm/barrier.h
+++ b/arch/arm64/include/asm/barrier.h
@@ -104,6 +104,8 @@ do {									\
 	VAL;								\
 })
 
+#define smp_mb__after_spinlock()	smp_mb()
+
 #include <asm-generic/barrier.h>
 
 #endif	/* __ASSEMBLY__ */
diff --git a/arch/powerpc/include/asm/barrier.h b/arch/powerpc/include/asm/barrier.h
index c0deafc212b8..23d64d7196b7 100644
--- a/arch/powerpc/include/asm/barrier.h
+++ b/arch/powerpc/include/asm/barrier.h
@@ -74,7 +74,7 @@ do {									\
 	___p1;								\
 })
 
-#define smp_mb__before_spinlock()   smp_mb()
+#define smp_mb__after_spinlock()   smp_mb()
 
 #include <asm-generic/barrier.h>
 
diff --git a/include/linux/spinlock.h b/include/linux/spinlock.h
index 47dd0cebd204..284616dad607 100644
--- a/include/linux/spinlock.h
+++ b/include/linux/spinlock.h
@@ -118,16 +118,39 @@ do {								\
 #endif
 
 /*
- * Despite its name it doesn't necessarily has to be a full barrier.
- * It should only guarantee that a STORE before the critical section
- * can not be reordered with LOADs and STOREs inside this section.
- * spin_lock() is the one-way barrier, this LOAD can not escape out
- * of the region. So the default implementation simply ensures that
- * a STORE can not move into the critical section, smp_wmb() should
- * serialize it with another STORE done by spin_lock().
+ * This barrier must provide two things:
+ *
+ *   - it must guarantee a STORE before the spin_lock() is ordered against a
+ *     LOAD after it, see the comments at its two usage sites.
+ *
+ *   - it must ensure the critical section is RCsc.
+ *
+ * The latter is important for cases where we observe values written by other
+ * CPUs in spin-loops, without barriers, while being subject to scheduling.
+ *
+ * CPU0			CPU1			CPU2
+ * 
+ * 			for (;;) {
+ * 			  if (READ_ONCE(X))
+ * 			  	break;
+ * 			}
+ * X=1
+ * 			<sched-out>
+ * 						<sched-in>
+ * 						r = X;
+ *
+ * without transitivity it could be that CPU1 observes X!=0 breaks the loop,
+ * we get migrated and CPU2 sees X==0.
+ *
+ * Since most load-store architectures implement ACQUIRE with an smp_mb() after
+ * the LL/SC loop, they need no further barriers. Similarly all our TSO
+ * architectures imlpy an smp_mb() for each atomic instruction and equally don't
+ * need more.
+ *
+ * Architectures that can implement ACQUIRE better need to take care.
  */
-#ifndef smp_mb__before_spinlock
-#define smp_mb__before_spinlock()	smp_wmb()
+#ifndef smp_mb__after_spinlock
+#define smp_mb__after_spinlock()	do { } while (0)
 #endif
 
 /**
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 556cb07ab1cf..b151a33d393b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2006,8 +2006,8 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 	 * reordered with p->state check below. This pairs with mb() in
 	 * set_current_state() the waiting thread does.
 	 */
-	smp_mb__before_spinlock();
 	raw_spin_lock_irqsave(&p->pi_lock, flags);
+	smp_mb__after_spinlock();
 	if (!(p->state & state))
 		goto out;
 
@@ -3332,8 +3332,9 @@ static void __sched notrace __schedule(bool preempt)
 	 * can't be reordered with __set_current_state(TASK_INTERRUPTIBLE)
 	 * done by the caller to avoid the race with signal_wake_up().
 	 */
-	smp_mb__before_spinlock();
 	raw_spin_lock(&rq->lock);
+	smp_mb__after_spinlock();
+
 	cookie = lockdep_pin_lock(&rq->lock);
 
 	rq->clock_skip_update <<= 1; /* promote REQ to ACT */