linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Peter Zijlstra <peterz@infradead.org>
To: Linus Torvalds <torvalds@linux-foundation.org>,
	Will Deacon <will.deacon@arm.com>,
	Oleg Nesterov <oleg@redhat.com>,
	Paul McKenney <paulmck@linux.vnet.ibm.com>,
	Benjamin Herrenschmidt <benh@kernel.crashing.org>,
	Michael Ellerman <mpe@ellerman.id.au>
Cc: linux-kernel@vger.kernel.org, Nicholas Piggin <npiggin@gmail.com>,
	Ingo Molnar <mingo@kernel.org>,
	Alan Stern <stern@rowland.harvard.edu>
Subject: Question on smp_mb__before_spinlock
Date: Mon, 5 Sep 2016 11:37:53 +0200	[thread overview]
Message-ID: <20160905093753.GN10138@twins.programming.kicks-ass.net> (raw)

Hi all,

So recently I've had two separate issues that touched upon
smp_mb__before_spinlock().


Since its inception, our understanding of ACQUIRE, esp. as applied to
spinlocks, has changed somewhat. Also, I wonder if, with a simple
change, we cannot make it provide more.

The problem with the comment is that the STORE done by spin_lock isn't
itself ordered by the ACQUIRE, and therefore a later LOAD can pass over
it and cross with any prior STORE, rendering the default WMB
insufficient (pointed out by Alan).

Now, this is only really a problem on PowerPC and ARM64, the former of
which already defined smp_mb__before_spinlock() as a smp_mb(), the
latter does not, Will?

The second issue I wondered about is spinlock transitivity. All except
powerpc have RCsc locks, and since Power already does a full mb, would
it not make sense to put it _after_ the spin_lock(), which would provide
the same guarantee, but also upgrades the section to RCsc.

That would make all schedule() calls fully transitive against one
another.


That is, would something like the below make sense?

(does not deal with mm_types.h and overlayfs using
smp_mb__before_spnlock).

---
 arch/arm64/include/asm/barrier.h   |  2 ++
 arch/powerpc/include/asm/barrier.h |  2 +-
 include/linux/spinlock.h           | 41 +++++++++++++++++++++++++++++---------
 kernel/sched/core.c                |  5 +++--
 4 files changed, 38 insertions(+), 12 deletions(-)

diff --git a/arch/arm64/include/asm/barrier.h b/arch/arm64/include/asm/barrier.h
index 4eea7f618dce..d5cc8b58f942 100644
--- a/arch/arm64/include/asm/barrier.h
+++ b/arch/arm64/include/asm/barrier.h
@@ -104,6 +104,8 @@ do {									\
 	VAL;								\
 })
 
+#define smp_mb__after_spinlock()	smp_mb()
+
 #include <asm-generic/barrier.h>
 
 #endif	/* __ASSEMBLY__ */
diff --git a/arch/powerpc/include/asm/barrier.h b/arch/powerpc/include/asm/barrier.h
index c0deafc212b8..23d64d7196b7 100644
--- a/arch/powerpc/include/asm/barrier.h
+++ b/arch/powerpc/include/asm/barrier.h
@@ -74,7 +74,7 @@ do {									\
 	___p1;								\
 })
 
-#define smp_mb__before_spinlock()   smp_mb()
+#define smp_mb__after_spinlock()   smp_mb()
 
 #include <asm-generic/barrier.h>
 
diff --git a/include/linux/spinlock.h b/include/linux/spinlock.h
index 47dd0cebd204..284616dad607 100644
--- a/include/linux/spinlock.h
+++ b/include/linux/spinlock.h
@@ -118,16 +118,39 @@ do {								\
 #endif
 
 /*
- * Despite its name it doesn't necessarily has to be a full barrier.
- * It should only guarantee that a STORE before the critical section
- * can not be reordered with LOADs and STOREs inside this section.
- * spin_lock() is the one-way barrier, this LOAD can not escape out
- * of the region. So the default implementation simply ensures that
- * a STORE can not move into the critical section, smp_wmb() should
- * serialize it with another STORE done by spin_lock().
+ * This barrier must provide two things:
+ *
+ *   - it must guarantee a STORE before the spin_lock() is ordered against a
+ *     LOAD after it, see the comments at its two usage sites.
+ *
+ *   - it must ensure the critical section is RCsc.
+ *
+ * The latter is important for cases where we observe values written by other
+ * CPUs in spin-loops, without barriers, while being subject to scheduling.
+ *
+ * CPU0			CPU1			CPU2
+ * 
+ * 			for (;;) {
+ * 			  if (READ_ONCE(X))
+ * 			  	break;
+ * 			}
+ * X=1
+ * 			<sched-out>
+ * 						<sched-in>
+ * 						r = X;
+ *
+ * without transitivity it could be that CPU1 observes X!=0 breaks the loop,
+ * we get migrated and CPU2 sees X==0.
+ *
+ * Since most load-store architectures implement ACQUIRE with an smp_mb() after
+ * the LL/SC loop, they need no further barriers. Similarly all our TSO
+ * architectures imlpy an smp_mb() for each atomic instruction and equally don't
+ * need more.
+ *
+ * Architectures that can implement ACQUIRE better need to take care.
  */
-#ifndef smp_mb__before_spinlock
-#define smp_mb__before_spinlock()	smp_wmb()
+#ifndef smp_mb__after_spinlock
+#define smp_mb__after_spinlock()	do { } while (0)
 #endif
 
 /**
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 556cb07ab1cf..b151a33d393b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2006,8 +2006,8 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 	 * reordered with p->state check below. This pairs with mb() in
 	 * set_current_state() the waiting thread does.
 	 */
-	smp_mb__before_spinlock();
 	raw_spin_lock_irqsave(&p->pi_lock, flags);
+	smp_mb__after_spinlock();
 	if (!(p->state & state))
 		goto out;
 
@@ -3332,8 +3332,9 @@ static void __sched notrace __schedule(bool preempt)
 	 * can't be reordered with __set_current_state(TASK_INTERRUPTIBLE)
 	 * done by the caller to avoid the race with signal_wake_up().
 	 */
-	smp_mb__before_spinlock();
 	raw_spin_lock(&rq->lock);
+	smp_mb__after_spinlock();
+
 	cookie = lockdep_pin_lock(&rq->lock);
 
 	rq->clock_skip_update <<= 1; /* promote REQ to ACT */

             reply	other threads:[~2016-09-05  9:38 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-09-05  9:37 Peter Zijlstra [this message]
2016-09-05  9:56 ` Question on smp_mb__before_spinlock kbuild test robot
2016-09-05 10:19   ` Peter Zijlstra
2016-09-05 11:26     ` Fengguang Wu
2016-09-05 10:10 ` Will Deacon
2016-09-06 11:17   ` Peter Zijlstra
2016-09-06 17:42     ` Will Deacon
2016-09-05 10:37 ` Paul E. McKenney
2016-09-05 11:34   ` Peter Zijlstra
2016-09-05 13:57     ` Paul E. McKenney
2016-09-05 10:51 ` kbuild test robot
2016-09-07 12:17 ` Nicholas Piggin
2016-09-07 13:23   ` Peter Zijlstra
2016-09-07 13:51     ` Will Deacon
2016-09-12  2:35       ` Nicholas Piggin
2016-09-12  2:27     ` Nicholas Piggin
2016-09-12 12:54       ` Peter Zijlstra
2016-09-13  2:05         ` Nicholas Piggin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160905093753.GN10138@twins.programming.kicks-ass.net \
    --to=peterz@infradead.org \
    --cc=benh@kernel.crashing.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@kernel.org \
    --cc=mpe@ellerman.id.au \
    --cc=npiggin@gmail.com \
    --cc=oleg@redhat.com \
    --cc=paulmck@linux.vnet.ibm.com \
    --cc=stern@rowland.harvard.edu \
    --cc=torvalds@linux-foundation.org \
    --cc=will.deacon@arm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).