linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] doc: Update wake_up() & co. memory-barrier guarantees
@ 2018-06-25  9:17 Andrea Parri
  2018-06-25  9:50 ` Peter Zijlstra
                   ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: Andrea Parri @ 2018-06-25  9:17 UTC (permalink / raw)
  To: linux-kernel, linux-doc
  Cc: Andrea Parri, Alan Stern, Will Deacon, Peter Zijlstra,
	Boqun Feng, Nicholas Piggin, David Howells, Jade Alglave,
	Luc Maranget, Paul E. McKenney, Akira Yokosawa, Daniel Lustig,
	Jonathan Corbet, Ingo Molnar, Randy Dunlap

Both the implementation and the users' expectation [1] for the various
wakeup primitives have evolved over time, but the documentation has not
kept up with these changes: brings it into 2018.

[1] http://lkml.kernel.org/r/20180424091510.GB4064@hirez.programming.kicks-ass.net

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrea Parri <andrea.parri@amarulasolutions.com>
[ aparri: Apply feedback from Alan Stern. ]
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Jade Alglave <j.alglave@ucl.ac.uk>
Cc: Luc Maranget <luc.maranget@inria.fr>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Akira Yokosawa <akiyks@gmail.com>
Cc: Daniel Lustig <dlustig@nvidia.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
---
 Documentation/memory-barriers.txt | 43 ++++++++++++++++++++++++---------------
 kernel/sched/completion.c         |  8 ++++----
 kernel/sched/core.c               | 11 +++++-----
 kernel/sched/wait.c               | 24 ++++++++++------------
 4 files changed, 47 insertions(+), 39 deletions(-)

diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
index a02d6bbfc9d0a..bf58fa1671b62 100644
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -2179,32 +2179,41 @@ or:
 	event_indicated = 1;
 	wake_up_process(event_daemon);
 
-A write memory barrier is implied by wake_up() and co.  if and only if they
-wake something up.  The barrier occurs before the task state is cleared, and so
-sits between the STORE to indicate the event and the STORE to set TASK_RUNNING:
+A general memory barrier is executed by wake_up() if it wakes something up.
+If it doesn't wake anything up then a memory barrier may or may not be
+executed; you must not rely on it.  The barrier occurs before the task state
+is accessed, in part., it sits between the STORE to indicate the event and
+the STORE to set TASK_RUNNING:
 
-	CPU 1				CPU 2
+	CPU 1 (Sleeper)			CPU 2 (Waker)
 	===============================	===============================
 	set_current_state();		STORE event_indicated
 	  smp_store_mb();		wake_up();
-	    STORE current->state	  <write barrier>
-	    <general barrier>		  STORE current->state
-	LOAD event_indicated
+	    STORE current->state	  ...
+	    <general barrier>		  <general barrier>
+	LOAD event_indicated		  if ((LOAD task->state) & TASK_NORMAL)
+					    STORE task->state
 
-To repeat, this write memory barrier is present if and only if something
-is actually awakened.  To see this, consider the following sequence of
-events, where X and Y are both initially zero:
+where "task" is the thread being woken up and it equals CPU 1's current.
+
+To repeat, a general memory barrier is guaranteed to be executed by wake_up()
+if something is actually awakened, but otherwise there is no such guarantee.
+To see this, consider the following sequence of events, where X and Y are both
+initially zero:
 
 	CPU 1				CPU 2
 	===============================	===============================
-	X = 1;				STORE event_indicated
+	X = 1;				Y = 1;
 	smp_mb();			wake_up();
-	Y = 1;				wait_event(wq, Y == 1);
-	wake_up();			  load from Y sees 1, no memory barrier
-					load from X might see 0
+	LOAD Y				LOAD X
+
+If a wakeup does occur, one (at least) of the two loads must see 1.  If, on
+the other hand, a wakeup does not occur, both loads might see 0.
 
-In contrast, if a wakeup does occur, CPU 2's load from X would be guaranteed
-to see 1.
+wake_up_process() always executes a general memory barrier.  The barrier again
+occurs before the task state is accessed.  In particular, if the wake_up() in
+the previous snippet were replaced by a call to wake_up_process() then one of
+the two loads would be guaranteed to see 1.
 
 The available waker functions include:
 
@@ -2224,6 +2233,8 @@ The available waker functions include:
 	wake_up_poll();
 	wake_up_process();
 
+In terms of memory ordering, these functions all provide the same guarantees of
+a wake_up() (or stronger).
 
 [!] Note that the memory barriers implied by the sleeper and the waker do _not_
 order multiple stores before the wake-up with respect to loads of those stored
diff --git a/kernel/sched/completion.c b/kernel/sched/completion.c
index e426b0cb9ac63..a1ad5b7d5521b 100644
--- a/kernel/sched/completion.c
+++ b/kernel/sched/completion.c
@@ -22,8 +22,8 @@
  *
  * See also complete_all(), wait_for_completion() and related routines.
  *
- * It may be assumed that this function implies a write memory barrier before
- * changing the task state if and only if any tasks are woken up.
+ * If this function wakes up a task, it executes a full memory barrier before
+ * accessing the task state.
  */
 void complete(struct completion *x)
 {
@@ -44,8 +44,8 @@ EXPORT_SYMBOL(complete);
  *
  * This will wake up all threads waiting on this particular completion event.
  *
- * It may be assumed that this function implies a write memory barrier before
- * changing the task state if and only if any tasks are woken up.
+ * If this function wakes up a task, it executes a full memory barrier before
+ * accessing the task state.
  *
  * Since complete_all() sets the completion of @x permanently to done
  * to allow multiple waiters to finish, a call to reinit_completion()
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index bfd49a932bdb6..4718da10ccb6c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -413,8 +413,8 @@ void wake_q_add(struct wake_q_head *head, struct task_struct *task)
 	 * its already queued (either by us or someone else) and will get the
 	 * wakeup due to that.
 	 *
-	 * This cmpxchg() implies a full barrier, which pairs with the write
-	 * barrier implied by the wakeup in wake_up_q().
+	 * This cmpxchg() executes a full barrier, which pairs with the full
+	 * barrier executed in the wakeup in wake_up_q().
 	 */
 	if (cmpxchg(&node->next, NULL, WAKE_Q_TAIL))
 		return;
@@ -442,8 +442,8 @@ void wake_up_q(struct wake_q_head *head)
 		task->wake_q.next = NULL;
 
 		/*
-		 * wake_up_process() implies a wmb() to pair with the queueing
-		 * in wake_q_add() so as not to miss wakeups.
+		 * wake_up_process() executes a full barrier, which pairs with
+		 * the queueing in wake_q_add() so as not to miss wakeups.
 		 */
 		wake_up_process(task);
 		put_task_struct(task);
@@ -2141,8 +2141,7 @@ static void try_to_wake_up_local(struct task_struct *p, struct rq_flags *rf)
  *
  * Return: 1 if the process was woken up, 0 if it was already running.
  *
- * It may be assumed that this function implies a write memory barrier before
- * changing the task state if and only if any tasks are woken up.
+ * This function executes a full memory barrier before accessing the task state.
  */
 int wake_up_process(struct task_struct *p)
 {
diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
index 928be527477eb..eaafc58543592 100644
--- a/kernel/sched/wait.c
+++ b/kernel/sched/wait.c
@@ -134,8 +134,8 @@ static void __wake_up_common_lock(struct wait_queue_head *wq_head, unsigned int
  * @nr_exclusive: how many wake-one or wake-many threads to wake up
  * @key: is directly passed to the wakeup function
  *
- * It may be assumed that this function implies a write memory barrier before
- * changing the task state if and only if any tasks are woken up.
+ * If this function wakes up a task, it executes a full memory barrier before
+ * accessing the task state.
  */
 void __wake_up(struct wait_queue_head *wq_head, unsigned int mode,
 			int nr_exclusive, void *key)
@@ -180,8 +180,8 @@ EXPORT_SYMBOL_GPL(__wake_up_locked_key_bookmark);
  *
  * On UP it can prevent extra preemption.
  *
- * It may be assumed that this function implies a write memory barrier before
- * changing the task state if and only if any tasks are woken up.
+ * If this function wakes up a task, it executes a full memory barrier before
+ * accessing the task state.
  */
 void __wake_up_sync_key(struct wait_queue_head *wq_head, unsigned int mode,
 			int nr_exclusive, void *key)
@@ -408,19 +408,23 @@ long wait_woken(struct wait_queue_entry *wq_entry, unsigned mode, long timeout)
 {
 	set_current_state(mode); /* A */
 	/*
-	 * The above implies an smp_mb(), which matches with the smp_wmb() from
+	 * The above executes an smp_mb(), which matches with the smp_wmb() from
 	 * woken_wake_function() such that if we observe WQ_FLAG_WOKEN we must
 	 * also observe all state before the wakeup.
+	 *
+	 * XXX: Specify memory accesses and communication relations.
 	 */
 	if (!(wq_entry->flags & WQ_FLAG_WOKEN) && !is_kthread_should_stop())
 		timeout = schedule_timeout(timeout);
 	__set_current_state(TASK_RUNNING);
 
 	/*
-	 * The below implies an smp_mb(), it too pairs with the smp_wmb() from
+	 * The below executes an smp_mb(), it too pairs with the smp_wmb() from
 	 * woken_wake_function() such that we must either observe the wait
 	 * condition being true _OR_ WQ_FLAG_WOKEN such that we will not miss
 	 * an event.
+	 *
+	 * XXX: Specify memory accesses and communication relations.
 	 */
 	smp_store_mb(wq_entry->flags, wq_entry->flags & ~WQ_FLAG_WOKEN); /* B */
 
@@ -430,13 +434,7 @@ EXPORT_SYMBOL(wait_woken);
 
 int woken_wake_function(struct wait_queue_entry *wq_entry, unsigned mode, int sync, void *key)
 {
-	/*
-	 * Although this function is called under waitqueue lock, LOCK
-	 * doesn't imply write barrier and the users expects write
-	 * barrier semantics on wakeup functions.  The following
-	 * smp_wmb() is equivalent to smp_wmb() in try_to_wake_up()
-	 * and is paired with smp_store_mb() in wait_woken().
-	 */
+	/* Pairs with the smp_store_mb() from wait_woken(). */
 	smp_wmb(); /* C */
 	wq_entry->flags |= WQ_FLAG_WOKEN;
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH] doc: Update wake_up() & co. memory-barrier guarantees
  2018-06-25  9:17 [PATCH] doc: Update wake_up() & co. memory-barrier guarantees Andrea Parri
@ 2018-06-25  9:50 ` Peter Zijlstra
  2018-06-25 10:56   ` Andrea Parri
  2018-06-25 12:12 ` David Howells
  2018-06-25 16:56 ` Alan Stern
  2 siblings, 1 reply; 19+ messages in thread
From: Peter Zijlstra @ 2018-06-25  9:50 UTC (permalink / raw)
  To: Andrea Parri
  Cc: linux-kernel, linux-doc, Alan Stern, Will Deacon, Boqun Feng,
	Nicholas Piggin, David Howells, Jade Alglave, Luc Maranget,
	Paul E. McKenney, Akira Yokosawa, Daniel Lustig, Jonathan Corbet,
	Ingo Molnar, Randy Dunlap

On Mon, Jun 25, 2018 at 11:17:38AM +0200, Andrea Parri wrote:
> Both the implementation and the users' expectation [1] for the various
> wakeup primitives have evolved over time, but the documentation has not
> kept up with these changes: brings it into 2018.

I wanted to reply to this saying that I'm not aware of anything relying
on this actually being a smp_mb() and that I've been treating it as an
RELEASE.

But then I found my own comment that goes with smp_mb__after_spinlock(),
which explains why we do in fact need the transitive thing if I'm not
mistaken.

So yes, I suppose we're entirely suck with the full memory barrier
semantics like that. But I still find it easier to think of it like a
RELEASE that pairs with the ACQUIRE of waking up, such that the task
is guaranteed to observe it's own wake condition.

And maybe that is the thing I'm missing here. These comments only state
that it does in fact imply a full memory barrier, but do not explain
why, should it?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] doc: Update wake_up() & co. memory-barrier guarantees
  2018-06-25  9:50 ` Peter Zijlstra
@ 2018-06-25 10:56   ` Andrea Parri
  2018-06-25 12:31     ` Peter Zijlstra
  0 siblings, 1 reply; 19+ messages in thread
From: Andrea Parri @ 2018-06-25 10:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-doc, Alan Stern, Will Deacon, Boqun Feng,
	Nicholas Piggin, David Howells, Jade Alglave, Luc Maranget,
	Paul E. McKenney, Akira Yokosawa, Daniel Lustig, Jonathan Corbet,
	Ingo Molnar, Randy Dunlap

On Mon, Jun 25, 2018 at 11:50:31AM +0200, Peter Zijlstra wrote:
> On Mon, Jun 25, 2018 at 11:17:38AM +0200, Andrea Parri wrote:
> > Both the implementation and the users' expectation [1] for the various
> > wakeup primitives have evolved over time, but the documentation has not
> > kept up with these changes: brings it into 2018.
> 
> I wanted to reply to this saying that I'm not aware of anything relying
> on this actually being a smp_mb() and that I've been treating it as an
> RELEASE.
> 
> But then I found my own comment that goes with smp_mb__after_spinlock(),
> which explains why we do in fact need the transitive thing if I'm not
> mistaken.

A concrete example being the store-buffering pattern reported in [1].


> 
> So yes, I suppose we're entirely suck with the full memory barrier
> semantics like that. But I still find it easier to think of it like a
> RELEASE that pairs with the ACQUIRE of waking up, such that the task
> is guaranteed to observe it's own wake condition.
> 
> And maybe that is the thing I'm missing here. These comments only state
> that it does in fact imply a full memory barrier, but do not explain
> why, should it?

"code (people) is relying on it" is really the only "why" I can think
of.  With this patch, that same/SB pattern is also reported in memory
-barriers.txt.  Other ideas?

  Andrea

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] doc: Update wake_up() & co. memory-barrier guarantees
  2018-06-25  9:17 [PATCH] doc: Update wake_up() & co. memory-barrier guarantees Andrea Parri
  2018-06-25  9:50 ` Peter Zijlstra
@ 2018-06-25 12:12 ` David Howells
  2018-06-25 12:28   ` Andrea Parri
  2018-06-25 16:56 ` Alan Stern
  2 siblings, 1 reply; 19+ messages in thread
From: David Howells @ 2018-06-25 12:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: dhowells, Andrea Parri, linux-kernel, linux-doc, Alan Stern,
	Will Deacon, Boqun Feng, Nicholas Piggin, Jade Alglave,
	Luc Maranget, Paul E. McKenney, Akira Yokosawa, Daniel Lustig,
	Jonathan Corbet, Ingo Molnar, Randy Dunlap

Peter Zijlstra <peterz@infradead.org> wrote:

> So yes, I suppose we're entirely suck with the full memory barrier
> semantics like that. But I still find it easier to think of it like a
> RELEASE that pairs with the ACQUIRE of waking up, such that the task
> is guaranteed to observe it's own wake condition.
> 
> And maybe that is the thing I'm missing here. These comments only state
> that it does in fact imply a full memory barrier, but do not explain
> why, should it?

I think because RELEASE and ACQUIRE concepts didn't really exist in Linux at
the time I wrote the doc, so the choices were read/readdep, write or full.

Since this document defines the *minimum* you can expect rather than what the
kernel actually gives you, I think it probably makes sense to switch to
RELEASE and ACQUIRE here.

David

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] doc: Update wake_up() & co. memory-barrier guarantees
  2018-06-25 12:12 ` David Howells
@ 2018-06-25 12:28   ` Andrea Parri
  2018-06-25 13:00     ` Peter Zijlstra
  0 siblings, 1 reply; 19+ messages in thread
From: Andrea Parri @ 2018-06-25 12:28 UTC (permalink / raw)
  To: David Howells
  Cc: Peter Zijlstra, linux-kernel, linux-doc, Alan Stern, Will Deacon,
	Boqun Feng, Nicholas Piggin, Jade Alglave, Luc Maranget,
	Paul E. McKenney, Akira Yokosawa, Daniel Lustig, Jonathan Corbet,
	Ingo Molnar, Randy Dunlap

On Mon, Jun 25, 2018 at 01:12:45PM +0100, David Howells wrote:
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > So yes, I suppose we're entirely suck with the full memory barrier
> > semantics like that. But I still find it easier to think of it like a
> > RELEASE that pairs with the ACQUIRE of waking up, such that the task
> > is guaranteed to observe it's own wake condition.
> > 
> > And maybe that is the thing I'm missing here. These comments only state
> > that it does in fact imply a full memory barrier, but do not explain
> > why, should it?
> 
> I think because RELEASE and ACQUIRE concepts didn't really exist in Linux at
> the time I wrote the doc, so the choices were read/readdep, write or full.
> 
> Since this document defines the *minimum* you can expect rather than what the
> kernel actually gives you, I think it probably makes sense to switch to
> RELEASE and ACQUIRE here.

RELEASE and ACQUIRE are not enough in SB.  Can you elaborate?

  Andrea


> 
> David

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] doc: Update wake_up() & co. memory-barrier guarantees
  2018-06-25 10:56   ` Andrea Parri
@ 2018-06-25 12:31     ` Peter Zijlstra
  2018-06-25 13:16       ` Andrea Parri
  2018-06-27 14:15       ` Andrea Parri
  0 siblings, 2 replies; 19+ messages in thread
From: Peter Zijlstra @ 2018-06-25 12:31 UTC (permalink / raw)
  To: Andrea Parri
  Cc: linux-kernel, linux-doc, Alan Stern, Will Deacon, Boqun Feng,
	Nicholas Piggin, David Howells, Jade Alglave, Luc Maranget,
	Paul E. McKenney, Akira Yokosawa, Daniel Lustig, Jonathan Corbet,
	Ingo Molnar, Randy Dunlap

On Mon, Jun 25, 2018 at 12:56:18PM +0200, Andrea Parri wrote:
> On Mon, Jun 25, 2018 at 11:50:31AM +0200, Peter Zijlstra wrote:
> > On Mon, Jun 25, 2018 at 11:17:38AM +0200, Andrea Parri wrote:
> > > Both the implementation and the users' expectation [1] for the various
> > > wakeup primitives have evolved over time, but the documentation has not
> > > kept up with these changes: brings it into 2018.
> > 
> > I wanted to reply to this saying that I'm not aware of anything relying
> > on this actually being a smp_mb() and that I've been treating it as an
> > RELEASE.
> > 
> > But then I found my own comment that goes with smp_mb__after_spinlock(),
> > which explains why we do in fact need the transitive thing if I'm not
> > mistaken.
> 
> A concrete example being the store-buffering pattern reported in [1].

Well, that example only needs a store->load barrier. It so happens
smp_mb() is the only one actually doing that, but imagine we had a
weaker barrier that did just that, one that did not imply the full
transitivity smp_mb() does.

Then the example from [1] could use that weaker thing.

> > So yes, I suppose we're entirely suck with the full memory barrier
> > semantics like that. But I still find it easier to think of it like a
> > RELEASE that pairs with the ACQUIRE of waking up, such that the task
> > is guaranteed to observe it's own wake condition.
> > 
> > And maybe that is the thing I'm missing here. These comments only state
> > that it does in fact imply a full memory barrier, but do not explain
> > why, should it?
> 
> "code (people) is relying on it" is really the only "why" I can think
> of.  With this patch, that same/SB pattern is also reported in memory
> -barriers.txt.  Other ideas?

So I'm not actually sure how many people rely on the RCsc transitive
smp_mb() here. People certainly rely on the RELEASE semantics, and the
code itself requires the store->load ordering, together that gives us
the smp_mb() because that's simply the only barrier we have.

And looking at smp_mb__after_spinlock() again, we really only need the
RCsc thing for rq->lock, not for the wakeups. The wakeups really only
need that RCpc RELEASE + store->load thing (which we don't have).

So yes, smp_mb(), however the below still makes more sense to me, or am
I just being obtuse again?

---
 kernel/sched/core.c | 19 +++++++++++++------
 1 file changed, 13 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a98d54cd5535..8374d01b2820 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1879,7 +1879,9 @@ static void ttwu_queue(struct task_struct *p, int cpu, int wake_flags)
  *  C) LOCK of the rq(c1)->lock scheduling in task
  *
  * Transitivity guarantees that B happens after A and C after B.
- * Note: we only require RCpc transitivity.
+ * Note: we only require RCpc transitivity for these cases,
+ *       but see smp_mb__after_spinlock() for why rq->lock is required
+ *       to be RCsc.
  * Note: the CPU doing B need not be c0 or c1
  *
  * Example:
@@ -1944,13 +1946,14 @@ static void ttwu_queue(struct task_struct *p, int cpu, int wake_flags)
  * However; for wakeups there is a second guarantee we must provide, namely we
  * must observe the state that lead to our wakeup. That is, not only must our
  * task observe its own prior state, it must also observe the stores prior to
- * its wakeup.
+ * its wakeup, see set_current_state().
  *
  * This means that any means of doing remote wakeups must order the CPU doing
- * the wakeup against the CPU the task is going to end up running on. This,
- * however, is already required for the regular Program-Order guarantee above,
- * since the waking CPU is the one issueing the ACQUIRE (smp_cond_load_acquire).
- *
+ * the wakeup against the CPU the task is going to end up running on. This
+ * means two things: firstly that try_to_wake_up() must (at least) imply a
+ * RELEASE (smp_mb__after_spinlock()), and secondly, as is already required
+ * for the regular Program-Order guarantee above, that waking implies an ACQUIRE
+ * (see smp_cond_load_acquire() above).
  */
 
 /**
@@ -1966,6 +1969,10 @@ static void ttwu_queue(struct task_struct *p, int cpu, int wake_flags)
  * Atomic against schedule() which would dequeue a task, also see
  * set_current_state().
  *
+ * Implies at least a RELEASE such that the waking task is guaranteed to
+ * observe the stores to the wait-condition; see set_task_state() and the
+ * Program-Order constraints.
+ *
  * Return: %true if @p->state changes (an actual wakeup was done),
  *	   %false otherwise.
  */

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH] doc: Update wake_up() & co. memory-barrier guarantees
  2018-06-25 12:28   ` Andrea Parri
@ 2018-06-25 13:00     ` Peter Zijlstra
  0 siblings, 0 replies; 19+ messages in thread
From: Peter Zijlstra @ 2018-06-25 13:00 UTC (permalink / raw)
  To: Andrea Parri
  Cc: David Howells, linux-kernel, linux-doc, Alan Stern, Will Deacon,
	Boqun Feng, Nicholas Piggin, Jade Alglave, Luc Maranget,
	Paul E. McKenney, Akira Yokosawa, Daniel Lustig, Jonathan Corbet,
	Ingo Molnar, Randy Dunlap

On Mon, Jun 25, 2018 at 02:28:50PM +0200, Andrea Parri wrote:
> On Mon, Jun 25, 2018 at 01:12:45PM +0100, David Howells wrote:
> > Peter Zijlstra <peterz@infradead.org> wrote:
> > 
> > > So yes, I suppose we're entirely suck with the full memory barrier
> > > semantics like that. But I still find it easier to think of it like a
> > > RELEASE that pairs with the ACQUIRE of waking up, such that the task
> > > is guaranteed to observe it's own wake condition.
> > > 
> > > And maybe that is the thing I'm missing here. These comments only state
> > > that it does in fact imply a full memory barrier, but do not explain
> > > why, should it?
> > 
> > I think because RELEASE and ACQUIRE concepts didn't really exist in Linux at
> > the time I wrote the doc, so the choices were read/readdep, write or full.
> > 
> > Since this document defines the *minimum* you can expect rather than what the
> > kernel actually gives you, I think it probably makes sense to switch to
> > RELEASE and ACQUIRE here.
> 
> RELEASE and ACQUIRE are not enough in SB.  Can you elaborate?

I prefer RELEASE vs wait-condition and treat task->state as an internal
matter. Also note how the set_current_task() comment is fairly vague on
what exact barriers are used. It just states 'sufficient'.

Maybe I should just give up and accept smp_mb(), but strictly speaking
that is overkill, but it is the only sufficient barrier we currently
have.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] doc: Update wake_up() & co. memory-barrier guarantees
  2018-06-25 12:31     ` Peter Zijlstra
@ 2018-06-25 13:16       ` Andrea Parri
  2018-06-25 14:18         ` Peter Zijlstra
  2018-06-27 14:15       ` Andrea Parri
  1 sibling, 1 reply; 19+ messages in thread
From: Andrea Parri @ 2018-06-25 13:16 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-doc, Alan Stern, Will Deacon, Boqun Feng,
	Nicholas Piggin, David Howells, Jade Alglave, Luc Maranget,
	Paul E. McKenney, Akira Yokosawa, Daniel Lustig, Jonathan Corbet,
	Ingo Molnar, Randy Dunlap

> > A concrete example being the store-buffering pattern reported in [1].
> 
> Well, that example only needs a store->load barrier. It so happens
> smp_mb() is the only one actually doing that, but imagine we had a
> weaker barrier that did just that, one that did not imply the full
> transitivity smp_mb() does.
> 
> Then the example from [1] could use that weaker thing.

Absolutely (and that would be "fence w,r" on RISC-V, IIUC).


> 
> > > So yes, I suppose we're entirely suck with the full memory barrier
> > > semantics like that. But I still find it easier to think of it like a
> > > RELEASE that pairs with the ACQUIRE of waking up, such that the task
> > > is guaranteed to observe it's own wake condition.
> > > 
> > > And maybe that is the thing I'm missing here. These comments only state
> > > that it does in fact imply a full memory barrier, but do not explain
> > > why, should it?
> > 
> > "code (people) is relying on it" is really the only "why" I can think
> > of.  With this patch, that same/SB pattern is also reported in memory
> > -barriers.txt.  Other ideas?
> 
> So I'm not actually sure how many people rely on the RCsc transitive
> smp_mb() here. People certainly rely on the RELEASE semantics, and the
> code itself requires the store->load ordering, together that gives us
> the smp_mb() because that's simply the only barrier we have.
> 
> And looking at smp_mb__after_spinlock() again, we really only need the
> RCsc thing for rq->lock, not for the wakeups. The wakeups really only
> need that RCpc RELEASE + store->load thing (which we don't have).
> 
> So yes, smp_mb(), however the below still makes more sense to me, or am
> I just being obtuse again?
> 
> ---
>  kernel/sched/core.c | 19 +++++++++++++------
>  1 file changed, 13 insertions(+), 6 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index a98d54cd5535..8374d01b2820 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1879,7 +1879,9 @@ static void ttwu_queue(struct task_struct *p, int cpu, int wake_flags)
>   *  C) LOCK of the rq(c1)->lock scheduling in task
>   *
>   * Transitivity guarantees that B happens after A and C after B.
> - * Note: we only require RCpc transitivity.
> + * Note: we only require RCpc transitivity for these cases,
> + *       but see smp_mb__after_spinlock() for why rq->lock is required
> + *       to be RCsc.
>   * Note: the CPU doing B need not be c0 or c1

FWIW, we discussed this pattern here:

  http://lkml.kernel.org/r/20171018010748.GA4017@andrea


>   *
>   * Example:
> @@ -1944,13 +1946,14 @@ static void ttwu_queue(struct task_struct *p, int cpu, int wake_flags)
>   * However; for wakeups there is a second guarantee we must provide, namely we
>   * must observe the state that lead to our wakeup. That is, not only must our
>   * task observe its own prior state, it must also observe the stores prior to
> - * its wakeup.
> + * its wakeup, see set_current_state().
>   *
>   * This means that any means of doing remote wakeups must order the CPU doing
> - * the wakeup against the CPU the task is going to end up running on. This,
> - * however, is already required for the regular Program-Order guarantee above,
> - * since the waking CPU is the one issueing the ACQUIRE (smp_cond_load_acquire).
> - *
> + * the wakeup against the CPU the task is going to end up running on. This
> + * means two things: firstly that try_to_wake_up() must (at least) imply a
> + * RELEASE (smp_mb__after_spinlock()), and secondly, as is already required
> + * for the regular Program-Order guarantee above, that waking implies an ACQUIRE
> + * (see smp_cond_load_acquire() above).
>   */
>  
>  /**
> @@ -1966,6 +1969,10 @@ static void ttwu_queue(struct task_struct *p, int cpu, int wake_flags)
>   * Atomic against schedule() which would dequeue a task, also see
>   * set_current_state().
>   *
> + * Implies at least a RELEASE such that the waking task is guaranteed to
> + * observe the stores to the wait-condition; see set_task_state() and the
> + * Program-Order constraints.

[s/set_task_task/set_current_state ?]

I'd stick to "Implies/Executes at least a full barrier"; this is in fact
already documented in the function body:

	/*
	 * If we are going to wake up a thread waiting for CONDITION we
	 * need to ensure that CONDITION=1 done by the caller can not be
	 * reordered with p->state check below. This pairs with mb() in
	 * set_current_state() the waiting thread does.
	 */

(this is, again, that "store->load barrier"/SB).

I'll try to integrate these changes in v2, if there is no objection.

  Andrea


> + *
>   * Return: %true if @p->state changes (an actual wakeup was done),
>   *	   %false otherwise.
>   */

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] doc: Update wake_up() & co. memory-barrier guarantees
  2018-06-25 13:16       ` Andrea Parri
@ 2018-06-25 14:18         ` Peter Zijlstra
  2018-06-25 14:56           ` Andrea Parri
  0 siblings, 1 reply; 19+ messages in thread
From: Peter Zijlstra @ 2018-06-25 14:18 UTC (permalink / raw)
  To: Andrea Parri
  Cc: linux-kernel, linux-doc, Alan Stern, Will Deacon, Boqun Feng,
	Nicholas Piggin, David Howells, Jade Alglave, Luc Maranget,
	Paul E. McKenney, Akira Yokosawa, Daniel Lustig, Jonathan Corbet,
	Ingo Molnar, Randy Dunlap

On Mon, Jun 25, 2018 at 03:16:43PM +0200, Andrea Parri wrote:
> > > A concrete example being the store-buffering pattern reported in [1].
> > 
> > Well, that example only needs a store->load barrier. It so happens
> > smp_mb() is the only one actually doing that, but imagine we had a
> > weaker barrier that did just that, one that did not imply the full
> > transitivity smp_mb() does.
> > 
> > Then the example from [1] could use that weaker thing.
> 
> Absolutely (and that would be "fence w,r" on RISC-V, IIUC).

Ah cute. What is the transitivity model of those "fence" instructions? I
see their smp_mb() is "fence rw,rw" and smp_mb() must be RSsc. Otoh
their smp_wmb() is "fence w,w" which is only only required to be RCpc.

So what does RISC-V do for "w,w" and "w,r" like things?

> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index a98d54cd5535..8374d01b2820 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -1879,7 +1879,9 @@ static void ttwu_queue(struct task_struct *p, int cpu, int wake_flags)
> >   *  C) LOCK of the rq(c1)->lock scheduling in task
> >   *
> >   * Transitivity guarantees that B happens after A and C after B.
> > - * Note: we only require RCpc transitivity.
> > + * Note: we only require RCpc transitivity for these cases,
> > + *       but see smp_mb__after_spinlock() for why rq->lock is required
> > + *       to be RCsc.
> >   * Note: the CPU doing B need not be c0 or c1
> 
> FWIW, we discussed this pattern here:
> 
>   http://lkml.kernel.org/r/20171018010748.GA4017@andrea

That's not the patter from smp_mb__after_spinlock(), right? But the
other two from this comment.

> > @@ -1966,6 +1969,10 @@ static void ttwu_queue(struct task_struct *p, int cpu, int wake_flags)
> >   * Atomic against schedule() which would dequeue a task, also see
> >   * set_current_state().
> >   *
> > + * Implies at least a RELEASE such that the waking task is guaranteed to
> > + * observe the stores to the wait-condition; see set_task_state() and the
> > + * Program-Order constraints.
> 
> [s/set_task_task/set_current_state ?]

Yes, we got rid of set_task_state(), someone forgot to tell my fingers
:-)

> I'd stick to "Implies/Executes at least a full barrier"; this is in fact
> already documented in the function body:
> 
> 	/*
> 	 * If we are going to wake up a thread waiting for CONDITION we
> 	 * need to ensure that CONDITION=1 done by the caller can not be
> 	 * reordered with p->state check below. This pairs with mb() in
> 	 * set_current_state() the waiting thread does.
> 	 */
> 
> (this is, again, that "store->load barrier"/SB).
> 
> I'll try to integrate these changes in v2, if there is no objection.

Thanks!

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] doc: Update wake_up() & co. memory-barrier guarantees
  2018-06-25 14:18         ` Peter Zijlstra
@ 2018-06-25 14:56           ` Andrea Parri
  2018-06-25 15:44             ` Daniel Lustig
  2018-06-25 16:37             ` Peter Zijlstra
  0 siblings, 2 replies; 19+ messages in thread
From: Andrea Parri @ 2018-06-25 14:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-doc, Alan Stern, Will Deacon, Boqun Feng,
	Nicholas Piggin, David Howells, Jade Alglave, Luc Maranget,
	Paul E. McKenney, Akira Yokosawa, Daniel Lustig, Jonathan Corbet,
	Ingo Molnar, Randy Dunlap

On Mon, Jun 25, 2018 at 04:18:30PM +0200, Peter Zijlstra wrote:
> On Mon, Jun 25, 2018 at 03:16:43PM +0200, Andrea Parri wrote:
> > > > A concrete example being the store-buffering pattern reported in [1].
> > > 
> > > Well, that example only needs a store->load barrier. It so happens
> > > smp_mb() is the only one actually doing that, but imagine we had a
> > > weaker barrier that did just that, one that did not imply the full
> > > transitivity smp_mb() does.
> > > 
> > > Then the example from [1] could use that weaker thing.
> > 
> > Absolutely (and that would be "fence w,r" on RISC-V, IIUC).
> 
> Ah cute. What is the transitivity model of those "fence" instructions? I
> see their smp_mb() is "fence rw,rw" and smp_mb() must be RSsc. Otoh
> their smp_wmb() is "fence w,w" which is only only required to be RCpc.
> 
> So what does RISC-V do for "w,w" and "w,r" like things?

I'd defer to Daniel (in Cc:) for this ;-)  I simply checked the SB pattern
plus w,r fences against the following models:

  http://www.cl.cam.ac.uk/~sf502/regressions/rmem/
  http://moscova.inria.fr/~maranget/cats7/riscv/ 


> 
> > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > > index a98d54cd5535..8374d01b2820 100644
> > > --- a/kernel/sched/core.c
> > > +++ b/kernel/sched/core.c
> > > @@ -1879,7 +1879,9 @@ static void ttwu_queue(struct task_struct *p, int cpu, int wake_flags)
> > >   *  C) LOCK of the rq(c1)->lock scheduling in task
> > >   *
> > >   * Transitivity guarantees that B happens after A and C after B.
> > > - * Note: we only require RCpc transitivity.
> > > + * Note: we only require RCpc transitivity for these cases,
> > > + *       but see smp_mb__after_spinlock() for why rq->lock is required
> > > + *       to be RCsc.
> > >   * Note: the CPU doing B need not be c0 or c1
> > 
> > FWIW, we discussed this pattern here:
> > 
> >   http://lkml.kernel.org/r/20171018010748.GA4017@andrea
> 
> That's not the patter from smp_mb__after_spinlock(), right? But the
> other two from this comment.

Indeed.


> 
> > > @@ -1966,6 +1969,10 @@ static void ttwu_queue(struct task_struct *p, int cpu, int wake_flags)
> > >   * Atomic against schedule() which would dequeue a task, also see
> > >   * set_current_state().
> > >   *
> > > + * Implies at least a RELEASE such that the waking task is guaranteed to
> > > + * observe the stores to the wait-condition; see set_task_state() and the
> > > + * Program-Order constraints.
> > 
> > [s/set_task_task/set_current_state ?]
> 
> Yes, we got rid of set_task_state(), someone forgot to tell my fingers
> :-)
> 
> > I'd stick to "Implies/Executes at least a full barrier"; this is in fact
> > already documented in the function body:
> > 
> > 	/*
> > 	 * If we are going to wake up a thread waiting for CONDITION we
> > 	 * need to ensure that CONDITION=1 done by the caller can not be
> > 	 * reordered with p->state check below. This pairs with mb() in
> > 	 * set_current_state() the waiting thread does.
> > 	 */
> > 
> > (this is, again, that "store->load barrier"/SB).
> > 
> > I'll try to integrate these changes in v2, if there is no objection.
> 
> Thanks!

Ah, before sending v2, I'd really appreciate some comments on the XXXs
I've added to wait_woken() as I'm not sure I understand the pattern in
questions. For example, the second comment says:

  /*
   * The below implies an smp_mb(), it too pairs with the smp_wmb() from
   * woken_wake_function() such that we must either observe the wait
   * condition being true _OR_ WQ_FLAG_WOKEN such that we will not miss
   * an event.
   */

From this I understand:

   wq_entry->flags &= ~WQ_FLAG_WOKEN;      condition = true;
   smp_mb() // B                           smp_wmb(); // C
   [next iteration of the loop]            wq_entry->flags |= WQ_FLAG_WOKEN;
   if (condition)
      break;

   BUG_ON(!condition && !(wq_entry->flags & WQ_FLAG_WOKEN))

IOW, this is an R-like pattern: if this is the case, the smp_wmb() does
_not_ prevent the BUG_ON() from firing; according to LKMM (and powerpc)
a full barrier would be needed.

Same RFC for the first comment:

  /*
   * The above implies an smp_mb(), which matches with the smp_wmb() from
   * woken_wake_function() such that if we observe WQ_FLAG_WOKEN we must
   * also observe all state before the wakeup.
   */

What is the corresponding snippet & BUG_ON()?

  Andrea

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] doc: Update wake_up() & co. memory-barrier guarantees
  2018-06-25 14:56           ` Andrea Parri
@ 2018-06-25 15:44             ` Daniel Lustig
  2018-06-25 16:38               ` Peter Zijlstra
  2018-06-25 16:37             ` Peter Zijlstra
  1 sibling, 1 reply; 19+ messages in thread
From: Daniel Lustig @ 2018-06-25 15:44 UTC (permalink / raw)
  To: Andrea Parri, Peter Zijlstra
  Cc: linux-kernel, linux-doc, Alan Stern, Will Deacon, Boqun Feng,
	Nicholas Piggin, David Howells, Jade Alglave, Luc Maranget,
	Paul E. McKenney, Akira Yokosawa, Jonathan Corbet, Ingo Molnar,
	Randy Dunlap

On 6/25/2018 7:56 AM, Andrea Parri wrote:
> On Mon, Jun 25, 2018 at 04:18:30PM +0200, Peter Zijlstra wrote:
>> On Mon, Jun 25, 2018 at 03:16:43PM +0200, Andrea Parri wrote:
>>>>> A concrete example being the store-buffering pattern reported in [1].
>>>>
>>>> Well, that example only needs a store->load barrier. It so happens
>>>> smp_mb() is the only one actually doing that, but imagine we had a
>>>> weaker barrier that did just that, one that did not imply the full
>>>> transitivity smp_mb() does.
>>>>
>>>> Then the example from [1] could use that weaker thing.
>>>
>>> Absolutely (and that would be "fence w,r" on RISC-V, IIUC).
>>
>> Ah cute. What is the transitivity model of those "fence" instructions? I
>> see their smp_mb() is "fence rw,rw" and smp_mb() must be RSsc. Otoh
>> their smp_wmb() is "fence w,w" which is only only required to be RCpc.
>>
>> So what does RISC-V do for "w,w" and "w,r" like things?
> 
> I'd defer to Daniel (in Cc:) for this ;-)  I simply checked the SB pattern
> plus w,r fences against the following models:
> 
>   http://www.cl.cam.ac.uk/~sf502/regressions/rmem/
>   http://moscova.inria.fr/~maranget/cats7/riscv/ 
> 

First off, the latest RISC-V spec is here, in case anyone hasn't seen it:
https://github.com/riscv/riscv-isa-manual.  It's all public now,
fortunately.  There's a PDF link at the bottom of that page too.  The
spec now has proposed Linux mappings (not all of which have actually made
it upstream...) in Appendix A.5 as well.  We'd welcome general feedback
on those.

Now to the question above:

RISC-V is (other-)multi-copy-atomic, so I don't think transitivity
should be an issue here.

We do have a "fence w,r", but we decided to warn against actually using
it for a few reasons: 1) lack of known common use cases :), 2) IIRC
there was some corner case discrepancy between the axiomatic and
operational models if we allowed it, and 3) in practice, it's already
both expensive enough and obscure enough that many or most
implementations will simply just treat it as "fence rw,rw" anyway.

So, in theory, "fence w,r" should be enough to prevent SB-like patterns.
It's just not yet clear that it's a big enough win that it's worth
creating a new fence macro for it, or pulling the current RISC-V
recommendation against its use.  What do you all think?

Are there other known use cases where you really do want a "fence w,r"
that's intentionally not also something stronger?  How common would the
pattern be?

Dan

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] doc: Update wake_up() & co. memory-barrier guarantees
  2018-06-25 14:56           ` Andrea Parri
  2018-06-25 15:44             ` Daniel Lustig
@ 2018-06-25 16:37             ` Peter Zijlstra
  2018-06-26 10:09               ` Andrea Parri
  1 sibling, 1 reply; 19+ messages in thread
From: Peter Zijlstra @ 2018-06-25 16:37 UTC (permalink / raw)
  To: Andrea Parri
  Cc: linux-kernel, linux-doc, Alan Stern, Will Deacon, Boqun Feng,
	Nicholas Piggin, David Howells, Jade Alglave, Luc Maranget,
	Paul E. McKenney, Akira Yokosawa, Daniel Lustig, Jonathan Corbet,
	Ingo Molnar, Randy Dunlap

On Mon, Jun 25, 2018 at 04:56:11PM +0200, Andrea Parri wrote:

> Ah, before sending v2, I'd really appreciate some comments on the XXXs
> I've added to wait_woken() as I'm not sure I understand the pattern in
> questions.

Oh man, lemme see if I can remember how all that was supposed to work.

So the basic idea was that we cannot rely on the normal task->state
rules because testing @condition can schedule itself. So instead we add
more state. But then we need to ensure that if we either don't loose a
wake or loose the wakeup state.

> For example, the second comment says:
> 
>   /*
>    * The below implies an smp_mb(), it too pairs with the smp_wmb() from
>    * woken_wake_function() such that we must either observe the wait
>    * condition being true _OR_ WQ_FLAG_WOKEN such that we will not miss
>    * an event.
>    */
> 
> From this I understand:
> 
>    wq_entry->flags &= ~WQ_FLAG_WOKEN;      condition = true;
>    smp_mb() // B                           smp_wmb(); // C
>    [next iteration of the loop]            wq_entry->flags |= WQ_FLAG_WOKEN;
>    if (condition)
>       break;
> 
>    BUG_ON(!condition && !(wq_entry->flags & WQ_FLAG_WOKEN))

Right, basically if we get a spurious wakeup and our ttwu() 'fails', we
must either see condition on the next iteration, or ensure the next
iteration doesn't sleep, so we'll loop around and test condition yet
again.

> IOW, this is an R-like pattern: if this is the case, the smp_wmb() does
> _not_ prevent the BUG_ON() from firing; according to LKMM (and powerpc)
> a full barrier would be needed.

Hmmm, how come? store-store collision? Yes I suppose you're right.

> Same RFC for the first comment:
> 
>   /*
>    * The above implies an smp_mb(), which matches with the smp_wmb() from
>    * woken_wake_function() such that if we observe WQ_FLAG_WOKEN we must
>    * also observe all state before the wakeup.
>    */
> 
> What is the corresponding snippet & BUG_ON()?

The comment there suggest:

	if (condition)
		break;

	set_current_state(UNINTERRUPTIBLE);		condition = true;
	/* smp_mb() */					smp_wmb();
							wq_entry->flags |= WQ_FLAG_WOKEN;
	if (!wq_entry->flags & WQ_FLAG_WOKEN)
		schedule();


	BUG_ON((wq_entry->flags & WQ_FLAG_WOKEN) && !condition);


But looking at that now, I think that's wrong. Because the the purpose
was that, if we don't do the try_to_wake_up(), our task still needs to
observe the condition store.

But for that we need a barrier between the wq_entry->flags load and the
second condition test, which would (again) be B, not A.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] doc: Update wake_up() & co. memory-barrier guarantees
  2018-06-25 15:44             ` Daniel Lustig
@ 2018-06-25 16:38               ` Peter Zijlstra
  0 siblings, 0 replies; 19+ messages in thread
From: Peter Zijlstra @ 2018-06-25 16:38 UTC (permalink / raw)
  To: Daniel Lustig
  Cc: Andrea Parri, linux-kernel, linux-doc, Alan Stern, Will Deacon,
	Boqun Feng, Nicholas Piggin, David Howells, Jade Alglave,
	Luc Maranget, Paul E. McKenney, Akira Yokosawa, Jonathan Corbet,
	Ingo Molnar, Randy Dunlap

On Mon, Jun 25, 2018 at 08:44:14AM -0700, Daniel Lustig wrote:
> RISC-V is (other-)multi-copy-atomic, so I don't think transitivity
> should be an issue here.

Ah, ok.

> We do have a "fence w,r", but we decided to warn against actually using
> it for a few reasons: 1) lack of known common use cases :), 2) IIRC
> there was some corner case discrepancy between the axiomatic and
> operational models if we allowed it, and 3) in practice, it's already
> both expensive enough and obscure enough that many or most
> implementations will simply just treat it as "fence rw,rw" anyway.

Because the majority of the cost is flushing the store-buffer in either
case?

> So, in theory, "fence w,r" should be enough to prevent SB-like patterns.
> It's just not yet clear that it's a big enough win that it's worth
> creating a new fence macro for it, or pulling the current RISC-V
> recommendation against its use.  What do you all think?

It was mostly a theoretical argument for why smp_mb() is too strong, not
a real practical desire to have w,t.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] doc: Update wake_up() & co. memory-barrier guarantees
  2018-06-25  9:17 [PATCH] doc: Update wake_up() & co. memory-barrier guarantees Andrea Parri
  2018-06-25  9:50 ` Peter Zijlstra
  2018-06-25 12:12 ` David Howells
@ 2018-06-25 16:56 ` Alan Stern
  2018-06-26 10:11   ` Andrea Parri
  2 siblings, 1 reply; 19+ messages in thread
From: Alan Stern @ 2018-06-25 16:56 UTC (permalink / raw)
  To: Andrea Parri
  Cc: linux-kernel, linux-doc, Will Deacon, Peter Zijlstra, Boqun Feng,
	Nicholas Piggin, David Howells, Jade Alglave, Luc Maranget,
	Paul E. McKenney, Akira Yokosawa, Daniel Lustig, Jonathan Corbet,
	Ingo Molnar, Randy Dunlap

On Mon, 25 Jun 2018, Andrea Parri wrote:

> Both the implementation and the users' expectation [1] for the various
> wakeup primitives have evolved over time, but the documentation has not
> kept up with these changes: brings it into 2018.
> 
> [1] http://lkml.kernel.org/r/20180424091510.GB4064@hirez.programming.kicks-ass.net
> 
> Suggested-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Andrea Parri <andrea.parri@amarulasolutions.com>
> [ aparri: Apply feedback from Alan Stern. ]
> Cc: Alan Stern <stern@rowland.harvard.edu>
> Cc: Will Deacon <will.deacon@arm.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Boqun Feng <boqun.feng@gmail.com>
> Cc: Nicholas Piggin <npiggin@gmail.com>
> Cc: David Howells <dhowells@redhat.com>
> Cc: Jade Alglave <j.alglave@ucl.ac.uk>
> Cc: Luc Maranget <luc.maranget@inria.fr>
> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> Cc: Akira Yokosawa <akiyks@gmail.com>
> Cc: Daniel Lustig <dlustig@nvidia.com>
> Cc: Jonathan Corbet <corbet@lwn.net>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Randy Dunlap <rdunlap@infradead.org>
> ---
>  Documentation/memory-barriers.txt | 43 ++++++++++++++++++++++++---------------
>  kernel/sched/completion.c         |  8 ++++----
>  kernel/sched/core.c               | 11 +++++-----
>  kernel/sched/wait.c               | 24 ++++++++++------------
>  4 files changed, 47 insertions(+), 39 deletions(-)
> 
> diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
> index a02d6bbfc9d0a..bf58fa1671b62 100644
> --- a/Documentation/memory-barriers.txt
> +++ b/Documentation/memory-barriers.txt
> @@ -2179,32 +2179,41 @@ or:
>  	event_indicated = 1;
>  	wake_up_process(event_daemon);
>  
> -A write memory barrier is implied by wake_up() and co.  if and only if they
> -wake something up.  The barrier occurs before the task state is cleared, and so
> -sits between the STORE to indicate the event and the STORE to set TASK_RUNNING:
> +A general memory barrier is executed by wake_up() if it wakes something up.
> +If it doesn't wake anything up then a memory barrier may or may not be
> +executed; you must not rely on it.  The barrier occurs before the task state
> +is accessed, in part., it sits between the STORE to indicate the event and
> +the STORE to set TASK_RUNNING:

Minor suggestion: Instead of "in part.", how about "that is"?

(I generally find "in part." to be at least a little confusing,
probably because "part" is itself a word and "in part" is a 
reasonably common phrase in English.)

>  
> -	CPU 1				CPU 2
> +	CPU 1 (Sleeper)			CPU 2 (Waker)
>  	===============================	===============================
>  	set_current_state();		STORE event_indicated
>  	  smp_store_mb();		wake_up();
> -	    STORE current->state	  <write barrier>
> -	    <general barrier>		  STORE current->state
> -	LOAD event_indicated
> +	    STORE current->state	  ...
> +	    <general barrier>		  <general barrier>
> +	LOAD event_indicated		  if ((LOAD task->state) & TASK_NORMAL)
> +					    STORE task->state
>  
> -To repeat, this write memory barrier is present if and only if something
> -is actually awakened.  To see this, consider the following sequence of
> -events, where X and Y are both initially zero:
> +where "task" is the thread being woken up and it equals CPU 1's current.

Since "task" is in quotation marks, "current" should also be in
quotation marks.

Alan


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] doc: Update wake_up() & co. memory-barrier guarantees
  2018-06-25 16:37             ` Peter Zijlstra
@ 2018-06-26 10:09               ` Andrea Parri
  2018-06-26 15:30                 ` Peter Zijlstra
  0 siblings, 1 reply; 19+ messages in thread
From: Andrea Parri @ 2018-06-26 10:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-doc, Alan Stern, Will Deacon, Boqun Feng,
	Nicholas Piggin, David Howells, Jade Alglave, Luc Maranget,
	Paul E. McKenney, Akira Yokosawa, Daniel Lustig, Jonathan Corbet,
	Ingo Molnar, Randy Dunlap

> > For example, the second comment says:
> > 
> >   /*
> >    * The below implies an smp_mb(), it too pairs with the smp_wmb() from
> >    * woken_wake_function() such that we must either observe the wait
> >    * condition being true _OR_ WQ_FLAG_WOKEN such that we will not miss
> >    * an event.
> >    */
> > 
> > From this I understand:
> > 
> >    wq_entry->flags &= ~WQ_FLAG_WOKEN;      condition = true;
> >    smp_mb() // B                           smp_wmb(); // C
> >    [next iteration of the loop]            wq_entry->flags |= WQ_FLAG_WOKEN;
> >    if (condition)
> >       break;
> > 
> >    BUG_ON(!condition && !(wq_entry->flags & WQ_FLAG_WOKEN))
> 
> Right, basically if we get a spurious wakeup and our ttwu() 'fails', we
> must either see condition on the next iteration, or ensure the next
> iteration doesn't sleep, so we'll loop around and test condition yet
> again.
> 
> > IOW, this is an R-like pattern: if this is the case, the smp_wmb() does
> > _not_ prevent the BUG_ON() from firing; according to LKMM (and powerpc)
> > a full barrier would be needed.
> 
> Hmmm, how come? store-store collision? Yes I suppose you're right.

Ehh, the corresponding powerpc test is architecturally allowed; in the
operational model, the BUG_ON() state can be reached by following the
following steps:

 1. let the writes all reach the storage subsystem,

 2. commit the partial coherence order from "->flags |= WQ_FLAG_WOKEN"
    to "->flags &= ~WQ_FLAG_WOKEN"

 3. propagate "->flags &= ~WQ_FLAG_WOKEN" to the other thread

 4. commit and acknowledge the sync (B)

 5. satisfy the read

 6. propagate "condition = true" and the lwsync (C) to the other thread.

AFAICT, this state remains _unobserved_ via litmus7 experiments.


> 
> > Same RFC for the first comment:
> > 
> >   /*
> >    * The above implies an smp_mb(), which matches with the smp_wmb() from
> >    * woken_wake_function() such that if we observe WQ_FLAG_WOKEN we must
> >    * also observe all state before the wakeup.
> >    */
> > 
> > What is the corresponding snippet & BUG_ON()?
> 
> The comment there suggest:
> 
> 	if (condition)
> 		break;
> 
> 	set_current_state(UNINTERRUPTIBLE);		condition = true;
> 	/* smp_mb() */					smp_wmb();
> 							wq_entry->flags |= WQ_FLAG_WOKEN;
> 	if (!wq_entry->flags & WQ_FLAG_WOKEN)
> 		schedule();
> 
> 
> 	BUG_ON((wq_entry->flags & WQ_FLAG_WOKEN) && !condition);
> 
> 
> But looking at that now, I think that's wrong. Because the the purpose
> was that, if we don't do the try_to_wake_up(), our task still needs to
> observe the condition store.
> 
> But for that we need a barrier between the wq_entry->flags load and the
> second condition test, which would (again) be B, not A.

Agreed.  Now that I stared at the code a bit more, I think that (A) is
still needed for the synchronization on "->state" and "->flags" (an SB
pattern seems again to be hidden in the call to try_to_wake_up()):

  p->state = mode;                           wq_entry->flags |= WQ_FLAG_WOKEN;
  smp_mb(); // A                             try_to_wake_up():
  if (!(wq_entry->flags & WQ_FLAG_WOKEN))      <full barrier>
     schedule()                                if (!(p->state & mode))
                                                 goto out;

  BUG_ON(!(wq_entry->flags & WQ_FLAG_WOKEN) && !(p->state & mode))

So, I think that we should keep (A).

I am planning to send these changes (smp_mb() in woken_wake_function()
and fixes to the comments) as a separate patch.

Thanks,
  Andrea

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] doc: Update wake_up() & co. memory-barrier guarantees
  2018-06-25 16:56 ` Alan Stern
@ 2018-06-26 10:11   ` Andrea Parri
  2018-06-26 13:49     ` Alan Stern
  0 siblings, 1 reply; 19+ messages in thread
From: Andrea Parri @ 2018-06-26 10:11 UTC (permalink / raw)
  To: Alan Stern
  Cc: linux-kernel, linux-doc, Will Deacon, Peter Zijlstra, Boqun Feng,
	Nicholas Piggin, David Howells, Jade Alglave, Luc Maranget,
	Paul E. McKenney, Akira Yokosawa, Daniel Lustig, Jonathan Corbet,
	Ingo Molnar, Randy Dunlap

> > -A write memory barrier is implied by wake_up() and co.  if and only if they
> > -wake something up.  The barrier occurs before the task state is cleared, and so
> > -sits between the STORE to indicate the event and the STORE to set TASK_RUNNING:
> > +A general memory barrier is executed by wake_up() if it wakes something up.
> > +If it doesn't wake anything up then a memory barrier may or may not be
> > +executed; you must not rely on it.  The barrier occurs before the task state
> > +is accessed, in part., it sits between the STORE to indicate the event and
> > +the STORE to set TASK_RUNNING:
> 
> Minor suggestion: Instead of "in part.", how about "that is"?
> 
> (I generally find "in part." to be at least a little confusing,
> probably because "part" is itself a word and "in part" is a 
> reasonably common phrase in English.)

Mmh, the fact is that that "before the task state is accessed" does want
to include the LOAD from ->state to check for the task state (recall the
pattern in [1])...; how about if I expand "in part." to "in particular"?


> 
> >  
> > -	CPU 1				CPU 2
> > +	CPU 1 (Sleeper)			CPU 2 (Waker)
> >  	===============================	===============================
> >  	set_current_state();		STORE event_indicated
> >  	  smp_store_mb();		wake_up();
> > -	    STORE current->state	  <write barrier>
> > -	    <general barrier>		  STORE current->state
> > -	LOAD event_indicated
> > +	    STORE current->state	  ...
> > +	    <general barrier>		  <general barrier>
> > +	LOAD event_indicated		  if ((LOAD task->state) & TASK_NORMAL)
> > +					    STORE task->state
> >  
> > -To repeat, this write memory barrier is present if and only if something
> > -is actually awakened.  To see this, consider the following sequence of
> > -events, where X and Y are both initially zero:
> > +where "task" is the thread being woken up and it equals CPU 1's current.
> 
> Since "task" is in quotation marks, "current" should also be in
> quotation marks.

Sure, will fix in v2.

Thanks,
  Andrea


> 
> Alan
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] doc: Update wake_up() & co. memory-barrier guarantees
  2018-06-26 10:11   ` Andrea Parri
@ 2018-06-26 13:49     ` Alan Stern
  0 siblings, 0 replies; 19+ messages in thread
From: Alan Stern @ 2018-06-26 13:49 UTC (permalink / raw)
  To: Andrea Parri
  Cc: linux-kernel, linux-doc, Will Deacon, Peter Zijlstra, Boqun Feng,
	Nicholas Piggin, David Howells, Jade Alglave, Luc Maranget,
	Paul E. McKenney, Akira Yokosawa, Daniel Lustig, Jonathan Corbet,
	Ingo Molnar, Randy Dunlap

On Tue, 26 Jun 2018, Andrea Parri wrote:

> > > -A write memory barrier is implied by wake_up() and co.  if and only if they
> > > -wake something up.  The barrier occurs before the task state is cleared, and so
> > > -sits between the STORE to indicate the event and the STORE to set TASK_RUNNING:
> > > +A general memory barrier is executed by wake_up() if it wakes something up.
> > > +If it doesn't wake anything up then a memory barrier may or may not be
> > > +executed; you must not rely on it.  The barrier occurs before the task state
> > > +is accessed, in part., it sits between the STORE to indicate the event and
> > > +the STORE to set TASK_RUNNING:
> > 
> > Minor suggestion: Instead of "in part.", how about "that is"?
> > 
> > (I generally find "in part." to be at least a little confusing,
> > probably because "part" is itself a word and "in part" is a 
> > reasonably common phrase in English.)
> 
> Mmh, the fact is that that "before the task state is accessed" does want
> to include the LOAD from ->state to check for the task state (recall the
> pattern in [1])...; how about if I expand "in part." to "in particular"?

That would be acceptable.

Alan


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] doc: Update wake_up() & co. memory-barrier guarantees
  2018-06-26 10:09               ` Andrea Parri
@ 2018-06-26 15:30                 ` Peter Zijlstra
  0 siblings, 0 replies; 19+ messages in thread
From: Peter Zijlstra @ 2018-06-26 15:30 UTC (permalink / raw)
  To: Andrea Parri
  Cc: linux-kernel, linux-doc, Alan Stern, Will Deacon, Boqun Feng,
	Nicholas Piggin, David Howells, Jade Alglave, Luc Maranget,
	Paul E. McKenney, Akira Yokosawa, Daniel Lustig, Jonathan Corbet,
	Ingo Molnar, Randy Dunlap

On Tue, Jun 26, 2018 at 12:09:42PM +0200, Andrea Parri wrote:
> > > Same RFC for the first comment:
> > > 
> > >   /*
> > >    * The above implies an smp_mb(), which matches with the smp_wmb() from
> > >    * woken_wake_function() such that if we observe WQ_FLAG_WOKEN we must
> > >    * also observe all state before the wakeup.
> > >    */
> > > 
> > > What is the corresponding snippet & BUG_ON()?
> > 
> > The comment there suggest:
> > 
> > 	if (condition)
> > 		break;
> > 
> > 	set_current_state(UNINTERRUPTIBLE);		condition = true;
> > 	/* smp_mb() */					smp_wmb();
> > 							wq_entry->flags |= WQ_FLAG_WOKEN;
> > 	if (!wq_entry->flags & WQ_FLAG_WOKEN)
> > 		schedule();
> > 
> > 
> > 	BUG_ON((wq_entry->flags & WQ_FLAG_WOKEN) && !condition);
> > 
> > 
> > But looking at that now, I think that's wrong. Because the the purpose
> > was that, if we don't do the try_to_wake_up(), our task still needs to
> > observe the condition store.
> > 
> > But for that we need a barrier between the wq_entry->flags load and the
> > second condition test, which would (again) be B, not A.
> 
> Agreed.  Now that I stared at the code a bit more, I think that (A) is
> still needed for the synchronization on "->state" and "->flags" (an SB
> pattern seems again to be hidden in the call to try_to_wake_up()):
> 
>   p->state = mode;                           wq_entry->flags |= WQ_FLAG_WOKEN;
>   smp_mb(); // A                             try_to_wake_up():
>   if (!(wq_entry->flags & WQ_FLAG_WOKEN))      <full barrier>
>      schedule()                                if (!(p->state & mode))
>                                                  goto out;
> 
>   BUG_ON(!(wq_entry->flags & WQ_FLAG_WOKEN) && !(p->state & mode))
> 
> So, I think that we should keep (A).

Yes, very much so. Once we actually get to use ttwu() that barrier is
required.

> I am planning to send these changes (smp_mb() in woken_wake_function()
> and fixes to the comments) as a separate patch.

Probably makes sense. Thanks for looking at this, I have vague memories
of being slightly confused when I wrote all that :-)

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] doc: Update wake_up() & co. memory-barrier guarantees
  2018-06-25 12:31     ` Peter Zijlstra
  2018-06-25 13:16       ` Andrea Parri
@ 2018-06-27 14:15       ` Andrea Parri
  1 sibling, 0 replies; 19+ messages in thread
From: Andrea Parri @ 2018-06-27 14:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-doc, Alan Stern, Will Deacon, Boqun Feng,
	Nicholas Piggin, David Howells, Jade Alglave, Luc Maranget,
	Paul E. McKenney, Akira Yokosawa, Daniel Lustig, Jonathan Corbet,
	Ingo Molnar, Randy Dunlap

> So I'm not actually sure how many people rely on the RCsc transitive
> smp_mb() here. People certainly rely on the RELEASE semantics, and the
> code itself requires the store->load ordering, together that gives us
> the smp_mb() because that's simply the only barrier we have.
> 
> And looking at smp_mb__after_spinlock() again, we really only need the
> RCsc thing for rq->lock, not for the wakeups. The wakeups really only
> need that RCpc RELEASE + store->load thing (which we don't have).
> 
> So yes, smp_mb(), however the below still makes more sense to me, or am
> I just being obtuse again?

While trying to integrate these remarks into v1 and looking again at the
comment before smp_mb__after_spinlock(), I remembered a discussion where
Boqun suggested some improvements for this comment: so I wrote the commit
reported at the end of this email.

This raises the following two issues:

 1) First, the problem of integrating the resulting comment into v1,
    where I've been talking about _full_ barriers associated to the
    wakeups fuctions but where these are actually implemented as:
    
       spin_lock(s);
       smp_mb__after_spinlock();

 2) Second, the problem of formalizing the requirements described in
    that comment (remark: informally, the LKMM currently states that
    the sequence "spin_lock(s); smp_mb__after_spinlock();" generates
    a full barrier; in particular, this sequence orders
    
       {STORE,LOAD} -> {STORE,LOAD}

    according to the current LKMM).

For (1), I could simply replace each occurrence of "executes a full memory
barrier" with "execute the sequence spin_lock(s); smp_mb__after_spinlock()";
I haven't really thought about (2) yet, but please notice that something as
simple as

   let mb = [...]  |
            ([W] ; po? ; [LKW] ; fencerel(After-spinlock) ; [R])

would _not_ guarantee "RCsc transitivity" ...

A different approach (that could solve both problems at once) would be to
follow the current formalization in LKMM and to modify the comment before
smp_mb__after_spinlock() accordingly (say, informally, "it's required that
that spin_lock(); smp_mb__after_spinlock() provides a full barrier").

Thoughts?

  Andrea

From c3648d5022bedcd356198efa65703e01541cbd3f Mon Sep 17 00:00:00 2001
From: Andrea Parri <andrea.parri@amarulasolutions.com>
Date: Wed, 27 Jun 2018 10:53:30 +0200
Subject: [PATCH 2/3] locking: Fix comment for smp_mb__after_spinlock()

Boqun reported that the snippet described in the header comment for
smp_mb__after_spinlock() can be misleading, because acquire/release
chains already provide us with the underlying guarantee (due to the
A-cumulativity of release).

This commit fixes the comment following Boqun's example in [1].

It's worth noting here that LKMM maintainers are currently actively
debating whether to enforce RCsc transitivity of locking operations
"by definition" [2]; should the guarantee be enforced in the future,
the requirement for smp_mb__after_spinlock() could be simplified to
include only the STORE->LOAD ordering requirement.

[1] http://lkml.kernel.org/r/20180312085600.aczjkpn73axzs2sb@tardis
[2] http://lkml.kernel.org/r/Pine.LNX.4.44L0.1711271553490.1424-100000@iolanthe.rowland.org
    http://lkml.kernel.org/r/Pine.LNX.4.44L0.1806211322160.2381-100000@iolanthe.rowland.org

Reported-and-Suggested-by: Boqun Feng <<boqun.feng@gmail.com>
Signed-off-by: Andrea Parri <andrea.parri@amarulasolutions.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
---
 include/linux/spinlock.h | 26 +++++++++++++-------------
 1 file changed, 13 insertions(+), 13 deletions(-)

diff --git a/include/linux/spinlock.h b/include/linux/spinlock.h
index 1e8a464358384..c74828fe8d75c 100644
--- a/include/linux/spinlock.h
+++ b/include/linux/spinlock.h
@@ -121,22 +121,22 @@ do {								\
  *
  *   - it must ensure the critical section is RCsc.
  *
- * The latter is important for cases where we observe values written by other
- * CPUs in spin-loops, without barriers, while being subject to scheduling.
+ * The latter requirement guarantees that stores from two critical sections
+ * in different CPUs are ordered even outside the critical sections.  As an
+ * example illustrating this property, consider the following snippet:
  *
- * CPU0			CPU1			CPU2
+ * CPU0			CPU1				CPU2
  *
- *			for (;;) {
- *			  if (READ_ONCE(X))
- *			    break;
- *			}
- * X=1
- *			<sched-out>
- *						<sched-in>
- *						r = X;
+ * spin_lock(s);	spin_lock(s);			r2 = READ_ONCE(Y);
+ * WRITE_ONCE(X, 1);	smp_mb__after_spinlock();	smp_rmb();
+ * spin_unlock(s);	r1 = READ_ONCE(X);		r3 = READ_ONCE(X);
+ *			WRITE_ONCE(Y, 1);
+ *			spin_unlock(s);
  *
- * without transitivity it could be that CPU1 observes X!=0 breaks the loop,
- * we get migrated and CPU2 sees X==0.
+ * Without RCsc transitivity, it is allowed that CPU0's critical section
+ * precedes CPU1's critical section (r1=1) and that CPU2 observes CPU1's
+ * store to Y (r2=1) while it does not observe CPU0's store to X (r3=0),
+ * despite the presence of the smp_rmb().
  *
  * Since most load-store architectures implement ACQUIRE with an smp_mb() after
  * the LL/SC loop, they need no further barriers. Similarly all our TSO
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2018-06-27 14:15 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-06-25  9:17 [PATCH] doc: Update wake_up() & co. memory-barrier guarantees Andrea Parri
2018-06-25  9:50 ` Peter Zijlstra
2018-06-25 10:56   ` Andrea Parri
2018-06-25 12:31     ` Peter Zijlstra
2018-06-25 13:16       ` Andrea Parri
2018-06-25 14:18         ` Peter Zijlstra
2018-06-25 14:56           ` Andrea Parri
2018-06-25 15:44             ` Daniel Lustig
2018-06-25 16:38               ` Peter Zijlstra
2018-06-25 16:37             ` Peter Zijlstra
2018-06-26 10:09               ` Andrea Parri
2018-06-26 15:30                 ` Peter Zijlstra
2018-06-27 14:15       ` Andrea Parri
2018-06-25 12:12 ` David Howells
2018-06-25 12:28   ` Andrea Parri
2018-06-25 13:00     ` Peter Zijlstra
2018-06-25 16:56 ` Alan Stern
2018-06-26 10:11   ` Andrea Parri
2018-06-26 13:49     ` Alan Stern

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).