All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] Documentation: Remove misleading examples of the barriers in wake_*()
@ 2015-09-08  1:14 Boqun Feng
  2015-09-09 19:28 ` Paul E. McKenney
  0 siblings, 1 reply; 23+ messages in thread
From: Boqun Feng @ 2015-09-08  1:14 UTC (permalink / raw)
  To: linux-kernel, linux-doc
  Cc: Paul E. McKenney, Jonathan Corbet, Michal Hocko, Oleg Nesterov,
	Peter Zijlstra, David Howells, Linus Torvalds, Boqun Feng

Two examples for barriers in wake_up() and co. in memory-barriers.txt
are misleading, along with their explanations:

1.	The example which wanted to explain the write barrier in
	wake_up() and co. [spotted by Oleg Nesterov <oleg@redhat.com>]

2.	The example which wanted to explain that the write barriers in
	wake_up() and co. only exist iff a wakeup actually occurs.

For example #1, according to Oleg Nesterov:

>
>       The barrier occurs before the task state is cleared
>
> is not actually right. This is misleading. What is really important is that
> we have a barrier before we _read_ the task state. And again, again, the
> fact that we actually have the write barrier is just the implementation
> detail.
>

And the example #2 is actually an example which could explain that the
barriers in wait_event() and co. only exist iff a sleep actually occurs.

Further more, these barriers are only used for the correctness of
sleeping and waking up, i.e. they exist only to guarantee the ordering
of memory accesses to the task states and the global variables
indicating an event. Users can't rely on them for other things, so
memory-barriers.txt had better to call this out and remove the
misleading examples.

This patch removes the misleading examples along with their
explanations, calls it out that those implied barriers are only for
sleep and wakeup related variables and adds a new example to explain the
implied barrier in wake_up() and co.

Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
---
 Documentation/memory-barriers.txt | 42 +++++++++++++++++----------------------
 1 file changed, 18 insertions(+), 24 deletions(-)

diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
index eafa6a5..07de72f 100644
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -1948,6 +1948,10 @@ these appear to happen in the right order, the primitives to begin the process
 of going to sleep, and the primitives to initiate a wake up imply certain
 barriers.
 
+[!] Note that these implied barriers are only for the correctness of sleep and
+wake-up. So don't rely on these barriers for things that are neither the task
+states nor the global variables indicating the events.
+
 Firstly, the sleeper normally follows something like this sequence of events:
 
 	for (;;) {
@@ -1997,32 +2001,22 @@ or:
 	event_indicated = 1;
 	wake_up_process(event_daemon);
 
-A write memory barrier is implied by wake_up() and co. if and only if they wake
-something up.  The barrier occurs before the task state is cleared, and so sits
-between the STORE to indicate the event and the STORE to set TASK_RUNNING:
-
-	CPU 1				CPU 2
-	===============================	===============================
-	set_current_state();		STORE event_indicated
-	  smp_store_mb();		wake_up();
-	    STORE current->state	  <write barrier>
-	    <general barrier>		  STORE current->state
-	LOAD event_indicated
+A memory barrier is implied by wake_up() and co. if and only if they wake
+something up. The memory barrier here is not necessary to be a general barrier,
+it only needs to guarantee a STORE preceding this barrier can never be
+reordered after a LOAD following this barrier(i.e. a STORE-LOAD barrier). This
+barrier guarantees that the event has been indicated before the waker read the
+wakee's task state:
 
-To repeat, this write memory barrier is present if and only if something
-is actually awakened.  To see this, consider the following sequence of
-events, where X and Y are both initially zero:
+	CPU 1
+	===============================
+	STORE event_indicated;
+	wake_up_process(wakee);
+	  <STORE-LOAD barrier>
+	  LOAD wakee->state;
 
-	CPU 1				CPU 2
-	===============================	===============================
-	X = 1;				STORE event_indicated
-	smp_mb();			wake_up();
-	Y = 1;				wait_event(wq, Y == 1);
-	wake_up();			  load from Y sees 1, no memory barrier
-					load from X might see 0
-
-In contrast, if a wakeup does occur, CPU 2's load from X would be guaranteed
-to see 1.
+This barrier pairs with the general barrier implied by set_current_state() on
+the sleeper side.
 
 The available waker functions include:
 
-- 
2.5.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH] Documentation: Remove misleading examples of the barriers in wake_*()
  2015-09-08  1:14 [PATCH] Documentation: Remove misleading examples of the barriers in wake_*() Boqun Feng
@ 2015-09-09 19:28 ` Paul E. McKenney
  2015-09-10  2:16   ` Boqun Feng
  0 siblings, 1 reply; 23+ messages in thread
From: Paul E. McKenney @ 2015-09-09 19:28 UTC (permalink / raw)
  To: Boqun Feng
  Cc: linux-kernel, linux-doc, Jonathan Corbet, Michal Hocko,
	Oleg Nesterov, Peter Zijlstra, David Howells, Linus Torvalds

On Tue, Sep 08, 2015 at 09:14:01AM +0800, Boqun Feng wrote:
> Two examples for barriers in wake_up() and co. in memory-barriers.txt
> are misleading, along with their explanations:
> 
> 1.	The example which wanted to explain the write barrier in
> 	wake_up() and co. [spotted by Oleg Nesterov <oleg@redhat.com>]
> 
> 2.	The example which wanted to explain that the write barriers in
> 	wake_up() and co. only exist iff a wakeup actually occurs.
> 
> For example #1, according to Oleg Nesterov:
> 
> >
> >       The barrier occurs before the task state is cleared
> >
> > is not actually right. This is misleading. What is really important is that
> > we have a barrier before we _read_ the task state. And again, again, the
> > fact that we actually have the write barrier is just the implementation
> > detail.
> >
> 
> And the example #2 is actually an example which could explain that the
> barriers in wait_event() and co. only exist iff a sleep actually occurs.
> 
> Further more, these barriers are only used for the correctness of
> sleeping and waking up, i.e. they exist only to guarantee the ordering
> of memory accesses to the task states and the global variables
> indicating an event. Users can't rely on them for other things, so
> memory-barriers.txt had better to call this out and remove the
> misleading examples.
> 
> This patch removes the misleading examples along with their
> explanations, calls it out that those implied barriers are only for
> sleep and wakeup related variables and adds a new example to explain the
> implied barrier in wake_up() and co.
> 
> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>

At this point, I would favor replacing that entire section with a short
paragraph describing what guarantees are provided, perhaps with an example
showing what added barriers/locks/whatever are required.  My feeling is
that we should avoid saying too much about the internals of wait_event()
and wake_up().

Or am I missing something?

							Thanx, Paul

> ---
>  Documentation/memory-barriers.txt | 42 +++++++++++++++++----------------------
>  1 file changed, 18 insertions(+), 24 deletions(-)
> 
> diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
> index eafa6a5..07de72f 100644
> --- a/Documentation/memory-barriers.txt
> +++ b/Documentation/memory-barriers.txt
> @@ -1948,6 +1948,10 @@ these appear to happen in the right order, the primitives to begin the process
>  of going to sleep, and the primitives to initiate a wake up imply certain
>  barriers.
> 
> +[!] Note that these implied barriers are only for the correctness of sleep and
> +wake-up. So don't rely on these barriers for things that are neither the task
> +states nor the global variables indicating the events.
> +
>  Firstly, the sleeper normally follows something like this sequence of events:
> 
>  	for (;;) {
> @@ -1997,32 +2001,22 @@ or:
>  	event_indicated = 1;
>  	wake_up_process(event_daemon);
> 
> -A write memory barrier is implied by wake_up() and co. if and only if they wake
> -something up.  The barrier occurs before the task state is cleared, and so sits
> -between the STORE to indicate the event and the STORE to set TASK_RUNNING:
> -
> -	CPU 1				CPU 2
> -	===============================	===============================
> -	set_current_state();		STORE event_indicated
> -	  smp_store_mb();		wake_up();
> -	    STORE current->state	  <write barrier>
> -	    <general barrier>		  STORE current->state
> -	LOAD event_indicated
> +A memory barrier is implied by wake_up() and co. if and only if they wake
> +something up. The memory barrier here is not necessary to be a general barrier,
> +it only needs to guarantee a STORE preceding this barrier can never be
> +reordered after a LOAD following this barrier(i.e. a STORE-LOAD barrier). This
> +barrier guarantees that the event has been indicated before the waker read the
> +wakee's task state:
> 
> -To repeat, this write memory barrier is present if and only if something
> -is actually awakened.  To see this, consider the following sequence of
> -events, where X and Y are both initially zero:
> +	CPU 1
> +	===============================
> +	STORE event_indicated;
> +	wake_up_process(wakee);
> +	  <STORE-LOAD barrier>
> +	  LOAD wakee->state;
> 
> -	CPU 1				CPU 2
> -	===============================	===============================
> -	X = 1;				STORE event_indicated
> -	smp_mb();			wake_up();
> -	Y = 1;				wait_event(wq, Y == 1);
> -	wake_up();			  load from Y sees 1, no memory barrier
> -					load from X might see 0
> -
> -In contrast, if a wakeup does occur, CPU 2's load from X would be guaranteed
> -to see 1.
> +This barrier pairs with the general barrier implied by set_current_state() on
> +the sleeper side.
> 
>  The available waker functions include:
> 
> -- 
> 2.5.1
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] Documentation: Remove misleading examples of the barriers in wake_*()
  2015-09-09 19:28 ` Paul E. McKenney
@ 2015-09-10  2:16   ` Boqun Feng
  2015-09-10 17:55     ` Oleg Nesterov
  0 siblings, 1 reply; 23+ messages in thread
From: Boqun Feng @ 2015-09-10  2:16 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, linux-doc, Jonathan Corbet, Michal Hocko,
	Oleg Nesterov, Peter Zijlstra, David Howells, Linus Torvalds

[-- Attachment #1: Type: text/plain, Size: 5086 bytes --]

On Wed, Sep 09, 2015 at 12:28:22PM -0700, Paul E. McKenney wrote:
> On Tue, Sep 08, 2015 at 09:14:01AM +0800, Boqun Feng wrote:
> > Two examples for barriers in wake_up() and co. in memory-barriers.txt
> > are misleading, along with their explanations:
> > 
> > 1.	The example which wanted to explain the write barrier in
> > 	wake_up() and co. [spotted by Oleg Nesterov <oleg@redhat.com>]
> > 
> > 2.	The example which wanted to explain that the write barriers in
> > 	wake_up() and co. only exist iff a wakeup actually occurs.
> > 
> > For example #1, according to Oleg Nesterov:
> > 
> > >
> > >       The barrier occurs before the task state is cleared
> > >
> > > is not actually right. This is misleading. What is really important is that
> > > we have a barrier before we _read_ the task state. And again, again, the
> > > fact that we actually have the write barrier is just the implementation
> > > detail.
> > >
> > 
> > And the example #2 is actually an example which could explain that the
> > barriers in wait_event() and co. only exist iff a sleep actually occurs.
> > 
> > Further more, these barriers are only used for the correctness of
> > sleeping and waking up, i.e. they exist only to guarantee the ordering
> > of memory accesses to the task states and the global variables
> > indicating an event. Users can't rely on them for other things, so
> > memory-barriers.txt had better to call this out and remove the
> > misleading examples.
> > 
> > This patch removes the misleading examples along with their
> > explanations, calls it out that those implied barriers are only for
> > sleep and wakeup related variables and adds a new example to explain the
> > implied barrier in wake_up() and co.
> > 
> > Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
> 
> At this point, I would favor replacing that entire section with a short
> paragraph describing what guarantees are provided, perhaps with an example
> showing what added barriers/locks/whatever are required.  My feeling is
> that we should avoid saying too much about the internals of wait_event()
> and wake_up().

Good idea!

However, I think a little more words help understand. So I keep the
original first paragraph and also add a paragraph for NOTE which, I
think, may be a little redundant although ;-)

How about the following new whole section?


SLEEP AND WAKE-UP FUNCTIONS
---------------------------

Sleeping and waking on an event flagged in global data can be viewed as an
interaction between two pieces of data: the task state of the task waiting for
the event and the global data used to indicate the event.  To make sure that
these appear to happen in the right order, the primitives to begin the process
of going to sleep, and the primitives to initiate a wake up imply certain
barriers.

The memory ordering requirement here can be expressed by two STORE-LOAD
barriers(barriers which can guarantee a STORE perceding it can never be
reordered after a LOAD following it). One STORE-LOAD barrier is needed on the
sleeper/wakee side, before reading a variable used to indicate the event and
after setting the state of the current task. Another STORE-LOAD barrier is
needed on the waker side, before reading the state of the wakee task and after
setting a variable used to indicate the event. These two barriers can pair with
each other to avoid race conditions between sleepers/wakees and wakers:

	sleepr/wakee on CPU 1		waker on CPU 2
	========================	========================
	{ wakee->state = TASK_RUNNING, event_indicated = 0 }
	STORE current->state=TASK_INTERRUPTIBLE
	<STORE-LOAD barrier>
	c = LOAD event_indicated	
					STORE event_indicated=1
					<STORE-LOAD barrier>
					s = LOAD wakee->state	

	assert(!(c==0 && s == TASK_RUNNING));

A STORE-LOAD barrier is implied after setting task state by wait-related functions:

	prepare_to_wait();
	prepare_to_wait_exclusive();
	prepare_to_wait_event();

A STORE-LOAD barrier is implied before reading task state by wake-related functions:

	complete();
	wake_up();
	wake_up_all();
	wake_up_bit();
	wake_up_interruptible();
	wake_up_interruptible_all();
	wake_up_interruptible_nr();
	wake_up_interruptible_poll();
	wake_up_interruptible_sync();
	wake_up_interruptible_sync_poll();
	wake_up_locked();
	wake_up_locked_poll();
	wake_up_nr();
	wake_up_poll();
	wake_up_process();

Make sure an appropriate wake-related function is called after setting a global
data used to indicate a event.

[!] Note that these implied barriers are only for the correctness of sleep and
wake-up. So don't rely on these barriers for things that are neither the task
states nor the global variables indicating the events.


git log --stat for this is:

 1 file changed, 29 insertions(+), 108 deletions(-)

, which I think it's better, thanks to your advice ;-)


I will rewrite the commit message and send a new patch if this looks to
you.

Regards,
Boqun

> 
> Or am I missing something?
> 
> 							Thanx, Paul

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] Documentation: Remove misleading examples of the barriers in wake_*()
  2015-09-10  2:16   ` Boqun Feng
@ 2015-09-10 17:55     ` Oleg Nesterov
  2015-09-11 16:59       ` Boqun Feng
  2015-09-17 13:01       ` Peter Zijlstra
  0 siblings, 2 replies; 23+ messages in thread
From: Oleg Nesterov @ 2015-09-10 17:55 UTC (permalink / raw)
  To: Boqun Feng
  Cc: Paul E. McKenney, linux-kernel, linux-doc, Jonathan Corbet,
	Michal Hocko, Peter Zijlstra, David Howells, Linus Torvalds

On 09/10, Boqun Feng wrote:
>
> On Wed, Sep 09, 2015 at 12:28:22PM -0700, Paul E. McKenney wrote:
> > My feeling is
> > that we should avoid saying too much about the internals of wait_event()
> > and wake_up().

I feel the same. I simply can't understand what we are trying to
document ;)

For example,

> A STORE-LOAD barrier is implied after setting task state by wait-related functions:
>
> 	prepare_to_wait();
> 	prepare_to_wait_exclusive();
> 	prepare_to_wait_event();

I won't argue, but to me this looks misleading too.

Yes, prepare_to_wait()->set_current_state() implies mb() and thus
a STORE-LOAD barrier.

But this has nothing to do with the explanation above. We do not
need this barrier to avoid the race with wake_up(). Again, again,
we can safely rely on wq->lock and acquire/release semantics.

This barrier is only needed if you do, say,

	CONDITION = 1;

	if (waitqueue_active(wq))
		wake_up(wq);

And note that the code above is wrong without another mb() after
CONDITION = 1.

Oleg.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] Documentation: Remove misleading examples of the barriers in wake_*()
  2015-09-10 17:55     ` Oleg Nesterov
@ 2015-09-11 16:59       ` Boqun Feng
  2015-09-17 13:01       ` Peter Zijlstra
  1 sibling, 0 replies; 23+ messages in thread
From: Boqun Feng @ 2015-09-11 16:59 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Paul E. McKenney, linux-kernel, linux-doc, Jonathan Corbet,
	Michal Hocko, Peter Zijlstra, David Howells, Linus Torvalds

[-- Attachment #1: Type: text/plain, Size: 5020 bytes --]

Hi Oleg,

On Thu, Sep 10, 2015 at 07:55:57PM +0200, Oleg Nesterov wrote:
> On 09/10, Boqun Feng wrote:
> >
> > On Wed, Sep 09, 2015 at 12:28:22PM -0700, Paul E. McKenney wrote:
> > > My feeling is
> > > that we should avoid saying too much about the internals of wait_event()
> > > and wake_up().
> 
> I feel the same. I simply can't understand what we are trying to
> document ;)
> 

What I think we should document here is what memory ordering guarantee
we can rely on with these sleep/wakeup primitives, or what kind of
barriers these primitives imply. Because the structure of the
memory-barriers.txt here is:

 (*) Implicit kernel memory barriers.

     - Locking functions.
     - Interrupt disabling functions.
   ->- Sleep and wake-up functions.<-
     - Miscellaneous functions.

> For example,
> 
> > A STORE-LOAD barrier is implied after setting task state by wait-related functions:
> >
> > 	prepare_to_wait();
> > 	prepare_to_wait_exclusive();
> > 	prepare_to_wait_event();
> 
> I won't argue, but to me this looks misleading too.
> 
> Yes, prepare_to_wait()->set_current_state() implies mb() and thus
> a STORE-LOAD barrier.
> 
> But this has nothing to do with the explanation above. We do not
> need this barrier to avoid the race with wake_up(). Again, again,
> we can safely rely on wq->lock and acquire/release semantics.
> 

Yes, you are right. prepare_to_wait*() should be put here. What should
be put here is set_current_state(), whose STORE-LOAD barrier pairs with
the STORE-LOAD barrier of wake_up_process().

> This barrier is only needed if you do, say,
> 
> 	CONDITION = 1;
> 
> 	if (waitqueue_active(wq))
> 		wake_up(wq);
> 
> And note that the code above is wrong without another mb() after
> CONDITION = 1.
> 

Understood, I admit I didn't realize this before.

To summarize, we have three kinds of data related to sleep/wakeup:

*	CONDITIONs: global data used to indicate events

*	task states

*	wait queues(may not be used, if users use set_current_state() +
	schedule() to sleep and wake_up_process() to wake up)

IIUC, the race on wait queues are almost avoided because of wq->locks,
and if a wait queue is used, race on task states are avoided because
states are readed and written with a wq->lock held in sleep/wakeup
functions. So only in two cases we need STORE-LOAD barriers to avoid the
race:

1.	no wait queue used(e.g. rcu_boost_kthread), we need STORE-LOAD
	to order accesses to task states and CONDITIONs, so we have
	barriers in wake_up_process() and set_current_state().

2.	wait queue accessed without a wq->lock held(e.g. your example),
	we need STORE-LOAD to order accesses to wait queues and CONDITIONs

Since case #1 still exists in kernel, we'd better keep this section in
memory-barriers.txt, however, I'm not sure whether we should mention
case #2 in this section.

Here is a modified version, without mentioning case #2:


SLEEP AND WAKE-UP FUNCTIONS
---------------------------

Sleeping and waking on an event flagged in global data can be viewed as an
interaction between two pieces of data: the task state of the task waiting for
the event and the global data used to indicate the event.  To make sure that
these appear to happen in the right order, the primitives to begin the process
of going to sleep, and the primitives to initiate a wake up imply certain
barriers.

If a wait queue is used, all accesses to task states are protected by the lock
of the wait queue, so the race on task states are avoided. However, if no wait
queue used, we need some memory ordering guantanee to avoid the race between
sleepers/wakees and wakers.

The memory ordering requirement here can be expressed by two STORE-LOAD
barriers(barriers which can guarantee a STORE perceding it can never be
reordered after a LOAD following it). One STORE-LOAD barrier is needed on the
sleeper/wakee side, before reading a variable used to indicate the event and
after setting the state of the current task. Another STORE-LOAD barrier is
needed on the waker side, before reading the state of the wakee task and after
setting a variable used to indicate the event. These two barriers can pair with
each other to avoid race conditions between sleepers/wakees and wakers:

	sleepr/wakee on CPU 1		waker on CPU 2
	========================	========================
	{ wakee->state = TASK_RUNNING, event_indicated = 0 }
	STORE current->state=TASK_INTERRUPTIBLE
	<STORE-LOAD barrier>
	c = LOAD event_indicated	
					STORE event_indicated=1
					<STORE-LOAD barrier>
					s = LOAD wakee->state	

	assert(!(c==0 && s == TASK_RUNNING));

A STORE-LOAD barrier is implied after setting task state in set_current_state()
and before reading task state in wake_up_process()

Make sure call set_current_state() before read the global data used to indicate
event and sleep, and call wake_up_process() after set the global data used to
indicate a event.


Regards,
Boqun

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] Documentation: Remove misleading examples of the barriers in wake_*()
  2015-09-10 17:55     ` Oleg Nesterov
  2015-09-11 16:59       ` Boqun Feng
@ 2015-09-17 13:01       ` Peter Zijlstra
  2015-09-17 17:01         ` Oleg Nesterov
  2015-09-24 13:21         ` Boqun Feng
  1 sibling, 2 replies; 23+ messages in thread
From: Peter Zijlstra @ 2015-09-17 13:01 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Boqun Feng, Paul E. McKenney, linux-kernel, linux-doc,
	Jonathan Corbet, Michal Hocko, David Howells, Linus Torvalds,
	Will Deacon

On Thu, Sep 10, 2015 at 07:55:57PM +0200, Oleg Nesterov wrote:
> On 09/10, Boqun Feng wrote:
> >
> > On Wed, Sep 09, 2015 at 12:28:22PM -0700, Paul E. McKenney wrote:
> > > My feeling is
> > > that we should avoid saying too much about the internals of wait_event()
> > > and wake_up().
> 
> I feel the same. I simply can't understand what we are trying to
> document ;)

So I've been sitting on this for a while and figured I'd finish it now.

It are some notes on the scheduler locking and how it provides program
order guarantees on SMP systems.

Included in it are some of the details on this subject, because a wakeup
has two prior states that are of importance, the tasks own prior state
and the wakeup state, both should be considered in the 'program order'
flow.

So maybe we can reduce the description in memory-barriers to this
'split' program order guarantee, where a woken task must observe both
its own prior state and its wakee state.


---
 kernel/sched/core.c | 137 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 137 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6ab415aa15c4..4fffde6f7c08 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1897,6 +1897,143 @@ static void ttwu_queue(struct task_struct *p, int cpu)
 	raw_spin_unlock(&rq->lock);
 }
 
+/*
+ * Notes on process order guarantees on SMP systems.
+ *
+ *
+ *   PREEMPTION/MIGRATION
+ *
+ * Regular preemption/migration is safe because as long as the task is runnable
+ * migrations involve both rq locks, albeit not (necessarily) at the same time.
+ *
+ * So we get (we allow 3 CPU migrations):
+ *
+ *   CPU0            CPU1            CPU2
+ *
+ *   LOCK rq(0)->lock
+ *   sched-out X
+ *   sched-in Y
+ *   UNLOCK rq(0)->lock
+ *
+ *                                   LOCK rq(0)->lock // MB against CPU0
+ *                                   dequeue X
+ *                                   UNLOCK rq(0)->lock
+ *
+ *                                   LOCK rq(1)->lock
+ *                                   enqueue X
+ *                                   UNLOCK rq(1)->lock
+ *
+ *                   LOCK rq(1)->lock // MB against CPU2
+ *                   sched-out Z
+ *                   sched-in X
+ *                   UNLOCK rq(1)->lock
+ *
+ * and the first LOCK rq(0) on CPU2 gives a full order against the UNLOCK rq(0)
+ * on CPU0. Similarly the LOCK rq(1) on CPU1 provides full order against the
+ * UNLOCK rq(1) on CPU2, therefore by the time task X runs on CPU1 it must
+ * observe the state it left behind on CPU0.
+ *
+ *
+ *   BLOCKING -- aka. SLEEP + WAKEUP
+ *
+ * For blocking things are a little more interesting, because when we dequeue
+ * the task, we don't need to acquire the old rq lock in order to migrate it.
+ *
+ * Say CPU0 does a wait_event() and CPU1 does the wake() and migrates the task
+ * to CPU2 (the most complex example):
+ *
+ *   CPU0 (schedule)  CPU1 (try_to_wake_up) CPU2 (sched_ttwu_pending)
+ *
+ *   X->state = UNINTERRUPTIBLE
+ *   MB
+ *   if (cond)
+ *     break
+ *                    cond = true
+ *
+ *   WMB              WMB (aka smp_mb__before_spinlock)
+ *   LOCK rq(0)->lock LOCK X->pi_lock
+ *
+ *   dequeue X
+ *                    while (X->on_cpu)
+ *                      cpu_relax()
+ *   sched-out X
+ *   WMB
+ *   X->on_cpu = 0
+ *                    RMB
+ *                    X->state = WAKING
+ *                    set_task_cpu(X,1)
+ *                      WMB
+ *                      ti(X)->cpu = 2
+ *
+ *                    llist_add(X, rq(2)) // MB
+ *                                          llist_del_all() // MB
+ *
+ *                                          LOCK rq(2)->lock
+ *                                          enqueue X
+ *                                          X->state = RUNNING
+ *                                          UNLOCK rq(2)->lock
+ *
+ *                                          LOCK rq(2)->lock
+ *                                          sched-out Z
+ *                                          sched-in X
+ *                                          UNLOCK rq(1)->lock
+ *
+ *                                          if (cond) // _TRUE_
+ *                    UNLOCK X->pi_lock
+ *   UNLOCK rq(0)->lock
+ *
+ * So in this case the scheduler does not provide an obvious full barrier; but
+ * stores on CPU0 cannot get delayed past the MB, the llist primitives order
+ * between CPU1 and CPU2 and X's loads on CPU2 cannot get before the last LOCK.
+ *
+ * Which again leads to the guarantee that by the time X gets to run on CPU2
+ * it must observe the state it left behind on CPU0.
+ *
+ * However; for blocking there is a second guarantee we must provide, namely we
+ * must observe the state that lead to our wakeup. That is, not only must X
+ * observe its own prior state, it must also observe the @cond store.
+ *
+ * This too is achieved in the above, purely by the llist primitives ordering
+ * CPU1 to CPU2.
+ *
+ * There is however a much more interesting case for this guarantee, where X
+ * never makes it off CPU0:
+ *
+ *   CPU0 (schedule)  CPU1 (try_to_wake_up)
+ *
+ *   X->state = UNINTERRUPTIBLE
+ *   MB
+ *   if (cond)
+ *     break
+ *                    cond = true
+ *
+ *   WMB              WMB (aka smp_mb__before_spinlock)
+ *                    LOCK X->pi_lock
+ *
+ *                    if (X->on_rq)
+ *                      LOCK rq(0)->lock
+ *                      X->state = RUNNING
+ *                      UNLOCK rq(0)->lock
+ *
+ *   LOCK rq(0)->lock // MB against CPU1
+ *   UNLOCK rq(0)->lock
+ *
+ *   if (cond) // _TRUE_
+ *
+ *                    UNLOCK X->pi_lock
+ *
+ * Here our task X never quite leaves the CPU, the wakeup happens before we can
+ * dequeue and schedule someone else. In this case we must still observe cond
+ * after our call to schedule() completes.
+ *
+ * This is achieved by the smp_mb__before_spinlock() WMB which ensures the store
+ * cannot leak inside the LOCK, and LOCK rq(0)->lock on CPU0 provides full order
+ * against the UNLOCK rq(0)->lock from CPU1. Furthermore our load of cond cannot
+ * happen before this same LOCK.
+ *
+ * Therefore, again, we're good.
+ */
+
 /**
  * try_to_wake_up - wake up a thread
  * @p: the thread to be awakened

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH] Documentation: Remove misleading examples of the barriers in wake_*()
  2015-09-17 13:01       ` Peter Zijlstra
@ 2015-09-17 17:01         ` Oleg Nesterov
  2015-09-18  6:49           ` Peter Zijlstra
  2015-09-24 13:21         ` Boqun Feng
  1 sibling, 1 reply; 23+ messages in thread
From: Oleg Nesterov @ 2015-09-17 17:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Boqun Feng, Paul E. McKenney, linux-kernel, linux-doc,
	Jonathan Corbet, Michal Hocko, David Howells, Linus Torvalds,
	Will Deacon

On 09/17, Peter Zijlstra wrote:
>
> Included in it are some of the details on this subject, because a wakeup
> has two prior states that are of importance, the tasks own prior state
> and the wakeup state, both should be considered in the 'program order'
> flow.

Great. Just one question,

> + *   BLOCKING -- aka. SLEEP + WAKEUP
> + *
> + * For blocking things are a little more interesting, because when we dequeue
> + * the task, we don't need to acquire the old rq lock in order to migrate it.
> + *
> + * Say CPU0 does a wait_event() and CPU1 does the wake() and migrates the task
> + * to CPU2 (the most complex example):
> + *
> + *   CPU0 (schedule)  CPU1 (try_to_wake_up) CPU2 (sched_ttwu_pending)
> + *
> + *   X->state = UNINTERRUPTIBLE
> + *   MB
> + *   if (cond)
> + *     break
> + *                    cond = true
> + *
> + *   WMB              WMB (aka smp_mb__before_spinlock)

Yes, both CPU's do WMB-aka-smp_mb__before_spinlock...

But afaics in this particular case we do not really need them?
So perhaps we should not even mention them?

Because (if I am right) this can confuse the reader who will try
to understand how/where do we rely on these barriers.

Oleg.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] Documentation: Remove misleading examples of the barriers in wake_*()
  2015-09-17 17:01         ` Oleg Nesterov
@ 2015-09-18  6:49           ` Peter Zijlstra
  2015-09-21 17:46             ` Oleg Nesterov
  0 siblings, 1 reply; 23+ messages in thread
From: Peter Zijlstra @ 2015-09-18  6:49 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Boqun Feng, Paul E. McKenney, linux-kernel, linux-doc,
	Jonathan Corbet, Michal Hocko, David Howells, Linus Torvalds,
	Will Deacon

On Thu, Sep 17, 2015 at 07:01:11PM +0200, Oleg Nesterov wrote:
> On 09/17, Peter Zijlstra wrote:
> >
> > Included in it are some of the details on this subject, because a wakeup
> > has two prior states that are of importance, the tasks own prior state
> > and the wakeup state, both should be considered in the 'program order'
> > flow.
> 
> Great. Just one question,
> 
> > + *   BLOCKING -- aka. SLEEP + WAKEUP
> > + *
> > + * For blocking things are a little more interesting, because when we dequeue
> > + * the task, we don't need to acquire the old rq lock in order to migrate it.
> > + *
> > + * Say CPU0 does a wait_event() and CPU1 does the wake() and migrates the task
> > + * to CPU2 (the most complex example):
> > + *
> > + *   CPU0 (schedule)  CPU1 (try_to_wake_up) CPU2 (sched_ttwu_pending)
> > + *
> > + *   X->state = UNINTERRUPTIBLE
> > + *   MB
> > + *   if (cond)
> > + *     break
> > + *                    cond = true
> > + *
> > + *   WMB              WMB (aka smp_mb__before_spinlock)
> 
> Yes, both CPU's do WMB-aka-smp_mb__before_spinlock...
> 
> But afaics in this particular case we do not really need them?
> So perhaps we should not even mention them?
> 
> Because (if I am right) this can confuse the reader who will try
> to understand how/where do we rely on these barriers.

Good point. Initially I put all barriers in, but now that we've figured
out which are important (the text is correct, right? please double
check) we can remove the rest.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] Documentation: Remove misleading examples of the barriers in wake_*()
  2015-09-18  6:49           ` Peter Zijlstra
@ 2015-09-21 17:46             ` Oleg Nesterov
  2015-10-06 16:04               ` Peter Zijlstra
  0 siblings, 1 reply; 23+ messages in thread
From: Oleg Nesterov @ 2015-09-21 17:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Boqun Feng, Paul E. McKenney, linux-kernel, linux-doc,
	Jonathan Corbet, Michal Hocko, David Howells, Linus Torvalds,
	Will Deacon

On 09/18, Peter Zijlstra wrote:
>
> the text is correct, right?

Yes, it looks good to me and helpful.

But damn. I forgot why exactly try_to_wake_up() needs rmb() after
->on_cpu check... It looks reasonable in any case, but I do not
see any strong reason immediately.

Say,

	p->sched_contributes_to_load = !!task_contributes_to_load(p);
	p->state = TASK_WAKING;

we can actually do this before "while (p->on_cpu)", afaics. However
we must not do this before the previous p->on_rq check.

So perhaps this rmb() helps to ensure task_contributes_to_load() can't
happen before p->on_rq check...

As for "p->state = TASK_WAKING" we have the control dependency in both
cases. But the modern fashion suggests to use _CTRL(). Although cpu_relax()
should imply barrier(), but afaik this is not documented.

In short, I got lost ;) Now I don't even understand why we do not need
another rmb() between p->on_rq and p->on_cpu. Suppose a thread T does

	set_current_state(...);
	schedule();

it can be preempted in between, after that we have "on_rq && !on_cpu".
Then it gets CPU again and calls schedule() which clears on_rq.

What guarantees that if ttwu() sees on_rq == 0 cleared by schedule()
then it can _not_ still see the old value of on_cpu == 0?

Oleg.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] Documentation: Remove misleading examples of the barriers in wake_*()
  2015-09-17 13:01       ` Peter Zijlstra
  2015-09-17 17:01         ` Oleg Nesterov
@ 2015-09-24 13:21         ` Boqun Feng
  2015-10-06 16:06           ` Peter Zijlstra
  1 sibling, 1 reply; 23+ messages in thread
From: Boqun Feng @ 2015-09-24 13:21 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Oleg Nesterov, Paul E. McKenney, linux-kernel, linux-doc,
	Jonathan Corbet, Michal Hocko, David Howells, Linus Torvalds,
	Will Deacon

[-- Attachment #1: Type: text/plain, Size: 1475 bytes --]

Hi Peter,

On Thu, Sep 17, 2015 at 03:01:26PM +0200, Peter Zijlstra wrote:
> On Thu, Sep 10, 2015 at 07:55:57PM +0200, Oleg Nesterov wrote:
> > On 09/10, Boqun Feng wrote:
> > >
> > > On Wed, Sep 09, 2015 at 12:28:22PM -0700, Paul E. McKenney wrote:
> > > > My feeling is
> > > > that we should avoid saying too much about the internals of wait_event()
> > > > and wake_up().
> > 
> > I feel the same. I simply can't understand what we are trying to
> > document ;)
> 
> So I've been sitting on this for a while and figured I'd finish it now.
> 
> It are some notes on the scheduler locking and how it provides program
> order guarantees on SMP systems.
> 
> Included in it are some of the details on this subject, because a wakeup
> has two prior states that are of importance, the tasks own prior state
> and the wakeup state, both should be considered in the 'program order'
> flow.
> 

Great and very helpful ;-)

> So maybe we can reduce the description in memory-barriers to this
> 'split' program order guarantee, where a woken task must observe both
> its own prior state and its wakee state.
                              ^^^^^
I think you mean "waker" here, right?

And the waker is not necessarily the same task who set the @cond to
true, right? If so, I feel like it's really hard to *use* this 'split'
program order guarantee in other places than sleep/wakeup itself. Could
you give an example? Thank you.

Regards,
Boqun

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] Documentation: Remove misleading examples of the barriers in wake_*()
  2015-09-21 17:46             ` Oleg Nesterov
@ 2015-10-06 16:04               ` Peter Zijlstra
  2015-10-06 16:24                 ` Peter Zijlstra
                                   ` (2 more replies)
  0 siblings, 3 replies; 23+ messages in thread
From: Peter Zijlstra @ 2015-10-06 16:04 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Boqun Feng, Paul E. McKenney, linux-kernel, linux-doc,
	Jonathan Corbet, Michal Hocko, David Howells, Linus Torvalds,
	Will Deacon

On Mon, Sep 21, 2015 at 07:46:11PM +0200, Oleg Nesterov wrote:
> On 09/18, Peter Zijlstra wrote:
> >
> > the text is correct, right?
> 
> Yes, it looks good to me and helpful.
> 
> But damn. I forgot why exactly try_to_wake_up() needs rmb() after
> ->on_cpu check... It looks reasonable in any case, but I do not
> see any strong reason immediately.

I read it like the smp_rmb() we have for
acquire__after_spin_is_unlocked. Except, as you note below, we need to
need an smp_read_barrier_depends for control barriers as well....

(I'm starting to think we're having more control deps what we were
thinking...)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1947,7 +1947,13 @@ try_to_wake_up(struct task_struct *p, un
 	while (p->on_cpu)
 		cpu_relax();
 	/*
-	 * Pairs with the smp_wmb() in finish_lock_switch().
+	 * Combined with the control dependency above, we have an effective
+	 * smp_load_acquire() without the need for full barriers.
+	 *
+	 * Pairs with the smp_store_release() in finish_lock_switch().
+	 *
+	 * This ensures that tasks getting woken will be fully ordered against
+	 * their previous state and preserve Program Order.
 	 */
 	smp_rmb();
 
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1073,6 +1073,9 @@ static inline void finish_lock_switch(st
 	 * We must ensure this doesn't happen until the switch is completely
 	 * finished.
 	 *
+	 * In particular, the load of prev->state in finish_task_switch() must
+	 * happen before this.
+	 *
 	 * Pairs with the control dependency and rmb in try_to_wake_up().
 	 */
 	smp_store_release(&prev->on_cpu, 0);


Updates the comments to clarify the release/acquire pair on p->on_cpu.

> Say,
> 
> 	p->sched_contributes_to_load = !!task_contributes_to_load(p);
> 	p->state = TASK_WAKING;
> 
> we can actually do this before "while (p->on_cpu)", afaics. However
> we must not do this before the previous p->on_rq check.

No, we must not touch the task before p->on_cpu is cleared, up until
that point the task is owned by the 'previous' CPU.

> So perhaps this rmb() helps to ensure task_contributes_to_load() can't
> happen before p->on_rq check...
> 
> As for "p->state = TASK_WAKING" we have the control dependency in both
> cases. But the modern fashion suggests to use _CTRL().

Yes, but I'm not sure we should go write:

	while (READ_ONCE_CTRL(p->on_cpu))
		cpu_relax();

Or:

	while (p->on_cpu)
		cpu_relax();

	smp_read_barrier_depends();

It seems to me that doing the smp_mb() (for Alpha) inside the loop might
be sub-optimal.

That said, it would be good if Paul (or anyone really) can explain to me
the reason for: 5af4692a75da ("smp: Make control dependencies work on
Alpha, improve documentation"). The Changelog simply states that Alpha
needs the mb, but not how/why etc.

> Although cpu_relax()
> should imply barrier(), but afaik this is not documented.

I think we're relying on that in many places..

> In short, I got lost ;) Now I don't even understand why we do not need
> another rmb() between p->on_rq and p->on_cpu. Suppose a thread T does
> 
> 	set_current_state(...);
> 	schedule();
> 
> it can be preempted in between, after that we have "on_rq && !on_cpu".
> Then it gets CPU again and calls schedule() which clears on_rq.
> 
> What guarantees that if ttwu() sees on_rq == 0 cleared by schedule()
> then it can _not_ still see the old value of on_cpu == 0?

Right, let me go have a think about that ;-)

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] Documentation: Remove misleading examples of the barriers in wake_*()
  2015-09-24 13:21         ` Boqun Feng
@ 2015-10-06 16:06           ` Peter Zijlstra
  2015-10-11 15:26             ` Boqun Feng
  0 siblings, 1 reply; 23+ messages in thread
From: Peter Zijlstra @ 2015-10-06 16:06 UTC (permalink / raw)
  To: Boqun Feng
  Cc: Oleg Nesterov, Paul E. McKenney, linux-kernel, linux-doc,
	Jonathan Corbet, Michal Hocko, David Howells, Linus Torvalds,
	Will Deacon

On Thu, Sep 24, 2015 at 09:21:22PM +0800, Boqun Feng wrote:
> > Included in it are some of the details on this subject, because a wakeup
> > has two prior states that are of importance, the tasks own prior state
> > and the wakeup state, both should be considered in the 'program order'
> > flow.
> > 
> 
> Great and very helpful ;-)
> 
> > So maybe we can reduce the description in memory-barriers to this
> > 'split' program order guarantee, where a woken task must observe both
> > its own prior state and its wakee state.
>                               ^^^^^
> I think you mean "waker" here, right?

Yes.

> And the waker is not necessarily the same task who set the @cond to
> true, right? 

It should be.

> If so, I feel like it's really hard to *use* this 'split'
> program order guarantee in other places than sleep/wakeup itself. Could
> you give an example? Thank you.

It was not meant to be used in any other scenario; the 'split' PO really
is part of the whole sleep/wakeup. It does not apply to anything else.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] Documentation: Remove misleading examples of the barriers in wake_*()
  2015-10-06 16:04               ` Peter Zijlstra
@ 2015-10-06 16:24                 ` Peter Zijlstra
  2015-10-06 16:35                   ` Will Deacon
  2015-10-06 19:57                 ` Peter Zijlstra
  2015-10-07 11:10                 ` Peter Zijlstra
  2 siblings, 1 reply; 23+ messages in thread
From: Peter Zijlstra @ 2015-10-06 16:24 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Boqun Feng, Paul E. McKenney, linux-kernel, linux-doc,
	Jonathan Corbet, Michal Hocko, David Howells, Linus Torvalds,
	Will Deacon

On Tue, Oct 06, 2015 at 06:04:50PM +0200, Peter Zijlstra wrote:
> On Mon, Sep 21, 2015 at 07:46:11PM +0200, Oleg Nesterov wrote:
> > On 09/18, Peter Zijlstra wrote:
> > >
> > > the text is correct, right?
> > 
> > Yes, it looks good to me and helpful.
> > 
> > But damn. I forgot why exactly try_to_wake_up() needs rmb() after
> > ->on_cpu check... It looks reasonable in any case, but I do not
> > see any strong reason immediately.
> 
> I read it like the smp_rmb() we have for
> acquire__after_spin_is_unlocked. Except, as you note below, we need to
> need an smp_read_barrier_depends for control barriers as well....

> Yes, but I'm not sure we should go write:
> 
> 	while (READ_ONCE_CTRL(p->on_cpu))
> 		cpu_relax();
> 
> Or:
> 
> 	while (p->on_cpu)
> 		cpu_relax();
> 
> 	smp_read_barrier_depends();
> 
> It seems to me that doing the smp_mb() (for Alpha) inside the loop might
> be sub-optimal.

And also referring to:

  lkml.kernel.org/r/20150812133109.GA8266@redhat.com

Do we want something like this?

#define smp_spin_acquire(cond) do {		\
	while (cond)				\
		cpu_relax();			\
	smp_read_barrier_depends(); /* ctrl */	\
	smp_rmb(); /* ctrl + rmb := acquire */	\
} while (0)

And use it like:

	smp_spin_acquire(raw_spin_is_locked(&task->pi_lock));

That might work for your task_work_run() and the scheduler case,
although it might be somewhat awkward for sem_wait_array().

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] Documentation: Remove misleading examples of the barriers in wake_*()
  2015-10-06 16:24                 ` Peter Zijlstra
@ 2015-10-06 16:35                   ` Will Deacon
  0 siblings, 0 replies; 23+ messages in thread
From: Will Deacon @ 2015-10-06 16:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Oleg Nesterov, Boqun Feng, Paul E. McKenney, linux-kernel,
	linux-doc, Jonathan Corbet, Michal Hocko, David Howells,
	Linus Torvalds

On Tue, Oct 06, 2015 at 06:24:23PM +0200, Peter Zijlstra wrote:
> On Tue, Oct 06, 2015 at 06:04:50PM +0200, Peter Zijlstra wrote:
> > On Mon, Sep 21, 2015 at 07:46:11PM +0200, Oleg Nesterov wrote:
> > > On 09/18, Peter Zijlstra wrote:
> > > >
> > > > the text is correct, right?
> > > 
> > > Yes, it looks good to me and helpful.
> > > 
> > > But damn. I forgot why exactly try_to_wake_up() needs rmb() after
> > > ->on_cpu check... It looks reasonable in any case, but I do not
> > > see any strong reason immediately.
> > 
> > I read it like the smp_rmb() we have for
> > acquire__after_spin_is_unlocked. Except, as you note below, we need to
> > need an smp_read_barrier_depends for control barriers as well....
> 
> > Yes, but I'm not sure we should go write:
> > 
> > 	while (READ_ONCE_CTRL(p->on_cpu))
> > 		cpu_relax();
> > 
> > Or:
> > 
> > 	while (p->on_cpu)
> > 		cpu_relax();
> > 
> > 	smp_read_barrier_depends();
> > 
> > It seems to me that doing the smp_mb() (for Alpha) inside the loop might
> > be sub-optimal.
> 
> And also referring to:
> 
>   lkml.kernel.org/r/20150812133109.GA8266@redhat.com
> 
> Do we want something like this?
> 
> #define smp_spin_acquire(cond) do {		\
> 	while (cond)				\
> 		cpu_relax();			\
> 	smp_read_barrier_depends(); /* ctrl */	\
> 	smp_rmb(); /* ctrl + rmb := acquire */	\
> } while (0)
> 
> And use it like:
> 
> 	smp_spin_acquire(raw_spin_is_locked(&task->pi_lock));
> 
> That might work for your task_work_run() and the scheduler case,
> although it might be somewhat awkward for sem_wait_array().

I could *really* use something like this for implementing power-saving
busy loops for arch/arm64 (i.e. in the qrwlock code). We have a WFE
instruction (wait for event) that can stop the processor clock and resume
it when the exclusive monitor is cleared (i.e. a cacheline migrates to
another CPU). That means we can implement a targetted wake-up when an
unlocker writes to a node in a queued lock, which isn't something
expressible with cpu_relax alone.

Will

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] Documentation: Remove misleading examples of the barriers in wake_*()
  2015-10-06 16:04               ` Peter Zijlstra
  2015-10-06 16:24                 ` Peter Zijlstra
@ 2015-10-06 19:57                 ` Peter Zijlstra
  2015-10-07 11:10                 ` Peter Zijlstra
  2 siblings, 0 replies; 23+ messages in thread
From: Peter Zijlstra @ 2015-10-06 19:57 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Boqun Feng, Paul E. McKenney, linux-kernel, linux-doc,
	Jonathan Corbet, Michal Hocko, David Howells, Linus Torvalds,
	Will Deacon

> On Mon, Sep 21, 2015 at 07:46:11PM +0200, Oleg Nesterov wrote:

> > In short, I got lost ;) Now I don't even understand why we do not need
> > another rmb() between p->on_rq and p->on_cpu. Suppose a thread T does
> > 
> > 	set_current_state(...);
> > 	schedule();
> > 
> > it can be preempted in between, after that we have "on_rq && !on_cpu".
> > Then it gets CPU again and calls schedule() which clears on_rq.
> > 
> > What guarantees that if ttwu() sees on_rq == 0 cleared by schedule()
> > then it can _not_ still see the old value of on_cpu == 0?

I think you're right. Does the below adequately explain things?

I'll have another look tomorrow to see if I still agree with myself, but
for now I think I've convinced myself you're right.

---
Subject: sched: Fix race in try_to_wake_up() vs schedule()

Oleg noticed that its possible to falsely observe p->on_cpu == 0 such
that we'll prematurely continue with the wakeup and effectively run p on
two CPUs at the same time.

Even though the overlap is very limited; the task is in the middle of
being scheduled out; it could still result in corruption of the
scheduler data structures.


	CPU0				CPU1

	set_current_state(...)

	<preempt_schedule>
	  context_switch(X, Y)
	    prepare_lock_switch(Y)
	      Y->on_cpu = 1;
	    finish_lock_switch(X)
	      store_release(X->on_cpu, 0);

					try_to_wake_up(X)
					  LOCK(p->pi_lock);

					  t = X->on_cpu; // 0

	  context_switch(Y, X)
	    prepare_lock_switch(X)
	      X->on_cpu = 1;
	    finish_lock_switch(Y)
	      store_release(Y->on_cpu, 0);
	</preempt_schedule>

	schedule();
	  deactivate_task(X);
	  X->on_rq = 0;

					  if (X->on_rq) // false

					  if (t) while (X->on_cpu)
					    cpu_relax();

	  context_switch(X, ..)
	    finish_lock_switch(X)
	      store_release(X->on_cpu, 0);


Avoid the load of X->on_cpu being hoisted over the X->on_rq load.

Reported-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c |   19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2084,6 +2084,25 @@ try_to_wake_up(struct task_struct *p, un
 
 #ifdef CONFIG_SMP
 	/*
+	 * Ensure we load p->on_cpu _after_ p->on_rq, otherwise it would be
+	 * possible to, falsely, observe p->on_cpu == 0.
+	 *
+	 * One must be running (->on_cpu == 1) in order to remove oneself
+	 * from the runqueue.
+	 *
+	 *  [S] ->on_cpu = 1;	[L] ->on_rq
+	 *      UNLOCK rq->lock
+	 *			RMB
+	 *      LOCK   rq->lock
+	 *  [S] ->on_rq = 0;    [L] ->on_cpu
+	 *
+	 * Pairs with the full barrier implied in the UNLOCK+LOCK on rq->lock
+	 * from the consecutive calls to schedule(); the first switching to our
+	 * task, the second putting it to sleep.
+	 */
+	smp_rmb();
+
+	/*
 	 * If the owning (remote) cpu is still in the middle of schedule() with
 	 * this task as prev, wait until its done referencing the task.
 	 */

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] Documentation: Remove misleading examples of the barriers in wake_*()
  2015-10-06 16:04               ` Peter Zijlstra
  2015-10-06 16:24                 ` Peter Zijlstra
  2015-10-06 19:57                 ` Peter Zijlstra
@ 2015-10-07 11:10                 ` Peter Zijlstra
  2015-10-07 15:40                   ` Paul E. McKenney
  2 siblings, 1 reply; 23+ messages in thread
From: Peter Zijlstra @ 2015-10-07 11:10 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Boqun Feng, Paul E. McKenney, linux-kernel, linux-doc,
	Jonathan Corbet, Michal Hocko, David Howells, Linus Torvalds,
	Will Deacon

On Tue, Oct 06, 2015 at 06:04:50PM +0200, Peter Zijlstra wrote:
> That said, it would be good if Paul (or anyone really) can explain to me
> the reason for: 5af4692a75da ("smp: Make control dependencies work on
> Alpha, improve documentation"). The Changelog simply states that Alpha
> needs the mb, but not how/why etc.

Also, I suppose Documentation/circular-buffer.txt needs help.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] Documentation: Remove misleading examples of the barriers in wake_*()
  2015-10-07 11:10                 ` Peter Zijlstra
@ 2015-10-07 15:40                   ` Paul E. McKenney
  0 siblings, 0 replies; 23+ messages in thread
From: Paul E. McKenney @ 2015-10-07 15:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Oleg Nesterov, Boqun Feng, linux-kernel, linux-doc,
	Jonathan Corbet, Michal Hocko, David Howells, Linus Torvalds,
	Will Deacon

On Wed, Oct 07, 2015 at 01:10:24PM +0200, Peter Zijlstra wrote:
> On Tue, Oct 06, 2015 at 06:04:50PM +0200, Peter Zijlstra wrote:
> > That said, it would be good if Paul (or anyone really) can explain to me
> > the reason for: 5af4692a75da ("smp: Make control dependencies work on
> > Alpha, improve documentation"). The Changelog simply states that Alpha
> > needs the mb, but not how/why etc.

"Alpha AXP Architecture Reference Manual, Second Edition", Sites & Witek,
page 5-20, Section 5.6.3, "Implied Barriers":

	In Alpha AXP, there are no implied barriers.  If an implied
	barrier is needed for functionally correct access to shared
	data, it must be written as an explicit instruction.  (Software
	must explicitly include any needed MB, WMB, or CALL_PAL IMB
	instructions.)

On exactly how the hardware might choose not to respect control
dependencies, I must defer to someone who knows about the Alpha hardware.
On my last opportunity to discuss this with Alpha architects, I was
concerned only about data dependencies, and didn't think to ask about
control dependencies.

> Also, I suppose Documentation/circular-buffer.txt needs help.

Who wrote that???  ;-)

How about the following?

							Thanx, Paul

------------------------------------------------------------------------

diff --git a/Documentation/circular-buffers.txt b/Documentation/circular-buffers.txt
index 88951b179262..c71c0cab7bbb 100644
--- a/Documentation/circular-buffers.txt
+++ b/Documentation/circular-buffers.txt
@@ -161,7 +161,7 @@ The producer will look something like this:
 
 	unsigned long head = buffer->head;
 	/* The spin_unlock() and next spin_lock() provide needed ordering. */
-	unsigned long tail = ACCESS_ONCE(buffer->tail);
+	unsigned long tail = READ_ONCE_CTRL(buffer->tail);
 
 	if (CIRC_SPACE(head, tail, buffer->size) >= 1) {
 		/* insert one item into the buffer */
@@ -222,7 +222,7 @@ This will instruct the CPU to make sure the index is up to date before reading
 the new item, and then it shall make sure the CPU has finished reading the item
 before it writes the new tail pointer, which will erase the item.
 
-Note the use of ACCESS_ONCE() and smp_load_acquire() to read the
+Note the use of READ_ONCE_CTRL() and smp_load_acquire() to read the
 opposition index.  This prevents the compiler from discarding and
 reloading its cached value - which some compilers will do across
 smp_read_barrier_depends().  This isn't strictly needed if you can


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH] Documentation: Remove misleading examples of the barriers in wake_*()
  2015-10-06 16:06           ` Peter Zijlstra
@ 2015-10-11 15:26             ` Boqun Feng
  2015-10-12  0:40               ` Paul E. McKenney
  0 siblings, 1 reply; 23+ messages in thread
From: Boqun Feng @ 2015-10-11 15:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Oleg Nesterov, Paul E. McKenney, linux-kernel, linux-doc,
	Jonathan Corbet, Michal Hocko, David Howells, Linus Torvalds,
	Will Deacon

[-- Attachment #1: Type: text/plain, Size: 1421 bytes --]

On Tue, Oct 06, 2015 at 06:06:50PM +0200, Peter Zijlstra wrote:
> On Thu, Sep 24, 2015 at 09:21:22PM +0800, Boqun Feng wrote:
> > > Included in it are some of the details on this subject, because a wakeup
> > > has two prior states that are of importance, the tasks own prior state
> > > and the wakeup state, both should be considered in the 'program order'
> > > flow.
> > > 
> > 
> > Great and very helpful ;-)
> > 
> > > So maybe we can reduce the description in memory-barriers to this
> > > 'split' program order guarantee, where a woken task must observe both
> > > its own prior state and its wakee state.
> >                               ^^^^^
> > I think you mean "waker" here, right?
> 
> Yes.
> 
> > And the waker is not necessarily the same task who set the @cond to
> > true, right? 
> 
> It should be.
> 
> > If so, I feel like it's really hard to *use* this 'split'
> > program order guarantee in other places than sleep/wakeup itself. Could
> > you give an example? Thank you.
> 
> It was not meant to be used in any other scenario; the 'split' PO really
> is part of the whole sleep/wakeup. It does not apply to anything else.

Got it. So at this point, I think it's better to remove the entire
"Sleep and wake-up functions" section in memory-barriers.txt. Because
this order guarantee is not for other users except sleep/wakeup. Any
concern, Paul?

Regards,
Boqun

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] Documentation: Remove misleading examples of the barriers in wake_*()
  2015-10-11 15:26             ` Boqun Feng
@ 2015-10-12  0:40               ` Paul E. McKenney
  2015-10-12  9:06                 ` Boqun Feng
  0 siblings, 1 reply; 23+ messages in thread
From: Paul E. McKenney @ 2015-10-12  0:40 UTC (permalink / raw)
  To: Boqun Feng
  Cc: Peter Zijlstra, Oleg Nesterov, linux-kernel, linux-doc,
	Jonathan Corbet, Michal Hocko, David Howells, Linus Torvalds,
	Will Deacon

On Sun, Oct 11, 2015 at 11:26:40PM +0800, Boqun Feng wrote:
> On Tue, Oct 06, 2015 at 06:06:50PM +0200, Peter Zijlstra wrote:
> > On Thu, Sep 24, 2015 at 09:21:22PM +0800, Boqun Feng wrote:
> > > > Included in it are some of the details on this subject, because a wakeup
> > > > has two prior states that are of importance, the tasks own prior state
> > > > and the wakeup state, both should be considered in the 'program order'
> > > > flow.
> > > > 
> > > 
> > > Great and very helpful ;-)
> > > 
> > > > So maybe we can reduce the description in memory-barriers to this
> > > > 'split' program order guarantee, where a woken task must observe both
> > > > its own prior state and its wakee state.
> > >                               ^^^^^
> > > I think you mean "waker" here, right?
> > 
> > Yes.
> > 
> > > And the waker is not necessarily the same task who set the @cond to
> > > true, right? 
> > 
> > It should be.
> > 
> > > If so, I feel like it's really hard to *use* this 'split'
> > > program order guarantee in other places than sleep/wakeup itself. Could
> > > you give an example? Thank you.
> > 
> > It was not meant to be used in any other scenario; the 'split' PO really
> > is part of the whole sleep/wakeup. It does not apply to anything else.
> 
> Got it. So at this point, I think it's better to remove the entire
> "Sleep and wake-up functions" section in memory-barriers.txt. Because
> this order guarantee is not for other users except sleep/wakeup. Any
> concern, Paul?

The concern I have with just removing it is that it is all too easy for
people to assume that they provide ordering.  So we should at least have
a section stating clearly that ordering is not guaranteed without help
from locks, release-acquire, explicit memory barriers, etc.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] Documentation: Remove misleading examples of the barriers in wake_*()
  2015-10-12  0:40               ` Paul E. McKenney
@ 2015-10-12  9:06                 ` Boqun Feng
  2015-10-12 11:54                   ` Peter Zijlstra
  0 siblings, 1 reply; 23+ messages in thread
From: Boqun Feng @ 2015-10-12  9:06 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Peter Zijlstra, Oleg Nesterov, linux-kernel, linux-doc,
	Jonathan Corbet, Michal Hocko, David Howells, Linus Torvalds,
	Will Deacon

[-- Attachment #1: Type: text/plain, Size: 2526 bytes --]

On Sun, Oct 11, 2015 at 05:40:44PM -0700, Paul E. McKenney wrote:
> On Sun, Oct 11, 2015 at 11:26:40PM +0800, Boqun Feng wrote:
> > On Tue, Oct 06, 2015 at 06:06:50PM +0200, Peter Zijlstra wrote:
> > > On Thu, Sep 24, 2015 at 09:21:22PM +0800, Boqun Feng wrote:
> > > > > Included in it are some of the details on this subject, because a wakeup
> > > > > has two prior states that are of importance, the tasks own prior state
> > > > > and the wakeup state, both should be considered in the 'program order'
> > > > > flow.
> > > > > 
> > > > 
> > > > Great and very helpful ;-)
> > > > 
> > > > > So maybe we can reduce the description in memory-barriers to this
> > > > > 'split' program order guarantee, where a woken task must observe both
> > > > > its own prior state and its wakee state.
> > > >                               ^^^^^
> > > > I think you mean "waker" here, right?
> > > 
> > > Yes.
> > > 
> > > > And the waker is not necessarily the same task who set the @cond to
> > > > true, right? 
> > > 
> > > It should be.
> > > 
> > > > If so, I feel like it's really hard to *use* this 'split'
> > > > program order guarantee in other places than sleep/wakeup itself. Could
> > > > you give an example? Thank you.
> > > 
> > > It was not meant to be used in any other scenario; the 'split' PO really
> > > is part of the whole sleep/wakeup. It does not apply to anything else.
> > 
> > Got it. So at this point, I think it's better to remove the entire
> > "Sleep and wake-up functions" section in memory-barriers.txt. Because
> > this order guarantee is not for other users except sleep/wakeup. Any
> > concern, Paul?
> 
> The concern I have with just removing it is that it is all too easy for
> people to assume that they provide ordering.  So we should at least have

Understood.

But, IMO, the position of this section is already misleading:

(*) Implicit kernel memory barriers.
     - Locking functions.
     - Interrupt disabling functions.
   ->- Sleep and wake-up functions.<-
     - Miscellaneous functions.

I read it as that sleep and wake-up functions provide some kernel memory
barriers which we can use *externally*(outside sleep/wakeup themselves).

So how about something like:

(*) Barriers only for internal use
     - Sleep and wake-up functions.

Regards,
Boqun

> a section stating clearly that ordering is not guaranteed without help
> from locks, release-acquire, explicit memory barriers, etc.
> 
> 							Thanx, Paul
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] Documentation: Remove misleading examples of the barriers in wake_*()
  2015-10-12  9:06                 ` Boqun Feng
@ 2015-10-12 11:54                   ` Peter Zijlstra
  2015-10-12 13:09                     ` Boqun Feng
  0 siblings, 1 reply; 23+ messages in thread
From: Peter Zijlstra @ 2015-10-12 11:54 UTC (permalink / raw)
  To: Boqun Feng
  Cc: Paul E. McKenney, Oleg Nesterov, linux-kernel, linux-doc,
	Jonathan Corbet, Michal Hocko, David Howells, Linus Torvalds,
	Will Deacon

On Mon, Oct 12, 2015 at 05:06:36PM +0800, Boqun Feng wrote:
> Understood.
> 
> But, IMO, the position of this section is already misleading:
> 
> (*) Implicit kernel memory barriers.
>      - Locking functions.
>      - Interrupt disabling functions.
>    ->- Sleep and wake-up functions.<-
>      - Miscellaneous functions.
> 
> I read it as that sleep and wake-up functions provide some kernel memory
> barriers which we can use *externally*(outside sleep/wakeup themselves).

I think it is useful to state that the primitives handle the ordering
between the waker and wakee wrt the 'blocking' state.

But I've not put much thought into wording. I wanted to finish process
order 'comment' patch first.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] Documentation: Remove misleading examples of the barriers in wake_*()
  2015-10-12 11:54                   ` Peter Zijlstra
@ 2015-10-12 13:09                     ` Boqun Feng
  2015-10-12 16:26                       ` Peter Zijlstra
  0 siblings, 1 reply; 23+ messages in thread
From: Boqun Feng @ 2015-10-12 13:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul E. McKenney, Oleg Nesterov, linux-kernel, linux-doc,
	Jonathan Corbet, Michal Hocko, David Howells, Linus Torvalds,
	Will Deacon

[-- Attachment #1: Type: text/plain, Size: 1433 bytes --]

On Mon, Oct 12, 2015 at 01:54:38PM +0200, Peter Zijlstra wrote:
> On Mon, Oct 12, 2015 at 05:06:36PM +0800, Boqun Feng wrote:
> > Understood.
> > 
> > But, IMO, the position of this section is already misleading:
> > 
> > (*) Implicit kernel memory barriers.
> >      - Locking functions.
> >      - Interrupt disabling functions.
> >    ->- Sleep and wake-up functions.<-
> >      - Miscellaneous functions.
> > 
> > I read it as that sleep and wake-up functions provide some kernel memory
> > barriers which we can use *externally*(outside sleep/wakeup themselves).
> 
> I think it is useful to state that the primitives handle the ordering
> between the waker and wakee wrt the 'blocking' state.
> 

I agree that's useful, however, the 'blocking' state is something
internal for sleep and wakeup, right? Not sure whether the users of
wake_up() and wait_event() will care much about this or they need to
understand that detailedly to use wake_up() and wait_event() correctly.

I treat this part of memory-barriers.txt as an API document to describe
the implicit barriers in some primitives, which can be used *externally*
by someone, but anyway, that's just my own opinion ;-)

> But I've not put much thought into wording. I wanted to finish process
> order 'comment' patch first.

Of course. Actually your 'comment' patch is the reason why I think this
section may be removed.

Regards,
Boqun

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] Documentation: Remove misleading examples of the barriers in wake_*()
  2015-10-12 13:09                     ` Boqun Feng
@ 2015-10-12 16:26                       ` Peter Zijlstra
  0 siblings, 0 replies; 23+ messages in thread
From: Peter Zijlstra @ 2015-10-12 16:26 UTC (permalink / raw)
  To: Boqun Feng
  Cc: Paul E. McKenney, Oleg Nesterov, linux-kernel, linux-doc,
	Jonathan Corbet, Michal Hocko, David Howells, Linus Torvalds,
	Will Deacon

On Mon, Oct 12, 2015 at 09:09:24PM +0800, Boqun Feng wrote:
> On Mon, Oct 12, 2015 at 01:54:38PM +0200, Peter Zijlstra wrote:
> > On Mon, Oct 12, 2015 at 05:06:36PM +0800, Boqun Feng wrote:
> > > Understood.
> > > 
> > > But, IMO, the position of this section is already misleading:
> > > 
> > > (*) Implicit kernel memory barriers.
> > >      - Locking functions.
> > >      - Interrupt disabling functions.
> > >    ->- Sleep and wake-up functions.<-
> > >      - Miscellaneous functions.
> > > 
> > > I read it as that sleep and wake-up functions provide some kernel memory
> > > barriers which we can use *externally*(outside sleep/wakeup themselves).
> > 
> > I think it is useful to state that the primitives handle the ordering
> > between the waker and wakee wrt the 'blocking' state.
> > 
> 
> I agree that's useful, however, the 'blocking' state is something
> internal for sleep and wakeup, right? 

Not entirely; its also the @cond thing in wait queues. IE:

	for (;;)
		set_current_state(TASK_INTERRUPTIBLE);
		if (@cond)
			break;
		schedule();
	}
	__set_current_state(TASK_RUNNING);

vs.

	@cond = true;
	wake_up_process(p);


So we guarantee that 'p' will see the @cond stores IF it does the
wakeup. (If it does not, ie. 'p' wasn't sleeping, any guarantee is out
the window).

> Not sure whether the users of
> wake_up() and wait_event() will care much about this or they need to
> understand that detailedly to use wake_up() and wait_event() correctly.

I think its mostly natural; but it explains why you don't have to do:

	wait_event(wq, @cond);

vs.

	@cond = true;
	smp_wmb();
	wake_up(wq);

(or worse...)


> > But I've not put much thought into wording. I wanted to finish process
> > order 'comment' patch first.
> 
> Of course. Actually your 'comment' patch is the reason why I think this
> section may be removed.

Yes, that is another option, referring to the comment, once that's
sorted.

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2015-10-12 16:26 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-09-08  1:14 [PATCH] Documentation: Remove misleading examples of the barriers in wake_*() Boqun Feng
2015-09-09 19:28 ` Paul E. McKenney
2015-09-10  2:16   ` Boqun Feng
2015-09-10 17:55     ` Oleg Nesterov
2015-09-11 16:59       ` Boqun Feng
2015-09-17 13:01       ` Peter Zijlstra
2015-09-17 17:01         ` Oleg Nesterov
2015-09-18  6:49           ` Peter Zijlstra
2015-09-21 17:46             ` Oleg Nesterov
2015-10-06 16:04               ` Peter Zijlstra
2015-10-06 16:24                 ` Peter Zijlstra
2015-10-06 16:35                   ` Will Deacon
2015-10-06 19:57                 ` Peter Zijlstra
2015-10-07 11:10                 ` Peter Zijlstra
2015-10-07 15:40                   ` Paul E. McKenney
2015-09-24 13:21         ` Boqun Feng
2015-10-06 16:06           ` Peter Zijlstra
2015-10-11 15:26             ` Boqun Feng
2015-10-12  0:40               ` Paul E. McKenney
2015-10-12  9:06                 ` Boqun Feng
2015-10-12 11:54                   ` Peter Zijlstra
2015-10-12 13:09                     ` Boqun Feng
2015-10-12 16:26                       ` Peter Zijlstra

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.