All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] slow_work_thread() should do the exclusive wait
@ 2009-04-13 18:17 Oleg Nesterov
  2009-04-13 19:03 ` Trond Myklebust
                   ` (2 more replies)
  0 siblings, 3 replies; 42+ messages in thread
From: Oleg Nesterov @ 2009-04-13 18:17 UTC (permalink / raw)
  To: Andrew Morton, David Howells
  Cc: Serge Hallyn, Steve Dickson, Trond Myklebust, Al Viro,
	Daire Byrne, linux-kernel

slow_work_thread() sleeps on slow_work_thread_wq without WQ_FLAG_EXCLUSIVE,
this means that slow_work_enqueue()->__wake_up(nr_exclusive => 1) wakes up
all kslowd threads. Afaics this is not what we want, change slow_work_thread()
to use prepare_to_wait_exclusive().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>

--- 6.30/kernel/slow-work.c~1_SW_EXCLUSIVE	2009-04-06 00:03:42.000000000 +0200
+++ 6.30/kernel/slow-work.c	2009-04-13 19:40:20.000000000 +0200
@@ -372,8 +372,8 @@ static int slow_work_thread(void *_data)
 		vsmax *= atomic_read(&slow_work_thread_count);
 		vsmax /= 100;
 
-		prepare_to_wait(&slow_work_thread_wq, &wait,
-				TASK_INTERRUPTIBLE);
+		prepare_to_wait_exclusive(&slow_work_thread_wq, &wait,
+						TASK_INTERRUPTIBLE);
 		if (!freezing(current) &&
 		    !slow_work_threads_should_exit &&
 		    !slow_work_available(vsmax) &&


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] slow_work_thread() should do the exclusive wait
  2009-04-13 18:17 [PATCH] slow_work_thread() should do the exclusive wait Oleg Nesterov
@ 2009-04-13 19:03 ` Trond Myklebust
  2009-04-13 19:14   ` Oleg Nesterov
  2009-04-13 21:35 ` David Howells
  2009-04-13 21:40 ` David Howells
  2 siblings, 1 reply; 42+ messages in thread
From: Trond Myklebust @ 2009-04-13 19:03 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, David Howells, Serge Hallyn, Steve Dickson,
	Al Viro, Daire Byrne, linux-kernel

On Mon, 2009-04-13 at 20:17 +0200, Oleg Nesterov wrote:
> slow_work_thread() sleeps on slow_work_thread_wq without WQ_FLAG_EXCLUSIVE,
> this means that slow_work_enqueue()->__wake_up(nr_exclusive => 1) wakes up
> all kslowd threads. Afaics this is not what we want, change slow_work_thread()
> to use prepare_to_wait_exclusive().
> 
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> 
> --- 6.30/kernel/slow-work.c~1_SW_EXCLUSIVE	2009-04-06 00:03:42.000000000 +0200
> +++ 6.30/kernel/slow-work.c	2009-04-13 19:40:20.000000000 +0200
> @@ -372,8 +372,8 @@ static int slow_work_thread(void *_data)
>  		vsmax *= atomic_read(&slow_work_thread_count);
>  		vsmax /= 100;
>  
> -		prepare_to_wait(&slow_work_thread_wq, &wait,
> -				TASK_INTERRUPTIBLE);
> +		prepare_to_wait_exclusive(&slow_work_thread_wq, &wait,
> +						TASK_INTERRUPTIBLE);
>  		if (!freezing(current) &&
>  		    !slow_work_threads_should_exit &&
>  		    !slow_work_available(vsmax) &&
> 

Should that really be TASK_INTERRUPTIBLE? I don't see anything obvious
in the enclosing for(;;) loop that checks for or handles signals...

Cheers
  Trond
-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust@netapp.com
www.netapp.com

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] slow_work_thread() should do the exclusive wait
  2009-04-13 19:03 ` Trond Myklebust
@ 2009-04-13 19:14   ` Oleg Nesterov
  0 siblings, 0 replies; 42+ messages in thread
From: Oleg Nesterov @ 2009-04-13 19:14 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: Andrew Morton, David Howells, Serge Hallyn, Steve Dickson,
	Al Viro, Daire Byrne, linux-kernel

On 04/13, Trond Myklebust wrote:
>
> On Mon, 2009-04-13 at 20:17 +0200, Oleg Nesterov wrote:
> > slow_work_thread() sleeps on slow_work_thread_wq without WQ_FLAG_EXCLUSIVE,
> > this means that slow_work_enqueue()->__wake_up(nr_exclusive => 1) wakes up
> > all kslowd threads. Afaics this is not what we want, change slow_work_thread()
> > to use prepare_to_wait_exclusive().
> > 
> > Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> > 
> > --- 6.30/kernel/slow-work.c~1_SW_EXCLUSIVE	2009-04-06 00:03:42.000000000 +0200
> > +++ 6.30/kernel/slow-work.c	2009-04-13 19:40:20.000000000 +0200
> > @@ -372,8 +372,8 @@ static int slow_work_thread(void *_data)
> >  		vsmax *= atomic_read(&slow_work_thread_count);
> >  		vsmax /= 100;
> >  
> > -		prepare_to_wait(&slow_work_thread_wq, &wait,
> > -				TASK_INTERRUPTIBLE);
> > +		prepare_to_wait_exclusive(&slow_work_thread_wq, &wait,
> > +						TASK_INTERRUPTIBLE);
> >  		if (!freezing(current) &&
> >  		    !slow_work_threads_should_exit &&
> >  		    !slow_work_available(vsmax) &&
> > 
> 
> Should that really be TASK_INTERRUPTIBLE? I don't see anything obvious
> in the enclosing for(;;) loop that checks for or handles signals...

I guess TASK_INTERRUPTIBLE was chosen to not contribute to calc_load(),
nr_active() returns nr_running + nr_uninterruptible.

Oleg.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] slow_work_thread() should do the exclusive wait
  2009-04-13 18:17 [PATCH] slow_work_thread() should do the exclusive wait Oleg Nesterov
  2009-04-13 19:03 ` Trond Myklebust
@ 2009-04-13 21:35 ` David Howells
  2009-04-13 21:40 ` David Howells
  2 siblings, 0 replies; 42+ messages in thread
From: David Howells @ 2009-04-13 21:35 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: dhowells, Andrew Morton, Serge Hallyn, Steve Dickson,
	Trond Myklebust, Al Viro, Daire Byrne, linux-kernel

Oleg Nesterov <oleg@redhat.com> wrote:

> slow_work_thread() sleeps on slow_work_thread_wq without WQ_FLAG_EXCLUSIVE,
> this means that slow_work_enqueue()->__wake_up(nr_exclusive => 1) wakes up
> all kslowd threads. Afaics this is not what we want, change slow_work_thread()
> to use prepare_to_wait_exclusive().

Hmmm...  I think you may be right.  I think I was assuming that wake_up()
would only wake up the first item on the queue, but that's not strictly what
it does...

David

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] slow_work_thread() should do the exclusive wait
  2009-04-13 18:17 [PATCH] slow_work_thread() should do the exclusive wait Oleg Nesterov
  2009-04-13 19:03 ` Trond Myklebust
  2009-04-13 21:35 ` David Howells
@ 2009-04-13 21:40 ` David Howells
  2009-04-13 21:48   ` Oleg Nesterov
  2 siblings, 1 reply; 42+ messages in thread
From: David Howells @ 2009-04-13 21:40 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: dhowells, Oleg Nesterov, Andrew Morton, Serge Hallyn,
	Steve Dickson, Al Viro, Daire Byrne, linux-kernel

Trond Myklebust <Trond.Myklebust@netapp.com> wrote:

> Should that really be TASK_INTERRUPTIBLE? I don't see anything obvious
> in the enclosing for(;;) loop that checks for or handles signals...

If it were TASK_UNINTERRUPTIBLE, it would sit there in the D-state when not
doing anything.  I must admit, I thought I was calling daemonize(), but that
seems to have got lost somewhere.

David

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] slow_work_thread() should do the exclusive wait
  2009-04-13 21:40 ` David Howells
@ 2009-04-13 21:48   ` Oleg Nesterov
  2009-04-13 21:57     ` Trond Myklebust
  0 siblings, 1 reply; 42+ messages in thread
From: Oleg Nesterov @ 2009-04-13 21:48 UTC (permalink / raw)
  To: David Howells
  Cc: Trond Myklebust, Andrew Morton, Serge Hallyn, Steve Dickson,
	Al Viro, Daire Byrne, linux-kernel

On 04/13, David Howells wrote:
>
> Trond Myklebust <Trond.Myklebust@netapp.com> wrote:
>
> > Should that really be TASK_INTERRUPTIBLE? I don't see anything obvious
> > in the enclosing for(;;) loop that checks for or handles signals...
>
> If it were TASK_UNINTERRUPTIBLE, it would sit there in the D-state when not
> doing anything.  I must admit, I thought I was calling daemonize(), but that
> seems to have got lost somewhere.

daemonize() is not needed, kthread_create() creates the kernel thread which
ignores all signals. So it doesn't matter which state we use to sleep,
TASK_INTERRUPTIBLE or TASK_UNINTERRUPTIBLE.

Oleg.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] slow_work_thread() should do the exclusive wait
  2009-04-13 21:48   ` Oleg Nesterov
@ 2009-04-13 21:57     ` Trond Myklebust
  2009-04-13 22:24       ` Oleg Nesterov
  0 siblings, 1 reply; 42+ messages in thread
From: Trond Myklebust @ 2009-04-13 21:57 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: David Howells, Andrew Morton, Serge Hallyn, Steve Dickson,
	Al Viro, Daire Byrne, linux-kernel

On Mon, 2009-04-13 at 23:48 +0200, Oleg Nesterov wrote:
> On 04/13, David Howells wrote:
> >
> > Trond Myklebust <Trond.Myklebust@netapp.com> wrote:
> >
> > > Should that really be TASK_INTERRUPTIBLE? I don't see anything obvious
> > > in the enclosing for(;;) loop that checks for or handles signals...
> >
> > If it were TASK_UNINTERRUPTIBLE, it would sit there in the D-state when not
> > doing anything.  I must admit, I thought I was calling daemonize(), but that
> > seems to have got lost somewhere.
> 
> daemonize() is not needed, kthread_create() creates the kernel thread which
> ignores all signals. So it doesn't matter which state we use to sleep,
> TASK_INTERRUPTIBLE or TASK_UNINTERRUPTIBLE.

Yes, but that is precisely why it is cleaner to use
TASK_UNINTERRUPTIBLE. It documents the fact that signal handling isn't
needed (whether or not the thread is blocking them).

Trond
-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust@netapp.com
www.netapp.com

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] slow_work_thread() should do the exclusive wait
  2009-04-13 21:57     ` Trond Myklebust
@ 2009-04-13 22:24       ` Oleg Nesterov
  2009-04-15 23:27         ` Andrew Morton
                           ` (2 more replies)
  0 siblings, 3 replies; 42+ messages in thread
From: Oleg Nesterov @ 2009-04-13 22:24 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: David Howells, Andrew Morton, Serge Hallyn, Steve Dickson,
	Al Viro, Daire Byrne, linux-kernel

On 04/13, Trond Myklebust wrote:
>
> On Mon, 2009-04-13 at 23:48 +0200, Oleg Nesterov wrote:
> > On 04/13, David Howells wrote:
> > >
> > > Trond Myklebust <Trond.Myklebust@netapp.com> wrote:
> > >
> > > > Should that really be TASK_INTERRUPTIBLE? I don't see anything obvious
> > > > in the enclosing for(;;) loop that checks for or handles signals...
> > >
> > > If it were TASK_UNINTERRUPTIBLE, it would sit there in the D-state when not
> > > doing anything.  I must admit, I thought I was calling daemonize(), but that
> > > seems to have got lost somewhere.
> >
> > daemonize() is not needed, kthread_create() creates the kernel thread which
> > ignores all signals. So it doesn't matter which state we use to sleep,
> > TASK_INTERRUPTIBLE or TASK_UNINTERRUPTIBLE.
>
> Yes, but that is precisely why it is cleaner to use
> TASK_UNINTERRUPTIBLE. It documents the fact that signal handling isn't
> needed (whether or not the thread is blocking them).

Agreed. But TASK_UNINTERRUPTIBLE can confuse a user which does
"cat /proc/loadavg" on the idle machine...

Note that, for example, worker_thread() uses TASK_INTERRUPTIBLE too, and I
think for the same reason.

I dunno.

Oleg.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] slow_work_thread() should do the exclusive wait
  2009-04-13 22:24       ` Oleg Nesterov
@ 2009-04-15 23:27         ` Andrew Morton
  2009-04-16  9:10         ` David Howells
  2009-04-23 16:00         ` [PATCH] slow_work_thread() should do the exclusive wait David Howells
  2 siblings, 0 replies; 42+ messages in thread
From: Andrew Morton @ 2009-04-15 23:27 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Trond.Myklebust, dhowells, serue, steved, viro, Daire.Byrne,
	linux-kernel

On Tue, 14 Apr 2009 00:24:51 +0200
Oleg Nesterov <oleg@redhat.com> wrote:

> On 04/13, Trond Myklebust wrote:
> >
> > On Mon, 2009-04-13 at 23:48 +0200, Oleg Nesterov wrote:
> > > On 04/13, David Howells wrote:
> > > >
> > > > Trond Myklebust <Trond.Myklebust@netapp.com> wrote:
> > > >
> > > > > Should that really be TASK_INTERRUPTIBLE? I don't see anything obvious
> > > > > in the enclosing for(;;) loop that checks for or handles signals...
> > > >
> > > > If it were TASK_UNINTERRUPTIBLE, it would sit there in the D-state when not
> > > > doing anything.  I must admit, I thought I was calling daemonize(), but that
> > > > seems to have got lost somewhere.
> > >
> > > daemonize() is not needed, kthread_create() creates the kernel thread which
> > > ignores all signals. So it doesn't matter which state we use to sleep,
> > > TASK_INTERRUPTIBLE or TASK_UNINTERRUPTIBLE.
> >
> > Yes, but that is precisely why it is cleaner to use
> > TASK_UNINTERRUPTIBLE. It documents the fact that signal handling isn't
> > needed (whether or not the thread is blocking them).
> 
> Agreed. But TASK_UNINTERRUPTIBLE can confuse a user which does
> "cat /proc/loadavg" on the idle machine...
> 
> Note that, for example, worker_thread() uses TASK_INTERRUPTIBLE too, and I
> think for the same reason.
> 

Yup.  It's a very common pattern for kernel threads to sleep in state
TASK_INTERRUPTIBLE.  It is "well known" (lol) that kernel threads don't
accept signals, and having a kernel thread sleep in state
TASK_UNINTERRUPTIBLE will indeed contribute to load average and we get
distressed emails quite promptly when we do that.

The patch itself is a little worrisome.  The wake-all semantics are
very good at covering up little race bugs.  And switching to wake-once
is a great way of exposing hitherto-unsuspected races.

<looks for races>

Nothing immediately leaps out, but you know how these things are.

I wonder if slow_work_cull_timeout() should have some sort of barrier,
so the write is suitably visible to the woken thread.  Bearing in mind
that the thread might _already_ have been woken by someone else?


off-topic: afacit the code will cull a maximum of one thread per five
seconds.  But the rate of thread _creation_ is, afacit, unbound.  Are
there scenarios in which we can get a runaway thread count?




^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] slow_work_thread() should do the exclusive wait
  2009-04-13 22:24       ` Oleg Nesterov
  2009-04-15 23:27         ` Andrew Morton
@ 2009-04-16  9:10         ` David Howells
  2009-04-16 14:33           ` Oleg Nesterov
  2009-04-22 13:37           ` [PATCH] Document that wake_up(), complete() and co. imply a full memory barrier David Howells
  2009-04-23 16:00         ` [PATCH] slow_work_thread() should do the exclusive wait David Howells
  2 siblings, 2 replies; 42+ messages in thread
From: David Howells @ 2009-04-16  9:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: dhowells, Oleg Nesterov, Trond.Myklebust, serue, steved, viro,
	Daire.Byrne, linux-kernel

Andrew Morton <akpm@linux-foundation.org> wrote:

> The patch itself is a little worrisome.  The wake-all semantics are
> very good at covering up little race bugs.  And switching to wake-once
> is a great way of exposing hitherto-unsuspected races.

It's something I'm intending to test, once I get MN10300 working again (which
for some reason it isn't).

> I wonder if slow_work_cull_timeout() should have some sort of barrier,
> so the write is suitably visible to the woken thread.

That's an interesting question.  Should wake_up() imply a barrier of any sort,
I wonder.  Well, __wake_up() does impose a barrier as it uses a spinlock, but
I wonder if that's sufficient.

> Bearing in mind that the thread might _already_ have been woken by someone
> else?

If the thread is woken by someone else, there must be work for it to do, in
which case it wouldn't be culled anyway.

> off-topic: afacit the code will cull a maximum of one thread per five
> seconds.  But the rate of thread _creation_ is, afacit, unbound.  Are
> there scenarios in which we can get a runaway thread count?

The maximum number of threads is limited (slow_work_max_threads).

David

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] slow_work_thread() should do the exclusive wait
  2009-04-16  9:10         ` David Howells
@ 2009-04-16 14:33           ` Oleg Nesterov
  2009-04-22 13:37           ` [PATCH] Document that wake_up(), complete() and co. imply a full memory barrier David Howells
  1 sibling, 0 replies; 42+ messages in thread
From: Oleg Nesterov @ 2009-04-16 14:33 UTC (permalink / raw)
  To: David Howells
  Cc: Andrew Morton, Trond.Myklebust, serue, steved, viro, Daire.Byrne,
	linux-kernel

On 04/16, David Howells wrote:
>
> Andrew Morton <akpm@linux-foundation.org> wrote:
>
> > I wonder if slow_work_cull_timeout() should have some sort of barrier,
> > so the write is suitably visible to the woken thread.
>
> That's an interesting question.  Should wake_up() imply a barrier of any sort,
> I wonder.  Well, __wake_up() does impose a barrier as it uses a spinlock, but
> I wonder if that's sufficient.

wake_up() does imply the barrier. Note the smp_wmb() in try_to_wake_up().
And in fact this wmb() implies mb(), because spin_lock() itself is STORE,
and the futher LOADs can't leak up before spin_lock().

But afaics, this doesn't matter? prepare_to_wait() sets task->state under
wait_queue_head_t->lock and wake_up() takes this look too, so we can't miss
the event.

Or I completely misunderstood the issue...

Oleg.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH] Document that wake_up(), complete() and co. imply a full memory barrier
  2009-04-16  9:10         ` David Howells
  2009-04-16 14:33           ` Oleg Nesterov
@ 2009-04-22 13:37           ` David Howells
  2009-04-22 13:51             ` Ingo Molnar
                               ` (2 more replies)
  1 sibling, 3 replies; 42+ messages in thread
From: David Howells @ 2009-04-22 13:37 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: dhowells, Andrew Morton, Trond.Myklebust, serue, steved, viro,
	Paul E. McKenney, Nick Piggin, linux-kernel

Oleg Nesterov <oleg@redhat.com> wrote:

> > That's an interesting question.  Should wake_up() imply a barrier of any
> > sort, I wonder.  Well, __wake_up() does impose a barrier as it uses a
> > spinlock, but I wonder if that's sufficient.
> 
> wake_up() does imply the barrier. Note the smp_wmb() in try_to_wake_up().
> And in fact this wmb() implies mb(), because spin_lock() itself is STORE,
> and the futher LOADs can't leak up before spin_lock().
> 
> But afaics, this doesn't matter? prepare_to_wait() sets task->state under
> wait_queue_head_t->lock and wake_up() takes this look too, so we can't miss
> the event.
> 
> Or I completely misunderstood the issue...

The problem is not what wake_up() and co. do, it's what you are allowed to
assume that they do.

However, I think you're right, and that we can assume they imply a full memory
barrier.  To this end, I've attached a patch to document this.

David
---
From: David Howells <dhowells@redhat.com>
Subject: [PATCH] Document that wake_up(), complete() and co. imply a full memory barrier

Add to the memory barriers document to note that wake_up(), complete() and
co. all imply a full memory barrier.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 Documentation/memory-barriers.txt |    4 ++++
 kernel/sched.c                    |   10 ++++++++++
 2 files changed, 14 insertions(+), 0 deletions(-)


diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
index f5b7127..2c8062c 100644
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -1224,6 +1224,10 @@ Other functions that imply barriers:
 
  (*) schedule() and similar imply full memory barriers.
 
+ (*) wake_up(), try_to_wake_up() and co. imply a full memory barrier.
+
+ (*) complete() and co. imply a full memory barrier.
+
 
 =================================
 INTER-CPU LOCKING BARRIER EFFECTS
diff --git a/kernel/sched.c b/kernel/sched.c
index b902e58..faccaa0 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2337,6 +2337,8 @@ static int sched_balance_self(int cpu, int flag)
  * runnable without the overhead of this.
  *
  * returns failure only if the task is already active.
+ *
+ * It may be assumed that this function implies a full memory barrier.
  */
 static int try_to_wake_up(struct task_struct *p, unsigned int state, int sync)
 {
@@ -5241,6 +5243,8 @@ void __wake_up_common(wait_queue_head_t *q, unsigned int mode,
  * @mode: which threads
  * @nr_exclusive: how many wake-one or wake-many threads to wake up
  * @key: is directly passed to the wakeup function
+ *
+ * It may be assumed that this function implies a full memory barrier.
  */
 void __wake_up(wait_queue_head_t *q, unsigned int mode,
 			int nr_exclusive, void *key)
@@ -5279,6 +5283,8 @@ void __wake_up_locked_key(wait_queue_head_t *q, unsigned int mode, void *key)
  * with each other. This can prevent needless bouncing between CPUs.
  *
  * On UP it can prevent extra preemption.
+ *
+ * It may be assumed that this function implies a full memory barrier.
  */
 void __wake_up_sync_key(wait_queue_head_t *q, unsigned int mode,
 			int nr_exclusive, void *key)
@@ -5315,6 +5321,8 @@ EXPORT_SYMBOL_GPL(__wake_up_sync);	/* For internal use only */
  * awakened in the same order in which they were queued.
  *
  * See also complete_all(), wait_for_completion() and related routines.
+ *
+ * It may be assumed that this function implies a full memory barrier.
  */
 void complete(struct completion *x)
 {
@@ -5332,6 +5340,8 @@ EXPORT_SYMBOL(complete);
  * @x:  holds the state of this particular completion
  *
  * This will wake up all threads waiting on this particular completion event.
+ *
+ * It may be assumed that this function implies a full memory barrier.
  */
 void complete_all(struct completion *x)
 {

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH] Document that wake_up(), complete() and co. imply a full memory barrier
  2009-04-22 13:37           ` [PATCH] Document that wake_up(), complete() and co. imply a full memory barrier David Howells
@ 2009-04-22 13:51             ` Ingo Molnar
  2009-04-22 14:39               ` Oleg Nesterov
  2009-04-22 15:12             ` David Howells
  2009-04-22 16:23             ` David Howells
  2 siblings, 1 reply; 42+ messages in thread
From: Ingo Molnar @ 2009-04-22 13:51 UTC (permalink / raw)
  To: David Howells
  Cc: Oleg Nesterov, Andrew Morton, Trond.Myklebust, serue, steved,
	viro, Paul E. McKenney, Nick Piggin, linux-kernel


* David Howells <dhowells@redhat.com> wrote:

> Oleg Nesterov <oleg@redhat.com> wrote:
> 
> > > That's an interesting question.  Should wake_up() imply a barrier of any
> > > sort, I wonder.  Well, __wake_up() does impose a barrier as it uses a
> > > spinlock, but I wonder if that's sufficient.
> > 
> > wake_up() does imply the barrier. Note the smp_wmb() in try_to_wake_up().
> > And in fact this wmb() implies mb(), because spin_lock() itself is STORE,
> > and the futher LOADs can't leak up before spin_lock().
> > 
> > But afaics, this doesn't matter? prepare_to_wait() sets 
> > task->state under wait_queue_head_t->lock and wake_up() takes 
> > this look too, so we can't miss the event.
> > 
> > Or I completely misunderstood the issue...
> 
> The problem is not what wake_up() and co. do, it's what you are 
> allowed to assume that they do.
> 
> However, I think you're right, and that we can assume they imply a 
> full memory barrier.  To this end, I've attached a patch to 
> document this.
> 
> David
> ---
> From: David Howells <dhowells@redhat.com>
> Subject: [PATCH] Document that wake_up(), complete() and co. imply a full memory barrier
> 
> Add to the memory barriers document to note that wake_up(), complete() and
> co. all imply a full memory barrier.

No. They dont generally imply a full memory barrier versus any 
arbitrary prior (or following) memory access.

try_to_wake_up() has an smp_wmb() so it is a write memory barrier 
(but not necessarily a read memory barrier). Otherwise there are 
spinlocks there but spinlocks are not explicit 'full memory 
barriers'.

Also, there's a sub-detail wrt. the wake_up() variants in that they 
have a fastpath for the !q case - then they dont have any atomics at 
all - they just return straight away.

Another sub-detail: wakeups using a special wakeup handler might not 
even call try_to_wake_up() - so in their case not even a write 
barrier can be assumed.

So your patch is misleading in a number of areas here.

	Ingo

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] Document that wake_up(), complete() and co. imply a full memory barrier
  2009-04-22 13:51             ` Ingo Molnar
@ 2009-04-22 14:39               ` Oleg Nesterov
  2009-04-22 14:56                 ` Ingo Molnar
  0 siblings, 1 reply; 42+ messages in thread
From: Oleg Nesterov @ 2009-04-22 14:39 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: David Howells, Andrew Morton, Trond.Myklebust, serue, steved,
	viro, Paul E. McKenney, Nick Piggin, linux-kernel

On 04/22, Ingo Molnar wrote:
>
> * David Howells <dhowells@redhat.com> wrote:
>
> > Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > > > That's an interesting question.  Should wake_up() imply a barrier of any
> > > > sort, I wonder.  Well, __wake_up() does impose a barrier as it uses a
> > > > spinlock, but I wonder if that's sufficient.
> > >
> > > wake_up() does imply the barrier. Note the smp_wmb() in try_to_wake_up().
> > > And in fact this wmb() implies mb(), because spin_lock() itself is STORE,
> > > and the futher LOADs can't leak up before spin_lock().
> > >
> > > But afaics, this doesn't matter? prepare_to_wait() sets
> > > task->state under wait_queue_head_t->lock and wake_up() takes
> > > this look too, so we can't miss the event.
> > >
> > > Or I completely misunderstood the issue...
> >
> > The problem is not what wake_up() and co. do, it's what you are
> > allowed to assume that they do.
> >
> > However, I think you're right, and that we can assume they imply a
> > full memory barrier.  To this end, I've attached a patch to
> > document this.
> >
> > David
> > ---
> > From: David Howells <dhowells@redhat.com>
> > Subject: [PATCH] Document that wake_up(), complete() and co. imply a full memory barrier
> >
> > Add to the memory barriers document to note that wake_up(), complete() and
> > co. all imply a full memory barrier.
>
> No. They dont generally imply a full memory barrier versus any
> arbitrary prior (or following) memory access.
>
> try_to_wake_up() has an smp_wmb() so it is a write memory barrier
> (but not necessarily a read memory barrier). Otherwise there are
> spinlocks there but spinlocks are not explicit 'full memory
> barriers'.

Yes. But please look at the changelog in

	 "Add memory barrier semantics to wake_up() & co"
	 04e2f1741d235ba599037734878d72e57cb302b5

However, I must admit, I don't understand how to document the semantics
correctly. This wmb() before spin_lock() ensures we don't read task->state
before previous STOREs. This is what we care about, and this is what
I meant when I said "this wmb() implies mb()".

So, I think that try_to_wake_up() implies that the LOADS after it can't be
reordered with STOREs before it (and wmb() of course).

Oleg.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] Document that wake_up(), complete() and co. imply a full memory barrier
  2009-04-22 14:39               ` Oleg Nesterov
@ 2009-04-22 14:56                 ` Ingo Molnar
  2009-04-22 15:07                   ` Oleg Nesterov
  0 siblings, 1 reply; 42+ messages in thread
From: Ingo Molnar @ 2009-04-22 14:56 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: David Howells, Andrew Morton, Trond.Myklebust, serue, steved,
	viro, Paul E. McKenney, Nick Piggin, linux-kernel


* Oleg Nesterov <oleg@redhat.com> wrote:

> On 04/22, Ingo Molnar wrote:
> >
> > * David Howells <dhowells@redhat.com> wrote:
> >
> > > Oleg Nesterov <oleg@redhat.com> wrote:
> > >
> > > > > That's an interesting question.  Should wake_up() imply a barrier of any
> > > > > sort, I wonder.  Well, __wake_up() does impose a barrier as it uses a
> > > > > spinlock, but I wonder if that's sufficient.
> > > >
> > > > wake_up() does imply the barrier. Note the smp_wmb() in try_to_wake_up().
> > > > And in fact this wmb() implies mb(), because spin_lock() itself is STORE,
> > > > and the futher LOADs can't leak up before spin_lock().
> > > >
> > > > But afaics, this doesn't matter? prepare_to_wait() sets
> > > > task->state under wait_queue_head_t->lock and wake_up() takes
> > > > this look too, so we can't miss the event.
> > > >
> > > > Or I completely misunderstood the issue...
> > >
> > > The problem is not what wake_up() and co. do, it's what you are
> > > allowed to assume that they do.
> > >
> > > However, I think you're right, and that we can assume they imply a
> > > full memory barrier.  To this end, I've attached a patch to
> > > document this.
> > >
> > > David
> > > ---
> > > From: David Howells <dhowells@redhat.com>
> > > Subject: [PATCH] Document that wake_up(), complete() and co. imply a full memory barrier
> > >
> > > Add to the memory barriers document to note that wake_up(), complete() and
> > > co. all imply a full memory barrier.
> >
> > No. They dont generally imply a full memory barrier versus any
> > arbitrary prior (or following) memory access.
> >
> > try_to_wake_up() has an smp_wmb() so it is a write memory barrier
> > (but not necessarily a read memory barrier). Otherwise there are
> > spinlocks there but spinlocks are not explicit 'full memory
> > barriers'.
> 
> Yes. But please look at the changelog in
> 
> 	 "Add memory barrier semantics to wake_up() & co"
> 	 04e2f1741d235ba599037734878d72e57cb302b5

yes - but still that commit is only wrt. the ->state check.

> However, I must admit, I don't understand how to document the 
> semantics correctly. This wmb() before spin_lock() ensures we 
> don't read task->state before previous STOREs. This is what we 
> care about, and this is what I meant when I said "this wmb() 
> implies mb()".
> 
> So, I think that try_to_wake_up() implies that the LOADS after it 
> can't be reordered with STOREs before it (and wmb() of course).

Note that the patch David sent says "full memory barrier", not "full 
memory barrier wrt. task->state":

+ (*) wake_up(), try_to_wake_up() and co. imply a full memory barrier.
+
+ (*) complete() and co. imply a full memory barrier.

These statements are not true in that form, as this code does not 
imply a full memory barrier. It does imply one on task->state 
_alone_ (and a couple of other wq-internal variables it happens to 
read for sure).

But even that one isnt entirely true in the two sub-cases i noted: 
the !wq case (which can happen in object state teardown) and the 
special ->func handler (which can happen in custom wakeup code a'la 
eventpoll).

So adding a comment that says "this is a full memory barrier" is 
simply not true to that extent, and is easily misunderstood. Adding 
"this is a fully memory barrier for task->state dependent data flow" 
would be more correct. (with a 'as long as wq is not NULL, and as 
long as the code using this isnt overriding ->func)

Agreed?

	Ingo

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] Document that wake_up(), complete() and co. imply a full memory barrier
  2009-04-22 14:56                 ` Ingo Molnar
@ 2009-04-22 15:07                   ` Oleg Nesterov
  0 siblings, 0 replies; 42+ messages in thread
From: Oleg Nesterov @ 2009-04-22 15:07 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: David Howells, Andrew Morton, Trond.Myklebust, serue, steved,
	viro, Paul E. McKenney, Nick Piggin, linux-kernel

On 04/22, Ingo Molnar wrote:
>
> * Oleg Nesterov <oleg@redhat.com> wrote:
>
> > So, I think that try_to_wake_up() implies that the LOADS after it
> > can't be reordered with STOREs before it (and wmb() of course).
>
> Note that the patch David sent says "full memory barrier", not "full
> memory barrier wrt. task->state":
>
> + (*) wake_up(), try_to_wake_up() and co. imply a full memory barrier.
> +
> + (*) complete() and co. imply a full memory barrier.
>
> These statements are not true in that form, as this code does not
> imply a full memory barrier. It does imply one on task->state
> _alone_ (and a couple of other wq-internal variables it happens to
> read for sure).
>
> But even that one isnt entirely true in the two sub-cases i noted:
> the !wq case (which can happen in object state teardown) and the
> special ->func handler (which can happen in custom wakeup code a'la
> eventpoll).
>
> So adding a comment that says "this is a full memory barrier" is
> simply not true to that extent, and is easily misunderstood. Adding
> "this is a fully memory barrier for task->state dependent data flow"
> would be more correct. (with a 'as long as wq is not NULL, and as
> long as the code using this isnt overriding ->func)
>
> Agreed?

Yes sure.

Except... not that it really matters, but the reading of ->state is
not "special". I mean,

	STORE;
	try_to_wake_up();
	LOAD;

in this case try_to_wake_up() acts as a barrier for STORE/LOAD. But
probably we should not rely on this. So personally I agree with
"for task->state dependent data flow" above.

Oleg.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] Document that wake_up(), complete() and co. imply a full memory barrier
  2009-04-22 13:37           ` [PATCH] Document that wake_up(), complete() and co. imply a full memory barrier David Howells
  2009-04-22 13:51             ` Ingo Molnar
@ 2009-04-22 15:12             ` David Howells
  2009-04-22 15:19               ` Ingo Molnar
  2009-04-22 16:23             ` David Howells
  2 siblings, 1 reply; 42+ messages in thread
From: David Howells @ 2009-04-22 15:12 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: dhowells, Oleg Nesterov, Andrew Morton, Trond.Myklebust, serue,
	steved, viro, Paul E. McKenney, Nick Piggin, linux-kernel

Ingo Molnar <mingo@elte.hu> wrote:

> No. They dont generally imply a full memory barrier versus any
> arbitrary prior (or following) memory access.
>
> try_to_wake_up() has an smp_wmb() so it is a write memory barrier
> (but not necessarily a read memory barrier). Otherwise there are
> spinlocks there but spinlocks are not explicit 'full memory
> barriers'.

Blech.  That's a good point LOCK...UNLOCK does not imply a full barrier.

So we can't assume that complete(), wake_up() and co. imply any barriers.

All we can assume is that try_to_wake_up() implies a write barrier, but we
can't assume that that will be called via __wake_up_common().

David

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] Document that wake_up(), complete() and co. imply a full memory barrier
  2009-04-22 15:12             ` David Howells
@ 2009-04-22 15:19               ` Ingo Molnar
  0 siblings, 0 replies; 42+ messages in thread
From: Ingo Molnar @ 2009-04-22 15:19 UTC (permalink / raw)
  To: David Howells
  Cc: Oleg Nesterov, Andrew Morton, Trond.Myklebust, serue, steved,
	viro, Paul E. McKenney, Nick Piggin, linux-kernel


* David Howells <dhowells@redhat.com> wrote:

> Ingo Molnar <mingo@elte.hu> wrote:
> 
> > No. They dont generally imply a full memory barrier versus any
> > arbitrary prior (or following) memory access.
> >
> > try_to_wake_up() has an smp_wmb() so it is a write memory barrier
> > (but not necessarily a read memory barrier). Otherwise there are
> > spinlocks there but spinlocks are not explicit 'full memory
> > barriers'.
> 
> Blech.  That's a good point LOCK...UNLOCK does not imply a full 
> barrier.
> 
> So we can't assume that complete(), wake_up() and co. imply any 
> barriers.
> 
> All we can assume is that try_to_wake_up() implies a write 
> barrier, but we can't assume that that will be called via 
> __wake_up_common().

Yeah, it's all too special-case. We might rely on it in special, 
well-argued cases but we should not document it as a general barrier 
property.

	Ingo

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] Document that wake_up(), complete() and co. imply a full memory barrier
  2009-04-22 13:37           ` [PATCH] Document that wake_up(), complete() and co. imply a full memory barrier David Howells
  2009-04-22 13:51             ` Ingo Molnar
  2009-04-22 15:12             ` David Howells
@ 2009-04-22 16:23             ` David Howells
  2009-04-22 17:57               ` Ingo Molnar
  2009-04-23 16:32               ` [PATCH] It may not be assumed that wake_up(), finish_wait() and co. imply a " David Howells
  2 siblings, 2 replies; 42+ messages in thread
From: David Howells @ 2009-04-22 16:23 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: dhowells, Oleg Nesterov, Andrew Morton, serue, viro,
	Paul E. McKenney, Nick Piggin, linux-kernel

David Howells <dhowells@redhat.com> wrote:

> So we can't assume that complete(), wake_up() and co. imply any barriers.
> 
> All we can assume is that try_to_wake_up() implies a write barrier, but we
> can't assume that that will be called via __wake_up_common().

So how about this, then?

David
---
From: David Howells <dhowells@redhat.com>
Subject: [PATCH] Document that wake_up(), complete() and co. may not imply a memory barrier

Add to the memory barriers document to note that wake_up(), complete() and
co. may not be assumed to imply any sort of memory barrier, with the exception
of try_to_wake_up() and things derived from that.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 Documentation/memory-barriers.txt |   34 +++++++++++++++++++++++++++++++++-
 kernel/sched.c                    |   10 ++++++++++
 2 files changed, 43 insertions(+), 1 deletions(-)


diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
index f5b7127..6bd626a 100644
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -42,6 +42,7 @@ Contents:
 
      - Interprocessor interaction.
      - Atomic operations.
+     - Wake up of processes
      - Accessing devices.
      - Interrupts.
 
@@ -1224,6 +1225,9 @@ Other functions that imply barriers:
 
  (*) schedule() and similar imply full memory barriers.
 
+ (*) try_to_wake_up() and things derived from that imply a write memory
+     barrier.
+
 
 =================================
 INTER-CPU LOCKING BARRIER EFFECTS
@@ -1366,13 +1370,15 @@ WHERE ARE MEMORY BARRIERS NEEDED?
 
 Under normal operation, memory operation reordering is generally not going to
 be a problem as a single-threaded linear piece of code will still appear to
-work correctly, even if it's in an SMP kernel.  There are, however, three
+work correctly, even if it's in an SMP kernel.  There are, however, five
 circumstances in which reordering definitely _could_ be a problem:
 
  (*) Interprocessor interaction.
 
  (*) Atomic operations.
 
+ (*) Wake up of processes.
+
  (*) Accessing devices.
 
  (*) Interrupts.
@@ -1568,6 +1574,32 @@ and in such cases the special barrier primitives will be no-ops.
 See Documentation/atomic_ops.txt for more information.
 
 
+WAKE UP OF PROCESSES
+--------------------
+
+An unlock, write memory barrier or a full memory barrier may be needed before a
+call to wake up another processes if the waker sets some state that the sleeper
+will need to see.
+
+	complete();
+	wake_up();
+	wake_up_all();
+	wake_up_bit();
+	wake_up_interruptible();
+	wake_up_interruptible_all();
+	wake_up_interruptible_nr();
+	wake_up_interruptible_poll();
+	wake_up_interruptible_sync();
+	wake_up_interruptible_sync_poll();
+	wake_up_locked();
+	wake_up_locked_poll();
+	wake_up_nr();
+	wake_up_poll();
+
+The sleeper may then need to interpolate a lock, read or full memory barrier
+before accessing that state.
+
+
 ACCESSING DEVICES
 -----------------
 
diff --git a/kernel/sched.c b/kernel/sched.c
index b902e58..7cbc3de 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2337,6 +2337,8 @@ static int sched_balance_self(int cpu, int flag)
  * runnable without the overhead of this.
  *
  * returns failure only if the task is already active.
+ *
+ * It may be assumed that this function implies a full memory barrier.
  */
 static int try_to_wake_up(struct task_struct *p, unsigned int state, int sync)
 {
@@ -5241,6 +5243,8 @@ void __wake_up_common(wait_queue_head_t *q, unsigned int mode,
  * @mode: which threads
  * @nr_exclusive: how many wake-one or wake-many threads to wake up
  * @key: is directly passed to the wakeup function
+ *
+ * It may not be assumed that this function implies any sort of memory barrier.
  */
 void __wake_up(wait_queue_head_t *q, unsigned int mode,
 			int nr_exclusive, void *key)
@@ -5279,6 +5283,8 @@ void __wake_up_locked_key(wait_queue_head_t *q, unsigned int mode, void *key)
  * with each other. This can prevent needless bouncing between CPUs.
  *
  * On UP it can prevent extra preemption.
+ *
+ * It may not be assumed that this function implies any sort of memory barrier.
  */
 void __wake_up_sync_key(wait_queue_head_t *q, unsigned int mode,
 			int nr_exclusive, void *key)
@@ -5315,6 +5321,8 @@ EXPORT_SYMBOL_GPL(__wake_up_sync);	/* For internal use only */
  * awakened in the same order in which they were queued.
  *
  * See also complete_all(), wait_for_completion() and related routines.
+ *
+ * It may not be assumed that this function implies any sort of memory barrier.
  */
 void complete(struct completion *x)
 {
@@ -5332,6 +5340,8 @@ EXPORT_SYMBOL(complete);
  * @x:  holds the state of this particular completion
  *
  * This will wake up all threads waiting on this particular completion event.
+ *
+ * It may not be assumed that this function implies any sort of memory barrier.
  */
 void complete_all(struct completion *x)
 {

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH] Document that wake_up(), complete() and co. imply a full memory barrier
  2009-04-22 16:23             ` David Howells
@ 2009-04-22 17:57               ` Ingo Molnar
  2009-04-23 16:36                 ` Oleg Nesterov
  2009-04-23 20:37                 ` David Howells
  2009-04-23 16:32               ` [PATCH] It may not be assumed that wake_up(), finish_wait() and co. imply a " David Howells
  1 sibling, 2 replies; 42+ messages in thread
From: Ingo Molnar @ 2009-04-22 17:57 UTC (permalink / raw)
  To: David Howells
  Cc: Oleg Nesterov, Andrew Morton, serue, viro, Paul E. McKenney,
	Nick Piggin, linux-kernel


* David Howells <dhowells@redhat.com> wrote:

> +WAKE UP OF PROCESSES
> +--------------------
> +
> +An unlock, write memory barrier or a full memory barrier may be needed before a
> +call to wake up another processes if the waker sets some state that the sleeper
> +will need to see.
> +
> +	complete();
> +	wake_up();
> +	wake_up_all();
> +	wake_up_bit();
> +	wake_up_interruptible();
> +	wake_up_interruptible_all();
> +	wake_up_interruptible_nr();
> +	wake_up_interruptible_poll();
> +	wake_up_interruptible_sync();
> +	wake_up_interruptible_sync_poll();
> +	wake_up_locked();
> +	wake_up_locked_poll();
> +	wake_up_nr();
> +	wake_up_poll();
> +
> +The sleeper may then need to interpolate a lock, read or full memory barrier
> +before accessing that state.

Why would an unlock be needed before a call to wake_up() variants?

If a piece of code does:

	wake_up*(&object->wq);
	...
	spin_unlock(&object->lock);

that's perfectly race-free, as long as the wakee always starts with 
something like:

	spin_lock(&object->lock);
	...

I.e. waking up sooner than dropping the lock always OK (and it's a 
frequent operation). Your text seems to imply that a barrier is 
needed (or the unlock needs to happen first) but that's wrong i 
think.

The dangerous pattern is lockless code doing wakeups. But lockless 
code always has to use proper barriers or atomics anyway, and has to 
be aware of the fact that kernel primitives they call are not 
necessarily full memory barriers.

In fact i'd encourage to _not_ document try_to_lock() as a write 
barrier either - but rather have explicit barriers where they are 
needed. Then we could remove that barrier from try_to_wake_up() too 
;-)

Hm?

	Ingo

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] slow_work_thread() should do the exclusive wait
  2009-04-13 22:24       ` Oleg Nesterov
  2009-04-15 23:27         ` Andrew Morton
  2009-04-16  9:10         ` David Howells
@ 2009-04-23 16:00         ` David Howells
  2009-04-23 16:18           ` Oleg Nesterov
  2 siblings, 1 reply; 42+ messages in thread
From: David Howells @ 2009-04-23 16:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: dhowells, Oleg Nesterov, Trond.Myklebust, serue, steved, viro,
	linux-kernel

Andrew Morton <akpm@linux-foundation.org> wrote:

> I wonder if slow_work_cull_timeout() should have some sort of barrier,
> so the write is suitably visible to the woken thread.  Bearing in mind
> that the thread might _already_ have been woken by someone else?

Perhaps the attached patch?

David
---
From: David Howells <dhowells@redhat.com>
Subject: [PATCH] slow_work_cull_timeout() should have a memory barrier

slow_work_cull_timeout() should have a write memory barrier so that the setting
of the cull flag is seen before the wakeup takes place.  This is required
because wake_up() does not guarantee any memory barriership at all.

Concomitant to this, slow_work_thread() should have a read memory barrier
between its return from schedule() and its testing of slow_work_cull() as
finish_wait() isn't a guaranteed barrier either.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 kernel/slow-work.c |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)


diff --git a/kernel/slow-work.c b/kernel/slow-work.c
index 521ed20..96e418d 100644
--- a/kernel/slow-work.c
+++ b/kernel/slow-work.c
@@ -382,6 +382,7 @@ static int slow_work_thread(void *_data)
 		finish_wait(&slow_work_thread_wq, &wait);
 
 		try_to_freeze();
+		smp_rmb();
 
 		vsmax = vslow_work_proportion;
 		vsmax *= atomic_read(&slow_work_thread_count);
@@ -416,6 +417,7 @@ static int slow_work_thread(void *_data)
 static void slow_work_cull_timeout(unsigned long data)
 {
 	slow_work_cull = true;
+	smp_wmb();
 	wake_up(&slow_work_thread_wq);
 }
 

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH] slow_work_thread() should do the exclusive wait
  2009-04-23 16:00         ` [PATCH] slow_work_thread() should do the exclusive wait David Howells
@ 2009-04-23 16:18           ` Oleg Nesterov
  0 siblings, 0 replies; 42+ messages in thread
From: Oleg Nesterov @ 2009-04-23 16:18 UTC (permalink / raw)
  To: David Howells
  Cc: Andrew Morton, Trond.Myklebust, serue, steved, viro,
	linux-kernel, Ingo Molnar

(add Ingo)

On 04/23, David Howells wrote:
>
> Andrew Morton <akpm@linux-foundation.org> wrote:
>
> > I wonder if slow_work_cull_timeout() should have some sort of barrier,
> > so the write is suitably visible to the woken thread.  Bearing in mind
> > that the thread might _already_ have been woken by someone else?
>
> Perhaps the attached patch?
>
> David
> ---
> From: David Howells <dhowells@redhat.com>
> Subject: [PATCH] slow_work_cull_timeout() should have a memory barrier
>
> slow_work_cull_timeout() should have a write memory barrier so that the setting
> of the cull flag is seen before the wakeup takes place.  This is required
> because wake_up() does not guarantee any memory barriership at all.
>
> Concomitant to this, slow_work_thread() should have a read memory barrier
> between its return from schedule() and its testing of slow_work_cull() as
> finish_wait() isn't a guaranteed barrier either.
>
> Signed-off-by: David Howells <dhowells@redhat.com>
> ---
>
>  kernel/slow-work.c |    2 ++
>  1 files changed, 2 insertions(+), 0 deletions(-)
>
>
> diff --git a/kernel/slow-work.c b/kernel/slow-work.c
> index 521ed20..96e418d 100644
> --- a/kernel/slow-work.c
> +++ b/kernel/slow-work.c
> @@ -382,6 +382,7 @@ static int slow_work_thread(void *_data)
>  		finish_wait(&slow_work_thread_wq, &wait);
>
>  		try_to_freeze();
> +		smp_rmb();
>
>  		vsmax = vslow_work_proportion;
>  		vsmax *= atomic_read(&slow_work_thread_count);
> @@ -416,6 +417,7 @@ static int slow_work_thread(void *_data)
>  static void slow_work_cull_timeout(unsigned long data)
>  {
>  	slow_work_cull = true;
> +	smp_wmb();
>  	wake_up(&slow_work_thread_wq);
>  }

Confused. If we need this barrier, a lot of similar code is broken.

	slow_work_cull_timeout:

		slow_work_cull = true;
		wake_up(&slow_work_thread_wq);


	slow_work_thread:

		prepare_to_wait(&slow_work_thread_wq);
		if (!slow_work_cull)
			schedule();
		finish_wait(&slow_work_thread_wq);

		if (slow_work_cull)
			.....

Both wake_up() and prepare_to_wait() take the same wait_queue_head_t->lock,
and prepare_to_wait() does set_current_state() under this lock.

How can we miss the event? If wake_up() happens before prepare_to_wait(),
slow_work_thread() must see slow_work_cull = T, otherwise the subsequent
wake_up() must see the result of list_add() + set_current_state() and
wake up the sleeping thread.

Could you please clarify?

Oleg.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH] It may not be assumed that wake_up(), finish_wait() and co. imply a memory barrier
  2009-04-22 16:23             ` David Howells
  2009-04-22 17:57               ` Ingo Molnar
@ 2009-04-23 16:32               ` David Howells
  2009-04-23 16:55                 ` Oleg Nesterov
                                   ` (3 more replies)
  1 sibling, 4 replies; 42+ messages in thread
From: David Howells @ 2009-04-23 16:32 UTC (permalink / raw)
  To: Ingo Molnar, torvalds
  Cc: dhowells, Oleg Nesterov, Andrew Morton, serue, viro,
	Paul E. McKenney, Nick Piggin, linux-kernel

Ingo Molnar <mingo@elte.hu> wrote:

> Why would an unlock be needed before a call to wake_up() variants?

Good point.  I've amended my patch again (see attached).

> In fact i'd encourage to _not_ document try_to_lock() as a write barrier
> either

Did you mean try_to_wake_up()?  Or did you mean things like
spin_lock_trylock()?  If the latter, it *has* to be a LOCK-class barrier if it
succeeds - otherwise what's the point?

> - but rather have explicit barriers where they are needed. Then we
> could remove that barrier from try_to_wake_up() too ;-)

I was wondering if wake_up() and friends should in fact imply smp_wmb(), but I
guess that they're often used in conjunction with spinlocks - and in such a
situation a barrier is unnecessary overhead.

David
---
From: David Howells <dhowells@redhat.com>
Subject: [PATCH] It may not be assumed that wake_up(), finish_wait() and co. imply a memory barrier

Add to the memory barriers document to note that try_to_wake_up(), wake_up(),
complete(), finish_wait() and co. should not be assumed to imply any sort of
memory barrier.

This is because:

 (1) A lot of the time, memory barriers in the wake-up and sleeper paths would
     be superfluous due to the use of locks.

 (2) It is possible to pass right through wake_up() and co. without hitting
     anything at all or anything other than a spin_lock/spin_unlock (if
     try_to_wake_up() isn't invoked).

 (3) The smp_wmb() should probably move out of try_to_wake_up() and into
     suitable places in its callers.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 Documentation/memory-barriers.txt |   34 +++++++++++++++++++++++++++++++++-
 kernel/sched.c                    |   11 +++++++++++
 2 files changed, 44 insertions(+), 1 deletions(-)


diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
index f5b7127..8c32e23 100644
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -42,6 +42,7 @@ Contents:
 
      - Interprocessor interaction.
      - Atomic operations.
+     - Wake up of processes
      - Accessing devices.
      - Interrupts.
 
@@ -1366,13 +1367,15 @@ WHERE ARE MEMORY BARRIERS NEEDED?
 
 Under normal operation, memory operation reordering is generally not going to
 be a problem as a single-threaded linear piece of code will still appear to
-work correctly, even if it's in an SMP kernel.  There are, however, three
+work correctly, even if it's in an SMP kernel.  There are, however, five
 circumstances in which reordering definitely _could_ be a problem:
 
  (*) Interprocessor interaction.
 
  (*) Atomic operations.
 
+ (*) Wake up of processes.
+
  (*) Accessing devices.
 
  (*) Interrupts.
@@ -1568,6 +1571,35 @@ and in such cases the special barrier primitives will be no-ops.
 See Documentation/atomic_ops.txt for more information.
 
 
+WAKE UP OF PROCESSES
+--------------------
+
+If locking is not used, and if the waker sets some state that the sleeper will
+need to see, a write memory barrier or a full memory barrier may be needed
+before one of the following calls is used to wake up another process:
+
+	complete();
+	try_to_wake_up();
+	wake_up();
+	wake_up_all();
+	wake_up_bit();
+	wake_up_interruptible();
+	wake_up_interruptible_all();
+	wake_up_interruptible_nr();
+	wake_up_interruptible_poll();
+	wake_up_interruptible_sync();
+	wake_up_interruptible_sync_poll();
+	wake_up_locked();
+	wake_up_locked_poll();
+	wake_up_nr();
+	wake_up_poll();
+
+After waking, and assuming it doesn't take a matching lock, the sleeper may
+need to interpolate a read or full memory barrier before accessing that state
+as finish_wait() does not imply a barrier either, and schedule() only implies a
+barrier on entry.
+
+
 ACCESSING DEVICES
 -----------------
 
diff --git a/kernel/sched.c b/kernel/sched.c
index b902e58..2ef0479 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2337,6 +2337,9 @@ static int sched_balance_self(int cpu, int flag)
  * runnable without the overhead of this.
  *
  * returns failure only if the task is already active.
+ *
+ * It should not be assumed that this function implies any sort of memory
+ * barrier.
  */
 static int try_to_wake_up(struct task_struct *p, unsigned int state, int sync)
 {
@@ -5241,6 +5244,8 @@ void __wake_up_common(wait_queue_head_t *q, unsigned int mode,
  * @mode: which threads
  * @nr_exclusive: how many wake-one or wake-many threads to wake up
  * @key: is directly passed to the wakeup function
+ *
+ * It may not be assumed that this function implies any sort of memory barrier.
  */
 void __wake_up(wait_queue_head_t *q, unsigned int mode,
 			int nr_exclusive, void *key)
@@ -5279,6 +5284,8 @@ void __wake_up_locked_key(wait_queue_head_t *q, unsigned int mode, void *key)
  * with each other. This can prevent needless bouncing between CPUs.
  *
  * On UP it can prevent extra preemption.
+ *
+ * It may not be assumed that this function implies any sort of memory barrier.
  */
 void __wake_up_sync_key(wait_queue_head_t *q, unsigned int mode,
 			int nr_exclusive, void *key)
@@ -5315,6 +5322,8 @@ EXPORT_SYMBOL_GPL(__wake_up_sync);	/* For internal use only */
  * awakened in the same order in which they were queued.
  *
  * See also complete_all(), wait_for_completion() and related routines.
+ *
+ * It may not be assumed that this function implies any sort of memory barrier.
  */
 void complete(struct completion *x)
 {
@@ -5332,6 +5341,8 @@ EXPORT_SYMBOL(complete);
  * @x:  holds the state of this particular completion
  *
  * This will wake up all threads waiting on this particular completion event.
+ *
+ * It may not be assumed that this function implies any sort of memory barrier.
  */
 void complete_all(struct completion *x)
 {

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH] Document that wake_up(), complete() and co. imply a full memory barrier
  2009-04-22 17:57               ` Ingo Molnar
@ 2009-04-23 16:36                 ` Oleg Nesterov
  2009-04-23 20:37                 ` David Howells
  1 sibling, 0 replies; 42+ messages in thread
From: Oleg Nesterov @ 2009-04-23 16:36 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: David Howells, Andrew Morton, serue, viro, Paul E. McKenney,
	Nick Piggin, linux-kernel

On 04/22, Ingo Molnar wrote:
>
> * David Howells <dhowells@redhat.com> wrote:
>
> The dangerous pattern is lockless code doing wakeups. But lockless
> code always has to use proper barriers or atomics anyway, and has to
> be aware of the fact that kernel primitives they call are not
> necessarily full memory barriers.
>
> In fact i'd encourage to _not_ document try_to_lock() as a write
> barrier either - but rather have explicit barriers where they are
> needed. Then we could remove that barrier from try_to_wake_up() too
> ;-)

Well. I think in that case almost every try_to_wake_up/wake_up_process
needs mb().

For example:

	do_nanosleep:

		for (;;) {
			set_current_state(TASK_INTERRUPTIBLE);
			if (likely(t->task))
				schedule();
			break;
		}

	hrtimer_wakeup:

		task = t->task
		t->task = NULL;
		wake_up_process(task);

If try_to_wake_up() has no the barrier semantics, we can miss the event.
"t->task = NULL" and the reading of task->state must not be reordered.

Oleg.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] It may not be assumed that wake_up(), finish_wait() and co. imply a memory barrier
  2009-04-23 16:32               ` [PATCH] It may not be assumed that wake_up(), finish_wait() and co. imply a " David Howells
@ 2009-04-23 16:55                 ` Oleg Nesterov
  2009-04-23 17:07                 ` Linus Torvalds
                                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 42+ messages in thread
From: Oleg Nesterov @ 2009-04-23 16:55 UTC (permalink / raw)
  To: David Howells
  Cc: Ingo Molnar, torvalds, Andrew Morton, serue, viro,
	Paul E. McKenney, Nick Piggin, linux-kernel

On 04/23, David Howells wrote:
>
> +	complete();
> +	try_to_wake_up();
> +	wake_up();
> +	wake_up_all();
> +	wake_up_bit();
> +	wake_up_interruptible();
> +	wake_up_interruptible_all();
> +	wake_up_interruptible_nr();
> +	wake_up_interruptible_poll();
> +	wake_up_interruptible_sync();
> +	wake_up_interruptible_sync_poll();
> +	wake_up_locked();
> +	wake_up_locked_poll();
> +	wake_up_nr();
> +	wake_up_poll();
> +
> +After waking, and assuming it doesn't take a matching lock, the sleeper may
> +need to interpolate a read or full memory barrier before accessing that state
> +as finish_wait() does not imply a barrier either, and schedule() only implies a
> +barrier on entry.

Well. I am starting to suspect I missed something, but I disagree. Or I just
(this is very possible) misunderstand the above.

finish_wait() doesn't imply a barrier, but why this matters?

And if we don't use prepare_to_wait() and just do

	for (;;) {
		set_current_state(WHATEVER);
		if (!CONDITION)
			schedule();
		break;
	}

we do have mb(), but

> + *
> + * It should not be assumed that this function implies any sort of memory
> + * barrier.
>   */
>  static int try_to_wake_up(struct task_struct *p, unsigned int state, int sync)

unless try_to_wake_up() has a barrier semantics too, this code

	CONDITION = 1;
	wake_up_process(waiter);

is not right, and that mb above can't help.


Could you please give the code example which shows we need a barrier
after finish_wait() ?

I am just trying to understand what I missed.

Oleg.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] It may not be assumed that wake_up(), finish_wait() and co. imply a memory barrier
  2009-04-23 16:32               ` [PATCH] It may not be assumed that wake_up(), finish_wait() and co. imply a " David Howells
  2009-04-23 16:55                 ` Oleg Nesterov
@ 2009-04-23 17:07                 ` Linus Torvalds
  2009-04-23 20:35                 ` David Howells
  2009-04-24 11:46                 ` David Howells
  3 siblings, 0 replies; 42+ messages in thread
From: Linus Torvalds @ 2009-04-23 17:07 UTC (permalink / raw)
  To: David Howells
  Cc: Ingo Molnar, Oleg Nesterov, Andrew Morton, serue, viro,
	Paul E. McKenney, Nick Piggin, linux-kernel



On Thu, 23 Apr 2009, David Howells wrote:
>
> I was wondering if wake_up() and friends should in fact imply smp_wmb(), but I
> guess that they're often used in conjunction with spinlocks - and in such a
> situation a barrier is unnecessary overhead.

I think we _have_ to imply a smp_wmb() in the wakup semantics, because 
otherwise sleepers can't do anything sane (no amount of barriers on the 
sleeping side will help). IOW, there basically has to be an implied write 
barrier between the thing that causes an event to become true, and the 
thing that turns 'task->state' back to RUNNING.

This is similar to the issue of doing cross-CPU IPI's: sending an IPI 
_must_ imply a memory barrier with the IPI mechanism, because otherwise 
the receiver could never do anything sane.

		Linus

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] It may not be assumed that wake_up(), finish_wait() and co. imply a memory barrier
  2009-04-23 16:32               ` [PATCH] It may not be assumed that wake_up(), finish_wait() and co. imply a " David Howells
  2009-04-23 16:55                 ` Oleg Nesterov
  2009-04-23 17:07                 ` Linus Torvalds
@ 2009-04-23 20:35                 ` David Howells
  2009-04-23 21:12                   ` Linus Torvalds
  2009-04-24 11:46                 ` David Howells
  3 siblings, 1 reply; 42+ messages in thread
From: David Howells @ 2009-04-23 20:35 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: dhowells, Ingo Molnar, Oleg Nesterov, Andrew Morton, serue, viro,
	Paul E. McKenney, Nick Piggin, linux-kernel

Linus Torvalds <torvalds@linux-foundation.org> wrote:

> I think we _have_ to imply a smp_wmb() in the wakup semantics, because 
> otherwise sleepers can't do anything sane (no amount of barriers on the 
> sleeping side will help). IOW, there basically has to be an implied write 
> barrier between the thing that causes an event to become true, and the 
> thing that turns 'task->state' back to RUNNING.

Well, Ingo's point is it could be left up to the caller of wake_up() to supply
the barrier:

	*my_variable = 1234;
	smp_wmb();
	wake_up(&my_queue);

or:

	spin_lock(&my_lock);
	*my_variable = 1234;
	wake_up(&my_queue);
	spin_unlock(&my_lock);

on the condition that the sleeper also gets the lock.

Also, he points out that wake_up() and co. may not insert useful memory barrier
at all, for three reasons:

 (1) If there's no-one to wake up, then certain wake functions will return
     immediately.

 (2) If there's no-one to wake up, then other wake functions will only impose
     LOCK and UNLOCK barriers.

 (3) If someone supplies a special awakener instead of default_wake_function(),
     then they can bypass try_to_wake_up() and whatever barrier that implies.

     Though as far as I can see, if you want to wake someone up, you *have* to
     go through try_to_wake_up().

What I'd like to say is that wake_up() and friends _will_ interpose at least a
write barrier _if_ they wake anything up (which is more or less what you said
above).  If they don't wake anything up, then there's no need for a memory
barrier between the assignment to my_variable and the non-existent alterations
to the state of the task not being awoken.

David

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] Document that wake_up(), complete() and co. imply a full memory barrier
  2009-04-22 17:57               ` Ingo Molnar
  2009-04-23 16:36                 ` Oleg Nesterov
@ 2009-04-23 20:37                 ` David Howells
  1 sibling, 0 replies; 42+ messages in thread
From: David Howells @ 2009-04-23 20:37 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: dhowells, Ingo Molnar, Andrew Morton, serue, viro,
	Paul E. McKenney, Nick Piggin, linux-kernel

Oleg Nesterov <oleg@redhat.com> wrote:

> Well. I think in that case almost every try_to_wake_up/wake_up_process
> needs mb().

Not exactly.  They need smp_wmb() in the wake_up() path - unless they use
locks, which a lot of cases do.

David

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] It may not be assumed that wake_up(), finish_wait() and co. imply a memory barrier
  2009-04-23 20:35                 ` David Howells
@ 2009-04-23 21:12                   ` Linus Torvalds
  2009-04-23 21:24                     ` Ingo Molnar
  0 siblings, 1 reply; 42+ messages in thread
From: Linus Torvalds @ 2009-04-23 21:12 UTC (permalink / raw)
  To: David Howells
  Cc: Ingo Molnar, Oleg Nesterov, Andrew Morton, serue, viro,
	Paul E. McKenney, Nick Piggin, linux-kernel



On Thu, 23 Apr 2009, David Howells wrote:
> 
> Well, Ingo's point is it could be left up to the caller of wake_up() to supply
> the barrier:
> 
> 	*my_variable = 1234;
> 	smp_wmb();
> 	wake_up(&my_queue);

That's bogus.

EVERY SINGLE wake_up() would need it.

There is _always_ a reason for the wakeup. And yes, you can re-order 
things, but we normally don't. Just make the rule be that there's an 
implied smp_wmb() instead.

It's not like it's going to cost anything on any sane architecture anyway. 
So asking people to add "smp_wmb()" calls before their wakups just makes 
the source code unreadable and fragile, for no actual advantage.

		Linus

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] It may not be assumed that wake_up(), finish_wait() and co. imply a memory barrier
  2009-04-23 21:12                   ` Linus Torvalds
@ 2009-04-23 21:24                     ` Ingo Molnar
  0 siblings, 0 replies; 42+ messages in thread
From: Ingo Molnar @ 2009-04-23 21:24 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Howells, Oleg Nesterov, Andrew Morton, serue, viro,
	Paul E. McKenney, Nick Piggin, linux-kernel


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Thu, 23 Apr 2009, David Howells wrote:
> > 
> > Well, Ingo's point is it could be left up to the caller of 
> > wake_up() to supply the barrier:

Well, i mainly reacted to your documentation patch which was 
incorrect as it said wake-up implies a _FULL_ memory barrier.

I also suggested that lockless code should have its barriers clearly 
documented and they should not really rely on kernel facilities 
acting as memory barriers.

Then i also suggested that maybe in the future we could remove the 
smp_wmb() from try_to_wake_up(). That was just an afterthought, and 
a rather stupid one at that, as Linus quickly noted :-)

	Ingo

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] It may not be assumed that wake_up(), finish_wait() and co. imply a memory barrier
  2009-04-23 16:32               ` [PATCH] It may not be assumed that wake_up(), finish_wait() and co. imply a " David Howells
                                   ` (2 preceding siblings ...)
  2009-04-23 20:35                 ` David Howells
@ 2009-04-24 11:46                 ` David Howells
  2009-04-24 15:08                   ` Paul E. McKenney
                                     ` (3 more replies)
  3 siblings, 4 replies; 42+ messages in thread
From: David Howells @ 2009-04-24 11:46 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: dhowells, Ingo Molnar, torvalds, Andrew Morton, serue, viro,
	Paul E. McKenney, Nick Piggin, linux-kernel

Oleg Nesterov <oleg@redhat.com> wrote:

> Well. I am starting to suspect I missed something, but I disagree. Or I just
> (this is very possible) misunderstand the above.

No, it's me getting myself confused.  After thinking about it for a while, I
see that you're right: the potential misordering, I think, is between the
setting of the task state and reading the data value in the sleeper, and
between the writing of the data value and the clearing of the task state in
the waker.

That means that provided that:

 (1) set_current_state() interpolates a full memory barrier after setting the
     task state then there's no problem in the sleeper, and

 (2) wake_up() interpolates a write memory barrier before clearing the task
     state - _if_ it wakes anything up - then there's no problem in the waker
     either.

How about the attached doc patch?

David
---
From: David Howells <dhowells@redhat.com>
Subject: [PATCH] Document memory barriers implied by sleep/wake-up primitives

Add a section to the memory barriers document to note the implied memory
barriers of sleep primitives (set_current_state() and wrappers) and wake-up
primitives (wake_up() and co.).

Also extend the in-code comments on the wake_up() functions to note these
implied barriers.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 Documentation/memory-barriers.txt |   93 +++++++++++++++++++++++++++++++++++++
 kernel/sched.c                    |   23 +++++++++
 2 files changed, 115 insertions(+), 1 deletions(-)


diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
index f5b7127..48d3c77 100644
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -31,6 +31,7 @@ Contents:
 
      - Locking functions.
      - Interrupt disabling functions.
+     - Sleep and wake-up functions.
      - Miscellaneous functions.
 
  (*) Inter-CPU locking barrier effects.
@@ -1217,6 +1218,96 @@ barriers are required in such a situation, they must be provided from some
 other means.
 
 
+SLEEP AND WAKE-UP FUNCTIONS
+---------------------------
+
+Sleeping and waking on an event flagged in global data can be viewed as an
+interaction between two pieces of data: the task state of the task waiting for
+the event and the global data used to indicate the event.  To make sure that
+these appear to happen in the right order, the primitives to begin the process
+of going to sleep, and the primitives to initiate a wake up imply certain
+barriers.
+
+Firstly, the sleeper normally follows something like this sequence of events:
+
+	for (;;) {
+		set_current_state(TASK_UNINTERRUPTIBLE);
+		if (event_indicated)
+			break;
+		schedule();
+	}
+
+A general memory barrier is interpolated automatically by set_current_state()
+after it has altered the task state:
+
+	CPU 1
+	===============================
+	set_current_state();
+	  set_mb();
+	    STORE current->state
+	    <general barrier>
+	LOAD event_indicated
+
+set_current_state() may be wrapped by:
+
+	prepare_to_wait();
+	prepare_to_wait_exclusive();
+
+which therefore also imply a general memory barrier after setting the state.
+The whole sequence above is available in various canned forms, all of which
+interpolate the memory barrier in the right place:
+
+	wait_event();
+	wait_event_interruptible();
+	wait_event_interruptible_exclusive();
+	wait_event_interruptible_timeout();
+	wait_event_killable();
+	wait_event_timeout();
+	wait_on_bit();
+	wait_on_bit_lock();
+
+
+Secondly, code that performs a wake up normally follows something like this:
+
+	event_indicated = 1;
+	wake_up(&event_wait_queue);
+
+or:
+
+	event_indicated = 1;
+	wake_up_process(event_daemon);
+
+A write memory barrier is implied by wake_up() and co. if and only if they wake
+something up.  The barrier occurs before the task state is cleared, and so sits
+between the STORE to indicate the event and the STORE to set TASK_RUNNING:
+
+	CPU 1				CPU 2
+	===============================	===============================
+	set_current_state();		STORE event_indicated
+	  set_mb();			wake_up();
+	    STORE current->state	  <write barrier>
+	    <general barrier>		  STORE current->state
+	LOAD event_indicated
+
+The available waker functions include:
+
+	complete();
+	wake_up();
+	wake_up_all();
+	wake_up_bit();
+	wake_up_interruptible();
+	wake_up_interruptible_all();
+	wake_up_interruptible_nr();
+	wake_up_interruptible_poll();
+	wake_up_interruptible_sync();
+	wake_up_interruptible_sync_poll();
+	wake_up_locked();
+	wake_up_locked_poll();
+	wake_up_nr();
+	wake_up_poll();
+	wake_up_process();
+
+
 MISCELLANEOUS FUNCTIONS
 -----------------------
 
@@ -1366,7 +1457,7 @@ WHERE ARE MEMORY BARRIERS NEEDED?
 
 Under normal operation, memory operation reordering is generally not going to
 be a problem as a single-threaded linear piece of code will still appear to
-work correctly, even if it's in an SMP kernel.  There are, however, three
+work correctly, even if it's in an SMP kernel.  There are, however, four
 circumstances in which reordering definitely _could_ be a problem:
 
  (*) Interprocessor interaction.
diff --git a/kernel/sched.c b/kernel/sched.c
index b902e58..fd0c2ce 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2458,6 +2458,17 @@ out:
 	return success;
 }
 
+/**
+ * wake_up_process - Wake up a specific process
+ * @p: The process to be woken up.
+ *
+ * Attempt to wake up the nominated process and move it to the set of runnable
+ * processes.  Returns 1 if the process was woken up, 0 if it was already
+ * running.
+ *
+ * It may be assumed that this function implies a write memory barrier before
+ * changing the task state if and only if any tasks are woken up.
+ */
 int wake_up_process(struct task_struct *p)
 {
 	return try_to_wake_up(p, TASK_ALL, 0);
@@ -5241,6 +5252,9 @@ void __wake_up_common(wait_queue_head_t *q, unsigned int mode,
  * @mode: which threads
  * @nr_exclusive: how many wake-one or wake-many threads to wake up
  * @key: is directly passed to the wakeup function
+ *
+ * It may be assumed that this function implies a write memory barrier before
+ * changing the task state if and only if any tasks are woken up.
  */
 void __wake_up(wait_queue_head_t *q, unsigned int mode,
 			int nr_exclusive, void *key)
@@ -5279,6 +5293,9 @@ void __wake_up_locked_key(wait_queue_head_t *q, unsigned int mode, void *key)
  * with each other. This can prevent needless bouncing between CPUs.
  *
  * On UP it can prevent extra preemption.
+ *
+ * It may be assumed that this function implies a write memory barrier before
+ * changing the task state if and only if any tasks are woken up.
  */
 void __wake_up_sync_key(wait_queue_head_t *q, unsigned int mode,
 			int nr_exclusive, void *key)
@@ -5315,6 +5332,9 @@ EXPORT_SYMBOL_GPL(__wake_up_sync);	/* For internal use only */
  * awakened in the same order in which they were queued.
  *
  * See also complete_all(), wait_for_completion() and related routines.
+ *
+ * It may be assumed that this function implies a write memory barrier before
+ * changing the task state if and only if any tasks are woken up.
  */
 void complete(struct completion *x)
 {
@@ -5332,6 +5352,9 @@ EXPORT_SYMBOL(complete);
  * @x:  holds the state of this particular completion
  *
  * This will wake up all threads waiting on this particular completion event.
+ *
+ * It may be assumed that this function implies a write memory barrier before
+ * changing the task state if and only if any tasks are woken up.
  */
 void complete_all(struct completion *x)
 {

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH] It may not be assumed that wake_up(), finish_wait() and co. imply a memory barrier
  2009-04-24 11:46                 ` David Howells
@ 2009-04-24 15:08                   ` Paul E. McKenney
  2009-04-24 17:08                     ` Oleg Nesterov
  2009-04-24 17:28                   ` Oleg Nesterov
                                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 42+ messages in thread
From: Paul E. McKenney @ 2009-04-24 15:08 UTC (permalink / raw)
  To: David Howells
  Cc: Oleg Nesterov, Ingo Molnar, torvalds, Andrew Morton, serue, viro,
	Nick Piggin, linux-kernel

On Fri, Apr 24, 2009 at 12:46:41PM +0100, David Howells wrote:
> Oleg Nesterov <oleg@redhat.com> wrote:
> 
> > Well. I am starting to suspect I missed something, but I disagree. Or I just
> > (this is very possible) misunderstand the above.
> 
> No, it's me getting myself confused.  After thinking about it for a while, I
> see that you're right: the potential misordering, I think, is between the
> setting of the task state and reading the data value in the sleeper, and
> between the writing of the data value and the clearing of the task state in
> the waker.
> 
> That means that provided that:
> 
>  (1) set_current_state() interpolates a full memory barrier after setting the
>      task state then there's no problem in the sleeper, and
> 
>  (2) wake_up() interpolates a write memory barrier before clearing the task
>      state - _if_ it wakes anything up - then there's no problem in the waker
>      either.
> 
> How about the attached doc patch?

One question, assuming that this documentation intends to guide the
reader on where to put the locking and/or memory-barrier primitives...

Suppose we have the following sequence of events:

1.	The waiter does "set_current_state(TASK_UNINTERRUPTIBLE);".
	This implies a full memory barrier.

2.	The awakener updates some shared state.

3.	The awakener does "event_indicated = 1;".

4.	The waiter does "if (event_indicated)", and, finding that
	the event has in fact been indicated, does "break".

5.	The waiter accesses the shared state set in #2 above.

6.	Some time later, the awakener does "wake_up(&event_wait_queue);"
	This does not awaken anyone, so no memory barrier.

Because there is no memory barrier between #2 and #3, reordering by
either the compiler or the CPU might cause the awakener to update the
event_indicated flag in #3 -before- completing its update of shared
state in #2.  Less likely (but still possible) optimizations might
cause the waiter to access the shared state in #5 before checking
the event_indicated flag in #4.

So, for this to work correctly, don't we need at least an smp_wmb()
between #2 and #3 and at least an smp_rmb() between #4 and #5?  And if
#2 does reads (but not writes) at least one variable in the shared state
that #5 writes to, don't both barriers need to be smp_mb()?

Or should I have waited until fully waking up before replying to this
email?  ;-)

For everyone's amusement, I have placed these memory barriers in the code
samples below.

And, yes, if these primitives are used with appropriate locks, then no
memory barriers are required.  So, again, for everyone's amusement, I
have indicated where the locking primitives would go -- not necessarily
where the memory barriers go due to the semipermeable nature of the
barriers implied by the locking primitives.

If I understand the purpose of this documentation, it would be good to
tell the reader where the required locks and/or memory barriers must go.

							Thanx, Paul

> David
> ---
> From: David Howells <dhowells@redhat.com>
> Subject: [PATCH] Document memory barriers implied by sleep/wake-up primitives
> 
> Add a section to the memory barriers document to note the implied memory
> barriers of sleep primitives (set_current_state() and wrappers) and wake-up
> primitives (wake_up() and co.).
> 
> Also extend the in-code comments on the wake_up() functions to note these
> implied barriers.
> 
> Signed-off-by: David Howells <dhowells@redhat.com>
> ---
> 
>  Documentation/memory-barriers.txt |   93 +++++++++++++++++++++++++++++++++++++
>  kernel/sched.c                    |   23 +++++++++
>  2 files changed, 115 insertions(+), 1 deletions(-)
> 
> 
> diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
> index f5b7127..48d3c77 100644
> --- a/Documentation/memory-barriers.txt
> +++ b/Documentation/memory-barriers.txt
> @@ -31,6 +31,7 @@ Contents:
> 
>       - Locking functions.
>       - Interrupt disabling functions.
> +     - Sleep and wake-up functions.
>       - Miscellaneous functions.
> 
>   (*) Inter-CPU locking barrier effects.
> @@ -1217,6 +1218,96 @@ barriers are required in such a situation, they must be provided from some
>  other means.
> 
> 
> +SLEEP AND WAKE-UP FUNCTIONS
> +---------------------------
> +
> +Sleeping and waking on an event flagged in global data can be viewed as an
> +interaction between two pieces of data: the task state of the task waiting for
> +the event and the global data used to indicate the event.  To make sure that
> +these appear to happen in the right order, the primitives to begin the process
> +of going to sleep, and the primitives to initiate a wake up imply certain
> +barriers.
> +
> +Firstly, the sleeper normally follows something like this sequence of events:
> +
> +	for (;;) {
> +		set_current_state(TASK_UNINTERRUPTIBLE);
> +		if (event_indicated)
> +			break;
> +		schedule();
> +	}
	smp_rmb(); /* or smp_mb() if writing later, or spin_lock(). */
> +
> +A general memory barrier is interpolated automatically by set_current_state()
> +after it has altered the task state:
> +
> +	CPU 1
> +	===============================
> +	set_current_state();
> +	  set_mb();
> +	    STORE current->state
> +	    <general barrier>
> +	LOAD event_indicated
> +
> +set_current_state() may be wrapped by:
> +
> +	prepare_to_wait();
> +	prepare_to_wait_exclusive();
> +
> +which therefore also imply a general memory barrier after setting the state.
> +The whole sequence above is available in various canned forms, all of which
> +interpolate the memory barrier in the right place:
> +
> +	wait_event();
> +	wait_event_interruptible();
> +	wait_event_interruptible_exclusive();
> +	wait_event_interruptible_timeout();
> +	wait_event_killable();
> +	wait_event_timeout();
> +	wait_on_bit();
> +	wait_on_bit_lock();
> +
> +
> +Secondly, code that performs a wake up normally follows something like this:
> +
	smp_wmb(); /* or smp_mb() if trailing prior read to a variable.  */
> +	event_indicated = 1;
> +	wake_up(&event_wait_queue);
> +
> +or:
> +
	smp_wmb(); /* or smp_mb() if trailing prior read to a variable.  */
> +	event_indicated = 1;
	/* Or spin_unlock() here instead. */
> +	wake_up_process(event_daemon);
> +
> +A write memory barrier is implied by wake_up() and co. if and only if they wake
> +something up.  The barrier occurs before the task state is cleared, and so sits
> +between the STORE to indicate the event and the STORE to set TASK_RUNNING:
> +
> +	CPU 1				CPU 2
> +	===============================	===============================
> +	set_current_state();		STORE event_indicated
> +	  set_mb();			wake_up();
> +	    STORE current->state	  <write barrier>
> +	    <general barrier>		  STORE current->state
> +	LOAD event_indicated
> +
> +The available waker functions include:
> +
> +	complete();
> +	wake_up();
> +	wake_up_all();
> +	wake_up_bit();
> +	wake_up_interruptible();
> +	wake_up_interruptible_all();
> +	wake_up_interruptible_nr();
> +	wake_up_interruptible_poll();
> +	wake_up_interruptible_sync();
> +	wake_up_interruptible_sync_poll();
> +	wake_up_locked();
> +	wake_up_locked_poll();
> +	wake_up_nr();
> +	wake_up_poll();
> +	wake_up_process();
> +
> +
>  MISCELLANEOUS FUNCTIONS
>  -----------------------
> 
> @@ -1366,7 +1457,7 @@ WHERE ARE MEMORY BARRIERS NEEDED?
> 
>  Under normal operation, memory operation reordering is generally not going to
>  be a problem as a single-threaded linear piece of code will still appear to
> -work correctly, even if it's in an SMP kernel.  There are, however, three
> +work correctly, even if it's in an SMP kernel.  There are, however, four
>  circumstances in which reordering definitely _could_ be a problem:
> 
>   (*) Interprocessor interaction.
> diff --git a/kernel/sched.c b/kernel/sched.c
> index b902e58..fd0c2ce 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -2458,6 +2458,17 @@ out:
>  	return success;
>  }
> 
> +/**
> + * wake_up_process - Wake up a specific process
> + * @p: The process to be woken up.
> + *
> + * Attempt to wake up the nominated process and move it to the set of runnable
> + * processes.  Returns 1 if the process was woken up, 0 if it was already
> + * running.
> + *
> + * It may be assumed that this function implies a write memory barrier before
> + * changing the task state if and only if any tasks are woken up.
> + */
>  int wake_up_process(struct task_struct *p)
>  {
>  	return try_to_wake_up(p, TASK_ALL, 0);
> @@ -5241,6 +5252,9 @@ void __wake_up_common(wait_queue_head_t *q, unsigned int mode,
>   * @mode: which threads
>   * @nr_exclusive: how many wake-one or wake-many threads to wake up
>   * @key: is directly passed to the wakeup function
> + *
> + * It may be assumed that this function implies a write memory barrier before
> + * changing the task state if and only if any tasks are woken up.
>   */
>  void __wake_up(wait_queue_head_t *q, unsigned int mode,
>  			int nr_exclusive, void *key)
> @@ -5279,6 +5293,9 @@ void __wake_up_locked_key(wait_queue_head_t *q, unsigned int mode, void *key)
>   * with each other. This can prevent needless bouncing between CPUs.
>   *
>   * On UP it can prevent extra preemption.
> + *
> + * It may be assumed that this function implies a write memory barrier before
> + * changing the task state if and only if any tasks are woken up.
>   */
>  void __wake_up_sync_key(wait_queue_head_t *q, unsigned int mode,
>  			int nr_exclusive, void *key)
> @@ -5315,6 +5332,9 @@ EXPORT_SYMBOL_GPL(__wake_up_sync);	/* For internal use only */
>   * awakened in the same order in which they were queued.
>   *
>   * See also complete_all(), wait_for_completion() and related routines.
> + *
> + * It may be assumed that this function implies a write memory barrier before
> + * changing the task state if and only if any tasks are woken up.
>   */
>  void complete(struct completion *x)
>  {
> @@ -5332,6 +5352,9 @@ EXPORT_SYMBOL(complete);
>   * @x:  holds the state of this particular completion
>   *
>   * This will wake up all threads waiting on this particular completion event.
> + *
> + * It may be assumed that this function implies a write memory barrier before
> + * changing the task state if and only if any tasks are woken up.
>   */
>  void complete_all(struct completion *x)
>  {

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] It may not be assumed that wake_up(), finish_wait() and co. imply a memory barrier
  2009-04-24 15:08                   ` Paul E. McKenney
@ 2009-04-24 17:08                     ` Oleg Nesterov
  2009-04-24 17:43                       ` Paul E. McKenney
  0 siblings, 1 reply; 42+ messages in thread
From: Oleg Nesterov @ 2009-04-24 17:08 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: David Howells, Ingo Molnar, torvalds, Andrew Morton, serue, viro,
	Nick Piggin, linux-kernel

On 04/24, Paul E. McKenney wrote:
>
> One question, assuming that this documentation intends to guide the
> reader on where to put the locking and/or memory-barrier primitives...
>
> Suppose we have the following sequence of events:
>
> 1.	The waiter does "set_current_state(TASK_UNINTERRUPTIBLE);".
> 	This implies a full memory barrier.
>
> 2.	The awakener updates some shared state.
>
> 3.	The awakener does "event_indicated = 1;".
>
> 4.	The waiter does "if (event_indicated)", and, finding that
> 	the event has in fact been indicated, does "break".
>
> 5.	The waiter accesses the shared state set in #2 above.
>
> 6.	Some time later, the awakener does "wake_up(&event_wait_queue);"
> 	This does not awaken anyone, so no memory barrier.
>
> Because there is no memory barrier between #2 and #3, reordering by
> either the compiler or the CPU might cause the awakener to update the
> event_indicated flag in #3 -before- completing its update of shared
> state in #2.  Less likely (but still possible) optimizations might
> cause the waiter to access the shared state in #5 before checking
> the event_indicated flag in #4.

Do you mean something like

	awakener:

		DATA = value;
		DATA_IS_READY = true;
		wake_up(wq);


	waiter:

		set_current_state(UNINTERRUPTIBLE);
		if (DATA_IS_READY)
			do_something(DATA);

?

Imho, the code above is just buggy and should be ignored by documentation ;)

Or do I miss your point?

Oleg.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] It may not be assumed that wake_up(), finish_wait() and co. imply a memory barrier
  2009-04-24 11:46                 ` David Howells
  2009-04-24 15:08                   ` Paul E. McKenney
@ 2009-04-24 17:28                   ` Oleg Nesterov
  2009-04-24 17:48                   ` David Howells
  2009-04-24 17:53                   ` David Howells
  3 siblings, 0 replies; 42+ messages in thread
From: Oleg Nesterov @ 2009-04-24 17:28 UTC (permalink / raw)
  To: David Howells
  Cc: Ingo Molnar, torvalds, Andrew Morton, serue, viro,
	Paul E. McKenney, Nick Piggin, linux-kernel

On 04/24, David Howells wrote:
>
>  (2) wake_up() interpolates a write memory barrier before clearing the task
>      state - _if_ it wakes anything up - then there's no problem in the waker
>      either.
>
[...snip...]
>
> +A write memory barrier is implied by wake_up() and co. if and only if they wake
> +something up.  The barrier occurs before the task state is cleared, and so sits
> +between the STORE to indicate the event and the STORE to set TASK_RUNNING:

Very minor nit. Perhaps it makes sense to mention that we also need the
barrier before _reading_ the task->state as well. Or not, I am not sure ;)
Just in case...

	event_indicated = 1;
	wake_up_process(event_daemon);

Suppose that "event_indicated = 1" leaks into try_to_wake_up() after we
read p->state. In this case we have

	try_to_wake_up:

		if (!(p->state & state))
			goto out;		// do nothing

		// WINDOW

		event_indicated = 1;		// leaked

In that case the whole

	set_current_state(TASK_UNINTERRUPTIBLE);
	if (event_indicated)
		break;
	schedule();

can happen in the WINDOW above.

But again, this is the real nitpick, and probably just the "implementation
details" which should not be documented.

Oleg.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] It may not be assumed that wake_up(), finish_wait() and co. imply a memory barrier
  2009-04-24 17:08                     ` Oleg Nesterov
@ 2009-04-24 17:43                       ` Paul E. McKenney
  0 siblings, 0 replies; 42+ messages in thread
From: Paul E. McKenney @ 2009-04-24 17:43 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: David Howells, Ingo Molnar, torvalds, Andrew Morton, serue, viro,
	Nick Piggin, linux-kernel

On Fri, Apr 24, 2009 at 07:08:30PM +0200, Oleg Nesterov wrote:
> On 04/24, Paul E. McKenney wrote:
> >
> > One question, assuming that this documentation intends to guide the
> > reader on where to put the locking and/or memory-barrier primitives...
> >
> > Suppose we have the following sequence of events:
> >
> > 1.	The waiter does "set_current_state(TASK_UNINTERRUPTIBLE);".
> > 	This implies a full memory barrier.
> >
> > 2.	The awakener updates some shared state.
> >
> > 3.	The awakener does "event_indicated = 1;".
> >
> > 4.	The waiter does "if (event_indicated)", and, finding that
> > 	the event has in fact been indicated, does "break".
> >
> > 5.	The waiter accesses the shared state set in #2 above.
> >
> > 6.	Some time later, the awakener does "wake_up(&event_wait_queue);"
> > 	This does not awaken anyone, so no memory barrier.
> >
> > Because there is no memory barrier between #2 and #3, reordering by
> > either the compiler or the CPU might cause the awakener to update the
> > event_indicated flag in #3 -before- completing its update of shared
> > state in #2.  Less likely (but still possible) optimizations might
> > cause the waiter to access the shared state in #5 before checking
> > the event_indicated flag in #4.
> 
> Do you mean something like
> 
> 	awakener:
> 
> 		DATA = value;
> 		DATA_IS_READY = true;
> 		wake_up(wq);
> 
> 
> 	waiter:
> 
> 		set_current_state(UNINTERRUPTIBLE);
> 		if (DATA_IS_READY)
> 			do_something(DATA);
> 
> ?
> 
> Imho, the code above is just buggy and should be ignored by documentation ;)
> 
> Or do I miss your point?

I was hoping that this sort of code might be actively discouraged by the
documentation.  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] It may not be assumed that wake_up(), finish_wait() and co. imply a memory barrier
  2009-04-24 11:46                 ` David Howells
  2009-04-24 15:08                   ` Paul E. McKenney
  2009-04-24 17:28                   ` Oleg Nesterov
@ 2009-04-24 17:48                   ` David Howells
  2009-04-24 18:06                     ` Paul E. McKenney
  2009-04-28 10:18                     ` David Howells
  2009-04-24 17:53                   ` David Howells
  3 siblings, 2 replies; 42+ messages in thread
From: David Howells @ 2009-04-24 17:48 UTC (permalink / raw)
  To: paulmck
  Cc: dhowells, Oleg Nesterov, Ingo Molnar, torvalds, Andrew Morton,
	serue, viro, Nick Piggin, linux-kernel

Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:

> Because there is no memory barrier between #2 and #3, reordering by
> either the compiler or the CPU might cause the awakener to update the
> event_indicated flag in #3 -before- completing its update of shared
> state in #2.

If the ordering of #2 and #3 is important with respect to each other, then the
awakener must manually interpolate a barrier of some sort between the two
_before_ calling wake_up() (or it should wrap them in a lock).

As I've tried to make clear in my documentation:

	Sleeping and waking on an event flagged in global data can be viewed as
	an interaction between two pieces of data: ===> the task state of the
	task waiting for the event and the global data used to indicate the
	event <===.

the barrier in wake_up() is only concerned with the ordering of #3 vs #6.  That
is all it _can_ impose an order upon, since #2 and #3 both happen before
wake_up() is called, and #3 is what causes the sleeper to break out of the
sleep loop.

> So, for this to work correctly, don't we need at least an smp_wmb()
> between #2 and #3 and at least an smp_rmb() between #4 and #5?  And if
> #2 does reads (but not writes) at least one variable in the shared state
> that #5 writes to, don't both barriers need to be smp_mb()?

Yes, but that's beyond the scope of this section.  set_current_state() imposes
the partial ordering { #1, #4 } and wake_up() the partial ordering { #3, #6 }
because those are the controlling features of the loop.

Managing the data beyond that is up to the caller of set_current_state() and
the caller of wake_up().

David

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] It may not be assumed that wake_up(), finish_wait() and co. imply a memory barrier
  2009-04-24 11:46                 ` David Howells
                                     ` (2 preceding siblings ...)
  2009-04-24 17:48                   ` David Howells
@ 2009-04-24 17:53                   ` David Howells
  2009-04-24 18:30                     ` Oleg Nesterov
  3 siblings, 1 reply; 42+ messages in thread
From: David Howells @ 2009-04-24 17:53 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: dhowells, Ingo Molnar, torvalds, Andrew Morton, serue, viro,
	Paul E. McKenney, Nick Piggin, linux-kernel

Oleg Nesterov <oleg@redhat.com> wrote:

> Suppose that "event_indicated = 1" leaks into try_to_wake_up() after we
> read p->state.

In that case, it's entirely possible that the smp_wmb() in try_to_wake_up()
should actually be an smp_mb(), but that on whichever arch patch:

	commit 04e2f1741d235ba599037734878d72e57cb302b5
	Author: Linus Torvalds <torvalds@woody.linux-foundation.org>
	Date:   Sat Feb 23 18:05:03 2008 -0800
	Subject: Add memory barrier semantics to wake_up() & co

was tested on, it made no difference.

David

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] It may not be assumed that wake_up(), finish_wait() and co. imply a memory barrier
  2009-04-24 17:48                   ` David Howells
@ 2009-04-24 18:06                     ` Paul E. McKenney
  2009-04-28 10:18                     ` David Howells
  1 sibling, 0 replies; 42+ messages in thread
From: Paul E. McKenney @ 2009-04-24 18:06 UTC (permalink / raw)
  To: David Howells
  Cc: Oleg Nesterov, Ingo Molnar, torvalds, Andrew Morton, serue, viro,
	Nick Piggin, linux-kernel

On Fri, Apr 24, 2009 at 06:48:06PM +0100, David Howells wrote:
> Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:
> 
> > Because there is no memory barrier between #2 and #3, reordering by
> > either the compiler or the CPU might cause the awakener to update the
> > event_indicated flag in #3 -before- completing its update of shared
> > state in #2.
> 
> If the ordering of #2 and #3 is important with respect to each other, then the
> awakener must manually interpolate a barrier of some sort between the two
> _before_ calling wake_up() (or it should wrap them in a lock).
> 
> As I've tried to make clear in my documentation:
> 
> 	Sleeping and waking on an event flagged in global data can be viewed as
> 	an interaction between two pieces of data: ===> the task state of the
> 	task waiting for the event and the global data used to indicate the
> 	event <===.
> 
> the barrier in wake_up() is only concerned with the ordering of #3 vs #6.  That
> is all it _can_ impose an order upon, since #2 and #3 both happen before
> wake_up() is called, and #3 is what causes the sleeper to break out of the
> sleep loop.
> 
> > So, for this to work correctly, don't we need at least an smp_wmb()
> > between #2 and #3 and at least an smp_rmb() between #4 and #5?  And if
> > #2 does reads (but not writes) at least one variable in the shared state
> > that #5 writes to, don't both barriers need to be smp_mb()?
> 
> Yes, but that's beyond the scope of this section.  set_current_state() imposes
> the partial ordering { #1, #4 } and wake_up() the partial ordering { #3, #6 }
> because those are the controlling features of the loop.
> 
> Managing the data beyond that is up to the caller of set_current_state() and
> the caller of wake_up().

Fair enough!

But I would strongly suggest at least a note calling this out, preferably a
"don't do this" example.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] It may not be assumed that wake_up(), finish_wait() and co. imply a memory barrier
  2009-04-24 17:53                   ` David Howells
@ 2009-04-24 18:30                     ` Oleg Nesterov
  0 siblings, 0 replies; 42+ messages in thread
From: Oleg Nesterov @ 2009-04-24 18:30 UTC (permalink / raw)
  To: David Howells
  Cc: Ingo Molnar, torvalds, Andrew Morton, serue, viro,
	Paul E. McKenney, Nick Piggin, linux-kernel

On 04/24, David Howells wrote:
>
> Oleg Nesterov <oleg@redhat.com> wrote:
>
> > Suppose that "event_indicated = 1" leaks into try_to_wake_up() after we
> > read p->state.
>
> In that case, it's entirely possible that the smp_wmb() in try_to_wake_up()
> should actually be an smp_mb(), but that on whichever arch patch:
>
> 	commit 04e2f1741d235ba599037734878d72e57cb302b5
> 	Author: Linus Torvalds <torvalds@woody.linux-foundation.org>
> 	Date:   Sat Feb 23 18:05:03 2008 -0800
> 	Subject: Add memory barrier semantics to wake_up() & co
>
> was tested on, it made no difference.

I think that, from the correctness pov, this patch does not depend on arch.
>From the changelog:

    However, adding a smp_wmb() to before the spinlock should now order the
    writing of CONDITION wrt the lock itself, which in turn is ordered wrt
    the accesses within the spinlock (which includes the reading of the old
    state).

IOW, spin_lock() itself is STORE, and has the acquire semantics. This means
that "wmb() + spin_lock()" must correctly serialize "STOREs bfore" and
"STOREs/LOADs after". At least this is my understanding.


But does it really "equal" to mb() on all arches ? I don't know, but I guess
it is not possible to derive from documentation. So, at least in theory,
try_to_wake_up() doesn't provide rmb() afaics, and

	LOAD_1;
	try_to_wake_up();
	LOAD_2;

this can be re-ordered.


But I know nothing about how mb/lock _actually_ play with hardware, even on
x86.

Oleg.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] It may not be assumed that wake_up(), finish_wait() and co. imply a memory barrier
  2009-04-24 17:48                   ` David Howells
  2009-04-24 18:06                     ` Paul E. McKenney
@ 2009-04-28 10:18                     ` David Howells
  2009-04-28 13:00                       ` Paul E. McKenney
  1 sibling, 1 reply; 42+ messages in thread
From: David Howells @ 2009-04-28 10:18 UTC (permalink / raw)
  To: paulmck
  Cc: dhowells, Oleg Nesterov, Ingo Molnar, torvalds, Andrew Morton,
	serue, viro, Nick Piggin, linux-kernel

Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:

> But I would strongly suggest at least a note calling this out, preferably a
> "don't do this" example.

How about I add this to the bottom of the new section:

[!] Note that the memory barriers implied by the sleeper and the waker do _not_
order multiple stores before the wake-up with respect to loads of those stored
values after the sleeper has called set_current_state().  For instance, if the
sleeper does:

	set_current_state(TASK_INTERRUPTIBLE);
	if (event_indicated)
		break;
	__set_current_state(TASK_RUNNING);
	do_something(my_data);

and the waker does:

	my_data = value;
	event_indicated = 1;
	wake_up(&event_wait_queue);

there's no guarantee that the change to event_indicated will be perceived by
the sleeper as coming after the change to my_data.  In such a circumstance, the
code on both sides must interpolate its own memory barriers between the
separate data accesses.  Thus the above sleeper ought to do:

	set_current_state(TASK_INTERRUPTIBLE);
	if (event_indicated) {
		smp_rmb();
		do_something(my_data);
	}

and the waker should do:

	my_data = value;
	smp_wmb();
	event_indicated = 1;
	wake_up(&event_wait_queue);


David

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] It may not be assumed that wake_up(), finish_wait() and co. imply a memory barrier
  2009-04-28 10:18                     ` David Howells
@ 2009-04-28 13:00                       ` Paul E. McKenney
  0 siblings, 0 replies; 42+ messages in thread
From: Paul E. McKenney @ 2009-04-28 13:00 UTC (permalink / raw)
  To: David Howells
  Cc: Oleg Nesterov, Ingo Molnar, torvalds, Andrew Morton, serue, viro,
	Nick Piggin, linux-kernel

On Tue, Apr 28, 2009 at 11:18:51AM +0100, David Howells wrote:
> Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:
> 
> > But I would strongly suggest at least a note calling this out, preferably a
> > "don't do this" example.
> 
> How about I add this to the bottom of the new section:
> 
> [!] Note that the memory barriers implied by the sleeper and the waker do _not_
> order multiple stores before the wake-up with respect to loads of those stored
> values after the sleeper has called set_current_state().  For instance, if the
> sleeper does:
> 
> 	set_current_state(TASK_INTERRUPTIBLE);
> 	if (event_indicated)
> 		break;
> 	__set_current_state(TASK_RUNNING);
> 	do_something(my_data);
> 
> and the waker does:
> 
> 	my_data = value;
> 	event_indicated = 1;
> 	wake_up(&event_wait_queue);
> 
> there's no guarantee that the change to event_indicated will be perceived by
> the sleeper as coming after the change to my_data.  In such a circumstance, the
> code on both sides must interpolate its own memory barriers between the
> separate data accesses.  Thus the above sleeper ought to do:
> 
> 	set_current_state(TASK_INTERRUPTIBLE);
> 	if (event_indicated) {
> 		smp_rmb();
> 		do_something(my_data);
> 	}
> 
> and the waker should do:
> 
> 	my_data = value;
> 	smp_wmb();
> 	event_indicated = 1;
> 	wake_up(&event_wait_queue);

Looks good to me!

							Thanx, Paul

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH] slow_work_thread() should do the exclusive wait
@ 2009-06-11 12:12 David Howells
  0 siblings, 0 replies; 42+ messages in thread
From: David Howells @ 2009-06-11 12:12 UTC (permalink / raw)
  To: torvalds; +Cc: akpm, linux-kernel, Oleg Nesterov, David Howells

From: Oleg Nesterov <oleg@redhat.com>

slow_work_thread() sleeps on slow_work_thread_wq without WQ_FLAG_EXCLUSIVE,
this means that slow_work_enqueue()->__wake_up(nr_exclusive => 1) wakes up all
kslowd threads.  This is not what we want, so we change slow_work_thread() to
use prepare_to_wait_exclusive() instead.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
---

 kernel/slow-work.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)


diff --git a/kernel/slow-work.c b/kernel/slow-work.c
index b28d191..521ed20 100644
--- a/kernel/slow-work.c
+++ b/kernel/slow-work.c
@@ -372,8 +372,8 @@ static int slow_work_thread(void *_data)
 		vsmax *= atomic_read(&slow_work_thread_count);
 		vsmax /= 100;
 
-		prepare_to_wait(&slow_work_thread_wq, &wait,
-				TASK_INTERRUPTIBLE);
+		prepare_to_wait_exclusive(&slow_work_thread_wq, &wait,
+					  TASK_INTERRUPTIBLE);
 		if (!freezing(current) &&
 		    !slow_work_threads_should_exit &&
 		    !slow_work_available(vsmax) &&


^ permalink raw reply related	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2009-06-11 12:13 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-04-13 18:17 [PATCH] slow_work_thread() should do the exclusive wait Oleg Nesterov
2009-04-13 19:03 ` Trond Myklebust
2009-04-13 19:14   ` Oleg Nesterov
2009-04-13 21:35 ` David Howells
2009-04-13 21:40 ` David Howells
2009-04-13 21:48   ` Oleg Nesterov
2009-04-13 21:57     ` Trond Myklebust
2009-04-13 22:24       ` Oleg Nesterov
2009-04-15 23:27         ` Andrew Morton
2009-04-16  9:10         ` David Howells
2009-04-16 14:33           ` Oleg Nesterov
2009-04-22 13:37           ` [PATCH] Document that wake_up(), complete() and co. imply a full memory barrier David Howells
2009-04-22 13:51             ` Ingo Molnar
2009-04-22 14:39               ` Oleg Nesterov
2009-04-22 14:56                 ` Ingo Molnar
2009-04-22 15:07                   ` Oleg Nesterov
2009-04-22 15:12             ` David Howells
2009-04-22 15:19               ` Ingo Molnar
2009-04-22 16:23             ` David Howells
2009-04-22 17:57               ` Ingo Molnar
2009-04-23 16:36                 ` Oleg Nesterov
2009-04-23 20:37                 ` David Howells
2009-04-23 16:32               ` [PATCH] It may not be assumed that wake_up(), finish_wait() and co. imply a " David Howells
2009-04-23 16:55                 ` Oleg Nesterov
2009-04-23 17:07                 ` Linus Torvalds
2009-04-23 20:35                 ` David Howells
2009-04-23 21:12                   ` Linus Torvalds
2009-04-23 21:24                     ` Ingo Molnar
2009-04-24 11:46                 ` David Howells
2009-04-24 15:08                   ` Paul E. McKenney
2009-04-24 17:08                     ` Oleg Nesterov
2009-04-24 17:43                       ` Paul E. McKenney
2009-04-24 17:28                   ` Oleg Nesterov
2009-04-24 17:48                   ` David Howells
2009-04-24 18:06                     ` Paul E. McKenney
2009-04-28 10:18                     ` David Howells
2009-04-28 13:00                       ` Paul E. McKenney
2009-04-24 17:53                   ` David Howells
2009-04-24 18:30                     ` Oleg Nesterov
2009-04-23 16:00         ` [PATCH] slow_work_thread() should do the exclusive wait David Howells
2009-04-23 16:18           ` Oleg Nesterov
2009-06-11 12:12 David Howells

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.