[BUG] TASK_DEAD task is able to be woken up in special condition

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [BUG] TASK_DEAD task is able to be woken up in special condition
@ 2011-12-22  0:42 Yasunori Goto
  2011-12-22  2:14 ` KOSAKI Motohiro
  2011-12-23  9:49 ` Peter Zijlstra
  0 siblings, 2 replies; 55+ messages in thread
From: Yasunori Goto @ 2011-12-22  0:42 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: Hiroyuki KAMEZAWA, Motohiro Kosaki, Linux Kernel ML


Hello

I found TASK_DEAD task is able to be woken up in special condition.
I would like to report this bug. Please check it.

Here is the sequence how it occurs.

----------------------------------+-----------------------------
                                  |
           CPU A                  |             CPU B
----------------------------------+-----------------------------
TASK A calls exit()....

do_exit()

  exit_mm()
    down_read(mm->mmap_sem);
    
    rwsem_down_failed_common()

      set TASK_UNINTERRUPTIBLE
      set waiter.task <= task A
      list_add to sem->wait_list
           :
      raw_spin_unlock_irq()
      (I/O interruption occured)

                                      __rwsem_do_wake(mmap_sem)

                                        list_del(&waiter->list);
                                        waiter->task = NULL
                                        wake_up_process(task A)
                                          try_to_wake_up()
                                             (task is still
                                               TASK_UNINTERRUPTIBLE)
                                              p->on_rq is still 1.)

                                              ttwu_do_wakeup()
                                                 (*A)
                                                   :
     (I/O interruption handler finished)

      if (!waiter.task) 
          schedule() is not called
          due to waiter.task is NULL.
      
      tsk->state = TASK_RUNNING

          :
                                              check_preempt_curr();
                                                  :
  task->state = TASK_DEAD
                                              (*B)
                                        <---    set TASK_RUNNING (*C)



     schedule()
     (exit task is running again)
     BUG_ON() is called!
--------------------------------------------------------


You probably think that execution time between (*A) and (*B) is very short,
because the interruption is disabled, and setting TASK_RUNNING at (*C)
must be executed before setting TASK_DEAD.


HOWEVER, if SMI is interrupted between (*A) and (*B), 
(*C) is able to execute AFTER setting TASK_DEAD!
Then, exited task is scheduled again, and BUG_ON() is called....

This is very bad senario.
But, I suppose this phenomenon is able to occur on a guest system of
virtual machine too.

Please fix it.

I suppose task->pi_lock should be held when task->state is changed to 
TASK_DEAD like the following patch (not tested yet).
Because try_to_wake_up() hold it before checking task state.


Thanks, 

----
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>

---
 kernel/exit.c |    3 +++
 1 file changed, 3 insertions(+)

Index: linux-3.2-rc4/kernel/exit.c
===================================================================
--- linux-3.2-rc4.orig/kernel/exit.c
+++ linux-3.2-rc4/kernel/exit.c
@@ -1038,8 +1038,11 @@ NORET_TYPE void do_exit(long code)
 
 	preempt_disable();
 	exit_rcu();
+
+	spin_lock(&tsk->pi_lock, flags);
 	/* causes final put_task_struct in finish_task_switch(). */
 	tsk->state = TASK_DEAD;
+	spin_unlock(&tsk->pi_lock, flags);
 	schedule();
 	BUG();
 	/* Avoid "noreturn function does return".  */




-- 
Yasunori Goto 



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [BUG] TASK_DEAD task is able to be woken up in special condition
  2011-12-22  0:42 [BUG] TASK_DEAD task is able to be woken up in special condition Yasunori Goto
@ 2011-12-22  2:14 ` KOSAKI Motohiro
  2011-12-22  8:22   ` Yasunori Goto
  2011-12-23  9:49 ` Peter Zijlstra
  1 sibling, 1 reply; 55+ messages in thread
From: KOSAKI Motohiro @ 2011-12-22  2:14 UTC (permalink / raw)
  To: Yasunori Goto
  Cc: Peter Zijlstra, Ingo Molnar, Hiroyuki KAMEZAWA, Motohiro Kosaki,
	Linux Kernel ML

(12/21/11 7:42 PM), Yasunori Goto wrote:
>
> Hello
>
> I found TASK_DEAD task is able to be woken up in special condition.
> I would like to report this bug. Please check it.
>
> Here is the sequence how it occurs.
>
> ----------------------------------+-----------------------------
>                                    |
>             CPU A                  |             CPU B
> ----------------------------------+-----------------------------
> TASK A calls exit()....
>
> do_exit()
>
>    exit_mm()
>      down_read(mm->mmap_sem);
>
>      rwsem_down_failed_common()
>
>        set TASK_UNINTERRUPTIBLE
>        set waiter.task<= task A
>        list_add to sem->wait_list
>             :
>        raw_spin_unlock_irq()
>        (I/O interruption occured)
>
>                                        __rwsem_do_wake(mmap_sem)
>
>                                          list_del(&waiter->list);
>                                          waiter->task = NULL
>                                          wake_up_process(task A)
>                                            try_to_wake_up()
>                                               (task is still
>                                                 TASK_UNINTERRUPTIBLE)
>                                                p->on_rq is still 1.)
>
>                                                ttwu_do_wakeup()
>                                                   (*A)
>                                                     :
>       (I/O interruption handler finished)
>
>        if (!waiter.task)
>            schedule() is not called
>            due to waiter.task is NULL.
>
>        tsk->state = TASK_RUNNING
>
>            :
>                                                check_preempt_curr();
>                                                    :
>    task->state = TASK_DEAD
>                                                (*B)
>                                          <---    set TASK_RUNNING (*C)
>
>
>
>       schedule()
>       (exit task is running again)
>       BUG_ON() is called!
> --------------------------------------------------------
>
>
> You probably think that execution time between (*A) and (*B) is very short,
> because the interruption is disabled, and setting TASK_RUNNING at (*C)
> must be executed before setting TASK_DEAD.
>
>
> HOWEVER, if SMI is interrupted between (*A) and (*B),
> (*C) is able to execute AFTER setting TASK_DEAD!
> Then, exited task is scheduled again, and BUG_ON() is called....
>
> This is very bad senario.
> But, I suppose this phenomenon is able to occur on a guest system of
> virtual machine too.
>
> Please fix it.
>
> I suppose task->pi_lock should be held when task->state is changed to
> TASK_DEAD like the following patch (not tested yet).
> Because try_to_wake_up() hold it before checking task state.
>
>
> Thanks,
>
> ----
> Signed-off-by: Yasunori Goto<y-goto@jp.fujitsu.com>
>
> ---
>   kernel/exit.c |    3 +++
>   1 file changed, 3 insertions(+)
>
> Index: linux-3.2-rc4/kernel/exit.c
> ===================================================================
> --- linux-3.2-rc4.orig/kernel/exit.c
> +++ linux-3.2-rc4/kernel/exit.c
> @@ -1038,8 +1038,11 @@ NORET_TYPE void do_exit(long code)
>
>   	preempt_disable();
>   	exit_rcu();
> +
> +	spin_lock(&tsk->pi_lock, flags);
>   	/* causes final put_task_struct in finish_task_switch(). */
>   	tsk->state = TASK_DEAD;
> +	spin_unlock(&tsk->pi_lock, flags);
>   	schedule();
>   	BUG();
>   	/* Avoid "noreturn function does return".  */

I doubt it is not only TASK_DEAD issue, it is rwsem fundamental issue.
Because of, a lot of place assume "current->state = newstate" is safe
and don't need any synchronization. So, I'm worry about to lost 
TASK_UNINTERRUPTIBLE can make catastrophe like TASK_DEAD.

How about following patch? anyway, rwsem_down_failed_common() is 
definitely slowpath. so killing micro optimization is not so much
problem, I guess.



diff --git a/lib/rwsem.c b/lib/rwsem.c
index 410aa11..e2a0c9a 100644
--- a/lib/rwsem.c
+++ b/lib/rwsem.c
@@ -208,9 +208,9 @@ rwsem_down_failed_common(struct rw_semaphore *sem,

         /* wait to be given the lock */
         for (;;) {
+               schedule();
                 if (!waiter.task)
                         break;
-               schedule();
                 set_task_state(tsk, TASK_UNINTERRUPTIBLE);
         }





^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [BUG] TASK_DEAD task is able to be woken up in special condition
  2011-12-22  2:14 ` KOSAKI Motohiro
@ 2011-12-22  8:22   ` Yasunori Goto
  2011-12-22 20:02     ` KOSAKI Motohiro
  0 siblings, 1 reply; 55+ messages in thread
From: Yasunori Goto @ 2011-12-22  8:22 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Peter Zijlstra, Ingo Molnar, Hiroyuki KAMEZAWA, Motohiro Kosaki,
	Linux Kernel ML

Thank you for your response.


> > ----
> > Signed-off-by: Yasunori Goto<y-goto@jp.fujitsu.com>
> >
> > ---
> >   kernel/exit.c |    3 +++
> >   1 file changed, 3 insertions(+)
> >
> > Index: linux-3.2-rc4/kernel/exit.c
> > ===================================================================
> > --- linux-3.2-rc4.orig/kernel/exit.c
> > +++ linux-3.2-rc4/kernel/exit.c
> > @@ -1038,8 +1038,11 @@ NORET_TYPE void do_exit(long code)
> >
> >   	preempt_disable();
> >   	exit_rcu();
> > +
> > +	spin_lock(&tsk->pi_lock, flags);
> >   	/* causes final put_task_struct in finish_task_switch(). */
> >   	tsk->state = TASK_DEAD;
> > +	spin_unlock(&tsk->pi_lock, flags);
> >   	schedule();
> >   	BUG();
> >   	/* Avoid "noreturn function does return".  */
> 
> I doubt it is not only TASK_DEAD issue, it is rwsem fundamental issue.
> Because of, a lot of place assume "current->state = newstate" is safe
> and don't need any synchronization. So, I'm worry about to lost 
> TASK_UNINTERRUPTIBLE can make catastrophe like TASK_DEAD.

I don't understand why this is catastrophe.
I suppose it is just waken up from TASK_UNINTERRUPTIBLE
by try_to_wake_up() in race condition. It seems to be normal situation.....

But TASK_DEAD status is special. It must not return to TASK_RUNNING state.

> 
> How about following patch? anyway, rwsem_down_failed_common() is 
> definitely slowpath. so killing micro optimization is not so much
> problem, I guess.
> 
> 
> 
> diff --git a/lib/rwsem.c b/lib/rwsem.c
> index 410aa11..e2a0c9a 100644
> --- a/lib/rwsem.c
> +++ b/lib/rwsem.c
> @@ -208,9 +208,9 @@ rwsem_down_failed_common(struct rw_semaphore *sem,
> 
>          /* wait to be given the lock */
>          for (;;) {
> +               schedule();
>                  if (!waiter.task)
>                          break;
> -               schedule();
>                  set_task_state(tsk, TASK_UNINTERRUPTIBLE);
>          }
> 

Hmmmmmmm.
Are you sure there is no route which TASK_DEAD task is waken up like rwsem?


Thanks.
-- 
Yasunori Goto 



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [BUG] TASK_DEAD task is able to be woken up in special condition
  2011-12-22  8:22   ` Yasunori Goto
@ 2011-12-22 20:02     ` KOSAKI Motohiro
  0 siblings, 0 replies; 55+ messages in thread
From: KOSAKI Motohiro @ 2011-12-22 20:02 UTC (permalink / raw)
  To: Yasunori Goto
  Cc: Peter Zijlstra, Ingo Molnar, Hiroyuki KAMEZAWA, Linux Kernel ML

>> I doubt it is not only TASK_DEAD issue, it is rwsem fundamental issue.
>> Because of, a lot of place assume "current->state = newstate" is safe
>> and don't need any synchronization. So, I'm worry about to lost
>> TASK_UNINTERRUPTIBLE can make catastrophe like TASK_DEAD.
>
> I don't understand why this is catastrophe.
> I suppose it is just waken up from TASK_UNINTERRUPTIBLE
> by try_to_wake_up() in race condition. It seems to be normal situation.....

First, yes, try_to_wake_up() has a race. caller must care for spurious wakeup.
But, mutex, rwsem and other synchronization primitives must hide it and ensure
locking semantics. That's one of the reason why we use lock primitive instead
of bare try_to_wake_up(). I don't think "hey, all drivers code must
know scheduler
race and scheduler internal deeply" is practical solution.

> But TASK_DEAD status is special. It must not return to TASK_RUNNING state.

O.K. agree. but your patch seems just hacky. (1) I'm worry about why
out of scheduler
code know pi_lock and take it (2) all of other task->state changing
place don't take
pi_lock. so, your proposed locking semantics is not clear.

So, to remove TASK_DEAD and to add new member task->is_dead is alternative idea.

>> @@ -208,9 +208,9 @@ rwsem_down_failed_common(struct rw_semaphore *sem,
>>
>>          /* wait to be given the lock */
>>          for (;;) {
>> +               schedule();
>>                  if (!waiter.task)
>>                          break;
>> -               schedule();
>>                  set_task_state(tsk, TASK_UNINTERRUPTIBLE);
>>          }
>>
>
> Hmmmmmmm.
> Are you sure there is no route which TASK_DEAD task is waken up like rwsem?

Not sure. but all of them are a bug if exist, I think.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [BUG] TASK_DEAD task is able to be woken up in special condition
  2011-12-22  0:42 [BUG] TASK_DEAD task is able to be woken up in special condition Yasunori Goto
  2011-12-22  2:14 ` KOSAKI Motohiro
@ 2011-12-23  9:49 ` Peter Zijlstra
  2011-12-23 15:41   ` Oleg Nesterov
  2011-12-26  6:52   ` Yasunori Goto
  1 sibling, 2 replies; 55+ messages in thread
From: Peter Zijlstra @ 2011-12-23  9:49 UTC (permalink / raw)
  To: Yasunori Goto
  Cc: Ingo Molnar, Hiroyuki KAMEZAWA, Motohiro Kosaki, Linux Kernel ML,
	Oleg Nesterov, KOSAKI Motohiro

On Thu, 2011-12-22 at 09:42 +0900, Yasunori Goto wrote:
> I found TASK_DEAD task is able to be woken up in special condition.
> I would like to report this bug. Please check it.

How did you find it? Manual inspection? Inspection of a core-dump?

> Here is the sequence how it occurs.
> 
> ----------------------------------+-----------------------------
>                                   |
>            CPU A                  |             CPU B
> ----------------------------------+-----------------------------
> TASK A calls exit()....
> 
> do_exit()
> 
>   exit_mm()
>     down_read(mm->mmap_sem);
>     
>     rwsem_down_failed_common()
> 
>       set TASK_UNINTERRUPTIBLE
>       set waiter.task <= task A
>       list_add to sem->wait_list
>            :
>       raw_spin_unlock_irq()
>       (I/O interruption occured)
> 
>                                       __rwsem_do_wake(mmap_sem)
> 
>                                         list_del(&waiter->list);
>                                         waiter->task = NULL
>                                         wake_up_process(task A)
>                                           try_to_wake_up()
>                                              (task is still
>                                                TASK_UNINTERRUPTIBLE)
>                                               p->on_rq is still 1.)
> 
>                                               ttwu_do_wakeup()
>                                                  (*A)
>                                                    :
>      (I/O interruption handler finished)
> 
>       if (!waiter.task) 
>           schedule() is not called
>           due to waiter.task is NULL.
>       
>       tsk->state = TASK_RUNNING
> 
>           :
>                                               check_preempt_curr();
>                                                   :
>   task->state = TASK_DEAD
>                                               (*B)
>                                         <---    set TASK_RUNNING (*C)
> 
> 
> 
>      schedule()
>      (exit task is running again)
>      BUG_ON() is called!
> --------------------------------------------------------

<snip>

> This is very bad senario.
> But, I suppose this phenomenon is able to occur on a guest system of
> virtual machine too.
> 
> Please fix it.
> 
> I suppose task->pi_lock should be held when task->state is changed to 
> TASK_DEAD like the following patch (not tested yet).
> Because try_to_wake_up() hold it before checking task state.

I don't think this can actually happen, note the raw_spin_unlock_wait()
in do_exit() long before setting TASK_DEAD, that should synchronize
against the in-progress wakeup and ensure its finished and has set
TASK_RUNNING. Spurious wakeups after that won't see a state to act on
and will terminate immediately without touching state.


> ---
>  kernel/exit.c |    3 +++
>  1 file changed, 3 insertions(+)
> 
> Index: linux-3.2-rc4/kernel/exit.c
> ===================================================================
> --- linux-3.2-rc4.orig/kernel/exit.c
> +++ linux-3.2-rc4/kernel/exit.c
> @@ -1038,8 +1038,11 @@ NORET_TYPE void do_exit(long code)
>  
>         preempt_disable();
>         exit_rcu();
> +
> +       spin_lock(&tsk->pi_lock, flags);
>         /* causes final put_task_struct in finish_task_switch(). */
>         tsk->state = TASK_DEAD;
> +       spin_unlock(&tsk->pi_lock, flags);
>         schedule();
>         BUG();
>         /* Avoid "noreturn function does return".  */ 

Note, ->pi_lock is a raw_spinlock_t, those should've been raw_spin_*().

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [BUG] TASK_DEAD task is able to be woken up in special condition
  2011-12-23  9:49 ` Peter Zijlstra
@ 2011-12-23 15:41   ` Oleg Nesterov
  2011-12-26  8:23     ` Yasunori Goto
  2011-12-26  6:52   ` Yasunori Goto
  1 sibling, 1 reply; 55+ messages in thread
From: Oleg Nesterov @ 2011-12-23 15:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Yasunori Goto, Ingo Molnar, Hiroyuki KAMEZAWA, Motohiro Kosaki,
	Linux Kernel ML

On 12/23, Peter Zijlstra wrote:
>
> On Thu, 2011-12-22 at 09:42 +0900, Yasunori Goto wrote:
> >
> > ----------------------------------+-----------------------------
> >                                   |
> >            CPU A                  |             CPU B
> > ----------------------------------+-----------------------------
> > TASK A calls exit()....
> >
> > do_exit()
> >
> >   exit_mm()
> >     down_read(mm->mmap_sem);
> >
> >     rwsem_down_failed_common()
> >
> >       set TASK_UNINTERRUPTIBLE
> >       set waiter.task <= task A
> >       list_add to sem->wait_list
> >            :
> >       raw_spin_unlock_irq()
> >       (I/O interruption occured)
> >
> >                                       __rwsem_do_wake(mmap_sem)
> >
> >                                         list_del(&waiter->list);
> >                                         waiter->task = NULL
> >                                         wake_up_process(task A)
> >                                           try_to_wake_up()
> >                                              (task is still
> >                                                TASK_UNINTERRUPTIBLE)
> >                                               p->on_rq is still 1.)
> >
> >                                               ttwu_do_wakeup()
> >                                                  (*A)
> >                                                    :
> >      (I/O interruption handler finished)
> >
> >       if (!waiter.task)
> >           schedule() is not called
> >           due to waiter.task is NULL.
> >
> >       tsk->state = TASK_RUNNING
> >
> >           :
> >                                               check_preempt_curr();
> >                                                   :
> >   task->state = TASK_DEAD
> >                                               (*B)
> >                                         <---    set TASK_RUNNING (*C)
> >
> >
> >
> >      schedule()
> >      (exit task is running again)
> >      BUG_ON() is called!
> > --------------------------------------------------------
>
> <snip>
>
> > This is very bad senario.
> > But, I suppose this phenomenon is able to occur on a guest system of
> > virtual machine too.
> >
> > Please fix it.
> >
> > I suppose task->pi_lock should be held when task->state is changed to
> > TASK_DEAD like the following patch (not tested yet).
> > Because try_to_wake_up() hold it before checking task state.
>
> I don't think this can actually happen, note the raw_spin_unlock_wait()
> in do_exit() long before setting TASK_DEAD, that should synchronize
> against the in-progress wakeup and ensure its finished and has set
> TASK_RUNNING.

How? raw_spin_unlock_wait() is calles before exit_mm(), then the exiting
task plays with its ->state.

IIRC, this was already discussed a bit. Say, try_to_wake_up(TASK_INTERRUPTIBLE)
can wakeup a TASK_UNINTERRUPTIBLE task if it temporary sets INTERRUPTIBLE but
doesn't call schedule() in this state.

May be ttwu_do_wakeup() should do cmpxchg(old_state, RUNNING) ?

Oleg.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [BUG] TASK_DEAD task is able to be woken up in special condition
  2011-12-23  9:49 ` Peter Zijlstra
  2011-12-23 15:41   ` Oleg Nesterov
@ 2011-12-26  6:52   ` Yasunori Goto
  1 sibling, 0 replies; 55+ messages in thread
From: Yasunori Goto @ 2011-12-26  6:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Hiroyuki KAMEZAWA, Motohiro Kosaki, Linux Kernel ML,
	Oleg Nesterov

> On Thu, 2011-12-22 at 09:42 +0900, Yasunori Goto wrote:
> > I found TASK_DEAD task is able to be woken up in special condition.
> > I would like to report this bug. Please check it.
> 
> How did you find it? Manual inspection? Inspection of a core-dump?

Original trouble occurred on a distributor's kernel which is based on 2.6.32.
Kernel called BUG() because TASK_DEAD task was scheduled again.
I chased it with trace in crash dump, and I confirmed this sequence.

In my code review, current 3.2-rc6 seems to have same problem, so I posted 
this report.


> 
> > Here is the sequence how it occurs.
> > 
> > ----------------------------------+-----------------------------
> >                                   |
> >            CPU A                  |             CPU B
> > ----------------------------------+-----------------------------
> > TASK A calls exit()....
> > 
> > do_exit()
> > 
> >   exit_mm()
> >     down_read(mm->mmap_sem);
> >     
> >     rwsem_down_failed_common()
> > 
> >       set TASK_UNINTERRUPTIBLE
> >       set waiter.task <= task A
> >       list_add to sem->wait_list
> >            :
> >       raw_spin_unlock_irq()
> >       (I/O interruption occured)
> > 
> >                                       __rwsem_do_wake(mmap_sem)
> > 
> >                                         list_del(&waiter->list);
> >                                         waiter->task = NULL
> >                                         wake_up_process(task A)
> >                                           try_to_wake_up()
> >                                              (task is still
> >                                                TASK_UNINTERRUPTIBLE)
> >                                               p->on_rq is still 1.)
> > 
> >                                               ttwu_do_wakeup()
> >                                                  (*A)
> >                                                    :
> >      (I/O interruption handler finished)
> > 
> >       if (!waiter.task) 
> >           schedule() is not called
> >           due to waiter.task is NULL.
> >       
> >       tsk->state = TASK_RUNNING
> > 
> >           :
> >                                               check_preempt_curr();
> >                                                   :
> >   task->state = TASK_DEAD
> >                                               (*B)
> >                                         <---    set TASK_RUNNING (*C)
> > 
> > 
> > 
> >      schedule()
> >      (exit task is running again)
> >      BUG_ON() is called!
> > --------------------------------------------------------
> 
> <snip>
> 
> > This is very bad senario.
> > But, I suppose this phenomenon is able to occur on a guest system of
> > virtual machine too.
> > 
> > Please fix it.
> > 
> > I suppose task->pi_lock should be held when task->state is changed to 
> > TASK_DEAD like the following patch (not tested yet).
> > Because try_to_wake_up() hold it before checking task state.
> 
> I don't think this can actually happen, note the raw_spin_unlock_wait()
> in do_exit() long before setting TASK_DEAD, that should synchronize
> against the in-progress wakeup and ensure its finished and has set
> TASK_RUNNING. Spurious wakeups after that won't see a state to act on
> and will terminate immediately without touching state.

As Oleg-san said, raw_spin_unlock_wait() is called before exit_mm().
This race condition occurred after using rwsem of mmap_sem in exit_mm().

Thanks.

> 
> 
> > ---
> >  kernel/exit.c |    3 +++
> >  1 file changed, 3 insertions(+)
> > 
> > Index: linux-3.2-rc4/kernel/exit.c
> > ===================================================================
> > --- linux-3.2-rc4.orig/kernel/exit.c
> > +++ linux-3.2-rc4/kernel/exit.c
> > @@ -1038,8 +1038,11 @@ NORET_TYPE void do_exit(long code)
> >  
> >         preempt_disable();
> >         exit_rcu();
> > +
> > +       spin_lock(&tsk->pi_lock, flags);
> >         /* causes final put_task_struct in finish_task_switch(). */
> >         tsk->state = TASK_DEAD;
> > +       spin_unlock(&tsk->pi_lock, flags);
> >         schedule();
> >         BUG();
> >         /* Avoid "noreturn function does return".  */ 
> 
> Note, ->pi_lock is a raw_spinlock_t, those should've been raw_spin_*().




-- 
Yasunori Goto 



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [BUG] TASK_DEAD task is able to be woken up in special condition
  2011-12-23 15:41   ` Oleg Nesterov
@ 2011-12-26  8:23     ` Yasunori Goto
  2011-12-26 17:11       ` Oleg Nesterov
  0 siblings, 1 reply; 55+ messages in thread
From: Yasunori Goto @ 2011-12-26  8:23 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Peter Zijlstra, Ingo Molnar, Hiroyuki KAMEZAWA, Motohiro Kosaki,
	Linux Kernel ML

> On 12/23, Peter Zijlstra wrote:
> >
> > On Thu, 2011-12-22 at 09:42 +0900, Yasunori Goto wrote:
> > >
> > > ----------------------------------+-----------------------------
> > >                                   |
> > >            CPU A                  |             CPU B
> > > ----------------------------------+-----------------------------
> > > TASK A calls exit()....
> > >
> > > do_exit()
> > >
> > >   exit_mm()
> > >     down_read(mm->mmap_sem);
> > >
> > >     rwsem_down_failed_common()
> > >
> > >       set TASK_UNINTERRUPTIBLE
> > >       set waiter.task <= task A
> > >       list_add to sem->wait_list
> > >            :
> > >       raw_spin_unlock_irq()
> > >       (I/O interruption occured)
> > >
> > >                                       __rwsem_do_wake(mmap_sem)
> > >
> > >                                         list_del(&waiter->list);
> > >                                         waiter->task = NULL
> > >                                         wake_up_process(task A)
> > >                                           try_to_wake_up()
> > >                                              (task is still
> > >                                                TASK_UNINTERRUPTIBLE)
> > >                                               p->on_rq is still 1.)
> > >
> > >                                               ttwu_do_wakeup()
> > >                                                  (*A)
> > >                                                    :
> > >      (I/O interruption handler finished)
> > >
> > >       if (!waiter.task)
> > >           schedule() is not called
> > >           due to waiter.task is NULL.
> > >
> > >       tsk->state = TASK_RUNNING
> > >
> > >           :
> > >                                               check_preempt_curr();
> > >                                                   :
> > >   task->state = TASK_DEAD
> > >                                               (*B)
> > >                                         <---    set TASK_RUNNING (*C)
> > >
> > >
> > >
> > >      schedule()
> > >      (exit task is running again)
> > >      BUG_ON() is called!
> > > --------------------------------------------------------
> >
> > <snip>
> >
> > > This is very bad senario.
> > > But, I suppose this phenomenon is able to occur on a guest system of
> > > virtual machine too.
> > >
> > > Please fix it.
> > >
> > > I suppose task->pi_lock should be held when task->state is changed to
> > > TASK_DEAD like the following patch (not tested yet).
> > > Because try_to_wake_up() hold it before checking task state.
> >
> > I don't think this can actually happen, note the raw_spin_unlock_wait()
> > in do_exit() long before setting TASK_DEAD, that should synchronize
> > against the in-progress wakeup and ensure its finished and has set
> > TASK_RUNNING.
> 
> How? raw_spin_unlock_wait() is calles before exit_mm(), then the exiting
> task plays with its ->state.
> 
> IIRC, this was already discussed a bit. Say, try_to_wake_up(TASK_INTERRUPTIBLE)
> can wakeup a TASK_UNINTERRUPTIBLE task if it temporary sets INTERRUPTIBLE but
> doesn't call schedule() in this state.

Oleg-san,

Could you point the discussion? 
I don't understand yet how it occurred...


> 
> May be ttwu_do_wakeup() should do cmpxchg(old_state, RUNNING) ?
> 
> Oleg.
> 

-- 
Yasunori Goto 



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [BUG] TASK_DEAD task is able to be woken up in special condition
  2011-12-26  8:23     ` Yasunori Goto
@ 2011-12-26 17:11       ` Oleg Nesterov
  2011-12-27  6:48         ` Yasunori Goto
  2011-12-28 21:07         ` KOSAKI Motohiro
  0 siblings, 2 replies; 55+ messages in thread
From: Oleg Nesterov @ 2011-12-26 17:11 UTC (permalink / raw)
  To: Yasunori Goto
  Cc: Peter Zijlstra, Ingo Molnar, Hiroyuki KAMEZAWA, Motohiro Kosaki,
	Linux Kernel ML

On 12/26, Yasunori Goto wrote:
>
> >
> > IIRC, this was already discussed a bit. Say, try_to_wake_up(TASK_INTERRUPTIBLE)
> > can wakeup a TASK_UNINTERRUPTIBLE task if it temporary sets INTERRUPTIBLE but
> > doesn't call schedule() in this state.
>
> Oleg-san,
>
> Could you point the discussion?
> I don't understand yet how it occurred...

Suppose that the task T does

	set_current_state(TASK_INTERRUPTIBLE);

	set_current_state(TASK_UNINTERRUPTIBLE);
	schedule();

try_to_wake_up(TASK_UNINTERRUPTIBLE) in between can observe this task
in TASK_INTERRUPTIBLE state. Then it can set RUNNING/WAKING after T
sets ->state = TASK_UNINTERRUPTIBLE.

For example, this is possibly if T simply does wait_event() twice when
the the 1st wait_event() doesn't sleep.

Basically this is the same race you described, but I think you found
the case when we can't tolerate the spurious wakeup.

Oleg.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [BUG] TASK_DEAD task is able to be woken up in special condition
  2011-12-26 17:11       ` Oleg Nesterov
@ 2011-12-27  6:48         ` Yasunori Goto
  2012-01-06 10:22           ` Yasunori Goto
  2011-12-28 21:07         ` KOSAKI Motohiro
  1 sibling, 1 reply; 55+ messages in thread
From: Yasunori Goto @ 2011-12-27  6:48 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Peter Zijlstra, Ingo Molnar, Hiroyuki KAMEZAWA, Motohiro Kosaki,
	Linux Kernel ML

> On 12/26, Yasunori Goto wrote:
> >
> > >
> > > IIRC, this was already discussed a bit. Say, try_to_wake_up(TASK_INTERRUPTIBLE)
> > > can wakeup a TASK_UNINTERRUPTIBLE task if it temporary sets INTERRUPTIBLE but
> > > doesn't call schedule() in this state.
> >
> > Oleg-san,
> >
> > Could you point the discussion?
> > I don't understand yet how it occurred...
> 
> Suppose that the task T does
> 
> 	set_current_state(TASK_INTERRUPTIBLE);
> 
> 	set_current_state(TASK_UNINTERRUPTIBLE);
> 	schedule();
> 
> try_to_wake_up(TASK_UNINTERRUPTIBLE) in between can observe this task
> in TASK_INTERRUPTIBLE state. Then it can set RUNNING/WAKING after T
> sets ->state = TASK_UNINTERRUPTIBLE.
> 
> For example, this is possibly if T simply does wait_event() twice when
> the the 1st wait_event() doesn't sleep.
> 
> Basically this is the same race you described, but I think you found
> the case when we can't tolerate the spurious wakeup.
> 
> Oleg.
> 
> On 12/26, Yasunori Goto wrote:
> >
> > >
> > > IIRC, this was already discussed a bit. Say, try_to_wake_up(TASK_INTERRUPTIBLE)
> > > can wakeup a TASK_UNINTERRUPTIBLE task if it temporary sets INTERRUPTIBLE but
> > > doesn't call schedule() in this state.
> >
> > Oleg-san,
> >
> > Could you point the discussion?
> > I don't understand yet how it occurred...
> 
> Suppose that the task T does
> 
> 	set_current_state(TASK_INTERRUPTIBLE);
> 
> 	set_current_state(TASK_UNINTERRUPTIBLE);
> 	schedule();
> 
> try_to_wake_up(TASK_UNINTERRUPTIBLE) in between can observe this task
> in TASK_INTERRUPTIBLE state. Then it can set RUNNING/WAKING after T
> sets ->state = TASK_UNINTERRUPTIBLE.
> 
> For example, this is possibly if T simply does wait_event() twice when
> the the 1st wait_event() doesn't sleep.
> 
> Basically this is the same race you described, but I think you found
> the case when we can't tolerate the spurious wakeup.

Thanks.

I suppose your idea which uses cmpxchg(old_state, RUNNING) is best way.
I'll remake my patch.



> 
> Oleg.
> 

-- 
Yasunori Goto 



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [BUG] TASK_DEAD task is able to be woken up in special condition
  2011-12-26 17:11       ` Oleg Nesterov
  2011-12-27  6:48         ` Yasunori Goto
@ 2011-12-28 21:07         ` KOSAKI Motohiro
  2012-01-24 10:23           ` Peter Zijlstra
  2012-01-25 10:10           ` Peter Zijlstra
  1 sibling, 2 replies; 55+ messages in thread
From: KOSAKI Motohiro @ 2011-12-28 21:07 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Yasunori Goto, Peter Zijlstra, Ingo Molnar, Hiroyuki KAMEZAWA,
	Motohiro Kosaki, Linux Kernel ML

(12/26/11 12:11 PM), Oleg Nesterov wrote:
> On 12/26, Yasunori Goto wrote:
>>
>>>
>>> IIRC, this was already discussed a bit. Say, try_to_wake_up(TASK_INTERRUPTIBLE)
>>> can wakeup a TASK_UNINTERRUPTIBLE task if it temporary sets INTERRUPTIBLE but
>>> doesn't call schedule() in this state.
>>
>> Oleg-san,
>>
>> Could you point the discussion?
>> I don't understand yet how it occurred...
>
> Suppose that the task T does
>
> 	set_current_state(TASK_INTERRUPTIBLE);
>
> 	set_current_state(TASK_UNINTERRUPTIBLE);
> 	schedule();
>
> try_to_wake_up(TASK_UNINTERRUPTIBLE) in between can observe this task
> in TASK_INTERRUPTIBLE state. Then it can set RUNNING/WAKING after T
> sets ->state = TASK_UNINTERRUPTIBLE.
>
> For example, this is possibly if T simply does wait_event() twice when
> the the 1st wait_event() doesn't sleep.
>
> Basically this is the same race you described, but I think you found
> the case when we can't tolerate the spurious wakeup.

Hi

I looked at scheduler code today briefly. now I'm afraid following code 
have similar race.


         if (task_contributes_to_load(p))
                 rq->nr_uninterruptible--;



Can't following schenario be happen?


CPU0                    CPU1
--------------------------------------------------------
deactivate_task()
                        task->state = TASK_UNINTERRUPTIBLE;
activate_task()
   rq->nr_uninterruptible--;

                        schedule()
                          deactivate_task()
                            rq->nr_uninterruptible++;

Totally, nr_uninterruptible wasn't incremented.


I'm still not sure. I need to read more sched code.




^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [BUG] TASK_DEAD task is able to be woken up in special condition
  2011-12-27  6:48         ` Yasunori Goto
@ 2012-01-06 10:22           ` Yasunori Goto
  2012-01-06 11:01             ` Peter Zijlstra
  2012-01-06 13:48             ` [BUG] TASK_DEAD task is able to be woken up in special condition Oleg Nesterov
  0 siblings, 2 replies; 55+ messages in thread
From: Yasunori Goto @ 2012-01-06 10:22 UTC (permalink / raw)
  To: Oleg Nesterov, Peter Zijlstra
  Cc: Ingo Molnar, Hiroyuki KAMEZAWA, Motohiro Kosaki, Linux Kernel ML

Hello,

I made a trial patch to change task state by cmpxchg to solve this problem.
Please comment.

I just confirmed booting up on my box, and I would like to get rough agreement
about this way to solve this issue at first.
If this way is roughly ok, I will test this patch more.

Thanks,

--------------------

try_to_wake_up() has a problem which may change status from TASK_DEAD to
TASK_RUNNING in race condition with SMI or guest environment of virtual
machine. (See: https://lkml.org/lkml/2011/12/21/523)
As a result, exited task is scheduled() again and panic occurs.

In addition, try_to_wake_up(TASK_INTERRUPTIBLE) can wakeup a
TASK_UNINTERRUPTIBLE task if it temporary sets INTERRUPTIBLE but
doesn't call schedule() in this state.

By this patch, kernel changes task state by cmpxchg when the task is
on run queue. If the task state does not fit required state,
kernel stops waking the task up.


Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>

---
 kernel/sched.c |   96 ++++++++++++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 79 insertions(+), 17 deletions(-)

Index: linux-3.2-rc7/kernel/sched.c
===================================================================
--- linux-3.2-rc7.orig/kernel/sched.c
+++ linux-3.2-rc7/kernel/sched.c
@@ -2650,16 +2650,31 @@ static void ttwu_activate(struct rq *rq,
 		wq_worker_waking_up(p, cpu_of(rq));
 }
 
-/*
- * Mark the task runnable and perform wakeup-preemption.
- */
-static void
-ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)
+static int
+change_task_state_cmpxchg(struct task_struct *p, unsigned int target_state,
+			  unsigned int new)
 {
-	trace_sched_wakeup(p, true);
-	check_preempt_curr(rq, p, wake_flags);
+	unsigned int old, tmp;
 
-	p->state = TASK_RUNNING;
+	old = p->state;
+
+	while (1) {
+		if (old == new)
+			return 1;
+
+		if (unlikely(!(old & target_state)))
+			return 0;
+
+		tmp = cmpxchg(&p->state, old, new);
+		if (likely(tmp == old))
+			return 1;
+
+		old = tmp;
+	}
+}
+
+static void ttwu_do_woken(struct rq *rq, struct task_struct *p)
+{
 #ifdef CONFIG_SMP
 	if (p->sched_class->task_woken)
 		p->sched_class->task_woken(rq, p);
@@ -2677,6 +2692,45 @@ ttwu_do_wakeup(struct rq *rq, struct tas
 #endif
 }
 
+/*
+ * Mark the task runnable and perform wakeup-preemption.
+ */
+static void
+ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)
+{
+	trace_sched_wakeup(p, true);
+	check_preempt_curr(rq, p, wake_flags);
+
+	p->state = TASK_RUNNING;
+
+	ttwu_do_woken(rq, p);
+}
+
+/*
+ * Mark the task runnable and perform wakeup-preemption with
+ * making sure the state of task. It might be changed due to
+ * race condition.
+ */
+static int
+ttwu_do_wakeup_state_check(struct rq *rq, struct task_struct *p,
+			   unsigned int target_state, int wake_flags)
+{
+	int ret;
+
+	ret = change_task_state_cmpxchg(p, target_state, TASK_RUNNING);
+	trace_sched_wakeup(p, ret);
+
+	/* faied to change state due to race? */
+	if (!ret)
+		return -EINVAL;
+
+	check_preempt_curr(rq, p, wake_flags);
+
+	ttwu_do_woken(rq, p);
+
+	return 1;
+}
+
 static void
 ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags)
 {
@@ -2695,16 +2749,16 @@ ttwu_do_activate(struct rq *rq, struct t
  * since all we need to do is flip p->state to TASK_RUNNING, since
  * the task is still ->on_rq.
  */
-static int ttwu_remote(struct task_struct *p, int wake_flags)
+static int
+ttwu_remote(struct task_struct *p, unsigned int state, int wake_flags)
 {
 	struct rq *rq;
 	int ret = 0;
 
 	rq = __task_rq_lock(p);
-	if (p->on_rq) {
-		ttwu_do_wakeup(rq, p, wake_flags);
-		ret = 1;
-	}
+	if (p->on_rq)
+		ret = ttwu_do_wakeup_state_check(rq, p, state, wake_flags);
+
 	__task_rq_unlock(rq);
 
 	return ret;
@@ -2766,7 +2820,8 @@ static void ttwu_queue_remote(struct tas
 }
 
 #ifdef __ARCH_WANT_INTERRUPTS_ON_CTXSW
-static int ttwu_activate_remote(struct task_struct *p, int wake_flags)
+static int
+ttwu_activate_remote(struct task_struct *p, int wake_flags)
 {
 	struct rq *rq;
 	int ret = 0;
@@ -2828,11 +2883,18 @@ try_to_wake_up(struct task_struct *p, un
 	if (!(p->state & state))
 		goto out;
 
-	success = 1; /* we're going to change ->state */
 	cpu = task_cpu(p);
 
-	if (p->on_rq && ttwu_remote(p, wake_flags))
-		goto stat;
+	if (p->on_rq) {
+		int ret;
+		ret = ttwu_remote(p, state, wake_flags);
+		if (likely(ret == 1)) {
+			success = 1;
+			goto stat;
+		} else if (unlikely(ret < 0))
+			goto out;
+	}
+	success = 1; /* we're going to change ->state */
 
 #ifdef CONFIG_SMP
 	/*


-- 
Yasunori Goto 



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [BUG] TASK_DEAD task is able to be woken up in special condition
  2012-01-06 10:22           ` Yasunori Goto
@ 2012-01-06 11:01             ` Peter Zijlstra
  2012-01-06 12:01               ` Yasunori Goto
  2012-01-06 13:48             ` [BUG] TASK_DEAD task is able to be woken up in special condition Oleg Nesterov
  1 sibling, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2012-01-06 11:01 UTC (permalink / raw)
  To: Yasunori Goto
  Cc: Oleg Nesterov, Ingo Molnar, Hiroyuki KAMEZAWA, Motohiro Kosaki,
	Linux Kernel ML

On Fri, 2012-01-06 at 19:22 +0900, Yasunori Goto wrote:
> I just confirmed booting up on my box, and I would like to get rough agreement
> about this way to solve this issue at first. 

I really don't like it. It makes the ttwu path more complex and more
expensive. ttwu is one of the hottest and more complex paths in the
scheduler, it needs neither more overhead nor more complexity.

I'd really much rather put another raw_spin_unlocked_wait() in do_exit()
before we set TASK_DEAD. It probably needs an smp memory barrier too.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [BUG] TASK_DEAD task is able to be woken up in special condition
  2012-01-06 11:01             ` Peter Zijlstra
@ 2012-01-06 12:01               ` Yasunori Goto
  2012-01-06 12:43                 ` Peter Zijlstra
  0 siblings, 1 reply; 55+ messages in thread
From: Yasunori Goto @ 2012-01-06 12:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Oleg Nesterov, Ingo Molnar, Hiroyuki KAMEZAWA, Motohiro Kosaki,
	Linux Kernel ML

> On Fri, 2012-01-06 at 19:22 +0900, Yasunori Goto wrote:
> > I just confirmed booting up on my box, and I would like to get rough agreement
> > about this way to solve this issue at first. 
> 
> I really don't like it. It makes the ttwu path more complex and more
> expensive. ttwu is one of the hottest and more complex paths in the
> scheduler, it needs neither more overhead nor more complexity.

Hmmmmm.
Ok.
> 
> I'd really much rather put another raw_spin_unlocked_wait() in do_exit()
> before we set TASK_DEAD. It probably needs an smp memory barrier too.

Do you mean the following patch?

---

Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>

---
 kernel/exit.c |    4 ++++
 1 file changed, 4 insertions(+)

Index: linux-3.2-rc7/kernel/exit.c
===================================================================
--- linux-3.2-rc7.orig/kernel/exit.c
+++ linux-3.2-rc7/kernel/exit.c
@@ -1038,6 +1038,10 @@ NORET_TYPE void do_exit(long code)
 
 	preempt_disable();
 	exit_rcu();
+
+	smp_mb();
+	raw_spin_unlock_wait(&tsk->pi_lock);
+
 	/* causes final put_task_struct in finish_task_switch(). */
 	tsk->state = TASK_DEAD;
 	schedule();


-- 
Yasunori Goto 



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [BUG] TASK_DEAD task is able to be woken up in special condition
  2012-01-06 12:01               ` Yasunori Goto
@ 2012-01-06 12:43                 ` Peter Zijlstra
  2012-01-06 14:12                   ` Oleg Nesterov
  0 siblings, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2012-01-06 12:43 UTC (permalink / raw)
  To: Yasunori Goto
  Cc: Oleg Nesterov, Ingo Molnar, Hiroyuki KAMEZAWA, Motohiro Kosaki,
	Linux Kernel ML

On Fri, 2012-01-06 at 21:01 +0900, Yasunori Goto wrote:

> Do you mean the following patch?

Yes, something like that. At that point ->state should be TASK_RUNNING
(since we are after all running). The unlock_wait() will synchronize
against any in-progress ttwu() while its fast path is a non-atomic
compare. Any ttwu after this will bail since it will either observe
TASK_RUNNING or TASK_DEAD, neither are a state it will act upon.

Now the only question that remains is if we need the full memory barrier
or if we can get away with less.

I guess the mb separates the write to ->state (setting TASK_RUNNING)
from the read of ->pi_lock. The remote CPU must see the TASK_RUNNING,
and we must see ->pi_lock taken if it is.

I also can't find anything to 'borrow' a barrier from (well I can for
mainline, but not for -rt).

So yes, I guess the below will do, albeit it needs a somewhat
comprehensive comment explaining its need.

Oleg, can you agree?

> ---
> 
> Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
> 
> ---
>  kernel/exit.c |    4 ++++
>  1 file changed, 4 insertions(+)
> 
> Index: linux-3.2-rc7/kernel/exit.c
> ===================================================================
> --- linux-3.2-rc7.orig/kernel/exit.c
> +++ linux-3.2-rc7/kernel/exit.c
> @@ -1038,6 +1038,10 @@ NORET_TYPE void do_exit(long code)
>  
>  	preempt_disable();
>  	exit_rcu();
> +
> +	smp_mb();
> +	raw_spin_unlock_wait(&tsk->pi_lock);
> +
>  	/* causes final put_task_struct in finish_task_switch(). */
>  	tsk->state = TASK_DEAD;
>  	schedule();
> 
> 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [BUG] TASK_DEAD task is able to be woken up in special condition
  2012-01-06 10:22           ` Yasunori Goto
  2012-01-06 11:01             ` Peter Zijlstra
@ 2012-01-06 13:48             ` Oleg Nesterov
  1 sibling, 0 replies; 55+ messages in thread
From: Oleg Nesterov @ 2012-01-06 13:48 UTC (permalink / raw)
  To: Yasunori Goto
  Cc: Peter Zijlstra, Ingo Molnar, Hiroyuki KAMEZAWA, Motohiro Kosaki,
	Linux Kernel ML

On 01/06, Yasunori Goto wrote:
>
> +change_task_state_cmpxchg(struct task_struct *p, unsigned int target_state,
> +			  unsigned int new)
>  {
> -	trace_sched_wakeup(p, true);
> -	check_preempt_curr(rq, p, wake_flags);
> +	unsigned int old, tmp;
>
> -	p->state = TASK_RUNNING;
> +	old = p->state;
> +
> +	while (1) {
> +		if (old == new)
> +			return 1;
> +
> +		if (unlikely(!(old & target_state)))
> +			return 0;
> +
> +		tmp = cmpxchg(&p->state, old, new);
> +		if (likely(tmp == old))
> +			return 1;
> +
> +		old = tmp;
> +	}

I do not really think we should retry if cmpxchg fails. The state was
changed after initial check, we can pretend it was changed after we
change it succesfully.

> @@ -2828,11 +2883,18 @@ try_to_wake_up(struct task_struct *p, un
>  	if (!(p->state & state))
>  		goto out;
>
> -	success = 1; /* we're going to change ->state */
>  	cpu = task_cpu(p);
>
> -	if (p->on_rq && ttwu_remote(p, wake_flags))
> -		goto stat;
> +	if (p->on_rq) {
> +		int ret;
> +		ret = ttwu_remote(p, state, wake_flags);
> +		if (likely(ret == 1)) {
> +			success = 1;
> +			goto stat;
> +		} else if (unlikely(ret < 0))
> +			goto out;
> +	}
> +	success = 1; /* we're going to change ->state */

But this is not enough, you should also recheck the state below. Although
in this case you do not need atomic ops, the target can do nothing with
its ->state.

Anyway. I have to agree with Peter even if I suggested this initially.
This adds the unpleasant complications. Hopefully you found the only
case when the spurious wakeup really hurts, probably it is better to
fix do_exit().

Oleg.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [BUG] TASK_DEAD task is able to be woken up in special condition
  2012-01-06 12:43                 ` Peter Zijlstra
@ 2012-01-06 14:12                   ` Oleg Nesterov
  2012-01-06 14:19                     ` Oleg Nesterov
  2012-01-07  1:31                     ` Yasunori Goto
  0 siblings, 2 replies; 55+ messages in thread
From: Oleg Nesterov @ 2012-01-06 14:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Yasunori Goto, Ingo Molnar, Hiroyuki KAMEZAWA, Motohiro Kosaki,
	Linux Kernel ML

On 01/06, Peter Zijlstra wrote:
>
> On Fri, 2012-01-06 at 21:01 +0900, Yasunori Goto wrote:
>
> > Do you mean the following patch?
>
> Yes, something like that. At that point ->state should be TASK_RUNNING
> (since we are after all running). The unlock_wait() will synchronize
> against any in-progress ttwu() while its fast path is a non-atomic
> compare. Any ttwu after this will bail since it will either observe
> TASK_RUNNING or TASK_DEAD, neither are a state it will act upon.
>
> Now the only question that remains is if we need the full memory barrier
> or if we can get away with less.
>
> I guess the mb separates the write to ->state (setting TASK_RUNNING)
> from the read of ->pi_lock. The remote CPU must see the TASK_RUNNING,
> and we must see ->pi_lock taken if it is.

Yes, I think we need the full mb, STORE vs LOAD.

> > --- linux-3.2-rc7.orig/kernel/exit.c
> > +++ linux-3.2-rc7/kernel/exit.c
> > @@ -1038,6 +1038,10 @@ NORET_TYPE void do_exit(long code)
> >
> >  	preempt_disable();
> >  	exit_rcu();
> > +
> > +	smp_mb();
> > +	raw_spin_unlock_wait(&tsk->pi_lock);
> > +
> >  	/* causes final put_task_struct in finish_task_switch(). */
> >  	tsk->state = TASK_DEAD;

Interesting. Initially I thought this is wrong and we should do

	raw_spin_unlock_wait(pi_lock);

	mb();

	tsk->state = TASK_DEAD;

This "obviously" serializes LOAD(pi_lock) and STORE(state).

But when I re-read your explanation above I think you are right,
mb() before unlock_wait() should work too, just it refers to
state = RUNNING in the past.

But this makes me worry. We are doing a lot of things after
exit_mm(). In particular we take tasklist_lock in exit_notify()
and then do_exit() takes task_lock(). But every unlock + lock
implies mb(). So how it was possible to hit this bug???

Oleg.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [BUG] TASK_DEAD task is able to be woken up in special condition
  2012-01-06 14:12                   ` Oleg Nesterov
@ 2012-01-06 14:19                     ` Oleg Nesterov
  2012-01-07  1:31                     ` Yasunori Goto
  1 sibling, 0 replies; 55+ messages in thread
From: Oleg Nesterov @ 2012-01-06 14:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Yasunori Goto, Ingo Molnar, Hiroyuki KAMEZAWA, Motohiro Kosaki,
	Linux Kernel ML

On 01/06, Oleg Nesterov wrote:
>
> But this makes me worry. We are doing a lot of things after
> exit_mm(). In particular we take tasklist_lock in exit_notify()
> and then do_exit() takes task_lock(). But every unlock + lock
> implies mb(). So how it was possible to hit this bug???

Damn. please ignore me.

Somehow I forgot that _there is no_ spin_unlock_wait() in the
current code ;)

Oleg.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [BUG] TASK_DEAD task is able to be woken up in special condition
  2012-01-06 14:12                   ` Oleg Nesterov
  2012-01-06 14:19                     ` Oleg Nesterov
@ 2012-01-07  1:31                     ` Yasunori Goto
  2012-01-16 11:51                       ` Yasunori Goto
  1 sibling, 1 reply; 55+ messages in thread
From: Yasunori Goto @ 2012-01-07  1:31 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Peter Zijlstra, Ingo Molnar, Hiroyuki KAMEZAWA, Motohiro Kosaki,
	Linux Kernel ML

> On 01/06, Peter Zijlstra wrote:
> >
> > On Fri, 2012-01-06 at 21:01 +0900, Yasunori Goto wrote:
> >
> > > Do you mean the following patch?
> >
> > Yes, something like that. At that point ->state should be TASK_RUNNING
> > (since we are after all running). The unlock_wait() will synchronize
> > against any in-progress ttwu() while its fast path is a non-atomic
> > compare. Any ttwu after this will bail since it will either observe
> > TASK_RUNNING or TASK_DEAD, neither are a state it will act upon.
> >
> > Now the only question that remains is if we need the full memory barrier
> > or if we can get away with less.
> >
> > I guess the mb separates the write to ->state (setting TASK_RUNNING)
> > from the read of ->pi_lock. The remote CPU must see the TASK_RUNNING,
> > and we must see ->pi_lock taken if it is.
> 
> Yes, I think we need the full mb, STORE vs LOAD.
> 
> > > --- linux-3.2-rc7.orig/kernel/exit.c
> > > +++ linux-3.2-rc7/kernel/exit.c
> > > @@ -1038,6 +1038,10 @@ NORET_TYPE void do_exit(long code)
> > >
> > >  	preempt_disable();
> > >  	exit_rcu();
> > > +
> > > +	smp_mb();
> > > +	raw_spin_unlock_wait(&tsk->pi_lock);
> > > +
> > >  	/* causes final put_task_struct in finish_task_switch(). */
> > >  	tsk->state = TASK_DEAD;
> 
> Interesting. Initially I thought this is wrong and we should do
> 
> 	raw_spin_unlock_wait(pi_lock);
> 
> 	mb();
> 
> 	tsk->state = TASK_DEAD;
> 
> This "obviously" serializes LOAD(pi_lock) and STORE(state).
> 
> But when I re-read your explanation above I think you are right,
> mb() before unlock_wait() should work too, just it refers to
> state = RUNNING in the past.

Ok. Thansk for your agreement.
I'll test this patch.


-- 
Yasunori Goto 



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [BUG] TASK_DEAD task is able to be woken up in special condition
  2012-01-07  1:31                     ` Yasunori Goto
@ 2012-01-16 11:51                       ` Yasunori Goto
  2012-01-16 13:38                         ` Peter Zijlstra
  0 siblings, 1 reply; 55+ messages in thread
From: Yasunori Goto @ 2012-01-16 11:51 UTC (permalink / raw)
  To: Oleg Nesterov, Peter Zijlstra
  Cc: Ingo Molnar, Hiroyuki KAMEZAWA, Motohiro Kosaki, Linux Kernel ML


> Ok. Thansk for your agreement.
> I'll test this patch.

Hi.

I tested this patch. I've not found any problems so far.
Please merge it.


-----
try_to_wake_up() has a problem which may change status from TASK_DEAD to
TASK_RUNNING in race condition with SMI or guest environment of virtual
machine. (See: https://lkml.org/lkml/2011/12/21/523)
As a result, exited task is scheduled() again and panic occurs.

By this patch, do_exit() waits for releasing task->pi_lock which is used
in try_to_wake_up(). It guarantees the task becomes TASK_DEAD after
waking up.


Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>

---
 kernel/exit.c |    8 ++++++++
 1 file changed, 8 insertions(+)

Index: linux-3.2/kernel/exit.c
===================================================================
--- linux-3.2.orig/kernel/exit.c
+++ linux-3.2/kernel/exit.c
@@ -1038,6 +1038,14 @@ NORET_TYPE void do_exit(long code)
 
 	preempt_disable();
 	exit_rcu();
+
+	/*
+	 * try_to_wake_up() might be waking me up due to race condition.
+	 * Make sure it is finished.
+	 */
+	smp_mb();
+	raw_spin_unlock_wait(&tsk->pi_lock);
+
 	/* causes final put_task_struct in finish_task_switch(). */
 	tsk->state = TASK_DEAD;
 	schedule();

-- 
Yasunori Goto 



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [BUG] TASK_DEAD task is able to be woken up in special condition
  2012-01-16 11:51                       ` Yasunori Goto
@ 2012-01-16 13:38                         ` Peter Zijlstra
  2012-01-17  8:40                           ` Yasunori Goto
  0 siblings, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2012-01-16 13:38 UTC (permalink / raw)
  To: Yasunori Goto
  Cc: Oleg Nesterov, Ingo Molnar, Hiroyuki KAMEZAWA, Motohiro Kosaki,
	Linux Kernel ML

On Mon, 2012-01-16 at 20:51 +0900, Yasunori Goto wrote:

> try_to_wake_up() has a problem which may change status from TASK_DEAD to
> TASK_RUNNING in race condition with SMI or guest environment of virtual
> machine. (See: https://lkml.org/lkml/2011/12/21/523)
> As a result, exited task is scheduled() again and panic occurs.
> 
> By this patch, do_exit() waits for releasing task->pi_lock which is used
> in try_to_wake_up(). It guarantees the task becomes TASK_DEAD after
> waking up.


This Changelog isn't very good. Please spell out the problem instead of
referring to it so that people using git-blame and the like don't then
have to go look up some (possibly non-existent) web-resource.

> Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
> 
> ---
>  kernel/exit.c |    8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> Index: linux-3.2/kernel/exit.c
> ===================================================================
> --- linux-3.2.orig/kernel/exit.c
> +++ linux-3.2/kernel/exit.c
> @@ -1038,6 +1038,14 @@ NORET_TYPE void do_exit(long code)
>  
>  	preempt_disable();
>  	exit_rcu();
> +
> +	/*
> +	 * try_to_wake_up() might be waking me up due to race condition.
> +	 * Make sure it is finished.
> +	 */

That comment is waaaay too terse. What race and what must be finished?

> +	smp_mb();
> +	raw_spin_unlock_wait(&tsk->pi_lock);
> +
>  	/* causes final put_task_struct in finish_task_switch(). */
>  	tsk->state = TASK_DEAD;
>  	schedule();
> 


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [BUG] TASK_DEAD task is able to be woken up in special condition
  2012-01-16 13:38                         ` Peter Zijlstra
@ 2012-01-17  8:40                           ` Yasunori Goto
  2012-01-17  9:06                             ` Ingo Molnar
  2012-01-28 12:03                             ` [tip:sched/core] sched: Fix ancient race in do_exit() tip-bot for Yasunori Goto
  0 siblings, 2 replies; 55+ messages in thread
From: Yasunori Goto @ 2012-01-17  8:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Oleg Nesterov, Ingo Molnar, Hiroyuki KAMEZAWA, Motohiro Kosaki,
	Linux Kernel ML

> On Mon, 2012-01-16 at 20:51 +0900, Yasunori Goto wrote:
> 
> > try_to_wake_up() has a problem which may change status from TASK_DEAD to
> > TASK_RUNNING in race condition with SMI or guest environment of virtual
> > machine. (See: https://lkml.org/lkml/2011/12/21/523)
> > As a result, exited task is scheduled() again and panic occurs.
> > 
> > By this patch, do_exit() waits for releasing task->pi_lock which is used
> > in try_to_wake_up(). It guarantees the task becomes TASK_DEAD after
> > waking up.
> 
> 
> This Changelog isn't very good. Please spell out the problem instead of
> referring to it so that people using git-blame and the like don't then
> have to go look up some (possibly non-existent) web-resource.

Ahh, Ok.

I rewrote the description and comment in patch.
Please check them.

-----
try_to_wake_up() has a problem which may change status from TASK_DEAD to
TASK_RUNNING in race condition with SMI or guest environment of virtual
machine. As a result, exited task is scheduled() again and panic occurs.

Here is the sequence how it occurs.

----------------------------------+-----------------------------
                                  |
           CPU A                  |             CPU B
----------------------------------+-----------------------------
TASK A calls exit()....

do_exit()

  exit_mm()
    down_read(mm->mmap_sem);
    
    rwsem_down_failed_common()

      set TASK_UNINTERRUPTIBLE
      set waiter.task <= task A
      list_add to sem->wait_list
           :
      raw_spin_unlock_irq()
      (I/O interruption occured)

                                      __rwsem_do_wake(mmap_sem)

                                        list_del(&waiter->list);
                                        waiter->task = NULL
                                        wake_up_process(task A)
                                          try_to_wake_up()
                                             (task is still
                                               TASK_UNINTERRUPTIBLE)
                                              p->on_rq is still 1.)

                                              ttwu_do_wakeup()
                                                 (*A)
                                                   :
     (I/O interruption handler finished)

      if (!waiter.task) 
          schedule() is not called
          due to waiter.task is NULL.
      
      tsk->state = TASK_RUNNING

          :
                                              check_preempt_curr();
                                                  :
  task->state = TASK_DEAD
                                              (*B)
                                        <---    set TASK_RUNNING (*C)



     schedule()
     (exit task is running again)
     BUG_ON() is called!
--------------------------------------------------------


The execution time between (*A) and (*B) is usually very short,
because the interruption is disabled, and setting TASK_RUNNING at (*C)
must be executed before setting TASK_DEAD.

HOWEVER, if SMI is interrupted between (*A) and (*B), 
(*C) is able to execute AFTER setting TASK_DEAD!
Then, exited task is scheduled again, and BUG_ON() is called....

If the system works on guest system of virtual machine, the time
between (*A) and (*B) may be also long due to scheduling of hypervisor,
and same phenomenon can occur.



By this patch, do_exit() waits for releasing task->pi_lock which is used
in try_to_wake_up(). It guarantees the task becomes TASK_DEAD after
waking up.


Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>

---
 kernel/exit.c |   16 ++++++++++++++++
 1 file changed, 16 insertions(+)

Index: linux-3.2/kernel/exit.c
===================================================================
--- linux-3.2.orig/kernel/exit.c
+++ linux-3.2/kernel/exit.c
@@ -1038,6 +1038,22 @@ NORET_TYPE void do_exit(long code)
 
 	preempt_disable();
 	exit_rcu();
+
+	/*
+	 * The setting of TASK_RUNNING by try_to_wake_up() may be delayed
+	 * when the following two conditions become true.
+	 *   - There is race condition of mmap_sem (It is acquired by
+	 *     exit_mm()), and
+	 *   - SMI occurs before setting TASK_RUNINNG.
+	 *     (or hypervisor of virtual machine switches to other guest)
+	 *  As a result, we may become TASK_RUNNING after becoming TASK_DEAD
+	 *
+	 * To avoid it, we have to wait for releasing tsk->pi_lock which
+	 * is held by try_to_wake_up()
+	 */
+	smp_mb();
+	raw_spin_unlock_wait(&tsk->pi_lock);
+
 	/* causes final put_task_struct in finish_task_switch(). */
 	tsk->state = TASK_DEAD;
 	schedule();
-----


-- 
Yasunori Goto 



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [BUG] TASK_DEAD task is able to be woken up in special condition
  2012-01-17  8:40                           ` Yasunori Goto
@ 2012-01-17  9:06                             ` Ingo Molnar
  2012-01-17 15:12                               ` Oleg Nesterov
  2012-01-28 12:03                             ` [tip:sched/core] sched: Fix ancient race in do_exit() tip-bot for Yasunori Goto
  1 sibling, 1 reply; 55+ messages in thread
From: Ingo Molnar @ 2012-01-17  9:06 UTC (permalink / raw)
  To: Yasunori Goto, Thomas Gleixner
  Cc: Peter Zijlstra, Oleg Nesterov, Hiroyuki KAMEZAWA,
	Motohiro Kosaki, Linux Kernel ML


* Yasunori Goto <y-goto@jp.fujitsu.com> wrote:

> --- linux-3.2.orig/kernel/exit.c
> +++ linux-3.2/kernel/exit.c
> @@ -1038,6 +1038,22 @@ NORET_TYPE void do_exit(long code)
>  
>  	preempt_disable();
>  	exit_rcu();
> +
> +	/*
> +	 * The setting of TASK_RUNNING by try_to_wake_up() may be delayed
> +	 * when the following two conditions become true.
> +	 *   - There is race condition of mmap_sem (It is acquired by
> +	 *     exit_mm()), and
> +	 *   - SMI occurs before setting TASK_RUNINNG.
> +	 *     (or hypervisor of virtual machine switches to other guest)
> +	 *  As a result, we may become TASK_RUNNING after becoming TASK_DEAD
> +	 *
> +	 * To avoid it, we have to wait for releasing tsk->pi_lock which
> +	 * is held by try_to_wake_up()
> +	 */
> +	smp_mb();
> +	raw_spin_unlock_wait(&tsk->pi_lock);

Hm, unlock_wait() is really nasty. Wouldnt the adoption of the 
-rt kernel's delayed task put logic solve most of these races? 
It's the patch below - we could get rid of the 
CONFIG_PREEMPT_RT_BASE and make it unconditional.

[ The -rt kernel is facing similar "sudden outbreak of large 
  delays" constraints as hypervisors or SMI victims do, so even 
  if the delayed-task-put patch does not solve the race, some 
  other -rt patch might :-) ]

Thanks,

	Ingo

-------------------->
Subject: sched-delay-put-task.patch
From: Thomas Gleixner <tglx@linutronix.de>
Date: Tue, 31 May 2011 16:59:16 +0200

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/sched.h |   13 +++++++++++++
 kernel/fork.c         |   11 +++++++++++
 2 files changed, 24 insertions(+)

Index: linux-3.2/include/linux/sched.h
===================================================================
--- linux-3.2.orig/include/linux/sched.h
+++ linux-3.2/include/linux/sched.h
@@ -1588,6 +1588,9 @@ struct task_struct {
 #ifdef CONFIG_HAVE_HW_BREAKPOINT
 	atomic_t ptrace_bp_refcnt;
 #endif
+#ifdef CONFIG_PREEMPT_RT_BASE
+	struct rcu_head put_rcu;
+#endif
 };
 
 /* Future-safe accessor for struct task_struct's cpus_allowed. */
@@ -1772,6 +1775,15 @@ extern struct pid *cad_pid;
 extern void free_task(struct task_struct *tsk);
 #define get_task_struct(tsk) do { atomic_inc(&(tsk)->usage); } while(0)
 
+#ifdef CONFIG_PREEMPT_RT_BASE
+extern void __put_task_struct_cb(struct rcu_head *rhp);
+
+static inline void put_task_struct(struct task_struct *t)
+{
+	if (atomic_dec_and_test(&t->usage))
+		call_rcu(&t->put_rcu, __put_task_struct_cb);
+}
+#else
 extern void __put_task_struct(struct task_struct *t);
 
 static inline void put_task_struct(struct task_struct *t)
@@ -1779,6 +1791,7 @@ static inline void put_task_struct(struc
 	if (atomic_dec_and_test(&t->usage))
 		__put_task_struct(t);
 }
+#endif
 
 extern void task_times(struct task_struct *p, cputime_t *ut, cputime_t *st);
 extern void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t *st);
Index: linux-3.2/kernel/fork.c
===================================================================
--- linux-3.2.orig/kernel/fork.c
+++ linux-3.2/kernel/fork.c
@@ -197,7 +197,18 @@ void __put_task_struct(struct task_struc
 	if (!profile_handoff_task(tsk))
 		free_task(tsk);
 }
+#ifndef CONFIG_PREEMPT_RT_BASE
 EXPORT_SYMBOL_GPL(__put_task_struct);
+#else
+void __put_task_struct_cb(struct rcu_head *rhp)
+{
+	struct task_struct *tsk = container_of(rhp, struct task_struct, rcu);
+
+	__put_task_struct(tsk);
+
+}
+EXPORT_SYMBOL_GPL(__put_task_struct_cb);
+#endif
 
 /*
  * macro override instead of weak attribute alias, to workaround

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [BUG] TASK_DEAD task is able to be woken up in special condition
  2012-01-17  9:06                             ` Ingo Molnar
@ 2012-01-17 15:12                               ` Oleg Nesterov
  2012-01-18  9:42                                 ` Ingo Molnar
  0 siblings, 1 reply; 55+ messages in thread
From: Oleg Nesterov @ 2012-01-17 15:12 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Yasunori Goto, Thomas Gleixner, Peter Zijlstra,
	Hiroyuki KAMEZAWA, Motohiro Kosaki, Linux Kernel ML

On 01/17, Ingo Molnar wrote:
>
> * Yasunori Goto <y-goto@jp.fujitsu.com> wrote:
>
> > --- linux-3.2.orig/kernel/exit.c
> > +++ linux-3.2/kernel/exit.c
> > @@ -1038,6 +1038,22 @@ NORET_TYPE void do_exit(long code)
> >
> >  	preempt_disable();
> >  	exit_rcu();
> > +
> > +	/*
> > +	 * The setting of TASK_RUNNING by try_to_wake_up() may be delayed
> > +	 * when the following two conditions become true.
> > +	 *   - There is race condition of mmap_sem (It is acquired by
> > +	 *     exit_mm()), and
> > +	 *   - SMI occurs before setting TASK_RUNINNG.
> > +	 *     (or hypervisor of virtual machine switches to other guest)
> > +	 *  As a result, we may become TASK_RUNNING after becoming TASK_DEAD
> > +	 *
> > +	 * To avoid it, we have to wait for releasing tsk->pi_lock which
> > +	 * is held by try_to_wake_up()
> > +	 */
> > +	smp_mb();
> > +	raw_spin_unlock_wait(&tsk->pi_lock);
>
> Hm, unlock_wait() is really nasty. Wouldnt the adoption of the
> -rt kernel's delayed task put logic solve most of these races?

How? The problem is that the exiting task can do the last schedule()
in TASK_RUNNING state, this breaks the TASK_DEAD logic in
finish_task_switch().

Oleg.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [BUG] TASK_DEAD task is able to be woken up in special condition
  2012-01-17 15:12                               ` Oleg Nesterov
@ 2012-01-18  9:42                                 ` Ingo Molnar
  2012-01-18 14:20                                   ` Oleg Nesterov
  2012-01-24 10:11                                   ` Peter Zijlstra
  0 siblings, 2 replies; 55+ messages in thread
From: Ingo Molnar @ 2012-01-18  9:42 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Yasunori Goto, Thomas Gleixner, Peter Zijlstra,
	Hiroyuki KAMEZAWA, Motohiro Kosaki, Linux Kernel ML


* Oleg Nesterov <oleg@redhat.com> wrote:

> On 01/17, Ingo Molnar wrote:
> >
> > * Yasunori Goto <y-goto@jp.fujitsu.com> wrote:
> >
> > > --- linux-3.2.orig/kernel/exit.c
> > > +++ linux-3.2/kernel/exit.c
> > > @@ -1038,6 +1038,22 @@ NORET_TYPE void do_exit(long code)
> > >
> > >  	preempt_disable();
> > >  	exit_rcu();
> > > +
> > > +	/*
> > > +	 * The setting of TASK_RUNNING by try_to_wake_up() may be delayed
> > > +	 * when the following two conditions become true.
> > > +	 *   - There is race condition of mmap_sem (It is acquired by
> > > +	 *     exit_mm()), and
> > > +	 *   - SMI occurs before setting TASK_RUNINNG.
> > > +	 *     (or hypervisor of virtual machine switches to other guest)
> > > +	 *  As a result, we may become TASK_RUNNING after becoming TASK_DEAD
> > > +	 *
> > > +	 * To avoid it, we have to wait for releasing tsk->pi_lock which
> > > +	 * is held by try_to_wake_up()
> > > +	 */
> > > +	smp_mb();
> > > +	raw_spin_unlock_wait(&tsk->pi_lock);
> >
> > Hm, unlock_wait() is really nasty. Wouldnt the adoption of 
> > the -rt kernel's delayed task put logic solve most of these 
> > races?
> 
> How? The problem is that the exiting task can do the last 
> schedule() in TASK_RUNNING state, this breaks the TASK_DEAD 
> logic in finish_task_switch().

Well, but does the -rt kernel suffer from the same race? It can 
generate delays at the exact same place, and can generate much 
longer delays than an SMI, if a high-prio RT task comes along.

So if there's something in the -rt kernel that fixes this race 
we'd like to have that. If the bug is present in the -rt kernel 
then why didn't it ever get triggered? We caught much more 
narrow races in -rt, and very early on in the project.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [BUG] TASK_DEAD task is able to be woken up in special condition
  2012-01-18  9:42                                 ` Ingo Molnar
@ 2012-01-18 14:20                                   ` Oleg Nesterov
  2012-01-24 10:19                                     ` Peter Zijlstra
  2012-01-24 10:11                                   ` Peter Zijlstra
  1 sibling, 1 reply; 55+ messages in thread
From: Oleg Nesterov @ 2012-01-18 14:20 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Yasunori Goto, Thomas Gleixner, Peter Zijlstra,
	Hiroyuki KAMEZAWA, Motohiro Kosaki, Linux Kernel ML

On 01/18, Ingo Molnar wrote:
>
> Well, but does the -rt kernel suffer from the same race?

I think yes.

To remind, the problem is generic, it is not bound to this
particular place. Let me repeat:

Suppose that the task T does

	set_current_state(TASK_INTERRUPTIBLE);

	set_current_state(TASK_UNINTERRUPTIBLE);
	schedule();

try_to_wake_up(TASK_UNINTERRUPTIBLE) in between can observe this task
in TASK_INTERRUPTIBLE state. Then it can set RUNNING/WAKING after T
sets ->state = TASK_UNINTERRUPTIBLE.

For example, this is possibly if T simply does wait_event() twice when
the the 1st wait_event() doesn't sleep.

do_exit() is different because it can not handle the spurious wakeup.
Well, may be we can? we can simply do

		for (;;) {
			tsk->state = TASK_DEAD;
			schedule();
		}

__schedule() can't race with ttwu() once it takes rq->lock. If the
exiting task is deactivated, finish_task_switch() will see EXIT_DEAD.
Unless I missed something, the only problem is preempt_disable(),
but schedule_debug() checks ->exit_state.

OTOH, if we fix this race then probably schedule_debug() should
check state == EXIT_DEAD instead.

> So if there's something in the -rt kernel that fixes this race
> we'd like to have that. If the bug is present in the -rt kernel
> then why didn't it ever get triggered? We caught much more
> narrow races in -rt, and very early on in the project.

I do not know.

Oleg.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [BUG] TASK_DEAD task is able to be woken up in special condition
  2012-01-18  9:42                                 ` Ingo Molnar
  2012-01-18 14:20                                   ` Oleg Nesterov
@ 2012-01-24 10:11                                   ` Peter Zijlstra
  2012-01-26  9:39                                     ` Ingo Molnar
  1 sibling, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2012-01-24 10:11 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Oleg Nesterov, Yasunori Goto, Thomas Gleixner, Hiroyuki KAMEZAWA,
	Motohiro Kosaki, Linux Kernel ML

On Wed, 2012-01-18 at 10:42 +0100, Ingo Molnar wrote:
> Well, but does the -rt kernel suffer from the same race? 

Yes it does

> It can 
> generate delays at the exact same place, and can generate much 
> longer delays than an SMI, if a high-prio RT task comes along.
> 
No it can't, since do_exit() plays silly games with preempt_disable().


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [BUG] TASK_DEAD task is able to be woken up in special condition
  2012-01-18 14:20                                   ` Oleg Nesterov
@ 2012-01-24 10:19                                     ` Peter Zijlstra
  2012-01-24 10:55                                       ` Peter Zijlstra
  0 siblings, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2012-01-24 10:19 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Ingo Molnar, Yasunori Goto, Thomas Gleixner, Hiroyuki KAMEZAWA,
	Motohiro Kosaki, Linux Kernel ML

On Wed, 2012-01-18 at 15:20 +0100, Oleg Nesterov wrote:
> do_exit() is different because it can not handle the spurious wakeup.
> Well, may be we can? we can simply do
> 
>                 for (;;) {
>                         tsk->state = TASK_DEAD;
>                         schedule();
>                 }
> 
> __schedule() can't race with ttwu() once it takes rq->lock. If the
> exiting task is deactivated, finish_task_switch() will see EXIT_DEAD.

TASK_DEAD, right?

> Unless I missed something, the only problem is preempt_disable(),
> but schedule_debug() checks ->exit_state.
> 
> OTOH, if we fix this race then probably schedule_debug() should
> check state == EXIT_DEAD instead. 

Hmm, interesting. On the up side that removes the need for that inf loop
after BUG, down side is of course that we loose the BUG itself too. Now
I'm not too sure we actually care about that, a task spinning at 100% in
x state should be fairly obvious borkage and its not like we hit this
thing very often.




^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [BUG] TASK_DEAD task is able to be woken up in special condition
  2011-12-28 21:07         ` KOSAKI Motohiro
@ 2012-01-24 10:23           ` Peter Zijlstra
  2012-01-24 18:01             ` KOSAKI Motohiro
  2012-01-25 10:10           ` Peter Zijlstra
  1 sibling, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2012-01-24 10:23 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Oleg Nesterov, Yasunori Goto, Ingo Molnar, Hiroyuki KAMEZAWA,
	Motohiro Kosaki, Linux Kernel ML

On Wed, 2011-12-28 at 16:07 -0500, KOSAKI Motohiro wrote:
> I looked at scheduler code today briefly. now I'm afraid following code 
> have similar race.
> 
> 
>          if (task_contributes_to_load(p))
>                  rq->nr_uninterruptible--;
> 
> 
> 
> Can't following schenario be happen?
> 
> 
> CPU0                    CPU1
> --------------------------------------------------------
> deactivate_task()
>                         task->state = TASK_UNINTERRUPTIBLE;
> activate_task()
>    rq->nr_uninterruptible--;
> 
>                         schedule()
>                           deactivate_task()
>                             rq->nr_uninterruptible++;
> 
> Totally, nr_uninterruptible wasn't incremented.
> 
> 
> I'm still not sure. I need to read more sched code. 

You shouldn't ever set another tasks ->state. 


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [BUG] TASK_DEAD task is able to be woken up in special condition
  2012-01-24 10:19                                     ` Peter Zijlstra
@ 2012-01-24 10:55                                       ` Peter Zijlstra
  2012-01-24 17:25                                         ` KOSAKI Motohiro
  2012-01-25 15:45                                         ` Oleg Nesterov
  0 siblings, 2 replies; 55+ messages in thread
From: Peter Zijlstra @ 2012-01-24 10:55 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Ingo Molnar, Yasunori Goto, Thomas Gleixner, Hiroyuki KAMEZAWA,
	Motohiro Kosaki, Linux Kernel ML

On Tue, 2012-01-24 at 11:19 +0100, Peter Zijlstra wrote:
> On Wed, 2012-01-18 at 15:20 +0100, Oleg Nesterov wrote:
> > do_exit() is different because it can not handle the spurious wakeup.
> > Well, may be we can? we can simply do
> > 
> >                 for (;;) {
> >                         tsk->state = TASK_DEAD;
> >                         schedule();
> >                 }
> > 
> > __schedule() can't race with ttwu() once it takes rq->lock. If the
> > exiting task is deactivated, finish_task_switch() will see EXIT_DEAD.
> 
> TASK_DEAD, right?
> 
> > Unless I missed something, the only problem is preempt_disable(),
> > but schedule_debug() checks ->exit_state.
> > 
> > OTOH, if we fix this race then probably schedule_debug() should
> > check state == EXIT_DEAD instead. 
> 
> Hmm, interesting. On the up side that removes the need for that inf loop
> after BUG, down side is of course that we loose the BUG itself too. Now
> I'm not too sure we actually care about that, a task spinning at 100% in
> x state should be fairly obvious borkage and its not like we hit this
> thing very often.

Something like so, right? schedule_debug() already tests
prev->exit_state so it should DTRT afaict.

Also, while going over this again, I think Yasunori-San's patch isn't
sufficient, note how the p->state = TASK_RUNNING in ttwu_do_wakeup() can
happen outside of p->pi_lock when the task gets queued on a remote cpu.

---
 kernel/exit.c |   17 +++++++++++------
 1 files changed, 11 insertions(+), 6 deletions(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index 294b170..ccd4f84 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -1039,13 +1039,18 @@ void do_exit(long code)
 		__this_cpu_add(dirty_throttle_leaks, tsk->nr_dirtied);
 	exit_rcu();
 	/* causes final put_task_struct in finish_task_switch(). */
-	tsk->state = TASK_DEAD;
 	tsk->flags |= PF_NOFREEZE;	/* tell freezer to ignore us */
-	schedule();
-	BUG();
-	/* Avoid "noreturn function does return".  */
-	for (;;)
-		cpu_relax();	/* For when BUG is null */
+	for (;;) {
+		/*
+		 * A spurious wakeup, eg. generated by rwsem when down()'s call
+		 * to schedule() doesn't happen but the wakeup from the
+		 * previous owner's up() did, can stomp on our ->state.
+		 *
+		 * This loop also avoids "noreturn functions does return"
+		 */
+		tsk->state = TASK_DEAD;
+		schedule();
+	}
 }
 
 EXPORT_SYMBOL_GPL(do_exit);



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [BUG] TASK_DEAD task is able to be woken up in special condition
  2012-01-24 10:55                                       ` Peter Zijlstra
@ 2012-01-24 17:25                                         ` KOSAKI Motohiro
  2012-01-25 15:45                                         ` Oleg Nesterov
  1 sibling, 0 replies; 55+ messages in thread
From: KOSAKI Motohiro @ 2012-01-24 17:25 UTC (permalink / raw)
  To: peterz; +Cc: oleg, mingo, y-goto, tglx, kamezawa.hiroyu, linux-kernel

On 1/24/2012 5:55 AM, Peter Zijlstra wrote:
> On Tue, 2012-01-24 at 11:19 +0100, Peter Zijlstra wrote:
>> On Wed, 2012-01-18 at 15:20 +0100, Oleg Nesterov wrote:
>>> do_exit() is different because it can not handle the spurious wakeup.
>>> Well, may be we can? we can simply do
>>>
>>>                 for (;;) {
>>>                         tsk->state = TASK_DEAD;
>>>                         schedule();
>>>                 }
>>>
>>> __schedule() can't race with ttwu() once it takes rq->lock. If the
>>> exiting task is deactivated, finish_task_switch() will see EXIT_DEAD.
>>
>> TASK_DEAD, right?
>>
>>> Unless I missed something, the only problem is preempt_disable(),
>>> but schedule_debug() checks ->exit_state.
>>>
>>> OTOH, if we fix this race then probably schedule_debug() should
>>> check state == EXIT_DEAD instead. 
>>
>> Hmm, interesting. On the up side that removes the need for that inf loop
>> after BUG, down side is of course that we loose the BUG itself too. Now
>> I'm not too sure we actually care about that, a task spinning at 100% in
>> x state should be fairly obvious borkage and its not like we hit this
>> thing very often.
> 
> Something like so, right? schedule_debug() already tests
> prev->exit_state so it should DTRT afaict.
> 
> Also, while going over this again, I think Yasunori-San's patch isn't
> sufficient, note how the p->state = TASK_RUNNING in ttwu_do_wakeup() can
> happen outside of p->pi_lock when the task gets queued on a remote cpu.
> 
> ---
>  kernel/exit.c |   17 +++++++++++------
>  1 files changed, 11 insertions(+), 6 deletions(-)
> 
> diff --git a/kernel/exit.c b/kernel/exit.c
> index 294b170..ccd4f84 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -1039,13 +1039,18 @@ void do_exit(long code)
>  		__this_cpu_add(dirty_throttle_leaks, tsk->nr_dirtied);
>  	exit_rcu();
>  	/* causes final put_task_struct in finish_task_switch(). */
> -	tsk->state = TASK_DEAD;
>  	tsk->flags |= PF_NOFREEZE;	/* tell freezer to ignore us */
> -	schedule();
> -	BUG();
> -	/* Avoid "noreturn function does return".  */
> -	for (;;)
> -		cpu_relax();	/* For when BUG is null */
> +	for (;;) {
> +		/*
> +		 * A spurious wakeup, eg. generated by rwsem when down()'s call
> +		 * to schedule() doesn't happen but the wakeup from the
> +		 * previous owner's up() did, can stomp on our ->state.
> +		 *
> +		 * This loop also avoids "noreturn functions does return"
> +		 */
> +		tsk->state = TASK_DEAD;
> +		schedule();
> +	}
>  }

This looks ok to me.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [BUG] TASK_DEAD task is able to be woken up in special condition
  2012-01-24 10:23           ` Peter Zijlstra
@ 2012-01-24 18:01             ` KOSAKI Motohiro
  2012-01-25  6:15               ` Mike Galbraith
  0 siblings, 1 reply; 55+ messages in thread
From: KOSAKI Motohiro @ 2012-01-24 18:01 UTC (permalink / raw)
  To: peterz
  Cc: kosaki.motohiro, oleg, y-goto, mingo, kamezawa.hiroyu, linux-kernel

On 1/24/2012 5:23 AM, Peter Zijlstra wrote:
> On Wed, 2011-12-28 at 16:07 -0500, KOSAKI Motohiro wrote:
>> I looked at scheduler code today briefly. now I'm afraid following code 
>> have similar race.
>>
>>
>>          if (task_contributes_to_load(p))
>>                  rq->nr_uninterruptible--;
>>
>>
>>
>> Can't following schenario be happen?
>>
>>
>> CPU0                    CPU1
>> --------------------------------------------------------
>> deactivate_task()
>>                         task->state = TASK_UNINTERRUPTIBLE;
>> activate_task()
>>    rq->nr_uninterruptible--;
>>
>>                         schedule()
>>                           deactivate_task()
>>                             rq->nr_uninterruptible++;
>>
>> Totally, nr_uninterruptible wasn't incremented.
>>
>>
>> I'm still not sure. I need to read more sched code. 
> 
> You shouldn't ever set another tasks ->state. 

I'm sorry. I haven't catch your point. I think following step is
valid kernel code. Do you disagree?

>>                         task->state = TASK_UNINTERRUPTIBLE;
>>                         schedule()


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [BUG] TASK_DEAD task is able to be woken up in special condition
  2012-01-24 18:01             ` KOSAKI Motohiro
@ 2012-01-25  6:15               ` Mike Galbraith
  2012-01-26 21:24                 ` KOSAKI Motohiro
  0 siblings, 1 reply; 55+ messages in thread
From: Mike Galbraith @ 2012-01-25  6:15 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: peterz, kosaki.motohiro, oleg, y-goto, mingo, kamezawa.hiroyu,
	linux-kernel

On Tue, 2012-01-24 at 13:01 -0500, KOSAKI Motohiro wrote:
> On 1/24/2012 5:23 AM, Peter Zijlstra wrote:
> > On Wed, 2011-12-28 at 16:07 -0500, KOSAKI Motohiro wrote:
> >> I looked at scheduler code today briefly. now I'm afraid following code 
> >> have similar race.
> >>
> >>
> >>          if (task_contributes_to_load(p))
> >>                  rq->nr_uninterruptible--;
> >>
> >>
> >>
> >> Can't following schenario be happen?
> >>
> >>
> >> CPU0                    CPU1
> >> --------------------------------------------------------
> >> deactivate_task()
> >>                         task->state = TASK_UNINTERRUPTIBLE;
> >> activate_task()
> >>    rq->nr_uninterruptible--;
> >>
> >>                         schedule()
> >>                           deactivate_task()
> >>                             rq->nr_uninterruptible++;
> >>
> >> Totally, nr_uninterruptible wasn't incremented.
> >>
> >>
> >> I'm still not sure. I need to read more sched code. 
> > 
> > You shouldn't ever set another tasks ->state. 
> 
> I'm sorry. I haven't catch your point. I think following step is
> valid kernel code. Do you disagree?
> 
> >>                         task->state = TASK_UNINTERRUPTIBLE;
> >>                         schedule()

I think you meant:
	__set_current_state(TASK_UNINTERRUPTIBLE);
	schedule();

The way you wrote it, task doesn't have to be current, so could be doing
the bad thing Peter pointed out, diddling *another* tasks ->state.

	-Mike


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [BUG] TASK_DEAD task is able to be woken up in special condition
  2011-12-28 21:07         ` KOSAKI Motohiro
  2012-01-24 10:23           ` Peter Zijlstra
@ 2012-01-25 10:10           ` Peter Zijlstra
  2012-01-26 20:25             ` [tip:sched/urgent] sched: Fix rq->nr_uninterruptible update race tip-bot for Peter Zijlstra
  2012-01-26 21:21             ` [BUG] TASK_DEAD task is able to be woken up in special condition KOSAKI Motohiro
  1 sibling, 2 replies; 55+ messages in thread
From: Peter Zijlstra @ 2012-01-25 10:10 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Oleg Nesterov, Yasunori Goto, Ingo Molnar, Hiroyuki KAMEZAWA,
	Motohiro Kosaki, Linux Kernel ML

On Wed, 2011-12-28 at 16:07 -0500, KOSAKI Motohiro wrote:
> 
> CPU0                    CPU1
> --------------------------------------------------------
> deactivate_task()
>                         task->state = TASK_UNINTERRUPTIBLE;
> activate_task()
>    rq->nr_uninterruptible--;
> 
>                         schedule()
>                           deactivate_task()
>                             rq->nr_uninterruptible++;
> 
> 

Hmm, I think you're right, when CPU0 does __sched_setscheduler() on the
task running on CPU1 and CPU1's @task is current.

I think only __sched_setscheduler() is really a problem, the other
activate/deactivate users not schedule or wakeup are __migrate_task()
and normalize_task().

__migrate_task() will only run on tasks that aren't actually running
anywhere so the above scenario can't happen, normalize_task() is never
used on normal systems (sysrq-n).

So I guess the below cures things and cleans up a bit.. no?

---
 kernel/sched/core.c |   18 ++++++------------
 1 files changed, 6 insertions(+), 12 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index df00cb0..e067df1 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -723,9 +723,6 @@ static void dequeue_task(struct rq *rq, struct task_struct *p, int flags)
 	p->sched_class->dequeue_task(rq, p, flags);
 }
 
-/*
- * activate_task - move a task to the runqueue.
- */
 void activate_task(struct rq *rq, struct task_struct *p, int flags)
 {
 	if (task_contributes_to_load(p))
@@ -734,9 +731,6 @@ void activate_task(struct rq *rq, struct task_struct *p, int flags)
 	enqueue_task(rq, p, flags);
 }
 
-/*
- * deactivate_task - remove a task from the runqueue.
- */
 void deactivate_task(struct rq *rq, struct task_struct *p, int flags)
 {
 	if (task_contributes_to_load(p))
@@ -4134,7 +4128,7 @@ static int __sched_setscheduler(struct task_struct *p, int policy,
 	on_rq = p->on_rq;
 	running = task_current(rq, p);
 	if (on_rq)
-		deactivate_task(rq, p, 0);
+		dequeue_task(rq, p, 0);
 	if (running)
 		p->sched_class->put_prev_task(rq, p);
 
@@ -4147,7 +4141,7 @@ static int __sched_setscheduler(struct task_struct *p, int policy,
 	if (running)
 		p->sched_class->set_curr_task(rq);
 	if (on_rq)
-		activate_task(rq, p, 0);
+		enqueue_task(rq, p, 0);
 
 	check_class_changed(rq, p, prev_class, oldprio);
 	task_rq_unlock(rq, p, &flags);
@@ -4998,9 +4992,9 @@ static int __migrate_task(struct task_struct *p, int src_cpu, int dest_cpu)
 	 * placed properly.
 	 */
 	if (p->on_rq) {
-		deactivate_task(rq_src, p, 0);
+		dequeue_task(rq_src, p, 0);
 		set_task_cpu(p, dest_cpu);
-		activate_task(rq_dest, p, 0);
+		enqueue_task(rq_dest, p, 0);
 		check_preempt_curr(rq_dest, p, 0);
 	}
 done:
@@ -7032,10 +7026,10 @@ static void normalize_task(struct rq *rq, struct task_struct *p)
 
 	on_rq = p->on_rq;
 	if (on_rq)
-		deactivate_task(rq, p, 0);
+		dequeue_task(rq, p, 0);
 	__setscheduler(rq, p, SCHED_NORMAL, 0);
 	if (on_rq) {
-		activate_task(rq, p, 0);
+		enqueue_task(rq, p, 0);
 		resched_task(rq->curr);
 	}
 



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [BUG] TASK_DEAD task is able to be woken up in special condition
  2012-01-24 10:55                                       ` Peter Zijlstra
  2012-01-24 17:25                                         ` KOSAKI Motohiro
@ 2012-01-25 15:45                                         ` Oleg Nesterov
  2012-01-25 16:51                                           ` Peter Zijlstra
  1 sibling, 1 reply; 55+ messages in thread
From: Oleg Nesterov @ 2012-01-25 15:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Yasunori Goto, Thomas Gleixner, Hiroyuki KAMEZAWA,
	Motohiro Kosaki, Linux Kernel ML

On 01/24, Peter Zijlstra wrote:
>
> On Tue, 2012-01-24 at 11:19 +0100, Peter Zijlstra wrote:
> > On Wed, 2012-01-18 at 15:20 +0100, Oleg Nesterov wrote:
> > > do_exit() is different because it can not handle the spurious wakeup.
> > > Well, may be we can? we can simply do
> > >
> > >                 for (;;) {
> > >                         tsk->state = TASK_DEAD;
> > >                         schedule();
> > >                 }
> > >
> > > __schedule() can't race with ttwu() once it takes rq->lock. If the
> > > exiting task is deactivated, finish_task_switch() will see EXIT_DEAD.
> >
> > TASK_DEAD, right?

Yes, but... I simply can't understand what I was thinking about.
And probably I missed something again, but I think this can't work.

Afaics, this can only help to prevent the race with ttwu_remote()
doing ttwu_do_wakeup() under rq->lock.

But we still can race with the !p->on_rq case which sets TASK_WAKING.
It can do this after finish_task_switch() observes TASK_DEAD and does
put_task_struct().

> I think Yasunori-San's patch isn't
> sufficient, note how the p->state = TASK_RUNNING in ttwu_do_wakeup() can
> happen outside of p->pi_lock when the task gets queued on a remote cpu.

Hmm, really? I am not sure, but I do not trust myself.

To simplify, you mean that

	mb();
	unlock_wait(pi_lock);

	tsk->state = TASK_DEAD;

can change ->state from TASK_WAKING to TASK_DEAD, right? Is this really
possible? ttwu() ensures p->on_rq == F in this case.

Oleg.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [BUG] TASK_DEAD task is able to be woken up in special condition
  2012-01-25 15:45                                         ` Oleg Nesterov
@ 2012-01-25 16:51                                           ` Peter Zijlstra
  2012-01-25 17:43                                             ` Oleg Nesterov
  0 siblings, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2012-01-25 16:51 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Ingo Molnar, Yasunori Goto, Thomas Gleixner, Hiroyuki KAMEZAWA,
	Motohiro Kosaki, Linux Kernel ML

On Wed, 2012-01-25 at 16:45 +0100, Oleg Nesterov wrote:
> On 01/24, Peter Zijlstra wrote:
> >
> > On Tue, 2012-01-24 at 11:19 +0100, Peter Zijlstra wrote:
> > > On Wed, 2012-01-18 at 15:20 +0100, Oleg Nesterov wrote:
> > > > do_exit() is different because it can not handle the spurious wakeup.
> > > > Well, may be we can? we can simply do
> > > >
> > > >                 for (;;) {
> > > >                         tsk->state = TASK_DEAD;
> > > >                         schedule();
> > > >                 }
> > > >
> > > > __schedule() can't race with ttwu() once it takes rq->lock. If the
> > > > exiting task is deactivated, finish_task_switch() will see EXIT_DEAD.
> > >
> > > TASK_DEAD, right?
> 
> Yes, but... I simply can't understand what I was thinking about.
> And probably I missed something again, but I think this can't work.

Oh man, total confusion. :-) Every time I look at this bug I see
different shadows on the wall.

> Afaics, this can only help to prevent the race with ttwu_remote()
> doing ttwu_do_wakeup() under rq->lock.

ttwu_do_wakeup() must always be called with rq->lock held.

> But we still can race with the !p->on_rq case which sets TASK_WAKING.
> It can do this after finish_task_switch() observes TASK_DEAD and does
> put_task_struct().

<random scribbling that got erased>

No, see below !p->on_rq isn't possible and thus pi_lock is indeed
sufficient.

> > I think Yasunori-San's patch isn't
> > sufficient, note how the p->state = TASK_RUNNING in ttwu_do_wakeup() can
> > happen outside of p->pi_lock when the task gets queued on a remote cpu.
> 
> Hmm, really? I am not sure, but I do not trust myself.
> 
> To simplify, you mean that
> 
> 	mb();
> 	unlock_wait(pi_lock);
> 
> 	tsk->state = TASK_DEAD;
> 
> can change ->state from TASK_WAKING to TASK_DEAD, right? Is this really
> possible? ttwu() ensures p->on_rq == F in this case.

Ahhh.. hold on, p->on_rq must be true, since we disabled preemption
before setting TASK_DEAD, so the thing cannot be scheduled out.

Does this mean that both Yasunori-San's solution and yours work again?

/me goes in search of a fresh mind.. shees!





^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [BUG] TASK_DEAD task is able to be woken up in special condition
  2012-01-25 16:51                                           ` Peter Zijlstra
@ 2012-01-25 17:43                                             ` Oleg Nesterov
  2012-01-26 15:32                                               ` Peter Zijlstra
  0 siblings, 1 reply; 55+ messages in thread
From: Oleg Nesterov @ 2012-01-25 17:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Yasunori Goto, Thomas Gleixner, Hiroyuki KAMEZAWA,
	Motohiro Kosaki, Linux Kernel ML

On 01/25, Peter Zijlstra wrote:
>
> On Wed, 2012-01-25 at 16:45 +0100, Oleg Nesterov wrote:
> > > > >
> > > > >                 for (;;) {
> > > > >                         tsk->state = TASK_DEAD;
> > > > >                         schedule();
> > > > >                 }
> > > > >
> > > > > __schedule() can't race with ttwu() once it takes rq->lock. If the
> > > > > exiting task is deactivated, finish_task_switch() will see EXIT_DEAD.
> > > >
> > > > TASK_DEAD, right?
> >
> > Yes, but... I simply can't understand what I was thinking about.
> > And probably I missed something again, but I think this can't work.
>
> Oh man, total confusion. :-) Every time I look at this bug I see
> different shadows on the wall.

Same here ;)

And this time I do not understand your reply.

> > Afaics, this can only help to prevent the race with ttwu_remote()
> > doing ttwu_do_wakeup() under rq->lock.
>
> ttwu_do_wakeup() must always be called with rq->lock held.

Yes sure. I meant the code above can't race with p->on_rq == T case.

> > But we still can race with the !p->on_rq case which sets TASK_WAKING.
> > It can do this after finish_task_switch() observes TASK_DEAD and does
> > put_task_struct().
>
> <random scribbling that got erased>
>
> No, see below !p->on_rq isn't possible and thus pi_lock is indeed
> sufficient.

Which pi_lock? __schedule() doesn't take it. Hmm, see below...

> > > I think Yasunori-San's patch isn't
> > > sufficient, note how the p->state = TASK_RUNNING in ttwu_do_wakeup() can
> > > happen outside of p->pi_lock when the task gets queued on a remote cpu.
> >
> > Hmm, really? I am not sure, but I do not trust myself.
> >
> > To simplify, you mean that
> >
> > 	mb();
> > 	unlock_wait(pi_lock);
> >
> > 	tsk->state = TASK_DEAD;
> >
> > can change ->state from TASK_WAKING to TASK_DEAD, right? Is this really
> > possible? ttwu() ensures p->on_rq == F in this case.
>
> Ahhh.. hold on, p->on_rq must be true, since we disabled preemption
> before setting TASK_DEAD, so the thing cannot be scheduled out.

Why? __schedule() checks "preempt_count() & PREEMPT_ACTIVE". And it
should be scheduled out, in general this task struct will be freed soon.

> Does this mean that both Yasunori-San's solution and yours work again?

I think that Yasunori-San's solution should work.

But,

> /me goes in search of a fresh mind.. shees!

Yes! I need the fresh head too. Probably just to realize I was completely
wrong again.

Oleg.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [BUG] TASK_DEAD task is able to be woken up in special condition
  2012-01-24 10:11                                   ` Peter Zijlstra
@ 2012-01-26  9:39                                     ` Ingo Molnar
  0 siblings, 0 replies; 55+ messages in thread
From: Ingo Molnar @ 2012-01-26  9:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Oleg Nesterov, Yasunori Goto, Thomas Gleixner, Hiroyuki KAMEZAWA,
	Motohiro Kosaki, Linux Kernel ML


* Peter Zijlstra <peterz@infradead.org> wrote:

> > It can generate delays at the exact same place, and can 
> > generate much longer delays than an SMI, if a high-prio RT 
> > task comes along.
> 
> No it can't, since do_exit() plays silly games with 
> preempt_disable().

ok, it being a critical section on both -rt and vanilla upstream 
explains it.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [BUG] TASK_DEAD task is able to be woken up in special condition
  2012-01-25 17:43                                             ` Oleg Nesterov
@ 2012-01-26 15:32                                               ` Peter Zijlstra
  2012-01-26 16:26                                                 ` Oleg Nesterov
  0 siblings, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2012-01-26 15:32 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Ingo Molnar, Yasunori Goto, Thomas Gleixner, Hiroyuki KAMEZAWA,
	Motohiro Kosaki, Linux Kernel ML

On Wed, 2012-01-25 at 18:43 +0100, Oleg Nesterov wrote:
> And this time I do not understand your reply.

Hmm, I wouldn't either, I think I got ->on_cpu and ->on_rq confused. Let
me try again.

So since we never call schedule() the p->on_rq thing will always be
true. This means we don't need to consider all the icky ttwu after that,
it also means the whole thing is inside ->pi_lock.

So we only have to consider the exact case Yasunori-San illustrated, and
waiting on ->pi_lock is sufficient.

However I think your proposal:

>                 for (;;) {
>                         tsk->state = TASK_DEAD;
>                         schedule();
>                 }

should equally work, if we hit the race and call schedule() with ->state
= TASK_RUNNING, we'll simply loop and try again observing the TASK_DEAD
on the second+ go.

Now I prefer your solution since it doesn't add anything to the exit
path.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [BUG] TASK_DEAD task is able to be woken up in special condition
  2012-01-26 15:32                                               ` Peter Zijlstra
@ 2012-01-26 16:26                                                 ` Oleg Nesterov
  2012-01-27  8:59                                                   ` Peter Zijlstra
  0 siblings, 1 reply; 55+ messages in thread
From: Oleg Nesterov @ 2012-01-26 16:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Yasunori Goto, Thomas Gleixner, Hiroyuki KAMEZAWA,
	Motohiro Kosaki, Linux Kernel ML

On 01/26, Peter Zijlstra wrote:
>
> So since we never call schedule() the p->on_rq thing will always be
> true. This means we don't need to consider all the icky ttwu after that,
> it also means the whole thing is inside ->pi_lock.
>
> So we only have to consider the exact case Yasunori-San illustrated, and
> waiting on ->pi_lock is sufficient.

Yes, and this is why I think Yasunori-san's patch should work. Because,
to remind, it adds unlock_wait(pi_lock).

> However I think your proposal:
>
> >                 for (;;) {
> >                         tsk->state = TASK_DEAD;
> >                         schedule();
> >                 }
>
> should equally work, if we hit the race and call schedule() with ->state
> = TASK_RUNNING,

Yes, in this case everything is fine, but we can shedule() with TASK_DEAD
state. preempt_disable() can't (and shouldn't) prevent deactivate_task().

To simplify, try_to_wake_up() does

		spin_lock(pi_lock);

		if (!(p->state & state))
			goto out;

		/* WINDOW */

		if (p->on_rq) {
			... everything is fine ...
		}

		p->state = TASK_WAKING;
		ttwu_queue(p, cpu);

And the exiting task does

	// but do not sleep ...
	current->state = TASK_UNINTERRUPTIBLE;
							// ttwu() checks ->state
	...
	tsk->state = TASK_DEAD;
	schedule();
		-> deactivate_task();
		-> tsk->on_rq = 0;
		-> finish_task_switch();

							// ttwu() checks ->on_rq

In theory it can do this all in the WINDOW above. In this case we
can wake it up again, after finish_task_switch()-put_task_struct().

No?

Oleg.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [tip:sched/urgent] sched: Fix rq->nr_uninterruptible update race
  2012-01-25 10:10           ` Peter Zijlstra
@ 2012-01-26 20:25             ` tip-bot for Peter Zijlstra
  2012-01-27  5:20               ` Rakib Mullick
  2012-01-26 21:21             ` [BUG] TASK_DEAD task is able to be woken up in special condition KOSAKI Motohiro
  1 sibling, 1 reply; 55+ messages in thread
From: tip-bot for Peter Zijlstra @ 2012-01-26 20:25 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, a.p.zijlstra, kosaki.motohiro, tglx, mingo

Commit-ID:  4ca9b72b71f10147bd21969c1805f5b2c4ca7b7b
Gitweb:     http://git.kernel.org/tip/4ca9b72b71f10147bd21969c1805f5b2c4ca7b7b
Author:     Peter Zijlstra <a.p.zijlstra@chello.nl>
AuthorDate: Wed, 25 Jan 2012 11:50:51 +0100
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Thu, 26 Jan 2012 19:38:09 +0100

sched: Fix rq->nr_uninterruptible update race

KOSAKI Motohiro noticed the following race:

 > CPU0                    CPU1
 > --------------------------------------------------------
 > deactivate_task()
 >                         task->state = TASK_UNINTERRUPTIBLE;
 > activate_task()
 >    rq->nr_uninterruptible--;
 >
 >                         schedule()
 >                           deactivate_task()
 >                             rq->nr_uninterruptible++;
 >

Kosaki-San's scenario is possible when CPU0 runs
__sched_setscheduler() against CPU1's current @task.

__sched_setscheduler() does a dequeue/enqueue in order to move
the task to its new queue (position) to reflect the newly provided
scheduling parameters. However it should be completely invariant to
nr_uninterruptible accounting, sched_setscheduler() doesn't affect
readyness to run, merely policy on when to run.

So convert the inappropriate activate/deactivate_task usage to
enqueue/dequeue_task, which avoids the nr_uninterruptible accounting.

Also convert the two other sites: __migrate_task() and
normalize_task() that still use activate/deactivate_task. These sites
aren't really a problem since __migrate_task() will only be called on
non-running task (and therefore are immume to the described problem)
and normalize_task() isn't ever used on regular systems.

Also remove the comments from activate/deactivate_task since they're
misleading at best.

Reported-by: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1327486224.2614.45.camel@laptop
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 kernel/sched/core.c |   18 ++++++------------
 1 files changed, 6 insertions(+), 12 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index df00cb0..e067df1 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -723,9 +723,6 @@ static void dequeue_task(struct rq *rq, struct task_struct *p, int flags)
 	p->sched_class->dequeue_task(rq, p, flags);
 }
 
-/*
- * activate_task - move a task to the runqueue.
- */
 void activate_task(struct rq *rq, struct task_struct *p, int flags)
 {
 	if (task_contributes_to_load(p))
@@ -734,9 +731,6 @@ void activate_task(struct rq *rq, struct task_struct *p, int flags)
 	enqueue_task(rq, p, flags);
 }
 
-/*
- * deactivate_task - remove a task from the runqueue.
- */
 void deactivate_task(struct rq *rq, struct task_struct *p, int flags)
 {
 	if (task_contributes_to_load(p))
@@ -4134,7 +4128,7 @@ recheck:
 	on_rq = p->on_rq;
 	running = task_current(rq, p);
 	if (on_rq)
-		deactivate_task(rq, p, 0);
+		dequeue_task(rq, p, 0);
 	if (running)
 		p->sched_class->put_prev_task(rq, p);
 
@@ -4147,7 +4141,7 @@ recheck:
 	if (running)
 		p->sched_class->set_curr_task(rq);
 	if (on_rq)
-		activate_task(rq, p, 0);
+		enqueue_task(rq, p, 0);
 
 	check_class_changed(rq, p, prev_class, oldprio);
 	task_rq_unlock(rq, p, &flags);
@@ -4998,9 +4992,9 @@ static int __migrate_task(struct task_struct *p, int src_cpu, int dest_cpu)
 	 * placed properly.
 	 */
 	if (p->on_rq) {
-		deactivate_task(rq_src, p, 0);
+		dequeue_task(rq_src, p, 0);
 		set_task_cpu(p, dest_cpu);
-		activate_task(rq_dest, p, 0);
+		enqueue_task(rq_dest, p, 0);
 		check_preempt_curr(rq_dest, p, 0);
 	}
 done:
@@ -7032,10 +7026,10 @@ static void normalize_task(struct rq *rq, struct task_struct *p)
 
 	on_rq = p->on_rq;
 	if (on_rq)
-		deactivate_task(rq, p, 0);
+		dequeue_task(rq, p, 0);
 	__setscheduler(rq, p, SCHED_NORMAL, 0);
 	if (on_rq) {
-		activate_task(rq, p, 0);
+		enqueue_task(rq, p, 0);
 		resched_task(rq->curr);
 	}
 

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [BUG] TASK_DEAD task is able to be woken up in special condition
  2012-01-25 10:10           ` Peter Zijlstra
  2012-01-26 20:25             ` [tip:sched/urgent] sched: Fix rq->nr_uninterruptible update race tip-bot for Peter Zijlstra
@ 2012-01-26 21:21             ` KOSAKI Motohiro
  2012-01-27  8:21               ` Peter Zijlstra
  1 sibling, 1 reply; 55+ messages in thread
From: KOSAKI Motohiro @ 2012-01-26 21:21 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Oleg Nesterov, Yasunori Goto, Ingo Molnar, Hiroyuki KAMEZAWA,
	Linux Kernel ML

2012/1/25 Peter Zijlstra <peterz@infradead.org>:
> On Wed, 2011-12-28 at 16:07 -0500, KOSAKI Motohiro wrote:
>>
>> CPU0                    CPU1
>> --------------------------------------------------------
>> deactivate_task()
>>                         task->state = TASK_UNINTERRUPTIBLE;
>> activate_task()
>>    rq->nr_uninterruptible--;
>>
>>                         schedule()
>>                           deactivate_task()
>>                             rq->nr_uninterruptible++;
>>
>>
>
> Hmm, I think you're right, when CPU0 does __sched_setscheduler() on the
> task running on CPU1 and CPU1's @task is current.
>
> I think only __sched_setscheduler() is really a problem, the other
> activate/deactivate users not schedule or wakeup are __migrate_task()
> and normalize_task().
>
> __migrate_task() will only run on tasks that aren't actually running
> anywhere so the above scenario can't happen, normalize_task() is never
> used on normal systems (sysrq-n).
>
> So I guess the below cures things and cleans up a bit.. no?

Sorry for the delay. Yeah, I think this should work. (even though I
don't understand __migrate_task() detail)

Thank you!

Ack-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [BUG] TASK_DEAD task is able to be woken up in special condition
  2012-01-25  6:15               ` Mike Galbraith
@ 2012-01-26 21:24                 ` KOSAKI Motohiro
  0 siblings, 0 replies; 55+ messages in thread
From: KOSAKI Motohiro @ 2012-01-26 21:24 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: peterz, oleg, y-goto, mingo, kamezawa.hiroyu, linux-kernel

>> >>                         task->state = TASK_UNINTERRUPTIBLE;
>> >>                         schedule()
>
> I think you meant:
>        __set_current_state(TASK_UNINTERRUPTIBLE);
>        schedule();
>
> The way you wrote it, task doesn't have to be current, so could be doing
> the bad thing Peter pointed out, diddling *another* tasks ->state.

Silly me. Thank you for following up me.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [tip:sched/urgent] sched: Fix rq->nr_uninterruptible update race
  2012-01-26 20:25             ` [tip:sched/urgent] sched: Fix rq->nr_uninterruptible update race tip-bot for Peter Zijlstra
@ 2012-01-27  5:20               ` Rakib Mullick
  2012-01-27  8:19                 ` Peter Zijlstra
  0 siblings, 1 reply; 55+ messages in thread
From: Rakib Mullick @ 2012-01-27  5:20 UTC (permalink / raw)
  To: linux-kernel, a.p.zijlstra, kosaki.motohiro, mingo

On Fri, Jan 27, 2012 at 2:25 AM, tip-bot for Peter Zijlstra
<a.p.zijlstra@chello.nl> wrote:
> Commit-ID:  4ca9b72b71f10147bd21969c1805f5b2c4ca7b7b
> Gitweb:     http://git.kernel.org/tip/4ca9b72b71f10147bd21969c1805f5b2c4ca7b7b
> Author:     Peter Zijlstra <a.p.zijlstra@chello.nl>
> AuthorDate: Wed, 25 Jan 2012 11:50:51 +0100
> Committer:  Ingo Molnar <mingo@elte.hu>
> CommitDate: Thu, 26 Jan 2012 19:38:09 +0100
>
> sched: Fix rq->nr_uninterruptible update race
>
> KOSAKI Motohiro noticed the following race:
>
>  > CPU0                    CPU1
>  > --------------------------------------------------------
>  > deactivate_task()
>  >                         task->state = TASK_UNINTERRUPTIBLE;
>  > activate_task()
>  >    rq->nr_uninterruptible--;
>  >
>  >                         schedule()
>  >                           deactivate_task()
>  >                             rq->nr_uninterruptible++;
>  >
>
> Kosaki-San's scenario is possible when CPU0 runs
> __sched_setscheduler() against CPU1's current @task.
>
> __sched_setscheduler() does a dequeue/enqueue in order to move
> the task to its new queue (position) to reflect the newly provided
> scheduling parameters. However it should be completely invariant to
> nr_uninterruptible accounting, sched_setscheduler() doesn't affect
> readyness to run, merely policy on when to run.
>
> So convert the inappropriate activate/deactivate_task usage to
> enqueue/dequeue_task, which avoids the nr_uninterruptible accounting.
>
Why would we want to avoid nr_uninterruptible accounting?
nr_uninterruptible has impact on load calculation, we might not get
the proper load weight if we don't account it. isn't it?

Thanks,
Rakib

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [tip:sched/urgent] sched: Fix rq->nr_uninterruptible update race
  2012-01-27  5:20               ` Rakib Mullick
@ 2012-01-27  8:19                 ` Peter Zijlstra
  2012-01-27 14:11                   ` Rakib Mullick
  0 siblings, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2012-01-27  8:19 UTC (permalink / raw)
  To: Rakib Mullick; +Cc: linux-kernel, kosaki.motohiro, mingo

On Fri, 2012-01-27 at 11:20 +0600, Rakib Mullick wrote:
> On Fri, Jan 27, 2012 at 2:25 AM, tip-bot for Peter Zijlstra
> <a.p.zijlstra@chello.nl> wrote:
> > Commit-ID:  4ca9b72b71f10147bd21969c1805f5b2c4ca7b7b
> > Gitweb:     http://git.kernel.org/tip/4ca9b72b71f10147bd21969c1805f5b2c4ca7b7b
> > Author:     Peter Zijlstra <a.p.zijlstra@chello.nl>
> > AuthorDate: Wed, 25 Jan 2012 11:50:51 +0100
> > Committer:  Ingo Molnar <mingo@elte.hu>
> > CommitDate: Thu, 26 Jan 2012 19:38:09 +0100
> >
> > sched: Fix rq->nr_uninterruptible update race
> >
> > KOSAKI Motohiro noticed the following race:
> >
> >  > CPU0                    CPU1
> >  > --------------------------------------------------------
> >  > deactivate_task()
> >  >                         task->state = TASK_UNINTERRUPTIBLE;
> >  > activate_task()
> >  >    rq->nr_uninterruptible--;
> >  >
> >  >                         schedule()
> >  >                           deactivate_task()
> >  >                             rq->nr_uninterruptible++;
> >  >
> >
> > Kosaki-San's scenario is possible when CPU0 runs
> > __sched_setscheduler() against CPU1's current @task.
> >
> > __sched_setscheduler() does a dequeue/enqueue in order to move
> > the task to its new queue (position) to reflect the newly provided
> > scheduling parameters. However it should be completely invariant to
> > nr_uninterruptible accounting, sched_setscheduler() doesn't affect
> > readyness to run, merely policy on when to run.
> >
> > So convert the inappropriate activate/deactivate_task usage to
> > enqueue/dequeue_task, which avoids the nr_uninterruptible accounting.
> >
> Why would we want to avoid nr_uninterruptible accounting?
> nr_uninterruptible has impact on load calculation, we might not get
> the proper load weight if we don't account it. isn't it?

Read again ;-)

sched_setscheduler() did:

  deactivate_task(); // remove it from the queue

  // change tasks's scheduler paramater

  activate_task(); // queue it in the new place

it is invariant wrt nr_uninterruptible but does include the
nr_uinterruptile accounting logic.

Now Kosaki-San noticed that if the task manages to change its ->state at
an inopportune moment (right between the dequeue and enqueue) we'll get
screwy nr_uninterruptible accounting.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [BUG] TASK_DEAD task is able to be woken up in special condition
  2012-01-26 21:21             ` [BUG] TASK_DEAD task is able to be woken up in special condition KOSAKI Motohiro
@ 2012-01-27  8:21               ` Peter Zijlstra
  0 siblings, 0 replies; 55+ messages in thread
From: Peter Zijlstra @ 2012-01-27  8:21 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Oleg Nesterov, Yasunori Goto, Ingo Molnar, Hiroyuki KAMEZAWA,
	Linux Kernel ML

On Thu, 2012-01-26 at 16:21 -0500, KOSAKI Motohiro wrote:
> 
> Sorry for the delay. Yeah, I think this should work. (even though I
> don't understand __migrate_task() detail) 

__migrate_task() isn't a problem, its use of activate/deactivate is safe
but also superfluous so I removed it too.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [BUG] TASK_DEAD task is able to be woken up in special condition
  2012-01-26 16:26                                                 ` Oleg Nesterov
@ 2012-01-27  8:59                                                   ` Peter Zijlstra
  0 siblings, 0 replies; 55+ messages in thread
From: Peter Zijlstra @ 2012-01-27  8:59 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Ingo Molnar, Yasunori Goto, Thomas Gleixner, Hiroyuki KAMEZAWA,
	Motohiro Kosaki, Linux Kernel ML

On Thu, 2012-01-26 at 17:26 +0100, Oleg Nesterov wrote:
> On 01/26, Peter Zijlstra wrote:

> > However I think your proposal:
> >
> > >                 for (;;) {
> > >                         tsk->state = TASK_DEAD;
> > >                         schedule();
> > >                 }
> >
> > should equally work, if we hit the race and call schedule() with ->state
> > = TASK_RUNNING,
> 
> Yes, in this case everything is fine, but we can shedule() with TASK_DEAD
> state. preempt_disable() can't (and shouldn't) prevent deactivate_task().
> 
> To simplify, try_to_wake_up() does
> 
> 		spin_lock(pi_lock);
> 
> 		if (!(p->state & state))
> 			goto out;
> 
> 		/* WINDOW */
> 
> 		if (p->on_rq) {
> 			... everything is fine ...
> 		}
> 
> 		p->state = TASK_WAKING;
> 		ttwu_queue(p, cpu);
> 
> And the exiting task does
> 
> 	// but do not sleep ...
> 	current->state = TASK_UNINTERRUPTIBLE;
> 							// ttwu() checks ->state
> 	...
> 	tsk->state = TASK_DEAD;
> 	schedule();
> 		-> deactivate_task();
> 		-> tsk->on_rq = 0;
> 		-> finish_task_switch();
> 
> 							// ttwu() checks ->on_rq
> 
> In theory it can do this all in the WINDOW above. In this case we
> can wake it up again, after finish_task_switch()-put_task_struct().
> 
> No?

Yes, bugger. Ok I'll queue Yasunori-san's patch as is.

Thanks everyone!

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [tip:sched/urgent] sched: Fix rq->nr_uninterruptible update race
  2012-01-27  8:19                 ` Peter Zijlstra
@ 2012-01-27 14:11                   ` Rakib Mullick
  0 siblings, 0 replies; 55+ messages in thread
From: Rakib Mullick @ 2012-01-27 14:11 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, kosaki.motohiro, mingo

On Fri, Jan 27, 2012 at 2:19 PM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> On Fri, 2012-01-27 at 11:20 +0600, Rakib Mullick wrote:
>> On Fri, Jan 27, 2012 at 2:25 AM, tip-bot for Peter Zijlstra
>> <a.p.zijlstra@chello.nl> wrote:

>> Why would we want to avoid nr_uninterruptible accounting?
>> nr_uninterruptible has impact on load calculation, we might not get
>> the proper load weight if we don't account it. isn't it?
>
> Read again ;-)
>
Wasn't enough! I had to use paper and pen ;-)

> sched_setscheduler() did:
>
>  deactivate_task(); // remove it from the queue
>
>  // change tasks's scheduler paramater
>
>  activate_task(); // queue it in the new place
>
> it is invariant wrt nr_uninterruptible but does include the
> nr_uinterruptile accounting logic.
>
> Now Kosaki-San noticed that if the task manages to change its ->state at
> an inopportune moment (right between the dequeue and enqueue) we'll get
> screwy nr_uninterruptible accounting.

I got your point and yes, we'll get screwy nr_uninterruptible
accounting if we call {activate,deactivate}_task(). We need to avoid
nr_uninterruptible accounting.

Thanks,
Rakib

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [tip:sched/core] sched: Fix ancient race in do_exit()
  2012-01-17  8:40                           ` Yasunori Goto
  2012-01-17  9:06                             ` Ingo Molnar
@ 2012-01-28 12:03                             ` tip-bot for Yasunori Goto
  2012-01-28 21:12                               ` Linus Torvalds
  1 sibling, 1 reply; 55+ messages in thread
From: tip-bot for Yasunori Goto @ 2012-01-28 12:03 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, torvalds, a.p.zijlstra, y-goto, akpm,
	tglx, oleg, mingo

Commit-ID:  b5740f4b2cb3503b436925eb2242bc3d75cd3dfe
Gitweb:     http://git.kernel.org/tip/b5740f4b2cb3503b436925eb2242bc3d75cd3dfe
Author:     Yasunori Goto <y-goto@jp.fujitsu.com>
AuthorDate: Tue, 17 Jan 2012 17:40:31 +0900
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Fri, 27 Jan 2012 11:55:36 +0100

sched: Fix ancient race in do_exit()

try_to_wake_up() has a problem which may change status from TASK_DEAD to
TASK_RUNNING in race condition with SMI or guest environment of virtual
machine. As a result, exited task is scheduled() again and panic occurs.

Here is the sequence how it occurs:

 ----------------------------------+-----------------------------
                                   |
            CPU A                  |             CPU B
 ----------------------------------+-----------------------------

TASK A calls exit()....

do_exit()

  exit_mm()
    down_read(mm->mmap_sem);

    rwsem_down_failed_common()

      set TASK_UNINTERRUPTIBLE
      set waiter.task <= task A
      list_add to sem->wait_list
           :
      raw_spin_unlock_irq()
      (I/O interruption occured)

                                      __rwsem_do_wake(mmap_sem)

                                        list_del(&waiter->list);
                                        waiter->task = NULL
                                        wake_up_process(task A)
                                          try_to_wake_up()
                                             (task is still
                                               TASK_UNINTERRUPTIBLE)
                                              p->on_rq is still 1.)

                                              ttwu_do_wakeup()
                                                 (*A)
                                                   :
     (I/O interruption handler finished)

      if (!waiter.task)
          schedule() is not called
          due to waiter.task is NULL.

      tsk->state = TASK_RUNNING

          :
                                              check_preempt_curr();
                                                  :
  task->state = TASK_DEAD
                                              (*B)
                                        <---    set TASK_RUNNING (*C)

     schedule()
     (exit task is running again)
     BUG_ON() is called!
 --------------------------------------------------------

The execution time between (*A) and (*B) is usually very short,
because the interruption is disabled, and setting TASK_RUNNING at (*C)
must be executed before setting TASK_DEAD.

HOWEVER, if SMI is interrupted between (*A) and (*B),
(*C) is able to execute AFTER setting TASK_DEAD!
Then, exited task is scheduled again, and BUG_ON() is called....

If the system works on guest system of virtual machine, the time
between (*A) and (*B) may be also long due to scheduling of hypervisor,
and same phenomenon can occur.

By this patch, do_exit() waits for releasing task->pi_lock which is used
in try_to_wake_up(). It guarantees the task becomes TASK_DEAD after
waking up.

Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20120117174031.3118.E1E9C6FF@jp.fujitsu.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 kernel/exit.c |   16 ++++++++++++++++
 1 files changed, 16 insertions(+), 0 deletions(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index 294b170..4b4042f 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -1038,6 +1038,22 @@ void do_exit(long code)
 	if (tsk->nr_dirtied)
 		__this_cpu_add(dirty_throttle_leaks, tsk->nr_dirtied);
 	exit_rcu();
+
+	/*
+	 * The setting of TASK_RUNNING by try_to_wake_up() may be delayed
+	 * when the following two conditions become true.
+	 *   - There is race condition of mmap_sem (It is acquired by
+	 *     exit_mm()), and
+	 *   - SMI occurs before setting TASK_RUNINNG.
+	 *     (or hypervisor of virtual machine switches to other guest)
+	 *  As a result, we may become TASK_RUNNING after becoming TASK_DEAD
+	 *
+	 * To avoid it, we have to wait for releasing tsk->pi_lock which
+	 * is held by try_to_wake_up()
+	 */
+	smp_mb();
+	raw_spin_unlock_wait(&tsk->pi_lock);
+
 	/* causes final put_task_struct in finish_task_switch(). */
 	tsk->state = TASK_DEAD;
 	tsk->flags |= PF_NOFREEZE;	/* tell freezer to ignore us */

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [tip:sched/core] sched: Fix ancient race in do_exit()
  2012-01-28 12:03                             ` [tip:sched/core] sched: Fix ancient race in do_exit() tip-bot for Yasunori Goto
@ 2012-01-28 21:12                               ` Linus Torvalds
  2012-01-29 16:07                                 ` Oleg Nesterov
  0 siblings, 1 reply; 55+ messages in thread
From: Linus Torvalds @ 2012-01-28 21:12 UTC (permalink / raw)
  To: mingo, hpa, linux-kernel, a.p.zijlstra, torvalds, y-goto, akpm,
	tglx, oleg, mingo
  Cc: linux-tip-commits

On Sat, Jan 28, 2012 at 4:03 AM, tip-bot for Yasunori Goto
<y-goto@jp.fujitsu.com> wrote:
>
> sched: Fix ancient race in do_exit()

Ugh.

It would be much nicer to just clear the rwsem waiter->task thing
*after* waking the task up, which would avoid this race entirely,
afaik.

Tell me, why wouldn't that work? rwsem_down_failed_common() does

        /* wait to be given the lock */
        for (;;) {
                if (!waiter.task)
                        break;
               ...

so then we wouldn't need the task refcount crap in rwsem either etc,
and we'd get rid of all races with wakeup.

I wonder why we're clearing that whole waiter->task so early.

                       Linus

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [tip:sched/core] sched: Fix ancient race in do_exit()
  2012-01-28 21:12                               ` Linus Torvalds
@ 2012-01-29 16:07                                 ` Oleg Nesterov
  2012-01-29 17:44                                   ` Linus Torvalds
  0 siblings, 1 reply; 55+ messages in thread
From: Oleg Nesterov @ 2012-01-29 16:07 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: mingo, hpa, linux-kernel, a.p.zijlstra, y-goto, akpm, tglx,
	mingo, linux-tip-commits

On 01/28, Linus Torvalds wrote:
>
> On Sat, Jan 28, 2012 at 4:03 AM, tip-bot for Yasunori Goto
> <y-goto@jp.fujitsu.com> wrote:
> >
> > sched: Fix ancient race in do_exit()
>
> Ugh.
>
> It would be much nicer to just clear the rwsem waiter->task thing
> *after* waking the task up, which would avoid this race entirely,
> afaik.

How? The problem is that wake_up_process(tsk) sees this task in
TASK_UNINTERRUPTIBLE state (the first "p->state & state" check in
try_to_wake_up), but then this task changes its state to TASK_DEAD
without schedule() and ttwu() does s/TASK_DEAD/TASK_RUNNING/.

IOW, the task doing

	current->state = TASK_A;
	...
	current->state = TASK_B;
	schedule();

can be woken up by try_to_wake_up(TASK_A), despite the fact it
sleeps in TASK_B. do_exit() is only "special" because it is not
easy to handle the spurious wakeup.

> Tell me, why wouldn't that work? rwsem_down_failed_common() does
>
>         /* wait to be given the lock */
>         for (;;) {
>                 if (!waiter.task)
>                         break;
>                ...
>
> so then we wouldn't need the task refcount crap in rwsem either etc,
> and we'd get rid of all races with wakeup.
>
> I wonder why we're clearing that whole waiter->task so early.

I must have missed something. I can't understand how this can help,
and "clear the rwsem waiter->task thing *after* waking" looks
obviously wrong. If we do this, then we can miss the "!!waiter.task"
condition. The loop above actually does

	set_task_state(TASK_UNINTERRUPTIBLE);

	if (!waiter.task)
		break;
	schedule();

and
	wake_up_process(tsk);
	waiter->task = NULL;

can happen right after set_task_state().

Oleg.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [tip:sched/core] sched: Fix ancient race in do_exit()
  2012-01-29 16:07                                 ` Oleg Nesterov
@ 2012-01-29 17:44                                   ` Linus Torvalds
  2012-01-29 18:28                                     ` Linus Torvalds
  2012-01-29 18:59                                     ` Oleg Nesterov
  0 siblings, 2 replies; 55+ messages in thread
From: Linus Torvalds @ 2012-01-29 17:44 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: mingo, hpa, linux-kernel, a.p.zijlstra, y-goto, akpm, tglx,
	mingo, linux-tip-commits

[-- Attachment #1: Type: text/plain, Size: 4066 bytes --]

On Sun, Jan 29, 2012 at 8:07 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> On 01/28, Linus Torvalds wrote:
>>
>> It would be much nicer to just clear the rwsem waiter->task thing
>> *after* waking the task up, which would avoid this race entirely,
>> afaik.
>
> How? The problem is that wake_up_process(tsk) sees this task in
> TASK_UNINTERRUPTIBLE state (the first "p->state & state" check in
> try_to_wake_up), but then this task changes its state to TASK_DEAD
> without schedule() and ttwu() does s/TASK_DEAD/TASK_RUNNING/.

Exactly.

The problem is that THE TASK HAS LONG SINCE GONE AWAY FROM THE RWSEM!

The task is already finishing up the exit (it might even have
*completed* the exit, which is why we take the extra reference count
to the task) when the wakeup happens.

THAT is the problem. The fact that in that big problem ("we do a
spurious wakeup for the rwsem long after the task no longer waits for
it") we then have a really small and tiny race (the race between
TASK_UNINTERRUPTIBLE -> (TASK_RUNNING ->) TASK_DEAD) is just a tiny
small detail.

So the real problem is that rwsem's are broken. Normal wakeups never
have this issue, because they use the spinlock for the waitqueue to
serialize: either a process wakes up in a timely manner, or the
process has removed itself from the wait-queue - you never see wakeups
long after the process has stopped caring.

The whole problem with the exit sequence is just a random detail. The
same spurious wakeup could happen in other places too - it just
doesn't happen to cause an Oops anywhere else (that we know about -
who knows if some code woudl be unhappy with getting woken up
unexpectedly).

>> Tell me, why wouldn't that work? rwsem_down_failed_common() does
>>
>>         /* wait to be given the lock */
>>         for (;;) {
>>                 if (!waiter.task)
>>                         break;
>>                ...
>>
>> so then we wouldn't need the task refcount crap in rwsem either etc,
>> and we'd get rid of all races with wakeup.
>>
>> I wonder why we're clearing that whole waiter->task so early.
>
> I must have missed something. I can't understand how this can help,
> and "clear the rwsem waiter->task thing *after* waking" looks
> obviously wrong.

Why? It helps because it means that the task that did the "down()" on
the rwsem will *never* leave its task pointer around to be spuriously
woken later: it will wait for the waiter->task entry to be NULL - so
that we know that the wakeup path has used the task _and_ woken it up,
so there is no possibility for the task pointer still being active and
used to wake up some *random* state later.

> If we do this, then we can miss the "!!waiter.task"
> condition. The loop above actually does
>
>        set_task_state(TASK_UNINTERRUPTIBLE);
>
>        if (!waiter.task)
>                break;
>        schedule();
>
> and
>        wake_up_process(tsk);
>        waiter->task = NULL;
>
> can happen right after set_task_state().

So? That's the *point*. We want to wait until the wakeup is *done*
before we return from down(). We do *not* want to "break" (and go off
doing something else) while some other CPU may still be using 'task'.

But yes, if you're talking about TASK_UNINTERRUPTIBLE, we do need to
just remove the setting of that entirely. It needs to be set *before*
adding us to the list, not after. That's just a bug - we get woken up
when we've been given the lock.

IOW, I think the fix should look like the attched.  TOTALLY UNTESTED!
This is certainly a much scarier patch, but I think it fixes the
actual *reason* for the problem, rather than just working around the
symptom.

So it may be completely and utterly broken for some subtle reason, but
I don't see what it would be. It seems to clean up and simplify the
logic, and remove all the bogus workarounds for the fact that we used
to do things stupidly.

But maybe there's some reason for those "stupid" things. I just don't see it.

                   Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/x-patch, Size: 1273 bytes --]

 lib/rwsem.c |   10 ++--------
 1 files changed, 2 insertions(+), 8 deletions(-)

diff --git a/lib/rwsem.c b/lib/rwsem.c
index 410aa1189b13..7afb403918da 100644
--- a/lib/rwsem.c
+++ b/lib/rwsem.c
@@ -92,10 +92,9 @@ __rwsem_do_wake(struct rw_semaphore *sem, int wake_type)
 	 */
 	list_del(&waiter->list);
 	tsk = waiter->task;
+	wake_up_process(tsk);
 	smp_mb();
 	waiter->task = NULL;
-	wake_up_process(tsk);
-	put_task_struct(tsk);
 	goto out;

  readers_only:
@@ -146,10 +145,9 @@ __rwsem_do_wake(struct rw_semaphore *sem, int wake_type)
 		waiter = list_entry(next, struct rwsem_waiter, list);
 		next = waiter->list.next;
 		tsk = waiter->task;
+		wake_up_process(tsk);
 		smp_mb();
 		waiter->task = NULL;
-		wake_up_process(tsk);
-		put_task_struct(tsk);
 	}

 	sem->wait_list.next = next;
@@ -183,7 +181,6 @@ rwsem_down_failed_common(struct rw_semaphore *sem,
 	raw_spin_lock_irq(&sem->wait_lock);
 	waiter.task = tsk;
 	waiter.flags = flags;
-	get_task_struct(tsk);

 	if (list_empty(&sem->wait_list))
 		adjustment += RWSEM_WAITING_BIAS;
@@ -211,11 +208,8 @@ rwsem_down_failed_common(struct rw_semaphore *sem,
 		if (!waiter.task)
 			break;
 		schedule();
-		set_task_state(tsk, TASK_UNINTERRUPTIBLE);
 	}

-	tsk->state = TASK_RUNNING;
-
 	return sem;
 }

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [tip:sched/core] sched: Fix ancient race in do_exit()
  2012-01-29 17:44                                   ` Linus Torvalds
@ 2012-01-29 18:28                                     ` Linus Torvalds
  2012-01-29 18:59                                     ` Oleg Nesterov
  1 sibling, 0 replies; 55+ messages in thread
From: Linus Torvalds @ 2012-01-29 18:28 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: mingo, hpa, linux-kernel, a.p.zijlstra, y-goto, akpm, tglx,
	mingo, linux-tip-commits

On Sun, Jan 29, 2012 at 9:44 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> So it may be completely and utterly broken for some subtle reason, but
> I don't see what it would be. It seems to clean up and simplify the
> logic, and remove all the bogus workarounds for the fact that we used
> to do things stupidly.
>
> But maybe there's some reason for those "stupid" things. I just don't see it.

Hmm. Ok, so I see one reason for it. The silly extraneous "set task to
TASK_UNINTERRUPTIBLE" shouldn't matter normally - even if there are
spurious wakeups (say, disk IO while taking a page fault - not that I
see why we'd be on any wait queues yet), we'll just schedule a bit
more than we need in the extremely unlikely case that they hit us.

But for RT tasks with higher priorities, looping - even if we call
schedule() all the time - can cause livelocks. Damn. So while I don't
think the spurious wakeup is a big issue (I don't think it happens in
practice), it could lead to problems.

I think we could possibly use the "flags" field to do that "are we
just about to get woken up" logic, and set TASK_UNINTERRUPTIBLE in the
loop - and just clear "flags" before doing the wakeup (the same way we
used to clear "task"). Dunno. Ideas?

                         Linus

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [tip:sched/core] sched: Fix ancient race in do_exit()
  2012-01-29 17:44                                   ` Linus Torvalds
  2012-01-29 18:28                                     ` Linus Torvalds
@ 2012-01-29 18:59                                     ` Oleg Nesterov
  2012-01-30 16:27                                       ` Linus Torvalds
  1 sibling, 1 reply; 55+ messages in thread
From: Oleg Nesterov @ 2012-01-29 18:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: mingo, hpa, linux-kernel, a.p.zijlstra, y-goto, akpm, tglx,
	mingo, linux-tip-commits

On 01/29, Linus Torvalds wrote:
>
> But yes, if you're talking about TASK_UNINTERRUPTIBLE, we do need to
> just remove the setting of that entirely. It needs to be set *before*
> adding us to the list, not after. That's just a bug - we get woken up
> when we've been given the lock.

Yes, I think this should work although I am not familiar with this code.

If we remove set_task_state() from the main waiting loop we can never race
with __rwsem_do_wake()->try_to_wake_up() seeing us in UNINTERRUPTIBLE state.
rwsem_down_failed_common() simply can't return until UNINTERRUPTIBLE->RUNNING
transition is finished (__rwsem_do_wake does wakeup first).

And since we do not play with current->state after spin_unlock(), it is
fine to "race" with waiter->task clearing, just we can do the unnecessary
but harmless schedule() in TASK_RUNNING.

> So it may be completely and utterly broken for some subtle reason,

Well, what about another spurious wakeup from somewhere? In this case
rwsem_down_failed_common() will do a busy-wait loop.

> @@ -92,10 +92,9 @@ __rwsem_do_wake(struct rw_semaphore *sem, int wake_type)
>  	 */
>  	list_del(&waiter->list);
>  	tsk = waiter->task;
> +	wake_up_process(tsk);
>  	smp_mb();
>  	waiter->task = NULL;

OK, now I understand why do we need "clear after wakeup".

But then I don't really understand this mb, perhaps wmb() is enough?
Afaics we only need to ensure we change waiter->task after changing
task's state.

OTOH,

> @@ -183,7 +181,6 @@ rwsem_down_failed_common(struct rw_semaphore *sem,
>  	raw_spin_lock_irq(&sem->wait_lock);
>  	waiter.task = tsk;
>  	waiter.flags = flags;
> -	get_task_struct(tsk);
>
>  	if (list_empty(&sem->wait_list))
>  		adjustment += RWSEM_WAITING_BIAS;
> @@ -211,11 +208,8 @@ rwsem_down_failed_common(struct rw_semaphore *sem,
>  		if (!waiter.task)
>  			break;
>  		schedule();
> -		set_task_state(tsk, TASK_UNINTERRUPTIBLE);
>  	}
>
> -	tsk->state = TASK_RUNNING;
> -
>  	return sem;
>  }

Suppose that this task does tsk->state = TASK_WHATEVER after that.
It seems that we need mb() before return, otherwise the next ->state
change can be reordered with "if (!waiter.task)" above. Or not?

Oleg.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [tip:sched/core] sched: Fix ancient race in do_exit()
  2012-01-29 18:59                                     ` Oleg Nesterov
@ 2012-01-30 16:27                                       ` Linus Torvalds
  0 siblings, 0 replies; 55+ messages in thread
From: Linus Torvalds @ 2012-01-30 16:27 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: mingo, hpa, linux-kernel, a.p.zijlstra, y-goto, akpm, tglx,
	mingo, linux-tip-commits

On Sun, Jan 29, 2012 at 10:59 AM, Oleg Nesterov <oleg@redhat.com> wrote:
>
> If we remove set_task_state() from the main waiting loop we can never race
> with __rwsem_do_wake()->try_to_wake_up() seeing us in UNINTERRUPTIBLE state.
> rwsem_down_failed_common() simply can't return until UNINTERRUPTIBLE->RUNNING
> transition is finished (__rwsem_do_wake does wakeup first).
>
> And since we do not play with current->state after spin_unlock(), it is
> fine to "race" with waiter->task clearing, just we can do the unnecessary
> but harmless schedule() in TASK_RUNNING.

So the more I think about this code, the more nervous I get.

I did a version that got the sem->wait_lock if it hit the race case
("task" still non-NULL after the schedule), and I could convince
myself that it was race-free and safe. But I couldn't actually really
convince myself that it was better than the old code, because I worry
that we'd actually hit that case much too often (ie another CPU woke
us up, but hasn't cleared "task" yet, so spin on the spinlock etc).

So I do hate the bug in rwsem, but I'm now willing to entertain the
possibility that the bug actually means that unlocking of an rwsem is
"nicely decoupled" from the process that got the semaphore, and may be
a performance advantage.

To be honest, I don't *really* believe that, but changing the rwsem.c
code at this point just scares me enough that without lots and lots of
people looking at it and actually doing some performance analysis, I'm
willing to just say "screw it, I'll take Yasunori Goto's solution".

I'm definitely still not a huge fan of that patch, but I'm not a huge
fan of the alternative either. So Ack on the patch, and I'll just try
to think about this all some more.

                         Linus

^ permalink raw reply	[flat|nested] 55+ messages in thread

end of thread, other threads:[~2012-01-30 16:27 UTC | newest]

Thread overview: 55+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-12-22  0:42 [BUG] TASK_DEAD task is able to be woken up in special condition Yasunori Goto
2011-12-22  2:14 ` KOSAKI Motohiro
2011-12-22  8:22   ` Yasunori Goto
2011-12-22 20:02     ` KOSAKI Motohiro
2011-12-23  9:49 ` Peter Zijlstra
2011-12-23 15:41   ` Oleg Nesterov
2011-12-26  8:23     ` Yasunori Goto
2011-12-26 17:11       ` Oleg Nesterov
2011-12-27  6:48         ` Yasunori Goto
2012-01-06 10:22           ` Yasunori Goto
2012-01-06 11:01             ` Peter Zijlstra
2012-01-06 12:01               ` Yasunori Goto
2012-01-06 12:43                 ` Peter Zijlstra
2012-01-06 14:12                   ` Oleg Nesterov
2012-01-06 14:19                     ` Oleg Nesterov
2012-01-07  1:31                     ` Yasunori Goto
2012-01-16 11:51                       ` Yasunori Goto
2012-01-16 13:38                         ` Peter Zijlstra
2012-01-17  8:40                           ` Yasunori Goto
2012-01-17  9:06                             ` Ingo Molnar
2012-01-17 15:12                               ` Oleg Nesterov
2012-01-18  9:42                                 ` Ingo Molnar
2012-01-18 14:20                                   ` Oleg Nesterov
2012-01-24 10:19                                     ` Peter Zijlstra
2012-01-24 10:55                                       ` Peter Zijlstra
2012-01-24 17:25                                         ` KOSAKI Motohiro
2012-01-25 15:45                                         ` Oleg Nesterov
2012-01-25 16:51                                           ` Peter Zijlstra
2012-01-25 17:43                                             ` Oleg Nesterov
2012-01-26 15:32                                               ` Peter Zijlstra
2012-01-26 16:26                                                 ` Oleg Nesterov
2012-01-27  8:59                                                   ` Peter Zijlstra
2012-01-24 10:11                                   ` Peter Zijlstra
2012-01-26  9:39                                     ` Ingo Molnar
2012-01-28 12:03                             ` [tip:sched/core] sched: Fix ancient race in do_exit() tip-bot for Yasunori Goto
2012-01-28 21:12                               ` Linus Torvalds
2012-01-29 16:07                                 ` Oleg Nesterov
2012-01-29 17:44                                   ` Linus Torvalds
2012-01-29 18:28                                     ` Linus Torvalds
2012-01-29 18:59                                     ` Oleg Nesterov
2012-01-30 16:27                                       ` Linus Torvalds
2012-01-06 13:48             ` [BUG] TASK_DEAD task is able to be woken up in special condition Oleg Nesterov
2011-12-28 21:07         ` KOSAKI Motohiro
2012-01-24 10:23           ` Peter Zijlstra
2012-01-24 18:01             ` KOSAKI Motohiro
2012-01-25  6:15               ` Mike Galbraith
2012-01-26 21:24                 ` KOSAKI Motohiro
2012-01-25 10:10           ` Peter Zijlstra
2012-01-26 20:25             ` [tip:sched/urgent] sched: Fix rq->nr_uninterruptible update race tip-bot for Peter Zijlstra
2012-01-27  5:20               ` Rakib Mullick
2012-01-27  8:19                 ` Peter Zijlstra
2012-01-27 14:11                   ` Rakib Mullick
2012-01-26 21:21             ` [BUG] TASK_DEAD task is able to be woken up in special condition KOSAKI Motohiro
2012-01-27  8:21               ` Peter Zijlstra
2011-12-26  6:52   ` Yasunori Goto

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).