[PATCH 1/1] do_exit(): Solve possibility of BUG() due to race with try_to_wake

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 1/1] do_exit(): Solve possibility of BUG() due to race with try_to_wake_up()
@ 2014-08-25 10:54 Kautuk Consul
  2014-08-25 15:57 ` Oleg Nesterov
  0 siblings, 1 reply; 25+ messages in thread
From: Kautuk Consul @ 2014-08-25 10:54 UTC (permalink / raw)
  To: Andrew Morton, Oleg Nesterov, Michal Hocko, David Rientjes,
	Ionut Alexa, Guillaume Morin, linux-kernel
  Cc: Kautuk Consul

I encountered a BUG() scenario within do_exit() on an ARM system.

The problem is due to a race scenario between do_exit() and try_to_wake_up()
on different CPUs due to usage of sleeping primitives such as __down_common
and wait_for_common.

Race Scenario
=============

Let us assume there are 2 CPUs A and B execute code in the following order:
1)	CPU A was running in user-mode and enters kernel mode via some
	syscall/exception handler.
2)	CPU A sets the current task(t) state to TASK_INTERRUPTIBLE via __down_common
	or wait_for_common.
3)	CPU A checks for signal_pending() and returns due to TIF_SIGPENDING
	being set in t's threadinfo due to a previous signal(say SIGKILL) being
	received on this task t.
4)	CPU A returns returns back to the assembly trap handler and calls
	do_work_pending() -> do_signal() -> get_signal() -> do_group_exit()
	 -> do_exit()
	CPU A  has not yet executed the following line of code before the final
	call to schedule:
    /* causes final put_task_struct in finish_task_switch(). */
    tsk->state = TASK_DEAD;
5)	CPU B tries to send a signal to task t (currently executing on CPU A)
	and thus enters: signal_wake_up_state() -> wake_up_state() ->
					 try_to_wake_up()
6)	CPU B executes all code in try_to_wake_up() till the call to
	ttwu_queue -> ttwu_do_activate -> ttwu_do_wakeup().
	CPU B has still not executed the following code in ttwu_do_wakeup():
	p->state = TASK_RUNNING;
7)	CPU A executes the following line of code:
    /* causes final put_task_struct in finish_task_switch(). */
    tsk->state = TASK_DEAD;
8)	CPU B executes the following code in ttwu_do_wakeup():
	p->state = TASK_RUNNING;
9)	CPU A continues to the call to do_exit() -> schedule().
	Since the tsk->state is TASK_RUNNING, the call to schedule() returns and
	do_exit() -> BUG() is hit on CPU A.

Alternate Solution
==================

An alternate solution would be to simply set the current task state to
TASK_RUNNING in __down_common(), wait_for_common() and all other interruptible
sleeping primitives in their if(signal_pending/signal_pending_state) conditional
blocks.

But this change seems to me to be more logical because:
i)		This will involve lesser changes to the kernel core code.
ii)		Any further sleeping primitives in the kernel also need not suffer from
		this kind of race scenario.

Signed-off-by: Kautuk Consul <consul.kautuk@gmail.com>
---
 kernel/exit.c |   10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index 32c58f7..69a8231 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -824,14 +824,16 @@ void do_exit(long code)
 	 *     (or hypervisor of virtual machine switches to other guest)
 	 *  As a result, we may become TASK_RUNNING after becoming TASK_DEAD
 	 *
-	 * To avoid it, we have to wait for releasing tsk->pi_lock which
-	 * is held by try_to_wake_up()
+	 * To solve this, we have to compete for tsk->pi_lock which is held by
+	 * try_to_wake_up().
 	 */
-	smp_mb();
-	raw_spin_unlock_wait(&tsk->pi_lock);
+	raw_spin_lock(&tsk->pi_lock);
 
 	/* causes final put_task_struct in finish_task_switch(). */
 	tsk->state = TASK_DEAD;
+
+	raw_spin_unlock(&tsk->pi_lock);
+
 	tsk->flags |= PF_NOFREEZE;	/* tell freezer to ignore us */
 	schedule();
 	BUG();
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/1] do_exit(): Solve possibility of BUG() due to race with try_to_wake_up()
  2014-08-25 10:54 [PATCH 1/1] do_exit(): Solve possibility of BUG() due to race with try_to_wake_up() Kautuk Consul
@ 2014-08-25 15:57 ` Oleg Nesterov
  2014-08-26  4:45   ` Kautuk Consul
                     ` (2 more replies)
  0 siblings, 3 replies; 25+ messages in thread
From: Oleg Nesterov @ 2014-08-25 15:57 UTC (permalink / raw)
  To: Kautuk Consul, Peter Zijlstra, Ingo Molnar
  Cc: Andrew Morton, Michal Hocko, David Rientjes, Ionut Alexa,
	Guillaume Morin, linux-kernel, Kirill Tkhai

On 08/25, Kautuk Consul wrote:
>
> I encountered a BUG() scenario within do_exit() on an ARM system.
>
> The problem is due to a race scenario between do_exit() and try_to_wake_up()
> on different CPUs due to usage of sleeping primitives such as __down_common
> and wait_for_common.
>
> Race Scenario
> =============
>
> Let us assume there are 2 CPUs A and B execute code in the following order:
> 1)	CPU A was running in user-mode and enters kernel mode via some
> 	syscall/exception handler.
> 2)	CPU A sets the current task(t) state to TASK_INTERRUPTIBLE via __down_common
> 	or wait_for_common.
> 3)	CPU A checks for signal_pending() and returns due to TIF_SIGPENDING
> 	being set in t's threadinfo due to a previous signal(say SIGKILL) being
> 	received on this task t.
> 4)	CPU A returns returns back to the assembly trap handler and calls
> 	do_work_pending() -> do_signal() -> get_signal() -> do_group_exit()
> 	 -> do_exit()
> 	CPU A  has not yet executed the following line of code before the final
> 	call to schedule:
>     /* causes final put_task_struct in finish_task_switch(). */
>     tsk->state = TASK_DEAD;
> 5)	CPU B tries to send a signal to task t (currently executing on CPU A)
> 	and thus enters: signal_wake_up_state() -> wake_up_state() ->
> 					 try_to_wake_up()
> 6)	CPU B executes all code in try_to_wake_up() till the call to
> 	ttwu_queue -> ttwu_do_activate -> ttwu_do_wakeup().
> 	CPU B has still not executed the following code in ttwu_do_wakeup():
> 	p->state = TASK_RUNNING;
> 7)	CPU A executes the following line of code:
>     /* causes final put_task_struct in finish_task_switch(). */
>     tsk->state = TASK_DEAD;
> 8)	CPU B executes the following code in ttwu_do_wakeup():
> 	p->state = TASK_RUNNING;
> 9)	CPU A continues to the call to do_exit() -> schedule().
> 	Since the tsk->state is TASK_RUNNING, the call to schedule() returns and
> 	do_exit() -> BUG() is hit on CPU A.
>
> Alternate Solution
> ==================
>
> An alternate solution would be to simply set the current task state to
> TASK_RUNNING in __down_common(), wait_for_common() and all other interruptible
> sleeping primitives in their if(signal_pending/signal_pending_state) conditional
> blocks.
>
> But this change seems to me to be more logical because:
> i)		This will involve lesser changes to the kernel core code.
> ii)		Any further sleeping primitives in the kernel also need not suffer from
> 		this kind of race scenario.
>
> Signed-off-by: Kautuk Consul <consul.kautuk@gmail.com>
> ---
>  kernel/exit.c |   10 ++++++----
>  1 file changed, 6 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/exit.c b/kernel/exit.c
> index 32c58f7..69a8231 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -824,14 +824,16 @@ void do_exit(long code)
>  	 *     (or hypervisor of virtual machine switches to other guest)
>  	 *  As a result, we may become TASK_RUNNING after becoming TASK_DEAD
>  	 *
> -	 * To avoid it, we have to wait for releasing tsk->pi_lock which
> -	 * is held by try_to_wake_up()
> +	 * To solve this, we have to compete for tsk->pi_lock which is held by
> +	 * try_to_wake_up().
>  	 */
> -	smp_mb();
> -	raw_spin_unlock_wait(&tsk->pi_lock);
> +	raw_spin_lock(&tsk->pi_lock);
>
>  	/* causes final put_task_struct in finish_task_switch(). */
>  	tsk->state = TASK_DEAD;
> +
> +	raw_spin_unlock(&tsk->pi_lock);
> +
>  	tsk->flags |= PF_NOFREEZE;	/* tell freezer to ignore us */
>  	schedule();
>  	BUG();
> --

Peter, do you remember another problem with TASK_DEAD we discussed recently?
(prev_state == TASK_DEAD detection in finish_task_switch() still looks racy).

I am starting to think that perhaps we need something like below, what do
you all think?

Oleg.

--- x/kernel/sched/core.c
+++ x/kernel/sched/core.c
@@ -2205,9 +2205,10 @@ static void finish_task_switch(struct rq
 	__releases(rq->lock)
 {
 	struct mm_struct *mm = rq->prev_mm;
-	long prev_state;
+	struct task_struct *dead = rq->dead;
 
 	rq->prev_mm = NULL;
+	rq->dead = NULL;
 
 	/*
 	 * A task struct has one reference for the use as "current".
@@ -2220,7 +2221,6 @@ static void finish_task_switch(struct rq
 	 * be dropped twice.
 	 *		Manfred Spraul <manfred@colorfullife.com>
 	 */
-	prev_state = prev->state;
 	vtime_task_switch(prev);
 	finish_arch_switch(prev);
 	perf_event_task_sched_in(prev, current);
@@ -2230,16 +2230,16 @@ static void finish_task_switch(struct rq
 	fire_sched_in_preempt_notifiers(current);
 	if (mm)
 		mmdrop(mm);
-	if (unlikely(prev_state == TASK_DEAD)) {
-		if (prev->sched_class->task_dead)
-			prev->sched_class->task_dead(prev);
+	if (unlikely(dead)) {
+		if (dead->sched_class->task_dead)
+			dead->sched_class->task_dead(dead);
 
 		/*
 		 * Remove function-return probe instances associated with this
 		 * task and put them back on the free list.
 		 */
-		kprobe_flush_task(prev);
-		put_task_struct(prev);
+		kprobe_flush_task(dead);
+		put_task_struct(dead);
 	}
 
 	tick_nohz_task_switch(current);
@@ -2770,11 +2770,15 @@ need_resched:
 	smp_mb__before_spinlock();
 	raw_spin_lock_irq(&rq->lock);
 
+	if (unlikely(rq->dead))
+		goto deactivate;
+
 	switch_count = &prev->nivcsw;
 	if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) {
 		if (unlikely(signal_pending_state(prev->state, prev))) {
 			prev->state = TASK_RUNNING;
 		} else {
+deactivate:
 			deactivate_task(rq, prev, DEQUEUE_SLEEP);
 			prev->on_rq = 0;
 
@@ -2826,6 +2830,15 @@ need_resched:
 		goto need_resched;
 }
 
+// called under preempt_disable();
+void exit_schedule()
+{
+	// TODO: kill TASK_DEAD, this is only for proc
+	current->state = TASK_DEAD;
+	task_rq(current)->dead = current;
+	__schedule();
+}
+
 static inline void sched_submit_work(struct task_struct *tsk)
 {
 	if (!tsk->state || tsk_is_pi_blocked(tsk))
--- x/kernel/exit.c
+++ x/kernel/exit.c
@@ -815,25 +815,8 @@ void do_exit(long code)
 		__this_cpu_add(dirty_throttle_leaks, tsk->nr_dirtied);
 	exit_rcu();
 
-	/*
-	 * The setting of TASK_RUNNING by try_to_wake_up() may be delayed
-	 * when the following two conditions become true.
-	 *   - There is race condition of mmap_sem (It is acquired by
-	 *     exit_mm()), and
-	 *   - SMI occurs before setting TASK_RUNINNG.
-	 *     (or hypervisor of virtual machine switches to other guest)
-	 *  As a result, we may become TASK_RUNNING after becoming TASK_DEAD
-	 *
-	 * To avoid it, we have to wait for releasing tsk->pi_lock which
-	 * is held by try_to_wake_up()
-	 */
-	smp_mb();
-	raw_spin_unlock_wait(&tsk->pi_lock);
-
-	/* causes final put_task_struct in finish_task_switch(). */
-	tsk->state = TASK_DEAD;
 	tsk->flags |= PF_NOFREEZE;	/* tell freezer to ignore us */
-	schedule();
+	exit_schedule();
 	BUG();
 	/* Avoid "noreturn function does return".  */
 	for (;;)


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/1] do_exit(): Solve possibility of BUG() due to race with try_to_wake_up()
  2014-08-25 15:57 ` Oleg Nesterov
@ 2014-08-26  4:45   ` Kautuk Consul
  2014-08-26 15:03     ` Oleg Nesterov
  2014-09-01 15:39   ` Peter Zijlstra
  2014-09-03  9:04   ` [PATCH 1/1] do_exit(): Solve possibility of BUG() due to race with try_to_wake_up() Kirill Tkhai
  2 siblings, 1 reply; 25+ messages in thread
From: Kautuk Consul @ 2014-08-26  4:45 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Peter Zijlstra, Ingo Molnar, Andrew Morton, Michal Hocko,
	David Rientjes, Ionut Alexa, Guillaume Morin, linux-kernel,
	Kirill Tkhai

Sorry folks,

I got one thing wrong:
>From some more code review, both __down_common() and
do_wait_for_common() inspect the signal_pending() only while in
TASK_RUNNING.

So I think that it cannot be possible that this happened on my system
due to __down_common() and/or wait_for_common().

Which only leaves out the possibility that the BUG() on my embedded
system happened due to a driver which was trying to implement its own
sleeping primitive
by first setting the task state to TASK_INTERRUPTIBLE and then
checking for signal_pending/signal_pending_state.
(But this is anyway generally frowned upon and not really acceptable nowadays).

I'll review those drivers for this kind of mistake.


On Mon, Aug 25, 2014 at 9:27 PM, Oleg Nesterov <oleg@redhat.com> wrote:
> On 08/25, Kautuk Consul wrote:
>>
>> I encountered a BUG() scenario within do_exit() on an ARM system.
>>
>> The problem is due to a race scenario between do_exit() and try_to_wake_up()
>> on different CPUs due to usage of sleeping primitives such as __down_common
>> and wait_for_common.
>>
>> Race Scenario
>> =============
>>
>> Let us assume there are 2 CPUs A and B execute code in the following order:
>> 1)    CPU A was running in user-mode and enters kernel mode via some
>>       syscall/exception handler.
>> 2)    CPU A sets the current task(t) state to TASK_INTERRUPTIBLE via __down_common
>>       or wait_for_common.
>> 3)    CPU A checks for signal_pending() and returns due to TIF_SIGPENDING
>>       being set in t's threadinfo due to a previous signal(say SIGKILL) being
>>       received on this task t.
>> 4)    CPU A returns returns back to the assembly trap handler and calls
>>       do_work_pending() -> do_signal() -> get_signal() -> do_group_exit()
>>        -> do_exit()
>>       CPU A  has not yet executed the following line of code before the final
>>       call to schedule:
>>     /* causes final put_task_struct in finish_task_switch(). */
>>     tsk->state = TASK_DEAD;
>> 5)    CPU B tries to send a signal to task t (currently executing on CPU A)
>>       and thus enters: signal_wake_up_state() -> wake_up_state() ->
>>                                        try_to_wake_up()
>> 6)    CPU B executes all code in try_to_wake_up() till the call to
>>       ttwu_queue -> ttwu_do_activate -> ttwu_do_wakeup().
>>       CPU B has still not executed the following code in ttwu_do_wakeup():
>>       p->state = TASK_RUNNING;
>> 7)    CPU A executes the following line of code:
>>     /* causes final put_task_struct in finish_task_switch(). */
>>     tsk->state = TASK_DEAD;
>> 8)    CPU B executes the following code in ttwu_do_wakeup():
>>       p->state = TASK_RUNNING;
>> 9)    CPU A continues to the call to do_exit() -> schedule().
>>       Since the tsk->state is TASK_RUNNING, the call to schedule() returns and
>>       do_exit() -> BUG() is hit on CPU A.
>>
>> Alternate Solution
>> ==================
>>
>> An alternate solution would be to simply set the current task state to
>> TASK_RUNNING in __down_common(), wait_for_common() and all other interruptible
>> sleeping primitives in their if(signal_pending/signal_pending_state) conditional
>> blocks.
>>
>> But this change seems to me to be more logical because:
>> i)            This will involve lesser changes to the kernel core code.
>> ii)           Any further sleeping primitives in the kernel also need not suffer from
>>               this kind of race scenario.
>>
>> Signed-off-by: Kautuk Consul <consul.kautuk@gmail.com>
>> ---
>>  kernel/exit.c |   10 ++++++----
>>  1 file changed, 6 insertions(+), 4 deletions(-)
>>
>> diff --git a/kernel/exit.c b/kernel/exit.c
>> index 32c58f7..69a8231 100644
>> --- a/kernel/exit.c
>> +++ b/kernel/exit.c
>> @@ -824,14 +824,16 @@ void do_exit(long code)
>>        *     (or hypervisor of virtual machine switches to other guest)
>>        *  As a result, we may become TASK_RUNNING after becoming TASK_DEAD
>>        *
>> -      * To avoid it, we have to wait for releasing tsk->pi_lock which
>> -      * is held by try_to_wake_up()
>> +      * To solve this, we have to compete for tsk->pi_lock which is held by
>> +      * try_to_wake_up().
>>        */
>> -     smp_mb();
>> -     raw_spin_unlock_wait(&tsk->pi_lock);
>> +     raw_spin_lock(&tsk->pi_lock);
>>
>>       /* causes final put_task_struct in finish_task_switch(). */
>>       tsk->state = TASK_DEAD;
>> +
>> +     raw_spin_unlock(&tsk->pi_lock);
>> +
>>       tsk->flags |= PF_NOFREEZE;      /* tell freezer to ignore us */
>>       schedule();
>>       BUG();
>> --
>
> Peter, do you remember another problem with TASK_DEAD we discussed recently?
> (prev_state == TASK_DEAD detection in finish_task_switch() still looks racy).
>
> I am starting to think that perhaps we need something like below, what do
> you all think?
>
> Oleg.
>
> --- x/kernel/sched/core.c
> +++ x/kernel/sched/core.c
> @@ -2205,9 +2205,10 @@ static void finish_task_switch(struct rq
>         __releases(rq->lock)
>  {
>         struct mm_struct *mm = rq->prev_mm;
> -       long prev_state;
> +       struct task_struct *dead = rq->dead;
>
>         rq->prev_mm = NULL;
> +       rq->dead = NULL;
>
>         /*
>          * A task struct has one reference for the use as "current".
> @@ -2220,7 +2221,6 @@ static void finish_task_switch(struct rq
>          * be dropped twice.
>          *              Manfred Spraul <manfred@colorfullife.com>
>          */
> -       prev_state = prev->state;
>         vtime_task_switch(prev);
>         finish_arch_switch(prev);
>         perf_event_task_sched_in(prev, current);
> @@ -2230,16 +2230,16 @@ static void finish_task_switch(struct rq
>         fire_sched_in_preempt_notifiers(current);
>         if (mm)
>                 mmdrop(mm);
> -       if (unlikely(prev_state == TASK_DEAD)) {
> -               if (prev->sched_class->task_dead)
> -                       prev->sched_class->task_dead(prev);
> +       if (unlikely(dead)) {
> +               if (dead->sched_class->task_dead)
> +                       dead->sched_class->task_dead(dead);
>
>                 /*
>                  * Remove function-return probe instances associated with this
>                  * task and put them back on the free list.
>                  */
> -               kprobe_flush_task(prev);
> -               put_task_struct(prev);
> +               kprobe_flush_task(dead);
> +               put_task_struct(dead);
>         }
>
>         tick_nohz_task_switch(current);
> @@ -2770,11 +2770,15 @@ need_resched:
>         smp_mb__before_spinlock();
>         raw_spin_lock_irq(&rq->lock);
>
> +       if (unlikely(rq->dead))
> +               goto deactivate;
> +
>         switch_count = &prev->nivcsw;
>         if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) {
>                 if (unlikely(signal_pending_state(prev->state, prev))) {
>                         prev->state = TASK_RUNNING;
>                 } else {
> +deactivate:
>                         deactivate_task(rq, prev, DEQUEUE_SLEEP);
>                         prev->on_rq = 0;
>
> @@ -2826,6 +2830,15 @@ need_resched:
>                 goto need_resched;
>  }
>
> +// called under preempt_disable();
> +void exit_schedule()
> +{
> +       // TODO: kill TASK_DEAD, this is only for proc
> +       current->state = TASK_DEAD;
> +       task_rq(current)->dead = current;
> +       __schedule();
> +}
> +
>  static inline void sched_submit_work(struct task_struct *tsk)
>  {
>         if (!tsk->state || tsk_is_pi_blocked(tsk))
> --- x/kernel/exit.c
> +++ x/kernel/exit.c
> @@ -815,25 +815,8 @@ void do_exit(long code)
>                 __this_cpu_add(dirty_throttle_leaks, tsk->nr_dirtied);
>         exit_rcu();
>
> -       /*
> -        * The setting of TASK_RUNNING by try_to_wake_up() may be delayed
> -        * when the following two conditions become true.
> -        *   - There is race condition of mmap_sem (It is acquired by
> -        *     exit_mm()), and
> -        *   - SMI occurs before setting TASK_RUNINNG.
> -        *     (or hypervisor of virtual machine switches to other guest)
> -        *  As a result, we may become TASK_RUNNING after becoming TASK_DEAD
> -        *
> -        * To avoid it, we have to wait for releasing tsk->pi_lock which
> -        * is held by try_to_wake_up()
> -        */
> -       smp_mb();
> -       raw_spin_unlock_wait(&tsk->pi_lock);
> -
> -       /* causes final put_task_struct in finish_task_switch(). */
> -       tsk->state = TASK_DEAD;
>         tsk->flags |= PF_NOFREEZE;      /* tell freezer to ignore us */
> -       schedule();
> +       exit_schedule();
>         BUG();
>         /* Avoid "noreturn function does return".  */
>         for (;;)
>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/1] do_exit(): Solve possibility of BUG() due to race with try_to_wake_up()
  2014-08-26  4:45   ` Kautuk Consul
@ 2014-08-26 15:03     ` Oleg Nesterov
  0 siblings, 0 replies; 25+ messages in thread
From: Oleg Nesterov @ 2014-08-26 15:03 UTC (permalink / raw)
  To: Kautuk Consul
  Cc: Peter Zijlstra, Ingo Molnar, Andrew Morton, Michal Hocko,
	David Rientjes, Ionut Alexa, Guillaume Morin, linux-kernel,
	Kirill Tkhai

On 08/26, Kautuk Consul wrote:
>
> I got one thing wrong:

Yes, your description was not accurate, but

> From some more code review, both __down_common() and
> do_wait_for_common() inspect the signal_pending() only while in
> TASK_RUNNING.

this doesn't really matter, or I missed something.

We have too much problems with this TASK_DEAD state. I have to admit that
I no longer understand why we do not need a barrier after spin_unlock_wait().

	set_current_state(TASK_UNINTERRUPTIBLE);

	__set_current_state(TASK_RUNNING);

	// do_exit()

	mb();
	spin_unlock_wait();

	tsk->state = TASK_DEAD;

	schedule();

Previously I was convinced, but now I think that ttwu(TASK_UNINTERRUPTIBLE)
still can change TASK_DEAD into TASK_RUNNING if CPU reorders spin_unlock_wait
and "state = TASK_DEAD".

Perhaps I am wrong and in any case we can fix this but there another problem,
in theory finish_task_switch() can race with RUNNING -> DEAD transition.

So I still think that the (incomplete) patch I sent probably makes sense, even
if it adds the ugly rq->dead check into __schedule().

Let's wait for Peter.

Oleg.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/1] do_exit(): Solve possibility of BUG() due to race with try_to_wake_up()
  2014-08-25 15:57 ` Oleg Nesterov
  2014-08-26  4:45   ` Kautuk Consul
@ 2014-09-01 15:39   ` Peter Zijlstra
  2014-09-01 17:58     ` Oleg Nesterov
  2014-09-03  9:04   ` [PATCH 1/1] do_exit(): Solve possibility of BUG() due to race with try_to_wake_up() Kirill Tkhai
  2 siblings, 1 reply; 25+ messages in thread
From: Peter Zijlstra @ 2014-09-01 15:39 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Kautuk Consul, Ingo Molnar, Andrew Morton, Michal Hocko,
	David Rientjes, Ionut Alexa, Guillaume Morin, linux-kernel,
	Kirill Tkhai

On Mon, Aug 25, 2014 at 05:57:38PM +0200, Oleg Nesterov wrote:
> Peter, do you remember another problem with TASK_DEAD we discussed recently?
> (prev_state == TASK_DEAD detection in finish_task_switch() still looks racy).

Uhm, right. That was somewhere on the todo list :-)

> I am starting to think that perhaps we need something like below, what do
> you all think?

I'm thinking you lost the hunk that adds rq::dead :-), more comments
below.

> --- x/kernel/sched/core.c
> +++ x/kernel/sched/core.c
> @@ -2205,9 +2205,10 @@ static void finish_task_switch(struct rq
>  	__releases(rq->lock)
>  {
>  	struct mm_struct *mm = rq->prev_mm;
> -	long prev_state;
> +	struct task_struct *dead = rq->dead;
>  
>  	rq->prev_mm = NULL;
> +	rq->dead = NULL;
>  
>  	/*
>  	 * A task struct has one reference for the use as "current".
> @@ -2220,7 +2221,6 @@ static void finish_task_switch(struct rq
>  	 * be dropped twice.
>  	 *		Manfred Spraul <manfred@colorfullife.com>
>  	 */

Clearly that comment needs to go as well..

> -	prev_state = prev->state;
>  	vtime_task_switch(prev);
>  	finish_arch_switch(prev);
>  	perf_event_task_sched_in(prev, current);
> @@ -2230,16 +2230,16 @@ static void finish_task_switch(struct rq
>  	fire_sched_in_preempt_notifiers(current);
>  	if (mm)
>  		mmdrop(mm);
> -	if (unlikely(prev_state == TASK_DEAD)) {
> -		if (prev->sched_class->task_dead)
> -			prev->sched_class->task_dead(prev);
> +	if (unlikely(dead)) {

	BUG_ON(dead != prev); ?

> +		if (dead->sched_class->task_dead)
> +			dead->sched_class->task_dead(dead);
>  
>  		/*
>  		 * Remove function-return probe instances associated with this
>  		 * task and put them back on the free list.
>  		 */
> -		kprobe_flush_task(prev);
> -		put_task_struct(prev);
> +		kprobe_flush_task(dead);
> +		put_task_struct(dead);
>  	}
>  
>  	tick_nohz_task_switch(current);
> @@ -2770,11 +2770,15 @@ need_resched:
>  	smp_mb__before_spinlock();
>  	raw_spin_lock_irq(&rq->lock);
>  
> +	if (unlikely(rq->dead))
> +		goto deactivate;
> +

Yeah, it would be best to not have to do that; ideally we would be able
to maybe do both; set rq->dead and current->state == TASK_DEAD.

Hmm, your exit_schedule() already does this, so why this extra test?

>  	switch_count = &prev->nivcsw;
>  	if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) {
>  		if (unlikely(signal_pending_state(prev->state, prev))) {
>  			prev->state = TASK_RUNNING;
>  		} else {
> +deactivate:
>  			deactivate_task(rq, prev, DEQUEUE_SLEEP);
>  			prev->on_rq = 0;
>  
> @@ -2826,6 +2830,15 @@ need_resched:
>  		goto need_resched;
>  }
>  
> +// called under preempt_disable();
> +void exit_schedule()
> +{
> +	// TODO: kill TASK_DEAD, this is only for proc
> +	current->state = TASK_DEAD;

Ah, not so, its also to avoid that extra condition in schedule() :-)

> +	task_rq(current)->dead = current;
> +	__schedule();
> +}
> +
>  static inline void sched_submit_work(struct task_struct *tsk)
>  {
>  	if (!tsk->state || tsk_is_pi_blocked(tsk))
> --- x/kernel/exit.c
> +++ x/kernel/exit.c
> @@ -815,25 +815,8 @@ void do_exit(long code)
>  		__this_cpu_add(dirty_throttle_leaks, tsk->nr_dirtied);
>  	exit_rcu();
>  
> -	/*
> -	 * The setting of TASK_RUNNING by try_to_wake_up() may be delayed
> -	 * when the following two conditions become true.
> -	 *   - There is race condition of mmap_sem (It is acquired by
> -	 *     exit_mm()), and
> -	 *   - SMI occurs before setting TASK_RUNINNG.
> -	 *     (or hypervisor of virtual machine switches to other guest)
> -	 *  As a result, we may become TASK_RUNNING after becoming TASK_DEAD
> -	 *
> -	 * To avoid it, we have to wait for releasing tsk->pi_lock which
> -	 * is held by try_to_wake_up()
> -	 */
> -	smp_mb();
> -	raw_spin_unlock_wait(&tsk->pi_lock);
> -
> -	/* causes final put_task_struct in finish_task_switch(). */
> -	tsk->state = TASK_DEAD;
>  	tsk->flags |= PF_NOFREEZE;	/* tell freezer to ignore us */
> -	schedule();
> +	exit_schedule();
>  	BUG();
>  	/* Avoid "noreturn function does return".  */
>  	for (;;)

Yes, something like this might work fine..

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/1] do_exit(): Solve possibility of BUG() due to race with try_to_wake_up()
  2014-09-01 15:39   ` Peter Zijlstra
@ 2014-09-01 17:58     ` Oleg Nesterov
  2014-09-01 19:09       ` Peter Zijlstra
  0 siblings, 1 reply; 25+ messages in thread
From: Oleg Nesterov @ 2014-09-01 17:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Kautuk Consul, Ingo Molnar, Andrew Morton, Michal Hocko,
	David Rientjes, Ionut Alexa, Guillaume Morin, linux-kernel,
	Kirill Tkhai

On 09/01, Peter Zijlstra wrote:
>
> On Mon, Aug 25, 2014 at 05:57:38PM +0200, Oleg Nesterov wrote:
> > Peter, do you remember another problem with TASK_DEAD we discussed recently?
> > (prev_state == TASK_DEAD detection in finish_task_switch() still looks racy).
>
> Uhm, right. That was somewhere on the todo list :-)
>
> > I am starting to think that perhaps we need something like below, what do
> > you all think?
>
> I'm thinking you lost the hunk that adds rq::dead :-), more comments
> below.

And "goto deactivate" should be moved down, after "switch_count"
initialization.

> > +	if (unlikely(rq->dead))
> > +		goto deactivate;
> > +
>
> Yeah, it would be best to not have to do that; ideally we would be able
> to maybe do both; set rq->dead and current->state == TASK_DEAD.

To avoid spin_unlock_wait() in do_exit(). But on a second thought this
can't work, please see below.

> > --- x/kernel/exit.c
> > +++ x/kernel/exit.c
> > @@ -815,25 +815,8 @@ void do_exit(long code)
> >  		__this_cpu_add(dirty_throttle_leaks, tsk->nr_dirtied);
> >  	exit_rcu();
> >
> > -	/*
> > -	 * The setting of TASK_RUNNING by try_to_wake_up() may be delayed
> > -	 * when the following two conditions become true.
> > -	 *   - There is race condition of mmap_sem (It is acquired by
> > -	 *     exit_mm()), and
> > -	 *   - SMI occurs before setting TASK_RUNINNG.
> > -	 *     (or hypervisor of virtual machine switches to other guest)
> > -	 *  As a result, we may become TASK_RUNNING after becoming TASK_DEAD
> > -	 *
> > -	 * To avoid it, we have to wait for releasing tsk->pi_lock which
> > -	 * is held by try_to_wake_up()
> > -	 */
> > -	smp_mb();
> > -	raw_spin_unlock_wait(&tsk->pi_lock);
> > -
> > -	/* causes final put_task_struct in finish_task_switch(). */
> > -	tsk->state = TASK_DEAD;
> >  	tsk->flags |= PF_NOFREEZE;	/* tell freezer to ignore us */
> > -	schedule();
> > +	exit_schedule();
> >  	BUG();
> >  	/* Avoid "noreturn function does return".  */
> >  	for (;;)
>
> Yes, something like this might work fine..

Not really :/ Yes, rq->dead (or just "bool prev_dead") should obviously
solve the problem with ttwu() after the last schedule(). But only in a
sense that the dying task won't be activated.

However, the very fact that another CPU can look at this task_struct
means that we still need spin_unlock_wait(). If nothing else to ensure
that try_to_wake_up()->spin_unlock(pi_lock) won't write into the memory
we are are going to free.

So I think the comment in do exit should be updated too, and smp_mb()
should be moved under raw_spin_unlock_wait() but ...

But. If am right, doesn't this mean we that have even more problems with
postmortem wakeups??? Why ttwu() can't _start_ after spin_unlock_wait ?

Oleg.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/1] do_exit(): Solve possibility of BUG() due to race with try_to_wake_up()
  2014-09-01 17:58     ` Oleg Nesterov
@ 2014-09-01 19:09       ` Peter Zijlstra
  2014-09-02 15:52         ` Oleg Nesterov
  0 siblings, 1 reply; 25+ messages in thread
From: Peter Zijlstra @ 2014-09-01 19:09 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Kautuk Consul, Ingo Molnar, Andrew Morton, Michal Hocko,
	David Rientjes, Ionut Alexa, Guillaume Morin, linux-kernel,
	Kirill Tkhai

On Mon, Sep 01, 2014 at 07:58:51PM +0200, Oleg Nesterov wrote:

> However, the very fact that another CPU can look at this task_struct
> means that we still need spin_unlock_wait(). If nothing else to ensure
> that try_to_wake_up()->spin_unlock(pi_lock) won't write into the memory
> we are are going to free.

task_struct is RCU freed, if it still has a 'reference' to the task, it
shouldn't be going 'away', right?

> So I think the comment in do exit should be updated too, and smp_mb()
> should be moved under raw_spin_unlock_wait() but ...
> 
> But. If am right, doesn't this mean we that have even more problems with
> postmortem wakeups??? Why ttwu() can't _start_ after spin_unlock_wait ?

ttwu should bail at: if (!(p->state & state)) goto out; That should
never match with TASK_DEAD.

Either that; or I should go sleep already :-) I shifted 7 hours
yesterday, so I'm still somewhat jet-lagged.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/1] do_exit(): Solve possibility of BUG() due to race with try_to_wake_up()
  2014-09-01 19:09       ` Peter Zijlstra
@ 2014-09-02 15:52         ` Oleg Nesterov
  2014-09-02 16:47           ` Oleg Nesterov
  0 siblings, 1 reply; 25+ messages in thread
From: Oleg Nesterov @ 2014-09-02 15:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Kautuk Consul, Ingo Molnar, Andrew Morton, Michal Hocko,
	David Rientjes, Ionut Alexa, Guillaume Morin, linux-kernel,
	Kirill Tkhai

On 09/01, Peter Zijlstra wrote:
>
> On Mon, Sep 01, 2014 at 07:58:51PM +0200, Oleg Nesterov wrote:
>
> > However, the very fact that another CPU can look at this task_struct
> > means that we still need spin_unlock_wait(). If nothing else to ensure
> > that try_to_wake_up()->spin_unlock(pi_lock) won't write into the memory
> > we are are going to free.
>
> task_struct is RCU freed, if it still has a 'reference' to the task,

Not really, put_task_struct() frees this memory once the counter is zero,
but this doesn't matter,

> it shouldn't be going 'away', right?

Yes, thanks for correcting me. Somehow I forgot that the caller of ttwu()
should have a reference anyway. And indeed, say, __rwsem_do_wake() does
have. Otherwise this code would be obviously buggy in any case.

> > So I think the comment in do exit should be updated too, and smp_mb()
> > should be moved under raw_spin_unlock_wait() but ...
> >
> > But. If am right, doesn't this mean we that have even more problems with
> > postmortem wakeups??? Why ttwu() can't _start_ after spin_unlock_wait ?
>
> ttwu should bail at: if (!(p->state & state)) goto out; That should
> never match with TASK_DEAD.

See above. I meant another problem, but I was wrong.

OK. So this patch should probably work. But let me think again and send
it tommorrow. Because today (and yesterday) I didn't really sleep ;)

Oleg.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/1] do_exit(): Solve possibility of BUG() due to race with try_to_wake_up()
  2014-09-02 15:52         ` Oleg Nesterov
@ 2014-09-02 16:47           ` Oleg Nesterov
  2014-09-02 17:39             ` Peter Zijlstra
  2014-09-03 16:08             ` task_numa_fault() && TASK_DEAD Oleg Nesterov
  0 siblings, 2 replies; 25+ messages in thread
From: Oleg Nesterov @ 2014-09-02 16:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Kautuk Consul, Ingo Molnar, Andrew Morton, Michal Hocko,
	David Rientjes, Ionut Alexa, Guillaume Morin, linux-kernel,
	Kirill Tkhai

On 09/02, Oleg Nesterov wrote:
>
> OK. So this patch should probably work. But let me think again and send
> it tommorrow. Because today (and yesterday) I didn't really sleep ;)

But since I already wrote v2 yesterday, let me show it anyway. Perhaps
you will notice something wrong immediately...

So, once again, this patch adds the ugly "goto" into schedule(). OTOH,
it removes the ugly spin_unlock_wait(pi_lock).

TASK_DEAD can die. The only valid user is schedule_debug(), trivial to
change. The usage of TASK_DEAD in task_numa_fault() is wrong in any case.

In fact, I think that the next change can change exit_schedule() to use
PREEMPT_ACTIVE, and then we can simply remove the TASK_DEAD check in
schedule_debug().

Oleg.


diff --git a/include/linux/sched.h b/include/linux/sched.h
index 857ba40..ef47159 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2407,6 +2407,7 @@ extern int copy_thread(unsigned long, unsigned long, unsigned long,
 			struct task_struct *);
 extern void flush_thread(void);
 extern void exit_thread(void);
+extern void exit_schedule(void) __noreturn;
 
 extern void exit_files(struct task_struct *);
 extern void __cleanup_sighand(struct sighand_struct *);
diff --git a/kernel/exit.c b/kernel/exit.c
index 32c58f7..75c994f 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -815,29 +815,8 @@ void do_exit(long code)
 		__this_cpu_add(dirty_throttle_leaks, tsk->nr_dirtied);
 	exit_rcu();
 
-	/*
-	 * The setting of TASK_RUNNING by try_to_wake_up() may be delayed
-	 * when the following two conditions become true.
-	 *   - There is race condition of mmap_sem (It is acquired by
-	 *     exit_mm()), and
-	 *   - SMI occurs before setting TASK_RUNINNG.
-	 *     (or hypervisor of virtual machine switches to other guest)
-	 *  As a result, we may become TASK_RUNNING after becoming TASK_DEAD
-	 *
-	 * To avoid it, we have to wait for releasing tsk->pi_lock which
-	 * is held by try_to_wake_up()
-	 */
-	smp_mb();
-	raw_spin_unlock_wait(&tsk->pi_lock);
-
-	/* causes final put_task_struct in finish_task_switch(). */
-	tsk->state = TASK_DEAD;
 	tsk->flags |= PF_NOFREEZE;	/* tell freezer to ignore us */
-	schedule();
-	BUG();
-	/* Avoid "noreturn function does return".  */
-	for (;;)
-		cpu_relax();	/* For when BUG is null */
+	exit_schedule();
 }
 EXPORT_SYMBOL_GPL(do_exit);
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index eee12b3..0a422e6 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2205,22 +2205,11 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
 	__releases(rq->lock)
 {
 	struct mm_struct *mm = rq->prev_mm;
-	long prev_state;
+	bool prev_dead = rq->prev_dead;
 
 	rq->prev_mm = NULL;
+	rq->prev_dead = false;
 
-	/*
-	 * A task struct has one reference for the use as "current".
-	 * If a task dies, then it sets TASK_DEAD in tsk->state and calls
-	 * schedule one last time. The schedule call will never return, and
-	 * the scheduled task must drop that reference.
-	 * The test for TASK_DEAD must occur while the runqueue locks are
-	 * still held, otherwise prev could be scheduled on another cpu, die
-	 * there before we look at prev->state, and then the reference would
-	 * be dropped twice.
-	 *		Manfred Spraul <manfred@colorfullife.com>
-	 */
-	prev_state = prev->state;
 	vtime_task_switch(prev);
 	finish_arch_switch(prev);
 	perf_event_task_sched_in(prev, current);
@@ -2230,7 +2219,7 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
 	fire_sched_in_preempt_notifiers(current);
 	if (mm)
 		mmdrop(mm);
-	if (unlikely(prev_state == TASK_DEAD)) {
+	if (unlikely(prev_dead)) {
 		if (prev->sched_class->task_dead)
 			prev->sched_class->task_dead(prev);
 
@@ -2771,10 +2760,14 @@ need_resched:
 	raw_spin_lock_irq(&rq->lock);
 
 	switch_count = &prev->nivcsw;
+	if (unlikely(rq->prev_dead))
+		goto deactivate;
+
 	if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) {
 		if (unlikely(signal_pending_state(prev->state, prev))) {
 			prev->state = TASK_RUNNING;
 		} else {
+deactivate:
 			deactivate_task(rq, prev, DEQUEUE_SLEEP);
 			prev->on_rq = 0;
 
@@ -2826,6 +2819,14 @@ need_resched:
 		goto need_resched;
 }
 
+void exit_schedule(void)
+{
+	current->state = TASK_DEAD; /* TODO: kill TASK_DEAD altogether */
+	task_rq(current)->prev_dead = true;
+	__schedule();
+	BUG();
+}
+
 static inline void sched_submit_work(struct task_struct *tsk)
 {
 	if (!tsk->state || tsk_is_pi_blocked(tsk))
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 579712f..b97f98e 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -560,6 +560,7 @@ struct rq {
 	struct task_struct *curr, *idle, *stop;
 	unsigned long next_balance;
 	struct mm_struct *prev_mm;
+	bool prev_dead;
 
 	u64 clock;
 	u64 clock_task;


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/1] do_exit(): Solve possibility of BUG() due to race with try_to_wake_up()
  2014-09-02 16:47           ` Oleg Nesterov
@ 2014-09-02 17:39             ` Peter Zijlstra
  2014-09-03 13:36               ` Oleg Nesterov
  2014-09-03 16:08             ` task_numa_fault() && TASK_DEAD Oleg Nesterov
  1 sibling, 1 reply; 25+ messages in thread
From: Peter Zijlstra @ 2014-09-02 17:39 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Kautuk Consul, Ingo Molnar, Andrew Morton, Michal Hocko,
	David Rientjes, Ionut Alexa, Guillaume Morin, linux-kernel,
	Kirill Tkhai

On Tue, Sep 02, 2014 at 06:47:14PM +0200, Oleg Nesterov wrote:

> But since I already wrote v2 yesterday, let me show it anyway. Perhaps
> you will notice something wrong immediately...
> 
> So, once again, this patch adds the ugly "goto" into schedule(). OTOH,
> it removes the ugly spin_unlock_wait(pi_lock).

But schedule() is called _far_ more often than exit(). It would be
really good not to have to do that. 

> TASK_DEAD can die. The only valid user is schedule_debug(), trivial to
> change. The usage of TASK_DEAD in task_numa_fault() is wrong in any case.
> 
> In fact, I think that the next change can change exit_schedule() to use
> PREEMPT_ACTIVE, and then we can simply remove the TASK_DEAD check in
> schedule_debug().

So you worry about concurrent wakeups vs setting TASK_DEAD and thereby
loosing it, right?

Would not something like:

	spin_lock_irq(&current->pi_lock);
	__set_current_state(TASK_DEAD);
	spin_unlock_irq(&current->pi_lock);

Not be race free and similarly expensive to the smp_mb() we have there
now?

> -	BUG();
> -	/* Avoid "noreturn function does return".  */
> -	for (;;)
> -		cpu_relax();	/* For when BUG is null */


> +void exit_schedule(void)
> +{
> +	current->state = TASK_DEAD; /* TODO: kill TASK_DEAD altogether */
> +	task_rq(current)->prev_dead = true;
> +	__schedule();
> +	BUG();

you lost that for loop.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/1] do_exit(): Solve possibility of BUG() due to race with try_to_wake_up()
  2014-08-25 15:57 ` Oleg Nesterov
  2014-08-26  4:45   ` Kautuk Consul
  2014-09-01 15:39   ` Peter Zijlstra
@ 2014-09-03  9:04   ` Kirill Tkhai
  2014-09-03  9:45     ` Peter Zijlstra
  2 siblings, 1 reply; 25+ messages in thread
From: Kirill Tkhai @ 2014-09-03  9:04 UTC (permalink / raw)
  To: Oleg Nesterov, Kautuk Consul, Peter Zijlstra, Ingo Molnar
  Cc: Andrew Morton, Michal Hocko, David Rientjes, Ionut Alexa,
	Guillaume Morin, linux-kernel

25.08.2014, 20:01, "Oleg Nesterov" <oleg@redhat.com>:
> Peter, do you remember another problem with TASK_DEAD we discussed recently?
> (prev_state == TASK_DEAD detection in finish_task_switch() still looks racy).

One more problem with task_dead just to mention it here.

Below is racy with the change of sched_class:

        if (prev->sched_class->task_dead)
                prev->sched_class->task_dead(prev);

switched_from_dl() does not cancel running timers.

So, if dl_task_timer() is slow (it is unhappy with rq->lock acquiring),
the timer is executing when tasks is already dead.

Kirill

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/1] do_exit(): Solve possibility of BUG() due to race with try_to_wake_up()
  2014-09-03  9:04   ` [PATCH 1/1] do_exit(): Solve possibility of BUG() due to race with try_to_wake_up() Kirill Tkhai
@ 2014-09-03  9:45     ` Peter Zijlstra
  0 siblings, 0 replies; 25+ messages in thread
From: Peter Zijlstra @ 2014-09-03  9:45 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: Oleg Nesterov, Kautuk Consul, Ingo Molnar, Andrew Morton,
	Michal Hocko, David Rientjes, Ionut Alexa, Guillaume Morin,
	linux-kernel

On Wed, Sep 03, 2014 at 01:04:36PM +0400, Kirill Tkhai wrote:
> 25.08.2014, 20:01, "Oleg Nesterov" <oleg@redhat.com>:
> > Peter, do you remember another problem with TASK_DEAD we discussed recently?
> > (prev_state == TASK_DEAD detection in finish_task_switch() still looks racy).
> 
> One more problem with task_dead just to mention it here.
> 
> Below is racy with the change of sched_class:
> 
>         if (prev->sched_class->task_dead)
>                 prev->sched_class->task_dead(prev);
> 
> switched_from_dl() does not cancel running timers.

Well, it does a try_to_cancel() but yes, that can fail. Now I suspect
you cannot actually do hrtimer_cancel() from switched_from because its
called with locks held and the timer function will also try and acquire
those locks.

But yes, that appears to be an actual problem indeed.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/1] do_exit(): Solve possibility of BUG() due to race with try_to_wake_up()
  2014-09-02 17:39             ` Peter Zijlstra
@ 2014-09-03 13:36               ` Oleg Nesterov
  2014-09-03 14:44                 ` Peter Zijlstra
  0 siblings, 1 reply; 25+ messages in thread
From: Oleg Nesterov @ 2014-09-03 13:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Kautuk Consul, Ingo Molnar, Andrew Morton, Michal Hocko,
	David Rientjes, Ionut Alexa, Guillaume Morin, linux-kernel,
	Kirill Tkhai

Peter, sorry for slow responses.

On 09/02, Peter Zijlstra wrote:
>
> On Tue, Sep 02, 2014 at 06:47:14PM +0200, Oleg Nesterov wrote:
>
> > But since I already wrote v2 yesterday, let me show it anyway. Perhaps
> > you will notice something wrong immediately...
> >
> > So, once again, this patch adds the ugly "goto" into schedule(). OTOH,
> > it removes the ugly spin_unlock_wait(pi_lock).
>
> But schedule() is called _far_ more often than exit(). It would be
> really good not to have to do that.

Yes sure, performance-wise this is not a win. My point was, this way the
whole "last schedule" logic becomes very simple.

But OK, I buy your nack. I understand that we should not penalize
__schedule() if possible. Let's forget this patch.

> > TASK_DEAD can die. The only valid user is schedule_debug(), trivial to
> > change. The usage of TASK_DEAD in task_numa_fault() is wrong in any case.
> >
> > In fact, I think that the next change can change exit_schedule() to use
> > PREEMPT_ACTIVE, and then we can simply remove the TASK_DEAD check in
> > schedule_debug().
>
> So you worry about concurrent wakeups vs setting TASK_DEAD and thereby
> loosing it, right?
>
> Would not something like:
>
> 	spin_lock_irq(&current->pi_lock);
> 	__set_current_state(TASK_DEAD);
> 	spin_unlock_irq(&current->pi_lock);

Sure. This should obviously fix the problem.

And, I think, another mb() after unlock_wait should fix it as well.

> Not be race free and similarly expensive to the smp_mb() we have there
> now?

Ah, I simply do not know what is cheaper, even on x86. Well, we need
to enable/disable irqs, but again I do not really know how much does
this cost. I can even say what (imo) looks better, lock/unlock above
or

	// Ensure that the previous __set_current_state(RUNNING) can't
	// leak after spin_unlock_wait()
	smp_mb();
	spin_unlock_wait();
	// Another mb to ensure this too can't be reordered with unlock_wait
	set_current_state(TASK_DEAD);

What do you think looks better?

Oleg.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/1] do_exit(): Solve possibility of BUG() due to race with try_to_wake_up()
  2014-09-03 13:36               ` Oleg Nesterov
@ 2014-09-03 14:44                 ` Peter Zijlstra
  2014-09-03 15:18                   ` Oleg Nesterov
  2014-09-04  5:04                   ` Ingo Molnar
  0 siblings, 2 replies; 25+ messages in thread
From: Peter Zijlstra @ 2014-09-03 14:44 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Kautuk Consul, Ingo Molnar, Andrew Morton, Michal Hocko,
	David Rientjes, Ionut Alexa, Guillaume Morin, linux-kernel,
	Kirill Tkhai

On Wed, Sep 03, 2014 at 03:36:40PM +0200, Oleg Nesterov wrote:
> Peter, sorry for slow responses.

No worries, I'm not entirely fast myself. Slept most of the day :-)

> Ah, I simply do not know what is cheaper, even on x86. Well, we need
> to enable/disable irqs, but again I do not really know how much does
> this cost. 

Ah good point about that IRQ thing, yes that's horribly expensive.

> I can even say what (imo) looks better, lock/unlock above or
> 
> 	// Ensure that the previous __set_current_state(RUNNING) can't
> 	// leak after spin_unlock_wait()
> 	smp_mb();
> 	spin_unlock_wait();
> 	// Another mb to ensure this too can't be reordered with unlock_wait
> 	set_current_state(TASK_DEAD);
> 
> What do you think looks better?

spin_unlock_wait() would be a control dependency right? Therefore that
store could not creep up anyhow.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/1] do_exit(): Solve possibility of BUG() due to race with try_to_wake_up()
  2014-09-03 14:44                 ` Peter Zijlstra
@ 2014-09-03 15:18                   ` Oleg Nesterov
  2014-09-04  7:15                     ` Peter Zijlstra
  2014-09-04  5:04                   ` Ingo Molnar
  1 sibling, 1 reply; 25+ messages in thread
From: Oleg Nesterov @ 2014-09-03 15:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Kautuk Consul, Ingo Molnar, Andrew Morton, Michal Hocko,
	David Rientjes, Ionut Alexa, Guillaume Morin, linux-kernel,
	Kirill Tkhai

On 09/03, Peter Zijlstra wrote:
>
> On Wed, Sep 03, 2014 at 03:36:40PM +0200, Oleg Nesterov wrote:
> >
> > 	// Ensure that the previous __set_current_state(RUNNING) can't
> > 	// leak after spin_unlock_wait()
> > 	smp_mb();
> > 	spin_unlock_wait();
> > 	// Another mb to ensure this too can't be reordered with unlock_wait
> > 	set_current_state(TASK_DEAD);
> >
> > What do you think looks better?
>
> spin_unlock_wait() would be a control dependency right? Therefore that
> store could not creep up anyhow.

Hmm. indeed, thanks! This probably means that task_work_run() can use
rmb() instead of mb().

What I can't understand is do we still need a compiler barrier or not.
Probably "in theory yes" ?

Oleg.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* task_numa_fault() && TASK_DEAD
  2014-09-02 16:47           ` Oleg Nesterov
  2014-09-02 17:39             ` Peter Zijlstra
@ 2014-09-03 16:08             ` Oleg Nesterov
  2014-09-03 16:33               ` Rik van Riel
  2014-09-04  7:11               ` Peter Zijlstra
  1 sibling, 2 replies; 25+ messages in thread
From: Oleg Nesterov @ 2014-09-03 16:08 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel, Mel Gorman
  Cc: Kautuk Consul, Ingo Molnar, Andrew Morton, Michal Hocko,
	David Rientjes, Ionut Alexa, Guillaume Morin, linux-kernel,
	Kirill Tkhai

On 09/02, Oleg Nesterov wrote:
>
> The usage of TASK_DEAD in task_numa_fault() is wrong in any case.

Rik, I can't understand why task_numa_fault() needs this check at all,
but "if (p->state == TASK_DEAD)" looks certainly wrong. You could replace
this check with BUG_ON(p->state == TASK_DEAD). Perhaps you meant PF_EXITING?

And a stupid (really, I don't understand this code) question:

	/* for example, ksmd faulting in a user's mm */
	if (!p->mm)
		return;

OK, but perhaps it make sense to pass "mm" as another argument and do

	/* ksmd faulting in a user's mm, or debugger, or kthread use_mm() caller */
	if (p->mm != mm)
		return;

?

Oleg.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: task_numa_fault() && TASK_DEAD
  2014-09-03 16:08             ` task_numa_fault() && TASK_DEAD Oleg Nesterov
@ 2014-09-03 16:33               ` Rik van Riel
  2014-09-04  7:11               ` Peter Zijlstra
  1 sibling, 0 replies; 25+ messages in thread
From: Rik van Riel @ 2014-09-03 16:33 UTC (permalink / raw)
  To: Oleg Nesterov, Peter Zijlstra, Mel Gorman
  Cc: Kautuk Consul, Ingo Molnar, Andrew Morton, Michal Hocko,
	David Rientjes, Ionut Alexa, Guillaume Morin, linux-kernel,
	Kirill Tkhai

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 09/03/2014 12:08 PM, Oleg Nesterov wrote:
> On 09/02, Oleg Nesterov wrote:
>> 
>> The usage of TASK_DEAD in task_numa_fault() is wrong in any
>> case.
> 
> Rik, I can't understand why task_numa_fault() needs this check at
> all, but "if (p->state == TASK_DEAD)" looks certainly wrong. You
> could replace this check with BUG_ON(p->state == TASK_DEAD).
> Perhaps you meant PF_EXITING?

I do not know why that code is in there, either.

I suspect it was added after some conversation on irc, with
either Peter or Mel.

> And a stupid (really, I don't understand this code) question:
> 
> /* for example, ksmd faulting in a user's mm */ if (!p->mm) 
> return;
> 
> OK, but perhaps it make sense to pass "mm" as another argument and
> do
> 
> /* ksmd faulting in a user's mm, or debugger, or kthread use_mm()
> caller */ if (p->mm != mm) return;

I suppose that makes sense, since it would be possible for one task
to cause a page fault in another task's mm, with eg. ptrace peek/poke
or similar code paths.

Currently the numa code could end up accounting such a fault in the
wrong mm, when it would be better to not account the fault at all.

This is a bit of a corner case, and probably not the highest priority
thing to fix right now, but it would be fairly easy.

- -- 
All rights reversed
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQEcBAEBAgAGBQJUB0LQAAoJEM553pKExN6DGBEIALBBKq0N3GlzjRWBzxI361dg
+Xn/C789TZFhk2tvZMNYwJgZRS7xaTRr6IfNcMZNlT9enVlXtrPj2BFiTJ1dF+bh
iwr2eS8c+VVHM+lvzEyiNbrTnvgVNgECI76qsjpvuS0BiUKLh51JSTNLdHA4/CEZ
yJrd+WyulTrv9dHchIO53MQ8+ttCNdzQv/1JK+L2R7vizGnnwA6FysTVQFOPLDxd
ZdvdlAb16uouYQ+1skufxwftvydkbv5voDzp2kb7W0vtwp45MEmj72KPCjHvm1JV
XoX6x1tdNuuZtGdY6WQvFup9ABUnMnILHaX4bkYvxxUqE6/NbfNGOC0CBO41IjE=
=NFQx
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/1] do_exit(): Solve possibility of BUG() due to race with try_to_wake_up()
  2014-09-03 14:44                 ` Peter Zijlstra
  2014-09-03 15:18                   ` Oleg Nesterov
@ 2014-09-04  5:04                   ` Ingo Molnar
  2014-09-04  6:32                     ` Peter Zijlstra
  1 sibling, 1 reply; 25+ messages in thread
From: Ingo Molnar @ 2014-09-04  5:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Oleg Nesterov, Kautuk Consul, Ingo Molnar, Andrew Morton,
	Michal Hocko, David Rientjes, Ionut Alexa, Guillaume Morin,
	linux-kernel, Kirill Tkhai


* Peter Zijlstra <peterz@infradead.org> wrote:

> > Ah, I simply do not know what is cheaper, even on x86. Well, 
> > we need to enable/disable irqs, but again I do not really 
> > know how much does this cost.
> 
> Ah good point about that IRQ thing, yes that's horribly 
> expensive.

Enabling/disabling local IRQs is not really expensive (it's a 
flat cost essentially - below 10 cycles on modern x86 CPUs) - 
especially if we consider the 100x-1000x frequency difference 
between schedule() and exit(), on typical systems:

  $ grep -E 'ctxt|processes' /proc/stat 
  ctxt 47166536
  processes 91876

And that's from a system that emphatically does not schedule 
much. On others the difference is much larger.

So please don't push complexity into the scheduler from 
lower-freq areas of the kernel!

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/1] do_exit(): Solve possibility of BUG() due to race with try_to_wake_up()
  2014-09-04  5:04                   ` Ingo Molnar
@ 2014-09-04  6:32                     ` Peter Zijlstra
  0 siblings, 0 replies; 25+ messages in thread
From: Peter Zijlstra @ 2014-09-04  6:32 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Oleg Nesterov, Kautuk Consul, Ingo Molnar, Andrew Morton,
	Michal Hocko, David Rientjes, Ionut Alexa, Guillaume Morin,
	linux-kernel, Kirill Tkhai

On Thu, Sep 04, 2014 at 07:04:24AM +0200, Ingo Molnar wrote:
> 
> * Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > > Ah, I simply do not know what is cheaper, even on x86. Well, 
> > > we need to enable/disable irqs, but again I do not really 
> > > know how much does this cost.
> > 
> > Ah good point about that IRQ thing, yes that's horribly 
> > expensive.
> 
> Enabling/disabling local IRQs is not really expensive (it's a 
> flat cost essentially - below 10 cycles on modern x86 CPUs) - 
> especially if we consider the 100x-1000x frequency difference 
> between schedule() and exit(), on typical systems:
> 
>   $ grep -E 'ctxt|processes' /proc/stat 
>   ctxt 47166536
>   processes 91876
> 
> And that's from a system that emphatically does not schedule 
> much. On others the difference is much larger.
> 
> So please don't push complexity into the scheduler from 
> lower-freq areas of the kernel!

We were very much not going to make schedule() more expensive, just
figuring out other ways.

And while its good to know modern x86 has cheap INT flag poking, this is
very much not true for other archs and not even older x86 (remember P4?
:-).

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: task_numa_fault() && TASK_DEAD
  2014-09-03 16:08             ` task_numa_fault() && TASK_DEAD Oleg Nesterov
  2014-09-03 16:33               ` Rik van Riel
@ 2014-09-04  7:11               ` Peter Zijlstra
  2014-09-04 10:39                 ` Oleg Nesterov
  1 sibling, 1 reply; 25+ messages in thread
From: Peter Zijlstra @ 2014-09-04  7:11 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Rik van Riel, Mel Gorman, Kautuk Consul, Ingo Molnar,
	Andrew Morton, Michal Hocko, David Rientjes, Ionut Alexa,
	Guillaume Morin, linux-kernel, Kirill Tkhai

On Wed, Sep 03, 2014 at 06:08:19PM +0200, Oleg Nesterov wrote:
> On 09/02, Oleg Nesterov wrote:
> >
> > The usage of TASK_DEAD in task_numa_fault() is wrong in any case.
> 
> Rik, I can't understand why task_numa_fault() needs this check at all,
> but "if (p->state == TASK_DEAD)" looks certainly wrong. You could replace
> this check with BUG_ON(p->state == TASK_DEAD). Perhaps you meant PF_EXITING?

Looking at 82727018b it appears the intent was to make sure we don't
re-create ->numa_fault after we free it. But you're right, we should
never get there with TASK_DEAD.

Also, given that task_numa_free() is called from __put_task_struct() I
tihnk we can safely delete this clause.

> And a stupid (really, I don't understand this code) question:
> 
> 	/* for example, ksmd faulting in a user's mm */
> 	if (!p->mm)
> 		return;

In general kernel threads have !->mm, and those cannot do the
accounting. The only way to get here is through get_user_pages() with
tsk != current and/or mm != current->mm.

> OK, but perhaps it make sense to pass "mm" as another argument and do
> 
> 	/* ksmd faulting in a user's mm, or debugger, or kthread use_mm() caller */
> 	if (p->mm != mm)
> 		return;
> 
> ?

I'm still somewhat fuzzy in the brain but that doesn't appear to
actually work, use_mm() explicitly sets ->mm so in that case it would
match just fine.

That said; I don't think we really need to worry about this. The !->mm
case is special in that that cannot ever work, the other cases are
extremely rare and will not skew accounting much if anything.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/1] do_exit(): Solve possibility of BUG() due to race with try_to_wake_up()
  2014-09-03 15:18                   ` Oleg Nesterov
@ 2014-09-04  7:15                     ` Peter Zijlstra
  2014-09-04 17:03                       ` Paul E. McKenney
  0 siblings, 1 reply; 25+ messages in thread
From: Peter Zijlstra @ 2014-09-04  7:15 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Kautuk Consul, Ingo Molnar, Andrew Morton, Michal Hocko,
	David Rientjes, Ionut Alexa, Guillaume Morin, linux-kernel,
	Kirill Tkhai, Paul McKenney

On Wed, Sep 03, 2014 at 05:18:48PM +0200, Oleg Nesterov wrote:
> On 09/03, Peter Zijlstra wrote:
> >
> > On Wed, Sep 03, 2014 at 03:36:40PM +0200, Oleg Nesterov wrote:
> > >
> > > 	// Ensure that the previous __set_current_state(RUNNING) can't
> > > 	// leak after spin_unlock_wait()
> > > 	smp_mb();
> > > 	spin_unlock_wait();
> > > 	// Another mb to ensure this too can't be reordered with unlock_wait
> > > 	set_current_state(TASK_DEAD);
> > >
> > > What do you think looks better?
> >
> > spin_unlock_wait() would be a control dependency right? Therefore that
> > store could not creep up anyhow.
> 
> Hmm. indeed, thanks! This probably means that task_work_run() can use
> rmb() instead of mb().
> 
> What I can't understand is do we still need a compiler barrier or not.
> Probably "in theory yes" ?

Yes, this is where I'm forever in doubt as well. The worry is the
compiler reordering things, but I'm not sure how it would do that in
this case, then again, I've been shown to not be creative enough in
these cases many times before.

Paul might know, he's had much more exposure to compiler people.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: task_numa_fault() && TASK_DEAD
  2014-09-04  7:11               ` Peter Zijlstra
@ 2014-09-04 10:39                 ` Oleg Nesterov
  2014-09-04 19:14                   ` Hugh Dickins
  0 siblings, 1 reply; 25+ messages in thread
From: Oleg Nesterov @ 2014-09-04 10:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rik van Riel, Mel Gorman, Kautuk Consul, Ingo Molnar,
	Andrew Morton, Michal Hocko, David Rientjes, Ionut Alexa,
	Guillaume Morin, linux-kernel, Kirill Tkhai

On 09/04, Peter Zijlstra wrote:
>
> On Wed, Sep 03, 2014 at 06:08:19PM +0200, Oleg Nesterov wrote:
>
> > And a stupid (really, I don't understand this code) question:
> >
> > 	/* for example, ksmd faulting in a user's mm */
> > 	if (!p->mm)
> > 		return;
>
> In general kernel threads have !->mm, and those cannot do the
> accounting. The only way to get here is through get_user_pages() with
> tsk != current and/or mm != current->mm.
>
> > OK, but perhaps it make sense to pass "mm" as another argument and do
> >
> > 	/* ksmd faulting in a user's mm, or debugger, or kthread use_mm() caller */
> > 	if (p->mm != mm)
> > 		return;
> >
> > ?
>
> I'm still somewhat fuzzy in the brain but that doesn't appear to
> actually work, use_mm() explicitly sets ->mm so in that case it would
> match just fine.

Yes, yes, sorry, I meant

	if (p->mm != mm || PF_KTHREAD)
		return;

> That said; I don't think we really need to worry about this. The !->mm
> case is special in that that cannot ever work, the other cases are
> extremely rare and will not skew accounting much if anything.

Sure, this is not bugfix. To me this change looks like a cleanup because
I think that, say, ksmd doesn't really differ from debugger in this case
(ignoring the fact that ->mm == NULL can probably lead to crash), or from
use_mm().

But of course I agree, this is minor.

Oleg.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/1] do_exit(): Solve possibility of BUG() due to race with try_to_wake_up()
  2014-09-04  7:15                     ` Peter Zijlstra
@ 2014-09-04 17:03                       ` Paul E. McKenney
  0 siblings, 0 replies; 25+ messages in thread
From: Paul E. McKenney @ 2014-09-04 17:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Oleg Nesterov, Kautuk Consul, Ingo Molnar, Andrew Morton,
	Michal Hocko, David Rientjes, Ionut Alexa, Guillaume Morin,
	linux-kernel, Kirill Tkhai

On Thu, Sep 04, 2014 at 09:15:26AM +0200, Peter Zijlstra wrote:
> On Wed, Sep 03, 2014 at 05:18:48PM +0200, Oleg Nesterov wrote:
> > On 09/03, Peter Zijlstra wrote:
> > >
> > > On Wed, Sep 03, 2014 at 03:36:40PM +0200, Oleg Nesterov wrote:
> > > >
> > > > 	// Ensure that the previous __set_current_state(RUNNING) can't
> > > > 	// leak after spin_unlock_wait()
> > > > 	smp_mb();
> > > > 	spin_unlock_wait();
> > > > 	// Another mb to ensure this too can't be reordered with unlock_wait
> > > > 	set_current_state(TASK_DEAD);
> > > >
> > > > What do you think looks better?
> > >
> > > spin_unlock_wait() would be a control dependency right? Therefore that
> > > store could not creep up anyhow.
> > 
> > Hmm. indeed, thanks! This probably means that task_work_run() can use
> > rmb() instead of mb().
> > 
> > What I can't understand is do we still need a compiler barrier or not.
> > Probably "in theory yes" ?
> 
> Yes, this is where I'm forever in doubt as well. The worry is the
> compiler reordering things, but I'm not sure how it would do that in
> this case, then again, I've been shown to not be creative enough in
> these cases many times before.
> 
> Paul might know, he's had much more exposure to compiler people.

Well, if we are talking about the code sequence above, spin_unlock_wait()
does reads followed by a conditional.  And set_current_state() does
a write.  The one thing that might be missing is that for this to work is
that Alpha might need an ACCESS_ONCE() to avoid read reordering.

(Yes, ACCESS_ONCE() is required for the other architectures to meet
the letter of the law in memory-barriers.txt, but if you know that
your particular architecture and compiler won't mess you up, you
have more freedom in your arch-specific code.)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: task_numa_fault() && TASK_DEAD
  2014-09-04 10:39                 ` Oleg Nesterov
@ 2014-09-04 19:14                   ` Hugh Dickins
  2014-09-05 11:35                     ` Oleg Nesterov
  0 siblings, 1 reply; 25+ messages in thread
From: Hugh Dickins @ 2014-09-04 19:14 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Peter Zijlstra, Rik van Riel, Mel Gorman, Kautuk Consul,
	Ingo Molnar, Andrew Morton, Michal Hocko, David Rientjes,
	Ionut Alexa, Guillaume Morin, linux-kernel, Kirill Tkhai

On Thu, 4 Sep 2014, Oleg Nesterov wrote:
> On 09/04, Peter Zijlstra wrote:
> > On Wed, Sep 03, 2014 at 06:08:19PM +0200, Oleg Nesterov wrote:
> >
> > > And a stupid (really, I don't understand this code) question:
> > >
> > > 	/* for example, ksmd faulting in a user's mm */
> > > 	if (!p->mm)
> > > 		return;

I don't understand your difficulty with that, I thought the comment
was helpful enough.  Does the original commit comment help?

commit 2832bc19f6668fd00116f61f821105040599ef8b
Author: Hugh Dickins <hughd@google.com>
Date:   Wed Dec 19 17:42:16 2012 -0800

    sched: numa: ksm: fix oops in task_numa_placment()
    
    task_numa_placement() oopsed on NULL p->mm when task_numa_fault() got
    called in the handling of break_ksm() for ksmd.  That might be a
    peculiar case, which perhaps KSM could takes steps to avoid? but it's
    more robust if task_numa_placement() allows for such a possibility.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: task_numa_fault() && TASK_DEAD
  2014-09-04 19:14                   ` Hugh Dickins
@ 2014-09-05 11:35                     ` Oleg Nesterov
  0 siblings, 0 replies; 25+ messages in thread
From: Oleg Nesterov @ 2014-09-05 11:35 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Peter Zijlstra, Rik van Riel, Mel Gorman, Kautuk Consul,
	Ingo Molnar, Andrew Morton, Michal Hocko, David Rientjes,
	Ionut Alexa, Guillaume Morin, linux-kernel, Kirill Tkhai

On 09/04, Hugh Dickins wrote:
>
> On Thu, 4 Sep 2014, Oleg Nesterov wrote:
> > On 09/04, Peter Zijlstra wrote:
> > > On Wed, Sep 03, 2014 at 06:08:19PM +0200, Oleg Nesterov wrote:
> > >
> > > > And a stupid (really, I don't understand this code) question:
> > > >
> > > > 	/* for example, ksmd faulting in a user's mm */
> > > > 	if (!p->mm)
> > > > 		return;
>
> I don't understand your difficulty with that, I thought the comment
> was helpful enough.

Yes, yes, sorry for confusion, the comment and the check itself look clear.

> Does the original commit comment help?
>
> commit 2832bc19f6668fd00116f61f821105040599ef8b
> Author: Hugh Dickins <hughd@google.com>
> Date:   Wed Dec 19 17:42:16 2012 -0800
>
>     sched: numa: ksm: fix oops in task_numa_placment()
>
>     task_numa_placement() oopsed on NULL p->mm

Yes. But I thought that even if task_numa_placment() didn't OOPS in case
when ->mm = NULL, it would be better to exclude ksmd. And other kthreads
which can have ->mm != NULL, and PTRACE_POKE or sys_process_vm_writev()
users:

	/* do nothing if this task accesses a foreign mm */
	if (p->mm != mm || (p->flags & PF_KTHREAD)
		return;

Please forget. This is minor, and my main question was the wrong usage
of TASK_DEAD.

Oleg.


^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2014-09-05 11:38 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-08-25 10:54 [PATCH 1/1] do_exit(): Solve possibility of BUG() due to race with try_to_wake_up() Kautuk Consul
2014-08-25 15:57 ` Oleg Nesterov
2014-08-26  4:45   ` Kautuk Consul
2014-08-26 15:03     ` Oleg Nesterov
2014-09-01 15:39   ` Peter Zijlstra
2014-09-01 17:58     ` Oleg Nesterov
2014-09-01 19:09       ` Peter Zijlstra
2014-09-02 15:52         ` Oleg Nesterov
2014-09-02 16:47           ` Oleg Nesterov
2014-09-02 17:39             ` Peter Zijlstra
2014-09-03 13:36               ` Oleg Nesterov
2014-09-03 14:44                 ` Peter Zijlstra
2014-09-03 15:18                   ` Oleg Nesterov
2014-09-04  7:15                     ` Peter Zijlstra
2014-09-04 17:03                       ` Paul E. McKenney
2014-09-04  5:04                   ` Ingo Molnar
2014-09-04  6:32                     ` Peter Zijlstra
2014-09-03 16:08             ` task_numa_fault() && TASK_DEAD Oleg Nesterov
2014-09-03 16:33               ` Rik van Riel
2014-09-04  7:11               ` Peter Zijlstra
2014-09-04 10:39                 ` Oleg Nesterov
2014-09-04 19:14                   ` Hugh Dickins
2014-09-05 11:35                     ` Oleg Nesterov
2014-09-03  9:04   ` [PATCH 1/1] do_exit(): Solve possibility of BUG() due to race with try_to_wake_up() Kirill Tkhai
2014-09-03  9:45     ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).