From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933722AbaHZEpb (ORCPT ); Tue, 26 Aug 2014 00:45:31 -0400 Received: from mail-la0-f51.google.com ([209.85.215.51]:38527 "EHLO mail-la0-f51.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750796AbaHZEp3 (ORCPT ); Tue, 26 Aug 2014 00:45:29 -0400 MIME-Version: 1.0 In-Reply-To: <20140825155738.GA5944@redhat.com> References: <1408964064-21447-1-git-send-email-consul.kautuk@gmail.com> <20140825155738.GA5944@redhat.com> Date: Tue, 26 Aug 2014 10:15:28 +0530 Message-ID: Subject: Re: [PATCH 1/1] do_exit(): Solve possibility of BUG() due to race with try_to_wake_up() From: Kautuk Consul To: Oleg Nesterov Cc: Peter Zijlstra , Ingo Molnar , Andrew Morton , Michal Hocko , David Rientjes , Ionut Alexa , Guillaume Morin , "linux-kernel@vger.kernel.org" , Kirill Tkhai Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Sorry folks, I got one thing wrong: >>From some more code review, both __down_common() and do_wait_for_common() inspect the signal_pending() only while in TASK_RUNNING. So I think that it cannot be possible that this happened on my system due to __down_common() and/or wait_for_common(). Which only leaves out the possibility that the BUG() on my embedded system happened due to a driver which was trying to implement its own sleeping primitive by first setting the task state to TASK_INTERRUPTIBLE and then checking for signal_pending/signal_pending_state. (But this is anyway generally frowned upon and not really acceptable nowadays). I'll review those drivers for this kind of mistake. On Mon, Aug 25, 2014 at 9:27 PM, Oleg Nesterov wrote: > On 08/25, Kautuk Consul wrote: >> >> I encountered a BUG() scenario within do_exit() on an ARM system. >> >> The problem is due to a race scenario between do_exit() and try_to_wake_up() >> on different CPUs due to usage of sleeping primitives such as __down_common >> and wait_for_common. >> >> Race Scenario >> ============= >> >> Let us assume there are 2 CPUs A and B execute code in the following order: >> 1) CPU A was running in user-mode and enters kernel mode via some >> syscall/exception handler. >> 2) CPU A sets the current task(t) state to TASK_INTERRUPTIBLE via __down_common >> or wait_for_common. >> 3) CPU A checks for signal_pending() and returns due to TIF_SIGPENDING >> being set in t's threadinfo due to a previous signal(say SIGKILL) being >> received on this task t. >> 4) CPU A returns returns back to the assembly trap handler and calls >> do_work_pending() -> do_signal() -> get_signal() -> do_group_exit() >> -> do_exit() >> CPU A has not yet executed the following line of code before the final >> call to schedule: >> /* causes final put_task_struct in finish_task_switch(). */ >> tsk->state = TASK_DEAD; >> 5) CPU B tries to send a signal to task t (currently executing on CPU A) >> and thus enters: signal_wake_up_state() -> wake_up_state() -> >> try_to_wake_up() >> 6) CPU B executes all code in try_to_wake_up() till the call to >> ttwu_queue -> ttwu_do_activate -> ttwu_do_wakeup(). >> CPU B has still not executed the following code in ttwu_do_wakeup(): >> p->state = TASK_RUNNING; >> 7) CPU A executes the following line of code: >> /* causes final put_task_struct in finish_task_switch(). */ >> tsk->state = TASK_DEAD; >> 8) CPU B executes the following code in ttwu_do_wakeup(): >> p->state = TASK_RUNNING; >> 9) CPU A continues to the call to do_exit() -> schedule(). >> Since the tsk->state is TASK_RUNNING, the call to schedule() returns and >> do_exit() -> BUG() is hit on CPU A. >> >> Alternate Solution >> ================== >> >> An alternate solution would be to simply set the current task state to >> TASK_RUNNING in __down_common(), wait_for_common() and all other interruptible >> sleeping primitives in their if(signal_pending/signal_pending_state) conditional >> blocks. >> >> But this change seems to me to be more logical because: >> i) This will involve lesser changes to the kernel core code. >> ii) Any further sleeping primitives in the kernel also need not suffer from >> this kind of race scenario. >> >> Signed-off-by: Kautuk Consul >> --- >> kernel/exit.c | 10 ++++++---- >> 1 file changed, 6 insertions(+), 4 deletions(-) >> >> diff --git a/kernel/exit.c b/kernel/exit.c >> index 32c58f7..69a8231 100644 >> --- a/kernel/exit.c >> +++ b/kernel/exit.c >> @@ -824,14 +824,16 @@ void do_exit(long code) >> * (or hypervisor of virtual machine switches to other guest) >> * As a result, we may become TASK_RUNNING after becoming TASK_DEAD >> * >> - * To avoid it, we have to wait for releasing tsk->pi_lock which >> - * is held by try_to_wake_up() >> + * To solve this, we have to compete for tsk->pi_lock which is held by >> + * try_to_wake_up(). >> */ >> - smp_mb(); >> - raw_spin_unlock_wait(&tsk->pi_lock); >> + raw_spin_lock(&tsk->pi_lock); >> >> /* causes final put_task_struct in finish_task_switch(). */ >> tsk->state = TASK_DEAD; >> + >> + raw_spin_unlock(&tsk->pi_lock); >> + >> tsk->flags |= PF_NOFREEZE; /* tell freezer to ignore us */ >> schedule(); >> BUG(); >> -- > > Peter, do you remember another problem with TASK_DEAD we discussed recently? > (prev_state == TASK_DEAD detection in finish_task_switch() still looks racy). > > I am starting to think that perhaps we need something like below, what do > you all think? > > Oleg. > > --- x/kernel/sched/core.c > +++ x/kernel/sched/core.c > @@ -2205,9 +2205,10 @@ static void finish_task_switch(struct rq > __releases(rq->lock) > { > struct mm_struct *mm = rq->prev_mm; > - long prev_state; > + struct task_struct *dead = rq->dead; > > rq->prev_mm = NULL; > + rq->dead = NULL; > > /* > * A task struct has one reference for the use as "current". > @@ -2220,7 +2221,6 @@ static void finish_task_switch(struct rq > * be dropped twice. > * Manfred Spraul > */ > - prev_state = prev->state; > vtime_task_switch(prev); > finish_arch_switch(prev); > perf_event_task_sched_in(prev, current); > @@ -2230,16 +2230,16 @@ static void finish_task_switch(struct rq > fire_sched_in_preempt_notifiers(current); > if (mm) > mmdrop(mm); > - if (unlikely(prev_state == TASK_DEAD)) { > - if (prev->sched_class->task_dead) > - prev->sched_class->task_dead(prev); > + if (unlikely(dead)) { > + if (dead->sched_class->task_dead) > + dead->sched_class->task_dead(dead); > > /* > * Remove function-return probe instances associated with this > * task and put them back on the free list. > */ > - kprobe_flush_task(prev); > - put_task_struct(prev); > + kprobe_flush_task(dead); > + put_task_struct(dead); > } > > tick_nohz_task_switch(current); > @@ -2770,11 +2770,15 @@ need_resched: > smp_mb__before_spinlock(); > raw_spin_lock_irq(&rq->lock); > > + if (unlikely(rq->dead)) > + goto deactivate; > + > switch_count = &prev->nivcsw; > if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) { > if (unlikely(signal_pending_state(prev->state, prev))) { > prev->state = TASK_RUNNING; > } else { > +deactivate: > deactivate_task(rq, prev, DEQUEUE_SLEEP); > prev->on_rq = 0; > > @@ -2826,6 +2830,15 @@ need_resched: > goto need_resched; > } > > +// called under preempt_disable(); > +void exit_schedule() > +{ > + // TODO: kill TASK_DEAD, this is only for proc > + current->state = TASK_DEAD; > + task_rq(current)->dead = current; > + __schedule(); > +} > + > static inline void sched_submit_work(struct task_struct *tsk) > { > if (!tsk->state || tsk_is_pi_blocked(tsk)) > --- x/kernel/exit.c > +++ x/kernel/exit.c > @@ -815,25 +815,8 @@ void do_exit(long code) > __this_cpu_add(dirty_throttle_leaks, tsk->nr_dirtied); > exit_rcu(); > > - /* > - * The setting of TASK_RUNNING by try_to_wake_up() may be delayed > - * when the following two conditions become true. > - * - There is race condition of mmap_sem (It is acquired by > - * exit_mm()), and > - * - SMI occurs before setting TASK_RUNINNG. > - * (or hypervisor of virtual machine switches to other guest) > - * As a result, we may become TASK_RUNNING after becoming TASK_DEAD > - * > - * To avoid it, we have to wait for releasing tsk->pi_lock which > - * is held by try_to_wake_up() > - */ > - smp_mb(); > - raw_spin_unlock_wait(&tsk->pi_lock); > - > - /* causes final put_task_struct in finish_task_switch(). */ > - tsk->state = TASK_DEAD; > tsk->flags |= PF_NOFREEZE; /* tell freezer to ignore us */ > - schedule(); > + exit_schedule(); > BUG(); > /* Avoid "noreturn function does return". */ > for (;;) >