* [PATCH 0/2] sched,kthread: Fix TASK_PARKED and special sleep states @ 2018-04-30 14:17 Peter Zijlstra 2018-04-30 14:17 ` [PATCH 1/2] kthread: Fix kthread_parkme() wait-loop Peter Zijlstra 2018-04-30 14:17 ` [PATCH 2/2] sched: Introduce set_special_state() Peter Zijlstra 0 siblings, 2 replies; 9+ messages in thread From: Peter Zijlstra @ 2018-04-30 14:17 UTC (permalink / raw) To: mingo, tglx Cc: linux-kernel, oleg, will.deacon, mpe, bigeasy, gkohli, neeraju, peterz Hi, The first patch should fix the reported TASK_PARKED problem and the second should fix any potential TASK_STOPPED, TASK_TRACED issues. ^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH 1/2] kthread: Fix kthread_parkme() wait-loop 2018-04-30 14:17 [PATCH 0/2] sched,kthread: Fix TASK_PARKED and special sleep states Peter Zijlstra @ 2018-04-30 14:17 ` Peter Zijlstra 2018-04-30 14:17 ` [PATCH 2/2] sched: Introduce set_special_state() Peter Zijlstra 1 sibling, 0 replies; 9+ messages in thread From: Peter Zijlstra @ 2018-04-30 14:17 UTC (permalink / raw) To: mingo, tglx Cc: linux-kernel, oleg, will.deacon, mpe, bigeasy, gkohli, neeraju, peterz [-- Attachment #1: peterz-sched-smpboot-1.patch --] [-- Type: text/plain, Size: 1859 bytes --] Gaurav reported a problem with __kthread_parkme() where a concurrent try_to_wake_up() could result in competing stores to ->state which, when the TASK_PARKED store got lost bad things would happen. The comment near set_current_state() actually mentions this competing store, but only mentions the case against TASK_RUNNING. This same store, with different timing, can happen against a subsequent !RUNNING store. This normally is not a problem, because as per that same comment, the !RUNNING state store is inside a condition based wait-loop: for (;;) { set_current_state(TASK_UNINTERRUPTIBLE); if (!need_sleep) break; schedule(); } __set_current_state(TASK_RUNNING); If we loose the (first) TASK_UNINTERRUPTIBLE store to a previous (concurrent) wakeup, the schedule() will NO-OP and we'll go around the loop once more. The problem here is that the TASK_PARKED store is not inside the KTHREAD_SHOULD_PARK condition wait-loop. There is a genuine issue with sleeps that do not have a condition; this is addressed in a subsequent patch. Reported-by: Gaurav Kohli <gkohli@codeaurora.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> --- kernel/kthread.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) --- a/kernel/kthread.c +++ b/kernel/kthread.c @@ -177,12 +177,13 @@ void *kthread_probe_data(struct task_str static void __kthread_parkme(struct kthread *self) { - __set_current_state(TASK_PARKED); - while (test_bit(KTHREAD_SHOULD_PARK, &self->flags)) { + for (;;) { + set_current_state(TASK_PARKED); + if (!test_bit(KTHREAD_SHOULD_PARK, &self->flags)) + break; if (!test_and_set_bit(KTHREAD_IS_PARKED, &self->flags)) complete(&self->parked); schedule(); - __set_current_state(TASK_PARKED); } clear_bit(KTHREAD_IS_PARKED, &self->flags); __set_current_state(TASK_RUNNING); ^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH 2/2] sched: Introduce set_special_state() 2018-04-30 14:17 [PATCH 0/2] sched,kthread: Fix TASK_PARKED and special sleep states Peter Zijlstra 2018-04-30 14:17 ` [PATCH 1/2] kthread: Fix kthread_parkme() wait-loop Peter Zijlstra @ 2018-04-30 14:17 ` Peter Zijlstra 2018-04-30 16:45 ` Oleg Nesterov 1 sibling, 1 reply; 9+ messages in thread From: Peter Zijlstra @ 2018-04-30 14:17 UTC (permalink / raw) To: mingo, tglx Cc: linux-kernel, oleg, will.deacon, mpe, bigeasy, gkohli, neeraju, peterz [-- Attachment #1: peterz-sched-smpboot-2.patch --] [-- Type: text/plain, Size: 6347 bytes --] Gaurav reported a perceived problem with TASK_PARKED, which turned out to be a broken wait-loop pattern in __kthread_parkme(), but the reported issue can (and does) in fact happen for states that do not do condition based sleeps. When the 'current->state = TASK_RUNNING' store of a previous (concurrent) try_to_wake_up() collides with the setting of a 'special' sleep state, we can loose the sleep state. Normal condition based wait-loops are immune to this problem, but for sleep states that are not condition based are subject to this problem. There already is a fix for TASK_DEAD. Abstract that and also apply it to TASK_STOPPED and TASK_TRACED, both of which are also without condition based wait-loop. Cc: Oleg Nesterov <oleg@redhat.com> Reported-by: Gaurav Kohli <gkohli@codeaurora.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> --- include/linux/sched.h | 52 ++++++++++++++++++++++++++++++++++++++----- include/linux/sched/signal.h | 2 - kernel/sched/core.c | 17 -------------- kernel/signal.c | 4 +-- 4 files changed, 51 insertions(+), 24 deletions(-) --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -112,17 +112,38 @@ struct task_group; #ifdef CONFIG_DEBUG_ATOMIC_SLEEP +/* + * Special states are those that do not use the normal wait-loop pattern. See + * the comment with set_special_state(). + */ +#define is_special_task_state(state) \ + ((state) == TASK_DEAD || \ + (state) == TASK_STOPPED || \ + (state) == TASK_TRACED) + #define __set_current_state(state_value) \ do { \ + WARN_ON_ONCE(is_special_task_state(state)); \ current->task_state_change = _THIS_IP_; \ current->state = (state_value); \ } while (0) + #define set_current_state(state_value) \ do { \ + WARN_ON_ONCE(is_special_task_state(state)); \ current->task_state_change = _THIS_IP_; \ smp_store_mb(current->state, (state_value)); \ } while (0) +#define set_special_state(state_value) \ + do { \ + unsigned long flags; /* may shadow */ \ + WARN_ON_ONCE(!is_special_task_state(state_value)); \ + raw_spin_lock_irqsave(¤t->pi_lock, flags); \ + current->task_state_change = _THIS_IP_; \ + current->state = (state_value); \ + raw_spin_unlock_irqrestore(¤t->pi_lock, flags); \ + } while (0) #else /* * set_current_state() includes a barrier so that the write of current->state @@ -144,8 +165,8 @@ struct task_group; * * The above is typically ordered against the wakeup, which does: * - * need_sleep = false; - * wake_up_state(p, TASK_UNINTERRUPTIBLE); + * need_sleep = false; + * wake_up_state(p, TASK_UNINTERRUPTIBLE); * * Where wake_up_state() (and all other wakeup primitives) imply enough * barriers to order the store of the variable against wakeup. @@ -154,12 +175,33 @@ struct task_group; * once it observes the TASK_UNINTERRUPTIBLE store the waking CPU can issue a * TASK_RUNNING store which can collide with __set_current_state(TASK_RUNNING). * - * This is obviously fine, since they both store the exact same value. + * However, with slightly different timing the wakeup TASK_RUNNING store can + * also collide with the TASK_UNINTERRUPTIBLE store. Loosing that store is not + * a problem either because that will result in one extra go around the loop + * and our @cond test will save the day. * * Also see the comments of try_to_wake_up(). */ -#define __set_current_state(state_value) do { current->state = (state_value); } while (0) -#define set_current_state(state_value) smp_store_mb(current->state, (state_value)) +#define __set_current_state(state_value) \ + current->state = (state_value) + +#define set_current_state(state_value) \ + smp_store_mb(current->state, (state_value)) + +/* + * set_special_state() should be used for those states when the blocking task + * can not use the regular condition based wait-loop. In that case we must + * serialize against wakeups such that any possible in-flight TASK_RUNNING stores + * will not collide with our state change. + */ +#define set_special_state(state_value) \ + do { \ + unsigned long flags; /* may shadow */ \ + raw_spin_lock_irqsave(¤t->pi_lock, flags); \ + current->state = (state_value); \ + raw_spin_unlock_irqrestore(¤t->pi_lock, flags); \ + } while (0) + #endif /* Task command name length: */ --- a/include/linux/sched/signal.h +++ b/include/linux/sched/signal.h @@ -280,7 +280,7 @@ static inline void kernel_signal_stop(vo { spin_lock_irq(¤t->sighand->siglock); if (current->jobctl & JOBCTL_STOP_DEQUEUED) - __set_current_state(TASK_STOPPED); + set_special_state(TASK_STOPPED); spin_unlock_irq(¤t->sighand->siglock); schedule(); --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3498,23 +3498,8 @@ static void __sched notrace __schedule(b void __noreturn do_task_dead(void) { - /* - * The setting of TASK_RUNNING by try_to_wake_up() may be delayed - * when the following two conditions become true. - * - There is race condition of mmap_sem (It is acquired by - * exit_mm()), and - * - SMI occurs before setting TASK_RUNINNG. - * (or hypervisor of virtual machine switches to other guest) - * As a result, we may become TASK_RUNNING after becoming TASK_DEAD - * - * To avoid it, we have to wait for releasing tsk->pi_lock which - * is held by try_to_wake_up() - */ - raw_spin_lock_irq(¤t->pi_lock); - raw_spin_unlock_irq(¤t->pi_lock); - /* Causes final put_task_struct in finish_task_switch(): */ - __set_current_state(TASK_DEAD); + set_special_state(TASK_DEAD); /* Tell freezer to ignore us: */ current->flags |= PF_NOFREEZE; --- a/kernel/signal.c +++ b/kernel/signal.c @@ -1968,7 +1968,7 @@ static void ptrace_stop(int exit_code, i * atomic with respect to siglock and should be done after the arch * hook as siglock is released and regrabbed across it. */ - set_current_state(TASK_TRACED); + set_special_state(TASK_TRACED); current->last_siginfo = info; current->exit_code = exit_code; @@ -2176,7 +2176,7 @@ static bool do_signal_stop(int signr) if (task_participate_group_stop(current)) notify = CLD_STOPPED; - __set_current_state(TASK_STOPPED); + set_special_state(TASK_STOPPED); spin_unlock_irq(¤t->sighand->siglock); /* ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 2/2] sched: Introduce set_special_state() 2018-04-30 14:17 ` [PATCH 2/2] sched: Introduce set_special_state() Peter Zijlstra @ 2018-04-30 16:45 ` Oleg Nesterov 2018-04-30 19:40 ` Peter Zijlstra 0 siblings, 1 reply; 9+ messages in thread From: Oleg Nesterov @ 2018-04-30 16:45 UTC (permalink / raw) To: Peter Zijlstra Cc: mingo, tglx, linux-kernel, will.deacon, mpe, bigeasy, gkohli, neeraju On 04/30, Peter Zijlstra wrote: > > --- a/kernel/signal.c > +++ b/kernel/signal.c > @@ -1968,7 +1968,7 @@ static void ptrace_stop(int exit_code, i > * atomic with respect to siglock and should be done after the arch > * hook as siglock is released and regrabbed across it. > */ > - set_current_state(TASK_TRACED); > + set_special_state(TASK_TRACED); Yes, but please note the comment above, we need a barrier after state = TASK_TRACED, that is why ptrace_stop() does set_current_state(), not __set_current_state(). Otherwise both patches look good to me, feel free to add Reviewed-by: Oleg Nesterov <oleg@redhat.com> ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 2/2] sched: Introduce set_special_state() 2018-04-30 16:45 ` Oleg Nesterov @ 2018-04-30 19:40 ` Peter Zijlstra 2018-05-01 9:50 ` Peter Zijlstra 0 siblings, 1 reply; 9+ messages in thread From: Peter Zijlstra @ 2018-04-30 19:40 UTC (permalink / raw) To: Oleg Nesterov Cc: mingo, tglx, linux-kernel, will.deacon, mpe, bigeasy, gkohli, neeraju On Mon, Apr 30, 2018 at 06:45:46PM +0200, Oleg Nesterov wrote: > On 04/30, Peter Zijlstra wrote: > > > > --- a/kernel/signal.c > > +++ b/kernel/signal.c > > @@ -1968,7 +1968,7 @@ static void ptrace_stop(int exit_code, i > > * atomic with respect to siglock and should be done after the arch > > * hook as siglock is released and regrabbed across it. > > */ > > - set_current_state(TASK_TRACED); > > + set_special_state(TASK_TRACED); > > Yes, but please note the comment above, we need a barrier after state = TASK_TRACED, > that is why ptrace_stop() does set_current_state(), not __set_current_state(). OK, so I got properly lost in that stuff. The comment says it we need to set TASK_TRACED before clearing JOBCTL_TRAPPING because of do_wait(), but I cannot seem to locate code in do_wait() and below that cares about JOBCTL_TRAPPING. Could you elucidate my tired brain? ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 2/2] sched: Introduce set_special_state() 2018-04-30 19:40 ` Peter Zijlstra @ 2018-05-01 9:50 ` Peter Zijlstra 2018-05-01 13:59 ` Oleg Nesterov 0 siblings, 1 reply; 9+ messages in thread From: Peter Zijlstra @ 2018-05-01 9:50 UTC (permalink / raw) To: Oleg Nesterov Cc: mingo, tglx, linux-kernel, will.deacon, mpe, bigeasy, gkohli, neeraju On Mon, Apr 30, 2018 at 09:40:24PM +0200, Peter Zijlstra wrote: > On Mon, Apr 30, 2018 at 06:45:46PM +0200, Oleg Nesterov wrote: > > On 04/30, Peter Zijlstra wrote: > > > > > > --- a/kernel/signal.c > > > +++ b/kernel/signal.c > > > @@ -1968,7 +1968,7 @@ static void ptrace_stop(int exit_code, i > > > * atomic with respect to siglock and should be done after the arch > > > * hook as siglock is released and regrabbed across it. > > > */ > > > - set_current_state(TASK_TRACED); > > > + set_special_state(TASK_TRACED); > > > > Yes, but please note the comment above, we need a barrier after state = TASK_TRACED, > > that is why ptrace_stop() does set_current_state(), not __set_current_state(). > > OK, so I got properly lost in that stuff. > > The comment says it we need to set TASK_TRACED before clearing > JOBCTL_TRAPPING because of do_wait(), but I cannot seem to locate code > in do_wait() and below that cares about JOBCTL_TRAPPING. > > Could you elucidate my tired brain? The only code I found that seems to care is ptrace_attach(), where we wait for JOBCTL_TRAPPING to get cleared. That same function has a comment about hiding the STOPPED -> RUNNING -> TRACED transition. So I'm assuming it needs to observe TRACED if it observes !TRAPPING. But I don't think there's enough barriers on that end to guarantee this. Any ->state load after wait_on_bit() is, afact, free to have happened before the ->jobctl load. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 2/2] sched: Introduce set_special_state() 2018-05-01 9:50 ` Peter Zijlstra @ 2018-05-01 13:59 ` Oleg Nesterov 2018-05-01 14:22 ` Peter Zijlstra 0 siblings, 1 reply; 9+ messages in thread From: Oleg Nesterov @ 2018-05-01 13:59 UTC (permalink / raw) To: Peter Zijlstra Cc: mingo, tglx, linux-kernel, will.deacon, mpe, bigeasy, gkohli, neeraju On 05/01, Peter Zijlstra wrote: > > The only code I found that seems to care is ptrace_attach(), where we > wait for JOBCTL_TRAPPING to get cleared. That same function has a > comment about hiding the STOPPED -> RUNNING -> TRACED transition. So I'm > assuming it needs to observe TRACED if it observes !TRAPPING. Yes, exactly. > But I don't think there's enough barriers on that end to guarantee this. > Any ->state load after wait_on_bit() is, afact, free to have happened > before the ->jobctl load. do_wait() does set_current_state() before it checks ->state or anything else. Oleg. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 2/2] sched: Introduce set_special_state() 2018-05-01 13:59 ` Oleg Nesterov @ 2018-05-01 14:22 ` Peter Zijlstra 2018-05-01 15:10 ` Oleg Nesterov 0 siblings, 1 reply; 9+ messages in thread From: Peter Zijlstra @ 2018-05-01 14:22 UTC (permalink / raw) To: Oleg Nesterov Cc: mingo, tglx, linux-kernel, will.deacon, mpe, bigeasy, gkohli, neeraju On Tue, May 01, 2018 at 03:59:24PM +0200, Oleg Nesterov wrote: > On 05/01, Peter Zijlstra wrote: > > > > The only code I found that seems to care is ptrace_attach(), where we > > wait for JOBCTL_TRAPPING to get cleared. That same function has a > > comment about hiding the STOPPED -> RUNNING -> TRACED transition. So I'm > > assuming it needs to observe TRACED if it observes !TRAPPING. > > Yes, exactly. > > > But I don't think there's enough barriers on that end to guarantee this. > > Any ->state load after wait_on_bit() is, afact, free to have happened > > before the ->jobctl load. > > do_wait() does set_current_state() before it checks ->state or anything else. But how are ptrace_attach() and do_wait() related? I guess I'm missing something fairly fundamental here. Is it userspace doing these two system calls back-to-back or something? Is that not a little bit fragile, to rely on a barrier in do_wait() for this? Anyway, does the below look ok? --- --- a/kernel/signal.c +++ b/kernel/signal.c @@ -1961,14 +1961,26 @@ static void ptrace_stop(int exit_code, i return; } + set_special_state(TASK_TRACED); + /* * We're committing to trapping. TRACED should be visible before * TRAPPING is cleared; otherwise, the tracer might fail do_wait(). * Also, transition to TRACED and updates to ->jobctl should be * atomic with respect to siglock and should be done after the arch * hook as siglock is released and regrabbed across it. + * + * TRACER TRACEE + * ptrace_attach() + * [L] wait_on_bit(JOBCTL_TRAPPING) [S] set_special_state(TRACED) + * do_wait() + * set_current_state() smp_wmb(); + * ptrace_do_wait() + * wait_task_stopped() + * task_stopped_code() + * [L] task_is_traced() [S] task_clear_jobctl_trapping(); */ - set_special_state(TASK_TRACED); + smp_wmb(); current->last_siginfo = info; current->exit_code = exit_code; ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 2/2] sched: Introduce set_special_state() 2018-05-01 14:22 ` Peter Zijlstra @ 2018-05-01 15:10 ` Oleg Nesterov 0 siblings, 0 replies; 9+ messages in thread From: Oleg Nesterov @ 2018-05-01 15:10 UTC (permalink / raw) To: Peter Zijlstra Cc: mingo, tglx, linux-kernel, will.deacon, mpe, bigeasy, gkohli, neeraju On 05/01, Peter Zijlstra wrote: > > On Tue, May 01, 2018 at 03:59:24PM +0200, Oleg Nesterov wrote: > > On 05/01, Peter Zijlstra wrote: > > > > > > The only code I found that seems to care is ptrace_attach(), where we > > > wait for JOBCTL_TRAPPING to get cleared. That same function has a > > > comment about hiding the STOPPED -> RUNNING -> TRACED transition. So I'm > > > assuming it needs to observe TRACED if it observes !TRAPPING. > > > > Yes, exactly. > > > > > But I don't think there's enough barriers on that end to guarantee this. > > > Any ->state load after wait_on_bit() is, afact, free to have happened > > > before the ->jobctl load. > > > > do_wait() does set_current_state() before it checks ->state or anything else. > > But how are ptrace_attach() and do_wait() related? Yes. > I guess I'm missing > something fairly fundamental here. You are missing the fact that ptrace API is very old and ugly ;) Just one example. If the debugger knows that the task is STOPPED then it has all rights to do, say, ptrace(PTRACE_ATTACH, pid); BUG_ON(pid != waitpid(pid, WNOHANG)); Or even do another ptrace() request right after ptrace(PTRACE_ATTACH) returns, without do_wait(). And unless my memory fools me, gdb even has some test-cases for this... Not sure, but it certainly looks at tracee->state in /proc before it does PTRACE_ATTACH, because if it was already STOPPED then gdb won't have any notification from the tracee. > Anyway, does the below look ok? Yes, thanks. Oleg. ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2018-05-01 15:10 UTC | newest] Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2018-04-30 14:17 [PATCH 0/2] sched,kthread: Fix TASK_PARKED and special sleep states Peter Zijlstra 2018-04-30 14:17 ` [PATCH 1/2] kthread: Fix kthread_parkme() wait-loop Peter Zijlstra 2018-04-30 14:17 ` [PATCH 2/2] sched: Introduce set_special_state() Peter Zijlstra 2018-04-30 16:45 ` Oleg Nesterov 2018-04-30 19:40 ` Peter Zijlstra 2018-05-01 9:50 ` Peter Zijlstra 2018-05-01 13:59 ` Oleg Nesterov 2018-05-01 14:22 ` Peter Zijlstra 2018-05-01 15:10 ` Oleg Nesterov
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).