[PATCH 0/2] sched,kthread: Fix TASK_PARKED and special sleep states

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/2] sched,kthread: Fix TASK_PARKED and special sleep states
@ 2018-04-30 14:17 Peter Zijlstra
  2018-04-30 14:17 ` [PATCH 1/2] kthread: Fix kthread_parkme() wait-loop Peter Zijlstra
  2018-04-30 14:17 ` [PATCH 2/2] sched: Introduce set_special_state() Peter Zijlstra
  0 siblings, 2 replies; 9+ messages in thread
From: Peter Zijlstra @ 2018-04-30 14:17 UTC (permalink / raw)
  To: mingo, tglx
  Cc: linux-kernel, oleg, will.deacon, mpe, bigeasy, gkohli, neeraju, peterz

Hi,

The first patch should fix the reported TASK_PARKED problem and the second
should fix any potential TASK_STOPPED, TASK_TRACED issues.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 1/2] kthread: Fix kthread_parkme() wait-loop
  2018-04-30 14:17 [PATCH 0/2] sched,kthread: Fix TASK_PARKED and special sleep states Peter Zijlstra
@ 2018-04-30 14:17 ` Peter Zijlstra
  2018-04-30 14:17 ` [PATCH 2/2] sched: Introduce set_special_state() Peter Zijlstra
  1 sibling, 0 replies; 9+ messages in thread
From: Peter Zijlstra @ 2018-04-30 14:17 UTC (permalink / raw)
  To: mingo, tglx
  Cc: linux-kernel, oleg, will.deacon, mpe, bigeasy, gkohli, neeraju, peterz

[-- Attachment #1: peterz-sched-smpboot-1.patch --]
[-- Type: text/plain, Size: 1859 bytes --]

Gaurav reported a problem with __kthread_parkme() where a concurrent
try_to_wake_up() could result in competing stores to ->state which,
when the TASK_PARKED store got lost bad things would happen.

The comment near set_current_state() actually mentions this competing
store, but only mentions the case against TASK_RUNNING. This same
store, with different timing, can happen against a subsequent !RUNNING
store.

This normally is not a problem, because as per that same comment, the
!RUNNING state store is inside a condition based wait-loop:

  for (;;) {
    set_current_state(TASK_UNINTERRUPTIBLE);
    if (!need_sleep)
      break;
    schedule();
  }
  __set_current_state(TASK_RUNNING);

If we loose the (first) TASK_UNINTERRUPTIBLE store to a previous
(concurrent) wakeup, the schedule() will NO-OP and we'll go around the
loop once more.

The problem here is that the TASK_PARKED store is not inside the
KTHREAD_SHOULD_PARK condition wait-loop.

There is a genuine issue with sleeps that do not have a condition;
this is addressed in a subsequent patch.

Reported-by: Gaurav Kohli <gkohli@codeaurora.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/kthread.c |    7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -177,12 +177,13 @@ void *kthread_probe_data(struct task_str

 static void __kthread_parkme(struct kthread *self)
 {
-	__set_current_state(TASK_PARKED);
-	while (test_bit(KTHREAD_SHOULD_PARK, &self->flags)) {
+	for (;;) {
+		set_current_state(TASK_PARKED);
+		if (!test_bit(KTHREAD_SHOULD_PARK, &self->flags))
+			break;
 		if (!test_and_set_bit(KTHREAD_IS_PARKED, &self->flags))
 			complete(&self->parked);
 		schedule();
-		__set_current_state(TASK_PARKED);
 	}
 	clear_bit(KTHREAD_IS_PARKED, &self->flags);
 	__set_current_state(TASK_RUNNING);

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 2/2] sched: Introduce set_special_state()
  2018-04-30 14:17 [PATCH 0/2] sched,kthread: Fix TASK_PARKED and special sleep states Peter Zijlstra
  2018-04-30 14:17 ` [PATCH 1/2] kthread: Fix kthread_parkme() wait-loop Peter Zijlstra
@ 2018-04-30 14:17 ` Peter Zijlstra
  2018-04-30 16:45   ` Oleg Nesterov
  1 sibling, 1 reply; 9+ messages in thread
From: Peter Zijlstra @ 2018-04-30 14:17 UTC (permalink / raw)
  To: mingo, tglx
  Cc: linux-kernel, oleg, will.deacon, mpe, bigeasy, gkohli, neeraju, peterz

[-- Attachment #1: peterz-sched-smpboot-2.patch --]
[-- Type: text/plain, Size: 6347 bytes --]

Gaurav reported a perceived problem with TASK_PARKED, which turned out
to be a broken wait-loop pattern in __kthread_parkme(), but the
reported issue can (and does) in fact happen for states that do not do
condition based sleeps.

When the 'current->state = TASK_RUNNING' store of a previous
(concurrent) try_to_wake_up() collides with the setting of a 'special'
sleep state, we can loose the sleep state.

Normal condition based wait-loops are immune to this problem, but for
sleep states that are not condition based are subject to this problem.

There already is a fix for TASK_DEAD. Abstract that and also apply it to
TASK_STOPPED and TASK_TRACED, both of which are also without condition
based wait-loop.

Cc: Oleg Nesterov <oleg@redhat.com>
Reported-by: Gaurav Kohli <gkohli@codeaurora.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/sched.h        |   52 ++++++++++++++++++++++++++++++++++++++-----
 include/linux/sched/signal.h |    2 -
 kernel/sched/core.c          |   17 --------------
 kernel/signal.c              |    4 +--
 4 files changed, 51 insertions(+), 24 deletions(-)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -112,17 +112,38 @@ struct task_group;
 
 #ifdef CONFIG_DEBUG_ATOMIC_SLEEP
 
+/*
+ * Special states are those that do not use the normal wait-loop pattern. See
+ * the comment with set_special_state().
+ */
+#define is_special_task_state(state)	\
+	((state) == TASK_DEAD    ||	\
+	 (state) == TASK_STOPPED ||	\
+	 (state) == TASK_TRACED)
+
 #define __set_current_state(state_value)			\
 	do {							\
+		WARN_ON_ONCE(is_special_task_state(state));	\
 		current->task_state_change = _THIS_IP_;		\
 		current->state = (state_value);			\
 	} while (0)
+
 #define set_current_state(state_value)				\
 	do {							\
+		WARN_ON_ONCE(is_special_task_state(state));	\
 		current->task_state_change = _THIS_IP_;		\
 		smp_store_mb(current->state, (state_value));	\
 	} while (0)
 
+#define set_special_state(state_value)					\
+	do {								\
+		unsigned long flags; /* may shadow */			\
+		WARN_ON_ONCE(!is_special_task_state(state_value));	\
+		raw_spin_lock_irqsave(&current->pi_lock, flags);	\
+		current->task_state_change = _THIS_IP_;			\
+		current->state = (state_value);				\
+		raw_spin_unlock_irqrestore(&current->pi_lock, flags);	\
+	} while (0)
 #else
 /*
  * set_current_state() includes a barrier so that the write of current->state
@@ -144,8 +165,8 @@ struct task_group;
  *
  * The above is typically ordered against the wakeup, which does:
  *
- *	need_sleep = false;
- *	wake_up_state(p, TASK_UNINTERRUPTIBLE);
+ *   need_sleep = false;
+ *   wake_up_state(p, TASK_UNINTERRUPTIBLE);
  *
  * Where wake_up_state() (and all other wakeup primitives) imply enough
  * barriers to order the store of the variable against wakeup.
@@ -154,12 +175,33 @@ struct task_group;
  * once it observes the TASK_UNINTERRUPTIBLE store the waking CPU can issue a
  * TASK_RUNNING store which can collide with __set_current_state(TASK_RUNNING).
  *
- * This is obviously fine, since they both store the exact same value.
+ * However, with slightly different timing the wakeup TASK_RUNNING store can
+ * also collide with the TASK_UNINTERRUPTIBLE store. Loosing that store is not
+ * a problem either because that will result in one extra go around the loop
+ * and our @cond test will save the day.
  *
  * Also see the comments of try_to_wake_up().
  */
-#define __set_current_state(state_value) do { current->state = (state_value); } while (0)
-#define set_current_state(state_value)	 smp_store_mb(current->state, (state_value))
+#define __set_current_state(state_value)				\
+	current->state = (state_value)
+
+#define set_current_state(state_value)					\
+	smp_store_mb(current->state, (state_value))
+
+/*
+ * set_special_state() should be used for those states when the blocking task
+ * can not use the regular condition based wait-loop. In that case we must
+ * serialize against wakeups such that any possible in-flight TASK_RUNNING stores
+ * will not collide with our state change.
+ */
+#define set_special_state(state_value)					\
+	do {								\
+		unsigned long flags; /* may shadow */			\
+		raw_spin_lock_irqsave(&current->pi_lock, flags);	\
+		current->state = (state_value);				\
+		raw_spin_unlock_irqrestore(&current->pi_lock, flags);	\
+	} while (0)
+
 #endif
 
 /* Task command name length: */
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -280,7 +280,7 @@ static inline void kernel_signal_stop(vo
 {
 	spin_lock_irq(&current->sighand->siglock);
 	if (current->jobctl & JOBCTL_STOP_DEQUEUED)
-		__set_current_state(TASK_STOPPED);
+		set_special_state(TASK_STOPPED);
 	spin_unlock_irq(&current->sighand->siglock);
 
 	schedule();
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3498,23 +3498,8 @@ static void __sched notrace __schedule(b
 
 void __noreturn do_task_dead(void)
 {
-	/*
-	 * The setting of TASK_RUNNING by try_to_wake_up() may be delayed
-	 * when the following two conditions become true.
-	 *   - There is race condition of mmap_sem (It is acquired by
-	 *     exit_mm()), and
-	 *   - SMI occurs before setting TASK_RUNINNG.
-	 *     (or hypervisor of virtual machine switches to other guest)
-	 *  As a result, we may become TASK_RUNNING after becoming TASK_DEAD
-	 *
-	 * To avoid it, we have to wait for releasing tsk->pi_lock which
-	 * is held by try_to_wake_up()
-	 */
-	raw_spin_lock_irq(&current->pi_lock);
-	raw_spin_unlock_irq(&current->pi_lock);
-
 	/* Causes final put_task_struct in finish_task_switch(): */
-	__set_current_state(TASK_DEAD);
+	set_special_state(TASK_DEAD);
 
 	/* Tell freezer to ignore us: */
 	current->flags |= PF_NOFREEZE;
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1968,7 +1968,7 @@ static void ptrace_stop(int exit_code, i
 	 * atomic with respect to siglock and should be done after the arch
 	 * hook as siglock is released and regrabbed across it.
 	 */
-	set_current_state(TASK_TRACED);
+	set_special_state(TASK_TRACED);
 
 	current->last_siginfo = info;
 	current->exit_code = exit_code;
@@ -2176,7 +2176,7 @@ static bool do_signal_stop(int signr)
 		if (task_participate_group_stop(current))
 			notify = CLD_STOPPED;
 
-		__set_current_state(TASK_STOPPED);
+		set_special_state(TASK_STOPPED);
 		spin_unlock_irq(&current->sighand->siglock);
 
 		/*

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/2] sched: Introduce set_special_state()
  2018-04-30 14:17 ` [PATCH 2/2] sched: Introduce set_special_state() Peter Zijlstra
@ 2018-04-30 16:45   ` Oleg Nesterov
  2018-04-30 19:40     ` Peter Zijlstra
  0 siblings, 1 reply; 9+ messages in thread
From: Oleg Nesterov @ 2018-04-30 16:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, tglx, linux-kernel, will.deacon, mpe, bigeasy, gkohli, neeraju

On 04/30, Peter Zijlstra wrote:
>
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -1968,7 +1968,7 @@ static void ptrace_stop(int exit_code, i
>  	 * atomic with respect to siglock and should be done after the arch
>  	 * hook as siglock is released and regrabbed across it.
>  	 */
> -	set_current_state(TASK_TRACED);
> +	set_special_state(TASK_TRACED);

Yes, but please note the comment above, we need a barrier after state = TASK_TRACED,
that is why ptrace_stop() does set_current_state(), not __set_current_state().

Otherwise both patches look good to me, feel free to add

Reviewed-by: Oleg Nesterov <oleg@redhat.com>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/2] sched: Introduce set_special_state()
  2018-04-30 16:45   ` Oleg Nesterov
@ 2018-04-30 19:40     ` Peter Zijlstra
  2018-05-01  9:50       ` Peter Zijlstra
  0 siblings, 1 reply; 9+ messages in thread
From: Peter Zijlstra @ 2018-04-30 19:40 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: mingo, tglx, linux-kernel, will.deacon, mpe, bigeasy, gkohli, neeraju

On Mon, Apr 30, 2018 at 06:45:46PM +0200, Oleg Nesterov wrote:
> On 04/30, Peter Zijlstra wrote:
> >
> > --- a/kernel/signal.c
> > +++ b/kernel/signal.c
> > @@ -1968,7 +1968,7 @@ static void ptrace_stop(int exit_code, i
> >  	 * atomic with respect to siglock and should be done after the arch
> >  	 * hook as siglock is released and regrabbed across it.
> >  	 */
> > -	set_current_state(TASK_TRACED);
> > +	set_special_state(TASK_TRACED);
> 
> Yes, but please note the comment above, we need a barrier after state = TASK_TRACED,
> that is why ptrace_stop() does set_current_state(), not __set_current_state().

OK, so I got properly lost in that stuff.

The comment says it we need to set TASK_TRACED before clearing
JOBCTL_TRAPPING because of do_wait(), but I cannot seem to locate code
in do_wait() and below that cares about JOBCTL_TRAPPING.

Could you elucidate my tired brain?

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/2] sched: Introduce set_special_state()
  2018-04-30 19:40     ` Peter Zijlstra
@ 2018-05-01  9:50       ` Peter Zijlstra
  2018-05-01 13:59         ` Oleg Nesterov
  0 siblings, 1 reply; 9+ messages in thread
From: Peter Zijlstra @ 2018-05-01  9:50 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: mingo, tglx, linux-kernel, will.deacon, mpe, bigeasy, gkohli, neeraju

On Mon, Apr 30, 2018 at 09:40:24PM +0200, Peter Zijlstra wrote:
> On Mon, Apr 30, 2018 at 06:45:46PM +0200, Oleg Nesterov wrote:
> > On 04/30, Peter Zijlstra wrote:
> > >
> > > --- a/kernel/signal.c
> > > +++ b/kernel/signal.c
> > > @@ -1968,7 +1968,7 @@ static void ptrace_stop(int exit_code, i
> > >  	 * atomic with respect to siglock and should be done after the arch
> > >  	 * hook as siglock is released and regrabbed across it.
> > >  	 */
> > > -	set_current_state(TASK_TRACED);
> > > +	set_special_state(TASK_TRACED);
> > 
> > Yes, but please note the comment above, we need a barrier after state = TASK_TRACED,
> > that is why ptrace_stop() does set_current_state(), not __set_current_state().
> 
> OK, so I got properly lost in that stuff.
> 
> The comment says it we need to set TASK_TRACED before clearing
> JOBCTL_TRAPPING because of do_wait(), but I cannot seem to locate code
> in do_wait() and below that cares about JOBCTL_TRAPPING.
> 
> Could you elucidate my tired brain?

The only code I found that seems to care is ptrace_attach(), where we
wait for JOBCTL_TRAPPING to get cleared. That same function has a
comment about hiding the STOPPED -> RUNNING -> TRACED transition. So I'm
assuming it needs to observe TRACED if it observes !TRAPPING.

But I don't think there's enough barriers on that end to guarantee this.
Any ->state load after wait_on_bit() is, afact, free to have happened
before the ->jobctl load.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/2] sched: Introduce set_special_state()
  2018-05-01  9:50       ` Peter Zijlstra
@ 2018-05-01 13:59         ` Oleg Nesterov
  2018-05-01 14:22           ` Peter Zijlstra
  0 siblings, 1 reply; 9+ messages in thread
From: Oleg Nesterov @ 2018-05-01 13:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, tglx, linux-kernel, will.deacon, mpe, bigeasy, gkohli, neeraju

On 05/01, Peter Zijlstra wrote:
>
> The only code I found that seems to care is ptrace_attach(), where we
> wait for JOBCTL_TRAPPING to get cleared. That same function has a
> comment about hiding the STOPPED -> RUNNING -> TRACED transition. So I'm
> assuming it needs to observe TRACED if it observes !TRAPPING.

Yes, exactly.

> But I don't think there's enough barriers on that end to guarantee this.
> Any ->state load after wait_on_bit() is, afact, free to have happened
> before the ->jobctl load.

do_wait() does set_current_state() before it checks ->state or anything else.

Oleg.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/2] sched: Introduce set_special_state()
  2018-05-01 13:59         ` Oleg Nesterov
@ 2018-05-01 14:22           ` Peter Zijlstra
  2018-05-01 15:10             ` Oleg Nesterov
  0 siblings, 1 reply; 9+ messages in thread
From: Peter Zijlstra @ 2018-05-01 14:22 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: mingo, tglx, linux-kernel, will.deacon, mpe, bigeasy, gkohli, neeraju

On Tue, May 01, 2018 at 03:59:24PM +0200, Oleg Nesterov wrote:
> On 05/01, Peter Zijlstra wrote:
> >
> > The only code I found that seems to care is ptrace_attach(), where we
> > wait for JOBCTL_TRAPPING to get cleared. That same function has a
> > comment about hiding the STOPPED -> RUNNING -> TRACED transition. So I'm
> > assuming it needs to observe TRACED if it observes !TRAPPING.
> 
> Yes, exactly.
> 
> > But I don't think there's enough barriers on that end to guarantee this.
> > Any ->state load after wait_on_bit() is, afact, free to have happened
> > before the ->jobctl load.
> 
> do_wait() does set_current_state() before it checks ->state or anything else.

But how are ptrace_attach() and do_wait() related? I guess I'm missing
something fairly fundamental here. Is it userspace doing these two
system calls back-to-back or something?

Is that not a little bit fragile, to rely on a barrier in do_wait()
for this?

Anyway, does the below look ok?

---
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1961,14 +1961,26 @@ static void ptrace_stop(int exit_code, i
 			return;
 	}
 
+	set_special_state(TASK_TRACED);
+
 	/*
 	 * We're committing to trapping.  TRACED should be visible before
 	 * TRAPPING is cleared; otherwise, the tracer might fail do_wait().
 	 * Also, transition to TRACED and updates to ->jobctl should be
 	 * atomic with respect to siglock and should be done after the arch
 	 * hook as siglock is released and regrabbed across it.
+	 *
+	 *     TRACER				    TRACEE
+	 *     ptrace_attach()
+	 * [L]   wait_on_bit(JOBCTL_TRAPPING)	[S] set_special_state(TRACED)
+	 *     do_wait()
+	 *       set_current_state()                smp_wmb();
+	 *       ptrace_do_wait()
+	 *         wait_task_stopped()
+	 *           task_stopped_code()
+	 * [L]         task_is_traced()		[S] task_clear_jobctl_trapping();
 	 */
-	set_special_state(TASK_TRACED);
+	smp_wmb();
 
 	current->last_siginfo = info;
 	current->exit_code = exit_code;

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/2] sched: Introduce set_special_state()
  2018-05-01 14:22           ` Peter Zijlstra
@ 2018-05-01 15:10             ` Oleg Nesterov
  0 siblings, 0 replies; 9+ messages in thread
From: Oleg Nesterov @ 2018-05-01 15:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, tglx, linux-kernel, will.deacon, mpe, bigeasy, gkohli, neeraju

On 05/01, Peter Zijlstra wrote:
>
> On Tue, May 01, 2018 at 03:59:24PM +0200, Oleg Nesterov wrote:
> > On 05/01, Peter Zijlstra wrote:
> > >
> > > The only code I found that seems to care is ptrace_attach(), where we
> > > wait for JOBCTL_TRAPPING to get cleared. That same function has a
> > > comment about hiding the STOPPED -> RUNNING -> TRACED transition. So I'm
> > > assuming it needs to observe TRACED if it observes !TRAPPING.
> >
> > Yes, exactly.
> >
> > > But I don't think there's enough barriers on that end to guarantee this.
> > > Any ->state load after wait_on_bit() is, afact, free to have happened
> > > before the ->jobctl load.
> >
> > do_wait() does set_current_state() before it checks ->state or anything else.
>
> But how are ptrace_attach() and do_wait() related?

Yes.

> I guess I'm missing
> something fairly fundamental here.

You are missing the fact that ptrace API is very old and ugly ;)

Just one example. If the debugger knows that the task is STOPPED then it has
all rights to do, say,

	ptrace(PTRACE_ATTACH, pid);
	BUG_ON(pid != waitpid(pid, WNOHANG));

Or even do another ptrace() request right after ptrace(PTRACE_ATTACH) returns,
without do_wait().

And unless my memory fools me, gdb even has some test-cases for this... Not sure,
but it certainly looks at tracee->state in /proc before it does PTRACE_ATTACH,
because if it was already STOPPED then gdb won't have any notification from the
tracee.

> Anyway, does the below look ok?

Yes, thanks.

Oleg.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2018-05-01 15:10 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-04-30 14:17 [PATCH 0/2] sched,kthread: Fix TASK_PARKED and special sleep states Peter Zijlstra
2018-04-30 14:17 ` [PATCH 1/2] kthread: Fix kthread_parkme() wait-loop Peter Zijlstra
2018-04-30 14:17 ` [PATCH 2/2] sched: Introduce set_special_state() Peter Zijlstra
2018-04-30 16:45   ` Oleg Nesterov
2018-04-30 19:40     ` Peter Zijlstra
2018-05-01  9:50       ` Peter Zijlstra
2018-05-01 13:59         ` Oleg Nesterov
2018-05-01 14:22           ` Peter Zijlstra
2018-05-01 15:10             ` Oleg Nesterov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).