linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC][PATCH] sched: Fix TASK_DEAD race in finish_task_switch()
@ 2015-09-29 12:45 Peter Zijlstra
  2015-09-29 16:40 ` Linus Torvalds
  2015-10-06 16:12 ` [tip:sched/core] sched/core: " tip-bot for Peter Zijlstra
  0 siblings, 2 replies; 5+ messages in thread
From: Peter Zijlstra @ 2015-09-29 12:45 UTC (permalink / raw)
  To: mingo; +Cc: linux-kernel, tglx, manfred, torvalds, oleg, will.deacon


Reviving a discussion at least a year old, I just remembered we had an
issue here while doing those PREEMPT_ACTIVE patches.

---

So the problem this patch is trying to address is as follows:


        CPU0                            CPU1

        context_switch(A, B)
                                        ttwu(A)
                                          LOCK A->pi_lock
                                          A->on_cpu == 0
        finish_task_switch(A)
          prev_state = A->state  <-.
          WMB                      |
          A->on_cpu = 0;           |
          UNLOCK rq0->lock         |
                                   |    context_switch(C, A)
                                   `--  A->state = TASK_DEAD
          prev_state == TASK_DEAD
            put_task_struct(A)
                                        context_switch(A, C)
                                        finish_task_switch(A)
                                          A->state == TASK_DEAD
                                            put_task_struct(A)


The argument being that the WMB will allow the load of A->state on CPU0
to cross over and observe CPU1's store of A->state, which will then
result in a double-drop and use-after-free.

Now the comment states (and this was true once upon a long time ago)
that we need to observe A->state while holding rq->lock because that
will order us against the wakeup; however the wakeup will not in fact
acquire (that) rq->lock; it takes A->pi_lock these days.

We can obviously fix this by upgrading the WMB to an MB, but that is
expensive, so we'd rather avoid that.

The alternative this patch takes is: smp_store_release(A->on_cpu, 0),
which avoids the MB on some archs, but not important ones like ARM.

Fixes: e4a52bcb9a18 ("sched: Remove rq->lock from the first half of ttwu()")
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Reported-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c  | 10 +++++-----
 kernel/sched/sched.h |  5 +++--
 2 files changed, 8 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e3a81ad9721a..7346133f1f4d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2515,11 +2515,11 @@ static struct rq *finish_task_switch(struct task_struct *prev)
 	 * If a task dies, then it sets TASK_DEAD in tsk->state and calls
 	 * schedule one last time. The schedule call will never return, and
 	 * the scheduled task must drop that reference.
-	 * The test for TASK_DEAD must occur while the runqueue locks are
-	 * still held, otherwise prev could be scheduled on another cpu, die
-	 * there before we look at prev->state, and then the reference would
-	 * be dropped twice.
-	 *		Manfred Spraul <manfred@colorfullife.com>
+	 *
+	 * We must observe prev->state before clearing prev->on_cpu (in
+	 * finish_lock_switch), otherwise a concurrent wakeup can get prev
+	 * running on another CPU and we could rave with its RUNNING -> DEAD
+	 * transition, resulting in a double drop.
 	 */
 	prev_state = prev->state;
 	vtime_task_switch(prev);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index af6f252e7e34..046242feff3a 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1072,9 +1072,10 @@ static inline void finish_lock_switch(struct rq *rq, struct task_struct *prev)
 	 * After ->on_cpu is cleared, the task can be moved to a different CPU.
 	 * We must ensure this doesn't happen until the switch is completely
 	 * finished.
+	 *
+	 * Pairs with the control dependency and rmb in try_to_wake_up().
 	 */
-	smp_wmb();
-	prev->on_cpu = 0;
+	smp_store_release(&prev->on_cpu, 0);
 #endif
 #ifdef CONFIG_DEBUG_SPINLOCK
 	/* this is a valid case when another task releases the spinlock */

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [RFC][PATCH] sched: Fix TASK_DEAD race in finish_task_switch()
  2015-09-29 12:45 [RFC][PATCH] sched: Fix TASK_DEAD race in finish_task_switch() Peter Zijlstra
@ 2015-09-29 16:40 ` Linus Torvalds
  2015-09-29 16:50   ` Peter Zijlstra
  2015-10-06 16:12 ` [tip:sched/core] sched/core: " tip-bot for Peter Zijlstra
  1 sibling, 1 reply; 5+ messages in thread
From: Linus Torvalds @ 2015-09-29 16:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Linux Kernel Mailing List, Thomas Gleixner,
	Manfred Spraul, Oleg Nesterov, Will Deacon

On Tue, Sep 29, 2015 at 8:45 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>
> +        *
> +        * Pairs with the control dependency and rmb in try_to_wake_up().
>          */

So this comment makes me nervous. A control dependency doesn't
actually do anything on powerpc and ARM (or alpha, or MIPS, or any
number of other architectures. Basically, a conditional branch ends up
not being the usual kind of data dependency (which works on everything
but alpha), because conditional branches are predicted and loads after
them are speculated.

Also, using smp_store_release() instead of a wmb() is going to be very
expensive on old ARM and a number of other not-so-great architectures.
On x86, both end up being just a scheduling thing. On other modern
architectures, store releases are fairly cheap, but wmb() is cheap
too.

So long-term, the wmb->store_release conversion probably makes sense,
but it's at least debatable for now.

The one advantage that release/acquire do have is that particularly
when they pair up, they have much nicer semantics (ie you get the same
ordering guarantees as if you had a lock). But if they don't pair up,
there are actually some advantages to smp_wmb() over the alternatives.

               Linus

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC][PATCH] sched: Fix TASK_DEAD race in finish_task_switch()
  2015-09-29 16:40 ` Linus Torvalds
@ 2015-09-29 16:50   ` Peter Zijlstra
  2015-09-29 17:41     ` Linus Torvalds
  0 siblings, 1 reply; 5+ messages in thread
From: Peter Zijlstra @ 2015-09-29 16:50 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Linux Kernel Mailing List, Thomas Gleixner,
	Manfred Spraul, Oleg Nesterov, Will Deacon

On Tue, Sep 29, 2015 at 12:40:22PM -0400, Linus Torvalds wrote:
> On Tue, Sep 29, 2015 at 8:45 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > +        *
> > +        * Pairs with the control dependency and rmb in try_to_wake_up().
> >          */
> 
> So this comment makes me nervous. A control dependency doesn't
> actually do anything on powerpc and ARM (or alpha, or MIPS, or any
> number of other architectures. Basically, a conditional branch ends up
> not being the usual kind of data dependency (which works on everything
> but alpha), because conditional branches are predicted and loads after
> them are speculated.

The control dependency creates a LOAD->STORE order, that is, no STOREs
can happen until we observe !p->on_cpu.

	while (p->on_cpu)
		cpu_relax();

The subsequent:

	smp_rmb();

ensures no loads can pass up before the ->on_cpu load. Combined they
provide a LOAD->(LOAD/STORE) barrier.

Nothing is allowed to pass up over that combination -- not unlike an
smp_load_acquire() but without the expensive MB.

> Also, using smp_store_release() instead of a wmb() is going to be very
> expensive on old ARM and a number of other not-so-great architectures.

Yes :-(

> On x86, both end up being just a scheduling thing. On other modern
> architectures, store releases are fairly cheap, but wmb() is cheap
> too.

Right, but wmb isn't sufficient as it doesn't order the prev->state LOAD
vs the prev->on_cpu = 0 STORE. If those happen in the wrong order the
described race can happen and we get a use-after-free.

Of course, x86 isn't affected (the reorder is disallowed by TSO) nor is
PPC (its wmb() is lwsync which also disallows this reorder).

But ARM/MIPS etc.. are affected and these are the archs now getting the
full barrier.

(And note that arm64 gets to use their store-release instruction, which
might be better than the full barrier -- but maybe not as great as their
wmb, I have no idea on their relative costs.)

> So long-term, the wmb->store_release conversion probably makes sense,
> but it's at least debatable for now.

I'm all open to alternative solutions to this race.


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC][PATCH] sched: Fix TASK_DEAD race in finish_task_switch()
  2015-09-29 16:50   ` Peter Zijlstra
@ 2015-09-29 17:41     ` Linus Torvalds
  0 siblings, 0 replies; 5+ messages in thread
From: Linus Torvalds @ 2015-09-29 17:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Linux Kernel Mailing List, Thomas Gleixner,
	Manfred Spraul, Oleg Nesterov, Will Deacon

On Tue, Sep 29, 2015 at 12:50 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>
> The control dependency creates a LOAD->STORE order, that is, no STOREs
> can happen until we observe !p->on_cpu.

Fair enough.

> Right, but wmb isn't sufficient as it doesn't order the prev->state LOAD
> vs the prev->on_cpu = 0 STORE. If those happen in the wrong order the
> described race can happen and we get a use-after-free.

.. and you convinced me. The patch is good. Ack.

               Linus

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [tip:sched/core] sched/core: Fix TASK_DEAD race in finish_task_switch()
  2015-09-29 12:45 [RFC][PATCH] sched: Fix TASK_DEAD race in finish_task_switch() Peter Zijlstra
  2015-09-29 16:40 ` Linus Torvalds
@ 2015-10-06 16:12 ` tip-bot for Peter Zijlstra
  1 sibling, 0 replies; 5+ messages in thread
From: tip-bot for Peter Zijlstra @ 2015-10-06 16:12 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: tglx, linux-kernel, mingo, oleg, hpa, peterz, torvalds

Commit-ID:  95913d97914f44db2b81271c2e2ebd4d2ac2df83
Gitweb:     http://git.kernel.org/tip/95913d97914f44db2b81271c2e2ebd4d2ac2df83
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Tue, 29 Sep 2015 14:45:09 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 6 Oct 2015 17:05:17 +0200

sched/core: Fix TASK_DEAD race in finish_task_switch()

So the problem this patch is trying to address is as follows:

        CPU0                            CPU1

        context_switch(A, B)
                                        ttwu(A)
                                          LOCK A->pi_lock
                                          A->on_cpu == 0
        finish_task_switch(A)
          prev_state = A->state  <-.
          WMB                      |
          A->on_cpu = 0;           |
          UNLOCK rq0->lock         |
                                   |    context_switch(C, A)
                                   `--  A->state = TASK_DEAD
          prev_state == TASK_DEAD
            put_task_struct(A)
                                        context_switch(A, C)
                                        finish_task_switch(A)
                                          A->state == TASK_DEAD
                                            put_task_struct(A)

The argument being that the WMB will allow the load of A->state on CPU0
to cross over and observe CPU1's store of A->state, which will then
result in a double-drop and use-after-free.

Now the comment states (and this was true once upon a long time ago)
that we need to observe A->state while holding rq->lock because that
will order us against the wakeup; however the wakeup will not in fact
acquire (that) rq->lock; it takes A->pi_lock these days.

We can obviously fix this by upgrading the WMB to an MB, but that is
expensive, so we'd rather avoid that.

The alternative this patch takes is: smp_store_release(&A->on_cpu, 0),
which avoids the MB on some archs, but not important ones like ARM.

Reported-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: <stable@vger.kernel.org> # v3.1+
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Cc: manfred@colorfullife.com
Cc: will.deacon@arm.com
Fixes: e4a52bcb9a18 ("sched: Remove rq->lock from the first half of ttwu()")
Link: http://lkml.kernel.org/r/20150929124509.GG3816@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/core.c  | 10 +++++-----
 kernel/sched/sched.h |  5 +++--
 2 files changed, 8 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6159531..10a8faa 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2517,11 +2517,11 @@ static struct rq *finish_task_switch(struct task_struct *prev)
 	 * If a task dies, then it sets TASK_DEAD in tsk->state and calls
 	 * schedule one last time. The schedule call will never return, and
 	 * the scheduled task must drop that reference.
-	 * The test for TASK_DEAD must occur while the runqueue locks are
-	 * still held, otherwise prev could be scheduled on another cpu, die
-	 * there before we look at prev->state, and then the reference would
-	 * be dropped twice.
-	 *		Manfred Spraul <manfred@colorfullife.com>
+	 *
+	 * We must observe prev->state before clearing prev->on_cpu (in
+	 * finish_lock_switch), otherwise a concurrent wakeup can get prev
+	 * running on another CPU and we could rave with its RUNNING -> DEAD
+	 * transition, resulting in a double drop.
 	 */
 	prev_state = prev->state;
 	vtime_task_switch(prev);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 68cda11..6d2a119 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1078,9 +1078,10 @@ static inline void finish_lock_switch(struct rq *rq, struct task_struct *prev)
 	 * After ->on_cpu is cleared, the task can be moved to a different CPU.
 	 * We must ensure this doesn't happen until the switch is completely
 	 * finished.
+	 *
+	 * Pairs with the control dependency and rmb in try_to_wake_up().
 	 */
-	smp_wmb();
-	prev->on_cpu = 0;
+	smp_store_release(&prev->on_cpu, 0);
 #endif
 #ifdef CONFIG_DEBUG_SPINLOCK
 	/* this is a valid case when another task releases the spinlock */

^ permalink raw reply related	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2015-10-06 16:13 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-09-29 12:45 [RFC][PATCH] sched: Fix TASK_DEAD race in finish_task_switch() Peter Zijlstra
2015-09-29 16:40 ` Linus Torvalds
2015-09-29 16:50   ` Peter Zijlstra
2015-09-29 17:41     ` Linus Torvalds
2015-10-06 16:12 ` [tip:sched/core] sched/core: " tip-bot for Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).