linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Peter Zijlstra <peterz@infradead.org>
To: mingo@kernel.org
Cc: linux-kernel@vger.kernel.org, tglx@linutronix.de,
	manfred@colorfullife.com, torvalds@linux-foundation.org,
	oleg@redhat.com, will.deacon@arm.com
Subject: [RFC][PATCH] sched: Fix TASK_DEAD race in finish_task_switch()
Date: Tue, 29 Sep 2015 14:45:09 +0200	[thread overview]
Message-ID: <20150929124509.GG3816@twins.programming.kicks-ass.net> (raw)


Reviving a discussion at least a year old, I just remembered we had an
issue here while doing those PREEMPT_ACTIVE patches.

---

So the problem this patch is trying to address is as follows:


        CPU0                            CPU1

        context_switch(A, B)
                                        ttwu(A)
                                          LOCK A->pi_lock
                                          A->on_cpu == 0
        finish_task_switch(A)
          prev_state = A->state  <-.
          WMB                      |
          A->on_cpu = 0;           |
          UNLOCK rq0->lock         |
                                   |    context_switch(C, A)
                                   `--  A->state = TASK_DEAD
          prev_state == TASK_DEAD
            put_task_struct(A)
                                        context_switch(A, C)
                                        finish_task_switch(A)
                                          A->state == TASK_DEAD
                                            put_task_struct(A)


The argument being that the WMB will allow the load of A->state on CPU0
to cross over and observe CPU1's store of A->state, which will then
result in a double-drop and use-after-free.

Now the comment states (and this was true once upon a long time ago)
that we need to observe A->state while holding rq->lock because that
will order us against the wakeup; however the wakeup will not in fact
acquire (that) rq->lock; it takes A->pi_lock these days.

We can obviously fix this by upgrading the WMB to an MB, but that is
expensive, so we'd rather avoid that.

The alternative this patch takes is: smp_store_release(A->on_cpu, 0),
which avoids the MB on some archs, but not important ones like ARM.

Fixes: e4a52bcb9a18 ("sched: Remove rq->lock from the first half of ttwu()")
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Reported-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c  | 10 +++++-----
 kernel/sched/sched.h |  5 +++--
 2 files changed, 8 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e3a81ad9721a..7346133f1f4d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2515,11 +2515,11 @@ static struct rq *finish_task_switch(struct task_struct *prev)
 	 * If a task dies, then it sets TASK_DEAD in tsk->state and calls
 	 * schedule one last time. The schedule call will never return, and
 	 * the scheduled task must drop that reference.
-	 * The test for TASK_DEAD must occur while the runqueue locks are
-	 * still held, otherwise prev could be scheduled on another cpu, die
-	 * there before we look at prev->state, and then the reference would
-	 * be dropped twice.
-	 *		Manfred Spraul <manfred@colorfullife.com>
+	 *
+	 * We must observe prev->state before clearing prev->on_cpu (in
+	 * finish_lock_switch), otherwise a concurrent wakeup can get prev
+	 * running on another CPU and we could rave with its RUNNING -> DEAD
+	 * transition, resulting in a double drop.
 	 */
 	prev_state = prev->state;
 	vtime_task_switch(prev);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index af6f252e7e34..046242feff3a 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1072,9 +1072,10 @@ static inline void finish_lock_switch(struct rq *rq, struct task_struct *prev)
 	 * After ->on_cpu is cleared, the task can be moved to a different CPU.
 	 * We must ensure this doesn't happen until the switch is completely
 	 * finished.
+	 *
+	 * Pairs with the control dependency and rmb in try_to_wake_up().
 	 */
-	smp_wmb();
-	prev->on_cpu = 0;
+	smp_store_release(&prev->on_cpu, 0);
 #endif
 #ifdef CONFIG_DEBUG_SPINLOCK
 	/* this is a valid case when another task releases the spinlock */

             reply	other threads:[~2015-09-29 12:50 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-09-29 12:45 Peter Zijlstra [this message]
2015-09-29 16:40 ` [RFC][PATCH] sched: Fix TASK_DEAD race in finish_task_switch() Linus Torvalds
2015-09-29 16:50   ` Peter Zijlstra
2015-09-29 17:41     ` Linus Torvalds
2015-10-06 16:12 ` [tip:sched/core] sched/core: " tip-bot for Peter Zijlstra

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150929124509.GG3816@twins.programming.kicks-ass.net \
    --to=peterz@infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=manfred@colorfullife.com \
    --cc=mingo@kernel.org \
    --cc=oleg@redhat.com \
    --cc=tglx@linutronix.de \
    --cc=torvalds@linux-foundation.org \
    --cc=will.deacon@arm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).