linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Peter Zijlstra <peterz@infradead.org>
To: Quentin Perret <qperret@google.com>
Cc: linux-kernel@vger.kernel.org, aaron.lwe@gmail.com,
	valentin.schneider@arm.com, mingo@kernel.org, pauld@redhat.com,
	jdesfossez@digitalocean.com, naravamudan@digitalocean.com,
	vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
	juri.lelli@redhat.com, rostedt@goodmis.org, bsegall@google.com,
	mgorman@suse.de, kernel-team@android.com, john.stultz@linaro.org
Subject: Re: NULL pointer dereference in pick_next_task_fair
Date: Wed, 6 Nov 2019 13:05:25 +0100	[thread overview]
Message-ID: <20191106120525.GX4131@hirez.programming.kicks-ass.net> (raw)
In-Reply-To: <20191028174603.GA246917@google.com>

On Mon, Oct 28, 2019 at 05:46:03PM +0000, Quentin Perret wrote:
> 
> After digging a bit, the offending commit seems to be:
> 
>     67692435c411 ("sched: Rework pick_next_task() slow-path")
> 
> By 'offending' I mean that reverting it makes the issue go away. The
> issue comes from the fact that pick_next_entity() returns a NULL se in
> the 'simple' path of pick_next_task_fair(), which causes obvious
> problems in the subsequent call to set_next_entity().
> 
> I'll dig more, but if anybody understands the issue in the meatime feel
> free to send me a patch to try out :)

So for all those who didn't follow along on IRC, the below seems to cure
things.

The only thing I'm now considering is if we shouldn't be setting
->on_cpu=2 _before_ calling put_prev_task(). I'll go audit the RT/DL
cases.

---
Subject: sched: Fix pick_next_task() vs 'change' pattern race
From: Peter Zijlstra <peterz@infradead.org>
Date: Mon Nov 4 22:18:14 CET 2019

Commit 67692435c411 ("sched: Rework pick_next_task() slow-path")
inadvertly introduced a race because it changed a previously
unexplored dependency between dropping the rq->lock and
sched_class::put_prev_task().

The comments about dropping rq->lock, in for example
newidle_balance(), only mentions the task being current and ->on_cpu
being set. But when we look at the 'change' pattern (in for example
sched_setnuma()):

	queued = task_on_rq_queued(p); /* p->on_rq == TASK_ON_RQ_QUEUED */
	running = task_current(rq, p); /* rq->curr == p */

	if (queued)
		dequeue_task(...);
	if (running)
		put_prev_task(...);

	/* change task properties */

	if (queued)
		enqueue_task(...);
	if (running)
		set_next_task(...);

It becomes obvious that if we do this after put_prev_task() has
already been called on @p, things go sideways. This is exactly what
the commit in question allows to happen when it does:

	prev->sched_class->put_prev_task(rq, prev, rf);
	if (!rq->nr_running)
		newidle_balance(rq, rf);

The newidle_balance() call will drop rq->lock after we've called
put_prev_task() and that allows the above 'change' pattern to
interleave and mess up the state.

The order in pick_next_task() is mandated by the fact that RT/DL
put_prev_task() can pull other RT tasks, in which case we should not
call newidle_balance() since we'll not be going idle. Similarly, we
cannot put newidle_balance() in put_prev_task_fair() because it should
be called every time we'll end up selecting the idle task.

Given that we're stuck with this order, the only solution is fixing
the 'change' pattern. The simplest fix seems to be to 'absuse'
p->on_cpu to carry more state. Adding more state to p->on_rq is
possible but is far more invasive and also ends up duplicating much of
the state we already carry in p->on_cpu.

Introduce task_on_rq_curr() to indicate the if
sched_class::set_next_task() has been called -- and we thus need to
call put_prev_task().

Fixes: 67692435c411 ("sched: Rework pick_next_task() slow-path")
Reported-by: Quentin Perret <qperret@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Quentin Perret <qperret@google.com>
Tested-by: Qais Yousef <qais.yousef@arm.com>
Tested-by: Valentin Schneider <valentin.schneider@arm.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
---
 kernel/sched/core.c  |   22 +++++++++++++++-------
 kernel/sched/sched.h |   16 ++++++++++++++++
 2 files changed, 31 insertions(+), 7 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1595,7 +1595,7 @@ void do_set_cpus_allowed(struct task_str
 	lockdep_assert_held(&p->pi_lock);
 
 	queued = task_on_rq_queued(p);
-	running = task_current(rq, p);
+	running = task_on_rq_curr(rq, p);
 
 	if (queued) {
 		/*
@@ -3934,8 +3934,16 @@ pick_next_task(struct rq *rq, struct tas
 	 * can PULL higher prio tasks when we lower the RQ 'priority'.
 	 */
 	prev->sched_class->put_prev_task(rq, prev, rf);
-	if (!rq->nr_running)
+#ifdef CONFIG_SMP
+	if (!rq->nr_running) {
+		/*
+		 * Make sure task_on_rq_curr() fails, such that we don't do
+		 * put_prev_task() + set_next_task() on this task again.
+		 */
+		prev->on_cpu = 2;
 		newidle_balance(rq, rf);
+	}
+#endif
 
 	for_each_class(class) {
 		p = class->pick_next_task(rq, NULL, NULL);
@@ -4422,7 +4430,7 @@ void rt_mutex_setprio(struct task_struct
 
 	prev_class = p->sched_class;
 	queued = task_on_rq_queued(p);
-	running = task_current(rq, p);
+	running = task_on_rq_curr(rq, p);
 	if (queued)
 		dequeue_task(rq, p, queue_flag);
 	if (running)
@@ -4509,7 +4517,7 @@ void set_user_nice(struct task_struct *p
 		goto out_unlock;
 	}
 	queued = task_on_rq_queued(p);
-	running = task_current(rq, p);
+	running = task_on_rq_curr(rq, p);
 	if (queued)
 		dequeue_task(rq, p, DEQUEUE_SAVE | DEQUEUE_NOCLOCK);
 	if (running)
@@ -4957,7 +4965,7 @@ static int __sched_setscheduler(struct t
 	}
 
 	queued = task_on_rq_queued(p);
-	running = task_current(rq, p);
+	running = task_on_rq_curr(rq, p);
 	if (queued)
 		dequeue_task(rq, p, queue_flags);
 	if (running)
@@ -6141,7 +6149,7 @@ void sched_setnuma(struct task_struct *p
 
 	rq = task_rq_lock(p, &rf);
 	queued = task_on_rq_queued(p);
-	running = task_current(rq, p);
+	running = task_on_rq_curr(rq, p);
 
 	if (queued)
 		dequeue_task(rq, p, DEQUEUE_SAVE);
@@ -7031,7 +7039,7 @@ void sched_move_task(struct task_struct
 	rq = task_rq_lock(tsk, &rf);
 	update_rq_clock(rq);
 
-	running = task_current(rq, tsk);
+	running = task_on_rq_curr(rq, tsk);
 	queued = task_on_rq_queued(tsk);
 
 	if (queued)
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1628,6 +1628,22 @@ static inline int task_running(struct rq
 #endif
 }
 
+/*
+ * If true, @p has had sched_class::set_next_task() called on it.
+ * See pick_next_task().
+ */
+static inline bool task_on_rq_curr(struct rq *rq, struct task_struct *p)
+{
+#ifdef CONFIG_SMP
+	return rq->curr == p && p->on_cpu == 1;
+#else
+	return rq->curr == p;
+#endif
+}
+
+/*
+ * If true, @p has has sched_class::enqueue_task() called on it.
+ */
 static inline int task_on_rq_queued(struct task_struct *p)
 {
 	return p->on_rq == TASK_ON_RQ_QUEUED;

  parent reply	other threads:[~2019-11-06 12:06 UTC|newest]

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-10-28 17:46 NULL pointer dereference in pick_next_task_fair Quentin Perret
2019-10-28 21:49 ` Peter Zijlstra
2019-10-29 11:34 ` Peter Zijlstra
2019-10-29 11:50   ` Quentin Perret
2019-10-30 22:50     ` Ram Muthiah
2019-10-31  1:33       ` Valentin Schneider
2019-10-31 10:54         ` Valentin Schneider
2019-10-31 14:24           ` Valentin Schneider
2019-10-31 22:15       ` Valentin Schneider
2019-11-06 12:05 ` Peter Zijlstra [this message]
2019-11-06 13:08   ` Peter Zijlstra
2019-11-06 15:04     ` Qais Yousef
2019-11-06 16:57       ` Peter Zijlstra
2019-11-06 17:26         ` Qais Yousef
2019-11-06 15:51   ` Kirill Tkhai
2019-11-06 16:54     ` Peter Zijlstra
2019-11-06 17:27       ` Peter Zijlstra
2019-11-07  8:36         ` Kirill Tkhai
2019-11-07 13:26           ` Peter Zijlstra
2019-11-07 15:12             ` Kirill Tkhai
2019-11-07 15:42               ` Peter Zijlstra
2019-11-07 15:53                 ` Kirill Tkhai
2019-11-07 15:38             ` Quentin Perret
2019-11-07 18:43               ` Peter Zijlstra
2019-11-07 19:27                 ` Quentin Perret
2019-11-07 19:31                   ` Peter Zijlstra
2019-11-07 19:42                     ` Quentin Perret
2019-11-07 19:29                 ` Peter Zijlstra
2019-11-08 11:02                   ` Quentin Perret
2019-11-08 11:47                     ` Valentin Schneider
2019-11-08 11:58                       ` Quentin Perret
2019-11-08 12:00                     ` Peter Zijlstra
2019-11-08 12:15                       ` Quentin Perret
2019-11-08 12:35                         ` Peter Zijlstra
2019-11-08 12:24                       ` Peter Zijlstra
2019-11-08 11:55                   ` Peter Zijlstra
2019-11-08 12:52                     ` Peter Zijlstra
2019-11-07 16:09             ` Qais Yousef

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20191106120525.GX4131@hirez.programming.kicks-ass.net \
    --to=peterz@infradead.org \
    --cc=aaron.lwe@gmail.com \
    --cc=bsegall@google.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=jdesfossez@digitalocean.com \
    --cc=john.stultz@linaro.org \
    --cc=juri.lelli@redhat.com \
    --cc=kernel-team@android.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@suse.de \
    --cc=mingo@kernel.org \
    --cc=naravamudan@digitalocean.com \
    --cc=pauld@redhat.com \
    --cc=qperret@google.com \
    --cc=rostedt@goodmis.org \
    --cc=valentin.schneider@arm.com \
    --cc=vincent.guittot@linaro.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).