linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/7] scheduler patches
@ 2019-11-08 13:15 Peter Zijlstra
  2019-11-08 13:15 ` [PATCH 1/7] sched: Fix pick_next_task() vs change pattern race Peter Zijlstra
                   ` (6 more replies)
  0 siblings, 7 replies; 21+ messages in thread
From: Peter Zijlstra @ 2019-11-08 13:15 UTC (permalink / raw)
  To: mingo, vincent.guittot, dietmar.eggemann, juri.lelli, rostedt,
	bsegall, mgorman
  Cc: linux-kernel, qperret, valentin.schneider, qais.yousef, ktkhai, peterz

Hi, here's the critical scheduler fix (#1) and the rest of the patches that
resulted from staring at the code.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 1/7] sched: Fix pick_next_task() vs change pattern race
  2019-11-08 13:15 [PATCH 0/7] scheduler patches Peter Zijlstra
@ 2019-11-08 13:15 ` Peter Zijlstra
  2019-11-08 14:28   ` Quentin Perret
                     ` (2 more replies)
  2019-11-08 13:15 ` [PATCH 2/7] sched/fair: Better document newidle_balance() Peter Zijlstra
                   ` (5 subsequent siblings)
  6 siblings, 3 replies; 21+ messages in thread
From: Peter Zijlstra @ 2019-11-08 13:15 UTC (permalink / raw)
  To: mingo, vincent.guittot, dietmar.eggemann, juri.lelli, rostedt,
	bsegall, mgorman
  Cc: linux-kernel, qperret, valentin.schneider, qais.yousef, ktkhai, peterz

Commit 67692435c411 ("sched: Rework pick_next_task() slow-path")
inadvertly introduced a race because it changed a previously
unexplored dependency between dropping the rq->lock and
sched_class::put_prev_task().

The comments about dropping rq->lock, in for example
newidle_balance(), only mentions the task being current and ->on_cpu
being set. But when we look at the 'change' pattern (in for example
sched_setnuma()):

	queued = task_on_rq_queued(p); /* p->on_rq == TASK_ON_RQ_QUEUED */
	running = task_current(rq, p); /* rq->curr == p */

	if (queued)
		dequeue_task(...);
	if (running)
		put_prev_task(...);

	/* change task properties */

	if (queued)
		enqueue_task(...);
	if (running)
		set_next_task(...);

It becomes obvious that if we do this after put_prev_task() has
already been called on @p, things go sideways. This is exactly what
the commit in question allows to happen when it does:

	prev->sched_class->put_prev_task(rq, prev, rf);
	if (!rq->nr_running)
		newidle_balance(rq, rf);

The newidle_balance() call will drop rq->lock after we've called
put_prev_task() and that allows the above 'change' pattern to
interleave and mess up the state.

Furthermore, it turns out we lost the RT-pull when we put the last DL
task.

Fix both problems by extracting the balancing from put_prev_task() and
doing a multi-class balance() pass before put_prev_task().

Fixes: 67692435c411 ("sched: Rework pick_next_task() slow-path")
Reported-by: Quentin Perret <qperret@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c      |   21 +++++++++++++++------
 kernel/sched/deadline.c  |   40 ++++++++++++++++++++--------------------
 kernel/sched/fair.c      |   15 ++++++++++++---
 kernel/sched/idle.c      |    9 ++++++++-
 kernel/sched/rt.c        |   37 +++++++++++++++++++------------------
 kernel/sched/sched.h     |   30 +++++++++++++++++++++++++++---
 kernel/sched/stop_task.c |   18 +++++++++++-------
 7 files changed, 112 insertions(+), 58 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3929,13 +3929,22 @@ pick_next_task(struct rq *rq, struct tas
 	}
 
 restart:
+#ifdef CONFIG_SMP
 	/*
-	 * Ensure that we put DL/RT tasks before the pick loop, such that they
-	 * can PULL higher prio tasks when we lower the RQ 'priority'.
+	 * We must do the balancing pass before put_next_task(), such
+	 * that when we release the rq->lock the task is in the same
+	 * state as before we took rq->lock.
+	 *
+	 * We can terminate the balance pass as soon as we know there is
+	 * a runnable task of @class priority or higher.
 	 */
-	prev->sched_class->put_prev_task(rq, prev, rf);
-	if (!rq->nr_running)
-		newidle_balance(rq, rf);
+	for_class_range(class, prev->sched_class, &idle_sched_class) {
+		if (class->balance(rq, prev, rf))
+			break;
+	}
+#endif
+
+	put_prev_task(rq, prev);
 
 	for_each_class(class) {
 		p = class->pick_next_task(rq, NULL, NULL);
@@ -6201,7 +6210,7 @@ static struct task_struct *__pick_migrat
 	for_each_class(class) {
 		next = class->pick_next_task(rq, NULL, NULL);
 		if (next) {
-			next->sched_class->put_prev_task(rq, next, NULL);
+			next->sched_class->put_prev_task(rq, next);
 			return next;
 		}
 	}
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1691,6 +1691,22 @@ static void check_preempt_equal_dl(struc
 	resched_curr(rq);
 }
 
+static int balance_dl(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
+{
+	if (!on_dl_rq(&p->dl) && need_pull_dl_task(rq, p)) {
+		/*
+		 * This is OK, because current is on_cpu, which avoids it being
+		 * picked for load-balance and preemption/IRQs are still
+		 * disabled avoiding further scheduler activity on it and we've
+		 * not yet started the picking loop.
+		 */
+		rq_unpin_lock(rq, rf);
+		pull_dl_task(rq);
+		rq_repin_lock(rq, rf);
+	}
+
+	return sched_stop_runnable(rq) || sched_dl_runnable(rq);
+}
 #endif /* CONFIG_SMP */
 
 /*
@@ -1758,45 +1774,28 @@ static struct task_struct *
 pick_next_task_dl(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 {
 	struct sched_dl_entity *dl_se;
+	struct dl_rq *dl_rq = &rq->dl;
 	struct task_struct *p;
-	struct dl_rq *dl_rq;
 
 	WARN_ON_ONCE(prev || rf);
 
-	dl_rq = &rq->dl;
-
-	if (unlikely(!dl_rq->dl_nr_running))
+	if (!sched_dl_runnable(rq))
 		return NULL;
 
 	dl_se = pick_next_dl_entity(rq, dl_rq);
 	BUG_ON(!dl_se);
-
 	p = dl_task_of(dl_se);
-
 	set_next_task_dl(rq, p);
-
 	return p;
 }
 
-static void put_prev_task_dl(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
+static void put_prev_task_dl(struct rq *rq, struct task_struct *p)
 {
 	update_curr_dl(rq);
 
 	update_dl_rq_load_avg(rq_clock_pelt(rq), rq, 1);
 	if (on_dl_rq(&p->dl) && p->nr_cpus_allowed > 1)
 		enqueue_pushable_dl_task(rq, p);
-
-	if (rf && !on_dl_rq(&p->dl) && need_pull_dl_task(rq, p)) {
-		/*
-		 * This is OK, because current is on_cpu, which avoids it being
-		 * picked for load-balance and preemption/IRQs are still
-		 * disabled avoiding further scheduler activity on it and we've
-		 * not yet started the picking loop.
-		 */
-		rq_unpin_lock(rq, rf);
-		pull_dl_task(rq);
-		rq_repin_lock(rq, rf);
-	}
 }
 
 /*
@@ -2442,6 +2441,7 @@ const struct sched_class dl_sched_class
 	.set_next_task		= set_next_task_dl,
 
 #ifdef CONFIG_SMP
+	.balance		= balance_dl,
 	.select_task_rq		= select_task_rq_dl,
 	.migrate_task_rq	= migrate_task_rq_dl,
 	.set_cpus_allowed       = set_cpus_allowed_dl,
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6423,6 +6423,15 @@ static void task_dead_fair(struct task_s
 {
 	remove_entity_load_avg(&p->se);
 }
+
+static int
+balance_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+{
+	if (rq->nr_running)
+		return 1;
+
+	return newidle_balance(rq, rf) != 0;
+}
 #endif /* CONFIG_SMP */
 
 static unsigned long wakeup_gran(struct sched_entity *se)
@@ -6599,7 +6608,7 @@ pick_next_task_fair(struct rq *rq, struc
 	int new_tasks;
 
 again:
-	if (!cfs_rq->nr_running)
+	if (!sched_fair_runnable(rq))
 		goto idle;
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
@@ -6737,7 +6746,7 @@ done: __maybe_unused;
 /*
  * Account for a descheduled task:
  */
-static void put_prev_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+static void put_prev_task_fair(struct rq *rq, struct task_struct *prev)
 {
 	struct sched_entity *se = &prev->se;
 	struct cfs_rq *cfs_rq;
@@ -10597,11 +10606,11 @@ const struct sched_class fair_sched_clas
 	.check_preempt_curr	= check_preempt_wakeup,
 
 	.pick_next_task		= pick_next_task_fair,
-
 	.put_prev_task		= put_prev_task_fair,
 	.set_next_task          = set_next_task_fair,
 
 #ifdef CONFIG_SMP
+	.balance		= balance_fair,
 	.select_task_rq		= select_task_rq_fair,
 	.migrate_task_rq	= migrate_task_rq_fair,
 
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -365,6 +365,12 @@ select_task_rq_idle(struct task_struct *
 {
 	return task_cpu(p); /* IDLE tasks as never migrated */
 }
+
+static int
+balance_idle(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+{
+	return WARN_ON_ONCE(1);
+}
 #endif
 
 /*
@@ -375,7 +381,7 @@ static void check_preempt_curr_idle(stru
 	resched_curr(rq);
 }
 
-static void put_prev_task_idle(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+static void put_prev_task_idle(struct rq *rq, struct task_struct *prev)
 {
 }
 
@@ -460,6 +466,7 @@ const struct sched_class idle_sched_clas
 	.set_next_task          = set_next_task_idle,
 
 #ifdef CONFIG_SMP
+	.balance		= balance_idle,
 	.select_task_rq		= select_task_rq_idle,
 	.set_cpus_allowed	= set_cpus_allowed_common,
 #endif
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1469,6 +1469,22 @@ static void check_preempt_equal_prio(str
 	resched_curr(rq);
 }
 
+static int balance_rt(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
+{
+	if (!on_rt_rq(&p->rt) && need_pull_rt_task(rq, p)) {
+		/*
+		 * This is OK, because current is on_cpu, which avoids it being
+		 * picked for load-balance and preemption/IRQs are still
+		 * disabled avoiding further scheduler activity on it and we've
+		 * not yet started the picking loop.
+		 */
+		rq_unpin_lock(rq, rf);
+		pull_rt_task(rq);
+		rq_repin_lock(rq, rf);
+	}
+
+	return sched_stop_runnable(rq) || sched_dl_runnable(rq) || sched_rt_runnable(rq);
+}
 #endif /* CONFIG_SMP */
 
 /*
@@ -1552,21 +1568,18 @@ static struct task_struct *
 pick_next_task_rt(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 {
 	struct task_struct *p;
-	struct rt_rq *rt_rq = &rq->rt;
 
 	WARN_ON_ONCE(prev || rf);
 
-	if (!rt_rq->rt_queued)
+	if (!sched_rt_runnable(rq))
 		return NULL;
 
 	p = _pick_next_task_rt(rq);
-
 	set_next_task_rt(rq, p);
-
 	return p;
 }
 
-static void put_prev_task_rt(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
+static void put_prev_task_rt(struct rq *rq, struct task_struct *p)
 {
 	update_curr_rt(rq);
 
@@ -1578,18 +1591,6 @@ static void put_prev_task_rt(struct rq *
 	 */
 	if (on_rt_rq(&p->rt) && p->nr_cpus_allowed > 1)
 		enqueue_pushable_task(rq, p);
-
-	if (rf && !on_rt_rq(&p->rt) && need_pull_rt_task(rq, p)) {
-		/*
-		 * This is OK, because current is on_cpu, which avoids it being
-		 * picked for load-balance and preemption/IRQs are still
-		 * disabled avoiding further scheduler activity on it and we've
-		 * not yet started the picking loop.
-		 */
-		rq_unpin_lock(rq, rf);
-		pull_rt_task(rq);
-		rq_repin_lock(rq, rf);
-	}
 }
 
 #ifdef CONFIG_SMP
@@ -2366,8 +2367,8 @@ const struct sched_class rt_sched_class
 	.set_next_task          = set_next_task_rt,
 
 #ifdef CONFIG_SMP
+	.balance		= balance_rt,
 	.select_task_rq		= select_task_rq_rt,
-
 	.set_cpus_allowed       = set_cpus_allowed_common,
 	.rq_online              = rq_online_rt,
 	.rq_offline             = rq_offline_rt,
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1727,10 +1727,11 @@ struct sched_class {
 	struct task_struct * (*pick_next_task)(struct rq *rq,
 					       struct task_struct *prev,
 					       struct rq_flags *rf);
-	void (*put_prev_task)(struct rq *rq, struct task_struct *p, struct rq_flags *rf);
+	void (*put_prev_task)(struct rq *rq, struct task_struct *p);
 	void (*set_next_task)(struct rq *rq, struct task_struct *p);
 
 #ifdef CONFIG_SMP
+	int (*balance)(struct rq *rq, struct task_struct *prev, struct rq_flags *rf);
 	int  (*select_task_rq)(struct task_struct *p, int task_cpu, int sd_flag, int flags);
 	void (*migrate_task_rq)(struct task_struct *p, int new_cpu);
 
@@ -1773,7 +1774,7 @@ struct sched_class {
 static inline void put_prev_task(struct rq *rq, struct task_struct *prev)
 {
 	WARN_ON_ONCE(rq->curr != prev);
-	prev->sched_class->put_prev_task(rq, prev, NULL);
+	prev->sched_class->put_prev_task(rq, prev);
 }
 
 static inline void set_next_task(struct rq *rq, struct task_struct *next)
@@ -1787,8 +1788,12 @@ static inline void set_next_task(struct
 #else
 #define sched_class_highest (&dl_sched_class)
 #endif
+
+#define for_class_range(class, _from, _to) \
+	for (class = (_from); class != (_to); class = class->next)
+
 #define for_each_class(class) \
-   for (class = sched_class_highest; class; class = class->next)
+	for_class_range(class, sched_class_highest, NULL)
 
 extern const struct sched_class stop_sched_class;
 extern const struct sched_class dl_sched_class;
@@ -1796,6 +1801,25 @@ extern const struct sched_class rt_sched
 extern const struct sched_class fair_sched_class;
 extern const struct sched_class idle_sched_class;
 
+static inline bool sched_stop_runnable(struct rq *rq)
+{
+	return rq->stop && task_on_rq_queued(rq->stop);
+}
+
+static inline bool sched_dl_runnable(struct rq *rq)
+{
+	return rq->dl.dl_nr_running > 0;
+}
+
+static inline bool sched_rt_runnable(struct rq *rq)
+{
+	return rq->rt.rt_queued > 0;
+}
+
+static inline bool sched_fair_runnable(struct rq *rq)
+{
+	return rq->cfs.nr_running > 0;
+}
 
 #ifdef CONFIG_SMP
 
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -15,6 +15,12 @@ select_task_rq_stop(struct task_struct *
 {
 	return task_cpu(p); /* stop tasks as never migrate */
 }
+
+static int
+balance_stop(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+{
+	return sched_stop_runnable(rq);
+}
 #endif /* CONFIG_SMP */
 
 static void
@@ -31,16 +37,13 @@ static void set_next_task_stop(struct rq
 static struct task_struct *
 pick_next_task_stop(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 {
-	struct task_struct *stop = rq->stop;
-
 	WARN_ON_ONCE(prev || rf);
 
-	if (!stop || !task_on_rq_queued(stop))
+	if (!sched_stop_runnable(rq))
 		return NULL;
 
-	set_next_task_stop(rq, stop);
-
-	return stop;
+	set_next_task_stop(rq, rq->stop);
+	return rq->stop;
 }
 
 static void
@@ -60,7 +63,7 @@ static void yield_task_stop(struct rq *r
 	BUG(); /* the stop task should never yield, its pointless. */
 }
 
-static void put_prev_task_stop(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+static void put_prev_task_stop(struct rq *rq, struct task_struct *prev)
 {
 	struct task_struct *curr = rq->curr;
 	u64 delta_exec;
@@ -129,6 +132,7 @@ const struct sched_class stop_sched_clas
 	.set_next_task          = set_next_task_stop,
 
 #ifdef CONFIG_SMP
+	.balance		= balance_stop,
 	.select_task_rq		= select_task_rq_stop,
 	.set_cpus_allowed	= set_cpus_allowed_common,
 #endif



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 2/7] sched/fair: Better document newidle_balance()
  2019-11-08 13:15 [PATCH 0/7] scheduler patches Peter Zijlstra
  2019-11-08 13:15 ` [PATCH 1/7] sched: Fix pick_next_task() vs change pattern race Peter Zijlstra
@ 2019-11-08 13:15 ` Peter Zijlstra
  2019-11-11  9:32   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
  2019-11-08 13:15 ` [PATCH 3/7] sched: Make pick_next_task_idle() more consistent Peter Zijlstra
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2019-11-08 13:15 UTC (permalink / raw)
  To: mingo, vincent.guittot, dietmar.eggemann, juri.lelli, rostedt,
	bsegall, mgorman
  Cc: linux-kernel, qperret, valentin.schneider, qais.yousef, ktkhai, peterz

Whilst chasing the pick_next_task() race, there was some confusion
about the newidle_balance() return values. Document them.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/fair.c |    5 +++++
 1 file changed, 5 insertions(+)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9920,6 +9920,11 @@ static inline void nohz_newidle_balance(
 /*
  * idle_balance is called by schedule() if this_cpu is about to become
  * idle. Attempts to pull tasks from other CPUs.
+ *
+ * Returns:
+ *  <0 - we released the lock and there are !fair tasks present
+ *   0 - failed, no new tasks
+ *  >0 - success, new (fair) tasks present
  */
 int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 {



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 3/7] sched: Make pick_next_task_idle() more consistent
  2019-11-08 13:15 [PATCH 0/7] scheduler patches Peter Zijlstra
  2019-11-08 13:15 ` [PATCH 1/7] sched: Fix pick_next_task() vs change pattern race Peter Zijlstra
  2019-11-08 13:15 ` [PATCH 2/7] sched/fair: Better document newidle_balance() Peter Zijlstra
@ 2019-11-08 13:15 ` Peter Zijlstra
  2019-11-11  9:32   ` [tip: sched/core] sched/core: " tip-bot2 for Peter Zijlstra
  2019-11-08 13:15 ` [PATCH 4/7] sched: Optimize pick_next_task() Peter Zijlstra
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2019-11-08 13:15 UTC (permalink / raw)
  To: mingo, vincent.guittot, dietmar.eggemann, juri.lelli, rostedt,
	bsegall, mgorman
  Cc: linux-kernel, qperret, valentin.schneider, qais.yousef, ktkhai, peterz

Only pick_next_task_fair() needs the @prev and @rf argument; these are
required to implement the cpu-cgroup optimization. None of the other
pick_next_task() methods need this. Make pick_next_task_idle() more
consistent.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c |    6 ++++--
 kernel/sched/idle.c |    3 +--
 2 files changed, 5 insertions(+), 4 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3922,8 +3922,10 @@ pick_next_task(struct rq *rq, struct tas
 			goto restart;
 
 		/* Assumes fair_sched_class->next == idle_sched_class */
-		if (unlikely(!p))
-			p = idle_sched_class.pick_next_task(rq, prev, rf);
+		if (unlikely(!p)) {
+			put_prev_task(rq, prev);
+			p = idle_sched_class.pick_next_task(rq, NULL, NULL);
+		}
 
 		return p;
 	}
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -390,8 +390,7 @@ pick_next_task_idle(struct rq *rq, struc
 {
 	struct task_struct *next = rq->idle;
 
-	if (prev)
-		put_prev_task(rq, prev);
+	WARN_ON_ONCE(prev || rf);
 
 	set_next_task_idle(rq, next);
 



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 4/7] sched: Optimize pick_next_task()
  2019-11-08 13:15 [PATCH 0/7] scheduler patches Peter Zijlstra
                   ` (2 preceding siblings ...)
  2019-11-08 13:15 ` [PATCH 3/7] sched: Make pick_next_task_idle() more consistent Peter Zijlstra
@ 2019-11-08 13:15 ` Peter Zijlstra
  2019-11-08 14:33   ` Quentin Perret
  2019-11-11  9:32   ` [tip: sched/core] sched/core: " tip-bot2 for Peter Zijlstra
  2019-11-08 13:15 ` [PATCH 5/7] sched: Simplify sched_class::pick_next_task() Peter Zijlstra
                   ` (2 subsequent siblings)
  6 siblings, 2 replies; 21+ messages in thread
From: Peter Zijlstra @ 2019-11-08 13:15 UTC (permalink / raw)
  To: mingo, vincent.guittot, dietmar.eggemann, juri.lelli, rostedt,
	bsegall, mgorman
  Cc: linux-kernel, qperret, valentin.schneider, qais.yousef, ktkhai, peterz

Ever since we moved the sched_class defenitions into their own files,
the constant expression {fair,idle}_sched_class.pick_next_task() is
not in fact a compile time constant anymore and results in an indirect
call (barring LTO).

Fix that by exposing pick_next_task_{fair,idle}() directly, this gets
rid of the indirect call (and RETPOLINE) on the fast path.

Also remove the unlikely() from the idle case, it is in fact /the/ way
we select idle -- and that is a very common thing to do.

Performance for will-it-scale/sched_yield improves by 2% (as reported
by 0-day).

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c  |    6 +++---
 kernel/sched/fair.c  |    2 +-
 kernel/sched/idle.c  |    2 +-
 kernel/sched/sched.h |    3 +++
 4 files changed, 8 insertions(+), 5 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3917,14 +3917,14 @@ pick_next_task(struct rq *rq, struct tas
 		    prev->sched_class == &fair_sched_class) &&
 		   rq->nr_running == rq->cfs.h_nr_running)) {
 
-		p = fair_sched_class.pick_next_task(rq, prev, rf);
+		p = pick_next_task_fair(rq, prev, rf);
 		if (unlikely(p == RETRY_TASK))
 			goto restart;
 
 		/* Assumes fair_sched_class->next == idle_sched_class */
-		if (unlikely(!p)) {
+		if (!p) {
 			put_prev_task(rq, prev);
-			p = idle_sched_class.pick_next_task(rq, NULL, NULL);
+			p = pick_next_task_idle(rq, NULL, NULL);
 		}
 
 		return p;
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6599,7 +6599,7 @@ static void check_preempt_wakeup(struct
 		set_last_buddy(se);
 }
 
-static struct task_struct *
+struct task_struct *
 pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 {
 	struct cfs_rq *cfs_rq = &rq->cfs;
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -391,7 +391,7 @@ static void set_next_task_idle(struct rq
 	schedstat_inc(rq->sched_goidle);
 }
 
-static struct task_struct *
+struct task_struct *
 pick_next_task_idle(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 {
 	struct task_struct *next = rq->idle;
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1821,6 +1821,9 @@ static inline bool sched_fair_runnable(s
 	return rq->cfs.nr_running > 0;
 }
 
+extern struct task_struct *pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf);
+extern struct task_struct *pick_next_task_idle(struct rq *rq, struct task_struct *prev, struct rq_flags *rf);
+
 #ifdef CONFIG_SMP
 
 extern void update_group_capacity(struct sched_domain *sd, int cpu);



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 5/7] sched: Simplify sched_class::pick_next_task()
  2019-11-08 13:15 [PATCH 0/7] scheduler patches Peter Zijlstra
                   ` (3 preceding siblings ...)
  2019-11-08 13:15 ` [PATCH 4/7] sched: Optimize pick_next_task() Peter Zijlstra
@ 2019-11-08 13:15 ` Peter Zijlstra
  2019-11-11  9:32   ` [tip: sched/core] sched/core: " tip-bot2 for Peter Zijlstra
  2019-11-08 13:15 ` [PATCH 6/7] sched/fair: Use mul_u32_u32() Peter Zijlstra
  2019-11-08 13:16 ` [PATCH 7/7] sched: Further clarify sched_class::set_next_task() Peter Zijlstra
  6 siblings, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2019-11-08 13:15 UTC (permalink / raw)
  To: mingo, vincent.guittot, dietmar.eggemann, juri.lelli, rostedt,
	bsegall, mgorman
  Cc: linux-kernel, qperret, valentin.schneider, qais.yousef, ktkhai, peterz

Now that the indirect class call never uses the last two arguments of
pick_next_task(), remove them.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c      |    6 +++---
 kernel/sched/deadline.c  |    5 +----
 kernel/sched/fair.c      |    7 ++++++-
 kernel/sched/idle.c      |    5 +----
 kernel/sched/rt.c        |    5 +----
 kernel/sched/sched.h     |   18 +++---------------
 kernel/sched/stop_task.c |    5 +----
 7 files changed, 16 insertions(+), 35 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3924,7 +3924,7 @@ pick_next_task(struct rq *rq, struct tas
 		/* Assumes fair_sched_class->next == idle_sched_class */
 		if (!p) {
 			put_prev_task(rq, prev);
-			p = pick_next_task_idle(rq, NULL, NULL);
+			p = pick_next_task_idle(rq);
 		}
 
 		return p;
@@ -3941,7 +3941,7 @@ pick_next_task(struct rq *rq, struct tas
 	put_prev_task(rq, prev);
 
 	for_each_class(class) {
-		p = class->pick_next_task(rq, NULL, NULL);
+		p = class->pick_next_task(rq);
 		if (p)
 			return p;
 	}
@@ -6203,7 +6203,7 @@ static struct task_struct *__pick_migrat
 	struct task_struct *next;
 
 	for_each_class(class) {
-		next = class->pick_next_task(rq, NULL, NULL);
+		next = class->pick_next_task(rq);
 		if (next) {
 			next->sched_class->put_prev_task(rq, next);
 			return next;
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1770,15 +1770,12 @@ static struct sched_dl_entity *pick_next
 	return rb_entry(left, struct sched_dl_entity, rb_node);
 }
 
-static struct task_struct *
-pick_next_task_dl(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+static struct task_struct *pick_next_task_dl(struct rq *rq)
 {
 	struct sched_dl_entity *dl_se;
 	struct dl_rq *dl_rq = &rq->dl;
 	struct task_struct *p;
 
-	WARN_ON_ONCE(prev || rf);
-
 	if (!sched_dl_runnable(rq))
 		return NULL;
 
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6743,6 +6743,11 @@ done: __maybe_unused;
 	return NULL;
 }
 
+static struct task_struct *__pick_next_task_fair(struct rq *rq)
+{
+	return pick_next_task_fair(rq, NULL, NULL);
+}
+
 /*
  * Account for a descheduled task:
  */
@@ -10621,7 +10626,7 @@ const struct sched_class fair_sched_clas
 
 	.check_preempt_curr	= check_preempt_wakeup,
 
-	.pick_next_task		= pick_next_task_fair,
+	.pick_next_task		= __pick_next_task_fair,
 	.put_prev_task		= put_prev_task_fair,
 	.set_next_task          = set_next_task_fair,
 
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -391,13 +391,10 @@ static void set_next_task_idle(struct rq
 	schedstat_inc(rq->sched_goidle);
 }
 
-struct task_struct *
-pick_next_task_idle(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+struct task_struct *pick_next_task_idle(struct rq *rq)
 {
 	struct task_struct *next = rq->idle;
 
-	WARN_ON_ONCE(prev || rf);
-
 	set_next_task_idle(rq, next);
 
 	return next;
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1564,13 +1564,10 @@ static struct task_struct *_pick_next_ta
 	return rt_task_of(rt_se);
 }
 
-static struct task_struct *
-pick_next_task_rt(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+static struct task_struct *pick_next_task_rt(struct rq *rq)
 {
 	struct task_struct *p;
 
-	WARN_ON_ONCE(prev || rf);
-
 	if (!sched_rt_runnable(rq))
 		return NULL;
 
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1713,20 +1713,8 @@ struct sched_class {
 
 	void (*check_preempt_curr)(struct rq *rq, struct task_struct *p, int flags);
 
-	/*
-	 * Both @prev and @rf are optional and may be NULL, in which case the
-	 * caller must already have invoked put_prev_task(rq, prev, rf).
-	 *
-	 * Otherwise it is the responsibility of the pick_next_task() to call
-	 * put_prev_task() on the @prev task or something equivalent, IFF it
-	 * returns a next task.
-	 *
-	 * In that case (@rf != NULL) it may return RETRY_TASK when it finds a
-	 * higher prio class has runnable tasks.
-	 */
-	struct task_struct * (*pick_next_task)(struct rq *rq,
-					       struct task_struct *prev,
-					       struct rq_flags *rf);
+	struct task_struct *(*pick_next_task)(struct rq *rq);
+
 	void (*put_prev_task)(struct rq *rq, struct task_struct *p);
 	void (*set_next_task)(struct rq *rq, struct task_struct *p);
 
@@ -1822,7 +1810,7 @@ static inline bool sched_fair_runnable(s
 }
 
 extern struct task_struct *pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf);
-extern struct task_struct *pick_next_task_idle(struct rq *rq, struct task_struct *prev, struct rq_flags *rf);
+extern struct task_struct *pick_next_task_idle(struct rq *rq);
 
 #ifdef CONFIG_SMP
 
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -34,11 +34,8 @@ static void set_next_task_stop(struct rq
 	stop->se.exec_start = rq_clock_task(rq);
 }
 
-static struct task_struct *
-pick_next_task_stop(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+static struct task_struct *pick_next_task_stop(struct rq *rq)
 {
-	WARN_ON_ONCE(prev || rf);
-
 	if (!sched_stop_runnable(rq))
 		return NULL;
 



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 6/7] sched/fair: Use mul_u32_u32()
  2019-11-08 13:15 [PATCH 0/7] scheduler patches Peter Zijlstra
                   ` (4 preceding siblings ...)
  2019-11-08 13:15 ` [PATCH 5/7] sched: Simplify sched_class::pick_next_task() Peter Zijlstra
@ 2019-11-08 13:15 ` Peter Zijlstra
  2019-11-11  9:32   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
  2019-11-08 13:16 ` [PATCH 7/7] sched: Further clarify sched_class::set_next_task() Peter Zijlstra
  6 siblings, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2019-11-08 13:15 UTC (permalink / raw)
  To: mingo, vincent.guittot, dietmar.eggemann, juri.lelli, rostedt,
	bsegall, mgorman
  Cc: linux-kernel, qperret, valentin.schneider, qais.yousef, ktkhai, peterz

While reading the code I encountered another site where we should be
using mul_u32_u32() because GCC just won't take a hint.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/fair.c |    3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -229,8 +229,7 @@ static u64 __calc_delta(u64 delta_exec,
 		}
 	}
 
-	/* hint to use a 32x32->64 mul */
-	fact = (u64)(u32)fact * lw->inv_weight;
+	fact = mul_u32_u32(fact, lw->inv_weight);
 
 	while (fact >> 32) {
 		fact >>= 1;



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 7/7] sched: Further clarify sched_class::set_next_task()
  2019-11-08 13:15 [PATCH 0/7] scheduler patches Peter Zijlstra
                   ` (5 preceding siblings ...)
  2019-11-08 13:15 ` [PATCH 6/7] sched/fair: Use mul_u32_u32() Peter Zijlstra
@ 2019-11-08 13:16 ` Peter Zijlstra
  2019-11-11  9:32   ` [tip: sched/core] sched/core: " tip-bot2 for Peter Zijlstra
  6 siblings, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2019-11-08 13:16 UTC (permalink / raw)
  To: mingo, vincent.guittot, dietmar.eggemann, juri.lelli, rostedt,
	bsegall, mgorman
  Cc: linux-kernel, qperret, valentin.schneider, qais.yousef, ktkhai, peterz

It turns out there really is something special to the first
set_next_task() invocation. In specific the 'change' pattern really
should not cause balance_callbacks.

Fixes: f95d4eaee6d0 ("sched/{rt,deadline}: Fix set_next_task vs pick_next_task")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/deadline.c  |    7 +++++--
 kernel/sched/fair.c      |    2 +-
 kernel/sched/idle.c      |    4 ++--
 kernel/sched/rt.c        |    7 +++++--
 kernel/sched/sched.h     |    4 ++--
 kernel/sched/stop_task.c |    4 ++--
 6 files changed, 17 insertions(+), 11 deletions(-)

--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1743,13 +1743,16 @@ static void start_hrtick_dl(struct rq *r
 }
 #endif
 
-static void set_next_task_dl(struct rq *rq, struct task_struct *p)
+static void set_next_task_dl(struct rq *rq, struct task_struct *p, bool first)
 {
 	p->se.exec_start = rq_clock_task(rq);
 
 	/* You can't push away the running task */
 	dequeue_pushable_dl_task(rq, p);
 
+	if (!first)
+		return;
+
 	if (hrtick_enabled(rq))
 		start_hrtick_dl(rq, p);
 
@@ -1782,7 +1785,7 @@ static struct task_struct *pick_next_tas
 	dl_se = pick_next_dl_entity(rq, dl_rq);
 	BUG_ON(!dl_se);
 	p = dl_task_of(dl_se);
-	set_next_task_dl(rq, p);
+	set_next_task_dl(rq, p, true);
 	return p;
 }
 
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10343,7 +10343,7 @@ static void switched_to_fair(struct rq *
  * This routine is mostly called to set cfs_rq->curr field when a task
  * migrates between groups/classes.
  */
-static void set_next_task_fair(struct rq *rq, struct task_struct *p)
+static void set_next_task_fair(struct rq *rq, struct task_struct *p, bool first)
 {
 	struct sched_entity *se = &p->se;
 
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -385,7 +385,7 @@ static void put_prev_task_idle(struct rq
 {
 }
 
-static void set_next_task_idle(struct rq *rq, struct task_struct *next)
+static void set_next_task_idle(struct rq *rq, struct task_struct *next, bool first)
 {
 	update_idle_core(rq);
 	schedstat_inc(rq->sched_goidle);
@@ -395,7 +395,7 @@ struct task_struct *pick_next_task_idle(
 {
 	struct task_struct *next = rq->idle;
 
-	set_next_task_idle(rq, next);
+	set_next_task_idle(rq, next, true);
 
 	return next;
 }
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1515,13 +1515,16 @@ static void check_preempt_curr_rt(struct
 #endif
 }
 
-static inline void set_next_task_rt(struct rq *rq, struct task_struct *p)
+static inline void set_next_task_rt(struct rq *rq, struct task_struct *p, bool first)
 {
 	p->se.exec_start = rq_clock_task(rq);
 
 	/* The running task is never eligible for pushing */
 	dequeue_pushable_task(rq, p);
 
+	if (!first)
+		return;
+
 	/*
 	 * If prev task was rt, put_prev_task() has already updated the
 	 * utilization. We only care of the case where we start to schedule a
@@ -1572,7 +1575,7 @@ static struct task_struct *pick_next_tas
 		return NULL;
 
 	p = _pick_next_task_rt(rq);
-	set_next_task_rt(rq, p);
+	set_next_task_rt(rq, p, true);
 	return p;
 }
 
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1716,7 +1716,7 @@ struct sched_class {
 	struct task_struct *(*pick_next_task)(struct rq *rq);
 
 	void (*put_prev_task)(struct rq *rq, struct task_struct *p);
-	void (*set_next_task)(struct rq *rq, struct task_struct *p);
+	void (*set_next_task)(struct rq *rq, struct task_struct *p, bool first);
 
 #ifdef CONFIG_SMP
 	int (*balance)(struct rq *rq, struct task_struct *prev, struct rq_flags *rf);
@@ -1768,7 +1768,7 @@ static inline void put_prev_task(struct
 static inline void set_next_task(struct rq *rq, struct task_struct *next)
 {
 	WARN_ON_ONCE(rq->curr != next);
-	next->sched_class->set_next_task(rq, next);
+	next->sched_class->set_next_task(rq, next, false);
 }
 
 #ifdef CONFIG_SMP
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -29,7 +29,7 @@ check_preempt_curr_stop(struct rq *rq, s
 	/* we're never preempted */
 }
 
-static void set_next_task_stop(struct rq *rq, struct task_struct *stop)
+static void set_next_task_stop(struct rq *rq, struct task_struct *stop, bool first)
 {
 	stop->se.exec_start = rq_clock_task(rq);
 }
@@ -39,7 +39,7 @@ static struct task_struct *pick_next_tas
 	if (!sched_stop_runnable(rq))
 		return NULL;
 
-	set_next_task_stop(rq, rq->stop);
+	set_next_task_stop(rq, rq->stop, true);
 	return rq->stop;
 }
 



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/7] sched: Fix pick_next_task() vs change pattern race
  2019-11-08 13:15 ` [PATCH 1/7] sched: Fix pick_next_task() vs change pattern race Peter Zijlstra
@ 2019-11-08 14:28   ` Quentin Perret
  2019-11-08 16:05   ` Valentin Schneider
  2019-11-08 17:03   ` Qais Yousef
  2 siblings, 0 replies; 21+ messages in thread
From: Quentin Perret @ 2019-11-08 14:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, vincent.guittot, dietmar.eggemann, juri.lelli, rostedt,
	bsegall, mgorman, linux-kernel, valentin.schneider, qais.yousef,
	ktkhai

On Friday 08 Nov 2019 at 14:15:54 (+0100), Peter Zijlstra wrote:
> Commit 67692435c411 ("sched: Rework pick_next_task() slow-path")
> inadvertly introduced a race because it changed a previously
> unexplored dependency between dropping the rq->lock and
> sched_class::put_prev_task().
> 
> The comments about dropping rq->lock, in for example
> newidle_balance(), only mentions the task being current and ->on_cpu
> being set. But when we look at the 'change' pattern (in for example
> sched_setnuma()):
> 
> 	queued = task_on_rq_queued(p); /* p->on_rq == TASK_ON_RQ_QUEUED */
> 	running = task_current(rq, p); /* rq->curr == p */
> 
> 	if (queued)
> 		dequeue_task(...);
> 	if (running)
> 		put_prev_task(...);
> 
> 	/* change task properties */
> 
> 	if (queued)
> 		enqueue_task(...);
> 	if (running)
> 		set_next_task(...);
> 
> It becomes obvious that if we do this after put_prev_task() has
> already been called on @p, things go sideways. This is exactly what
> the commit in question allows to happen when it does:
> 
> 	prev->sched_class->put_prev_task(rq, prev, rf);
> 	if (!rq->nr_running)
> 		newidle_balance(rq, rf);
> 
> The newidle_balance() call will drop rq->lock after we've called
> put_prev_task() and that allows the above 'change' pattern to
> interleave and mess up the state.
> 
> Furthermore, it turns out we lost the RT-pull when we put the last DL
> task.
> 
> Fix both problems by extracting the balancing from put_prev_task() and
> doing a multi-class balance() pass before put_prev_task().
> 
> Fixes: 67692435c411 ("sched: Rework pick_next_task() slow-path")
> Reported-by: Quentin Perret <qperret@google.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

The reproducer that triggered in 30sec or so has now been running for
3 hours:

   Tested-by: Quentin Perret <qperret@google.com>

Thanks for fix,
Quentin

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 4/7] sched: Optimize pick_next_task()
  2019-11-08 13:15 ` [PATCH 4/7] sched: Optimize pick_next_task() Peter Zijlstra
@ 2019-11-08 14:33   ` Quentin Perret
  2019-11-08 16:46     ` Vincent Guittot
  2019-11-08 20:46     ` Peter Zijlstra
  2019-11-11  9:32   ` [tip: sched/core] sched/core: " tip-bot2 for Peter Zijlstra
  1 sibling, 2 replies; 21+ messages in thread
From: Quentin Perret @ 2019-11-08 14:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, vincent.guittot, dietmar.eggemann, juri.lelli, rostedt,
	bsegall, mgorman, linux-kernel, valentin.schneider, qais.yousef,
	ktkhai

On Friday 08 Nov 2019 at 14:15:57 (+0100), Peter Zijlstra wrote:
> Ever since we moved the sched_class defenitions into their own files,

s/defenitions/definitions

> the constant expression {fair,idle}_sched_class.pick_next_task() is
> not in fact a compile time constant anymore and results in an indirect
> call (barring LTO).
> 
> Fix that by exposing pick_next_task_{fair,idle}() directly, this gets
> rid of the indirect call (and RETPOLINE) on the fast path.
> 
> Also remove the unlikely() from the idle case, it is in fact /the/ way
> we select idle -- and that is a very common thing to do.

I assumed this was to optimize the case where we did find a cfs task to
run. That is, we can afford to hit the unlikely case when there is no
work to do after, but when there is, we shouldn't spend time checking
the idle case. Makes sense ?

Thanks,
Quentin

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/7] sched: Fix pick_next_task() vs change pattern race
  2019-11-08 13:15 ` [PATCH 1/7] sched: Fix pick_next_task() vs change pattern race Peter Zijlstra
  2019-11-08 14:28   ` Quentin Perret
@ 2019-11-08 16:05   ` Valentin Schneider
  2019-11-08 20:49     ` Peter Zijlstra
  2019-11-08 17:03   ` Qais Yousef
  2 siblings, 1 reply; 21+ messages in thread
From: Valentin Schneider @ 2019-11-08 16:05 UTC (permalink / raw)
  To: Peter Zijlstra, mingo, vincent.guittot, dietmar.eggemann,
	juri.lelli, rostedt, bsegall, mgorman
  Cc: linux-kernel, qperret, qais.yousef, ktkhai

On 08/11/2019 13:15, Peter Zijlstra wrote:
> Fixes: 67692435c411 ("sched: Rework pick_next_task() slow-path")
> Reported-by: Quentin Perret <qperret@google.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

I've been running the same reproducer as Quentin's for a similar length of
time (~3h) and it's still going strong, so FWIW:

Tested-by: Valentin Schneider <valentin.schneider@arm.com>

> ---
>  kernel/sched/core.c      |   21 +++++++++++++++------
>  kernel/sched/deadline.c  |   40 ++++++++++++++++++++--------------------
>  kernel/sched/fair.c      |   15 ++++++++++++---
>  kernel/sched/idle.c      |    9 ++++++++-
>  kernel/sched/rt.c        |   37 +++++++++++++++++++------------------
>  kernel/sched/sched.h     |   30 +++++++++++++++++++++++++++---
>  kernel/sched/stop_task.c |   18 +++++++++++-------
>  7 files changed, 112 insertions(+), 58 deletions(-)
> 
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3929,13 +3929,22 @@ pick_next_task(struct rq *rq, struct tas
>  	}
>  
>  restart:
> +#ifdef CONFIG_SMP

I suppose we could dump that in a core balance_prev_task() to avoid the
inline #ifdeffery, but eh...

>  	/*
> -	 * Ensure that we put DL/RT tasks before the pick loop, such that they
> -	 * can PULL higher prio tasks when we lower the RQ 'priority'.
> +	 * We must do the balancing pass before put_next_task(), such
> +	 * that when we release the rq->lock the task is in the same
> +	 * state as before we took rq->lock.
> +	 *
> +	 * We can terminate the balance pass as soon as we know there is
> +	 * a runnable task of @class priority or higher.
>  	 */
> -	prev->sched_class->put_prev_task(rq, prev, rf);
> -	if (!rq->nr_running)
> -		newidle_balance(rq, rf);
> +	for_class_range(class, prev->sched_class, &idle_sched_class) {
> +		if (class->balance(rq, prev, rf))
> +			break;
> +	}
> +#endif
> +
> +	put_prev_task(rq, prev);
>  
>  	for_each_class(class) {
>  		p = class->pick_next_task(rq, NULL, NULL);

> @@ -1469,6 +1469,22 @@ static void check_preempt_equal_prio(str
>  	resched_curr(rq);
>  }
>  
> +static int balance_rt(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
> +{
> +	if (!on_rt_rq(&p->rt) && need_pull_rt_task(rq, p)) {
> +		/*
> +		 * This is OK, because current is on_cpu, which avoids it being
> +		 * picked for load-balance and preemption/IRQs are still
> +		 * disabled avoiding further scheduler activity on it and we've
> +		 * not yet started the picking loop.
> +		 */
> +		rq_unpin_lock(rq, rf);
> +		pull_rt_task(rq);
> +		rq_repin_lock(rq, rf);
> +	}
> +
> +	return sched_stop_runnable(rq) || sched_dl_runnable(rq) || sched_rt_runnable(rq);

So we already have some dependencies on the class ordering (e.g. fair->idle),
but I'm wondering if would it make sense to have these runnable functions be
defined as sched_class callbacks?

e.g.

  rt_sched_class.runnable = rt_runnable

w/ rt_runnable() just being a non-inlined sched_rt_runnable() you define
further down the patch (or a wrapper to it). The balance return pattern could
then become:

  for_class_range(class, sched_class_highest, rt_sched_class->next)
  	if (class->runnable(rq))
		return true;
  return false;

(and replace rt_sched_class by whatever class' balance callback this is)

It's a bit neater, but I'm pretty sure it's going to run worse :/
The only unaffected one would be fair, since newidle_balance() already does
that "for free".

> +}
>  #endif /* CONFIG_SMP */
>  
>  /*

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 4/7] sched: Optimize pick_next_task()
  2019-11-08 14:33   ` Quentin Perret
@ 2019-11-08 16:46     ` Vincent Guittot
  2019-11-08 20:46     ` Peter Zijlstra
  1 sibling, 0 replies; 21+ messages in thread
From: Vincent Guittot @ 2019-11-08 16:46 UTC (permalink / raw)
  To: Quentin Perret
  Cc: Peter Zijlstra, Ingo Molnar, Dietmar Eggemann, Juri Lelli,
	Steven Rostedt, Ben Segall, Mel Gorman, linux-kernel,
	Valentin Schneider, Qais Yousef, ktkhai

On Fri, 8 Nov 2019 at 15:33, Quentin Perret <qperret@google.com> wrote:
>
> On Friday 08 Nov 2019 at 14:15:57 (+0100), Peter Zijlstra wrote:
> > Ever since we moved the sched_class defenitions into their own files,
>
> s/defenitions/definitions
>
> > the constant expression {fair,idle}_sched_class.pick_next_task() is
> > not in fact a compile time constant anymore and results in an indirect
> > call (barring LTO).
> >
> > Fix that by exposing pick_next_task_{fair,idle}() directly, this gets
> > rid of the indirect call (and RETPOLINE) on the fast path.
> >
> > Also remove the unlikely() from the idle case, it is in fact /the/ way
> > we select idle -- and that is a very common thing to do.
>
> I assumed this was to optimize the case where we did find a cfs task to
> run. That is, we can afford to hit the unlikely case when there is no
> work to do after, but when there is, we shouldn't spend time checking
> the idle case. Makes sense ?

I have the same understanding as Quentin

>
> Thanks,
> Quentin

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/7] sched: Fix pick_next_task() vs change pattern race
  2019-11-08 13:15 ` [PATCH 1/7] sched: Fix pick_next_task() vs change pattern race Peter Zijlstra
  2019-11-08 14:28   ` Quentin Perret
  2019-11-08 16:05   ` Valentin Schneider
@ 2019-11-08 17:03   ` Qais Yousef
  2 siblings, 0 replies; 21+ messages in thread
From: Qais Yousef @ 2019-11-08 17:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, vincent.guittot, dietmar.eggemann, juri.lelli, rostedt,
	bsegall, mgorman, linux-kernel, qperret, valentin.schneider,
	ktkhai

On 11/08/19 14:15, Peter Zijlstra wrote:
> Commit 67692435c411 ("sched: Rework pick_next_task() slow-path")
> inadvertly introduced a race because it changed a previously
> unexplored dependency between dropping the rq->lock and
> sched_class::put_prev_task().
> 
> The comments about dropping rq->lock, in for example
> newidle_balance(), only mentions the task being current and ->on_cpu
> being set. But when we look at the 'change' pattern (in for example
> sched_setnuma()):
> 
> 	queued = task_on_rq_queued(p); /* p->on_rq == TASK_ON_RQ_QUEUED */
> 	running = task_current(rq, p); /* rq->curr == p */
> 
> 	if (queued)
> 		dequeue_task(...);
> 	if (running)
> 		put_prev_task(...);
> 
> 	/* change task properties */
> 
> 	if (queued)
> 		enqueue_task(...);
> 	if (running)
> 		set_next_task(...);
> 
> It becomes obvious that if we do this after put_prev_task() has
> already been called on @p, things go sideways. This is exactly what
> the commit in question allows to happen when it does:
> 
> 	prev->sched_class->put_prev_task(rq, prev, rf);
> 	if (!rq->nr_running)
> 		newidle_balance(rq, rf);
> 
> The newidle_balance() call will drop rq->lock after we've called
> put_prev_task() and that allows the above 'change' pattern to
> interleave and mess up the state.
> 
> Furthermore, it turns out we lost the RT-pull when we put the last DL
> task.
> 
> Fix both problems by extracting the balancing from put_prev_task() and
> doing a multi-class balance() pass before put_prev_task().
> 
> Fixes: 67692435c411 ("sched: Rework pick_next_task() slow-path")
> Reported-by: Quentin Perret <qperret@google.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---

This feels more fluent and less delicate than the previous approach for sure.

Spinning the whole series in my reproducer at the minute.

Thinking ahead, can we do something to help us debug/spot this in a better way
in the future?

I'm thinking maybe something like

	save_rq_state();
	rq_unlock();

	.
	.
	.

	rq_lock();
	assert_rq_state(); /* Bug here if something we don't expect has changed
			      when the lock wasn't held */

might help us verify no one has tried to change the state of the rq when the
lock wasn't held.

At least this will help us catch races like this closer to the point where
things go wrong.

We might as well annotate the functions that modify the state of the rq to help
debug the sequence of events?

Happy to carry out the work if there's anything sensible that we can actually
do.

Cheers

--
Qais Yousef

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 4/7] sched: Optimize pick_next_task()
  2019-11-08 14:33   ` Quentin Perret
  2019-11-08 16:46     ` Vincent Guittot
@ 2019-11-08 20:46     ` Peter Zijlstra
  1 sibling, 0 replies; 21+ messages in thread
From: Peter Zijlstra @ 2019-11-08 20:46 UTC (permalink / raw)
  To: Quentin Perret
  Cc: mingo, vincent.guittot, dietmar.eggemann, juri.lelli, rostedt,
	bsegall, mgorman, linux-kernel, valentin.schneider, qais.yousef,
	ktkhai

On Fri, Nov 08, 2019 at 02:33:48PM +0000, Quentin Perret wrote:
> On Friday 08 Nov 2019 at 14:15:57 (+0100), Peter Zijlstra wrote:

> > Also remove the unlikely() from the idle case, it is in fact /the/ way
> > we select idle -- and that is a very common thing to do.
> 
> I assumed this was to optimize the case where we did find a cfs task to
> run. That is, we can afford to hit the unlikely case when there is no
> work to do after, but when there is, we shouldn't spend time checking
> the idle case. Makes sense ?

There is that, but it is also the way we pick idle when nr_running drops
to 0.

I just couldn't make up my mind if the unlikely made sense or not.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/7] sched: Fix pick_next_task() vs change pattern race
  2019-11-08 16:05   ` Valentin Schneider
@ 2019-11-08 20:49     ` Peter Zijlstra
  0 siblings, 0 replies; 21+ messages in thread
From: Peter Zijlstra @ 2019-11-08 20:49 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: mingo, vincent.guittot, dietmar.eggemann, juri.lelli, rostedt,
	bsegall, mgorman, linux-kernel, qperret, qais.yousef, ktkhai

On Fri, Nov 08, 2019 at 04:05:25PM +0000, Valentin Schneider wrote:
> On 08/11/2019 13:15, Peter Zijlstra wrote:
> > +static int balance_rt(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
> > +{
> > +	if (!on_rt_rq(&p->rt) && need_pull_rt_task(rq, p)) {
> > +		/*
> > +		 * This is OK, because current is on_cpu, which avoids it being
> > +		 * picked for load-balance and preemption/IRQs are still
> > +		 * disabled avoiding further scheduler activity on it and we've
> > +		 * not yet started the picking loop.
> > +		 */
> > +		rq_unpin_lock(rq, rf);
> > +		pull_rt_task(rq);
> > +		rq_repin_lock(rq, rf);
> > +	}
> > +
> > +	return sched_stop_runnable(rq) || sched_dl_runnable(rq) || sched_rt_runnable(rq);
> 
> So we already have some dependencies on the class ordering (e.g. fair->idle),
> but I'm wondering if would it make sense to have these runnable functions be
> defined as sched_class callbacks?
> 
> e.g.
> 
>   rt_sched_class.runnable = rt_runnable
> 
> w/ rt_runnable() just being a non-inlined sched_rt_runnable() you define
> further down the patch (or a wrapper to it). The balance return pattern could
> then become:
> 
>   for_class_range(class, sched_class_highest, rt_sched_class->next)
>   	if (class->runnable(rq))
> 		return true;
>   return false;
> 
> (and replace rt_sched_class by whatever class' balance callback this is)
> 
> It's a bit neater, but I'm pretty sure it's going to run worse :/
> The only unaffected one would be fair, since newidle_balance() already does
> that "for free".

Yeah, it'll be pretty terrible :/

That said, I might have some clues on how to get rid of all the indirect
calls, but I need to play around a bit. It'll be invasive though :/
(like that ever stopped me).

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [tip: sched/core] sched/core: Further clarify sched_class::set_next_task()
  2019-11-08 13:16 ` [PATCH 7/7] sched: Further clarify sched_class::set_next_task() Peter Zijlstra
@ 2019-11-11  9:32   ` tip-bot2 for Peter Zijlstra
  0 siblings, 0 replies; 21+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2019-11-11  9:32 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Peter Zijlstra (Intel),
	Linus Torvalds, Thomas Gleixner, bsegall, dietmar.eggemann,
	juri.lelli, ktkhai, mgorman, qais.yousef, qperret, rostedt,
	valentin.schneider, vincent.guittot, Ingo Molnar,
	Borislav Petkov, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     a0e813f26ebcb25c0b5e504498fbd796cca1a4ba
Gitweb:        https://git.kernel.org/tip/a0e813f26ebcb25c0b5e504498fbd796cca1a4ba
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Fri, 08 Nov 2019 14:16:00 +01:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Mon, 11 Nov 2019 08:35:21 +01:00

sched/core: Further clarify sched_class::set_next_task()

It turns out there really is something special to the first
set_next_task() invocation. In specific the 'change' pattern really
should not cause balance callbacks.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: dietmar.eggemann@arm.com
Cc: juri.lelli@redhat.com
Cc: ktkhai@virtuozzo.com
Cc: mgorman@suse.de
Cc: qais.yousef@arm.com
Cc: qperret@google.com
Cc: rostedt@goodmis.org
Cc: valentin.schneider@arm.com
Cc: vincent.guittot@linaro.org
Fixes: f95d4eaee6d0 ("sched/{rt,deadline}: Fix set_next_task vs pick_next_task")
Link: https://lkml.kernel.org/r/20191108131909.775434698@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/deadline.c  | 7 +++++--
 kernel/sched/fair.c      | 2 +-
 kernel/sched/idle.c      | 4 ++--
 kernel/sched/rt.c        | 7 +++++--
 kernel/sched/sched.h     | 4 ++--
 kernel/sched/stop_task.c | 4 ++--
 6 files changed, 17 insertions(+), 11 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index f7fbb44..43323f8 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1743,13 +1743,16 @@ static void start_hrtick_dl(struct rq *rq, struct task_struct *p)
 }
 #endif
 
-static void set_next_task_dl(struct rq *rq, struct task_struct *p)
+static void set_next_task_dl(struct rq *rq, struct task_struct *p, bool first)
 {
 	p->se.exec_start = rq_clock_task(rq);
 
 	/* You can't push away the running task */
 	dequeue_pushable_dl_task(rq, p);
 
+	if (!first)
+		return;
+
 	if (hrtick_enabled(rq))
 		start_hrtick_dl(rq, p);
 
@@ -1782,7 +1785,7 @@ static struct task_struct *pick_next_task_dl(struct rq *rq)
 	dl_se = pick_next_dl_entity(rq, dl_rq);
 	BUG_ON(!dl_se);
 	p = dl_task_of(dl_se);
-	set_next_task_dl(rq, p);
+	set_next_task_dl(rq, p, true);
 	return p;
 }
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ba97d1a..81eba55 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10344,7 +10344,7 @@ static void switched_to_fair(struct rq *rq, struct task_struct *p)
  * This routine is mostly called to set cfs_rq->curr field when a task
  * migrates between groups/classes.
  */
-static void set_next_task_fair(struct rq *rq, struct task_struct *p)
+static void set_next_task_fair(struct rq *rq, struct task_struct *p, bool first)
 {
 	struct sched_entity *se = &p->se;
 
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index f88b79e..428cd05 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -385,7 +385,7 @@ static void put_prev_task_idle(struct rq *rq, struct task_struct *prev)
 {
 }
 
-static void set_next_task_idle(struct rq *rq, struct task_struct *next)
+static void set_next_task_idle(struct rq *rq, struct task_struct *next, bool first)
 {
 	update_idle_core(rq);
 	schedstat_inc(rq->sched_goidle);
@@ -395,7 +395,7 @@ struct task_struct *pick_next_task_idle(struct rq *rq)
 {
 	struct task_struct *next = rq->idle;
 
-	set_next_task_idle(rq, next);
+	set_next_task_idle(rq, next, true);
 
 	return next;
 }
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 38027c0..e591d40 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1515,13 +1515,16 @@ static void check_preempt_curr_rt(struct rq *rq, struct task_struct *p, int flag
 #endif
 }
 
-static inline void set_next_task_rt(struct rq *rq, struct task_struct *p)
+static inline void set_next_task_rt(struct rq *rq, struct task_struct *p, bool first)
 {
 	p->se.exec_start = rq_clock_task(rq);
 
 	/* The running task is never eligible for pushing */
 	dequeue_pushable_task(rq, p);
 
+	if (!first)
+		return;
+
 	/*
 	 * If prev task was rt, put_prev_task() has already updated the
 	 * utilization. We only care of the case where we start to schedule a
@@ -1572,7 +1575,7 @@ static struct task_struct *pick_next_task_rt(struct rq *rq)
 		return NULL;
 
 	p = _pick_next_task_rt(rq);
-	set_next_task_rt(rq, p);
+	set_next_task_rt(rq, p, true);
 	return p;
 }
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 75d96cc..05c2827 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1716,7 +1716,7 @@ struct sched_class {
 	struct task_struct *(*pick_next_task)(struct rq *rq);
 
 	void (*put_prev_task)(struct rq *rq, struct task_struct *p);
-	void (*set_next_task)(struct rq *rq, struct task_struct *p);
+	void (*set_next_task)(struct rq *rq, struct task_struct *p, bool first);
 
 #ifdef CONFIG_SMP
 	int (*balance)(struct rq *rq, struct task_struct *prev, struct rq_flags *rf);
@@ -1768,7 +1768,7 @@ static inline void put_prev_task(struct rq *rq, struct task_struct *prev)
 static inline void set_next_task(struct rq *rq, struct task_struct *next)
 {
 	WARN_ON_ONCE(rq->curr != next);
-	next->sched_class->set_next_task(rq, next);
+	next->sched_class->set_next_task(rq, next, false);
 }
 
 #ifdef CONFIG_SMP
diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c
index 0aefdfb..4c9e997 100644
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -29,7 +29,7 @@ check_preempt_curr_stop(struct rq *rq, struct task_struct *p, int flags)
 	/* we're never preempted */
 }
 
-static void set_next_task_stop(struct rq *rq, struct task_struct *stop)
+static void set_next_task_stop(struct rq *rq, struct task_struct *stop, bool first)
 {
 	stop->se.exec_start = rq_clock_task(rq);
 }
@@ -39,7 +39,7 @@ static struct task_struct *pick_next_task_stop(struct rq *rq)
 	if (!sched_stop_runnable(rq))
 		return NULL;
 
-	set_next_task_stop(rq, rq->stop);
+	set_next_task_stop(rq, rq->stop, true);
 	return rq->stop;
 }
 

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [tip: sched/core] sched/fair: Use mul_u32_u32()
  2019-11-08 13:15 ` [PATCH 6/7] sched/fair: Use mul_u32_u32() Peter Zijlstra
@ 2019-11-11  9:32   ` tip-bot2 for Peter Zijlstra
  0 siblings, 0 replies; 21+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2019-11-11  9:32 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Peter Zijlstra (Intel),
	Linus Torvalds, Thomas Gleixner, bsegall, dietmar.eggemann,
	juri.lelli, ktkhai, mgorman, qais.yousef, qperret, rostedt,
	valentin.schneider, vincent.guittot, Ingo Molnar,
	Borislav Petkov, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     2eeb01a28c9233333bf229a5b4b0559f4bd22b52
Gitweb:        https://git.kernel.org/tip/2eeb01a28c9233333bf229a5b4b0559f4bd22b52
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Fri, 08 Nov 2019 14:15:59 +01:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Mon, 11 Nov 2019 08:35:20 +01:00

sched/fair: Use mul_u32_u32()

While reading the code I encountered another site where we should be
using mul_u32_u32() because GCC just won't take a hint.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: dietmar.eggemann@arm.com
Cc: juri.lelli@redhat.com
Cc: ktkhai@virtuozzo.com
Cc: mgorman@suse.de
Cc: qais.yousef@arm.com
Cc: qperret@google.com
Cc: rostedt@goodmis.org
Cc: valentin.schneider@arm.com
Cc: vincent.guittot@linaro.org
Link: https://lkml.kernel.org/r/20191108131909.717931380@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1789193..ba97d1a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -229,8 +229,7 @@ static u64 __calc_delta(u64 delta_exec, unsigned long weight, struct load_weight
 		}
 	}
 
-	/* hint to use a 32x32->64 mul */
-	fact = (u64)(u32)fact * lw->inv_weight;
+	fact = mul_u32_u32(fact, lw->inv_weight);
 
 	while (fact >> 32) {
 		fact >>= 1;

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [tip: sched/core] sched/core: Simplify sched_class::pick_next_task()
  2019-11-08 13:15 ` [PATCH 5/7] sched: Simplify sched_class::pick_next_task() Peter Zijlstra
@ 2019-11-11  9:32   ` tip-bot2 for Peter Zijlstra
  0 siblings, 0 replies; 21+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2019-11-11  9:32 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Peter Zijlstra (Intel),
	Linus Torvalds, Thomas Gleixner, bsegall, dietmar.eggemann,
	juri.lelli, ktkhai, mgorman, qais.yousef, qperret, rostedt,
	valentin.schneider, vincent.guittot, Ingo Molnar,
	Borislav Petkov, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     98c2f700edb413e4baa4a0368c5861d96211a775
Gitweb:        https://git.kernel.org/tip/98c2f700edb413e4baa4a0368c5861d96211a775
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Fri, 08 Nov 2019 14:15:58 +01:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Mon, 11 Nov 2019 08:35:20 +01:00

sched/core: Simplify sched_class::pick_next_task()

Now that the indirect class call never uses the last two arguments of
pick_next_task(), remove them.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: dietmar.eggemann@arm.com
Cc: juri.lelli@redhat.com
Cc: ktkhai@virtuozzo.com
Cc: mgorman@suse.de
Cc: qais.yousef@arm.com
Cc: qperret@google.com
Cc: rostedt@goodmis.org
Cc: valentin.schneider@arm.com
Cc: vincent.guittot@linaro.org
Link: https://lkml.kernel.org/r/20191108131909.660595546@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/core.c      |  6 +++---
 kernel/sched/deadline.c  |  5 +----
 kernel/sched/fair.c      |  7 ++++++-
 kernel/sched/idle.c      |  5 +----
 kernel/sched/rt.c        |  5 +----
 kernel/sched/sched.h     | 18 +++---------------
 kernel/sched/stop_task.c |  5 +----
 7 files changed, 16 insertions(+), 35 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7cf6547..513a479 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3924,7 +3924,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 		/* Assumes fair_sched_class->next == idle_sched_class */
 		if (!p) {
 			put_prev_task(rq, prev);
-			p = pick_next_task_idle(rq, NULL, NULL);
+			p = pick_next_task_idle(rq);
 		}
 
 		return p;
@@ -3949,7 +3949,7 @@ restart:
 	put_prev_task(rq, prev);
 
 	for_each_class(class) {
-		p = class->pick_next_task(rq, NULL, NULL);
+		p = class->pick_next_task(rq);
 		if (p)
 			return p;
 	}
@@ -6210,7 +6210,7 @@ static struct task_struct *__pick_migrate_task(struct rq *rq)
 	struct task_struct *next;
 
 	for_each_class(class) {
-		next = class->pick_next_task(rq, NULL, NULL);
+		next = class->pick_next_task(rq);
 		if (next) {
 			next->sched_class->put_prev_task(rq, next);
 			return next;
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index a8a0803..f7fbb44 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1770,15 +1770,12 @@ static struct sched_dl_entity *pick_next_dl_entity(struct rq *rq,
 	return rb_entry(left, struct sched_dl_entity, rb_node);
 }
 
-static struct task_struct *
-pick_next_task_dl(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+static struct task_struct *pick_next_task_dl(struct rq *rq)
 {
 	struct sched_dl_entity *dl_se;
 	struct dl_rq *dl_rq = &rq->dl;
 	struct task_struct *p;
 
-	WARN_ON_ONCE(prev || rf);
-
 	if (!sched_dl_runnable(rq))
 		return NULL;
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index da81451..1789193 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6755,6 +6755,11 @@ idle:
 	return NULL;
 }
 
+static struct task_struct *__pick_next_task_fair(struct rq *rq)
+{
+	return pick_next_task_fair(rq, NULL, NULL);
+}
+
 /*
  * Account for a descheduled task:
  */
@@ -10622,7 +10627,7 @@ const struct sched_class fair_sched_class = {
 
 	.check_preempt_curr	= check_preempt_wakeup,
 
-	.pick_next_task		= pick_next_task_fair,
+	.pick_next_task		= __pick_next_task_fair,
 	.put_prev_task		= put_prev_task_fair,
 	.set_next_task          = set_next_task_fair,
 
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 0fdceac..f88b79e 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -391,13 +391,10 @@ static void set_next_task_idle(struct rq *rq, struct task_struct *next)
 	schedstat_inc(rq->sched_goidle);
 }
 
-struct task_struct *
-pick_next_task_idle(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+struct task_struct *pick_next_task_idle(struct rq *rq)
 {
 	struct task_struct *next = rq->idle;
 
-	WARN_ON_ONCE(prev || rf);
-
 	set_next_task_idle(rq, next);
 
 	return next;
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 9b8adc0..38027c0 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1564,13 +1564,10 @@ static struct task_struct *_pick_next_task_rt(struct rq *rq)
 	return rt_task_of(rt_se);
 }
 
-static struct task_struct *
-pick_next_task_rt(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+static struct task_struct *pick_next_task_rt(struct rq *rq)
 {
 	struct task_struct *p;
 
-	WARN_ON_ONCE(prev || rf);
-
 	if (!sched_rt_runnable(rq))
 		return NULL;
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 66172a3..75d96cc 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1713,20 +1713,8 @@ struct sched_class {
 
 	void (*check_preempt_curr)(struct rq *rq, struct task_struct *p, int flags);
 
-	/*
-	 * Both @prev and @rf are optional and may be NULL, in which case the
-	 * caller must already have invoked put_prev_task(rq, prev, rf).
-	 *
-	 * Otherwise it is the responsibility of the pick_next_task() to call
-	 * put_prev_task() on the @prev task or something equivalent, IFF it
-	 * returns a next task.
-	 *
-	 * In that case (@rf != NULL) it may return RETRY_TASK when it finds a
-	 * higher prio class has runnable tasks.
-	 */
-	struct task_struct * (*pick_next_task)(struct rq *rq,
-					       struct task_struct *prev,
-					       struct rq_flags *rf);
+	struct task_struct *(*pick_next_task)(struct rq *rq);
+
 	void (*put_prev_task)(struct rq *rq, struct task_struct *p);
 	void (*set_next_task)(struct rq *rq, struct task_struct *p);
 
@@ -1822,7 +1810,7 @@ static inline bool sched_fair_runnable(struct rq *rq)
 }
 
 extern struct task_struct *pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf);
-extern struct task_struct *pick_next_task_idle(struct rq *rq, struct task_struct *prev, struct rq_flags *rf);
+extern struct task_struct *pick_next_task_idle(struct rq *rq);
 
 #ifdef CONFIG_SMP
 
diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c
index c064073..0aefdfb 100644
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -34,11 +34,8 @@ static void set_next_task_stop(struct rq *rq, struct task_struct *stop)
 	stop->se.exec_start = rq_clock_task(rq);
 }
 
-static struct task_struct *
-pick_next_task_stop(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+static struct task_struct *pick_next_task_stop(struct rq *rq)
 {
-	WARN_ON_ONCE(prev || rf);
-
 	if (!sched_stop_runnable(rq))
 		return NULL;
 

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [tip: sched/core] sched/core: Optimize pick_next_task()
  2019-11-08 13:15 ` [PATCH 4/7] sched: Optimize pick_next_task() Peter Zijlstra
  2019-11-08 14:33   ` Quentin Perret
@ 2019-11-11  9:32   ` tip-bot2 for Peter Zijlstra
  1 sibling, 0 replies; 21+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2019-11-11  9:32 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Peter Zijlstra (Intel),
	Linus Torvalds, Thomas Gleixner, bsegall, dietmar.eggemann,
	juri.lelli, ktkhai, mgorman, qais.yousef, qperret, rostedt,
	valentin.schneider, vincent.guittot, Ingo Molnar,
	Borislav Petkov, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     5d7d605642b28a5911198a405a6072f091bfbee6
Gitweb:        https://git.kernel.org/tip/5d7d605642b28a5911198a405a6072f091bfbee6
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Fri, 08 Nov 2019 14:15:57 +01:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Mon, 11 Nov 2019 08:35:19 +01:00

sched/core: Optimize pick_next_task()

Ever since we moved the sched_class definitions into their own files,
the constant expression {fair,idle}_sched_class.pick_next_task() is
not in fact a compile time constant anymore and results in an indirect
call (barring LTO).

Fix that by exposing pick_next_task_{fair,idle}() directly, this gets
rid of the indirect call (and RETPOLINE) on the fast path.

Also remove the unlikely() from the idle case, it is in fact /the/ way
we select idle -- and that is a very common thing to do.

Performance for will-it-scale/sched_yield improves by 2% (as reported
by 0-day).

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: dietmar.eggemann@arm.com
Cc: juri.lelli@redhat.com
Cc: ktkhai@virtuozzo.com
Cc: mgorman@suse.de
Cc: qais.yousef@arm.com
Cc: qperret@google.com
Cc: rostedt@goodmis.org
Cc: valentin.schneider@arm.com
Cc: vincent.guittot@linaro.org
Link: https://lkml.kernel.org/r/20191108131909.603037345@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/core.c  | 6 +++---
 kernel/sched/fair.c  | 2 +-
 kernel/sched/idle.c  | 2 +-
 kernel/sched/sched.h | 3 +++
 4 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 59c4f29..7cf6547 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3917,14 +3917,14 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 		    prev->sched_class == &fair_sched_class) &&
 		   rq->nr_running == rq->cfs.h_nr_running)) {
 
-		p = fair_sched_class.pick_next_task(rq, prev, rf);
+		p = pick_next_task_fair(rq, prev, rf);
 		if (unlikely(p == RETRY_TASK))
 			goto restart;
 
 		/* Assumes fair_sched_class->next == idle_sched_class */
-		if (unlikely(!p)) {
+		if (!p) {
 			put_prev_task(rq, prev);
-			p = idle_sched_class.pick_next_task(rq, NULL, NULL);
+			p = pick_next_task_idle(rq, NULL, NULL);
 		}
 
 		return p;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c48a695..da81451 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6611,7 +6611,7 @@ preempt:
 		set_last_buddy(se);
 }
 
-static struct task_struct *
+struct task_struct *
 pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 {
 	struct cfs_rq *cfs_rq = &rq->cfs;
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 179d1d4..0fdceac 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -391,7 +391,7 @@ static void set_next_task_idle(struct rq *rq, struct task_struct *next)
 	schedstat_inc(rq->sched_goidle);
 }
 
-static struct task_struct *
+struct task_struct *
 pick_next_task_idle(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 {
 	struct task_struct *next = rq->idle;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index c8870c5..66172a3 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1821,6 +1821,9 @@ static inline bool sched_fair_runnable(struct rq *rq)
 	return rq->cfs.nr_running > 0;
 }
 
+extern struct task_struct *pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf);
+extern struct task_struct *pick_next_task_idle(struct rq *rq, struct task_struct *prev, struct rq_flags *rf);
+
 #ifdef CONFIG_SMP
 
 extern void update_group_capacity(struct sched_domain *sd, int cpu);

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [tip: sched/core] sched/core: Make pick_next_task_idle() more consistent
  2019-11-08 13:15 ` [PATCH 3/7] sched: Make pick_next_task_idle() more consistent Peter Zijlstra
@ 2019-11-11  9:32   ` tip-bot2 for Peter Zijlstra
  0 siblings, 0 replies; 21+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2019-11-11  9:32 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Peter Zijlstra (Intel),
	Linus Torvalds, Thomas Gleixner, bsegall, dietmar.eggemann,
	juri.lelli, ktkhai, mgorman, qais.yousef, qperret, rostedt,
	valentin.schneider, vincent.guittot, Ingo Molnar,
	Borislav Petkov, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     f488e1057bb97b881ed317557dc5e62ff8747393
Gitweb:        https://git.kernel.org/tip/f488e1057bb97b881ed317557dc5e62ff8747393
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Fri, 08 Nov 2019 14:15:56 +01:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Mon, 11 Nov 2019 08:35:19 +01:00

sched/core: Make pick_next_task_idle() more consistent

Only pick_next_task_fair() needs the @prev and @rf argument; these are
required to implement the cpu-cgroup optimization. None of the other
pick_next_task() methods need this. Make pick_next_task_idle() more
consistent.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: dietmar.eggemann@arm.com
Cc: juri.lelli@redhat.com
Cc: ktkhai@virtuozzo.com
Cc: mgorman@suse.de
Cc: qais.yousef@arm.com
Cc: qperret@google.com
Cc: rostedt@goodmis.org
Cc: valentin.schneider@arm.com
Cc: vincent.guittot@linaro.org
Link: https://lkml.kernel.org/r/20191108131909.545730862@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/core.c | 6 ++++--
 kernel/sched/idle.c | 3 +--
 2 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0f2eb36..59c4f29 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3922,8 +3922,10 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 			goto restart;
 
 		/* Assumes fair_sched_class->next == idle_sched_class */
-		if (unlikely(!p))
-			p = idle_sched_class.pick_next_task(rq, prev, rf);
+		if (unlikely(!p)) {
+			put_prev_task(rq, prev);
+			p = idle_sched_class.pick_next_task(rq, NULL, NULL);
+		}
 
 		return p;
 	}
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index f65ef1e..179d1d4 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -396,8 +396,7 @@ pick_next_task_idle(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
 {
 	struct task_struct *next = rq->idle;
 
-	if (prev)
-		put_prev_task(rq, prev);
+	WARN_ON_ONCE(prev || rf);
 
 	set_next_task_idle(rq, next);
 

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [tip: sched/core] sched/fair: Better document newidle_balance()
  2019-11-08 13:15 ` [PATCH 2/7] sched/fair: Better document newidle_balance() Peter Zijlstra
@ 2019-11-11  9:32   ` tip-bot2 for Peter Zijlstra
  0 siblings, 0 replies; 21+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2019-11-11  9:32 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Peter Zijlstra (Intel),
	Linus Torvalds, Thomas Gleixner, bsegall, dietmar.eggemann,
	juri.lelli, ktkhai, mgorman, qais.yousef, qperret, rostedt,
	valentin.schneider, vincent.guittot, Ingo Molnar,
	Borislav Petkov, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     7277a34c6be0b2972bdd1fea88c7cef409bed5b4
Gitweb:        https://git.kernel.org/tip/7277a34c6be0b2972bdd1fea88c7cef409bed5b4
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Fri, 08 Nov 2019 14:15:55 +01:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Mon, 11 Nov 2019 08:35:18 +01:00

sched/fair: Better document newidle_balance()

Whilst chasing the pick_next_task() race, there was some confusion
about the newidle_balance() return values. Document them.

[ mingo: Minor edits. ]

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: dietmar.eggemann@arm.com
Cc: juri.lelli@redhat.com
Cc: ktkhai@virtuozzo.com
Cc: mgorman@suse.de
Cc: qais.yousef@arm.com
Cc: qperret@google.com
Cc: rostedt@goodmis.org
Cc: valentin.schneider@arm.com
Cc: vincent.guittot@linaro.org
Link: https://lkml.kernel.org/r/20191108131909.488364308@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6e622f4..c48a695 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9941,6 +9941,11 @@ static inline void nohz_newidle_balance(struct rq *this_rq) { }
 /*
  * idle_balance is called by schedule() if this_cpu is about to become
  * idle. Attempts to pull tasks from other CPUs.
+ *
+ * Returns:
+ *   < 0 - we released the lock and there are !fair tasks present
+ *     0 - failed, no new tasks
+ *   > 0 - success, new (fair) tasks present
  */
 int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 {

^ permalink raw reply related	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2019-11-11  9:33 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-08 13:15 [PATCH 0/7] scheduler patches Peter Zijlstra
2019-11-08 13:15 ` [PATCH 1/7] sched: Fix pick_next_task() vs change pattern race Peter Zijlstra
2019-11-08 14:28   ` Quentin Perret
2019-11-08 16:05   ` Valentin Schneider
2019-11-08 20:49     ` Peter Zijlstra
2019-11-08 17:03   ` Qais Yousef
2019-11-08 13:15 ` [PATCH 2/7] sched/fair: Better document newidle_balance() Peter Zijlstra
2019-11-11  9:32   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
2019-11-08 13:15 ` [PATCH 3/7] sched: Make pick_next_task_idle() more consistent Peter Zijlstra
2019-11-11  9:32   ` [tip: sched/core] sched/core: " tip-bot2 for Peter Zijlstra
2019-11-08 13:15 ` [PATCH 4/7] sched: Optimize pick_next_task() Peter Zijlstra
2019-11-08 14:33   ` Quentin Perret
2019-11-08 16:46     ` Vincent Guittot
2019-11-08 20:46     ` Peter Zijlstra
2019-11-11  9:32   ` [tip: sched/core] sched/core: " tip-bot2 for Peter Zijlstra
2019-11-08 13:15 ` [PATCH 5/7] sched: Simplify sched_class::pick_next_task() Peter Zijlstra
2019-11-11  9:32   ` [tip: sched/core] sched/core: " tip-bot2 for Peter Zijlstra
2019-11-08 13:15 ` [PATCH 6/7] sched/fair: Use mul_u32_u32() Peter Zijlstra
2019-11-11  9:32   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
2019-11-08 13:16 ` [PATCH 7/7] sched: Further clarify sched_class::set_next_task() Peter Zijlstra
2019-11-11  9:32   ` [tip: sched/core] sched/core: " tip-bot2 for Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).