linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] sched/core: Fix forceidle balancing
@ 2022-03-30 16:05 Peter Zijlstra
  2022-03-30 16:22 ` Peter Zijlstra
                   ` (3 more replies)
  0 siblings, 4 replies; 7+ messages in thread
From: Peter Zijlstra @ 2022-03-30 16:05 UTC (permalink / raw)
  To: Ingo Molnar, Steven Rostedt, vincent.guittot
  Cc: LKML, Thomas Gleixner, Sebastian Andrzej Siewior, joel,
	dietmar.eggemann, bsegall, mgorman, bristot


Steve reported that ChromeOS encounters the forceidle balancer being
ran from rt_mutex_setprio()'s balance_callback() invocation and
explodes.

Now, the forceidle balancer gets queued every time the idle task gets
selected, set_next_task(), which is strictly too often.
rt_mutex_setprio() also uses set_next_task() in the 'change' pattern:

	queued = task_on_rq_queued(p); /* p->on_rq == TASK_ON_RQ_QUEUED */
	running = task_current(rq, p); /* rq->curr == p */

	if (queued)
		dequeue_task(...);
	if (running)
		put_prev_task(...);

	/* change task properties */

	if (queued)
		enqueue_task(...);
	if (running)
		set_next_task(...);

However, rt_mutex_setprio() will explicitly not run this pattern on
the idle task (since priority boosting the idle task is quite insane).
Most other 'change' pattern users are pidhash based and would also not
apply to idle.

Also, the change pattern doesn't contain a __balance_callback()
invocation and hence we could have an out-of-band balance-callback,
which *should* trigger the WARN in rq_pin_lock() (which guards against
this exact anti-pattern).

So while none of that explains how this happens, it does indicate that
having it in set_next_task() might not be the most robust option.

Instead, explicitly queue the forceidle balancer from pick_next_task()
when it does indeed result in forceidle selection. Having it here,
ensures it can only be triggered under the __schedule() rq->lock
instance, and hence must be ran from that context.

This also happens to clean up the code a little, so win-win.

Fixes: d2dfa17bc7de ("sched: Trivial forced-newidle balancer")
Reported-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c  |   16 +++++++++++-----
 kernel/sched/idle.c  |    1 -
 kernel/sched/sched.h |    6 ------
 3 files changed, 11 insertions(+), 12 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5752,6 +5752,8 @@ static inline struct task_struct *pick_t
 
 extern void task_vruntime_update(struct rq *rq, struct task_struct *p, bool in_fi);
 
+static void queue_core_balance(struct rq *rq);
+
 static struct task_struct *
 pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 {
@@ -5801,7 +5803,7 @@ pick_next_task(struct rq *rq, struct tas
 		}
 
 		rq->core_pick = NULL;
-		return next;
+		goto out;
 	}
 
 	put_prev_task_balance(rq, prev, rf);
@@ -5851,7 +5853,7 @@ pick_next_task(struct rq *rq, struct tas
 			 */
 			WARN_ON_ONCE(fi_before);
 			task_vruntime_update(rq, next, false);
-			goto done;
+			goto out_set_next;
 		}
 	}
 
@@ -5970,8 +5972,12 @@ pick_next_task(struct rq *rq, struct tas
 		resched_curr(rq_i);
 	}
 
-done:
+out_set_next:
 	set_next_task(rq, next);
+out:
+	if (rq->core->core_forceidle_count && next == rq->idle)
+		queue_core_balance(rq);
+
 	return next;
 }
 
@@ -6066,7 +6072,7 @@ static void sched_core_balance(struct rq
 
 static DEFINE_PER_CPU(struct callback_head, core_balance_head);
 
-void queue_core_balance(struct rq *rq)
+static void queue_core_balance(struct rq *rq)
 {
 	if (!sched_core_enabled(rq))
 		return;
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -434,7 +434,6 @@ static void set_next_task_idle(struct rq
 {
 	update_idle_core(rq);
 	schedstat_inc(rq->sched_goidle);
-	queue_core_balance(rq);
 }
 
 #ifdef CONFIG_SMP
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1232,8 +1232,6 @@ static inline bool sched_group_cookie_ma
 	return false;
 }
 
-extern void queue_core_balance(struct rq *rq);
-
 static inline bool sched_core_enqueued(struct task_struct *p)
 {
 	return !RB_EMPTY_NODE(&p->core_node);
@@ -1267,10 +1265,6 @@ static inline raw_spinlock_t *__rq_lockp
 	return &rq->__lock;
 }
 
-static inline void queue_core_balance(struct rq *rq)
-{
-}
-
 static inline bool sched_cpu_cookie_match(struct rq *rq, struct task_struct *p)
 {
 	return true;

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] sched/core: Fix forceidle balancing
  2022-03-30 16:05 [PATCH] sched/core: Fix forceidle balancing Peter Zijlstra
@ 2022-03-30 16:22 ` Peter Zijlstra
  2022-03-31 19:00 ` Joel Fernandes
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 7+ messages in thread
From: Peter Zijlstra @ 2022-03-30 16:22 UTC (permalink / raw)
  To: Ingo Molnar, Steven Rostedt, vincent.guittot
  Cc: LKML, Thomas Gleixner, Sebastian Andrzej Siewior, joel,
	dietmar.eggemann, bsegall, mgorman, bristot

On Wed, Mar 30, 2022 at 06:05:35PM +0200, Peter Zijlstra wrote:

> rt_mutex_setprio() also uses set_next_task() in the 'change' pattern:
> 
> 	queued = task_on_rq_queued(p); /* p->on_rq == TASK_ON_RQ_QUEUED */
> 	running = task_current(rq, p); /* rq->curr == p */
> 
> 	if (queued)
> 		dequeue_task(...);
> 	if (running)
> 		put_prev_task(...);
> 
> 	/* change task properties */
> 
> 	if (queued)
> 		enqueue_task(...);
> 	if (running)
> 		set_next_task(...);
> 

Also, while procrastinating while writing this changelog, I did the
below... is that worth it?

---
 kernel/sched/core.c | 249 ++++++++++++++++++++++++----------------------------
 1 file changed, 115 insertions(+), 134 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d575b4914925..6455e0bbbeb9 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2549,11 +2549,53 @@ void set_cpus_allowed_common(struct task_struct *p, const struct cpumask *new_ma
 	p->nr_cpus_allowed = cpumask_weight(new_mask);
 }
 
+struct change_guard {
+	struct task_struct *p;
+	struct rq *rq;
+	unsigned int flags;
+	bool queued, running, done;
+};
+
+static inline struct change_guard
+change_guard_init(struct rq *rq, struct task_struct *p, unsigned int flags)
+{
+	struct change_guard cg = {
+		.rq = rq,
+		.p = p,
+		.flags = flags,
+		.queued = task_on_rq_queued(p),
+		.running = task_current(rq, p),
+	};
+
+	if (cg.queued)
+		dequeue_task(rq, p, flags);
+	if (cg.running)
+		put_prev_task(rq, p);
+
+	return cg;
+}
+
+static inline void change_guard_fini(struct change_guard *cg)
+{
+	if (cg->queued)
+		enqueue_task(cg->rq, cg->p, cg->flags | ENQUEUE_NOCLOCK);
+	if (cg->running)
+		set_next_task(cg->rq, cg->p);
+}
+
+#define __cleanup(_fini)		__attribute__((__cleanup__(_fini)))
+
+#define CHANGE_GUARD(name, _rq, _p, _flags)			\
+	struct change_guard __cleanup(change_guard_fini) name =	\
+		change_guard_init(_rq, _p, _flags)
+
+#define FOR_CHANGE_GUARD(name, _rq, _p, _flags)	\
+	for (CHANGE_GUARD(name, _rq, _p, _flags); !name.done; name.done = true)
+
 static void
 __do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask, u32 flags)
 {
 	struct rq *rq = task_rq(p);
-	bool queued, running;
 
 	/*
 	 * This here violates the locking rules for affinity, since we're only
@@ -2572,26 +2614,16 @@ __do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask, u32
 	else
 		lockdep_assert_held(&p->pi_lock);
 
-	queued = task_on_rq_queued(p);
-	running = task_current(rq, p);
-
-	if (queued) {
-		/*
-		 * Because __kthread_bind() calls this on blocked tasks without
-		 * holding rq->lock.
-		 */
-		lockdep_assert_rq_held(rq);
-		dequeue_task(rq, p, DEQUEUE_SAVE | DEQUEUE_NOCLOCK);
+	FOR_CHANGE_GUARD(cg, rq, p, DEQUEUE_SAVE | DEQUEUE_NOCLOCK) {
+		if (cg.queued) {
+			/*
+			 * Because __kthread_bind() calls this on blocked tasks without
+			 * holding rq->lock.
+			 */
+			lockdep_assert_rq_held(rq);
+		}
+		p->sched_class->set_cpus_allowed(p, new_mask, flags);
 	}
-	if (running)
-		put_prev_task(rq, p);
-
-	p->sched_class->set_cpus_allowed(p, new_mask, flags);
-
-	if (queued)
-		enqueue_task(rq, p, ENQUEUE_RESTORE | ENQUEUE_NOCLOCK);
-	if (running)
-		set_next_task(rq, p);
 }
 
 void do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask)
@@ -6745,7 +6777,7 @@ static inline int rt_effective_prio(struct task_struct *p, int prio)
  */
 void rt_mutex_setprio(struct task_struct *p, struct task_struct *pi_task)
 {
-	int prio, oldprio, queued, running, queue_flag =
+	int prio, oldprio, queue_flag =
 		DEQUEUE_SAVE | DEQUEUE_MOVE | DEQUEUE_NOCLOCK;
 	const struct sched_class *prev_class;
 	struct rq_flags rf;
@@ -6805,49 +6837,39 @@ void rt_mutex_setprio(struct task_struct *p, struct task_struct *pi_task)
 		queue_flag &= ~DEQUEUE_MOVE;
 
 	prev_class = p->sched_class;
-	queued = task_on_rq_queued(p);
-	running = task_current(rq, p);
-	if (queued)
-		dequeue_task(rq, p, queue_flag);
-	if (running)
-		put_prev_task(rq, p);
-
-	/*
-	 * Boosting condition are:
-	 * 1. -rt task is running and holds mutex A
-	 *      --> -dl task blocks on mutex A
-	 *
-	 * 2. -dl task is running and holds mutex A
-	 *      --> -dl task blocks on mutex A and could preempt the
-	 *          running task
-	 */
-	if (dl_prio(prio)) {
-		if (!dl_prio(p->normal_prio) ||
-		    (pi_task && dl_prio(pi_task->prio) &&
-		     dl_entity_preempt(&pi_task->dl, &p->dl))) {
-			p->dl.pi_se = pi_task->dl.pi_se;
-			queue_flag |= ENQUEUE_REPLENISH;
+	FOR_CHANGE_GUARD(cg, rq, p, queue_flag) {
+		/*
+		 * Boosting condition are:
+		 * 1. -rt task is running and holds mutex A
+		 *      --> -dl task blocks on mutex A
+		 *
+		 * 2. -dl task is running and holds mutex A
+		 *      --> -dl task blocks on mutex A and could preempt the
+		 *          running task
+		 */
+		if (dl_prio(prio)) {
+			if (!dl_prio(p->normal_prio) ||
+			    (pi_task && dl_prio(pi_task->prio) &&
+			     dl_entity_preempt(&pi_task->dl, &p->dl))) {
+				p->dl.pi_se = pi_task->dl.pi_se;
+				queue_flag |= ENQUEUE_REPLENISH;
+			} else {
+				p->dl.pi_se = &p->dl;
+			}
+		} else if (rt_prio(prio)) {
+			if (dl_prio(oldprio))
+				p->dl.pi_se = &p->dl;
+			if (oldprio < prio)
+				queue_flag |= ENQUEUE_HEAD;
 		} else {
-			p->dl.pi_se = &p->dl;
+			if (dl_prio(oldprio))
+				p->dl.pi_se = &p->dl;
+			if (rt_prio(oldprio))
+				p->rt.timeout = 0;
 		}
-	} else if (rt_prio(prio)) {
-		if (dl_prio(oldprio))
-			p->dl.pi_se = &p->dl;
-		if (oldprio < prio)
-			queue_flag |= ENQUEUE_HEAD;
-	} else {
-		if (dl_prio(oldprio))
-			p->dl.pi_se = &p->dl;
-		if (rt_prio(oldprio))
-			p->rt.timeout = 0;
-	}
 
-	__setscheduler_prio(p, prio);
-
-	if (queued)
-		enqueue_task(rq, p, queue_flag);
-	if (running)
-		set_next_task(rq, p);
+		__setscheduler_prio(p, prio);
+	}
 
 	check_class_changed(rq, p, prev_class, oldprio);
 out_unlock:
@@ -6869,10 +6891,9 @@ static inline int rt_effective_prio(struct task_struct *p, int prio)
 
 void set_user_nice(struct task_struct *p, long nice)
 {
-	bool queued, running;
-	int old_prio;
 	struct rq_flags rf;
 	struct rq *rq;
+	int old_prio;
 
 	if (task_nice(p) == nice || nice < MIN_NICE || nice > MAX_NICE)
 		return;
@@ -6893,22 +6914,12 @@ void set_user_nice(struct task_struct *p, long nice)
 		p->static_prio = NICE_TO_PRIO(nice);
 		goto out_unlock;
 	}
-	queued = task_on_rq_queued(p);
-	running = task_current(rq, p);
-	if (queued)
-		dequeue_task(rq, p, DEQUEUE_SAVE | DEQUEUE_NOCLOCK);
-	if (running)
-		put_prev_task(rq, p);
-
-	p->static_prio = NICE_TO_PRIO(nice);
-	set_load_weight(p, true);
-	old_prio = p->prio;
-	p->prio = effective_prio(p);
-
-	if (queued)
-		enqueue_task(rq, p, ENQUEUE_RESTORE | ENQUEUE_NOCLOCK);
-	if (running)
-		set_next_task(rq, p);
+	FOR_CHANGE_GUARD(cg, rq, p, DEQUEUE_SAVE | DEQUEUE_NOCLOCK) {
+		p->static_prio = NICE_TO_PRIO(nice);
+		set_load_weight(p, true);
+		old_prio = p->prio;
+		p->prio = effective_prio(p);
+	}
 
 	/*
 	 * If the task increased its priority or is running and
@@ -7216,8 +7227,8 @@ static int __sched_setscheduler(struct task_struct *p,
 				bool user, bool pi)
 {
 	int oldpolicy = -1, policy = attr->sched_policy;
-	int retval, oldprio, newprio, queued, running;
 	const struct sched_class *prev_class;
+	int retval, oldprio, newprio;
 	struct callback_head *head;
 	struct rq_flags rf;
 	int reset_on_fork;
@@ -7428,33 +7439,24 @@ static int __sched_setscheduler(struct task_struct *p,
 			queue_flags &= ~DEQUEUE_MOVE;
 	}
 
-	queued = task_on_rq_queued(p);
-	running = task_current(rq, p);
-	if (queued)
-		dequeue_task(rq, p, queue_flags);
-	if (running)
-		put_prev_task(rq, p);
-
 	prev_class = p->sched_class;
 
-	if (!(attr->sched_flags & SCHED_FLAG_KEEP_PARAMS)) {
-		__setscheduler_params(p, attr);
-		__setscheduler_prio(p, newprio);
-	}
-	__setscheduler_uclamp(p, attr);
-
-	if (queued) {
-		/*
-		 * We enqueue to tail when the priority of a task is
-		 * increased (user space view).
-		 */
-		if (oldprio < p->prio)
-			queue_flags |= ENQUEUE_HEAD;
+	FOR_CHANGE_GUARD(cg, rq, p, queue_flags) {
+		if (!(attr->sched_flags & SCHED_FLAG_KEEP_PARAMS)) {
+			__setscheduler_params(p, attr);
+			__setscheduler_prio(p, newprio);
+		}
+		__setscheduler_uclamp(p, attr);
 
-		enqueue_task(rq, p, queue_flags);
+		if (cg.queued) {
+			/*
+			 * We enqueue to tail when the priority of a task is
+			 * increased (user space view).
+			 */
+			if (oldprio < p->prio)
+				cg.flags |= ENQUEUE_HEAD;
+		}
 	}
-	if (running)
-		set_next_task(rq, p);
 
 	check_class_changed(rq, p, prev_class, oldprio);
 
@@ -8915,25 +8917,13 @@ int migrate_task_to(struct task_struct *p, int target_cpu)
  */
 void sched_setnuma(struct task_struct *p, int nid)
 {
-	bool queued, running;
 	struct rq_flags rf;
 	struct rq *rq;
 
 	rq = task_rq_lock(p, &rf);
-	queued = task_on_rq_queued(p);
-	running = task_current(rq, p);
-
-	if (queued)
-		dequeue_task(rq, p, DEQUEUE_SAVE);
-	if (running)
-		put_prev_task(rq, p);
-
-	p->numa_preferred_nid = nid;
-
-	if (queued)
-		enqueue_task(rq, p, ENQUEUE_RESTORE | ENQUEUE_NOCLOCK);
-	if (running)
-		set_next_task(rq, p);
+	FOR_CHANGE_GUARD(cg, rq, p, DEQUEUE_SAVE) {
+		p->numa_preferred_nid = nid;
+	}
 	task_rq_unlock(rq, p, &rf);
 }
 #endif /* CONFIG_NUMA_BALANCING */
@@ -10035,28 +10025,19 @@ static void sched_change_group(struct task_struct *tsk, int type)
  */
 void sched_move_task(struct task_struct *tsk)
 {
-	int queued, running, queue_flags =
-		DEQUEUE_SAVE | DEQUEUE_MOVE | DEQUEUE_NOCLOCK;
+	bool resched = false;
 	struct rq_flags rf;
 	struct rq *rq;
 
 	rq = task_rq_lock(tsk, &rf);
-	update_rq_clock(rq);
-
-	running = task_current(rq, tsk);
-	queued = task_on_rq_queued(tsk);
 
-	if (queued)
-		dequeue_task(rq, tsk, queue_flags);
-	if (running)
-		put_prev_task(rq, tsk);
-
-	sched_change_group(tsk, TASK_MOVE_GROUP);
+	FOR_CHANGE_GUARD(cg, rq, tsk, DEQUEUE_SAVE | DEQUEUE_MOVE) {
+		sched_change_group(tsk, TASK_MOVE_GROUP);
+		if (cg.running)
+			resched = true;
+	}
 
-	if (queued)
-		enqueue_task(rq, tsk, queue_flags);
-	if (running) {
-		set_next_task(rq, tsk);
+	if (resched) {
 		/*
 		 * After changing group, the running task may have joined a
 		 * throttled one but it's still the running task. Trigger a

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] sched/core: Fix forceidle balancing
  2022-03-30 16:05 [PATCH] sched/core: Fix forceidle balancing Peter Zijlstra
  2022-03-30 16:22 ` Peter Zijlstra
@ 2022-03-31 19:00 ` Joel Fernandes
  2022-04-01 11:46   ` Peter Zijlstra
  2022-04-01 11:41 ` Peter Zijlstra
  2022-04-05  8:22 ` [tip: sched/urgent] " tip-bot2 for Peter Zijlstra
  3 siblings, 1 reply; 7+ messages in thread
From: Joel Fernandes @ 2022-03-31 19:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Steven Rostedt, Vincent Guittot, LKML,
	Thomas Gleixner, Sebastian Andrzej Siewior, Dietmar Eggemann,
	Ben Segall, Mel Gorman, Daniel Bristot de Oliveira,
	Guenter Roeck

Hi,

By the way, might be slightly related - we still see crashes with
pick_task_fair() in our kernel even with this change:
https://lkml.org/lkml/2020/11/17/2137

Is it possible that when doing pick_task_fair() especially on a remote
CPU, both the "cfs_rq->curr" and the rbtree's "left" be NULL with core
scheduling? In this case, se will be NULL and can cause crashes right?
I think the code assumes this can never happen.

+Guenter Roeck  kindly debugged pick_task_fair() in a crash as
follows. Copying some details he mentioned in a bug report:

Assembler/source:

  25:   e8 4f 11 00 00          call   0x1179             ; se =
pick_next_entity(cfs_rq, curr);
  2a:*  48 8b 98 60 01 00 00    mov    0x160(%rax),%rbx   ; trapping
instruction [cfs_rq = group_cfs_rq(se);]
  31:   48 85 db                test   %rbx,%rbx
  34:   75 d1                   jne    0x7
  36:   48 89 c7                mov    %rax,%rdi

At 2a: RAX = se == NULL after pick_next_entity(). Looking closely into
pick_next_entity(), it can indeed return NULL if curr is NULL and if
left in pick_next_entity() is NULL. Per line 7:, curr is in %r14 and
indeed 0.

Thoughts?

-Joel


On Wed, Mar 30, 2022 at 12:05 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
>
> Steve reported that ChromeOS encounters the forceidle balancer being
> ran from rt_mutex_setprio()'s balance_callback() invocation and
> explodes.
>
> Now, the forceidle balancer gets queued every time the idle task gets
> selected, set_next_task(), which is strictly too often.
> rt_mutex_setprio() also uses set_next_task() in the 'change' pattern:
>
>         queued = task_on_rq_queued(p); /* p->on_rq == TASK_ON_RQ_QUEUED */
>         running = task_current(rq, p); /* rq->curr == p */
>
>         if (queued)
>                 dequeue_task(...);
>         if (running)
>                 put_prev_task(...);
>
>         /* change task properties */
>
>         if (queued)
>                 enqueue_task(...);
>         if (running)
>                 set_next_task(...);
>
> However, rt_mutex_setprio() will explicitly not run this pattern on
> the idle task (since priority boosting the idle task is quite insane).
> Most other 'change' pattern users are pidhash based and would also not
> apply to idle.
>
> Also, the change pattern doesn't contain a __balance_callback()
> invocation and hence we could have an out-of-band balance-callback,
> which *should* trigger the WARN in rq_pin_lock() (which guards against
> this exact anti-pattern).
>
> So while none of that explains how this happens, it does indicate that
> having it in set_next_task() might not be the most robust option.
>
> Instead, explicitly queue the forceidle balancer from pick_next_task()
> when it does indeed result in forceidle selection. Having it here,
> ensures it can only be triggered under the __schedule() rq->lock
> instance, and hence must be ran from that context.
>
> This also happens to clean up the code a little, so win-win.
>
> Fixes: d2dfa17bc7de ("sched: Trivial forced-newidle balancer")
> Reported-by: Steven Rostedt <rostedt@goodmis.org>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  kernel/sched/core.c  |   16 +++++++++++-----
>  kernel/sched/idle.c  |    1 -
>  kernel/sched/sched.h |    6 ------
>  3 files changed, 11 insertions(+), 12 deletions(-)
>
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -5752,6 +5752,8 @@ static inline struct task_struct *pick_t
>
>  extern void task_vruntime_update(struct rq *rq, struct task_struct *p, bool in_fi);
>
> +static void queue_core_balance(struct rq *rq);
> +
>  static struct task_struct *
>  pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>  {
> @@ -5801,7 +5803,7 @@ pick_next_task(struct rq *rq, struct tas
>                 }
>
>                 rq->core_pick = NULL;
> -               return next;
> +               goto out;
>         }
>
>         put_prev_task_balance(rq, prev, rf);
> @@ -5851,7 +5853,7 @@ pick_next_task(struct rq *rq, struct tas
>                          */
>                         WARN_ON_ONCE(fi_before);
>                         task_vruntime_update(rq, next, false);
> -                       goto done;
> +                       goto out_set_next;
>                 }
>         }
>
> @@ -5970,8 +5972,12 @@ pick_next_task(struct rq *rq, struct tas
>                 resched_curr(rq_i);
>         }
>
> -done:
> +out_set_next:
>         set_next_task(rq, next);
> +out:
> +       if (rq->core->core_forceidle_count && next == rq->idle)
> +               queue_core_balance(rq);
> +
>         return next;
>  }
>
> @@ -6066,7 +6072,7 @@ static void sched_core_balance(struct rq
>
>  static DEFINE_PER_CPU(struct callback_head, core_balance_head);
>
> -void queue_core_balance(struct rq *rq)
> +static void queue_core_balance(struct rq *rq)
>  {
>         if (!sched_core_enabled(rq))
>                 return;
> --- a/kernel/sched/idle.c
> +++ b/kernel/sched/idle.c
> @@ -434,7 +434,6 @@ static void set_next_task_idle(struct rq
>  {
>         update_idle_core(rq);
>         schedstat_inc(rq->sched_goidle);
> -       queue_core_balance(rq);
>  }
>
>  #ifdef CONFIG_SMP
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1232,8 +1232,6 @@ static inline bool sched_group_cookie_ma
>         return false;
>  }
>
> -extern void queue_core_balance(struct rq *rq);
> -
>  static inline bool sched_core_enqueued(struct task_struct *p)
>  {
>         return !RB_EMPTY_NODE(&p->core_node);
> @@ -1267,10 +1265,6 @@ static inline raw_spinlock_t *__rq_lockp
>         return &rq->__lock;
>  }
>
> -static inline void queue_core_balance(struct rq *rq)
> -{
> -}
> -
>  static inline bool sched_cpu_cookie_match(struct rq *rq, struct task_struct *p)
>  {
>         return true;

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] sched/core: Fix forceidle balancing
  2022-03-30 16:05 [PATCH] sched/core: Fix forceidle balancing Peter Zijlstra
  2022-03-30 16:22 ` Peter Zijlstra
  2022-03-31 19:00 ` Joel Fernandes
@ 2022-04-01 11:41 ` Peter Zijlstra
  2022-04-05  8:22 ` [tip: sched/urgent] " tip-bot2 for Peter Zijlstra
  3 siblings, 0 replies; 7+ messages in thread
From: Peter Zijlstra @ 2022-04-01 11:41 UTC (permalink / raw)
  To: Ingo Molnar, Steven Rostedt, vincent.guittot
  Cc: LKML, Thomas Gleixner, Sebastian Andrzej Siewior, joel,
	dietmar.eggemann, bsegall, mgorman, bristot

On Wed, Mar 30, 2022 at 06:05:35PM +0200, Peter Zijlstra wrote:
> 
> Steve reported that ChromeOS encounters the forceidle balancer being
> ran from rt_mutex_setprio()'s balance_callback() invocation and
> explodes.
> 
> Now, the forceidle balancer gets queued every time the idle task gets
> selected, set_next_task(), which is strictly too often.
> rt_mutex_setprio() also uses set_next_task() in the 'change' pattern:
> 
> 	queued = task_on_rq_queued(p); /* p->on_rq == TASK_ON_RQ_QUEUED */
> 	running = task_current(rq, p); /* rq->curr == p */
> 
> 	if (queued)
> 		dequeue_task(...);
> 	if (running)
> 		put_prev_task(...);
> 
> 	/* change task properties */
> 
> 	if (queued)
> 		enqueue_task(...);
> 	if (running)
> 		set_next_task(...);
> 
> However, rt_mutex_setprio() will explicitly not run this pattern on
> the idle task (since priority boosting the idle task is quite insane).
> Most other 'change' pattern users are pidhash based and would also not
> apply to idle.
> 
> Also, the change pattern doesn't contain a __balance_callback()
> invocation and hence we could have an out-of-band balance-callback,
> which *should* trigger the WARN in rq_pin_lock() (which guards against
> this exact anti-pattern).
> 
> So while none of that explains how this happens, it does indicate that
> having it in set_next_task() might not be the most robust option.
> 
> Instead, explicitly queue the forceidle balancer from pick_next_task()
> when it does indeed result in forceidle selection. Having it here,
> ensures it can only be triggered under the __schedule() rq->lock
> instance, and hence must be ran from that context.
> 
> This also happens to clean up the code a little, so win-win.

So I couldn't figure out how this could happen without triggering other
warnings, because as I mentioned elsewhere, commit 565790d28b1e ("sched:
Fix balance_callback()") should've caused a different splat.

But then Dietmar reminded me that ChromeOS is probably running some
ancient crud with backports on :/ and will most likely not have that
commit.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] sched/core: Fix forceidle balancing
  2022-03-31 19:00 ` Joel Fernandes
@ 2022-04-01 11:46   ` Peter Zijlstra
  2022-04-05 19:25     ` Joel Fernandes
  0 siblings, 1 reply; 7+ messages in thread
From: Peter Zijlstra @ 2022-04-01 11:46 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Ingo Molnar, Steven Rostedt, Vincent Guittot, LKML,
	Thomas Gleixner, Sebastian Andrzej Siewior, Dietmar Eggemann,
	Ben Segall, Mel Gorman, Daniel Bristot de Oliveira,
	Guenter Roeck

On Thu, Mar 31, 2022 at 03:00:40PM -0400, Joel Fernandes wrote:
> Hi,
> 
> By the way, might be slightly related - we still see crashes with
> pick_task_fair() in our kernel even with this change:
> https://lkml.org/lkml/2020/11/17/2137

Please as to not use lkml.org. Please use something with a MsgID in like
lore.

> Is it possible that when doing pick_task_fair() especially on a remote
> CPU, both the "cfs_rq->curr" and the rbtree's "left" be NULL with core
> scheduling? In this case, se will be NULL and can cause crashes right?
> I think the code assumes this can never happen.
> 
> +Guenter Roeck  kindly debugged pick_task_fair() in a crash as
> follows. Copying some details he mentioned in a bug report:
> 
> Assembler/source:
> 
>   25:   e8 4f 11 00 00          call   0x1179             ; se =
> pick_next_entity(cfs_rq, curr);
>   2a:*  48 8b 98 60 01 00 00    mov    0x160(%rax),%rbx   ; trapping
> instruction [cfs_rq = group_cfs_rq(se);]
>   31:   48 85 db                test   %rbx,%rbx
>   34:   75 d1                   jne    0x7
>   36:   48 89 c7                mov    %rax,%rdi
> 
> At 2a: RAX = se == NULL after pick_next_entity(). Looking closely into
> pick_next_entity(), it can indeed return NULL if curr is NULL and if
> left in pick_next_entity() is NULL. Per line 7:, curr is in %r14 and
> indeed 0.
> 
> Thoughts?

It is possible for ->curr and ->leftmost to be NULL, but then we should
also be having ->nr_running == 0 and not call pick in the first place.
Because picking a task from no tasks doesn't make much sense.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [tip: sched/urgent] sched/core: Fix forceidle balancing
  2022-03-30 16:05 [PATCH] sched/core: Fix forceidle balancing Peter Zijlstra
                   ` (2 preceding siblings ...)
  2022-04-01 11:41 ` Peter Zijlstra
@ 2022-04-05  8:22 ` tip-bot2 for Peter Zijlstra
  3 siblings, 0 replies; 7+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2022-04-05  8:22 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Steven Rostedt, Peter Zijlstra (Intel),
	T.J. Alumbaugh, x86, linux-kernel

The following commit has been merged into the sched/urgent branch of tip:

Commit-ID:     5b6547ed97f4f5dfc23f8e3970af6d11d7b7ed7e
Gitweb:        https://git.kernel.org/tip/5b6547ed97f4f5dfc23f8e3970af6d11d7b7ed7e
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Wed, 16 Mar 2022 22:03:41 +01:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Tue, 05 Apr 2022 09:59:36 +02:00

sched/core: Fix forceidle balancing

Steve reported that ChromeOS encounters the forceidle balancer being
ran from rt_mutex_setprio()'s balance_callback() invocation and
explodes.

Now, the forceidle balancer gets queued every time the idle task gets
selected, set_next_task(), which is strictly too often.
rt_mutex_setprio() also uses set_next_task() in the 'change' pattern:

	queued = task_on_rq_queued(p); /* p->on_rq == TASK_ON_RQ_QUEUED */
	running = task_current(rq, p); /* rq->curr == p */

	if (queued)
		dequeue_task(...);
	if (running)
		put_prev_task(...);

	/* change task properties */

	if (queued)
		enqueue_task(...);
	if (running)
		set_next_task(...);

However, rt_mutex_setprio() will explicitly not run this pattern on
the idle task (since priority boosting the idle task is quite insane).
Most other 'change' pattern users are pidhash based and would also not
apply to idle.

Also, the change pattern doesn't contain a __balance_callback()
invocation and hence we could have an out-of-band balance-callback,
which *should* trigger the WARN in rq_pin_lock() (which guards against
this exact anti-pattern).

So while none of that explains how this happens, it does indicate that
having it in set_next_task() might not be the most robust option.

Instead, explicitly queue the forceidle balancer from pick_next_task()
when it does indeed result in forceidle selection. Having it here,
ensures it can only be triggered under the __schedule() rq->lock
instance, and hence must be ran from that context.

This also happens to clean up the code a little, so win-win.

Fixes: d2dfa17bc7de ("sched: Trivial forced-newidle balancer")
Reported-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: T.J. Alumbaugh <talumbau@chromium.org>
Link: https://lkml.kernel.org/r/20220330160535.GN8939@worktop.programming.kicks-ass.net
---
 kernel/sched/core.c  | 14 ++++++++++----
 kernel/sched/idle.c  |  1 -
 kernel/sched/sched.h |  6 ------
 3 files changed, 10 insertions(+), 11 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d575b49..017ee78 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5752,6 +5752,8 @@ static inline struct task_struct *pick_task(struct rq *rq)
 
 extern void task_vruntime_update(struct rq *rq, struct task_struct *p, bool in_fi);
 
+static void queue_core_balance(struct rq *rq);
+
 static struct task_struct *
 pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 {
@@ -5801,7 +5803,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 		}
 
 		rq->core_pick = NULL;
-		return next;
+		goto out;
 	}
 
 	put_prev_task_balance(rq, prev, rf);
@@ -5851,7 +5853,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 			 */
 			WARN_ON_ONCE(fi_before);
 			task_vruntime_update(rq, next, false);
-			goto done;
+			goto out_set_next;
 		}
 	}
 
@@ -5970,8 +5972,12 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 		resched_curr(rq_i);
 	}
 
-done:
+out_set_next:
 	set_next_task(rq, next);
+out:
+	if (rq->core->core_forceidle_count && next == rq->idle)
+		queue_core_balance(rq);
+
 	return next;
 }
 
@@ -6066,7 +6072,7 @@ static void sched_core_balance(struct rq *rq)
 
 static DEFINE_PER_CPU(struct callback_head, core_balance_head);
 
-void queue_core_balance(struct rq *rq)
+static void queue_core_balance(struct rq *rq)
 {
 	if (!sched_core_enabled(rq))
 		return;
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 8f8b502..ecb0d70 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -434,7 +434,6 @@ static void set_next_task_idle(struct rq *rq, struct task_struct *next, bool fir
 {
 	update_idle_core(rq);
 	schedstat_inc(rq->sched_goidle);
-	queue_core_balance(rq);
 }
 
 #ifdef CONFIG_SMP
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 58263f9..8dccb34 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1232,8 +1232,6 @@ static inline bool sched_group_cookie_match(struct rq *rq,
 	return false;
 }
 
-extern void queue_core_balance(struct rq *rq);
-
 static inline bool sched_core_enqueued(struct task_struct *p)
 {
 	return !RB_EMPTY_NODE(&p->core_node);
@@ -1267,10 +1265,6 @@ static inline raw_spinlock_t *__rq_lockp(struct rq *rq)
 	return &rq->__lock;
 }
 
-static inline void queue_core_balance(struct rq *rq)
-{
-}
-
 static inline bool sched_cpu_cookie_match(struct rq *rq, struct task_struct *p)
 {
 	return true;

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] sched/core: Fix forceidle balancing
  2022-04-01 11:46   ` Peter Zijlstra
@ 2022-04-05 19:25     ` Joel Fernandes
  0 siblings, 0 replies; 7+ messages in thread
From: Joel Fernandes @ 2022-04-05 19:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Steven Rostedt, Vincent Guittot, LKML,
	Thomas Gleixner, Sebastian Andrzej Siewior, Dietmar Eggemann,
	Ben Segall, Mel Gorman, Daniel Bristot de Oliveira,
	Guenter Roeck

On Fri, Apr 1, 2022 at 7:46 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Thu, Mar 31, 2022 at 03:00:40PM -0400, Joel Fernandes wrote:
> > Hi,
> >
> > By the way, might be slightly related - we still see crashes with
> > pick_task_fair() in our kernel even with this change:
> > https://lkml.org/lkml/2020/11/17/2137
>
> Please as to not use lkml.org. Please use something with a MsgID in like
> lore.

Yep, will do.

> > Is it possible that when doing pick_task_fair() especially on a remote
> > CPU, both the "cfs_rq->curr" and the rbtree's "left" be NULL with core
> > scheduling? In this case, se will be NULL and can cause crashes right?
> > I think the code assumes this can never happen.
> >
> > +Guenter Roeck  kindly debugged pick_task_fair() in a crash as
> > follows. Copying some details he mentioned in a bug report:
> >
> > Assembler/source:
> >
> >   25:   e8 4f 11 00 00          call   0x1179             ; se =
> > pick_next_entity(cfs_rq, curr);
> >   2a:*  48 8b 98 60 01 00 00    mov    0x160(%rax),%rbx   ; trapping
> > instruction [cfs_rq = group_cfs_rq(se);]
> >   31:   48 85 db                test   %rbx,%rbx
> >   34:   75 d1                   jne    0x7
> >   36:   48 89 c7                mov    %rax,%rdi
> >
> > At 2a: RAX = se == NULL after pick_next_entity(). Looking closely into
> > pick_next_entity(), it can indeed return NULL if curr is NULL and if
> > left in pick_next_entity() is NULL. Per line 7:, curr is in %r14 and
> > indeed 0.
> >
> > Thoughts?
>
> It is possible for ->curr and ->leftmost to be NULL, but then we should
> also be having ->nr_running == 0 and not call pick in the first place.
> Because picking a task from no tasks doesn't make much sense.

Indeed the code checks for nr_running so it is really bizarre. My
guess is this is kernel memory corruption due to an unrelated bug or
something, it is also not easy to trigger.

Thanks,

 - Joel

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2022-04-05 23:12 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-30 16:05 [PATCH] sched/core: Fix forceidle balancing Peter Zijlstra
2022-03-30 16:22 ` Peter Zijlstra
2022-03-31 19:00 ` Joel Fernandes
2022-04-01 11:46   ` Peter Zijlstra
2022-04-05 19:25     ` Joel Fernandes
2022-04-01 11:41 ` Peter Zijlstra
2022-04-05  8:22 ` [tip: sched/urgent] " tip-bot2 for Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).