All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] sched/core: fix pick_next_task 'max' tracking
@ 2021-08-18  0:56 Josh Don
  2021-08-18  4:35 ` Tao Zhou
  2021-08-23 11:16 ` Peter Zijlstra
  0 siblings, 2 replies; 17+ messages in thread
From: Josh Don @ 2021-08-18  0:56 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Joel Fernandes, Vineeth Pillai,
	linux-kernel, Josh Don

For core-sched, pick_next_task will update the 'max' task if there is a
cookie mismatch (since in this case the new task must be of higher
priority than the current max). However, we fail to update 'max' if
we've found a task with a matching cookie and higher priority than
'max'.

This can result in extra iterations on SMT-X machines, where X > 2.

As an example, on a machine with SMT=3, on core 0, SMT-0 might pick
the following, in order:

- SMT-0: p1, with cookie A, and priority 10 (max = p1)
- SMT-1: p2, with cookie A, and priority 30 (max not updated here)
- SMT-2: p3, with cookie B, and priority 20 (max = p2)
	> invalidate the other picks and retry

Here, we should have instead updated 'max' when picking for SMT-1. Note
that this code would eventually have righted itself, since the retry
loop would re-pick p2, and update 'max' accordingly. However, this patch
avoids the extra round-trip.

Signed-off-by: Josh Don <joshdon@google.com>
---
 kernel/sched/core.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3431939699dc..110ea7582a33 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5623,6 +5623,8 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 					occ = 1;
 					goto again;
 				}
+			} else if (prio_less(max, p, fi_before)) {
+				max = p;
 			}
 		}
 	}
-- 
2.33.0.rc1.237.g0d66db33f3-goog


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] sched/core: fix pick_next_task 'max' tracking
  2021-08-18  0:56 [PATCH] sched/core: fix pick_next_task 'max' tracking Josh Don
@ 2021-08-18  4:35 ` Tao Zhou
  2021-08-18 15:18   ` Tao Zhou
  2021-08-23 11:16 ` Peter Zijlstra
  1 sibling, 1 reply; 17+ messages in thread
From: Tao Zhou @ 2021-08-18  4:35 UTC (permalink / raw)
  To: Josh Don
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Joel Fernandes, Vineeth Pillai,
	linux-kernel, tao.zhou

Hi Josh,

On Tue, Aug 17, 2021 at 05:56:15PM -0700, Josh Don wrote:
> For core-sched, pick_next_task will update the 'max' task if there is a
> cookie mismatch (since in this case the new task must be of higher
> priority than the current max). However, we fail to update 'max' if
> we've found a task with a matching cookie and higher priority than
> 'max'.
> 
> This can result in extra iterations on SMT-X machines, where X > 2.
> 
> As an example, on a machine with SMT=3, on core 0, SMT-0 might pick
> the following, in order:
> 
> - SMT-0: p1, with cookie A, and priority 10 (max = p1)
> - SMT-1: p2, with cookie A, and priority 30 (max not updated here)

Thanks for your illustration. Good catch.
The guilty is 'cookie_equals(class_pick, cookie))' condition in pick_task()

> - SMT-2: p3, with cookie B, and priority 20 (max = p2)
> 	> invalidate the other picks and retry
> 
> Here, we should have instead updated 'max' when picking for SMT-1. Note
> that this code would eventually have righted itself, since the retry
> loop would re-pick p2, and update 'max' accordingly. However, this patch
> avoids the extra round-trip.

This is correct then may increase the chance to retry. That means it is
more possible to filter the max first(not sure).

> Signed-off-by: Josh Don <joshdon@google.com>
> ---
>  kernel/sched/core.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 3431939699dc..110ea7582a33 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -5623,6 +5623,8 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>  					occ = 1;
>  					goto again;
>  				}
> +			} else if (prio_less(max, p, fi_before)) {
> +				max = p;
>  			}
>  		}
>  	}
> -- 
> 2.33.0.rc1.237.g0d66db33f3-goog
> 


Thanks,
Tao

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] sched/core: fix pick_next_task 'max' tracking
  2021-08-18  4:35 ` Tao Zhou
@ 2021-08-18 15:18   ` Tao Zhou
  0 siblings, 0 replies; 17+ messages in thread
From: Tao Zhou @ 2021-08-18 15:18 UTC (permalink / raw)
  To: Josh Don, Tao Zhou
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Joel Fernandes, Vineeth Pillai,
	linux-kernel

Hi,

On Wed, Aug 18, 2021 at 12:35:13PM +0800, Tao Zhou wrote:
> Hi Josh,
> 
> On Tue, Aug 17, 2021 at 05:56:15PM -0700, Josh Don wrote:
> > For core-sched, pick_next_task will update the 'max' task if there is a
> > cookie mismatch (since in this case the new task must be of higher
> > priority than the current max). However, we fail to update 'max' if
> > we've found a task with a matching cookie and higher priority than
> > 'max'.
> > 
> > This can result in extra iterations on SMT-X machines, where X > 2.
> > 
> > As an example, on a machine with SMT=3, on core 0, SMT-0 might pick
> > the following, in order:
> > 
> > - SMT-0: p1, with cookie A, and priority 10 (max = p1)
> > - SMT-1: p2, with cookie A, and priority 30 (max not updated here)
> 
> Thanks for your illustration. Good catch.
> The guilty is 'cookie_equals(class_pick, cookie))' condition in pick_task()
> 
> > - SMT-2: p3, with cookie B, and priority 20 (max = p2)
> > 	> invalidate the other picks and retry
> > 
> > Here, we should have instead updated 'max' when picking for SMT-1. Note
> > that this code would eventually have righted itself, since the retry
> > loop would re-pick p2, and update 'max' accordingly. However, this patch
> > avoids the extra round-trip.
> 
> This is correct then may increase the chance to retry. That means it is
> more possible to filter the max first(not sure).
> 
> > Signed-off-by: Josh Don <joshdon@google.com>
> > ---
> >  kernel/sched/core.c | 2 ++
> >  1 file changed, 2 insertions(+)
> > 
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 3431939699dc..110ea7582a33 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -5623,6 +5623,8 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> >  					occ = 1;
> >  					goto again;
> >  				}
> > +			} else if (prio_less(max, p, fi_before)) {
> > +				max = p;
> >  			}
> >  		}
> >  	}
> > -- 
> > 2.33.0.rc1.237.g0d66db33f3-goog
> > 

My ugly patch need to continue, not compiled.
Rebase on the previous patch.

From 79f25599486c79ecb2e78875498b24f1320060b8 Mon Sep 17 00:00:00 2001
From: Tao Zhou <tao.zhou@linux.dev>
Date: Wed, 18 Aug 2021 00:07:38 +0800
Subject: [PATCH] optimize pick_next_task()

sched/core: Optimize pick_next_task() not sure

---
 kernel/sched/core.c  | 78 ++++++++++++++++++++++++++++++--------------
 kernel/sched/sched.h |  1 +
 2 files changed, 54 insertions(+), 25 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 20ffcc044134..18f236f75170 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5370,28 +5370,42 @@ pick_task(struct rq *rq, const struct sched_class *class, struct task_struct *ma
 		if (max && class_pick->core_cookie &&
 		    prio_less(class_pick, max, in_fi))
 			return idle_sched_class.pick_task(rq);
-
-		return class_pick;
 	}
 
-	/*
-	 * If class_pick is idle or matches cookie, return early.
-	 */
-	if (cookie_equals(class_pick, cookie))
-		return class_pick;
+	return class_pick;
+}
+
+static task_struct *
+filter_max_prio(struct rq *rq, struct task_struct *class_pick, struct task_struct *max, bool in_fi)
+{
+	unsigned long cookie = rq->core->core_cookie;
+	struct task_struct *cookie_pick = NULL;
+	bool cookie_core_temp = false;
 
-	cookie_pick = sched_core_find(rq, cookie);
+	rq->core_temp = class_pick;
 
-	/*
-	 * If class > max && class > cookie, it is the highest priority task on
-	 * the core (so far) and it must be selected, otherwise we must go with
-	 * the cookie pick in order to satisfy the constraint.
-	 */
-	if (prio_less(cookie_pick, class_pick, in_fi) &&
-	    (!max || prio_less(max, class_pick, in_fi)))
-		return class_pick;
+	if (cookie) {
+		if (!cookie_equals(class_pick, cookie)) {
+			cookie_pick = sched_core_find(rq, cookie);
+			/*
+			 * If class > max && class > cookie, it is the
+			 * highest priority task on the core (so far)
+			 * and it must be selected, otherwise we must
+			 * go with the cookie pick in order to satisfy
+			 * the constraint.
+			 */
+			if (prio_less(cookie_pick, class_pick, in_fi) &&
+			    (!max || prio_less(max, class_pick, in_fi)))
+				return class_pick;
+			cookie_core_temp = true;
+		} else if (prio_less(max, class_pick, fi_before))
+				return class_pick;
+	}
 
-	return cookie_pick;
+	if (cookie_core_temp)
+		rq->core_temp = cookie_pick;
+
+	return NULL;
 }
 
 extern void task_vruntime_update(struct rq *rq, struct task_struct *p, bool in_fi);
@@ -5508,24 +5522,37 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 	 * order.
 	 */
 	for_each_class(class) {
-again:
+		struct task_struct *class_pick, *cookie_pick;
+		struct rq *rq_i;
+
+		for_each_cpu_wrap(i, smt_mask, cpu) {
+			rq_i = cpu_rq(i);
+			class_pick = pick_task(rq_i, class, max, fi_before);
+			/*
+			 * This sibling doesn't yet have a suitable task to run.
+			 */
+			if (!class_pick)
+				continue;
+
+			if (filter_max_prio(rq_i, class_pick, max, fi_before))
+				max = class_pick;
+		}
+
 		for_each_cpu_wrap(i, smt_mask, cpu) {
-			struct rq *rq_i = cpu_rq(i);
 			struct task_struct *p;
+			rq_i = cpu_rq(i);
 
 			if (rq_i->core_pick)
 				continue;
 
 			/*
-			 * If this sibling doesn't yet have a suitable task to
-			 * run; ask for the most eligible task, given the
-			 * highest priority task already selected for this
-			 * core.
+			 * This sibling doesn't yet have a suitable task to run.
 			 */
-			p = pick_task(rq_i, class, max, fi_before);
-			if (!p)
+			if (!rq_i->core_temp)
 				continue;
 
+			p = rq_i->core_temp;
+
 			if (!is_task_rq_idle(p))
 				occ++;
 
@@ -9024,6 +9051,7 @@ void __init sched_init(void)
 #ifdef CONFIG_SCHED_CORE
 		rq->core = NULL;
 		rq->core_pick = NULL;
+		rq->core_temp = NULL;
 		rq->core_enabled = 0;
 		rq->core_tree = RB_ROOT;
 		rq->core_forceidle = false;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 14a41a243f7b..2b21a3846b8e 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1089,6 +1089,7 @@ struct rq {
 	/* per rq */
 	struct rq		*core;
 	struct task_struct	*core_pick;
+	struct task_struct	*core_temp;
 	unsigned int		core_enabled;
 	unsigned int		core_sched_seq;
 	struct rb_root		core_tree;
-- 
2.31.1



Thanks,
Tao

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] sched/core: fix pick_next_task 'max' tracking
  2021-08-18  0:56 [PATCH] sched/core: fix pick_next_task 'max' tracking Josh Don
  2021-08-18  4:35 ` Tao Zhou
@ 2021-08-23 11:16 ` Peter Zijlstra
  2021-08-23 15:38   ` Tao Zhou
                     ` (2 more replies)
  1 sibling, 3 replies; 17+ messages in thread
From: Peter Zijlstra @ 2021-08-23 11:16 UTC (permalink / raw)
  To: Josh Don
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Joel Fernandes, Vineeth Pillai,
	linux-kernel, tao.zhou

On Tue, Aug 17, 2021 at 05:56:15PM -0700, Josh Don wrote:
> For core-sched, pick_next_task will update the 'max' task if there is a
> cookie mismatch (since in this case the new task must be of higher
> priority than the current max). However, we fail to update 'max' if
> we've found a task with a matching cookie and higher priority than
> 'max'.
> 
> This can result in extra iterations on SMT-X machines, where X > 2.
> 
> As an example, on a machine with SMT=3, on core 0, SMT-0 might pick
> the following, in order:
> 
> - SMT-0: p1, with cookie A, and priority 10 (max = p1)
> - SMT-1: p2, with cookie A, and priority 30 (max not updated here)
> - SMT-2: p3, with cookie B, and priority 20 (max = p2)
> 	> invalidate the other picks and retry
> 
> Here, we should have instead updated 'max' when picking for SMT-1. Note
> that this code would eventually have righted itself, since the retry
> loop would re-pick p2, and update 'max' accordingly. However, this patch
> avoids the extra round-trip.

Going with the observation Tao made; how about we rewrite the whole lot
to not be mind-bending complicated :-)

How's this? It seems to build and pass the core-sched selftest thingy
(so it must be perfect, right? :-)

---
 kernel/sched/core.c  | 147 ++++++++++++++-------------------------------------
 kernel/sched/sched.h |   1 +
 2 files changed, 41 insertions(+), 107 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ceae25ea8a0e..e896250c2fb8 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5404,66 +5404,18 @@ static inline bool cookie_match(struct task_struct *a, struct task_struct *b)
 	return a->core_cookie == b->core_cookie;
 }
 
-// XXX fairness/fwd progress conditions
-/*
- * Returns
- * - NULL if there is no runnable task for this class.
- * - the highest priority task for this runqueue if it matches
- *   rq->core->core_cookie or its priority is greater than max.
- * - Else returns idle_task.
- */
-static struct task_struct *
-pick_task(struct rq *rq, const struct sched_class *class, struct task_struct *max, bool in_fi)
-{
-	struct task_struct *class_pick, *cookie_pick;
-	unsigned long cookie = rq->core->core_cookie;
-
-	class_pick = class->pick_task(rq);
-	if (!class_pick)
-		return NULL;
-
-	if (!cookie) {
-		/*
-		 * If class_pick is tagged, return it only if it has
-		 * higher priority than max.
-		 */
-		if (max && class_pick->core_cookie &&
-		    prio_less(class_pick, max, in_fi))
-			return idle_sched_class.pick_task(rq);
-
-		return class_pick;
-	}
-
-	/*
-	 * If class_pick is idle or matches cookie, return early.
-	 */
-	if (cookie_equals(class_pick, cookie))
-		return class_pick;
-
-	cookie_pick = sched_core_find(rq, cookie);
-
-	/*
-	 * If class > max && class > cookie, it is the highest priority task on
-	 * the core (so far) and it must be selected, otherwise we must go with
-	 * the cookie pick in order to satisfy the constraint.
-	 */
-	if (prio_less(cookie_pick, class_pick, in_fi) &&
-	    (!max || prio_less(max, class_pick, in_fi)))
-		return class_pick;
-
-	return cookie_pick;
-}
-
 extern void task_vruntime_update(struct rq *rq, struct task_struct *p, bool in_fi);
 
 static struct task_struct *
 pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 {
-	struct task_struct *next, *max = NULL;
+	struct task_struct *next, *p, *max = NULL;
 	const struct sched_class *class;
 	const struct cpumask *smt_mask;
 	bool fi_before = false;
-	int i, j, cpu, occ = 0;
+	unsigned long cookie;
+	int i, cpu, occ = 0;
+	struct rq *rq_i;
 	bool need_sync;
 
 	if (!sched_core_enabled(rq))
@@ -5554,76 +5506,57 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 		}
 	}
 
-	for_each_cpu(i, smt_mask) {
-		struct rq *rq_i = cpu_rq(i);
-
+	/*
+	 * For each thread: do the regular task pick and find the max prio task
+	 * amongst them.
+	 *
+	 * Tie-break prio towards the current CPU
+	 */
+	for_each_cpu_wrap(i, smt_mask, cpu) {
+		rq_i = cpu_rq(i);
 		rq_i->core_pick = NULL;
 
 		if (i != cpu)
 			update_rq_clock(rq_i);
+
+		for_each_class(class) {
+			p = rq_i->core_temp = class->pick_task(rq_i);
+			if (p)
+				break;
+		}
+
+		if (!max || prio_less(max, p, fi_before))
+			max = p;
 	}
 
+	cookie = rq->core->core_cookie = max->core_cookie;
+
 	/*
-	 * Try and select tasks for each sibling in descending sched_class
-	 * order.
+	 * For each thread: try and find a runnable task that matches @max or
+	 * force idle.
 	 */
-	for_each_class(class) {
-again:
-		for_each_cpu_wrap(i, smt_mask, cpu) {
-			struct rq *rq_i = cpu_rq(i);
-			struct task_struct *p;
-
-			if (rq_i->core_pick)
-				continue;
+	for_each_cpu(i, smt_mask) {
+		rq_i = cpu_rq(i);
+		p = rq_i->core_temp;
 
-			/*
-			 * If this sibling doesn't yet have a suitable task to
-			 * run; ask for the most eligible task, given the
-			 * highest priority task already selected for this
-			 * core.
-			 */
-			p = pick_task(rq_i, class, max, fi_before);
+		if (!cookie_equals(p, cookie)) {
+			p = NULL;
+			if (cookie)
+				p = sched_core_find(rq_i, cookie);
 			if (!p)
-				continue;
+				p = idle_sched_class.pick_task(rq_i);
+		}
 
-			if (!is_task_rq_idle(p))
-				occ++;
+		rq_i->core_pick = p;
 
-			rq_i->core_pick = p;
-			if (rq_i->idle == p && rq_i->nr_running) {
+		if (p == rq_i->idle) {
+			if (rq_i->nr_running) {
 				rq->core->core_forceidle = true;
 				if (!fi_before)
 					rq->core->core_forceidle_seq++;
 			}
-
-			/*
-			 * If this new candidate is of higher priority than the
-			 * previous; and they're incompatible; we need to wipe
-			 * the slate and start over. pick_task makes sure that
-			 * p's priority is more than max if it doesn't match
-			 * max's cookie.
-			 *
-			 * NOTE: this is a linear max-filter and is thus bounded
-			 * in execution time.
-			 */
-			if (!max || !cookie_match(max, p)) {
-				struct task_struct *old_max = max;
-
-				rq->core->core_cookie = p->core_cookie;
-				max = p;
-
-				if (old_max) {
-					rq->core->core_forceidle = false;
-					for_each_cpu(j, smt_mask) {
-						if (j == i)
-							continue;
-
-						cpu_rq(j)->core_pick = NULL;
-					}
-					occ = 1;
-					goto again;
-				}
-			}
+		} else {
+			occ++;
 		}
 	}
 
@@ -5643,7 +5576,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 	 * non-matching user state.
 	 */
 	for_each_cpu(i, smt_mask) {
-		struct rq *rq_i = cpu_rq(i);
+		rq_i = cpu_rq(i);
 
 		/*
 		 * An online sibling might have gone offline before a task
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index d9f8d73a1d84..0760b460903a 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1091,6 +1091,7 @@ struct rq {
 	/* per rq */
 	struct rq		*core;
 	struct task_struct	*core_pick;
+	struct task_struct	*core_temp;
 	unsigned int		core_enabled;
 	unsigned int		core_sched_seq;
 	struct rb_root		core_tree;

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] sched/core: fix pick_next_task 'max' tracking
  2021-08-23 11:16 ` Peter Zijlstra
@ 2021-08-23 15:38   ` Tao Zhou
  2021-08-23 20:25   ` Vineeth Pillai
  2021-08-23 23:24   ` [PATCH] sched/core: fix pick_next_task 'max' tracking Josh Don
  2 siblings, 0 replies; 17+ messages in thread
From: Tao Zhou @ 2021-08-23 15:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Josh Don, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Joel Fernandes, Vineeth Pillai,
	linux-kernel, tao.zhou

Hi Peter,

On Mon, Aug 23, 2021 at 01:16:56PM +0200, Peter Zijlstra wrote:

> On Tue, Aug 17, 2021 at 05:56:15PM -0700, Josh Don wrote:
> > For core-sched, pick_next_task will update the 'max' task if there is a
> > cookie mismatch (since in this case the new task must be of higher
> > priority than the current max). However, we fail to update 'max' if
> > we've found a task with a matching cookie and higher priority than
> > 'max'.
> > 
> > This can result in extra iterations on SMT-X machines, where X > 2.
> > 
> > As an example, on a machine with SMT=3, on core 0, SMT-0 might pick
> > the following, in order:
> > 
> > - SMT-0: p1, with cookie A, and priority 10 (max = p1)
> > - SMT-1: p2, with cookie A, and priority 30 (max not updated here)
> > - SMT-2: p3, with cookie B, and priority 20 (max = p2)
> > 	> invalidate the other picks and retry
> > 
> > Here, we should have instead updated 'max' when picking for SMT-1. Note
> > that this code would eventually have righted itself, since the retry
> > loop would re-pick p2, and update 'max' accordingly. However, this patch
> > avoids the extra round-trip.
> 
> Going with the observation Tao made; how about we rewrite the whole lot
> to not be mind-bending complicated :-)
> 
> How's this? It seems to build and pass the core-sched selftest thingy
> (so it must be perfect, right? :-)
> 
> ---
>  kernel/sched/core.c  | 147 ++++++++++++++-------------------------------------
>  kernel/sched/sched.h |   1 +
>  2 files changed, 41 insertions(+), 107 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index ceae25ea8a0e..e896250c2fb8 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -5404,66 +5404,18 @@ static inline bool cookie_match(struct task_struct *a, struct task_struct *b)
>  	return a->core_cookie == b->core_cookie;
>  }
>  
> -// XXX fairness/fwd progress conditions
> -/*
> - * Returns
> - * - NULL if there is no runnable task for this class.
> - * - the highest priority task for this runqueue if it matches
> - *   rq->core->core_cookie or its priority is greater than max.
> - * - Else returns idle_task.
> - */
> -static struct task_struct *
> -pick_task(struct rq *rq, const struct sched_class *class, struct task_struct *max, bool in_fi)
> -{
> -	struct task_struct *class_pick, *cookie_pick;
> -	unsigned long cookie = rq->core->core_cookie;
> -
> -	class_pick = class->pick_task(rq);
> -	if (!class_pick)
> -		return NULL;
> -
> -	if (!cookie) {
> -		/*
> -		 * If class_pick is tagged, return it only if it has
> -		 * higher priority than max.
> -		 */
> -		if (max && class_pick->core_cookie &&
> -		    prio_less(class_pick, max, in_fi))
> -			return idle_sched_class.pick_task(rq);
> -
> -		return class_pick;
> -	}
> -
> -	/*
> -	 * If class_pick is idle or matches cookie, return early.
> -	 */
> -	if (cookie_equals(class_pick, cookie))
> -		return class_pick;
> -
> -	cookie_pick = sched_core_find(rq, cookie);
> -
> -	/*
> -	 * If class > max && class > cookie, it is the highest priority task on
> -	 * the core (so far) and it must be selected, otherwise we must go with
> -	 * the cookie pick in order to satisfy the constraint.
> -	 */
> -	if (prio_less(cookie_pick, class_pick, in_fi) &&
> -	    (!max || prio_less(max, class_pick, in_fi)))
> -		return class_pick;
> -
> -	return cookie_pick;
> -}
> -
>  extern void task_vruntime_update(struct rq *rq, struct task_struct *p, bool in_fi);
>  
>  static struct task_struct *
>  pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>  {
> -	struct task_struct *next, *max = NULL;
> +	struct task_struct *next, *p, *max = NULL;
>  	const struct sched_class *class;
>  	const struct cpumask *smt_mask;
>  	bool fi_before = false;
> -	int i, j, cpu, occ = 0;
> +	unsigned long cookie;
> +	int i, cpu, occ = 0;
> +	struct rq *rq_i;
>  	bool need_sync;
>  
>  	if (!sched_core_enabled(rq))
> @@ -5554,76 +5506,57 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>  		}
>  	}
>  
> -	for_each_cpu(i, smt_mask) {
> -		struct rq *rq_i = cpu_rq(i);
> -
> +	/*
> +	 * For each thread: do the regular task pick and find the max prio task
> +	 * amongst them.
> +	 *
> +	 * Tie-break prio towards the current CPU
> +	 */
> +	for_each_cpu_wrap(i, smt_mask, cpu) {
> +		rq_i = cpu_rq(i);
>  		rq_i->core_pick = NULL;
>  
>  		if (i != cpu)
>  			update_rq_clock(rq_i);
> +
> +		for_each_class(class) {
> +			p = rq_i->core_temp = class->pick_task(rq_i);
> +			if (p)
> +				break;
> +		}
> +
> +		if (!max || prio_less(max, p, fi_before))

The above 'prio_less(max, p, fi_before)' condition covers the
case of Josh's fix if I'm not wrong.

The rewriting code looks clarity and simpler..

> +			max = p;
>  	}
>  
> +	cookie = rq->core->core_cookie = max->core_cookie;
> +
>  	/*
> -	 * Try and select tasks for each sibling in descending sched_class
> -	 * order.
> +	 * For each thread: try and find a runnable task that matches @max or
> +	 * force idle.
>  	 */
> -	for_each_class(class) {
> -again:
> -		for_each_cpu_wrap(i, smt_mask, cpu) {
> -			struct rq *rq_i = cpu_rq(i);
> -			struct task_struct *p;
> -
> -			if (rq_i->core_pick)
> -				continue;
> +	for_each_cpu(i, smt_mask) {
> +		rq_i = cpu_rq(i);
> +		p = rq_i->core_temp;
>  
> -			/*
> -			 * If this sibling doesn't yet have a suitable task to
> -			 * run; ask for the most eligible task, given the
> -			 * highest priority task already selected for this
> -			 * core.
> -			 */
> -			p = pick_task(rq_i, class, max, fi_before);
> +		if (!cookie_equals(p, cookie)) {
> +			p = NULL;
> +			if (cookie)
> +				p = sched_core_find(rq_i, cookie);
>  			if (!p)
> -				continue;
> +				p = idle_sched_class.pick_task(rq_i);
> +		}
>  
> -			if (!is_task_rq_idle(p))
> -				occ++;
> +		rq_i->core_pick = p;
>  
> -			rq_i->core_pick = p;
> -			if (rq_i->idle == p && rq_i->nr_running) {
> +		if (p == rq_i->idle) {
> +			if (rq_i->nr_running) {
>  				rq->core->core_forceidle = true;
>  				if (!fi_before)
>  					rq->core->core_forceidle_seq++;
>  			}
> -
> -			/*
> -			 * If this new candidate is of higher priority than the
> -			 * previous; and they're incompatible; we need to wipe
> -			 * the slate and start over. pick_task makes sure that
> -			 * p's priority is more than max if it doesn't match
> -			 * max's cookie.
> -			 *
> -			 * NOTE: this is a linear max-filter and is thus bounded
> -			 * in execution time.
> -			 */
> -			if (!max || !cookie_match(max, p)) {
> -				struct task_struct *old_max = max;
> -
> -				rq->core->core_cookie = p->core_cookie;
> -				max = p;
> -
> -				if (old_max) {
> -					rq->core->core_forceidle = false;
> -					for_each_cpu(j, smt_mask) {
> -						if (j == i)
> -							continue;
> -
> -						cpu_rq(j)->core_pick = NULL;
> -					}
> -					occ = 1;
> -					goto again;
> -				}
> -			}
> +		} else {
> +			occ++;
>  		}
>  	}
>  
> @@ -5643,7 +5576,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>  	 * non-matching user state.
>  	 */
>  	for_each_cpu(i, smt_mask) {
> -		struct rq *rq_i = cpu_rq(i);
> +		rq_i = cpu_rq(i);
>  
>  		/*
>  		 * An online sibling might have gone offline before a task
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index d9f8d73a1d84..0760b460903a 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1091,6 +1091,7 @@ struct rq {
>  	/* per rq */
>  	struct rq		*core;
>  	struct task_struct	*core_pick;
> +	struct task_struct	*core_temp;
>  	unsigned int		core_enabled;
>  	unsigned int		core_sched_seq;
>  	struct rb_root		core_tree;




Thanks,
Tao

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] sched/core: fix pick_next_task 'max' tracking
  2021-08-23 11:16 ` Peter Zijlstra
  2021-08-23 15:38   ` Tao Zhou
@ 2021-08-23 20:25   ` Vineeth Pillai
  2021-08-23 22:57     ` Tao Zhou
  2021-08-24  9:03     ` Peter Zijlstra
  2021-08-23 23:24   ` [PATCH] sched/core: fix pick_next_task 'max' tracking Josh Don
  2 siblings, 2 replies; 17+ messages in thread
From: Vineeth Pillai @ 2021-08-23 20:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Josh Don, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Joel Fernandes, linux-kernel,
	tao.zhou

Hi Peter,


> > Here, we should have instead updated 'max' when picking for SMT-1. Note
> > that this code would eventually have righted itself, since the retry
> > loop would re-pick p2, and update 'max' accordingly. However, this patch
> > avoids the extra round-trip.
>
> Going with the observation Tao made; how about we rewrite the whole lot
> to not be mind-bending complicated :-)
>
> How's this? It seems to build and pass the core-sched selftest thingy
> (so it must be perfect, right? :-)
>
Nice, the code is much simpler now :-). A minor suggestion down..

> -       for_each_cpu(i, smt_mask) {
> -               struct rq *rq_i = cpu_rq(i);
> -
> +       /*
> +        * For each thread: do the regular task pick and find the max prio task
> +        * amongst them.
> +        *
> +        * Tie-break prio towards the current CPU
> +        */
> +       for_each_cpu_wrap(i, smt_mask, cpu) {
> +               rq_i = cpu_rq(i);
>                 rq_i->core_pick = NULL;
>
>                 if (i != cpu)
>                         update_rq_clock(rq_i);
> +
> +               for_each_class(class) {
> +                       p = rq_i->core_temp = class->pick_task(rq_i);
I think we can use core_pick to store the pick here and core_temp
might not be required. What do you feel?

> +                       if (p)
> +                               break;
> +               }
> +
> +               if (!max || prio_less(max, p, fi_before))
> +                       max = p;


Thanks,
Vineeth

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] sched/core: fix pick_next_task 'max' tracking
  2021-08-23 20:25   ` Vineeth Pillai
@ 2021-08-23 22:57     ` Tao Zhou
  2021-08-24  9:03     ` Peter Zijlstra
  1 sibling, 0 replies; 17+ messages in thread
From: Tao Zhou @ 2021-08-23 22:57 UTC (permalink / raw)
  To: Vineeth Pillai
  Cc: Peter Zijlstra, Josh Don, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Joel Fernandes,
	linux-kernel, tao.zhou

Hi Vineeth,

On Mon, Aug 23, 2021 at 04:25:28PM -0400, Vineeth Pillai wrote:
> Hi Peter,
> 
> 
> > > Here, we should have instead updated 'max' when picking for SMT-1. Note
> > > that this code would eventually have righted itself, since the retry
> > > loop would re-pick p2, and update 'max' accordingly. However, this patch
> > > avoids the extra round-trip.
> >
> > Going with the observation Tao made; how about we rewrite the whole lot
> > to not be mind-bending complicated :-)
> >
> > How's this? It seems to build and pass the core-sched selftest thingy
> > (so it must be perfect, right? :-)
> >
> Nice, the code is much simpler now :-). A minor suggestion down..
> 
> > -       for_each_cpu(i, smt_mask) {
> > -               struct rq *rq_i = cpu_rq(i);
> > -
> > +       /*
> > +        * For each thread: do the regular task pick and find the max prio task
> > +        * amongst them.
> > +        *
> > +        * Tie-break prio towards the current CPU
> > +        */
> > +       for_each_cpu_wrap(i, smt_mask, cpu) {
> > +               rq_i = cpu_rq(i);
> >                 rq_i->core_pick = NULL;
> >
> >                 if (i != cpu)
> >                         update_rq_clock(rq_i);
> > +
> > +               for_each_class(class) {
> > +                       p = rq_i->core_temp = class->pick_task(rq_i);
> I think we can use core_pick to store the pick here and core_temp
> might not be required. What do you feel?

You're right.

The @core_temp load the class pick, the @core_pick is the final
pick(class or cookie). Using @core_pick to store class pick first
and then the final pick is right and save the bytes) but just a
little not clarity from my end :-)

> > +                       if (p)
> > +                               break;
> > +               }
> > +
> > +               if (!max || prio_less(max, p, fi_before))
> > +                       max = p;
> 
> 
> Thanks,
> Vineeth



Thanks,
Tao

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] sched/core: fix pick_next_task 'max' tracking
  2021-08-23 11:16 ` Peter Zijlstra
  2021-08-23 15:38   ` Tao Zhou
  2021-08-23 20:25   ` Vineeth Pillai
@ 2021-08-23 23:24   ` Josh Don
  2021-08-24  3:01     ` Tao Zhou
  2021-08-24  8:55     ` Peter Zijlstra
  2 siblings, 2 replies; 17+ messages in thread
From: Josh Don @ 2021-08-23 23:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Joel Fernandes, Vineeth Pillai,
	linux-kernel, Tao Zhou

Hi Peter,

On Mon, Aug 23, 2021 at 4:17 AM Peter Zijlstra <peterz@infradead.org> wrote:
[snip]
> +       for_each_cpu(i, smt_mask) {
> +               rq_i = cpu_rq(i);
> +               p = rq_i->core_temp;
>
> -                       /*
> -                        * If this sibling doesn't yet have a suitable task to
> -                        * run; ask for the most eligible task, given the
> -                        * highest priority task already selected for this
> -                        * core.
> -                        */
> -                       p = pick_task(rq_i, class, max, fi_before);
> +               if (!cookie_equals(p, cookie)) {
> +                       p = NULL;
> +                       if (cookie)
> +                               p = sched_core_find(rq_i, cookie);

In the case that 'max' has a zero cookie, shouldn't we search for a
match on this cpu if the original class pick ('p') had a non-zero
cookie? We don't enqueue tasks with zero cookie in the core_tree, so I
forget if there was some other reasoning here.

>                         if (!p)
> -                               continue;
> +                               p = idle_sched_class.pick_task(rq_i);
> +               }

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] sched/core: fix pick_next_task 'max' tracking
  2021-08-23 23:24   ` [PATCH] sched/core: fix pick_next_task 'max' tracking Josh Don
@ 2021-08-24  3:01     ` Tao Zhou
  2021-08-24  8:55     ` Peter Zijlstra
  1 sibling, 0 replies; 17+ messages in thread
From: Tao Zhou @ 2021-08-24  3:01 UTC (permalink / raw)
  To: Josh Don
  Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Joel Fernandes, Vineeth Pillai,
	linux-kernel, tao.zhou

Hi Josh,

On Mon, Aug 23, 2021 at 04:24:26PM -0700, Josh Don wrote:
> Hi Peter,
> 
> On Mon, Aug 23, 2021 at 4:17 AM Peter Zijlstra <peterz@infradead.org> wrote:
> [snip]
> > +       for_each_cpu(i, smt_mask) {
> > +               rq_i = cpu_rq(i);
> > +               p = rq_i->core_temp;
> >
> > -                       /*
> > -                        * If this sibling doesn't yet have a suitable task to
> > -                        * run; ask for the most eligible task, given the
> > -                        * highest priority task already selected for this
> > -                        * core.
> > -                        */
> > -                       p = pick_task(rq_i, class, max, fi_before);
> > +               if (!cookie_equals(p, cookie)) {
> > +                       p = NULL;
> > +                       if (cookie)
> > +                               p = sched_core_find(rq_i, cookie);
> 
> In the case that 'max' has a zero cookie, shouldn't we search for a

In the original pick_task(), when cookie is zero, we choose class pick
task or force idle task.

> match on this cpu if the original class pick ('p') had a non-zero
> cookie? We don't enqueue tasks with zero cookie in the core_tree, so I
> forget if there was some other reasoning here.

So, no need to search the core_tree when cookie is zero.

But I'm not sure that force idle pick condition(in pick_task())
is also covered by this clause in the first filter max loop in
the rewrite..
  
  '
  if (!max || prio_less(max, p, fi_before))
		max = p;
  '

I just thought there are three(one add by Josh) places that can return
max;

1, !cookie condition and class_pick > max
2, cookie equal condition and class_pick > max(add by Josh)
3, cookie not equal condition and class_pick > max.

The rewrite change a little with the class pick, it loops all class to
find the max first(finally will get one task and not be possible return
NULL). This is not the same with the original.

It is very possible that I'm getting wrong, then please shout to me.

> >                         if (!p)
> > -                               continue;
> > +                               p = idle_sched_class.pick_task(rq_i);
> > +               }



Thanks,
Tao

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] sched/core: fix pick_next_task 'max' tracking
  2021-08-23 23:24   ` [PATCH] sched/core: fix pick_next_task 'max' tracking Josh Don
  2021-08-24  3:01     ` Tao Zhou
@ 2021-08-24  8:55     ` Peter Zijlstra
  1 sibling, 0 replies; 17+ messages in thread
From: Peter Zijlstra @ 2021-08-24  8:55 UTC (permalink / raw)
  To: Josh Don
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Joel Fernandes, Vineeth Pillai,
	linux-kernel, Tao Zhou

On Mon, Aug 23, 2021 at 04:24:26PM -0700, Josh Don wrote:
> Hi Peter,
> 
> On Mon, Aug 23, 2021 at 4:17 AM Peter Zijlstra <peterz@infradead.org> wrote:
> [snip]
> > +       for_each_cpu(i, smt_mask) {
> > +               rq_i = cpu_rq(i);
> > +               p = rq_i->core_temp;
> >
> > -                       /*
> > -                        * If this sibling doesn't yet have a suitable task to
> > -                        * run; ask for the most eligible task, given the
> > -                        * highest priority task already selected for this
> > -                        * core.
> > -                        */
> > -                       p = pick_task(rq_i, class, max, fi_before);
> > +               if (!cookie_equals(p, cookie)) {
> > +                       p = NULL;
> > +                       if (cookie)
> > +                               p = sched_core_find(rq_i, cookie);
> 
> In the case that 'max' has a zero cookie, shouldn't we search for a
> match on this cpu if the original class pick ('p') had a non-zero
> cookie? We don't enqueue tasks with zero cookie in the core_tree, so I
> forget if there was some other reasoning here.

IIRC we don't keep 0-cookies in the tree. Lemme check.

Yeah, see sched_core_enqueue(), they bail for 0-cookie.

This is indeed sub-optimal, but also having the 0-cookie tasks in the
tree has other issues IIRC.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] sched/core: fix pick_next_task 'max' tracking
  2021-08-23 20:25   ` Vineeth Pillai
  2021-08-23 22:57     ` Tao Zhou
@ 2021-08-24  9:03     ` Peter Zijlstra
  2021-08-24  9:38       ` [PATCH] sched/core: Simplify core-wide task selection Peter Zijlstra
  1 sibling, 1 reply; 17+ messages in thread
From: Peter Zijlstra @ 2021-08-24  9:03 UTC (permalink / raw)
  To: Vineeth Pillai
  Cc: Josh Don, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Joel Fernandes, linux-kernel,
	tao.zhou

On Mon, Aug 23, 2021 at 04:25:28PM -0400, Vineeth Pillai wrote:
> Hi Peter,
> 
> 
> > > Here, we should have instead updated 'max' when picking for SMT-1. Note
> > > that this code would eventually have righted itself, since the retry
> > > loop would re-pick p2, and update 'max' accordingly. However, this patch
> > > avoids the extra round-trip.
> >
> > Going with the observation Tao made; how about we rewrite the whole lot
> > to not be mind-bending complicated :-)
> >
> > How's this? It seems to build and pass the core-sched selftest thingy
> > (so it must be perfect, right? :-)
> >
> Nice, the code is much simpler now :-). A minor suggestion down..
> 
> > -       for_each_cpu(i, smt_mask) {
> > -               struct rq *rq_i = cpu_rq(i);
> > -
> > +       /*
> > +        * For each thread: do the regular task pick and find the max prio task
> > +        * amongst them.
> > +        *
> > +        * Tie-break prio towards the current CPU
> > +        */
> > +       for_each_cpu_wrap(i, smt_mask, cpu) {
> > +               rq_i = cpu_rq(i);
> >                 rq_i->core_pick = NULL;
> >
> >                 if (i != cpu)
> >                         update_rq_clock(rq_i);
> > +
> > +               for_each_class(class) {
> > +                       p = rq_i->core_temp = class->pick_task(rq_i);
> I think we can use core_pick to store the pick here and core_temp
> might not be required. What do you feel?

Indeed we can; makes the code a little less obvious but saves a few
bytes.

Let me go do that and also attempt a Changelog to go with it ;-)

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH] sched/core: Simplify core-wide task selection
  2021-08-24  9:03     ` Peter Zijlstra
@ 2021-08-24  9:38       ` Peter Zijlstra
  2021-08-24 12:15         ` Tao Zhou
                           ` (4 more replies)
  0 siblings, 5 replies; 17+ messages in thread
From: Peter Zijlstra @ 2021-08-24  9:38 UTC (permalink / raw)
  To: Vineeth Pillai
  Cc: Josh Don, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Joel Fernandes, linux-kernel,
	tao.zhou

On Tue, Aug 24, 2021 at 11:03:58AM +0200, Peter Zijlstra wrote:
> Let me go do that and also attempt a Changelog to go with it ;-)

How's this then?

---
Subject: sched/core: Simplify core-wide task selection
From: Peter Zijlstra <peterz@infradead.org>
Date: Tue Aug 24 11:05:47 CEST 2021

Tao suggested a two-pass task selection to avoid the retry loop.

Not only does it avoid the retry loop, it results in *much* simpler
code.

This also fixes an issue spotted by Josh Don where, for SMT3+, we can
forget to update max on the first pass and get to do an extra round.

Suggested-by: Tao Zhou <tao.zhou@linux.dev>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c | 156 +++++++++++++++-------------------------------------
 1 file changed, 45 insertions(+), 111 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ceae25ea8a0e..8a9a32df5f38 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5381,8 +5381,7 @@ __pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 			return p;
 	}
 
-	/* The idle class should always have a runnable task: */
-	BUG();
+	BUG(); /* The idle class should always have a runnable task. */
 }
 
 #ifdef CONFIG_SCHED_CORE
@@ -5404,54 +5403,18 @@ static inline bool cookie_match(struct task_struct *a, struct task_struct *b)
 	return a->core_cookie == b->core_cookie;
 }
 
-// XXX fairness/fwd progress conditions
-/*
- * Returns
- * - NULL if there is no runnable task for this class.
- * - the highest priority task for this runqueue if it matches
- *   rq->core->core_cookie or its priority is greater than max.
- * - Else returns idle_task.
- */
-static struct task_struct *
-pick_task(struct rq *rq, const struct sched_class *class, struct task_struct *max, bool in_fi)
+static inline struct task_struct *pick_task(struct rq *rq)
 {
-	struct task_struct *class_pick, *cookie_pick;
-	unsigned long cookie = rq->core->core_cookie;
-
-	class_pick = class->pick_task(rq);
-	if (!class_pick)
-		return NULL;
-
-	if (!cookie) {
-		/*
-		 * If class_pick is tagged, return it only if it has
-		 * higher priority than max.
-		 */
-		if (max && class_pick->core_cookie &&
-		    prio_less(class_pick, max, in_fi))
-			return idle_sched_class.pick_task(rq);
+	const struct sched_class *class;
+	struct task_struct *p;
 
-		return class_pick;
+	for_each_class(class) {
+		p = class->pick_task(rq);
+		if (p)
+			return p;
 	}
 
-	/*
-	 * If class_pick is idle or matches cookie, return early.
-	 */
-	if (cookie_equals(class_pick, cookie))
-		return class_pick;
-
-	cookie_pick = sched_core_find(rq, cookie);
-
-	/*
-	 * If class > max && class > cookie, it is the highest priority task on
-	 * the core (so far) and it must be selected, otherwise we must go with
-	 * the cookie pick in order to satisfy the constraint.
-	 */
-	if (prio_less(cookie_pick, class_pick, in_fi) &&
-	    (!max || prio_less(max, class_pick, in_fi)))
-		return class_pick;
-
-	return cookie_pick;
+	BUG(); /* The idle class should always have a runnable task. */
 }
 
 extern void task_vruntime_update(struct rq *rq, struct task_struct *p, bool in_fi);
@@ -5459,11 +5422,12 @@ extern void task_vruntime_update(struct rq *rq, struct task_struct *p, bool in_f
 static struct task_struct *
 pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 {
-	struct task_struct *next, *max = NULL;
-	const struct sched_class *class;
+	struct task_struct *next, *p, *max = NULL;
 	const struct cpumask *smt_mask;
 	bool fi_before = false;
-	int i, j, cpu, occ = 0;
+	unsigned long cookie;
+	int i, cpu, occ = 0;
+	struct rq *rq_i;
 	bool need_sync;
 
 	if (!sched_core_enabled(rq))
@@ -5536,12 +5500,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 	 * and there are no cookied tasks running on siblings.
 	 */
 	if (!need_sync) {
-		for_each_class(class) {
-			next = class->pick_task(rq);
-			if (next)
-				break;
-		}
-
+		next = pick_task(rq);
 		if (!next->core_cookie) {
 			rq->core_pick = NULL;
 			/*
@@ -5554,76 +5513,51 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 		}
 	}
 
-	for_each_cpu(i, smt_mask) {
-		struct rq *rq_i = cpu_rq(i);
-
-		rq_i->core_pick = NULL;
+	/*
+	 * For each thread: do the regular task pick and find the max prio task
+	 * amongst them.
+	 *
+	 * Tie-break prio towards the current CPU
+	 */
+	for_each_cpu_wrap(i, smt_mask, cpu) {
+		rq_i = cpu_rq(i);
 
 		if (i != cpu)
 			update_rq_clock(rq_i);
+
+		p = rq_i->core_pick = pick_task(rq_i);
+		if (!max || prio_less(max, p, fi_before))
+			max = p;
 	}
 
+	cookie = rq->core->core_cookie = max->core_cookie;
+
 	/*
-	 * Try and select tasks for each sibling in descending sched_class
-	 * order.
+	 * For each thread: try and find a runnable task that matches @max or
+	 * force idle.
 	 */
-	for_each_class(class) {
-again:
-		for_each_cpu_wrap(i, smt_mask, cpu) {
-			struct rq *rq_i = cpu_rq(i);
-			struct task_struct *p;
-
-			if (rq_i->core_pick)
-				continue;
+	for_each_cpu(i, smt_mask) {
+		rq_i = cpu_rq(i);
+		p = rq_i->core_pick;
 
-			/*
-			 * If this sibling doesn't yet have a suitable task to
-			 * run; ask for the most eligible task, given the
-			 * highest priority task already selected for this
-			 * core.
-			 */
-			p = pick_task(rq_i, class, max, fi_before);
+		if (!cookie_equals(p, cookie)) {
+			p = NULL;
+			if (cookie)
+				p = sched_core_find(rq_i, cookie);
 			if (!p)
-				continue;
+				p = idle_sched_class.pick_task(rq_i);
+		}
 
-			if (!is_task_rq_idle(p))
-				occ++;
+		rq_i->core_pick = p;
 
-			rq_i->core_pick = p;
-			if (rq_i->idle == p && rq_i->nr_running) {
+		if (p == rq_i->idle) {
+			if (rq_i->nr_running) {
 				rq->core->core_forceidle = true;
 				if (!fi_before)
 					rq->core->core_forceidle_seq++;
 			}
-
-			/*
-			 * If this new candidate is of higher priority than the
-			 * previous; and they're incompatible; we need to wipe
-			 * the slate and start over. pick_task makes sure that
-			 * p's priority is more than max if it doesn't match
-			 * max's cookie.
-			 *
-			 * NOTE: this is a linear max-filter and is thus bounded
-			 * in execution time.
-			 */
-			if (!max || !cookie_match(max, p)) {
-				struct task_struct *old_max = max;
-
-				rq->core->core_cookie = p->core_cookie;
-				max = p;
-
-				if (old_max) {
-					rq->core->core_forceidle = false;
-					for_each_cpu(j, smt_mask) {
-						if (j == i)
-							continue;
-
-						cpu_rq(j)->core_pick = NULL;
-					}
-					occ = 1;
-					goto again;
-				}
-			}
+		} else {
+			occ++;
 		}
 	}
 
@@ -5643,7 +5577,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 	 * non-matching user state.
 	 */
 	for_each_cpu(i, smt_mask) {
-		struct rq *rq_i = cpu_rq(i);
+		rq_i = cpu_rq(i);
 
 		/*
 		 * An online sibling might have gone offline before a task

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] sched/core: Simplify core-wide task selection
  2021-08-24  9:38       ` [PATCH] sched/core: Simplify core-wide task selection Peter Zijlstra
@ 2021-08-24 12:15         ` Tao Zhou
  2021-08-24 17:40         ` Josh Don
                           ` (3 subsequent siblings)
  4 siblings, 0 replies; 17+ messages in thread
From: Tao Zhou @ 2021-08-24 12:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vineeth Pillai, Josh Don, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Joel Fernandes,
	linux-kernel, tao.zhou

Hi Peter,

On Tue, Aug 24, 2021 at 11:38:02AM +0200, Peter Zijlstra wrote:
> On Tue, Aug 24, 2021 at 11:03:58AM +0200, Peter Zijlstra wrote:
> > Let me go do that and also attempt a Changelog to go with it ;-)
> 
> How's this then?
> 
> ---
> Subject: sched/core: Simplify core-wide task selection
> From: Peter Zijlstra <peterz@infradead.org>
> Date: Tue Aug 24 11:05:47 CEST 2021
> 
> Tao suggested a two-pass task selection to avoid the retry loop.
> 
> Not only does it avoid the retry loop, it results in *much* simpler
> code.
> 
> This also fixes an issue spotted by Josh Don where, for SMT3+, we can
> forget to update max on the first pass and get to do an extra round.
> 
> Suggested-by: Tao Zhou <tao.zhou@linux.dev>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  kernel/sched/core.c | 156 +++++++++++++++-------------------------------------
>  1 file changed, 45 insertions(+), 111 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index ceae25ea8a0e..8a9a32df5f38 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -5381,8 +5381,7 @@ __pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>  			return p;
>  	}
>  
> -	/* The idle class should always have a runnable task: */
> -	BUG();
> +	BUG(); /* The idle class should always have a runnable task. */

How can use pick_task() here instead.

The Reported-by is comfortable and enough for me.

Otherwise, Look good to me.

>  }
>  
>  #ifdef CONFIG_SCHED_CORE
> @@ -5404,54 +5403,18 @@ static inline bool cookie_match(struct task_struct *a, struct task_struct *b)
>  	return a->core_cookie == b->core_cookie;
>  }
>  
> -// XXX fairness/fwd progress conditions
> -/*
> - * Returns
> - * - NULL if there is no runnable task for this class.
> - * - the highest priority task for this runqueue if it matches
> - *   rq->core->core_cookie or its priority is greater than max.
> - * - Else returns idle_task.
> - */
> -static struct task_struct *
> -pick_task(struct rq *rq, const struct sched_class *class, struct task_struct *max, bool in_fi)
> +static inline struct task_struct *pick_task(struct rq *rq)
>  {
> -	struct task_struct *class_pick, *cookie_pick;
> -	unsigned long cookie = rq->core->core_cookie;
> -
> -	class_pick = class->pick_task(rq);
> -	if (!class_pick)
> -		return NULL;
> -
> -	if (!cookie) {
> -		/*
> -		 * If class_pick is tagged, return it only if it has
> -		 * higher priority than max.
> -		 */
> -		if (max && class_pick->core_cookie &&
> -		    prio_less(class_pick, max, in_fi))
> -			return idle_sched_class.pick_task(rq);
> +	const struct sched_class *class;
> +	struct task_struct *p;
>  
> -		return class_pick;
> +	for_each_class(class) {
> +		p = class->pick_task(rq);
> +		if (p)
> +			return p;
>  	}
>  
> -	/*
> -	 * If class_pick is idle or matches cookie, return early.
> -	 */
> -	if (cookie_equals(class_pick, cookie))
> -		return class_pick;
> -
> -	cookie_pick = sched_core_find(rq, cookie);
> -
> -	/*
> -	 * If class > max && class > cookie, it is the highest priority task on
> -	 * the core (so far) and it must be selected, otherwise we must go with
> -	 * the cookie pick in order to satisfy the constraint.
> -	 */
> -	if (prio_less(cookie_pick, class_pick, in_fi) &&
> -	    (!max || prio_less(max, class_pick, in_fi)))
> -		return class_pick;
> -
> -	return cookie_pick;
> +	BUG(); /* The idle class should always have a runnable task. */
>  }
>  
>  extern void task_vruntime_update(struct rq *rq, struct task_struct *p, bool in_fi);
> @@ -5459,11 +5422,12 @@ extern void task_vruntime_update(struct rq *rq, struct task_struct *p, bool in_f
>  static struct task_struct *
>  pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>  {
> -	struct task_struct *next, *max = NULL;
> -	const struct sched_class *class;
> +	struct task_struct *next, *p, *max = NULL;
>  	const struct cpumask *smt_mask;
>  	bool fi_before = false;
> -	int i, j, cpu, occ = 0;
> +	unsigned long cookie;
> +	int i, cpu, occ = 0;
> +	struct rq *rq_i;
>  	bool need_sync;
>  
>  	if (!sched_core_enabled(rq))
> @@ -5536,12 +5500,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>  	 * and there are no cookied tasks running on siblings.
>  	 */
>  	if (!need_sync) {
> -		for_each_class(class) {
> -			next = class->pick_task(rq);
> -			if (next)
> -				break;
> -		}
> -
> +		next = pick_task(rq);
>  		if (!next->core_cookie) {
>  			rq->core_pick = NULL;
>  			/*
> @@ -5554,76 +5513,51 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>  		}
>  	}
>  
> -	for_each_cpu(i, smt_mask) {
> -		struct rq *rq_i = cpu_rq(i);
> -
> -		rq_i->core_pick = NULL;
> +	/*
> +	 * For each thread: do the regular task pick and find the max prio task
> +	 * amongst them.
> +	 *
> +	 * Tie-break prio towards the current CPU
> +	 */
> +	for_each_cpu_wrap(i, smt_mask, cpu) {
> +		rq_i = cpu_rq(i);
>  
>  		if (i != cpu)
>  			update_rq_clock(rq_i);
> +
> +		p = rq_i->core_pick = pick_task(rq_i);
> +		if (!max || prio_less(max, p, fi_before))
> +			max = p;
>  	}
>  
> +	cookie = rq->core->core_cookie = max->core_cookie;
> +
>  	/*
> -	 * Try and select tasks for each sibling in descending sched_class
> -	 * order.
> +	 * For each thread: try and find a runnable task that matches @max or
> +	 * force idle.
>  	 */
> -	for_each_class(class) {
> -again:
> -		for_each_cpu_wrap(i, smt_mask, cpu) {
> -			struct rq *rq_i = cpu_rq(i);
> -			struct task_struct *p;
> -
> -			if (rq_i->core_pick)
> -				continue;
> +	for_each_cpu(i, smt_mask) {
> +		rq_i = cpu_rq(i);
> +		p = rq_i->core_pick;
>  
> -			/*
> -			 * If this sibling doesn't yet have a suitable task to
> -			 * run; ask for the most eligible task, given the
> -			 * highest priority task already selected for this
> -			 * core.
> -			 */
> -			p = pick_task(rq_i, class, max, fi_before);
> +		if (!cookie_equals(p, cookie)) {
> +			p = NULL;
> +			if (cookie)
> +				p = sched_core_find(rq_i, cookie);
>  			if (!p)
> -				continue;
> +				p = idle_sched_class.pick_task(rq_i);
> +		}
>  
> -			if (!is_task_rq_idle(p))
> -				occ++;
> +		rq_i->core_pick = p;
>  
> -			rq_i->core_pick = p;
> -			if (rq_i->idle == p && rq_i->nr_running) {
> +		if (p == rq_i->idle) {
> +			if (rq_i->nr_running) {
>  				rq->core->core_forceidle = true;
>  				if (!fi_before)
>  					rq->core->core_forceidle_seq++;
>  			}
> -
> -			/*
> -			 * If this new candidate is of higher priority than the
> -			 * previous; and they're incompatible; we need to wipe
> -			 * the slate and start over. pick_task makes sure that
> -			 * p's priority is more than max if it doesn't match
> -			 * max's cookie.
> -			 *
> -			 * NOTE: this is a linear max-filter and is thus bounded
> -			 * in execution time.
> -			 */
> -			if (!max || !cookie_match(max, p)) {
> -				struct task_struct *old_max = max;
> -
> -				rq->core->core_cookie = p->core_cookie;
> -				max = p;
> -
> -				if (old_max) {
> -					rq->core->core_forceidle = false;
> -					for_each_cpu(j, smt_mask) {
> -						if (j == i)
> -							continue;
> -
> -						cpu_rq(j)->core_pick = NULL;
> -					}
> -					occ = 1;
> -					goto again;
> -				}
> -			}
> +		} else {
> +			occ++;
>  		}
>  	}
>  
> @@ -5643,7 +5577,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>  	 * non-matching user state.
>  	 */
>  	for_each_cpu(i, smt_mask) {
> -		struct rq *rq_i = cpu_rq(i);
> +		rq_i = cpu_rq(i);
>  
>  		/*
>  		 * An online sibling might have gone offline before a task



Thanks,
Tao

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] sched/core: Simplify core-wide task selection
  2021-08-24  9:38       ` [PATCH] sched/core: Simplify core-wide task selection Peter Zijlstra
  2021-08-24 12:15         ` Tao Zhou
@ 2021-08-24 17:40         ` Josh Don
  2021-08-24 18:28         ` Vineeth Pillai
                           ` (2 subsequent siblings)
  4 siblings, 0 replies; 17+ messages in thread
From: Josh Don @ 2021-08-24 17:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vineeth Pillai, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Joel Fernandes, linux-kernel,
	Tao Zhou

On Tue, Aug 24, 2021 at 2:38 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Tue, Aug 24, 2021 at 11:03:58AM +0200, Peter Zijlstra wrote:
> > Let me go do that and also attempt a Changelog to go with it ;-)
>
> How's this then?
>
> ---
> Subject: sched/core: Simplify core-wide task selection
> From: Peter Zijlstra <peterz@infradead.org>
> Date: Tue Aug 24 11:05:47 CEST 2021
>
> Tao suggested a two-pass task selection to avoid the retry loop.
>
> Not only does it avoid the retry loop, it results in *much* simpler
> code.
>
> This also fixes an issue spotted by Josh Don where, for SMT3+, we can
> forget to update max on the first pass and get to do an extra round.
>
> Suggested-by: Tao Zhou <tao.zhou@linux.dev>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

Reviewed-by: Josh Don <joshdon@google.com>

> ---
>  kernel/sched/core.c | 156 +++++++++++++++-------------------------------------
>  1 file changed, 45 insertions(+), 111 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index ceae25ea8a0e..8a9a32df5f38 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -5381,8 +5381,7 @@ __pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>                         return p;
>         }
>
> -       /* The idle class should always have a runnable task: */
> -       BUG();
> +       BUG(); /* The idle class should always have a runnable task. */
>  }
>
>  #ifdef CONFIG_SCHED_CORE
> @@ -5404,54 +5403,18 @@ static inline bool cookie_match(struct task_struct *a, struct task_struct *b)
>         return a->core_cookie == b->core_cookie;
>  }
>
> -// XXX fairness/fwd progress conditions
> -/*
> - * Returns
> - * - NULL if there is no runnable task for this class.
> - * - the highest priority task for this runqueue if it matches
> - *   rq->core->core_cookie or its priority is greater than max.
> - * - Else returns idle_task.
> - */
> -static struct task_struct *
> -pick_task(struct rq *rq, const struct sched_class *class, struct task_struct *max, bool in_fi)
> +static inline struct task_struct *pick_task(struct rq *rq)
>  {
> -       struct task_struct *class_pick, *cookie_pick;
> -       unsigned long cookie = rq->core->core_cookie;
> -
> -       class_pick = class->pick_task(rq);
> -       if (!class_pick)
> -               return NULL;
> -
> -       if (!cookie) {
> -               /*
> -                * If class_pick is tagged, return it only if it has
> -                * higher priority than max.
> -                */
> -               if (max && class_pick->core_cookie &&
> -                   prio_less(class_pick, max, in_fi))
> -                       return idle_sched_class.pick_task(rq);
> +       const struct sched_class *class;
> +       struct task_struct *p;
>
> -               return class_pick;
> +       for_each_class(class) {
> +               p = class->pick_task(rq);
> +               if (p)
> +                       return p;
>         }
>
> -       /*
> -        * If class_pick is idle or matches cookie, return early.
> -        */
> -       if (cookie_equals(class_pick, cookie))
> -               return class_pick;
> -
> -       cookie_pick = sched_core_find(rq, cookie);
> -
> -       /*
> -        * If class > max && class > cookie, it is the highest priority task on
> -        * the core (so far) and it must be selected, otherwise we must go with
> -        * the cookie pick in order to satisfy the constraint.
> -        */
> -       if (prio_less(cookie_pick, class_pick, in_fi) &&
> -           (!max || prio_less(max, class_pick, in_fi)))
> -               return class_pick;
> -
> -       return cookie_pick;
> +       BUG(); /* The idle class should always have a runnable task. */
>  }
>
>  extern void task_vruntime_update(struct rq *rq, struct task_struct *p, bool in_fi);
> @@ -5459,11 +5422,12 @@ extern void task_vruntime_update(struct rq *rq, struct task_struct *p, bool in_f
>  static struct task_struct *
>  pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>  {
> -       struct task_struct *next, *max = NULL;
> -       const struct sched_class *class;
> +       struct task_struct *next, *p, *max = NULL;
>         const struct cpumask *smt_mask;
>         bool fi_before = false;
> -       int i, j, cpu, occ = 0;
> +       unsigned long cookie;
> +       int i, cpu, occ = 0;
> +       struct rq *rq_i;
>         bool need_sync;
>
>         if (!sched_core_enabled(rq))
> @@ -5536,12 +5500,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>          * and there are no cookied tasks running on siblings.
>          */
>         if (!need_sync) {
> -               for_each_class(class) {
> -                       next = class->pick_task(rq);
> -                       if (next)
> -                               break;
> -               }
> -
> +               next = pick_task(rq);
>                 if (!next->core_cookie) {
>                         rq->core_pick = NULL;
>                         /*
> @@ -5554,76 +5513,51 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>                 }
>         }
>
> -       for_each_cpu(i, smt_mask) {
> -               struct rq *rq_i = cpu_rq(i);
> -
> -               rq_i->core_pick = NULL;
> +       /*
> +        * For each thread: do the regular task pick and find the max prio task
> +        * amongst them.
> +        *
> +        * Tie-break prio towards the current CPU
> +        */
> +       for_each_cpu_wrap(i, smt_mask, cpu) {
> +               rq_i = cpu_rq(i);
>
>                 if (i != cpu)
>                         update_rq_clock(rq_i);
> +
> +               p = rq_i->core_pick = pick_task(rq_i);
> +               if (!max || prio_less(max, p, fi_before))
> +                       max = p;
>         }
>
> +       cookie = rq->core->core_cookie = max->core_cookie;
> +
>         /*
> -        * Try and select tasks for each sibling in descending sched_class
> -        * order.
> +        * For each thread: try and find a runnable task that matches @max or
> +        * force idle.
>          */
> -       for_each_class(class) {
> -again:
> -               for_each_cpu_wrap(i, smt_mask, cpu) {
> -                       struct rq *rq_i = cpu_rq(i);
> -                       struct task_struct *p;
> -
> -                       if (rq_i->core_pick)
> -                               continue;
> +       for_each_cpu(i, smt_mask) {
> +               rq_i = cpu_rq(i);
> +               p = rq_i->core_pick;
>
> -                       /*
> -                        * If this sibling doesn't yet have a suitable task to
> -                        * run; ask for the most eligible task, given the
> -                        * highest priority task already selected for this
> -                        * core.
> -                        */
> -                       p = pick_task(rq_i, class, max, fi_before);
> +               if (!cookie_equals(p, cookie)) {
> +                       p = NULL;
> +                       if (cookie)
> +                               p = sched_core_find(rq_i, cookie);
>                         if (!p)
> -                               continue;
> +                               p = idle_sched_class.pick_task(rq_i);
> +               }
>
> -                       if (!is_task_rq_idle(p))
> -                               occ++;
> +               rq_i->core_pick = p;
>
> -                       rq_i->core_pick = p;
> -                       if (rq_i->idle == p && rq_i->nr_running) {
> +               if (p == rq_i->idle) {
> +                       if (rq_i->nr_running) {
>                                 rq->core->core_forceidle = true;
>                                 if (!fi_before)
>                                         rq->core->core_forceidle_seq++;
>                         }
> -
> -                       /*
> -                        * If this new candidate is of higher priority than the
> -                        * previous; and they're incompatible; we need to wipe
> -                        * the slate and start over. pick_task makes sure that
> -                        * p's priority is more than max if it doesn't match
> -                        * max's cookie.
> -                        *
> -                        * NOTE: this is a linear max-filter and is thus bounded
> -                        * in execution time.
> -                        */
> -                       if (!max || !cookie_match(max, p)) {
> -                               struct task_struct *old_max = max;
> -
> -                               rq->core->core_cookie = p->core_cookie;
> -                               max = p;
> -
> -                               if (old_max) {
> -                                       rq->core->core_forceidle = false;
> -                                       for_each_cpu(j, smt_mask) {
> -                                               if (j == i)
> -                                                       continue;
> -
> -                                               cpu_rq(j)->core_pick = NULL;
> -                                       }
> -                                       occ = 1;
> -                                       goto again;
> -                               }
> -                       }
> +               } else {
> +                       occ++;
>                 }
>         }
>
> @@ -5643,7 +5577,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>          * non-matching user state.
>          */
>         for_each_cpu(i, smt_mask) {
> -               struct rq *rq_i = cpu_rq(i);
> +               rq_i = cpu_rq(i);
>
>                 /*
>                  * An online sibling might have gone offline before a task

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] sched/core: Simplify core-wide task selection
  2021-08-24  9:38       ` [PATCH] sched/core: Simplify core-wide task selection Peter Zijlstra
  2021-08-24 12:15         ` Tao Zhou
  2021-08-24 17:40         ` Josh Don
@ 2021-08-24 18:28         ` Vineeth Pillai
  2021-09-09 11:18         ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
  2021-10-05 14:12         ` tip-bot2 for Peter Zijlstra
  4 siblings, 0 replies; 17+ messages in thread
From: Vineeth Pillai @ 2021-08-24 18:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Josh Don, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Joel Fernandes, linux-kernel,
	tao.zhou

Reviewed-by: Vineeth Pillai (Microsoft) <vineethrp@gmail.com>


On Tue, Aug 24, 2021 at 5:38 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Tue, Aug 24, 2021 at 11:03:58AM +0200, Peter Zijlstra wrote:
> > Let me go do that and also attempt a Changelog to go with it ;-)
>
> How's this then?
>
> ---
> Subject: sched/core: Simplify core-wide task selection
> From: Peter Zijlstra <peterz@infradead.org>
> Date: Tue Aug 24 11:05:47 CEST 2021
>
> Tao suggested a two-pass task selection to avoid the retry loop.
>
> Not only does it avoid the retry loop, it results in *much* simpler
> code.
>
> This also fixes an issue spotted by Josh Don where, for SMT3+, we can
> forget to update max on the first pass and get to do an extra round.
>
> Suggested-by: Tao Zhou <tao.zhou@linux.dev>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  kernel/sched/core.c | 156 +++++++++++++++-------------------------------------
>  1 file changed, 45 insertions(+), 111 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index ceae25ea8a0e..8a9a32df5f38 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -5381,8 +5381,7 @@ __pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>                         return p;
>         }
>
> -       /* The idle class should always have a runnable task: */
> -       BUG();
> +       BUG(); /* The idle class should always have a runnable task. */
>  }
>
>  #ifdef CONFIG_SCHED_CORE
> @@ -5404,54 +5403,18 @@ static inline bool cookie_match(struct task_struct *a, struct task_struct *b)
>         return a->core_cookie == b->core_cookie;
>  }
>
> -// XXX fairness/fwd progress conditions
> -/*
> - * Returns
> - * - NULL if there is no runnable task for this class.
> - * - the highest priority task for this runqueue if it matches
> - *   rq->core->core_cookie or its priority is greater than max.
> - * - Else returns idle_task.
> - */
> -static struct task_struct *
> -pick_task(struct rq *rq, const struct sched_class *class, struct task_struct *max, bool in_fi)
> +static inline struct task_struct *pick_task(struct rq *rq)
>  {
> -       struct task_struct *class_pick, *cookie_pick;
> -       unsigned long cookie = rq->core->core_cookie;
> -
> -       class_pick = class->pick_task(rq);
> -       if (!class_pick)
> -               return NULL;
> -
> -       if (!cookie) {
> -               /*
> -                * If class_pick is tagged, return it only if it has
> -                * higher priority than max.
> -                */
> -               if (max && class_pick->core_cookie &&
> -                   prio_less(class_pick, max, in_fi))
> -                       return idle_sched_class.pick_task(rq);
> +       const struct sched_class *class;
> +       struct task_struct *p;
>
> -               return class_pick;
> +       for_each_class(class) {
> +               p = class->pick_task(rq);
> +               if (p)
> +                       return p;
>         }
>
> -       /*
> -        * If class_pick is idle or matches cookie, return early.
> -        */
> -       if (cookie_equals(class_pick, cookie))
> -               return class_pick;
> -
> -       cookie_pick = sched_core_find(rq, cookie);
> -
> -       /*
> -        * If class > max && class > cookie, it is the highest priority task on
> -        * the core (so far) and it must be selected, otherwise we must go with
> -        * the cookie pick in order to satisfy the constraint.
> -        */
> -       if (prio_less(cookie_pick, class_pick, in_fi) &&
> -           (!max || prio_less(max, class_pick, in_fi)))
> -               return class_pick;
> -
> -       return cookie_pick;
> +       BUG(); /* The idle class should always have a runnable task. */
>  }
>
>  extern void task_vruntime_update(struct rq *rq, struct task_struct *p, bool in_fi);
> @@ -5459,11 +5422,12 @@ extern void task_vruntime_update(struct rq *rq, struct task_struct *p, bool in_f
>  static struct task_struct *
>  pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>  {
> -       struct task_struct *next, *max = NULL;
> -       const struct sched_class *class;
> +       struct task_struct *next, *p, *max = NULL;
>         const struct cpumask *smt_mask;
>         bool fi_before = false;
> -       int i, j, cpu, occ = 0;
> +       unsigned long cookie;
> +       int i, cpu, occ = 0;
> +       struct rq *rq_i;
>         bool need_sync;
>
>         if (!sched_core_enabled(rq))
> @@ -5536,12 +5500,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>          * and there are no cookied tasks running on siblings.
>          */
>         if (!need_sync) {
> -               for_each_class(class) {
> -                       next = class->pick_task(rq);
> -                       if (next)
> -                               break;
> -               }
> -
> +               next = pick_task(rq);
>                 if (!next->core_cookie) {
>                         rq->core_pick = NULL;
>                         /*
> @@ -5554,76 +5513,51 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>                 }
>         }
>
> -       for_each_cpu(i, smt_mask) {
> -               struct rq *rq_i = cpu_rq(i);
> -
> -               rq_i->core_pick = NULL;
> +       /*
> +        * For each thread: do the regular task pick and find the max prio task
> +        * amongst them.
> +        *
> +        * Tie-break prio towards the current CPU
> +        */
> +       for_each_cpu_wrap(i, smt_mask, cpu) {
> +               rq_i = cpu_rq(i);
>
>                 if (i != cpu)
>                         update_rq_clock(rq_i);
> +
> +               p = rq_i->core_pick = pick_task(rq_i);
> +               if (!max || prio_less(max, p, fi_before))
> +                       max = p;
>         }
>
> +       cookie = rq->core->core_cookie = max->core_cookie;
> +
>         /*
> -        * Try and select tasks for each sibling in descending sched_class
> -        * order.
> +        * For each thread: try and find a runnable task that matches @max or
> +        * force idle.
>          */
> -       for_each_class(class) {
> -again:
> -               for_each_cpu_wrap(i, smt_mask, cpu) {
> -                       struct rq *rq_i = cpu_rq(i);
> -                       struct task_struct *p;
> -
> -                       if (rq_i->core_pick)
> -                               continue;
> +       for_each_cpu(i, smt_mask) {
> +               rq_i = cpu_rq(i);
> +               p = rq_i->core_pick;
>
> -                       /*
> -                        * If this sibling doesn't yet have a suitable task to
> -                        * run; ask for the most eligible task, given the
> -                        * highest priority task already selected for this
> -                        * core.
> -                        */
> -                       p = pick_task(rq_i, class, max, fi_before);
> +               if (!cookie_equals(p, cookie)) {
> +                       p = NULL;
> +                       if (cookie)
> +                               p = sched_core_find(rq_i, cookie);
>                         if (!p)
> -                               continue;
> +                               p = idle_sched_class.pick_task(rq_i);
> +               }
>
> -                       if (!is_task_rq_idle(p))
> -                               occ++;
> +               rq_i->core_pick = p;
>
> -                       rq_i->core_pick = p;
> -                       if (rq_i->idle == p && rq_i->nr_running) {
> +               if (p == rq_i->idle) {
> +                       if (rq_i->nr_running) {
>                                 rq->core->core_forceidle = true;
>                                 if (!fi_before)
>                                         rq->core->core_forceidle_seq++;
>                         }
> -
> -                       /*
> -                        * If this new candidate is of higher priority than the
> -                        * previous; and they're incompatible; we need to wipe
> -                        * the slate and start over. pick_task makes sure that
> -                        * p's priority is more than max if it doesn't match
> -                        * max's cookie.
> -                        *
> -                        * NOTE: this is a linear max-filter and is thus bounded
> -                        * in execution time.
> -                        */
> -                       if (!max || !cookie_match(max, p)) {
> -                               struct task_struct *old_max = max;
> -
> -                               rq->core->core_cookie = p->core_cookie;
> -                               max = p;
> -
> -                               if (old_max) {
> -                                       rq->core->core_forceidle = false;
> -                                       for_each_cpu(j, smt_mask) {
> -                                               if (j == i)
> -                                                       continue;
> -
> -                                               cpu_rq(j)->core_pick = NULL;
> -                                       }
> -                                       occ = 1;
> -                                       goto again;
> -                               }
> -                       }
> +               } else {
> +                       occ++;
>                 }
>         }
>
> @@ -5643,7 +5577,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>          * non-matching user state.
>          */
>         for_each_cpu(i, smt_mask) {
> -               struct rq *rq_i = cpu_rq(i);
> +               rq_i = cpu_rq(i);
>
>                 /*
>                  * An online sibling might have gone offline before a task



--
Cheers,
~Vineeth

## "Its not the load that breaks you, but the way u carry it!" ##

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [tip: sched/core] sched/core: Simplify core-wide task selection
  2021-08-24  9:38       ` [PATCH] sched/core: Simplify core-wide task selection Peter Zijlstra
                           ` (2 preceding siblings ...)
  2021-08-24 18:28         ` Vineeth Pillai
@ 2021-09-09 11:18         ` tip-bot2 for Peter Zijlstra
  2021-10-05 14:12         ` tip-bot2 for Peter Zijlstra
  4 siblings, 0 replies; 17+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2021-09-09 11:18 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Tao Zhou, Peter Zijlstra (Intel),
	Josh Don, Vineeth Pillai (Microsoft),
	x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     4b2b9ed7b987eb6243b609095fe5c8c65340e5ee
Gitweb:        https://git.kernel.org/tip/4b2b9ed7b987eb6243b609095fe5c8c65340e5ee
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Tue, 24 Aug 2021 11:05:47 +02:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Thu, 09 Sep 2021 11:27:30 +02:00

sched/core: Simplify core-wide task selection

Tao suggested a two-pass task selection to avoid the retry loop.

Not only does it avoid the retry loop, it results in *much* simpler
code.

This also fixes an issue spotted by Josh Don where, for SMT3+, we can
forget to update max on the first pass and get to do an extra round.

Suggested-by: Tao Zhou <tao.zhou@linux.dev>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Josh Don <joshdon@google.com>
Reviewed-by: Vineeth Pillai (Microsoft) <vineethrp@gmail.com>
Link: https://lkml.kernel.org/r/YSS9+k1teA9oPEKl@hirez.programming.kicks-ass.net
---
 kernel/sched/core.c | 156 ++++++++++++-------------------------------
 1 file changed, 45 insertions(+), 111 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d19d1ba..953ff36 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5518,8 +5518,7 @@ restart:
 			return p;
 	}
 
-	/* The idle class should always have a runnable task: */
-	BUG();
+	BUG(); /* The idle class should always have a runnable task. */
 }
 
 #ifdef CONFIG_SCHED_CORE
@@ -5541,54 +5540,18 @@ static inline bool cookie_match(struct task_struct *a, struct task_struct *b)
 	return a->core_cookie == b->core_cookie;
 }
 
-// XXX fairness/fwd progress conditions
-/*
- * Returns
- * - NULL if there is no runnable task for this class.
- * - the highest priority task for this runqueue if it matches
- *   rq->core->core_cookie or its priority is greater than max.
- * - Else returns idle_task.
- */
-static struct task_struct *
-pick_task(struct rq *rq, const struct sched_class *class, struct task_struct *max, bool in_fi)
+static inline struct task_struct *pick_task(struct rq *rq)
 {
-	struct task_struct *class_pick, *cookie_pick;
-	unsigned long cookie = rq->core->core_cookie;
-
-	class_pick = class->pick_task(rq);
-	if (!class_pick)
-		return NULL;
-
-	if (!cookie) {
-		/*
-		 * If class_pick is tagged, return it only if it has
-		 * higher priority than max.
-		 */
-		if (max && class_pick->core_cookie &&
-		    prio_less(class_pick, max, in_fi))
-			return idle_sched_class.pick_task(rq);
+	const struct sched_class *class;
+	struct task_struct *p;
 
-		return class_pick;
+	for_each_class(class) {
+		p = class->pick_task(rq);
+		if (p)
+			return p;
 	}
 
-	/*
-	 * If class_pick is idle or matches cookie, return early.
-	 */
-	if (cookie_equals(class_pick, cookie))
-		return class_pick;
-
-	cookie_pick = sched_core_find(rq, cookie);
-
-	/*
-	 * If class > max && class > cookie, it is the highest priority task on
-	 * the core (so far) and it must be selected, otherwise we must go with
-	 * the cookie pick in order to satisfy the constraint.
-	 */
-	if (prio_less(cookie_pick, class_pick, in_fi) &&
-	    (!max || prio_less(max, class_pick, in_fi)))
-		return class_pick;
-
-	return cookie_pick;
+	BUG(); /* The idle class should always have a runnable task. */
 }
 
 extern void task_vruntime_update(struct rq *rq, struct task_struct *p, bool in_fi);
@@ -5596,11 +5559,12 @@ extern void task_vruntime_update(struct rq *rq, struct task_struct *p, bool in_f
 static struct task_struct *
 pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 {
-	struct task_struct *next, *max = NULL;
-	const struct sched_class *class;
+	struct task_struct *next, *p, *max = NULL;
 	const struct cpumask *smt_mask;
 	bool fi_before = false;
-	int i, j, cpu, occ = 0;
+	unsigned long cookie;
+	int i, cpu, occ = 0;
+	struct rq *rq_i;
 	bool need_sync;
 
 	if (!sched_core_enabled(rq))
@@ -5673,12 +5637,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 	 * and there are no cookied tasks running on siblings.
 	 */
 	if (!need_sync) {
-		for_each_class(class) {
-			next = class->pick_task(rq);
-			if (next)
-				break;
-		}
-
+		next = pick_task(rq);
 		if (!next->core_cookie) {
 			rq->core_pick = NULL;
 			/*
@@ -5691,76 +5650,51 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 		}
 	}
 
-	for_each_cpu(i, smt_mask) {
-		struct rq *rq_i = cpu_rq(i);
-
-		rq_i->core_pick = NULL;
+	/*
+	 * For each thread: do the regular task pick and find the max prio task
+	 * amongst them.
+	 *
+	 * Tie-break prio towards the current CPU
+	 */
+	for_each_cpu_wrap(i, smt_mask, cpu) {
+		rq_i = cpu_rq(i);
 
 		if (i != cpu)
 			update_rq_clock(rq_i);
+
+		p = rq_i->core_pick = pick_task(rq_i);
+		if (!max || prio_less(max, p, fi_before))
+			max = p;
 	}
 
+	cookie = rq->core->core_cookie = max->core_cookie;
+
 	/*
-	 * Try and select tasks for each sibling in descending sched_class
-	 * order.
+	 * For each thread: try and find a runnable task that matches @max or
+	 * force idle.
 	 */
-	for_each_class(class) {
-again:
-		for_each_cpu_wrap(i, smt_mask, cpu) {
-			struct rq *rq_i = cpu_rq(i);
-			struct task_struct *p;
-
-			if (rq_i->core_pick)
-				continue;
+	for_each_cpu(i, smt_mask) {
+		rq_i = cpu_rq(i);
+		p = rq_i->core_pick;
 
-			/*
-			 * If this sibling doesn't yet have a suitable task to
-			 * run; ask for the most eligible task, given the
-			 * highest priority task already selected for this
-			 * core.
-			 */
-			p = pick_task(rq_i, class, max, fi_before);
+		if (!cookie_equals(p, cookie)) {
+			p = NULL;
+			if (cookie)
+				p = sched_core_find(rq_i, cookie);
 			if (!p)
-				continue;
+				p = idle_sched_class.pick_task(rq_i);
+		}
 
-			if (!is_task_rq_idle(p))
-				occ++;
+		rq_i->core_pick = p;
 
-			rq_i->core_pick = p;
-			if (rq_i->idle == p && rq_i->nr_running) {
+		if (p == rq_i->idle) {
+			if (rq_i->nr_running) {
 				rq->core->core_forceidle = true;
 				if (!fi_before)
 					rq->core->core_forceidle_seq++;
 			}
-
-			/*
-			 * If this new candidate is of higher priority than the
-			 * previous; and they're incompatible; we need to wipe
-			 * the slate and start over. pick_task makes sure that
-			 * p's priority is more than max if it doesn't match
-			 * max's cookie.
-			 *
-			 * NOTE: this is a linear max-filter and is thus bounded
-			 * in execution time.
-			 */
-			if (!max || !cookie_match(max, p)) {
-				struct task_struct *old_max = max;
-
-				rq->core->core_cookie = p->core_cookie;
-				max = p;
-
-				if (old_max) {
-					rq->core->core_forceidle = false;
-					for_each_cpu(j, smt_mask) {
-						if (j == i)
-							continue;
-
-						cpu_rq(j)->core_pick = NULL;
-					}
-					occ = 1;
-					goto again;
-				}
-			}
+		} else {
+			occ++;
 		}
 	}
 
@@ -5780,7 +5714,7 @@ again:
 	 * non-matching user state.
 	 */
 	for_each_cpu(i, smt_mask) {
-		struct rq *rq_i = cpu_rq(i);
+		rq_i = cpu_rq(i);
 
 		/*
 		 * An online sibling might have gone offline before a task

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [tip: sched/core] sched/core: Simplify core-wide task selection
  2021-08-24  9:38       ` [PATCH] sched/core: Simplify core-wide task selection Peter Zijlstra
                           ` (3 preceding siblings ...)
  2021-09-09 11:18         ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
@ 2021-10-05 14:12         ` tip-bot2 for Peter Zijlstra
  4 siblings, 0 replies; 17+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2021-10-05 14:12 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Tao Zhou, Peter Zijlstra (Intel),
	Josh Don, Vineeth Pillai (Microsoft),
	x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     bc9ffef31bf59819c9fc032178534ff9ed7c4981
Gitweb:        https://git.kernel.org/tip/bc9ffef31bf59819c9fc032178534ff9ed7c4981
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Tue, 24 Aug 2021 11:05:47 +02:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Tue, 05 Oct 2021 15:51:33 +02:00

sched/core: Simplify core-wide task selection

Tao suggested a two-pass task selection to avoid the retry loop.

Not only does it avoid the retry loop, it results in *much* simpler
code.

This also fixes an issue spotted by Josh Don where, for SMT3+, we can
forget to update max on the first pass and get to do an extra round.

Suggested-by: Tao Zhou <tao.zhou@linux.dev>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Josh Don <joshdon@google.com>
Reviewed-by: Vineeth Pillai (Microsoft) <vineethrp@gmail.com>
Link: https://lkml.kernel.org/r/YSS9+k1teA9oPEKl@hirez.programming.kicks-ass.net
---
 kernel/sched/core.c | 156 ++++++++++++-------------------------------
 1 file changed, 45 insertions(+), 111 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2672694..f963c81 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5580,8 +5580,7 @@ restart:
 			return p;
 	}
 
-	/* The idle class should always have a runnable task: */
-	BUG();
+	BUG(); /* The idle class should always have a runnable task. */
 }
 
 #ifdef CONFIG_SCHED_CORE
@@ -5603,54 +5602,18 @@ static inline bool cookie_match(struct task_struct *a, struct task_struct *b)
 	return a->core_cookie == b->core_cookie;
 }
 
-// XXX fairness/fwd progress conditions
-/*
- * Returns
- * - NULL if there is no runnable task for this class.
- * - the highest priority task for this runqueue if it matches
- *   rq->core->core_cookie or its priority is greater than max.
- * - Else returns idle_task.
- */
-static struct task_struct *
-pick_task(struct rq *rq, const struct sched_class *class, struct task_struct *max, bool in_fi)
+static inline struct task_struct *pick_task(struct rq *rq)
 {
-	struct task_struct *class_pick, *cookie_pick;
-	unsigned long cookie = rq->core->core_cookie;
-
-	class_pick = class->pick_task(rq);
-	if (!class_pick)
-		return NULL;
-
-	if (!cookie) {
-		/*
-		 * If class_pick is tagged, return it only if it has
-		 * higher priority than max.
-		 */
-		if (max && class_pick->core_cookie &&
-		    prio_less(class_pick, max, in_fi))
-			return idle_sched_class.pick_task(rq);
+	const struct sched_class *class;
+	struct task_struct *p;
 
-		return class_pick;
+	for_each_class(class) {
+		p = class->pick_task(rq);
+		if (p)
+			return p;
 	}
 
-	/*
-	 * If class_pick is idle or matches cookie, return early.
-	 */
-	if (cookie_equals(class_pick, cookie))
-		return class_pick;
-
-	cookie_pick = sched_core_find(rq, cookie);
-
-	/*
-	 * If class > max && class > cookie, it is the highest priority task on
-	 * the core (so far) and it must be selected, otherwise we must go with
-	 * the cookie pick in order to satisfy the constraint.
-	 */
-	if (prio_less(cookie_pick, class_pick, in_fi) &&
-	    (!max || prio_less(max, class_pick, in_fi)))
-		return class_pick;
-
-	return cookie_pick;
+	BUG(); /* The idle class should always have a runnable task. */
 }
 
 extern void task_vruntime_update(struct rq *rq, struct task_struct *p, bool in_fi);
@@ -5658,11 +5621,12 @@ extern void task_vruntime_update(struct rq *rq, struct task_struct *p, bool in_f
 static struct task_struct *
 pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 {
-	struct task_struct *next, *max = NULL;
-	const struct sched_class *class;
+	struct task_struct *next, *p, *max = NULL;
 	const struct cpumask *smt_mask;
 	bool fi_before = false;
-	int i, j, cpu, occ = 0;
+	unsigned long cookie;
+	int i, cpu, occ = 0;
+	struct rq *rq_i;
 	bool need_sync;
 
 	if (!sched_core_enabled(rq))
@@ -5735,12 +5699,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 	 * and there are no cookied tasks running on siblings.
 	 */
 	if (!need_sync) {
-		for_each_class(class) {
-			next = class->pick_task(rq);
-			if (next)
-				break;
-		}
-
+		next = pick_task(rq);
 		if (!next->core_cookie) {
 			rq->core_pick = NULL;
 			/*
@@ -5753,76 +5712,51 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 		}
 	}
 
-	for_each_cpu(i, smt_mask) {
-		struct rq *rq_i = cpu_rq(i);
-
-		rq_i->core_pick = NULL;
+	/*
+	 * For each thread: do the regular task pick and find the max prio task
+	 * amongst them.
+	 *
+	 * Tie-break prio towards the current CPU
+	 */
+	for_each_cpu_wrap(i, smt_mask, cpu) {
+		rq_i = cpu_rq(i);
 
 		if (i != cpu)
 			update_rq_clock(rq_i);
+
+		p = rq_i->core_pick = pick_task(rq_i);
+		if (!max || prio_less(max, p, fi_before))
+			max = p;
 	}
 
+	cookie = rq->core->core_cookie = max->core_cookie;
+
 	/*
-	 * Try and select tasks for each sibling in descending sched_class
-	 * order.
+	 * For each thread: try and find a runnable task that matches @max or
+	 * force idle.
 	 */
-	for_each_class(class) {
-again:
-		for_each_cpu_wrap(i, smt_mask, cpu) {
-			struct rq *rq_i = cpu_rq(i);
-			struct task_struct *p;
-
-			if (rq_i->core_pick)
-				continue;
+	for_each_cpu(i, smt_mask) {
+		rq_i = cpu_rq(i);
+		p = rq_i->core_pick;
 
-			/*
-			 * If this sibling doesn't yet have a suitable task to
-			 * run; ask for the most eligible task, given the
-			 * highest priority task already selected for this
-			 * core.
-			 */
-			p = pick_task(rq_i, class, max, fi_before);
+		if (!cookie_equals(p, cookie)) {
+			p = NULL;
+			if (cookie)
+				p = sched_core_find(rq_i, cookie);
 			if (!p)
-				continue;
+				p = idle_sched_class.pick_task(rq_i);
+		}
 
-			if (!is_task_rq_idle(p))
-				occ++;
+		rq_i->core_pick = p;
 
-			rq_i->core_pick = p;
-			if (rq_i->idle == p && rq_i->nr_running) {
+		if (p == rq_i->idle) {
+			if (rq_i->nr_running) {
 				rq->core->core_forceidle = true;
 				if (!fi_before)
 					rq->core->core_forceidle_seq++;
 			}
-
-			/*
-			 * If this new candidate is of higher priority than the
-			 * previous; and they're incompatible; we need to wipe
-			 * the slate and start over. pick_task makes sure that
-			 * p's priority is more than max if it doesn't match
-			 * max's cookie.
-			 *
-			 * NOTE: this is a linear max-filter and is thus bounded
-			 * in execution time.
-			 */
-			if (!max || !cookie_match(max, p)) {
-				struct task_struct *old_max = max;
-
-				rq->core->core_cookie = p->core_cookie;
-				max = p;
-
-				if (old_max) {
-					rq->core->core_forceidle = false;
-					for_each_cpu(j, smt_mask) {
-						if (j == i)
-							continue;
-
-						cpu_rq(j)->core_pick = NULL;
-					}
-					occ = 1;
-					goto again;
-				}
-			}
+		} else {
+			occ++;
 		}
 	}
 
@@ -5842,7 +5776,7 @@ again:
 	 * non-matching user state.
 	 */
 	for_each_cpu(i, smt_mask) {
-		struct rq *rq_i = cpu_rq(i);
+		rq_i = cpu_rq(i);
 
 		/*
 		 * An online sibling might have gone offline before a task

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2021-10-05 14:13 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-18  0:56 [PATCH] sched/core: fix pick_next_task 'max' tracking Josh Don
2021-08-18  4:35 ` Tao Zhou
2021-08-18 15:18   ` Tao Zhou
2021-08-23 11:16 ` Peter Zijlstra
2021-08-23 15:38   ` Tao Zhou
2021-08-23 20:25   ` Vineeth Pillai
2021-08-23 22:57     ` Tao Zhou
2021-08-24  9:03     ` Peter Zijlstra
2021-08-24  9:38       ` [PATCH] sched/core: Simplify core-wide task selection Peter Zijlstra
2021-08-24 12:15         ` Tao Zhou
2021-08-24 17:40         ` Josh Don
2021-08-24 18:28         ` Vineeth Pillai
2021-09-09 11:18         ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
2021-10-05 14:12         ` tip-bot2 for Peter Zijlstra
2021-08-23 23:24   ` [PATCH] sched/core: fix pick_next_task 'max' tracking Josh Don
2021-08-24  3:01     ` Tao Zhou
2021-08-24  8:55     ` Peter Zijlstra

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.