linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/8] sched/fair: misfit task load-balance tweaks
@ 2021-01-28 18:31 Valentin Schneider
  2021-01-28 18:31 ` [PATCH 1/8] sched/fair: Clean up active balance nr_balance_failed trickery Valentin Schneider
                   ` (8 more replies)
  0 siblings, 9 replies; 46+ messages in thread
From: Valentin Schneider @ 2021-01-28 18:31 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Dietmar Eggemann,
	Morten Rasmussen, Qais Yousef, Quentin Perret, Pavan Kondeti,
	Rik van Riel

Hi folks,

Here is this year's series of misfit changes. On the menu:

o Patch 1 is an independent active balance cleanup
o Patch 2 adds some more sched_asym_cpucapacity static branches
o Patch 3 introduces yet another margin for capacity to capacity
  comparisons
o Patches 4-6 build on top of patch 3 and change capacity comparisons
  throughout misfit load balancing  
o Patches 7-8 fix some extra misfit issues I've been seeing on "real"
  workloads.

IMO the somewhat controversial bit is patch 3, because it attempts to solve
margin issues by... Adding another margin. This does solve issues on
existing platforms (e.g. Pixel4), but we'll be back to square one the day
some "clever" folks spin a platform with two different CPU capacities less than
5% apart.

This is based on top of today's tip/sched/core at:

  65bcf072e20e ("sched: Use task_current() instead of 'rq->curr == p'")

Cheers,
Valentin

Valentin Schneider (8):
  sched/fair: Clean up active balance nr_balance_failed trickery
  sched/fair: Add more sched_asym_cpucapacity static branch checks
  sched/fair: Tweak misfit-related capacity checks
  sched/fair: Use dst_cpu's capacity rather than group {min, max}
    capacity
  sched/fair: Make check_misfit_status() only compare dynamic capacities
  sched/fair: Filter out locally-unsolvable misfit imbalances
  sched/fair: Attempt misfit active balance when migration_type !=
    migrate_misfit
  sched/fair: Relax task_hot() for misfit tasks

 kernel/sched/fair.c  | 138 ++++++++++++++++++++++++++-----------------
 kernel/sched/sched.h |   6 ++
 2 files changed, 89 insertions(+), 55 deletions(-)

--
2.27.0


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH 1/8] sched/fair: Clean up active balance nr_balance_failed trickery
  2021-01-28 18:31 [PATCH 0/8] sched/fair: misfit task load-balance tweaks Valentin Schneider
@ 2021-01-28 18:31 ` Valentin Schneider
  2021-02-03 15:14   ` Qais Yousef
  2021-02-05 13:51   ` Vincent Guittot
  2021-01-28 18:31 ` [PATCH 2/8] sched/fair: Add more sched_asym_cpucapacity static branch checks Valentin Schneider
                   ` (7 subsequent siblings)
  8 siblings, 2 replies; 46+ messages in thread
From: Valentin Schneider @ 2021-01-28 18:31 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Dietmar Eggemann,
	Morten Rasmussen, Qais Yousef, Quentin Perret, Pavan Kondeti,
	Rik van Riel

When triggering an active load balance, sd->nr_balance_failed is set to
such a value that any further can_migrate_task() using said sd will ignore
the output of task_hot().

This behaviour makes sense, as active load balance intentionally preempts a
rq's running task to migrate it right away, but this asynchronous write is
a bit shoddy, as the stopper thread might run active_load_balance_cpu_stop
before the sd->nr_balance_failed write either becomes visible to the
stopper's CPU or even happens on the CPU that appended the stopper work.

Add a struct lb_env flag to denote active balancing, and use it in
can_migrate_task(). Remove the sd->nr_balance_failed write that served the
same purpose.

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
---
 kernel/sched/fair.c | 17 ++++++++++-------
 1 file changed, 10 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 197a51473e0c..0f6a4e58ce3c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7423,6 +7423,7 @@ enum migration_type {
 #define LBF_SOME_PINNED	0x08
 #define LBF_NOHZ_STATS	0x10
 #define LBF_NOHZ_AGAIN	0x20
+#define LBF_ACTIVE_LB	0x40
 
 struct lb_env {
 	struct sched_domain	*sd;
@@ -7608,10 +7609,14 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 
 	/*
 	 * Aggressive migration if:
-	 * 1) destination numa is preferred
-	 * 2) task is cache cold, or
-	 * 3) too many balance attempts have failed.
+	 * 1) active balance
+	 * 2) destination numa is preferred
+	 * 3) task is cache cold, or
+	 * 4) too many balance attempts have failed.
 	 */
+	if (env->flags & LBF_ACTIVE_LB)
+		return 1;
+
 	tsk_cache_hot = migrate_degrades_locality(p, env);
 	if (tsk_cache_hot == -1)
 		tsk_cache_hot = task_hot(p, env);
@@ -9805,9 +9810,6 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 					active_load_balance_cpu_stop, busiest,
 					&busiest->active_balance_work);
 			}
-
-			/* We've kicked active balancing, force task migration. */
-			sd->nr_balance_failed = sd->cache_nice_tries+1;
 		}
 	} else {
 		sd->nr_balance_failed = 0;
@@ -9963,7 +9965,8 @@ static int active_load_balance_cpu_stop(void *data)
 			 * @dst_grpmask we need to make that test go away with lying
 			 * about DST_PINNED.
 			 */
-			.flags		= LBF_DST_PINNED,
+			.flags		= LBF_DST_PINNED |
+					  LBF_ACTIVE_LB,
 		};
 
 		schedstat_inc(sd->alb_count);
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 2/8] sched/fair: Add more sched_asym_cpucapacity static branch checks
  2021-01-28 18:31 [PATCH 0/8] sched/fair: misfit task load-balance tweaks Valentin Schneider
  2021-01-28 18:31 ` [PATCH 1/8] sched/fair: Clean up active balance nr_balance_failed trickery Valentin Schneider
@ 2021-01-28 18:31 ` Valentin Schneider
  2021-02-03 15:14   ` Qais Yousef
  2021-02-09  8:42   ` Vincent Guittot
  2021-01-28 18:31 ` [PATCH 3/8] sched/fair: Tweak misfit-related capacity checks Valentin Schneider
                   ` (6 subsequent siblings)
  8 siblings, 2 replies; 46+ messages in thread
From: Valentin Schneider @ 2021-01-28 18:31 UTC (permalink / raw)
  To: linux-kernel
  Cc: Rik van Riel, Peter Zijlstra, Ingo Molnar, Vincent Guittot,
	Dietmar Eggemann, Morten Rasmussen, Qais Yousef, Quentin Perret,
	Pavan Kondeti

Rik noted a while back that a handful of

  sd->flags & SD_ASYM_CPUCAPACITY

& family in the CFS load-balancer code aren't guarded by the
sched_asym_cpucapacity static branch.

The load-balancer is already doing a humongous amount of work, but turning
those checks into NOPs for those who don't need it is fairly
straightforward, so do that.

Suggested-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
---
 kernel/sched/fair.c  | 11 ++++++-----
 kernel/sched/sched.h |  6 ++++++
 2 files changed, 12 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0f6a4e58ce3c..7d01fa0bfc7e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8465,7 +8465,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 			continue;
 
 		/* Check for a misfit task on the cpu */
-		if (env->sd->flags & SD_ASYM_CPUCAPACITY &&
+		if (sd_has_asym_cpucapacity(env->sd) &&
 		    sgs->group_misfit_task_load < rq->misfit_task_load) {
 			sgs->group_misfit_task_load = rq->misfit_task_load;
 			*sg_status |= SG_OVERLOAD;
@@ -8522,7 +8522,8 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 	 * CPUs in the group should either be possible to resolve
 	 * internally or be covered by avg_load imbalance (eventually).
 	 */
-	if (sgs->group_type == group_misfit_task &&
+	if (static_branch_unlikely(&sched_asym_cpucapacity) &&
+	    sgs->group_type == group_misfit_task &&
 	    (!group_smaller_max_cpu_capacity(sg, sds->local) ||
 	     sds->local_stat.group_type != group_has_spare))
 		return false;
@@ -8605,7 +8606,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 	 * throughput. Maximize throughput, power/energy consequences are not
 	 * considered.
 	 */
-	if ((env->sd->flags & SD_ASYM_CPUCAPACITY) &&
+	if (sd_has_asym_cpucapacity(env->sd) &&
 	    (sgs->group_type <= group_fully_busy) &&
 	    (group_smaller_min_cpu_capacity(sds->local, sg)))
 		return false;
@@ -8728,7 +8729,7 @@ static inline void update_sg_wakeup_stats(struct sched_domain *sd,
 	}
 
 	/* Check if task fits in the group */
-	if (sd->flags & SD_ASYM_CPUCAPACITY &&
+	if (sd_has_asym_cpucapacity(sd) &&
 	    !task_fits_capacity(p, group->sgc->max_capacity)) {
 		sgs->group_misfit_task_load = 1;
 	}
@@ -9419,7 +9420,7 @@ static struct rq *find_busiest_queue(struct lb_env *env,
 		 * Higher per-CPU capacity is considered better than balancing
 		 * average load.
 		 */
-		if (env->sd->flags & SD_ASYM_CPUCAPACITY &&
+		if (sd_has_asym_cpucapacity(env->sd) &&
 		    capacity_of(env->dst_cpu) < capacity &&
 		    nr_running == 1)
 			continue;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 045b01064c1e..21bd71f58c06 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1482,6 +1482,12 @@ DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
 DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
 extern struct static_key_false sched_asym_cpucapacity;
 
+static inline bool sd_has_asym_cpucapacity(struct sched_domain *sd)
+{
+	return static_branch_unlikely(&sched_asym_cpucapacity) &&
+		sd->flags & SD_ASYM_CPUCAPACITY;
+}
+
 struct sched_group_capacity {
 	atomic_t		ref;
 	/*
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 3/8] sched/fair: Tweak misfit-related capacity checks
  2021-01-28 18:31 [PATCH 0/8] sched/fair: misfit task load-balance tweaks Valentin Schneider
  2021-01-28 18:31 ` [PATCH 1/8] sched/fair: Clean up active balance nr_balance_failed trickery Valentin Schneider
  2021-01-28 18:31 ` [PATCH 2/8] sched/fair: Add more sched_asym_cpucapacity static branch checks Valentin Schneider
@ 2021-01-28 18:31 ` Valentin Schneider
  2021-02-03 15:15   ` Qais Yousef
  2021-02-05 14:31   ` Vincent Guittot
  2021-01-28 18:31 ` [PATCH 4/8] sched/fair: Use dst_cpu's capacity rather than group {min, max} capacity Valentin Schneider
                   ` (5 subsequent siblings)
  8 siblings, 2 replies; 46+ messages in thread
From: Valentin Schneider @ 2021-01-28 18:31 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Dietmar Eggemann,
	Morten Rasmussen, Qais Yousef, Quentin Perret, Pavan Kondeti,
	Rik van Riel

During load-balance, groups classified as group_misfit_task are filtered
out if they do not pass

  group_smaller_max_cpu_capacity(<candidate group>, <local group>);

which itself employs fits_capacity() to compare the sgc->max_capacity of
both groups.

Due to the underlying margin, fits_capacity(X, 1024) will return false for
any X > 819. Tough luck, the capacity_orig's on e.g. the Pixel 4 are
{261, 871, 1024}. If a CPU-bound task ends up on one of those "medium"
CPUs, misfit migration will never intentionally upmigrate it to a CPU of
higher capacity due to the aforementioned margin.

One may argue the 20% margin of fits_capacity() is excessive in the advent
of counter-enhanced load tracking (APERF/MPERF, AMUs), but one point here
is that fits_capacity() is meant to compare a utilization value to a
capacity value, whereas here it is being used to compare two capacity
values. As CPU capacity and task utilization have different dynamics, a
sensible approach here would be to add a new helper dedicated to comparing
CPU capacities.

Introduce capacity_greater(), which uses a 5% margin. Use it to replace the
existing capacity checks. Note that check_cpu_capacity() uses yet another
margin (sd->imbalance_pct), and is left alone for now.

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
---
 kernel/sched/fair.c | 15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7d01fa0bfc7e..58ce0b22fcb0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -113,6 +113,13 @@ int __weak arch_asym_cpu_priority(int cpu)
  */
 #define fits_capacity(cap, max)	((cap) * 1280 < (max) * 1024)
 
+/*
+ * The margin used when comparing CPU capacities.
+ * is 'cap' noticeably greater than 'ref'
+ *
+ * (default: ~5%)
+ */
+#define capacity_greater(cap, ref) ((ref) * 1078 < (cap) * 1024)
 #endif
 
 #ifdef CONFIG_CFS_BANDWIDTH
@@ -8253,7 +8260,7 @@ check_cpu_capacity(struct rq *rq, struct sched_domain *sd)
 static inline int check_misfit_status(struct rq *rq, struct sched_domain *sd)
 {
 	return rq->misfit_task_load &&
-		(rq->cpu_capacity_orig < rq->rd->max_cpu_capacity ||
+		(capacity_greater(rq->rd->max_cpu_capacity, rq->cpu_capacity_orig) ||
 		 check_cpu_capacity(rq, sd));
 }
 
@@ -8352,7 +8359,7 @@ group_is_overloaded(unsigned int imbalance_pct, struct sg_lb_stats *sgs)
 static inline bool
 group_smaller_min_cpu_capacity(struct sched_group *sg, struct sched_group *ref)
 {
-	return fits_capacity(sg->sgc->min_capacity, ref->sgc->min_capacity);
+	return capacity_greater(ref->sgc->min_capacity, sg->sgc->min_capacity);
 }
 
 /*
@@ -8362,7 +8369,7 @@ group_smaller_min_cpu_capacity(struct sched_group *sg, struct sched_group *ref)
 static inline bool
 group_smaller_max_cpu_capacity(struct sched_group *sg, struct sched_group *ref)
 {
-	return fits_capacity(sg->sgc->max_capacity, ref->sgc->max_capacity);
+	return capacity_greater(ref->sgc->max_capacity, sg->sgc->max_capacity);
 }
 
 static inline enum
@@ -9421,7 +9428,7 @@ static struct rq *find_busiest_queue(struct lb_env *env,
 		 * average load.
 		 */
 		if (sd_has_asym_cpucapacity(env->sd) &&
-		    capacity_of(env->dst_cpu) < capacity &&
+		    !capacity_greater(capacity_of(env->dst_cpu), capacity) &&
 		    nr_running == 1)
 			continue;
 
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 4/8] sched/fair: Use dst_cpu's capacity rather than group {min, max} capacity
  2021-01-28 18:31 [PATCH 0/8] sched/fair: misfit task load-balance tweaks Valentin Schneider
                   ` (2 preceding siblings ...)
  2021-01-28 18:31 ` [PATCH 3/8] sched/fair: Tweak misfit-related capacity checks Valentin Schneider
@ 2021-01-28 18:31 ` Valentin Schneider
  2021-02-03 15:15   ` Qais Yousef
  2021-01-28 18:31 ` [PATCH 5/8] sched/fair: Make check_misfit_status() only compare dynamic capacities Valentin Schneider
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 46+ messages in thread
From: Valentin Schneider @ 2021-01-28 18:31 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Dietmar Eggemann,
	Morten Rasmussen, Qais Yousef, Quentin Perret, Pavan Kondeti,
	Rik van Riel

Comparing capacity extrema of local and source sched_group's doesn't make
much sense when at the day of the day the imbalance will be pulled by a
known env->dst_cpu, whose capacity can be anywhere within the local group's
capacity extrema.

Replace group_smaller_{min, max}_cpu_capacity() with comparisons of the
source group's min/max capacity and the destination CPU's capacity.

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
---
 kernel/sched/fair.c | 31 +++----------------------------
 1 file changed, 3 insertions(+), 28 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 58ce0b22fcb0..0959a770ecc0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8352,26 +8352,6 @@ group_is_overloaded(unsigned int imbalance_pct, struct sg_lb_stats *sgs)
 	return false;
 }
 
-/*
- * group_smaller_min_cpu_capacity: Returns true if sched_group sg has smaller
- * per-CPU capacity than sched_group ref.
- */
-static inline bool
-group_smaller_min_cpu_capacity(struct sched_group *sg, struct sched_group *ref)
-{
-	return capacity_greater(ref->sgc->min_capacity, sg->sgc->min_capacity);
-}
-
-/*
- * group_smaller_max_cpu_capacity: Returns true if sched_group sg has smaller
- * per-CPU capacity_orig than sched_group ref.
- */
-static inline bool
-group_smaller_max_cpu_capacity(struct sched_group *sg, struct sched_group *ref)
-{
-	return capacity_greater(ref->sgc->max_capacity, sg->sgc->max_capacity);
-}
-
 static inline enum
 group_type group_classify(unsigned int imbalance_pct,
 			  struct sched_group *group,
@@ -8523,15 +8503,10 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 	if (!sgs->sum_h_nr_running)
 		return false;
 
-	/*
-	 * Don't try to pull misfit tasks we can't help.
-	 * We can use max_capacity here as reduction in capacity on some
-	 * CPUs in the group should either be possible to resolve
-	 * internally or be covered by avg_load imbalance (eventually).
-	 */
+	/* Don't try to pull misfit tasks we can't help */
 	if (static_branch_unlikely(&sched_asym_cpucapacity) &&
 	    sgs->group_type == group_misfit_task &&
-	    (!group_smaller_max_cpu_capacity(sg, sds->local) ||
+	    (!capacity_greater(capacity_of(env->dst_cpu), sg->sgc->max_capacity) ||
 	     sds->local_stat.group_type != group_has_spare))
 		return false;
 
@@ -8615,7 +8590,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 	 */
 	if (sd_has_asym_cpucapacity(env->sd) &&
 	    (sgs->group_type <= group_fully_busy) &&
-	    (group_smaller_min_cpu_capacity(sds->local, sg)))
+	    (capacity_greater(sg->sgc->min_capacity, capacity_of(env->dst_cpu))))
 		return false;
 
 	return true;
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 5/8] sched/fair: Make check_misfit_status() only compare dynamic capacities
  2021-01-28 18:31 [PATCH 0/8] sched/fair: misfit task load-balance tweaks Valentin Schneider
                   ` (3 preceding siblings ...)
  2021-01-28 18:31 ` [PATCH 4/8] sched/fair: Use dst_cpu's capacity rather than group {min, max} capacity Valentin Schneider
@ 2021-01-28 18:31 ` Valentin Schneider
  2021-02-03 15:15   ` Qais Yousef
  2021-01-28 18:31 ` [PATCH 6/8] sched/fair: Filter out locally-unsolvable misfit imbalances Valentin Schneider
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 46+ messages in thread
From: Valentin Schneider @ 2021-01-28 18:31 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Dietmar Eggemann,
	Morten Rasmussen, Qais Yousef, Quentin Perret, Pavan Kondeti,
	Rik van Riel

check_misfit_status() checks for both capacity pressure & available CPUs
with higher capacity. Now that we have a sane(ish) capacity comparison
margin which is used throughout load-balance, this can be condensed into a
single check:

  capacity_greater(<root_domain max capacity>, <misfit task CPU's capacity>);

This has the added benefit of returning false if the misfit task CPU's is
heavily pressured, but there are no better candidates for migration.

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
---
 kernel/sched/fair.c | 10 ++++------
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0959a770ecc0..ef44474b8fbf 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8254,14 +8254,12 @@ check_cpu_capacity(struct rq *rq, struct sched_domain *sd)
 
 /*
  * Check whether a rq has a misfit task and if it looks like we can actually
- * help that task: we can migrate the task to a CPU of higher capacity, or
- * the task's current CPU is heavily pressured.
+ * help that task: we can migrate the task to a CPU of higher capacity.
  */
-static inline int check_misfit_status(struct rq *rq, struct sched_domain *sd)
+static inline int check_misfit_status(struct rq *rq)
 {
 	return rq->misfit_task_load &&
-		(capacity_greater(rq->rd->max_cpu_capacity, rq->cpu_capacity_orig) ||
-		 check_cpu_capacity(rq, sd));
+		capacity_greater(rq->rd->max_cpu_capacity, rq->cpu_capacity);
 }
 
 /*
@@ -10238,7 +10236,7 @@ static void nohz_balancer_kick(struct rq *rq)
 		 * When ASYM_CPUCAPACITY; see if there's a higher capacity CPU
 		 * to run the misfit task on.
 		 */
-		if (check_misfit_status(rq, sd)) {
+		if (check_misfit_status(rq)) {
 			flags = NOHZ_KICK_MASK;
 			goto unlock;
 		}
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 6/8] sched/fair: Filter out locally-unsolvable misfit imbalances
  2021-01-28 18:31 [PATCH 0/8] sched/fair: misfit task load-balance tweaks Valentin Schneider
                   ` (4 preceding siblings ...)
  2021-01-28 18:31 ` [PATCH 5/8] sched/fair: Make check_misfit_status() only compare dynamic capacities Valentin Schneider
@ 2021-01-28 18:31 ` Valentin Schneider
  2021-02-03 15:16   ` Qais Yousef
  2021-01-28 18:31 ` [PATCH 7/8] sched/fair: Attempt misfit active balance when migration_type != migrate_misfit Valentin Schneider
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 46+ messages in thread
From: Valentin Schneider @ 2021-01-28 18:31 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Dietmar Eggemann,
	Morten Rasmussen, Qais Yousef, Quentin Perret, Pavan Kondeti,
	Rik van Riel

Consider the following (hypothetical) asymmetric CPU capacity topology,
with some amount of capacity pressure (RT | DL | IRQ | thermal):

  DIE [          ]
  MC  [    ][    ]
       0  1  2  3

  | CPU | capacity_orig | capacity |
  |-----+---------------+----------|
  |   0 |           870 |      860 |
  |   1 |           870 |      600 |
  |   2 |          1024 |      850 |
  |   3 |          1024 |      860 |

If CPU1 has a misfit task, then CPU0, CPU2 and CPU3 are valid candidates to
grant the task an uplift in CPU capacity. Consider CPU0 and CPU3 as
sufficiently busy, i.e. don't have enough spare capacity to accommodate
CPU1's misfit task. This would then fall on CPU2 to pull the task.

This currently won't happen, because CPU2 will fail

  capacity_greater(capacity_of(CPU2), sg->sgc->max_capacity)

in update_sd_pick_busiest(), where 'sg' is the [0, 1] group at DIE
level. In this case, the max_capacity is that of CPU0's, which is at this
point in time greater than that of CPU2's. This comparison doesn't make
much sense, given that the only CPUs we should care about in this scenario
are CPU1 (the CPU with the misfit task) and CPU2 (the load-balance
destination CPU).

Aggregate a misfit task's load into sgs->group_misfit_task_load only if
env->dst_cpu would grant it a capacity uplift. Separately track whether a
sched_group contains a misfit task to still classify it as
group_misfit_task and not pick it as busiest group when pulling from a
lower-capacity CPU (which is the current behaviour and prevents
down-migration).

Since find_busiest_queue() can now iterate over CPUs with a higher capacity
than the local CPU's, add a capacity check there.

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
---
 kernel/sched/fair.c | 40 ++++++++++++++++++++++++++++++----------
 1 file changed, 30 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ef44474b8fbf..0ac2f876b86f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5765,6 +5765,12 @@ static unsigned long capacity_of(int cpu)
 	return cpu_rq(cpu)->cpu_capacity;
 }
 
+/* Is CPU a's capacity noticeably greater than CPU b's? */
+static inline bool cpu_capacity_greater(int a, int b)
+{
+	return capacity_greater(capacity_of(a), capacity_of(b));
+}
+
 static void record_wakee(struct task_struct *p)
 {
 	/*
@@ -8093,7 +8099,8 @@ struct sg_lb_stats {
 	unsigned int group_weight;
 	enum group_type group_type;
 	unsigned int group_asym_packing; /* Tasks should be moved to preferred CPU */
-	unsigned long group_misfit_task_load; /* A CPU has a task too big for its capacity */
+	unsigned long group_misfit_task_load; /* Task load that can be uplifted */
+	int           group_has_misfit_task; /* A CPU has a task too big for its capacity */
 #ifdef CONFIG_NUMA_BALANCING
 	unsigned int nr_numa_running;
 	unsigned int nr_preferred_running;
@@ -8364,7 +8371,7 @@ group_type group_classify(unsigned int imbalance_pct,
 	if (sgs->group_asym_packing)
 		return group_asym_packing;
 
-	if (sgs->group_misfit_task_load)
+	if (sgs->group_has_misfit_task)
 		return group_misfit_task;
 
 	if (!group_has_capacity(imbalance_pct, sgs))
@@ -8450,11 +8457,21 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 			continue;
 
 		/* Check for a misfit task on the cpu */
-		if (sd_has_asym_cpucapacity(env->sd) &&
-		    sgs->group_misfit_task_load < rq->misfit_task_load) {
-			sgs->group_misfit_task_load = rq->misfit_task_load;
-			*sg_status |= SG_OVERLOAD;
-		}
+		if (!sd_has_asym_cpucapacity(env->sd) ||
+		    !rq->misfit_task_load)
+			continue;
+
+		*sg_status |= SG_OVERLOAD;
+		sgs->group_has_misfit_task = true;
+
+		/*
+		 * Don't attempt to maximize load for misfit tasks that can't be
+		 * granted a CPU capacity uplift.
+		 */
+		if (cpu_capacity_greater(env->dst_cpu, i))
+			sgs->group_misfit_task_load = max(
+				sgs->group_misfit_task_load,
+				rq->misfit_task_load);
 	}
 
 	/* Check if dst CPU is idle and preferred to this group */
@@ -8504,7 +8521,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 	/* Don't try to pull misfit tasks we can't help */
 	if (static_branch_unlikely(&sched_asym_cpucapacity) &&
 	    sgs->group_type == group_misfit_task &&
-	    (!capacity_greater(capacity_of(env->dst_cpu), sg->sgc->max_capacity) ||
+	    (!sgs->group_misfit_task_load ||
 	     sds->local_stat.group_type != group_has_spare))
 		return false;
 
@@ -9464,15 +9481,18 @@ static struct rq *find_busiest_queue(struct lb_env *env,
 		case migrate_misfit:
 			/*
 			 * For ASYM_CPUCAPACITY domains with misfit tasks we
-			 * simply seek the "biggest" misfit task.
+			 * simply seek the "biggest" misfit task we can
+			 * accommodate.
 			 */
+			if (!cpu_capacity_greater(env->dst_cpu, i))
+				continue;
+
 			if (rq->misfit_task_load > busiest_load) {
 				busiest_load = rq->misfit_task_load;
 				busiest = rq;
 			}
 
 			break;
-
 		}
 	}
 
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 7/8] sched/fair: Attempt misfit active balance when migration_type != migrate_misfit
  2021-01-28 18:31 [PATCH 0/8] sched/fair: misfit task load-balance tweaks Valentin Schneider
                   ` (5 preceding siblings ...)
  2021-01-28 18:31 ` [PATCH 6/8] sched/fair: Filter out locally-unsolvable misfit imbalances Valentin Schneider
@ 2021-01-28 18:31 ` Valentin Schneider
  2021-02-03 15:16   ` Qais Yousef
  2021-02-09  8:58   ` Vincent Guittot
  2021-01-28 18:31 ` [PATCH 8/8] sched/fair: Relax task_hot() for misfit tasks Valentin Schneider
  2021-02-03 15:14 ` [PATCH 0/8] sched/fair: misfit task load-balance tweaks Qais Yousef
  8 siblings, 2 replies; 46+ messages in thread
From: Valentin Schneider @ 2021-01-28 18:31 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Dietmar Eggemann,
	Morten Rasmussen, Qais Yousef, Quentin Perret, Pavan Kondeti,
	Rik van Riel

Consider a 4-CPU big.LITTLE system with CPUs 0-1 as LITTLEs and CPUs 2-3 as
bigs. The resulting sched_domain hierarchy is:

  DIE [          ]
  MC  [    ][    ]
       0  1  2  3

When running a multithreaded CPU-bound workload (i.e. 1 hog per CPU), the
expected behaviour is to have the about-to-idle big CPUs pull a hog from
the LITTLEs, since bigs will complete their work sooner than LITTLEs.

Further Consider a scenario where:
- CPU0 is idle (e.g. its hog got migrated to one of the big CPUs)
- CPU1 is currently executing a per-CPU kworker, preempting the CPU hog
- CPU2 and CPU3 are executing CPU-hogs

CPU0 goes through load_balance() at MC level, and tries to pick stuff from
CPU1, but:
- the hog can't be pulled, because it's task_hot()
- the kworker can't be pulled, because it's pinned to CPU1, which sets
  LBF_SOME_PINNED

This load balance attempts ends with no load pulled, LBF_SOME_PINNED set,
and as a consequence we set the imbalance flag of DIE's [0, 1]
sched_group_capacity.

Shortly after, CPU2 completes its work and is about to go idle. It goes
through the newidle_balance(), and we would really like it to active
balance the hog running on CPU1 (which is a misfit task). However,
sgc->imbalance is set for the LITTLE group at DIE level, so the group gets
classified as group_imbalanced rather than group_misfit_task.

Unlike group_misfit_task (via migrate_misfit), the active balance logic
doesn't have any specific case for group_imbalanced, so CPU2 ends up going
idle. We'll have to wait for a load balance on CPU0 or CPU1 to happen and
clear the imbalance flag, and then for another DIE-level load-balance on
CPU2 to happen to pull the task off of CPU1. That's several precious
milliseconds wasted down the drain.

Giving group_misfit_task a higher group_classify() priority than
group_imbalance doesn't seem like the right thing to do. Instead, make
need_active_balance() return true for any migration_type when the
destination CPU is idle and the source CPU has a misfit task.

While at it, add an sd_has_asym_cpucapacity() guard in
need_active_balance().

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
---
 kernel/sched/fair.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0ac2f876b86f..cba9f97d9beb 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9557,9 +9557,22 @@ static int need_active_balance(struct lb_env *env)
 			return 1;
 	}
 
+	if (!sd_has_asym_cpucapacity(sd))
+		return 0;
+
 	if (env->migration_type == migrate_misfit)
 		return 1;
 
+	/*
+	 * If we failed to pull anything and the src_rq has a misfit task, but
+	 * the busiest group_type was higher than group_misfit_task, try to
+	 * go for a misfit active balance anyway.
+	 */
+	if ((env->idle != CPU_NOT_IDLE) &&
+	    env->src_rq->misfit_task_load &&
+	    cpu_capacity_greater(env->dst_cpu, env->src_cpu))
+		return 1;
+
 	return 0;
 }
 
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 8/8] sched/fair: Relax task_hot() for misfit tasks
  2021-01-28 18:31 [PATCH 0/8] sched/fair: misfit task load-balance tweaks Valentin Schneider
                   ` (6 preceding siblings ...)
  2021-01-28 18:31 ` [PATCH 7/8] sched/fair: Attempt misfit active balance when migration_type != migrate_misfit Valentin Schneider
@ 2021-01-28 18:31 ` Valentin Schneider
  2021-02-03 15:17   ` Qais Yousef
  2021-02-08 16:21   ` Vincent Guittot
  2021-02-03 15:14 ` [PATCH 0/8] sched/fair: misfit task load-balance tweaks Qais Yousef
  8 siblings, 2 replies; 46+ messages in thread
From: Valentin Schneider @ 2021-01-28 18:31 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Dietmar Eggemann,
	Morten Rasmussen, Qais Yousef, Quentin Perret, Pavan Kondeti,
	Rik van Riel

Misfit tasks can and will be preempted by the stopper to migrate them over
to a higher-capacity CPU. However, when runnable but not current misfit
tasks are scanned by the load balancer (i.e. detach_tasks()), the
task_hot() ratelimiting logic may prevent us from enqueuing said task onto
a higher-capacity CPU.

Align detach_tasks() with the active-balance logic and let it pick a
cache-hot misfit task when the destination CPU can provide a capacity
uplift.

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
---
 kernel/sched/fair.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index cba9f97d9beb..c2351b87824f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7484,6 +7484,17 @@ static int task_hot(struct task_struct *p, struct lb_env *env)
 	if (env->sd->flags & SD_SHARE_CPUCAPACITY)
 		return 0;
 
+	/*
+	 * On a (sane) asymmetric CPU capacity system, the increase in compute
+	 * capacity should offset any potential performance hit caused by a
+	 * migration.
+	 */
+	if (sd_has_asym_cpucapacity(env->sd) &&
+	    env->idle != CPU_NOT_IDLE &&
+	    !task_fits_capacity(p, capacity_of(env->src_cpu)) &&
+	    cpu_capacity_greater(env->dst_cpu, env->src_cpu))
+		return 0;
+
 	/*
 	 * Buddy candidates are cache hot:
 	 */
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [PATCH 0/8] sched/fair: misfit task load-balance tweaks
  2021-01-28 18:31 [PATCH 0/8] sched/fair: misfit task load-balance tweaks Valentin Schneider
                   ` (7 preceding siblings ...)
  2021-01-28 18:31 ` [PATCH 8/8] sched/fair: Relax task_hot() for misfit tasks Valentin Schneider
@ 2021-02-03 15:14 ` Qais Yousef
  2021-02-03 18:43   ` Valentin Schneider
  8 siblings, 1 reply; 46+ messages in thread
From: Qais Yousef @ 2021-02-03 15:14 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Vincent Guittot,
	Dietmar Eggemann, Morten Rasmussen, Quentin Perret,
	Pavan Kondeti, Rik van Riel

On 01/28/21 18:31, Valentin Schneider wrote:
> Hi folks,
> 
> Here is this year's series of misfit changes. On the menu:
> 
> o Patch 1 is an independent active balance cleanup
> o Patch 2 adds some more sched_asym_cpucapacity static branches
> o Patch 3 introduces yet another margin for capacity to capacity
>   comparisons
> o Patches 4-6 build on top of patch 3 and change capacity comparisons
>   throughout misfit load balancing  
> o Patches 7-8 fix some extra misfit issues I've been seeing on "real"
>   workloads.
> 
> IMO the somewhat controversial bit is patch 3, because it attempts to solve
> margin issues by... Adding another margin. This does solve issues on
> existing platforms (e.g. Pixel4), but we'll be back to square one the day
> some "clever" folks spin a platform with two different CPU capacities less than
> 5% apart.

One more margin is a cause of apprehension to me. But in this case I think it
is the appropriate thing to do now. I can't think of a scenario where this
could hurt.

Thanks

--
Qais Yousef

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/8] sched/fair: Clean up active balance nr_balance_failed trickery
  2021-01-28 18:31 ` [PATCH 1/8] sched/fair: Clean up active balance nr_balance_failed trickery Valentin Schneider
@ 2021-02-03 15:14   ` Qais Yousef
  2021-02-03 18:42     ` Valentin Schneider
  2021-02-05 13:51   ` Vincent Guittot
  1 sibling, 1 reply; 46+ messages in thread
From: Qais Yousef @ 2021-02-03 15:14 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Vincent Guittot,
	Dietmar Eggemann, Morten Rasmussen, Quentin Perret,
	Pavan Kondeti, Rik van Riel

Hi Valentin

On 01/28/21 18:31, Valentin Schneider wrote:
> When triggering an active load balance, sd->nr_balance_failed is set to
> such a value that any further can_migrate_task() using said sd will ignore
> the output of task_hot().
> 
> This behaviour makes sense, as active load balance intentionally preempts a
> rq's running task to migrate it right away, but this asynchronous write is
> a bit shoddy, as the stopper thread might run active_load_balance_cpu_stop
> before the sd->nr_balance_failed write either becomes visible to the
> stopper's CPU or even happens on the CPU that appended the stopper work.
> 
> Add a struct lb_env flag to denote active balancing, and use it in
> can_migrate_task(). Remove the sd->nr_balance_failed write that served the
> same purpose.
> 
> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
> ---
>  kernel/sched/fair.c | 17 ++++++++++-------
>  1 file changed, 10 insertions(+), 7 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 197a51473e0c..0f6a4e58ce3c 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7423,6 +7423,7 @@ enum migration_type {
>  #define LBF_SOME_PINNED	0x08
>  #define LBF_NOHZ_STATS	0x10
>  #define LBF_NOHZ_AGAIN	0x20
> +#define LBF_ACTIVE_LB	0x40
>  
>  struct lb_env {
>  	struct sched_domain	*sd;
> @@ -7608,10 +7609,14 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
>  
>  	/*
>  	 * Aggressive migration if:
> -	 * 1) destination numa is preferred
> -	 * 2) task is cache cold, or
> -	 * 3) too many balance attempts have failed.
> +	 * 1) active balance
> +	 * 2) destination numa is preferred
> +	 * 3) task is cache cold, or
> +	 * 4) too many balance attempts have failed.
>  	 */
> +	if (env->flags & LBF_ACTIVE_LB)
> +		return 1;
> +
>  	tsk_cache_hot = migrate_degrades_locality(p, env);
>  	if (tsk_cache_hot == -1)
>  		tsk_cache_hot = task_hot(p, env);
> @@ -9805,9 +9810,6 @@ static int load_balance(int this_cpu, struct rq *this_rq,
>  					active_load_balance_cpu_stop, busiest,
>  					&busiest->active_balance_work);
>  			}
> -
> -			/* We've kicked active balancing, force task migration. */
> -			sd->nr_balance_failed = sd->cache_nice_tries+1;

This has an impact on future calls to need_active_balance() too, no? We enter
this path because need_active_balance() returned true; one of the conditions it
checks for is

	return unlikely(sd->nr_balance_failed > sd->cache_nice_tries+2);

So since we used to reset nr_balanced_failed to cache_nice_tries+1, the above
condition would be false in the next call or two IIUC. But since we remove
that, we could end up here again soon.

Was this intentional?

Thanks

--
Qais Yousef

>  		}
>  	} else {
>  		sd->nr_balance_failed = 0;
> @@ -9963,7 +9965,8 @@ static int active_load_balance_cpu_stop(void *data)
>  			 * @dst_grpmask we need to make that test go away with lying
>  			 * about DST_PINNED.
>  			 */
> -			.flags		= LBF_DST_PINNED,
> +			.flags		= LBF_DST_PINNED |
> +					  LBF_ACTIVE_LB,
>  		};
>  
>  		schedstat_inc(sd->alb_count);
> -- 
> 2.27.0
> 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/8] sched/fair: Add more sched_asym_cpucapacity static branch checks
  2021-01-28 18:31 ` [PATCH 2/8] sched/fair: Add more sched_asym_cpucapacity static branch checks Valentin Schneider
@ 2021-02-03 15:14   ` Qais Yousef
  2021-02-09  8:42   ` Vincent Guittot
  1 sibling, 0 replies; 46+ messages in thread
From: Qais Yousef @ 2021-02-03 15:14 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, Rik van Riel, Peter Zijlstra, Ingo Molnar,
	Vincent Guittot, Dietmar Eggemann, Morten Rasmussen,
	Quentin Perret, Pavan Kondeti

On 01/28/21 18:31, Valentin Schneider wrote:
> Rik noted a while back that a handful of
> 
>   sd->flags & SD_ASYM_CPUCAPACITY
> 
> & family in the CFS load-balancer code aren't guarded by the
> sched_asym_cpucapacity static branch.
> 
> The load-balancer is already doing a humongous amount of work, but turning
> those checks into NOPs for those who don't need it is fairly
> straightforward, so do that.
> 
> Suggested-by: Rik van Riel <riel@surriel.com>
> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
> ---

Reviewed-by: Qais Yousef <qais.yousef@arm.com>

Thanks

--
Qais Yousef

>  kernel/sched/fair.c  | 11 ++++++-----
>  kernel/sched/sched.h |  6 ++++++
>  2 files changed, 12 insertions(+), 5 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 0f6a4e58ce3c..7d01fa0bfc7e 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8465,7 +8465,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>  			continue;
>  
>  		/* Check for a misfit task on the cpu */
> -		if (env->sd->flags & SD_ASYM_CPUCAPACITY &&
> +		if (sd_has_asym_cpucapacity(env->sd) &&
>  		    sgs->group_misfit_task_load < rq->misfit_task_load) {
>  			sgs->group_misfit_task_load = rq->misfit_task_load;
>  			*sg_status |= SG_OVERLOAD;
> @@ -8522,7 +8522,8 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>  	 * CPUs in the group should either be possible to resolve
>  	 * internally or be covered by avg_load imbalance (eventually).
>  	 */
> -	if (sgs->group_type == group_misfit_task &&
> +	if (static_branch_unlikely(&sched_asym_cpucapacity) &&
> +	    sgs->group_type == group_misfit_task &&
>  	    (!group_smaller_max_cpu_capacity(sg, sds->local) ||
>  	     sds->local_stat.group_type != group_has_spare))
>  		return false;
> @@ -8605,7 +8606,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>  	 * throughput. Maximize throughput, power/energy consequences are not
>  	 * considered.
>  	 */
> -	if ((env->sd->flags & SD_ASYM_CPUCAPACITY) &&
> +	if (sd_has_asym_cpucapacity(env->sd) &&
>  	    (sgs->group_type <= group_fully_busy) &&
>  	    (group_smaller_min_cpu_capacity(sds->local, sg)))
>  		return false;
> @@ -8728,7 +8729,7 @@ static inline void update_sg_wakeup_stats(struct sched_domain *sd,
>  	}
>  
>  	/* Check if task fits in the group */
> -	if (sd->flags & SD_ASYM_CPUCAPACITY &&
> +	if (sd_has_asym_cpucapacity(sd) &&
>  	    !task_fits_capacity(p, group->sgc->max_capacity)) {
>  		sgs->group_misfit_task_load = 1;
>  	}
> @@ -9419,7 +9420,7 @@ static struct rq *find_busiest_queue(struct lb_env *env,
>  		 * Higher per-CPU capacity is considered better than balancing
>  		 * average load.
>  		 */
> -		if (env->sd->flags & SD_ASYM_CPUCAPACITY &&
> +		if (sd_has_asym_cpucapacity(env->sd) &&
>  		    capacity_of(env->dst_cpu) < capacity &&
>  		    nr_running == 1)
>  			continue;
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 045b01064c1e..21bd71f58c06 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1482,6 +1482,12 @@ DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
>  DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
>  extern struct static_key_false sched_asym_cpucapacity;
>  
> +static inline bool sd_has_asym_cpucapacity(struct sched_domain *sd)
> +{
> +	return static_branch_unlikely(&sched_asym_cpucapacity) &&
> +		sd->flags & SD_ASYM_CPUCAPACITY;
> +}
> +
>  struct sched_group_capacity {
>  	atomic_t		ref;
>  	/*
> -- 
> 2.27.0
> 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 3/8] sched/fair: Tweak misfit-related capacity checks
  2021-01-28 18:31 ` [PATCH 3/8] sched/fair: Tweak misfit-related capacity checks Valentin Schneider
@ 2021-02-03 15:15   ` Qais Yousef
  2021-02-03 18:42     ` Valentin Schneider
  2021-02-05 14:31   ` Vincent Guittot
  1 sibling, 1 reply; 46+ messages in thread
From: Qais Yousef @ 2021-02-03 15:15 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Vincent Guittot,
	Dietmar Eggemann, Morten Rasmussen, Quentin Perret,
	Pavan Kondeti, Rik van Riel

On 01/28/21 18:31, Valentin Schneider wrote:
> During load-balance, groups classified as group_misfit_task are filtered
> out if they do not pass
> 
>   group_smaller_max_cpu_capacity(<candidate group>, <local group>);
> 
> which itself employs fits_capacity() to compare the sgc->max_capacity of
> both groups.
> 
> Due to the underlying margin, fits_capacity(X, 1024) will return false for
> any X > 819. Tough luck, the capacity_orig's on e.g. the Pixel 4 are
> {261, 871, 1024}. If a CPU-bound task ends up on one of those "medium"
> CPUs, misfit migration will never intentionally upmigrate it to a CPU of
> higher capacity due to the aforementioned margin.
> 
> One may argue the 20% margin of fits_capacity() is excessive in the advent
> of counter-enhanced load tracking (APERF/MPERF, AMUs), but one point here
> is that fits_capacity() is meant to compare a utilization value to a
> capacity value, whereas here it is being used to compare two capacity
> values. As CPU capacity and task utilization have different dynamics, a
> sensible approach here would be to add a new helper dedicated to comparing
> CPU capacities.
> 
> Introduce capacity_greater(), which uses a 5% margin. Use it to replace the
> existing capacity checks. Note that check_cpu_capacity() uses yet another
> margin (sd->imbalance_pct), and is left alone for now.
> 
> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
> ---
>  kernel/sched/fair.c | 15 +++++++++++----
>  1 file changed, 11 insertions(+), 4 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 7d01fa0bfc7e..58ce0b22fcb0 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -113,6 +113,13 @@ int __weak arch_asym_cpu_priority(int cpu)
>   */
>  #define fits_capacity(cap, max)	((cap) * 1280 < (max) * 1024)
>  
> +/*
> + * The margin used when comparing CPU capacities.
> + * is 'cap' noticeably greater than 'ref'
> + *
> + * (default: ~5%)
> + */
> +#define capacity_greater(cap, ref) ((ref) * 1078 < (cap) * 1024)

nit: can we use cap1 and cap2 and make the implementation use '>' instead of
'<'? ie:

	#define capacity_greater(cap1, cap2) ((cap1) * 1024 > (cap2) * 1078)

this is more intuitive to read IMHO. Especially few lines below we have

	return capacity_greater(ref->sgc->max_capacity, sg->sgc->max_capacity);

which pass 'ref->...' as cap which can be confusing when looking at the
function signature @ref.

Either way, this LGTM

Reviewed-by: Qais Yousef <qais.yousef@arm.com>

Thanks

--
Qais Yousef

>  #endif
>  
>  #ifdef CONFIG_CFS_BANDWIDTH
> @@ -8253,7 +8260,7 @@ check_cpu_capacity(struct rq *rq, struct sched_domain *sd)
>  static inline int check_misfit_status(struct rq *rq, struct sched_domain *sd)
>  {
>  	return rq->misfit_task_load &&
> -		(rq->cpu_capacity_orig < rq->rd->max_cpu_capacity ||
> +		(capacity_greater(rq->rd->max_cpu_capacity, rq->cpu_capacity_orig) ||
>  		 check_cpu_capacity(rq, sd));
>  }
>  
> @@ -8352,7 +8359,7 @@ group_is_overloaded(unsigned int imbalance_pct, struct sg_lb_stats *sgs)
>  static inline bool
>  group_smaller_min_cpu_capacity(struct sched_group *sg, struct sched_group *ref)
>  {
> -	return fits_capacity(sg->sgc->min_capacity, ref->sgc->min_capacity);
> +	return capacity_greater(ref->sgc->min_capacity, sg->sgc->min_capacity);
>  }
>  
>  /*
> @@ -8362,7 +8369,7 @@ group_smaller_min_cpu_capacity(struct sched_group *sg, struct sched_group *ref)
>  static inline bool
>  group_smaller_max_cpu_capacity(struct sched_group *sg, struct sched_group *ref)
>  {
> -	return fits_capacity(sg->sgc->max_capacity, ref->sgc->max_capacity);
> +	return capacity_greater(ref->sgc->max_capacity, sg->sgc->max_capacity);
>  }
>  
>  static inline enum
> @@ -9421,7 +9428,7 @@ static struct rq *find_busiest_queue(struct lb_env *env,
>  		 * average load.
>  		 */
>  		if (sd_has_asym_cpucapacity(env->sd) &&
> -		    capacity_of(env->dst_cpu) < capacity &&
> +		    !capacity_greater(capacity_of(env->dst_cpu), capacity) &&
>  		    nr_running == 1)
>  			continue;
>  
> -- 
> 2.27.0
> 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 4/8] sched/fair: Use dst_cpu's capacity rather than group {min, max} capacity
  2021-01-28 18:31 ` [PATCH 4/8] sched/fair: Use dst_cpu's capacity rather than group {min, max} capacity Valentin Schneider
@ 2021-02-03 15:15   ` Qais Yousef
  0 siblings, 0 replies; 46+ messages in thread
From: Qais Yousef @ 2021-02-03 15:15 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Vincent Guittot,
	Dietmar Eggemann, Morten Rasmussen, Quentin Perret,
	Pavan Kondeti, Rik van Riel

On 01/28/21 18:31, Valentin Schneider wrote:
> Comparing capacity extrema of local and source sched_group's doesn't make
> much sense when at the day of the day the imbalance will be pulled by a
> known env->dst_cpu, whose capacity can be anywhere within the local group's
> capacity extrema.
> 
> Replace group_smaller_{min, max}_cpu_capacity() with comparisons of the
> source group's min/max capacity and the destination CPU's capacity.
> 
> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
> ---

Reviewed-by: Qais Yousef <qais.yousef@arm.com>

Thanks

--
Qais Yousef

>  kernel/sched/fair.c | 31 +++----------------------------
>  1 file changed, 3 insertions(+), 28 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 58ce0b22fcb0..0959a770ecc0 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8352,26 +8352,6 @@ group_is_overloaded(unsigned int imbalance_pct, struct sg_lb_stats *sgs)
>  	return false;
>  }
>  
> -/*
> - * group_smaller_min_cpu_capacity: Returns true if sched_group sg has smaller
> - * per-CPU capacity than sched_group ref.
> - */
> -static inline bool
> -group_smaller_min_cpu_capacity(struct sched_group *sg, struct sched_group *ref)
> -{
> -	return capacity_greater(ref->sgc->min_capacity, sg->sgc->min_capacity);
> -}
> -
> -/*
> - * group_smaller_max_cpu_capacity: Returns true if sched_group sg has smaller
> - * per-CPU capacity_orig than sched_group ref.
> - */
> -static inline bool
> -group_smaller_max_cpu_capacity(struct sched_group *sg, struct sched_group *ref)
> -{
> -	return capacity_greater(ref->sgc->max_capacity, sg->sgc->max_capacity);
> -}
> -
>  static inline enum
>  group_type group_classify(unsigned int imbalance_pct,
>  			  struct sched_group *group,
> @@ -8523,15 +8503,10 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>  	if (!sgs->sum_h_nr_running)
>  		return false;
>  
> -	/*
> -	 * Don't try to pull misfit tasks we can't help.
> -	 * We can use max_capacity here as reduction in capacity on some
> -	 * CPUs in the group should either be possible to resolve
> -	 * internally or be covered by avg_load imbalance (eventually).
> -	 */
> +	/* Don't try to pull misfit tasks we can't help */
>  	if (static_branch_unlikely(&sched_asym_cpucapacity) &&
>  	    sgs->group_type == group_misfit_task &&
> -	    (!group_smaller_max_cpu_capacity(sg, sds->local) ||
> +	    (!capacity_greater(capacity_of(env->dst_cpu), sg->sgc->max_capacity) ||
>  	     sds->local_stat.group_type != group_has_spare))
>  		return false;
>  
> @@ -8615,7 +8590,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>  	 */
>  	if (sd_has_asym_cpucapacity(env->sd) &&
>  	    (sgs->group_type <= group_fully_busy) &&
> -	    (group_smaller_min_cpu_capacity(sds->local, sg)))
> +	    (capacity_greater(sg->sgc->min_capacity, capacity_of(env->dst_cpu))))
>  		return false;
>  
>  	return true;
> -- 
> 2.27.0
> 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 5/8] sched/fair: Make check_misfit_status() only compare dynamic capacities
  2021-01-28 18:31 ` [PATCH 5/8] sched/fair: Make check_misfit_status() only compare dynamic capacities Valentin Schneider
@ 2021-02-03 15:15   ` Qais Yousef
  2021-02-04 10:49     ` Dietmar Eggemann
  0 siblings, 1 reply; 46+ messages in thread
From: Qais Yousef @ 2021-02-03 15:15 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Vincent Guittot,
	Dietmar Eggemann, Morten Rasmussen, Quentin Perret,
	Pavan Kondeti, Rik van Riel

On 01/28/21 18:31, Valentin Schneider wrote:
> check_misfit_status() checks for both capacity pressure & available CPUs
> with higher capacity. Now that we have a sane(ish) capacity comparison
> margin which is used throughout load-balance, this can be condensed into a
> single check:
> 
>   capacity_greater(<root_domain max capacity>, <misfit task CPU's capacity>);
> 
> This has the added benefit of returning false if the misfit task CPU's is
> heavily pressured, but there are no better candidates for migration.
> 
> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
> ---

check_cpu_capacity() call looks redundant now, yes.

Reviewed-by: Qais Yousef <qais.yousef@arm.com>

Thanks

--
Qais Yousef

>  kernel/sched/fair.c | 10 ++++------
>  1 file changed, 4 insertions(+), 6 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 0959a770ecc0..ef44474b8fbf 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8254,14 +8254,12 @@ check_cpu_capacity(struct rq *rq, struct sched_domain *sd)
>  
>  /*
>   * Check whether a rq has a misfit task and if it looks like we can actually
> - * help that task: we can migrate the task to a CPU of higher capacity, or
> - * the task's current CPU is heavily pressured.
> + * help that task: we can migrate the task to a CPU of higher capacity.
>   */
> -static inline int check_misfit_status(struct rq *rq, struct sched_domain *sd)
> +static inline int check_misfit_status(struct rq *rq)
>  {
>  	return rq->misfit_task_load &&
> -		(capacity_greater(rq->rd->max_cpu_capacity, rq->cpu_capacity_orig) ||
> -		 check_cpu_capacity(rq, sd));
> +		capacity_greater(rq->rd->max_cpu_capacity, rq->cpu_capacity);
>  }
>  
>  /*
> @@ -10238,7 +10236,7 @@ static void nohz_balancer_kick(struct rq *rq)
>  		 * When ASYM_CPUCAPACITY; see if there's a higher capacity CPU
>  		 * to run the misfit task on.
>  		 */
> -		if (check_misfit_status(rq, sd)) {
> +		if (check_misfit_status(rq)) {
>  			flags = NOHZ_KICK_MASK;
>  			goto unlock;
>  		}
> -- 
> 2.27.0
> 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 6/8] sched/fair: Filter out locally-unsolvable misfit imbalances
  2021-01-28 18:31 ` [PATCH 6/8] sched/fair: Filter out locally-unsolvable misfit imbalances Valentin Schneider
@ 2021-02-03 15:16   ` Qais Yousef
  2021-02-03 18:43     ` Valentin Schneider
  0 siblings, 1 reply; 46+ messages in thread
From: Qais Yousef @ 2021-02-03 15:16 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Vincent Guittot,
	Dietmar Eggemann, Morten Rasmussen, Quentin Perret,
	Pavan Kondeti, Rik van Riel

On 01/28/21 18:31, Valentin Schneider wrote:
> Consider the following (hypothetical) asymmetric CPU capacity topology,
> with some amount of capacity pressure (RT | DL | IRQ | thermal):
> 
>   DIE [          ]
>   MC  [    ][    ]
>        0  1  2  3
> 
>   | CPU | capacity_orig | capacity |
>   |-----+---------------+----------|
>   |   0 |           870 |      860 |
>   |   1 |           870 |      600 |
>   |   2 |          1024 |      850 |
>   |   3 |          1024 |      860 |
> 
> If CPU1 has a misfit task, then CPU0, CPU2 and CPU3 are valid candidates to
> grant the task an uplift in CPU capacity. Consider CPU0 and CPU3 as
> sufficiently busy, i.e. don't have enough spare capacity to accommodate
> CPU1's misfit task. This would then fall on CPU2 to pull the task.

I think this scenario would be hard in practice, but not impossible. Maybe
gaming could push the system that hard.

> 
> This currently won't happen, because CPU2 will fail
> 
>   capacity_greater(capacity_of(CPU2), sg->sgc->max_capacity)
> 
> in update_sd_pick_busiest(), where 'sg' is the [0, 1] group at DIE
> level. In this case, the max_capacity is that of CPU0's, which is at this
> point in time greater than that of CPU2's. This comparison doesn't make
> much sense, given that the only CPUs we should care about in this scenario
> are CPU1 (the CPU with the misfit task) and CPU2 (the load-balance
> destination CPU).
> 
> Aggregate a misfit task's load into sgs->group_misfit_task_load only if
> env->dst_cpu would grant it a capacity uplift. Separately track whether a
> sched_group contains a misfit task to still classify it as
> group_misfit_task and not pick it as busiest group when pulling from a
> lower-capacity CPU (which is the current behaviour and prevents
> down-migration).
> 
> Since find_busiest_queue() can now iterate over CPUs with a higher capacity
> than the local CPU's, add a capacity check there.
> 
> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
> ---
>  kernel/sched/fair.c | 40 ++++++++++++++++++++++++++++++----------
>  1 file changed, 30 insertions(+), 10 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index ef44474b8fbf..0ac2f876b86f 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5765,6 +5765,12 @@ static unsigned long capacity_of(int cpu)
>  	return cpu_rq(cpu)->cpu_capacity;
>  }
>  
> +/* Is CPU a's capacity noticeably greater than CPU b's? */
> +static inline bool cpu_capacity_greater(int a, int b)
> +{
> +	return capacity_greater(capacity_of(a), capacity_of(b));
> +}
> +
>  static void record_wakee(struct task_struct *p)
>  {
>  	/*
> @@ -8093,7 +8099,8 @@ struct sg_lb_stats {
>  	unsigned int group_weight;
>  	enum group_type group_type;
>  	unsigned int group_asym_packing; /* Tasks should be moved to preferred CPU */
> -	unsigned long group_misfit_task_load; /* A CPU has a task too big for its capacity */
> +	unsigned long group_misfit_task_load; /* Task load that can be uplifted */
> +	int           group_has_misfit_task; /* A CPU has a task too big for its capacity */
>  #ifdef CONFIG_NUMA_BALANCING
>  	unsigned int nr_numa_running;
>  	unsigned int nr_preferred_running;
> @@ -8364,7 +8371,7 @@ group_type group_classify(unsigned int imbalance_pct,
>  	if (sgs->group_asym_packing)
>  		return group_asym_packing;
>  
> -	if (sgs->group_misfit_task_load)
> +	if (sgs->group_has_misfit_task)
>  		return group_misfit_task;
>  
>  	if (!group_has_capacity(imbalance_pct, sgs))
> @@ -8450,11 +8457,21 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>  			continue;
>  
>  		/* Check for a misfit task on the cpu */
> -		if (sd_has_asym_cpucapacity(env->sd) &&
> -		    sgs->group_misfit_task_load < rq->misfit_task_load) {
> -			sgs->group_misfit_task_load = rq->misfit_task_load;
> -			*sg_status |= SG_OVERLOAD;
> -		}
> +		if (!sd_has_asym_cpucapacity(env->sd) ||
> +		    !rq->misfit_task_load)
> +			continue;
> +
> +		*sg_status |= SG_OVERLOAD;
> +		sgs->group_has_misfit_task = true;
> +
> +		/*
> +		 * Don't attempt to maximize load for misfit tasks that can't be
> +		 * granted a CPU capacity uplift.
> +		 */
> +		if (cpu_capacity_greater(env->dst_cpu, i))
> +			sgs->group_misfit_task_load = max(
> +				sgs->group_misfit_task_load,
> +				rq->misfit_task_load);

nit: missing curly braces around the if.

>  	}
>  
>  	/* Check if dst CPU is idle and preferred to this group */
> @@ -8504,7 +8521,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>  	/* Don't try to pull misfit tasks we can't help */
>  	if (static_branch_unlikely(&sched_asym_cpucapacity) &&
>  	    sgs->group_type == group_misfit_task &&
> -	    (!capacity_greater(capacity_of(env->dst_cpu), sg->sgc->max_capacity) ||
> +	    (!sgs->group_misfit_task_load ||
>  	     sds->local_stat.group_type != group_has_spare))
>  		return false;
>  
> @@ -9464,15 +9481,18 @@ static struct rq *find_busiest_queue(struct lb_env *env,
>  		case migrate_misfit:
>  			/*
>  			 * For ASYM_CPUCAPACITY domains with misfit tasks we
> -			 * simply seek the "biggest" misfit task.
> +			 * simply seek the "biggest" misfit task we can
> +			 * accommodate.
>  			 */
> +			if (!cpu_capacity_greater(env->dst_cpu, i))
> +				continue;

Both this hunk and the one above mean we will end up searching harder to pull
the task into the right cpu taking actual capacity into account. Which is
a good improvement.

Reviewed-by: Qais Yousef <qais.youesf@arm.com>

Thanks

--
Qais Yousef

> +
>  			if (rq->misfit_task_load > busiest_load) {
>  				busiest_load = rq->misfit_task_load;
>  				busiest = rq;
>  			}
>  
>  			break;
> -
>  		}
>  	}
>  
> -- 
> 2.27.0
> 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 7/8] sched/fair: Attempt misfit active balance when migration_type != migrate_misfit
  2021-01-28 18:31 ` [PATCH 7/8] sched/fair: Attempt misfit active balance when migration_type != migrate_misfit Valentin Schneider
@ 2021-02-03 15:16   ` Qais Yousef
  2021-02-03 18:43     ` Valentin Schneider
  2021-02-09  8:58   ` Vincent Guittot
  1 sibling, 1 reply; 46+ messages in thread
From: Qais Yousef @ 2021-02-03 15:16 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Vincent Guittot,
	Dietmar Eggemann, Morten Rasmussen, Quentin Perret,
	Pavan Kondeti, Rik van Riel

On 01/28/21 18:31, Valentin Schneider wrote:
> Consider a 4-CPU big.LITTLE system with CPUs 0-1 as LITTLEs and CPUs 2-3 as
> bigs. The resulting sched_domain hierarchy is:
> 
>   DIE [          ]
>   MC  [    ][    ]
>        0  1  2  3
> 
> When running a multithreaded CPU-bound workload (i.e. 1 hog per CPU), the
> expected behaviour is to have the about-to-idle big CPUs pull a hog from
> the LITTLEs, since bigs will complete their work sooner than LITTLEs.
> 
> Further Consider a scenario where:
> - CPU0 is idle (e.g. its hog got migrated to one of the big CPUs)
> - CPU1 is currently executing a per-CPU kworker, preempting the CPU hog
> - CPU2 and CPU3 are executing CPU-hogs
> 
> CPU0 goes through load_balance() at MC level, and tries to pick stuff from
> CPU1, but:
> - the hog can't be pulled, because it's task_hot()
> - the kworker can't be pulled, because it's pinned to CPU1, which sets
>   LBF_SOME_PINNED
> 
> This load balance attempts ends with no load pulled, LBF_SOME_PINNED set,
> and as a consequence we set the imbalance flag of DIE's [0, 1]
> sched_group_capacity.
> 
> Shortly after, CPU2 completes its work and is about to go idle. It goes
> through the newidle_balance(), and we would really like it to active
> balance the hog running on CPU1 (which is a misfit task). However,
> sgc->imbalance is set for the LITTLE group at DIE level, so the group gets
> classified as group_imbalanced rather than group_misfit_task.
> 
> Unlike group_misfit_task (via migrate_misfit), the active balance logic
> doesn't have any specific case for group_imbalanced, so CPU2 ends up going
> idle. We'll have to wait for a load balance on CPU0 or CPU1 to happen and
> clear the imbalance flag, and then for another DIE-level load-balance on
> CPU2 to happen to pull the task off of CPU1. That's several precious
> milliseconds wasted down the drain.

I think that might help with in games where some worker threads could get end
up on little cores and this delay in up migration could cause frames to drop.

> 
> Giving group_misfit_task a higher group_classify() priority than
> group_imbalance doesn't seem like the right thing to do. Instead, make
> need_active_balance() return true for any migration_type when the

s/need_active_balance()/voluntary_active_balance()/?

> destination CPU is idle and the source CPU has a misfit task.
> 
> While at it, add an sd_has_asym_cpucapacity() guard in
> need_active_balance().

ditto.

> 
> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
> ---
>  kernel/sched/fair.c | 13 +++++++++++++
>  1 file changed, 13 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 0ac2f876b86f..cba9f97d9beb 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9557,9 +9557,22 @@ static int need_active_balance(struct lb_env *env)
>  			return 1;
>  	}
>  
> +	if (!sd_has_asym_cpucapacity(sd))
> +		return 0;
> +
>  	if (env->migration_type == migrate_misfit)
>  		return 1;
>  
> +	/*
> +	 * If we failed to pull anything and the src_rq has a misfit task, but
> +	 * the busiest group_type was higher than group_misfit_task, try to
> +	 * go for a misfit active balance anyway.
> +	 */
> +	if ((env->idle != CPU_NOT_IDLE) &&
> +	    env->src_rq->misfit_task_load &&
> +	    cpu_capacity_greater(env->dst_cpu, env->src_cpu))
> +		return 1;
> +

Reviewed-by: Qais Yousef <qais.yousef@arm.com>

Thanks

--
Qais Yousef

>  	return 0;
>  }
>  
> -- 
> 2.27.0
> 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 8/8] sched/fair: Relax task_hot() for misfit tasks
  2021-01-28 18:31 ` [PATCH 8/8] sched/fair: Relax task_hot() for misfit tasks Valentin Schneider
@ 2021-02-03 15:17   ` Qais Yousef
  2021-02-08 16:21   ` Vincent Guittot
  1 sibling, 0 replies; 46+ messages in thread
From: Qais Yousef @ 2021-02-03 15:17 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Vincent Guittot,
	Dietmar Eggemann, Morten Rasmussen, Quentin Perret,
	Pavan Kondeti, Rik van Riel

On 01/28/21 18:31, Valentin Schneider wrote:
> Misfit tasks can and will be preempted by the stopper to migrate them over
> to a higher-capacity CPU. However, when runnable but not current misfit
> tasks are scanned by the load balancer (i.e. detach_tasks()), the
> task_hot() ratelimiting logic may prevent us from enqueuing said task onto
> a higher-capacity CPU.
> 
> Align detach_tasks() with the active-balance logic and let it pick a
> cache-hot misfit task when the destination CPU can provide a capacity
> uplift.

Good catch.

> 
> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
> ---
>  kernel/sched/fair.c | 11 +++++++++++
>  1 file changed, 11 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index cba9f97d9beb..c2351b87824f 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7484,6 +7484,17 @@ static int task_hot(struct task_struct *p, struct lb_env *env)
>  	if (env->sd->flags & SD_SHARE_CPUCAPACITY)
>  		return 0;
>  
> +	/*
> +	 * On a (sane) asymmetric CPU capacity system, the increase in compute
> +	 * capacity should offset any potential performance hit caused by a
> +	 * migration.
> +	 */
> +	if (sd_has_asym_cpucapacity(env->sd) &&
> +	    env->idle != CPU_NOT_IDLE &&
> +	    !task_fits_capacity(p, capacity_of(env->src_cpu)) &&

Note for a very busy task that is running on the biggest cpu this will always
return true.

> +	    cpu_capacity_greater(env->dst_cpu, env->src_cpu))

But this will save us from triggering unnecessary migration.

We could swap them and optimize for this particular case, but tbh this is the
type of micro optimization that is hard to know whether it makes a real
difference or not..

Anyways, this looks good to me.

Reviewed-by: Qais Yousef <qais.yousef@arm.com>

Thanks

--
Qais Yousef

> +		return 0;
> +
>  	/*
>  	 * Buddy candidates are cache hot:
>  	 */
> -- 
> 2.27.0
> 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/8] sched/fair: Clean up active balance nr_balance_failed trickery
  2021-02-03 15:14   ` Qais Yousef
@ 2021-02-03 18:42     ` Valentin Schneider
  2021-02-04 15:05       ` Qais Yousef
  0 siblings, 1 reply; 46+ messages in thread
From: Valentin Schneider @ 2021-02-03 18:42 UTC (permalink / raw)
  To: Qais Yousef
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Vincent Guittot,
	Dietmar Eggemann, Morten Rasmussen, Quentin Perret,
	Pavan Kondeti, Rik van Riel

>> @@ -9805,9 +9810,6 @@ static int load_balance(int this_cpu, struct rq *this_rq,
>>  					active_load_balance_cpu_stop, busiest,
>>  					&busiest->active_balance_work);
>>  			}
>> -
>> -			/* We've kicked active balancing, force task migration. */
>> -			sd->nr_balance_failed = sd->cache_nice_tries+1;
>
> This has an impact on future calls to need_active_balance() too, no? We enter
> this path because need_active_balance() returned true; one of the conditions it
> checks for is
>
> 	return unlikely(sd->nr_balance_failed > sd->cache_nice_tries+2);
>
> So since we used to reset nr_balanced_failed to cache_nice_tries+1, the above
> condition would be false in the next call or two IIUC. But since we remove
> that, we could end up here again soon.
>
> Was this intentional?
>

Partially, I'd say :-)

If you look at active_load_balance_cpu_stop(), it does

  sd->nr_balance_failed = 0;

when it successfully pulls a task. So we get a reset of the failed counter
on pull, which I've preserved. As for interactions with later
need_active_balance(), the commit that introduced the current counter write
(which is over 15 years old!):  

  3950745131e2 ("[PATCH] sched: fix SMT scheduling problems")

only states the task_hot() issue; thus I'm doubtful whether said
interaction was intentional.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 3/8] sched/fair: Tweak misfit-related capacity checks
  2021-02-03 15:15   ` Qais Yousef
@ 2021-02-03 18:42     ` Valentin Schneider
  0 siblings, 0 replies; 46+ messages in thread
From: Valentin Schneider @ 2021-02-03 18:42 UTC (permalink / raw)
  To: Qais Yousef
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Vincent Guittot,
	Dietmar Eggemann, Morten Rasmussen, Quentin Perret,
	Pavan Kondeti, Rik van Riel

On 03/02/21 15:15, Qais Yousef wrote:
> On 01/28/21 18:31, Valentin Schneider wrote:
>> @@ -113,6 +113,13 @@ int __weak arch_asym_cpu_priority(int cpu)
>>   */
>>  #define fits_capacity(cap, max)	((cap) * 1280 < (max) * 1024)
>>  
>> +/*
>> + * The margin used when comparing CPU capacities.
>> + * is 'cap' noticeably greater than 'ref'
>> + *
>> + * (default: ~5%)
>> + */
>> +#define capacity_greater(cap, ref) ((ref) * 1078 < (cap) * 1024)
>
> nit: can we use cap1 and cap2 and make the implementation use '>' instead of
> '<'? ie:
>
> 	#define capacity_greater(cap1, cap2) ((cap1) * 1024 > (cap2) * 1078)
>
> this is more intuitive to read IMHO. Especially few lines below we have
>
> 	return capacity_greater(ref->sgc->max_capacity, sg->sgc->max_capacity);
>
> which pass 'ref->...' as cap which can be confusing when looking at the
> function signature @ref.
>

Unfortunate naming indeed... And I suppose it can't hurt to follow the
argument "declaration" order. 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 6/8] sched/fair: Filter out locally-unsolvable misfit imbalances
  2021-02-03 15:16   ` Qais Yousef
@ 2021-02-03 18:43     ` Valentin Schneider
  0 siblings, 0 replies; 46+ messages in thread
From: Valentin Schneider @ 2021-02-03 18:43 UTC (permalink / raw)
  To: Qais Yousef
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Vincent Guittot,
	Dietmar Eggemann, Morten Rasmussen, Quentin Perret,
	Pavan Kondeti, Rik van Riel

On 03/02/21 15:16, Qais Yousef wrote:
> On 01/28/21 18:31, Valentin Schneider wrote:
>> Consider the following (hypothetical) asymmetric CPU capacity topology,
>> with some amount of capacity pressure (RT | DL | IRQ | thermal):
>> 
>>   DIE [          ]
>>   MC  [    ][    ]
>>        0  1  2  3
>> 
>>   | CPU | capacity_orig | capacity |
>>   |-----+---------------+----------|
>>   |   0 |           870 |      860 |
>>   |   1 |           870 |      600 |
>>   |   2 |          1024 |      850 |
>>   |   3 |          1024 |      860 |
>> 
>> If CPU1 has a misfit task, then CPU0, CPU2 and CPU3 are valid candidates to
>> grant the task an uplift in CPU capacity. Consider CPU0 and CPU3 as
>> sufficiently busy, i.e. don't have enough spare capacity to accommodate
>> CPU1's misfit task. This would then fall on CPU2 to pull the task.
>
> I think this scenario would be hard in practice, but not impossible. Maybe
> gaming could push the system that hard.
>

Actually I wouldn't be surprised if a moderatly busy Android environment
could hit this - slight thermal pressure on the bigs, RT pressure because
we know folks love (ab)using RT, a pinch of IRQs in the mix...

>> @@ -8450,11 +8457,21 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>>  			continue;
>>  
>>  		/* Check for a misfit task on the cpu */
>> -		if (sd_has_asym_cpucapacity(env->sd) &&
>> -		    sgs->group_misfit_task_load < rq->misfit_task_load) {
>> -			sgs->group_misfit_task_load = rq->misfit_task_load;
>> -			*sg_status |= SG_OVERLOAD;
>> -		}
>> +		if (!sd_has_asym_cpucapacity(env->sd) ||
>> +		    !rq->misfit_task_load)
>> +			continue;
>> +
>> +		*sg_status |= SG_OVERLOAD;
>> +		sgs->group_has_misfit_task = true;
>> +
>> +		/*
>> +		 * Don't attempt to maximize load for misfit tasks that can't be
>> +		 * granted a CPU capacity uplift.
>> +		 */
>> +		if (cpu_capacity_greater(env->dst_cpu, i))
>> +			sgs->group_misfit_task_load = max(
>> +				sgs->group_misfit_task_load,
>> +				rq->misfit_task_load);
>
> nit: missing curly braces around the if.
>

Ack.

>> @@ -8504,7 +8521,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>>  	/* Don't try to pull misfit tasks we can't help */
>>  	if (static_branch_unlikely(&sched_asym_cpucapacity) &&
>>  	    sgs->group_type == group_misfit_task &&
>> -	    (!capacity_greater(capacity_of(env->dst_cpu), sg->sgc->max_capacity) ||
>> +	    (!sgs->group_misfit_task_load ||
>>  	     sds->local_stat.group_type != group_has_spare))
>>  		return false;
>>  
>> @@ -9464,15 +9481,18 @@ static struct rq *find_busiest_queue(struct lb_env *env,
>>  		case migrate_misfit:
>>  			/*
>>  			 * For ASYM_CPUCAPACITY domains with misfit tasks we
>> -			 * simply seek the "biggest" misfit task.
>> +			 * simply seek the "biggest" misfit task we can
>> +			 * accommodate.
>>  			 */
>> +			if (!cpu_capacity_greater(env->dst_cpu, i))
>> +				continue;
>
> Both this hunk and the one above mean we will end up searching harder to pull
> the task into the right cpu taking actual capacity into account. Which is
> a good improvement.
>

Note that those extra checks are to make sure we *don't* downmigrate tasks
(as stated somewhere above, this change lets find_busiest_queue() iterate
over CPUs bigger than the local CPU's, which wasn't the case before). A "big"
CPU will still get the chance to pull a "medium" task, even if a "medium"
CPU would have been a "better" choice.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 7/8] sched/fair: Attempt misfit active balance when migration_type != migrate_misfit
  2021-02-03 15:16   ` Qais Yousef
@ 2021-02-03 18:43     ` Valentin Schneider
  2021-02-04 11:44       ` Dietmar Eggemann
  0 siblings, 1 reply; 46+ messages in thread
From: Valentin Schneider @ 2021-02-03 18:43 UTC (permalink / raw)
  To: Qais Yousef
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Vincent Guittot,
	Dietmar Eggemann, Morten Rasmussen, Quentin Perret,
	Pavan Kondeti, Rik van Riel

On 03/02/21 15:16, Qais Yousef wrote:
> On 01/28/21 18:31, Valentin Schneider wrote:
>> Giving group_misfit_task a higher group_classify() priority than
>> group_imbalance doesn't seem like the right thing to do. Instead, make
>> need_active_balance() return true for any migration_type when the
>
> s/need_active_balance()/voluntary_active_balance()/?
>
>> destination CPU is idle and the source CPU has a misfit task.
>> 
>> While at it, add an sd_has_asym_cpucapacity() guard in
>> need_active_balance().
>
> ditto.
>

Myes, clearly this has been left to ferment for too long!

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 0/8] sched/fair: misfit task load-balance tweaks
  2021-02-03 15:14 ` [PATCH 0/8] sched/fair: misfit task load-balance tweaks Qais Yousef
@ 2021-02-03 18:43   ` Valentin Schneider
  2021-02-04 12:03     ` Dietmar Eggemann
  0 siblings, 1 reply; 46+ messages in thread
From: Valentin Schneider @ 2021-02-03 18:43 UTC (permalink / raw)
  To: Qais Yousef
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Vincent Guittot,
	Dietmar Eggemann, Morten Rasmussen, Quentin Perret,
	Pavan Kondeti, Rik van Riel

Hi Qais,

On 03/02/21 15:14, Qais Yousef wrote:
> On 01/28/21 18:31, Valentin Schneider wrote:
>> Hi folks,
>> 
>> Here is this year's series of misfit changes. On the menu:
>> 
>> o Patch 1 is an independent active balance cleanup
>> o Patch 2 adds some more sched_asym_cpucapacity static branches
>> o Patch 3 introduces yet another margin for capacity to capacity
>>   comparisons
>> o Patches 4-6 build on top of patch 3 and change capacity comparisons
>>   throughout misfit load balancing  
>> o Patches 7-8 fix some extra misfit issues I've been seeing on "real"
>>   workloads.
>> 
>> IMO the somewhat controversial bit is patch 3, because it attempts to solve
>> margin issues by... Adding another margin. This does solve issues on
>> existing platforms (e.g. Pixel4), but we'll be back to square one the day
>> some "clever" folks spin a platform with two different CPU capacities less than
>> 5% apart.
>
> One more margin is a cause of apprehension to me. But in this case I think it
> is the appropriate thing to do now. I can't think of a scenario where this
> could hurt.
>

Thanks for having a look!

> Thanks
>
> --
> Qais Yousef

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 5/8] sched/fair: Make check_misfit_status() only compare dynamic capacities
  2021-02-03 15:15   ` Qais Yousef
@ 2021-02-04 10:49     ` Dietmar Eggemann
  2021-02-04 11:34       ` Valentin Schneider
  0 siblings, 1 reply; 46+ messages in thread
From: Dietmar Eggemann @ 2021-02-04 10:49 UTC (permalink / raw)
  To: Qais Yousef, Valentin Schneider
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Vincent Guittot,
	Morten Rasmussen, Quentin Perret, Pavan Kondeti, Rik van Riel

On 03/02/2021 16:15, Qais Yousef wrote:
> On 01/28/21 18:31, Valentin Schneider wrote:

[...]

>> @@ -10238,7 +10236,7 @@ static void nohz_balancer_kick(struct rq *rq)
>>  		 * When ASYM_CPUCAPACITY; see if there's a higher capacity CPU
>>  		 * to run the misfit task on.
>>  		 */
>> -		if (check_misfit_status(rq, sd)) {
>> +		if (check_misfit_status(rq)) {

Since check_misfit_status() doesn't need sd anymore it looks like that
rcu_dereference(per_cpu(sd_asym_cpucapacity, cpu)) could be replaced by
static_branch_unlikely(&sched_asym_cpucapacity)) in nohz_balancer_kick().

But as you mentioned in an earlier conversation we do need to check sd
because of asymmetric CPU capacity systems w/ exclusive cpusets which
could create symmetric islands (unique capacity_orig among CPUs).

Maybe worth putting a comment here (similar to the one in sis()) so
people don't try to optimize?

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 5/8] sched/fair: Make check_misfit_status() only compare dynamic capacities
  2021-02-04 10:49     ` Dietmar Eggemann
@ 2021-02-04 11:34       ` Valentin Schneider
  2021-02-04 14:57         ` Dietmar Eggemann
  0 siblings, 1 reply; 46+ messages in thread
From: Valentin Schneider @ 2021-02-04 11:34 UTC (permalink / raw)
  To: Dietmar Eggemann, Qais Yousef
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Vincent Guittot,
	Morten Rasmussen, Quentin Perret, Pavan Kondeti, Rik van Riel

On 04/02/21 11:49, Dietmar Eggemann wrote:
> On 03/02/2021 16:15, Qais Yousef wrote:
>> On 01/28/21 18:31, Valentin Schneider wrote:
>
> [...]
>
>>> @@ -10238,7 +10236,7 @@ static void nohz_balancer_kick(struct rq *rq)
>>>  		 * When ASYM_CPUCAPACITY; see if there's a higher capacity CPU
>>>  		 * to run the misfit task on.
>>>  		 */
>>> -		if (check_misfit_status(rq, sd)) {
>>> +		if (check_misfit_status(rq)) {
>
> Since check_misfit_status() doesn't need sd anymore it looks like that
> rcu_dereference(per_cpu(sd_asym_cpucapacity, cpu)) could be replaced by
> static_branch_unlikely(&sched_asym_cpucapacity)) in nohz_balancer_kick().
>
> But as you mentioned in an earlier conversation we do need to check sd
> because of asymmetric CPU capacity systems w/ exclusive cpusets which
> could create symmetric islands (unique capacity_orig among CPUs).
>
> Maybe worth putting a comment here (similar to the one in sis()) so
> people don't try to optimize?

How about:

--->8---
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c2351b87824f..4b71f4d1d324 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6322,15 +6322,8 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 	 * sd_asym_cpucapacity rather than sd_llc.
 	 */
 	if (static_branch_unlikely(&sched_asym_cpucapacity)) {
+		/* See sd_has_asym_cpucapacity() */
 		sd = rcu_dereference(per_cpu(sd_asym_cpucapacity, target));
-		/*
-		 * On an asymmetric CPU capacity system where an exclusive
-		 * cpuset defines a symmetric island (i.e. one unique
-		 * capacity_orig value through the cpuset), the key will be set
-		 * but the CPUs within that cpuset will not have a domain with
-		 * SD_ASYM_CPUCAPACITY. These should follow the usual symmetric
-		 * capacity path.
-		 */
 		if (sd) {
 			i = select_idle_capacity(p, sd, target);
 			return ((unsigned)i < nr_cpumask_bits) ? i : target;
@@ -10274,6 +10267,10 @@ static void nohz_balancer_kick(struct rq *rq)
 		}
 	}
 
+	/*
+	 * Below checks don't actually use the sd, but they still hinge on its
+	 * presence. See sd_has_asym_cpucapacity().
+	 */
 	sd = rcu_dereference(per_cpu(sd_asym_cpucapacity, cpu));
 	if (sd) {
 		/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 21bd71f58c06..ea7f0155e268 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1482,6 +1482,33 @@ DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
 DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
 extern struct static_key_false sched_asym_cpucapacity;
 
+/*
+ * Note that the static key is system-wide, but the visibility of
+ * SD_ASYM_CPUCAPACITY isn't. Thus the static key being enabled does not
+ * imply all CPUs can see asymmetry.
+ *
+ * Consider an asymmetric CPU capacity system such as:
+ *
+ * MC [           ]
+ *     0 1 2 3 4 5
+ *     L L L L B B
+ *
+ * w/ arch_scale_cpu_capacity(L) < arch_scale_cpu_capacity(B)
+ *
+ * By default, booting this system will enable the sched_asym_cpucapacity
+ * static key, and all CPUs will see SD_ASYM_CPUCAPACITY set at their MC
+ * sched_domain.
+ *
+ * Further consider exclusive cpusets creating a "symmetric island":
+ *
+ * MC [   ][      ]
+ *     0 1  2 3 4 5
+ *     L L  L L B B
+ *
+ * Again, booting this will enable the static key, but CPUs 0-1 will *not* have
+ * SD_ASYM_CPUCAPACITY set in any of their sched_domain. This is the intending
+ * behaviour, as CPUs 0-1 should be treated as a regular, isolated SMP system.
+ */
 static inline bool sd_has_asym_cpucapacity(struct sched_domain *sd)
 {
 	return static_branch_unlikely(&sched_asym_cpucapacity) &&

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [PATCH 7/8] sched/fair: Attempt misfit active balance when migration_type != migrate_misfit
  2021-02-03 18:43     ` Valentin Schneider
@ 2021-02-04 11:44       ` Dietmar Eggemann
  2021-02-04 12:22         ` Valentin Schneider
  0 siblings, 1 reply; 46+ messages in thread
From: Dietmar Eggemann @ 2021-02-04 11:44 UTC (permalink / raw)
  To: Valentin Schneider, Qais Yousef
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Vincent Guittot,
	Morten Rasmussen, Quentin Perret, Pavan Kondeti, Rik van Riel

On 03/02/2021 19:43, Valentin Schneider wrote:
> On 03/02/21 15:16, Qais Yousef wrote:
>> On 01/28/21 18:31, Valentin Schneider wrote:
>>> Giving group_misfit_task a higher group_classify() priority than
>>> group_imbalance doesn't seem like the right thing to do. Instead, make
>>> need_active_balance() return true for any migration_type when the
>>
>> s/need_active_balance()/voluntary_active_balance()/?
>>
>>> destination CPU is idle and the source CPU has a misfit task.
>>>
>>> While at it, add an sd_has_asym_cpucapacity() guard in
>>> need_active_balance().
>>
>> ditto.
>>
> 
> Myes, clearly this has been left to ferment for too long!

Wasn't the migrate_misfit condition moved from
voluntary_active_balance() into need_active_balance() by commit
("sched/fair: Reduce cases for active balance")?

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 0/8] sched/fair: misfit task load-balance tweaks
  2021-02-03 18:43   ` Valentin Schneider
@ 2021-02-04 12:03     ` Dietmar Eggemann
  2021-02-04 12:36       ` Valentin Schneider
  0 siblings, 1 reply; 46+ messages in thread
From: Dietmar Eggemann @ 2021-02-04 12:03 UTC (permalink / raw)
  To: Valentin Schneider, Qais Yousef
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Vincent Guittot,
	Morten Rasmussen, Quentin Perret, Pavan Kondeti, Rik van Riel

On 03/02/2021 19:43, Valentin Schneider wrote:
> Hi Qais,
> 
> On 03/02/21 15:14, Qais Yousef wrote:
>> On 01/28/21 18:31, Valentin Schneider wrote:
>>> Hi folks,
>>>
>>> Here is this year's series of misfit changes. On the menu:
>>>
>>> o Patch 1 is an independent active balance cleanup
>>> o Patch 2 adds some more sched_asym_cpucapacity static branches
>>> o Patch 3 introduces yet another margin for capacity to capacity
>>>   comparisons
>>> o Patches 4-6 build on top of patch 3 and change capacity comparisons
>>>   throughout misfit load balancing  
>>> o Patches 7-8 fix some extra misfit issues I've been seeing on "real"
>>>   workloads.
>>>
>>> IMO the somewhat controversial bit is patch 3, because it attempts to solve
>>> margin issues by... Adding another margin. This does solve issues on
>>> existing platforms (e.g. Pixel4), but we'll be back to square one the day
>>> some "clever" folks spin a platform with two different CPU capacities less than
>>> 5% apart.
>>
>> One more margin is a cause of apprehension to me. But in this case I think it
>> is the appropriate thing to do now. I can't think of a scenario where this
>> could hurt.
>>
> 
> Thanks for having a look!
> 
>> Thanks
>>
>> --
>> Qais Yousef

How did you verify the benefit of these changes?

It's clear that you need a platform with capacity_orig diffs <20%
between CPU types (like Pixel4 - SD855 (4x261, 3x871, 1x1024) or QC's
RB5 platform - SD865 (4x284, 3x871, 1*1024)) but which
benchmark/testcase did you use?

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 7/8] sched/fair: Attempt misfit active balance when migration_type != migrate_misfit
  2021-02-04 11:44       ` Dietmar Eggemann
@ 2021-02-04 12:22         ` Valentin Schneider
  0 siblings, 0 replies; 46+ messages in thread
From: Valentin Schneider @ 2021-02-04 12:22 UTC (permalink / raw)
  To: Dietmar Eggemann, Qais Yousef
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Vincent Guittot,
	Morten Rasmussen, Quentin Perret, Pavan Kondeti, Rik van Riel

On 04/02/21 12:44, Dietmar Eggemann wrote:
> On 03/02/2021 19:43, Valentin Schneider wrote:
>> On 03/02/21 15:16, Qais Yousef wrote:
>>> On 01/28/21 18:31, Valentin Schneider wrote:
>>>> Giving group_misfit_task a higher group_classify() priority than
>>>> group_imbalance doesn't seem like the right thing to do. Instead, make
>>>> need_active_balance() return true for any migration_type when the
>>>
>>> s/need_active_balance()/voluntary_active_balance()/?
>>>
>>>> destination CPU is idle and the source CPU has a misfit task.
>>>>
>>>> While at it, add an sd_has_asym_cpucapacity() guard in
>>>> need_active_balance().
>>>
>>> ditto.
>>>
>> 
>> Myes, clearly this has been left to ferment for too long!
>
> Wasn't the migrate_misfit condition moved from
> voluntary_active_balance() into need_active_balance() by commit
> ("sched/fair: Reduce cases for active balance")?

Bah, you're right, I got confused with when I first wrote that vs when I
last updated the changelog.

As of

  e9b9734b7465 ("sched/fair: Reduce cases for active balance")e9b9734b7465

The above changelog is actually correct.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 0/8] sched/fair: misfit task load-balance tweaks
  2021-02-04 12:03     ` Dietmar Eggemann
@ 2021-02-04 12:36       ` Valentin Schneider
  0 siblings, 0 replies; 46+ messages in thread
From: Valentin Schneider @ 2021-02-04 12:36 UTC (permalink / raw)
  To: Dietmar Eggemann, Qais Yousef
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Vincent Guittot,
	Morten Rasmussen, Quentin Perret, Pavan Kondeti, Rik van Riel

On 04/02/21 13:03, Dietmar Eggemann wrote:
> How did you verify the benefit of these changes?
>
> It's clear that you need a platform with capacity_orig diffs <20%
> between CPU types (like Pixel4 - SD855 (4x261, 3x871, 1x1024) or QC's
> RB5 platform - SD865 (4x284, 3x871, 1*1024)) but which
> benchmark/testcase did you use?

Benchmark is the usual culprit:

  https://lisa-linux-integrated-system-analysis.readthedocs.io/en/master/kernel_tests.html#lisa.tests.scheduler.misfit.StaggeredFinishes

This test spawns 1 CPU hog per CPU, and screams whenever a CPU of capacity
X is running a hog while another CPU of capacity Y > X has been idling for
"too long" (a few ms). IOW it makes sure upmigration happens in a timely
manner.

Some of the test platforms (Juno (4+2 big.LITTLE), HiKey960 (4+4
big.LITTLE)) show some improvements due to the last 2 patches.

As for systems with CPUs in the [819-1024] "deadzone", Ionela's been kindly
running said test on said RB5, and the upmigrations look just fine with the
patches applied.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 5/8] sched/fair: Make check_misfit_status() only compare dynamic capacities
  2021-02-04 11:34       ` Valentin Schneider
@ 2021-02-04 14:57         ` Dietmar Eggemann
  0 siblings, 0 replies; 46+ messages in thread
From: Dietmar Eggemann @ 2021-02-04 14:57 UTC (permalink / raw)
  To: Valentin Schneider, Qais Yousef
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Vincent Guittot,
	Morten Rasmussen, Quentin Perret, Pavan Kondeti, Rik van Riel

On 04/02/2021 12:34, Valentin Schneider wrote:
> On 04/02/21 11:49, Dietmar Eggemann wrote:
>> On 03/02/2021 16:15, Qais Yousef wrote:
>>> On 01/28/21 18:31, Valentin Schneider wrote:

[...]

> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 21bd71f58c06..ea7f0155e268 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1482,6 +1482,33 @@ DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
>  DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
>  extern struct static_key_false sched_asym_cpucapacity;
>  
> +/*
> + * Note that the static key is system-wide, but the visibility of
> + * SD_ASYM_CPUCAPACITY isn't. Thus the static key being enabled does not
> + * imply all CPUs can see asymmetry.
> + *
> + * Consider an asymmetric CPU capacity system such as:
> + *
> + * MC [           ]
> + *     0 1 2 3 4 5
> + *     L L L L B B
> + *
> + * w/ arch_scale_cpu_capacity(L) < arch_scale_cpu_capacity(B)
> + *
> + * By default, booting this system will enable the sched_asym_cpucapacity
> + * static key, and all CPUs will see SD_ASYM_CPUCAPACITY set at their MC
> + * sched_domain.
> + *
> + * Further consider exclusive cpusets creating a "symmetric island":
> + *
> + * MC [   ][      ]
> + *     0 1  2 3 4 5
> + *     L L  L L B B
> + *
> + * Again, booting this will enable the static key, but CPUs 0-1 will *not* have
> + * SD_ASYM_CPUCAPACITY set in any of their sched_domain. This is the intending

s/intending/intended

> + * behaviour, as CPUs 0-1 should be treated as a regular, isolated SMP system.
> + */

LGTM.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/8] sched/fair: Clean up active balance nr_balance_failed trickery
  2021-02-03 18:42     ` Valentin Schneider
@ 2021-02-04 15:05       ` Qais Yousef
  0 siblings, 0 replies; 46+ messages in thread
From: Qais Yousef @ 2021-02-04 15:05 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Vincent Guittot,
	Dietmar Eggemann, Morten Rasmussen, Quentin Perret,
	Pavan Kondeti, Rik van Riel

On 02/03/21 18:42, Valentin Schneider wrote:
> >> @@ -9805,9 +9810,6 @@ static int load_balance(int this_cpu, struct rq *this_rq,
> >>  					active_load_balance_cpu_stop, busiest,
> >>  					&busiest->active_balance_work);
> >>  			}
> >> -
> >> -			/* We've kicked active balancing, force task migration. */
> >> -			sd->nr_balance_failed = sd->cache_nice_tries+1;
> >
> > This has an impact on future calls to need_active_balance() too, no? We enter
> > this path because need_active_balance() returned true; one of the conditions it
> > checks for is
> >
> > 	return unlikely(sd->nr_balance_failed > sd->cache_nice_tries+2);
> >
> > So since we used to reset nr_balanced_failed to cache_nice_tries+1, the above
> > condition would be false in the next call or two IIUC. But since we remove
> > that, we could end up here again soon.
> >
> > Was this intentional?
> >
> 
> Partially, I'd say :-)
> 
> If you look at active_load_balance_cpu_stop(), it does
> 
>   sd->nr_balance_failed = 0;
> 
> when it successfully pulls a task. So we get a reset of the failed counter
> on pull, which I've preserved. As for interactions with later
> need_active_balance(), the commit that introduced the current counter write
> (which is over 15 years old!):  
> 
>   3950745131e2 ("[PATCH] sched: fix SMT scheduling problems")
> 
> only states the task_hot() issue; thus I'm doubtful whether said
> interaction was intentional.

The '+1' was added in that comment. 'Original' code was just resetting the
nr_balance_failed cache_nice_tries, so that we don't do another one too soon
I think.

With this change, no active balance is allowed until later. Which makes sense.
I can't see why we would have allowed another kick sooner tbh. But as you say,
this is ancient piece of logic.

I agree I can't see a reason to worry about this (potential) change of
behavior.

Thanks

--
Qais Yousef

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/8] sched/fair: Clean up active balance nr_balance_failed trickery
  2021-01-28 18:31 ` [PATCH 1/8] sched/fair: Clean up active balance nr_balance_failed trickery Valentin Schneider
  2021-02-03 15:14   ` Qais Yousef
@ 2021-02-05 13:51   ` Vincent Guittot
  2021-02-05 14:05     ` Valentin Schneider
  1 sibling, 1 reply; 46+ messages in thread
From: Vincent Guittot @ 2021-02-05 13:51 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Dietmar Eggemann,
	Morten Rasmussen, Qais Yousef, Quentin Perret, Pavan Kondeti,
	Rik van Riel

On Thu, 28 Jan 2021 at 19:32, Valentin Schneider
<valentin.schneider@arm.com> wrote:
>
> When triggering an active load balance, sd->nr_balance_failed is set to
> such a value that any further can_migrate_task() using said sd will ignore
> the output of task_hot().
>
> This behaviour makes sense, as active load balance intentionally preempts a
> rq's running task to migrate it right away, but this asynchronous write is
> a bit shoddy, as the stopper thread might run active_load_balance_cpu_stop
> before the sd->nr_balance_failed write either becomes visible to the
> stopper's CPU or even happens on the CPU that appended the stopper work.
>
> Add a struct lb_env flag to denote active balancing, and use it in
> can_migrate_task(). Remove the sd->nr_balance_failed write that served the
> same purpose.
>
> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
> ---
>  kernel/sched/fair.c | 17 ++++++++++-------
>  1 file changed, 10 insertions(+), 7 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 197a51473e0c..0f6a4e58ce3c 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7423,6 +7423,7 @@ enum migration_type {
>  #define LBF_SOME_PINNED        0x08
>  #define LBF_NOHZ_STATS 0x10
>  #define LBF_NOHZ_AGAIN 0x20
> +#define LBF_ACTIVE_LB  0x40
>
>  struct lb_env {
>         struct sched_domain     *sd;
> @@ -7608,10 +7609,14 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
>
>         /*
>          * Aggressive migration if:
> -        * 1) destination numa is preferred
> -        * 2) task is cache cold, or
> -        * 3) too many balance attempts have failed.
> +        * 1) active balance
> +        * 2) destination numa is preferred
> +        * 3) task is cache cold, or
> +        * 4) too many balance attempts have failed.
>          */
> +       if (env->flags & LBF_ACTIVE_LB)
> +               return 1;
> +

This changes the behavior for numa system because it skips
migrate_degrades_locality() which can return 1 and prevent active
migration whatever nr_balance_failed

Is that intentional ?

>         tsk_cache_hot = migrate_degrades_locality(p, env);
>         if (tsk_cache_hot == -1)
>                 tsk_cache_hot = task_hot(p, env);
> @@ -9805,9 +9810,6 @@ static int load_balance(int this_cpu, struct rq *this_rq,
>                                         active_load_balance_cpu_stop, busiest,
>                                         &busiest->active_balance_work);
>                         }
> -
> -                       /* We've kicked active balancing, force task migration. */
> -                       sd->nr_balance_failed = sd->cache_nice_tries+1;
>                 }
>         } else {
>                 sd->nr_balance_failed = 0;
> @@ -9963,7 +9965,8 @@ static int active_load_balance_cpu_stop(void *data)
>                          * @dst_grpmask we need to make that test go away with lying
>                          * about DST_PINNED.
>                          */
> -                       .flags          = LBF_DST_PINNED,
> +                       .flags          = LBF_DST_PINNED |
> +                                         LBF_ACTIVE_LB,
>                 };
>
>                 schedstat_inc(sd->alb_count);
> --
> 2.27.0
>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/8] sched/fair: Clean up active balance nr_balance_failed trickery
  2021-02-05 13:51   ` Vincent Guittot
@ 2021-02-05 14:05     ` Valentin Schneider
  2021-02-05 14:34       ` Vincent Guittot
  0 siblings, 1 reply; 46+ messages in thread
From: Valentin Schneider @ 2021-02-05 14:05 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Dietmar Eggemann,
	Morten Rasmussen, Qais Yousef, Quentin Perret, Pavan Kondeti,
	Rik van Riel

On 05/02/21 14:51, Vincent Guittot wrote:
> On Thu, 28 Jan 2021 at 19:32, Valentin Schneider
> <valentin.schneider@arm.com> wrote:
>>
>> When triggering an active load balance, sd->nr_balance_failed is set to
>> such a value that any further can_migrate_task() using said sd will ignore
>> the output of task_hot().
>>
>> This behaviour makes sense, as active load balance intentionally preempts a
>> rq's running task to migrate it right away, but this asynchronous write is
>> a bit shoddy, as the stopper thread might run active_load_balance_cpu_stop
>> before the sd->nr_balance_failed write either becomes visible to the
>> stopper's CPU or even happens on the CPU that appended the stopper work.
>>
>> Add a struct lb_env flag to denote active balancing, and use it in
>> can_migrate_task(). Remove the sd->nr_balance_failed write that served the
>> same purpose.
>>
>> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
>> ---
>>  kernel/sched/fair.c | 17 ++++++++++-------
>>  1 file changed, 10 insertions(+), 7 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 197a51473e0c..0f6a4e58ce3c 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -7423,6 +7423,7 @@ enum migration_type {
>>  #define LBF_SOME_PINNED        0x08
>>  #define LBF_NOHZ_STATS 0x10
>>  #define LBF_NOHZ_AGAIN 0x20
>> +#define LBF_ACTIVE_LB  0x40
>>
>>  struct lb_env {
>>         struct sched_domain     *sd;
>> @@ -7608,10 +7609,14 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
>>
>>         /*
>>          * Aggressive migration if:
>> -        * 1) destination numa is preferred
>> -        * 2) task is cache cold, or
>> -        * 3) too many balance attempts have failed.
>> +        * 1) active balance
>> +        * 2) destination numa is preferred
>> +        * 3) task is cache cold, or
>> +        * 4) too many balance attempts have failed.
>>          */
>> +       if (env->flags & LBF_ACTIVE_LB)
>> +               return 1;
>> +
>
> This changes the behavior for numa system because it skips
> migrate_degrades_locality() which can return 1 and prevent active
> migration whatever nr_balance_failed
>
> Is that intentional ?
>

If I read this right, the result of migrate_degrades_locality() is
(currently) ignored if

  env->sd->nr_balance_failed > env->sd->cache_nice_tries

While on the load_balance() side, we have:

  /* We've kicked active balancing, force task migration. */
  sd->nr_balance_failed = sd->cache_nice_tries+1;

So we should currently be ignoring migrate_degrades_locality() in the
active balance case - what I wrote in the changelog for task_hot() still
applies to migrate_degrades_locality().

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 3/8] sched/fair: Tweak misfit-related capacity checks
  2021-01-28 18:31 ` [PATCH 3/8] sched/fair: Tweak misfit-related capacity checks Valentin Schneider
  2021-02-03 15:15   ` Qais Yousef
@ 2021-02-05 14:31   ` Vincent Guittot
  2021-02-05 16:59     ` Valentin Schneider
  1 sibling, 1 reply; 46+ messages in thread
From: Vincent Guittot @ 2021-02-05 14:31 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Dietmar Eggemann,
	Morten Rasmussen, Qais Yousef, Quentin Perret, Pavan Kondeti,
	Rik van Riel

On Thu, 28 Jan 2021 at 19:32, Valentin Schneider
<valentin.schneider@arm.com> wrote:
>
> During load-balance, groups classified as group_misfit_task are filtered
> out if they do not pass
>
>   group_smaller_max_cpu_capacity(<candidate group>, <local group>);

group_smaller_max_cpu_capacity and  group_smaller_max_cpu_capacity are
removed in the next patch. Merge this and the next and directly remove
them

>
> which itself employs fits_capacity() to compare the sgc->max_capacity of
> both groups.
>
> Due to the underlying margin, fits_capacity(X, 1024) will return false for
> any X > 819. Tough luck, the capacity_orig's on e.g. the Pixel 4 are
> {261, 871, 1024}. If a CPU-bound task ends up on one of those "medium"
> CPUs, misfit migration will never intentionally upmigrate it to a CPU of
> higher capacity due to the aforementioned margin.
>
> One may argue the 20% margin of fits_capacity() is excessive in the advent
> of counter-enhanced load tracking (APERF/MPERF, AMUs), but one point here
> is that fits_capacity() is meant to compare a utilization value to a
> capacity value, whereas here it is being used to compare two capacity
> values. As CPU capacity and task utilization have different dynamics, a
> sensible approach here would be to add a new helper dedicated to comparing
> CPU capacities.
>
> Introduce capacity_greater(), which uses a 5% margin. Use it to replace the
> existing capacity checks. Note that check_cpu_capacity() uses yet another
> margin (sd->imbalance_pct), and is left alone for now.
>
> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
> ---
>  kernel/sched/fair.c | 15 +++++++++++----
>  1 file changed, 11 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 7d01fa0bfc7e..58ce0b22fcb0 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -113,6 +113,13 @@ int __weak arch_asym_cpu_priority(int cpu)
>   */
>  #define fits_capacity(cap, max)        ((cap) * 1280 < (max) * 1024)
>
> +/*
> + * The margin used when comparing CPU capacities.
> + * is 'cap' noticeably greater than 'ref'
> + *
> + * (default: ~5%)
> + */
> +#define capacity_greater(cap, ref) ((ref) * 1078 < (cap) * 1024)
>  #endif
>
>  #ifdef CONFIG_CFS_BANDWIDTH
> @@ -8253,7 +8260,7 @@ check_cpu_capacity(struct rq *rq, struct sched_domain *sd)
>  static inline int check_misfit_status(struct rq *rq, struct sched_domain *sd)
>  {
>         return rq->misfit_task_load &&
> -               (rq->cpu_capacity_orig < rq->rd->max_cpu_capacity ||
> +               (capacity_greater(rq->rd->max_cpu_capacity, rq->cpu_capacity_orig) ||

Why do you add a margin here whereas there was no margin before ?

>                  check_cpu_capacity(rq, sd));
>  }
>
> @@ -8352,7 +8359,7 @@ group_is_overloaded(unsigned int imbalance_pct, struct sg_lb_stats *sgs)
>  static inline bool
>  group_smaller_min_cpu_capacity(struct sched_group *sg, struct sched_group *ref)
>  {
> -       return fits_capacity(sg->sgc->min_capacity, ref->sgc->min_capacity);
> +       return capacity_greater(ref->sgc->min_capacity, sg->sgc->min_capacity);
>  }
>
>  /*
> @@ -8362,7 +8369,7 @@ group_smaller_min_cpu_capacity(struct sched_group *sg, struct sched_group *ref)
>  static inline bool
>  group_smaller_max_cpu_capacity(struct sched_group *sg, struct sched_group *ref)
>  {
> -       return fits_capacity(sg->sgc->max_capacity, ref->sgc->max_capacity);
> +       return capacity_greater(ref->sgc->max_capacity, sg->sgc->max_capacity);
>  }
>
>  static inline enum
> @@ -9421,7 +9428,7 @@ static struct rq *find_busiest_queue(struct lb_env *env,
>                  * average load.
>                  */
>                 if (sd_has_asym_cpucapacity(env->sd) &&
> -                   capacity_of(env->dst_cpu) < capacity &&
> +                   !capacity_greater(capacity_of(env->dst_cpu), capacity) &&

same here

>                     nr_running == 1)
>                         continue;
>
> --
> 2.27.0
>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/8] sched/fair: Clean up active balance nr_balance_failed trickery
  2021-02-05 14:05     ` Valentin Schneider
@ 2021-02-05 14:34       ` Vincent Guittot
  0 siblings, 0 replies; 46+ messages in thread
From: Vincent Guittot @ 2021-02-05 14:34 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Dietmar Eggemann,
	Morten Rasmussen, Qais Yousef, Quentin Perret, Pavan Kondeti,
	Rik van Riel

On Fri, 5 Feb 2021 at 15:05, Valentin Schneider
<valentin.schneider@arm.com> wrote:
>
> On 05/02/21 14:51, Vincent Guittot wrote:
> > On Thu, 28 Jan 2021 at 19:32, Valentin Schneider
> > <valentin.schneider@arm.com> wrote:
> >>
> >> When triggering an active load balance, sd->nr_balance_failed is set to
> >> such a value that any further can_migrate_task() using said sd will ignore
> >> the output of task_hot().
> >>
> >> This behaviour makes sense, as active load balance intentionally preempts a
> >> rq's running task to migrate it right away, but this asynchronous write is
> >> a bit shoddy, as the stopper thread might run active_load_balance_cpu_stop
> >> before the sd->nr_balance_failed write either becomes visible to the
> >> stopper's CPU or even happens on the CPU that appended the stopper work.
> >>
> >> Add a struct lb_env flag to denote active balancing, and use it in
> >> can_migrate_task(). Remove the sd->nr_balance_failed write that served the
> >> same purpose.
> >>
> >> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
> >> ---
> >>  kernel/sched/fair.c | 17 ++++++++++-------
> >>  1 file changed, 10 insertions(+), 7 deletions(-)
> >>
> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> index 197a51473e0c..0f6a4e58ce3c 100644
> >> --- a/kernel/sched/fair.c
> >> +++ b/kernel/sched/fair.c
> >> @@ -7423,6 +7423,7 @@ enum migration_type {
> >>  #define LBF_SOME_PINNED        0x08
> >>  #define LBF_NOHZ_STATS 0x10
> >>  #define LBF_NOHZ_AGAIN 0x20
> >> +#define LBF_ACTIVE_LB  0x40
> >>
> >>  struct lb_env {
> >>         struct sched_domain     *sd;
> >> @@ -7608,10 +7609,14 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
> >>
> >>         /*
> >>          * Aggressive migration if:
> >> -        * 1) destination numa is preferred
> >> -        * 2) task is cache cold, or
> >> -        * 3) too many balance attempts have failed.
> >> +        * 1) active balance
> >> +        * 2) destination numa is preferred
> >> +        * 3) task is cache cold, or
> >> +        * 4) too many balance attempts have failed.
> >>          */
> >> +       if (env->flags & LBF_ACTIVE_LB)
> >> +               return 1;
> >> +
> >
> > This changes the behavior for numa system because it skips
> > migrate_degrades_locality() which can return 1 and prevent active
> > migration whatever nr_balance_failed
> >
> > Is that intentional ?
> >
>
> If I read this right, the result of migrate_degrades_locality() is
> (currently) ignored if
>
>   env->sd->nr_balance_failed > env->sd->cache_nice_tries

You're right, I have misread the || condition

>
> While on the load_balance() side, we have:
>
>   /* We've kicked active balancing, force task migration. */
>   sd->nr_balance_failed = sd->cache_nice_tries+1;
>
> So we should currently be ignoring migrate_degrades_locality() in the
> active balance case - what I wrote in the changelog for task_hot() still
> applies to migrate_degrades_locality().

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 3/8] sched/fair: Tweak misfit-related capacity checks
  2021-02-05 14:31   ` Vincent Guittot
@ 2021-02-05 16:59     ` Valentin Schneider
  2021-02-05 17:17       ` Vincent Guittot
  0 siblings, 1 reply; 46+ messages in thread
From: Valentin Schneider @ 2021-02-05 16:59 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Dietmar Eggemann,
	Morten Rasmussen, Qais Yousef, Quentin Perret, Pavan Kondeti,
	Rik van Riel

On 05/02/21 15:31, Vincent Guittot wrote:
> On Thu, 28 Jan 2021 at 19:32, Valentin Schneider
> <valentin.schneider@arm.com> wrote:
>>
>> During load-balance, groups classified as group_misfit_task are filtered
>> out if they do not pass
>>
>>   group_smaller_max_cpu_capacity(<candidate group>, <local group>);
>
> group_smaller_max_cpu_capacity and  group_smaller_max_cpu_capacity are
> removed in the next patch. Merge this and the next and directly remove
> them
>

OK.

>> @@ -8253,7 +8260,7 @@ check_cpu_capacity(struct rq *rq, struct sched_domain *sd)
>>  static inline int check_misfit_status(struct rq *rq, struct sched_domain *sd)
>>  {
>>         return rq->misfit_task_load &&
>> -               (rq->cpu_capacity_orig < rq->rd->max_cpu_capacity ||
>> +               (capacity_greater(rq->rd->max_cpu_capacity, rq->cpu_capacity_orig) ||
>
> Why do you add a margin here whereas there was no margin before ?
>

Comparing capacities without any sort of filter can lead to ping-ponging
tasks around (capacity values very easily fluctuate by +/- 1, if not more).
I'm guilty of doing two things at once here: replace existing users, and
convert callsites that should be existing users. I can split the conversion
in a separate patch.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 3/8] sched/fair: Tweak misfit-related capacity checks
  2021-02-05 16:59     ` Valentin Schneider
@ 2021-02-05 17:17       ` Vincent Guittot
  2021-02-05 20:07         ` Valentin Schneider
  0 siblings, 1 reply; 46+ messages in thread
From: Vincent Guittot @ 2021-02-05 17:17 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Dietmar Eggemann,
	Morten Rasmussen, Qais Yousef, Quentin Perret, Pavan Kondeti,
	Rik van Riel

On Fri, 5 Feb 2021 at 18:00, Valentin Schneider
<valentin.schneider@arm.com> wrote:
>
> On 05/02/21 15:31, Vincent Guittot wrote:
> > On Thu, 28 Jan 2021 at 19:32, Valentin Schneider
> > <valentin.schneider@arm.com> wrote:
> >>
> >> During load-balance, groups classified as group_misfit_task are filtered
> >> out if they do not pass
> >>
> >>   group_smaller_max_cpu_capacity(<candidate group>, <local group>);
> >
> > group_smaller_max_cpu_capacity and  group_smaller_max_cpu_capacity are
> > removed in the next patch. Merge this and the next and directly remove
> > them
> >
>
> OK.
>
> >> @@ -8253,7 +8260,7 @@ check_cpu_capacity(struct rq *rq, struct sched_domain *sd)
> >>  static inline int check_misfit_status(struct rq *rq, struct sched_domain *sd)
> >>  {
> >>         return rq->misfit_task_load &&
> >> -               (rq->cpu_capacity_orig < rq->rd->max_cpu_capacity ||
> >> +               (capacity_greater(rq->rd->max_cpu_capacity, rq->cpu_capacity_orig) ||
> >
> > Why do you add a margin here whereas there was no margin before ?
> >
>
> Comparing capacities without any sort of filter can lead to ping-ponging
> tasks around (capacity values very easily fluctuate by +/- 1, if not more).

max_cpu_capacity reflects the max of the cpu_capacity_orig values
don't aim to change and can be considered as static values.
It would be better to fix this rounding problem (if any) in
topology_get_cpu_scale instead of computing a margin every time it's
used

> I'm guilty of doing two things at once here: replace existing users, and
> convert callsites that should be existing users. I can split the conversion
> in a separate patch.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 3/8] sched/fair: Tweak misfit-related capacity checks
  2021-02-05 17:17       ` Vincent Guittot
@ 2021-02-05 20:07         ` Valentin Schneider
  2021-02-08 15:29           ` Vincent Guittot
  0 siblings, 1 reply; 46+ messages in thread
From: Valentin Schneider @ 2021-02-05 20:07 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Dietmar Eggemann,
	Morten Rasmussen, Qais Yousef, Quentin Perret, Pavan Kondeti,
	Rik van Riel

On 05/02/21 18:17, Vincent Guittot wrote:
> On Fri, 5 Feb 2021 at 18:00, Valentin Schneider
>> >> @@ -8253,7 +8260,7 @@ check_cpu_capacity(struct rq *rq, struct sched_domain *sd)
>> >>  static inline int check_misfit_status(struct rq *rq, struct sched_domain *sd)
>> >>  {
>> >>         return rq->misfit_task_load &&
>> >> -               (rq->cpu_capacity_orig < rq->rd->max_cpu_capacity ||
>> >> +               (capacity_greater(rq->rd->max_cpu_capacity, rq->cpu_capacity_orig) ||
>> >
>> > Why do you add a margin here whereas there was no margin before ?
>> >
>>
>> Comparing capacities without any sort of filter can lead to ping-ponging
>> tasks around (capacity values very easily fluctuate by +/- 1, if not more).
>
> max_cpu_capacity reflects the max of the cpu_capacity_orig values
> don't aim to change and can be considered as static values.
> It would be better to fix this rounding problem (if any) in
> topology_get_cpu_scale instead of computing a margin every time it's
> used
>

That's embarrassing, I was convinced we had something updating
rd->max_cpu_capacity with actual rq->capacity values... But as you point
out that's absolutely not the case, it's all based on rq->capacity_orig,
which completely invalidates patch 5/8.

Welp.

Perhaps I can still keep 5/8 with something like

  if (!rq->misfit_task_load)
          return false;

  do {
          if (capacity_greater(group->sgc->max_capacity, rq->cpu_capacity))
                  return true;

          group = group->next;
  } while (group != sd->groups);

  return false;

This works somewhat well for big.LITTLE, but for DynamIQ systems under a
single L3 this ends up iterating over all the CPUs :/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 3/8] sched/fair: Tweak misfit-related capacity checks
  2021-02-05 20:07         ` Valentin Schneider
@ 2021-02-08 15:29           ` Vincent Guittot
  2021-02-08 17:49             ` Valentin Schneider
  0 siblings, 1 reply; 46+ messages in thread
From: Vincent Guittot @ 2021-02-08 15:29 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Dietmar Eggemann,
	Morten Rasmussen, Qais Yousef, Quentin Perret, Pavan Kondeti,
	Rik van Riel

On Fri, 5 Feb 2021 at 21:07, Valentin Schneider
<valentin.schneider@arm.com> wrote:
>
> On 05/02/21 18:17, Vincent Guittot wrote:
> > On Fri, 5 Feb 2021 at 18:00, Valentin Schneider
> >> >> @@ -8253,7 +8260,7 @@ check_cpu_capacity(struct rq *rq, struct sched_domain *sd)
> >> >>  static inline int check_misfit_status(struct rq *rq, struct sched_domain *sd)
> >> >>  {
> >> >>         return rq->misfit_task_load &&
> >> >> -               (rq->cpu_capacity_orig < rq->rd->max_cpu_capacity ||
> >> >> +               (capacity_greater(rq->rd->max_cpu_capacity, rq->cpu_capacity_orig) ||
> >> >
> >> > Why do you add a margin here whereas there was no margin before ?
> >> >
> >>
> >> Comparing capacities without any sort of filter can lead to ping-ponging
> >> tasks around (capacity values very easily fluctuate by +/- 1, if not more).
> >
> > max_cpu_capacity reflects the max of the cpu_capacity_orig values
> > don't aim to change and can be considered as static values.
> > It would be better to fix this rounding problem (if any) in
> > topology_get_cpu_scale instead of computing a margin every time it's
> > used
> >
>
> That's embarrassing, I was convinced we had something updating
> rd->max_cpu_capacity with actual rq->capacity values... But as you point
> out that's absolutely not the case, it's all based on rq->capacity_orig,
> which completely invalidates patch 5/8.
>
> Welp.
>
> Perhaps I can still keep 5/8 with something like
>
>   if (!rq->misfit_task_load)
>           return false;
>
>   do {
>           if (capacity_greater(group->sgc->max_capacity, rq->cpu_capacity))
>                   return true;
>
>           group = group->next;
>   } while (group != sd->groups);

I don't catch what you want to achieve with this  while loop compared
to the original condition which is :
trigger a load_balance :
- if there is CPU with higher original capacity
- or if the capacity of this cpu has significantly reduced because of
pressure and there is maybe others with more capacity even if it's one
with highest original capacity

>
>   return false;
>
> This works somewhat well for big.LITTLE, but for DynamIQ systems under a
> single L3 this ends up iterating over all the CPUs :/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 8/8] sched/fair: Relax task_hot() for misfit tasks
  2021-01-28 18:31 ` [PATCH 8/8] sched/fair: Relax task_hot() for misfit tasks Valentin Schneider
  2021-02-03 15:17   ` Qais Yousef
@ 2021-02-08 16:21   ` Vincent Guittot
  2021-02-08 18:24     ` Valentin Schneider
  1 sibling, 1 reply; 46+ messages in thread
From: Vincent Guittot @ 2021-02-08 16:21 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Dietmar Eggemann,
	Morten Rasmussen, Qais Yousef, Quentin Perret, Pavan Kondeti,
	Rik van Riel

On Thu, 28 Jan 2021 at 19:32, Valentin Schneider
<valentin.schneider@arm.com> wrote:
>
> Misfit tasks can and will be preempted by the stopper to migrate them over
> to a higher-capacity CPU. However, when runnable but not current misfit
> tasks are scanned by the load balancer (i.e. detach_tasks()), the
> task_hot() ratelimiting logic may prevent us from enqueuing said task onto
> a higher-capacity CPU.
>
> Align detach_tasks() with the active-balance logic and let it pick a
> cache-hot misfit task when the destination CPU can provide a capacity
> uplift.
>
> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
> ---
>  kernel/sched/fair.c | 11 +++++++++++
>  1 file changed, 11 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index cba9f97d9beb..c2351b87824f 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7484,6 +7484,17 @@ static int task_hot(struct task_struct *p, struct lb_env *env)
>         if (env->sd->flags & SD_SHARE_CPUCAPACITY)
>                 return 0;
>
> +       /*
> +        * On a (sane) asymmetric CPU capacity system, the increase in compute
> +        * capacity should offset any potential performance hit caused by a
> +        * migration.
> +        */
> +       if (sd_has_asym_cpucapacity(env->sd) &&
> +           env->idle != CPU_NOT_IDLE &&
> +           !task_fits_capacity(p, capacity_of(env->src_cpu)) &&
> +           cpu_capacity_greater(env->dst_cpu, env->src_cpu))

Why not using env->migration_type to directly detect that it's a
misfit task active migration ?

> +               return 0;
> +
>         /*
>          * Buddy candidates are cache hot:
>          */
> --
> 2.27.0
>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 3/8] sched/fair: Tweak misfit-related capacity checks
  2021-02-08 15:29           ` Vincent Guittot
@ 2021-02-08 17:49             ` Valentin Schneider
  0 siblings, 0 replies; 46+ messages in thread
From: Valentin Schneider @ 2021-02-08 17:49 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Dietmar Eggemann,
	Morten Rasmussen, Qais Yousef, Quentin Perret, Pavan Kondeti,
	Rik van Riel

On 08/02/21 16:29, Vincent Guittot wrote:
> On Fri, 5 Feb 2021 at 21:07, Valentin Schneider
> <valentin.schneider@arm.com> wrote:
>>
>> Perhaps I can still keep 5/8 with something like
>>
>>   if (!rq->misfit_task_load)
>>           return false;
>>
>>   do {
>>           if (capacity_greater(group->sgc->max_capacity, rq->cpu_capacity))
>>                   return true;
>>
>>           group = group->next;
>>   } while (group != sd->groups);
>
> I don't catch what you want to achieve with this  while loop compared
> to the original condition which is :
> trigger a load_balance :
> - if there is CPU with higher original capacity
> - or if the capacity of this cpu has significantly reduced because of
> pressure and there is maybe others with more capacity even if it's one
> with highest original capacity
>

If we had a root-domain-wide (dynamic) capacity maximum, we could make
check_misfit_status() return false if the CPU *is* pressured but there is
no better alternative - e.g. if all other CPUs are pressured even worse.

This isn't a correctness issue as the nohz load-balance will just not
migrate the misfit task, but it would be nice to prevent the nohz kick
altogether.

I might ditch this for now and revisit it later.

>>
>>   return false;
>>
>> This works somewhat well for big.LITTLE, but for DynamIQ systems under a
>> single L3 this ends up iterating over all the CPUs :/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 8/8] sched/fair: Relax task_hot() for misfit tasks
  2021-02-08 16:21   ` Vincent Guittot
@ 2021-02-08 18:24     ` Valentin Schneider
  2021-02-09  8:56       ` Vincent Guittot
  0 siblings, 1 reply; 46+ messages in thread
From: Valentin Schneider @ 2021-02-08 18:24 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Dietmar Eggemann,
	Morten Rasmussen, Qais Yousef, Quentin Perret, Pavan Kondeti,
	Rik van Riel

On 08/02/21 17:21, Vincent Guittot wrote:
> On Thu, 28 Jan 2021 at 19:32, Valentin Schneider
> <valentin.schneider@arm.com> wrote:
>>
>> Misfit tasks can and will be preempted by the stopper to migrate them over
>> to a higher-capacity CPU. However, when runnable but not current misfit
>> tasks are scanned by the load balancer (i.e. detach_tasks()), the
>> task_hot() ratelimiting logic may prevent us from enqueuing said task onto
>> a higher-capacity CPU.
>>
>> Align detach_tasks() with the active-balance logic and let it pick a
>> cache-hot misfit task when the destination CPU can provide a capacity
>> uplift.
>>
>> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
>> ---
>>  kernel/sched/fair.c | 11 +++++++++++
>>  1 file changed, 11 insertions(+)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index cba9f97d9beb..c2351b87824f 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -7484,6 +7484,17 @@ static int task_hot(struct task_struct *p, struct lb_env *env)
>>         if (env->sd->flags & SD_SHARE_CPUCAPACITY)
>>                 return 0;
>>
>> +       /*
>> +        * On a (sane) asymmetric CPU capacity system, the increase in compute
>> +        * capacity should offset any potential performance hit caused by a
>> +        * migration.
>> +        */
>> +       if (sd_has_asym_cpucapacity(env->sd) &&
>> +           env->idle != CPU_NOT_IDLE &&
>> +           !task_fits_capacity(p, capacity_of(env->src_cpu)) &&
>> +           cpu_capacity_greater(env->dst_cpu, env->src_cpu))
>
> Why not using env->migration_type to directly detect that it's a
> misfit task active migration ?
>

This is admittedly a kludge. Consider the scenario described in patch 7/8,
i.e.:
- there's a misfit task running on a LITTLE CPU
- a big CPU completes its work and is about to go through newidle_balance()

Now, consider by the time that big CPU gets into load_balance(), the misfit
task on the LITTLE CPU got preempted by a percpu kworker. As of right now,
it's quite likely the imbalance won't be classified as group_misfit_task,
but as group_overloaded (depends on the topology / workload, but that's a
symptom I've been seeing).

Unfortunately, even if we e.g. change the misfit load-balance logic to also
track preempted misfit tasks (rather than just the rq's current), this
could still happen AFAICT.

Long story short, we already trigger an active-balance to upmigrate running
misfit tasks, this changes task_hot() to allow any preempted task that
doesn't fit on its CPU to be upmigrated (regardless of the imbalance
classification).

>> +               return 0;
>> +
>>         /*
>>          * Buddy candidates are cache hot:
>>          */
>> --
>> 2.27.0
>>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/8] sched/fair: Add more sched_asym_cpucapacity static branch checks
  2021-01-28 18:31 ` [PATCH 2/8] sched/fair: Add more sched_asym_cpucapacity static branch checks Valentin Schneider
  2021-02-03 15:14   ` Qais Yousef
@ 2021-02-09  8:42   ` Vincent Guittot
  1 sibling, 0 replies; 46+ messages in thread
From: Vincent Guittot @ 2021-02-09  8:42 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, Rik van Riel, Peter Zijlstra, Ingo Molnar,
	Dietmar Eggemann, Morten Rasmussen, Qais Yousef, Quentin Perret,
	Pavan Kondeti

On Thu, 28 Jan 2021 at 19:32, Valentin Schneider
<valentin.schneider@arm.com> wrote:
>
> Rik noted a while back that a handful of
>
>   sd->flags & SD_ASYM_CPUCAPACITY
>
> & family in the CFS load-balancer code aren't guarded by the
> sched_asym_cpucapacity static branch.
>
> The load-balancer is already doing a humongous amount of work, but turning
> those checks into NOPs for those who don't need it is fairly
> straightforward, so do that.

It would be good to get figures showing an improvement. We are there
in the LB which is a slow path. Whereas using the static branch in
fast path makes sense, I don't see the benefit there. Also why not
doing the same for others like SD_ASYM_PACKING.
I'm not convinced by this change which IMO doesn't provide any benefit
but adds complexity.

>
> Suggested-by: Rik van Riel <riel@surriel.com>
> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
> ---
>  kernel/sched/fair.c  | 11 ++++++-----
>  kernel/sched/sched.h |  6 ++++++
>  2 files changed, 12 insertions(+), 5 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 0f6a4e58ce3c..7d01fa0bfc7e 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8465,7 +8465,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>                         continue;
>
>                 /* Check for a misfit task on the cpu */
> -               if (env->sd->flags & SD_ASYM_CPUCAPACITY &&
> +               if (sd_has_asym_cpucapacity(env->sd) &&
>                     sgs->group_misfit_task_load < rq->misfit_task_load) {
>                         sgs->group_misfit_task_load = rq->misfit_task_load;
>                         *sg_status |= SG_OVERLOAD;
> @@ -8522,7 +8522,8 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>          * CPUs in the group should either be possible to resolve
>          * internally or be covered by avg_load imbalance (eventually).
>          */
> -       if (sgs->group_type == group_misfit_task &&
> +       if (static_branch_unlikely(&sched_asym_cpucapacity) &&
> +           sgs->group_type == group_misfit_task &&
>             (!group_smaller_max_cpu_capacity(sg, sds->local) ||
>              sds->local_stat.group_type != group_has_spare))
>                 return false;
> @@ -8605,7 +8606,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>          * throughput. Maximize throughput, power/energy consequences are not
>          * considered.
>          */
> -       if ((env->sd->flags & SD_ASYM_CPUCAPACITY) &&
> +       if (sd_has_asym_cpucapacity(env->sd) &&
>             (sgs->group_type <= group_fully_busy) &&
>             (group_smaller_min_cpu_capacity(sds->local, sg)))
>                 return false;
> @@ -8728,7 +8729,7 @@ static inline void update_sg_wakeup_stats(struct sched_domain *sd,
>         }
>
>         /* Check if task fits in the group */
> -       if (sd->flags & SD_ASYM_CPUCAPACITY &&
> +       if (sd_has_asym_cpucapacity(sd) &&
>             !task_fits_capacity(p, group->sgc->max_capacity)) {
>                 sgs->group_misfit_task_load = 1;
>         }
> @@ -9419,7 +9420,7 @@ static struct rq *find_busiest_queue(struct lb_env *env,
>                  * Higher per-CPU capacity is considered better than balancing
>                  * average load.
>                  */
> -               if (env->sd->flags & SD_ASYM_CPUCAPACITY &&
> +               if (sd_has_asym_cpucapacity(env->sd) &&
>                     capacity_of(env->dst_cpu) < capacity &&
>                     nr_running == 1)
>                         continue;
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 045b01064c1e..21bd71f58c06 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1482,6 +1482,12 @@ DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
>  DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
>  extern struct static_key_false sched_asym_cpucapacity;
>
> +static inline bool sd_has_asym_cpucapacity(struct sched_domain *sd)
> +{
> +       return static_branch_unlikely(&sched_asym_cpucapacity) &&
> +               sd->flags & SD_ASYM_CPUCAPACITY;
> +}
> +
>  struct sched_group_capacity {
>         atomic_t                ref;
>         /*
> --
> 2.27.0
>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 8/8] sched/fair: Relax task_hot() for misfit tasks
  2021-02-08 18:24     ` Valentin Schneider
@ 2021-02-09  8:56       ` Vincent Guittot
  0 siblings, 0 replies; 46+ messages in thread
From: Vincent Guittot @ 2021-02-09  8:56 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Dietmar Eggemann,
	Morten Rasmussen, Qais Yousef, Quentin Perret, Pavan Kondeti,
	Rik van Riel

On Mon, 8 Feb 2021 at 19:24, Valentin Schneider
<valentin.schneider@arm.com> wrote:
>
> On 08/02/21 17:21, Vincent Guittot wrote:
> > On Thu, 28 Jan 2021 at 19:32, Valentin Schneider
> > <valentin.schneider@arm.com> wrote:
> >>
> >> Misfit tasks can and will be preempted by the stopper to migrate them over
> >> to a higher-capacity CPU. However, when runnable but not current misfit
> >> tasks are scanned by the load balancer (i.e. detach_tasks()), the
> >> task_hot() ratelimiting logic may prevent us from enqueuing said task onto
> >> a higher-capacity CPU.
> >>
> >> Align detach_tasks() with the active-balance logic and let it pick a
> >> cache-hot misfit task when the destination CPU can provide a capacity
> >> uplift.
> >>
> >> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
> >> ---
> >>  kernel/sched/fair.c | 11 +++++++++++
> >>  1 file changed, 11 insertions(+)
> >>
> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> index cba9f97d9beb..c2351b87824f 100644
> >> --- a/kernel/sched/fair.c
> >> +++ b/kernel/sched/fair.c
> >> @@ -7484,6 +7484,17 @@ static int task_hot(struct task_struct *p, struct lb_env *env)
> >>         if (env->sd->flags & SD_SHARE_CPUCAPACITY)
> >>                 return 0;
> >>
> >> +       /*
> >> +        * On a (sane) asymmetric CPU capacity system, the increase in compute
> >> +        * capacity should offset any potential performance hit caused by a
> >> +        * migration.
> >> +        */
> >> +       if (sd_has_asym_cpucapacity(env->sd) &&
> >> +           env->idle != CPU_NOT_IDLE &&
> >> +           !task_fits_capacity(p, capacity_of(env->src_cpu)) &&
> >> +           cpu_capacity_greater(env->dst_cpu, env->src_cpu))
> >
> > Why not using env->migration_type to directly detect that it's a
> > misfit task active migration ?
> >
>
> This is admittedly a kludge. Consider the scenario described in patch 7/8,
> i.e.:
> - there's a misfit task running on a LITTLE CPU
> - a big CPU completes its work and is about to go through newidle_balance()
>
> Now, consider by the time that big CPU gets into load_balance(), the misfit
> task on the LITTLE CPU got preempted by a percpu kworker. As of right now,
> it's quite likely the imbalance won't be classified as group_misfit_task,
> but as group_overloaded (depends on the topology / workload, but that's a
> symptom I've been seeing).

IIRC, we already discussed this. And you should better track that a
misfit task remains on the rq instead of adding a lot of special case
everywhere

>
> Unfortunately, even if we e.g. change the misfit load-balance logic to also
> track preempted misfit tasks (rather than just the rq's current), this
> could still happen AFAICT.
>
> Long story short, we already trigger an active-balance to upmigrate running
> misfit tasks, this changes task_hot() to allow any preempted task that
> doesn't fit on its CPU to be upmigrated (regardless of the imbalance
> classification).
>
> >> +               return 0;
> >> +
> >>         /*
> >>          * Buddy candidates are cache hot:
> >>          */
> >> --
> >> 2.27.0
> >>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 7/8] sched/fair: Attempt misfit active balance when migration_type != migrate_misfit
  2021-01-28 18:31 ` [PATCH 7/8] sched/fair: Attempt misfit active balance when migration_type != migrate_misfit Valentin Schneider
  2021-02-03 15:16   ` Qais Yousef
@ 2021-02-09  8:58   ` Vincent Guittot
  2021-02-09 18:19     ` Valentin Schneider
  1 sibling, 1 reply; 46+ messages in thread
From: Vincent Guittot @ 2021-02-09  8:58 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Dietmar Eggemann,
	Morten Rasmussen, Qais Yousef, Quentin Perret, Pavan Kondeti,
	Rik van Riel

On Thu, 28 Jan 2021 at 19:32, Valentin Schneider
<valentin.schneider@arm.com> wrote:
>
> Consider a 4-CPU big.LITTLE system with CPUs 0-1 as LITTLEs and CPUs 2-3 as
> bigs. The resulting sched_domain hierarchy is:
>
>   DIE [          ]
>   MC  [    ][    ]
>        0  1  2  3
>
> When running a multithreaded CPU-bound workload (i.e. 1 hog per CPU), the
> expected behaviour is to have the about-to-idle big CPUs pull a hog from
> the LITTLEs, since bigs will complete their work sooner than LITTLEs.
>
> Further Consider a scenario where:
> - CPU0 is idle (e.g. its hog got migrated to one of the big CPUs)
> - CPU1 is currently executing a per-CPU kworker, preempting the CPU hog
> - CPU2 and CPU3 are executing CPU-hogs
>
> CPU0 goes through load_balance() at MC level, and tries to pick stuff from
> CPU1, but:
> - the hog can't be pulled, because it's task_hot()
> - the kworker can't be pulled, because it's pinned to CPU1, which sets
>   LBF_SOME_PINNED
>
> This load balance attempts ends with no load pulled, LBF_SOME_PINNED set,
> and as a consequence we set the imbalance flag of DIE's [0, 1]
> sched_group_capacity.
>
> Shortly after, CPU2 completes its work and is about to go idle. It goes
> through the newidle_balance(), and we would really like it to active
> balance the hog running on CPU1 (which is a misfit task). However,
> sgc->imbalance is set for the LITTLE group at DIE level, so the group gets
> classified as group_imbalanced rather than group_misfit_task.
>
> Unlike group_misfit_task (via migrate_misfit), the active balance logic
> doesn't have any specific case for group_imbalanced, so CPU2 ends up going
> idle. We'll have to wait for a load balance on CPU0 or CPU1 to happen and
> clear the imbalance flag, and then for another DIE-level load-balance on
> CPU2 to happen to pull the task off of CPU1. That's several precious
> milliseconds wasted down the drain.
>
> Giving group_misfit_task a higher group_classify() priority than
> group_imbalance doesn't seem like the right thing to do. Instead, make

Could you explain why ?

> need_active_balance() return true for any migration_type when the
> destination CPU is idle and the source CPU has a misfit task.
>
> While at it, add an sd_has_asym_cpucapacity() guard in
> need_active_balance().
>
> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
> ---
>  kernel/sched/fair.c | 13 +++++++++++++
>  1 file changed, 13 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 0ac2f876b86f..cba9f97d9beb 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9557,9 +9557,22 @@ static int need_active_balance(struct lb_env *env)
>                         return 1;
>         }
>
> +       if (!sd_has_asym_cpucapacity(sd))
> +               return 0;
> +
>         if (env->migration_type == migrate_misfit)
>                 return 1;
>
> +       /*
> +        * If we failed to pull anything and the src_rq has a misfit task, but
> +        * the busiest group_type was higher than group_misfit_task, try to
> +        * go for a misfit active balance anyway.
> +        */
> +       if ((env->idle != CPU_NOT_IDLE) &&
> +           env->src_rq->misfit_task_load &&
> +           cpu_capacity_greater(env->dst_cpu, env->src_cpu))
> +               return 1;
> +
>         return 0;
>  }
>
> --
> 2.27.0
>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 7/8] sched/fair: Attempt misfit active balance when migration_type != migrate_misfit
  2021-02-09  8:58   ` Vincent Guittot
@ 2021-02-09 18:19     ` Valentin Schneider
  0 siblings, 0 replies; 46+ messages in thread
From: Valentin Schneider @ 2021-02-09 18:19 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Dietmar Eggemann,
	Morten Rasmussen, Qais Yousef, Quentin Perret, Pavan Kondeti,
	Rik van Riel

On 09/02/21 09:58, Vincent Guittot wrote:
> On Thu, 28 Jan 2021 at 19:32, Valentin Schneider
>> Giving group_misfit_task a higher group_classify() priority than
>> group_imbalance doesn't seem like the right thing to do. Instead, make
>
> Could you explain why ?
>

Morten had intentionally placed it above (then) group_other but below
group_imbalanced:

  3b1baa6496e6 ("sched/fair: Add 'group_misfit_task' load-balance type")

The reasoning being misfit balance shouldn't take higher priority than
jarring imbalance issues. group_imbalanced is a mixed bag and difficult to
classify, but for sure group_overloaded takes priority as it ought to imply
you can move tasks around without doing an active balance (there's more
tasks than CPUs).

Then again, we do have issues where the busiest group is group_overloaded,
but we'd "want" this to be classified as misfit. This ties in with patch 8.

Take the CPU hog vs pcpu kworker example on a big.LITTLE platform:

  a,b,c,d are our CPU-hogging tasks
  k is a per-CPU kworker

        {CPU0 | a a a a k
  LITTLE{CPU1 | b b b b a
        ------|---------
        {CPU2 | c c c c .
  bigs  {CPU3 | d d d d ^
                        |
                        |
                  newidle pull

CPU2 finished its work and goes through a newidle balance. Ideally here it
would pull either 'a' or 'b' which are CPU-bound tasks running on LITTLE
CPUs. Unfortunately, a per-CPU kworker woke up on CPU0, so since we have:

  DIE [        ]
  MC  [   ][   ]
       0 1  2 3

the DIE (0-1) sched_group has 3 runnable tasks, two of which are CPU hogs:
it gets classified as group_overloaded. Only task 'a' can be pulled, and it
requires patch 8 to be migrated in this scenario.


I'm not sure how we could better classify this, even if admitting we
started tracking preempted misfit tasks. Perhaps not group classification
itself, but the migration_type? i.e. something like

  if (nr_running - sgs->group_weight <= nr_misfits)
  => all preempted tasks are misfit
  => migration_type = migrate_misfit

^ permalink raw reply	[flat|nested] 46+ messages in thread

end of thread, other threads:[~2021-02-09 19:54 UTC | newest]

Thread overview: 46+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-28 18:31 [PATCH 0/8] sched/fair: misfit task load-balance tweaks Valentin Schneider
2021-01-28 18:31 ` [PATCH 1/8] sched/fair: Clean up active balance nr_balance_failed trickery Valentin Schneider
2021-02-03 15:14   ` Qais Yousef
2021-02-03 18:42     ` Valentin Schneider
2021-02-04 15:05       ` Qais Yousef
2021-02-05 13:51   ` Vincent Guittot
2021-02-05 14:05     ` Valentin Schneider
2021-02-05 14:34       ` Vincent Guittot
2021-01-28 18:31 ` [PATCH 2/8] sched/fair: Add more sched_asym_cpucapacity static branch checks Valentin Schneider
2021-02-03 15:14   ` Qais Yousef
2021-02-09  8:42   ` Vincent Guittot
2021-01-28 18:31 ` [PATCH 3/8] sched/fair: Tweak misfit-related capacity checks Valentin Schneider
2021-02-03 15:15   ` Qais Yousef
2021-02-03 18:42     ` Valentin Schneider
2021-02-05 14:31   ` Vincent Guittot
2021-02-05 16:59     ` Valentin Schneider
2021-02-05 17:17       ` Vincent Guittot
2021-02-05 20:07         ` Valentin Schneider
2021-02-08 15:29           ` Vincent Guittot
2021-02-08 17:49             ` Valentin Schneider
2021-01-28 18:31 ` [PATCH 4/8] sched/fair: Use dst_cpu's capacity rather than group {min, max} capacity Valentin Schneider
2021-02-03 15:15   ` Qais Yousef
2021-01-28 18:31 ` [PATCH 5/8] sched/fair: Make check_misfit_status() only compare dynamic capacities Valentin Schneider
2021-02-03 15:15   ` Qais Yousef
2021-02-04 10:49     ` Dietmar Eggemann
2021-02-04 11:34       ` Valentin Schneider
2021-02-04 14:57         ` Dietmar Eggemann
2021-01-28 18:31 ` [PATCH 6/8] sched/fair: Filter out locally-unsolvable misfit imbalances Valentin Schneider
2021-02-03 15:16   ` Qais Yousef
2021-02-03 18:43     ` Valentin Schneider
2021-01-28 18:31 ` [PATCH 7/8] sched/fair: Attempt misfit active balance when migration_type != migrate_misfit Valentin Schneider
2021-02-03 15:16   ` Qais Yousef
2021-02-03 18:43     ` Valentin Schneider
2021-02-04 11:44       ` Dietmar Eggemann
2021-02-04 12:22         ` Valentin Schneider
2021-02-09  8:58   ` Vincent Guittot
2021-02-09 18:19     ` Valentin Schneider
2021-01-28 18:31 ` [PATCH 8/8] sched/fair: Relax task_hot() for misfit tasks Valentin Schneider
2021-02-03 15:17   ` Qais Yousef
2021-02-08 16:21   ` Vincent Guittot
2021-02-08 18:24     ` Valentin Schneider
2021-02-09  8:56       ` Vincent Guittot
2021-02-03 15:14 ` [PATCH 0/8] sched/fair: misfit task load-balance tweaks Qais Yousef
2021-02-03 18:43   ` Valentin Schneider
2021-02-04 12:03     ` Dietmar Eggemann
2021-02-04 12:36       ` Valentin Schneider

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).