All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 0/7] sched/fair: misfit task load-balance tweaks
@ 2021-03-11 12:05 Valentin Schneider
  2021-03-11 12:05 ` [PATCH v3 1/7] sched/fair: Ignore percpu threads for imbalance pulls Valentin Schneider
                   ` (6 more replies)
  0 siblings, 7 replies; 24+ messages in thread
From: Valentin Schneider @ 2021-03-11 12:05 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Dietmar Eggemann,
	Morten Rasmussen, Qais Yousef, Quentin Perret, Pavan Kondeti,
	Rik van Riel, Lingutla Chandrasekhar

Hi folks,

Here is this year's series of misfit changes. On the menu:

o Patch 1 prevents pcpu kworkers from causing group_imbalanced
o Patch 2 is an independent active balance cleanup
o Patch 3 adds some more sched_asym_cpucapacity static branches
o Patch 4 introduces yet another margin for capacity to capacity
  comparisons
o Patches 5-6 build on top of patch 4 and change capacity comparisons
  throughout misfit load balancing  
o Patch 7 aligns running and non-running misfit task cache hotness
  considerations

IMO the somewhat controversial bit is patch 4, because it attempts to solve
margin issues by... Adding another margin. This does solve issues on
existing platforms (e.g. Pixel4), but we'll be back to square one the day
some "clever" folks spin a platform with two different CPU capacities less
than 5% apart.

This is based on top of today's tip/sched/core at:

  13c2235b2b28 ("sched: Remove unnecessary variable from schedule_tail()")

Testing
=======

I ran my usual [1] misfit tests on
o TC2
o Juno
o HiKey960
o Dragonboard845C
o RB5

RB5 has a similar topology to Pixel4 and highlights the problem of having
two different CPU capacity values above 819 (in this case 871 and 1024):
without these patches, CPU hogs (i.e. misfit tasks) running on the "medium"
CPUs will never be upmigrated to a "big" via misfit balance.


The 0day bot reported [3] the first patch causes a ~14% regression on its
stress-ng.vm-segv testcase. I ran that testcase on: 

o Ampere eMAG (arm64, 32 cores)
o 2-socket Xeon E5-2690 (x86, 40 cores)

and found at worse a -0.3% regression and at best a 2% improvement - I'm
getting nowhere near -14%.
  
Revisions
=========

v2 -> v3
--------

o Rebased on top of latest tip/sched/core
o Added test results vs stress-ng.vm-segv

v1 -> v2
--------

o Collected Reviewed-by
o Minor comment and code cleanups

o Consolidated static key vs SD flag explanation (Dietmar)

  Note to Vincent: I didn't measure the impact of adding said static key to
  load_balance(); I do however believe it is a low hanging fruit. The
  wrapper keeps things neat and tidy, and is also helpful for documenting
  the intricacies of the static key status vs the presence of the SD flag
  in a CPU's sched_domain hierarchy.
  
o Removed v1 patch 4 - root_domain.max_cpu_capacity is absolutely not what
  I had convinced myself it was.
o Squashed capacity margin usage with removal of
  group_smaller_{min, max}_capacity() (Vincent)   
o Replaced v1 patch 7 with Lingutla's can_migrate_task() patch [2]
o Rewrote task_hot() modification changelog

Links
=====

[1]: https://lisa-linux-integrated-system-analysis.readthedocs.io/en/master/kernel_tests.html#lisa.tests.scheduler.misfit.StaggeredFinishes
[2]: http://lore.kernel.org/r/20210217120854.1280-1-clingutla@codeaurora.org
[3]: http://lore.kernel.org/r/20210223023004.GB25487@xsang-OptiPlex-9020

Cheers,
Valentin

Lingutla Chandrasekhar (1):
  sched/fair: Ignore percpu threads for imbalance pulls

Valentin Schneider (6):
  sched/fair: Clean up active balance nr_balance_failed trickery
  sched/fair: Add more sched_asym_cpucapacity static branch checks
  sched/fair: Introduce a CPU capacity comparison helper
  sched/fair: Employ capacity_greater() throughout load_balance()
  sched/fair: Filter out locally-unsolvable misfit imbalances
  sched/fair: Relax task_hot() for misfit tasks

 kernel/sched/fair.c  | 128 ++++++++++++++++++++++++-------------------
 kernel/sched/sched.h |  33 +++++++++++
 2 files changed, 105 insertions(+), 56 deletions(-)

--
2.25.1


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v3 1/7] sched/fair: Ignore percpu threads for imbalance pulls
  2021-03-11 12:05 [PATCH v3 0/7] sched/fair: misfit task load-balance tweaks Valentin Schneider
@ 2021-03-11 12:05 ` Valentin Schneider
  2021-03-16 15:49   ` Dietmar Eggemann
  2021-03-11 12:05 ` [PATCH v3 2/7] sched/fair: Clean up active balance nr_balance_failed trickery Valentin Schneider
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 24+ messages in thread
From: Valentin Schneider @ 2021-03-11 12:05 UTC (permalink / raw)
  To: linux-kernel
  Cc: Lingutla Chandrasekhar, Peter Zijlstra, Ingo Molnar,
	Vincent Guittot, Dietmar Eggemann, Morten Rasmussen, Qais Yousef,
	Quentin Perret, Pavan Kondeti, Rik van Riel

From: Lingutla Chandrasekhar <clingutla@codeaurora.org>

In load balancing, when balancing group is unable to pull task
due to ->cpus_ptr constraints from busy group, then it sets
LBF_SOME_PINNED to lb env flags, as a consequence, sgc->imbalance
is set for its parent domain level. which makes the group
classified as imbalance to get help from another balancing cpu.

Consider a 4-CPU big.LITTLE system with CPUs 0-1 as LITTLEs and
CPUs 2-3 as Bigs with below scenario:
- CPU0 doing newly_idle balancing
- CPU1 running percpu kworker and RT task (small tasks)
- CPU2 running 2 big tasks
- CPU3 running 1 medium task

While CPU0 is doing newly_idle load balance at MC level, it fails to
pull percpu kworker from CPU1 and sets LBF_SOME_PINNED to lb env flag
and set sgc->imbalance at DIE level domain. As LBF_ALL_PINNED not cleared,
it tries to redo the balancing by clearing CPU1 in env cpus, but it don't
find other busiest_group, so CPU0 stops balacing at MC level without
clearing 'sgc->imbalance' and restart the load balacing at DIE level.

And CPU0 (balancing cpu) finds LITTLE's group as busiest_group with group
type as imbalance, and Bigs that classified the level below imbalance type
would be ignored to pick as busiest, and the balancing would be aborted
without pulling any tasks (by the time, CPU1 might not have running tasks).

It is suboptimal decision to classify the group as imbalance due to
percpu threads. So don't use LBF_SOME_PINNED for per cpu threads.

Signed-off-by: Lingutla Chandrasekhar <clingutla@codeaurora.org>
[Use kthread_is_per_cpu() rather than p->nr_cpus_allowed]
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
---
 kernel/sched/fair.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2e2ab1e00ef9..83aea97fbf22 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7565,6 +7565,10 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 	if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
 		return 0;
 
+	/* Disregard pcpu kthreads; they are where they need to be. */
+	if ((p->flags & PF_KTHREAD) && kthread_is_per_cpu(p))
+		return 0;
+
 	if (!cpumask_test_cpu(env->dst_cpu, p->cpus_ptr)) {
 		int cpu;
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v3 2/7] sched/fair: Clean up active balance nr_balance_failed trickery
  2021-03-11 12:05 [PATCH v3 0/7] sched/fair: misfit task load-balance tweaks Valentin Schneider
  2021-03-11 12:05 ` [PATCH v3 1/7] sched/fair: Ignore percpu threads for imbalance pulls Valentin Schneider
@ 2021-03-11 12:05 ` Valentin Schneider
  2021-03-17 10:52   ` Dietmar Eggemann
  2021-03-11 12:05 ` [PATCH v3 3/7] sched/fair: Add more sched_asym_cpucapacity static branch checks Valentin Schneider
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 24+ messages in thread
From: Valentin Schneider @ 2021-03-11 12:05 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Dietmar Eggemann,
	Morten Rasmussen, Qais Yousef, Quentin Perret, Pavan Kondeti,
	Rik van Riel, Lingutla Chandrasekhar

When triggering an active load balance, sd->nr_balance_failed is set to
such a value that any further can_migrate_task() using said sd will ignore
the output of task_hot().

This behaviour makes sense, as active load balance intentionally preempts a
rq's running task to migrate it right away, but this asynchronous write is
a bit shoddy, as the stopper thread might run active_load_balance_cpu_stop
before the sd->nr_balance_failed write either becomes visible to the
stopper's CPU or even happens on the CPU that appended the stopper work.

Add a struct lb_env flag to denote active balancing, and use it in
can_migrate_task(). Remove the sd->nr_balance_failed write that served the
same purpose.

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
---
 kernel/sched/fair.c | 17 ++++++++++-------
 1 file changed, 10 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 83aea97fbf22..f50a902bdf24 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7420,6 +7420,7 @@ enum migration_type {
 #define LBF_NEED_BREAK	0x02
 #define LBF_DST_PINNED  0x04
 #define LBF_SOME_PINNED	0x08
+#define LBF_ACTIVE_LB	0x10
 
 struct lb_env {
 	struct sched_domain	*sd;
@@ -7609,10 +7610,14 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 
 	/*
 	 * Aggressive migration if:
-	 * 1) destination numa is preferred
-	 * 2) task is cache cold, or
-	 * 3) too many balance attempts have failed.
+	 * 1) active balance
+	 * 2) destination numa is preferred
+	 * 3) task is cache cold, or
+	 * 4) too many balance attempts have failed.
 	 */
+	if (env->flags & LBF_ACTIVE_LB)
+		return 1;
+
 	tsk_cache_hot = migrate_degrades_locality(p, env);
 	if (tsk_cache_hot == -1)
 		tsk_cache_hot = task_hot(p, env);
@@ -9794,9 +9799,6 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 					active_load_balance_cpu_stop, busiest,
 					&busiest->active_balance_work);
 			}
-
-			/* We've kicked active balancing, force task migration. */
-			sd->nr_balance_failed = sd->cache_nice_tries+1;
 		}
 	} else {
 		sd->nr_balance_failed = 0;
@@ -9952,7 +9954,8 @@ static int active_load_balance_cpu_stop(void *data)
 			 * @dst_grpmask we need to make that test go away with lying
 			 * about DST_PINNED.
 			 */
-			.flags		= LBF_DST_PINNED,
+			.flags		= LBF_DST_PINNED |
+					  LBF_ACTIVE_LB,
 		};
 
 		schedstat_inc(sd->alb_count);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v3 3/7] sched/fair: Add more sched_asym_cpucapacity static branch checks
  2021-03-11 12:05 [PATCH v3 0/7] sched/fair: misfit task load-balance tweaks Valentin Schneider
  2021-03-11 12:05 ` [PATCH v3 1/7] sched/fair: Ignore percpu threads for imbalance pulls Valentin Schneider
  2021-03-11 12:05 ` [PATCH v3 2/7] sched/fair: Clean up active balance nr_balance_failed trickery Valentin Schneider
@ 2021-03-11 12:05 ` Valentin Schneider
  2021-03-15 14:18   ` Vincent Guittot
  2021-03-11 12:05 ` [PATCH v3 4/7] sched/fair: Introduce a CPU capacity comparison helper Valentin Schneider
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 24+ messages in thread
From: Valentin Schneider @ 2021-03-11 12:05 UTC (permalink / raw)
  To: linux-kernel
  Cc: Rik van Riel, Qais Yousef, Peter Zijlstra, Ingo Molnar,
	Vincent Guittot, Dietmar Eggemann, Morten Rasmussen,
	Quentin Perret, Pavan Kondeti, Lingutla Chandrasekhar

Rik noted a while back that a handful of

  sd->flags & SD_ASYM_CPUCAPACITY

& family in the CFS load-balancer code aren't guarded by the
sched_asym_cpucapacity static branch.

Turning those checks into NOPs for those who don't need it is fairly
straightforward, and hiding it in a helper doesn't change code size in all
but one spot. It also gives us a place to document the differences between
checking the static key and checking the SD flag.

Suggested-by: Rik van Riel <riel@surriel.com>
Reviewed-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
---
 kernel/sched/fair.c  | 21 ++++++++-------------
 kernel/sched/sched.h | 33 +++++++++++++++++++++++++++++++++
 2 files changed, 41 insertions(+), 13 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f50a902bdf24..db892f6e222f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6300,15 +6300,8 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 	 * sd_asym_cpucapacity rather than sd_llc.
 	 */
 	if (static_branch_unlikely(&sched_asym_cpucapacity)) {
+		/* See sd_has_asym_cpucapacity() */
 		sd = rcu_dereference(per_cpu(sd_asym_cpucapacity, target));
-		/*
-		 * On an asymmetric CPU capacity system where an exclusive
-		 * cpuset defines a symmetric island (i.e. one unique
-		 * capacity_orig value through the cpuset), the key will be set
-		 * but the CPUs within that cpuset will not have a domain with
-		 * SD_ASYM_CPUCAPACITY. These should follow the usual symmetric
-		 * capacity path.
-		 */
 		if (sd) {
 			i = select_idle_capacity(p, sd, target);
 			return ((unsigned)i < nr_cpumask_bits) ? i : target;
@@ -8467,7 +8460,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 			continue;
 
 		/* Check for a misfit task on the cpu */
-		if (env->sd->flags & SD_ASYM_CPUCAPACITY &&
+		if (sd_has_asym_cpucapacity(env->sd) &&
 		    sgs->group_misfit_task_load < rq->misfit_task_load) {
 			sgs->group_misfit_task_load = rq->misfit_task_load;
 			*sg_status |= SG_OVERLOAD;
@@ -8524,7 +8517,8 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 	 * CPUs in the group should either be possible to resolve
 	 * internally or be covered by avg_load imbalance (eventually).
 	 */
-	if (sgs->group_type == group_misfit_task &&
+	if (static_branch_unlikely(&sched_asym_cpucapacity) &&
+	    sgs->group_type == group_misfit_task &&
 	    (!group_smaller_max_cpu_capacity(sg, sds->local) ||
 	     sds->local_stat.group_type != group_has_spare))
 		return false;
@@ -8607,7 +8601,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 	 * throughput. Maximize throughput, power/energy consequences are not
 	 * considered.
 	 */
-	if ((env->sd->flags & SD_ASYM_CPUCAPACITY) &&
+	if (sd_has_asym_cpucapacity(env->sd) &&
 	    (sgs->group_type <= group_fully_busy) &&
 	    (group_smaller_min_cpu_capacity(sds->local, sg)))
 		return false;
@@ -8730,7 +8724,7 @@ static inline void update_sg_wakeup_stats(struct sched_domain *sd,
 	}
 
 	/* Check if task fits in the group */
-	if (sd->flags & SD_ASYM_CPUCAPACITY &&
+	if (sd_has_asym_cpucapacity(sd) &&
 	    !task_fits_capacity(p, group->sgc->max_capacity)) {
 		sgs->group_misfit_task_load = 1;
 	}
@@ -9408,7 +9402,7 @@ static struct rq *find_busiest_queue(struct lb_env *env,
 		 * Higher per-CPU capacity is considered better than balancing
 		 * average load.
 		 */
-		if (env->sd->flags & SD_ASYM_CPUCAPACITY &&
+		if (sd_has_asym_cpucapacity(env->sd) &&
 		    capacity_of(env->dst_cpu) < capacity &&
 		    nr_running == 1)
 			continue;
@@ -10225,6 +10219,7 @@ static void nohz_balancer_kick(struct rq *rq)
 		}
 	}
 
+	 /* See sd_has_asym_cpucapacity(). */
 	sd = rcu_dereference(per_cpu(sd_asym_cpucapacity, cpu));
 	if (sd) {
 		/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index d2e09a647c4f..27bf70bc86c7 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1492,6 +1492,39 @@ DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
 DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
 extern struct static_key_false sched_asym_cpucapacity;
 
+/*
+ * Note that the static key is system-wide, but the visibility of
+ * SD_ASYM_CPUCAPACITY isn't. Thus the static key being enabled does not
+ * imply all CPUs can see asymmetry.
+ *
+ * Consider an asymmetric CPU capacity system such as:
+ *
+ * MC [           ]
+ *     0 1 2 3 4 5
+ *     L L L L B B
+ *
+ * w/ arch_scale_cpu_capacity(L) < arch_scale_cpu_capacity(B)
+ *
+ * By default, booting this system will enable the sched_asym_cpucapacity
+ * static key, and all CPUs will see SD_ASYM_CPUCAPACITY set at their MC
+ * sched_domain.
+ *
+ * Further consider exclusive cpusets creating a "symmetric island":
+ *
+ * MC [   ][      ]
+ *     0 1  2 3 4 5
+ *     L L  L L B B
+ *
+ * Again, booting this will enable the static key, but CPUs 0-1 will *not* have
+ * SD_ASYM_CPUCAPACITY set in any of their sched_domain. This is the intended
+ * behaviour, as CPUs 0-1 should be treated as a regular, isolated SMP system.
+ */
+static inline bool sd_has_asym_cpucapacity(struct sched_domain *sd)
+{
+	return static_branch_unlikely(&sched_asym_cpucapacity) &&
+		sd->flags & SD_ASYM_CPUCAPACITY;
+}
+
 struct sched_group_capacity {
 	atomic_t		ref;
 	/*
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v3 4/7] sched/fair: Introduce a CPU capacity comparison helper
  2021-03-11 12:05 [PATCH v3 0/7] sched/fair: misfit task load-balance tweaks Valentin Schneider
                   ` (2 preceding siblings ...)
  2021-03-11 12:05 ` [PATCH v3 3/7] sched/fair: Add more sched_asym_cpucapacity static branch checks Valentin Schneider
@ 2021-03-11 12:05 ` Valentin Schneider
  2021-03-15 14:24   ` Vincent Guittot
  2021-03-31 11:34   ` Chandra Sekhar Lingutla
  2021-03-11 12:05 ` [PATCH v3 5/7] sched/fair: Employ capacity_greater() throughout load_balance() Valentin Schneider
                   ` (2 subsequent siblings)
  6 siblings, 2 replies; 24+ messages in thread
From: Valentin Schneider @ 2021-03-11 12:05 UTC (permalink / raw)
  To: linux-kernel
  Cc: Qais Yousef, Peter Zijlstra, Ingo Molnar, Vincent Guittot,
	Dietmar Eggemann, Morten Rasmussen, Quentin Perret,
	Pavan Kondeti, Rik van Riel, Lingutla Chandrasekhar

During load-balance, groups classified as group_misfit_task are filtered
out if they do not pass

  group_smaller_max_cpu_capacity(<candidate group>, <local group>);

which itself employs fits_capacity() to compare the sgc->max_capacity of
both groups.

Due to the underlying margin, fits_capacity(X, 1024) will return false for
any X > 819. Tough luck, the capacity_orig's on e.g. the Pixel 4 are
{261, 871, 1024}. If a CPU-bound task ends up on one of those "medium"
CPUs, misfit migration will never intentionally upmigrate it to a CPU of
higher capacity due to the aforementioned margin.

One may argue the 20% margin of fits_capacity() is excessive in the advent
of counter-enhanced load tracking (APERF/MPERF, AMUs), but one point here
is that fits_capacity() is meant to compare a utilization value to a
capacity value, whereas here it is being used to compare two capacity
values. As CPU capacity and task utilization have different dynamics, a
sensible approach here would be to add a new helper dedicated to comparing
CPU capacities.

Reviewed-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
---
 kernel/sched/fair.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index db892f6e222f..ddb2ab3edf6d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -113,6 +113,13 @@ int __weak arch_asym_cpu_priority(int cpu)
  */
 #define fits_capacity(cap, max)	((cap) * 1280 < (max) * 1024)
 
+/*
+ * The margin used when comparing CPU capacities.
+ * is 'cap1' noticeably greater than 'cap2'
+ *
+ * (default: ~5%)
+ */
+#define capacity_greater(cap1, cap2) ((cap1) * 1024 > (cap2) * 1078)
 #endif
 
 #ifdef CONFIG_CFS_BANDWIDTH
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v3 5/7] sched/fair: Employ capacity_greater() throughout load_balance()
  2021-03-11 12:05 [PATCH v3 0/7] sched/fair: misfit task load-balance tweaks Valentin Schneider
                   ` (3 preceding siblings ...)
  2021-03-11 12:05 ` [PATCH v3 4/7] sched/fair: Introduce a CPU capacity comparison helper Valentin Schneider
@ 2021-03-11 12:05 ` Valentin Schneider
  2021-03-31 11:35   ` Chandra Sekhar Lingutla
  2021-03-11 12:05 ` [PATCH v3 6/7] sched/fair: Filter out locally-unsolvable misfit imbalances Valentin Schneider
  2021-03-11 12:05 ` [PATCH v3 7/7] sched/fair: Relax task_hot() for misfit tasks Valentin Schneider
  6 siblings, 1 reply; 24+ messages in thread
From: Valentin Schneider @ 2021-03-11 12:05 UTC (permalink / raw)
  To: linux-kernel
  Cc: Qais Yousef, Peter Zijlstra, Ingo Molnar, Vincent Guittot,
	Dietmar Eggemann, Morten Rasmussen, Quentin Perret,
	Pavan Kondeti, Rik van Riel, Lingutla Chandrasekhar

While at it, replace group_smaller_{min, max}_cpu_capacity() with
comparisons of the source group's min/max capacity and the destination
CPU's capacity.

Reviewed-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
---
 kernel/sched/fair.c | 33 ++++-----------------------------
 1 file changed, 4 insertions(+), 29 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ddb2ab3edf6d..1e8a242cd1f7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8350,26 +8350,6 @@ group_is_overloaded(unsigned int imbalance_pct, struct sg_lb_stats *sgs)
 	return false;
 }
 
-/*
- * group_smaller_min_cpu_capacity: Returns true if sched_group sg has smaller
- * per-CPU capacity than sched_group ref.
- */
-static inline bool
-group_smaller_min_cpu_capacity(struct sched_group *sg, struct sched_group *ref)
-{
-	return fits_capacity(sg->sgc->min_capacity, ref->sgc->min_capacity);
-}
-
-/*
- * group_smaller_max_cpu_capacity: Returns true if sched_group sg has smaller
- * per-CPU capacity_orig than sched_group ref.
- */
-static inline bool
-group_smaller_max_cpu_capacity(struct sched_group *sg, struct sched_group *ref)
-{
-	return fits_capacity(sg->sgc->max_capacity, ref->sgc->max_capacity);
-}
-
 static inline enum
 group_type group_classify(unsigned int imbalance_pct,
 			  struct sched_group *group,
@@ -8518,15 +8498,10 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 	if (!sgs->sum_h_nr_running)
 		return false;
 
-	/*
-	 * Don't try to pull misfit tasks we can't help.
-	 * We can use max_capacity here as reduction in capacity on some
-	 * CPUs in the group should either be possible to resolve
-	 * internally or be covered by avg_load imbalance (eventually).
-	 */
+	/* Don't try to pull misfit tasks we can't help */
 	if (static_branch_unlikely(&sched_asym_cpucapacity) &&
 	    sgs->group_type == group_misfit_task &&
-	    (!group_smaller_max_cpu_capacity(sg, sds->local) ||
+	    (!capacity_greater(capacity_of(env->dst_cpu), sg->sgc->max_capacity) ||
 	     sds->local_stat.group_type != group_has_spare))
 		return false;
 
@@ -8610,7 +8585,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 	 */
 	if (sd_has_asym_cpucapacity(env->sd) &&
 	    (sgs->group_type <= group_fully_busy) &&
-	    (group_smaller_min_cpu_capacity(sds->local, sg)))
+	    (capacity_greater(sg->sgc->min_capacity, capacity_of(env->dst_cpu))))
 		return false;
 
 	return true;
@@ -9410,7 +9385,7 @@ static struct rq *find_busiest_queue(struct lb_env *env,
 		 * average load.
 		 */
 		if (sd_has_asym_cpucapacity(env->sd) &&
-		    capacity_of(env->dst_cpu) < capacity &&
+		    !capacity_greater(capacity_of(env->dst_cpu), capacity) &&
 		    nr_running == 1)
 			continue;
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v3 6/7] sched/fair: Filter out locally-unsolvable misfit imbalances
  2021-03-11 12:05 [PATCH v3 0/7] sched/fair: misfit task load-balance tweaks Valentin Schneider
                   ` (4 preceding siblings ...)
  2021-03-11 12:05 ` [PATCH v3 5/7] sched/fair: Employ capacity_greater() throughout load_balance() Valentin Schneider
@ 2021-03-11 12:05 ` Valentin Schneider
  2021-03-15 15:13   ` Vincent Guittot
  2021-03-11 12:05 ` [PATCH v3 7/7] sched/fair: Relax task_hot() for misfit tasks Valentin Schneider
  6 siblings, 1 reply; 24+ messages in thread
From: Valentin Schneider @ 2021-03-11 12:05 UTC (permalink / raw)
  To: linux-kernel
  Cc: Qais Yousef, Peter Zijlstra, Ingo Molnar, Vincent Guittot,
	Dietmar Eggemann, Morten Rasmussen, Quentin Perret,
	Pavan Kondeti, Rik van Riel, Lingutla Chandrasekhar

Consider the following (hypothetical) asymmetric CPU capacity topology,
with some amount of capacity pressure (RT | DL | IRQ | thermal):

  DIE [          ]
  MC  [    ][    ]
       0  1  2  3

  | CPU | capacity_orig | capacity |
  |-----+---------------+----------|
  |   0 |           870 |      860 |
  |   1 |           870 |      600 |
  |   2 |          1024 |      850 |
  |   3 |          1024 |      860 |

If CPU1 has a misfit task, then CPU0, CPU2 and CPU3 are valid candidates to
grant the task an uplift in CPU capacity. Consider CPU0 and CPU3 as
sufficiently busy, i.e. don't have enough spare capacity to accommodate
CPU1's misfit task. This would then fall on CPU2 to pull the task.

This currently won't happen, because CPU2 will fail

  capacity_greater(capacity_of(CPU2), sg->sgc->max_capacity)

in update_sd_pick_busiest(), where 'sg' is the [0, 1] group at DIE
level. In this case, the max_capacity is that of CPU0's, which is at this
point in time greater than that of CPU2's. This comparison doesn't make
much sense, given that the only CPUs we should care about in this scenario
are CPU1 (the CPU with the misfit task) and CPU2 (the load-balance
destination CPU).

Aggregate a misfit task's load into sgs->group_misfit_task_load only if
env->dst_cpu would grant it a capacity uplift. Separately track whether a
sched_group contains a misfit task to still classify it as
group_misfit_task and not pick it as busiest group when pulling from a
lower-capacity CPU (which is the current behaviour and prevents
down-migration).

Since find_busiest_queue() can now iterate over CPUs with a higher capacity
than the local CPU's, add a capacity check there.

Reviewed-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
---
 kernel/sched/fair.c | 39 ++++++++++++++++++++++++++++++---------
 1 file changed, 30 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1e8a242cd1f7..41cdda7a8ea6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5759,6 +5759,12 @@ static unsigned long capacity_of(int cpu)
 	return cpu_rq(cpu)->cpu_capacity;
 }
 
+/* Is CPU a's capacity noticeably greater than CPU b's? */
+static inline bool cpu_capacity_greater(int a, int b)
+{
+	return capacity_greater(capacity_of(a), capacity_of(b));
+}
+
 static void record_wakee(struct task_struct *p)
 {
 	/*
@@ -8091,7 +8097,8 @@ struct sg_lb_stats {
 	unsigned int group_weight;
 	enum group_type group_type;
 	unsigned int group_asym_packing; /* Tasks should be moved to preferred CPU */
-	unsigned long group_misfit_task_load; /* A CPU has a task too big for its capacity */
+	unsigned long group_misfit_task_load; /* Task load that can be uplifted */
+	int           group_has_misfit_task; /* A CPU has a task too big for its capacity */
 #ifdef CONFIG_NUMA_BALANCING
 	unsigned int nr_numa_running;
 	unsigned int nr_preferred_running;
@@ -8364,7 +8371,7 @@ group_type group_classify(unsigned int imbalance_pct,
 	if (sgs->group_asym_packing)
 		return group_asym_packing;
 
-	if (sgs->group_misfit_task_load)
+	if (sgs->group_has_misfit_task)
 		return group_misfit_task;
 
 	if (!group_has_capacity(imbalance_pct, sgs))
@@ -8447,10 +8454,21 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 			continue;
 
 		/* Check for a misfit task on the cpu */
-		if (sd_has_asym_cpucapacity(env->sd) &&
-		    sgs->group_misfit_task_load < rq->misfit_task_load) {
-			sgs->group_misfit_task_load = rq->misfit_task_load;
-			*sg_status |= SG_OVERLOAD;
+		if (!sd_has_asym_cpucapacity(env->sd) ||
+		    !rq->misfit_task_load)
+			continue;
+
+		*sg_status |= SG_OVERLOAD;
+		sgs->group_has_misfit_task = true;
+
+		/*
+		 * Don't attempt to maximize load for misfit tasks that can't be
+		 * granted a CPU capacity uplift.
+		 */
+		if (cpu_capacity_greater(env->dst_cpu, i)) {
+			sgs->group_misfit_task_load = max(
+				sgs->group_misfit_task_load,
+				rq->misfit_task_load);
 		}
 	}
 
@@ -8501,7 +8519,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 	/* Don't try to pull misfit tasks we can't help */
 	if (static_branch_unlikely(&sched_asym_cpucapacity) &&
 	    sgs->group_type == group_misfit_task &&
-	    (!capacity_greater(capacity_of(env->dst_cpu), sg->sgc->max_capacity) ||
+	    (!sgs->group_misfit_task_load ||
 	     sds->local_stat.group_type != group_has_spare))
 		return false;
 
@@ -9448,15 +9466,18 @@ static struct rq *find_busiest_queue(struct lb_env *env,
 		case migrate_misfit:
 			/*
 			 * For ASYM_CPUCAPACITY domains with misfit tasks we
-			 * simply seek the "biggest" misfit task.
+			 * simply seek the "biggest" misfit task we can
+			 * accommodate.
 			 */
+			if (!cpu_capacity_greater(env->dst_cpu, i))
+				continue;
+
 			if (rq->misfit_task_load > busiest_load) {
 				busiest_load = rq->misfit_task_load;
 				busiest = rq;
 			}
 
 			break;
-
 		}
 	}
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v3 7/7] sched/fair: Relax task_hot() for misfit tasks
  2021-03-11 12:05 [PATCH v3 0/7] sched/fair: misfit task load-balance tweaks Valentin Schneider
                   ` (5 preceding siblings ...)
  2021-03-11 12:05 ` [PATCH v3 6/7] sched/fair: Filter out locally-unsolvable misfit imbalances Valentin Schneider
@ 2021-03-11 12:05 ` Valentin Schneider
  6 siblings, 0 replies; 24+ messages in thread
From: Valentin Schneider @ 2021-03-11 12:05 UTC (permalink / raw)
  To: linux-kernel
  Cc: Qais Yousef, Peter Zijlstra, Ingo Molnar, Vincent Guittot,
	Dietmar Eggemann, Morten Rasmussen, Quentin Perret,
	Pavan Kondeti, Rik van Riel, Lingutla Chandrasekhar

Consider the following topology:

  DIE [          ]
  MC  [    ][    ]
       0  1  2  3

  capacity_orig_of(x \in {0-1}) < capacity_orig_of(x \in {2-3})

w/ CPUs 2-3 idle and CPUs 0-1 running CPU hogs (util_avg=1024).

When CPU2 goes through load_balance() (via periodic / NOHZ balance), it
should pull one CPU hog from either CPU0 or CPU1 (this is misfit task
upmigration). However, should a e.g. pcpu kworker awake on CPU0 just before
this load_balance() happens and preempt the CPU hog running there, we would
have, for the [0-1] group at CPU2's DIE level:

o sgs->sum_nr_running > sgs->group_weight
o sgs->group_capacity * 100 < sgs->group_util * imbalance_pct

IOW, this group is group_overloaded.

Considering CPU0 is picked by find_busiest_queue(), we would then visit the
preempted CPU hog in detach_tasks(). However, given it has just been
preempted by this pcpu kworker, task_hot() will prevent it from being
detached. We then leave load_balance() without having done anything.

Long story short, preempted misfit tasks are affected by task_hot(), while
currently running misfit tasks are intentionally preempted by the stopper
task to migrate them over to a higher-capacity CPU.

Align detach_tasks() with the active-balance logic and let it pick a
cache-hot misfit task when the destination CPU can provide a capacity
uplift.

Reviewed-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
---
 kernel/sched/fair.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 41cdda7a8ea6..5454429cea7a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7474,6 +7474,17 @@ static int task_hot(struct task_struct *p, struct lb_env *env)
 	if (env->sd->flags & SD_SHARE_CPUCAPACITY)
 		return 0;
 
+	/*
+	 * On a (sane) asymmetric CPU capacity system, the increase in compute
+	 * capacity should offset any potential performance hit caused by a
+	 * migration.
+	 */
+	if (sd_has_asym_cpucapacity(env->sd) &&
+	    env->idle != CPU_NOT_IDLE &&
+	    !task_fits_capacity(p, capacity_of(env->src_cpu)) &&
+	    cpu_capacity_greater(env->dst_cpu, env->src_cpu))
+		return 0;
+
 	/*
 	 * Buddy candidates are cache hot:
 	 */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 3/7] sched/fair: Add more sched_asym_cpucapacity static branch checks
  2021-03-11 12:05 ` [PATCH v3 3/7] sched/fair: Add more sched_asym_cpucapacity static branch checks Valentin Schneider
@ 2021-03-15 14:18   ` Vincent Guittot
  2021-03-15 19:24     ` Valentin Schneider
  0 siblings, 1 reply; 24+ messages in thread
From: Vincent Guittot @ 2021-03-15 14:18 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, Rik van Riel, Qais Yousef, Peter Zijlstra,
	Ingo Molnar, Dietmar Eggemann, Morten Rasmussen, Quentin Perret,
	Pavan Kondeti, Lingutla Chandrasekhar

On Thu, 11 Mar 2021 at 13:05, Valentin Schneider
<valentin.schneider@arm.com> wrote:
>
> Rik noted a while back that a handful of
>
>   sd->flags & SD_ASYM_CPUCAPACITY
>
> & family in the CFS load-balancer code aren't guarded by the
> sched_asym_cpucapacity static branch.

guarding asym capacity with static branch in fast path makes sense but
I see no benefit in this slow path but hiding and complexifying the
code. Also if you start with this way then you have to add a nop in
all other places where flag or a group_type might be unused.

>
> Turning those checks into NOPs for those who don't need it is fairly
> straightforward, and hiding it in a helper doesn't change code size in all
> but one spot. It also gives us a place to document the differences between
> checking the static key and checking the SD flag.
>
> Suggested-by: Rik van Riel <riel@surriel.com>
> Reviewed-by: Qais Yousef <qais.yousef@arm.com>
> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
> ---
>  kernel/sched/fair.c  | 21 ++++++++-------------
>  kernel/sched/sched.h | 33 +++++++++++++++++++++++++++++++++
>  2 files changed, 41 insertions(+), 13 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index f50a902bdf24..db892f6e222f 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6300,15 +6300,8 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
>          * sd_asym_cpucapacity rather than sd_llc.
>          */
>         if (static_branch_unlikely(&sched_asym_cpucapacity)) {
> +               /* See sd_has_asym_cpucapacity() */
>                 sd = rcu_dereference(per_cpu(sd_asym_cpucapacity, target));
> -               /*
> -                * On an asymmetric CPU capacity system where an exclusive
> -                * cpuset defines a symmetric island (i.e. one unique
> -                * capacity_orig value through the cpuset), the key will be set
> -                * but the CPUs within that cpuset will not have a domain with
> -                * SD_ASYM_CPUCAPACITY. These should follow the usual symmetric
> -                * capacity path.
> -                */
>                 if (sd) {
>                         i = select_idle_capacity(p, sd, target);
>                         return ((unsigned)i < nr_cpumask_bits) ? i : target;
> @@ -8467,7 +8460,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>                         continue;
>
>                 /* Check for a misfit task on the cpu */
> -               if (env->sd->flags & SD_ASYM_CPUCAPACITY &&
> +               if (sd_has_asym_cpucapacity(env->sd) &&
>                     sgs->group_misfit_task_load < rq->misfit_task_load) {
>                         sgs->group_misfit_task_load = rq->misfit_task_load;
>                         *sg_status |= SG_OVERLOAD;
> @@ -8524,7 +8517,8 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>          * CPUs in the group should either be possible to resolve
>          * internally or be covered by avg_load imbalance (eventually).
>          */
> -       if (sgs->group_type == group_misfit_task &&
> +       if (static_branch_unlikely(&sched_asym_cpucapacity) &&
> +           sgs->group_type == group_misfit_task &&
>             (!group_smaller_max_cpu_capacity(sg, sds->local) ||
>              sds->local_stat.group_type != group_has_spare))
>                 return false;
> @@ -8607,7 +8601,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>          * throughput. Maximize throughput, power/energy consequences are not
>          * considered.
>          */
> -       if ((env->sd->flags & SD_ASYM_CPUCAPACITY) &&
> +       if (sd_has_asym_cpucapacity(env->sd) &&
>             (sgs->group_type <= group_fully_busy) &&
>             (group_smaller_min_cpu_capacity(sds->local, sg)))
>                 return false;
> @@ -8730,7 +8724,7 @@ static inline void update_sg_wakeup_stats(struct sched_domain *sd,
>         }
>
>         /* Check if task fits in the group */
> -       if (sd->flags & SD_ASYM_CPUCAPACITY &&
> +       if (sd_has_asym_cpucapacity(sd) &&
>             !task_fits_capacity(p, group->sgc->max_capacity)) {
>                 sgs->group_misfit_task_load = 1;
>         }
> @@ -9408,7 +9402,7 @@ static struct rq *find_busiest_queue(struct lb_env *env,
>                  * Higher per-CPU capacity is considered better than balancing
>                  * average load.
>                  */
> -               if (env->sd->flags & SD_ASYM_CPUCAPACITY &&
> +               if (sd_has_asym_cpucapacity(env->sd) &&
>                     capacity_of(env->dst_cpu) < capacity &&
>                     nr_running == 1)
>                         continue;
> @@ -10225,6 +10219,7 @@ static void nohz_balancer_kick(struct rq *rq)
>                 }
>         }
>
> +        /* See sd_has_asym_cpucapacity(). */
>         sd = rcu_dereference(per_cpu(sd_asym_cpucapacity, cpu));
>         if (sd) {
>                 /*
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index d2e09a647c4f..27bf70bc86c7 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1492,6 +1492,39 @@ DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
>  DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
>  extern struct static_key_false sched_asym_cpucapacity;
>
> +/*
> + * Note that the static key is system-wide, but the visibility of
> + * SD_ASYM_CPUCAPACITY isn't. Thus the static key being enabled does not
> + * imply all CPUs can see asymmetry.
> + *
> + * Consider an asymmetric CPU capacity system such as:
> + *
> + * MC [           ]
> + *     0 1 2 3 4 5
> + *     L L L L B B
> + *
> + * w/ arch_scale_cpu_capacity(L) < arch_scale_cpu_capacity(B)
> + *
> + * By default, booting this system will enable the sched_asym_cpucapacity
> + * static key, and all CPUs will see SD_ASYM_CPUCAPACITY set at their MC
> + * sched_domain.
> + *
> + * Further consider exclusive cpusets creating a "symmetric island":
> + *
> + * MC [   ][      ]
> + *     0 1  2 3 4 5
> + *     L L  L L B B
> + *
> + * Again, booting this will enable the static key, but CPUs 0-1 will *not* have
> + * SD_ASYM_CPUCAPACITY set in any of their sched_domain. This is the intended
> + * behaviour, as CPUs 0-1 should be treated as a regular, isolated SMP system.
> + */
> +static inline bool sd_has_asym_cpucapacity(struct sched_domain *sd)
> +{
> +       return static_branch_unlikely(&sched_asym_cpucapacity) &&
> +               sd->flags & SD_ASYM_CPUCAPACITY;
> +}
> +
>  struct sched_group_capacity {
>         atomic_t                ref;
>         /*
> --
> 2.25.1
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 4/7] sched/fair: Introduce a CPU capacity comparison helper
  2021-03-11 12:05 ` [PATCH v3 4/7] sched/fair: Introduce a CPU capacity comparison helper Valentin Schneider
@ 2021-03-15 14:24   ` Vincent Guittot
  2021-03-15 19:24     ` Valentin Schneider
  2021-03-31 11:34   ` Chandra Sekhar Lingutla
  1 sibling, 1 reply; 24+ messages in thread
From: Vincent Guittot @ 2021-03-15 14:24 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, Qais Yousef, Peter Zijlstra, Ingo Molnar,
	Dietmar Eggemann, Morten Rasmussen, Quentin Perret,
	Pavan Kondeti, Rik van Riel, Lingutla Chandrasekhar

On Thu, 11 Mar 2021 at 13:05, Valentin Schneider
<valentin.schneider@arm.com> wrote:
>
> During load-balance, groups classified as group_misfit_task are filtered
> out if they do not pass
>
>   group_smaller_max_cpu_capacity(<candidate group>, <local group>);
>
> which itself employs fits_capacity() to compare the sgc->max_capacity of
> both groups.
>
> Due to the underlying margin, fits_capacity(X, 1024) will return false for
> any X > 819. Tough luck, the capacity_orig's on e.g. the Pixel 4 are
> {261, 871, 1024}. If a CPU-bound task ends up on one of those "medium"
> CPUs, misfit migration will never intentionally upmigrate it to a CPU of
> higher capacity due to the aforementioned margin.
>
> One may argue the 20% margin of fits_capacity() is excessive in the advent
> of counter-enhanced load tracking (APERF/MPERF, AMUs), but one point here
> is that fits_capacity() is meant to compare a utilization value to a
> capacity value, whereas here it is being used to compare two capacity
> values. As CPU capacity and task utilization have different dynamics, a
> sensible approach here would be to add a new helper dedicated to comparing
> CPU capacities.
>
> Reviewed-by: Qais Yousef <qais.yousef@arm.com>
> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
> ---
>  kernel/sched/fair.c | 7 +++++++
>  1 file changed, 7 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index db892f6e222f..ddb2ab3edf6d 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -113,6 +113,13 @@ int __weak arch_asym_cpu_priority(int cpu)
>   */
>  #define fits_capacity(cap, max)        ((cap) * 1280 < (max) * 1024)
>
> +/*
> + * The margin used when comparing CPU capacities.
> + * is 'cap1' noticeably greater than 'cap2'
> + *
> + * (default: ~5%)
> + */
> +#define capacity_greater(cap1, cap2) ((cap1) * 1024 > (cap2) * 1078)

defined but not used.

Should be merged with next patch which start to use it

>  #endif
>
>  #ifdef CONFIG_CFS_BANDWIDTH
> --
> 2.25.1
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 6/7] sched/fair: Filter out locally-unsolvable misfit imbalances
  2021-03-11 12:05 ` [PATCH v3 6/7] sched/fair: Filter out locally-unsolvable misfit imbalances Valentin Schneider
@ 2021-03-15 15:13   ` Vincent Guittot
  2021-03-15 19:18     ` Valentin Schneider
  0 siblings, 1 reply; 24+ messages in thread
From: Vincent Guittot @ 2021-03-15 15:13 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, Qais Yousef, Peter Zijlstra, Ingo Molnar,
	Dietmar Eggemann, Morten Rasmussen, Quentin Perret,
	Pavan Kondeti, Rik van Riel, Lingutla Chandrasekhar

On Thu, 11 Mar 2021 at 13:05, Valentin Schneider
<valentin.schneider@arm.com> wrote:
>
> Consider the following (hypothetical) asymmetric CPU capacity topology,
> with some amount of capacity pressure (RT | DL | IRQ | thermal):
>
>   DIE [          ]
>   MC  [    ][    ]
>        0  1  2  3
>
>   | CPU | capacity_orig | capacity |
>   |-----+---------------+----------|
>   |   0 |           870 |      860 |
>   |   1 |           870 |      600 |
>   |   2 |          1024 |      850 |
>   |   3 |          1024 |      860 |
>
> If CPU1 has a misfit task, then CPU0, CPU2 and CPU3 are valid candidates to
> grant the task an uplift in CPU capacity. Consider CPU0 and CPU3 as
> sufficiently busy, i.e. don't have enough spare capacity to accommodate
> CPU1's misfit task. This would then fall on CPU2 to pull the task.
>
> This currently won't happen, because CPU2 will fail
>
>   capacity_greater(capacity_of(CPU2), sg->sgc->max_capacity)

which has been introduced by the previous patch: patch5

>
> in update_sd_pick_busiest(), where 'sg' is the [0, 1] group at DIE
> level. In this case, the max_capacity is that of CPU0's, which is at this
> point in time greater than that of CPU2's. This comparison doesn't make
> much sense, given that the only CPUs we should care about in this scenario
> are CPU1 (the CPU with the misfit task) and CPU2 (the load-balance
> destination CPU).
>
> Aggregate a misfit task's load into sgs->group_misfit_task_load only if
> env->dst_cpu would grant it a capacity uplift. Separately track whether a
> sched_group contains a misfit task to still classify it as
> group_misfit_task and not pick it as busiest group when pulling from a

Could you give more details about why we should keep tracking the
group as misfit ? Do you have a UC in mind ?

> lower-capacity CPU (which is the current behaviour and prevents
> down-migration).
>
> Since find_busiest_queue() can now iterate over CPUs with a higher capacity
> than the local CPU's, add a capacity check there.
>
> Reviewed-by: Qais Yousef <qais.yousef@arm.com>
> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
> ---
>  kernel/sched/fair.c | 39 ++++++++++++++++++++++++++++++---------
>  1 file changed, 30 insertions(+), 9 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 1e8a242cd1f7..41cdda7a8ea6 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5759,6 +5759,12 @@ static unsigned long capacity_of(int cpu)
>         return cpu_rq(cpu)->cpu_capacity;
>  }
>
> +/* Is CPU a's capacity noticeably greater than CPU b's? */
> +static inline bool cpu_capacity_greater(int a, int b)
> +{
> +       return capacity_greater(capacity_of(a), capacity_of(b));
> +}
> +
>  static void record_wakee(struct task_struct *p)
>  {
>         /*
> @@ -8091,7 +8097,8 @@ struct sg_lb_stats {
>         unsigned int group_weight;
>         enum group_type group_type;
>         unsigned int group_asym_packing; /* Tasks should be moved to preferred CPU */
> -       unsigned long group_misfit_task_load; /* A CPU has a task too big for its capacity */
> +       unsigned long group_misfit_task_load; /* Task load that can be uplifted */
> +       int           group_has_misfit_task; /* A CPU has a task too big for its capacity */
>  #ifdef CONFIG_NUMA_BALANCING
>         unsigned int nr_numa_running;
>         unsigned int nr_preferred_running;
> @@ -8364,7 +8371,7 @@ group_type group_classify(unsigned int imbalance_pct,
>         if (sgs->group_asym_packing)
>                 return group_asym_packing;
>
> -       if (sgs->group_misfit_task_load)
> +       if (sgs->group_has_misfit_task)
>                 return group_misfit_task;
>
>         if (!group_has_capacity(imbalance_pct, sgs))
> @@ -8447,10 +8454,21 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>                         continue;
>
>                 /* Check for a misfit task on the cpu */
> -               if (sd_has_asym_cpucapacity(env->sd) &&
> -                   sgs->group_misfit_task_load < rq->misfit_task_load) {
> -                       sgs->group_misfit_task_load = rq->misfit_task_load;
> -                       *sg_status |= SG_OVERLOAD;
> +               if (!sd_has_asym_cpucapacity(env->sd) ||
> +                   !rq->misfit_task_load)
> +                       continue;
> +
> +               *sg_status |= SG_OVERLOAD;
> +               sgs->group_has_misfit_task = true;
> +
> +               /*
> +                * Don't attempt to maximize load for misfit tasks that can't be
> +                * granted a CPU capacity uplift.
> +                */
> +               if (cpu_capacity_greater(env->dst_cpu, i)) {
> +                       sgs->group_misfit_task_load = max(
> +                               sgs->group_misfit_task_load,
> +                               rq->misfit_task_load);

Please encapsulate all this misfit specific code in a dedicated
function which will be called from update_sg_lb_stats

>                 }
>         }
>
> @@ -8501,7 +8519,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>         /* Don't try to pull misfit tasks we can't help */
>         if (static_branch_unlikely(&sched_asym_cpucapacity) &&
>             sgs->group_type == group_misfit_task &&
> -           (!capacity_greater(capacity_of(env->dst_cpu), sg->sgc->max_capacity) ||
> +           (!sgs->group_misfit_task_load ||
>              sds->local_stat.group_type != group_has_spare))
>                 return false;
>
> @@ -9448,15 +9466,18 @@ static struct rq *find_busiest_queue(struct lb_env *env,
>                 case migrate_misfit:
>                         /*
>                          * For ASYM_CPUCAPACITY domains with misfit tasks we
> -                        * simply seek the "biggest" misfit task.
> +                        * simply seek the "biggest" misfit task we can
> +                        * accommodate.
>                          */
> +                       if (!cpu_capacity_greater(env->dst_cpu, i))
> +                               continue;
> +
>                         if (rq->misfit_task_load > busiest_load) {
>                                 busiest_load = rq->misfit_task_load;
>                                 busiest = rq;
>                         }
>
>                         break;
> -
>                 }
>         }
>
> --
> 2.25.1
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 6/7] sched/fair: Filter out locally-unsolvable misfit imbalances
  2021-03-15 15:13   ` Vincent Guittot
@ 2021-03-15 19:18     ` Valentin Schneider
  2021-03-19 15:19       ` Vincent Guittot
  0 siblings, 1 reply; 24+ messages in thread
From: Valentin Schneider @ 2021-03-15 19:18 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: linux-kernel, Qais Yousef, Peter Zijlstra, Ingo Molnar,
	Dietmar Eggemann, Morten Rasmussen, Quentin Perret,
	Pavan Kondeti, Rik van Riel, Lingutla Chandrasekhar

On 15/03/21 16:13, Vincent Guittot wrote:
> On Thu, 11 Mar 2021 at 13:05, Valentin Schneider
> <valentin.schneider@arm.com> wrote:
>>
>> Consider the following (hypothetical) asymmetric CPU capacity topology,
>> with some amount of capacity pressure (RT | DL | IRQ | thermal):
>>
>>   DIE [          ]
>>   MC  [    ][    ]
>>        0  1  2  3
>>
>>   | CPU | capacity_orig | capacity |
>>   |-----+---------------+----------|
>>   |   0 |           870 |      860 |
>>   |   1 |           870 |      600 |
>>   |   2 |          1024 |      850 |
>>   |   3 |          1024 |      860 |
>>
>> If CPU1 has a misfit task, then CPU0, CPU2 and CPU3 are valid candidates to
>> grant the task an uplift in CPU capacity. Consider CPU0 and CPU3 as
>> sufficiently busy, i.e. don't have enough spare capacity to accommodate
>> CPU1's misfit task. This would then fall on CPU2 to pull the task.
>>
>> This currently won't happen, because CPU2 will fail
>>
>>   capacity_greater(capacity_of(CPU2), sg->sgc->max_capacity)
>
> which has been introduced by the previous patch: patch5
>
>>
>> in update_sd_pick_busiest(), where 'sg' is the [0, 1] group at DIE
>> level. In this case, the max_capacity is that of CPU0's, which is at this
>> point in time greater than that of CPU2's. This comparison doesn't make
>> much sense, given that the only CPUs we should care about in this scenario
>> are CPU1 (the CPU with the misfit task) and CPU2 (the load-balance
>> destination CPU).
>>
>> Aggregate a misfit task's load into sgs->group_misfit_task_load only if
>> env->dst_cpu would grant it a capacity uplift. Separately track whether a
>> sched_group contains a misfit task to still classify it as
>> group_misfit_task and not pick it as busiest group when pulling from a
>
> Could you give more details about why we should keep tracking the
> group as misfit ? Do you have a UC in mind ?
>

As stated the current behaviour is to classify groups as group_misfit_task
regardless of the dst_cpu's capacity. When we see a group_misfit_task
candidate group misfit task with higher per-CPU capacity than the local
group, we don't pick it as busiest.

I initially thought not marking those as group_misfit_task was the right
thing to do, as they could then be classified as group_fully_busy or
group_has_spare. Consider:

  DIE [          ]
  MC  [    ][    ]
       0  1  2  3
       L  L  B  B

  arch_scale_capacity(L) < arch_scale_capacity(B)

  CPUs 0-1 are idle / lightly loaded
  CPU2 has a misfit task and a few very small tasks
  CPU3 has a few very small tasks

When CPU0 is running load_balance() at DIE level, right now we'll classify
the [2-3] group as group_misfit_task and not pick it as busiest because the
local group has a lower CPU capacity.

If we didn't do that, we could leave the misfit task alone and pull some
small task(s) from CPU2 or CPU3, which would be a good thing to
do. However, by allowing a group containing a misfit task to be picked as
the busiest group when a CPU of lower capacity is pulling, we run the risk
of the misfit task itself being downmigrated - e.g. if we repeatedly
increment the sd->nr_balance_failed counter and do an active balance (maybe
because the small tasks were unfortunately cache_hot()).

It's less than ideal, but I considered not downmigrating misfit tasks was
the thing to prioritize (and FWIW it also maintains current behaviour).


Another approach would be to add task utilization vs CPU capacity checks in
detach_tasks() and need_active_balance() to prevent downmigration when
env->imbalance_type < group_misfit_task. This may go against the busiest
group selection heuristics however (misfit tasks could be the main
contributors to the imbalance, but we end up not moving them).


>> lower-capacity CPU (which is the current behaviour and prevents
>> down-migration).
>>
>> Since find_busiest_queue() can now iterate over CPUs with a higher capacity
>> than the local CPU's, add a capacity check there.
>>
>> Reviewed-by: Qais Yousef <qais.yousef@arm.com>
>> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>

>> @@ -8447,10 +8454,21 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>>                         continue;
>>
>>                 /* Check for a misfit task on the cpu */
>> -               if (sd_has_asym_cpucapacity(env->sd) &&
>> -                   sgs->group_misfit_task_load < rq->misfit_task_load) {
>> -                       sgs->group_misfit_task_load = rq->misfit_task_load;
>> -                       *sg_status |= SG_OVERLOAD;
>> +               if (!sd_has_asym_cpucapacity(env->sd) ||
>> +                   !rq->misfit_task_load)
>> +                       continue;
>> +
>> +               *sg_status |= SG_OVERLOAD;
>> +               sgs->group_has_misfit_task = true;
>> +
>> +               /*
>> +                * Don't attempt to maximize load for misfit tasks that can't be
>> +                * granted a CPU capacity uplift.
>> +                */
>> +               if (cpu_capacity_greater(env->dst_cpu, i)) {
>> +                       sgs->group_misfit_task_load = max(
>> +                               sgs->group_misfit_task_load,
>> +                               rq->misfit_task_load);
>
> Please encapsulate all this misfit specific code in a dedicated
> function which will be called from update_sg_lb_stats
>

Will do.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 3/7] sched/fair: Add more sched_asym_cpucapacity static branch checks
  2021-03-15 14:18   ` Vincent Guittot
@ 2021-03-15 19:24     ` Valentin Schneider
  0 siblings, 0 replies; 24+ messages in thread
From: Valentin Schneider @ 2021-03-15 19:24 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: linux-kernel, Rik van Riel, Qais Yousef, Peter Zijlstra,
	Ingo Molnar, Dietmar Eggemann, Morten Rasmussen, Quentin Perret,
	Pavan Kondeti, Lingutla Chandrasekhar


Hi Vincent,

Thanks for taking another look at this.

On 15/03/21 15:18, Vincent Guittot wrote:
> On Thu, 11 Mar 2021 at 13:05, Valentin Schneider
> <valentin.schneider@arm.com> wrote:
>>
>> Rik noted a while back that a handful of
>>
>>   sd->flags & SD_ASYM_CPUCAPACITY
>>
>> & family in the CFS load-balancer code aren't guarded by the
>> sched_asym_cpucapacity static branch.
>
> guarding asym capacity with static branch in fast path makes sense but
> I see no benefit in this slow path but hiding and complexifying the
> code. Also if you start with this way then you have to add a nop in
> all other places where flag or a group_type might be unused.
>

OK, fair enough, I'll drop this one.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 4/7] sched/fair: Introduce a CPU capacity comparison helper
  2021-03-15 14:24   ` Vincent Guittot
@ 2021-03-15 19:24     ` Valentin Schneider
  0 siblings, 0 replies; 24+ messages in thread
From: Valentin Schneider @ 2021-03-15 19:24 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: linux-kernel, Qais Yousef, Peter Zijlstra, Ingo Molnar,
	Dietmar Eggemann, Morten Rasmussen, Quentin Perret,
	Pavan Kondeti, Rik van Riel, Lingutla Chandrasekhar

On 15/03/21 15:24, Vincent Guittot wrote:
>> @@ -113,6 +113,13 @@ int __weak arch_asym_cpu_priority(int cpu)
>>   */
>>  #define fits_capacity(cap, max)        ((cap) * 1280 < (max) * 1024)
>>
>> +/*
>> + * The margin used when comparing CPU capacities.
>> + * is 'cap1' noticeably greater than 'cap2'
>> + *
>> + * (default: ~5%)
>> + */
>> +#define capacity_greater(cap1, cap2) ((cap1) * 1024 > (cap2) * 1078)
>
> defined but not used.
>
> Should be merged with next patch which start to use it
>

Will do.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 1/7] sched/fair: Ignore percpu threads for imbalance pulls
  2021-03-11 12:05 ` [PATCH v3 1/7] sched/fair: Ignore percpu threads for imbalance pulls Valentin Schneider
@ 2021-03-16 15:49   ` Dietmar Eggemann
  2021-03-16 16:03     ` Chandra Sekhar Lingutla
  2021-03-16 17:31     ` Valentin Schneider
  0 siblings, 2 replies; 24+ messages in thread
From: Dietmar Eggemann @ 2021-03-16 15:49 UTC (permalink / raw)
  To: Valentin Schneider, linux-kernel
  Cc: Lingutla Chandrasekhar, Peter Zijlstra, Ingo Molnar,
	Vincent Guittot, Morten Rasmussen, Qais Yousef, Quentin Perret,
	Pavan Kondeti, Rik van Riel

On 11/03/2021 13:05, Valentin Schneider wrote:
> From: Lingutla Chandrasekhar <clingutla@codeaurora.org>
> 
> In load balancing, when balancing group is unable to pull task
> due to ->cpus_ptr constraints from busy group, then it sets
> LBF_SOME_PINNED to lb env flags, as a consequence, sgc->imbalance
> is set for its parent domain level. which makes the group
> classified as imbalance to get help from another balancing cpu.
> 
> Consider a 4-CPU big.LITTLE system with CPUs 0-1 as LITTLEs and

Does it have to be a big.LITTLE system? I assume this issue also happens
on an SMP system.

> CPUs 2-3 as Bigs with below scenario:
> - CPU0 doing newly_idle balancing
> - CPU1 running percpu kworker and RT task (small tasks)

What's the role of the small RT task here in the story?

> - CPU2 running 2 big tasks
> - CPU3 running 1 medium task
> 
> While CPU0 is doing newly_idle load balance at MC level, it fails to
> pull percpu kworker from CPU1 and sets LBF_SOME_PINNED to lb env flag
> and set sgc->imbalance at DIE level domain. As LBF_ALL_PINNED not cleared,
> it tries to redo the balancing by clearing CPU1 in env cpus, but it don't
> find other busiest_group, so CPU0 stops balacing at MC level without
> clearing 'sgc->imbalance' and restart the load balacing at DIE level.
> 
> And CPU0 (balancing cpu) finds LITTLE's group as busiest_group with group
> type as imbalance, and Bigs that classified the level below imbalance type
> would be ignored to pick as busiest, and the balancing would be aborted
> without pulling any tasks (by the time, CPU1 might not have running tasks).
> 
> It is suboptimal decision to classify the group as imbalance due to
> percpu threads. So don't use LBF_SOME_PINNED for per cpu threads.

This sentence mentioned per-cpu threads (and so does the patch name) but
the implementation (only) deals with per-cpu kernel threads. IMHO, it
would be good to align this.

> 
> Signed-off-by: Lingutla Chandrasekhar <clingutla@codeaurora.org>
> [Use kthread_is_per_cpu() rather than p->nr_cpus_allowed]
> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
> ---
>  kernel/sched/fair.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 2e2ab1e00ef9..83aea97fbf22 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7565,6 +7565,10 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
>  	if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
>  		return 0;
>  
> +	/* Disregard pcpu kthreads; they are where they need to be. */
> +	if ((p->flags & PF_KTHREAD) && kthread_is_per_cpu(p))
> +		return 0;
> +
>  	if (!cpumask_test_cpu(env->dst_cpu, p->cpus_ptr)) {
>  		int cpu;
>  
> 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 1/7] sched/fair: Ignore percpu threads for imbalance pulls
  2021-03-16 15:49   ` Dietmar Eggemann
@ 2021-03-16 16:03     ` Chandra Sekhar Lingutla
  2021-03-16 18:59       ` Dietmar Eggemann
  2021-03-16 17:31     ` Valentin Schneider
  1 sibling, 1 reply; 24+ messages in thread
From: Chandra Sekhar Lingutla @ 2021-03-16 16:03 UTC (permalink / raw)
  To: Dietmar Eggemann, Valentin Schneider, linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Morten Rasmussen,
	Qais Yousef, Quentin Perret, Pavan Kondeti, Rik van Riel

Hi Dietmar,

On 3/16/2021 9:19 PM, Dietmar Eggemann wrote:
> On 11/03/2021 13:05, Valentin Schneider wrote:
>> From: Lingutla Chandrasekhar <clingutla@codeaurora.org>
>>
>> In load balancing, when balancing group is unable to pull task
>> due to ->cpus_ptr constraints from busy group, then it sets
>> LBF_SOME_PINNED to lb env flags, as a consequence, sgc->imbalance
>> is set for its parent domain level. which makes the group
>> classified as imbalance to get help from another balancing cpu.
>>
>> Consider a 4-CPU big.LITTLE system with CPUs 0-1 as LITTLEs and
> Does it have to be a big.LITTLE system? I assume this issue also happens
> on an SMP system.

Yah, issue can happen on SMP system as well.  I will let Valentin update 
the commit text on
his next version of this series.

>> CPUs 2-3 as Bigs with below scenario:
>> - CPU0 doing newly_idle balancing
>> - CPU1 running percpu kworker and RT task (small tasks)
> What's the role of the small RT task here in the story?
This is to satisfy 'busiest->nr_running > 1' checks.
>> - CPU2 running 2 big tasks
>> - CPU3 running 1 medium task
>>
>> While CPU0 is doing newly_idle load balance at MC level, it fails to
>> pull percpu kworker from CPU1 and sets LBF_SOME_PINNED to lb env flag
>> and set sgc->imbalance at DIE level domain. As LBF_ALL_PINNED not cleared,
>> it tries to redo the balancing by clearing CPU1 in env cpus, but it don't
>> find other busiest_group, so CPU0 stops balacing at MC level without
>> clearing 'sgc->imbalance' and restart the load balacing at DIE level.
>>
>> And CPU0 (balancing cpu) finds LITTLE's group as busiest_group with group
>> type as imbalance, and Bigs that classified the level below imbalance type
>> would be ignored to pick as busiest, and the balancing would be aborted
>> without pulling any tasks (by the time, CPU1 might not have running tasks).
>>
>> It is suboptimal decision to classify the group as imbalance due to
>> percpu threads. So don't use LBF_SOME_PINNED for per cpu threads.
> This sentence mentioned per-cpu threads (and so does the patch name) but
> the implementation (only) deals with per-cpu kernel threads. IMHO, it
> would be good to align this.
I will let Valentin update this on next version.
>> Signed-off-by: Lingutla Chandrasekhar <clingutla@codeaurora.org>
>> [Use kthread_is_per_cpu() rather than p->nr_cpus_allowed]
>> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
>> ---
>>   kernel/sched/fair.c | 4 ++++
>>   1 file changed, 4 insertions(+)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 2e2ab1e00ef9..83aea97fbf22 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -7565,6 +7565,10 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
>>   	if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
>>   		return 0;
>>   
>> +	/* Disregard pcpu kthreads; they are where they need to be. */
>> +	if ((p->flags & PF_KTHREAD) && kthread_is_per_cpu(p))
>> +		return 0;
>> +
>>   	if (!cpumask_test_cpu(env->dst_cpu, p->cpus_ptr)) {
>>   		int cpu;
>>   
>>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 1/7] sched/fair: Ignore percpu threads for imbalance pulls
  2021-03-16 15:49   ` Dietmar Eggemann
  2021-03-16 16:03     ` Chandra Sekhar Lingutla
@ 2021-03-16 17:31     ` Valentin Schneider
  2021-03-16 19:06       ` Dietmar Eggemann
  1 sibling, 1 reply; 24+ messages in thread
From: Valentin Schneider @ 2021-03-16 17:31 UTC (permalink / raw)
  To: Dietmar Eggemann, linux-kernel
  Cc: Lingutla Chandrasekhar, Peter Zijlstra, Ingo Molnar,
	Vincent Guittot, Morten Rasmussen, Qais Yousef, Quentin Perret,
	Pavan Kondeti, Rik van Riel

On 16/03/21 16:49, Dietmar Eggemann wrote:
> On 11/03/2021 13:05, Valentin Schneider wrote:
>> From: Lingutla Chandrasekhar <clingutla@codeaurora.org>
>>
>> In load balancing, when balancing group is unable to pull task
>> due to ->cpus_ptr constraints from busy group, then it sets
>> LBF_SOME_PINNED to lb env flags, as a consequence, sgc->imbalance
>> is set for its parent domain level. which makes the group
>> classified as imbalance to get help from another balancing cpu.
>>
>> Consider a 4-CPU big.LITTLE system with CPUs 0-1 as LITTLEs and
>
> Does it have to be a big.LITTLE system? I assume this issue also happens
> on an SMP system.
>

Aye, though the consequences are "worse" on asym CPU capacity systems.

>> CPUs 2-3 as Bigs with below scenario:
>> - CPU0 doing newly_idle balancing
>> - CPU1 running percpu kworker and RT task (small tasks)
>
> What's the role of the small RT task here in the story?
>

I don't think it matters much here.

>> - CPU2 running 2 big tasks
>> - CPU3 running 1 medium task
>>
>> While CPU0 is doing newly_idle load balance at MC level, it fails to
>> pull percpu kworker from CPU1 and sets LBF_SOME_PINNED to lb env flag
>> and set sgc->imbalance at DIE level domain. As LBF_ALL_PINNED not cleared,
>> it tries to redo the balancing by clearing CPU1 in env cpus, but it don't
>> find other busiest_group, so CPU0 stops balacing at MC level without
>> clearing 'sgc->imbalance' and restart the load balacing at DIE level.
>>
>> And CPU0 (balancing cpu) finds LITTLE's group as busiest_group with group
>> type as imbalance, and Bigs that classified the level below imbalance type
>> would be ignored to pick as busiest, and the balancing would be aborted
>> without pulling any tasks (by the time, CPU1 might not have running tasks).
>>
>> It is suboptimal decision to classify the group as imbalance due to
>> percpu threads. So don't use LBF_SOME_PINNED for per cpu threads.
>
> This sentence mentioned per-cpu threads (and so does the patch name) but
> the implementation (only) deals with per-cpu kernel threads. IMHO, it
> would be good to align this.
>

Tell you what, I'll go for:
1) how can pcpu kthreads cause LBF_SOME_PINNED
2) why we may not want this, but still ignore !kthread pcpu tasks
3) why this is even more important for big.LITTLE

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 1/7] sched/fair: Ignore percpu threads for imbalance pulls
  2021-03-16 16:03     ` Chandra Sekhar Lingutla
@ 2021-03-16 18:59       ` Dietmar Eggemann
  0 siblings, 0 replies; 24+ messages in thread
From: Dietmar Eggemann @ 2021-03-16 18:59 UTC (permalink / raw)
  To: Chandra Sekhar Lingutla, Valentin Schneider, linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Morten Rasmussen,
	Qais Yousef, Quentin Perret, Pavan Kondeti, Rik van Riel

Hi Chandra,

On 16/03/2021 17:03, Chandra Sekhar Lingutla wrote:
> Hi Dietmar,
> 
> On 3/16/2021 9:19 PM, Dietmar Eggemann wrote:
>> On 11/03/2021 13:05, Valentin Schneider wrote:
>>> From: Lingutla Chandrasekhar <clingutla@codeaurora.org>

[...]

>>> CPUs 2-3 as Bigs with below scenario:
>>> - CPU0 doing newly_idle balancing
>>> - CPU1 running percpu kworker and RT task (small tasks)
>> What's the role of the small RT task here in the story?
> This is to satisfy 'busiest->nr_running > 1' checks.

Ah, I see. Forgot about this bit of the story, the 'if
(busiest->nr_running > 1)' in load_balance().

[...]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 1/7] sched/fair: Ignore percpu threads for imbalance pulls
  2021-03-16 17:31     ` Valentin Schneider
@ 2021-03-16 19:06       ` Dietmar Eggemann
  0 siblings, 0 replies; 24+ messages in thread
From: Dietmar Eggemann @ 2021-03-16 19:06 UTC (permalink / raw)
  To: Valentin Schneider, linux-kernel
  Cc: Lingutla Chandrasekhar, Peter Zijlstra, Ingo Molnar,
	Vincent Guittot, Morten Rasmussen, Qais Yousef, Quentin Perret,
	Pavan Kondeti, Rik van Riel

On 16/03/2021 18:31, Valentin Schneider wrote:
> On 16/03/21 16:49, Dietmar Eggemann wrote:
>> On 11/03/2021 13:05, Valentin Schneider wrote:
>>> From: Lingutla Chandrasekhar <clingutla@codeaurora.org>
>>>
>>> In load balancing, when balancing group is unable to pull task
>>> due to ->cpus_ptr constraints from busy group, then it sets
>>> LBF_SOME_PINNED to lb env flags, as a consequence, sgc->imbalance
>>> is set for its parent domain level. which makes the group
>>> classified as imbalance to get help from another balancing cpu.
>>>
>>> Consider a 4-CPU big.LITTLE system with CPUs 0-1 as LITTLEs and
>>
>> Does it have to be a big.LITTLE system? I assume this issue also happens
>> on an SMP system.
>>
> 
> Aye, though the consequences are "worse" on asym CPU capacity systems.

I can only think of higher group_type 'group_imbalanced' eclipses
'group_misfit_task' here?

> 
>>> CPUs 2-3 as Bigs with below scenario:
>>> - CPU0 doing newly_idle balancing
>>> - CPU1 running percpu kworker and RT task (small tasks)
>>
>> What's the role of the small RT task here in the story?
>>
> 
> I don't think it matters much here.

Chandra already mentioned that this is part of the story, namely to
start trying to move task on lb MC CPU1->CPU0 (if (busiest->nr_running >
1)).

[...]

>> This sentence mentioned per-cpu threads (and so does the patch name) but
>> the implementation (only) deals with per-cpu kernel threads. IMHO, it
>> would be good to align this.
>>
> 
> Tell you what, I'll go for:
> 1) how can pcpu kthreads cause LBF_SOME_PINNED
> 2) why we may not want this, but still ignore !kthread pcpu tasks
> 3) why this is even more important for big.LITTLE

LGTM.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 2/7] sched/fair: Clean up active balance nr_balance_failed trickery
  2021-03-11 12:05 ` [PATCH v3 2/7] sched/fair: Clean up active balance nr_balance_failed trickery Valentin Schneider
@ 2021-03-17 10:52   ` Dietmar Eggemann
  0 siblings, 0 replies; 24+ messages in thread
From: Dietmar Eggemann @ 2021-03-17 10:52 UTC (permalink / raw)
  To: Valentin Schneider, linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Morten Rasmussen,
	Qais Yousef, Quentin Perret, Pavan Kondeti, Rik van Riel,
	Lingutla Chandrasekhar

On 11/03/2021 13:05, Valentin Schneider wrote:

[...]

> @@ -9952,7 +9954,8 @@ static int active_load_balance_cpu_stop(void *data)
>  			 * @dst_grpmask we need to make that test go away with lying
>  			 * about DST_PINNED.
>  			 */
> -			.flags		= LBF_DST_PINNED,
> +			.flags		= LBF_DST_PINNED |
> +					  LBF_ACTIVE_LB,

Since you now have a dedicated LBF flag for active balancing you could remove the
LBF_DST_PINNED here and adapt the if condition in can_migrate_task():

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f50a902bdf24..9f7feb512016 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7583,10 +7583,13 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
                 * meet load balance goals by pulling other tasks on src_cpu.
                 *
                 * Avoid computing new_dst_cpu for NEWLY_IDLE or if we have
-                * already computed one in current iteration.
+                * already computed one in current iteration or if it is an
+                * active balance.
                 */
-               if (env->idle == CPU_NEWLY_IDLE || (env->flags & LBF_DST_PINNED))
+               if (env->idle == CPU_NEWLY_IDLE ||
+                   (env->flags & (LBF_DST_PINNED | LBF_ACTIVE_LB))) {
                        return 0;
+               }
 
                /* Prevent to re-select dst_cpu via env's CPUs: */
                for_each_cpu_and(cpu, env->dst_grpmask, env->cpus) {
@@ -9948,14 +9951,7 @@ static int active_load_balance_cpu_stop(void *data)
                        .src_cpu        = busiest_rq->cpu,
                        .src_rq         = busiest_rq,
                        .idle           = CPU_IDLE,
-                       /*
-                        * can_migrate_task() doesn't need to compute new_dst_cpu
-                        * for active balancing. Since we have CPU_IDLE, but no
-                        * @dst_grpmask we need to make that test go away with lying
-                        * about DST_PINNED.
-                        */
-                       .flags          = LBF_DST_PINNED |
-                                         LBF_ACTIVE_LB,
+                       .flags          = LBF_ACTIVE_LB,
                };

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 6/7] sched/fair: Filter out locally-unsolvable misfit imbalances
  2021-03-15 19:18     ` Valentin Schneider
@ 2021-03-19 15:19       ` Vincent Guittot
  2021-03-23 18:51         ` Valentin Schneider
  0 siblings, 1 reply; 24+ messages in thread
From: Vincent Guittot @ 2021-03-19 15:19 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, Qais Yousef, Peter Zijlstra, Ingo Molnar,
	Dietmar Eggemann, Morten Rasmussen, Quentin Perret,
	Pavan Kondeti, Rik van Riel, Lingutla Chandrasekhar

On Mon, 15 Mar 2021 at 20:18, Valentin Schneider
<valentin.schneider@arm.com> wrote:
>
> On 15/03/21 16:13, Vincent Guittot wrote:
> > On Thu, 11 Mar 2021 at 13:05, Valentin Schneider
> > <valentin.schneider@arm.com> wrote:
> >>
> >> Consider the following (hypothetical) asymmetric CPU capacity topology,
> >> with some amount of capacity pressure (RT | DL | IRQ | thermal):
> >>
> >>   DIE [          ]
> >>   MC  [    ][    ]
> >>        0  1  2  3
> >>
> >>   | CPU | capacity_orig | capacity |
> >>   |-----+---------------+----------|
> >>   |   0 |           870 |      860 |
> >>   |   1 |           870 |      600 |
> >>   |   2 |          1024 |      850 |
> >>   |   3 |          1024 |      860 |
> >>
> >> If CPU1 has a misfit task, then CPU0, CPU2 and CPU3 are valid candidates to
> >> grant the task an uplift in CPU capacity. Consider CPU0 and CPU3 as
> >> sufficiently busy, i.e. don't have enough spare capacity to accommodate
> >> CPU1's misfit task. This would then fall on CPU2 to pull the task.
> >>
> >> This currently won't happen, because CPU2 will fail
> >>
> >>   capacity_greater(capacity_of(CPU2), sg->sgc->max_capacity)
> >
> > which has been introduced by the previous patch: patch5
> >
> >>
> >> in update_sd_pick_busiest(), where 'sg' is the [0, 1] group at DIE
> >> level. In this case, the max_capacity is that of CPU0's, which is at this
> >> point in time greater than that of CPU2's. This comparison doesn't make
> >> much sense, given that the only CPUs we should care about in this scenario
> >> are CPU1 (the CPU with the misfit task) and CPU2 (the load-balance
> >> destination CPU).
> >>
> >> Aggregate a misfit task's load into sgs->group_misfit_task_load only if
> >> env->dst_cpu would grant it a capacity uplift. Separately track whether a
> >> sched_group contains a misfit task to still classify it as
> >> group_misfit_task and not pick it as busiest group when pulling from a
> >
> > Could you give more details about why we should keep tracking the
> > group as misfit ? Do you have a UC in mind ?
> >
>
> As stated the current behaviour is to classify groups as group_misfit_task
> regardless of the dst_cpu's capacity. When we see a group_misfit_task
> candidate group misfit task with higher per-CPU capacity than the local
> group, we don't pick it as busiest.
>
> I initially thought not marking those as group_misfit_task was the right
> thing to do, as they could then be classified as group_fully_busy or
> group_has_spare. Consider:
>
>   DIE [          ]
>   MC  [    ][    ]
>        0  1  2  3
>        L  L  B  B
>
>   arch_scale_capacity(L) < arch_scale_capacity(B)
>
>   CPUs 0-1 are idle / lightly loaded
>   CPU2 has a misfit task and a few very small tasks
>   CPU3 has a few very small tasks
>
> When CPU0 is running load_balance() at DIE level, right now we'll classify
> the [2-3] group as group_misfit_task and not pick it as busiest because the
> local group has a lower CPU capacity.
>
> If we didn't do that, we could leave the misfit task alone and pull some
> small task(s) from CPU2 or CPU3, which would be a good thing to

Are you sure? the last check in update_sd_pick_busiest() should
already filter this. So it should be enough to let it be classify
correctly

A group should be classified as group_misfit_task when there is a task
to migrate in priority compared to some other groups. In your case,
you tag it as group_misfit_task but in order to do the opposite, i.e.
make sure to not select it. As mentioned above, this will be filter in
the last check in update_sd_pick_busiest()

> do. However, by allowing a group containing a misfit task to be picked as
> the busiest group when a CPU of lower capacity is pulling, we run the risk
> of the misfit task itself being downmigrated - e.g. if we repeatedly
> increment the sd->nr_balance_failed counter and do an active balance (maybe
> because the small tasks were unfortunately cache_hot()).
>
> It's less than ideal, but I considered not downmigrating misfit tasks was
> the thing to prioritize (and FWIW it also maintains current behaviour).
>
>
> Another approach would be to add task utilization vs CPU capacity checks in
> detach_tasks() and need_active_balance() to prevent downmigration when
> env->imbalance_type < group_misfit_task. This may go against the busiest
> group selection heuristics however (misfit tasks could be the main
> contributors to the imbalance, but we end up not moving them).
>
>
> >> lower-capacity CPU (which is the current behaviour and prevents
> >> down-migration).
> >>
> >> Since find_busiest_queue() can now iterate over CPUs with a higher capacity
> >> than the local CPU's, add a capacity check there.
> >>
> >> Reviewed-by: Qais Yousef <qais.yousef@arm.com>
> >> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
>
> >> @@ -8447,10 +8454,21 @@ static inline void update_sg_lb_stats(struct lb_env *env,
> >>                         continue;
> >>
> >>                 /* Check for a misfit task on the cpu */
> >> -               if (sd_has_asym_cpucapacity(env->sd) &&
> >> -                   sgs->group_misfit_task_load < rq->misfit_task_load) {
> >> -                       sgs->group_misfit_task_load = rq->misfit_task_load;
> >> -                       *sg_status |= SG_OVERLOAD;
> >> +               if (!sd_has_asym_cpucapacity(env->sd) ||
> >> +                   !rq->misfit_task_load)
> >> +                       continue;
> >> +
> >> +               *sg_status |= SG_OVERLOAD;
> >> +               sgs->group_has_misfit_task = true;
> >> +
> >> +               /*
> >> +                * Don't attempt to maximize load for misfit tasks that can't be
> >> +                * granted a CPU capacity uplift.
> >> +                */
> >> +               if (cpu_capacity_greater(env->dst_cpu, i)) {
> >> +                       sgs->group_misfit_task_load = max(
> >> +                               sgs->group_misfit_task_load,
> >> +                               rq->misfit_task_load);
> >
> > Please encapsulate all this misfit specific code in a dedicated
> > function which will be called from update_sg_lb_stats
> >
>
> Will do.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 6/7] sched/fair: Filter out locally-unsolvable misfit imbalances
  2021-03-19 15:19       ` Vincent Guittot
@ 2021-03-23 18:51         ` Valentin Schneider
  0 siblings, 0 replies; 24+ messages in thread
From: Valentin Schneider @ 2021-03-23 18:51 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: linux-kernel, Qais Yousef, Peter Zijlstra, Ingo Molnar,
	Dietmar Eggemann, Morten Rasmussen, Quentin Perret,
	Pavan Kondeti, Rik van Riel, Lingutla Chandrasekhar

On 19/03/21 16:19, Vincent Guittot wrote:
> On Mon, 15 Mar 2021 at 20:18, Valentin Schneider
> <valentin.schneider@arm.com> wrote:
>> As stated the current behaviour is to classify groups as group_misfit_task
>> regardless of the dst_cpu's capacity. When we see a group_misfit_task
>> candidate group misfit task with higher per-CPU capacity than the local
>> group, we don't pick it as busiest.
>>
>> I initially thought not marking those as group_misfit_task was the right
>> thing to do, as they could then be classified as group_fully_busy or
>> group_has_spare. Consider:
>>
>>   DIE [          ]
>>   MC  [    ][    ]
>>        0  1  2  3
>>        L  L  B  B
>>
>>   arch_scale_capacity(L) < arch_scale_capacity(B)
>>
>>   CPUs 0-1 are idle / lightly loaded
>>   CPU2 has a misfit task and a few very small tasks
>>   CPU3 has a few very small tasks
>>
>> When CPU0 is running load_balance() at DIE level, right now we'll classify
>> the [2-3] group as group_misfit_task and not pick it as busiest because the
>> local group has a lower CPU capacity.
>>
>> If we didn't do that, we could leave the misfit task alone and pull some
>> small task(s) from CPU2 or CPU3, which would be a good thing to
>
> Are you sure? the last check in update_sd_pick_busiest() should
> already filter this. So it should be enough to let it be classify
> correctly
>
> A group should be classified as group_misfit_task when there is a task
> to migrate in priority compared to some other groups. In your case,
> you tag it as group_misfit_task but in order to do the opposite, i.e.
> make sure to not select it. As mentioned above, this will be filter in
> the last check in update_sd_pick_busiest()
>

This hinges on sgc->min_capacity, which might be influenced by a CPU in the
candidate group being severely pressured by IRQ / thermal / RT / DL
pressure. That said, you have a point in that this check and the one in
find_busiest_queue() catches most scenarios I can think of.

Let me ponder about this some more, and if throw it at the test
infrastructure monster if I go down that route.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 4/7] sched/fair: Introduce a CPU capacity comparison helper
  2021-03-11 12:05 ` [PATCH v3 4/7] sched/fair: Introduce a CPU capacity comparison helper Valentin Schneider
  2021-03-15 14:24   ` Vincent Guittot
@ 2021-03-31 11:34   ` Chandra Sekhar Lingutla
  1 sibling, 0 replies; 24+ messages in thread
From: Chandra Sekhar Lingutla @ 2021-03-31 11:34 UTC (permalink / raw)
  To: Valentin Schneider, linux-kernel
  Cc: Qais Yousef, Peter Zijlstra, Ingo Molnar, Vincent Guittot,
	Dietmar Eggemann, Morten Rasmussen, Quentin Perret,
	Pavan Kondeti, Rik van Riel

Hi Valentin,

The reduced margin is helping our platforms, Please feel free to add my 
tested-by tag:

Tested-by:  Lingutla Chandrasekhar <clingutla@codeaurora.org>

On 3/11/2021 5:35 PM, Valentin Schneider wrote:
> During load-balance, groups classified as group_misfit_task are filtered
> out if they do not pass
>
>    group_smaller_max_cpu_capacity(<candidate group>, <local group>);
>
> which itself employs fits_capacity() to compare the sgc->max_capacity of
> both groups.
>
> Due to the underlying margin, fits_capacity(X, 1024) will return false for
> any X > 819. Tough luck, the capacity_orig's on e.g. the Pixel 4 are
> {261, 871, 1024}. If a CPU-bound task ends up on one of those "medium"
> CPUs, misfit migration will never intentionally upmigrate it to a CPU of
> higher capacity due to the aforementioned margin.
>
> One may argue the 20% margin of fits_capacity() is excessive in the advent
> of counter-enhanced load tracking (APERF/MPERF, AMUs), but one point here
> is that fits_capacity() is meant to compare a utilization value to a
> capacity value, whereas here it is being used to compare two capacity
> values. As CPU capacity and task utilization have different dynamics, a
> sensible approach here would be to add a new helper dedicated to comparing
> CPU capacities.
>
> Reviewed-by: Qais Yousef <qais.yousef@arm.com>
> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
> ---
>   kernel/sched/fair.c | 7 +++++++
>   1 file changed, 7 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index db892f6e222f..ddb2ab3edf6d 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -113,6 +113,13 @@ int __weak arch_asym_cpu_priority(int cpu)
>    */
>   #define fits_capacity(cap, max)	((cap) * 1280 < (max) * 1024)
>   
> +/*
> + * The margin used when comparing CPU capacities.
> + * is 'cap1' noticeably greater than 'cap2'
> + *
> + * (default: ~5%)
> + */
> +#define capacity_greater(cap1, cap2) ((cap1) * 1024 > (cap2) * 1078)
>   #endif
>   
>   #ifdef CONFIG_CFS_BANDWIDTH

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 5/7] sched/fair: Employ capacity_greater() throughout load_balance()
  2021-03-11 12:05 ` [PATCH v3 5/7] sched/fair: Employ capacity_greater() throughout load_balance() Valentin Schneider
@ 2021-03-31 11:35   ` Chandra Sekhar Lingutla
  0 siblings, 0 replies; 24+ messages in thread
From: Chandra Sekhar Lingutla @ 2021-03-31 11:35 UTC (permalink / raw)
  To: Valentin Schneider, linux-kernel
  Cc: Qais Yousef, Peter Zijlstra, Ingo Molnar, Vincent Guittot,
	Dietmar Eggemann, Morten Rasmussen, Quentin Perret,
	Pavan Kondeti, Rik van Riel

Hi Valentin,

The reduced margin is helping our platforms, Please feel free to add my 
tested-by tag:

Tested-by:  Lingutla Chandrasekhar <clingutla@codeaurora.org>

On 3/11/2021 5:35 PM, Valentin Schneider wrote:
> While at it, replace group_smaller_{min, max}_cpu_capacity() with
> comparisons of the source group's min/max capacity and the destination
> CPU's capacity.
>
> Reviewed-by: Qais Yousef <qais.yousef@arm.com>
> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
> ---
>   kernel/sched/fair.c | 33 ++++-----------------------------
>   1 file changed, 4 insertions(+), 29 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index ddb2ab3edf6d..1e8a242cd1f7 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8350,26 +8350,6 @@ group_is_overloaded(unsigned int imbalance_pct, struct sg_lb_stats *sgs)
>   	return false;
>   }
>   
> -/*
> - * group_smaller_min_cpu_capacity: Returns true if sched_group sg has smaller
> - * per-CPU capacity than sched_group ref.
> - */
> -static inline bool
> -group_smaller_min_cpu_capacity(struct sched_group *sg, struct sched_group *ref)
> -{
> -	return fits_capacity(sg->sgc->min_capacity, ref->sgc->min_capacity);
> -}
> -
> -/*
> - * group_smaller_max_cpu_capacity: Returns true if sched_group sg has smaller
> - * per-CPU capacity_orig than sched_group ref.
> - */
> -static inline bool
> -group_smaller_max_cpu_capacity(struct sched_group *sg, struct sched_group *ref)
> -{
> -	return fits_capacity(sg->sgc->max_capacity, ref->sgc->max_capacity);
> -}
> -
>   static inline enum
>   group_type group_classify(unsigned int imbalance_pct,
>   			  struct sched_group *group,
> @@ -8518,15 +8498,10 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>   	if (!sgs->sum_h_nr_running)
>   		return false;
>   
> -	/*
> -	 * Don't try to pull misfit tasks we can't help.
> -	 * We can use max_capacity here as reduction in capacity on some
> -	 * CPUs in the group should either be possible to resolve
> -	 * internally or be covered by avg_load imbalance (eventually).
> -	 */
> +	/* Don't try to pull misfit tasks we can't help */
>   	if (static_branch_unlikely(&sched_asym_cpucapacity) &&
>   	    sgs->group_type == group_misfit_task &&
> -	    (!group_smaller_max_cpu_capacity(sg, sds->local) ||
> +	    (!capacity_greater(capacity_of(env->dst_cpu), sg->sgc->max_capacity) ||
>   	     sds->local_stat.group_type != group_has_spare))
>   		return false;
>   
> @@ -8610,7 +8585,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>   	 */
>   	if (sd_has_asym_cpucapacity(env->sd) &&
>   	    (sgs->group_type <= group_fully_busy) &&
> -	    (group_smaller_min_cpu_capacity(sds->local, sg)))
> +	    (capacity_greater(sg->sgc->min_capacity, capacity_of(env->dst_cpu))))
>   		return false;
>   
>   	return true;
> @@ -9410,7 +9385,7 @@ static struct rq *find_busiest_queue(struct lb_env *env,
>   		 * average load.
>   		 */
>   		if (sd_has_asym_cpucapacity(env->sd) &&
> -		    capacity_of(env->dst_cpu) < capacity &&
> +		    !capacity_greater(capacity_of(env->dst_cpu), capacity) &&
>   		    nr_running == 1)
>   			continue;
>   

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2021-03-31 11:35 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-11 12:05 [PATCH v3 0/7] sched/fair: misfit task load-balance tweaks Valentin Schneider
2021-03-11 12:05 ` [PATCH v3 1/7] sched/fair: Ignore percpu threads for imbalance pulls Valentin Schneider
2021-03-16 15:49   ` Dietmar Eggemann
2021-03-16 16:03     ` Chandra Sekhar Lingutla
2021-03-16 18:59       ` Dietmar Eggemann
2021-03-16 17:31     ` Valentin Schneider
2021-03-16 19:06       ` Dietmar Eggemann
2021-03-11 12:05 ` [PATCH v3 2/7] sched/fair: Clean up active balance nr_balance_failed trickery Valentin Schneider
2021-03-17 10:52   ` Dietmar Eggemann
2021-03-11 12:05 ` [PATCH v3 3/7] sched/fair: Add more sched_asym_cpucapacity static branch checks Valentin Schneider
2021-03-15 14:18   ` Vincent Guittot
2021-03-15 19:24     ` Valentin Schneider
2021-03-11 12:05 ` [PATCH v3 4/7] sched/fair: Introduce a CPU capacity comparison helper Valentin Schneider
2021-03-15 14:24   ` Vincent Guittot
2021-03-15 19:24     ` Valentin Schneider
2021-03-31 11:34   ` Chandra Sekhar Lingutla
2021-03-11 12:05 ` [PATCH v3 5/7] sched/fair: Employ capacity_greater() throughout load_balance() Valentin Schneider
2021-03-31 11:35   ` Chandra Sekhar Lingutla
2021-03-11 12:05 ` [PATCH v3 6/7] sched/fair: Filter out locally-unsolvable misfit imbalances Valentin Schneider
2021-03-15 15:13   ` Vincent Guittot
2021-03-15 19:18     ` Valentin Schneider
2021-03-19 15:19       ` Vincent Guittot
2021-03-23 18:51         ` Valentin Schneider
2021-03-11 12:05 ` [PATCH v3 7/7] sched/fair: Relax task_hot() for misfit tasks Valentin Schneider

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.