[PATCH 0/2] sched/fair: (The return of) misfit task load-balance tweaks

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/2] sched/fair: (The return of) misfit task load-balance tweaks
@ 2021-04-15 17:58 Valentin Schneider
  2021-04-15 17:58 ` [PATCH 1/2] sched/fair: Filter out locally-unsolvable misfit imbalances Valentin Schneider
  2021-04-15 17:58 ` [PATCH 2/2] sched/fair: Relax task_hot() for misfit tasks Valentin Schneider
  0 siblings, 2 replies; 21+ messages in thread
From: Valentin Schneider @ 2021-04-15 17:58 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Dietmar Eggemann,
	Morten Rasmussen, Qais Yousef, Quentin Perret, Pavan Kondeti,
	Rik van Riel, Lingutla Chandrasekhar

Hi folks,

This is the misfit-specific bits I tore out of [1] and that I've been further
chewing on.

o Patch 1 pays attention to group vs CPU capacity checks. It's removing some
  safeguard we had against downmigrations, so it had to grow fatter to
  compensate for it.
o Patch 2 aligns running and non-running misfit task cache hotness
  considerations

This is based on top of today's tip/sched/core at:

  e0ccd4a3b01 ("rseq: Optimise rseq_get_rseq_cs() and clear_rseq_cs()")

Revisions
=========

v3 -> part 2 v1
---------------

o Removed sg_lb_stats->group_has_misfit_task (Vincent)
o Shoved misfit specific update_sg_lb_stats() work into its own function
  (Vincent)

o Added migrate_degrades_capacity()
o Made task_hot() tweak depend on
  src_grp_type == group_has_spare
  rather than
  idle != CPU_NOT_IDLE

v2 -> v3
--------

o Rebased on top of latest tip/sched/core
o Added test results vs stress-ng.vm-segv

v1 -> v2
--------

o Collected Reviewed-by
o Minor comment and code cleanups

o Consolidated static key vs SD flag explanation (Dietmar)

  Note to Vincent: I didn't measure the impact of adding said static key to
  load_balance(); I do however believe it is a low hanging fruit. The
  wrapper keeps things neat and tidy, and is also helpful for documenting
  the intricacies of the static key status vs the presence of the SD flag
  in a CPU's sched_domain hierarchy.

o Removed v1 patch 4 - root_domain.max_cpu_capacity is absolutely not what
  I had convinced myself it was.
o Squashed capacity margin usage with removal of
  group_smaller_{min, max}_capacity() (Vincent)   
o Replaced v1 patch 7 with Lingutla's can_migrate_task() patch [2]
o Rewrote task_hot() modification changelog

Links
=====

[1]: http://lore.kernel.org/r/20210311120527.167870-1-valentin.schneider@arm.com
[2]: http://lore.kernel.org/r/20210217120854.1280-1-clingutla@codeaurora.org

Cheers,
Valentin

Valentin Schneider (2):
  sched/fair: Filter out locally-unsolvable misfit imbalances
  sched/fair: Relax task_hot() for misfit tasks

 kernel/sched/fair.c | 99 ++++++++++++++++++++++++++++++++++++---------
 1 file changed, 81 insertions(+), 18 deletions(-)

--
2.25.1

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 1/2] sched/fair: Filter out locally-unsolvable misfit imbalances
  2021-04-15 17:58 [PATCH 0/2] sched/fair: (The return of) misfit task load-balance tweaks Valentin Schneider
@ 2021-04-15 17:58 ` Valentin Schneider
  2021-04-15 18:47   ` Rik van Riel
                     ` (2 more replies)
  2021-04-15 17:58 ` [PATCH 2/2] sched/fair: Relax task_hot() for misfit tasks Valentin Schneider
  1 sibling, 3 replies; 21+ messages in thread
From: Valentin Schneider @ 2021-04-15 17:58 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Dietmar Eggemann,
	Morten Rasmussen, Qais Yousef, Quentin Perret, Pavan Kondeti,
	Rik van Riel, Lingutla Chandrasekhar

Consider the following (hypothetical) asymmetric CPU capacity topology,
with some amount of capacity pressure (RT | DL | IRQ | thermal):

  DIE [          ]
  MC  [    ][    ]
       0  1  2  3

  | CPU | capacity_orig | capacity |
  |-----+---------------+----------|
  |   0 |           870 |      860 |
  |   1 |           870 |      600 |
  |   2 |          1024 |      850 |
  |   3 |          1024 |      860 |

If CPU1 has a misfit task, then CPU0, CPU2 and CPU3 are valid candidates to
grant the task an uplift in CPU capacity. Consider CPU0 and CPU3 as
sufficiently busy, i.e. don't have enough spare capacity to accommodate
CPU1's misfit task. This would then fall on CPU2 to pull the task.

This currently won't happen, because CPU2 will fail

  capacity_greater(capacity_of(CPU2), sg->sgc->max_capacity)

in update_sd_pick_busiest(), where 'sg' is the [0, 1] group at DIE
level. In this case, the max_capacity is that of CPU0's, which is at this
point in time greater than that of CPU2's. This comparison doesn't make
much sense, given that the only CPUs we should care about in this scenario
are CPU1 (the CPU with the misfit task) and CPU2 (the load-balance
destination CPU).

Aggregate a misfit task's load into sgs->group_misfit_task_load only if
env->dst_cpu would grant it a capacity uplift.

Note that the aforementioned capacity vs sgc->max_capacity comparison was
meant to prevent misfit task downmigration: candidate groups classified as
group_misfit_task but with a higher (max) CPU capacity than the destination CPU
would be discarded. This change makes it so said group_misfit_task
classification can't happen anymore, which may cause some undesired
downmigrations.

Further tweak find_busiest_queue() to ensure this doesn't happen. Also note
find_busiest_queue() can now iterate over CPUs with a higher capacity than
the local CPU's, so add a capacity check there.

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
---
 kernel/sched/fair.c | 63 ++++++++++++++++++++++++++++++++-------------
 1 file changed, 45 insertions(+), 18 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9b8ae02f1994..d2d1a69d7aa7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5759,6 +5759,12 @@ static unsigned long capacity_of(int cpu)
 	return cpu_rq(cpu)->cpu_capacity;
 }
 
+/* Is CPU a's capacity noticeably greater than CPU b's? */
+static inline bool cpu_capacity_greater(int a, int b)
+{
+	return capacity_greater(capacity_of(a), capacity_of(b));
+}
+
 static void record_wakee(struct task_struct *p)
 {
 	/*
@@ -7486,6 +7492,7 @@ struct lb_env {
 
 	enum fbq_type		fbq_type;
 	enum migration_type	migration_type;
+	enum group_type         src_grp_type;
 	struct list_head	tasks;
 };
 
@@ -8447,6 +8454,32 @@ static bool update_nohz_stats(struct rq *rq)
 #endif
 }
 
+static inline void update_sg_lb_misfit_stats(struct lb_env *env,
+					     struct sched_group *group,
+					     struct sg_lb_stats *sgs,
+					     int *sg_status,
+					     int cpu)
+{
+	struct rq *rq = cpu_rq(cpu);
+
+	if (!(env->sd->flags & SD_ASYM_CPUCAPACITY) ||
+	    !rq->misfit_task_load)
+		return;
+
+	*sg_status |= SG_OVERLOAD;
+
+	/*
+	 * Don't attempt to maximize load for misfit tasks that can't be
+	 * granted a CPU capacity uplift.
+	 */
+	if (cpu_capacity_greater(env->dst_cpu, cpu)) {
+		sgs->group_misfit_task_load = max(
+			sgs->group_misfit_task_load,
+			rq->misfit_task_load);
+	}
+
+}
+
 /**
  * update_sg_lb_stats - Update sched_group's statistics for load balancing.
  * @env: The load balancing environment.
@@ -8498,12 +8531,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 		if (local_group)
 			continue;
 
-		/* Check for a misfit task on the cpu */
-		if (env->sd->flags & SD_ASYM_CPUCAPACITY &&
-		    sgs->group_misfit_task_load < rq->misfit_task_load) {
-			sgs->group_misfit_task_load = rq->misfit_task_load;
-			*sg_status |= SG_OVERLOAD;
-		}
+		update_sg_lb_misfit_stats(env, group, sgs, sg_status, i);
 	}
 
 	/* Check if dst CPU is idle and preferred to this group */
@@ -8550,15 +8578,9 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 	if (!sgs->sum_h_nr_running)
 		return false;
 
-	/*
-	 * Don't try to pull misfit tasks we can't help.
-	 * We can use max_capacity here as reduction in capacity on some
-	 * CPUs in the group should either be possible to resolve
-	 * internally or be covered by avg_load imbalance (eventually).
-	 */
+	/* Don't try to pull misfit tasks we can't help */
 	if (sgs->group_type == group_misfit_task &&
-	    (!capacity_greater(capacity_of(env->dst_cpu), sg->sgc->max_capacity) ||
-	     sds->local_stat.group_type != group_has_spare))
+	     sds->local_stat.group_type != group_has_spare)
 		return false;
 
 	if (sgs->group_type > busiest->group_type)
@@ -9288,6 +9310,8 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
 	if (!sds.busiest)
 		goto out_balanced;
 
+	env->src_grp_type = busiest->group_type;
+
 	/* Misfit tasks should be dealt with regardless of the avg load */
 	if (busiest->group_type == group_misfit_task)
 		goto force_balance;
@@ -9441,8 +9465,8 @@ static struct rq *find_busiest_queue(struct lb_env *env,
 		 * average load.
 		 */
 		if (env->sd->flags & SD_ASYM_CPUCAPACITY &&
-		    !capacity_greater(capacity_of(env->dst_cpu), capacity) &&
-		    nr_running == 1)
+		    env->src_grp_type <= group_fully_busy &&
+		    !capacity_greater(capacity_of(env->dst_cpu), capacity))
 			continue;
 
 		switch (env->migration_type) {
@@ -9504,15 +9528,18 @@ static struct rq *find_busiest_queue(struct lb_env *env,
 		case migrate_misfit:
 			/*
 			 * For ASYM_CPUCAPACITY domains with misfit tasks we
-			 * simply seek the "biggest" misfit task.
+			 * simply seek the "biggest" misfit task we can
+			 * accommodate.
 			 */
+			if (!cpu_capacity_greater(env->dst_cpu, i))
+				continue;
+
 			if (rq->misfit_task_load > busiest_load) {
 				busiest_load = rq->misfit_task_load;
 				busiest = rq;
 			}
 
 			break;
-
 		}
 	}
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 2/2] sched/fair: Relax task_hot() for misfit tasks
  2021-04-15 17:58 [PATCH 0/2] sched/fair: (The return of) misfit task load-balance tweaks Valentin Schneider
  2021-04-15 17:58 ` [PATCH 1/2] sched/fair: Filter out locally-unsolvable misfit imbalances Valentin Schneider
@ 2021-04-15 17:58 ` Valentin Schneider
  2021-04-15 20:39   ` Rik van Riel
  2021-04-16 13:51   ` Vincent Guittot
  1 sibling, 2 replies; 21+ messages in thread
From: Valentin Schneider @ 2021-04-15 17:58 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Dietmar Eggemann,
	Morten Rasmussen, Qais Yousef, Quentin Perret, Pavan Kondeti,
	Rik van Riel, Lingutla Chandrasekhar

Consider the following topology:

  DIE [          ]
  MC  [    ][    ]
       0  1  2  3

  capacity_orig_of(x \in {0-1}) < capacity_orig_of(x \in {2-3})

w/ CPUs 2-3 idle and CPUs 0-1 running CPU hogs (util_avg=1024).

When CPU2 goes through load_balance() (via periodic / NOHZ balance), it
should pull one CPU hog from either CPU0 or CPU1 (this is misfit task
upmigration). However, should a e.g. pcpu kworker awake on CPU0 just before
this load_balance() happens and preempt the CPU hog running there, we would
have, for the [0-1] group at CPU2's DIE level:

o sgs->sum_nr_running > sgs->group_weight
o sgs->group_capacity * 100 < sgs->group_util * imbalance_pct

IOW, this group is group_overloaded.

Considering CPU0 is picked by find_busiest_queue(), we would then visit the
preempted CPU hog in detach_tasks(). However, given it has just been
preempted by this pcpu kworker, task_hot() will prevent it from being
detached. We then leave load_balance() without having done anything.

Long story short, preempted misfit tasks are affected by task_hot(), while
currently running misfit tasks are intentionally preempted by the stopper
task to migrate them over to a higher-capacity CPU.

Align detach_tasks() with the active-balance logic and let it pick a
cache-hot misfit task when the destination CPU can provide a capacity
uplift.

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
---
 kernel/sched/fair.c | 36 ++++++++++++++++++++++++++++++++++++
 1 file changed, 36 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d2d1a69d7aa7..43fc98d34276 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7493,6 +7493,7 @@ struct lb_env {
 	enum fbq_type		fbq_type;
 	enum migration_type	migration_type;
 	enum group_type         src_grp_type;
+	enum group_type         dst_grp_type;
 	struct list_head	tasks;
 };
 
@@ -7533,6 +7534,31 @@ static int task_hot(struct task_struct *p, struct lb_env *env)
 	return delta < (s64)sysctl_sched_migration_cost;
 }
 
+
+/*
+ * What does migrating this task do to our capacity-aware scheduling criterion?
+ *
+ * Returns 1, if the task needs more capacity than the dst CPU can provide.
+ * Returns 0, if the task needs the extra capacity provided by the dst CPU
+ * Returns -1, if the task isn't impacted by the migration wrt capacity.
+ */
+static int migrate_degrades_capacity(struct task_struct *p, struct lb_env *env)
+{
+	if (!(env->sd->flags & SD_ASYM_CPUCAPACITY))
+		return -1;
+
+	if (!task_fits_capacity(p, capacity_of(env->src_cpu))) {
+		if (cpu_capacity_greater(env->dst_cpu, env->src_cpu))
+			return 0;
+		else if (cpu_capacity_greater(env->src_cpu, env->dst_cpu))
+			return 1;
+		else
+			return -1;
+	}
+
+	return task_fits_capacity(p, capacity_of(env->dst_cpu)) ? -1 : 1;
+}
+
 #ifdef CONFIG_NUMA_BALANCING
 /*
  * Returns 1, if task migration degrades locality
@@ -7672,6 +7698,15 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 	if (tsk_cache_hot == -1)
 		tsk_cache_hot = task_hot(p, env);
 
+	/*
+	 * On a (sane) asymmetric CPU capacity system, the increase in compute
+	 * capacity should offset any potential performance hit caused by a
+	 * migration.
+	 */
+	if ((env->dst_grp_type == group_has_spare) &&
+	    !migrate_degrades_capacity(p, env))
+		tsk_cache_hot = 0;
+
 	if (tsk_cache_hot <= 0 ||
 	    env->sd->nr_balance_failed > env->sd->cache_nice_tries) {
 		if (tsk_cache_hot == 1) {
@@ -9310,6 +9345,7 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
 	if (!sds.busiest)
 		goto out_balanced;
 
+	env->dst_grp_type = local->group_type;
 	env->src_grp_type = busiest->group_type;
 
 	/* Misfit tasks should be dealt with regardless of the avg load */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] sched/fair: Filter out locally-unsolvable misfit imbalances
  2021-04-15 17:58 ` [PATCH 1/2] sched/fair: Filter out locally-unsolvable misfit imbalances Valentin Schneider
@ 2021-04-15 18:47   ` Rik van Riel
  2021-04-16 13:29   ` Vincent Guittot
  2021-04-22  9:48   ` Dietmar Eggemann
  2 siblings, 0 replies; 21+ messages in thread
From: Rik van Riel @ 2021-04-15 18:47 UTC (permalink / raw)
  To: Valentin Schneider, linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Dietmar Eggemann,
	Morten Rasmussen, Qais Yousef, Quentin Perret, Pavan Kondeti,
	Lingutla Chandrasekhar

[-- Attachment #1: Type: text/plain, Size: 430 bytes --]

On Thu, 2021-04-15 at 18:58 +0100, Valentin Schneider wrote:
> ...
> Further tweak find_busiest_queue() to ensure this doesn't happen.
> Also note
> find_busiest_queue() can now iterate over CPUs with a higher capacity
> than
> the local CPU's, so add a capacity check there.
> 
> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>

Reviewed-by: Rik van Riel <riel@surriel.com>

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/2] sched/fair: Relax task_hot() for misfit tasks
  2021-04-15 17:58 ` [PATCH 2/2] sched/fair: Relax task_hot() for misfit tasks Valentin Schneider
@ 2021-04-15 20:39   ` Rik van Riel
  2021-04-16  9:43     ` Valentin Schneider
  2021-04-16 13:51   ` Vincent Guittot
  1 sibling, 1 reply; 21+ messages in thread
From: Rik van Riel @ 2021-04-15 20:39 UTC (permalink / raw)
  To: Valentin Schneider, linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Dietmar Eggemann,
	Morten Rasmussen, Qais Yousef, Quentin Perret, Pavan Kondeti,
	Lingutla Chandrasekhar

[-- Attachment #1: Type: text/plain, Size: 1305 bytes --]

On Thu, 2021-04-15 at 18:58 +0100, Valentin Schneider wrote:
> Consider the following topology:
> 
> Long story short, preempted misfit tasks are affected by task_hot(),
> while
> currently running misfit tasks are intentionally preempted by the
> stopper
> task to migrate them over to a higher-capacity CPU.
> 
> Align detach_tasks() with the active-balance logic and let it pick a
> cache-hot misfit task when the destination CPU can provide a capacity
> uplift.
> 
> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>

Reviewed-by: Rik van Riel <riel@surriel.com>


This patch looks good, but...

> @@ -7672,6 +7698,15 @@ int can_migrate_task(struct task_struct *p,
> struct lb_env *env)
>  	if (tsk_cache_hot == -1)
>  		tsk_cache_hot = task_hot(p, env);
>  
> +	/*
> +	 * On a (sane) asymmetric CPU capacity system, the increase in
> compute
> +	 * capacity should offset any potential performance hit caused
> by a
> +	 * migration.
> +	 */
> +	if ((env->dst_grp_type == group_has_spare) &&
> +	    !migrate_degrades_capacity(p, env))
> +		tsk_cache_hot = 0;

... I'm starting to wonder if we should not rename the
tsk_cache_hot variable to something else to make this
code more readable. Probably in another patch :)

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/2] sched/fair: Relax task_hot() for misfit tasks
  2021-04-15 20:39   ` Rik van Riel
@ 2021-04-16  9:43     ` Valentin Schneider
  2021-04-19 12:59       ` Phil Auld
  0 siblings, 1 reply; 21+ messages in thread
From: Valentin Schneider @ 2021-04-16  9:43 UTC (permalink / raw)
  To: Rik van Riel, linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Dietmar Eggemann,
	Morten Rasmussen, Qais Yousef, Quentin Perret, Pavan Kondeti,
	Lingutla Chandrasekhar

On 15/04/21 16:39, Rik van Riel wrote:
> On Thu, 2021-04-15 at 18:58 +0100, Valentin Schneider wrote:
>> Consider the following topology:
>>
>> Long story short, preempted misfit tasks are affected by task_hot(),
>> while
>> currently running misfit tasks are intentionally preempted by the
>> stopper
>> task to migrate them over to a higher-capacity CPU.
>>
>> Align detach_tasks() with the active-balance logic and let it pick a
>> cache-hot misfit task when the destination CPU can provide a capacity
>> uplift.
>>
>> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
>
> Reviewed-by: Rik van Riel <riel@surriel.com>
>

Thanks!

>
> This patch looks good, but...
>
>> @@ -7672,6 +7698,15 @@ int can_migrate_task(struct task_struct *p,
>> struct lb_env *env)
>>      if (tsk_cache_hot == -1)
>>              tsk_cache_hot = task_hot(p, env);
>>
>> +	/*
>> +	 * On a (sane) asymmetric CPU capacity system, the increase in
>> compute
>> +	 * capacity should offset any potential performance hit caused
>> by a
>> +	 * migration.
>> +	 */
>> +	if ((env->dst_grp_type == group_has_spare) &&
>> +	    !migrate_degrades_capacity(p, env))
>> +		tsk_cache_hot = 0;
>
> ... I'm starting to wonder if we should not rename the
> tsk_cache_hot variable to something else to make this
> code more readable. Probably in another patch :)
>

I'd tend to agree, but naming is hard. "migration_harmful" ?

> --
> All Rights Reversed.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] sched/fair: Filter out locally-unsolvable misfit imbalances
  2021-04-15 17:58 ` [PATCH 1/2] sched/fair: Filter out locally-unsolvable misfit imbalances Valentin Schneider
  2021-04-15 18:47   ` Rik van Riel
@ 2021-04-16 13:29   ` Vincent Guittot
  2021-04-19 17:13     ` Valentin Schneider
  2021-04-22  9:48   ` Dietmar Eggemann
  2 siblings, 1 reply; 21+ messages in thread
From: Vincent Guittot @ 2021-04-16 13:29 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Dietmar Eggemann,
	Morten Rasmussen, Qais Yousef, Quentin Perret, Pavan Kondeti,
	Rik van Riel, Lingutla Chandrasekhar

On Thu, 15 Apr 2021 at 19:58, Valentin Schneider
<valentin.schneider@arm.com> wrote:
>
> Consider the following (hypothetical) asymmetric CPU capacity topology,
> with some amount of capacity pressure (RT | DL | IRQ | thermal):
>
>   DIE [          ]
>   MC  [    ][    ]
>        0  1  2  3
>
>   | CPU | capacity_orig | capacity |
>   |-----+---------------+----------|
>   |   0 |           870 |      860 |
>   |   1 |           870 |      600 |
>   |   2 |          1024 |      850 |
>   |   3 |          1024 |      860 |
>
> If CPU1 has a misfit task, then CPU0, CPU2 and CPU3 are valid candidates to
> grant the task an uplift in CPU capacity. Consider CPU0 and CPU3 as
> sufficiently busy, i.e. don't have enough spare capacity to accommodate
> CPU1's misfit task. This would then fall on CPU2 to pull the task.
>
> This currently won't happen, because CPU2 will fail
>
>   capacity_greater(capacity_of(CPU2), sg->sgc->max_capacity)
>
> in update_sd_pick_busiest(), where 'sg' is the [0, 1] group at DIE
> level. In this case, the max_capacity is that of CPU0's, which is at this
> point in time greater than that of CPU2's. This comparison doesn't make
> much sense, given that the only CPUs we should care about in this scenario
> are CPU1 (the CPU with the misfit task) and CPU2 (the load-balance
> destination CPU).
>
> Aggregate a misfit task's load into sgs->group_misfit_task_load only if
> env->dst_cpu would grant it a capacity uplift.
>
> Note that the aforementioned capacity vs sgc->max_capacity comparison was
> meant to prevent misfit task downmigration: candidate groups classified as
> group_misfit_task but with a higher (max) CPU capacity than the destination CPU
> would be discarded. This change makes it so said group_misfit_task
> classification can't happen anymore, which may cause some undesired
> downmigrations.
>
> Further tweak find_busiest_queue() to ensure this doesn't happen. Also note
> find_busiest_queue() can now iterate over CPUs with a higher capacity than
> the local CPU's, so add a capacity check there.
>
> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
> ---
>  kernel/sched/fair.c | 63 ++++++++++++++++++++++++++++++++-------------
>  1 file changed, 45 insertions(+), 18 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 9b8ae02f1994..d2d1a69d7aa7 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5759,6 +5759,12 @@ static unsigned long capacity_of(int cpu)
>         return cpu_rq(cpu)->cpu_capacity;
>  }
>
> +/* Is CPU a's capacity noticeably greater than CPU b's? */
> +static inline bool cpu_capacity_greater(int a, int b)
> +{
> +       return capacity_greater(capacity_of(a), capacity_of(b));
> +}
> +
>  static void record_wakee(struct task_struct *p)
>  {
>         /*
> @@ -7486,6 +7492,7 @@ struct lb_env {
>
>         enum fbq_type           fbq_type;
>         enum migration_type     migration_type;
> +       enum group_type         src_grp_type;
>         struct list_head        tasks;
>  };
>
> @@ -8447,6 +8454,32 @@ static bool update_nohz_stats(struct rq *rq)
>  #endif
>  }
>
> +static inline void update_sg_lb_misfit_stats(struct lb_env *env,
> +                                            struct sched_group *group,
> +                                            struct sg_lb_stats *sgs,
> +                                            int *sg_status,
> +                                            int cpu)
> +{
> +       struct rq *rq = cpu_rq(cpu);
> +
> +       if (!(env->sd->flags & SD_ASYM_CPUCAPACITY) ||
> +           !rq->misfit_task_load)
> +               return;
> +
> +       *sg_status |= SG_OVERLOAD;
> +
> +       /*
> +        * Don't attempt to maximize load for misfit tasks that can't be
> +        * granted a CPU capacity uplift.
> +        */
> +       if (cpu_capacity_greater(env->dst_cpu, cpu)) {
> +               sgs->group_misfit_task_load = max(
> +                       sgs->group_misfit_task_load,
> +                       rq->misfit_task_load);
> +       }
> +
> +}
> +
>  /**
>   * update_sg_lb_stats - Update sched_group's statistics for load balancing.
>   * @env: The load balancing environment.
> @@ -8498,12 +8531,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>                 if (local_group)
>                         continue;
>
> -               /* Check for a misfit task on the cpu */
> -               if (env->sd->flags & SD_ASYM_CPUCAPACITY &&
> -                   sgs->group_misfit_task_load < rq->misfit_task_load) {
> -                       sgs->group_misfit_task_load = rq->misfit_task_load;
> -                       *sg_status |= SG_OVERLOAD;
> -               }
> +               update_sg_lb_misfit_stats(env, group, sgs, sg_status, i);
>         }
>
>         /* Check if dst CPU is idle and preferred to this group */
> @@ -8550,15 +8578,9 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>         if (!sgs->sum_h_nr_running)
>                 return false;
>
> -       /*
> -        * Don't try to pull misfit tasks we can't help.
> -        * We can use max_capacity here as reduction in capacity on some
> -        * CPUs in the group should either be possible to resolve
> -        * internally or be covered by avg_load imbalance (eventually).
> -        */
> +       /* Don't try to pull misfit tasks we can't help */
>         if (sgs->group_type == group_misfit_task &&
> -           (!capacity_greater(capacity_of(env->dst_cpu), sg->sgc->max_capacity) ||
> -            sds->local_stat.group_type != group_has_spare))
> +            sds->local_stat.group_type != group_has_spare)
>                 return false;
>
>         if (sgs->group_type > busiest->group_type)
> @@ -9288,6 +9310,8 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
>         if (!sds.busiest)
>                 goto out_balanced;
>
> +       env->src_grp_type = busiest->group_type;
> +
>         /* Misfit tasks should be dealt with regardless of the avg load */
>         if (busiest->group_type == group_misfit_task)
>                 goto force_balance;
> @@ -9441,8 +9465,8 @@ static struct rq *find_busiest_queue(struct lb_env *env,
>                  * average load.
>                  */
>                 if (env->sd->flags & SD_ASYM_CPUCAPACITY &&
> -                   !capacity_greater(capacity_of(env->dst_cpu), capacity) &&
> -                   nr_running == 1)
> +                   env->src_grp_type <= group_fully_busy &&
> +                   !capacity_greater(capacity_of(env->dst_cpu), capacity))
>                         continue;
>
>                 switch (env->migration_type) {
> @@ -9504,15 +9528,18 @@ static struct rq *find_busiest_queue(struct lb_env *env,
>                 case migrate_misfit:
>                         /*
>                          * For ASYM_CPUCAPACITY domains with misfit tasks we
> -                        * simply seek the "biggest" misfit task.
> +                        * simply seek the "biggest" misfit task we can
> +                        * accommodate.
>                          */
> +                       if (!cpu_capacity_greater(env->dst_cpu, i))

Use the same level of interface as above. This makes code and the
condition easier to follow in find_busiest_queue()

capacity_greater(capacity_of(env->dst_cpu), capacity_of(i))


> +                               continue;
> +
>                         if (rq->misfit_task_load > busiest_load) {
>                                 busiest_load = rq->misfit_task_load;
>                                 busiest = rq;
>                         }
>
>                         break;
> -
>                 }
>         }
>
> --
> 2.25.1
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/2] sched/fair: Relax task_hot() for misfit tasks
  2021-04-15 17:58 ` [PATCH 2/2] sched/fair: Relax task_hot() for misfit tasks Valentin Schneider
  2021-04-15 20:39   ` Rik van Riel
@ 2021-04-16 13:51   ` Vincent Guittot
  2021-04-19 17:13     ` Valentin Schneider
  1 sibling, 1 reply; 21+ messages in thread
From: Vincent Guittot @ 2021-04-16 13:51 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Dietmar Eggemann,
	Morten Rasmussen, Qais Yousef, Quentin Perret, Pavan Kondeti,
	Rik van Riel, Lingutla Chandrasekhar

Le jeudi 15 avril 2021 à 18:58:46 (+0100), Valentin Schneider a écrit :
> Consider the following topology:
> 
>   DIE [          ]
>   MC  [    ][    ]
>        0  1  2  3
> 
>   capacity_orig_of(x \in {0-1}) < capacity_orig_of(x \in {2-3})
> 
> w/ CPUs 2-3 idle and CPUs 0-1 running CPU hogs (util_avg=1024).
> 
> When CPU2 goes through load_balance() (via periodic / NOHZ balance), it
> should pull one CPU hog from either CPU0 or CPU1 (this is misfit task
> upmigration). However, should a e.g. pcpu kworker awake on CPU0 just before
> this load_balance() happens and preempt the CPU hog running there, we would
> have, for the [0-1] group at CPU2's DIE level:
> 
> o sgs->sum_nr_running > sgs->group_weight
> o sgs->group_capacity * 100 < sgs->group_util * imbalance_pct
> 
> IOW, this group is group_overloaded.
> 
> Considering CPU0 is picked by find_busiest_queue(), we would then visit the
> preempted CPU hog in detach_tasks(). However, given it has just been
> preempted by this pcpu kworker, task_hot() will prevent it from being
> detached. We then leave load_balance() without having done anything.
> 
> Long story short, preempted misfit tasks are affected by task_hot(), while
> currently running misfit tasks are intentionally preempted by the stopper
> task to migrate them over to a higher-capacity CPU.
> 
> Align detach_tasks() with the active-balance logic and let it pick a
> cache-hot misfit task when the destination CPU can provide a capacity
> uplift.
> 
> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
> ---
>  kernel/sched/fair.c | 36 ++++++++++++++++++++++++++++++++++++
>  1 file changed, 36 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d2d1a69d7aa7..43fc98d34276 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7493,6 +7493,7 @@ struct lb_env {
>  	enum fbq_type		fbq_type;
>  	enum migration_type	migration_type;
>  	enum group_type         src_grp_type;
> +	enum group_type         dst_grp_type;
>  	struct list_head	tasks;
>  };
>  
> @@ -7533,6 +7534,31 @@ static int task_hot(struct task_struct *p, struct lb_env *env)
>  	return delta < (s64)sysctl_sched_migration_cost;
>  }
>  
> +
> +/*
> + * What does migrating this task do to our capacity-aware scheduling criterion?
> + *
> + * Returns 1, if the task needs more capacity than the dst CPU can provide.
> + * Returns 0, if the task needs the extra capacity provided by the dst CPU
> + * Returns -1, if the task isn't impacted by the migration wrt capacity.
> + */
> +static int migrate_degrades_capacity(struct task_struct *p, struct lb_env *env)
> +{
> +	if (!(env->sd->flags & SD_ASYM_CPUCAPACITY))
> +		return -1;
> +
> +	if (!task_fits_capacity(p, capacity_of(env->src_cpu))) {
> +		if (cpu_capacity_greater(env->dst_cpu, env->src_cpu))
> +			return 0;
> +		else if (cpu_capacity_greater(env->src_cpu, env->dst_cpu))
> +			return 1;
> +		else
> +			return -1;
> +	}

Being there means that task fits src_cpu capacity so why testing p against dst_cpu ?

> +
> +	return task_fits_capacity(p, capacity_of(env->dst_cpu)) ? -1 : 1;
> +}

I prefer the below which easier to read because the same var is use everywhere and you can remove cpu_capacity_greater.

static int migrate_degrades_capacity(struct task_struct *p, struct lb_env *env)
{
    unsigned long src_capacity, dst_capacity;

    if (!(env->sd->flags & SD_ASYM_CPUCAPACITY))
        return -1;

    src_capacity = capacity_of(env->src_cpu);
    dst_capacity = capacity_of(env->dst_cpu);

    if (!task_fits_capacity(p, src_capacity)) {
        if (capacity_greater(dst_capacity, src_capacity))
            return 0;
        else if (capacity_greater(src_capacity, dst_capacity))
            return 1;
        else
            return -1;
    }

    return task_fits_capacity(p, dst_capacity) ? -1 : 1;
}


> +
>  #ifdef CONFIG_NUMA_BALANCING
>  /*
>   * Returns 1, if task migration degrades locality
> @@ -7672,6 +7698,15 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
>  	if (tsk_cache_hot == -1)
>  		tsk_cache_hot = task_hot(p, env);
>  
> +	/*
> +	 * On a (sane) asymmetric CPU capacity system, the increase in compute
> +	 * capacity should offset any potential performance hit caused by a
> +	 * migration.
> +	 */
> +	if ((env->dst_grp_type == group_has_spare) &&

Shouldn't it be env->src_grp_type == group_misfit_task to only care of misfit task case as
stated in $subject


> +	    !migrate_degrades_capacity(p, env))
> +		tsk_cache_hot = 0;
> +
>  	if (tsk_cache_hot <= 0 ||
>  	    env->sd->nr_balance_failed > env->sd->cache_nice_tries) {
>  		if (tsk_cache_hot == 1) {
> @@ -9310,6 +9345,7 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
>  	if (!sds.busiest)
>  		goto out_balanced;
>  
> +	env->dst_grp_type = local->group_type;
>  	env->src_grp_type = busiest->group_type;
>  
>  	/* Misfit tasks should be dealt with regardless of the avg load */
> -- 
> 2.25.1
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/2] sched/fair: Relax task_hot() for misfit tasks
  2021-04-16  9:43     ` Valentin Schneider
@ 2021-04-19 12:59       ` Phil Auld
  2021-04-19 17:17         ` Valentin Schneider
  0 siblings, 1 reply; 21+ messages in thread
From: Phil Auld @ 2021-04-19 12:59 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: Rik van Riel, linux-kernel, Peter Zijlstra, Ingo Molnar,
	Vincent Guittot, Dietmar Eggemann, Morten Rasmussen, Qais Yousef,
	Quentin Perret, Pavan Kondeti, Lingutla Chandrasekhar

On Fri, Apr 16, 2021 at 10:43:38AM +0100 Valentin Schneider wrote:
> On 15/04/21 16:39, Rik van Riel wrote:
> > On Thu, 2021-04-15 at 18:58 +0100, Valentin Schneider wrote:
> >> Consider the following topology:
> >>
> >> Long story short, preempted misfit tasks are affected by task_hot(),
> >> while
> >> currently running misfit tasks are intentionally preempted by the
> >> stopper
> >> task to migrate them over to a higher-capacity CPU.
> >>
> >> Align detach_tasks() with the active-balance logic and let it pick a
> >> cache-hot misfit task when the destination CPU can provide a capacity
> >> uplift.
> >>
> >> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
> >
> > Reviewed-by: Rik van Riel <riel@surriel.com>
> >
> 
> Thanks!
> 
> >
> > This patch looks good, but...
> >
> >> @@ -7672,6 +7698,15 @@ int can_migrate_task(struct task_struct *p,
> >> struct lb_env *env)
> >>      if (tsk_cache_hot == -1)
> >>              tsk_cache_hot = task_hot(p, env);
> >>
> >> +	/*
> >> +	 * On a (sane) asymmetric CPU capacity system, the increase in
> >> compute
> >> +	 * capacity should offset any potential performance hit caused
> >> by a
> >> +	 * migration.
> >> +	 */
> >> +	if ((env->dst_grp_type == group_has_spare) &&
> >> +	    !migrate_degrades_capacity(p, env))
> >> +		tsk_cache_hot = 0;
> >
> > ... I'm starting to wonder if we should not rename the
> > tsk_cache_hot variable to something else to make this
> > code more readable. Probably in another patch :)
> >
> 
> I'd tend to agree, but naming is hard. "migration_harmful" ?

I thought Rik meant tsk_cache_hot, for which I'd suggest at least
buying a vowel and putting an 'a' in there :) 


Cheers,
Phil

> 
> > --
> > All Rights Reversed.
> 

-- 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] sched/fair: Filter out locally-unsolvable misfit imbalances
  2021-04-16 13:29   ` Vincent Guittot
@ 2021-04-19 17:13     ` Valentin Schneider
  0 siblings, 0 replies; 21+ messages in thread
From: Valentin Schneider @ 2021-04-19 17:13 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Dietmar Eggemann,
	Morten Rasmussen, Qais Yousef, Quentin Perret, Pavan Kondeti,
	Rik van Riel, Lingutla Chandrasekhar

On 16/04/21 15:29, Vincent Guittot wrote:
> On Thu, 15 Apr 2021 at 19:58, Valentin Schneider
> <valentin.schneider@arm.com> wrote:
>> @@ -9441,8 +9465,8 @@ static struct rq *find_busiest_queue(struct lb_env *env,
>>                  * average load.
>>                  */
>>                 if (env->sd->flags & SD_ASYM_CPUCAPACITY &&
>> -                   !capacity_greater(capacity_of(env->dst_cpu), capacity) &&
>> -                   nr_running == 1)
>> +                   env->src_grp_type <= group_fully_busy &&
>> +                   !capacity_greater(capacity_of(env->dst_cpu), capacity))
>>                         continue;
>>
>>                 switch (env->migration_type) {
>> @@ -9504,15 +9528,18 @@ static struct rq *find_busiest_queue(struct lb_env *env,
>>                 case migrate_misfit:
>>                         /*
>>                          * For ASYM_CPUCAPACITY domains with misfit tasks we
>> -                        * simply seek the "biggest" misfit task.
>> +                        * simply seek the "biggest" misfit task we can
>> +                        * accommodate.
>>                          */
>> +                       if (!cpu_capacity_greater(env->dst_cpu, i))
>
> Use the same level of interface as above. This makes code and the
> condition easier to follow in find_busiest_queue()
>
> capacity_greater(capacity_of(env->dst_cpu), capacity_of(i))
>

OK

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/2] sched/fair: Relax task_hot() for misfit tasks
  2021-04-16 13:51   ` Vincent Guittot
@ 2021-04-19 17:13     ` Valentin Schneider
  2021-04-20 14:33       ` Vincent Guittot
  0 siblings, 1 reply; 21+ messages in thread
From: Valentin Schneider @ 2021-04-19 17:13 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Dietmar Eggemann,
	Morten Rasmussen, Qais Yousef, Quentin Perret, Pavan Kondeti,
	Rik van Riel, Lingutla Chandrasekhar

On 16/04/21 15:51, Vincent Guittot wrote:
> Le jeudi 15 avril 2021 ï¿½ 18:58:46 (+0100), Valentin Schneider a ï¿½crit :
>> +
>> +/*
>> + * What does migrating this task do to our capacity-aware scheduling criterion?
>> + *
>> + * Returns 1, if the task needs more capacity than the dst CPU can provide.
>> + * Returns 0, if the task needs the extra capacity provided by the dst CPU
>> + * Returns -1, if the task isn't impacted by the migration wrt capacity.
>> + */
>> +static int migrate_degrades_capacity(struct task_struct *p, struct lb_env *env)
>> +{
>> +	if (!(env->sd->flags & SD_ASYM_CPUCAPACITY))
>> +		return -1;
>> +
>> +	if (!task_fits_capacity(p, capacity_of(env->src_cpu))) {
>> +		if (cpu_capacity_greater(env->dst_cpu, env->src_cpu))
>> +			return 0;
>> +		else if (cpu_capacity_greater(env->src_cpu, env->dst_cpu))
>> +			return 1;
>> +		else
>> +			return -1;
>> +	}
>
> Being there means that task fits src_cpu capacity so why testing p against dst_cpu ?
>

Because if p fits on src_cpu, we don't want to move it to a dst_cpu on
which it *doesn't* fit.

>> +
>> +	return task_fits_capacity(p, capacity_of(env->dst_cpu)) ? -1 : 1;
>> +}
>
> I prefer the below which easier to read because the same var is use everywhere and you can remove cpu_capacity_greater.
>
> static int migrate_degrades_capacity(struct task_struct *p, struct lb_env *env)
> {
>     unsigned long src_capacity, dst_capacity;
>
>     if (!(env->sd->flags & SD_ASYM_CPUCAPACITY))
>         return -1;
>
>     src_capacity = capacity_of(env->src_cpu);
>     dst_capacity = capacity_of(env->dst_cpu);
>
>     if (!task_fits_capacity(p, src_capacity)) {
>         if (capacity_greater(dst_capacity, src_capacity))
>             return 0;
>         else if (capacity_greater(src_capacity, dst_capacity))
>             return 1;
>         else
>             return -1;
>     }
>
>     return task_fits_capacity(p, dst_capacity) ? -1 : 1;
> }
>

I'll take it, thanks!

>
>> +
>>  #ifdef CONFIG_NUMA_BALANCING
>>  /*
>>   * Returns 1, if task migration degrades locality
>> @@ -7672,6 +7698,15 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
>>      if (tsk_cache_hot == -1)
>>              tsk_cache_hot = task_hot(p, env);
>>
>> +	/*
>> +	 * On a (sane) asymmetric CPU capacity system, the increase in compute
>> +	 * capacity should offset any potential performance hit caused by a
>> +	 * migration.
>> +	 */
>> +	if ((env->dst_grp_type == group_has_spare) &&
>
> Shouldn't it be env->src_grp_type == group_misfit_task to only care of misfit task case as
> stated in $subject
>

Previously this was env->idle != CPU_NOT_IDLE, but I figured dst_grp_type
could give us a better picture. Staring at this some more, this isn't so
true when the group size goes up - there's no guarantees the dst_cpu is the
one that has spare cycles, and the other CPUs might not be able to grant
the capacity uplift dst_cpu can.

As for not using src_grp_type == group_misfit_task, this is pretty much the
same as [1]. CPU-bound (misfit) task + some other task on the same rq
implies group_overloaded classification when balancing at MC level (no SMT,
so one group per CPU).

[1]: http://lore.kernel.org/r/jhjblcuv2mo.mognet@arm.com

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/2] sched/fair: Relax task_hot() for misfit tasks
  2021-04-19 12:59       ` Phil Auld
@ 2021-04-19 17:17         ` Valentin Schneider
  2021-04-19 20:23           ` Phil Auld
  0 siblings, 1 reply; 21+ messages in thread
From: Valentin Schneider @ 2021-04-19 17:17 UTC (permalink / raw)
  To: Phil Auld
  Cc: Rik van Riel, linux-kernel, Peter Zijlstra, Ingo Molnar,
	Vincent Guittot, Dietmar Eggemann, Morten Rasmussen, Qais Yousef,
	Quentin Perret, Pavan Kondeti, Lingutla Chandrasekhar

On 19/04/21 08:59, Phil Auld wrote:
> On Fri, Apr 16, 2021 at 10:43:38AM +0100 Valentin Schneider wrote:
>> On 15/04/21 16:39, Rik van Riel wrote:
>> > On Thu, 2021-04-15 at 18:58 +0100, Valentin Schneider wrote:
>> >> @@ -7672,6 +7698,15 @@ int can_migrate_task(struct task_struct *p,
>> >> struct lb_env *env)
>> >>      if (tsk_cache_hot == -1)
>> >>              tsk_cache_hot = task_hot(p, env);
>> >>
>> >> +	/*
>> >> +	 * On a (sane) asymmetric CPU capacity system, the increase in
>> >> compute
>> >> +	 * capacity should offset any potential performance hit caused
>> >> by a
>> >> +	 * migration.
>> >> +	 */
>> >> +	if ((env->dst_grp_type == group_has_spare) &&
>> >> +	    !migrate_degrades_capacity(p, env))
>> >> +		tsk_cache_hot = 0;
>> >
>> > ... I'm starting to wonder if we should not rename the
>> > tsk_cache_hot variable to something else to make this
>> > code more readable. Probably in another patch :)
>> >
>>
>> I'd tend to agree, but naming is hard. "migration_harmful" ?
>
> I thought Rik meant tsk_cache_hot, for which I'd suggest at least
> buying a vowel and putting an 'a' in there :)
>

That's the one I was eyeing: s/tsk_cache_hot/migration_harmful/ or
somesuch. Right now we're feeding it:

o migrate_degrades_locality()
o task_hot()

and I'm adding another one, so that's 2/3 which don't actually care about
cache hotness, but rather "does doing this migration degrade/improve
$criterion?"

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/2] sched/fair: Relax task_hot() for misfit tasks
  2021-04-19 17:17         ` Valentin Schneider
@ 2021-04-19 20:23           ` Phil Auld
  0 siblings, 0 replies; 21+ messages in thread
From: Phil Auld @ 2021-04-19 20:23 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: Rik van Riel, linux-kernel, Peter Zijlstra, Ingo Molnar,
	Vincent Guittot, Dietmar Eggemann, Morten Rasmussen, Qais Yousef,
	Quentin Perret, Pavan Kondeti, Lingutla Chandrasekhar

On Mon, Apr 19, 2021 at 06:17:47PM +0100 Valentin Schneider wrote:
> On 19/04/21 08:59, Phil Auld wrote:
> > On Fri, Apr 16, 2021 at 10:43:38AM +0100 Valentin Schneider wrote:
> >> On 15/04/21 16:39, Rik van Riel wrote:
> >> > On Thu, 2021-04-15 at 18:58 +0100, Valentin Schneider wrote:
> >> >> @@ -7672,6 +7698,15 @@ int can_migrate_task(struct task_struct *p,
> >> >> struct lb_env *env)
> >> >>      if (tsk_cache_hot == -1)
> >> >>              tsk_cache_hot = task_hot(p, env);
> >> >>
> >> >> +	/*
> >> >> +	 * On a (sane) asymmetric CPU capacity system, the increase in
> >> >> compute
> >> >> +	 * capacity should offset any potential performance hit caused
> >> >> by a
> >> >> +	 * migration.
> >> >> +	 */
> >> >> +	if ((env->dst_grp_type == group_has_spare) &&
> >> >> +	    !migrate_degrades_capacity(p, env))
> >> >> +		tsk_cache_hot = 0;
> >> >
> >> > ... I'm starting to wonder if we should not rename the
> >> > tsk_cache_hot variable to something else to make this
> >> > code more readable. Probably in another patch :)
> >> >
> >>
> >> I'd tend to agree, but naming is hard. "migration_harmful" ?
> >
> > I thought Rik meant tsk_cache_hot, for which I'd suggest at least
> > buying a vowel and putting an 'a' in there :)
> >
> 
> That's the one I was eyeing: s/tsk_cache_hot/migration_harmful/ or
> somesuch. Right now we're feeding it:
>

Fair enough, my bad, the migration part immediately drew me to
migrate_degrades_capacity().

> o migrate_degrades_locality()
> o task_hot()
> 
> and I'm adding another one, so that's 2/3 which don't actually care about
> cache hotness, but rather "does doing this migration degrade/improve
> $criterion?"
> 

prefer_no_migrate? 


Cheers,
Phil
-- 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/2] sched/fair: Relax task_hot() for misfit tasks
  2021-04-19 17:13     ` Valentin Schneider
@ 2021-04-20 14:33       ` Vincent Guittot
  2021-04-21 10:52         ` Valentin Schneider
  0 siblings, 1 reply; 21+ messages in thread
From: Vincent Guittot @ 2021-04-20 14:33 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Dietmar Eggemann,
	Morten Rasmussen, Qais Yousef, Quentin Perret, Pavan Kondeti,
	Rik van Riel, Lingutla Chandrasekhar

On Mon, 19 Apr 2021 at 19:13, Valentin Schneider
<valentin.schneider@arm.com> wrote:
>
> On 16/04/21 15:51, Vincent Guittot wrote:
> > Le jeudi 15 avril 2021 ï¿½ 18:58:46 (+0100), Valentin Schneider a ï¿½crit :
> >> +
> >> +/*
> >> + * What does migrating this task do to our capacity-aware scheduling criterion?
> >> + *
> >> + * Returns 1, if the task needs more capacity than the dst CPU can provide.
> >> + * Returns 0, if the task needs the extra capacity provided by the dst CPU
> >> + * Returns -1, if the task isn't impacted by the migration wrt capacity.
> >> + */
> >> +static int migrate_degrades_capacity(struct task_struct *p, struct lb_env *env)
> >> +{
> >> +    if (!(env->sd->flags & SD_ASYM_CPUCAPACITY))
> >> +            return -1;
> >> +
> >> +    if (!task_fits_capacity(p, capacity_of(env->src_cpu))) {
> >> +            if (cpu_capacity_greater(env->dst_cpu, env->src_cpu))
> >> +                    return 0;
> >> +            else if (cpu_capacity_greater(env->src_cpu, env->dst_cpu))
> >> +                    return 1;
> >> +            else
> >> +                    return -1;
> >> +    }
> >
> > Being there means that task fits src_cpu capacity so why testing p against dst_cpu ?
> >
>
> Because if p fits on src_cpu, we don't want to move it to a dst_cpu on
> which it *doesn't* fit.

OK. I was confused because I thought that this was only to force
migration in case of group_misfit_task but you tried to extend to
other cases... I'm not convinced that you succeeded to cover all cases

Also I found this function which returns 3 values a bit disturbing.
IIUC you tried to align to migrate_degrades_capacity but you should
have better aligned to task_hot and return only 0 or 1. -1 is not used

>
> >> +
> >> +    return task_fits_capacity(p, capacity_of(env->dst_cpu)) ? -1 : 1;
> >> +}
> >
> > I prefer the below which easier to read because the same var is use everywhere and you can remove cpu_capacity_greater.
> >
> > static int migrate_degrades_capacity(struct task_struct *p, struct lb_env *env)
> > {
> >     unsigned long src_capacity, dst_capacity;
> >
> >     if (!(env->sd->flags & SD_ASYM_CPUCAPACITY))
> >         return -1;
> >
> >     src_capacity = capacity_of(env->src_cpu);
> >     dst_capacity = capacity_of(env->dst_cpu);
> >
> >     if (!task_fits_capacity(p, src_capacity)) {
> >         if (capacity_greater(dst_capacity, src_capacity))
> >             return 0;
> >         else if (capacity_greater(src_capacity, dst_capacity))
> >             return 1;
> >         else
> >             return -1;
> >     }
> >
> >     return task_fits_capacity(p, dst_capacity) ? -1 : 1;
> > }
> >
>
> I'll take it, thanks!
>
> >
> >> +
> >>  #ifdef CONFIG_NUMA_BALANCING
> >>  /*
> >>   * Returns 1, if task migration degrades locality
> >> @@ -7672,6 +7698,15 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
> >>      if (tsk_cache_hot == -1)
> >>              tsk_cache_hot = task_hot(p, env);
> >>
> >> +    /*
> >> +     * On a (sane) asymmetric CPU capacity system, the increase in compute
> >> +     * capacity should offset any potential performance hit caused by a
> >> +     * migration.
> >> +     */
> >> +    if ((env->dst_grp_type == group_has_spare) &&
> >
> > Shouldn't it be env->src_grp_type == group_misfit_task to only care of misfit task case as
> > stated in $subject
> >
>
> Previously this was env->idle != CPU_NOT_IDLE, but I figured dst_grp_type
> could give us a better picture. Staring at this some more, this isn't so
> true when the group size goes up - there's no guarantees the dst_cpu is the
> one that has spare cycles, and the other CPUs might not be able to grant
> the capacity uplift dst_cpu can.

yeah you have to keep checking for env->idle != CPU_NOT_IDLE

>
> As for not using src_grp_type == group_misfit_task, this is pretty much the
> same as [1]. CPU-bound (misfit) task + some other task on the same rq
> implies group_overloaded classification when balancing at MC level (no SMT,
> so one group per CPU).

Is it something that happens often or just a sporadic/transient state
? I mean does it really worth the extra complexity and do you see
performance improvement ?

You should better focus on fixing the simple case of group_misfit_task
task. This other cases looks far more complex with lot of corner cases

>
> [1]: http://lore.kernel.org/r/jhjblcuv2mo.mognet@arm.com

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/2] sched/fair: Relax task_hot() for misfit tasks
  2021-04-20 14:33       ` Vincent Guittot
@ 2021-04-21 10:52         ` Valentin Schneider
  2021-04-22 17:29           ` Dietmar Eggemann
  2021-04-30  6:58           ` Vincent Guittot
  0 siblings, 2 replies; 21+ messages in thread
From: Valentin Schneider @ 2021-04-21 10:52 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Dietmar Eggemann,
	Morten Rasmussen, Qais Yousef, Quentin Perret, Pavan Kondeti,
	Rik van Riel, Lingutla Chandrasekhar

On 20/04/21 16:33, Vincent Guittot wrote:
> On Mon, 19 Apr 2021 at 19:13, Valentin Schneider
> <valentin.schneider@arm.com> wrote:
>>
>> On 16/04/21 15:51, Vincent Guittot wrote:
>> > Le jeudi 15 avril 2021 ï¿½ 18:58:46 (+0100), Valentin Schneider a ï¿½crit :
>> >> +
>> >> +/*
>> >> + * What does migrating this task do to our capacity-aware scheduling criterion?
>> >> + *
>> >> + * Returns 1, if the task needs more capacity than the dst CPU can provide.
>> >> + * Returns 0, if the task needs the extra capacity provided by the dst CPU
>> >> + * Returns -1, if the task isn't impacted by the migration wrt capacity.
>> >> + */
>> >> +static int migrate_degrades_capacity(struct task_struct *p, struct lb_env *env)
>> >> +{
>> >> +    if (!(env->sd->flags & SD_ASYM_CPUCAPACITY))
>> >> +            return -1;
>> >> +
>> >> +    if (!task_fits_capacity(p, capacity_of(env->src_cpu))) {
>> >> +            if (cpu_capacity_greater(env->dst_cpu, env->src_cpu))
>> >> +                    return 0;
>> >> +            else if (cpu_capacity_greater(env->src_cpu, env->dst_cpu))
>> >> +                    return 1;
>> >> +            else
>> >> +                    return -1;
>> >> +    }
>> >
>> > Being there means that task fits src_cpu capacity so why testing p against dst_cpu ?
>> >
>>
>> Because if p fits on src_cpu, we don't want to move it to a dst_cpu on
>> which it *doesn't* fit.
>
> OK. I was confused because I thought that this was only to force
> migration in case of group_misfit_task but you tried to extend to
> other cases... I'm not convinced that you succeeded to cover all cases
>
> Also I found this function which returns 3 values a bit disturbing.
> IIUC you tried to align to migrate_degrades_capacity but you should
> have better aligned to task_hot and return only 0 or 1. -1 is not used
>

Ack, will do.

>> >> @@ -7672,6 +7698,15 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
>> >>      if (tsk_cache_hot == -1)
>> >>              tsk_cache_hot = task_hot(p, env);
>> >>
>> >> +    /*
>> >> +     * On a (sane) asymmetric CPU capacity system, the increase in compute
>> >> +     * capacity should offset any potential performance hit caused by a
>> >> +     * migration.
>> >> +     */
>> >> +    if ((env->dst_grp_type == group_has_spare) &&
>> >
>> > Shouldn't it be env->src_grp_type == group_misfit_task to only care of misfit task case as
>> > stated in $subject
>> >
>>
>> Previously this was env->idle != CPU_NOT_IDLE, but I figured dst_grp_type
>> could give us a better picture. Staring at this some more, this isn't so
>> true when the group size goes up - there's no guarantees the dst_cpu is the
>> one that has spare cycles, and the other CPUs might not be able to grant
>> the capacity uplift dst_cpu can.
>
> yeah you have to keep checking for env->idle != CPU_NOT_IDLE
>
>>
>> As for not using src_grp_type == group_misfit_task, this is pretty much the
>> same as [1]. CPU-bound (misfit) task + some other task on the same rq
>> implies group_overloaded classification when balancing at MC level (no SMT,
>> so one group per CPU).
>
> Is it something that happens often or just a sporadic/transient state
> ? I mean does it really worth the extra complexity and do you see
> performance improvement ?
>

"Unfortunately" yes, this is a relatively common scenario when running "1
big task per CPU" types of workloads. The expected behaviour for big.LITTLE
systems is to upmigrate tasks stuck on the LITTLE CPUs as soon as a big CPU
becomes free, usually via newidle balance (which, since they process work
faster than the LITTLEs, is bound to happen), and an extra task being
enqueued at "the wrong time" can prevent this from happening.

This usually means a misfit task can take a few dozen extra ms than it
should to be migrated - in the tests I run (which are pretty much this 1
hog per CPU workload) this happens about ~20% of the time.

> You should better focus on fixing the simple case of group_misfit_task
> task. This other cases looks far more complex with lot of corner cases
>
>>
>> [1]: http://lore.kernel.org/r/jhjblcuv2mo.mognet@arm.com

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] sched/fair: Filter out locally-unsolvable misfit imbalances
  2021-04-15 17:58 ` [PATCH 1/2] sched/fair: Filter out locally-unsolvable misfit imbalances Valentin Schneider
  2021-04-15 18:47   ` Rik van Riel
  2021-04-16 13:29   ` Vincent Guittot
@ 2021-04-22  9:48   ` Dietmar Eggemann
  2021-04-22 19:19     ` Valentin Schneider
  2 siblings, 1 reply; 21+ messages in thread
From: Dietmar Eggemann @ 2021-04-22  9:48 UTC (permalink / raw)
  To: Valentin Schneider, linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Morten Rasmussen,
	Qais Yousef, Quentin Perret, Pavan Kondeti, Rik van Riel,
	Lingutla Chandrasekhar

On 15/04/2021 19:58, Valentin Schneider wrote:

[...]

> Aggregate a misfit task's load into sgs->group_misfit_task_load only if
> env->dst_cpu would grant it a capacity uplift.
> 
> Note that the aforementioned capacity vs sgc->max_capacity comparison was
> meant to prevent misfit task downmigration: candidate groups classified as
> group_misfit_task but with a higher (max) CPU capacity than the destination CPU
> would be discarded. This change makes it so said group_misfit_task
> classification can't happen anymore, which may cause some undesired
> downmigrations.
> 
> Further tweak find_busiest_queue() to ensure this doesn't happen. 

Maybe you can describe shortly here what's the new mechanism is you
replace the old 'prevent misfit task downmigration' with.

Also note
> find_busiest_queue() can now iterate over CPUs with a higher capacity than
> the local CPU's, so add a capacity check there.

[...]

> +static inline void update_sg_lb_misfit_stats(struct lb_env *env,
> +					     struct sched_group *group,

Seems to be not used.

> +					     struct sg_lb_stats *sgs,
> +					     int *sg_status,
> +					     int cpu)
> +{
> +	struct rq *rq = cpu_rq(cpu);
> +
> +	if (!(env->sd->flags & SD_ASYM_CPUCAPACITY) ||
> +	    !rq->misfit_task_load)
> +		return;
> +
> +	*sg_status |= SG_OVERLOAD;
> +
> +	/*
> +	 * Don't attempt to maximize load for misfit tasks that can't be
> +	 * granted a CPU capacity uplift.
> +	 */
> +	if (cpu_capacity_greater(env->dst_cpu, cpu)) {
> +		sgs->group_misfit_task_load = max(
> +			sgs->group_misfit_task_load,
> +			rq->misfit_task_load);
> +	}
> +
> +}
> +

[...]

> @@ -9288,6 +9310,8 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
>  	if (!sds.busiest)
>  		goto out_balanced;
>  
> +	env->src_grp_type = busiest->group_type;

Maybe a short comment why you set it here in fbg(). It's only used later
in fbq() for asym. CPU capacity sd but is set unconditionally.

[...]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/2] sched/fair: Relax task_hot() for misfit tasks
  2021-04-21 10:52         ` Valentin Schneider
@ 2021-04-22 17:29           ` Dietmar Eggemann
  2021-04-22 19:19             ` Valentin Schneider
  2021-04-30  6:58           ` Vincent Guittot
  1 sibling, 1 reply; 21+ messages in thread
From: Dietmar Eggemann @ 2021-04-22 17:29 UTC (permalink / raw)
  To: Valentin Schneider, Vincent Guittot
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Morten Rasmussen,
	Qais Yousef, Quentin Perret, Pavan Kondeti, Rik van Riel,
	Lingutla Chandrasekhar

On 21/04/2021 12:52, Valentin Schneider wrote:
> On 20/04/21 16:33, Vincent Guittot wrote:
>> On Mon, 19 Apr 2021 at 19:13, Valentin Schneider
>> <valentin.schneider@arm.com> wrote:
>>>
>>> On 16/04/21 15:51, Vincent Guittot wrote:
>>>> Le jeudi 15 avril 2021 ï¿½ 18:58:46 (+0100), Valentin Schneider a ï¿½crit :
>>>>> +
>>>>> +/*
>>>>> + * What does migrating this task do to our capacity-aware scheduling criterion?
>>>>> + *
>>>>> + * Returns 1, if the task needs more capacity than the dst CPU can provide.
>>>>> + * Returns 0, if the task needs the extra capacity provided by the dst CPU
>>>>> + * Returns -1, if the task isn't impacted by the migration wrt capacity.
>>>>> + */
>>>>> +static int migrate_degrades_capacity(struct task_struct *p, struct lb_env *env)
>>>>> +{
>>>>> +    if (!(env->sd->flags & SD_ASYM_CPUCAPACITY))
>>>>> +            return -1;
>>>>> +
>>>>> +    if (!task_fits_capacity(p, capacity_of(env->src_cpu))) {
>>>>> +            if (cpu_capacity_greater(env->dst_cpu, env->src_cpu))
>>>>> +                    return 0;
>>>>> +            else if (cpu_capacity_greater(env->src_cpu, env->dst_cpu))
>>>>> +                    return 1;
>>>>> +            else
>>>>> +                    return -1;
>>>>> +    }
>>>>
>>>> Being there means that task fits src_cpu capacity so why testing p against dst_cpu ?
>>>>
>>>
>>> Because if p fits on src_cpu, we don't want to move it to a dst_cpu on
>>> which it *doesn't* fit.
>>
>> OK. I was confused because I thought that this was only to force
>> migration in case of group_misfit_task but you tried to extend to
>> other cases... I'm not convinced that you succeeded to cover all cases
>>
>> Also I found this function which returns 3 values a bit disturbing.
>> IIUC you tried to align to migrate_degrades_capacity but you should
>> have better aligned to task_hot and return only 0 or 1. -1 is not used
>>
> 
> Ack, will do.
> 
>>>>> @@ -7672,6 +7698,15 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
>>>>>      if (tsk_cache_hot == -1)
>>>>>              tsk_cache_hot = task_hot(p, env);
>>>>>
>>>>> +    /*
>>>>> +     * On a (sane) asymmetric CPU capacity system, the increase in compute
>>>>> +     * capacity should offset any potential performance hit caused by a
>>>>> +     * migration.
>>>>> +     */
>>>>> +    if ((env->dst_grp_type == group_has_spare) &&
>>>>
>>>> Shouldn't it be env->src_grp_type == group_misfit_task to only care of misfit task case as
>>>> stated in $subject
>>>>
>>>
>>> Previously this was env->idle != CPU_NOT_IDLE, but I figured dst_grp_type
>>> could give us a better picture. Staring at this some more, this isn't so
>>> true when the group size goes up - there's no guarantees the dst_cpu is the
>>> one that has spare cycles, and the other CPUs might not be able to grant
>>> the capacity uplift dst_cpu can.
>>
>> yeah you have to keep checking for env->idle != CPU_NOT_IDLE
>>
>>>
>>> As for not using src_grp_type == group_misfit_task, this is pretty much the
>>> same as [1]. CPU-bound (misfit) task + some other task on the same rq
>>> implies group_overloaded classification when balancing at MC level (no SMT,
>>> so one group per CPU).
>>
>> Is it something that happens often or just a sporadic/transient state
>> ? I mean does it really worth the extra complexity and do you see
>> performance improvement ?
>>
> 
> "Unfortunately" yes, this is a relatively common scenario when running "1
> big task per CPU" types of workloads. The expected behaviour for big.LITTLE
> systems is to upmigrate tasks stuck on the LITTLE CPUs as soon as a big CPU
> becomes free, usually via newidle balance (which, since they process work
> faster than the LITTLEs, is bound to happen), and an extra task being
> enqueued at "the wrong time" can prevent this from happening.
> 
> This usually means a misfit task can take a few dozen extra ms than it
> should to be migrated - in the tests I run (which are pretty much this 1
> hog per CPU workload) this happens about ~20% of the time.
> 
>> You should better focus on fixing the simple case of group_misfit_task
>> task. This other cases looks far more complex with lot of corner cases
>>
>>>
>>> [1]: http://lore.kernel.org/r/jhjblcuv2mo.mognet@arm.com


Just to make sure I can follow the conversation ...

In case you:

(1) return 1 instead of -1

(2) keep the `env->idle != CPU_NOT_IDLE` check

(3) and remove the `dst_grp_type == group_has_spare` check

you are pretty much back to what you had in [PATCH v3 7/7] directly in
task_hot() except:

(4) the 'if (p fits on src_cpu && p !fits dst_cpu) => tsk_cache_hot) check?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] sched/fair: Filter out locally-unsolvable misfit imbalances
  2021-04-22  9:48   ` Dietmar Eggemann
@ 2021-04-22 19:19     ` Valentin Schneider
  0 siblings, 0 replies; 21+ messages in thread
From: Valentin Schneider @ 2021-04-22 19:19 UTC (permalink / raw)
  To: Dietmar Eggemann, linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Morten Rasmussen,
	Qais Yousef, Quentin Perret, Pavan Kondeti, Rik van Riel,
	Lingutla Chandrasekhar

On 22/04/21 11:48, Dietmar Eggemann wrote:
> On 15/04/2021 19:58, Valentin Schneider wrote:
>
> [...]
>
>> Aggregate a misfit task's load into sgs->group_misfit_task_load only if
>> env->dst_cpu would grant it a capacity uplift.
>>
>> Note that the aforementioned capacity vs sgc->max_capacity comparison was
>> meant to prevent misfit task downmigration: candidate groups classified as
>> group_misfit_task but with a higher (max) CPU capacity than the destination CPU
>> would be discarded. This change makes it so said group_misfit_task
>> classification can't happen anymore, which may cause some undesired
>> downmigrations.
>>
>> Further tweak find_busiest_queue() to ensure this doesn't happen.
>
> Maybe you can describe shortly here what's the new mechanism is you
> replace the old 'prevent misfit task downmigration' with.
>

Will do.

> Also note
>> find_busiest_queue() can now iterate over CPUs with a higher capacity than
>> the local CPU's, so add a capacity check there.
>
> [...]
>
>> +static inline void update_sg_lb_misfit_stats(struct lb_env *env,
>> +					     struct sched_group *group,
>
> Seems to be not used.
>

Right, that's an update_sg_lb_stats() copy/paste fail :-)

>
>> @@ -9288,6 +9310,8 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
>>      if (!sds.busiest)
>>              goto out_balanced;
>>
>> +	env->src_grp_type = busiest->group_type;
>
> Maybe a short comment why you set it here in fbg(). It's only used later
> in fbq() for asym. CPU capacity sd but is set unconditionally.
>

Seeing as almost every other line there has an accompanying comment, I
think I'll do that.

> [...]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/2] sched/fair: Relax task_hot() for misfit tasks
  2021-04-22 17:29           ` Dietmar Eggemann
@ 2021-04-22 19:19             ` Valentin Schneider
  0 siblings, 0 replies; 21+ messages in thread
From: Valentin Schneider @ 2021-04-22 19:19 UTC (permalink / raw)
  To: Dietmar Eggemann, Vincent Guittot
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Morten Rasmussen,
	Qais Yousef, Quentin Perret, Pavan Kondeti, Rik van Riel,
	Lingutla Chandrasekhar

On 22/04/21 19:29, Dietmar Eggemann wrote:
> Just to make sure I can follow the conversation ...
>
> In case you:
>
> (1) return 1 instead of -1
>
> (2) keep the `env->idle != CPU_NOT_IDLE` check
>
> (3) and remove the `dst_grp_type == group_has_spare` check
>
> you are pretty much back to what you had in [PATCH v3 7/7] directly in
> task_hot() except:
>
> (4) the 'if (p fits on src_cpu && p !fits dst_cpu) => tsk_cache_hot) check?

Pretty much, I'll now filter the return value of task_hot() rather than
affect it directly.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/2] sched/fair: Relax task_hot() for misfit tasks
  2021-04-21 10:52         ` Valentin Schneider
  2021-04-22 17:29           ` Dietmar Eggemann
@ 2021-04-30  6:58           ` Vincent Guittot
  2021-05-07 13:46             ` Valentin Schneider
  1 sibling, 1 reply; 21+ messages in thread
From: Vincent Guittot @ 2021-04-30  6:58 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Dietmar Eggemann,
	Morten Rasmussen, Qais Yousef, Quentin Perret, Pavan Kondeti,
	Rik van Riel, Lingutla Chandrasekhar

On Wed, 21 Apr 2021 at 12:52, Valentin Schneider
<valentin.schneider@arm.com> wrote:
>
> On 20/04/21 16:33, Vincent Guittot wrote:
> > On Mon, 19 Apr 2021 at 19:13, Valentin Schneider
> > <valentin.schneider@arm.com> wrote:
> >>
> >> On 16/04/21 15:51, Vincent Guittot wrote:
> >> > Le jeudi 15 avril 2021 ï¿½ 18:58:46 (+0100), Valentin Schneider a ï¿½crit :
> >> >> +
> >> >> +/*
> >> >> + * What does migrating this task do to our capacity-aware scheduling criterion?
> >> >> + *
> >> >> + * Returns 1, if the task needs more capacity than the dst CPU can provide.
> >> >> + * Returns 0, if the task needs the extra capacity provided by the dst CPU
> >> >> + * Returns -1, if the task isn't impacted by the migration wrt capacity.
> >> >> + */
> >> >> +static int migrate_degrades_capacity(struct task_struct *p, struct lb_env *env)
> >> >> +{
> >> >> +    if (!(env->sd->flags & SD_ASYM_CPUCAPACITY))
> >> >> +            return -1;
> >> >> +
> >> >> +    if (!task_fits_capacity(p, capacity_of(env->src_cpu))) {
> >> >> +            if (cpu_capacity_greater(env->dst_cpu, env->src_cpu))
> >> >> +                    return 0;
> >> >> +            else if (cpu_capacity_greater(env->src_cpu, env->dst_cpu))
> >> >> +                    return 1;
> >> >> +            else
> >> >> +                    return -1;
> >> >> +    }
> >> >
> >> > Being there means that task fits src_cpu capacity so why testing p against dst_cpu ?
> >> >
> >>
> >> Because if p fits on src_cpu, we don't want to move it to a dst_cpu on
> >> which it *doesn't* fit.
> >
> > OK. I was confused because I thought that this was only to force
> > migration in case of group_misfit_task but you tried to extend to
> > other cases... I'm not convinced that you succeeded to cover all cases
> >
> > Also I found this function which returns 3 values a bit disturbing.
> > IIUC you tried to align to migrate_degrades_capacity but you should
> > have better aligned to task_hot and return only 0 or 1. -1 is not used
> >
>
> Ack, will do.
>
> >> >> @@ -7672,6 +7698,15 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
> >> >>      if (tsk_cache_hot == -1)
> >> >>              tsk_cache_hot = task_hot(p, env);
> >> >>
> >> >> +    /*
> >> >> +     * On a (sane) asymmetric CPU capacity system, the increase in compute
> >> >> +     * capacity should offset any potential performance hit caused by a
> >> >> +     * migration.
> >> >> +     */
> >> >> +    if ((env->dst_grp_type == group_has_spare) &&
> >> >
> >> > Shouldn't it be env->src_grp_type == group_misfit_task to only care of misfit task case as
> >> > stated in $subject
> >> >
> >>
> >> Previously this was env->idle != CPU_NOT_IDLE, but I figured dst_grp_type
> >> could give us a better picture. Staring at this some more, this isn't so
> >> true when the group size goes up - there's no guarantees the dst_cpu is the
> >> one that has spare cycles, and the other CPUs might not be able to grant
> >> the capacity uplift dst_cpu can.
> >
> > yeah you have to keep checking for env->idle != CPU_NOT_IDLE
> >
> >>
> >> As for not using src_grp_type == group_misfit_task, this is pretty much the
> >> same as [1]. CPU-bound (misfit) task + some other task on the same rq
> >> implies group_overloaded classification when balancing at MC level (no SMT,
> >> so one group per CPU).
> >
> > Is it something that happens often or just a sporadic/transient state
> > ? I mean does it really worth the extra complexity and do you see
> > performance improvement ?
> >
>
> "Unfortunately" yes, this is a relatively common scenario when running "1
> big task per CPU" types of workloads. The expected behaviour for big.LITTLE
> systems is to upmigrate tasks stuck on the LITTLE CPUs as soon as a big CPU
> becomes free, usually via newidle balance (which, since they process work
> faster than the LITTLEs, is bound to happen), and an extra task being
> enqueued at "the wrong time" can prevent this from happening.
>
> This usually means a misfit task can take a few dozen extra ms than it

A few dozens is quite long. With a big core being idle, it should try
every 8ms on a quad x quad system and I suspect the next try will be
during the next tick. Would be good to understand why it has to wait
so much

> should to be migrated - in the tests I run (which are pretty much this 1
> hog per CPU workload) this happens about ~20% of the time.
>
> > You should better focus on fixing the simple case of group_misfit_task
> > task. This other cases looks far more complex with lot of corner cases
> >
> >>
> >> [1]: http://lore.kernel.org/r/jhjblcuv2mo.mognet@arm.com

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/2] sched/fair: Relax task_hot() for misfit tasks
  2021-04-30  6:58           ` Vincent Guittot
@ 2021-05-07 13:46             ` Valentin Schneider
  0 siblings, 0 replies; 21+ messages in thread
From: Valentin Schneider @ 2021-05-07 13:46 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Dietmar Eggemann,
	Morten Rasmussen, Qais Yousef, Quentin Perret, Pavan Kondeti,
	Rik van Riel, Lingutla Chandrasekhar


Hi Vincent, apologies for the belated reply

On 30/04/21 08:58, Vincent Guittot wrote:
> On Wed, 21 Apr 2021 at 12:52, Valentin Schneider
> <valentin.schneider@arm.com> wrote:
>> On 20/04/21 16:33, Vincent Guittot wrote:
>> > Is it something that happens often or just a sporadic/transient state
>> > ? I mean does it really worth the extra complexity and do you see
>> > performance improvement ?
>> >
>>
>> "Unfortunately" yes, this is a relatively common scenario when running "1
>> big task per CPU" types of workloads. The expected behaviour for big.LITTLE
>> systems is to upmigrate tasks stuck on the LITTLE CPUs as soon as a big CPU
>> becomes free, usually via newidle balance (which, since they process work
>> faster than the LITTLEs, is bound to happen), and an extra task being
>> enqueued at "the wrong time" can prevent this from happening.
>>
>> This usually means a misfit task can take a few dozen extra ms than it
>
> A few dozens is quite long. With a big core being idle, it should try
> every 8ms on a quad x quad system and I suspect the next try will be
> during the next tick. Would be good to understand why it has to wait
> so much
>

True, IIRC this was mostly due to a compound effect of the different issues
I've described in that thread (and the previous one). Now that

  9bcb959d05ee ("sched/fair: Ignore percpu threads for imbalance pulls")

is in, I'll re-run some tests against upstream and see how we fare.

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2021-05-07 13:47 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-15 17:58 [PATCH 0/2] sched/fair: (The return of) misfit task load-balance tweaks Valentin Schneider
2021-04-15 17:58 ` [PATCH 1/2] sched/fair: Filter out locally-unsolvable misfit imbalances Valentin Schneider
2021-04-15 18:47   ` Rik van Riel
2021-04-16 13:29   ` Vincent Guittot
2021-04-19 17:13     ` Valentin Schneider
2021-04-22  9:48   ` Dietmar Eggemann
2021-04-22 19:19     ` Valentin Schneider
2021-04-15 17:58 ` [PATCH 2/2] sched/fair: Relax task_hot() for misfit tasks Valentin Schneider
2021-04-15 20:39   ` Rik van Riel
2021-04-16  9:43     ` Valentin Schneider
2021-04-19 12:59       ` Phil Auld
2021-04-19 17:17         ` Valentin Schneider
2021-04-19 20:23           ` Phil Auld
2021-04-16 13:51   ` Vincent Guittot
2021-04-19 17:13     ` Valentin Schneider
2021-04-20 14:33       ` Vincent Guittot
2021-04-21 10:52         ` Valentin Schneider
2021-04-22 17:29           ` Dietmar Eggemann
2021-04-22 19:19             ` Valentin Schneider
2021-04-30  6:58           ` Vincent Guittot
2021-05-07 13:46             ` Valentin Schneider

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.