linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 0/3] sched/fair: load-balance vs capacity margins
@ 2021-04-01 19:30 Valentin Schneider
  2021-04-01 19:30 ` [PATCH v4 1/3] sched/fair: Ignore percpu threads for imbalance pulls Valentin Schneider
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Valentin Schneider @ 2021-04-01 19:30 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Dietmar Eggemann,
	Morten Rasmussen, Qais Yousef, Quentin Perret, Pavan Kondeti,
	Rik van Riel, Lingutla Chandrasekhar

Hi folks,

I split up the extra misfit patches from v3 as I'm still playing around with
those following Vincent's comments. In the meantime, I believe the first few
patches of the series can still be considered as standalone.

o Patch 1 prevents pcpu kworkers from causing group_imbalanced
o Patch 2 is an independent active balance cleanup
o Patch 3 introduces yet another margin for capacity to capacity
  comparisons

The "important" one is patch 3, as it solves misfit migration issues on newer
platforms.
  
This is based on top of today's tip/sched/core at:

  0a2b65c03e9b ("sched/topology: Remove redundant cpumask_and() in init_overlap_sched_group()")

Testing
=======

I ran my usual [1] misfit tests on
o TC2
o Juno
o HiKey960
o Dragonboard845C
o RB5

RB5 has a similar topology to Pixel4 and highlights the problem of having
two different CPU capacity values above 819 (in this case 871 and 1024):
without these patches, CPU hogs (i.e. misfit tasks) running on the "medium"
CPUs will never be upmigrated to a "big" via misfit balance.


The 0day bot reported [3] the first patch causes a ~14% regression on its
stress-ng.vm-segv testcase. I ran that testcase on: 

o Ampere eMAG (arm64, 32 cores)
o 2-socket Xeon E5-2690 (x86, 40 cores)

and found at worse a -0.3% regression and at best a 2% improvement - I'm
getting nowhere near -14%.
  
Revisions
=========

v3 -> v4
--------
o Tore out the extra misfit patches

o Rewrote patch 1 changelog (Dietmar)
o Reused LBF_ACTIVE_BALANCE to ditch LBF_DST_PINNED active balance logic
  (Dietmar)
o Collected Tested-by (Lingutla)  

o Squashed capacity_greater() introduction and use (Vincent)
o Removed sched_asym_cpucapacity() static key proliferation (Vincent)

v2 -> v3
--------

o Rebased on top of latest tip/sched/core
o Added test results vs stress-ng.vm-segv

v1 -> v2
--------

o Collected Reviewed-by
o Minor comment and code cleanups

o Consolidated static key vs SD flag explanation (Dietmar)

  Note to Vincent: I didn't measure the impact of adding said static key to
  load_balance(); I do however believe it is a low hanging fruit. The
  wrapper keeps things neat and tidy, and is also helpful for documenting
  the intricacies of the static key status vs the presence of the SD flag
  in a CPU's sched_domain hierarchy.
  
o Removed v1 patch 4 - root_domain.max_cpu_capacity is absolutely not what
  I had convinced myself it was.
o Squashed capacity margin usage with removal of
  group_smaller_{min, max}_capacity() (Vincent)   
o Replaced v1 patch 7 with Lingutla's can_migrate_task() patch [2]
o Rewrote task_hot() modification changelog

Links
=====

[1]: https://lisa-linux-integrated-system-analysis.readthedocs.io/en/master/kernel_tests.html#lisa.tests.scheduler.misfit.StaggeredFinishes
[2]: http://lore.kernel.org/r/20210217120854.1280-1-clingutla@codeaurora.org
[3]: http://lore.kernel.org/r/20210223023004.GB25487@xsang-OptiPlex-9020

Cheers,
Valentin

Lingutla Chandrasekhar (1):
  sched/fair: Ignore percpu threads for imbalance pulls

Valentin Schneider (2):
  sched/fair: Clean up active balance nr_balance_failed trickery
  sched/fair: Introduce a CPU capacity comparison helper

 kernel/sched/fair.c | 68 +++++++++++++++++++--------------------------
 1 file changed, 29 insertions(+), 39 deletions(-)

--
2.25.1


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v4 1/3] sched/fair: Ignore percpu threads for imbalance pulls
  2021-04-01 19:30 [PATCH v4 0/3] sched/fair: load-balance vs capacity margins Valentin Schneider
@ 2021-04-01 19:30 ` Valentin Schneider
  2021-04-02 12:52   ` Vincent Guittot
  2021-04-06 15:35   ` Dietmar Eggemann
  2021-04-01 19:30 ` [PATCH v4 2/3] sched/fair: Clean up active balance nr_balance_failed trickery Valentin Schneider
  2021-04-01 19:30 ` [PATCH v4 3/3] sched/fair: Introduce a CPU capacity comparison helper Valentin Schneider
  2 siblings, 2 replies; 12+ messages in thread
From: Valentin Schneider @ 2021-04-01 19:30 UTC (permalink / raw)
  To: linux-kernel
  Cc: Lingutla Chandrasekhar, Peter Zijlstra, Ingo Molnar,
	Vincent Guittot, Dietmar Eggemann, Morten Rasmussen, Qais Yousef,
	Quentin Perret, Pavan Kondeti, Rik van Riel

From: Lingutla Chandrasekhar <clingutla@codeaurora.org>

During load balance, LBF_SOME_PINNED will bet set if any candidate task
cannot be detached due to CPU affinity constraints. This can result in
setting env->sd->parent->sgc->group_imbalance, which can lead to a group
being classified as group_imbalanced (rather than any of the other, lower
group_type) when balancing at a higher level.

In workloads involving a single task per CPU, LBF_SOME_PINNED can often be
set due to per-CPU kthreads being the only other runnable tasks on any
given rq. This results in changing the group classification during
load-balance at higher levels when in reality there is nothing that can be
done for this affinity constraint: per-CPU kthreads, as the name implies,
don't get to move around (modulo hotplug shenanigans).

It's not as clear for userspace tasks - a task could be in an N-CPU cpuset
with N-1 offline CPUs, making it an "accidental" per-CPU task rather than
an intended one. KTHREAD_IS_PER_CPU gives us an indisputable signal which
we can leverage here to not set LBF_SOME_PINNED.

Note that the aforementioned classification to group_imbalance (when
nothing can be done) is especially problematic on big.LITTLE systems, which
have a topology the likes of:

  DIE [          ]
  MC  [    ][    ]
       0  1  2  3
       L  L  B  B

  arch_scale_cpu_capacity(L) < arch_scale_cpu_capacity(B)

Here, setting LBF_SOME_PINNED due to a per-CPU kthread when balancing at MC
level on CPUs [0-1] will subsequently prevent CPUs [2-3] from classifying
the [0-1] group as group_misfit_task when balancing at DIE level. Thus, if
CPUs [0-1] are running CPU-bound (misfit) tasks, ill-timed per-CPU kthreads
can significantly delay the upgmigration of said misfit tasks. Systems
relying on ASYM_PACKING are likely to face similar issues.

Signed-off-by: Lingutla Chandrasekhar <clingutla@codeaurora.org>
[Use kthread_is_per_cpu() rather than p->nr_cpus_allowed]
[Reword changelog]
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
---
 kernel/sched/fair.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6d73bdbb2d40..04d5e14fa261 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7567,6 +7567,10 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 	if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
 		return 0;
 
+	/* Disregard pcpu kthreads; they are where they need to be. */
+	if ((p->flags & PF_KTHREAD) && kthread_is_per_cpu(p))
+		return 0;
+
 	if (!cpumask_test_cpu(env->dst_cpu, p->cpus_ptr)) {
 		int cpu;
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v4 2/3] sched/fair: Clean up active balance nr_balance_failed trickery
  2021-04-01 19:30 [PATCH v4 0/3] sched/fair: load-balance vs capacity margins Valentin Schneider
  2021-04-01 19:30 ` [PATCH v4 1/3] sched/fair: Ignore percpu threads for imbalance pulls Valentin Schneider
@ 2021-04-01 19:30 ` Valentin Schneider
  2021-04-02 12:57   ` Vincent Guittot
  2021-04-06 15:36   ` Dietmar Eggemann
  2021-04-01 19:30 ` [PATCH v4 3/3] sched/fair: Introduce a CPU capacity comparison helper Valentin Schneider
  2 siblings, 2 replies; 12+ messages in thread
From: Valentin Schneider @ 2021-04-01 19:30 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Dietmar Eggemann,
	Morten Rasmussen, Qais Yousef, Quentin Perret, Pavan Kondeti,
	Rik van Riel, Lingutla Chandrasekhar

When triggering an active load balance, sd->nr_balance_failed is set to
such a value that any further can_migrate_task() using said sd will ignore
the output of task_hot().

This behaviour makes sense, as active load balance intentionally preempts a
rq's running task to migrate it right away, but this asynchronous write is
a bit shoddy, as the stopper thread might run active_load_balance_cpu_stop
before the sd->nr_balance_failed write either becomes visible to the
stopper's CPU or even happens on the CPU that appended the stopper work.

Add a struct lb_env flag to denote active balancing, and use it in
can_migrate_task(). Remove the sd->nr_balance_failed write that served the
same purpose. Cleanup the LBF_DST_PINNED active balance special case.

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
---
 kernel/sched/fair.c | 31 +++++++++++++++----------------
 1 file changed, 15 insertions(+), 16 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 04d5e14fa261..d8077f82a380 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7422,6 +7422,7 @@ enum migration_type {
 #define LBF_NEED_BREAK	0x02
 #define LBF_DST_PINNED  0x04
 #define LBF_SOME_PINNED	0x08
+#define LBF_ACTIVE_LB	0x10
 
 struct lb_env {
 	struct sched_domain	*sd;
@@ -7583,10 +7584,13 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 		 * our sched_group. We may want to revisit it if we couldn't
 		 * meet load balance goals by pulling other tasks on src_cpu.
 		 *
-		 * Avoid computing new_dst_cpu for NEWLY_IDLE or if we have
-		 * already computed one in current iteration.
+		 * Avoid computing new_dst_cpu
+		 * - for NEWLY_IDLE
+		 * - if we have already computed one in current iteration
+		 * - if it's an active balance
 		 */
-		if (env->idle == CPU_NEWLY_IDLE || (env->flags & LBF_DST_PINNED))
+		if (env->idle == CPU_NEWLY_IDLE ||
+		    env->flags & (LBF_DST_PINNED | LBF_ACTIVE_LB))
 			return 0;
 
 		/* Prevent to re-select dst_cpu via env's CPUs: */
@@ -7611,10 +7615,14 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 
 	/*
 	 * Aggressive migration if:
-	 * 1) destination numa is preferred
-	 * 2) task is cache cold, or
-	 * 3) too many balance attempts have failed.
+	 * 1) active balance
+	 * 2) destination numa is preferred
+	 * 3) task is cache cold, or
+	 * 4) too many balance attempts have failed.
 	 */
+	if (env->flags & LBF_ACTIVE_LB)
+		return 1;
+
 	tsk_cache_hot = migrate_degrades_locality(p, env);
 	if (tsk_cache_hot == -1)
 		tsk_cache_hot = task_hot(p, env);
@@ -9805,9 +9813,6 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 					active_load_balance_cpu_stop, busiest,
 					&busiest->active_balance_work);
 			}
-
-			/* We've kicked active balancing, force task migration. */
-			sd->nr_balance_failed = sd->cache_nice_tries+1;
 		}
 	} else {
 		sd->nr_balance_failed = 0;
@@ -9957,13 +9962,7 @@ static int active_load_balance_cpu_stop(void *data)
 			.src_cpu	= busiest_rq->cpu,
 			.src_rq		= busiest_rq,
 			.idle		= CPU_IDLE,
-			/*
-			 * can_migrate_task() doesn't need to compute new_dst_cpu
-			 * for active balancing. Since we have CPU_IDLE, but no
-			 * @dst_grpmask we need to make that test go away with lying
-			 * about DST_PINNED.
-			 */
-			.flags		= LBF_DST_PINNED,
+			.flags		= LBF_ACTIVE_LB,
 		};
 
 		schedstat_inc(sd->alb_count);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v4 3/3] sched/fair: Introduce a CPU capacity comparison helper
  2021-04-01 19:30 [PATCH v4 0/3] sched/fair: load-balance vs capacity margins Valentin Schneider
  2021-04-01 19:30 ` [PATCH v4 1/3] sched/fair: Ignore percpu threads for imbalance pulls Valentin Schneider
  2021-04-01 19:30 ` [PATCH v4 2/3] sched/fair: Clean up active balance nr_balance_failed trickery Valentin Schneider
@ 2021-04-01 19:30 ` Valentin Schneider
  2021-04-02 13:04   ` Vincent Guittot
  2021-04-06 15:37   ` Dietmar Eggemann
  2 siblings, 2 replies; 12+ messages in thread
From: Valentin Schneider @ 2021-04-01 19:30 UTC (permalink / raw)
  To: linux-kernel
  Cc: Qais Yousef, Lingutla Chandrasekhar, Peter Zijlstra, Ingo Molnar,
	Vincent Guittot, Dietmar Eggemann, Morten Rasmussen,
	Quentin Perret, Pavan Kondeti, Rik van Riel

During load-balance, groups classified as group_misfit_task are filtered
out if they do not pass

  group_smaller_max_cpu_capacity(<candidate group>, <local group>);

which itself employs fits_capacity() to compare the sgc->max_capacity of
both groups.

Due to the underlying margin, fits_capacity(X, 1024) will return false for
any X > 819. Tough luck, the capacity_orig's on e.g. the Pixel 4 are
{261, 871, 1024}. If a CPU-bound task ends up on one of those "medium"
CPUs, misfit migration will never intentionally upmigrate it to a CPU of
higher capacity due to the aforementioned margin.

One may argue the 20% margin of fits_capacity() is excessive in the advent
of counter-enhanced load tracking (APERF/MPERF, AMUs), but one point here
is that fits_capacity() is meant to compare a utilization value to a
capacity value, whereas here it is being used to compare two capacity
values. As CPU capacity and task utilization have different dynamics, a
sensible approach here would be to add a new helper dedicated to comparing
CPU capacities.

While at it, replace group_smaller_{min, max}_cpu_capacity() with
comparisons of the source group's min/max capacity and the destination
CPU's capacity.

Reviewed-by: Qais Yousef <qais.yousef@arm.com>
Tested-by: Lingutla Chandrasekhar <clingutla@codeaurora.org>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
---
 kernel/sched/fair.c | 33 ++++++++++-----------------------
 1 file changed, 10 insertions(+), 23 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d8077f82a380..c9c5c2697998 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -113,6 +113,13 @@ int __weak arch_asym_cpu_priority(int cpu)
  */
 #define fits_capacity(cap, max)	((cap) * 1280 < (max) * 1024)
 
+/*
+ * The margin used when comparing CPU capacities.
+ * is 'cap1' noticeably greater than 'cap2'
+ *
+ * (default: ~5%)
+ */
+#define capacity_greater(cap1, cap2) ((cap1) * 1024 > (cap2) * 1078)
 #endif
 
 #ifdef CONFIG_CFS_BANDWIDTH
@@ -8364,26 +8371,6 @@ group_is_overloaded(unsigned int imbalance_pct, struct sg_lb_stats *sgs)
 	return false;
 }
 
-/*
- * group_smaller_min_cpu_capacity: Returns true if sched_group sg has smaller
- * per-CPU capacity than sched_group ref.
- */
-static inline bool
-group_smaller_min_cpu_capacity(struct sched_group *sg, struct sched_group *ref)
-{
-	return fits_capacity(sg->sgc->min_capacity, ref->sgc->min_capacity);
-}
-
-/*
- * group_smaller_max_cpu_capacity: Returns true if sched_group sg has smaller
- * per-CPU capacity_orig than sched_group ref.
- */
-static inline bool
-group_smaller_max_cpu_capacity(struct sched_group *sg, struct sched_group *ref)
-{
-	return fits_capacity(sg->sgc->max_capacity, ref->sgc->max_capacity);
-}
-
 static inline enum
 group_type group_classify(unsigned int imbalance_pct,
 			  struct sched_group *group,
@@ -8539,7 +8526,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 	 * internally or be covered by avg_load imbalance (eventually).
 	 */
 	if (sgs->group_type == group_misfit_task &&
-	    (!group_smaller_max_cpu_capacity(sg, sds->local) ||
+	    (!capacity_greater(capacity_of(env->dst_cpu), sg->sgc->max_capacity) ||
 	     sds->local_stat.group_type != group_has_spare))
 		return false;
 
@@ -8623,7 +8610,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 	 */
 	if ((env->sd->flags & SD_ASYM_CPUCAPACITY) &&
 	    (sgs->group_type <= group_fully_busy) &&
-	    (group_smaller_min_cpu_capacity(sds->local, sg)))
+	    (capacity_greater(sg->sgc->min_capacity, capacity_of(env->dst_cpu))))
 		return false;
 
 	return true;
@@ -9423,7 +9410,7 @@ static struct rq *find_busiest_queue(struct lb_env *env,
 		 * average load.
 		 */
 		if (env->sd->flags & SD_ASYM_CPUCAPACITY &&
-		    capacity_of(env->dst_cpu) < capacity &&
+		    !capacity_greater(capacity_of(env->dst_cpu), capacity) &&
 		    nr_running == 1)
 			continue;
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v4 1/3] sched/fair: Ignore percpu threads for imbalance pulls
  2021-04-01 19:30 ` [PATCH v4 1/3] sched/fair: Ignore percpu threads for imbalance pulls Valentin Schneider
@ 2021-04-02 12:52   ` Vincent Guittot
  2021-04-06 15:35   ` Dietmar Eggemann
  1 sibling, 0 replies; 12+ messages in thread
From: Vincent Guittot @ 2021-04-02 12:52 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, Lingutla Chandrasekhar, Peter Zijlstra,
	Ingo Molnar, Dietmar Eggemann, Morten Rasmussen, Qais Yousef,
	Quentin Perret, Pavan Kondeti, Rik van Riel

On Thu, 1 Apr 2021 at 21:30, Valentin Schneider
<valentin.schneider@arm.com> wrote:
>
> From: Lingutla Chandrasekhar <clingutla@codeaurora.org>
>
> During load balance, LBF_SOME_PINNED will bet set if any candidate task
> cannot be detached due to CPU affinity constraints. This can result in
> setting env->sd->parent->sgc->group_imbalance, which can lead to a group
> being classified as group_imbalanced (rather than any of the other, lower
> group_type) when balancing at a higher level.
>
> In workloads involving a single task per CPU, LBF_SOME_PINNED can often be
> set due to per-CPU kthreads being the only other runnable tasks on any
> given rq. This results in changing the group classification during
> load-balance at higher levels when in reality there is nothing that can be
> done for this affinity constraint: per-CPU kthreads, as the name implies,
> don't get to move around (modulo hotplug shenanigans).
>
> It's not as clear for userspace tasks - a task could be in an N-CPU cpuset
> with N-1 offline CPUs, making it an "accidental" per-CPU task rather than
> an intended one. KTHREAD_IS_PER_CPU gives us an indisputable signal which
> we can leverage here to not set LBF_SOME_PINNED.
>
> Note that the aforementioned classification to group_imbalance (when
> nothing can be done) is especially problematic on big.LITTLE systems, which
> have a topology the likes of:
>
>   DIE [          ]
>   MC  [    ][    ]
>        0  1  2  3
>        L  L  B  B
>
>   arch_scale_cpu_capacity(L) < arch_scale_cpu_capacity(B)
>
> Here, setting LBF_SOME_PINNED due to a per-CPU kthread when balancing at MC
> level on CPUs [0-1] will subsequently prevent CPUs [2-3] from classifying
> the [0-1] group as group_misfit_task when balancing at DIE level. Thus, if
> CPUs [0-1] are running CPU-bound (misfit) tasks, ill-timed per-CPU kthreads
> can significantly delay the upgmigration of said misfit tasks. Systems
> relying on ASYM_PACKING are likely to face similar issues.
>
> Signed-off-by: Lingutla Chandrasekhar <clingutla@codeaurora.org>
> [Use kthread_is_per_cpu() rather than p->nr_cpus_allowed]
> [Reword changelog]
> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>

Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>

> ---
>  kernel/sched/fair.c | 4 ++++
>  1 file changed, 4 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 6d73bdbb2d40..04d5e14fa261 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7567,6 +7567,10 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
>         if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
>                 return 0;
>
> +       /* Disregard pcpu kthreads; they are where they need to be. */
> +       if ((p->flags & PF_KTHREAD) && kthread_is_per_cpu(p))
> +               return 0;
> +
>         if (!cpumask_test_cpu(env->dst_cpu, p->cpus_ptr)) {
>                 int cpu;
>
> --
> 2.25.1
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v4 2/3] sched/fair: Clean up active balance nr_balance_failed trickery
  2021-04-01 19:30 ` [PATCH v4 2/3] sched/fair: Clean up active balance nr_balance_failed trickery Valentin Schneider
@ 2021-04-02 12:57   ` Vincent Guittot
  2021-04-06 15:36   ` Dietmar Eggemann
  1 sibling, 0 replies; 12+ messages in thread
From: Vincent Guittot @ 2021-04-02 12:57 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Dietmar Eggemann,
	Morten Rasmussen, Qais Yousef, Quentin Perret, Pavan Kondeti,
	Rik van Riel, Lingutla Chandrasekhar

On Thu, 1 Apr 2021 at 21:30, Valentin Schneider
<valentin.schneider@arm.com> wrote:
>
> When triggering an active load balance, sd->nr_balance_failed is set to
> such a value that any further can_migrate_task() using said sd will ignore
> the output of task_hot().
>
> This behaviour makes sense, as active load balance intentionally preempts a
> rq's running task to migrate it right away, but this asynchronous write is
> a bit shoddy, as the stopper thread might run active_load_balance_cpu_stop
> before the sd->nr_balance_failed write either becomes visible to the
> stopper's CPU or even happens on the CPU that appended the stopper work.
>
> Add a struct lb_env flag to denote active balancing, and use it in
> can_migrate_task(). Remove the sd->nr_balance_failed write that served the
> same purpose. Cleanup the LBF_DST_PINNED active balance special case.
>
> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>

Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>

> ---
>  kernel/sched/fair.c | 31 +++++++++++++++----------------
>  1 file changed, 15 insertions(+), 16 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 04d5e14fa261..d8077f82a380 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7422,6 +7422,7 @@ enum migration_type {
>  #define LBF_NEED_BREAK 0x02
>  #define LBF_DST_PINNED  0x04
>  #define LBF_SOME_PINNED        0x08
> +#define LBF_ACTIVE_LB  0x10
>
>  struct lb_env {
>         struct sched_domain     *sd;
> @@ -7583,10 +7584,13 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
>                  * our sched_group. We may want to revisit it if we couldn't
>                  * meet load balance goals by pulling other tasks on src_cpu.
>                  *
> -                * Avoid computing new_dst_cpu for NEWLY_IDLE or if we have
> -                * already computed one in current iteration.
> +                * Avoid computing new_dst_cpu
> +                * - for NEWLY_IDLE
> +                * - if we have already computed one in current iteration
> +                * - if it's an active balance
>                  */
> -               if (env->idle == CPU_NEWLY_IDLE || (env->flags & LBF_DST_PINNED))
> +               if (env->idle == CPU_NEWLY_IDLE ||
> +                   env->flags & (LBF_DST_PINNED | LBF_ACTIVE_LB))
>                         return 0;
>
>                 /* Prevent to re-select dst_cpu via env's CPUs: */
> @@ -7611,10 +7615,14 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
>
>         /*
>          * Aggressive migration if:
> -        * 1) destination numa is preferred
> -        * 2) task is cache cold, or
> -        * 3) too many balance attempts have failed.
> +        * 1) active balance
> +        * 2) destination numa is preferred
> +        * 3) task is cache cold, or
> +        * 4) too many balance attempts have failed.
>          */
> +       if (env->flags & LBF_ACTIVE_LB)
> +               return 1;
> +
>         tsk_cache_hot = migrate_degrades_locality(p, env);
>         if (tsk_cache_hot == -1)
>                 tsk_cache_hot = task_hot(p, env);
> @@ -9805,9 +9813,6 @@ static int load_balance(int this_cpu, struct rq *this_rq,
>                                         active_load_balance_cpu_stop, busiest,
>                                         &busiest->active_balance_work);
>                         }
> -
> -                       /* We've kicked active balancing, force task migration. */
> -                       sd->nr_balance_failed = sd->cache_nice_tries+1;
>                 }
>         } else {
>                 sd->nr_balance_failed = 0;
> @@ -9957,13 +9962,7 @@ static int active_load_balance_cpu_stop(void *data)
>                         .src_cpu        = busiest_rq->cpu,
>                         .src_rq         = busiest_rq,
>                         .idle           = CPU_IDLE,
> -                       /*
> -                        * can_migrate_task() doesn't need to compute new_dst_cpu
> -                        * for active balancing. Since we have CPU_IDLE, but no
> -                        * @dst_grpmask we need to make that test go away with lying
> -                        * about DST_PINNED.
> -                        */
> -                       .flags          = LBF_DST_PINNED,
> +                       .flags          = LBF_ACTIVE_LB,
>                 };
>
>                 schedstat_inc(sd->alb_count);
> --
> 2.25.1
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v4 3/3] sched/fair: Introduce a CPU capacity comparison helper
  2021-04-01 19:30 ` [PATCH v4 3/3] sched/fair: Introduce a CPU capacity comparison helper Valentin Schneider
@ 2021-04-02 13:04   ` Vincent Guittot
  2021-04-06 15:37   ` Dietmar Eggemann
  1 sibling, 0 replies; 12+ messages in thread
From: Vincent Guittot @ 2021-04-02 13:04 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, Qais Yousef, Lingutla Chandrasekhar,
	Peter Zijlstra, Ingo Molnar, Dietmar Eggemann, Morten Rasmussen,
	Quentin Perret, Pavan Kondeti, Rik van Riel

On Thu, 1 Apr 2021 at 21:30, Valentin Schneider
<valentin.schneider@arm.com> wrote:
>
> During load-balance, groups classified as group_misfit_task are filtered
> out if they do not pass
>
>   group_smaller_max_cpu_capacity(<candidate group>, <local group>);
>
> which itself employs fits_capacity() to compare the sgc->max_capacity of
> both groups.
>
> Due to the underlying margin, fits_capacity(X, 1024) will return false for
> any X > 819. Tough luck, the capacity_orig's on e.g. the Pixel 4 are
> {261, 871, 1024}. If a CPU-bound task ends up on one of those "medium"
> CPUs, misfit migration will never intentionally upmigrate it to a CPU of
> higher capacity due to the aforementioned margin.
>
> One may argue the 20% margin of fits_capacity() is excessive in the advent
> of counter-enhanced load tracking (APERF/MPERF, AMUs), but one point here
> is that fits_capacity() is meant to compare a utilization value to a
> capacity value, whereas here it is being used to compare two capacity
> values. As CPU capacity and task utilization have different dynamics, a
> sensible approach here would be to add a new helper dedicated to comparing
> CPU capacities.
>
> While at it, replace group_smaller_{min, max}_cpu_capacity() with
> comparisons of the source group's min/max capacity and the destination
> CPU's capacity.
>
> Reviewed-by: Qais Yousef <qais.yousef@arm.com>
> Tested-by: Lingutla Chandrasekhar <clingutla@codeaurora.org>
> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>

Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>

> ---
>  kernel/sched/fair.c | 33 ++++++++++-----------------------
>  1 file changed, 10 insertions(+), 23 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d8077f82a380..c9c5c2697998 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -113,6 +113,13 @@ int __weak arch_asym_cpu_priority(int cpu)
>   */
>  #define fits_capacity(cap, max)        ((cap) * 1280 < (max) * 1024)
>
> +/*
> + * The margin used when comparing CPU capacities.
> + * is 'cap1' noticeably greater than 'cap2'
> + *
> + * (default: ~5%)
> + */
> +#define capacity_greater(cap1, cap2) ((cap1) * 1024 > (cap2) * 1078)
>  #endif
>
>  #ifdef CONFIG_CFS_BANDWIDTH
> @@ -8364,26 +8371,6 @@ group_is_overloaded(unsigned int imbalance_pct, struct sg_lb_stats *sgs)
>         return false;
>  }
>
> -/*
> - * group_smaller_min_cpu_capacity: Returns true if sched_group sg has smaller
> - * per-CPU capacity than sched_group ref.
> - */
> -static inline bool
> -group_smaller_min_cpu_capacity(struct sched_group *sg, struct sched_group *ref)
> -{
> -       return fits_capacity(sg->sgc->min_capacity, ref->sgc->min_capacity);
> -}
> -
> -/*
> - * group_smaller_max_cpu_capacity: Returns true if sched_group sg has smaller
> - * per-CPU capacity_orig than sched_group ref.
> - */
> -static inline bool
> -group_smaller_max_cpu_capacity(struct sched_group *sg, struct sched_group *ref)
> -{
> -       return fits_capacity(sg->sgc->max_capacity, ref->sgc->max_capacity);
> -}
> -
>  static inline enum
>  group_type group_classify(unsigned int imbalance_pct,
>                           struct sched_group *group,
> @@ -8539,7 +8526,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>          * internally or be covered by avg_load imbalance (eventually).
>          */
>         if (sgs->group_type == group_misfit_task &&
> -           (!group_smaller_max_cpu_capacity(sg, sds->local) ||
> +           (!capacity_greater(capacity_of(env->dst_cpu), sg->sgc->max_capacity) ||
>              sds->local_stat.group_type != group_has_spare))
>                 return false;
>
> @@ -8623,7 +8610,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>          */
>         if ((env->sd->flags & SD_ASYM_CPUCAPACITY) &&
>             (sgs->group_type <= group_fully_busy) &&
> -           (group_smaller_min_cpu_capacity(sds->local, sg)))
> +           (capacity_greater(sg->sgc->min_capacity, capacity_of(env->dst_cpu))))
>                 return false;
>
>         return true;
> @@ -9423,7 +9410,7 @@ static struct rq *find_busiest_queue(struct lb_env *env,
>                  * average load.
>                  */
>                 if (env->sd->flags & SD_ASYM_CPUCAPACITY &&
> -                   capacity_of(env->dst_cpu) < capacity &&
> +                   !capacity_greater(capacity_of(env->dst_cpu), capacity) &&
>                     nr_running == 1)
>                         continue;
>
> --
> 2.25.1
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v4 1/3] sched/fair: Ignore percpu threads for imbalance pulls
  2021-04-01 19:30 ` [PATCH v4 1/3] sched/fair: Ignore percpu threads for imbalance pulls Valentin Schneider
  2021-04-02 12:52   ` Vincent Guittot
@ 2021-04-06 15:35   ` Dietmar Eggemann
  2021-04-06 15:55     ` Valentin Schneider
  1 sibling, 1 reply; 12+ messages in thread
From: Dietmar Eggemann @ 2021-04-06 15:35 UTC (permalink / raw)
  To: Valentin Schneider, linux-kernel
  Cc: Lingutla Chandrasekhar, Peter Zijlstra, Ingo Molnar,
	Vincent Guittot, Morten Rasmussen, Qais Yousef, Quentin Perret,
	Pavan Kondeti, Rik van Riel

On 01/04/2021 21:30, Valentin Schneider wrote:
> From: Lingutla Chandrasekhar <clingutla@codeaurora.org>
> 
> During load balance, LBF_SOME_PINNED will bet set if any candidate task

nitpick; s/bet/be ?

[...]

Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v4 2/3] sched/fair: Clean up active balance nr_balance_failed trickery
  2021-04-01 19:30 ` [PATCH v4 2/3] sched/fair: Clean up active balance nr_balance_failed trickery Valentin Schneider
  2021-04-02 12:57   ` Vincent Guittot
@ 2021-04-06 15:36   ` Dietmar Eggemann
  1 sibling, 0 replies; 12+ messages in thread
From: Dietmar Eggemann @ 2021-04-06 15:36 UTC (permalink / raw)
  To: Valentin Schneider, linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Morten Rasmussen,
	Qais Yousef, Quentin Perret, Pavan Kondeti, Rik van Riel,
	Lingutla Chandrasekhar

On 01/04/2021 21:30, Valentin Schneider wrote:
> When triggering an active load balance, sd->nr_balance_failed is set to
> such a value that any further can_migrate_task() using said sd will ignore
> the output of task_hot().
> 
> This behaviour makes sense, as active load balance intentionally preempts a
> rq's running task to migrate it right away, but this asynchronous write is
> a bit shoddy, as the stopper thread might run active_load_balance_cpu_stop
> before the sd->nr_balance_failed write either becomes visible to the
> stopper's CPU or even happens on the CPU that appended the stopper work.
> 
> Add a struct lb_env flag to denote active balancing, and use it in
> can_migrate_task(). Remove the sd->nr_balance_failed write that served the
> same purpose. Cleanup the LBF_DST_PINNED active balance special case.
> 
> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>

Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>

[...]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v4 3/3] sched/fair: Introduce a CPU capacity comparison helper
  2021-04-01 19:30 ` [PATCH v4 3/3] sched/fair: Introduce a CPU capacity comparison helper Valentin Schneider
  2021-04-02 13:04   ` Vincent Guittot
@ 2021-04-06 15:37   ` Dietmar Eggemann
  2021-04-06 15:59     ` Valentin Schneider
  1 sibling, 1 reply; 12+ messages in thread
From: Dietmar Eggemann @ 2021-04-06 15:37 UTC (permalink / raw)
  To: Valentin Schneider, linux-kernel
  Cc: Qais Yousef, Lingutla Chandrasekhar, Peter Zijlstra, Ingo Molnar,
	Vincent Guittot, Morten Rasmussen, Quentin Perret, Pavan Kondeti,
	Rik van Riel

On 01/04/2021 21:30, Valentin Schneider wrote:
> During load-balance, groups classified as group_misfit_task are filtered
> out if they do not pass
> 
>   group_smaller_max_cpu_capacity(<candidate group>, <local group>);
> 
> which itself employs fits_capacity() to compare the sgc->max_capacity of
> both groups.
> 
> Due to the underlying margin, fits_capacity(X, 1024) will return false for
> any X > 819. Tough luck, the capacity_orig's on e.g. the Pixel 4 are
> {261, 871, 1024}. If a CPU-bound task ends up on one of those "medium"
> CPUs, misfit migration will never intentionally upmigrate it to a CPU of
> higher capacity due to the aforementioned margin.
> 
> One may argue the 20% margin of fits_capacity() is excessive in the advent
> of counter-enhanced load tracking (APERF/MPERF, AMUs), but one point here
> is that fits_capacity() is meant to compare a utilization value to a
> capacity value, whereas here it is being used to compare two capacity
> values. As CPU capacity and task utilization have different dynamics, a
> sensible approach here would be to add a new helper dedicated to comparing
> CPU capacities.
> 
> While at it, replace group_smaller_{min, max}_cpu_capacity() with
> comparisons of the source group's min/max capacity and the destination
> CPU's capacity.

IMHO, you haven't mentioned why you replace local sched group with dst
CPU. I can see that only the capacity of the dst CPU makes really sense
here. Might be worth mentioning in the patch header why. There is some
of it in v3 6/7 but that's a different change.

Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>

[...]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v4 1/3] sched/fair: Ignore percpu threads for imbalance pulls
  2021-04-06 15:35   ` Dietmar Eggemann
@ 2021-04-06 15:55     ` Valentin Schneider
  0 siblings, 0 replies; 12+ messages in thread
From: Valentin Schneider @ 2021-04-06 15:55 UTC (permalink / raw)
  To: Dietmar Eggemann, linux-kernel
  Cc: Lingutla Chandrasekhar, Peter Zijlstra, Ingo Molnar,
	Vincent Guittot, Morten Rasmussen, Qais Yousef, Quentin Perret,
	Pavan Kondeti, Rik van Riel

On 06/04/21 17:35, Dietmar Eggemann wrote:
> On 01/04/2021 21:30, Valentin Schneider wrote:
>> From: Lingutla Chandrasekhar <clingutla@codeaurora.org>
>> 
>> During load balance, LBF_SOME_PINNED will bet set if any candidate task
>
> nitpick; s/bet/be ?
>

Yes indeed...

> [...]
>
> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v4 3/3] sched/fair: Introduce a CPU capacity comparison helper
  2021-04-06 15:37   ` Dietmar Eggemann
@ 2021-04-06 15:59     ` Valentin Schneider
  0 siblings, 0 replies; 12+ messages in thread
From: Valentin Schneider @ 2021-04-06 15:59 UTC (permalink / raw)
  To: Dietmar Eggemann, linux-kernel
  Cc: Qais Yousef, Lingutla Chandrasekhar, Peter Zijlstra, Ingo Molnar,
	Vincent Guittot, Morten Rasmussen, Quentin Perret, Pavan Kondeti,
	Rik van Riel

On 06/04/21 17:37, Dietmar Eggemann wrote:
> On 01/04/2021 21:30, Valentin Schneider wrote:
>> While at it, replace group_smaller_{min, max}_cpu_capacity() with
>> comparisons of the source group's min/max capacity and the destination
>> CPU's capacity.
>
> IMHO, you haven't mentioned why you replace local sched group with dst
> CPU. I can see that only the capacity of the dst CPU makes really sense
> here. Might be worth mentioning in the patch header why. There is some
> of it in v3 6/7 but that's a different change.
>

Right, some of it got lost in the shuffling. This was a separate change in
v1 (4/8). I can reuse that changelog into:

"""
Also note that comparing capacity extrema of local and source sched_group's
doesn't make much sense when at the day of the day the imbalance will be
pulled by a known env->dst_cpu, whose capacity can be anywhere within the
local group's capacity extrema.

While at it, replace group_smaller_{min, max}_cpu_capacity() with
comparisons of the source group's min/max capacity and the destination
CPU's capacity.
"""

> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
>
> [...]

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2021-04-06 15:59 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-01 19:30 [PATCH v4 0/3] sched/fair: load-balance vs capacity margins Valentin Schneider
2021-04-01 19:30 ` [PATCH v4 1/3] sched/fair: Ignore percpu threads for imbalance pulls Valentin Schneider
2021-04-02 12:52   ` Vincent Guittot
2021-04-06 15:35   ` Dietmar Eggemann
2021-04-06 15:55     ` Valentin Schneider
2021-04-01 19:30 ` [PATCH v4 2/3] sched/fair: Clean up active balance nr_balance_failed trickery Valentin Schneider
2021-04-02 12:57   ` Vincent Guittot
2021-04-06 15:36   ` Dietmar Eggemann
2021-04-01 19:30 ` [PATCH v4 3/3] sched/fair: Introduce a CPU capacity comparison helper Valentin Schneider
2021-04-02 13:04   ` Vincent Guittot
2021-04-06 15:37   ` Dietmar Eggemann
2021-04-06 15:59     ` Valentin Schneider

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).