* [PATCH v4 1/3] sched/fair: Ignore percpu threads for imbalance pulls
2021-04-01 19:30 [PATCH v4 0/3] sched/fair: load-balance vs capacity margins Valentin Schneider
@ 2021-04-01 19:30 ` Valentin Schneider
2021-04-02 12:52 ` Vincent Guittot
2021-04-06 15:35 ` Dietmar Eggemann
2021-04-01 19:30 ` [PATCH v4 2/3] sched/fair: Clean up active balance nr_balance_failed trickery Valentin Schneider
2021-04-01 19:30 ` [PATCH v4 3/3] sched/fair: Introduce a CPU capacity comparison helper Valentin Schneider
2 siblings, 2 replies; 12+ messages in thread
From: Valentin Schneider @ 2021-04-01 19:30 UTC (permalink / raw)
To: linux-kernel
Cc: Lingutla Chandrasekhar, Peter Zijlstra, Ingo Molnar,
Vincent Guittot, Dietmar Eggemann, Morten Rasmussen, Qais Yousef,
Quentin Perret, Pavan Kondeti, Rik van Riel
From: Lingutla Chandrasekhar <clingutla@codeaurora.org>
During load balance, LBF_SOME_PINNED will bet set if any candidate task
cannot be detached due to CPU affinity constraints. This can result in
setting env->sd->parent->sgc->group_imbalance, which can lead to a group
being classified as group_imbalanced (rather than any of the other, lower
group_type) when balancing at a higher level.
In workloads involving a single task per CPU, LBF_SOME_PINNED can often be
set due to per-CPU kthreads being the only other runnable tasks on any
given rq. This results in changing the group classification during
load-balance at higher levels when in reality there is nothing that can be
done for this affinity constraint: per-CPU kthreads, as the name implies,
don't get to move around (modulo hotplug shenanigans).
It's not as clear for userspace tasks - a task could be in an N-CPU cpuset
with N-1 offline CPUs, making it an "accidental" per-CPU task rather than
an intended one. KTHREAD_IS_PER_CPU gives us an indisputable signal which
we can leverage here to not set LBF_SOME_PINNED.
Note that the aforementioned classification to group_imbalance (when
nothing can be done) is especially problematic on big.LITTLE systems, which
have a topology the likes of:
DIE [ ]
MC [ ][ ]
0 1 2 3
L L B B
arch_scale_cpu_capacity(L) < arch_scale_cpu_capacity(B)
Here, setting LBF_SOME_PINNED due to a per-CPU kthread when balancing at MC
level on CPUs [0-1] will subsequently prevent CPUs [2-3] from classifying
the [0-1] group as group_misfit_task when balancing at DIE level. Thus, if
CPUs [0-1] are running CPU-bound (misfit) tasks, ill-timed per-CPU kthreads
can significantly delay the upgmigration of said misfit tasks. Systems
relying on ASYM_PACKING are likely to face similar issues.
Signed-off-by: Lingutla Chandrasekhar <clingutla@codeaurora.org>
[Use kthread_is_per_cpu() rather than p->nr_cpus_allowed]
[Reword changelog]
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
---
kernel/sched/fair.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6d73bdbb2d40..04d5e14fa261 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7567,6 +7567,10 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
return 0;
+ /* Disregard pcpu kthreads; they are where they need to be. */
+ if ((p->flags & PF_KTHREAD) && kthread_is_per_cpu(p))
+ return 0;
+
if (!cpumask_test_cpu(env->dst_cpu, p->cpus_ptr)) {
int cpu;
--
2.25.1
^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH v4 1/3] sched/fair: Ignore percpu threads for imbalance pulls
2021-04-01 19:30 ` [PATCH v4 1/3] sched/fair: Ignore percpu threads for imbalance pulls Valentin Schneider
@ 2021-04-02 12:52 ` Vincent Guittot
2021-04-06 15:35 ` Dietmar Eggemann
1 sibling, 0 replies; 12+ messages in thread
From: Vincent Guittot @ 2021-04-02 12:52 UTC (permalink / raw)
To: Valentin Schneider
Cc: linux-kernel, Lingutla Chandrasekhar, Peter Zijlstra,
Ingo Molnar, Dietmar Eggemann, Morten Rasmussen, Qais Yousef,
Quentin Perret, Pavan Kondeti, Rik van Riel
On Thu, 1 Apr 2021 at 21:30, Valentin Schneider
<valentin.schneider@arm.com> wrote:
>
> From: Lingutla Chandrasekhar <clingutla@codeaurora.org>
>
> During load balance, LBF_SOME_PINNED will bet set if any candidate task
> cannot be detached due to CPU affinity constraints. This can result in
> setting env->sd->parent->sgc->group_imbalance, which can lead to a group
> being classified as group_imbalanced (rather than any of the other, lower
> group_type) when balancing at a higher level.
>
> In workloads involving a single task per CPU, LBF_SOME_PINNED can often be
> set due to per-CPU kthreads being the only other runnable tasks on any
> given rq. This results in changing the group classification during
> load-balance at higher levels when in reality there is nothing that can be
> done for this affinity constraint: per-CPU kthreads, as the name implies,
> don't get to move around (modulo hotplug shenanigans).
>
> It's not as clear for userspace tasks - a task could be in an N-CPU cpuset
> with N-1 offline CPUs, making it an "accidental" per-CPU task rather than
> an intended one. KTHREAD_IS_PER_CPU gives us an indisputable signal which
> we can leverage here to not set LBF_SOME_PINNED.
>
> Note that the aforementioned classification to group_imbalance (when
> nothing can be done) is especially problematic on big.LITTLE systems, which
> have a topology the likes of:
>
> DIE [ ]
> MC [ ][ ]
> 0 1 2 3
> L L B B
>
> arch_scale_cpu_capacity(L) < arch_scale_cpu_capacity(B)
>
> Here, setting LBF_SOME_PINNED due to a per-CPU kthread when balancing at MC
> level on CPUs [0-1] will subsequently prevent CPUs [2-3] from classifying
> the [0-1] group as group_misfit_task when balancing at DIE level. Thus, if
> CPUs [0-1] are running CPU-bound (misfit) tasks, ill-timed per-CPU kthreads
> can significantly delay the upgmigration of said misfit tasks. Systems
> relying on ASYM_PACKING are likely to face similar issues.
>
> Signed-off-by: Lingutla Chandrasekhar <clingutla@codeaurora.org>
> [Use kthread_is_per_cpu() rather than p->nr_cpus_allowed]
> [Reword changelog]
> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
> kernel/sched/fair.c | 4 ++++
> 1 file changed, 4 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 6d73bdbb2d40..04d5e14fa261 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7567,6 +7567,10 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
> if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
> return 0;
>
> + /* Disregard pcpu kthreads; they are where they need to be. */
> + if ((p->flags & PF_KTHREAD) && kthread_is_per_cpu(p))
> + return 0;
> +
> if (!cpumask_test_cpu(env->dst_cpu, p->cpus_ptr)) {
> int cpu;
>
> --
> 2.25.1
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v4 1/3] sched/fair: Ignore percpu threads for imbalance pulls
2021-04-01 19:30 ` [PATCH v4 1/3] sched/fair: Ignore percpu threads for imbalance pulls Valentin Schneider
2021-04-02 12:52 ` Vincent Guittot
@ 2021-04-06 15:35 ` Dietmar Eggemann
2021-04-06 15:55 ` Valentin Schneider
1 sibling, 1 reply; 12+ messages in thread
From: Dietmar Eggemann @ 2021-04-06 15:35 UTC (permalink / raw)
To: Valentin Schneider, linux-kernel
Cc: Lingutla Chandrasekhar, Peter Zijlstra, Ingo Molnar,
Vincent Guittot, Morten Rasmussen, Qais Yousef, Quentin Perret,
Pavan Kondeti, Rik van Riel
On 01/04/2021 21:30, Valentin Schneider wrote:
> From: Lingutla Chandrasekhar <clingutla@codeaurora.org>
>
> During load balance, LBF_SOME_PINNED will bet set if any candidate task
nitpick; s/bet/be ?
[...]
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v4 1/3] sched/fair: Ignore percpu threads for imbalance pulls
2021-04-06 15:35 ` Dietmar Eggemann
@ 2021-04-06 15:55 ` Valentin Schneider
0 siblings, 0 replies; 12+ messages in thread
From: Valentin Schneider @ 2021-04-06 15:55 UTC (permalink / raw)
To: Dietmar Eggemann, linux-kernel
Cc: Lingutla Chandrasekhar, Peter Zijlstra, Ingo Molnar,
Vincent Guittot, Morten Rasmussen, Qais Yousef, Quentin Perret,
Pavan Kondeti, Rik van Riel
On 06/04/21 17:35, Dietmar Eggemann wrote:
> On 01/04/2021 21:30, Valentin Schneider wrote:
>> From: Lingutla Chandrasekhar <clingutla@codeaurora.org>
>>
>> During load balance, LBF_SOME_PINNED will bet set if any candidate task
>
> nitpick; s/bet/be ?
>
Yes indeed...
> [...]
>
> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH v4 2/3] sched/fair: Clean up active balance nr_balance_failed trickery
2021-04-01 19:30 [PATCH v4 0/3] sched/fair: load-balance vs capacity margins Valentin Schneider
2021-04-01 19:30 ` [PATCH v4 1/3] sched/fair: Ignore percpu threads for imbalance pulls Valentin Schneider
@ 2021-04-01 19:30 ` Valentin Schneider
2021-04-02 12:57 ` Vincent Guittot
2021-04-06 15:36 ` Dietmar Eggemann
2021-04-01 19:30 ` [PATCH v4 3/3] sched/fair: Introduce a CPU capacity comparison helper Valentin Schneider
2 siblings, 2 replies; 12+ messages in thread
From: Valentin Schneider @ 2021-04-01 19:30 UTC (permalink / raw)
To: linux-kernel
Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Dietmar Eggemann,
Morten Rasmussen, Qais Yousef, Quentin Perret, Pavan Kondeti,
Rik van Riel, Lingutla Chandrasekhar
When triggering an active load balance, sd->nr_balance_failed is set to
such a value that any further can_migrate_task() using said sd will ignore
the output of task_hot().
This behaviour makes sense, as active load balance intentionally preempts a
rq's running task to migrate it right away, but this asynchronous write is
a bit shoddy, as the stopper thread might run active_load_balance_cpu_stop
before the sd->nr_balance_failed write either becomes visible to the
stopper's CPU or even happens on the CPU that appended the stopper work.
Add a struct lb_env flag to denote active balancing, and use it in
can_migrate_task(). Remove the sd->nr_balance_failed write that served the
same purpose. Cleanup the LBF_DST_PINNED active balance special case.
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
---
kernel/sched/fair.c | 31 +++++++++++++++----------------
1 file changed, 15 insertions(+), 16 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 04d5e14fa261..d8077f82a380 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7422,6 +7422,7 @@ enum migration_type {
#define LBF_NEED_BREAK 0x02
#define LBF_DST_PINNED 0x04
#define LBF_SOME_PINNED 0x08
+#define LBF_ACTIVE_LB 0x10
struct lb_env {
struct sched_domain *sd;
@@ -7583,10 +7584,13 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
* our sched_group. We may want to revisit it if we couldn't
* meet load balance goals by pulling other tasks on src_cpu.
*
- * Avoid computing new_dst_cpu for NEWLY_IDLE or if we have
- * already computed one in current iteration.
+ * Avoid computing new_dst_cpu
+ * - for NEWLY_IDLE
+ * - if we have already computed one in current iteration
+ * - if it's an active balance
*/
- if (env->idle == CPU_NEWLY_IDLE || (env->flags & LBF_DST_PINNED))
+ if (env->idle == CPU_NEWLY_IDLE ||
+ env->flags & (LBF_DST_PINNED | LBF_ACTIVE_LB))
return 0;
/* Prevent to re-select dst_cpu via env's CPUs: */
@@ -7611,10 +7615,14 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
/*
* Aggressive migration if:
- * 1) destination numa is preferred
- * 2) task is cache cold, or
- * 3) too many balance attempts have failed.
+ * 1) active balance
+ * 2) destination numa is preferred
+ * 3) task is cache cold, or
+ * 4) too many balance attempts have failed.
*/
+ if (env->flags & LBF_ACTIVE_LB)
+ return 1;
+
tsk_cache_hot = migrate_degrades_locality(p, env);
if (tsk_cache_hot == -1)
tsk_cache_hot = task_hot(p, env);
@@ -9805,9 +9813,6 @@ static int load_balance(int this_cpu, struct rq *this_rq,
active_load_balance_cpu_stop, busiest,
&busiest->active_balance_work);
}
-
- /* We've kicked active balancing, force task migration. */
- sd->nr_balance_failed = sd->cache_nice_tries+1;
}
} else {
sd->nr_balance_failed = 0;
@@ -9957,13 +9962,7 @@ static int active_load_balance_cpu_stop(void *data)
.src_cpu = busiest_rq->cpu,
.src_rq = busiest_rq,
.idle = CPU_IDLE,
- /*
- * can_migrate_task() doesn't need to compute new_dst_cpu
- * for active balancing. Since we have CPU_IDLE, but no
- * @dst_grpmask we need to make that test go away with lying
- * about DST_PINNED.
- */
- .flags = LBF_DST_PINNED,
+ .flags = LBF_ACTIVE_LB,
};
schedstat_inc(sd->alb_count);
--
2.25.1
^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH v4 2/3] sched/fair: Clean up active balance nr_balance_failed trickery
2021-04-01 19:30 ` [PATCH v4 2/3] sched/fair: Clean up active balance nr_balance_failed trickery Valentin Schneider
@ 2021-04-02 12:57 ` Vincent Guittot
2021-04-06 15:36 ` Dietmar Eggemann
1 sibling, 0 replies; 12+ messages in thread
From: Vincent Guittot @ 2021-04-02 12:57 UTC (permalink / raw)
To: Valentin Schneider
Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Dietmar Eggemann,
Morten Rasmussen, Qais Yousef, Quentin Perret, Pavan Kondeti,
Rik van Riel, Lingutla Chandrasekhar
On Thu, 1 Apr 2021 at 21:30, Valentin Schneider
<valentin.schneider@arm.com> wrote:
>
> When triggering an active load balance, sd->nr_balance_failed is set to
> such a value that any further can_migrate_task() using said sd will ignore
> the output of task_hot().
>
> This behaviour makes sense, as active load balance intentionally preempts a
> rq's running task to migrate it right away, but this asynchronous write is
> a bit shoddy, as the stopper thread might run active_load_balance_cpu_stop
> before the sd->nr_balance_failed write either becomes visible to the
> stopper's CPU or even happens on the CPU that appended the stopper work.
>
> Add a struct lb_env flag to denote active balancing, and use it in
> can_migrate_task(). Remove the sd->nr_balance_failed write that served the
> same purpose. Cleanup the LBF_DST_PINNED active balance special case.
>
> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
> kernel/sched/fair.c | 31 +++++++++++++++----------------
> 1 file changed, 15 insertions(+), 16 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 04d5e14fa261..d8077f82a380 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7422,6 +7422,7 @@ enum migration_type {
> #define LBF_NEED_BREAK 0x02
> #define LBF_DST_PINNED 0x04
> #define LBF_SOME_PINNED 0x08
> +#define LBF_ACTIVE_LB 0x10
>
> struct lb_env {
> struct sched_domain *sd;
> @@ -7583,10 +7584,13 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
> * our sched_group. We may want to revisit it if we couldn't
> * meet load balance goals by pulling other tasks on src_cpu.
> *
> - * Avoid computing new_dst_cpu for NEWLY_IDLE or if we have
> - * already computed one in current iteration.
> + * Avoid computing new_dst_cpu
> + * - for NEWLY_IDLE
> + * - if we have already computed one in current iteration
> + * - if it's an active balance
> */
> - if (env->idle == CPU_NEWLY_IDLE || (env->flags & LBF_DST_PINNED))
> + if (env->idle == CPU_NEWLY_IDLE ||
> + env->flags & (LBF_DST_PINNED | LBF_ACTIVE_LB))
> return 0;
>
> /* Prevent to re-select dst_cpu via env's CPUs: */
> @@ -7611,10 +7615,14 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
>
> /*
> * Aggressive migration if:
> - * 1) destination numa is preferred
> - * 2) task is cache cold, or
> - * 3) too many balance attempts have failed.
> + * 1) active balance
> + * 2) destination numa is preferred
> + * 3) task is cache cold, or
> + * 4) too many balance attempts have failed.
> */
> + if (env->flags & LBF_ACTIVE_LB)
> + return 1;
> +
> tsk_cache_hot = migrate_degrades_locality(p, env);
> if (tsk_cache_hot == -1)
> tsk_cache_hot = task_hot(p, env);
> @@ -9805,9 +9813,6 @@ static int load_balance(int this_cpu, struct rq *this_rq,
> active_load_balance_cpu_stop, busiest,
> &busiest->active_balance_work);
> }
> -
> - /* We've kicked active balancing, force task migration. */
> - sd->nr_balance_failed = sd->cache_nice_tries+1;
> }
> } else {
> sd->nr_balance_failed = 0;
> @@ -9957,13 +9962,7 @@ static int active_load_balance_cpu_stop(void *data)
> .src_cpu = busiest_rq->cpu,
> .src_rq = busiest_rq,
> .idle = CPU_IDLE,
> - /*
> - * can_migrate_task() doesn't need to compute new_dst_cpu
> - * for active balancing. Since we have CPU_IDLE, but no
> - * @dst_grpmask we need to make that test go away with lying
> - * about DST_PINNED.
> - */
> - .flags = LBF_DST_PINNED,
> + .flags = LBF_ACTIVE_LB,
> };
>
> schedstat_inc(sd->alb_count);
> --
> 2.25.1
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v4 2/3] sched/fair: Clean up active balance nr_balance_failed trickery
2021-04-01 19:30 ` [PATCH v4 2/3] sched/fair: Clean up active balance nr_balance_failed trickery Valentin Schneider
2021-04-02 12:57 ` Vincent Guittot
@ 2021-04-06 15:36 ` Dietmar Eggemann
1 sibling, 0 replies; 12+ messages in thread
From: Dietmar Eggemann @ 2021-04-06 15:36 UTC (permalink / raw)
To: Valentin Schneider, linux-kernel
Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Morten Rasmussen,
Qais Yousef, Quentin Perret, Pavan Kondeti, Rik van Riel,
Lingutla Chandrasekhar
On 01/04/2021 21:30, Valentin Schneider wrote:
> When triggering an active load balance, sd->nr_balance_failed is set to
> such a value that any further can_migrate_task() using said sd will ignore
> the output of task_hot().
>
> This behaviour makes sense, as active load balance intentionally preempts a
> rq's running task to migrate it right away, but this asynchronous write is
> a bit shoddy, as the stopper thread might run active_load_balance_cpu_stop
> before the sd->nr_balance_failed write either becomes visible to the
> stopper's CPU or even happens on the CPU that appended the stopper work.
>
> Add a struct lb_env flag to denote active balancing, and use it in
> can_migrate_task(). Remove the sd->nr_balance_failed write that served the
> same purpose. Cleanup the LBF_DST_PINNED active balance special case.
>
> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
[...]
^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH v4 3/3] sched/fair: Introduce a CPU capacity comparison helper
2021-04-01 19:30 [PATCH v4 0/3] sched/fair: load-balance vs capacity margins Valentin Schneider
2021-04-01 19:30 ` [PATCH v4 1/3] sched/fair: Ignore percpu threads for imbalance pulls Valentin Schneider
2021-04-01 19:30 ` [PATCH v4 2/3] sched/fair: Clean up active balance nr_balance_failed trickery Valentin Schneider
@ 2021-04-01 19:30 ` Valentin Schneider
2021-04-02 13:04 ` Vincent Guittot
2021-04-06 15:37 ` Dietmar Eggemann
2 siblings, 2 replies; 12+ messages in thread
From: Valentin Schneider @ 2021-04-01 19:30 UTC (permalink / raw)
To: linux-kernel
Cc: Qais Yousef, Lingutla Chandrasekhar, Peter Zijlstra, Ingo Molnar,
Vincent Guittot, Dietmar Eggemann, Morten Rasmussen,
Quentin Perret, Pavan Kondeti, Rik van Riel
During load-balance, groups classified as group_misfit_task are filtered
out if they do not pass
group_smaller_max_cpu_capacity(<candidate group>, <local group>);
which itself employs fits_capacity() to compare the sgc->max_capacity of
both groups.
Due to the underlying margin, fits_capacity(X, 1024) will return false for
any X > 819. Tough luck, the capacity_orig's on e.g. the Pixel 4 are
{261, 871, 1024}. If a CPU-bound task ends up on one of those "medium"
CPUs, misfit migration will never intentionally upmigrate it to a CPU of
higher capacity due to the aforementioned margin.
One may argue the 20% margin of fits_capacity() is excessive in the advent
of counter-enhanced load tracking (APERF/MPERF, AMUs), but one point here
is that fits_capacity() is meant to compare a utilization value to a
capacity value, whereas here it is being used to compare two capacity
values. As CPU capacity and task utilization have different dynamics, a
sensible approach here would be to add a new helper dedicated to comparing
CPU capacities.
While at it, replace group_smaller_{min, max}_cpu_capacity() with
comparisons of the source group's min/max capacity and the destination
CPU's capacity.
Reviewed-by: Qais Yousef <qais.yousef@arm.com>
Tested-by: Lingutla Chandrasekhar <clingutla@codeaurora.org>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
---
kernel/sched/fair.c | 33 ++++++++++-----------------------
1 file changed, 10 insertions(+), 23 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d8077f82a380..c9c5c2697998 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -113,6 +113,13 @@ int __weak arch_asym_cpu_priority(int cpu)
*/
#define fits_capacity(cap, max) ((cap) * 1280 < (max) * 1024)
+/*
+ * The margin used when comparing CPU capacities.
+ * is 'cap1' noticeably greater than 'cap2'
+ *
+ * (default: ~5%)
+ */
+#define capacity_greater(cap1, cap2) ((cap1) * 1024 > (cap2) * 1078)
#endif
#ifdef CONFIG_CFS_BANDWIDTH
@@ -8364,26 +8371,6 @@ group_is_overloaded(unsigned int imbalance_pct, struct sg_lb_stats *sgs)
return false;
}
-/*
- * group_smaller_min_cpu_capacity: Returns true if sched_group sg has smaller
- * per-CPU capacity than sched_group ref.
- */
-static inline bool
-group_smaller_min_cpu_capacity(struct sched_group *sg, struct sched_group *ref)
-{
- return fits_capacity(sg->sgc->min_capacity, ref->sgc->min_capacity);
-}
-
-/*
- * group_smaller_max_cpu_capacity: Returns true if sched_group sg has smaller
- * per-CPU capacity_orig than sched_group ref.
- */
-static inline bool
-group_smaller_max_cpu_capacity(struct sched_group *sg, struct sched_group *ref)
-{
- return fits_capacity(sg->sgc->max_capacity, ref->sgc->max_capacity);
-}
-
static inline enum
group_type group_classify(unsigned int imbalance_pct,
struct sched_group *group,
@@ -8539,7 +8526,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
* internally or be covered by avg_load imbalance (eventually).
*/
if (sgs->group_type == group_misfit_task &&
- (!group_smaller_max_cpu_capacity(sg, sds->local) ||
+ (!capacity_greater(capacity_of(env->dst_cpu), sg->sgc->max_capacity) ||
sds->local_stat.group_type != group_has_spare))
return false;
@@ -8623,7 +8610,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
*/
if ((env->sd->flags & SD_ASYM_CPUCAPACITY) &&
(sgs->group_type <= group_fully_busy) &&
- (group_smaller_min_cpu_capacity(sds->local, sg)))
+ (capacity_greater(sg->sgc->min_capacity, capacity_of(env->dst_cpu))))
return false;
return true;
@@ -9423,7 +9410,7 @@ static struct rq *find_busiest_queue(struct lb_env *env,
* average load.
*/
if (env->sd->flags & SD_ASYM_CPUCAPACITY &&
- capacity_of(env->dst_cpu) < capacity &&
+ !capacity_greater(capacity_of(env->dst_cpu), capacity) &&
nr_running == 1)
continue;
--
2.25.1
^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH v4 3/3] sched/fair: Introduce a CPU capacity comparison helper
2021-04-01 19:30 ` [PATCH v4 3/3] sched/fair: Introduce a CPU capacity comparison helper Valentin Schneider
@ 2021-04-02 13:04 ` Vincent Guittot
2021-04-06 15:37 ` Dietmar Eggemann
1 sibling, 0 replies; 12+ messages in thread
From: Vincent Guittot @ 2021-04-02 13:04 UTC (permalink / raw)
To: Valentin Schneider
Cc: linux-kernel, Qais Yousef, Lingutla Chandrasekhar,
Peter Zijlstra, Ingo Molnar, Dietmar Eggemann, Morten Rasmussen,
Quentin Perret, Pavan Kondeti, Rik van Riel
On Thu, 1 Apr 2021 at 21:30, Valentin Schneider
<valentin.schneider@arm.com> wrote:
>
> During load-balance, groups classified as group_misfit_task are filtered
> out if they do not pass
>
> group_smaller_max_cpu_capacity(<candidate group>, <local group>);
>
> which itself employs fits_capacity() to compare the sgc->max_capacity of
> both groups.
>
> Due to the underlying margin, fits_capacity(X, 1024) will return false for
> any X > 819. Tough luck, the capacity_orig's on e.g. the Pixel 4 are
> {261, 871, 1024}. If a CPU-bound task ends up on one of those "medium"
> CPUs, misfit migration will never intentionally upmigrate it to a CPU of
> higher capacity due to the aforementioned margin.
>
> One may argue the 20% margin of fits_capacity() is excessive in the advent
> of counter-enhanced load tracking (APERF/MPERF, AMUs), but one point here
> is that fits_capacity() is meant to compare a utilization value to a
> capacity value, whereas here it is being used to compare two capacity
> values. As CPU capacity and task utilization have different dynamics, a
> sensible approach here would be to add a new helper dedicated to comparing
> CPU capacities.
>
> While at it, replace group_smaller_{min, max}_cpu_capacity() with
> comparisons of the source group's min/max capacity and the destination
> CPU's capacity.
>
> Reviewed-by: Qais Yousef <qais.yousef@arm.com>
> Tested-by: Lingutla Chandrasekhar <clingutla@codeaurora.org>
> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
> kernel/sched/fair.c | 33 ++++++++++-----------------------
> 1 file changed, 10 insertions(+), 23 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d8077f82a380..c9c5c2697998 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -113,6 +113,13 @@ int __weak arch_asym_cpu_priority(int cpu)
> */
> #define fits_capacity(cap, max) ((cap) * 1280 < (max) * 1024)
>
> +/*
> + * The margin used when comparing CPU capacities.
> + * is 'cap1' noticeably greater than 'cap2'
> + *
> + * (default: ~5%)
> + */
> +#define capacity_greater(cap1, cap2) ((cap1) * 1024 > (cap2) * 1078)
> #endif
>
> #ifdef CONFIG_CFS_BANDWIDTH
> @@ -8364,26 +8371,6 @@ group_is_overloaded(unsigned int imbalance_pct, struct sg_lb_stats *sgs)
> return false;
> }
>
> -/*
> - * group_smaller_min_cpu_capacity: Returns true if sched_group sg has smaller
> - * per-CPU capacity than sched_group ref.
> - */
> -static inline bool
> -group_smaller_min_cpu_capacity(struct sched_group *sg, struct sched_group *ref)
> -{
> - return fits_capacity(sg->sgc->min_capacity, ref->sgc->min_capacity);
> -}
> -
> -/*
> - * group_smaller_max_cpu_capacity: Returns true if sched_group sg has smaller
> - * per-CPU capacity_orig than sched_group ref.
> - */
> -static inline bool
> -group_smaller_max_cpu_capacity(struct sched_group *sg, struct sched_group *ref)
> -{
> - return fits_capacity(sg->sgc->max_capacity, ref->sgc->max_capacity);
> -}
> -
> static inline enum
> group_type group_classify(unsigned int imbalance_pct,
> struct sched_group *group,
> @@ -8539,7 +8526,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
> * internally or be covered by avg_load imbalance (eventually).
> */
> if (sgs->group_type == group_misfit_task &&
> - (!group_smaller_max_cpu_capacity(sg, sds->local) ||
> + (!capacity_greater(capacity_of(env->dst_cpu), sg->sgc->max_capacity) ||
> sds->local_stat.group_type != group_has_spare))
> return false;
>
> @@ -8623,7 +8610,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
> */
> if ((env->sd->flags & SD_ASYM_CPUCAPACITY) &&
> (sgs->group_type <= group_fully_busy) &&
> - (group_smaller_min_cpu_capacity(sds->local, sg)))
> + (capacity_greater(sg->sgc->min_capacity, capacity_of(env->dst_cpu))))
> return false;
>
> return true;
> @@ -9423,7 +9410,7 @@ static struct rq *find_busiest_queue(struct lb_env *env,
> * average load.
> */
> if (env->sd->flags & SD_ASYM_CPUCAPACITY &&
> - capacity_of(env->dst_cpu) < capacity &&
> + !capacity_greater(capacity_of(env->dst_cpu), capacity) &&
> nr_running == 1)
> continue;
>
> --
> 2.25.1
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v4 3/3] sched/fair: Introduce a CPU capacity comparison helper
2021-04-01 19:30 ` [PATCH v4 3/3] sched/fair: Introduce a CPU capacity comparison helper Valentin Schneider
2021-04-02 13:04 ` Vincent Guittot
@ 2021-04-06 15:37 ` Dietmar Eggemann
2021-04-06 15:59 ` Valentin Schneider
1 sibling, 1 reply; 12+ messages in thread
From: Dietmar Eggemann @ 2021-04-06 15:37 UTC (permalink / raw)
To: Valentin Schneider, linux-kernel
Cc: Qais Yousef, Lingutla Chandrasekhar, Peter Zijlstra, Ingo Molnar,
Vincent Guittot, Morten Rasmussen, Quentin Perret, Pavan Kondeti,
Rik van Riel
On 01/04/2021 21:30, Valentin Schneider wrote:
> During load-balance, groups classified as group_misfit_task are filtered
> out if they do not pass
>
> group_smaller_max_cpu_capacity(<candidate group>, <local group>);
>
> which itself employs fits_capacity() to compare the sgc->max_capacity of
> both groups.
>
> Due to the underlying margin, fits_capacity(X, 1024) will return false for
> any X > 819. Tough luck, the capacity_orig's on e.g. the Pixel 4 are
> {261, 871, 1024}. If a CPU-bound task ends up on one of those "medium"
> CPUs, misfit migration will never intentionally upmigrate it to a CPU of
> higher capacity due to the aforementioned margin.
>
> One may argue the 20% margin of fits_capacity() is excessive in the advent
> of counter-enhanced load tracking (APERF/MPERF, AMUs), but one point here
> is that fits_capacity() is meant to compare a utilization value to a
> capacity value, whereas here it is being used to compare two capacity
> values. As CPU capacity and task utilization have different dynamics, a
> sensible approach here would be to add a new helper dedicated to comparing
> CPU capacities.
>
> While at it, replace group_smaller_{min, max}_cpu_capacity() with
> comparisons of the source group's min/max capacity and the destination
> CPU's capacity.
IMHO, you haven't mentioned why you replace local sched group with dst
CPU. I can see that only the capacity of the dst CPU makes really sense
here. Might be worth mentioning in the patch header why. There is some
of it in v3 6/7 but that's a different change.
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
[...]
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v4 3/3] sched/fair: Introduce a CPU capacity comparison helper
2021-04-06 15:37 ` Dietmar Eggemann
@ 2021-04-06 15:59 ` Valentin Schneider
0 siblings, 0 replies; 12+ messages in thread
From: Valentin Schneider @ 2021-04-06 15:59 UTC (permalink / raw)
To: Dietmar Eggemann, linux-kernel
Cc: Qais Yousef, Lingutla Chandrasekhar, Peter Zijlstra, Ingo Molnar,
Vincent Guittot, Morten Rasmussen, Quentin Perret, Pavan Kondeti,
Rik van Riel
On 06/04/21 17:37, Dietmar Eggemann wrote:
> On 01/04/2021 21:30, Valentin Schneider wrote:
>> While at it, replace group_smaller_{min, max}_cpu_capacity() with
>> comparisons of the source group's min/max capacity and the destination
>> CPU's capacity.
>
> IMHO, you haven't mentioned why you replace local sched group with dst
> CPU. I can see that only the capacity of the dst CPU makes really sense
> here. Might be worth mentioning in the patch header why. There is some
> of it in v3 6/7 but that's a different change.
>
Right, some of it got lost in the shuffling. This was a separate change in
v1 (4/8). I can reuse that changelog into:
"""
Also note that comparing capacity extrema of local and source sched_group's
doesn't make much sense when at the day of the day the imbalance will be
pulled by a known env->dst_cpu, whose capacity can be anywhere within the
local group's capacity extrema.
While at it, replace group_smaller_{min, max}_cpu_capacity() with
comparisons of the source group's min/max capacity and the destination
CPU's capacity.
"""
> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
>
> [...]
^ permalink raw reply [flat|nested] 12+ messages in thread