When the system doesn't have enough cycles for all tasks, the scheduler must ensure a fair split of those CPUs cycles between CFS tasks. The fairness of some use cases can't be solved with a static distribution of the tasks on the system and requires a periodic rebalancing of the system but this dynamic behavior is not always optimal and the fair distribution of the CPU's time is not always ensured. The patchset improves the fairness by decreasing the constraint for selecting migratable tasks with the number of failed load balance. This change enables then to decrease the imbalance threshold because 1st LB will try to migrate tasks that fully match the imbalance. Some tests results: - small 2 x 4 cores arm64 system hackbench -l (256000/#grp) -g #grp grp tip/sched/core +patchset improvement 1 1.420(+/- 11.72 %) 1.382(+/-10.50 %) 2.72 % 4 1.295(+/- 2.72 %) 1.218(+/- 2.97 %) 0.76 % 8 1.220(+/- 2.17 %) 1.218(+/- 1.60 %) 0.17 % 16 1.258(+/- 1.88 %) 1.250(+/- 1,78 %) 0.58 % fairness tests: run always running rt-app threads monitor the ratio between min/max work done by threads v5.9-rc1 w/ patchset 9 threads avg 78.3% (+/- 6.60%) 91.20% (+/- 2.44%) worst 68.6% 85.67% 11 threads avg 65.91% (+/- 8.26%) 91.34% (+/- 1.87%) worst 53.52% 87.26% - large 2 nodes x 28 cores x 4 threads arm64 system The hackbench tests that I usually run as well as the sp.C.x and lu.C.x tests with 224 threads have not shown any difference with a mix of less than 0.5% of improvements or regressions. Changes for v2: - rebased on tip/sched/core - added comment for patch 3 - added acked and reviewed tags Vincent Guittot (4): sched/fair: relax constraint on task's load during load balance sched/fair: reduce minimal imbalance threshold sched/fair: minimize concurrent LBs between domain level sched/fair: reduce busy load balance interval kernel/sched/fair.c | 13 +++++++++++-- kernel/sched/topology.c | 4 ++-- 2 files changed, 13 insertions(+), 4 deletions(-) -- 2.17.1
Some UCs like 9 always running tasks on 8 CPUs can't be balanced and the load balancer currently migrates the waiting task between the CPUs in an almost random manner. The success of a rq pulling a task depends of the value of nr_balance_failed of its domains and its ability to be faster than others to detach it. This behavior results in an unfair distribution of the running time between tasks because some CPUs will run most of the time, if not always, the same task whereas others will share their time between several tasks. Instead of using nr_balance_failed as a boolean to relax the condition for detaching task, the LB will use nr_balanced_failed to relax the threshold between the tasks'load and the imbalance. This mecanism prevents the same rq or domain to always win the load balance fight. Reviewed-by: Phil Auld <pauld@redhat.com> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> --- kernel/sched/fair.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 33699db27ed5..d8320dc9d014 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7674,8 +7674,8 @@ static int detach_tasks(struct lb_env *env) * scheduler fails to find a good waiting task to * migrate. */ - if (load/2 > env->imbalance && - env->sd->nr_balance_failed <= env->sd->cache_nice_tries) + + if ((load >> env->sd->nr_balance_failed) > env->imbalance) goto next; env->imbalance -= load; -- 2.17.1
The 25% default imbalance threshold for DIE and NUMA domain is large enough to generate significant unfairness between threads. A typical example is the case of 11 threads running on 2x4 CPUs. The imbalance of 20% between the 2 groups of 4 cores is just low enough to not trigger the load balance between the 2 groups. We will have always the same 6 threads on one group of 4 CPUs and the other 5 threads on the other group of CPUS. With a fair time sharing in each group, we ends up with +20% running time for the group of 5 threads. Consider decreasing the imbalance threshold for overloaded case where we use the load to balance task and to ensure fair time sharing. Acked-by: Hillf Danton <hdanton@sina.com> Reviewed-by: Phil Auld <pauld@redhat.com> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> --- kernel/sched/topology.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index 249bec7b0a4c..41df62884cea 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -1349,7 +1349,7 @@ sd_init(struct sched_domain_topology_level *tl, .min_interval = sd_weight, .max_interval = 2*sd_weight, .busy_factor = 32, - .imbalance_pct = 125, + .imbalance_pct = 117, .cache_nice_tries = 0, -- 2.17.1
sched domains tend to trigger simultaneously the load balance loop but the larger domains often need more time to collect statistics. This slowness makes the larger domain trying to detach tasks from a rq whereas tasks already migrated somewhere else at a sub-domain level. This is not a real problem for idle LB because the period of smaller domains will increase with its CPUs being busy and this will let time for higher ones to pulled tasks. But this becomes a problem when all CPUs are already busy because all domains stay synced when they trigger their LB. A simple way to minimize simultaneous LB of all domains is to decrement the the busy interval by 1 jiffies. Because of the busy_factor, the interval of larger domain will not be a multiple of smaller ones anymore. Reviewed-by: Phil Auld <pauld@redhat.com> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> --- kernel/sched/fair.c | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index d8320dc9d014..458702062d3b 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -9785,6 +9785,15 @@ get_sd_balance_interval(struct sched_domain *sd, int cpu_busy) /* scale ms to jiffies */ interval = msecs_to_jiffies(interval); + + /* + * Reduce likelihood of busy balancing at higher domains racing with + * balancing at lower domains by preventing their balancing periods + * from being multiples of each other. + */ + if (cpu_busy) + interval -= 1; + interval = clamp(interval, 1UL, max_load_balance_interval); return interval; -- 2.17.1
The busy_factor, which increases load balance interval when a cpu is busy, is set to 32 by default. This value generates some huge LB interval on large system like the THX2 made of 2 node x 28 cores x 4 threads. For such system, the interval increases from 112ms to 3584ms at MC level. And from 228ms to 7168ms at NUMA level. Even on smaller system, a lower busy factor has shown improvement on the fair distribution of the running time so let reduce it for all. Reviewed-by: Phil Auld <pauld@redhat.com> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> --- kernel/sched/topology.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index 41df62884cea..a3a2417fec54 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -1348,7 +1348,7 @@ sd_init(struct sched_domain_topology_level *tl, *sd = (struct sched_domain){ .min_interval = sd_weight, .max_interval = 2*sd_weight, - .busy_factor = 32, + .busy_factor = 16, .imbalance_pct = 117, .cache_nice_tries = 0, -- 2.17.1
On 21/09/20 08:24, Vincent Guittot wrote:
> Some UCs like 9 always running tasks on 8 CPUs can't be balanced and the
> load balancer currently migrates the waiting task between the CPUs in an
> almost random manner. The success of a rq pulling a task depends of the
> value of nr_balance_failed of its domains and its ability to be faster
> than others to detach it. This behavior results in an unfair distribution
> of the running time between tasks because some CPUs will run most of the
> time, if not always, the same task whereas others will share their time
> between several tasks.
>
> Instead of using nr_balance_failed as a boolean to relax the condition
> for detaching task, the LB will use nr_balanced_failed to relax the
> threshold between the tasks'load and the imbalance. This mecanism
> prevents the same rq or domain to always win the load balance fight.
>
> Reviewed-by: Phil Auld <pauld@redhat.com>
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
On 21/09/20 08:24, Vincent Guittot wrote:
> The 25% default imbalance threshold for DIE and NUMA domain is large
> enough to generate significant unfairness between threads. A typical
> example is the case of 11 threads running on 2x4 CPUs. The imbalance of
> 20% between the 2 groups of 4 cores is just low enough to not trigger
> the load balance between the 2 groups. We will have always the same 6
> threads on one group of 4 CPUs and the other 5 threads on the other
> group of CPUS. With a fair time sharing in each group, we ends up with
> +20% running time for the group of 5 threads.
>
> Consider decreasing the imbalance threshold for overloaded case where we
> use the load to balance task and to ensure fair time sharing.
>
> Acked-by: Hillf Danton <hdanton@sina.com>
> Reviewed-by: Phil Auld <pauld@redhat.com>
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
On 21/09/20 08:24, Vincent Guittot wrote:
> sched domains tend to trigger simultaneously the load balance loop but
> the larger domains often need more time to collect statistics. This
> slowness makes the larger domain trying to detach tasks from a rq whereas
> tasks already migrated somewhere else at a sub-domain level. This is not
> a real problem for idle LB because the period of smaller domains will
> increase with its CPUs being busy and this will let time for higher ones
> to pulled tasks. But this becomes a problem when all CPUs are already busy
> because all domains stay synced when they trigger their LB.
>
> A simple way to minimize simultaneous LB of all domains is to decrement the
> the busy interval by 1 jiffies. Because of the busy_factor, the interval of
> larger domain will not be a multiple of smaller ones anymore.
>
> Reviewed-by: Phil Auld <pauld@redhat.com>
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
On 21/09/20 08:24, Vincent Guittot wrote:
> The busy_factor, which increases load balance interval when a cpu is busy,
> is set to 32 by default. This value generates some huge LB interval on
> large system like the THX2 made of 2 node x 28 cores x 4 threads.
> For such system, the interval increases from 112ms to 3584ms at MC level.
> And from 228ms to 7168ms at NUMA level.
>
> Even on smaller system, a lower busy factor has shown improvement on the
> fair distribution of the running time so let reduce it for all.
>
> Reviewed-by: Phil Auld <pauld@redhat.com>
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
On Mon, Sep 21, 2020 at 09:24:20AM +0200, Vincent Guittot wrote:
> When the system doesn't have enough cycles for all tasks, the scheduler
> must ensure a fair split of those CPUs cycles between CFS tasks. The
> fairness of some use cases can't be solved with a static distribution of
> the tasks on the system and requires a periodic rebalancing of the system
> but this dynamic behavior is not always optimal and the fair distribution
> of the CPU's time is not always ensured.
>
FWIW, nothing bad fell out of the series from a battery of scheduler
tests across various machines. Headline-wise, EPYC 1 looked very bad for
hackbench but a detailed look showed that it was great until the very
highest group count when it looked bad. Otherwise EPYC 1 looked good
as-did EPYC 2. Various generation of Intel boxes showed marginal gains
or losses, nothing dramatic. will-it-scale for various test loads looks
looked fractionally worse across some machines which may how up in the
0-day bot but it probably will be marginal.
As the patches are partially magic numbers which you could reason about
either way, I'm not going to say that it's universally better. However
it's slightly better in normal cases, your tests indicate its good for
a specific corner case and it does not look like anything obvious falls
apart.
--
Mel Gorman
SUSE Labs
The following commit has been merged into the sched/core branch of tip: Commit-ID: e4d32e4d5444977d8dc25fa98b3ce0a65544db8c Gitweb: https://git.kernel.org/tip/e4d32e4d5444977d8dc25fa98b3ce0a65544db8c Author: Vincent Guittot <vincent.guittot@linaro.org> AuthorDate: Mon, 21 Sep 2020 09:24:23 +02:00 Committer: Peter Zijlstra <peterz@infradead.org> CommitterDate: Fri, 25 Sep 2020 14:23:26 +02:00 sched/fair: Minimize concurrent LBs between domain level sched domains tend to trigger simultaneously the load balance loop but the larger domains often need more time to collect statistics. This slowness makes the larger domain trying to detach tasks from a rq whereas tasks already migrated somewhere else at a sub-domain level. This is not a real problem for idle LB because the period of smaller domains will increase with its CPUs being busy and this will let time for higher ones to pulled tasks. But this becomes a problem when all CPUs are already busy because all domains stay synced when they trigger their LB. A simple way to minimize simultaneous LB of all domains is to decrement the the busy interval by 1 jiffies. Because of the busy_factor, the interval of larger domain will not be a multiple of smaller ones anymore. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Phil Auld <pauld@redhat.com> Link: https://lkml.kernel.org/r/20200921072424.14813-4-vincent.guittot@linaro.org --- kernel/sched/fair.c | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 5e3add3..24a5ee6 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -9790,6 +9790,15 @@ get_sd_balance_interval(struct sched_domain *sd, int cpu_busy) /* scale ms to jiffies */ interval = msecs_to_jiffies(interval); + + /* + * Reduce likelihood of busy balancing at higher domains racing with + * balancing at lower domains by preventing their balancing periods + * from being multiples of each other. + */ + if (cpu_busy) + interval -= 1; + interval = clamp(interval, 1UL, max_load_balance_interval); return interval;
The following commit has been merged into the sched/core branch of tip: Commit-ID: 6e7499135db724539ca887b3aa64122502875c71 Gitweb: https://git.kernel.org/tip/6e7499135db724539ca887b3aa64122502875c71 Author: Vincent Guittot <vincent.guittot@linaro.org> AuthorDate: Mon, 21 Sep 2020 09:24:24 +02:00 Committer: Peter Zijlstra <peterz@infradead.org> CommitterDate: Fri, 25 Sep 2020 14:23:26 +02:00 sched/fair: Reduce busy load balance interval The busy_factor, which increases load balance interval when a cpu is busy, is set to 32 by default. This value generates some huge LB interval on large system like the THX2 made of 2 node x 28 cores x 4 threads. For such system, the interval increases from 112ms to 3584ms at MC level. And from 228ms to 7168ms at NUMA level. Even on smaller system, a lower busy factor has shown improvement on the fair distribution of the running time so let reduce it for all. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Phil Auld <pauld@redhat.com> Link: https://lkml.kernel.org/r/20200921072424.14813-5-vincent.guittot@linaro.org --- kernel/sched/topology.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index 41df628..a3a2417 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -1348,7 +1348,7 @@ sd_init(struct sched_domain_topology_level *tl, *sd = (struct sched_domain){ .min_interval = sd_weight, .max_interval = 2*sd_weight, - .busy_factor = 32, + .busy_factor = 16, .imbalance_pct = 117, .cache_nice_tries = 0,
The following commit has been merged into the sched/core branch of tip: Commit-ID: 2208cdaa56c957e20d8e16f28819aeb47851cb1e Gitweb: https://git.kernel.org/tip/2208cdaa56c957e20d8e16f28819aeb47851cb1e Author: Vincent Guittot <vincent.guittot@linaro.org> AuthorDate: Mon, 21 Sep 2020 09:24:22 +02:00 Committer: Peter Zijlstra <peterz@infradead.org> CommitterDate: Fri, 25 Sep 2020 14:23:26 +02:00 sched/fair: Reduce minimal imbalance threshold The 25% default imbalance threshold for DIE and NUMA domain is large enough to generate significant unfairness between threads. A typical example is the case of 11 threads running on 2x4 CPUs. The imbalance of 20% between the 2 groups of 4 cores is just low enough to not trigger the load balance between the 2 groups. We will have always the same 6 threads on one group of 4 CPUs and the other 5 threads on the other group of CPUS. With a fair time sharing in each group, we ends up with +20% running time for the group of 5 threads. Consider decreasing the imbalance threshold for overloaded case where we use the load to balance task and to ensure fair time sharing. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Phil Auld <pauld@redhat.com> Acked-by: Hillf Danton <hdanton@sina.com> Link: https://lkml.kernel.org/r/20200921072424.14813-3-vincent.guittot@linaro.org --- kernel/sched/topology.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index 249bec7..41df628 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -1349,7 +1349,7 @@ sd_init(struct sched_domain_topology_level *tl, .min_interval = sd_weight, .max_interval = 2*sd_weight, .busy_factor = 32, - .imbalance_pct = 125, + .imbalance_pct = 117, .cache_nice_tries = 0,
The following commit has been merged into the sched/core branch of tip: Commit-ID: 5a7f555904671c0737819fe4d19bd6143de3f6c0 Gitweb: https://git.kernel.org/tip/5a7f555904671c0737819fe4d19bd6143de3f6c0 Author: Vincent Guittot <vincent.guittot@linaro.org> AuthorDate: Mon, 21 Sep 2020 09:24:21 +02:00 Committer: Peter Zijlstra <peterz@infradead.org> CommitterDate: Fri, 25 Sep 2020 14:23:25 +02:00 sched/fair: Relax constraint on task's load during load balance Some UCs like 9 always running tasks on 8 CPUs can't be balanced and the load balancer currently migrates the waiting task between the CPUs in an almost random manner. The success of a rq pulling a task depends of the value of nr_balance_failed of its domains and its ability to be faster than others to detach it. This behavior results in an unfair distribution of the running time between tasks because some CPUs will run most of the time, if not always, the same task whereas others will share their time between several tasks. Instead of using nr_balance_failed as a boolean to relax the condition for detaching task, the LB will use nr_balanced_failed to relax the threshold between the tasks'load and the imbalance. This mecanism prevents the same rq or domain to always win the load balance fight. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Phil Auld <pauld@redhat.com> Link: https://lkml.kernel.org/r/20200921072424.14813-2-vincent.guittot@linaro.org --- kernel/sched/fair.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index b56276a..5e3add3 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7679,8 +7679,8 @@ static int detach_tasks(struct lb_env *env) * scheduler fails to find a good waiting task to * migrate. */ - if (load/2 > env->imbalance && - env->sd->nr_balance_failed <= env->sd->cache_nice_tries) + + if ((load >> env->sd->nr_balance_failed) > env->imbalance) goto next; env->imbalance -= load;