All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/4] sched/fair: Improve fairness between cfs tasks
@ 2020-09-14 10:03 Vincent Guittot
  2020-09-14 10:03 ` [PATCH 1/4] sched/fair: relax constraint on task's load during load balance Vincent Guittot
                   ` (5 more replies)
  0 siblings, 6 replies; 26+ messages in thread
From: Vincent Guittot @ 2020-09-14 10:03 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, linux-kernel
  Cc: valentin.schneider, Vincent Guittot

When the system doesn't have enough cycles for all tasks, the scheduler
must ensure a fair split of those CPUs cycles between CFS tasks. The
fairness of some use cases can't be solved with a static distribution of
the tasks on the system and requires a periodic rebalancing of the system
but this dynamic behavior is not always optimal and the fair distribution
of the CPU's time is not always ensured.

The patchset improves the fairness by decreasing  the constraint for
selecting migratable tasks with the number of failed load balance. This
change enables then to decrease the imbalance threshold because 1st LB
will try to migrate tasks that fully match the imbalance.

Some tests results:

- small 2 x 4 cores arm64 system

hackbench -l (256000/#grp) -g #grp

grp    tip/sched/core         +patchset             improvement
1      1.420(+/- 11.72 %)     1.382(+/-10.50 %)     2.72 %
4      1.295(+/-  2.72 %)     1.218(+/- 2.97 %)     0.76 %
8      1.220(+/-  2.17 %)     1.218(+/- 1.60 %)     0.17 %
16     1.258(+/-  1.88 %)     1.250(+/- 1,78 %)     0.58 %


fairness tests: run always running rt-app threads
monitor the ratio between min/max work done by threads

                  v5.9-rc1             w/ patchset
9 threads  avg     78.3% (+/- 6.60%)   91.20% (+/- 2.44%)
           worst   68.6%               85.67%

11 threads avg     65.91% (+/- 8.26%)  91.34% (+/- 1.87%)
           worst   53.52%              87.26%

- large 2 nodes x 28 cores x 4 threads arm64 system

The hackbench tests that I usually run as well as the sp.C.x and lu.C.x
tests with 224 threads have not shown any difference with a mix of less
than 0.5% of improvements or regressions.

Vincent Guittot (4):
  sched/fair: relax constraint on task's load during load balance
  sched/fair: reduce minimal imbalance threshold
  sched/fair: minimize concurrent LBs between domain level
  sched/fair: reduce busy load balance interval

 kernel/sched/fair.c     | 7 +++++--
 kernel/sched/topology.c | 4 ++--
 2 files changed, 7 insertions(+), 4 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 1/4] sched/fair: relax constraint on task's load during load balance
  2020-09-14 10:03 [PATCH 0/4] sched/fair: Improve fairness between cfs tasks Vincent Guittot
@ 2020-09-14 10:03 ` Vincent Guittot
  2020-09-14 10:03 ` [PATCH 2/4] sched/fair: reduce minimal imbalance threshold Vincent Guittot
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 26+ messages in thread
From: Vincent Guittot @ 2020-09-14 10:03 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, linux-kernel
  Cc: valentin.schneider, Vincent Guittot

Some UCs like 9 always running tasks on 8 CPUs can't be balanced and the
load balancer currently migrates the waiting task between the CPUs in an
almost random manner. The success of a rq pulling a task depends of the
value of nr_balance_failed of its domains and its ability to be faster
than others to detach it. This behavior results in an unfair distribution
of the running time between tasks because some CPUs will run most of the
time, if not always, the same task whereas others will share their time
between several tasks.

Instead of using nr_balance_failed as a boolean to relax the condition
for detaching task, the LB will use nr_balanced_failed to relax the
threshold between the tasks'load and the imbalance. This mecanism
prevents the same rq or domain to always win the load balance fight.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2ba8f230feb9..765be8273292 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7669,8 +7669,8 @@ static int detach_tasks(struct lb_env *env)
 			 * scheduler fails to find a good waiting task to
 			 * migrate.
 			 */
-			if (load/2 > env->imbalance &&
-			    env->sd->nr_balance_failed <= env->sd->cache_nice_tries)
+
+			if ((load >> env->sd->nr_balance_failed) > env->imbalance)
 				goto next;
 
 			env->imbalance -= load;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 2/4] sched/fair: reduce minimal imbalance threshold
  2020-09-14 10:03 [PATCH 0/4] sched/fair: Improve fairness between cfs tasks Vincent Guittot
  2020-09-14 10:03 ` [PATCH 1/4] sched/fair: relax constraint on task's load during load balance Vincent Guittot
@ 2020-09-14 10:03 ` Vincent Guittot
  2020-09-15 19:04   ` Valentin Schneider
  2020-09-14 10:03 ` [PATCH 3/4] sched/fair: minimize concurrent LBs between domain level Vincent Guittot
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 26+ messages in thread
From: Vincent Guittot @ 2020-09-14 10:03 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, linux-kernel
  Cc: valentin.schneider, Vincent Guittot

The 25% default imbalance threshold for DIE and NUMA domain is large
enough to generate significant unfairness between threads. A typical
example is the case of 11 threads running on 2x4 CPUs. The imbalance of
20% between the 2 groups of 4 cores is just low enough to not trigger
the load balance between the 2 groups. We will have always the same 6
threads on one group of 4 CPUs and the other 5 threads on the other
group of CPUS. With a fair time sharing in each group, we ends up with
+20% running time for the group of 5 threads.

Consider decreasing the imbalance threshold for overloaded case where we
use the load to balance task and to ensure fair time sharing.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/topology.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 9079d865a935..1a84b778755d 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1337,7 +1337,7 @@ sd_init(struct sched_domain_topology_level *tl,
 		.min_interval		= sd_weight,
 		.max_interval		= 2*sd_weight,
 		.busy_factor		= 32,
-		.imbalance_pct		= 125,
+		.imbalance_pct		= 117,
 
 		.cache_nice_tries	= 0,
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 3/4] sched/fair: minimize concurrent LBs between domain level
  2020-09-14 10:03 [PATCH 0/4] sched/fair: Improve fairness between cfs tasks Vincent Guittot
  2020-09-14 10:03 ` [PATCH 1/4] sched/fair: relax constraint on task's load during load balance Vincent Guittot
  2020-09-14 10:03 ` [PATCH 2/4] sched/fair: reduce minimal imbalance threshold Vincent Guittot
@ 2020-09-14 10:03 ` Vincent Guittot
  2020-09-15 19:04   ` Valentin Schneider
  2020-09-14 10:03 ` [PATCH 4/4] sched/fair: reduce busy load balance interval Vincent Guittot
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 26+ messages in thread
From: Vincent Guittot @ 2020-09-14 10:03 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, linux-kernel
  Cc: valentin.schneider, Vincent Guittot

sched domains tend to trigger simultaneously the load balance loop but
the larger domains often need more time to collect statistics. This
slowness makes the larger domain trying to detach tasks from a rq whereas
tasks already migrated somewhere else at a sub-domain level. This is not
a real problem for idle LB because the period of smaller domains will
increase with its CPUs being busy and this will let time for higher ones
to pulled tasks. But this becomes a problem when all CPUs are already busy
because all domains stay synced when they trigger their LB.

A simple way to minimize simultaneous LB of all domains is to decrement the
the busy interval by 1 jiffies. Because of the busy_factor, the interval of
larger domain will not be a multiple of smaller ones anymore.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 765be8273292..7d7eefd8e2d4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9780,6 +9780,9 @@ get_sd_balance_interval(struct sched_domain *sd, int cpu_busy)
 
 	/* scale ms to jiffies */
 	interval = msecs_to_jiffies(interval);
+	if (cpu_busy)
+		interval -= 1;
+
 	interval = clamp(interval, 1UL, max_load_balance_interval);
 
 	return interval;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 4/4] sched/fair: reduce busy load balance interval
  2020-09-14 10:03 [PATCH 0/4] sched/fair: Improve fairness between cfs tasks Vincent Guittot
                   ` (2 preceding siblings ...)
  2020-09-14 10:03 ` [PATCH 3/4] sched/fair: minimize concurrent LBs between domain level Vincent Guittot
@ 2020-09-14 10:03 ` Vincent Guittot
  2020-09-15  9:11   ` Jiang Biao
  2020-09-15 19:04   ` Valentin Schneider
  2020-09-14 11:42 ` [PATCH 0/4] sched/fair: Improve fairness between cfs tasks peterz
  2020-09-15 19:05 ` Valentin Schneider
  5 siblings, 2 replies; 26+ messages in thread
From: Vincent Guittot @ 2020-09-14 10:03 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, linux-kernel
  Cc: valentin.schneider, Vincent Guittot

The busy_factor, which increases load balance interval when a cpu is busy,
is set to 32 by default. This value generates some huge LB interval on
large system like the THX2 made of 2 node x 28 cores x 4 threads.
For such system, the interval increases from 112ms to 3584ms at MC level.
And from 228ms to 7168ms at NUMA level.

Even on smaller system, a lower busy factor has shown improvement on the
fair distribution of the running time so let reduce it for all.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/topology.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 1a84b778755d..a8477c9e8569 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1336,7 +1336,7 @@ sd_init(struct sched_domain_topology_level *tl,
 	*sd = (struct sched_domain){
 		.min_interval		= sd_weight,
 		.max_interval		= 2*sd_weight,
-		.busy_factor		= 32,
+		.busy_factor		= 16,
 		.imbalance_pct		= 117,
 
 		.cache_nice_tries	= 0,
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/4] sched/fair: Improve fairness between cfs tasks
  2020-09-14 10:03 [PATCH 0/4] sched/fair: Improve fairness between cfs tasks Vincent Guittot
                   ` (3 preceding siblings ...)
  2020-09-14 10:03 ` [PATCH 4/4] sched/fair: reduce busy load balance interval Vincent Guittot
@ 2020-09-14 11:42 ` peterz
  2020-09-14 12:53   ` Phil Auld
                     ` (2 more replies)
  2020-09-15 19:05 ` Valentin Schneider
  5 siblings, 3 replies; 26+ messages in thread
From: peterz @ 2020-09-14 11:42 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, juri.lelli, dietmar.eggemann, rostedt, bsegall, mgorman,
	linux-kernel, valentin.schneider, pauld

On Mon, Sep 14, 2020 at 12:03:36PM +0200, Vincent Guittot wrote:
> Vincent Guittot (4):
>   sched/fair: relax constraint on task's load during load balance
>   sched/fair: reduce minimal imbalance threshold
>   sched/fair: minimize concurrent LBs between domain level
>   sched/fair: reduce busy load balance interval

I see nothing objectionable there, a little more testing can't hurt, but
I'm tempted to apply them.

Phil, Mel, any chance you can run them through your respective setups?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/4] sched/fair: Improve fairness between cfs tasks
  2020-09-14 11:42 ` [PATCH 0/4] sched/fair: Improve fairness between cfs tasks peterz
@ 2020-09-14 12:53   ` Phil Auld
  2020-09-14 16:03     ` Vincent Guittot
  2020-09-14 15:50   ` Mel Gorman
  2020-09-18 16:39   ` Phil Auld
  2 siblings, 1 reply; 26+ messages in thread
From: Phil Auld @ 2020-09-14 12:53 UTC (permalink / raw)
  To: peterz
  Cc: Vincent Guittot, mingo, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, mgorman, linux-kernel, valentin.schneider

On Mon, Sep 14, 2020 at 01:42:02PM +0200 peterz@infradead.org wrote:
> On Mon, Sep 14, 2020 at 12:03:36PM +0200, Vincent Guittot wrote:
> > Vincent Guittot (4):
> >   sched/fair: relax constraint on task's load during load balance
> >   sched/fair: reduce minimal imbalance threshold
> >   sched/fair: minimize concurrent LBs between domain level
> >   sched/fair: reduce busy load balance interval
> 
> I see nothing objectionable there, a little more testing can't hurt, but
> I'm tempted to apply them.
> 
> Phil, Mel, any chance you can run them through your respective setups?
> 

Yep. I'll try to get something started today, results in a few days.

These look pretty inocuous. It'll be interesting to see the effect is.


Cheers,
Phil
-- 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/4] sched/fair: Improve fairness between cfs tasks
  2020-09-14 11:42 ` [PATCH 0/4] sched/fair: Improve fairness between cfs tasks peterz
  2020-09-14 12:53   ` Phil Auld
@ 2020-09-14 15:50   ` Mel Gorman
  2020-09-14 16:04     ` Vincent Guittot
  2020-09-18 16:39   ` Phil Auld
  2 siblings, 1 reply; 26+ messages in thread
From: Mel Gorman @ 2020-09-14 15:50 UTC (permalink / raw)
  To: peterz
  Cc: Vincent Guittot, mingo, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, linux-kernel, valentin.schneider, pauld

On Mon, Sep 14, 2020 at 01:42:02PM +0200, peterz@infradead.org wrote:
> On Mon, Sep 14, 2020 at 12:03:36PM +0200, Vincent Guittot wrote:
> > Vincent Guittot (4):
> >   sched/fair: relax constraint on task's load during load balance
> >   sched/fair: reduce minimal imbalance threshold
> >   sched/fair: minimize concurrent LBs between domain level
> >   sched/fair: reduce busy load balance interval
> 
> I see nothing objectionable there, a little more testing can't hurt, but
> I'm tempted to apply them.
> 
> Phil, Mel, any chance you can run them through your respective setups?

They're queued but the test grid is backlogged at the moment. It'll be
a few days before the tests complete.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/4] sched/fair: Improve fairness between cfs tasks
  2020-09-14 12:53   ` Phil Auld
@ 2020-09-14 16:03     ` Vincent Guittot
  0 siblings, 0 replies; 26+ messages in thread
From: Vincent Guittot @ 2020-09-14 16:03 UTC (permalink / raw)
  To: Phil Auld
  Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, linux-kernel,
	Valentin Schneider

On Mon, 14 Sep 2020 at 14:53, Phil Auld <pauld@redhat.com> wrote:
>
> On Mon, Sep 14, 2020 at 01:42:02PM +0200 peterz@infradead.org wrote:
> > On Mon, Sep 14, 2020 at 12:03:36PM +0200, Vincent Guittot wrote:
> > > Vincent Guittot (4):
> > >   sched/fair: relax constraint on task's load during load balance
> > >   sched/fair: reduce minimal imbalance threshold
> > >   sched/fair: minimize concurrent LBs between domain level
> > >   sched/fair: reduce busy load balance interval
> >
> > I see nothing objectionable there, a little more testing can't hurt, but
> > I'm tempted to apply them.
> >
> > Phil, Mel, any chance you can run them through your respective setups?
> >
>
> Yep. I'll try to get something started today, results in a few days.

Thanks Phil

>
> These look pretty inocuous. It'll be interesting to see the effect is.
>
>
> Cheers,
> Phil
> --
>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/4] sched/fair: Improve fairness between cfs tasks
  2020-09-14 15:50   ` Mel Gorman
@ 2020-09-14 16:04     ` Vincent Guittot
  0 siblings, 0 replies; 26+ messages in thread
From: Vincent Guittot @ 2020-09-14 16:04 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, linux-kernel, Valentin Schneider,
	Phil Auld

On Mon, 14 Sep 2020 at 17:51, Mel Gorman <mgorman@suse.de> wrote:
>
> On Mon, Sep 14, 2020 at 01:42:02PM +0200, peterz@infradead.org wrote:
> > On Mon, Sep 14, 2020 at 12:03:36PM +0200, Vincent Guittot wrote:
> > > Vincent Guittot (4):
> > >   sched/fair: relax constraint on task's load during load balance
> > >   sched/fair: reduce minimal imbalance threshold
> > >   sched/fair: minimize concurrent LBs between domain level
> > >   sched/fair: reduce busy load balance interval
> >
> > I see nothing objectionable there, a little more testing can't hurt, but
> > I'm tempted to apply them.
> >
> > Phil, Mel, any chance you can run them through your respective setups?
>
> They're queued but the test grid is backlogged at the moment. It'll be
> a few days before the tests complete.

Thanks Mel

>
> --
> Mel Gorman
> SUSE Labs

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/4] sched/fair: reduce busy load balance interval
  2020-09-14 10:03 ` [PATCH 4/4] sched/fair: reduce busy load balance interval Vincent Guittot
@ 2020-09-15  9:11   ` Jiang Biao
  2020-09-15  9:28     ` Vincent Guittot
  2020-09-15 19:04   ` Valentin Schneider
  1 sibling, 1 reply; 26+ messages in thread
From: Jiang Biao @ 2020-09-15  9:11 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, linux-kernel,
	Valentin Schneider

Hi, Vincent

On Mon, 14 Sep 2020 at 18:07, Vincent Guittot
<vincent.guittot@linaro.org> wrote:
>
> The busy_factor, which increases load balance interval when a cpu is busy,
> is set to 32 by default. This value generates some huge LB interval on
> large system like the THX2 made of 2 node x 28 cores x 4 threads.
> For such system, the interval increases from 112ms to 3584ms at MC level.
> And from 228ms to 7168ms at NUMA level.
Agreed that the interval is too big for that case.
But would it be too small for an AMD environment(like ROME) with 8cpu
at MC level(CCX), if we reduce busy_factor?
For that case, the interval could be reduced from 256ms to 128ms.
Or should we define an MIN_INTERVAL for MC level to avoid too small interval?

Thx.
Regards,
Jiang

>
> Even on smaller system, a lower busy factor has shown improvement on the
> fair distribution of the running time so let reduce it for all.
>
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
>  kernel/sched/topology.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 1a84b778755d..a8477c9e8569 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -1336,7 +1336,7 @@ sd_init(struct sched_domain_topology_level *tl,
>         *sd = (struct sched_domain){
>                 .min_interval           = sd_weight,
>                 .max_interval           = 2*sd_weight,
> -               .busy_factor            = 32,
> +               .busy_factor            = 16,
>                 .imbalance_pct          = 117,
>
>                 .cache_nice_tries       = 0,
> --
> 2.17.1
>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/4] sched/fair: reduce busy load balance interval
  2020-09-15  9:11   ` Jiang Biao
@ 2020-09-15  9:28     ` Vincent Guittot
  2020-09-15 11:36       ` Jiang Biao
  0 siblings, 1 reply; 26+ messages in thread
From: Vincent Guittot @ 2020-09-15  9:28 UTC (permalink / raw)
  To: Jiang Biao
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, linux-kernel,
	Valentin Schneider

On Tue, 15 Sep 2020 at 11:11, Jiang Biao <benbjiang@gmail.com> wrote:
>
> Hi, Vincent
>
> On Mon, 14 Sep 2020 at 18:07, Vincent Guittot
> <vincent.guittot@linaro.org> wrote:
> >
> > The busy_factor, which increases load balance interval when a cpu is busy,
> > is set to 32 by default. This value generates some huge LB interval on
> > large system like the THX2 made of 2 node x 28 cores x 4 threads.
> > For such system, the interval increases from 112ms to 3584ms at MC level.
> > And from 228ms to 7168ms at NUMA level.
> Agreed that the interval is too big for that case.
> But would it be too small for an AMD environment(like ROME) with 8cpu
> at MC level(CCX), if we reduce busy_factor?

Are you sure that this is too small ? As mentioned in the commit
message below, I tested it on small system (2x4 cores Arm64) and i
have seen some improvements

> For that case, the interval could be reduced from 256ms to 128ms.
> Or should we define an MIN_INTERVAL for MC level to avoid too small interval?

What would be a too small interval ?

Before this patch we have for a level with 8 cores:
when idle, the interval is 8ms and increase to 256ms when busy
After the patch, we have
When idle the interval is still 8ms and increase to 128ms when busy

Regards,
Vincent

>
> Thx.
> Regards,
> Jiang
>
> >
> > Even on smaller system, a lower busy factor has shown improvement on the
> > fair distribution of the running time so let reduce it for all.
> >
> > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> > ---
> >  kernel/sched/topology.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > index 1a84b778755d..a8477c9e8569 100644
> > --- a/kernel/sched/topology.c
> > +++ b/kernel/sched/topology.c
> > @@ -1336,7 +1336,7 @@ sd_init(struct sched_domain_topology_level *tl,
> >         *sd = (struct sched_domain){
> >                 .min_interval           = sd_weight,
> >                 .max_interval           = 2*sd_weight,
> > -               .busy_factor            = 32,
> > +               .busy_factor            = 16,
> >                 .imbalance_pct          = 117,
> >
> >                 .cache_nice_tries       = 0,
> > --
> > 2.17.1
> >

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/4] sched/fair: reduce busy load balance interval
  2020-09-15  9:28     ` Vincent Guittot
@ 2020-09-15 11:36       ` Jiang Biao
  2020-09-15 12:42         ` Vincent Guittot
  0 siblings, 1 reply; 26+ messages in thread
From: Jiang Biao @ 2020-09-15 11:36 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, linux-kernel,
	Valentin Schneider

Hi, Vincent

On Tue, 15 Sep 2020 at 17:28, Vincent Guittot
<vincent.guittot@linaro.org> wrote:
>
> On Tue, 15 Sep 2020 at 11:11, Jiang Biao <benbjiang@gmail.com> wrote:
> >
> > Hi, Vincent
> >
> > On Mon, 14 Sep 2020 at 18:07, Vincent Guittot
> > <vincent.guittot@linaro.org> wrote:
> > >
> > > The busy_factor, which increases load balance interval when a cpu is busy,
> > > is set to 32 by default. This value generates some huge LB interval on
> > > large system like the THX2 made of 2 node x 28 cores x 4 threads.
> > > For such system, the interval increases from 112ms to 3584ms at MC level.
> > > And from 228ms to 7168ms at NUMA level.
> > Agreed that the interval is too big for that case.
> > But would it be too small for an AMD environment(like ROME) with 8cpu
> > at MC level(CCX), if we reduce busy_factor?
>
> Are you sure that this is too small ? As mentioned in the commit
> message below, I tested it on small system (2x4 cores Arm64) and i
> have seen some improvements
Not so sure. :)
Small interval means more frequent balances and more cost consumed for
balancing, especially for pinned vm cases.
For our case, we have AMD ROME servers made of 2node x 48cores x
2thread, and 8c at MC level(within a CCX). The 256ms interval seems a
little too big for us, compared to Intel Cascadlake CPU with 48c at MC
level, whose balance interval is 1536ms. 128ms seems a little more
waste. :)
I guess more balance costs may hurt the throughput of sysbench like
benchmark.. Just a guess.

>
> > For that case, the interval could be reduced from 256ms to 128ms.
> > Or should we define an MIN_INTERVAL for MC level to avoid too small interval?
>
> What would be a too small interval ?
That's hard to say. :)
My guess is just for large server system cases.

Thanks.
Regards,
Jiang

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/4] sched/fair: reduce busy load balance interval
  2020-09-15 11:36       ` Jiang Biao
@ 2020-09-15 12:42         ` Vincent Guittot
  2020-09-16  1:14           ` Jiang Biao
  0 siblings, 1 reply; 26+ messages in thread
From: Vincent Guittot @ 2020-09-15 12:42 UTC (permalink / raw)
  To: Jiang Biao
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, linux-kernel,
	Valentin Schneider

On Tue, 15 Sep 2020 at 13:36, Jiang Biao <benbjiang@gmail.com> wrote:
>
> Hi, Vincent
>
> On Tue, 15 Sep 2020 at 17:28, Vincent Guittot
> <vincent.guittot@linaro.org> wrote:
> >
> > On Tue, 15 Sep 2020 at 11:11, Jiang Biao <benbjiang@gmail.com> wrote:
> > >
> > > Hi, Vincent
> > >
> > > On Mon, 14 Sep 2020 at 18:07, Vincent Guittot
> > > <vincent.guittot@linaro.org> wrote:
> > > >
> > > > The busy_factor, which increases load balance interval when a cpu is busy,
> > > > is set to 32 by default. This value generates some huge LB interval on
> > > > large system like the THX2 made of 2 node x 28 cores x 4 threads.
> > > > For such system, the interval increases from 112ms to 3584ms at MC level.
> > > > And from 228ms to 7168ms at NUMA level.
> > > Agreed that the interval is too big for that case.
> > > But would it be too small for an AMD environment(like ROME) with 8cpu
> > > at MC level(CCX), if we reduce busy_factor?
> >
> > Are you sure that this is too small ? As mentioned in the commit
> > message below, I tested it on small system (2x4 cores Arm64) and i
> > have seen some improvements
> Not so sure. :)
> Small interval means more frequent balances and more cost consumed for
> balancing, especially for pinned vm cases.

If you are running only pinned threads, the interval can increase
above 512ms which means 8sec after applying the busy factor

> For our case, we have AMD ROME servers made of 2node x 48cores x
> 2thread, and 8c at MC level(within a CCX). The 256ms interval seems a
> little too big for us, compared to Intel Cascadlake CPU with 48c at MC

so IIUC your topology is :
2 nodes at NUMA
6 CCX at DIE level
8 cores per CCX at MC
2 threads per core at SMT

> level, whose balance interval is 1536ms. 128ms seems a little more
> waste. :)

the 256ms/128ms interval only looks at 8 cores whereas the 1536
intervall looks for the whole 48 cores

> I guess more balance costs may hurt the throughput of sysbench like
> benchmark.. Just a guess.
>
> >
> > > For that case, the interval could be reduced from 256ms to 128ms.
> > > Or should we define an MIN_INTERVAL for MC level to avoid too small interval?
> >
> > What would be a too small interval ?
> That's hard to say. :)
> My guess is just for large server system cases.
>
> Thanks.
> Regards,
> Jiang

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/4] sched/fair: reduce minimal imbalance threshold
  2020-09-14 10:03 ` [PATCH 2/4] sched/fair: reduce minimal imbalance threshold Vincent Guittot
@ 2020-09-15 19:04   ` Valentin Schneider
  2020-09-16  6:53     ` Vincent Guittot
  0 siblings, 1 reply; 26+ messages in thread
From: Valentin Schneider @ 2020-09-15 19:04 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, linux-kernel


On 14/09/20 11:03, Vincent Guittot wrote:
> The 25% default imbalance threshold for DIE and NUMA domain is large
> enough to generate significant unfairness between threads. A typical
> example is the case of 11 threads running on 2x4 CPUs. The imbalance of
> 20% between the 2 groups of 4 cores is just low enough to not trigger
> the load balance between the 2 groups. We will have always the same 6
> threads on one group of 4 CPUs and the other 5 threads on the other
> group of CPUS. With a fair time sharing in each group, we ends up with
> +20% running time for the group of 5 threads.
>

AIUI this is the culprit:

                if (100 * busiest->avg_load <=
                                env->sd->imbalance_pct * local->avg_load)
                        goto out_balanced;

As in your case imbalance_pct=120 becomes the tipping point.

Now, ultimately this would need to scale based on the underlying topology,
right? If you have a system with 2x32 cores running {33 threads, 34
threads}, the tipping point becomes imbalance_pct≈103; but then since you
have this many more cores, it is somewhat questionable.

> Consider decreasing the imbalance threshold for overloaded case where we
> use the load to balance task and to ensure fair time sharing.
>
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
>  kernel/sched/topology.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 9079d865a935..1a84b778755d 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -1337,7 +1337,7 @@ sd_init(struct sched_domain_topology_level *tl,
>               .min_interval		= sd_weight,
>               .max_interval		= 2*sd_weight,
>               .busy_factor		= 32,
> -		.imbalance_pct		= 125,
> +		.imbalance_pct		= 117,
>
>               .cache_nice_tries	= 0,

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/4] sched/fair: minimize concurrent LBs between domain level
  2020-09-14 10:03 ` [PATCH 3/4] sched/fair: minimize concurrent LBs between domain level Vincent Guittot
@ 2020-09-15 19:04   ` Valentin Schneider
  2020-09-16  6:54     ` Vincent Guittot
  0 siblings, 1 reply; 26+ messages in thread
From: Valentin Schneider @ 2020-09-15 19:04 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, linux-kernel


On 14/09/20 11:03, Vincent Guittot wrote:
> sched domains tend to trigger simultaneously the load balance loop but
> the larger domains often need more time to collect statistics. This
> slowness makes the larger domain trying to detach tasks from a rq whereas
> tasks already migrated somewhere else at a sub-domain level. This is not
> a real problem for idle LB because the period of smaller domains will
> increase with its CPUs being busy and this will let time for higher ones
> to pulled tasks. But this becomes a problem when all CPUs are already busy
> because all domains stay synced when they trigger their LB.
>
> A simple way to minimize simultaneous LB of all domains is to decrement the
> the busy interval by 1 jiffies. Because of the busy_factor, the interval of
> larger domain will not be a multiple of smaller ones anymore.
>
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
>  kernel/sched/fair.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 765be8273292..7d7eefd8e2d4 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9780,6 +9780,9 @@ get_sd_balance_interval(struct sched_domain *sd, int cpu_busy)
>
>       /* scale ms to jiffies */
>       interval = msecs_to_jiffies(interval);

A comment here would be nice, I think. What about:

/*
 * Reduce likelihood of (busy) balancing at higher domains racing with
 * balancing at lower domains by preventing their balancing periods from being
 * multiples of each other.
 */

> +	if (cpu_busy)
> +		interval -= 1;
> +
>       interval = clamp(interval, 1UL, max_load_balance_interval);
>
>       return interval;

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/4] sched/fair: reduce busy load balance interval
  2020-09-14 10:03 ` [PATCH 4/4] sched/fair: reduce busy load balance interval Vincent Guittot
  2020-09-15  9:11   ` Jiang Biao
@ 2020-09-15 19:04   ` Valentin Schneider
  2020-09-16  7:02     ` Vincent Guittot
  1 sibling, 1 reply; 26+ messages in thread
From: Valentin Schneider @ 2020-09-15 19:04 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, linux-kernel


On 14/09/20 11:03, Vincent Guittot wrote:
> The busy_factor, which increases load balance interval when a cpu is busy,
> is set to 32 by default. This value generates some huge LB interval on
> large system like the THX2 made of 2 node x 28 cores x 4 threads.
> For such system, the interval increases from 112ms to 3584ms at MC level.
> And from 228ms to 7168ms at NUMA level.
>
> Even on smaller system, a lower busy factor has shown improvement on the
> fair distribution of the running time so let reduce it for all.
>

ISTR you mentioned taking this one step further and making
(interval * busy_factor) scale logarithmically with the number of CPUs to
avoid reaching outrageous numbers. Did you experiment with that already?

> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
>  kernel/sched/topology.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 1a84b778755d..a8477c9e8569 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -1336,7 +1336,7 @@ sd_init(struct sched_domain_topology_level *tl,
>       *sd = (struct sched_domain){
>               .min_interval		= sd_weight,
>               .max_interval		= 2*sd_weight,
> -		.busy_factor		= 32,
> +		.busy_factor		= 16,
>               .imbalance_pct		= 117,
>
>               .cache_nice_tries	= 0,

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/4] sched/fair: Improve fairness between cfs tasks
  2020-09-14 10:03 [PATCH 0/4] sched/fair: Improve fairness between cfs tasks Vincent Guittot
                   ` (4 preceding siblings ...)
  2020-09-14 11:42 ` [PATCH 0/4] sched/fair: Improve fairness between cfs tasks peterz
@ 2020-09-15 19:05 ` Valentin Schneider
  5 siblings, 0 replies; 26+ messages in thread
From: Valentin Schneider @ 2020-09-15 19:05 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, linux-kernel


Hi Vincent,

On 14/09/20 11:03, Vincent Guittot wrote:
> When the system doesn't have enough cycles for all tasks, the scheduler
> must ensure a fair split of those CPUs cycles between CFS tasks. The
> fairness of some use cases can't be solved with a static distribution of
> the tasks on the system and requires a periodic rebalancing of the system
> but this dynamic behavior is not always optimal and the fair distribution
> of the CPU's time is not always ensured.
>
> The patchset improves the fairness by decreasing  the constraint for
> selecting migratable tasks with the number of failed load balance. This
> change enables then to decrease the imbalance threshold because 1st LB
> will try to migrate tasks that fully match the imbalance.
>
> Some tests results:
>
> - small 2 x 4 cores arm64 system
>
> hackbench -l (256000/#grp) -g #grp
>
> grp    tip/sched/core         +patchset             improvement
> 1      1.420(+/- 11.72 %)     1.382(+/-10.50 %)     2.72 %
> 4      1.295(+/-  2.72 %)     1.218(+/- 2.97 %)     0.76 %
> 8      1.220(+/-  2.17 %)     1.218(+/- 1.60 %)     0.17 %
> 16     1.258(+/-  1.88 %)     1.250(+/- 1,78 %)     0.58 %
>
>
> fairness tests: run always running rt-app threads
> monitor the ratio between min/max work done by threads
>
>                   v5.9-rc1             w/ patchset
> 9 threads  avg     78.3% (+/- 6.60%)   91.20% (+/- 2.44%)
>            worst   68.6%               85.67%
>
> 11 threads avg     65.91% (+/- 8.26%)  91.34% (+/- 1.87%)
>            worst   53.52%              87.26%
>
> - large 2 nodes x 28 cores x 4 threads arm64 system
>
> The hackbench tests that I usually run as well as the sp.C.x and lu.C.x
> tests with 224 threads have not shown any difference with a mix of less
> than 0.5% of improvements or regressions.
>

Few nitpicks from my end, but no major objections - this looks mostly
sane to me.

> Vincent Guittot (4):
>   sched/fair: relax constraint on task's load during load balance
>   sched/fair: reduce minimal imbalance threshold
>   sched/fair: minimize concurrent LBs between domain level
>   sched/fair: reduce busy load balance interval
>
>  kernel/sched/fair.c     | 7 +++++--
>  kernel/sched/topology.c | 4 ++--
>  2 files changed, 7 insertions(+), 4 deletions(-)

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/4] sched/fair: reduce busy load balance interval
  2020-09-15 12:42         ` Vincent Guittot
@ 2020-09-16  1:14           ` Jiang Biao
  0 siblings, 0 replies; 26+ messages in thread
From: Jiang Biao @ 2020-09-16  1:14 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, linux-kernel,
	Valentin Schneider

Hi,

On Tue, 15 Sep 2020 at 20:43, Vincent Guittot
<vincent.guittot@linaro.org> wrote:
>
> On Tue, 15 Sep 2020 at 13:36, Jiang Biao <benbjiang@gmail.com> wrote:
> >
> > Hi, Vincent
> >
> > On Tue, 15 Sep 2020 at 17:28, Vincent Guittot
> > <vincent.guittot@linaro.org> wrote:
> > >
> > > On Tue, 15 Sep 2020 at 11:11, Jiang Biao <benbjiang@gmail.com> wrote:
> > > >
> > > > Hi, Vincent
> > > >
> > > > On Mon, 14 Sep 2020 at 18:07, Vincent Guittot
> > > > <vincent.guittot@linaro.org> wrote:
> > > > >
> > > > > The busy_factor, which increases load balance interval when a cpu is busy,
> > > > > is set to 32 by default. This value generates some huge LB interval on
> > > > > large system like the THX2 made of 2 node x 28 cores x 4 threads.
> > > > > For such system, the interval increases from 112ms to 3584ms at MC level.
> > > > > And from 228ms to 7168ms at NUMA level.
> > > > Agreed that the interval is too big for that case.
> > > > But would it be too small for an AMD environment(like ROME) with 8cpu
> > > > at MC level(CCX), if we reduce busy_factor?
> > >
> > > Are you sure that this is too small ? As mentioned in the commit
> > > message below, I tested it on small system (2x4 cores Arm64) and i
> > > have seen some improvements
> > Not so sure. :)
> > Small interval means more frequent balances and more cost consumed for
> > balancing, especially for pinned vm cases.
>
> If you are running only pinned threads, the interval can increase
> above 512ms which means 8sec after applying the busy factor
Yep. :)

>
> > For our case, we have AMD ROME servers made of 2node x 48cores x
> > 2thread, and 8c at MC level(within a CCX). The 256ms interval seems a
> > little too big for us, compared to Intel Cascadlake CPU with 48c at MC
>
> so IIUC your topology is :
> 2 nodes at NUMA
> 6 CCX at DIE level
> 8 cores per CCX at MC
> 2 threads per core at SMT
Yes.

>
> > level, whose balance interval is 1536ms. 128ms seems a little more
> > waste. :)
>
> the 256ms/128ms interval only looks at 8 cores whereas the 1536
> intervall looks for the whole 48 cores
Yes. The real problem for us is the cpu number difference between MC
and DIE level is too big(8 VS. 96), 3072ms for DIE level is too big(reduce
busy_factor is good enough), while 128ms for MC level seems a little waste(
if reduce busy_factor)
And no objection for this patch. It still looks ok for us.

Thx.
Regards,
Jiang

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/4] sched/fair: reduce minimal imbalance threshold
  2020-09-15 19:04   ` Valentin Schneider
@ 2020-09-16  6:53     ` Vincent Guittot
  2020-09-16  8:33       ` Valentin Schneider
  0 siblings, 1 reply; 26+ messages in thread
From: Vincent Guittot @ 2020-09-16  6:53 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, linux-kernel

On Tue, 15 Sep 2020 at 21:04, Valentin Schneider
<valentin.schneider@arm.com> wrote:
>
>
> On 14/09/20 11:03, Vincent Guittot wrote:
> > The 25% default imbalance threshold for DIE and NUMA domain is large
> > enough to generate significant unfairness between threads. A typical
> > example is the case of 11 threads running on 2x4 CPUs. The imbalance of
> > 20% between the 2 groups of 4 cores is just low enough to not trigger
> > the load balance between the 2 groups. We will have always the same 6
> > threads on one group of 4 CPUs and the other 5 threads on the other
> > group of CPUS. With a fair time sharing in each group, we ends up with
> > +20% running time for the group of 5 threads.
> >
>
> AIUI this is the culprit:
>
>                 if (100 * busiest->avg_load <=
>                                 env->sd->imbalance_pct * local->avg_load)
>                         goto out_balanced;
>
> As in your case imbalance_pct=120 becomes the tipping point.
>
> Now, ultimately this would need to scale based on the underlying topology,
> right? If you have a system with 2x32 cores running {33 threads, 34
> threads}, the tipping point becomes imbalance_pct≈103; but then since you
> have this many more cores, it is somewhat questionable.

I wanted to stay conservative and to not trigger too much task
migration because of small imbalance so I decided to decrease the
default threshold to the same level as the MC groups but this can
still generate unfairness. With your example of 2x32 cores, if you end
up with 33 tasks in one group and 38 in the other one, the system is
overloaded so you use load and imbalance_pct but the imbalance will
stay below the new threshold and the 33 tasks will have 13% more
running time.

This new imbalance_pct seems a reasonable step to decrease the unfairness

>
> > Consider decreasing the imbalance threshold for overloaded case where we
> > use the load to balance task and to ensure fair time sharing.
> >
> > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> > ---
> >  kernel/sched/topology.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > index 9079d865a935..1a84b778755d 100644
> > --- a/kernel/sched/topology.c
> > +++ b/kernel/sched/topology.c
> > @@ -1337,7 +1337,7 @@ sd_init(struct sched_domain_topology_level *tl,
> >               .min_interval           = sd_weight,
> >               .max_interval           = 2*sd_weight,
> >               .busy_factor            = 32,
> > -             .imbalance_pct          = 125,
> > +             .imbalance_pct          = 117,
> >
> >               .cache_nice_tries       = 0,

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/4] sched/fair: minimize concurrent LBs between domain level
  2020-09-15 19:04   ` Valentin Schneider
@ 2020-09-16  6:54     ` Vincent Guittot
  0 siblings, 0 replies; 26+ messages in thread
From: Vincent Guittot @ 2020-09-16  6:54 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, linux-kernel

On Tue, 15 Sep 2020 at 21:04, Valentin Schneider
<valentin.schneider@arm.com> wrote:
>
>
> On 14/09/20 11:03, Vincent Guittot wrote:
> > sched domains tend to trigger simultaneously the load balance loop but
> > the larger domains often need more time to collect statistics. This
> > slowness makes the larger domain trying to detach tasks from a rq whereas
> > tasks already migrated somewhere else at a sub-domain level. This is not
> > a real problem for idle LB because the period of smaller domains will
> > increase with its CPUs being busy and this will let time for higher ones
> > to pulled tasks. But this becomes a problem when all CPUs are already busy
> > because all domains stay synced when they trigger their LB.
> >
> > A simple way to minimize simultaneous LB of all domains is to decrement the
> > the busy interval by 1 jiffies. Because of the busy_factor, the interval of
> > larger domain will not be a multiple of smaller ones anymore.
> >
> > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> > ---
> >  kernel/sched/fair.c | 3 +++
> >  1 file changed, 3 insertions(+)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 765be8273292..7d7eefd8e2d4 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -9780,6 +9780,9 @@ get_sd_balance_interval(struct sched_domain *sd, int cpu_busy)
> >
> >       /* scale ms to jiffies */
> >       interval = msecs_to_jiffies(interval);
>
> A comment here would be nice, I think. What about:
>
> /*
>  * Reduce likelihood of (busy) balancing at higher domains racing with
>  * balancing at lower domains by preventing their balancing periods from being
>  * multiples of each other.
>  */

Yes a comment would be nice. Will add it

Thanks
>
> > +     if (cpu_busy)
> > +             interval -= 1;
> > +
> >       interval = clamp(interval, 1UL, max_load_balance_interval);
> >
> >       return interval;

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/4] sched/fair: reduce busy load balance interval
  2020-09-15 19:04   ` Valentin Schneider
@ 2020-09-16  7:02     ` Vincent Guittot
  2020-09-16  8:34       ` Valentin Schneider
  0 siblings, 1 reply; 26+ messages in thread
From: Vincent Guittot @ 2020-09-16  7:02 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, linux-kernel

On Tue, 15 Sep 2020 at 21:04, Valentin Schneider
<valentin.schneider@arm.com> wrote:
>
>
> On 14/09/20 11:03, Vincent Guittot wrote:
> > The busy_factor, which increases load balance interval when a cpu is busy,
> > is set to 32 by default. This value generates some huge LB interval on
> > large system like the THX2 made of 2 node x 28 cores x 4 threads.
> > For such system, the interval increases from 112ms to 3584ms at MC level.
> > And from 228ms to 7168ms at NUMA level.
> >
> > Even on smaller system, a lower busy factor has shown improvement on the
> > fair distribution of the running time so let reduce it for all.
> >
>
> ISTR you mentioned taking this one step further and making
> (interval * busy_factor) scale logarithmically with the number of CPUs to
> avoid reaching outrageous numbers. Did you experiment with that already?

Yes I have tried the logarithmically scale but It didn't give any
benefit compared to this solution for the fairness problem but
impacted other use cases because it impacts idle interval and it also
adds more constraints in the computation of the interval and
busy_factor because we can end up with the same interval for 2
consecutive levels .

That being said, it might be useful for other cases but i haven't look
further for this

>
> > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> > ---
> >  kernel/sched/topology.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > index 1a84b778755d..a8477c9e8569 100644
> > --- a/kernel/sched/topology.c
> > +++ b/kernel/sched/topology.c
> > @@ -1336,7 +1336,7 @@ sd_init(struct sched_domain_topology_level *tl,
> >       *sd = (struct sched_domain){
> >               .min_interval           = sd_weight,
> >               .max_interval           = 2*sd_weight,
> > -             .busy_factor            = 32,
> > +             .busy_factor            = 16,
> >               .imbalance_pct          = 117,
> >
> >               .cache_nice_tries       = 0,

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/4] sched/fair: reduce minimal imbalance threshold
  2020-09-16  6:53     ` Vincent Guittot
@ 2020-09-16  8:33       ` Valentin Schneider
  0 siblings, 0 replies; 26+ messages in thread
From: Valentin Schneider @ 2020-09-16  8:33 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, linux-kernel


On 16/09/20 07:53, Vincent Guittot wrote:
> On Tue, 15 Sep 2020 at 21:04, Valentin Schneider
> <valentin.schneider@arm.com> wrote:
>> AIUI this is the culprit:
>>
>>                 if (100 * busiest->avg_load <=
>>                                 env->sd->imbalance_pct * local->avg_load)
>>                         goto out_balanced;
>>
>> As in your case imbalance_pct=120 becomes the tipping point.
>>
>> Now, ultimately this would need to scale based on the underlying topology,
>> right? If you have a system with 2x32 cores running {33 threads, 34
>> threads}, the tipping point becomes imbalance_pct≈103; but then since you
>> have this many more cores, it is somewhat questionable.
>
> I wanted to stay conservative and to not trigger too much task
> migration because of small imbalance so I decided to decrease the
> default threshold to the same level as the MC groups but this can
> still generate unfairness. With your example of 2x32 cores, if you end
> up with 33 tasks in one group and 38 in the other one, the system is
> overloaded so you use load and imbalance_pct but the imbalance will
> stay below the new threshold and the 33 tasks will have 13% more
> running time.
>
> This new imbalance_pct seems a reasonable step to decrease the unfairness
>

No major complaint on the change itself, it's just that this static
imbalance_pct assignment is something I've never really been satisfied with
- at the same time figuring a (or several) correct value from the topology
isn't straightforward either.

At the same time, I believe Peter would be happy to get rid of the decimal
faff and make it all simple shifts, which would limit how much we can
fine-tune these (not necessarily a bad thing).

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/4] sched/fair: reduce busy load balance interval
  2020-09-16  7:02     ` Vincent Guittot
@ 2020-09-16  8:34       ` Valentin Schneider
  0 siblings, 0 replies; 26+ messages in thread
From: Valentin Schneider @ 2020-09-16  8:34 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, linux-kernel


On 16/09/20 08:02, Vincent Guittot wrote:
> On Tue, 15 Sep 2020 at 21:04, Valentin Schneider
> <valentin.schneider@arm.com> wrote:
>>
>>
>> On 14/09/20 11:03, Vincent Guittot wrote:
>> > The busy_factor, which increases load balance interval when a cpu is busy,
>> > is set to 32 by default. This value generates some huge LB interval on
>> > large system like the THX2 made of 2 node x 28 cores x 4 threads.
>> > For such system, the interval increases from 112ms to 3584ms at MC level.
>> > And from 228ms to 7168ms at NUMA level.
>> >
>> > Even on smaller system, a lower busy factor has shown improvement on the
>> > fair distribution of the running time so let reduce it for all.
>> >
>>
>> ISTR you mentioned taking this one step further and making
>> (interval * busy_factor) scale logarithmically with the number of CPUs to
>> avoid reaching outrageous numbers. Did you experiment with that already?
>
> Yes I have tried the logarithmically scale but It didn't give any
> benefit compared to this solution for the fairness problem but
> impacted other use cases because it impacts idle interval and it also
> adds more constraints in the computation of the interval and
> busy_factor because we can end up with the same interval for 2
> consecutive levels .
>

Right, I suppose we could frob a topology level index in there to prevent
that if we really wanted to...

> That being said, it might be useful for other cases but i haven't look
> further for this
>

Fair enough!

>>
>> > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
>> > ---
>> >  kernel/sched/topology.c | 2 +-
>> >  1 file changed, 1 insertion(+), 1 deletion(-)
>> >
>> > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
>> > index 1a84b778755d..a8477c9e8569 100644
>> > --- a/kernel/sched/topology.c
>> > +++ b/kernel/sched/topology.c
>> > @@ -1336,7 +1336,7 @@ sd_init(struct sched_domain_topology_level *tl,
>> >       *sd = (struct sched_domain){
>> >               .min_interval           = sd_weight,
>> >               .max_interval           = 2*sd_weight,
>> > -             .busy_factor            = 32,
>> > +             .busy_factor            = 16,
>> >               .imbalance_pct          = 117,
>> >
>> >               .cache_nice_tries       = 0,

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/4] sched/fair: Improve fairness between cfs tasks
  2020-09-14 11:42 ` [PATCH 0/4] sched/fair: Improve fairness between cfs tasks peterz
  2020-09-14 12:53   ` Phil Auld
  2020-09-14 15:50   ` Mel Gorman
@ 2020-09-18 16:39   ` Phil Auld
  2020-09-18 17:27     ` Phil Auld
  2 siblings, 1 reply; 26+ messages in thread
From: Phil Auld @ 2020-09-18 16:39 UTC (permalink / raw)
  To: peterz
  Cc: Vincent Guittot, mingo, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, mgorman, linux-kernel, valentin.schneider, Jirka Hladky

Hi Peter,

On Mon, Sep 14, 2020 at 01:42:02PM +0200 peterz@infradead.org wrote:
> On Mon, Sep 14, 2020 at 12:03:36PM +0200, Vincent Guittot wrote:
> > Vincent Guittot (4):
> >   sched/fair: relax constraint on task's load during load balance
> >   sched/fair: reduce minimal imbalance threshold
> >   sched/fair: minimize concurrent LBs between domain level
> >   sched/fair: reduce busy load balance interval
> 
> I see nothing objectionable there, a little more testing can't hurt, but
> I'm tempted to apply them.
> 
> Phil, Mel, any chance you can run them through your respective setups?
> 

Sorry for the delay. Things have been backing up...

We tested with tis series and found there was no performance change in
our test suites. (We don't have a good way to share the actual numbers
outside right now, but since they aren't really different it probably
doesn't matter much here.)

The difference we did see was a slight decrease in the number of tasks
moved around at higher loads.  That seems to be a good thing even though
it didn't directly show time-based performance benefits (and was pretty
minor).

So if this helps other use cases we've got no problems with it.

Thanks,
Phil

-- 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/4] sched/fair: Improve fairness between cfs tasks
  2020-09-18 16:39   ` Phil Auld
@ 2020-09-18 17:27     ` Phil Auld
  0 siblings, 0 replies; 26+ messages in thread
From: Phil Auld @ 2020-09-18 17:27 UTC (permalink / raw)
  To: peterz
  Cc: Vincent Guittot, mingo, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, mgorman, linux-kernel, valentin.schneider, Jirka Hladky

On Fri, Sep 18, 2020 at 12:39:28PM -0400 Phil Auld wrote:
> Hi Peter,
> 
> On Mon, Sep 14, 2020 at 01:42:02PM +0200 peterz@infradead.org wrote:
> > On Mon, Sep 14, 2020 at 12:03:36PM +0200, Vincent Guittot wrote:
> > > Vincent Guittot (4):
> > >   sched/fair: relax constraint on task's load during load balance
> > >   sched/fair: reduce minimal imbalance threshold
> > >   sched/fair: minimize concurrent LBs between domain level
> > >   sched/fair: reduce busy load balance interval
> > 
> > I see nothing objectionable there, a little more testing can't hurt, but
> > I'm tempted to apply them.
> > 
> > Phil, Mel, any chance you can run them through your respective setups?
> > 
> 
> Sorry for the delay. Things have been backing up...
> 
> We tested with tis series and found there was no performance change in
> our test suites. (We don't have a good way to share the actual numbers
> outside right now, but since they aren't really different it probably
> doesn't matter much here.)
> 
> The difference we did see was a slight decrease in the number of tasks
> moved around at higher loads.  That seems to be a good thing even though
> it didn't directly show time-based performance benefits (and was pretty
> minor).
> 
> So if this helps other use cases we've got no problems with it.
>

Feel free to add a

Reviewed-by: Phil Auld <pauld@redhat.com>

Jirka did the actual testing so he can speak up with a Tested-by if he
wants to.


> Thanks,
> Phil
> 
> -- 
> 

-- 


^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2020-09-18 17:29 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-14 10:03 [PATCH 0/4] sched/fair: Improve fairness between cfs tasks Vincent Guittot
2020-09-14 10:03 ` [PATCH 1/4] sched/fair: relax constraint on task's load during load balance Vincent Guittot
2020-09-14 10:03 ` [PATCH 2/4] sched/fair: reduce minimal imbalance threshold Vincent Guittot
2020-09-15 19:04   ` Valentin Schneider
2020-09-16  6:53     ` Vincent Guittot
2020-09-16  8:33       ` Valentin Schneider
2020-09-14 10:03 ` [PATCH 3/4] sched/fair: minimize concurrent LBs between domain level Vincent Guittot
2020-09-15 19:04   ` Valentin Schneider
2020-09-16  6:54     ` Vincent Guittot
2020-09-14 10:03 ` [PATCH 4/4] sched/fair: reduce busy load balance interval Vincent Guittot
2020-09-15  9:11   ` Jiang Biao
2020-09-15  9:28     ` Vincent Guittot
2020-09-15 11:36       ` Jiang Biao
2020-09-15 12:42         ` Vincent Guittot
2020-09-16  1:14           ` Jiang Biao
2020-09-15 19:04   ` Valentin Schneider
2020-09-16  7:02     ` Vincent Guittot
2020-09-16  8:34       ` Valentin Schneider
2020-09-14 11:42 ` [PATCH 0/4] sched/fair: Improve fairness between cfs tasks peterz
2020-09-14 12:53   ` Phil Auld
2020-09-14 16:03     ` Vincent Guittot
2020-09-14 15:50   ` Mel Gorman
2020-09-14 16:04     ` Vincent Guittot
2020-09-18 16:39   ` Phil Auld
2020-09-18 17:27     ` Phil Auld
2020-09-15 19:05 ` Valentin Schneider

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.