All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH] sched: fix the nonsense shares when load of cfs_rq is too, small
@ 2020-03-03 14:17 王贇
  2020-03-03 19:52 ` Peter Zijlstra
  0 siblings, 1 reply; 21+ messages in thread
From: 王贇 @ 2020-03-03 14:17 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	open list:SCHEDULER

During our testing, we found a case that shares no longer
working correctly, the cgroup topology is like:

  /sys/fs/cgroup/cpu/A		(shares=102400)
  /sys/fs/cgroup/cpu/A/B	(shares=2)
  /sys/fs/cgroup/cpu/A/B/C	(shares=1024)

  /sys/fs/cgroup/cpu/D		(shares=1024)
  /sys/fs/cgroup/cpu/D/E	(shares=1024)
  /sys/fs/cgroup/cpu/D/E/F	(shares=1024)

The same benchmark is running in group C & F, no other tasks are
running, the benchmark is capable to consumed all the CPUs.

We suppose the group C will win more CPU resources since it could
enjoy all the shares of group A, but it's F who wins much more.

The reason is because we have group B with shares as 2, which make
the group A 'cfs_rq->load.weight' very small.

And in calc_group_shares() we calculate shares as:

  load = max(scale_load_down(cfs_rq->load.weight), cfs_rq->avg.load_avg);
  shares = (tg_shares * load) / tg_weight;

Since the 'cfs_rq->load.weight' is too small, the load become 0
in here, although 'tg_shares' is 102400, shares of the se which
stand for group A on root cfs_rq become 2.

While the se of D on root cfs_rq is far more bigger than 2, so it
wins the battle.

This patch add a check on the zero load and make it as MIN_SHARES
to fix the nonsense shares, after applied the group C wins as
expected.

Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com>
---
 kernel/sched/fair.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 84594f8aeaf8..53d705f75fa4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3182,6 +3182,8 @@ static long calc_group_shares(struct cfs_rq *cfs_rq)
 	tg_shares = READ_ONCE(tg->shares);

 	load = max(scale_load_down(cfs_rq->load.weight), cfs_rq->avg.load_avg);
+	if (!load && cfs_rq->load.weight)
+		load = MIN_SHARES;

 	tg_weight = atomic_long_read(&tg->load_avg);

-- 
2.14.4.44.g2045bb6


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH] sched: fix the nonsense shares when load of cfs_rq is too, small
  2020-03-03 14:17 [RFC PATCH] sched: fix the nonsense shares when load of cfs_rq is too, small 王贇
@ 2020-03-03 19:52 ` Peter Zijlstra
  2020-03-04  1:19   ` 王贇
                     ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: Peter Zijlstra @ 2020-03-03 19:52 UTC (permalink / raw)
  To: 王贇
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, open list:SCHEDULER

On Tue, Mar 03, 2020 at 10:17:03PM +0800, 王贇 wrote:
> During our testing, we found a case that shares no longer
> working correctly, the cgroup topology is like:
> 
>   /sys/fs/cgroup/cpu/A		(shares=102400)
>   /sys/fs/cgroup/cpu/A/B	(shares=2)
>   /sys/fs/cgroup/cpu/A/B/C	(shares=1024)
> 
>   /sys/fs/cgroup/cpu/D		(shares=1024)
>   /sys/fs/cgroup/cpu/D/E	(shares=1024)
>   /sys/fs/cgroup/cpu/D/E/F	(shares=1024)
> 
> The same benchmark is running in group C & F, no other tasks are
> running, the benchmark is capable to consumed all the CPUs.
> 
> We suppose the group C will win more CPU resources since it could
> enjoy all the shares of group A, but it's F who wins much more.
> 
> The reason is because we have group B with shares as 2, which make
> the group A 'cfs_rq->load.weight' very small.
> 
> And in calc_group_shares() we calculate shares as:
> 
>   load = max(scale_load_down(cfs_rq->load.weight), cfs_rq->avg.load_avg);
>   shares = (tg_shares * load) / tg_weight;
> 
> Since the 'cfs_rq->load.weight' is too small, the load become 0
> in here, although 'tg_shares' is 102400, shares of the se which
> stand for group A on root cfs_rq become 2.

Argh, because A->cfs_rq.load.weight is B->se.load.weight which is
B->shares/nr_cpus.

> While the se of D on root cfs_rq is far more bigger than 2, so it
> wins the battle.
> 
> This patch add a check on the zero load and make it as MIN_SHARES
> to fix the nonsense shares, after applied the group C wins as
> expected.
> 
> Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com>
> ---
>  kernel/sched/fair.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 84594f8aeaf8..53d705f75fa4 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3182,6 +3182,8 @@ static long calc_group_shares(struct cfs_rq *cfs_rq)
>  	tg_shares = READ_ONCE(tg->shares);
> 
>  	load = max(scale_load_down(cfs_rq->load.weight), cfs_rq->avg.load_avg);
> +	if (!load && cfs_rq->load.weight)
> +		load = MIN_SHARES;
> 
>  	tg_weight = atomic_long_read(&tg->load_avg);

Yeah, I suppose that'll do. Hurmph, wants a comment though.

But that has me looking at other users of scale_load_down(), and doesn't
at least update_tg_cfs_load() suffer the same problem?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH] sched: fix the nonsense shares when load of cfs_rq is too, small
  2020-03-03 19:52 ` Peter Zijlstra
@ 2020-03-04  1:19   ` 王贇
  2020-03-04  8:47     ` Vincent Guittot
  2020-03-04  8:45   ` Vincent Guittot
  2020-03-04 18:47   ` bsegall
  2 siblings, 1 reply; 21+ messages in thread
From: 王贇 @ 2020-03-04  1:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, open list:SCHEDULER



On 2020/3/4 上午3:52, Peter Zijlstra wrote:
[snip]
>> The reason is because we have group B with shares as 2, which make
>> the group A 'cfs_rq->load.weight' very small.
>>
>> And in calc_group_shares() we calculate shares as:
>>
>>   load = max(scale_load_down(cfs_rq->load.weight), cfs_rq->avg.load_avg);
>>   shares = (tg_shares * load) / tg_weight;
>>
>> Since the 'cfs_rq->load.weight' is too small, the load become 0
>> in here, although 'tg_shares' is 102400, shares of the se which
>> stand for group A on root cfs_rq become 2.
> 
> Argh, because A->cfs_rq.load.weight is B->se.load.weight which is
> B->shares/nr_cpus.

Yeah, that's exactly why it happens, even the share 2 scale up to 2048,
on 96 CPUs platform, each CPU get only 21 in equal case.

> 
>> While the se of D on root cfs_rq is far more bigger than 2, so it
>> wins the battle.
>>
>> This patch add a check on the zero load and make it as MIN_SHARES
>> to fix the nonsense shares, after applied the group C wins as
>> expected.
>>
>> Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com>
>> ---
>>  kernel/sched/fair.c | 2 ++
>>  1 file changed, 2 insertions(+)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 84594f8aeaf8..53d705f75fa4 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -3182,6 +3182,8 @@ static long calc_group_shares(struct cfs_rq *cfs_rq)
>>  	tg_shares = READ_ONCE(tg->shares);
>>
>>  	load = max(scale_load_down(cfs_rq->load.weight), cfs_rq->avg.load_avg);
>> +	if (!load && cfs_rq->load.weight)
>> +		load = MIN_SHARES;
>>
>>  	tg_weight = atomic_long_read(&tg->load_avg);
> 
> Yeah, I suppose that'll do. Hurmph, wants a comment though.
> 
> But that has me looking at other users of scale_load_down(), and doesn't
> at least update_tg_cfs_load() suffer the same problem?

Good point :-) I'm not sure but is scale_load_down() supposed to scale small
value into 0? If not, maybe we should fix the helper to make sure it at
least return some real load? like:

# define scale_load_down(w) ((w + (1 << SCHED_FIXEDPOINT_SHIFT)) >> SCHED_FIXEDPOINT_SHIFT)

Regards,
Michael Wang

> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH] sched: fix the nonsense shares when load of cfs_rq is too, small
  2020-03-03 19:52 ` Peter Zijlstra
  2020-03-04  1:19   ` 王贇
@ 2020-03-04  8:45   ` Vincent Guittot
  2020-03-04 18:47   ` bsegall
  2 siblings, 0 replies; 21+ messages in thread
From: Vincent Guittot @ 2020-03-04  8:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: 王贇,
	Ingo Molnar, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, open list:SCHEDULER

On Tue, 3 Mar 2020 at 20:52, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Tue, Mar 03, 2020 at 10:17:03PM +0800, 王贇 wrote:
> > During our testing, we found a case that shares no longer
> > working correctly, the cgroup topology is like:
> >
> >   /sys/fs/cgroup/cpu/A                (shares=102400)
> >   /sys/fs/cgroup/cpu/A/B      (shares=2)
> >   /sys/fs/cgroup/cpu/A/B/C    (shares=1024)
> >
> >   /sys/fs/cgroup/cpu/D                (shares=1024)
> >   /sys/fs/cgroup/cpu/D/E      (shares=1024)
> >   /sys/fs/cgroup/cpu/D/E/F    (shares=1024)
> >
> > The same benchmark is running in group C & F, no other tasks are
> > running, the benchmark is capable to consumed all the CPUs.
> >
> > We suppose the group C will win more CPU resources since it could
> > enjoy all the shares of group A, but it's F who wins much more.
> >
> > The reason is because we have group B with shares as 2, which make
> > the group A 'cfs_rq->load.weight' very small.
> >
> > And in calc_group_shares() we calculate shares as:
> >
> >   load = max(scale_load_down(cfs_rq->load.weight), cfs_rq->avg.load_avg);
> >   shares = (tg_shares * load) / tg_weight;
> >
> > Since the 'cfs_rq->load.weight' is too small, the load become 0
> > in here, although 'tg_shares' is 102400, shares of the se which
> > stand for group A on root cfs_rq become 2.
>
> Argh, because A->cfs_rq.load.weight is B->se.load.weight which is
> B->shares/nr_cpus.
>
> > While the se of D on root cfs_rq is far more bigger than 2, so it
> > wins the battle.
> >
> > This patch add a check on the zero load and make it as MIN_SHARES
> > to fix the nonsense shares, after applied the group C wins as
> > expected.
> >
> > Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com>
> > ---
> >  kernel/sched/fair.c | 2 ++
> >  1 file changed, 2 insertions(+)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 84594f8aeaf8..53d705f75fa4 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -3182,6 +3182,8 @@ static long calc_group_shares(struct cfs_rq *cfs_rq)
> >       tg_shares = READ_ONCE(tg->shares);
> >
> >       load = max(scale_load_down(cfs_rq->load.weight), cfs_rq->avg.load_avg);
> > +     if (!load && cfs_rq->load.weight)
> > +             load = MIN_SHARES;
> >
> >       tg_weight = atomic_long_read(&tg->load_avg);
>
> Yeah, I suppose that'll do. Hurmph, wants a comment though.
>
> But that has me looking at other users of scale_load_down(), and doesn't
> at least update_tg_cfs_load() suffer the same problem?

yes and other places like the load_avg that will stay to 0 or the fact
that weight != 0 is used to assume that se on enqueued and to not
remove the cfs from the leaf_cfs_rq_list even if load_avg is null

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH] sched: fix the nonsense shares when load of cfs_rq is too, small
  2020-03-04  1:19   ` 王贇
@ 2020-03-04  8:47     ` Vincent Guittot
  2020-03-04  9:43       ` Vincent Guittot
  2020-03-04  9:52       ` Peter Zijlstra
  0 siblings, 2 replies; 21+ messages in thread
From: Vincent Guittot @ 2020-03-04  8:47 UTC (permalink / raw)
  To: 王贇
  Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, open list:SCHEDULER

On Wed, 4 Mar 2020 at 02:19, 王贇 <yun.wang@linux.alibaba.com> wrote:
>
>
>
> On 2020/3/4 上午3:52, Peter Zijlstra wrote:
> [snip]
> >> The reason is because we have group B with shares as 2, which make
> >> the group A 'cfs_rq->load.weight' very small.
> >>
> >> And in calc_group_shares() we calculate shares as:
> >>
> >>   load = max(scale_load_down(cfs_rq->load.weight), cfs_rq->avg.load_avg);
> >>   shares = (tg_shares * load) / tg_weight;
> >>
> >> Since the 'cfs_rq->load.weight' is too small, the load become 0
> >> in here, although 'tg_shares' is 102400, shares of the se which
> >> stand for group A on root cfs_rq become 2.
> >
> > Argh, because A->cfs_rq.load.weight is B->se.load.weight which is
> > B->shares/nr_cpus.
>
> Yeah, that's exactly why it happens, even the share 2 scale up to 2048,
> on 96 CPUs platform, each CPU get only 21 in equal case.
>
> >
> >> While the se of D on root cfs_rq is far more bigger than 2, so it
> >> wins the battle.
> >>
> >> This patch add a check on the zero load and make it as MIN_SHARES
> >> to fix the nonsense shares, after applied the group C wins as
> >> expected.
> >>
> >> Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com>
> >> ---
> >>  kernel/sched/fair.c | 2 ++
> >>  1 file changed, 2 insertions(+)
> >>
> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> index 84594f8aeaf8..53d705f75fa4 100644
> >> --- a/kernel/sched/fair.c
> >> +++ b/kernel/sched/fair.c
> >> @@ -3182,6 +3182,8 @@ static long calc_group_shares(struct cfs_rq *cfs_rq)
> >>      tg_shares = READ_ONCE(tg->shares);
> >>
> >>      load = max(scale_load_down(cfs_rq->load.weight), cfs_rq->avg.load_avg);
> >> +    if (!load && cfs_rq->load.weight)
> >> +            load = MIN_SHARES;
> >>
> >>      tg_weight = atomic_long_read(&tg->load_avg);
> >
> > Yeah, I suppose that'll do. Hurmph, wants a comment though.
> >
> > But that has me looking at other users of scale_load_down(), and doesn't
> > at least update_tg_cfs_load() suffer the same problem?
>
> Good point :-) I'm not sure but is scale_load_down() supposed to scale small
> value into 0? If not, maybe we should fix the helper to make sure it at
> least return some real load? like:
>
> # define scale_load_down(w) ((w + (1 << SCHED_FIXEDPOINT_SHIFT)) >> SCHED_FIXEDPOINT_SHIFT)

you will add +1 of nice prio for each device

should we use instead
# define scale_load_down(w) ((w >> SCHED_FIXEDPOINT_SHIFT) ? (w >>
SCHED_FIXEDPOINT_SHIFT) : MIN_SHARES)

Regards,
Vincent

>
> Regards,
> Michael Wang
>
> >

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH] sched: fix the nonsense shares when load of cfs_rq is too, small
  2020-03-04  8:47     ` Vincent Guittot
@ 2020-03-04  9:43       ` Vincent Guittot
  2020-03-05  1:23         ` 王贇
  2020-03-04  9:52       ` Peter Zijlstra
  1 sibling, 1 reply; 21+ messages in thread
From: Vincent Guittot @ 2020-03-04  9:43 UTC (permalink / raw)
  To: 王贇
  Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, open list:SCHEDULER

On Wed, 4 Mar 2020 at 09:47, Vincent Guittot <vincent.guittot@linaro.org> wrote:
>
> On Wed, 4 Mar 2020 at 02:19, 王贇 <yun.wang@linux.alibaba.com> wrote:
> >
> >
> >
> > On 2020/3/4 上午3:52, Peter Zijlstra wrote:
> > [snip]
> > >> The reason is because we have group B with shares as 2, which make
> > >> the group A 'cfs_rq->load.weight' very small.
> > >>
> > >> And in calc_group_shares() we calculate shares as:
> > >>
> > >>   load = max(scale_load_down(cfs_rq->load.weight), cfs_rq->avg.load_avg);
> > >>   shares = (tg_shares * load) / tg_weight;
> > >>
> > >> Since the 'cfs_rq->load.weight' is too small, the load become 0
> > >> in here, although 'tg_shares' is 102400, shares of the se which
> > >> stand for group A on root cfs_rq become 2.
> > >
> > > Argh, because A->cfs_rq.load.weight is B->se.load.weight which is
> > > B->shares/nr_cpus.
> >
> > Yeah, that's exactly why it happens, even the share 2 scale up to 2048,
> > on 96 CPUs platform, each CPU get only 21 in equal case.
> >
> > >
> > >> While the se of D on root cfs_rq is far more bigger than 2, so it
> > >> wins the battle.
> > >>
> > >> This patch add a check on the zero load and make it as MIN_SHARES
> > >> to fix the nonsense shares, after applied the group C wins as
> > >> expected.
> > >>
> > >> Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com>
> > >> ---
> > >>  kernel/sched/fair.c | 2 ++
> > >>  1 file changed, 2 insertions(+)
> > >>
> > >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > >> index 84594f8aeaf8..53d705f75fa4 100644
> > >> --- a/kernel/sched/fair.c
> > >> +++ b/kernel/sched/fair.c
> > >> @@ -3182,6 +3182,8 @@ static long calc_group_shares(struct cfs_rq *cfs_rq)
> > >>      tg_shares = READ_ONCE(tg->shares);
> > >>
> > >>      load = max(scale_load_down(cfs_rq->load.weight), cfs_rq->avg.load_avg);
> > >> +    if (!load && cfs_rq->load.weight)
> > >> +            load = MIN_SHARES;
> > >>
> > >>      tg_weight = atomic_long_read(&tg->load_avg);
> > >
> > > Yeah, I suppose that'll do. Hurmph, wants a comment though.
> > >
> > > But that has me looking at other users of scale_load_down(), and doesn't
> > > at least update_tg_cfs_load() suffer the same problem?
> >
> > Good point :-) I'm not sure but is scale_load_down() supposed to scale small
> > value into 0? If not, maybe we should fix the helper to make sure it at
> > least return some real load? like:
> >
> > # define scale_load_down(w) ((w + (1 << SCHED_FIXEDPOINT_SHIFT)) >> SCHED_FIXEDPOINT_SHIFT)
>
> you will add +1 of nice prio for each device

Of course, it's not prio but only weight which is different

>
> should we use instead
> # define scale_load_down(w) ((w >> SCHED_FIXEDPOINT_SHIFT) ? (w >>
> SCHED_FIXEDPOINT_SHIFT) : MIN_SHARES)
>
> Regards,
> Vincent
>
> >
> > Regards,
> > Michael Wang
> >
> > >

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH] sched: fix the nonsense shares when load of cfs_rq is too, small
  2020-03-04  8:47     ` Vincent Guittot
  2020-03-04  9:43       ` Vincent Guittot
@ 2020-03-04  9:52       ` Peter Zijlstra
  2020-03-04 11:55         ` Vincent Guittot
  2020-03-05  1:08         ` 王贇
  1 sibling, 2 replies; 21+ messages in thread
From: Peter Zijlstra @ 2020-03-04  9:52 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: 王贇,
	Ingo Molnar, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, open list:SCHEDULER

On Wed, Mar 04, 2020 at 09:47:34AM +0100, Vincent Guittot wrote:
> you will add +1 of nice prio for each device
> 
> should we use instead
> # define scale_load_down(w) ((w >> SCHED_FIXEDPOINT_SHIFT) ? (w >>
> SCHED_FIXEDPOINT_SHIFT) : MIN_SHARES)

That's '((w >> SHIFT) ?: MIN_SHARES)', but even that is not quite right.

I think we want something like:

#define scale_load_down(w) \
({ unsigned long ___w = (w); \
   if (___w) \
     ____w = max(MIN_SHARES, ___w >> SHIFT); \
   ___w; })

That is, we very much want to retain 0 I'm thinking.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH] sched: fix the nonsense shares when load of cfs_rq is too, small
  2020-03-04  9:52       ` Peter Zijlstra
@ 2020-03-04 11:55         ` Vincent Guittot
  2020-03-05  1:08         ` 王贇
  1 sibling, 0 replies; 21+ messages in thread
From: Vincent Guittot @ 2020-03-04 11:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: 王贇,
	Ingo Molnar, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, open list:SCHEDULER

On Wed, 4 Mar 2020 at 10:52, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Wed, Mar 04, 2020 at 09:47:34AM +0100, Vincent Guittot wrote:
> > you will add +1 of nice prio for each device
> >
> > should we use instead
> > # define scale_load_down(w) ((w >> SCHED_FIXEDPOINT_SHIFT) ? (w >>
> > SCHED_FIXEDPOINT_SHIFT) : MIN_SHARES)
>
> That's '((w >> SHIFT) ?: MIN_SHARES)', but even that is not quite right.
>
> I think we want something like:
>
> #define scale_load_down(w) \
> ({ unsigned long ___w = (w); \
>    if (___w) \
>      ____w = max(MIN_SHARES, ___w >> SHIFT); \
>    ___w; })
>
> That is, we very much want to retain 0 I'm thinking.

yes, you're right

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH] sched: fix the nonsense shares when load of cfs_rq is too, small
  2020-03-03 19:52 ` Peter Zijlstra
  2020-03-04  1:19   ` 王贇
  2020-03-04  8:45   ` Vincent Guittot
@ 2020-03-04 18:47   ` bsegall
  2020-03-05  1:14     ` 王贇
  2 siblings, 1 reply; 21+ messages in thread
From: bsegall @ 2020-03-04 18:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: 王贇,
	Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, open list:SCHEDULER

Peter Zijlstra <peterz@infradead.org> writes:

> On Tue, Mar 03, 2020 at 10:17:03PM +0800, 王贇 wrote:
>> During our testing, we found a case that shares no longer
>> working correctly, the cgroup topology is like:
>> 
>>   /sys/fs/cgroup/cpu/A		(shares=102400)
>>   /sys/fs/cgroup/cpu/A/B	(shares=2)
>>   /sys/fs/cgroup/cpu/A/B/C	(shares=1024)
>> 
>>   /sys/fs/cgroup/cpu/D		(shares=1024)
>>   /sys/fs/cgroup/cpu/D/E	(shares=1024)
>>   /sys/fs/cgroup/cpu/D/E/F	(shares=1024)
>> 
>> The same benchmark is running in group C & F, no other tasks are
>> running, the benchmark is capable to consumed all the CPUs.
>> 
>> We suppose the group C will win more CPU resources since it could
>> enjoy all the shares of group A, but it's F who wins much more.
>> 
>> The reason is because we have group B with shares as 2, which make
>> the group A 'cfs_rq->load.weight' very small.
>> 
>> And in calc_group_shares() we calculate shares as:
>> 
>>   load = max(scale_load_down(cfs_rq->load.weight), cfs_rq->avg.load_avg);
>>   shares = (tg_shares * load) / tg_weight;
>> 
>> Since the 'cfs_rq->load.weight' is too small, the load become 0
>> in here, although 'tg_shares' is 102400, shares of the se which
>> stand for group A on root cfs_rq become 2.
>
> Argh, because A->cfs_rq.load.weight is B->se.load.weight which is
> B->shares/nr_cpus.
>
>> While the se of D on root cfs_rq is far more bigger than 2, so it
>> wins the battle.
>> 
>> This patch add a check on the zero load and make it as MIN_SHARES
>> to fix the nonsense shares, after applied the group C wins as
>> expected.
>> 
>> Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com>
>> ---
>>  kernel/sched/fair.c | 2 ++
>>  1 file changed, 2 insertions(+)
>> 
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 84594f8aeaf8..53d705f75fa4 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -3182,6 +3182,8 @@ static long calc_group_shares(struct cfs_rq *cfs_rq)
>>  	tg_shares = READ_ONCE(tg->shares);
>> 
>>  	load = max(scale_load_down(cfs_rq->load.weight), cfs_rq->avg.load_avg);
>> +	if (!load && cfs_rq->load.weight)
>> +		load = MIN_SHARES;
>> 
>>  	tg_weight = atomic_long_read(&tg->load_avg);
>
> Yeah, I suppose that'll do. Hurmph, wants a comment though.
>
> But that has me looking at other users of scale_load_down(), and doesn't
> at least update_tg_cfs_load() suffer the same problem?

I think instead we should probably scale_load_down(tg_shares) and
scale_load(load_avg). tg_shares is always a scaled integer, so just
moving the source of the scaling in the multiply should do the job.

ie

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fcc968669aea..6d7a9d72d742 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3179,9 +3179,9 @@ static long calc_group_shares(struct cfs_rq *cfs_rq)
        long tg_weight, tg_shares, load, shares;
        struct task_group *tg = cfs_rq->tg;
 
-       tg_shares = READ_ONCE(tg->shares);
+       tg_shares = scale_load_down(READ_ONCE(tg->shares));
 
-       load = max(scale_load_down(cfs_rq->load.weight), cfs_rq->avg.load_avg);
+       load = max(cfs_rq->load.weight, scale_load(cfs_rq->avg.load_avg));
 
        tg_weight = atomic_long_read(&tg->load_avg);
 




^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH] sched: fix the nonsense shares when load of cfs_rq is too, small
  2020-03-04  9:52       ` Peter Zijlstra
  2020-03-04 11:55         ` Vincent Guittot
@ 2020-03-05  1:08         ` 王贇
  1 sibling, 0 replies; 21+ messages in thread
From: 王贇 @ 2020-03-05  1:08 UTC (permalink / raw)
  To: Peter Zijlstra, Vincent Guittot
  Cc: Ingo Molnar, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, open list:SCHEDULER



On 2020/3/4 下午5:52, Peter Zijlstra wrote:
> On Wed, Mar 04, 2020 at 09:47:34AM +0100, Vincent Guittot wrote:
>> you will add +1 of nice prio for each device
>>
>> should we use instead
>> # define scale_load_down(w) ((w >> SCHED_FIXEDPOINT_SHIFT) ? (w >>
>> SCHED_FIXEDPOINT_SHIFT) : MIN_SHARES)
> 
> That's '((w >> SHIFT) ?: MIN_SHARES)', but even that is not quite right.
> 
> I think we want something like:
> 
> #define scale_load_down(w) \
> ({ unsigned long ___w = (w); \
>    if (___w) \
>      ____w = max(MIN_SHARES, ___w >> SHIFT); \
>    ___w; })
> 
> That is, we very much want to retain 0 I'm thinking.

Should works, I'll give this one a test and send another fix :-)

Regards,
Michael Wang

> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH] sched: fix the nonsense shares when load of cfs_rq is too, small
  2020-03-04 18:47   ` bsegall
@ 2020-03-05  1:14     ` 王贇
  2020-03-05  7:53       ` Vincent Guittot
  2020-03-06 19:17       ` bsegall
  0 siblings, 2 replies; 21+ messages in thread
From: 王贇 @ 2020-03-05  1:14 UTC (permalink / raw)
  To: bsegall, Peter Zijlstra
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Mel Gorman, open list:SCHEDULER



On 2020/3/5 上午2:47, bsegall@google.com wrote:
[snip]
>> Argh, because A->cfs_rq.load.weight is B->se.load.weight which is
>> B->shares/nr_cpus.
>>
>>> While the se of D on root cfs_rq is far more bigger than 2, so it
>>> wins the battle.
>>>
>>> This patch add a check on the zero load and make it as MIN_SHARES
>>> to fix the nonsense shares, after applied the group C wins as
>>> expected.
>>>
>>> Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com>
>>> ---
>>>  kernel/sched/fair.c | 2 ++
>>>  1 file changed, 2 insertions(+)
>>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index 84594f8aeaf8..53d705f75fa4 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -3182,6 +3182,8 @@ static long calc_group_shares(struct cfs_rq *cfs_rq)
>>>  	tg_shares = READ_ONCE(tg->shares);
>>>
>>>  	load = max(scale_load_down(cfs_rq->load.weight), cfs_rq->avg.load_avg);
>>> +	if (!load && cfs_rq->load.weight)
>>> +		load = MIN_SHARES;
>>>
>>>  	tg_weight = atomic_long_read(&tg->load_avg);
>>
>> Yeah, I suppose that'll do. Hurmph, wants a comment though.
>>
>> But that has me looking at other users of scale_load_down(), and doesn't
>> at least update_tg_cfs_load() suffer the same problem?
> 
> I think instead we should probably scale_load_down(tg_shares) and
> scale_load(load_avg). tg_shares is always a scaled integer, so just
> moving the source of the scaling in the multiply should do the job.
> 
> ie
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index fcc968669aea..6d7a9d72d742 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3179,9 +3179,9 @@ static long calc_group_shares(struct cfs_rq *cfs_rq)
>         long tg_weight, tg_shares, load, shares;
>         struct task_group *tg = cfs_rq->tg;
>  
> -       tg_shares = READ_ONCE(tg->shares);
> +       tg_shares = scale_load_down(READ_ONCE(tg->shares));
>  
> -       load = max(scale_load_down(cfs_rq->load.weight), cfs_rq->avg.load_avg);
> +       load = max(cfs_rq->load.weight, scale_load(cfs_rq->avg.load_avg));
>  
>         tg_weight = atomic_long_read(&tg->load_avg);

Get the point, but IMHO fix scale_load_down() sounds better, to
cover all the similar cases, let's first try that way see if it's
working :-)

Regards,
Michael Wang

>  
> 
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH] sched: fix the nonsense shares when load of cfs_rq is too, small
  2020-03-04  9:43       ` Vincent Guittot
@ 2020-03-05  1:23         ` 王贇
  0 siblings, 0 replies; 21+ messages in thread
From: 王贇 @ 2020-03-05  1:23 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, open list:SCHEDULER



On 2020/3/4 下午5:43, Vincent Guittot wrote:
> On Wed, 4 Mar 2020 at 09:47, Vincent Guittot <vincent.guittot@linaro.org> wrote:
>>
>> On Wed, 4 Mar 2020 at 02:19, 王贇 <yun.wang@linux.alibaba.com> wrote:
>>>
>>>
>>>
>>> On 2020/3/4 上午3:52, Peter Zijlstra wrote:
>>> [snip]
>>>>> The reason is because we have group B with shares as 2, which make
>>>>> the group A 'cfs_rq->load.weight' very small.
>>>>>
>>>>> And in calc_group_shares() we calculate shares as:
>>>>>
>>>>>   load = max(scale_load_down(cfs_rq->load.weight), cfs_rq->avg.load_avg);
>>>>>   shares = (tg_shares * load) / tg_weight;
>>>>>
>>>>> Since the 'cfs_rq->load.weight' is too small, the load become 0
>>>>> in here, although 'tg_shares' is 102400, shares of the se which
>>>>> stand for group A on root cfs_rq become 2.
>>>>
>>>> Argh, because A->cfs_rq.load.weight is B->se.load.weight which is
>>>> B->shares/nr_cpus.
>>>
>>> Yeah, that's exactly why it happens, even the share 2 scale up to 2048,
>>> on 96 CPUs platform, each CPU get only 21 in equal case.
>>>
>>>>
>>>>> While the se of D on root cfs_rq is far more bigger than 2, so it
>>>>> wins the battle.
>>>>>
>>>>> This patch add a check on the zero load and make it as MIN_SHARES
>>>>> to fix the nonsense shares, after applied the group C wins as
>>>>> expected.
>>>>>
>>>>> Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com>
>>>>> ---
>>>>>  kernel/sched/fair.c | 2 ++
>>>>>  1 file changed, 2 insertions(+)
>>>>>
>>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>>> index 84594f8aeaf8..53d705f75fa4 100644
>>>>> --- a/kernel/sched/fair.c
>>>>> +++ b/kernel/sched/fair.c
>>>>> @@ -3182,6 +3182,8 @@ static long calc_group_shares(struct cfs_rq *cfs_rq)
>>>>>      tg_shares = READ_ONCE(tg->shares);
>>>>>
>>>>>      load = max(scale_load_down(cfs_rq->load.weight), cfs_rq->avg.load_avg);
>>>>> +    if (!load && cfs_rq->load.weight)
>>>>> +            load = MIN_SHARES;
>>>>>
>>>>>      tg_weight = atomic_long_read(&tg->load_avg);
>>>>
>>>> Yeah, I suppose that'll do. Hurmph, wants a comment though.
>>>>
>>>> But that has me looking at other users of scale_load_down(), and doesn't
>>>> at least update_tg_cfs_load() suffer the same problem?
>>>
>>> Good point :-) I'm not sure but is scale_load_down() supposed to scale small
>>> value into 0? If not, maybe we should fix the helper to make sure it at
>>> least return some real load? like:
>>>
>>> # define scale_load_down(w) ((w + (1 << SCHED_FIXEDPOINT_SHIFT)) >> SCHED_FIXEDPOINT_SHIFT)
>>
>> you will add +1 of nice prio for each device
> 
> Of course, it's not prio but only weight which is different

That's right, should only handle the issue cases.

Regards,
Michael Wang

> 
>>
>> should we use instead
>> # define scale_load_down(w) ((w >> SCHED_FIXEDPOINT_SHIFT) ? (w >>
>> SCHED_FIXEDPOINT_SHIFT) : MIN_SHARES)
>>
>> Regards,
>> Vincent
>>
>>>
>>> Regards,
>>> Michael Wang
>>>
>>>>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH] sched: fix the nonsense shares when load of cfs_rq is too, small
  2020-03-05  1:14     ` 王贇
@ 2020-03-05  7:53       ` Vincent Guittot
  2020-03-06  4:23         ` 王贇
  2020-03-06 19:17       ` bsegall
  1 sibling, 1 reply; 21+ messages in thread
From: Vincent Guittot @ 2020-03-05  7:53 UTC (permalink / raw)
  To: 王贇
  Cc: Ben Segall, Peter Zijlstra, Ingo Molnar, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Mel Gorman,
	open list:SCHEDULER

On Thu, 5 Mar 2020 at 02:14, 王贇 <yun.wang@linux.alibaba.com> wrote:
>
>
>
> On 2020/3/5 上午2:47, bsegall@google.com wrote:
> [snip]
> >> Argh, because A->cfs_rq.load.weight is B->se.load.weight which is
> >> B->shares/nr_cpus.
> >>
> >>> While the se of D on root cfs_rq is far more bigger than 2, so it
> >>> wins the battle.
> >>>
> >>> This patch add a check on the zero load and make it as MIN_SHARES
> >>> to fix the nonsense shares, after applied the group C wins as
> >>> expected.
> >>>
> >>> Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com>
> >>> ---
> >>>  kernel/sched/fair.c | 2 ++
> >>>  1 file changed, 2 insertions(+)
> >>>
> >>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >>> index 84594f8aeaf8..53d705f75fa4 100644
> >>> --- a/kernel/sched/fair.c
> >>> +++ b/kernel/sched/fair.c
> >>> @@ -3182,6 +3182,8 @@ static long calc_group_shares(struct cfs_rq *cfs_rq)
> >>>     tg_shares = READ_ONCE(tg->shares);
> >>>
> >>>     load = max(scale_load_down(cfs_rq->load.weight), cfs_rq->avg.load_avg);
> >>> +   if (!load && cfs_rq->load.weight)
> >>> +           load = MIN_SHARES;
> >>>
> >>>     tg_weight = atomic_long_read(&tg->load_avg);
> >>
> >> Yeah, I suppose that'll do. Hurmph, wants a comment though.
> >>
> >> But that has me looking at other users of scale_load_down(), and doesn't
> >> at least update_tg_cfs_load() suffer the same problem?
> >
> > I think instead we should probably scale_load_down(tg_shares) and
> > scale_load(load_avg). tg_shares is always a scaled integer, so just
> > moving the source of the scaling in the multiply should do the job.
> >
> > ie
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index fcc968669aea..6d7a9d72d742 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -3179,9 +3179,9 @@ static long calc_group_shares(struct cfs_rq *cfs_rq)
> >         long tg_weight, tg_shares, load, shares;
> >         struct task_group *tg = cfs_rq->tg;
> >
> > -       tg_shares = READ_ONCE(tg->shares);
> > +       tg_shares = scale_load_down(READ_ONCE(tg->shares));
> >
> > -       load = max(scale_load_down(cfs_rq->load.weight), cfs_rq->avg.load_avg);
> > +       load = max(cfs_rq->load.weight, scale_load(cfs_rq->avg.load_avg));
> >
> >         tg_weight = atomic_long_read(&tg->load_avg);
>
> Get the point, but IMHO fix scale_load_down() sounds better, to
> cover all the similar cases, let's first try that way see if it's
> working :-)

The problem with this solution is that the avg.load_avg of gse or
cfs_rq might stay to 0 because it uses
scale_load_down(se/cfs_rq->load.weight)

>
> Regards,
> Michael Wang
>
> >
> >
> >

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH] sched: fix the nonsense shares when load of cfs_rq is too, small
  2020-03-05  7:53       ` Vincent Guittot
@ 2020-03-06  4:23         ` 王贇
  2020-03-06  8:04           ` Vincent Guittot
  0 siblings, 1 reply; 21+ messages in thread
From: 王贇 @ 2020-03-06  4:23 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ben Segall, Peter Zijlstra, Ingo Molnar, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Mel Gorman,
	open list:SCHEDULER



On 2020/3/5 下午3:53, Vincent Guittot wrote:
> On Thu, 5 Mar 2020 at 02:14, 王贇 <yun.wang@linux.alibaba.com> wrote:
[snip]
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index fcc968669aea..6d7a9d72d742 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -3179,9 +3179,9 @@ static long calc_group_shares(struct cfs_rq *cfs_rq)
>>>         long tg_weight, tg_shares, load, shares;
>>>         struct task_group *tg = cfs_rq->tg;
>>>
>>> -       tg_shares = READ_ONCE(tg->shares);
>>> +       tg_shares = scale_load_down(READ_ONCE(tg->shares));
>>>
>>> -       load = max(scale_load_down(cfs_rq->load.weight), cfs_rq->avg.load_avg);
>>> +       load = max(cfs_rq->load.weight, scale_load(cfs_rq->avg.load_avg));
>>>
>>>         tg_weight = atomic_long_read(&tg->load_avg);
>>
>> Get the point, but IMHO fix scale_load_down() sounds better, to
>> cover all the similar cases, let's first try that way see if it's
>> working :-)
> 
> The problem with this solution is that the avg.load_avg of gse or
> cfs_rq might stay to 0 because it uses
> scale_load_down(se/cfs_rq->load.weight)

Will cfs_rq->load.weight be zero too without scale down?

If cfs_rq->load.weight got at least something, the load will not be
zero after pick the max, correct?

Regards,
Michael Wang

> 
>>
>> Regards,
>> Michael Wang
>>
>>>
>>>
>>>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH] sched: fix the nonsense shares when load of cfs_rq is too, small
  2020-03-06  4:23         ` 王贇
@ 2020-03-06  8:04           ` Vincent Guittot
  2020-03-06  9:34             ` 王贇
  0 siblings, 1 reply; 21+ messages in thread
From: Vincent Guittot @ 2020-03-06  8:04 UTC (permalink / raw)
  To: 王贇
  Cc: Ben Segall, Peter Zijlstra, Ingo Molnar, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Mel Gorman,
	open list:SCHEDULER

On Fri, 6 Mar 2020 at 05:23, 王贇 <yun.wang@linux.alibaba.com> wrote:
>
>
>
> On 2020/3/5 下午3:53, Vincent Guittot wrote:
> > On Thu, 5 Mar 2020 at 02:14, 王贇 <yun.wang@linux.alibaba.com> wrote:
> [snip]
> >>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >>> index fcc968669aea..6d7a9d72d742 100644
> >>> --- a/kernel/sched/fair.c
> >>> +++ b/kernel/sched/fair.c
> >>> @@ -3179,9 +3179,9 @@ static long calc_group_shares(struct cfs_rq *cfs_rq)
> >>>         long tg_weight, tg_shares, load, shares;
> >>>         struct task_group *tg = cfs_rq->tg;
> >>>
> >>> -       tg_shares = READ_ONCE(tg->shares);
> >>> +       tg_shares = scale_load_down(READ_ONCE(tg->shares));
> >>>
> >>> -       load = max(scale_load_down(cfs_rq->load.weight), cfs_rq->avg.load_avg);
> >>> +       load = max(cfs_rq->load.weight, scale_load(cfs_rq->avg.load_avg));
> >>>
> >>>         tg_weight = atomic_long_read(&tg->load_avg);
> >>
> >> Get the point, but IMHO fix scale_load_down() sounds better, to
> >> cover all the similar cases, let's first try that way see if it's
> >> working :-)
> >
> > The problem with this solution is that the avg.load_avg of gse or
> > cfs_rq might stay to 0 because it uses
> > scale_load_down(se/cfs_rq->load.weight)
>
> Will cfs_rq->load.weight be zero too without scale down?

cfs_rq->load.weight will never be 0, it's min is 2

>
> If cfs_rq->load.weight got at least something, the load will not be
> zero after pick the max, correct?

But the cfs_rq->avg.load_avg will never be other than 0 what ever
there are heavy or light tasks in the group

>
> Regards,
> Michael Wang
>
> >
> >>
> >> Regards,
> >> Michael Wang
> >>
> >>>
> >>>
> >>>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH] sched: fix the nonsense shares when load of cfs_rq is too, small
  2020-03-06  8:04           ` Vincent Guittot
@ 2020-03-06  9:34             ` 王贇
  0 siblings, 0 replies; 21+ messages in thread
From: 王贇 @ 2020-03-06  9:34 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ben Segall, Peter Zijlstra, Ingo Molnar, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Mel Gorman,
	open list:SCHEDULER

On 2020/3/6 下午4:04, Vincent Guittot wrote:
> On Fri, 6 Mar 2020 at 05:23, 王贇 <yun.wang@linux.alibaba.com> wrote:
>>
>>
>>
>> On 2020/3/5 下午3:53, Vincent Guittot wrote:
>>> On Thu, 5 Mar 2020 at 02:14, 王贇 <yun.wang@linux.alibaba.com> wrote:
>> [snip]
>>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>>> index fcc968669aea..6d7a9d72d742 100644
>>>>> --- a/kernel/sched/fair.c
>>>>> +++ b/kernel/sched/fair.c
>>>>> @@ -3179,9 +3179,9 @@ static long calc_group_shares(struct cfs_rq *cfs_rq)
>>>>>         long tg_weight, tg_shares, load, shares;
>>>>>         struct task_group *tg = cfs_rq->tg;
>>>>>
>>>>> -       tg_shares = READ_ONCE(tg->shares);
>>>>> +       tg_shares = scale_load_down(READ_ONCE(tg->shares));
>>>>>
>>>>> -       load = max(scale_load_down(cfs_rq->load.weight), cfs_rq->avg.load_avg);
>>>>> +       load = max(cfs_rq->load.weight, scale_load(cfs_rq->avg.load_avg));
>>>>>
>>>>>         tg_weight = atomic_long_read(&tg->load_avg);
>>>>
>>>> Get the point, but IMHO fix scale_load_down() sounds better, to
>>>> cover all the similar cases, let's first try that way see if it's
>>>> working :-)
>>>
>>> The problem with this solution is that the avg.load_avg of gse or
>>> cfs_rq might stay to 0 because it uses
>>> scale_load_down(se/cfs_rq->load.weight)
>>
>> Will cfs_rq->load.weight be zero too without scale down?
> 
> cfs_rq->load.weight will never be 0, it's min is 2
> 
>>
>> If cfs_rq->load.weight got at least something, the load will not be
>> zero after pick the max, correct?
> 
> But the cfs_rq->avg.load_avg will never be other than 0 what ever
> there are heavy or light tasks in the group

Aha, get the point now :-)

BTW, would you like to give a review on
  [PATCH] sched: avoid scale real weight down to zero
please?

Regards,
Michael Wang


> 
>>
>> Regards,
>> Michael Wang
>>
>>>
>>>>
>>>> Regards,
>>>> Michael Wang
>>>>
>>>>>
>>>>>
>>>>>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH] sched: fix the nonsense shares when load of cfs_rq is too, small
  2020-03-05  1:14     ` 王贇
  2020-03-05  7:53       ` Vincent Guittot
@ 2020-03-06 19:17       ` bsegall
  2020-03-09 11:15         ` Vincent Guittot
  1 sibling, 1 reply; 21+ messages in thread
From: bsegall @ 2020-03-06 19:17 UTC (permalink / raw)
  To: 王贇
  Cc: bsegall, Peter Zijlstra, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Mel Gorman,
	open list:SCHEDULER

王贇 <yun.wang@linux.alibaba.com> writes:

> On 2020/3/5 上午2:47, bsegall@google.com wrote:
> [snip]
>>> Argh, because A->cfs_rq.load.weight is B->se.load.weight which is
>>> B->shares/nr_cpus.
>>>
>>>> While the se of D on root cfs_rq is far more bigger than 2, so it
>>>> wins the battle.
>>>>
>>>> This patch add a check on the zero load and make it as MIN_SHARES
>>>> to fix the nonsense shares, after applied the group C wins as
>>>> expected.
>>>>
>>>> Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com>
>>>> ---
>>>>  kernel/sched/fair.c | 2 ++
>>>>  1 file changed, 2 insertions(+)
>>>>
>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>> index 84594f8aeaf8..53d705f75fa4 100644
>>>> --- a/kernel/sched/fair.c
>>>> +++ b/kernel/sched/fair.c
>>>> @@ -3182,6 +3182,8 @@ static long calc_group_shares(struct cfs_rq *cfs_rq)
>>>>  	tg_shares = READ_ONCE(tg->shares);
>>>>
>>>>  	load = max(scale_load_down(cfs_rq->load.weight), cfs_rq->avg.load_avg);
>>>> +	if (!load && cfs_rq->load.weight)
>>>> +		load = MIN_SHARES;
>>>>
>>>>  	tg_weight = atomic_long_read(&tg->load_avg);
>>>
>>> Yeah, I suppose that'll do. Hurmph, wants a comment though.
>>>
>>> But that has me looking at other users of scale_load_down(), and doesn't
>>> at least update_tg_cfs_load() suffer the same problem?
>> 
>> I think instead we should probably scale_load_down(tg_shares) and
>> scale_load(load_avg). tg_shares is always a scaled integer, so just
>> moving the source of the scaling in the multiply should do the job.
>> 
>> ie
>> 
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index fcc968669aea..6d7a9d72d742 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -3179,9 +3179,9 @@ static long calc_group_shares(struct cfs_rq *cfs_rq)
>>         long tg_weight, tg_shares, load, shares;
>>         struct task_group *tg = cfs_rq->tg;
>>  
>> -       tg_shares = READ_ONCE(tg->shares);
>> +       tg_shares = scale_load_down(READ_ONCE(tg->shares));
>>  
>> -       load = max(scale_load_down(cfs_rq->load.weight), cfs_rq->avg.load_avg);
>> +       load = max(cfs_rq->load.weight, scale_load(cfs_rq->avg.load_avg));
>>  
>>         tg_weight = atomic_long_read(&tg->load_avg);
>
> Get the point, but IMHO fix scale_load_down() sounds better, to
> cover all the similar cases, let's first try that way see if it's
> working :-)

Yeah, that might not be a bad idea as well; it's just that doing this
fix would keep you from losing all your precision (and I'd have to think
if that would result in fairness issues like having all the group ses
having the full tg shares, or something like that).

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH] sched: fix the nonsense shares when load of cfs_rq is too, small
  2020-03-06 19:17       ` bsegall
@ 2020-03-09 11:15         ` Vincent Guittot
  2020-03-10  3:42           ` 王贇
  0 siblings, 1 reply; 21+ messages in thread
From: Vincent Guittot @ 2020-03-09 11:15 UTC (permalink / raw)
  To: Ben Segall
  Cc: 王贇,
	Peter Zijlstra, Ingo Molnar, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Mel Gorman, open list:SCHEDULER

On Fri, 6 Mar 2020 at 20:17, <bsegall@google.com> wrote:
>
> 王贇 <yun.wang@linux.alibaba.com> writes:
>
> > On 2020/3/5 上午2:47, bsegall@google.com wrote:
> > [snip]
> >>> Argh, because A->cfs_rq.load.weight is B->se.load.weight which is
> >>> B->shares/nr_cpus.
> >>>
> >>>> While the se of D on root cfs_rq is far more bigger than 2, so it
> >>>> wins the battle.
> >>>>
> >>>> This patch add a check on the zero load and make it as MIN_SHARES
> >>>> to fix the nonsense shares, after applied the group C wins as
> >>>> expected.
> >>>>
> >>>> Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com>
> >>>> ---
> >>>>  kernel/sched/fair.c | 2 ++
> >>>>  1 file changed, 2 insertions(+)
> >>>>
> >>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >>>> index 84594f8aeaf8..53d705f75fa4 100644
> >>>> --- a/kernel/sched/fair.c
> >>>> +++ b/kernel/sched/fair.c
> >>>> @@ -3182,6 +3182,8 @@ static long calc_group_shares(struct cfs_rq *cfs_rq)
> >>>>    tg_shares = READ_ONCE(tg->shares);
> >>>>
> >>>>    load = max(scale_load_down(cfs_rq->load.weight), cfs_rq->avg.load_avg);
> >>>> +  if (!load && cfs_rq->load.weight)
> >>>> +          load = MIN_SHARES;
> >>>>
> >>>>    tg_weight = atomic_long_read(&tg->load_avg);
> >>>
> >>> Yeah, I suppose that'll do. Hurmph, wants a comment though.
> >>>
> >>> But that has me looking at other users of scale_load_down(), and doesn't
> >>> at least update_tg_cfs_load() suffer the same problem?
> >>
> >> I think instead we should probably scale_load_down(tg_shares) and
> >> scale_load(load_avg). tg_shares is always a scaled integer, so just
> >> moving the source of the scaling in the multiply should do the job.
> >>
> >> ie
> >>
> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> index fcc968669aea..6d7a9d72d742 100644
> >> --- a/kernel/sched/fair.c
> >> +++ b/kernel/sched/fair.c
> >> @@ -3179,9 +3179,9 @@ static long calc_group_shares(struct cfs_rq *cfs_rq)
> >>         long tg_weight, tg_shares, load, shares;
> >>         struct task_group *tg = cfs_rq->tg;
> >>
> >> -       tg_shares = READ_ONCE(tg->shares);
> >> +       tg_shares = scale_load_down(READ_ONCE(tg->shares));
> >>
> >> -       load = max(scale_load_down(cfs_rq->load.weight), cfs_rq->avg.load_avg);
> >> +       load = max(cfs_rq->load.weight, scale_load(cfs_rq->avg.load_avg));
> >>
> >>         tg_weight = atomic_long_read(&tg->load_avg);
> >
> > Get the point, but IMHO fix scale_load_down() sounds better, to
> > cover all the similar cases, let's first try that way see if it's
> > working :-)
>
> Yeah, that might not be a bad idea as well; it's just that doing this
> fix would keep you from losing all your precision (and I'd have to think
> if that would result in fairness issues like having all the group ses
> having the full tg shares, or something like that).

AFAICT, we already have a fairness problem case because
scale_load_down is used in calc_delta_fair() so all sched groups that
have a weight lower than 1024 will end up with the same increase of
their vruntime when running.
Then the load_avg is used to balance between rq so load_balance will
ensure at least 1 task per CPU but not more because the load_avg which
is then used will stay null.

That being said, having a min of 2 for scale_load_down will enable us
to have the tg->load_avg != 0 so a tg_weight != 0 and each sched group
will not have the full shares. But it will make those group completely
fair anyway.
The best solution would be not to scale down the weight but that's a
bigger change

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH] sched: fix the nonsense shares when load of cfs_rq is too, small
  2020-03-09 11:15         ` Vincent Guittot
@ 2020-03-10  3:42           ` 王贇
  2020-03-10  7:57             ` Vincent Guittot
  0 siblings, 1 reply; 21+ messages in thread
From: 王贇 @ 2020-03-10  3:42 UTC (permalink / raw)
  To: Vincent Guittot, Ben Segall
  Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Mel Gorman, open list:SCHEDULER



On 2020/3/9 下午7:15, Vincent Guittot wrote:
[snip]
>>>> -       load = max(scale_load_down(cfs_rq->load.weight), cfs_rq->avg.load_avg);
>>>> +       load = max(cfs_rq->load.weight, scale_load(cfs_rq->avg.load_avg));
>>>>
>>>>         tg_weight = atomic_long_read(&tg->load_avg);
>>>
>>> Get the point, but IMHO fix scale_load_down() sounds better, to
>>> cover all the similar cases, let's first try that way see if it's
>>> working :-)
>>
>> Yeah, that might not be a bad idea as well; it's just that doing this
>> fix would keep you from losing all your precision (and I'd have to think
>> if that would result in fairness issues like having all the group ses
>> having the full tg shares, or something like that).
> 
> AFAICT, we already have a fairness problem case because
> scale_load_down is used in calc_delta_fair() so all sched groups that
> have a weight lower than 1024 will end up with the same increase of
> their vruntime when running.
> Then the load_avg is used to balance between rq so load_balance will
> ensure at least 1 task per CPU but not more because the load_avg which
> is then used will stay null.
> 
> That being said, having a min of 2 for scale_load_down will enable us
> to have the tg->load_avg != 0 so a tg_weight != 0 and each sched group
> will not have the full shares. But it will make those group completely
> fair anyway.
> The best solution would be not to scale down the weight but that's a
> bigger change

Does that means a changing for all those 'load.weight' related
calculation, to reserve the scaled weight?

I suppose u64 is capable for 'cfs_rq.load' to reserve the scaled up load,
changing all those places could be annoying but still fine.

However, I'm not quite sure about the benefit, how much more precision
we'll gain and does that really matters? better to have some testing to
demonstrate it.

Regards,
Michael Wang


> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH] sched: fix the nonsense shares when load of cfs_rq is too, small
  2020-03-10  3:42           ` 王贇
@ 2020-03-10  7:57             ` Vincent Guittot
  2020-03-10  8:15               ` 王贇
  0 siblings, 1 reply; 21+ messages in thread
From: Vincent Guittot @ 2020-03-10  7:57 UTC (permalink / raw)
  To: 王贇
  Cc: Ben Segall, Peter Zijlstra, Ingo Molnar, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Mel Gorman,
	open list:SCHEDULER

On Tue, 10 Mar 2020 at 04:42, 王贇 <yun.wang@linux.alibaba.com> wrote:
>
>
>
> On 2020/3/9 下午7:15, Vincent Guittot wrote:
> [snip]
> >>>> -       load = max(scale_load_down(cfs_rq->load.weight), cfs_rq->avg.load_avg);
> >>>> +       load = max(cfs_rq->load.weight, scale_load(cfs_rq->avg.load_avg));
> >>>>
> >>>>         tg_weight = atomic_long_read(&tg->load_avg);
> >>>
> >>> Get the point, but IMHO fix scale_load_down() sounds better, to
> >>> cover all the similar cases, let's first try that way see if it's
> >>> working :-)
> >>
> >> Yeah, that might not be a bad idea as well; it's just that doing this
> >> fix would keep you from losing all your precision (and I'd have to think
> >> if that would result in fairness issues like having all the group ses
> >> having the full tg shares, or something like that).
> >
> > AFAICT, we already have a fairness problem case because
> > scale_load_down is used in calc_delta_fair() so all sched groups that
> > have a weight lower than 1024 will end up with the same increase of
> > their vruntime when running.
> > Then the load_avg is used to balance between rq so load_balance will
> > ensure at least 1 task per CPU but not more because the load_avg which
> > is then used will stay null.
> >
> > That being said, having a min of 2 for scale_load_down will enable us
> > to have the tg->load_avg != 0 so a tg_weight != 0 and each sched group
> > will not have the full shares. But it will make those group completely
> > fair anyway.
> > The best solution would be not to scale down the weight but that's a
> > bigger change
>
> Does that means a changing for all those 'load.weight' related
> calculation, to reserve the scaled weight?

yes, to make sure that calculation still fit in the variable

>
> I suppose u64 is capable for 'cfs_rq.load' to reserve the scaled up load,
> changing all those places could be annoying but still fine.

it's fine but the max number of runnable tasks at the max priority on
a cfs_rq  will decrease from around 4 billion to "only" 4 Million.

>
> However, I'm not quite sure about the benefit, how much more precision
> we'll gain and does that really matters? better to have some testing to
> demonstrate it.

it will ensure a better fairness in a larger range of share value. I
agree that we can wonder if it's worth the effort for those low share
values. Wouldbe interesting to knwo who use such low value and for
which purpose

Regards,
Vincent
>
> Regards,
> Michael Wang
>
>
> >

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH] sched: fix the nonsense shares when load of cfs_rq is too, small
  2020-03-10  7:57             ` Vincent Guittot
@ 2020-03-10  8:15               ` 王贇
  0 siblings, 0 replies; 21+ messages in thread
From: 王贇 @ 2020-03-10  8:15 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ben Segall, Peter Zijlstra, Ingo Molnar, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Mel Gorman,
	open list:SCHEDULER



On 2020/3/10 下午3:57, Vincent Guittot wrote:
[snip]
>>> That being said, having a min of 2 for scale_load_down will enable us
>>> to have the tg->load_avg != 0 so a tg_weight != 0 and each sched group
>>> will not have the full shares. But it will make those group completely
>>> fair anyway.
>>> The best solution would be not to scale down the weight but that's a
>>> bigger change
>>
>> Does that means a changing for all those 'load.weight' related
>> calculation, to reserve the scaled weight?
> 
> yes, to make sure that calculation still fit in the variable
> 
>>
>> I suppose u64 is capable for 'cfs_rq.load' to reserve the scaled up load,
>> changing all those places could be annoying but still fine.
> 
> it's fine but the max number of runnable tasks at the max priority on
> a cfs_rq  will decrease from around 4 billion to "only" 4 Million.
> 
>>
>> However, I'm not quite sure about the benefit, how much more precision
>> we'll gain and does that really matters? better to have some testing to
>> demonstrate it.
> 
> it will ensure a better fairness in a larger range of share value. I
> agree that we can wonder if it's worth the effort for those low share
> values. Wouldbe interesting to knwo who use such low value and for
> which purpose

AFAIK, the k8s stuff will use share 2 for the Best Effort type of Pods,
but that's just because they want them run only when there are no other
Pods want running, won't dealing with multiple shares under 1024 and
desire good precision I suppose.

Regards,
Michael Wang

> 
> Regards,
> Vincent
>>
>> Regards,
>> Michael Wang
>>
>>
>>>

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2020-03-10  8:15 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-03 14:17 [RFC PATCH] sched: fix the nonsense shares when load of cfs_rq is too, small 王贇
2020-03-03 19:52 ` Peter Zijlstra
2020-03-04  1:19   ` 王贇
2020-03-04  8:47     ` Vincent Guittot
2020-03-04  9:43       ` Vincent Guittot
2020-03-05  1:23         ` 王贇
2020-03-04  9:52       ` Peter Zijlstra
2020-03-04 11:55         ` Vincent Guittot
2020-03-05  1:08         ` 王贇
2020-03-04  8:45   ` Vincent Guittot
2020-03-04 18:47   ` bsegall
2020-03-05  1:14     ` 王贇
2020-03-05  7:53       ` Vincent Guittot
2020-03-06  4:23         ` 王贇
2020-03-06  8:04           ` Vincent Guittot
2020-03-06  9:34             ` 王贇
2020-03-06 19:17       ` bsegall
2020-03-09 11:15         ` Vincent Guittot
2020-03-10  3:42           ` 王贇
2020-03-10  7:57             ` Vincent Guittot
2020-03-10  8:15               ` 王贇

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.