linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] sched/fair: handle case of task_h_load() returning 0
@ 2020-07-02 14:42 Vincent Guittot
  2020-07-02 16:11 ` Valentin Schneider
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Vincent Guittot @ 2020-07-02 14:42 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, linux-kernel
  Cc: valentin.schneider, Vincent Guittot

task_h_load() can return 0 in some situations like running stress-ng
mmapfork, which forks thousands of threads, in a sched group on a 224 cores
system. The load balance doesn't handle this correctly because
env->imbalance never decreases and it will stop pulling tasks only after
reaching loop_max, which can be equal to the number of running tasks of
the cfs. Make sure that imbalance will be decreased by at least 1.

misfit task is the other feature that doesn't handle correctly such
situation although it's probably more difficult to face the problem
because of the smaller number of CPUs and running tasks on heterogenous
system.

We can't simply ensure that task_h_load() returns at least one because it
would imply to handle underrun in other places.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 18 +++++++++++++++++-
 1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6fab1d17c575..62747c24aa9e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4049,7 +4049,13 @@ static inline void update_misfit_status(struct task_struct *p, struct rq *rq)
 		return;
 	}
 
-	rq->misfit_task_load = task_h_load(p);
+	/*
+	 * Make sure that misfit_task_load will not be null even if
+	 * task_h_load() returns 0. misfit_task_load is only used to select
+	 * rq with highest load so adding 1 will not modify the result
+	 * of the comparison.
+	 */
+	rq->misfit_task_load = task_h_load(p) + 1;
 }
 
 #else /* CONFIG_SMP */
@@ -7664,6 +7670,16 @@ static int detach_tasks(struct lb_env *env)
 			    env->sd->nr_balance_failed <= env->sd->cache_nice_tries)
 				goto next;
 
+			/*
+			 * Depending of the number of CPUs and tasks and the
+			 * cgroup hierarchy, task_h_load() can return a null
+			 * value. Make sure that env->imbalance decreases
+			 * otherwise detach_tasks() will stop only after
+			 * detaching up to loop_max tasks.
+			 */
+			if (!load)
+				load = 1;
+
 			env->imbalance -= load;
 			break;
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH] sched/fair: handle case of task_h_load() returning 0
  2020-07-02 14:42 [PATCH] sched/fair: handle case of task_h_load() returning 0 Vincent Guittot
@ 2020-07-02 16:11 ` Valentin Schneider
  2020-07-02 16:28   ` Vincent Guittot
  2020-07-08  9:44 ` Dietmar Eggemann
  2020-07-09 13:06 ` Valentin Schneider
  2 siblings, 1 reply; 11+ messages in thread
From: Valentin Schneider @ 2020-07-02 16:11 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, linux-kernel


On 02/07/20 15:42, Vincent Guittot wrote:
> task_h_load() can return 0 in some situations like running stress-ng
> mmapfork, which forks thousands of threads, in a sched group on a 224 cores
> system. The load balance doesn't handle this correctly because
> env->imbalance never decreases and it will stop pulling tasks only after
> reaching loop_max, which can be equal to the number of running tasks of
> the cfs. Make sure that imbalance will be decreased by at least 1.
>
> misfit task is the other feature that doesn't handle correctly such
> situation although it's probably more difficult to face the problem
> because of the smaller number of CPUs and running tasks on heterogenous
> system.
>
> We can't simply ensure that task_h_load() returns at least one because it
> would imply to handle underrun in other places.

Nasty one, that...

Random thought: isn't that the kind of thing we have scale_load() and
scale_load_down() for? There's more uses of task_h_load() than I would like
for this, but if we upscale its output (or introduce an upscaled variant),
we could do something like:

---
detach_tasks()
{
        long imbalance = env->imbalance;

        if (env->migration_type == migrate_load)
                imbalance = scale_load(imbalance);

        while (!list_empty(tasks)) {
                /* ... */
                switch (env->migration_type) {
                case migrate_load:
                        load = task_h_load_upscaled(p);
                        /* ... usual bits here ...*/
                        lsub_positive(&env->imbalance, load);
                        break;
                        /* ... */
                }

                if (!scale_load_down(env->imbalance))
                        break;
        }
}
---

It's not perfect, and there's still the misfit situation to sort out -
still, do you think this is something we could go towards?

>
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
>  kernel/sched/fair.c | 18 +++++++++++++++++-
>  1 file changed, 17 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 6fab1d17c575..62747c24aa9e 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4049,7 +4049,13 @@ static inline void update_misfit_status(struct task_struct *p, struct rq *rq)
>               return;
>       }
>
> -	rq->misfit_task_load = task_h_load(p);
> +	/*
> +	 * Make sure that misfit_task_load will not be null even if
> +	 * task_h_load() returns 0. misfit_task_load is only used to select
> +	 * rq with highest load so adding 1 will not modify the result
> +	 * of the comparison.
> +	 */
> +	rq->misfit_task_load = task_h_load(p) + 1;
>  }
>
>  #else /* CONFIG_SMP */
> @@ -7664,6 +7670,16 @@ static int detach_tasks(struct lb_env *env)
>                           env->sd->nr_balance_failed <= env->sd->cache_nice_tries)
>                               goto next;
>
> +			/*
> +			 * Depending of the number of CPUs and tasks and the
> +			 * cgroup hierarchy, task_h_load() can return a null
> +			 * value. Make sure that env->imbalance decreases
> +			 * otherwise detach_tasks() will stop only after
> +			 * detaching up to loop_max tasks.
> +			 */
> +			if (!load)
> +				load = 1;
> +
>                       env->imbalance -= load;
>                       break;

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] sched/fair: handle case of task_h_load() returning 0
  2020-07-02 16:11 ` Valentin Schneider
@ 2020-07-02 16:28   ` Vincent Guittot
  2020-07-07 13:30     ` Vincent Guittot
  0 siblings, 1 reply; 11+ messages in thread
From: Vincent Guittot @ 2020-07-02 16:28 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, linux-kernel

On Thu, 2 Jul 2020 at 18:11, Valentin Schneider
<valentin.schneider@arm.com> wrote:
>
>
> On 02/07/20 15:42, Vincent Guittot wrote:
> > task_h_load() can return 0 in some situations like running stress-ng
> > mmapfork, which forks thousands of threads, in a sched group on a 224 cores
> > system. The load balance doesn't handle this correctly because
> > env->imbalance never decreases and it will stop pulling tasks only after
> > reaching loop_max, which can be equal to the number of running tasks of
> > the cfs. Make sure that imbalance will be decreased by at least 1.
> >
> > misfit task is the other feature that doesn't handle correctly such
> > situation although it's probably more difficult to face the problem
> > because of the smaller number of CPUs and running tasks on heterogenous
> > system.
> >
> > We can't simply ensure that task_h_load() returns at least one because it
> > would imply to handle underrun in other places.
>
> Nasty one, that...
>
> Random thought: isn't that the kind of thing we have scale_load() and
> scale_load_down() for? There's more uses of task_h_load() than I would like
> for this, but if we upscale its output (or introduce an upscaled variant),
> we could do something like:
>
> ---
> detach_tasks()
> {
>         long imbalance = env->imbalance;
>
>         if (env->migration_type == migrate_load)
>                 imbalance = scale_load(imbalance);
>
>         while (!list_empty(tasks)) {
>                 /* ... */
>                 switch (env->migration_type) {
>                 case migrate_load:
>                         load = task_h_load_upscaled(p);
>                         /* ... usual bits here ...*/
>                         lsub_positive(&env->imbalance, load);
>                         break;
>                         /* ... */
>                 }
>
>                 if (!scale_load_down(env->imbalance))
>                         break;
>         }
> }
> ---
>
> It's not perfect, and there's still the misfit situation to sort out -
> still, do you think this is something we could go towards?

This will not work for 32bits system.

For 64bits, I have to think a bit more if the upscale would fix all
cases and support propagation across a hierarchy. And in this case we
could also consider to make scale_load/scale_load_down a nop all the
time

>
> >
> > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> > ---
> >  kernel/sched/fair.c | 18 +++++++++++++++++-
> >  1 file changed, 17 insertions(+), 1 deletion(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 6fab1d17c575..62747c24aa9e 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -4049,7 +4049,13 @@ static inline void update_misfit_status(struct task_struct *p, struct rq *rq)
> >               return;
> >       }
> >
> > -     rq->misfit_task_load = task_h_load(p);
> > +     /*
> > +      * Make sure that misfit_task_load will not be null even if
> > +      * task_h_load() returns 0. misfit_task_load is only used to select
> > +      * rq with highest load so adding 1 will not modify the result
> > +      * of the comparison.
> > +      */
> > +     rq->misfit_task_load = task_h_load(p) + 1;
> >  }
> >
> >  #else /* CONFIG_SMP */
> > @@ -7664,6 +7670,16 @@ static int detach_tasks(struct lb_env *env)
> >                           env->sd->nr_balance_failed <= env->sd->cache_nice_tries)
> >                               goto next;
> >
> > +                     /*
> > +                      * Depending of the number of CPUs and tasks and the
> > +                      * cgroup hierarchy, task_h_load() can return a null
> > +                      * value. Make sure that env->imbalance decreases
> > +                      * otherwise detach_tasks() will stop only after
> > +                      * detaching up to loop_max tasks.
> > +                      */
> > +                     if (!load)
> > +                             load = 1;
> > +
> >                       env->imbalance -= load;
> >                       break;

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] sched/fair: handle case of task_h_load() returning 0
  2020-07-02 16:28   ` Vincent Guittot
@ 2020-07-07 13:30     ` Vincent Guittot
  2020-07-08 10:34       ` Valentin Schneider
  0 siblings, 1 reply; 11+ messages in thread
From: Vincent Guittot @ 2020-07-07 13:30 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, linux-kernel

On Thu, 2 Jul 2020 at 18:28, Vincent Guittot <vincent.guittot@linaro.org> wrote:
>
> On Thu, 2 Jul 2020 at 18:11, Valentin Schneider
> <valentin.schneider@arm.com> wrote:
> >
> >
> > On 02/07/20 15:42, Vincent Guittot wrote:
> > > task_h_load() can return 0 in some situations like running stress-ng
> > > mmapfork, which forks thousands of threads, in a sched group on a 224 cores
> > > system. The load balance doesn't handle this correctly because
> > > env->imbalance never decreases and it will stop pulling tasks only after
> > > reaching loop_max, which can be equal to the number of running tasks of
> > > the cfs. Make sure that imbalance will be decreased by at least 1.
> > >
> > > misfit task is the other feature that doesn't handle correctly such
> > > situation although it's probably more difficult to face the problem
> > > because of the smaller number of CPUs and running tasks on heterogenous
> > > system.
> > >
> > > We can't simply ensure that task_h_load() returns at least one because it
> > > would imply to handle underrun in other places.
> >
> > Nasty one, that...
> >
> > Random thought: isn't that the kind of thing we have scale_load() and
> > scale_load_down() for? There's more uses of task_h_load() than I would like
> > for this, but if we upscale its output (or introduce an upscaled variant),
> > we could do something like:
> >
> > ---
> > detach_tasks()
> > {
> >         long imbalance = env->imbalance;
> >
> >         if (env->migration_type == migrate_load)
> >                 imbalance = scale_load(imbalance);
> >
> >         while (!list_empty(tasks)) {
> >                 /* ... */
> >                 switch (env->migration_type) {
> >                 case migrate_load:
> >                         load = task_h_load_upscaled(p);
> >                         /* ... usual bits here ...*/
> >                         lsub_positive(&env->imbalance, load);
> >                         break;
> >                         /* ... */
> >                 }
> >
> >                 if (!scale_load_down(env->imbalance))
> >                         break;
> >         }
> > }
> > ---
> >
> > It's not perfect, and there's still the misfit situation to sort out -
> > still, do you think this is something we could go towards?
>
> This will not work for 32bits system.
>
> For 64bits, I have to think a bit more if the upscale would fix all
> cases and support propagation across a hierarchy. And in this case we
> could also consider to make scale_load/scale_load_down a nop all the
> time

In addition that problem remains on 32bits, the problem can still
happen after extending the scale so this current patch still makes
sense.

Then if we want to reduce the cases where task_h_load returns 0, we
should better make scale_load_down a nop otherwise we will have to
maintain 2 values h_load and scale_h_load across the hierarchy

>
> >
> > >
> > > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> > > ---
> > >  kernel/sched/fair.c | 18 +++++++++++++++++-
> > >  1 file changed, 17 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index 6fab1d17c575..62747c24aa9e 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -4049,7 +4049,13 @@ static inline void update_misfit_status(struct task_struct *p, struct rq *rq)
> > >               return;
> > >       }
> > >
> > > -     rq->misfit_task_load = task_h_load(p);
> > > +     /*
> > > +      * Make sure that misfit_task_load will not be null even if
> > > +      * task_h_load() returns 0. misfit_task_load is only used to select
> > > +      * rq with highest load so adding 1 will not modify the result
> > > +      * of the comparison.
> > > +      */
> > > +     rq->misfit_task_load = task_h_load(p) + 1;
> > >  }
> > >
> > >  #else /* CONFIG_SMP */
> > > @@ -7664,6 +7670,16 @@ static int detach_tasks(struct lb_env *env)
> > >                           env->sd->nr_balance_failed <= env->sd->cache_nice_tries)
> > >                               goto next;
> > >
> > > +                     /*
> > > +                      * Depending of the number of CPUs and tasks and the
> > > +                      * cgroup hierarchy, task_h_load() can return a null
> > > +                      * value. Make sure that env->imbalance decreases
> > > +                      * otherwise detach_tasks() will stop only after
> > > +                      * detaching up to loop_max tasks.
> > > +                      */
> > > +                     if (!load)
> > > +                             load = 1;
> > > +
> > >                       env->imbalance -= load;
> > >                       break;

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] sched/fair: handle case of task_h_load() returning 0
  2020-07-02 14:42 [PATCH] sched/fair: handle case of task_h_load() returning 0 Vincent Guittot
  2020-07-02 16:11 ` Valentin Schneider
@ 2020-07-08  9:44 ` Dietmar Eggemann
  2020-07-08  9:47   ` Vincent Guittot
  2020-07-09 13:06 ` Valentin Schneider
  2 siblings, 1 reply; 11+ messages in thread
From: Dietmar Eggemann @ 2020-07-08  9:44 UTC (permalink / raw)
  To: Vincent Guittot, mingo, peterz, juri.lelli, rostedt, bsegall,
	mgorman, linux-kernel
  Cc: valentin.schneider

On 02/07/2020 16:42, Vincent Guittot wrote:
> task_h_load() can return 0 in some situations like running stress-ng
> mmapfork, which forks thousands of threads, in a sched group on a 224 cores
> system. The load balance doesn't handle this correctly because

I guess the issue here is that 'cfs_rq->h_load' in 

task_h_load() {
    struct cfs_rq *cfs_rq = task_cfs_rq(p);
    ...   
    return div64_ul(p->se.avg.load_avg * cfs_rq->h_load,
                    cfs_rq_load_avg(cfs_rq) + 1);
}

is still ~0 (or at least pretty small) compared to se.avg.load_avg being
1024 and cfs_rq_load_avg(cfs_rq) n*1024 in these lb occurrences.
 
> env->imbalance never decreases and it will stop pulling tasks only after
> reaching loop_max, which can be equal to the number of running tasks of
> the cfs. Make sure that imbalance will be decreased by at least 1.
> 
> misfit task is the other feature that doesn't handle correctly such
> situation although it's probably more difficult to face the problem
> because of the smaller number of CPUs and running tasks on heterogenous
> system.
> 
> We can't simply ensure that task_h_load() returns at least one because it
> would imply to handle underrun in other places.
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
>  kernel/sched/fair.c | 18 +++++++++++++++++-
>  1 file changed, 17 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 6fab1d17c575..62747c24aa9e 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4049,7 +4049,13 @@ static inline void update_misfit_status(struct task_struct *p, struct rq *rq)
>  		return;
>  	}
>  
> -	rq->misfit_task_load = task_h_load(p);
> +	/*
> +	 * Make sure that misfit_task_load will not be null even if
> +	 * task_h_load() returns 0. misfit_task_load is only used to select
> +	 * rq with highest load so adding 1 will not modify the result
> +	 * of the comparison.
> +	 */
> +	rq->misfit_task_load = task_h_load(p) + 1;
>  }
>  
>  #else /* CONFIG_SMP */
> @@ -7664,6 +7670,16 @@ static int detach_tasks(struct lb_env *env)
>  			    env->sd->nr_balance_failed <= env->sd->cache_nice_tries)
>  				goto next;
>  
> +			/*
> +			 * Depending of the number of CPUs and tasks and the
> +			 * cgroup hierarchy, task_h_load() can return a null
> +			 * value. Make sure that env->imbalance decreases
> +			 * otherwise detach_tasks() will stop only after
> +			 * detaching up to loop_max tasks.
> +			 */
> +			if (!load)
> +				load = 1;
> +
>  			env->imbalance -= load;
>  			break;

I assume that this is related to the LKP mail

https://lkml.kernel.org/r/20200421004749.GC26573@shao2-debian ?

I ran the test (5.8.0-rc4 w/o vs. w/ your patch) on 'Intel(R) Xeon(R)
CPU E5-2690 v2 @ 3.00GHz' (2*2*10, 40 CPUs).
I can't see the changes in the magnitude shown in the email above (they
used a 96 CPU system though).
I used only scheduler stressor mmapfork in taskgroup /A/B/C:

 stress-ng --timeout 1 --times --verify --metrics-brief --sequential 40 --class scheduler --exclude (all except mmapfork)

5.8.0-rc4-custom-dieegg01-stress-ng-base

stress-ng: info:  [3720]  stressor      bogo ops real time  usr time  sys time   bogo ops/s   bogo ops/s
stress-ng: info:  [3720]                           (secs)    (secs)    (secs)   (real time) (usr+sys time)
stress-ng: info:  [3720]  mmapfork            40      1.98     12.53     71.12        20.21         0.48
stress-ng: info:  [5201]  mmapfork            40      2.50     13.10     98.61        16.01         0.36
stress-ng: info:  [6682]  mmapfork            40      2.58     14.80     98.63        15.88         0.36
stress-ng: info:  [8195]  mmapfork            40      1.79     12.57     61.61        22.31         0.54
stress-ng: info:  [9679]  mmapfork            40      2.20     12.17     82.66        18.20         0.42
stress-ng: info:  [11164] mmapfork            40      2.61     15.09    102.86        16.86         0.37
stress-ng: info:  [12773] mmapfork            40      1.89     12.32     65.09        21.15         0.52
stress-ng: info:  [3883]  mmapfork            40      2.14     12.90     76.73        18.68         0.45
stress-ng: info:  [6845]  mmapfork            40      2.25     11.83     84.06        17.80         0.42
stress-ng: info:  [8326]  mmapfork            40      1.76     12.93     56.65        22.70         0.57

                                                                                Mean: 18.98 (σ: 2.369)
5.8.0-rc4-custom-dieegg01-stress-ng

stress-ng: info:  [3895]  mmapfork            40      2.40     13.56     92.83        16.67         0.38
stress-ng: info:  [5379]  mmapfork            40      2.08     13.65     74.11        19.23         0.46
stress-ng: info:  [6860]  mmapfork            40      2.15     13.72     80.24        18.62         0.43
stress-ng: info:  [8341]  mmapfork            40      2.37     13.74     90.93        16.85         0.38
stress-ng: info:  [9822]  mmapfork            40      2.10     12.48     83.85        19.09         0.42
stress-ng: info:  [13816] mmapfork            40      2.05     12.13     77.64        19.49         0.45
stress-ng: info:  [15297] mmapfork            40      2.53     13.16    100.26        15.84         0.35
stress-ng: info:  [16780] mmapfork            40      2.00     12.10     71.25        20.02         0.48
stress-ng: info:  [18262] mmapfork            40      1.73     12.24     57.69        23.09         0.57
stress-ng: info:  [19743] mmapfork            40      1.78     12.51     57.89        22.48         0.57

                                                                                Mean: 19.14 (σ: 2.239)






^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] sched/fair: handle case of task_h_load() returning 0
  2020-07-08  9:44 ` Dietmar Eggemann
@ 2020-07-08  9:47   ` Vincent Guittot
  2020-07-09 13:34     ` Dietmar Eggemann
  0 siblings, 1 reply; 11+ messages in thread
From: Vincent Guittot @ 2020-07-08  9:47 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Steven Rostedt,
	Ben Segall, Mel Gorman, linux-kernel, Valentin Schneider

On Wed, 8 Jul 2020 at 11:45, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>
> On 02/07/2020 16:42, Vincent Guittot wrote:
> > task_h_load() can return 0 in some situations like running stress-ng
> > mmapfork, which forks thousands of threads, in a sched group on a 224 cores
> > system. The load balance doesn't handle this correctly because
>
> I guess the issue here is that 'cfs_rq->h_load' in
>
> task_h_load() {
>     struct cfs_rq *cfs_rq = task_cfs_rq(p);
>     ...
>     return div64_ul(p->se.avg.load_avg * cfs_rq->h_load,
>                     cfs_rq_load_avg(cfs_rq) + 1);
> }
>
> is still ~0 (or at least pretty small) compared to se.avg.load_avg being
> 1024 and cfs_rq_load_avg(cfs_rq) n*1024 in these lb occurrences.
>
> > env->imbalance never decreases and it will stop pulling tasks only after
> > reaching loop_max, which can be equal to the number of running tasks of
> > the cfs. Make sure that imbalance will be decreased by at least 1.
> >
> > misfit task is the other feature that doesn't handle correctly such
> > situation although it's probably more difficult to face the problem
> > because of the smaller number of CPUs and running tasks on heterogenous
> > system.
> >
> > We can't simply ensure that task_h_load() returns at least one because it
> > would imply to handle underrun in other places.
> >
> > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> > ---
> >  kernel/sched/fair.c | 18 +++++++++++++++++-
> >  1 file changed, 17 insertions(+), 1 deletion(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 6fab1d17c575..62747c24aa9e 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -4049,7 +4049,13 @@ static inline void update_misfit_status(struct task_struct *p, struct rq *rq)
> >               return;
> >       }
> >
> > -     rq->misfit_task_load = task_h_load(p);
> > +     /*
> > +      * Make sure that misfit_task_load will not be null even if
> > +      * task_h_load() returns 0. misfit_task_load is only used to select
> > +      * rq with highest load so adding 1 will not modify the result
> > +      * of the comparison.
> > +      */
> > +     rq->misfit_task_load = task_h_load(p) + 1;
> >  }
> >
> >  #else /* CONFIG_SMP */
> > @@ -7664,6 +7670,16 @@ static int detach_tasks(struct lb_env *env)
> >                           env->sd->nr_balance_failed <= env->sd->cache_nice_tries)
> >                               goto next;
> >
> > +                     /*
> > +                      * Depending of the number of CPUs and tasks and the
> > +                      * cgroup hierarchy, task_h_load() can return a null
> > +                      * value. Make sure that env->imbalance decreases
> > +                      * otherwise detach_tasks() will stop only after
> > +                      * detaching up to loop_max tasks.
> > +                      */
> > +                     if (!load)
> > +                             load = 1;
> > +
> >                       env->imbalance -= load;
> >                       break;
>
> I assume that this is related to the LKP mail

I have found this problem while studying the regression raised in the
email below but it doesn't fix it. At least, it's not enough

>
> https://lkml.kernel.org/r/20200421004749.GC26573@shao2-debian ?
>
> I ran the test (5.8.0-rc4 w/o vs. w/ your patch) on 'Intel(R) Xeon(R)
> CPU E5-2690 v2 @ 3.00GHz' (2*2*10, 40 CPUs).
> I can't see the changes in the magnitude shown in the email above (they
> used a 96 CPU system though).
> I used only scheduler stressor mmapfork in taskgroup /A/B/C:
>
>  stress-ng --timeout 1 --times --verify --metrics-brief --sequential 40 --class scheduler --exclude (all except mmapfork)
>
> 5.8.0-rc4-custom-dieegg01-stress-ng-base
>
> stress-ng: info:  [3720]  stressor      bogo ops real time  usr time  sys time   bogo ops/s   bogo ops/s
> stress-ng: info:  [3720]                           (secs)    (secs)    (secs)   (real time) (usr+sys time)
> stress-ng: info:  [3720]  mmapfork            40      1.98     12.53     71.12        20.21         0.48
> stress-ng: info:  [5201]  mmapfork            40      2.50     13.10     98.61        16.01         0.36
> stress-ng: info:  [6682]  mmapfork            40      2.58     14.80     98.63        15.88         0.36
> stress-ng: info:  [8195]  mmapfork            40      1.79     12.57     61.61        22.31         0.54
> stress-ng: info:  [9679]  mmapfork            40      2.20     12.17     82.66        18.20         0.42
> stress-ng: info:  [11164] mmapfork            40      2.61     15.09    102.86        16.86         0.37
> stress-ng: info:  [12773] mmapfork            40      1.89     12.32     65.09        21.15         0.52
> stress-ng: info:  [3883]  mmapfork            40      2.14     12.90     76.73        18.68         0.45
> stress-ng: info:  [6845]  mmapfork            40      2.25     11.83     84.06        17.80         0.42
> stress-ng: info:  [8326]  mmapfork            40      1.76     12.93     56.65        22.70         0.57
>
>                                                                                 Mean: 18.98 (σ: 2.369)
> 5.8.0-rc4-custom-dieegg01-stress-ng
>
> stress-ng: info:  [3895]  mmapfork            40      2.40     13.56     92.83        16.67         0.38
> stress-ng: info:  [5379]  mmapfork            40      2.08     13.65     74.11        19.23         0.46
> stress-ng: info:  [6860]  mmapfork            40      2.15     13.72     80.24        18.62         0.43
> stress-ng: info:  [8341]  mmapfork            40      2.37     13.74     90.93        16.85         0.38
> stress-ng: info:  [9822]  mmapfork            40      2.10     12.48     83.85        19.09         0.42
> stress-ng: info:  [13816] mmapfork            40      2.05     12.13     77.64        19.49         0.45
> stress-ng: info:  [15297] mmapfork            40      2.53     13.16    100.26        15.84         0.35
> stress-ng: info:  [16780] mmapfork            40      2.00     12.10     71.25        20.02         0.48
> stress-ng: info:  [18262] mmapfork            40      1.73     12.24     57.69        23.09         0.57
> stress-ng: info:  [19743] mmapfork            40      1.78     12.51     57.89        22.48         0.57
>
>                                                                                 Mean: 19.14 (σ: 2.239)
>
>
>
>
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] sched/fair: handle case of task_h_load() returning 0
  2020-07-07 13:30     ` Vincent Guittot
@ 2020-07-08 10:34       ` Valentin Schneider
  0 siblings, 0 replies; 11+ messages in thread
From: Valentin Schneider @ 2020-07-08 10:34 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, linux-kernel


On 07/07/20 14:30, Vincent Guittot wrote:
> On Thu, 2 Jul 2020 at 18:28, Vincent Guittot <vincent.guittot@linaro.org> wrote:
>>
>> On Thu, 2 Jul 2020 at 18:11, Valentin Schneider
>> <valentin.schneider@arm.com> wrote:
>> >
>> >
>> > On 02/07/20 15:42, Vincent Guittot wrote:
>> > > task_h_load() can return 0 in some situations like running stress-ng
>> > > mmapfork, which forks thousands of threads, in a sched group on a 224 cores
>> > > system. The load balance doesn't handle this correctly because
>> > > env->imbalance never decreases and it will stop pulling tasks only after
>> > > reaching loop_max, which can be equal to the number of running tasks of
>> > > the cfs. Make sure that imbalance will be decreased by at least 1.
>> > >
>> > > misfit task is the other feature that doesn't handle correctly such
>> > > situation although it's probably more difficult to face the problem
>> > > because of the smaller number of CPUs and running tasks on heterogenous
>> > > system.
>> > >
>> > > We can't simply ensure that task_h_load() returns at least one because it
>> > > would imply to handle underrun in other places.
>> >
>> > Nasty one, that...
>> >
>> > Random thought: isn't that the kind of thing we have scale_load() and
>> > scale_load_down() for? There's more uses of task_h_load() than I would like
>> > for this, but if we upscale its output (or introduce an upscaled variant),
>> > we could do something like:
>> >
>> > ---
>> > detach_tasks()
>> > {
>> >         long imbalance = env->imbalance;
>> >
>> >         if (env->migration_type == migrate_load)
>> >                 imbalance = scale_load(imbalance);
>> >
>> >         while (!list_empty(tasks)) {
>> >                 /* ... */
>> >                 switch (env->migration_type) {
>> >                 case migrate_load:
>> >                         load = task_h_load_upscaled(p);
>> >                         /* ... usual bits here ...*/
>> >                         lsub_positive(&env->imbalance, load);
>> >                         break;
>> >                         /* ... */
>> >                 }
>> >
>> >                 if (!scale_load_down(env->imbalance))
>> >                         break;
>> >         }
>> > }
>> > ---
>> >
>> > It's not perfect, and there's still the misfit situation to sort out -
>> > still, do you think this is something we could go towards?
>>
>> This will not work for 32bits system.
>>
>> For 64bits, I have to think a bit more if the upscale would fix all
>> cases and support propagation across a hierarchy. And in this case we
>> could also consider to make scale_load/scale_load_down a nop all the
>> time
>
> In addition that problem remains on 32bits, the problem can still
> happen after extending the scale so this current patch still makes
> sense.
>

Right, I think we'd want to have that at the very least for 32bit anyway. I
haven't done the math, but doesn't it require an obscene amount of tasks
for that to still happen on 64bit with the increased resolution?

> Then if we want to reduce the cases where task_h_load returns 0, we
> should better make scale_load_down a nop otherwise we will have to
> maintain 2 values h_load and scale_h_load across the hierarchy
>

I don't fully grasp yet how much surgery that would require, but it does
sound like something we've been meaning to do, see e.g. se_weight:

 * XXX we want to get rid of these helpers and use the full load resolution.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] sched/fair: handle case of task_h_load() returning 0
  2020-07-02 14:42 [PATCH] sched/fair: handle case of task_h_load() returning 0 Vincent Guittot
  2020-07-02 16:11 ` Valentin Schneider
  2020-07-08  9:44 ` Dietmar Eggemann
@ 2020-07-09 13:06 ` Valentin Schneider
  2020-07-09 13:51   ` Vincent Guittot
  2 siblings, 1 reply; 11+ messages in thread
From: Valentin Schneider @ 2020-07-09 13:06 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, linux-kernel


On 02/07/20 15:42, Vincent Guittot wrote:
> task_h_load() can return 0 in some situations like running stress-ng
> mmapfork, which forks thousands of threads, in a sched group on a 224 cores
> system. The load balance doesn't handle this correctly because
> env->imbalance never decreases and it will stop pulling tasks only after
> reaching loop_max, which can be equal to the number of running tasks of
> the cfs. Make sure that imbalance will be decreased by at least 1.
>
> misfit task is the other feature that doesn't handle correctly such
> situation although it's probably more difficult to face the problem
> because of the smaller number of CPUs and running tasks on heterogenous
> system.
>
> We can't simply ensure that task_h_load() returns at least one because it
> would imply to handle underrun in other places.
>
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

I dug some more into this; if I got my math right, this can be reproduced
with a single task group below the root. Forked tasks get max load, so this
can be tried out with either tons of forks or tons of CPU hogs.

We need

  p->se.avg.load_avg * cfs_rq->h_load
  -----------------------------------  < 1
    cfs_rq_load_avg(cfs_rq) + 1

Assuming homogeneous system with tasks spread out all over (no other tasks
interfering), that should boil down to

  1024 * (tg.shares / nr_cpus)
  ---------------------------  < 1
  1024 * (nr_tasks_on_cpu)

IOW

  tg.shares / nr_cpus < nr_tasks_on_cpu

If we get tasks nicely spread out, a simple condition to hit this should be
to have more tasks than shares.

I can hit task_h_load=0 with the following on my Juno (pinned to one CPU to
make things simpler; big.LITTLE doesn't yield equal weights between CPUs):

  cgcreate -g cpu:tg0

  echo 128 > /sys/fs/cgroup/cpu/tg0/cpu.shares

  for ((i=0; i<130; i++)); do
      # busy loop of your choice
      taskset -c 0 ./loop.sh &
      echo $! > /sys/fs/cgroup/cpu/tg0/tasks
  done

> ---
>  kernel/sched/fair.c | 18 +++++++++++++++++-
>  1 file changed, 17 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 6fab1d17c575..62747c24aa9e 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4049,7 +4049,13 @@ static inline void update_misfit_status(struct task_struct *p, struct rq *rq)
>               return;
>       }
>
> -	rq->misfit_task_load = task_h_load(p);
> +	/*
> +	 * Make sure that misfit_task_load will not be null even if
> +	 * task_h_load() returns 0. misfit_task_load is only used to select
> +	 * rq with highest load so adding 1 will not modify the result
> +	 * of the comparison.
> +	 */
> +	rq->misfit_task_load = task_h_load(p) + 1;

For here and below; wouldn't it be a tad cleaner to just do

        foo = max(task_h_load(p), 1);

Otherwise, I think I've properly convinced myself we do want to have
that in one form or another. So either way:

Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>

>  }
>
>  #else /* CONFIG_SMP */
> @@ -7664,6 +7670,16 @@ static int detach_tasks(struct lb_env *env)
>                           env->sd->nr_balance_failed <= env->sd->cache_nice_tries)
>                               goto next;
>
> +			/*
> +			 * Depending of the number of CPUs and tasks and the
> +			 * cgroup hierarchy, task_h_load() can return a null
> +			 * value. Make sure that env->imbalance decreases
> +			 * otherwise detach_tasks() will stop only after
> +			 * detaching up to loop_max tasks.
> +			 */
> +			if (!load)
> +				load = 1;
> +
>                       env->imbalance -= load;
>                       break;

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] sched/fair: handle case of task_h_load() returning 0
  2020-07-08  9:47   ` Vincent Guittot
@ 2020-07-09 13:34     ` Dietmar Eggemann
  2020-07-09 13:52       ` Vincent Guittot
  0 siblings, 1 reply; 11+ messages in thread
From: Dietmar Eggemann @ 2020-07-09 13:34 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Steven Rostedt,
	Ben Segall, Mel Gorman, linux-kernel, Valentin Schneider

On 08/07/2020 11:47, Vincent Guittot wrote:
> On Wed, 8 Jul 2020 at 11:45, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>>
>> On 02/07/2020 16:42, Vincent Guittot wrote:
>>> task_h_load() can return 0 in some situations like running stress-ng
>>> mmapfork, which forks thousands of threads, in a sched group on a 224 cores
>>> system. The load balance doesn't handle this correctly because
>>
>> I guess the issue here is that 'cfs_rq->h_load' in
>>
>> task_h_load() {
>>     struct cfs_rq *cfs_rq = task_cfs_rq(p);
>>     ...
>>     return div64_ul(p->se.avg.load_avg * cfs_rq->h_load,
>>                     cfs_rq_load_avg(cfs_rq) + 1);
>> }
>>
>> is still ~0 (or at least pretty small) compared to se.avg.load_avg being
>> 1024 and cfs_rq_load_avg(cfs_rq) n*1024 in these lb occurrences.
>>
>>> env->imbalance never decreases and it will stop pulling tasks only after
>>> reaching loop_max, which can be equal to the number of running tasks of
>>> the cfs. Make sure that imbalance will be decreased by at least 1.

Looks like it's bounded by sched_nr_migrate (32 on my E5-2690 v2).

env.loop_max  = min(sysctl_sched_nr_migrate, busiest->nr_running);

[...]

>> I assume that this is related to the LKP mail
> 
> I have found this problem while studying the regression raised in the
> email below but it doesn't fix it. At least, it's not enough
> 
>>
>> https://lkml.kernel.org/r/20200421004749.GC26573@shao2-debian ?

I see. It also happens with other workloads but it's most visible
at the beginning of a workload (fork).

Still on E5-2690 v2 (2*2*10, 40 CPUs):

In the taskgroup cfs_rq->h_load is ~ 1024/40 = 25 so this leads to
task_h_load = 0 with cfs_rq->avg.load_avg 40 times higher than the
individual task load (1024).

One incarnation of 20 loops w/o any progress (that's w/o your patch).

With loop='loop/loop_break/loop_max'
and load='p->se.avg.load_avg/cfs_rq->h_load/cfs_rq->avg.load_avg'

Jul  9 10:41:18 e105613-lin kernel: [73.068844] [stress-ng-mmapf 2907] SMT CPU37->CPU17 imb=8 loop=1/32/32 load=1023/23/43006
Jul  9 10:41:18 e105613-lin kernel: [73.068873] [stress-ng-mmapf 3501] SMT CPU37->CPU17 imb=8 loop=2/32/32 load=1022/23/41983
Jul  9 10:41:18 e105613-lin kernel: [73.068890] [stress-ng-mmapf 2602] SMT CPU37->CPU17 imb=8 loop=3/32/32 load=1023/23/40960
...
Jul  9 10:41:18 e105613-lin kernel: [73.069136] [stress-ng-mmapf 2520] SMT CPU37->CPU17 imb=8 loop=18/32/32 load=1023/23/25613
Jul  9 10:41:18 e105613-lin kernel: [73.069144] [stress-ng-mmapf 3107] SMT CPU37->CPU17 imb=8 loop=19/32/32 load=1021/23/24589
Jul  9 10:41:18 e105613-lin kernel: [73.069149] [stress-ng-mmapf 2672] SMT CPU37->CPU17 imb=8 loop=20/32/32 load=1024/23/23566
...

Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>








^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] sched/fair: handle case of task_h_load() returning 0
  2020-07-09 13:06 ` Valentin Schneider
@ 2020-07-09 13:51   ` Vincent Guittot
  0 siblings, 0 replies; 11+ messages in thread
From: Vincent Guittot @ 2020-07-09 13:51 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, linux-kernel

On Thu, 9 Jul 2020 at 15:06, Valentin Schneider
<valentin.schneider@arm.com> wrote:
>
>
> On 02/07/20 15:42, Vincent Guittot wrote:
> > task_h_load() can return 0 in some situations like running stress-ng
> > mmapfork, which forks thousands of threads, in a sched group on a 224 cores
> > system. The load balance doesn't handle this correctly because
> > env->imbalance never decreases and it will stop pulling tasks only after
> > reaching loop_max, which can be equal to the number of running tasks of
> > the cfs. Make sure that imbalance will be decreased by at least 1.
> >
> > misfit task is the other feature that doesn't handle correctly such
> > situation although it's probably more difficult to face the problem
> > because of the smaller number of CPUs and running tasks on heterogenous
> > system.
> >
> > We can't simply ensure that task_h_load() returns at least one because it
> > would imply to handle underrun in other places.
> >
> > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
>
> I dug some more into this; if I got my math right, this can be reproduced
> with a single task group below the root. Forked tasks get max load, so this
> can be tried out with either tons of forks or tons of CPU hogs.
>
> We need
>
>   p->se.avg.load_avg * cfs_rq->h_load
>   -----------------------------------  < 1
>     cfs_rq_load_avg(cfs_rq) + 1
>
> Assuming homogeneous system with tasks spread out all over (no other tasks
> interfering), that should boil down to
>
>   1024 * (tg.shares / nr_cpus)
>   ---------------------------  < 1
>   1024 * (nr_tasks_on_cpu)
>
> IOW
>
>   tg.shares / nr_cpus < nr_tasks_on_cpu
>
> If we get tasks nicely spread out, a simple condition to hit this should be
> to have more tasks than shares.
>
> I can hit task_h_load=0 with the following on my Juno (pinned to one CPU to
> make things simpler; big.LITTLE doesn't yield equal weights between CPUs):
>
>   cgcreate -g cpu:tg0
>
>   echo 128 > /sys/fs/cgroup/cpu/tg0/cpu.shares
>
>   for ((i=0; i<130; i++)); do
>       # busy loop of your choice
>       taskset -c 0 ./loop.sh &
>       echo $! > /sys/fs/cgroup/cpu/tg0/tasks
>   done
>
> > ---
> >  kernel/sched/fair.c | 18 +++++++++++++++++-
> >  1 file changed, 17 insertions(+), 1 deletion(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 6fab1d17c575..62747c24aa9e 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -4049,7 +4049,13 @@ static inline void update_misfit_status(struct task_struct *p, struct rq *rq)
> >               return;
> >       }
> >
> > -     rq->misfit_task_load = task_h_load(p);
> > +     /*
> > +      * Make sure that misfit_task_load will not be null even if
> > +      * task_h_load() returns 0. misfit_task_load is only used to select
> > +      * rq with highest load so adding 1 will not modify the result
> > +      * of the comparison.
> > +      */
> > +     rq->misfit_task_load = task_h_load(p) + 1;
>
> For here and below; wouldn't it be a tad cleaner to just do
>
>         foo = max(task_h_load(p), 1);

+1

For the one below, my goal was mainly to not impact the result of the
tests before applying the +1 but doing it before will not change the
results

I'm going to update it

>
> Otherwise, I think I've properly convinced myself we do want to have
> that in one form or another. So either way:
>
> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>

Thanks

>
> >  }
> >
> >  #else /* CONFIG_SMP */
> > @@ -7664,6 +7670,16 @@ static int detach_tasks(struct lb_env *env)
> >                           env->sd->nr_balance_failed <= env->sd->cache_nice_tries)
> >                               goto next;
> >
> > +                     /*
> > +                      * Depending of the number of CPUs and tasks and the
> > +                      * cgroup hierarchy, task_h_load() can return a null
> > +                      * value. Make sure that env->imbalance decreases
> > +                      * otherwise detach_tasks() will stop only after
> > +                      * detaching up to loop_max tasks.
> > +                      */
> > +                     if (!load)
> > +                             load = 1;
> > +
> >                       env->imbalance -= load;
> >                       break;

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] sched/fair: handle case of task_h_load() returning 0
  2020-07-09 13:34     ` Dietmar Eggemann
@ 2020-07-09 13:52       ` Vincent Guittot
  0 siblings, 0 replies; 11+ messages in thread
From: Vincent Guittot @ 2020-07-09 13:52 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Steven Rostedt,
	Ben Segall, Mel Gorman, linux-kernel, Valentin Schneider

On Thu, 9 Jul 2020 at 15:34, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>
> On 08/07/2020 11:47, Vincent Guittot wrote:
> > On Wed, 8 Jul 2020 at 11:45, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> >>
> >> On 02/07/2020 16:42, Vincent Guittot wrote:
> >>> task_h_load() can return 0 in some situations like running stress-ng
> >>> mmapfork, which forks thousands of threads, in a sched group on a 224 cores
> >>> system. The load balance doesn't handle this correctly because
> >>
> >> I guess the issue here is that 'cfs_rq->h_load' in
> >>
> >> task_h_load() {
> >>     struct cfs_rq *cfs_rq = task_cfs_rq(p);
> >>     ...
> >>     return div64_ul(p->se.avg.load_avg * cfs_rq->h_load,
> >>                     cfs_rq_load_avg(cfs_rq) + 1);
> >> }
> >>
> >> is still ~0 (or at least pretty small) compared to se.avg.load_avg being
> >> 1024 and cfs_rq_load_avg(cfs_rq) n*1024 in these lb occurrences.
> >>
> >>> env->imbalance never decreases and it will stop pulling tasks only after
> >>> reaching loop_max, which can be equal to the number of running tasks of
> >>> the cfs. Make sure that imbalance will be decreased by at least 1.
>
> Looks like it's bounded by sched_nr_migrate (32 on my E5-2690 v2).

yes

>
> env.loop_max  = min(sysctl_sched_nr_migrate, busiest->nr_running);
>
> [...]
>
> >> I assume that this is related to the LKP mail
> >
> > I have found this problem while studying the regression raised in the
> > email below but it doesn't fix it. At least, it's not enough
> >
> >>
> >> https://lkml.kernel.org/r/20200421004749.GC26573@shao2-debian ?
>
> I see. It also happens with other workloads but it's most visible
> at the beginning of a workload (fork).
>
> Still on E5-2690 v2 (2*2*10, 40 CPUs):
>
> In the taskgroup cfs_rq->h_load is ~ 1024/40 = 25 so this leads to
> task_h_load = 0 with cfs_rq->avg.load_avg 40 times higher than the
> individual task load (1024).
>
> One incarnation of 20 loops w/o any progress (that's w/o your patch).
>
> With loop='loop/loop_break/loop_max'
> and load='p->se.avg.load_avg/cfs_rq->h_load/cfs_rq->avg.load_avg'
>
> Jul  9 10:41:18 e105613-lin kernel: [73.068844] [stress-ng-mmapf 2907] SMT CPU37->CPU17 imb=8 loop=1/32/32 load=1023/23/43006
> Jul  9 10:41:18 e105613-lin kernel: [73.068873] [stress-ng-mmapf 3501] SMT CPU37->CPU17 imb=8 loop=2/32/32 load=1022/23/41983
> Jul  9 10:41:18 e105613-lin kernel: [73.068890] [stress-ng-mmapf 2602] SMT CPU37->CPU17 imb=8 loop=3/32/32 load=1023/23/40960
> ...
> Jul  9 10:41:18 e105613-lin kernel: [73.069136] [stress-ng-mmapf 2520] SMT CPU37->CPU17 imb=8 loop=18/32/32 load=1023/23/25613
> Jul  9 10:41:18 e105613-lin kernel: [73.069144] [stress-ng-mmapf 3107] SMT CPU37->CPU17 imb=8 loop=19/32/32 load=1021/23/24589
> Jul  9 10:41:18 e105613-lin kernel: [73.069149] [stress-ng-mmapf 2672] SMT CPU37->CPU17 imb=8 loop=20/32/32 load=1024/23/23566
> ...
>
> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>

Thanks

>
>
>
>
>
>
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2020-07-09 13:52 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-02 14:42 [PATCH] sched/fair: handle case of task_h_load() returning 0 Vincent Guittot
2020-07-02 16:11 ` Valentin Schneider
2020-07-02 16:28   ` Vincent Guittot
2020-07-07 13:30     ` Vincent Guittot
2020-07-08 10:34       ` Valentin Schneider
2020-07-08  9:44 ` Dietmar Eggemann
2020-07-08  9:47   ` Vincent Guittot
2020-07-09 13:34     ` Dietmar Eggemann
2020-07-09 13:52       ` Vincent Guittot
2020-07-09 13:06 ` Valentin Schneider
2020-07-09 13:51   ` Vincent Guittot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).