Re: [PATCH 2/2] sched/fair: Always propagate runnable_load_avg

From: Vincent Guittot <vincent.guittot@linaro.org>
To: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	linux-kernel <linux-kernel@vger.kernel.org>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Mike Galbraith <efault@gmx.de>, Paul Turner <pjt@google.com>,
	Chris Mason <clm@fb.com>,
	kernel-team@fb.com
Subject: Re: [PATCH 2/2] sched/fair: Always propagate runnable_load_avg
Date: Tue, 2 May 2017 08:56:52 +0200	[thread overview]
Message-ID: <CAKfTPtBYokz_2P_VFSatBQT9+uzJFgUKD=ETksS2FUHZZ4mxAw@mail.gmail.com> (raw)
In-Reply-To: <20170428161450.GA4628@htj.duckdns.org>

On 28 April 2017 at 18:14, Tejun Heo <tj@kernel.org> wrote:
> Hello, Vincent.
>
>>
>> The only interest of runnable_load_avg is to be null when a cfs_rq is
>> idle whereas load_avg is not  but not to be higher than load_avg. The
>> root cause is that load_balance only looks at "load" but not number of
>> task currently running and that's probably the main problem:
>> runnable_load_avg has been added because load_balance fails to filter
>> idle group and idle rq. We should better add a new type in
>> group_classify to tag group that are  idle and the same in
>> find_busiest_queue more.
>
> I'll follow up in the other subthread but there really is fundamental
> difference in how we calculate runnable_avg w/ and w/o cgroups.
> Indepndent of whether we can improve the load balancer further, it is
> an existing bug.

I'd like to weight that a bit.
The runnable_load_avg works correctly as it is because it reflects
correctly the load of runnable entity at root domain
If you start to propagate the runnable_load_avg on the load_avg of the
group entity, the load will become unstable.
runnable_load_avg has been added to fix load_balance being unable to
select the right busiest rq. So the goal is to use more and more
load_avg not the opposite

>
>> > * Can you please double check that the higher latencies w/ the patch
>> >   is reliably reproducible?  The test machines that I use have
>> >   variable management load.  They never dominate the machine but are
>> >   enough to disturb the results so that to drawing out a reliable
>> >   pattern takes a lot of repeated runs.  I'd really appreciate if you
>> >   could double check that the pattern is reliable with different run
>> >   patterns (ie. instead of 10 consecutive runs after another,
>> >   interleaved).
>>
>> I always let time between 2 consecutive run and the 10 consecutive
>> runs take around 2min to execute
>>
>> Then I have run several time these 10 consecutive test and results stay the same
>
> Can you please try the patch at the end of this message?  I'm
> wondering whether you're seeing the errors accumulating from the wrong
> min().

I still have the regression with the patch below.
The regression comes from the use runnable_load_avg to propagate. As
load_avg becomes null at some point, it break the computation of share
and the load_avg stay very low

>
>> >   ones.  If there's something I can easily buy, please point me to it.
>> >   If there's something I can loan, that'd be great too.
>>
>> It looks like most of web site are currently out of stock
>
> :(
>
>> > * If not, I'll try to clean up the debug patches I have and send them
>> >   your way to get more visiblity but given these things tend to be
>> >   very iterative, it might take quite a few back and forth.
>>
>> Yes, that could be usefull. Even a trace of regression could be useful
>
> Yeah, will clean it up a bit and post.  Will also post some traces.
>
> And here's the updated patch.  I didn't add the suggested
> scale_down()'s.  I'll follow up on that in the other subthread.
> Thanks.
>
>  kernel/sched/fair.c |   47 ++++++++++++++++++++---------------------------
>  1 file changed, 20 insertions(+), 27 deletions(-)
>
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3078,37 +3078,30 @@ static inline void
>  update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  {
>         struct cfs_rq *gcfs_rq = group_cfs_rq(se);
> -       long delta, load = gcfs_rq->avg.load_avg;
> +       long load = 0, delta;
>
>         /*
> -        * If the load of group cfs_rq is null, the load of the
> -        * sched_entity will also be null so we can skip the formula
> +        * A cfs_rq's load avg contribution to the parent should be scaled
> +        * to the sched_entity's weight.  Use freshly calculated shares
> +        * instead of @se->load.weight as the latter may not reflect
> +        * changes from the current scheduling operation.
> +        *
> +        * Note that the propagation source is runnable_load_avg instead of
> +        * load_avg.  This keeps every cfs_rq's runnable_load_avg true to
> +        * the sum of the scaled loads of all tasks queued under it, which
> +        * is important for the correct operation of the load balancer.
> +        *
> +        * This can make the sched_entity's load_avg jumpier but that
> +        * correctly reflects what would happen without cgroups if each
> +        * task's load is scaled across nesting - the load is being
> +        * averaged at the task and each cfs_rq.
>          */
> -       if (load) {
> -               long tg_load;
> +       if (gcfs_rq->load.weight) {
> +               long shares = calc_cfs_shares(gcfs_rq, gcfs_rq->tg);
>
> -               /* Get tg's load and ensure tg_load > 0 */
> -               tg_load = atomic_long_read(&gcfs_rq->tg->load_avg) + 1;
> -
> -               /* Ensure tg_load >= load and updated with current load*/
> -               tg_load -= gcfs_rq->tg_load_avg_contrib;
> -               tg_load += load;
> -
> -               /*
> -                * We need to compute a correction term in the case that the
> -                * task group is consuming more CPU than a task of equal
> -                * weight. A task with a weight equals to tg->shares will have
> -                * a load less or equal to scale_load_down(tg->shares).
> -                * Similarly, the sched_entities that represent the task group
> -                * at parent level, can't have a load higher than
> -                * scale_load_down(tg->shares). And the Sum of sched_entities'
> -                * load must be <= scale_load_down(tg->shares).
> -                */
> -               if (tg_load > scale_load_down(gcfs_rq->tg->shares)) {
> -                       /* scale gcfs_rq's load into tg's shares*/
> -                       load *= scale_load_down(gcfs_rq->tg->shares);
> -                       load /= tg_load;
> -               }
> +               load = min_t(long, scale_load_down(shares),
> +                            gcfs_rq->runnable_load_avg *
> +                            shares / gcfs_rq->load.weight);
>         }
>
>         delta = load - se->avg.load_avg;