All of lore.kernel.org
 help / color / mirror / Atom feed
From: Vincent Guittot <vincent.guittot@linaro.org>
To: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Joseph Salisbury <joseph.salisbury@canonical.com>,
	Ingo Molnar <mingo@kernel.org>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	LKML <linux-kernel@vger.kernel.org>,
	Mike Galbraith <efault@gmx.de>,
	omer.akram@canonical.com
Subject: Re: [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes
Date: Wed, 19 Oct 2016 19:41:36 +0200	[thread overview]
Message-ID: <CAKfTPtCK=riUXyapEhTxB9ZF+HrPzO+=9pOQYsYsNgyRa+rc6w@mail.gmail.com> (raw)
In-Reply-To: <20161019132957.GA7509@e105550-lin.cambridge.arm.com>

On 19 October 2016 at 15:30, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> On Tue, Oct 18, 2016 at 01:56:51PM +0200, Vincent Guittot wrote:
>> Le Tuesday 18 Oct 2016 à 12:34:12 (+0200), Peter Zijlstra a écrit :
>> > On Tue, Oct 18, 2016 at 11:45:48AM +0200, Vincent Guittot wrote:
>> > > On 18 October 2016 at 11:07, Peter Zijlstra <peterz@infradead.org> wrote:
>> > > > So aside from funny BIOSes, this should also show up when creating
>> > > > cgroups when you have offlined a few CPUs, which is far more common I'd
>> > > > think.
>> > >
>> > > The problem is also that the load of the tg->se[cpu] that represents
>> > > the tg->cfs_rq[cpu] is initialized to 1024 in:
>> > > alloc_fair_sched_group
>> > >      for_each_possible_cpu(i) {
>> > >          init_entity_runnable_average(se);
>> > >             sa->load_avg = scale_load_down(se->load.weight);
>> > >
>> > > Initializing  sa->load_avg to 1024 for a newly created task makes
>> > > sense as we don't know yet what will be its real load but i'm not sure
>> > > that we have to do the same for se that represents a task group. This
>> > > load should be initialized to 0 and it will increase when task will be
>> > > moved/attached into task group
>> >
>> > Yes, I think that makes sense, not sure how horrible that is with the
>>
>> That should not be that bad because this initial value is only useful for
>> the few dozens of ms that follow the creation of the task group
>
> IMHO, it doesn't make much sense to initialize empty containers, which
> group sched_entities really are, to 1024. It is meant to represent what
> is in it, and a creation it is empty, so in my opinion initializing it
> to zero make sense.
>
>> > current state of things, but after your propagate patch, that
>> > reinstates the interactivity hack that should work for sure.
>
> It actually works on mainline/tip as well.
>
> As I see it, the fundamental problem is keeping group entities up to
> date. Because the load_weight and hence se->avg.load_avg each per-cpu
> group sched_entity depends on the group cfs_rq->tg_load_avg_contrib for
> all cpus (tg->load_avg), including those that might be empty and
> therefore not enqueued, we must ensure that they are updated some other
> way. Most naturally as part of update_blocked_averages().
>
> To guarantee that, it basically boils down to making sure:
> Any cfs_rq with a non-zero tg_load_avg_contrib must be on the
> leaf_cfs_rq_list.
>
> We can do that in different ways: 1) Add all cfs_rqs to the
> leaf_cfs_rq_list at task group creation, or 2) initialize group
> sched_entity contributions to zero and make sure that they are added to
> leaf_cfs_rq_list as soon as a sched_entity (task or group) is enqueued
> on it.
>
> Vincent patch below gives us the second option.
>
>>  kernel/sched/fair.c | 9 ++++++++-
>>  1 file changed, 8 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 8b03fb5..89776ac 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -690,7 +690,14 @@ void init_entity_runnable_average(struct sched_entity *se)
>>        * will definitely be update (after enqueue).
>>        */
>>       sa->period_contrib = 1023;
>> -     sa->load_avg = scale_load_down(se->load.weight);
>> +     /*
>> +      * Tasks are intialized with full load to be seen as heavy task until
>> +      * they get a chance to stabilize to their real load level.
>> +      * group entity are intialized with null load to reflect the fact that
>> +      * nothing has been attached yet to the task group.
>> +      */
>> +     if (entity_is_task(se))
>> +             sa->load_avg = scale_load_down(se->load.weight);
>>       sa->load_sum = sa->load_avg * LOAD_AVG_MAX;
>>       /*
>>        * At this point, util_avg won't be used in select_task_rq_fair anyway
>
> I would suggest adding a comment somewhere stating that we need to keep
> group cfs_rqs up to date:
>
> -----
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index abb3763dff69..2b820d489be0 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6641,6 +6641,11 @@ static void update_blocked_averages(int cpu)
>                 if (throttled_hierarchy(cfs_rq))
>                         continue;
>
> +               /*
> +                * Note that _any_ leaf cfs_rq with a non-zero tg_load_avg_contrib
> +                * _must_ be on the leaf_cfs_rq_list to ensure that group shares
> +                * are updated correctly.
> +                */

As discussed on IRC, the point is that even if the leaf cfs_rq is
added to the leaf_cfs_rq_list, it doesn't ensure that it will be
updated correctly for unplugged CPUs

>                 if (update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq, true))
>                         update_tg_load_avg(cfs_rq, 0);
>         }
> -----
>
> I did a couple of simple tests on tip/sched/core to test whether
> Vincent's fix works even without reflecting group load/util in the group
> hierarchy:
>
> Juno (2xA57+4xA53)
>
> tip:
>         grouped hog(1) alone: 2841
>         non-grouped hogs(6) alone: 40830
>         grouped hog(1): 218
>         non-grouped hogs(6): 40580
>
> tip+vg:
>         grouped hog alone: 2849
>         non-grouped hogs(6) alone: 40831
>         grouped hog: 2363
>         non-grouped hogs: 38418
>
> See script below for details, but we basically see that the grouped task
> is not getting its 'fair' share on tip, while it does with Vincent's
> patch.
>
> To summarize, I think Vincent's patch makes sense and works :-) More
> testing is needed of cause to see if there are other problems.
>
> -----
>
> # Create 100 task groups:
> for i in `seq 1 100`;
> do
>         cgcreate -g cpu:/root/test$i
> done
>
> NCPUS=$(grep -c ^processor /proc/cpuinfo)
>
> # Run single cpu hog inside task group on first cpu _alone_:
> cgexec -g cpu:/root/test100 taskset 0x01 sysbench --test=cpu \
> --num-threads=1 --max-time=5 --max-requests=1000000 run | \
> awk '{if ($4=="events:") {print "grouped hog(1) alone: " $5}}'
>
> # Run cpu hogs outside task group _alone_:
> sysbench --test=cpu --num-threads=$NCPUS --max-time=10 \
> --max-requests=1000000 run | awk '{if ($4=="events:") \
> {print "non-grouped hogs('$NCPUS') alone: " $5}}'
>
> # Run cpu hogs outside task group:
> sysbench --test=cpu --num-threads=$NCPUS --max-time=10 \
> --max-requests=1000000 run | awk '{if ($4=="events:") \
> {print "non-grouped hogs('$NCPUS'): " $5}}' &
>
> # Run single cpu hog inside task group on first cpu:
> cgexec -g cpu:/root/test100 taskset 0x01 sysbench \
> --test=cpu --num-threads=1 --max-time=5 \
> --max-requests=1000000 run | awk '{if ($4=="events:") \
> {print "grouped hog(1): " $5}}'
>
> wait
>
> # Delete task groups:
> for i in `seq 1 100`;
> do
>         cgdelete -g cpu:/root/test$i
> done

  reply	other threads:[~2016-10-19 17:42 UTC|newest]

Thread overview: 46+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-10-07 19:38 [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes Joseph Salisbury
2016-10-07 19:57 ` Linus Torvalds
2016-10-07 20:22   ` Joseph Salisbury
2016-10-07 20:37     ` Linus Torvalds
2016-10-08  8:00 ` Peter Zijlstra
2016-10-08  8:39   ` Ingo Molnar
2016-10-08 11:37     ` Vincent Guittot
2016-10-08 11:49       ` Mike Galbraith
2016-10-12 12:20         ` Vincent Guittot
2016-10-12 15:35           ` Joseph Salisbury
2016-10-12 16:21           ` Joseph Salisbury
2016-10-13 10:58             ` Vincent Guittot
2016-10-13 15:52               ` Joseph Salisbury
2016-10-13 16:48                 ` Vincent Guittot
2016-10-13 18:49                   ` Dietmar Eggemann
2016-10-13 21:34                     ` Vincent Guittot
2016-10-14  8:24                       ` Vincent Guittot
2016-10-14 13:10                         ` Dietmar Eggemann
2016-10-14 15:18                           ` Vincent Guittot
2016-10-14 16:04                             ` Joseph Salisbury
2016-10-17  9:09                               ` Vincent Guittot
2016-10-17 11:49                                 ` Dietmar Eggemann
2016-10-17 13:19                                   ` Peter Zijlstra
2016-10-17 13:54                                     ` Vincent Guittot
2016-10-17 22:52                                       ` Dietmar Eggemann
2016-10-18  8:43                                         ` Vincent Guittot
2016-10-18  9:07                                         ` Peter Zijlstra
2016-10-18  9:45                                           ` Vincent Guittot
2016-10-18 10:34                                             ` Peter Zijlstra
2016-10-18 11:56                                               ` Vincent Guittot
2016-10-18 21:58                                                 ` Joonwoo Park
2016-10-19  6:42                                                   ` Vincent Guittot
2016-10-19  9:46                                                 ` Dietmar Eggemann
2016-10-19 11:25                                                   ` Vincent Guittot
2016-10-19 15:33                                                     ` Dietmar Eggemann
2016-10-19 17:33                                                       ` Joonwoo Park
2016-10-19 17:50                                                       ` Vincent Guittot
2016-10-19 11:33                                                 ` Peter Zijlstra
2016-10-19 11:50                                                   ` Vincent Guittot
2016-10-19 13:30                                                 ` Morten Rasmussen
2016-10-19 17:41                                                   ` Vincent Guittot [this message]
2016-10-20  7:56                                                     ` Morten Rasmussen
2016-10-19 14:49                                                 ` Joseph Salisbury
2016-10-19 14:53                                                   ` Vincent Guittot
2016-10-18 11:15                                           ` Dietmar Eggemann
2016-10-18 12:07                                             ` Peter Zijlstra

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAKfTPtCK=riUXyapEhTxB9ZF+HrPzO+=9pOQYsYsNgyRa+rc6w@mail.gmail.com' \
    --to=vincent.guittot@linaro.org \
    --cc=dietmar.eggemann@arm.com \
    --cc=efault@gmx.de \
    --cc=joseph.salisbury@canonical.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@kernel.org \
    --cc=morten.rasmussen@arm.com \
    --cc=omer.akram@canonical.com \
    --cc=peterz@infradead.org \
    --cc=tglx@linutronix.de \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.