Re: [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes

From: Vincent Guittot <vincent.guittot@linaro.org>
To: Peter Zijlstra <peterz@infradead.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Joseph Salisbury <joseph.salisbury@canonical.com>,
	Ingo Molnar <mingo@kernel.org>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	LKML <linux-kernel@vger.kernel.org>,
	Mike Galbraith <efault@gmx.de>,
	omer.akram@canonical.com
Subject: Re: [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes
Date: Mon, 17 Oct 2016 15:54:42 +0200	[thread overview]
Message-ID: <CAKfTPtAxw1b-vy285HKtPUBFYuJdv2CFZH_gP3CMtZHs1wLPXg@mail.gmail.com> (raw)
In-Reply-To: <20161017131952.GR3117@twins.programming.kicks-ass.net>

On 17 October 2016 at 15:19, Peter Zijlstra <peterz@infradead.org> wrote:
> On Mon, Oct 17, 2016 at 12:49:55PM +0100, Dietmar Eggemann wrote:
>
>> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> > index 8b03fb5..8926685 100644
>> > --- a/kernel/sched/fair.c
>> > +++ b/kernel/sched/fair.c
>> > @@ -2902,7 +2902,8 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
>> >   */
>> >  static inline void update_tg_load_avg(struct cfs_rq *cfs_rq, int force)
>> >  {
>> > -   long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
>> > +   unsigned long load_avg = READ_ONCE(cfs_rq->avg.load_avg);
>> > +   long delta = load_avg - cfs_rq->tg_load_avg_contrib;
>> >
>> >     /*
>> >      * No need to update load_avg for root_task_group as it is not used.
>> > @@ -2912,7 +2913,7 @@ static inline void update_tg_load_avg(struct cfs_rq *cfs_rq, int force)
>> >
>> >     if (force || abs(delta) > cfs_rq->tg_load_avg_contrib / 64) {
>> >             atomic_long_add(delta, &cfs_rq->tg->load_avg);
>> > -           cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg;
>> > +           cfs_rq->tg_load_avg_contrib = load_avg;
>> >     }
>> >  }
>>
>> Tested it on an Ubuntu 16.10 Server (on top of the default 4.8.0-22-generic
>> kernel) on a Lenovo T430 and it didn't help.
>
> Right, I don't think that race exists, we update cfs_rq->avg.load_avg
> with rq->lock held and have it held here, so it cannot change under us.

yes I agree. It was just to be sure that an unexpected race condition
doesn't happen

>
> This might do with a few lockdep_assert_held() instances to clarify this
> though.
>
>> What seems to cure it is to get rid of this snippet (part of the commit
>> mentioned earlier in this thread: 3d30544f0212):
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 039de34f1521..16c692049fbf 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -726,7 +726,6 @@ void post_init_entity_util_avg(struct sched_entity *se)
>>         struct sched_avg *sa = &se->avg;
>>         long cap = (long)(SCHED_CAPACITY_SCALE - cfs_rq->avg.util_avg) / 2;
>>         u64 now = cfs_rq_clock_task(cfs_rq);
>> -       int tg_update;
>>
>>         if (cap > 0) {
>>                 if (cfs_rq->avg.util_avg != 0) {
>> @@ -759,10 +758,8 @@ void post_init_entity_util_avg(struct sched_entity *se)
>>                 }
>>         }
>>
>> -       tg_update = update_cfs_rq_load_avg(now, cfs_rq, false);
>> +       update_cfs_rq_load_avg(now, cfs_rq, false);
>>         attach_entity_load_avg(cfs_rq, se);
>> -       if (tg_update)
>> -               update_tg_load_avg(cfs_rq, false);
>>  }
>>
>>  #else /* !CONFIG_SMP */
>>
>> BTW, I guess we can reach .tg_load_avg up to ~300000-400000 on such a system
>> initially because systemd will create all ~100 services (and therefore the
>> corresponding 2. level tg's) at once. In my previous example, there was 500ms
>> between the creation of 2 tg's so there was a lot of decaying going on in between.
>
> Cute... on current kernels that translates to simply removing the call
> to update_tg_load_avg(), lets see if we can figure out what goes
> sideways first though, because it _should_ decay back out. And if that

yes, Reaching ~300000-400000 is not an issue in itself, the problem is
that load_avg has decayed but it has not been reflected in
tg->load_avg in the buggy case

> can fail here, I'm not seeing why that wouldn't fail elsewhere either.
>
> I'll see if I can reproduce this with a script creating heaps of cgroups
> in a hurry, I have a total lack of system-disease on all my machines.
>
>
> /me goes prod..