From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932667AbcJQNzM (ORCPT ); Mon, 17 Oct 2016 09:55:12 -0400 Received: from mail-lf0-f43.google.com ([209.85.215.43]:32929 "EHLO mail-lf0-f43.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932180AbcJQNzF (ORCPT ); Mon, 17 Oct 2016 09:55:05 -0400 MIME-Version: 1.0 In-Reply-To: <20161017131952.GR3117@twins.programming.kicks-ass.net> References: <57FFADC8.2020602@canonical.com> <43c59cba-2044-1de2-0f78-8f346bd1e3cb@arm.com> <20161014151827.GA10379@linaro.org> <2bb765e7-8a5f-c525-a6ae-fbec6fae6354@canonical.com> <20161017090903.GA11962@linaro.org> <4e15ad55-beeb-e860-0420-8f439d076758@arm.com> <20161017131952.GR3117@twins.programming.kicks-ass.net> From: Vincent Guittot Date: Mon, 17 Oct 2016 15:54:42 +0200 Message-ID: Subject: Re: [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes To: Peter Zijlstra Cc: Dietmar Eggemann , Joseph Salisbury , Ingo Molnar , Linus Torvalds , Thomas Gleixner , LKML , Mike Galbraith , omer.akram@canonical.com Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 17 October 2016 at 15:19, Peter Zijlstra wrote: > On Mon, Oct 17, 2016 at 12:49:55PM +0100, Dietmar Eggemann wrote: > >> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >> > index 8b03fb5..8926685 100644 >> > --- a/kernel/sched/fair.c >> > +++ b/kernel/sched/fair.c >> > @@ -2902,7 +2902,8 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa, >> > */ >> > static inline void update_tg_load_avg(struct cfs_rq *cfs_rq, int force) >> > { >> > - long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib; >> > + unsigned long load_avg = READ_ONCE(cfs_rq->avg.load_avg); >> > + long delta = load_avg - cfs_rq->tg_load_avg_contrib; >> > >> > /* >> > * No need to update load_avg for root_task_group as it is not used. >> > @@ -2912,7 +2913,7 @@ static inline void update_tg_load_avg(struct cfs_rq *cfs_rq, int force) >> > >> > if (force || abs(delta) > cfs_rq->tg_load_avg_contrib / 64) { >> > atomic_long_add(delta, &cfs_rq->tg->load_avg); >> > - cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg; >> > + cfs_rq->tg_load_avg_contrib = load_avg; >> > } >> > } >> >> Tested it on an Ubuntu 16.10 Server (on top of the default 4.8.0-22-generic >> kernel) on a Lenovo T430 and it didn't help. > > Right, I don't think that race exists, we update cfs_rq->avg.load_avg > with rq->lock held and have it held here, so it cannot change under us. yes I agree. It was just to be sure that an unexpected race condition doesn't happen > > This might do with a few lockdep_assert_held() instances to clarify this > though. > >> What seems to cure it is to get rid of this snippet (part of the commit >> mentioned earlier in this thread: 3d30544f0212): >> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >> index 039de34f1521..16c692049fbf 100644 >> --- a/kernel/sched/fair.c >> +++ b/kernel/sched/fair.c >> @@ -726,7 +726,6 @@ void post_init_entity_util_avg(struct sched_entity *se) >> struct sched_avg *sa = &se->avg; >> long cap = (long)(SCHED_CAPACITY_SCALE - cfs_rq->avg.util_avg) / 2; >> u64 now = cfs_rq_clock_task(cfs_rq); >> - int tg_update; >> >> if (cap > 0) { >> if (cfs_rq->avg.util_avg != 0) { >> @@ -759,10 +758,8 @@ void post_init_entity_util_avg(struct sched_entity *se) >> } >> } >> >> - tg_update = update_cfs_rq_load_avg(now, cfs_rq, false); >> + update_cfs_rq_load_avg(now, cfs_rq, false); >> attach_entity_load_avg(cfs_rq, se); >> - if (tg_update) >> - update_tg_load_avg(cfs_rq, false); >> } >> >> #else /* !CONFIG_SMP */ >> >> BTW, I guess we can reach .tg_load_avg up to ~300000-400000 on such a system >> initially because systemd will create all ~100 services (and therefore the >> corresponding 2. level tg's) at once. In my previous example, there was 500ms >> between the creation of 2 tg's so there was a lot of decaying going on in between. > > Cute... on current kernels that translates to simply removing the call > to update_tg_load_avg(), lets see if we can figure out what goes > sideways first though, because it _should_ decay back out. And if that yes, Reaching ~300000-400000 is not an issue in itself, the problem is that load_avg has decayed but it has not been reflected in tg->load_avg in the buggy case > can fail here, I'm not seeing why that wouldn't fail elsewhere either. > > I'll see if I can reproduce this with a script creating heaps of cgroups > in a hurry, I have a total lack of system-disease on all my machines. > > > /me goes prod..