From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754191AbcJSOLG (ORCPT ); Wed, 19 Oct 2016 10:11:06 -0400 Received: from foss.arm.com ([217.140.101.70]:53202 "EHLO foss.arm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751547AbcJSOKD (ORCPT ); Wed, 19 Oct 2016 10:10:03 -0400 Date: Wed, 19 Oct 2016 14:30:18 +0100 From: Morten Rasmussen To: Vincent Guittot Cc: Peter Zijlstra , Dietmar Eggemann , Joseph Salisbury , Ingo Molnar , Linus Torvalds , Thomas Gleixner , LKML , Mike Galbraith , omer.akram@canonical.com Subject: Re: [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes Message-ID: <20161019132957.GA7509@e105550-lin.cambridge.arm.com> References: <2bb765e7-8a5f-c525-a6ae-fbec6fae6354@canonical.com> <20161017090903.GA11962@linaro.org> <4e15ad55-beeb-e860-0420-8f439d076758@arm.com> <20161017131952.GR3117@twins.programming.kicks-ass.net> <94cc6deb-f93e-60ec-5834-e84a8b98e73c@arm.com> <20161018090747.GW3142@twins.programming.kicks-ass.net> <20161018103412.GT3117@twins.programming.kicks-ass.net> <20161018115651.GA20956@linaro.org> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20161018115651.GA20956@linaro.org> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Oct 18, 2016 at 01:56:51PM +0200, Vincent Guittot wrote: > Le Tuesday 18 Oct 2016 à 12:34:12 (+0200), Peter Zijlstra a écrit : > > On Tue, Oct 18, 2016 at 11:45:48AM +0200, Vincent Guittot wrote: > > > On 18 October 2016 at 11:07, Peter Zijlstra wrote: > > > > So aside from funny BIOSes, this should also show up when creating > > > > cgroups when you have offlined a few CPUs, which is far more common I'd > > > > think. > > > > > > The problem is also that the load of the tg->se[cpu] that represents > > > the tg->cfs_rq[cpu] is initialized to 1024 in: > > > alloc_fair_sched_group > > > for_each_possible_cpu(i) { > > > init_entity_runnable_average(se); > > > sa->load_avg = scale_load_down(se->load.weight); > > > > > > Initializing sa->load_avg to 1024 for a newly created task makes > > > sense as we don't know yet what will be its real load but i'm not sure > > > that we have to do the same for se that represents a task group. This > > > load should be initialized to 0 and it will increase when task will be > > > moved/attached into task group > > > > Yes, I think that makes sense, not sure how horrible that is with the > > That should not be that bad because this initial value is only useful for > the few dozens of ms that follow the creation of the task group IMHO, it doesn't make much sense to initialize empty containers, which group sched_entities really are, to 1024. It is meant to represent what is in it, and a creation it is empty, so in my opinion initializing it to zero make sense. > > current state of things, but after your propagate patch, that > > reinstates the interactivity hack that should work for sure. It actually works on mainline/tip as well. As I see it, the fundamental problem is keeping group entities up to date. Because the load_weight and hence se->avg.load_avg each per-cpu group sched_entity depends on the group cfs_rq->tg_load_avg_contrib for all cpus (tg->load_avg), including those that might be empty and therefore not enqueued, we must ensure that they are updated some other way. Most naturally as part of update_blocked_averages(). To guarantee that, it basically boils down to making sure: Any cfs_rq with a non-zero tg_load_avg_contrib must be on the leaf_cfs_rq_list. We can do that in different ways: 1) Add all cfs_rqs to the leaf_cfs_rq_list at task group creation, or 2) initialize group sched_entity contributions to zero and make sure that they are added to leaf_cfs_rq_list as soon as a sched_entity (task or group) is enqueued on it. Vincent patch below gives us the second option. > kernel/sched/fair.c | 9 ++++++++- > 1 file changed, 8 insertions(+), 1 deletion(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 8b03fb5..89776ac 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -690,7 +690,14 @@ void init_entity_runnable_average(struct sched_entity *se) > * will definitely be update (after enqueue). > */ > sa->period_contrib = 1023; > - sa->load_avg = scale_load_down(se->load.weight); > + /* > + * Tasks are intialized with full load to be seen as heavy task until > + * they get a chance to stabilize to their real load level. > + * group entity are intialized with null load to reflect the fact that > + * nothing has been attached yet to the task group. > + */ > + if (entity_is_task(se)) > + sa->load_avg = scale_load_down(se->load.weight); > sa->load_sum = sa->load_avg * LOAD_AVG_MAX; > /* > * At this point, util_avg won't be used in select_task_rq_fair anyway I would suggest adding a comment somewhere stating that we need to keep group cfs_rqs up to date: ----- diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index abb3763dff69..2b820d489be0 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6641,6 +6641,11 @@ static void update_blocked_averages(int cpu) if (throttled_hierarchy(cfs_rq)) continue; + /* + * Note that _any_ leaf cfs_rq with a non-zero tg_load_avg_contrib + * _must_ be on the leaf_cfs_rq_list to ensure that group shares + * are updated correctly. + */ if (update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq, true)) update_tg_load_avg(cfs_rq, 0); } ----- I did a couple of simple tests on tip/sched/core to test whether Vincent's fix works even without reflecting group load/util in the group hierarchy: Juno (2xA57+4xA53) tip: grouped hog(1) alone: 2841 non-grouped hogs(6) alone: 40830 grouped hog(1): 218 non-grouped hogs(6): 40580 tip+vg: grouped hog alone: 2849 non-grouped hogs(6) alone: 40831 grouped hog: 2363 non-grouped hogs: 38418 See script below for details, but we basically see that the grouped task is not getting its 'fair' share on tip, while it does with Vincent's patch. To summarize, I think Vincent's patch makes sense and works :-) More testing is needed of cause to see if there are other problems. ----- # Create 100 task groups: for i in `seq 1 100`; do cgcreate -g cpu:/root/test$i done NCPUS=$(grep -c ^processor /proc/cpuinfo) # Run single cpu hog inside task group on first cpu _alone_: cgexec -g cpu:/root/test100 taskset 0x01 sysbench --test=cpu \ --num-threads=1 --max-time=5 --max-requests=1000000 run | \ awk '{if ($4=="events:") {print "grouped hog(1) alone: " $5}}' # Run cpu hogs outside task group _alone_: sysbench --test=cpu --num-threads=$NCPUS --max-time=10 \ --max-requests=1000000 run | awk '{if ($4=="events:") \ {print "non-grouped hogs('$NCPUS') alone: " $5}}' # Run cpu hogs outside task group: sysbench --test=cpu --num-threads=$NCPUS --max-time=10 \ --max-requests=1000000 run | awk '{if ($4=="events:") \ {print "non-grouped hogs('$NCPUS'): " $5}}' & # Run single cpu hog inside task group on first cpu: cgexec -g cpu:/root/test100 taskset 0x01 sysbench \ --test=cpu --num-threads=1 --max-time=5 \ --max-requests=1000000 run | awk '{if ($4=="events:") \ {print "grouped hog(1): " $5}}' wait # Delete task groups: for i in `seq 1 100`; do cgdelete -g cpu:/root/test$i done