From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757933AbcGKI64 (ORCPT ); Mon, 11 Jul 2016 04:58:56 -0400 Received: from foss.arm.com ([217.140.101.70]:51464 "EHLO foss.arm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751847AbcGKI6z (ORCPT ); Mon, 11 Jul 2016 04:58:55 -0400 Subject: Re: [rfc patch] sched/fair: Use instantaneous load for fork/exec balancing To: Matt Fleming References: <1465891111.1694.13.camel@gmail.com> <5760115C.7040306@arm.com> <1465922407.3626.21.camel@gmail.com> <5761752A.6000606@arm.com> <20160704150452.GP8415@codeblueprint.co.uk> Cc: Mike Galbraith , Peter Zijlstra , Yuyang Du , LKML , Mel Gorman From: Dietmar Eggemann Message-ID: <57835FCC.70205@arm.com> Date: Mon, 11 Jul 2016 09:58:52 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.8.0 MIME-Version: 1.0 In-Reply-To: <20160704150452.GP8415@codeblueprint.co.uk> Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 04/07/16 16:04, Matt Fleming wrote: > On Wed, 15 Jun, at 04:32:58PM, Dietmar Eggemann wrote: >> On 14/06/16 17:40, Mike Galbraith wrote: >>> On Tue, 2016-06-14 at 15:14 +0100, Dietmar Eggemann wrote: >>> >>>> IMHO, the hackbench performance "boost" w/o 0905f04eb21f is due to the >>>> fact that a new task gets all it's load decayed (making it a small task) >>>> in the __update_load_avg() call in remove_entity_load_avg() because its >>>> se->avg.last_update_time value is 0 which creates a huge time difference >>>> comparing it to cfs_rq->avg.last_update_time. The patch 0905f04eb21f >>>> avoids this and thus the task stays big se->avg.load_avg = 1024. >>> >>> I don't care much at all about the hackbench "regression" in its own >>> right, and what causes it, for me, bottom line is that there are cases >>> where we need to be able to resolve, and can't, simply because we're >>> looking at a fuzzy (rippling) reflection. >> >> Understood. I just thought it would be nice to know why 0905f04eb21f >> makes this problem even more visible. But so far I wasn't able to figure >> out why this diff in se->avg.load_avg [1024 versus 0] has this effect on >> cfs_rq->runnable_load_avg making it even less suitable in find idlest*. >> enqueue_entity_load_avg()'s cfs_rq->runnable_load_* += sa->load_* looks >> suspicious though. > > In my testing without 0905f04eb21f I saw that se->avg.load_avg > actually managed to skip being decayed at all before the task was > dequeued, which meant that cfs_rq->runnable_load_avg was more likely > to be zero after dequeue, for those workloads like hackbench that > essentially are just a fork bomb. Do you mean the first dequeue when the task is forked? These are the pelt related functions which are called when the task is forked: detach_entity_load_avg attach_entity_load_avg remove_entity_load_avg <-- se->avg.load_avg is set to 0 w/o 0905f04eb21f se->avg.load_avg stays 1024 w/ 0905f04eb21f enqueue_entity_load_avg attach_entity_load_avg (double attach is fixed on tip/sched/core) dequeue_entity_load_avg > se->avg.load_avg evaded decay because se->avg.period_contrib was being > zero'd in __update_load_avg(). I don't see the relation to se->avg.period_contrib here. IMHO, se->avg.period_contrib is purely there to manage the 3 different update phases in __update_load_avg(). This difference in the initial se->avg.load_avg value [0 or 1024] has an influence in wake_affine() [weight = p->se.avg.load_avg;] for the wakeup handling of the hackbench tasks in the 'send/receive data' phase. There are a couple of patches on tip/sched/core which might change the behaviour of this: fork path, no double attach_entity_load_avg for new task, no remove_entity_load_avg for new task, changes in effective_load ... > With 0905f04eb21f applied, it's less likely (though not impossible) > that ->period_contrib will be zero'd and so we usually end up with > some residual load in cfs_rq->runnable_load_avg on dequeue, and hence, > > cfs_rq->runnable_load_avg > se->avg.load_avg > > even if 'se' is the only task on the runqueue. > > FYI, below is my quick and dirty hack that restored hackbench > performance for the few machines I checked. I didn't try schbench with > it. > > --- > > From 4e9856ea3dc56e356195ca035dab7302754ce59b Mon Sep 17 00:00:00 2001 > From: Matt Fleming > Date: Thu, 9 Jun 2016 19:48:14 +0100 > Subject: [PATCH] sched/fair: Reset ::runnable_load_avg when dequeueing last > entity > > The task and runqueue load averages maintained in p->se.avg.load_avg > and cfs_rq->runnable_load_avg respectively, can decay at different > wall clock rates, which means that enqueueing and then dequeueing a > task on an otherwise empty runqueue doesn't always leave > ::runnable_load_avg with its initial value. > > This can lead to the situation where cfs_rq->runnable_load_avg has a > non-zero value even though there are no runnable entities on the > runqueue. Assuming no entity is enqueued on this runqueue for some > time this residual load average will decay gradually as the load > averages are updated. > > But we can optimise the special case of dequeueing the last entity and > reset ::runnable_load_avg early, which gives a performance improvement > to workloads that trigger the load balancer, such as fork-heavy > applications when SD_BALANCE_FORK is set, because it gives a more up > to date view of how busy the cpu is. > > Signed-off-by: Matt Fleming > --- > kernel/sched/fair.c | 14 ++++++++++++-- > 1 file changed, 12 insertions(+), 2 deletions(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index c6dd8bab010c..408ee90c7ea8 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -3007,10 +3007,20 @@ enqueue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) > static inline void > dequeue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) > { > + unsigned long load_avg = 0; > + > update_load_avg(se, 1); > > - cfs_rq->runnable_load_avg = > - max_t(long, cfs_rq->runnable_load_avg - se->avg.load_avg, 0); > + /* > + * If we're about to dequeue the last runnable entity we can > + * reset the runnable load average to zero instead of waiting > + * for it to decay naturally. This gives the load balancer a > + * more timely and accurate view of how busy this cpu is. > + */ > + if (cfs_rq->nr_running > 1) > + load_avg = max_t(long, cfs_rq->runnable_load_avg - se->avg.load_avg, 0); > + > + cfs_rq->runnable_load_avg = load_avg; > cfs_rq->runnable_load_sum = > max_t(s64, cfs_rq->runnable_load_sum - se->avg.load_sum, 0); > } >