Re: [rfc patch] sched/fair: Use instantaneous load for fork/exec balancing

From: Dietmar Eggemann <dietmar.eggemann@arm.com>
To: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Yuyang Du <yuyang.du@intel.com>,
	LKML <linux-kernel@vger.kernel.org>,
	Mel Gorman <mgorman@techsingularity.net>
Subject: Re: [rfc patch] sched/fair: Use instantaneous load for fork/exec balancing
Date: Mon, 11 Jul 2016 09:58:52 +0100	[thread overview]
Message-ID: <57835FCC.70205@arm.com> (raw)
In-Reply-To: <20160704150452.GP8415@codeblueprint.co.uk>

On 04/07/16 16:04, Matt Fleming wrote:
> On Wed, 15 Jun, at 04:32:58PM, Dietmar Eggemann wrote:
>> On 14/06/16 17:40, Mike Galbraith wrote:
>>> On Tue, 2016-06-14 at 15:14 +0100, Dietmar Eggemann wrote:
>>>
>>>> IMHO, the hackbench performance "boost" w/o 0905f04eb21f is due to the
>>>> fact that a new task gets all it's load decayed (making it a small task)
>>>> in the __update_load_avg() call in remove_entity_load_avg() because its
>>>> se->avg.last_update_time value is 0 which creates a huge time difference
>>>> comparing it to cfs_rq->avg.last_update_time. The patch 0905f04eb21f
>>>> avoids this and thus the task stays big se->avg.load_avg = 1024.
>>>
>>> I don't care much at all about the hackbench "regression" in its own
>>> right, and what causes it, for me, bottom line is that there are cases
>>> where we need to be able to resolve, and can't, simply because we're
>>> looking at a fuzzy (rippling) reflection.
>>
>> Understood. I just thought it would be nice to know why 0905f04eb21f
>> makes this problem even more visible. But so far I wasn't able to figure
>> out why this diff in se->avg.load_avg [1024 versus 0] has this effect on
>> cfs_rq->runnable_load_avg making it even less suitable in find idlest*.
>> enqueue_entity_load_avg()'s cfs_rq->runnable_load_* += sa->load_* looks
>> suspicious though.
> 
> In my testing without 0905f04eb21f I saw that se->avg.load_avg
> actually managed to skip being decayed at all before the task was
> dequeued, which meant that cfs_rq->runnable_load_avg was more likely
> to be zero after dequeue, for those workloads like hackbench that
> essentially are just a fork bomb.

Do you mean the first dequeue when the task is forked?

These are the pelt related functions which are called when the task is
forked:

detach_entity_load_avg
attach_entity_load_avg
remove_entity_load_avg <-- se->avg.load_avg is set to 0 w/o 0905f04eb21f
                           se->avg.load_avg stays 1024 w/ 0905f04eb21f
enqueue_entity_load_avg
attach_entity_load_avg (double attach is fixed on tip/sched/core)
dequeue_entity_load_avg

> se->avg.load_avg evaded decay because se->avg.period_contrib was being
> zero'd in __update_load_avg().

I don't see the relation to se->avg.period_contrib here. IMHO,
se->avg.period_contrib is purely there to manage the 3 different update
phases in __update_load_avg().

This difference in the initial se->avg.load_avg value [0 or 1024] has an
influence in wake_affine() [weight = p->se.avg.load_avg;] for the wakeup
handling of the hackbench tasks in the 'send/receive data' phase.

There are a couple of patches on tip/sched/core which might change the
behaviour of this: fork path, no double attach_entity_load_avg for new
task, no remove_entity_load_avg for new task, changes in effective_load ...

> With 0905f04eb21f applied, it's less likely (though not impossible)
> that ->period_contrib will be zero'd and so we usually end up with
> some residual load in cfs_rq->runnable_load_avg on dequeue, and hence,
> 
> 	cfs_rq->runnable_load_avg > se->avg.load_avg
> 
> even if 'se' is the only task on the runqueue.
> 
> FYI, below is my quick and dirty hack that restored hackbench
> performance for the few machines I checked. I didn't try schbench with
> it.
> 
> ---
> 
> From 4e9856ea3dc56e356195ca035dab7302754ce59b Mon Sep 17 00:00:00 2001
> From: Matt Fleming <matt@codeblueprint.co.uk>
> Date: Thu, 9 Jun 2016 19:48:14 +0100
> Subject: [PATCH] sched/fair: Reset ::runnable_load_avg when dequeueing last
>  entity
> 
> The task and runqueue load averages maintained in p->se.avg.load_avg
> and cfs_rq->runnable_load_avg respectively, can decay at different
> wall clock rates, which means that enqueueing and then dequeueing a
> task on an otherwise empty runqueue doesn't always leave
> ::runnable_load_avg with its initial value.
> 
> This can lead to the situation where cfs_rq->runnable_load_avg has a
> non-zero value even though there are no runnable entities on the
> runqueue. Assuming no entity is enqueued on this runqueue for some
> time this residual load average will decay gradually as the load
> averages are updated.
> 
> But we can optimise the special case of dequeueing the last entity and
> reset ::runnable_load_avg early, which gives a performance improvement
> to workloads that trigger the load balancer, such as fork-heavy
> applications when SD_BALANCE_FORK is set, because it gives a more up
> to date view of how busy the cpu is.
> 
> Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
> ---
>  kernel/sched/fair.c | 14 ++++++++++++--
>  1 file changed, 12 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index c6dd8bab010c..408ee90c7ea8 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3007,10 +3007,20 @@ enqueue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  static inline void
>  dequeue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  {
> +	unsigned long load_avg = 0;
> +
>  	update_load_avg(se, 1);
>  
> -	cfs_rq->runnable_load_avg =
> -		max_t(long, cfs_rq->runnable_load_avg - se->avg.load_avg, 0);
> +	/*
> +	 * If we're about to dequeue the last runnable entity we can
> +	 * reset the runnable load average to zero instead of waiting
> +	 * for it to decay naturally. This gives the load balancer a
> +	 * more timely and accurate view of how busy this cpu is.
> +	 */
> +	if (cfs_rq->nr_running > 1)
> +		load_avg = max_t(long, cfs_rq->runnable_load_avg - se->avg.load_avg, 0);
> +
> +	cfs_rq->runnable_load_avg = load_avg;
>  	cfs_rq->runnable_load_sum =
>  		max_t(s64,  cfs_rq->runnable_load_sum - se->avg.load_sum, 0);
>  }
>