From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1751824AbbLNOq1 (ORCPT <rfc822;w@1wt.eu>);
	Mon, 14 Dec 2015 09:46:27 -0500
Received: from foss.arm.com ([217.140.101.70]:43398 "EHLO foss.arm.com"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1751249AbbLNOq0 (ORCPT <rfc822;linux-kernel@vger.kernel.org>);
	Mon, 14 Dec 2015 09:46:26 -0500
Date: Mon, 14 Dec 2015 14:46:46 +0000
From: Morten Rasmussen <morten.rasmussen@arm.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: Yuyang Du <yuyang.du@intel.com>, Andrey Ryabinin <aryabinin@virtuozzo.com>,
        mingo@redhat.com, linux-kernel@vger.kernel.org,
        Paul Turner <pjt@google.com>, Ben Segall <bsegall@google.com>
Subject: Re: [PATCH] sched/fair: fix mul overflow on 32-bit systems
Message-ID: <20151214144645.GA23930@e105550-lin.cambridge.arm.com>
References: <1449838518-26543-1-git-send-email-aryabinin@virtuozzo.com>
 <20151211132551.GO6356@twins.programming.kicks-ass.net>
 <20151211133612.GG6373@twins.programming.kicks-ass.net>
 <566AD6E1.2070005@virtuozzo.com>
 <20151211175751.GA27552@e105550-lin.cambridge.arm.com>
 <20151213224224.GC28098@intel.com>
 <20151214115453.GN6357@twins.programming.kicks-ass.net>
 <20151214130723.GB9870@e105550-lin.cambridge.arm.com>
 <20151214142021.GO6357@twins.programming.kicks-ass.net>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20151214142021.GO6357@twins.programming.kicks-ass.net>
User-Agent: Mutt/1.5.21 (2010-09-15)
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Mon, Dec 14, 2015 at 03:20:21PM +0100, Peter Zijlstra wrote:
> On Mon, Dec 14, 2015 at 01:07:26PM +0000, Morten Rasmussen wrote:
> 
> > Agreed, >100% is a transient state (which can be rather long) which only
> > means over-utilized, nothing more. Would you like the metric itself to
> > be changed to saturate at 100% or just cap it to 100% when used?
> 
> We already cap it when using it IIRC. But no, I was thinking of the
> measure itself.

Yes, okay.

> 
> > It is not straight forward to provide a bound on the sum.
> 
> Agreed..
> 
> > There isn't one for load_avg either.
> 
> But that one is fundamentally unbound, whereas the util thing is
> fundamentally bound, except our implementation isn't.

Agreed.

> 
> > If we want to guarantee an upper bound for
> > cfs_rq->avg.util_sum we have to somehow cap the se->avg.util_avg
> > contributions for each sched_entity. This cap depends on the cpu and how
> > many other tasks are associated with that cpu. The cap may have to
> > change when tasks migrate.
> 
> Yep, blows :-)
> 
> > > However, I think that makes sense, but would propose doing it
> > > differently. That condition is generally a maximum (assuming proper
> > > functioning of the weight based scheduling etc..) for any one task, so
> > > on migrate we can hard clip to this value.
> 
> > Why use load.weight to scale util_avg? It is affected by priority. Isn't
> > just the ratio 1/nr_running that you are after?
> 
> Remember, the util thing is based on running, so assuming each task
> always wants to run, each task gets to run w_i/\Sum_j w_j due to CFS
> being a weighted fair queueing thingy.

Of course, yes.

> 
> > IIUC, you propose to clip the sum itself. In which case you are running
> > into trouble when removing tasks. You don't know how much to remove from
> > the clipped sum.
> 
> Right, then we'll have to slowly gain it again.

If you have a seriously over-utilized cpu and migrate some of the tasks
to a different cpu the old cpu may temporarily look lightly utilized
even if we leave some big tasks behind. That might lead us to trouble if
we start using util_avg as the basis for cpufreq decisions. If we care
about performance, the safe choice is to consider an cpu over-utilized
still over-utilized even after we have migrated tasks away. We can only
trust that the cpu is no longer over-utilized when cfs_rq->avg.util_avg
'naturally' goes below 100%. So from that point of view, it might be
better to let it stay 100% and let it sort itself out.

> > Another problem is that load.weight is just a snapshot while
> > avg.util_avg includes tasks that are not currently on the rq so the
> > scaling factor is probably bigger than what you want.
> 
> Our weight guestimates also include non running (aka blocked) tasks,
> right?

The rq/cfs_rq load.weight doesn't. It is updated through
update_load_{add,sub}() in account_entity_{enqueue,dequeue}(). So only
runnable+running tasks I think.

> > If we leave the sum as it is (unclipped) add/remove shouldn't give us
> > any problems. The only problem is the overflow, which is solved by using
> > a 64bit type for load_avg. That is not an acceptable solution?
> 
> It might be. After all, any time any of this is needed we're CPU bound
> and the utilization measure is pointless anyway. That measure only
> matters if its small and the sum is 'small'. After that its back to the
> normal load based thingy.

Yes, agreed.