Re: [PATCH 5/6] sched/fair: Get rid of scaling utilization by capacity_orig

From: bsegall@google.com
To: Yuyang Du <yuyang.du@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>,
	Morten Rasmussen <morten.rasmussen@arm.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steve Muckle <steve.muckle@linaro.org>,
	"mingo\@redhat.com" <mingo@redhat.com>,
	"daniel.lezcano\@linaro.org" <daniel.lezcano@linaro.org>,
	"mturquette\@baylibre.com" <mturquette@baylibre.com>,
	"rjw\@rjwysocki.net" <rjw@rjwysocki.net>,
	Juri Lelli <Juri.Lelli@arm.com>,
	"sgurrappadi\@nvidia.com" <sgurrappadi@nvidia.com>,
	"pang.xunlei\@zte.com.cn" <pang.xunlei@zte.com.cn>,
	"linux-kernel\@vger.kernel.org" <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 5/6] sched/fair: Get rid of scaling utilization by capacity_orig
Date: Wed, 23 Sep 2015 09:54:08 -0700	[thread overview]
Message-ID: <xm26zj0d84b3.fsf@sword-of-the-dawn.mtv.corp.google.com> (raw)
In-Reply-To: <20150922232222.GF11102@intel.com> (Yuyang Du's message of "Wed, 23 Sep 2015 07:22:22 +0800")

Yuyang Du <yuyang.du@intel.com> writes:

> On Tue, Sep 22, 2015 at 10:18:30AM -0700, bsegall@google.com wrote:
>> Yuyang Du <yuyang.du@intel.com> writes:
>> 
>> > On Mon, Sep 21, 2015 at 10:30:04AM -0700, bsegall@google.com wrote:
>> >> > But first, I think as load_sum and load_avg can afford NICE_0_LOAD with either high
>> >> > or low resolution. So we have no reason to have low resolution (10bits) load_avg
>> >> > when NICE_0_LOAD has high resolution (20bits), because load_avg = runnable% * load,
>> >> > as opposed to now we have load_avg = runnable% * scale_load_down(load).
>> >> >
>> >> > We get rid of all scale_load_down() for runnable load average?
>> >> 
>> >> Hmm, LOAD_AVG_MAX * prio_to_weight[0] is 4237627662, ie barely within a
>> >> 32-bit unsigned long, but in fact LOAD_AVG_MAX * MAX_SHARES is already
>> >> going to give errors on 32-bit (even with the old code in fact). This
>> >> should probably be fixed... somehow (dividing by 4 for load_sum on
>> >> 32-bit would work, though be ugly. Reducing MAX_SHARES by 2 bits on
>> >> 32-bit might have made sense but would be a weird difference between 32
>> >> and 64, and could break userspace anyway, so it's presumably too late
>> >> for that).
>> >> 
>> >> 64-bit has ~30 bits free, so this would be fine so long as SLR is 0 on
>> >> 32-bit.
>> >> 
>> >
>> > load_avg has no LOAD_AVG_MAX term in it, it is runnable% * load, IOW, load_avg <= load.
>> > So, on 32bit, cfs_rq's load_avg can host 2^32/prio_to_weight[0]/1024 = 47, with 20bits
>> > load resolution. This is ok, because struct load_weight's load is also unsigned
>> > long. If overflown, cfs_rq->load.weight will be overflown in the first place.
>> >
>> > However, after a second thought, this is not quite right. Because load_avg is not
>> > necessarily no greater than load, since load_avg has blocked load in it. Although,
>> > load_avg is still at the same level as load (converging to be <= load), we may not
>> > want the risk to overflow on 32bit.
>  
> This second thought made a mistake (what was wrong with me). load_avg is for sure
> no greater than load with or without blocked load.
>
> With that said, it really does not matter what the following numbers are, 32bit or
> 64bit machine. What matters is that cfs_rq->load.weight is one that needs to worry
> whether overflow or not, not the load_avg. It is as simple as that.
>
> With that, I think we can and should get rid of the scale_load_down()
> for load_avg.

load_avg yes is bounded by load.weight, but on 64-bit load_sum is only
bounded by load.weight * LOAD_AVG_MAX and is the same size as
load.weight (as I said below). There's still space for anything
reasonable though with 10 bits of SLR.

>
>> Yeah, I missed that load_sum was u64 and only load_avg was long. This
>> means we're fine on 32-bit with no SLR (or more precisely, cfs_rq
>> runnable_load_avg can overflow, but only when cfs_rq load.weight does,
>> so whatever). On 64-bit you can currently have 2^36 cgroups or 2^37
>> tasks before load.weight overflows, and ~2^31 tasks before
>> runnable_load_avg does, which is obviously fine (and in fact you hit
>> PID_MAX_LIMIT and even if you had the cpu/ram/etc to not fall over).
>> 
>> Now, applying SLR to runnable_load_avg would cut this down to ~2^21
>> tasks running at once or 2^20 with cgroups, which is technically
>> allowed, though it seems utterly implausible (especially since this
>> would have to all be on one cpu). If SLR was increased as peterz asked
>> about, this could be an issue though.
>> 
>> All that said, using SLR on load_sum/load_avg as opposed to cfs_rq
>> runnable_load_avg would be fine, as they're limited to only one
>> task/cgroup's weight. Having it SLRed and cfs_rq not would be a
>> little odd, but not impossible.
>  
>
>> > If NICE_0_LOAD is nice-0's load, and if SCHED_LOAD_SHIFT is to say how to get 
>> > nice-0's load, I don't understand why you want to separate them.
>> 
>> SCHED_LOAD_SHIFT is not how to get nice-0's load, it just happens to
>> have the same value as NICE_0_SHIFT. (I think anyway, SCHED_LOAD_* is
>> used in precisely one place other than the newish util_avg, and as I
>> mentioned it's not remotely clear what compute_imbalance is doing theer)
>
> Yes, it is not clear to me either.
>
> With the above proposal to get rid of scale_load_down() for load_avg, so I think
> now we can remove SCHED_LOAD_*, and rename scale_load() to user_to_kernel_load(),
> and raname scale_load_down() to kernel_to_user_load().
>
> Hmm?

I have no opinion on renaming the scale_load functions, it's certainly
reasonable, but the scale_load names seem fine too.