From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757372AbbIURaK (ORCPT ); Mon, 21 Sep 2015 13:30:10 -0400 Received: from mail-pa0-f41.google.com ([209.85.220.41]:34083 "EHLO mail-pa0-f41.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756591AbbIURaI (ORCPT ); Mon, 21 Sep 2015 13:30:08 -0400 From: bsegall@google.com To: Yuyang Du Cc: Peter Zijlstra , Morten Rasmussen , Vincent Guittot , Dietmar Eggemann , Steve Muckle , "mingo\@redhat.com" , "daniel.lezcano\@linaro.org" , "mturquette\@baylibre.com" , "rjw\@rjwysocki.net" , Juri Lelli , "sgurrappadi\@nvidia.com" , "pang.xunlei\@zte.com.cn" , "linux-kernel\@vger.kernel.org" Subject: Re: [PATCH 5/6] sched/fair: Get rid of scaling utilization by capacity_orig References: <55EDDD5A.70904@arm.com> <20150908122606.GH3644@twins.programming.kicks-ass.net> <20150908125205.GW18673@twins.programming.kicks-ass.net> <20150908143157.GA27098@e105550-lin.cambridge.arm.com> <20150908165331.GC27098@e105550-lin.cambridge.arm.com> <20150909094305.GO3644@twins.programming.kicks-ass.net> <20150909111309.GD27098@e105550-lin.cambridge.arm.com> <20150911172246.GI27098@e105550-lin.cambridge.arm.com> <20150917103825.GG3604@twins.programming.kicks-ass.net> <20150921011652.GD11102@intel.com> Date: Mon, 21 Sep 2015 10:30:04 -0700 In-Reply-To: <20150921011652.GD11102@intel.com> (Yuyang Du's message of "Mon, 21 Sep 2015 09:16:52 +0800") Message-ID: User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.3 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Yuyang Du writes: > On Thu, Sep 17, 2015 at 12:38:25PM +0200, Peter Zijlstra wrote: >> On Fri, Sep 11, 2015 at 06:22:47PM +0100, Morten Rasmussen wrote: >> >> > While at it, should I include Yuyang's patch redefining the SCALE/SHIFT >> > mess? >> >> I suspect his patch will fail to compile on ARM which uses >> SCHED_CAPACITY_* outside of kernel/sched/*. >> >> But if you all (Ben, Yuyang, you) can agree on a patch simplifying these >> things I'm not opposed to it. > > Yes, indeed. So SCHED_RESOLUTION_SHIFT has to be defined in include/linux/sched.h. > > With this, I think the codes still need some cleanup, and importantly > documentation. > > But first, I think as load_sum and load_avg can afford NICE_0_LOAD with either high > or low resolution. So we have no reason to have low resolution (10bits) load_avg > when NICE_0_LOAD has high resolution (20bits), because load_avg = runnable% * load, > as opposed to now we have load_avg = runnable% * scale_load_down(load). > > We get rid of all scale_load_down() for runnable load average? Hmm, LOAD_AVG_MAX * prio_to_weight[0] is 4237627662, ie barely within a 32-bit unsigned long, but in fact LOAD_AVG_MAX * MAX_SHARES is already going to give errors on 32-bit (even with the old code in fact). This should probably be fixed... somehow (dividing by 4 for load_sum on 32-bit would work, though be ugly. Reducing MAX_SHARES by 2 bits on 32-bit might have made sense but would be a weird difference between 32 and 64, and could break userspace anyway, so it's presumably too late for that). 64-bit has ~30 bits free, so this would be fine so long as SLR is 0 on 32-bit. > > -- > > Subject: [PATCH] sched/fair: Generalize the load/util averages resolution > definition > > The metric needs certain resolution to determine how much detail we > can look into (or not losing detail by integer rounding), which also > determines the range of the metrics. > > For instance, to increase the resolution of [0, 1] (two levels), one > can multiply 1024 and get [0, 1024] (1025 levels). > > In sched/fair, a few metrics depend on the resolution: load/load_avg, > util_avg, and capacity (frequency adjustment). In order to reduce the > risks to make mistakes relating to resolution/range, we therefore > generalize the resolution by defining a basic resolution constant > number, and then formalize all metrics by depending on the basic > resolution. The basic resolution is 1024 or (1 << 10). Further, one > can recursively apply the basic resolution to increase the final > resolution. > > Pointed out by Ben Segall, NICE_0's weight (visible to user) and load > have independent resolution, but they must be well calibrated. > > Signed-off-by: Yuyang Du > --- > include/linux/sched.h | 9 ++++++--- > kernel/sched/fair.c | 4 ---- > kernel/sched/sched.h | 15 ++++++++++----- > 3 files changed, 16 insertions(+), 12 deletions(-) > > diff --git a/include/linux/sched.h b/include/linux/sched.h > index bd38b3e..9b86f79 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -909,10 +909,13 @@ enum cpu_idle_type { > CPU_MAX_IDLE_TYPES > }; > > +# define SCHED_RESOLUTION_SHIFT 10 > +# define SCHED_RESOLUTION_SCALE (1L << SCHED_RESOLUTION_SHIFT) > + > /* > * Increase resolution of cpu_capacity calculations > */ > -#define SCHED_CAPACITY_SHIFT 10 > +#define SCHED_CAPACITY_SHIFT SCHED_RESOLUTION_SHIFT > #define SCHED_CAPACITY_SCALE (1L << SCHED_CAPACITY_SHIFT) > > /* > @@ -1180,8 +1183,8 @@ struct load_weight { > * 1) load_avg factors frequency scaling into the amount of time that a > * sched_entity is runnable on a rq into its weight. For cfs_rq, it is the > * aggregated such weights of all runnable and blocked sched_entities. > - * 2) util_avg factors frequency and cpu scaling into the amount of time > - * that a sched_entity is running on a CPU, in the range [0..SCHED_LOAD_SCALE]. > + * 2) util_avg factors frequency and cpu capacity scaling into the amount of time > + * that a sched_entity is running on a CPU, in the range [0..SCHED_CAPACITY_SCALE]. > * For cfs_rq, it is the aggregated such times of all runnable and > * blocked sched_entities. > * The 64 bit load_sum can: > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 4df37a4..c61fd8e 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -2522,10 +2522,6 @@ static u32 __compute_runnable_contrib(u64 n) > return contrib + runnable_avg_yN_sum[n]; > } > > -#if (SCHED_LOAD_SHIFT - SCHED_LOAD_RESOLUTION) != 10 || SCHED_CAPACITY_SHIFT != 10 > -#error "load tracking assumes 2^10 as unit" > -#endif > - > #define cap_scale(v, s) ((v)*(s) >> SCHED_CAPACITY_SHIFT) > > /* > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h > index 3845a71..31b4022 100644 > --- a/kernel/sched/sched.h > +++ b/kernel/sched/sched.h > @@ -53,18 +53,23 @@ static inline void update_cpu_load_active(struct rq *this_rq) { } > * increased costs. > */ > #if 0 /* BITS_PER_LONG > 32 -- currently broken: it increases power usage under light load */ > -# define SCHED_LOAD_RESOLUTION 10 > -# define scale_load(w) ((w) << SCHED_LOAD_RESOLUTION) > -# define scale_load_down(w) ((w) >> SCHED_LOAD_RESOLUTION) > +# define SCHED_LOAD_SHIFT (SCHED_RESOLUTION_SHIFT + SCHED_RESOLUTION_SHIFT) > +# define scale_load(w) ((w) << SCHED_RESOLUTION_SHIFT) > +# define scale_load_down(w) ((w) >> SCHED_RESOLUTION_SHIFT) > #else > -# define SCHED_LOAD_RESOLUTION 0 > +# define SCHED_LOAD_SHIFT (SCHED_RESOLUTION_SHIFT) > # define scale_load(w) (w) > # define scale_load_down(w) (w) > #endif > > -#define SCHED_LOAD_SHIFT (10 + SCHED_LOAD_RESOLUTION) > #define SCHED_LOAD_SCALE (1L << SCHED_LOAD_SHIFT) > > +/* > + * NICE_0's weight (visible to user) and its load (invisible to user) have > + * independent resolution, but they should be well calibrated. We use scale_load() > + * and scale_load_down(w) to convert between them, the following must be true: > + * scale_load(prio_to_weight[20]) == NICE_0_LOAD > + */ > #define NICE_0_LOAD SCHED_LOAD_SCALE > #define NICE_0_SHIFT SCHED_LOAD_SHIFT I still think tying the scale_load shift to be the same as the SCHED_CAPACITY/etc shift is silly, and tying the NICE_0_LOAD/SHIFT in is worse. Honestly if I was going to change anything it would be to define NICE_0_LOAD/SHIFT entirely separately from SCHED_LOAD_SCALE/SHIFT. However I'm not sure if calculate_imbalance's use of SCHED_LOAD_SCALE is actually a separate use of 1024*SLR-as-percentage or is basically assuming most tasks are nice-0 or what. It sure /looks/ like it's comparing values with different units - it's doing (nr_running * CONST - group_capacity) and comparing to load, so it looks like both (ie increasing load.weight of everything on your system by X% would change load balancer behavior here). Given that it might make sense to make it clear that capacity units and nice-0-task units have to be the same thing due to load balancer approximations (though they are still entirely separate from the SCHED_LOAD_RESOLUTION multiplier).