From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753811AbbIQJ5R (ORCPT ); Thu, 17 Sep 2015 05:57:17 -0400 Received: from bombadil.infradead.org ([198.137.202.9]:45740 "EHLO bombadil.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752580AbbIQJ5M (ORCPT ); Thu, 17 Sep 2015 05:57:12 -0400 Date: Thu, 17 Sep 2015 11:51:38 +0200 From: Peter Zijlstra To: Morten Rasmussen Cc: Vincent Guittot , Dietmar Eggemann , Steve Muckle , "mingo@redhat.com" , "daniel.lezcano@linaro.org" , "yuyang.du@intel.com" , "mturquette@baylibre.com" , "rjw@rjwysocki.net" , Juri Lelli , "sgurrappadi@nvidia.com" , "pang.xunlei@zte.com.cn" , "linux-kernel@vger.kernel.org" Subject: Re: [PATCH 5/6] sched/fair: Get rid of scaling utilization by capacity_orig Message-ID: <20150917095138.GF3604@twins.programming.kicks-ass.net> References: <55EDDD5A.70904@arm.com> <20150908122606.GH3644@twins.programming.kicks-ass.net> <20150908125205.GW18673@twins.programming.kicks-ass.net> <20150908143157.GA27098@e105550-lin.cambridge.arm.com> <20150908165331.GC27098@e105550-lin.cambridge.arm.com> <20150909094305.GO3644@twins.programming.kicks-ass.net> <20150909111309.GD27098@e105550-lin.cambridge.arm.com> <20150911172246.GI27098@e105550-lin.cambridge.arm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20150911172246.GI27098@e105550-lin.cambridge.arm.com> User-Agent: Mutt/1.5.21 (2012-12-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Sep 11, 2015 at 06:22:47PM +0100, Morten Rasmussen wrote: > I have done some runs with the proposed fixes added: > > 1. PeterZ's util_sum shift fix (change util_sum). > 2. Morten's scaling of weight instead of time (reduce bit loss). > 3. PeterZ's unconditional calls to arch*() functions (compiler opt). > > To be clear: 2 includes 1, and 3 includes 1 and 2. > > Runs where done with the default (#define) implementation of the > arch-functions and with arch specific implementation for ARM. > Results: > > perf numbers are average of three (x10) runs. Raw data is available > further down. > > ARM TC2 #mul #mul_all perf bench > arch*() default arm default arm default arm > > 1 shift_fix 10 16 22 36 13.401 13.288 > 2 scaled_weight 12 14 30 32 13.282 13.238 > 3 unconditional 12 14 26 32 13.296 13.427 > > Intel E5-2690 #mul #mul_all perf bench > arch*() default default default > > 1 shift_fix 13 14.786 > 2 scaled_weight 18 15.078 > 3 unconditional 14 15.195 > > > Overall it appears that fewer 'mul' instructions doesn't necessarily > mean better perf bench score. For ARM, 2 seems the best choice overall. I suspect you're paying for having to do an actual load which can miss there. So that makes sense. > While 1 is better for Intel. Right, because GCC shits itself with those conditionals. Weirdly though; the below version does not seem so affected. > I suggest that I spin a v2 of this series and go with scaled_weight to > reduce bit loss. Any objections? Just playing devils advocate to myself; how about cgroups? Will not a per-cpu share of the cgroup weight often be very small? So I had a little play, and I'm not at all convinced we want to do this (I've not actually ran any numbers on it, but I can well imagine the extra condition to hurt on branch miss predict) but it does show GCC need not always get confused. --- kernel/sched/fair.c | 58 +++++++++++++++++++++++++++++++++++------------------ 1 file changed, 38 insertions(+), 20 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 9176f7c588a8..1b60fbe3b86c 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -2519,7 +2519,25 @@ static u32 __compute_runnable_contrib(u64 n) #error "load tracking assumes 2^10 as unit" #endif -#define cap_scale(v, s) ((v)*(s) >> SCHED_CAPACITY_SHIFT) +static __always_inline unsigned long fp_mult2(unsigned long x, unsigned long y) +{ + y *= x; + y >>= 10; + + return y; +} + +static __always_inline unsigned long fp_mult3(unsigned long x, unsigned long y, unsigned long z) +{ + if (x > y) + swap(x,y); + + z *= y; + z >>= 10; + z *= x; + + return z; +} /* * We can represent the historical contribution to runnable average as the @@ -2553,9 +2571,9 @@ static __always_inline int __update_load_avg(u64 now, int cpu, struct sched_avg *sa, unsigned long weight, int running, struct cfs_rq *cfs_rq) { - u64 delta, scaled_delta, periods; + u64 delta, periods; u32 contrib; - unsigned int delta_w, scaled_delta_w, decayed = 0; + unsigned int delta_w, decayed = 0; unsigned long scale_freq, scale_cpu; delta = now - sa->last_update_time; @@ -2577,8 +2595,10 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa, return 0; sa->last_update_time = now; - scale_freq = arch_scale_freq_capacity(NULL, cpu); - scale_cpu = arch_scale_cpu_capacity(NULL, cpu); + if (weight) + scale_freq = arch_scale_freq_capacity(NULL, cpu); + if (running) + scale_cpu = arch_scale_cpu_capacity(NULL, cpu); /* delta_w is the amount already accumulated against our next period */ delta_w = sa->period_contrib; @@ -2594,16 +2614,14 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa, * period and accrue it. */ delta_w = 1024 - delta_w; - scaled_delta_w = cap_scale(delta_w, scale_freq); if (weight) { - sa->load_sum += weight * scaled_delta_w; - if (cfs_rq) { - cfs_rq->runnable_load_sum += - weight * scaled_delta_w; - } + unsigned long t = fp_mult3(delta_w, weight, scale_freq); + sa->load_sum += t; + if (cfs_rq) + cfs_rq->runnable_load_sum += t; } if (running) - sa->util_sum += scaled_delta_w * scale_cpu; + sa->util_sum += delta_w * fp_mult2(scale_cpu, scale_freq); delta -= delta_w; @@ -2620,25 +2638,25 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa, /* Efficiently calculate \sum (1..n_period) 1024*y^i */ contrib = __compute_runnable_contrib(periods); - contrib = cap_scale(contrib, scale_freq); if (weight) { - sa->load_sum += weight * contrib; + unsigned long t = fp_mult3(contrib, weight, scale_freq); + sa->load_sum += t; if (cfs_rq) - cfs_rq->runnable_load_sum += weight * contrib; + cfs_rq->runnable_load_sum += t; } if (running) - sa->util_sum += contrib * scale_cpu; + sa->util_sum += contrib * fp_mult2(scale_cpu, scale_freq); } /* Remainder of delta accrued against u_0` */ - scaled_delta = cap_scale(delta, scale_freq); if (weight) { - sa->load_sum += weight * scaled_delta; + unsigned long t = fp_mult3(delta, weight, scale_freq); + sa->load_sum += t; if (cfs_rq) - cfs_rq->runnable_load_sum += weight * scaled_delta; + cfs_rq->runnable_load_sum += t; } if (running) - sa->util_sum += scaled_delta * scale_cpu; + sa->util_sum += delta * fp_mult2(scale_cpu, scale_freq); sa->period_contrib += delta;