From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932381AbcESPgK (ORCPT ); Thu, 19 May 2016 11:36:10 -0400 Received: from foss.arm.com ([217.140.101.70]:35830 "EHLO foss.arm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932140AbcESPgI (ORCPT ); Thu, 19 May 2016 11:36:08 -0400 Date: Thu, 19 May 2016 16:36:38 +0100 From: Morten Rasmussen To: Vincent Guittot Cc: Yuyang Du , Peter Zijlstra , linux-kernel , Mike Galbraith , Ingo Molnar , Dietmar Eggemann , Wanpeng Li Subject: Re: [tip:sched/core] sched/fair: Correct unit of load_above_capacity Message-ID: <20160519153637.GA27946@e105550-lin.cambridge.arm.com> References: <1461958364-675-4-git-send-email-dietmar.eggemann@arm.com> <20160512214811.GB8790@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, May 13, 2016 at 10:22:19AM +0200, Vincent Guittot wrote: > On 12 May 2016 at 23:48, Yuyang Du wrote: > > On Thu, May 12, 2016 at 03:31:51AM -0700, tip-bot for Morten Rasmussen wrote: > >> Commit-ID: cfa10334318d8212d007da8c771187643c9cef35 > >> Gitweb: http://git.kernel.org/tip/cfa10334318d8212d007da8c771187643c9cef35 > >> Author: Morten Rasmussen > >> AuthorDate: Fri, 29 Apr 2016 20:32:40 +0100 > >> Committer: Ingo Molnar > >> CommitDate: Thu, 12 May 2016 09:55:33 +0200 > >> > >> sched/fair: Correct unit of load_above_capacity > >> > >> In calculate_imbalance() load_above_capacity currently has the unit > >> [capacity] while it is used as being [load/capacity]. Not only is it > >> wrong it also makes it unlikely that load_above_capacity is ever used > >> as the subsequent code picks the smaller of load_above_capacity and > >> the avg_load > >> > >> This patch ensures that load_above_capacity has the right unit > >> [load/capacity]. > > load_above_capacity has the unit [load] and is then compared with other load I'm not sure if talk about the same code, I'm referring to: max_pull = min(busiest->avg_load - sds->avg_load, load_above_capacity); a bit further down. Here the first option has unit [load/capacity], and the subsequent code multiplies the result with [capacity] to get back to [load]: /* How much load to actually move to equalise the imbalance */ env->imbalance = min( max_pull * busiest->group_capacity, (sds->avg_load - local->avg_load) * local->group_capacity ) / SCHED_CAPACITY_SCALE; That lead me to the conclusion that load_above_capacity would have to be 'per capacity', i.e. [load/capacity], as well for the code to make sense. > > >> > >> Signed-off-by: Morten Rasmussen > >> [ Changed changelog to note it was in capacity unit; +rebase. ] > >> Signed-off-by: Peter Zijlstra (Intel) > >> Cc: Dietmar Eggemann > >> Cc: Linus Torvalds > >> Cc: Mike Galbraith > >> Cc: Peter Zijlstra > >> Cc: Thomas Gleixner > >> Cc: linux-kernel@vger.kernel.org > >> Link: http://lkml.kernel.org/r/1461958364-675-4-git-send-email-dietmar.eggemann@arm.com > >> Signed-off-by: Ingo Molnar > >> --- > >> kernel/sched/fair.c | 6 ++++-- > >> 1 file changed, 4 insertions(+), 2 deletions(-) > >> > >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > >> index 2338105..218f8e8 100644 > >> --- a/kernel/sched/fair.c > >> +++ b/kernel/sched/fair.c > >> @@ -7067,9 +7067,11 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s > >> if (busiest->group_type == group_overloaded && > >> local->group_type == group_overloaded) { > >> load_above_capacity = busiest->sum_nr_running * SCHED_CAPACITY_SCALE; > >> - if (load_above_capacity > busiest->group_capacity) > >> + if (load_above_capacity > busiest->group_capacity) { > >> load_above_capacity -= busiest->group_capacity; > >> - else > >> + load_above_capacity *= NICE_0_LOAD; > >> + load_above_capacity /= busiest->group_capacity; > > If I follow correctly the sequence, > load_above_capacity = (busiest->sum_nr_running * SCHED_CAPACITY_SCALE > - busiest->group_capacity) * NICE_0_LOAD / busiest->group_capacity > so > load_above_capacity = busiest->sum_nr_running * SCHED_CAPACITY_SCALE / > busiest->group_capacity * NICE_0_LOAD - NICE_0_LOAD > > so the unit is [load] I'm with you for the equation, but not for the unit and I find it hard convince myself what the unit really is :-( I think it is down to the fact that we are comparing [load] directly with [capacity] here, basically saying [load] == [capacity]. Most other places we scale [load] by [capacity], we don't compare the two or otherwise imply that they are the same unit. Do we have any other examples of this? The name load_above_capacity suggests that it should have unit [load], but we actually compute it by subtracting to values, where the latter clearly has unit [capacity]. PeterZ's recent patch (1be0eb2a97 sched/fair: Clean up scale confusion) changes the former to be [capacity]. load_above_capacity is later multiplied by [capacity] to determine the imbalance which must have unit [load]. So working our way backwards, load_above_capacity must have unit [load/capacity]. However, if [load] = [capacity] then what we really have is just group-normalized [load]. As said in my other reply, the code only really makes sense for under special circumstances where [load] == [capacity]. The easy, and possibly best, way out of this mess would be to replace the code with something like PeterZ's suggestion that I tried to implement in the patch included in my other reply. > Lets take the example of a group made of 2 cores with 2 threads per > core and the capacity of each core is 1,5* SCHED_CAPACITY_SCALE so the > capacity of the group is 3*SCHED_CAPACITY_SCALE. > If we have 6 tasks on this group : > load_above capacity = 1 *NICE_0_LOAD which means we want to remove no > more than 1 tasks from the group and let 5 tasks in the group whereas > we don't expect to have more than 4 tasks as we have 4 CPUs and > probably less because the compute capacity of each CPU is less than > the default one > > So i would have expected > load_above_capacity = busiest->sum_nr_running * NICE_0_LOAD - > busiest->group_capacity * NICE_0_LOAD / SCHED_CAPACITY_SCALE > load_above capacity = 3*NICE_0_LOAD which means we want to remove no > more than 3 tasks and let 3 tasks in the group And this is exactly you get with this patch :-) load_above_capacity (through max_pull) is multiplied by the group capacity to compute that actual amount of [load] to remove: env->imbalance = load_above_capacity * busiest->group_capacity / SCHED_CAPACITY_SCALE = 1*NICE_0_LOAD * 3*SCHED_CAPACITY_SCALE / SCHED_CAPACITY_SCALE = 3*NICE_0_LOAD I don't think we disagree on how it should work :-) Without the capacity scaling in this patch you get: env->imbalance = (6*SCHED_CAPACITY_SCALE - 3*SCHED_CAPACITY_SCALE) * 3*SCHED_CAPACITY_SCALE / SCHED_CAPACITY_SCALE = 9*SCHED_CAPACITY_SCALE Coming back to Yuyang's question. I think it should be NICE_0_LOAD to ensure that the resulting imbalance has the proper unit [load]. I'm happy to scrap this patch and replace the whole thing with something else that makes more sense, but IMHO it is needed if the current mess should make any sense at all.