From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753111AbbC3N3d (ORCPT ); Mon, 30 Mar 2015 09:29:33 -0400 Received: from mail-ob0-f177.google.com ([209.85.214.177]:32769 "EHLO mail-ob0-f177.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752701AbbC3N3b (ORCPT ); Mon, 30 Mar 2015 09:29:31 -0400 MIME-Version: 1.0 In-Reply-To: <20150330122449.GH21418@twins.programming.kicks-ass.net> References: <20150326130014.21532.17158.stgit@preeti.in.ibm.com> <20150327143839.GO18994@e105550-lin.cambridge.arm.com> <55158966.4050300@linux.vnet.ibm.com> <20150327175651.GR18994@e105550-lin.cambridge.arm.com> <20150330110632.GT23123@twins.programming.kicks-ass.net> <20150330120302.GT18994@e105550-lin.cambridge.arm.com> <20150330122449.GH21418@twins.programming.kicks-ass.net> From: Vincent Guittot Date: Mon, 30 Mar 2015 15:29:09 +0200 Message-ID: Subject: Re: [PATCH V2] sched: Improve load balancing in the presence of idle CPUs To: Peter Zijlstra Cc: Morten Rasmussen , Preeti U Murthy , "mingo@kernel.org" , "riel@redhat.com" , "daniel.lezcano@linaro.org" , "srikar@linux.vnet.ibm.com" , "pjt@google.com" , "benh@kernel.crashing.org" , "efault@gmx.de" , "linux-kernel@vger.kernel.org" , "iamjoonsoo.kim@lge.com" , "svaidy@linux.vnet.ibm.com" , "tim.c.chen@linux.intel.com" , "jason.low2@hp.com" Content-Type: text/plain; charset=ISO-8859-1 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 30 March 2015 at 14:24, Peter Zijlstra wrote: > On Mon, Mar 30, 2015 at 01:03:03PM +0100, Morten Rasmussen wrote: >> On Mon, Mar 30, 2015 at 12:06:32PM +0100, Peter Zijlstra wrote: >> > On Fri, Mar 27, 2015 at 05:56:51PM +0000, Morten Rasmussen wrote: >> > >> > > I agree that it is hard to predict how many additional cpus you need, >> > > but I don't think you necessarily need that information as long as you >> > > start by filling up the cpu that was kicked to do the >> > > nohz_idle_balance() first. >> > >> > > Reducing unnecessary wakeups is quite important for energy consumption >> > > and something a lot of effort is put into. You really don't want to wake >> > > up another cluster/package unnecessarily just because there was only one >> > > nohz-idle cpu left in the previous one which could have handled the >> > > additional load. It gets even worse if the other cluster is less >> > > energy-efficient (big.LITTLE). >> > >> > So the only way to get tasks to cross your cluster is by balancing that >> > domain. At this point we'll compute sg stats for either group >> > (=cluster). >> > >> > The only thing we need to ensure is that it doesn't view the small >> > cluster as overloaded (as long as it really isn't of course), as long as >> > its not viewed as overloaded it will not pull tasks from it into the big >> > cluster, no matter how many ILBs we run before the ILB duty cpu's >> > rebalance_domains() call. >> > >> > I'm really not seeing the problem here. >> >> I see. The group_classify() should take care of it in all cases of >> balancing across clusters. You would be iterating over all cpus in the >> other cluster running rebalance_domains() if the balancer cpu happens to >> be the last one in the little cluster though. However, within the >> cluster (in case you have 2 or more nohz-idle cpus) you still take a >> double hit. No? > > It can yes, but typically not I think. This all could use some 'help' > for sure. > > So the thing is, find_new_ilb() simply selects the first idle_cpus_mask > cpu, while at the same time, nohz_idle_balance() will iterate the > idle_cpus_mask with the first, being first (obviously). > > So it is very like that if we migrate on the ILB it is in fact to the > current CPU. > > In case we cannot, we have no choice but to wake up a second idle, > nothing really to be done about that. > > To put it another way, for ILB purposes the rebalance_domains() call is > mostly superfluous. The only other case is if the selected ILB target > became non-idle between being selected and getting to run the softirq > handler. At which point we should wake another anyhow too. > > Maybe something like the below helps -- albeit it could use a comment > too of course. > > --- > kernel/sched/fair.c | 15 ++++++++++----- > 1 file changed, 10 insertions(+), 5 deletions(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index fdae26eb7218..b879d4b3b599 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -7624,11 +7624,12 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle) > * In CONFIG_NO_HZ_COMMON case, the idle balance kickee will do the > * rebalancing for all the cpus for whom scheduler ticks are stopped. > */ > -static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) > +static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) > { > int this_cpu = this_rq->cpu; > - struct rq *rq; > int balance_cpu; > + struct rq *rq; > + bool done = false; > > if (idle != CPU_IDLE || > !test_bit(NOHZ_BALANCE_KICK, nohz_flags(this_cpu))) > @@ -7647,6 +7648,8 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) > break; > > rq = cpu_rq(balance_cpu); > + if (rq == this_rq) > + done = true; AFAICT, this can't happen because we start the for_each _cpu loop with: if (balance_cpu == this_cpu || !idle_cpu(balance_cpu)) continue; > > /* > * If time for next balance is due, > @@ -7666,6 +7669,8 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) > nohz.next_balance = this_rq->next_balance; > end: > clear_bit(NOHZ_BALANCE_KICK, nohz_flags(this_cpu)); > + > + return done; > } > > /* > @@ -7744,7 +7749,7 @@ static inline bool nohz_kick_needed(struct rq *rq) > return kick; > } > #else > -static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) { } > +static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) { return false; } > #endif > > /* > @@ -7765,8 +7770,8 @@ static void run_rebalance_domains(struct softirq_action *h) > * load balance only within the local sched_domain hierarchy > * and abort nohz_idle_balance altogether if we pull some load. > */ > - nohz_idle_balance(this_rq, idle); > - rebalance_domains(this_rq, idle); > + if (!nohz_idle_balance(this_rq, idle)) > + rebalance_domains(this_rq, idle); the nohz_idle_balance run rebalance_domains for all CPU except this CPU > } > > /*