From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754476AbbCMWwf (ORCPT ); Fri, 13 Mar 2015 18:52:35 -0400 Received: from hqemgate16.nvidia.com ([216.228.121.65]:19960 "EHLO hqemgate16.nvidia.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751825AbbCMWwe (ORCPT ); Fri, 13 Mar 2015 18:52:34 -0400 X-PGP-Universal: processed; by hqnvupgp07.nvidia.com on Fri, 13 Mar 2015 15:50:55 -0700 Message-ID: <550368F4.5050905@nvidia.com> Date: Fri, 13 Mar 2015 15:47:16 -0700 From: Sai Gurrappadi User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.3.0 MIME-Version: 1.0 To: Morten Rasmussen , "peterz@infradead.org" , "mingo@redhat.com" CC: "vincent.guittot@linaro.org" , Dietmar Eggemann , "yuyang.du@intel.com" , "preeti@linux.vnet.ibm.com" , "mturquette@linaro.org" , "nico@linaro.org" , "rjw@rjwysocki.net" , Juri Lelli , "linux-kernel@vger.kernel.org" Subject: Re: [RFCv3 PATCH 33/48] sched: Energy-aware wake-up task placement References: <1423074685-6336-1-git-send-email-morten.rasmussen@arm.com> <1423074685-6336-34-git-send-email-morten.rasmussen@arm.com> In-Reply-To: <1423074685-6336-34-git-send-email-morten.rasmussen@arm.com> X-NVConfidentiality: public Content-Type: text/plain; charset="windows-1252" Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 02/04/2015 10:31 AM, Morten Rasmussen wrote: > Let available compute capacity and estimated energy impact select > wake-up target cpu when energy-aware scheduling is enabled. > energy_aware_wake_cpu() attempts to find group of cpus with sufficient > compute capacity to accommodate the task and find a cpu with enough spare > capacity to handle the task within that group. Preference is given to > cpus with enough spare capacity at the current OPP. Finally, the energy > impact of the new target and the previous task cpu is compared to select > the wake-up target cpu. > > cc: Ingo Molnar > cc: Peter Zijlstra > > Signed-off-by: Morten Rasmussen > --- > kernel/sched/fair.c | 90 +++++++++++++++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 90 insertions(+) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index b371f32..8713310 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -5091,6 +5091,92 @@ static int select_idle_sibling(struct task_struct *p, int target) > done: > return target; > } > + > +static unsigned long group_max_capacity(struct sched_group *sg) > +{ > + int max_idx; > + > + if (!sg->sge) > + return 0; > + > + max_idx = sg->sge->nr_cap_states-1; > + > + return sg->sge->cap_states[max_idx].cap; > +} > + > +static inline unsigned long task_utilization(struct task_struct *p) > +{ > + return p->se.avg.utilization_avg_contrib; > +} > + > +static int cpu_overutilized(int cpu, struct sched_domain *sd) > +{ > + return (capacity_orig_of(cpu) * 100) < > + (get_cpu_usage(cpu) * sd->imbalance_pct); > +} > + > +static int energy_aware_wake_cpu(struct task_struct *p) > +{ > + struct sched_domain *sd; > + struct sched_group *sg, *sg_target; > + int target_max_cap = SCHED_CAPACITY_SCALE; > + int target_cpu = task_cpu(p); > + int i; > + > + sd = rcu_dereference(per_cpu(sd_ea, task_cpu(p))); > + > + if (!sd) > + return -1; > + > + sg = sd->groups; > + sg_target = sg; > + /* Find group with sufficient capacity */ > + do { > + int sg_max_capacity = group_max_capacity(sg); > + > + if (sg_max_capacity >= task_utilization(p) && > + sg_max_capacity <= target_max_cap) { > + sg_target = sg; > + target_max_cap = sg_max_capacity; > + } > + } while (sg = sg->next, sg != sd->groups); If a 'small' task suddenly becomes 'big' i.e close to 100% util, the above loop would still pick the little/small cluster because task_utilization(p) is upper-bounded by the arch-invariant capacity of the little CPU/group right? Also, this heuristic for determining sg_target is a big little assumption. I don't think it is necessarily correct to assume that this is true for all platforms. This heuristic should be derived from the energy model for the platform instead. > + > + /* Find cpu with sufficient capacity */ > + for_each_cpu_and(i, tsk_cpus_allowed(p), sched_group_cpus(sg_target)) { > + int new_usage = get_cpu_usage(i) + task_utilization(p); Isn't this double accounting the task's usage in case task_cpu(p) belongs to sg_target? > + > + if (new_usage > capacity_orig_of(i)) > + continue; > + > + if (new_usage < capacity_curr_of(i)) { > + target_cpu = i; > + if (!cpu_rq(i)->nr_running) > + break; > + } > + > + /* cpu has capacity at higher OPP, keep it as fallback */ > + if (target_cpu == task_cpu(p)) > + target_cpu = i; > + } > + > + if (target_cpu != task_cpu(p)) { > + struct energy_env eenv = { > + .usage_delta = task_utilization(p), > + .src_cpu = task_cpu(p), > + .dst_cpu = target_cpu, > + }; > + > + /* Not enough spare capacity on previous cpu */ > + if (cpu_overutilized(task_cpu(p), sd)) > + return target_cpu; > + > + if (energy_diff(&eenv) >= 0) > + return task_cpu(p); > + } > + > + return target_cpu; > +} > + > /* > * select_task_rq_fair: Select target runqueue for the waking task in domains > * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE, > @@ -5138,6 +5224,10 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f > prev_cpu = cpu; > > if (sd_flag & SD_BALANCE_WAKE) { > + if (energy_aware()) { > + new_cpu = energy_aware_wake_cpu(p); > + goto unlock; > + } > new_cpu = select_idle_sibling(p, prev_cpu); > goto unlock; > } > -Sai