From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933828AbcIALvn (ORCPT ); Thu, 1 Sep 2016 07:51:43 -0400 Received: from foss.arm.com ([217.140.101.70]:56848 "EHLO foss.arm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933585AbcIALvl (ORCPT ); Thu, 1 Sep 2016 07:51:41 -0400 Date: Thu, 1 Sep 2016 12:51:36 +0100 From: Patrick Bellasi To: Morten Rasmussen Cc: peterz@infradead.org, mingo@redhat.com, dietmar.eggemann@arm.com, yuyang.du@intel.com, vincent.guittot@linaro.org, mgalbraith@suse.de, sgurrappadi@nvidia.com, freedom.tan@mediatek.com, keita.kobayashi.ym@renesas.com, linux-kernel@vger.kernel.org Subject: Re: [PATCH v4 5/5] sched/fair: Track peak per-entity utilization Message-ID: <20160901115136.GB31371@e105326-lin> References: <1472640739-8778-1-git-send-email-morten.rasmussen@arm.com> <1472640739-8778-6-git-send-email-morten.rasmussen@arm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1472640739-8778-6-git-send-email-morten.rasmussen@arm.com> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 31-Aug 11:52, Morten Rasmussen wrote: > When using PELT (per-entity load tracking) utilization to place tasks at > wake-up using the decayed utilization (due to sleep) leads to > under-estimation of true utilization of the task. This could mean > putting the task on a cpu with less available capacity than is actually > needed. This issue can be mitigated by using 'peak' utilization instead > of the decayed utilization for placement decisions, e.g. at task > wake-up. > > The 'peak' utilization metric, util_peak, tracks util_avg when the task > is running and retains its previous value while the task is > blocked/waiting on the rq. It is instantly updated to track util_avg > again as soon as the task running again. > > cc: Ingo Molnar > cc: Peter Zijlstra > > Signed-off-by: Morten Rasmussen > --- > include/linux/sched.h | 2 +- > kernel/sched/fair.c | 23 ++++++++++++++--------- > 2 files changed, 15 insertions(+), 10 deletions(-) > > diff --git a/include/linux/sched.h b/include/linux/sched.h > index d75024053e9b..fff4e4b6e654 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -1282,7 +1282,7 @@ struct load_weight { > struct sched_avg { > u64 last_update_time, load_sum; > u32 util_sum, period_contrib; > - unsigned long load_avg, util_avg; > + unsigned long load_avg, util_avg, util_peak; By adding util_peak here (in sched_avg) we implicitly define a new signal for CFS RQs as well, but in the rest of this patch it seems we use it only for tasks? Overall this seems to be a filtered signal on top of PELT but just for tasks. Perhaps something similar can be useful for CPUs utilization as well... > }; > > #ifdef CONFIG_SCHEDSTATS > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 68d8b40c546b..27534e36555b 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -692,6 +692,7 @@ void init_entity_runnable_average(struct sched_entity *se) > * At this point, util_avg won't be used in select_task_rq_fair anyway > */ > sa->util_avg = 0; > + sa->util_peak = 0; For consistency with other sched_avg's signals, perhaps we should report the value of util_peak from: kernel/sched/debug.c::{print_cfs_group_statproc_sched_show_task,proc_sched_show_task} > sa->util_sum = 0; > /* when this task enqueue'ed, it will contribute to its cfs_rq's load_avg */ > } > @@ -743,6 +744,7 @@ void post_init_entity_util_avg(struct sched_entity *se) > } else { > sa->util_avg = cap; > } > + sa->util_peak = sa->util_avg; > sa->util_sum = sa->util_avg * LOAD_AVG_MAX; > } > > @@ -2804,6 +2806,9 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa, > sa->util_avg = sa->util_sum / LOAD_AVG_MAX; > } > > + if (running || sa->util_avg > sa->util_peak) > + sa->util_peak = sa->util_avg; Do we really need to update this new signal so often? It seems that we use it only at wakeup time, is it not enough to cache the util_avg value in dequeue_task_fair() in case of a DEQUEUE_SLEEP? > + > return decayed; > } > > @@ -5184,7 +5189,7 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, > return 1; > } > > -static inline int task_util(struct task_struct *p); > +static inline int task_util_peak(struct task_struct *p); > static int cpu_util_wake(int cpu, struct task_struct *p); > > static unsigned long capacity_spare_wake(int cpu, struct task_struct *p) > @@ -5267,14 +5272,14 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, > /* > * The cross-over point between using spare capacity or least load > * is too conservative for high utilization tasks on partially > - * utilized systems if we require spare_capacity > task_util(p), > + * utilized systems if we require spare_capacity > task_util_peak(p), > * so we allow for some task stuffing by using > - * spare_capacity > task_util(p)/2. > + * spare_capacity > task_util_peak(p)/2. > */ > - if (this_spare > task_util(p) / 2 && > + if (this_spare > task_util_peak(p) / 2 && > imbalance*this_spare > 100*most_spare) > return NULL; > - else if (most_spare > task_util(p) / 2) > + else if (most_spare > task_util_peak(p) / 2) > return most_spare_sg; > > if (!idlest || 100*this_load < imbalance*min_load) > @@ -5432,9 +5437,9 @@ static int cpu_util(int cpu) > return (util >= capacity) ? capacity : util; > } > > -static inline int task_util(struct task_struct *p) > +static inline int task_util_peak(struct task_struct *p) > { > - return p->se.avg.util_avg; > + return p->se.avg.util_peak; > } > > /* > @@ -5450,7 +5455,7 @@ static int cpu_util_wake(int cpu, struct task_struct *p) > return cpu_util(cpu); > > capacity = capacity_orig_of(cpu); > - util = max_t(long, cpu_rq(cpu)->cfs.avg.util_avg - task_util(p), 0); > + util = max_t(long, cpu_rq(cpu)->cfs.avg.util_avg - task_util_peak(p), 0); > > return (util >= capacity) ? capacity : util; > } > @@ -5476,7 +5481,7 @@ static int wake_cap(struct task_struct *p, int cpu, int prev_cpu) > /* Bring task utilization in sync with prev_cpu */ > sync_entity_load_avg(&p->se); > > - return min_cap * 1024 < task_util(p) * capacity_margin; > + return min_cap * 1024 < task_util_peak(p) * capacity_margin; > } > > /* > -- > 1.9.1 > -- #include Patrick Bellasi