linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/4] Utilization estimation (util_est) for FAIR tasks
@ 2017-11-09 16:41 Patrick Bellasi
  2017-11-09 16:41 ` [PATCH 1/4] sched/fair: always used unsigned long for utilization Patrick Bellasi
                   ` (4 more replies)
  0 siblings, 5 replies; 6+ messages in thread
From: Patrick Bellasi @ 2017-11-09 16:41 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Peter Zijlstra, Rafael J . Wysocki, Viresh Kumar,
	Vincent Guittot, Paul Turner, Dietmar Eggemann, Morten Rasmussen,
	Juri Lelli, Todd Kjos, Joel Fernandes

The aim of this series is to improve some PELT behaviors to make it a
better fit for the scheduling of tasks common in embedded mobile
use-cases, without affecting other classes of workloads.

A complete description of these behavior has been presented in the
previous RFC [1] and further discussed at last OSPM Summit [2] as well
as the last two LPCs.

This series presents an implementation which improves the initial RFC's
prototype. Specifically, this new implementation has been verified to
not impact in any noticeable way the performance of:

    perf bench sched messaging --pipe --thread --group 8 --loop 50000

when running 30 iterations on a dual socket, 10 cores (20 threads) per
socket Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz, whith the
sched_feat(SCHED_UTILEST) set to False.
With this feature enabled, the measured overhead is in the range of ~1%
on the HW/SW test configuration.

That's the main reason why this sched feature is disabled by default.
A possible improvement can be the addition of a KConfig option to toggle
the sched_feat default value on systems where a 1% overhead on hackbench
is not a concern, e.g. mobile systems, especially considering the
benefits coming from estimated utilization on workloads of interest.

>From a functional standpoint, this implementation shows a more stable
utilization signal, compared to mainline, when running synthetics
benchmarks describing a set of interesting target use-cases.
This allows a better selection of the target CPU as well as a faster
selection of the most appropriate OPP.
A detailed description of the functional tests used has been already
covered in the previous RFC [1].

This series is based on v4.14-rc8 and is composed of four patches:
 1) a small refactoring preparing the ground
 2) introducing the required data structures to track util_est of both
    TASKs and CPUs
 3) make use of util_est in the wakeup and load balance paths
 4) make use of util_est in schedutil for frequency selection

Cheers Patrick

.:: References
==============
[1] https://lkml.org/lkml/2017/8/25/195
[2] slides: http://retis.sssup.it/ospm-summit/Downloads/OSPM_PELT_DecayClampingVsUtilEst.pdf
     video: http://youtu.be/adnSHPBGS-w

Patrick Bellasi (4):
  sched/fair: always used unsigned long for utilization
  sched/fair: add util_est on top of PELT
  sched/fair: use util_est in LB and WU paths
  sched/cpufreq_schedutil: use util_est for OPP selection

 include/linux/sched.h            |  21 +++++
 kernel/sched/cpufreq_schedutil.c |   6 +-
 kernel/sched/debug.c             |   4 +
 kernel/sched/fair.c              | 184 ++++++++++++++++++++++++++++++++++++---
 kernel/sched/features.h          |   5 ++
 kernel/sched/sched.h             |   1 +
 6 files changed, 209 insertions(+), 12 deletions(-)

-- 
2.14.1

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH 1/4] sched/fair: always used unsigned long for utilization
  2017-11-09 16:41 [PATCH 0/4] Utilization estimation (util_est) for FAIR tasks Patrick Bellasi
@ 2017-11-09 16:41 ` Patrick Bellasi
  2017-11-09 16:41 ` [PATCH 2/4] sched/fair: add util_est on top of PELT Patrick Bellasi
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: Patrick Bellasi @ 2017-11-09 16:41 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Peter Zijlstra, Rafael J . Wysocki, Viresh Kumar,
	Vincent Guittot, Paul Turner, Dietmar Eggemann, Morten Rasmussen,
	Juri Lelli, Todd Kjos, Joel Fernandes

Utilization and capacity are tracked as unsigned long, however some
functions using them returns an int which is ultimately assigned back to
unsigned long variables.

Since there is not scope on using a different and signed type, this
consolidate the signature of functions returning utilization to always
use the native type.
As well as improving code consistency this is expected also benefits
code paths where utilizations should be clamped by voiding further type
conversions or ugly type casts.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Reviewed-by: Chris Redpath <chris.redpath@arm.com>
Reviewed-by: Brendan Jackman <brendan.jackman@arm.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: linux-kernel@vger.kernel.org
---
 kernel/sched/fair.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5c09ddf8c832..83bc5d69fe3a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5438,8 +5438,8 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
 	return affine;
 }
 
-static inline int task_util(struct task_struct *p);
-static int cpu_util_wake(int cpu, struct task_struct *p);
+static inline unsigned long task_util(struct task_struct *p);
+static unsigned long cpu_util_wake(int cpu, struct task_struct *p);
 
 static unsigned long capacity_spare_wake(int cpu, struct task_struct *p)
 {
@@ -5870,7 +5870,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
  * capacity_orig) as it useful for predicting the capacity required after task
  * migrations (scheduler-driven DVFS).
  */
-static int cpu_util(int cpu)
+static unsigned long cpu_util(int cpu)
 {
 	unsigned long util = cpu_rq(cpu)->cfs.avg.util_avg;
 	unsigned long capacity = capacity_orig_of(cpu);
@@ -5878,7 +5878,7 @@ static int cpu_util(int cpu)
 	return (util >= capacity) ? capacity : util;
 }
 
-static inline int task_util(struct task_struct *p)
+static inline unsigned long task_util(struct task_struct *p)
 {
 	return p->se.avg.util_avg;
 }
@@ -5887,7 +5887,7 @@ static inline int task_util(struct task_struct *p)
  * cpu_util_wake: Compute cpu utilization with any contributions from
  * the waking task p removed.
  */
-static int cpu_util_wake(int cpu, struct task_struct *p)
+static unsigned long cpu_util_wake(int cpu, struct task_struct *p)
 {
 	unsigned long util, capacity;
 
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH 2/4] sched/fair: add util_est on top of PELT
  2017-11-09 16:41 [PATCH 0/4] Utilization estimation (util_est) for FAIR tasks Patrick Bellasi
  2017-11-09 16:41 ` [PATCH 1/4] sched/fair: always used unsigned long for utilization Patrick Bellasi
@ 2017-11-09 16:41 ` Patrick Bellasi
  2017-11-09 16:41 ` [PATCH 3/4] sched/fair: use util_est in LB and WU paths Patrick Bellasi
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: Patrick Bellasi @ 2017-11-09 16:41 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Peter Zijlstra, Rafael J . Wysocki, Viresh Kumar,
	Vincent Guittot, Paul Turner, Dietmar Eggemann, Morten Rasmussen,
	Juri Lelli, Todd Kjos, Joel Fernandes, linux-pm

The util_avg signal computed by PELT is too variable for some use-cases.
For example, a big task waking up after a long sleep period will have its
utilization almost completely decayed. This introduces some latency before
schedutil will be able to pick the best frequency to run a task.

The same issue can affect task placement. Indeed, since the task
utilization is already decayed at wakeup, when the task is enqueued in a
CPU, this can results in a CPU running a big task as being temporarily
represented as being almost empty. This leads to a race condition where
other tasks can be potentially allocated on a CPU which just started to run
a big task which slept for a relatively long period.

Moreover, the PELT utilization of a task can be updated every [ms], thus
making it a continuously changing value for certain longer running
tasks.  This means that the instantaneous PELT utilization of a RUNNING
task is not really meaningful to properly support scheduler decisions.

For all these reasons, a more stable signal could probably do a better
job of representing the expected/estimated utilization of a task/cfs_rq.
Such a signal can be easily created on top of PELT by still using it as
an estimator which produces values to be aggregated on meaningful
events.

This patch adds a simple implementation of util_est, a new signal built on
top of PELT's util_avg where:

    util_est(task) = max(task::util_avg, f(task::util_avg@dequeue_times))

This allows to remember how big a task has been reported by PELT in its
previous activations via the function: f(task::util_avg@dequeue_times).

If a task should change its behavior and it runs even longer in a new
activation, after a certain time its util_est will just track the
original PELT signal (i.e. task::util_avg).

The estimated utilization of cfs_rq is defined only for root ones.
That's because the only sensible consumer of this signal are the
scheduler and schedutil when looking for the overall CPU utilization due
to FAIR tasks.
For this reason, the estimated utilization or a root cfs_rq is simply
defined as:

    util_est(cfs_rq) = max(cfs_rq::util_avg, cfs_rq::util_est_runnable)

where:

    cfs_rq::util_est_runnable = sum(util_est(task))
                                for each RUNNABLE task on that root cfs_rq

It's worth to note that the estimated utilization is tracked only for
entities of interests, specifically:
 - Tasks: to better support tasks placement decisions
 - root cfs_rqs: to better both tasks placement decisions as well as
                 to better support frequencies selection for a CPU

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Reviewed-by: Brendan Jackman <brendan.jackman@arm.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Paul Turner <pjt@google.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-pm@vger.kernel.org
---
 include/linux/sched.h   |  21 ++++++++++
 kernel/sched/debug.c    |   4 ++
 kernel/sched/fair.c     | 102 +++++++++++++++++++++++++++++++++++++++++++++++-
 kernel/sched/features.h |   5 +++
 kernel/sched/sched.h    |   1 +
 5 files changed, 132 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index fdf74f27acf1..bce77204c378 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -338,6 +338,21 @@ struct sched_avg {
 	unsigned long			util_avg;
 };
 
+/**
+ * Estimation Utilization for FAIR tasks.
+ *
+ * Support data structure to track an Exponential Weighted Moving Average
+ * (EWMA) of a FAIR task's utilization. New samples are added to the moving
+ * average each time a task completes an activation. Sample's weight is
+ * chosen so that the EWMA will be relatively insensitive to transient changes
+ * to the task's workload.
+ */
+struct util_est {
+	unsigned long			last;
+	unsigned long			ewma;
+#define UTIL_EST_WEIGHT_SHIFT		2
+};
+
 struct sched_statistics {
 #ifdef CONFIG_SCHEDSTATS
 	u64				wait_start;
@@ -561,6 +576,12 @@ struct task_struct {
 
 	const struct sched_class	*sched_class;
 	struct sched_entity		se;
+	/*
+	 * Since we use se.avg.util_avg to update util_est fields,
+	 * this last can benefit from being close to se which
+	 * also defines se.avg as cache aligned.
+	 */
+	struct util_est			util_est;
 	struct sched_rt_entity		rt;
 #ifdef CONFIG_CGROUP_SCHED
 	struct task_group		*sched_task_group;
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 2f93e4a2d9f6..a4af385329b2 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -564,6 +564,8 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
 			cfs_rq->runnable_load_avg);
 	SEQ_printf(m, "  .%-30s: %lu\n", "util_avg",
 			cfs_rq->avg.util_avg);
+	SEQ_printf(m, "  .%-30s: %lu\n", "util_est_runnable",
+			cfs_rq->util_est_runnable);
 	SEQ_printf(m, "  .%-30s: %ld\n", "removed_load_avg",
 			atomic_long_read(&cfs_rq->removed_load_avg));
 	SEQ_printf(m, "  .%-30s: %ld\n", "removed_util_avg",
@@ -1010,6 +1012,8 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
 	P(se.avg.load_avg);
 	P(se.avg.util_avg);
 	P(se.avg.last_update_time);
+	P(util_est.ewma);
+	P(util_est.last);
 #endif
 	P(policy);
 	P(prio);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 83bc5d69fe3a..f14d199e81ed 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -739,6 +739,12 @@ void init_entity_runnable_average(struct sched_entity *se)
 	sa->util_avg = 0;
 	sa->util_sum = 0;
 	/* when this task enqueue'ed, it will contribute to its cfs_rq's load_avg */
+
+	/* Utilization estimation */
+	if (entity_is_task(se)) {
+		task_of(se)->util_est.ewma = 0;
+		task_of(se)->util_est.last = 0;
+	}
 }
 
 static inline u64 cfs_rq_clock_task(struct cfs_rq *cfs_rq);
@@ -4870,6 +4876,20 @@ static inline void hrtick_update(struct rq *rq)
 }
 #endif
 
+static inline unsigned long task_util(struct task_struct *p);
+static inline unsigned long task_util_est(struct task_struct *p);
+
+static inline void util_est_enqueue(struct task_struct *p)
+{
+	struct cfs_rq *cfs_rq = &task_rq(p)->cfs;
+
+	if (!sched_feat(UTIL_EST))
+		return;
+
+	/* Update root cfs_rq's estimated utilization */
+	cfs_rq->util_est_runnable += task_util_est(p);
+}
+
 /*
  * The enqueue_task method is called before nr_running is
  * increased. Here we update the fair scheduling stats and
@@ -4922,9 +4942,84 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 	if (!se)
 		add_nr_running(rq, 1);
 
+	util_est_enqueue(p);
 	hrtick_update(rq);
 }
 
+static inline void util_est_dequeue(struct task_struct *p, int flags)
+{
+	struct cfs_rq *cfs_rq = &task_rq(p)->cfs;
+	unsigned long util_last = task_util(p);
+	bool sleep = flags & DEQUEUE_SLEEP;
+	unsigned long ewma;
+	long util_est;
+
+	if (!sched_feat(UTIL_EST))
+		return;
+
+	/*
+	 * Update root cfs_rq's estimated utilization
+	 *
+	 * If *p is the last task then the root cfs_rq's estimated utilization
+	 * of a CPU is 0 by definition.
+	 *
+	 * Otherwise, in removing *p's util_est from its cfs_rq's
+	 * util_est_runnable we should account for cases where this last
+	 * activation of *p was longer then the previous ones.
+	 * Also in these cases we need to set 0 the estimated utilization for
+	 * the CPU.
+	 */
+	if (cfs_rq->nr_running > 0) {
+		util_est  = cfs_rq->util_est_runnable;
+		util_est -= task_util_est(p);
+		if (util_est < 0)
+			util_est = 0;
+		cfs_rq->util_est_runnable = util_est;
+	} else {
+		cfs_rq->util_est_runnable = 0;
+	}
+
+	/*
+	 * Skip update of task's estimated utilization when the task has not
+	 * yet completed an activation, e.g. being migrated.
+	 */
+	if (!sleep)
+		return;
+
+	/*
+	 * Skip update of task's estimated utilization when its EWMA is already
+	 * ~1% close to its last activation value.
+	 */
+	util_est = p->util_est.ewma;
+	if (abs(util_est - util_last) <= (SCHED_CAPACITY_SCALE / 100))
+		return;
+
+	/*
+	 * Update Task's estimated utilization
+	 *
+	 * When *p completes an activation we can consolidate another sample
+	 * about the task size. This is done by storing the last PELT value
+	 * for this task and using this value to load another sample in the
+	 * exponential weighted moving average:
+	 *
+	 *      ewma(t) = w *  task_util(p) + (1 - w) ewma(t-1)
+	 *              = w *  task_util(p) + ewma(t-1) - w * ewma(t-1)
+	 *              = w * (task_util(p) + ewma(t-1) / w - ewma(t-1))
+	 *
+	 * Where 'w' is the weight of new samples, which is configured to be
+	 * 0.25, thus making w=1/4
+	 */
+	p->util_est.last = util_last;
+	ewma = p->util_est.ewma;
+	if (likely(ewma != 0)) {
+		ewma   = util_last + (ewma << UTIL_EST_WEIGHT_SHIFT) - ewma;
+		ewma >>= UTIL_EST_WEIGHT_SHIFT;
+	} else {
+		ewma = util_last;
+	}
+	p->util_est.ewma = ewma;
+}
+
 static void set_next_buddy(struct sched_entity *se);
 
 /*
@@ -4981,6 +5076,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 	if (!se)
 		sub_nr_running(rq, 1);
 
+	util_est_dequeue(p, flags);
 	hrtick_update(rq);
 }
 
@@ -5438,7 +5534,6 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
 	return affine;
 }
 
-static inline unsigned long task_util(struct task_struct *p);
 static unsigned long cpu_util_wake(int cpu, struct task_struct *p);
 
 static unsigned long capacity_spare_wake(int cpu, struct task_struct *p)
@@ -5883,6 +5978,11 @@ static inline unsigned long task_util(struct task_struct *p)
 	return p->se.avg.util_avg;
 }
 
+static inline unsigned long task_util_est(struct task_struct *p)
+{
+	return max(p->util_est.ewma, p->util_est.last);
+}
+
 /*
  * cpu_util_wake: Compute cpu utilization with any contributions from
  * the waking task p removed.
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 9552fd5854bf..e9f312acc0d3 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -85,3 +85,8 @@ SCHED_FEAT(ATTACH_AGE_LOAD, true)
 SCHED_FEAT(WA_IDLE, true)
 SCHED_FEAT(WA_WEIGHT, true)
 SCHED_FEAT(WA_BIAS, true)
+
+/*
+ * UtilEstimation. Use estimated CPU utiliation.
+ */
+SCHED_FEAT(UTIL_EST, false)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e1666b3e2fb2..3abd13e68655 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -444,6 +444,7 @@ struct cfs_rq {
 	 * CFS load tracking
 	 */
 	struct sched_avg avg;
+	unsigned long util_est_runnable;
 	u64 runnable_load_sum;
 	unsigned long runnable_load_avg;
 #ifdef CONFIG_FAIR_GROUP_SCHED
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH 3/4] sched/fair: use util_est in LB and WU paths
  2017-11-09 16:41 [PATCH 0/4] Utilization estimation (util_est) for FAIR tasks Patrick Bellasi
  2017-11-09 16:41 ` [PATCH 1/4] sched/fair: always used unsigned long for utilization Patrick Bellasi
  2017-11-09 16:41 ` [PATCH 2/4] sched/fair: add util_est on top of PELT Patrick Bellasi
@ 2017-11-09 16:41 ` Patrick Bellasi
  2017-11-09 16:41 ` [PATCH 4/4] sched/cpufreq_schedutil: use util_est for OPP selection Patrick Bellasi
  2017-12-01  9:29 ` [PATCH 0/4] Utilization estimation (util_est) for FAIR tasks Dietmar Eggemann
  4 siblings, 0 replies; 6+ messages in thread
From: Patrick Bellasi @ 2017-11-09 16:41 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Peter Zijlstra, Rafael J . Wysocki, Viresh Kumar,
	Vincent Guittot, Paul Turner, Dietmar Eggemann, Morten Rasmussen,
	Juri Lelli, Todd Kjos, Joel Fernandes, linux-pm

When the scheduler looks at the CPU utilization, the current PELT value
for a CPU is returned straight away. In certain scenarios this can have
undesired side effects on task placement.

For example, since the task utilization is decayed at wakeup time, when
a long sleeping big task is enqueued it does not add immediately a
significant contribution to the target CPU.
As a result we generate a race condition where other tasks can be placed
on the same CPU while is still considered relatively empty.

In order to reduce these kind of race conditions, this patch introduces the
required support to integrate the usage of the CPU's estimated utilization
in cpu_util_wake as well as in update_sg_lb_stats.

The estimated utilization of a CPU is defined to be the maximum between
its PELT's utilization and the sum of the estimated utilization of the
tasks currently RUNNABLE on that CPU.
This allows to properly represent the expected utilization of a CPU which,
for example it has just got a big task running since a long sleep
period.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Reviewed-by: Brendan Jackman <brendan.jackman@arm.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Paul Turner <pjt@google.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-pm@vger.kernel.org
---
 kernel/sched/fair.c | 74 ++++++++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 68 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f14d199e81ed..76626e1a7645 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5973,6 +5973,41 @@ static unsigned long cpu_util(int cpu)
 	return (util >= capacity) ? capacity : util;
 }
 
+/**
+ * cpu_util_est: estimated utilization for the specified CPU
+ * @cpu: the CPU to get the estimated utilization for
+ *
+ * The estimated utilization of a CPU is defined to be the maximum between its
+ * PELT's utilization and the sum of the estimated utilization of the tasks
+ * currently RUNNABLE on that CPU.
+ *
+ * This allows to properly represent the expected utilization of a CPU which
+ * has just got a big task running since a long sleep period. At the same time
+ * however it preserves the benefits of the "blocked load" in describing the
+ * potential for other tasks waking up on the same CPU.
+ *
+ * Return: the estimated utlization for the specified CPU
+ */
+static inline unsigned long cpu_util_est(int cpu)
+{
+	unsigned long util, util_est;
+	unsigned long capacity;
+	struct cfs_rq *cfs_rq;
+
+	if (!sched_feat(UTIL_EST))
+		return cpu_util(cpu);
+
+	cfs_rq = &cpu_rq(cpu)->cfs;
+	util = cfs_rq->avg.util_avg;
+	util_est = cfs_rq->util_est_runnable;
+	util_est = max(util, util_est);
+
+	capacity = capacity_orig_of(cpu);
+	util_est = min(util_est, capacity);
+
+	return util_est;
+}
+
 static inline unsigned long task_util(struct task_struct *p)
 {
 	return p->se.avg.util_avg;
@@ -5989,16 +6024,43 @@ static inline unsigned long task_util_est(struct task_struct *p)
  */
 static unsigned long cpu_util_wake(int cpu, struct task_struct *p)
 {
-	unsigned long util, capacity;
+	long util, util_est;
 
 	/* Task has no contribution or is new */
 	if (cpu != task_cpu(p) || !p->se.avg.last_update_time)
-		return cpu_util(cpu);
+		return cpu_util_est(cpu);
 
-	capacity = capacity_orig_of(cpu);
-	util = max_t(long, cpu_rq(cpu)->cfs.avg.util_avg - task_util(p), 0);
+	/* Discount task's blocked util from CPU's util */
+	util = cpu_util(cpu) - task_util(p);
+	util = max(util, 0L);
 
-	return (util >= capacity) ? capacity : util;
+	if (!sched_feat(UTIL_EST))
+		return util;
+
+	/*
+	 * These are the main cases covered:
+	 * - if *p is the only task sleeping on this CPU, then:
+	 *      cpu_util (== task_util) > util_est (== 0)
+	 *   and thus we return:
+	 *      cpu_util_wake = (cpu_util - task_util) = 0
+	 *
+	 * - if other tasks are SLEEPING on the same CPU, which is just waking
+	 *   up, then:
+	 *      cpu_util >= task_util
+	 *      cpu_util > util_est (== 0)
+	 *   and thus we discount *p's blocked utilization to return:
+	 *      cpu_util_wake = (cpu_util - task_util) >= 0
+	 *
+	 * - if other tasks are RUNNABLE on that CPU and
+	 *      util_est > cpu_util
+	 *   then we use util_est since it returns a more restrictive
+	 *   estimation of the spare capacity on that CPU, by just considering
+	 *   the expected utilization of tasks already runnable on that CPU.
+	 */
+	util_est = cpu_rq(cpu)->cfs.util_est_runnable;
+	util = max(util, util_est);
+
+	return util;
 }
 
 /*
@@ -7525,7 +7587,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 			load = source_load(i, load_idx);
 
 		sgs->group_load += load;
-		sgs->group_util += cpu_util(i);
+		sgs->group_util += cpu_util_est(i);
 		sgs->sum_nr_running += rq->cfs.h_nr_running;
 
 		nr_running = rq->nr_running;
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH 4/4] sched/cpufreq_schedutil: use util_est for OPP selection
  2017-11-09 16:41 [PATCH 0/4] Utilization estimation (util_est) for FAIR tasks Patrick Bellasi
                   ` (2 preceding siblings ...)
  2017-11-09 16:41 ` [PATCH 3/4] sched/fair: use util_est in LB and WU paths Patrick Bellasi
@ 2017-11-09 16:41 ` Patrick Bellasi
  2017-12-01  9:29 ` [PATCH 0/4] Utilization estimation (util_est) for FAIR tasks Dietmar Eggemann
  4 siblings, 0 replies; 6+ messages in thread
From: Patrick Bellasi @ 2017-11-09 16:41 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Peter Zijlstra, Rafael J . Wysocki, Viresh Kumar,
	Vincent Guittot, Paul Turner, Dietmar Eggemann, Morten Rasmussen,
	Juri Lelli, Todd Kjos, Joel Fernandes, linux-pm

When schedutil looks at the CPU utilization, the current PELT value for
that CPU is returned straight away. In certain scenarios this can have
undesired side effects and delays on frequency selection.

For example, since the task utilization is decayed at wakeup time, a
long sleeping big task newly enqueued it does not add immediately a
significant contribution to the target CPU. This introduces some latency
before schedutil will be able to detect the best frequency required by
that task.

Moreover, the PELT signal building up time is function of the current
frequency, because of the scale invariant load tracking support. Thus,
starting from a lower frequency, the utilization build-up time will
increase even more and further delays the selection of the actual
frequency which better serves the task requirements.

In order to reduce these kind of latencies, this patch integrates the
usage of the CPU's estimated utilization in the sugov_get_util function.

The estimated utilization of a CPU is defined to be the maximum between
its PELT's utilization and the sum of the estimated utilization of each
currently RUNNABLE task on that CPU.
This allows to properly represent the expected utilization of a CPU which,
for example, it has just got a big task running after a long sleep period,
and ultimately it allows to select the best frequency to run a task
right after its wakes up.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Reviewed-by: Brendan Jackman <brendan.jackman@arm.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Paul Turner <pjt@google.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-pm@vger.kernel.org
---
 kernel/sched/cpufreq_schedutil.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 137733db6727..d72231e30d44 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -183,7 +183,11 @@ static void sugov_get_util(unsigned long *util, unsigned long *max, int cpu)
 
 	cfs_max = arch_scale_cpu_capacity(NULL, cpu);
 
-	*util = min(rq->cfs.avg.util_avg, cfs_max);
+	*util = rq->cfs.avg.util_avg;
+	if (sched_feat(UTIL_EST))
+		*util = max(*util, rq->cfs.util_est_runnable);
+	*util = min(*util, cfs_max);
+
 	*max = cfs_max;
 }
 
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH 0/4] Utilization estimation (util_est) for FAIR tasks
  2017-11-09 16:41 [PATCH 0/4] Utilization estimation (util_est) for FAIR tasks Patrick Bellasi
                   ` (3 preceding siblings ...)
  2017-11-09 16:41 ` [PATCH 4/4] sched/cpufreq_schedutil: use util_est for OPP selection Patrick Bellasi
@ 2017-12-01  9:29 ` Dietmar Eggemann
  4 siblings, 0 replies; 6+ messages in thread
From: Dietmar Eggemann @ 2017-12-01  9:29 UTC (permalink / raw)
  To: Patrick Bellasi, linux-kernel
  Cc: Ingo Molnar, Peter Zijlstra, Rafael J . Wysocki, Viresh Kumar,
	Vincent Guittot, Paul Turner, Morten Rasmussen, Juri Lelli,
	Todd Kjos, Joel Fernandes

Hi Patrick,

On 11/09/2017 05:41 PM, Patrick Bellasi wrote:

[...]

> This series is based on v4.14-rc8 and is composed of four patches:

Could you rebase this onto v4.15-rc1 or tip/sched/core so we have the 
patch-set 'sched/fair: A bit of a cgroup/PELT overhaul' underneath.

-- Dietmar

[...]

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2017-12-01  9:29 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-11-09 16:41 [PATCH 0/4] Utilization estimation (util_est) for FAIR tasks Patrick Bellasi
2017-11-09 16:41 ` [PATCH 1/4] sched/fair: always used unsigned long for utilization Patrick Bellasi
2017-11-09 16:41 ` [PATCH 2/4] sched/fair: add util_est on top of PELT Patrick Bellasi
2017-11-09 16:41 ` [PATCH 3/4] sched/fair: use util_est in LB and WU paths Patrick Bellasi
2017-11-09 16:41 ` [PATCH 4/4] sched/cpufreq_schedutil: use util_est for OPP selection Patrick Bellasi
2017-12-01  9:29 ` [PATCH 0/4] Utilization estimation (util_est) for FAIR tasks Dietmar Eggemann

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).