linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/4] Utilization estimation (util_est) for FAIR tasks
@ 2017-12-05 17:10 Patrick Bellasi
  2017-12-05 17:10 ` [PATCH v2 1/4] sched/fair: always used unsigned long for utilization Patrick Bellasi
                   ` (5 more replies)
  0 siblings, 6 replies; 30+ messages in thread
From: Patrick Bellasi @ 2017-12-05 17:10 UTC (permalink / raw)
  To: linux-kernel, linux-pm
  Cc: Ingo Molnar, Peter Zijlstra, Rafael J . Wysocki, Viresh Kumar,
	Vincent Guittot, Paul Turner, Dietmar Eggemann, Morten Rasmussen,
	Juri Lelli, Todd Kjos, Joel Fernandes

This is a respin of:
   https://lkml.org/lkml/2017/11/9/546
which has been rebased on v4.15-rc2 to have util_est now working on top
of the recent PeterZ's:
   [PATCH -v2 00/18] sched/fair: A bit of a cgroup/PELT overhaul

The aim of this series is to improve some PELT behaviors to make it a
better fit for the scheduling of tasks common in embedded mobile
use-cases, without affecting other classes of workloads.

A complete description of these behaviors has been presented in the
previous RFC [1] and further discussed during the last OSPM Summit [2]
as well as during the last two LPCs.

This series presents an implementation which improves the initial RFC's
prototype. Specifically, this new implementation has been verified to
not impact in any noticeable way the performance of:

    perf bench sched messaging --pipe --thread --group 8 --loop 50000

when running 30 iterations on a dual socket, 10 cores (20 threads) per
socket Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz, whit the
sched_feat(SCHED_UTILEST) set to False.
With this feature enabled, the measured overhead is in the range of ~1%
on the same HW/SW test configuration.

That's the main reason why this sched feature is disabled by default.
A possible improvement can be the addition of a KConfig option to toggle
the sched_feat default value on systems where a 1% overhead on hackbench
is not a concern, e.g. mobile systems, especially considering the
benefits coming from estimated utilization on workloads of interest.

>From a functional standpoint, this implementation shows a more stable
utilization signal, compared to mainline, when running synthetics
benchmarks describing a set of interesting target use-cases.
This allows for a better selection of the target CPU as well as a
faster selection of the most appropriate OPP.
A detailed description of the used functional tests has been already
covered in the previous RFC [1].

This series is based on v4.15-rc2 and is composed of four patches:
 1) a small refactoring preparing the ground
 2) introducing the required data structures to track util_est of both
    TASKs and CPUs
 3) make use of util_est in the wakeup and load balance paths
 4) make use of util_est in schedutil for frequency selection

Cheers Patrick

.:: References
==============
[1] https://lkml.org/lkml/2017/8/25/195
[2] slides: http://retis.sssup.it/ospm-summit/Downloads/OSPM_PELT_DecayClampingVsUtilEst.pdf
     video: http://youtu.be/adnSHPBGS-w

Changes v1->v2:
 - rebase on top of v4.15-rc2
 - tested that overhauled PELT code does not affect the util_est

Patrick Bellasi (4):
  sched/fair: always used unsigned long for utilization
  sched/fair: add util_est on top of PELT
  sched/fair: use util_est in LB and WU paths
  sched/cpufreq_schedutil: use util_est for OPP selection

 include/linux/sched.h            |  21 +++++
 kernel/sched/cpufreq_schedutil.c |   6 +-
 kernel/sched/debug.c             |   4 +
 kernel/sched/fair.c              | 184 ++++++++++++++++++++++++++++++++++++---
 kernel/sched/features.h          |   5 ++
 kernel/sched/sched.h             |   1 +
 6 files changed, 209 insertions(+), 12 deletions(-)

-- 
2.14.1

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH v2 1/4] sched/fair: always used unsigned long for utilization
  2017-12-05 17:10 [PATCH v2 0/4] Utilization estimation (util_est) for FAIR tasks Patrick Bellasi
@ 2017-12-05 17:10 ` Patrick Bellasi
  2017-12-06  8:56   ` Vincent Guittot
  2018-01-10 12:14   ` [tip:sched/core] sched/fair: Use 'unsigned long' for utilization, consistently tip-bot for Patrick Bellasi
  2017-12-05 17:10 ` [PATCH v2 2/4] sched/fair: add util_est on top of PELT Patrick Bellasi
                   ` (4 subsequent siblings)
  5 siblings, 2 replies; 30+ messages in thread
From: Patrick Bellasi @ 2017-12-05 17:10 UTC (permalink / raw)
  To: linux-kernel, linux-pm
  Cc: Ingo Molnar, Peter Zijlstra, Rafael J . Wysocki, Viresh Kumar,
	Vincent Guittot, Paul Turner, Dietmar Eggemann, Morten Rasmussen,
	Juri Lelli, Todd Kjos, Joel Fernandes

Utilization and capacity are tracked as unsigned long, however some
functions using them return an int which is ultimately assigned back to
unsigned long variables.

Since there is not scope on using a different and signed type, this
consolidate the signature of functions returning utilization to always
use the native type.
As well as improving code consistency this is expected also benefits
code paths where utilizations should be clamped by avoiding further type
conversions or ugly type casts.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Reviewed-by: Chris Redpath <chris.redpath@arm.com>
Reviewed-by: Brendan Jackman <brendan.jackman@arm.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: linux-kernel@vger.kernel.org

---
Changes v1->v2:
 - rebase on top of v4.15-rc2
 - tested that overhauled PELT code does not affect the util_est
---
 kernel/sched/fair.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4037e19bbca2..ad21550d008c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5721,8 +5721,8 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
 	return affine;
 }
 
-static inline int task_util(struct task_struct *p);
-static int cpu_util_wake(int cpu, struct task_struct *p);
+static inline unsigned long task_util(struct task_struct *p);
+static unsigned long cpu_util_wake(int cpu, struct task_struct *p);
 
 static unsigned long capacity_spare_wake(int cpu, struct task_struct *p)
 {
@@ -6203,7 +6203,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
  * capacity_orig) as it useful for predicting the capacity required after task
  * migrations (scheduler-driven DVFS).
  */
-static int cpu_util(int cpu)
+static unsigned long cpu_util(int cpu)
 {
 	unsigned long util = cpu_rq(cpu)->cfs.avg.util_avg;
 	unsigned long capacity = capacity_orig_of(cpu);
@@ -6211,7 +6211,7 @@ static int cpu_util(int cpu)
 	return (util >= capacity) ? capacity : util;
 }
 
-static inline int task_util(struct task_struct *p)
+static inline unsigned long task_util(struct task_struct *p)
 {
 	return p->se.avg.util_avg;
 }
@@ -6220,7 +6220,7 @@ static inline int task_util(struct task_struct *p)
  * cpu_util_wake: Compute cpu utilization with any contributions from
  * the waking task p removed.
  */
-static int cpu_util_wake(int cpu, struct task_struct *p)
+static unsigned long cpu_util_wake(int cpu, struct task_struct *p)
 {
 	unsigned long util, capacity;
 
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v2 2/4] sched/fair: add util_est on top of PELT
  2017-12-05 17:10 [PATCH v2 0/4] Utilization estimation (util_est) for FAIR tasks Patrick Bellasi
  2017-12-05 17:10 ` [PATCH v2 1/4] sched/fair: always used unsigned long for utilization Patrick Bellasi
@ 2017-12-05 17:10 ` Patrick Bellasi
  2017-12-13 16:05   ` Peter Zijlstra
                     ` (2 more replies)
  2017-12-05 17:10 ` [PATCH v2 3/4] sched/fair: use util_est in LB and WU paths Patrick Bellasi
                   ` (3 subsequent siblings)
  5 siblings, 3 replies; 30+ messages in thread
From: Patrick Bellasi @ 2017-12-05 17:10 UTC (permalink / raw)
  To: linux-kernel, linux-pm
  Cc: Ingo Molnar, Peter Zijlstra, Rafael J . Wysocki, Viresh Kumar,
	Vincent Guittot, Paul Turner, Dietmar Eggemann, Morten Rasmussen,
	Juri Lelli, Todd Kjos, Joel Fernandes

The util_avg signal computed by PELT is too variable for some use-cases.
For example, a big task waking up after a long sleep period will have its
utilization almost completely decayed. This introduces some latency before
schedutil will be able to pick the best frequency to run a task.

The same issue can affect task placement. Indeed, since the task
utilization is already decayed at wakeup, when the task is enqueued in a
CPU, this can result in a CPU running a big task as being temporarily
represented as being almost empty. This leads to a race condition where
other tasks can be potentially allocated on a CPU which just started to run
a big task which slept for a relatively long period.

Moreover, the PELT utilization of a task can be updated every [ms], thus
making it a continuously changing value for certain longer running
tasks. This means that the instantaneous PELT utilization of a RUNNING
task is not really meaningful to properly support scheduler decisions.

For all these reasons, a more stable signal can do a better job of
representing the expected/estimated utilization of a task/cfs_rq.
Such a signal can be easily created on top of PELT by still using it as
an estimator which produces values to be aggregated on meaningful
events.

This patch adds a simple implementation of util_est, a new signal built on
top of PELT's util_avg where:

    util_est(task) = max(task::util_avg, f(task::util_avg@dequeue_times))

This allows to remember how big a task has been reported by PELT in its
previous activations via the function: f(task::util_avg@dequeue_times).

If a task should change its behavior and it runs even longer in a new
activation, after a certain time its util_est will just track the
original PELT signal (i.e. task::util_avg).

The estimated utilization of cfs_rq is defined only for root ones.
That's because the only sensible consumer of this signal are the
scheduler and schedutil when looking for the overall CPU utilization due
to FAIR tasks.
For this reason, the estimated utilization of a root cfs_rq is simply
defined as:

    util_est(cfs_rq) = max(cfs_rq::util_avg, cfs_rq::util_est_runnable)

where:

    cfs_rq::util_est_runnable = sum(util_est(task))
                                for each RUNNABLE task on that root cfs_rq

It's worth to note that the estimated utilization is tracked only for
entities of interests, specifically:
 - Tasks: to better support tasks placement decisions
 - root cfs_rqs: to better support both tasks placement decisions as
                 well as frequencies selection

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Reviewed-by: Brendan Jackman <brendan.jackman@arm.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Paul Turner <pjt@google.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-pm@vger.kernel.org

---
Changes v1->v2:
 - rebase on top of v4.15-rc2
 - tested that overhauled PELT code does not affect the util_est
---
 include/linux/sched.h   |  21 ++++++++++
 kernel/sched/debug.c    |   4 ++
 kernel/sched/fair.c     | 102 +++++++++++++++++++++++++++++++++++++++++++++++-
 kernel/sched/features.h |   5 +++
 kernel/sched/sched.h    |   1 +
 5 files changed, 132 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 21991d668d35..b01c0dc75ef5 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -338,6 +338,21 @@ struct sched_avg {
 	unsigned long			util_avg;
 };
 
+/**
+ * Estimation Utilization for FAIR tasks.
+ *
+ * Support data structure to track an Exponential Weighted Moving Average
+ * (EWMA) of a FAIR task's utilization. New samples are added to the moving
+ * average each time a task completes an activation. Sample's weight is
+ * chosen so that the EWMA will be relatively insensitive to transient changes
+ * to the task's workload.
+ */
+struct util_est {
+	unsigned long			last;
+	unsigned long			ewma;
+#define UTIL_EST_WEIGHT_SHIFT		2
+};
+
 struct sched_statistics {
 #ifdef CONFIG_SCHEDSTATS
 	u64				wait_start;
@@ -562,6 +577,12 @@ struct task_struct {
 
 	const struct sched_class	*sched_class;
 	struct sched_entity		se;
+	/*
+	 * Since we use se.avg.util_avg to update util_est fields,
+	 * this last can benefit from being close to se which
+	 * also defines se.avg as cache aligned.
+	 */
+	struct util_est			util_est;
 	struct sched_rt_entity		rt;
 #ifdef CONFIG_CGROUP_SCHED
 	struct task_group		*sched_task_group;
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 1ca0130ed4f9..5ffa8234524a 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -567,6 +567,8 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
 			cfs_rq->avg.runnable_load_avg);
 	SEQ_printf(m, "  .%-30s: %lu\n", "util_avg",
 			cfs_rq->avg.util_avg);
+	SEQ_printf(m, "  .%-30s: %lu\n", "util_est_runnable",
+			cfs_rq->util_est_runnable);
 	SEQ_printf(m, "  .%-30s: %ld\n", "removed.load_avg",
 			cfs_rq->removed.load_avg);
 	SEQ_printf(m, "  .%-30s: %ld\n", "removed.util_avg",
@@ -1018,6 +1020,8 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
 	P(se.avg.runnable_load_avg);
 	P(se.avg.util_avg);
 	P(se.avg.last_update_time);
+	P(util_est.ewma);
+	P(util_est.last);
 #endif
 	P(policy);
 	P(prio);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ad21550d008c..d8f3ed71010b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -732,6 +732,12 @@ void init_entity_runnable_average(struct sched_entity *se)
 	se->runnable_weight = se->load.weight;
 
 	/* when this task enqueue'ed, it will contribute to its cfs_rq's load_avg */
+
+	/* Utilization estimation */
+	if (entity_is_task(se)) {
+		task_of(se)->util_est.ewma = 0;
+		task_of(se)->util_est.last = 0;
+	}
 }
 
 static inline u64 cfs_rq_clock_task(struct cfs_rq *cfs_rq);
@@ -5153,6 +5159,20 @@ static inline void hrtick_update(struct rq *rq)
 }
 #endif
 
+static inline unsigned long task_util(struct task_struct *p);
+static inline unsigned long task_util_est(struct task_struct *p);
+
+static inline void util_est_enqueue(struct task_struct *p)
+{
+	struct cfs_rq *cfs_rq = &task_rq(p)->cfs;
+
+	if (!sched_feat(UTIL_EST))
+		return;
+
+	/* Update root cfs_rq's estimated utilization */
+	cfs_rq->util_est_runnable += task_util_est(p);
+}
+
 /*
  * The enqueue_task method is called before nr_running is
  * increased. Here we update the fair scheduling stats and
@@ -5205,9 +5225,84 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 	if (!se)
 		add_nr_running(rq, 1);
 
+	util_est_enqueue(p);
 	hrtick_update(rq);
 }
 
+static inline void util_est_dequeue(struct task_struct *p, int flags)
+{
+	struct cfs_rq *cfs_rq = &task_rq(p)->cfs;
+	unsigned long util_last = task_util(p);
+	bool sleep = flags & DEQUEUE_SLEEP;
+	unsigned long ewma;
+	long util_est;
+
+	if (!sched_feat(UTIL_EST))
+		return;
+
+	/*
+	 * Update root cfs_rq's estimated utilization
+	 *
+	 * If *p is the last task then the root cfs_rq's estimated utilization
+	 * of a CPU is 0 by definition.
+	 *
+	 * Otherwise, in removing *p's util_est from its cfs_rq's
+	 * util_est_runnable we should account for cases where this last
+	 * activation of *p was longer then the previous ones.
+	 * Also in these cases we need to set 0 the estimated utilization for
+	 * the CPU.
+	 */
+	if (cfs_rq->nr_running > 0) {
+		util_est  = cfs_rq->util_est_runnable;
+		util_est -= task_util_est(p);
+		if (util_est < 0)
+			util_est = 0;
+		cfs_rq->util_est_runnable = util_est;
+	} else {
+		cfs_rq->util_est_runnable = 0;
+	}
+
+	/*
+	 * Skip update of task's estimated utilization when the task has not
+	 * yet completed an activation, e.g. being migrated.
+	 */
+	if (!sleep)
+		return;
+
+	/*
+	 * Skip update of task's estimated utilization when its EWMA is already
+	 * ~1% close to its last activation value.
+	 */
+	util_est = p->util_est.ewma;
+	if (abs(util_est - util_last) <= (SCHED_CAPACITY_SCALE / 100))
+		return;
+
+	/*
+	 * Update Task's estimated utilization
+	 *
+	 * When *p completes an activation we can consolidate another sample
+	 * about the task size. This is done by storing the last PELT value
+	 * for this task and using this value to load another sample in the
+	 * exponential weighted moving average:
+	 *
+	 *      ewma(t) = w *  task_util(p) + (1 - w) ewma(t-1)
+	 *              = w *  task_util(p) + ewma(t-1) - w * ewma(t-1)
+	 *              = w * (task_util(p) + ewma(t-1) / w - ewma(t-1))
+	 *
+	 * Where 'w' is the weight of new samples, which is configured to be
+	 * 0.25, thus making w=1/4
+	 */
+	p->util_est.last = util_last;
+	ewma = p->util_est.ewma;
+	if (likely(ewma != 0)) {
+		ewma   = util_last + (ewma << UTIL_EST_WEIGHT_SHIFT) - ewma;
+		ewma >>= UTIL_EST_WEIGHT_SHIFT;
+	} else {
+		ewma = util_last;
+	}
+	p->util_est.ewma = ewma;
+}
+
 static void set_next_buddy(struct sched_entity *se);
 
 /*
@@ -5264,6 +5359,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 	if (!se)
 		sub_nr_running(rq, 1);
 
+	util_est_dequeue(p, flags);
 	hrtick_update(rq);
 }
 
@@ -5721,7 +5817,6 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
 	return affine;
 }
 
-static inline unsigned long task_util(struct task_struct *p);
 static unsigned long cpu_util_wake(int cpu, struct task_struct *p);
 
 static unsigned long capacity_spare_wake(int cpu, struct task_struct *p)
@@ -6216,6 +6311,11 @@ static inline unsigned long task_util(struct task_struct *p)
 	return p->se.avg.util_avg;
 }
 
+static inline unsigned long task_util_est(struct task_struct *p)
+{
+	return max(p->util_est.ewma, p->util_est.last);
+}
+
 /*
  * cpu_util_wake: Compute cpu utilization with any contributions from
  * the waking task p removed.
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 9552fd5854bf..e9f312acc0d3 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -85,3 +85,8 @@ SCHED_FEAT(ATTACH_AGE_LOAD, true)
 SCHED_FEAT(WA_IDLE, true)
 SCHED_FEAT(WA_WEIGHT, true)
 SCHED_FEAT(WA_BIAS, true)
+
+/*
+ * UtilEstimation. Use estimated CPU utiliation.
+ */
+SCHED_FEAT(UTIL_EST, false)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b19552a212de..8371839075fa 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -444,6 +444,7 @@ struct cfs_rq {
 	 * CFS load tracking
 	 */
 	struct sched_avg avg;
+	unsigned long util_est_runnable;
 #ifndef CONFIG_64BIT
 	u64 load_last_update_time_copy;
 #endif
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v2 3/4] sched/fair: use util_est in LB and WU paths
  2017-12-05 17:10 [PATCH v2 0/4] Utilization estimation (util_est) for FAIR tasks Patrick Bellasi
  2017-12-05 17:10 ` [PATCH v2 1/4] sched/fair: always used unsigned long for utilization Patrick Bellasi
  2017-12-05 17:10 ` [PATCH v2 2/4] sched/fair: add util_est on top of PELT Patrick Bellasi
@ 2017-12-05 17:10 ` Patrick Bellasi
  2017-12-05 17:10 ` [PATCH v2 4/4] sched/cpufreq_schedutil: use util_est for OPP selection Patrick Bellasi
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 30+ messages in thread
From: Patrick Bellasi @ 2017-12-05 17:10 UTC (permalink / raw)
  To: linux-kernel, linux-pm
  Cc: Ingo Molnar, Peter Zijlstra, Rafael J . Wysocki, Viresh Kumar,
	Vincent Guittot, Paul Turner, Dietmar Eggemann, Morten Rasmussen,
	Juri Lelli, Todd Kjos, Joel Fernandes

When the scheduler looks at the CPU utilization, the current PELT value
for a CPU is returned straight away. In certain scenarios this can have
undesired side effects on task placement.

For example, since the task utilization is decayed at wakeup time, when
a long sleeping big task is enqueued it does not add immediately a
significant contribution to the target CPU.
As a result we generate a race condition where other tasks can be placed
on the same CPU while is still considered relatively empty.

In order to reduce these kind of race conditions, this patch introduces the
required support to integrate the usage of the CPU's estimated utilization
in cpu_util_wake as well as in update_sg_lb_stats.

The estimated utilization of a CPU is defined to be the maximum between
its PELT's utilization and the sum of the estimated utilization of the
tasks currently RUNNABLE on that CPU.
This allows to properly represent the expected utilization of a CPU which,
for example, has just got a big task running since a long sleep
period.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Reviewed-by: Brendan Jackman <brendan.jackman@arm.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Paul Turner <pjt@google.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-pm@vger.kernel.org

---
Changes v1->v2:
 - rebase on top of v4.15-rc2
 - tested that overhauled PELT code does not affect the util_est
---
 kernel/sched/fair.c | 74 ++++++++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 68 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d8f3ed71010b..373d631efa91 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6306,6 +6306,41 @@ static unsigned long cpu_util(int cpu)
 	return (util >= capacity) ? capacity : util;
 }
 
+/**
+ * cpu_util_est: estimated utilization for the specified CPU
+ * @cpu: the CPU to get the estimated utilization for
+ *
+ * The estimated utilization of a CPU is defined to be the maximum between its
+ * PELT's utilization and the sum of the estimated utilization of the tasks
+ * currently RUNNABLE on that CPU.
+ *
+ * This allows to properly represent the expected utilization of a CPU which
+ * has just got a big task running since a long sleep period. At the same time
+ * however it preserves the benefits of the "blocked load" in describing the
+ * potential for other tasks waking up on the same CPU.
+ *
+ * Return: the estimated utlization for the specified CPU
+ */
+static inline unsigned long cpu_util_est(int cpu)
+{
+	unsigned long util, util_est;
+	unsigned long capacity;
+	struct cfs_rq *cfs_rq;
+
+	if (!sched_feat(UTIL_EST))
+		return cpu_util(cpu);
+
+	cfs_rq = &cpu_rq(cpu)->cfs;
+	util = cfs_rq->avg.util_avg;
+	util_est = cfs_rq->util_est_runnable;
+	util_est = max(util, util_est);
+
+	capacity = capacity_orig_of(cpu);
+	util_est = min(util_est, capacity);
+
+	return util_est;
+}
+
 static inline unsigned long task_util(struct task_struct *p)
 {
 	return p->se.avg.util_avg;
@@ -6322,16 +6357,43 @@ static inline unsigned long task_util_est(struct task_struct *p)
  */
 static unsigned long cpu_util_wake(int cpu, struct task_struct *p)
 {
-	unsigned long util, capacity;
+	long util, util_est;
 
 	/* Task has no contribution or is new */
 	if (cpu != task_cpu(p) || !p->se.avg.last_update_time)
-		return cpu_util(cpu);
+		return cpu_util_est(cpu);
 
-	capacity = capacity_orig_of(cpu);
-	util = max_t(long, cpu_rq(cpu)->cfs.avg.util_avg - task_util(p), 0);
+	/* Discount task's blocked util from CPU's util */
+	util = cpu_util(cpu) - task_util(p);
+	util = max(util, 0L);
 
-	return (util >= capacity) ? capacity : util;
+	if (!sched_feat(UTIL_EST))
+		return util;
+
+	/*
+	 * These are the main cases covered:
+	 * - if *p is the only task sleeping on this CPU, then:
+	 *      cpu_util (== task_util) > util_est (== 0)
+	 *   and thus we return:
+	 *      cpu_util_wake = (cpu_util - task_util) = 0
+	 *
+	 * - if other tasks are SLEEPING on the same CPU, which is just waking
+	 *   up, then:
+	 *      cpu_util >= task_util
+	 *      cpu_util > util_est (== 0)
+	 *   and thus we discount *p's blocked utilization to return:
+	 *      cpu_util_wake = (cpu_util - task_util) >= 0
+	 *
+	 * - if other tasks are RUNNABLE on that CPU and
+	 *      util_est > cpu_util
+	 *   then we use util_est since it returns a more restrictive
+	 *   estimation of the spare capacity on that CPU, by just considering
+	 *   the expected utilization of tasks already runnable on that CPU.
+	 */
+	util_est = cpu_rq(cpu)->cfs.util_est_runnable;
+	util = max(util, util_est);
+
+	return util;
 }
 
 /*
@@ -7857,7 +7919,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 			load = source_load(i, load_idx);
 
 		sgs->group_load += load;
-		sgs->group_util += cpu_util(i);
+		sgs->group_util += cpu_util_est(i);
 		sgs->sum_nr_running += rq->cfs.h_nr_running;
 
 		nr_running = rq->nr_running;
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v2 4/4] sched/cpufreq_schedutil: use util_est for OPP selection
  2017-12-05 17:10 [PATCH v2 0/4] Utilization estimation (util_est) for FAIR tasks Patrick Bellasi
                   ` (2 preceding siblings ...)
  2017-12-05 17:10 ` [PATCH v2 3/4] sched/fair: use util_est in LB and WU paths Patrick Bellasi
@ 2017-12-05 17:10 ` Patrick Bellasi
  2017-12-16  2:35   ` Rafael J. Wysocki
  2017-12-13 16:03 ` [PATCH v2 0/4] Utilization estimation (util_est) for FAIR tasks Peter Zijlstra
  2017-12-13 17:56 ` Mike Galbraith
  5 siblings, 1 reply; 30+ messages in thread
From: Patrick Bellasi @ 2017-12-05 17:10 UTC (permalink / raw)
  To: linux-kernel, linux-pm
  Cc: Ingo Molnar, Peter Zijlstra, Rafael J . Wysocki, Viresh Kumar,
	Vincent Guittot, Paul Turner, Dietmar Eggemann, Morten Rasmussen,
	Juri Lelli, Todd Kjos, Joel Fernandes

When schedutil looks at the CPU utilization, the current PELT value for
that CPU is returned straight away. In certain scenarios this can have
undesired side effects and delays on frequency selection.

For example, since the task utilization is decayed at wakeup time, a
long sleeping big task newly enqueued does not add immediately a
significant contribution to the target CPU. This introduces some latency
before schedutil will be able to detect the best frequency required by
that task.

Moreover, the PELT signal build-up time is function of the current
frequency, because of the scale invariant load tracking support. Thus,
starting from a lower frequency, the utilization build-up time will
increase even more and further delays the selection of the actual
frequency which better serves the task requirements.

In order to reduce these kind of latencies, this patch integrates the
usage of the CPU's estimated utilization in the sugov_get_util function.

The estimated utilization of a CPU is defined to be the maximum between
its PELT's utilization and the sum of the estimated utilization of each
currently RUNNABLE task on that CPU.
This allows to properly represent the expected utilization of a CPU which,
for example, has just got a big task running after a long sleep period,
and ultimately it allows to select the best frequency to run a task
right after it wakes up.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Reviewed-by: Brendan Jackman <brendan.jackman@arm.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Paul Turner <pjt@google.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-pm@vger.kernel.org

---
Changes v1->v2:
 - rebase on top of v4.15-rc2
 - tested that overhauled PELT code does not affect the util_est
---
 kernel/sched/cpufreq_schedutil.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 2f52ec0f1539..465430d99440 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -183,7 +183,11 @@ static void sugov_get_util(unsigned long *util, unsigned long *max, int cpu)
 
 	cfs_max = arch_scale_cpu_capacity(NULL, cpu);
 
-	*util = min(rq->cfs.avg.util_avg, cfs_max);
+	*util = rq->cfs.avg.util_avg;
+	if (sched_feat(UTIL_EST))
+		*util = max(*util, rq->cfs.util_est_runnable);
+	*util = min(*util, cfs_max);
+
 	*max = cfs_max;
 }
 
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 1/4] sched/fair: always used unsigned long for utilization
  2017-12-05 17:10 ` [PATCH v2 1/4] sched/fair: always used unsigned long for utilization Patrick Bellasi
@ 2017-12-06  8:56   ` Vincent Guittot
  2018-01-10 12:14   ` [tip:sched/core] sched/fair: Use 'unsigned long' for utilization, consistently tip-bot for Patrick Bellasi
  1 sibling, 0 replies; 30+ messages in thread
From: Vincent Guittot @ 2017-12-06  8:56 UTC (permalink / raw)
  To: Patrick Bellasi
  Cc: linux-kernel, linux-pm, Ingo Molnar, Peter Zijlstra,
	Rafael J . Wysocki, Viresh Kumar, Paul Turner, Dietmar Eggemann,
	Morten Rasmussen, Juri Lelli, Todd Kjos, Joel Fernandes

On 5 December 2017 at 18:10, Patrick Bellasi <patrick.bellasi@arm.com> wrote:
> Utilization and capacity are tracked as unsigned long, however some
> functions using them return an int which is ultimately assigned back to
> unsigned long variables.
>
> Since there is not scope on using a different and signed type, this
> consolidate the signature of functions returning utilization to always
> use the native type.
> As well as improving code consistency this is expected also benefits
> code paths where utilizations should be clamped by avoiding further type
> conversions or ugly type casts.
>
> Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
> Reviewed-by: Chris Redpath <chris.redpath@arm.com>
> Reviewed-by: Brendan Jackman <brendan.jackman@arm.com>
> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Morten Rasmussen <morten.rasmussen@arm.com>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Cc: linux-kernel@vger.kernel.org

Acked-by: Vincent Guittot <vincent.guittot@linaro.org>

>
> ---
> Changes v1->v2:
>  - rebase on top of v4.15-rc2
>  - tested that overhauled PELT code does not affect the util_est
> ---
>  kernel/sched/fair.c | 10 +++++-----
>  1 file changed, 5 insertions(+), 5 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 4037e19bbca2..ad21550d008c 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5721,8 +5721,8 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
>         return affine;
>  }
>
> -static inline int task_util(struct task_struct *p);
> -static int cpu_util_wake(int cpu, struct task_struct *p);
> +static inline unsigned long task_util(struct task_struct *p);
> +static unsigned long cpu_util_wake(int cpu, struct task_struct *p);
>
>  static unsigned long capacity_spare_wake(int cpu, struct task_struct *p)
>  {
> @@ -6203,7 +6203,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
>   * capacity_orig) as it useful for predicting the capacity required after task
>   * migrations (scheduler-driven DVFS).
>   */
> -static int cpu_util(int cpu)
> +static unsigned long cpu_util(int cpu)
>  {
>         unsigned long util = cpu_rq(cpu)->cfs.avg.util_avg;
>         unsigned long capacity = capacity_orig_of(cpu);
> @@ -6211,7 +6211,7 @@ static int cpu_util(int cpu)
>         return (util >= capacity) ? capacity : util;
>  }
>
> -static inline int task_util(struct task_struct *p)
> +static inline unsigned long task_util(struct task_struct *p)
>  {
>         return p->se.avg.util_avg;
>  }
> @@ -6220,7 +6220,7 @@ static inline int task_util(struct task_struct *p)
>   * cpu_util_wake: Compute cpu utilization with any contributions from
>   * the waking task p removed.
>   */
> -static int cpu_util_wake(int cpu, struct task_struct *p)
> +static unsigned long cpu_util_wake(int cpu, struct task_struct *p)
>  {
>         unsigned long util, capacity;
>
> --
> 2.14.1
>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 0/4] Utilization estimation (util_est) for FAIR tasks
  2017-12-05 17:10 [PATCH v2 0/4] Utilization estimation (util_est) for FAIR tasks Patrick Bellasi
                   ` (3 preceding siblings ...)
  2017-12-05 17:10 ` [PATCH v2 4/4] sched/cpufreq_schedutil: use util_est for OPP selection Patrick Bellasi
@ 2017-12-13 16:03 ` Peter Zijlstra
  2017-12-13 16:23   ` Patrick Bellasi
  2017-12-13 17:56 ` Mike Galbraith
  5 siblings, 1 reply; 30+ messages in thread
From: Peter Zijlstra @ 2017-12-13 16:03 UTC (permalink / raw)
  To: Patrick Bellasi
  Cc: linux-kernel, linux-pm, Ingo Molnar, Rafael J . Wysocki,
	Viresh Kumar, Vincent Guittot, Paul Turner, Dietmar Eggemann,
	Morten Rasmussen, Juri Lelli, Todd Kjos, Joel Fernandes

On Tue, Dec 05, 2017 at 05:10:14PM +0000, Patrick Bellasi wrote:
> With this feature enabled, the measured overhead is in the range of ~1%
> on the same HW/SW test configuration.

That's quite a lot; did you look where that comes from?

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 2/4] sched/fair: add util_est on top of PELT
  2017-12-05 17:10 ` [PATCH v2 2/4] sched/fair: add util_est on top of PELT Patrick Bellasi
@ 2017-12-13 16:05   ` Peter Zijlstra
  2017-12-15 14:02     ` Patrick Bellasi
  2017-12-13 16:16   ` Peter Zijlstra
  2017-12-13 16:19   ` Peter Zijlstra
  2 siblings, 1 reply; 30+ messages in thread
From: Peter Zijlstra @ 2017-12-13 16:05 UTC (permalink / raw)
  To: Patrick Bellasi
  Cc: linux-kernel, linux-pm, Ingo Molnar, Rafael J . Wysocki,
	Viresh Kumar, Vincent Guittot, Paul Turner, Dietmar Eggemann,
	Morten Rasmussen, Juri Lelli, Todd Kjos, Joel Fernandes

On Tue, Dec 05, 2017 at 05:10:16PM +0000, Patrick Bellasi wrote:
> +	if (cfs_rq->nr_running > 0) {
> +		util_est  = cfs_rq->util_est_runnable;
> +		util_est -= task_util_est(p);
> +		if (util_est < 0)
> +			util_est = 0;
> +		cfs_rq->util_est_runnable = util_est;
> +	} else {

I'm thinking that's an explicit load-store to avoid intermediate values
landing in cfs_rq->util_esp_runnable, right?

That would need READ_ONCE() / WRITE_ONCE() I think, without that the
compiler is free to munge the lot together.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 2/4] sched/fair: add util_est on top of PELT
  2017-12-05 17:10 ` [PATCH v2 2/4] sched/fair: add util_est on top of PELT Patrick Bellasi
  2017-12-13 16:05   ` Peter Zijlstra
@ 2017-12-13 16:16   ` Peter Zijlstra
  2017-12-15 12:14     ` Patrick Bellasi
  2017-12-13 16:19   ` Peter Zijlstra
  2 siblings, 1 reply; 30+ messages in thread
From: Peter Zijlstra @ 2017-12-13 16:16 UTC (permalink / raw)
  To: Patrick Bellasi
  Cc: linux-kernel, linux-pm, Ingo Molnar, Rafael J . Wysocki,
	Viresh Kumar, Vincent Guittot, Paul Turner, Dietmar Eggemann,
	Morten Rasmussen, Juri Lelli, Todd Kjos, Joel Fernandes

On Tue, Dec 05, 2017 at 05:10:16PM +0000, Patrick Bellasi wrote:
> +static inline void util_est_dequeue(struct task_struct *p, int flags)
> +{
> +	struct cfs_rq *cfs_rq = &task_rq(p)->cfs;
> +	unsigned long util_last = task_util(p);
> +	bool sleep = flags & DEQUEUE_SLEEP;
> +	unsigned long ewma;
> +	long util_est;
> +
> +	if (!sched_feat(UTIL_EST))
> +		return;
> +
> +	/*
> +	 * Update root cfs_rq's estimated utilization
> +	 *
> +	 * If *p is the last task then the root cfs_rq's estimated utilization
> +	 * of a CPU is 0 by definition.
> +	 *
> +	 * Otherwise, in removing *p's util_est from its cfs_rq's
> +	 * util_est_runnable we should account for cases where this last
> +	 * activation of *p was longer then the previous ones.
> +	 * Also in these cases we need to set 0 the estimated utilization for
> +	 * the CPU.
> +	 */
> +	if (cfs_rq->nr_running > 0) {
> +		util_est  = cfs_rq->util_est_runnable;
> +		util_est -= task_util_est(p);
> +		if (util_est < 0)
> +			util_est = 0;
> +		cfs_rq->util_est_runnable = util_est;
> +	} else {
> +		cfs_rq->util_est_runnable = 0;
> +	}
> +
> +	/*
> +	 * Skip update of task's estimated utilization when the task has not
> +	 * yet completed an activation, e.g. being migrated.
> +	 */
> +	if (!sleep)
> +		return;
> +
> +	/*
> +	 * Skip update of task's estimated utilization when its EWMA is already
> +	 * ~1% close to its last activation value.
> +	 */
> +	util_est = p->util_est.ewma;
> +	if (abs(util_est - util_last) <= (SCHED_CAPACITY_SCALE / 100))
> +		return;

Isn't that computation almost as expensive as the stuff you're trying to
avoid?

> +	/*
> +	 * Update Task's estimated utilization
> +	 *
> +	 * When *p completes an activation we can consolidate another sample
> +	 * about the task size. This is done by storing the last PELT value
> +	 * for this task and using this value to load another sample in the
> +	 * exponential weighted moving average:
> +	 *
> +	 *      ewma(t) = w *  task_util(p) + (1 - w) ewma(t-1)
> +	 *              = w *  task_util(p) + ewma(t-1) - w * ewma(t-1)
> +	 *              = w * (task_util(p) + ewma(t-1) / w - ewma(t-1))
> +	 *
> +	 * Where 'w' is the weight of new samples, which is configured to be
> +	 * 0.25, thus making w=1/4
> +	 */
> +	p->util_est.last = util_last;
> +	ewma = p->util_est.ewma;
> +	if (likely(ewma != 0)) {

Why special case 0? Yes it helps with the initial ramp-on, but would not
an asymmetric IIR (with a consistent upward bias) be better?

> +		ewma   = util_last + (ewma << UTIL_EST_WEIGHT_SHIFT) - ewma;
> +		ewma >>= UTIL_EST_WEIGHT_SHIFT;
> +	} else {
> +		ewma = util_last;
> +	}
> +	p->util_est.ewma = ewma;
> +}

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 2/4] sched/fair: add util_est on top of PELT
  2017-12-05 17:10 ` [PATCH v2 2/4] sched/fair: add util_est on top of PELT Patrick Bellasi
  2017-12-13 16:05   ` Peter Zijlstra
  2017-12-13 16:16   ` Peter Zijlstra
@ 2017-12-13 16:19   ` Peter Zijlstra
  2017-12-13 16:36     ` Patrick Bellasi
  2 siblings, 1 reply; 30+ messages in thread
From: Peter Zijlstra @ 2017-12-13 16:19 UTC (permalink / raw)
  To: Patrick Bellasi
  Cc: linux-kernel, linux-pm, Ingo Molnar, Rafael J . Wysocki,
	Viresh Kumar, Vincent Guittot, Paul Turner, Dietmar Eggemann,
	Morten Rasmussen, Juri Lelli, Todd Kjos, Joel Fernandes

On Tue, Dec 05, 2017 at 05:10:16PM +0000, Patrick Bellasi wrote:
> @@ -562,6 +577,12 @@ struct task_struct {
>  
>  	const struct sched_class	*sched_class;
>  	struct sched_entity		se;
> +	/*
> +	 * Since we use se.avg.util_avg to update util_est fields,
> +	 * this last can benefit from being close to se which
> +	 * also defines se.avg as cache aligned.
> +	 */
> +	struct util_est			util_est;
>  	struct sched_rt_entity		rt;
>  #ifdef CONFIG_CGROUP_SCHED
>  	struct task_group		*sched_task_group;


> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index b19552a212de..8371839075fa 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -444,6 +444,7 @@ struct cfs_rq {
>  	 * CFS load tracking
>  	 */
>  	struct sched_avg avg;
> +	unsigned long util_est_runnable;
>  #ifndef CONFIG_64BIT
>  	u64 load_last_update_time_copy;
>  #endif


So you put the util_est in task_struct (not sched_entity) but the
util_est_runnable in cfs_rq (not rq). Seems inconsistent.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 0/4] Utilization estimation (util_est) for FAIR tasks
  2017-12-13 16:03 ` [PATCH v2 0/4] Utilization estimation (util_est) for FAIR tasks Peter Zijlstra
@ 2017-12-13 16:23   ` Patrick Bellasi
  0 siblings, 0 replies; 30+ messages in thread
From: Patrick Bellasi @ 2017-12-13 16:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-pm, Ingo Molnar, Rafael J . Wysocki,
	Viresh Kumar, Vincent Guittot, Paul Turner, Dietmar Eggemann,
	Morten Rasmussen, Juri Lelli, Todd Kjos, Joel Fernandes

On 13-Dec 17:03, Peter Zijlstra wrote:
> On Tue, Dec 05, 2017 at 05:10:14PM +0000, Patrick Bellasi wrote:
> > With this feature enabled, the measured overhead is in the range of ~1%
> > on the same HW/SW test configuration.
> 
> That's quite a lot; did you look where that comes from?

I've tracked it down to  util_est_dequeue() introduced by PATCH 2/4,
mainly due to the EWMA udpate. Initially the running average was
implemented using the library function provided in:

   include/linux/average.h::DECLARE_EWMA

but that solution generated even more overhead.
That's why we switched to an "inline custom" implementation.

Hackbench is quite stressful for that path and we have also few
branches which can play a role. One for example has been added to
avoid the EWMA if the rolling average is "close enough" to the
current PELT value.

All that considered, that's why I disable by default the sched_feat,
in which case we have 0 overheads... in !SCHED_DEBUG kernel the code
is actually removed by the compiler.

In mobile systems (i.e. non-hackbench scenarios) the additional
benefits on tasks placement and OPP selection is likely still worth
the overhead.

Do you think the idea to have a Kconfig to enabled this feature only
on systems which do not care about the possible  overheads is a viable
solution?

-- 
#include <best/regards.h>

Patrick Bellasi

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 2/4] sched/fair: add util_est on top of PELT
  2017-12-13 16:19   ` Peter Zijlstra
@ 2017-12-13 16:36     ` Patrick Bellasi
  2017-12-13 17:03       ` Peter Zijlstra
  0 siblings, 1 reply; 30+ messages in thread
From: Patrick Bellasi @ 2017-12-13 16:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-pm, Ingo Molnar, Rafael J . Wysocki,
	Viresh Kumar, Vincent Guittot, Paul Turner, Dietmar Eggemann,
	Morten Rasmussen, Juri Lelli, Todd Kjos, Joel Fernandes

On 13-Dec 17:19, Peter Zijlstra wrote:
> On Tue, Dec 05, 2017 at 05:10:16PM +0000, Patrick Bellasi wrote:
> > @@ -562,6 +577,12 @@ struct task_struct {
> >  
> >  	const struct sched_class	*sched_class;
> >  	struct sched_entity		se;
> > +	/*
> > +	 * Since we use se.avg.util_avg to update util_est fields,
> > +	 * this last can benefit from being close to se which
> > +	 * also defines se.avg as cache aligned.
> > +	 */
> > +	struct util_est			util_est;
> >  	struct sched_rt_entity		rt;
> >  #ifdef CONFIG_CGROUP_SCHED
> >  	struct task_group		*sched_task_group;
> 
> 
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index b19552a212de..8371839075fa 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -444,6 +444,7 @@ struct cfs_rq {
> >  	 * CFS load tracking
> >  	 */
> >  	struct sched_avg avg;
> > +	unsigned long util_est_runnable;
> >  #ifndef CONFIG_64BIT
> >  	u64 load_last_update_time_copy;
> >  #endif
> 
> 
> So you put the util_est in task_struct (not sched_entity) but the
> util_est_runnable in cfs_rq (not rq). Seems inconsistent.

One goal was to keep util_est variables close to the util_avg used to
load the filter, for caches affinity sakes.

The other goal was to have util_est data only for Tasks and CPU's
RQ, thus avoiding unused data for TG's RQ and SE.

Unfortunately the first goal does not allow to achieve completely the
second and, you right, the solution looks a bit inconsistent.

Do you think we should better disregard cache proximity and move
util_est_runnable to rq?

-- 
#include <best/regards.h>

Patrick Bellasi

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 2/4] sched/fair: add util_est on top of PELT
  2017-12-13 16:36     ` Patrick Bellasi
@ 2017-12-13 17:03       ` Peter Zijlstra
  2017-12-15 12:03         ` Patrick Bellasi
  0 siblings, 1 reply; 30+ messages in thread
From: Peter Zijlstra @ 2017-12-13 17:03 UTC (permalink / raw)
  To: Patrick Bellasi
  Cc: linux-kernel, linux-pm, Ingo Molnar, Rafael J . Wysocki,
	Viresh Kumar, Vincent Guittot, Paul Turner, Dietmar Eggemann,
	Morten Rasmussen, Juri Lelli, Todd Kjos, Joel Fernandes

On Wed, Dec 13, 2017 at 04:36:53PM +0000, Patrick Bellasi wrote:
> On 13-Dec 17:19, Peter Zijlstra wrote:
> > On Tue, Dec 05, 2017 at 05:10:16PM +0000, Patrick Bellasi wrote:
> > > @@ -562,6 +577,12 @@ struct task_struct {
> > >  
> > >  	const struct sched_class	*sched_class;
> > >  	struct sched_entity		se;
> > > +	/*
> > > +	 * Since we use se.avg.util_avg to update util_est fields,
> > > +	 * this last can benefit from being close to se which
> > > +	 * also defines se.avg as cache aligned.
> > > +	 */
> > > +	struct util_est			util_est;

The thing is, since sched_entity has a member with cacheline alignment,
the whole structure must have cacheline alignment, and this util_est
_will_ start on a new line.

See also:

$ pahole -EC task_struct defconfig/kernel/sched/core.o

...
		struct sched_avg {
                                /* typedef u64 */ long long unsigned int last_update_time; /*   576     8 */
                                /* typedef u64 */ long long unsigned int load_sum;       /*   584     8 */
                                /* typedef u32 */ unsigned int util_sum;                 /*   592     4 */
                                /* typedef u32 */ unsigned int period_contrib;           /*   596     4 */
                                long unsigned int load_avg;                              /*   600     8 */
                                long unsigned int util_avg;                              /*   608     8 */
                        } avg; /*   576    40 */
                        /* --- cacheline 6 boundary (384 bytes) --- */
                } se; /*   192   448 */
                /* --- cacheline 8 boundary (512 bytes) was 24 bytes ago --- */
                struct util_est {
                        long unsigned int last;                                          /*   640     8 */
                        long unsigned int ewma;                                          /*   648     8 */
                } util_est; /*   640    16 */
...

The thing is somewhat confused on which cacheline is which, but you'll
see sched_avg landing at 576 (cacheline #9) and util_est at 640 (line
#10).

> > >  	struct sched_rt_entity		rt;
> > >  #ifdef CONFIG_CGROUP_SCHED
> > >  	struct task_group		*sched_task_group;

> One goal was to keep util_est variables close to the util_avg used to
> load the filter, for caches affinity sakes.
> 
> The other goal was to have util_est data only for Tasks and CPU's
> RQ, thus avoiding unused data for TG's RQ and SE.
> 
> Unfortunately the first goal does not allow to achieve completely the
> second and, you right, the solution looks a bit inconsistent.
> 
> Do you think we should better disregard cache proximity and move
> util_est_runnable to rq?

proximity is likely important; I'd suggest moving util_est into
sched_entity.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 0/4] Utilization estimation (util_est) for FAIR tasks
  2017-12-05 17:10 [PATCH v2 0/4] Utilization estimation (util_est) for FAIR tasks Patrick Bellasi
                   ` (4 preceding siblings ...)
  2017-12-13 16:03 ` [PATCH v2 0/4] Utilization estimation (util_est) for FAIR tasks Peter Zijlstra
@ 2017-12-13 17:56 ` Mike Galbraith
  2017-12-15 16:13   ` Patrick Bellasi
  5 siblings, 1 reply; 30+ messages in thread
From: Mike Galbraith @ 2017-12-13 17:56 UTC (permalink / raw)
  To: Patrick Bellasi, linux-kernel, linux-pm
  Cc: Ingo Molnar, Peter Zijlstra, Rafael J . Wysocki, Viresh Kumar,
	Vincent Guittot, Paul Turner, Dietmar Eggemann, Morten Rasmussen,
	Juri Lelli, Todd Kjos, Joel Fernandes

On Tue, 2017-12-05 at 17:10 +0000, Patrick Bellasi wrote:
> This is a respin of:
>    https://lkml.org/lkml/2017/11/9/546
> which has been rebased on v4.15-rc2 to have util_est now working on top
> of the recent PeterZ's:
>    [PATCH -v2 00/18] sched/fair: A bit of a cgroup/PELT overhaul
> 
> The aim of this series is to improve some PELT behaviors to make it a
> better fit for the scheduling of tasks common in embedded mobile
> use-cases, without affecting other classes of workloads.

I thought perhaps this patch set would improve the below behavior, but
alas it does not.  That's 3 instances of firefox playing youtube clips
being shoved into a corner by hogs sitting on 7 of 8 runqueues.  PELT
serializes the threaded desktop, making that threading kinda pointless,
and CFS not all that fair.

 6569 root      20   0    4048    704    628 R 100.0 0.004   5:10.48 7 cpuhog                                                                                                                                                                
 6573 root      20   0    4048    712    636 R 100.0 0.004   5:07.47 5 cpuhog                                                                                                                                                                
 6581 root      20   0    4048    696    620 R 100.0 0.004   5:07.36 1 cpuhog                                                                                                                                                                
 6585 root      20   0    4048    812    736 R 100.0 0.005   5:08.14 4 cpuhog                                                                                                                                                                
 6589 root      20   0    4048    712    636 R 100.0 0.004   5:06.42 6 cpuhog                                                                                                                                                                
 6577 root      20   0    4048    720    644 R 99.80 0.005   5:06.52 3 cpuhog                                                                                                                                                                
 6593 root      20   0    4048    728    652 R 99.60 0.005   5:04.25 0 cpuhog                                                                                                                                                                
 6755 mikeg     20   0 2714788 885324 179196 S 19.96 5.544   2:14.36 2 Web Content                                                                                                                                                           
 6620 mikeg     20   0 2318348 312336 145044 S 8.383 1.956   0:51.51 2 firefox                                                                                                                                                               
 3190 root      20   0  323944  71704  42368 S 3.194 0.449   0:11.90 2 Xorg                                                                                                                                                                  
 3718 root      20   0 3009580  67112  49256 S 0.599 0.420   0:02.89 2 kwin_x11                                                                                                                                                              
 3761 root      20   0  769760  90740  62048 S 0.399 0.568   0:03.46 2 konsole                                                                                                                                                               
 3845 root       9 -11  791224  20132  14236 S 0.399 0.126   0:03.00 2 pulseaudio                                                                                                                                                            
 3722 root      20   0 3722308 172568  88088 S 0.200 1.081   0:04.35 2 plasmashel

 ------------------------------------------------------------------------------------------------------------------------------------
  Task                  |   Runtime ms  | Switches | Average delay ms | Maximum delay ms | Sum     delay ms | Maximum delay at      |
 ------------------------------------------------------------------------------------------------------------------------------------
  Web Content:6755      |   2864.862 ms |     7314 | avg:    0.299 ms | max:   40.374 ms | sum: 2189.472 ms | max at:    375.769240 |
  Compositor:6680       |   1889.847 ms |     4672 | avg:    0.531 ms | max:   29.092 ms | sum: 2478.559 ms | max at:    375.759405 |
  MediaPl~back #3:(13)  |   3269.777 ms |     7853 | avg:    0.218 ms | max:   19.451 ms | sum: 1711.635 ms | max at:    391.123970 |
  MediaPl~back #4:(10)  |   1472.986 ms |     8189 | avg:    0.236 ms | max:   18.653 ms | sum: 1933.886 ms | max at:    376.124211 |
  MediaPl~back #1:(9)   |    601.788 ms |     6598 | avg:    0.247 ms | max:   17.823 ms | sum: 1627.852 ms | max at:    401.122567 |
  firefox:6620          |    303.181 ms |     6232 | avg:    0.111 ms | max:   15.602 ms | sum:  691.865 ms | max at:    385.078558 |
  Socket Thread:6639    |    667.537 ms |     4806 | avg:    0.069 ms | max:   12.638 ms | sum:  329.387 ms | max at:    380.827323 |
  MediaPD~oder #1:6835  |    154.737 ms |     1592 | avg:    0.700 ms | max:   10.139 ms | sum: 1113.688 ms | max at:    392.575370 |
  MediaTimer #1:6828    |     42.660 ms |     5250 | avg:    0.575 ms | max:    9.845 ms | sum: 3018.994 ms | max at:    380.823677 |
  MediaPD~oder #2:6840  |    150.822 ms |     1583 | avg:    0.703 ms | max:    9.639 ms | sum: 1112.962 ms | max at:    380.823741 |
...

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 2/4] sched/fair: add util_est on top of PELT
  2017-12-13 17:03       ` Peter Zijlstra
@ 2017-12-15 12:03         ` Patrick Bellasi
  2017-12-15 12:58           ` Peter Zijlstra
  0 siblings, 1 reply; 30+ messages in thread
From: Patrick Bellasi @ 2017-12-15 12:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-pm, Ingo Molnar, Rafael J . Wysocki,
	Viresh Kumar, Vincent Guittot, Paul Turner, Dietmar Eggemann,
	Morten Rasmussen, Juri Lelli, Todd Kjos, Joel Fernandes

On 13-Dec 18:03, Peter Zijlstra wrote:
> On Wed, Dec 13, 2017 at 04:36:53PM +0000, Patrick Bellasi wrote:
> > On 13-Dec 17:19, Peter Zijlstra wrote:
> > > On Tue, Dec 05, 2017 at 05:10:16PM +0000, Patrick Bellasi wrote:
> > > > @@ -562,6 +577,12 @@ struct task_struct {
> > > >  
> > > >  	const struct sched_class	*sched_class;
> > > >  	struct sched_entity		se;
> > > > +	/*
> > > > +	 * Since we use se.avg.util_avg to update util_est fields,
> > > > +	 * this last can benefit from being close to se which
> > > > +	 * also defines se.avg as cache aligned.
> > > > +	 */
> > > > +	struct util_est			util_est;
> 
> The thing is, since sched_entity has a member with cacheline alignment,
> the whole structure must have cacheline alignment, and this util_est
> _will_ start on a new line.

Right, I was not considering that "aligned" affects also the
start of the following data. Thus

> See also:
> 
> $ pahole -EC task_struct defconfig/kernel/sched/core.o
> 
> ...
> 		struct sched_avg {
>                                 /* typedef u64 */ long long unsigned int last_update_time; /*   576     8 */
>                                 /* typedef u64 */ long long unsigned int load_sum;       /*   584     8 */
>                                 /* typedef u32 */ unsigned int util_sum;                 /*   592     4 */
>                                 /* typedef u32 */ unsigned int period_contrib;           /*   596     4 */
>                                 long unsigned int load_avg;                              /*   600     8 */
>                                 long unsigned int util_avg;                              /*   608     8 */
>                         } avg; /*   576    40 */
>                         /* --- cacheline 6 boundary (384 bytes) --- */
>                 } se; /*   192   448 */
>                 /* --- cacheline 8 boundary (512 bytes) was 24 bytes ago --- */
>                 struct util_est {
>                         long unsigned int last;                                          /*   640     8 */
>                         long unsigned int ewma;                                          /*   648     8 */
>                 } util_est; /*   640    16 */
> ...
> 
> The thing is somewhat confused on which cacheline is which, but you'll
> see sched_avg landing at 576 (cacheline #9) and util_est at 640 (line
> #10).
> 
> > > >  	struct sched_rt_entity		rt;
> > > >  #ifdef CONFIG_CGROUP_SCHED
> > > >  	struct task_group		*sched_task_group;
> 
> > One goal was to keep util_est variables close to the util_avg used to
> > load the filter, for caches affinity sakes.
> > 
> > The other goal was to have util_est data only for Tasks and CPU's
> > RQ, thus avoiding unused data for TG's RQ and SE.
> > 
> > Unfortunately the first goal does not allow to achieve completely the
> > second and, you right, the solution looks a bit inconsistent.
> > 
> > Do you think we should better disregard cache proximity and move
> > util_est_runnable to rq?
> 
> proximity is likely important; I'd suggest moving util_est into
> sched_entity.

So, by moving util_est right after sched_avg, here is what we get (with some
lines to better highlight 64B boundaries):

                const struct sched_class  * sched_class;                                 /*   152     8 */
                struct sched_entity {
			[...]
		---[ Line 9 ]-------------------------------------------------------------------------------
                        struct sched_avg {
                                /* typedef u64 */ long long unsigned int last_update_time; /*   576     8 */
                                /* typedef u64 */ long long unsigned int load_sum;       /*   584     8 */
                                /* typedef u64 */ long long unsigned int runnable_load_sum; /*   592     8 */
                                /* typedef u32 */ unsigned int util_sum;                 /*   600     4 */
                                /* typedef u32 */ unsigned int period_contrib;           /*   604     4 */
                                long unsigned int load_avg;                              /*   608     8 */
                                long unsigned int runnable_load_avg;                     /*   616     8 */
                                long unsigned int util_avg;                              /*   624     8 */
                        } avg; /*   576    56 */
                        /* --- cacheline 6 boundary (384 bytes) was 24 bytes ago --- */
                        struct util_est {
                                long unsigned int last;                                  /*   632     8 */
		---[ Line 10 ]------------------------------------------------------------------------------
                                long unsigned int ewma;                                  /*   640     8 */
                        } util_est; /*   632    16 */
                } se; /*   192   512 */
		---[ Line 11 ]------------------------------------------------------------------------------
                /* --- cacheline 9 boundary (576 bytes) was 24 bytes ago --- */
                struct sched_rt_entity {
                        struct list_head {
                                struct list_head * next;                                 /*   704     8 */
                                struct list_head * prev;                                 /*   712     8 */
                        } run_list; /*   704    16 */


As you can see we still end up with util_est spanning acrosss two cache and
even worst with an almost empty Line 10. The point is that sched_avg already
uses 56B... which leave just 8bytes left.

So, I can to move util_est there and use unsigned int for "last" and "ewma"
storage. This should fix the cache alignment but only until we do not add
other stuff to sched_avg.

BTW, should not be possible to use a similar "fasting" approach for load_avg
and runnable_load_avg? Given their range a u32 should be just good enough,
isn't it?

Cheers Patrick

-- 
#include <best/regards.h>

Patrick Bellasi

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 2/4] sched/fair: add util_est on top of PELT
  2017-12-13 16:16   ` Peter Zijlstra
@ 2017-12-15 12:14     ` Patrick Bellasi
  2017-12-15 12:53       ` Peter Zijlstra
  0 siblings, 1 reply; 30+ messages in thread
From: Patrick Bellasi @ 2017-12-15 12:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-pm, Ingo Molnar, Rafael J . Wysocki,
	Viresh Kumar, Vincent Guittot, Paul Turner, Dietmar Eggemann,
	Morten Rasmussen, Juri Lelli, Todd Kjos, Joel Fernandes

On 13-Dec 17:16, Peter Zijlstra wrote:
> On Tue, Dec 05, 2017 at 05:10:16PM +0000, Patrick Bellasi wrote:
> > +static inline void util_est_dequeue(struct task_struct *p, int flags)
> > +{
> > +	struct cfs_rq *cfs_rq = &task_rq(p)->cfs;
> > +	unsigned long util_last = task_util(p);
> > +	bool sleep = flags & DEQUEUE_SLEEP;
> > +	unsigned long ewma;
> > +	long util_est;
> > +
> > +	if (!sched_feat(UTIL_EST))
> > +		return;
> > +
> > +	/*
> > +	 * Update root cfs_rq's estimated utilization
> > +	 *
> > +	 * If *p is the last task then the root cfs_rq's estimated utilization
> > +	 * of a CPU is 0 by definition.
> > +	 *
> > +	 * Otherwise, in removing *p's util_est from its cfs_rq's
> > +	 * util_est_runnable we should account for cases where this last
> > +	 * activation of *p was longer then the previous ones.
> > +	 * Also in these cases we need to set 0 the estimated utilization for
> > +	 * the CPU.
> > +	 */
> > +	if (cfs_rq->nr_running > 0) {
> > +		util_est  = cfs_rq->util_est_runnable;
> > +		util_est -= task_util_est(p);
> > +		if (util_est < 0)
> > +			util_est = 0;
> > +		cfs_rq->util_est_runnable = util_est;
> > +	} else {
> > +		cfs_rq->util_est_runnable = 0;
> > +	}
> > +
> > +	/*
> > +	 * Skip update of task's estimated utilization when the task has not
> > +	 * yet completed an activation, e.g. being migrated.
> > +	 */
> > +	if (!sleep)
> > +		return;
> > +
> > +	/*
> > +	 * Skip update of task's estimated utilization when its EWMA is already
> > +	 * ~1% close to its last activation value.
> > +	 */
> > +	util_est = p->util_est.ewma;
> > +	if (abs(util_est - util_last) <= (SCHED_CAPACITY_SCALE / 100))
> > +		return;
> 
> Isn't that computation almost as expensive as the stuff you're trying to
> avoid?

Mmm... maybe slightly simpler. I'll profile it again but I remember
I've added it because it was slightly better on backbench.

This code at the end it's just a "sub" and a "compare to constant" and
it's likely to bail early for all "almost regular" tasks.

Are you worried about the branch overhead?

> > +	/*
> > +	 * Update Task's estimated utilization
> > +	 *
> > +	 * When *p completes an activation we can consolidate another sample
> > +	 * about the task size. This is done by storing the last PELT value
> > +	 * for this task and using this value to load another sample in the
> > +	 * exponential weighted moving average:
> > +	 *
> > +	 *      ewma(t) = w *  task_util(p) + (1 - w) ewma(t-1)
> > +	 *              = w *  task_util(p) + ewma(t-1) - w * ewma(t-1)
> > +	 *              = w * (task_util(p) + ewma(t-1) / w - ewma(t-1))
> > +	 *
> > +	 * Where 'w' is the weight of new samples, which is configured to be
> > +	 * 0.25, thus making w=1/4
> > +	 */
> > +	p->util_est.last = util_last;
> > +	ewma = p->util_est.ewma;
> > +	if (likely(ewma != 0)) {
> 
> Why special case 0? Yes it helps with the initial ramp-on, but would not
> an asymmetric IIR (with a consistent upward bias) be better?

Yes, maybe the fast ramp-up is not really necessary... I'll test it
without on some real use-cases and see if we really get any noticiable
benefit, otheriwse I'll remove it.

Thanks for pointing this out.

> > +		ewma   = util_last + (ewma << UTIL_EST_WEIGHT_SHIFT) - ewma;
> > +		ewma >>= UTIL_EST_WEIGHT_SHIFT;
> > +	} else {
> > +		ewma = util_last;
> > +	}
> > +	p->util_est.ewma = ewma;
> > +}

-- 
#include <best/regards.h>

Patrick Bellasi

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 2/4] sched/fair: add util_est on top of PELT
  2017-12-15 12:14     ` Patrick Bellasi
@ 2017-12-15 12:53       ` Peter Zijlstra
  2017-12-15 15:41         ` Patrick Bellasi
  0 siblings, 1 reply; 30+ messages in thread
From: Peter Zijlstra @ 2017-12-15 12:53 UTC (permalink / raw)
  To: Patrick Bellasi
  Cc: linux-kernel, linux-pm, Ingo Molnar, Rafael J . Wysocki,
	Viresh Kumar, Vincent Guittot, Paul Turner, Dietmar Eggemann,
	Morten Rasmussen, Juri Lelli, Todd Kjos, Joel Fernandes

On Fri, Dec 15, 2017 at 12:14:17PM +0000, Patrick Bellasi wrote:
> On 13-Dec 17:16, Peter Zijlstra wrote:

> > > +	/*
> > > +	 * Skip update of task's estimated utilization when its EWMA is already
> > > +	 * ~1% close to its last activation value.
> > > +	 */
> > > +	util_est = p->util_est.ewma;
> > > +	if (abs(util_est - util_last) <= (SCHED_CAPACITY_SCALE / 100))
> > > +		return;
> > 
> > Isn't that computation almost as expensive as the stuff you're trying to
> > avoid?
> 
> Mmm... maybe slightly simpler. I'll profile it again but I remember
> I've added it because it was slightly better on backbench.
> 
> This code at the end it's just a "sub" and a "compare to constant" and
> it's likely to bail early for all "almost regular" tasks.
> 
> Are you worried about the branch overhead?

Its a subtract, a test for sign, a conditional branch on test, a negate,
a subtract constant and another conditinoal branch.

Branch overhead certainly matters too.

> > > +	p->util_est.last = util_last;
> > > +	ewma = p->util_est.ewma;
> > > +	if (likely(ewma != 0)) {
> > 
> > Why special case 0? Yes it helps with the initial ramp-on, but would not
> > an asymmetric IIR (with a consistent upward bias) be better?
> 
> Yes, maybe the fast ramp-up is not really necessary... I'll test it
> without on some real use-cases and see if we really get any noticiable
> benefit, otheriwse I'll remove it.
> 
> Thanks for pointing this out.
> 
> > > +		ewma   = util_last + (ewma << UTIL_EST_WEIGHT_SHIFT) - ewma;
> > > +		ewma >>= UTIL_EST_WEIGHT_SHIFT;
> > > +	} else {
> > > +		ewma = util_last;
> > > +	}
> > > +	p->util_est.ewma = ewma;

And this, without the 0 case, is shift, an add, a subtract and another
shift followed by a store.

Which is less branches and roughly similar arith ops, some of which can
be done in parallel.

I suspect what you saw on the profile is the cacheline hit of the store,
but I'm not sure.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 2/4] sched/fair: add util_est on top of PELT
  2017-12-15 12:03         ` Patrick Bellasi
@ 2017-12-15 12:58           ` Peter Zijlstra
  0 siblings, 0 replies; 30+ messages in thread
From: Peter Zijlstra @ 2017-12-15 12:58 UTC (permalink / raw)
  To: Patrick Bellasi
  Cc: linux-kernel, linux-pm, Ingo Molnar, Rafael J . Wysocki,
	Viresh Kumar, Vincent Guittot, Paul Turner, Dietmar Eggemann,
	Morten Rasmussen, Juri Lelli, Todd Kjos, Joel Fernandes

On Fri, Dec 15, 2017 at 12:03:31PM +0000, Patrick Bellasi wrote:

> So, by moving util_est right after sched_avg, here is what we get (with some
> lines to better highlight 64B boundaries):
> 
>                 const struct sched_class  * sched_class;                                 /*   152     8 */
>                 struct sched_entity {
> 			[...]
> 		---[ Line 9 ]-------------------------------------------------------------------------------
>                         struct sched_avg {
>                                 /* typedef u64 */ long long unsigned int last_update_time; /*   576     8 */
>                                 /* typedef u64 */ long long unsigned int load_sum;       /*   584     8 */
>                                 /* typedef u64 */ long long unsigned int runnable_load_sum; /*   592     8 */
>                                 /* typedef u32 */ unsigned int util_sum;                 /*   600     4 */
>                                 /* typedef u32 */ unsigned int period_contrib;           /*   604     4 */
>                                 long unsigned int load_avg;                              /*   608     8 */
>                                 long unsigned int runnable_load_avg;                     /*   616     8 */
>                                 long unsigned int util_avg;                              /*   624     8 */
>                         } avg; /*   576    56 */
>                         /* --- cacheline 6 boundary (384 bytes) was 24 bytes ago --- */
>                         struct util_est {
>                                 long unsigned int last;                                  /*   632     8 */
> 		---[ Line 10 ]------------------------------------------------------------------------------
>                                 long unsigned int ewma;                                  /*   640     8 */
>                         } util_est; /*   632    16 */
>                 } se; /*   192   512 */
> 		---[ Line 11 ]------------------------------------------------------------------------------
>                 /* --- cacheline 9 boundary (576 bytes) was 24 bytes ago --- */
>                 struct sched_rt_entity {
>                         struct list_head {
>                                 struct list_head * next;                                 /*   704     8 */
>                                 struct list_head * prev;                                 /*   712     8 */
>                         } run_list; /*   704    16 */
> 
> 
> As you can see we still end up with util_est spanning acrosss two cache and
> even worst with an almost empty Line 10. The point is that sched_avg already
> uses 56B... which leave just 8bytes left.

Yes, that's unfortunate.

> So, I can to move util_est there and use unsigned int for "last" and "ewma"
> storage. This should fix the cache alignment but only until we do not add
> other stuff to sched_avg.
> 
> BTW, should not be possible to use a similar "fasting" approach for load_avg
> and runnable_load_avg? Given their range a u32 should be just good enough,
> isn't it?

Probably, I'd have to page all that stuff back in :/

Another issue is that for tasks load and runnable_load are the exact
same; I just never found a sensible way to collapse that.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 2/4] sched/fair: add util_est on top of PELT
  2017-12-13 16:05   ` Peter Zijlstra
@ 2017-12-15 14:02     ` Patrick Bellasi
  2017-12-15 14:07       ` Peter Zijlstra
  0 siblings, 1 reply; 30+ messages in thread
From: Patrick Bellasi @ 2017-12-15 14:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-pm, Ingo Molnar, Rafael J . Wysocki,
	Viresh Kumar, Vincent Guittot, Paul Turner, Dietmar Eggemann,
	Morten Rasmussen, Juri Lelli, Todd Kjos, Joel Fernandes

On 13-Dec 17:05, Peter Zijlstra wrote:
> On Tue, Dec 05, 2017 at 05:10:16PM +0000, Patrick Bellasi wrote:
> > +	if (cfs_rq->nr_running > 0) {
> > +		util_est  = cfs_rq->util_est_runnable;
> > +		util_est -= task_util_est(p);
> > +		if (util_est < 0)
> > +			util_est = 0;
> > +		cfs_rq->util_est_runnable = util_est;
> > +	} else {
> 
> I'm thinking that's an explicit load-store to avoid intermediate values
> landing in cfs_rq->util_esp_runnable, right?

Was mainly to have an unsigned util_est for the following "sub"...


> That would need READ_ONCE() / WRITE_ONCE() I think, without that the
> compiler is free to munge the lot together.

... do we still need the {READ,WRITE}_ONCE() in this case?
I guess adding them however does not hurts.

Steering back at that code however it can likely by optimized to avoid
the else branch... will update and add the barriers.

Cheers Patrick.

-- 
#include <best/regards.h>

Patrick Bellasi

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 2/4] sched/fair: add util_est on top of PELT
  2017-12-15 14:02     ` Patrick Bellasi
@ 2017-12-15 14:07       ` Peter Zijlstra
  2017-12-15 15:22         ` Patrick Bellasi
  0 siblings, 1 reply; 30+ messages in thread
From: Peter Zijlstra @ 2017-12-15 14:07 UTC (permalink / raw)
  To: Patrick Bellasi
  Cc: linux-kernel, linux-pm, Ingo Molnar, Rafael J . Wysocki,
	Viresh Kumar, Vincent Guittot, Paul Turner, Dietmar Eggemann,
	Morten Rasmussen, Juri Lelli, Todd Kjos, Joel Fernandes

On Fri, Dec 15, 2017 at 02:02:18PM +0000, Patrick Bellasi wrote:
> On 13-Dec 17:05, Peter Zijlstra wrote:
> > On Tue, Dec 05, 2017 at 05:10:16PM +0000, Patrick Bellasi wrote:
> > > +	if (cfs_rq->nr_running > 0) {
> > > +		util_est  = cfs_rq->util_est_runnable;
> > > +		util_est -= task_util_est(p);
> > > +		if (util_est < 0)
> > > +			util_est = 0;
> > > +		cfs_rq->util_est_runnable = util_est;
> > > +	} else {
> > 
> > I'm thinking that's an explicit load-store to avoid intermediate values
> > landing in cfs_rq->util_esp_runnable, right?
> 
> Was mainly to have an unsigned util_est for the following "sub"...
> 
> 
> > That would need READ_ONCE() / WRITE_ONCE() I think, without that the
> > compiler is free to munge the lot together.
> 
> ... do we still need the {READ,WRITE}_ONCE() in this case?
> I guess adding them however does not hurts.

I think the compiler is free to munge it into something like:

	cfs_rq->util_est_runnable -= task_util_est();
	if (cfs_rq->util_est_runnable < 0)
		cfs_rq->util_est_runnable = 0

and its a fairly simple optimization at that, it just needs to observe
that util_est is an alias for cfs_rq->util_est_runnable.

Using the volatile load/store completely destroys that optimization
though, so yes, I'd say its definitely needed.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 2/4] sched/fair: add util_est on top of PELT
  2017-12-15 14:07       ` Peter Zijlstra
@ 2017-12-15 15:22         ` Patrick Bellasi
  0 siblings, 0 replies; 30+ messages in thread
From: Patrick Bellasi @ 2017-12-15 15:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-pm, Ingo Molnar, Rafael J . Wysocki,
	Viresh Kumar, Vincent Guittot, Paul Turner, Dietmar Eggemann,
	Morten Rasmussen, Juri Lelli, Todd Kjos, Joel Fernandes

On 15-Dec 15:07, Peter Zijlstra wrote:
> On Fri, Dec 15, 2017 at 02:02:18PM +0000, Patrick Bellasi wrote:
> > On 13-Dec 17:05, Peter Zijlstra wrote:
> > > On Tue, Dec 05, 2017 at 05:10:16PM +0000, Patrick Bellasi wrote:
> > > > +	if (cfs_rq->nr_running > 0) {
> > > > +		util_est  = cfs_rq->util_est_runnable;
> > > > +		util_est -= task_util_est(p);
> > > > +		if (util_est < 0)
> > > > +			util_est = 0;
> > > > +		cfs_rq->util_est_runnable = util_est;
> > > > +	} else {
> > > 
> > > I'm thinking that's an explicit load-store to avoid intermediate values
> > > landing in cfs_rq->util_esp_runnable, right?
> > 
> > Was mainly to have an unsigned util_est for the following "sub"...
> > 
> > 
> > > That would need READ_ONCE() / WRITE_ONCE() I think, without that the
> > > compiler is free to munge the lot together.
> > 
> > ... do we still need the {READ,WRITE}_ONCE() in this case?
> > I guess adding them however does not hurts.
> 

This is just to better understand....

> I think the compiler is free to munge it into something like:
> 
> 	cfs_rq->util_est_runnable -= task_util_est();
> 	if (cfs_rq->util_est_runnable < 0)
> 		cfs_rq->util_est_runnable = 0
> 

I'm still confused, we have:

            long util_est
   unsigned long cfs_rq->util_est_runnable

The optimization above can potentially generate an overflow, isn't it?

> and its a fairly simple optimization at that, it just needs to observe
> that util_est is an alias for cfs_rq->util_est_runnable.

Since the first is signed and the last unsigned, can the compiler still
considered them an alias?

At least on ARM I would expect a load of an unsigned value, some
computations on "signed registers", and finally a store of an unsigned
value. This is what I get:


        if (cfs_rq->nr_running > 0) {
    51e4:       91020000        add     x0, x0, #0x80
    51e8:       b9401802        ldr     w2, [x0,#24]
    51ec:       340004a2        cbz     w2, 5280 <dequeue_task_fair+0xb20>
	// skip branch for nr_running == 0

        return max(p->util_est.ewma, p->util_est.last);
    51f0:       f9403ba2        ldr     x2, [x29,#112]
    51f4:       f9418044        ldr     x4, [x2,#768]
    51f8:       f9418443        ldr     x3, [x2,#776]
	// x3 := p->util_est.ewma
	// x4 := p->util_est.last

                util_est -= task_util_est(p);
    51fc:       f9405002        ldr     x2, [x0,#160]
	// x2 := cfs_rq->util_est_runnable

        return max(p->util_est.ewma, p->util_est.last);
    5200:       eb04007f        cmp     x3, x4
    5204:       9a842063        csel    x3, x3, x4, cs
	// x3 := task_util_est(p) := max(p->util_est.ewma, p->util_est.last)

                cfs_rq->util_est_runnable = util_est;
    5208:       eb030042        subs    x2, x2, x3
	// x2 := util_est -= task_util_est(p);

    520c:       9a9f5042        csel    x2, x2, xzr, pl
	// x2 := max(0, util_est)

    5210:       f9005002        str     x2, [x0,#160]
	// store back into cfs_rq->util_est_runnable


And by adding {READ,WRITE}_ONCE I still get the same code.

While, compiling for x86, I get two different versions, here is
the one without {READ,WRITE}_ONCE:

       if (cfs_rq->nr_running > 0) {
    3e3e:       8b 90 98 00 00 00       mov    0x98(%rax),%edx
    3e44:       85 d2                   test   %edx,%edx
    3e46:       0f 84 e0 00 00 00       je     3f2c <dequeue_task_fair+0xf9c>
                util_est  = cfs_rq->util_est_runnable;
                util_est -= task_util_est(p);
    3e4c:       48 8b 74 24 28          mov    0x28(%rsp),%rsi
    3e51:       48 8b 96 80 02 00 00    mov    0x280(%rsi),%rdx
    3e58:       48 39 96 88 02 00 00    cmp    %rdx,0x288(%rsi)
    3e5f:       48 0f 43 96 88 02 00    cmovae 0x288(%rsi),%rdx
    3e66:       00 
                if (util_est < 0)
                        util_est = 0;
                cfs_rq->util_est_runnable = util_est;
    3e67:       48 8b b0 20 01 00 00    mov    0x120(%rax),%rsi
    3e6e:       48 29 d6                sub    %rdx,%rsi
    3e71:       48 89 f2                mov    %rsi,%rdx
    3e74:       be 00 00 00 00          mov    $0x0,%esi
    3e79:       48 0f 48 d6             cmovs  %rsi,%rdx
    3e7d:       48 89 90 20 01 00 00    mov    %rdx,0x120(%rax)

but I'm not confident on "parsing it"...


> Using the volatile load/store completely destroys that optimization
> though, so yes, I'd say its definitely needed.

Ok, since it's definitively not harmful, I'll add it.

-- 
#include <best/regards.h>

Patrick Bellasi

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 2/4] sched/fair: add util_est on top of PELT
  2017-12-15 12:53       ` Peter Zijlstra
@ 2017-12-15 15:41         ` Patrick Bellasi
  2017-12-20  8:57           ` Peter Zijlstra
  0 siblings, 1 reply; 30+ messages in thread
From: Patrick Bellasi @ 2017-12-15 15:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-pm, Ingo Molnar, Rafael J . Wysocki,
	Viresh Kumar, Vincent Guittot, Paul Turner, Dietmar Eggemann,
	Morten Rasmussen, Juri Lelli, Todd Kjos, Joel Fernandes

On 15-Dec 13:53, Peter Zijlstra wrote:
> On Fri, Dec 15, 2017 at 12:14:17PM +0000, Patrick Bellasi wrote:
> > On 13-Dec 17:16, Peter Zijlstra wrote:
> 
> > > > +	/*
> > > > +	 * Skip update of task's estimated utilization when its EWMA is already
> > > > +	 * ~1% close to its last activation value.
> > > > +	 */
> > > > +	util_est = p->util_est.ewma;
> > > > +	if (abs(util_est - util_last) <= (SCHED_CAPACITY_SCALE / 100))
> > > > +		return;
> > > 
> > > Isn't that computation almost as expensive as the stuff you're trying to
> > > avoid?
> > 
> > Mmm... maybe slightly simpler. I'll profile it again but I remember
> > I've added it because it was slightly better on backbench.
> > 
> > This code at the end it's just a "sub" and a "compare to constant" and
> > it's likely to bail early for all "almost regular" tasks.
> > 
> > Are you worried about the branch overhead?
> 
> Its a subtract, a test for sign, a conditional branch on test, a negate,
> a subtract constant and another conditinoal branch.

Close enough, the actual code is:

        util_est = p->util_est.ewma;
    5218:       f9403ba3        ldr     x3, [x29,#112]
    521c:       f9418462        ldr     x2, [x3,#776]
        if (abs(util_est - util_last) <= (SCHED_CAPACITY_SCALE / 100))
    5220:       eb010040        subs    x0, x2, x1
    5224:       da805400        cneg    x0, x0, mi
    5228:       f100281f        cmp     x0, #0xa
    522c:       54fff9cd        b.le    5164 <dequeue_task_fair+0xa04>

> 
> Branch overhead certainly matters too.
> 
> > > > +	p->util_est.last = util_last;
> > > > +	ewma = p->util_est.ewma;
> > > > +	if (likely(ewma != 0)) {
> > > 
> > > Why special case 0? Yes it helps with the initial ramp-on, but would not
> > > an asymmetric IIR (with a consistent upward bias) be better?
> > 
> > Yes, maybe the fast ramp-up is not really necessary... I'll test it
> > without on some real use-cases and see if we really get any noticiable
> > benefit, otheriwse I'll remove it.
> > 
> > Thanks for pointing this out.
> > 
> > > > +		ewma   = util_last + (ewma << UTIL_EST_WEIGHT_SHIFT) - ewma;
> > > > +		ewma >>= UTIL_EST_WEIGHT_SHIFT;
> > > > +	} else {
> > > > +		ewma = util_last;
> > > > +	}
> > > > +	p->util_est.ewma = ewma;
> 
> And this, without the 0 case, is shift, an add, a subtract and another
> shift followed by a store.

Actual code:

       p->util_est.last = util_last;
    5230:       f9018061        str     x1, [x3,#768]
        if (likely(ewma != 0)) {
    5234:       b40000a2        cbz     x2, 5248 <dequeue_task_fair+0xae8>
                ewma   = util_last + (ewma << UTIL_EST_WEIGHT_SHIFT) - ewma;
    5238:       d37ef440        lsl     x0, x2, #2
    523c:       cb020002        sub     x2, x0, x2
    5240:       8b010041        add     x1, x2, x1
                ewma >>= UTIL_EST_WEIGHT_SHIFT;
    5244:       d342fc21        lsr     x1, x1, #2
        p->util_est.ewma = ewma;
    5248:       f9403ba0        ldr     x0, [x29,#112]
    524c:       f9018401        str     x1, [x0,#776]
    5250:       17ffffc5        b       5164 <dequeue_task_fair+0xa04>

> 
> Which is less branches and roughly similar arith ops, some of which can
> be done in parallel.
> 
> I suspect what you saw on the profile is the cacheline hit of the store,
> but I'm not sure.

Yes likely, looking at the two chunks above and considering the
removal of the 0 case, it's probably worth to remove the first check.

I'll give it a try again to measure hackbench overheads with the cache
alignment fixed.

Cheers Patrick

-- 
#include <best/regards.h>

Patrick Bellasi

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 0/4] Utilization estimation (util_est) for FAIR tasks
  2017-12-13 17:56 ` Mike Galbraith
@ 2017-12-15 16:13   ` Patrick Bellasi
  2017-12-15 20:23     ` Mike Galbraith
  0 siblings, 1 reply; 30+ messages in thread
From: Patrick Bellasi @ 2017-12-15 16:13 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: linux-kernel, linux-pm, Ingo Molnar, Peter Zijlstra,
	Rafael J . Wysocki, Viresh Kumar, Vincent Guittot, Paul Turner,
	Dietmar Eggemann, Morten Rasmussen, Juri Lelli, Todd Kjos,
	Joel Fernandes

Hi Mike,

On 13-Dec 18:56, Mike Galbraith wrote:
> On Tue, 2017-12-05 at 17:10 +0000, Patrick Bellasi wrote:
> > This is a respin of:
> >    https://lkml.org/lkml/2017/11/9/546
> > which has been rebased on v4.15-rc2 to have util_est now working on top
> > of the recent PeterZ's:
> >    [PATCH -v2 00/18] sched/fair: A bit of a cgroup/PELT overhaul
> > 
> > The aim of this series is to improve some PELT behaviors to make it a
> > better fit for the scheduling of tasks common in embedded mobile
> > use-cases, without affecting other classes of workloads.
> 
> I thought perhaps this patch set would improve the below behavior, but
> alas it does not.  That's 3 instances of firefox playing youtube clips
> being shoved into a corner by hogs sitting on 7 of 8 runqueues.  PELT
> serializes the threaded desktop, making that threading kinda pointless,
> and CFS not all that fair.

Perhaps I don't completely get your use-case.
Are the cpuhog thread pinned to a CPU or just happens to be always
running on the same CPU?

I guess you would expect the three Firefox instances to be spread on
different CPUs. But whether this is possible depends also on the
specific tasks composition generated by Firefox, isn't it?

Being a video playback pipeline I would not be surprised to see that
most of the time we actually have only 1 or 2 tasks RUNNABLE, while
the others are sleeping... and if an HW decoder is involved, even if
you have three instances running you likely get only one pipeline
active at each time...

If that's the case, why should CFS move Fairfox tasks around?


>  6569 root      20   0    4048    704    628 R 100.0 0.004   5:10.48 7 cpuhog
>  6573 root      20   0    4048    712    636 R 100.0 0.004   5:07.47 5 cpuhog
>  6581 root      20   0    4048    696    620 R 100.0 0.004   5:07.36 1 cpuhog
>  6585 root      20   0    4048    812    736 R 100.0 0.005   5:08.14 4 cpuhog
>  6589 root      20   0    4048    712    636 R 100.0 0.004   5:06.42 6 cpuhog
>  6577 root      20   0    4048    720    644 R 99.80 0.005   5:06.52 3 cpuhog
>  6593 root      20   0    4048    728    652 R 99.60 0.005   5:04.25 0 cpuhog
>  6755 mikeg     20   0 2714788 885324 179196 S 19.96 5.544   2:14.36 2 Web Content
>  6620 mikeg     20   0 2318348 312336 145044 S 8.383 1.956   0:51.51 2 firefox
>  3190 root      20   0  323944  71704  42368 S 3.194 0.449   0:11.90 2 Xorg
>  3718 root      20   0 3009580  67112  49256 S 0.599 0.420   0:02.89 2 kwin_x11
>  3761 root      20   0  769760  90740  62048 S 0.399 0.568   0:03.46 2 konsole
>  3845 root       9 -11  791224  20132  14236 S 0.399 0.126   0:03.00 2 pulseaudio
>  3722 root      20   0 3722308 172568  88088 S 0.200 1.081   0:04.35 2 plasmashel

Is this always happening... or sometimes Firefox tasks gets a chance
to run on CPUs other then CPU2?

Could be that looking at an htop output we don't see these small opportunities?


>  ------------------------------------------------------------------------------------------------------------------------------------
>   Task                  |   Runtime ms  | Switches | Average delay ms | Maximum delay ms | Sum     delay ms | Maximum delay at      |
>  ------------------------------------------------------------------------------------------------------------------------------------
>   Web Content:6755      |   2864.862 ms |     7314 | avg:    0.299 ms | max:   40.374 ms | sum: 2189.472 ms | max at:    375.769240 |
>   Compositor:6680       |   1889.847 ms |     4672 | avg:    0.531 ms | max:   29.092 ms | sum: 2478.559 ms | max at:    375.759405 |
>   MediaPl~back #3:(13)  |   3269.777 ms |     7853 | avg:    0.218 ms | max:   19.451 ms | sum: 1711.635 ms | max at:    391.123970 |
>   MediaPl~back #4:(10)  |   1472.986 ms |     8189 | avg:    0.236 ms | max:   18.653 ms | sum: 1933.886 ms | max at:    376.124211 |
>   MediaPl~back #1:(9)   |    601.788 ms |     6598 | avg:    0.247 ms | max:   17.823 ms | sum: 1627.852 ms | max at:    401.122567 |
>   firefox:6620          |    303.181 ms |     6232 | avg:    0.111 ms | max:   15.602 ms | sum:  691.865 ms | max at:    385.078558 |
>   Socket Thread:6639    |    667.537 ms |     4806 | avg:    0.069 ms | max:   12.638 ms | sum:  329.387 ms | max at:    380.827323 |
>   MediaPD~oder #1:6835  |    154.737 ms |     1592 | avg:    0.700 ms | max:   10.139 ms | sum: 1113.688 ms | max at:    392.575370 |
>   MediaTimer #1:6828    |     42.660 ms |     5250 | avg:    0.575 ms | max:    9.845 ms | sum: 3018.994 ms | max at:    380.823677 |
>   MediaPD~oder #2:6840  |    150.822 ms |     1583 | avg:    0.703 ms | max:    9.639 ms | sum: 1112.962 ms | max at:    380.823741 |

How do you get these stats?

It's definitively an interesting use-case however I think it's out of
the scope of util_est.

Regarding the specific statement "CFS not all that fair", I would say
that the fairness of CFS is defined and has to be evaluated within a
single CPU and on a temporal (not clock cycles) base.

AFAIK, vruntime is progressed based on elapsed time, thus you can have
two tasks which gets the same slice time but consume it at different
frequencies. In this case also we are not that fair, isn't it?

Thus, at the end it all boils down to some (as much as possible)
low-overhead heuristics. Thus, a proper description of a
reproducible use-case can help on improving them.

Can we model your use-case using a simple rt-app configuration?

This would likely help to have a simple and reproducible testing
scenario to better understand where the issue eventually is...
maybe by looking at an execution trace.

Cheers Patrick

-- 
#include <best/regards.h>

Patrick Bellasi

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 0/4] Utilization estimation (util_est) for FAIR tasks
  2017-12-15 16:13   ` Patrick Bellasi
@ 2017-12-15 20:23     ` Mike Galbraith
  2017-12-16  6:37       ` Mike Galbraith
  0 siblings, 1 reply; 30+ messages in thread
From: Mike Galbraith @ 2017-12-15 20:23 UTC (permalink / raw)
  To: Patrick Bellasi
  Cc: linux-kernel, linux-pm, Ingo Molnar, Peter Zijlstra,
	Rafael J . Wysocki, Viresh Kumar, Vincent Guittot, Paul Turner,
	Dietmar Eggemann, Morten Rasmussen, Juri Lelli, Todd Kjos,
	Joel Fernandes

On Fri, 2017-12-15 at 16:13 +0000, Patrick Bellasi wrote:
> Hi Mike,
> 
> On 13-Dec 18:56, Mike Galbraith wrote:
> > On Tue, 2017-12-05 at 17:10 +0000, Patrick Bellasi wrote:
> > > This is a respin of:
> > >    https://lkml.org/lkml/2017/11/9/546
> > > which has been rebased on v4.15-rc2 to have util_est now working on top
> > > of the recent PeterZ's:
> > >    [PATCH -v2 00/18] sched/fair: A bit of a cgroup/PELT overhaul
> > > 
> > > The aim of this series is to improve some PELT behaviors to make it a
> > > better fit for the scheduling of tasks common in embedded mobile
> > > use-cases, without affecting other classes of workloads.
> > 
> > I thought perhaps this patch set would improve the below behavior, but
> > alas it does not.  That's 3 instances of firefox playing youtube clips
> > being shoved into a corner by hogs sitting on 7 of 8 runqueues.  PELT
> > serializes the threaded desktop, making that threading kinda pointless,
> > and CFS not all that fair.
> 
> Perhaps I don't completely get your use-case.
> Are the cpuhog thread pinned to a CPU or just happens to be always
> running on the same CPU?

Nothing is pinned.

> I guess you would expect the three Firefox instances to be spread on
> different CPUs. But whether this is possible depends also on the
> specific tasks composition generated by Firefox, isn't it?

It depends on load balancing.  We're letting firefox threads stack up
to 6 deep while single hogs dominate the box.

> Being a video playback pipeline I would not be surprised to see that
> most of the time we actually have only 1 or 2 tasks RUNNABLE, while
> the others are sleeping... and if an HW decoder is involved, even if
> you have three instances running you likely get only one pipeline
> active at each time...
> 
> If that's the case, why should CFS move Fairfox tasks around?

No, while they are indeed ~fairly synchronous, there is overlap.  If
there were not, there would be no wait time being accumulated. The load
wants to consume roughly one full core worth, but to achieve that, it
needs access to more than one runqueue, which we are not facilitating.

> Is this always happening... or sometimes Firefox tasks gets a chance
> to run on CPUs other then CPU2?

There is some escape going on, but not enough for the load to get its
fair share.  I have it sort of fixed up locally, but while patch keeps
changing, it's not getting any prettier, nor is it particularly
interested in letting me keep some performance gains I want, so...

> How do you get these stats?

perf sched record/perf sched lat.  I twiddled it to output accumulated
wait times as well for convenience, stock only shows max.  See below.
 If you play with perf sched, you'll notice some.. oddities about it.

> It's definitively an interesting use-case however I think it's out of
> the scope of util_est.

Yeah.  If I had been less busy and read the whole thing, I wouldn't
have taken it out for a spin.

> Regarding the specific statement "CFS not all that fair", I would say
> that the fairness of CFS is defined and has to be evaluated within a
> single CPU and on a temporal (not clock cycles) base.

No, that doesn't really fly.  In fact, in the group scheduling code, we
actively pursue box wide fairness.  PELT is going a bit too far ATM.

Point: if you think it's OK to serialize these firefox threads, would
you still think so if those were kernel threads instead?  Serializing
your kernel is a clear fail, but unpinned kthreads can be stacked up
just as effectively as those browser threads are, eat needless wakeup
latency and pass it on.

> AFAIK, vruntime is progressed based on elapsed time, thus you can have
> two tasks which gets the same slice time but consume it at different
> frequencies. In this case also we are not that fair, isn't it?

Time slices don't really exist as a concrete quantum in CFS.  There's
vruntime equalization, and that's it.

> Thus, at the end it all boils down to some (as much as possible)
> low-overhead heuristics. Thus, a proper description of a
> reproducible use-case can help on improving them.

Nah, heuristics are fickle beasts, they WILL knife you in the back,
it's just a question of how often, and how deep.

> Can we model your use-case using a simple rt-app configuration?

No idea.

> This would likely help to have a simple and reproducible testing
> scenario to better understand where the issue eventually is...
> maybe by looking at an execution trace.

It should be reproducible by anyone, just fire up NR_CPUS-1 pure hogs,
point firefox at youtube, open three clips in tabs, watch tasks stack.

Root cause IMHO is PELT having grown too aggressive.  SIS was made more
aggressive to compensate, but when you slam that door you get the full
PELT impact, and it stings, as does too aggressive bouncing when you
leave the escape hatch open.  Sticky wicket that.  Both of those want a
gentle wrap upside the head, as they're both acting a bit nutty.

	-Mike

---
 tools/perf/builtin-sched.c |   34 ++++++++++++++++++++++++++--------
 1 file changed, 26 insertions(+), 8 deletions(-)

--- a/tools/perf/builtin-sched.c
+++ b/tools/perf/builtin-sched.c
@@ -212,6 +212,7 @@ struct perf_sched {
 	u64		 run_avg;
 	u64		 all_runtime;
 	u64		 all_count;
+	u64		 all_lat;
 	u64		 cpu_last_switched[MAX_CPUS];
 	struct rb_root	 atom_root, sorted_atom_root, merged_atom_root;
 	struct list_head sort_list, cmp_pid;
@@ -1286,6 +1287,7 @@ static void output_lat_thread(struct per
 
 	sched->all_runtime += work_list->total_runtime;
 	sched->all_count   += work_list->nb_atoms;
+	sched->all_lat += work_list->total_lat;
 
 	if (work_list->num_merged > 1)
 		ret = printf("  %s:(%d) ", thread__comm_str(work_list->thread), work_list->num_merged);
@@ -1298,10 +1300,11 @@ static void output_lat_thread(struct per
 	avg = work_list->total_lat / work_list->nb_atoms;
 	timestamp__scnprintf_usec(work_list->max_lat_at, max_lat_at, sizeof(max_lat_at));
 
-	printf("|%11.3f ms |%9" PRIu64 " | avg:%9.3f ms | max:%9.3f ms | max at: %13s s\n",
+	printf("|%11.3f ms |%9" PRIu64 " | avg:%9.3f ms | max:%9.3f ms | sum:%9.3f ms | max at: %13s s\n",
 	      (double)work_list->total_runtime / NSEC_PER_MSEC,
 		 work_list->nb_atoms, (double)avg / NSEC_PER_MSEC,
 		 (double)work_list->max_lat / NSEC_PER_MSEC,
+		 (double)work_list->total_lat / NSEC_PER_MSEC,
 		 max_lat_at);
 }
 
@@ -1347,6 +1350,16 @@ static int max_cmp(struct work_atoms *l,
 	return 0;
 }
 
+static int sum_cmp(struct work_atoms *l, struct work_atoms *r)
+{
+	if (l->total_lat < r->total_lat)
+		return -1;
+	if (l->total_lat > r->total_lat)
+		return 1;
+
+	return 0;
+}
+
 static int switch_cmp(struct work_atoms *l, struct work_atoms *r)
 {
 	if (l->nb_atoms < r->nb_atoms)
@@ -1378,6 +1391,10 @@ static int sort_dimension__add(const cha
 		.name = "max",
 		.cmp  = max_cmp,
 	};
+	static struct sort_dimension sum_sort_dimension = {
+		.name = "sum",
+		.cmp  = sum_cmp,
+	};
 	static struct sort_dimension pid_sort_dimension = {
 		.name = "pid",
 		.cmp  = pid_cmp,
@@ -1394,6 +1411,7 @@ static int sort_dimension__add(const cha
 		&pid_sort_dimension,
 		&avg_sort_dimension,
 		&max_sort_dimension,
+		&sum_sort_dimension,
 		&switch_sort_dimension,
 		&runtime_sort_dimension,
 	};
@@ -3090,9 +3108,9 @@ static int perf_sched__lat(struct perf_s
 	perf_sched__merge_lat(sched);
 	perf_sched__sort_lat(sched);
 
-	printf("\n -----------------------------------------------------------------------------------------------------------------\n");
-	printf("  Task                  |   Runtime ms  | Switches | Average delay ms | Maximum delay ms | Maximum delay at       |\n");
-	printf(" -----------------------------------------------------------------------------------------------------------------\n");
+	printf("\n ------------------------------------------------------------------------------------------------------------------------------------\n");
+	printf("  Task                  |   Runtime ms  | Switches | Average delay ms | Maximum delay ms | Sum     delay ms | Maximum delay at       |\n");
+	printf(" ------------------------------------------------------------------------------------------------------------------------------------\n");
 
 	next = rb_first(&sched->sorted_atom_root);
 
@@ -3105,11 +3123,11 @@ static int perf_sched__lat(struct perf_s
 		thread__zput(work_list->thread);
 	}
 
-	printf(" -----------------------------------------------------------------------------------------------------------------\n");
-	printf("  TOTAL:                |%11.3f ms |%9" PRIu64 " |\n",
-		(double)sched->all_runtime / NSEC_PER_MSEC, sched->all_count);
+	printf(" ------------------------------------------------------------------------------------------------------------\n");
+	printf("  TOTAL:                |%11.3f ms |%9" PRIu64 " |                                     |%14.3f ms |\n",
+		(double)sched->all_runtime / NSEC_PER_MSEC, sched->all_count, (double)sched->all_lat / NSEC_PER_MSEC);
 
-	printf(" ---------------------------------------------------\n");
+	printf(" ------------------------------------------------------------------------------------------------------------\n");
 
 	print_bad_events(sched);
 	printf("\n");

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 4/4] sched/cpufreq_schedutil: use util_est for OPP selection
  2017-12-05 17:10 ` [PATCH v2 4/4] sched/cpufreq_schedutil: use util_est for OPP selection Patrick Bellasi
@ 2017-12-16  2:35   ` Rafael J. Wysocki
  2017-12-18 10:48     ` Patrick Bellasi
  0 siblings, 1 reply; 30+ messages in thread
From: Rafael J. Wysocki @ 2017-12-16  2:35 UTC (permalink / raw)
  To: Patrick Bellasi
  Cc: linux-kernel, linux-pm, Ingo Molnar, Peter Zijlstra,
	Rafael J . Wysocki, Viresh Kumar, Vincent Guittot, Paul Turner,
	Dietmar Eggemann, Morten Rasmussen, Juri Lelli, Todd Kjos,
	Joel Fernandes

On Tuesday, December 5, 2017 6:10:18 PM CET Patrick Bellasi wrote:
> When schedutil looks at the CPU utilization, the current PELT value for
> that CPU is returned straight away. In certain scenarios this can have
> undesired side effects and delays on frequency selection.
> 
> For example, since the task utilization is decayed at wakeup time, a
> long sleeping big task newly enqueued does not add immediately a
> significant contribution to the target CPU. This introduces some latency
> before schedutil will be able to detect the best frequency required by
> that task.
> 
> Moreover, the PELT signal build-up time is function of the current
> frequency, because of the scale invariant load tracking support. Thus,
> starting from a lower frequency, the utilization build-up time will
> increase even more and further delays the selection of the actual
> frequency which better serves the task requirements.
> 
> In order to reduce these kind of latencies, this patch integrates the
> usage of the CPU's estimated utilization in the sugov_get_util function.
> 
> The estimated utilization of a CPU is defined to be the maximum between
> its PELT's utilization and the sum of the estimated utilization of each
> currently RUNNABLE task on that CPU.
> This allows to properly represent the expected utilization of a CPU which,
> for example, has just got a big task running after a long sleep period,
> and ultimately it allows to select the best frequency to run a task
> right after it wakes up.
> 
> Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
> Reviewed-by: Brendan Jackman <brendan.jackman@arm.com>
> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> Cc: Viresh Kumar <viresh.kumar@linaro.org>
> Cc: Paul Turner <pjt@google.com>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Morten Rasmussen <morten.rasmussen@arm.com>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Cc: linux-kernel@vger.kernel.org
> Cc: linux-pm@vger.kernel.org
> 
> ---
> Changes v1->v2:
>  - rebase on top of v4.15-rc2
>  - tested that overhauled PELT code does not affect the util_est
> ---
>  kernel/sched/cpufreq_schedutil.c | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index 2f52ec0f1539..465430d99440 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -183,7 +183,11 @@ static void sugov_get_util(unsigned long *util, unsigned long *max, int cpu)
>  
>  	cfs_max = arch_scale_cpu_capacity(NULL, cpu);
>  
> -	*util = min(rq->cfs.avg.util_avg, cfs_max);
> +	*util = rq->cfs.avg.util_avg;

I would use a local variable here.

That *util everywhere looks a bit dirtyish.

> +	if (sched_feat(UTIL_EST))
> +		*util = max(*util, rq->cfs.util_est_runnable);
> +	*util = min(*util, cfs_max);
> +
>  	*max = cfs_max;
>  }
>  
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 0/4] Utilization estimation (util_est) for FAIR tasks
  2017-12-15 20:23     ` Mike Galbraith
@ 2017-12-16  6:37       ` Mike Galbraith
  0 siblings, 0 replies; 30+ messages in thread
From: Mike Galbraith @ 2017-12-16  6:37 UTC (permalink / raw)
  To: Patrick Bellasi
  Cc: linux-kernel, linux-pm, Ingo Molnar, Peter Zijlstra,
	Rafael J . Wysocki, Viresh Kumar, Vincent Guittot, Paul Turner,
	Dietmar Eggemann, Morten Rasmussen, Juri Lelli, Todd Kjos,
	Joel Fernandes

On Fri, 2017-12-15 at 21:23 +0100, Mike Galbraith wrote:
> 
> Point: if you think it's OK to serialize these firefox threads, would
> you still think so if those were kernel threads instead?  Serializing
> your kernel is a clear fail, but unpinned kthreads can be stacked up
> just as effectively as those browser threads are, eat needless wakeup
> latency and pass it on.

FWIW, somewhat cheezy example of that below.

(later, /me returns to [apparently endless] squabble w. PELT/SIS;)

bonnie in nfs mount of own box competing with 7 hogs:
 ------------------------------------------------------------------------------------------------------------------------------------
  Task                  |   Runtime ms  | Switches | Average delay ms | Maximum delay ms | Sum     delay ms | Maximum delay at      |
 ------------------------------------------------------------------------------------------------------------------------------------
  kworker/3:0:29        |    630.078 ms |    89669 | avg:    0.011 ms | max:  102.340 ms | sum:  962.919 ms | max at:    310.501277 |
  kworker/3:1H:464      |   1179.868 ms |   101944 | avg:    0.005 ms | max:  102.232 ms | sum:  480.915 ms | max at:    310.501273 |
  kswapd0:78            |   2662.230 ms |     1661 | avg:    0.128 ms | max:   93.935 ms | sum:  213.258 ms | max at:    310.503419 |
  nfsd:2039             |   3257.143 ms |    78448 | avg:    0.112 ms | max:   86.039 ms | sum: 8795.767 ms | max at:    258.847140 |
  nfsd:2038             |   3185.730 ms |    76253 | avg:    0.113 ms | max:   78.348 ms | sum: 8580.676 ms | max at:    258.831370 |
  nfsd:2042             |   3256.554 ms |    81423 | avg:    0.110 ms | max:   74.941 ms | sum: 8929.015 ms | max at:    288.397203 |
  nfsd:2040             |   3314.826 ms |    80396 | avg:    0.105 ms | max:   51.039 ms | sum: 8471.816 ms | max at:    363.870078 |
  nfsd:2036             |   3058.867 ms |    70460 | avg:    0.115 ms | max:   44.629 ms | sum: 8092.319 ms | max at:    250.074253 |
  nfsd:2037             |   3113.592 ms |    74276 | avg:    0.115 ms | max:   43.294 ms | sum: 8556.110 ms | max at:    310.443722 |
  konsole:4013          |    402.509 ms |      894 | avg:    0.148 ms | max:   38.129 ms | sum:  132.050 ms | max at:    332.156495 |
  haveged:497           |     11.831 ms |     1224 | avg:    0.104 ms | max:   37.575 ms | sum:  127.706 ms | max at:    350.669645 |
  nfsd:2043             |   3316.033 ms |    78303 | avg:    0.115 ms | max:   36.511 ms | sum: 8995.138 ms | max at:    248.576108 |
  nfsd:2035             |   3064.108 ms |    67413 | avg:    0.115 ms | max:   28.221 ms | sum: 7746.306 ms | max at:    313.785682 |
  bash:7022             |      0.342 ms |        1 | avg:   22.959 ms | max:   22.959 ms | sum:   22.959 ms | max at:    262.258960 |
  kworker/u16:4:354     |   2073.383 ms |     1550 | avg:    0.050 ms | max:   21.203 ms | sum:   77.185 ms | max at:    332.220678 |
  kworker/4:3:6975      |   1189.868 ms |   115776 | avg:    0.018 ms | max:   20.856 ms | sum: 2071.894 ms | max at:    348.142757 |
  kworker/2:4:6981      |    335.895 ms |    26617 | avg:    0.023 ms | max:   20.726 ms | sum:  625.102 ms | max at:    248.522083 |
  bash:7021             |      0.517 ms |        2 | avg:   10.363 ms | max:   20.726 ms | sum:   20.727 ms | max at:    262.235708 |
  ksoftirqd/2:22        |     65.718 ms |      998 | avg:    0.138 ms | max:   19.072 ms | sum:  137.827 ms | max at:    332.221676 |
  kworker/7:3:6969      |    625.724 ms |    84153 | avg:    0.010 ms | max:   18.838 ms | sum:  876.603 ms | max at:    264.188983 |
  bonnie:6965           |  79637.998 ms |    35434 | avg:    0.007 ms | max:   18.719 ms | sum:  256.748 ms | max at:    331.299867 | 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 4/4] sched/cpufreq_schedutil: use util_est for OPP selection
  2017-12-16  2:35   ` Rafael J. Wysocki
@ 2017-12-18 10:48     ` Patrick Bellasi
  0 siblings, 0 replies; 30+ messages in thread
From: Patrick Bellasi @ 2017-12-18 10:48 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: linux-kernel, linux-pm, Ingo Molnar, Peter Zijlstra,
	Rafael J . Wysocki, Viresh Kumar, Vincent Guittot, Paul Turner,
	Dietmar Eggemann, Morten Rasmussen, Juri Lelli, Todd Kjos,
	Joel Fernandes

Hi Rafael,

On 16-Dec 03:35, Rafael J. Wysocki wrote:
> On Tuesday, December 5, 2017 6:10:18 PM CET Patrick Bellasi wrote:
[...]

> > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> > index 2f52ec0f1539..465430d99440 100644
> > --- a/kernel/sched/cpufreq_schedutil.c
> > +++ b/kernel/sched/cpufreq_schedutil.c
> > @@ -183,7 +183,11 @@ static void sugov_get_util(unsigned long *util, unsigned long *max, int cpu)
> >  
> >  	cfs_max = arch_scale_cpu_capacity(NULL, cpu);
> >  
> > -	*util = min(rq->cfs.avg.util_avg, cfs_max);
> > +	*util = rq->cfs.avg.util_avg;
> 
> I would use a local variable here.
> 
> That *util everywhere looks a bit dirtyish.

Yes, right... will update for the next respin.

> 
> > +	if (sched_feat(UTIL_EST))
> > +		*util = max(*util, rq->cfs.util_est_runnable);
> > +	*util = min(*util, cfs_max);
> > +
> >  	*max = cfs_max;
> >  }
> >  
> > 

Cheers Patrick

-- 
#include <best/regards.h>

Patrick Bellasi

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 2/4] sched/fair: add util_est on top of PELT
  2017-12-15 15:41         ` Patrick Bellasi
@ 2017-12-20  8:57           ` Peter Zijlstra
  2017-12-20  9:02             ` Peter Zijlstra
  0 siblings, 1 reply; 30+ messages in thread
From: Peter Zijlstra @ 2017-12-20  8:57 UTC (permalink / raw)
  To: Patrick Bellasi
  Cc: linux-kernel, linux-pm, Ingo Molnar, Rafael J . Wysocki,
	Viresh Kumar, Vincent Guittot, Paul Turner, Dietmar Eggemann,
	Morten Rasmussen, Juri Lelli, Todd Kjos, Joel Fernandes

On Fri, Dec 15, 2017 at 03:41:40PM +0000, Patrick Bellasi wrote:
> Close enough, the actual code is:
> 
>         util_est = p->util_est.ewma;
>     5218:       f9403ba3        ldr     x3, [x29,#112]
>     521c:       f9418462        ldr     x2, [x3,#776]
>         if (abs(util_est - util_last) <= (SCHED_CAPACITY_SCALE / 100))
>     5220:       eb010040        subs    x0, x2, x1
>     5224:       da805400        cneg    x0, x0, mi
>     5228:       f100281f        cmp     x0, #0xa
>     522c:       54fff9cd        b.le    5164 <dequeue_task_fair+0xa04>

Ah, that cneg instruction is cute; on x86 we end up with something like:

bool abs_test(long s)
{
        return abs(s) < 32;
}

        cmpl    $-31, %eax
        jl      .L107
        movq    -8(%rbp), %rax
        cmpl    $31, %eax
        jg      .L107
        movl    $1, %eax
        jmp     .L108
.L107:
        movl    $0, %eax
.L108:


But I figured you can actually do:

	abs(x) < y := (unsigned)(x + y - 1) < (2 * y - 1)

Which, if y is a constant, should result in nicer code, and it does for
x86:

        addq    $31, %rax
        cmpq    $62, %rax
        setbe   %al
        movzbl  %al, %eax

Just not measurably faster, I suppose because of all the dependencies :/

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 2/4] sched/fair: add util_est on top of PELT
  2017-12-20  8:57           ` Peter Zijlstra
@ 2017-12-20  9:02             ` Peter Zijlstra
  0 siblings, 0 replies; 30+ messages in thread
From: Peter Zijlstra @ 2017-12-20  9:02 UTC (permalink / raw)
  To: Patrick Bellasi
  Cc: linux-kernel, linux-pm, Ingo Molnar, Rafael J . Wysocki,
	Viresh Kumar, Vincent Guittot, Paul Turner, Dietmar Eggemann,
	Morten Rasmussen, Juri Lelli, Todd Kjos, Joel Fernandes

On Wed, Dec 20, 2017 at 09:57:47AM +0100, Peter Zijlstra wrote:
> On Fri, Dec 15, 2017 at 03:41:40PM +0000, Patrick Bellasi wrote:
> > Close enough, the actual code is:
> > 
> >         util_est = p->util_est.ewma;
> >     5218:       f9403ba3        ldr     x3, [x29,#112]
> >     521c:       f9418462        ldr     x2, [x3,#776]
> >         if (abs(util_est - util_last) <= (SCHED_CAPACITY_SCALE / 100))
> >     5220:       eb010040        subs    x0, x2, x1
> >     5224:       da805400        cneg    x0, x0, mi
> >     5228:       f100281f        cmp     x0, #0xa
> >     522c:       54fff9cd        b.le    5164 <dequeue_task_fair+0xa04>
> 
> Ah, that cneg instruction is cute; on x86 we end up with something like:
> 
> bool abs_test(long s)
> {
>         return abs(s) < 32;
> }
> 
>         cmpl    $-31, %eax
>         jl      .L107
>         movq    -8(%rbp), %rax
>         cmpl    $31, %eax
>         jg      .L107
>         movl    $1, %eax
>         jmp     .L108
> .L107:
>         movl    $0, %eax
> .L108:
> 
> 
> But I figured you can actually do:
> 
> 	abs(x) < y := (unsigned)(x + y - 1) < (2 * y - 1)
> 
> Which, if y is a constant, should result in nicer code, and it does for
> x86:
> 
>         addq    $31, %rax
>         cmpq    $62, %rax
>         setbe   %al
>         movzbl  %al, %eax
> 
> Just not measurably faster, I suppose because of all the dependencies :/

Ah no, it actually is, I'm an idiot and used 'long' for return value. If
I use bool we loose that last movzbl and we go from around 4.0 cycles
down to 3.4 cycles.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [tip:sched/core] sched/fair: Use 'unsigned long' for utilization, consistently
  2017-12-05 17:10 ` [PATCH v2 1/4] sched/fair: always used unsigned long for utilization Patrick Bellasi
  2017-12-06  8:56   ` Vincent Guittot
@ 2018-01-10 12:14   ` tip-bot for Patrick Bellasi
  1 sibling, 0 replies; 30+ messages in thread
From: tip-bot for Patrick Bellasi @ 2018-01-10 12:14 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: dietmar.eggemann, tkjos, patrick.bellasi, torvalds, tglx,
	juri.lelli, pjt, vincent.guittot, mingo, morten.rasmussen,
	chris.redpath, rafael.j.wysocki, joelaf, peterz, brendan.jackman,
	hpa, linux-kernel, viresh.kumar

Commit-ID:  f01415fdbfe83380c2dfcf90b7b26042f88963aa
Gitweb:     https://git.kernel.org/tip/f01415fdbfe83380c2dfcf90b7b26042f88963aa
Author:     Patrick Bellasi <patrick.bellasi@arm.com>
AuthorDate: Tue, 5 Dec 2017 17:10:15 +0000
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 10 Jan 2018 11:30:28 +0100

sched/fair: Use 'unsigned long' for utilization, consistently

Utilization and capacity are tracked as 'unsigned long', however some
functions using them return an 'int' which is ultimately assigned back to
'unsigned long' variables.

Since there is not scope on using a different and signed type,
consolidate the signature of functions returning utilization to always
use the native type.

This change improves code consistency, and it also benefits
code paths where utilizations should be clamped by avoiding
further type conversions or ugly type casts.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Chris Redpath <chris.redpath@arm.com>
Reviewed-by: Brendan Jackman <brendan.jackman@arm.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Todd Kjos <tkjos@android.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Link: http://lkml.kernel.org/r/20171205171018.9203-2-patrick.bellasi@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2915c0d..de43bd8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5765,8 +5765,8 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
 	return affine;
 }
 
-static inline int task_util(struct task_struct *p);
-static int cpu_util_wake(int cpu, struct task_struct *p);
+static inline unsigned long task_util(struct task_struct *p);
+static unsigned long cpu_util_wake(int cpu, struct task_struct *p);
 
 static unsigned long capacity_spare_wake(int cpu, struct task_struct *p)
 {
@@ -6247,7 +6247,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
  * capacity_orig) as it useful for predicting the capacity required after task
  * migrations (scheduler-driven DVFS).
  */
-static int cpu_util(int cpu)
+static unsigned long cpu_util(int cpu)
 {
 	unsigned long util = cpu_rq(cpu)->cfs.avg.util_avg;
 	unsigned long capacity = capacity_orig_of(cpu);
@@ -6255,7 +6255,7 @@ static int cpu_util(int cpu)
 	return (util >= capacity) ? capacity : util;
 }
 
-static inline int task_util(struct task_struct *p)
+static inline unsigned long task_util(struct task_struct *p)
 {
 	return p->se.avg.util_avg;
 }
@@ -6264,7 +6264,7 @@ static inline int task_util(struct task_struct *p)
  * cpu_util_wake: Compute cpu utilization with any contributions from
  * the waking task p removed.
  */
-static int cpu_util_wake(int cpu, struct task_struct *p)
+static unsigned long cpu_util_wake(int cpu, struct task_struct *p)
 {
 	unsigned long util, capacity;
 

^ permalink raw reply related	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2018-01-10 12:19 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-12-05 17:10 [PATCH v2 0/4] Utilization estimation (util_est) for FAIR tasks Patrick Bellasi
2017-12-05 17:10 ` [PATCH v2 1/4] sched/fair: always used unsigned long for utilization Patrick Bellasi
2017-12-06  8:56   ` Vincent Guittot
2018-01-10 12:14   ` [tip:sched/core] sched/fair: Use 'unsigned long' for utilization, consistently tip-bot for Patrick Bellasi
2017-12-05 17:10 ` [PATCH v2 2/4] sched/fair: add util_est on top of PELT Patrick Bellasi
2017-12-13 16:05   ` Peter Zijlstra
2017-12-15 14:02     ` Patrick Bellasi
2017-12-15 14:07       ` Peter Zijlstra
2017-12-15 15:22         ` Patrick Bellasi
2017-12-13 16:16   ` Peter Zijlstra
2017-12-15 12:14     ` Patrick Bellasi
2017-12-15 12:53       ` Peter Zijlstra
2017-12-15 15:41         ` Patrick Bellasi
2017-12-20  8:57           ` Peter Zijlstra
2017-12-20  9:02             ` Peter Zijlstra
2017-12-13 16:19   ` Peter Zijlstra
2017-12-13 16:36     ` Patrick Bellasi
2017-12-13 17:03       ` Peter Zijlstra
2017-12-15 12:03         ` Patrick Bellasi
2017-12-15 12:58           ` Peter Zijlstra
2017-12-05 17:10 ` [PATCH v2 3/4] sched/fair: use util_est in LB and WU paths Patrick Bellasi
2017-12-05 17:10 ` [PATCH v2 4/4] sched/cpufreq_schedutil: use util_est for OPP selection Patrick Bellasi
2017-12-16  2:35   ` Rafael J. Wysocki
2017-12-18 10:48     ` Patrick Bellasi
2017-12-13 16:03 ` [PATCH v2 0/4] Utilization estimation (util_est) for FAIR tasks Peter Zijlstra
2017-12-13 16:23   ` Patrick Bellasi
2017-12-13 17:56 ` Mike Galbraith
2017-12-15 16:13   ` Patrick Bellasi
2017-12-15 20:23     ` Mike Galbraith
2017-12-16  6:37       ` Mike Galbraith

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).