[RFC PATCH v2 0/5] sched/cpufreq: Make schedutil energy aware

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH v2 0/5] sched/cpufreq: Make schedutil energy aware
@ 2019-06-27 17:15 Douglas RAILLARD
  2019-06-27 17:15 ` [RFC PATCH v2 1/5] PM: Introduce em_pd_get_higher_freq() Douglas RAILLARD
                   ` (6 more replies)
  0 siblings, 7 replies; 17+ messages in thread
From: Douglas RAILLARD @ 2019-06-27 17:15 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-pm, mingo, peterz, rjw, viresh.kumar, quentin.perret,
	douglas.raillard, patrick.bellasi, dietmar.eggemann

Make schedutil cpufreq governor energy-aware.

- patch 1 introduces a function to retrieve a frequency given a base
  frequency and an energy cost margin.
- patch 2 links Energy Model perf_domain to sugov_policy.
- patch 3 updates get_next_freq() to make use of the Energy Model.
- patch 4 adds sugov_cpu_ramp_boost() function.
- patch 5 updates sugov_update_(single|shared)() to make use of
  sugov_cpu_ramp_boost().

The benefits of using the EM in schedutil are twofold:

1) Selecting the highest possible frequency for a given cost. Some
   platforms can have lower frequencies that are less efficient than
   higher ones, in which case they should be skipped for most purposes.
   They can still be useful to give more freedom to thermal throttling
   mechanisms, but not under normal circumstances.
   note: the EM framework will warn about such OPPs "hertz/watts ratio
   non-monotonically decreasing"

2) Driving the frequency selection with power in mind, in addition to
   maximizing the utilization of the non-idle CPUs in the system.

Point 1) is implemented in "PM: Introduce em_pd_get_higher_freq()" and
enabled in schedutil by
"sched/cpufreq: Hook em_pd_get_higher_power() into get_next_freq()".

Point 2) is enabled in
"sched/cpufreq: Boost schedutil frequency ramp up". It allows using
higher frequencies when it is known that the true utilization of
currently running tasks is exceeding their previous stable point.
The benefits are:

* Boosting the frequency when the behavior of a runnable task changes,
  leading to an increase in utilization. That shortens the frequency
  ramp up duration, which in turns allows the utilization signal to
  reach stable values quicker.  Since the allowed frequency boost is
  bounded in energy, it will behave consistently across platforms,
  regardless of the OPP cost range.

* The boost is only transient, and should not impact a lot the energy
  consumed of workloads with very stable utilization signals.

This has been ligthly tested with a rtapp task ramping from 10% to 75%
utilisation on a big core. Results are improved by fast ramp-up
EWMA [1], since it greatly reduces the oscillation in frequency at first
idle when ramping up.

[1] [PATCH] sched/fair: util_est: fast ramp-up EWMA on utilization increases
    Message-ID: <20190620150555.15717-1-patrick.bellasi@arm.com>
    https://lore.kernel.org/lkml/20190620150555.15717-1-patrick.bellasi@arm.com/

v1 -> v2:

  * Split the new sugov_cpu_ramp_boost() from the existing
    sugov_cpu_is_busy() as they seem to seek a different goal.

  * Implement sugov_cpu_ramp_boost() based on CFS util_avg and
    util_est_enqueued signals, rather than using idle calls count.
    This makes the ramp boost much more accurate in finding boost
    opportunities, and give a "continuous" output rather than a boolean.

  * Add EM_COST_MARGIN_SCALE=1024 to represent the
    margin values of em_pd_get_higher_freq().

Douglas RAILLARD (5):
  PM: Introduce em_pd_get_higher_freq()
  sched/cpufreq: Attach perf domain to sugov policy
  sched/cpufreq: Hook em_pd_get_higher_power() into get_next_freq()
  sched/cpufreq: Introduce sugov_cpu_ramp_boost
  sched/cpufreq: Boost schedutil frequency ramp up

 include/linux/energy_model.h     |  53 ++++++++++++++++
 kernel/sched/cpufreq_schedutil.c | 106 ++++++++++++++++++++++++++++++-
 2 files changed, 156 insertions(+), 3 deletions(-)

-- 
2.22.0

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [RFC PATCH v2 1/5] PM: Introduce em_pd_get_higher_freq()
  2019-06-27 17:15 [RFC PATCH v2 0/5] sched/cpufreq: Make schedutil energy aware Douglas RAILLARD
@ 2019-06-27 17:15 ` Douglas RAILLARD
  2019-06-27 17:16 ` [RFC PATCH v2 2/5] sched/cpufreq: Attach perf domain to sugov policy Douglas RAILLARD
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 17+ messages in thread
From: Douglas RAILLARD @ 2019-06-27 17:15 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-pm, mingo, peterz, rjw, viresh.kumar, quentin.perret,
	douglas.raillard, patrick.bellasi, dietmar.eggemann

em_pd_get_higher_freq() returns a frequency greater or equal to the
provided one while taking into account a given cost margin. It also
skips inefficient OPPs that have a higher cost than another one with a
higher frequency.

The efficiency of an OPP is measured as efficiency=capacity/power.
OPPs with the same efficiency are assumed to be equivalent, since they
will consume as much energy for a given amount of work to do. That may
take more or less time depending on the frequency, but will consume the
same energy.

Signed-off-by: Douglas RAILLARD <douglas.raillard@arm.com>
---
 include/linux/energy_model.h | 53 ++++++++++++++++++++++++++++++++++++
 1 file changed, 53 insertions(+)

diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
index aa027f7bcb3e..cc9819967f8d 100644
--- a/include/linux/energy_model.h
+++ b/include/linux/energy_model.h
@@ -159,6 +159,53 @@ static inline int em_pd_nr_cap_states(struct em_perf_domain *pd)
 	return pd->nr_cap_states;
 }
 
+#define EM_COST_MARGIN_SCALE 1024U
+
+/**
+ * em_pd_get_higher_freq() - Get the highest frequency that does not exceed the
+ * given cost margin compared to min_freq
+ * @pd		: performance domain for which this must be done
+ * @min_freq	: minimum frequency to return
+ * @cost_margin	: allowed margin compared to min_freq, on the
+ *		  EM_COST_MARGIN_SCALE scale.
+ *
+ * Return: the chosen frequency, guaranteed to be at least as high as min_freq.
+ */
+static inline unsigned long em_pd_get_higher_freq(struct em_perf_domain *pd,
+	unsigned long min_freq, unsigned long cost_margin)
+{
+	unsigned long max_cost = 0;
+	struct em_cap_state *cs;
+	int i;
+
+	if (!pd)
+		return min_freq;
+
+	/* Compute the maximum allowed cost */
+	for (i = 0; i < pd->nr_cap_states; i++) {
+		cs = &pd->table[i];
+		if (cs->frequency >= min_freq) {
+			max_cost = cs->cost +
+				(cs->cost * cost_margin) / EM_COST_MARGIN_SCALE;
+			break;
+		}
+	}
+
+	/* Find the highest frequency that will not exceed the cost margin */
+	for (i = pd->nr_cap_states-1; i >= 0; i--) {
+		cs = &pd->table[i];
+		if (cs->cost <= max_cost)
+			return cs->frequency;
+	}
+
+	/*
+	 * We should normally never reach here, unless min_freq was higher than
+	 * the highest available frequency, which is not expected to happen.
+	 */
+	return min_freq;
+}
+
+
 #else
 struct em_perf_domain {};
 struct em_data_callback {};
@@ -182,6 +229,12 @@ static inline int em_pd_nr_cap_states(struct em_perf_domain *pd)
 {
 	return 0;
 }
+
+static inline unsigned long em_pd_get_higher_freq(struct em_perf_domain *pd,
+	unsigned long min_freq, unsigned long cost_margin)
+{
+	return min_freq;
+}
 #endif
 
 #endif
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH v2 2/5] sched/cpufreq: Attach perf domain to sugov policy
  2019-06-27 17:15 [RFC PATCH v2 0/5] sched/cpufreq: Make schedutil energy aware Douglas RAILLARD
  2019-06-27 17:15 ` [RFC PATCH v2 1/5] PM: Introduce em_pd_get_higher_freq() Douglas RAILLARD
@ 2019-06-27 17:16 ` Douglas RAILLARD
  2019-06-27 17:16 ` [RFC PATCH v2 3/5] sched/cpufreq: Hook em_pd_get_higher_power() into get_next_freq() Douglas RAILLARD
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 17+ messages in thread
From: Douglas RAILLARD @ 2019-06-27 17:16 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-pm, mingo, peterz, rjw, viresh.kumar, quentin.perret,
	douglas.raillard, patrick.bellasi, dietmar.eggemann

Attach an Energy Model perf_domain to each sugov_policy to prepare the
ground for energy-aware schedutil.

Signed-off-by: Douglas RAILLARD <douglas.raillard@arm.com>
---
 kernel/sched/cpufreq_schedutil.c | 39 ++++++++++++++++++++++++++++++++
 1 file changed, 39 insertions(+)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 9c0419087260..0a3ccc20adeb 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -41,6 +41,10 @@ struct sugov_policy {
 	bool			work_in_progress;
 
 	bool			need_freq_update;
+
+#ifdef CONFIG_ENERGY_MODEL
+	struct em_perf_domain *pd;
+#endif
 };
 
 struct sugov_cpu {
@@ -65,6 +69,38 @@ static DEFINE_PER_CPU(struct sugov_cpu, sugov_cpu);
 
 /************************ Governor internals ***********************/
 
+#ifdef CONFIG_ENERGY_MODEL
+static void sugov_policy_attach_pd(struct sugov_policy *sg_policy)
+{
+	struct em_perf_domain *pd;
+	struct cpufreq_policy *policy = sg_policy->policy;
+
+	sg_policy->pd = NULL;
+	pd = em_cpu_get(policy->cpu);
+	if (!pd)
+		return;
+
+	if (cpumask_equal(policy->related_cpus, to_cpumask(pd->cpus)))
+		sg_policy->pd = pd;
+	else
+		pr_warn("%s: Not all CPUs in schedutil policy %u share the same perf domain, no perf domain for that policy will be registered\n",
+			__func__, policy->cpu);
+}
+
+static struct em_perf_domain *sugov_policy_get_pd(
+						struct sugov_policy *sg_policy)
+{
+	return sg_policy->pd;
+}
+#else /* CONFIG_ENERGY_MODEL */
+static void sugov_policy_attach_pd(struct sugov_policy *sg_policy) {}
+static struct em_perf_domain *sugov_policy_get_pd(
+						struct sugov_policy *sg_policy)
+{
+	return NULL;
+}
+#endif /* CONFIG_ENERGY_MODEL */
+
 static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time)
 {
 	s64 delta_ns;
@@ -850,6 +886,9 @@ static int sugov_start(struct cpufreq_policy *policy)
 							sugov_update_shared :
 							sugov_update_single);
 	}
+
+	sugov_policy_attach_pd(sg_policy);
+
 	return 0;
 }
 
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH v2 3/5] sched/cpufreq: Hook em_pd_get_higher_power() into get_next_freq()
  2019-06-27 17:15 [RFC PATCH v2 0/5] sched/cpufreq: Make schedutil energy aware Douglas RAILLARD
  2019-06-27 17:15 ` [RFC PATCH v2 1/5] PM: Introduce em_pd_get_higher_freq() Douglas RAILLARD
  2019-06-27 17:16 ` [RFC PATCH v2 2/5] sched/cpufreq: Attach perf domain to sugov policy Douglas RAILLARD
@ 2019-06-27 17:16 ` Douglas RAILLARD
  2019-06-27 17:16 ` [RFC PATCH v2 4/5] sched/cpufreq: Introduce sugov_cpu_ramp_boost Douglas RAILLARD
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 17+ messages in thread
From: Douglas RAILLARD @ 2019-06-27 17:16 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-pm, mingo, peterz, rjw, viresh.kumar, quentin.perret,
	douglas.raillard, patrick.bellasi, dietmar.eggemann

Choose the highest OPP for a given energy cost, allowing to skip lower
frequencies that would not be cheaper in terms of consumed power. These
frequencies can still be interesting to keep in the energy model to give
more freedom to thermal throttling, but should not be selected under
normal circumstances.

This also prepares the ground for energy-aware frequency boosting.

Signed-off-by: Douglas RAILLARD <douglas.raillard@arm.com>
---
 kernel/sched/cpufreq_schedutil.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 0a3ccc20adeb..7ffc6fe3b670 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -10,6 +10,7 @@
 
 #include "sched.h"
 
+#include <linux/energy_model.h>
 #include <linux/sched/cpufreq.h>
 #include <trace/events/power.h>
 
@@ -201,9 +202,16 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
 	struct cpufreq_policy *policy = sg_policy->policy;
 	unsigned int freq = arch_scale_freq_invariant() ?
 				policy->cpuinfo.max_freq : policy->cur;
+	struct em_perf_domain *pd = sugov_policy_get_pd(sg_policy);
 
 	freq = map_util_freq(util, freq, max);
 
+	/*
+	 * Try to get a higher frequency if one is available, given the extra
+	 * power we are ready to spend.
+	 */
+	freq = em_pd_get_higher_freq(pd, freq, 0);
+
 	if (freq == sg_policy->cached_raw_freq && !sg_policy->need_freq_update)
 		return sg_policy->next_freq;
 
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH v2 4/5] sched/cpufreq: Introduce sugov_cpu_ramp_boost
  2019-06-27 17:15 [RFC PATCH v2 0/5] sched/cpufreq: Make schedutil energy aware Douglas RAILLARD
                   ` (2 preceding siblings ...)
  2019-06-27 17:16 ` [RFC PATCH v2 3/5] sched/cpufreq: Hook em_pd_get_higher_power() into get_next_freq() Douglas RAILLARD
@ 2019-06-27 17:16 ` Douglas RAILLARD
  2019-06-28 15:08   ` Patrick Bellasi
  2019-06-27 17:16 ` [RFC PATCH v2 5/5] sched/cpufreq: Boost schedutil frequency ramp up Douglas RAILLARD
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 17+ messages in thread
From: Douglas RAILLARD @ 2019-06-27 17:16 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-pm, mingo, peterz, rjw, viresh.kumar, quentin.perret,
	douglas.raillard, patrick.bellasi, dietmar.eggemann

Use the utilization signals dynamic to detect when the utilization of a
set of tasks starts increasing because of a change in tasks' behavior.
This allows detecting when spending extra power for faster frequency
ramp up response would be beneficial to the reactivity of the system.

This ramp boost is computed as the difference
util_avg-util_est_enqueued. This number somehow represents a lower bound
of how much extra utilization this tasks is actually using, compared to
our best current stable knowledge of it (which is util_est_enqueued).

When the set of runnable tasks changes, the boost is disabled as the
impact of blocked utilization on util_avg will make the delta with
util_est_enqueued not very informative.

Signed-off-by: Douglas RAILLARD <douglas.raillard@arm.com>
---
 kernel/sched/cpufreq_schedutil.c | 42 ++++++++++++++++++++++++++++++++
 1 file changed, 42 insertions(+)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 7ffc6fe3b670..3eabfd815195 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -60,6 +60,9 @@ struct sugov_cpu {
 	unsigned long		bw_dl;
 	unsigned long		max;
 
+	unsigned long		ramp_boost;
+	unsigned long		util_est_enqueued;
+
 	/* The field below is for single-CPU policies only: */
 #ifdef CONFIG_NO_HZ_COMMON
 	unsigned long		saved_idle_calls;
@@ -174,6 +177,41 @@ static void sugov_deferred_update(struct sugov_policy *sg_policy, u64 time,
 	}
 }
 
+static unsigned long sugov_cpu_ramp_boost(struct sugov_cpu *sg_cpu)
+{
+	return READ_ONCE(sg_cpu->ramp_boost);
+}
+
+static unsigned long sugov_cpu_ramp_boost_update(struct sugov_cpu *sg_cpu,
+						 unsigned long util)
+{
+	struct rq *rq = cpu_rq(sg_cpu->cpu);
+	unsigned long util_est_enqueued;
+	unsigned long util_avg;
+	unsigned long boost = 0;
+
+	util_est_enqueued = READ_ONCE(rq->cfs.avg.util_est.enqueued);
+	util_avg = READ_ONCE(rq->cfs.avg.util_avg);
+
+	/*
+	 * Boost when util_avg becomes higher than the previous stable
+	 * knowledge of the enqueued tasks' set util, which is CPU's
+	 * util_est_enqueued.
+	 *
+	 * We try to spot changes in the workload itself, so we want to
+	 * avoid the noise of tasks being enqueued/dequeued. To do that,
+	 * we only trigger boosting when the "amount of work' enqueued
+	 * is stable.
+	 */
+	if (util_est_enqueued == sg_cpu->util_est_enqueued
+	    && util_avg > util_est_enqueued)
+		 boost = util_avg - util_est_enqueued;
+
+	sg_cpu->util_est_enqueued = util_est_enqueued;
+	WRITE_ONCE(sg_cpu->ramp_boost, boost);
+	return boost;
+}
+
 /**
  * get_next_freq - Compute a new frequency for a given cpufreq policy.
  * @sg_policy: schedutil policy object to compute the new frequency for.
@@ -504,6 +542,7 @@ static void sugov_update_single(struct update_util_data *hook, u64 time,
 	busy = sugov_cpu_is_busy(sg_cpu);
 
 	util = sugov_get_util(sg_cpu);
+	sugov_cpu_ramp_boost_update(sg_cpu, util);
 	max = sg_cpu->max;
 	util = sugov_iowait_apply(sg_cpu, time, util, max);
 	next_f = get_next_freq(sg_policy, util, max);
@@ -544,6 +583,8 @@ static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu, u64 time)
 		unsigned long j_util, j_max;
 
 		j_util = sugov_get_util(j_sg_cpu);
+		if (j_sg_cpu == sg_cpu)
+			sugov_cpu_ramp_boost_update(sg_cpu, j_util);
 		j_max = j_sg_cpu->max;
 		j_util = sugov_iowait_apply(j_sg_cpu, time, j_util, j_max);
 
@@ -553,6 +594,7 @@ static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu, u64 time)
 		}
 	}
 
+
 	return get_next_freq(sg_policy, util, max);
 }
 
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH v2 5/5] sched/cpufreq: Boost schedutil frequency ramp up
  2019-06-27 17:15 [RFC PATCH v2 0/5] sched/cpufreq: Make schedutil energy aware Douglas RAILLARD
                   ` (3 preceding siblings ...)
  2019-06-27 17:16 ` [RFC PATCH v2 4/5] sched/cpufreq: Introduce sugov_cpu_ramp_boost Douglas RAILLARD
@ 2019-06-27 17:16 ` Douglas RAILLARD
  2019-07-02 15:44 ` [RFC PATCH v2 0/5] sched/cpufreq: Make schedutil energy aware Peter Zijlstra
  2019-07-02 15:51 ` Peter Zijlstra
  6 siblings, 0 replies; 17+ messages in thread
From: Douglas RAILLARD @ 2019-06-27 17:16 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-pm, mingo, peterz, rjw, viresh.kumar, quentin.perret,
	douglas.raillard, patrick.bellasi, dietmar.eggemann

In some situations, it can be interesting to spend temporarily more
power if that can give a useful frequency boost.

Use the new sugov_cpu_ramp_boost() function to drive an energy-aware
boost, on top of the minimal required frequency.

As that boost number is not accurate (and cannot be without a crystal
ball), we only use it in a way that allows direct control over the power
it is going to cost. This allows keeping a platform-independant level of
control over the average power, while allowing for frequency bursts when
we know a (set of) tasks can make use of it.

In shared policies, the maximum of all CPU's boost is used. Since the
extra power expenditure is bounded, it cannot skyrocket even on
platforms with a large number of cores in the same frequency domain
and/or very high ratio between lowest and highest OPP cost.

Signed-off-by: Douglas RAILLARD <douglas.raillard@arm.com>
---
 kernel/sched/cpufreq_schedutil.c | 23 +++++++++++++++++------
 1 file changed, 17 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 3eabfd815195..d70bbbeaa5cf 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -217,6 +217,9 @@ static unsigned long sugov_cpu_ramp_boost_update(struct sugov_cpu *sg_cpu,
  * @sg_policy: schedutil policy object to compute the new frequency for.
  * @util: Current CPU utilization.
  * @max: CPU capacity.
+ * @boost: Extra power that can be spent on top of the minimum amount of power
+ *	required to meet capacity requirements, as a percentage between 0 and
+ *	EM_COST_MARGIN_SCALE.
  *
  * If the utilization is frequency-invariant, choose the new frequency to be
  * proportional to it, that is
@@ -235,7 +238,8 @@ static unsigned long sugov_cpu_ramp_boost_update(struct sugov_cpu *sg_cpu,
  * cpufreq driver limitations.
  */
 static unsigned int get_next_freq(struct sugov_policy *sg_policy,
-				  unsigned long util, unsigned long max)
+				  unsigned long util, unsigned long max,
+				  unsigned long boost)
 {
 	struct cpufreq_policy *policy = sg_policy->policy;
 	unsigned int freq = arch_scale_freq_invariant() ?
@@ -248,7 +252,7 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
 	 * Try to get a higher frequency if one is available, given the extra
 	 * power we are ready to spend.
 	 */
-	freq = em_pd_get_higher_freq(pd, freq, 0);
+	freq = em_pd_get_higher_freq(pd, freq, boost);
 
 	if (freq == sg_policy->cached_raw_freq && !sg_policy->need_freq_update)
 		return sg_policy->next_freq;
@@ -530,6 +534,7 @@ static void sugov_update_single(struct update_util_data *hook, u64 time,
 	unsigned long util, max;
 	unsigned int next_f;
 	bool busy;
+	unsigned long ramp_boost = 0;
 
 	sugov_iowait_boost(sg_cpu, time, flags);
 	sg_cpu->last_update = time;
@@ -542,10 +547,10 @@ static void sugov_update_single(struct update_util_data *hook, u64 time,
 	busy = sugov_cpu_is_busy(sg_cpu);
 
 	util = sugov_get_util(sg_cpu);
-	sugov_cpu_ramp_boost_update(sg_cpu, util);
+	ramp_boost = sugov_cpu_ramp_boost_update(sg_cpu, util);
 	max = sg_cpu->max;
 	util = sugov_iowait_apply(sg_cpu, time, util, max);
-	next_f = get_next_freq(sg_policy, util, max);
+	next_f = get_next_freq(sg_policy, util, max, ramp_boost);
 	/*
 	 * Do not reduce the frequency if the CPU has not been idle
 	 * recently, as the reduction is likely to be premature then.
@@ -577,6 +582,8 @@ static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu, u64 time)
 	struct cpufreq_policy *policy = sg_policy->policy;
 	unsigned long util = 0, max = 1;
 	unsigned int j;
+	unsigned long ramp_boost = 0;
+	unsigned long j_ramp_boost = 0;
 
 	for_each_cpu(j, policy->cpus) {
 		struct sugov_cpu *j_sg_cpu = &per_cpu(sugov_cpu, j);
@@ -584,7 +591,11 @@ static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu, u64 time)
 
 		j_util = sugov_get_util(j_sg_cpu);
 		if (j_sg_cpu == sg_cpu)
-			sugov_cpu_ramp_boost_update(sg_cpu, j_util);
+			j_ramp_boost = sugov_cpu_ramp_boost_update(sg_cpu, j_util);
+		else
+			j_ramp_boost = sugov_cpu_ramp_boost(j_sg_cpu);
+		ramp_boost = max(ramp_boost, j_ramp_boost);
+
 		j_max = j_sg_cpu->max;
 		j_util = sugov_iowait_apply(j_sg_cpu, time, j_util, j_max);
 
@@ -595,7 +606,7 @@ static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu, u64 time)
 	}
 
 
-	return get_next_freq(sg_policy, util, max);
+	return get_next_freq(sg_policy, util, max, ramp_boost);
 }
 
 static void
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH v2 4/5] sched/cpufreq: Introduce sugov_cpu_ramp_boost
  2019-06-27 17:16 ` [RFC PATCH v2 4/5] sched/cpufreq: Introduce sugov_cpu_ramp_boost Douglas RAILLARD
@ 2019-06-28 15:08   ` Patrick Bellasi
  0 siblings, 0 replies; 17+ messages in thread
From: Patrick Bellasi @ 2019-06-28 15:08 UTC (permalink / raw)
  To: Douglas RAILLARD
  Cc: linux-kernel, linux-pm, mingo, peterz, rjw, viresh.kumar,
	quentin.perret, dietmar.eggemann

Hi Douglas,

On 27-Jun 18:16, Douglas RAILLARD wrote:
> Use the utilization signals dynamic to detect when the utilization of a
> set of tasks starts increasing because of a change in tasks' behavior.
> This allows detecting when spending extra power for faster frequency
> ramp up response would be beneficial to the reactivity of the system.
> 
> This ramp boost is computed as the difference
> util_avg-util_est_enqueued. This number somehow represents a lower bound
> of how much extra utilization this tasks is actually using, compared to
> our best current stable knowledge of it (which is util_est_enqueued).

Maybe it's worth to call out here that at rq-level we don't have an
EWMA. However, the enqueued estimated utilization is derived by
considering the _task_util_est() which factors in the moving average
of tasks and thus makes the signal more stable even in case of tasks
switching between big and small activations.

> When the set of runnable tasks changes, the boost is disabled as the
> impact of blocked utilization on util_avg will make the delta with
> util_est_enqueued not very informative.
> 
> Signed-off-by: Douglas RAILLARD <douglas.raillard@arm.com>
> ---
>  kernel/sched/cpufreq_schedutil.c | 42 ++++++++++++++++++++++++++++++++
>  1 file changed, 42 insertions(+)
> 
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index 7ffc6fe3b670..3eabfd815195 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -60,6 +60,9 @@ struct sugov_cpu {
>  	unsigned long		bw_dl;
>  	unsigned long		max;
>  
> +	unsigned long		ramp_boost;
> +	unsigned long		util_est_enqueued;
> +
>  	/* The field below is for single-CPU policies only: */
>  #ifdef CONFIG_NO_HZ_COMMON
>  	unsigned long		saved_idle_calls;
> @@ -174,6 +177,41 @@ static void sugov_deferred_update(struct sugov_policy *sg_policy, u64 time,
>  	}
>  }
>  
> +static unsigned long sugov_cpu_ramp_boost(struct sugov_cpu *sg_cpu)
> +{
> +	return READ_ONCE(sg_cpu->ramp_boost);
> +}
> +
> +static unsigned long sugov_cpu_ramp_boost_update(struct sugov_cpu *sg_cpu,
> +						 unsigned long util)
> +{
> +	struct rq *rq = cpu_rq(sg_cpu->cpu);

Since you don't really need the rq below, maybe better:

        struct sched_avg *sa = &cpu_rq(sg_cpu->cpu)->cfs.avg;

?

> +	unsigned long util_est_enqueued;
> +	unsigned long util_avg;
> +	unsigned long boost = 0;
> +
> +	util_est_enqueued = READ_ONCE(rq->cfs.avg.util_est.enqueued);
> +	util_avg = READ_ONCE(rq->cfs.avg.util_avg);
> +
> +	/*
> +	 * Boost when util_avg becomes higher than the previous stable
> +	 * knowledge of the enqueued tasks' set util, which is CPU's
> +	 * util_est_enqueued.
> +	 *
> +	 * We try to spot changes in the workload itself, so we want to
> +	 * avoid the noise of tasks being enqueued/dequeued. To do that,
> +	 * we only trigger boosting when the "amount of work' enqueued
> +	 * is stable.
> +	 */
> +	if (util_est_enqueued == sg_cpu->util_est_enqueued
> +	    && util_avg > util_est_enqueued)
> +		 boost = util_avg - util_est_enqueued;

The above should be:


 	if (util_est_enqueued == sg_cpu->util_est_enqueue &&
            util_avg > util_est_enqueued) {
 		 boost = util_avg - util_est_enqueued;
        }

but perhaps you can also go for a fast bailout with something like:

        if (util_avg <= util_est_enqueued)
                return 0;
        if (util_est_enqueued == sg_cpu->util_est_enqueue)
                boost = util_avg - util_est_enqueued;


Moreover: could it make sense to add a threshold on a minimal boost
value to return non zero?

> +
> +	sg_cpu->util_est_enqueued = util_est_enqueued;
> +	WRITE_ONCE(sg_cpu->ramp_boost, boost);
> +	return boost;

You don't seem to use this returned value: should be void?

> +}
> +
>  /**
>   * get_next_freq - Compute a new frequency for a given cpufreq policy.
>   * @sg_policy: schedutil policy object to compute the new frequency for.
> @@ -504,6 +542,7 @@ static void sugov_update_single(struct update_util_data *hook, u64 time,
>  	busy = sugov_cpu_is_busy(sg_cpu);
>  
>  	util = sugov_get_util(sg_cpu);
> +	sugov_cpu_ramp_boost_update(sg_cpu, util);
>  	max = sg_cpu->max;
>  	util = sugov_iowait_apply(sg_cpu, time, util, max);
>  	next_f = get_next_freq(sg_policy, util, max);
> @@ -544,6 +583,8 @@ static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu, u64 time)
>  		unsigned long j_util, j_max;
>  
>  		j_util = sugov_get_util(j_sg_cpu);
> +		if (j_sg_cpu == sg_cpu)
> +			sugov_cpu_ramp_boost_update(sg_cpu, j_util);
>  		j_max = j_sg_cpu->max;
>  		j_util = sugov_iowait_apply(j_sg_cpu, time, j_util, j_max);
>  
> @@ -553,6 +594,7 @@ static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu, u64 time)
>  		}
>  	}
>  
> +
>  	return get_next_freq(sg_policy, util, max);
>  }


Best,
Patrick

-- 
#include <best/regards.h>

Patrick Bellasi

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH v2 0/5] sched/cpufreq: Make schedutil energy aware
  2019-06-27 17:15 [RFC PATCH v2 0/5] sched/cpufreq: Make schedutil energy aware Douglas RAILLARD
                   ` (4 preceding siblings ...)
  2019-06-27 17:16 ` [RFC PATCH v2 5/5] sched/cpufreq: Boost schedutil frequency ramp up Douglas RAILLARD
@ 2019-07-02 15:44 ` Peter Zijlstra
  2019-07-03 13:38   ` Douglas Raillard
  2019-07-02 15:51 ` Peter Zijlstra
  6 siblings, 1 reply; 17+ messages in thread
From: Peter Zijlstra @ 2019-07-02 15:44 UTC (permalink / raw)
  To: Douglas RAILLARD
  Cc: linux-kernel, linux-pm, mingo, rjw, viresh.kumar, quentin.perret,
	patrick.bellasi, dietmar.eggemann

On Thu, Jun 27, 2019 at 06:15:58PM +0100, Douglas RAILLARD wrote:
> Make schedutil cpufreq governor energy-aware.
> 
> - patch 1 introduces a function to retrieve a frequency given a base
>   frequency and an energy cost margin.
> - patch 2 links Energy Model perf_domain to sugov_policy.
> - patch 3 updates get_next_freq() to make use of the Energy Model.

> 
> 1) Selecting the highest possible frequency for a given cost. Some
>    platforms can have lower frequencies that are less efficient than
>    higher ones, in which case they should be skipped for most purposes.
>    They can still be useful to give more freedom to thermal throttling
>    mechanisms, but not under normal circumstances.
>    note: the EM framework will warn about such OPPs "hertz/watts ratio
>    non-monotonically decreasing"

Humm, for some reason I was thinking we explicitly skipped those OPPs
and they already weren't used.

This isn't in fact so, and these first few patches make it so?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH v2 0/5] sched/cpufreq: Make schedutil energy aware
  2019-06-27 17:15 [RFC PATCH v2 0/5] sched/cpufreq: Make schedutil energy aware Douglas RAILLARD
                   ` (5 preceding siblings ...)
  2019-07-02 15:44 ` [RFC PATCH v2 0/5] sched/cpufreq: Make schedutil energy aware Peter Zijlstra
@ 2019-07-02 15:51 ` Peter Zijlstra
  2019-07-03 16:36   ` Douglas Raillard
  6 siblings, 1 reply; 17+ messages in thread
From: Peter Zijlstra @ 2019-07-02 15:51 UTC (permalink / raw)
  To: Douglas RAILLARD
  Cc: linux-kernel, linux-pm, mingo, rjw, viresh.kumar, quentin.perret,
	patrick.bellasi, dietmar.eggemann

On Thu, Jun 27, 2019 at 06:15:58PM +0100, Douglas RAILLARD wrote:
> Make schedutil cpufreq governor energy-aware.
> 
> - patch 4 adds sugov_cpu_ramp_boost() function.
> - patch 5 updates sugov_update_(single|shared)() to make use of
>   sugov_cpu_ramp_boost().
> 
> The benefits of using the EM in schedutil are twofold:

> 2) Driving the frequency selection with power in mind, in addition to
>    maximizing the utilization of the non-idle CPUs in the system.

> Point 2) is enabled in
> "sched/cpufreq: Boost schedutil frequency ramp up". It allows using
> higher frequencies when it is known that the true utilization of
> currently running tasks is exceeding their previous stable point.
> The benefits are:
> 
> * Boosting the frequency when the behavior of a runnable task changes,
>   leading to an increase in utilization. That shortens the frequency
>   ramp up duration, which in turns allows the utilization signal to
>   reach stable values quicker.  Since the allowed frequency boost is
>   bounded in energy, it will behave consistently across platforms,
>   regardless of the OPP cost range.
> 
> * The boost is only transient, and should not impact a lot the energy
>   consumed of workloads with very stable utilization signals.

So you're allowing a higher pick when the EWMA exceeds the enqueue
thing.

This then obviously has relation to Patrick's patch that makes the EWMA
asymmetric, but I'm thinking that the interaction is mostly favourable?

I'm not immediately seeing how it is transient; that is, PELT has a
wobble in it's steady state, is that accounted for?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH v2 0/5] sched/cpufreq: Make schedutil energy aware
  2019-07-02 15:44 ` [RFC PATCH v2 0/5] sched/cpufreq: Make schedutil energy aware Peter Zijlstra
@ 2019-07-03 13:38   ` Douglas Raillard
  2019-07-08 11:13     ` Patrick Bellasi
  0 siblings, 1 reply; 17+ messages in thread
From: Douglas Raillard @ 2019-07-03 13:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-pm, mingo, rjw, viresh.kumar, quentin.perret,
	patrick.bellasi, dietmar.eggemann

Hi Peter,

On 7/2/19 4:44 PM, Peter Zijlstra wrote:
> On Thu, Jun 27, 2019 at 06:15:58PM +0100, Douglas RAILLARD wrote:
>> Make schedutil cpufreq governor energy-aware.
>>
>> - patch 1 introduces a function to retrieve a frequency given a base
>>    frequency and an energy cost margin.
>> - patch 2 links Energy Model perf_domain to sugov_policy.
>> - patch 3 updates get_next_freq() to make use of the Energy Model.
> 
>>
>> 1) Selecting the highest possible frequency for a given cost. Some
>>     platforms can have lower frequencies that are less efficient than
>>     higher ones, in which case they should be skipped for most purposes.
>>     They can still be useful to give more freedom to thermal throttling
>>     mechanisms, but not under normal circumstances.
>>     note: the EM framework will warn about such OPPs "hertz/watts ratio
>>     non-monotonically decreasing"
> 
> Humm, for some reason I was thinking we explicitly skipped those OPPs
> and they already weren't used.
> 
> This isn't in fact so, and these first few patches make it so?

That's correct, the cost information about each OPP has been introduced recently in mainline
by the energy model series. Without that info, the only way to skip them that comes to my
mind is to set a policy min frequency, since these inefficient OPPs are usually located
at the lower end.


Thanks,
Douglas

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH v2 0/5] sched/cpufreq: Make schedutil energy aware
  2019-07-02 15:51 ` Peter Zijlstra
@ 2019-07-03 16:36   ` Douglas Raillard
  2019-07-08 11:09     ` Patrick Bellasi
  0 siblings, 1 reply; 17+ messages in thread
From: Douglas Raillard @ 2019-07-03 16:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-pm, mingo, rjw, viresh.kumar, quentin.perret,
	patrick.bellasi, dietmar.eggemann

On 7/2/19 4:51 PM, Peter Zijlstra wrote:
> On Thu, Jun 27, 2019 at 06:15:58PM +0100, Douglas RAILLARD wrote:
>> Make schedutil cpufreq governor energy-aware.
>>
>> - patch 4 adds sugov_cpu_ramp_boost() function.
>> - patch 5 updates sugov_update_(single|shared)() to make use of
>>    sugov_cpu_ramp_boost().
>>
>> The benefits of using the EM in schedutil are twofold:
> 
>> 2) Driving the frequency selection with power in mind, in addition to
>>     maximizing the utilization of the non-idle CPUs in the system.
> 
>> Point 2) is enabled in
>> "sched/cpufreq: Boost schedutil frequency ramp up". It allows using
>> higher frequencies when it is known that the true utilization of
>> currently running tasks is exceeding their previous stable point.
>> The benefits are:
>>
>> * Boosting the frequency when the behavior of a runnable task changes,
>>    leading to an increase in utilization. That shortens the frequency
>>    ramp up duration, which in turns allows the utilization signal to
>>    reach stable values quicker.  Since the allowed frequency boost is
>>    bounded in energy, it will behave consistently across platforms,
>>    regardless of the OPP cost range.
>>
>> * The boost is only transient, and should not impact a lot the energy
>>    consumed of workloads with very stable utilization signals.
> 

[reordered original comments]

> This then obviously has relation to Patrick's patch that makes the EWMA
> asymmetric, but I'm thinking that the interaction is mostly favourable?

Making task_ue.ewma larger makes cpu_ue.enqueued larger, so Patrick's patch
helps increasing the utilisation as seen by schedutil in that transient time.
(see discussion on schedutil signals at the bottom). That goes in the same
direction as this series.

> So you're allowing a higher pick when the EWMA exceeds the enqueue
> thing.

TLDR: Schedutil ramp boost works on CPU rq signals, for which util est EWMA
is not defined, but the idea is the same (replace util est EWMA by util_avg).

The important point here is that when util_avg for the task becomes higher
than task_ue.enqueued, it means the knowledge of the actual needs of the task
is turned into a lower bound (=task_ue.enqueued) rather than an exact value.
This means that selecting a higher frequency than that is:
a) necessary, the task needs more computational power to do its job.
b) a shot in the dark, as it's impossible to predict exactly how much it will
    need without a crystal ball.

When adding ramp boost, the bill is split: part of the "shot in the dark" comes from
the growing CPU's util_avg (see schedutil_u definition at the bottom), and part of it
comes from the ramp boost. We don't want to make the boost too costly either since
it's a shot in the dark. Therefore, we make the boost proportional to a battery life
cost rather than some guessed utilisation.

Now that I think about it, it may make sense to let this ramp-boost completely
handle this "future util prediction" case, as it's not better or worse than
util_avg at that (since it's based on it), but allows better control on
the cost of a (mis)prediction.

> 
> I'm not immediately seeing how it is transient; that is, PELT has a
> wobble in it's steady state, is that accounted for?
> 

The transient-ness of the ramp boost I'm introducing comes from the fact that for a
periodic task at steady state, task_ue.enqueued <= task_u when the task is executing.
That is because task_ue.enqueued is sampled at dequeue time, precisely at the moment
at which task_u is reaching its max for that task. Since we only take into account
positive boosts, ramp boost will only have an impact in the "increase transients".


About signals schedutil is based on
===================================

Here is the state of signals used by schedutil to my knowledge to compute
the final "chosen_freq":

# let's define short names to talk about
task_ue = se.avg.util_est
task_u = se.avg.util_avg

cpu_ue = cfs_rq->avg.util_est
cpu_u = cfs_rq->avg.util_avg


# How things are defined
task_u ~= LOW_PASS_FILTER(task_activations)
task_ue.enqueued = SAMPLE_AT_DEQUEUE_AND_HOLD(task_u)
task_ue.ewma = LOW_PASS_FILTER(task_ue.enqueued)

# Patrick's patch amends task_ue.ewma definition this way:
task_ue.ewma =
	| task_ue.enqueued > task_ue.ewma: task_ue.enqueued
	| otherwise			 : LOW_PASS_FILTER(task_ue.enqueued)


cpu_ue.enqueued = SUM[MAX(task_ue.ewma, task_ue.enqueued) forall task_ue in enqueued_tasks]
cpu_u = SUM[task_u forall task_ue in enqueued_tasks]

# What schedutil considers when taking freq decisions

non_cfs_u = util of deadline + rt + irq
schedutil_u = non_cfs_u + APPLY_UCLAMP(MAX(cpu_ue.enqueued, cpu_u)) + iowait_boost
schedutil_base_freq = MAP_UTIL_FREQ(schedutil_u)

STABLE(signal) =
	| signal equal to the last time it was sampled by caller: True
	| otherwise				      		: False
# A diff between two util signals is converted to a EM_COST_MARGIN_SCALE value.
# They are different units, but the conversion factor is 1 in practice.
ramp_boost =
	| cpu_ue.enqueued > cpu_u && STABLE(cpu_ue.enqueued):
		(cpu_ue.enqueued - cpu_u) * (EM_COST_MARGIN_SCALE/SCHED_CAPACITY_SCALE)
	| otherwise: 0

APPLY_RAMP_BOOST(boost, base_freq) = boosted_freq
	with
		acceptable_cost = ENERGY_MODEL_COST(base_freq) * (EM_COST_MARGIN_SCALE + boost)
		boosted_freq = MAX[freq forall freqs if ENERGY_MODEL_COST(freq) < acceptable_cost]

# ramp-boost is applied on a freq instead of a util (unlike iowait_boost), since
# the function ENERGY_MODEL_COST(freq) is provided by the EM, and an equivalent
# ENERGY_MODEL_COST(util) would need extra calls to MAP_UTIL_FREQ().
schedutil_freq = APPLY_RAMP_BOOST(ramp_boost, schedutil_base_freq)

REAL_FREQ(ideal_freq) = MIN[freq forall freqs if freq >= ideal_freq]
POLICY_CLAMP(freq) =
	| freq < policy_min_freq: policy_min_freq
	| freq > policy_max_freq: policy_max_freq
	| otherwise		: freq
# Frequency finally used for the policy
chosen_freq = POLICY_CLAMP(REAL_FREQ(schedutil_freq))


Thanks,
Douglas

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH v2 0/5] sched/cpufreq: Make schedutil energy aware
  2019-07-03 16:36   ` Douglas Raillard
@ 2019-07-08 11:09     ` Patrick Bellasi
  2019-07-08 13:46       ` Douglas Raillard
  0 siblings, 1 reply; 17+ messages in thread
From: Patrick Bellasi @ 2019-07-08 11:09 UTC (permalink / raw)
  To: Douglas Raillard
  Cc: Peter Zijlstra, linux-kernel, linux-pm, mingo, rjw, viresh.kumar,
	quentin.perret, dietmar.eggemann

On 03-Jul 17:36, Douglas Raillard wrote:
> On 7/2/19 4:51 PM, Peter Zijlstra wrote:
> > On Thu, Jun 27, 2019 at 06:15:58PM +0100, Douglas RAILLARD wrote:

[...]

> > I'm not immediately seeing how it is transient; that is, PELT has a
> > wobble in it's steady state, is that accounted for?
> > 
> 
> The transient-ness of the ramp boost I'm introducing comes from the fact that for a
> periodic task at steady state, task_ue.enqueued <= task_u when the task is executing.
                ^^^^^^^^^^^^^^^

I find your above "at steady state" a bit confusing.

The condition "task_ue.enqueue <= task_u" is true only for the first
task's big activation after a series of small activations, e.g. a task
switching from 20% to 70%.

That's the transient stat you refer to, isn't it?

> That is because task_ue.enqueued is sampled at dequeue time, precisely at the moment
> at which task_u is reaching its max for that task.

Right, so in the example above we will have enqueued=20% while task_u
is going above to converge towards 70%

> Since we only take into account positive boosts, ramp boost will
> only have an impact in the "increase transients".

Right.

I think Peter was referring to the smallish wobbles we see when the
task already converged to 70%. If that's the case I would say they are
already fully covered also by the current util_est.

You are also correct in pointing out that in the steady state
ramp_boost will not be triggered in that steady state.

IMU, that's for two main reasons:
 a) it's very likely that enqueued <= util_avg
 b) even in case enqueued should turn out to be _slightly_ bigger then
    util_avg, the corresponding (proportional) ramp_boost would be so
    tiny to not have any noticeable effect on OPP selection.

Am I correct on point b) above?

Could you maybe come up with some experimental numbers related to that
case specifically?

Best,
Patrick

-- 
#include <best/regards.h>

Patrick Bellasi

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH v2 0/5] sched/cpufreq: Make schedutil energy aware
  2019-07-03 13:38   ` Douglas Raillard
@ 2019-07-08 11:13     ` Patrick Bellasi
  2019-07-08 13:49       ` Douglas Raillard
  0 siblings, 1 reply; 17+ messages in thread
From: Patrick Bellasi @ 2019-07-08 11:13 UTC (permalink / raw)
  To: Douglas Raillard
  Cc: Peter Zijlstra, linux-kernel, linux-pm, mingo, rjw, viresh.kumar,
	quentin.perret, dietmar.eggemann

On 03-Jul 14:38, Douglas Raillard wrote:
> Hi Peter,
> 
> On 7/2/19 4:44 PM, Peter Zijlstra wrote:
> > On Thu, Jun 27, 2019 at 06:15:58PM +0100, Douglas RAILLARD wrote:
> > > Make schedutil cpufreq governor energy-aware.
> > > 
> > > - patch 1 introduces a function to retrieve a frequency given a base
> > >    frequency and an energy cost margin.
> > > - patch 2 links Energy Model perf_domain to sugov_policy.
> > > - patch 3 updates get_next_freq() to make use of the Energy Model.
> > 
> > > 
> > > 1) Selecting the highest possible frequency for a given cost. Some
> > >     platforms can have lower frequencies that are less efficient than
> > >     higher ones, in which case they should be skipped for most purposes.
> > >     They can still be useful to give more freedom to thermal throttling
> > >     mechanisms, but not under normal circumstances.
> > >     note: the EM framework will warn about such OPPs "hertz/watts ratio
> > >     non-monotonically decreasing"
> > 
> > Humm, for some reason I was thinking we explicitly skipped those OPPs
> > and they already weren't used.
> > 
> > This isn't in fact so, and these first few patches make it so?
> 
> That's correct, the cost information about each OPP has been introduced recently in mainline
> by the energy model series. Without that info, the only way to skip them that comes to my
> mind is to set a policy min frequency, since these inefficient OPPs are usually located
> at the lower end.

Perhaps it's also worth to point out that the alternative approach you
point out above is a system wide solution.

While, the ramp_boost thingy you propose, it's a more fine grained
mechanisms which could be extended in the future to have a per-task
side. IOW, it could contribute to have better user-space hints, for
example to ramp_boost more certain tasks and not others.

Best,
Patrick

-- 
#include <best/regards.h>

Patrick Bellasi

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH v2 0/5] sched/cpufreq: Make schedutil energy aware
  2019-07-08 11:09     ` Patrick Bellasi
@ 2019-07-08 13:46       ` Douglas Raillard
  2019-07-09 10:37         ` Patrick Bellasi
  0 siblings, 1 reply; 17+ messages in thread
From: Douglas Raillard @ 2019-07-08 13:46 UTC (permalink / raw)
  To: Patrick Bellasi
  Cc: Peter Zijlstra, linux-kernel, linux-pm, mingo, rjw, viresh.kumar,
	quentin.perret, dietmar.eggemann

Hi Patrick,

On 7/8/19 12:09 PM, Patrick Bellasi wrote:
> On 03-Jul 17:36, Douglas Raillard wrote:
>> On 7/2/19 4:51 PM, Peter Zijlstra wrote:
>>> On Thu, Jun 27, 2019 at 06:15:58PM +0100, Douglas RAILLARD wrote:
> 
> [...]
> 
>>> I'm not immediately seeing how it is transient; that is, PELT has a
>>> wobble in it's steady state, is that accounted for?
>>>
>>
>> The transient-ness of the ramp boost I'm introducing comes from the fact that for a
>> periodic task at steady state, task_ue.enqueued <= task_u when the task is executing.
>                  ^^^^^^^^^^^^^^^
> 
> I find your above "at steady state" a bit confusing.
> 
> The condition "task_ue.enqueue <= task_u" is true only for the first
> task's big activation after a series of small activations, e.g. a task
> switching from 20% to 70%.

I actually made a typo and meant "task_u <= task_ue.enqueue". The rest of the paragraph
is aligned with that condition, sorry for the confusion.

> That's the transient stat you refer to, isn't it?
> 
>> That is because task_ue.enqueued is sampled at dequeue time, precisely at the moment
>> at which task_u is reaching its max for that task.
> 
> Right, so in the example above we will have enqueued=20% while task_u
> is going above to converge towards 70%
> 
>> Since we only take into account positive boosts, ramp boost will
>> only have an impact in the "increase transients".
> 
> Right.
> 
> I think Peter was referring to the smallish wobbles we see when the
> task already converged to 70%. If that's the case I would say they are
> already fully covered also by the current util_est.

Yes, that's covered by the "task_u <= task_ue.enqueue" condition, with task_ue.enqueued
not having any of these "mid freq" content that we call wobble here.

Util est enqueued acts as an adaptive filter that kills frequencies higher than 1/task_period,
task_period being the delta between the two previous "enqueue events". All what's (mostly) remaining
after that is util variation of larger periods, with a positive shift that increases with
the task period (mean(enqueued) = mean(util_avg) + f(task_period)).

> You are also correct in pointing out that in the steady state
> ramp_boost will not be triggered in that steady state.
> 
> IMU, that's for two main reasons:
>   a) it's very likely that enqueued <= util_avg
>   b) even in case enqueued should turn out to be _slightly_ bigger then
>      util_avg, the corresponding (proportional) ramp_boost would be so
>      tiny to not have any noticeable effect on OPP selection.
> 
> Am I correct on point b) above?

Assuming you meant "util_avg slightly bigger than enqueued" (which is when boosting triggers),
then yes since ramp_boost effect is proportional to "task_ue.enqueue - task_u". It makes it robust
against that.

> 
> Could you maybe come up with some experimental numbers related to that
> case specifically?

With:
* an rt-app task ramping up from 5% to 75% util in one big step. The whole cycle is 0.6s long
  (0.3s at 5% followed by 0.3s at 75%). This cycle is repeated 20 times and the average of
  boosting is taken.

* a hikey 960 (this impact the frequency at which the test runs at the beginning of 75% phase,
   which impacts the number of missed activations before the util ramped up).

* assuming an OPP exists for each util value (i.e. 1024 OPPs, so the effect
   of boost on consumption is not impacted by OPP capacities granularity)

Then the boosting feature would increase the average power consumption by 3.1%, out of which 0.12% can
be considered "spurious boosting" due to the util taking some time to really converge to its
steady state value. In practice, the impact of small boosts will be even lower since they will less likely
trigger the selection of a high OPP due to OPP capacity granularity > 1 util unit.

> 
> Best,
> Patrick
> 

Best regards,
Douglas

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH v2 0/5] sched/cpufreq: Make schedutil energy aware
  2019-07-08 11:13     ` Patrick Bellasi
@ 2019-07-08 13:49       ` Douglas Raillard
  0 siblings, 0 replies; 17+ messages in thread
From: Douglas Raillard @ 2019-07-08 13:49 UTC (permalink / raw)
  To: Patrick Bellasi
  Cc: Peter Zijlstra, linux-kernel, linux-pm, mingo, rjw, viresh.kumar,
	quentin.perret, dietmar.eggemann



On 7/8/19 12:13 PM, Patrick Bellasi wrote:
> On 03-Jul 14:38, Douglas Raillard wrote:
>> Hi Peter,
>>
>> On 7/2/19 4:44 PM, Peter Zijlstra wrote:
>>> On Thu, Jun 27, 2019 at 06:15:58PM +0100, Douglas RAILLARD wrote:
>>>> Make schedutil cpufreq governor energy-aware.
>>>>
>>>> - patch 1 introduces a function to retrieve a frequency given a base
>>>>     frequency and an energy cost margin.
>>>> - patch 2 links Energy Model perf_domain to sugov_policy.
>>>> - patch 3 updates get_next_freq() to make use of the Energy Model.
>>>
>>>>
>>>> 1) Selecting the highest possible frequency for a given cost. Some
>>>>      platforms can have lower frequencies that are less efficient than
>>>>      higher ones, in which case they should be skipped for most purposes.
>>>>      They can still be useful to give more freedom to thermal throttling
>>>>      mechanisms, but not under normal circumstances.
>>>>      note: the EM framework will warn about such OPPs "hertz/watts ratio
>>>>      non-monotonically decreasing"
>>>
>>> Humm, for some reason I was thinking we explicitly skipped those OPPs
>>> and they already weren't used.
>>>
>>> This isn't in fact so, and these first few patches make it so?
>>
>> That's correct, the cost information about each OPP has been introduced recently in mainline
>> by the energy model series. Without that info, the only way to skip them that comes to my
>> mind is to set a policy min frequency, since these inefficient OPPs are usually located
>> at the lower end.
> 
> Perhaps it's also worth to point out that the alternative approach you
> point out above is a system wide solution.
> 
> While, the ramp_boost thingy you propose, it's a more fine grained
> mechanisms which could be extended in the future to have a per-task
> side. IOW, it could contribute to have better user-space hints, for
> example to ramp_boost more certain tasks and not others.

ramp_boost and the situation you describe are more what solves point 2) (which has been cut out in that answer),
this point "1)" is really just about avoiding selection of some OPPs, regardless of task util. IOW, it's better to
skip the OPPs we talk about here, and race to idle at a higher OPP regardless of what the task need.


> Best,
> Patrick
> 

Cheers,
Douglas

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH v2 0/5] sched/cpufreq: Make schedutil energy aware
  2019-07-08 13:46       ` Douglas Raillard
@ 2019-07-09 10:37         ` Patrick Bellasi
  2019-08-09 17:37           ` Douglas Raillard
  0 siblings, 1 reply; 17+ messages in thread
From: Patrick Bellasi @ 2019-07-09 10:37 UTC (permalink / raw)
  To: Douglas Raillard
  Cc: Peter Zijlstra, linux-kernel, linux-pm, mingo, rjw, viresh.kumar,
	quentin.perret, dietmar.eggemann

On 08-Jul 14:46, Douglas Raillard wrote:
> Hi Patrick,
> 
> On 7/8/19 12:09 PM, Patrick Bellasi wrote:
> > On 03-Jul 17:36, Douglas Raillard wrote:
> > > On 7/2/19 4:51 PM, Peter Zijlstra wrote:
> > > > On Thu, Jun 27, 2019 at 06:15:58PM +0100, Douglas RAILLARD wrote:

[...]

> > You are also correct in pointing out that in the steady state
> > ramp_boost will not be triggered in that steady state.
> > 
> > IMU, that's for two main reasons:
> >   a) it's very likely that enqueued <= util_avg
> >   b) even in case enqueued should turn out to be _slightly_ bigger then
> >      util_avg, the corresponding (proportional) ramp_boost would be so
> >      tiny to not have any noticeable effect on OPP selection.
> > 
> > Am I correct on point b) above?
> 
> Assuming you meant "util_avg slightly bigger than enqueued" (which is when boosting triggers),
> then yes since ramp_boost effect is proportional to "task_ue.enqueue - task_u". It makes it robust
> against that.

Right :)

> > Could you maybe come up with some experimental numbers related to that
> > case specifically?
> 
> With:
> * an rt-app task ramping up from 5% to 75% util in one big step. The
> whole cycle is 0.6s long (0.3s at 5% followed by 0.3s at 75%). This
> cycle is repeated 20 times and the average of boosting is taken.
> 
> * a hikey 960 (this impact the frequency at which the test runs at
> the beginning of 75% phase, which impacts the number of missed
> activations before the util ramped up).
> 
> * assuming an OPP exists for each util value (i.e. 1024 OPPs, so the
> effect of boost on consumption is not impacted by OPP capacities
> granularity)
> 
> Then the boosting feature would increase the average power
> consumption by 3.1%, out of which 0.12% can be considered "spurious
> boosting" due to the util taking some time to really converge to its
> steady state value.
>
> In practice, the impact of small boosts will be even lower since
> they will less likely trigger the selection of a high OPP due to OPP
> capacity granularity > 1 util unit.

That's ok for the energy side: you estimate a ~3% worst case more
energy on that specific target.

By boosting I expect the negative boost to improve.
Do you have also numbers/stats related to the negative slack?
Can you share a percentage figure for that improvement?

Best,
Patrick

-- 
#include <best/regards.h>

Patrick Bellasi

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH v2 0/5] sched/cpufreq: Make schedutil energy aware
  2019-07-09 10:37         ` Patrick Bellasi
@ 2019-08-09 17:37           ` Douglas Raillard
  0 siblings, 0 replies; 17+ messages in thread
From: Douglas Raillard @ 2019-08-09 17:37 UTC (permalink / raw)
  To: Patrick Bellasi
  Cc: Peter Zijlstra, linux-kernel, linux-pm, mingo, rjw, viresh.kumar,
	quentin.perret, dietmar.eggemann

Hi Patrick,

On 7/9/19 11:37 AM, Patrick Bellasi wrote:
> On 08-Jul 14:46, Douglas Raillard wrote:
>> Hi Patrick,
>>
>> On 7/8/19 12:09 PM, Patrick Bellasi wrote:
>>> On 03-Jul 17:36, Douglas Raillard wrote:
>>>> On 7/2/19 4:51 PM, Peter Zijlstra wrote:
>>>>> On Thu, Jun 27, 2019 at 06:15:58PM +0100, Douglas RAILLARD wrote:
> 
> [...]
> 
>>> You are also correct in pointing out that in the steady state
>>> ramp_boost will not be triggered in that steady state.
>>>
>>> IMU, that's for two main reasons:
>>>    a) it's very likely that enqueued <= util_avg
>>>    b) even in case enqueued should turn out to be _slightly_ bigger then
>>>       util_avg, the corresponding (proportional) ramp_boost would be so
>>>       tiny to not have any noticeable effect on OPP selection.
>>>
>>> Am I correct on point b) above?
>>
>> Assuming you meant "util_avg slightly bigger than enqueued" (which is when boosting triggers),
>> then yes since ramp_boost effect is proportional to "task_ue.enqueue - task_u". It makes it robust
>> against that.
> 
> Right :)
> 
>>> Could you maybe come up with some experimental numbers related to that
>>> case specifically?
>>
>> With:
>> * an rt-app task ramping up from 5% to 75% util in one big step. The
>> whole cycle is 0.6s long (0.3s at 5% followed by 0.3s at 75%). This
>> cycle is repeated 20 times and the average of boosting is taken.
>>
>> * a hikey 960 (this impact the frequency at which the test runs at
>> the beginning of 75% phase, which impacts the number of missed
>> activations before the util ramped up).
>>
>> * assuming an OPP exists for each util value (i.e. 1024 OPPs, so the
>> effect of boost on consumption is not impacted by OPP capacities
>> granularity)
>>
>> Then the boosting feature would increase the average power
>> consumption by 3.1%, out of which 0.12% can be considered "spurious
>> boosting" due to the util taking some time to really converge to its
>> steady state value.
>>
>> In practice, the impact of small boosts will be even lower since
>> they will less likely trigger the selection of a high OPP due to OPP
>> capacity granularity > 1 util unit.
> 
> That's ok for the energy side: you estimate a ~3% worst case more
> energy on that specific target.
> 
> By boosting I expect the negative boost to improve.
> Do you have also numbers/stats related to the negative slack?
> Can you share a percentage figure for that improvement?

I'm now testing on a Google Pixel 3 (Qcom Snapdragon 845) phone, with the same workload, pinned on a big core.
It has a lot more OPPs than a hikey 960, so gradations in boosting are better reflected on frequency selection.

avg slack (higher=better):
     Average time between task sleep and its next periodic activation.

avg negative slack (lower in absolute value=better):
     Same as avg slack, but only taking into account negative values.
     Negative slack means a task activation did not have enough time to complete before the next
     periodic activation fired, which is what we want to avoid.

boost energy overhead (lower=better):
     Extra power consumption induced by ramp boost, assuming continuous OPP space (infinite number of OPP)
     and single-CPU policies. In practice, fixed number of OPP decrease this value, and more CPU per policy increases it,
     since boost(policy) = max(boost(cpu of policy)).

Without ramp boost:
+--------------------+--------------------+
|avg slack (us)      |avg negative slack  |
|                    |(us)                |
+--------------------+--------------------+
|6598.72             |-10217.13           |
|6595.49             |-10200.13           |
|6613.72             |-10401.06           |
|6600.29             |-9860.872           |
|6605.53             |-10057.64           |
|6612.05             |-10267.50           |
|6599.01             |-9939.60            |
|6593.79             |-9445.633           |
|6613.56             |-10276.75           |
|6595.44             |-9751.770           |
+--------------------+--------------------+
|average                                  |
+--------------------+--------------------+
|6602.76             |-10041.81           |
+--------------------+--------------------+


With ramp boost enabled:
+--------------------+--------------------+--------------------+
|boost energy        |avg slack (us)      |avg negative slack  |
|overhead (%)        |                    |(us)                |
+--------------------+--------------------+--------------------+
|3.05                |7148.93             |-5664.26            |
|3.04                |7144.69             |-5667.77            |
|3.05                |7149.05             |-5698.31            |
|2.97                |7126.71             |-6040.23            |
|3.02                |7140.28             |-5826.78            |
|3.03                |7135.11             |-5749.62            |
|3.05                |7140.24             |-5750.0             |
|3.05                |7144.84             |-5667.04            |
|3.07                |7157.30             |-5656.65            |
|3.06                |7154.65             |-5653.76            |
+--------------------+--------------------+--------------------+
|average                                                       |
+--------------------+--------------------+--------------------+
|3.039000            |7144.18             |5737.44             |
+--------------------+--------------------+--------------------+


The negative slack is due to missed activations while the utilization signals
increase during the big utilization step. Ramp boost is designed to boost frequency during
that phase, which materializes in 1.75 less negative slack, for an extra power
consumption under 3%.

> Best,
> Patrick
> 

Thanks,
Douglas

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2019-08-09 17:37 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-06-27 17:15 [RFC PATCH v2 0/5] sched/cpufreq: Make schedutil energy aware Douglas RAILLARD
2019-06-27 17:15 ` [RFC PATCH v2 1/5] PM: Introduce em_pd_get_higher_freq() Douglas RAILLARD
2019-06-27 17:16 ` [RFC PATCH v2 2/5] sched/cpufreq: Attach perf domain to sugov policy Douglas RAILLARD
2019-06-27 17:16 ` [RFC PATCH v2 3/5] sched/cpufreq: Hook em_pd_get_higher_power() into get_next_freq() Douglas RAILLARD
2019-06-27 17:16 ` [RFC PATCH v2 4/5] sched/cpufreq: Introduce sugov_cpu_ramp_boost Douglas RAILLARD
2019-06-28 15:08   ` Patrick Bellasi
2019-06-27 17:16 ` [RFC PATCH v2 5/5] sched/cpufreq: Boost schedutil frequency ramp up Douglas RAILLARD
2019-07-02 15:44 ` [RFC PATCH v2 0/5] sched/cpufreq: Make schedutil energy aware Peter Zijlstra
2019-07-03 13:38   ` Douglas Raillard
2019-07-08 11:13     ` Patrick Bellasi
2019-07-08 13:49       ` Douglas Raillard
2019-07-02 15:51 ` Peter Zijlstra
2019-07-03 16:36   ` Douglas Raillard
2019-07-08 11:09     ` Patrick Bellasi
2019-07-08 13:46       ` Douglas Raillard
2019-07-09 10:37         ` Patrick Bellasi
2019-08-09 17:37           ` Douglas Raillard

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).