All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 00/14] sched: Central, scheduler-driven, power-perfomance control
@ 2015-08-19 18:47 Patrick Bellasi
  2015-08-19 18:47 ` [RFC PATCH 01/14] sched/cpufreq_sched: use static key for cpu frequency selection Patrick Bellasi
                   ` (13 more replies)
  0 siblings, 14 replies; 29+ messages in thread
From: Patrick Bellasi @ 2015-08-19 18:47 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar; +Cc: linux-kernel, linux-pm

The topic of a single simple power-performance tunable, that is wholly
scheduler centric with defined and predictable properties, has come up on
several occasions in the past [1,2]. With techniques such as scheduler driven
DVFS [4], we now have a good framework for implementing such a tunable.

This RFC introduces the foundation to add a single, central tunable 'knob' to
the scheduler. The main goal of this RFC is to present an initial proposal for
a possible solution as well as triggering a discussion on how the ideas here
may be extended for integration with Energy Aware Scheduling [5].

Patch set organization
======================

The following patches implement the tunable knob stacked on top of sched-DVFS.
The knob extends the functionality provided by sched-DVFS to support task
performance boosting. The knob is expressed as a simple user-space facing
interface that allows the tuning of system wide scheduler behaviors ranging
from energy efficiency at one end through to full performance boosting at the
other end.

The tunable can be used globally such that it affects all tasks. It can also be
used for a select set of tasks via a new CGroup controller.

The content of this RFC consists of three main parts:

Patches 01-07: sched:
         Juri's patches on sched-DVFS, which have been updated to
         address review comments from the last EAS RFCv5 posting.

Patches 08-11: sched, schedtune:
         A new proposal for "global task boosting"

Patches 12-14: sched, schedtune:
         An extension, based on CGroups, to support per-task boosting

These patches are based on tip/sched/core and depend on:

 1. patches to "compute capacity invariant load/utilization tracking"
    recently posted by Morten Rasmussen [8]

 2. patches for "scheduler-driven cpu frequency selection"
    which add the new sched-DVFS CPUFreq governor
    initially posted by Mike Turquette [4]
    and recently updated in the series [5] posted by Morten Rasmussen

For testing purposes an integration branch providing all the required
dependencies as well as the patches of this RFC is available here:

    git://www.linux-arm.com/linux-power eas/stune/rfcv1


Test results
============

Tests have been performed on an ARM Juno board, booted using only the LITTLE
cluster (4x ARM64 CortexA53 @ 850 MHz) to mimic a standard SMP platform.

Impact on scheduler performance
-------------------------------

Performance impact has been evaluated using the hackbench test provided by
perf with this command line:

     perf bench sched messaging --thread --group 25 --loop 1000

Reported completion times (CTime) in seconds are averages over 10 runs

                 |           |            |     SchedTune
                 | Ondemand  | sched-DVFS |  Global    PerTask
-----------------+-----------+------------+--------------------
CTime        [s] |     50.9  |      50.3  |   51.1       51.3
 vs Ondemand [%] |      0.00 |      -1.19 |    0.34       0.84
-----------------+-----------+------------+--------------------
Energy           |           |            |
 vs Ondemand [%] |      0.00 |      -0.80 |    1.16       1.45
-----------------+-----------+------------+--------------------

Overall considerations are:

 1) sched-DVFS is quite well positioned compare to the Ondemand
    governor with respect to both performance and energy consumption

 2) SchedTune is always worse than the Ondemand governor due to the
    missing optimizations in the current implementation for working on
    saturated conditions

The SchedTune extension is useful only on a lightly loaded system.
On the other hand, when the system is saturated, the SchedTune support
should be automatically disabled. This automatic disabling is currently
being implemented and will be posted in the next revision of this RFC.


Performance/energy impacts of task boosting
-------------------------------------------

We considered a set of rt-app [5] generated workloads.
All the tests are executed using:
   - 4 threads (to match the number of available CPUs)
   - each thread has a 2ms period
   - duty-cycle (at highest OPP) is (6,13,19,25,31,38 and 44)%
   - each workload runs for 60s

The energy metric (EnergyDiff) is based on energy counters available on the
Juno platform and it reports the energy consumption for the complete execution
of the workload.

The performance evaluation is based on data obtained by rt-app [6] using the
same metric introduced with the EAS RFCv5 [5].

The following table reports the percentage variation on each metric.
Each variation compares:
    base   : workload run using the sched-DVFS governor but without boosting
    testNN : workload run using the sched-DVFS governor with a NN boost value
             configured for just the tasks of the workload,
             i.e. using per-task boosting

Reported numbers are averages on 10 runs for each test configuration.
Numbers in (parenthesis) are reference for the comments below the table.


  Test Id   : Comparison      | EnergyDiff [%] |  PerfIndex [%] |
  ----------------------------+----------------+----------------+
  Test_43   : test05 vs base  |  (1) -0.24     |  (4) -1.22     |
  Test_43   : test10 vs base  |      -0.25     |      -0.82     |
  Test_43   : test30 vs base  |      -0.22     |      -0.62     |
  Test_43   : test80 vs base  |      22.72     |      10.40     |
  --------------------- ------+----------------+----------------+
  Test_44   : test05 vs base  |  (1) -0.37     |       1.43     |
  Test_44   : test10 vs base  |      -0.30     |       0.70     |
  Test_44   : test30 vs base  |       0.52     |       1.57     |
  Test_44   : test80 vs base  |      21.08     |      17.36     |
  --------------------- ------+----------------+----------------+
  Test_45   : test05 vs base  |  (1) -0.17     |       1.00     |
  Test_45   : test10 vs base  |      -0.12     |      -0.22     |
  Test_45   : test30 vs base  |       4.15     |       8.25     |
  Test_45   : test80 vs base  |      21.84     |      22.38     |
  --------------------- ------+----------------+----------------+
  Test_46   : test05 vs base  |  (1) -0.09     |      -0.48     |
  Test_46   : test10 vs base  |      -0.02     |      -1.06     |
  Test_46   : test30 vs base  |       4.36     |      13.01     |
  Test_46   : test80 vs base  |  (2) 21.15     |  (3) 29.58     |
  --------------------- ------+----------------+----------------+
  Test_47   : test05 vs base  |       0.11     |       1.15     |
  Test_47   : test10 vs base  |       0.58     |       1.99     |
  Test_47   : test30 vs base  |       5.44     |       8.54     |
  Test_47   : test80 vs base  |  (2) 22.47     |  (3) 30.88     |
  --------------------- ------+----------------+----------------+
  Test_48   : test05 vs base  |       4.23     |       5.00     |
  Test_48   : test10 vs base  |       7.32     |      16.88     |
  Test_48   : test30 vs base  |      14.75     |      28.72     |
  Test_48   : test80 vs base  |  (2) 29.11     |  (3) 42.30     |
  --------------------- ------+----------------+----------------+
  Test_49   : test05 vs base  |       0.21     |       1.15     |
  Test_49   : test10 vs base  |       0.50     |       2.47     |
  Test_49   : test30 vs base  |       6.60     |      11.51     |
  Test_49   : test80 vs base  |  (2) 18.22     |  (3) 27.45     |


Comments on Results
===================

The goal of the proposed task boosting strategy is to speed-up task
completion, by running them at a higher Operating Performance Point (OPP),
with respect to the lowest OPP required by the specific workload.

Here are some considerations on reported results:

 a) Low intensity workloads present a small decrease in energy
    consumption (1) probably due to a race-to-idle effect when running
    at lower OPP. Otherwise, in general we observe an increase in
    energy consumption which is monotonic and proportional wrt the
    configured boost value.

 b) Higher boost values (2) are subject to 20-30% more energy
    consumption which is however compensated by an expected improvement
    in the performance metric.

 c) The PerfIndex is in general aligned with the magnitude of the boost
    value. The more we boost the workload the sooner tasks activation complete
    and thus the better the PerfIndex metric (3)

 d) On really small workloads, when the boosting value is relatively small (4),
    the overhead introduced by SchedTune is not compensated by the possibility
    to select an higher OPP.
    This aspect is part of the SchedTune optimization that we will target for
    the following posting.


Conclusions
===========

The proposed patch set provides a simple and effective tunable knob which
allows to boost performance of low-intensity tasks. This tunable works by
biasing sched-DVFS in the selection of the operating frequency.
This allows to trade-off increased energy consumptions for faster tasks
completion time.

This patch set provides just the foundation bits which focus on OPP
selection. A further extension of this patch set is under development
to target the integration with the Energy Aware Scheduler (EAS) [5] by
biasing CPU selection.

This will allow to complete the boosting knob semantics by providing a single
knob which allows:
   a) to tune sched-DVFS to mimic (dynamically and on a per-task base) the
      behaviors of other governors (i.e. ondemand, performance and interactive)
   b) to tune EAS to be more energy-efficient or performance boosting oriented


References
==========

[1] Remove stale power aware scheduling remnants and dysfunctional knobs
      http://lkml.org/lkml/2012/5/18/91
[2] Power-efficient scheduling design
      http://lwn.net/Articles/552889
[3] Compute capacity invariant load/utilization tracking
      http://lkml.org/lkml/2015/8/14/296
[4] Scheduler-driven CPU frequency selection (RFCv3)
      http://lkml.org/lkml/2015/6/26/620
[5] Energy cost model for energy-aware scheduling (RFCv5)
      https://lkml.org/lkml/2015/7/7/754
[6] Extended RT-App to report Time to Completion
      https://github.com/scheduler-tools/rt-app.git exp/eas_v5


Juri Lelli (7):
  sched/cpufreq_sched: use static key for cpu frequency selection
  sched/fair: add triggers for OPP change requests
  sched/{core,fair}: trigger OPP change request on fork()
  sched/{fair,cpufreq_sched}: add reset_capacity interface
  sched/fair: jump to max OPP when crossing UP threshold
  sched/cpufreq_sched: modify pcpu_capacity handling
  sched/fair: cpufreq_sched triggers for load balancing

Patrick Bellasi (7):
  sched/tune: add detailed documentation
  sched/tune: add sysctl interface to define a boost value
  sched/fair: add function to convert boost value into "margin"
  sched/fair: add boosted CPU usage
  sched/tune: add initial support for CGroups based boosting
  sched/tune: compute and keep track of per CPU boost value
  sched/{fair,tune}: track RUNNABLE tasks impact on per CPU boost value

 Documentation/scheduler/sched-tune.txt | 367 +++++++++++++++++++++++++++++
 include/linux/cgroup_subsys.h          |   4 +
 include/linux/sched/sysctl.h           |  16 ++
 init/Kconfig                           |  43 ++++
 kernel/sched/Makefile                  |   1 +
 kernel/sched/core.c                    |   2 +-
 kernel/sched/cpufreq_sched.c           |  28 ++-
 kernel/sched/fair.c                    | 168 +++++++++++++-
 kernel/sched/sched.h                   |  10 +
 kernel/sched/tune.c                    | 411 +++++++++++++++++++++++++++++++++
 kernel/sched/tune.h                    |  23 ++
 kernel/sysctl.c                        |  15 ++
 12 files changed, 1082 insertions(+), 6 deletions(-)
 create mode 100644 Documentation/scheduler/sched-tune.txt
 create mode 100644 kernel/sched/tune.c
 create mode 100644 kernel/sched/tune.h

--
2.5.0


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [RFC PATCH 01/14] sched/cpufreq_sched: use static key for cpu frequency selection
  2015-08-19 18:47 [RFC PATCH 00/14] sched: Central, scheduler-driven, power-perfomance control Patrick Bellasi
@ 2015-08-19 18:47 ` Patrick Bellasi
  2015-08-19 18:47 ` [RFC PATCH 02/14] sched/fair: add triggers for OPP change requests Patrick Bellasi
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 29+ messages in thread
From: Patrick Bellasi @ 2015-08-19 18:47 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar; +Cc: linux-kernel, linux-pm, Juri Lelli

From: Juri Lelli <juri.lelli@arm.com>

Introduce a static key to only affect scheduler hot paths when sched
governor is enabled.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
---
 kernel/sched/cpufreq_sched.c | 14 ++++++++++++++
 kernel/sched/fair.c          |  2 ++
 kernel/sched/sched.h         |  6 ++++++
 3 files changed, 22 insertions(+)

diff --git a/kernel/sched/cpufreq_sched.c b/kernel/sched/cpufreq_sched.c
index 5020f24..2968f3a 100644
--- a/kernel/sched/cpufreq_sched.c
+++ b/kernel/sched/cpufreq_sched.c
@@ -203,6 +203,18 @@ out:
 	return;
 }
 
+static inline void set_sched_energy_freq(void)
+{
+	if (!sched_energy_freq())
+		static_key_slow_inc(&__sched_energy_freq);
+}
+
+static inline void clear_sched_energy_freq(void)
+{
+	if (sched_energy_freq())
+		static_key_slow_dec(&__sched_energy_freq);
+}
+
 static int cpufreq_sched_start(struct cpufreq_policy *policy)
 {
 	struct gov_data *gd;
@@ -243,6 +255,7 @@ static int cpufreq_sched_start(struct cpufreq_policy *policy)
 
 	policy->governor_data = gd;
 	gd->policy = policy;
+	set_sched_energy_freq();
 	return 0;
 
 err:
@@ -254,6 +267,7 @@ static int cpufreq_sched_stop(struct cpufreq_policy *policy)
 {
 	struct gov_data *gd = policy->governor_data;
 
+	clear_sched_energy_freq();
 	if (cpufreq_driver_might_sleep()) {
 		kthread_stop(gd->task);
 	}
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a04b074..b35d90b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4075,6 +4075,8 @@ static inline void hrtick_update(struct rq *rq)
 }
 #endif
 
+struct static_key __sched_energy_freq __read_mostly = STATIC_KEY_INIT_FALSE;
+
 /*
  * The enqueue_task method is called before nr_running is
  * increased. Here we update the fair scheduling stats and
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index c5af84b..07ab036 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1415,6 +1415,12 @@ unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
 }
 #endif
 
+extern struct static_key __sched_energy_freq;
+static inline bool sched_energy_freq(void)
+{
+	return static_key_false(&__sched_energy_freq);
+}
+
 #ifdef CONFIG_CPU_FREQ_GOV_SCHED
 void cpufreq_sched_set_cap(int cpu, unsigned long util);
 #else
-- 
2.5.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH 02/14] sched/fair: add triggers for OPP change requests
  2015-08-19 18:47 [RFC PATCH 00/14] sched: Central, scheduler-driven, power-perfomance control Patrick Bellasi
  2015-08-19 18:47 ` [RFC PATCH 01/14] sched/cpufreq_sched: use static key for cpu frequency selection Patrick Bellasi
@ 2015-08-19 18:47 ` Patrick Bellasi
  2015-08-19 18:47 ` [RFC PATCH 03/14] sched/{core,fair}: trigger OPP change request on fork() Patrick Bellasi
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 29+ messages in thread
From: Patrick Bellasi @ 2015-08-19 18:47 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar; +Cc: linux-kernel, linux-pm, Juri Lelli

From: Juri Lelli <juri.lelli@arm.com>

Each time a task is {en,de}queued we might need to adapt the current
frequency to the new usage. Add triggers on {en,de}queue_task_fair() for
this purpose.  Only trigger a freq request if we are effectively waking up
or going to sleep.  Filter out load balancing related calls to reduce the
number of triggers.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
---
 kernel/sched/fair.c | 47 +++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 45 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b35d90b..ebf86b4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4075,6 +4075,26 @@ static inline void hrtick_update(struct rq *rq)
 }
 #endif
 
+/*
+ * ~20% capacity margin that we add to every capacity change
+ * request to provide some head room if task utilization further
+ * increases.
+ */
+static unsigned int capacity_margin = 1280;
+static unsigned long capacity_orig_of(int cpu);
+static int cpu_util(int cpu);
+
+static void update_capacity_of(int cpu)
+{
+	unsigned long req_cap;
+
+	if (!sched_energy_freq())
+		return;
+
+	req_cap = cpu_util(cpu) * capacity_margin / capacity_orig_of(cpu);
+	cpufreq_sched_set_cap(cpu, req_cap);
+}
+
 struct static_key __sched_energy_freq __read_mostly = STATIC_KEY_INIT_FALSE;
 
 /*
@@ -4087,6 +4107,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 {
 	struct cfs_rq *cfs_rq;
 	struct sched_entity *se = &p->se;
+	int task_new = !(flags & ENQUEUE_WAKEUP);
 
 	for_each_sched_entity(se) {
 		if (se->on_rq)
@@ -4118,9 +4139,22 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		update_cfs_shares(cfs_rq);
 	}
 
-	if (!se)
+	if (!se) {
 		add_nr_running(rq, 1);
 
+		/*
+		 * We want to potentially trigger a freq switch request only for
+                 * tasks that are waking up; this is because we get here also during
+		 * load balancing, but in these cases it seems wise to trigger
+		 * as single request after load balancing is done.
+		 *
+		 * XXX: how about fork()? Do we need a special flag/something
+		 *      to tell if we are here after a fork() (wakeup_task_new)?
+		 *
+		 */
+		if (!task_new)
+			update_capacity_of(cpu_of(rq));
+	}
 	hrtick_update(rq);
 }
 
@@ -4178,9 +4212,18 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		update_cfs_shares(cfs_rq);
 	}
 
-	if (!se)
+	if (!se) {
 		sub_nr_running(rq, 1);
 
+		/*
+		 * We want to potentially trigger a freq switch request only for
+		 * tasks that are going to sleep; this is because we get here also
+		 * during load balancing, but in these cases it seems wise to trigger
+		 * as single request after load balancing is done.
+		 */
+		if (task_sleep)
+			update_capacity_of(cpu_of(rq));
+	}
 	hrtick_update(rq);
 }
 
-- 
2.5.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH 03/14] sched/{core,fair}: trigger OPP change request on fork()
  2015-08-19 18:47 [RFC PATCH 00/14] sched: Central, scheduler-driven, power-perfomance control Patrick Bellasi
  2015-08-19 18:47 ` [RFC PATCH 01/14] sched/cpufreq_sched: use static key for cpu frequency selection Patrick Bellasi
  2015-08-19 18:47 ` [RFC PATCH 02/14] sched/fair: add triggers for OPP change requests Patrick Bellasi
@ 2015-08-19 18:47 ` Patrick Bellasi
  2015-08-19 18:47 ` [RFC PATCH 04/14] sched/{fair,cpufreq_sched}: add reset_capacity interface Patrick Bellasi
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 29+ messages in thread
From: Patrick Bellasi @ 2015-08-19 18:47 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar; +Cc: linux-kernel, linux-pm, Juri Lelli

From: Juri Lelli <juri.lelli@arm.com>

Patch "sched/fair: add triggers for OPP change requests" introduced OPP
change triggers for enqueue_task_fair(), but the trigger was operating only
for wakeups. Fact is that it makes sense to consider wakeup_new also (i.e.,
fork()), as we don't know anything about a newly created task and thus we
most certainly want to jump to max OPP to not harm performance too much.

However, it is not currently possible (or at least it wasn't evident to me
how to do so :/) to tell new wakeups from other (non wakeup) operations.

This patch introduces an additional flag in sched.h that is only set at
fork() time and it is then consumed in enqueue_task_fair() for our purpose.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
---
 kernel/sched/core.c  | 2 +-
 kernel/sched/fair.c  | 9 +++------
 kernel/sched/sched.h | 1 +
 3 files changed, 5 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 43952c7..e901340 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2360,7 +2360,7 @@ void wake_up_new_task(struct task_struct *p)
 #endif
 
 	rq = __task_rq_lock(p);
-	activate_task(rq, p, 0);
+	activate_task(rq, p, ENQUEUE_WAKEUP_NEW);
 	p->on_rq = TASK_ON_RQ_QUEUED;
 	trace_sched_wakeup_new(p);
 	check_preempt_curr(rq, p, WF_FORK);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ebf86b4..a75ea07 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4107,7 +4107,8 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 {
 	struct cfs_rq *cfs_rq;
 	struct sched_entity *se = &p->se;
-	int task_new = !(flags & ENQUEUE_WAKEUP);
+	int task_new = flags & ENQUEUE_WAKEUP_NEW;
+	int task_wakeup = flags & ENQUEUE_WAKEUP;
 
 	for_each_sched_entity(se) {
 		if (se->on_rq)
@@ -4147,12 +4148,8 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
                  * tasks that are waking up; this is because we get here also during
 		 * load balancing, but in these cases it seems wise to trigger
 		 * as single request after load balancing is done.
-		 *
-		 * XXX: how about fork()? Do we need a special flag/something
-		 *      to tell if we are here after a fork() (wakeup_task_new)?
-		 *
 		 */
-		if (!task_new)
+		if (task_new || task_wakeup)
 			update_capacity_of(cpu_of(rq));
 	}
 	hrtick_update(rq);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 07ab036..1f0b433 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1164,6 +1164,7 @@ static const u32 prio_to_wmult[40] = {
 #define ENQUEUE_WAKING		0
 #endif
 #define ENQUEUE_REPLENISH	8
+#define ENQUEUE_WAKEUP_NEW	16
 
 #define DEQUEUE_SLEEP		1
 
-- 
2.5.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH 04/14] sched/{fair,cpufreq_sched}: add reset_capacity interface
  2015-08-19 18:47 [RFC PATCH 00/14] sched: Central, scheduler-driven, power-perfomance control Patrick Bellasi
                   ` (2 preceding siblings ...)
  2015-08-19 18:47 ` [RFC PATCH 03/14] sched/{core,fair}: trigger OPP change request on fork() Patrick Bellasi
@ 2015-08-19 18:47 ` Patrick Bellasi
  2015-08-19 18:47 ` [RFC PATCH 05/14] sched/fair: jump to max OPP when crossing UP threshold Patrick Bellasi
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 29+ messages in thread
From: Patrick Bellasi @ 2015-08-19 18:47 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar; +Cc: linux-kernel, linux-pm, Juri Lelli

From: Juri Lelli <juri.lelli@arm.com>

When a CPU is going idle it is pointless to ask for an OPP update as we
would wake up another task only to ask for the same capacity we are already
running at (utilization gets moved to blocked_utilization).  We thus add
cpufreq_sched_reset_capacity() interface to just reset our current capacity
request without triggering any real update.  At wakeup we will use the
decayed utilization to select an appropriate OPP.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
---
 kernel/sched/cpufreq_sched.c | 12 ++++++++++++
 kernel/sched/fair.c          |  8 ++++++--
 kernel/sched/sched.h         |  3 +++
 3 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/cpufreq_sched.c b/kernel/sched/cpufreq_sched.c
index 2968f3a..e6b4a22 100644
--- a/kernel/sched/cpufreq_sched.c
+++ b/kernel/sched/cpufreq_sched.c
@@ -203,6 +203,18 @@ out:
 	return;
 }
 
+/**
+ * cpufreq_sched_reset_capacity - interface to scheduler for resetting capacity
+ *                                requests
+ * @cpu: cpu whose capacity request has to be reset
+ *
+ * This _wont trigger_ any capacity update.
+ */
+void cpufreq_sched_reset_cap(int cpu)
+{
+	per_cpu(pcpu_capacity, cpu) = 0;
+}
+
 static inline void set_sched_energy_freq(void)
 {
 	if (!sched_energy_freq())
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a75ea07..2961e29 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4218,8 +4218,12 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		 * during load balancing, but in these cases it seems wise to trigger
 		 * as single request after load balancing is done.
 		 */
-		if (task_sleep)
-			update_capacity_of(cpu_of(rq));
+		if (task_sleep) {
+			if (rq->cfs.nr_running)
+				update_capacity_of(cpu_of(rq));
+			else if (sched_energy_freq())
+				cpufreq_sched_reset_cap(cpu_of(rq));
+		}
 	}
 	hrtick_update(rq);
 }
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 1f0b433..ad9293b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1424,9 +1424,12 @@ static inline bool sched_energy_freq(void)
 
 #ifdef CONFIG_CPU_FREQ_GOV_SCHED
 void cpufreq_sched_set_cap(int cpu, unsigned long util);
+void cpufreq_sched_reset_cap(int cpu);
 #else
 static inline void cpufreq_sched_set_cap(int cpu, unsigned long util)
 { }
+static inline void cpufreq_sched_reset_cap(int cpu)
+{ }
 #endif
 
 static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta)
-- 
2.5.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH 05/14] sched/fair: jump to max OPP when crossing UP threshold
  2015-08-19 18:47 [RFC PATCH 00/14] sched: Central, scheduler-driven, power-perfomance control Patrick Bellasi
                   ` (3 preceding siblings ...)
  2015-08-19 18:47 ` [RFC PATCH 04/14] sched/{fair,cpufreq_sched}: add reset_capacity interface Patrick Bellasi
@ 2015-08-19 18:47 ` Patrick Bellasi
  2015-08-19 18:47 ` [RFC PATCH 06/14] sched/cpufreq_sched: modify pcpu_capacity handling Patrick Bellasi
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 29+ messages in thread
From: Patrick Bellasi @ 2015-08-19 18:47 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar; +Cc: linux-kernel, linux-pm, Juri Lelli

From: Juri Lelli <juri.lelli@arm.com>

Since the true utilization of a long running task is not detectable while
it is running and might be bigger than the current cpu capacity, create the
maximum cpu capacity head room by requesting the maximum cpu capacity once
the cpu usage plus the capacity margin exceeds the current capacity. This
is also done to try to harm the performance of a task the least.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
---
 kernel/sched/fair.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2961e29..6197b3b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7864,6 +7864,24 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
 
 	if (numabalancing_enabled)
 		task_tick_numa(rq, curr);
+
+	/*
+	 * To make free room for a task that is building up its "real"
+	 * utilization and to harm its performance the least, request a
+	 * jump to max OPP as soon as get_cpu_usage() crosses the UP
+	 * threshold. The UP threshold is built relative to the current
+	 * capacity (OPP), by using capacity_margin.
+	 */
+	if (sched_energy_freq()) {
+		int cpu = cpu_of(rq);
+		unsigned long capacity_orig = capacity_orig_of(cpu);
+		unsigned long capacity_curr = capacity_curr_of(cpu);
+
+		if (capacity_curr < capacity_orig &&
+		    (capacity_curr * SCHED_LOAD_SCALE) <
+		    (cpu_util(cpu) * capacity_margin))
+			cpufreq_sched_set_cap(cpu, capacity_orig);
+	}
 }
 
 /*
-- 
2.5.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH 06/14] sched/cpufreq_sched: modify pcpu_capacity handling
  2015-08-19 18:47 [RFC PATCH 00/14] sched: Central, scheduler-driven, power-perfomance control Patrick Bellasi
                   ` (4 preceding siblings ...)
  2015-08-19 18:47 ` [RFC PATCH 05/14] sched/fair: jump to max OPP when crossing UP threshold Patrick Bellasi
@ 2015-08-19 18:47 ` Patrick Bellasi
  2015-08-19 18:47 ` [RFC PATCH 07/14] sched/fair: cpufreq_sched triggers for load balancing Patrick Bellasi
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 29+ messages in thread
From: Patrick Bellasi @ 2015-08-19 18:47 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar; +Cc: linux-kernel, linux-pm, Juri Lelli

From: Juri Lelli <juri.lelli@arm.com>

Use the cpu argument of cpufreq_sched_set_cap() to handle per_cpu writes,
as the thing can be called remotely (e.g., from load balacing code).

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
Acked-by: Michael Turquette <mturquette@baylibre.com>
---
 kernel/sched/cpufreq_sched.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/cpufreq_sched.c b/kernel/sched/cpufreq_sched.c
index e6b4a22..27f2cec 100644
--- a/kernel/sched/cpufreq_sched.c
+++ b/kernel/sched/cpufreq_sched.c
@@ -151,7 +151,7 @@ void cpufreq_sched_set_cap(int cpu, unsigned long capacity)
 	unsigned long capacity_max = 0;
 
 	/* update per-cpu capacity request */
-	__this_cpu_write(pcpu_capacity, capacity);
+	per_cpu(pcpu_capacity, cpu) = capacity;
 
 	policy = cpufreq_cpu_get(cpu);
 	if (IS_ERR_OR_NULL(policy)) {
-- 
2.5.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH 07/14] sched/fair: cpufreq_sched triggers for load balancing
  2015-08-19 18:47 [RFC PATCH 00/14] sched: Central, scheduler-driven, power-perfomance control Patrick Bellasi
                   ` (5 preceding siblings ...)
  2015-08-19 18:47 ` [RFC PATCH 06/14] sched/cpufreq_sched: modify pcpu_capacity handling Patrick Bellasi
@ 2015-08-19 18:47 ` Patrick Bellasi
  2015-08-19 18:47 ` [RFC PATCH 08/14] sched/tune: add detailed documentation Patrick Bellasi
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 29+ messages in thread
From: Patrick Bellasi @ 2015-08-19 18:47 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar; +Cc: linux-kernel, linux-pm, Juri Lelli

From: Juri Lelli <juri.lelli@arm.com>

As we don't trigger freq changes from {en,de}queue_task_fair() during load
balancing, we need to do explicitly so on load balancing paths.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
---
 kernel/sched/fair.c | 23 +++++++++++++++++++++--
 1 file changed, 21 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6197b3b..955dfe1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7030,6 +7030,11 @@ more_balance:
 		 * ld_moved     - cumulative load moved across iterations
 		 */
 		cur_ld_moved = detach_tasks(&env);
+		/*
+		 * We want to potentially lower env.src_cpu's OPP.
+		 */
+		if (cur_ld_moved)
+			update_capacity_of(env.src_cpu);
 
 		/*
 		 * We've detached some tasks from busiest_rq. Every
@@ -7044,6 +7049,10 @@ more_balance:
 		if (cur_ld_moved) {
 			attach_tasks(&env);
 			ld_moved += cur_ld_moved;
+			/*
+			 * We want to potentially raise env.dst_cpu's OPP.
+			 */
+			update_capacity_of(env.dst_cpu);
 		}
 
 		local_irq_restore(flags);
@@ -7398,8 +7407,13 @@ static int active_load_balance_cpu_stop(void *data)
 		schedstat_inc(sd, alb_count);
 
 		p = detach_one_task(&env);
-		if (p)
+		if (p) {
 			schedstat_inc(sd, alb_pushed);
+			/*
+			 * We want to potentially lower env.src_cpu's OPP.
+			 */
+			update_capacity_of(env.src_cpu);
+		}
 		else
 			schedstat_inc(sd, alb_failed);
 	}
@@ -7408,8 +7422,13 @@ out_unlock:
 	busiest_rq->active_balance = 0;
 	raw_spin_unlock(&busiest_rq->lock);
 
-	if (p)
+	if (p) {
 		attach_one_task(target_rq, p);
+		/*
+		 * We want to potentially raise target_cpu's OPP.
+		 */
+		update_capacity_of(target_cpu);
+	}
 
 	local_irq_enable();
 
-- 
2.5.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH 08/14] sched/tune: add detailed documentation
  2015-08-19 18:47 [RFC PATCH 00/14] sched: Central, scheduler-driven, power-perfomance control Patrick Bellasi
                   ` (6 preceding siblings ...)
  2015-08-19 18:47 ` [RFC PATCH 07/14] sched/fair: cpufreq_sched triggers for load balancing Patrick Bellasi
@ 2015-08-19 18:47 ` Patrick Bellasi
  2015-09-02  6:49   ` [RFC,08/14] " Ricky Liang
  2015-08-19 18:47 ` [RFC PATCH 09/14] sched/tune: add sysctl interface to define a boost value Patrick Bellasi
                   ` (5 subsequent siblings)
  13 siblings, 1 reply; 29+ messages in thread
From: Patrick Bellasi @ 2015-08-19 18:47 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: linux-kernel, linux-pm, Jonathan Corbet, linux-doc

The topic of a single simple power-performance tunable, that is wholly
scheduler centric, and has well defined and predictable properties has
come up on several occasions in the past. With techniques such as a
scheduler driven DVFS, we now have a good framework for implementing
such a tunable.

This patch provides a detailed description of the motivations and design
decisions behind the implementation of the SchedTune.

cc: Jonathan Corbet <corbet@lwn.net>
cc: linux-doc@vger.kernel.org
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
---
 Documentation/scheduler/sched-tune.txt | 367 +++++++++++++++++++++++++++++++++
 1 file changed, 367 insertions(+)
 create mode 100644 Documentation/scheduler/sched-tune.txt

diff --git a/Documentation/scheduler/sched-tune.txt b/Documentation/scheduler/sched-tune.txt
new file mode 100644
index 0000000..cb795e6
--- /dev/null
+++ b/Documentation/scheduler/sched-tune.txt
@@ -0,0 +1,367 @@
+             Central, scheduler-driven, power-performance control
+                               (EXPERIMENTAL)
+
+Abstract
+========
+
+The topic of a single simple power-performance tunable, that is wholly
+scheduler centric, and has well defined and predictable properties has come up
+on several occasions in the past [1,2]. With techniques such as a scheduler
+driven DVFS [3], we now have a good framework for implementing such a tunable.
+This document describes the overall ideas behind its design and implementation.
+
+
+Table of Contents
+=================
+
+1. Motivation
+2. Introduction
+3. Signal Boosting Strategy
+4. OPP selection using boosted CPU utilization
+5. Per task group boosting
+6. Question and Answers
+   - What about "auto" mode?
+   - What about boosting on a congested system?
+   - How CPUs are boosted when we have tasks with multiple boost values?
+7. References
+
+
+1. Motivation
+=============
+
+Sched-DVFS [3] is a new event-driven cpufreq governor which allows the
+scheduler to select the optimal DVFS operating point (OPP) for running a task
+allocated to a CPU. The introduction of sched-DVFS enables running workloads at
+the most energy efficient OPPs.
+
+However, sometimes it may be desired to intentionally boost the performance of
+a workload even if that could imply a reasonable increase in energy
+consumption. For example, in order to reduce the response time of a task, we
+may want to run the task at a higher OPP than the one that is actually required
+by it's CPU bandwidth demand.
+
+This last requirement is especially important if we consider that one of the
+main goals of the sched-DVFS component is to replace all currently available
+CPUFreq policies. Since sched-DVFS is event based, as opposed to the sampling
+driven governors we currently have, it is already more responsive at selecting
+the optimal OPP to run tasks allocated to a CPU. However, just tracking the
+actual task load demand may not be enough from a performance standpoint.  For
+example, it is not possible to get behaviors similar to those provided by the
+"performance" and "interactive" CPUFreq governors.
+
+This document describes an implementation of a tunable, stacked on top of the
+sched-DVFS which extends its functionality to support task performance
+boosting.
+
+By "performance boosting" we mean the reduction of the time required to
+complete a task activation, i.e. the time elapsed from a task wakeup to its
+next deactivation (e.g. because it goes back to sleep or it terminates).  For
+example, if we consider a simple periodic task which executes the same workload
+for 5[s] every 20[s] while running at a certain OPP, a boosted execution of
+that task must complete each of its activations in less than 5[s].
+
+A previous attempt [5] to introduce such a boosting feature has not been
+successful mainly because of the complexity of the proposed solution.  The
+approach described in this document exposes a single simple interface to
+user-space.  This single tunable knob allows the tuning of system wide
+scheduler behaviours ranging from energy efficiency at one end through to
+incremental performance boosting at the other end.  This first tunable affects
+all tasks. However, a more advanced extension of the concept is also provided
+which uses CGroups to boost the performance of only selected tasks while using
+the energy efficient default for all others.
+
+The rest of this document introduces in more details the proposed solution
+which has been named SchedTune.
+
+
+2. Introduction
+===============
+
+SchedTune exposes a simple user-space interface with a single power-performance
+tunable:
+
+  /proc/sys/kernel/sched_cfs_boost
+
+This permits expressing a boost value as an integer in the range [0..100].
+
+A value of 0 (default) configures the CFS scheduler for maximum energy
+efficiency. This means that sched-DVFS runs the tasks at the minimum OPP
+required to satisfy their workload demand.
+A value of 100 configures scheduler for maximum performance, which translates
+to the selection of the maximum OPP on that CPU.
+
+The range between 0 and 100 can be set to satisfy other scenarios suitably. For
+example to satisfy interactive response or depending on other system events
+(battery level etc).
+
+A CGroup based extension is also provided, which permits further user-space
+defined task classification to tune the scheduler for different goals depending
+on the specific nature of the task, e.g. background vs interactive vs
+low-priority.
+
+The overall design of the SchedTune module is built on top of "Per-Entity Load
+Tracking" (PELT) signals and sched-DVFS by introducing a bias on the Operating
+Performance Point (OPP) selection.
+Each time a task is allocated on a CPU, sched-DVFS has the opportunity to tune
+the operating frequency of that CPU to better match the workload demand. The
+selection of the actual OPP being activated is influenced by the global boost
+value, or the boost value for the task CGroup when in use.
+
+This simple biasing approach leverages existing frameworks, which means minimal
+modifications to the scheduler, and yet it allows to achieve a range of
+different behaviours all from a single simple tunable knob.
+The only new concept introduced is that of signal boosting.
+
+
+3. Signal Boosting Strategy
+===========================
+
+The whole PELT machinery works based on the value of a few load tracking signals
+which basically track the CPU bandwidth requirements for tasks and the capacity
+of CPUs. The basic idea behind the SchedTune knob is to artificially inflate
+some of these load tracking signals to make a task or RQ appears more demanding
+that it actually is.
+
+Which signals have to be inflated depends on the specific "consumer".  However,
+independently from the specific (signal, consumer) pair, it is important to
+define a simple and possibly consistent strategy for the concept of boosting a
+signal.
+
+A boosting strategy defines how the "abstract" user-space defined
+sched_cfs_boost value is translated into an internal "margin" value to be added
+to a signal to get its inflated value:
+
+  margin         := boosting_strategy(sched_cfs_boost, signal)
+  boosted_signal := signal + margin
+
+Different boosting strategies were identified and analyzed before selecting the
+one found to be most effective.
+
+Signal Proportional Compensation (SPC)
+--------------------------------------
+
+In this boosting strategy the sched_cfs_boost value is used to compute a
+margin which is proportional to the complement of the original signal.
+When a signal has a maximum possible value, its complement is defined as
+the delta from the actual value and its possible maximum.
+
+Since the tunable implementation uses signals which have SCHED_LOAD_SCALE as
+the maximum possible value, the margin becomes:
+
+	margin := sched_cfs_boost * (SCHED_LOAD_SCALE - signal)
+
+Using this boosting strategy:
+- a 100% sched_cfs_boost means that the signal is scaled to the maximum value
+- each value in the range of sched_cfs_boost effectively inflates the signal in
+  question by a quantity which is proportional to the maximum value.
+
+For example, by applying the SPC boosting strategy to the selection of the OPP
+to run a task it is possible to achieve these behaviors:
+
+-   0% boosting: run the task at the minimum OPP required by its workload
+- 100% boosting: run the task at the maximum OPP available for the CPU
+-  50% boosting: run at the half-way OPP between minimum and maximum
+
+Which means that, at 50% boosting, a task will be scheduled to run at half of
+the maximum theoretically achievable performance on the specific target
+platform.
+
+A graphical representation of an SPC boosted signal is represented in the
+following figure where:
+ a) "-" represents the original signal
+ b) "b" represents a  50% boosted signal
+ c) "p" represents a 100% boosted signal
+
+
+   ^
+   |  SCHED_LOAD_SCALE
+   +-----------------------------------------------------------------+
+   |pppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppp
+   |
+   |                                             boosted_signal
+   |                                          bbbbbbbbbbbbbbbbbbbbbbbb
+   |
+   |                                            original signal
+   |                  bbbbbbbbbbbbbbbbbbbbbbbb+----------------------+
+   |                                          |
+   |bbbbbbbbbbbbbbbbbb                        |
+   |                                          |
+   |                                          |
+   |                                          |
+   |                  +-----------------------+
+   |                  |
+   |                  |
+   |                  |
+   |------------------+
+   |
+   |
+   +----------------------------------------------------------------------->
+
+The plot above shows a ramped load signal (titled 'original_signal') and it's
+boosted equivalent. For each step of the original signal the boosted signal
+corresponding to a 50% boost is midway from the original signal and the upper
+bound. Boosting by 100% generates a boosted signal which is always saturated to
+the upper bound.
+
+
+4. OPP selection using boosted CPU utilization
+==============================================
+
+It is worth calling out that the implementation does not introduce any new load
+signals. Instead, it provides an API to tune existing signals. This tuning is
+done on demand and only in scheduler code paths where it is sensible to do so.
+The new API calls are defined to return either the default signal or a boosted
+one, depending on the value of sched_cfs_boost. This is a clean an non invasive
+modification of the existing existing code paths.
+
+The signal representing a CPU's utilization is boosted according to the
+previously described SPC boosting strategy. To sched-DVFS, this allows a CPU
+(ie CFS run-queue) to appear more used then it actually is.
+
+Thus, with the sched_cfs_boost enabled we have the following main functions to
+get the current utilization of a CPU:
+
+  cpu_util()
+  boosted_cpu_util()
+
+The new boosted_cpu_util() is similar to the first but returns a boosted
+utilization signal which is a function of the sched_cfs_boost value.
+
+This function is used in the CFS scheduler code paths where sched-DVFS needs to
+decide the OPP to run a CPU at.
+For example, this allows selecting the highest OPP for a CPU which has
+the boost value set to 100%.
+
+
+5. Per task group boosting
+==========================
+
+The availability of a single knob which is used to boost all tasks in the
+system is certainly a simple solution but it quite likely doesn't fit many
+utilization scenarios, especially in the mobile device space.
+
+For example, on battery powered devices there usually are many background
+services which are long running and need energy efficient scheduling. On the
+other hand, some applications are more performance sensitive and require an
+interactive response and/or maximum performance, regardless of the energy cost.
+To better service such scenarios, the SchedTune implementation has an extension
+that provides a more fine grained boosting interface.
+
+A new CGroup controller, namely "schedtune", could be enabled which allows to
+defined and configure task groups with different boosting values.
+Tasks that require special performance can be put into separate CGroups.
+The value of the boost associated with the tasks in this group can be specified
+using a single knob exposed by the CGroup controller:
+
+   schedtune.boost
+
+This knob allows the definition of a boost value that is to be used for
+SPC boosting of all tasks attached to this group.
+
+The current schedtune controller implementation is really simple and has these
+main characteristics:
+
+  1) It is only possible to create 1 level depth hierarchies
+
+     The root control groups define the system-wide boost value to be applied
+     by default to all tasks. Its direct subgroups are named "boost groups" and
+     they define the boost value for specific set of tasks.
+     Further nested subgroups are not allowed since they do not have a sensible
+     meaning from a user-space standpoint.
+
+  2) It is possible to define only a limited number of "boost groups"
+
+     This number is defined at compile time and by default configured to 16.
+     This is a design decision motivated by two main reasons:
+     a) In a real system we do not expect utilization scenarios with more then few
+	boost groups. For example, a reasonable collection of groups could be
+        just "background", "interactive" and "performance".
+     b) It simplifies the implementation considerably, especially for the code
+	which has to compute the per CPU boosting once there are multiple
+        RUNNABLE tasks with different boost values.
+
+Such a simple design should allow servicing the main utilization scenarios identified
+so far. It provides a simple interface which can be used to manage the
+power-performance of all tasks or only selected tasks.
+Moreover, this interface can be easily integrated by user-space run-times (e.g.
+Android, ChromeOS) to implement a QoS solution for task boosting based on tasks
+classification, which has been a long standing requirement.
+
+Setup and usage
+---------------
+
+0. Use a kernel with CGROUP_SCHEDTUNE support enabled
+
+1. Check that the "schedtune" CGroup controller is available:
+
+   root@linaro-nano:~# cat /proc/cgroups
+   #subsys_name	hierarchy	num_cgroups	enabled
+   cpuset  	0		1		1
+   cpu     	0		1		1
+   schedtune	0		1		1
+
+2. Mount a tmpfs to create the CGroups mount point (Optional)
+
+   root@linaro-nano:~# sudo mount -t tmpfs cgroups /sys/fs/cgroup
+
+3. Mount the "schedtune" controller
+
+   root@linaro-nano:~# mkdir /sys/fs/cgroup/stune
+   root@linaro-nano:~# sudo mount -t cgroup -o schedtune stune /sys/fs/cgroup/stune
+
+4. Setup the system-wide boost value (Optional)
+
+   If not configured the root control group has a 0% boost value, which
+   basically disables boosting for all tasks in the system thus running in
+   an energy-efficient mode.
+
+   root@linaro-nano:~# echo $SYSBOOST > /sys/fs/cgroup/stune/schedtune.boost
+
+5. Create task groups and configure their specific boost value (Optional)
+
+   For example here we create a "performance" boost group configure to boost
+   all its tasks to 100%
+
+   root@linaro-nano:~# mkdir /sys/fs/cgroup/stune/performance
+   root@linaro-nano:~# echo 100 > /sys/fs/cgroup/stune/performance/schedtune.boost
+
+6. Move tasks into the boost group
+
+   For example, the following moves the tasks with PID $TASKPID (and all its
+   threads) into the "performance" boost group.
+
+   root@linaro-nano:~# echo "TASKPID > /sys/fs/cgroup/stune/performance/cgroup.procs
+
+This simple configuration allows only the threads of the $TASKPID task to run,
+when needed, at the highest OPP in the most capable CPU of the system.
+
+
+6. Question and Answers
+=======================
+
+What about "auto" mode?
+-----------------------
+
+The 'auto' mode as described in [5] can be implemented by interfacing SchedTune
+with some suitable user-space element. This element could use the exposed
+system-wide or cgroup based interface.
+
+How are multiple groups of tasks with different boost values managed?
+---------------------------------------------------------------------
+
+The current SchedTune implementation keeps track of the boosted RUNNABLE tasks
+on a CPU. Once sched-DVFS selects the OPP to run a CPU at, the CPU utilization
+is boosted with a value which is the maximum of the boost values of the
+currently RUNNABLE tasks in its RQ.
+
+This allows sched-DVFS to boost a CPU only while there are boosted tasks ready
+to run and switch back to the energy efficient mode as soon as the last boosted
+task is dequeued.
+
+
+7. References
+=============
+[1] http://lwn.net/Articles/552889
+[2] http://lkml.org/lkml/2012/5/18/91
+[3] http://lkml.org/lkml/2015/6/26/620
+
-- 
2.5.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH 09/14] sched/tune: add sysctl interface to define a boost value
  2015-08-19 18:47 [RFC PATCH 00/14] sched: Central, scheduler-driven, power-perfomance control Patrick Bellasi
                   ` (7 preceding siblings ...)
  2015-08-19 18:47 ` [RFC PATCH 08/14] sched/tune: add detailed documentation Patrick Bellasi
@ 2015-08-19 18:47 ` Patrick Bellasi
  2015-08-19 18:47 ` [RFC PATCH 10/14] sched/fair: add function to convert boost value into "margin" Patrick Bellasi
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 29+ messages in thread
From: Patrick Bellasi @ 2015-08-19 18:47 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar; +Cc: linux-kernel, linux-pm

The current (CFS) scheduler implementation does not allow "to boost"
tasks performance by running them at a higher OPP compared to the
minimum required to meet their workload demands.

To support tasks performance boosting the scheduler should provide a
"knob" which allows to tune how much the system is going to be optimised
for energy efficiency vs performance.

This patch is the first of a series which provides a simple interface to
define a tuning knob. One system-wide "boost" tunable is exposed via:
  /proc/sys/kernel/sched_cfs_boost
which can be configured in the range [0..100], to define a percentage
where:
  - 0%   boost requires to operate in "standard" mode by scheduling
         tasks at the minimum capacities required by the workload demand
  - 100% boost requires to push at maximum the task performances,
         "regardless" of the incurred energy consumption

A boost value in between these two boundaries is used to bias the
power/performance trade-off, the higher the boost value the more the
scheduler is biased toward performance boosting instead of energy
efficiency.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
---
 include/linux/sched/sysctl.h | 16 ++++++++++++++++
 init/Kconfig                 | 26 ++++++++++++++++++++++++++
 kernel/sched/Makefile        |  1 +
 kernel/sched/tune.c          | 17 +++++++++++++++++
 kernel/sysctl.c              | 11 +++++++++++
 5 files changed, 71 insertions(+)
 create mode 100644 kernel/sched/tune.c

diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index c9e4731..4479e48 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -77,6 +77,22 @@ extern int sysctl_sched_rt_runtime;
 extern unsigned int sysctl_sched_cfs_bandwidth_slice;
 #endif
 
+#ifdef CONFIG_SCHED_TUNE
+extern unsigned int sysctl_sched_cfs_boost;
+int sysctl_sched_cfs_boost_handler(struct ctl_table *table, int write,
+				   void __user *buffer, size_t *length,
+				   loff_t *ppos);
+static inline unsigned int get_sysctl_sched_cfs_boost(void)
+{
+	return sysctl_sched_cfs_boost;
+}
+#else
+static inline unsigned int get_sysctl_sched_cfs_boost(void)
+{
+	return 0;
+}
+#endif
+
 #ifdef CONFIG_SCHED_AUTOGROUP
 extern unsigned int sysctl_sched_autogroup_enabled;
 #endif
diff --git a/init/Kconfig b/init/Kconfig
index af09b4f..7fa3419 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1220,6 +1220,32 @@ config SCHED_AUTOGROUP
 	  desktop applications.  Task group autogeneration is currently based
 	  upon task session.
 
+config SCHED_TUNE
+	bool "Boosting for CFS tasks (EXPERIMENTAL)"
+	help
+	  This option enables the system-wide support for task boosting.
+	  When this support is enabled a new sysctl interface is exposed to
+	  userspace via:
+	     /proc/sys/kernel/sched_cfs_boost
+	  which allows to set a system-wide boost value in range [0..100].
+
+	  The currently boosting strategy is implemented in such a way that:
+	  - a 0% boost value requires to operate in "standard" mode by
+	    scheduling all tasks at the minimum capacities required by their
+	    workload demand
+	  - a 100% boost value requires to push at maximum the task
+	    performances, "regardless" of the incurred energy consumption
+
+	  A boost value in between these two boundaries is used to bias the
+	  power/performance trade-off, the higher the boost value the more the
+	  scheduler is biased toward performance boosting instead of energy
+	  efficiency.
+
+	  Since this support exposes a single system-wide knob, the specified
+	  boost value is applied to all (CFS) tasks in the system.
+
+	  If unsure, say N.
+
 config SYSFS_DEPRECATED
 	bool "Enable deprecated sysfs features to support old userspace tools"
 	depends on SYSFS
diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index 90ed832..f804ef3 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -18,5 +18,6 @@ obj-$(CONFIG_SMP) += cpupri.o cpudeadline.o
 obj-$(CONFIG_SCHED_AUTOGROUP) += auto_group.o
 obj-$(CONFIG_SCHEDSTATS) += stats.o
 obj-$(CONFIG_SCHED_DEBUG) += debug.o
+obj-$(CONFIG_SCHED_TUNE) += tune.o
 obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o
 obj-$(CONFIG_CPU_FREQ_GOV_SCHED) += cpufreq_sched.o
diff --git a/kernel/sched/tune.c b/kernel/sched/tune.c
new file mode 100644
index 0000000..4c44b1a
--- /dev/null
+++ b/kernel/sched/tune.c
@@ -0,0 +1,17 @@
+#include "sched.h"
+
+unsigned int sysctl_sched_cfs_boost __read_mostly;
+
+int
+sysctl_sched_cfs_boost_handler(struct ctl_table *table, int write,
+			       void __user *buffer, size_t *lenp,
+			       loff_t *ppos)
+{
+	int ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
+
+	if (ret || !write)
+		return ret;
+
+	return 0;
+}
+
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 19b62b5..2b4673e 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -433,6 +433,17 @@ static struct ctl_table kern_table[] = {
 		.extra1		= &one,
 	},
 #endif
+#ifdef CONFIG_SCHED_TUNE
+	{
+		.procname	= "sched_cfs_boost",
+		.data		= &sysctl_sched_cfs_boost,
+		.maxlen		= sizeof(sysctl_sched_cfs_boost),
+		.mode		= 0644,
+		.proc_handler	= &sysctl_sched_cfs_boost_handler,
+		.extra1		= &zero,
+		.extra2		= &one_hundred,
+	},
+#endif
 #ifdef CONFIG_PROVE_LOCKING
 	{
 		.procname	= "prove_locking",
-- 
2.5.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH 10/14] sched/fair: add function to convert boost value into "margin"
  2015-08-19 18:47 [RFC PATCH 00/14] sched: Central, scheduler-driven, power-perfomance control Patrick Bellasi
                   ` (8 preceding siblings ...)
  2015-08-19 18:47 ` [RFC PATCH 09/14] sched/tune: add sysctl interface to define a boost value Patrick Bellasi
@ 2015-08-19 18:47 ` Patrick Bellasi
  2015-08-19 18:47 ` [RFC PATCH 11/14] sched/fair: add boosted CPU usage Patrick Bellasi
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 29+ messages in thread
From: Patrick Bellasi @ 2015-08-19 18:47 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar; +Cc: linux-kernel, linux-pm

The basic idea of the boost knob is to "artificially inflate" a signal
to make a task or logical CPU appears more demanding than it actually
is. Independently from the specific signal, a consistent and possibly
simple semantic for the concept of "signal boosting" must define:
1. how we translate the boost percentage into a "margin" value to be added
   to the original signal to inflate
2. what is the meaning of a boost value from a user-space perspective

This patch provides the implementation of a possible boost semantic,
named "Signal Proportional Compensation" (SPC), where the boost
percentage (BP) is used to compute a margin (M) which is proportional to
the complement of the original signal (OS):
  M = BP * (SCHED_LOAD_SCALE - OS)
The computed margin then added to the OS to obtain the Boosted Signal (BS)
  BS = OS + M

The proposed boost semantic has these main features:
- each signal gets a boost which is proportional to its delta with respect
  to the maximum available capacity in the system (i.e. SCHED_LOAD_SCALE)
- a 100% boosting has a clear understanding from a user-space perspective,
  since it means simply to run (possibly) "all" tasks at the max OPP
- each boosting value means to improve the task performance by a quantity
  which is proportional to the maximum achievable performance on that
  system
Thus this semantics is somehow forcing a behaviour which is:

  50% boosting means to run at half-way between the current and the
    maximum performance which a task could achieve on that system

This patch provides the code to implement a fast integer division to
convert a boost percentage (BP) value into a margin (M).

NOTE: this code is suitable for all signals operating in range
      [0..SCHED_LOAD_SCALE]

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
---
 kernel/sched/fair.c | 38 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 38 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 955dfe1..15fde75 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4730,6 +4730,44 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
 	return 1;
 }
 
+#ifdef CONFIG_SCHED_TUNE
+
+static unsigned long
+schedtune_margin(unsigned long signal, unsigned long boost)
+{
+	unsigned long long margin = 0;
+
+	/*
+	 * Signal proportional compensation (SPC)
+	 *
+	 * The Boost (B) value is used to compute a Margin (M) which is
+	 * proportional to the complement of the original Signal (S):
+	 *   M = B * (SCHED_LOAD_SCALE - S)
+	 * The obtained M could be used by the caller to "boost" S.
+	 */
+	margin  = SCHED_LOAD_SCALE - signal;
+	margin *= boost;
+
+	/*
+	 * Fast integer division by constant:
+	 *  Constant   :                 (C) = 100
+	 *  Precision  : 0.1%            (P) = 0.1
+	 *  Reference  : C * 100 / P     (R) = 100000
+	 *
+	 * Thus:
+	 *  Shift bits : ceil(log(R,2))  (S) = 17
+	 *  Mult const : round(2^S/C)    (M) = 1311
+	 *
+	 *
+	 * */
+	margin  *= 1311;
+	margin >>= 17;
+
+	return margin;
+}
+
+#endif /* CONFIG_SCHED_TUNE */
+
 /*
  * find_idlest_group finds and returns the least busy CPU group within the
  * domain.
-- 
2.5.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH 11/14] sched/fair: add boosted CPU usage
  2015-08-19 18:47 [RFC PATCH 00/14] sched: Central, scheduler-driven, power-perfomance control Patrick Bellasi
                   ` (9 preceding siblings ...)
  2015-08-19 18:47 ` [RFC PATCH 10/14] sched/fair: add function to convert boost value into "margin" Patrick Bellasi
@ 2015-08-19 18:47 ` Patrick Bellasi
  2015-08-19 18:47 ` [RFC PATCH 12/14] sched/tune: add initial support for CGroups based boosting Patrick Bellasi
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 29+ messages in thread
From: Patrick Bellasi @ 2015-08-19 18:47 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar; +Cc: linux-kernel, linux-pm

The CPU usage signal is used by the scheduler as an estimation of the
overall bandwidth currently allocated on a CPU. When SchedDVFS is in
use, this signal affects the selection of the operating points (OPP)
required to accommodate all the workload allocated in a CPU.
A convenient way to boost the performance of tasks running on a CPU,
which is also little intrusive, is to boost the CPU usage signal each
time it is used to select an OPP.

This patch introduces a new function:
  get_boosted_cpu_usage(cpu)
to return a boosted value for the usage of a specified CPU.
The margin added to the original usage is:
  1. computed based on the "boosting strategy" in use
  2. proportional to the system-wide boost value defined by provided
     user-space interface

The boosted signal is used by SchedDVFS (transparently) each time it
requires to get an estimation of the capacity required for a CPU.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
---
 kernel/sched/fair.c | 32 +++++++++++++++++++++++++++++++-
 1 file changed, 31 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 15fde75..633fcab4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4083,6 +4083,7 @@ static inline void hrtick_update(struct rq *rq)
 static unsigned int capacity_margin = 1280;
 static unsigned long capacity_orig_of(int cpu);
 static int cpu_util(int cpu);
+static inline unsigned long boosted_cpu_util(int cpu);
 
 static void update_capacity_of(int cpu)
 {
@@ -4091,7 +4092,8 @@ static void update_capacity_of(int cpu)
 	if (!sched_energy_freq())
 		return;
 
-	req_cap = cpu_util(cpu) * capacity_margin / capacity_orig_of(cpu);
+	req_cap = boosted_cpu_util(cpu);
+	req_cap = req_cap * capacity_margin / capacity_orig_of(cpu);
 	cpufreq_sched_set_cap(cpu, req_cap);
 }
 
@@ -4766,8 +4768,36 @@ schedtune_margin(unsigned long signal, unsigned long boost)
 	return margin;
 }
 
+static inline unsigned int
+schedtune_cpu_margin(unsigned long util)
+{
+	unsigned int boost = get_sysctl_sched_cfs_boost();
+
+	if (boost == 0)
+		return 0;
+
+	return schedtune_margin(util, boost);
+}
+
+#else /* CONFIG_SCHED_TUNE */
+
+static inline unsigned int
+schedtune_cpu_margin(unsigned long util)
+{
+	return 0;
+}
+
 #endif /* CONFIG_SCHED_TUNE */
 
+static inline unsigned long
+boosted_cpu_util(int cpu)
+{
+	unsigned long util = cpu_util(cpu);
+	unsigned long margin = schedtune_cpu_margin(util);
+
+	return util + margin;
+}
+
 /*
  * find_idlest_group finds and returns the least busy CPU group within the
  * domain.
-- 
2.5.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH 12/14] sched/tune: add initial support for CGroups based boosting
  2015-08-19 18:47 [RFC PATCH 00/14] sched: Central, scheduler-driven, power-perfomance control Patrick Bellasi
                   ` (10 preceding siblings ...)
  2015-08-19 18:47 ` [RFC PATCH 11/14] sched/fair: add boosted CPU usage Patrick Bellasi
@ 2015-08-19 18:47 ` Patrick Bellasi
  2015-08-19 18:47 ` [RFC PATCH 13/14] sched/tune: compute and keep track of per CPU boost value Patrick Bellasi
  2015-08-19 18:47 ` [RFC PATCH 14/14] sched/{fair,tune}: track RUNNABLE tasks impact on " Patrick Bellasi
  13 siblings, 0 replies; 29+ messages in thread
From: Patrick Bellasi @ 2015-08-19 18:47 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: linux-kernel, linux-pm, Tejun Heo, Li Zefan, Johannes Weiner

To support task performance boosting, the usage of a single knob has the
advantage to be a simple solution, both from the implementation and the
usability standpoint.  However, on a real system it can be difficult to
identify a single value for the knob which fits the needs of multiple
different tasks. For example, some kernel threads and/or user-space
background services should be better managed the "standard" way while we
still want to be able to boost the performance of specific workloads.

In order to improve the flexibility of the task boosting mechanism this
patch is the first of a small series which extends the previous
implementation to introduce a "per task group" support.
This first patch introduces just the basic CGroups support, a new
"schedtune" CGroups controller is added which allows to configure
different boost value for different groups of tasks.
To keep the implementation simple but still effective for a boosting
strategy, the new controller:
  1. allows only a two layer hierarchy
  2. supports only a limited number of boost groups

A two layer hierarchy allows to place each task either:
  a) in the root control group
     thus being subject to a system-wide boosting value
  b) in a child of the root group
     thus being subject to the specific boost value defined by that
     "boost group"

The limited number of "boost groups" supported is mainly motivated by
the observation that in a real system it could be useful to have only
few classes of tasks which deserve different treatment.
For example, background vs foreground or interactive vs low-priority.
As an additional benefit, a limited number of boost groups allows also
to have a simpler implementation especially for the code required to
compute the boost value for CPUs which have runnable tasks belonging to
different boost groups.

cc: Tejun Heo <tj@kernel.org>
cc: Li Zefan <lizefan@huawei.com>
cc: Johannes Weiner <hannes@cmpxchg.org>
cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
---
 include/linux/cgroup_subsys.h |   4 +
 init/Kconfig                  |  17 ++++
 kernel/sched/tune.c           | 200 ++++++++++++++++++++++++++++++++++++++++++
 kernel/sysctl.c               |   4 +
 4 files changed, 225 insertions(+)

diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index e4a96fb..23befa0 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -15,6 +15,10 @@ SUBSYS(cpu)
 SUBSYS(cpuacct)
 #endif
 
+#if IS_ENABLED(CONFIG_CGROUP_SCHEDTUNE)
+SUBSYS(schedtune)
+#endif
+
 #if IS_ENABLED(CONFIG_BLK_CGROUP)
 SUBSYS(blkio)
 #endif
diff --git a/init/Kconfig b/init/Kconfig
index 7fa3419..4555e97 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -982,6 +982,23 @@ config CGROUP_CPUACCT
 	  Provides a simple Resource Controller for monitoring the
 	  total CPU consumed by the tasks in a cgroup.
 
+config CGROUP_SCHEDTUNE
+	bool "CFS tasks boosting cgroup subsystem (EXPERIMENTAL)"
+	depends on SCHED_TUNE
+	help
+	  This option provides the "schedtune" controller which improves the
+	  flexibility of the task boosting mechanism by introducing the support
+	  to define "per task" boost values.
+
+	  This new controller:
+	  1. allows only a two layers hierarchy, where the root defines the
+	     system-wide boost value and its direct childrens define each one a
+	     different "class of tasks" to be boosted with a different value
+	  2. supports up to 16 different task classes, each one which could be
+	     configured with a different boost value
+
+	  Say N if unsure.
+
 config PAGE_COUNTER
        bool
 
diff --git a/kernel/sched/tune.c b/kernel/sched/tune.c
index 4c44b1a..a26295c 100644
--- a/kernel/sched/tune.c
+++ b/kernel/sched/tune.c
@@ -1,7 +1,207 @@
+#include <linux/cgroup.h>
+#include <linux/err.h>
+#include <linux/printk.h>
+#include <linux/slab.h>
+
 #include "sched.h"
 
 unsigned int sysctl_sched_cfs_boost __read_mostly;
 
+#ifdef CONFIG_CGROUP_SCHEDTUNE
+
+/*
+ * EAS scheduler tunables for task groups.
+ */
+
+/* SchdTune tunables for a group of tasks */
+struct schedtune {
+	/* SchedTune CGroup subsystem */
+	struct cgroup_subsys_state css;
+
+	/* Boost group allocated ID */
+	int idx;
+
+	/* Boost value for tasks on that SchedTune CGroup */
+	int boost;
+
+};
+
+static inline struct schedtune *css_st(struct cgroup_subsys_state *css)
+{
+	return css ? container_of(css, struct schedtune, css) : NULL;
+}
+
+static inline struct schedtune *task_schedtune(struct task_struct *tsk)
+{
+	return css_st(task_css(tsk, schedtune_cgrp_id));
+}
+
+static inline struct schedtune *parent_st(struct schedtune *st)
+{
+	return css_st(st->css.parent);
+}
+
+/*
+ * SchedTune root control group
+ * The root control group is used to defined a system-wide boosting tuning,
+ * which is applied to all tasks in the system.
+ * Task specific boost tuning could be specified by creating and
+ * configuring a child control group under the root one.
+ * By default, system-wide boosting is disabled, i.e. no boosting is applied
+ * to tasks which are not into a child control group.
+ */
+static struct schedtune
+root_schedtune = {
+	.boost	= 0,
+};
+
+/*
+ * Maximum number of boost groups to support
+ * When per-task boosting is used we still allow only limited number of
+ * boost groups for two main reasons:
+ * 1. on a real system we usually have only few classes of workloads which
+ *    make sense to boost with different values (e.g. background vs foreground
+ *    tasks, interactive vs low-priority tasks)
+ * 2. a limited number allows for a simpler and more memory/time efficient
+ *    implementation especially for the computation of the per-CPU boost
+ *    value
+ */
+#define BOOSTGROUPS_COUNT 16
+
+/* Array of configured boostgroups */
+static struct schedtune *allocated_group[BOOSTGROUPS_COUNT] = {
+	&root_schedtune,
+	NULL,
+};
+
+static u64
+boost_read(struct cgroup_subsys_state *css, struct cftype *cft)
+{
+	struct schedtune *st = css_st(css);
+
+	return st->boost;
+}
+
+static int
+boost_write(struct cgroup_subsys_state *css, struct cftype *cft,
+	    u64 boost)
+{
+	struct schedtune *st = css_st(css);
+
+	if (boost < 0 || boost > 100)
+		return -EINVAL;
+
+	st->boost = boost;
+	if (css == &root_schedtune.css)
+		sysctl_sched_cfs_boost = boost;
+
+	return 0;
+}
+
+static struct cftype files[] = {
+	{
+		.name = "boost",
+		.read_u64 = boost_read,
+		.write_u64 = boost_write,
+	},
+	{ }	/* terminate */
+};
+
+static int
+schedtune_boostgroup_init(struct schedtune *st)
+{
+	/* Keep track of allocated boost groups */
+	allocated_group[st->idx] = st;
+
+	return 0;
+}
+
+static int
+schedtune_init(void)
+{
+	struct boost_groups *bg;
+	int cpu;
+
+	/* Initialize the per CPU boost groups */
+	for_each_possible_cpu(cpu) {
+		bg = &per_cpu(cpu_boost_groups, cpu);
+		memset(bg, 0, sizeof(struct boost_groups));
+	}
+
+	pr_info("  schedtune configured to support %d boost groups\n",
+		BOOSTGROUPS_COUNT);
+	return 0;
+}
+
+static struct cgroup_subsys_state *
+schedtune_css_alloc(struct cgroup_subsys_state *parent_css)
+{
+	struct schedtune *st;
+	int idx;
+
+	if (!parent_css) {
+		schedtune_init();
+		return &root_schedtune.css;
+	}
+
+	/* Allow only single level hierachies */
+	if (parent_css != &root_schedtune.css) {
+		pr_err("Nested SchedTune boosting groups not allowed\n");
+		return ERR_PTR(-ENOMEM);
+	}
+
+	/* Allow only a limited number of boosting groups */
+	for (idx = 1; idx < BOOSTGROUPS_COUNT; ++idx)
+		if (!allocated_group[idx])
+			break;
+	if (idx == BOOSTGROUPS_COUNT) {
+		pr_err("Trying to create more than %d SchedTune boosting groups\n",
+		       BOOSTGROUPS_COUNT);
+		return ERR_PTR(-ENOSPC);
+	}
+
+	st = kzalloc(sizeof(*st), GFP_KERNEL);
+	if (!st)
+		goto out;
+
+	/* Initialize per CPUs boost group support */
+	st->idx = idx;
+	if (schedtune_boostgroup_init(st))
+		goto release;
+
+	return &st->css;
+
+release:
+	kfree(st);
+out:
+	return ERR_PTR(-ENOMEM);
+}
+
+static void
+schedtune_boostgroup_release(struct schedtune *st)
+{
+	/* Keep track of allocated boost groups */
+	allocated_group[st->idx] = NULL;
+}
+
+static void
+schedtune_css_free(struct cgroup_subsys_state *css)
+{
+	struct schedtune *st = css_st(css);
+
+	schedtune_boostgroup_release(st);
+	kfree(st);
+}
+
+struct cgroup_subsys schedtune_cgrp_subsys = {
+	.css_alloc	= schedtune_css_alloc,
+	.css_free	= schedtune_css_free,
+	.legacy_cftypes	= files,
+	.early_init	= 1,
+};
+
+#endif /* CONFIG_CGROUP_SCHEDTUNE */
+
 int
 sysctl_sched_cfs_boost_handler(struct ctl_table *table, int write,
 			       void __user *buffer, size_t *lenp,
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 2b4673e..d42162c 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -438,7 +438,11 @@ static struct ctl_table kern_table[] = {
 		.procname	= "sched_cfs_boost",
 		.data		= &sysctl_sched_cfs_boost,
 		.maxlen		= sizeof(sysctl_sched_cfs_boost),
+#ifdef CONFIG_CGROUP_SCHEDTUNE
+		.mode		= 0444,
+#else
 		.mode		= 0644,
+#endif
 		.proc_handler	= &sysctl_sched_cfs_boost_handler,
 		.extra1		= &zero,
 		.extra2		= &one_hundred,
-- 
2.5.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH 13/14] sched/tune: compute and keep track of per CPU boost value
  2015-08-19 18:47 [RFC PATCH 00/14] sched: Central, scheduler-driven, power-perfomance control Patrick Bellasi
                   ` (11 preceding siblings ...)
  2015-08-19 18:47 ` [RFC PATCH 12/14] sched/tune: add initial support for CGroups based boosting Patrick Bellasi
@ 2015-08-19 18:47 ` Patrick Bellasi
  2015-08-19 18:47 ` [RFC PATCH 14/14] sched/{fair,tune}: track RUNNABLE tasks impact on " Patrick Bellasi
  13 siblings, 0 replies; 29+ messages in thread
From: Patrick Bellasi @ 2015-08-19 18:47 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar; +Cc: linux-kernel, linux-pm

When per task boosting is enabled, we could have multiple RUNNABLE tasks
which are concurrently scheduled on the same CPU but each one with a
different boost value.
For example, we could have a scenarios like this:

  Task   SchedTune CGroup   Boost Value
    T1               root            0
    T2       low-priority           10
    T3        interactive           90

In these conditions we expect a CPU to be configured according to a
proper "aggregation" of the required boost values for all the tasks
currently scheduled on this CPU.

A suitable aggregation function is the one which tracks the MAX boost
value for all the tasks RUNNABLE on a CPU. This approach allows to
always satisfy the most boost demanding task while at the same time:
 a) boosting all the concurrently scheduled tasks thus reducing
    potential co-scheduling side-effects on demanding tasks
 b) reduce the number of frequency switch requested towards SchedDVFS,
    thus being more friendly to architectures with slow frequency
    switching times

Every time a task enters/exits the RQ of a CPU the max boost value
should be updated considering all the boost groups currently "affecting"
that CPU, i.e. which have at least one RUNNABLE task currently allocated
on that CPU.

This patch introduces the required support to keep track of the boost
groups currently affecting CPUs. Thanks to the limited number of boost
groups, a small and memory efficient per-cpu array of boost groups
values (cpu_boost_groups) is used which is updated for each CPU entry by
schedtune_boostgroup_update() but only when a schedtune CGroup boost
value is updated. However, this is expected to be a rare operation,
perhaps done just one time at system boot time.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
---
 kernel/sched/tune.c | 100 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 100 insertions(+)

diff --git a/kernel/sched/tune.c b/kernel/sched/tune.c
index a26295c..3223ef3 100644
--- a/kernel/sched/tune.c
+++ b/kernel/sched/tune.c
@@ -1,5 +1,6 @@
 #include <linux/cgroup.h>
 #include <linux/err.h>
+#include <linux/percpu.h>
 #include <linux/printk.h>
 #include <linux/slab.h>
 
@@ -74,6 +75,89 @@ static struct schedtune *allocated_group[BOOSTGROUPS_COUNT] = {
 	NULL,
 };
 
+/* SchedTune boost groups
+ * Keep track of all the boost groups which impact on CPU, for example when a
+ * CPU has two RUNNABLE tasks belonging to two different boost groups and thus
+ * likely with different boost values.
+ * Since on each system we expect only a limited number of boost groups, here
+ * we use a simple array to keep track of the metrics required to compute the
+ * maximum per-CPU boosting value.
+ */
+struct boost_groups {
+	/* Maximum boost value for all RUNNABLE tasks on a CPU */
+	unsigned boost_max;
+	struct {
+		/* The boost for tasks on that boost group */
+		unsigned boost;
+		/* Count of RUNNABLE tasks on that boost group */
+		unsigned tasks;
+	} group[BOOSTGROUPS_COUNT];
+};
+
+/* Boost groups affecting each CPU in the system */
+DEFINE_PER_CPU(struct boost_groups, cpu_boost_groups);
+
+static void
+schedtune_cpu_update(int cpu)
+{
+	struct boost_groups *bg;
+	unsigned boost_max;
+	int idx;
+
+	bg = &per_cpu(cpu_boost_groups, cpu);
+
+	/* The root boost group is always active */
+	boost_max = bg->group[0].boost;
+	for (idx = 1; idx < BOOSTGROUPS_COUNT; ++idx) {
+		/*
+		 * A boost group affects a CPU only if it has
+		 * RUNNABLE tasks on that CPU
+		 */
+		if (bg->group[idx].tasks == 0)
+			continue;
+		boost_max = max(boost_max, bg->group[idx].boost);
+	}
+
+	bg->boost_max = boost_max;
+}
+
+static int
+schedtune_boostgroup_update(int idx, int boost)
+{
+	struct boost_groups *bg;
+	int cur_boost_max;
+	int old_boost;
+	int cpu;
+
+	/* Update per CPU boost groups */
+	for_each_possible_cpu(cpu) {
+		bg = &per_cpu(cpu_boost_groups, cpu);
+
+		/*
+		 * Keep track of current boost values to compute the per CPU
+		 * maximum only when it has been affected by the new value of
+		 * the updated boost group
+		 */
+		cur_boost_max = bg->boost_max;
+		old_boost = bg->group[idx].boost;
+
+		/* Update the boost value of this boost group */
+		bg->group[idx].boost = boost;
+
+		/* Check if this update increase current max */
+		if (boost > cur_boost_max && bg->group[idx].tasks) {
+			bg->boost_max = boost;
+			continue;
+		}
+
+		/* Check if this update has decreased current max */
+		if (cur_boost_max == old_boost && old_boost > boost)
+			schedtune_cpu_update(cpu);
+	}
+
+	return 0;
+}
+
 static u64
 boost_read(struct cgroup_subsys_state *css, struct cftype *cft)
 {
@@ -95,6 +179,9 @@ boost_write(struct cgroup_subsys_state *css, struct cftype *cft,
 	if (css == &root_schedtune.css)
 		sysctl_sched_cfs_boost = boost;
 
+	/* Update CPU boost */
+	schedtune_boostgroup_update(st->idx, st->boost);
+
 	return 0;
 }
 
@@ -110,9 +197,19 @@ static struct cftype files[] = {
 static int
 schedtune_boostgroup_init(struct schedtune *st)
 {
+	struct boost_groups *bg;
+	int cpu;
+
 	/* Keep track of allocated boost groups */
 	allocated_group[st->idx] = st;
 
+	/* Initialize the per CPU boost groups */
+	for_each_possible_cpu(cpu) {
+		bg = &per_cpu(cpu_boost_groups, cpu);
+		bg->group[st->idx].boost = 0;
+		bg->group[st->idx].tasks = 0;
+	}
+
 	return 0;
 }
 
@@ -180,6 +277,9 @@ out:
 static void
 schedtune_boostgroup_release(struct schedtune *st)
 {
+	/* Reset this boost group */
+	schedtune_boostgroup_update(st->idx, 0);
+
 	/* Keep track of allocated boost groups */
 	allocated_group[st->idx] = NULL;
 }
-- 
2.5.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH 14/14] sched/{fair,tune}: track RUNNABLE tasks impact on per CPU boost value
  2015-08-19 18:47 [RFC PATCH 00/14] sched: Central, scheduler-driven, power-perfomance control Patrick Bellasi
                   ` (12 preceding siblings ...)
  2015-08-19 18:47 ` [RFC PATCH 13/14] sched/tune: compute and keep track of per CPU boost value Patrick Bellasi
@ 2015-08-19 18:47 ` Patrick Bellasi
  13 siblings, 0 replies; 29+ messages in thread
From: Patrick Bellasi @ 2015-08-19 18:47 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar; +Cc: linux-kernel, linux-pm

When per-task boosting is enabled, every time a task enters/exits a CPU
its boost value could impact the currently selected OPP for that CPU.
Thus, the "aggregated" boost value for that CPU potentially needs to
be updated to match the current maximum boost value among all the tasks
currently RUNNABLE on that CPU.

This patch introduces the required support to keep track of which boost
groups are impacting a CPU. Each time a task is enqueued/dequeued to/from
a CPU its boost group is used to increment a per-cpu counter of RUNNABLE
tasks on that CPU.
Only when the number of runnable tasks for a specific boost group
becomes 1 or 0 the corresponding boost group changes its effects on
that CPU, specifically:
  a) boost_group::tasks == 1: this boost group starts to impact the CPU
  b) boost_group::tasks == 0: this boost group stops to impact the CPU
In each of these two conditions the aggregation function:
  sched_cpu_update(cpu)
could be required to run in order to identify the new maximum boost
value required for the CPU.

The proposed patch minimizes the number of times the aggregation
function is executed while still providing the required support to
always boost a CPU to the maximum boost value required by all its
currently RUNNABLE tasks.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
---
 kernel/sched/fair.c | 17 +++++++---
 kernel/sched/tune.c | 94 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/tune.h | 23 +++++++++++++
 3 files changed, 130 insertions(+), 4 deletions(-)
 create mode 100644 kernel/sched/tune.h

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 633fcab4..98470c4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -34,6 +34,7 @@
 #include <trace/events/sched.h>
 
 #include "sched.h"
+#include "tune.h"
 
 /*
  * Targeted preemption latency for CPU-bound tasks:
@@ -4145,6 +4146,8 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 	if (!se) {
 		add_nr_running(rq, 1);
 
+		schedtune_enqueue_task(p, cpu_of(rq));
+
 		/*
 		 * We want to potentially trigger a freq switch request only for
                  * tasks that are waking up; this is because we get here also during
@@ -4213,6 +4216,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 
 	if (!se) {
 		sub_nr_running(rq, 1);
+		schedtune_dequeue_task(p, cpu_of(rq));
 
 		/*
 		 * We want to potentially trigger a freq switch request only for
@@ -4769,10 +4773,15 @@ schedtune_margin(unsigned long signal, unsigned long boost)
 }
 
 static inline unsigned int
-schedtune_cpu_margin(unsigned long util)
+schedtune_cpu_margin(unsigned long util, int cpu)
 {
-	unsigned int boost = get_sysctl_sched_cfs_boost();
+	unsigned int boost;
 
+#ifdef CONFIG_CGROUP_SCHEDTUNE
+	boost = schedtune_cpu_boost(cpu);
+#else
+	boost = get_sysctl_sched_cfs_boost();
+#endif
 	if (boost == 0)
 		return 0;
 
@@ -4782,7 +4791,7 @@ schedtune_cpu_margin(unsigned long util)
 #else /* CONFIG_SCHED_TUNE */
 
 static inline unsigned int
-schedtune_cpu_margin(unsigned long util)
+schedtune_cpu_margin(unsigned long util, int cpu)
 {
 	return 0;
 }
@@ -4793,7 +4802,7 @@ static inline unsigned long
 boosted_cpu_util(int cpu)
 {
 	unsigned long util = cpu_util(cpu);
-	unsigned long margin = schedtune_cpu_margin(util);
+	unsigned long margin = schedtune_cpu_margin(util, cpu);
 
 	return util + margin;
 }
diff --git a/kernel/sched/tune.c b/kernel/sched/tune.c
index 3223ef3..3838106 100644
--- a/kernel/sched/tune.c
+++ b/kernel/sched/tune.c
@@ -2,6 +2,7 @@
 #include <linux/err.h>
 #include <linux/percpu.h>
 #include <linux/printk.h>
+#include <linux/rcupdate.h>
 #include <linux/slab.h>
 
 #include "sched.h"
@@ -158,6 +159,87 @@ schedtune_boostgroup_update(int idx, int boost)
 	return 0;
 }
 
+static inline void
+schedtune_tasks_update(struct task_struct *p, int cpu, int idx, int task_count)
+{
+	struct boost_groups *bg;
+	int tasks;
+
+	bg = &per_cpu(cpu_boost_groups, cpu);
+
+	/* Update boosted tasks count while avoiding to make it negative */
+	if (task_count < 0 && bg->group[idx].tasks <= -task_count)
+		bg->group[idx].tasks = 0;
+	else
+		bg->group[idx].tasks += task_count;
+
+	/* Boost group activation or deactivation on that RQ */
+	tasks = bg->group[idx].tasks;
+	if (tasks == 1 || tasks == 0)
+		schedtune_cpu_update(cpu);
+}
+
+/*
+ * NOTE: This function must be called while holding the lock on the CPU RQ
+ */
+void schedtune_enqueue_task(struct task_struct *p, int cpu)
+{
+	struct schedtune *st;
+	int idx;
+
+	/*
+	 * When a task is marked PF_EXITING by do_exit() it's going to be
+	 * dequeued and enqueued multiple times in the exit path.
+	 * Thus we avoid any further update, since we do not want to change
+	 * CPU boosting while the task is exiting.
+	 */
+	if (p->flags & PF_EXITING)
+		return;
+
+	/* Get task boost group */
+	rcu_read_lock();
+	st = task_schedtune(p);
+	idx = st->idx;
+	rcu_read_unlock();
+
+	schedtune_tasks_update(p, cpu, idx, 1);
+}
+
+/*
+ * NOTE: This function must be called while holding the lock on the CPU RQ
+ */
+void schedtune_dequeue_task(struct task_struct *p, int cpu)
+{
+	struct schedtune *st;
+	int idx;
+
+	/*
+	 * When a task is marked PF_EXITING by do_exit() it's going to be
+	 * dequeued and enqueued multiple times in the exit path.
+	 * Thus we avoid any further update, since we do not want to change
+	 * CPU boosting while the task is exiting.
+	 * The last dequeue will be done by cgroup exit() callback.
+	 */
+	if (p->flags & PF_EXITING)
+		return;
+
+	/* Get task boost group */
+	rcu_read_lock();
+	st = task_schedtune(p);
+	idx = st->idx;
+	rcu_read_unlock();
+
+	schedtune_tasks_update(p, cpu, idx, -1);
+}
+
+int schedtune_cpu_boost(int cpu)
+{
+	struct boost_groups *bg;
+
+	bg = &per_cpu(cpu_boost_groups, cpu);
+	return bg->boost_max;
+}
+
 static u64
 boost_read(struct cgroup_subsys_state *css, struct cftype *cft)
 {
@@ -293,9 +375,21 @@ schedtune_css_free(struct cgroup_subsys_state *css)
 	kfree(st);
 }
 
+static void
+schedtune_exit(struct cgroup_subsys_state *css,
+	       struct cgroup_subsys_state *old_css,
+	       struct task_struct *tsk)
+{
+	struct schedtune *old_st = css_st(old_css);
+	int cpu = task_cpu(tsk);
+
+	schedtune_tasks_update(tsk, cpu, old_st->idx, -1);
+}
+
 struct cgroup_subsys schedtune_cgrp_subsys = {
 	.css_alloc	= schedtune_css_alloc,
 	.css_free	= schedtune_css_free,
+	.exit		= schedtune_exit,
 	.legacy_cftypes	= files,
 	.early_init	= 1,
 };
diff --git a/kernel/sched/tune.h b/kernel/sched/tune.h
new file mode 100644
index 0000000..4519028
--- /dev/null
+++ b/kernel/sched/tune.h
@@ -0,0 +1,23 @@
+
+#ifdef CONFIG_SCHED_TUNE
+
+#ifdef CONFIG_CGROUP_SCHEDTUNE
+
+extern int schedtune_cpu_boost(int cpu);
+
+extern void schedtune_enqueue_task(struct task_struct *p, int cpu);
+extern void schedtune_dequeue_task(struct task_struct *p, int cpu);
+
+#else /* CONFIG_CGROUP_SCHEDTUNE */
+
+#define schedtune_enqueue_task(task, cpu) do { } while (0)
+#define schedtune_dequeue_task(task, cpu) do { } while (0)
+
+#endif /* CONFIG_CGROUP_SCHEDTUNE */
+
+#else /* CONFIG_SCHED_TUNE */
+
+#define schedtune_enqueue_task(task, cpu) do { } while (0)
+#define schedtune_dequeue_task(task, cpu) do { } while (0)
+
+#endif /* CONFIG_SCHED_TUNE */
-- 
2.5.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [RFC,08/14] sched/tune: add detailed documentation
  2015-08-19 18:47 ` [RFC PATCH 08/14] sched/tune: add detailed documentation Patrick Bellasi
@ 2015-09-02  6:49   ` Ricky Liang
  2015-09-03  9:18     ` [RFC 08/14] " Patrick Bellasi
  0 siblings, 1 reply; 29+ messages in thread
From: Ricky Liang @ 2015-09-02  6:49 UTC (permalink / raw)
  To: Patrick Bellasi
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, linux-pm,
	Jonathan Corbet, linux-doc

Hi Patrick,

I wonder if this can replace the boost function in the interactive governor [0],
which is widely used in both Android and ChromeOS kernels.

My understanding is that the boost in interactive governor is to simply raise
the OPP on selected cores. The SchedTune boost works by adding a margin to the
original load of a task which makes the kernel think that the task is more
demanding than it actually is. My intuition was that they work differently and
could cause different reaction in the kernel.

I feel that the per-task cgroup ScheTune boost should work as expected as it
only boosts a set of tasks and make them appear relatively high demanding
comparing to other tasks. But if the ScheTune boost is applied globally to boost
all the tasks in the system, will it cause unnecessary task migrations as all
the tasks appear to be high demanding to the kernel?

Specifically, my questions is: When the global SchedTune boost is enabled in a
on-demand manner, is it possible that a light task gets migrated to the big
core, and in turn kicks out a heavy task originally on that core? I'm wondering
whether global SchedTune boost could result in a "priority inversion" causing
the heavy task to run on the little core and the light task to run on the big
core.

[0]: https://android.googlesource.com/kernel/common.git/+/android-3.18/drivers/cpufreq/cpufreq_interactive.c

Thanks,
Ricky

On Wed, Aug 19, 2015 at 07:47:18PM +0100, Patrick Bellasi wrote:
> The topic of a single simple power-performance tunable, that is wholly
> scheduler centric, and has well defined and predictable properties has
> come up on several occasions in the past. With techniques such as a
> scheduler driven DVFS, we now have a good framework for implementing
> such a tunable.
> 
> This patch provides a detailed description of the motivations and design
> decisions behind the implementation of the SchedTune.
> 
> cc: Jonathan Corbet <corbet@lwn.net>
> cc: linux-doc@vger.kernel.org
> Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
> 
> ---
> Documentation/scheduler/sched-tune.txt | 367 +++++++++++++++++++++++++++++++++
>  1 file changed, 367 insertions(+)
>  create mode 100644 Documentation/scheduler/sched-tune.txt
> 
> diff --git a/Documentation/scheduler/sched-tune.txt b/Documentation/scheduler/sched-tune.txt
> new file mode 100644
> index 0000000..cb795e6
> --- /dev/null
> +++ b/Documentation/scheduler/sched-tune.txt
> @@ -0,0 +1,367 @@
> +             Central, scheduler-driven, power-performance control
> +                               (EXPERIMENTAL)
> +
> +Abstract
> +========
> +
> +The topic of a single simple power-performance tunable, that is wholly
> +scheduler centric, and has well defined and predictable properties has come up
> +on several occasions in the past [1,2]. With techniques such as a scheduler
> +driven DVFS [3], we now have a good framework for implementing such a tunable.
> +This document describes the overall ideas behind its design and implementation.
> +
> +
> +Table of Contents
> +=================
> +
> +1. Motivation
> +2. Introduction
> +3. Signal Boosting Strategy
> +4. OPP selection using boosted CPU utilization
> +5. Per task group boosting
> +6. Question and Answers
> +   - What about "auto" mode?
> +   - What about boosting on a congested system?
> +   - How CPUs are boosted when we have tasks with multiple boost values?
> +7. References
> +
> +
> +1. Motivation
> +=============
> +
> +Sched-DVFS [3] is a new event-driven cpufreq governor which allows the
> +scheduler to select the optimal DVFS operating point (OPP) for running a task
> +allocated to a CPU. The introduction of sched-DVFS enables running workloads at
> +the most energy efficient OPPs.
> +
> +However, sometimes it may be desired to intentionally boost the performance of
> +a workload even if that could imply a reasonable increase in energy
> +consumption. For example, in order to reduce the response time of a task, we
> +may want to run the task at a higher OPP than the one that is actually required
> +by it's CPU bandwidth demand.
> +
> +This last requirement is especially important if we consider that one of the
> +main goals of the sched-DVFS component is to replace all currently available
> +CPUFreq policies. Since sched-DVFS is event based, as opposed to the sampling
> +driven governors we currently have, it is already more responsive at selecting
> +the optimal OPP to run tasks allocated to a CPU. However, just tracking the
> +actual task load demand may not be enough from a performance standpoint.  For
> +example, it is not possible to get behaviors similar to those provided by the
> +"performance" and "interactive" CPUFreq governors.
> +
> +This document describes an implementation of a tunable, stacked on top of the
> +sched-DVFS which extends its functionality to support task performance
> +boosting.
> +
> +By "performance boosting" we mean the reduction of the time required to
> +complete a task activation, i.e. the time elapsed from a task wakeup to its
> +next deactivation (e.g. because it goes back to sleep or it terminates).  For
> +example, if we consider a simple periodic task which executes the same workload
> +for 5[s] every 20[s] while running at a certain OPP, a boosted execution of
> +that task must complete each of its activations in less than 5[s].
> +
> +A previous attempt [5] to introduce such a boosting feature has not been
> +successful mainly because of the complexity of the proposed solution.  The
> +approach described in this document exposes a single simple interface to
> +user-space.  This single tunable knob allows the tuning of system wide
> +scheduler behaviours ranging from energy efficiency at one end through to
> +incremental performance boosting at the other end.  This first tunable affects
> +all tasks. However, a more advanced extension of the concept is also provided
> +which uses CGroups to boost the performance of only selected tasks while using
> +the energy efficient default for all others.
> +
> +The rest of this document introduces in more details the proposed solution
> +which has been named SchedTune.
> +
> +
> +2. Introduction
> +===============
> +
> +SchedTune exposes a simple user-space interface with a single power-performance
> +tunable:
> +
> +  /proc/sys/kernel/sched_cfs_boost
> +
> +This permits expressing a boost value as an integer in the range [0..100].
> +
> +A value of 0 (default) configures the CFS scheduler for maximum energy
> +efficiency. This means that sched-DVFS runs the tasks at the minimum OPP
> +required to satisfy their workload demand.
> +A value of 100 configures scheduler for maximum performance, which translates
> +to the selection of the maximum OPP on that CPU.
> +
> +The range between 0 and 100 can be set to satisfy other scenarios suitably. For
> +example to satisfy interactive response or depending on other system events
> +(battery level etc).
> +
> +A CGroup based extension is also provided, which permits further user-space
> +defined task classification to tune the scheduler for different goals depending
> +on the specific nature of the task, e.g. background vs interactive vs
> +low-priority.
> +
> +The overall design of the SchedTune module is built on top of "Per-Entity Load
> +Tracking" (PELT) signals and sched-DVFS by introducing a bias on the Operating
> +Performance Point (OPP) selection.
> +Each time a task is allocated on a CPU, sched-DVFS has the opportunity to tune
> +the operating frequency of that CPU to better match the workload demand. The
> +selection of the actual OPP being activated is influenced by the global boost
> +value, or the boost value for the task CGroup when in use.
> +
> +This simple biasing approach leverages existing frameworks, which means minimal
> +modifications to the scheduler, and yet it allows to achieve a range of
> +different behaviours all from a single simple tunable knob.
> +The only new concept introduced is that of signal boosting.
> +
> +
> +3. Signal Boosting Strategy
> +===========================
> +
> +The whole PELT machinery works based on the value of a few load tracking signals
> +which basically track the CPU bandwidth requirements for tasks and the capacity
> +of CPUs. The basic idea behind the SchedTune knob is to artificially inflate
> +some of these load tracking signals to make a task or RQ appears more demanding
> +that it actually is.
> +
> +Which signals have to be inflated depends on the specific "consumer".  However,
> +independently from the specific (signal, consumer) pair, it is important to
> +define a simple and possibly consistent strategy for the concept of boosting a
> +signal.
> +
> +A boosting strategy defines how the "abstract" user-space defined
> +sched_cfs_boost value is translated into an internal "margin" value to be added
> +to a signal to get its inflated value:
> +
> +  margin         := boosting_strategy(sched_cfs_boost, signal)
> +  boosted_signal := signal + margin
> +
> +Different boosting strategies were identified and analyzed before selecting the
> +one found to be most effective.
> +
> +Signal Proportional Compensation (SPC)
> +--------------------------------------
> +
> +In this boosting strategy the sched_cfs_boost value is used to compute a
> +margin which is proportional to the complement of the original signal.
> +When a signal has a maximum possible value, its complement is defined as
> +the delta from the actual value and its possible maximum.
> +
> +Since the tunable implementation uses signals which have SCHED_LOAD_SCALE as
> +the maximum possible value, the margin becomes:
> +
> +	margin := sched_cfs_boost * (SCHED_LOAD_SCALE - signal)
> +
> +Using this boosting strategy:
> +- a 100% sched_cfs_boost means that the signal is scaled to the maximum value
> +- each value in the range of sched_cfs_boost effectively inflates the signal in
> +  question by a quantity which is proportional to the maximum value.
> +
> +For example, by applying the SPC boosting strategy to the selection of the OPP
> +to run a task it is possible to achieve these behaviors:
> +
> +-   0% boosting: run the task at the minimum OPP required by its workload
> +- 100% boosting: run the task at the maximum OPP available for the CPU
> +-  50% boosting: run at the half-way OPP between minimum and maximum
> +
> +Which means that, at 50% boosting, a task will be scheduled to run at half of
> +the maximum theoretically achievable performance on the specific target
> +platform.
> +
> +A graphical representation of an SPC boosted signal is represented in the
> +following figure where:
> + a) "-" represents the original signal
> + b) "b" represents a  50% boosted signal
> + c) "p" represents a 100% boosted signal
> +
> +
> +   ^
> +   |  SCHED_LOAD_SCALE
> +   +-----------------------------------------------------------------+
> +   |pppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppp
> +   |
> +   |                                             boosted_signal
> +   |                                          bbbbbbbbbbbbbbbbbbbbbbbb
> +   |
> +   |                                            original signal
> +   |                  bbbbbbbbbbbbbbbbbbbbbbbb+----------------------+
> +   |                                          |
> +   |bbbbbbbbbbbbbbbbbb                        |
> +   |                                          |
> +   |                                          |
> +   |                                          |
> +   |                  +-----------------------+
> +   |                  |
> +   |                  |
> +   |                  |
> +   |------------------+
> +   |
> +   |
> +   +----------------------------------------------------------------------->
> +
> +The plot above shows a ramped load signal (titled 'original_signal') and it's
> +boosted equivalent. For each step of the original signal the boosted signal
> +corresponding to a 50% boost is midway from the original signal and the upper
> +bound. Boosting by 100% generates a boosted signal which is always saturated to
> +the upper bound.
> +
> +
> +4. OPP selection using boosted CPU utilization
> +==============================================
> +
> +It is worth calling out that the implementation does not introduce any new load
> +signals. Instead, it provides an API to tune existing signals. This tuning is
> +done on demand and only in scheduler code paths where it is sensible to do so.
> +The new API calls are defined to return either the default signal or a boosted
> +one, depending on the value of sched_cfs_boost. This is a clean an non invasive
> +modification of the existing existing code paths.
> +
> +The signal representing a CPU's utilization is boosted according to the
> +previously described SPC boosting strategy. To sched-DVFS, this allows a CPU
> +(ie CFS run-queue) to appear more used then it actually is.
> +
> +Thus, with the sched_cfs_boost enabled we have the following main functions to
> +get the current utilization of a CPU:
> +
> +  cpu_util()
> +  boosted_cpu_util()
> +
> +The new boosted_cpu_util() is similar to the first but returns a boosted
> +utilization signal which is a function of the sched_cfs_boost value.
> +
> +This function is used in the CFS scheduler code paths where sched-DVFS needs to
> +decide the OPP to run a CPU at.
> +For example, this allows selecting the highest OPP for a CPU which has
> +the boost value set to 100%.
> +
> +
> +5. Per task group boosting
> +==========================
> +
> +The availability of a single knob which is used to boost all tasks in the
> +system is certainly a simple solution but it quite likely doesn't fit many
> +utilization scenarios, especially in the mobile device space.
> +
> +For example, on battery powered devices there usually are many background
> +services which are long running and need energy efficient scheduling. On the
> +other hand, some applications are more performance sensitive and require an
> +interactive response and/or maximum performance, regardless of the energy cost.
> +To better service such scenarios, the SchedTune implementation has an extension
> +that provides a more fine grained boosting interface.
> +
> +A new CGroup controller, namely "schedtune", could be enabled which allows to
> +defined and configure task groups with different boosting values.
> +Tasks that require special performance can be put into separate CGroups.
> +The value of the boost associated with the tasks in this group can be specified
> +using a single knob exposed by the CGroup controller:
> +
> +   schedtune.boost
> +
> +This knob allows the definition of a boost value that is to be used for
> +SPC boosting of all tasks attached to this group.
> +
> +The current schedtune controller implementation is really simple and has these
> +main characteristics:
> +
> +  1) It is only possible to create 1 level depth hierarchies
> +
> +     The root control groups define the system-wide boost value to be applied
> +     by default to all tasks. Its direct subgroups are named "boost groups" and
> +     they define the boost value for specific set of tasks.
> +     Further nested subgroups are not allowed since they do not have a sensible
> +     meaning from a user-space standpoint.
> +
> +  2) It is possible to define only a limited number of "boost groups"
> +
> +     This number is defined at compile time and by default configured to 16.
> +     This is a design decision motivated by two main reasons:
> +     a) In a real system we do not expect utilization scenarios with more then few
> +	boost groups. For example, a reasonable collection of groups could be
> +        just "background", "interactive" and "performance".
> +     b) It simplifies the implementation considerably, especially for the code
> +	which has to compute the per CPU boosting once there are multiple
> +        RUNNABLE tasks with different boost values.
> +
> +Such a simple design should allow servicing the main utilization scenarios identified
> +so far. It provides a simple interface which can be used to manage the
> +power-performance of all tasks or only selected tasks.
> +Moreover, this interface can be easily integrated by user-space run-times (e.g.
> +Android, ChromeOS) to implement a QoS solution for task boosting based on tasks
> +classification, which has been a long standing requirement.
> +
> +Setup and usage
> +---------------
> +
> +0. Use a kernel with CGROUP_SCHEDTUNE support enabled
> +
> +1. Check that the "schedtune" CGroup controller is available:
> +
> +   root@linaro-nano:~# cat /proc/cgroups
> +   #subsys_name	hierarchy	num_cgroups	enabled
> +   cpuset  	0		1		1
> +   cpu     	0		1		1
> +   schedtune	0		1		1
> +
> +2. Mount a tmpfs to create the CGroups mount point (Optional)
> +
> +   root@linaro-nano:~# sudo mount -t tmpfs cgroups /sys/fs/cgroup
> +
> +3. Mount the "schedtune" controller
> +
> +   root@linaro-nano:~# mkdir /sys/fs/cgroup/stune
> +   root@linaro-nano:~# sudo mount -t cgroup -o schedtune stune /sys/fs/cgroup/stune
> +
> +4. Setup the system-wide boost value (Optional)
> +
> +   If not configured the root control group has a 0% boost value, which
> +   basically disables boosting for all tasks in the system thus running in
> +   an energy-efficient mode.
> +
> +   root@linaro-nano:~# echo $SYSBOOST > /sys/fs/cgroup/stune/schedtune.boost
> +
> +5. Create task groups and configure their specific boost value (Optional)
> +
> +   For example here we create a "performance" boost group configure to boost
> +   all its tasks to 100%
> +
> +   root@linaro-nano:~# mkdir /sys/fs/cgroup/stune/performance
> +   root@linaro-nano:~# echo 100 > /sys/fs/cgroup/stune/performance/schedtune.boost
> +
> +6. Move tasks into the boost group
> +
> +   For example, the following moves the tasks with PID $TASKPID (and all its
> +   threads) into the "performance" boost group.
> +
> +   root@linaro-nano:~# echo "TASKPID > /sys/fs/cgroup/stune/performance/cgroup.procs
> +
> +This simple configuration allows only the threads of the $TASKPID task to run,
> +when needed, at the highest OPP in the most capable CPU of the system.
> +
> +
> +6. Question and Answers
> +=======================
> +
> +What about "auto" mode?
> +-----------------------
> +
> +The 'auto' mode as described in [5] can be implemented by interfacing SchedTune
> +with some suitable user-space element. This element could use the exposed
> +system-wide or cgroup based interface.
> +
> +How are multiple groups of tasks with different boost values managed?
> +---------------------------------------------------------------------
> +
> +The current SchedTune implementation keeps track of the boosted RUNNABLE tasks
> +on a CPU. Once sched-DVFS selects the OPP to run a CPU at, the CPU utilization
> +is boosted with a value which is the maximum of the boost values of the
> +currently RUNNABLE tasks in its RQ.
> +
> +This allows sched-DVFS to boost a CPU only while there are boosted tasks ready
> +to run and switch back to the energy efficient mode as soon as the last boosted
> +task is dequeued.
> +
> +
> +7. References
> +=============
> +[1] http://lwn.net/Articles/552889
> +[2] http://lkml.org/lkml/2012/5/18/91
> +[3] http://lkml.org/lkml/2015/6/26/620
> +

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC 08/14] sched/tune: add detailed documentation
  2015-09-02  6:49   ` [RFC,08/14] " Ricky Liang
@ 2015-09-03  9:18     ` Patrick Bellasi
  2015-09-04  7:59       ` Ricky Liang
  2015-09-09 20:16       ` Steve Muckle
  0 siblings, 2 replies; 29+ messages in thread
From: Patrick Bellasi @ 2015-09-03  9:18 UTC (permalink / raw)
  To: Ricky Liang
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, linux-pm,
	Jonathan Corbet, linux-doc

On Wed, Sep 02, 2015 at 07:49:58AM +0100, Ricky Liang wrote:
> Hi Patrick,

Hi Ricky,

> I wonder if this can replace the boost function in the interactive
> governor [0], which is widely used in both Android and ChromeOS
> kernels.

In my view, one of the main goals of sched-DVFS is actually that to be
a solid and generic replacement of different CPUFreq governors.
Being driven by the scheduler, sched-DVFS can exploit information on
CPU demand of active tasks in order to select the optimal Operating
Performance Point (OPP) using a "proactive" approach instead of the
"reactive" approach commonly used by existing governors.

In the current implementation proposed by this RFC, SchedTune is just
a simple mechanism on top of sched-DVFS to bias the selection of the
OPP.
In case a task is running for a limited amount of time at each of its
(sporadic) activation, it does not contribute a CPU load which selects
an higher OPP. Thus, the actual performance (i.e. time to completion)
of that task depends on which other tasks are co-scheduled with it.
If it has the chance to be scheduled on a loaded CPU it will run fast,
to the contrary it will be slower when scheduled alone on a CPU
running at the lowest OPP.

If this task is (for whatever reason) "important" and should always
complete an activation as soon as possible, the current situation is:
a) we use the "performance" governor when we know that the task could
   be active, thus running the whole system in "race-to-idle" mode
b) we use the "interactive" governor (if possible, since it is not not
   in mainline) and ensure that this task pokes the "boost" attribute
   when it is active

Notice that, for both these solutions:
1) unless we pin the task on a specific set of CPUs, we must enable
   this governor for all the frequency domains since we do not know on
   which CPU the scheduler will end up to run the task
2) the tuning for a single task is likely to affect the whole system
   once the task as been started all the tasks are going to be boosted
   even when this task is not runnable

SchedTune provides a "global tunable" which allows to get the same
results as a) and b) with the main advantage that only the specific
frequency domain where the task is RUNNING is boosted. Since we do not
need to pin the task to get this result this can simplify (eventually)
the modification required in user-space while still getting optimal
performances for the task without compromising overall system
consumption.

AFAIU, regarding specifically the boost modes supported by the
Interactive governor:

1) the "boost" tunable is substantially similar to setting to 100% the
   SchedTune boost value. Userspace is in charge to trigger the start
   and end of a boost period

2) the "boostpulse" tunable triggers a 100% boost.

   The main difference is that the Interactive governor resets the
   boost after a configurable time (usually 80ms) while in SchedTune
   the boost value is asserted until release by userspace.

   This has advantages and disadvantages. By using SchedTune the
   userspace has to release the boost explicitly. With the Interactive
   governor this is automatic but still the userspace has to defined a
   suitable timeout. However, this can be different for different
   tasks.

3) the "boost_input" tunable is just an hook exposed to kernel drivers
   which can generate input events expected to impact on user the
   experience.
   The actual implementation is just similar to the previous knob.

IMHO the "boostpulse/input_pulse" tunables are a simple solution to
the problem of running fast to get better UI interactive response.
Indeed, the driver/task which generates the input event is not
necessary the actual target of the load and/or user perceived
response.
Moreover, it boosts all the frequency domains independently from where
the actual UI related workload is running.

By exploiting scheduler information on the actual workload demand of
some tasks, we could aim at a more effective solution which boost just
the required CPUs and only when the task affecting the UI experience
is actually running. This is what the "per-task" SchedTune boosting is
trying to enable.

I'm wondering if you could provide some example to better describe
when the "boostpulse" tunables are used in ChromiumOS.
Maybe that by starting from the description of some use-case we could
better understand if the tunables provided by the Interactive governor
are really required of if we can figure out a possible better even if
different approach to be implemented in SchedTune.

> My understanding is that the boost in interactive governor is to
> simply raise the OPP on selected cores.

AFAIU the "boost" of the Interactive governor affects all the (online)
CPUs. Thus if you have a multi frequency domain system (e.g.
big.LITTLE), the Interactive governor switch to performance mode for
all the CPUs. This makes sense since that boosting is triggered by an
event but does not exploit any information on which tasks really need
boosting and where they are executed by the scheduler.

> The SchedTune boost works by adding a margin to the original load of
> a task which makes the kernel think that the task is more demanding
> than it actually is. My intuition was that they work differently and
> could cause different reaction in the kernel.

That's absolutely true, they works differently. However it is worth to
notice that the SchedTune boost value is "consumed" just by
sched-DVFS, when it has to select an OPP.
There are not other links with the scheduler and/or signals "consumed"
by the scheduler. Specifically, all the task/RQ specific signals used
by the scheduler are not affected by the SchedTune value.

This is what happens in the SchedTune version presented by this
RFC. Internally we are working on an extension which integrates the
Energy-Aware scheduler (EAS).
In that case you are right, the boost value could affect some decision
of the EAS scheduler. For example, boosted tasks could end up being
moved into a more capable CPU of a big.LITTLE system even if they are
not generating a big utilization.

> I feel that the per-task cgroup ScheTune boost should work as
> expected as it only boosts a set of tasks and make them appear
> relatively high demanding comparing to other tasks. But if the
> ScheTune boost is applied globally to boost all the tasks in the
> system, will it cause unnecessary task migrations as all the tasks
> appear to be high demanding to the kernel?

IMHO the best usage of SchedTune is via "per-task" boosting, where it
is more easy to control when the system must work at higher OPPs.
However, this will probably require more efforts in the user-space
middleware layers to feed the scheduler with sensible information
about tasks demands.

Meanwhile, the current solutions are based on system-wide tuning, and
that's why SchedTune has been proposed with a support for "global"
boosting.

When we are boosting globally the only information we are providing to
the kernel is that we are in a rush and everything is important. Thus
yes, small tasks could eventually end up being moved into a more
capable CPU.

However, how SchedTune is going to bias tasks allocation is part of our
internal developments targeting its integration with EAS.

> Specifically, my questions is: When the global SchedTune boost is
> enabled in a on-demand manner, is it possible that a light task gets
> migrated to the big core, and in turn kicks out a heavy task
> originally on that core?

In this RFC we presented just the initial idea of task boosting with a
solution which is generic enough to possibly replace some of the most
commonly used CPUFreq governors (e.g. Performance, Ondemand and
Interactive) while still being completely unrelated from the scheduler
decisions on tasks allocation.

We think that the approach of posting small and self-contained updates
can be more effective on creating consensus by working together on
designing and building a solution which fits many different needs.

> I'm wondering whether global SchedTune boost could result in a
> "priority inversion" causing the heavy task to run on the little
> core and the light task to run on the big core.

That's an interesting point we should keep into consideration for the
design of the complete solution.
I would prefer to post-pone this discussion on the list once we will
present the next extension of SchedTune which integrates into EAS.


> [0]: https://android.googlesource.com/kernel/common.git/+/android-3.18/drivers/cpufreq/cpufreq_interactive.c
> 
> Thanks,
> Ricky

Cheers,
Patrick

-- 
#include <best/regards.h>

Patrick Bellasi


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC 08/14] sched/tune: add detailed documentation
  2015-09-03  9:18     ` [RFC 08/14] " Patrick Bellasi
@ 2015-09-04  7:59       ` Ricky Liang
  2015-09-09 20:16       ` Steve Muckle
  1 sibling, 0 replies; 29+ messages in thread
From: Ricky Liang @ 2015-09-04  7:59 UTC (permalink / raw)
  To: Patrick Bellasi
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, linux-pm,
	Jonathan Corbet, linux-doc

Hi Patrick,

Please find my replies inline.

On Thu, Sep 3, 2015 at 5:18 PM, Patrick Bellasi <patrick.bellasi@arm.com> wrote:
> On Wed, Sep 02, 2015 at 07:49:58AM +0100, Ricky Liang wrote:
>> Hi Patrick,
>
> Hi Ricky,
>
>> I wonder if this can replace the boost function in the interactive
>> governor [0], which is widely used in both Android and ChromeOS
>> kernels.
>
> In my view, one of the main goals of sched-DVFS is actually that to be
> a solid and generic replacement of different CPUFreq governors.
> Being driven by the scheduler, sched-DVFS can exploit information on
> CPU demand of active tasks in order to select the optimal Operating
> Performance Point (OPP) using a "proactive" approach instead of the
> "reactive" approach commonly used by existing governors.
>
> In the current implementation proposed by this RFC, SchedTune is just
> a simple mechanism on top of sched-DVFS to bias the selection of the
> OPP.
> In case a task is running for a limited amount of time at each of its
> (sporadic) activation, it does not contribute a CPU load which selects
> an higher OPP. Thus, the actual performance (i.e. time to completion)
> of that task depends on which other tasks are co-scheduled with it.
> If it has the chance to be scheduled on a loaded CPU it will run fast,
> to the contrary it will be slower when scheduled alone on a CPU
> running at the lowest OPP.
>
> If this task is (for whatever reason) "important" and should always
> complete an activation as soon as possible, the current situation is:
> a) we use the "performance" governor when we know that the task could
>    be active, thus running the whole system in "race-to-idle" mode
> b) we use the "interactive" governor (if possible, since it is not not
>    in mainline) and ensure that this task pokes the "boost" attribute
>    when it is active
>
> Notice that, for both these solutions:
> 1) unless we pin the task on a specific set of CPUs, we must enable
>    this governor for all the frequency domains since we do not know on
>    which CPU the scheduler will end up to run the task
> 2) the tuning for a single task is likely to affect the whole system
>    once the task as been started all the tasks are going to be boosted
>    even when this task is not runnable
>
> SchedTune provides a "global tunable" which allows to get the same
> results as a) and b) with the main advantage that only the specific
> frequency domain where the task is RUNNING is boosted. Since we do not
> need to pin the task to get this result this can simplify (eventually)
> the modification required in user-space while still getting optimal
> performances for the task without compromising overall system
> consumption.
>
> AFAIU, regarding specifically the boost modes supported by the
> Interactive governor:
>
> 1) the "boost" tunable is substantially similar to setting to 100% the
>    SchedTune boost value. Userspace is in charge to trigger the start
>    and end of a boost period
>
> 2) the "boostpulse" tunable triggers a 100% boost.
>
>    The main difference is that the Interactive governor resets the
>    boost after a configurable time (usually 80ms) while in SchedTune
>    the boost value is asserted until release by userspace.
>
>    This has advantages and disadvantages. By using SchedTune the
>    userspace has to release the boost explicitly. With the Interactive
>    governor this is automatic but still the userspace has to defined a
>    suitable timeout. However, this can be different for different
>    tasks.
>
> 3) the "boost_input" tunable is just an hook exposed to kernel drivers
>    which can generate input events expected to impact on user the
>    experience.
>    The actual implementation is just similar to the previous knob.
>
> IMHO the "boostpulse/input_pulse" tunables are a simple solution to
> the problem of running fast to get better UI interactive response.
> Indeed, the driver/task which generates the input event is not
> necessary the actual target of the load and/or user perceived
> response.
> Moreover, it boosts all the frequency domains independently from where
> the actual UI related workload is running.
>
> By exploiting scheduler information on the actual workload demand of
> some tasks, we could aim at a more effective solution which boost just
> the required CPUs and only when the task affecting the UI experience
> is actually running. This is what the "per-task" SchedTune boosting is
> trying to enable.
>
> I'm wondering if you could provide some example to better describe
> when the "boostpulse" tunables are used in ChromiumOS.
> Maybe that by starting from the description of some use-case we could
> better understand if the tunables provided by the Interactive governor
> are really required of if we can figure out a possible better even if
> different approach to be implemented in SchedTune.
>

In addition to the "boost" or "boost pulse" that are triggered by user space, in
ChromiumOS we register a input event handler in the interactive governor
which triggers interactive boost upon receiving any input events. The handler
causes the interactive governor to boost all CPUs and the boost lasts until the
CPUs go idle - in other words the boost lasts until there's no work for the
CPUs to do. Sometimes it's not trivial to tell which processes are crucial to
interactive response, so we are doing a global boost.

This is a use case specific to ChromiumOS, so it's probably not suitable to
be included in the mainline kernel. However, there are probably other similar
use cases out there so it's interesting to explore how SchedTune could
support this use case.

>> My understanding is that the boost in interactive governor is to
>> simply raise the OPP on selected cores.
>
> AFAIU the "boost" of the Interactive governor affects all the (online)
> CPUs. Thus if you have a multi frequency domain system (e.g.
> big.LITTLE), the Interactive governor switch to performance mode for
> all the CPUs. This makes sense since that boosting is triggered by an
> event but does not exploit any information on which tasks really need
> boosting and where they are executed by the scheduler.
>

You can also boost specific CPU in user space. The boost can be enabled
in a per-policy granularity. In any case you are right, the interactive governor
doesn't have context about tasks so SchedTune can be more effective.

>> The SchedTune boost works by adding a margin to the original load of
>> a task which makes the kernel think that the task is more demanding
>> than it actually is. My intuition was that they work differently and
>> could cause different reaction in the kernel.
>
> That's absolutely true, they works differently. However it is worth to
> notice that the SchedTune boost value is "consumed" just by
> sched-DVFS, when it has to select an OPP.
> There are not other links with the scheduler and/or signals "consumed"
> by the scheduler. Specifically, all the task/RQ specific signals used
> by the scheduler are not affected by the SchedTune value.
>
> This is what happens in the SchedTune version presented by this
> RFC. Internally we are working on an extension which integrates the
> Energy-Aware scheduler (EAS).
> In that case you are right, the boost value could affect some decision
> of the EAS scheduler. For example, boosted tasks could end up being
> moved into a more capable CPU of a big.LITTLE system even if they are
> not generating a big utilization.
>
>> I feel that the per-task cgroup ScheTune boost should work as
>> expected as it only boosts a set of tasks and make them appear
>> relatively high demanding comparing to other tasks. But if the
>> ScheTune boost is applied globally to boost all the tasks in the
>> system, will it cause unnecessary task migrations as all the tasks
>> appear to be high demanding to the kernel?
>
> IMHO the best usage of SchedTune is via "per-task" boosting, where it
> is more easy to control when the system must work at higher OPPs.
> However, this will probably require more efforts in the user-space
> middleware layers to feed the scheduler with sensible information
> about tasks demands.
>
> Meanwhile, the current solutions are based on system-wide tuning, and
> that's why SchedTune has been proposed with a support for "global"
> boosting.
>
> When we are boosting globally the only information we are providing to
> the kernel is that we are in a rush and everything is important. Thus
> yes, small tasks could eventually end up being moved into a more
> capable CPU.
>
> However, how SchedTune is going to bias tasks allocation is part of our
> internal developments targeting its integration with EAS.
>
>> Specifically, my questions is: When the global SchedTune boost is
>> enabled in a on-demand manner, is it possible that a light task gets
>> migrated to the big core, and in turn kicks out a heavy task
>> originally on that core?
>
> In this RFC we presented just the initial idea of task boosting with a
> solution which is generic enough to possibly replace some of the most
> commonly used CPUFreq governors (e.g. Performance, Ondemand and
> Interactive) while still being completely unrelated from the scheduler
> decisions on tasks allocation.
>
> We think that the approach of posting small and self-contained updates
> can be more effective on creating consensus by working together on
> designing and building a solution which fits many different needs.
>
>> I'm wondering whether global SchedTune boost could result in a
>> "priority inversion" causing the heavy task to run on the little
>> core and the light task to run on the big core.
>
> That's an interesting point we should keep into consideration for the
> design of the complete solution.
> I would prefer to post-pone this discussion on the list once we will
> present the next extension of SchedTune which integrates into EAS.
>
>
>> [0]: https://android.googlesource.com/kernel/common.git/+/android-3.18/drivers/cpufreq/cpufreq_interactive.c
>>
>> Thanks,
>> Ricky
>
> Cheers,
> Patrick
>
> --
> #include <best/regards.h>
>
> Patrick Bellasi
>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC 08/14] sched/tune: add detailed documentation
  2015-09-03  9:18     ` [RFC 08/14] " Patrick Bellasi
  2015-09-04  7:59       ` Ricky Liang
@ 2015-09-09 20:16       ` Steve Muckle
  2015-09-11 11:09         ` Patrick Bellasi
  1 sibling, 1 reply; 29+ messages in thread
From: Steve Muckle @ 2015-09-09 20:16 UTC (permalink / raw)
  To: Patrick Bellasi, Ricky Liang
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, linux-pm,
	Jonathan Corbet, linux-doc, Viresh Kumar

Hi Patrick,

On 09/03/2015 02:18 AM, Patrick Bellasi wrote:
> In my view, one of the main goals of sched-DVFS is actually that to be
> a solid and generic replacement of different CPUFreq governors.
> Being driven by the scheduler, sched-DVFS can exploit information on
> CPU demand of active tasks in order to select the optimal Operating
> Performance Point (OPP) using a "proactive" approach instead of the
> "reactive" approach commonly used by existing governors.

I'd agree that with knowledge of CPU demand on a per-task basis, rather
than the aggregate per-CPU demand that cpufreq governors use today, it
is possible to proactively address changes in CPU demand which result
from task migrations, task creation and exit, etc.

That said I believe setting the OPP based on a particular given
historical profile of task load still relies on a heuristic algorithm of
some sort where there is no single right answer. I am concerned about
whether sched-dvfs and SchedTune, as currently proposed, will support
enough of a range of possible heuristics/policies to effectively replace
the existing cpufreq governors.

The two most popular governors for normal operation in the mobile world:

* ondemand: Samples periodically, CPU usage calculated as simple busy
fraction of last X ms window of time. Goes straight to fmax when load
exceeds up_threshold tunable %, otherwise scales frequency
proportionally with load. Can stay at fmax longer if requested before
re-evaluating by configuring the sampling_down_factor tunable.

* interactive: Samples periodically, CPU usage calculated as simple busy
fraction of last Xms window of time. Goes to an intermediate tunable
freq (hispeed_freq) when load exceeds a user definable threshold
(go_hispeed_load). Otherwise strives to maintain the CPU usage set by
the user in the "target_loads" array. Other knobs that affect behavior
include min_sample_time (min time to spend at a freq before slowing
down) and above_hispeed_delay (allows various delays to further raise
speed above hispeed freq).

It's also worth noting that mobile vendors typically add all sorts of
hacks on top of the existing cpufreq governors which further complicate
policy.

The current proposal:

* sched-dvfs/schedtune: Event driven, CPU usage calculated using
exponential moving average. AFAICS tries to maintain some % of idle
headroom, but if that headroom doesn't exist at task_tick_fair(), goes
to max frequency. Schedtune provides a way to boost/inflate the demand
of individual tasks or overall system demand.

This looks a bit like ondemand to me but without the
sampling_down_factor functionality and using per-entity load tracking
instead of a simple window-based aggregate CPU usage. The interactive
functionality would require additional knobs. I don't think schedtune
will allow for tuning the latency of CPU frequency changes
(min_sample_time, above_hispeed_delay, etc).

A separate but related concern - in the (IMO likely, given the above)
case that folks want to tinker with that policy, it now means they're
hacking the scheduler as opposed to a self-contained frequency policy
plugin.

Another issue with policy (but not specific to this proposal) is that
putting a bunch of it in the CPU frequency selection may derail the
efforts of the EAS algorithm, which I'm still working on digesting.
Perhaps a unified sched/cpufreq policy could go there.

thanks,
Steve


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC 08/14] sched/tune: add detailed documentation
  2015-09-09 20:16       ` Steve Muckle
@ 2015-09-11 11:09         ` Patrick Bellasi
  2015-09-14 20:00           ` Steve Muckle
  0 siblings, 1 reply; 29+ messages in thread
From: Patrick Bellasi @ 2015-09-11 11:09 UTC (permalink / raw)
  To: Steve Muckle
  Cc: Ricky Liang, Peter Zijlstra, Ingo Molnar, linux-kernel, linux-pm,
	Jonathan Corbet, linux-doc, Viresh Kumar

On Wed, Sep 09, 2015 at 09:16:10PM +0100, Steve Muckle wrote:
> Hi Patrick,

Hi Steve,
 
> On 09/03/2015 02:18 AM, Patrick Bellasi wrote:
> > In my view, one of the main goals of sched-DVFS is actually that to be
> > a solid and generic replacement of different CPUFreq governors.
> > Being driven by the scheduler, sched-DVFS can exploit information on
> > CPU demand of active tasks in order to select the optimal Operating
> > Performance Point (OPP) using a "proactive" approach instead of the
> > "reactive" approach commonly used by existing governors.
> 
> I'd agree that with knowledge of CPU demand on a per-task basis, rather
> than the aggregate per-CPU demand that cpufreq governors use today, it
> is possible to proactively address changes in CPU demand which result
> from task migrations, task creation and exit, etc.
> 
> That said I believe setting the OPP based on a particular given
> historical profile of task load still relies on a heuristic algorithm of
> some sort where there is no single right answer. I am concerned about
> whether sched-dvfs and SchedTune, as currently proposed, will support
> enough of a range of possible heuristics/policies to effectively replace
> the existing cpufreq governors.
> 
> The two most popular governors for normal operation in the mobile world:
> 
> * ondemand: Samples periodically, CPU usage calculated as simple busy
> fraction of last X ms window of time. Goes straight to fmax when load
> exceeds up_threshold tunable %, otherwise scales frequency
> proportionally with load. Can stay at fmax longer if requested before
> re-evaluating by configuring the sampling_down_factor tunable.
> 
> * interactive: Samples periodically, CPU usage calculated as simple busy
> fraction of last Xms window of time. Goes to an intermediate tunable
> freq (hispeed_freq) when load exceeds a user definable threshold
> (go_hispeed_load). Otherwise strives to maintain the CPU usage set by
> the user in the "target_loads" array. Other knobs that affect behavior
> include min_sample_time (min time to spend at a freq before slowing
> down) and above_hispeed_delay (allows various delays to further raise
> speed above hispeed freq).
> 
> It's also worth noting that mobile vendors typically add all sorts of
> hacks on top of the existing cpufreq governors which further complicate
> policy.

Could it be that many of the hacks introduced by vendors are just
there to implement a kind of "scenario based" tuning of governors?
I mean, depending on the specific use-case they try to refine the
value of exposed tunables to improve either performance,
responsiveness or power consumption?

If this is the case, it means that the currently available governors
are missing an important bit of information: what are the best
tunables values for a specific (set of) tasks?

> The current proposal:
> 
> * sched-dvfs/schedtune: Event driven, CPU usage calculated using
> exponential moving average. AFAICS tries to maintain some % of idle
> headroom, but if that headroom doesn't exist at task_tick_fair(), goes
> to max frequency. Schedtune provides a way to boost/inflate the demand
> of individual tasks or overall system demand.

That's quite of a good description. One small correction is that, at
least in the implementation presented by this RFC, SchedTune is not
boosting individual tasks but just the CPU usage.
The link with tasks is just that SchedTune knows how much to boost a
CPU usage by keeping track of which tasks are runnable on that CPU.
However, the utilization signal of each task is not actually modified
from the scheduler standpoint.

> This looks a bit like ondemand to me but without the
> sampling_down_factor functionality and using per-entity load tracking
> instead of a simple window-based aggregate CPU usage.

I agree in principle.
An important difference worth to notice is that we use an "event
based" approach. This means that an enqueue/dequeue can trigger
an immediate OPP change.
If you consider that commonly ondemand uses a 20ms sample rate while
an OPP switch never requires (quite likely) more than 1 or 2 ms, this
means that sched-DVFS can be much more reactive on adapting to
variable loads.

> The interactive functionality would require additional knobs. I
> don't think schedtune will allow for tuning the latency of CPU
> frequency changes (min_sample_time, above_hispeed_delay, etc).

Well, there can be certainly some limitations in the current
implementation. Indeed, the goal of this RFC is to trigger the
discussion and verify if the overall idea make sense and how we
can improve it.

However, regarding specifically the latency on OPP changes, there are
a couple of extension we was thinking about:
1. link the SchedTune boost value with the % of idle headroom which
   triggers an OPP increase
2. use the SchedTune boost value to defined the high frequency to jump
   at when a CPU crosses the % of idle headroom

These are tunables which allows to parameterize the way the PELT
signal for CPU usage is interpreted by the sched-DVFS governor.

How such tunables should be exposed and tuned is to be discussed.
Indeed, one of the main goals of the sched-DVFS and SchedTune
specifically, is to simplify the tuning of a platform by exposing to
userspace a reduced number of tunables, preferably just one.
 
> A separate but related concern - in the (IMO likely, given the above)
> case that folks want to tinker with that policy, it now means they're
> hacking the scheduler as opposed to a self-contained frequency policy
> plugin.

I do not agree on that point. SchedTune, as well as sched-DVFS, are
framework quit well separated from the scheduler.
They are "consumers" of signals usually used by the scheduler, but
they are not directly affecting scheduler decisions (at least in the
implementation proposed by this RFC).

Side effects are possible, of course. For example the selection of an
OPP instead of another can affect the residency of a task on a CPU,
thus somehow biasing some scheduler decisions. However, I think that
this kind of side effects can be produced by current governors as
well.

Eventually, I agree with you if you mean that one can have the
impression of hacking the scheduler because the main compilation unit
of SchedTune is a file under kernel/sched. If this can be a problem,
for example from a maintenance perspective, perhaps we can find a
better location for that code.

> Another issue with policy (but not specific to this proposal) is that
> putting a bunch of it in the CPU frequency selection may derail the
> efforts of the EAS algorithm, which I'm still working on digesting.
> Perhaps a unified sched/cpufreq policy could go there.

We have an internal extension of SchedTune which is proposing an
integration with EAS. We have not included it on that RFC to keep
things simple by exposing at first instance only generic bits which
can extend sched-DVFS features.

However, one of the main goals of this proposal is to respond to a
couple of long lasting demands (e.g. [1,2]) for:
1. a better integration of CPUFreq with the scheduler, which has all
   the required knowledge about workloads demands to target both
   performances and energy efficiency
2. a simple approach to configure a system to care more about
   performance or energy-efficiency

SchedTune addresses mainly the second point. Once SchedTune is
integrated with EAS it will provide a support to decide, in an
energy-efficient way, how much we want to reduce power or boost
performances.

> thanks,
> Steve

Thanks for the interesting feedbacks, this is actually the kind of
discussion we would like to have around this initial proposal.

Cheers Patrick

[1] https://lkml.org/lkml/2012/5/18/91
[2] http://lwn.net/Articles/552889/

-- 
#include <best/regards.h>

Patrick Bellasi


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC 08/14] sched/tune: add detailed documentation
  2015-09-11 11:09         ` Patrick Bellasi
@ 2015-09-14 20:00           ` Steve Muckle
  2015-09-15 15:00             ` Patrick Bellasi
  0 siblings, 1 reply; 29+ messages in thread
From: Steve Muckle @ 2015-09-14 20:00 UTC (permalink / raw)
  To: Patrick Bellasi
  Cc: Ricky Liang, Peter Zijlstra, Ingo Molnar, linux-kernel, linux-pm,
	Jonathan Corbet, linux-doc, Viresh Kumar

Hi Patrick,

On 09/11/2015 04:09 AM, Patrick Bellasi wrote:
>> It's also worth noting that mobile vendors typically add all sorts of
>> hacks on top of the existing cpufreq governors which further complicate
>> policy.
> 
> Could it be that many of the hacks introduced by vendors are just
> there to implement a kind of "scenario based" tuning of governors?
> I mean, depending on the specific use-case they try to refine the
> value of exposed tunables to improve either performance,
> responsiveness or power consumption?

>From what I've seen I think it's both scenario based tuning (add
functionality to detect and improve power/perf for say web browsing or
mp3 playback usecases specifically), as well as tailoring general case
behavior. Some of these are actually new features in the governor though
as opposed to just tweaks of existing tunables.

> If this is the case, it means that the currently available governors
> are missing an important bit of information: what are the best
> tunables values for a specific (set of) tasks?

Agreed, though I also think those tunable values might also change for a
given set of tasks in different circumstances.

> 
>> The current proposal:
>>
>> * sched-dvfs/schedtune: Event driven, CPU usage calculated using
>> exponential moving average. AFAICS tries to maintain some % of idle
>> headroom, but if that headroom doesn't exist at task_tick_fair(), goes
>> to max frequency. Schedtune provides a way to boost/inflate the demand
>> of individual tasks or overall system demand.
> 
> That's quite of a good description. One small correction is that, at
> least in the implementation presented by this RFC, SchedTune is not
> boosting individual tasks but just the CPU usage.
> The link with tasks is just that SchedTune knows how much to boost a
> CPU usage by keeping track of which tasks are runnable on that CPU.
> However, the utilization signal of each task is not actually modified
> from the scheduler standpoint.

Ah yes I see what you mean. I was thinking of the cgroup stuff but I see
that max per-task boost is tracked per-CPU and that CPU's aggregate
usage is boosted accordingly.

>> This looks a bit like ondemand to me but without the
>> sampling_down_factor functionality and using per-entity load tracking
>> instead of a simple window-based aggregate CPU usage.
> 
> I agree in principle.
> An important difference worth to notice is that we use an "event
> based" approach. This means that an enqueue/dequeue can trigger
> an immediate OPP change.
> If you consider that commonly ondemand uses a 20ms sample rate while
> an OPP switch never requires (quite likely) more than 1 or 2 ms, this
> means that sched-DVFS can be much more reactive on adapting to
> variable loads.

"Can be" are the important words to me there... it'd be nice to be able
to control that. Aggressive frequency changes may not be desirable for
power or performance, even if the transition can be quickly completed.
The configuration values of min_sample_time and above_hispeed_delay in
the interactive governor on some recent devices may give clues as to
whether latency is being intentionally increased on various platforms.

The latency/reactiveness of CPU frequency changes are also IMO a product
of two things - the CPUfreq/sched-dvfs policy, and the task load
tracking algorithm. I don't have enough experience with the mainline
task load tracking algorithm yet to know how it will compare with the
window-based aggregate CPU usage metric used by mainline cpufreq
governors. But I would imagine it will smooth out some of the aggressive
nature of sched-dvfs' event-driven approach. The hardcoded values in the
task load tracking algorithm seem concerning though from a tuning
standpoint.

>> The interactive functionality would require additional knobs. I
...
> However, regarding specifically the latency on OPP changes, there are
> a couple of extension we was thinking about:
> 1. link the SchedTune boost value with the % of idle headroom which
>    triggers an OPP increase
> 2. use the SchedTune boost value to defined the high frequency to jump
>    at when a CPU crosses the % of idle headroom

Hmmm... This may be useful (only testing/profiling would tell) though it
may be nice to be able to tune these values.

> These are tunables which allows to parameterize the way the PELT
> signal for CPU usage is interpreted by the sched-DVFS governor.
> 
> How such tunables should be exposed and tuned is to be discussed.
> Indeed, one of the main goals of the sched-DVFS and SchedTune
> specifically, is to simplify the tuning of a platform by exposing to
> userspace a reduced number of tunables, preferably just one.

This last point (the desire for a single tunable) is perhaps at the root
of my main concern. There are users/vendors for whom the current
tunables are insufficient, resulting in their hacking the governors to
add more tunables or features in the policy.

Consolidating CPU frequency and idle management in the scheduler will
clean things up and probably make things more effective, but I don't
think it will remove the need for a highly configurable policy.

I'm curious about the drive for one tunable. Is that something there's
specifically been a broad call for? Don't get me wrong, I'm all for
simplification and cleanup, if the flexibility and used features can be
retained.

>> A separate but related concern - in the (IMO likely, given the above)
>> case that folks want to tinker with that policy, it now means they're
>> hacking the scheduler as opposed to a self-contained frequency policy
>> plugin.
> 
> I do not agree on that point. SchedTune, as well as sched-DVFS, are
> framework quit well separated from the scheduler.
> They are "consumers" of signals usually used by the scheduler, but
> they are not directly affecting scheduler decisions (at least in the
> implementation proposed by this RFC).

Agreed it's not affecting scheduler decision making (not directly). It's
more just the mixing of the policy into the same code, as margin is
added in enqueue_task_fair()/task_tick_fair() etc. That one in
particular would probably be easy to solve. A more difficult one is if
someone wants to make adjustments to the load tracking algorithm because
it is driving CPU frequency.

> Side effects are possible, of course. For example the selection of an
...
> However, one of the main goals of this proposal is to respond to a
> couple of long lasting demands (e.g. [1,2]) for:
> 1. a better integration of CPUFreq with the scheduler, which has all
>    the required knowledge about workloads demands to target both
>    performances and energy efficiency
> 2. a simple approach to configure a system to care more about
>    performance or energy-efficiency
> 
> SchedTune addresses mainly the second point. Once SchedTune is
> integrated with EAS it will provide a support to decide, in an
> energy-efficient way, how much we want to reduce power or boost
> performances.

The provided links definitely establish the need for (1) but I am still
wondering about the motivation for (2), because I don't think it's going
to be possible to boil everything down to a single slider tunable
without losing flexibility/functionality.

cheers,
Steve


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC 08/14] sched/tune: add detailed documentation
  2015-09-14 20:00           ` Steve Muckle
@ 2015-09-15 15:00             ` Patrick Bellasi
  2015-09-15 15:19               ` Peter Zijlstra
  2015-09-15 23:55               ` Steve Muckle
  0 siblings, 2 replies; 29+ messages in thread
From: Patrick Bellasi @ 2015-09-15 15:00 UTC (permalink / raw)
  To: Steve Muckle
  Cc: Ricky Liang, Peter Zijlstra, Ingo Molnar, linux-kernel, linux-pm,
	Jonathan Corbet, linux-doc, Viresh Kumar

On Mon, Sep 14, 2015 at 09:00:51PM +0100, Steve Muckle wrote:
> Hi Patrick,
> 
> On 09/11/2015 04:09 AM, Patrick Bellasi wrote:
> >> It's also worth noting that mobile vendors typically add all sorts of
> >> hacks on top of the existing cpufreq governors which further complicate
> >> policy.
> > 
> > Could it be that many of the hacks introduced by vendors are just
> > there to implement a kind of "scenario based" tuning of governors?
> > I mean, depending on the specific use-case they try to refine the
> > value of exposed tunables to improve either performance,
> > responsiveness or power consumption?
> 
> From what I've seen I think it's both scenario based tuning (add
> functionality to detect and improve power/perf for say web browsing or
> mp3 playback usecases specifically), as well as tailoring general case
> behavior. Some of these are actually new features in the governor though
> as opposed to just tweaks of existing tunables.
> 
> > If this is the case, it means that the currently available governors
> > are missing an important bit of information: what are the best
> > tunables values for a specific (set of) tasks?
> 
> Agreed, though I also think those tunable values might also change for a
> given set of tasks in different circumstances.

Could you provide an example?

In my view the per-task support should be exploited just for quite
specialized tasks, which are usually not subject to many different
phases during their execution.

For example, in a graphics rendering pipeline usually we have a host
"controller" task and a set set of "worker" tasks running on the
processing elements of the GPU.
Since the controller task is usually low intensity, it does not
generate on the CPU a load big enough to trigger the selection of an
higher OPP. The main issue in this case is that running this task on a
lower OPP could have sensible effects on latency affecting the
performance of the whole graphics pipeline.

For example, on Intel machines I was able to verify that running two
OpenCL workloads concurrently on the same GPU gives better FPS than
just running a single workload. And that it's mainly due to the
selection of an higher OPP on the CPU side when two instances are
running instead of just one.

In these scenarios, the boosting of the CPU OPP when a specific task
is runnable can help on getting better performance.

> >> The current proposal:
> >>
> >> * sched-dvfs/schedtune: Event driven, CPU usage calculated using
> >> exponential moving average. AFAICS tries to maintain some % of idle
> >> headroom, but if that headroom doesn't exist at task_tick_fair(), goes
> >> to max frequency. Schedtune provides a way to boost/inflate the demand
> >> of individual tasks or overall system demand.
> > 
> > That's quite of a good description. One small correction is that, at
> > least in the implementation presented by this RFC, SchedTune is not
> > boosting individual tasks but just the CPU usage.
> > The link with tasks is just that SchedTune knows how much to boost a
> > CPU usage by keeping track of which tasks are runnable on that CPU.
> > However, the utilization signal of each task is not actually modified
> > from the scheduler standpoint.
> 
> Ah yes I see what you mean. I was thinking of the cgroup stuff but I see
> that max per-task boost is tracked per-CPU and that CPU's aggregate
> usage is boosted accordingly.

Right, the idea is to have a sort of "boosting inheritance" mechanism.
While two tasks, with two different boosting values, are concurrently
runnable on a CPU, that CPU is boosted according to the max boost
value for these two tasks.

> >> This looks a bit like ondemand to me but without the
> >> sampling_down_factor functionality and using per-entity load tracking
> >> instead of a simple window-based aggregate CPU usage.
> > 
> > I agree in principle.
> > An important difference worth to notice is that we use an "event
> > based" approach. This means that an enqueue/dequeue can trigger
> > an immediate OPP change.
> > If you consider that commonly ondemand uses a 20ms sample rate while
> > an OPP switch never requires (quite likely) more than 1 or 2 ms, this
> > means that sched-DVFS can be much more reactive on adapting to
> > variable loads.
> 
> "Can be" are the important words to me there... it'd be nice to be able
> to control that. Aggressive frequency changes may not be desirable for
> power or performance, even if the transition can be quickly completed.
> The configuration values of min_sample_time and above_hispeed_delay in
> the interactive governor on some recent devices may give clues as to
> whether latency is being intentionally increased on various platforms.

IMO these knobs are more like fixes for a too "coarse grained" solution.
The main limitation of the current CPUFreq governors are:
1. use a single set of knobs to track many different tasks
2. use a system-wide view to control all tasks

The solution we get is working but, of course, it is an "average"
solution which satisfy on "average" the requirement of different
tasks.

With SchedTune we would like to get a similar result to the one you
describe using min_sample_time and above_hispeed_delay by linking
somehow the "interpretation" of the PELT signal with the boost value.

Right now we have in sched-DVFS an idle % headroom which is hardcoded
to be ~20% of the current OPP capacity. When we cross that boundary
that threshold with the CPU usage, we switch straight to the max OPP.
If we could figure out a proper mechanism to link the boost signal to
both the idle % headroom and the target OPP, I think we could achieve
quite similar results than what you can get with the knobs offered by
the interactive governor.
The more you boost a task the bigger is the idle % headroom and
the higher is the OPP you will jump.

> The latency/reactiveness of CPU frequency changes are also IMO a product
> of two things - the CPUfreq/sched-dvfs policy, and the task load
> tracking algorithm. I don't have enough experience with the mainline
> task load tracking algorithm yet to know how it will compare with the
> window-based aggregate CPU usage metric used by mainline cpufreq
> governors. But I would imagine it will smooth out some of the aggressive
> nature of sched-dvfs' event-driven approach.

That's right, somehow the PELT signal has a dynamic which is well
defined by the time constants it uses. Task enqueue/dequeue events
could happen with a higher frequency dynamic, however these are only
"check points" where the most updated value of a PELT signal could be
used to take a decision.

> The hardcoded values in the
> task load tracking algorithm seem concerning though from a tuning
> standpoint.

I agree, that's why we are thinking about the solution described
before. Exploit the boost value to replace the hardcoded thresholds
should allow to get more flexibility while being per-task defined.
Hopefully, tuning per task can be more easy and effective than
selection a single value fitting all needs.

> 
> >> The interactive functionality would require additional knobs. I
> ...
> > However, regarding specifically the latency on OPP changes, there are
> > a couple of extension we was thinking about:
> > 1. link the SchedTune boost value with the % of idle headroom which
> >    triggers an OPP increase
> > 2. use the SchedTune boost value to defined the high frequency to jump
> >    at when a CPU crosses the % of idle headroom
> 
> Hmmm... This may be useful (only testing/profiling would tell) though it
> may be nice to be able to tune these values.

Again, in my view the tuning should be per task with a single knob.
The value of the knob should than be properly mapped on other internal
values to obtain a well defined behavior driven by information shared
with the scheduler, i.e. a PELT signal.

> > These are tunables which allows to parameterize the way the PELT
> > signal for CPU usage is interpreted by the sched-DVFS governor.
> > 
> > How such tunables should be exposed and tuned is to be discussed.
> > Indeed, one of the main goals of the sched-DVFS and SchedTune
> > specifically, is to simplify the tuning of a platform by exposing to
> > userspace a reduced number of tunables, preferably just one.
> 
> This last point (the desire for a single tunable) is perhaps at the root
> of my main concern. There are users/vendors for whom the current
> tunables are insufficient, resulting in their hacking the governors to
> add more tunables or features in the policy.

We should also consider that we are proposing not only a single
tunable but also a completely different standpoint. Not more a "blind"
system-wide view on the average system behaviors, but instead a more
detailed view on tasks behaviors. A single tunable used to "tag" tasks
maybe it's not such a limited solution in this design.

> Consolidating CPU frequency and idle management in the scheduler will
> clean things up and probably make things more effective, but I don't
> think it will remove the need for a highly configurable policy.

This can be verified only by starting to use sched-DVFS + SchedTune on
real/synthetic setup to verify which features are eventually missing,
or specific use-cases not properly managed.
If we are able to setup these experiments perhaps we will be able to
identify a better design for a scheduler driver solution.

> I'm curious about the drive for one tunable. Is that something there's
> specifically been a broad call for? Don't get me wrong, I'm all for
> simplification and cleanup, if the flexibility and used features can be
> retained.

All this thread [1] was somehow calling out for a solution which goes
in the direction of a single tunable.

The main idea is to exploit the current effort around EAS.
While we are redesign some parts of the scheduler to be energy-ware it
is convenient also to include in that design a knob which allows to
configure how much we want to optimize for reduced power consumption
or increased performance.

> >> A separate but related concern - in the (IMO likely, given the above)
> >> case that folks want to tinker with that policy, it now means they're
> >> hacking the scheduler as opposed to a self-contained frequency policy
> >> plugin.
> > 
> > I do not agree on that point. SchedTune, as well as sched-DVFS, are
> > framework quit well separated from the scheduler.
> > They are "consumers" of signals usually used by the scheduler, but
> > they are not directly affecting scheduler decisions (at least in the
> > implementation proposed by this RFC).
> 
> Agreed it's not affecting scheduler decision making (not directly). It's
> more just the mixing of the policy into the same code, as margin is
> added in enqueue_task_fair()/task_tick_fair() etc. That one in
> particular would probably be easy to solve. A more difficult one is if
> someone wants to make adjustments to the load tracking algorithm because
> it is driving CPU frequency.

That's not so straightforward.

We have plenty of experience, collected on the past years, on CPUFreq
governors and customer specific mods.
Don't you think we can exploit that experience to reason around a
fresh new design that allows to satisfy all requirements while
providing possibly a simpler interface?

I agree with you that all the current scenarios must be supported by
the new proposal. We should probably start by listing them and come
out with a set of test cases that allow to verify where we are wrt
the state of the art.

Tools and benchmarks to verify the proposals and measure the
regress/progress should become more and more used.
This is an even more important requirement to setup a common
language and aims at objective evaluations.
Moreover, it has been already required by scheduler maintainers in the
past.

> > Side effects are possible, of course. For example the selection of an
> ...
> > However, one of the main goals of this proposal is to respond to a
> > couple of long lasting demands (e.g. [1,2]) for:
> > 1. a better integration of CPUFreq with the scheduler, which has all
> >    the required knowledge about workloads demands to target both
> >    performances and energy efficiency
> > 2. a simple approach to configure a system to care more about
> >    performance or energy-efficiency
> > 
> > SchedTune addresses mainly the second point. Once SchedTune is
> > integrated with EAS it will provide a support to decide, in an
> > energy-efficient way, how much we want to reduce power or boost
> > performances.
> 
> The provided links definitely establish the need for (1) but I am still
> wondering about the motivation for (2), because I don't think it's going
> to be possible to boil everything down to a single slider tunable
> without losing flexibility/functionality.

I see and understand your concerns, still I'm on the idea that we
should try to evaluate a different solution which possibly allows to
simplify the user-space interface as well as to reduce the tuning
effort.
All that without scarifying the (measurable) efficiency of the final
result.

> cheers,
> Steve
> 

Thanks for this interesting discussion.

Patrick

[1] http://thread.gmane.org/gmane.linux.kernel/1236846/focus=1237796

-- 
#include <best/regards.h>

Patrick Bellasi


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC 08/14] sched/tune: add detailed documentation
  2015-09-15 15:00             ` Patrick Bellasi
@ 2015-09-15 15:19               ` Peter Zijlstra
  2015-09-16  0:34                 ` Steve Muckle
  2015-09-15 23:55               ` Steve Muckle
  1 sibling, 1 reply; 29+ messages in thread
From: Peter Zijlstra @ 2015-09-15 15:19 UTC (permalink / raw)
  To: Patrick Bellasi
  Cc: Steve Muckle, Ricky Liang, Ingo Molnar, linux-kernel, linux-pm,
	Jonathan Corbet, linux-doc, Viresh Kumar

On Tue, Sep 15, 2015 at 04:00:45PM +0100, Patrick Bellasi wrote:

> > I'm curious about the drive for one tunable. Is that something there's
> > specifically been a broad call for? Don't get me wrong, I'm all for
> > simplification and cleanup, if the flexibility and used features can be
> > retained.
> 
> All this thread [1] was somehow calling out for a solution which goes
> in the direction of a single tunable.
> 
> The main idea is to exploit the current effort around EAS.
> While we are redesign some parts of the scheduler to be energy-ware it
> is convenient also to include in that design a knob which allows to
> configure how much we want to optimize for reduced power consumption
> or increased performance.

Please flip the argument around; providing lots of knobs for vendors to
do $magic with is _NOT_ a good thing.

The whole out-of-tree cpufreq governor hack fest Android thing is a
complete and utter fail on all levels. Its the embedded, ship, forget,
not contribute cycle all over again.

Making that harder is a _GOOD_ thing.

Esp. now that we get hardware which has multiple frequency domains on
the CPU cores, this is going to be really important.


> > Agreed it's not affecting scheduler decision making (not directly). It's
> > more just the mixing of the policy into the same code, as margin is
> > added in enqueue_task_fair()/task_tick_fair() etc. That one in
> > particular would probably be easy to solve. A more difficult one is if
> > someone wants to make adjustments to the load tracking algorithm because
> > it is driving CPU frequency.
> 
> That's not so straightforward.
> 
> We have plenty of experience, collected on the past years, on CPUFreq
> governors and customer specific mods.
> Don't you think we can exploit that experience to reason around a
> fresh new design that allows to satisfy all requirements while
> providing possibly a simpler interface?
> 
> I agree with you that all the current scenarios must be supported by
> the new proposal. We should probably start by listing them and come
> out with a set of test cases that allow to verify where we are wrt
> the state of the art.
> 
> Tools and benchmarks to verify the proposals and measure the
> regress/progress should become more and more used.
> This is an even more important requirement to setup a common
> language and aims at objective evaluations.
> Moreover, it has been already required by scheduler maintainers in the
> past.

This.

And if $vendor feels their use case doesn't perform well, have them
contribute a benchmark for it. They must have one anyway -- how else are
they going to evaluate the current cpufreq hackery?


Do not encourage vendors to add 'features' in magic warts. Strive to
improve Linux for everyone.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC 08/14] sched/tune: add detailed documentation
  2015-09-15 15:00             ` Patrick Bellasi
  2015-09-15 15:19               ` Peter Zijlstra
@ 2015-09-15 23:55               ` Steve Muckle
  2015-09-16  9:26                 ` Juri Lelli
  2015-09-16 10:03                 ` Patrick Bellasi
  1 sibling, 2 replies; 29+ messages in thread
From: Steve Muckle @ 2015-09-15 23:55 UTC (permalink / raw)
  To: Patrick Bellasi
  Cc: Ricky Liang, Peter Zijlstra, Ingo Molnar, linux-kernel, linux-pm,
	Jonathan Corbet, linux-doc, Viresh Kumar

On 09/15/2015 08:00 AM, Patrick Bellasi wrote:
>> Agreed, though I also think those tunable values might also change for a
>> given set of tasks in different circumstances.
> 
> Could you provide an example?
>
> In my view the per-task support should be exploited just for quite
> specialized tasks, which are usually not subject to many different
> phases during their execution.

The surfaceflinger task in Android is a possible example. It can have
the same issue as the graphics controller task you mentioned - needing
to finish quickly so the overall display pipeline can meet its deadline,
but often not exerting enough CPU demand by itself to raise the
frequency high enough.

Since mobile platforms are so power sensitive though, it won't be
possible to boost surfaceflinger all the time. Perhaps the
surfaceflinger boost could be managed by some sort of userspace daemon
monitoring the sort of usecase running and/or whether display deadlines
are being missed, and updating a schedtune boost cgroup.

> For example, in a graphics rendering pipeline usually we have a host
...
> With SchedTune we would like to get a similar result to the one you
> describe using min_sample_time and above_hispeed_delay by linking
> somehow the "interpretation" of the PELT signal with the boost value.
> 
> Right now we have in sched-DVFS an idle % headroom which is hardcoded
> to be ~20% of the current OPP capacity. When we cross that boundary
> that threshold with the CPU usage, we switch straight to the max OPP.
> If we could figure out a proper mechanism to link the boost signal to
> both the idle % headroom and the target OPP, I think we could achieve
> quite similar results than what you can get with the knobs offered by
> the interactive governor.
> The more you boost a task the bigger is the idle % headroom and
> the higher is the OPP you will jump.

Let's say I have a system with one task (to set aside the per-task vs.
global policy issue temporarily) and I want to define a policy which

 - quickly goes to 1.2GHz when the current frequency is less than
   that and demand exceeds capacity

 - waits at least 40ms (or just "a longer time") before increasing the
   frequency if the current frequency is 1.2GHz or higher

This is similar to (though a simplification of) what interactive is
often configured to do on mobile platforms. AFAIK it's a fairly common
strategy due to the power-perf curves and OPPs available on CPUs, and at
the same time striving to maintain decent UI responsiveness.

Even with the proposed modification to link boost with idle % and target
OPP I don't think there'd currently be a way to express this policy,
which goes beyond the linear scaling of the magnitude of CPU demand
requested by a task, idle headroom or target OPP.

> 
...
>> The hardcoded values in the
>> task load tracking algorithm seem concerning though from a tuning
>> standpoint.
> 
> I agree, that's why we are thinking about the solution described
> before. Exploit the boost value to replace the hardcoded thresholds
> should allow to get more flexibility while being per-task defined.
> Hopefully, tuning per task can be more easy and effective than
> selection a single value fitting all needs.
> 
>>
>>>> The interactive functionality would require additional knobs. I
>> ...
>>> However, regarding specifically the latency on OPP changes, there are
>>> a couple of extension we was thinking about:
>>> 1. link the SchedTune boost value with the % of idle headroom which
>>>    triggers an OPP increase
>>> 2. use the SchedTune boost value to defined the high frequency to jump
>>>    at when a CPU crosses the % of idle headroom
>>
>> Hmmm... This may be useful (only testing/profiling would tell) though it
>> may be nice to be able to tune these values.
> 
> Again, in my view the tuning should be per task with a single knob.
> The value of the knob should than be properly mapped on other internal
> values to obtain a well defined behavior driven by information shared
> with the scheduler, i.e. a PELT signal.
> 
>>> These are tunables which allows to parameterize the way the PELT
>>> signal for CPU usage is interpreted by the sched-DVFS governor.
>>>
>>> How such tunables should be exposed and tuned is to be discussed.
>>> Indeed, one of the main goals of the sched-DVFS and SchedTune
>>> specifically, is to simplify the tuning of a platform by exposing to
>>> userspace a reduced number of tunables, preferably just one.
>>
>> This last point (the desire for a single tunable) is perhaps at the root
>> of my main concern. There are users/vendors for whom the current
>> tunables are insufficient, resulting in their hacking the governors to
>> add more tunables or features in the policy.
> 
> We should also consider that we are proposing not only a single
> tunable but also a completely different standpoint. Not more a "blind"
> system-wide view on the average system behaviors, but instead a more
> detailed view on tasks behaviors. A single tunable used to "tag" tasks
> maybe it's not such a limited solution in this design.

I think the algorithm is still fairly blind. There still has to be a
heuristic for future CPU usage, it's now just per-task and in the
scheduler (PELT), whereas it used to be per-CPU and in the governor.

This allows for good features like adjusting frequency right away on
task migration/creation/exit or per task boosting etc., but I think
policy will still be important. Tasks change their behavior all the
time, at least in the mobile usecases I've seen.

>> Consolidating CPU frequency and idle management in the scheduler will
>> clean things up and probably make things more effective, but I don't
>> think it will remove the need for a highly configurable policy.
> 
> This can be verified only by starting to use sched-DVFS + SchedTune on
> real/synthetic setup to verify which features are eventually missing,
> or specific use-cases not properly managed.
> If we are able to setup these experiments perhaps we will be able to
> identify a better design for a scheduler driver solution.

Agree. I hope to be able to run some of these experiments to help.

>> I'm curious about the drive for one tunable. Is that something there's
...
> We have plenty of experience, collected on the past years, on CPUFreq
> governors and customer specific mods.
> Don't you think we can exploit that experience to reason around a
> fresh new design that allows to satisfy all requirements while
> providing possibly a simpler interface?

Sure. I'm just communicating requirements I've seen :) .

> I agree with you that all the current scenarios must be supported by
> the new proposal. We should probably start by listing them and come
> out with a set of test cases that allow to verify where we are wrt
> the state of the art.

Sounds like a good plan to me... Perhaps we could discuss some mobile
usecases next week at Linaro Connect?

cheers,
Steve


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC 08/14] sched/tune: add detailed documentation
  2015-09-15 15:19               ` Peter Zijlstra
@ 2015-09-16  0:34                 ` Steve Muckle
  2015-09-16  7:47                   ` Ingo Molnar
  0 siblings, 1 reply; 29+ messages in thread
From: Steve Muckle @ 2015-09-16  0:34 UTC (permalink / raw)
  To: Peter Zijlstra, Patrick Bellasi
  Cc: Ricky Liang, Ingo Molnar, linux-kernel, linux-pm,
	Jonathan Corbet, linux-doc, Viresh Kumar

On 09/15/2015 08:19 AM, Peter Zijlstra wrote:
> Please flip the argument around; providing lots of knobs for vendors to
> do $magic with is _NOT_ a good thing.
> 
> The whole out-of-tree cpufreq governor hack fest Android thing is a
> complete and utter fail on all levels. Its the embedded, ship, forget,
> not contribute cycle all over again.
> 
> Making that harder is a _GOOD_ thing.

I get why the plugin-like governor interface may encourage out of tree
development, but why would providing lots of policy knobs/tunables from
the scheduler be bad?

Shouldn't that hopefully reduce the likelihood that someone feels the
need to roll their own stack of kernel modifications which never make it
upstream?

cheers,
Steve

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC 08/14] sched/tune: add detailed documentation
  2015-09-16  0:34                 ` Steve Muckle
@ 2015-09-16  7:47                   ` Ingo Molnar
  0 siblings, 0 replies; 29+ messages in thread
From: Ingo Molnar @ 2015-09-16  7:47 UTC (permalink / raw)
  To: Steve Muckle
  Cc: Peter Zijlstra, Patrick Bellasi, Ricky Liang, Ingo Molnar,
	linux-kernel, linux-pm, Jonathan Corbet, linux-doc, Viresh Kumar


* Steve Muckle <steve.muckle@linaro.org> wrote:

> On 09/15/2015 08:19 AM, Peter Zijlstra wrote:
> > Please flip the argument around; providing lots of knobs for vendors to
> > do $magic with is _NOT_ a good thing.
> > 
> > The whole out-of-tree cpufreq governor hack fest Android thing is a
> > complete and utter fail on all levels. Its the embedded, ship, forget,
> > not contribute cycle all over again.
> > 
> > Making that harder is a _GOOD_ thing.
> 
> I get why the plugin-like governor interface may encourage out of tree
> development, but why would providing lots of policy knobs/tunables from
> the scheduler be bad?

There's many disadvantages:

 - People/vendors learn to rely on their knob based hacks and start making noise 
   if one of the knobs changes or goes away, reporting regressions, not always 
   reporting that they have tunables twiddled.

 - If distros start using knobs it's easy to get into a situation where different 
   distros have different knobs, and people tune them further. For every single 
   bugreport we'd have to first make 100% sure what the knobs are.

 - Workloads can sometimes be improved by twiddling knobs - while breaking lots of 
   other workloads. We'd like people who contribute to the scheduler to think 
   about the hard problems and improve the whole picture, not just a small part of 
   it. There's a rich set of 'knobs' to prevent the scheduler from doing things 
   (the various resource affinity system calls and facilities), so it's not like
   user-space does not have the flexibility.

 - Having too much configuration space makes upgrades to newer kernels generally 
   harder - and we want to have the opposite effects.

> Shouldn't that hopefully reduce the likelihood that someone feels the need to 
> roll their own stack of kernel modifications which never make it upstream?

If a scheduler bug or inefficiency can be kludged around with a knob then that 
reduces the likelihood of the scheduler getting improved.

Otherwise I'd like to encourage people to change the source if they want to debug 
a problem or improve things - so having to roll your own patches isn't an 
unconditional negative - it's what this whole OSS thing is about.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC 08/14] sched/tune: add detailed documentation
  2015-09-15 23:55               ` Steve Muckle
@ 2015-09-16  9:26                 ` Juri Lelli
  2015-09-16 13:49                   ` Vincent Guittot
  2015-09-16 10:03                 ` Patrick Bellasi
  1 sibling, 1 reply; 29+ messages in thread
From: Juri Lelli @ 2015-09-16  9:26 UTC (permalink / raw)
  To: Steve Muckle, Patrick Bellasi
  Cc: Ricky Liang, Peter Zijlstra, Ingo Molnar, linux-kernel, linux-pm,
	Jonathan Corbet, linux-doc, Viresh Kumar

Hi Steve,

thanks a lot for this interesting discussion.

On 16/09/15 00:55, Steve Muckle wrote:
> On 09/15/2015 08:00 AM, Patrick Bellasi wrote:
>>> Agreed, though I also think those tunable values might also change for a
>>> given set of tasks in different circumstances.
>>
>> Could you provide an example?
>>
>> In my view the per-task support should be exploited just for quite
>> specialized tasks, which are usually not subject to many different
>> phases during their execution.
> 
> The surfaceflinger task in Android is a possible example. It can have
> the same issue as the graphics controller task you mentioned - needing
> to finish quickly so the overall display pipeline can meet its deadline,
> but often not exerting enough CPU demand by itself to raise the
> frequency high enough.
>

SurfaceFlinger timeliness requirements, and maybe AudioFlinger's and
others' as well, might be better expressed by using other scheduling
classes, IMHO. SCHED_DEADLINE, for example, has built-in explicit
deadlines awareness and might work better with this kind of activities.
Not to mention that Android has already started using SCHED_FIFO for
some of its time sensitive tasks. It seems to me that the long run goal
should be to give the scheduler more information about what is going on
and then use such information to do more informed decisions (scheduling,
OPP selection, etc.).

> Since mobile platforms are so power sensitive though, it won't be
> possible to boost surfaceflinger all the time. Perhaps the
> surfaceflinger boost could be managed by some sort of userspace daemon
> monitoring the sort of usecase running and/or whether display deadlines
> are being missed, and updating a schedtune boost cgroup.
> 

I'd say you would like to "boost" just enough to meet a certain quality
of service in the end.

>> For example, in a graphics rendering pipeline usually we have a host
> ...
>> With SchedTune we would like to get a similar result to the one you
>> describe using min_sample_time and above_hispeed_delay by linking
>> somehow the "interpretation" of the PELT signal with the boost value.
>>
>> Right now we have in sched-DVFS an idle % headroom which is hardcoded
>> to be ~20% of the current OPP capacity. When we cross that boundary
>> that threshold with the CPU usage, we switch straight to the max OPP.
>> If we could figure out a proper mechanism to link the boost signal to
>> both the idle % headroom and the target OPP, I think we could achieve
>> quite similar results than what you can get with the knobs offered by
>> the interactive governor.
>> The more you boost a task the bigger is the idle % headroom and
>> the higher is the OPP you will jump.
> 
> Let's say I have a system with one task (to set aside the per-task vs.
> global policy issue temporarily) and I want to define a policy which
> 
>  - quickly goes to 1.2GHz when the current frequency is less than
>    that and demand exceeds capacity
> 
>  - waits at least 40ms (or just "a longer time") before increasing the
>    frequency if the current frequency is 1.2GHz or higher
> 
> This is similar to (though a simplification of) what interactive is
> often configured to do on mobile platforms. AFAIK it's a fairly common
> strategy due to the power-perf curves and OPPs available on CPUs, and at
> the same time striving to maintain decent UI responsiveness.
> 

Not that this is already in place, but, once we'll have an energy model
of the platform available to the scheduler (the EAS idea), shouldn't
this kind of considerations be possible without any explicit
configuration? I mean, it seems to me that you start reasoning about
trade-offs after you obtained power-perf curves for your platform; but,
once this data will be available to the scheduler, don't you think we
could put a bit more intelligence there to make the same kind of
decisions you would configure a governor to do?

> Even with the proposed modification to link boost with idle % and target
> OPP I don't think there'd currently be a way to express this policy,
> which goes beyond the linear scaling of the magnitude of CPU demand
> requested by a task, idle headroom or target OPP.
> 
>>
> ...
>>> The hardcoded values in the
>>> task load tracking algorithm seem concerning though from a tuning
>>> standpoint.
>>
>> I agree, that's why we are thinking about the solution described
>> before. Exploit the boost value to replace the hardcoded thresholds
>> should allow to get more flexibility while being per-task defined.
>> Hopefully, tuning per task can be more easy and effective than
>> selection a single value fitting all needs.
>>
>>>
>>>>> The interactive functionality would require additional knobs. I
>>> ...
>>>> However, regarding specifically the latency on OPP changes, there are
>>>> a couple of extension we was thinking about:
>>>> 1. link the SchedTune boost value with the % of idle headroom which
>>>>    triggers an OPP increase
>>>> 2. use the SchedTune boost value to defined the high frequency to jump
>>>>    at when a CPU crosses the % of idle headroom
>>>
>>> Hmmm... This may be useful (only testing/profiling would tell) though it
>>> may be nice to be able to tune these values.
>>
>> Again, in my view the tuning should be per task with a single knob.
>> The value of the knob should than be properly mapped on other internal
>> values to obtain a well defined behavior driven by information shared
>> with the scheduler, i.e. a PELT signal.
>>
>>>> These are tunables which allows to parameterize the way the PELT
>>>> signal for CPU usage is interpreted by the sched-DVFS governor.
>>>>
>>>> How such tunables should be exposed and tuned is to be discussed.
>>>> Indeed, one of the main goals of the sched-DVFS and SchedTune
>>>> specifically, is to simplify the tuning of a platform by exposing to
>>>> userspace a reduced number of tunables, preferably just one.
>>>
>>> This last point (the desire for a single tunable) is perhaps at the root
>>> of my main concern. There are users/vendors for whom the current
>>> tunables are insufficient, resulting in their hacking the governors to
>>> add more tunables or features in the policy.
>>
>> We should also consider that we are proposing not only a single
>> tunable but also a completely different standpoint. Not more a "blind"
>> system-wide view on the average system behaviors, but instead a more
>> detailed view on tasks behaviors. A single tunable used to "tag" tasks
>> maybe it's not such a limited solution in this design.
> 
> I think the algorithm is still fairly blind. There still has to be a
> heuristic for future CPU usage, it's now just per-task and in the
> scheduler (PELT), whereas it used to be per-CPU and in the governor.
> 
> This allows for good features like adjusting frequency right away on
> task migration/creation/exit or per task boosting etc., but I think
> policy will still be important. Tasks change their behavior all the
> time, at least in the mobile usecases I've seen.
> 
>>> Consolidating CPU frequency and idle management in the scheduler will
>>> clean things up and probably make things more effective, but I don't
>>> think it will remove the need for a highly configurable policy.
>>
>> This can be verified only by starting to use sched-DVFS + SchedTune on
>> real/synthetic setup to verify which features are eventually missing,
>> or specific use-cases not properly managed.
>> If we are able to setup these experiments perhaps we will be able to
>> identify a better design for a scheduler driver solution.
> 
> Agree. I hope to be able to run some of these experiments to help.
> 
>>> I'm curious about the drive for one tunable. Is that something there's
> ...
>> We have plenty of experience, collected on the past years, on CPUFreq
>> governors and customer specific mods.
>> Don't you think we can exploit that experience to reason around a
>> fresh new design that allows to satisfy all requirements while
>> providing possibly a simpler interface?
> 
> Sure. I'm just communicating requirements I've seen :) .
> 

And that's great! :-)

>> I agree with you that all the current scenarios must be supported by
>> the new proposal. We should probably start by listing them and come
>> out with a set of test cases that allow to verify where we are wrt
>> the state of the art.
> 
> Sounds like a good plan to me... Perhaps we could discuss some mobile
> usecases next week at Linaro Connect?
> 

I'm up for it!

Best,

- Juri


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC 08/14] sched/tune: add detailed documentation
  2015-09-15 23:55               ` Steve Muckle
  2015-09-16  9:26                 ` Juri Lelli
@ 2015-09-16 10:03                 ` Patrick Bellasi
  1 sibling, 0 replies; 29+ messages in thread
From: Patrick Bellasi @ 2015-09-16 10:03 UTC (permalink / raw)
  To: Steve Muckle
  Cc: Ricky Liang, Peter Zijlstra, Ingo Molnar, linux-kernel, linux-pm,
	Jonathan Corbet, linux-doc, Viresh Kumar

On Wed, Sep 16, 2015 at 12:55:12AM +0100, Steve Muckle wrote:
> On 09/15/2015 08:00 AM, Patrick Bellasi wrote:
> >> Agreed, though I also think those tunable values might also change for a
> >> given set of tasks in different circumstances.
> > 
> > Could you provide an example?
> >
> > In my view the per-task support should be exploited just for quite
> > specialized tasks, which are usually not subject to many different
> > phases during their execution.
> 
> The surfaceflinger task in Android is a possible example. It can have
> the same issue as the graphics controller task you mentioned - needing
> to finish quickly so the overall display pipeline can meet its deadline,
> but often not exerting enough CPU demand by itself to raise the
> frequency high enough.

Right, that's actually a really good example which can be an
interesting starting point to experiment and get plots and
performance numbers to compare interactive with the new proposal.

> Since mobile platforms are so power sensitive though, it won't be
> possible to boost surfaceflinger all the time. Perhaps the
> surfaceflinger boost could be managed by some sort of userspace daemon
> monitoring the sort of usecase running and/or whether display deadlines
> are being missed, and updating a schedtune boost cgroup.

That's a really good example for the need to expose a simple yet
effective interface. In the mobile space, middleware like Android or
Chrome (I'm thinking about ChromeOS devises) can provide valuable
input to the scheduler. IMHO, the more the scheduler know about a task
(set of tasks) the more we can aim at improving it to give a standard
and well tested solution which targets both energy efficiency and
performance boosting.

I'm still not completely convinced that CGroup could be a suitable
interface, especially considering the running discussion on the
restructuring of the cpuset controller. However, the idea to provide a
per-task/per-process tunable interface is still sound to me.

> > For example, in a graphics rendering pipeline usually we have a host
> ...
> > With SchedTune we would like to get a similar result to the one you
> > describe using min_sample_time and above_hispeed_delay by linking
> > somehow the "interpretation" of the PELT signal with the boost value.
> > 
> > Right now we have in sched-DVFS an idle % headroom which is hardcoded
> > to be ~20% of the current OPP capacity. When we cross that boundary
> > that threshold with the CPU usage, we switch straight to the max OPP.
> > If we could figure out a proper mechanism to link the boost signal to
> > both the idle % headroom and the target OPP, I think we could achieve
> > quite similar results than what you can get with the knobs offered by
> > the interactive governor.
> > The more you boost a task the bigger is the idle % headroom and
> > the higher is the OPP you will jump.
> 
> Let's say I have a system with one task (to set aside the per-task vs.
> global policy issue temporarily) and I want to define a policy which
> 
>  - quickly goes to 1.2GHz when the current frequency is less than
>    that and demand exceeds capacity
> 
>  - waits at least 40ms (or just "a longer time") before increasing the
>    frequency if the current frequency is 1.2GHz or higher
> 
> This is similar to (though a simplification of) what interactive is
> often configured to do on mobile platforms. AFAIK it's a fairly common
> strategy due to the power-perf curves and OPPs available on CPUs, and at
> the same time striving to maintain decent UI responsiveness.

In the proposal presented with this RFC there is just one "signal
boosting strategy" named "Signal Proportional Compensation" (SPC).
Actually, internally we are evaluating other boosting policies as well
that we decided to not post to keep it simple at the beginning.

What you are proposing make sense and it's similar to another policy
we was considering, which is however just a slightly variation of the
SPC. The idea is to use a parameter to define the compensation
boundary. Right now that boundary is set to be SCHED_LOAD_SCALE (i.e.
1024) which it the maximum capacity available on a system.

The same SPC works fine if we use a different value (lower than
SCHED_LOAD_SCALE) which could be configured for example to match a
specific feature of the OPP curve or, in case of a big.LITTLE system,
the max capacity of a LITTLE cluster.

> Even with the proposed modification to link boost with idle % and target
> OPP I don't think there'd currently be a way to express this policy,
> which goes beyond the linear scaling of the magnitude of CPU demand
> requested by a task, idle headroom or target OPP.

In you example, a 100% compensation could be configured to select
right the 1.2GHz OPP. From that point, any further increase of OPP
will be driven just by the original (i.e. not boosted) task
utilization. This should allow for example on a big.LITTLE system to
boost a small task to the max OPP of a LITTLE cluster while making it
eligible for a migration on the big cluster only if its real
utilization (after a while) becomes bigger than the LITTLE capacity.

It's worth to notice that in this case "a longer time" is not
something defined once for all the tasks in a system, but instead it
is a time frame more closely related to the specific nature of tasks
running on a CPU.

We do not spend time trying to tune a system to match on average all
the tasks of the system but instead we provide a valuable information
to the scheduler (and sched-DVFS) to understand when it's worth to
switch OPP according to the information it has about the tasks it's
managing.


> ...
> >> The hardcoded values in the
> >> task load tracking algorithm seem concerning though from a tuning
> >> standpoint.
> > 
> > I agree, that's why we are thinking about the solution described
> > before. Exploit the boost value to replace the hardcoded thresholds
> > should allow to get more flexibility while being per-task defined.
> > Hopefully, tuning per task can be more easy and effective than
> > selection a single value fitting all needs.
> > 
> >>
> >>>> The interactive functionality would require additional knobs. I
> >> ...
> >>> However, regarding specifically the latency on OPP changes, there are
> >>> a couple of extension we was thinking about:
> >>> 1. link the SchedTune boost value with the % of idle headroom which
> >>>    triggers an OPP increase
> >>> 2. use the SchedTune boost value to defined the high frequency to jump
> >>>    at when a CPU crosses the % of idle headroom
> >>
> >> Hmmm... This may be useful (only testing/profiling would tell) though it
> >> may be nice to be able to tune these values.
> > 
> > Again, in my view the tuning should be per task with a single knob.
> > The value of the knob should than be properly mapped on other internal
> > values to obtain a well defined behavior driven by information shared
> > with the scheduler, i.e. a PELT signal.
> > 
> >>> These are tunables which allows to parameterize the way the PELT
> >>> signal for CPU usage is interpreted by the sched-DVFS governor.
> >>>
> >>> How such tunables should be exposed and tuned is to be discussed.
> >>> Indeed, one of the main goals of the sched-DVFS and SchedTune
> >>> specifically, is to simplify the tuning of a platform by exposing to
> >>> userspace a reduced number of tunables, preferably just one.
> >>
> >> This last point (the desire for a single tunable) is perhaps at the root
> >> of my main concern. There are users/vendors for whom the current
> >> tunables are insufficient, resulting in their hacking the governors to
> >> add more tunables or features in the policy.
> > 
> > We should also consider that we are proposing not only a single
> > tunable but also a completely different standpoint. Not more a "blind"
> > system-wide view on the average system behaviors, but instead a more
> > detailed view on tasks behaviors. A single tunable used to "tag" tasks
> > maybe it's not such a limited solution in this design.
> 
> I think the algorithm is still fairly blind. There still has to be a
> heuristic for future CPU usage, it's now just per-task and in the
> scheduler (PELT), whereas it used to be per-CPU and in the governor.

Forecasting the future is a tough task, especially if you do not have
sensible information from informed entities. The main risk with
heuristics decoupled from sensible information is that you get just an
"average good" result at the cost of a long and painful tuning
activity. If your workload mix changes, than the tuning risks to be
broken.

IMO a more valuable approach is to provide effective interfaces to
collect sensible information. Than, underneath, a well defined design
can be found to correlate and exploit all these information to take
"good enough" decisions.

> This allows for good features like adjusting frequency right away on
> task migration/creation/exit or per task boosting etc., but I think
> policy will still be important. Tasks change their behavior all the
> time, at least in the mobile usecases I've seen.

That's where a middleware (possibly) should have a simple and well
defined interface to update the hits given to the scheduler for a
specific task.

> >> Consolidating CPU frequency and idle management in the scheduler will
> >> clean things up and probably make things more effective, but I don't
> >> think it will remove the need for a highly configurable policy.
> > 
> > This can be verified only by starting to use sched-DVFS + SchedTune on
> > real/synthetic setup to verify which features are eventually missing,
> > or specific use-cases not properly managed.
> > If we are able to setup these experiments perhaps we will be able to
> > identify a better design for a scheduler driver solution.
> 
> Agree. I hope to be able to run some of these experiments to help.

Good, actually we should discuss also about an effective way to run
experiments and collect/share results. We have some tools and ideas
about that... we can discuss better about that next week at the Linaro
Connect.

> >> I'm curious about the drive for one tunable. Is that something there's
> ...
> > We have plenty of experience, collected on the past years, on CPUFreq
> > governors and customer specific mods.
> > Don't you think we can exploit that experience to reason around a
> > fresh new design that allows to satisfy all requirements while
> > providing possibly a simpler interface?
> 
> Sure. I'm just communicating requirements I've seen :) .

That's exactly what we need for this initial stage.
I think we are on the right direction to setup a fruitful discussion.

> > I agree with you that all the current scenarios must be supported by
> > the new proposal. We should probably start by listing them and come
> > out with a set of test cases that allow to verify where we are wrt
> > the state of the art.
> 
> Sounds like a good plan to me... Perhaps we could discuss some mobile
> usecases next week at Linaro Connect?

Absolutely yes!

> 
> cheers,
> Steve
> 

Cheers Patrick

-- 
#include <best/regards.h>

Patrick Bellasi


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC 08/14] sched/tune: add detailed documentation
  2015-09-16  9:26                 ` Juri Lelli
@ 2015-09-16 13:49                   ` Vincent Guittot
  0 siblings, 0 replies; 29+ messages in thread
From: Vincent Guittot @ 2015-09-16 13:49 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Steve Muckle, Patrick Bellasi, Ricky Liang, Peter Zijlstra,
	Ingo Molnar, linux-kernel, linux-pm, Jonathan Corbet, linux-doc,
	Viresh Kumar

On 16 September 2015 at 11:26, Juri Lelli <juri.lelli@arm.com> wrote:
>
> Hi Steve,
>
> thanks a lot for this interesting discussion.
>
> On 16/09/15 00:55, Steve Muckle wrote:
> > On 09/15/2015 08:00 AM, Patrick Bellasi wrote:
> >>> Agreed, though I also think those tunable values might also change for a
> >>> given set of tasks in different circumstances.
> >>
> >> Could you provide an example?
> >>
> >> In my view the per-task support should be exploited just for quite
> >> specialized tasks, which are usually not subject to many different
> >> phases during their execution.
> >
> > The surfaceflinger task in Android is a possible example. It can have
> > the same issue as the graphics controller task you mentioned - needing
> > to finish quickly so the overall display pipeline can meet its deadline,
> > but often not exerting enough CPU demand by itself to raise the
> > frequency high enough.
> >
>
> SurfaceFlinger timeliness requirements, and maybe AudioFlinger's and
> others' as well, might be better expressed by using other scheduling
> classes, IMHO. SCHED_DEADLINE, for example, has built-in explicit

I fully agree on this point that we must be sure to not create knob to
solve some latency/perf/power issue in a sched class whereas it can be
easily solved with a more appropriate sched class.
Surface flinger and sched_deadline is a good example for this kind of
"critical" task that can accept a limited amount of latency

Vincent

>
> deadlines awareness and might work better with this kind of activities.
> Not to mention that Android has already started using SCHED_FIFO for
> some of its time sensitive tasks. It seems to me that the long run goal
> should be to give the scheduler more information about what is going on
> and then use such information to do more informed decisions (scheduling,
> OPP selection, etc.).
>
> > Since mobile platforms are so power sensitive though, it won't be
> > possible to boost surfaceflinger all the time. Perhaps the
> > surfaceflinger boost could be managed by some sort of userspace daemon
> > monitoring the sort of usecase running and/or whether display deadlines
> > are being missed, and updating a schedtune boost cgroup.
> >
>
> I'd say you would like to "boost" just enough to meet a certain quality
> of service in the end.
>
> >> For example, in a graphics rendering pipeline usually we have a host
> > ...
> >> With SchedTune we would like to get a similar result to the one you
> >> describe using min_sample_time and above_hispeed_delay by linking
> >> somehow the "interpretation" of the PELT signal with the boost value.
> >>
> >> Right now we have in sched-DVFS an idle % headroom which is hardcoded
> >> to be ~20% of the current OPP capacity. When we cross that boundary
> >> that threshold with the CPU usage, we switch straight to the max OPP.
> >> If we could figure out a proper mechanism to link the boost signal to
> >> both the idle % headroom and the target OPP, I think we could achieve
> >> quite similar results than what you can get with the knobs offered by
> >> the interactive governor.
> >> The more you boost a task the bigger is the idle % headroom and
> >> the higher is the OPP you will jump.
> >
> > Let's say I have a system with one task (to set aside the per-task vs.
> > global policy issue temporarily) and I want to define a policy which
> >
> >  - quickly goes to 1.2GHz when the current frequency is less than
> >    that and demand exceeds capacity
> >
> >  - waits at least 40ms (or just "a longer time") before increasing the
> >    frequency if the current frequency is 1.2GHz or higher
> >
> > This is similar to (though a simplification of) what interactive is
> > often configured to do on mobile platforms. AFAIK it's a fairly common
> > strategy due to the power-perf curves and OPPs available on CPUs, and at
> > the same time striving to maintain decent UI responsiveness.
> >
>
> Not that this is already in place, but, once we'll have an energy model
> of the platform available to the scheduler (the EAS idea), shouldn't
> this kind of considerations be possible without any explicit
> configuration? I mean, it seems to me that you start reasoning about
> trade-offs after you obtained power-perf curves for your platform; but,
> once this data will be available to the scheduler, don't you think we
> could put a bit more intelligence there to make the same kind of
> decisions you would configure a governor to do?
>
> > Even with the proposed modification to link boost with idle % and target
> > OPP I don't think there'd currently be a way to express this policy,
> > which goes beyond the linear scaling of the magnitude of CPU demand
> > requested by a task, idle headroom or target OPP.
> >
> >>
> > ...
> >>> The hardcoded values in the
> >>> task load tracking algorithm seem concerning though from a tuning
> >>> standpoint.
> >>
> >> I agree, that's why we are thinking about the solution described
> >> before. Exploit the boost value to replace the hardcoded thresholds
> >> should allow to get more flexibility while being per-task defined.
> >> Hopefully, tuning per task can be more easy and effective than
> >> selection a single value fitting all needs.
> >>
> >>>
> >>>>> The interactive functionality would require additional knobs. I
> >>> ...
> >>>> However, regarding specifically the latency on OPP changes, there are
> >>>> a couple of extension we was thinking about:
> >>>> 1. link the SchedTune boost value with the % of idle headroom which
> >>>>    triggers an OPP increase
> >>>> 2. use the SchedTune boost value to defined the high frequency to jump
> >>>>    at when a CPU crosses the % of idle headroom
> >>>
> >>> Hmmm... This may be useful (only testing/profiling would tell) though it
> >>> may be nice to be able to tune these values.
> >>
> >> Again, in my view the tuning should be per task with a single knob.
> >> The value of the knob should than be properly mapped on other internal
> >> values to obtain a well defined behavior driven by information shared
> >> with the scheduler, i.e. a PELT signal.
> >>
> >>>> These are tunables which allows to parameterize the way the PELT
> >>>> signal for CPU usage is interpreted by the sched-DVFS governor.
> >>>>
> >>>> How such tunables should be exposed and tuned is to be discussed.
> >>>> Indeed, one of the main goals of the sched-DVFS and SchedTune
> >>>> specifically, is to simplify the tuning of a platform by exposing to
> >>>> userspace a reduced number of tunables, preferably just one.
> >>>
> >>> This last point (the desire for a single tunable) is perhaps at the root
> >>> of my main concern. There are users/vendors for whom the current
> >>> tunables are insufficient, resulting in their hacking the governors to
> >>> add more tunables or features in the policy.
> >>
> >> We should also consider that we are proposing not only a single
> >> tunable but also a completely different standpoint. Not more a "blind"
> >> system-wide view on the average system behaviors, but instead a more
> >> detailed view on tasks behaviors. A single tunable used to "tag" tasks
> >> maybe it's not such a limited solution in this design.
> >
> > I think the algorithm is still fairly blind. There still has to be a
> > heuristic for future CPU usage, it's now just per-task and in the
> > scheduler (PELT), whereas it used to be per-CPU and in the governor.
> >
> > This allows for good features like adjusting frequency right away on
> > task migration/creation/exit or per task boosting etc., but I think
> > policy will still be important. Tasks change their behavior all the
> > time, at least in the mobile usecases I've seen.
> >
> >>> Consolidating CPU frequency and idle management in the scheduler will
> >>> clean things up and probably make things more effective, but I don't
> >>> think it will remove the need for a highly configurable policy.
> >>
> >> This can be verified only by starting to use sched-DVFS + SchedTune on
> >> real/synthetic setup to verify which features are eventually missing,
> >> or specific use-cases not properly managed.
> >> If we are able to setup these experiments perhaps we will be able to
> >> identify a better design for a scheduler driver solution.
> >
> > Agree. I hope to be able to run some of these experiments to help.
> >
> >>> I'm curious about the drive for one tunable. Is that something there's
> > ...
> >> We have plenty of experience, collected on the past years, on CPUFreq
> >> governors and customer specific mods.
> >> Don't you think we can exploit that experience to reason around a
> >> fresh new design that allows to satisfy all requirements while
> >> providing possibly a simpler interface?
> >
> > Sure. I'm just communicating requirements I've seen :) .
> >
>
> And that's great! :-)
>
> >> I agree with you that all the current scenarios must be supported by
> >> the new proposal. We should probably start by listing them and come
> >> out with a set of test cases that allow to verify where we are wrt
> >> the state of the art.
> >
> > Sounds like a good plan to me... Perhaps we could discuss some mobile
> > usecases next week at Linaro Connect?
> >
>
> I'm up for it!
>
> Best,
>
> - Juri
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2015-09-16 13:49 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-08-19 18:47 [RFC PATCH 00/14] sched: Central, scheduler-driven, power-perfomance control Patrick Bellasi
2015-08-19 18:47 ` [RFC PATCH 01/14] sched/cpufreq_sched: use static key for cpu frequency selection Patrick Bellasi
2015-08-19 18:47 ` [RFC PATCH 02/14] sched/fair: add triggers for OPP change requests Patrick Bellasi
2015-08-19 18:47 ` [RFC PATCH 03/14] sched/{core,fair}: trigger OPP change request on fork() Patrick Bellasi
2015-08-19 18:47 ` [RFC PATCH 04/14] sched/{fair,cpufreq_sched}: add reset_capacity interface Patrick Bellasi
2015-08-19 18:47 ` [RFC PATCH 05/14] sched/fair: jump to max OPP when crossing UP threshold Patrick Bellasi
2015-08-19 18:47 ` [RFC PATCH 06/14] sched/cpufreq_sched: modify pcpu_capacity handling Patrick Bellasi
2015-08-19 18:47 ` [RFC PATCH 07/14] sched/fair: cpufreq_sched triggers for load balancing Patrick Bellasi
2015-08-19 18:47 ` [RFC PATCH 08/14] sched/tune: add detailed documentation Patrick Bellasi
2015-09-02  6:49   ` [RFC,08/14] " Ricky Liang
2015-09-03  9:18     ` [RFC 08/14] " Patrick Bellasi
2015-09-04  7:59       ` Ricky Liang
2015-09-09 20:16       ` Steve Muckle
2015-09-11 11:09         ` Patrick Bellasi
2015-09-14 20:00           ` Steve Muckle
2015-09-15 15:00             ` Patrick Bellasi
2015-09-15 15:19               ` Peter Zijlstra
2015-09-16  0:34                 ` Steve Muckle
2015-09-16  7:47                   ` Ingo Molnar
2015-09-15 23:55               ` Steve Muckle
2015-09-16  9:26                 ` Juri Lelli
2015-09-16 13:49                   ` Vincent Guittot
2015-09-16 10:03                 ` Patrick Bellasi
2015-08-19 18:47 ` [RFC PATCH 09/14] sched/tune: add sysctl interface to define a boost value Patrick Bellasi
2015-08-19 18:47 ` [RFC PATCH 10/14] sched/fair: add function to convert boost value into "margin" Patrick Bellasi
2015-08-19 18:47 ` [RFC PATCH 11/14] sched/fair: add boosted CPU usage Patrick Bellasi
2015-08-19 18:47 ` [RFC PATCH 12/14] sched/tune: add initial support for CGroups based boosting Patrick Bellasi
2015-08-19 18:47 ` [RFC PATCH 13/14] sched/tune: compute and keep track of per CPU boost value Patrick Bellasi
2015-08-19 18:47 ` [RFC PATCH 14/14] sched/{fair,tune}: track RUNNABLE tasks impact on " Patrick Bellasi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.