All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFCv7 PATCH 00/10] sched: scheduler-driven CPU frequency selection
@ 2016-02-23  1:22 Steve Muckle
  2016-02-23  1:22 ` [RFCv7 PATCH 01/10] sched: Compute cpu capacity available at current frequency Steve Muckle
                   ` (11 more replies)
  0 siblings, 12 replies; 51+ messages in thread
From: Steve Muckle @ 2016-02-23  1:22 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Rafael J. Wysocki
  Cc: linux-kernel, linux-pm, Vincent Guittot, Morten Rasmussen,
	Dietmar Eggemann, Juri Lelli, Patrick Bellasi, Michael Turquette

Scheduler-driven CPU frequency selection hopes to exploit both
per-task and global information in the scheduler to improve frequency
selection policy and achieve lower power consumption, improved
responsiveness/performance, and less reliance on heuristics and
tunables. For further discussion of this integration see [0].

This patch series implements a cpufreq governor which collects CPU
capacity requests from the fair, realtime, and deadline scheduling
classes. The fair and realtime scheduling classes are modified to make
these requests. The deadline class is not yet modified to make CPU
capacity requests.

Changes in this series since RFCv6 [1], posted December 9, 2015:
  Patch 3, sched: scheduler-driven cpu frequency selection
  - Added Kconfig dependency on IRQ_WORK.
  - Reworked locking.
  - Make throttling optional - it is not required in order to ensure that
    the previous frequency transition is complete.
  - Some fixes in cpufreq_sched_thread related to the task state.
  - Changes to support mixed fast and slow path operation.
  Patch 7: sched/fair: jump to max OPP when crossing UP threshold
  - move sched_freq_tick() call so rq lock is still held
  Patch 9: sched/deadline: split rt_avg in 2 distincts metrics
  - RFCv6 calculated DL capacity from DL task parameters, RFCv7 restores
    the original method of calculation but keeps DL capacity separate
  Patch 10: sched: rt scheduler sets capacity requirement
  - change #ifdef from CONFIG_SMP, trivial cleanup

Profiling results:
Performance profiling has been done by using rt-app [2] to generate
various periodic workloads with a particular duty cycle. The time to
complete the busy portion of the duty cycle is measured and overhead
is calculated as

overhead = (busy_duration_test_gov - busy_duration_perf_gov)/
         (busy_duration_pwrsave_gov - busy_duration_perf_gov)

This shows as a percentage how close the governor is to running the
workload at fmin (100%) or fmax (0%). The number of times the busy
duration exceeds the period of the periodic workload (an "overrun") is
also recorded. In the table below the performance of the ondemand
(sampling_rate = 20ms), interactive (default tunables), and
scheduler-driven governors are evaluated using these metrics. The test
platform is a Samsung Chromebook 2 ("Peach Pi"). The workload is
affined to CPU0, an A15 with an fmin of 200MHz and an fmax of
1.8GHz. The interactive governor was incorporated/adapted from [3]. A
branch with the interactive governor and a few required dependency
patches for ARM is available at [4].

More detailed explanation of the columns below:
run: duration at fmax of the busy portion of the periodic workload in msec
period: duration of the entire period of the periodic workload in msec
loops: number of iterations of the periodic workload tested
OR: number of instances of overrun as described above
OH: overhead as calculated above

SCHED_OTHER workload:
 wload parameters	  ondemand        interactive     sched	
run	period	loops	OR	OH	OR	OH	OR	OH
1	100	100	0	62.07%	0	100.02%	0	78.49%
10	1000	10	0	21.80%	0	22.74%	0	72.56%
1	10	1000	0	21.72%	0	63.08%	0	52.40%
10	100	100	0	8.09%	0	15.53%	0	17.33%
100	1000	10	0	1.83%	0	1.77%	0	0.29%
6	33	300	0	15.32%	0	8.60%	0	17.34%
66	333	30	0	0.79%	0	3.18%	0	12.26%
4	10	1000	0	5.87%	0	10.21%	0	6.15%
40	100	100	0	0.41%	0	0.04%	0	2.68%
400	1000	10	0	0.42%	0	0.50%	0	1.22%
5	9	1000	2	3.82%	1	6.10%	0	2.51%
50	90	100	0	0.19%	0	0.05%	0	1.71%
500	900	10	0	0.37%	0	0.38%	0	1.82%
9	12	1000	6	1.79%	1	0.77%	0	0.26%
90	120	100	0	0.16%	1	0.05%	0	0.49%
900	1200	10	0	0.09%	0	0.26%	0	0.62%

SCHED_FIFO workload:
 wload parameters	  ondemand        interactive     sched	
run	period	loops	OR	OH	OR	OH	OR	OH
1	100	100	0	39.61%	0	100.49%	0	99.57%
10	1000	10	0	73.51%	0	21.09%	0	96.66%
1	10	1000	0	18.01%	0	61.46%	0	67.68%
10	100	100	0	31.31%	0	18.62%	0	77.01%
100	1000	10	0	58.80%	0	1.90%	0	15.40%
6	33	300	251	85.99%	0	9.20%	1	30.09%
66	333	30	24	84.03%	0	3.38%	0	33.23%
4	10	1000	0	6.23%	0	12.21%	10	11.54%
40	100	100	100	62.08%	0	0.11%	1	11.85%
400	1000	10	10	62.09%	0	0.51%	0	7.00%
5	9	1000	999	12.29%	1	6.03%	0	0.04%
50	90	100	99	61.47%	0	0.05%	2	6.53%
500	900	10	10	43.37%	0	0.39%	0	6.30%
9	12	1000	999	9.83%	0	0.01%	14	1.69%
90	120	100	99	61.47%	0	0.01%	28	2.29%
900	1200	10	10	43.31%	0	0.22%	0	2.15%

Note that at this point RT CPU capacity is measured via rt_avg. For
the above results sched_time_avg_ms has been set to 50ms.

Known issues:
 - More testing with real world type workloads, such as UI workloads and
   benchmarks, is required.
 - The power side of the characterization is in progress.
 - Deadline scheduling class does not yet make CPU capacity requests.
 - Not sure what's going on yet with the ondemand numbers above, it seems like
   there may a regression with ondemand and RT tasks.

Dependencies:
Frequency invariant load tracking is required. For heterogeneous
systems such as big.Little, CPU invariant load tracking is required as
well. The required support for ARM platforms along with a patch
creating tracepoints for cpufreq_sched is located in [5].

References:
[0] http://article.gmane.org/gmane.linux.kernel/1499836
[1] http://thread.gmane.org/gmane.linux.power-management.general/69176
[2] https://git.linaro.org/power/rt-app.git
[3] https://lkml.org/lkml/2015/10/28/782
[4] https://git.linaro.org/people/steve.muckle/kernel.git/shortlog/refs/heads/interactive
[5] https://git.linaro.org/people/steve.muckle/kernel.git/shortlog/refs/heads/sched-freq-rfcv7

Juri Lelli (3):
  sched/fair: add triggers for OPP change requests
  sched/{core,fair}: trigger OPP change request on fork()
  sched/fair: cpufreq_sched triggers for load balancing

Michael Turquette (2):
  cpufreq: introduce cpufreq_driver_is_slow
  sched: scheduler-driven cpu frequency selection

Morten Rasmussen (1):
  sched: Compute cpu capacity available at current frequency

Steve Muckle (1):
  sched/fair: jump to max OPP when crossing UP threshold

Vincent Guittot (3):
  sched: remove call of sched_avg_update from sched_rt_avg_update
  sched/deadline: split rt_avg in 2 distincts metrics
  sched: rt scheduler sets capacity requirement

 drivers/cpufreq/Kconfig      |  21 ++
 drivers/cpufreq/cpufreq.c    |   6 +
 include/linux/cpufreq.h      |  12 ++
 include/linux/sched.h        |   8 +
 kernel/sched/Makefile        |   1 +
 kernel/sched/core.c          |  43 +++-
 kernel/sched/cpufreq_sched.c | 459 +++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/deadline.c      |   2 +-
 kernel/sched/fair.c          | 108 +++++-----
 kernel/sched/rt.c            |  48 ++++-
 kernel/sched/sched.h         | 120 ++++++++++-
 11 files changed, 777 insertions(+), 51 deletions(-)
 create mode 100644 kernel/sched/cpufreq_sched.c

-- 
2.4.10

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [RFCv7 PATCH 01/10] sched: Compute cpu capacity available at current frequency
  2016-02-23  1:22 [RFCv7 PATCH 00/10] sched: scheduler-driven CPU frequency selection Steve Muckle
@ 2016-02-23  1:22 ` Steve Muckle
  2016-02-23  1:41   ` Rafael J. Wysocki
  2016-02-23  1:22 ` [RFCv7 PATCH 02/10] cpufreq: introduce cpufreq_driver_is_slow Steve Muckle
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 51+ messages in thread
From: Steve Muckle @ 2016-02-23  1:22 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Rafael J. Wysocki
  Cc: linux-kernel, linux-pm, Vincent Guittot, Morten Rasmussen,
	Dietmar Eggemann, Juri Lelli, Patrick Bellasi, Michael Turquette

From: Morten Rasmussen <morten.rasmussen@arm.com>

capacity_orig_of() returns the max available compute capacity of a cpu.
For scale-invariant utilization tracking and energy-aware scheduling
decisions it is useful to know the compute capacity available at the
current OPP of a cpu.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Steve Muckle <smuckle@linaro.org>
---
 kernel/sched/fair.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7ce24a4..3437e01 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4821,6 +4821,17 @@ static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
 #endif
 
 /*
+ * Returns the current capacity of cpu after applying both
+ * cpu and freq scaling.
+ */
+static unsigned long capacity_curr_of(int cpu)
+{
+	return cpu_rq(cpu)->cpu_capacity_orig *
+	       arch_scale_freq_capacity(NULL, cpu)
+	       >> SCHED_CAPACITY_SHIFT;
+}
+
+/*
  * Detect M:N waker/wakee relationships via a switching-frequency heuristic.
  * A waker of many should wake a different task than the one last awakened
  * at a frequency roughly N times higher than one of its wakees.  In order
-- 
2.4.10

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [RFCv7 PATCH 02/10] cpufreq: introduce cpufreq_driver_is_slow
  2016-02-23  1:22 [RFCv7 PATCH 00/10] sched: scheduler-driven CPU frequency selection Steve Muckle
  2016-02-23  1:22 ` [RFCv7 PATCH 01/10] sched: Compute cpu capacity available at current frequency Steve Muckle
@ 2016-02-23  1:22 ` Steve Muckle
  2016-02-23  1:31   ` Rafael J. Wysocki
  2016-02-23  1:22 ` [RFCv7 PATCH 03/10] sched: scheduler-driven cpu frequency selection Steve Muckle
                   ` (9 subsequent siblings)
  11 siblings, 1 reply; 51+ messages in thread
From: Steve Muckle @ 2016-02-23  1:22 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Rafael J. Wysocki
  Cc: linux-kernel, linux-pm, Vincent Guittot, Morten Rasmussen,
	Dietmar Eggemann, Juri Lelli, Patrick Bellasi, Michael Turquette,
	Viresh Kumar

From: Michael Turquette <mturquette@baylibre.com>

Some architectures and platforms perform CPU frequency transitions
through a non-blocking method, while some might block or sleep. Even
when frequency transitions do not block or sleep they may be very slow.
This distinction is important when trying to change frequency from
a non-interruptible context in a scheduler hot path.

Describe this distinction with a cpufreq driver flag,
CPUFREQ_DRIVER_FAST. The default is to not have this flag set,
thus erring on the side of caution.

cpufreq_driver_is_slow() is also introduced in this patch. Setting
the above flag will allow this function to return false.

[smuckle@linaro.org: change flag/API to include drivers that are too
 slow for scheduler hot paths, in addition to those that block/sleep]

Cc: Rafael J. Wysocki <rafael@kernel.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Michael Turquette <mturquette@baylibre.com>
Signed-off-by: Steve Muckle <smuckle@linaro.org>
---
 drivers/cpufreq/cpufreq.c | 6 ++++++
 include/linux/cpufreq.h   | 9 +++++++++
 2 files changed, 15 insertions(+)

diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index e979ec7..88e63ca 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -154,6 +154,12 @@ bool have_governor_per_policy(void)
 }
 EXPORT_SYMBOL_GPL(have_governor_per_policy);
 
+bool cpufreq_driver_is_slow(void)
+{
+	return !(cpufreq_driver->flags & CPUFREQ_DRIVER_FAST);
+}
+EXPORT_SYMBOL_GPL(cpufreq_driver_is_slow);
+
 struct kobject *get_governor_parent_kobj(struct cpufreq_policy *policy)
 {
 	if (have_governor_per_policy())
diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h
index 88a4215..93e1c1c 100644
--- a/include/linux/cpufreq.h
+++ b/include/linux/cpufreq.h
@@ -160,6 +160,7 @@ u64 get_cpu_idle_time(unsigned int cpu, u64 *wall, int io_busy);
 int cpufreq_get_policy(struct cpufreq_policy *policy, unsigned int cpu);
 int cpufreq_update_policy(unsigned int cpu);
 bool have_governor_per_policy(void);
+bool cpufreq_driver_is_slow(void);
 struct kobject *get_governor_parent_kobj(struct cpufreq_policy *policy);
 #else
 static inline unsigned int cpufreq_get(unsigned int cpu)
@@ -316,6 +317,14 @@ struct cpufreq_driver {
  */
 #define CPUFREQ_NEED_INITIAL_FREQ_CHECK	(1 << 5)
 
+/*
+ * Indicates that it is safe to call cpufreq_driver_target from
+ * non-interruptable context in scheduler hot paths.  Drivers must
+ * opt-in to this flag, as the safe default is that they might sleep
+ * or be too slow for hot path use.
+ */
+#define CPUFREQ_DRIVER_FAST		(1 << 6)
+
 int cpufreq_register_driver(struct cpufreq_driver *driver_data);
 int cpufreq_unregister_driver(struct cpufreq_driver *driver_data);
 
-- 
2.4.10

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [RFCv7 PATCH 03/10] sched: scheduler-driven cpu frequency selection
  2016-02-23  1:22 [RFCv7 PATCH 00/10] sched: scheduler-driven CPU frequency selection Steve Muckle
  2016-02-23  1:22 ` [RFCv7 PATCH 01/10] sched: Compute cpu capacity available at current frequency Steve Muckle
  2016-02-23  1:22 ` [RFCv7 PATCH 02/10] cpufreq: introduce cpufreq_driver_is_slow Steve Muckle
@ 2016-02-23  1:22 ` Steve Muckle
  2016-02-25  3:55   ` Rafael J. Wysocki
  2016-03-03 14:21   ` Ingo Molnar
  2016-02-23  1:22 ` [RFCv7 PATCH 04/10] sched/fair: add triggers for OPP change requests Steve Muckle
                   ` (8 subsequent siblings)
  11 siblings, 2 replies; 51+ messages in thread
From: Steve Muckle @ 2016-02-23  1:22 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Rafael J. Wysocki
  Cc: linux-kernel, linux-pm, Vincent Guittot, Morten Rasmussen,
	Dietmar Eggemann, Juri Lelli, Patrick Bellasi, Michael Turquette,
	Ricky Liang, Juri Lelli

From: Michael Turquette <mturquette@baylibre.com>

Scheduler-driven CPU frequency selection hopes to exploit both
per-task and global information in the scheduler to improve frequency
selection policy, achieving lower power consumption, improved
responsiveness/performance, and less reliance on heuristics and
tunables. For further discussion on the motivation of this integration
see [0].

This patch implements a shim layer between the Linux scheduler and the
cpufreq subsystem. The interface accepts capacity requests from the
CFS, RT and deadline sched classes. The requests from each sched class
are summed on each CPU with a margin applied to the CFS and RT
capacity requests to provide some headroom. Deadline requests are
expected to be precise enough given their nature to not require
headroom. The maximum total capacity request for a CPU in a frequency
domain drives the requested frequency for that domain.

Policy is determined by both the sched classes and this shim layer.

Note that this algorithm is event-driven. There is no polling loop to
check cpu idle time nor any other method which is unsynchronized with
the scheduler, aside from an optional throttling mechanism.

Thanks to Juri Lelli <juri.lelli@arm.com> for contributing design ideas,
code and test results, and to Ricky Liang <jcliang@chromium.org>
for initialization and static key inc/dec fixes.

[0] http://article.gmane.org/gmane.linux.kernel/1499836

[smuckle@linaro.org: various additions and fixes, revised commit text]

CC: Ricky Liang <jcliang@chromium.org>
Signed-off-by: Michael Turquette <mturquette@baylibre.com>
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Steve Muckle <smuckle@linaro.org>
---
 drivers/cpufreq/Kconfig      |  21 ++
 include/linux/cpufreq.h      |   3 +
 include/linux/sched.h        |   8 +
 kernel/sched/Makefile        |   1 +
 kernel/sched/cpufreq_sched.c | 459 +++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h         |  51 +++++
 6 files changed, 543 insertions(+)
 create mode 100644 kernel/sched/cpufreq_sched.c

diff --git a/drivers/cpufreq/Kconfig b/drivers/cpufreq/Kconfig
index 659879a..82d1548 100644
--- a/drivers/cpufreq/Kconfig
+++ b/drivers/cpufreq/Kconfig
@@ -102,6 +102,14 @@ config CPU_FREQ_DEFAULT_GOV_CONSERVATIVE
 	  Be aware that not all cpufreq drivers support the conservative
 	  governor. If unsure have a look at the help section of the
 	  driver. Fallback governor will be the performance governor.
+
+config CPU_FREQ_DEFAULT_GOV_SCHED
+	bool "sched"
+	select CPU_FREQ_GOV_SCHED
+	help
+	  Use the CPUfreq governor 'sched' as default. This scales
+	  cpu frequency using CPU utilization estimates from the
+	  scheduler.
 endchoice
 
 config CPU_FREQ_GOV_PERFORMANCE
@@ -183,6 +191,19 @@ config CPU_FREQ_GOV_CONSERVATIVE
 
 	  If in doubt, say N.
 
+config CPU_FREQ_GOV_SCHED
+	bool "'sched' cpufreq governor"
+	depends on CPU_FREQ
+	select CPU_FREQ_GOV_COMMON
+	select IRQ_WORK
+	help
+	  'sched' - this governor scales cpu frequency from the
+	  scheduler as a function of cpu capacity utilization. It does
+	  not evaluate utilization on a periodic basis (as ondemand
+	  does) but instead is event-driven by the scheduler.
+
+	  If in doubt, say N.
+
 comment "CPU frequency scaling drivers"
 
 config CPUFREQ_DT
diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h
index 93e1c1c..ce8b895 100644
--- a/include/linux/cpufreq.h
+++ b/include/linux/cpufreq.h
@@ -495,6 +495,9 @@ extern struct cpufreq_governor cpufreq_gov_ondemand;
 #elif defined(CONFIG_CPU_FREQ_DEFAULT_GOV_CONSERVATIVE)
 extern struct cpufreq_governor cpufreq_gov_conservative;
 #define CPUFREQ_DEFAULT_GOVERNOR	(&cpufreq_gov_conservative)
+#elif defined(CONFIG_CPU_FREQ_DEFAULT_GOV_SCHED)
+extern struct cpufreq_governor cpufreq_gov_sched;
+#define CPUFREQ_DEFAULT_GOVERNOR	(&cpufreq_gov_sched)
 #endif
 
 /*********************************************************************
diff --git a/include/linux/sched.h b/include/linux/sched.h
index a292c4b..27a6cd8 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -937,6 +937,14 @@ enum cpu_idle_type {
 #define SCHED_CAPACITY_SHIFT	10
 #define SCHED_CAPACITY_SCALE	(1L << SCHED_CAPACITY_SHIFT)
 
+struct sched_capacity_reqs {
+	unsigned long cfs;
+	unsigned long rt;
+	unsigned long dl;
+
+	unsigned long total;
+};
+
 /*
  * Wake-queues are lists of tasks with a pending wakeup, whose
  * callers have already marked the task as woken internally,
diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index 6768797..90ed832 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -19,3 +19,4 @@ obj-$(CONFIG_SCHED_AUTOGROUP) += auto_group.o
 obj-$(CONFIG_SCHEDSTATS) += stats.o
 obj-$(CONFIG_SCHED_DEBUG) += debug.o
 obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o
+obj-$(CONFIG_CPU_FREQ_GOV_SCHED) += cpufreq_sched.o
diff --git a/kernel/sched/cpufreq_sched.c b/kernel/sched/cpufreq_sched.c
new file mode 100644
index 0000000..a113e4e
--- /dev/null
+++ b/kernel/sched/cpufreq_sched.c
@@ -0,0 +1,459 @@
+/*
+ *  Copyright (C)  2015 Michael Turquette <mturquette@linaro.org>
+ *  Copyright (C)  2015-2016 Steve Muckle <smuckle@linaro.org>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/cpufreq.h>
+#include <linux/module.h>
+#include <linux/kthread.h>
+#include <linux/percpu.h>
+#include <linux/irq_work.h>
+#include <linux/delay.h>
+#include <linux/string.h>
+
+#include "sched.h"
+
+struct static_key __read_mostly __sched_freq = STATIC_KEY_INIT_FALSE;
+static bool __read_mostly cpufreq_driver_slow;
+
+/*
+ * The number of enabled schedfreq policies is modified during GOV_START/STOP.
+ * It, along with whether the schedfreq static key is enabled, is protected by
+ * the gov_enable_lock.
+ */
+static int enabled_policies;
+static DEFINE_MUTEX(gov_enable_lock);
+
+#ifndef CONFIG_CPU_FREQ_DEFAULT_GOV_SCHED
+static struct cpufreq_governor cpufreq_gov_sched;
+#endif
+
+/*
+ * Capacity margin added to CFS and RT capacity requests to provide
+ * some head room if task utilization further increases.
+ */
+unsigned int capacity_margin = 1280;
+
+static DEFINE_PER_CPU(struct gov_data *, cpu_gov_data);
+DEFINE_PER_CPU(struct sched_capacity_reqs, cpu_sched_capacity_reqs);
+
+/**
+ * gov_data - per-policy data internal to the governor
+ * @throttle: next throttling period expiry. Derived from throttle_nsec
+ * @throttle_nsec: throttle period length in nanoseconds
+ * @task: worker thread for dvfs transition that may block/sleep
+ * @irq_work: callback used to wake up worker thread
+ * @policy: pointer to cpufreq policy associated with this governor data
+ * @fastpath_lock: prevents multiple CPUs in a frequency domain from racing
+ * with each other in fast path during calculation of domain frequency
+ * @slowpath_lock: mutex used to synchronize with slow path - ensure policy
+ * remains enabled, and eliminate racing between slow and fast path
+ * @enabled: boolean value indicating that the policy is started, protected
+ * by the slowpath_lock
+ * @requested_freq: last frequency requested by the sched governor
+ *
+ * struct gov_data is the per-policy cpufreq_sched-specific data
+ * structure. A per-policy instance of it is created when the
+ * cpufreq_sched governor receives the CPUFREQ_GOV_POLICY_INIT
+ * condition and a pointer to it exists in the gov_data member of
+ * struct cpufreq_policy.
+ */
+struct gov_data {
+	ktime_t throttle;
+	unsigned int throttle_nsec;
+	struct task_struct *task;
+	struct irq_work irq_work;
+	struct cpufreq_policy *policy;
+	raw_spinlock_t fastpath_lock;
+	struct mutex slowpath_lock;
+	unsigned int enabled;
+	unsigned int requested_freq;
+};
+
+static void cpufreq_sched_try_driver_target(struct cpufreq_policy *policy,
+					    unsigned int freq)
+{
+	struct gov_data *gd = policy->governor_data;
+
+	__cpufreq_driver_target(policy, freq, CPUFREQ_RELATION_L);
+	gd->throttle = ktime_add_ns(ktime_get(), gd->throttle_nsec);
+}
+
+static bool finish_last_request(struct gov_data *gd)
+{
+	ktime_t now = ktime_get();
+
+	if (ktime_after(now, gd->throttle))
+		return false;
+
+	while (1) {
+		int usec_left = ktime_to_ns(ktime_sub(gd->throttle, now));
+
+		usec_left /= NSEC_PER_USEC;
+		usleep_range(usec_left, usec_left + 100);
+		now = ktime_get();
+		if (ktime_after(now, gd->throttle))
+			return true;
+	}
+}
+
+static int cpufreq_sched_thread(void *data)
+{
+	struct sched_param param;
+	struct gov_data *gd = (struct gov_data*) data;
+	struct cpufreq_policy *policy = gd->policy;
+	unsigned int new_request = 0;
+	unsigned int last_request = 0;
+	int ret;
+
+	param.sched_priority = 50;
+	ret = sched_setscheduler_nocheck(gd->task, SCHED_FIFO, &param);
+	if (ret) {
+		pr_warn("%s: failed to set SCHED_FIFO\n", __func__);
+		do_exit(-EINVAL);
+	} else {
+		pr_debug("%s: kthread (%d) set to SCHED_FIFO\n",
+				__func__, gd->task->pid);
+	}
+
+	mutex_lock(&gd->slowpath_lock);
+
+	while (true) {
+		set_current_state(TASK_INTERRUPTIBLE);
+		if (kthread_should_stop()) {
+			set_current_state(TASK_RUNNING);
+			break;
+		}
+		new_request = gd->requested_freq;
+		if (!gd->enabled || new_request == last_request) {
+			mutex_unlock(&gd->slowpath_lock);
+			schedule();
+			mutex_lock(&gd->slowpath_lock);
+		} else {
+			set_current_state(TASK_RUNNING);
+			/*
+			 * if the frequency thread sleeps while waiting to be
+			 * unthrottled, start over to check for a newer request
+			 */
+			if (finish_last_request(gd))
+				continue;
+			last_request = new_request;
+			cpufreq_sched_try_driver_target(policy, new_request);
+		}
+	}
+
+	mutex_unlock(&gd->slowpath_lock);
+
+	return 0;
+}
+
+static void cpufreq_sched_irq_work(struct irq_work *irq_work)
+{
+	struct gov_data *gd;
+
+	gd = container_of(irq_work, struct gov_data, irq_work);
+	if (!gd)
+		return;
+
+	wake_up_process(gd->task);
+}
+
+static void update_fdomain_capacity_request(int cpu)
+{
+	unsigned int freq_new, index_new, cpu_tmp;
+	struct cpufreq_policy *policy;
+	struct gov_data *gd = per_cpu(cpu_gov_data, cpu);
+	unsigned long capacity = 0;
+
+	if (!gd)
+		return;
+
+	/* interrupts already disabled here via rq locked */
+	raw_spin_lock(&gd->fastpath_lock);
+
+	policy = gd->policy;
+
+	for_each_cpu(cpu_tmp, policy->cpus) {
+		struct sched_capacity_reqs *scr;
+
+		scr = &per_cpu(cpu_sched_capacity_reqs, cpu_tmp);
+		capacity = max(capacity, scr->total);
+	}
+
+	freq_new = capacity * policy->max >> SCHED_CAPACITY_SHIFT;
+
+	/*
+	 * Calling this without locking policy->rwsem means we race
+	 * against changes with policy->min and policy->max. This should
+	 * be okay though.
+	 */
+	if (cpufreq_frequency_table_target(policy, policy->freq_table,
+					   freq_new, CPUFREQ_RELATION_L,
+					   &index_new))
+		goto out;
+	freq_new = policy->freq_table[index_new].frequency;
+
+	if (freq_new == gd->requested_freq)
+		goto out;
+
+	gd->requested_freq = freq_new;
+
+	if (cpufreq_driver_slow || !mutex_trylock(&gd->slowpath_lock)) {
+		irq_work_queue_on(&gd->irq_work, cpu);
+	} else if (policy->transition_ongoing ||
+		   ktime_before(ktime_get(), gd->throttle)) {
+		mutex_unlock(&gd->slowpath_lock);
+		irq_work_queue_on(&gd->irq_work, cpu);
+	} else {
+		cpufreq_sched_try_driver_target(policy, freq_new);
+		mutex_unlock(&gd->slowpath_lock);
+	}
+
+out:
+	raw_spin_unlock(&gd->fastpath_lock);
+}
+
+void update_cpu_capacity_request(int cpu, bool request)
+{
+	unsigned long new_capacity;
+	struct sched_capacity_reqs *scr;
+
+	/* The rq lock serializes access to the CPU's sched_capacity_reqs. */
+	lockdep_assert_held(&cpu_rq(cpu)->lock);
+
+	scr = &per_cpu(cpu_sched_capacity_reqs, cpu);
+
+	new_capacity = scr->cfs + scr->rt;
+	new_capacity = new_capacity * capacity_margin
+		/ SCHED_CAPACITY_SCALE;
+	new_capacity += scr->dl;
+
+	if (new_capacity == scr->total)
+		return;
+
+	scr->total = new_capacity;
+	if (request)
+		update_fdomain_capacity_request(cpu);
+}
+
+static ssize_t show_throttle_nsec(struct cpufreq_policy *policy, char *buf)
+{
+	struct gov_data *gd = policy->governor_data;
+	return sprintf(buf, "%u\n", gd->throttle_nsec);
+}
+
+static ssize_t store_throttle_nsec(struct cpufreq_policy *policy,
+				   const char *buf, size_t count)
+{
+	struct gov_data *gd = policy->governor_data;
+	unsigned int input;
+	int ret;
+
+	ret = sscanf(buf, "%u", &input);
+
+	if (ret != 1)
+		return -EINVAL;
+
+	gd->throttle_nsec = input;
+	return count;
+}
+
+static struct freq_attr sched_freq_throttle_nsec_attr =
+	__ATTR(throttle_nsec, 0644, show_throttle_nsec, store_throttle_nsec);
+
+static struct attribute *sched_freq_sysfs_attribs[] = {
+	&sched_freq_throttle_nsec_attr.attr,
+	NULL
+};
+
+static struct attribute_group sched_freq_sysfs_group = {
+	.attrs = sched_freq_sysfs_attribs,
+	.name = "sched_freq",
+};
+
+static int cpufreq_sched_policy_init(struct cpufreq_policy *policy)
+{
+	struct gov_data *gd;
+	int ret;
+
+	gd = kzalloc(sizeof(*gd), GFP_KERNEL);
+	if (!gd)
+		return -ENOMEM;
+	policy->governor_data = gd;
+	gd->policy = policy;
+	raw_spin_lock_init(&gd->fastpath_lock);
+	mutex_init(&gd->slowpath_lock);
+
+	ret = sysfs_create_group(&policy->kobj, &sched_freq_sysfs_group);
+	if (ret)
+		goto err_mem;
+
+	/*
+	 * Set up schedfreq thread for slow path freq transitions if
+	 * required by the driver.
+	 */
+	if (cpufreq_driver_is_slow()) {
+		cpufreq_driver_slow = true;
+		gd->task = kthread_create(cpufreq_sched_thread, gd,
+					  "kschedfreq:%d",
+					  cpumask_first(policy->related_cpus));
+		if (IS_ERR_OR_NULL(gd->task)) {
+			pr_err("%s: failed to create kschedfreq thread\n",
+			       __func__);
+			goto err_sysfs;
+		}
+		get_task_struct(gd->task);
+		kthread_bind_mask(gd->task, policy->related_cpus);
+		wake_up_process(gd->task);
+		init_irq_work(&gd->irq_work, cpufreq_sched_irq_work);
+	}
+	return 0;
+
+err_sysfs:
+	sysfs_remove_group(&policy->kobj, &sched_freq_sysfs_group);
+err_mem:
+	policy->governor_data = NULL;
+	kfree(gd);
+	return -ENOMEM;
+}
+
+static int cpufreq_sched_policy_exit(struct cpufreq_policy *policy)
+{
+	struct gov_data *gd = policy->governor_data;
+
+	/* Stop the schedfreq thread associated with this policy. */
+	if (cpufreq_driver_slow) {
+		kthread_stop(gd->task);
+		put_task_struct(gd->task);
+	}
+	sysfs_remove_group(&policy->kobj, &sched_freq_sysfs_group);
+	policy->governor_data = NULL;
+	kfree(gd);
+	return 0;
+}
+
+static int cpufreq_sched_start(struct cpufreq_policy *policy)
+{
+	struct gov_data *gd = policy->governor_data;
+	int cpu;
+
+	/*
+	 * The schedfreq static key is managed here so the global schedfreq
+	 * lock must be taken - a per-policy lock such as policy->rwsem is
+	 * not sufficient.
+	 */
+	mutex_lock(&gov_enable_lock);
+
+	gd->enabled = 1;
+
+	/*
+	 * Set up percpu information. Writing the percpu gd pointer will
+	 * enable the fast path if the static key is already enabled.
+	 */
+	for_each_cpu(cpu, policy->cpus) {
+		memset(&per_cpu(cpu_sched_capacity_reqs, cpu), 0,
+		       sizeof(struct sched_capacity_reqs));
+		per_cpu(cpu_gov_data, cpu) = gd;
+	}
+
+	if (enabled_policies == 0)
+		static_key_slow_inc(&__sched_freq);
+	enabled_policies++;
+	mutex_unlock(&gov_enable_lock);
+
+	return 0;
+}
+
+static void dummy(void *info) {}
+
+static int cpufreq_sched_stop(struct cpufreq_policy *policy)
+{
+	struct gov_data *gd = policy->governor_data;
+	int cpu;
+
+	/*
+	 * The schedfreq static key is managed here so the global schedfreq
+	 * lock must be taken - a per-policy lock such as policy->rwsem is
+	 * not sufficient.
+	 */
+	mutex_lock(&gov_enable_lock);
+
+	/*
+	 * The governor stop path may or may not hold policy->rwsem. There
+	 * must be synchronization with the slow path however.
+	 */
+	mutex_lock(&gd->slowpath_lock);
+
+	/*
+	 * Stop new entries into the hot path for all CPUs. This will
+	 * potentially affect other policies which are still running but
+	 * this is an infrequent operation.
+	 */
+	static_key_slow_dec(&__sched_freq);
+	enabled_policies--;
+
+	/*
+	 * Ensure that all CPUs currently part of this policy are out
+	 * of the hot path so that if this policy exits we can free gd.
+	 */
+	preempt_disable();
+	smp_call_function_many(policy->cpus, dummy, NULL, true);
+	preempt_enable();
+
+	/*
+	 * Other CPUs in other policies may still have the schedfreq
+	 * static key enabled. The percpu gd is used to signal which
+	 * CPUs are enabled in the sched gov during the hot path.
+	 */
+	for_each_cpu(cpu, policy->cpus)
+		per_cpu(cpu_gov_data, cpu) = NULL;
+
+	/* Pause the slow path for this policy. */
+	gd->enabled = 0;
+
+	if (enabled_policies)
+		static_key_slow_inc(&__sched_freq);
+	mutex_unlock(&gd->slowpath_lock);
+	mutex_unlock(&gov_enable_lock);
+
+	return 0;
+}
+
+static int cpufreq_sched_setup(struct cpufreq_policy *policy,
+			       unsigned int event)
+{
+	switch (event) {
+	case CPUFREQ_GOV_POLICY_INIT:
+		return cpufreq_sched_policy_init(policy);
+	case CPUFREQ_GOV_POLICY_EXIT:
+		return cpufreq_sched_policy_exit(policy);
+	case CPUFREQ_GOV_START:
+		return cpufreq_sched_start(policy);
+	case CPUFREQ_GOV_STOP:
+		return cpufreq_sched_stop(policy);
+	case CPUFREQ_GOV_LIMITS:
+		break;
+	}
+	return 0;
+}
+
+#ifndef CONFIG_CPU_FREQ_DEFAULT_GOV_SCHED
+static
+#endif
+struct cpufreq_governor cpufreq_gov_sched = {
+	.name			= "sched",
+	.governor		= cpufreq_sched_setup,
+	.owner			= THIS_MODULE,
+};
+
+static int __init cpufreq_sched_init(void)
+{
+	return cpufreq_register_governor(&cpufreq_gov_sched);
+}
+
+/* Try to make this the default governor */
+fs_initcall(cpufreq_sched_init);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 1d58387..17908dd 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1384,6 +1384,57 @@ unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
 }
 #endif
 
+#ifdef CONFIG_CPU_FREQ_GOV_SCHED
+extern unsigned int capacity_margin;
+extern struct static_key __sched_freq;
+
+static inline bool sched_freq(void)
+{
+	return static_key_false(&__sched_freq);
+}
+
+DECLARE_PER_CPU(struct sched_capacity_reqs, cpu_sched_capacity_reqs);
+void update_cpu_capacity_request(int cpu, bool request);
+
+static inline void set_cfs_cpu_capacity(int cpu, bool request,
+					unsigned long capacity)
+{
+	if (per_cpu(cpu_sched_capacity_reqs, cpu).cfs != capacity) {
+		per_cpu(cpu_sched_capacity_reqs, cpu).cfs = capacity;
+		update_cpu_capacity_request(cpu, request);
+	}
+}
+
+static inline void set_rt_cpu_capacity(int cpu, bool request,
+				       unsigned long capacity)
+{
+	if (per_cpu(cpu_sched_capacity_reqs, cpu).rt != capacity) {
+		per_cpu(cpu_sched_capacity_reqs, cpu).rt = capacity;
+		update_cpu_capacity_request(cpu, request);
+	}
+}
+
+static inline void set_dl_cpu_capacity(int cpu, bool request,
+				       unsigned long capacity)
+{
+	if (per_cpu(cpu_sched_capacity_reqs, cpu).dl != capacity) {
+		per_cpu(cpu_sched_capacity_reqs, cpu).dl = capacity;
+		update_cpu_capacity_request(cpu, request);
+	}
+}
+#else
+static inline bool sched_freq(void) { return false; }
+static inline void set_cfs_cpu_capacity(int cpu, bool request,
+					unsigned long capacity)
+{ }
+static inline void set_rt_cpu_capacity(int cpu, bool request,
+				       unsigned long capacity)
+{ }
+static inline void set_dl_cpu_capacity(int cpu, bool request,
+				       unsigned long capacity)
+{ }
+#endif
+
 static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta)
 {
 	rq->rt_avg += rt_delta * arch_scale_freq_capacity(NULL, cpu_of(rq));
-- 
2.4.10

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [RFCv7 PATCH 04/10] sched/fair: add triggers for OPP change requests
  2016-02-23  1:22 [RFCv7 PATCH 00/10] sched: scheduler-driven CPU frequency selection Steve Muckle
                   ` (2 preceding siblings ...)
  2016-02-23  1:22 ` [RFCv7 PATCH 03/10] sched: scheduler-driven cpu frequency selection Steve Muckle
@ 2016-02-23  1:22 ` Steve Muckle
  2016-03-01  6:51   ` Ricky Liang
  2016-02-23  1:22 ` [RFCv7 PATCH 05/10] sched/{core,fair}: trigger OPP change request on fork() Steve Muckle
                   ` (7 subsequent siblings)
  11 siblings, 1 reply; 51+ messages in thread
From: Steve Muckle @ 2016-02-23  1:22 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Rafael J. Wysocki
  Cc: linux-kernel, linux-pm, Vincent Guittot, Morten Rasmussen,
	Dietmar Eggemann, Juri Lelli, Patrick Bellasi, Michael Turquette,
	Juri Lelli

From: Juri Lelli <juri.lelli@arm.com>

Each time a task is {en,de}queued we might need to adapt the current
frequency to the new usage. Add triggers on {en,de}queue_task_fair() for
this purpose.  Only trigger a freq request if we are effectively waking up
or going to sleep.  Filter out load balancing related calls to reduce the
number of triggers.

[smuckle@linaro.org: resolve merge conflicts, define task_new,
 use renamed static key sched_freq]

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Steve Muckle <smuckle@linaro.org>
---
 kernel/sched/fair.c | 49 +++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 47 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3437e01..f1f00a4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4283,6 +4283,21 @@ static inline void hrtick_update(struct rq *rq)
 }
 #endif
 
+static unsigned long capacity_orig_of(int cpu);
+static int cpu_util(int cpu);
+
+static void update_capacity_of(int cpu)
+{
+	unsigned long req_cap;
+
+	if (!sched_freq())
+		return;
+
+	/* Convert scale-invariant capacity to cpu. */
+	req_cap = cpu_util(cpu) * SCHED_CAPACITY_SCALE / capacity_orig_of(cpu);
+	set_cfs_cpu_capacity(cpu, true, req_cap);
+}
+
 /*
  * The enqueue_task method is called before nr_running is
  * increased. Here we update the fair scheduling stats and
@@ -4293,6 +4308,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 {
 	struct cfs_rq *cfs_rq;
 	struct sched_entity *se = &p->se;
+	int task_new = !(flags & ENQUEUE_WAKEUP);
 
 	for_each_sched_entity(se) {
 		if (se->on_rq)
@@ -4324,9 +4340,23 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		update_cfs_shares(cfs_rq);
 	}
 
-	if (!se)
+	if (!se) {
 		add_nr_running(rq, 1);
 
+		/*
+		 * We want to potentially trigger a freq switch
+		 * request only for tasks that are waking up; this is
+		 * because we get here also during load balancing, but
+		 * in these cases it seems wise to trigger as single
+		 * request after load balancing is done.
+		 *
+		 * XXX: how about fork()? Do we need a special
+		 *      flag/something to tell if we are here after a
+		 *      fork() (wakeup_task_new)?
+		 */
+		if (!task_new)
+			update_capacity_of(cpu_of(rq));
+	}
 	hrtick_update(rq);
 }
 
@@ -4384,9 +4414,24 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		update_cfs_shares(cfs_rq);
 	}
 
-	if (!se)
+	if (!se) {
 		sub_nr_running(rq, 1);
 
+		/*
+		 * We want to potentially trigger a freq switch
+		 * request only for tasks that are going to sleep;
+		 * this is because we get here also during load
+		 * balancing, but in these cases it seems wise to
+		 * trigger as single request after load balancing is
+		 * done.
+		 */
+		if (task_sleep) {
+			if (rq->cfs.nr_running)
+				update_capacity_of(cpu_of(rq));
+			else if (sched_freq())
+				set_cfs_cpu_capacity(cpu_of(rq), false, 0);
+		}
+	}
 	hrtick_update(rq);
 }
 
-- 
2.4.10

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [RFCv7 PATCH 05/10] sched/{core,fair}: trigger OPP change request on fork()
  2016-02-23  1:22 [RFCv7 PATCH 00/10] sched: scheduler-driven CPU frequency selection Steve Muckle
                   ` (3 preceding siblings ...)
  2016-02-23  1:22 ` [RFCv7 PATCH 04/10] sched/fair: add triggers for OPP change requests Steve Muckle
@ 2016-02-23  1:22 ` Steve Muckle
  2016-02-23  1:22 ` [RFCv7 PATCH 06/10] sched/fair: cpufreq_sched triggers for load balancing Steve Muckle
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 51+ messages in thread
From: Steve Muckle @ 2016-02-23  1:22 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Rafael J. Wysocki
  Cc: linux-kernel, linux-pm, Vincent Guittot, Morten Rasmussen,
	Dietmar Eggemann, Juri Lelli, Patrick Bellasi, Michael Turquette,
	Juri Lelli

From: Juri Lelli <juri.lelli@arm.com>

Patch "sched/fair: add triggers for OPP change requests" introduced OPP
change triggers for enqueue_task_fair(), but the trigger was operating only
for wakeups. Fact is that it makes sense to consider wakeup_new also (i.e.,
fork()), as we don't know anything about a newly created task and thus we
most certainly want to jump to max OPP to not harm performance too much.

However, it is not currently possible (or at least it wasn't evident to me
how to do so :/) to tell new wakeups from other (non wakeup) operations.

This patch introduces an additional flag in sched.h that is only set at
fork() time and it is then consumed in enqueue_task_fair() for our purpose.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Steve Muckle <smuckle@linaro.org>
---
 kernel/sched/core.c  | 2 +-
 kernel/sched/fair.c  | 9 +++------
 kernel/sched/sched.h | 1 +
 3 files changed, 5 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 87ca0be..86297a2 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2553,7 +2553,7 @@ void wake_up_new_task(struct task_struct *p)
 #endif
 
 	rq = __task_rq_lock(p);
-	activate_task(rq, p, 0);
+	activate_task(rq, p, ENQUEUE_WAKEUP_NEW);
 	p->on_rq = TASK_ON_RQ_QUEUED;
 	trace_sched_wakeup_new(p);
 	check_preempt_curr(rq, p, WF_FORK);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f1f00a4..e7fab8f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4308,7 +4308,8 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 {
 	struct cfs_rq *cfs_rq;
 	struct sched_entity *se = &p->se;
-	int task_new = !(flags & ENQUEUE_WAKEUP);
+	int task_new = flags & ENQUEUE_WAKEUP_NEW;
+	int task_wakeup = flags & ENQUEUE_WAKEUP;
 
 	for_each_sched_entity(se) {
 		if (se->on_rq)
@@ -4349,12 +4350,8 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		 * because we get here also during load balancing, but
 		 * in these cases it seems wise to trigger as single
 		 * request after load balancing is done.
-		 *
-		 * XXX: how about fork()? Do we need a special
-		 *      flag/something to tell if we are here after a
-		 *      fork() (wakeup_task_new)?
 		 */
-		if (!task_new)
+		if (task_new || task_wakeup)
 			update_capacity_of(cpu_of(rq));
 	}
 	hrtick_update(rq);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 17908dd..9c26be2 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1140,6 +1140,7 @@ extern const u32 sched_prio_to_wmult[40];
 #endif
 #define ENQUEUE_REPLENISH	0x08
 #define ENQUEUE_RESTORE	0x10
+#define ENQUEUE_WAKEUP_NEW	0x20
 
 #define DEQUEUE_SLEEP		0x01
 #define DEQUEUE_SAVE		0x02
-- 
2.4.10

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [RFCv7 PATCH 06/10] sched/fair: cpufreq_sched triggers for load balancing
  2016-02-23  1:22 [RFCv7 PATCH 00/10] sched: scheduler-driven CPU frequency selection Steve Muckle
                   ` (4 preceding siblings ...)
  2016-02-23  1:22 ` [RFCv7 PATCH 05/10] sched/{core,fair}: trigger OPP change request on fork() Steve Muckle
@ 2016-02-23  1:22 ` Steve Muckle
  2016-02-23  1:22 ` [RFCv7 PATCH 07/10] sched/fair: jump to max OPP when crossing UP threshold Steve Muckle
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 51+ messages in thread
From: Steve Muckle @ 2016-02-23  1:22 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Rafael J. Wysocki
  Cc: linux-kernel, linux-pm, Vincent Guittot, Morten Rasmussen,
	Dietmar Eggemann, Juri Lelli, Patrick Bellasi, Michael Turquette,
	Juri Lelli

From: Juri Lelli <juri.lelli@arm.com>

As we don't trigger freq changes from {en,de}queue_task_fair() during load
balancing, we need to do explicitly so on load balancing paths.

[smuckle@linaro.org: move update_capacity_of calls so rq lock is held]

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Steve Muckle <smuckle@linaro.org>
---
 kernel/sched/fair.c | 21 ++++++++++++++++++++-
 1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e7fab8f..5531513 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6107,6 +6107,10 @@ static void attach_one_task(struct rq *rq, struct task_struct *p)
 {
 	raw_spin_lock(&rq->lock);
 	attach_task(rq, p);
+	/*
+	 * We want to potentially raise target_cpu's OPP.
+	 */
+	update_capacity_of(cpu_of(rq));
 	raw_spin_unlock(&rq->lock);
 }
 
@@ -6128,6 +6132,11 @@ static void attach_tasks(struct lb_env *env)
 		attach_task(env->dst_rq, p);
 	}
 
+	/*
+	 * We want to potentially raise env.dst_cpu's OPP.
+	 */
+	update_capacity_of(env->dst_cpu);
+
 	raw_spin_unlock(&env->dst_rq->lock);
 }
 
@@ -7267,6 +7276,11 @@ more_balance:
 		 * ld_moved     - cumulative load moved across iterations
 		 */
 		cur_ld_moved = detach_tasks(&env);
+		/*
+		 * We want to potentially lower env.src_cpu's OPP.
+		 */
+		if (cur_ld_moved)
+			update_capacity_of(env.src_cpu);
 
 		/*
 		 * We've detached some tasks from busiest_rq. Every
@@ -7631,8 +7645,13 @@ static int active_load_balance_cpu_stop(void *data)
 		schedstat_inc(sd, alb_count);
 
 		p = detach_one_task(&env);
-		if (p)
+		if (p) {
 			schedstat_inc(sd, alb_pushed);
+			/*
+			 * We want to potentially lower env.src_cpu's OPP.
+			 */
+			update_capacity_of(env.src_cpu);
+		}
 		else
 			schedstat_inc(sd, alb_failed);
 	}
-- 
2.4.10

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [RFCv7 PATCH 07/10] sched/fair: jump to max OPP when crossing UP threshold
  2016-02-23  1:22 [RFCv7 PATCH 00/10] sched: scheduler-driven CPU frequency selection Steve Muckle
                   ` (5 preceding siblings ...)
  2016-02-23  1:22 ` [RFCv7 PATCH 06/10] sched/fair: cpufreq_sched triggers for load balancing Steve Muckle
@ 2016-02-23  1:22 ` Steve Muckle
  2016-02-23  1:22 ` [RFCv7 PATCH 08/10] sched: remove call of sched_avg_update from sched_rt_avg_update Steve Muckle
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 51+ messages in thread
From: Steve Muckle @ 2016-02-23  1:22 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Rafael J. Wysocki
  Cc: linux-kernel, linux-pm, Vincent Guittot, Morten Rasmussen,
	Dietmar Eggemann, Juri Lelli, Patrick Bellasi, Michael Turquette,
	Juri Lelli

Since the true utilization of a long running task is not detectable
while it is running and might be bigger than the current cpu capacity,
create the maximum cpu capacity head room by requesting the maximum
cpu capacity once the cpu usage plus the capacity margin exceeds the
current capacity. This is also done to try to harm the performance of
a task the least.

Original fair-class only version authored by Juri Lelli
<juri.lelli@arm.com>.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Steve Muckle <smuckle@linaro.org>
---
 kernel/sched/core.c  | 40 +++++++++++++++++++++++++++++++++++
 kernel/sched/fair.c  | 57 --------------------------------------------------
 kernel/sched/sched.h | 59 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 99 insertions(+), 57 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 86297a2..747a7af 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3020,6 +3020,45 @@ unsigned long long task_sched_runtime(struct task_struct *p)
 	return ns;
 }
 
+#ifdef CONFIG_CPU_FREQ_GOV_SCHED
+static unsigned long sum_capacity_reqs(unsigned long cfs_cap,
+				       struct sched_capacity_reqs *scr)
+{
+	unsigned long total = cfs_cap + scr->rt;
+
+	total = total * capacity_margin;
+	total /= SCHED_CAPACITY_SCALE;
+	total += scr->dl;
+	return total;
+}
+
+static void sched_freq_tick(int cpu)
+{
+	struct sched_capacity_reqs *scr;
+	unsigned long capacity_orig, capacity_curr;
+
+	if (!sched_freq())
+		return;
+
+	capacity_orig = capacity_orig_of(cpu);
+	capacity_curr = capacity_curr_of(cpu);
+	if (capacity_curr == capacity_orig)
+		return;
+
+	/*
+	 * To make free room for a task that is building up its "real"
+	 * utilization and to harm its performance the least, request
+	 * a jump to max OPP as soon as the margin of free capacity is
+	 * impacted (specified by capacity_margin).
+	 */
+	scr = &per_cpu(cpu_sched_capacity_reqs, cpu);
+	if (capacity_curr < sum_capacity_reqs(cpu_util(cpu), scr))
+		set_cfs_cpu_capacity(cpu, true, capacity_max);
+}
+#else
+static inline void sched_freq_tick(int cpu) { }
+#endif
+
 /*
  * This function gets called by the timer code, with HZ frequency.
  * We call it with interrupts disabled.
@@ -3037,6 +3076,7 @@ void scheduler_tick(void)
 	curr->sched_class->task_tick(rq, curr, 0);
 	update_cpu_load_active(rq);
 	calc_global_load_tick(rq);
+	sched_freq_tick(cpu);
 	raw_spin_unlock(&rq->lock);
 
 	perf_event_task_tick();
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5531513..cf7ae0a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4283,9 +4283,6 @@ static inline void hrtick_update(struct rq *rq)
 }
 #endif
 
-static unsigned long capacity_orig_of(int cpu);
-static int cpu_util(int cpu);
-
 static void update_capacity_of(int cpu)
 {
 	unsigned long req_cap;
@@ -4685,15 +4682,6 @@ static unsigned long target_load(int cpu, int type)
 	return max(rq->cpu_load[type-1], total);
 }
 
-static unsigned long capacity_of(int cpu)
-{
-	return cpu_rq(cpu)->cpu_capacity;
-}
-
-static unsigned long capacity_orig_of(int cpu)
-{
-	return cpu_rq(cpu)->cpu_capacity_orig;
-}
 
 static unsigned long cpu_avg_load_per_task(int cpu)
 {
@@ -4863,17 +4851,6 @@ static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
 #endif
 
 /*
- * Returns the current capacity of cpu after applying both
- * cpu and freq scaling.
- */
-static unsigned long capacity_curr_of(int cpu)
-{
-	return cpu_rq(cpu)->cpu_capacity_orig *
-	       arch_scale_freq_capacity(NULL, cpu)
-	       >> SCHED_CAPACITY_SHIFT;
-}
-
-/*
  * Detect M:N waker/wakee relationships via a switching-frequency heuristic.
  * A waker of many should wake a different task than the one last awakened
  * at a frequency roughly N times higher than one of its wakees.  In order
@@ -5117,40 +5094,6 @@ done:
 }
 
 /*
- * cpu_util returns the amount of capacity of a CPU that is used by CFS
- * tasks. The unit of the return value must be the one of capacity so we can
- * compare the utilization with the capacity of the CPU that is available for
- * CFS task (ie cpu_capacity).
- *
- * cfs_rq.avg.util_avg is the sum of running time of runnable tasks plus the
- * recent utilization of currently non-runnable tasks on a CPU. It represents
- * the amount of utilization of a CPU in the range [0..capacity_orig] where
- * capacity_orig is the cpu_capacity available at the highest frequency
- * (arch_scale_freq_capacity()).
- * The utilization of a CPU converges towards a sum equal to or less than the
- * current capacity (capacity_curr <= capacity_orig) of the CPU because it is
- * the running time on this CPU scaled by capacity_curr.
- *
- * Nevertheless, cfs_rq.avg.util_avg can be higher than capacity_curr or even
- * higher than capacity_orig because of unfortunate rounding in
- * cfs.avg.util_avg or just after migrating tasks and new task wakeups until
- * the average stabilizes with the new running time. We need to check that the
- * utilization stays within the range of [0..capacity_orig] and cap it if
- * necessary. Without utilization capping, a group could be seen as overloaded
- * (CPU0 utilization at 121% + CPU1 utilization at 80%) whereas CPU1 has 20% of
- * available capacity. We allow utilization to overshoot capacity_curr (but not
- * capacity_orig) as it useful for predicting the capacity required after task
- * migrations (scheduler-driven DVFS).
- */
-static int cpu_util(int cpu)
-{
-	unsigned long util = cpu_rq(cpu)->cfs.avg.util_avg;
-	unsigned long capacity = capacity_orig_of(cpu);
-
-	return (util >= capacity) ? capacity : util;
-}
-
-/*
  * select_task_rq_fair: Select target runqueue for the waking task in domains
  * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
  * SD_BALANCE_FORK, or SD_BALANCE_EXEC.
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 9c26be2..59747d8 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1385,7 +1385,66 @@ unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
 }
 #endif
 
+#ifdef CONFIG_SMP
+static inline unsigned long capacity_of(int cpu)
+{
+	return cpu_rq(cpu)->cpu_capacity;
+}
+
+static inline unsigned long capacity_orig_of(int cpu)
+{
+	return cpu_rq(cpu)->cpu_capacity_orig;
+}
+
+/*
+ * cpu_util returns the amount of capacity of a CPU that is used by CFS
+ * tasks. The unit of the return value must be the one of capacity so we can
+ * compare the utilization with the capacity of the CPU that is available for
+ * CFS task (ie cpu_capacity).
+ *
+ * cfs_rq.avg.util_avg is the sum of running time of runnable tasks plus the
+ * recent utilization of currently non-runnable tasks on a CPU. It represents
+ * the amount of utilization of a CPU in the range [0..capacity_orig] where
+ * capacity_orig is the cpu_capacity available at the highest frequency
+ * (arch_scale_freq_capacity()).
+ * The utilization of a CPU converges towards a sum equal to or less than the
+ * current capacity (capacity_curr <= capacity_orig) of the CPU because it is
+ * the running time on this CPU scaled by capacity_curr.
+ *
+ * Nevertheless, cfs_rq.avg.util_avg can be higher than capacity_curr or even
+ * higher than capacity_orig because of unfortunate rounding in
+ * cfs.avg.util_avg or just after migrating tasks and new task wakeups until
+ * the average stabilizes with the new running time. We need to check that the
+ * utilization stays within the range of [0..capacity_orig] and cap it if
+ * necessary. Without utilization capping, a group could be seen as overloaded
+ * (CPU0 utilization at 121% + CPU1 utilization at 80%) whereas CPU1 has 20% of
+ * available capacity. We allow utilization to overshoot capacity_curr (but not
+ * capacity_orig) as it useful for predicting the capacity required after task
+ * migrations (scheduler-driven DVFS).
+ */
+static inline int cpu_util(int cpu)
+{
+	unsigned long util = cpu_rq(cpu)->cfs.avg.util_avg;
+	unsigned long capacity = capacity_orig_of(cpu);
+
+	return (util >= capacity) ? capacity : util;
+}
+
+/*
+ * Returns the current capacity of cpu after applying both
+ * cpu and freq scaling.
+ */
+static inline unsigned long capacity_curr_of(int cpu)
+{
+	return cpu_rq(cpu)->cpu_capacity_orig *
+	       arch_scale_freq_capacity(NULL, cpu)
+	       >> SCHED_CAPACITY_SHIFT;
+}
+
+#endif
+
 #ifdef CONFIG_CPU_FREQ_GOV_SCHED
+#define capacity_max SCHED_CAPACITY_SCALE
 extern unsigned int capacity_margin;
 extern struct static_key __sched_freq;
 
-- 
2.4.10

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [RFCv7 PATCH 08/10] sched: remove call of sched_avg_update from sched_rt_avg_update
  2016-02-23  1:22 [RFCv7 PATCH 00/10] sched: scheduler-driven CPU frequency selection Steve Muckle
                   ` (6 preceding siblings ...)
  2016-02-23  1:22 ` [RFCv7 PATCH 07/10] sched/fair: jump to max OPP when crossing UP threshold Steve Muckle
@ 2016-02-23  1:22 ` Steve Muckle
  2016-02-23  1:22 ` [RFCv7 PATCH 09/10] sched/deadline: split rt_avg in 2 distincts metrics Steve Muckle
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 51+ messages in thread
From: Steve Muckle @ 2016-02-23  1:22 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Rafael J. Wysocki
  Cc: linux-kernel, linux-pm, Vincent Guittot, Morten Rasmussen,
	Dietmar Eggemann, Juri Lelli, Patrick Bellasi, Michael Turquette

From: Vincent Guittot <vincent.guittot@linaro.org>

rt_avg is only used to scale the available CPU's capacity for CFS
tasks.  As the update of this scaling is done during periodic load
balance, we only have to ensure that sched_avg_update has been called
before any periodic load balancing. This requirement is already
fulfilled by __update_cpu_load so the call in sched_rt_avg_update,
which is part of the hotpath, is useless.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Steve Muckle <smuckle@linaro.org>
---
 kernel/sched/sched.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 59747d8..3df21f2 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1498,7 +1498,6 @@ static inline void set_dl_cpu_capacity(int cpu, bool request,
 static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta)
 {
 	rq->rt_avg += rt_delta * arch_scale_freq_capacity(NULL, cpu_of(rq));
-	sched_avg_update(rq);
 }
 #else
 static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta) { }
-- 
2.4.10

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [RFCv7 PATCH 09/10] sched/deadline: split rt_avg in 2 distincts metrics
  2016-02-23  1:22 [RFCv7 PATCH 00/10] sched: scheduler-driven CPU frequency selection Steve Muckle
                   ` (7 preceding siblings ...)
  2016-02-23  1:22 ` [RFCv7 PATCH 08/10] sched: remove call of sched_avg_update from sched_rt_avg_update Steve Muckle
@ 2016-02-23  1:22 ` Steve Muckle
  2016-02-23  1:22 ` [RFCv7 PATCH 10/10] sched: rt scheduler sets capacity requirement Steve Muckle
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 51+ messages in thread
From: Steve Muckle @ 2016-02-23  1:22 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Rafael J. Wysocki
  Cc: linux-kernel, linux-pm, Vincent Guittot, Morten Rasmussen,
	Dietmar Eggemann, Juri Lelli, Patrick Bellasi, Michael Turquette

From: Vincent Guittot <vincent.guittot@linaro.org>

rt_avg monitors the average load of rt tasks, deadline tasks and
interruptions, when enabled. It's used to calculate the remaining
capacity for CFS tasks. We split rt_avg in 2 metrics, one for rt and
interruptions that keeps the name rt_avg and another one for deadline
tasks that will be named dl_avg.
Both values are still used to calculate the remaining capacity for cfs
task. But rt_avg is now also used to request capacity to the sched-freq
for the rt tasks.
As the irq time is accounted with rt tasks, it will be taken into account
in the request of capacity.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Steve Muckle <smuckle@linaro.org>
---
 kernel/sched/core.c     | 1 +
 kernel/sched/deadline.c | 2 +-
 kernel/sched/fair.c     | 1 +
 kernel/sched/sched.h    | 8 +++++++-
 4 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 747a7af..12a4a3a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -759,6 +759,7 @@ void sched_avg_update(struct rq *rq)
 		asm("" : "+rm" (rq->age_stamp));
 		rq->age_stamp += period;
 		rq->rt_avg /= 2;
+		rq->dl_avg /= 2;
 	}
 }
 
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index cd64c97..87dcee3 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -747,7 +747,7 @@ static void update_curr_dl(struct rq *rq)
 	curr->se.exec_start = rq_clock_task(rq);
 	cpuacct_charge(curr, delta_exec);
 
-	sched_rt_avg_update(rq, delta_exec);
+	sched_dl_avg_update(rq, delta_exec);
 
 	dl_se->runtime -= dl_se->dl_yielded ? 0 : delta_exec;
 	if (dl_runtime_exceeded(dl_se)) {
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index cf7ae0a..3a812fa 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6278,6 +6278,7 @@ static unsigned long scale_rt_capacity(int cpu)
 	 */
 	age_stamp = READ_ONCE(rq->age_stamp);
 	avg = READ_ONCE(rq->rt_avg);
+	avg += READ_ONCE(rq->dl_avg);
 	delta = __rq_clock_broken(rq) - age_stamp;
 
 	if (unlikely(delta < 0))
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 3df21f2..ad6cc8b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -644,7 +644,7 @@ struct rq {
 
 	struct list_head cfs_tasks;
 
-	u64 rt_avg;
+	u64 rt_avg, dl_avg;
 	u64 age_stamp;
 	u64 idle_stamp;
 	u64 avg_idle;
@@ -1499,8 +1499,14 @@ static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta)
 {
 	rq->rt_avg += rt_delta * arch_scale_freq_capacity(NULL, cpu_of(rq));
 }
+
+static inline void sched_dl_avg_update(struct rq *rq, u64 dl_delta)
+{
+	rq->dl_avg += dl_delta * arch_scale_freq_capacity(NULL, cpu_of(rq));
+}
 #else
 static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta) { }
+static inline void sched_dl_avg_update(struct rq *rq, u64 dl_delta) { }
 static inline void sched_avg_update(struct rq *rq) { }
 #endif
 
-- 
2.4.10

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [RFCv7 PATCH 10/10] sched: rt scheduler sets capacity requirement
  2016-02-23  1:22 [RFCv7 PATCH 00/10] sched: scheduler-driven CPU frequency selection Steve Muckle
                   ` (8 preceding siblings ...)
  2016-02-23  1:22 ` [RFCv7 PATCH 09/10] sched/deadline: split rt_avg in 2 distincts metrics Steve Muckle
@ 2016-02-23  1:22 ` Steve Muckle
  2016-02-23  1:33 ` [RFCv7 PATCH 00/10] sched: scheduler-driven CPU frequency selection Steve Muckle
  2016-03-30  0:45 ` Yuyang Du
  11 siblings, 0 replies; 51+ messages in thread
From: Steve Muckle @ 2016-02-23  1:22 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Rafael J. Wysocki
  Cc: linux-kernel, linux-pm, Vincent Guittot, Morten Rasmussen,
	Dietmar Eggemann, Juri Lelli, Patrick Bellasi, Michael Turquette

From: Vincent Guittot <vincent.guittot@linaro.org>

RT tasks don't provide any running constraints like deadline ones
except their running priority. The only current usable input to
estimate the capacity needed by RT tasks is the rt_avg metric. We use
it to estimate the CPU capacity needed for the RT scheduler class.

In order to monitor the evolution for RT task load, we must
peridiocally check it during the tick.

Then, we use the estimated capacity of the last activity to estimate
the next one which can not be that accurate but is a good starting
point without any impact on the wake up path of RT tasks.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Steve Muckle <smuckle@linaro.org>
---
 kernel/sched/rt.c | 48 +++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 47 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 8ec86ab..da9086c 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1426,6 +1426,41 @@ static void check_preempt_curr_rt(struct rq *rq, struct task_struct *p, int flag
 #endif
 }
 
+#ifdef CONFIG_CPU_FREQ_GOV_SCHED
+static void sched_rt_update_capacity_req(struct rq *rq)
+{
+	u64 total, used, age_stamp, avg;
+	s64 delta;
+
+	if (!sched_freq())
+		return;
+
+	sched_avg_update(rq);
+	/*
+	 * Since we're reading these variables without serialization make sure
+	 * we read them once before doing sanity checks on them.
+	 */
+	age_stamp = READ_ONCE(rq->age_stamp);
+	avg = READ_ONCE(rq->rt_avg);
+	delta = rq_clock(rq) - age_stamp;
+
+	if (unlikely(delta < 0))
+		delta = 0;
+
+	total = sched_avg_period() + delta;
+
+	used = div_u64(avg, total);
+	if (unlikely(used > SCHED_CAPACITY_SCALE))
+		used = SCHED_CAPACITY_SCALE;
+
+	set_rt_cpu_capacity(rq->cpu, true, (unsigned long)(used));
+}
+#else
+static inline void sched_rt_update_capacity_req(struct rq *rq)
+{ }
+
+#endif
+
 static struct sched_rt_entity *pick_next_rt_entity(struct rq *rq,
 						   struct rt_rq *rt_rq)
 {
@@ -1494,8 +1529,17 @@ pick_next_task_rt(struct rq *rq, struct task_struct *prev)
 	if (prev->sched_class == &rt_sched_class)
 		update_curr_rt(rq);
 
-	if (!rt_rq->rt_queued)
+	if (!rt_rq->rt_queued) {
+		/*
+		 * The next task to be picked on this rq will have a lower
+		 * priority than rt tasks so we can spend some time to update
+		 * the capacity used by rt tasks based on the last activity.
+		 * This value will be the used as an estimation of the next
+		 * activity.
+		 */
+		sched_rt_update_capacity_req(rq);
 		return NULL;
+	}
 
 	put_prev_task(rq, prev);
 
@@ -2212,6 +2256,8 @@ static void task_tick_rt(struct rq *rq, struct task_struct *p, int queued)
 
 	update_curr_rt(rq);
 
+	sched_rt_update_capacity_req(rq);
+
 	watchdog(rq, p);
 
 	/*
-- 
2.4.10

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [RFCv7 PATCH 02/10] cpufreq: introduce cpufreq_driver_is_slow
  2016-02-23  1:22 ` [RFCv7 PATCH 02/10] cpufreq: introduce cpufreq_driver_is_slow Steve Muckle
@ 2016-02-23  1:31   ` Rafael J. Wysocki
  2016-02-26  0:50       ` Michael Turquette
  0 siblings, 1 reply; 51+ messages in thread
From: Rafael J. Wysocki @ 2016-02-23  1:31 UTC (permalink / raw)
  To: Steve Muckle
  Cc: Peter Zijlstra, Ingo Molnar, Rafael J. Wysocki,
	Linux Kernel Mailing List, linux-pm, Vincent Guittot,
	Morten Rasmussen, Dietmar Eggemann, Juri Lelli, Patrick Bellasi,
	Michael Turquette, Viresh Kumar

On Tue, Feb 23, 2016 at 2:22 AM, Steve Muckle <steve.muckle@linaro.org> wrote:
> From: Michael Turquette <mturquette@baylibre.com>
>
> Some architectures and platforms perform CPU frequency transitions
> through a non-blocking method, while some might block or sleep. Even
> when frequency transitions do not block or sleep they may be very slow.
> This distinction is important when trying to change frequency from
> a non-interruptible context in a scheduler hot path.
>
> Describe this distinction with a cpufreq driver flag,
> CPUFREQ_DRIVER_FAST. The default is to not have this flag set,
> thus erring on the side of caution.
>
> cpufreq_driver_is_slow() is also introduced in this patch. Setting
> the above flag will allow this function to return false.
>
> [smuckle@linaro.org: change flag/API to include drivers that are too
>  slow for scheduler hot paths, in addition to those that block/sleep]
>
> Cc: Rafael J. Wysocki <rafael@kernel.org>
> Cc: Viresh Kumar <viresh.kumar@linaro.org>
> Signed-off-by: Michael Turquette <mturquette@baylibre.com>
> Signed-off-by: Steve Muckle <smuckle@linaro.org>

Something more sophisticated than this is needed, because one driver
may actually be able to do "fast" switching in some cases and may not
be able to do that in other cases.

For example, in the acpi-cpufreq case all depends on what's there in
the ACPI tables.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFCv7 PATCH 00/10] sched: scheduler-driven CPU frequency selection
  2016-02-23  1:22 [RFCv7 PATCH 00/10] sched: scheduler-driven CPU frequency selection Steve Muckle
                   ` (9 preceding siblings ...)
  2016-02-23  1:22 ` [RFCv7 PATCH 10/10] sched: rt scheduler sets capacity requirement Steve Muckle
@ 2016-02-23  1:33 ` Steve Muckle
  2016-03-30  0:45 ` Yuyang Du
  11 siblings, 0 replies; 51+ messages in thread
From: Steve Muckle @ 2016-02-23  1:33 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Rafael J. Wysocki
  Cc: linux-kernel, linux-pm, Vincent Guittot, Morten Rasmussen,
	Dietmar Eggemann, Juri Lelli, Patrick Bellasi, Michael Turquette

On 02/22/2016 05:22 PM, Steve Muckle wrote:
> Scheduler-driven CPU frequency selection hopes to exploit both
> per-task and global information in the scheduler to improve frequency
> selection policy and achieve lower power consumption, improved
> responsiveness/performance, and less reliance on heuristics and
> tunables. For further discussion of this integration see [0].
> 
> This patch series implements a cpufreq governor which collects CPU
> capacity requests from the fair, realtime, and deadline scheduling
> classes. The fair and realtime scheduling classes are modified to make
> these requests. The deadline class is not yet modified to make CPU
> capacity requests.

This RFC series does not attempt to address any of today's feedback
regarding simplifying the hooks in the scheduler - I'd like some more
time to ponder that. But I thought it important to get the latest
version of this out for discussion as soon as possible.

thanks,
Steve

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFCv7 PATCH 01/10] sched: Compute cpu capacity available at current frequency
  2016-02-23  1:22 ` [RFCv7 PATCH 01/10] sched: Compute cpu capacity available at current frequency Steve Muckle
@ 2016-02-23  1:41   ` Rafael J. Wysocki
  2016-02-23  9:19     ` Peter Zijlstra
  0 siblings, 1 reply; 51+ messages in thread
From: Rafael J. Wysocki @ 2016-02-23  1:41 UTC (permalink / raw)
  To: Steve Muckle
  Cc: Peter Zijlstra, Ingo Molnar, Rafael J. Wysocki,
	Linux Kernel Mailing List, linux-pm, Vincent Guittot,
	Morten Rasmussen, Dietmar Eggemann, Juri Lelli, Patrick Bellasi,
	Michael Turquette

On Tue, Feb 23, 2016 at 2:22 AM, Steve Muckle <steve.muckle@linaro.org> wrote:
> From: Morten Rasmussen <morten.rasmussen@arm.com>
>
> capacity_orig_of() returns the max available compute capacity of a cpu.
> For scale-invariant utilization tracking and energy-aware scheduling
> decisions it is useful to know the compute capacity available at the
> current OPP of a cpu.
>
> cc: Ingo Molnar <mingo@redhat.com>
> cc: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
> Signed-off-by: Steve Muckle <smuckle@linaro.org>
> ---
>  kernel/sched/fair.c | 11 +++++++++++
>  1 file changed, 11 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 7ce24a4..3437e01 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4821,6 +4821,17 @@ static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
>  #endif
>
>  /*
> + * Returns the current capacity of cpu after applying both
> + * cpu and freq scaling.
> + */
> +static unsigned long capacity_curr_of(int cpu)
> +{
> +       return cpu_rq(cpu)->cpu_capacity_orig *
> +              arch_scale_freq_capacity(NULL, cpu)

What about architectures that don't have this?

Why is that an architecture feature?

I can easily imagine two x86 platforms using different
scale_freq_capacity(), for example.

> +              >> SCHED_CAPACITY_SHIFT;
> +}
> +
> +/*
>   * Detect M:N waker/wakee relationships via a switching-frequency heuristic.
>   * A waker of many should wake a different task than the one last awakened
>   * at a frequency roughly N times higher than one of its wakees.  In order
> --


Thanks,
Rafael

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFCv7 PATCH 01/10] sched: Compute cpu capacity available at current frequency
  2016-02-23  1:41   ` Rafael J. Wysocki
@ 2016-02-23  9:19     ` Peter Zijlstra
  2016-02-26  1:37       ` Rafael J. Wysocki
  0 siblings, 1 reply; 51+ messages in thread
From: Peter Zijlstra @ 2016-02-23  9:19 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Steve Muckle, Ingo Molnar, Linux Kernel Mailing List, linux-pm,
	Vincent Guittot, Morten Rasmussen, Dietmar Eggemann, Juri Lelli,
	Patrick Bellasi, Michael Turquette

On Tue, Feb 23, 2016 at 02:41:20AM +0100, Rafael J. Wysocki wrote:
> >  /*
> > + * Returns the current capacity of cpu after applying both
> > + * cpu and freq scaling.
> > + */
> > +static unsigned long capacity_curr_of(int cpu)
> > +{
> > +       return cpu_rq(cpu)->cpu_capacity_orig *
> > +              arch_scale_freq_capacity(NULL, cpu)
> 
> What about architectures that don't have this?

They get the 'default' which is a constant SCHED_CAPACITY_SCALE unit.

> Why is that an architecture feature?

Because not all archs can tell the frequency the same way. Some you
program the DVFS state and they really run at this speed, for those you
can simply report back.

For others, x86 for example, you program a DVFS 'hint' and the hardware
does whatever, we'd have to do APERF/MPERF samples to get an idea of the
actual frequency we ran at.

Also, the having of this makes the load tracking slightly more
expensive, instead of compile time constants we get function calls and
actual multiplications. Its not _too_ bad, but still.

> I can easily imagine two x86 platforms using different
> scale_freq_capacity(), for example.

That's up to the arch, if different x86 platforms need different
thingies the arch implementation needs to offer a selector -- this isn't
'hard'.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFCv7 PATCH 03/10] sched: scheduler-driven cpu frequency selection
  2016-02-23  1:22 ` [RFCv7 PATCH 03/10] sched: scheduler-driven cpu frequency selection Steve Muckle
@ 2016-02-25  3:55   ` Rafael J. Wysocki
  2016-02-25  9:21     ` Peter Zijlstra
                       ` (3 more replies)
  2016-03-03 14:21   ` Ingo Molnar
  1 sibling, 4 replies; 51+ messages in thread
From: Rafael J. Wysocki @ 2016-02-25  3:55 UTC (permalink / raw)
  To: Steve Muckle
  Cc: Peter Zijlstra, Ingo Molnar, Rafael J. Wysocki, linux-kernel,
	linux-pm, Vincent Guittot, Morten Rasmussen, Dietmar Eggemann,
	Juri Lelli, Patrick Bellasi, Michael Turquette, Ricky Liang,
	Juri Lelli

Hi,

I promised a review and here it goes.

Let me focus on this one as the rest seems to depend on it.

On Monday, February 22, 2016 05:22:43 PM Steve Muckle wrote:
> From: Michael Turquette <mturquette@baylibre.com>
> 
> Scheduler-driven CPU frequency selection hopes to exploit both
> per-task and global information in the scheduler to improve frequency
> selection policy, achieving lower power consumption, improved
> responsiveness/performance, and less reliance on heuristics and
> tunables. For further discussion on the motivation of this integration
> see [0].
> 
> This patch implements a shim layer between the Linux scheduler and the
> cpufreq subsystem. The interface accepts capacity requests from the
> CFS, RT and deadline sched classes. The requests from each sched class
> are summed on each CPU with a margin applied to the CFS and RT
> capacity requests to provide some headroom. Deadline requests are
> expected to be precise enough given their nature to not require
> headroom. The maximum total capacity request for a CPU in a frequency
> domain drives the requested frequency for that domain.
> 
> Policy is determined by both the sched classes and this shim layer.
> 
> Note that this algorithm is event-driven. There is no polling loop to
> check cpu idle time nor any other method which is unsynchronized with
> the scheduler, aside from an optional throttling mechanism.
> 
> Thanks to Juri Lelli <juri.lelli@arm.com> for contributing design ideas,
> code and test results, and to Ricky Liang <jcliang@chromium.org>
> for initialization and static key inc/dec fixes.
> 
> [0] http://article.gmane.org/gmane.linux.kernel/1499836
> 
> [smuckle@linaro.org: various additions and fixes, revised commit text]

Well, the changelog is still a bit terse in my view.  It should at least
describe the design somewhat (mention the static keys and how they are
used etc) end explain why the things are done this way. 

> CC: Ricky Liang <jcliang@chromium.org>
> Signed-off-by: Michael Turquette <mturquette@baylibre.com>
> Signed-off-by: Juri Lelli <juri.lelli@arm.com>
> Signed-off-by: Steve Muckle <smuckle@linaro.org>
> ---
>  drivers/cpufreq/Kconfig      |  21 ++
>  include/linux/cpufreq.h      |   3 +
>  include/linux/sched.h        |   8 +
>  kernel/sched/Makefile        |   1 +
>  kernel/sched/cpufreq_sched.c | 459 +++++++++++++++++++++++++++++++++++++++++++
>  kernel/sched/sched.h         |  51 +++++
>  6 files changed, 543 insertions(+)
>  create mode 100644 kernel/sched/cpufreq_sched.c
> 

[cut]

> diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h
> index 93e1c1c..ce8b895 100644
> --- a/include/linux/cpufreq.h
> +++ b/include/linux/cpufreq.h
> @@ -495,6 +495,9 @@ extern struct cpufreq_governor cpufreq_gov_ondemand;
>  #elif defined(CONFIG_CPU_FREQ_DEFAULT_GOV_CONSERVATIVE)
>  extern struct cpufreq_governor cpufreq_gov_conservative;
>  #define CPUFREQ_DEFAULT_GOVERNOR	(&cpufreq_gov_conservative)
> +#elif defined(CONFIG_CPU_FREQ_DEFAULT_GOV_SCHED)
> +extern struct cpufreq_governor cpufreq_gov_sched;
> +#define CPUFREQ_DEFAULT_GOVERNOR	(&cpufreq_gov_sched)
>  #endif
>  
>  /*********************************************************************
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index a292c4b..27a6cd8 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -937,6 +937,14 @@ enum cpu_idle_type {
>  #define SCHED_CAPACITY_SHIFT	10
>  #define SCHED_CAPACITY_SCALE	(1L << SCHED_CAPACITY_SHIFT)
>  
> +struct sched_capacity_reqs {
> +	unsigned long cfs;
> +	unsigned long rt;
> +	unsigned long dl;
> +
> +	unsigned long total;
> +};

Without a comment explaining what this represents it is quite hard to
decode it.

> +
>  /*
>   * Wake-queues are lists of tasks with a pending wakeup, whose
>   * callers have already marked the task as woken internally,
> diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
> index 6768797..90ed832 100644
> --- a/kernel/sched/Makefile
> +++ b/kernel/sched/Makefile
> @@ -19,3 +19,4 @@ obj-$(CONFIG_SCHED_AUTOGROUP) += auto_group.o
>  obj-$(CONFIG_SCHEDSTATS) += stats.o
>  obj-$(CONFIG_SCHED_DEBUG) += debug.o
>  obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o
> +obj-$(CONFIG_CPU_FREQ_GOV_SCHED) += cpufreq_sched.o
> diff --git a/kernel/sched/cpufreq_sched.c b/kernel/sched/cpufreq_sched.c
> new file mode 100644
> index 0000000..a113e4e
> --- /dev/null
> +++ b/kernel/sched/cpufreq_sched.c
> @@ -0,0 +1,459 @@
> +/*

Any chance to add one sentence about what's in the file?

Besides, governors traditionally go to drivers/cpufreq.  Why is this different?

> + *  Copyright (C)  2015 Michael Turquette <mturquette@linaro.org>
> + *  Copyright (C)  2015-2016 Steve Muckle <smuckle@linaro.org>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/cpufreq.h>
> +#include <linux/module.h>
> +#include <linux/kthread.h>
> +#include <linux/percpu.h>
> +#include <linux/irq_work.h>
> +#include <linux/delay.h>
> +#include <linux/string.h>
> +
> +#include "sched.h"
> +
> +struct static_key __read_mostly __sched_freq = STATIC_KEY_INIT_FALSE;

Well, I'm not familiar with static keys and how they work, so you'll need to
explain this part to me.  I'm assuming that there is some magic related to
the value of the key that changes set_*_cpu_capacity() into no-ops when the
key is 0, presumably by modifying kernel code.

So this is clever but tricky and my question here is why it is necessary.

For example, compared to the RCU-based approach I'm using, how much better
this is?  Yes, there is some tiny overhead related to the checking if callbacks
are present and invoking them, but is it really so bad?  Can you actually
measure the different in any realistic workload?

One thing I personally like in the RCU-based approach is its universality.  The
callbacks may be installed by different entities in a uniform way: intel_pstate
can do that, the old governors can do that, my experimental schedutil code can
do that and your code could have done that too in principle.  And this is very
nice, because it is a common underlying mechanism that can be used by everybody
regardless of their particular implementations on the other side.

Why would I want to use something different, then?

> +static bool __read_mostly cpufreq_driver_slow;
> +
> +/*
> + * The number of enabled schedfreq policies is modified during GOV_START/STOP.
> + * It, along with whether the schedfreq static key is enabled, is protected by
> + * the gov_enable_lock.
> + */

Well, it would be good to explain what the role of the number of enabled_policies is
at least briefly.

> +static int enabled_policies;
> +static DEFINE_MUTEX(gov_enable_lock);
> +
> +#ifndef CONFIG_CPU_FREQ_DEFAULT_GOV_SCHED
> +static struct cpufreq_governor cpufreq_gov_sched;
> +#endif

I don't think you need the #ifndef any more after recent changes in linux-next.

> +
> +/*
> + * Capacity margin added to CFS and RT capacity requests to provide
> + * some head room if task utilization further increases.
> + */

OK, where does this number come from?

> +unsigned int capacity_margin = 1280;
> +
> +static DEFINE_PER_CPU(struct gov_data *, cpu_gov_data);
> +DEFINE_PER_CPU(struct sched_capacity_reqs, cpu_sched_capacity_reqs);
> +
> +/**
> + * gov_data - per-policy data internal to the governor
> + * @throttle: next throttling period expiry. Derived from throttle_nsec
> + * @throttle_nsec: throttle period length in nanoseconds
> + * @task: worker thread for dvfs transition that may block/sleep
> + * @irq_work: callback used to wake up worker thread
> + * @policy: pointer to cpufreq policy associated with this governor data
> + * @fastpath_lock: prevents multiple CPUs in a frequency domain from racing
> + * with each other in fast path during calculation of domain frequency
> + * @slowpath_lock: mutex used to synchronize with slow path - ensure policy
> + * remains enabled, and eliminate racing between slow and fast path
> + * @enabled: boolean value indicating that the policy is started, protected
> + * by the slowpath_lock
> + * @requested_freq: last frequency requested by the sched governor
> + *
> + * struct gov_data is the per-policy cpufreq_sched-specific data
> + * structure. A per-policy instance of it is created when the
> + * cpufreq_sched governor receives the CPUFREQ_GOV_POLICY_INIT
> + * condition and a pointer to it exists in the gov_data member of
> + * struct cpufreq_policy.
> + */
> +struct gov_data {
> +	ktime_t throttle;
> +	unsigned int throttle_nsec;
> +	struct task_struct *task;
> +	struct irq_work irq_work;
> +	struct cpufreq_policy *policy;
> +	raw_spinlock_t fastpath_lock;
> +	struct mutex slowpath_lock;
> +	unsigned int enabled;
> +	unsigned int requested_freq;
> +};
> +
> +static void cpufreq_sched_try_driver_target(struct cpufreq_policy *policy,
> +					    unsigned int freq)
> +{
> +	struct gov_data *gd = policy->governor_data;
> +
> +	__cpufreq_driver_target(policy, freq, CPUFREQ_RELATION_L);
> +	gd->throttle = ktime_add_ns(ktime_get(), gd->throttle_nsec);
> +}
> +
> +static bool finish_last_request(struct gov_data *gd)
> +{
> +	ktime_t now = ktime_get();
> +
> +	if (ktime_after(now, gd->throttle))
> +		return false;
> +
> +	while (1) {

I would write this as

	do {

> +		int usec_left = ktime_to_ns(ktime_sub(gd->throttle, now));
> +
> +		usec_left /= NSEC_PER_USEC;
> +		usleep_range(usec_left, usec_left + 100);
> +		now = ktime_get();

	} while (ktime_before(now, gd->throttle));

	return true;

But maybe that's just me. :-)

> +		if (ktime_after(now, gd->throttle))
> +			return true;
> +	}
> +}

I'm not a big fan of this throttling mechanism overall, but more about that later.

> +
> +static int cpufreq_sched_thread(void *data)
> +{

Now, what really is the advantage of having those extra threads vs using
workqueues?

I guess the underlying concern is that RT tasks may stall workqueues indefinitely
in theory and then the frequency won't be updated, but there's much more kernel
stuff run from workqueues and if that is starved, you won't get very far anyway.

If you take special measures to prevent frequency change requests from being
stalled by RT tasks, question is why are they so special?  Aren't there any
other kernel activities that also should be protected from that and may be
more important than CPU frequency changes?

Plus if this really is the problem here, then it also affects the other cpufreq
governors, so maybe it should be solved for everybody in some common way?

> +	struct sched_param param;
> +	struct gov_data *gd = (struct gov_data*) data;
> +	struct cpufreq_policy *policy = gd->policy;
> +	unsigned int new_request = 0;
> +	unsigned int last_request = 0;
> +	int ret;
> +
> +	param.sched_priority = 50;
> +	ret = sched_setscheduler_nocheck(gd->task, SCHED_FIFO, &param);
> +	if (ret) {
> +		pr_warn("%s: failed to set SCHED_FIFO\n", __func__);
> +		do_exit(-EINVAL);
> +	} else {
> +		pr_debug("%s: kthread (%d) set to SCHED_FIFO\n",
> +				__func__, gd->task->pid);
> +	}
> +
> +	mutex_lock(&gd->slowpath_lock);
> +
> +	while (true) {
> +		set_current_state(TASK_INTERRUPTIBLE);
> +		if (kthread_should_stop()) {
> +			set_current_state(TASK_RUNNING);
> +			break;
> +		}
> +		new_request = gd->requested_freq;
> +		if (!gd->enabled || new_request == last_request) {

This generally is a mistake.

You can't assume that if you had requested a CPU to enter a specific P-state,
that state was actually entered.  In the platform-coordinated case the hardware
(or firmware) may choose to ignore your request and you won't be told about
that.

For this reason, you generally need to make the request every time even if it
is identical to the previous one.  Even in the one-CPU-per-policy case there
may be HW coordination you don't know about.

> +			mutex_unlock(&gd->slowpath_lock);
> +			schedule();
> +			mutex_lock(&gd->slowpath_lock);
> +		} else {
> +			set_current_state(TASK_RUNNING);
> +			/*
> +			 * if the frequency thread sleeps while waiting to be
> +			 * unthrottled, start over to check for a newer request
> +			 */
> +			if (finish_last_request(gd))
> +				continue;
> +			last_request = new_request;
> +			cpufreq_sched_try_driver_target(policy, new_request);
> +		}
> +	}
> +
> +	mutex_unlock(&gd->slowpath_lock);
> +
> +	return 0;
> +}
> +
> +static void cpufreq_sched_irq_work(struct irq_work *irq_work)
> +{
> +	struct gov_data *gd;
> +
> +	gd = container_of(irq_work, struct gov_data, irq_work);
> +	if (!gd)
> +		return;
> +
> +	wake_up_process(gd->task);

I'm wondering what would be wrong with writing it as

	if (gd)
		wake_up_process(gd->task);

And can gd turn out to be NULL here in any case?

> +}
> +
> +static void update_fdomain_capacity_request(int cpu)
> +{
> +	unsigned int freq_new, index_new, cpu_tmp;
> +	struct cpufreq_policy *policy;
> +	struct gov_data *gd = per_cpu(cpu_gov_data, cpu);
> +	unsigned long capacity = 0;
> +
> +	if (!gd)
> +		return;
> +

Why is this check necessary?

> +	/* interrupts already disabled here via rq locked */
> +	raw_spin_lock(&gd->fastpath_lock);

Well, if you compare this with the one-CPU-per-policy path in my experimental
schedutil governor code with the "fast switch" patch on top, you'll notice that
it doesn't use any locks and/or atomic ops.  That's very much on purpose and
here's where your whole gain from using static keys practically goes away.

> +
> +	policy = gd->policy;
> +
> +	for_each_cpu(cpu_tmp, policy->cpus) {
> +		struct sched_capacity_reqs *scr;
> +
> +		scr = &per_cpu(cpu_sched_capacity_reqs, cpu_tmp);
> +		capacity = max(capacity, scr->total);
> +	}

You could save a few cycles from this in the case when the policy is not
shared.

> +
> +	freq_new = capacity * policy->max >> SCHED_CAPACITY_SHIFT;

Where does this formula come from?

> +
> +	/*
> +	 * Calling this without locking policy->rwsem means we race
> +	 * against changes with policy->min and policy->max. This should
> +	 * be okay though.
> +	 */
> +	if (cpufreq_frequency_table_target(policy, policy->freq_table,
> +					   freq_new, CPUFREQ_RELATION_L,
> +					   &index_new))
> +		goto out;

__cpufreq_driver_target() will call this again, so isn't calling it here
a bit wasteful?

> +	freq_new = policy->freq_table[index_new].frequency;
> +
> +	if (freq_new == gd->requested_freq)
> +		goto out;
> +

Again, the above generally is a mistake for reasons explained earlier.

> +	gd->requested_freq = freq_new;
> +
> +	if (cpufreq_driver_slow || !mutex_trylock(&gd->slowpath_lock)) {

This really doesn't look good to me.

Why is the mutex needed here in the first place?  cpufreq_sched_stop() should
be able to make sure that this function won't be run again for the policy
without using this lock.

> +		irq_work_queue_on(&gd->irq_work, cpu);

I hope that you are aware of the fact that irq_work_queue_on() explodes
on uniprocessor ARM32 if you run an SMP kernel on it?

And what happens in the !cpufreq_driver_is_slow() case when we don't
initialize the irq_work?

> +	} else if (policy->transition_ongoing ||
> +		   ktime_before(ktime_get(), gd->throttle)) {

If this really runs in the scheduler paths, you don't want to have ktime_get()
here.

> +		mutex_unlock(&gd->slowpath_lock);
> +		irq_work_queue_on(&gd->irq_work, cpu);

Allright.

I think I have figured out how this is arranged, but I may be wrong. :-)

Here's my understanding of it.  If we are throttled, we don't just skip the
request.  Instead, we wake up the gd thread kind of in the hope that the
throttling may end when it actually wakes up.  So in fact we poke at the
gd thread on a regular basis asking it "Are you still throttled?" and that
happens on every call from the scheduler until the throttling is over if
I'm not mistaken.  This means that during throttling every call from the
scheduler generates an irq_work that wakes up the gd thread just to make it
check if it still is throttled and go to sleep again.  Please tell me
that I haven't understood this correctly.

The above aside, I personally think that rate limitting should happen at the source
and not at the worker thread level.  So if you're throttled, you should just
return immediately from this function without generating any more work.  That,
BTW, is what the sampling rate in my code is for.

> +	} else {
> +		cpufreq_sched_try_driver_target(policy, freq_new);

Well, this is supposed to be the fast path AFAICS.

Did you actually look at what __cpufreq_driver_target() does in general?
Including the wait_event() in cpufreq_freq_transition_begin() to mention just
one suspicious thing?  And how much overhead it generates in the most general
case?

No, running *that* from the fast path is not a good idea.  Quite honestly,
you'd need a new driver callback and a new way to run it from the cpufreq core
to implement this in a reasonably efficient way.

> +		mutex_unlock(&gd->slowpath_lock);
> +	}
> +
> +out:
> +	raw_spin_unlock(&gd->fastpath_lock);
> +}
> +
> +void update_cpu_capacity_request(int cpu, bool request)
> +{
> +	unsigned long new_capacity;
> +	struct sched_capacity_reqs *scr;
> +
> +	/* The rq lock serializes access to the CPU's sched_capacity_reqs. */
> +	lockdep_assert_held(&cpu_rq(cpu)->lock);
> +
> +	scr = &per_cpu(cpu_sched_capacity_reqs, cpu);
> +
> +	new_capacity = scr->cfs + scr->rt;
> +	new_capacity = new_capacity * capacity_margin
> +		/ SCHED_CAPACITY_SCALE;
> +	new_capacity += scr->dl;

Can you please explain the formula here?

> +
> +	if (new_capacity == scr->total)
> +		return;
> +

The same mistake as before.

> +	scr->total = new_capacity;
> +	if (request)
> +		update_fdomain_capacity_request(cpu);
> +}
> +
> +static ssize_t show_throttle_nsec(struct cpufreq_policy *policy, char *buf)
> +{
> +	struct gov_data *gd = policy->governor_data;
> +	return sprintf(buf, "%u\n", gd->throttle_nsec);
> +}
> +
> +static ssize_t store_throttle_nsec(struct cpufreq_policy *policy,
> +				   const char *buf, size_t count)
> +{
> +	struct gov_data *gd = policy->governor_data;
> +	unsigned int input;
> +	int ret;
> +
> +	ret = sscanf(buf, "%u", &input);
> +
> +	if (ret != 1)
> +		return -EINVAL;
> +
> +	gd->throttle_nsec = input;
> +	return count;
> +}
> +
> +static struct freq_attr sched_freq_throttle_nsec_attr =
> +	__ATTR(throttle_nsec, 0644, show_throttle_nsec, store_throttle_nsec);
> +
> +static struct attribute *sched_freq_sysfs_attribs[] = {
> +	&sched_freq_throttle_nsec_attr.attr,
> +	NULL
> +};
> +
> +static struct attribute_group sched_freq_sysfs_group = {
> +	.attrs = sched_freq_sysfs_attribs,
> +	.name = "sched_freq",
> +};
> +
> +static int cpufreq_sched_policy_init(struct cpufreq_policy *policy)
> +{
> +	struct gov_data *gd;
> +	int ret;
> +
> +	gd = kzalloc(sizeof(*gd), GFP_KERNEL);
> +	if (!gd)
> +		return -ENOMEM;
> +	policy->governor_data = gd;
> +	gd->policy = policy;
> +	raw_spin_lock_init(&gd->fastpath_lock);
> +	mutex_init(&gd->slowpath_lock);
> +
> +	ret = sysfs_create_group(&policy->kobj, &sched_freq_sysfs_group);
> +	if (ret)
> +		goto err_mem;
> +
> +	/*
> +	 * Set up schedfreq thread for slow path freq transitions if
> +	 * required by the driver.
> +	 */
> +	if (cpufreq_driver_is_slow()) {
> +		cpufreq_driver_slow = true;
> +		gd->task = kthread_create(cpufreq_sched_thread, gd,
> +					  "kschedfreq:%d",
> +					  cpumask_first(policy->related_cpus));
> +		if (IS_ERR_OR_NULL(gd->task)) {
> +			pr_err("%s: failed to create kschedfreq thread\n",
> +			       __func__);
> +			goto err_sysfs;
> +		}
> +		get_task_struct(gd->task);
> +		kthread_bind_mask(gd->task, policy->related_cpus);
> +		wake_up_process(gd->task);
> +		init_irq_work(&gd->irq_work, cpufreq_sched_irq_work);
> +	}
> +	return 0;
> +
> +err_sysfs:
> +	sysfs_remove_group(&policy->kobj, &sched_freq_sysfs_group);
> +err_mem:
> +	policy->governor_data = NULL;
> +	kfree(gd);
> +	return -ENOMEM;
> +}
> +
> +static int cpufreq_sched_policy_exit(struct cpufreq_policy *policy)
> +{
> +	struct gov_data *gd = policy->governor_data;
> +
> +	/* Stop the schedfreq thread associated with this policy. */
> +	if (cpufreq_driver_slow) {
> +		kthread_stop(gd->task);
> +		put_task_struct(gd->task);
> +	}
> +	sysfs_remove_group(&policy->kobj, &sched_freq_sysfs_group);
> +	policy->governor_data = NULL;
> +	kfree(gd);
> +	return 0;
> +}
> +
> +static int cpufreq_sched_start(struct cpufreq_policy *policy)
> +{
> +	struct gov_data *gd = policy->governor_data;
> +	int cpu;
> +
> +	/*
> +	 * The schedfreq static key is managed here so the global schedfreq
> +	 * lock must be taken - a per-policy lock such as policy->rwsem is
> +	 * not sufficient.
> +	 */
> +	mutex_lock(&gov_enable_lock);
> +
> +	gd->enabled = 1;
> +
> +	/*
> +	 * Set up percpu information. Writing the percpu gd pointer will
> +	 * enable the fast path if the static key is already enabled.
> +	 */
> +	for_each_cpu(cpu, policy->cpus) {
> +		memset(&per_cpu(cpu_sched_capacity_reqs, cpu), 0,
> +		       sizeof(struct sched_capacity_reqs));
> +		per_cpu(cpu_gov_data, cpu) = gd;
> +	}
> +
> +	if (enabled_policies == 0)
> +		static_key_slow_inc(&__sched_freq);
> +	enabled_policies++;
> +	mutex_unlock(&gov_enable_lock);
> +
> +	return 0;
> +}
> +
> +static void dummy(void *info) {}
> +
> +static int cpufreq_sched_stop(struct cpufreq_policy *policy)
> +{
> +	struct gov_data *gd = policy->governor_data;
> +	int cpu;
> +
> +	/*
> +	 * The schedfreq static key is managed here so the global schedfreq
> +	 * lock must be taken - a per-policy lock such as policy->rwsem is
> +	 * not sufficient.
> +	 */
> +	mutex_lock(&gov_enable_lock);
> +
> +	/*
> +	 * The governor stop path may or may not hold policy->rwsem. There
> +	 * must be synchronization with the slow path however.
> +	 */
> +	mutex_lock(&gd->slowpath_lock);
> +
> +	/*
> +	 * Stop new entries into the hot path for all CPUs. This will
> +	 * potentially affect other policies which are still running but
> +	 * this is an infrequent operation.
> +	 */
> +	static_key_slow_dec(&__sched_freq);
> +	enabled_policies--;
> +
> +	/*
> +	 * Ensure that all CPUs currently part of this policy are out
> +	 * of the hot path so that if this policy exits we can free gd.
> +	 */
> +	preempt_disable();
> +	smp_call_function_many(policy->cpus, dummy, NULL, true);
> +	preempt_enable();

I'm not sure how this works, can you please tell me?

> +
> +	/*
> +	 * Other CPUs in other policies may still have the schedfreq
> +	 * static key enabled. The percpu gd is used to signal which
> +	 * CPUs are enabled in the sched gov during the hot path.
> +	 */
> +	for_each_cpu(cpu, policy->cpus)
> +		per_cpu(cpu_gov_data, cpu) = NULL;
> +
> +	/* Pause the slow path for this policy. */
> +	gd->enabled = 0;
> +
> +	if (enabled_policies)
> +		static_key_slow_inc(&__sched_freq);
> +	mutex_unlock(&gd->slowpath_lock);
> +	mutex_unlock(&gov_enable_lock);
> +
> +	return 0;
> +}
> +
> +static int cpufreq_sched_setup(struct cpufreq_policy *policy,
> +			       unsigned int event)
> +{
> +	switch (event) {
> +	case CPUFREQ_GOV_POLICY_INIT:
> +		return cpufreq_sched_policy_init(policy);
> +	case CPUFREQ_GOV_POLICY_EXIT:
> +		return cpufreq_sched_policy_exit(policy);
> +	case CPUFREQ_GOV_START:
> +		return cpufreq_sched_start(policy);
> +	case CPUFREQ_GOV_STOP:
> +		return cpufreq_sched_stop(policy);
> +	case CPUFREQ_GOV_LIMITS:
> +		break;
> +	}
> +	return 0;
> +}
> +
> +#ifndef CONFIG_CPU_FREQ_DEFAULT_GOV_SCHED
> +static
> +#endif
> +struct cpufreq_governor cpufreq_gov_sched = {
> +	.name			= "sched",
> +	.governor		= cpufreq_sched_setup,
> +	.owner			= THIS_MODULE,
> +};
> +
> +static int __init cpufreq_sched_init(void)
> +{
> +	return cpufreq_register_governor(&cpufreq_gov_sched);
> +}
> +
> +/* Try to make this the default governor */
> +fs_initcall(cpufreq_sched_init);

I have no comments to the rest of the patch.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFCv7 PATCH 03/10] sched: scheduler-driven cpu frequency selection
  2016-02-25  3:55   ` Rafael J. Wysocki
@ 2016-02-25  9:21     ` Peter Zijlstra
  2016-02-25 21:04       ` Rafael J. Wysocki
  2016-02-25  9:28     ` Peter Zijlstra
                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 51+ messages in thread
From: Peter Zijlstra @ 2016-02-25  9:21 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Steve Muckle, Ingo Molnar, Rafael J. Wysocki, linux-kernel,
	linux-pm, Vincent Guittot, Morten Rasmussen, Dietmar Eggemann,
	Juri Lelli, Patrick Bellasi, Michael Turquette, Ricky Liang

On Thu, Feb 25, 2016 at 04:55:57AM +0100, Rafael J. Wysocki wrote:
> Well, I'm not familiar with static keys and how they work, so you'll need to
> explain this part to me. 

See include/linux/jump_label.h, it has lots of text on them. There is
also Documentation/static-keys.txt

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFCv7 PATCH 03/10] sched: scheduler-driven cpu frequency selection
  2016-02-25  3:55   ` Rafael J. Wysocki
  2016-02-25  9:21     ` Peter Zijlstra
@ 2016-02-25  9:28     ` Peter Zijlstra
  2016-02-25 21:08       ` Rafael J. Wysocki
  2016-02-25 11:04     ` Rafael J. Wysocki
  2016-02-26  0:34     ` Steve Muckle
  3 siblings, 1 reply; 51+ messages in thread
From: Peter Zijlstra @ 2016-02-25  9:28 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Steve Muckle, Ingo Molnar, Rafael J. Wysocki, linux-kernel,
	linux-pm, Vincent Guittot, Morten Rasmussen, Dietmar Eggemann,
	Juri Lelli, Patrick Bellasi, Michael Turquette, Ricky Liang

On Thu, Feb 25, 2016 at 04:55:57AM +0100, Rafael J. Wysocki wrote:
> > +static void dummy(void *info) {}
> > +
> > +static int cpufreq_sched_stop(struct cpufreq_policy *policy)
> > +{
> > +	struct gov_data *gd = policy->governor_data;
> > +	int cpu;
> > +
> > +	/*
> > +	 * The schedfreq static key is managed here so the global schedfreq
> > +	 * lock must be taken - a per-policy lock such as policy->rwsem is
> > +	 * not sufficient.
> > +	 */
> > +	mutex_lock(&gov_enable_lock);
> > +
> > +	/*
> > +	 * The governor stop path may or may not hold policy->rwsem. There
> > +	 * must be synchronization with the slow path however.
> > +	 */
> > +	mutex_lock(&gd->slowpath_lock);
> > +
> > +	/*
> > +	 * Stop new entries into the hot path for all CPUs. This will
> > +	 * potentially affect other policies which are still running but
> > +	 * this is an infrequent operation.
> > +	 */
> > +	static_key_slow_dec(&__sched_freq);
> > +	enabled_policies--;
> > +
> > +	/*
> > +	 * Ensure that all CPUs currently part of this policy are out
> > +	 * of the hot path so that if this policy exits we can free gd.
> > +	 */
> > +	preempt_disable();
> > +	smp_call_function_many(policy->cpus, dummy, NULL, true);
> > +	preempt_enable();
> 
> I'm not sure how this works, can you please tell me?

I think it relies on the fact that rq->lock disables IRQs, so if we've
managed to IPI all relevant CPUs, it means they cannot be inside a
rq->lock section.

Its vile though; one should not spray IPIs if one can avoid it. Such
things are much better done with RCU. Sure sync_sched() takes a little
longer, but this isn't a fast path by any measure.

> > +
> > +	/*
> > +	 * Other CPUs in other policies may still have the schedfreq
> > +	 * static key enabled. The percpu gd is used to signal which
> > +	 * CPUs are enabled in the sched gov during the hot path.
> > +	 */
> > +	for_each_cpu(cpu, policy->cpus)
> > +		per_cpu(cpu_gov_data, cpu) = NULL;
> > +
> > +	/* Pause the slow path for this policy. */
> > +	gd->enabled = 0;
> > +
> > +	if (enabled_policies)
> > +		static_key_slow_inc(&__sched_freq);
> > +	mutex_unlock(&gd->slowpath_lock);
> > +	mutex_unlock(&gov_enable_lock);
> > +
> > +	return 0;
> > +}

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFCv7 PATCH 03/10] sched: scheduler-driven cpu frequency selection
  2016-02-25  3:55   ` Rafael J. Wysocki
  2016-02-25  9:21     ` Peter Zijlstra
  2016-02-25  9:28     ` Peter Zijlstra
@ 2016-02-25 11:04     ` Rafael J. Wysocki
  2016-02-26  0:34     ` Steve Muckle
  3 siblings, 0 replies; 51+ messages in thread
From: Rafael J. Wysocki @ 2016-02-25 11:04 UTC (permalink / raw)
  To: Steve Muckle
  Cc: Peter Zijlstra, Ingo Molnar, Rafael J. Wysocki, linux-kernel,
	linux-pm, Vincent Guittot, Morten Rasmussen, Dietmar Eggemann,
	Juri Lelli, Patrick Bellasi, Michael Turquette, Ricky Liang,
	Juri Lelli

On Thursday, February 25, 2016 04:55:57 AM Rafael J. Wysocki wrote:
> Hi,
> 

[cut]

> > +	while (true) {
> > +		set_current_state(TASK_INTERRUPTIBLE);
> > +		if (kthread_should_stop()) {
> > +			set_current_state(TASK_RUNNING);
> > +			break;
> > +		}
> > +		new_request = gd->requested_freq;
> > +		if (!gd->enabled || new_request == last_request) {
> 
> This generally is a mistake.
> 
> You can't assume that if you had requested a CPU to enter a specific P-state,
> that state was actually entered.  In the platform-coordinated case the hardware
> (or firmware) may choose to ignore your request and you won't be told about
> that.
> 
> For this reason, you generally need to make the request every time even if it
> is identical to the previous one.  Even in the one-CPU-per-policy case there
> may be HW coordination you don't know about.

Please scratch this bit, actually (and analogous comments below).

It is reasonable to expect that the platform will remember your last request,
so you don't have to repeat it.

Which means that I can optimize the schedutil thing somewhat. :-)

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFCv7 PATCH 03/10] sched: scheduler-driven cpu frequency selection
  2016-02-25  9:21     ` Peter Zijlstra
@ 2016-02-25 21:04       ` Rafael J. Wysocki
  0 siblings, 0 replies; 51+ messages in thread
From: Rafael J. Wysocki @ 2016-02-25 21:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Steve Muckle, Ingo Molnar, Rafael J. Wysocki, linux-kernel,
	linux-pm, Vincent Guittot, Morten Rasmussen, Dietmar Eggemann,
	Juri Lelli, Patrick Bellasi, Michael Turquette, Ricky Liang

On Thursday, February 25, 2016 10:21:50 AM Peter Zijlstra wrote:
> On Thu, Feb 25, 2016 at 04:55:57AM +0100, Rafael J. Wysocki wrote:
> > Well, I'm not familiar with static keys and how they work, so you'll need to
> > explain this part to me. 
> 
> See include/linux/jump_label.h, it has lots of text on them. There is
> also Documentation/static-keys.txt

Thanks for the pointers!

It looks like the author of the $subject patch hasn't looked at the latter
document lately.

In any case, IMO this might be used to hide the cpufreq_update_util() call
sites from the scheduler code in case no one has set anything via
cpufreq_set_update_util_data() for any CPUs.

That essentially is when things like the performance governor are in use,
so may be worth doing, but that's your judgement call mostly. :-)

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFCv7 PATCH 03/10] sched: scheduler-driven cpu frequency selection
  2016-02-25  9:28     ` Peter Zijlstra
@ 2016-02-25 21:08       ` Rafael J. Wysocki
  2016-02-26  9:18         ` Peter Zijlstra
  0 siblings, 1 reply; 51+ messages in thread
From: Rafael J. Wysocki @ 2016-02-25 21:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Steve Muckle, Ingo Molnar, Rafael J. Wysocki, linux-kernel,
	linux-pm, Vincent Guittot, Morten Rasmussen, Dietmar Eggemann,
	Juri Lelli, Patrick Bellasi, Michael Turquette, Ricky Liang

On Thursday, February 25, 2016 10:28:37 AM Peter Zijlstra wrote:
> On Thu, Feb 25, 2016 at 04:55:57AM +0100, Rafael J. Wysocki wrote:
> > > +static void dummy(void *info) {}
> > > +
> > > +static int cpufreq_sched_stop(struct cpufreq_policy *policy)
> > > +{
> > > +	struct gov_data *gd = policy->governor_data;
> > > +	int cpu;
> > > +
> > > +	/*
> > > +	 * The schedfreq static key is managed here so the global schedfreq
> > > +	 * lock must be taken - a per-policy lock such as policy->rwsem is
> > > +	 * not sufficient.
> > > +	 */
> > > +	mutex_lock(&gov_enable_lock);
> > > +
> > > +	/*
> > > +	 * The governor stop path may or may not hold policy->rwsem. There
> > > +	 * must be synchronization with the slow path however.
> > > +	 */
> > > +	mutex_lock(&gd->slowpath_lock);
> > > +
> > > +	/*
> > > +	 * Stop new entries into the hot path for all CPUs. This will
> > > +	 * potentially affect other policies which are still running but
> > > +	 * this is an infrequent operation.
> > > +	 */
> > > +	static_key_slow_dec(&__sched_freq);
> > > +	enabled_policies--;
> > > +
> > > +	/*
> > > +	 * Ensure that all CPUs currently part of this policy are out
> > > +	 * of the hot path so that if this policy exits we can free gd.
> > > +	 */
> > > +	preempt_disable();
> > > +	smp_call_function_many(policy->cpus, dummy, NULL, true);
> > > +	preempt_enable();
> > 
> > I'm not sure how this works, can you please tell me?
> 
> I think it relies on the fact that rq->lock disables IRQs, so if we've
> managed to IPI all relevant CPUs, it means they cannot be inside a
> rq->lock section.
> 
> Its vile though; one should not spray IPIs if one can avoid it. Such
> things are much better done with RCU. Sure sync_sched() takes a little
> longer, but this isn't a fast path by any measure.

I see, thanks!

BTW, when cpufreq_update_util() callbacks are removed, I use synchronize_rcu()
to wait for the running ones, but would it be better to use synchronize_sched()
in there instead?

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFCv7 PATCH 03/10] sched: scheduler-driven cpu frequency selection
  2016-02-25  3:55   ` Rafael J. Wysocki
                       ` (2 preceding siblings ...)
  2016-02-25 11:04     ` Rafael J. Wysocki
@ 2016-02-26  0:34     ` Steve Muckle
  2016-02-27  2:39       ` Rafael J. Wysocki
                         ` (2 more replies)
  3 siblings, 3 replies; 51+ messages in thread
From: Steve Muckle @ 2016-02-26  0:34 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Peter Zijlstra, Ingo Molnar, Rafael J. Wysocki, linux-kernel,
	linux-pm, Vincent Guittot, Morten Rasmussen, Dietmar Eggemann,
	Juri Lelli, Patrick Bellasi, Michael Turquette, Ricky Liang

On 02/24/2016 07:55 PM, Rafael J. Wysocki wrote:
> Hi,
> 
> I promised a review and here it goes.

Thanks Rafael for your detailed review.

> 
> Let me focus on this one as the rest seems to depend on it.
> 
> On Monday, February 22, 2016 05:22:43 PM Steve Muckle wrote:
>> From: Michael Turquette <mturquette@baylibre.com>
>>
>> Scheduler-driven CPU frequency selection hopes to exploit both
>> per-task and global information in the scheduler to improve frequency
>> selection policy, achieving lower power consumption, improved
>> responsiveness/performance, and less reliance on heuristics and
>> tunables. For further discussion on the motivation of this integration
>> see [0].
>>
>> This patch implements a shim layer between the Linux scheduler and the
>> cpufreq subsystem. The interface accepts capacity requests from the
>> CFS, RT and deadline sched classes. The requests from each sched class
>> are summed on each CPU with a margin applied to the CFS and RT
>> capacity requests to provide some headroom. Deadline requests are
>> expected to be precise enough given their nature to not require
>> headroom. The maximum total capacity request for a CPU in a frequency
>> domain drives the requested frequency for that domain.
>>
>> Policy is determined by both the sched classes and this shim layer.
>>
>> Note that this algorithm is event-driven. There is no polling loop to
>> check cpu idle time nor any other method which is unsynchronized with
>> the scheduler, aside from an optional throttling mechanism.
>>
>> Thanks to Juri Lelli <juri.lelli@arm.com> for contributing design ideas,
>> code and test results, and to Ricky Liang <jcliang@chromium.org>
>> for initialization and static key inc/dec fixes.
>>
>> [0] http://article.gmane.org/gmane.linux.kernel/1499836
>>
>> [smuckle@linaro.org: various additions and fixes, revised commit text]
> 
> Well, the changelog is still a bit terse in my view.  It should at least
> describe the design somewhat (mention the static keys and how they are
> used etc) end explain why the things are done this way. 

Sure. Will add more, including the details you mention.

> 
...
>>  /*********************************************************************
>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index a292c4b..27a6cd8 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -937,6 +937,14 @@ enum cpu_idle_type {
>>  #define SCHED_CAPACITY_SHIFT	10
>>  #define SCHED_CAPACITY_SCALE	(1L << SCHED_CAPACITY_SHIFT)
>>  
>> +struct sched_capacity_reqs {
>> +	unsigned long cfs;
>> +	unsigned long rt;
>> +	unsigned long dl;
>> +
>> +	unsigned long total;
>> +};
> 
> Without a comment explaining what this represents it is quite hard to
> decode it.

This is the per-CPU representation of capacity requests from each sched
class. The scale of the cfs, rt, and dl capacity numbers are 0 to
SCHED_CAPACITY_SCALE. I'll add a comment here.

> 
>> +
>>  /*
>>   * Wake-queues are lists of tasks with a pending wakeup, whose
>>   * callers have already marked the task as woken internally,
>> diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
>> index 6768797..90ed832 100644
>> --- a/kernel/sched/Makefile
>> +++ b/kernel/sched/Makefile
>> @@ -19,3 +19,4 @@ obj-$(CONFIG_SCHED_AUTOGROUP) += auto_group.o
>>  obj-$(CONFIG_SCHEDSTATS) += stats.o
>>  obj-$(CONFIG_SCHED_DEBUG) += debug.o
>>  obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o
>> +obj-$(CONFIG_CPU_FREQ_GOV_SCHED) += cpufreq_sched.o
>> diff --git a/kernel/sched/cpufreq_sched.c b/kernel/sched/cpufreq_sched.c
>> new file mode 100644
>> index 0000000..a113e4e
>> --- /dev/null
>> +++ b/kernel/sched/cpufreq_sched.c
>> @@ -0,0 +1,459 @@
>> +/*
> 
> Any chance to add one sentence about what's in the file?

Sure, will do.

> 
> Besides, governors traditionally go to drivers/cpufreq.  Why is this different?

This predates my involvement but I'm guessing this was done because the
long term vision was for cpufreq to eventually be removed from the
picture. But it may be quite some time before that happens - I think
communicating directly with drivers may be difficult due to the need to
synchronize with thermal or userspace min/max requests etc, plus there's
the fact that we've got a longstanding API exposed to userspace.

I'm happy to move this to drivers/cpufreq.

> 
>> + *  Copyright (C)  2015 Michael Turquette <mturquette@linaro.org>
>> + *  Copyright (C)  2015-2016 Steve Muckle <smuckle@linaro.org>
>> + *
>> + * This program is free software; you can redistribute it and/or modify
>> + * it under the terms of the GNU General Public License version 2 as
>> + * published by the Free Software Foundation.
>> + */
>> +
>> +#include <linux/cpufreq.h>
>> +#include <linux/module.h>
>> +#include <linux/kthread.h>
>> +#include <linux/percpu.h>
>> +#include <linux/irq_work.h>
>> +#include <linux/delay.h>
>> +#include <linux/string.h>
>> +
>> +#include "sched.h"
>> +
>> +struct static_key __read_mostly __sched_freq = STATIC_KEY_INIT_FALSE;
> 
> Well, I'm not familiar with static keys and how they work, so you'll need to
> explain this part to me.  I'm assuming that there is some magic related to
> the value of the key that changes set_*_cpu_capacity() into no-ops when the
> key is 0, presumably by modifying kernel code.

That's correct.

> 
> So this is clever but tricky and my question here is why it is necessary.
> 
> For example, compared to the RCU-based approach I'm using, how much better
> this is?  Yes, there is some tiny overhead related to the checking if callbacks
> are present and invoking them, but is it really so bad?  Can you actually
> measure the different in any realistic workload?

I have not tried to measure any difference. We erred on the side of
caution to avoid impacting the scheduler hot paths entirely if possible.

> One thing I personally like in the RCU-based approach is its universality.  The
> callbacks may be installed by different entities in a uniform way: intel_pstate
> can do that, the old governors can do that, my experimental schedutil code can
> do that and your code could have done that too in principle.  And this is very
> nice, because it is a common underlying mechanism that can be used by everybody
> regardless of their particular implementations on the other side.
> 
> Why would I want to use something different, then?

I've got nothing against a callback registration mechanism. As you
mentioned in another mail it could itself use static keys, enabling the
static key when a callback is registered and disabling it otherwise to
avoid calling into cpufreq_update_util().

> 
>> +static bool __read_mostly cpufreq_driver_slow;
>> +
>> +/*
>> + * The number of enabled schedfreq policies is modified during GOV_START/STOP.
>> + * It, along with whether the schedfreq static key is enabled, is protected by
>> + * the gov_enable_lock.
>> + */
> 
> Well, it would be good to explain what the role of the number of enabled_policies is
> at least briefly.

It's just that if the number of enabled policies is > 0, the static key
must be enabled so that the callbacks work. I'll clarify that in the
comment here.

> 
>> +static int enabled_policies;
>> +static DEFINE_MUTEX(gov_enable_lock);
>> +
>> +#ifndef CONFIG_CPU_FREQ_DEFAULT_GOV_SCHED
>> +static struct cpufreq_governor cpufreq_gov_sched;
>> +#endif
> 
> I don't think you need the #ifndef any more after recent changes in linux-next.

Ah I've been based on tip/sched/core. I should probably move over to
linux-next. Will do so and check this.

> 
>> +
>> +/*
>> + * Capacity margin added to CFS and RT capacity requests to provide
>> + * some head room if task utilization further increases.
>> + */
> 
> OK, where does this number come from?

Someone's posterior :) .

This really should be a tunable IMO, but there's a fairly strong
anti-tunable sentiment, so it's been left hard-coded in an attempt to
provide something that "just works."

At the least I can add a comment saying that the 20% idle headroom
requirement was an off the cuff estimate and that at this time, we don't
have significant data to suggest it's the best number.

> 
...
>> +static bool finish_last_request(struct gov_data *gd)
>> +{
>> +	ktime_t now = ktime_get();
>> +
>> +	if (ktime_after(now, gd->throttle))
>> +		return false;
>> +
>> +	while (1) {
> 
> I would write this as
> 
> 	do {
> 
>> +		int usec_left = ktime_to_ns(ktime_sub(gd->throttle, now));
>> +
>> +		usec_left /= NSEC_PER_USEC;
>> +		usleep_range(usec_left, usec_left + 100);
>> +		now = ktime_get();
> 
> 	} while (ktime_before(now, gd->throttle));
> 
> 	return true;
> 
> But maybe that's just me. :-)

I agree that's cleaner, will change it.

> 
>> +		if (ktime_after(now, gd->throttle))
>> +			return true;
>> +	}
>> +}
> 
> I'm not a big fan of this throttling mechanism overall, but more about that later.
> 
>> +
>> +static int cpufreq_sched_thread(void *data)
>> +{
> 
> Now, what really is the advantage of having those extra threads vs using
> workqueues?
> 
> I guess the underlying concern is that RT tasks may stall workqueues indefinitely
> in theory and then the frequency won't be updated, but there's much more kernel
> stuff run from workqueues and if that is starved, you won't get very far anyway.
> 
> If you take special measures to prevent frequency change requests from being
> stalled by RT tasks, question is why are they so special?  Aren't there any
> other kernel activities that also should be protected from that and may be
> more important than CPU frequency changes?

I think updating the CPU frequency during periods of heavy RT/DL load is
one of the most (if not the most) important things. I can't speak for
other system activities that may get blocked, but there's an opportunity
to protect CPU frequency changes here, and it seems worth taking to me.

> 
> Plus if this really is the problem here, then it also affects the other cpufreq
> governors, so maybe it should be solved for everybody in some common way?

Agreed, I'd think a freq change thread that serves frequency change
requests would be a useful common component. The locking and throttling
(slowpath_lock, finish_last_request()) are somewhat specific to this
implementation, but could probably be done generically and maybe even
used in other governors. If you're okay with it though I'd like to view
that as a slightly longer term effort, as I think it would get unwieldy
trying to do that as part of this initial change.

> 
...
>> +
>> +static void cpufreq_sched_irq_work(struct irq_work *irq_work)
>> +{
>> +	struct gov_data *gd;
>> +
>> +	gd = container_of(irq_work, struct gov_data, irq_work);
>> +	if (!gd)
>> +		return;
>> +
>> +	wake_up_process(gd->task);
> 
> I'm wondering what would be wrong with writing it as
> 
> 	if (gd)
> 		wake_up_process(gd->task);
> 
> And can gd turn out to be NULL here in any case?

In practice I don't think this would ever happen, but there's not
anything that would guarantee the policy can't be stopped and exit while
one of these irq_works is in flight. This would free not only gd but the
the irq_work structure itself.

Rather than check if gd is NULL here I think synchronization is required
to flush an in flight irq_work when the policy is being stopped. I will
add an irq_work_sync in the policy stop path and remove the NULL check.

> 
>> +}
>> +
>> +static void update_fdomain_capacity_request(int cpu)
>> +{
>> +	unsigned int freq_new, index_new, cpu_tmp;
>> +	struct cpufreq_policy *policy;
>> +	struct gov_data *gd = per_cpu(cpu_gov_data, cpu);
>> +	unsigned long capacity = 0;
>> +
>> +	if (!gd)
>> +		return;
>> +
> 
> Why is this check necessary?

As soon as one policy in the system uses scheduler-guided frequency the
static key will be enabled and all CPUs will be calling into the
scheduler hooks and can potentially come through this path. But some
CPUs may not be part of a scheduler-guided frequency policy. Those CPUs
will have a NULL gov_data pointer.

That being said, I think this test should be moved up into
update_cpu_capacity_request() to avoid that computation when it is not
required. Better still would be bailing when required in the
set_*_cpu_capacity() macros in kernel/sched/sched.h. I'll see if I can
do that instead.

> 
>> +	/* interrupts already disabled here via rq locked */
>> +	raw_spin_lock(&gd->fastpath_lock);
> 
> Well, if you compare this with the one-CPU-per-policy path in my experimental
> schedutil governor code with the "fast switch" patch on top, you'll notice that
> it doesn't use any locks and/or atomic ops.  That's very much on purpose and
> here's where your whole gain from using static keys practically goes away.

Sure, I like the idea of special fast callbacks for the simple case of
UP frequency domains and will add it to the list of things to do here.

The static key is really a separate issue IMO as it's about minimizing
the impact of this feature on scheduler hot paths when schedfreq is not
enabled.

> 
>> +
>> +	policy = gd->policy;
>> +
>> +	for_each_cpu(cpu_tmp, policy->cpus) {
>> +		struct sched_capacity_reqs *scr;
>> +
>> +		scr = &per_cpu(cpu_sched_capacity_reqs, cpu_tmp);
>> +		capacity = max(capacity, scr->total);
>> +	}
> 
> You could save a few cycles from this in the case when the policy is not
> shared.

Agreed. Will look at making special UP paths.

> 
>> +
>> +	freq_new = capacity * policy->max >> SCHED_CAPACITY_SHIFT;
> 
> Where does this formula come from?

Capacity here is 0 to SCHED_CAPACITY_SCALE, so this is translating the
capacity request to a frequency request via
(capacity/SCHED_CAPACITY_SCALE) * policy->max. I'll add a comment to
this effect.

The race with policy->max potentially changing also deserves a comment.
If policy->max changes say just before we read it here to do this
translation, the scheduler PELT numbers will still largely be based on
the old policy->max value, because they are an exponential moving
average and will take time to re-adjust. So at least for now I'm
ignoring this race as I don't think it's really meaningful to attempt
any synchronization.

> 
>> +
>> +	/*
>> +	 * Calling this without locking policy->rwsem means we race
>> +	 * against changes with policy->min and policy->max. This should
>> +	 * be okay though.
>> +	 */
>> +	if (cpufreq_frequency_table_target(policy, policy->freq_table,
>> +					   freq_new, CPUFREQ_RELATION_L,
>> +					   &index_new))
>> +		goto out;
> 
> __cpufreq_driver_target() will call this again, so isn't calling it here
> a bit wasteful?

I wanted to avoid waking up the frequency change thread (an expensive
operation) whenever possible, or even making an unnecessary fastpath
frequency request, so translating the raw frequency request to a
supported target frequency allows us to bail if the actual requested
target frequency will end up being the same as the current one. I
thought this was more valuable than the extra table lookup here.

Actually, I could make this better by storing the next lower frequency
than the current frequency as well as the current frequency - this would
allow me to do a much simpler test to see if we'd end up requesting the
same frequency or something different.

> 
>> +	freq_new = policy->freq_table[index_new].frequency;
>> +
>> +	if (freq_new == gd->requested_freq)
>> +		goto out;
>> +
> 
> Again, the above generally is a mistake for reasons explained earlier.

(skipping per the other email)

> 
>> +	gd->requested_freq = freq_new;
>> +
>> +	if (cpufreq_driver_slow || !mutex_trylock(&gd->slowpath_lock)) {
> 
> This really doesn't look good to me.
> 
> Why is the mutex needed here in the first place?  cpufreq_sched_stop() should
> be able to make sure that this function won't be run again for the policy
> without using this lock.

The acquisition of the slowpath_lock here isn't for synchronizing with
the policy being stopped - that's handled differently via the smp call
in the stop routine.

Rather it may be the case that schedfreq runs both in the fast and slow
path on a target (maybe because of throttling, or because a previously
started asynchronous transition isn't done yet). If that is so, then
when the slow path is active, I do not want to attempt a transition in
the fast path.

> 
>> +		irq_work_queue_on(&gd->irq_work, cpu);
> 
> I hope that you are aware of the fact that irq_work_queue_on() explodes
> on uniprocessor ARM32 if you run an SMP kernel on it?

No, I wasn't. Fortunately I think switching it to irq_work_queue() to
run the irq_work on the same CPU as the fast path should be fine
(assuming there's no similar landmine there).

> 
> And what happens in the !cpufreq_driver_is_slow() case when we don't
> initialize the irq_work?

That's a bug, the irq_work should always be initialized. Will fix.

> 
>> +	} else if (policy->transition_ongoing ||
>> +		   ktime_before(ktime_get(), gd->throttle)) {
> 
> If this really runs in the scheduler paths, you don't want to have ktime_get()
> here.

I'll switch this to be sched_clock() based.

> 
>> +		mutex_unlock(&gd->slowpath_lock);
>> +		irq_work_queue_on(&gd->irq_work, cpu);
> 
> Allright.
> 
> I think I have figured out how this is arranged, but I may be wrong. :-)
> 
> Here's my understanding of it.  If we are throttled, we don't just skip the
> request.  Instead, we wake up the gd thread kind of in the hope that the
> throttling may end when it actually wakes up.  So in fact we poke at the
> gd thread on a regular basis asking it "Are you still throttled?" and that
> happens on every call from the scheduler until the throttling is over if
> I'm not mistaken.  This means that during throttling every call from the
> scheduler generates an irq_work that wakes up the gd thread just to make it
> check if it still is throttled and go to sleep again.  Please tell me
> that I haven't understood this correctly.

Sorry, yes this should be doing something much more intelligent such as
checking to see if the freq thread is already due to wake up at some
point, and setting a timer for it if not.

> 
> The above aside, I personally think that rate limitting should happen at the source
> and not at the worker thread level.  So if you're throttled, you should just
> return immediately from this function without generating any more work.  That,
> BTW, is what the sampling rate in my code is for.

I will work on modifying the throttling here to be more sane.

> 
>> +	} else {
>> +		cpufreq_sched_try_driver_target(policy, freq_new);
> 
> Well, this is supposed to be the fast path AFAICS.
> 
> Did you actually look at what __cpufreq_driver_target() does in general?
> Including the wait_event() in cpufreq_freq_transition_begin() to mention just
> one suspicious thing?  And how much overhead it generates in the most general
> case?
> 
> No, running *that* from the fast path is not a good idea.  Quite honestly,
> you'd need a new driver callback and a new way to run it from the cpufreq core
> to implement this in a reasonably efficient way.

Yes I'm aware of the wait_event(). I've attempted to work around it by
ensuring schedfreq does not issue a __cpufreq_driver_target attempt
while a transition is in flight (transition_ongoing), and ensuring the
slow and fast paths can't race. I'm not sure yet whether this is enough
if something like thermal or userspace changes min/max, whether that can
independently start a transition that may cause a fast path request here
to block. This governor does not use the dbs stuff.

While it's not exactly lean, I didn't see anything else in
__cpufreq_driver_target() that looked really terrible.

I've nothing against a new fast switch driver interface. It may be nice
to support unmodified drivers in the fast path as well, if it can be
made to work, even though it may not be optimal.

> 
>> +		mutex_unlock(&gd->slowpath_lock);
>> +	}
>> +
>> +out:
>> +	raw_spin_unlock(&gd->fastpath_lock);
>> +}
>> +
>> +void update_cpu_capacity_request(int cpu, bool request)
>> +{
>> +	unsigned long new_capacity;
>> +	struct sched_capacity_reqs *scr;
>> +
>> +	/* The rq lock serializes access to the CPU's sched_capacity_reqs. */
>> +	lockdep_assert_held(&cpu_rq(cpu)->lock);
>> +
>> +	scr = &per_cpu(cpu_sched_capacity_reqs, cpu);
>> +
>> +	new_capacity = scr->cfs + scr->rt;
>> +	new_capacity = new_capacity * capacity_margin
>> +		/ SCHED_CAPACITY_SCALE;
>> +	new_capacity += scr->dl;
> 
> Can you please explain the formula here?

The deadline class is expected to provide precise enough capacity
requirements (since tasks admitted to it have CPU bandwidth parameters)
such that it does not require additional CPU headroom.

So we're applying the CPU headroom to the CFS and RT capacity requests
only, then adding the DL request.

I'll add a comment here.

> 
>> +
>> +	if (new_capacity == scr->total)
>> +		return;
>> +
> 
> The same mistake as before.

(assuming this is part of the same comment in the other email)

> 
...
>> +static int cpufreq_sched_stop(struct cpufreq_policy *policy)
>> +{
>> +	struct gov_data *gd = policy->governor_data;
>> +	int cpu;
>> +
>> +	/*
>> +	 * The schedfreq static key is managed here so the global schedfreq
>> +	 * lock must be taken - a per-policy lock such as policy->rwsem is
>> +	 * not sufficient.
>> +	 */
>> +	mutex_lock(&gov_enable_lock);
>> +
>> +	/*
>> +	 * The governor stop path may or may not hold policy->rwsem. There
>> +	 * must be synchronization with the slow path however.
>> +	 */
>> +	mutex_lock(&gd->slowpath_lock);
>> +
>> +	/*
>> +	 * Stop new entries into the hot path for all CPUs. This will
>> +	 * potentially affect other policies which are still running but
>> +	 * this is an infrequent operation.
>> +	 */
>> +	static_key_slow_dec(&__sched_freq);
>> +	enabled_policies--;
>> +
>> +	/*
>> +	 * Ensure that all CPUs currently part of this policy are out
>> +	 * of the hot path so that if this policy exits we can free gd.
>> +	 */
>> +	preempt_disable();
>> +	smp_call_function_many(policy->cpus, dummy, NULL, true);
>> +	preempt_enable();
> 
> I'm not sure how this works, can you please tell me?

Peter correctly interpreted my intentions.

The RCU possibility also crossed my mind. They both seemed like a bit of
a hack to me - this wouldn't really be doing any RCU per se, rather
relying on its implementation. I'll switch to RCU since that seems to be
preferred though. It's certainly cleaner to write.

> 
...
> 
> I have no comments to the rest of the patch.

Thanks again for the review.

thanks,
Steve

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFCv7 PATCH 02/10] cpufreq: introduce cpufreq_driver_is_slow
  2016-02-23  1:31   ` Rafael J. Wysocki
@ 2016-02-26  0:50       ` Michael Turquette
  0 siblings, 0 replies; 51+ messages in thread
From: Michael Turquette @ 2016-02-26  0:50 UTC (permalink / raw)
  To: Rafael J. Wysocki, Steve Muckle
  Cc: Peter Zijlstra, Ingo Molnar, Rafael J. Wysocki,
	Linux Kernel Mailing List, linux-pm, Vincent Guittot,
	Morten Rasmussen, Dietmar Eggemann, Juri Lelli, Patrick Bellasi,
	Viresh Kumar

Quoting Rafael J. Wysocki (2016-02-22 17:31:09)
> On Tue, Feb 23, 2016 at 2:22 AM, Steve Muckle <steve.muckle@linaro.org> wrote:
> > From: Michael Turquette <mturquette@baylibre.com>
> >
> > Some architectures and platforms perform CPU frequency transitions
> > through a non-blocking method, while some might block or sleep. Even
> > when frequency transitions do not block or sleep they may be very slow.
> > This distinction is important when trying to change frequency from
> > a non-interruptible context in a scheduler hot path.
> >
> > Describe this distinction with a cpufreq driver flag,
> > CPUFREQ_DRIVER_FAST. The default is to not have this flag set,
> > thus erring on the side of caution.
> >
> > cpufreq_driver_is_slow() is also introduced in this patch. Setting
> > the above flag will allow this function to return false.
> >
> > [smuckle@linaro.org: change flag/API to include drivers that are too
> >  slow for scheduler hot paths, in addition to those that block/sleep]
> >
> > Cc: Rafael J. Wysocki <rafael@kernel.org>
> > Cc: Viresh Kumar <viresh.kumar@linaro.org>
> > Signed-off-by: Michael Turquette <mturquette@baylibre.com>
> > Signed-off-by: Steve Muckle <smuckle@linaro.org>
> 
> Something more sophisticated than this is needed, because one driver
> may actually be able to do "fast" switching in some cases and may not
> be able to do that in other cases.

Those drivers can set the flag dynamically when they probe based on
their ACPI tables.

> 
> For example, in the acpi-cpufreq case all depends on what's there in
> the ACPI tables.

It's all a moot point until the locking in cpufreq is changed. Until
those changes are made it is a bad idea to call cpufreq_driver_target()
from schedule() context, regardless of the underlying hardware, and all
platforms should kick that work out to the kthread.

Regards,
Mike

> 
> Thanks,
> Rafael

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFCv7 PATCH 02/10] cpufreq: introduce cpufreq_driver_is_slow
@ 2016-02-26  0:50       ` Michael Turquette
  0 siblings, 0 replies; 51+ messages in thread
From: Michael Turquette @ 2016-02-26  0:50 UTC (permalink / raw)
  To: Steve Muckle
  Cc: Peter Zijlstra, Ingo Molnar, Rafael J. Wysocki,
	Linux Kernel Mailing List, linux-pm, Vincent Guittot,
	Morten Rasmussen, Dietmar Eggemann, Juri Lelli, Patrick Bellasi,
	Viresh Kumar

Quoting Rafael J. Wysocki (2016-02-22 17:31:09)
> On Tue, Feb 23, 2016 at 2:22 AM, Steve Muckle <steve.muckle@linaro.org> wrote:
> > From: Michael Turquette <mturquette@baylibre.com>
> >
> > Some architectures and platforms perform CPU frequency transitions
> > through a non-blocking method, while some might block or sleep. Even
> > when frequency transitions do not block or sleep they may be very slow.
> > This distinction is important when trying to change frequency from
> > a non-interruptible context in a scheduler hot path.
> >
> > Describe this distinction with a cpufreq driver flag,
> > CPUFREQ_DRIVER_FAST. The default is to not have this flag set,
> > thus erring on the side of caution.
> >
> > cpufreq_driver_is_slow() is also introduced in this patch. Setting
> > the above flag will allow this function to return false.
> >
> > [smuckle@linaro.org: change flag/API to include drivers that are too
> >  slow for scheduler hot paths, in addition to those that block/sleep]
> >
> > Cc: Rafael J. Wysocki <rafael@kernel.org>
> > Cc: Viresh Kumar <viresh.kumar@linaro.org>
> > Signed-off-by: Michael Turquette <mturquette@baylibre.com>
> > Signed-off-by: Steve Muckle <smuckle@linaro.org>
> 
> Something more sophisticated than this is needed, because one driver
> may actually be able to do "fast" switching in some cases and may not
> be able to do that in other cases.

Those drivers can set the flag dynamically when they probe based on
their ACPI tables.

> 
> For example, in the acpi-cpufreq case all depends on what's there in
> the ACPI tables.

It's all a moot point until the locking in cpufreq is changed. Until
those changes are made it is a bad idea to call cpufreq_driver_target()
from schedule() context, regardless of the underlying hardware, and all
platforms should kick that work out to the kthread.

Regards,
Mike

> 
> Thanks,
> Rafael

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFCv7 PATCH 02/10] cpufreq: introduce cpufreq_driver_is_slow
  2016-02-26  0:50       ` Michael Turquette
  (?)
@ 2016-02-26  1:07       ` Steve Muckle
  -1 siblings, 0 replies; 51+ messages in thread
From: Steve Muckle @ 2016-02-26  1:07 UTC (permalink / raw)
  To: Michael Turquette, Rafael J. Wysocki
  Cc: Peter Zijlstra, Ingo Molnar, Linux Kernel Mailing List, linux-pm,
	Vincent Guittot, Morten Rasmussen, Dietmar Eggemann, Juri Lelli,
	Patrick Bellasi, Viresh Kumar

On 02/25/2016 04:50 PM, Michael Turquette wrote:
>> > Something more sophisticated than this is needed, because one driver
>> > may actually be able to do "fast" switching in some cases and may not
>> > be able to do that in other cases.
>
> Those drivers can set the flag dynamically when they probe based on
> their ACPI tables.

I was thinking that the reference here was to a driver that may be able
to do fast switching for some transitions and not for others, say
perhaps depending on the current and target frequencies, or the state of
the regulators, or other system conditions.

Rafael has proposed a fast_switch() addition to the cpufreq API which
currently returns void. Perhaps that could be extended to return success
or failure from the driver. The driver aborts if it cannot complete the
request atomically and quickly.

The scheduler-driven governor could attempt a fast switch if the
callback is installed (and the other criteria for the fast switch are
met, such as not throttled etc, no request already in flight etc). If
the fast switch aborts, fall back to the slow path.

I suppose the governor could also just see if policy->cur has changed as
opposed to checking cpufreq_driver_fast_switch's return value. But then
we can't tell the difference between the fast transition failing because
it must be re-attempted in the slow path, and the fast transition
failing because of some other more serious reason. In the latter case
the request should probably just be dropped rather than retried in the
slow path.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFCv7 PATCH 02/10] cpufreq: introduce cpufreq_driver_is_slow
  2016-02-26  0:50       ` Michael Turquette
  (?)
  (?)
@ 2016-02-26  1:16       ` Rafael J. Wysocki
       [not found]         ` <20160226185503.2278.20479@quark.deferred.io>
  -1 siblings, 1 reply; 51+ messages in thread
From: Rafael J. Wysocki @ 2016-02-26  1:16 UTC (permalink / raw)
  To: Michael Turquette
  Cc: Rafael J. Wysocki, Steve Muckle, Peter Zijlstra, Ingo Molnar,
	Linux Kernel Mailing List, linux-pm, Vincent Guittot,
	Morten Rasmussen, Dietmar Eggemann, Juri Lelli, Patrick Bellasi,
	Viresh Kumar

On Thursday, February 25, 2016 04:50:29 PM Michael Turquette wrote:
> Quoting Rafael J. Wysocki (2016-02-22 17:31:09)
> > On Tue, Feb 23, 2016 at 2:22 AM, Steve Muckle <steve.muckle@linaro.org> wrote:
> > > From: Michael Turquette <mturquette@baylibre.com>
> > >
> > > Some architectures and platforms perform CPU frequency transitions
> > > through a non-blocking method, while some might block or sleep. Even
> > > when frequency transitions do not block or sleep they may be very slow.
> > > This distinction is important when trying to change frequency from
> > > a non-interruptible context in a scheduler hot path.
> > >
> > > Describe this distinction with a cpufreq driver flag,
> > > CPUFREQ_DRIVER_FAST. The default is to not have this flag set,
> > > thus erring on the side of caution.
> > >
> > > cpufreq_driver_is_slow() is also introduced in this patch. Setting
> > > the above flag will allow this function to return false.
> > >
> > > [smuckle@linaro.org: change flag/API to include drivers that are too
> > >  slow for scheduler hot paths, in addition to those that block/sleep]
> > >
> > > Cc: Rafael J. Wysocki <rafael@kernel.org>
> > > Cc: Viresh Kumar <viresh.kumar@linaro.org>
> > > Signed-off-by: Michael Turquette <mturquette@baylibre.com>
> > > Signed-off-by: Steve Muckle <smuckle@linaro.org>
> > 
> > Something more sophisticated than this is needed, because one driver
> > may actually be able to do "fast" switching in some cases and may not
> > be able to do that in other cases.
> 
> Those drivers can set the flag dynamically when they probe based on
> their ACPI tables.

No, they can't.

Being able to to the "fast" switching is a property of the policy and
the driver together and it may change with CPU going online/offline.

> > 
> > For example, in the acpi-cpufreq case all depends on what's there in
> > the ACPI tables.
> 
> It's all a moot point until the locking in cpufreq is changed.

No, it isn't.  Look at this, for example: https://patchwork.kernel.org/patch/8426741/

> Until those changes are made it is a bad idea to call cpufreq_driver_target()
> from schedule() context, regardless of the underlying hardware, and all
> platforms should kick that work out to the kthread.

Calling cpufreq_driver_target() from the scheduler is a bad idea overall,
not just because of the locking.

But there are other ways to switch frequencies from scheduler paths.  I run
such code on my test box daily without any problems.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFCv7 PATCH 01/10] sched: Compute cpu capacity available at current frequency
  2016-02-23  9:19     ` Peter Zijlstra
@ 2016-02-26  1:37       ` Rafael J. Wysocki
  2016-02-26  9:14         ` Peter Zijlstra
  0 siblings, 1 reply; 51+ messages in thread
From: Rafael J. Wysocki @ 2016-02-26  1:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rafael J. Wysocki, Steve Muckle, Ingo Molnar,
	Linux Kernel Mailing List, linux-pm, Vincent Guittot,
	Morten Rasmussen, Dietmar Eggemann, Juri Lelli, Patrick Bellasi,
	Michael Turquette

On Tuesday, February 23, 2016 10:19:16 AM Peter Zijlstra wrote:
> On Tue, Feb 23, 2016 at 02:41:20AM +0100, Rafael J. Wysocki wrote:
> > >  /*
> > > + * Returns the current capacity of cpu after applying both
> > > + * cpu and freq scaling.
> > > + */
> > > +static unsigned long capacity_curr_of(int cpu)
> > > +{
> > > +       return cpu_rq(cpu)->cpu_capacity_orig *
> > > +              arch_scale_freq_capacity(NULL, cpu)
> > 
> > What about architectures that don't have this?
> 
> They get the 'default' which is a constant SCHED_CAPACITY_SCALE unit.
> 
> > Why is that an architecture feature?
> 
> Because not all archs can tell the frequency the same way. Some you
> program the DVFS state and they really run at this speed, for those you
> can simply report back.
> 
> For others, x86 for example, you program a DVFS 'hint' and the hardware
> does whatever, we'd have to do APERF/MPERF samples to get an idea of the
> actual frequency we ran at.
> 
> Also, the having of this makes the load tracking slightly more
> expensive, instead of compile time constants we get function calls and
> actual multiplications. Its not _too_ bad, but still.

That's all correct, but my question should rather be: is arch the right
granularity?

In theory, there may be ARM64-based platforms using ACPI and behaving
like x86 in that respect in the future.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFCv7 PATCH 01/10] sched: Compute cpu capacity available at current frequency
  2016-02-26  1:37       ` Rafael J. Wysocki
@ 2016-02-26  9:14         ` Peter Zijlstra
  0 siblings, 0 replies; 51+ messages in thread
From: Peter Zijlstra @ 2016-02-26  9:14 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Rafael J. Wysocki, Steve Muckle, Ingo Molnar,
	Linux Kernel Mailing List, linux-pm, Vincent Guittot,
	Morten Rasmussen, Dietmar Eggemann, Juri Lelli, Patrick Bellasi,
	Michael Turquette

On Fri, Feb 26, 2016 at 02:37:19AM +0100, Rafael J. Wysocki wrote:

> That's all correct, but my question should rather be: is arch the right
> granularity?
> 
> In theory, there may be ARM64-based platforms using ACPI and behaving
> like x86 in that respect in the future.

Ah, so I started these hooks way before the cpufreq/cpuidle etc.
integration push.

Maybe we should look at something like that, but performance is really
critical, you most definitely do not want 3 indirections just because
abstract framework crap, that's measurable overhead on these callsites.

Hence the current inline with constant value or single function call.
And if archs would want a selector, I would recommend boot time call
instruction rewrites a-la alternatives/paravirt.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFCv7 PATCH 03/10] sched: scheduler-driven cpu frequency selection
  2016-02-25 21:08       ` Rafael J. Wysocki
@ 2016-02-26  9:18         ` Peter Zijlstra
  2016-02-27  0:08           ` Rafael J. Wysocki
  0 siblings, 1 reply; 51+ messages in thread
From: Peter Zijlstra @ 2016-02-26  9:18 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Steve Muckle, Ingo Molnar, Rafael J. Wysocki, linux-kernel,
	linux-pm, Vincent Guittot, Morten Rasmussen, Dietmar Eggemann,
	Juri Lelli, Patrick Bellasi, Michael Turquette, Ricky Liang

On Thu, Feb 25, 2016 at 10:08:48PM +0100, Rafael J. Wysocki wrote:
> On Thursday, February 25, 2016 10:28:37 AM Peter Zijlstra wrote:
> > Its vile though; one should not spray IPIs if one can avoid it. Such
> > things are much better done with RCU. Sure sync_sched() takes a little
> > longer, but this isn't a fast path by any measure.
> 
> I see, thanks!
> 
> BTW, when cpufreq_update_util() callbacks are removed, I use synchronize_rcu()
> to wait for the running ones, but would it be better to use synchronize_sched()
> in there instead?

So I think we only call the callback with rq->lock held, in which case
sync_sched() is good enough.

It would allow you to get rid of the rcu_read_{,un}lock() calls as well.

The down-side is that it all makes the code a little harder to get,
because you're relying on caller context to DTRT.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFCv7 PATCH 02/10] cpufreq: introduce cpufreq_driver_is_slow
       [not found]         ` <20160226185503.2278.20479@quark.deferred.io>
@ 2016-02-26 21:00           ` Rafael J. Wysocki
  0 siblings, 0 replies; 51+ messages in thread
From: Rafael J. Wysocki @ 2016-02-26 21:00 UTC (permalink / raw)
  To: Michael Turquette
  Cc: Rafael J. Wysocki, Rafael J. Wysocki, Steve Muckle,
	Peter Zijlstra, Ingo Molnar, Linux Kernel Mailing List, linux-pm,
	Vincent Guittot, Morten Rasmussen, Dietmar Eggemann, Juri Lelli,
	Patrick Bellasi, Viresh Kumar

On Fri, Feb 26, 2016 at 7:55 PM, Michael Turquette
<mturquette@baylibre.com> wrote:
> Quoting Rafael J. Wysocki (2016-02-25 17:16:17)
>> On Thursday, February 25, 2016 04:50:29 PM Michael Turquette wrote:
>> > Quoting Rafael J. Wysocki (2016-02-22 17:31:09)
>> > > On Tue, Feb 23, 2016 at 2:22 AM, Steve Muckle <steve.muckle@linaro.org> wrote:
>> > > For example, in the acpi-cpufreq case all depends on what's there in
>> > > the ACPI tables.
>> >
>> > It's all a moot point until the locking in cpufreq is changed.
>>
>> No, it isn't.  Look at this, for example: https://patchwork.kernel.org/patch/8426741/
>
> Thanks for the pointer. Do you mind Cc'ing me on future versions?

I'll do that.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFCv7 PATCH 03/10] sched: scheduler-driven cpu frequency selection
  2016-02-26  9:18         ` Peter Zijlstra
@ 2016-02-27  0:08           ` Rafael J. Wysocki
  2016-03-01 12:57             ` Peter Zijlstra
  0 siblings, 1 reply; 51+ messages in thread
From: Rafael J. Wysocki @ 2016-02-27  0:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Steve Muckle, Ingo Molnar, Rafael J. Wysocki, linux-kernel,
	linux-pm, Vincent Guittot, Morten Rasmussen, Dietmar Eggemann,
	Juri Lelli, Patrick Bellasi, Michael Turquette, Ricky Liang

On Friday, February 26, 2016 10:18:43 AM Peter Zijlstra wrote:
> On Thu, Feb 25, 2016 at 10:08:48PM +0100, Rafael J. Wysocki wrote:
> > On Thursday, February 25, 2016 10:28:37 AM Peter Zijlstra wrote:
> > > Its vile though; one should not spray IPIs if one can avoid it. Such
> > > things are much better done with RCU. Sure sync_sched() takes a little
> > > longer, but this isn't a fast path by any measure.
> > 
> > I see, thanks!
> > 
> > BTW, when cpufreq_update_util() callbacks are removed, I use synchronize_rcu()
> > to wait for the running ones, but would it be better to use synchronize_sched()
> > in there instead?
> 
> So I think we only call the callback with rq->lock held, in which case
> sync_sched() is good enough.
> 
> It would allow you to get rid of the rcu_read_{,un}lock() calls as well.
> 
> The down-side is that it all makes the code a little harder to get,
> because you're relying on caller context to DTRT.

OK, so what about the below (on top of linux-next)?

It has passed my cursory testing.

---
From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Subject: [PATCH] cpufreq: Reduce cpufreq_update_util() overhead a bit

Use the observation that cpufreq_update_util() is only called
by the scheduler with rq->lock held, so the callers of
cpufreq_set_update_util_data() can use synchronize_sched()
instead of synchronize_rcu() to wait for cpufreq_update_util()
to complete.  Moreover, if they are updated to do that,
rcu_read_(un)lock() calls in cpufreq_update_util() might be
replaced with rcu_read_(un)lock_sched(), respectively, but
those aren't really necessary, because the scheduler calls
that function from RCU-sched read-side critical sections
already.

In addition to that, if cpufreq_set_update_util_data() checks
the func field in the struct update_util_data before setting
the per-CPU pointer to it, the data->func check may be dropped
from cpufreq_update_util() as well.

Make the above changes to reduce the overhead from
cpufreq_update_util() in the scheduler paths invoking it
and to make the cleanup after removing its callbacks less
heavy-weight somewhat.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---
 drivers/cpufreq/cpufreq.c          |   21 +++++++++++++--------
 drivers/cpufreq/cpufreq_governor.c |    2 +-
 drivers/cpufreq/intel_pstate.c     |    4 ++--
 3 files changed, 16 insertions(+), 11 deletions(-)

Index: linux-pm/drivers/cpufreq/cpufreq.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq.c
+++ linux-pm/drivers/cpufreq/cpufreq.c
@@ -77,12 +77,15 @@ static DEFINE_PER_CPU(struct update_util
  * to call from cpufreq_update_util().  That function will be called from an RCU
  * read-side critical section, so it must not sleep.
  *
- * Callers must use RCU callbacks to free any memory that might be accessed
- * via the old update_util_data pointer or invoke synchronize_rcu() right after
- * this function to avoid use-after-free.
+ * Callers must use RCU-sched callbacks to free any memory that might be
+ * accessed via the old update_util_data pointer or invoke synchronize_sched()
+ * right after this function to avoid use-after-free.
  */
 void cpufreq_set_update_util_data(int cpu, struct update_util_data *data)
 {
+	if (WARN_ON(data && !data->func))
+		return;
+
 	rcu_assign_pointer(per_cpu(cpufreq_update_util_data, cpu), data);
 }
 EXPORT_SYMBOL_GPL(cpufreq_set_update_util_data);
@@ -95,18 +98,20 @@ EXPORT_SYMBOL_GPL(cpufreq_set_update_uti
  *
  * This function is called by the scheduler on every invocation of
  * update_load_avg() on the CPU whose utilization is being updated.
+ *
+ * It can only be called from RCU-sched read-side critical sections.
  */
 void cpufreq_update_util(u64 time, unsigned long util, unsigned long max)
 {
 	struct update_util_data *data;
 
-	rcu_read_lock();
-
 	data = rcu_dereference(*this_cpu_ptr(&cpufreq_update_util_data));
-	if (data && data->func)
+	/*
+	 * If this isn't inside of an RCU-sched read-side critical section, data
+	 * may become NULL after the check below.
+	 */
+	if (data)
 		data->func(data, time, util, max);
-
-	rcu_read_unlock();
 }
 
 /* Flag to suspend/resume CPUFreq governors */
Index: linux-pm/drivers/cpufreq/cpufreq_governor.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq_governor.c
+++ linux-pm/drivers/cpufreq/cpufreq_governor.c
@@ -280,7 +280,7 @@ static inline void gov_clear_update_util
 	for_each_cpu(i, policy->cpus)
 		cpufreq_set_update_util_data(i, NULL);
 
-	synchronize_rcu();
+	synchronize_sched();
 }
 
 static void gov_cancel_work(struct cpufreq_policy *policy)
Index: linux-pm/drivers/cpufreq/intel_pstate.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/intel_pstate.c
+++ linux-pm/drivers/cpufreq/intel_pstate.c
@@ -1171,7 +1171,7 @@ static void intel_pstate_stop_cpu(struct
 	pr_debug("intel_pstate: CPU %d exiting\n", cpu_num);
 
 	cpufreq_set_update_util_data(cpu_num, NULL);
-	synchronize_rcu();
+	synchronize_sched();
 
 	if (hwp_active)
 		return;
@@ -1429,7 +1429,7 @@ out:
 	for_each_online_cpu(cpu) {
 		if (all_cpu_data[cpu]) {
 			cpufreq_set_update_util_data(cpu, NULL);
-			synchronize_rcu();
+			synchronize_sched();
 			kfree(all_cpu_data[cpu]);
 		}
 	}

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFCv7 PATCH 03/10] sched: scheduler-driven cpu frequency selection
  2016-02-26  0:34     ` Steve Muckle
@ 2016-02-27  2:39       ` Rafael J. Wysocki
  2016-02-27  4:17         ` Steve Muckle
  2016-03-01 13:17       ` Peter Zijlstra
  2016-03-02  7:49       ` Michael Turquette
  2 siblings, 1 reply; 51+ messages in thread
From: Rafael J. Wysocki @ 2016-02-27  2:39 UTC (permalink / raw)
  To: Steve Muckle
  Cc: Peter Zijlstra, Ingo Molnar, Rafael J. Wysocki, linux-kernel,
	linux-pm, Vincent Guittot, Morten Rasmussen, Dietmar Eggemann,
	Juri Lelli, Patrick Bellasi, Michael Turquette, Ricky Liang

On Thursday, February 25, 2016 04:34:23 PM Steve Muckle wrote:
> On 02/24/2016 07:55 PM, Rafael J. Wysocki wrote:
> > Hi,

[cut]

> > One thing I personally like in the RCU-based approach is its universality.  The
> > callbacks may be installed by different entities in a uniform way: intel_pstate
> > can do that, the old governors can do that, my experimental schedutil code can
> > do that and your code could have done that too in principle.  And this is very
> > nice, because it is a common underlying mechanism that can be used by everybody
> > regardless of their particular implementations on the other side.
> > 
> > Why would I want to use something different, then?
> 
> I've got nothing against a callback registration mechanism. As you
> mentioned in another mail it could itself use static keys, enabling the
> static key when a callback is registered and disabling it otherwise to
> avoid calling into cpufreq_update_util().

But then it would only make a difference if cpufreq_update_util() was not
used at all (ie. no callbacks installed for any policies by anyone).  The
only reason why it may matter is that the total number of systems using
the performance governor is quite large AFAICS and they would benefit from
that.

[cut]

> 
> > 
> >> +
> >> +/*
> >> + * Capacity margin added to CFS and RT capacity requests to provide
> >> + * some head room if task utilization further increases.
> >> + */
> > 
> > OK, where does this number come from?
> 
> Someone's posterior :) .
> 
> This really should be a tunable IMO, but there's a fairly strong
> anti-tunable sentiment, so it's been left hard-coded in an attempt to
> provide something that "just works."

Ouch.

> At the least I can add a comment saying that the 20% idle headroom
> requirement was an off the cuff estimate and that at this time, we don't
> have significant data to suggest it's the best number.

Well, in this area, every number has to be justified.  Otherwise we end
up with things that sort of work, but nobody actually understands why.

[cut]

> > 
> >> +
> >> +static int cpufreq_sched_thread(void *data)
> >> +{
> > 
> > Now, what really is the advantage of having those extra threads vs using
> > workqueues?
> > 
> > I guess the underlying concern is that RT tasks may stall workqueues indefinitely
> > in theory and then the frequency won't be updated, but there's much more kernel
> > stuff run from workqueues and if that is starved, you won't get very far anyway.
> > 
> > If you take special measures to prevent frequency change requests from being
> > stalled by RT tasks, question is why are they so special?  Aren't there any
> > other kernel activities that also should be protected from that and may be
> > more important than CPU frequency changes?
> 
> I think updating the CPU frequency during periods of heavy RT/DL load is
> one of the most (if not the most) important things. I can't speak for
> other system activities that may get blocked, but there's an opportunity
> to protect CPU frequency changes here, and it seems worth taking to me.

So do it in a general way for everybody and not just for one governor
that you happen to be working on.

That said I'm unconvinced about the approach still.

Having more RT threads in a system that already is under RT pressure seems like
a recipe for trouble.  Moreover, it's likely that those new RT threads will
disturb the system's normal operation somehow even without the RT pressure and
have you investigated that?  Also having them per policy may be overkill and
binding them to policy CPUs only is not necessary.

Overall, it looks like a dynamic pool of threads that may run on every CPU
might be a better approach, but that would almost duplicate the workqueues
subsystem, so is it really worth it?

And is the problem actually visible in practice?  I have no record of any reports
mentioning it, although theoretically it's been there forever, so had it been
real, someone would have noticed it and complained about it IMO.

> > 
> > Plus if this really is the problem here, then it also affects the other cpufreq
> > governors, so maybe it should be solved for everybody in some common way?
> 
> Agreed, I'd think a freq change thread that serves frequency change
> requests would be a useful common component. The locking and throttling
> (slowpath_lock, finish_last_request()) are somewhat specific to this
> implementation, but could probably be done generically and maybe even
> used in other governors. If you're okay with it though I'd like to view
> that as a slightly longer term effort, as I think it would get unwieldy
> trying to do that as part of this initial change.

I really am not sure if this is useful at all, so why bother with it initially?

> > 
> ...
> >> +
> >> +static void cpufreq_sched_irq_work(struct irq_work *irq_work)
> >> +{
> >> +	struct gov_data *gd;
> >> +
> >> +	gd = container_of(irq_work, struct gov_data, irq_work);
> >> +	if (!gd)
> >> +		return;
> >> +
> >> +	wake_up_process(gd->task);
> > 
> > I'm wondering what would be wrong with writing it as
> > 
> > 	if (gd)
> > 		wake_up_process(gd->task);
> > 
> > And can gd turn out to be NULL here in any case?
> 
> In practice I don't think this would ever happen, but there's not
> anything that would guarantee the policy can't be stopped and exit while
> one of these irq_works is in flight. This would free not only gd but the
> the irq_work structure itself.
> 
> Rather than check if gd is NULL here I think synchronization is required
> to flush an in flight irq_work when the policy is being stopped.

Right.

> I will add an irq_work_sync in the policy stop path and remove the NULL check.
> 
> > 
> >> +}
> >> +
> >> +static void update_fdomain_capacity_request(int cpu)
> >> +{
> >> +	unsigned int freq_new, index_new, cpu_tmp;
> >> +	struct cpufreq_policy *policy;
> >> +	struct gov_data *gd = per_cpu(cpu_gov_data, cpu);
> >> +	unsigned long capacity = 0;
> >> +
> >> +	if (!gd)
> >> +		return;
> >> +
> > 
> > Why is this check necessary?
> 
> As soon as one policy in the system uses scheduler-guided frequency the
> static key will be enabled and all CPUs will be calling into the
> scheduler hooks and can potentially come through this path. But some
> CPUs may not be part of a scheduler-guided frequency policy. Those CPUs
> will have a NULL gov_data pointer.
> 
> That being said, I think this test should be moved up into
> update_cpu_capacity_request() to avoid that computation when it is not
> required. Better still would be bailing when required in the
> set_*_cpu_capacity() macros in kernel/sched/sched.h. I'll see if I can
> do that instead.

If you switch over to using cpufreq_update_util() callbacks like the other
governors, you won't need it any more, because then you'll be guaranteed that
you won't run if the callback hasn't been installed for this CPU.

[cut]

> > 
> >> +
> >> +	freq_new = capacity * policy->max >> SCHED_CAPACITY_SHIFT;
> > 
> > Where does this formula come from?
> 
> Capacity here is 0 to SCHED_CAPACITY_SCALE, so this is translating the
> capacity request to a frequency request via
> (capacity/SCHED_CAPACITY_SCALE) * policy->max. I'll add a comment to
> this effect.
>
> The race with policy->max potentially changing also deserves a comment.
> If policy->max changes say just before we read it here to do this
> translation, the scheduler PELT numbers will still largely be based on
> the old policy->max value, because they are an exponential moving
> average and will take time to re-adjust. So at least for now I'm
> ignoring this race as I don't think it's really meaningful to attempt
> any synchronization.

Agreed.

> > 
> >> +
> >> +	/*
> >> +	 * Calling this without locking policy->rwsem means we race
> >> +	 * against changes with policy->min and policy->max. This should
> >> +	 * be okay though.
> >> +	 */
> >> +	if (cpufreq_frequency_table_target(policy, policy->freq_table,
> >> +					   freq_new, CPUFREQ_RELATION_L,
> >> +					   &index_new))
> >> +		goto out;
> > 
> > __cpufreq_driver_target() will call this again, so isn't calling it here
> > a bit wasteful?
> 
> I wanted to avoid waking up the frequency change thread (an expensive
> operation) whenever possible, or even making an unnecessary fastpath
> frequency request,

In the fastpath case the driver will check if the new freq is equal to the
old one as it generally has no guarantee that the freq it is asked for
matches one it has in the table.  It needs to look it up and check.

And the lookup may actually be the most expensive operation in that case
(the request itself may be a matter of a read from and a write to a fast
register), so by adding it here you pretty much double the overhead.  Not
to mention the fact that the driver's lookup may be optimized quite a bit
compared to what cpufreq_frequency_table_target() does.

> so translating the raw frequency request to a
> supported target frequency allows us to bail if the actual requested
> target frequency will end up being the same as the current one. I
> thought this was more valuable than the extra table lookup here.

You don't need to do a table lookup for that.  You only need to store
the frequency that you have previously requested.

As long as the driver is sane, it should deterministically choose the same
frequency from the table for the same target value every time, so comparing
with what you passed to it before should be sufficient.

> Actually, I could make this better by storing the next lower frequency
> than the current frequency as well as the current frequency - this would
> allow me to do a much simpler test to see if we'd end up requesting the
> same frequency or something different.
> 
> > 
> >> +	freq_new = policy->freq_table[index_new].frequency;
> >> +
> >> +	if (freq_new == gd->requested_freq)
> >> +		goto out;
> >> +
> > 
> > Again, the above generally is a mistake for reasons explained earlier.
> 
> (skipping per the other email)
> 
> > 
> >> +	gd->requested_freq = freq_new;
> >> +
> >> +	if (cpufreq_driver_slow || !mutex_trylock(&gd->slowpath_lock)) {
> > 
> > This really doesn't look good to me.
> > 
> > Why is the mutex needed here in the first place?  cpufreq_sched_stop() should
> > be able to make sure that this function won't be run again for the policy
> > without using this lock.
> 
> The acquisition of the slowpath_lock here isn't for synchronizing with
> the policy being stopped - that's handled differently via the smp call
> in the stop routine.
> 
> Rather it may be the case that schedfreq runs both in the fast and slow
> path on a target (maybe because of throttling, or because a previously
> started asynchronous transition isn't done yet). If that is so, then
> when the slow path is active, I do not want to attempt a transition in
> the fast path.

But you never throttle in the "fast switch" case, do you?  You don't
start the gd thread in that case even.

That aside, playing tricks with mutexes like that is ugly like hell and
doesn't make the code easier to understand in any way.

> 
> > 
> >> +		irq_work_queue_on(&gd->irq_work, cpu);
> > 
> > I hope that you are aware of the fact that irq_work_queue_on() explodes
> > on uniprocessor ARM32 if you run an SMP kernel on it?
> 
> No, I wasn't. Fortunately I think switching it to irq_work_queue() to
> run the irq_work on the same CPU as the fast path should be fine
> (assuming there's no similar landmine there).

You don't have to run it on the same CPU, though.  It doesn't matter what
CPU kicks your worker thread, does it?

[cut]

> >> +	} else {
> >> +		cpufreq_sched_try_driver_target(policy, freq_new);
> > 
> > Well, this is supposed to be the fast path AFAICS.
> > 
> > Did you actually look at what __cpufreq_driver_target() does in general?
> > Including the wait_event() in cpufreq_freq_transition_begin() to mention just
> > one suspicious thing?  And how much overhead it generates in the most general
> > case?
> > 
> > No, running *that* from the fast path is not a good idea.  Quite honestly,
> > you'd need a new driver callback and a new way to run it from the cpufreq core
> > to implement this in a reasonably efficient way.
> 
> Yes I'm aware of the wait_event(). I've attempted to work around it by
> ensuring schedfreq does not issue a __cpufreq_driver_target attempt
> while a transition is in flight (transition_ongoing), and ensuring the
> slow and fast paths can't race. I'm not sure yet whether this is enough
> if something like thermal or userspace changes min/max, whether that can
> independently start a transition that may cause a fast path request here
> to block. This governor does not use the dbs stuff.
> 
> While it's not exactly lean, I didn't see anything else in
> __cpufreq_driver_target() that looked really terrible.
> 
> I've nothing against a new fast switch driver interface. It may be nice
> to support unmodified drivers in the fast path as well, if it can be
> made to work, even though it may not be optimal.

My bet is that you'd need to modify drivers for that anyway, because none of
them meet the fast path requirements for the very simple reason that they
weren't designed with that in mind.

So if you need to modify them anyway, why not to add a new callback to them
in the process?

> > 
> >> +		mutex_unlock(&gd->slowpath_lock);
> >> +	}
> >> +
> >> +out:
> >> +	raw_spin_unlock(&gd->fastpath_lock);
> >> +}
> >> +
> >> +void update_cpu_capacity_request(int cpu, bool request)
> >> +{
> >> +	unsigned long new_capacity;
> >> +	struct sched_capacity_reqs *scr;
> >> +
> >> +	/* The rq lock serializes access to the CPU's sched_capacity_reqs. */
> >> +	lockdep_assert_held(&cpu_rq(cpu)->lock);
> >> +
> >> +	scr = &per_cpu(cpu_sched_capacity_reqs, cpu);
> >> +
> >> +	new_capacity = scr->cfs + scr->rt;
> >> +	new_capacity = new_capacity * capacity_margin
> >> +		/ SCHED_CAPACITY_SCALE;
> >> +	new_capacity += scr->dl;
> > 
> > Can you please explain the formula here?
> 
> The deadline class is expected to provide precise enough capacity
> requirements (since tasks admitted to it have CPU bandwidth parameters)
> such that it does not require additional CPU headroom.
> 
> So we're applying the CPU headroom to the CFS and RT capacity requests
> only, then adding the DL request.
> 
> I'll add a comment here.

One thing that has escaped me: How do you take policy->min into account?
In particular, what if you end up with a frequency below policy->min?

[cut]

> >> +	/*
> >> +	 * Ensure that all CPUs currently part of this policy are out
> >> +	 * of the hot path so that if this policy exits we can free gd.
> >> +	 */
> >> +	preempt_disable();
> >> +	smp_call_function_many(policy->cpus, dummy, NULL, true);
> >> +	preempt_enable();
> > 
> > I'm not sure how this works, can you please tell me?
> 
> Peter correctly interpreted my intentions.
> 
> The RCU possibility also crossed my mind. They both seemed like a bit of
> a hack to me - this wouldn't really be doing any RCU per se, rather
> relying on its implementation. I'll switch to RCU since that seems to be
> preferred though. It's certainly cleaner to write.

Well, in that case you simply can switch over to using cpufreq_update_util()
callbacks and then you'll get that part for free.

Overall, sorry to say that, I don't quite see much to salvage in this patch.

The throttling implementation is a disaster.  The idea of having RT worker
threads per policy is questionable for a few reasons.  The idea of invoking
the existing driver callbacks from the fast path is not a good one and
trying to use __cpufreq_driver_target() for that only makes it worse.
The mutex manipulations under a raw spinlock are firmly in the "don't do
that" category and the static keys won't be useful any more or need to be
used elsewhere.

The way that you compute the frequency to ask for kind of might be defended,
but it has problems too, like arbitrary factors coming out of thin air.

So what's left then, realistically?

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFCv7 PATCH 03/10] sched: scheduler-driven cpu frequency selection
  2016-02-27  2:39       ` Rafael J. Wysocki
@ 2016-02-27  4:17         ` Steve Muckle
  2016-02-28  2:26           ` Rafael J. Wysocki
  2016-03-01 13:26           ` Peter Zijlstra
  0 siblings, 2 replies; 51+ messages in thread
From: Steve Muckle @ 2016-02-27  4:17 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Peter Zijlstra, Ingo Molnar, Rafael J. Wysocki, linux-kernel,
	linux-pm, Vincent Guittot, Morten Rasmussen, Dietmar Eggemann,
	Juri Lelli, Patrick Bellasi, Michael Turquette, Ricky Liang

On 02/26/2016 06:39 PM, Rafael J. Wysocki wrote:
>>> One thing I personally like in the RCU-based approach is its universality.  The
>>> callbacks may be installed by different entities in a uniform way: intel_pstate
>>> can do that, the old governors can do that, my experimental schedutil code can
>>> do that and your code could have done that too in principle.  And this is very
>>> nice, because it is a common underlying mechanism that can be used by everybody
>>> regardless of their particular implementations on the other side.
>>>
>>> Why would I want to use something different, then?
>>
>> I've got nothing against a callback registration mechanism. As you
>> mentioned in another mail it could itself use static keys, enabling the
>> static key when a callback is registered and disabling it otherwise to
>> avoid calling into cpufreq_update_util().
> 
> But then it would only make a difference if cpufreq_update_util() was not
> used at all (ie. no callbacks installed for any policies by anyone).  The
> only reason why it may matter is that the total number of systems using
> the performance governor is quite large AFAICS and they would benefit from
> that.

I'd think that's a benefit worth preserving, but I guess that's Peter
and Ingo's call.

> 
...
>>>> +/*
>>>> + * Capacity margin added to CFS and RT capacity requests to provide
>>>> + * some head room if task utilization further increases.
>>>> + */
>>>
>>> OK, where does this number come from?
>>
>> Someone's posterior :) .
>>
>> This really should be a tunable IMO, but there's a fairly strong
>> anti-tunable sentiment, so it's been left hard-coded in an attempt to
>> provide something that "just works."
> 
> Ouch.
> 
>> At the least I can add a comment saying that the 20% idle headroom
>> requirement was an off the cuff estimate and that at this time, we don't
>> have significant data to suggest it's the best number.
> 
> Well, in this area, every number has to be justified.  Otherwise we end
> up with things that sort of work, but nobody actually understands why.

It's just a starting point. There's a lot of profiling and tuning that
has yet to happen. We figured there were larger design issues to discuss
prior to spending a lot of time tweaking the headroom value.

> 
> [cut]
> 
>>>
>>>> +
>>>> +static int cpufreq_sched_thread(void *data)
>>>> +{
>>>
>>> Now, what really is the advantage of having those extra threads vs using
>>> workqueues?
>>>
>>> I guess the underlying concern is that RT tasks may stall workqueues indefinitely
>>> in theory and then the frequency won't be updated, but there's much more kernel
>>> stuff run from workqueues and if that is starved, you won't get very far anyway.
>>>
>>> If you take special measures to prevent frequency change requests from being
>>> stalled by RT tasks, question is why are they so special?  Aren't there any
>>> other kernel activities that also should be protected from that and may be
>>> more important than CPU frequency changes?
>>
>> I think updating the CPU frequency during periods of heavy RT/DL load is
>> one of the most (if not the most) important things. I can't speak for
>> other system activities that may get blocked, but there's an opportunity
>> to protect CPU frequency changes here, and it seems worth taking to me.
> 
> So do it in a general way for everybody and not just for one governor
> that you happen to be working on.
> 
> That said I'm unconvinced about the approach still.
> 
> Having more RT threads in a system that already is under RT pressure seems like
> a recipe for trouble.  Moreover, it's likely that those new RT threads will
> disturb the system's normal operation somehow even without the RT pressure and
> have you investigated that?

Sorry I'm not sure what you mean by disturb normal operation.

Generally speaking, increasing the capacity of a heavily loaded system
seems to me to be something that should run urgently, so that the system
can potentially get itself out of trouble and meet the workload's needs.

> Also having them per policy may be overkill and
> binding them to policy CPUs only is not necessary.
> 
> Overall, it looks like a dynamic pool of threads that may run on every CPU
> might be a better approach, but that would almost duplicate the workqueues
> subsystem, so is it really worth it?
> 
> And is the problem actually visible in practice?  I have no record of any reports
> mentioning it, although theoretically it's been there forever, so had it been
> real, someone would have noticed it and complained about it IMO.

While I don't have a test case drawn up to provide it seems like it'd be
easy to create one. More importantly the interactive governor in Android
uses this same kind of model, starting a frequency change thread and
making it RT. Android is particularly sensitive to latency in frequency
response. So that's likely one big reason why you're not hearing about
this issue - some folks have already worked around it.

> 
>>>
>>> Plus if this really is the problem here, then it also affects the other cpufreq
>>> governors, so maybe it should be solved for everybody in some common way?
>>
>> Agreed, I'd think a freq change thread that serves frequency change
>> requests would be a useful common component. The locking and throttling
>> (slowpath_lock, finish_last_request()) are somewhat specific to this
>> implementation, but could probably be done generically and maybe even
>> used in other governors. If you're okay with it though I'd like to view
>> that as a slightly longer term effort, as I think it would get unwieldy
>> trying to do that as part of this initial change.
> 
> I really am not sure if this is useful at all, so why bother with it initially?

I still think ensuring CPU frequency changes aren't delayed by other
task activity is useful...

> 
...
>>>
>>>> +}
>>>> +
>>>> +static void update_fdomain_capacity_request(int cpu)
>>>> +{
>>>> +	unsigned int freq_new, index_new, cpu_tmp;
>>>> +	struct cpufreq_policy *policy;
>>>> +	struct gov_data *gd = per_cpu(cpu_gov_data, cpu);
>>>> +	unsigned long capacity = 0;
>>>> +
>>>> +	if (!gd)
>>>> +		return;
>>>> +
>>>
>>> Why is this check necessary?
>>
>> As soon as one policy in the system uses scheduler-guided frequency the
>> static key will be enabled and all CPUs will be calling into the
>> scheduler hooks and can potentially come through this path. But some
>> CPUs may not be part of a scheduler-guided frequency policy. Those CPUs
>> will have a NULL gov_data pointer.
>>
>> That being said, I think this test should be moved up into
>> update_cpu_capacity_request() to avoid that computation when it is not
>> required. Better still would be bailing when required in the
>> set_*_cpu_capacity() macros in kernel/sched/sched.h. I'll see if I can
>> do that instead.
> 
> If you switch over to using cpufreq_update_util() callbacks like the other
> governors, you won't need it any more, because then you'll be guaranteed that
> you won't run if the callback hasn't been installed for this CPU.

Sure.

> 
...
>>>> +		irq_work_queue_on(&gd->irq_work, cpu);
>>>
>>> I hope that you are aware of the fact that irq_work_queue_on() explodes
>>> on uniprocessor ARM32 if you run an SMP kernel on it?
>>
>> No, I wasn't. Fortunately I think switching it to irq_work_queue() to
>> run the irq_work on the same CPU as the fast path should be fine
>> (assuming there's no similar landmine there).
> 
> You don't have to run it on the same CPU, though.  It doesn't matter what
> CPU kicks your worker thread, does it?

I don't want to wake a random CPU to potentially just do the kick.

> 
> [cut]
> 
>>>> +	} else {
>>>> +		cpufreq_sched_try_driver_target(policy, freq_new);
>>>
>>> Well, this is supposed to be the fast path AFAICS.
>>>
>>> Did you actually look at what __cpufreq_driver_target() does in general?
>>> Including the wait_event() in cpufreq_freq_transition_begin() to mention just
>>> one suspicious thing?  And how much overhead it generates in the most general
>>> case?
>>>
>>> No, running *that* from the fast path is not a good idea.  Quite honestly,
>>> you'd need a new driver callback and a new way to run it from the cpufreq core
>>> to implement this in a reasonably efficient way.
>>
>> Yes I'm aware of the wait_event(). I've attempted to work around it by
>> ensuring schedfreq does not issue a __cpufreq_driver_target attempt
>> while a transition is in flight (transition_ongoing), and ensuring the
>> slow and fast paths can't race. I'm not sure yet whether this is enough
>> if something like thermal or userspace changes min/max, whether that can
>> independently start a transition that may cause a fast path request here
>> to block. This governor does not use the dbs stuff.
>>
>> While it's not exactly lean, I didn't see anything else in
>> __cpufreq_driver_target() that looked really terrible.
>>
>> I've nothing against a new fast switch driver interface. It may be nice
>> to support unmodified drivers in the fast path as well, if it can be
>> made to work, even though it may not be optimal.
> 
> My bet is that you'd need to modify drivers for that anyway, because none of
> them meet the fast path requirements for the very simple reason that they
> weren't designed with that in mind.
> 
> So if you need to modify them anyway, why not to add a new callback to them
> in the process?

This implies to me that there are drivers which can be made to work in
the fast path (i.e. do not sleep and are reasonably quick) but currently
do not. Are there really such drivers? Why would they have unnecessary
blocking or work in them?

I would have thought that if a driver sleeps or is slow, it is because
of platform limitations (like a blocking call to adjust a regulator)
which are unavoidable.

> 
>>>
>>>> +		mutex_unlock(&gd->slowpath_lock);
>>>> +	}
>>>> +
>>>> +out:
>>>> +	raw_spin_unlock(&gd->fastpath_lock);
>>>> +}
>>>> +
>>>> +void update_cpu_capacity_request(int cpu, bool request)
>>>> +{
>>>> +	unsigned long new_capacity;
>>>> +	struct sched_capacity_reqs *scr;
>>>> +
>>>> +	/* The rq lock serializes access to the CPU's sched_capacity_reqs. */
>>>> +	lockdep_assert_held(&cpu_rq(cpu)->lock);
>>>> +
>>>> +	scr = &per_cpu(cpu_sched_capacity_reqs, cpu);
>>>> +
>>>> +	new_capacity = scr->cfs + scr->rt;
>>>> +	new_capacity = new_capacity * capacity_margin
>>>> +		/ SCHED_CAPACITY_SCALE;
>>>> +	new_capacity += scr->dl;
>>>
>>> Can you please explain the formula here?
>>
>> The deadline class is expected to provide precise enough capacity
>> requirements (since tasks admitted to it have CPU bandwidth parameters)
>> such that it does not require additional CPU headroom.
>>
>> So we're applying the CPU headroom to the CFS and RT capacity requests
>> only, then adding the DL request.
>>
>> I'll add a comment here.
> 
> One thing that has escaped me: How do you take policy->min into account?
> In particular, what if you end up with a frequency below policy->min?

__cpufreq_driver_target will floor the request with policy->min.

> 
> [cut]
> 
>>>> +	/*
>>>> +	 * Ensure that all CPUs currently part of this policy are out
>>>> +	 * of the hot path so that if this policy exits we can free gd.
>>>> +	 */
>>>> +	preempt_disable();
>>>> +	smp_call_function_many(policy->cpus, dummy, NULL, true);
>>>> +	preempt_enable();
>>>
>>> I'm not sure how this works, can you please tell me?
>>
>> Peter correctly interpreted my intentions.
>>
>> The RCU possibility also crossed my mind. They both seemed like a bit of
>> a hack to me - this wouldn't really be doing any RCU per se, rather
>> relying on its implementation. I'll switch to RCU since that seems to be
>> preferred though. It's certainly cleaner to write.
> 
> Well, in that case you simply can switch over to using cpufreq_update_util()
> callbacks and then you'll get that part for free.

I'm not sure. Removing the callbacks by itself doesn't guarantee all the
CPUs are out of them - don't you still need something to synchronize
with that?

> Overall, sorry to say that, I don't quite see much to salvage in this patch.
> 
> The throttling implementation is a disaster.  The idea of having RT worker
> threads per policy is questionable for a few reasons.  The idea of invoking
> the existing driver callbacks from the fast path is not a good one and
> trying to use __cpufreq_driver_target() for that only makes it worse.
> The mutex manipulations under a raw spinlock are firmly in the "don't do
> that" category and the static keys won't be useful any more or need to be
> used elsewhere.
> 
> The way that you compute the frequency to ask for kind of might be defended,
> but it has problems too, like arbitrary factors coming out of thin air.
> 
> So what's left then, realistically?

Aside from the throttling, which I agree was a last minute and
not-well-thought-out addition and needs to be fixed, I'd still challenge
the statements above. But I don't think there's much point. I'm happy to
work towards enabling ARM in schedutil. I have some review comments
which I am just finishing up and will send shortly.

thanks,
Steve

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFCv7 PATCH 03/10] sched: scheduler-driven cpu frequency selection
  2016-02-27  4:17         ` Steve Muckle
@ 2016-02-28  2:26           ` Rafael J. Wysocki
  2016-03-01 14:31             ` Peter Zijlstra
  2016-03-01 13:26           ` Peter Zijlstra
  1 sibling, 1 reply; 51+ messages in thread
From: Rafael J. Wysocki @ 2016-02-28  2:26 UTC (permalink / raw)
  To: Steve Muckle
  Cc: Peter Zijlstra, Ingo Molnar, Rafael J. Wysocki, linux-kernel,
	linux-pm, Vincent Guittot, Morten Rasmussen, Dietmar Eggemann,
	Juri Lelli, Patrick Bellasi, Michael Turquette, Ricky Liang

On Friday, February 26, 2016 08:17:46 PM Steve Muckle wrote:
> On 02/26/2016 06:39 PM, Rafael J. Wysocki wrote:
> >>> One thing I personally like in the RCU-based approach is its universality.  The
> >>> callbacks may be installed by different entities in a uniform way: intel_pstate
> >>> can do that, the old governors can do that, my experimental schedutil code can
> >>> do that and your code could have done that too in principle.  And this is very
> >>> nice, because it is a common underlying mechanism that can be used by everybody
> >>> regardless of their particular implementations on the other side.
> >>>
> >>> Why would I want to use something different, then?
> >>
> >> I've got nothing against a callback registration mechanism. As you
> >> mentioned in another mail it could itself use static keys, enabling the
> >> static key when a callback is registered and disabling it otherwise to
> >> avoid calling into cpufreq_update_util().
> > 
> > But then it would only make a difference if cpufreq_update_util() was not
> > used at all (ie. no callbacks installed for any policies by anyone).  The
> > only reason why it may matter is that the total number of systems using
> > the performance governor is quite large AFAICS and they would benefit from
> > that.
> 
> I'd think that's a benefit worth preserving, but I guess that's Peter
> and Ingo's call.

That's exactly what I said to Peter. :-)

[cut]

> > 
> > That said I'm unconvinced about the approach still.
> > 
> > Having more RT threads in a system that already is under RT pressure seems like
> > a recipe for trouble.  Moreover, it's likely that those new RT threads will
> > disturb the system's normal operation somehow even without the RT pressure and
> > have you investigated that?
> 
> Sorry I'm not sure what you mean by disturb normal operation.

That would introduce a number of extra RT threads that would be woken up quite
often and on a regular basis, so there would be some extra RT noise in the
system, especially on systems with one CPU per cpufreq policy and many CPUs.

That's not present ATM and surely need not be completely transparent.

> Generally speaking, increasing the capacity of a heavily loaded system
> seems to me to be something that should run urgently, so that the system
> can potentially get itself out of trouble and meet the workload's needs.
> 
> > Also having them per policy may be overkill and
> > binding them to policy CPUs only is not necessary.
> > 
> > Overall, it looks like a dynamic pool of threads that may run on every CPU
> > might be a better approach, but that would almost duplicate the workqueues
> > subsystem, so is it really worth it?
> > 
> > And is the problem actually visible in practice?  I have no record of any reports
> > mentioning it, although theoretically it's been there forever, so had it been
> > real, someone would have noticed it and complained about it IMO.
> 
> While I don't have a test case drawn up to provide it seems like it'd be
> easy to create one. More importantly the interactive governor in Android
> uses this same kind of model, starting a frequency change thread and
> making it RT. Android is particularly sensitive to latency in frequency
> response. So that's likely one big reason why you're not hearing about
> this issue - some folks have already worked around it.

OK, so Android is the reason. :-)

Fair enough.  I still think that care is needed here, though.

[cut]

> >>
> >> I've nothing against a new fast switch driver interface. It may be nice
> >> to support unmodified drivers in the fast path as well, if it can be
> >> made to work, even though it may not be optimal.
> > 
> > My bet is that you'd need to modify drivers for that anyway, because none of
> > them meet the fast path requirements for the very simple reason that they
> > weren't designed with that in mind.
> > 
> > So if you need to modify them anyway, why not to add a new callback to them
> > in the process?
> 
> This implies to me that there are drivers which can be made to work in
> the fast path (i.e. do not sleep and are reasonably quick) but currently
> do not. Are there really such drivers? Why would they have unnecessary
> blocking or work in them?

The ACPI driver is an example here.  It uses smp_call_function_many() to
run stuff on policy CPUs and that cannot be called from the fast path,
so the driver has to be modified.

> I would have thought that if a driver sleeps or is slow, it is because
> of platform limitations (like a blocking call to adjust a regulator)
> which are unavoidable.

Well, if you don't need to worry about sleeping, it's common to use
things like mutexes for synchronization etc.  Quite simply, if you don't
expect that the given piece of code will ever be run from interrupt
context, you'll not code for that and the scheduler paths have even more
restrictions than typical interrupt handlers.

> >>>
> >>>> +		mutex_unlock(&gd->slowpath_lock);
> >>>> +	}
> >>>> +
> >>>> +out:
> >>>> +	raw_spin_unlock(&gd->fastpath_lock);
> >>>> +}
> >>>> +
> >>>> +void update_cpu_capacity_request(int cpu, bool request)
> >>>> +{
> >>>> +	unsigned long new_capacity;
> >>>> +	struct sched_capacity_reqs *scr;
> >>>> +
> >>>> +	/* The rq lock serializes access to the CPU's sched_capacity_reqs. */
> >>>> +	lockdep_assert_held(&cpu_rq(cpu)->lock);
> >>>> +
> >>>> +	scr = &per_cpu(cpu_sched_capacity_reqs, cpu);
> >>>> +
> >>>> +	new_capacity = scr->cfs + scr->rt;
> >>>> +	new_capacity = new_capacity * capacity_margin
> >>>> +		/ SCHED_CAPACITY_SCALE;
> >>>> +	new_capacity += scr->dl;
> >>>
> >>> Can you please explain the formula here?
> >>
> >> The deadline class is expected to provide precise enough capacity
> >> requirements (since tasks admitted to it have CPU bandwidth parameters)
> >> such that it does not require additional CPU headroom.
> >>
> >> So we're applying the CPU headroom to the CFS and RT capacity requests
> >> only, then adding the DL request.
> >>
> >> I'll add a comment here.
> > 
> > One thing that has escaped me: How do you take policy->min into account?
> > In particular, what if you end up with a frequency below policy->min?
> 
> __cpufreq_driver_target will floor the request with policy->min.

I see.

> > 
> > [cut]
> > 
> >>>> +	/*
> >>>> +	 * Ensure that all CPUs currently part of this policy are out
> >>>> +	 * of the hot path so that if this policy exits we can free gd.
> >>>> +	 */
> >>>> +	preempt_disable();
> >>>> +	smp_call_function_many(policy->cpus, dummy, NULL, true);
> >>>> +	preempt_enable();
> >>>
> >>> I'm not sure how this works, can you please tell me?
> >>
> >> Peter correctly interpreted my intentions.
> >>
> >> The RCU possibility also crossed my mind. They both seemed like a bit of
> >> a hack to me - this wouldn't really be doing any RCU per se, rather
> >> relying on its implementation. I'll switch to RCU since that seems to be
> >> preferred though. It's certainly cleaner to write.
> > 
> > Well, in that case you simply can switch over to using cpufreq_update_util()
> > callbacks and then you'll get that part for free.
> 
> I'm not sure. Removing the callbacks by itself doesn't guarantee all the
> CPUs are out of them - don't you still need something to synchronize
> with that?

The update_util pointers are dereferenced and checked against NULL and the
callbacks are run (if present) in the same RCU read-side critical section.
synchronize_rcu() ensures that (a) all of the previously running read-side
critical sections complete before it returns and (b) all of the read-side
critical sections that begin after it has returned will see the new value
of the pointer.  This guarantees that the callbacks will not be running
after clearing the update_util pointers and executing synchronize_rcu().

> > Overall, sorry to say that, I don't quite see much to salvage in this patch.
> > 
> > The throttling implementation is a disaster.  The idea of having RT worker
> > threads per policy is questionable for a few reasons.  The idea of invoking
> > the existing driver callbacks from the fast path is not a good one and
> > trying to use __cpufreq_driver_target() for that only makes it worse.
> > The mutex manipulations under a raw spinlock are firmly in the "don't do
> > that" category and the static keys won't be useful any more or need to be
> > used elsewhere.
> > 
> > The way that you compute the frequency to ask for kind of might be defended,
> > but it has problems too, like arbitrary factors coming out of thin air.
> > 
> > So what's left then, realistically?
> 
> Aside from the throttling, which I agree was a last minute and
> not-well-thought-out addition and needs to be fixed, I'd still challenge
> the statements above. But I don't think there's much point. I'm happy to
> work towards enabling ARM in schedutil.

Cool! :-)

> I have some review comments which I am just finishing up and will send shortly.

I've responded to those already I think.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFCv7 PATCH 04/10] sched/fair: add triggers for OPP change requests
  2016-02-23  1:22 ` [RFCv7 PATCH 04/10] sched/fair: add triggers for OPP change requests Steve Muckle
@ 2016-03-01  6:51   ` Ricky Liang
  2016-03-03  3:55     ` Steve Muckle
  0 siblings, 1 reply; 51+ messages in thread
From: Ricky Liang @ 2016-03-01  6:51 UTC (permalink / raw)
  To: Steve Muckle
  Cc: Peter Zijlstra, Ingo Molnar, Rafael J. Wysocki, open list,
	linux-pm, Vincent Guittot, Morten Rasmussen, Dietmar Eggemann,
	Juri Lelli, Patrick Bellasi, Michael Turquette

Hi Steve,

On Tue, Feb 23, 2016 at 9:22 AM, Steve Muckle <steve.muckle@linaro.org> wrote:
> From: Juri Lelli <juri.lelli@arm.com>
>
> Each time a task is {en,de}queued we might need to adapt the current
> frequency to the new usage. Add triggers on {en,de}queue_task_fair() for
> this purpose.  Only trigger a freq request if we are effectively waking up
> or going to sleep.  Filter out load balancing related calls to reduce the
> number of triggers.
>
> [smuckle@linaro.org: resolve merge conflicts, define task_new,
>  use renamed static key sched_freq]
>
> cc: Ingo Molnar <mingo@redhat.com>
> cc: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Juri Lelli <juri.lelli@arm.com>
> Signed-off-by: Steve Muckle <smuckle@linaro.org>
> ---
>  kernel/sched/fair.c | 49 +++++++++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 47 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 3437e01..f1f00a4 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4283,6 +4283,21 @@ static inline void hrtick_update(struct rq *rq)
>  }
>  #endif
>
> +static unsigned long capacity_orig_of(int cpu);
> +static int cpu_util(int cpu);
> +
> +static void update_capacity_of(int cpu)
> +{
> +       unsigned long req_cap;
> +
> +       if (!sched_freq())
> +               return;
> +
> +       /* Convert scale-invariant capacity to cpu. */
> +       req_cap = cpu_util(cpu) * SCHED_CAPACITY_SCALE / capacity_orig_of(cpu);
> +       set_cfs_cpu_capacity(cpu, true, req_cap);
> +}
> +

The change hunks of this patch should probably all depend on
CONFIG_SMP as capacity_orig_of() and cpu_util() are only available
when CONFIG_SMP is enabled.

[snip...]

Thanks,
Ricky

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFCv7 PATCH 03/10] sched: scheduler-driven cpu frequency selection
  2016-02-27  0:08           ` Rafael J. Wysocki
@ 2016-03-01 12:57             ` Peter Zijlstra
  2016-03-01 19:44               ` Rafael J. Wysocki
  0 siblings, 1 reply; 51+ messages in thread
From: Peter Zijlstra @ 2016-03-01 12:57 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Steve Muckle, Ingo Molnar, Rafael J. Wysocki, linux-kernel,
	linux-pm, Vincent Guittot, Morten Rasmussen, Dietmar Eggemann,
	Juri Lelli, Patrick Bellasi, Michael Turquette, Ricky Liang

On Sat, Feb 27, 2016 at 01:08:02AM +0100, Rafael J. Wysocki wrote:
> @@ -95,18 +98,20 @@ EXPORT_SYMBOL_GPL(cpufreq_set_update_uti
>   *
>   * This function is called by the scheduler on every invocation of
>   * update_load_avg() on the CPU whose utilization is being updated.
> + *
> + * It can only be called from RCU-sched read-side critical sections.
>   */
>  void cpufreq_update_util(u64 time, unsigned long util, unsigned long max)
>  {
>  	struct update_util_data *data;
>  
> -	rcu_read_lock();
> -

Maybe, just because I'm paranoid, add something like:

#ifdef CONFIG_LOCKDEP
	WARN_ON(debug_locks && !rcu_read_lock_sched_held());
#endif

>  	data = rcu_dereference(*this_cpu_ptr(&cpufreq_update_util_data));
> -	if (data && data->func)
> +	/*
> +	 * If this isn't inside of an RCU-sched read-side critical section, data
> +	 * may become NULL after the check below.
> +	 */
> +	if (data)
>  		data->func(data, time, util, max);
> -
> -	rcu_read_unlock();
>  }

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFCv7 PATCH 03/10] sched: scheduler-driven cpu frequency selection
  2016-02-26  0:34     ` Steve Muckle
  2016-02-27  2:39       ` Rafael J. Wysocki
@ 2016-03-01 13:17       ` Peter Zijlstra
  2016-03-02  7:49       ` Michael Turquette
  2 siblings, 0 replies; 51+ messages in thread
From: Peter Zijlstra @ 2016-03-01 13:17 UTC (permalink / raw)
  To: Steve Muckle
  Cc: Rafael J. Wysocki, Ingo Molnar, Rafael J. Wysocki, linux-kernel,
	linux-pm, Vincent Guittot, Morten Rasmussen, Dietmar Eggemann,
	Juri Lelli, Patrick Bellasi, Michael Turquette, Ricky Liang

On Thu, Feb 25, 2016 at 04:34:23PM -0800, Steve Muckle wrote:
> >> +	/*
> >> +	 * Ensure that all CPUs currently part of this policy are out
> >> +	 * of the hot path so that if this policy exits we can free gd.
> >> +	 */
> >> +	preempt_disable();
> >> +	smp_call_function_many(policy->cpus, dummy, NULL, true);
> >> +	preempt_enable();
> > 
> > I'm not sure how this works, can you please tell me?
> 
> Peter correctly interpreted my intentions.
> 
> The RCU possibility also crossed my mind. They both seemed like a bit of
> a hack to me - this wouldn't really be doing any RCU per se, rather
> relying on its implementation. I'll switch to RCU since that seems to be
> preferred though. It's certainly cleaner to write.

RCU is widely used in this fashion. synchronize_sched() is explicitly
constructed to sync against preempt disable sections. This is not an
implementation detail.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFCv7 PATCH 03/10] sched: scheduler-driven cpu frequency selection
  2016-02-27  4:17         ` Steve Muckle
  2016-02-28  2:26           ` Rafael J. Wysocki
@ 2016-03-01 13:26           ` Peter Zijlstra
  1 sibling, 0 replies; 51+ messages in thread
From: Peter Zijlstra @ 2016-03-01 13:26 UTC (permalink / raw)
  To: Steve Muckle
  Cc: Rafael J. Wysocki, Ingo Molnar, Rafael J. Wysocki, linux-kernel,
	linux-pm, Vincent Guittot, Morten Rasmussen, Dietmar Eggemann,
	Juri Lelli, Patrick Bellasi, Michael Turquette, Ricky Liang

On Fri, Feb 26, 2016 at 08:17:46PM -0800, Steve Muckle wrote:
> > But then it would only make a difference if cpufreq_update_util() was not
> > used at all (ie. no callbacks installed for any policies by anyone).  The
> > only reason why it may matter is that the total number of systems using
> > the performance governor is quite large AFAICS and they would benefit from
> > that.
> 
> I'd think that's a benefit worth preserving, but I guess that's Peter
> and Ingo's call.

Probably worth it, most of this is on rather fast paths.

See commit 1cde2930e154 ("sched/preempt: Add static_key() to
preempt_notifiers") for example.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFCv7 PATCH 03/10] sched: scheduler-driven cpu frequency selection
  2016-02-28  2:26           ` Rafael J. Wysocki
@ 2016-03-01 14:31             ` Peter Zijlstra
  2016-03-01 20:32               ` Rafael J. Wysocki
  0 siblings, 1 reply; 51+ messages in thread
From: Peter Zijlstra @ 2016-03-01 14:31 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Steve Muckle, Ingo Molnar, Rafael J. Wysocki, linux-kernel,
	linux-pm, Vincent Guittot, Morten Rasmussen, Dietmar Eggemann,
	Juri Lelli, Patrick Bellasi, Michael Turquette, Ricky Liang

On Sun, Feb 28, 2016 at 03:26:21AM +0100, Rafael J. Wysocki wrote:

> > > That said I'm unconvinced about the approach still.
> > > 
> > > Having more RT threads in a system that already is under RT pressure seems like
> > > a recipe for trouble.  Moreover, it's likely that those new RT threads will
> > > disturb the system's normal operation somehow even without the RT pressure and
> > > have you investigated that?
> > 
> > Sorry I'm not sure what you mean by disturb normal operation.
> 
> That would introduce a number of extra RT threads that would be woken up quite
> often and on a regular basis, so there would be some extra RT noise in the
> system, especially on systems with one CPU per cpufreq policy and many CPUs.
> 
> That's not present ATM and surely need not be completely transparent.

Having RT tasks should not be a problem. You can always set their
priority such that they do not interfere with an actual RT workload.

> > Generally speaking, increasing the capacity of a heavily loaded system
> > seems to me to be something that should run urgently, so that the system
> > can potentially get itself out of trouble and meet the workload's needs.
> > 
> > > Also having them per policy may be overkill and
> > > binding them to policy CPUs only is not necessary.
> > > 
> > > Overall, it looks like a dynamic pool of threads that may run on every CPU
> > > might be a better approach, but that would almost duplicate the workqueues
> > > subsystem, so is it really worth it?
> > > 
> > > And is the problem actually visible in practice?  I have no record of any reports
> > > mentioning it, although theoretically it's been there forever, so had it been
> > > real, someone would have noticed it and complained about it IMO.
> > 
> > While I don't have a test case drawn up to provide it seems like it'd be
> > easy to create one. More importantly the interactive governor in Android
> > uses this same kind of model, starting a frequency change thread and
> > making it RT. Android is particularly sensitive to latency in frequency
> > response. So that's likely one big reason why you're not hearing about
> > this issue - some folks have already worked around it.
> 
> OK, so Android is the reason. :-)
> 
> Fair enough.  I still think that care is needed here, though.

So I sort of see the point of having per-cpu RT kthread tasks to effect
the OPP change for CFS. And here I would indeed suggest just having a
task per cpu, moving tasks about is too complex and generates even more
noise.

But the problem of having RT tasks is that you should run RT tasks at
max OPP.

Now you can probably fudge things and not account these RT tasks to the
RT 'workload' since you know their characteristics etc.. But its going
to be ugly I suspect.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFCv7 PATCH 03/10] sched: scheduler-driven cpu frequency selection
  2016-03-01 12:57             ` Peter Zijlstra
@ 2016-03-01 19:44               ` Rafael J. Wysocki
  0 siblings, 0 replies; 51+ messages in thread
From: Rafael J. Wysocki @ 2016-03-01 19:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rafael J. Wysocki, Steve Muckle, Ingo Molnar, Rafael J. Wysocki,
	Linux Kernel Mailing List, linux-pm, Vincent Guittot,
	Morten Rasmussen, Dietmar Eggemann, Juri Lelli, Patrick Bellasi,
	Michael Turquette, Ricky Liang

On Tue, Mar 1, 2016 at 1:57 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Sat, Feb 27, 2016 at 01:08:02AM +0100, Rafael J. Wysocki wrote:
>> @@ -95,18 +98,20 @@ EXPORT_SYMBOL_GPL(cpufreq_set_update_uti
>>   *
>>   * This function is called by the scheduler on every invocation of
>>   * update_load_avg() on the CPU whose utilization is being updated.
>> + *
>> + * It can only be called from RCU-sched read-side critical sections.
>>   */
>>  void cpufreq_update_util(u64 time, unsigned long util, unsigned long max)
>>  {
>>       struct update_util_data *data;
>>
>> -     rcu_read_lock();
>> -
>
> Maybe, just because I'm paranoid, add something like:
>
> #ifdef CONFIG_LOCKDEP
>         WARN_ON(debug_locks && !rcu_read_lock_sched_held());
> #endif

OK

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFCv7 PATCH 03/10] sched: scheduler-driven cpu frequency selection
  2016-03-01 14:31             ` Peter Zijlstra
@ 2016-03-01 20:32               ` Rafael J. Wysocki
  0 siblings, 0 replies; 51+ messages in thread
From: Rafael J. Wysocki @ 2016-03-01 20:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rafael J. Wysocki, Steve Muckle, Ingo Molnar, Rafael J. Wysocki,
	Linux Kernel Mailing List, linux-pm, Vincent Guittot,
	Morten Rasmussen, Dietmar Eggemann, Juri Lelli, Patrick Bellasi,
	Michael Turquette, Ricky Liang

On Tue, Mar 1, 2016 at 3:31 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Sun, Feb 28, 2016 at 03:26:21AM +0100, Rafael J. Wysocki wrote:
>
>> > > That said I'm unconvinced about the approach still.
>> > >
>> > > Having more RT threads in a system that already is under RT pressure seems like
>> > > a recipe for trouble.  Moreover, it's likely that those new RT threads will
>> > > disturb the system's normal operation somehow even without the RT pressure and
>> > > have you investigated that?
>> >
>> > Sorry I'm not sure what you mean by disturb normal operation.
>>
>> That would introduce a number of extra RT threads that would be woken up quite
>> often and on a regular basis, so there would be some extra RT noise in the
>> system, especially on systems with one CPU per cpufreq policy and many CPUs.
>>
>> That's not present ATM and surely need not be completely transparent.
>
> Having RT tasks should not be a problem. You can always set their
> priority such that they do not interfere with an actual RT workload.
>
>> > Generally speaking, increasing the capacity of a heavily loaded system
>> > seems to me to be something that should run urgently, so that the system
>> > can potentially get itself out of trouble and meet the workload's needs.
>> >
>> > > Also having them per policy may be overkill and
>> > > binding them to policy CPUs only is not necessary.
>> > >
>> > > Overall, it looks like a dynamic pool of threads that may run on every CPU
>> > > might be a better approach, but that would almost duplicate the workqueues
>> > > subsystem, so is it really worth it?
>> > >
>> > > And is the problem actually visible in practice?  I have no record of any reports
>> > > mentioning it, although theoretically it's been there forever, so had it been
>> > > real, someone would have noticed it and complained about it IMO.
>> >
>> > While I don't have a test case drawn up to provide it seems like it'd be
>> > easy to create one. More importantly the interactive governor in Android
>> > uses this same kind of model, starting a frequency change thread and
>> > making it RT. Android is particularly sensitive to latency in frequency
>> > response. So that's likely one big reason why you're not hearing about
>> > this issue - some folks have already worked around it.
>>
>> OK, so Android is the reason. :-)
>>
>> Fair enough.  I still think that care is needed here, though.
>
> So I sort of see the point of having per-cpu RT kthread tasks to effect
> the OPP change for CFS. And here I would indeed suggest just having a
> task per cpu, moving tasks about is too complex and generates even more
> noise.
>
> But the problem of having RT tasks is that you should run RT tasks at
> max OPP.
>
> Now you can probably fudge things and not account these RT tasks to the
> RT 'workload' since you know their characteristics etc.. But its going
> to be ugly I suspect.

Good point.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFCv7 PATCH 03/10] sched: scheduler-driven cpu frequency selection
  2016-02-26  0:34     ` Steve Muckle
  2016-02-27  2:39       ` Rafael J. Wysocki
  2016-03-01 13:17       ` Peter Zijlstra
@ 2016-03-02  7:49       ` Michael Turquette
  2016-03-03  2:49         ` Rafael J. Wysocki
  2016-03-03 13:03         ` Peter Zijlstra
  2 siblings, 2 replies; 51+ messages in thread
From: Michael Turquette @ 2016-03-02  7:49 UTC (permalink / raw)
  To: Steve Muckle, Rafael J. Wysocki
  Cc: Peter Zijlstra, Ingo Molnar, Rafael J. Wysocki, linux-kernel,
	linux-pm, Vincent Guittot, Morten Rasmussen, Dietmar Eggemann,
	Juri Lelli, Patrick Bellasi, Ricky Liang

Hi,

I'm still catching up on the plurality of scheduler/cpufreq threads but
I thought I would chime in with some historical reasons for why
cpufreq_sched.c looks the way it does today.

Quoting Steve Muckle (2016-02-25 16:34:23)
> On 02/24/2016 07:55 PM, Rafael J. Wysocki wrote:
> > On Monday, February 22, 2016 05:22:43 PM Steve Muckle wrote:
> > Besides, governors traditionally go to drivers/cpufreq.  Why is this different?
> 
> This predates my involvement but I'm guessing this was done because the
> long term vision was for cpufreq to eventually be removed from the
> picture. But it may be quite some time before that happens - I think
> communicating directly with drivers may be difficult due to the need to
> synchronize with thermal or userspace min/max requests etc, plus there's
> the fact that we've got a longstanding API exposed to userspace.
> 
> I'm happy to move this to drivers/cpufreq.

Originally it was put in kernel/sched/ because cpufreq_sched.c needed
access to kernel/sched/sched.h. Rafael's implementation flips the
push-pull relationship between the governor and scheduler so that this
is not necessary. If that requirement is removed from Steve's series
then there is no need to put the file outside of drivers/cpufreq.

...
> > One thing I personally like in the RCU-based approach is its universality.  The
> > callbacks may be installed by different entities in a uniform way: intel_pstate
> > can do that, the old governors can do that, my experimental schedutil code can
> > do that and your code could have done that too in principle.  And this is very
> > nice, because it is a common underlying mechanism that can be used by everybody
> > regardless of their particular implementations on the other side.
> > 
> > Why would I want to use something different, then?
> 
> I've got nothing against a callback registration mechanism. As you
> mentioned in another mail it could itself use static keys, enabling the
> static key when a callback is registered and disabling it otherwise to
> avoid calling into cpufreq_update_util().

I'm very happy to see the return of capacity/util ops. I had these in my
initial prototype back in 2014 but Peter shot it down, partially due to
the performance hit of indirect function call overhead in the fast path:

http://permalink.gmane.org/gmane.linux.kernel/1803410

...
> >> +
> >> +/*
> >> + * Capacity margin added to CFS and RT capacity requests to provide
> >> + * some head room if task utilization further increases.
> >> + */
> > 
> > OK, where does this number come from?
> 
> Someone's posterior :) .

Indeed, this margin is based on a posterior-derived value, in particular
the 25% imbalance_pct in struct sched_domain:

include/linux/sched.h:
struct sched_domain {
        /* These fields must be setup */
        ...
        unsigned int imbalance_pct;     /* No balance until over watermark */
...

and,

kernel/sched/core.c:
static struct sched_domain *
sd_init(struct sched_domain_topology_level *tl, int cpu)
{
        struct sched_domain *sd = *per_cpu_ptr(tl->data.sd, cpu);
        ...
        sd = (struct sched_domain){
                .min_interval           = sd_weight,
                .max_interval           = 2*sd_weight,
                .busy_factor            = 32,
                .imbalance_pct          = 125,
...

> 
> This really should be a tunable IMO, but there's a fairly strong
> anti-tunable sentiment, so it's been left hard-coded in an attempt to
> provide something that "just works."

It should definitely be tunable.

> 
> At the least I can add a comment saying that the 20% idle headroom
> requirement was an off the cuff estimate and that at this time, we don't
> have significant data to suggest it's the best number.

Yeah, it was just for prototyping.

...
> > Now, what really is the advantage of having those extra threads vs using
> > workqueues?

More historical anecdotes: The RT/DL stuff below is absolutely worth
talking about, but if your question is, "why did the design choose
kthreads over wq's", the answer is legacy.

Originally back in 2014 Morten played with queuing work up via wq within
schedule() context, but hit a problem where that caused re-entering
schedule() from schedule() and entering reentrancy hell and entering
reentrancy hell and entering reentrancy hell and entering reentrancy
hell ... well you get the idea.

I came up with an ugly workaround to wake up a dormant kthread with
wake_up_process() which was called from an arch-specific callback to
change cpu frequency and later Juri suggested to use irq_work to kick
the kthread, which is still the method used in Steve's patch series.

Seems that it is fine to toss work onto a workqueue from irq_work
context however, which is nice, except that the RT thread approach
yields better latency. I'll let you guys sort out the issues around
RT/DL starvation, which seems like a real issue.

> > 
> > I guess the underlying concern is that RT tasks may stall workqueues indefinitely
> > in theory and then the frequency won't be updated, but there's much more kernel
> > stuff run from workqueues and if that is starved, you won't get very far anyway.
> > 
> > If you take special measures to prevent frequency change requests from being
> > stalled by RT tasks, question is why are they so special?  Aren't there any
> > other kernel activities that also should be protected from that and may be
> > more important than CPU frequency changes?
> 
> I think updating the CPU frequency during periods of heavy RT/DL load is
> one of the most (if not the most) important things. I can't speak for
> other system activities that may get blocked, but there's an opportunity
> to protect CPU frequency changes here, and it seems worth taking to me.
> 
> > 
> > Plus if this really is the problem here, then it also affects the other cpufreq
> > governors, so maybe it should be solved for everybody in some common way?
> 
> Agreed, I'd think a freq change thread that serves frequency change
> requests would be a useful common component. The locking and throttling
> (slowpath_lock, finish_last_request()) are somewhat specific to this
> implementation, but could probably be done generically and maybe even
> used in other governors. If you're okay with it though I'd like to view
> that as a slightly longer term effort, as I think it would get unwieldy
> trying to do that as part of this initial change.

I do not have any data to back up a case for stalls caused by RT/DL
starvation, but conceptually I would say that latency is fundamentally
more important in a scheduler-driven cpu frequency selection scenario,
versus the legacy timer-based governors.

In the latter case we get woken up by a timer (prior to Rafael's recent
"cpufreq: governor: Replace timers with utilization update callbacks"
patch), we sample idleness/busyness, and change frequency, all in one go
and all from process context.

In the case of the scheduler selecting frequency in the hot path, with
hardware that takes a long time to transition frequency (and thus must
be done in a slow path), we want to minimize the time delta between the
scheduler picking a frequency and the thread that executes that change
actually being run.

In my over-simplified view of the scheduler, it would be great if we
could have a backdoor mechanism to place the frequency transition
kthread onto a runqueue from within the schedule() context and dispense
with the irq_work stuff in Steve's series altogether.

Regards,
Mike

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFCv7 PATCH 03/10] sched: scheduler-driven cpu frequency selection
  2016-03-02  7:49       ` Michael Turquette
@ 2016-03-03  2:49         ` Rafael J. Wysocki
  2016-03-03  3:50           ` Steve Muckle
  2016-03-03 13:03         ` Peter Zijlstra
  1 sibling, 1 reply; 51+ messages in thread
From: Rafael J. Wysocki @ 2016-03-03  2:49 UTC (permalink / raw)
  To: Michael Turquette
  Cc: Steve Muckle, Rafael J. Wysocki, Peter Zijlstra, Ingo Molnar,
	Rafael J. Wysocki, Linux Kernel Mailing List, linux-pm,
	Vincent Guittot, Morten Rasmussen, Dietmar Eggemann, Juri Lelli,
	Patrick Bellasi, Ricky Liang

On Wed, Mar 2, 2016 at 8:49 AM, Michael Turquette
<mturquette@baylibre.com> wrote:
>

[cut]

> I do not have any data to back up a case for stalls caused by RT/DL
> starvation, but conceptually I would say that latency is fundamentally
> more important in a scheduler-driven cpu frequency selection scenario,
> versus the legacy timer-based governors.
>
> In the latter case we get woken up by a timer (prior to Rafael's recent
> "cpufreq: governor: Replace timers with utilization update callbacks"
> patch), we sample idleness/busyness, and change frequency, all in one go
> and all from process context.
>
> In the case of the scheduler selecting frequency in the hot path, with
> hardware that takes a long time to transition frequency (and thus must
> be done in a slow path), we want to minimize the time delta between the
> scheduler picking a frequency and the thread that executes that change
> actually being run.

That is a good point.  However, the Peter's one about the RT tasks
having to run at the max util and affecting the frequency control this
way is good too.

I'm not actually sure if RT is the right answer here.  DL may be a
better choice.  After all, we want the thing to happen shortly, but
not necessarily at full speed.

So something like a DL workqueue would be quite useful here it seems.

> In my over-simplified view of the scheduler, it would be great if we
> could have a backdoor mechanism to place the frequency transition
> kthread onto a runqueue from within the schedule() context and dispense
> with the irq_work stuff in Steve's series altogether.

Well, I use irq_work() now in schedutil and ondemand/conservative too
for queuing up work items and it gets the job done.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFCv7 PATCH 03/10] sched: scheduler-driven cpu frequency selection
  2016-03-03  2:49         ` Rafael J. Wysocki
@ 2016-03-03  3:50           ` Steve Muckle
  2016-03-03  9:34             ` Juri Lelli
  0 siblings, 1 reply; 51+ messages in thread
From: Steve Muckle @ 2016-03-03  3:50 UTC (permalink / raw)
  To: Rafael J. Wysocki, Michael Turquette, Peter Zijlstra, Ingo Molnar
  Cc: Rafael J. Wysocki, Linux Kernel Mailing List, linux-pm,
	Vincent Guittot, Morten Rasmussen, Dietmar Eggemann, Juri Lelli,
	Patrick Bellasi, Ricky Liang

On 03/02/2016 06:49 PM, Rafael J. Wysocki wrote:
> I'm not actually sure if RT is the right answer here.  DL may be a
> better choice.  After all, we want the thing to happen shortly, but
> not necessarily at full speed.
> 
> So something like a DL workqueue would be quite useful here it seems.

The DL idea seems like a good one to me.

It would also prevent cpufreq changes from being delayed by other RT or
DL tasks.

thanks,
Steve

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFCv7 PATCH 04/10] sched/fair: add triggers for OPP change requests
  2016-03-01  6:51   ` Ricky Liang
@ 2016-03-03  3:55     ` Steve Muckle
  0 siblings, 0 replies; 51+ messages in thread
From: Steve Muckle @ 2016-03-03  3:55 UTC (permalink / raw)
  To: Ricky Liang
  Cc: Peter Zijlstra, Ingo Molnar, Rafael J. Wysocki, open list,
	linux-pm, Vincent Guittot, Morten Rasmussen, Dietmar Eggemann,
	Juri Lelli, Patrick Bellasi, Michael Turquette

Hi Ricky,

On 02/29/2016 10:51 PM, Ricky Liang wrote:
> The change hunks of this patch should probably all depend on
> CONFIG_SMP as capacity_orig_of() and cpu_util() are only available
> when CONFIG_SMP is enabled.

Yeah, I was deferring cleaning that up until there was more buy in on
the overall solution. But it looks like we will be moving forward using
Rafael's schedutil governor. The most recent posting of that is here:

http://thread.gmane.org/gmane.linux.kernel/2166378

thanks,
Steve

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFCv7 PATCH 03/10] sched: scheduler-driven cpu frequency selection
  2016-03-03  3:50           ` Steve Muckle
@ 2016-03-03  9:34             ` Juri Lelli
  0 siblings, 0 replies; 51+ messages in thread
From: Juri Lelli @ 2016-03-03  9:34 UTC (permalink / raw)
  To: Steve Muckle
  Cc: Rafael J. Wysocki, Michael Turquette, Peter Zijlstra,
	Ingo Molnar, Rafael J. Wysocki, Linux Kernel Mailing List,
	linux-pm, Vincent Guittot, Morten Rasmussen, Dietmar Eggemann,
	Patrick Bellasi, Ricky Liang

On 02/03/16 19:50, Steve Muckle wrote:
> On 03/02/2016 06:49 PM, Rafael J. Wysocki wrote:
> > I'm not actually sure if RT is the right answer here.  DL may be a
> > better choice.  After all, we want the thing to happen shortly, but
> > not necessarily at full speed.
> > 
> > So something like a DL workqueue would be quite useful here it seems.
> 
> The DL idea seems like a good one to me.
> 
> It would also prevent cpufreq changes from being delayed by other RT or
> DL tasks.
> 

Dimensioning period and runtime could require some experimenting, but
it's worth a try, I agree.

Best,

- Juri

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFCv7 PATCH 03/10] sched: scheduler-driven cpu frequency selection
  2016-03-02  7:49       ` Michael Turquette
  2016-03-03  2:49         ` Rafael J. Wysocki
@ 2016-03-03 13:03         ` Peter Zijlstra
  1 sibling, 0 replies; 51+ messages in thread
From: Peter Zijlstra @ 2016-03-03 13:03 UTC (permalink / raw)
  To: Michael Turquette
  Cc: Steve Muckle, Rafael J. Wysocki, Ingo Molnar, Rafael J. Wysocki,
	linux-kernel, linux-pm, Vincent Guittot, Morten Rasmussen,
	Dietmar Eggemann, Juri Lelli, Patrick Bellasi, Ricky Liang

On Tue, Mar 01, 2016 at 11:49:10PM -0800, Michael Turquette wrote:
> 
> In my over-simplified view of the scheduler, it would be great if we
> could have a backdoor mechanism to place the frequency transition
> kthread onto a runqueue from within the schedule() context and dispense
> with the irq_work stuff in Steve's series altogether.

This is actually very very hard :/

So while there is something similar for workqueues,
try_to_wake_up_local(), that will not work for the cpufreq stuff.

The main problem is that schedule() is done with rq->lock held, but
wakeups need p->pi_lock, but it so happens that rq->lock nests inside of
p->pi_lock.

Now, the workqueue stuff with try_to_wake_up_local() can get away with
dropping rq->lock, because of where it is called, way early in
schedule() before we really muck things up.

The cpufreq hook otoh is called all over the place.

The second problem is that doing a wakeup will in fact also end up
calling the cpufreq hook, so you're back in recursion hell.

The third problem is that cpufreq is called from wakeups, which would
want to do another wakeup (see point 2), but this also means we have to
nest p->pi_lock, and we can't really do that either.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFCv7 PATCH 03/10] sched: scheduler-driven cpu frequency selection
  2016-02-23  1:22 ` [RFCv7 PATCH 03/10] sched: scheduler-driven cpu frequency selection Steve Muckle
  2016-02-25  3:55   ` Rafael J. Wysocki
@ 2016-03-03 14:21   ` Ingo Molnar
  1 sibling, 0 replies; 51+ messages in thread
From: Ingo Molnar @ 2016-03-03 14:21 UTC (permalink / raw)
  To: Steve Muckle
  Cc: Peter Zijlstra, Ingo Molnar, Rafael J. Wysocki, linux-kernel,
	linux-pm, Vincent Guittot, Morten Rasmussen, Dietmar Eggemann,
	Juri Lelli, Patrick Bellasi, Michael Turquette, Ricky Liang


* Steve Muckle <steve.muckle@linaro.org> wrote:

> From: Michael Turquette <mturquette@baylibre.com>
> 
> Scheduler-driven CPU frequency selection hopes to exploit both
> per-task and global information in the scheduler to improve frequency
> selection policy, achieving lower power consumption, improved
> responsiveness/performance, and less reliance on heuristics and
> tunables. For further discussion on the motivation of this integration
> see [0].
> 
> This patch implements a shim layer between the Linux scheduler and the
> cpufreq subsystem. The interface accepts capacity requests from the
> CFS, RT and deadline sched classes. The requests from each sched class
> are summed on each CPU with a margin applied to the CFS and RT
> capacity requests to provide some headroom. Deadline requests are
> expected to be precise enough given their nature to not require
> headroom. The maximum total capacity request for a CPU in a frequency
> domain drives the requested frequency for that domain.
> 
> Policy is determined by both the sched classes and this shim layer.
> 
> Note that this algorithm is event-driven. There is no polling loop to
> check cpu idle time nor any other method which is unsynchronized with
> the scheduler, aside from an optional throttling mechanism.
> 
> Thanks to Juri Lelli <juri.lelli@arm.com> for contributing design ideas,
> code and test results, and to Ricky Liang <jcliang@chromium.org>
> for initialization and static key inc/dec fixes.
> 
> [0] http://article.gmane.org/gmane.linux.kernel/1499836
> 
> [smuckle@linaro.org: various additions and fixes, revised commit text]
> 
> CC: Ricky Liang <jcliang@chromium.org>
> Signed-off-by: Michael Turquette <mturquette@baylibre.com>
> Signed-off-by: Juri Lelli <juri.lelli@arm.com>
> Signed-off-by: Steve Muckle <smuckle@linaro.org>
> ---
>  drivers/cpufreq/Kconfig      |  21 ++
>  include/linux/cpufreq.h      |   3 +
>  include/linux/sched.h        |   8 +
>  kernel/sched/Makefile        |   1 +
>  kernel/sched/cpufreq_sched.c | 459 +++++++++++++++++++++++++++++++++++++++++++

Please rename this to kernel/sched/cpufreq.c - no need to say 'sched' twice! :-)

>  kernel/sched/sched.h         |  51 +++++
>  6 files changed, 543 insertions(+)
>  create mode 100644 kernel/sched/cpufreq_sched.c

So I really like how you push all high level code into kernel/sched/cpufreq.c and 
use the cpufreq drivers only for actual low level frequency switching.

It would be nice to converge this code with the code from Rafael:

   [PATCH 0/6] cpufreq: schedutil governor

i.e. use scheduler internal metrics within the scheduler, and create a clear 
interface between low level cpufreq drivers and the cpufreq code living in the 
scheduler.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFCv7 PATCH 00/10] sched: scheduler-driven CPU frequency selection
  2016-02-23  1:22 [RFCv7 PATCH 00/10] sched: scheduler-driven CPU frequency selection Steve Muckle
                   ` (10 preceding siblings ...)
  2016-02-23  1:33 ` [RFCv7 PATCH 00/10] sched: scheduler-driven CPU frequency selection Steve Muckle
@ 2016-03-30  0:45 ` Yuyang Du
  2016-03-31  1:35   ` Steve Muckle
  11 siblings, 1 reply; 51+ messages in thread
From: Yuyang Du @ 2016-03-30  0:45 UTC (permalink / raw)
  To: Steve Muckle
  Cc: Peter Zijlstra, Ingo Molnar, Rafael J. Wysocki, linux-kernel,
	linux-pm, Vincent Guittot, Morten Rasmussen, Dietmar Eggemann,
	Juri Lelli, Patrick Bellasi, Michael Turquette

Hi Steve,

On Mon, Feb 22, 2016 at 05:22:40PM -0800, Steve Muckle wrote:
> The number of times the busy
> duration exceeds the period of the periodic workload (an "overrun") is
> also recorded.

Could you please explain more about overrun?

> SCHED_OTHER workload:
>  wload parameters	  ondemand        interactive     sched	
> run	period	loops	OR	OH	OR	OH	OR	OH
> 1	100	100	0	62.07%	0	100.02%	0	78.49%
> 10	1000	10	0	21.80%	0	22.74%	0	72.56%
> 1	10	1000	0	21.72%	0	63.08%	0	52.40%
> 10	100	100	0	8.09%	0	15.53%	0	17.33%
> 100	1000	10	0	1.83%	0	1.77%	0	0.29%
> 6	33	300	0	15.32%	0	8.60%	0	17.34%
> 66	333	30	0	0.79%	0	3.18%	0	12.26%
> 4	10	1000	0	5.87%	0	10.21%	0	6.15%
> 40	100	100	0	0.41%	0	0.04%	0	2.68%
> 400	1000	10	0	0.42%	0	0.50%	0	1.22%
> 5	9	1000	2	3.82%	1	6.10%	0	2.51%
> 50	90	100	0	0.19%	0	0.05%	0	1.71%
> 500	900	10	0	0.37%	0	0.38%	0	1.82%
> 9	12	1000	6	1.79%	1	0.77%	0	0.26%
> 90	120	100	0	0.16%	1	0.05%	0	0.49%
> 900	1200	10	0	0.09%	0	0.26%	0	0.62%
 
Could you please also explain what we can learn from the wload vs. OH/OR
results?

Or if you have already explained these questions before, could you please
point to the links? Thank you.

Yuyang

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFCv7 PATCH 00/10] sched: scheduler-driven CPU frequency selection
  2016-03-31  1:35   ` Steve Muckle
@ 2016-03-30 20:22     ` Yuyang Du
  0 siblings, 0 replies; 51+ messages in thread
From: Yuyang Du @ 2016-03-30 20:22 UTC (permalink / raw)
  To: Steve Muckle
  Cc: Peter Zijlstra, Ingo Molnar, Rafael J. Wysocki, linux-kernel,
	linux-pm, Vincent Guittot, Morten Rasmussen, Dietmar Eggemann,
	Juri Lelli, Patrick Bellasi, Michael Turquette

Hi Steve,

On Wed, Mar 30, 2016 at 06:35:23PM -0700, Steve Muckle wrote:
> This series was dropped in favor of Rafael's schedutil. But on the
> chance that you're still curious about the test setup used to quantify
> the series I'll explain below.
 
I will catch up and learn both.

> These results are meant to show how the governors perform across varying
> workload intensities and periodicities. Higher overhead (OH) numbers
> indicate that the completion times of each period of the workload were
> closer to what they would be when run at fmin (100% overhead would be as
> slow as fmin, 0% overhead would be as fast as fmax). And as described
> above, overruns (OR) indicate that the governor was not responsive
> enough to finish the work in each period of the workload.
> 
> These are just performance metrics so they only tell half the story.
> Power is not factored in at all.
> 
> This provides a quick sanity check that the governor under test (in this
> case, the now defunct schedfreq, or sched for short) performs similarly
> to two of the most commonly used governors, ondemand and interactive, in
> steady state periodic workloads. In the data above sched looks good for
> the most part with the second test case being the biggest exception.
 
Yes, it is indeed a quick sanity check.

Thanks,
Yuyang

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFCv7 PATCH 00/10] sched: scheduler-driven CPU frequency selection
  2016-03-30  0:45 ` Yuyang Du
@ 2016-03-31  1:35   ` Steve Muckle
  2016-03-30 20:22     ` Yuyang Du
  0 siblings, 1 reply; 51+ messages in thread
From: Steve Muckle @ 2016-03-31  1:35 UTC (permalink / raw)
  To: Yuyang Du
  Cc: Peter Zijlstra, Ingo Molnar, Rafael J. Wysocki, linux-kernel,
	linux-pm, Vincent Guittot, Morten Rasmussen, Dietmar Eggemann,
	Juri Lelli, Patrick Bellasi, Michael Turquette

Hi Yuyang,

This series was dropped in favor of Rafael's schedutil. But on the
chance that you're still curious about the test setup used to quantify
the series I'll explain below.

On 03/29/2016 05:45 PM, Yuyang Du wrote:
> Hi Steve,
> 
> On Mon, Feb 22, 2016 at 05:22:40PM -0800, Steve Muckle wrote:
>> The number of times the busy
>> duration exceeds the period of the periodic workload (an "overrun") is
>> also recorded.
> 
> Could you please explain more about overrun?

Each of the 16 workloads is periodic. The period of the workload may not
be long enough to fit the busy ("run") duration at lower CPU
frequencies. If the governor has not raised the CPU frequency high
enough, the busy duration will exceed the period of the workload. This
is an "overrun" and in this synthetic workload represents a deadline
being missed.

>> SCHED_OTHER workload:
>>  wload parameters	  ondemand        interactive     sched	
>> run	period	loops	OR	OH	OR	OH	OR	OH
>> 1	100	100	0	62.07%	0	100.02%	0	78.49%
>> 10	1000	10	0	21.80%	0	22.74%	0	72.56%
>> 1	10	1000	0	21.72%	0	63.08%	0	52.40%
>> 10	100	100	0	8.09%	0	15.53%	0	17.33%
>> 100	1000	10	0	1.83%	0	1.77%	0	0.29%
>> 6	33	300	0	15.32%	0	8.60%	0	17.34%
>> 66	333	30	0	0.79%	0	3.18%	0	12.26%
>> 4	10	1000	0	5.87%	0	10.21%	0	6.15%
>> 40	100	100	0	0.41%	0	0.04%	0	2.68%
>> 400	1000	10	0	0.42%	0	0.50%	0	1.22%
>> 5	9	1000	2	3.82%	1	6.10%	0	2.51%
>> 50	90	100	0	0.19%	0	0.05%	0	1.71%
>> 500	900	10	0	0.37%	0	0.38%	0	1.82%
>> 9	12	1000	6	1.79%	1	0.77%	0	0.26%
>> 90	120	100	0	0.16%	1	0.05%	0	0.49%
>> 900	1200	10	0	0.09%	0	0.26%	0	0.62%
>  
> Could you please also explain what we can learn from the wload vs. OH/OR
> results?

These results are meant to show how the governors perform across varying
workload intensities and periodicities. Higher overhead (OH) numbers
indicate that the completion times of each period of the workload were
closer to what they would be when run at fmin (100% overhead would be as
slow as fmin, 0% overhead would be as fast as fmax). And as described
above, overruns (OR) indicate that the governor was not responsive
enough to finish the work in each period of the workload.

These are just performance metrics so they only tell half the story.
Power is not factored in at all.

This provides a quick sanity check that the governor under test (in this
case, the now defunct schedfreq, or sched for short) performs similarly
to two of the most commonly used governors, ondemand and interactive, in
steady state periodic workloads. In the data above sched looks good for
the most part with the second test case being the biggest exception.

thanks,
Steve

^ permalink raw reply	[flat|nested] 51+ messages in thread

end of thread, other threads:[~2016-03-31  4:05 UTC | newest]

Thread overview: 51+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-02-23  1:22 [RFCv7 PATCH 00/10] sched: scheduler-driven CPU frequency selection Steve Muckle
2016-02-23  1:22 ` [RFCv7 PATCH 01/10] sched: Compute cpu capacity available at current frequency Steve Muckle
2016-02-23  1:41   ` Rafael J. Wysocki
2016-02-23  9:19     ` Peter Zijlstra
2016-02-26  1:37       ` Rafael J. Wysocki
2016-02-26  9:14         ` Peter Zijlstra
2016-02-23  1:22 ` [RFCv7 PATCH 02/10] cpufreq: introduce cpufreq_driver_is_slow Steve Muckle
2016-02-23  1:31   ` Rafael J. Wysocki
2016-02-26  0:50     ` Michael Turquette
2016-02-26  0:50       ` Michael Turquette
2016-02-26  1:07       ` Steve Muckle
2016-02-26  1:16       ` Rafael J. Wysocki
     [not found]         ` <20160226185503.2278.20479@quark.deferred.io>
2016-02-26 21:00           ` Rafael J. Wysocki
2016-02-23  1:22 ` [RFCv7 PATCH 03/10] sched: scheduler-driven cpu frequency selection Steve Muckle
2016-02-25  3:55   ` Rafael J. Wysocki
2016-02-25  9:21     ` Peter Zijlstra
2016-02-25 21:04       ` Rafael J. Wysocki
2016-02-25  9:28     ` Peter Zijlstra
2016-02-25 21:08       ` Rafael J. Wysocki
2016-02-26  9:18         ` Peter Zijlstra
2016-02-27  0:08           ` Rafael J. Wysocki
2016-03-01 12:57             ` Peter Zijlstra
2016-03-01 19:44               ` Rafael J. Wysocki
2016-02-25 11:04     ` Rafael J. Wysocki
2016-02-26  0:34     ` Steve Muckle
2016-02-27  2:39       ` Rafael J. Wysocki
2016-02-27  4:17         ` Steve Muckle
2016-02-28  2:26           ` Rafael J. Wysocki
2016-03-01 14:31             ` Peter Zijlstra
2016-03-01 20:32               ` Rafael J. Wysocki
2016-03-01 13:26           ` Peter Zijlstra
2016-03-01 13:17       ` Peter Zijlstra
2016-03-02  7:49       ` Michael Turquette
2016-03-03  2:49         ` Rafael J. Wysocki
2016-03-03  3:50           ` Steve Muckle
2016-03-03  9:34             ` Juri Lelli
2016-03-03 13:03         ` Peter Zijlstra
2016-03-03 14:21   ` Ingo Molnar
2016-02-23  1:22 ` [RFCv7 PATCH 04/10] sched/fair: add triggers for OPP change requests Steve Muckle
2016-03-01  6:51   ` Ricky Liang
2016-03-03  3:55     ` Steve Muckle
2016-02-23  1:22 ` [RFCv7 PATCH 05/10] sched/{core,fair}: trigger OPP change request on fork() Steve Muckle
2016-02-23  1:22 ` [RFCv7 PATCH 06/10] sched/fair: cpufreq_sched triggers for load balancing Steve Muckle
2016-02-23  1:22 ` [RFCv7 PATCH 07/10] sched/fair: jump to max OPP when crossing UP threshold Steve Muckle
2016-02-23  1:22 ` [RFCv7 PATCH 08/10] sched: remove call of sched_avg_update from sched_rt_avg_update Steve Muckle
2016-02-23  1:22 ` [RFCv7 PATCH 09/10] sched/deadline: split rt_avg in 2 distincts metrics Steve Muckle
2016-02-23  1:22 ` [RFCv7 PATCH 10/10] sched: rt scheduler sets capacity requirement Steve Muckle
2016-02-23  1:33 ` [RFCv7 PATCH 00/10] sched: scheduler-driven CPU frequency selection Steve Muckle
2016-03-30  0:45 ` Yuyang Du
2016-03-31  1:35   ` Steve Muckle
2016-03-30 20:22     ` Yuyang Du

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.