linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC 0/7] scheduler-driven cpu frequency scaling
@ 2014-10-22  6:07 Mike Turquette
  2014-10-22  6:07 ` [PATCH RFC 1/7] sched: Make energy awareness a sched feature Mike Turquette
                   ` (6 more replies)
  0 siblings, 7 replies; 21+ messages in thread
From: Mike Turquette @ 2014-10-22  6:07 UTC (permalink / raw)
  To: peterz, mingo
  Cc: linux-kernel, preeti, Morten.Rasmussen, kamalesh, riel, efault,
	nicolas.pitre, linaro-kernel, daniel.lezcano, dietmar.eggemann,
	pjt, bsegall, vincent.guittot, patches, tuukka.tikkanen,
	amit.kucheria, Mike Turquette

This series demonstrates cpu frequency scaling via a simple policy
driven by the scheduler. Specifically the policy evaluates cpu frequency
when cpu utilization is updated from enqueue_task_fair and
dequeue_task_fair. The policy itself uses a simple up/down threshold
scheme using the same 80%/20% cpu utilization boundaries that are used
by default in the ondemand cpufreq governor.

This series is not intended for merging, but instead to ignite some
discussion around scheduler-driven cpu frequency selection. Of
particular interest to me is the policy itself and how it might
integrate with task placement in CFS's load_balance. Additionally I'd
like to ask the scheduler experts about which call sites in CFS are
right for evaluating cpu frequency selection; maybe
{en,de}queue_task_fair are not such a good idea?

The messiest part of this series is the cpumask stuff, where I tried to
track which cpus have updated statistics in the case of a sched_entity
which contains several other sched_entities that are spread across cpus.
As discussed at Linux Plumbers Conference 2014, I will replace this
complexity with simpler logic that ignores scheduler cgroups in the next
version. In any case I am posting the code I have now.

This code is experiemental and bugs are Guaranteed.

These patches are based on the scale invariance series from Morten[0].
The variables names in this RFC will doubtless change once that work is
rebased onto Vincent's series[1].

[0] http://lkml.kernel.org/r/<1411403047-32010-1-git-send-email-morten.rasmussen@arm.com>
[1] http://lkml.kernel.org/r/<1412684017-16595-1-git-send-email-vincent.guittot@linaro.org>

Mike Turquette (6):
  sched: cfs: declare capacity_of in sched.h
  sched: fair: add usage_util_of helper
  cpufreq: add per-governor private data
  sched: cfs: cpu frequency scaling arch functions
  sched: cfs: cpu frequency scaling based on task placement
  sched: energy_model: simple cpu frequency scaling policy

Morten Rasmussen (1):
  sched: Make energy awareness a sched feature

 drivers/cpufreq/Kconfig     |  21 +++
 include/linux/cpufreq.h     |   6 +
 kernel/sched/Makefile       |   1 +
 kernel/sched/energy_model.c | 341 ++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/fair.c         |  69 ++++++++-
 kernel/sched/features.h     |   6 +
 kernel/sched/sched.h        |   3 +
 7 files changed, 445 insertions(+), 2 deletions(-)
 create mode 100644 kernel/sched/energy_model.c

-- 
1.8.3.2


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH RFC 1/7] sched: Make energy awareness a sched feature
  2014-10-22  6:07 [PATCH RFC 0/7] scheduler-driven cpu frequency scaling Mike Turquette
@ 2014-10-22  6:07 ` Mike Turquette
  2014-10-22  6:07 ` [PATCH RFC 2/7] sched: cfs: declare capacity_of in sched.h Mike Turquette
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 21+ messages in thread
From: Mike Turquette @ 2014-10-22  6:07 UTC (permalink / raw)
  To: peterz, mingo
  Cc: linux-kernel, preeti, Morten.Rasmussen, kamalesh, riel, efault,
	nicolas.pitre, linaro-kernel, daniel.lezcano, dietmar.eggemann,
	pjt, bsegall, vincent.guittot, patches, tuukka.tikkanen,
	amit.kucheria, Morten Rasmussen, Mike Turquette

From: Morten Rasmussen <morten.rasmussen@arm.com>

This patch introduces the ENERGY_AWARE sched feature, which is
implemented using jump labels when SCHED_DEBUG is defined. It is
statically set false when SCHED_DEBUG is not defined. Hence this doesn't
allow energy awareness to be enabled without SCHED_DEBUG. This
sched_feature knob will be replaced later with a more appropriate
control knob when things have matured a bit.

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Mike Turquette <mturquette@linaro.org>
[mturquette@linaro.org: moved energy_aware above enqueue_task_fair]
---
 kernel/sched/fair.c     | 5 +++++
 kernel/sched/features.h | 6 ++++++
 2 files changed, 11 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6738160..90b36cc 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3978,6 +3978,11 @@ static inline void hrtick_update(struct rq *rq)
 }
 #endif
 
+static inline bool energy_aware(void)
+{
+	return sched_feat(ENERGY_AWARE);
+}
+
 /*
  * The enqueue_task method is called before nr_running is
  * increased. Here we update the fair scheduling stats and
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 90284d1..199ee3a 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -83,3 +83,9 @@ SCHED_FEAT(NUMA_FAVOUR_HIGHER, true)
  */
 SCHED_FEAT(NUMA_RESIST_LOWER, false)
 #endif
+
+/*
+ * Energy aware scheduling. Use platform energy model to guide scheduling
+ * decisions optimizing for energy efficiency.
+ */
+SCHED_FEAT(ENERGY_AWARE, false)
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH RFC 2/7] sched: cfs: declare capacity_of in sched.h
  2014-10-22  6:07 [PATCH RFC 0/7] scheduler-driven cpu frequency scaling Mike Turquette
  2014-10-22  6:07 ` [PATCH RFC 1/7] sched: Make energy awareness a sched feature Mike Turquette
@ 2014-10-22  6:07 ` Mike Turquette
  2014-10-22  6:07 ` [PATCH RFC 3/7] sched: fair: add usage_util_of helper Mike Turquette
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 21+ messages in thread
From: Mike Turquette @ 2014-10-22  6:07 UTC (permalink / raw)
  To: peterz, mingo
  Cc: linux-kernel, preeti, Morten.Rasmussen, kamalesh, riel, efault,
	nicolas.pitre, linaro-kernel, daniel.lezcano, dietmar.eggemann,
	pjt, bsegall, vincent.guittot, patches, tuukka.tikkanen,
	amit.kucheria, Mike Turquette

capacity_of is useful for cpu frequency scaling policies. Share it via
sched.h so that selectable cpu frequency scaling policies can make use
of it.

Signed-off-by: Mike Turquette <mturquette@linaro.org>
---
 kernel/sched/fair.c  | 7 +++++--
 kernel/sched/sched.h | 2 ++
 2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 90b36cc..15f5638 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1018,7 +1018,6 @@ bool should_numa_migrate_memory(struct task_struct *p, struct page * page,
 static unsigned long weighted_cpuload(const int cpu);
 static unsigned long source_load(int cpu, int type);
 static unsigned long target_load(int cpu, int type);
-static unsigned long capacity_of(int cpu);
 static long effective_load(struct task_group *tg, int cpu, long wl, long wg);
 
 /* Cached statistics for all CPUs within a node */
@@ -2056,6 +2055,10 @@ static inline void account_numa_dequeue(struct rq *rq, struct task_struct *p)
 }
 #endif /* CONFIG_NUMA_BALANCING */
 
+#ifdef CONFIG_SMP
+unsigned long capacity_of(int cpu);
+#endif /* CONFIG_SMP */
+
 static void
 account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
@@ -4132,7 +4135,7 @@ static unsigned long target_load(int cpu, int type)
 	return max(rq->cpu_load[type-1], total);
 }
 
-static unsigned long capacity_of(int cpu)
+unsigned long capacity_of(int cpu)
 {
 	return cpu_rq(cpu)->cpu_capacity;
 }
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 04940f8..9a28d38 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -309,6 +309,8 @@ struct cfs_bandwidth { };
 
 #endif	/* CONFIG_CGROUP_SCHED */
 
+extern unsigned long capacity_of(int cpu);
+
 /* CFS-related fields in a runqueue */
 struct cfs_rq {
 	struct load_weight load;
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH RFC 3/7] sched: fair: add usage_util_of helper
  2014-10-22  6:07 [PATCH RFC 0/7] scheduler-driven cpu frequency scaling Mike Turquette
  2014-10-22  6:07 ` [PATCH RFC 1/7] sched: Make energy awareness a sched feature Mike Turquette
  2014-10-22  6:07 ` [PATCH RFC 2/7] sched: cfs: declare capacity_of in sched.h Mike Turquette
@ 2014-10-22  6:07 ` Mike Turquette
  2014-10-22  6:07 ` [PATCH RFC 4/7] cpufreq: add per-governor private data Mike Turquette
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 21+ messages in thread
From: Mike Turquette @ 2014-10-22  6:07 UTC (permalink / raw)
  To: peterz, mingo
  Cc: linux-kernel, preeti, Morten.Rasmussen, kamalesh, riel, efault,
	nicolas.pitre, linaro-kernel, daniel.lezcano, dietmar.eggemann,
	pjt, bsegall, vincent.guittot, patches, tuukka.tikkanen,
	amit.kucheria, Mike Turquette

Signed-off-by: Mike Turquette <mturquette@linaro.org>
---
 kernel/sched/fair.c  | 6 ++++++
 kernel/sched/sched.h | 1 +
 2 files changed, 7 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 15f5638..0930ad8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2057,6 +2057,7 @@ static inline void account_numa_dequeue(struct rq *rq, struct task_struct *p)
 
 #ifdef CONFIG_SMP
 unsigned long capacity_of(int cpu);
+unsigned long usage_util_of(int cpu);
 #endif /* CONFIG_SMP */
 
 static void
@@ -4140,6 +4141,11 @@ unsigned long capacity_of(int cpu)
 	return cpu_rq(cpu)->cpu_capacity;
 }
 
+unsigned long usage_util_of(int cpu)
+{
+	return cpu_rq(cpu)->cfs.usage_util_avg;
+}
+
 static unsigned long cpu_avg_load_per_task(int cpu)
 {
 	struct rq *rq = cpu_rq(cpu);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 9a28d38..c34cbfc 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -310,6 +310,7 @@ struct cfs_bandwidth { };
 #endif	/* CONFIG_CGROUP_SCHED */
 
 extern unsigned long capacity_of(int cpu);
+extern unsigned long usage_util_of(int cpu);
 
 /* CFS-related fields in a runqueue */
 struct cfs_rq {
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH RFC 4/7] cpufreq: add per-governor private data
  2014-10-22  6:07 [PATCH RFC 0/7] scheduler-driven cpu frequency scaling Mike Turquette
                   ` (2 preceding siblings ...)
  2014-10-22  6:07 ` [PATCH RFC 3/7] sched: fair: add usage_util_of helper Mike Turquette
@ 2014-10-22  6:07 ` Mike Turquette
  2014-10-22  6:26   ` Viresh Kumar
  2014-10-22  6:07 ` [PATCH RFC 5/7] sched: cfs: cpu frequency scaling arch functions Mike Turquette
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 21+ messages in thread
From: Mike Turquette @ 2014-10-22  6:07 UTC (permalink / raw)
  To: peterz, mingo
  Cc: linux-kernel, preeti, Morten.Rasmussen, kamalesh, riel, efault,
	nicolas.pitre, linaro-kernel, daniel.lezcano, dietmar.eggemann,
	pjt, bsegall, vincent.guittot, patches, tuukka.tikkanen,
	amit.kucheria, Mike Turquette, Viresh Kumar, Rafael J. Wysocki

Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Signed-off-by: Mike Turquette <mturquette@linaro.org>
---
 include/linux/cpufreq.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h
index 138336b..91d173c 100644
--- a/include/linux/cpufreq.h
+++ b/include/linux/cpufreq.h
@@ -115,6 +115,9 @@ struct cpufreq_policy {
 
 	/* For cpufreq driver's internal use */
 	void			*driver_data;
+
+	/* For cpufreq governor's internal use */
+	void			*gov_data;
 };
 
 /* Only for ACPI */
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH RFC 5/7] sched: cfs: cpu frequency scaling arch functions
  2014-10-22  6:07 [PATCH RFC 0/7] scheduler-driven cpu frequency scaling Mike Turquette
                   ` (3 preceding siblings ...)
  2014-10-22  6:07 ` [PATCH RFC 4/7] cpufreq: add per-governor private data Mike Turquette
@ 2014-10-22  6:07 ` Mike Turquette
  2014-10-22 20:06   ` Rik van Riel
  2014-10-22  6:07 ` [PATCH RFC 6/7] sched: cfs: cpu frequency scaling based on task placement Mike Turquette
  2014-10-22  6:07 ` [PATCH RFC 7/7] sched: energy_model: simple cpu frequency scaling policy Mike Turquette
  6 siblings, 1 reply; 21+ messages in thread
From: Mike Turquette @ 2014-10-22  6:07 UTC (permalink / raw)
  To: peterz, mingo
  Cc: linux-kernel, preeti, Morten.Rasmussen, kamalesh, riel, efault,
	nicolas.pitre, linaro-kernel, daniel.lezcano, dietmar.eggemann,
	pjt, bsegall, vincent.guittot, patches, tuukka.tikkanen,
	amit.kucheria, Mike Turquette

arch_eval_cpu_freq and arch_scale_cpu_freq are added to allow the
scheduler to evaluate if cpu frequency should change and to invoke that
change from a safe context.

They are weakly defined arch functions that do nothing by default. A
CPUfreq governor could use these functions to implement a frequency
scaling policy based on updates to per-task statistics or updates to
per-cpu utilization.

As discussed at Linux Plumbers Conference 2014, the goal will be to
focus on a single cpu frequency scaling policy that works for everyone.
That may mean that the weak arch functions definitions can be removed
entirely and a single policy implements that logic for all
architectures.

Not-signed-off-by: Mike Turquette <mturquette@linaro.org>
---
 kernel/sched/fair.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0930ad8..1af6f6d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2265,6 +2265,8 @@ static u32 __compute_runnable_contrib(u64 n)
 }
 
 unsigned long arch_scale_load_capacity(int cpu);
+void arch_eval_cpu_freq(struct cpumask *cpus);
+void arch_scale_cpu_freq(void);
 
 /*
  * We can represent the historical contribution to runnable average as the
@@ -5805,6 +5807,16 @@ unsigned long __weak arch_scale_load_capacity(int cpu)
 	return default_scale_load_capacity(cpu);
 }
 
+void __weak arch_eval_cpu_freq(struct cpumask *cpus)
+{
+	return;
+}
+
+void __weak arch_scale_cpu_freq(void)
+{
+	return;
+}
+
 static unsigned long scale_rt_capacity(int cpu)
 {
 	struct rq *rq = cpu_rq(cpu);
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH RFC 6/7] sched: cfs: cpu frequency scaling based on task placement
  2014-10-22  6:07 [PATCH RFC 0/7] scheduler-driven cpu frequency scaling Mike Turquette
                   ` (4 preceding siblings ...)
  2014-10-22  6:07 ` [PATCH RFC 5/7] sched: cfs: cpu frequency scaling arch functions Mike Turquette
@ 2014-10-22  6:07 ` Mike Turquette
  2014-10-23  4:03   ` Preeti U Murthy
                     ` (3 more replies)
  2014-10-22  6:07 ` [PATCH RFC 7/7] sched: energy_model: simple cpu frequency scaling policy Mike Turquette
  6 siblings, 4 replies; 21+ messages in thread
From: Mike Turquette @ 2014-10-22  6:07 UTC (permalink / raw)
  To: peterz, mingo
  Cc: linux-kernel, preeti, Morten.Rasmussen, kamalesh, riel, efault,
	nicolas.pitre, linaro-kernel, daniel.lezcano, dietmar.eggemann,
	pjt, bsegall, vincent.guittot, patches, tuukka.tikkanen,
	amit.kucheria, Mike Turquette

{en,de}queue_task_fair are updated to track which cpus will have changed
utilization values as function of task queueing. The affected cpus are
passed on to arch_eval_cpu_freq for further machine-specific processing
based on a selectable policy.

arch_scale_cpu_freq is called from run_rebalance_domains as a way to
kick off the scaling process (via wake_up_process), so as to prevent
re-entering the {en,de}queue code.

All of the call sites in this patch are up for discussion. Does it make
sense to track which cpus have updated statistics in enqueue_fair_task?
I chose this because I wanted to gather statistics for all cpus affected
in the event CONFIG_FAIR_GROUP_SCHED is enabled. As agreed at LPC14 the
next version of this patch will focus on the simpler case of not using
scheduler cgroups, which should remove a good chunk of this code,
including the cpumask stuff.

Also discussed at LPC14 is that fact that load_balance is a very
interesting place to do this as frequency can be considered in concert
with task placement. Please put forth any ideas on a sensible way to do
this.

Is run_rebalance_domains a logical place to change cpu frequency? What
other call sites make sense?

Even for platforms that can target a cpu frequency without sleeping
(x86, some ARM platforms with PM microcontrollers) it is currently
necessary to always kick the frequency target work out into a kthread.
This is because of the rw_sem usage in the cpufreq core which might
sleep. Replacing that lock type is probably a good idea.

Not-signed-off-by: Mike Turquette <mturquette@linaro.org>
---
 kernel/sched/fair.c | 39 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 39 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1af6f6d..3619f63 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3999,6 +3999,9 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 {
 	struct cfs_rq *cfs_rq;
 	struct sched_entity *se = &p->se;
+	struct cpumask update_cpus;
+
+	cpumask_clear(&update_cpus);
 
 	for_each_sched_entity(se) {
 		if (se->on_rq)
@@ -4028,12 +4031,27 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 
 		update_cfs_shares(cfs_rq);
 		update_entity_load_avg(se, 1);
+		/* track cpus that need to be re-evaluated */
+		cpumask_set_cpu(cpu_of(rq_of(cfs_rq)), &update_cpus);
 	}
 
+	/* !CONFIG_FAIR_GROUP_SCHED */
 	if (!se) {
 		update_rq_runnable_avg(rq, rq->nr_running);
 		add_nr_running(rq, 1);
+
+		/*
+		 * FIXME for !CONFIG_FAIR_GROUP_SCHED it might be nice to
+		 * typedef update_cpus into an int and skip all of the cpumask
+		 * stuff
+		 */
+		cpumask_set_cpu(cpu_of(rq), &update_cpus);
 	}
+
+	if (energy_aware())
+		if (!cpumask_empty(&update_cpus))
+			arch_eval_cpu_freq(&update_cpus);
+
 	hrtick_update(rq);
 }
 
@@ -4049,6 +4067,9 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 	struct cfs_rq *cfs_rq;
 	struct sched_entity *se = &p->se;
 	int task_sleep = flags & DEQUEUE_SLEEP;
+	struct cpumask update_cpus;
+
+	cpumask_clear(&update_cpus);
 
 	for_each_sched_entity(se) {
 		cfs_rq = cfs_rq_of(se);
@@ -4089,12 +4110,27 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 
 		update_cfs_shares(cfs_rq);
 		update_entity_load_avg(se, 1);
+		/* track runqueues/cpus that need to be re-evaluated */
+		cpumask_set_cpu(cpu_of(rq_of(cfs_rq)), &update_cpus);
 	}
 
+	/* !CONFIG_FAIR_GROUP_SCHED */
 	if (!se) {
 		sub_nr_running(rq, 1);
 		update_rq_runnable_avg(rq, 1);
+
+		/*
+		 * FIXME for !CONFIG_FAIR_GROUP_SCHED it might be nice to
+		 * typedef update_cpus into an int and skip all of the cpumask
+		 * stuff
+		 */
+		cpumask_set_cpu(cpu_of(rq), &update_cpus);
 	}
+
+	if (energy_aware())
+		if (!cpumask_empty(&update_cpus))
+			arch_eval_cpu_freq(&update_cpus);
+
 	hrtick_update(rq);
 }
 
@@ -7536,6 +7572,9 @@ static void run_rebalance_domains(struct softirq_action *h)
 	 * stopped.
 	 */
 	nohz_idle_balance(this_rq, idle);
+
+	if (energy_aware())
+		arch_scale_cpu_freq();
 }
 
 /*
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH RFC 7/7] sched: energy_model: simple cpu frequency scaling policy
  2014-10-22  6:07 [PATCH RFC 0/7] scheduler-driven cpu frequency scaling Mike Turquette
                   ` (5 preceding siblings ...)
  2014-10-22  6:07 ` [PATCH RFC 6/7] sched: cfs: cpu frequency scaling based on task placement Mike Turquette
@ 2014-10-22  6:07 ` Mike Turquette
  2014-10-27 19:43   ` Dietmar Eggemann
  2014-10-28 14:27   ` Peter Zijlstra
  6 siblings, 2 replies; 21+ messages in thread
From: Mike Turquette @ 2014-10-22  6:07 UTC (permalink / raw)
  To: peterz, mingo
  Cc: linux-kernel, preeti, Morten.Rasmussen, kamalesh, riel, efault,
	nicolas.pitre, linaro-kernel, daniel.lezcano, dietmar.eggemann,
	pjt, bsegall, vincent.guittot, patches, tuukka.tikkanen,
	amit.kucheria, Mike Turquette

Building on top of the scale invariant capacity patches and earlier
patches in this series that prepare CFS for scaling cpu frequency, this
patch implements a simple, naive ondemand-like cpu frequency scaling
policy that is driven by enqueue_task_fair and dequeue_tassk_fair. This
new policy is named "energy_model" as an homage to the on-going work in
that area. It is NOT an actual energy model.

This policy is implemented using the CPUfreq governor interface for two
main reasons:

1) re-using the CPUfreq machine drivers without using the governor
interface is hard. I do not forsee any issue continuing to use the
governor interface going forward but it is worth making clear what this
patch does up front.

2) using the CPUfreq interface allows us to switch between the
energy_model governor and other CPUfreq governors (such as ondemand) at
run-time. This is very useful for comparative testing and tuning.

A caveat to #2 above is that the weak arch function used by the governor
means that only one scheduler-driven policy can be linked at a time.
This limitation does not apply to "traditional" governors. I raised this
in my previous capacity_ops patches[0] but as discussed at LPC14 last
week, it seems desirable to pursue a single cpu frequency scaling policy
at first, and try to make that work for everyone interested in using it.
If that model breaks down then we can revisit the idea of dynamic
selection of scheduler-driven cpu frequency scaling.

Unlike legacy CPUfreq governors, this policy does not implement its own
logic loop (such as a workqueue triggered by a timer), but instead uses
an event-driven design. Frequency is evaluated by entering
{en,de}queue_task_fair and then a kthread is woken from
run_rebalance_domains which scales cpu frequency based on the latest
evaluation.

The policy implemented in this patch takes the highest cpu utilization
from policy->cpus and uses that select a frequency target based on the
same 80%/20% thresholds used as defaults in ondemand. Frequenecy-scaled
thresholds are pre-computed when energy_model inits. The frequency
selection is a simple comparison of cpu utilization (as defined in
Morten's latest RFC) to the threshold values. In the future this logic
could be replaced with something more sophisticated that uses PELT to
get a historical overview. Ideas are welcome.

Note that the pre-computed thresholds above do not take into account
micro-architecture differences (SMT or big.LITTLE hardware), only
frequency invariance.

Not-signed-off-by: Mike Turquette <mturquette@linaro.org>
---
 drivers/cpufreq/Kconfig     |  21 +++
 include/linux/cpufreq.h     |   3 +
 kernel/sched/Makefile       |   1 +
 kernel/sched/energy_model.c | 341 ++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 366 insertions(+)
 create mode 100644 kernel/sched/energy_model.c

diff --git a/drivers/cpufreq/Kconfig b/drivers/cpufreq/Kconfig
index 22b42d5..78a2caa 100644
--- a/drivers/cpufreq/Kconfig
+++ b/drivers/cpufreq/Kconfig
@@ -102,6 +102,15 @@ config CPU_FREQ_DEFAULT_GOV_CONSERVATIVE
 	  Be aware that not all cpufreq drivers support the conservative
 	  governor. If unsure have a look at the help section of the
 	  driver. Fallback governor will be the performance governor.
+
+config CPU_FREQ_DEFAULT_GOV_ENERGY_MODEL
+	bool "energy_model"
+	select CPU_FREQ_GOV_ENERGY_MODEL
+	select CPU_FREQ_GOV_PERFORMANCE
+	help
+	  Use the CPUfreq governor 'energy_model' as default. This
+	  scales cpu frequency from the scheduler as per-task statistics
+	  are updated.
 endchoice
 
 config CPU_FREQ_GOV_PERFORMANCE
@@ -183,6 +192,18 @@ config CPU_FREQ_GOV_CONSERVATIVE
 
 	  If in doubt, say N.
 
+config CPU_FREQ_GOV_ENERGY_MODEL
+	tristate "'energy model' cpufreq governor"
+	depends on CPU_FREQ
+	select CPU_FREQ_GOV_COMMON
+	help
+	  'energy_model' - this governor scales cpu frequency from the
+	  scheduler as a function of cpu utilization. It does not
+	  evaluate utilization on a periodic basis (unlike ondemand) but
+	  instead is invoked from CFS when updating per-task statistics.
+
+	  If in doubt, say N.
+
 config CPUFREQ_GENERIC
 	tristate "Generic cpufreq driver"
 	depends on HAVE_CLK && OF
diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h
index 91d173c..69cbbec 100644
--- a/include/linux/cpufreq.h
+++ b/include/linux/cpufreq.h
@@ -482,6 +482,9 @@ extern struct cpufreq_governor cpufreq_gov_ondemand;
 #elif defined(CONFIG_CPU_FREQ_DEFAULT_GOV_CONSERVATIVE)
 extern struct cpufreq_governor cpufreq_gov_conservative;
 #define CPUFREQ_DEFAULT_GOVERNOR	(&cpufreq_gov_conservative)
+#elif defined(CONFIG_CPU_FREQ_DEFAULT_GOV_ENERGY_MODEL)
+extern struct cpufreq_governor cpufreq_gov_energy_model;
+#define CPUFREQ_DEFAULT_GOVERNOR	(&cpufreq_gov_energy_model)
 #endif
 
 /*********************************************************************
diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index ab32b7b..7cd404c 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -19,3 +19,4 @@ obj-$(CONFIG_SCHED_AUTOGROUP) += auto_group.o
 obj-$(CONFIG_SCHEDSTATS) += stats.o
 obj-$(CONFIG_SCHED_DEBUG) += debug.o
 obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o
+obj-$(CONFIG_CPU_FREQ_GOV_ENERGY_MODEL) += energy_model.o
diff --git a/kernel/sched/energy_model.c b/kernel/sched/energy_model.c
new file mode 100644
index 0000000..5cdea9a
--- /dev/null
+++ b/kernel/sched/energy_model.c
@@ -0,0 +1,341 @@
+/*
+ *  Copyright (C)  2014 Michael Turquette <mturquette@linaro.org>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/cpufreq.h>
+#include <linux/module.h>
+#include <linux/kthread.h>
+
+#include "sched.h"
+
+#define THROTTLE_MSEC		50
+#define UP_THRESHOLD		80
+#define DOWN_THRESHOLD		20
+
+/**
+ * em_data - per-policy data used by energy_mode
+ * @throttle: bail if current time is less than than ktime_throttle.
+ * 		    Derived from THROTTLE_MSEC
+ * @up_threshold:   table of normalized capacity states to determine if cpu
+ * 		    should run faster. Derived from UP_THRESHOLD
+ * @down_threshold: table of normalized capacity states to determine if cpu
+ * 		    should run slower. Derived from DOWN_THRESHOLD
+ *
+ * struct em_data is the per-policy energy_model-specific data structure. A
+ * per-policy instance of it is created when the energy_model governor receives
+ * the CPUFREQ_GOV_START condition and a pointer to it exists in the gov_data
+ * member of struct cpufreq_policy.
+ *
+ * Readers of this data must call down_read(policy->rwsem). Writers must
+ * call down_write(policy->rwsem).
+ */
+struct em_data {
+	/* per-policy throttling */
+	ktime_t throttle;
+	unsigned int *up_threshold;
+	unsigned int *down_threshold;
+	struct task_struct *task;
+	atomic_long_t target_freq;
+	atomic_t need_wake_task;
+};
+
+/*
+ * we pass in struct cpufreq_policy. This is safe because changing out the
+ * policy requires a call to __cpufreq_governor(policy, CPUFREQ_GOV_STOP),
+ * which tears all of the data structures down and __cpufreq_governor(policy,
+ * CPUFREQ_GOV_START) will do a full rebuild, including this kthread with the
+ * new policy pointer
+ */
+static int energy_model_thread(void *data)
+{
+	struct sched_param param;
+	struct cpufreq_policy *policy;
+	struct em_data *em;
+	int ret;
+
+	policy = (struct cpufreq_policy *) data;
+	if (!policy) {
+		pr_warn("%s: missing policy\n", __func__);
+		do_exit(-EINVAL);
+	}
+
+	em = policy->gov_data;
+	if (!em) {
+		pr_warn("%s: missing governor data\n", __func__);
+		do_exit(-EINVAL);
+	}
+
+	param.sched_priority = 0;
+	sched_setscheduler(current, SCHED_FIFO, &param);
+
+
+	do {
+		down_write(&policy->rwsem);
+		if (!atomic_read(&em->need_wake_task))  {
+			up_write(&policy->rwsem);
+			set_current_state(TASK_INTERRUPTIBLE);
+			schedule();
+			continue;
+		}
+
+		ret = __cpufreq_driver_target(policy, atomic_read(&em->target_freq),
+				CPUFREQ_RELATION_H);
+		if (ret)
+			pr_debug("%s: __cpufreq_driver_target returned %d\n",
+					__func__, ret);
+
+		em->throttle = ktime_get();
+		atomic_set(&em->need_wake_task, 0);
+		up_write(&policy->rwsem);
+	} while (!kthread_should_stop());
+
+	do_exit(0);
+}
+
+static void em_wake_up_process(struct task_struct *task)
+{
+	/* this is null during early boot */
+	if (IS_ERR_OR_NULL(task)) {
+		return;
+	}
+
+	wake_up_process(task);
+}
+
+void arch_scale_cpu_freq(void)
+{
+	struct cpufreq_policy *policy;
+	struct em_data *em;
+	int cpu;
+
+	for_each_online_cpu(cpu) {
+		policy = cpufreq_cpu_get(cpu);
+		if (IS_ERR_OR_NULL(policy))
+			continue;
+
+		em = policy->gov_data;
+		if (!em)
+			continue;
+
+		/*
+		 * FIXME replace the atomic stuff by holding write-locks
+		 * in arch_eval_cpu_freq?
+		 */
+		if (atomic_read(&em->need_wake_task)) {
+			em_wake_up_process(em->task);
+		}
+
+		cpufreq_cpu_put(policy);
+	}
+}
+
+/**
+ * arch_eval_cpu_freq - scale cpu frequency based on CFS utilization
+ * @update_cpus: mask of CPUs with updated utilization and capacity
+ *
+ * Declared and weakly defined in kernel/sched/fair.c This definition overrides
+ * the default. In the case of CONFIG_FAIR_GROUP_SCHED, update_cpus may
+ * contains cpus that are not in the same policy. Otherwise update_cpus will be
+ * a single cpu.
+ *
+ * Holds read lock for policy->rw_sem.
+ *
+ * FIXME weak arch function means that only one definition of this function can
+ * be linked. How to support multiple energy model policies?
+ */
+void arch_eval_cpu_freq(struct cpumask *update_cpus)
+{
+	struct cpufreq_policy *policy;
+	struct em_data *em;
+	int index;
+	unsigned int cpu, tmp;
+	unsigned long percent_util = 0, max_util = 0, cap = 0, util = 0;
+
+	/*
+	 * In the case of CONFIG_FAIR_GROUP_SCHED, policy->cpus may be a subset
+	 * of update_cpus. In such case take the first cpu in update_cpus, get
+	 * its policy and try to scale the affects cpus. Then we clear the
+	 * corresponding bits from update_cpus and try again. If a policy does
+	 * not exist for a cpu then we remove that bit as well, preventing an
+	 * infinite loop.
+	 */
+	while (!cpumask_empty(update_cpus)) {
+		percent_util = 0;
+		max_util = 0;
+		cap = 0;
+		util = 0;
+
+		cpu = cpumask_first(update_cpus);
+		policy = cpufreq_cpu_get(cpu);
+		if (IS_ERR_OR_NULL(policy)) {
+			cpumask_clear_cpu(cpu, update_cpus);
+			continue;
+		}
+
+		if (!policy->gov_data)
+			return;
+
+		em = policy->gov_data;
+
+		if (ktime_before(ktime_get(), em->throttle)) {
+			trace_printk("THROTTLED");
+			goto bail;
+		}
+
+		/*
+		 * try scaling cpus
+		 *
+		 * algorithm assumptions & description:
+		 * 	all cpus in a policy run at the same rate/capacity.
+		 * 	choose frequency target based on most utilized cpu.
+		 * 	do not care about aggregating cpu utilization.
+		 * 	do not track any historical trends beyond utilization
+		 * 	if max_util > 80% of current capacity,
+		 * 		go to max capacity
+		 * 	if max_util < 20% of current capacity,
+		 * 		go to the next lowest capacity
+		 * 	otherwise, stay at the same capacity state
+		 */
+		for_each_cpu(tmp, policy->cpus) {
+			util = usage_util_of(cpu);
+			if (util > max_util)
+				max_util = util;
+		}
+
+		cap = capacity_of(cpu);
+		if (!cap) {
+			goto bail;
+		}
+
+		index = cpufreq_frequency_table_get_index(policy, policy->cur);
+		if (max_util > em->up_threshold[index]) {
+			/* write em->target_freq with read lock held */
+			atomic_long_set(&em->target_freq, policy->max);
+			/*
+			 * FIXME this is gross. convert arch_eval_cpu_freq to
+			 * hold the write lock?
+			 */
+			atomic_set(&em->need_wake_task, 1);
+		} else if (max_util < em->down_threshold[index]) {
+			/* write em->target_freq with read lock held */
+			atomic_long_set(&em->target_freq, policy->cur - 1);
+			/*
+			 * FIXME this is gross. convert arch_eval_cpu_freq to
+			 * hold the write lock?
+			 */
+			atomic_set(&em->need_wake_task, 1);
+		}
+
+bail:
+		/* remove policy->cpus fromm update_cpus */
+		cpumask_andnot(update_cpus, update_cpus, policy->cpus);
+		cpufreq_cpu_put(policy);
+	}
+
+	return;
+}
+
+static void em_start(struct cpufreq_policy *policy)
+{
+	int index = 0, count = 0;
+	unsigned int capacity;
+	struct em_data *em;
+	struct cpufreq_frequency_table *pos;
+
+	/* prepare per-policy private data */
+	em = kzalloc(sizeof(*em), GFP_KERNEL);
+	if (!em) {
+		pr_debug("%s: failed to allocate private data\n", __func__);
+		return;
+	}
+
+	policy->gov_data = em;
+
+	/* how many entries in the frequency table? */
+	cpufreq_for_each_entry(pos, policy->freq_table)
+		count++;
+
+	/* pre-compute thresholds */
+	em->up_threshold = kcalloc(count, sizeof(unsigned int), GFP_KERNEL);
+	em->down_threshold = kcalloc(count, sizeof(unsigned int), GFP_KERNEL);
+
+	cpufreq_for_each_entry(pos, policy->freq_table) {
+		/* FIXME capacity below is not scaled for uarch */
+		capacity = pos->frequency * SCHED_CAPACITY_SCALE / policy->max;
+		em->up_threshold[index] = capacity * UP_THRESHOLD / 100;
+		em->down_threshold[index] = capacity * DOWN_THRESHOLD / 100;
+		pr_debug("%s: cpu = %u index = %d capacity = %u up = %u down = %u\n",
+				__func__, cpumask_first(policy->cpus), index,
+				capacity, em->up_threshold[index],
+				em->down_threshold[index]);
+		index++;
+	}
+
+	/* init per-policy kthread */
+	em->task = kthread_create(energy_model_thread, policy, "kenergy_model_task");
+	if (IS_ERR_OR_NULL(em->task))
+		pr_err("%s: failed to create kenergy_model_task thread\n", __func__);
+}
+
+
+static void em_stop(struct cpufreq_policy *policy)
+{
+	struct em_data *em;
+
+	em = policy->gov_data;
+
+	kthread_stop(em->task);
+
+	/* replace with devm counterparts */
+	kfree(em->up_threshold);
+	kfree(em->down_threshold);
+	kfree(em);
+}
+
+static int energy_model_setup(struct cpufreq_policy *policy, unsigned int event)
+{
+	switch (event) {
+		case CPUFREQ_GOV_START:
+			/* Start managing the frequency */
+			em_start(policy);
+			return 0;
+
+		case CPUFREQ_GOV_STOP:
+			em_stop(policy);
+			return 0;
+
+		case CPUFREQ_GOV_LIMITS:	/* unused */
+		case CPUFREQ_GOV_POLICY_INIT:	/* unused */
+		case CPUFREQ_GOV_POLICY_EXIT:	/* unused */
+			break;
+	}
+	return 0;
+}
+
+#ifndef CONFIG_CPU_FREQ_DEFAULT_GOV_ENERGY_MODEL
+static
+#endif
+struct cpufreq_governor cpufreq_gov_energy_model = {
+	.name			= "energy_model",
+	.governor		= energy_model_setup,
+	.owner			= THIS_MODULE,
+};
+
+static int __init energy_model_init(void)
+{
+	return cpufreq_register_governor(&cpufreq_gov_energy_model);
+}
+
+static void __exit energy_model_exit(void)
+{
+	cpufreq_unregister_governor(&cpufreq_gov_energy_model);
+}
+
+/* Try to make this the default governor */
+fs_initcall(energy_model_init);
+
+MODULE_LICENSE("GPL");
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC 4/7] cpufreq: add per-governor private data
  2014-10-22  6:07 ` [PATCH RFC 4/7] cpufreq: add per-governor private data Mike Turquette
@ 2014-10-22  6:26   ` Viresh Kumar
  2014-10-22  6:35     ` Mike Turquette
  0 siblings, 1 reply; 21+ messages in thread
From: Viresh Kumar @ 2014-10-22  6:26 UTC (permalink / raw)
  To: Mike Turquette
  Cc: Peter Zijlstra, Ingo Molnar, Linux Kernel Mailing List,
	Preeti U Murthy, Morten Rasmussen, kamalesh, riel, efault,
	Nicolas Pitre, Lists linaro-kernel, Daniel Lezcano,
	Dietmar Eggemann, Paul Turner, bsegall, Vincent Guittot,
	Patch Tracking, Tuukka Tikkanen, Amit Kucheria,
	Rafael J. Wysocki

On 22 October 2014 11:37, Mike Turquette <mturquette@linaro.org> wrote:
> Cc: Viresh Kumar <viresh.kumar@linaro.org>
> Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
> Signed-off-by: Mike Turquette <mturquette@linaro.org>
> ---
>  include/linux/cpufreq.h | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h
> index 138336b..91d173c 100644
> --- a/include/linux/cpufreq.h
> +++ b/include/linux/cpufreq.h
> @@ -115,6 +115,9 @@ struct cpufreq_policy {
>
>         /* For cpufreq driver's internal use */
>         void                    *driver_data;
> +
> +       /* For cpufreq governor's internal use */
> +       void                    *gov_data;

Its already there: governor_data ..

Am I missing something ?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC 4/7] cpufreq: add per-governor private data
  2014-10-22  6:26   ` Viresh Kumar
@ 2014-10-22  6:35     ` Mike Turquette
  0 siblings, 0 replies; 21+ messages in thread
From: Mike Turquette @ 2014-10-22  6:35 UTC (permalink / raw)
  To: Viresh Kumar
  Cc: Peter Zijlstra, Ingo Molnar, Linux Kernel Mailing List,
	Preeti U Murthy, Morten Rasmussen, kamalesh, riel, efault,
	Nicolas Pitre, Lists linaro-kernel, Daniel Lezcano,
	Dietmar Eggemann, Paul Turner, Benjamin Segall, Vincent Guittot,
	Patch Tracking, Tuukka Tikkanen, Amit Kucheria,
	Rafael J. Wysocki

On Tue, Oct 21, 2014 at 11:26 PM, Viresh Kumar <viresh.kumar@linaro.org> wrote:
> On 22 October 2014 11:37, Mike Turquette <mturquette@linaro.org> wrote:
>> Cc: Viresh Kumar <viresh.kumar@linaro.org>
>> Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
>> Signed-off-by: Mike Turquette <mturquette@linaro.org>
>> ---
>>  include/linux/cpufreq.h | 3 +++
>>  1 file changed, 3 insertions(+)
>>
>> diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h
>> index 138336b..91d173c 100644
>> --- a/include/linux/cpufreq.h
>> +++ b/include/linux/cpufreq.h
>> @@ -115,6 +115,9 @@ struct cpufreq_policy {
>>
>>         /* For cpufreq driver's internal use */
>>         void                    *driver_data;
>> +
>> +       /* For cpufreq governor's internal use */
>> +       void                    *gov_data;
>
> Its already there: governor_data ..
>
> Am I missing something ?

Oops. Thats what I get for hacking while jetlagged. Please disregard the noise.

Regards,
Mike

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC 5/7] sched: cfs: cpu frequency scaling arch functions
  2014-10-22  6:07 ` [PATCH RFC 5/7] sched: cfs: cpu frequency scaling arch functions Mike Turquette
@ 2014-10-22 20:06   ` Rik van Riel
  2014-10-22 23:20     ` Mike Turquette
  0 siblings, 1 reply; 21+ messages in thread
From: Rik van Riel @ 2014-10-22 20:06 UTC (permalink / raw)
  To: Mike Turquette, peterz, mingo
  Cc: linux-kernel, preeti, Morten.Rasmussen, kamalesh, efault,
	nicolas.pitre, linaro-kernel, daniel.lezcano, dietmar.eggemann,
	pjt, bsegall, vincent.guittot, patches, tuukka.tikkanen,
	amit.kucheria

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 10/22/2014 02:07 AM, Mike Turquette wrote:
> arch_eval_cpu_freq and arch_scale_cpu_freq are added to allow the 
> scheduler to evaluate if cpu frequency should change and to invoke
> that change from a safe context.
> 
> They are weakly defined arch functions that do nothing by default.
> A CPUfreq governor could use these functions to implement a
> frequency scaling policy based on updates to per-task statistics or
> updates to per-cpu utilization.
> 
> As discussed at Linux Plumbers Conference 2014, the goal will be
> to focus on a single cpu frequency scaling policy that works for
> everyone. That may mean that the weak arch functions definitions
> can be removed entirely and a single policy implements that logic
> for all architectures.

On virtual machines, we probably want to use both frequency and
steal time to calculate the factor.

- -- 
All rights reversed
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQEcBAEBAgAGBQJUSA5XAAoJEM553pKExN6DeRYH/jeXImjO2/WZFp82Yv6ukMxI
r8/kzrLMA+NS1XXCWYIcOiBqReEabkZZmypt21Tdnpkvi4GbZPpG0PEApSvOfqWE
w71J87cpMGV/e4uLcBDcvgHJX8RBQLO/ZqDcMm+zcSoeJ3G3NMK2YlZp3Uf8xqcB
tE2VGW7o2yEqNJL1fqYb++3upQmc10vIFqxVIJfP+TqZRyaVP+5kBqOMDTWb5qCV
qZjBKe1jDX5sLLGfY0ddAeuUH1iEJBIUMCcr027ezcqRp4YoqIrHRInHmNxEs5Az
9PN8N0yGgqhvkcCfXG7He+tQBHECOnjyQlrM/2K8Cw11RziwDkC/yYIp3DPgjxc=
=f/8V
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC 5/7] sched: cfs: cpu frequency scaling arch functions
  2014-10-22 20:06   ` Rik van Riel
@ 2014-10-22 23:20     ` Mike Turquette
  2014-10-23  1:42       ` Rik van Riel
  0 siblings, 1 reply; 21+ messages in thread
From: Mike Turquette @ 2014-10-22 23:20 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Preeti U Murthy,
	Morten Rasmussen, kamalesh, efault, Nicolas Pitre, linaro-kernel,
	Daniel Lezcano, Dietmar Eggemann, Paul Turner, Benjamin Segall,
	Vincent Guittot, Patch Tracking, Tuukka Tikkanen, Amit Kucheria

On Wed, Oct 22, 2014 at 1:06 PM, Rik van Riel <riel@redhat.com> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On 10/22/2014 02:07 AM, Mike Turquette wrote:
>> arch_eval_cpu_freq and arch_scale_cpu_freq are added to allow the
>> scheduler to evaluate if cpu frequency should change and to invoke
>> that change from a safe context.
>>
>> They are weakly defined arch functions that do nothing by default.
>> A CPUfreq governor could use these functions to implement a
>> frequency scaling policy based on updates to per-task statistics or
>> updates to per-cpu utilization.
>>
>> As discussed at Linux Plumbers Conference 2014, the goal will be
>> to focus on a single cpu frequency scaling policy that works for
>> everyone. That may mean that the weak arch functions definitions
>> can be removed entirely and a single policy implements that logic
>> for all architectures.
>
> On virtual machines, we probably want to use both frequency and
> steal time to calculate the factor.

You mean for calculating desired cpu frequency on a virtual guest? Is
that something we want to do?

Thanks,
Mike

>
> - --
> All rights reversed
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1
>
> iQEcBAEBAgAGBQJUSA5XAAoJEM553pKExN6DeRYH/jeXImjO2/WZFp82Yv6ukMxI
> r8/kzrLMA+NS1XXCWYIcOiBqReEabkZZmypt21Tdnpkvi4GbZPpG0PEApSvOfqWE
> w71J87cpMGV/e4uLcBDcvgHJX8RBQLO/ZqDcMm+zcSoeJ3G3NMK2YlZp3Uf8xqcB
> tE2VGW7o2yEqNJL1fqYb++3upQmc10vIFqxVIJfP+TqZRyaVP+5kBqOMDTWb5qCV
> qZjBKe1jDX5sLLGfY0ddAeuUH1iEJBIUMCcr027ezcqRp4YoqIrHRInHmNxEs5Az
> 9PN8N0yGgqhvkcCfXG7He+tQBHECOnjyQlrM/2K8Cw11RziwDkC/yYIp3DPgjxc=
> =f/8V
> -----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC 5/7] sched: cfs: cpu frequency scaling arch functions
  2014-10-22 23:20     ` Mike Turquette
@ 2014-10-23  1:42       ` Rik van Riel
  2014-10-23  2:12         ` Mike Galbraith
  0 siblings, 1 reply; 21+ messages in thread
From: Rik van Riel @ 2014-10-23  1:42 UTC (permalink / raw)
  To: Mike Turquette
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Preeti U Murthy,
	Morten Rasmussen, kamalesh, efault, Nicolas Pitre, linaro-kernel,
	Daniel Lezcano, Dietmar Eggemann, Paul Turner, Benjamin Segall,
	Vincent Guittot, Patch Tracking, Tuukka Tikkanen, Amit Kucheria

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 10/22/2014 07:20 PM, Mike Turquette wrote:
> On Wed, Oct 22, 2014 at 1:06 PM, Rik van Riel <riel@redhat.com>
> wrote: On 10/22/2014 02:07 AM, Mike Turquette wrote:
>>>> arch_eval_cpu_freq and arch_scale_cpu_freq are added to allow
>>>> the scheduler to evaluate if cpu frequency should change and
>>>> to invoke that change from a safe context.
>>>> 
>>>> They are weakly defined arch functions that do nothing by
>>>> default. A CPUfreq governor could use these functions to
>>>> implement a frequency scaling policy based on updates to
>>>> per-task statistics or updates to per-cpu utilization.
>>>> 
>>>> As discussed at Linux Plumbers Conference 2014, the goal will
>>>> be to focus on a single cpu frequency scaling policy that
>>>> works for everyone. That may mean that the weak arch
>>>> functions definitions can be removed entirely and a single
>>>> policy implements that logic for all architectures.
> 
> On virtual machines, we probably want to use both frequency and 
> steal time to calculate the factor.
> 
>> You mean for calculating desired cpu frequency on a virtual
>> guest? Is that something we want to do?

A guest will be unable to set the cpu frequency, but it should
know what the frequency is, so it can take the capacity of each
CPU into account when doing things like load balancing.

This has little impact on this patch series, the impact is more
in the load balancer, which can see how much compute capacity is
available on each CPU, and adjust the load accordingly.

I have seen some code come by that adjusts each cpu's compute_capacity,
but do not remember whether it looks at cpu frequency, and am pretty
sure it does not look at steal time currently :)

- -- 
All rights reversed
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQEcBAEBAgAGBQJUSF0DAAoJEM553pKExN6DYgkIALSZxKxQhMAJl0VUrBtPEFlr
cXOr0jKS/0FowS22agzpJr/OoWi58mGGm6mKr6LkoZJ34K96Y6/H4ie7Sr7Q4BL/
A4hQpTwxHzGasawQwdQOG/lW2q2oDUqsQuxRQDOs97I4vtYwxsj+D3qDtfIyaosf
f7ctWDQMzBBgLlrDn1wWmDE6K1pxa2eqnf0rRVSRNRXQ/lncHHzPdFOj4sJE9RVQ
E47gqeisDf+m7TyvG1I9MN6ZIHMEfgaQcmVvO8/QGqnb1ZMom6JTCDa4UqAd97XB
1NQ/QSJvQ5ED/cCfLy91YguEr/GY+QFsKeCjL1604e+3lsN4DjuejtcUP9/LQVs=
=On7B
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC 5/7] sched: cfs: cpu frequency scaling arch functions
  2014-10-23  1:42       ` Rik van Riel
@ 2014-10-23  2:12         ` Mike Galbraith
  2014-10-23  2:42           ` Rik van Riel
  0 siblings, 1 reply; 21+ messages in thread
From: Mike Galbraith @ 2014-10-23  2:12 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Mike Turquette, Peter Zijlstra, Ingo Molnar, linux-kernel,
	Preeti U Murthy, Morten Rasmussen, kamalesh, Nicolas Pitre,
	linaro-kernel, Daniel Lezcano, Dietmar Eggemann, Paul Turner,
	Benjamin Segall, Vincent Guittot, Patch Tracking,
	Tuukka Tikkanen, Amit Kucheria

On Wed, 2014-10-22 at 21:42 -0400, Rik van Riel wrote: 
> On 10/22/2014 07:20 PM, Mike Turquette wrote:
> > On Wed, Oct 22, 2014 at 1:06 PM, Rik van Riel <riel@redhat.com>
> > wrote: On 10/22/2014 02:07 AM, Mike Turquette wrote:
> >>>> arch_eval_cpu_freq and arch_scale_cpu_freq are added to allow
> >>>> the scheduler to evaluate if cpu frequency should change and
> >>>> to invoke that change from a safe context.
> >>>> 
> >>>> They are weakly defined arch functions that do nothing by
> >>>> default. A CPUfreq governor could use these functions to
> >>>> implement a frequency scaling policy based on updates to
> >>>> per-task statistics or updates to per-cpu utilization.
> >>>> 
> >>>> As discussed at Linux Plumbers Conference 2014, the goal will
> >>>> be to focus on a single cpu frequency scaling policy that
> >>>> works for everyone. That may mean that the weak arch
> >>>> functions definitions can be removed entirely and a single
> >>>> policy implements that logic for all architectures.
> > 
> > On virtual machines, we probably want to use both frequency and 
> > steal time to calculate the factor.
> > 
> >> You mean for calculating desired cpu frequency on a virtual
> >> guest? Is that something we want to do?
> 
> A guest will be unable to set the cpu frequency, but it should
> know what the frequency is, so it can take the capacity of each
> CPU into account when doing things like load balancing.

Hm.  Why does using vaporite freq/capacity/whatever make any sense, the
silicon under the V(aporite)PU can/does change at the drop of a hat, no?

-Mike


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC 5/7] sched: cfs: cpu frequency scaling arch functions
  2014-10-23  2:12         ` Mike Galbraith
@ 2014-10-23  2:42           ` Rik van Riel
  0 siblings, 0 replies; 21+ messages in thread
From: Rik van Riel @ 2014-10-23  2:42 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Mike Turquette, Peter Zijlstra, Ingo Molnar, linux-kernel,
	Preeti U Murthy, Morten Rasmussen, kamalesh, Nicolas Pitre,
	linaro-kernel, Daniel Lezcano, Dietmar Eggemann, Paul Turner,
	Benjamin Segall, Vincent Guittot, Patch Tracking,
	Tuukka Tikkanen, Amit Kucheria

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 10/22/2014 10:12 PM, Mike Galbraith wrote:
> On Wed, 2014-10-22 at 21:42 -0400, Rik van Riel wrote:
>> On 10/22/2014 07:20 PM, Mike Turquette wrote:
>>> On Wed, Oct 22, 2014 at 1:06 PM, Rik van Riel
>>> <riel@redhat.com> wrote: On 10/22/2014 02:07 AM, Mike Turquette
>>> wrote:
>>>>>> arch_eval_cpu_freq and arch_scale_cpu_freq are added to
>>>>>> allow the scheduler to evaluate if cpu frequency should
>>>>>> change and to invoke that change from a safe context.
>>>>>> 
>>>>>> They are weakly defined arch functions that do nothing
>>>>>> by default. A CPUfreq governor could use these functions
>>>>>> to implement a frequency scaling policy based on updates
>>>>>> to per-task statistics or updates to per-cpu
>>>>>> utilization.
>>>>>> 
>>>>>> As discussed at Linux Plumbers Conference 2014, the goal
>>>>>> will be to focus on a single cpu frequency scaling policy
>>>>>> that works for everyone. That may mean that the weak
>>>>>> arch functions definitions can be removed entirely and a
>>>>>> single policy implements that logic for all
>>>>>> architectures.
>>> 
>>> On virtual machines, we probably want to use both frequency and
>>>  steal time to calculate the factor.
>>> 
>>>> You mean for calculating desired cpu frequency on a virtual 
>>>> guest? Is that something we want to do?
>> 
>> A guest will be unable to set the cpu frequency, but it should 
>> know what the frequency is, so it can take the capacity of each 
>> CPU into account when doing things like load balancing.
> 
> Hm.  Why does using vaporite freq/capacity/whatever make any sense,
> the silicon under the V(aporite)PU can/does change at the drop of a
> hat, no?

It can, but IIRC that should cause the kvmclock data for that VCPU
to be regenerated, and the VCPU should be able to use that to figure
out that the frequency changed the next time it runs the scheduler
code on that VCPU.

- -- 
All rights reversed
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQEcBAEBAgAGBQJUSGsmAAoJEM553pKExN6DlhUH/RLmVoHmab2zfPgZfRXWD9PX
yKkx1tmoNPFAdp7l1xgz+fIVtp5I7gUnCo03r0x3JDL8dYiEfU1BfX1bs2WSresL
7q50DVLQe8VXIqgmu1INqzQSJGfF9yOW4Kgg2xHkNBoWUdt+3fjF9JSEMJFxOZOs
pFT85ITTs0zFIRDlwdEBEs0kRLEqh0JBeLx501RSC9VQ9OIZ3lp9O1BnawQ8WI0o
Qq8ODXFgy1BGUE+Ow+skP8MnQUyBgb6b+f0Q6AmK/Er6lzw8PMwNvnmYN14ruR3R
YkTjsyYxlYlzrx2IKZNWuYy5OXguRIslWi67fI0k/yE2WVHy/yXPbRErYQfM2o8=
=PeDr
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC 6/7] sched: cfs: cpu frequency scaling based on task placement
  2014-10-22  6:07 ` [PATCH RFC 6/7] sched: cfs: cpu frequency scaling based on task placement Mike Turquette
@ 2014-10-23  4:03   ` Preeti U Murthy
  2014-10-27 15:55   ` Peter Zijlstra
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 21+ messages in thread
From: Preeti U Murthy @ 2014-10-23  4:03 UTC (permalink / raw)
  To: Mike Turquette, peterz, mingo
  Cc: linux-kernel, Morten.Rasmussen, kamalesh, riel, efault,
	nicolas.pitre, linaro-kernel, daniel.lezcano, dietmar.eggemann,
	pjt, bsegall, vincent.guittot, patches, tuukka.tikkanen,
	amit.kucheria

Hi Mike,

On 10/22/2014 11:37 AM, Mike Turquette wrote:
> {en,de}queue_task_fair are updated to track which cpus will have changed
> utilization values as function of task queueing. The affected cpus are
> passed on to arch_eval_cpu_freq for further machine-specific processing
> based on a selectable policy.
> 
> arch_scale_cpu_freq is called from run_rebalance_domains as a way to
> kick off the scaling process (via wake_up_process), so as to prevent
> re-entering the {en,de}queue code.
> 
> All of the call sites in this patch are up for discussion. Does it make
> sense to track which cpus have updated statistics in enqueue_fair_task?
> I chose this because I wanted to gather statistics for all cpus affected
> in the event CONFIG_FAIR_GROUP_SCHED is enabled. As agreed at LPC14 the

Can you explain how pstate selection can get affected by the presence of
task groups? We are after all concerned with the cpu load. So when we
enqueue/dequeue a task, we update the cpu load and pass it on for cpu
pstate scaling. How does this change if we have task groups?
I know that this issue was brought up during LPC, but I have yet not
managed to gain clarity here.

> next version of this patch will focus on the simpler case of not using
> scheduler cgroups, which should remove a good chunk of this code,
> including the cpumask stuff.
> 
> Also discussed at LPC14 is that fact that load_balance is a very
> interesting place to do this as frequency can be considered in concert
> with task placement. Please put forth any ideas on a sensible way to do
> this.
> 
> Is run_rebalance_domains a logical place to change cpu frequency? What
> other call sites make sense?
> 
> Even for platforms that can target a cpu frequency without sleeping
> (x86, some ARM platforms with PM microcontrollers) it is currently
> necessary to always kick the frequency target work out into a kthread.
> This is because of the rw_sem usage in the cpufreq core which might
> sleep. Replacing that lock type is probably a good idea.
> 
> Not-signed-off-by: Mike Turquette <mturquette@linaro.org>
> ---
>  kernel/sched/fair.c | 39 +++++++++++++++++++++++++++++++++++++++
>  1 file changed, 39 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 1af6f6d..3619f63 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3999,6 +3999,9 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>  {
>  	struct cfs_rq *cfs_rq;
>  	struct sched_entity *se = &p->se;
> +	struct cpumask update_cpus;
> +
> +	cpumask_clear(&update_cpus);
> 
>  	for_each_sched_entity(se) {
>  		if (se->on_rq)
> @@ -4028,12 +4031,27 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> 
>  		update_cfs_shares(cfs_rq);
>  		update_entity_load_avg(se, 1);
> +		/* track cpus that need to be re-evaluated */
> +		cpumask_set_cpu(cpu_of(rq_of(cfs_rq)), &update_cpus);

All the cfs_rqs that you iterate through here will belong to the same
rq/cpu right?

Regards
Preeti U Murthy


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC 6/7] sched: cfs: cpu frequency scaling based on task placement
  2014-10-22  6:07 ` [PATCH RFC 6/7] sched: cfs: cpu frequency scaling based on task placement Mike Turquette
  2014-10-23  4:03   ` Preeti U Murthy
@ 2014-10-27 15:55   ` Peter Zijlstra
  2014-10-27 17:42   ` Dietmar Eggemann
  2014-11-27 10:46   ` Preeti U Murthy
  3 siblings, 0 replies; 21+ messages in thread
From: Peter Zijlstra @ 2014-10-27 15:55 UTC (permalink / raw)
  To: Mike Turquette
  Cc: mingo, linux-kernel, preeti, Morten.Rasmussen, kamalesh, riel,
	efault, nicolas.pitre, linaro-kernel, daniel.lezcano,
	dietmar.eggemann, pjt, bsegall, vincent.guittot, patches,
	tuukka.tikkanen, amit.kucheria

On Tue, Oct 21, 2014 at 11:07:30PM -0700, Mike Turquette wrote:
> {en,de}queue_task_fair are updated to track which cpus will have changed
> utilization values as function of task queueing. The affected cpus are
> passed on to arch_eval_cpu_freq for further machine-specific processing
> based on a selectable policy.

Yeah, I'm not sure about the arch eval hook, ideally it'd be all
integrated with the energy model.

> arch_scale_cpu_freq is called from run_rebalance_domains as a way to
> kick off the scaling process (via wake_up_process), so as to prevent
> re-entering the {en,de}queue code.

We might want a better name for that :-) dvfs_set_freq() or whatnot, or
maybe preserve the cpufreq_*() namespace, people seen to know that that
is the linux dvfs name.

> All of the call sites in this patch are up for discussion. Does it make
> sense to track which cpus have updated statistics in enqueue_fair_task?

Like I said, I don't think so, we guestimate and approximate everything
anyhow, don't bother trying to be 'perfect' here, its excessively
expensive.

> I chose this because I wanted to gather statistics for all cpus affected
> in the event CONFIG_FAIR_GROUP_SCHED is enabled. As agreed at LPC14 the
> next version of this patch will focus on the simpler case of not using
> scheduler cgroups, which should remove a good chunk of this code,
> including the cpumask stuff.

Yes please, make the cpumask stuff go away :-)

> Also discussed at LPC14 is that fact that load_balance is a very
> interesting place to do this as frequency can be considered in concert
> with task placement. Please put forth any ideas on a sensible way to do
> this.

Ideally it'd be natural fallout of Morten's energy model.

If you take a multi-core energy model, find its bifurcations and map its
solution spaces I suspect there to be a fairly small set of actual
behaviours.

The problem is, nobody seems to have done this yet so we don't know.

Once you've done this, you can try and minimize the model by proving you
retain all behaviour modes, but for now Morten has a rather full
parameter space (not complete though, and the impact of the missing
parameters might or might not be relevant, impossible to prove until we
have the above done).

> Is run_rebalance_domains a logical place to change cpu frequency? What
> other call sites make sense?

For the legacy systems, maybe.

> Even for platforms that can target a cpu frequency without sleeping
> (x86, some ARM platforms with PM microcontrollers) it is currently
> necessary to always kick the frequency target work out into a kthread.
> This is because of the rw_sem usage in the cpufreq core which might
> sleep. Replacing that lock type is probably a good idea.

I think it would be best to start with this, ideally we'd be able to RCU
free the thing such that either holding the rwsem or rcu_read_lock is
sufficient for usage, that way the sleeping muck can grab the rwsem, the
non-sleeping stuff can grab rcu_read_lock.

But I've not looked at the cpufreq stuff at all.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC 6/7] sched: cfs: cpu frequency scaling based on task placement
  2014-10-22  6:07 ` [PATCH RFC 6/7] sched: cfs: cpu frequency scaling based on task placement Mike Turquette
  2014-10-23  4:03   ` Preeti U Murthy
  2014-10-27 15:55   ` Peter Zijlstra
@ 2014-10-27 17:42   ` Dietmar Eggemann
  2014-11-27 10:46   ` Preeti U Murthy
  3 siblings, 0 replies; 21+ messages in thread
From: Dietmar Eggemann @ 2014-10-27 17:42 UTC (permalink / raw)
  To: Mike Turquette, peterz, mingo
  Cc: linux-kernel, preeti, Morten Rasmussen, kamalesh, riel, efault,
	nicolas.pitre, linaro-kernel, daniel.lezcano, pjt, bsegall,
	vincent.guittot, patches, tuukka.tikkanen, amit.kucheria

Hi Mike,

On 22/10/14 07:07, Mike Turquette wrote:
> {en,de}queue_task_fair are updated to track which cpus will have changed
> utilization values as function of task queueing.

The sentence is a little bit misleading. We update the se utilization
contrib and the cfs_rq utilization in {en,de}queue_task_fair for a
specific se and a specific cpu = rq_of(cfs_rq_of(se))->cpu .

> The affected cpus are
> passed on to arch_eval_cpu_freq for further machine-specific processing
> based on a selectable policy.

I'm not sure if separating the evaluation and the setting of the cpu
frequency makes sense. You could evaluate and possibly set the cpu
frequency in one go. Right now you evaluate if the cfs_rq utilization
exceeds the thresholds for the current index every time a task is
enqueued or dequeued but that's not necessary since you only try to set
the cpu frequency in the softirq. The history (and the future if we
consider blocked utilization) is already captured in the cfs_rq
utilization itself.

> 
> arch_scale_cpu_freq is called from run_rebalance_domains as a way to
> kick off the scaling process (via wake_up_process), so as to prevent
> re-entering the {en,de}queue code.

The name is misleading from the viewpoint of the CFS sched class. The
original scaling function of the CFS scheduler
(arch_scale_{freq,smt/cpu,rt}_capacity) scale capacity based on
frequency, uarch or rt. So your function should be call
arch_scale_util_cpu_freq or even better arch_set_cpu_freq.

> 
> All of the call sites in this patch are up for discussion. Does it make
> sense to track which cpus have updated statistics in enqueue_fair_task?

Not really because cfs_rq utilization tracks the history/(future) of cpu
utilization and you can evaluate the signal when you want to set the cpu
frequency.

> I chose this because I wanted to gather statistics for all cpus affected
> in the event CONFIG_FAIR_GROUP_SCHED is enabled. As agreed at LPC14 the
> next version of this patch will focus on the simpler case of not using
> scheduler cgroups, which should remove a good chunk of this code,
> including the cpumask stuff.

I don't understand why you should care about task groups at all. The
task groups contribution to the utilization of a cpu should be already
encountered for in the appropriate cpu's cfs_rq utilization signal.

But I can see a dependency to the fact that there is a difference
between systems with per-cluster (package) or per-cpu frequency scaling.
But there is no SD_SHARE_FREQDOMAIN (sched domain flag) today which
applied to the SD level MC could tell you tahts we deal with per-cluster
frequency scaling though.
On systems with per-cpu frequency scaling you can set the frequency for
individual cpus by hooking into the scheduler but for systems with
per-cluster frequency scaling, you would have to respect the maximum cpu
utilization of all cpus in the cluster.

A similar problem occurs with hardware threads (SMT sd level).

But I don't know right now how the sd topology hierarchy can become
handy here.

> 
> Also discussed at LPC14 is that fact that load_balance is a very
> interesting place to do this as frequency can be considered in concert
> with task placement. Please put forth any ideas on a sensible way to do
> this.
> 
> Is run_rebalance_domains a logical place to change cpu frequency? What
> other call sites make sense?

At least it's a good place to test this feature for now.

> 
> Even for platforms that can target a cpu frequency without sleeping
> (x86, some ARM platforms with PM microcontrollers) it is currently
> necessary to always kick the frequency target work out into a kthread.
> This is because of the rw_sem usage in the cpufreq core which might
> sleep. Replacing that lock type is probably a good idea.
> 
> Not-signed-off-by: Mike Turquette <mturquette@linaro.org>
> ---
>  kernel/sched/fair.c | 39 +++++++++++++++++++++++++++++++++++++++
>  1 file changed, 39 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 1af6f6d..3619f63 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3999,6 +3999,9 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>  {
>  	struct cfs_rq *cfs_rq;
>  	struct sched_entity *se = &p->se;
> +	struct cpumask update_cpus;
> +
> +	cpumask_clear(&update_cpus);
>  
>  	for_each_sched_entity(se) {
>  		if (se->on_rq)
> @@ -4028,12 +4031,27 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>  
>  		update_cfs_shares(cfs_rq);
>  		update_entity_load_avg(se, 1);
> +		/* track cpus that need to be re-evaluated */
> +		cpumask_set_cpu(cpu_of(rq_of(cfs_rq)), &update_cpus);
>  	}
>  
> +	/* !CONFIG_FAIR_GROUP_SCHED */
>  	if (!se) {
>  		update_rq_runnable_avg(rq, rq->nr_running);
>  		add_nr_running(rq, 1);
> +
> +		/*
> +		 * FIXME for !CONFIG_FAIR_GROUP_SCHED it might be nice to
> +		 * typedef update_cpus into an int and skip all of the cpumask
> +		 * stuff
> +		 */
> +		cpumask_set_cpu(cpu_of(rq), &update_cpus);
>  	}
> +
> +	if (energy_aware())
> +		if (!cpumask_empty(&update_cpus))
> +			arch_eval_cpu_freq(&update_cpus);
> +
>  	hrtick_update(rq);
>  }
>  
> @@ -4049,6 +4067,9 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>  	struct cfs_rq *cfs_rq;
>  	struct sched_entity *se = &p->se;
>  	int task_sleep = flags & DEQUEUE_SLEEP;
> +	struct cpumask update_cpus;
> +
> +	cpumask_clear(&update_cpus);
>  
>  	for_each_sched_entity(se) {
>  		cfs_rq = cfs_rq_of(se);
> @@ -4089,12 +4110,27 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>  
>  		update_cfs_shares(cfs_rq);
>  		update_entity_load_avg(se, 1);
> +		/* track runqueues/cpus that need to be re-evaluated */
> +		cpumask_set_cpu(cpu_of(rq_of(cfs_rq)), &update_cpus);
>  	}
>  
> +	/* !CONFIG_FAIR_GROUP_SCHED */
>  	if (!se) {
>  		sub_nr_running(rq, 1);
>  		update_rq_runnable_avg(rq, 1);
> +
> +		/*
> +		 * FIXME for !CONFIG_FAIR_GROUP_SCHED it might be nice to
> +		 * typedef update_cpus into an int and skip all of the cpumask
> +		 * stuff
> +		 */
> +		cpumask_set_cpu(cpu_of(rq), &update_cpus);
>  	}
> +
> +	if (energy_aware())
> +		if (!cpumask_empty(&update_cpus))
> +			arch_eval_cpu_freq(&update_cpus);
> +
>  	hrtick_update(rq);
>  }
>  
> @@ -7536,6 +7572,9 @@ static void run_rebalance_domains(struct softirq_action *h)
>  	 * stopped.
>  	 */
>  	nohz_idle_balance(this_rq, idle);
> +
> +	if (energy_aware())
> +		arch_scale_cpu_freq();
>  }
>  
>  /*
> 



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC 7/7] sched: energy_model: simple cpu frequency scaling policy
  2014-10-22  6:07 ` [PATCH RFC 7/7] sched: energy_model: simple cpu frequency scaling policy Mike Turquette
@ 2014-10-27 19:43   ` Dietmar Eggemann
  2014-10-28 14:27   ` Peter Zijlstra
  1 sibling, 0 replies; 21+ messages in thread
From: Dietmar Eggemann @ 2014-10-27 19:43 UTC (permalink / raw)
  To: Mike Turquette, peterz, mingo
  Cc: linux-kernel, preeti, Morten Rasmussen, kamalesh, riel, efault,
	nicolas.pitre, linaro-kernel, daniel.lezcano, pjt, bsegall,
	vincent.guittot, patches, tuukka.tikkanen, amit.kucheria

On 22/10/14 07:07, Mike Turquette wrote:
> Building on top of the scale invariant capacity patches and earlier

We don't have scale invariant capacity yet but scale invariant
load/utilization.

> patches in this series that prepare CFS for scaling cpu frequency, this
> patch implements a simple, naive ondemand-like cpu frequency scaling
> policy that is driven by enqueue_task_fair and dequeue_tassk_fair. This
> new policy is named "energy_model" as an homage to the on-going work in
> that area. It is NOT an actual energy model.

Maybe it's worth mentioning that you simply take SCHED_CAPACITY_SCALE
and multiply it with the OPP frequency/max frequency of that cpu to get
the capacity at that OPP. You're not using the capacity related energy
values 'struct capacity:cap' from the energy model which would have to
be measured for the particular platform.

[...]

> The policy implemented in this patch takes the highest cpu utilization
> from policy->cpus and uses that select a frequency target based on the
> same 80%/20% thresholds used as defaults in ondemand. Frequenecy-scaled
> thresholds are pre-computed when energy_model inits. The frequency
> selection is a simple comparison of cpu utilization (as defined in
> Morten's latest RFC) to the threshold values. In the future this logic
> could be replaced with something more sophisticated that uses PELT to
> get a historical overview. Ideas are welcome.

This is what I don't grasp. The se utilization contrib and the cfs_rq
utilization are PELT signals and they provide history information? I
mean comparing the cfs_rq utilization PELT signal with a number from an
energy model, that's essentially EAS.

> 
> Note that the pre-computed thresholds above do not take into account
> micro-architecture differences (SMT or big.LITTLE hardware), only
> frequency invariance.
> 
> Not-signed-off-by: Mike Turquette <mturquette@linaro.org>
> ---
>  drivers/cpufreq/Kconfig     |  21 +++
>  include/linux/cpufreq.h     |   3 +
>  kernel/sched/Makefile       |   1 +
>  kernel/sched/energy_model.c | 341 ++++++++++++++++++++++++++++++++++++++++++++
>  4 files changed, 366 insertions(+)
>  create mode 100644 kernel/sched/energy_model.c
> 

[...]

> +/**
> + * em_data - per-policy data used by energy_mode
> + * @throttle: bail if current time is less than than ktime_throttle.
> + *                 Derived from THROTTLE_MSEC
> + * @up_threshold:   table of normalized capacity states to determine if cpu
> + *                 should run faster. Derived from UP_THRESHOLD
> + * @down_threshold: table of normalized capacity states to determine if cpu
> + *                 should run slower. Derived from DOWN_THRESHOLD
> + *
> + * struct em_data is the per-policy energy_model-specific data structure. A
> + * per-policy instance of it is created when the energy_model governor receives
> + * the CPUFREQ_GOV_START condition and a pointer to it exists in the gov_data
> + * member of struct cpufreq_policy.
> + *
> + * Readers of this data must call down_read(policy->rwsem). Writers must
> + * call down_write(policy->rwsem).
> + */
> +struct em_data {
> +       /* per-policy throttling */
> +       ktime_t throttle;
> +       unsigned int *up_threshold;
> +       unsigned int *down_threshold;
> +       struct task_struct *task;
> +       atomic_long_t target_freq;
> +       atomic_t need_wake_task;
> +};

On my Chromebook2 (Exynos 5 Octa 5800) I end up with 2 kernel threads
(one for each cluster). There is an 'for_each_online_cpu' in
arch_scale_cpu_freq and I can see that the em data thread is invoked for
both clusters every time. Is this the intended behaviour?

It looks like you achieve the desired behaviour for freq-scaling per
cluster for this system but it's not clear to me how this is done from
the design perspective and what would have to be changed if we want to
run it on a per-cpu frequency scaling system.

Coming back to your question where you should call arch_scale_cpu_freq.
Another issue is for which cpu you should call it? For EAS we want to be
able to either raise the cpu frequency of the busiest cpu or do task
migration away from the busiest cpu. So maybe arch_scale_cpu_freq should
be called later in load_balance when we figured out which one is the
busiest cpu?
This would map nicely to load balance in MC sd level for per-cpu
frequency scaling and in DIE sd level for per-cluster frequency scaling.
But then, where do you hook in to lower the frequency eventually? And
what happens in load-balance for all the other 'sd level <-> per-foo
frequency scaling' combinations?

[...]

> +
> +#ifndef CONFIG_CPU_FREQ_DEFAULT_GOV_ENERGY_MODEL
> +static
> +#endif
> +struct cpufreq_governor cpufreq_gov_energy_model = {
> +       .name                   = "energy_model",
> +       .governor               = energy_model_setup,
> +       .owner                  = THIS_MODULE,
> +};
> +
> +static int __init energy_model_init(void)
> +{
> +       return cpufreq_register_governor(&cpufreq_gov_energy_model);
> +}
> +

Probably not that important at this stage. I always hit

[    8.601824] ------------[ cut here ]------------
[    8.601869] WARNING: CPU: 6 PID: 3229 at
drivers/cpufreq/cpufreq_governor.c:266 cpufreq_governor_dbs+0x6f4/0x6f8()
[    8.601884] Modules linked in:
[    8.601912] CPU: 6 PID: 3229 Comm: cpufreq-set Not tainted
3.17.0-rc3-00293-g5cf54ebcaea6 #16
[    8.601953] [<c0015224>] (unwind_backtrace) from [<c0011cd4>]
(show_stack+0x18/0x1c)
[    8.601982] [<c0011cd4>] (show_stack) from [<c04c5b28>]
(dump_stack+0x80/0xc0)
[    8.602011] [<c04c5b28>] (dump_stack) from [<c0022fd8>]
(warn_slowpath_common+0x78/0x94)
[    8.602041] [<c0022fd8>] (warn_slowpath_common) from [<c00230a8>]
(warn_slowpath_null+0x24/0x2c)
[    8.602071] [<c00230a8>] (warn_slowpath_null) from [<c03a74c8>]
(cpufreq_governor_dbs+0x6f4/0x6f8)
[    8.602100] [<c03a74c8>] (cpufreq_governor_dbs) from [<c03a1b58>]
(__cpufreq_governor+0x140/0x240)
[    8.602126] [<c03a1b58>] (__cpufreq_governor) from [<c03a31b0>]
(cpufreq_set_policy+0x18c/0x20c)
[    8.602153] [<c03a31b0>] (cpufreq_set_policy) from [<c03a3400>]
(store_scaling_governor+0x78/0xa4)
[    8.602179] [<c03a3400>] (store_scaling_governor) from [<c03a149c>]
(store+0x94/0xc0)
[    8.602207] [<c03a149c>] (store) from [<c015c268>]
(kernfs_fop_write+0xc8/0x188)
[    8.602236] [<c015c268>] (kernfs_fop_write) from [<c00ffc00>]
(vfs_write+0xac/0x1b8)
[    8.602263] [<c00ffc00>] (vfs_write) from [<c010023c>]
(SyS_write+0x48/0x9c)
[    8.602290] [<c010023c>] (SyS_write) from [<c000e600>]
(ret_fast_syscall+0x0/0x30)
[    8.602307] ---[ end trace bedc9e3b94a57ef2 ]---

when I configure CONFIG_CPU_FREQ_DEFAULT_GOV_ENERGY_MODEL=y during
initial system start.

[...]







^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC 7/7] sched: energy_model: simple cpu frequency scaling policy
  2014-10-22  6:07 ` [PATCH RFC 7/7] sched: energy_model: simple cpu frequency scaling policy Mike Turquette
  2014-10-27 19:43   ` Dietmar Eggemann
@ 2014-10-28 14:27   ` Peter Zijlstra
  1 sibling, 0 replies; 21+ messages in thread
From: Peter Zijlstra @ 2014-10-28 14:27 UTC (permalink / raw)
  To: Mike Turquette
  Cc: mingo, linux-kernel, preeti, Morten.Rasmussen, kamalesh, riel,
	efault, nicolas.pitre, linaro-kernel, daniel.lezcano,
	dietmar.eggemann, pjt, bsegall, vincent.guittot, patches,
	tuukka.tikkanen, amit.kucheria

On Tue, Oct 21, 2014 at 11:07:31PM -0700, Mike Turquette wrote:
> Unlike legacy CPUfreq governors, this policy does not implement its own
> logic loop (such as a workqueue triggered by a timer), but instead uses
> an event-driven design. Frequency is evaluated by entering
> {en,de}queue_task_fair and then a kthread is woken from
> run_rebalance_domains which scales cpu frequency based on the latest
> evaluation.

Also note that we probably want to extend the governor to include the
other sched classes, deadline for example is a good candidate to include
as it already explicitly provides utilization requirements from which
you can compute a hard minimum frequency, below which the task set is
unschedulable.

fifo/rr are far harder to do, since for them we don't have anything
useful, the best we can do I suppose is some statistical over
provisioning but no guarantees.



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC 6/7] sched: cfs: cpu frequency scaling based on task placement
  2014-10-22  6:07 ` [PATCH RFC 6/7] sched: cfs: cpu frequency scaling based on task placement Mike Turquette
                     ` (2 preceding siblings ...)
  2014-10-27 17:42   ` Dietmar Eggemann
@ 2014-11-27 10:46   ` Preeti U Murthy
  3 siblings, 0 replies; 21+ messages in thread
From: Preeti U Murthy @ 2014-11-27 10:46 UTC (permalink / raw)
  To: Mike Turquette, peterz, mingo
  Cc: linux-kernel, Morten.Rasmussen, kamalesh, riel, efault,
	nicolas.pitre, linaro-kernel, daniel.lezcano, dietmar.eggemann,
	pjt, bsegall, vincent.guittot, patches, tuukka.tikkanen,
	amit.kucheria, Shilpa Bhat

On 10/22/2014 11:37 AM, Mike Turquette wrote:
> {en,de}queue_task_fair are updated to track which cpus will have changed
> utilization values as function of task queueing. The affected cpus are
> passed on to arch_eval_cpu_freq for further machine-specific processing
> based on a selectable policy.
> 
> arch_scale_cpu_freq is called from run_rebalance_domains as a way to
> kick off the scaling process (via wake_up_process), so as to prevent
> re-entering the {en,de}queue code.
> 
> All of the call sites in this patch are up for discussion. Does it make
> sense to track which cpus have updated statistics in enqueue_fair_task?
> I chose this because I wanted to gather statistics for all cpus affected
> in the event CONFIG_FAIR_GROUP_SCHED is enabled. As agreed at LPC14 the
> next version of this patch will focus on the simpler case of not using
> scheduler cgroups, which should remove a good chunk of this code,
> including the cpumask stuff.
> 
> Also discussed at LPC14 is that fact that load_balance is a very
> interesting place to do this as frequency can be considered in concert
> with task placement. Please put forth any ideas on a sensible way to do
> this.

I believe load balancing would be the right place to evaluate the
frequency at which CPUs must run. find_busiest_group() is already
iterating through all the CPUs and calculating the load on them. So this
information is readily available and that which remains to be seen is
which of the CPUs in the group have their load > some_threshold and
queue a kthread on that cpu to scale its frequency, while the current
cpu continues with its load balancing.

There is another positive I see in evaluating cpu frequency in load
balancing. The frequency at which load balancing is run is already
optimized for scalability. One of the factors that is considered is if
any sibling cpus has carried out load balancing in the recent past, the
current cpu defers doing the same. This means it is naturally ensured
that only one cpu in the power domain takes care of frequency scaling
each time and there is no need of explicit synchronization between the
policy cpus to do this.

Regards
Preeti U Murthy


^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2014-11-27 10:47 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-10-22  6:07 [PATCH RFC 0/7] scheduler-driven cpu frequency scaling Mike Turquette
2014-10-22  6:07 ` [PATCH RFC 1/7] sched: Make energy awareness a sched feature Mike Turquette
2014-10-22  6:07 ` [PATCH RFC 2/7] sched: cfs: declare capacity_of in sched.h Mike Turquette
2014-10-22  6:07 ` [PATCH RFC 3/7] sched: fair: add usage_util_of helper Mike Turquette
2014-10-22  6:07 ` [PATCH RFC 4/7] cpufreq: add per-governor private data Mike Turquette
2014-10-22  6:26   ` Viresh Kumar
2014-10-22  6:35     ` Mike Turquette
2014-10-22  6:07 ` [PATCH RFC 5/7] sched: cfs: cpu frequency scaling arch functions Mike Turquette
2014-10-22 20:06   ` Rik van Riel
2014-10-22 23:20     ` Mike Turquette
2014-10-23  1:42       ` Rik van Riel
2014-10-23  2:12         ` Mike Galbraith
2014-10-23  2:42           ` Rik van Riel
2014-10-22  6:07 ` [PATCH RFC 6/7] sched: cfs: cpu frequency scaling based on task placement Mike Turquette
2014-10-23  4:03   ` Preeti U Murthy
2014-10-27 15:55   ` Peter Zijlstra
2014-10-27 17:42   ` Dietmar Eggemann
2014-11-27 10:46   ` Preeti U Murthy
2014-10-22  6:07 ` [PATCH RFC 7/7] sched: energy_model: simple cpu frequency scaling policy Mike Turquette
2014-10-27 19:43   ` Dietmar Eggemann
2014-10-28 14:27   ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).