linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC 0/4] Scheduler idle notifiers and users
@ 2012-02-08  1:39 Anton Vorontsov
  2012-02-08  1:41 ` [PATCH 1/4] sched: Introduce idle notifiers API Anton Vorontsov
                   ` (4 more replies)
  0 siblings, 5 replies; 34+ messages in thread
From: Anton Vorontsov @ 2012-02-08  1:39 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Dave Jones, Russell King
  Cc: Oleg Nesterov, Benjamin Herrenschmidt, Paul E. McKenney,
	Nicolas Pitre, Mike Chan, Todd Poynor, cpufreq, kernel-team,
	linaro-kernel, linux-arm-kernel, linux-kernel

Hi all,

For some drivers we need to know when scheduler is idling. The most
straightforward way is to gracefully hook into the idle loop.

On x86 there are "CPU idle" notifiers in the inner idle loop, but
scheduler idle notifiers are different. These notifiers do not run on
every invocation/exit from cpuidle, instead they used to notify about
scheduler state changes, not HW states.

In other words, CPU idle notifiers work inside while(!need_resched())
loop (nested into idle loop), while scheduler idle notifier work
outside of the loop.

The first two patches consolidate scheduler idle entry/exit
points, and converts architectures to this new API.

The third patch is a new cpufreq governor, the commit message
briefly describes it.

The fourth patch is another user of the notifiers, a trivial one.

Thanks,

p.s. For the reference, the old discussion about CPU/PM idle
     notifiers: http://lkml.org/lkml/2011/6/27/391 

-- 
Anton Vorontsov
Email: cbouatmailru@gmail.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 1/4] sched: Introduce idle notifiers API
  2012-02-08  1:39 [PATCH RFC 0/4] Scheduler idle notifiers and users Anton Vorontsov
@ 2012-02-08  1:41 ` Anton Vorontsov
  2012-02-08  1:43 ` [PATCH 2/4] sched: Wire up idle notifiers Anton Vorontsov
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 34+ messages in thread
From: Anton Vorontsov @ 2012-02-08  1:41 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Dave Jones, Russell King
  Cc: Oleg Nesterov, Benjamin Herrenschmidt, Paul E. McKenney,
	Nicolas Pitre, Mike Chan, Todd Poynor, cpufreq, kernel-team,
	linaro-kernel, linux-arm-kernel, linux-kernel

Idle notifiers may be used as a hint to the code that needs to know when
there are no tasks to execute, and the scheduler is idling, or when the
idling period ends. This patch implements a simple notifiers API.

Notes:

- Unlike x86 "CPU idle" notifiers API, these notifiers do not run on
  every invocation or exit from cpuidle. Instead it is only used
  to notify about scheduler state changes, not HW states.

  In other words, CPU idle notifiers work inside while(!need_resched())
  loop, and scheduler idle notifiers will work outside of this loop.

- rcu_idle_{enter,exit} are wired as built-ins, bypassing
  sched_idle_notifier chain.

  We might change it later to get rid of sched_idle_enter_condrcu()
  stuff on powerpc and x86. But that's just an implementation detail,
  so let's keep things simple for now.

- tick_nohz_idle_enter() is also wired as built-in, there is no much
  gain in moving to to sched_idle_notifier chain.

Signed-off-by: Anton Vorontsov <anton.vorontsov@linaro.org>
---
 include/linux/sched.h |   10 ++++++++++
 kernel/sched/core.c   |   38 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 48 insertions(+), 0 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4032ec1..e82f721 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1960,6 +1960,16 @@ extern void sched_clock_idle_sleep_event(void);
 extern void sched_clock_idle_wakeup_event(u64 delta_ns);
 #endif
 
+#define SCHED_IDLE_START	1
+#define SCHED_IDLE_END		2
+extern void sched_idle_notifier_register(struct notifier_block *nb);
+extern void sched_idle_notifier_unregister(struct notifier_block *nb);
+extern void sched_idle_notifier_call_chain(unsigned long val);
+extern void sched_idle_enter_condrcu(bool idle_uses_rcu);
+extern void sched_idle_exit_condrcu(bool idle_uses_rcu);
+static inline void sched_idle_enter(void) { sched_idle_enter_condrcu(0); }
+static inline void sched_idle_exit(void) { sched_idle_exit_condrcu(0); }
+
 #ifdef CONFIG_IRQ_TIME_ACCOUNTING
 /*
  * An i/f to runtime opt-in for irq time accounting based off of sched_clock.
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fd7b25e..62798ac 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1810,6 +1810,44 @@ void wake_up_new_task(struct task_struct *p)
 	task_rq_unlock(rq, p, &flags);
 }
 
+static ATOMIC_NOTIFIER_HEAD(sched_idle_notifier);
+
+void sched_idle_notifier_register(struct notifier_block *nb)
+{
+	atomic_notifier_chain_register(&sched_idle_notifier, nb);
+}
+EXPORT_SYMBOL_GPL(sched_idle_notifier_register);
+
+void sched_idle_notifier_unregister(struct notifier_block *nb)
+{
+	atomic_notifier_chain_unregister(&sched_idle_notifier, nb);
+}
+EXPORT_SYMBOL_GPL(sched_idle_notifier_unregister);
+
+void sched_idle_notifier_call_chain(unsigned long val)
+{
+	atomic_notifier_call_chain(&sched_idle_notifier, val, NULL);
+}
+EXPORT_SYMBOL_GPL(sched_idle_notifier_call_chain);
+
+void sched_idle_enter_condrcu(bool idle_uses_rcu)
+{
+	tick_nohz_idle_enter();
+	if (!idle_uses_rcu)
+		rcu_idle_enter();
+	sched_idle_notifier_call_chain(SCHED_IDLE_START);
+}
+EXPORT_SYMBOL_GPL(sched_idle_enter_condrcu);
+
+void sched_idle_exit_condrcu(bool idle_uses_rcu)
+{
+	sched_idle_notifier_call_chain(SCHED_IDLE_END);
+	if (!idle_uses_rcu)
+		rcu_idle_exit();
+	tick_nohz_idle_exit();
+}
+EXPORT_SYMBOL_GPL(sched_idle_exit_condrcu);
+
 #ifdef CONFIG_PREEMPT_NOTIFIERS
 
 /**
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 2/4] sched: Wire up idle notifiers
  2012-02-08  1:39 [PATCH RFC 0/4] Scheduler idle notifiers and users Anton Vorontsov
  2012-02-08  1:41 ` [PATCH 1/4] sched: Introduce idle notifiers API Anton Vorontsov
@ 2012-02-08  1:43 ` Anton Vorontsov
  2012-02-08  1:44 ` [PATCH 3/4] cpufreq: New 'interactive' governor Anton Vorontsov
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 34+ messages in thread
From: Anton Vorontsov @ 2012-02-08  1:43 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Dave Jones, Russell King
  Cc: Oleg Nesterov, Benjamin Herrenschmidt, Paul E. McKenney,
	Nicolas Pitre, Mike Chan, Todd Poynor, cpufreq, kernel-team,
	linaro-kernel, linux-arm-kernel, linux-kernel

Tweak arch files to wire up sched_idle routines.

The changes are trivial except for powerpc and x86, for these
architectures we have to use _condrcu variants.

Signed-off-by: Anton Vorontsov <anton.vorontsov@linaro.org>
---
 arch/arm/kernel/process.c              |    6 ++----
 arch/avr32/kernel/process.c            |    6 ++----
 arch/blackfin/kernel/process.c         |    6 ++----
 arch/c6x/kernel/process.c              |    6 ++----
 arch/microblaze/kernel/process.c       |    6 ++----
 arch/mips/kernel/process.c             |    6 ++----
 arch/openrisc/kernel/idle.c            |    6 ++----
 arch/powerpc/kernel/idle.c             |    8 ++------
 arch/powerpc/platforms/iseries/setup.c |   12 ++++--------
 arch/s390/kernel/process.c             |    6 ++----
 arch/sh/kernel/idle.c                  |    6 ++----
 arch/sparc/kernel/process_64.c         |    6 ++----
 arch/tile/kernel/process.c             |    6 ++----
 arch/um/kernel/process.c               |    6 ++----
 arch/unicore32/kernel/process.c        |    6 ++----
 arch/x86/kernel/process_32.c           |    6 ++----
 arch/x86/kernel/process_64.c           |    4 ++--
 17 files changed, 36 insertions(+), 72 deletions(-)

diff --git a/arch/arm/kernel/process.c b/arch/arm/kernel/process.c
index 971d65c..f2bac2d 100644
--- a/arch/arm/kernel/process.c
+++ b/arch/arm/kernel/process.c
@@ -206,8 +206,7 @@ void cpu_idle(void)
 
 	/* endless idle loop with no priority at all */
 	while (1) {
-		tick_nohz_idle_enter();
-		rcu_idle_enter();
+		sched_idle_enter();
 		leds_event(led_idle_start);
 		while (!need_resched()) {
 #ifdef CONFIG_HOTPLUG_CPU
@@ -237,8 +236,7 @@ void cpu_idle(void)
 			}
 		}
 		leds_event(led_idle_end);
-		rcu_idle_exit();
-		tick_nohz_idle_exit();
+		sched_idle_exit();
 		preempt_enable_no_resched();
 		schedule();
 		preempt_disable();
diff --git a/arch/avr32/kernel/process.c b/arch/avr32/kernel/process.c
index ea33957..a993186 100644
--- a/arch/avr32/kernel/process.c
+++ b/arch/avr32/kernel/process.c
@@ -34,12 +34,10 @@ void cpu_idle(void)
 {
 	/* endless idle loop with no priority at all */
 	while (1) {
-		tick_nohz_idle_enter();
-		rcu_idle_enter();
+		sched_idle_enter();
 		while (!need_resched())
 			cpu_idle_sleep();
-		rcu_idle_exit();
-		tick_nohz_idle_exit();
+		sched_idle_exit();
 		preempt_enable_no_resched();
 		schedule();
 		preempt_disable();
diff --git a/arch/blackfin/kernel/process.c b/arch/blackfin/kernel/process.c
index 8dd0416..91fd39b8 100644
--- a/arch/blackfin/kernel/process.c
+++ b/arch/blackfin/kernel/process.c
@@ -88,12 +88,10 @@ void cpu_idle(void)
 #endif
 		if (!idle)
 			idle = default_idle;
-		tick_nohz_idle_enter();
-		rcu_idle_enter();
+		sched_idle_enter();
 		while (!need_resched())
 			idle();
-		rcu_idle_exit();
-		tick_nohz_idle_exit();
+		sched_idle_exit();
 		preempt_enable_no_resched();
 		schedule();
 		preempt_disable();
diff --git a/arch/c6x/kernel/process.c b/arch/c6x/kernel/process.c
index 7ca8c41..64eefc4 100644
--- a/arch/c6x/kernel/process.c
+++ b/arch/c6x/kernel/process.c
@@ -71,8 +71,7 @@ void cpu_idle(void)
 {
 	/* endless idle loop with no priority at all */
 	while (1) {
-		tick_nohz_idle_enter();
-		rcu_idle_enter();
+		sched_idle_enter();
 		while (1) {
 			local_irq_disable();
 			if (need_resched()) {
@@ -81,8 +80,7 @@ void cpu_idle(void)
 			}
 			c6x_idle(); /* enables local irqs */
 		}
-		rcu_idle_exit();
-		tick_nohz_idle_exit();
+		sched_idle_exit();
 
 		preempt_enable_no_resched();
 		schedule();
diff --git a/arch/microblaze/kernel/process.c b/arch/microblaze/kernel/process.c
index 7dcb5bf..ac0ddd0 100644
--- a/arch/microblaze/kernel/process.c
+++ b/arch/microblaze/kernel/process.c
@@ -103,12 +103,10 @@ void cpu_idle(void)
 		if (!idle)
 			idle = default_idle;
 
-		tick_nohz_idle_enter();
-		rcu_idle_enter();
+		sched_idle_enter();
 		while (!need_resched())
 			idle();
-		rcu_idle_exit();
-		tick_nohz_idle_exit();
+		sched_idle_exit();
 
 		preempt_enable_no_resched();
 		schedule();
diff --git a/arch/mips/kernel/process.c b/arch/mips/kernel/process.c
index 7955409..72ed62b8 100644
--- a/arch/mips/kernel/process.c
+++ b/arch/mips/kernel/process.c
@@ -56,8 +56,7 @@ void __noreturn cpu_idle(void)
 
 	/* endless idle loop with no priority at all */
 	while (1) {
-		tick_nohz_idle_enter();
-		rcu_idle_enter();
+		sched_idle_enter();
 		while (!need_resched() && cpu_online(cpu)) {
 #ifdef CONFIG_MIPS_MT_SMTC
 			extern void smtc_idle_loop_hook(void);
@@ -78,8 +77,7 @@ void __noreturn cpu_idle(void)
 		     system_state == SYSTEM_BOOTING))
 			play_dead();
 #endif
-		rcu_idle_exit();
-		tick_nohz_idle_exit();
+		sched_idle_exit();
 		preempt_enable_no_resched();
 		schedule();
 		preempt_disable();
diff --git a/arch/openrisc/kernel/idle.c b/arch/openrisc/kernel/idle.c
index e5fc7887..ab5dd49 100644
--- a/arch/openrisc/kernel/idle.c
+++ b/arch/openrisc/kernel/idle.c
@@ -51,8 +51,7 @@ void cpu_idle(void)
 
 	/* endless idle loop with no priority at all */
 	while (1) {
-		tick_nohz_idle_enter();
-		rcu_idle_enter();
+		sched_idle_enter();
 
 		while (!need_resched()) {
 			check_pgt_cache();
@@ -70,8 +69,7 @@ void cpu_idle(void)
 			set_thread_flag(TIF_POLLING_NRFLAG);
 		}
 
-		rcu_idle_exit();
-		tick_nohz_idle_exit();
+		sched_idle_exit();
 		preempt_enable_no_resched();
 		schedule();
 		preempt_disable();
diff --git a/arch/powerpc/kernel/idle.c b/arch/powerpc/kernel/idle.c
index 7c66ce1..c89172d 100644
--- a/arch/powerpc/kernel/idle.c
+++ b/arch/powerpc/kernel/idle.c
@@ -66,9 +66,7 @@ void cpu_idle(void)
 
 	set_thread_flag(TIF_POLLING_NRFLAG);
 	while (1) {
-		tick_nohz_idle_enter();
-		if (!idle_uses_rcu)
-			rcu_idle_enter();
+		sched_idle_enter_condrcu(idle_uses_rcu);
 
 		while (!need_resched() && !cpu_should_die()) {
 			ppc64_runlatch_off();
@@ -106,9 +104,7 @@ void cpu_idle(void)
 
 		HMT_medium();
 		ppc64_runlatch_on();
-		if (!idle_uses_rcu)
-			rcu_idle_exit();
-		tick_nohz_idle_exit();
+		sched_idle_exit_condrcu(idle_uses_rcu);
 		preempt_enable_no_resched();
 		if (cpu_should_die())
 			cpu_die();
diff --git a/arch/powerpc/platforms/iseries/setup.c b/arch/powerpc/platforms/iseries/setup.c
index 8fc6258..496bd5e 100644
--- a/arch/powerpc/platforms/iseries/setup.c
+++ b/arch/powerpc/platforms/iseries/setup.c
@@ -563,8 +563,7 @@ static void yield_shared_processor(void)
 static void iseries_shared_idle(void)
 {
 	while (1) {
-		tick_nohz_idle_enter();
-		rcu_idle_enter();
+		sched_idle_enter();
 		while (!need_resched() && !hvlpevent_is_pending()) {
 			local_irq_disable();
 			ppc64_runlatch_off();
@@ -578,8 +577,7 @@ static void iseries_shared_idle(void)
 		}
 
 		ppc64_runlatch_on();
-		rcu_idle_exit();
-		tick_nohz_idle_exit();
+		sched_idle_exit();
 
 		if (hvlpevent_is_pending())
 			process_iSeries_events();
@@ -595,8 +593,7 @@ static void iseries_dedicated_idle(void)
 	set_thread_flag(TIF_POLLING_NRFLAG);
 
 	while (1) {
-		tick_nohz_idle_enter();
-		rcu_idle_enter();
+		sched_idle_enter();
 		if (!need_resched()) {
 			while (!need_resched()) {
 				ppc64_runlatch_off();
@@ -613,8 +610,7 @@ static void iseries_dedicated_idle(void)
 		}
 
 		ppc64_runlatch_on();
-		rcu_idle_exit();
-		tick_nohz_idle_exit();
+		sched_idle_exit();
 		preempt_enable_no_resched();
 		schedule();
 		preempt_disable();
diff --git a/arch/s390/kernel/process.c b/arch/s390/kernel/process.c
index 3201ae4..1446fdf 100644
--- a/arch/s390/kernel/process.c
+++ b/arch/s390/kernel/process.c
@@ -91,12 +91,10 @@ static void default_idle(void)
 void cpu_idle(void)
 {
 	for (;;) {
-		tick_nohz_idle_enter();
-		rcu_idle_enter();
+		sched_idle_enter();
 		while (!need_resched())
 			default_idle();
-		rcu_idle_exit();
-		tick_nohz_idle_exit();
+		sched_idle_exit();
 		preempt_enable_no_resched();
 		schedule();
 		preempt_disable();
diff --git a/arch/sh/kernel/idle.c b/arch/sh/kernel/idle.c
index 406508d..5d8acc2 100644
--- a/arch/sh/kernel/idle.c
+++ b/arch/sh/kernel/idle.c
@@ -89,8 +89,7 @@ void cpu_idle(void)
 
 	/* endless idle loop with no priority at all */
 	while (1) {
-		tick_nohz_idle_enter();
-		rcu_idle_enter();
+		sched_idle_enter();
 
 		while (!need_resched()) {
 			check_pgt_cache();
@@ -112,8 +111,7 @@ void cpu_idle(void)
 			start_critical_timings();
 		}
 
-		rcu_idle_exit();
-		tick_nohz_idle_exit();
+		sched_idle_exit();
 		preempt_enable_no_resched();
 		schedule();
 		preempt_disable();
diff --git a/arch/sparc/kernel/process_64.c b/arch/sparc/kernel/process_64.c
index 39d8b05..a5d0062 100644
--- a/arch/sparc/kernel/process_64.c
+++ b/arch/sparc/kernel/process_64.c
@@ -95,14 +95,12 @@ void cpu_idle(void)
 	set_thread_flag(TIF_POLLING_NRFLAG);
 
 	while(1) {
-		tick_nohz_idle_enter();
-		rcu_idle_enter();
+		sched_idle_enter();
 
 		while (!need_resched() && !cpu_is_offline(cpu))
 			sparc64_yield(cpu);
 
-		rcu_idle_exit();
-		tick_nohz_idle_exit();
+		sched_idle_exit();
 
 		preempt_enable_no_resched();
 
diff --git a/arch/tile/kernel/process.c b/arch/tile/kernel/process.c
index 4c1ac6e..436f366 100644
--- a/arch/tile/kernel/process.c
+++ b/arch/tile/kernel/process.c
@@ -85,8 +85,7 @@ void cpu_idle(void)
 
 	/* endless idle loop with no priority at all */
 	while (1) {
-		tick_nohz_idle_enter();
-		rcu_idle_enter();
+		sched_idle_enter();
 		while (!need_resched()) {
 			if (cpu_is_offline(cpu))
 				BUG();  /* no HOTPLUG_CPU */
@@ -106,8 +105,7 @@ void cpu_idle(void)
 				local_irq_enable();
 			current_thread_info()->status |= TS_POLLING;
 		}
-		rcu_idle_exit();
-		tick_nohz_idle_exit();
+		sched_idle_exit();
 		preempt_enable_no_resched();
 		schedule();
 		preempt_disable();
diff --git a/arch/um/kernel/process.c b/arch/um/kernel/process.c
index 69f2490..20b1a39 100644
--- a/arch/um/kernel/process.c
+++ b/arch/um/kernel/process.c
@@ -246,12 +246,10 @@ void default_idle(void)
 		if (need_resched())
 			schedule();
 
-		tick_nohz_idle_enter();
-		rcu_idle_enter();
+		sched_idle_enter();
 		nsecs = disable_timer();
 		idle_sleep(nsecs);
-		rcu_idle_exit();
-		tick_nohz_idle_exit();
+		sched_idle_exit();
 	}
 }
 
diff --git a/arch/unicore32/kernel/process.c b/arch/unicore32/kernel/process.c
index 52edc2b..ec540dc 100644
--- a/arch/unicore32/kernel/process.c
+++ b/arch/unicore32/kernel/process.c
@@ -55,8 +55,7 @@ void cpu_idle(void)
 {
 	/* endless idle loop with no priority at all */
 	while (1) {
-		tick_nohz_idle_enter();
-		rcu_idle_enter();
+		sched_idle_enter();
 		while (!need_resched()) {
 			local_irq_disable();
 			stop_critical_timings();
@@ -64,8 +63,7 @@ void cpu_idle(void)
 			local_irq_enable();
 			start_critical_timings();
 		}
-		rcu_idle_exit();
-		tick_nohz_idle_exit();
+		sched_idle_exit();
 		preempt_enable_no_resched();
 		schedule();
 		preempt_disable();
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index 485204f..0e5a4c3 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -99,8 +99,7 @@ void cpu_idle(void)
 
 	/* endless idle loop with no priority at all */
 	while (1) {
-		tick_nohz_idle_enter();
-		rcu_idle_enter();
+		sched_idle_enter();
 		while (!need_resched()) {
 
 			check_pgt_cache();
@@ -117,8 +116,7 @@ void cpu_idle(void)
 				pm_idle();
 			start_critical_timings();
 		}
-		rcu_idle_exit();
-		tick_nohz_idle_exit();
+		sched_idle_exit();
 		preempt_enable_no_resched();
 		schedule();
 		preempt_disable();
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 9b9fe4a..4d8bc3d 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -122,7 +122,7 @@ void cpu_idle(void)
 
 	/* endless idle loop with no priority at all */
 	while (1) {
-		tick_nohz_idle_enter();
+		sched_idle_enter_condrcu(1);
 		while (!need_resched()) {
 
 			rmb();
@@ -155,7 +155,7 @@ void cpu_idle(void)
 			__exit_idle();
 		}
 
-		tick_nohz_idle_exit();
+		sched_idle_exit_condrcu(1);
 		preempt_enable_no_resched();
 		schedule();
 		preempt_disable();
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 3/4] cpufreq: New 'interactive' governor
  2012-02-08  1:39 [PATCH RFC 0/4] Scheduler idle notifiers and users Anton Vorontsov
  2012-02-08  1:41 ` [PATCH 1/4] sched: Introduce idle notifiers API Anton Vorontsov
  2012-02-08  1:43 ` [PATCH 2/4] sched: Wire up idle notifiers Anton Vorontsov
@ 2012-02-08  1:44 ` Anton Vorontsov
  2012-02-08 23:00   ` Vincent Guittot
  2012-02-08  1:44 ` [PATCH 4/4] ARM: Move leds idle start/stop calls to sched idle notifiers Anton Vorontsov
  2012-02-08  3:05 ` [PATCH RFC 0/4] Scheduler idle notifiers and users Peter Zijlstra
  4 siblings, 1 reply; 34+ messages in thread
From: Anton Vorontsov @ 2012-02-08  1:44 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Dave Jones, Russell King
  Cc: Oleg Nesterov, Benjamin Herrenschmidt, Paul E. McKenney,
	Nicolas Pitre, Mike Chan, Todd Poynor, cpufreq, kernel-team,
	linaro-kernel, linux-arm-kernel, linux-kernel

From: Mike Chan <mike@android.com>

This governor is designed for latency-sensitive workloads, such as
interactive user interfaces.  The interactive governor aims to be
significantly more responsive to ramp CPU quickly up when CPU-intensive
activity begins.

Existing governors sample CPU load at a particular rate, typically
every X ms.  This can lead to under-powering UI threads for the period of
time during which the user begins interacting with a previously-idle system
until the next sample period happens.

The 'interactive' governor uses a different approach. Instead of sampling
the CPU at a specified rate, the governor will check whether to scale the
CPU frequency up soon after coming out of idle.  When the CPU comes out of
idle, a timer is configured to fire within 1-2 ticks.  If the CPU is very
busy from exiting idle to when the timer fires then we assume the CPU is
underpowered and ramp to MAX speed.

If the CPU was not sufficiently busy to immediately ramp to MAX speed, then
the governor evaluates the CPU load since the last speed adjustment,
choosing the highest value between that longer-term load or the short-term
load since idle exit to determine the CPU speed to ramp to.

A realtime thread is used for scaling up, giving the remaining tasks the
CPU performance benefit, unlike existing governors which are more likely to
schedule rampup work to occur after your performance starved tasks have
completed.

The tuneables for this governor are:
/sys/devices/system/cpu/cpufreq/interactive/min_sample_time:
        The minimum amount of time to spend at the current frequency before
        ramping down. This is to ensure that the governor has seen enough
        historic CPU load data to determine the appropriate workload.
        Default is 20000 uS.
/sys/devices/system/cpu/cpufreq/interactive/go_hispeed_load
        The CPU load at which to ramp to max speed.  Default is 95.

Signed-off-by: Mike Chan <mike@android.com>
Signed-off-by: Todd Poynor <toddpoynor@google.com>
Signed-off-by: Allen Martin <amartin@nvidia.com> (submitted improvements)
Signed-off-by: Axel Haslam <axelhaslam@ti.com> (submitted improvements)
Signed-off-by: Anton Vorontsov <anton.vorontsov@linaro.org>
---
 drivers/cpufreq/Kconfig               |   26 ++
 drivers/cpufreq/Makefile              |    1 +
 drivers/cpufreq/cpufreq_interactive.c |  700 +++++++++++++++++++++++++++++++++
 include/linux/cpufreq.h               |    3 +
 4 files changed, 730 insertions(+), 0 deletions(-)
 create mode 100644 drivers/cpufreq/cpufreq_interactive.c

diff --git a/drivers/cpufreq/Kconfig b/drivers/cpufreq/Kconfig
index e24a2a1..c47cc46 100644
--- a/drivers/cpufreq/Kconfig
+++ b/drivers/cpufreq/Kconfig
@@ -99,6 +99,15 @@ config CPU_FREQ_DEFAULT_GOV_CONSERVATIVE
 	  Be aware that not all cpufreq drivers support the conservative
 	  governor. If unsure have a look at the help section of the
 	  driver. Fallback governor will be the performance governor.
+
+config CPU_FREQ_DEFAULT_GOV_INTERACTIVE
+	bool "interactive"
+	select CPU_FREQ_GOV_INTERACTIVE
+	help
+	  Use the CPUFreq governor 'interactive' as default. This allows
+	  you to get a full dynamic cpu frequency capable system by simply
+	  loading your cpufreq low-level hardware driver, using the
+	  'interactive' governor for latency-sensitive workloads.
 endchoice
 
 config CPU_FREQ_GOV_PERFORMANCE
@@ -179,6 +188,23 @@ config CPU_FREQ_GOV_CONSERVATIVE
 
 	  If in doubt, say N.
 
+config CPU_FREQ_GOV_INTERACTIVE
+	tristate "'interactive' cpufreq policy governor"
+	help
+	  'interactive' - This driver adds a dynamic cpufreq policy governor
+	  designed for latency-sensitive workloads.
+
+	  This governor attempts to reduce the latency of clock
+	  increases so that the system is more responsive to
+	  interactive workloads.
+
+	  To compile this driver as a module, choose M here: the
+	  module will be called cpufreq_interactive.
+
+	  For details, take a look at linux/Documentation/cpu-freq.
+
+	  If in doubt, say N.
+
 menu "x86 CPU frequency scaling drivers"
 depends on X86
 source "drivers/cpufreq/Kconfig.x86"
diff --git a/drivers/cpufreq/Makefile b/drivers/cpufreq/Makefile
index ac000fa..f84c99b 100644
--- a/drivers/cpufreq/Makefile
+++ b/drivers/cpufreq/Makefile
@@ -9,6 +9,7 @@ obj-$(CONFIG_CPU_FREQ_GOV_POWERSAVE)	+= cpufreq_powersave.o
 obj-$(CONFIG_CPU_FREQ_GOV_USERSPACE)	+= cpufreq_userspace.o
 obj-$(CONFIG_CPU_FREQ_GOV_ONDEMAND)	+= cpufreq_ondemand.o
 obj-$(CONFIG_CPU_FREQ_GOV_CONSERVATIVE)	+= cpufreq_conservative.o
+obj-$(CONFIG_CPU_FREQ_GOV_INTERACTIVE)	+= cpufreq_interactive.o
 
 # CPUfreq cross-arch helpers
 obj-$(CONFIG_CPU_FREQ_TABLE)		+= freq_table.o
diff --git a/drivers/cpufreq/cpufreq_interactive.c b/drivers/cpufreq/cpufreq_interactive.c
new file mode 100644
index 0000000..188096a
--- /dev/null
+++ b/drivers/cpufreq/cpufreq_interactive.c
@@ -0,0 +1,700 @@
+/*
+ * Copyright (C) 2010 Google, Inc.
+ *
+ * This software is licensed under the terms of the GNU General Public
+ * License version 2, as published by the Free Software Foundation, and
+ * may be copied, distributed, and modified under those terms.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * Author: Mike Chan (mike@android.com)
+ */
+
+#include <linux/module.h>
+#include <linux/cpu.h>
+#include <linux/cpumask.h>
+#include <linux/cpufreq.h>
+#include <linux/mutex.h>
+#include <linux/sched.h>
+#include <linux/tick.h>
+#include <linux/time.h>
+#include <linux/timer.h>
+#include <linux/workqueue.h>
+#include <linux/kthread.h>
+#include <linux/mutex.h>
+
+static atomic_t active_count = ATOMIC_INIT(0);
+
+struct cpufreq_interactive_cpuinfo {
+	struct timer_list cpu_timer;
+	int timer_idlecancel;
+	u64 time_in_idle;
+	u64 idle_exit_time;
+	u64 timer_run_time;
+	int idling;
+	u64 freq_change_time;
+	u64 freq_change_time_in_idle;
+	struct cpufreq_policy *policy;
+	struct cpufreq_frequency_table *freq_table;
+	unsigned int target_freq;
+	int governor_enabled;
+};
+
+static DEFINE_PER_CPU(struct cpufreq_interactive_cpuinfo, cpuinfo);
+
+/* Workqueues handle frequency scaling */
+static struct task_struct *up_task;
+static struct workqueue_struct *down_wq;
+static struct work_struct freq_scale_down_work;
+static cpumask_t up_cpumask;
+static spinlock_t up_cpumask_lock;
+static cpumask_t down_cpumask;
+static spinlock_t down_cpumask_lock;
+static struct mutex set_speed_lock;
+
+/* Hi speed to bump to from lo speed when load burst (default max) */
+static u64 hispeed_freq;
+
+/* Go to hi speed when CPU load at or above this value. */
+#define DEFAULT_GO_HISPEED_LOAD 95
+static unsigned long go_hispeed_load;
+
+/*
+ * The minimum amount of time to spend at a frequency before we can ramp down.
+ */
+#define DEFAULT_MIN_SAMPLE_TIME (20 * USEC_PER_MSEC)
+static unsigned long min_sample_time;
+
+/*
+ * The sample rate of the timer used to increase frequency
+ */
+#define DEFAULT_TIMER_RATE (20 * USEC_PER_MSEC)
+static unsigned long timer_rate;
+
+static int cpufreq_governor_interactive(struct cpufreq_policy *policy,
+		unsigned int event);
+
+#ifndef CONFIG_CPU_FREQ_DEFAULT_GOV_INTERACTIVE
+static
+#endif
+struct cpufreq_governor cpufreq_gov_interactive = {
+	.name = "interactive",
+	.governor = cpufreq_governor_interactive,
+	.max_transition_latency = 10000000,
+	.owner = THIS_MODULE,
+};
+
+static void cpufreq_interactive_timer(unsigned long data)
+{
+	unsigned int delta_idle;
+	unsigned int delta_time;
+	int cpu_load;
+	int load_since_change;
+	u64 time_in_idle;
+	u64 idle_exit_time;
+	struct cpufreq_interactive_cpuinfo *pcpu =
+		&per_cpu(cpuinfo, data);
+	u64 now_idle;
+	unsigned int new_freq;
+	unsigned int index;
+	unsigned long flags;
+
+	smp_rmb();
+
+	if (!pcpu->governor_enabled)
+		goto exit;
+
+	/*
+	 * Once pcpu->timer_run_time is updated to >= pcpu->idle_exit_time,
+	 * this lets idle exit know the current idle time sample has
+	 * been processed, and idle exit can generate a new sample and
+	 * re-arm the timer.  This prevents a concurrent idle
+	 * exit on that CPU from writing a new set of info at the same time
+	 * the timer function runs (the timer function can't use that info
+	 * until more time passes).
+	 */
+	time_in_idle = pcpu->time_in_idle;
+	idle_exit_time = pcpu->idle_exit_time;
+	now_idle = get_cpu_idle_time_us(data, &pcpu->timer_run_time);
+	smp_wmb();
+
+	/* If we raced with cancelling a timer, skip. */
+	if (!idle_exit_time)
+		goto exit;
+
+	delta_idle = (unsigned int) (now_idle - time_in_idle);
+	delta_time = (unsigned int) (pcpu->timer_run_time - idle_exit_time);
+
+	/*
+	 * If timer ran less than 1ms after short-term sample started, retry.
+	 */
+	if (delta_time < 1000)
+		goto rearm;
+
+	if (delta_idle > delta_time)
+		cpu_load = 0;
+	else
+		cpu_load = 100 * (delta_time - delta_idle) / delta_time;
+
+	delta_idle = (unsigned int) (now_idle - pcpu->freq_change_time_in_idle);
+	delta_time = (unsigned int) (pcpu->timer_run_time - pcpu->freq_change_time);
+
+	if ((delta_time == 0) || (delta_idle > delta_time))
+		load_since_change = 0;
+	else
+		load_since_change =
+			100 * (delta_time - delta_idle) / delta_time;
+
+	/*
+	 * Choose greater of short-term load (since last idle timer
+	 * started or timer function re-armed itself) or long-term load
+	 * (since last frequency change).
+	 */
+	if (load_since_change > cpu_load)
+		cpu_load = load_since_change;
+
+	if (cpu_load >= go_hispeed_load) {
+		if (pcpu->policy->cur == pcpu->policy->min)
+			new_freq = hispeed_freq;
+		else
+			new_freq = pcpu->policy->max * cpu_load / 100;
+	} else {
+		new_freq = pcpu->policy->cur * cpu_load / 100;
+	}
+
+	if (cpufreq_frequency_table_target(pcpu->policy, pcpu->freq_table,
+					   new_freq, CPUFREQ_RELATION_H,
+					   &index)) {
+		pr_warn_once("timer %d: cpufreq_frequency_table_target error\n",
+			     (int) data);
+		goto rearm;
+	}
+
+	new_freq = pcpu->freq_table[index].frequency;
+
+	if (pcpu->target_freq == new_freq)
+		goto rearm_if_notmax;
+
+	/*
+	 * Do not scale down unless we have been at this frequency for the
+	 * minimum sample time.
+	 */
+	if (new_freq < pcpu->target_freq) {
+		if (pcpu->timer_run_time - pcpu->freq_change_time
+				< min_sample_time)
+			goto rearm;
+	}
+
+	if (new_freq < pcpu->target_freq) {
+		pcpu->target_freq = new_freq;
+		spin_lock_irqsave(&down_cpumask_lock, flags);
+		cpumask_set_cpu(data, &down_cpumask);
+		spin_unlock_irqrestore(&down_cpumask_lock, flags);
+		queue_work(down_wq, &freq_scale_down_work);
+	} else {
+		pcpu->target_freq = new_freq;
+		spin_lock_irqsave(&up_cpumask_lock, flags);
+		cpumask_set_cpu(data, &up_cpumask);
+		spin_unlock_irqrestore(&up_cpumask_lock, flags);
+		wake_up_process(up_task);
+	}
+
+rearm_if_notmax:
+	/*
+	 * Already set max speed and don't see a need to change that,
+	 * wait until next idle to re-evaluate, don't need timer.
+	 */
+	if (pcpu->target_freq == pcpu->policy->max)
+		goto exit;
+
+rearm:
+	if (!timer_pending(&pcpu->cpu_timer)) {
+		/*
+		 * If already at min: if that CPU is idle, don't set timer.
+		 * Else cancel the timer if that CPU goes idle.  We don't
+		 * need to re-evaluate speed until the next idle exit.
+		 */
+		if (pcpu->target_freq == pcpu->policy->min) {
+			smp_rmb();
+
+			if (pcpu->idling)
+				goto exit;
+
+			pcpu->timer_idlecancel = 1;
+		}
+
+		pcpu->time_in_idle = get_cpu_idle_time_us(
+			data, &pcpu->idle_exit_time);
+		mod_timer(&pcpu->cpu_timer,
+			  jiffies + usecs_to_jiffies(timer_rate));
+	}
+
+exit:
+	return;
+}
+
+static void cpufreq_interactive_idle_start(void)
+{
+	struct cpufreq_interactive_cpuinfo *pcpu =
+		&per_cpu(cpuinfo, smp_processor_id());
+	int pending;
+
+	if (!pcpu->governor_enabled)
+		return;
+
+	pcpu->idling = 1;
+	smp_wmb();
+	pending = timer_pending(&pcpu->cpu_timer);
+
+	if (pcpu->target_freq != pcpu->policy->min) {
+#ifdef CONFIG_SMP
+		/*
+		 * Entering idle while not at lowest speed.  On some
+		 * platforms this can hold the other CPU(s) at that speed
+		 * even though the CPU is idle. Set a timer to re-evaluate
+		 * speed so this idle CPU doesn't hold the other CPUs above
+		 * min indefinitely.  This should probably be a quirk of
+		 * the CPUFreq driver.
+		 */
+		if (!pending) {
+			pcpu->time_in_idle = get_cpu_idle_time_us(
+				smp_processor_id(), &pcpu->idle_exit_time);
+			pcpu->timer_idlecancel = 0;
+			mod_timer(&pcpu->cpu_timer,
+				  jiffies + usecs_to_jiffies(timer_rate));
+		}
+#endif
+	} else {
+		/*
+		 * If at min speed and entering idle after load has
+		 * already been evaluated, and a timer has been set just in
+		 * case the CPU suddenly goes busy, cancel that timer.  The
+		 * CPU didn't go busy; we'll recheck things upon idle exit.
+		 */
+		if (pending && pcpu->timer_idlecancel) {
+			del_timer(&pcpu->cpu_timer);
+			/*
+			 * Ensure last timer run time is after current idle
+			 * sample start time, so next idle exit will always
+			 * start a new idle sampling period.
+			 */
+			pcpu->idle_exit_time = 0;
+			pcpu->timer_idlecancel = 0;
+		}
+	}
+
+}
+
+static void cpufreq_interactive_idle_end(void)
+{
+	struct cpufreq_interactive_cpuinfo *pcpu =
+		&per_cpu(cpuinfo, smp_processor_id());
+
+	pcpu->idling = 0;
+	smp_wmb();
+
+	/*
+	 * Arm the timer for 1-2 ticks later if not already, and if the timer
+	 * function has already processed the previous load sampling
+	 * interval.  (If the timer is not pending but has not processed
+	 * the previous interval, it is probably racing with us on another
+	 * CPU.  Let it compute load based on the previous sample and then
+	 * re-arm the timer for another interval when it's done, rather
+	 * than updating the interval start time to be "now", which doesn't
+	 * give the timer function enough time to make a decision on this
+	 * run.)
+	 */
+	if (timer_pending(&pcpu->cpu_timer) == 0 &&
+	    pcpu->timer_run_time >= pcpu->idle_exit_time &&
+	    pcpu->governor_enabled) {
+		pcpu->time_in_idle =
+			get_cpu_idle_time_us(smp_processor_id(),
+					     &pcpu->idle_exit_time);
+		pcpu->timer_idlecancel = 0;
+		mod_timer(&pcpu->cpu_timer,
+			  jiffies + usecs_to_jiffies(timer_rate));
+	}
+
+}
+
+static int cpufreq_interactive_up_task(void *data)
+{
+	unsigned int cpu;
+	cpumask_t tmp_mask;
+	unsigned long flags;
+	struct cpufreq_interactive_cpuinfo *pcpu;
+
+	while (1) {
+		set_current_state(TASK_INTERRUPTIBLE);
+		spin_lock_irqsave(&up_cpumask_lock, flags);
+
+		if (cpumask_empty(&up_cpumask)) {
+			spin_unlock_irqrestore(&up_cpumask_lock, flags);
+			schedule();
+
+			if (kthread_should_stop())
+				break;
+
+			spin_lock_irqsave(&up_cpumask_lock, flags);
+		}
+
+		set_current_state(TASK_RUNNING);
+		tmp_mask = up_cpumask;
+		cpumask_clear(&up_cpumask);
+		spin_unlock_irqrestore(&up_cpumask_lock, flags);
+
+		for_each_cpu(cpu, &tmp_mask) {
+			unsigned int j;
+			unsigned int max_freq = 0;
+
+			pcpu = &per_cpu(cpuinfo, cpu);
+			smp_rmb();
+
+			if (!pcpu->governor_enabled)
+				continue;
+
+			mutex_lock(&set_speed_lock);
+
+			for_each_cpu(j, pcpu->policy->cpus) {
+				struct cpufreq_interactive_cpuinfo *pjcpu =
+					&per_cpu(cpuinfo, j);
+
+				if (pjcpu->target_freq > max_freq)
+					max_freq = pjcpu->target_freq;
+			}
+
+			if (max_freq != pcpu->policy->cur)
+				__cpufreq_driver_target(pcpu->policy,
+							max_freq,
+							CPUFREQ_RELATION_H);
+			mutex_unlock(&set_speed_lock);
+
+			pcpu->freq_change_time_in_idle =
+				get_cpu_idle_time_us(cpu,
+						     &pcpu->freq_change_time);
+		}
+	}
+
+	return 0;
+}
+
+static void cpufreq_interactive_freq_down(struct work_struct *work)
+{
+	unsigned int cpu;
+	cpumask_t tmp_mask;
+	unsigned long flags;
+	struct cpufreq_interactive_cpuinfo *pcpu;
+
+	spin_lock_irqsave(&down_cpumask_lock, flags);
+	tmp_mask = down_cpumask;
+	cpumask_clear(&down_cpumask);
+	spin_unlock_irqrestore(&down_cpumask_lock, flags);
+
+	for_each_cpu(cpu, &tmp_mask) {
+		unsigned int j;
+		unsigned int max_freq = 0;
+
+		pcpu = &per_cpu(cpuinfo, cpu);
+		smp_rmb();
+
+		if (!pcpu->governor_enabled)
+			continue;
+
+		mutex_lock(&set_speed_lock);
+
+		for_each_cpu(j, pcpu->policy->cpus) {
+			struct cpufreq_interactive_cpuinfo *pjcpu =
+				&per_cpu(cpuinfo, j);
+
+			if (pjcpu->target_freq > max_freq)
+				max_freq = pjcpu->target_freq;
+		}
+
+		if (max_freq != pcpu->policy->cur)
+			__cpufreq_driver_target(pcpu->policy, max_freq,
+						CPUFREQ_RELATION_H);
+
+		mutex_unlock(&set_speed_lock);
+		pcpu->freq_change_time_in_idle =
+			get_cpu_idle_time_us(cpu,
+					     &pcpu->freq_change_time);
+	}
+}
+
+static ssize_t show_hispeed_freq(struct kobject *kobj,
+				 struct attribute *attr, char *buf)
+{
+	return sprintf(buf, "%llu\n", hispeed_freq);
+}
+
+static ssize_t store_hispeed_freq(struct kobject *kobj,
+				  struct attribute *attr, const char *buf,
+				  size_t count)
+{
+	int ret;
+	u64 val;
+
+	ret = strict_strtoull(buf, 0, &val);
+	if (ret < 0)
+		return ret;
+	hispeed_freq = val;
+	return count;
+}
+
+static struct global_attr hispeed_freq_attr = __ATTR(hispeed_freq, 0644,
+		show_hispeed_freq, store_hispeed_freq);
+
+
+static ssize_t show_go_hispeed_load(struct kobject *kobj,
+				     struct attribute *attr, char *buf)
+{
+	return sprintf(buf, "%lu\n", go_hispeed_load);
+}
+
+static ssize_t store_go_hispeed_load(struct kobject *kobj,
+			struct attribute *attr, const char *buf, size_t count)
+{
+	int ret;
+	unsigned long val;
+
+	ret = strict_strtoul(buf, 0, &val);
+	if (ret < 0)
+		return ret;
+	go_hispeed_load = val;
+	return count;
+}
+
+static struct global_attr go_hispeed_load_attr = __ATTR(go_hispeed_load, 0644,
+		show_go_hispeed_load, store_go_hispeed_load);
+
+static ssize_t show_min_sample_time(struct kobject *kobj,
+				struct attribute *attr, char *buf)
+{
+	return sprintf(buf, "%lu\n", min_sample_time);
+}
+
+static ssize_t store_min_sample_time(struct kobject *kobj,
+			struct attribute *attr, const char *buf, size_t count)
+{
+	int ret;
+	unsigned long val;
+
+	ret = strict_strtoul(buf, 0, &val);
+	if (ret < 0)
+		return ret;
+	min_sample_time = val;
+	return count;
+}
+
+static struct global_attr min_sample_time_attr = __ATTR(min_sample_time, 0644,
+		show_min_sample_time, store_min_sample_time);
+
+static ssize_t show_timer_rate(struct kobject *kobj,
+			struct attribute *attr, char *buf)
+{
+	return sprintf(buf, "%lu\n", timer_rate);
+}
+
+static ssize_t store_timer_rate(struct kobject *kobj,
+			struct attribute *attr, const char *buf, size_t count)
+{
+	int ret;
+	unsigned long val;
+
+	ret = strict_strtoul(buf, 0, &val);
+	if (ret < 0)
+		return ret;
+	timer_rate = val;
+	return count;
+}
+
+static struct global_attr timer_rate_attr = __ATTR(timer_rate, 0644,
+		show_timer_rate, store_timer_rate);
+
+static struct attribute *interactive_attributes[] = {
+	&hispeed_freq_attr.attr,
+	&go_hispeed_load_attr.attr,
+	&min_sample_time_attr.attr,
+	&timer_rate_attr.attr,
+	NULL,
+};
+
+static struct attribute_group interactive_attr_group = {
+	.attrs = interactive_attributes,
+	.name = "interactive",
+};
+
+static int cpufreq_governor_interactive(struct cpufreq_policy *policy,
+		unsigned int event)
+{
+	int rc;
+	unsigned int j;
+	struct cpufreq_interactive_cpuinfo *pcpu;
+	struct cpufreq_frequency_table *freq_table;
+
+	switch (event) {
+	case CPUFREQ_GOV_START:
+		if (!cpu_online(policy->cpu))
+			return -EINVAL;
+
+		freq_table =
+			cpufreq_frequency_get_table(policy->cpu);
+
+		for_each_cpu(j, policy->cpus) {
+			pcpu = &per_cpu(cpuinfo, j);
+			pcpu->policy = policy;
+			pcpu->target_freq = policy->cur;
+			pcpu->freq_table = freq_table;
+			pcpu->freq_change_time_in_idle =
+				get_cpu_idle_time_us(j,
+					     &pcpu->freq_change_time);
+			pcpu->governor_enabled = 1;
+			smp_wmb();
+		}
+
+		if (!hispeed_freq)
+			hispeed_freq = policy->max;
+
+		/*
+		 * Do not register the idle hook and create sysfs
+		 * entries if we have already done so.
+		 */
+		if (atomic_inc_return(&active_count) > 1)
+			return 0;
+
+		rc = sysfs_create_group(cpufreq_global_kobject,
+				&interactive_attr_group);
+		if (rc)
+			return rc;
+
+		break;
+
+	case CPUFREQ_GOV_STOP:
+		for_each_cpu(j, policy->cpus) {
+			pcpu = &per_cpu(cpuinfo, j);
+			pcpu->governor_enabled = 0;
+			smp_wmb();
+			del_timer_sync(&pcpu->cpu_timer);
+
+			/*
+			 * Reset idle exit time since we may cancel the timer
+			 * before it can run after the last idle exit time,
+			 * to avoid tripping the check in idle exit for a timer
+			 * that is trying to run.
+			 */
+			pcpu->idle_exit_time = 0;
+		}
+
+		flush_work(&freq_scale_down_work);
+		if (atomic_dec_return(&active_count) > 0)
+			return 0;
+
+		sysfs_remove_group(cpufreq_global_kobject,
+				&interactive_attr_group);
+
+		break;
+
+	case CPUFREQ_GOV_LIMITS:
+		if (policy->max < policy->cur)
+			__cpufreq_driver_target(policy,
+					policy->max, CPUFREQ_RELATION_H);
+		else if (policy->min > policy->cur)
+			__cpufreq_driver_target(policy,
+					policy->min, CPUFREQ_RELATION_L);
+		break;
+	}
+	return 0;
+}
+
+static int cpufreq_interactive_idle_notifier(struct notifier_block *nb,
+					     unsigned long val,
+					     void *data)
+{
+	switch (val) {
+	case SCHED_IDLE_START:
+		cpufreq_interactive_idle_start();
+		break;
+	case SCHED_IDLE_END:
+		cpufreq_interactive_idle_end();
+		break;
+	}
+
+	return 0;
+}
+
+static struct notifier_block cpufreq_interactive_idle_nb = {
+	.notifier_call = cpufreq_interactive_idle_notifier,
+};
+
+static int __init cpufreq_interactive_init(void)
+{
+	unsigned int i;
+	struct cpufreq_interactive_cpuinfo *pcpu;
+	struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 };
+
+	go_hispeed_load = DEFAULT_GO_HISPEED_LOAD;
+	min_sample_time = DEFAULT_MIN_SAMPLE_TIME;
+	timer_rate = DEFAULT_TIMER_RATE;
+
+	/* Initalize per-cpu timers */
+	for_each_possible_cpu(i) {
+		pcpu = &per_cpu(cpuinfo, i);
+		init_timer(&pcpu->cpu_timer);
+		pcpu->cpu_timer.function = cpufreq_interactive_timer;
+		pcpu->cpu_timer.data = i;
+	}
+
+	up_task = kthread_create(cpufreq_interactive_up_task, NULL,
+				 "kinteractiveup");
+	if (IS_ERR(up_task))
+		return PTR_ERR(up_task);
+
+	sched_setscheduler_nocheck(up_task, SCHED_FIFO, &param);
+	get_task_struct(up_task);
+
+	/* No rescuer thread, bind to CPU queuing the work for possibly
+	   warm cache (probably doesn't matter much). */
+	down_wq = alloc_workqueue("knteractive_down", 0, 1);
+
+	if (!down_wq)
+		goto err_freeuptask;
+
+	INIT_WORK(&freq_scale_down_work,
+		  cpufreq_interactive_freq_down);
+
+	spin_lock_init(&up_cpumask_lock);
+	spin_lock_init(&down_cpumask_lock);
+	mutex_init(&set_speed_lock);
+
+	sched_idle_notifier_register(&cpufreq_interactive_idle_nb);
+
+	return cpufreq_register_governor(&cpufreq_gov_interactive);
+
+err_freeuptask:
+	put_task_struct(up_task);
+	return -ENOMEM;
+}
+
+#ifdef CONFIG_CPU_FREQ_DEFAULT_GOV_INTERACTIVE
+fs_initcall(cpufreq_interactive_init);
+#else
+module_init(cpufreq_interactive_init);
+#endif
+
+static void __exit cpufreq_interactive_exit(void)
+{
+	cpufreq_unregister_governor(&cpufreq_gov_interactive);
+	kthread_stop(up_task);
+	put_task_struct(up_task);
+	destroy_workqueue(down_wq);
+}
+
+module_exit(cpufreq_interactive_exit);
+
+MODULE_AUTHOR("Mike Chan <mike@android.com>");
+MODULE_DESCRIPTION("'cpufreq_interactive' - A cpufreq governor for "
+	"Latency sensitive workloads");
+MODULE_LICENSE("GPL");
diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h
index 6216115..c6126b9 100644
--- a/include/linux/cpufreq.h
+++ b/include/linux/cpufreq.h
@@ -363,6 +363,9 @@ extern struct cpufreq_governor cpufreq_gov_ondemand;
 #elif defined(CONFIG_CPU_FREQ_DEFAULT_GOV_CONSERVATIVE)
 extern struct cpufreq_governor cpufreq_gov_conservative;
 #define CPUFREQ_DEFAULT_GOVERNOR	(&cpufreq_gov_conservative)
+#elif defined(CONFIG_CPU_FREQ_DEFAULT_GOV_INTERACTIVE)
+extern struct cpufreq_governor cpufreq_gov_interactive;
+#define CPUFREQ_DEFAULT_GOVERNOR	(&cpufreq_gov_interactive)
 #endif
 
 
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 4/4] ARM: Move leds idle start/stop calls to sched idle notifiers
  2012-02-08  1:39 [PATCH RFC 0/4] Scheduler idle notifiers and users Anton Vorontsov
                   ` (2 preceding siblings ...)
  2012-02-08  1:44 ` [PATCH 3/4] cpufreq: New 'interactive' governor Anton Vorontsov
@ 2012-02-08  1:44 ` Anton Vorontsov
  2012-02-08  3:05 ` [PATCH RFC 0/4] Scheduler idle notifiers and users Peter Zijlstra
  4 siblings, 0 replies; 34+ messages in thread
From: Anton Vorontsov @ 2012-02-08  1:44 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Dave Jones, Russell King
  Cc: Oleg Nesterov, Benjamin Herrenschmidt, Paul E. McKenney,
	Nicolas Pitre, Mike Chan, Todd Poynor, cpufreq, kernel-team,
	linaro-kernel, linux-arm-kernel, linux-kernel

From: Todd Poynor <toddpoynor@google.com>

Signed-off-by: Todd Poynor <toddpoynor@google.com>
Signed-off-by: Anton Vorontsov <anton.vorontsov@linaro.org>
---
 arch/arm/kernel/leds.c    |   25 ++++++++++++++++++++++++-
 arch/arm/kernel/process.c |    3 ---
 2 files changed, 24 insertions(+), 4 deletions(-)

diff --git a/arch/arm/kernel/leds.c b/arch/arm/kernel/leds.c
index 1911dae..fbc30d4 100644
--- a/arch/arm/kernel/leds.c
+++ b/arch/arm/kernel/leds.c
@@ -9,6 +9,8 @@
  */
 #include <linux/export.h>
 #include <linux/init.h>
+#include <linux/sched.h>
+#include <linux/notifier.h>
 #include <linux/device.h>
 #include <linux/syscore_ops.h>
 #include <linux/string.h>
@@ -103,6 +105,25 @@ static struct syscore_ops leds_syscore_ops = {
 	.resume		= leds_resume,
 };
 
+static int leds_idle_notifier(struct notifier_block *nb, unsigned long val,
+			      void *data)
+{
+	switch (val) {
+	case SCHED_IDLE_START:
+		leds_event(led_idle_start);
+		break;
+	case SCHED_IDLE_END:
+		leds_event(led_idle_end);
+		break;
+	}
+
+	return 0;
+}
+
+static struct notifier_block leds_idle_nb = {
+	.notifier_call = leds_idle_notifier,
+};
+
 static int __init leds_init(void)
 {
 	int ret;
@@ -111,8 +132,10 @@ static int __init leds_init(void)
 		ret = device_register(&leds_device);
 	if (ret == 0)
 		ret = device_create_file(&leds_device, &dev_attr_event);
-	if (ret == 0)
+	if (ret == 0) {
 		register_syscore_ops(&leds_syscore_ops);
+		sched_idle_notifier_register(&leds_idle_nb);
+	}
 	return ret;
 }
 
diff --git a/arch/arm/kernel/process.c b/arch/arm/kernel/process.c
index f2bac2d..f4b53aa 100644
--- a/arch/arm/kernel/process.c
+++ b/arch/arm/kernel/process.c
@@ -33,7 +33,6 @@
 #include <linux/cpuidle.h>
 
 #include <asm/cacheflush.h>
-#include <asm/leds.h>
 #include <asm/processor.h>
 #include <asm/system.h>
 #include <asm/thread_notify.h>
@@ -207,7 +206,6 @@ void cpu_idle(void)
 	/* endless idle loop with no priority at all */
 	while (1) {
 		sched_idle_enter();
-		leds_event(led_idle_start);
 		while (!need_resched()) {
 #ifdef CONFIG_HOTPLUG_CPU
 			if (cpu_is_offline(smp_processor_id()))
@@ -235,7 +233,6 @@ void cpu_idle(void)
 				local_irq_enable();
 			}
 		}
-		leds_event(led_idle_end);
 		sched_idle_exit();
 		preempt_enable_no_resched();
 		schedule();
-- 
1.7.7.6

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC 0/4] Scheduler idle notifiers and users
  2012-02-08  1:39 [PATCH RFC 0/4] Scheduler idle notifiers and users Anton Vorontsov
                   ` (3 preceding siblings ...)
  2012-02-08  1:44 ` [PATCH 4/4] ARM: Move leds idle start/stop calls to sched idle notifiers Anton Vorontsov
@ 2012-02-08  3:05 ` Peter Zijlstra
  2012-02-08 20:23   ` Dave Jones
  4 siblings, 1 reply; 34+ messages in thread
From: Peter Zijlstra @ 2012-02-08  3:05 UTC (permalink / raw)
  To: Anton Vorontsov
  Cc: Ingo Molnar, Dave Jones, Russell King, Oleg Nesterov,
	Benjamin Herrenschmidt, Paul E. McKenney, Nicolas Pitre,
	Mike Chan, Todd Poynor, cpufreq, kernel-team, linaro-kernel,
	linux-arm-kernel, linux-kernel, Arjan Van De Ven

On Wed, 2012-02-08 at 05:39 +0400, Anton Vorontsov wrote:
> Hi all,
> 
> For some drivers we need to know when scheduler is idling. The most
> straightforward way is to gracefully hook into the idle loop.
> 
> On x86 there are "CPU idle" notifiers in the inner idle loop, but
> scheduler idle notifiers are different. These notifiers do not run on
> every invocation/exit from cpuidle, instead they used to notify about
> scheduler state changes, not HW states.
> 
> In other words, CPU idle notifiers work inside while(!need_resched())
> loop (nested into idle loop), while scheduler idle notifier work
> outside of the loop.
> 
> The first two patches consolidate scheduler idle entry/exit
> points, and converts architectures to this new API.
> 
> The third patch is a new cpufreq governor, the commit message
> briefly describes it.

Argh, no.. cpufreq so sucks rocks. Can we please just scrap it and write
an entirely new infrastructure that is much more connected to the
scheduler and do away with this stupid need to set P-states from a
schedulable context.

We can maybe keep cpufreq around for the broken ass hardware that needs
to schedule in order to change its state, but gah.

We're going to do per-task avg-load tracking soon
(https://lkml.org/lkml/2012/2/1/763) if you can use that (if not, tell
why) you can do task based policy and migrate the P-state/freq along
with tasks.

By keeping per-task avg-runtime and accounting on migration we can
compute an avg-runtime per cpu, and select a freq based on that to
either minimize idle time (if that's what your platform wants) or boost
and run to idle right along with scheduling on wakeup and sleep.

Arjan talked about something like that several times.. and I always
forgets what policy is best for what chips etc. All I know is that
cpufreq sucks because its strictly per-cpu and oblivious to task
movement.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC 0/4] Scheduler idle notifiers and users
  2012-02-08  3:05 ` [PATCH RFC 0/4] Scheduler idle notifiers and users Peter Zijlstra
@ 2012-02-08 20:23   ` Dave Jones
  2012-02-08 21:33     ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 34+ messages in thread
From: Dave Jones @ 2012-02-08 20:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Anton Vorontsov, Ingo Molnar, Russell King, Oleg Nesterov,
	Benjamin Herrenschmidt, Paul E. McKenney, Nicolas Pitre,
	Mike Chan, Todd Poynor, cpufreq, kernel-team, linaro-kernel,
	linux-arm-kernel, linux-kernel, Arjan Van De Ven

On Wed, Feb 08, 2012 at 04:05:55AM +0100, Peter Zijlstra wrote:
 
 > Argh, no.. cpufreq so sucks rocks. Can we please just scrap it and write
 > an entirely new infrastructure that is much more connected to the
 > scheduler and do away with this stupid need to set P-states from a
 > schedulable context.

Well there's bits of it that will live on regardless of implementation
(The lower level drivers are pretty much necessary). But all the rest..

If the new scheduler bits grew a per-task proc file for their power saving
policy (powersave/performance/scale on-demand), and a sysfs knob to set
the default policy, then I think a lot of the horrors in ondemand.c etc
could just go away.

Some of what the existing governors do would need reimplementing, but the
scheduler has the smarts to make the right decisions anyway.

The midlayer glue (cpufreq.c) could mostly go away, along with as many
of the user-facing knobs as possible.

I think the biggest mistake we ever made with cpufreq was making it
so configurable. If we redesign it, just say no to plugin governors, and
yes to a lot fewer sysfs knobs.

So, provide mechanism to kill off all the governors, and there's a
migration path from what we have now to something that just works
in a lot more cases, while remaining configurable enough for the corner-cases.

	Dave


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC 0/4] Scheduler idle notifiers and users
  2012-02-08 20:23   ` Dave Jones
@ 2012-02-08 21:33     ` Benjamin Herrenschmidt
  2012-02-09  7:51       ` Ingo Molnar
  0 siblings, 1 reply; 34+ messages in thread
From: Benjamin Herrenschmidt @ 2012-02-08 21:33 UTC (permalink / raw)
  To: Dave Jones
  Cc: Peter Zijlstra, Anton Vorontsov, Ingo Molnar, Russell King,
	Oleg Nesterov, Paul E. McKenney, Nicolas Pitre, Mike Chan,
	Todd Poynor, cpufreq, kernel-team, linaro-kernel,
	linux-arm-kernel, linux-kernel, Arjan Van De Ven

On Wed, 2012-02-08 at 15:23 -0500, Dave Jones wrote:
> I think the biggest mistake we ever made with cpufreq was making it
> so configurable. If we redesign it, just say no to plugin governors,
> and
> yes to a lot fewer sysfs knobs.
> 
> So, provide mechanism to kill off all the governors, and there's a
> migration path from what we have now to something that just works
> in a lot more cases, while remaining configurable enough for the
> corner-cases.

On the other hand, the need for schedulable contxts may not necessarily
go away.

If you look beyond x86, there's several issues that get into the
picture. i2c clock chips & power control chips are slow (the i2c bus
itself is). You don't want to spin for hundreds of microsecs while you
do those transactions.

I have seen many cases where the clock control can be done quite
quickly, but on the other hand, the voltage control takes dozens of ms
to reach the target value & stabilize.

That could be done asynchronously .. as long as the scheduler doesn't
constantly hammer it with change requests.

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/4] cpufreq: New 'interactive' governor
  2012-02-08  1:44 ` [PATCH 3/4] cpufreq: New 'interactive' governor Anton Vorontsov
@ 2012-02-08 23:00   ` Vincent Guittot
  2012-02-09  0:32     ` Anton Vorontsov
  0 siblings, 1 reply; 34+ messages in thread
From: Vincent Guittot @ 2012-02-08 23:00 UTC (permalink / raw)
  To: Anton Vorontsov
  Cc: Ingo Molnar, Peter Zijlstra, Dave Jones, Russell King,
	Oleg Nesterov, Benjamin Herrenschmidt, Paul E. McKenney,
	Nicolas Pitre, Mike Chan, Todd Poynor, cpufreq, kernel-team,
	linaro-kernel, linux-arm-kernel, linux-kernel

Hi Anton,

Have you got some figures which shows the improvement of the
responsivness compared to other governor like the ondemand one ?
That could be interesting to test interactive governor with
cpufreq-bench and compare the results with ondemand ?

Regards,
Vincent

On 7 February 2012 17:44, Anton Vorontsov <anton.vorontsov@linaro.org> wrote:
> From: Mike Chan <mike@android.com>
>
> This governor is designed for latency-sensitive workloads, such as
> interactive user interfaces.  The interactive governor aims to be
> significantly more responsive to ramp CPU quickly up when CPU-intensive
> activity begins.
>
> Existing governors sample CPU load at a particular rate, typically
> every X ms.  This can lead to under-powering UI threads for the period of
> time during which the user begins interacting with a previously-idle system
> until the next sample period happens.
>
> The 'interactive' governor uses a different approach. Instead of sampling
> the CPU at a specified rate, the governor will check whether to scale the
> CPU frequency up soon after coming out of idle.  When the CPU comes out of
> idle, a timer is configured to fire within 1-2 ticks.  If the CPU is very
> busy from exiting idle to when the timer fires then we assume the CPU is
> underpowered and ramp to MAX speed.
>
> If the CPU was not sufficiently busy to immediately ramp to MAX speed, then
> the governor evaluates the CPU load since the last speed adjustment,
> choosing the highest value between that longer-term load or the short-term
> load since idle exit to determine the CPU speed to ramp to.
>
> A realtime thread is used for scaling up, giving the remaining tasks the
> CPU performance benefit, unlike existing governors which are more likely to
> schedule rampup work to occur after your performance starved tasks have
> completed.
>
> The tuneables for this governor are:
> /sys/devices/system/cpu/cpufreq/interactive/min_sample_time:
>        The minimum amount of time to spend at the current frequency before
>        ramping down. This is to ensure that the governor has seen enough
>        historic CPU load data to determine the appropriate workload.
>        Default is 20000 uS.
> /sys/devices/system/cpu/cpufreq/interactive/go_hispeed_load
>        The CPU load at which to ramp to max speed.  Default is 95.
>
> Signed-off-by: Mike Chan <mike@android.com>
> Signed-off-by: Todd Poynor <toddpoynor@google.com>
> Signed-off-by: Allen Martin <amartin@nvidia.com> (submitted improvements)
> Signed-off-by: Axel Haslam <axelhaslam@ti.com> (submitted improvements)
> Signed-off-by: Anton Vorontsov <anton.vorontsov@linaro.org>
> ---
>  drivers/cpufreq/Kconfig               |   26 ++
>  drivers/cpufreq/Makefile              |    1 +
>  drivers/cpufreq/cpufreq_interactive.c |  700 +++++++++++++++++++++++++++++++++
>  include/linux/cpufreq.h               |    3 +
>  4 files changed, 730 insertions(+), 0 deletions(-)
>  create mode 100644 drivers/cpufreq/cpufreq_interactive.c
>
> diff --git a/drivers/cpufreq/Kconfig b/drivers/cpufreq/Kconfig
> index e24a2a1..c47cc46 100644
> --- a/drivers/cpufreq/Kconfig
> +++ b/drivers/cpufreq/Kconfig
> @@ -99,6 +99,15 @@ config CPU_FREQ_DEFAULT_GOV_CONSERVATIVE
>          Be aware that not all cpufreq drivers support the conservative
>          governor. If unsure have a look at the help section of the
>          driver. Fallback governor will be the performance governor.
> +
> +config CPU_FREQ_DEFAULT_GOV_INTERACTIVE
> +       bool "interactive"
> +       select CPU_FREQ_GOV_INTERACTIVE
> +       help
> +         Use the CPUFreq governor 'interactive' as default. This allows
> +         you to get a full dynamic cpu frequency capable system by simply
> +         loading your cpufreq low-level hardware driver, using the
> +         'interactive' governor for latency-sensitive workloads.
>  endchoice
>
>  config CPU_FREQ_GOV_PERFORMANCE
> @@ -179,6 +188,23 @@ config CPU_FREQ_GOV_CONSERVATIVE
>
>          If in doubt, say N.
>
> +config CPU_FREQ_GOV_INTERACTIVE
> +       tristate "'interactive' cpufreq policy governor"
> +       help
> +         'interactive' - This driver adds a dynamic cpufreq policy governor
> +         designed for latency-sensitive workloads.
> +
> +         This governor attempts to reduce the latency of clock
> +         increases so that the system is more responsive to
> +         interactive workloads.
> +
> +         To compile this driver as a module, choose M here: the
> +         module will be called cpufreq_interactive.
> +
> +         For details, take a look at linux/Documentation/cpu-freq.
> +
> +         If in doubt, say N.
> +
>  menu "x86 CPU frequency scaling drivers"
>  depends on X86
>  source "drivers/cpufreq/Kconfig.x86"
> diff --git a/drivers/cpufreq/Makefile b/drivers/cpufreq/Makefile
> index ac000fa..f84c99b 100644
> --- a/drivers/cpufreq/Makefile
> +++ b/drivers/cpufreq/Makefile
> @@ -9,6 +9,7 @@ obj-$(CONFIG_CPU_FREQ_GOV_POWERSAVE)    += cpufreq_powersave.o
>  obj-$(CONFIG_CPU_FREQ_GOV_USERSPACE)   += cpufreq_userspace.o
>  obj-$(CONFIG_CPU_FREQ_GOV_ONDEMAND)    += cpufreq_ondemand.o
>  obj-$(CONFIG_CPU_FREQ_GOV_CONSERVATIVE)        += cpufreq_conservative.o
> +obj-$(CONFIG_CPU_FREQ_GOV_INTERACTIVE) += cpufreq_interactive.o
>
>  # CPUfreq cross-arch helpers
>  obj-$(CONFIG_CPU_FREQ_TABLE)           += freq_table.o
> diff --git a/drivers/cpufreq/cpufreq_interactive.c b/drivers/cpufreq/cpufreq_interactive.c
> new file mode 100644
> index 0000000..188096a
> --- /dev/null
> +++ b/drivers/cpufreq/cpufreq_interactive.c
> @@ -0,0 +1,700 @@
> +/*
> + * Copyright (C) 2010 Google, Inc.
> + *
> + * This software is licensed under the terms of the GNU General Public
> + * License version 2, as published by the Free Software Foundation, and
> + * may be copied, distributed, and modified under those terms.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * Author: Mike Chan (mike@android.com)
> + */
> +
> +#include <linux/module.h>
> +#include <linux/cpu.h>
> +#include <linux/cpumask.h>
> +#include <linux/cpufreq.h>
> +#include <linux/mutex.h>
> +#include <linux/sched.h>
> +#include <linux/tick.h>
> +#include <linux/time.h>
> +#include <linux/timer.h>
> +#include <linux/workqueue.h>
> +#include <linux/kthread.h>
> +#include <linux/mutex.h>
> +
> +static atomic_t active_count = ATOMIC_INIT(0);
> +
> +struct cpufreq_interactive_cpuinfo {
> +       struct timer_list cpu_timer;
> +       int timer_idlecancel;
> +       u64 time_in_idle;
> +       u64 idle_exit_time;
> +       u64 timer_run_time;
> +       int idling;
> +       u64 freq_change_time;
> +       u64 freq_change_time_in_idle;
> +       struct cpufreq_policy *policy;
> +       struct cpufreq_frequency_table *freq_table;
> +       unsigned int target_freq;
> +       int governor_enabled;
> +};
> +
> +static DEFINE_PER_CPU(struct cpufreq_interactive_cpuinfo, cpuinfo);
> +
> +/* Workqueues handle frequency scaling */
> +static struct task_struct *up_task;
> +static struct workqueue_struct *down_wq;
> +static struct work_struct freq_scale_down_work;
> +static cpumask_t up_cpumask;
> +static spinlock_t up_cpumask_lock;
> +static cpumask_t down_cpumask;
> +static spinlock_t down_cpumask_lock;
> +static struct mutex set_speed_lock;
> +
> +/* Hi speed to bump to from lo speed when load burst (default max) */
> +static u64 hispeed_freq;
> +
> +/* Go to hi speed when CPU load at or above this value. */
> +#define DEFAULT_GO_HISPEED_LOAD 95
> +static unsigned long go_hispeed_load;
> +
> +/*
> + * The minimum amount of time to spend at a frequency before we can ramp down.
> + */
> +#define DEFAULT_MIN_SAMPLE_TIME (20 * USEC_PER_MSEC)
> +static unsigned long min_sample_time;
> +
> +/*
> + * The sample rate of the timer used to increase frequency
> + */
> +#define DEFAULT_TIMER_RATE (20 * USEC_PER_MSEC)
> +static unsigned long timer_rate;
> +
> +static int cpufreq_governor_interactive(struct cpufreq_policy *policy,
> +               unsigned int event);
> +
> +#ifndef CONFIG_CPU_FREQ_DEFAULT_GOV_INTERACTIVE
> +static
> +#endif
> +struct cpufreq_governor cpufreq_gov_interactive = {
> +       .name = "interactive",
> +       .governor = cpufreq_governor_interactive,
> +       .max_transition_latency = 10000000,
> +       .owner = THIS_MODULE,
> +};
> +
> +static void cpufreq_interactive_timer(unsigned long data)
> +{
> +       unsigned int delta_idle;
> +       unsigned int delta_time;
> +       int cpu_load;
> +       int load_since_change;
> +       u64 time_in_idle;
> +       u64 idle_exit_time;
> +       struct cpufreq_interactive_cpuinfo *pcpu =
> +               &per_cpu(cpuinfo, data);
> +       u64 now_idle;
> +       unsigned int new_freq;
> +       unsigned int index;
> +       unsigned long flags;
> +
> +       smp_rmb();
> +
> +       if (!pcpu->governor_enabled)
> +               goto exit;
> +
> +       /*
> +        * Once pcpu->timer_run_time is updated to >= pcpu->idle_exit_time,
> +        * this lets idle exit know the current idle time sample has
> +        * been processed, and idle exit can generate a new sample and
> +        * re-arm the timer.  This prevents a concurrent idle
> +        * exit on that CPU from writing a new set of info at the same time
> +        * the timer function runs (the timer function can't use that info
> +        * until more time passes).
> +        */
> +       time_in_idle = pcpu->time_in_idle;
> +       idle_exit_time = pcpu->idle_exit_time;
> +       now_idle = get_cpu_idle_time_us(data, &pcpu->timer_run_time);
> +       smp_wmb();
> +
> +       /* If we raced with cancelling a timer, skip. */
> +       if (!idle_exit_time)
> +               goto exit;
> +
> +       delta_idle = (unsigned int) (now_idle - time_in_idle);
> +       delta_time = (unsigned int) (pcpu->timer_run_time - idle_exit_time);
> +
> +       /*
> +        * If timer ran less than 1ms after short-term sample started, retry.
> +        */
> +       if (delta_time < 1000)
> +               goto rearm;
> +
> +       if (delta_idle > delta_time)
> +               cpu_load = 0;
> +       else
> +               cpu_load = 100 * (delta_time - delta_idle) / delta_time;
> +
> +       delta_idle = (unsigned int) (now_idle - pcpu->freq_change_time_in_idle);
> +       delta_time = (unsigned int) (pcpu->timer_run_time - pcpu->freq_change_time);
> +
> +       if ((delta_time == 0) || (delta_idle > delta_time))
> +               load_since_change = 0;
> +       else
> +               load_since_change =
> +                       100 * (delta_time - delta_idle) / delta_time;
> +
> +       /*
> +        * Choose greater of short-term load (since last idle timer
> +        * started or timer function re-armed itself) or long-term load
> +        * (since last frequency change).
> +        */
> +       if (load_since_change > cpu_load)
> +               cpu_load = load_since_change;
> +
> +       if (cpu_load >= go_hispeed_load) {
> +               if (pcpu->policy->cur == pcpu->policy->min)
> +                       new_freq = hispeed_freq;
> +               else
> +                       new_freq = pcpu->policy->max * cpu_load / 100;
> +       } else {
> +               new_freq = pcpu->policy->cur * cpu_load / 100;
> +       }
> +
> +       if (cpufreq_frequency_table_target(pcpu->policy, pcpu->freq_table,
> +                                          new_freq, CPUFREQ_RELATION_H,
> +                                          &index)) {
> +               pr_warn_once("timer %d: cpufreq_frequency_table_target error\n",
> +                            (int) data);
> +               goto rearm;
> +       }
> +
> +       new_freq = pcpu->freq_table[index].frequency;
> +
> +       if (pcpu->target_freq == new_freq)
> +               goto rearm_if_notmax;
> +
> +       /*
> +        * Do not scale down unless we have been at this frequency for the
> +        * minimum sample time.
> +        */
> +       if (new_freq < pcpu->target_freq) {
> +               if (pcpu->timer_run_time - pcpu->freq_change_time
> +                               < min_sample_time)
> +                       goto rearm;
> +       }
> +
> +       if (new_freq < pcpu->target_freq) {
> +               pcpu->target_freq = new_freq;
> +               spin_lock_irqsave(&down_cpumask_lock, flags);
> +               cpumask_set_cpu(data, &down_cpumask);
> +               spin_unlock_irqrestore(&down_cpumask_lock, flags);
> +               queue_work(down_wq, &freq_scale_down_work);
> +       } else {
> +               pcpu->target_freq = new_freq;
> +               spin_lock_irqsave(&up_cpumask_lock, flags);
> +               cpumask_set_cpu(data, &up_cpumask);
> +               spin_unlock_irqrestore(&up_cpumask_lock, flags);
> +               wake_up_process(up_task);
> +       }
> +
> +rearm_if_notmax:
> +       /*
> +        * Already set max speed and don't see a need to change that,
> +        * wait until next idle to re-evaluate, don't need timer.
> +        */
> +       if (pcpu->target_freq == pcpu->policy->max)
> +               goto exit;
> +
> +rearm:
> +       if (!timer_pending(&pcpu->cpu_timer)) {
> +               /*
> +                * If already at min: if that CPU is idle, don't set timer.
> +                * Else cancel the timer if that CPU goes idle.  We don't
> +                * need to re-evaluate speed until the next idle exit.
> +                */
> +               if (pcpu->target_freq == pcpu->policy->min) {
> +                       smp_rmb();
> +
> +                       if (pcpu->idling)
> +                               goto exit;
> +
> +                       pcpu->timer_idlecancel = 1;
> +               }
> +
> +               pcpu->time_in_idle = get_cpu_idle_time_us(
> +                       data, &pcpu->idle_exit_time);
> +               mod_timer(&pcpu->cpu_timer,
> +                         jiffies + usecs_to_jiffies(timer_rate));
> +       }
> +
> +exit:
> +       return;
> +}
> +
> +static void cpufreq_interactive_idle_start(void)
> +{
> +       struct cpufreq_interactive_cpuinfo *pcpu =
> +               &per_cpu(cpuinfo, smp_processor_id());
> +       int pending;
> +
> +       if (!pcpu->governor_enabled)
> +               return;
> +
> +       pcpu->idling = 1;
> +       smp_wmb();
> +       pending = timer_pending(&pcpu->cpu_timer);
> +
> +       if (pcpu->target_freq != pcpu->policy->min) {
> +#ifdef CONFIG_SMP
> +               /*
> +                * Entering idle while not at lowest speed.  On some
> +                * platforms this can hold the other CPU(s) at that speed
> +                * even though the CPU is idle. Set a timer to re-evaluate
> +                * speed so this idle CPU doesn't hold the other CPUs above
> +                * min indefinitely.  This should probably be a quirk of
> +                * the CPUFreq driver.
> +                */
> +               if (!pending) {
> +                       pcpu->time_in_idle = get_cpu_idle_time_us(
> +                               smp_processor_id(), &pcpu->idle_exit_time);
> +                       pcpu->timer_idlecancel = 0;
> +                       mod_timer(&pcpu->cpu_timer,
> +                                 jiffies + usecs_to_jiffies(timer_rate));
> +               }
> +#endif
> +       } else {
> +               /*
> +                * If at min speed and entering idle after load has
> +                * already been evaluated, and a timer has been set just in
> +                * case the CPU suddenly goes busy, cancel that timer.  The
> +                * CPU didn't go busy; we'll recheck things upon idle exit.
> +                */
> +               if (pending && pcpu->timer_idlecancel) {
> +                       del_timer(&pcpu->cpu_timer);
> +                       /*
> +                        * Ensure last timer run time is after current idle
> +                        * sample start time, so next idle exit will always
> +                        * start a new idle sampling period.
> +                        */
> +                       pcpu->idle_exit_time = 0;
> +                       pcpu->timer_idlecancel = 0;
> +               }
> +       }
> +
> +}
> +
> +static void cpufreq_interactive_idle_end(void)
> +{
> +       struct cpufreq_interactive_cpuinfo *pcpu =
> +               &per_cpu(cpuinfo, smp_processor_id());
> +
> +       pcpu->idling = 0;
> +       smp_wmb();
> +
> +       /*
> +        * Arm the timer for 1-2 ticks later if not already, and if the timer
> +        * function has already processed the previous load sampling
> +        * interval.  (If the timer is not pending but has not processed
> +        * the previous interval, it is probably racing with us on another
> +        * CPU.  Let it compute load based on the previous sample and then
> +        * re-arm the timer for another interval when it's done, rather
> +        * than updating the interval start time to be "now", which doesn't
> +        * give the timer function enough time to make a decision on this
> +        * run.)
> +        */
> +       if (timer_pending(&pcpu->cpu_timer) == 0 &&
> +           pcpu->timer_run_time >= pcpu->idle_exit_time &&
> +           pcpu->governor_enabled) {
> +               pcpu->time_in_idle =
> +                       get_cpu_idle_time_us(smp_processor_id(),
> +                                            &pcpu->idle_exit_time);
> +               pcpu->timer_idlecancel = 0;
> +               mod_timer(&pcpu->cpu_timer,
> +                         jiffies + usecs_to_jiffies(timer_rate));
> +       }
> +
> +}
> +
> +static int cpufreq_interactive_up_task(void *data)
> +{
> +       unsigned int cpu;
> +       cpumask_t tmp_mask;
> +       unsigned long flags;
> +       struct cpufreq_interactive_cpuinfo *pcpu;
> +
> +       while (1) {
> +               set_current_state(TASK_INTERRUPTIBLE);
> +               spin_lock_irqsave(&up_cpumask_lock, flags);
> +
> +               if (cpumask_empty(&up_cpumask)) {
> +                       spin_unlock_irqrestore(&up_cpumask_lock, flags);
> +                       schedule();
> +
> +                       if (kthread_should_stop())
> +                               break;
> +
> +                       spin_lock_irqsave(&up_cpumask_lock, flags);
> +               }
> +
> +               set_current_state(TASK_RUNNING);
> +               tmp_mask = up_cpumask;
> +               cpumask_clear(&up_cpumask);
> +               spin_unlock_irqrestore(&up_cpumask_lock, flags);
> +
> +               for_each_cpu(cpu, &tmp_mask) {
> +                       unsigned int j;
> +                       unsigned int max_freq = 0;
> +
> +                       pcpu = &per_cpu(cpuinfo, cpu);
> +                       smp_rmb();
> +
> +                       if (!pcpu->governor_enabled)
> +                               continue;
> +
> +                       mutex_lock(&set_speed_lock);
> +
> +                       for_each_cpu(j, pcpu->policy->cpus) {
> +                               struct cpufreq_interactive_cpuinfo *pjcpu =
> +                                       &per_cpu(cpuinfo, j);
> +
> +                               if (pjcpu->target_freq > max_freq)
> +                                       max_freq = pjcpu->target_freq;
> +                       }
> +
> +                       if (max_freq != pcpu->policy->cur)
> +                               __cpufreq_driver_target(pcpu->policy,
> +                                                       max_freq,
> +                                                       CPUFREQ_RELATION_H);
> +                       mutex_unlock(&set_speed_lock);
> +
> +                       pcpu->freq_change_time_in_idle =
> +                               get_cpu_idle_time_us(cpu,
> +                                                    &pcpu->freq_change_time);
> +               }
> +       }
> +
> +       return 0;
> +}
> +
> +static void cpufreq_interactive_freq_down(struct work_struct *work)
> +{
> +       unsigned int cpu;
> +       cpumask_t tmp_mask;
> +       unsigned long flags;
> +       struct cpufreq_interactive_cpuinfo *pcpu;
> +
> +       spin_lock_irqsave(&down_cpumask_lock, flags);
> +       tmp_mask = down_cpumask;
> +       cpumask_clear(&down_cpumask);
> +       spin_unlock_irqrestore(&down_cpumask_lock, flags);
> +
> +       for_each_cpu(cpu, &tmp_mask) {
> +               unsigned int j;
> +               unsigned int max_freq = 0;
> +
> +               pcpu = &per_cpu(cpuinfo, cpu);
> +               smp_rmb();
> +
> +               if (!pcpu->governor_enabled)
> +                       continue;
> +
> +               mutex_lock(&set_speed_lock);
> +
> +               for_each_cpu(j, pcpu->policy->cpus) {
> +                       struct cpufreq_interactive_cpuinfo *pjcpu =
> +                               &per_cpu(cpuinfo, j);
> +
> +                       if (pjcpu->target_freq > max_freq)
> +                               max_freq = pjcpu->target_freq;
> +               }
> +
> +               if (max_freq != pcpu->policy->cur)
> +                       __cpufreq_driver_target(pcpu->policy, max_freq,
> +                                               CPUFREQ_RELATION_H);
> +
> +               mutex_unlock(&set_speed_lock);
> +               pcpu->freq_change_time_in_idle =
> +                       get_cpu_idle_time_us(cpu,
> +                                            &pcpu->freq_change_time);
> +       }
> +}
> +
> +static ssize_t show_hispeed_freq(struct kobject *kobj,
> +                                struct attribute *attr, char *buf)
> +{
> +       return sprintf(buf, "%llu\n", hispeed_freq);
> +}
> +
> +static ssize_t store_hispeed_freq(struct kobject *kobj,
> +                                 struct attribute *attr, const char *buf,
> +                                 size_t count)
> +{
> +       int ret;
> +       u64 val;
> +
> +       ret = strict_strtoull(buf, 0, &val);
> +       if (ret < 0)
> +               return ret;
> +       hispeed_freq = val;
> +       return count;
> +}
> +
> +static struct global_attr hispeed_freq_attr = __ATTR(hispeed_freq, 0644,
> +               show_hispeed_freq, store_hispeed_freq);
> +
> +
> +static ssize_t show_go_hispeed_load(struct kobject *kobj,
> +                                    struct attribute *attr, char *buf)
> +{
> +       return sprintf(buf, "%lu\n", go_hispeed_load);
> +}
> +
> +static ssize_t store_go_hispeed_load(struct kobject *kobj,
> +                       struct attribute *attr, const char *buf, size_t count)
> +{
> +       int ret;
> +       unsigned long val;
> +
> +       ret = strict_strtoul(buf, 0, &val);
> +       if (ret < 0)
> +               return ret;
> +       go_hispeed_load = val;
> +       return count;
> +}
> +
> +static struct global_attr go_hispeed_load_attr = __ATTR(go_hispeed_load, 0644,
> +               show_go_hispeed_load, store_go_hispeed_load);
> +
> +static ssize_t show_min_sample_time(struct kobject *kobj,
> +                               struct attribute *attr, char *buf)
> +{
> +       return sprintf(buf, "%lu\n", min_sample_time);
> +}
> +
> +static ssize_t store_min_sample_time(struct kobject *kobj,
> +                       struct attribute *attr, const char *buf, size_t count)
> +{
> +       int ret;
> +       unsigned long val;
> +
> +       ret = strict_strtoul(buf, 0, &val);
> +       if (ret < 0)
> +               return ret;
> +       min_sample_time = val;
> +       return count;
> +}
> +
> +static struct global_attr min_sample_time_attr = __ATTR(min_sample_time, 0644,
> +               show_min_sample_time, store_min_sample_time);
> +
> +static ssize_t show_timer_rate(struct kobject *kobj,
> +                       struct attribute *attr, char *buf)
> +{
> +       return sprintf(buf, "%lu\n", timer_rate);
> +}
> +
> +static ssize_t store_timer_rate(struct kobject *kobj,
> +                       struct attribute *attr, const char *buf, size_t count)
> +{
> +       int ret;
> +       unsigned long val;
> +
> +       ret = strict_strtoul(buf, 0, &val);
> +       if (ret < 0)
> +               return ret;
> +       timer_rate = val;
> +       return count;
> +}
> +
> +static struct global_attr timer_rate_attr = __ATTR(timer_rate, 0644,
> +               show_timer_rate, store_timer_rate);
> +
> +static struct attribute *interactive_attributes[] = {
> +       &hispeed_freq_attr.attr,
> +       &go_hispeed_load_attr.attr,
> +       &min_sample_time_attr.attr,
> +       &timer_rate_attr.attr,
> +       NULL,
> +};
> +
> +static struct attribute_group interactive_attr_group = {
> +       .attrs = interactive_attributes,
> +       .name = "interactive",
> +};
> +
> +static int cpufreq_governor_interactive(struct cpufreq_policy *policy,
> +               unsigned int event)
> +{
> +       int rc;
> +       unsigned int j;
> +       struct cpufreq_interactive_cpuinfo *pcpu;
> +       struct cpufreq_frequency_table *freq_table;
> +
> +       switch (event) {
> +       case CPUFREQ_GOV_START:
> +               if (!cpu_online(policy->cpu))
> +                       return -EINVAL;
> +
> +               freq_table =
> +                       cpufreq_frequency_get_table(policy->cpu);
> +
> +               for_each_cpu(j, policy->cpus) {
> +                       pcpu = &per_cpu(cpuinfo, j);
> +                       pcpu->policy = policy;
> +                       pcpu->target_freq = policy->cur;
> +                       pcpu->freq_table = freq_table;
> +                       pcpu->freq_change_time_in_idle =
> +                               get_cpu_idle_time_us(j,
> +                                            &pcpu->freq_change_time);
> +                       pcpu->governor_enabled = 1;
> +                       smp_wmb();
> +               }
> +
> +               if (!hispeed_freq)
> +                       hispeed_freq = policy->max;
> +
> +               /*
> +                * Do not register the idle hook and create sysfs
> +                * entries if we have already done so.
> +                */
> +               if (atomic_inc_return(&active_count) > 1)
> +                       return 0;
> +
> +               rc = sysfs_create_group(cpufreq_global_kobject,
> +                               &interactive_attr_group);
> +               if (rc)
> +                       return rc;
> +
> +               break;
> +
> +       case CPUFREQ_GOV_STOP:
> +               for_each_cpu(j, policy->cpus) {
> +                       pcpu = &per_cpu(cpuinfo, j);
> +                       pcpu->governor_enabled = 0;
> +                       smp_wmb();
> +                       del_timer_sync(&pcpu->cpu_timer);
> +
> +                       /*
> +                        * Reset idle exit time since we may cancel the timer
> +                        * before it can run after the last idle exit time,
> +                        * to avoid tripping the check in idle exit for a timer
> +                        * that is trying to run.
> +                        */
> +                       pcpu->idle_exit_time = 0;
> +               }
> +
> +               flush_work(&freq_scale_down_work);
> +               if (atomic_dec_return(&active_count) > 0)
> +                       return 0;
> +
> +               sysfs_remove_group(cpufreq_global_kobject,
> +                               &interactive_attr_group);
> +
> +               break;
> +
> +       case CPUFREQ_GOV_LIMITS:
> +               if (policy->max < policy->cur)
> +                       __cpufreq_driver_target(policy,
> +                                       policy->max, CPUFREQ_RELATION_H);
> +               else if (policy->min > policy->cur)
> +                       __cpufreq_driver_target(policy,
> +                                       policy->min, CPUFREQ_RELATION_L);
> +               break;
> +       }
> +       return 0;
> +}
> +
> +static int cpufreq_interactive_idle_notifier(struct notifier_block *nb,
> +                                            unsigned long val,
> +                                            void *data)
> +{
> +       switch (val) {
> +       case SCHED_IDLE_START:
> +               cpufreq_interactive_idle_start();
> +               break;
> +       case SCHED_IDLE_END:
> +               cpufreq_interactive_idle_end();
> +               break;
> +       }
> +
> +       return 0;
> +}
> +
> +static struct notifier_block cpufreq_interactive_idle_nb = {
> +       .notifier_call = cpufreq_interactive_idle_notifier,
> +};
> +
> +static int __init cpufreq_interactive_init(void)
> +{
> +       unsigned int i;
> +       struct cpufreq_interactive_cpuinfo *pcpu;
> +       struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 };
> +
> +       go_hispeed_load = DEFAULT_GO_HISPEED_LOAD;
> +       min_sample_time = DEFAULT_MIN_SAMPLE_TIME;
> +       timer_rate = DEFAULT_TIMER_RATE;
> +
> +       /* Initalize per-cpu timers */
> +       for_each_possible_cpu(i) {
> +               pcpu = &per_cpu(cpuinfo, i);
> +               init_timer(&pcpu->cpu_timer);
> +               pcpu->cpu_timer.function = cpufreq_interactive_timer;
> +               pcpu->cpu_timer.data = i;
> +       }
> +
> +       up_task = kthread_create(cpufreq_interactive_up_task, NULL,
> +                                "kinteractiveup");
> +       if (IS_ERR(up_task))
> +               return PTR_ERR(up_task);
> +
> +       sched_setscheduler_nocheck(up_task, SCHED_FIFO, &param);
> +       get_task_struct(up_task);
> +
> +       /* No rescuer thread, bind to CPU queuing the work for possibly
> +          warm cache (probably doesn't matter much). */
> +       down_wq = alloc_workqueue("knteractive_down", 0, 1);
> +
> +       if (!down_wq)
> +               goto err_freeuptask;
> +
> +       INIT_WORK(&freq_scale_down_work,
> +                 cpufreq_interactive_freq_down);
> +
> +       spin_lock_init(&up_cpumask_lock);
> +       spin_lock_init(&down_cpumask_lock);
> +       mutex_init(&set_speed_lock);
> +
> +       sched_idle_notifier_register(&cpufreq_interactive_idle_nb);
> +
> +       return cpufreq_register_governor(&cpufreq_gov_interactive);
> +
> +err_freeuptask:
> +       put_task_struct(up_task);
> +       return -ENOMEM;
> +}
> +
> +#ifdef CONFIG_CPU_FREQ_DEFAULT_GOV_INTERACTIVE
> +fs_initcall(cpufreq_interactive_init);
> +#else
> +module_init(cpufreq_interactive_init);
> +#endif
> +
> +static void __exit cpufreq_interactive_exit(void)
> +{
> +       cpufreq_unregister_governor(&cpufreq_gov_interactive);
> +       kthread_stop(up_task);
> +       put_task_struct(up_task);
> +       destroy_workqueue(down_wq);
> +}
> +
> +module_exit(cpufreq_interactive_exit);
> +
> +MODULE_AUTHOR("Mike Chan <mike@android.com>");
> +MODULE_DESCRIPTION("'cpufreq_interactive' - A cpufreq governor for "
> +       "Latency sensitive workloads");
> +MODULE_LICENSE("GPL");
> diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h
> index 6216115..c6126b9 100644
> --- a/include/linux/cpufreq.h
> +++ b/include/linux/cpufreq.h
> @@ -363,6 +363,9 @@ extern struct cpufreq_governor cpufreq_gov_ondemand;
>  #elif defined(CONFIG_CPU_FREQ_DEFAULT_GOV_CONSERVATIVE)
>  extern struct cpufreq_governor cpufreq_gov_conservative;
>  #define CPUFREQ_DEFAULT_GOVERNOR       (&cpufreq_gov_conservative)
> +#elif defined(CONFIG_CPU_FREQ_DEFAULT_GOV_INTERACTIVE)
> +extern struct cpufreq_governor cpufreq_gov_interactive;
> +#define CPUFREQ_DEFAULT_GOVERNOR       (&cpufreq_gov_interactive)
>  #endif
>
>
> --
> 1.7.7.6
>
> --
> To unsubscribe from this list: send the line "unsubscribe cpufreq" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/4] cpufreq: New 'interactive' governor
  2012-02-08 23:00   ` Vincent Guittot
@ 2012-02-09  0:32     ` Anton Vorontsov
  0 siblings, 0 replies; 34+ messages in thread
From: Anton Vorontsov @ 2012-02-09  0:32 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, Peter Zijlstra, Dave Jones, Russell King,
	Oleg Nesterov, Benjamin Herrenschmidt, Paul E. McKenney,
	Nicolas Pitre, Mike Chan, Todd Poynor, cpufreq, kernel-team,
	linaro-kernel, linux-arm-kernel, linux-kernel

Hello Vincent,

On Wed, Feb 08, 2012 at 03:00:59PM -0800, Vincent Guittot wrote:
> Have you got some figures which shows the improvement of the
> responsivness compared to other governor like the ondemand one ?
> That could be interesting to test interactive governor with
> cpufreq-bench and compare the results with ondemand ?

I don't have any numbers handy, but no doubt the governor brings
some improvements which you can see on a real device.

Anyway, the point of sending out these RFC patches was to get a
technical review of the approach, because there's no much point
in pushing the code that isn't acceptable on technical merits,
no matter how better numbers it might give.

And scheduler folks aren't happy on the whole approach, so I
guess we should go back to the drawing board. :-)

Thanks,

p.s. Sure, in the end we'll have to measure 'interactive' vs.
'ondemand' vs. 'newapproach'. And maybe now it's time to actually
measure 'interactive' governor in numbers... I'll get back to
this thread when I get the numbers.

-- 
Anton Vorontsov
Email: cbouatmailru@gmail.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC 0/4] Scheduler idle notifiers and users
  2012-02-08 21:33     ` Benjamin Herrenschmidt
@ 2012-02-09  7:51       ` Ingo Molnar
  2012-02-11  3:15         ` Saravana Kannan
  0 siblings, 1 reply; 34+ messages in thread
From: Ingo Molnar @ 2012-02-09  7:51 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Dave Jones, Peter Zijlstra, Anton Vorontsov, Russell King,
	Oleg Nesterov, Paul E. McKenney, Nicolas Pitre, Mike Chan,
	Todd Poynor, cpufreq, kernel-team, linaro-kernel,
	linux-arm-kernel, linux-kernel, Arjan Van De Ven


* Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:

> On Wed, 2012-02-08 at 15:23 -0500, Dave Jones wrote:
> > I think the biggest mistake we ever made with cpufreq was making it
> > so configurable. If we redesign it, just say no to plugin governors,
> > and
> > yes to a lot fewer sysfs knobs.
> > 
> > So, provide mechanism to kill off all the governors, and there's a
> > migration path from what we have now to something that just works
> > in a lot more cases, while remaining configurable enough for the
> > corner-cases.
> 
> On the other hand, the need for schedulable contxts may not 
> necessarily go away.

We will support it, but the *sane* hw solution is where 
frequency transitions can be done atomically. Most workloads 
change their characteristics very quickly, and so does their 
power management profile change.

The user-space driven policy model failed for that reason: it 
was *way* too slow in reacting) - and slow hardware transitions 
suck for a similar reason as well.

We accomodate all hardware as well as we can, but we *design* 
for proper hardware. So Peter is right, this should be done 
properly.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC 0/4] Scheduler idle notifiers and users
  2012-02-09  7:51       ` Ingo Molnar
@ 2012-02-11  3:15         ` Saravana Kannan
  2012-02-11 14:39           ` Mark Brown
  2012-02-11 14:45           ` Ingo Molnar
  0 siblings, 2 replies; 34+ messages in thread
From: Saravana Kannan @ 2012-02-11  3:15 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Benjamin Herrenschmidt, Todd Poynor, Russell King,
	Peter Zijlstra, Nicolas Pitre, Oleg Nesterov, cpufreq,
	linux-kernel, Anton Vorontsov, linaro-kernel, Mike Chan,
	Dave Jones, Paul E. McKenney, kernel-team, linux-arm-kernel,
	Arjan Van De Ven

On 02/08/2012 11:51 PM, Ingo Molnar wrote:
>
> * Benjamin Herrenschmidt<benh@kernel.crashing.org>  wrote:
>
>> On Wed, 2012-02-08 at 15:23 -0500, Dave Jones wrote:
>>> I think the biggest mistake we ever made with cpufreq was making it
>>> so configurable. If we redesign it, just say no to plugin governors,
>>> and
>>> yes to a lot fewer sysfs knobs.
>>>
>>> So, provide mechanism to kill off all the governors, and there's a
>>> migration path from what we have now to something that just works
>>> in a lot more cases, while remaining configurable enough for the
>>> corner-cases.
>>
>> On the other hand, the need for schedulable contxts may not
>> necessarily go away.
>
> We will support it, but the *sane* hw solution is where
> frequency transitions can be done atomically.

I'm not sure atomicity has much to do with this. From what I can tell, 
it's about the physical characteristics of the voltage source and the 
load on said source.

After a quick digging around for some info for one of our platforms 
(ARM/MSM), it looks like it will take 200us to ramp up the power rail 
from the voltage for the lowest CPU freq to voltage for the highest CPU 
freq. And that's ignoring any communication delay. The 200us is purely 
how long it takes for the PMIC output to settle given the power load 
from the CPU. I would think other PMICs from different manufacturers 
would be in the same ballpark.

200us is a lot of time to add to a context switch or to busy wait on 
when the processors today can run at GHz speeds.

So, with what I know this doesn't look like a matter of broken HW unless 
the PMIC I'm looking up data for is a really crappy one. I'm sure other 
in the community know more about PMICs than I do and they can correct me 
if the general PMIC voltage settling characteristic is much better than 
the one I'm look at.

> Most workloads
> change their characteristics very quickly, and so does their
> power management profile change.
>
> The user-space driven policy model failed for that reason: it
> was *way* too slow in reacting) - and slow hardware transitions
> suck for a similar reason as well.

I think we all agree on this.

> We accomodate all hardware as well as we can, but we *design*
> for proper hardware. So Peter is right, this should be done
> properly.

When you say accommodate all hardware, does it mean we will keep around 
CPUfreq and allow attempts at improving it? Or we will completely move 
to scheduler based CPU freq scaling, but won't try to force atomicity? 
Say, may be queue up a notification to a CPU driver to scale up the 
frequency as soon as it can?

IMHO, I think the problem with CPUfreq and its dynamic governors today 
is that they do a timer based sampling of the CPU load instead of 
getting some hints from the scheduler when the scheduler knows that the 
load average is quite high.

-Saravana

-- 
Sent by an employee of the Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC 0/4] Scheduler idle notifiers and users
  2012-02-11  3:15         ` Saravana Kannan
@ 2012-02-11 14:39           ` Mark Brown
  2012-02-11 14:53             ` Peter Zijlstra
  2012-02-11 14:45           ` Ingo Molnar
  1 sibling, 1 reply; 34+ messages in thread
From: Mark Brown @ 2012-02-11 14:39 UTC (permalink / raw)
  To: Saravana Kannan
  Cc: Ingo Molnar, Benjamin Herrenschmidt, Todd Poynor, Russell King,
	Peter Zijlstra, Nicolas Pitre, Oleg Nesterov, cpufreq,
	linux-kernel, Anton Vorontsov, linaro-kernel, Mike Chan,
	Dave Jones, Paul E. McKenney, kernel-team, linux-arm-kernel,
	Arjan Van De Ven

On Fri, Feb 10, 2012 at 07:15:10PM -0800, Saravana Kannan wrote:
> On 02/08/2012 11:51 PM, Ingo Molnar wrote:
> >* Benjamin Herrenschmidt<benh@kernel.crashing.org>  wrote:

> >>On the other hand, the need for schedulable contxts may not
> >>necessarily go away.

> >We will support it, but the *sane* hw solution is where
> >frequency transitions can be done atomically.

> I'm not sure atomicity has much to do with this. From what I can
> tell, it's about the physical characteristics of the voltage source
> and the load on said source.

> After a quick digging around for some info for one of our platforms
> (ARM/MSM), it looks like it will take 200us to ramp up the power
> rail from the voltage for the lowest CPU freq to voltage for the
> highest CPU freq. And that's ignoring any communication delay. The
> 200us is purely how long it takes for the PMIC output to settle
> given the power load from the CPU. I would think other PMICs from
> different manufacturers would be in the same ballpark.

No matter how good the PMICs get the CPUs are also improving the speed
with which they can do frequency changes so I expect this is always
going to need consideration on at least some systems.

> 200us is a lot of time to add to a context switch or to busy wait on
> when the processors today can run at GHz speeds.

Absolutely, and as you say this ignores communication overheads - often
PMICs are connected via I2C which can only be communicated with in
schedulable context and which takes substantially more than microseconds
to interact with.  Usually in systems where scaling performance is
important there will also be GPIOs to signal voltage changes but we
can't rely on them being there and you can often do some useful stuff
if you also interact via I2C.

For step downs this isn't such a big deal as we don't often care if the
voltage drops immediately but for step ups it's critical as if the
voltage hasn't ramped before the CPU tries to run at the higher
frequency the CPU will brown out.

> >We accomodate all hardware as well as we can, but we *design*
> >for proper hardware. So Peter is right, this should be done
> >properly.

> When you say accommodate all hardware, does it mean we will keep
> around CPUfreq and allow attempts at improving it? Or we will
> completely move to scheduler based CPU freq scaling, but won't try
> to force atomicity? Say, may be queue up a notification to a CPU
> driver to scale up the frequency as soon as it can?

We could also make the system aware of the multiple steps in scaling so
that it can do things like kick off voltage ramps and wait for them to
complete before performing the frequency change, I'm sure there's room
to do useful things there.  Possibly having the concept of expanding the
range of currently available frequencies for example.

> IMHO, I think the problem with CPUfreq and its dynamic governors
> today is that they do a timer based sampling of the CPU load instead
> of getting some hints from the scheduler when the scheduler knows
> that the load average is quite high.

Yes, this seems like a big issue - often the interval before the
governors will react can end up being human visible which is
unfortunate.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC 0/4] Scheduler idle notifiers and users
  2012-02-11  3:15         ` Saravana Kannan
  2012-02-11 14:39           ` Mark Brown
@ 2012-02-11 14:45           ` Ingo Molnar
  2012-02-14 23:20             ` Saravana Kannan
  1 sibling, 1 reply; 34+ messages in thread
From: Ingo Molnar @ 2012-02-11 14:45 UTC (permalink / raw)
  To: Saravana Kannan
  Cc: Benjamin Herrenschmidt, Todd Poynor, Russell King,
	Peter Zijlstra, Nicolas Pitre, Oleg Nesterov, cpufreq,
	linux-kernel, Anton Vorontsov, linaro-kernel, Mike Chan,
	Dave Jones, Paul E. McKenney, kernel-team, linux-arm-kernel,
	Arjan Van De Ven


* Saravana Kannan <skannan@codeaurora.org> wrote:

> When you say accommodate all hardware, does it mean we will 
> keep around CPUfreq and allow attempts at improving it? Or we 
> will completely move to scheduler based CPU freq scaling, but 
> won't try to force atomicity? Say, may be queue up a 
> notification to a CPU driver to scale up the frequency as soon 
> as it can?

I don't think we should (or even could) force atomicity - we 
adapt to whatever the hardware can do.

But the design should be directed at systems where frequency 
changes can be done in a reasonably fast manner. That is what he 
future is - any change we initiate today takes years to reach 
actual products/systems.

> IMHO, I think the problem with CPUfreq and its dynamic 
> governors today is that they do a timer based sampling of the 
> CPU load instead of getting some hints from the scheduler when 
> the scheduler knows that the load average is quite high.

Yes - that is one of the "frequency changes are slow" 
assumptions - which is wrong.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC 0/4] Scheduler idle notifiers and users
  2012-02-11 14:39           ` Mark Brown
@ 2012-02-11 14:53             ` Peter Zijlstra
  2012-02-11 15:33               ` Mark Brown
  2012-02-12 21:33               ` Benjamin Herrenschmidt
  0 siblings, 2 replies; 34+ messages in thread
From: Peter Zijlstra @ 2012-02-11 14:53 UTC (permalink / raw)
  To: Mark Brown
  Cc: Saravana Kannan, Ingo Molnar, Benjamin Herrenschmidt,
	Todd Poynor, Russell King, Nicolas Pitre, Oleg Nesterov, cpufreq,
	linux-kernel, Anton Vorontsov, linaro-kernel, Mike Chan,
	Dave Jones, Paul E. McKenney, kernel-team, linux-arm-kernel,
	Arjan Van De Ven

On Sat, 2012-02-11 at 14:39 +0000, Mark Brown wrote:
> 
> For step downs this isn't such a big deal as we don't often care if the
> voltage drops immediately but for step ups it's critical as if the
> voltage hasn't ramped before the CPU tries to run at the higher
> frequency the CPU will brown out. 

Why isn't all this done by micro-controllers, software writes a desired
state in some machine register (fast), micro-controller sets about
making it so in an asynchronous way. If it finds the settings have
changed by the time it reached its former goal, goto 1.

Having to actually wait for this in software is quite ridiculous.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC 0/4] Scheduler idle notifiers and users
  2012-02-11 14:53             ` Peter Zijlstra
@ 2012-02-11 15:33               ` Mark Brown
  2012-02-15 13:38                 ` Peter Zijlstra
  2012-02-12 21:33               ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 34+ messages in thread
From: Mark Brown @ 2012-02-11 15:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Saravana Kannan, Ingo Molnar, Benjamin Herrenschmidt,
	Todd Poynor, Russell King, Nicolas Pitre, Oleg Nesterov, cpufreq,
	linux-kernel, Anton Vorontsov, linaro-kernel, Mike Chan,
	Dave Jones, Paul E. McKenney, kernel-team, linux-arm-kernel,
	Arjan Van De Ven

[-- Attachment #1: Type: text/plain, Size: 1386 bytes --]

On Sat, Feb 11, 2012 at 03:53:03PM +0100, Peter Zijlstra wrote:
> On Sat, 2012-02-11 at 14:39 +0000, Mark Brown wrote:

> > For step downs this isn't such a big deal as we don't often care if the
> > voltage drops immediately but for step ups it's critical as if the
> > voltage hasn't ramped before the CPU tries to run at the higher
> > frequency the CPU will brown out. 

> Why isn't all this done by micro-controllers, software writes a desired
> state in some machine register (fast), micro-controller sets about
> making it so in an asynchronous way. If it finds the settings have
> changed by the time it reached its former goal, goto 1.

*Something* is going to have to wait for all the steps to take
place, if when frequency scaling you're not particlarly worried about
waiting for the the actual completion of your frequency change then
doing it in the CPU isn't that hard - you just post the request off
elsewhere and let it get on with trying implement whatever the last
request it saw was (along with all the other constraints it's seeing).
Modulo non-trivial implementation issues, of couse.

> Having to actually wait for this in software is quite ridiculous.

Well, it's also not terribly hard.  There's use cases for having this
stuff offloaded but if you're not doing that stuff then why deal with
the complication of designing the hardware?

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC 0/4] Scheduler idle notifiers and users
  2012-02-11 14:53             ` Peter Zijlstra
  2012-02-11 15:33               ` Mark Brown
@ 2012-02-12 21:33               ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 34+ messages in thread
From: Benjamin Herrenschmidt @ 2012-02-12 21:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mark Brown, Saravana Kannan, Ingo Molnar, Todd Poynor,
	Russell King, Nicolas Pitre, Oleg Nesterov, cpufreq,
	linux-kernel, Anton Vorontsov, linaro-kernel, Mike Chan,
	Dave Jones, Paul E. McKenney, kernel-team, linux-arm-kernel,
	Arjan Van De Ven

On Sat, 2012-02-11 at 15:53 +0100, Peter Zijlstra wrote:
> On Sat, 2012-02-11 at 14:39 +0000, Mark Brown wrote:
> > 
> > For step downs this isn't such a big deal as we don't often care if the
> > voltage drops immediately but for step ups it's critical as if the
> > voltage hasn't ramped before the CPU tries to run at the higher
> > frequency the CPU will brown out. 
> 
> Why isn't all this done by micro-controllers, software writes a desired
> state in some machine register (fast), micro-controller sets about
> making it so in an asynchronous way. If it finds the settings have
> changed by the time it reached its former goal, goto 1.
> 
> Having to actually wait for this in software is quite ridiculous.

Not necessarily micro-controllers no. Or rather, it's generally done by
uC (or system controllers) on desktop or server machines, but not on
embedded (ie. phones) where arguably that's where it is the most
important :-)

Now often (but not always) the trigger to initiate a change is indeed a
simple register or register-based gpio, so that's fast. But you also
often have to wait for a transition to be complete before initiating
another one, or sychronize between voltage and frequency.

For example, if you are ramping up, you need to up the voltage first,
then wait for it to reach the nominal value & stabilize, then ramp up
the frequency. It's not that often automated.

Then there's the case where to communicate with those chips, you have to
go via a bus such as i2c which requires schedulable contexts.

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC 0/4] Scheduler idle notifiers and users
  2012-02-11 14:45           ` Ingo Molnar
@ 2012-02-14 23:20             ` Saravana Kannan
  2012-02-15 13:38               ` Peter Zijlstra
  0 siblings, 1 reply; 34+ messages in thread
From: Saravana Kannan @ 2012-02-14 23:20 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linaro-kernel, Russell King, Peter Zijlstra, Nicolas Pitre,
	Benjamin Herrenschmidt, Oleg Nesterov, cpufreq, linux-kernel,
	Anton Vorontsov, Paul E. McKenney, Mike Chan, Dave Jones,
	Todd Poynor, kernel-team, linux-arm-kernel, Arjan Van De Ven


On 02/11/2012 06:45 AM, Ingo Molnar wrote:
>
> * Saravana Kannan<skannan@codeaurora.org>  wrote:
>
>> When you say accommodate all hardware, does it mean we will
>> keep around CPUfreq and allow attempts at improving it? Or we
>> will completely move to scheduler based CPU freq scaling, but
>> won't try to force atomicity? Say, may be queue up a
>> notification to a CPU driver to scale up the frequency as soon
>> as it can?
>
> I don't think we should (or even could) force atomicity - we
> adapt to whatever the hardware can do.

May be I misread the emails from Peter and you, but it sounded like the 
idea being proposed was to directly do a freq change from the scheduler. 
That would force the freq change API to be atomic (if it can be 
implemented is another issue). That's what I was referring to when I 
loosely used the terms "force atomicity".

> But the design should be directed at systems where frequency
> changes can be done in a reasonably fast manner. That is what he
> future is - any change we initiate today takes years to reach
> actual products/systems.

As long as the new design doesn't treat archs needing schedulable 
context to set freq as a second class citizen, I think we would all be 
happy. Because it's not just a matter of it being old hardware. 
Sometimes the decision to let the SW do the voltage scaling also comes 
down to HW cost. Considering Linux runs on such a wide set of archs, I 
think we shouldn't treat the need for schedulable context for freq 
setting as "broken" or "not sane".


>> IMHO, I think the problem with CPUfreq and its dynamic
>> governors today is that they do a timer based sampling of the
>> CPU load instead of getting some hints from the scheduler when
>> the scheduler knows that the load average is quite high.
>
> Yes - that is one of the "frequency changes are slow"
> assumptions - which is wrong.

Thanks,
Saravana

-- 
Sent by an employee of the Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC 0/4] Scheduler idle notifiers and users
  2012-02-11 15:33               ` Mark Brown
@ 2012-02-15 13:38                 ` Peter Zijlstra
  2012-02-15 16:04                   ` Mark Brown
  0 siblings, 1 reply; 34+ messages in thread
From: Peter Zijlstra @ 2012-02-15 13:38 UTC (permalink / raw)
  To: Mark Brown
  Cc: Saravana Kannan, Ingo Molnar, Benjamin Herrenschmidt,
	Todd Poynor, Russell King, Nicolas Pitre, Oleg Nesterov, cpufreq,
	linux-kernel, Anton Vorontsov, linaro-kernel, Mike Chan,
	Dave Jones, Paul E. McKenney, kernel-team, linux-arm-kernel,
	Arjan Van De Ven

On Sat, 2012-02-11 at 15:33 +0000, Mark Brown wrote:
> > Having to actually wait for this in software is quite ridiculous.
> 
> Well, it's also not terribly hard. 

Having to schedule from the scheduler is. Which is exactly the situation
you'll end up with if you want scheduler driven cpufreq, which I thought
everybody wanted because polling state sucks.

>  There's use cases for having this
> stuff offloaded but if you're not doing that stuff then why deal with
> the complication of designing the hardware? 

Because doing it in software is more expensive?

Penny-wise pound-foolish like thing.. you make the software requirements
more complex, which results in more bugs (more cost in debugging), more
runtime (for doing the 'software' thing), less power savings.

Esp since all this uC/system-controller stuff is already available and
validated.



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC 0/4] Scheduler idle notifiers and users
  2012-02-14 23:20             ` Saravana Kannan
@ 2012-02-15 13:38               ` Peter Zijlstra
  2012-02-15 14:02                 ` Russell King - ARM Linux
  0 siblings, 1 reply; 34+ messages in thread
From: Peter Zijlstra @ 2012-02-15 13:38 UTC (permalink / raw)
  To: Saravana Kannan
  Cc: Ingo Molnar, linaro-kernel, Russell King, Nicolas Pitre,
	Benjamin Herrenschmidt, Oleg Nesterov, cpufreq, linux-kernel,
	Anton Vorontsov, Paul E. McKenney, Mike Chan, Dave Jones,
	Todd Poynor, kernel-team, linux-arm-kernel, Arjan Van De Ven,
	Thomas Gleixner

On Tue, 2012-02-14 at 15:20 -0800, Saravana Kannan wrote:
> On 02/11/2012 06:45 AM, Ingo Molnar wrote:
> >
> > * Saravana Kannan<skannan@codeaurora.org>  wrote:
> >
> >> When you say accommodate all hardware, does it mean we will
> >> keep around CPUfreq and allow attempts at improving it? Or we
> >> will completely move to scheduler based CPU freq scaling, but
> >> won't try to force atomicity? Say, may be queue up a
> >> notification to a CPU driver to scale up the frequency as soon
> >> as it can?
> >
> > I don't think we should (or even could) force atomicity - we
> > adapt to whatever the hardware can do.
> 
> May be I misread the emails from Peter and you, but it sounded like the 
> idea being proposed was to directly do a freq change from the scheduler. 
> That would force the freq change API to be atomic (if it can be 
> implemented is another issue). That's what I was referring to when I 
> loosely used the terms "force atomicity".

Right, so we all agree cpufreq wants scheduler notifications because
polling sucks. The result is indeed you get to do cpufreq from atomic
context, because scheduling from the scheduler is 'interesting'.

> > But the design should be directed at systems where frequency
> > changes can be done in a reasonably fast manner. That is what he
> > future is - any change we initiate today takes years to reach
> > actual products/systems.
> 
> As long as the new design doesn't treat archs needing schedulable 
> context to set freq as a second class citizen, I think we would all be 
> happy.

I would really really like to do just that, if only to encourage
hardware people to just do the thing in hardware. Wanting both ultimate
power savings and crappy hardware just doesn't work -- and yes I'm
sticking to PMIC on i2c is shit as is having to manually sync up voltage
and freq changes.

>  Because it's not just a matter of it being old hardware. 
> Sometimes the decision to let the SW do the voltage scaling also comes 
> down to HW cost. Considering Linux runs on such a wide set of archs, I 
> think we shouldn't treat the need for schedulable context for freq 
> setting as "broken" or "not sane".

So you'd rather spend double the money on trying to get software working
on broken ass hardware?

A lot of these lets save 3 transistors, software can fix it up, hardware
feat^Wfailures end up in spending more than the savings on making the
software doing the fixup. I'm sure tglx can share a few stories here.

Now we could probably cludge something, and we might have to, but I'll
hate you guys for it.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC 0/4] Scheduler idle notifiers and users
  2012-02-15 13:38               ` Peter Zijlstra
@ 2012-02-15 14:02                 ` Russell King - ARM Linux
  2012-02-15 15:01                   ` Peter Zijlstra
  0 siblings, 1 reply; 34+ messages in thread
From: Russell King - ARM Linux @ 2012-02-15 14:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Saravana Kannan, Ingo Molnar, linaro-kernel, Nicolas Pitre,
	Benjamin Herrenschmidt, Oleg Nesterov, cpufreq, linux-kernel,
	Anton Vorontsov, Paul E. McKenney, Mike Chan, Dave Jones,
	Todd Poynor, kernel-team, linux-arm-kernel, Arjan Van De Ven,
	Thomas Gleixner

On Wed, Feb 15, 2012 at 02:38:05PM +0100, Peter Zijlstra wrote:
> On Tue, 2012-02-14 at 15:20 -0800, Saravana Kannan wrote:
> > On 02/11/2012 06:45 AM, Ingo Molnar wrote:
> > >
> > > * Saravana Kannan<skannan@codeaurora.org>  wrote:
> > >
> > >> When you say accommodate all hardware, does it mean we will
> > >> keep around CPUfreq and allow attempts at improving it? Or we
> > >> will completely move to scheduler based CPU freq scaling, but
> > >> won't try to force atomicity? Say, may be queue up a
> > >> notification to a CPU driver to scale up the frequency as soon
> > >> as it can?
> > >
> > > I don't think we should (or even could) force atomicity - we
> > > adapt to whatever the hardware can do.
> > 
> > May be I misread the emails from Peter and you, but it sounded like the 
> > idea being proposed was to directly do a freq change from the scheduler. 
> > That would force the freq change API to be atomic (if it can be 
> > implemented is another issue). That's what I was referring to when I 
> > loosely used the terms "force atomicity".
> 
> Right, so we all agree cpufreq wants scheduler notifications because
> polling sucks. The result is indeed you get to do cpufreq from atomic
> context, because scheduling from the scheduler is 'interesting'.

There's a problem with that: SA11x0 platforms (for which cpufreq was
_originally_ written for before it spouted all the policy stuff which
Linus demanded) need to notify drivers when the CPU frequency changes so
that drivers can readjust stuff to keep within the bounds of the hardware.

Unfortunately, there's embedded platforms out there where the CPU core
clock is not just the CPU core clock, but also is the memory bus clock,
PCMCIA clock, and some peripheral clocks.  All these peripherals need
their timing registers rewritten when the CPU core clock changes.

Even more unfortunately, some of these peripherals can't be adjusted
with the click of your fingers: you have to wait for them to finish
what they're doing.  In the case of a LCD controller, that means the
hardware must finish displaying the current frame before the LCD
controller will shut down and let you change its registers.

We _could_ make it atomic, but in return we'd have to spin in the driver
for maybe 20+ ms, during which time the system would not be able to do
anything else, not even those threaded IRQs.  That's on top of however
long it takes for the CPU core clock PLL to re-lock at the requested
frequency.  That might not be too bad if the CPU clock rate changes
only occasionally, but if we're talking about doing that more often
then I think there's something wrong with the cpufreq policy design.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC 0/4] Scheduler idle notifiers and users
  2012-02-15 14:02                 ` Russell King - ARM Linux
@ 2012-02-15 15:01                   ` Peter Zijlstra
  2012-02-15 16:00                     ` Russell King - ARM Linux
                                       ` (3 more replies)
  0 siblings, 4 replies; 34+ messages in thread
From: Peter Zijlstra @ 2012-02-15 15:01 UTC (permalink / raw)
  To: Russell King - ARM Linux
  Cc: Saravana Kannan, Ingo Molnar, linaro-kernel, Nicolas Pitre,
	Benjamin Herrenschmidt, Oleg Nesterov, cpufreq, linux-kernel,
	Anton Vorontsov, Paul E. McKenney, Mike Chan, Dave Jones,
	Todd Poynor, kernel-team, linux-arm-kernel, Arjan Van De Ven,
	Thomas Gleixner

On Wed, 2012-02-15 at 14:02 +0000, Russell King - ARM Linux wrote:

> There's a problem with that: SA11x0 platforms (for which cpufreq was
> _originally_ written for before it spouted all the policy stuff which
> Linus demanded) need to notify drivers when the CPU frequency changes so
> that drivers can readjust stuff to keep within the bounds of the hardware.
> 
> Unfortunately, there's embedded platforms out there where the CPU core
> clock is not just the CPU core clock, but also is the memory bus clock,
> PCMCIA clock, and some peripheral clocks.  All these peripherals need
> their timing registers rewritten when the CPU core clock changes.
> 
> Even more unfortunately, some of these peripherals can't be adjusted
> with the click of your fingers: you have to wait for them to finish
> what they're doing.  In the case of a LCD controller, that means the
> hardware must finish displaying the current frame before the LCD
> controller will shut down and let you change its registers.
> 
> We _could_ make it atomic, but in return we'd have to spin in the driver
> for maybe 20+ ms, during which time the system would not be able to do
> anything else, not even those threaded IRQs. 

Thing is, the scheduler doesn't care about completion, all it needs is
to be able to kick-start the thing atomically. So you really have to
wait for it or can you do an interrupt driven state machine?

Anyway, one possibility is to keep cpufreq in its current state and use
that for this 'interesting' class of hardware -- clearly its current
state is good enough for it. And transition all sane hardware over to a
new scheme.

Another possibility is we'll try and fudge something in the scheduler
that either wakes a special per-cpu thread or allow enqueueing work and
make this CONFIG_goo available to these platforms so as not to add to
fast-path overhead of others.

A third possibility is to self-IPI and take it from there.. assuming
these platforms can actually self-IPI.

>  That's on top of however
> long it takes for the CPU core clock PLL to re-lock at the requested
> frequency.  That might not be too bad if the CPU clock rate changes
> only occasionally, but if we're talking about doing that more often
> then I think there's something wrong with the cpufreq policy design.

I guess that all will depend on the hardware.. there'll still be some
sort of governor in between taking the per-cpu/task load-tracking data
and scheduler events and using that to compute some volt/freq setting.

>From what I've heard there's a number of different classes of hardware
out there, some like race to idle, some can power gate more than others
etc.. I'm not particularly bothered by those details, I'm sure there's
people who are.

All I really want is to consolidate all the various statistics we have
across cpufreq/cpuidle/sched and provide cpufreq with scheduler
callbacks because they've been telling me their current polling stuff
sucks rocks.

Also the current state of affairs is that the cpufreq stuff is trying to
guess what the scheduler is doing, and people are feeding that back into
the scheduler. This I need to stop from happening ;-)

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC 0/4] Scheduler idle notifiers and users
  2012-02-15 15:01                   ` Peter Zijlstra
@ 2012-02-15 16:00                     ` Russell King - ARM Linux
  2012-02-15 16:09                       ` Peter Zijlstra
  2012-02-16  3:31                     ` Benjamin Herrenschmidt
                                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 34+ messages in thread
From: Russell King - ARM Linux @ 2012-02-15 16:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Saravana Kannan, Ingo Molnar, linaro-kernel, Nicolas Pitre,
	Benjamin Herrenschmidt, Oleg Nesterov, cpufreq, linux-kernel,
	Anton Vorontsov, Paul E. McKenney, Mike Chan, Dave Jones,
	Todd Poynor, kernel-team, linux-arm-kernel, Arjan Van De Ven,
	Thomas Gleixner

On Wed, Feb 15, 2012 at 04:01:03PM +0100, Peter Zijlstra wrote:
> On Wed, 2012-02-15 at 14:02 +0000, Russell King - ARM Linux wrote:
> 
> > There's a problem with that: SA11x0 platforms (for which cpufreq was
> > _originally_ written for before it spouted all the policy stuff which
> > Linus demanded) need to notify drivers when the CPU frequency changes so
> > that drivers can readjust stuff to keep within the bounds of the hardware.
> > 
> > Unfortunately, there's embedded platforms out there where the CPU core
> > clock is not just the CPU core clock, but also is the memory bus clock,
> > PCMCIA clock, and some peripheral clocks.  All these peripherals need
> > their timing registers rewritten when the CPU core clock changes.
> > 
> > Even more unfortunately, some of these peripherals can't be adjusted
> > with the click of your fingers: you have to wait for them to finish
> > what they're doing.  In the case of a LCD controller, that means the
> > hardware must finish displaying the current frame before the LCD
> > controller will shut down and let you change its registers.
> > 
> > We _could_ make it atomic, but in return we'd have to spin in the driver
> > for maybe 20+ ms, during which time the system would not be able to do
> > anything else, not even those threaded IRQs. 
> 
> Thing is, the scheduler doesn't care about completion, all it needs is
> to be able to kick-start the thing atomically. So you really have to
> wait for it or can you do an interrupt driven state machine?

Well, in the case I'm most familiar with, the sequence is this:

1. Receive request to change frequency
2. Shut down LCD controller and wait for it to stop (can take 20ms)
3. Check new frequency wrt old frequency, reprogram timings for PCMCIA
   if moving to a faster clock rate
4. Simultaneously program new frequency and SDRAM clocking
5. Recheck new frequency wrt old frequency, reprogram things for PCMCIA
   if moving to a slower clock rate
6. Reconfigure LCD controller

Now, the problem is that this sequence is spread across multiple drivers,
and today is handled by a notifier, which can sleep in the case of the
LCD controller waiting for the shutdown to complete.

> A third possibility is to self-IPI and take it from there.. assuming
> these platforms can actually self-IPI.

Even if there was an IPI (not talking about SMP anyway) I'm not sure
what good it would be.  We can (and do) get an IRQ from the LCD
controller when its shutdown is complete, but that would have to be
somehow propagated back up to the cpufreq code.  And the cpufreq code
would have to know that the LCD controller was alive and therefore had
to be waited for.  All sounds rather yucky to me.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC 0/4] Scheduler idle notifiers and users
  2012-02-15 13:38                 ` Peter Zijlstra
@ 2012-02-15 16:04                   ` Mark Brown
  0 siblings, 0 replies; 34+ messages in thread
From: Mark Brown @ 2012-02-15 16:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Saravana Kannan, Ingo Molnar, Benjamin Herrenschmidt,
	Todd Poynor, Russell King, Nicolas Pitre, Oleg Nesterov, cpufreq,
	linux-kernel, Anton Vorontsov, linaro-kernel, Mike Chan,
	Dave Jones, Paul E. McKenney, kernel-team, linux-arm-kernel,
	Arjan Van De Ven

[-- Attachment #1: Type: text/plain, Size: 1250 bytes --]

On Wed, Feb 15, 2012 at 02:38:04PM +0100, Peter Zijlstra wrote:
> On Sat, 2012-02-11 at 15:33 +0000, Mark Brown wrote:

> >  There's use cases for having this
> > stuff offloaded but if you're not doing that stuff then why deal with
> > the complication of designing the hardware? 

> Because doing it in software is more expensive?

> Penny-wise pound-foolish like thing.. you make the software requirements
> more complex, which results in more bugs (more cost in debugging), more
> runtime (for doing the 'software' thing), less power savings.

> Esp since all this uC/system-controller stuff is already available and
> validated.

It's really not - like I say most of the times people have tried to
deploy this on embedded systems it's just made everyone more miserable
and typically winds up getting turned off or bypassed.  The PMICs are
much more decoupled from the CPUs and the power control on the SoCs is
more fine grained than you seem to see in the desktop market.  There's
software effort but people are willing to spend that for microamps and
all other things being equal they'd typically rather spend it in the
software stack they're already working with rather than in a separate
stack for a microcontroller.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC 0/4] Scheduler idle notifiers and users
  2012-02-15 16:00                     ` Russell King - ARM Linux
@ 2012-02-15 16:09                       ` Peter Zijlstra
  0 siblings, 0 replies; 34+ messages in thread
From: Peter Zijlstra @ 2012-02-15 16:09 UTC (permalink / raw)
  To: Russell King - ARM Linux
  Cc: Saravana Kannan, Ingo Molnar, linaro-kernel, Nicolas Pitre,
	Benjamin Herrenschmidt, Oleg Nesterov, cpufreq, linux-kernel,
	Anton Vorontsov, Paul E. McKenney, Mike Chan, Dave Jones,
	Todd Poynor, kernel-team, linux-arm-kernel, Arjan Van De Ven,
	Thomas Gleixner

On Wed, 2012-02-15 at 16:00 +0000, Russell King - ARM Linux wrote:
> 
> > A third possibility is to self-IPI and take it from there.. assuming
> > these platforms can actually self-IPI.
> 
> Even if there was an IPI (not talking about SMP anyway) I'm not sure
> what good it would be.  We can (and do) get an IRQ from the LCD
> controller when its shutdown is complete, but that would have to be
> somehow propagated back up to the cpufreq code.  And the cpufreq code
> would have to know that the LCD controller was alive and therefore had
> to be waited for.  All sounds rather yucky to me. 

If can self-ipi from the scheduler context (which has IRQs disabled),
once you get to the ipi handler your scheduler locks are gone and you
can queue a worklet or wake some kthread to do all the sleeping stuff.



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC 0/4] Scheduler idle notifiers and users
  2012-02-15 15:01                   ` Peter Zijlstra
  2012-02-15 16:00                     ` Russell King - ARM Linux
@ 2012-02-16  3:31                     ` Benjamin Herrenschmidt
  2012-02-16 10:14                       ` Peter Zijlstra
  2012-02-17  9:00                     ` Dominik Brodowski
  2012-02-21 12:38                     ` Pantelis Antoniou
  3 siblings, 1 reply; 34+ messages in thread
From: Benjamin Herrenschmidt @ 2012-02-16  3:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Russell King - ARM Linux, Saravana Kannan, Ingo Molnar,
	linaro-kernel, Nicolas Pitre, Oleg Nesterov, cpufreq,
	linux-kernel, Anton Vorontsov, Paul E. McKenney, Mike Chan,
	Dave Jones, Todd Poynor, kernel-team, linux-arm-kernel,
	Arjan Van De Ven, Thomas Gleixner

On Wed, 2012-02-15 at 16:01 +0100, Peter Zijlstra wrote:
> 
> Thing is, the scheduler doesn't care about completion, all it needs is
> to be able to kick-start the thing atomically. So you really have to
> wait for it or can you do an interrupt driven state machine?

Or the scheduler callback could schedule a wq to do the job ?

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC 0/4] Scheduler idle notifiers and users
  2012-02-16  3:31                     ` Benjamin Herrenschmidt
@ 2012-02-16 10:14                       ` Peter Zijlstra
  0 siblings, 0 replies; 34+ messages in thread
From: Peter Zijlstra @ 2012-02-16 10:14 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Russell King - ARM Linux, Saravana Kannan, Ingo Molnar,
	linaro-kernel, Nicolas Pitre, Oleg Nesterov, cpufreq,
	linux-kernel, Anton Vorontsov, Paul E. McKenney, Mike Chan,
	Dave Jones, Todd Poynor, kernel-team, linux-arm-kernel,
	Arjan Van De Ven, Thomas Gleixner

On Thu, 2012-02-16 at 14:31 +1100, Benjamin Herrenschmidt wrote:
> On Wed, 2012-02-15 at 16:01 +0100, Peter Zijlstra wrote:
> > 
> > Thing is, the scheduler doesn't care about completion, all it needs is
> > to be able to kick-start the thing atomically. So you really have to
> > wait for it or can you do an interrupt driven state machine?
> 
> Or the scheduler callback could schedule a wq to do the job ?

That'll end up being very ugly due to lock inversion etc. If we can get
out of this using self-IPIs I'd much prefer that.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC 0/4] Scheduler idle notifiers and users
  2012-02-15 15:01                   ` Peter Zijlstra
  2012-02-15 16:00                     ` Russell King - ARM Linux
  2012-02-16  3:31                     ` Benjamin Herrenschmidt
@ 2012-02-17  9:00                     ` Dominik Brodowski
  2012-02-20 11:03                       ` Peter Zijlstra
  2012-02-21 12:38                     ` Pantelis Antoniou
  3 siblings, 1 reply; 34+ messages in thread
From: Dominik Brodowski @ 2012-02-17  9:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Russell King - ARM Linux, Saravana Kannan, Ingo Molnar,
	linaro-kernel, Nicolas Pitre, Benjamin Herrenschmidt,
	Oleg Nesterov, cpufreq, linux-kernel, Anton Vorontsov,
	Paul E. McKenney, Mike Chan, Dave Jones, Todd Poynor,
	kernel-team, linux-arm-kernel, Arjan Van De Ven, Thomas Gleixner

On Wed, Feb 15, 2012 at 04:01:03PM +0100, Peter Zijlstra wrote:
> On Wed, 2012-02-15 at 14:02 +0000, Russell King - ARM Linux wrote:
> 
> > There's a problem with that: SA11x0 platforms (for which cpufreq was
> > _originally_ written for before it spouted all the policy stuff which
> > Linus demanded) need to notify drivers when the CPU frequency changes so
> > that drivers can readjust stuff to keep within the bounds of the hardware.
> > 
> > Unfortunately, there's embedded platforms out there where the CPU core
> > clock is not just the CPU core clock, but also is the memory bus clock,
> > PCMCIA clock, and some peripheral clocks.  All these peripherals need
> > their timing registers rewritten when the CPU core clock changes.
> > 
> > Even more unfortunately, some of these peripherals can't be adjusted
> > with the click of your fingers: you have to wait for them to finish
> > what they're doing.  In the case of a LCD controller, that means the
> > hardware must finish displaying the current frame before the LCD
> > controller will shut down and let you change its registers.
> > 
> > We _could_ make it atomic, but in return we'd have to spin in the driver
> > for maybe 20+ ms, during which time the system would not be able to do
> > anything else, not even those threaded IRQs. 
> 
> Thing is, the scheduler doesn't care about completion, all it needs is
> to be able to kick-start the thing atomically. So you really have to
> wait for it or can you do an interrupt driven state machine?
> 
> Anyway, one possibility is to keep cpufreq in its current state and use
> that for this 'interesting' class of hardware -- clearly its current
> state is good enough for it. And transition all sane hardware over to a
> new scheme.
> 
> Another possibility is we'll try and fudge something in the scheduler
> that either wakes a special per-cpu thread or allow enqueueing work and
> make this CONFIG_goo available to these platforms so as not to add to
> fast-path overhead of others.

Well, we can actually have both: Adding a new cpufreq governor "scheduler"
is easy. The scheduler stores the target frequency (in per-cent or
per-mille) in (per-cpu) data available to this governor, and kick a
(per-cpu?) thread which then handels the rest -- by existing cpufreq means.
The cpufreq part is easy, the sched part less so (I think).

Of course, this is still slower than manipulating some MSRs in sched.c
directly. However, we could make use of the existing infrastructure, and not
worry about whether things need to schedule, need to busy-loop etc, whether
we have thermal implications which mean that some frequences are not
available etc.

Best,
	Dominik

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC 0/4] Scheduler idle notifiers and users
  2012-02-17  9:00                     ` Dominik Brodowski
@ 2012-02-20 11:03                       ` Peter Zijlstra
  0 siblings, 0 replies; 34+ messages in thread
From: Peter Zijlstra @ 2012-02-20 11:03 UTC (permalink / raw)
  To: Dominik Brodowski
  Cc: Russell King - ARM Linux, Saravana Kannan, Ingo Molnar,
	linaro-kernel, Nicolas Pitre, Benjamin Herrenschmidt,
	Oleg Nesterov, cpufreq, linux-kernel, Anton Vorontsov,
	Paul E. McKenney, Mike Chan, Dave Jones, Todd Poynor,
	kernel-team, linux-arm-kernel, Arjan Van De Ven, Thomas Gleixner

On Fri, 2012-02-17 at 10:00 +0100, Dominik Brodowski wrote:
> 
> Well, we can actually have both: Adding a new cpufreq governor "scheduler"
> is easy. The scheduler stores the target frequency (in per-cent or
> per-mille) in (per-cpu) data available to this governor, and kick a
> (per-cpu?) thread which then handels the rest -- by existing cpufreq means.
> The cpufreq part is easy, the sched part less so (I think). 

You might not have been reading what I wrote, kicking a kthread (or
doing any other scheduler activity) from within the scheduler is way
ugly and something I'd really rather avoid if at all possible.

Yes I could do it, but I really really don't want to.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC 0/4] Scheduler idle notifiers and users
  2012-02-15 15:01                   ` Peter Zijlstra
                                       ` (2 preceding siblings ...)
  2012-02-17  9:00                     ` Dominik Brodowski
@ 2012-02-21 12:38                     ` Pantelis Antoniou
  2012-02-21 12:56                       ` Peter Zijlstra
  3 siblings, 1 reply; 34+ messages in thread
From: Pantelis Antoniou @ 2012-02-21 12:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Russell King - ARM Linux, Saravana Kannan, Ingo Molnar,
	linaro-kernel, Nicolas Pitre, Benjamin Herrenschmidt,
	Oleg Nesterov, cpufreq, linux-kernel, Anton Vorontsov,
	Paul E. McKenney, Mike Chan, Dave Jones, Todd Poynor,
	kernel-team, linux-arm-kernel, Arjan Van De Ven, Thomas Gleixner

Hi there,

On Feb 15, 2012, at 5:01 PM, Peter Zijlstra wrote:

> On Wed, 2012-02-15 at 14:02 +0000, Russell King - ARM Linux wrote:
> 

<snip>
> 
> I guess that all will depend on the hardware.. there'll still be some
> sort of governor in between taking the per-cpu/task load-tracking data
> and scheduler events and using that to compute some volt/freq setting.
> 
> From what I've heard there's a number of different classes of hardware
> out there, some like race to idle, some can power gate more than others
> etc.. I'm not particularly bothered by those details, I'm sure there's
> people who are.
> 
> All I really want is to consolidate all the various statistics we have
> across cpufreq/cpuidle/sched and provide cpufreq with scheduler
> callbacks because they've been telling me their current polling stuff
> sucks rocks.
> 
> Also the current state of affairs is that the cpufreq stuff is trying to
> guess what the scheduler is doing, and people are feeding that back into
> the scheduler. This I need to stop from happening ;-)

If I may interject one more point here.

If we go to all the trouble of integrating cpufreq/cpuidle/sched into scheduler
callbacks, we should place hooks into the thermal framework/PM as well.

It will pretty common to have per core temperature readings, on most
modern SoCs. 

It is quite conceivable to have a case with a multi-core CPU where due
to load imbalance, one (or more) of the cores is running at full speed
while the rest are mostly idle. What you want do, for best performance
and conceivably better power consumption, is not to throttle either 
frequency or lowers voltage to the overloaded CPU but to migrate the
load to one of the cooler CPUs.

This affects CPU capacity immediately, i.e. you shouldn't schedule more
load on a CPU that its too hot, since you'll only end up triggering thermal 
shutdown. The ideal solution would be to round robin
the load from the hot CPU to the cooler ones, but not so fast that we lose
due to the migration of state from one CPU to the other.

In a nutshell, the processing capacity of a core is not static, i.e. it
might degrade over time due to the increase of temperature caused by the
previous load.

What do you think?

Regards

-- Pantelis
  


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC 0/4] Scheduler idle notifiers and users
  2012-02-21 12:38                     ` Pantelis Antoniou
@ 2012-02-21 12:56                       ` Peter Zijlstra
  2012-02-21 13:31                         ` Pantelis Antoniou
  0 siblings, 1 reply; 34+ messages in thread
From: Peter Zijlstra @ 2012-02-21 12:56 UTC (permalink / raw)
  To: Pantelis Antoniou
  Cc: Russell King - ARM Linux, Saravana Kannan, Ingo Molnar,
	linaro-kernel, Nicolas Pitre, Benjamin Herrenschmidt,
	Oleg Nesterov, cpufreq, linux-kernel, Anton Vorontsov,
	Paul E. McKenney, Mike Chan, Dave Jones, Todd Poynor,
	kernel-team, linux-arm-kernel, Arjan Van De Ven, Thomas Gleixner

On Tue, 2012-02-21 at 14:38 +0200, Pantelis Antoniou wrote:
> 
> If we go to all the trouble of integrating cpufreq/cpuidle/sched into scheduler
> callbacks, we should place hooks into the thermal framework/PM as well.
> 
> It will pretty common to have per core temperature readings, on most
> modern SoCs. 
> 
> It is quite conceivable to have a case with a multi-core CPU where due
> to load imbalance, one (or more) of the cores is running at full speed
> while the rest are mostly idle. What you want do, for best performance
> and conceivably better power consumption, is not to throttle either 
> frequency or lowers voltage to the overloaded CPU but to migrate the
> load to one of the cooler CPUs.
> 
> This affects CPU capacity immediately, i.e. you shouldn't schedule more
> load on a CPU that its too hot, since you'll only end up triggering thermal 
> shutdown. The ideal solution would be to round robin
> the load from the hot CPU to the cooler ones, but not so fast that we lose
> due to the migration of state from one CPU to the other.
> 
> In a nutshell, the processing capacity of a core is not static, i.e. it
> might degrade over time due to the increase of temperature caused by the
> previous load.
> 
> What do you think? 

This is called core-hopping, and yes that's a nice goal, although I
would like to do that after we get the 'simple' bits up and running. I
suspect it'll end up being slightly more complex than we'd like to due
to the fact that the goal conflicts with wanting to aggregate things on
cpu0 due to cpu0 being special for a host of reasons.



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC 0/4] Scheduler idle notifiers and users
  2012-02-21 12:56                       ` Peter Zijlstra
@ 2012-02-21 13:31                         ` Pantelis Antoniou
  2012-02-21 14:52                           ` Amit Kucheria
  0 siblings, 1 reply; 34+ messages in thread
From: Pantelis Antoniou @ 2012-02-21 13:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Russell King - ARM Linux, Saravana Kannan, Ingo Molnar,
	linaro-kernel, Nicolas Pitre, Benjamin Herrenschmidt,
	Oleg Nesterov, cpufreq, linux-kernel, Anton Vorontsov,
	Paul E. McKenney, Mike Chan, Dave Jones, Todd Poynor,
	kernel-team, linux-arm-kernel, Arjan Van De Ven, Thomas Gleixner


On Feb 21, 2012, at 2:56 PM, Peter Zijlstra wrote:

> On Tue, 2012-02-21 at 14:38 +0200, Pantelis Antoniou wrote:
>> 
>> If we go to all the trouble of integrating cpufreq/cpuidle/sched into scheduler
>> callbacks, we should place hooks into the thermal framework/PM as well.
>> 
>> It will pretty common to have per core temperature readings, on most
>> modern SoCs. 
>> 
>> It is quite conceivable to have a case with a multi-core CPU where due
>> to load imbalance, one (or more) of the cores is running at full speed
>> while the rest are mostly idle. What you want do, for best performance
>> and conceivably better power consumption, is not to throttle either 
>> frequency or lowers voltage to the overloaded CPU but to migrate the
>> load to one of the cooler CPUs.
>> 
>> This affects CPU capacity immediately, i.e. you shouldn't schedule more
>> load on a CPU that its too hot, since you'll only end up triggering thermal 
>> shutdown. The ideal solution would be to round robin
>> the load from the hot CPU to the cooler ones, but not so fast that we lose
>> due to the migration of state from one CPU to the other.
>> 
>> In a nutshell, the processing capacity of a core is not static, i.e. it
>> might degrade over time due to the increase of temperature caused by the
>> previous load.
>> 
>> What do you think? 
> 
> This is called core-hopping, and yes that's a nice goal, although I
> would like to do that after we get the 'simple' bits up and running. I
> suspect it'll end up being slightly more complex than we'd like to due
> to the fact that the goal conflicts with wanting to aggregate things on
> cpu0 due to cpu0 being special for a host of reasons.
> 
> 

Hi Peter,

Agreed. We need to get there step by step, and I think that per-task load tracking
is the first one. We do have other metrics besides load that can influence the
scheduler decisions, with the most obvious being power consumption.

BTW, since we're going to the trouble of calculating per-task load with 
increased accuracy, how about having some thought of translating the load numbers
in an absolute format. I.e. with the CPUs now having fluctuating performance
(due to cpufreq etc.) one would say that each CPU would have an X bogomips 
(or some else absolute) capacity per OPP. Perhaps having such a bogomips number
calculated per-task would make things easier.

Perhaps the same can be done with power/energy, i.e. have a per-task power
consumption figure that we can use for scheduling, according to the available
power budget per CPU.

Dunno, it might not be feasible ATM, but having a power-aware scheduler would
assume some kind of power measurement, no?

Regards

-- Pantelis


 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC 0/4] Scheduler idle notifiers and users
  2012-02-21 13:31                         ` Pantelis Antoniou
@ 2012-02-21 14:52                           ` Amit Kucheria
  2012-02-21 17:06                             ` Pantelis Antoniou
  0 siblings, 1 reply; 34+ messages in thread
From: Amit Kucheria @ 2012-02-21 14:52 UTC (permalink / raw)
  To: Pantelis Antoniou
  Cc: Peter Zijlstra, linaro-kernel, Russell King - ARM Linux,
	Nicolas Pitre, Benjamin Herrenschmidt, Oleg Nesterov, cpufreq,
	linux-kernel, Anton Vorontsov, Todd Poynor, Saravana Kannan,
	Mike Chan, Dave Jones, Ingo Molnar, Paul E. McKenney,
	kernel-team, linux-arm-kernel, Arjan Van De Ven

On Tue, Feb 21, 2012 at 3:31 PM, Pantelis Antoniou
<panto@antoniou-consulting.com> wrote:
>
> On Feb 21, 2012, at 2:56 PM, Peter Zijlstra wrote:
>
>> On Tue, 2012-02-21 at 14:38 +0200, Pantelis Antoniou wrote:
>>>
>>> If we go to all the trouble of integrating cpufreq/cpuidle/sched into scheduler
>>> callbacks, we should place hooks into the thermal framework/PM as well.
>>>
>>> It will pretty common to have per core temperature readings, on most
>>> modern SoCs.
>>>
>>> It is quite conceivable to have a case with a multi-core CPU where due
>>> to load imbalance, one (or more) of the cores is running at full speed
>>> while the rest are mostly idle. What you want do, for best performance
>>> and conceivably better power consumption, is not to throttle either
>>> frequency or lowers voltage to the overloaded CPU but to migrate the
>>> load to one of the cooler CPUs.
>>>
>>> This affects CPU capacity immediately, i.e. you shouldn't schedule more
>>> load on a CPU that its too hot, since you'll only end up triggering thermal
>>> shutdown. The ideal solution would be to round robin
>>> the load from the hot CPU to the cooler ones, but not so fast that we lose
>>> due to the migration of state from one CPU to the other.
>>>
>>> In a nutshell, the processing capacity of a core is not static, i.e. it
>>> might degrade over time due to the increase of temperature caused by the
>>> previous load.
>>>
>>> What do you think?
>>
>> This is called core-hopping, and yes that's a nice goal, although I
>> would like to do that after we get the 'simple' bits up and running. I
>> suspect it'll end up being slightly more complex than we'd like to due
>> to the fact that the goal conflicts with wanting to aggregate things on
>> cpu0 due to cpu0 being special for a host of reasons.
>>
>>
>
> Hi Peter,
>
> Agreed. We need to get there step by step, and I think that per-task load tracking
> is the first one. We do have other metrics besides load that can influence the
> scheduler decisions, with the most obvious being power consumption.
>
> BTW, since we're going to the trouble of calculating per-task load with
> increased accuracy, how about having some thought of translating the load numbers
> in an absolute format. I.e. with the CPUs now having fluctuating performance
> (due to cpufreq etc.) one would say that each CPU would have an X bogomips
> (or some else absolute) capacity per OPP. Perhaps having such a bogomips number
> calculated per-task would make things easier.
>
> Perhaps the same can be done with power/energy, i.e. have a per-task power
> consumption figure that we can use for scheduling, according to the available
> power budget per CPU.
>
> Dunno, it might not be feasible ATM, but having a power-aware scheduler would
> assume some kind of power measurement, no?

No please. We don't want to document ADC requirements, current probe
specs and sampling rates to successfully run the Linux kernel. :)

But from the scheduler mini-summit, there is acceptance that we need
to pass *some* knowledge of CPU characteristics to Linux. These need
to be distilled down to a few that guide scheduler policy e.g. power
cost of using a core. This in turn would influence the scheduler's
spread or gather decision (better to consolidate task onto few cores
or spread them out at low frequencies). Manufacturing processes and
CPU architecture obviously play a role in the differences here.
However, I don't expect unit for these parameters to be in mW. :)

/Amit

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC 0/4] Scheduler idle notifiers and users
  2012-02-21 14:52                           ` Amit Kucheria
@ 2012-02-21 17:06                             ` Pantelis Antoniou
  0 siblings, 0 replies; 34+ messages in thread
From: Pantelis Antoniou @ 2012-02-21 17:06 UTC (permalink / raw)
  To: Amit Kucheria
  Cc: Peter Zijlstra, linaro-kernel, Russell King - ARM Linux,
	Nicolas Pitre, Benjamin Herrenschmidt, Oleg Nesterov, cpufreq,
	linux-kernel, Anton Vorontsov, Todd Poynor, Saravana Kannan,
	Mike Chan, Dave Jones, Ingo Molnar, Paul E. McKenney,
	kernel-team, linux-arm-kernel, Arjan Van De Ven

Hi Amit,

On Feb 21, 2012, at 4:52 PM, Amit Kucheria wrote:

> On Tue, Feb 21, 2012 at 3:31 PM, Pantelis Antoniou
> <panto@antoniou-consulting.com> wrote:
>> 
>> On Feb 21, 2012, at 2:56 PM, Peter Zijlstra wrote:
>> 
>>> On Tue, 2012-02-21 at 14:38 +0200, Pantelis Antoniou wrote:
>>>> 
>>>> If we go to all the trouble of integrating cpufreq/cpuidle/sched into scheduler
>>>> callbacks, we should place hooks into the thermal framework/PM as well.
>>>> 
>>>> It will pretty common to have per core temperature readings, on most
>>>> modern SoCs.
>>>> 
>>>> It is quite conceivable to have a case with a multi-core CPU where due
>>>> to load imbalance, one (or more) of the cores is running at full speed
>>>> while the rest are mostly idle. What you want do, for best performance
>>>> and conceivably better power consumption, is not to throttle either
>>>> frequency or lowers voltage to the overloaded CPU but to migrate the
>>>> load to one of the cooler CPUs.
>>>> 
>>>> This affects CPU capacity immediately, i.e. you shouldn't schedule more
>>>> load on a CPU that its too hot, since you'll only end up triggering thermal
>>>> shutdown. The ideal solution would be to round robin
>>>> the load from the hot CPU to the cooler ones, but not so fast that we lose
>>>> due to the migration of state from one CPU to the other.
>>>> 
>>>> In a nutshell, the processing capacity of a core is not static, i.e. it
>>>> might degrade over time due to the increase of temperature caused by the
>>>> previous load.
>>>> 
>>>> What do you think?
>>> 
>>> This is called core-hopping, and yes that's a nice goal, although I
>>> would like to do that after we get the 'simple' bits up and running. I
>>> suspect it'll end up being slightly more complex than we'd like to due
>>> to the fact that the goal conflicts with wanting to aggregate things on
>>> cpu0 due to cpu0 being special for a host of reasons.
>>> 
>>> 
>> 
>> Hi Peter,
>> 
>> Agreed. We need to get there step by step, and I think that per-task load tracking
>> is the first one. We do have other metrics besides load that can influence the
>> scheduler decisions, with the most obvious being power consumption.
>> 
>> BTW, since we're going to the trouble of calculating per-task load with
>> increased accuracy, how about having some thought of translating the load numbers
>> in an absolute format. I.e. with the CPUs now having fluctuating performance
>> (due to cpufreq etc.) one would say that each CPU would have an X bogomips
>> (or some else absolute) capacity per OPP. Perhaps having such a bogomips number
>> calculated per-task would make things easier.
>> 
>> Perhaps the same can be done with power/energy, i.e. have a per-task power
>> consumption figure that we can use for scheduling, according to the available
>> power budget per CPU.
>> 
>> Dunno, it might not be feasible ATM, but having a power-aware scheduler would
>> assume some kind of power measurement, no?
> 
> No please. We don't want to document ADC requirements, current probe
> specs and sampling rates to successfully run the Linux kernel. :)
> 

No, we certainly don't want to do that :). I only care about some kind
of absolute value metric, and not something relative to the maximum
speed of which one of the cores can run. Now if there's some way for
a user-space app to convert that value into something like a mW measurement
is somebody else's problem :)

> But from the scheduler mini-summit, there is acceptance that we need
> to pass *some* knowledge of CPU characteristics to Linux. These need
> to be distilled down to a few that guide scheduler policy e.g. power
> cost of using a core. This in turn would influence the scheduler's
> spread or gather decision (better to consolidate task onto few cores
> or spread them out at low frequencies). Manufacturing processes and
> CPU architecture obviously play a role in the differences here.
> However, I don't expect unit for these parameters to be in mW. :)
> 
> /Amit

Yes, that is what we need. 

The problem of a power-aware scheduler, the way I see it is a matter of getting 
to a point of dynamic equilibrium between acceptable performance and acceptable 
power-usage.

It seems we will have the per-task cpu load value, so we have a measure of 
the force pushing to one side, we will need something pushing to the other.

Regards

-- Pantelis
 





^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2012-02-21 17:06 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-02-08  1:39 [PATCH RFC 0/4] Scheduler idle notifiers and users Anton Vorontsov
2012-02-08  1:41 ` [PATCH 1/4] sched: Introduce idle notifiers API Anton Vorontsov
2012-02-08  1:43 ` [PATCH 2/4] sched: Wire up idle notifiers Anton Vorontsov
2012-02-08  1:44 ` [PATCH 3/4] cpufreq: New 'interactive' governor Anton Vorontsov
2012-02-08 23:00   ` Vincent Guittot
2012-02-09  0:32     ` Anton Vorontsov
2012-02-08  1:44 ` [PATCH 4/4] ARM: Move leds idle start/stop calls to sched idle notifiers Anton Vorontsov
2012-02-08  3:05 ` [PATCH RFC 0/4] Scheduler idle notifiers and users Peter Zijlstra
2012-02-08 20:23   ` Dave Jones
2012-02-08 21:33     ` Benjamin Herrenschmidt
2012-02-09  7:51       ` Ingo Molnar
2012-02-11  3:15         ` Saravana Kannan
2012-02-11 14:39           ` Mark Brown
2012-02-11 14:53             ` Peter Zijlstra
2012-02-11 15:33               ` Mark Brown
2012-02-15 13:38                 ` Peter Zijlstra
2012-02-15 16:04                   ` Mark Brown
2012-02-12 21:33               ` Benjamin Herrenschmidt
2012-02-11 14:45           ` Ingo Molnar
2012-02-14 23:20             ` Saravana Kannan
2012-02-15 13:38               ` Peter Zijlstra
2012-02-15 14:02                 ` Russell King - ARM Linux
2012-02-15 15:01                   ` Peter Zijlstra
2012-02-15 16:00                     ` Russell King - ARM Linux
2012-02-15 16:09                       ` Peter Zijlstra
2012-02-16  3:31                     ` Benjamin Herrenschmidt
2012-02-16 10:14                       ` Peter Zijlstra
2012-02-17  9:00                     ` Dominik Brodowski
2012-02-20 11:03                       ` Peter Zijlstra
2012-02-21 12:38                     ` Pantelis Antoniou
2012-02-21 12:56                       ` Peter Zijlstra
2012-02-21 13:31                         ` Pantelis Antoniou
2012-02-21 14:52                           ` Amit Kucheria
2012-02-21 17:06                             ` Pantelis Antoniou

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).