All of lore.kernel.org
 help / color / mirror / Atom feed
* [v2 PATCH 0/4] timers: framework for migration between CPU
@ 2009-03-04 12:12 Arun R Bharadwaj
  2009-03-04 12:14 ` [v2 PATCH 1/4] timers: framework to identify pinned timers Arun R Bharadwaj
                   ` (11 more replies)
  0 siblings, 12 replies; 46+ messages in thread
From: Arun R Bharadwaj @ 2009-03-04 12:12 UTC (permalink / raw)
  To: linux-kernel, linux-pm
  Cc: a.p.zijlstra, ego, tglx, mingo, andi, venkatesh.pallipadi, vatsa,
	arjan, arun, svaidy

Hi,


In an SMP system, tasks are scheduled on different CPUs by the
scheduler, interrupts are managed by irqbalancer daemon, but timers
are still stuck to the CPUs that they have been initialised.  Timers
queued by tasks gets re-queued on the CPU where the task gets to run
next, but timers from IRQ context like the ones in device drivers are
still stuck on the CPU they were initialised.  This framework will
help move all 'movable timers' from one CPU to any other CPU of choice
using a sysfs interface.

Original posting can be found here --> http://lkml.org/lkml/2009/2/20/121

Based on Ingo's suggestion, I have extended the scheduler power-saving
code, which already identifies an idle load balancer CPU, to also attract
all the attractable sources of timers, automatically.

Also, I have removed the per-cpu sysfs interface and instead created a
single entry at /sys/devices/system/cpu/enable_timer_migration.
This allows users to enable timer migration as a policy and let
the kernel decide the target CPU to move timers and also decide on
the thresholds on when to initiate a timer migration and when to stop.

Timers from idle cpus are migrated to the idle-load-balancer-cpu.
The idle load balancer is one of the idle cpus that has the sched
ticks running and does other system management tasks and load
balancing on behalf of other idle cpus.  Attracting timers from
other idle cpus will reduce wakeups for them while increasing the
probability of overlap with sched ticks on the idle load balancer cpu.

However, this technique has drawbacks if the idle load balancer cpu
is re-nominated too often based on the system behaviour leading to
ping-pong of the timers in the system. This issue can be solved by
optimising the selection of the idle load balancer cpu as described
by Gautham in the following patch http://lkml.org/lkml/2008/9/23/82.

If the idle-load-balancer is selected from a semi-idle package by
including Gautham's patch, then we are able to experimentally verify
consistent selection of idle-load-balancer cpu and timers are also
consolidated to this cpu.

The following patches are included:
PATCH 1/4 - framework to identify pinned timers.
PATCH 2/4 - identifying the existing pinned hrtimers.
PATCH 3/4 - sysfs hook to enable timer migration.
PATCH 4/4 - logic to enable timer migration.

The patchset is based on the latest tip/master.


The following experiment was carried out to demonstrate the
functionality of the patch.
The machine used is a 2 socket, quad core machine, with HT enabled.

I run a `make -j4` pinned to 4 CPUs.
I have used a driver which continuously queues timers on a CPU.
With the timers queued I measure the sleep state residency
for a period of 10s.
Next, I enable timer migration and measure the sleep state
residency period.
The comparison in sleep state residency values is posted below.

Also the difference in Local Timer Interrupt rate(LOC) rate
from /proc/interrupts is posted below.

This enables timer migration.

	echo 1 > /sys/devices/system/cpu/enable_timer_migration

similarly,

	echo 0 > /sys/devices/system/cpu/enable_timer_migration
disables timer migration.


$taskset -c 4,5,6,7 make -j4

my_driver queuing timers continuously on CPU 10.

idle load balancer currently on CPU 15


Case1: Without timer migration		Case2: With timer migration

   --------------------			   --------------------
   | Core | LOC Count |			   | Core | LOC Count |
   | 4    |   2504    |			   | 4    |   2503    |
   | 5    |   2502    |			   | 5    |   2503    |
   | 6    |   2502    |			   | 6    |   2502    |
   | 7    |   2498    |			   | 7    |   2500    |
   | 10   |   2501    |			   | 10   |     35    |
   | 15   |   2501    |			   | 15   |   2501    |
   --------------------			   --------------------

   ---------------------		   --------------------
   | Core | Sleep time |		   | Core | Sleep time |
   | 4    |    0.47168 |		   | 4    |    0.49601 |
   | 5    |    0.44301 |		   | 5    |    0.37153 |
   | 6    |    0.38979 |		   | 6    |    0.51286 |
   | 7    |    0.42829 |		   | 7    |    0.49635 |
   | 10   |    9.86652 |		   | 10   |   10.04216 |
   | 15   |    0.43048 |		   | 15   |    0.49056 |
   ---------------------		   ---------------------

Here, all the timers queued by the driver on CPU10 are moved to CPU15,
which is the idle load balancer.


--arun

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [v2 PATCH 1/4] timers: framework to identify pinned timers.
  2009-03-04 12:12 [v2 PATCH 0/4] timers: framework for migration between CPU Arun R Bharadwaj
@ 2009-03-04 12:14 ` Arun R Bharadwaj
  2009-03-05 16:53   ` Oleg Nesterov
  2009-03-05 16:53   ` Oleg Nesterov
  2009-03-04 12:14 ` Arun R Bharadwaj
                   ` (10 subsequent siblings)
  11 siblings, 2 replies; 46+ messages in thread
From: Arun R Bharadwaj @ 2009-03-04 12:14 UTC (permalink / raw)
  To: linux-kernel, linux-pm
  Cc: a.p.zijlstra, ego, tglx, mingo, andi, venkatesh.pallipadi, vatsa,
	arjan, svaidy, Arun Bharadwaj

* Arun R Bharadwaj <arun@linux.vnet.ibm.com> [2009-03-04 17:42:49]:

This patch creates a new framework for identifying cpu-pinned timers
and hrtimers.


This framework is needed because pinned timers are expected to fire on
the same CPU on which they are queued. So it is essential to identify
these and not migrate them, in case there are any.


For regular timers a new flag called TBASE_PINNED_FLAG is created.
Since the last 3 bits of the tvec_base is guaranteed to be 0, and
since the last bit is being used to indicate deferrable timers,
the second last bit is used to indicate cpu-pinned regular timers.
The implementation of functions to manage the TBASE_PINNED_FLAG is
similar to those which manage the TBASE_DEFERRABLE_FLAG.


For hrtimers, a new interface hrtimer_start_pinned() is created,
which can be used to queue cpu-pinned hrtimer.


Signed-off-by: Arun R Bharadwaj <arun@linux.vnet.ibm.com>
---
 include/linux/hrtimer.h |   24 ++++++++++++++++++++----
 kernel/hrtimer.c        |   34 ++++++++++++++++++++++++++++------
 kernel/timer.c          |   30 +++++++++++++++++++++++++++---
 3 files changed, 75 insertions(+), 13 deletions(-)

Index: linux.trees.git/kernel/timer.c
===================================================================
--- linux.trees.git.orig/kernel/timer.c
+++ linux.trees.git/kernel/timer.c
@@ -37,6 +37,7 @@
 #include <linux/delay.h>
 #include <linux/tick.h>
 #include <linux/kallsyms.h>
+#include <linux/timer.h>
 
 #include <asm/uaccess.h>
 #include <asm/unistd.h>
@@ -87,8 +88,12 @@ static DEFINE_PER_CPU(struct tvec_base *
  * the new flag to indicate whether the timer is deferrable
  */
 #define TBASE_DEFERRABLE_FLAG		(0x1)
+#define TBASE_PINNED_FLAG		(0x2)
 
-/* Functions below help us manage 'deferrable' flag */
+/*
+ * Functions below help us manage
+ * 'deferrable' flag and 'cpu-pinned-timer' flag
+ */
 static inline unsigned int tbase_get_deferrable(struct tvec_base *base)
 {
 	return ((unsigned int)(unsigned long)base & TBASE_DEFERRABLE_FLAG);
@@ -96,7 +101,8 @@ static inline unsigned int tbase_get_def
 
 static inline struct tvec_base *tbase_get_base(struct tvec_base *base)
 {
-	return ((struct tvec_base *)((unsigned long)base & ~TBASE_DEFERRABLE_FLAG));
+	return (struct tvec_base *)((unsigned long)base &
+			~(TBASE_DEFERRABLE_FLAG | TBASE_PINNED_FLAG));
 }
 
 static inline void timer_set_deferrable(struct timer_list *timer)
@@ -105,11 +111,28 @@ static inline void timer_set_deferrable(
 				       TBASE_DEFERRABLE_FLAG));
 }
 
+static inline unsigned long tbase_get_pinned(struct tvec_base *base)
+{
+	return (unsigned long)base & TBASE_PINNED_FLAG;
+}
+
+static inline unsigned long tbase_get_flag_bits(struct timer_list *timer)
+{
+	return tbase_get_deferrable(timer->base) |
+				tbase_get_pinned(timer->base);
+}
+
 static inline void
 timer_set_base(struct timer_list *timer, struct tvec_base *new_base)
 {
 	timer->base = (struct tvec_base *)((unsigned long)(new_base) |
-				      tbase_get_deferrable(timer->base));
+					tbase_get_flag_bits(timer));
+}
+
+static inline void timer_set_pinned(struct timer_list *timer)
+{
+	timer->base = ((struct tvec_base *)((unsigned long)(timer->base) |
+				TBASE_PINNED_FLAG));
 }
 
 static unsigned long round_jiffies_common(unsigned long j, int cpu,
@@ -736,6 +759,7 @@ void add_timer_on(struct timer_list *tim
 	struct tvec_base *base = per_cpu(tvec_bases, cpu);
 	unsigned long flags;
 
+	timer_set_pinned(timer);
 	timer_stats_timer_set_start_info(timer);
 	BUG_ON(timer_pending(timer) || !timer->function);
 	spin_lock_irqsave(&base->lock, flags);
Index: linux.trees.git/include/linux/hrtimer.h
===================================================================
--- linux.trees.git.orig/include/linux/hrtimer.h
+++ linux.trees.git/include/linux/hrtimer.h
@@ -331,23 +331,39 @@ static inline void hrtimer_init_on_stack
 static inline void destroy_hrtimer_on_stack(struct hrtimer *timer) { }
 #endif
 
+#define HRTIMER_NOT_PINNED	0
+#define HRTIMER_PINNED		1
 /* Basic timer operations: */
 extern int hrtimer_start(struct hrtimer *timer, ktime_t tim,
 			 const enum hrtimer_mode mode);
+extern int hrtimer_start_pinned(struct hrtimer *timer, ktime_t tim,
+			const enum hrtimer_mode mode);
 extern int hrtimer_start_range_ns(struct hrtimer *timer, ktime_t tim,
-			unsigned long range_ns, const enum hrtimer_mode mode);
+	unsigned long range_ns, const enum hrtimer_mode mode, int pinned);
 extern int hrtimer_cancel(struct hrtimer *timer);
 extern int hrtimer_try_to_cancel(struct hrtimer *timer);
 
-static inline int hrtimer_start_expires(struct hrtimer *timer,
-						enum hrtimer_mode mode)
+static inline int __hrtimer_start_expires(struct hrtimer *timer,
+					enum hrtimer_mode mode, int pinned)
 {
 	unsigned long delta;
 	ktime_t soft, hard;
 	soft = hrtimer_get_softexpires(timer);
 	hard = hrtimer_get_expires(timer);
 	delta = ktime_to_ns(ktime_sub(hard, soft));
-	return hrtimer_start_range_ns(timer, soft, delta, mode);
+	return hrtimer_start_range_ns(timer, soft, delta, mode, pinned);
+}
+
+static inline int hrtimer_start_expires(struct hrtimer *timer,
+						enum hrtimer_mode mode)
+{
+	return __hrtimer_start_expires(timer, mode, HRTIMER_NOT_PINNED);
+}
+
+static inline int hrtimer_start_expires_pinned(struct hrtimer *timer,
+						enum hrtimer_mode mode)
+{
+	return __hrtimer_start_expires(timer, mode, HRTIMER_PINNED);
 }
 
 static inline int hrtimer_restart(struct hrtimer *timer)
Index: linux.trees.git/kernel/hrtimer.c
===================================================================
--- linux.trees.git.orig/kernel/hrtimer.c
+++ linux.trees.git/kernel/hrtimer.c
@@ -193,7 +193,8 @@ struct hrtimer_clock_base *lock_hrtimer_
  * Switch the timer base to the current CPU when possible.
  */
 static inline struct hrtimer_clock_base *
-switch_hrtimer_base(struct hrtimer *timer, struct hrtimer_clock_base *base)
+switch_hrtimer_base(struct hrtimer *timer, struct hrtimer_clock_base *base,
+int pinned)
 {
 	struct hrtimer_clock_base *new_base;
 	struct hrtimer_cpu_base *new_cpu_base;
@@ -897,9 +898,8 @@ remove_hrtimer(struct hrtimer *timer, st
  *  0 on success
  *  1 when the timer was active
  */
-int
-hrtimer_start_range_ns(struct hrtimer *timer, ktime_t tim, unsigned long delta_ns,
-			const enum hrtimer_mode mode)
+int hrtimer_start_range_ns(struct hrtimer *timer, ktime_t tim,
+	unsigned long delta_ns, const enum hrtimer_mode mode, int pinned)
 {
 	struct hrtimer_clock_base *base, *new_base;
 	unsigned long flags;
@@ -911,7 +911,7 @@ hrtimer_start_range_ns(struct hrtimer *t
 	ret = remove_hrtimer(timer, base);
 
 	/* Switch the timer base, if necessary: */
-	new_base = switch_hrtimer_base(timer, base);
+	new_base = switch_hrtimer_base(timer, base, pinned);
 
 	if (mode == HRTIMER_MODE_REL) {
 		tim = ktime_add_safe(tim, new_base->get_time());
@@ -948,6 +948,12 @@ hrtimer_start_range_ns(struct hrtimer *t
 }
 EXPORT_SYMBOL_GPL(hrtimer_start_range_ns);
 
+int __hrtimer_start(struct hrtimer *timer, ktime_t tim,
+const enum hrtimer_mode mode, int pinned)
+{
+	return hrtimer_start_range_ns(timer, tim, 0, mode, pinned);
+}
+
 /**
  * hrtimer_start - (re)start an hrtimer on the current CPU
  * @timer:	the timer to be added
@@ -961,10 +967,26 @@ EXPORT_SYMBOL_GPL(hrtimer_start_range_ns
 int
 hrtimer_start(struct hrtimer *timer, ktime_t tim, const enum hrtimer_mode mode)
 {
-	return hrtimer_start_range_ns(timer, tim, 0, mode);
+	return __hrtimer_start(timer, tim, mode, HRTIMER_NOT_PINNED);
 }
 EXPORT_SYMBOL_GPL(hrtimer_start);
 
+/**
+ * hrtimer_start_pinned - start a CPU-pinned hrtimer
+ * @timer:      the timer to be added
+ * @tim:        expiry time
+ * @mode:       expiry mode: absolute (HRTIMER_ABS) or relative (HRTIMER_REL)
+ *
+ * Returns:
+ *  0 on success
+ *  1 when the timer was active
+ */
+int hrtimer_start_pinned(struct hrtimer *timer,
+	ktime_t tim, const enum hrtimer_mode mode)
+{
+	return __hrtimer_start(timer, tim, mode, HRTIMER_PINNED);
+}
+EXPORT_SYMBOL_GPL(hrtimer_start_pinned);
 
 /**
  * hrtimer_try_to_cancel - try to deactivate a timer

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [v2 PATCH 1/4] timers: framework to identify pinned timers.
  2009-03-04 12:12 [v2 PATCH 0/4] timers: framework for migration between CPU Arun R Bharadwaj
  2009-03-04 12:14 ` [v2 PATCH 1/4] timers: framework to identify pinned timers Arun R Bharadwaj
@ 2009-03-04 12:14 ` Arun R Bharadwaj
  2009-03-04 12:16 ` [v2 PATCH 2/4] timers: identifying the existing pinned hrtimers Arun R Bharadwaj
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 46+ messages in thread
From: Arun R Bharadwaj @ 2009-03-04 12:14 UTC (permalink / raw)
  To: linux-kernel, linux-pm
  Cc: a.p.zijlstra, vatsa, andi, Arun Bharadwaj, tglx, mingo, arjan

* Arun R Bharadwaj <arun@linux.vnet.ibm.com> [2009-03-04 17:42:49]:

This patch creates a new framework for identifying cpu-pinned timers
and hrtimers.


This framework is needed because pinned timers are expected to fire on
the same CPU on which they are queued. So it is essential to identify
these and not migrate them, in case there are any.


For regular timers a new flag called TBASE_PINNED_FLAG is created.
Since the last 3 bits of the tvec_base is guaranteed to be 0, and
since the last bit is being used to indicate deferrable timers,
the second last bit is used to indicate cpu-pinned regular timers.
The implementation of functions to manage the TBASE_PINNED_FLAG is
similar to those which manage the TBASE_DEFERRABLE_FLAG.


For hrtimers, a new interface hrtimer_start_pinned() is created,
which can be used to queue cpu-pinned hrtimer.


Signed-off-by: Arun R Bharadwaj <arun@linux.vnet.ibm.com>
---
 include/linux/hrtimer.h |   24 ++++++++++++++++++++----
 kernel/hrtimer.c        |   34 ++++++++++++++++++++++++++++------
 kernel/timer.c          |   30 +++++++++++++++++++++++++++---
 3 files changed, 75 insertions(+), 13 deletions(-)

Index: linux.trees.git/kernel/timer.c
===================================================================
--- linux.trees.git.orig/kernel/timer.c
+++ linux.trees.git/kernel/timer.c
@@ -37,6 +37,7 @@
 #include <linux/delay.h>
 #include <linux/tick.h>
 #include <linux/kallsyms.h>
+#include <linux/timer.h>
 
 #include <asm/uaccess.h>
 #include <asm/unistd.h>
@@ -87,8 +88,12 @@ static DEFINE_PER_CPU(struct tvec_base *
  * the new flag to indicate whether the timer is deferrable
  */
 #define TBASE_DEFERRABLE_FLAG		(0x1)
+#define TBASE_PINNED_FLAG		(0x2)
 
-/* Functions below help us manage 'deferrable' flag */
+/*
+ * Functions below help us manage
+ * 'deferrable' flag and 'cpu-pinned-timer' flag
+ */
 static inline unsigned int tbase_get_deferrable(struct tvec_base *base)
 {
 	return ((unsigned int)(unsigned long)base & TBASE_DEFERRABLE_FLAG);
@@ -96,7 +101,8 @@ static inline unsigned int tbase_get_def
 
 static inline struct tvec_base *tbase_get_base(struct tvec_base *base)
 {
-	return ((struct tvec_base *)((unsigned long)base & ~TBASE_DEFERRABLE_FLAG));
+	return (struct tvec_base *)((unsigned long)base &
+			~(TBASE_DEFERRABLE_FLAG | TBASE_PINNED_FLAG));
 }
 
 static inline void timer_set_deferrable(struct timer_list *timer)
@@ -105,11 +111,28 @@ static inline void timer_set_deferrable(
 				       TBASE_DEFERRABLE_FLAG));
 }
 
+static inline unsigned long tbase_get_pinned(struct tvec_base *base)
+{
+	return (unsigned long)base & TBASE_PINNED_FLAG;
+}
+
+static inline unsigned long tbase_get_flag_bits(struct timer_list *timer)
+{
+	return tbase_get_deferrable(timer->base) |
+				tbase_get_pinned(timer->base);
+}
+
 static inline void
 timer_set_base(struct timer_list *timer, struct tvec_base *new_base)
 {
 	timer->base = (struct tvec_base *)((unsigned long)(new_base) |
-				      tbase_get_deferrable(timer->base));
+					tbase_get_flag_bits(timer));
+}
+
+static inline void timer_set_pinned(struct timer_list *timer)
+{
+	timer->base = ((struct tvec_base *)((unsigned long)(timer->base) |
+				TBASE_PINNED_FLAG));
 }
 
 static unsigned long round_jiffies_common(unsigned long j, int cpu,
@@ -736,6 +759,7 @@ void add_timer_on(struct timer_list *tim
 	struct tvec_base *base = per_cpu(tvec_bases, cpu);
 	unsigned long flags;
 
+	timer_set_pinned(timer);
 	timer_stats_timer_set_start_info(timer);
 	BUG_ON(timer_pending(timer) || !timer->function);
 	spin_lock_irqsave(&base->lock, flags);
Index: linux.trees.git/include/linux/hrtimer.h
===================================================================
--- linux.trees.git.orig/include/linux/hrtimer.h
+++ linux.trees.git/include/linux/hrtimer.h
@@ -331,23 +331,39 @@ static inline void hrtimer_init_on_stack
 static inline void destroy_hrtimer_on_stack(struct hrtimer *timer) { }
 #endif
 
+#define HRTIMER_NOT_PINNED	0
+#define HRTIMER_PINNED		1
 /* Basic timer operations: */
 extern int hrtimer_start(struct hrtimer *timer, ktime_t tim,
 			 const enum hrtimer_mode mode);
+extern int hrtimer_start_pinned(struct hrtimer *timer, ktime_t tim,
+			const enum hrtimer_mode mode);
 extern int hrtimer_start_range_ns(struct hrtimer *timer, ktime_t tim,
-			unsigned long range_ns, const enum hrtimer_mode mode);
+	unsigned long range_ns, const enum hrtimer_mode mode, int pinned);
 extern int hrtimer_cancel(struct hrtimer *timer);
 extern int hrtimer_try_to_cancel(struct hrtimer *timer);
 
-static inline int hrtimer_start_expires(struct hrtimer *timer,
-						enum hrtimer_mode mode)
+static inline int __hrtimer_start_expires(struct hrtimer *timer,
+					enum hrtimer_mode mode, int pinned)
 {
 	unsigned long delta;
 	ktime_t soft, hard;
 	soft = hrtimer_get_softexpires(timer);
 	hard = hrtimer_get_expires(timer);
 	delta = ktime_to_ns(ktime_sub(hard, soft));
-	return hrtimer_start_range_ns(timer, soft, delta, mode);
+	return hrtimer_start_range_ns(timer, soft, delta, mode, pinned);
+}
+
+static inline int hrtimer_start_expires(struct hrtimer *timer,
+						enum hrtimer_mode mode)
+{
+	return __hrtimer_start_expires(timer, mode, HRTIMER_NOT_PINNED);
+}
+
+static inline int hrtimer_start_expires_pinned(struct hrtimer *timer,
+						enum hrtimer_mode mode)
+{
+	return __hrtimer_start_expires(timer, mode, HRTIMER_PINNED);
 }
 
 static inline int hrtimer_restart(struct hrtimer *timer)
Index: linux.trees.git/kernel/hrtimer.c
===================================================================
--- linux.trees.git.orig/kernel/hrtimer.c
+++ linux.trees.git/kernel/hrtimer.c
@@ -193,7 +193,8 @@ struct hrtimer_clock_base *lock_hrtimer_
  * Switch the timer base to the current CPU when possible.
  */
 static inline struct hrtimer_clock_base *
-switch_hrtimer_base(struct hrtimer *timer, struct hrtimer_clock_base *base)
+switch_hrtimer_base(struct hrtimer *timer, struct hrtimer_clock_base *base,
+int pinned)
 {
 	struct hrtimer_clock_base *new_base;
 	struct hrtimer_cpu_base *new_cpu_base;
@@ -897,9 +898,8 @@ remove_hrtimer(struct hrtimer *timer, st
  *  0 on success
  *  1 when the timer was active
  */
-int
-hrtimer_start_range_ns(struct hrtimer *timer, ktime_t tim, unsigned long delta_ns,
-			const enum hrtimer_mode mode)
+int hrtimer_start_range_ns(struct hrtimer *timer, ktime_t tim,
+	unsigned long delta_ns, const enum hrtimer_mode mode, int pinned)
 {
 	struct hrtimer_clock_base *base, *new_base;
 	unsigned long flags;
@@ -911,7 +911,7 @@ hrtimer_start_range_ns(struct hrtimer *t
 	ret = remove_hrtimer(timer, base);
 
 	/* Switch the timer base, if necessary: */
-	new_base = switch_hrtimer_base(timer, base);
+	new_base = switch_hrtimer_base(timer, base, pinned);
 
 	if (mode == HRTIMER_MODE_REL) {
 		tim = ktime_add_safe(tim, new_base->get_time());
@@ -948,6 +948,12 @@ hrtimer_start_range_ns(struct hrtimer *t
 }
 EXPORT_SYMBOL_GPL(hrtimer_start_range_ns);
 
+int __hrtimer_start(struct hrtimer *timer, ktime_t tim,
+const enum hrtimer_mode mode, int pinned)
+{
+	return hrtimer_start_range_ns(timer, tim, 0, mode, pinned);
+}
+
 /**
  * hrtimer_start - (re)start an hrtimer on the current CPU
  * @timer:	the timer to be added
@@ -961,10 +967,26 @@ EXPORT_SYMBOL_GPL(hrtimer_start_range_ns
 int
 hrtimer_start(struct hrtimer *timer, ktime_t tim, const enum hrtimer_mode mode)
 {
-	return hrtimer_start_range_ns(timer, tim, 0, mode);
+	return __hrtimer_start(timer, tim, mode, HRTIMER_NOT_PINNED);
 }
 EXPORT_SYMBOL_GPL(hrtimer_start);
 
+/**
+ * hrtimer_start_pinned - start a CPU-pinned hrtimer
+ * @timer:      the timer to be added
+ * @tim:        expiry time
+ * @mode:       expiry mode: absolute (HRTIMER_ABS) or relative (HRTIMER_REL)
+ *
+ * Returns:
+ *  0 on success
+ *  1 when the timer was active
+ */
+int hrtimer_start_pinned(struct hrtimer *timer,
+	ktime_t tim, const enum hrtimer_mode mode)
+{
+	return __hrtimer_start(timer, tim, mode, HRTIMER_PINNED);
+}
+EXPORT_SYMBOL_GPL(hrtimer_start_pinned);
 
 /**
  * hrtimer_try_to_cancel - try to deactivate a timer

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [v2 PATCH 2/4] timers: identifying the existing pinned hrtimers.
  2009-03-04 12:12 [v2 PATCH 0/4] timers: framework for migration between CPU Arun R Bharadwaj
  2009-03-04 12:14 ` [v2 PATCH 1/4] timers: framework to identify pinned timers Arun R Bharadwaj
  2009-03-04 12:14 ` Arun R Bharadwaj
@ 2009-03-04 12:16 ` Arun R Bharadwaj
  2009-03-04 12:16 ` Arun R Bharadwaj
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 46+ messages in thread
From: Arun R Bharadwaj @ 2009-03-04 12:16 UTC (permalink / raw)
  To: linux-kernel, linux-pm
  Cc: a.p.zijlstra, ego, tglx, mingo, andi, venkatesh.pallipadi, vatsa,
	arjan, svaidy, Arun Bharadwaj

* Arun R Bharadwaj <arun@linux.vnet.ibm.com> [2009-03-04 17:42:49]:

The following pinned hrtimers have been identified and marked:
1)sched_rt_period_timer
2)tick_sched_timer
3)stack_trace_timer_fn

Signed-off-by: Arun R Bharadwaj <arun@linux.vnet.ibm.com>
---
 kernel/sched.c               |    5 +++--
 kernel/time/tick-sched.c     |    7 ++++---
 kernel/trace/trace_sysprof.c |    3 ++-
 3 files changed, 9 insertions(+), 6 deletions(-)

Index: linux.trees.git/kernel/sched.c
===================================================================
--- linux.trees.git.orig/kernel/sched.c
+++ linux.trees.git/kernel/sched.c
@@ -236,7 +236,7 @@ static void start_rt_bandwidth(struct rt
 
 		now = hrtimer_cb_get_time(&rt_b->rt_period_timer);
 		hrtimer_forward(&rt_b->rt_period_timer, now, rt_b->rt_period);
-		hrtimer_start_expires(&rt_b->rt_period_timer,
+		hrtimer_start_expires_pinned(&rt_b->rt_period_timer,
 				HRTIMER_MODE_ABS);
 	}
 	spin_unlock(&rt_b->rt_runtime_lock);
@@ -1156,7 +1156,8 @@ static __init void init_hrtick(void)
  */
 static void hrtick_start(struct rq *rq, u64 delay)
 {
-	hrtimer_start(&rq->hrtick_timer, ns_to_ktime(delay), HRTIMER_MODE_REL);
+	hrtimer_start_pinned(&rq->hrtick_timer, ns_to_ktime(delay),
+				HRTIMER_MODE_REL);
 }
 
 static inline void init_hrtick(void)
Index: linux.trees.git/kernel/time/tick-sched.c
===================================================================
--- linux.trees.git.orig/kernel/time/tick-sched.c
+++ linux.trees.git/kernel/time/tick-sched.c
@@ -348,7 +348,7 @@ void tick_nohz_stop_sched_tick(int inidl
 		ts->idle_expires = expires;
 
 		if (ts->nohz_mode == NOHZ_MODE_HIGHRES) {
-			hrtimer_start(&ts->sched_timer, expires,
+			hrtimer_start_pinned(&ts->sched_timer, expires,
 				      HRTIMER_MODE_ABS);
 			/* Check, if the timer was already in the past */
 			if (hrtimer_active(&ts->sched_timer))
@@ -394,7 +394,7 @@ static void tick_nohz_restart(struct tic
 		hrtimer_forward(&ts->sched_timer, now, tick_period);
 
 		if (ts->nohz_mode == NOHZ_MODE_HIGHRES) {
-			hrtimer_start_expires(&ts->sched_timer,
+			hrtimer_start_expires_pinned(&ts->sched_timer,
 				      HRTIMER_MODE_ABS);
 			/* Check, if the timer was already in the past */
 			if (hrtimer_active(&ts->sched_timer))
@@ -698,7 +698,8 @@ void tick_setup_sched_timer(void)
 
 	for (;;) {
 		hrtimer_forward(&ts->sched_timer, now, tick_period);
-		hrtimer_start_expires(&ts->sched_timer, HRTIMER_MODE_ABS);
+		hrtimer_start_expires_pinned(&ts->sched_timer,
+						HRTIMER_MODE_ABS);
 		/* Check, if the timer was already in the past */
 		if (hrtimer_active(&ts->sched_timer))
 			break;
Index: linux.trees.git/kernel/trace/trace_sysprof.c
===================================================================
--- linux.trees.git.orig/kernel/trace/trace_sysprof.c
+++ linux.trees.git/kernel/trace/trace_sysprof.c
@@ -203,7 +203,8 @@ static void start_stack_timer(void *unus
 	hrtimer_init(hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
 	hrtimer->function = stack_trace_timer_fn;
 
-	hrtimer_start(hrtimer, ns_to_ktime(sample_period), HRTIMER_MODE_REL);
+	hrtimer_start_pinned(hrtimer, ns_to_ktime(sample_period),
+				HRTIMER_MODE_REL);
 }
 
 static void start_stack_timers(void)

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [v2 PATCH 2/4] timers: identifying the existing pinned hrtimers.
  2009-03-04 12:12 [v2 PATCH 0/4] timers: framework for migration between CPU Arun R Bharadwaj
                   ` (2 preceding siblings ...)
  2009-03-04 12:16 ` [v2 PATCH 2/4] timers: identifying the existing pinned hrtimers Arun R Bharadwaj
@ 2009-03-04 12:16 ` Arun R Bharadwaj
  2009-03-04 12:18 ` [v2 PATCH 3/4] timers: sysfs hook to enable timer migration Arun R Bharadwaj
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 46+ messages in thread
From: Arun R Bharadwaj @ 2009-03-04 12:16 UTC (permalink / raw)
  To: linux-kernel, linux-pm
  Cc: a.p.zijlstra, vatsa, andi, Arun Bharadwaj, tglx, mingo, arjan

* Arun R Bharadwaj <arun@linux.vnet.ibm.com> [2009-03-04 17:42:49]:

The following pinned hrtimers have been identified and marked:
1)sched_rt_period_timer
2)tick_sched_timer
3)stack_trace_timer_fn

Signed-off-by: Arun R Bharadwaj <arun@linux.vnet.ibm.com>
---
 kernel/sched.c               |    5 +++--
 kernel/time/tick-sched.c     |    7 ++++---
 kernel/trace/trace_sysprof.c |    3 ++-
 3 files changed, 9 insertions(+), 6 deletions(-)

Index: linux.trees.git/kernel/sched.c
===================================================================
--- linux.trees.git.orig/kernel/sched.c
+++ linux.trees.git/kernel/sched.c
@@ -236,7 +236,7 @@ static void start_rt_bandwidth(struct rt
 
 		now = hrtimer_cb_get_time(&rt_b->rt_period_timer);
 		hrtimer_forward(&rt_b->rt_period_timer, now, rt_b->rt_period);
-		hrtimer_start_expires(&rt_b->rt_period_timer,
+		hrtimer_start_expires_pinned(&rt_b->rt_period_timer,
 				HRTIMER_MODE_ABS);
 	}
 	spin_unlock(&rt_b->rt_runtime_lock);
@@ -1156,7 +1156,8 @@ static __init void init_hrtick(void)
  */
 static void hrtick_start(struct rq *rq, u64 delay)
 {
-	hrtimer_start(&rq->hrtick_timer, ns_to_ktime(delay), HRTIMER_MODE_REL);
+	hrtimer_start_pinned(&rq->hrtick_timer, ns_to_ktime(delay),
+				HRTIMER_MODE_REL);
 }
 
 static inline void init_hrtick(void)
Index: linux.trees.git/kernel/time/tick-sched.c
===================================================================
--- linux.trees.git.orig/kernel/time/tick-sched.c
+++ linux.trees.git/kernel/time/tick-sched.c
@@ -348,7 +348,7 @@ void tick_nohz_stop_sched_tick(int inidl
 		ts->idle_expires = expires;
 
 		if (ts->nohz_mode == NOHZ_MODE_HIGHRES) {
-			hrtimer_start(&ts->sched_timer, expires,
+			hrtimer_start_pinned(&ts->sched_timer, expires,
 				      HRTIMER_MODE_ABS);
 			/* Check, if the timer was already in the past */
 			if (hrtimer_active(&ts->sched_timer))
@@ -394,7 +394,7 @@ static void tick_nohz_restart(struct tic
 		hrtimer_forward(&ts->sched_timer, now, tick_period);
 
 		if (ts->nohz_mode == NOHZ_MODE_HIGHRES) {
-			hrtimer_start_expires(&ts->sched_timer,
+			hrtimer_start_expires_pinned(&ts->sched_timer,
 				      HRTIMER_MODE_ABS);
 			/* Check, if the timer was already in the past */
 			if (hrtimer_active(&ts->sched_timer))
@@ -698,7 +698,8 @@ void tick_setup_sched_timer(void)
 
 	for (;;) {
 		hrtimer_forward(&ts->sched_timer, now, tick_period);
-		hrtimer_start_expires(&ts->sched_timer, HRTIMER_MODE_ABS);
+		hrtimer_start_expires_pinned(&ts->sched_timer,
+						HRTIMER_MODE_ABS);
 		/* Check, if the timer was already in the past */
 		if (hrtimer_active(&ts->sched_timer))
 			break;
Index: linux.trees.git/kernel/trace/trace_sysprof.c
===================================================================
--- linux.trees.git.orig/kernel/trace/trace_sysprof.c
+++ linux.trees.git/kernel/trace/trace_sysprof.c
@@ -203,7 +203,8 @@ static void start_stack_timer(void *unus
 	hrtimer_init(hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
 	hrtimer->function = stack_trace_timer_fn;
 
-	hrtimer_start(hrtimer, ns_to_ktime(sample_period), HRTIMER_MODE_REL);
+	hrtimer_start_pinned(hrtimer, ns_to_ktime(sample_period),
+				HRTIMER_MODE_REL);
 }
 
 static void start_stack_timers(void)

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [v2 PATCH 3/4] timers: sysfs hook to enable timer migration
  2009-03-04 12:12 [v2 PATCH 0/4] timers: framework for migration between CPU Arun R Bharadwaj
                   ` (4 preceding siblings ...)
  2009-03-04 12:18 ` [v2 PATCH 3/4] timers: sysfs hook to enable timer migration Arun R Bharadwaj
@ 2009-03-04 12:18 ` Arun R Bharadwaj
  2009-03-04 12:19 ` [v2 PATCH 4/4] timers: logic " Arun R Bharadwaj
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 46+ messages in thread
From: Arun R Bharadwaj @ 2009-03-04 12:18 UTC (permalink / raw)
  To: linux-kernel, linux-pm
  Cc: a.p.zijlstra, ego, tglx, mingo, andi, venkatesh.pallipadi, vatsa,
	arjan, svaidy, Arun Bharadwaj

* Arun R Bharadwaj <arun@linux.vnet.ibm.com> [2009-03-04 17:42:49]:

This patch creates the necessary sysfs interface for timer migration.

The interface is located at
/sys/devices/system/cpu/enable_timer_migration

This enables timer migration.

        echo 1 > /sys/devices/system/cpu/enable_timer_migration

similarly,

	echo 0 > /sys/devices/system/cpu/enable_timer_migration

disables timer migration.


Signed-off-by: Arun R Bharadwaj <arun@linux.vnet.ibm.com>
---
 include/linux/sched.h |    1 +
 kernel/sched.c        |   28 ++++++++++++++++++++++++++++
 2 files changed, 29 insertions(+)

Index: linux.trees.git/include/linux/sched.h
===================================================================
--- linux.trees.git.orig/include/linux/sched.h
+++ linux.trees.git/include/linux/sched.h
@@ -803,6 +803,7 @@ enum powersavings_balance_level {
 };
 
 extern int sched_mc_power_savings, sched_smt_power_savings;
+extern int enable_timer_migration;
 
 static inline int sd_balance_for_mc_power(void)
 {
Index: linux.trees.git/kernel/sched.c
===================================================================
--- linux.trees.git.orig/kernel/sched.c
+++ linux.trees.git/kernel/sched.c
@@ -7445,6 +7445,7 @@ static void sched_domain_node_span(int n
 #endif /* CONFIG_NUMA */
 
 int sched_smt_power_savings = 0, sched_mc_power_savings = 0;
+int enable_timer_migration;
 
 /*
  * The cpus mask in sched_group and sched_domain hangs off the end.
@@ -8317,6 +8318,29 @@ static SYSDEV_CLASS_ATTR(sched_smt_power
 		   sched_smt_power_savings_store);
 #endif
 
+static ssize_t timer_migration_show(struct sysdev_class *class,
+					   char *page)
+{
+	return sprintf(page, "%u\n", enable_timer_migration);
+}
+static ssize_t timer_migration_store(struct sysdev_class *class,
+					    const char *buf, size_t count)
+{
+	unsigned int level = 0;
+
+	if (sscanf(buf, "%u", &level) != 1)
+		return -EINVAL;
+
+	if (level > 1)
+		return -EINVAL;
+
+	enable_timer_migration = level;
+
+	return count;
+}
+static SYSDEV_CLASS_ATTR(enable_timer_migration, 0644,
+			 timer_migration_show,
+			 timer_migration_store);
 int __init sched_create_sysfs_power_savings_entries(struct sysdev_class *cls)
 {
 	int err = 0;
@@ -8331,6 +8355,10 @@ int __init sched_create_sysfs_power_savi
 		err = sysfs_create_file(&cls->kset.kobj,
 					&attr_sched_mc_power_savings.attr);
 #endif
+	if (!err)
+		err = sysfs_create_file(&cls->kset.kobj,
+					&attr_enable_timer_migration.attr);
+
 	return err;
 }
 #endif /* CONFIG_SCHED_MC || CONFIG_SCHED_SMT */

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [v2 PATCH 3/4] timers: sysfs hook to enable timer migration
  2009-03-04 12:12 [v2 PATCH 0/4] timers: framework for migration between CPU Arun R Bharadwaj
                   ` (3 preceding siblings ...)
  2009-03-04 12:16 ` Arun R Bharadwaj
@ 2009-03-04 12:18 ` Arun R Bharadwaj
  2009-03-04 12:18 ` Arun R Bharadwaj
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 46+ messages in thread
From: Arun R Bharadwaj @ 2009-03-04 12:18 UTC (permalink / raw)
  To: linux-kernel, linux-pm
  Cc: a.p.zijlstra, vatsa, andi, Arun Bharadwaj, tglx, mingo, arjan

* Arun R Bharadwaj <arun@linux.vnet.ibm.com> [2009-03-04 17:42:49]:

This patch creates the necessary sysfs interface for timer migration.

The interface is located at
/sys/devices/system/cpu/enable_timer_migration

This enables timer migration.

        echo 1 > /sys/devices/system/cpu/enable_timer_migration

similarly,

	echo 0 > /sys/devices/system/cpu/enable_timer_migration

disables timer migration.


Signed-off-by: Arun R Bharadwaj <arun@linux.vnet.ibm.com>
---
 include/linux/sched.h |    1 +
 kernel/sched.c        |   28 ++++++++++++++++++++++++++++
 2 files changed, 29 insertions(+)

Index: linux.trees.git/include/linux/sched.h
===================================================================
--- linux.trees.git.orig/include/linux/sched.h
+++ linux.trees.git/include/linux/sched.h
@@ -803,6 +803,7 @@ enum powersavings_balance_level {
 };
 
 extern int sched_mc_power_savings, sched_smt_power_savings;
+extern int enable_timer_migration;
 
 static inline int sd_balance_for_mc_power(void)
 {
Index: linux.trees.git/kernel/sched.c
===================================================================
--- linux.trees.git.orig/kernel/sched.c
+++ linux.trees.git/kernel/sched.c
@@ -7445,6 +7445,7 @@ static void sched_domain_node_span(int n
 #endif /* CONFIG_NUMA */
 
 int sched_smt_power_savings = 0, sched_mc_power_savings = 0;
+int enable_timer_migration;
 
 /*
  * The cpus mask in sched_group and sched_domain hangs off the end.
@@ -8317,6 +8318,29 @@ static SYSDEV_CLASS_ATTR(sched_smt_power
 		   sched_smt_power_savings_store);
 #endif
 
+static ssize_t timer_migration_show(struct sysdev_class *class,
+					   char *page)
+{
+	return sprintf(page, "%u\n", enable_timer_migration);
+}
+static ssize_t timer_migration_store(struct sysdev_class *class,
+					    const char *buf, size_t count)
+{
+	unsigned int level = 0;
+
+	if (sscanf(buf, "%u", &level) != 1)
+		return -EINVAL;
+
+	if (level > 1)
+		return -EINVAL;
+
+	enable_timer_migration = level;
+
+	return count;
+}
+static SYSDEV_CLASS_ATTR(enable_timer_migration, 0644,
+			 timer_migration_show,
+			 timer_migration_store);
 int __init sched_create_sysfs_power_savings_entries(struct sysdev_class *cls)
 {
 	int err = 0;
@@ -8331,6 +8355,10 @@ int __init sched_create_sysfs_power_savi
 		err = sysfs_create_file(&cls->kset.kobj,
 					&attr_sched_mc_power_savings.attr);
 #endif
+	if (!err)
+		err = sysfs_create_file(&cls->kset.kobj,
+					&attr_enable_timer_migration.attr);
+
 	return err;
 }
 #endif /* CONFIG_SCHED_MC || CONFIG_SCHED_SMT */

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [v2 PATCH 4/4] timers: logic to enable timer migration.
  2009-03-04 12:12 [v2 PATCH 0/4] timers: framework for migration between CPU Arun R Bharadwaj
                   ` (6 preceding siblings ...)
  2009-03-04 12:19 ` [v2 PATCH 4/4] timers: logic " Arun R Bharadwaj
@ 2009-03-04 12:19 ` Arun R Bharadwaj
  2009-03-04 16:33   ` Daniel Walker
                     ` (5 more replies)
  2009-03-04 17:33 ` [v2 PATCH 0/4] timers: framework for migration between CPU Ingo Molnar
                   ` (3 subsequent siblings)
  11 siblings, 6 replies; 46+ messages in thread
From: Arun R Bharadwaj @ 2009-03-04 12:19 UTC (permalink / raw)
  To: linux-kernel, linux-pm
  Cc: a.p.zijlstra, ego, tglx, mingo, andi, venkatesh.pallipadi, vatsa,
	arjan, svaidy, Arun Bharadwaj

* Arun R Bharadwaj <arun@linux.vnet.ibm.com> [2009-03-04 17:42:49]:

This patch migrates all non pinned timers and hrtimers to the current
idle load balancer, from all the idle CPUs. Timers firing on busy CPUs
are not migrated.


Signed-off-by: Arun R Bharadwaj <arun@linux.vnet.ibm.com>
---
 include/linux/sched.h |    1 +
 kernel/hrtimer.c      |   12 +++++++++++-
 kernel/sched.c        |    5 +++++
 kernel/timer.c        |   13 ++++++++++++-
 4 files changed, 29 insertions(+), 2 deletions(-)

Index: linux.trees.git/kernel/timer.c
===================================================================
--- linux.trees.git.orig/kernel/timer.c
+++ linux.trees.git/kernel/timer.c
@@ -38,6 +38,7 @@
 #include <linux/tick.h>
 #include <linux/kallsyms.h>
 #include <linux/timer.h>
+#include <linux/sched.h>
 
 #include <asm/uaccess.h>
 #include <asm/unistd.h>
@@ -628,7 +629,7 @@ __mod_timer(struct timer_list *timer, un
 {
 	struct tvec_base *base, *new_base;
 	unsigned long flags;
-	int ret;
+	int ret, current_cpu, preferred_cpu;
 
 	ret = 0;
 
@@ -649,6 +650,16 @@ __mod_timer(struct timer_list *timer, un
 
 	new_base = __get_cpu_var(tvec_bases);
 
+	current_cpu = smp_processor_id();
+	preferred_cpu = get_nohz_load_balancer();
+	if (enable_timer_migration && !tbase_get_pinned(timer->base) &&
+			idle_cpu(current_cpu) && preferred_cpu != -1) {
+		new_base = per_cpu(tvec_bases, preferred_cpu);
+		timer_set_base(timer, new_base);
+		timer->expires = expires;
+		internal_add_timer(new_base, timer);
+		goto out_unlock;
+	}
 	if (base != new_base) {
 		/*
 		 * We are trying to schedule the timer on the local CPU.
Index: linux.trees.git/kernel/hrtimer.c
===================================================================
--- linux.trees.git.orig/kernel/hrtimer.c
+++ linux.trees.git/kernel/hrtimer.c
@@ -43,6 +43,8 @@
 #include <linux/seq_file.h>
 #include <linux/err.h>
 #include <linux/debugobjects.h>
+#include <linux/sched.h>
+#include <linux/timer.h>
 
 #include <asm/uaccess.h>
 
@@ -198,8 +200,16 @@ int pinned)
 {
 	struct hrtimer_clock_base *new_base;
 	struct hrtimer_cpu_base *new_cpu_base;
+	int current_cpu, preferred_cpu;
+
+	current_cpu = smp_processor_id();
+	preferred_cpu = get_nohz_load_balancer();
+	if (enable_timer_migration && !pinned && preferred_cpu != -1 &&
+			idle_cpu(current_cpu))
+		new_cpu_base = &per_cpu(hrtimer_bases, preferred_cpu);
+	else
+		new_cpu_base = &__get_cpu_var(hrtimer_bases);
 
-	new_cpu_base = &__get_cpu_var(hrtimer_bases);
 	new_base = &new_cpu_base->clock_base[base->index];
 
 	if (base != new_base) {
Index: linux.trees.git/include/linux/sched.h
===================================================================
--- linux.trees.git.orig/include/linux/sched.h
+++ linux.trees.git/include/linux/sched.h
@@ -265,6 +265,7 @@ static inline int select_nohz_load_balan
 }
 #endif
 
+extern int get_nohz_load_balancer(void);
 /*
  * Only dump TASK_* tasks. (0 for all tasks)
  */
Index: linux.trees.git/kernel/sched.c
===================================================================
--- linux.trees.git.orig/kernel/sched.c
+++ linux.trees.git/kernel/sched.c
@@ -4009,6 +4009,11 @@ static struct {
 	.load_balancer = ATOMIC_INIT(-1),
 };
 
+inline int get_nohz_load_balancer(void)
+{
+	return atomic_read(&nohz.load_balancer);
+}
+
 /*
  * This routine will try to nominate the ilb (idle load balancing)
  * owner among the cpus whose ticks are stopped. ilb owner will do the idle

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [v2 PATCH 4/4] timers: logic to enable timer migration.
  2009-03-04 12:12 [v2 PATCH 0/4] timers: framework for migration between CPU Arun R Bharadwaj
                   ` (5 preceding siblings ...)
  2009-03-04 12:18 ` Arun R Bharadwaj
@ 2009-03-04 12:19 ` Arun R Bharadwaj
  2009-03-04 12:19 ` Arun R Bharadwaj
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 46+ messages in thread
From: Arun R Bharadwaj @ 2009-03-04 12:19 UTC (permalink / raw)
  To: linux-kernel, linux-pm
  Cc: a.p.zijlstra, vatsa, andi, Arun Bharadwaj, tglx, mingo, arjan

* Arun R Bharadwaj <arun@linux.vnet.ibm.com> [2009-03-04 17:42:49]:

This patch migrates all non pinned timers and hrtimers to the current
idle load balancer, from all the idle CPUs. Timers firing on busy CPUs
are not migrated.


Signed-off-by: Arun R Bharadwaj <arun@linux.vnet.ibm.com>
---
 include/linux/sched.h |    1 +
 kernel/hrtimer.c      |   12 +++++++++++-
 kernel/sched.c        |    5 +++++
 kernel/timer.c        |   13 ++++++++++++-
 4 files changed, 29 insertions(+), 2 deletions(-)

Index: linux.trees.git/kernel/timer.c
===================================================================
--- linux.trees.git.orig/kernel/timer.c
+++ linux.trees.git/kernel/timer.c
@@ -38,6 +38,7 @@
 #include <linux/tick.h>
 #include <linux/kallsyms.h>
 #include <linux/timer.h>
+#include <linux/sched.h>
 
 #include <asm/uaccess.h>
 #include <asm/unistd.h>
@@ -628,7 +629,7 @@ __mod_timer(struct timer_list *timer, un
 {
 	struct tvec_base *base, *new_base;
 	unsigned long flags;
-	int ret;
+	int ret, current_cpu, preferred_cpu;
 
 	ret = 0;
 
@@ -649,6 +650,16 @@ __mod_timer(struct timer_list *timer, un
 
 	new_base = __get_cpu_var(tvec_bases);
 
+	current_cpu = smp_processor_id();
+	preferred_cpu = get_nohz_load_balancer();
+	if (enable_timer_migration && !tbase_get_pinned(timer->base) &&
+			idle_cpu(current_cpu) && preferred_cpu != -1) {
+		new_base = per_cpu(tvec_bases, preferred_cpu);
+		timer_set_base(timer, new_base);
+		timer->expires = expires;
+		internal_add_timer(new_base, timer);
+		goto out_unlock;
+	}
 	if (base != new_base) {
 		/*
 		 * We are trying to schedule the timer on the local CPU.
Index: linux.trees.git/kernel/hrtimer.c
===================================================================
--- linux.trees.git.orig/kernel/hrtimer.c
+++ linux.trees.git/kernel/hrtimer.c
@@ -43,6 +43,8 @@
 #include <linux/seq_file.h>
 #include <linux/err.h>
 #include <linux/debugobjects.h>
+#include <linux/sched.h>
+#include <linux/timer.h>
 
 #include <asm/uaccess.h>
 
@@ -198,8 +200,16 @@ int pinned)
 {
 	struct hrtimer_clock_base *new_base;
 	struct hrtimer_cpu_base *new_cpu_base;
+	int current_cpu, preferred_cpu;
+
+	current_cpu = smp_processor_id();
+	preferred_cpu = get_nohz_load_balancer();
+	if (enable_timer_migration && !pinned && preferred_cpu != -1 &&
+			idle_cpu(current_cpu))
+		new_cpu_base = &per_cpu(hrtimer_bases, preferred_cpu);
+	else
+		new_cpu_base = &__get_cpu_var(hrtimer_bases);
 
-	new_cpu_base = &__get_cpu_var(hrtimer_bases);
 	new_base = &new_cpu_base->clock_base[base->index];
 
 	if (base != new_base) {
Index: linux.trees.git/include/linux/sched.h
===================================================================
--- linux.trees.git.orig/include/linux/sched.h
+++ linux.trees.git/include/linux/sched.h
@@ -265,6 +265,7 @@ static inline int select_nohz_load_balan
 }
 #endif
 
+extern int get_nohz_load_balancer(void);
 /*
  * Only dump TASK_* tasks. (0 for all tasks)
  */
Index: linux.trees.git/kernel/sched.c
===================================================================
--- linux.trees.git.orig/kernel/sched.c
+++ linux.trees.git/kernel/sched.c
@@ -4009,6 +4009,11 @@ static struct {
 	.load_balancer = ATOMIC_INIT(-1),
 };
 
+inline int get_nohz_load_balancer(void)
+{
+	return atomic_read(&nohz.load_balancer);
+}
+
 /*
  * This routine will try to nominate the ilb (idle load balancing)
  * owner among the cpus whose ticks are stopped. ilb owner will do the idle

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v2 PATCH 4/4] timers: logic to enable timer migration.
  2009-03-04 12:19 ` Arun R Bharadwaj
  2009-03-04 16:33   ` Daniel Walker
@ 2009-03-04 16:33   ` Daniel Walker
  2009-03-04 16:52     ` Arun R Bharadwaj
  2009-03-04 16:52     ` Arun R Bharadwaj
  2009-03-05 15:34   ` Oleg Nesterov
                     ` (3 subsequent siblings)
  5 siblings, 2 replies; 46+ messages in thread
From: Daniel Walker @ 2009-03-04 16:33 UTC (permalink / raw)
  To: arun
  Cc: linux-kernel, linux-pm, a.p.zijlstra, ego, tglx, mingo, andi,
	venkatesh.pallipadi, vatsa, arjan, svaidy

On Wed, 2009-03-04 at 17:49 +0530, Arun R Bharadwaj wrote:
> @@ -649,6 +650,16 @@ __mod_timer(struct timer_list *timer, un
>  
>         new_base = __get_cpu_var(tvec_bases);
>  
> +       current_cpu = smp_processor_id();
> +       preferred_cpu = get_nohz_load_balancer();
> +       if (enable_timer_migration && !tbase_get_pinned(timer->base)
> &&
> +                       idle_cpu(current_cpu) && preferred_cpu != -1)
> {
> +               new_base = per_cpu(tvec_bases, preferred_cpu);
> +               timer_set_base(timer, new_base);
> +               timer->expires = expires;
> +               internal_add_timer(new_base, timer);
> +               goto out_unlock;
> +       }

Are you sure this compiles w/ CONFIG_SMP=n ? It looks like
enable_timer_migration wouldn't exist in that case, but the chunks your
adding in the timer code are still using it.

Daniel


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v2 PATCH 4/4] timers: logic to enable timer migration.
  2009-03-04 12:19 ` Arun R Bharadwaj
@ 2009-03-04 16:33   ` Daniel Walker
  2009-03-04 16:33   ` Daniel Walker
                     ` (4 subsequent siblings)
  5 siblings, 0 replies; 46+ messages in thread
From: Daniel Walker @ 2009-03-04 16:33 UTC (permalink / raw)
  To: arun
  Cc: a.p.zijlstra, mingo, linux-kernel, vatsa, andi, linux-pm, tglx, arjan

On Wed, 2009-03-04 at 17:49 +0530, Arun R Bharadwaj wrote:
> @@ -649,6 +650,16 @@ __mod_timer(struct timer_list *timer, un
>  
>         new_base = __get_cpu_var(tvec_bases);
>  
> +       current_cpu = smp_processor_id();
> +       preferred_cpu = get_nohz_load_balancer();
> +       if (enable_timer_migration && !tbase_get_pinned(timer->base)
> &&
> +                       idle_cpu(current_cpu) && preferred_cpu != -1)
> {
> +               new_base = per_cpu(tvec_bases, preferred_cpu);
> +               timer_set_base(timer, new_base);
> +               timer->expires = expires;
> +               internal_add_timer(new_base, timer);
> +               goto out_unlock;
> +       }

Are you sure this compiles w/ CONFIG_SMP=n ? It looks like
enable_timer_migration wouldn't exist in that case, but the chunks your
adding in the timer code are still using it.

Daniel

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v2 PATCH 4/4] timers: logic to enable timer migration.
  2009-03-04 16:33   ` Daniel Walker
@ 2009-03-04 16:52     ` Arun R Bharadwaj
  2009-03-04 16:52     ` Arun R Bharadwaj
  1 sibling, 0 replies; 46+ messages in thread
From: Arun R Bharadwaj @ 2009-03-04 16:52 UTC (permalink / raw)
  To: Daniel Walker
  Cc: linux-kernel, linux-pm, a.p.zijlstra, ego, tglx, mingo, andi,
	venkatesh.pallipadi, vatsa, arjan, svaidy

* Daniel Walker <dwalker@fifo99.com> [2009-03-04 08:33:09]:

> On Wed, 2009-03-04 at 17:49 +0530, Arun R Bharadwaj wrote:
> > @@ -649,6 +650,16 @@ __mod_timer(struct timer_list *timer, un
> >  
> >         new_base = __get_cpu_var(tvec_bases);
> >  
> > +       current_cpu = smp_processor_id();
> > +       preferred_cpu = get_nohz_load_balancer();
> > +       if (enable_timer_migration && !tbase_get_pinned(timer->base)
> > &&
> > +                       idle_cpu(current_cpu) && preferred_cpu != -1)
> > {
> > +               new_base = per_cpu(tvec_bases, preferred_cpu);
> > +               timer_set_base(timer, new_base);
> > +               timer->expires = expires;
> > +               internal_add_timer(new_base, timer);
> > +               goto out_unlock;
> > +       }
> 
> Are you sure this compiles w/ CONFIG_SMP=n ? It looks like
> enable_timer_migration wouldn't exist in that case, but the chunks your
> adding in the timer code are still using it.
> 
> Daniel
>

Hi Daniel,

Thanks for pointing this out. I'll clean this up in my next iteration.

--arun

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v2 PATCH 4/4] timers: logic to enable timer migration.
  2009-03-04 16:33   ` Daniel Walker
  2009-03-04 16:52     ` Arun R Bharadwaj
@ 2009-03-04 16:52     ` Arun R Bharadwaj
  1 sibling, 0 replies; 46+ messages in thread
From: Arun R Bharadwaj @ 2009-03-04 16:52 UTC (permalink / raw)
  To: Daniel Walker
  Cc: a.p.zijlstra, mingo, linux-kernel, vatsa, andi, linux-pm, tglx, arjan

* Daniel Walker <dwalker@fifo99.com> [2009-03-04 08:33:09]:

> On Wed, 2009-03-04 at 17:49 +0530, Arun R Bharadwaj wrote:
> > @@ -649,6 +650,16 @@ __mod_timer(struct timer_list *timer, un
> >  
> >         new_base = __get_cpu_var(tvec_bases);
> >  
> > +       current_cpu = smp_processor_id();
> > +       preferred_cpu = get_nohz_load_balancer();
> > +       if (enable_timer_migration && !tbase_get_pinned(timer->base)
> > &&
> > +                       idle_cpu(current_cpu) && preferred_cpu != -1)
> > {
> > +               new_base = per_cpu(tvec_bases, preferred_cpu);
> > +               timer_set_base(timer, new_base);
> > +               timer->expires = expires;
> > +               internal_add_timer(new_base, timer);
> > +               goto out_unlock;
> > +       }
> 
> Are you sure this compiles w/ CONFIG_SMP=n ? It looks like
> enable_timer_migration wouldn't exist in that case, but the chunks your
> adding in the timer code are still using it.
> 
> Daniel
>

Hi Daniel,

Thanks for pointing this out. I'll clean this up in my next iteration.

--arun

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v2 PATCH 0/4] timers: framework for migration between CPU
  2009-03-04 12:12 [v2 PATCH 0/4] timers: framework for migration between CPU Arun R Bharadwaj
                   ` (7 preceding siblings ...)
  2009-03-04 12:19 ` Arun R Bharadwaj
@ 2009-03-04 17:33 ` Ingo Molnar
  2009-03-04 18:06   ` Vaidyanathan Srinivasan
  2009-03-04 18:06   ` Vaidyanathan Srinivasan
  2009-03-04 17:33 ` Ingo Molnar
                   ` (2 subsequent siblings)
  11 siblings, 2 replies; 46+ messages in thread
From: Ingo Molnar @ 2009-03-04 17:33 UTC (permalink / raw)
  To: Arun R Bharadwaj
  Cc: linux-kernel, linux-pm, a.p.zijlstra, ego, tglx, andi,
	venkatesh.pallipadi, vatsa, arjan, svaidy


* Arun R Bharadwaj <arun@linux.vnet.ibm.com> wrote:

> $taskset -c 4,5,6,7 make -j4
> 
> my_driver queuing timers continuously on CPU 10.
> 
> idle load balancer currently on CPU 15
> 
> 
> Case1: Without timer migration		Case2: With timer migration
> 
>    --------------------			   --------------------
>    | Core | LOC Count |			   | Core | LOC Count |
>    | 4    |   2504    |			   | 4    |   2503    |
>    | 5    |   2502    |			   | 5    |   2503    |
>    | 6    |   2502    |			   | 6    |   2502    |
>    | 7    |   2498    |			   | 7    |   2500    |
>    | 10   |   2501    |			   | 10   |     35    |
>    | 15   |   2501    |			   | 15   |   2501    |
>    --------------------			   --------------------
> 
>    ---------------------		   --------------------
>    | Core | Sleep time |		   | Core | Sleep time |
>    | 4    |    0.47168 |		   | 4    |    0.49601 |
>    | 5    |    0.44301 |		   | 5    |    0.37153 |
>    | 6    |    0.38979 |		   | 6    |    0.51286 |
>    | 7    |    0.42829 |		   | 7    |    0.49635 |
>    | 10   |    9.86652 |		   | 10   |   10.04216 |
>    | 15   |    0.43048 |		   | 15   |    0.49056 |
>    ---------------------		   ---------------------
> 
> Here, all the timers queued by the driver on CPU10 are moved to CPU15,
> which is the idle load balancer.

The numbers with this automatic method based on the ilb-cpu look 
pretty convincing. Is this what you expected it to be?

	Ingo

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v2 PATCH 0/4] timers: framework for migration between CPU
  2009-03-04 12:12 [v2 PATCH 0/4] timers: framework for migration between CPU Arun R Bharadwaj
                   ` (8 preceding siblings ...)
  2009-03-04 17:33 ` [v2 PATCH 0/4] timers: framework for migration between CPU Ingo Molnar
@ 2009-03-04 17:33 ` Ingo Molnar
  2009-03-06  0:14 ` Arjan van de Ven
  2009-03-06  0:14 ` Arjan van de Ven
  11 siblings, 0 replies; 46+ messages in thread
From: Ingo Molnar @ 2009-03-04 17:33 UTC (permalink / raw)
  To: Arun R Bharadwaj
  Cc: a.p.zijlstra, linux-kernel, vatsa, andi, linux-pm, tglx, arjan


* Arun R Bharadwaj <arun@linux.vnet.ibm.com> wrote:

> $taskset -c 4,5,6,7 make -j4
> 
> my_driver queuing timers continuously on CPU 10.
> 
> idle load balancer currently on CPU 15
> 
> 
> Case1: Without timer migration		Case2: With timer migration
> 
>    --------------------			   --------------------
>    | Core | LOC Count |			   | Core | LOC Count |
>    | 4    |   2504    |			   | 4    |   2503    |
>    | 5    |   2502    |			   | 5    |   2503    |
>    | 6    |   2502    |			   | 6    |   2502    |
>    | 7    |   2498    |			   | 7    |   2500    |
>    | 10   |   2501    |			   | 10   |     35    |
>    | 15   |   2501    |			   | 15   |   2501    |
>    --------------------			   --------------------
> 
>    ---------------------		   --------------------
>    | Core | Sleep time |		   | Core | Sleep time |
>    | 4    |    0.47168 |		   | 4    |    0.49601 |
>    | 5    |    0.44301 |		   | 5    |    0.37153 |
>    | 6    |    0.38979 |		   | 6    |    0.51286 |
>    | 7    |    0.42829 |		   | 7    |    0.49635 |
>    | 10   |    9.86652 |		   | 10   |   10.04216 |
>    | 15   |    0.43048 |		   | 15   |    0.49056 |
>    ---------------------		   ---------------------
> 
> Here, all the timers queued by the driver on CPU10 are moved to CPU15,
> which is the idle load balancer.

The numbers with this automatic method based on the ilb-cpu look 
pretty convincing. Is this what you expected it to be?

	Ingo

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v2 PATCH 0/4] timers: framework for migration between CPU
  2009-03-04 17:33 ` [v2 PATCH 0/4] timers: framework for migration between CPU Ingo Molnar
  2009-03-04 18:06   ` Vaidyanathan Srinivasan
@ 2009-03-04 18:06   ` Vaidyanathan Srinivasan
  2009-03-04 18:29     ` Ingo Molnar
  2009-03-04 18:29     ` Ingo Molnar
  1 sibling, 2 replies; 46+ messages in thread
From: Vaidyanathan Srinivasan @ 2009-03-04 18:06 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Arun R Bharadwaj, linux-kernel, linux-pm, a.p.zijlstra, ego,
	tglx, andi, venkatesh.pallipadi, vatsa, arjan

* Ingo Molnar <mingo@elte.hu> [2009-03-04 18:33:21]:

> 
> * Arun R Bharadwaj <arun@linux.vnet.ibm.com> wrote:
> 
> > $taskset -c 4,5,6,7 make -j4
> > 
> > my_driver queuing timers continuously on CPU 10.
> > 
> > idle load balancer currently on CPU 15
> > 
> > 
> > Case1: Without timer migration		Case2: With timer migration
> > 
> >    --------------------			   --------------------
> >    | Core | LOC Count |			   | Core | LOC Count |
> >    | 4    |   2504    |			   | 4    |   2503    |
> >    | 5    |   2502    |			   | 5    |   2503    |
> >    | 6    |   2502    |			   | 6    |   2502    |
> >    | 7    |   2498    |			   | 7    |   2500    |
> >    | 10   |   2501    |			   | 10   |     35    |
> >    | 15   |   2501    |			   | 15   |   2501    |
> >    --------------------			   --------------------
> > 
> >    ---------------------		   --------------------
> >    | Core | Sleep time |		   | Core | Sleep time |
> >    | 4    |    0.47168 |		   | 4    |    0.49601 |
> >    | 5    |    0.44301 |		   | 5    |    0.37153 |
> >    | 6    |    0.38979 |		   | 6    |    0.51286 |
> >    | 7    |    0.42829 |		   | 7    |    0.49635 |
> >    | 10   |    9.86652 |		   | 10   |   10.04216 |
> >    | 15   |    0.43048 |		   | 15   |    0.49056 |
> >    ---------------------		   ---------------------
> > 
> > Here, all the timers queued by the driver on CPU10 are moved to CPU15,
> > which is the idle load balancer.
> 
> The numbers with this automatic method based on the ilb-cpu look 
> pretty convincing. Is this what you expected it to be?

Yes Ingo, this is the expected results and looks pretty good.  However
there are two parameters controlled in this experiment:

1) The system is moderately loaded with kernbench so that there are
   some busy CPUs and some idle cpus, and the no_hz mask is does not
   change often.  This leads to stable ilb-cpu selection.  If the
   system is either completely idle or loaded too little leading to
   ilb nominations, then timers keep following the ilb cpu and it is
   very difficult to experimentally observe the benefits.

   Even if the ilb bounces, consolidating timers should increase
   overlap between timers and reduce the wakeup from idle.

   Optimising the ilb selection should significantly improve
   experimental results for this patch.

2) The timer test driver creates quite large timer load so that the
   effect of migration is observable as sleep time difference on the
   expected target cpu.  This kind of timer load may not be uncommon
   with lots of application stack loaded in an enterprise system

--Vaidy

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v2 PATCH 0/4] timers: framework for migration between CPU
  2009-03-04 17:33 ` [v2 PATCH 0/4] timers: framework for migration between CPU Ingo Molnar
@ 2009-03-04 18:06   ` Vaidyanathan Srinivasan
  2009-03-04 18:06   ` Vaidyanathan Srinivasan
  1 sibling, 0 replies; 46+ messages in thread
From: Vaidyanathan Srinivasan @ 2009-03-04 18:06 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: a.p.zijlstra, linux-kernel, vatsa, andi, Arun R Bharadwaj,
	linux-pm, tglx, arjan

* Ingo Molnar <mingo@elte.hu> [2009-03-04 18:33:21]:

> 
> * Arun R Bharadwaj <arun@linux.vnet.ibm.com> wrote:
> 
> > $taskset -c 4,5,6,7 make -j4
> > 
> > my_driver queuing timers continuously on CPU 10.
> > 
> > idle load balancer currently on CPU 15
> > 
> > 
> > Case1: Without timer migration		Case2: With timer migration
> > 
> >    --------------------			   --------------------
> >    | Core | LOC Count |			   | Core | LOC Count |
> >    | 4    |   2504    |			   | 4    |   2503    |
> >    | 5    |   2502    |			   | 5    |   2503    |
> >    | 6    |   2502    |			   | 6    |   2502    |
> >    | 7    |   2498    |			   | 7    |   2500    |
> >    | 10   |   2501    |			   | 10   |     35    |
> >    | 15   |   2501    |			   | 15   |   2501    |
> >    --------------------			   --------------------
> > 
> >    ---------------------		   --------------------
> >    | Core | Sleep time |		   | Core | Sleep time |
> >    | 4    |    0.47168 |		   | 4    |    0.49601 |
> >    | 5    |    0.44301 |		   | 5    |    0.37153 |
> >    | 6    |    0.38979 |		   | 6    |    0.51286 |
> >    | 7    |    0.42829 |		   | 7    |    0.49635 |
> >    | 10   |    9.86652 |		   | 10   |   10.04216 |
> >    | 15   |    0.43048 |		   | 15   |    0.49056 |
> >    ---------------------		   ---------------------
> > 
> > Here, all the timers queued by the driver on CPU10 are moved to CPU15,
> > which is the idle load balancer.
> 
> The numbers with this automatic method based on the ilb-cpu look 
> pretty convincing. Is this what you expected it to be?

Yes Ingo, this is the expected results and looks pretty good.  However
there are two parameters controlled in this experiment:

1) The system is moderately loaded with kernbench so that there are
   some busy CPUs and some idle cpus, and the no_hz mask is does not
   change often.  This leads to stable ilb-cpu selection.  If the
   system is either completely idle or loaded too little leading to
   ilb nominations, then timers keep following the ilb cpu and it is
   very difficult to experimentally observe the benefits.

   Even if the ilb bounces, consolidating timers should increase
   overlap between timers and reduce the wakeup from idle.

   Optimising the ilb selection should significantly improve
   experimental results for this patch.

2) The timer test driver creates quite large timer load so that the
   effect of migration is observable as sleep time difference on the
   expected target cpu.  This kind of timer load may not be uncommon
   with lots of application stack loaded in an enterprise system

--Vaidy

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v2 PATCH 0/4] timers: framework for migration between CPU
  2009-03-04 18:06   ` Vaidyanathan Srinivasan
  2009-03-04 18:29     ` Ingo Molnar
@ 2009-03-04 18:29     ` Ingo Molnar
  1 sibling, 0 replies; 46+ messages in thread
From: Ingo Molnar @ 2009-03-04 18:29 UTC (permalink / raw)
  To: Vaidyanathan Srinivasan
  Cc: Arun R Bharadwaj, linux-kernel, linux-pm, a.p.zijlstra, ego,
	tglx, andi, venkatesh.pallipadi, vatsa, arjan


* Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> wrote:

> * Ingo Molnar <mingo@elte.hu> [2009-03-04 18:33:21]:
> 
> > 
> > * Arun R Bharadwaj <arun@linux.vnet.ibm.com> wrote:
> > 
> > > $taskset -c 4,5,6,7 make -j4
> > > 
> > > my_driver queuing timers continuously on CPU 10.
> > > 
> > > idle load balancer currently on CPU 15
> > > 
> > > 
> > > Case1: Without timer migration		Case2: With timer migration
> > > 
> > >    --------------------			   --------------------
> > >    | Core | LOC Count |			   | Core | LOC Count |
> > >    | 4    |   2504    |			   | 4    |   2503    |
> > >    | 5    |   2502    |			   | 5    |   2503    |
> > >    | 6    |   2502    |			   | 6    |   2502    |
> > >    | 7    |   2498    |			   | 7    |   2500    |
> > >    | 10   |   2501    |			   | 10   |     35    |
> > >    | 15   |   2501    |			   | 15   |   2501    |
> > >    --------------------			   --------------------
> > > 
> > >    ---------------------		   --------------------
> > >    | Core | Sleep time |		   | Core | Sleep time |
> > >    | 4    |    0.47168 |		   | 4    |    0.49601 |
> > >    | 5    |    0.44301 |		   | 5    |    0.37153 |
> > >    | 6    |    0.38979 |		   | 6    |    0.51286 |
> > >    | 7    |    0.42829 |		   | 7    |    0.49635 |
> > >    | 10   |    9.86652 |		   | 10   |   10.04216 |
> > >    | 15   |    0.43048 |		   | 15   |    0.49056 |
> > >    ---------------------		   ---------------------
> > > 
> > > Here, all the timers queued by the driver on CPU10 are moved to CPU15,
> > > which is the idle load balancer.
> > 
> > The numbers with this automatic method based on the ilb-cpu look 
> > pretty convincing. Is this what you expected it to be?
> 
> Yes Ingo, this is the expected results and looks pretty good.  However
> there are two parameters controlled in this experiment:
> 
> 1) The system is moderately loaded with kernbench so that there are
>    some busy CPUs and some idle cpus, and the no_hz mask is does not
>    change often.  This leads to stable ilb-cpu selection.  If the
>    system is either completely idle or loaded too little leading to
>    ilb nominations, then timers keep following the ilb cpu and it is
>    very difficult to experimentally observe the benefits.
> 
>    Even if the ilb bounces, consolidating timers should increase
>    overlap between timers and reduce the wakeup from idle.
> 
>    Optimising the ilb selection should significantly improve
>    experimental results for this patch.
> 
> 2) The timer test driver creates quite large timer load so that the
>    effect of migration is observable as sleep time difference on the
>    expected target cpu.  This kind of timer load may not be uncommon
>    with lots of application stack loaded in an enterprise system

the important thing to watch out for is to not have _worse_ 
performance due to ilb jumping too much. So as long as you can 
prove that numbers dont get worse you are golden.

Power-saving via migration will only work if there's a 
concentrated workload to begin with.

So the best results will be in combination with scheduler 
power-saving patches. (which too make the ilb jump less in 
essence)

So by getting scheduler power saving enhancements your method 
will work better too - there's good synergy and no dependency on 
any user-space component.

Btw., could you please turn the runtime switch into a /proc/sys 
sysctl, and only when CONFIG_SCHED_DEBUG=y. Otherwise it should 
be default-enabled with no ability to turn it off.

	Ingo

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v2 PATCH 0/4] timers: framework for migration between CPU
  2009-03-04 18:06   ` Vaidyanathan Srinivasan
@ 2009-03-04 18:29     ` Ingo Molnar
  2009-03-04 18:29     ` Ingo Molnar
  1 sibling, 0 replies; 46+ messages in thread
From: Ingo Molnar @ 2009-03-04 18:29 UTC (permalink / raw)
  To: Vaidyanathan Srinivasan
  Cc: a.p.zijlstra, linux-kernel, vatsa, andi, Arun R Bharadwaj,
	linux-pm, tglx, arjan


* Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> wrote:

> * Ingo Molnar <mingo@elte.hu> [2009-03-04 18:33:21]:
> 
> > 
> > * Arun R Bharadwaj <arun@linux.vnet.ibm.com> wrote:
> > 
> > > $taskset -c 4,5,6,7 make -j4
> > > 
> > > my_driver queuing timers continuously on CPU 10.
> > > 
> > > idle load balancer currently on CPU 15
> > > 
> > > 
> > > Case1: Without timer migration		Case2: With timer migration
> > > 
> > >    --------------------			   --------------------
> > >    | Core | LOC Count |			   | Core | LOC Count |
> > >    | 4    |   2504    |			   | 4    |   2503    |
> > >    | 5    |   2502    |			   | 5    |   2503    |
> > >    | 6    |   2502    |			   | 6    |   2502    |
> > >    | 7    |   2498    |			   | 7    |   2500    |
> > >    | 10   |   2501    |			   | 10   |     35    |
> > >    | 15   |   2501    |			   | 15   |   2501    |
> > >    --------------------			   --------------------
> > > 
> > >    ---------------------		   --------------------
> > >    | Core | Sleep time |		   | Core | Sleep time |
> > >    | 4    |    0.47168 |		   | 4    |    0.49601 |
> > >    | 5    |    0.44301 |		   | 5    |    0.37153 |
> > >    | 6    |    0.38979 |		   | 6    |    0.51286 |
> > >    | 7    |    0.42829 |		   | 7    |    0.49635 |
> > >    | 10   |    9.86652 |		   | 10   |   10.04216 |
> > >    | 15   |    0.43048 |		   | 15   |    0.49056 |
> > >    ---------------------		   ---------------------
> > > 
> > > Here, all the timers queued by the driver on CPU10 are moved to CPU15,
> > > which is the idle load balancer.
> > 
> > The numbers with this automatic method based on the ilb-cpu look 
> > pretty convincing. Is this what you expected it to be?
> 
> Yes Ingo, this is the expected results and looks pretty good.  However
> there are two parameters controlled in this experiment:
> 
> 1) The system is moderately loaded with kernbench so that there are
>    some busy CPUs and some idle cpus, and the no_hz mask is does not
>    change often.  This leads to stable ilb-cpu selection.  If the
>    system is either completely idle or loaded too little leading to
>    ilb nominations, then timers keep following the ilb cpu and it is
>    very difficult to experimentally observe the benefits.
> 
>    Even if the ilb bounces, consolidating timers should increase
>    overlap between timers and reduce the wakeup from idle.
> 
>    Optimising the ilb selection should significantly improve
>    experimental results for this patch.
> 
> 2) The timer test driver creates quite large timer load so that the
>    effect of migration is observable as sleep time difference on the
>    expected target cpu.  This kind of timer load may not be uncommon
>    with lots of application stack loaded in an enterprise system

the important thing to watch out for is to not have _worse_ 
performance due to ilb jumping too much. So as long as you can 
prove that numbers dont get worse you are golden.

Power-saving via migration will only work if there's a 
concentrated workload to begin with.

So the best results will be in combination with scheduler 
power-saving patches. (which too make the ilb jump less in 
essence)

So by getting scheduler power saving enhancements your method 
will work better too - there's good synergy and no dependency on 
any user-space component.

Btw., could you please turn the runtime switch into a /proc/sys 
sysctl, and only when CONFIG_SCHED_DEBUG=y. Otherwise it should 
be default-enabled with no ability to turn it off.

	Ingo

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v2 PATCH 4/4] timers: logic to enable timer migration.
  2009-03-04 12:19 ` Arun R Bharadwaj
                     ` (2 preceding siblings ...)
  2009-03-05 15:34   ` Oleg Nesterov
@ 2009-03-05 15:34   ` Oleg Nesterov
  2009-03-05 16:23   ` Oleg Nesterov
  2009-03-05 16:23   ` Oleg Nesterov
  5 siblings, 0 replies; 46+ messages in thread
From: Oleg Nesterov @ 2009-03-05 15:34 UTC (permalink / raw)
  To: Arun R Bharadwaj
  Cc: linux-kernel, linux-pm, a.p.zijlstra, ego, tglx, mingo, andi,
	venkatesh.pallipadi, vatsa, arjan, svaidy

On 03/04, Arun R Bharadwaj wrote:
>
> @@ -628,7 +629,7 @@ __mod_timer(struct timer_list *timer, un
>  {
>  	struct tvec_base *base, *new_base;
>  	unsigned long flags;
> -	int ret;
> +	int ret, current_cpu, preferred_cpu;
>  
>  	ret = 0;
>  
> @@ -649,6 +650,16 @@ __mod_timer(struct timer_list *timer, un
>  
>  	new_base = __get_cpu_var(tvec_bases);
>  
> +	current_cpu = smp_processor_id();
> +	preferred_cpu = get_nohz_load_balancer();
> +	if (enable_timer_migration && !tbase_get_pinned(timer->base) &&
> +			idle_cpu(current_cpu) && preferred_cpu != -1) {
> +		new_base = per_cpu(tvec_bases, preferred_cpu);
> +		timer_set_base(timer, new_base);
> +		timer->expires = expires;
> +		internal_add_timer(new_base, timer);
> +		goto out_unlock;

I didn't read the whole series, but this looks very wrong.

We can not do internal_add_timer/etc until we lock new_base, please
look how the current code does this under "if (base != new_base)".

I think you can do something like

	-	new_base = __get_cpu_var(tvec_bases);
	+
	+	new_cpu = smp_processor_id();
	+	if (enable_timer_migration && ....)
	+		new_cpu = preferred_cpu;
	+
	+	new_base = per_cpu(tvec_bases, new_cpu);

		if (base != new_base) {

Oleg.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v2 PATCH 4/4] timers: logic to enable timer migration.
  2009-03-04 12:19 ` Arun R Bharadwaj
  2009-03-04 16:33   ` Daniel Walker
  2009-03-04 16:33   ` Daniel Walker
@ 2009-03-05 15:34   ` Oleg Nesterov
  2009-03-05 15:34   ` Oleg Nesterov
                     ` (2 subsequent siblings)
  5 siblings, 0 replies; 46+ messages in thread
From: Oleg Nesterov @ 2009-03-05 15:34 UTC (permalink / raw)
  To: Arun R Bharadwaj
  Cc: a.p.zijlstra, mingo, linux-kernel, vatsa, andi, linux-pm, tglx, arjan

On 03/04, Arun R Bharadwaj wrote:
>
> @@ -628,7 +629,7 @@ __mod_timer(struct timer_list *timer, un
>  {
>  	struct tvec_base *base, *new_base;
>  	unsigned long flags;
> -	int ret;
> +	int ret, current_cpu, preferred_cpu;
>  
>  	ret = 0;
>  
> @@ -649,6 +650,16 @@ __mod_timer(struct timer_list *timer, un
>  
>  	new_base = __get_cpu_var(tvec_bases);
>  
> +	current_cpu = smp_processor_id();
> +	preferred_cpu = get_nohz_load_balancer();
> +	if (enable_timer_migration && !tbase_get_pinned(timer->base) &&
> +			idle_cpu(current_cpu) && preferred_cpu != -1) {
> +		new_base = per_cpu(tvec_bases, preferred_cpu);
> +		timer_set_base(timer, new_base);
> +		timer->expires = expires;
> +		internal_add_timer(new_base, timer);
> +		goto out_unlock;

I didn't read the whole series, but this looks very wrong.

We can not do internal_add_timer/etc until we lock new_base, please
look how the current code does this under "if (base != new_base)".

I think you can do something like

	-	new_base = __get_cpu_var(tvec_bases);
	+
	+	new_cpu = smp_processor_id();
	+	if (enable_timer_migration && ....)
	+		new_cpu = preferred_cpu;
	+
	+	new_base = per_cpu(tvec_bases, new_cpu);

		if (base != new_base) {

Oleg.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v2 PATCH 4/4] timers: logic to enable timer migration.
  2009-03-04 12:19 ` Arun R Bharadwaj
                     ` (4 preceding siblings ...)
  2009-03-05 16:23   ` Oleg Nesterov
@ 2009-03-05 16:23   ` Oleg Nesterov
  2009-03-06  3:21     ` Gautham R Shenoy
  2009-03-06  3:21     ` Gautham R Shenoy
  5 siblings, 2 replies; 46+ messages in thread
From: Oleg Nesterov @ 2009-03-05 16:23 UTC (permalink / raw)
  To: Arun R Bharadwaj
  Cc: linux-kernel, linux-pm, a.p.zijlstra, ego, tglx, mingo, andi,
	venkatesh.pallipadi, vatsa, arjan, svaidy

On 03/04, Arun R Bharadwaj wrote:
>
> +++ linux.trees.git/kernel/sched.c
> @@ -4009,6 +4009,11 @@ static struct {
>  	.load_balancer = ATOMIC_INIT(-1),
>  };
>
> +inline int get_nohz_load_balancer(void)

inline?

> +{
> +	return atomic_read(&nohz.load_balancer);
> +}

Shouldn't we reset .load_balancer when this CPU is CPU_DOWN'ed ?
Otherwise the timer can migrate to the dead CPU.

Oleg.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v2 PATCH 4/4] timers: logic to enable timer migration.
  2009-03-04 12:19 ` Arun R Bharadwaj
                     ` (3 preceding siblings ...)
  2009-03-05 15:34   ` Oleg Nesterov
@ 2009-03-05 16:23   ` Oleg Nesterov
  2009-03-05 16:23   ` Oleg Nesterov
  5 siblings, 0 replies; 46+ messages in thread
From: Oleg Nesterov @ 2009-03-05 16:23 UTC (permalink / raw)
  To: Arun R Bharadwaj
  Cc: a.p.zijlstra, mingo, linux-kernel, vatsa, andi, linux-pm, tglx, arjan

On 03/04, Arun R Bharadwaj wrote:
>
> +++ linux.trees.git/kernel/sched.c
> @@ -4009,6 +4009,11 @@ static struct {
>  	.load_balancer = ATOMIC_INIT(-1),
>  };
>
> +inline int get_nohz_load_balancer(void)

inline?

> +{
> +	return atomic_read(&nohz.load_balancer);
> +}

Shouldn't we reset .load_balancer when this CPU is CPU_DOWN'ed ?
Otherwise the timer can migrate to the dead CPU.

Oleg.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v2 PATCH 1/4] timers: framework to identify pinned timers.
  2009-03-04 12:14 ` [v2 PATCH 1/4] timers: framework to identify pinned timers Arun R Bharadwaj
  2009-03-05 16:53   ` Oleg Nesterov
@ 2009-03-05 16:53   ` Oleg Nesterov
  2009-03-06  6:14     ` Arun R Bharadwaj
                       ` (3 more replies)
  1 sibling, 4 replies; 46+ messages in thread
From: Oleg Nesterov @ 2009-03-05 16:53 UTC (permalink / raw)
  To: Arun R Bharadwaj
  Cc: linux-kernel, linux-pm, a.p.zijlstra, ego, tglx, mingo, andi,
	venkatesh.pallipadi, vatsa, arjan, svaidy

On 03/04, Arun R Bharadwaj wrote:
>
> +static inline unsigned long tbase_get_flag_bits(struct timer_list *timer)
> +{
> +	return tbase_get_deferrable(timer->base) |
> +				tbase_get_pinned(timer->base);
> +}

I'd say this looks a bit strange. Hopefully compiler can optimize this code
to return (unsigned long)base & (TBASE_DEFERRABLE_FLAG | TBASE_PINNED_FLAG).

> @@ -736,6 +759,7 @@ void add_timer_on(struct timer_list *tim
>  	struct tvec_base *base = per_cpu(tvec_bases, cpu);
>  	unsigned long flags;
>
> +	timer_set_pinned(timer);

But we never clear TBASE_PINNED_FLAG?

If we use mod_timer() next time, the timer remains "pinned". I do not say
this is really wrong, but a bit strange imho.

Oleg.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v2 PATCH 1/4] timers: framework to identify pinned timers.
  2009-03-04 12:14 ` [v2 PATCH 1/4] timers: framework to identify pinned timers Arun R Bharadwaj
@ 2009-03-05 16:53   ` Oleg Nesterov
  2009-03-05 16:53   ` Oleg Nesterov
  1 sibling, 0 replies; 46+ messages in thread
From: Oleg Nesterov @ 2009-03-05 16:53 UTC (permalink / raw)
  To: Arun R Bharadwaj
  Cc: a.p.zijlstra, mingo, linux-kernel, vatsa, andi, linux-pm, tglx, arjan

On 03/04, Arun R Bharadwaj wrote:
>
> +static inline unsigned long tbase_get_flag_bits(struct timer_list *timer)
> +{
> +	return tbase_get_deferrable(timer->base) |
> +				tbase_get_pinned(timer->base);
> +}

I'd say this looks a bit strange. Hopefully compiler can optimize this code
to return (unsigned long)base & (TBASE_DEFERRABLE_FLAG | TBASE_PINNED_FLAG).

> @@ -736,6 +759,7 @@ void add_timer_on(struct timer_list *tim
>  	struct tvec_base *base = per_cpu(tvec_bases, cpu);
>  	unsigned long flags;
>
> +	timer_set_pinned(timer);

But we never clear TBASE_PINNED_FLAG?

If we use mod_timer() next time, the timer remains "pinned". I do not say
this is really wrong, but a bit strange imho.

Oleg.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v2 PATCH 0/4] timers: framework for migration between CPU
  2009-03-04 12:12 [v2 PATCH 0/4] timers: framework for migration between CPU Arun R Bharadwaj
                   ` (9 preceding siblings ...)
  2009-03-04 17:33 ` Ingo Molnar
@ 2009-03-06  0:14 ` Arjan van de Ven
  2009-03-06  4:23   ` Arun R Bharadwaj
  2009-03-06  4:23   ` Arun R Bharadwaj
  2009-03-06  0:14 ` Arjan van de Ven
  11 siblings, 2 replies; 46+ messages in thread
From: Arjan van de Ven @ 2009-03-06  0:14 UTC (permalink / raw)
  To: arun
  Cc: linux-kernel, linux-pm, a.p.zijlstra, ego, tglx, mingo, andi,
	venkatesh.pallipadi, vatsa, svaidy

On Wed, 4 Mar 2009 17:42:49 +0530
Arun R Bharadwaj <arun@linux.vnet.ibm.com> wrote:
> With the timers queued I measure the sleep state residency
> for a period of 10s.
> Next, I enable timer migration and measure the sleep state
> residency period.

is your table in seconds? Because realistically, at least on PC like
hardware, doing work to get sleep times over a 100 / 200 msecs or so
isn't going to save you any amount of measurable power anymore...



-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v2 PATCH 0/4] timers: framework for migration between CPU
  2009-03-04 12:12 [v2 PATCH 0/4] timers: framework for migration between CPU Arun R Bharadwaj
                   ` (10 preceding siblings ...)
  2009-03-06  0:14 ` Arjan van de Ven
@ 2009-03-06  0:14 ` Arjan van de Ven
  11 siblings, 0 replies; 46+ messages in thread
From: Arjan van de Ven @ 2009-03-06  0:14 UTC (permalink / raw)
  To: arun; +Cc: a.p.zijlstra, mingo, linux-kernel, vatsa, andi, linux-pm, tglx

On Wed, 4 Mar 2009 17:42:49 +0530
Arun R Bharadwaj <arun@linux.vnet.ibm.com> wrote:
> With the timers queued I measure the sleep state residency
> for a period of 10s.
> Next, I enable timer migration and measure the sleep state
> residency period.

is your table in seconds? Because realistically, at least on PC like
hardware, doing work to get sleep times over a 100 / 200 msecs or so
isn't going to save you any amount of measurable power anymore...



-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v2 PATCH 4/4] timers: logic to enable timer migration.
  2009-03-05 16:23   ` Oleg Nesterov
@ 2009-03-06  3:21     ` Gautham R Shenoy
  2009-03-06 14:50       ` Oleg Nesterov
  2009-03-06 14:50       ` Oleg Nesterov
  2009-03-06  3:21     ` Gautham R Shenoy
  1 sibling, 2 replies; 46+ messages in thread
From: Gautham R Shenoy @ 2009-03-06  3:21 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Arun R Bharadwaj, linux-kernel, linux-pm, a.p.zijlstra, tglx,
	mingo, andi, venkatesh.pallipadi, vatsa, arjan, svaidy

On Thu, Mar 05, 2009 at 05:23:29PM +0100, Oleg Nesterov wrote:
> On 03/04, Arun R Bharadwaj wrote:
> >
> > +++ linux.trees.git/kernel/sched.c
> > @@ -4009,6 +4009,11 @@ static struct {
> >  	.load_balancer = ATOMIC_INIT(-1),
> >  };
> >
> > +inline int get_nohz_load_balancer(void)
> 
> inline?
> 
> > +{
> > +	return atomic_read(&nohz.load_balancer);
> > +}
> 
> Shouldn't we reset .load_balancer when this CPU is CPU_DOWN'ed ?
> Otherwise the timer can migrate to the dead CPU.

In the select_nohz_load_balancer() code, we check if this CPU is in the
cpu_active_map. If no, then this CPU relinquishes being the idle
load balancer. Also, the timer migration code in the CPU down path would
migrate any timers queued onto this CPU, right ?

> 
> Oleg.

-- 
Thanks and Regards
gautham

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v2 PATCH 4/4] timers: logic to enable timer migration.
  2009-03-05 16:23   ` Oleg Nesterov
  2009-03-06  3:21     ` Gautham R Shenoy
@ 2009-03-06  3:21     ` Gautham R Shenoy
  1 sibling, 0 replies; 46+ messages in thread
From: Gautham R Shenoy @ 2009-03-06  3:21 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: a.p.zijlstra, linux-kernel, vatsa, andi, tglx, Arun R Bharadwaj,
	linux-pm, mingo, arjan

On Thu, Mar 05, 2009 at 05:23:29PM +0100, Oleg Nesterov wrote:
> On 03/04, Arun R Bharadwaj wrote:
> >
> > +++ linux.trees.git/kernel/sched.c
> > @@ -4009,6 +4009,11 @@ static struct {
> >  	.load_balancer = ATOMIC_INIT(-1),
> >  };
> >
> > +inline int get_nohz_load_balancer(void)
> 
> inline?
> 
> > +{
> > +	return atomic_read(&nohz.load_balancer);
> > +}
> 
> Shouldn't we reset .load_balancer when this CPU is CPU_DOWN'ed ?
> Otherwise the timer can migrate to the dead CPU.

In the select_nohz_load_balancer() code, we check if this CPU is in the
cpu_active_map. If no, then this CPU relinquishes being the idle
load balancer. Also, the timer migration code in the CPU down path would
migrate any timers queued onto this CPU, right ?

> 
> Oleg.

-- 
Thanks and Regards
gautham

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v2 PATCH 0/4] timers: framework for migration between CPU
  2009-03-06  0:14 ` Arjan van de Ven
  2009-03-06  4:23   ` Arun R Bharadwaj
@ 2009-03-06  4:23   ` Arun R Bharadwaj
  1 sibling, 0 replies; 46+ messages in thread
From: Arun R Bharadwaj @ 2009-03-06  4:23 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: linux-kernel, linux-pm, a.p.zijlstra, ego, tglx, mingo, andi,
	venkatesh.pallipadi, vatsa, svaidy

* Arjan van de Ven <arjan@infradead.org> [2009-03-05 16:14:51]:

> On Wed, 4 Mar 2009 17:42:49 +0530
> Arun R Bharadwaj <arun@linux.vnet.ibm.com> wrote:
> > With the timers queued I measure the sleep state residency
> > for a period of 10s.
> > Next, I enable timer migration and measure the sleep state
> > residency period.
> 
> is your table in seconds? Because realistically, at least on PC like
> hardware, doing work to get sleep times over a 100 / 200 msecs or so
> isn't going to save you any amount of measurable power anymore...
>

Hi Arjan,

Yes, the table is in seconds.
Rather than looking at timer migration as a direct means of obtaining
power savings, I would like to see it as one of the mechanisms which
would aid in preventing an almost idle cpu from waking up
unnecessarily, thereby causing power penalty. Otherwise, there is a
chance of all the CPUs in a particular package being idle and the
entire package going into deep sleep.
So we cannot really get big power savings just by doing timer
migration alone.

--arun

> 
> 
> -- 
> Arjan van de Ven 	Intel Open Source Technology Centre
> For development, discussion and tips for power savings, 
> visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v2 PATCH 0/4] timers: framework for migration between CPU
  2009-03-06  0:14 ` Arjan van de Ven
@ 2009-03-06  4:23   ` Arun R Bharadwaj
  2009-03-06  4:23   ` Arun R Bharadwaj
  1 sibling, 0 replies; 46+ messages in thread
From: Arun R Bharadwaj @ 2009-03-06  4:23 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: a.p.zijlstra, mingo, linux-kernel, vatsa, andi, linux-pm, tglx

* Arjan van de Ven <arjan@infradead.org> [2009-03-05 16:14:51]:

> On Wed, 4 Mar 2009 17:42:49 +0530
> Arun R Bharadwaj <arun@linux.vnet.ibm.com> wrote:
> > With the timers queued I measure the sleep state residency
> > for a period of 10s.
> > Next, I enable timer migration and measure the sleep state
> > residency period.
> 
> is your table in seconds? Because realistically, at least on PC like
> hardware, doing work to get sleep times over a 100 / 200 msecs or so
> isn't going to save you any amount of measurable power anymore...
>

Hi Arjan,

Yes, the table is in seconds.
Rather than looking at timer migration as a direct means of obtaining
power savings, I would like to see it as one of the mechanisms which
would aid in preventing an almost idle cpu from waking up
unnecessarily, thereby causing power penalty. Otherwise, there is a
chance of all the CPUs in a particular package being idle and the
entire package going into deep sleep.
So we cannot really get big power savings just by doing timer
migration alone.

--arun

> 
> 
> -- 
> Arjan van de Ven 	Intel Open Source Technology Centre
> For development, discussion and tips for power savings, 
> visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v2 PATCH 1/4] timers: framework to identify pinned timers.
  2009-03-05 16:53   ` Oleg Nesterov
@ 2009-03-06  6:14     ` Arun R Bharadwaj
  2009-03-06 15:03       ` Oleg Nesterov
  2009-03-06 15:03       ` Oleg Nesterov
  2009-03-06  6:14     ` Arun R Bharadwaj
                       ` (2 subsequent siblings)
  3 siblings, 2 replies; 46+ messages in thread
From: Arun R Bharadwaj @ 2009-03-06  6:14 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-kernel, linux-pm, a.p.zijlstra, ego, tglx, mingo, andi,
	venkatesh.pallipadi, vatsa, arjan, svaidy

* Oleg Nesterov <oleg@redhat.com> [2009-03-05 17:53:40]:

> On 03/04, Arun R Bharadwaj wrote:
> >
> > +static inline unsigned long tbase_get_flag_bits(struct timer_list *timer)
> > +{
> > +	return tbase_get_deferrable(timer->base) |
> > +				tbase_get_pinned(timer->base);
> > +}
> 
> I'd say this looks a bit strange. Hopefully compiler can optimize this code
> to return (unsigned long)base & (TBASE_DEFERRABLE_FLAG | TBASE_PINNED_FLAG).
>

Yes, it does.

> > @@ -736,6 +759,7 @@ void add_timer_on(struct timer_list *tim
> >  	struct tvec_base *base = per_cpu(tvec_bases, cpu);
> >  	unsigned long flags;
> >
> > +	timer_set_pinned(timer);
> 
> But we never clear TBASE_PINNED_FLAG?
> 
> If we use mod_timer() next time, the timer remains "pinned". I do not say
> this is really wrong, but a bit strange imho.
> 
> Oleg.
>

The pinned timer would expect to continue firing on the same CPU
although it does a mod_timer() the next time, right?
Thats why I have not cleared the TBASE_PINNED_FLAG.

--arun

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v2 PATCH 1/4] timers: framework to identify pinned timers.
  2009-03-05 16:53   ` Oleg Nesterov
  2009-03-06  6:14     ` Arun R Bharadwaj
@ 2009-03-06  6:14     ` Arun R Bharadwaj
  2009-03-06  7:01     ` Arun R Bharadwaj
  2009-03-06  7:01     ` Arun R Bharadwaj
  3 siblings, 0 replies; 46+ messages in thread
From: Arun R Bharadwaj @ 2009-03-06  6:14 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: a.p.zijlstra, mingo, linux-kernel, vatsa, andi, linux-pm, tglx, arjan

* Oleg Nesterov <oleg@redhat.com> [2009-03-05 17:53:40]:

> On 03/04, Arun R Bharadwaj wrote:
> >
> > +static inline unsigned long tbase_get_flag_bits(struct timer_list *timer)
> > +{
> > +	return tbase_get_deferrable(timer->base) |
> > +				tbase_get_pinned(timer->base);
> > +}
> 
> I'd say this looks a bit strange. Hopefully compiler can optimize this code
> to return (unsigned long)base & (TBASE_DEFERRABLE_FLAG | TBASE_PINNED_FLAG).
>

Yes, it does.

> > @@ -736,6 +759,7 @@ void add_timer_on(struct timer_list *tim
> >  	struct tvec_base *base = per_cpu(tvec_bases, cpu);
> >  	unsigned long flags;
> >
> > +	timer_set_pinned(timer);
> 
> But we never clear TBASE_PINNED_FLAG?
> 
> If we use mod_timer() next time, the timer remains "pinned". I do not say
> this is really wrong, but a bit strange imho.
> 
> Oleg.
>

The pinned timer would expect to continue firing on the same CPU
although it does a mod_timer() the next time, right?
Thats why I have not cleared the TBASE_PINNED_FLAG.

--arun

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v2 PATCH 1/4] timers: framework to identify pinned timers.
  2009-03-05 16:53   ` Oleg Nesterov
  2009-03-06  6:14     ` Arun R Bharadwaj
  2009-03-06  6:14     ` Arun R Bharadwaj
@ 2009-03-06  7:01     ` Arun R Bharadwaj
  2009-03-06  7:01     ` Arun R Bharadwaj
  3 siblings, 0 replies; 46+ messages in thread
From: Arun R Bharadwaj @ 2009-03-06  7:01 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-kernel, linux-pm, a.p.zijlstra, ego, tglx, mingo, andi,
	venkatesh.pallipadi, vatsa, arjan, svaidy

* Oleg Nesterov <oleg@redhat.com> [2009-03-05 17:53:40]:

> On 03/04, Arun R Bharadwaj wrote:
> >
> > +static inline unsigned long tbase_get_flag_bits(struct timer_list *timer)
> > +{
> > +	return tbase_get_deferrable(timer->base) |
> > +				tbase_get_pinned(timer->base);
> > +}
> 
> I'd say this looks a bit strange. Hopefully compiler can optimize this code
> to return (unsigned long)base & (TBASE_DEFERRABLE_FLAG | TBASE_PINNED_FLAG).
>

Misread the above statement.
The code is functionally correct. But not sure if the compiler
optimizes in the way that you have mentioned.

> > @@ -736,6 +759,7 @@ void add_timer_on(struct timer_list *tim
> >  	struct tvec_base *base = per_cpu(tvec_bases, cpu);
> >  	unsigned long flags;
> >
> > +	timer_set_pinned(timer);
> 
> But we never clear TBASE_PINNED_FLAG?
> 
> If we use mod_timer() next time, the timer remains "pinned". I do not say
> this is really wrong, but a bit strange imho.
> 
> Oleg.
> 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v2 PATCH 1/4] timers: framework to identify pinned timers.
  2009-03-05 16:53   ` Oleg Nesterov
                       ` (2 preceding siblings ...)
  2009-03-06  7:01     ` Arun R Bharadwaj
@ 2009-03-06  7:01     ` Arun R Bharadwaj
  3 siblings, 0 replies; 46+ messages in thread
From: Arun R Bharadwaj @ 2009-03-06  7:01 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: a.p.zijlstra, mingo, linux-kernel, vatsa, andi, linux-pm, tglx, arjan

* Oleg Nesterov <oleg@redhat.com> [2009-03-05 17:53:40]:

> On 03/04, Arun R Bharadwaj wrote:
> >
> > +static inline unsigned long tbase_get_flag_bits(struct timer_list *timer)
> > +{
> > +	return tbase_get_deferrable(timer->base) |
> > +				tbase_get_pinned(timer->base);
> > +}
> 
> I'd say this looks a bit strange. Hopefully compiler can optimize this code
> to return (unsigned long)base & (TBASE_DEFERRABLE_FLAG | TBASE_PINNED_FLAG).
>

Misread the above statement.
The code is functionally correct. But not sure if the compiler
optimizes in the way that you have mentioned.

> > @@ -736,6 +759,7 @@ void add_timer_on(struct timer_list *tim
> >  	struct tvec_base *base = per_cpu(tvec_bases, cpu);
> >  	unsigned long flags;
> >
> > +	timer_set_pinned(timer);
> 
> But we never clear TBASE_PINNED_FLAG?
> 
> If we use mod_timer() next time, the timer remains "pinned". I do not say
> this is really wrong, but a bit strange imho.
> 
> Oleg.
> 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v2 PATCH 4/4] timers: logic to enable timer migration.
  2009-03-06  3:21     ` Gautham R Shenoy
@ 2009-03-06 14:50       ` Oleg Nesterov
  2009-03-06 15:49         ` Gautham R Shenoy
  2009-03-06 15:49         ` Gautham R Shenoy
  2009-03-06 14:50       ` Oleg Nesterov
  1 sibling, 2 replies; 46+ messages in thread
From: Oleg Nesterov @ 2009-03-06 14:50 UTC (permalink / raw)
  To: Gautham R Shenoy
  Cc: Arun R Bharadwaj, linux-kernel, linux-pm, a.p.zijlstra, tglx,
	mingo, andi, venkatesh.pallipadi, vatsa, arjan, svaidy

On 03/06, Gautham R Shenoy wrote:
>
> On Thu, Mar 05, 2009 at 05:23:29PM +0100, Oleg Nesterov wrote:
> > On 03/04, Arun R Bharadwaj wrote:
> > >
> > > +++ linux.trees.git/kernel/sched.c
> > > @@ -4009,6 +4009,11 @@ static struct {
> > >  	.load_balancer = ATOMIC_INIT(-1),
> > >  };
> > >
> > > +inline int get_nohz_load_balancer(void)
> >
> > inline?
> > 
> > > +{
> > > +	return atomic_read(&nohz.load_balancer);
> > > +}
> >
> > Shouldn't we reset .load_balancer when this CPU is CPU_DOWN'ed ?
> > Otherwise the timer can migrate to the dead CPU.
>
> In the select_nohz_load_balancer() code, we check if this CPU is in the
> cpu_active_map. If no, then this CPU relinquishes being the idle
> load balancer.

I don't understand this code, but I am not sure select_nohz_load_balancer()
is always called on cpu_down() path before migrate_timers/migrate_hrtimers.

If this is true, then there is no problem.

> Also, the timer migration code in the CPU down path would
> migrate any timers queued onto this CPU, right ?

Yes, but if .load_balancer is not cleared before the timer migration,
mod_timer() can move the timer to the dead CPU after migration.

Oleg.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v2 PATCH 4/4] timers: logic to enable timer migration.
  2009-03-06  3:21     ` Gautham R Shenoy
  2009-03-06 14:50       ` Oleg Nesterov
@ 2009-03-06 14:50       ` Oleg Nesterov
  1 sibling, 0 replies; 46+ messages in thread
From: Oleg Nesterov @ 2009-03-06 14:50 UTC (permalink / raw)
  To: Gautham R Shenoy
  Cc: a.p.zijlstra, linux-kernel, vatsa, andi, tglx, Arun R Bharadwaj,
	linux-pm, mingo, arjan

On 03/06, Gautham R Shenoy wrote:
>
> On Thu, Mar 05, 2009 at 05:23:29PM +0100, Oleg Nesterov wrote:
> > On 03/04, Arun R Bharadwaj wrote:
> > >
> > > +++ linux.trees.git/kernel/sched.c
> > > @@ -4009,6 +4009,11 @@ static struct {
> > >  	.load_balancer = ATOMIC_INIT(-1),
> > >  };
> > >
> > > +inline int get_nohz_load_balancer(void)
> >
> > inline?
> > 
> > > +{
> > > +	return atomic_read(&nohz.load_balancer);
> > > +}
> >
> > Shouldn't we reset .load_balancer when this CPU is CPU_DOWN'ed ?
> > Otherwise the timer can migrate to the dead CPU.
>
> In the select_nohz_load_balancer() code, we check if this CPU is in the
> cpu_active_map. If no, then this CPU relinquishes being the idle
> load balancer.

I don't understand this code, but I am not sure select_nohz_load_balancer()
is always called on cpu_down() path before migrate_timers/migrate_hrtimers.

If this is true, then there is no problem.

> Also, the timer migration code in the CPU down path would
> migrate any timers queued onto this CPU, right ?

Yes, but if .load_balancer is not cleared before the timer migration,
mod_timer() can move the timer to the dead CPU after migration.

Oleg.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v2 PATCH 1/4] timers: framework to identify pinned timers.
  2009-03-06  6:14     ` Arun R Bharadwaj
@ 2009-03-06 15:03       ` Oleg Nesterov
  2009-03-06 15:03       ` Oleg Nesterov
  1 sibling, 0 replies; 46+ messages in thread
From: Oleg Nesterov @ 2009-03-06 15:03 UTC (permalink / raw)
  To: Arun R Bharadwaj
  Cc: linux-kernel, linux-pm, a.p.zijlstra, ego, tglx, mingo, andi,
	venkatesh.pallipadi, vatsa, arjan, svaidy

On 03/06, Arun R Bharadwaj wrote:
>
> * Oleg Nesterov <oleg@redhat.com> [2009-03-05 17:53:40]:
>
> > > @@ -736,6 +759,7 @@ void add_timer_on(struct timer_list *tim
> > >  	struct tvec_base *base = per_cpu(tvec_bases, cpu);
> > >  	unsigned long flags;
> > >
> > > +	timer_set_pinned(timer);
> >
> > But we never clear TBASE_PINNED_FLAG?
> >
> > If we use mod_timer() next time, the timer remains "pinned". I do not say
> > this is really wrong, but a bit strange imho.
> >
> > Oleg.
> >
>
> The pinned timer would expect to continue firing on the same CPU
> although it does a mod_timer() the next time, right?

Why? Let's suppose we call queue_delayed_work_on(), and next time
we use queue_delayed_work() with the same dwork. The timer is still
pinned, this doesn't look consistent to me.

> Thats why I have not cleared the TBASE_PINNED_FLAG.

Personally, I don't like this flag. I think that "pinned" should
be the argument for mod_timer(), not the "property" of timer_list.

Oleg.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v2 PATCH 1/4] timers: framework to identify pinned timers.
  2009-03-06  6:14     ` Arun R Bharadwaj
  2009-03-06 15:03       ` Oleg Nesterov
@ 2009-03-06 15:03       ` Oleg Nesterov
  1 sibling, 0 replies; 46+ messages in thread
From: Oleg Nesterov @ 2009-03-06 15:03 UTC (permalink / raw)
  To: Arun R Bharadwaj
  Cc: a.p.zijlstra, mingo, linux-kernel, vatsa, andi, linux-pm, tglx, arjan

On 03/06, Arun R Bharadwaj wrote:
>
> * Oleg Nesterov <oleg@redhat.com> [2009-03-05 17:53:40]:
>
> > > @@ -736,6 +759,7 @@ void add_timer_on(struct timer_list *tim
> > >  	struct tvec_base *base = per_cpu(tvec_bases, cpu);
> > >  	unsigned long flags;
> > >
> > > +	timer_set_pinned(timer);
> >
> > But we never clear TBASE_PINNED_FLAG?
> >
> > If we use mod_timer() next time, the timer remains "pinned". I do not say
> > this is really wrong, but a bit strange imho.
> >
> > Oleg.
> >
>
> The pinned timer would expect to continue firing on the same CPU
> although it does a mod_timer() the next time, right?

Why? Let's suppose we call queue_delayed_work_on(), and next time
we use queue_delayed_work() with the same dwork. The timer is still
pinned, this doesn't look consistent to me.

> Thats why I have not cleared the TBASE_PINNED_FLAG.

Personally, I don't like this flag. I think that "pinned" should
be the argument for mod_timer(), not the "property" of timer_list.

Oleg.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v2 PATCH 4/4] timers: logic to enable timer migration.
  2009-03-06 14:50       ` Oleg Nesterov
@ 2009-03-06 15:49         ` Gautham R Shenoy
  2009-03-06 17:08           ` Oleg Nesterov
  2009-03-06 17:08           ` Oleg Nesterov
  2009-03-06 15:49         ` Gautham R Shenoy
  1 sibling, 2 replies; 46+ messages in thread
From: Gautham R Shenoy @ 2009-03-06 15:49 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Arun R Bharadwaj, linux-kernel, linux-pm, a.p.zijlstra, tglx,
	mingo, andi, venkatesh.pallipadi, vatsa, arjan, svaidy

On Fri, Mar 06, 2009 at 03:50:02PM +0100, Oleg Nesterov wrote:
> On 03/06, Gautham R Shenoy wrote:
> >
> > On Thu, Mar 05, 2009 at 05:23:29PM +0100, Oleg Nesterov wrote:
> > > On 03/04, Arun R Bharadwaj wrote:
> > > >
> > > > +++ linux.trees.git/kernel/sched.c
> > > > @@ -4009,6 +4009,11 @@ static struct {
> > > >  	.load_balancer = ATOMIC_INIT(-1),
> > > >  };
> > > >
> > > > +inline int get_nohz_load_balancer(void)
> > >
> > > inline?
> > > 
> > > > +{
> > > > +	return atomic_read(&nohz.load_balancer);
> > > > +}
> > >
> > > Shouldn't we reset .load_balancer when this CPU is CPU_DOWN'ed ?
> > > Otherwise the timer can migrate to the dead CPU.
> >
> > In the select_nohz_load_balancer() code, we check if this CPU is in the
> > cpu_active_map. If no, then this CPU relinquishes being the idle
> > load balancer.

.load_balancer is the idle load balancer CPU which performs
load balancing on behalf of all the idle cpus in the system.
It's the only idle CPU which doesn't turn off it's ticks.

This CPU relinquishes it's role as the idle load balancer when
a) it finds a runnable task in it's runqueue, i.e before exiting idle
   state.
OR
b) all the CPUs in the system go idle.

Since it doesn't turn off it's ticks, it calls
select_nohz_load_balancer() every scheduler tick, and thus can observe
that it's no longer set in cpu_active_map within a tick or two.

Also, the cpu_down() path calls migrate_timers() in CPU_DEAD:,i.e after
the CPU has gone down. But to take the CPU down, we would have invoked
the stop_machine_run() code on that CPU, which by itself would make the
CPU relinquish it's role as the idle load balancer owing to condition (a).

Thus I believe, mod_timer() can never migrate a timer on to a DEAD CPU,
during/_after_ we invoke migrate_timers() from the CPU_DEAD callpath.

> 
> I don't understand this code, but I am not sure select_nohz_load_balancer()
> is always called on cpu_down() path before migrate_timers/migrate_hrtimers.
> 
> If this is true, then there is no problem.
> 
> > Also, the timer migration code in the CPU down path would
> > migrate any timers queued onto this CPU, right ?
> 
> Yes, but if .load_balancer is not cleared before the timer migration,
> mod_timer() can move the timer to the dead CPU after migration.
> 
> Oleg.

-- 
Thanks and Regards
gautham

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v2 PATCH 4/4] timers: logic to enable timer migration.
  2009-03-06 14:50       ` Oleg Nesterov
  2009-03-06 15:49         ` Gautham R Shenoy
@ 2009-03-06 15:49         ` Gautham R Shenoy
  1 sibling, 0 replies; 46+ messages in thread
From: Gautham R Shenoy @ 2009-03-06 15:49 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: a.p.zijlstra, linux-kernel, vatsa, andi, tglx, Arun R Bharadwaj,
	linux-pm, mingo, arjan

On Fri, Mar 06, 2009 at 03:50:02PM +0100, Oleg Nesterov wrote:
> On 03/06, Gautham R Shenoy wrote:
> >
> > On Thu, Mar 05, 2009 at 05:23:29PM +0100, Oleg Nesterov wrote:
> > > On 03/04, Arun R Bharadwaj wrote:
> > > >
> > > > +++ linux.trees.git/kernel/sched.c
> > > > @@ -4009,6 +4009,11 @@ static struct {
> > > >  	.load_balancer = ATOMIC_INIT(-1),
> > > >  };
> > > >
> > > > +inline int get_nohz_load_balancer(void)
> > >
> > > inline?
> > > 
> > > > +{
> > > > +	return atomic_read(&nohz.load_balancer);
> > > > +}
> > >
> > > Shouldn't we reset .load_balancer when this CPU is CPU_DOWN'ed ?
> > > Otherwise the timer can migrate to the dead CPU.
> >
> > In the select_nohz_load_balancer() code, we check if this CPU is in the
> > cpu_active_map. If no, then this CPU relinquishes being the idle
> > load balancer.

.load_balancer is the idle load balancer CPU which performs
load balancing on behalf of all the idle cpus in the system.
It's the only idle CPU which doesn't turn off it's ticks.

This CPU relinquishes it's role as the idle load balancer when
a) it finds a runnable task in it's runqueue, i.e before exiting idle
   state.
OR
b) all the CPUs in the system go idle.

Since it doesn't turn off it's ticks, it calls
select_nohz_load_balancer() every scheduler tick, and thus can observe
that it's no longer set in cpu_active_map within a tick or two.

Also, the cpu_down() path calls migrate_timers() in CPU_DEAD:,i.e after
the CPU has gone down. But to take the CPU down, we would have invoked
the stop_machine_run() code on that CPU, which by itself would make the
CPU relinquish it's role as the idle load balancer owing to condition (a).

Thus I believe, mod_timer() can never migrate a timer on to a DEAD CPU,
during/_after_ we invoke migrate_timers() from the CPU_DEAD callpath.

> 
> I don't understand this code, but I am not sure select_nohz_load_balancer()
> is always called on cpu_down() path before migrate_timers/migrate_hrtimers.
> 
> If this is true, then there is no problem.
> 
> > Also, the timer migration code in the CPU down path would
> > migrate any timers queued onto this CPU, right ?
> 
> Yes, but if .load_balancer is not cleared before the timer migration,
> mod_timer() can move the timer to the dead CPU after migration.
> 
> Oleg.

-- 
Thanks and Regards
gautham

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v2 PATCH 4/4] timers: logic to enable timer migration.
  2009-03-06 15:49         ` Gautham R Shenoy
  2009-03-06 17:08           ` Oleg Nesterov
@ 2009-03-06 17:08           ` Oleg Nesterov
  2009-03-06 17:26             ` Gautham R Shenoy
  2009-03-06 17:26             ` Gautham R Shenoy
  1 sibling, 2 replies; 46+ messages in thread
From: Oleg Nesterov @ 2009-03-06 17:08 UTC (permalink / raw)
  To: Gautham R Shenoy
  Cc: Arun R Bharadwaj, linux-kernel, linux-pm, a.p.zijlstra, tglx,
	mingo, andi, venkatesh.pallipadi, vatsa, arjan, svaidy

On 03/06, Gautham R Shenoy wrote:
>
> .load_balancer is the idle load balancer CPU which performs
> load balancing on behalf of all the idle cpus in the system.
> It's the only idle CPU which doesn't turn off it's ticks.
>
> This CPU relinquishes it's role as the idle load balancer when
> a) it finds a runnable task in it's runqueue, i.e before exiting idle
>    state.
> OR
> b) all the CPUs in the system go idle.
>
> Since it doesn't turn off it's ticks, it calls
> select_nohz_load_balancer() every scheduler tick, and thus can observe
> that it's no longer set in cpu_active_map within a tick or two.
>
> Also, the cpu_down() path calls migrate_timers() in CPU_DEAD:,i.e after
> the CPU has gone down. But to take the CPU down, we would have invoked
> the stop_machine_run() code on that CPU, which by itself would make the
> CPU relinquish it's role as the idle load balancer owing to condition (a).
>
> Thus I believe, mod_timer() can never migrate a timer on to a DEAD CPU,
> during/_after_ we invoke migrate_timers() from the CPU_DEAD callpath.

OK, thanks a lot for your explanation!

Oleg.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v2 PATCH 4/4] timers: logic to enable timer migration.
  2009-03-06 15:49         ` Gautham R Shenoy
@ 2009-03-06 17:08           ` Oleg Nesterov
  2009-03-06 17:08           ` Oleg Nesterov
  1 sibling, 0 replies; 46+ messages in thread
From: Oleg Nesterov @ 2009-03-06 17:08 UTC (permalink / raw)
  To: Gautham R Shenoy
  Cc: a.p.zijlstra, linux-kernel, vatsa, andi, tglx, Arun R Bharadwaj,
	linux-pm, mingo, arjan

On 03/06, Gautham R Shenoy wrote:
>
> .load_balancer is the idle load balancer CPU which performs
> load balancing on behalf of all the idle cpus in the system.
> It's the only idle CPU which doesn't turn off it's ticks.
>
> This CPU relinquishes it's role as the idle load balancer when
> a) it finds a runnable task in it's runqueue, i.e before exiting idle
>    state.
> OR
> b) all the CPUs in the system go idle.
>
> Since it doesn't turn off it's ticks, it calls
> select_nohz_load_balancer() every scheduler tick, and thus can observe
> that it's no longer set in cpu_active_map within a tick or two.
>
> Also, the cpu_down() path calls migrate_timers() in CPU_DEAD:,i.e after
> the CPU has gone down. But to take the CPU down, we would have invoked
> the stop_machine_run() code on that CPU, which by itself would make the
> CPU relinquish it's role as the idle load balancer owing to condition (a).
>
> Thus I believe, mod_timer() can never migrate a timer on to a DEAD CPU,
> during/_after_ we invoke migrate_timers() from the CPU_DEAD callpath.

OK, thanks a lot for your explanation!

Oleg.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v2 PATCH 4/4] timers: logic to enable timer migration.
  2009-03-06 17:08           ` Oleg Nesterov
  2009-03-06 17:26             ` Gautham R Shenoy
@ 2009-03-06 17:26             ` Gautham R Shenoy
  1 sibling, 0 replies; 46+ messages in thread
From: Gautham R Shenoy @ 2009-03-06 17:26 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Arun R Bharadwaj, linux-kernel, linux-pm, a.p.zijlstra, tglx,
	mingo, andi, venkatesh.pallipadi, vatsa, arjan, svaidy

On Fri, Mar 06, 2009 at 06:08:34PM +0100, Oleg Nesterov wrote:
> On 03/06, Gautham R Shenoy wrote:
> >
> > .load_balancer is the idle load balancer CPU which performs
> > load balancing on behalf of all the idle cpus in the system.
> > It's the only idle CPU which doesn't turn off it's ticks.
> >
> > This CPU relinquishes it's role as the idle load balancer when
> > a) it finds a runnable task in it's runqueue, i.e before exiting idle
> >    state.
> > OR
> > b) all the CPUs in the system go idle.
> >
> > Since it doesn't turn off it's ticks, it calls
> > select_nohz_load_balancer() every scheduler tick, and thus can observe
> > that it's no longer set in cpu_active_map within a tick or two.
> >
> > Also, the cpu_down() path calls migrate_timers() in CPU_DEAD:,i.e after
> > the CPU has gone down. But to take the CPU down, we would have invoked
> > the stop_machine_run() code on that CPU, which by itself would make the
> > CPU relinquish it's role as the idle load balancer owing to condition (a).
> >
> > Thus I believe, mod_timer() can never migrate a timer on to a DEAD CPU,
> > during/_after_ we invoke migrate_timers() from the CPU_DEAD callpath.
> 
> OK, thanks a lot for your explanation!

Glad it helped :-)
> 
> Oleg.

-- 
Thanks and Regards
gautham

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v2 PATCH 4/4] timers: logic to enable timer migration.
  2009-03-06 17:08           ` Oleg Nesterov
@ 2009-03-06 17:26             ` Gautham R Shenoy
  2009-03-06 17:26             ` Gautham R Shenoy
  1 sibling, 0 replies; 46+ messages in thread
From: Gautham R Shenoy @ 2009-03-06 17:26 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: a.p.zijlstra, linux-kernel, vatsa, andi, tglx, Arun R Bharadwaj,
	linux-pm, mingo, arjan

On Fri, Mar 06, 2009 at 06:08:34PM +0100, Oleg Nesterov wrote:
> On 03/06, Gautham R Shenoy wrote:
> >
> > .load_balancer is the idle load balancer CPU which performs
> > load balancing on behalf of all the idle cpus in the system.
> > It's the only idle CPU which doesn't turn off it's ticks.
> >
> > This CPU relinquishes it's role as the idle load balancer when
> > a) it finds a runnable task in it's runqueue, i.e before exiting idle
> >    state.
> > OR
> > b) all the CPUs in the system go idle.
> >
> > Since it doesn't turn off it's ticks, it calls
> > select_nohz_load_balancer() every scheduler tick, and thus can observe
> > that it's no longer set in cpu_active_map within a tick or two.
> >
> > Also, the cpu_down() path calls migrate_timers() in CPU_DEAD:,i.e after
> > the CPU has gone down. But to take the CPU down, we would have invoked
> > the stop_machine_run() code on that CPU, which by itself would make the
> > CPU relinquish it's role as the idle load balancer owing to condition (a).
> >
> > Thus I believe, mod_timer() can never migrate a timer on to a DEAD CPU,
> > during/_after_ we invoke migrate_timers() from the CPU_DEAD callpath.
> 
> OK, thanks a lot for your explanation!

Glad it helped :-)
> 
> Oleg.

-- 
Thanks and Regards
gautham

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [v2 PATCH 0/4] timers: framework for migration between CPU
@ 2009-03-04 12:12 Arun R Bharadwaj
  0 siblings, 0 replies; 46+ messages in thread
From: Arun R Bharadwaj @ 2009-03-04 12:12 UTC (permalink / raw)
  To: linux-kernel, linux-pm
  Cc: a.p.zijlstra, vatsa, andi, arun, tglx, mingo, arjan

Hi,


In an SMP system, tasks are scheduled on different CPUs by the
scheduler, interrupts are managed by irqbalancer daemon, but timers
are still stuck to the CPUs that they have been initialised.  Timers
queued by tasks gets re-queued on the CPU where the task gets to run
next, but timers from IRQ context like the ones in device drivers are
still stuck on the CPU they were initialised.  This framework will
help move all 'movable timers' from one CPU to any other CPU of choice
using a sysfs interface.

Original posting can be found here --> http://lkml.org/lkml/2009/2/20/121

Based on Ingo's suggestion, I have extended the scheduler power-saving
code, which already identifies an idle load balancer CPU, to also attract
all the attractable sources of timers, automatically.

Also, I have removed the per-cpu sysfs interface and instead created a
single entry at /sys/devices/system/cpu/enable_timer_migration.
This allows users to enable timer migration as a policy and let
the kernel decide the target CPU to move timers and also decide on
the thresholds on when to initiate a timer migration and when to stop.

Timers from idle cpus are migrated to the idle-load-balancer-cpu.
The idle load balancer is one of the idle cpus that has the sched
ticks running and does other system management tasks and load
balancing on behalf of other idle cpus.  Attracting timers from
other idle cpus will reduce wakeups for them while increasing the
probability of overlap with sched ticks on the idle load balancer cpu.

However, this technique has drawbacks if the idle load balancer cpu
is re-nominated too often based on the system behaviour leading to
ping-pong of the timers in the system. This issue can be solved by
optimising the selection of the idle load balancer cpu as described
by Gautham in the following patch http://lkml.org/lkml/2008/9/23/82.

If the idle-load-balancer is selected from a semi-idle package by
including Gautham's patch, then we are able to experimentally verify
consistent selection of idle-load-balancer cpu and timers are also
consolidated to this cpu.

The following patches are included:
PATCH 1/4 - framework to identify pinned timers.
PATCH 2/4 - identifying the existing pinned hrtimers.
PATCH 3/4 - sysfs hook to enable timer migration.
PATCH 4/4 - logic to enable timer migration.

The patchset is based on the latest tip/master.


The following experiment was carried out to demonstrate the
functionality of the patch.
The machine used is a 2 socket, quad core machine, with HT enabled.

I run a `make -j4` pinned to 4 CPUs.
I have used a driver which continuously queues timers on a CPU.
With the timers queued I measure the sleep state residency
for a period of 10s.
Next, I enable timer migration and measure the sleep state
residency period.
The comparison in sleep state residency values is posted below.

Also the difference in Local Timer Interrupt rate(LOC) rate
from /proc/interrupts is posted below.

This enables timer migration.

	echo 1 > /sys/devices/system/cpu/enable_timer_migration

similarly,

	echo 0 > /sys/devices/system/cpu/enable_timer_migration
disables timer migration.


$taskset -c 4,5,6,7 make -j4

my_driver queuing timers continuously on CPU 10.

idle load balancer currently on CPU 15


Case1: Without timer migration		Case2: With timer migration

   --------------------			   --------------------
   | Core | LOC Count |			   | Core | LOC Count |
   | 4    |   2504    |			   | 4    |   2503    |
   | 5    |   2502    |			   | 5    |   2503    |
   | 6    |   2502    |			   | 6    |   2502    |
   | 7    |   2498    |			   | 7    |   2500    |
   | 10   |   2501    |			   | 10   |     35    |
   | 15   |   2501    |			   | 15   |   2501    |
   --------------------			   --------------------

   ---------------------		   --------------------
   | Core | Sleep time |		   | Core | Sleep time |
   | 4    |    0.47168 |		   | 4    |    0.49601 |
   | 5    |    0.44301 |		   | 5    |    0.37153 |
   | 6    |    0.38979 |		   | 6    |    0.51286 |
   | 7    |    0.42829 |		   | 7    |    0.49635 |
   | 10   |    9.86652 |		   | 10   |   10.04216 |
   | 15   |    0.43048 |		   | 15   |    0.49056 |
   ---------------------		   ---------------------

Here, all the timers queued by the driver on CPU10 are moved to CPU15,
which is the idle load balancer.


--arun

^ permalink raw reply	[flat|nested] 46+ messages in thread

end of thread, other threads:[~2009-03-06 17:27 UTC | newest]

Thread overview: 46+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-03-04 12:12 [v2 PATCH 0/4] timers: framework for migration between CPU Arun R Bharadwaj
2009-03-04 12:14 ` [v2 PATCH 1/4] timers: framework to identify pinned timers Arun R Bharadwaj
2009-03-05 16:53   ` Oleg Nesterov
2009-03-05 16:53   ` Oleg Nesterov
2009-03-06  6:14     ` Arun R Bharadwaj
2009-03-06 15:03       ` Oleg Nesterov
2009-03-06 15:03       ` Oleg Nesterov
2009-03-06  6:14     ` Arun R Bharadwaj
2009-03-06  7:01     ` Arun R Bharadwaj
2009-03-06  7:01     ` Arun R Bharadwaj
2009-03-04 12:14 ` Arun R Bharadwaj
2009-03-04 12:16 ` [v2 PATCH 2/4] timers: identifying the existing pinned hrtimers Arun R Bharadwaj
2009-03-04 12:16 ` Arun R Bharadwaj
2009-03-04 12:18 ` [v2 PATCH 3/4] timers: sysfs hook to enable timer migration Arun R Bharadwaj
2009-03-04 12:18 ` Arun R Bharadwaj
2009-03-04 12:19 ` [v2 PATCH 4/4] timers: logic " Arun R Bharadwaj
2009-03-04 12:19 ` Arun R Bharadwaj
2009-03-04 16:33   ` Daniel Walker
2009-03-04 16:33   ` Daniel Walker
2009-03-04 16:52     ` Arun R Bharadwaj
2009-03-04 16:52     ` Arun R Bharadwaj
2009-03-05 15:34   ` Oleg Nesterov
2009-03-05 15:34   ` Oleg Nesterov
2009-03-05 16:23   ` Oleg Nesterov
2009-03-05 16:23   ` Oleg Nesterov
2009-03-06  3:21     ` Gautham R Shenoy
2009-03-06 14:50       ` Oleg Nesterov
2009-03-06 15:49         ` Gautham R Shenoy
2009-03-06 17:08           ` Oleg Nesterov
2009-03-06 17:08           ` Oleg Nesterov
2009-03-06 17:26             ` Gautham R Shenoy
2009-03-06 17:26             ` Gautham R Shenoy
2009-03-06 15:49         ` Gautham R Shenoy
2009-03-06 14:50       ` Oleg Nesterov
2009-03-06  3:21     ` Gautham R Shenoy
2009-03-04 17:33 ` [v2 PATCH 0/4] timers: framework for migration between CPU Ingo Molnar
2009-03-04 18:06   ` Vaidyanathan Srinivasan
2009-03-04 18:06   ` Vaidyanathan Srinivasan
2009-03-04 18:29     ` Ingo Molnar
2009-03-04 18:29     ` Ingo Molnar
2009-03-04 17:33 ` Ingo Molnar
2009-03-06  0:14 ` Arjan van de Ven
2009-03-06  4:23   ` Arun R Bharadwaj
2009-03-06  4:23   ` Arun R Bharadwaj
2009-03-06  0:14 ` Arjan van de Ven
2009-03-04 12:12 Arun R Bharadwaj

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.