linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/4] timers: framework for migration between CPU
@ 2009-02-20 12:55 Arun R Bharadwaj
  2009-02-20 12:57 ` [RFC PATCH 1/4] timers: framework to identify pinned timers Arun R Bharadwaj
                   ` (4 more replies)
  0 siblings, 5 replies; 23+ messages in thread
From: Arun R Bharadwaj @ 2009-02-20 12:55 UTC (permalink / raw)
  To: linux-kernel, linux-pm
  Cc: a.p.zijlstra, ego, tglx, mingo, andi, venkatesh.pallipadi, vatsa,
	arjan, arun

Hi,


In an SMP system, tasks are scheduled on different CPUs by the
scheduler, interrupts are managed by irqbalancer daemon, but timers
are still stuck to the CPUs that they have been initialised.  Timers
queued by tasks gets re-queued on the CPU where the task gets to run
next, but timers from IRQ context like the ones in device drivers are
still stuck on the CPU they were initialised.  This framework will
help move all 'movable timers' from one CPU to any other CPU of choice
using a sysfs interface.

Why is that a problem?

In a completely idle system with large number of cores, and CPU
packages, we can have a few timers stuck in each core that will force the
corresponding CPU package to wakeup for a short duration to service
the timer interrupt.

Timers eventually have to run on some CPU in the system, but ability to
move timers from one CPU to another can help consolidate timers to
less number of CPUs.  Consolidating timers to one or two cores
in a large system helps in reducing the CPU wakeups from idle since
there is better chance of servicing multiple timer during one wakeup
interval.  This technique could also help 'range timer' framework where
timers expiring pretty close in time can be combined together and save
wakeups for the CPU.

Migrating timer from select set of CPUs and consolidating them helps
improve the deep sleep state residency and reduce the number of cpu
wakeups from idle. This framework and patch series is an enabler for
a higher level framework to evacuate CPU packages and consolidate work
in an almost idle system.

Currently, timers are migrated during the cpu offline operation.
Since cpu-hotplug is too heavy for this purpose,this patch
demonstrates a lightweight timer migration framework.

My earlier post to lkml in this area can be found at
http://lkml.org/lkml/2008/10/16/138

Evacuating timers from certain CPUs can help other general situations
like HPC or highly optimised system to run specific set of application.
Essentially this framework will help us control the spread of
OS/device driver timers in a multi-cpu system.

The following patches are included:
PATCH 1/4 - framework to identify pinned timers.
PATCH 2/4 - sysfs hook to enable timer migration.
PATCH 3/4 - identifying the existing pinned hrtimers.
PATCH 4/4 - logic to enable timer migration.

The patches are based against kernel version 2.6.29-rc5

The following experiment was carried out to demonstrate the
functionality of the patch.
The machine used is a 2 socket, quad core machine.

I have used a driver which continuously queues timers on a CPU.
With the timers queued I measure the sleep state residency
for a period of 10s.
Next, I enable timer migration and move all timers away from
that CPU to a specific cpu and measure the sleep state residency period.
The comparison in sleep state residency values is posted below.

Also the difference in Local Timer Interrupt rate(LOC) rate
from /proc/interrupts is posted below.

The interface for timer migration is located at
/sys/devices/system/cpu/cpuX/timer_migration

By echoing a target cpu number we can enable migration for that cpu.

	echo 4 > /sys/devices/system/cpu/cpu1/timer_migration

this would move all regular and hrtimers from cpu1 to cpu4 when the
new timers are queued or old timers are requeued.
Timers already in the queue will not be migrated and would
fire one last time on cpu1.

	echo 4 > /sys/devices/system/cpu/cpu4/timer_migration

this would stop timer migration.

---------------------------------------------------------------------------
Timers are being queued on CPU2 using my test driver.


	Package 0			Package 1	    Local Timer
								Count
---------------------------- ----------------------------	C0 167
|Core| Sleep time          | |Core|       Sleep time    |	C1 310
|0   |	 8.58219	   | |4   |        10.05127     |	C2 2542
|1   |	10.04206           | |5   |        10.05216     |	C3 268
|2   |   9.77348           | |6   |        10.05386     |	C4 54
|3   |  10.03901           | |7   |        10.05540     |	C5 27
---------------------------- ----------------------------	C6 28
								C7 20
Since timers are being queued on CPU2, Core sleep state residency of CPU2
is relatively low compared to others, barring CPU0. The LOC count
shows a high interrupt rate on CPU2, as expected.

---------------------------------------------------------------------------
Timers Migrated to CPU7

	Package 0			Package 1	    Local Timer
								Count
----------------------------  ----------------------------	C0 129
|Core|       Sleep time    |  |Core|       Sleep time    |	C1 206
|0   |        8.94301      |  |4   |        10.04280     |	C2 203
|1   |       10.05429      |  |5   |        10.04471     |	C3 292
|2   |       10.04477      |  |6   |        10.04320     |	C4 33
|3   |       10.04570      |  |7   |         9.77789     |	C5 25
----------------------------  ----------------------------	C6 42
								C7 2033
Here, timers are being migrated from CPU2 to CPU7. The sleep state
residency value of CPU2 has gone up and that of CPU7 has come down.
Also, LOC count shows that timers have been moved.

---------------------------------------------------------------------------
Timers migrated to CPU1

	Package 0			Package 1	    Local Timer
							        Count
----------------------------  ----------------------------      C0 210
|Core|       Sleep time    |  |Core|       Sleep time    |      C1 2049
|0   |        9.50814      |  |4   |        10.05087     |      C2 331
|1   |        9.81115      |  |5   |        10.05121     |      C3 307
|2   |       10.04120      |  |6   |        10.05312     |      C4 324
|3   |       10.04015      |  |7   |        10.05327     |      C5 22
----------------------------  ----------------------------      C6 27
                                                                C7 27

---------------------------------------------------------------------------

Please let me know your comments.

--arun

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [RFC PATCH 1/4] timers: framework to identify pinned timers.
  2009-02-20 12:55 [RFC PATCH 0/4] timers: framework for migration between CPU Arun R Bharadwaj
@ 2009-02-20 12:57 ` Arun R Bharadwaj
  2009-02-20 12:58 ` [RFC PATCH 2/4] timers: sysfs hook to enable timer migration Arun R Bharadwaj
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 23+ messages in thread
From: Arun R Bharadwaj @ 2009-02-20 12:57 UTC (permalink / raw)
  To: linux-kernel, linux-pm, a.p.zijlstra, ego, tglx, mingo, andi,
	venkatesh.pallipadi, vatsa, arjan

* Arun R Bharadwaj <arun@linux.vnet.ibm.com> [2009-02-20 18:25:16]:

This patch creates a new framework for identifying cpu-pinned timers
and hrtimers.


This framework is needed because pinned timers are expected to fire on
the same CPU on which they are queued. So it is essential to identify
these and not migrate them, in case there are any.


For regular timers a new flag called TBASE_PINNED_FLAG is created.
Since the last 3 bits of the tvec_base is guaranteed to be 0, and
since the last bit is being used to indicate deferrable timers,
the second last bit is used to indicate cpu-pinned regular timers.
The implementation of functions to manage the TBASE_PINNED_FLAG is
similar to those which manage the TBASE_DEFERRABLE_FLAG.


For hrtimers, a new interface hrtimer_start_pinned() is created,
which can be used to queue cpu-pinned hrtimer.



Signed-off-by: Arun R Bharadwaj <arun@linux.vnet.ibm.com>
---
 include/linux/hrtimer.h |   24 ++++++++++++++++++++----
 kernel/hrtimer.c        |   34 ++++++++++++++++++++++++++++------
 kernel/timer.c          |   30 +++++++++++++++++++++++++++---
 3 files changed, 75 insertions(+), 13 deletions(-)

Index: git-2.6/kernel/timer.c
===================================================================
--- git-2.6.orig/kernel/timer.c
+++ git-2.6/kernel/timer.c
@@ -37,6 +37,7 @@
 #include <linux/delay.h>
 #include <linux/tick.h>
 #include <linux/kallsyms.h>
+#include <linux/timer.h>
 
 #include <asm/uaccess.h>
 #include <asm/unistd.h>
@@ -87,8 +88,12 @@ static DEFINE_PER_CPU(struct tvec_base *
  * the new flag to indicate whether the timer is deferrable
  */
 #define TBASE_DEFERRABLE_FLAG		(0x1)
+#define TBASE_PINNED_FLAG		(0x2)
 
-/* Functions below help us manage 'deferrable' flag */
+/*
+ * Functions below help us manage
+ * 'deferrable' flag and 'cpu-pinned-timer' flag
+ */
 static inline unsigned int tbase_get_deferrable(struct tvec_base *base)
 {
 	return ((unsigned int)(unsigned long)base & TBASE_DEFERRABLE_FLAG);
@@ -96,7 +101,8 @@ static inline unsigned int tbase_get_def
 
 static inline struct tvec_base *tbase_get_base(struct tvec_base *base)
 {
-	return ((struct tvec_base *)((unsigned long)base & ~TBASE_DEFERRABLE_FLAG));
+	return (struct tvec_base *)((unsigned long)base &
+			~(TBASE_DEFERRABLE_FLAG | TBASE_PINNED_FLAG));
 }
 
 static inline void timer_set_deferrable(struct timer_list *timer)
@@ -105,11 +111,28 @@ static inline void timer_set_deferrable(
 				       TBASE_DEFERRABLE_FLAG));
 }
 
+static inline unsigned long tbase_get_pinned(struct tvec_base *base)
+{
+	return (unsigned long)base & TBASE_PINNED_FLAG;
+}
+
+static inline unsigned long tbase_get_flag_bits(struct timer_list *timer)
+{
+	return tbase_get_deferrable(timer->base) |
+				tbase_get_pinned(timer->base);
+}
+
 static inline void
 timer_set_base(struct timer_list *timer, struct tvec_base *new_base)
 {
 	timer->base = (struct tvec_base *)((unsigned long)(new_base) |
-				      tbase_get_deferrable(timer->base));
+					tbase_get_flag_bits(timer));
+}
+
+static inline void timer_set_pinned(struct timer_list *timer)
+{
+	timer->base = ((struct tvec_base *)((unsigned long)(timer->base) |
+				TBASE_PINNED_FLAG));
 }
 
 static unsigned long round_jiffies_common(unsigned long j, int cpu,
@@ -648,6 +671,7 @@ void add_timer_on(struct timer_list *tim
 	struct tvec_base *base = per_cpu(tvec_bases, cpu);
 	unsigned long flags;
 
+	timer_set_pinned(timer);
 	timer_stats_timer_set_start_info(timer);
 	BUG_ON(timer_pending(timer) || !timer->function);
 	spin_lock_irqsave(&base->lock, flags);
Index: git-2.6/include/linux/hrtimer.h
===================================================================
--- git-2.6.orig/include/linux/hrtimer.h
+++ git-2.6/include/linux/hrtimer.h
@@ -331,23 +331,39 @@ static inline void hrtimer_init_on_stack
 static inline void destroy_hrtimer_on_stack(struct hrtimer *timer) { }
 #endif
 
+#define HRTIMER_NOT_PINNED	0
+#define HRTIMER_PINNED		1
 /* Basic timer operations: */
 extern int hrtimer_start(struct hrtimer *timer, ktime_t tim,
 			 const enum hrtimer_mode mode);
+extern int hrtimer_start_pinned(struct hrtimer *timer, ktime_t tim,
+			const enum hrtimer_mode mode);
 extern int hrtimer_start_range_ns(struct hrtimer *timer, ktime_t tim,
-			unsigned long range_ns, const enum hrtimer_mode mode);
+	unsigned long range_ns, const enum hrtimer_mode mode, int pinned);
 extern int hrtimer_cancel(struct hrtimer *timer);
 extern int hrtimer_try_to_cancel(struct hrtimer *timer);
 
-static inline int hrtimer_start_expires(struct hrtimer *timer,
-						enum hrtimer_mode mode)
+static inline int __hrtimer_start_expires(struct hrtimer *timer,
+					enum hrtimer_mode mode, int pinned)
 {
 	unsigned long delta;
 	ktime_t soft, hard;
 	soft = hrtimer_get_softexpires(timer);
 	hard = hrtimer_get_expires(timer);
 	delta = ktime_to_ns(ktime_sub(hard, soft));
-	return hrtimer_start_range_ns(timer, soft, delta, mode);
+	return hrtimer_start_range_ns(timer, soft, delta, mode, pinned);
+}
+
+static inline int hrtimer_start_expires(struct hrtimer *timer,
+						enum hrtimer_mode mode)
+{
+	return __hrtimer_start_expires(timer, mode, HRTIMER_NOT_PINNED);
+}
+
+static inline int hrtimer_start_expires_pinned(struct hrtimer *timer,
+						enum hrtimer_mode mode)
+{
+	return __hrtimer_start_expires(timer, mode, HRTIMER_PINNED);
 }
 
 static inline int hrtimer_restart(struct hrtimer *timer)
Index: git-2.6/kernel/hrtimer.c
===================================================================
--- git-2.6.orig/kernel/hrtimer.c
+++ git-2.6/kernel/hrtimer.c
@@ -193,7 +193,8 @@ struct hrtimer_clock_base *lock_hrtimer_
  * Switch the timer base to the current CPU when possible.
  */
 static inline struct hrtimer_clock_base *
-switch_hrtimer_base(struct hrtimer *timer, struct hrtimer_clock_base *base)
+switch_hrtimer_base(struct hrtimer *timer, struct hrtimer_clock_base *base,
+int pinned)
 {
 	struct hrtimer_clock_base *new_base;
 	struct hrtimer_cpu_base *new_cpu_base;
@@ -897,9 +898,8 @@ remove_hrtimer(struct hrtimer *timer, st
  *  0 on success
  *  1 when the timer was active
  */
-int
-hrtimer_start_range_ns(struct hrtimer *timer, ktime_t tim, unsigned long delta_ns,
-			const enum hrtimer_mode mode)
+int hrtimer_start_range_ns(struct hrtimer *timer, ktime_t tim,
+	unsigned long delta_ns, const enum hrtimer_mode mode, int pinned)
 {
 	struct hrtimer_clock_base *base, *new_base;
 	unsigned long flags;
@@ -911,7 +911,7 @@ hrtimer_start_range_ns(struct hrtimer *t
 	ret = remove_hrtimer(timer, base);
 
 	/* Switch the timer base, if necessary: */
-	new_base = switch_hrtimer_base(timer, base);
+	new_base = switch_hrtimer_base(timer, base, pinned);
 
 	if (mode == HRTIMER_MODE_REL) {
 		tim = ktime_add_safe(tim, new_base->get_time());
@@ -948,6 +948,12 @@ hrtimer_start_range_ns(struct hrtimer *t
 }
 EXPORT_SYMBOL_GPL(hrtimer_start_range_ns);
 
+int __hrtimer_start(struct hrtimer *timer, ktime_t tim,
+const enum hrtimer_mode mode, int pinned)
+{
+	return hrtimer_start_range_ns(timer, tim, 0, mode, pinned);
+}
+
 /**
  * hrtimer_start - (re)start an hrtimer on the current CPU
  * @timer:	the timer to be added
@@ -961,10 +967,26 @@ EXPORT_SYMBOL_GPL(hrtimer_start_range_ns
 int
 hrtimer_start(struct hrtimer *timer, ktime_t tim, const enum hrtimer_mode mode)
 {
-	return hrtimer_start_range_ns(timer, tim, 0, mode);
+	return __hrtimer_start(timer, tim, mode, HRTIMER_NOT_PINNED);
 }
 EXPORT_SYMBOL_GPL(hrtimer_start);
 
+/**
+ * hrtimer_start_pinned - start a CPU-pinned hrtimer
+ * @timer:      the timer to be added
+ * @tim:        expiry time
+ * @mode:       expiry mode: absolute (HRTIMER_ABS) or relative (HRTIMER_REL)
+ *
+ * Returns:
+ *  0 on success
+ *  1 when the timer was active
+ */
+int hrtimer_start_pinned(struct hrtimer *timer,
+	ktime_t tim, const enum hrtimer_mode mode)
+{
+	return __hrtimer_start(timer, tim, mode, HRTIMER_PINNED);
+}
+EXPORT_SYMBOL_GPL(hrtimer_start_pinned);
 
 /**
  * hrtimer_try_to_cancel - try to deactivate a timer

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [RFC PATCH 2/4] timers: sysfs hook to enable timer migration.
  2009-02-20 12:55 [RFC PATCH 0/4] timers: framework for migration between CPU Arun R Bharadwaj
  2009-02-20 12:57 ` [RFC PATCH 1/4] timers: framework to identify pinned timers Arun R Bharadwaj
@ 2009-02-20 12:58 ` Arun R Bharadwaj
  2009-02-20 13:00 ` [RFC PATCH 3/4] timers: identifying the existing pinned hrtimers Arun R Bharadwaj
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 23+ messages in thread
From: Arun R Bharadwaj @ 2009-02-20 12:58 UTC (permalink / raw)
  To: linux-kernel, linux-pm, a.p.zijlstra, ego, tglx, mingo, andi,
	venkatesh.pallipadi, vatsa, arjan

* Arun R Bharadwaj <arun@linux.vnet.ibm.com> [2009-02-20 18:25:16]:

This patch creates the necessary sysfs interface for timer migration.

The interface is located at
/sys/devices/system/cpu/cpuX/timer_migration

These sysfs entries are initialized to their respective cpu ids.
This represents the no timer migration state.
By echoing a target cpu number we can enable migration for that cpu.

Echo a target cpu number to the per-cpu sysfs entry and all timers
are migrated to that cpu, instead of choosing cpu0 by default.

e.g. echo 4 > /sys/devices/system/cpu/cpu1/timer_migration

this would move all regular and hrtimers from cpu1 to cpu4.



Signed-off-by: Arun R Bharadwaj <arun@linux.vnet.ibm.com>
---
 drivers/base/cpu.c    |   44 ++++++++++++++++++++++++++++++++++++++++++++
 include/linux/timer.h |    2 ++
 2 files changed, 46 insertions(+)

Index: git-2.6/drivers/base/cpu.c
===================================================================
--- git-2.6.orig/drivers/base/cpu.c
+++ git-2.6/drivers/base/cpu.c
@@ -20,6 +20,45 @@ EXPORT_SYMBOL(cpu_sysdev_class);
 
 static DEFINE_PER_CPU(struct sys_device *, cpu_sys_devices);
 
+DEFINE_PER_CPU(int, enable_timer_migration);
+
+/*
+ * This function initializes sysfs entries for enabling timer migration.
+ * Each per_cpu enable_timer_migration is initialized to its cpu_id.
+ * By echo-ing a value other than its cpu_id will set that as the target cpu
+ * to which the timers are to be migrated to.
+ */
+void initialize_timer_migration_sysfs(void)
+{
+	int cpu;
+	for_each_possible_cpu(cpu)
+		per_cpu(enable_timer_migration, cpu) = cpu;
+}
+
+static ssize_t timer_migration_show(struct sys_device *dev,
+			struct sysdev_attribute *attr, char *buf)
+{
+	struct cpu *cpu = container_of(dev, struct cpu, sysdev);
+	return sprintf(buf, "%u\n", per_cpu(enable_timer_migration,
+		cpu->sysdev.id));
+}
+static ssize_t
+timer_migration_store(struct sys_device *dev, struct sysdev_attribute *attr,
+			const char *buf, size_t count)
+{
+	struct cpu *cpu = container_of(dev, struct cpu, sysdev);
+	ssize_t ret = -EINVAL;
+	int target_cpu;
+	if (sscanf(buf, "%d", &target_cpu) && cpu_online(target_cpu)) {
+		ret = count;
+		per_cpu(enable_timer_migration, cpu->sysdev.id) = target_cpu;
+	}
+
+	return ret;
+}
+static SYSDEV_ATTR(timer_migration, 0666,
+		timer_migration_show, timer_migration_store);
+
 #ifdef CONFIG_HOTPLUG_CPU
 static ssize_t show_online(struct sys_device *dev, struct sysdev_attribute *attr,
 			   char *buf)
@@ -221,6 +260,11 @@ int __cpuinit register_cpu(struct cpu *c
 	if (!error)
 		error = sysdev_create_file(&cpu->sysdev, &attr_crash_notes);
 #endif
+
+	if (!error) {
+		error = sysdev_create_file(&cpu->sysdev, &attr_timer_migration);
+		initialize_timer_migration_sysfs();
+	}
 	return error;
 }
 
Index: git-2.6/include/linux/timer.h
===================================================================
--- git-2.6.orig/include/linux/timer.h
+++ git-2.6/include/linux/timer.h
@@ -192,3 +192,5 @@ unsigned long round_jiffies_up(unsigned 
 unsigned long round_jiffies_up_relative(unsigned long j);
 
 #endif
+
+DECLARE_PER_CPU(int, enable_timer_migration);

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [RFC PATCH 3/4] timers: identifying the existing pinned hrtimers.
  2009-02-20 12:55 [RFC PATCH 0/4] timers: framework for migration between CPU Arun R Bharadwaj
  2009-02-20 12:57 ` [RFC PATCH 1/4] timers: framework to identify pinned timers Arun R Bharadwaj
  2009-02-20 12:58 ` [RFC PATCH 2/4] timers: sysfs hook to enable timer migration Arun R Bharadwaj
@ 2009-02-20 13:00 ` Arun R Bharadwaj
  2009-02-20 13:01 ` [RFC PATCH 4/4] timers: logic to enable timer migration Arun R Bharadwaj
  2009-02-20 13:21 ` [RFC PATCH 0/4] timers: framework for migration between CPU Ingo Molnar
  4 siblings, 0 replies; 23+ messages in thread
From: Arun R Bharadwaj @ 2009-02-20 13:00 UTC (permalink / raw)
  To: linux-kernel, linux-pm, a.p.zijlstra, ego, tglx, mingo, andi,
	venkatesh.pallipadi, vatsa, arjan

* Arun R Bharadwaj <arun@linux.vnet.ibm.com> [2009-02-20 18:25:16]:

The following pinned hrtimers have been identified and marked:
1)sched_rt_period_timer
2)tick_sched_timer
3)stack_trace_timer_fn



Signed-off-by: Arun R Bharadwaj <arun@linux.vnet.ibm.com>
---
 kernel/sched.c               |    5 +++--
 kernel/time/tick-sched.c     |    7 ++++---
 kernel/trace/trace_sysprof.c |    3 ++-
 3 files changed, 9 insertions(+), 6 deletions(-)

Index: git-2.6/kernel/sched.c
===================================================================
--- git-2.6.orig/kernel/sched.c
+++ git-2.6/kernel/sched.c
@@ -236,7 +236,7 @@ static void start_rt_bandwidth(struct rt
 
 		now = hrtimer_cb_get_time(&rt_b->rt_period_timer);
 		hrtimer_forward(&rt_b->rt_period_timer, now, rt_b->rt_period);
-		hrtimer_start_expires(&rt_b->rt_period_timer,
+		hrtimer_start_expires_pinned(&rt_b->rt_period_timer,
 				HRTIMER_MODE_ABS);
 	}
 	spin_unlock(&rt_b->rt_runtime_lock);
@@ -1129,7 +1129,8 @@ static __init void init_hrtick(void)
  */
 static void hrtick_start(struct rq *rq, u64 delay)
 {
-	hrtimer_start(&rq->hrtick_timer, ns_to_ktime(delay), HRTIMER_MODE_REL);
+	hrtimer_start_pinned(&rq->hrtick_timer, ns_to_ktime(delay),
+				HRTIMER_MODE_REL);
 }
 
 static inline void init_hrtick(void)
Index: git-2.6/kernel/time/tick-sched.c
===================================================================
--- git-2.6.orig/kernel/time/tick-sched.c
+++ git-2.6/kernel/time/tick-sched.c
@@ -348,7 +348,7 @@ void tick_nohz_stop_sched_tick(int inidl
 		ts->idle_expires = expires;
 
 		if (ts->nohz_mode == NOHZ_MODE_HIGHRES) {
-			hrtimer_start(&ts->sched_timer, expires,
+			hrtimer_start_pinned(&ts->sched_timer, expires,
 				      HRTIMER_MODE_ABS);
 			/* Check, if the timer was already in the past */
 			if (hrtimer_active(&ts->sched_timer))
@@ -394,7 +394,7 @@ static void tick_nohz_restart(struct tic
 		hrtimer_forward(&ts->sched_timer, now, tick_period);
 
 		if (ts->nohz_mode == NOHZ_MODE_HIGHRES) {
-			hrtimer_start_expires(&ts->sched_timer,
+			hrtimer_start_expires_pinned(&ts->sched_timer,
 				      HRTIMER_MODE_ABS);
 			/* Check, if the timer was already in the past */
 			if (hrtimer_active(&ts->sched_timer))
@@ -698,7 +698,8 @@ void tick_setup_sched_timer(void)
 
 	for (;;) {
 		hrtimer_forward(&ts->sched_timer, now, tick_period);
-		hrtimer_start_expires(&ts->sched_timer, HRTIMER_MODE_ABS);
+		hrtimer_start_expires_pinned(&ts->sched_timer,
+						HRTIMER_MODE_ABS);
 		/* Check, if the timer was already in the past */
 		if (hrtimer_active(&ts->sched_timer))
 			break;
Index: git-2.6/kernel/trace/trace_sysprof.c
===================================================================
--- git-2.6.orig/kernel/trace/trace_sysprof.c
+++ git-2.6/kernel/trace/trace_sysprof.c
@@ -203,7 +203,8 @@ static void start_stack_timer(void *unus
 	hrtimer_init(hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
 	hrtimer->function = stack_trace_timer_fn;
 
-	hrtimer_start(hrtimer, ns_to_ktime(sample_period), HRTIMER_MODE_REL);
+	hrtimer_start_pinned(hrtimer, ns_to_ktime(sample_period),
+				HRTIMER_MODE_REL);
 }
 
 static void start_stack_timers(void)

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [RFC PATCH 4/4] timers: logic to enable timer migration.
  2009-02-20 12:55 [RFC PATCH 0/4] timers: framework for migration between CPU Arun R Bharadwaj
                   ` (2 preceding siblings ...)
  2009-02-20 13:00 ` [RFC PATCH 3/4] timers: identifying the existing pinned hrtimers Arun R Bharadwaj
@ 2009-02-20 13:01 ` Arun R Bharadwaj
  2009-02-20 13:21 ` [RFC PATCH 0/4] timers: framework for migration between CPU Ingo Molnar
  4 siblings, 0 replies; 23+ messages in thread
From: Arun R Bharadwaj @ 2009-02-20 13:01 UTC (permalink / raw)
  To: linux-kernel, linux-pm, a.p.zijlstra, ego, tglx, mingo, andi,
	venkatesh.pallipadi, vatsa, arjan

* Arun R Bharadwaj <arun@linux.vnet.ibm.com> [2009-02-20 18:25:16]:

This patch migrates all non pinned timers and hrtimers to the target
CPU.

Timer migration is enabled by setting the sysfs entry of the
particular CPU. At the time of queuing a timer, it is checked to see
if migration is enabled for that particular CPU. If so, the target CPU
base is set as the new timer base and the timer is queued on that CPU.


Signed-off-by: Arun R Bharadwaj <arun@linux.vnet.ibm.com>
---
 kernel/hrtimer.c |   12 +++++++++++-
 kernel/timer.c   |   15 ++++++++++++++-
 2 files changed, 25 insertions(+), 2 deletions(-)

Index: git-2.6/kernel/timer.c
===================================================================
--- git-2.6.orig/kernel/timer.c
+++ git-2.6/kernel/timer.c
@@ -616,7 +616,7 @@ int __mod_timer(struct timer_list *timer
 {
 	struct tvec_base *base, *new_base;
 	unsigned long flags;
-	int ret = 0;
+	int ret = 0, target_cpu;
 
 	timer_stats_timer_set_start_info(timer);
 	BUG_ON(!timer->function);
@@ -631,6 +631,18 @@ int __mod_timer(struct timer_list *timer
 	debug_timer_activate(timer);
 
 	new_base = __get_cpu_var(tvec_bases);
+	/*
+	 * @target_cpu: The CPU to which timers are to be migrated.
+	 * If timer migration is disabled, target_cpu = this_cpu.
+	 */
+	target_cpu = __get_cpu_var(enable_timer_migration);
+	if (!tbase_get_pinned(timer->base) && cpu_online(target_cpu)) {
+		new_base = per_cpu(tvec_bases, target_cpu);
+		timer_set_base(timer, new_base);
+		timer->expires = expires;
+		internal_add_timer(new_base, timer);
+		goto out;
+	}
 
 	if (base != new_base) {
 		/*
@@ -652,6 +664,7 @@ int __mod_timer(struct timer_list *timer
 
 	timer->expires = expires;
 	internal_add_timer(base, timer);
+out:
 	spin_unlock_irqrestore(&base->lock, flags);
 
 	return ret;
Index: git-2.6/kernel/hrtimer.c
===================================================================
--- git-2.6.orig/kernel/hrtimer.c
+++ git-2.6/kernel/hrtimer.c
@@ -198,8 +198,18 @@ int pinned)
 {
 	struct hrtimer_clock_base *new_base;
 	struct hrtimer_cpu_base *new_cpu_base;
+	int target_cpu;
+
+	/*
+	 * @target_cpu: CPU to which the timers are to be migrated.
+	 * If timer migration is disabled, target_cpu = this_cpu
+	 */
+	target_cpu = __get_cpu_var(enable_timer_migration);
+	if (cpu_online(target_cpu) && !pinned)
+		new_cpu_base = &per_cpu(hrtimer_bases, target_cpu);
+	else
+		new_cpu_base = &__get_cpu_var(hrtimer_bases);
 
-	new_cpu_base = &__get_cpu_var(hrtimer_bases);
 	new_base = &new_cpu_base->clock_base[base->index];
 
 	if (base != new_base) {

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 0/4] timers: framework for migration between CPU
  2009-02-20 12:55 [RFC PATCH 0/4] timers: framework for migration between CPU Arun R Bharadwaj
                   ` (3 preceding siblings ...)
  2009-02-20 13:01 ` [RFC PATCH 4/4] timers: logic to enable timer migration Arun R Bharadwaj
@ 2009-02-20 13:21 ` Ingo Molnar
  2009-02-20 14:14   ` Vaidyanathan Srinivasan
  2009-02-23  7:59   ` Arun R Bharadwaj
  4 siblings, 2 replies; 23+ messages in thread
From: Ingo Molnar @ 2009-02-20 13:21 UTC (permalink / raw)
  To: linux-kernel, linux-pm, a.p.zijlstra, ego, tglx, andi,
	venkatesh.pallipadi, vatsa, arjan, arun


* Arun R Bharadwaj <arun@linux.vnet.ibm.com> wrote:

> Hi,
> 
> 
> In an SMP system, tasks are scheduled on different CPUs by the 
> scheduler, interrupts are managed by irqbalancer daemon, but 
> timers are still stuck to the CPUs that they have been 
> initialised.  Timers queued by tasks gets re-queued on the CPU 
> where the task gets to run next, but timers from IRQ context 
> like the ones in device drivers are still stuck on the CPU 
> they were initialised.  This framework will help move all 
> 'movable timers' from one CPU to any other CPU of choice using 
> a sysfs interface.

hm, the intention is good, the concept of migrating timers to 
their target CPU is good as well. We already do some of that for 
regular timers.

But the whole sysfs interface you implemented here is not 
particularly clean nor is it efficient.

The main problem is that timers are really fast-moving entities, 
and so are the tasks they are related to.

Your implementation completely ties the direction of migration 
(the timer scheduling) to a clumsy sysfs interface:

+	if (sscanf(buf, "%d", &target_cpu) && cpu_online(target_cpu)) {
+               ret = count;
+               per_cpu(enable_timer_migration, cpu->sysdev.id) = target_cpu;
+	}

That doesnt really scale and i doubt it works in practice. We 
should not schedule timers via sysfs, we should let the kernel 
do it auomatically. [*]

So what i'd suggest instead is extend the scheduler power-saving 
code, which already identifies a 'load balancer CPU', to also 
attract all attractable sources of timers - automatically. See 
the 'load_balancer' CPU logic in kernel/sched.c.

Does that sound OK to you? I think the end result might even 
give better numbers - and out of box.

I'd also suggest to not do that rather ugly 
enable_timer_migration per-cpu variable, but simply reuse the 
existing nohz.load_balancer as a target CPU.

Also, please base your patches on the latest timer tree (which 
already modified some of this code in this cycle):

  http://people.redhat.com/mingo/tip.git/README

Btw., could you please also fix your mailer to not do this to 
us:

Mail-Followup-To: linux-kernel@vger.kernel.org,
        linux-pm@lists.linux-foundation.org, a.p.zijlstra@chello.nl,
        ego@in.ibm.com, tglx@linutronix.de, mingo@elte.hu,
        andi@firstfloor.org, venkatesh.pallipadi@intel.com,
        vatsa@linux.vnet.ibm.com, arjan@infradead.org

it messes up the replies.

	Ingo

[*] IRQ migration (where you possibly got the sysfs idea from) 
    is a special case where 'slow scheduling' via a user-space 
    daemon is possible: they are an external source of events 
    and they are concentrators of work. The same concept does 
    not apply to timers, most of which are inherently 
    task-generated.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 0/4] timers: framework for migration between CPU
  2009-02-20 13:21 ` [RFC PATCH 0/4] timers: framework for migration between CPU Ingo Molnar
@ 2009-02-20 14:14   ` Vaidyanathan Srinivasan
  2009-02-20 16:07     ` Ingo Molnar
  2009-02-23  7:59   ` Arun R Bharadwaj
  1 sibling, 1 reply; 23+ messages in thread
From: Vaidyanathan Srinivasan @ 2009-02-20 14:14 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, linux-pm, a.p.zijlstra, ego, tglx, andi,
	venkatesh.pallipadi, vatsa, arjan, arun

* Ingo Molnar <mingo@elte.hu> [2009-02-20 14:21:45]:

> 
> * Arun R Bharadwaj <arun@linux.vnet.ibm.com> wrote:
> 
> > Hi,
> > 
> > 
> > In an SMP system, tasks are scheduled on different CPUs by the 
> > scheduler, interrupts are managed by irqbalancer daemon, but 
> > timers are still stuck to the CPUs that they have been 
> > initialised.  Timers queued by tasks gets re-queued on the CPU 
> > where the task gets to run next, but timers from IRQ context 
> > like the ones in device drivers are still stuck on the CPU 
> > they were initialised.  This framework will help move all 
> > 'movable timers' from one CPU to any other CPU of choice using 
> > a sysfs interface.
> 
> hm, the intention is good, the concept of migrating timers to 
> their target CPU is good as well. We already do some of that for 
> regular timers.
> 
> But the whole sysfs interface you implemented here is not 
> particularly clean nor is it efficient.
> 
> The main problem is that timers are really fast-moving entities, 
> and so are the tasks they are related to.
> 
> Your implementation completely ties the direction of migration 
> (the timer scheduling) to a clumsy sysfs interface:
> 
> +	if (sscanf(buf, "%d", &target_cpu) && cpu_online(target_cpu)) {
> +               ret = count;
> +               per_cpu(enable_timer_migration, cpu->sysdev.id) = target_cpu;
> +	}
> 
> That doesnt really scale and i doubt it works in practice. We 
> should not schedule timers via sysfs, we should let the kernel 
> do it auomatically. [*]

Hi Ingo,

Thanks for comments on the overall goal.  Having an in-kernel
framework to attract the 'movable' timers will be ideal.
 
> So what i'd suggest instead is extend the scheduler power-saving 
> code, which already identifies a 'load balancer CPU', to also 
> attract all attractable sources of timers - automatically. See 
> the 'load_balancer' CPU logic in kernel/sched.c.
> 
> Does that sound OK to you? I think the end result might even 
> give better numbers - and out of box.

I would agree that we can atleast try that approach and compare
how we score.

> I'd also suggest to not do that rather ugly 
> enable_timer_migration per-cpu variable, but simply reuse the 
> existing nohz.load_balancer as a target CPU.

This is a good idea to automatically bias the timers.  But this
nohz.load_balancer is a very fast moving target and we will need some
heuristics to estimate overall system idleness before moving the
timers.

I would agree that the power saving load balancer has a good view of
the system and can potentially guide the timer biasing framework.

--Vaidy

> Also, please base your patches on the latest timer tree (which 
> already modified some of this code in this cycle):
> 
>   http://people.redhat.com/mingo/tip.git/README
> 
> Btw., could you please also fix your mailer to not do this to 
> us:
> 
> Mail-Followup-To: linux-kernel@vger.kernel.org,
>         linux-pm@lists.linux-foundation.org, a.p.zijlstra@chello.nl,
>         ego@in.ibm.com, tglx@linutronix.de, mingo@elte.hu,
>         andi@firstfloor.org, venkatesh.pallipadi@intel.com,
>         vatsa@linux.vnet.ibm.com, arjan@infradead.org
> 
> it messes up the replies.
> 
> 	Ingo
> 
> [*] IRQ migration (where you possibly got the sysfs idea from) 
>     is a special case where 'slow scheduling' via a user-space 
>     daemon is possible: they are an external source of events 
>     and they are concentrators of work. The same concept does 
>     not apply to timers, most of which are inherently 
>     task-generated.


> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 0/4] timers: framework for migration between CPU
  2009-02-20 14:14   ` Vaidyanathan Srinivasan
@ 2009-02-20 16:07     ` Ingo Molnar
  2009-02-20 19:57       ` Arjan van de Ven
  0 siblings, 1 reply; 23+ messages in thread
From: Ingo Molnar @ 2009-02-20 16:07 UTC (permalink / raw)
  To: Vaidyanathan Srinivasan
  Cc: linux-kernel, linux-pm, a.p.zijlstra, ego, tglx, andi,
	venkatesh.pallipadi, vatsa, arjan, arun, Vaidyanathan Srinivasan,
	Suresh Siddha


* Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> wrote:

> > I'd also suggest to not do that rather ugly 
> > enable_timer_migration per-cpu variable, but simply reuse 
> > the existing nohz.load_balancer as a target CPU.
> 
> This is a good idea to automatically bias the timers.  But 
> this nohz.load_balancer is a very fast moving target and we 
> will need some heuristics to estimate overall system idleness 
> before moving the timers.
> 
> I would agree that the power saving load balancer has a good 
> view of the system and can potentially guide the timer biasing 
> framework.

Yeah, it's a fast moving target, but it already concentrates 
the load somewhat.

	Ingo

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 0/4] timers: framework for migration between CPU
  2009-02-20 16:07     ` Ingo Molnar
@ 2009-02-20 19:57       ` Arjan van de Ven
  2009-02-20 21:53         ` Ingo Molnar
  0 siblings, 1 reply; 23+ messages in thread
From: Arjan van de Ven @ 2009-02-20 19:57 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Vaidyanathan Srinivasan, linux-kernel, linux-pm, a.p.zijlstra,
	ego, tglx, andi, venkatesh.pallipadi, vatsa, arun, Suresh Siddha

On Fri, 20 Feb 2009 17:07:37 +0100
Ingo Molnar <mingo@elte.hu> wrote:

> 
> * Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> wrote:
> 
> > > I'd also suggest to not do that rather ugly 
> > > enable_timer_migration per-cpu variable, but simply reuse 
> > > the existing nohz.load_balancer as a target CPU.
> > 
> > This is a good idea to automatically bias the timers.  But 
> > this nohz.load_balancer is a very fast moving target and we 
> > will need some heuristics to estimate overall system idleness 
> > before moving the timers.
> > 
> > I would agree that the power saving load balancer has a good 
> > view of the system and can potentially guide the timer biasing 
> > framework.
> 
> Yeah, it's a fast moving target, but it already concentrates 
> the load somewhat.
> 

I wonder if the real answer for this isn't to have timers be considered 
schedulable-entities and have the regular scheduler decide where they
actually run.

-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 0/4] timers: framework for migration between CPU
  2009-02-20 19:57       ` Arjan van de Ven
@ 2009-02-20 21:53         ` Ingo Molnar
  2009-02-23  7:55           ` Balbir Singh
  0 siblings, 1 reply; 23+ messages in thread
From: Ingo Molnar @ 2009-02-20 21:53 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Vaidyanathan Srinivasan, linux-kernel, linux-pm, a.p.zijlstra,
	ego, tglx, andi, venkatesh.pallipadi, vatsa, arun, Suresh Siddha


* Arjan van de Ven <arjan@infradead.org> wrote:

> On Fri, 20 Feb 2009 17:07:37 +0100
> Ingo Molnar <mingo@elte.hu> wrote:
> 
> > 
> > * Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> wrote:
> > 
> > > > I'd also suggest to not do that rather ugly 
> > > > enable_timer_migration per-cpu variable, but simply reuse 
> > > > the existing nohz.load_balancer as a target CPU.
> > > 
> > > This is a good idea to automatically bias the timers.  But 
> > > this nohz.load_balancer is a very fast moving target and we 
> > > will need some heuristics to estimate overall system idleness 
> > > before moving the timers.
> > > 
> > > I would agree that the power saving load balancer has a good 
> > > view of the system and can potentially guide the timer biasing 
> > > framework.
> > 
> > Yeah, it's a fast moving target, but it already concentrates 
> > the load somewhat.
> > 
> 
> I wonder if the real answer for this isn't to have timers be 
> considered schedulable-entities and have the regular scheduler 
> decide where they actually run.

hm, not sure - it's a bit heavy for that.

	Ingo

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 0/4] timers: framework for migration between CPU
  2009-02-20 21:53         ` Ingo Molnar
@ 2009-02-23  7:55           ` Balbir Singh
  2009-02-23  9:11             ` Ingo Molnar
  0 siblings, 1 reply; 23+ messages in thread
From: Balbir Singh @ 2009-02-23  7:55 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Arjan van de Ven, Vaidyanathan Srinivasan, linux-kernel,
	linux-pm, a.p.zijlstra, ego, tglx, andi, venkatesh.pallipadi,
	vatsa, arun, Suresh Siddha

* Ingo Molnar <mingo@elte.hu> [2009-02-20 22:53:18]:

> 
> * Arjan van de Ven <arjan@infradead.org> wrote:
> 
> > On Fri, 20 Feb 2009 17:07:37 +0100
> > Ingo Molnar <mingo@elte.hu> wrote:
> > 
> > > 
> > > * Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> wrote:
> > > 
> > > > > I'd also suggest to not do that rather ugly 
> > > > > enable_timer_migration per-cpu variable, but simply reuse 
> > > > > the existing nohz.load_balancer as a target CPU.
> > > > 
> > > > This is a good idea to automatically bias the timers.  But 
> > > > this nohz.load_balancer is a very fast moving target and we 
> > > > will need some heuristics to estimate overall system idleness 
> > > > before moving the timers.
> > > > 
> > > > I would agree that the power saving load balancer has a good 
> > > > view of the system and can potentially guide the timer biasing 
> > > > framework.
> > > 
> > > Yeah, it's a fast moving target, but it already concentrates 
> > > the load somewhat.
> > > 
> > 
> > I wonder if the real answer for this isn't to have timers be 
> > considered schedulable-entities and have the regular scheduler 
> > decide where they actually run.
> 
> hm, not sure - it's a bit heavy for that.
>

I think the basic timer migration policy should exist in user space.
One of the ways of looking at it is, as we begin to consolidate, using
range timers and migrating all timers to lesser number of CPUs would
make a whole lot of sense. 

As far as the scheduler making those decisions is concerned, my
concern is that the load balancing is a continuous process and timers
don't necessarily work that way. I'd put my neck out and say that
irqbalance, range timers and timer migration should all belong to user
space. irqbalance and range timers do, so should timer migration.

-- 
	Balbir

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 0/4] timers: framework for migration between CPU
  2009-02-20 13:21 ` [RFC PATCH 0/4] timers: framework for migration between CPU Ingo Molnar
  2009-02-20 14:14   ` Vaidyanathan Srinivasan
@ 2009-02-23  7:59   ` Arun R Bharadwaj
  1 sibling, 0 replies; 23+ messages in thread
From: Arun R Bharadwaj @ 2009-02-23  7:59 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, linux-pm, a.p.zijlstra, ego, tglx, andi,
	venkatesh.pallipadi, vatsa, arjan

* Ingo Molnar <mingo@elte.hu> [2009-02-20 14:21:45]:

> 
> * Arun R Bharadwaj <arun@linux.vnet.ibm.com> wrote:
> 
> > Hi,
> > 
> > 
> > In an SMP system, tasks are scheduled on different CPUs by the 
> > scheduler, interrupts are managed by irqbalancer daemon, but 
> > timers are still stuck to the CPUs that they have been 
> > initialised.  Timers queued by tasks gets re-queued on the CPU 
> > where the task gets to run next, but timers from IRQ context 
> > like the ones in device drivers are still stuck on the CPU 
> > they were initialised.  This framework will help move all 
> > 'movable timers' from one CPU to any other CPU of choice using 
> > a sysfs interface.
> 
> hm, the intention is good, the concept of migrating timers to 
> their target CPU is good as well. We already do some of that for 
> regular timers.
> 
> But the whole sysfs interface you implemented here is not 
> particularly clean nor is it efficient.
> 
> The main problem is that timers are really fast-moving entities, 
> and so are the tasks they are related to.
> 
> Your implementation completely ties the direction of migration 
> (the timer scheduling) to a clumsy sysfs interface:
> 
> +	if (sscanf(buf, "%d", &target_cpu) && cpu_online(target_cpu)) {
> +               ret = count;
> +               per_cpu(enable_timer_migration, cpu->sysdev.id) = target_cpu;
> +	}
> 
> That doesnt really scale and i doubt it works in practice. We 
> should not schedule timers via sysfs, we should let the kernel 
> do it auomatically. [*]
> 
> So what i'd suggest instead is extend the scheduler power-saving 
> code, which already identifies a 'load balancer CPU', to also 
> attract all attractable sources of timers - automatically. See 
> the 'load_balancer' CPU logic in kernel/sched.c.
> 
> Does that sound OK to you? I think the end result might even 
> give better numbers - and out of box.
> 
> I'd also suggest to not do that rather ugly 
> enable_timer_migration per-cpu variable, but simply reuse the 
> existing nohz.load_balancer as a target CPU.
>

Hi Ingo,

Thanks a lot for your comments.
Sure, what you are suggesting makes sense. Having an in-kernel method
to move timers is much better than exposing the interface to the user.
I will give this a shot.

--arun

> Also, please base your patches on the latest timer tree (which 
> already modified some of this code in this cycle):
> 
>   http://people.redhat.com/mingo/tip.git/README
> 
> Btw., could you please also fix your mailer to not do this to 
> us:
> 
> Mail-Followup-To: linux-kernel@vger.kernel.org,
>         linux-pm@lists.linux-foundation.org, a.p.zijlstra@chello.nl,
>         ego@in.ibm.com, tglx@linutronix.de, mingo@elte.hu,
>         andi@firstfloor.org, venkatesh.pallipadi@intel.com,
>         vatsa@linux.vnet.ibm.com, arjan@infradead.org
> 
> it messes up the replies.
> 
> 	Ingo
> 
> [*] IRQ migration (where you possibly got the sysfs idea from) 
>     is a special case where 'slow scheduling' via a user-space 
>     daemon is possible: they are an external source of events 
>     and they are concentrators of work. The same concept does 
>     not apply to timers, most of which are inherently 
>     task-generated.
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 0/4] timers: framework for migration between CPU
  2009-02-23  7:55           ` Balbir Singh
@ 2009-02-23  9:11             ` Ingo Molnar
  2009-02-23  9:48               ` Balbir Singh
  0 siblings, 1 reply; 23+ messages in thread
From: Ingo Molnar @ 2009-02-23  9:11 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Arjan van de Ven, Vaidyanathan Srinivasan, linux-kernel,
	linux-pm, a.p.zijlstra, ego, tglx, andi, venkatesh.pallipadi,
	vatsa, arun, Suresh Siddha


* Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> * Ingo Molnar <mingo@elte.hu> [2009-02-20 22:53:18]:
> 
> > 
> > * Arjan van de Ven <arjan@infradead.org> wrote:
> > 
> > > On Fri, 20 Feb 2009 17:07:37 +0100
> > > Ingo Molnar <mingo@elte.hu> wrote:
> > > 
> > > > 
> > > > * Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> wrote:
> > > > 
> > > > > > I'd also suggest to not do that rather ugly 
> > > > > > enable_timer_migration per-cpu variable, but simply reuse 
> > > > > > the existing nohz.load_balancer as a target CPU.
> > > > > 
> > > > > This is a good idea to automatically bias the timers.  But 
> > > > > this nohz.load_balancer is a very fast moving target and we 
> > > > > will need some heuristics to estimate overall system idleness 
> > > > > before moving the timers.
> > > > > 
> > > > > I would agree that the power saving load balancer has a good 
> > > > > view of the system and can potentially guide the timer biasing 
> > > > > framework.
> > > > 
> > > > Yeah, it's a fast moving target, but it already concentrates 
> > > > the load somewhat.
> > > > 
> > > 
> > > I wonder if the real answer for this isn't to have timers be 
> > > considered schedulable-entities and have the regular scheduler 
> > > decide where they actually run.
> > 
> > hm, not sure - it's a bit heavy for that.
> >
> 
> I think the basic timer migration policy should exist in user 
> space.

I disagree.

> One of the ways of looking at it is, as we begin to 
> consolidate, using range timers and migrating all timers to 
> lesser number of CPUs would make a whole lot of sense.
> 
> As far as the scheduler making those decisions is concerned, 
> my concern is that the load balancing is a continuous process 
> and timers don't necessarily work that way. I'd put my neck 
> out and say that irqbalance, range timers and timer migration 
> should all belong to user space. irqbalance and range timers 
> do, so should timer migration.

As i said it my first reply, IRQ migration is special because 
they are not kernel-internal objects, they come externally so 
there's a lot of user-space enumeration, policy and other steps 
involved. Furthermore, IRQs are migrated in a 'slow' fashion.

Timers on the other hand are fast entities tied to _tasks_ 
primarily, not external entities. Hence they should migrate 
according to the CPU where the activities of the system 
concentrates - i.e. where tasks are running.

Another thing: do you argue for the existing timer-migration 
code we have in mod_timer() to move to user-space too? It isnt a 
consistent argument to push 'some' of it to user-space, and some 
of it in kernel-space.

	Ingo

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 0/4] timers: framework for migration between CPU
  2009-02-23  9:11             ` Ingo Molnar
@ 2009-02-23  9:48               ` Balbir Singh
  2009-02-23 10:22                 ` Ingo Molnar
  2009-02-23 10:38                 ` Vaidyanathan Srinivasan
  0 siblings, 2 replies; 23+ messages in thread
From: Balbir Singh @ 2009-02-23  9:48 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Arjan van de Ven, Vaidyanathan Srinivasan, linux-kernel,
	linux-pm, a.p.zijlstra, ego, tglx, andi, venkatesh.pallipadi,
	vatsa, arun, Suresh Siddha

* Ingo Molnar <mingo@elte.hu> [2009-02-23 10:11:58]:

> 
> * Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> 
> > * Ingo Molnar <mingo@elte.hu> [2009-02-20 22:53:18]:
> > 
> > > 
> > > * Arjan van de Ven <arjan@infradead.org> wrote:
> > > 
> > > > On Fri, 20 Feb 2009 17:07:37 +0100
> > > > Ingo Molnar <mingo@elte.hu> wrote:
> > > > 
> > > > > 
> > > > > * Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> wrote:
> > > > > 
> > > > > > > I'd also suggest to not do that rather ugly 
> > > > > > > enable_timer_migration per-cpu variable, but simply reuse 
> > > > > > > the existing nohz.load_balancer as a target CPU.
> > > > > > 
> > > > > > This is a good idea to automatically bias the timers.  But 
> > > > > > this nohz.load_balancer is a very fast moving target and we 
> > > > > > will need some heuristics to estimate overall system idleness 
> > > > > > before moving the timers.
> > > > > > 
> > > > > > I would agree that the power saving load balancer has a good 
> > > > > > view of the system and can potentially guide the timer biasing 
> > > > > > framework.
> > > > > 
> > > > > Yeah, it's a fast moving target, but it already concentrates 
> > > > > the load somewhat.
> > > > > 
> > > > 
> > > > I wonder if the real answer for this isn't to have timers be 
> > > > considered schedulable-entities and have the regular scheduler 
> > > > decide where they actually run.
> > > 
> > > hm, not sure - it's a bit heavy for that.
> > >
> > 
> > I think the basic timer migration policy should exist in user 
> > space.
> 
> I disagree.
>

See below
 
> > One of the ways of looking at it is, as we begin to 
> > consolidate, using range timers and migrating all timers to 
> > lesser number of CPUs would make a whole lot of sense.
> > 
> > As far as the scheduler making those decisions is concerned, 
> > my concern is that the load balancing is a continuous process 
> > and timers don't necessarily work that way. I'd put my neck 
> > out and say that irqbalance, range timers and timer migration 
> > should all belong to user space. irqbalance and range timers 
> > do, so should timer migration.
> 
> As i said it my first reply, IRQ migration is special because 
> they are not kernel-internal objects, they come externally so 
> there's a lot of user-space enumeration, policy and other steps 
> involved. Furthermore, IRQs are migrated in a 'slow' fashion.
> 
> Timers on the other hand are fast entities tied to _tasks_ 
> primarily, not external entities. 

Timers are also queued due to external events like interrupts (device
drivers tend to set of timers all the time). I am not fully against
what you've said, at some semantic level what you are suggesting is
that at a higher level of power saving, when the scheduler balances
timers it is doing a form of soft CPU hotplug on the system by
migrating timers and tasks away from idle CPUs when the load can be
handled by other CPUs. See below as well.

> Hence they should migrate 
> according to the CPU where the activities of the system 
> concentrates - i.e. where tasks are running.
> 
> Another thing: do you argue for the existing timer-migration 
> code we have in mod_timer() to move to user-space too? It isnt a 
> consistent argument to push 'some' of it to user-space, and some 
> of it in kernel-space.
> 

No.. mod_timer() is correct where it belongs.

Consider the powertop usage scenario today

1. Powertop displays a list of timers and common causes of wakeup
2. It recommends policies in user space that can affect power savings
   a. usb autosuspend
   b. wireless link management
   c. disable HAL polling

My argument is, why can't we add

   d. Use range timers
   e. Consolidate timers

In the future.

Even sched_mc=n is set by user space, so really the
policy is in user space.


> 	Ingo

-- 
	Balbir

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 0/4] timers: framework for migration between CPU
  2009-02-23  9:48               ` Balbir Singh
@ 2009-02-23 10:22                 ` Ingo Molnar
  2009-02-23 11:24                   ` Balbir Singh
  2009-02-23 10:38                 ` Vaidyanathan Srinivasan
  1 sibling, 1 reply; 23+ messages in thread
From: Ingo Molnar @ 2009-02-23 10:22 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Arjan van de Ven, Vaidyanathan Srinivasan, linux-kernel,
	linux-pm, a.p.zijlstra, ego, tglx, andi, venkatesh.pallipadi,
	vatsa, arun, Suresh Siddha


* Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> * Ingo Molnar <mingo@elte.hu> [2009-02-23 10:11:58]:
> 
> > 
> > * Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > 
> > > * Ingo Molnar <mingo@elte.hu> [2009-02-20 22:53:18]:
> > > 
> > > > 
> > > > * Arjan van de Ven <arjan@infradead.org> wrote:
> > > > 
> > > > > On Fri, 20 Feb 2009 17:07:37 +0100
> > > > > Ingo Molnar <mingo@elte.hu> wrote:
> > > > > 
> > > > > > 
> > > > > > * Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> wrote:
> > > > > > 
> > > > > > > > I'd also suggest to not do that rather ugly 
> > > > > > > > enable_timer_migration per-cpu variable, but simply reuse 
> > > > > > > > the existing nohz.load_balancer as a target CPU.
> > > > > > > 
> > > > > > > This is a good idea to automatically bias the timers.  But 
> > > > > > > this nohz.load_balancer is a very fast moving target and we 
> > > > > > > will need some heuristics to estimate overall system idleness 
> > > > > > > before moving the timers.
> > > > > > > 
> > > > > > > I would agree that the power saving load balancer has a good 
> > > > > > > view of the system and can potentially guide the timer biasing 
> > > > > > > framework.
> > > > > > 
> > > > > > Yeah, it's a fast moving target, but it already concentrates 
> > > > > > the load somewhat.
> > > > > > 
> > > > > 
> > > > > I wonder if the real answer for this isn't to have timers be 
> > > > > considered schedulable-entities and have the regular scheduler 
> > > > > decide where they actually run.
> > > > 
> > > > hm, not sure - it's a bit heavy for that.
> > > >
> > > 
> > > I think the basic timer migration policy should exist in user 
> > > space.
> > 
> > I disagree.
> >
> 
> See below
>  
> > > One of the ways of looking at it is, as we begin to 
> > > consolidate, using range timers and migrating all timers to 
> > > lesser number of CPUs would make a whole lot of sense.
> > > 
> > > As far as the scheduler making those decisions is concerned, 
> > > my concern is that the load balancing is a continuous process 
> > > and timers don't necessarily work that way. I'd put my neck 
> > > out and say that irqbalance, range timers and timer migration 
> > > should all belong to user space. irqbalance and range timers 
> > > do, so should timer migration.
> > 
> > As i said it my first reply, IRQ migration is special because 
> > they are not kernel-internal objects, they come externally so 
> > there's a lot of user-space enumeration, policy and other steps 
> > involved. Furthermore, IRQs are migrated in a 'slow' fashion.
> > 
> > Timers on the other hand are fast entities tied to _tasks_ 
> > primarily, not external entities. 
> 
> Timers are also queued due to external events like interrupts 
> (device drivers tend to set of timers all the time). [...]

That is a silly argument. Tasks are created due to 'external 
events' as well such as the user hitting a key.

What matters, and what was my argument is the distinction 
whether the kernel _generates_ the event. For most IRQ events it 
does not, for the overwhelming majority of timers events it 
consciously generates timer events. Which makes them all the 
much different.

> [...] I am not fully against what you've said, at some 
> semantic level what you are suggesting is that at a higher 
> level of power saving, when the scheduler balances timers it 
> is doing a form of soft CPU hotplug on the system by migrating 
> timers and tasks away from idle CPUs when the load can be 
> handled by other CPUs. See below as well.
> 
> > Hence they should migrate 
> > according to the CPU where the activities of the system 
> > concentrates - i.e. where tasks are running.
> > 
> > Another thing: do you argue for the existing timer-migration 
> > code we have in mod_timer() to move to user-space too? It isnt a 
> > consistent argument to push 'some' of it to user-space, and some 
> > of it in kernel-space.
> > 
> 
> No.. mod_timer() is correct where it belongs.

You did not reply to my statement that the argument is a double 
standard. Why do certain migrations in the kernel and some not?

> Consider the powertop usage scenario today
> 
> 1. Powertop displays a list of timers and common causes of wakeup
> 2. It recommends policies in user space that can affect power savings
>    a. usb autosuspend
>    b. wireless link management
>    c. disable HAL polling

That's different - those are PowerTop timer event _reduction_ 
policies. Not migration policies of existing timers.

> My argument is, why can't we add
> 
>    d. Use range timers
>    e. Consolidate timers
> 
> In the future.
> 
> Even sched_mc=n is set by user space, so really the
> policy is in user space.

that is different again. sched_mc is a broad switch not a 
dynamic control like the sysfs migration interface that was 
introduced in this patchset. Which patchset we are discussing.

	Ingo

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 0/4] timers: framework for migration between CPU
  2009-02-23  9:48               ` Balbir Singh
  2009-02-23 10:22                 ` Ingo Molnar
@ 2009-02-23 10:38                 ` Vaidyanathan Srinivasan
  2009-02-23 11:07                   ` Ingo Molnar
  1 sibling, 1 reply; 23+ messages in thread
From: Vaidyanathan Srinivasan @ 2009-02-23 10:38 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Ingo Molnar, Arjan van de Ven, linux-kernel, linux-pm,
	a.p.zijlstra, ego, tglx, andi, venkatesh.pallipadi, vatsa, arun,
	Suresh Siddha

* Balbir Singh <balbir@linux.vnet.ibm.com> [2009-02-23 15:18:50]:

> * Ingo Molnar <mingo@elte.hu> [2009-02-23 10:11:58]:
> 
> > 
> > * Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > 
> > > * Ingo Molnar <mingo@elte.hu> [2009-02-20 22:53:18]:
> > > 
> > > > 
> > > > * Arjan van de Ven <arjan@infradead.org> wrote:
> > > > 
> > > > > On Fri, 20 Feb 2009 17:07:37 +0100
> > > > > Ingo Molnar <mingo@elte.hu> wrote:
> > > > > 
> > > > > > 
> > > > > > * Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> wrote:
> > > > > > 
> > > > > > > > I'd also suggest to not do that rather ugly 
> > > > > > > > enable_timer_migration per-cpu variable, but simply reuse 
> > > > > > > > the existing nohz.load_balancer as a target CPU.
> > > > > > > 
> > > > > > > This is a good idea to automatically bias the timers.  But 
> > > > > > > this nohz.load_balancer is a very fast moving target and we 
> > > > > > > will need some heuristics to estimate overall system idleness 
> > > > > > > before moving the timers.
> > > > > > > 
> > > > > > > I would agree that the power saving load balancer has a good 
> > > > > > > view of the system and can potentially guide the timer biasing 
> > > > > > > framework.
> > > > > > 
> > > > > > Yeah, it's a fast moving target, but it already concentrates 
> > > > > > the load somewhat.
> > > > > > 
> > > > > 
> > > > > I wonder if the real answer for this isn't to have timers be 
> > > > > considered schedulable-entities and have the regular scheduler 
> > > > > decide where they actually run.
> > > > 
> > > > hm, not sure - it's a bit heavy for that.
> > > >
> > > 
> > > I think the basic timer migration policy should exist in user 
> > > space.
> > 
> > I disagree.
> >
> 
> See below
> 
> > > One of the ways of looking at it is, as we begin to 
> > > consolidate, using range timers and migrating all timers to 
> > > lesser number of CPUs would make a whole lot of sense.
> > > 
> > > As far as the scheduler making those decisions is concerned, 
> > > my concern is that the load balancing is a continuous process 
> > > and timers don't necessarily work that way. I'd put my neck 
> > > out and say that irqbalance, range timers and timer migration 
> > > should all belong to user space. irqbalance and range timers 
> > > do, so should timer migration.
> > 
> > As i said it my first reply, IRQ migration is special because 
> > they are not kernel-internal objects, they come externally so 
> > there's a lot of user-space enumeration, policy and other steps 
> > involved. Furthermore, IRQs are migrated in a 'slow' fashion.
> > 
> > Timers on the other hand are fast entities tied to _tasks_ 
> > primarily, not external entities. 
> 
> Timers are also queued due to external events like interrupts (device
> drivers tend to set of timers all the time). I am not fully against
> what you've said, at some semantic level what you are suggesting is
> that at a higher level of power saving, when the scheduler balances
> timers it is doing a form of soft CPU hotplug on the system by
> migrating timers and tasks away from idle CPUs when the load can be
> handled by other CPUs. See below as well.
> 
> > Hence they should migrate 
> > according to the CPU where the activities of the system 
> > concentrates - i.e. where tasks are running.
> > 
> > Another thing: do you argue for the existing timer-migration 
> > code we have in mod_timer() to move to user-space too? It isnt a 
> > consistent argument to push 'some' of it to user-space, and some 
> > of it in kernel-space.
> > 
> 
> No.. mod_timer() is correct where it belongs.
> 
> Consider the powertop usage scenario today
> 
> 1. Powertop displays a list of timers and common causes of wakeup
> 2. It recommends policies in user space that can affect power savings
>    a. usb autosuspend
>    b. wireless link management
>    c. disable HAL polling
> 
> My argument is, why can't we add
> 
>    d. Use range timers
>    e. Consolidate timers
> 
> In the future.
> 
> Even sched_mc=n is set by user space, so really the
> policy is in user space.

Hi Balbir, 

I would agree that the policy would exist in user space.  But what
Ingo is suggesting is that the decision of actually choosing the
destination cpu to consolidate should come from existing scheduler's
power save balancer code.

My understanding is that we will certainly have a sysfs tunable to
'enable' timer migration or consolidation, similar to the sched_mc=2
policy, but the actual set of CPUs to evacuate and the correct set of
target CPUs to consolidate should come from the scheduler and not
necessarily from the user space.

The scheduler should be able to figure out the following parameters:

* Identify set of idle CPUs (CPU package) from which timers can be
  removed
* Identify a semi-idle or idle CPU package to which the timers can be
  moved
* Decide when to start moving timers as the system has large number of
  idle CPUs
* Decide when to stop migrating as system becomes less idle and
  utilisation increases

Guiding all of the above decisions from user space may not be fast
enough.

--Vaidy

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 0/4] timers: framework for migration between CPU
  2009-02-23 10:38                 ` Vaidyanathan Srinivasan
@ 2009-02-23 11:07                   ` Ingo Molnar
  2009-02-23 11:25                     ` Balbir Singh
  2009-02-26  8:58                     ` Dipankar Sarma
  0 siblings, 2 replies; 23+ messages in thread
From: Ingo Molnar @ 2009-02-23 11:07 UTC (permalink / raw)
  To: Vaidyanathan Srinivasan
  Cc: Balbir Singh, Arjan van de Ven, linux-kernel, linux-pm,
	a.p.zijlstra, ego, tglx, andi, venkatesh.pallipadi, vatsa, arun,
	Suresh Siddha


* Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> wrote:

> My understanding is that we will certainly have a sysfs 
> tunable to 'enable' timer migration or consolidation, similar 
> to the sched_mc=2 policy, but the actual set of CPUs to 
> evacuate and the correct set of target CPUs to consolidate 
> should come from the scheduler and not necessarily from the 
> user space.

Yes.

> The scheduler should be able to figure out the following 
> parameters:
> 
> * Identify set of idle CPUs (CPU package) from which timers 
>   can be removed
> * Identify a semi-idle or idle CPU package to which the timers
>   can be moved
> * Decide when to start moving timers as the system has large
>   number of idle CPUs
> * Decide when to stop migrating as system becomes less idle
>   and utilisation increases
> 
> Guiding all of the above decisions from user space may not be 
> fast enough.

Exactly.

	Ingo

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 0/4] timers: framework for migration between CPU
  2009-02-23 10:22                 ` Ingo Molnar
@ 2009-02-23 11:24                   ` Balbir Singh
  0 siblings, 0 replies; 23+ messages in thread
From: Balbir Singh @ 2009-02-23 11:24 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Arjan van de Ven, Vaidyanathan Srinivasan, linux-kernel,
	linux-pm, a.p.zijlstra, ego, tglx, andi, venkatesh.pallipadi,
	vatsa, arun, Suresh Siddha

* Ingo Molnar <mingo@elte.hu> [2009-02-23 11:22:34]:

> 
> * Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> 
> > * Ingo Molnar <mingo@elte.hu> [2009-02-23 10:11:58]:
> > 
> > > 
> > > * Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > > 
> > > > * Ingo Molnar <mingo@elte.hu> [2009-02-20 22:53:18]:
> > > > 
> > > > > 
> > > > > * Arjan van de Ven <arjan@infradead.org> wrote:
> > > > > 
> > > > > > On Fri, 20 Feb 2009 17:07:37 +0100
> > > > > > Ingo Molnar <mingo@elte.hu> wrote:
> > > > > > 
> > > > > > > 
> > > > > > > * Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> wrote:
> > > > > > > 
> > > > > > > > > I'd also suggest to not do that rather ugly 
> > > > > > > > > enable_timer_migration per-cpu variable, but simply reuse 
> > > > > > > > > the existing nohz.load_balancer as a target CPU.
> > > > > > > > 
> > > > > > > > This is a good idea to automatically bias the timers.  But 
> > > > > > > > this nohz.load_balancer is a very fast moving target and we 
> > > > > > > > will need some heuristics to estimate overall system idleness 
> > > > > > > > before moving the timers.
> > > > > > > > 
> > > > > > > > I would agree that the power saving load balancer has a good 
> > > > > > > > view of the system and can potentially guide the timer biasing 
> > > > > > > > framework.
> > > > > > > 
> > > > > > > Yeah, it's a fast moving target, but it already concentrates 
> > > > > > > the load somewhat.
> > > > > > > 
> > > > > > 
> > > > > > I wonder if the real answer for this isn't to have timers be 
> > > > > > considered schedulable-entities and have the regular scheduler 
> > > > > > decide where they actually run.
> > > > > 
> > > > > hm, not sure - it's a bit heavy for that.
> > > > >
> > > > 
> > > > I think the basic timer migration policy should exist in user 
> > > > space.
> > > 
> > > I disagree.
> > >
> > 
> > See below
> >  
> > > > One of the ways of looking at it is, as we begin to 
> > > > consolidate, using range timers and migrating all timers to 
> > > > lesser number of CPUs would make a whole lot of sense.
> > > > 
> > > > As far as the scheduler making those decisions is concerned, 
> > > > my concern is that the load balancing is a continuous process 
> > > > and timers don't necessarily work that way. I'd put my neck 
> > > > out and say that irqbalance, range timers and timer migration 
> > > > should all belong to user space. irqbalance and range timers 
> > > > do, so should timer migration.
> > > 
> > > As i said it my first reply, IRQ migration is special because 
> > > they are not kernel-internal objects, they come externally so 
> > > there's a lot of user-space enumeration, policy and other steps 
> > > involved. Furthermore, IRQs are migrated in a 'slow' fashion.
> > > 
> > > Timers on the other hand are fast entities tied to _tasks_ 
> > > primarily, not external entities. 
> > 
> > Timers are also queued due to external events like interrupts 
> > (device drivers tend to set of timers all the time). [...]
> 
> That is a silly argument. Tasks are created due to 'external 
> events' as well such as the user hitting a key.
> 

The point I was trying to make was that not all timers are due to
tasks, some are due to interrupts and thus the focus on looking at
getting irqbalance and timers to work together.

> What matters, and what was my argument is the distinction 
> whether the kernel _generates_ the event. For most IRQ events it 
> does not, for the overwhelming majority of timers events it 
> consciously generates timer events. Which makes them all the 
> much different.
>

Yes, agreed
 
> > [...] I am not fully against what you've said, at some 
> > semantic level what you are suggesting is that at a higher 
> > level of power saving, when the scheduler balances timers it 
> > is doing a form of soft CPU hotplug on the system by migrating 
> > timers and tasks away from idle CPUs when the load can be 
> > handled by other CPUs. See below as well.
> > 
> > > Hence they should migrate 
> > > according to the CPU where the activities of the system 
> > > concentrates - i.e. where tasks are running.
> > > 
> > > Another thing: do you argue for the existing timer-migration 
> > > code we have in mod_timer() to move to user-space too? It isnt a 
> > > consistent argument to push 'some' of it to user-space, and some 
> > > of it in kernel-space.
> > > 
> > 
> > No.. mod_timer() is correct where it belongs.
> 
> You did not reply to my statement that the argument is a double 
> standard. Why do certain migrations in the kernel and some not?

Sorry, I am not sure I understand what portions of mod_timer were you
recommending move to user space?

> 
> > Consider the powertop usage scenario today
> > 
> > 1. Powertop displays a list of timers and common causes of wakeup
> > 2. It recommends policies in user space that can affect power savings
> >    a. usb autosuspend
> >    b. wireless link management
> >    c. disable HAL polling
> 
> That's different - those are PowerTop timer event _reduction_ 
> policies. Not migration policies of existing timers.
> 
> > My argument is, why can't we add
> > 
> >    d. Use range timers
> >    e. Consolidate timers
> > 
> > In the future.
> > 
> > Even sched_mc=n is set by user space, so really the
> > policy is in user space.
> 
> that is different again. sched_mc is a broad switch not a 
> dynamic control like the sysfs migration interface that was 
> introduced in this patchset. Which patchset we are discussing.
>

The timer migration patchset. We are discussing sched_mc=n, since I
expect sched_mc=3 or so to enable timer migration.

I guess we could try and select the target cpu for consolidation from
within the scheduler, but my concerns are

1. Not all timers are due to tasks
2. The effect of migrating a timer from the scheduler automatically
can vary, since we don't know the load associated with a timer.

But having said that some experimentation with Ingo's suggestion of
automatically selecting the target CPU would be nice.

 

-- 
	Balbir

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 0/4] timers: framework for migration between CPU
  2009-02-23 11:07                   ` Ingo Molnar
@ 2009-02-23 11:25                     ` Balbir Singh
  2009-02-26  8:58                     ` Dipankar Sarma
  1 sibling, 0 replies; 23+ messages in thread
From: Balbir Singh @ 2009-02-23 11:25 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Vaidyanathan Srinivasan, Arjan van de Ven, linux-kernel,
	linux-pm, a.p.zijlstra, ego, tglx, andi, venkatesh.pallipadi,
	vatsa, arun, Suresh Siddha

* Ingo Molnar <mingo@elte.hu> [2009-02-23 12:07:25]:

> 
> * Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> wrote:
> 
> > My understanding is that we will certainly have a sysfs 
> > tunable to 'enable' timer migration or consolidation, similar 
> > to the sched_mc=2 policy, but the actual set of CPUs to 
> > evacuate and the correct set of target CPUs to consolidate 
> > should come from the scheduler and not necessarily from the 
> > user space.
> 
> Yes.
> 
> > The scheduler should be able to figure out the following 
> > parameters:
> > 
> > * Identify set of idle CPUs (CPU package) from which timers 
> >   can be removed
> > * Identify a semi-idle or idle CPU package to which the timers
> >   can be moved
> > * Decide when to start moving timers as the system has large
> >   number of idle CPUs
> > * Decide when to stop migrating as system becomes less idle
> >   and utilisation increases
> > 
> > Guiding all of the above decisions from user space may not be 
> > fast enough.
> 
> Exactly.
>

OK, lets head that way for now. I've highlighted my concerns in a
different email. Experimentation will definitely show us if they are
justified or not. 

-- 
	Balbir

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 0/4] timers: framework for migration between CPU
  2009-02-23 11:07                   ` Ingo Molnar
  2009-02-23 11:25                     ` Balbir Singh
@ 2009-02-26  8:58                     ` Dipankar Sarma
  2009-02-26 15:45                       ` Ingo Molnar
  1 sibling, 1 reply; 23+ messages in thread
From: Dipankar Sarma @ 2009-02-26  8:58 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Vaidyanathan Srinivasan, Balbir Singh, Arjan van de Ven,
	linux-kernel, linux-pm, a.p.zijlstra, ego, tglx, andi,
	venkatesh.pallipadi, vatsa, arun, Suresh Siddha

On Mon, Feb 23, 2009 at 12:07:25PM +0100, Ingo Molnar wrote:
> 
> * Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> wrote:
> 
> > * Identify set of idle CPUs (CPU package) from which timers 
> >   can be removed
> > * Identify a semi-idle or idle CPU package to which the timers
> >   can be moved
> > * Decide when to start moving timers as the system has large
> >   number of idle CPUs
> > * Decide when to stop migrating as system becomes less idle
> >   and utilisation increases
> > 
> > Guiding all of the above decisions from user space may not be 
> > fast enough.
> 
> Exactly.

That is true for power management. However there are other
situations where we may need targeted avoidance of timers.
Certain type of applications - HPC for example - prefer avoidance
of jitters due to periodic timers. It would be good to be
able to say "avoid these CPUs for timers" while they are being
used for HPC tasks.

Thanks
Dipankar

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 0/4] timers: framework for migration between CPU
  2009-02-26  8:58                     ` Dipankar Sarma
@ 2009-02-26 15:45                       ` Ingo Molnar
  2009-02-26 16:02                         ` Peter Zijlstra
  0 siblings, 1 reply; 23+ messages in thread
From: Ingo Molnar @ 2009-02-26 15:45 UTC (permalink / raw)
  To: Dipankar Sarma
  Cc: Vaidyanathan Srinivasan, Balbir Singh, Arjan van de Ven,
	linux-kernel, linux-pm, a.p.zijlstra, ego, tglx, andi,
	venkatesh.pallipadi, vatsa, arun, Suresh Siddha


* Dipankar Sarma <dipankar@in.ibm.com> wrote:

> On Mon, Feb 23, 2009 at 12:07:25PM +0100, Ingo Molnar wrote:
> > 
> > * Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> wrote:
> > 
> > > * Identify set of idle CPUs (CPU package) from which timers 
> > >   can be removed
> > > * Identify a semi-idle or idle CPU package to which the timers
> > >   can be moved
> > > * Decide when to start moving timers as the system has large
> > >   number of idle CPUs
> > > * Decide when to stop migrating as system becomes less idle
> > >   and utilisation increases
> > > 
> > > Guiding all of the above decisions from user space may not be 
> > > fast enough.
> > 
> > Exactly.
> 
> That is true for power management. However there are other 
> situations where we may need targeted avoidance of timers. 
> Certain type of applications - HPC for example - prefer 
> avoidance of jitters due to periodic timers. It would be good 
> to be able to say "avoid these CPUs for timers" while they are 
> being used for HPC tasks.

Yes - but that kind of policy should be coupled and expressed 
via cpusets. /proc based irq_affinity is just a limited, 
inflexible hack. All things IRQ partitioning should be handled 
via cpusets - perhaps via the 'system sets' idea from Peter?

	Ingo

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 0/4] timers: framework for migration between CPU
  2009-02-26 15:45                       ` Ingo Molnar
@ 2009-02-26 16:02                         ` Peter Zijlstra
  2009-02-26 16:12                           ` Ingo Molnar
  0 siblings, 1 reply; 23+ messages in thread
From: Peter Zijlstra @ 2009-02-26 16:02 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Dipankar Sarma, Vaidyanathan Srinivasan, Balbir Singh,
	Arjan van de Ven, linux-kernel, linux-pm, ego, tglx, andi,
	venkatesh.pallipadi, vatsa, arun, Suresh Siddha

On Thu, 2009-02-26 at 16:45 +0100, Ingo Molnar wrote:

> Yes - but that kind of policy should be coupled and expressed 
> via cpusets. /proc based irq_affinity is just a limited, 
> inflexible hack. All things IRQ partitioning should be handled 
> via cpusets - perhaps via the 'system sets' idea from Peter?

all we got out of that idea was the default_smp_affinity thing
in /proc/irq and a head-ache trying to work out silly details.

Maybe we ought to try again,..


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 0/4] timers: framework for migration between CPU
  2009-02-26 16:02                         ` Peter Zijlstra
@ 2009-02-26 16:12                           ` Ingo Molnar
  0 siblings, 0 replies; 23+ messages in thread
From: Ingo Molnar @ 2009-02-26 16:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dipankar Sarma, Vaidyanathan Srinivasan, Balbir Singh,
	Arjan van de Ven, linux-kernel, linux-pm, ego, tglx, andi,
	venkatesh.pallipadi, vatsa, arun, Suresh Siddha


* Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> On Thu, 2009-02-26 at 16:45 +0100, Ingo Molnar wrote:
> 
> > Yes - but that kind of policy should be coupled and expressed 
> > via cpusets. /proc based irq_affinity is just a limited, 
> > inflexible hack. All things IRQ partitioning should be handled 
> > via cpusets - perhaps via the 'system sets' idea from Peter?
> 
> all we got out of that idea was the default_smp_affinity thing 
> in /proc/irq and a head-ache trying to work out silly details.
> 
> Maybe we ought to try again,..

Your system sets patch was actually very sane, it just fell 
victim to a merge window i think. Mind re-sending it?

	Ingo

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2009-02-26 16:13 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-02-20 12:55 [RFC PATCH 0/4] timers: framework for migration between CPU Arun R Bharadwaj
2009-02-20 12:57 ` [RFC PATCH 1/4] timers: framework to identify pinned timers Arun R Bharadwaj
2009-02-20 12:58 ` [RFC PATCH 2/4] timers: sysfs hook to enable timer migration Arun R Bharadwaj
2009-02-20 13:00 ` [RFC PATCH 3/4] timers: identifying the existing pinned hrtimers Arun R Bharadwaj
2009-02-20 13:01 ` [RFC PATCH 4/4] timers: logic to enable timer migration Arun R Bharadwaj
2009-02-20 13:21 ` [RFC PATCH 0/4] timers: framework for migration between CPU Ingo Molnar
2009-02-20 14:14   ` Vaidyanathan Srinivasan
2009-02-20 16:07     ` Ingo Molnar
2009-02-20 19:57       ` Arjan van de Ven
2009-02-20 21:53         ` Ingo Molnar
2009-02-23  7:55           ` Balbir Singh
2009-02-23  9:11             ` Ingo Molnar
2009-02-23  9:48               ` Balbir Singh
2009-02-23 10:22                 ` Ingo Molnar
2009-02-23 11:24                   ` Balbir Singh
2009-02-23 10:38                 ` Vaidyanathan Srinivasan
2009-02-23 11:07                   ` Ingo Molnar
2009-02-23 11:25                     ` Balbir Singh
2009-02-26  8:58                     ` Dipankar Sarma
2009-02-26 15:45                       ` Ingo Molnar
2009-02-26 16:02                         ` Peter Zijlstra
2009-02-26 16:12                           ` Ingo Molnar
2009-02-23  7:59   ` Arun R Bharadwaj

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).