linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch 00/23]
@ 2006-09-29 23:58 Thomas Gleixner
  2006-09-29 23:58 ` [patch 01/23] GTOD: exponential update_wall_time Thomas Gleixner
                   ` (24 more replies)
  0 siblings, 25 replies; 55+ messages in thread
From: Thomas Gleixner @ 2006-09-29 23:58 UTC (permalink / raw)
  To: LKML
  Cc: Andrew Morton, Ingo Molnar, Jim Gettys, John Stultz,
	David Woodhouse, Arjan van de Ven, Dave Jones

We are pleased to announce the next version of our "high resolution 
timers" and "dynticks" patchset, which implements two important features 
that Linux lacked for many years.

The patchset is against 2.6.18-mm2. (Since our last release there were 
no big changes, other than bugfixes and internal releasification 
cleanups, and the merge to -mm. The queue is bisect-friendly.)

If review and feedback is positive we'd like this patchset to be added 
to the 2.6.19 kernel. It has been maintained ontop of ktimers initially 
(more than a year ago), and then ontop of hrtimers (after ktimers were 
renamed to hrtimers and the hrtimer subsystem went upstream in January). 
Various -hrt iterations have been announced on lkml numerous times in 
the past year.

Now that the hrtimers subsystem and most of John Stultz Generic Time Of 
Day work is upstream, this patchset is straightforward and carries 
little risks if high-res timers are turned off (which is the default).

This patchset has been tested on various i686 systems. (We have the 
x86_64 patches too, but we'd like to concentrate on this first wave 
initially.)

The patchset can also be found at:

  http://www.tglx.de/projects/hrtimers/2.6.18-mm2/patch-2.6.18-mm2-hrt-dyntick1.patches.tar.bz2

	Thomas, Ingo

--


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [patch 01/23] GTOD: exponential update_wall_time
  2006-09-29 23:58 [patch 00/23] Thomas Gleixner
@ 2006-09-29 23:58 ` Thomas Gleixner
  2006-09-29 23:58 ` [patch 02/23] GTOD: persistent clock support, core Thomas Gleixner
                   ` (23 subsequent siblings)
  24 siblings, 0 replies; 55+ messages in thread
From: Thomas Gleixner @ 2006-09-29 23:58 UTC (permalink / raw)
  To: LKML
  Cc: Andrew Morton, Ingo Molnar, Jim Gettys, John Stultz,
	David Woodhouse, Arjan van de Ven, Dave Jones

[-- Attachment #1: update-times-exponential.patch --]
[-- Type: text/plain, Size: 2745 bytes --]

From: John Stultz <johnstul@us.ibm.com>

Accumulate time in update_wall_time() exponentially.  This avoids long
running loops seen with the dynticks patch as well as the problematic
hang seen on systems with broken clocksources.

NOTE: this only has relevance on dyntick kernels, so the quality of
NTP updates on jiffies-tick systems is unaffected. (non-dyntick
kernels call update_wall_time() in every timer tick)

Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
--

 kernel/timer.c |   28 ++++++++++++++++++++--------
 1 file changed, 20 insertions(+), 8 deletions(-)

Index: linux-2.6.18-mm2/kernel/timer.c
===================================================================
--- linux-2.6.18-mm2.orig/kernel/timer.c	2006-09-30 01:41:14.000000000 +0200
+++ linux-2.6.18-mm2/kernel/timer.c	2006-09-30 01:41:14.000000000 +0200
@@ -907,6 +907,7 @@ static void clocksource_adjust(struct cl
 static void update_wall_time(void)
 {
 	cycle_t offset;
+	int shift = 0;
 
 	/* Make sure we're fully resumed: */
 	if (unlikely(timekeeping_suspended))
@@ -919,28 +920,39 @@ static void update_wall_time(void)
 #endif
 	clock->xtime_nsec += (s64)xtime.tv_nsec << clock->shift;
 
+	while (offset > clock->cycle_interval << (shift + 1))
+		shift++;
+
 	/* normally this loop will run just once, however in the
 	 * case of lost or late ticks, it will accumulate correctly.
 	 */
 	while (offset >= clock->cycle_interval) {
+		if (offset < (clock->cycle_interval << shift)) {
+			shift--;
+			continue;
+		}
+
 		/* accumulate one interval */
-		clock->xtime_nsec += clock->xtime_interval;
-		clock->cycle_last += clock->cycle_interval;
-		offset -= clock->cycle_interval;
+		clock->xtime_nsec += clock->xtime_interval << shift;
+		clock->cycle_last += clock->cycle_interval << shift;
+		offset -= clock->cycle_interval << shift;
 
-		if (clock->xtime_nsec >= (u64)NSEC_PER_SEC << clock->shift) {
+		while (clock->xtime_nsec >= (u64)NSEC_PER_SEC << clock->shift) {
 			clock->xtime_nsec -= (u64)NSEC_PER_SEC << clock->shift;
 			xtime.tv_sec++;
 			second_overflow();
 		}
 
 		/* interpolator bits */
-		time_interpolator_update(clock->xtime_interval
-						>> clock->shift);
+		time_interpolator_update((clock->xtime_interval
+						>> clock->shift)<<shift);
 
 		/* accumulate error between NTP and clock interval */
-		clock->error += current_tick_length();
-		clock->error -= clock->xtime_interval << (TICK_LENGTH_SHIFT - clock->shift);
+		clock->error += current_tick_length() << shift;
+		clock->error -= (clock->xtime_interval
+			<< (TICK_LENGTH_SHIFT - clock->shift))<<shift;
+
+		shift--;
 	}
 
 	/* correct the clock when NTP error is too big */

--


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [patch 02/23] GTOD: persistent clock support, core
  2006-09-29 23:58 [patch 00/23] Thomas Gleixner
  2006-09-29 23:58 ` [patch 01/23] GTOD: exponential update_wall_time Thomas Gleixner
@ 2006-09-29 23:58 ` Thomas Gleixner
  2006-09-30  8:35   ` Andrew Morton
  2006-09-29 23:58 ` [patch 03/23] GTOD: persistent clock support, i386 Thomas Gleixner
                   ` (22 subsequent siblings)
  24 siblings, 1 reply; 55+ messages in thread
From: Thomas Gleixner @ 2006-09-29 23:58 UTC (permalink / raw)
  To: LKML
  Cc: Andrew Morton, Ingo Molnar, Jim Gettys, John Stultz,
	David Woodhouse, Arjan van de Ven, Dave Jones

[-- Attachment #1: linux-2.6.18-rc6_timeofday-persistent-clock-generic_C6.patch --]
[-- Type: text/plain, Size: 4767 bytes --]

From: John Stultz <johnstul@us.ibm.com>

persistent clock support: do proper timekeeping across suspend/resume.

Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
--
 include/linux/hrtimer.h |    3 +++
 include/linux/time.h    |    1 +
 kernel/hrtimer.c        |    8 ++++++++
 kernel/timer.c          |   34 +++++++++++++++++++++++++++++++---
 4 files changed, 43 insertions(+), 3 deletions(-)

linux-2.6.18-rc6_timeofday-persistent-clock-generic_C6.patch
Index: linux-2.6.18-mm2/include/linux/hrtimer.h
===================================================================
--- linux-2.6.18-mm2.orig/include/linux/hrtimer.h	2006-09-30 01:41:14.000000000 +0200
+++ linux-2.6.18-mm2/include/linux/hrtimer.h	2006-09-30 01:41:15.000000000 +0200
@@ -146,6 +146,9 @@ extern void hrtimer_init_sleeper(struct 
 /* Soft interrupt function to run the hrtimer queues: */
 extern void hrtimer_run_queues(void);
 
+/* Resume notification */
+void hrtimer_notify_resume(void);
+
 /* Bootup initialization: */
 extern void __init hrtimers_init(void);
 
Index: linux-2.6.18-mm2/include/linux/time.h
===================================================================
--- linux-2.6.18-mm2.orig/include/linux/time.h	2006-09-30 01:41:14.000000000 +0200
+++ linux-2.6.18-mm2/include/linux/time.h	2006-09-30 01:41:15.000000000 +0200
@@ -92,6 +92,7 @@ extern struct timespec xtime;
 extern struct timespec wall_to_monotonic;
 extern seqlock_t xtime_lock;
 
+extern unsigned long read_persistent_clock(void);
 void timekeeping_init(void);
 
 static inline unsigned long get_seconds(void)
Index: linux-2.6.18-mm2/kernel/hrtimer.c
===================================================================
--- linux-2.6.18-mm2.orig/kernel/hrtimer.c	2006-09-30 01:41:14.000000000 +0200
+++ linux-2.6.18-mm2/kernel/hrtimer.c	2006-09-30 01:41:15.000000000 +0200
@@ -287,6 +287,14 @@ static unsigned long ktime_divns(const k
 #endif /* BITS_PER_LONG >= 64 */
 
 /*
+ * Timekeeping resumed notification
+ */
+void hrtimer_notify_resume(void)
+{
+	clock_was_set();
+}
+
+/*
  * Counterpart to lock_timer_base above:
  */
 static inline
Index: linux-2.6.18-mm2/kernel/timer.c
===================================================================
--- linux-2.6.18-mm2.orig/kernel/timer.c	2006-09-30 01:41:14.000000000 +0200
+++ linux-2.6.18-mm2/kernel/timer.c	2006-09-30 01:41:15.000000000 +0200
@@ -41,6 +41,9 @@
 #include <asm/timex.h>
 #include <asm/io.h>
 
+/* jiffies at the most recent update of wall time */
+unsigned long wall_jiffies = INITIAL_JIFFIES;
+
 u64 jiffies_64 __cacheline_aligned_in_smp = INITIAL_JIFFIES;
 
 EXPORT_SYMBOL(jiffies_64);
@@ -743,12 +746,20 @@ int timekeeping_is_continuous(void)
 	return ret;
 }
 
+/* Weak dummy function for arches that do not yet support it.
+ * XXX - Do be sure to remove it once all arches implement it.
+ */
+unsigned long __attribute__((weak)) read_persistent_clock(void)
+{
+	return 0;
+}
+
 /*
  * timekeeping_init - Initializes the clocksource and common timekeeping values
  */
 void __init timekeeping_init(void)
 {
-	unsigned long flags;
+	unsigned long flags, sec = read_persistent_clock();
 
 	write_seqlock_irqsave(&xtime_lock, flags);
 
@@ -758,11 +769,18 @@ void __init timekeeping_init(void)
 	clocksource_calculate_interval(clock, tick_nsec);
 	clock->cycle_last = clocksource_read(clock);
 
+	xtime.tv_sec = sec;
+	xtime.tv_nsec = (jiffies % HZ) * (NSEC_PER_SEC / HZ);
+	set_normalized_timespec(&wall_to_monotonic,
+		-xtime.tv_sec, -xtime.tv_nsec);
+
 	write_sequnlock_irqrestore(&xtime_lock, flags);
 }
 
 
 static int timekeeping_suspended;
+static unsigned long timekeeping_suspend_time;
+
 /**
  * timekeeping_resume - Resumes the generic timekeeping subsystem.
  * @dev:	unused
@@ -773,14 +791,23 @@ static int timekeeping_suspended;
  */
 static int timekeeping_resume(struct sys_device *dev)
 {
-	unsigned long flags;
+	unsigned long flags, now = read_persistent_clock();
 
 	write_seqlock_irqsave(&xtime_lock, flags);
-	/* restart the last cycle value */
+
+	if (now && (now > timekeeping_suspend_time)) {
+		unsigned long sleep_length = now - timekeeping_suspend_time;
+		xtime.tv_sec += sleep_length;
+		jiffies_64 += sleep_length * HZ;
+	}
+	/* re-base the last cycle value */
 	clock->cycle_last = clocksource_read(clock);
 	clock->error = 0;
 	timekeeping_suspended = 0;
 	write_sequnlock_irqrestore(&xtime_lock, flags);
+
+	hrtimer_notify_resume();
+
 	return 0;
 }
 
@@ -790,6 +817,7 @@ static int timekeeping_suspend(struct sy
 
 	write_seqlock_irqsave(&xtime_lock, flags);
 	timekeeping_suspended = 1;
+	timekeeping_suspend_time = read_persistent_clock();
 	write_sequnlock_irqrestore(&xtime_lock, flags);
 	return 0;
 }

--


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [patch 03/23] GTOD: persistent clock support, i386
  2006-09-29 23:58 [patch 00/23] Thomas Gleixner
  2006-09-29 23:58 ` [patch 01/23] GTOD: exponential update_wall_time Thomas Gleixner
  2006-09-29 23:58 ` [patch 02/23] GTOD: persistent clock support, core Thomas Gleixner
@ 2006-09-29 23:58 ` Thomas Gleixner
  2006-09-30  8:36   ` Andrew Morton
  2006-09-29 23:58 ` [patch 04/23] time: uninline jiffies.h Thomas Gleixner
                   ` (21 subsequent siblings)
  24 siblings, 1 reply; 55+ messages in thread
From: Thomas Gleixner @ 2006-09-29 23:58 UTC (permalink / raw)
  To: LKML
  Cc: Andrew Morton, Ingo Molnar, Jim Gettys, John Stultz,
	David Woodhouse, Arjan van de Ven, Dave Jones

[-- Attachment #1: linux-2.6.18-rc6_timeofday-persistent-clock-i386_C6.patch --]
[-- Type: text/plain, Size: 6080 bytes --]

From: John Stultz <johnstul@us.ibm.com>

persistent clock support: do proper timekeeping across suspend/resume,
i386 arch support.

Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
--
 arch/i386/kernel/apm.c  |   44 ---------------------------------------
 arch/i386/kernel/time.c |   54 +-----------------------------------------------
 2 files changed, 2 insertions(+), 96 deletions(-)

linux-2.6.18-rc6_timeofday-persistent-clock-i386_C6.patch
Index: linux-2.6.18-mm2/arch/i386/kernel/apm.c
===================================================================
--- linux-2.6.18-mm2.orig/arch/i386/kernel/apm.c	2006-09-30 01:41:14.000000000 +0200
+++ linux-2.6.18-mm2/arch/i386/kernel/apm.c	2006-09-30 01:41:15.000000000 +0200
@@ -234,7 +234,6 @@
 
 #include "io_ports.h"
 
-extern unsigned long get_cmos_time(void);
 extern void machine_real_restart(unsigned char *, int);
 
 #if defined(CONFIG_APM_DISPLAY_BLANK) && defined(CONFIG_VT)
@@ -1153,28 +1152,6 @@ out:
 	spin_unlock(&user_list_lock);
 }
 
-static void set_time(void)
-{
-	struct timespec ts;
-	if (got_clock_diff) {	/* Must know time zone in order to set clock */
-		ts.tv_sec = get_cmos_time() + clock_cmos_diff;
-		ts.tv_nsec = 0;
-		do_settimeofday(&ts);
-	} 
-}
-
-static void get_time_diff(void)
-{
-#ifndef CONFIG_APM_RTC_IS_GMT
-	/*
-	 * Estimate time zone so that set_time can update the clock
-	 */
-	clock_cmos_diff = -get_cmos_time();
-	clock_cmos_diff += get_seconds();
-	got_clock_diff = 1;
-#endif
-}
-
 static void reinit_timer(void)
 {
 #ifdef INIT_TIMER_AFTER_SUSPEND
@@ -1214,19 +1191,6 @@ static int suspend(int vetoable)
 	local_irq_disable();
 	device_power_down(PMSG_SUSPEND);
 
-	/* serialize with the timer interrupt */
-	write_seqlock(&xtime_lock);
-
-	/* protect against access to timer chip registers */
-	spin_lock(&i8253_lock);
-
-	get_time_diff();
-	/*
-	 * Irq spinlock must be dropped around set_system_power_state.
-	 * We'll undo any timer changes due to interrupts below.
-	 */
-	spin_unlock(&i8253_lock);
-	write_sequnlock(&xtime_lock);
 	local_irq_enable();
 
 	save_processor_state();
@@ -1235,7 +1199,6 @@ static int suspend(int vetoable)
 	restore_processor_state();
 
 	local_irq_disable();
-	set_time();
 	reinit_timer();
 
 	if (err == APM_NO_ERROR)
@@ -1265,11 +1228,6 @@ static void standby(void)
 
 	local_irq_disable();
 	device_power_down(PMSG_SUSPEND);
-	/* serialize with the timer interrupt */
-	write_seqlock(&xtime_lock);
-	/* If needed, notify drivers here */
-	get_time_diff();
-	write_sequnlock(&xtime_lock);
 	local_irq_enable();
 
 	err = set_system_power_state(APM_STATE_STANDBY);
@@ -1363,7 +1321,6 @@ static void check_events(void)
 			ignore_bounce = 1;
 			if ((event != APM_NORMAL_RESUME)
 			    || (ignore_normal_resume == 0)) {
-				set_time();
 				device_resume();
 				pm_send_all(PM_RESUME, (void *)0);
 				queue_event(event, NULL);
@@ -1379,7 +1336,6 @@ static void check_events(void)
 			break;
 
 		case APM_UPDATE_TIME:
-			set_time();
 			break;
 
 		case APM_CRITICAL_SUSPEND:
Index: linux-2.6.18-mm2/arch/i386/kernel/time.c
===================================================================
--- linux-2.6.18-mm2.orig/arch/i386/kernel/time.c	2006-09-30 01:41:14.000000000 +0200
+++ linux-2.6.18-mm2/arch/i386/kernel/time.c	2006-09-30 01:41:15.000000000 +0200
@@ -216,7 +216,7 @@ irqreturn_t timer_interrupt(int irq, voi
 }
 
 /* not static: needed by APM */
-unsigned long get_cmos_time(void)
+unsigned long read_persistent_clock(void)
 {
 	unsigned long retval;
 	unsigned long flags;
@@ -232,7 +232,7 @@ unsigned long get_cmos_time(void)
 
 	return retval;
 }
-EXPORT_SYMBOL(get_cmos_time);
+EXPORT_SYMBOL(read_persistent_clock);
 
 static void sync_cmos_clock(unsigned long dummy);
 
@@ -283,58 +283,19 @@ void notify_arch_cmos_timer(void)
 	mod_timer(&sync_cmos_timer, jiffies + 1);
 }
 
-static long clock_cmos_diff;
-static unsigned long sleep_start;
-
-static int timer_suspend(struct sys_device *dev, pm_message_t state)
-{
-	/*
-	 * Estimate time zone so that set_time can update the clock
-	 */
-	unsigned long ctime =  get_cmos_time();
-
-	clock_cmos_diff = -ctime;
-	clock_cmos_diff += get_seconds();
-	sleep_start = ctime;
-	return 0;
-}
-
 static int timer_resume(struct sys_device *dev)
 {
-	unsigned long flags;
-	unsigned long sec;
-	unsigned long ctime = get_cmos_time();
-	long sleep_length = (ctime - sleep_start) * HZ;
-	struct timespec ts;
-
-	if (sleep_length < 0) {
-		printk(KERN_WARNING "CMOS clock skew detected in timer resume!\n");
-		/* The time after the resume must not be earlier than the time
-		 * before the suspend or some nasty things will happen
-		 */
-		sleep_length = 0;
-		ctime = sleep_start;
-	}
 #ifdef CONFIG_HPET_TIMER
 	if (is_hpet_enabled())
 		hpet_reenable();
 #endif
 	setup_pit_timer();
 
-	sec = ctime + clock_cmos_diff;
-	ts.tv_sec = sec;
-	ts.tv_nsec = 0;
-	do_settimeofday(&ts);
-	write_seqlock_irqsave(&xtime_lock, flags);
-	jiffies_64 += sleep_length;
-	write_sequnlock_irqrestore(&xtime_lock, flags);
-	touch_softlockup_watchdog();
 	return 0;
 }
 
 static struct sysdev_class timer_sysclass = {
 	.resume = timer_resume,
-	.suspend = timer_suspend,
 	set_kset_name("timer"),
 };
 
@@ -360,12 +321,6 @@ extern void (*late_time_init)(void);
 /* Duplicate of time_init() below, with hpet_enable part added */
 static void __init hpet_time_init(void)
 {
-	struct timespec ts;
-	ts.tv_sec = get_cmos_time();
-	ts.tv_nsec = (INITIAL_JIFFIES % HZ) * (NSEC_PER_SEC / HZ);
-
-	do_settimeofday(&ts);
-
 	if ((hpet_enable() >= 0) && hpet_use_timer) {
 		printk("Using HPET for base-timer\n");
 	}
@@ -376,7 +331,6 @@ static void __init hpet_time_init(void)
 
 void __init time_init(void)
 {
-	struct timespec ts;
 #ifdef CONFIG_HPET_TIMER
 	if (is_hpet_capable()) {
 		/*
@@ -387,10 +341,6 @@ void __init time_init(void)
 		return;
 	}
 #endif
-	ts.tv_sec = get_cmos_time();
-	ts.tv_nsec = (INITIAL_JIFFIES % HZ) * (NSEC_PER_SEC / HZ);
-
-	do_settimeofday(&ts);
 
 	time_init_hook();
 }

--


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [patch 04/23] time: uninline jiffies.h
  2006-09-29 23:58 [patch 00/23] Thomas Gleixner
                   ` (2 preceding siblings ...)
  2006-09-29 23:58 ` [patch 03/23] GTOD: persistent clock support, i386 Thomas Gleixner
@ 2006-09-29 23:58 ` Thomas Gleixner
  2006-09-29 23:58 ` [patch 05/23] time: fix msecs_to_jiffies() bug Thomas Gleixner
                   ` (20 subsequent siblings)
  24 siblings, 0 replies; 55+ messages in thread
From: Thomas Gleixner @ 2006-09-29 23:58 UTC (permalink / raw)
  To: LKML
  Cc: Andrew Morton, Ingo Molnar, Jim Gettys, John Stultz,
	David Woodhouse, Arjan van de Ven, Dave Jones

[-- Attachment #1: uninline-jiffies-h.patch --]
[-- Type: text/plain, Size: 13958 bytes --]

From: Ingo Molnar <mingo@elte.hu>

there are load of fat functions hidden in jiffies.h. Uninline them.
No code changes.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
--
 include/linux/jiffies.h |  223 +++---------------------------------------------
 kernel/time.c           |  218 ++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 234 insertions(+), 207 deletions(-)

Index: linux-2.6.18-mm2/include/linux/jiffies.h
===================================================================
--- linux-2.6.18-mm2.orig/include/linux/jiffies.h	2006-09-30 01:41:14.000000000 +0200
+++ linux-2.6.18-mm2/include/linux/jiffies.h	2006-09-30 01:41:15.000000000 +0200
@@ -259,215 +259,24 @@ static inline u64 get_jiffies_64(void)
 #endif
 
 /*
- * Convert jiffies to milliseconds and back.
- *
- * Avoid unnecessary multiplications/divisions in the
- * two most common HZ cases:
+ * Convert various time units to each other:
  */
-static inline unsigned int jiffies_to_msecs(const unsigned long j)
-{
-#if HZ <= MSEC_PER_SEC && !(MSEC_PER_SEC % HZ)
-	return (MSEC_PER_SEC / HZ) * j;
-#elif HZ > MSEC_PER_SEC && !(HZ % MSEC_PER_SEC)
-	return (j + (HZ / MSEC_PER_SEC) - 1)/(HZ / MSEC_PER_SEC);
-#else
-	return (j * MSEC_PER_SEC) / HZ;
-#endif
-}
-
-static inline unsigned int jiffies_to_usecs(const unsigned long j)
-{
-#if HZ <= USEC_PER_SEC && !(USEC_PER_SEC % HZ)
-	return (USEC_PER_SEC / HZ) * j;
-#elif HZ > USEC_PER_SEC && !(HZ % USEC_PER_SEC)
-	return (j + (HZ / USEC_PER_SEC) - 1)/(HZ / USEC_PER_SEC);
-#else
-	return (j * USEC_PER_SEC) / HZ;
-#endif
-}
-
-static inline unsigned long msecs_to_jiffies(const unsigned int m)
-{
-	if (m > jiffies_to_msecs(MAX_JIFFY_OFFSET))
-		return MAX_JIFFY_OFFSET;
-#if HZ <= MSEC_PER_SEC && !(MSEC_PER_SEC % HZ)
-	return (m + (MSEC_PER_SEC / HZ) - 1) / (MSEC_PER_SEC / HZ);
-#elif HZ > MSEC_PER_SEC && !(HZ % MSEC_PER_SEC)
-	return m * (HZ / MSEC_PER_SEC);
-#else
-	return (m * HZ + MSEC_PER_SEC - 1) / MSEC_PER_SEC;
-#endif
-}
-
-static inline unsigned long usecs_to_jiffies(const unsigned int u)
-{
-	if (u > jiffies_to_usecs(MAX_JIFFY_OFFSET))
-		return MAX_JIFFY_OFFSET;
-#if HZ <= USEC_PER_SEC && !(USEC_PER_SEC % HZ)
-	return (u + (USEC_PER_SEC / HZ) - 1) / (USEC_PER_SEC / HZ);
-#elif HZ > USEC_PER_SEC && !(HZ % USEC_PER_SEC)
-	return u * (HZ / USEC_PER_SEC);
-#else
-	return (u * HZ + USEC_PER_SEC - 1) / USEC_PER_SEC;
-#endif
-}
-
-/*
- * The TICK_NSEC - 1 rounds up the value to the next resolution.  Note
- * that a remainder subtract here would not do the right thing as the
- * resolution values don't fall on second boundries.  I.e. the line:
- * nsec -= nsec % TICK_NSEC; is NOT a correct resolution rounding.
- *
- * Rather, we just shift the bits off the right.
- *
- * The >> (NSEC_JIFFIE_SC - SEC_JIFFIE_SC) converts the scaled nsec
- * value to a scaled second value.
- */
-static __inline__ unsigned long
-timespec_to_jiffies(const struct timespec *value)
-{
-	unsigned long sec = value->tv_sec;
-	long nsec = value->tv_nsec + TICK_NSEC - 1;
-
-	if (sec >= MAX_SEC_IN_JIFFIES){
-		sec = MAX_SEC_IN_JIFFIES;
-		nsec = 0;
-	}
-	return (((u64)sec * SEC_CONVERSION) +
-		(((u64)nsec * NSEC_CONVERSION) >>
-		 (NSEC_JIFFIE_SC - SEC_JIFFIE_SC))) >> SEC_JIFFIE_SC;
-
-}
-
-static __inline__ void
-jiffies_to_timespec(const unsigned long jiffies, struct timespec *value)
-{
-	/*
-	 * Convert jiffies to nanoseconds and separate with
-	 * one divide.
-	 */
-	u64 nsec = (u64)jiffies * TICK_NSEC;
-	value->tv_sec = div_long_long_rem(nsec, NSEC_PER_SEC, &value->tv_nsec);
-}
-
-/* Same for "timeval"
- *
- * Well, almost.  The problem here is that the real system resolution is
- * in nanoseconds and the value being converted is in micro seconds.
- * Also for some machines (those that use HZ = 1024, in-particular),
- * there is a LARGE error in the tick size in microseconds.
-
- * The solution we use is to do the rounding AFTER we convert the
- * microsecond part.  Thus the USEC_ROUND, the bits to be shifted off.
- * Instruction wise, this should cost only an additional add with carry
- * instruction above the way it was done above.
- */
-static __inline__ unsigned long
-timeval_to_jiffies(const struct timeval *value)
-{
-	unsigned long sec = value->tv_sec;
-	long usec = value->tv_usec;
-
-	if (sec >= MAX_SEC_IN_JIFFIES){
-		sec = MAX_SEC_IN_JIFFIES;
-		usec = 0;
-	}
-	return (((u64)sec * SEC_CONVERSION) +
-		(((u64)usec * USEC_CONVERSION + USEC_ROUND) >>
-		 (USEC_JIFFIE_SC - SEC_JIFFIE_SC))) >> SEC_JIFFIE_SC;
-}
-
-static __inline__ void
-jiffies_to_timeval(const unsigned long jiffies, struct timeval *value)
-{
-	/*
-	 * Convert jiffies to nanoseconds and separate with
-	 * one divide.
-	 */
-	u64 nsec = (u64)jiffies * TICK_NSEC;
-	long tv_usec;
-
-	value->tv_sec = div_long_long_rem(nsec, NSEC_PER_SEC, &tv_usec);
-	tv_usec /= NSEC_PER_USEC;
-	value->tv_usec = tv_usec;
-}
-
-/*
- * Convert jiffies/jiffies_64 to clock_t and back.
- */
-static inline clock_t jiffies_to_clock_t(long x)
-{
-#if (TICK_NSEC % (NSEC_PER_SEC / USER_HZ)) == 0
-	return x / (HZ / USER_HZ);
-#else
-	u64 tmp = (u64)x * TICK_NSEC;
-	do_div(tmp, (NSEC_PER_SEC / USER_HZ));
-	return (long)tmp;
-#endif
-}
-
-static inline unsigned long clock_t_to_jiffies(unsigned long x)
-{
-#if (HZ % USER_HZ)==0
-	if (x >= ~0UL / (HZ / USER_HZ))
-		return ~0UL;
-	return x * (HZ / USER_HZ);
-#else
-	u64 jif;
-
-	/* Don't worry about loss of precision here .. */
-	if (x >= ~0UL / HZ * USER_HZ)
-		return ~0UL;
-
-	/* .. but do try to contain it here */
-	jif = x * (u64) HZ;
-	do_div(jif, USER_HZ);
-	return jif;
-#endif
-}
-
-static inline u64 jiffies_64_to_clock_t(u64 x)
-{
-#if (TICK_NSEC % (NSEC_PER_SEC / USER_HZ)) == 0
-	do_div(x, HZ / USER_HZ);
-#else
-	/*
-	 * There are better ways that don't overflow early,
-	 * but even this doesn't overflow in hundreds of years
-	 * in 64 bits, so..
-	 */
-	x *= TICK_NSEC;
-	do_div(x, (NSEC_PER_SEC / USER_HZ));
-#endif
-	return x;
-}
-
-static inline u64 nsec_to_clock_t(u64 x)
-{
-#if (NSEC_PER_SEC % USER_HZ) == 0
-	do_div(x, (NSEC_PER_SEC / USER_HZ));
-#elif (USER_HZ % 512) == 0
-	x *= USER_HZ/512;
-	do_div(x, (NSEC_PER_SEC / 512));
-#else
-	/*
-         * max relative error 5.7e-8 (1.8s per year) for USER_HZ <= 1024,
-         * overflow after 64.99 years.
-         * exact for HZ=60, 72, 90, 120, 144, 180, 300, 600, 900, ...
-         */
-	x *= 9;
-	do_div(x, (unsigned long)((9ull * NSEC_PER_SEC + (USER_HZ/2))
-	                          / USER_HZ));
-#endif
-	return x;
-}
+extern unsigned int jiffies_to_msecs(const unsigned long j);
+extern unsigned int jiffies_to_usecs(const unsigned long j);
+extern unsigned long msecs_to_jiffies(const unsigned int m);
+extern unsigned long usecs_to_jiffies(const unsigned int u);
+extern unsigned long timespec_to_jiffies(const struct timespec *value);
+extern void jiffies_to_timespec(const unsigned long jiffies,
+				struct timespec *value);
+extern unsigned long timeval_to_jiffies(const struct timeval *value);
+extern void jiffies_to_timeval(const unsigned long jiffies,
+			       struct timeval *value);
+extern clock_t jiffies_to_clock_t(long x);
+extern unsigned long clock_t_to_jiffies(unsigned long x);
+extern u64 jiffies_64_to_clock_t(u64 x);
+extern u64 nsec_to_clock_t(u64 x);
+extern int nsec_to_timestamp(char *s, u64 t);
 
-static inline int nsec_to_timestamp(char *s, u64 t)
-{
-	unsigned long nsec_rem = do_div(t, NSEC_PER_SEC);
-	return sprintf(s, "[%5lu.%06lu]", (unsigned long)t,
-		       nsec_rem/NSEC_PER_USEC);
-}
 #define TIMESTAMP_SIZE	30
 
 #endif
Index: linux-2.6.18-mm2/kernel/time.c
===================================================================
--- linux-2.6.18-mm2.orig/kernel/time.c	2006-09-30 01:41:14.000000000 +0200
+++ linux-2.6.18-mm2/kernel/time.c	2006-09-30 01:41:15.000000000 +0200
@@ -470,6 +470,224 @@ struct timeval ns_to_timeval(const s64 n
 	return tv;
 }
 
+/*
+ * Convert jiffies to milliseconds and back.
+ *
+ * Avoid unnecessary multiplications/divisions in the
+ * two most common HZ cases:
+ */
+unsigned int jiffies_to_msecs(const unsigned long j)
+{
+#if HZ <= MSEC_PER_SEC && !(MSEC_PER_SEC % HZ)
+	return (MSEC_PER_SEC / HZ) * j;
+#elif HZ > MSEC_PER_SEC && !(HZ % MSEC_PER_SEC)
+	return (j + (HZ / MSEC_PER_SEC) - 1)/(HZ / MSEC_PER_SEC);
+#else
+	return (j * MSEC_PER_SEC) / HZ;
+#endif
+}
+EXPORT_SYMBOL(jiffies_to_msecs);
+
+unsigned int jiffies_to_usecs(const unsigned long j)
+{
+#if HZ <= USEC_PER_SEC && !(USEC_PER_SEC % HZ)
+	return (USEC_PER_SEC / HZ) * j;
+#elif HZ > USEC_PER_SEC && !(HZ % USEC_PER_SEC)
+	return (j + (HZ / USEC_PER_SEC) - 1)/(HZ / USEC_PER_SEC);
+#else
+	return (j * USEC_PER_SEC) / HZ;
+#endif
+}
+EXPORT_SYMBOL(jiffies_to_usecs);
+
+unsigned long msecs_to_jiffies(const unsigned int m)
+{
+	if (m > jiffies_to_msecs(MAX_JIFFY_OFFSET))
+		return MAX_JIFFY_OFFSET;
+#if HZ <= MSEC_PER_SEC && !(MSEC_PER_SEC % HZ)
+	return (m + (MSEC_PER_SEC / HZ) - 1) / (MSEC_PER_SEC / HZ);
+#elif HZ > MSEC_PER_SEC && !(HZ % MSEC_PER_SEC)
+	return m * (HZ / MSEC_PER_SEC);
+#else
+	return (m * HZ + MSEC_PER_SEC - 1) / MSEC_PER_SEC;
+#endif
+}
+EXPORT_SYMBOL(msecs_to_jiffies);
+
+unsigned long usecs_to_jiffies(const unsigned int u)
+{
+	if (u > jiffies_to_usecs(MAX_JIFFY_OFFSET))
+		return MAX_JIFFY_OFFSET;
+#if HZ <= USEC_PER_SEC && !(USEC_PER_SEC % HZ)
+	return (u + (USEC_PER_SEC / HZ) - 1) / (USEC_PER_SEC / HZ);
+#elif HZ > USEC_PER_SEC && !(HZ % USEC_PER_SEC)
+	return u * (HZ / USEC_PER_SEC);
+#else
+	return (u * HZ + USEC_PER_SEC - 1) / USEC_PER_SEC;
+#endif
+}
+EXPORT_SYMBOL(usecs_to_jiffies);
+
+/*
+ * The TICK_NSEC - 1 rounds up the value to the next resolution.  Note
+ * that a remainder subtract here would not do the right thing as the
+ * resolution values don't fall on second boundries.  I.e. the line:
+ * nsec -= nsec % TICK_NSEC; is NOT a correct resolution rounding.
+ *
+ * Rather, we just shift the bits off the right.
+ *
+ * The >> (NSEC_JIFFIE_SC - SEC_JIFFIE_SC) converts the scaled nsec
+ * value to a scaled second value.
+ */
+unsigned long
+timespec_to_jiffies(const struct timespec *value)
+{
+	unsigned long sec = value->tv_sec;
+	long nsec = value->tv_nsec + TICK_NSEC - 1;
+
+	if (sec >= MAX_SEC_IN_JIFFIES){
+		sec = MAX_SEC_IN_JIFFIES;
+		nsec = 0;
+	}
+	return (((u64)sec * SEC_CONVERSION) +
+		(((u64)nsec * NSEC_CONVERSION) >>
+		 (NSEC_JIFFIE_SC - SEC_JIFFIE_SC))) >> SEC_JIFFIE_SC;
+
+}
+EXPORT_SYMBOL(timespec_to_jiffies);
+
+void
+jiffies_to_timespec(const unsigned long jiffies, struct timespec *value)
+{
+	/*
+	 * Convert jiffies to nanoseconds and separate with
+	 * one divide.
+	 */
+	u64 nsec = (u64)jiffies * TICK_NSEC;
+	value->tv_sec = div_long_long_rem(nsec, NSEC_PER_SEC, &value->tv_nsec);
+}
+
+/* Same for "timeval"
+ *
+ * Well, almost.  The problem here is that the real system resolution is
+ * in nanoseconds and the value being converted is in micro seconds.
+ * Also for some machines (those that use HZ = 1024, in-particular),
+ * there is a LARGE error in the tick size in microseconds.
+
+ * The solution we use is to do the rounding AFTER we convert the
+ * microsecond part.  Thus the USEC_ROUND, the bits to be shifted off.
+ * Instruction wise, this should cost only an additional add with carry
+ * instruction above the way it was done above.
+ */
+unsigned long
+timeval_to_jiffies(const struct timeval *value)
+{
+	unsigned long sec = value->tv_sec;
+	long usec = value->tv_usec;
+
+	if (sec >= MAX_SEC_IN_JIFFIES){
+		sec = MAX_SEC_IN_JIFFIES;
+		usec = 0;
+	}
+	return (((u64)sec * SEC_CONVERSION) +
+		(((u64)usec * USEC_CONVERSION + USEC_ROUND) >>
+		 (USEC_JIFFIE_SC - SEC_JIFFIE_SC))) >> SEC_JIFFIE_SC;
+}
+
+void jiffies_to_timeval(const unsigned long jiffies, struct timeval *value)
+{
+	/*
+	 * Convert jiffies to nanoseconds and separate with
+	 * one divide.
+	 */
+	u64 nsec = (u64)jiffies * TICK_NSEC;
+	long tv_usec;
+
+	value->tv_sec = div_long_long_rem(nsec, NSEC_PER_SEC, &tv_usec);
+	tv_usec /= NSEC_PER_USEC;
+	value->tv_usec = tv_usec;
+}
+
+/*
+ * Convert jiffies/jiffies_64 to clock_t and back.
+ */
+clock_t jiffies_to_clock_t(long x)
+{
+#if (TICK_NSEC % (NSEC_PER_SEC / USER_HZ)) == 0
+	return x / (HZ / USER_HZ);
+#else
+	u64 tmp = (u64)x * TICK_NSEC;
+	do_div(tmp, (NSEC_PER_SEC / USER_HZ));
+	return (long)tmp;
+#endif
+}
+EXPORT_SYMBOL(jiffies_to_clock_t);
+
+unsigned long clock_t_to_jiffies(unsigned long x)
+{
+#if (HZ % USER_HZ)==0
+	if (x >= ~0UL / (HZ / USER_HZ))
+		return ~0UL;
+	return x * (HZ / USER_HZ);
+#else
+	u64 jif;
+
+	/* Don't worry about loss of precision here .. */
+	if (x >= ~0UL / HZ * USER_HZ)
+		return ~0UL;
+
+	/* .. but do try to contain it here */
+	jif = x * (u64) HZ;
+	do_div(jif, USER_HZ);
+	return jif;
+#endif
+}
+EXPORT_SYMBOL(clock_t_to_jiffies);
+
+u64 jiffies_64_to_clock_t(u64 x)
+{
+#if (TICK_NSEC % (NSEC_PER_SEC / USER_HZ)) == 0
+	do_div(x, HZ / USER_HZ);
+#else
+	/*
+	 * There are better ways that don't overflow early,
+	 * but even this doesn't overflow in hundreds of years
+	 * in 64 bits, so..
+	 */
+	x *= TICK_NSEC;
+	do_div(x, (NSEC_PER_SEC / USER_HZ));
+#endif
+	return x;
+}
+
+EXPORT_SYMBOL(jiffies_64_to_clock_t);
+
+u64 nsec_to_clock_t(u64 x)
+{
+#if (NSEC_PER_SEC % USER_HZ) == 0
+	do_div(x, (NSEC_PER_SEC / USER_HZ));
+#elif (USER_HZ % 512) == 0
+	x *= USER_HZ/512;
+	do_div(x, (NSEC_PER_SEC / 512));
+#else
+	/*
+         * max relative error 5.7e-8 (1.8s per year) for USER_HZ <= 1024,
+         * overflow after 64.99 years.
+         * exact for HZ=60, 72, 90, 120, 144, 180, 300, 600, 900, ...
+         */
+	x *= 9;
+	do_div(x, (unsigned long)((9ull * NSEC_PER_SEC + (USER_HZ/2)) /
+				  USER_HZ));
+#endif
+	return x;
+}
+
+int nsec_to_timestamp(char *s, u64 t)
+{
+	unsigned long nsec_rem = do_div(t, NSEC_PER_SEC);
+	return sprintf(s, "[%5lu.%06lu]", (unsigned long)t,
+		       nsec_rem/NSEC_PER_USEC);
+}
 __attribute__((weak)) unsigned long long timestamp_clock(void)
 {
 	return sched_clock();

--


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [patch 05/23] time: fix msecs_to_jiffies() bug
  2006-09-29 23:58 [patch 00/23] Thomas Gleixner
                   ` (3 preceding siblings ...)
  2006-09-29 23:58 ` [patch 04/23] time: uninline jiffies.h Thomas Gleixner
@ 2006-09-29 23:58 ` Thomas Gleixner
  2006-09-29 23:58 ` [patch 06/23] time: fix timeout overflow Thomas Gleixner
                   ` (19 subsequent siblings)
  24 siblings, 0 replies; 55+ messages in thread
From: Thomas Gleixner @ 2006-09-29 23:58 UTC (permalink / raw)
  To: LKML
  Cc: Andrew Morton, Ingo Molnar, Jim Gettys, John Stultz,
	David Woodhouse, Arjan van de Ven, Dave Jones

[-- Attachment #1: fix-msec-conversion.patch --]
[-- Type: text/plain, Size: 2680 bytes --]

From: Ingo Molnar <mingo@elte.hu>

fix multiple conversion bugs in msecs_to_jiffies().

the main problem is that this condition:

       if (m > jiffies_to_msecs(MAX_JIFFY_OFFSET))

overflows if HZ is smaller than 1000!

this change is user-visible: for HZ=250 SUS-compliant poll()-timeout
value of -20 is mistakenly converted to 'immediate timeout'.

(the new dyntick code also triggered this, as it frequently creates
'lagging timer wheel' scenarios.)

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
--
 kernel/time.c |   43 ++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 42 insertions(+), 1 deletion(-)

Index: linux-2.6.18-mm2/kernel/time.c
===================================================================
--- linux-2.6.18-mm2.orig/kernel/time.c	2006-09-30 01:41:15.000000000 +0200
+++ linux-2.6.18-mm2/kernel/time.c	2006-09-30 01:41:15.000000000 +0200
@@ -500,15 +500,56 @@ unsigned int jiffies_to_usecs(const unsi
 }
 EXPORT_SYMBOL(jiffies_to_usecs);
 
+/*
+ * When we convert to jiffies then we interpret incoming values
+ * the following way:
+ *
+ * - negative values mean 'infinite timeout' (MAX_JIFFY_OFFSET)
+ *
+ * - 'too large' values [that would result in larger than
+ *   MAX_JIFFY_OFFSET values] mean 'infinite timeout' too.
+ *
+ * - all other values are converted to jiffies by either multiplying
+ *   the input value by a factor or dividing it with a factor
+ *
+ * We must also be careful about 32-bit overflows.
+ */
 unsigned long msecs_to_jiffies(const unsigned int m)
 {
-	if (m > jiffies_to_msecs(MAX_JIFFY_OFFSET))
+	/*
+	 * Negative value, means infinite timeout:
+	 */
+	if ((int)m < 0)
 		return MAX_JIFFY_OFFSET;
+
 #if HZ <= MSEC_PER_SEC && !(MSEC_PER_SEC % HZ)
+	/*
+	 * HZ is equal to or smaller than 1000, and 1000 is a nice
+	 * round multiple of HZ, divide with the factor between them,
+	 * but round upwards:
+	 */
 	return (m + (MSEC_PER_SEC / HZ) - 1) / (MSEC_PER_SEC / HZ);
 #elif HZ > MSEC_PER_SEC && !(HZ % MSEC_PER_SEC)
+	/*
+	 * HZ is larger than 1000, and HZ is a nice round multiple of
+	 * 1000 - simply multiply with the factor between them.
+	 *
+	 * But first make sure the multiplication result cannot
+	 * overflow:
+	 */
+	if (m > jiffies_to_msecs(MAX_JIFFY_OFFSET))
+		return MAX_JIFFY_OFFSET;
+
 	return m * (HZ / MSEC_PER_SEC);
 #else
+	/*
+	 * Generic case - multiply, round and divide. But first
+	 * check that if we are doing a net multiplication, that
+	 * we wouldnt overflow:
+	 */
+	if (HZ > MSEC_PER_SEC && m > jiffies_to_msecs(MAX_JIFFY_OFFSET))
+		return MAX_JIFFY_OFFSET;
+
 	return (m * HZ + MSEC_PER_SEC - 1) / MSEC_PER_SEC;
 #endif
 }

--


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [patch 06/23] time: fix timeout overflow
  2006-09-29 23:58 [patch 00/23] Thomas Gleixner
                   ` (4 preceding siblings ...)
  2006-09-29 23:58 ` [patch 05/23] time: fix msecs_to_jiffies() bug Thomas Gleixner
@ 2006-09-29 23:58 ` Thomas Gleixner
  2006-09-29 23:58 ` [patch 07/23] cleanup: uninline irq_enter() and move it into a function Thomas Gleixner
                   ` (18 subsequent siblings)
  24 siblings, 0 replies; 55+ messages in thread
From: Thomas Gleixner @ 2006-09-29 23:58 UTC (permalink / raw)
  To: LKML
  Cc: Andrew Morton, Ingo Molnar, Jim Gettys, John Stultz,
	David Woodhouse, Arjan van de Ven, Dave Jones

[-- Attachment #1: max-jiffies-timeout-prevent-overflow.patch --]
[-- Type: text/plain, Size: 1239 bytes --]

From: Ingo Molnar <mingo@elte.hu>

prevent timeout overflow if timer ticks are behind jiffies (due to high
softirq load or due to dyntick), by limiting the valid timeout range
to MAX_LONG/2.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
--
 include/linux/jiffies.h |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Index: linux-2.6.18-mm2/include/linux/jiffies.h
===================================================================
--- linux-2.6.18-mm2.orig/include/linux/jiffies.h	2006-09-30 01:41:15.000000000 +0200
+++ linux-2.6.18-mm2/include/linux/jiffies.h	2006-09-30 01:41:16.000000000 +0200
@@ -142,13 +142,13 @@ static inline u64 get_jiffies_64(void)
  *
  * And some not so obvious.
  *
- * Note that we don't want to return MAX_LONG, because
+ * Note that we don't want to return LONG_MAX, because
  * for various timeout reasons we often end up having
  * to wait "jiffies+1" in order to guarantee that we wait
  * at _least_ "jiffies" - so "jiffies+1" had better still
  * be positive.
  */
-#define MAX_JIFFY_OFFSET ((~0UL >> 1)-1)
+#define MAX_JIFFY_OFFSET ((LONG_MAX >> 1)-1)
 
 /*
  * We want to do realistic conversions of time so we need to use the same

--


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [patch 07/23] cleanup: uninline irq_enter() and move it into a function
  2006-09-29 23:58 [patch 00/23] Thomas Gleixner
                   ` (5 preceding siblings ...)
  2006-09-29 23:58 ` [patch 06/23] time: fix timeout overflow Thomas Gleixner
@ 2006-09-29 23:58 ` Thomas Gleixner
  2006-09-30  8:36   ` Andrew Morton
  2006-09-29 23:58 ` [patch 08/23] dynticks: prepare the RCU code Thomas Gleixner
                   ` (17 subsequent siblings)
  24 siblings, 1 reply; 55+ messages in thread
From: Thomas Gleixner @ 2006-09-29 23:58 UTC (permalink / raw)
  To: LKML
  Cc: Andrew Morton, Ingo Molnar, Jim Gettys, John Stultz,
	David Woodhouse, Arjan van de Ven, Dave Jones

[-- Attachment #1: unmacro-irq-enter.patch --]
[-- Type: text/plain, Size: 1592 bytes --]

From: Ingo Molnar <mingo@elte.hu>

uninline irq_enter(). [dynticks adds more stuff to it]

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
--
 include/linux/hardirq.h |    7 +------
 kernel/softirq.c        |    7 +++++++
 2 files changed, 8 insertions(+), 6 deletions(-)

Index: linux-2.6.18-mm2/include/linux/hardirq.h
===================================================================
--- linux-2.6.18-mm2.orig/include/linux/hardirq.h	2006-09-30 01:41:13.000000000 +0200
+++ linux-2.6.18-mm2/include/linux/hardirq.h	2006-09-30 01:41:16.000000000 +0200
@@ -106,12 +106,7 @@ static inline void account_system_vtime(
  * always balanced, so the interrupted value of ->hardirq_context
  * will always be restored.
  */
-#define irq_enter()					\
-	do {						\
-		account_system_vtime(current);		\
-		add_preempt_count(HARDIRQ_OFFSET);	\
-		trace_hardirq_enter();			\
-	} while (0)
+extern void irq_enter(void);
 
 /*
  * Exit irq context without processing softirqs:
Index: linux-2.6.18-mm2/kernel/softirq.c
===================================================================
--- linux-2.6.18-mm2.orig/kernel/softirq.c	2006-09-30 01:41:13.000000000 +0200
+++ linux-2.6.18-mm2/kernel/softirq.c	2006-09-30 01:41:16.000000000 +0200
@@ -279,6 +279,13 @@ EXPORT_SYMBOL(do_softirq);
 # define invoke_softirq()	do_softirq()
 #endif
 
+extern void irq_enter(void)
+{
+	account_system_vtime(current);
+	add_preempt_count(HARDIRQ_OFFSET);
+	trace_hardirq_enter();
+}
+
 /*
  * Exit an interrupt context. Process softirqs if needed and possible:
  */

--


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [patch 08/23] dynticks: prepare the RCU code
  2006-09-29 23:58 [patch 00/23] Thomas Gleixner
                   ` (6 preceding siblings ...)
  2006-09-29 23:58 ` [patch 07/23] cleanup: uninline irq_enter() and move it into a function Thomas Gleixner
@ 2006-09-29 23:58 ` Thomas Gleixner
  2006-09-30  8:36   ` Andrew Morton
  2006-09-29 23:58 ` [patch 09/23] dynticks: extend next_timer_interrupt() to use a reference jiffie Thomas Gleixner
                   ` (16 subsequent siblings)
  24 siblings, 1 reply; 55+ messages in thread
From: Thomas Gleixner @ 2006-09-29 23:58 UTC (permalink / raw)
  To: LKML
  Cc: Andrew Morton, Ingo Molnar, Jim Gettys, John Stultz,
	David Woodhouse, Arjan van de Ven, Dave Jones

[-- Attachment #1: rcu-prepare-for-nohz.patch --]
[-- Type: text/plain, Size: 2590 bytes --]

From: Ingo Molnar <mingo@elte.hu>

prepare the RCU code for dynticks/nohz. Since on nohz kernels there
is no guaranteed timer IRQ that processes RCU callbacks, the idle
code has to make sure that all RCU callbacks that can be finished
off are indeed finished off. This patch adds the necessary APIs:
rcu_advance_callbacks() [to register quiescent state] and
rcu_process_callbacks() [to finish finishable RCU callbacks].

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
--
 include/linux/rcupdate.h |    2 ++
 kernel/rcupdate.c        |   13 ++++++++++++-
 2 files changed, 14 insertions(+), 1 deletion(-)

Index: linux-2.6.18-mm2/include/linux/rcupdate.h
===================================================================
--- linux-2.6.18-mm2.orig/include/linux/rcupdate.h	2006-09-30 01:41:13.000000000 +0200
+++ linux-2.6.18-mm2/include/linux/rcupdate.h	2006-09-30 01:41:16.000000000 +0200
@@ -271,6 +271,7 @@ extern int rcu_needs_cpu(int cpu);
 
 extern void rcu_init(void);
 extern void rcu_check_callbacks(int cpu, int user);
+extern void rcu_advance_callbacks(int cpu, int user);
 extern void rcu_restart_cpu(int cpu);
 extern long rcu_batches_completed(void);
 extern long rcu_batches_completed_bh(void);
@@ -283,6 +284,7 @@ extern void FASTCALL(call_rcu_bh(struct 
 extern void synchronize_rcu(void);
 void synchronize_idle(void);
 extern void rcu_barrier(void);
+extern void rcu_process_callbacks(unsigned long unused);
 
 #endif /* __KERNEL__ */
 #endif /* __LINUX_RCUPDATE_H */
Index: linux-2.6.18-mm2/kernel/rcupdate.c
===================================================================
--- linux-2.6.18-mm2.orig/kernel/rcupdate.c	2006-09-30 01:41:13.000000000 +0200
+++ linux-2.6.18-mm2/kernel/rcupdate.c	2006-09-30 01:41:16.000000000 +0200
@@ -460,7 +460,7 @@ static void __rcu_process_callbacks(stru
 		rcu_do_batch(rdp);
 }
 
-static void rcu_process_callbacks(unsigned long unused)
+void rcu_process_callbacks(unsigned long unused)
 {
 	__rcu_process_callbacks(&rcu_ctrlblk, &__get_cpu_var(rcu_data));
 	__rcu_process_callbacks(&rcu_bh_ctrlblk, &__get_cpu_var(rcu_bh_data));
@@ -515,6 +515,17 @@ int rcu_needs_cpu(int cpu)
 	return (!!rdp->curlist || !!rdp_bh->curlist || rcu_pending(cpu));
 }
 
+void rcu_advance_callbacks(int cpu, int user)
+{
+	if (user ||
+	    (idle_cpu(cpu) && !in_softirq() &&
+				hardirq_count() <= (1 << HARDIRQ_SHIFT))) {
+		rcu_qsctr_inc(cpu);
+		rcu_bh_qsctr_inc(cpu);
+	} else if (!in_softirq())
+		rcu_bh_qsctr_inc(cpu);
+}
+
 void rcu_check_callbacks(int cpu, int user)
 {
 	if (user || 

--


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [patch 09/23] dynticks: extend next_timer_interrupt() to use a reference jiffie
  2006-09-29 23:58 [patch 00/23] Thomas Gleixner
                   ` (7 preceding siblings ...)
  2006-09-29 23:58 ` [patch 08/23] dynticks: prepare the RCU code Thomas Gleixner
@ 2006-09-29 23:58 ` Thomas Gleixner
  2006-09-30  8:37   ` Andrew Morton
  2006-09-29 23:58 ` [patch 10/23] hrtimers: clean up locking Thomas Gleixner
                   ` (15 subsequent siblings)
  24 siblings, 1 reply; 55+ messages in thread
From: Thomas Gleixner @ 2006-09-29 23:58 UTC (permalink / raw)
  To: LKML
  Cc: Andrew Morton, Ingo Molnar, Jim Gettys, John Stultz,
	David Woodhouse, Arjan van de Ven, Dave Jones

[-- Attachment #1: extend-timer-next-interrupt.patch --]
[-- Type: text/plain, Size: 5269 bytes --]

From: Thomas Gleixner <tglx@linutronix.de>

For CONFIG_NO_HZ we need to calculate the next timer wheel event based
to a given jiffie value. Extend the existing code to allow the extra now
argument. Provide a compability function for the existing implementations
to call the function with now = jiffies.
This also solves the racyness of the original code vs. jiffies changing
during the iteration.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
----
 include/linux/timer.h |   10 +++++++
 kernel/timer.c        |   64 +++++++++++++++++++++++++++++++++++---------------
 2 files changed, 56 insertions(+), 18 deletions(-)

Index: linux-2.6.18-mm2/include/linux/timer.h
===================================================================
--- linux-2.6.18-mm2.orig/include/linux/timer.h	2006-09-30 01:41:12.000000000 +0200
+++ linux-2.6.18-mm2/include/linux/timer.h	2006-09-30 01:41:16.000000000 +0200
@@ -61,7 +61,17 @@ extern int del_timer(struct timer_list *
 extern int __mod_timer(struct timer_list *timer, unsigned long expires);
 extern int mod_timer(struct timer_list *timer, unsigned long expires);
 
+/*
+ * Return when the next timer-wheel timeout occurs (in absolute jiffies),
+ * locks the timer base:
+ */
 extern unsigned long next_timer_interrupt(void);
+/*
+ * Return when the next timer-wheel timeout occurs (in absolute jiffies),
+ * locks the timer base and does the comparison against the given
+ * jiffie.
+ */
+extern unsigned long get_next_timer_interrupt(unsigned long now);
 
 /***
  * add_timer - start a timer
Index: linux-2.6.18-mm2/kernel/timer.c
===================================================================
--- linux-2.6.18-mm2.orig/kernel/timer.c	2006-09-30 01:41:15.000000000 +0200
+++ linux-2.6.18-mm2/kernel/timer.c	2006-09-30 01:41:16.000000000 +0200
@@ -468,29 +468,28 @@ static inline void __run_timers(tvec_bas
  * is used on S/390 to stop all activity when a cpus is idle.
  * This functions needs to be called disabled.
  */
-unsigned long next_timer_interrupt(void)
+unsigned long __next_timer_interrupt(tvec_base_t *base, unsigned long now)
 {
-	tvec_base_t *base;
 	struct list_head *list;
-	struct timer_list *nte;
+	struct timer_list *nte, *found = NULL;
 	unsigned long expires;
-	unsigned long hr_expires = MAX_JIFFY_OFFSET;
-	ktime_t hr_delta;
 	tvec_t *varray[4];
 	int i, j;
 
-	hr_delta = hrtimer_get_next_event();
+#ifndef CONFIG_NO_HZ
+	unsigned long hr_expires = MAX_JIFFY_OFFSET;
+	ktime_t hr_delta = hrtimer_get_next_event();
+
 	if (hr_delta.tv64 != KTIME_MAX) {
 		struct timespec tsdelta;
 		tsdelta = ktime_to_timespec(hr_delta);
 		hr_expires = timespec_to_jiffies(&tsdelta);
 		if (hr_expires < 3)
-			return hr_expires + jiffies;
+			return hr_expires + now;
 	}
-	hr_expires += jiffies;
+	hr_expires += now;
+#endif
 
-	base = __get_cpu_var(tvec_bases);
-	spin_lock(&base->lock);
 	expires = base->timer_jiffies + (LONG_MAX >> 1);
 	list = NULL;
 
@@ -499,6 +498,7 @@ unsigned long next_timer_interrupt(void)
 	do {
 		list_for_each_entry(nte, base->tv1.vec + j, entry) {
 			expires = nte->expires;
+			found = nte;
 			if (j < (base->timer_jiffies & TVR_MASK))
 				list = base->tv2.vec + (INDEX(0));
 			goto found;
@@ -518,9 +518,12 @@ unsigned long next_timer_interrupt(void)
 				j = (j + 1) & TVN_MASK;
 				continue;
 			}
-			list_for_each_entry(nte, varray[i]->vec + j, entry)
-				if (time_before(nte->expires, expires))
+			list_for_each_entry(nte, varray[i]->vec + j, entry) {
+				if (time_before(nte->expires, expires)) {
 					expires = nte->expires;
+					found = nte;
+				}
+			}
 			if (j < (INDEX(i)) && i < 3)
 				list = varray[i + 1]->vec + (INDEX(i + 1));
 			goto found;
@@ -534,12 +537,15 @@ found:
 		 * where we found the timer element.
 		 */
 		list_for_each_entry(nte, list, entry) {
-			if (time_before(nte->expires, expires))
+			if (time_before(nte->expires, expires)) {
 				expires = nte->expires;
+				found = nte;
+			}
 		}
 	}
-	spin_unlock(&base->lock);
+	WARN_ON(!found);
 
+#ifndef CONFIG_NO_HZ
 	/*
 	 * It can happen that other CPUs service timer IRQs and increment
 	 * jiffies, but we have not yet got a local timer tick to process
@@ -553,14 +559,36 @@ found:
 	 * would falsely evaluate to true.  If that is the case, just
 	 * return jiffies so that we can immediately fire the local timer
 	 */
-	if (time_before(expires, jiffies))
-		return jiffies;
+	if (time_before(expires, now))
+		expires = now;
+	else if (time_before(hr_expires, expires))
+		expires = hr_expires;
+#endif
+	/*
+	 * 'Timer wheel time' can lag behind 'jiffies time' due to
+	 * delayed processing, so make sure we return a value that
+	 * makes sense externally:
+	 */
+	return expires - (now - base->timer_jiffies);
+}
+
+unsigned long get_next_timer_interrupt(unsigned long now)
+{
+	tvec_base_t *base = __get_cpu_var(tvec_bases);
+	unsigned long expires;
 
-	if (time_before(hr_expires, expires))
-		return hr_expires;
+	spin_lock(&base->lock);
+	expires = __next_timer_interrupt(base, now);
+	spin_unlock(&base->lock);
 
 	return expires;
 }
+
+unsigned long next_timer_interrupt(void)
+{
+	return get_next_timer_interrupt(jiffies);
+}
+
 #endif
 
 /******************************************************************/

--


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [patch 10/23] hrtimers: clean up locking
  2006-09-29 23:58 [patch 00/23] Thomas Gleixner
                   ` (8 preceding siblings ...)
  2006-09-29 23:58 ` [patch 09/23] dynticks: extend next_timer_interrupt() to use a reference jiffie Thomas Gleixner
@ 2006-09-29 23:58 ` Thomas Gleixner
  2006-09-30  8:37   ` Andrew Morton
  2006-09-29 23:58 ` [patch 11/23] hrtimers: state tracking Thomas Gleixner
                   ` (14 subsequent siblings)
  24 siblings, 1 reply; 55+ messages in thread
From: Thomas Gleixner @ 2006-09-29 23:58 UTC (permalink / raw)
  To: LKML
  Cc: Andrew Morton, Ingo Molnar, Jim Gettys, John Stultz,
	David Woodhouse, Arjan van de Ven, Dave Jones

[-- Attachment #1: hrtimer-single-lock.patch --]
[-- Type: text/plain, Size: 16077 bytes --]

From: Thomas Gleixner <tglx@linutronix.de>

improve kernel/hrtimers.c locking: use a per-CPU base with a lock to
control locking of all clocks belonging to a CPU. This simplifies
code that needs to lock all clocks at once. This makes life easier
for high-res timers and dyntick. No functional change should happen
due to this.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
--
 include/linux/hrtimer.h |   37 ++++++---
 kernel/hrtimer.c        |  181 +++++++++++++++++++++++++-----------------------
 2 files changed, 121 insertions(+), 97 deletions(-)

Index: linux-2.6.18-mm2/include/linux/hrtimer.h
===================================================================
--- linux-2.6.18-mm2.orig/include/linux/hrtimer.h	2006-09-30 01:41:15.000000000 +0200
+++ linux-2.6.18-mm2/include/linux/hrtimer.h	2006-09-30 01:41:17.000000000 +0200
@@ -36,7 +36,7 @@ enum hrtimer_restart {
 
 #define HRTIMER_INACTIVE	((void *)1UL)
 
-struct hrtimer_base;
+struct hrtimer_clock_base;
 
 /**
  * struct hrtimer - the basic hrtimer structure
@@ -50,10 +50,10 @@ struct hrtimer_base;
  * The hrtimer structure must be initialized by init_hrtimer_#CLOCKTYPE()
  */
 struct hrtimer {
-	struct rb_node		node;
-	ktime_t			expires;
-	int			(*function)(struct hrtimer *);
-	struct hrtimer_base	*base;
+	struct rb_node			node;
+	ktime_t				expires;
+	int				(*function)(struct hrtimer *);
+	struct hrtimer_clock_base	*base;
 };
 
 /**
@@ -68,31 +68,44 @@ struct hrtimer_sleeper {
 	struct task_struct *task;
 };
 
+struct hrtimer_cpu_base;
+
 /**
  * struct hrtimer_base - the timer base for a specific clock
  * @index:		clock type index for per_cpu support when moving a timer
  *			to a base on another cpu.
- * @lock:		lock protecting the base and associated timers
  * @active:		red black tree root node for the active timers
  * @first:		pointer to the timer node which expires first
  * @resolution:		the resolution of the clock, in nanoseconds
  * @get_time:		function to retrieve the current time of the clock
  * @get_softirq_time:	function to retrieve the current time from the softirq
- * @curr_timer:		the timer which is executing a callback right now
  * @softirq_time:	the time when running the hrtimer queue in the softirq
- * @lock_key:		the lock_class_key for use with lockdep
  */
-struct hrtimer_base {
+struct hrtimer_clock_base {
+	struct hrtimer_cpu_base	*cpu_base;
 	clockid_t		index;
-	spinlock_t		lock;
 	struct rb_root		active;
 	struct rb_node		*first;
 	ktime_t			resolution;
 	ktime_t			(*get_time)(void);
 	ktime_t			(*get_softirq_time)(void);
-	struct hrtimer		*curr_timer;
 	ktime_t			softirq_time;
-	struct lock_class_key lock_key;
+};
+
+#define HRTIMER_MAX_CLOCK_BASES 2
+
+/*
+ * struct hrtimer_cpu_base - the per cpu clock bases
+ * @lock:		lock protecting the base and associated clock bases and timers
+ * @lock_key:		the lock_class_key for use with lockdep
+ * @clock_base:		array of clock bases for this cpu
+ * @curr_timer:		the timer which is executing a callback right now
+ */
+struct hrtimer_cpu_base {
+	spinlock_t			lock;
+	struct lock_class_key		lock_key;
+	struct hrtimer_clock_base	clock_base[HRTIMER_MAX_CLOCK_BASES];
+	struct hrtimer			*curr_timer;
 };
 
 /*
Index: linux-2.6.18-mm2/kernel/hrtimer.c
===================================================================
--- linux-2.6.18-mm2.orig/kernel/hrtimer.c	2006-09-30 01:41:15.000000000 +0200
+++ linux-2.6.18-mm2/kernel/hrtimer.c	2006-09-30 01:41:17.000000000 +0200
@@ -1,8 +1,8 @@
 /*
  *  linux/kernel/hrtimer.c
  *
- *  Copyright(C) 2005, Thomas Gleixner <tglx@linutronix.de>
- *  Copyright(C) 2005, Red Hat, Inc., Ingo Molnar
+ *  Copyright(C) 2005-2006, Thomas Gleixner <tglx@linutronix.de>
+ *  Copyright(C) 2005-2006, Red Hat, Inc., Ingo Molnar
  *
  *  High-resolution kernel timers
  *
@@ -79,21 +79,22 @@ EXPORT_SYMBOL_GPL(ktime_get_real);
  * This ensures that we capture erroneous accesses to these clock ids
  * rather than moving them into the range of valid clock id's.
  */
-
-#define MAX_HRTIMER_BASES 2
-
-static DEFINE_PER_CPU(struct hrtimer_base, hrtimer_bases[MAX_HRTIMER_BASES]) =
+static DEFINE_PER_CPU(struct hrtimer_cpu_base, hrtimer_bases) =
 {
+
+	.clock_base =
 	{
-		.index = CLOCK_REALTIME,
-		.get_time = &ktime_get_real,
-		.resolution = KTIME_REALTIME_RES,
-	},
-	{
-		.index = CLOCK_MONOTONIC,
-		.get_time = &ktime_get,
-		.resolution = KTIME_MONOTONIC_RES,
-	},
+		{
+			.index = CLOCK_REALTIME,
+			.get_time = &ktime_get_real,
+			.resolution = KTIME_REALTIME_RES,
+		},
+		{
+			.index = CLOCK_MONOTONIC,
+			.get_time = &ktime_get,
+			.resolution = KTIME_MONOTONIC_RES,
+		},
+	}
 };
 
 /**
@@ -125,7 +126,7 @@ EXPORT_SYMBOL_GPL(ktime_get_ts);
  * Get the coarse grained time at the softirq based on xtime and
  * wall_to_monotonic.
  */
-static void hrtimer_get_softirq_time(struct hrtimer_base *base)
+static void hrtimer_get_softirq_time(struct hrtimer_cpu_base *base)
 {
 	ktime_t xtim, tomono;
 	unsigned long seq;
@@ -137,8 +138,9 @@ static void hrtimer_get_softirq_time(str
 
 	} while (read_seqretry(&xtime_lock, seq));
 
-	base[CLOCK_REALTIME].softirq_time = xtim;
-	base[CLOCK_MONOTONIC].softirq_time = ktime_add(xtim, tomono);
+	base->clock_base[CLOCK_REALTIME].softirq_time = xtim;
+	base->clock_base[CLOCK_MONOTONIC].softirq_time =
+		ktime_add(xtim, tomono);
 }
 
 /*
@@ -161,19 +163,20 @@ static void hrtimer_get_softirq_time(str
  * possible to set timer->base = NULL and drop the lock: the timer remains
  * locked.
  */
-static struct hrtimer_base *lock_hrtimer_base(const struct hrtimer *timer,
-					      unsigned long *flags)
+static
+struct hrtimer_clock_base *lock_hrtimer_base(const struct hrtimer *timer,
+					     unsigned long *flags)
 {
-	struct hrtimer_base *base;
+	struct hrtimer_clock_base *base;
 
 	for (;;) {
 		base = timer->base;
 		if (likely(base != NULL)) {
-			spin_lock_irqsave(&base->lock, *flags);
+			spin_lock_irqsave(&base->cpu_base->lock, *flags);
 			if (likely(base == timer->base))
 				return base;
 			/* The timer has migrated to another CPU: */
-			spin_unlock_irqrestore(&base->lock, *flags);
+			spin_unlock_irqrestore(&base->cpu_base->lock, *flags);
 		}
 		cpu_relax();
 	}
@@ -182,12 +185,14 @@ static struct hrtimer_base *lock_hrtimer
 /*
  * Switch the timer base to the current CPU when possible.
  */
-static inline struct hrtimer_base *
-switch_hrtimer_base(struct hrtimer *timer, struct hrtimer_base *base)
+static inline struct hrtimer_clock_base *
+switch_hrtimer_base(struct hrtimer *timer, struct hrtimer_clock_base *base)
 {
-	struct hrtimer_base *new_base;
+	struct hrtimer_clock_base *new_base;
+	struct hrtimer_cpu_base *new_cpu_base;
 
-	new_base = &__get_cpu_var(hrtimer_bases)[base->index];
+	new_cpu_base = &__get_cpu_var(hrtimer_bases);
+	new_base = &new_cpu_base->clock_base[base->index];
 
 	if (base != new_base) {
 		/*
@@ -199,13 +204,13 @@ switch_hrtimer_base(struct hrtimer *time
 		 * completed. There is no conflict as we hold the lock until
 		 * the timer is enqueued.
 		 */
-		if (unlikely(base->curr_timer == timer))
+		if (unlikely(base->cpu_base->curr_timer == timer))
 			return base;
 
 		/* See the comment in lock_timer_base() */
 		timer->base = NULL;
-		spin_unlock(&base->lock);
-		spin_lock(&new_base->lock);
+		spin_unlock(&base->cpu_base->lock);
+		spin_lock(&new_base->cpu_base->lock);
 		timer->base = new_base;
 	}
 	return new_base;
@@ -215,12 +220,12 @@ switch_hrtimer_base(struct hrtimer *time
 
 #define set_curr_timer(b, t)		do { } while (0)
 
-static inline struct hrtimer_base *
+static inline struct hrtimer_clock_base *
 lock_hrtimer_base(const struct hrtimer *timer, unsigned long *flags)
 {
-	struct hrtimer_base *base = timer->base;
+	struct hrtimer_clock_base *base = timer->base;
 
-	spin_lock_irqsave(&base->lock, *flags);
+	spin_lock_irqsave(&base->cpu_base->lock, *flags);
 
 	return base;
 }
@@ -300,7 +305,7 @@ void hrtimer_notify_resume(void)
 static inline
 void unlock_hrtimer_base(const struct hrtimer *timer, unsigned long *flags)
 {
-	spin_unlock_irqrestore(&timer->base->lock, *flags);
+	spin_unlock_irqrestore(&timer->base->cpu_base->lock, *flags);
 }
 
 /**
@@ -350,7 +355,8 @@ hrtimer_forward(struct hrtimer *timer, k
  * The timer is inserted in expiry order. Insertion into the
  * red black tree is O(log(n)). Must hold the base lock.
  */
-static void enqueue_hrtimer(struct hrtimer *timer, struct hrtimer_base *base)
+static void enqueue_hrtimer(struct hrtimer *timer,
+			    struct hrtimer_clock_base *base)
 {
 	struct rb_node **link = &base->active.rb_node;
 	struct rb_node *parent = NULL;
@@ -389,7 +395,8 @@ static void enqueue_hrtimer(struct hrtim
  *
  * Caller must hold the base lock.
  */
-static void __remove_hrtimer(struct hrtimer *timer, struct hrtimer_base *base)
+static void __remove_hrtimer(struct hrtimer *timer,
+			     struct hrtimer_clock_base *base)
 {
 	/*
 	 * Remove the timer from the rbtree and replace the
@@ -405,7 +412,7 @@ static void __remove_hrtimer(struct hrti
  * remove hrtimer, called with base lock held
  */
 static inline int
-remove_hrtimer(struct hrtimer *timer, struct hrtimer_base *base)
+remove_hrtimer(struct hrtimer *timer, struct hrtimer_clock_base *base)
 {
 	if (hrtimer_active(timer)) {
 		__remove_hrtimer(timer, base);
@@ -427,7 +434,7 @@ remove_hrtimer(struct hrtimer *timer, st
 int
 hrtimer_start(struct hrtimer *timer, ktime_t tim, const enum hrtimer_mode mode)
 {
-	struct hrtimer_base *base, *new_base;
+	struct hrtimer_clock_base *base, *new_base;
 	unsigned long flags;
 	int ret;
 
@@ -474,13 +481,13 @@ EXPORT_SYMBOL_GPL(hrtimer_start);
  */
 int hrtimer_try_to_cancel(struct hrtimer *timer)
 {
-	struct hrtimer_base *base;
+	struct hrtimer_clock_base *base;
 	unsigned long flags;
 	int ret = -1;
 
 	base = lock_hrtimer_base(timer, &flags);
 
-	if (base->curr_timer != timer)
+	if (base->cpu_base->curr_timer != timer)
 		ret = remove_hrtimer(timer, base);
 
 	unlock_hrtimer_base(timer, &flags);
@@ -516,7 +523,7 @@ EXPORT_SYMBOL_GPL(hrtimer_cancel);
  */
 ktime_t hrtimer_get_remaining(const struct hrtimer *timer)
 {
-	struct hrtimer_base *base;
+	struct hrtimer_clock_base *base;
 	unsigned long flags;
 	ktime_t rem;
 
@@ -537,26 +544,29 @@ EXPORT_SYMBOL_GPL(hrtimer_get_remaining)
  */
 ktime_t hrtimer_get_next_event(void)
 {
-	struct hrtimer_base *base = __get_cpu_var(hrtimer_bases);
+	struct hrtimer_cpu_base *cpu_base = &__get_cpu_var(hrtimer_bases);
+	struct hrtimer_clock_base *base = cpu_base->clock_base;
 	ktime_t delta, mindelta = { .tv64 = KTIME_MAX };
 	unsigned long flags;
 	int i;
 
-	for (i = 0; i < MAX_HRTIMER_BASES; i++, base++) {
+	spin_lock_irqsave(&cpu_base->lock, flags);
+
+	for (i = 0; i < HRTIMER_MAX_CLOCK_BASES; i++, base++) {
 		struct hrtimer *timer;
 
-		spin_lock_irqsave(&base->lock, flags);
-		if (!base->first) {
-			spin_unlock_irqrestore(&base->lock, flags);
+		if (!base->first)
 			continue;
-		}
+
 		timer = rb_entry(base->first, struct hrtimer, node);
 		delta.tv64 = timer->expires.tv64;
-		spin_unlock_irqrestore(&base->lock, flags);
 		delta = ktime_sub(delta, base->get_time());
 		if (delta.tv64 < mindelta.tv64)
 			mindelta.tv64 = delta.tv64;
 	}
+
+	spin_unlock_irqrestore(&cpu_base->lock, flags);
+
 	if (mindelta.tv64 < 0)
 		mindelta.tv64 = 0;
 	return mindelta;
@@ -572,16 +582,16 @@ ktime_t hrtimer_get_next_event(void)
 void hrtimer_init(struct hrtimer *timer, clockid_t clock_id,
 		  enum hrtimer_mode mode)
 {
-	struct hrtimer_base *bases;
+	struct hrtimer_cpu_base *cpu_base;
 
 	memset(timer, 0, sizeof(struct hrtimer));
 
-	bases = __raw_get_cpu_var(hrtimer_bases);
+	cpu_base = &__raw_get_cpu_var(hrtimer_bases);
 
 	if (clock_id == CLOCK_REALTIME && mode != HRTIMER_ABS)
 		clock_id = CLOCK_MONOTONIC;
 
-	timer->base = &bases[clock_id];
+	timer->base = &cpu_base->clock_base[clock_id];
 	rb_set_parent(&timer->node, &timer->node);
 }
 EXPORT_SYMBOL_GPL(hrtimer_init);
@@ -596,10 +606,10 @@ EXPORT_SYMBOL_GPL(hrtimer_init);
  */
 int hrtimer_get_res(const clockid_t which_clock, struct timespec *tp)
 {
-	struct hrtimer_base *bases;
+	struct hrtimer_cpu_base *cpu_base;
 
-	bases = __raw_get_cpu_var(hrtimer_bases);
-	*tp = ktime_to_timespec(bases[which_clock].resolution);
+	cpu_base = &__raw_get_cpu_var(hrtimer_bases);
+	*tp = ktime_to_timespec(cpu_base->clock_base[which_clock].resolution);
 
 	return 0;
 }
@@ -608,9 +618,11 @@ EXPORT_SYMBOL_GPL(hrtimer_get_res);
 /*
  * Expire the per base hrtimer-queue:
  */
-static inline void run_hrtimer_queue(struct hrtimer_base *base)
+static inline void run_hrtimer_queue(struct hrtimer_cpu_base *cpu_base,
+				     int index)
 {
 	struct rb_node *node;
+	struct hrtimer_clock_base *base = &cpu_base->clock_base[index];
 
 	if (!base->first)
 		return;
@@ -618,7 +630,7 @@ static inline void run_hrtimer_queue(str
 	if (base->get_softirq_time)
 		base->softirq_time = base->get_softirq_time();
 
-	spin_lock_irq(&base->lock);
+	spin_lock_irq(&cpu_base->lock);
 
 	while ((node = base->first)) {
 		struct hrtimer *timer;
@@ -630,21 +642,21 @@ static inline void run_hrtimer_queue(str
 			break;
 
 		fn = timer->function;
-		set_curr_timer(base, timer);
+		set_curr_timer(cpu_base, timer);
 		__remove_hrtimer(timer, base);
-		spin_unlock_irq(&base->lock);
+		spin_unlock_irq(&cpu_base->lock);
 
 		restart = fn(timer);
 
-		spin_lock_irq(&base->lock);
+		spin_lock_irq(&cpu_base->lock);
 
 		if (restart != HRTIMER_NORESTART) {
 			BUG_ON(hrtimer_active(timer));
 			enqueue_hrtimer(timer, base);
 		}
 	}
-	set_curr_timer(base, NULL);
-	spin_unlock_irq(&base->lock);
+	set_curr_timer(cpu_base, NULL);
+	spin_unlock_irq(&cpu_base->lock);
 }
 
 /*
@@ -652,13 +664,13 @@ static inline void run_hrtimer_queue(str
  */
 void hrtimer_run_queues(void)
 {
-	struct hrtimer_base *base = __get_cpu_var(hrtimer_bases);
+	struct hrtimer_cpu_base *cpu_base = &__get_cpu_var(hrtimer_bases);
 	int i;
 
-	hrtimer_get_softirq_time(base);
+	hrtimer_get_softirq_time(cpu_base);
 
-	for (i = 0; i < MAX_HRTIMER_BASES; i++)
-		run_hrtimer_queue(&base[i]);
+	for (i = 0; i < HRTIMER_MAX_CLOCK_BASES; i++)
+		run_hrtimer_queue(cpu_base, i);
 }
 
 /*
@@ -787,19 +799,21 @@ sys_nanosleep(struct timespec __user *rq
  */
 static void __devinit init_hrtimers_cpu(int cpu)
 {
-	struct hrtimer_base *base = per_cpu(hrtimer_bases, cpu);
+	struct hrtimer_cpu_base *cpu_base = &per_cpu(hrtimer_bases, cpu);
 	int i;
 
-	for (i = 0; i < MAX_HRTIMER_BASES; i++, base++) {
-		spin_lock_init(&base->lock);
-		lockdep_set_class(&base->lock, &base->lock_key);
-	}
+	spin_lock_init(&cpu_base->lock);
+	lockdep_set_class(&cpu_base->lock, &cpu_base->lock_key);
+
+	for (i = 0; i < HRTIMER_MAX_CLOCK_BASES; i++)
+		cpu_base->clock_base[i].cpu_base = cpu_base;
+
 }
 
 #ifdef CONFIG_HOTPLUG_CPU
 
-static void migrate_hrtimer_list(struct hrtimer_base *old_base,
-				struct hrtimer_base *new_base)
+static void migrate_hrtimer_list(struct hrtimer_clock_base *old_base,
+				struct hrtimer_clock_base *new_base)
 {
 	struct hrtimer *timer;
 	struct rb_node *node;
@@ -814,29 +828,26 @@ static void migrate_hrtimer_list(struct 
 
 static void migrate_hrtimers(int cpu)
 {
-	struct hrtimer_base *old_base, *new_base;
+	struct hrtimer_cpu_base *old_base, *new_base;
 	int i;
 
 	BUG_ON(cpu_online(cpu));
-	old_base = per_cpu(hrtimer_bases, cpu);
-	new_base = get_cpu_var(hrtimer_bases);
+	old_base = &per_cpu(hrtimer_bases, cpu);
+	new_base = &get_cpu_var(hrtimer_bases);
 
 	local_irq_disable();
 
-	for (i = 0; i < MAX_HRTIMER_BASES; i++) {
-
-		spin_lock(&new_base->lock);
-		spin_lock(&old_base->lock);
+	spin_lock(&new_base->lock);
+	spin_lock(&old_base->lock);
 
+	for (i = 0; i < HRTIMER_MAX_CLOCK_BASES; i++) {
 		BUG_ON(old_base->curr_timer);
 
-		migrate_hrtimer_list(old_base, new_base);
-
-		spin_unlock(&old_base->lock);
-		spin_unlock(&new_base->lock);
-		old_base++;
-		new_base++;
+		migrate_hrtimer_list(&old_base->clock_base[i],
+				     &new_base->clock_base[i]);
 	}
+	spin_unlock(&old_base->lock);
+	spin_unlock(&new_base->lock);
 
 	local_irq_enable();
 	put_cpu_var(hrtimer_bases);

--


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [patch 11/23] hrtimers: state tracking
  2006-09-29 23:58 [patch 00/23] Thomas Gleixner
                   ` (9 preceding siblings ...)
  2006-09-29 23:58 ` [patch 10/23] hrtimers: clean up locking Thomas Gleixner
@ 2006-09-29 23:58 ` Thomas Gleixner
  2006-09-30  8:37   ` Andrew Morton
  2006-09-29 23:58 ` [patch 12/23] hrtimers: clean up callback tracking Thomas Gleixner
                   ` (13 subsequent siblings)
  24 siblings, 1 reply; 55+ messages in thread
From: Thomas Gleixner @ 2006-09-29 23:58 UTC (permalink / raw)
  To: LKML
  Cc: Andrew Morton, Ingo Molnar, Jim Gettys, John Stultz,
	David Woodhouse, Arjan van de Ven, Dave Jones

[-- Attachment #1: hrtimer-change-state-tracking.patch --]
[-- Type: text/plain, Size: 4248 bytes --]

From: Thomas Gleixner <tglx@linutronix.de>

reintroduce ktimers feature "optimized away" by the ktimers
review process: multiple hrtimer states to enable the running
of hrtimers without holding the cpu-base-lock.

(the "optimized" rbtree hack carried only 2 states worth of
information and we need 3.)

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
--
 include/linux/hrtimer.h |    7 +++++--
 kernel/hrtimer.c        |   17 ++++++++++-------
 2 files changed, 15 insertions(+), 9 deletions(-)

Index: linux-2.6.18-mm2/include/linux/hrtimer.h
===================================================================
--- linux-2.6.18-mm2.orig/include/linux/hrtimer.h	2006-09-30 01:41:17.000000000 +0200
+++ linux-2.6.18-mm2/include/linux/hrtimer.h	2006-09-30 01:41:17.000000000 +0200
@@ -34,7 +34,9 @@ enum hrtimer_restart {
 	HRTIMER_RESTART,
 };
 
-#define HRTIMER_INACTIVE	((void *)1UL)
+#define HRTIMER_INACTIVE	0x00
+#define HRTIMER_ACTIVE		0x01
+#define HRTIMER_CALLBACK	0x02
 
 struct hrtimer_clock_base;
 
@@ -54,6 +56,7 @@ struct hrtimer {
 	ktime_t				expires;
 	int				(*function)(struct hrtimer *);
 	struct hrtimer_clock_base	*base;
+	unsigned long			state;
 };
 
 /**
@@ -139,7 +142,7 @@ extern ktime_t hrtimer_get_next_event(vo
 
 static inline int hrtimer_active(const struct hrtimer *timer)
 {
-	return rb_parent(&timer->node) != &timer->node;
+	return timer->state != HRTIMER_INACTIVE;
 }
 
 /* Forward a hrtimer so it expires after now: */
Index: linux-2.6.18-mm2/kernel/hrtimer.c
===================================================================
--- linux-2.6.18-mm2.orig/kernel/hrtimer.c	2006-09-30 01:41:17.000000000 +0200
+++ linux-2.6.18-mm2/kernel/hrtimer.c	2006-09-30 01:41:17.000000000 +0200
@@ -384,6 +384,7 @@ static void enqueue_hrtimer(struct hrtim
 	 */
 	rb_link_node(&timer->node, parent, link);
 	rb_insert_color(&timer->node, &base->active);
+	timer->state |= HRTIMER_ACTIVE;
 
 	if (!base->first || timer->expires.tv64 <
 	    rb_entry(base->first, struct hrtimer, node)->expires.tv64)
@@ -396,7 +397,8 @@ static void enqueue_hrtimer(struct hrtim
  * Caller must hold the base lock.
  */
 static void __remove_hrtimer(struct hrtimer *timer,
-			     struct hrtimer_clock_base *base)
+			     struct hrtimer_clock_base *base,
+			     unsigned long newstate)
 {
 	/*
 	 * Remove the timer from the rbtree and replace the
@@ -405,7 +407,7 @@ static void __remove_hrtimer(struct hrti
 	if (base->first == &timer->node)
 		base->first = rb_next(&timer->node);
 	rb_erase(&timer->node, &base->active);
-	rb_set_parent(&timer->node, &timer->node);
+	timer->state = newstate;
 }
 
 /*
@@ -415,7 +417,7 @@ static inline int
 remove_hrtimer(struct hrtimer *timer, struct hrtimer_clock_base *base)
 {
 	if (hrtimer_active(timer)) {
-		__remove_hrtimer(timer, base);
+		__remove_hrtimer(timer, base, HRTIMER_INACTIVE);
 		return 1;
 	}
 	return 0;
@@ -487,7 +489,7 @@ int hrtimer_try_to_cancel(struct hrtimer
 
 	base = lock_hrtimer_base(timer, &flags);
 
-	if (base->cpu_base->curr_timer != timer)
+	if (!(timer->state & HRTIMER_CALLBACK))
 		ret = remove_hrtimer(timer, base);
 
 	unlock_hrtimer_base(timer, &flags);
@@ -592,7 +594,6 @@ void hrtimer_init(struct hrtimer *timer,
 		clock_id = CLOCK_MONOTONIC;
 
 	timer->base = &cpu_base->clock_base[clock_id];
-	rb_set_parent(&timer->node, &timer->node);
 }
 EXPORT_SYMBOL_GPL(hrtimer_init);
 
@@ -643,13 +644,14 @@ static inline void run_hrtimer_queue(str
 
 		fn = timer->function;
 		set_curr_timer(cpu_base, timer);
-		__remove_hrtimer(timer, base);
+		__remove_hrtimer(timer, base, HRTIMER_CALLBACK);
 		spin_unlock_irq(&cpu_base->lock);
 
 		restart = fn(timer);
 
 		spin_lock_irq(&cpu_base->lock);
 
+		timer->state &= ~HRTIMER_CALLBACK;
 		if (restart != HRTIMER_NORESTART) {
 			BUG_ON(hrtimer_active(timer));
 			enqueue_hrtimer(timer, base);
@@ -820,7 +822,8 @@ static void migrate_hrtimer_list(struct 
 
 	while ((node = rb_first(&old_base->active))) {
 		timer = rb_entry(node, struct hrtimer, node);
-		__remove_hrtimer(timer, old_base);
+		BUG_ON(timer->state & HRTIMER_CALLBACK);
+		__remove_hrtimer(timer, old_base, HRTIMER_INACTIVE);
 		timer->base = new_base;
 		enqueue_hrtimer(timer, new_base);
 	}

--


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [patch 12/23] hrtimers: clean up callback tracking
  2006-09-29 23:58 [patch 00/23] Thomas Gleixner
                   ` (10 preceding siblings ...)
  2006-09-29 23:58 ` [patch 11/23] hrtimers: state tracking Thomas Gleixner
@ 2006-09-29 23:58 ` Thomas Gleixner
  2006-09-29 23:58 ` [patch 13/23] clockevents: core Thomas Gleixner
                   ` (12 subsequent siblings)
  24 siblings, 0 replies; 55+ messages in thread
From: Thomas Gleixner @ 2006-09-29 23:58 UTC (permalink / raw)
  To: LKML
  Cc: Andrew Morton, Ingo Molnar, Jim Gettys, John Stultz,
	David Woodhouse, Arjan van de Ven, Dave Jones

[-- Attachment #1: hrtimer-change-callback-tracking.patch --]
[-- Type: text/plain, Size: 2747 bytes --]

From: Thomas Gleixner <tglx@linutronix.de>

reintroduce ktimers feature "optimized away" by the ktimers
review process: remove the curr_timer pointer from the cpu-base
and use the hrtimer state.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
--
 include/linux/hrtimer.h |    1 -
 kernel/hrtimer.c        |   10 +---------
 2 files changed, 1 insertion(+), 10 deletions(-)

Index: linux-2.6.18-mm2/include/linux/hrtimer.h
===================================================================
--- linux-2.6.18-mm2.orig/include/linux/hrtimer.h	2006-09-30 01:41:17.000000000 +0200
+++ linux-2.6.18-mm2/include/linux/hrtimer.h	2006-09-30 01:41:17.000000000 +0200
@@ -108,7 +108,6 @@ struct hrtimer_cpu_base {
 	spinlock_t			lock;
 	struct lock_class_key		lock_key;
 	struct hrtimer_clock_base	clock_base[HRTIMER_MAX_CLOCK_BASES];
-	struct hrtimer			*curr_timer;
 };
 
 /*
Index: linux-2.6.18-mm2/kernel/hrtimer.c
===================================================================
--- linux-2.6.18-mm2.orig/kernel/hrtimer.c	2006-09-30 01:41:17.000000000 +0200
+++ linux-2.6.18-mm2/kernel/hrtimer.c	2006-09-30 01:41:17.000000000 +0200
@@ -149,8 +149,6 @@ static void hrtimer_get_softirq_time(str
  */
 #ifdef CONFIG_SMP
 
-#define set_curr_timer(b, t)		do { (b)->curr_timer = (t); } while (0)
-
 /*
  * We are using hashed locking: holding per_cpu(hrtimer_bases)[n].lock
  * means that all timers which are tied to this base via timer->base are
@@ -204,7 +202,7 @@ switch_hrtimer_base(struct hrtimer *time
 		 * completed. There is no conflict as we hold the lock until
 		 * the timer is enqueued.
 		 */
-		if (unlikely(base->cpu_base->curr_timer == timer))
+		if (unlikely(timer->state & HRTIMER_CALLBACK))
 			return base;
 
 		/* See the comment in lock_timer_base() */
@@ -218,8 +216,6 @@ switch_hrtimer_base(struct hrtimer *time
 
 #else /* CONFIG_SMP */
 
-#define set_curr_timer(b, t)		do { } while (0)
-
 static inline struct hrtimer_clock_base *
 lock_hrtimer_base(const struct hrtimer *timer, unsigned long *flags)
 {
@@ -643,7 +639,6 @@ static inline void run_hrtimer_queue(str
 			break;
 
 		fn = timer->function;
-		set_curr_timer(cpu_base, timer);
 		__remove_hrtimer(timer, base, HRTIMER_CALLBACK);
 		spin_unlock_irq(&cpu_base->lock);
 
@@ -657,7 +652,6 @@ static inline void run_hrtimer_queue(str
 			enqueue_hrtimer(timer, base);
 		}
 	}
-	set_curr_timer(cpu_base, NULL);
 	spin_unlock_irq(&cpu_base->lock);
 }
 
@@ -844,8 +838,6 @@ static void migrate_hrtimers(int cpu)
 	spin_lock(&old_base->lock);
 
 	for (i = 0; i < HRTIMER_MAX_CLOCK_BASES; i++) {
-		BUG_ON(old_base->curr_timer);
-
 		migrate_hrtimer_list(&old_base->clock_base[i],
 				     &new_base->clock_base[i]);
 	}

--


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [patch 13/23] clockevents: core
  2006-09-29 23:58 [patch 00/23] Thomas Gleixner
                   ` (11 preceding siblings ...)
  2006-09-29 23:58 ` [patch 12/23] hrtimers: clean up callback tracking Thomas Gleixner
@ 2006-09-29 23:58 ` Thomas Gleixner
  2006-09-30  8:39   ` Andrew Morton
  2006-09-29 23:58 ` [patch 14/23] clockevents: drivers for i386 Thomas Gleixner
                   ` (11 subsequent siblings)
  24 siblings, 1 reply; 55+ messages in thread
From: Thomas Gleixner @ 2006-09-29 23:58 UTC (permalink / raw)
  To: LKML
  Cc: Andrew Morton, Ingo Molnar, Jim Gettys, John Stultz,
	David Woodhouse, Arjan van de Ven, Dave Jones

[-- Attachment #1: clockevents-base.patch --]
[-- Type: text/plain, Size: 20165 bytes --]

From: Thomas Gleixner <tglx@linutronix.de>

We have two types of clock event devices:
- global events (one device per system)
- local events (one device per cpu)

We assign the various time(r) related interrupts to those devices:

- global tick
- profiling (per cpu)
- next timer events (per cpu)

architectures register their clockevent sources, with specific capability
masks set, and the generic high-res-timers code picks the best one,
without the architecture having to worry about that.

here are the capabilities a clockevent driver can register:

 #define CLOCK_CAP_TICK		0x000001
 #define CLOCK_CAP_UPDATE	0x000002
 #define CLOCK_CAP_PROFILE	0x000004
 #define CLOCK_CAP_NEXTEVT	0x000008

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
--
 include/linux/clockchips.h |  104 ++++++++
 include/linux/hrtimer.h    |    3 
 init/main.c                |    2 
 kernel/hrtimer.c           |    6 
 kernel/time/Makefile       |    2 
 kernel/time/clockevents.c  |  527 +++++++++++++++++++++++++++++++++++++++++++++
 6 files changed, 642 insertions(+), 2 deletions(-)

Index: linux-2.6.18-mm2/include/linux/clockchips.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.18-mm2/include/linux/clockchips.h	2006-09-30 01:41:17.000000000 +0200
@@ -0,0 +1,104 @@
+/*  linux/include/linux/clockchips.h
+ *
+ *  This file contains the structure definitions for clockchips.
+ *
+ *  If you are not a clockchip, or the time of day code, you should
+ *  not be including this file!
+ */
+#ifndef _LINUX_CLOCKCHIPS_H
+#define _LINUX_CLOCKCHIPS_H
+
+#include <linux/config.h>
+
+#ifdef CONFIG_GENERIC_TIME
+
+#include <linux/clocksource.h>
+#include <linux/interrupt.h>
+
+/* Clock event mode commands */
+enum {
+	CLOCK_EVT_PERIODIC,
+	CLOCK_EVT_ONESHOT,
+	CLOCK_EVT_SHUTDOWN,
+};
+
+/* Clock event capability flags */
+#define CLOCK_CAP_TICK		0x000001
+#define CLOCK_CAP_UPDATE	0x000002
+#ifndef CONFIG_PROFILE_NMI
+# define CLOCK_CAP_PROFILE	0x000004
+#else
+# define CLOCK_CAP_PROFILE	0x000000
+#endif
+#ifdef CONFIG_HIGH_RES_TIMERS
+# define CLOCK_CAP_NEXTEVT	0x000008
+#else
+# define CLOCK_CAP_NEXTEVT	0x000000
+#endif
+
+#define CLOCK_BASE_CAPS_MASK	(CLOCK_CAP_TICK | CLOCK_CAP_PROFILE | \
+				 CLOCK_CAP_UPDATE)
+#define CLOCK_CAPS_MASK		(CLOCK_BASE_CAPS_MASK | CLOCK_CAP_NEXTEVT)
+
+struct clock_event;
+
+/**
+ * struct clock_event - clock event descriptor
+ *
+ * @name:		ptr to clock event name
+ * @capabilities:	capabilities of the event chip
+ * @max_delta_ns:	maximum delta value in ns
+ * @min_delta_ns:	minimum delta value in ns
+ * @mult:		nanosecond to cycles multiplier
+ * @shift:		nanoseconds to cycles divisor (power of two)
+ * @set_next_event:	set next event
+ * @set_mode:		set mode function
+ * @suspend:		suspend function (optional)
+ * @resume:		resume function (optional)
+ * @evthandler:		Assigned by the framework to be called by the low
+ *			level handler of the event source
+ */
+struct clock_event {
+	const char	*name;
+	unsigned int	capabilities;
+	unsigned long	max_delta_ns;
+	unsigned long	min_delta_ns;
+	unsigned long	mult;
+	int		shift;
+	void		(*set_next_event)(unsigned long evt,
+					  struct clock_event *);
+	void		(*set_mode)(int mode, struct clock_event *);
+	int		(*suspend)(struct clock_event *);
+	int		(*resume)(struct clock_event *);
+	void		(*event_handler)(struct pt_regs *regs);
+};
+
+/*
+ * Calculate a multiplication factor
+ */
+static inline unsigned long div_sc(unsigned long a, unsigned long b,
+				   int shift)
+{
+	uint64_t tmp = ((uint64_t)a) << shift;
+	do_div(tmp, b);
+	return (unsigned long) tmp;
+}
+
+/* Clock event layer functions */
+extern int register_local_clockevent(struct clock_event *);
+extern int register_global_clockevent(struct clock_event *);
+extern unsigned long clockevent_delta2ns(unsigned long latch,
+					 struct clock_event *evt);
+extern void clockevents_init(void);
+
+extern int clockevents_init_next_event(void);
+extern int clockevents_set_next_event(ktime_t expires, int force);
+extern int clockevents_next_event_available(void);
+extern void clockevents_resume_events(void);
+
+#else
+# define clockevents_init()		do { } while(0)
+# define clockevents_resume_events()	do { } while(0)
+#endif
+
+#endif
Index: linux-2.6.18-mm2/include/linux/hrtimer.h
===================================================================
--- linux-2.6.18-mm2.orig/include/linux/hrtimer.h	2006-09-30 01:41:17.000000000 +0200
+++ linux-2.6.18-mm2/include/linux/hrtimer.h	2006-09-30 01:41:17.000000000 +0200
@@ -116,6 +116,9 @@ struct hrtimer_cpu_base {
  * is expired in the next softirq when the clock was advanced.
  */
 #define clock_was_set()		do { } while (0)
+#define hrtimer_clock_notify()	do { } while (0)
+extern ktime_t ktime_get(void);
+extern ktime_t ktime_get_real(void);
 
 /* Exported timer functions: */
 
Index: linux-2.6.18-mm2/init/main.c
===================================================================
--- linux-2.6.18-mm2.orig/init/main.c	2006-09-30 01:41:11.000000000 +0200
+++ linux-2.6.18-mm2/init/main.c	2006-09-30 01:41:17.000000000 +0200
@@ -36,6 +36,7 @@
 #include <linux/moduleparam.h>
 #include <linux/kallsyms.h>
 #include <linux/writeback.h>
+#include <linux/clockchips.h>
 #include <linux/cpu.h>
 #include <linux/cpuset.h>
 #include <linux/efi.h>
@@ -529,6 +530,7 @@ asmlinkage void __init start_kernel(void
 	rcu_init();
 	init_IRQ();
 	pidhash_init();
+	clockevents_init();
 	init_timers();
 	hrtimers_init();
 	softirq_init();
Index: linux-2.6.18-mm2/kernel/hrtimer.c
===================================================================
--- linux-2.6.18-mm2.orig/kernel/hrtimer.c	2006-09-30 01:41:17.000000000 +0200
+++ linux-2.6.18-mm2/kernel/hrtimer.c	2006-09-30 01:41:17.000000000 +0200
@@ -30,6 +30,7 @@
  *  For licencing details see kernel-base/COPYING
  */
 
+#include <linux/clockchips.h>
 #include <linux/cpu.h>
 #include <linux/module.h>
 #include <linux/percpu.h>
@@ -45,7 +46,7 @@
  *
  * returns the time in ktime_t format
  */
-static ktime_t ktime_get(void)
+ktime_t ktime_get(void)
 {
 	struct timespec now;
 
@@ -59,7 +60,7 @@ static ktime_t ktime_get(void)
  *
  * returns the time in ktime_t format
  */
-static ktime_t ktime_get_real(void)
+ktime_t ktime_get_real(void)
 {
 	struct timespec now;
 
@@ -292,6 +293,7 @@ static unsigned long ktime_divns(const k
  */
 void hrtimer_notify_resume(void)
 {
+	clockevents_resume_events();
 	clock_was_set();
 }
 
Index: linux-2.6.18-mm2/kernel/time/Makefile
===================================================================
--- linux-2.6.18-mm2.orig/kernel/time/Makefile	2006-09-30 01:41:11.000000000 +0200
+++ linux-2.6.18-mm2/kernel/time/Makefile	2006-09-30 01:41:17.000000000 +0200
@@ -1 +1,3 @@
 obj-y += ntp.o clocksource.o jiffies.o
+
+obj-$(CONFIG_GENERIC_TIME) += clockevents.o
Index: linux-2.6.18-mm2/kernel/time/clockevents.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.18-mm2/kernel/time/clockevents.c	2006-09-30 01:41:17.000000000 +0200
@@ -0,0 +1,527 @@
+/*
+ * linux/kernel/time/clockevents.c
+ *
+ * This file contains functions which manage clock event drivers.
+ *
+ * Copyright(C) 2005-2006, Thomas Gleixner <tglx@linutronix.de>
+ * Copyright(C) 2005-2006, Red Hat, Inc., Ingo Molnar
+ *
+ * We have two types of clock event devices:
+ * - global events (one device per system)
+ * - local events (one device per cpu)
+ *
+ * We assign the various time(r) related interrupts to those devices
+ *
+ * - global tick
+ * - profiling (per cpu)
+ * - next timer events (per cpu)
+ *
+ * TODO:
+ * - implement variable frequency profiling
+ *
+ * This code is licenced under the GPL version 2. For details see
+ * kernel-base/COPYING.
+ */
+
+#include <linux/clockchips.h>
+#include <linux/cpu.h>
+#include <linux/irq.h>
+#include <linux/init.h>
+#include <linux/notifier.h>
+#include <linux/module.h>
+#include <linux/percpu.h>
+#include <linux/profile.h>
+#include <linux/sysdev.h>
+#include <linux/hrtimer.h>
+
+#define MAX_CLOCK_EVENTS	4
+#define GLOBAL_CLOCK_EVENT	MAX_CLOCK_EVENTS
+
+struct event_descr {
+	struct clock_event *event;
+	unsigned int mode;
+	unsigned int real_caps;
+	struct irqaction action;
+};
+
+struct local_events {
+	int installed;
+	struct event_descr events[MAX_CLOCK_EVENTS];
+	struct clock_event *nextevt;
+};
+
+/* Variables related to the global event source */
+static __read_mostly struct event_descr global_eventsource;
+
+/* Variables related to the per cpu local event sources */
+static DEFINE_PER_CPU(struct local_events, local_eventsources);
+
+/* lock to protect the above */
+static DEFINE_SPINLOCK(events_lock);
+
+/*
+ * Math helper. Convert a latch value to ns
+ */
+unsigned long clockevent_delta2ns(unsigned long latch, struct clock_event *evt)
+{
+	u64 clc = ((u64) latch << evt->shift);
+
+	do_div(clc, evt->mult);
+	if (clc < KTIME_MONOTONIC_RES.tv64)
+		clc = KTIME_MONOTONIC_RES.tv64;
+	if (clc > LONG_MAX)
+		clc = LONG_MAX;
+
+	return (unsigned long) clc;
+}
+
+/*
+ * Bootup and lowres handler: ticks only
+ */
+static void handle_tick(struct pt_regs *regs)
+{
+	write_seqlock(&xtime_lock);
+	do_timer(1);
+	write_sequnlock(&xtime_lock);
+}
+
+/*
+ * Bootup and lowres handler: ticks and update_process_times
+ */
+static void handle_tick_update(struct pt_regs *regs)
+{
+	write_seqlock(&xtime_lock);
+	do_timer(1);
+	write_sequnlock(&xtime_lock);
+
+	update_process_times(user_mode(regs));
+}
+
+/*
+ * Bootup and lowres handler: ticks and profileing
+ */
+static void handle_tick_profile(struct pt_regs *regs)
+{
+	write_seqlock(&xtime_lock);
+	do_timer(1);
+	write_sequnlock(&xtime_lock);
+
+	profile_tick(CPU_PROFILING, regs);
+}
+
+/*
+ * Bootup and lowres handler: ticks, update_process_times and profiling
+ */
+static void handle_tick_update_profile(struct pt_regs *regs)
+{
+	write_seqlock(&xtime_lock);
+	do_timer(1);
+	write_sequnlock(&xtime_lock);
+
+	update_process_times(user_mode(regs));
+	profile_tick(CPU_PROFILING, regs);
+}
+
+/*
+ * Bootup and lowres handler: update_process_times
+ */
+static void handle_update(struct pt_regs *regs)
+{
+	update_process_times(user_mode(regs));
+}
+
+/*
+ * Bootup and lowres handler: update_process_times and profiling
+ */
+static void handle_update_profile(struct pt_regs *regs)
+{
+	update_process_times(user_mode(regs));
+	profile_tick(CPU_PROFILING, regs);
+}
+
+/*
+ * Bootup and lowres handler: profiling
+ */
+static void handle_profile(struct pt_regs *regs)
+{
+	profile_tick(CPU_PROFILING, regs);
+}
+
+/*
+ * Noop handler when we shut down an event source
+ */
+static void handle_noop(struct pt_regs *regs)
+{
+}
+
+/*
+ * Lookup table for bootup and lowres event assignment
+ */
+static void __read_mostly *event_handlers[] = {
+	handle_noop,			/* 0: No capability selected */
+	handle_tick,			/* 1: Tick only	*/
+	handle_update,			/* 2: Update process times */
+	handle_tick_update,		/* 3: Tick + update process times */
+	handle_profile,			/* 4: Profiling int */
+	handle_tick_profile,		/* 5: Tick + Profiling int */
+	handle_update_profile,		/* 6: Update process times +
+					      profiling */
+	handle_tick_update_profile,	/* 7: Tick + update process times +
+					      profiling */
+#ifdef CONFIG_HIGH_RES_TIMERS
+	hrtimer_interrupt,		/* 8: Reprogrammable event source */
+#endif
+};
+
+/*
+ * Start up an event source
+ */
+static void startup_event(struct clock_event *evt, unsigned int caps)
+{
+	int mode;
+
+	if (caps == CLOCK_CAP_NEXTEVT)
+		mode = CLOCK_EVT_ONESHOT;
+	else
+		mode = CLOCK_EVT_PERIODIC;
+
+	evt->set_mode(mode, evt);
+}
+
+/*
+ * Setup an event source. Assign an handler and start it up
+ * When the event source has no own interrupt handler we setup
+ * the interrupt too.
+ */
+static void setup_event(struct event_descr *descr, struct clock_event *evt,
+			unsigned int caps)
+{
+	void *handler = event_handlers[caps];
+
+	/* Set the event handler */
+	evt->event_handler = handler;
+
+	/* Store all relevant information */
+	descr->real_caps = caps;
+
+	startup_event(evt, caps);
+
+	printk(KERN_INFO "Event source %s configured with caps set: "
+	       "%02x\n", evt->name, descr->real_caps);
+}
+
+/**
+ * register_global_clockevent - register the device which generates
+ *			     global clock events
+ *
+ * @evt:	The device which generates global clock events (ticks)
+ *
+ * This can be a device which is only necessary for bootup. On UP systems this
+ * might be the only event source which is used for everything including
+ * high resolution events.
+ *
+ * When a cpu local event source is installed the global event source is
+ * switched off in the high resolution timer / tickless mode.
+ */
+int __init register_global_clockevent(struct clock_event *evt)
+{
+	/* Already installed? */
+	if (global_eventsource.event) {
+		printk(KERN_ERR "Global clock event source already installed: "
+		       "%s. Ignoring new global eventsoruce %s\n",
+		       global_eventsource.event->name,
+		       evt->name);
+		return -EBUSY;
+	}
+
+	/* Preset the handler in any case */
+	evt->event_handler = handle_noop;
+
+	/*
+	 * Check, whether it is a valid global event source
+	 */
+	if (!(evt->capabilities & CLOCK_BASE_CAPS_MASK)) {
+		printk(KERN_ERR "Unsupported event source %s\n", evt->name);
+		return -EINVAL;
+	}
+
+	/* Mask out high resolution capabilities for now */
+	global_eventsource.event = evt;
+	setup_event(&global_eventsource, evt,
+		    evt->capabilities & CLOCK_BASE_CAPS_MASK);
+	return 0;
+}
+
+/*
+ * Mask out the functionality which is covered by the new event source
+ * and assign a new event handler.
+ */
+static void recalc_active_event(struct event_descr *descr,
+				unsigned int newcaps)
+{
+	unsigned int caps;
+
+	if (!descr->real_caps)
+		return;
+
+	/* Mask the overlapping bits */
+	caps = descr->real_caps & ~newcaps;
+
+	/* Assign the new event handler */
+	if (caps) {
+		descr->event->event_handler = event_handlers[caps];
+		printk(KERN_INFO "Event source %s new caps set: %02x\n" ,
+		       descr->event->name, caps);
+	} else {
+		descr->event->event_handler = handle_noop;
+
+		if (descr->event->set_mode)
+			descr->event->set_mode(CLOCK_EVT_SHUTDOWN,
+					       descr->event);
+
+		printk(KERN_INFO "Event source %s disabled\n" ,
+		       descr->event->name);
+	}
+	descr->real_caps = caps;
+}
+
+/*
+ * Recalc the events and reassign the handlers if necessary
+ */
+static int recalc_events(struct local_events *sources, struct clock_event *evt,
+			 unsigned int caps, int new)
+{
+	int i;
+
+	if (new && sources->installed == MAX_CLOCK_EVENTS)
+		return -ENOSPC;
+
+	/*
+	 * If there is no handler and this is not a next-event capable
+	 * event source, refuse to handle it
+	 */
+	if (!evt->capabilities & CLOCK_CAP_NEXTEVT && !event_handlers[caps]) {
+		printk(KERN_ERR "Unsupported event source %s\n", evt->name);
+		return -EINVAL;
+	}
+
+	if (caps && global_eventsource.event && global_eventsource.event != evt)
+		recalc_active_event(&global_eventsource, caps);
+
+	for (i = 0; i < sources->installed; i++) {
+		if (sources->events[i].event != evt)
+			recalc_active_event(&sources->events[i], caps);
+	}
+
+	if (new)
+		sources->events[sources->installed++].event = evt;
+
+	if (caps) {
+		/* Is next_event event source going to be installed? */
+		if (caps & CLOCK_CAP_NEXTEVT)
+			caps = CLOCK_CAP_NEXTEVT;
+
+		setup_event(&sources->events[sources->installed],
+			    evt, caps);
+	} else
+		printk(KERN_INFO "Inactive event source %s registered\n",
+		       evt->name);
+
+	return 0;
+}
+
+/**
+ * register_local_clockevent - Set up a cpu local clock event device
+ *
+ * @evt:	event device to be registered
+ */
+int register_local_clockevent(struct clock_event *evt)
+{
+	struct local_events *sources = &__get_cpu_var(local_eventsources);
+	unsigned long flags;
+	int ret;
+
+	spin_lock_irqsave(&events_lock, flags);
+
+	/* Preset the handler in any case */
+	evt->event_handler = handle_noop;
+
+	/* Recalc event sources and maybe reassign handlers */
+	ret = recalc_events(sources, evt,
+			    evt->capabilities & CLOCK_BASE_CAPS_MASK, 1);
+
+	spin_unlock_irqrestore(&events_lock, flags);
+
+	/*
+	 * Trigger hrtimers, when the event source is next-event
+	 * capable
+	 */
+	if (!ret && (evt->capabilities & CLOCK_CAP_NEXTEVT))
+		hrtimer_clock_notify();
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(register_local_clockevent);
+
+/*
+ * Find a next-event capable event source
+ */
+static int get_next_event_source(void)
+{
+	struct local_events *sources = &__get_cpu_var(local_eventsources);
+	int i;
+
+	for (i = 0; i < sources->installed; i++) {
+		struct clock_event *evt;
+
+		evt = sources->events[i].event;
+		if (evt->capabilities & CLOCK_CAP_NEXTEVT)
+			return i;
+	}
+
+#ifndef CONFIG_SMP
+	if (global_eventsource.event->capabilities & CLOCK_CAP_NEXTEVT)
+		return GLOBAL_CLOCK_EVENT;
+#endif
+	return -ENODEV;
+}
+
+/**
+ * clockevents_next_event_available - Check for a installed next-event source
+ */
+int clockevents_next_event_available(void)
+{
+	unsigned long flags;
+	int idx;
+
+	spin_lock_irqsave(&events_lock, flags);
+	idx = get_next_event_source();
+	spin_unlock_irqrestore(&events_lock, flags);
+
+	return idx < 0 ? 0 : 1;
+}
+
+int clockevents_init_next_event(void)
+{
+	struct local_events *sources = &__get_cpu_var(local_eventsources);
+	struct clock_event *nextevt;
+	unsigned long flags;
+	int idx, ret = -ENODEV;
+
+	if (sources->nextevt)
+		return -EBUSY;
+
+	spin_lock_irqsave(&events_lock, flags);
+
+	idx = get_next_event_source();
+	if (idx < 0)
+		goto out_unlock;
+
+	if (idx == GLOBAL_CLOCK_EVENT)
+		nextevt = global_eventsource.event;
+	else
+		nextevt = sources->events[idx].event;
+
+	ret = recalc_events(sources, nextevt, CLOCK_CAPS_MASK, 0);
+	if (!ret)
+		sources->nextevt = nextevt;
+ out_unlock:
+	spin_unlock_irqrestore(&events_lock, flags);
+
+	return ret;
+}
+
+int clockevents_set_next_event(ktime_t expires, int force)
+{
+	struct local_events *sources = &__get_cpu_var(local_eventsources);
+	int64_t delta = ktime_to_ns(ktime_sub(expires, ktime_get()));
+	struct clock_event *nextevt = sources->nextevt;
+	unsigned long long clc;
+
+	if (delta <= 0 && !force)
+		return -ETIME;
+
+	if (delta > nextevt->max_delta_ns)
+		delta = nextevt->max_delta_ns;
+	if (delta < nextevt->min_delta_ns)
+		delta = nextevt->min_delta_ns;
+
+	clc = delta * nextevt->mult;
+	clc >>= nextevt->shift;
+	nextevt->set_next_event((unsigned long)clc, sources->nextevt);
+
+	return 0;
+}
+
+/*
+ * Resume the cpu local clock events
+ */
+static void clockevents_resume_local_events(void *arg)
+{
+	struct local_events *sources = &__get_cpu_var(local_eventsources);
+	int i;
+
+	for (i = 0; i < sources->installed; i++) {
+		if (sources->events[i].real_caps)
+			startup_event(sources->events[i].event,
+				      sources->events[i].real_caps);
+	}
+}
+
+/*
+ * Called after timekeeping is functional again
+ */
+void clockevents_resume_events(void)
+{
+	unsigned long flags;
+
+	local_irq_save(flags);
+
+	/* Resume global event source */
+	if (global_eventsource.real_caps)
+		startup_event(global_eventsource.event,
+			      global_eventsource.real_caps);
+
+	clockevents_resume_local_events(NULL);
+	local_irq_restore(flags);
+
+	touch_softlockup_watchdog();
+
+	if (smp_call_function(clockevents_resume_local_events, NULL, 1, 1))
+		BUG();
+
+}
+
+/*
+ * Functions related to initialization and hotplug
+ */
+static int clockevents_cpu_notify(struct notifier_block *self,
+				  unsigned long action, void *hcpu)
+{
+	switch(action) {
+	case CPU_UP_PREPARE:
+		break;
+#ifdef CONFIG_HOTPLUG_CPU
+	case CPU_DEAD:
+		/*
+		 * Do something sensible here !
+		 * Disable the cpu local clocksources
+		 */
+		break;
+#endif
+	default:
+		break;
+	}
+	return NOTIFY_OK;
+}
+
+static struct notifier_block __devinitdata clockevents_nb = {
+	.notifier_call	= clockevents_cpu_notify,
+};
+
+void __init clockevents_init(void)
+{
+	clockevents_cpu_notify(&clockevents_nb, (unsigned long)CPU_UP_PREPARE,
+				(void *)(long)smp_processor_id());
+	register_cpu_notifier(&clockevents_nb);
+}

--


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [patch 14/23] clockevents: drivers for i386
  2006-09-29 23:58 [patch 00/23] Thomas Gleixner
                   ` (12 preceding siblings ...)
  2006-09-29 23:58 ` [patch 13/23] clockevents: core Thomas Gleixner
@ 2006-09-29 23:58 ` Thomas Gleixner
  2006-09-30  8:40   ` Andrew Morton
  2006-09-29 23:58 ` [patch 15/23] high-res timers: core Thomas Gleixner
                   ` (10 subsequent siblings)
  24 siblings, 1 reply; 55+ messages in thread
From: Thomas Gleixner @ 2006-09-29 23:58 UTC (permalink / raw)
  To: LKML
  Cc: Andrew Morton, Ingo Molnar, Jim Gettys, John Stultz,
	David Woodhouse, Arjan van de Ven, Dave Jones

[-- Attachment #1: clockevents-i386.patch --]
[-- Type: text/plain, Size: 12263 bytes --]

From: Thomas Gleixner <tglx@linutronix.de>

add clockevent drivers for i386: lapic (local) and PIT (global).
Update the timer IRQ to call into the PIT driver's event handler
and the lapic-timer IRQ to call into the lapic clockevent driver.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
--
 arch/i386/kernel/apic.c                  |   81 ++++++++++++++++++++++++-------
 arch/i386/kernel/i8253.c                 |   60 +++++++++++++++++++---
 arch/i386/kernel/time.c                  |   45 -----------------
 include/asm-i386/i8253.h                 |    1 
 include/asm-i386/mach-default/do_timer.h |    3 -
 5 files changed, 120 insertions(+), 70 deletions(-)

Index: linux-2.6.18-mm2/arch/i386/kernel/apic.c
===================================================================
--- linux-2.6.18-mm2.orig/arch/i386/kernel/apic.c	2006-09-30 01:41:11.000000000 +0200
+++ linux-2.6.18-mm2/arch/i386/kernel/apic.c	2006-09-30 01:41:18.000000000 +0200
@@ -25,6 +25,7 @@
 #include <linux/kernel_stat.h>
 #include <linux/sysdev.h>
 #include <linux/cpu.h>
+#include <linux/clockchips.h>
 #include <linux/module.h>
 
 #include <asm/atomic.h>
@@ -70,6 +71,23 @@ static inline void lapic_enable(void)
  */
 int apic_verbosity;
 
+static unsigned int calibration_result;
+
+static void lapic_next_event(unsigned long delta, struct clock_event *evt);
+static void lapic_timer_setup(int mode, struct clock_event *evt);
+
+static struct clock_event lapic_clockevent = {
+	.name = "lapic",
+	.capabilities = CLOCK_CAP_NEXTEVT | CLOCK_CAP_PROFILE
+#ifdef CONFIG_SMP
+			| CLOCK_CAP_UPDATE
+#endif
+	,
+	.shift = 32,
+	.set_mode = lapic_timer_setup,
+	.set_next_event = lapic_next_event,
+};
+static DEFINE_PER_CPU(struct clock_event, lapic_events);
 
 static void apic_pm_activate(void);
 
@@ -919,6 +937,11 @@ fake_ioapic_page:
  */
 
 /*
+ * FIXME: Move this to i8253.h. There is no need to keep the access to
+ * the PIT scattered all around the place -tglx
+ */
+
+/*
  * The timer chip is already set up at HZ interrupts per second here,
  * but we do not accept timer interrupts yet. We only allow the BP
  * to calibrate.
@@ -976,13 +999,15 @@ void (*wait_timer_tick)(void) __devinitd
 
 #define APIC_DIVISOR 16
 
-static void __setup_APIC_LVTT(unsigned int clocks)
+static void __setup_APIC_LVTT(unsigned int clocks, int oneshot)
 {
 	unsigned int lvtt_value, tmp_value, ver;
 	int cpu = smp_processor_id();
 
 	ver = GET_APIC_VERSION(apic_read(APIC_LVR));
-	lvtt_value = APIC_LVT_TIMER_PERIODIC | LOCAL_TIMER_VECTOR;
+	lvtt_value = LOCAL_TIMER_VECTOR;
+	if (!oneshot)
+		lvtt_value |= APIC_LVT_TIMER_PERIODIC;
 	if (!APIC_INTEGRATED(ver))
 		lvtt_value |= SET_APIC_TIMER_BASE(APIC_TIMER_BASE_DIV);
 
@@ -999,23 +1024,31 @@ static void __setup_APIC_LVTT(unsigned i
 				& ~(APIC_TDR_DIV_1 | APIC_TDR_DIV_TMBASE))
 				| APIC_TDR_DIV_16);
 
-	apic_write_around(APIC_TMICT, clocks/APIC_DIVISOR);
+	if (!oneshot)
+		apic_write_around(APIC_TMICT, clocks/APIC_DIVISOR);
+}
+
+static void lapic_next_event(unsigned long delta, struct clock_event *evt)
+{
+	apic_write_around(APIC_TMICT, delta);
 }
 
-static void __devinit setup_APIC_timer(unsigned int clocks)
+static void lapic_timer_setup(int mode, struct clock_event *evt)
 {
 	unsigned long flags;
 
 	local_irq_save(flags);
+	__setup_APIC_LVTT(calibration_result, mode != CLOCK_EVT_PERIODIC);
+	local_irq_restore(flags);
+}
 
-	/*
-	 * Wait for IRQ0's slice:
-	 */
-	wait_timer_tick();
+static void __devinit setup_APIC_timer(void)
+{
+	struct clock_event *levt = &__get_cpu_var(lapic_events);
 
-	__setup_APIC_LVTT(clocks);
+	memcpy(levt, &lapic_clockevent, sizeof(*levt));
 
-	local_irq_restore(flags);
+	register_local_clockevent(levt);
 }
 
 /*
@@ -1024,6 +1057,8 @@ static void __devinit setup_APIC_timer(u
  * to calibrate, since some later bootup code depends on getting
  * the first irq? Ugh.
  *
+ * TODO: Fix this rather than saying "Ugh" -tglx
+ *
  * We want to do the calibration only once since we
  * want to have local timer irqs syncron. CPUs connected
  * by the same APIC bus have the very same bus frequency.
@@ -1046,7 +1081,7 @@ static int __init calibrate_APIC_clock(v
 	 * value into the APIC clock, we just want to get the
 	 * counter running for calibration.
 	 */
-	__setup_APIC_LVTT(1000000000);
+	__setup_APIC_LVTT(1000000000, 0);
 
 	/*
 	 * The timer chip counts down to zero. Let's wait
@@ -1083,6 +1118,14 @@ static int __init calibrate_APIC_clock(v
 
 	result = (tt1-tt2)*APIC_DIVISOR/LOOPS;
 
+	/* Calculate the scaled math multiplication factor */
+	lapic_clockevent.mult = div_sc(tt1-tt2, TICK_NSEC * LOOPS, 32);
+	lapic_clockevent.max_delta_ns =
+		clockevent_delta2ns(0x7FFFFF, &lapic_clockevent);
+	printk("lapic max_delta_ns: %ld\n", lapic_clockevent.max_delta_ns);
+	lapic_clockevent.min_delta_ns =
+		clockevent_delta2ns(0xF, &lapic_clockevent);
+
 	if (cpu_has_tsc)
 		apic_printk(APIC_VERBOSE, "..... CPU clock speed is "
 			"%ld.%04ld MHz.\n",
@@ -1097,8 +1140,6 @@ static int __init calibrate_APIC_clock(v
 	return result;
 }
 
-static unsigned int calibration_result;
-
 void __init setup_boot_APIC_clock(void)
 {
 	unsigned long flags;
@@ -1111,14 +1152,14 @@ void __init setup_boot_APIC_clock(void)
 	/*
 	 * Now set up the timer for real.
 	 */
-	setup_APIC_timer(calibration_result);
+	setup_APIC_timer();
 
 	local_irq_restore(flags);
 }
 
 void __devinit setup_secondary_APIC_clock(void)
 {
-	setup_APIC_timer(calibration_result);
+	setup_APIC_timer();
 }
 
 void disable_APIC_timer(void)
@@ -1164,6 +1205,13 @@ void switch_APIC_timer_to_ipi(void *cpum
 	    !cpu_isset(cpu, timer_bcast_ipi)) {
 		disable_APIC_timer();
 		cpu_set(cpu, timer_bcast_ipi);
+#ifdef CONFIG_HIGH_RES_TIMERS
+		printk("Disabling NO_HZ and high resolution timers "
+		       "due to timer broadcasting\n");
+		for_each_possible_cpu(cpu)
+			per_cpu(lapic_events, cpu).capabilities &=
+				~CLOCK_CAP_NEXTEVT;
+#endif
 	}
 }
 EXPORT_SYMBOL(switch_APIC_timer_to_ipi);
@@ -1224,6 +1272,7 @@ inline void smp_local_timer_interrupt(st
 fastcall void smp_apic_timer_interrupt(struct pt_regs *regs)
 {
 	int cpu = smp_processor_id();
+	struct clock_event *evt = &per_cpu(lapic_events, cpu);
 
 	/*
 	 * the NMI deadlock-detector uses this.
@@ -1241,7 +1290,7 @@ fastcall void smp_apic_timer_interrupt(s
 	 * interrupt lock, which is the WrongThing (tm) to do.
 	 */
 	irq_enter();
-	smp_local_timer_interrupt(regs);
+	evt->event_handler(regs);
 	irq_exit();
 }
 
Index: linux-2.6.18-mm2/arch/i386/kernel/i8253.c
===================================================================
--- linux-2.6.18-mm2.orig/arch/i386/kernel/i8253.c	2006-09-30 01:41:11.000000000 +0200
+++ linux-2.6.18-mm2/arch/i386/kernel/i8253.c	2006-09-30 01:41:18.000000000 +0200
@@ -2,7 +2,7 @@
  * i8253.c  8253/PIT functions
  *
  */
-#include <linux/clocksource.h>
+#include <linux/clockchips.h>
 #include <linux/spinlock.h>
 #include <linux/jiffies.h>
 #include <linux/sysdev.h>
@@ -19,19 +19,63 @@
 DEFINE_SPINLOCK(i8253_lock);
 EXPORT_SYMBOL(i8253_lock);
 
-void setup_pit_timer(void)
+static void init_pit_timer(int mode, struct clock_event *evt)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&i8253_lock, flags);
+
+	switch(mode) {
+	case CLOCK_EVT_PERIODIC:
+		/* binary, mode 2, LSB/MSB, ch 0 */
+		outb_p(0x34, PIT_MODE);
+		udelay(10);
+		outb_p(LATCH & 0xff , PIT_CH0);	/* LSB */
+		outb(LATCH >> 8 , PIT_CH0);	/* MSB */
+		break;
+
+	case CLOCK_EVT_ONESHOT:
+	case CLOCK_EVT_SHUTDOWN:
+		/* One shot setup */
+		outb_p(0x38, PIT_MODE);
+		udelay(10);
+		break;
+	}
+	spin_unlock_irqrestore(&i8253_lock, flags);
+}
+
+static void pit_next_event(unsigned long delta, struct clock_event *evt)
 {
 	unsigned long flags;
 
 	spin_lock_irqsave(&i8253_lock, flags);
-	outb_p(0x34,PIT_MODE);		/* binary, mode 2, LSB/MSB, ch 0 */
-	udelay(10);
-	outb_p(LATCH & 0xff , PIT_CH0);	/* LSB */
-	udelay(10);
-	outb(LATCH >> 8 , PIT_CH0);	/* MSB */
+	outb_p(delta & 0xff , PIT_CH0);	/* LSB */
+	outb(delta >> 8 , PIT_CH0);	/* MSB */
 	spin_unlock_irqrestore(&i8253_lock, flags);
 }
 
+struct clock_event pit_clockevent = {
+	.name		= "pit",
+	.capabilities	= CLOCK_CAP_TICK | CLOCK_CAP_PROFILE | CLOCK_CAP_UPDATE
+#ifndef CONFIG_SMP
+			| CLOCK_CAP_NEXTEVT
+#endif
+	,
+	.set_mode	= init_pit_timer,
+	.set_next_event = pit_next_event,
+	.shift		= 32,
+};
+
+void setup_pit_timer(void)
+{
+	pit_clockevent.mult = div_sc(CLOCK_TICK_RATE, NSEC_PER_SEC, 32);
+	pit_clockevent.max_delta_ns =
+		clockevent_delta2ns(0x7FFF, &pit_clockevent);
+	pit_clockevent.min_delta_ns =
+		clockevent_delta2ns(0xF, &pit_clockevent);
+	register_global_clockevent(&pit_clockevent);
+}
+
 /*
  * Since the PIT overflows every tick, its not very useful
  * to just read by itself. So use jiffies to emulate a free
@@ -46,7 +90,7 @@ static cycle_t pit_read(void)
 	static u32 old_jifs;
 
 	spin_lock_irqsave(&i8253_lock, flags);
-        /*
+	/*
 	 * Although our caller may have the read side of xtime_lock,
 	 * this is now a seqlock, and we are cheating in this routine
 	 * by having side effects on state that we cannot undo if
Index: linux-2.6.18-mm2/arch/i386/kernel/time.c
===================================================================
--- linux-2.6.18-mm2.orig/arch/i386/kernel/time.c	2006-09-30 01:41:15.000000000 +0200
+++ linux-2.6.18-mm2/arch/i386/kernel/time.c	2006-09-30 01:41:18.000000000 +0200
@@ -163,15 +163,6 @@ EXPORT_SYMBOL(profile_pc);
  */
 irqreturn_t timer_interrupt(int irq, void *dev_id, struct pt_regs *regs)
 {
-	/*
-	 * Here we are in the timer irq handler. We just have irqs locally
-	 * disabled but we don't know if the timer_bh is running on the other
-	 * CPU. We need to avoid to SMP race with it. NOTE: we don' t need
-	 * the irq version of write_lock because as just said we have irq
-	 * locally disabled. -arca
-	 */
-	write_seqlock(&xtime_lock);
-
 #ifdef CONFIG_X86_IO_APIC
 	if (timer_ack) {
 		/*
@@ -190,7 +181,6 @@ irqreturn_t timer_interrupt(int irq, voi
 
 	do_timer_interrupt_hook(regs);
 
-
 	if (MCA_bus) {
 		/* The PS/2 uses level-triggered interrupts.  You can't
 		turn them off, nor would you want to (any attempt to
@@ -205,8 +195,6 @@ irqreturn_t timer_interrupt(int irq, voi
 		outb_p( irq|0x80, 0x61 );	/* reset the IRQ */
 	}
 
-	write_sequnlock(&xtime_lock);
-
 #ifdef CONFIG_X86_LOCAL_APIC
 	if (using_apic_timer)
 		smp_send_timer_broadcast_ipi(regs);
@@ -283,39 +271,6 @@ void notify_arch_cmos_timer(void)
 	mod_timer(&sync_cmos_timer, jiffies + 1);
 }
 
-static int timer_resume(struct sys_device *dev)
-{
-#ifdef CONFIG_HPET_TIMER
-	if (is_hpet_enabled())
-		hpet_reenable();
-#endif
-	setup_pit_timer();
-
-	return 0;
-}
-
-static struct sysdev_class timer_sysclass = {
-	.resume = timer_resume,
-	set_kset_name("timer"),
-};
-
-
-/* XXX this driverfs stuff should probably go elsewhere later -john */
-static struct sys_device device_timer = {
-	.id	= 0,
-	.cls	= &timer_sysclass,
-};
-
-static int time_init_device(void)
-{
-	int error = sysdev_class_register(&timer_sysclass);
-	if (!error)
-		error = sysdev_register(&device_timer);
-	return error;
-}
-
-device_initcall(time_init_device);
-
 #ifdef CONFIG_HPET_TIMER
 extern void (*late_time_init)(void);
 /* Duplicate of time_init() below, with hpet_enable part added */
Index: linux-2.6.18-mm2/include/asm-i386/i8253.h
===================================================================
--- linux-2.6.18-mm2.orig/include/asm-i386/i8253.h	2006-09-30 01:41:11.000000000 +0200
+++ linux-2.6.18-mm2/include/asm-i386/i8253.h	2006-09-30 01:41:18.000000000 +0200
@@ -2,5 +2,6 @@
 #define __ASM_I8253_H__
 
 extern spinlock_t i8253_lock;
+extern struct clock_event pit_clockevent;
 
 #endif	/* __ASM_I8253_H__ */
Index: linux-2.6.18-mm2/include/asm-i386/mach-default/do_timer.h
===================================================================
--- linux-2.6.18-mm2.orig/include/asm-i386/mach-default/do_timer.h	2006-09-30 01:41:11.000000000 +0200
+++ linux-2.6.18-mm2/include/asm-i386/mach-default/do_timer.h	2006-09-30 01:41:18.000000000 +0200
@@ -1,7 +1,8 @@
 /* defines for inline arch setup functions */
-
+#include <linux/clockchips.h>
 #include <asm/apic.h>
 #include <asm/i8259.h>
+#include <asm/i8253.h>
 
 /**
  * do_timer_interrupt_hook - hook into timer tick

--


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [patch 15/23] high-res timers: core
  2006-09-29 23:58 [patch 00/23] Thomas Gleixner
                   ` (13 preceding siblings ...)
  2006-09-29 23:58 ` [patch 14/23] clockevents: drivers for i386 Thomas Gleixner
@ 2006-09-29 23:58 ` Thomas Gleixner
  2006-09-30  8:43   ` Andrew Morton
  2006-09-29 23:58 ` [patch 16/23] dynticks: core Thomas Gleixner
                   ` (9 subsequent siblings)
  24 siblings, 1 reply; 55+ messages in thread
From: Thomas Gleixner @ 2006-09-29 23:58 UTC (permalink / raw)
  To: LKML
  Cc: Andrew Morton, Ingo Molnar, Jim Gettys, John Stultz,
	David Woodhouse, Arjan van de Ven, Dave Jones

[-- Attachment #1: hrtimer-highres.patch --]
[-- Type: text/plain, Size: 29334 bytes --]

From: Thomas Gleixner <tglx@linutronix.de>

add the core bits of high-res timers support.

the design makes use of the existing hrtimers subsystem which manages a
per-CPU and per-clock tree of timers, and the clockevents framework, which
provides a standard API to request programmable clock events from. The
core code does not have to know about the clock details - it makes use
of clockevents_set_next_event().

the code also provides dyntick functionality: it is implemented via a
per-cpu sched_tick hrtimer that is set to HZ frequency, but which is
reprogrammed to a longer timeout before going idle, and reprogrammed to
HZ again once the CPU goes busy again. (If an non-timer IRQ hits the
idle task then it will process jiffies before calling the IRQ code.)

the impact to non-high-res architectures is intended to be minimal.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
--
 include/linux/hrtimer.h   |   70 +++++
 include/linux/interrupt.h |    5 
 include/linux/ktime.h     |    3 
 kernel/hrtimer.c          |  576 ++++++++++++++++++++++++++++++++++++++++++++--
 kernel/itimer.c           |    2 
 kernel/posix-timers.c     |    2 
 kernel/time/Kconfig       |   22 +
 kernel/timer.c            |    1 
 8 files changed, 649 insertions(+), 32 deletions(-)

Index: linux-2.6.18-mm2/include/linux/hrtimer.h
===================================================================
--- linux-2.6.18-mm2.orig/include/linux/hrtimer.h	2006-09-30 01:41:17.000000000 +0200
+++ linux-2.6.18-mm2/include/linux/hrtimer.h	2006-09-30 01:41:18.000000000 +0200
@@ -17,6 +17,7 @@
 
 #include <linux/rbtree.h>
 #include <linux/ktime.h>
+#include <linux/timer.h>
 #include <linux/init.h>
 #include <linux/list.h>
 #include <linux/wait.h>
@@ -34,9 +35,17 @@ enum hrtimer_restart {
 	HRTIMER_RESTART,
 };
 
+enum hrtimer_cb_mode {
+	HRTIMER_CB_SOFTIRQ,
+	HRTIMER_CB_IRQSAFE,
+	HRTIMER_CB_IRQSAFE_NO_RESTART,
+	HRTIMER_CB_IRQSAFE_NO_SOFTIRQ,
+};
+
 #define HRTIMER_INACTIVE	0x00
 #define HRTIMER_ACTIVE		0x01
 #define HRTIMER_CALLBACK	0x02
+#define HRTIMER_PENDING		0x04
 
 struct hrtimer_clock_base;
 
@@ -49,6 +58,10 @@ struct hrtimer_clock_base;
  * @function:	timer expiry callback function
  * @base:	pointer to the timer base (per cpu and per clock)
  *
+ * @mode:	high resolution timer feature to allow executing the
+ *		callback in the hardirq context (wakeups)
+ * @cb_entry:	list head to enqueue an expired timer into the callback list
+ *
  * The hrtimer structure must be initialized by init_hrtimer_#CLOCKTYPE()
  */
 struct hrtimer {
@@ -57,6 +70,10 @@ struct hrtimer {
 	int				(*function)(struct hrtimer *);
 	struct hrtimer_clock_base	*base;
 	unsigned long			state;
+#ifdef CONFIG_HIGH_RES_TIMERS
+	int				cb_mode;
+	struct list_head		cb_entry;
+#endif
 };
 
 /**
@@ -83,6 +100,9 @@ struct hrtimer_cpu_base;
  * @get_time:		function to retrieve the current time of the clock
  * @get_softirq_time:	function to retrieve the current time from the softirq
  * @softirq_time:	the time when running the hrtimer queue in the softirq
+ * @cb_pending:		list of timers where the callback is pending
+ * @offset:		offset of this clock to the monotonic base
+ * @reprogram:		function to reprogram the timer event
  */
 struct hrtimer_clock_base {
 	struct hrtimer_cpu_base	*cpu_base;
@@ -93,6 +113,12 @@ struct hrtimer_clock_base {
 	ktime_t			(*get_time)(void);
 	ktime_t			(*get_softirq_time)(void);
 	ktime_t			softirq_time;
+#ifdef CONFIG_HIGH_RES_TIMERS
+	ktime_t			offset;
+	int			(*reprogram)(struct hrtimer *t,
+					     struct hrtimer_clock_base *b,
+					     ktime_t n);
+#endif
 };
 
 #define HRTIMER_MAX_CLOCK_BASES 2
@@ -108,17 +134,53 @@ struct hrtimer_cpu_base {
 	spinlock_t			lock;
 	struct lock_class_key		lock_key;
 	struct hrtimer_clock_base	clock_base[HRTIMER_MAX_CLOCK_BASES];
+#ifdef CONFIG_HIGH_RES_TIMERS
+	ktime_t				expires_next;
+	int				hres_active;
+	unsigned long			check_clocks;
+	struct list_head		cb_pending;
+	struct hrtimer			sched_timer;
+	struct pt_regs			*sched_regs;
+	unsigned long			events;
+#endif
 };
 
+#ifdef CONFIG_HIGH_RES_TIMERS
+
+extern void hrtimer_clock_notify(void);
+extern void clock_was_set(void);
+extern void hrtimer_interrupt(struct pt_regs *regs);
+
+# define hrtimer_cb_get_time(t)	(t)->base->get_time()
+# define hrtimer_hres_active	(__get_cpu_var(hrtimer_bases).hres_active)
+/*
+ * The resolution of the clocks. The resolution value is returned in
+ * the clock_getres() system call to give application programmers an
+ * idea of the (in)accuracy of timers. Timer values are rounded up to
+ * this resolution values.
+ */
+# define KTIME_HIGH_RES		(ktime_t) { .tv64 = CONFIG_HIGH_RES_RESOLUTION }
+# define KTIME_MONOTONIC_RES	KTIME_HIGH_RES
+
+#else
+
+# define KTIME_MONOTONIC_RES	KTIME_LOW_RES
+
 /*
  * clock_was_set() is a NOP for non- high-resolution systems. The
  * time-sorted order guarantees that a timer does not expire early and
  * is expired in the next softirq when the clock was advanced.
  */
-#define clock_was_set()		do { } while (0)
-#define hrtimer_clock_notify()	do { } while (0)
-extern ktime_t ktime_get(void);
-extern ktime_t ktime_get_real(void);
+# define clock_was_set()		do { } while (0)
+# define hrtimer_clock_notify()		do { } while (0)
+
+# define hrtimer_cb_get_time(t)		(t)->base->softirq_time
+# define hrtimer_hres_active		0
+
+#endif
+
+  extern ktime_t ktime_get(void);
+  extern ktime_t ktime_get_real(void);
 
 /* Exported timer functions: */
 
Index: linux-2.6.18-mm2/include/linux/interrupt.h
===================================================================
--- linux-2.6.18-mm2.orig/include/linux/interrupt.h	2006-09-30 01:41:11.000000000 +0200
+++ linux-2.6.18-mm2/include/linux/interrupt.h	2006-09-30 01:41:18.000000000 +0200
@@ -235,7 +235,10 @@ enum
 	NET_TX_SOFTIRQ,
 	NET_RX_SOFTIRQ,
 	BLOCK_SOFTIRQ,
-	TASKLET_SOFTIRQ
+	TASKLET_SOFTIRQ,
+#ifdef CONFIG_HIGH_RES_TIMERS
+	HRTIMER_SOFTIRQ,
+#endif
 };
 
 /* softirq mask and active fields moved to irq_cpustat_t in
Index: linux-2.6.18-mm2/include/linux/ktime.h
===================================================================
--- linux-2.6.18-mm2.orig/include/linux/ktime.h	2006-09-30 01:41:11.000000000 +0200
+++ linux-2.6.18-mm2/include/linux/ktime.h	2006-09-30 01:41:18.000000000 +0200
@@ -261,8 +261,7 @@ static inline u64 ktime_to_ns(const ktim
  * idea of the (in)accuracy of timers. Timer values are rounded up to
  * this resolution values.
  */
-#define KTIME_REALTIME_RES	(ktime_t){ .tv64 = TICK_NSEC }
-#define KTIME_MONOTONIC_RES	(ktime_t){ .tv64 = TICK_NSEC }
+#define KTIME_LOW_RES		(ktime_t){ .tv64 = TICK_NSEC }
 
 /* Get the monotonic time in timespec format: */
 extern void ktime_get_ts(struct timespec *ts);
Index: linux-2.6.18-mm2/kernel/hrtimer.c
===================================================================
--- linux-2.6.18-mm2.orig/kernel/hrtimer.c	2006-09-30 01:41:17.000000000 +0200
+++ linux-2.6.18-mm2/kernel/hrtimer.c	2006-09-30 01:41:18.000000000 +0200
@@ -37,7 +37,11 @@
 #include <linux/hrtimer.h>
 #include <linux/notifier.h>
 #include <linux/syscalls.h>
+#include <linux/kallsyms.h>
 #include <linux/interrupt.h>
+#include <linux/clockchips.h>
+#include <linux/profile.h>
+#include <linux/seq_file.h>
 
 #include <asm/uaccess.h>
 
@@ -80,7 +84,7 @@ EXPORT_SYMBOL_GPL(ktime_get_real);
  * This ensures that we capture erroneous accesses to these clock ids
  * rather than moving them into the range of valid clock id's.
  */
-static DEFINE_PER_CPU(struct hrtimer_cpu_base, hrtimer_bases) =
+DEFINE_PER_CPU(struct hrtimer_cpu_base, hrtimer_bases) =
 {
 
 	.clock_base =
@@ -88,12 +92,12 @@ static DEFINE_PER_CPU(struct hrtimer_cpu
 		{
 			.index = CLOCK_REALTIME,
 			.get_time = &ktime_get_real,
-			.resolution = KTIME_REALTIME_RES,
+			.resolution = KTIME_LOW_RES,
 		},
 		{
 			.index = CLOCK_MONOTONIC,
 			.get_time = &ktime_get,
-			.resolution = KTIME_MONOTONIC_RES,
+			.resolution = KTIME_LOW_RES,
 		},
 	}
 };
@@ -227,7 +231,7 @@ lock_hrtimer_base(const struct hrtimer *
 	return base;
 }
 
-#define switch_hrtimer_base(t, b)	(b)
+# define switch_hrtimer_base(t, b)	(b)
 
 #endif	/* !CONFIG_SMP */
 
@@ -258,9 +262,6 @@ ktime_t ktime_add_ns(const ktime_t kt, u
 
 	return ktime_add(kt, tmp);
 }
-
-#else /* CONFIG_KTIME_SCALAR */
-
 # endif /* !CONFIG_KTIME_SCALAR */
 
 /*
@@ -288,11 +289,366 @@ static unsigned long ktime_divns(const k
 # define ktime_divns(kt, div)		(unsigned long)((kt).tv64 / (div))
 #endif /* BITS_PER_LONG >= 64 */
 
+/* High resolution timer related functions */
+#ifdef CONFIG_HIGH_RES_TIMERS
+
+static ktime_t last_jiffies_update;
+
+/*
+ * Reprogramm the event source with checking both queues for the
+ * next event
+ * Called with interrupts disabled and base->lock held
+ */
+static void hrtimer_force_reprogram(struct hrtimer_cpu_base *cpu_base)
+{
+	int i;
+	struct hrtimer_clock_base *base = cpu_base->clock_base;
+	ktime_t expires;
+
+	cpu_base->expires_next.tv64 = KTIME_MAX;
+
+	for (i = HRTIMER_MAX_CLOCK_BASES; i ; i--, base++) {
+		struct hrtimer *timer;
+
+		if (!base->first)
+			continue;
+		timer = rb_entry(base->first, struct hrtimer, node);
+		expires = ktime_sub(timer->expires, base->offset);
+		if (expires.tv64 < cpu_base->expires_next.tv64)
+			cpu_base->expires_next = expires;
+	}
+
+	if (cpu_base->expires_next.tv64 != KTIME_MAX)
+		clockevents_set_next_event(cpu_base->expires_next, 1);
+}
+
+/*
+ * Shared reprogramming for clock_realtime and clock_monotonic
+ *
+ * When a new expires first timer is enqueued, we have
+ * to check, whether it expires earlier than the timer
+ * for which the hrt time source was armed.
+ *
+ * Called with interrupts disabled and base->cpu_base.lock held
+ */
+static int hrtimer_reprogram(struct hrtimer *timer,
+			     struct hrtimer_clock_base *base)
+{
+	ktime_t *expires_next = &__get_cpu_var(hrtimer_bases).expires_next;
+	ktime_t expires = ktime_sub(timer->expires, base->offset);
+	int res;
+
+	/* Callback running on another CPU ? */
+	if (timer->state & HRTIMER_CALLBACK)
+		return 0;
+
+	if (expires.tv64 >= expires_next->tv64)
+		return 0;
+
+	res = clockevents_set_next_event(expires, 0);
+	if (!res)
+		*expires_next = expires;
+	return res;
+}
+
+
+/*
+ * Retrigger next event is called after clock was set
+ */
+static void retrigger_next_event(void *arg)
+{
+	struct hrtimer_cpu_base *base;
+	struct timespec realtime_offset;
+	unsigned long flags, seq;
+
+	do {
+		seq = read_seqbegin(&xtime_lock);
+		set_normalized_timespec(&realtime_offset,
+					-wall_to_monotonic.tv_sec,
+					-wall_to_monotonic.tv_nsec);
+	} while (read_seqretry(&xtime_lock, seq));
+
+	base = &per_cpu(hrtimer_bases, smp_processor_id());
+
+	/* Adjust CLOCK_REALTIME offset */
+	spin_lock_irqsave(&base->lock, flags);
+	base->clock_base[CLOCK_REALTIME].offset =
+		timespec_to_ktime(realtime_offset);
+
+	hrtimer_force_reprogram(base);
+	spin_unlock_irqrestore(&base->lock, flags);
+}
+
+/*
+ * Clock realtime was set
+ *
+ * Change the offset of the realtime clock vs. the monotonic
+ * clock.
+ *
+ * We might have to reprogram the high resolution timer interrupt. On
+ * SMP we call the architecture specific code to retrigger _all_ high
+ * resolution timer interrupts. On UP we just disable interrupts and
+ * call the high resolution interrupt code.
+ */
+void clock_was_set(void)
+{
+	preempt_disable();
+	if (hrtimer_hres_active) {
+		retrigger_next_event(NULL);
+
+		if (smp_call_function(retrigger_next_event, NULL, 1, 1))
+			BUG();
+	}
+	preempt_enable();
+}
+
+/**
+ * hrtimer_clock_notify - A clock source or a clock event has been installed
+ *
+ * Notify the per cpu softirqs to recheck the clock sources and events
+ */
+void hrtimer_clock_notify(void)
+{
+	int i;
+
+	for (i = 0; i < NR_CPUS; i++)
+		set_bit(0, &per_cpu(hrtimer_bases, i).check_clocks);
+}
+
+
+static const ktime_t nsec_per_hz = { .tv64 = NSEC_PER_SEC / HZ };
+
+/*
+ * We switched off the global tick source when switching to high resolution
+ * mode. Update jiffies64.
+ *
+ * Must be called with interrupts disabled !
+ *
+ * FIXME: We need a mechanism to assign the update to a CPU. In principle this
+ * is not hard, but when dynamic ticks come into play it starts to be. We don't
+ * want to wake up a complete idle cpu just to update jiffies, so we need
+ * something more intellegent than a mere "do this only on CPUx".
+ */
+static void update_jiffies64(ktime_t now)
+{
+	ktime_t delta;
+
+	write_seqlock(&xtime_lock);
+
+	delta = ktime_sub(now, last_jiffies_update);
+	if (delta.tv64 >= nsec_per_hz.tv64) {
+
+		unsigned long orun = 1;
+
+		delta = ktime_sub(delta, nsec_per_hz);
+		last_jiffies_update = ktime_add(last_jiffies_update,
+						nsec_per_hz);
+
+		/* Slow path for long timeouts */
+		if (unlikely(delta.tv64 >= nsec_per_hz.tv64)) {
+			s64 incr = ktime_to_ns(nsec_per_hz);
+			orun = ktime_divns(delta, incr);
+
+			last_jiffies_update = ktime_add_ns(last_jiffies_update,
+							   incr * orun);
+			jiffies_64 += orun;
+			orun++;
+		}
+		do_timer(orun);
+	}
+	write_sequnlock(&xtime_lock);
+}
+
+/*
+ * We rearm the timer until we get disabled by the idle code
+ */
+static int hrtimer_sched_tick(struct hrtimer *timer)
+{
+	unsigned long flags;
+	struct hrtimer_cpu_base *cpu_base =
+		container_of(timer, struct hrtimer_cpu_base, sched_timer);
+
+	local_irq_save(flags);
+	/*
+	 * Do not call, when we are not in irq context and have
+	 * no valid regs pointer
+	 */
+	if (cpu_base->sched_regs) {
+		update_process_times(user_mode(cpu_base->sched_regs));
+		profile_tick(CPU_PROFILING, cpu_base->sched_regs);
+	}
+
+	hrtimer_forward(timer, hrtimer_cb_get_time(timer), nsec_per_hz);
+	local_irq_restore(flags);
+
+	return HRTIMER_RESTART;
+}
+
+/*
+ * A change in the clock source or clock events was detected.
+ * Check the clock source and the events, whether we can switch to
+ * high resolution mode or not.
+ *
+ * TODO: Handle the removal of clock sources / events
+ */
+static void hrtimer_check_clocks(void)
+{
+	struct hrtimer_cpu_base *base = &__get_cpu_var(hrtimer_bases);
+	unsigned long flags;
+	ktime_t now;
+
+	if (!test_and_clear_bit(0, &base->check_clocks))
+		return;
+
+	if (!timekeeping_is_continuous())
+		return;
+
+	if (!clockevents_next_event_available())
+		return;
+
+	local_irq_save(flags);
+
+	if (base->hres_active) {
+		local_irq_restore(flags);
+		return;
+	}
+
+	now = ktime_get();
+	if (clockevents_init_next_event()) {
+		local_irq_restore(flags);
+		return;
+	}
+	base->hres_active = 1;
+	base->clock_base[CLOCK_REALTIME].resolution = KTIME_HIGH_RES;
+	base->clock_base[CLOCK_MONOTONIC].resolution = KTIME_HIGH_RES;
+
+	/* Did we start the jiffies update yet ? */
+	if (last_jiffies_update.tv64 == 0) {
+		write_seqlock(&xtime_lock);
+		last_jiffies_update = now;
+		write_sequnlock(&xtime_lock);
+	}
+
+	/*
+	 * Emulate tick processing via per-CPU hrtimers:
+	 */
+	hrtimer_init(&base->sched_timer, CLOCK_MONOTONIC, HRTIMER_REL);
+	base->sched_timer.function = hrtimer_sched_tick;
+	base->sched_timer.cb_mode = HRTIMER_CB_IRQSAFE_NO_SOFTIRQ;
+	hrtimer_start(&base->sched_timer, nsec_per_hz, HRTIMER_REL);
+
+	/* "Retrigger" the interrupt to get things going */
+	retrigger_next_event(NULL);
+	local_irq_restore(flags);
+	printk(KERN_INFO "hrtimers: Switched to high resolution mode CPU %d\n",
+	       smp_processor_id());
+}
+
+static inline int hrtimer_cb_pending(const struct hrtimer *timer)
+{
+	return !list_empty(&timer->cb_entry);
+}
+
+static inline void hrtimer_remove_cb_pending(struct hrtimer *timer)
+{
+	list_del_init(&timer->cb_entry);
+}
+
+static inline void hrtimer_add_cb_pending(struct hrtimer *timer,
+					  struct hrtimer_clock_base *base)
+{
+	list_add_tail(&timer->cb_entry, &base->cpu_base->cb_pending);
+	timer->state = HRTIMER_PENDING;
+}
+
+static inline void hrtimer_init_hres(struct hrtimer_cpu_base *base)
+{
+	base->expires_next.tv64 = KTIME_MAX;
+	set_bit(0, &base->check_clocks);
+	base->hres_active = 0;
+	INIT_LIST_HEAD(&base->cb_pending);
+}
+
+static inline void hrtimer_init_timer_hres(struct hrtimer *timer)
+{
+	INIT_LIST_HEAD(&timer->cb_entry);
+}
+
+static inline int hrtimer_enqueue_reprogram(struct hrtimer *timer,
+					    struct hrtimer_clock_base *base)
+{
+	/*
+	 * When High resolution timers are active try to reprogram. Note, that
+	 * in case the state has HRTIMER_CALLBACK set, no reprogramming and no
+	 * expiry check happens. The timer gets enqueued into the rbtree and
+	 * the reprogramming / expiry check is done in the hrtimer_interrupt or
+	 * in the softirq.
+	 */
+	if (hrtimer_hres_active && hrtimer_reprogram(timer, base)) {
+
+		/* Timer is expired, act upon the callback mode */
+		switch(timer->cb_mode) {
+		case HRTIMER_CB_IRQSAFE_NO_RESTART:
+			/*
+			 * We can call the callback from here. No restart
+			 * happens, so no danger of recursion
+			 */
+			BUG_ON(timer->function(timer) != HRTIMER_NORESTART);
+			return 1;
+		case HRTIMER_CB_IRQSAFE_NO_SOFTIRQ:
+			/*
+			 * This is solely for the sched tick emulation with
+			 * dynamic tick support to ensure that we do not
+			 * restart the tick right on the edge and end up with
+			 * the tick timer in the softirq ! The calling site
+			 * takes care of this.
+			 */
+			return 1;
+		case HRTIMER_CB_IRQSAFE:
+		case HRTIMER_CB_SOFTIRQ:
+			/*
+			 * Move everything else into the softirq pending list !
+			 */
+			hrtimer_add_cb_pending(timer, base);
+			raise_softirq(HRTIMER_SOFTIRQ);
+			return 1;
+		default:
+			BUG();
+		}
+	}
+	return 0;
+}
+
+static inline void hrtimer_resume_jiffie_update(void)
+{
+	unsigned long flags;
+	ktime_t now = ktime_get();
+
+	write_seqlock_irqsave(&xtime_lock, flags);
+	last_jiffies_update = now;
+	write_sequnlock_irqrestore(&xtime_lock, flags);
+}
+
+#else
+
+# define hrtimer_hres_active		0
+# define hrtimer_check_clocks()		do { } while (0)
+# define hrtimer_enqueue_reprogram(t,b)	0
+# define hrtimer_force_reprogram(b)	do { } while (0)
+# define hrtimer_cb_pending(t)		0
+# define hrtimer_remove_cb_pending(t)	do { } while (0)
+# define hrtimer_init_hres(c)		do { } while (0)
+# define hrtimer_init_timer_hres(t)	do { } while (0)
+# define hrtimer_resume_jiffie_update()	do { } while (0)
+
+#endif /* CONFIG_HIGH_RES_TIMERS */
+
 /*
  * Timekeeping resumed notification
  */
 void hrtimer_notify_resume(void)
 {
+	hrtimer_resume_jiffie_update();
 	clockevents_resume_events();
 	clock_was_set();
 }
@@ -380,13 +736,18 @@ static void enqueue_hrtimer(struct hrtim
 	 * Insert the timer to the rbtree and check whether it
 	 * replaces the first pending timer
 	 */
+	if (!base->first || timer->expires.tv64 <
+	    rb_entry(base->first, struct hrtimer, node)->expires.tv64) {
+
+		if (hrtimer_enqueue_reprogram(timer, base))
+			return;
+
+		base->first = &timer->node;
+	}
+
 	rb_link_node(&timer->node, parent, link);
 	rb_insert_color(&timer->node, &base->active);
 	timer->state |= HRTIMER_ACTIVE;
-
-	if (!base->first || timer->expires.tv64 <
-	    rb_entry(base->first, struct hrtimer, node)->expires.tv64)
-		base->first = &timer->node;
 }
 
 /*
@@ -396,15 +757,23 @@ static void enqueue_hrtimer(struct hrtim
  */
 static void __remove_hrtimer(struct hrtimer *timer,
 			     struct hrtimer_clock_base *base,
-			     unsigned long newstate)
+			     unsigned long newstate, int reprogram)
 {
-	/*
-	 * Remove the timer from the rbtree and replace the
-	 * first entry pointer if necessary.
-	 */
-	if (base->first == &timer->node)
-		base->first = rb_next(&timer->node);
-	rb_erase(&timer->node, &base->active);
+	/* High res. callback list. NOP for !HIGHRES */
+	if (hrtimer_cb_pending(timer))
+		hrtimer_remove_cb_pending(timer);
+	else {
+		/*
+		 * Remove the timer from the rbtree and replace the
+		 * first entry pointer if necessary.
+		 */
+		if (base->first == &timer->node) {
+			base->first = rb_next(&timer->node);
+			if (reprogram && hrtimer_hres_active)
+				hrtimer_force_reprogram(base->cpu_base);
+		}
+		rb_erase(&timer->node, &base->active);
+	}
 	timer->state = newstate;
 }
 
@@ -415,7 +784,11 @@ static inline int
 remove_hrtimer(struct hrtimer *timer, struct hrtimer_clock_base *base)
 {
 	if (hrtimer_active(timer)) {
-		__remove_hrtimer(timer, base, HRTIMER_INACTIVE);
+		/*
+		 * Remove the timer and force reprogramming when high
+		 * resolution mode is active
+		 */
+		__remove_hrtimer(timer, base, HRTIMER_INACTIVE, 1);
 		return 1;
 	}
 	return 0;
@@ -550,6 +923,13 @@ ktime_t hrtimer_get_next_event(void)
 	unsigned long flags;
 	int i;
 
+	/*
+	 * In high-res mode we dont need to get the next high-res
+	 * event on a tickless system:
+	 */
+	if (hrtimer_hres_active)
+		return mindelta;
+
 	spin_lock_irqsave(&cpu_base->lock, flags);
 
 	for (i = 0; i < HRTIMER_MAX_CLOCK_BASES; i++, base++) {
@@ -592,6 +972,7 @@ void hrtimer_init(struct hrtimer *timer,
 		clock_id = CLOCK_MONOTONIC;
 
 	timer->base = &cpu_base->clock_base[clock_id];
+	hrtimer_init_timer_hres(timer);
 }
 EXPORT_SYMBOL_GPL(hrtimer_init);
 
@@ -614,6 +995,138 @@ int hrtimer_get_res(const clockid_t whic
 }
 EXPORT_SYMBOL_GPL(hrtimer_get_res);
 
+#ifdef CONFIG_HIGH_RES_TIMERS
+
+/*
+ * High resolution timer interrupt
+ * Called with interrupts disabled
+ */
+void hrtimer_interrupt(struct pt_regs *regs)
+{
+	struct hrtimer_clock_base *base;
+	ktime_t expires_next, now;
+	int i, raise = 0, cpu = smp_processor_id();
+	struct hrtimer_cpu_base *cpu_base = &per_cpu(hrtimer_bases, cpu);
+
+	BUG_ON(!cpu_base->hres_active);
+
+	/* Store the regs for an possible sched_timer callback */
+	cpu_base->sched_regs = regs;
+	cpu_base->events++;
+
+ retry:
+	now = ktime_get();
+
+	/* Check, if the jiffies need an update */
+	update_jiffies64(now);
+
+	expires_next.tv64 = KTIME_MAX;
+
+	base = cpu_base->clock_base;
+
+	for (i = 0; i < HRTIMER_MAX_CLOCK_BASES; i++) {
+		ktime_t basenow;
+		struct rb_node *node;
+
+		spin_lock(&cpu_base->lock);
+
+		basenow = ktime_add(now, base->offset);
+
+		while ((node = base->first)) {
+			struct hrtimer *timer;
+
+			timer = rb_entry(node, struct hrtimer, node);
+
+			if (basenow.tv64 < timer->expires.tv64) {
+				ktime_t expires;
+
+				expires = ktime_sub(timer->expires,
+						    base->offset);
+				if (expires.tv64 < expires_next.tv64)
+					expires_next = expires;
+				break;
+			}
+
+			/* Move softirq callbacks to the pending list */
+			if (timer->cb_mode == HRTIMER_CB_SOFTIRQ) {
+				__remove_hrtimer(timer, base, HRTIMER_PENDING, 0);
+				hrtimer_add_cb_pending(timer, base);
+				raise = 1;
+				continue;
+			}
+
+			__remove_hrtimer(timer, base, HRTIMER_CALLBACK, 0);
+
+			if (timer->function(timer) != HRTIMER_NORESTART) {
+				BUG_ON(timer->state != HRTIMER_CALLBACK);
+				/*
+				 * state == HRTIMER_CALLBACK prevents
+				 * reprogramming. We do this when we break out
+				 * of the loop !
+				 */
+				enqueue_hrtimer(timer, base);
+			}
+			timer->state &= ~HRTIMER_CALLBACK;
+		}
+		spin_unlock(&cpu_base->lock);
+		base++;
+	}
+
+	cpu_base->expires_next = expires_next;
+
+	/* Reprogramming necessary ? */
+	if (expires_next.tv64 != KTIME_MAX) {
+		if (clockevents_set_next_event(expires_next, 0))
+			goto retry;
+	}
+
+	/* Invalidate regs */
+	cpu_base->sched_regs = NULL;
+
+	/* Raise softirq ? */
+	if (raise)
+		raise_softirq(HRTIMER_SOFTIRQ);
+}
+
+static void run_hrtimer_softirq(struct softirq_action *h)
+{
+	struct hrtimer_cpu_base *cpu_base;
+
+	cpu_base = &per_cpu(hrtimer_bases, smp_processor_id());
+
+	spin_lock_irq(&cpu_base->lock);
+
+	while (!list_empty(&cpu_base->cb_pending)) {
+		struct hrtimer *timer;
+		int (*fn)(struct hrtimer *);
+		int restart;
+
+		timer = list_entry(cpu_base->cb_pending.next,
+				   struct hrtimer, cb_entry);
+
+		fn = timer->function;
+		__remove_hrtimer(timer, timer->base, HRTIMER_CALLBACK, 0);
+		spin_unlock_irq(&cpu_base->lock);
+
+		restart = fn(timer);
+
+		spin_lock_irq(&cpu_base->lock);
+
+		timer->state &= ~HRTIMER_CALLBACK;
+		if (restart == HRTIMER_RESTART) {
+			BUG_ON(hrtimer_active(timer));
+			enqueue_hrtimer(timer, timer->base);
+		} else if (hrtimer_active(timer)) {
+			/* Timer was rearmed on another CPU: */
+			if (timer->base->first == &timer->node)
+				hrtimer_reprogram(timer, timer->base);
+		}
+	}
+	spin_unlock_irq(&cpu_base->lock);
+}
+
+#endif	/* CONFIG_HIGH_RES_TIMERS */
+
 /*
  * Expire the per base hrtimer-queue:
  */
@@ -641,7 +1154,7 @@ static inline void run_hrtimer_queue(str
 			break;
 
 		fn = timer->function;
-		__remove_hrtimer(timer, base, HRTIMER_CALLBACK);
+		__remove_hrtimer(timer, base, HRTIMER_CALLBACK, 0);
 		spin_unlock_irq(&cpu_base->lock);
 
 		restart = fn(timer);
@@ -659,12 +1172,21 @@ static inline void run_hrtimer_queue(str
 
 /*
  * Called from timer softirq every jiffy, expire hrtimers:
+ *
+ * For HRT its the fall back code to run the softirq in the timer
+ * softirq context in case the hrtimer initialization failed or has
+ * not been done yet.
  */
 void hrtimer_run_queues(void)
 {
 	struct hrtimer_cpu_base *cpu_base = &__get_cpu_var(hrtimer_bases);
 	int i;
 
+	hrtimer_check_clocks();
+
+	if (hrtimer_hres_active)
+		return;
+
 	hrtimer_get_softirq_time(cpu_base);
 
 	for (i = 0; i < HRTIMER_MAX_CLOCK_BASES; i++)
@@ -691,6 +1213,9 @@ void hrtimer_init_sleeper(struct hrtimer
 {
 	sl->timer.function = hrtimer_wakeup;
 	sl->task = task;
+#ifdef CONFIG_HIGH_RES_TIMERS
+	sl->timer.cb_mode = HRTIMER_CB_IRQSAFE_NO_RESTART;
+#endif
 }
 
 static int __sched do_nanosleep(struct hrtimer_sleeper *t, enum hrtimer_mode mode)
@@ -701,7 +1226,8 @@ static int __sched do_nanosleep(struct h
 		set_current_state(TASK_INTERRUPTIBLE);
 		hrtimer_start(&t->timer, t->timer.expires, mode);
 
-		schedule();
+		if (likely(t->task))
+			schedule();
 
 		hrtimer_cancel(&t->timer);
 		mode = HRTIMER_ABS;
@@ -806,6 +1332,7 @@ static void __devinit init_hrtimers_cpu(
 	for (i = 0; i < HRTIMER_MAX_CLOCK_BASES; i++)
 		cpu_base->clock_base[i].cpu_base = cpu_base;
 
+	hrtimer_init_hres(cpu_base);
 }
 
 #ifdef CONFIG_HOTPLUG_CPU
@@ -819,7 +1346,7 @@ static void migrate_hrtimer_list(struct 
 	while ((node = rb_first(&old_base->active))) {
 		timer = rb_entry(node, struct hrtimer, node);
 		BUG_ON(timer->state & HRTIMER_CALLBACK);
-		__remove_hrtimer(timer, old_base, HRTIMER_INACTIVE);
+		__remove_hrtimer(timer, old_base, HRTIMER_INACTIVE, 0);
 		timer->base = new_base;
 		enqueue_hrtimer(timer, new_base);
 	}
@@ -884,5 +1411,8 @@ void __init hrtimers_init(void)
 	hrtimer_cpu_notify(&hrtimers_nb, (unsigned long)CPU_UP_PREPARE,
 			  (void *)(long)smp_processor_id());
 	register_cpu_notifier(&hrtimers_nb);
+#ifdef CONFIG_HIGH_RES_TIMERS
+	open_softirq(HRTIMER_SOFTIRQ, run_hrtimer_softirq, NULL);
+#endif
 }
 
Index: linux-2.6.18-mm2/kernel/itimer.c
===================================================================
--- linux-2.6.18-mm2.orig/kernel/itimer.c	2006-09-30 01:41:11.000000000 +0200
+++ linux-2.6.18-mm2/kernel/itimer.c	2006-09-30 01:41:18.000000000 +0200
@@ -136,7 +136,7 @@ int it_real_fn(struct hrtimer *timer)
 	send_group_sig_info(SIGALRM, SEND_SIG_PRIV, sig->tsk);
 
 	if (sig->it_real_incr.tv64 != 0) {
-		hrtimer_forward(timer, timer->base->softirq_time,
+		hrtimer_forward(timer, hrtimer_cb_get_time(timer),
 				sig->it_real_incr);
 		return HRTIMER_RESTART;
 	}
Index: linux-2.6.18-mm2/kernel/posix-timers.c
===================================================================
--- linux-2.6.18-mm2.orig/kernel/posix-timers.c	2006-09-30 01:41:11.000000000 +0200
+++ linux-2.6.18-mm2/kernel/posix-timers.c	2006-09-30 01:41:18.000000000 +0200
@@ -356,7 +356,7 @@ static int posix_timer_fn(struct hrtimer
 		if (timr->it.real.interval.tv64 != 0) {
 			timr->it_overrun +=
 				hrtimer_forward(timer,
-						timer->base->softirq_time,
+						hrtimer_cb_get_time(timer),
 						timr->it.real.interval);
 			ret = HRTIMER_RESTART;
 			++timr->it_requeue_pending;
Index: linux-2.6.18-mm2/kernel/time/Kconfig
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.18-mm2/kernel/time/Kconfig	2006-09-30 01:41:18.000000000 +0200
@@ -0,0 +1,22 @@
+#
+# Timer subsystem related configuration options
+#
+config HIGH_RES_TIMERS
+	bool "High Resolution Timer Support"
+	depends on GENERIC_TIME
+	help
+	  This option enables high resolution timer support. If your
+	  hardware is not capable then this option only increases
+	  the size of the kernel image.
+
+config HIGH_RES_RESOLUTION
+	int "High Resolution Timer resolution (nanoseconds)"
+	depends on HIGH_RES_TIMERS
+	default 1000
+	help
+	  This sets the resolution in nanoseconds of the high resolution
+	  timers. Too fine a resolution (small a number) will usually
+	  not be observable due to normal system latencies.  For an
+          800 MHz processor about 10,000 (10 microseconds) is recommended as a
+	  finest resolution.  If you don't need that sort of resolution,
+	  larger values may generate less overhead.
Index: linux-2.6.18-mm2/kernel/timer.c
===================================================================
--- linux-2.6.18-mm2.orig/kernel/timer.c	2006-09-30 01:41:16.000000000 +0200
+++ linux-2.6.18-mm2/kernel/timer.c	2006-09-30 01:41:18.000000000 +0200
@@ -1022,6 +1022,7 @@ static void update_wall_time(void)
 	if (change_clocksource()) {
 		clock->error = 0;
 		clock->xtime_nsec = 0;
+		hrtimer_clock_notify();
 		clocksource_calculate_interval(clock, tick_nsec);
 	}
 }

--


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [patch 16/23] dynticks: core
  2006-09-29 23:58 [patch 00/23] Thomas Gleixner
                   ` (14 preceding siblings ...)
  2006-09-29 23:58 ` [patch 15/23] high-res timers: core Thomas Gleixner
@ 2006-09-29 23:58 ` Thomas Gleixner
  2006-09-30  8:44   ` Andrew Morton
  2006-09-29 23:58 ` [patch 17/23] dyntick: add nohz stats to /proc/stat Thomas Gleixner
                   ` (8 subsequent siblings)
  24 siblings, 1 reply; 55+ messages in thread
From: Thomas Gleixner @ 2006-09-29 23:58 UTC (permalink / raw)
  To: LKML
  Cc: Andrew Morton, Ingo Molnar, Jim Gettys, John Stultz,
	David Woodhouse, Arjan van de Ven, Dave Jones

[-- Attachment #1: hrtimer-no-idle-hz.patch --]
[-- Type: text/plain, Size: 11186 bytes --]

From: Ingo Molnar <mingo@elte.hu>

dynticks core code.

Add idling-stats to the cpu base (to be used to optimize power
management decisions), add the scheduler tick and its stop/restart
functions, and the jiffies-update function to be called when an irq
context hits the idle context.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
--
 include/linux/hrtimer.h |   28 +++++++
 kernel/hrtimer.c        |  185 ++++++++++++++++++++++++++++++++++++++++++++++--
 kernel/softirq.c        |   11 ++
 kernel/time/Kconfig     |    8 ++
 kernel/timer.c          |    2 
 5 files changed, 226 insertions(+), 8 deletions(-)

Index: linux-2.6.18-mm2/include/linux/hrtimer.h
===================================================================
--- linux-2.6.18-mm2.orig/include/linux/hrtimer.h	2006-09-30 01:41:18.000000000 +0200
+++ linux-2.6.18-mm2/include/linux/hrtimer.h	2006-09-30 01:41:18.000000000 +0200
@@ -142,6 +142,14 @@ struct hrtimer_cpu_base {
 	struct hrtimer			sched_timer;
 	struct pt_regs			*sched_regs;
 	unsigned long			events;
+#ifdef CONFIG_NO_HZ
+	ktime_t				idle_tick;
+	int				tick_stopped;
+	unsigned long			idle_jiffies;
+	unsigned long			idle_calls;
+	unsigned long			idle_sleeps;
+	unsigned long			idle_sleeptime;
+#endif
 #endif
 };
 
@@ -200,7 +208,7 @@ extern int hrtimer_try_to_cancel(struct 
 extern ktime_t hrtimer_get_remaining(const struct hrtimer *timer);
 extern int hrtimer_get_res(const clockid_t which_clock, struct timespec *tp);
 
-#ifdef CONFIG_NO_IDLE_HZ
+#if defined(CONFIG_NO_IDLE_HZ) || defined(CONFIG_NO_HZ)
 extern ktime_t hrtimer_get_next_event(void);
 #endif
 
@@ -229,6 +237,24 @@ extern void hrtimer_run_queues(void);
 /* Resume notification */
 void hrtimer_notify_resume(void);
 
+#ifdef CONFIG_NO_HZ
+extern void hrtimer_trigger_next_hz_tick(struct tvec_t_base_s *base);
+extern int hrtimer_stop_sched_tick(void);
+extern void hrtimer_restart_sched_tick(void);
+extern void update_jiffies(void);
+struct seq_file;
+extern void show_no_hz_stats(struct seq_file *p);
+#else
+# define hrtimer_trigger_next_hz_tick(base)	do { } while (0)
+static inline int hrtimer_stop_sched_tick(void)
+{
+	return 0;
+}
+# define hrtimer_restart_sched_tick()		do { } while (0)
+# define update_jiffies()			do { } while (0)
+# define show_no_hz_stats(p)			do { } while (0)
+#endif
+
 /* Bootup initialization: */
 extern void __init hrtimers_init(void);
 
Index: linux-2.6.18-mm2/kernel/hrtimer.c
===================================================================
--- linux-2.6.18-mm2.orig/kernel/hrtimer.c	2006-09-30 01:41:18.000000000 +0200
+++ linux-2.6.18-mm2/kernel/hrtimer.c	2006-09-30 01:41:18.000000000 +0200
@@ -437,7 +437,6 @@ static void update_jiffies64(ktime_t now
 
 	delta = ktime_sub(now, last_jiffies_update);
 	if (delta.tv64 >= nsec_per_hz.tv64) {
-
 		unsigned long orun = 1;
 
 		delta = ktime_sub(delta, nsec_per_hz);
@@ -451,7 +450,6 @@ static void update_jiffies64(ktime_t now
 
 			last_jiffies_update = ktime_add_ns(last_jiffies_update,
 							   incr * orun);
-			jiffies_64 += orun;
 			orun++;
 		}
 		do_timer(orun);
@@ -459,28 +457,201 @@ static void update_jiffies64(ktime_t now
 	write_sequnlock(&xtime_lock);
 }
 
+#ifdef CONFIG_NO_HZ
+/*
+ * Called from interrupt entry when then CPU was idle
+ */
+void update_jiffies(void)
+{
+	unsigned long flags;
+	ktime_t now;
+
+	if (unlikely(!hrtimer_hres_active))
+		return;
+
+	now = ktime_get();
+
+	local_irq_save(flags);
+	update_jiffies64(now);
+	local_irq_restore(flags);
+}
+
+/*
+ * Called from the idle thread so careful!
+ */
+int hrtimer_stop_sched_tick(void)
+{
+	int cpu = smp_processor_id();
+	struct hrtimer_cpu_base *cpu_base = &per_cpu(hrtimer_bases, cpu);
+	unsigned long seq, last_jiffies, next_jiffies;
+	ktime_t last_update, expires;
+	unsigned long delta_jiffies;
+	unsigned long flags;
+
+	if (unlikely(!hrtimer_hres_active))
+		return 0;
+
+	local_irq_save(flags);
+
+	do {
+		seq = read_seqbegin(&xtime_lock);
+		last_update = last_jiffies_update;
+		last_jiffies = jiffies;
+	} while (read_seqretry(&xtime_lock, seq));
+
+	next_jiffies = get_next_timer_interrupt(last_jiffies);
+	delta_jiffies = next_jiffies - last_jiffies;
+
+	cpu_base->idle_calls++;
+
+	if ((long)delta_jiffies >= 1) {
+		/*
+		 * Save the current tick time, so we can restart the
+		 * scheduler tick when we get woken up before the next
+		 * wheel timer expires
+		 */
+		cpu_base->idle_tick = cpu_base->sched_timer.expires;
+		expires = ktime_add_ns(last_update,
+				       nsec_per_hz.tv64 * delta_jiffies);
+		hrtimer_start(&cpu_base->sched_timer, expires, HRTIMER_ABS);
+		cpu_base->idle_sleeps++;
+		cpu_base->idle_jiffies = last_jiffies;
+		cpu_base->tick_stopped = 1;
+	} else {
+		/* Keep the timer alive */
+		if ((long) delta_jiffies < 0)
+			raise_softirq(TIMER_SOFTIRQ);
+	}
+
+	if (local_softirq_pending()) {
+		inc_preempt_count();
+		do_softirq();
+		dec_preempt_count();
+	}
+
+	WARN_ON(!idle_cpu(cpu));
+	/*
+	 * RCU normally depends on the timer IRQ kicking completion
+	 * in every tick. We have to do this here now:
+	 */
+	if (rcu_pending(cpu)) {
+		/*
+		 * We are in quiescent state, so advance callbacks:
+		 */
+		rcu_advance_callbacks(cpu, 1);
+		local_irq_enable();
+		local_bh_disable();
+		rcu_process_callbacks(0);
+		local_bh_enable();
+	}
+
+	local_irq_restore(flags);
+
+	return need_resched();
+}
+
+void hrtimer_restart_sched_tick(void)
+{
+	struct hrtimer_cpu_base *cpu_base = &__get_cpu_var(hrtimer_bases);
+	unsigned long flags;
+	ktime_t now;
+
+	if (!hrtimer_hres_active || !cpu_base->tick_stopped)
+		return;
+
+	/* Update jiffies first */
+	now = ktime_get();
+
+	local_irq_save(flags);
+	update_jiffies64(now);
+
+	/*
+	 * Update process times would randomly account the time we slept to
+	 * whatever the context of the next sched tick is.  Enforce that this
+	 * is accounted to idle !
+	 */
+	add_preempt_count(HARDIRQ_OFFSET);
+	update_process_times(0);
+	sub_preempt_count(HARDIRQ_OFFSET);
+
+	cpu_base->idle_sleeptime += jiffies - cpu_base->idle_jiffies;
+
+	cpu_base->tick_stopped  = 0;
+	hrtimer_cancel(&cpu_base->sched_timer);
+	cpu_base->sched_timer.expires = cpu_base->idle_tick;
+
+	while (1) {
+		hrtimer_forward(&cpu_base->sched_timer, now, nsec_per_hz);
+		hrtimer_start(&cpu_base->sched_timer,
+			      cpu_base->sched_timer.expires, HRTIMER_ABS);
+		if (hrtimer_active(&cpu_base->sched_timer))
+			break;
+		/* We missed an update */
+		update_jiffies64(now);
+		now = ktime_get();
+	}
+	local_irq_restore(flags);
+}
+
+void show_no_hz_stats(struct seq_file *p)
+{
+	int cpu;
+	unsigned long calls = 0, sleeps = 0, time = 0, events = 0;
+
+	for_each_online_cpu(cpu) {
+		struct hrtimer_cpu_base *base = &per_cpu(hrtimer_bases, cpu);
+
+		calls += base->idle_calls;
+		sleeps += base->idle_sleeps;
+		time += base->idle_sleeptime;
+		events += base->events;
+
+		seq_printf(p, "nohz cpu%d I:%lu S:%lu T:%lu A:%lu E: %lu\n",
+			   cpu, base->idle_calls, base->idle_sleeps,
+			   base->idle_sleeptime, base->idle_sleeps ?
+			   base->idle_sleeptime / sleeps : 0, base->events);
+	}
+#ifdef CONFIG_SMP
+	seq_printf(p, "nohz total I:%lu S:%lu T:%lu A:%lu E:%lu\n",
+		   calls, sleeps, time, sleeps ? time / sleeps : 0, events);
+#endif
+}
+
+#endif
+
+
 /*
  * We rearm the timer until we get disabled by the idle code
+ * Called with interrupts disabled.
  */
 static int hrtimer_sched_tick(struct hrtimer *timer)
 {
-	unsigned long flags;
 	struct hrtimer_cpu_base *cpu_base =
 		container_of(timer, struct hrtimer_cpu_base, sched_timer);
 
-	local_irq_save(flags);
 	/*
 	 * Do not call, when we are not in irq context and have
 	 * no valid regs pointer
 	 */
 	if (cpu_base->sched_regs) {
+		/*
+		 * update_process_times() might take tasklist_lock, hence
+		 * drop the base lock. sched-tick hrtimers are per-CPU and
+		 * never accessible by userspace APIs, so this is safe to do.
+		 */
+		spin_unlock(&cpu_base->lock);
 		update_process_times(user_mode(cpu_base->sched_regs));
 		profile_tick(CPU_PROFILING, cpu_base->sched_regs);
+		spin_lock(&cpu_base->lock);
 	}
 
 	hrtimer_forward(timer, hrtimer_cb_get_time(timer), nsec_per_hz);
-	local_irq_restore(flags);
 
+#ifdef CONFIG_NO_HZ
+	/* Do not restart, when we are in the idle loop */
+	if (cpu_base->tick_stopped)
+		return HRTIMER_NORESTART;
+#endif
 	return HRTIMER_RESTART;
 }
 
@@ -908,7 +1079,7 @@ ktime_t hrtimer_get_remaining(const stru
 }
 EXPORT_SYMBOL_GPL(hrtimer_get_remaining);
 
-#ifdef CONFIG_NO_IDLE_HZ
+#if defined(CONFIG_NO_IDLE_HZ) || defined(CONFIG_NO_HZ)
 /**
  * hrtimer_get_next_event - get the time until next expiry event
  *
@@ -923,12 +1094,14 @@ ktime_t hrtimer_get_next_event(void)
 	unsigned long flags;
 	int i;
 
+#ifndef CONFIG_NO_HZ
 	/*
 	 * In high-res mode we dont need to get the next high-res
 	 * event on a tickless system:
 	 */
 	if (hrtimer_hres_active)
 		return mindelta;
+#endif
 
 	spin_lock_irqsave(&cpu_base->lock, flags);
 
Index: linux-2.6.18-mm2/kernel/softirq.c
===================================================================
--- linux-2.6.18-mm2.orig/kernel/softirq.c	2006-09-30 01:41:16.000000000 +0200
+++ linux-2.6.18-mm2/kernel/softirq.c	2006-09-30 01:41:18.000000000 +0200
@@ -284,6 +284,11 @@ extern void irq_enter(void)
 	account_system_vtime(current);
 	add_preempt_count(HARDIRQ_OFFSET);
 	trace_hardirq_enter();
+
+#ifdef CONFIG_NO_HZ
+	if (idle_cpu(smp_processor_id()))
+		update_jiffies();
+#endif
 }
 
 /*
@@ -296,6 +301,12 @@ void irq_exit(void)
 	sub_preempt_count(IRQ_EXIT_OFFSET);
 	if (!in_interrupt() && local_softirq_pending())
 		invoke_softirq();
+
+#ifdef CONFIG_NO_HZ
+	/* Make sure that timer wheel updates are propagated */
+	if (!in_interrupt() && idle_cpu(smp_processor_id()) && !need_resched())
+		hrtimer_stop_sched_tick();
+#endif
 	preempt_enable_no_resched();
 }
 
Index: linux-2.6.18-mm2/kernel/time/Kconfig
===================================================================
--- linux-2.6.18-mm2.orig/kernel/time/Kconfig	2006-09-30 01:41:18.000000000 +0200
+++ linux-2.6.18-mm2/kernel/time/Kconfig	2006-09-30 01:41:18.000000000 +0200
@@ -20,3 +20,11 @@ config HIGH_RES_RESOLUTION
           800 MHz processor about 10,000 (10 microseconds) is recommended as a
 	  finest resolution.  If you don't need that sort of resolution,
 	  larger values may generate less overhead.
+
+config NO_HZ
+	bool "Tickless System (Dynamic Ticks)"
+	depends on GENERIC_TIME && HIGH_RES_TIMERS
+	help
+	  This option enables a tickless system: timer interrupts will
+	  only trigger on an as-needed basis both when the system is
+	  busy and when the system is idle.
Index: linux-2.6.18-mm2/kernel/timer.c
===================================================================
--- linux-2.6.18-mm2.orig/kernel/timer.c	2006-09-30 01:41:18.000000000 +0200
+++ linux-2.6.18-mm2/kernel/timer.c	2006-09-30 01:41:18.000000000 +0200
@@ -462,7 +462,7 @@ static inline void __run_timers(tvec_bas
 	spin_unlock_irq(&base->lock);
 }
 
-#ifdef CONFIG_NO_IDLE_HZ
+#if defined(CONFIG_NO_IDLE_HZ) || defined(CONFIG_NO_HZ)
 /*
  * Find out when the next timer event is due to happen. This
  * is used on S/390 to stop all activity when a cpus is idle.

--


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [patch 17/23] dyntick: add nohz stats to /proc/stat
  2006-09-29 23:58 [patch 00/23] Thomas Gleixner
                   ` (15 preceding siblings ...)
  2006-09-29 23:58 ` [patch 16/23] dynticks: core Thomas Gleixner
@ 2006-09-29 23:58 ` Thomas Gleixner
  2006-09-29 23:58 ` [patch 18/23] dynticks: i386 arch code Thomas Gleixner
                   ` (7 subsequent siblings)
  24 siblings, 0 replies; 55+ messages in thread
From: Thomas Gleixner @ 2006-09-29 23:58 UTC (permalink / raw)
  To: LKML
  Cc: Andrew Morton, Ingo Molnar, Jim Gettys, John Stultz,
	David Woodhouse, Arjan van de Ven, Dave Jones

[-- Attachment #1: hrtimer-add-stats.patch --]
[-- Type: text/plain, Size: 653 bytes --]

From: Thomas Gleixner <tglx@linutronix.de>

add nohz stats to /proc/stat.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
--
 fs/proc/proc_misc.c |    2 ++
 1 file changed, 2 insertions(+)

Index: linux-2.6.18-mm2/fs/proc/proc_misc.c
===================================================================
--- linux-2.6.18-mm2.orig/fs/proc/proc_misc.c	2006-09-30 01:41:10.000000000 +0200
+++ linux-2.6.18-mm2/fs/proc/proc_misc.c	2006-09-30 01:41:19.000000000 +0200
@@ -527,6 +527,8 @@ static int show_stat(struct seq_file *p,
 		nr_running(),
 		nr_iowait());
 
+	show_no_hz_stats(p);
+
 	return 0;
 }
 

--


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [patch 18/23] dynticks: i386 arch code
  2006-09-29 23:58 [patch 00/23] Thomas Gleixner
                   ` (16 preceding siblings ...)
  2006-09-29 23:58 ` [patch 17/23] dyntick: add nohz stats to /proc/stat Thomas Gleixner
@ 2006-09-29 23:58 ` Thomas Gleixner
  2006-09-30  8:45   ` Andrew Morton
  2006-09-29 23:58 ` [patch 19/23] high-res timers, dynticks: enable i386 support Thomas Gleixner
                   ` (6 subsequent siblings)
  24 siblings, 1 reply; 55+ messages in thread
From: Thomas Gleixner @ 2006-09-29 23:58 UTC (permalink / raw)
  To: LKML
  Cc: Andrew Morton, Ingo Molnar, Jim Gettys, John Stultz,
	David Woodhouse, Arjan van de Ven, Dave Jones

[-- Attachment #1: i368-prepare-no-hz.patch --]
[-- Type: text/plain, Size: 2344 bytes --]

From: Ingo Molnar <mingo@elte.hu>

prepare i386 for dyntick: idle handler callbacks and IRQ callback.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
----
 arch/i386/kernel/nmi.c     |    3 ++-
 arch/i386/kernel/process.c |   27 +++++++++++++++------------
 2 files changed, 17 insertions(+), 13 deletions(-)

Index: linux-2.6.18-mm2/arch/i386/kernel/nmi.c
===================================================================
--- linux-2.6.18-mm2.orig/arch/i386/kernel/nmi.c	2006-09-30 01:41:10.000000000 +0200
+++ linux-2.6.18-mm2/arch/i386/kernel/nmi.c	2006-09-30 01:41:19.000000000 +0200
@@ -20,6 +20,7 @@
 #include <linux/sysdev.h>
 #include <linux/sysctl.h>
 #include <linux/percpu.h>
+#include <linux/kernel_stat.h>
 #include <linux/dmi.h>
 #include <linux/kprobes.h>
 
@@ -908,7 +909,7 @@ __kprobes int nmi_watchdog_tick(struct p
 		touched = 1;
 	}
 
-	sum = per_cpu(irq_stat, cpu).apic_timer_irqs;
+	sum = per_cpu(irq_stat, cpu).apic_timer_irqs + kstat_irqs(0);
 
 	/* if the apic timer isn't firing, this cpu isn't doing much */
 	if (!touched && last_irq_sums[cpu] == sum) {
Index: linux-2.6.18-mm2/arch/i386/kernel/process.c
===================================================================
--- linux-2.6.18-mm2.orig/arch/i386/kernel/process.c	2006-09-30 01:41:10.000000000 +0200
+++ linux-2.6.18-mm2/arch/i386/kernel/process.c	2006-09-30 01:41:19.000000000 +0200
@@ -178,24 +178,27 @@ void cpu_idle(void)
 
 	/* endless idle loop with no priority at all */
 	while (1) {
-		while (!need_resched()) {
-			void (*idle)(void);
+		if (!hrtimer_stop_sched_tick()) {
+			while (!need_resched()) {
+				void (*idle)(void);
 
-			if (__get_cpu_var(cpu_idle_state))
-				__get_cpu_var(cpu_idle_state) = 0;
+				if (__get_cpu_var(cpu_idle_state))
+					__get_cpu_var(cpu_idle_state) = 0;
 
-			rmb();
-			idle = pm_idle;
+				rmb();
+				idle = pm_idle;
 
-			if (!idle)
-				idle = default_idle;
+				if (!idle)
+					idle = default_idle;
 
-			if (cpu_is_offline(cpu))
-				play_dead();
+				if (cpu_is_offline(cpu))
+					play_dead();
 
-			__get_cpu_var(irq_stat).idle_timestamp = jiffies;
-			idle();
+				__get_cpu_var(irq_stat).idle_timestamp = jiffies;
+				idle();
+			}
 		}
+		hrtimer_restart_sched_tick();
 		preempt_enable_no_resched();
 		schedule();
 		preempt_disable();

--


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [patch 19/23] high-res timers, dynticks: enable i386 support
  2006-09-29 23:58 [patch 00/23] Thomas Gleixner
                   ` (17 preceding siblings ...)
  2006-09-29 23:58 ` [patch 18/23] dynticks: i386 arch code Thomas Gleixner
@ 2006-09-29 23:58 ` Thomas Gleixner
  2006-09-29 23:58 ` [patch 20/23] add /proc/sys/kernel/timeout_granularity Thomas Gleixner
                   ` (5 subsequent siblings)
  24 siblings, 0 replies; 55+ messages in thread
From: Thomas Gleixner @ 2006-09-29 23:58 UTC (permalink / raw)
  To: LKML
  Cc: Andrew Morton, Ingo Molnar, Jim Gettys, John Stultz,
	David Woodhouse, Arjan van de Ven, Dave Jones

[-- Attachment #1: hrtimer-hres-i386.patch --]
[-- Type: text/plain, Size: 692 bytes --]

From: Ingo Molnar <mingo@elte.hu>

enable high-res timers and dyntick on i386.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
--
 arch/i386/Kconfig |    2 ++
 1 file changed, 2 insertions(+)

Index: linux-2.6.18-mm2/arch/i386/Kconfig
===================================================================
--- linux-2.6.18-mm2.orig/arch/i386/Kconfig	2006-09-30 01:41:10.000000000 +0200
+++ linux-2.6.18-mm2/arch/i386/Kconfig	2006-09-30 01:41:19.000000000 +0200
@@ -61,6 +61,8 @@ source "init/Kconfig"
 
 menu "Processor type and features"
 
+source "kernel/time/Kconfig"
+
 config SMP
 	bool "Symmetric multi-processing support"
 	---help---

--


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [patch 20/23] add /proc/sys/kernel/timeout_granularity
  2006-09-29 23:58 [patch 00/23] Thomas Gleixner
                   ` (18 preceding siblings ...)
  2006-09-29 23:58 ` [patch 19/23] high-res timers, dynticks: enable i386 support Thomas Gleixner
@ 2006-09-29 23:58 ` Thomas Gleixner
  2006-09-30  8:45   ` Andrew Morton
  2006-09-29 23:58 ` [patch 21/23] debugging feature: timer stats Thomas Gleixner
                   ` (4 subsequent siblings)
  24 siblings, 1 reply; 55+ messages in thread
From: Thomas Gleixner @ 2006-09-29 23:58 UTC (permalink / raw)
  To: LKML
  Cc: Andrew Morton, Ingo Molnar, Jim Gettys, John Stultz,
	David Woodhouse, Arjan van de Ven, Dave Jones

[-- Attachment #1: timer-tick-multiplier.patch --]
[-- Type: text/plain, Size: 4865 bytes --]

From: Ingo Molnar <mingo@elte.hu>

Introduce timeout granularity: process timer wheel timers every
timeout_granularity jiffies. Defaults to 1 (process timers HZ times
per second - most finegrained).

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
--
 Documentation/kernel-parameters.txt |    6 ++++++
 include/linux/sysctl.h              |    1 +
 include/linux/timer.h               |    1 +
 kernel/sysctl.c                     |   10 ++++++++++
 kernel/timer.c                      |   24 +++++++++++++++++++++---
 5 files changed, 39 insertions(+), 3 deletions(-)

Index: linux-2.6.18-mm2/Documentation/kernel-parameters.txt
===================================================================
--- linux-2.6.18-mm2.orig/Documentation/kernel-parameters.txt	2006-09-30 01:41:09.000000000 +0200
+++ linux-2.6.18-mm2/Documentation/kernel-parameters.txt	2006-09-30 01:41:19.000000000 +0200
@@ -1637,6 +1637,12 @@ and is between 256 and 4096 characters. 
 
 	time		Show timing data prefixed to each printk message line
 
+	timeout_granularity=
+			[KNL]
+			Timeout granularity: process timer wheel timers every
+			timeout_granularity jiffies. Defaults to 1 (process
+			timers HZ times per second - most finegrained).
+
 	clocksource=	[GENERIC_TIME] Override the default clocksource
 			Override the default clocksource and use the clocksource
 			with the name specified.
Index: linux-2.6.18-mm2/include/linux/sysctl.h
===================================================================
--- linux-2.6.18-mm2.orig/include/linux/sysctl.h	2006-09-30 01:41:09.000000000 +0200
+++ linux-2.6.18-mm2/include/linux/sysctl.h	2006-09-30 01:41:19.000000000 +0200
@@ -153,6 +153,7 @@ enum
 	KERN_MAX_LOCK_DEPTH=74,
 	KERN_NMI_WATCHDOG=75, /* int: enable/disable nmi watchdog */
 	KERN_PANIC_ON_NMI=76, /* int: whether we will panic on an unrecovered */
+	KERN_TIMEOUT_GRANULARITY=77, /* int: timeout granularity in jiffies */
 };
 
 
Index: linux-2.6.18-mm2/include/linux/timer.h
===================================================================
--- linux-2.6.18-mm2.orig/include/linux/timer.h	2006-09-30 01:41:16.000000000 +0200
+++ linux-2.6.18-mm2/include/linux/timer.h	2006-09-30 01:41:19.000000000 +0200
@@ -18,6 +18,7 @@ struct timer_list {
 };
 
 extern struct tvec_t_base_s boot_tvec_bases;
+extern unsigned int timeout_granularity;
 
 #define TIMER_INITIALIZER(_function, _expires, _data) {		\
 		.function = (_function),			\
Index: linux-2.6.18-mm2/kernel/sysctl.c
===================================================================
--- linux-2.6.18-mm2.orig/kernel/sysctl.c	2006-09-30 01:41:09.000000000 +0200
+++ linux-2.6.18-mm2/kernel/sysctl.c	2006-09-30 01:41:19.000000000 +0200
@@ -640,6 +640,16 @@ static ctl_table kern_table[] = {
 		.proc_handler	= &proc_dointvec,
 	},
 #endif
+#ifdef CONFIG_NO_HZ
+	{
+		.ctl_name       = KERN_TIMEOUT_GRANULARITY,
+		.procname       = "timeout_granularity",
+		.data           = &timeout_granularity,
+		.maxlen         = sizeof(int),
+		.mode           = 0644,
+		.proc_handler   = &proc_dointvec,
+	},
+#endif
 	{
 		.ctl_name	= KERN_PIDMAX,
 		.procname	= "pid_max",
Index: linux-2.6.18-mm2/kernel/timer.c
===================================================================
--- linux-2.6.18-mm2.orig/kernel/timer.c	2006-09-30 01:41:18.000000000 +0200
+++ linux-2.6.18-mm2/kernel/timer.c	2006-09-30 01:41:19.000000000 +0200
@@ -66,6 +66,8 @@ typedef struct tvec_root_s {
 	struct list_head vec[TVR_SIZE];
 } tvec_root_t;
 
+unsigned int __read_mostly timeout_granularity = 1;
+
 struct tvec_t_base_s {
 	spinlock_t lock;
 	struct timer_list *running_timer;
@@ -417,7 +419,9 @@ static inline void __run_timers(tvec_bas
 	struct timer_list *timer;
 
 	spin_lock_irq(&base->lock);
-	while (time_after_eq(jiffies, base->timer_jiffies)) {
+
+	while (time_before_eq(base->timer_jiffies, jiffies)) {
+
 		struct list_head work_list;
 		struct list_head *head = &work_list;
  		int index = base->timer_jiffies & TVR_MASK;
@@ -569,7 +573,15 @@ found:
 	 * delayed processing, so make sure we return a value that
 	 * makes sense externally:
 	 */
-	return expires - (now - base->timer_jiffies);
+	expires -= (now - base->timer_jiffies);
+
+	/*
+	 * Round it up per timeout_granularity:
+	 */
+	expires += timeout_granularity - 1;
+	expires -= expires % timeout_granularity;
+
+	return expires;
 }
 
 unsigned long get_next_timer_interrupt(unsigned long now)
@@ -1112,7 +1124,13 @@ static void run_timer_softirq(struct sof
  */
 void run_local_timers(void)
 {
-	raise_softirq(TIMER_SOFTIRQ);
+	tvec_base_t *base = per_cpu(tvec_bases, smp_processor_id());
+	/*
+	 * Only wake up the TIMER_SOFTIRQ every timeout_granularity
+	 * jiffies:
+	 */
+	if (time_before_eq(base->timer_jiffies + timeout_granularity, jiffies))
+		raise_softirq(TIMER_SOFTIRQ);
 	softlockup_tick();
 }
 

--


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [patch 21/23] debugging feature: timer stats
  2006-09-29 23:58 [patch 00/23] Thomas Gleixner
                   ` (19 preceding siblings ...)
  2006-09-29 23:58 ` [patch 20/23] add /proc/sys/kernel/timeout_granularity Thomas Gleixner
@ 2006-09-29 23:58 ` Thomas Gleixner
  2006-09-30  8:46   ` Andrew Morton
  2006-09-29 23:58 ` [patch 22/23] dynticks: increase SLAB timeouts Thomas Gleixner
                   ` (3 subsequent siblings)
  24 siblings, 1 reply; 55+ messages in thread
From: Thomas Gleixner @ 2006-09-29 23:58 UTC (permalink / raw)
  To: LKML
  Cc: Andrew Morton, Ingo Molnar, Jim Gettys, John Stultz,
	David Woodhouse, Arjan van de Ven, Dave Jones

[-- Attachment #1: timer_stats.patch --]
[-- Type: text/plain, Size: 19100 bytes --]

From: Thomas Gleixner <tglx@linutronix.de>

add /proc/tstats support: debugging feature to profile timer expiration.
Both the starting site, process/PID and the expiration function is
captured. This allows the quick identification of timer event sources
in a system.

sample output:

 # echo 1 > /proc/tstats
 # cat /proc/tstats
 Timerstats sample period: 3.888770 s
   12,     0 swapper          hrtimer_stop_sched_tick (hrtimer_sched_tick)
   15,     1 swapper          hcd_submit_urb (rh_timer_func)
    4,   959 kedac            schedule_timeout (process_timeout)
    1,     0 swapper          page_writeback_init (wb_timer_fn)
   28,     0 swapper          hrtimer_stop_sched_tick (hrtimer_sched_tick)
   22,  2948 IRQ 4            tty_flip_buffer_push (delayed_work_timer_fn)
    3,  3100 bash             schedule_timeout (process_timeout)
    1,     1 swapper          queue_delayed_work_on (delayed_work_timer_fn)
    1,     1 swapper          queue_delayed_work_on (delayed_work_timer_fn)
    1,     1 swapper          neigh_table_init_no_netlink (neigh_periodic_timer)
    1,  2292 ip               __netdev_watchdog_up (dev_watchdog)
    1,    23 events/1         do_cache_clean (delayed_work_timer_fn)
 90 total events, 30.0 events/sec

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
--
 include/linux/hrtimer.h   |   44 ++++++++
 include/linux/timer.h     |   48 +++++++++
 kernel/hrtimer.c          |   26 +++++
 kernel/time/Makefile      |    3 
 kernel/time/timer_stats.c |  227 ++++++++++++++++++++++++++++++++++++++++++++++
 kernel/timer.c            |   29 +++++
 kernel/workqueue.c        |    8 +
 lib/Kconfig.debug         |   11 ++
 8 files changed, 391 insertions(+), 5 deletions(-)

Index: linux-2.6.18-mm2/include/linux/hrtimer.h
===================================================================
--- linux-2.6.18-mm2.orig/include/linux/hrtimer.h	2006-09-30 01:41:18.000000000 +0200
+++ linux-2.6.18-mm2/include/linux/hrtimer.h	2006-09-30 01:41:20.000000000 +0200
@@ -74,6 +74,11 @@ struct hrtimer {
 	int				cb_mode;
 	struct list_head		cb_entry;
 #endif
+#ifdef CONFIG_TIMER_STATS
+	void				*start_site;
+	char				start_comm[16];
+	int				start_pid;
+#endif
 };
 
 /**
@@ -258,4 +263,43 @@ static inline int hrtimer_stop_sched_tic
 /* Bootup initialization: */
 extern void __init hrtimers_init(void);
 
+/*
+ * Timer-statistics info:
+ */
+#ifdef CONFIG_TIMER_STATS
+
+extern void tstats_update_stats(void *timer, pid_t pid, void *startf,
+				void *timerf, char * comm);
+
+static inline void tstats_account_hrtimer(struct hrtimer *timer)
+{
+	tstats_update_stats(timer, timer->start_pid, timer->start_site,
+			    timer->function, timer->start_comm);
+}
+
+extern void __tstats_hrtimer_set_start_info(struct hrtimer *timer, void *addr);
+
+static inline void tstats_hrtimer_set_start_info(struct hrtimer *timer)
+{
+	__tstats_hrtimer_set_start_info(timer, __builtin_return_address(0));
+}
+
+static inline void tstats_hrtimer_clear_start_info(struct hrtimer *timer)
+{
+	timer->start_site = NULL;
+}
+#else
+static inline void tstats_account_hrtimer(struct hrtimer *timer)
+{
+}
+
+static inline void tstats_hrtimer_set_start_info(struct hrtimer *timer)
+{
+}
+
+static inline void tstats_hrtimer_clear_start_info(struct hrtimer *timer)
+{
+}
+#endif
+
 #endif
Index: linux-2.6.18-mm2/include/linux/timer.h
===================================================================
--- linux-2.6.18-mm2.orig/include/linux/timer.h	2006-09-30 01:41:19.000000000 +0200
+++ linux-2.6.18-mm2/include/linux/timer.h	2006-09-30 01:41:20.000000000 +0200
@@ -2,6 +2,7 @@
 #define _LINUX_TIMER_H
 
 #include <linux/list.h>
+#include <linux/ktime.h>
 #include <linux/spinlock.h>
 #include <linux/stddef.h>
 
@@ -15,6 +16,11 @@ struct timer_list {
 	unsigned long data;
 
 	struct tvec_t_base_s *base;
+#ifdef CONFIG_TIMER_STATS
+	void *start_site;
+	char start_comm[16];
+	int start_pid;
+#endif
 };
 
 extern struct tvec_t_base_s boot_tvec_bases;
@@ -74,6 +80,48 @@ extern unsigned long next_timer_interrup
  */
 extern unsigned long get_next_timer_interrupt(unsigned long now);
 
+/*
+ * Timer-statistics info:
+ */
+#ifdef CONFIG_TIMER_STATS
+
+extern void tstats_update_stats(void *timer, pid_t pid, void *startf,
+				void *timerf, char * comm);
+
+static inline void tstats_account_timer(struct timer_list *timer)
+{
+	tstats_update_stats(timer, timer->start_pid, timer->start_site,
+			    timer->function, timer->start_comm);
+}
+
+extern void __tstats_timer_set_start_info(struct timer_list *timer, void *addr);
+
+static inline void tstats_timer_set_start_info(struct timer_list *timer)
+{
+	__tstats_timer_set_start_info(timer, __builtin_return_address(0));
+}
+
+static inline void tstats_timer_clear_start_info(struct timer_list *timer)
+{
+	timer->start_site = NULL;
+}
+#else
+static inline void tstats_account_timer(struct timer_list *timer)
+{
+}
+
+static inline void tstats_timer_set_start_info(struct timer_list *timer)
+{
+}
+
+static inline void tstats_timer_clear_start_info(struct timer_list *timer)
+{
+}
+#endif
+
+extern void delayed_work_timer_fn(unsigned long __data);
+
+
 /***
  * add_timer - start a timer
  * @timer: the timer to be added
Index: linux-2.6.18-mm2/kernel/hrtimer.c
===================================================================
--- linux-2.6.18-mm2.orig/kernel/hrtimer.c	2006-09-30 01:41:18.000000000 +0200
+++ linux-2.6.18-mm2/kernel/hrtimer.c	2006-09-30 01:41:20.000000000 +0200
@@ -814,6 +814,18 @@ static inline void hrtimer_resume_jiffie
 
 #endif /* CONFIG_HIGH_RES_TIMERS */
 
+#ifdef CONFIG_TIMER_STATS
+void __tstats_hrtimer_set_start_info(struct hrtimer *timer, void *addr)
+{
+	if (timer->start_site)
+		return;
+
+	timer->start_site = addr;
+	memcpy(timer->start_comm, current->comm, TASK_COMM_LEN);
+	timer->start_pid = current->pid;
+}
+#endif
+
 /*
  * Timekeeping resumed notification
  */
@@ -959,6 +971,7 @@ remove_hrtimer(struct hrtimer *timer, st
 		 * Remove the timer and force reprogramming when high
 		 * resolution mode is active
 		 */
+		tstats_hrtimer_clear_start_info(timer);
 		__remove_hrtimer(timer, base, HRTIMER_INACTIVE, 1);
 		return 1;
 	}
@@ -1005,6 +1018,8 @@ hrtimer_start(struct hrtimer *timer, kti
 	}
 	timer->expires = tim;
 
+	tstats_hrtimer_set_start_info(timer);
+
 	enqueue_hrtimer(timer, new_base);
 
 	unlock_hrtimer_base(timer, &flags);
@@ -1146,6 +1161,12 @@ void hrtimer_init(struct hrtimer *timer,
 
 	timer->base = &cpu_base->clock_base[clock_id];
 	hrtimer_init_timer_hres(timer);
+
+#ifdef CONFIG_TIMER_STATS
+	timer->start_site = NULL;
+	timer->start_pid = -1;
+	memset(timer->start_comm, 0, TASK_COMM_LEN);
+#endif
 }
 EXPORT_SYMBOL_GPL(hrtimer_init);
 
@@ -1229,6 +1250,7 @@ void hrtimer_interrupt(struct pt_regs *r
 			}
 
 			__remove_hrtimer(timer, base, HRTIMER_CALLBACK, 0);
+			tstats_account_hrtimer(timer);
 
 			if (timer->function(timer) != HRTIMER_NORESTART) {
 				BUG_ON(timer->state != HRTIMER_CALLBACK);
@@ -1277,6 +1299,8 @@ static void run_hrtimer_softirq(struct s
 		timer = list_entry(cpu_base->cb_pending.next,
 				   struct hrtimer, cb_entry);
 
+		tstats_account_hrtimer(timer);
+
 		fn = timer->function;
 		__remove_hrtimer(timer, timer->base, HRTIMER_CALLBACK, 0);
 		spin_unlock_irq(&cpu_base->lock);
@@ -1326,6 +1350,8 @@ static inline void run_hrtimer_queue(str
 		if (base->softirq_time.tv64 <= timer->expires.tv64)
 			break;
 
+		tstats_account_hrtimer(timer);
+
 		fn = timer->function;
 		__remove_hrtimer(timer, base, HRTIMER_CALLBACK, 0);
 		spin_unlock_irq(&cpu_base->lock);
Index: linux-2.6.18-mm2/kernel/time/Makefile
===================================================================
--- linux-2.6.18-mm2.orig/kernel/time/Makefile	2006-09-30 01:41:17.000000000 +0200
+++ linux-2.6.18-mm2/kernel/time/Makefile	2006-09-30 01:41:20.000000000 +0200
@@ -1,3 +1,4 @@
 obj-y += ntp.o clocksource.o jiffies.o
 
-obj-$(CONFIG_GENERIC_TIME) += clockevents.o
+obj-$(CONFIG_GENERIC_TIME)	+= clockevents.o
+obj-$(CONFIG_TIMER_STATS)	+= timer_stats.o
Index: linux-2.6.18-mm2/kernel/time/timer_stats.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.18-mm2/kernel/time/timer_stats.c	2006-09-30 01:41:20.000000000 +0200
@@ -0,0 +1,227 @@
+/*
+ * kernel/time/timer_stats.c
+ *
+ * Copyright(C) 2006, Red Hat, Inc., Ingo Molnar
+ * Copyright(C) 2006, Thomas Gleixner <tglx@timesys.com>
+ *
+ * Based on: timer_top.c
+ *	Copyright (C) 2005 Instituto Nokia de Tecnologia - INdT - Manaus
+ *	Written by Daniel Petrini <d.pensator@gmail.com>
+ *
+ * Collect timer usage statistics.
+ *
+ * We export the addresses and counting of timer functions being called,
+ * the pid and cmdline from the owner process if applicable.
+ *
+ * Start/stop data collection:
+ * # echo 1[0] >/proc/tstats
+ *
+ * Display the collected information:
+ * # cat /proc/tstats
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/list.h>
+#include <linux/proc_fs.h>
+#include <linux/module.h>
+#include <linux/spinlock.h>
+#include <linux/sched.h>
+#include <linux/seq_file.h>
+#include <linux/kallsyms.h>
+
+#include <asm/uaccess.h>
+
+static DEFINE_SPINLOCK(tstats_lock);
+static int tstats_status;
+static ktime_t tstats_time;
+
+enum tstats_stat {
+	TSTATS_INACTIVE,
+	TSTATS_ACTIVE,
+	TSTATS_READOUT,
+	TSTATS_RESET,
+};
+
+struct tstats_entry {
+	void			*timer;
+	void			*start_func;
+	void			*expire_func;
+	unsigned long		counter;
+	pid_t			pid;
+	char			comm[TASK_COMM_LEN + 1];
+};
+
+#define TSTATS_MAX_ENTRIES	1024
+
+static struct tstats_entry tstats[TSTATS_MAX_ENTRIES];
+
+void tstats_update_stats(void *timer, pid_t pid, void *startf,
+			 void *timerf, char * comm)
+{
+	struct tstats_entry *entry = tstats;
+	unsigned long flags;
+	int i;
+
+	spin_lock_irqsave(&tstats_lock, flags);
+	if (tstats_status != TSTATS_ACTIVE)
+		goto out_unlock;
+
+	for (i = 0; i < TSTATS_MAX_ENTRIES; i++, entry++) {
+		if (entry->timer == timer &&
+		    entry->start_func == startf &&
+		    entry->expire_func == timerf &&
+		    entry->pid == pid) {
+
+			entry->counter++;
+			break;
+		}
+		if (!entry->timer) {
+			entry->timer = timer;
+			entry->start_func = startf;
+			entry->expire_func = timerf;
+			entry->counter = 1;
+			entry->pid = pid;
+			memcpy(entry->comm, comm, TASK_COMM_LEN);
+			entry->comm[TASK_COMM_LEN] = 0;
+			break;
+		}
+	}
+
+ out_unlock:
+	spin_unlock_irqrestore(&tstats_lock, flags);
+}
+
+static void tstats_reset(void)
+{
+	memset(tstats, 0, sizeof(tstats));
+}
+
+static void print_name_offset(struct seq_file *m, unsigned long addr)
+{
+	char namebuf[KSYM_NAME_LEN+1];
+	unsigned long size, offset;
+	const char *sym_name;
+	char *modname;
+
+	sym_name = kallsyms_lookup(addr, &size, &offset, &modname, namebuf);
+	if (sym_name)
+		seq_printf(m, "%s", sym_name);
+	else
+		seq_printf(m, "<%p>", (void *)addr);
+}
+
+static int tstats_show(struct seq_file *m, void *v)
+{
+	struct tstats_entry *entry = tstats;
+	struct timespec period;
+	unsigned long ms;
+	long events = 0;
+	int i;
+
+	spin_lock_irq(&tstats_lock);
+	switch(tstats_status) {
+	case TSTATS_ACTIVE:
+		tstats_time = ktime_sub(ktime_get(), tstats_time);
+	case TSTATS_INACTIVE:
+		tstats_status = TSTATS_READOUT;
+		break;
+	default:
+		spin_unlock_irq(&tstats_lock);
+		return -EBUSY;
+	}
+	spin_unlock_irq(&tstats_lock);
+
+	period = ktime_to_timespec(tstats_time);
+	ms = period.tv_nsec % 1000000;
+
+	seq_printf(m, "Timerstats sample period: %ld.%3ld s\n",
+		   period.tv_sec, ms);
+
+	for (i = 0; i < TSTATS_MAX_ENTRIES && entry->timer; i++, entry++) {
+		seq_printf(m, "%4lu, %5d %-16s ", entry->counter, entry->pid,
+			   entry->comm);
+
+		print_name_offset(m, (unsigned long)entry->start_func);
+		seq_puts(m, " (");
+		print_name_offset(m, (unsigned long)entry->expire_func);
+		seq_puts(m, ")\n");
+		events += entry->counter;
+	}
+
+	ms += period.tv_sec * 1000;
+	if (events && period.tv_sec)
+		seq_printf(m, "%ld total events, %ld.%ld events/sec\n", events,
+			   events / period.tv_sec, events * 1000 / ms);
+	else
+		seq_printf(m, "%ld total events\n", events);
+
+	tstats_status = TSTATS_INACTIVE;
+	return 0;
+}
+
+static ssize_t tstats_write(struct file *file, const char __user *buf,
+			    size_t count, loff_t *offs)
+{
+	char ctl[2];
+
+	if (count != 2 || *offs)
+		return -EINVAL;
+
+	if (copy_from_user(ctl, buf, count))
+		return -EFAULT;
+
+	switch (ctl[0]) {
+	case '0':
+		spin_lock_irq(&tstats_lock);
+		if (tstats_status == TSTATS_ACTIVE) {
+			tstats_status = TSTATS_INACTIVE;
+			tstats_time = ktime_sub(ktime_get(), tstats_time);
+		}
+		spin_unlock_irq(&tstats_lock);
+		break;
+	case '1':
+		spin_lock_irq(&tstats_lock);
+		if (tstats_status == TSTATS_INACTIVE) {
+			tstats_status = TSTATS_RESET;
+			spin_unlock_irq(&tstats_lock);
+			tstats_reset();
+			tstats_time = ktime_get();
+			tstats_status = TSTATS_ACTIVE;
+		} else
+			spin_unlock_irq(&tstats_lock);
+		break;
+	default:
+		count = -EINVAL;
+	}
+
+	return count;
+}
+
+static int tstats_open(struct inode *inode, struct file *filp)
+{
+	return single_open(filp, tstats_show, NULL);
+}
+
+static struct file_operations tstats_fops = {
+	.open		= tstats_open,
+	.read		= seq_read,
+	.write		= tstats_write,
+	.llseek		= seq_lseek,
+	.release	= seq_release,
+};
+
+static int __init init_tstats(void)
+{
+	struct proc_dir_entry *pe = create_proc_entry("tstats", 0666, NULL);
+
+	if (!pe)
+		return -ENOMEM;
+
+	pe->proc_fops = &tstats_fops;
+
+	return 0;
+}
+module_init(init_tstats);
Index: linux-2.6.18-mm2/kernel/timer.c
===================================================================
--- linux-2.6.18-mm2.orig/kernel/timer.c	2006-09-30 01:41:19.000000000 +0200
+++ linux-2.6.18-mm2/kernel/timer.c	2006-09-30 01:41:20.000000000 +0200
@@ -34,6 +34,7 @@
 #include <linux/cpu.h>
 #include <linux/syscalls.h>
 #include <linux/delay.h>
+#include <linux/kallsyms.h>
 
 #include <asm/uaccess.h>
 #include <asm/unistd.h>
@@ -135,6 +136,18 @@ static void internal_add_timer(tvec_base
 	list_add_tail(&timer->entry, vec);
 }
 
+#ifdef CONFIG_TIMER_STATS
+void __tstats_timer_set_start_info(struct timer_list *timer, void *addr)
+{
+	if (timer->start_site)
+		return;
+
+	timer->start_site = addr;
+	memcpy(timer->start_comm, current->comm, TASK_COMM_LEN);
+	timer->start_pid = current->pid;
+}
+#endif
+
 /**
  * init_timer - initialize a timer.
  * @timer: the timer to be initialized
@@ -146,11 +159,16 @@ void fastcall init_timer(struct timer_li
 {
 	timer->entry.next = NULL;
 	timer->base = __raw_get_cpu_var(tvec_bases);
+#ifdef CONFIG_TIMER_STATS
+	timer->start_site = NULL;
+	timer->start_pid = -1;
+	memset(timer->start_comm, 0, TASK_COMM_LEN);
+#endif
 }
 EXPORT_SYMBOL(init_timer);
 
 static inline void detach_timer(struct timer_list *timer,
-					int clear_pending)
+				int clear_pending)
 {
 	struct list_head *entry = &timer->entry;
 
@@ -197,6 +215,7 @@ int __mod_timer(struct timer_list *timer
 	unsigned long flags;
 	int ret = 0;
 
+	tstats_timer_set_start_info(timer);
 	BUG_ON(!timer->function);
 
 	base = lock_timer_base(timer, &flags);
@@ -247,6 +266,7 @@ void add_timer_on(struct timer_list *tim
 	tvec_base_t *base = per_cpu(tvec_bases, cpu);
   	unsigned long flags;
 
+	tstats_timer_set_start_info(timer);
   	BUG_ON(timer_pending(timer) || !timer->function);
 	spin_lock_irqsave(&base->lock, flags);
 	timer->base = base;
@@ -279,6 +299,7 @@ int mod_timer(struct timer_list *timer, 
 {
 	BUG_ON(!timer->function);
 
+	tstats_timer_set_start_info(timer);
 	/*
 	 * This is a common optimization triggered by the
 	 * networking code - if the timer is re-modified
@@ -309,6 +330,7 @@ int del_timer(struct timer_list *timer)
 	unsigned long flags;
 	int ret = 0;
 
+	tstats_timer_clear_start_info(timer);
 	if (timer_pending(timer)) {
 		base = lock_timer_base(timer, &flags);
 		if (timer_pending(timer)) {
@@ -444,6 +466,8 @@ static inline void __run_timers(tvec_bas
  			fn = timer->function;
  			data = timer->data;
 
+			tstats_account_timer(timer);
+
 			set_running_timer(base, timer);
 			detach_timer(timer, 1);
 			spin_unlock_irq(&base->lock);
@@ -1114,7 +1138,8 @@ static void run_timer_softirq(struct sof
 {
 	tvec_base_t *base = __get_cpu_var(tvec_bases);
 
- 	hrtimer_run_queues();
+	hrtimer_run_queues();
+
 	if (time_after_eq(jiffies, base->timer_jiffies))
 		__run_timers(base);
 }
Index: linux-2.6.18-mm2/kernel/workqueue.c
===================================================================
--- linux-2.6.18-mm2.orig/kernel/workqueue.c	2006-09-30 01:41:09.000000000 +0200
+++ linux-2.6.18-mm2/kernel/workqueue.c	2006-09-30 01:41:20.000000000 +0200
@@ -119,12 +119,14 @@ int fastcall queue_work(struct workqueue
 }
 EXPORT_SYMBOL_GPL(queue_work);
 
-static void delayed_work_timer_fn(unsigned long __data)
+void delayed_work_timer_fn(unsigned long __data)
 {
 	struct work_struct *work = (struct work_struct *)__data;
 	struct workqueue_struct *wq = work->wq_data;
 	int cpu = smp_processor_id();
+	struct list_head *head;
 
+	head = &per_cpu_ptr(wq->cpu_wq, cpu)->more_work.task_list;
 	if (unlikely(is_single_threaded(wq)))
 		cpu = singlethread_cpu;
 
@@ -140,11 +142,12 @@ static void delayed_work_timer_fn(unsign
  * Returns non-zero if it was successfully added.
  */
 int fastcall queue_delayed_work(struct workqueue_struct *wq,
-			struct work_struct *work, unsigned long delay)
+				struct work_struct *work, unsigned long delay)
 {
 	int ret = 0;
 	struct timer_list *timer = &work->timer;
 
+	tstats_timer_set_start_info(&work->timer);
 	if (!test_and_set_bit(0, &work->pending)) {
 		BUG_ON(timer_pending(timer));
 		BUG_ON(!list_empty(&work->entry));
@@ -469,6 +472,7 @@ EXPORT_SYMBOL(schedule_work);
  */
 int fastcall schedule_delayed_work(struct work_struct *work, unsigned long delay)
 {
+	tstats_timer_set_start_info(&work->timer);
 	return queue_delayed_work(keventd_wq, work, delay);
 }
 EXPORT_SYMBOL(schedule_delayed_work);
Index: linux-2.6.18-mm2/lib/Kconfig.debug
===================================================================
--- linux-2.6.18-mm2.orig/lib/Kconfig.debug	2006-09-30 01:41:09.000000000 +0200
+++ linux-2.6.18-mm2/lib/Kconfig.debug	2006-09-30 01:41:20.000000000 +0200
@@ -109,6 +109,17 @@ config SCHEDSTATS
 	  application, you can say N to avoid the very slight overhead
 	  this adds.
 
+config TIMER_STATS
+	bool "Collect kernel timers statistics"
+	depends on DEBUG_KERNEL && PROC_FS
+	help
+	  If you say Y here, additional code will be inserted into the
+	  timer routines to collect statistics about kernel timers being
+	  reprogrammed. The statistics can be read from /proc/tstats.
+	  The statistics collection is started by writing 1 to /proc/tstats,
+	  writing 0 stops it. This feature is useful to collect information
+	  about timer usage patterns in kernel and userspace.
+
 config DEBUG_SLAB
 	bool "Debug slab memory allocations"
 	depends on DEBUG_KERNEL && SLAB

--


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [patch 22/23] dynticks: increase SLAB timeouts
  2006-09-29 23:58 [patch 00/23] Thomas Gleixner
                   ` (20 preceding siblings ...)
  2006-09-29 23:58 ` [patch 21/23] debugging feature: timer stats Thomas Gleixner
@ 2006-09-29 23:58 ` Thomas Gleixner
  2006-09-30  8:49   ` Andrew Morton
  2006-09-29 23:58 ` [patch 23/23] dynticks: decrease I8042_POLL_PERIOD Thomas Gleixner
                   ` (2 subsequent siblings)
  24 siblings, 1 reply; 55+ messages in thread
From: Thomas Gleixner @ 2006-09-29 23:58 UTC (permalink / raw)
  To: LKML
  Cc: Andrew Morton, Ingo Molnar, Jim Gettys, John Stultz,
	David Woodhouse, Arjan van de Ven, Dave Jones

[-- Attachment #1: slab-slowdown-nohz.patch --]
[-- Type: text/plain, Size: 1011 bytes --]

From: Ingo Molnar <mingo@elte.hu>

decrease the rate of SLAB timers going off. Reduces the amount
of timers going off in an idle system.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
--
 mm/slab.c |    9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

Index: linux-2.6.18-mm2/mm/slab.c
===================================================================
--- linux-2.6.18-mm2.orig/mm/slab.c	2006-09-30 01:41:09.000000000 +0200
+++ linux-2.6.18-mm2/mm/slab.c	2006-09-30 01:41:20.000000000 +0200
@@ -457,8 +457,13 @@ struct kmem_cache {
  * OTOH the cpuarrays can contain lots of objects,
  * which could lock up otherwise freeable slabs.
  */
-#define REAPTIMEOUT_CPUC	(2*HZ)
-#define REAPTIMEOUT_LIST3	(4*HZ)
+#ifdef CONFIG_NO_HZ
+# define REAPTIMEOUT_CPUC	(4*HZ)
+# define REAPTIMEOUT_LIST3	(8*HZ)
+#else
+# define REAPTIMEOUT_CPUC	(2*HZ)
+# define REAPTIMEOUT_LIST3	(4*HZ)
+#endif
 
 #if STATS
 #define	STATS_INC_ACTIVE(x)	((x)->num_active++)

--


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [patch 23/23] dynticks: decrease I8042_POLL_PERIOD
  2006-09-29 23:58 [patch 00/23] Thomas Gleixner
                   ` (21 preceding siblings ...)
  2006-09-29 23:58 ` [patch 22/23] dynticks: increase SLAB timeouts Thomas Gleixner
@ 2006-09-29 23:58 ` Thomas Gleixner
  2006-09-30  8:49   ` Andrew Morton
  2006-09-30  8:35 ` [patch 00/23] Andrew Morton
  2006-09-30  8:35 ` Andrew Morton
  24 siblings, 1 reply; 55+ messages in thread
From: Thomas Gleixner @ 2006-09-29 23:58 UTC (permalink / raw)
  To: LKML
  Cc: Andrew Morton, Ingo Molnar, Jim Gettys, John Stultz,
	David Woodhouse, Arjan van de Ven, Dave Jones

[-- Attachment #1: i8042-poll-less.patch --]
[-- Type: text/plain, Size: 1366 bytes --]

From: Ingo Molnar <mingo@elte.hu>

decrease the rate of timers going off. Also work around apparent
kbd-init bug by making the first timeout short.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
--
 drivers/input/serio/i8042.c |    2 +-
 drivers/input/serio/i8042.h |    2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

Index: linux-2.6.18-mm2/drivers/input/serio/i8042.c
===================================================================
--- linux-2.6.18-mm2.orig/drivers/input/serio/i8042.c	2006-09-30 01:41:08.000000000 +0200
+++ linux-2.6.18-mm2/drivers/input/serio/i8042.c	2006-09-30 01:41:20.000000000 +0200
@@ -1101,7 +1101,7 @@ static int __devinit i8042_probe(struct 
 		goto err_controller_cleanup;
 	}
 
-	mod_timer(&i8042_timer, jiffies + I8042_POLL_PERIOD);
+	mod_timer(&i8042_timer, jiffies + 2); //I8042_POLL_PERIOD);
 	return 0;
 
  err_unregister_ports:
Index: linux-2.6.18-mm2/drivers/input/serio/i8042.h
===================================================================
--- linux-2.6.18-mm2.orig/drivers/input/serio/i8042.h	2006-09-30 01:41:08.000000000 +0200
+++ linux-2.6.18-mm2/drivers/input/serio/i8042.h	2006-09-30 01:41:20.000000000 +0200
@@ -43,7 +43,7 @@
  * polling.
  */
 
-#define I8042_POLL_PERIOD	HZ/20
+#define I8042_POLL_PERIOD	(10*HZ)
 
 /*
  * Status register bits.

--


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [patch 00/23]
  2006-09-29 23:58 [patch 00/23] Thomas Gleixner
                   ` (22 preceding siblings ...)
  2006-09-29 23:58 ` [patch 23/23] dynticks: decrease I8042_POLL_PERIOD Thomas Gleixner
@ 2006-09-30  8:35 ` Andrew Morton
  2006-09-30 19:17   ` Thomas Gleixner
  2006-09-30  8:35 ` Andrew Morton
  24 siblings, 1 reply; 55+ messages in thread
From: Andrew Morton @ 2006-09-30  8:35 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Ingo Molnar, Jim Gettys, John Stultz, David Woodhouse,
	Arjan van de Ven, Dave Jones

On Fri, 29 Sep 2006 23:58:18 -0000
Thomas Gleixner <tglx@linutronix.de> wrote:

> We are pleased to announce the next version of our "high resolution 
> timers" and "dynticks" patchset, which implements two important features 
> that Linux lacked for many years.

Could we please have a full description of these features?  All we have at
present is "high resolution timers" and "dynticks", which is ludicrously
terse.  Please also describe the design and implementation.  This is basic
stuff for a run-of-the-mill patch, let alone a feature like this one.

I don't believe I can adequately review this work without that information.
I can try, but obviously such a review will not be as beneficial - it can
only cover trivial matters.

With all the work which has gone into this, and with the impact which it
will have upon us all it is totally disproportionate that no more than a
few minutes were spent telling the rest of us what it does and how it
does it.

We've had a lot of problems with timekeeping in recent years, and they have
been hard and slow to solve.  Hence I'd like to set the bar very high on
the maintainability and understandability of this work.  And what I see
here doesn't look really great from that POV.

Thanks.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [patch 00/23]
  2006-09-29 23:58 [patch 00/23] Thomas Gleixner
                   ` (23 preceding siblings ...)
  2006-09-30  8:35 ` [patch 00/23] Andrew Morton
@ 2006-09-30  8:35 ` Andrew Morton
  24 siblings, 0 replies; 55+ messages in thread
From: Andrew Morton @ 2006-09-30  8:35 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Ingo Molnar, Jim Gettys, John Stultz, David Woodhouse,
	Arjan van de Ven, Dave Jones


General readability comment.  This:

enum hrtimer_mode {
	HRTIMER_ABS,	/* Time value is absolute */
	HRTIMER_REL,	/* Time value is relative to now */
};

enum hrtimer_restart {
	HRTIMER_NORESTART,
	HRTIMER_RESTART,
};


#define HRTIMER_INACTIVE	0x00
#define HRTIMER_ACTIVE		0x01
#define HRTIMER_CALLBACK	0x02
#define HRTIMER_PENDING		0x04


is quite bad.  They enumerate different conceptual things but they are all
in the sane namespace.  The reader sees code which is using a mixture of
HRTIMER_ABS, HRTIMER_RESTART and HRTIMER_ACTIVE and gets needlessly
confused over what they all signify.  

This:

enum hrtimer_cb_mode {
	HRTIMER_CB_SOFTIRQ,
	HRTIMER_CB_IRQSAFE,
	HRTIMER_CB_IRQSAFE_NO_RESTART,
	HRTIMER_CB_IRQSAFE_NO_SOFTIRQ,
};

is better.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [patch 02/23] GTOD: persistent clock support, core
  2006-09-29 23:58 ` [patch 02/23] GTOD: persistent clock support, core Thomas Gleixner
@ 2006-09-30  8:35   ` Andrew Morton
  2006-09-30 17:15     ` Jan Engelhardt
  2006-10-02 21:49     ` john stultz
  0 siblings, 2 replies; 55+ messages in thread
From: Andrew Morton @ 2006-09-30  8:35 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Ingo Molnar, Jim Gettys, John Stultz, David Woodhouse,
	Arjan van de Ven, Dave Jones


On Fri, 29 Sep 2006 23:58:21 -0000
Thomas Gleixner <tglx@linutronix.de> wrote:

> From: John Stultz <johnstul@us.ibm.com>
> 
> persistent clock support: do proper timekeeping across suspend/resume.

How?

> +/* Weak dummy function for arches that do not yet support it.
> + * XXX - Do be sure to remove it once all arches implement it.
> + */
> +unsigned long __attribute__((weak)) read_persistent_clock(void)
> +{
> +	return 0;
> +}

Seconds?  microseconds?  jiffies?  walltime?  uptime?

Needs some comments.


>  void __init timekeeping_init(void)
>  {
> -	unsigned long flags;
> +	unsigned long flags, sec = read_persistent_clock();

So it apparently returns seconds-since-epoch?

If so, why?

>  	write_seqlock_irqsave(&xtime_lock, flags);
>  
> @@ -758,11 +769,18 @@ void __init timekeeping_init(void)
>  	clocksource_calculate_interval(clock, tick_nsec);
>  	clock->cycle_last = clocksource_read(clock);
>  
> +	xtime.tv_sec = sec;
> +	xtime.tv_nsec = (jiffies % HZ) * (NSEC_PER_SEC / HZ);

Why is it valid to take the second from the persistent clock and the
fraction-of-a-second from jiffies?  Some comments describing the
implementation would improve its understandability and maintainability.

This statement can set xtime.tv_nsec to a value >= NSEC_PER_SEC.  Should it
not be normalised?

> +	set_normalized_timespec(&wall_to_monotonic,
> +		-xtime.tv_sec, -xtime.tv_nsec);
> +
>  	write_sequnlock_irqrestore(&xtime_lock, flags);
>  }
>  
>  
>  static int timekeeping_suspended;
> +static unsigned long timekeeping_suspend_time;

In what units?

> +
>  /**
>   * timekeeping_resume - Resumes the generic timekeeping subsystem.
>   * @dev:	unused
> @@ -773,14 +791,23 @@ static int timekeeping_suspended;
>   */
>  static int timekeeping_resume(struct sys_device *dev)
>  {
> -	unsigned long flags;
> +	unsigned long flags, now = read_persistent_clock();

Would whoever keeps doing that please stop it?  This:

	unsigned long flags;
	unsigned long now = read_persistent_clock();

is more readable and makes for more readable patches in the future.

>  	write_seqlock_irqsave(&xtime_lock, flags);
> -	/* restart the last cycle value */
> +
> +	if (now && (now > timekeeping_suspend_time)) {
> +		unsigned long sleep_length = now - timekeeping_suspend_time;
> +		xtime.tv_sec += sleep_length;
> +		jiffies_64 += sleep_length * HZ;

sleep_length will overflow if we slept for more than 49 days, and HZ=1000.

> +	}
> +	/* re-base the last cycle value */
>  	clock->cycle_last = clocksource_read(clock);
>  	clock->error = 0;
>  	timekeeping_suspended = 0;
>  	write_sequnlock_irqrestore(&xtime_lock, flags);
> +
> +	hrtimer_notify_resume();
> +
>  	return 0;
>  }
>  
> @@ -790,6 +817,7 @@ static int timekeeping_suspend(struct sy
>  
>  	write_seqlock_irqsave(&xtime_lock, flags);
>  	timekeeping_suspended = 1;
> +	timekeeping_suspend_time = read_persistent_clock();
>  	write_sequnlock_irqrestore(&xtime_lock, flags);
>  	return 0;
>  }
> 
> --

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [patch 03/23] GTOD: persistent clock support, i386
  2006-09-29 23:58 ` [patch 03/23] GTOD: persistent clock support, i386 Thomas Gleixner
@ 2006-09-30  8:36   ` Andrew Morton
  2006-10-02 22:03     ` john stultz
  0 siblings, 1 reply; 55+ messages in thread
From: Andrew Morton @ 2006-09-30  8:36 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Ingo Molnar, Jim Gettys, John Stultz, David Woodhouse,
	Arjan van de Ven, Dave Jones

On Fri, 29 Sep 2006 23:58:22 -0000
Thomas Gleixner <tglx@linutronix.de> wrote:

> persistent clock support: do proper timekeeping across suspend/resume,
> i386 arch support.
> 

This description implies that the patch implements something for i386

>  arch/i386/kernel/apm.c  |   44 ---------------------------------------
>  arch/i386/kernel/time.c |   54 +---------------------------------------------

but all it does is delete stuff.

I _assume_ that it switches i386 over to using the (undescribed) generic
core, and it does that merely by implementing read_persistent_clock().

But I'd have expected to see some Kconfig change in there as well?

Does this implementation support all forms of persistent clock which are
known to exist on i386 platforms?

If/when you issue new changelogs, please describe what has to be done to
port other architectures over to use this overall framework.

Do ports for other architectures exist?

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [patch 07/23] cleanup: uninline irq_enter() and move it into a function
  2006-09-29 23:58 ` [patch 07/23] cleanup: uninline irq_enter() and move it into a function Thomas Gleixner
@ 2006-09-30  8:36   ` Andrew Morton
  0 siblings, 0 replies; 55+ messages in thread
From: Andrew Morton @ 2006-09-30  8:36 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Ingo Molnar, Jim Gettys, John Stultz, David Woodhouse,
	Arjan van de Ven, Dave Jones

On Fri, 29 Sep 2006 23:58:26 -0000
Thomas Gleixner <tglx@linutronix.de> wrote:

> --- linux-2.6.18-mm2.orig/kernel/softirq.c	2006-09-30 01:41:13.000000000 +0200
> +++ linux-2.6.18-mm2/kernel/softirq.c	2006-09-30 01:41:16.000000000 +0200
> @@ -279,6 +279,13 @@ EXPORT_SYMBOL(do_softirq);
>  # define invoke_softirq()	do_softirq()
>  #endif
>  
> +extern void irq_enter(void)

unneeded `extern'.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [patch 08/23] dynticks: prepare the RCU code
  2006-09-29 23:58 ` [patch 08/23] dynticks: prepare the RCU code Thomas Gleixner
@ 2006-09-30  8:36   ` Andrew Morton
  2006-09-30 12:25     ` Dipankar Sarma
  0 siblings, 1 reply; 55+ messages in thread
From: Andrew Morton @ 2006-09-30  8:36 UTC (permalink / raw)
  To: Thomas Gleixner, Dipankar Sarma, Paul E. McKenney
  Cc: LKML, Ingo Molnar, Jim Gettys, John Stultz, David Woodhouse,
	Arjan van de Ven, Dave Jones

On Fri, 29 Sep 2006 23:58:27 -0000
Thomas Gleixner <tglx@linutronix.de> wrote:

> From: Ingo Molnar <mingo@elte.hu>
> 
> prepare the RCU code for dynticks/nohz. Since on nohz kernels there
> is no guaranteed timer IRQ that processes RCU callbacks, the idle
> code has to make sure that all RCU callbacks that can be finished
> off are indeed finished off. This patch adds the necessary APIs:
> rcu_advance_callbacks() [to register quiescent state] and
> rcu_process_callbacks() [to finish finishable RCU callbacks].
> 
> ...
>  
> +void rcu_advance_callbacks(int cpu, int user)
> +{
> +	if (user ||
> +	    (idle_cpu(cpu) && !in_softirq() &&
> +				hardirq_count() <= (1 << HARDIRQ_SHIFT))) {
> +		rcu_qsctr_inc(cpu);
> +		rcu_bh_qsctr_inc(cpu);
> +	} else if (!in_softirq())
> +		rcu_bh_qsctr_inc(cpu);
> +}
> +

I hope this function is immediately clear to the RCU maintainers, because it's
complete mud to me.

An introductory comment which describes what this function does and how it
does it seems appropriate.  And some words which decrypt the tests in there
might be needed too.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [patch 09/23] dynticks: extend next_timer_interrupt() to use a reference jiffie
  2006-09-29 23:58 ` [patch 09/23] dynticks: extend next_timer_interrupt() to use a reference jiffie Thomas Gleixner
@ 2006-09-30  8:37   ` Andrew Morton
  0 siblings, 0 replies; 55+ messages in thread
From: Andrew Morton @ 2006-09-30  8:37 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Ingo Molnar, Jim Gettys, John Stultz, David Woodhouse,
	Arjan van de Ven, Dave Jones

On Fri, 29 Sep 2006 23:58:28 -0000
Thomas Gleixner <tglx@linutronix.de> wrote:

> From: Thomas Gleixner <tglx@linutronix.de>
> 
> For CONFIG_NO_HZ we need to calculate the next timer wheel event based
> to a given jiffie value. Extend the existing code to allow the extra now
> argument. Provide a compability function for the existing implementations
> to call the function with now = jiffies.
> This also solves the racyness of the original code vs. jiffies changing
> during the iteration.
> 

I think this change has the potential to significantly increase the hold
time of tvec_base_t.lock.  Quite a lot of code has been moved inside that
lock, but most worrisome is that hrtimer_get_next_event() is also now
inside it.

What workloads is this change likely to impact, and how can we set about
verifying that we aren't introducing a problem?

Was that (unchangelogged) locking change even needed?  If so, why?


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [patch 10/23] hrtimers: clean up locking
  2006-09-29 23:58 ` [patch 10/23] hrtimers: clean up locking Thomas Gleixner
@ 2006-09-30  8:37   ` Andrew Morton
  0 siblings, 0 replies; 55+ messages in thread
From: Andrew Morton @ 2006-09-30  8:37 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Ingo Molnar, Jim Gettys, John Stultz, David Woodhouse,
	Arjan van de Ven, Dave Jones

On Fri, 29 Sep 2006 23:58:29 -0000
Thomas Gleixner <tglx@linutronix.de> wrote:

> From: Thomas Gleixner <tglx@linutronix.de>
> 
> improve kernel/hrtimers.c locking: use a per-CPU base with a lock to
> control locking of all clocks belonging to a CPU. This simplifies
> code that needs to lock all clocks at once. This makes life easier
> for high-res timers and dyntick. No functional change should happen
> due to this.
> 
> .. 
>
> -struct hrtimer_base;
> +struct hrtimer_clock_base;

It is better to place these forward declarations right at the top of the
include file.  That prevents people from later accidentally adding another
forward declaration of the same structure at an earlier point in the file,
and keeps all the same types of thing in the same place.

(two instances in this file)

> + * struct hrtimer_cpu_base - the per cpu clock bases
> + * @lock:		lock protecting the base and associated clock bases and timers

Looks crappy in 80-cols.  But I don't know if breaking this line will break
kerneldoc?


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [patch 11/23] hrtimers: state tracking
  2006-09-29 23:58 ` [patch 11/23] hrtimers: state tracking Thomas Gleixner
@ 2006-09-30  8:37   ` Andrew Morton
  0 siblings, 0 replies; 55+ messages in thread
From: Andrew Morton @ 2006-09-30  8:37 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Ingo Molnar, Jim Gettys, John Stultz, David Woodhouse,
	Arjan van de Ven, Dave Jones

On Fri, 29 Sep 2006 23:58:30 -0000
Thomas Gleixner <tglx@linutronix.de> wrote:

> From: Thomas Gleixner <tglx@linutronix.de>
> 
> reintroduce ktimers feature "optimized away" by the ktimers
> review process: multiple hrtimer states to enable the running
> of hrtimers without holding the cpu-base-lock.
> 
> (the "optimized" rbtree hack carried only 2 states worth of
> information and we need 3.)
> 

uh, I'll believe you ;)

> -#define HRTIMER_INACTIVE	((void *)1UL)
> +#define HRTIMER_INACTIVE	0x00
> +#define HRTIMER_ACTIVE		0x01
> +#define HRTIMER_CALLBACK	0x02
>  
>  struct hrtimer_clock_base;
>  
> @@ -54,6 +56,7 @@ struct hrtimer {
>  	ktime_t				expires;
>  	int				(*function)(struct hrtimer *);
>  	struct hrtimer_clock_base	*base;
> +	unsigned long			state;

I assume that `state' here takes the above enumerated values HRTIMER_*?

Using an enum would make that explicit, and more understandable.

Does it really need to be a long type?

>  static inline int hrtimer_active(const struct hrtimer *timer)
>  {
> -	return rb_parent(&timer->node) != &timer->node;
> +	return timer->state != HRTIMER_INACTIVE;
>  }

This implies that HRTIMER_CALLBACK is an "active" state, yes?  If so, how
come?  Perhaps a comment here would aid understandability.

> +	timer->state |= HRTIMER_ACTIVE;

No!  It's a bitfield!  The plot thickens.

How come hrtimer_active() tests for equality of all bits if it's a bitfield?

> +	timer->state = newstate;

No, it's not a bitfield.  It's a scalar.

> +	if (!(timer->state & HRTIMER_CALLBACK))

whoop, it's a bitfield again.

>  		ret = remove_hrtimer(timer, base);
>  
>  	unlock_hrtimer_base(timer, &flags);
> @@ -592,7 +594,6 @@ void hrtimer_init(struct hrtimer *timer,
>  		clock_id = CLOCK_MONOTONIC;
>  
>  	timer->base = &cpu_base->clock_base[clock_id];
> -	rb_set_parent(&timer->node, &timer->node);
>  }
>  EXPORT_SYMBOL_GPL(hrtimer_init);
>  
> @@ -643,13 +644,14 @@ static inline void run_hrtimer_queue(str
>  
>  		fn = timer->function;
>  		set_curr_timer(cpu_base, timer);
> -		__remove_hrtimer(timer, base);
> +		__remove_hrtimer(timer, base, HRTIMER_CALLBACK);

How come this was assigned to state, and not or-ed into it?

> +		timer->state &= ~HRTIMER_CALLBACK;

Please document the locking for timer->state.

Please also document its various states.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [patch 13/23] clockevents: core
  2006-09-29 23:58 ` [patch 13/23] clockevents: core Thomas Gleixner
@ 2006-09-30  8:39   ` Andrew Morton
  2006-10-03  4:33     ` John Kacur
  0 siblings, 1 reply; 55+ messages in thread
From: Andrew Morton @ 2006-09-30  8:39 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Ingo Molnar, Jim Gettys, John Stultz, David Woodhouse,
	Arjan van de Ven, Dave Jones

On Fri, 29 Sep 2006 23:58:32 -0000
Thomas Gleixner <tglx@linutronix.de> wrote:

> From: Thomas Gleixner <tglx@linutronix.de>
> 
> We have two types of clock event devices:
> - global events (one device per system)
> - local events (one device per cpu)
> 
> We assign the various time(r) related interrupts to those devices:
> 
> - global tick
> - profiling (per cpu)
> - next timer events (per cpu)
> 
> architectures register their clockevent sources, with specific capability
> masks set, and the generic high-res-timers code picks the best one,
> without the architecture having to worry about that.
> 
> here are the capabilities a clockevent driver can register:
> 
>  #define CLOCK_CAP_TICK		0x000001
>  #define CLOCK_CAP_UPDATE	0x000002
>  #define CLOCK_CAP_PROFILE	0x000004
>  #define CLOCK_CAP_NEXTEVT	0x000008

OK..  Perhaps this info is worth promoting to a code comment.

> +++ linux-2.6.18-mm2/include/linux/clockchips.h	2006-09-30 01:41:17.000000000 +0200
> @@ -0,0 +1,104 @@
> +/*  linux/include/linux/clockchips.h
> + *
> + *  This file contains the structure definitions for clockchips.
> + *
> + *  If you are not a clockchip, or the time of day code, you should
> + *  not be including this file!
> + */
> +#ifndef _LINUX_CLOCKCHIPS_H
> +#define _LINUX_CLOCKCHIPS_H
> +
> +#include <linux/config.h>

The build system includes config.h for you.

> +#ifdef CONFIG_GENERIC_TIME
> +
> +#include <linux/clocksource.h>
> +#include <linux/interrupt.h>
> +
> +/* Clock event mode commands */
> +enum {
> +	CLOCK_EVT_PERIODIC,
> +	CLOCK_EVT_ONESHOT,
> +	CLOCK_EVT_SHUTDOWN,
> +};
> +
> +/* Clock event capability flags */
> +#define CLOCK_CAP_TICK		0x000001
> +#define CLOCK_CAP_UPDATE	0x000002
> +#ifndef CONFIG_PROFILE_NMI
> +# define CLOCK_CAP_PROFILE	0x000004
> +#else
> +# define CLOCK_CAP_PROFILE	0x000000
> +#endif
> +#ifdef CONFIG_HIGH_RES_TIMERS
> +# define CLOCK_CAP_NEXTEVT	0x000008
> +#else
> +# define CLOCK_CAP_NEXTEVT	0x000000
> +#endif

There is no CONFIG_PROFILE_NMI in the kernel nor anywhere else in this
patchset.

> +#define CLOCK_BASE_CAPS_MASK	(CLOCK_CAP_TICK | CLOCK_CAP_PROFILE | \
> +				 CLOCK_CAP_UPDATE)
> +#define CLOCK_CAPS_MASK		(CLOCK_BASE_CAPS_MASK | CLOCK_CAP_NEXTEVT)
> +
> +struct clock_event;
> +
> +/**
> + * struct clock_event - clock event descriptor
> + *
> + * @name:		ptr to clock event name
> + * @capabilities:	capabilities of the event chip
> + * @max_delta_ns:	maximum delta value in ns
> + * @min_delta_ns:	minimum delta value in ns
> + * @mult:		nanosecond to cycles multiplier
> + * @shift:		nanoseconds to cycles divisor (power of two)
> + * @set_next_event:	set next event
> + * @set_mode:		set mode function
> + * @suspend:		suspend function (optional)
> + * @resume:		resume function (optional)
> + * @evthandler:		Assigned by the framework to be called by the low
> + *			level handler of the event source
> + */
> +struct clock_event {
> +	const char	*name;
> +	unsigned int	capabilities;
> +	unsigned long	max_delta_ns;
> +	unsigned long	min_delta_ns;
> +	unsigned long	mult;
> +	int		shift;
> +	void		(*set_next_event)(unsigned long evt,
> +					  struct clock_event *);
> +	void		(*set_mode)(int mode, struct clock_event *);
> +	int		(*suspend)(struct clock_event *);
> +	int		(*resume)(struct clock_event *);
> +	void		(*event_handler)(struct pt_regs *regs);
> +};

hm.  The term "clock_event" implies "something which happens": ie, an
event.

But a `struct clock_event' is, what?  Actually a source of events?

Is this a well-chosen name?

> +/*
> + * Calculate a multiplication factor
> + */
> +static inline unsigned long div_sc(unsigned long a, unsigned long b,
> +				   int shift)
> +{
> +	uint64_t tmp = ((uint64_t)a) << shift;
> +	do_div(tmp, b);
> +	return (unsigned long) tmp;
> +}

What does "div_sc" stand for??

> Index: linux-2.6.18-mm2/kernel/time/Makefile
> ===================================================================
> --- linux-2.6.18-mm2.orig/kernel/time/Makefile	2006-09-30 01:41:11.000000000 +0200
> +++ linux-2.6.18-mm2/kernel/time/Makefile	2006-09-30 01:41:17.000000000 +0200
> @@ -1 +1,3 @@
>  obj-y += ntp.o clocksource.o jiffies.o
> +
> +obj-$(CONFIG_GENERIC_TIME) += clockevents.o
> Index: linux-2.6.18-mm2/kernel/time/clockevents.c
> ===================================================================
> --- /dev/null	1970-01-01 00:00:00.000000000 +0000
> +++ linux-2.6.18-mm2/kernel/time/clockevents.c	2006-09-30 01:41:17.000000000 +0200
> @@ -0,0 +1,527 @@
> +/*
> + * linux/kernel/time/clockevents.c
> + *
> + * This file contains functions which manage clock event drivers.
> + *
> + * Copyright(C) 2005-2006, Thomas Gleixner <tglx@linutronix.de>
> + * Copyright(C) 2005-2006, Red Hat, Inc., Ingo Molnar
> + *
> + * We have two types of clock event devices:
> + * - global events (one device per system)
> + * - local events (one device per cpu)

So perhaps s/clock_event/clock_event_driver/.

Or clock_event_device?

> + * We assign the various time(r) related interrupts to those devices
> + *
> + * - global tick
> + * - profiling (per cpu)
> + * - next timer events (per cpu)
> + *
> + * TODO:
> + * - implement variable frequency profiling
> + *
> + * This code is licenced under the GPL version 2. For details see
> + * kernel-base/COPYING.
> + */
> +
> +#include <linux/clockchips.h>
> +#include <linux/cpu.h>
> +#include <linux/irq.h>
> +#include <linux/init.h>
> +#include <linux/notifier.h>
> +#include <linux/module.h>
> +#include <linux/percpu.h>
> +#include <linux/profile.h>
> +#include <linux/sysdev.h>
> +#include <linux/hrtimer.h>
> +
> +#define MAX_CLOCK_EVENTS	4
> +#define GLOBAL_CLOCK_EVENT	MAX_CLOCK_EVENTS
> +
> +struct event_descr {
> +	struct clock_event *event;
> +	unsigned int mode;
> +	unsigned int real_caps;
> +	struct irqaction action;
> +};
> +
> +struct local_events {
> +	int installed;
> +	struct event_descr events[MAX_CLOCK_EVENTS];
> +	struct clock_event *nextevt;
> +};
> +
> +/* Variables related to the global event source */
> +static __read_mostly struct event_descr global_eventsource;
> +
> +/* Variables related to the per cpu local event sources */
> +static DEFINE_PER_CPU(struct local_events, local_eventsources);
> +
> +/* lock to protect the above */
> +static DEFINE_SPINLOCK(events_lock);

Does "the above" really refer to cpu-local storage?

> +/*
> + * Recalc the events and reassign the handlers if necessary
> + */
> +static int recalc_events(struct local_events *sources, struct clock_event *evt,
> +			 unsigned int caps, int new)

It's good to document the caller-provided environmental requirements.  I
see from the callers that this requires spin_lock_irq(&events_lock).

> +{
> +	int i;
> +
> +	if (new && sources->installed == MAX_CLOCK_EVENTS)
> +		return -ENOSPC;
> +
> +	/*
> +	 * If there is no handler and this is not a next-event capable
> +	 * event source, refuse to handle it
> +	 */
> +	if (!evt->capabilities & CLOCK_CAP_NEXTEVT && !event_handlers[caps]) {

bug - needs parentheses.

> +		printk(KERN_ERR "Unsupported event source %s\n", evt->name);
> +		return -EINVAL;
> +	}
> +
> +	if (caps && global_eventsource.event && global_eventsource.event != evt)
> +		recalc_active_event(&global_eventsource, caps);
> +
> +	for (i = 0; i < sources->installed; i++) {
> +		if (sources->events[i].event != evt)
> +			recalc_active_event(&sources->events[i], caps);
> +	}
> +
> +	if (new)
> +		sources->events[sources->installed++].event = evt;
> +
> +	if (caps) {
> +		/* Is next_event event source going to be installed? */
> +		if (caps & CLOCK_CAP_NEXTEVT)
> +			caps = CLOCK_CAP_NEXTEVT;
> +
> +		setup_event(&sources->events[sources->installed],
> +			    evt, caps);
> +	} else
> +		printk(KERN_INFO "Inactive event source %s registered\n",
> +		       evt->name);
> +
> +	return 0;
> +}
> +
> +/**
> + * register_local_clockevent - Set up a cpu local clock event device

We have a mixture of clock_event and clockevent.

> + *
> + * @evt:	event device to be registered
> + */
> +int register_local_clockevent(struct clock_event *evt)
> +{
> +	struct local_events *sources = &__get_cpu_var(local_eventsources);

event_sources?

> +	unsigned long flags;
> +	int ret;
> +
> +	spin_lock_irqsave(&events_lock, flags);
> +
> +	/* Preset the handler in any case */
> +	evt->event_handler = handle_noop;
> +
> +	/* Recalc event sources and maybe reassign handlers */
> +	ret = recalc_events(sources, evt,
> +			    evt->capabilities & CLOCK_BASE_CAPS_MASK, 1);
> +
> +	spin_unlock_irqrestore(&events_lock, flags);
> +
> +	/*
> +	 * Trigger hrtimers, when the event source is next-event
> +	 * capable
> +	 */
> +	if (!ret && (evt->capabilities & CLOCK_CAP_NEXTEVT))
> +		hrtimer_clock_notify();
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(register_local_clockevent);
> +
> +/*
> + * Find a next-event capable event source
> + */
> +static int get_next_event_source(void)
> +{
> +	struct local_events *sources = &__get_cpu_var(local_eventsources);
> +	int i;
> +
> +	for (i = 0; i < sources->installed; i++) {
> +		struct clock_event *evt;
> +
> +		evt = sources->events[i].event;
> +		if (evt->capabilities & CLOCK_CAP_NEXTEVT)
> +			return i;
> +	}
> +
> +#ifndef CONFIG_SMP
> +	if (global_eventsource.event->capabilities & CLOCK_CAP_NEXTEVT)
> +		return GLOBAL_CLOCK_EVENT;
> +#endif

How come this is UP-only?  Perhaps a comment describing what's going on here.

> +	return -ENODEV;
> +}
> +
> +/**
> + * clockevents_next_event_available - Check for a installed next-event source
> + */
> +int clockevents_next_event_available(void)
> +{
> +	unsigned long flags;
> +	int idx;
> +
> +	spin_lock_irqsave(&events_lock, flags);
> +	idx = get_next_event_source();
> +	spin_unlock_irqrestore(&events_lock, flags);
> +
> +	return idx < 0 ? 0 : 1;

Perhaps IS_ERR_VALUE() could be used to make this code clearer.

I really wish kerneldoc had a standard way of describing return values. 
People often leave it out, and it's important.

(Although it's fairly obvious here due to the well-chosen function name)

(But it'll be even better when generic-boolean is merged, and people start
using it).

> +}
> +
> +int clockevents_init_next_event(void)
> +{
> +	struct local_events *sources = &__get_cpu_var(local_eventsources);
> +	struct clock_event *nextevt;
> +	unsigned long flags;
> +	int idx, ret = -ENODEV;
> +
> +	if (sources->nextevt)
> +		return -EBUSY;
> +
> +	spin_lock_irqsave(&events_lock, flags);
> +
> +	idx = get_next_event_source();
> +	if (idx < 0)
> +		goto out_unlock;
> +
> +	if (idx == GLOBAL_CLOCK_EVENT)
> +		nextevt = global_eventsource.event;
> +	else
> +		nextevt = sources->events[idx].event;
> +
> +	ret = recalc_events(sources, nextevt, CLOCK_CAPS_MASK, 0);
> +	if (!ret)
> +		sources->nextevt = nextevt;
> + out_unlock:
> +	spin_unlock_irqrestore(&events_lock, flags);
> +
> +	return ret;
> +}
> +
> +int clockevents_set_next_event(ktime_t expires, int force)
> +{
> +	struct local_events *sources = &__get_cpu_var(local_eventsources);
> +	int64_t delta = ktime_to_ns(ktime_sub(expires, ktime_get()));
> +	struct clock_event *nextevt = sources->nextevt;
> +	unsigned long long clc;
> +
> +	if (delta <= 0 && !force)
> +		return -ETIME;
> +
> +	if (delta > nextevt->max_delta_ns)
> +		delta = nextevt->max_delta_ns;
> +	if (delta < nextevt->min_delta_ns)
> +		delta = nextevt->min_delta_ns;
> +
> +	clc = delta * nextevt->mult;
> +	clc >>= nextevt->shift;
> +	nextevt->set_next_event((unsigned long)clc, sources->nextevt);
> +
> +	return 0;
> +}

These functions are exported to the whole kernel, but are undocumented.

AFAIK the timer and hrtimer code consistently uses s64's and u64's.  But
someone has snuck a couple of int64_t's in here.  Please review all the
patches for that.

> +/*
> + * Resume the cpu local clock events
> + */
> +static void clockevents_resume_local_events(void *arg)
> +{
> +	struct local_events *sources = &__get_cpu_var(local_eventsources);
> +	int i;
> +
> +	for (i = 0; i < sources->installed; i++) {
> +		if (sources->events[i].real_caps)
> +			startup_event(sources->events[i].event,
> +				      sources->events[i].real_caps);
> +	}
> +}
> +
> +/*
> + * Called after timekeeping is functional again
> + */
> +void clockevents_resume_events(void)
> +{
> +	unsigned long flags;
> +
> +	local_irq_save(flags);
> +
> +	/* Resume global event source */
> +	if (global_eventsource.real_caps)
> +		startup_event(global_eventsource.event,
> +			      global_eventsource.real_caps);
> +
> +	clockevents_resume_local_events(NULL);
> +	local_irq_restore(flags);
> +
> +	touch_softlockup_watchdog();
> +
> +	if (smp_call_function(clockevents_resume_local_events, NULL, 1, 1))
> +		BUG();
> +
> +}

hm.  The kernel's core resume code likes to call resume handlers under
local_irq_disable().  Does that happen here?  A BUG_ON(irqs_disabled())
would tell.

The above code can be simplified via on_each_cpu().

> +/*
> + * Functions related to initialization and hotplug
> + */
> +static int clockevents_cpu_notify(struct notifier_block *self,
> +				  unsigned long action, void *hcpu)
> +{
> +	switch(action) {
> +	case CPU_UP_PREPARE:
> +		break;
> +#ifdef CONFIG_HOTPLUG_CPU
> +	case CPU_DEAD:
> +		/*
> +		 * Do something sensible here !
> +		 * Disable the cpu local clocksources
> +		 */
> +		break;
> +#endif
> +	default:
> +		break;
> +	}
> +	return NOTIFY_OK;
> +}
> +
> +static struct notifier_block __devinitdata clockevents_nb = {
> +	.notifier_call	= clockevents_cpu_notify,
> +};
> +
> +void __init clockevents_init(void)
> +{
> +	clockevents_cpu_notify(&clockevents_nb, (unsigned long)CPU_UP_PREPARE,
> +				(void *)(long)smp_processor_id());
> +	register_cpu_notifier(&clockevents_nb);
> +}
> 

No...  None of this code should be present if !CONFIG_HOTPLUG_CPU.  See
cpuid_class_cpu_callback() for an example
.  

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [patch 14/23] clockevents: drivers for i386
  2006-09-29 23:58 ` [patch 14/23] clockevents: drivers for i386 Thomas Gleixner
@ 2006-09-30  8:40   ` Andrew Morton
  0 siblings, 0 replies; 55+ messages in thread
From: Andrew Morton @ 2006-09-30  8:40 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Ingo Molnar, Jim Gettys, John Stultz, David Woodhouse,
	Arjan van de Ven, Dave Jones

On Fri, 29 Sep 2006 23:58:33 -0000
Thomas Gleixner <tglx@linutronix.de> wrote:

> From: Thomas Gleixner <tglx@linutronix.de>
> 
> add clockevent drivers for i386: lapic (local) and PIT (global).
> Update the timer IRQ to call into the PIT driver's event handler
> and the lapic-timer IRQ to call into the lapic clockevent driver.
> 

What's the story on other clock sources?  hpet, pm-timer?

> --- linux-2.6.18-mm2.orig/arch/i386/kernel/apic.c	2006-09-30 01:41:11.000000000 +0200
> +++ linux-2.6.18-mm2/arch/i386/kernel/apic.c	2006-09-30 01:41:18.000000000 +0200
> @@ -25,6 +25,7 @@
>  #include <linux/kernel_stat.h>
>  #include <linux/sysdev.h>
>  #include <linux/cpu.h>
> +#include <linux/clockchips.h>
>  #include <linux/module.h>
>  
>  #include <asm/atomic.h>
> @@ -70,6 +71,23 @@ static inline void lapic_enable(void)
>   */
>  int apic_verbosity;
>  
> +static unsigned int calibration_result;
> +
> +static void lapic_next_event(unsigned long delta, struct clock_event *evt);
> +static void lapic_timer_setup(int mode, struct clock_event *evt);
> +
> +static struct clock_event lapic_clockevent = {
> +	.name = "lapic",
> +	.capabilities = CLOCK_CAP_NEXTEVT | CLOCK_CAP_PROFILE
> +#ifdef CONFIG_SMP
> +			| CLOCK_CAP_UPDATE
> +#endif

I can't work out why this is SMP-only.  Please add changelog information
or, better, a comment explaining this.

> -static void __devinit setup_APIC_timer(unsigned int clocks)
> +static void lapic_timer_setup(int mode, struct clock_event *evt)
>  {
>  	unsigned long flags;
>  
>  	local_irq_save(flags);
> +	__setup_APIC_LVTT(calibration_result, mode != CLOCK_EVT_PERIODIC);
> +	local_irq_restore(flags);
> +}
>  
> -	/*
> -	 * Wait for IRQ0's slice:
> -	 */
> -	wait_timer_tick();
> -	wait_timer_tick();

For what reason was setup_APIC_timer() "Waiting for IRQ0's slice" and why
is it safe to remove that?

> +#ifdef CONFIG_HIGH_RES_TIMERS
> +		printk("Disabling NO_HZ and high resolution timers "
> +		       "due to timer broadcasting\n");
> +		for_each_possible_cpu(cpu)
> +			per_cpu(lapic_events, cpu).capabilities &=
> +				~CLOCK_CAP_NEXTEVT;
> +#endif

I think you mean "due to lack of timer broadcasting"?

No comment, no changelog entry -> I don't understand why this code is here.

>   *
>   */
> -#include <linux/clocksource.h>
> +#include <linux/clockchips.h>
>  #include <linux/spinlock.h>
>  #include <linux/jiffies.h>
>  #include <linux/sysdev.h>
> @@ -19,19 +19,63 @@
>  DEFINE_SPINLOCK(i8253_lock);
>  EXPORT_SYMBOL(i8253_lock);
>  
> -void setup_pit_timer(void)
> +static void init_pit_timer(int mode, struct clock_event *evt)

No. `mode' is not an integer.  It has type `enum you_forgot_to_give_it_a_name'.

In several places the timer code is using enums as integers - basically
using them as a glorified #define.

This loses the main advantages of enums: readability.  They permit the
reader to say "ah, that's a clock event type" instead of "oh, thats an
integer".

Please give those enums a name, and use it everywhere, rather than relying
upon implicit conversion to `int'.

(C sucks)

> +{
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&i8253_lock, flags);
> +
> +	switch(mode) {
> +	case CLOCK_EVT_PERIODIC:
> +		/* binary, mode 2, LSB/MSB, ch 0 */
> +		outb_p(0x34, PIT_MODE);
> +		udelay(10);
> +		outb_p(LATCH & 0xff , PIT_CH0);	/* LSB */
> +		outb(LATCH >> 8 , PIT_CH0);	/* MSB */
> +		break;
> +
> +	case CLOCK_EVT_ONESHOT:
> +	case CLOCK_EVT_SHUTDOWN:
> +		/* One shot setup */
> +		outb_p(0x38, PIT_MODE);
> +		udelay(10);
> +		break;
> +	}
> +	spin_unlock_irqrestore(&i8253_lock, flags);
> +}
> +
> +static void pit_next_event(unsigned long delta, struct clock_event *evt)
>  {
>  	unsigned long flags;
>  
>  	spin_lock_irqsave(&i8253_lock, flags);
> -	outb_p(0x34,PIT_MODE);		/* binary, mode 2, LSB/MSB, ch 0 */
> -	udelay(10);
> -	outb_p(LATCH & 0xff , PIT_CH0);	/* LSB */
> -	udelay(10);
> -	outb(LATCH >> 8 , PIT_CH0);	/* MSB */
> +	outb_p(delta & 0xff , PIT_CH0);	/* LSB */
> +	outb(delta >> 8 , PIT_CH0);	/* MSB */
>  	spin_unlock_irqrestore(&i8253_lock, flags);
>  }
>  
> +struct clock_event pit_clockevent = {
> +	.name		= "pit",
> +	.capabilities	= CLOCK_CAP_TICK | CLOCK_CAP_PROFILE | CLOCK_CAP_UPDATE
> +#ifndef CONFIG_SMP
> +			| CLOCK_CAP_NEXTEVT
> +#endif

Again, the CONFIG_SMP conditionality is a complete mystery to this reader.



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [patch 15/23] high-res timers: core
  2006-09-29 23:58 ` [patch 15/23] high-res timers: core Thomas Gleixner
@ 2006-09-30  8:43   ` Andrew Morton
  0 siblings, 0 replies; 55+ messages in thread
From: Andrew Morton @ 2006-09-30  8:43 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Ingo Molnar, Jim Gettys, John Stultz, David Woodhouse,
	Arjan van de Ven, Dave Jones

On Fri, 29 Sep 2006 23:58:34 -0000
Thomas Gleixner <tglx@linutronix.de> wrote:

> From: Thomas Gleixner <tglx@linutronix.de>
> 
> add the core bits of high-res timers support.
> 
> the design makes use of the existing hrtimers subsystem which manages a
> per-CPU and per-clock tree of timers, and the clockevents framework, which
> provides a standard API to request programmable clock events from. The
> core code does not have to know about the clock details - it makes use
> of clockevents_set_next_event().
> 
> the code also provides dyntick functionality: it is implemented via a
> per-cpu sched_tick hrtimer that is set to HZ frequency, but which is
> reprogrammed to a longer timeout before going idle, and reprogrammed to
> HZ again once the CPU goes busy again. (If an non-timer IRQ hits the
> idle task then it will process jiffies before calling the IRQ code.)
> 
> the impact to non-high-res architectures is intended to be minimal.
> 
> ...
>  
> @@ -108,17 +134,53 @@ struct hrtimer_cpu_base {
>  	spinlock_t			lock;
>  	struct lock_class_key		lock_key;
>  	struct hrtimer_clock_base	clock_base[HRTIMER_MAX_CLOCK_BASES];
> +#ifdef CONFIG_HIGH_RES_TIMERS
> +	ktime_t				expires_next;
> +	int				hres_active;
> +	unsigned long			check_clocks;
> +	struct list_head		cb_pending;
> +	struct hrtimer			sched_timer;
> +	struct pt_regs			*sched_regs;
> +	unsigned long			events;
> +#endif

You forgot to update the kerneldoc for this struct.

Does `events' needs to be long?

<looks>

oh, it's a scalar this time ;)

> +#ifdef CONFIG_HIGH_RES_TIMERS
> +
> +extern void hrtimer_clock_notify(void);
> +extern void clock_was_set(void);
> +extern void hrtimer_interrupt(struct pt_regs *regs);
> +
> +# define hrtimer_cb_get_time(t)	(t)->base->get_time()
> +# define hrtimer_hres_active	(__get_cpu_var(hrtimer_bases).hres_active)

These two could be inline functions?

That might cause include file ordering problems I guess.

> +/*
> + * The resolution of the clocks. The resolution value is returned in
> + * the clock_getres() system call to give application programmers an
> + * idea of the (in)accuracy of timers. Timer values are rounded up to
> + * this resolution values.
> + */
> +# define KTIME_HIGH_RES		(ktime_t) { .tv64 = CONFIG_HIGH_RES_RESOLUTION }
> +# define KTIME_MONOTONIC_RES	KTIME_HIGH_RES
> +
> +#else
> +
> +# define KTIME_MONOTONIC_RES	KTIME_LOW_RES
> +
>  /*
>   * clock_was_set() is a NOP for non- high-resolution systems. The
>   * time-sorted order guarantees that a timer does not expire early and
>   * is expired in the next softirq when the clock was advanced.
>   */
> -#define clock_was_set()		do { } while (0)
> -#define hrtimer_clock_notify()	do { } while (0)
> -extern ktime_t ktime_get(void);
> -extern ktime_t ktime_get_real(void);
> +# define clock_was_set()		do { } while (0)
> +# define hrtimer_clock_notify()		do { } while (0)

these could be inlines.

> +# define hrtimer_cb_get_time(t)		(t)->base->softirq_time

Does this need parenthesisation?  Probably it's OK..  An inline function
would be nicer.

> +# define hrtimer_hres_active		0

Perhaps this would be better if it was presented as a function.

> +/* High resolution timer related functions */
> +#ifdef CONFIG_HIGH_RES_TIMERS
> +
> +static ktime_t last_jiffies_update;

What's this do?

> +/*
> + * Reprogramm the event source with checking both queues for the

"Reprogramme" ;)

> + * next event
> + * Called with interrupts disabled and base->lock held
> + */
> +static void hrtimer_force_reprogram(struct hrtimer_cpu_base *cpu_base)
> +{
> +	int i;
> +	struct hrtimer_clock_base *base = cpu_base->clock_base;
> +	ktime_t expires;
> +
> +	cpu_base->expires_next.tv64 = KTIME_MAX;
> +
> +	for (i = HRTIMER_MAX_CLOCK_BASES; i ; i--, base++) {

Downcounting loops hurt my brain.  Does it actually generate better code?

> +		struct hrtimer *timer;
> +
> +		if (!base->first)
> +			continue;
> +		timer = rb_entry(base->first, struct hrtimer, node);
> +		expires = ktime_sub(timer->expires, base->offset);
> +		if (expires.tv64 < cpu_base->expires_next.tv64)
> +			cpu_base->expires_next = expires;
> +	}
> +
> +	if (cpu_base->expires_next.tv64 != KTIME_MAX)
> +		clockevents_set_next_event(cpu_base->expires_next, 1);
> +}
> +
> +/*
> + * Shared reprogramming for clock_realtime and clock_monotonic
> + *
> + * When a new expires first timer is enqueued, we have

That sentence might need work.

> +/*
> + * Retrigger next event is called after clock was set
> + */
> +static void retrigger_next_event(void *arg)
> +{
> +	struct hrtimer_cpu_base *base;
> +	struct timespec realtime_offset;
> +	unsigned long flags, seq;
> +
> +	do {
> +		seq = read_seqbegin(&xtime_lock);
> +		set_normalized_timespec(&realtime_offset,
> +					-wall_to_monotonic.tv_sec,
> +					-wall_to_monotonic.tv_nsec);
> +	} while (read_seqretry(&xtime_lock, seq));
> +
> +	base = &per_cpu(hrtimer_bases, smp_processor_id());
> +
> +	/* Adjust CLOCK_REALTIME offset */
> +	spin_lock_irqsave(&base->lock, flags);
> +	base->clock_base[CLOCK_REALTIME].offset =
> +		timespec_to_ktime(realtime_offset);
> +
> +	hrtimer_force_reprogram(base);
> +	spin_unlock_irqrestore(&base->lock, flags);
> +}
> +
> +/*
> + * Clock realtime was set
> + *
> + * Change the offset of the realtime clock vs. the monotonic
> + * clock.
> + *
> + * We might have to reprogram the high resolution timer interrupt. On
> + * SMP we call the architecture specific code to retrigger _all_ high
> + * resolution timer interrupts. On UP we just disable interrupts and
> + * call the high resolution interrupt code.
> + */
> +void clock_was_set(void)
> +{
> +	preempt_disable();
> +	if (hrtimer_hres_active) {
> +		retrigger_next_event(NULL);
> +
> +		if (smp_call_function(retrigger_next_event, NULL, 1, 1))
> +			BUG();
> +	}
> +	preempt_enable();
> +}

If you use on_each_cpu() here you know that retrigger_next_event() will be
called under local_irq_disable().  The preempt_disable() goes away and the
spin_lock_irqsave() in retrigger_next_event() becomes a spin_lock() and
everything becomes simpler.

> +/**
> + * hrtimer_clock_notify - A clock source or a clock event has been installed
> + *
> + * Notify the per cpu softirqs to recheck the clock sources and events
> + */
> +void hrtimer_clock_notify(void)
> +{
> +	int i;
> +
> +	for (i = 0; i < NR_CPUS; i++)
> +		set_bit(0, &per_cpu(hrtimer_bases, i).check_clocks);
> +}

This will go splat if/when the arch chooses to not implement per-cpu
storage for not-possible CPUs.  Use for_each_possible_cpu().

> +
> +static const ktime_t nsec_per_hz = { .tv64 = NSEC_PER_SEC / HZ };
> +

This could use the same trick as KTIME_HIGH_RES and friends.  But perhaps
the compiler will generate the same code..

> +/*
> + * We switched off the global tick source when switching to high resolution
> + * mode. Update jiffies64.
> + *
> + * Must be called with interrupts disabled !
> + *
> + * FIXME: We need a mechanism to assign the update to a CPU. In principle this
> + * is not hard, but when dynamic ticks come into play it starts to be. We don't
> + * want to wake up a complete idle cpu just to update jiffies, so we need
> + * something more intellegent than a mere "do this only on CPUx".
> + */
> +static void update_jiffies64(ktime_t now)
> +{
> +	ktime_t delta;
> +
> +	write_seqlock(&xtime_lock);
> +
> +	delta = ktime_sub(now, last_jiffies_update);
> +	if (delta.tv64 >= nsec_per_hz.tv64) {
> +

stray blank line.

> +		unsigned long orun = 1;

"orun"?

> +
> +		delta = ktime_sub(delta, nsec_per_hz);
> +		last_jiffies_update = ktime_add(last_jiffies_update,
> +						nsec_per_hz);
> +
> +		/* Slow path for long timeouts */
> +		if (unlikely(delta.tv64 >= nsec_per_hz.tv64)) {
> +			s64 incr = ktime_to_ns(nsec_per_hz);
> +			orun = ktime_divns(delta, incr);
> +
> +			last_jiffies_update = ktime_add_ns(last_jiffies_update,
> +							   incr * orun);
> +			jiffies_64 += orun;
> +			orun++;
> +		}

That's a bit of a hack isn't it?  do_timer() owns the modification of
jiffies_64, so why is this code modifying it as well?

> +		do_timer(orun);

twice?

I suspect a bug.

> +	}
> +	write_sequnlock(&xtime_lock);
> +}
> +
> +/*
> + * We rearm the timer until we get disabled by the idle code
> + */
> +static int hrtimer_sched_tick(struct hrtimer *timer)
> +{
> +	unsigned long flags;
> +	struct hrtimer_cpu_base *cpu_base =
> +		container_of(timer, struct hrtimer_cpu_base, sched_timer);
> +
> +	local_irq_save(flags);
> +	/*
> +	 * Do not call, when we are not in irq context and have
> +	 * no valid regs pointer
> +	 */
> +	if (cpu_base->sched_regs) {
> +		update_process_times(user_mode(cpu_base->sched_regs));
> +		profile_tick(CPU_PROFILING, cpu_base->sched_regs);
> +	}
> +
> +	hrtimer_forward(timer, hrtimer_cb_get_time(timer), nsec_per_hz);
> +	local_irq_restore(flags);
> +
> +	return HRTIMER_RESTART;

bah.  hrtimer_restart is an `enum hrtimer_restart', not an integer.

> +	printk(KERN_INFO "hrtimers: Switched to high resolution mode CPU %d\n",
> +	       smp_processor_id());

"on CPU"

> +
> +static inline int hrtimer_enqueue_reprogram(struct hrtimer *timer,
> +					    struct hrtimer_clock_base *base)
> +{
> +	/*
> +	 * When High resolution timers are active try to reprogram. Note, that
> +	 * in case the state has HRTIMER_CALLBACK set, no reprogramming and no
> +	 * expiry check happens. The timer gets enqueued into the rbtree and
> +	 * the reprogramming / expiry check is done in the hrtimer_interrupt or
> +	 * in the softirq.
> +	 */

This (useful) comment should be above the function, not inside it.

> +	if (hrtimer_hres_active && hrtimer_reprogram(timer, base)) {
> +
> +		/* Timer is expired, act upon the callback mode */
> +		switch(timer->cb_mode) {
> +		case HRTIMER_CB_IRQSAFE_NO_RESTART:
> +			/*
> +			 * We can call the callback from here. No restart
> +			 * happens, so no danger of recursion
> +			 */
> +			BUG_ON(timer->function(timer) != HRTIMER_NORESTART);

Doing assert(thing-which-has-side-effects) is poor form.

I doubt if the kernel will work if someone goes and disables BUG_ON, but
it's a laudable objective.


> +			return 1;
> +		case HRTIMER_CB_IRQSAFE_NO_SOFTIRQ:
> +			/*
> +			 * This is solely for the sched tick emulation with
> +			 * dynamic tick support to ensure that we do not
> +			 * restart the tick right on the edge and end up with
> +			 * the tick timer in the softirq ! The calling site
> +			 * takes care of this.
> +			 */
> +			return 1;
> +		case HRTIMER_CB_IRQSAFE:
> +		case HRTIMER_CB_SOFTIRQ:
> +			/*
> +			 * Move everything else into the softirq pending list !
> +			 */
> +			hrtimer_add_cb_pending(timer, base);
> +			raise_softirq(HRTIMER_SOFTIRQ);
> +			return 1;
> +		default:
> +			BUG();
> +		}
> +	}
> +	return 0;
> +}
> +
> +static inline void hrtimer_resume_jiffie_update(void)

hrtimer_resume_jiffy_update

> +{
> +	unsigned long flags;
> +	ktime_t now = ktime_get();
> +
> +	write_seqlock_irqsave(&xtime_lock, flags);
> +	last_jiffies_update = now;
> +	write_sequnlock_irqrestore(&xtime_lock, flags);
> +}
> +
> +#else
> +
> +# define hrtimer_hres_active		0
> +# define hrtimer_check_clocks()		do { } while (0)
> +# define hrtimer_enqueue_reprogram(t,b)	0
> +# define hrtimer_force_reprogram(b)	do { } while (0)
> +# define hrtimer_cb_pending(t)		0
> +# define hrtimer_remove_cb_pending(t)	do { } while (0)
> +# define hrtimer_init_hres(c)		do { } while (0)
> +# define hrtimer_init_timer_hres(t)	do { } while (0)
> +# define hrtimer_resume_jiffie_update()	do { } while (0)
> +
> +#endif /* CONFIG_HIGH_RES_TIMERS */
> +
>  /*
>   * Timekeeping resumed notification

resume

> +#ifdef CONFIG_HIGH_RES_TIMERS
> +
> +/*
> + * High resolution timer interrupt
> + * Called with interrupts disabled
> + */
> +void hrtimer_interrupt(struct pt_regs *regs)
> +{
> +	struct hrtimer_clock_base *base;
> +	ktime_t expires_next, now;
> +	int i, raise = 0, cpu = smp_processor_id();
> +	struct hrtimer_cpu_base *cpu_base = &per_cpu(hrtimer_bases, cpu);
> +
> +	BUG_ON(!cpu_base->hres_active);
> +
> +	/* Store the regs for an possible sched_timer callback */
> +	cpu_base->sched_regs = regs;
> +	cpu_base->events++;
> +
> + retry:
> +	now = ktime_get();
> +
> +	/* Check, if the jiffies need an update */
> +	update_jiffies64(now);
> +
> +	expires_next.tv64 = KTIME_MAX;
> +
> +	base = cpu_base->clock_base;
> +
> +	for (i = 0; i < HRTIMER_MAX_CLOCK_BASES; i++) {
> +		ktime_t basenow;
> +		struct rb_node *node;
> +
> +		spin_lock(&cpu_base->lock);
> +
> +		basenow = ktime_add(now, base->offset);

Would it be better to take the lock outside the loop, rather than hammering
on it like this?


> +		while ((node = base->first)) {
> +			struct hrtimer *timer;
> +
> +			timer = rb_entry(node, struct hrtimer, node);
> +
> +			if (basenow.tv64 < timer->expires.tv64) {
> +				ktime_t expires;
> +
> +				expires = ktime_sub(timer->expires,
> +						    base->offset);
> +				if (expires.tv64 < expires_next.tv64)
> +					expires_next = expires;
> +				break;
> +			}
> +
> +			/* Move softirq callbacks to the pending list */
> +			if (timer->cb_mode == HRTIMER_CB_SOFTIRQ) {
> +				__remove_hrtimer(timer, base, HRTIMER_PENDING, 0);
> +				hrtimer_add_cb_pending(timer, base);
> +				raise = 1;
> +				continue;
> +			}
> +
> +			__remove_hrtimer(timer, base, HRTIMER_CALLBACK, 0);
> +
> +			if (timer->function(timer) != HRTIMER_NORESTART) {
> +				BUG_ON(timer->state != HRTIMER_CALLBACK);
> +				/*
> +				 * state == HRTIMER_CALLBACK prevents
> +				 * reprogramming. We do this when we break out
> +				 * of the loop !
> +				 */
> +				enqueue_hrtimer(timer, base);
> +			}
> +			timer->state &= ~HRTIMER_CALLBACK;
> +		}
> +		spin_unlock(&cpu_base->lock);
> +		base++;
> +	}
> +
> +	cpu_base->expires_next = expires_next;
> +
> +	/* Reprogramming necessary ? */
> +	if (expires_next.tv64 != KTIME_MAX) {
> +		if (clockevents_set_next_event(expires_next, 0))
> +			goto retry;
> +	}
> +
> +	/* Invalidate regs */
> +	cpu_base->sched_regs = NULL;
> +
> +	/* Raise softirq ? */
> +	if (raise)
> +		raise_softirq(HRTIMER_SOFTIRQ);
> +}
> +
>  
> ...
>
>  static int __sched do_nanosleep(struct hrtimer_sleeper *t, enum hrtimer_mode mode)
> @@ -701,7 +1226,8 @@ static int __sched do_nanosleep(struct h
>  		set_current_state(TASK_INTERRUPTIBLE);
>  		hrtimer_start(&t->timer, t->timer.expires, mode);
>  
> -		schedule();
> +		if (likely(t->task))
> +			schedule();

Why?  Needs a comment.

> @@ -0,0 +1,22 @@
> +#
> +# Timer subsystem related configuration options
> +#
> +config HIGH_RES_TIMERS
> +	bool "High Resolution Timer Support"
> +	depends on GENERIC_TIME
> +	help
> +	  This option enables high resolution timer support. If your
> +	  hardware is not capable then this option only increases
> +	  the size of the kernel image.
> +
> +config HIGH_RES_RESOLUTION
> +	int "High Resolution Timer resolution (nanoseconds)"
> +	depends on HIGH_RES_TIMERS
> +	default 1000
> +	help
> +	  This sets the resolution in nanoseconds of the high resolution
> +	  timers. Too fine a resolution (small a number) will usually
> +	  not be observable due to normal system latencies.  For an
> +          800 MHz processor about 10,000 (10 microseconds) is recommended as a
> +	  finest resolution.  If you don't need that sort of resolution,
> +	  larger values may generate less overhead.

In that case the default is far too low.

What value are you suggesting that users and vendors set it to?



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [patch 16/23] dynticks: core
  2006-09-29 23:58 ` [patch 16/23] dynticks: core Thomas Gleixner
@ 2006-09-30  8:44   ` Andrew Morton
  2006-09-30 12:11     ` Dipankar Sarma
  0 siblings, 1 reply; 55+ messages in thread
From: Andrew Morton @ 2006-09-30  8:44 UTC (permalink / raw)
  To: Thomas Gleixner, Paul E. McKenney, Dipankar Sarma
  Cc: LKML, Ingo Molnar, Jim Gettys, John Stultz, David Woodhouse,
	Arjan van de Ven, Dave Jones

On Fri, 29 Sep 2006 23:58:35 -0000
Thomas Gleixner <tglx@linutronix.de> wrote:

> From: Ingo Molnar <mingo@elte.hu>
> 
> dynticks core code.
> 
> Add idling-stats to the cpu base (to be used to optimize power
> management decisions), add the scheduler tick and its stop/restart
> functions, and the jiffies-update function to be called when an irq
> context hits the idle context.
> 

I worry that we're making this feature optional.

Certainly for the public testing period we should wire these new features
to "on".

But long-term this is yet another question which we'll need to ask when
we're trying to work out why someone's computer failed.

> --- linux-2.6.18-mm2.orig/include/linux/hrtimer.h	2006-09-30 01:41:18.000000000 +0200
> +++ linux-2.6.18-mm2/include/linux/hrtimer.h	2006-09-30 01:41:18.000000000 +0200
> @@ -142,6 +142,14 @@ struct hrtimer_cpu_base {
>  	struct hrtimer			sched_timer;
>  	struct pt_regs			*sched_regs;
>  	unsigned long			events;
> +#ifdef CONFIG_NO_HZ
> +	ktime_t				idle_tick;
> +	int				tick_stopped;
> +	unsigned long			idle_jiffies;
> +	unsigned long			idle_calls;
> +	unsigned long			idle_sleeps;
> +	unsigned long			idle_sleeptime;
> +#endif

Forgot to update this structure's kerneldoc.

> +# define show_no_hz_stats(p)			do { } while (0)

static inlines provide type checking.

> @@ -451,7 +450,6 @@ static void update_jiffies64(ktime_t now
>  
>  			last_jiffies_update = ktime_add_ns(last_jiffies_update,
>  							   incr * orun);
> -			jiffies_64 += orun;
>  			orun++;
>  		}

I think we just fixed that bug I might have seen.

>  		do_timer(orun);
> @@ -459,28 +457,201 @@ static void update_jiffies64(ktime_t now
>  	write_sequnlock(&xtime_lock);
>  }
>  
> +#ifdef CONFIG_NO_HZ
> +/*
> + * Called from interrupt entry when then CPU was idle

tpyo

> + */
> +void update_jiffies(void)
> +{
> +	unsigned long flags;
> +	ktime_t now;
> +
> +	if (unlikely(!hrtimer_hres_active))
> +		return;
> +
> +	now = ktime_get();
> +
> +	local_irq_save(flags);
> +	update_jiffies64(now);
> +	local_irq_restore(flags);
> +}
> +
> +/*
> + * Called from the idle thread so careful!

about what?

> + */
> +int hrtimer_stop_sched_tick(void)
> +{
> +	int cpu = smp_processor_id();
> +	struct hrtimer_cpu_base *cpu_base = &per_cpu(hrtimer_bases, cpu);
> +	unsigned long seq, last_jiffies, next_jiffies;
> +	ktime_t last_update, expires;
> +	unsigned long delta_jiffies;
> +	unsigned long flags;
> +
> +	if (unlikely(!hrtimer_hres_active))
> +		return 0;
> +
> +	local_irq_save(flags);

Do we really need local_irq_save() here?  If it's called from the idle
thread then presumably local IRQs are enabled already.  They'd better be,
because this function unconditionally enables them in a couple of places.

> +	do {
> +		seq = read_seqbegin(&xtime_lock);
> +		last_update = last_jiffies_update;
> +		last_jiffies = jiffies;
> +	} while (read_seqretry(&xtime_lock, seq));
> +
> +	next_jiffies = get_next_timer_interrupt(last_jiffies);
> +	delta_jiffies = next_jiffies - last_jiffies;
> +
> +	cpu_base->idle_calls++;
> +
> +	if ((long)delta_jiffies >= 1) {
> +		/*
> +		 * Save the current tick time, so we can restart the
> +		 * scheduler tick when we get woken up before the next
> +		 * wheel timer expires
> +		 */
> +		cpu_base->idle_tick = cpu_base->sched_timer.expires;
> +		expires = ktime_add_ns(last_update,
> +				       nsec_per_hz.tv64 * delta_jiffies);
> +		hrtimer_start(&cpu_base->sched_timer, expires, HRTIMER_ABS);
> +		cpu_base->idle_sleeps++;
> +		cpu_base->idle_jiffies = last_jiffies;
> +		cpu_base->tick_stopped = 1;
> +	} else {
> +		/* Keep the timer alive */
> +		if ((long) delta_jiffies < 0)
> +			raise_softirq(TIMER_SOFTIRQ);
> +	}
> +
> +	if (local_softirq_pending()) {
> +		inc_preempt_count();

I am unable to work out why the inc_preempt_count() is there.  Please add
comment.

> +		do_softirq();
> +		dec_preempt_count();
> +	}
> +
> +	WARN_ON(!idle_cpu(cpu));
> +	/*
> +	 * RCU normally depends on the timer IRQ kicking completion
> +	 * in every tick. We have to do this here now:
> +	 */
> +	if (rcu_pending(cpu)) {
> +		/*
> +		 * We are in quiescent state, so advance callbacks:
> +		 */
> +		rcu_advance_callbacks(cpu, 1);
> +		local_irq_enable();
> +		local_bh_disable();
> +		rcu_process_callbacks(0);
> +		local_bh_enable();
> +	}
> +
> +	local_irq_restore(flags);
> +
> +	return need_resched();
> +}

Are the RCU guys OK with this?

> +void hrtimer_restart_sched_tick(void)

Am unable to work out what this does from its implementation and from its
caller.  Please document it.


> +{
> +	struct hrtimer_cpu_base *cpu_base = &__get_cpu_var(hrtimer_bases);
> +	unsigned long flags;
> +	ktime_t now;
> +
> +	if (!hrtimer_hres_active || !cpu_base->tick_stopped)
> +		return;
> +
> +	/* Update jiffies first */
> +	now = ktime_get();
> +
> +	local_irq_save(flags);

The sole caller of this function calls it with local interrupts enabled. 
local_irq_disable() could be used here.

> +	update_jiffies64(now);
> +
> +	/*
> +	 * Update process times would randomly account the time we slept to
> +	 * whatever the context of the next sched tick is.  Enforce that this
> +	 * is accounted to idle !
> +	 */
> +	add_preempt_count(HARDIRQ_OFFSET);
> +	update_process_times(0);
> +	sub_preempt_count(HARDIRQ_OFFSET);
> +
> +	cpu_base->idle_sleeptime += jiffies - cpu_base->idle_jiffies;
> +
> +	cpu_base->tick_stopped  = 0;
> +	hrtimer_cancel(&cpu_base->sched_timer);
> +	cpu_base->sched_timer.expires = cpu_base->idle_tick;
> +
> +	while (1) {
> +		hrtimer_forward(&cpu_base->sched_timer, now, nsec_per_hz);
> +		hrtimer_start(&cpu_base->sched_timer,
> +			      cpu_base->sched_timer.expires, HRTIMER_ABS);
> +		if (hrtimer_active(&cpu_base->sched_timer))
> +			break;
> +		/* We missed an update */
> +		update_jiffies64(now);
> +		now = ktime_get();
> +	}
> +	local_irq_restore(flags);
> +}
> +
> +void show_no_hz_stats(struct seq_file *p)
> +{
> +	int cpu;
> +	unsigned long calls = 0, sleeps = 0, time = 0, events = 0;
> +
> +	for_each_online_cpu(cpu) {
> +		struct hrtimer_cpu_base *base = &per_cpu(hrtimer_bases, cpu);
> +
> +		calls += base->idle_calls;
> +		sleeps += base->idle_sleeps;
> +		time += base->idle_sleeptime;
> +		events += base->events;
> +
> +		seq_printf(p, "nohz cpu%d I:%lu S:%lu T:%lu A:%lu E: %lu\n",
> +			   cpu, base->idle_calls, base->idle_sleeps,
> +			   base->idle_sleeptime, base->idle_sleeps ?
> +			   base->idle_sleeptime / sleeps : 0, base->events);
> +	}
> +#ifdef CONFIG_SMP
> +	seq_printf(p, "nohz total I:%lu S:%lu T:%lu A:%lu E:%lu\n",
> +		   calls, sleeps, time, sleeps ? time / sleeps : 0, events);
> +#endif
> +}

Wouldn't it be better to display the "total" line on UP rather than cpu0?



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [patch 18/23] dynticks: i386 arch code
  2006-09-29 23:58 ` [patch 18/23] dynticks: i386 arch code Thomas Gleixner
@ 2006-09-30  8:45   ` Andrew Morton
  0 siblings, 0 replies; 55+ messages in thread
From: Andrew Morton @ 2006-09-30  8:45 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Ingo Molnar, Jim Gettys, John Stultz, David Woodhouse,
	Arjan van de Ven, Dave Jones

On Fri, 29 Sep 2006 23:58:37 -0000
Thomas Gleixner <tglx@linutronix.de> wrote:

> From: Ingo Molnar <mingo@elte.hu>
> 
> prepare i386 for dyntick: idle handler callbacks and IRQ callback.
> 
> Index: linux-2.6.18-mm2/arch/i386/kernel/nmi.c
> ===================================================================
> --- linux-2.6.18-mm2.orig/arch/i386/kernel/nmi.c	2006-09-30 01:41:10.000000000 +0200
> +++ linux-2.6.18-mm2/arch/i386/kernel/nmi.c	2006-09-30 01:41:19.000000000 +0200
> @@ -20,6 +20,7 @@
>  #include <linux/sysdev.h>
>  #include <linux/sysctl.h>
>  #include <linux/percpu.h>
> +#include <linux/kernel_stat.h>
>  #include <linux/dmi.h>
>  #include <linux/kprobes.h>
>  
> @@ -908,7 +909,7 @@ __kprobes int nmi_watchdog_tick(struct p
>  		touched = 1;
>  	}
>  
> -	sum = per_cpu(irq_stat, cpu).apic_timer_irqs;
> +	sum = per_cpu(irq_stat, cpu).apic_timer_irqs + kstat_irqs(0);

Why?

> ===================================================================
> --- linux-2.6.18-mm2.orig/arch/i386/kernel/process.c	2006-09-30 01:41:10.000000000 +0200
> +++ linux-2.6.18-mm2/arch/i386/kernel/process.c	2006-09-30 01:41:19.000000000 +0200
> @@ -178,24 +178,27 @@ void cpu_idle(void)
>  
>  	/* endless idle loop with no priority at all */
>  	while (1) {
> -		while (!need_resched()) {
> -			void (*idle)(void);
> +		if (!hrtimer_stop_sched_tick()) {
> +			while (!need_resched()) {

I don't see why hrtimer_stop_sched_tick() returns need_resched().  We
immediately reevaluate it anyway.  hrtimer_stop_sched_tick() could return 1.

> +				void (*idle)(void);
>  
> -			if (__get_cpu_var(cpu_idle_state))
> -				__get_cpu_var(cpu_idle_state) = 0;
> +				if (__get_cpu_var(cpu_idle_state))
> +					__get_cpu_var(cpu_idle_state) = 0;
>  
> -			rmb();
> -			idle = pm_idle;
> +				rmb();
> +				idle = pm_idle;
>  
> -			if (!idle)
> -				idle = default_idle;
> +				if (!idle)
> +					idle = default_idle;
>  
> -			if (cpu_is_offline(cpu))
> -				play_dead();
> +				if (cpu_is_offline(cpu))
> +					play_dead();
>  
> -			__get_cpu_var(irq_stat).idle_timestamp = jiffies;
> -			idle();
> +				__get_cpu_var(irq_stat).idle_timestamp = jiffies;
> +				idle();
> +			}
>  		}
> +		hrtimer_restart_sched_tick();
>  		preempt_enable_no_resched();
>  		schedule();
>  		preempt_disable();
> 


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [patch 20/23] add /proc/sys/kernel/timeout_granularity
  2006-09-29 23:58 ` [patch 20/23] add /proc/sys/kernel/timeout_granularity Thomas Gleixner
@ 2006-09-30  8:45   ` Andrew Morton
  0 siblings, 0 replies; 55+ messages in thread
From: Andrew Morton @ 2006-09-30  8:45 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Ingo Molnar, Jim Gettys, John Stultz, David Woodhouse,
	Arjan van de Ven, Dave Jones

On Fri, 29 Sep 2006 23:58:39 -0000
Thomas Gleixner <tglx@linutronix.de> wrote:

> From: Ingo Molnar <mingo@elte.hu>
> 
> Introduce timeout granularity: process timer wheel timers every
> timeout_granularity jiffies. Defaults to 1 (process timers HZ times
> per second - most finegrained).
> 
> ...
>
> +	timeout_granularity=
> +			[KNL]
> +			Timeout granularity: process timer wheel timers every
> +			timeout_granularity jiffies. Defaults to 1 (process
> +			timers HZ times per second - most finegrained).
> +

Please do not expose HZ to userspace in this fashion.  It means that an
application which was developed and tested on a 1000Hz kernel might fail on a
250Hz kernel.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [patch 21/23] debugging feature: timer stats
  2006-09-29 23:58 ` [patch 21/23] debugging feature: timer stats Thomas Gleixner
@ 2006-09-30  8:46   ` Andrew Morton
  0 siblings, 0 replies; 55+ messages in thread
From: Andrew Morton @ 2006-09-30  8:46 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Ingo Molnar, Jim Gettys, John Stultz, David Woodhouse,
	Arjan van de Ven, Dave Jones

On Fri, 29 Sep 2006 23:58:41 -0000
Thomas Gleixner <tglx@linutronix.de> wrote:

> From: Thomas Gleixner <tglx@linutronix.de>
> 
> add /proc/tstats support:

/proc/timer_stats would be a better name.

Is this intended as a mainlineable feature?  If so, some documentation
would be nice.

> debugging feature to profile timer expiration.
> Both the starting site, process/PID and the expiration function is
> captured. This allows the quick identification of timer event sources
> in a system.
> 
> sample output:
> 
>  # echo 1 > /proc/tstats
>  # cat /proc/tstats
>  Timerstats sample period: 3.888770 s
>    12,     0 swapper          hrtimer_stop_sched_tick (hrtimer_sched_tick)
>    15,     1 swapper          hcd_submit_urb (rh_timer_func)
>     4,   959 kedac            schedule_timeout (process_timeout)
>     1,     0 swapper          page_writeback_init (wb_timer_fn)
>    28,     0 swapper          hrtimer_stop_sched_tick (hrtimer_sched_tick)
>    22,  2948 IRQ 4            tty_flip_buffer_push (delayed_work_timer_fn)
>     3,  3100 bash             schedule_timeout (process_timeout)
>     1,     1 swapper          queue_delayed_work_on (delayed_work_timer_fn)
>     1,     1 swapper          queue_delayed_work_on (delayed_work_timer_fn)
>     1,     1 swapper          neigh_table_init_no_netlink (neigh_periodic_timer)
>     1,  2292 ip               __netdev_watchdog_up (dev_watchdog)
>     1,    23 events/1         do_cache_clean (delayed_work_timer_fn)
>  90 total events, 30.0 events/sec
> 
> ...
> 
> @@ -74,6 +74,11 @@ struct hrtimer {
>  	int				cb_mode;
>  	struct list_head		cb_entry;
>  #endif
> +#ifdef CONFIG_TIMER_STATS
> +	void				*start_site;
> +	char				start_comm[16];
> +	int				start_pid;
> +#endif
>  };

Forgot to update this struct's kerneldoc.

> +extern void tstats_update_stats(void *timer, pid_t pid, void *startf,

timer_stats_* would be nicer...

> Index: linux-2.6.18-mm2/include/linux/timer.h
> ===================================================================
> --- linux-2.6.18-mm2.orig/include/linux/timer.h	2006-09-30 01:41:19.000000000 +0200
> +++ linux-2.6.18-mm2/include/linux/timer.h	2006-09-30 01:41:20.000000000 +0200
> @@ -2,6 +2,7 @@
>  #define _LINUX_TIMER_H
>  
>  #include <linux/list.h>
> +#include <linux/ktime.h>
>  #include <linux/spinlock.h>
>  #include <linux/stddef.h>
>  
> @@ -15,6 +16,11 @@ struct timer_list {
>  	unsigned long data;
>  
>  	struct tvec_t_base_s *base;
> +#ifdef CONFIG_TIMER_STATS
> +	void *start_site;
> +	char start_comm[16];
> +	int start_pid;
> +#endif
>  };

hm.  So it's very much a developer thing.

> ===================================================================
> --- /dev/null	1970-01-01 00:00:00.000000000 +0000
> +++ linux-2.6.18-mm2/kernel/time/timer_stats.c	2006-09-30 01:41:20.000000000 +0200
> @@ -0,0 +1,227 @@
> +/*
> + * kernel/time/timer_stats.c
> + *
> + * Copyright(C) 2006, Red Hat, Inc., Ingo Molnar
> + * Copyright(C) 2006, Thomas Gleixner <tglx@timesys.com>
> + *
> + * Based on: timer_top.c
> + *	Copyright (C) 2005 Instituto Nokia de Tecnologia - INdT - Manaus
> + *	Written by Daniel Petrini <d.pensator@gmail.com>

We're missing a Signed-off-by:?

> + * Collect timer usage statistics.
> + *
> + * We export the addresses and counting of timer functions being called,
> + * the pid and cmdline from the owner process if applicable.
> + *
> + * Start/stop data collection:
> + * # echo 1[0] >/proc/tstats
> + *
> + * Display the collected information:
> + * # cat /proc/tstats
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/list.h>
> +#include <linux/proc_fs.h>
> +#include <linux/module.h>
> +#include <linux/spinlock.h>
> +#include <linux/sched.h>
> +#include <linux/seq_file.h>
> +#include <linux/kallsyms.h>
> +
> +#include <asm/uaccess.h>
> +
> +static DEFINE_SPINLOCK(tstats_lock);
> +static int tstats_status;

This is not an integer.  It has type `enum tstats_stat'.

> +static ktime_t tstats_time;
> +
> +enum tstats_stat {
> +	TSTATS_INACTIVE,
> +	TSTATS_ACTIVE,
> +	TSTATS_READOUT,
> +	TSTATS_RESET,
> +};
> +
>
> ...
>
> +static void tstats_reset(void)
> +{
> +	memset(tstats, 0, sizeof(tstats));
> +}

This is called without the lock held, which looks wrong.

> +static ssize_t tstats_write(struct file *file, const char __user *buf,
> +			    size_t count, loff_t *offs)
> +{
> +	char ctl[2];
> +
> +	if (count != 2 || *offs)
> +		return -EINVAL;

Let's hope nobody's echo command does while(n) write(fd, *p++, 1);

> -static void delayed_work_timer_fn(unsigned long __data)
> +void delayed_work_timer_fn(unsigned long __data)
>  {
>  	struct work_struct *work = (struct work_struct *)__data;
>  	struct workqueue_struct *wq = work->wq_data;
>  	int cpu = smp_processor_id();
> +	struct list_head *head;
>  
> +	head = &per_cpu_ptr(wq->cpu_wq, cpu)->more_work.task_list;

This doesn't do anything.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [patch 22/23] dynticks: increase SLAB timeouts
  2006-09-29 23:58 ` [patch 22/23] dynticks: increase SLAB timeouts Thomas Gleixner
@ 2006-09-30  8:49   ` Andrew Morton
  0 siblings, 0 replies; 55+ messages in thread
From: Andrew Morton @ 2006-09-30  8:49 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Ingo Molnar, Jim Gettys, John Stultz, David Woodhouse,
	Arjan van de Ven, Dave Jones

On Fri, 29 Sep 2006 23:58:42 -0000
Thomas Gleixner <tglx@linutronix.de> wrote:

> From: Ingo Molnar <mingo@elte.hu>
> 
> decrease the rate of SLAB timers going off. Reduces the amount
> of timers going off in an idle system.
> 
> Signed-off-by: Ingo Molnar <mingo@elte.hu>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> --
>  mm/slab.c |    9 +++++++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
> 
> Index: linux-2.6.18-mm2/mm/slab.c
> ===================================================================
> --- linux-2.6.18-mm2.orig/mm/slab.c	2006-09-30 01:41:09.000000000 +0200
> +++ linux-2.6.18-mm2/mm/slab.c	2006-09-30 01:41:20.000000000 +0200
> @@ -457,8 +457,13 @@ struct kmem_cache {
>   * OTOH the cpuarrays can contain lots of objects,
>   * which could lock up otherwise freeable slabs.
>   */
> -#define REAPTIMEOUT_CPUC	(2*HZ)
> -#define REAPTIMEOUT_LIST3	(4*HZ)
> +#ifdef CONFIG_NO_HZ
> +# define REAPTIMEOUT_CPUC	(4*HZ)
> +# define REAPTIMEOUT_LIST3	(8*HZ)
> +#else
> +# define REAPTIMEOUT_CPUC	(2*HZ)
> +# define REAPTIMEOUT_LIST3	(4*HZ)
> +#endif
>  
>  #if STATS
>  #define	STATS_INC_ACTIVE(x)	((x)->num_active++)
> 

err, no.

a) We shouldn't go and assume that "No Hz" implies "I want the CPU to
   remain idle for long periods".

   It's a good assumption, but that is an *application* of NO_HZ and the
   above should be a separate configuration option.  Or, better, runtime
   configurable.

b) This reap timeout is there for a reason.  We shouldn't just go and
   modify memory management behaviour because someone selected NO_HZ.

   Again, a runtime tunable is preferable.

Then again, two seconds is quite a long time, surely?  And increasing it to
just four seconds hardly seems worth the effort.


Still, the code you're patching is pretty lame anyway.  It shouldn't be
using time.  Time is meaningless in the mm context.  I'm not sure what it
_should_ be using though.  Perhaps every-Nth-kmem_cache_alloc or something.

It's trying to measure "is this memory I'm holding likely to be in the
CPU's cache any more".  So perhaps time is a close-enough basis.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [patch 23/23] dynticks: decrease I8042_POLL_PERIOD
  2006-09-29 23:58 ` [patch 23/23] dynticks: decrease I8042_POLL_PERIOD Thomas Gleixner
@ 2006-09-30  8:49   ` Andrew Morton
  0 siblings, 0 replies; 55+ messages in thread
From: Andrew Morton @ 2006-09-30  8:49 UTC (permalink / raw)
  To: Thomas Gleixner, Dmitry Torokhov
  Cc: LKML, Ingo Molnar, Jim Gettys, John Stultz, David Woodhouse,
	Arjan van de Ven, Dave Jones

On Fri, 29 Sep 2006 23:58:43 -0000
Thomas Gleixner <tglx@linutronix.de> wrote:

> From: Ingo Molnar <mingo@elte.hu>
> 
> decrease the rate of timers going off. Also work around apparent
> kbd-init bug by making the first timeout short.
> 

Again, please don't make unrelated kernel functions behave differently like
this.

> 
> Index: linux-2.6.18-mm2/drivers/input/serio/i8042.c
> ===================================================================
> --- linux-2.6.18-mm2.orig/drivers/input/serio/i8042.c	2006-09-30 01:41:08.000000000 +0200
> +++ linux-2.6.18-mm2/drivers/input/serio/i8042.c	2006-09-30 01:41:20.000000000 +0200
> @@ -1101,7 +1101,7 @@ static int __devinit i8042_probe(struct 
>  		goto err_controller_cleanup;
>  	}
>  
> -	mod_timer(&i8042_timer, jiffies + I8042_POLL_PERIOD);
> +	mod_timer(&i8042_timer, jiffies + 2); //I8042_POLL_PERIOD);
>  	return 0;
>  
>   err_unregister_ports:
> Index: linux-2.6.18-mm2/drivers/input/serio/i8042.h
> ===================================================================
> --- linux-2.6.18-mm2.orig/drivers/input/serio/i8042.h	2006-09-30 01:41:08.000000000 +0200
> +++ linux-2.6.18-mm2/drivers/input/serio/i8042.h	2006-09-30 01:41:20.000000000 +0200
> @@ -43,7 +43,7 @@
>   * polling.
>   */
>  
> -#define I8042_POLL_PERIOD	HZ/20
> +#define I8042_POLL_PERIOD	(10*HZ)

That's a huge change.  Perhaps the interval was too short in the first
case.  I guess waiting ten seconds for your keyboard or mouse to come to
life after hot-add is liveable with.

But whatever.  This timer gets deleted in Dmitry's current development tree:

commit de9ce703c6b807b1dfef5942df4f2fadd0fdb67a
Author: Dmitry Torokhov <dtor@insightbb.com>
Date:   Sun Sep 10 21:57:21 2006 -0400

    Input: i8042 - get rid of polling timer
    
    Remove polling timer that was used to detect keybord/mice hotplug and
    register both IRQs right away instead of waiting for a driver to
    attach to a port.
    
    Signed-off-by: Dmitry Torokhov <dtor@mail.ru>

so problem solved.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [patch 16/23] dynticks: core
  2006-09-30  8:44   ` Andrew Morton
@ 2006-09-30 12:11     ` Dipankar Sarma
  0 siblings, 0 replies; 55+ messages in thread
From: Dipankar Sarma @ 2006-09-30 12:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Thomas Gleixner, Paul E. McKenney, LKML, Ingo Molnar, Jim Gettys,
	John Stultz, David Woodhouse, Arjan van de Ven, Dave Jones

On Sat, Sep 30, 2006 at 01:44:56AM -0700, Andrew Morton wrote:
> On Fri, 29 Sep 2006 23:58:35 -0000
> Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> > From: Ingo Molnar <mingo@elte.hu>
> > 
> > dynticks core code.
> > 
> > Add idling-stats to the cpu base (to be used to optimize power
> > management decisions), add the scheduler tick and its stop/restart
> > functions, and the jiffies-update function to be called when an irq
> > context hits the idle context.
> > 
> 
> I worry that we're making this feature optional.
> > +	/*
> > +	 * RCU normally depends on the timer IRQ kicking completion
> > +	 * in every tick. We have to do this here now:
> > +	 */
> > +	if (rcu_pending(cpu)) {
> > +		/*
> > +		 * We are in quiescent state, so advance callbacks:
> > +		 */
> > +		rcu_advance_callbacks(cpu, 1);
> > +		local_irq_enable(); <----------------- Here
> > +		local_bh_disable();
> > +		rcu_process_callbacks(0);
> > +		local_bh_enable();
> > +	}
> > +
> > +	local_irq_restore(flags);
> > +
> > +	return need_resched();
> > +}
> 
> Are the RCU guys OK with this?

What prevents more RCU callbacks getting queued up by an
irq after irqs are enabled (marked Here) ? This seems racy.
The s390 implementation is correct - there we back out
if RCU is pending. Also, one call
to rcu_process_callbacks() doesn't guarantee that all
the RCUs are processed. They can be rate limited.

Thanks
Dipankar

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [patch 08/23] dynticks: prepare the RCU code
  2006-09-30  8:36   ` Andrew Morton
@ 2006-09-30 12:25     ` Dipankar Sarma
  2006-09-30 13:09       ` Ingo Molnar
  0 siblings, 1 reply; 55+ messages in thread
From: Dipankar Sarma @ 2006-09-30 12:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Thomas Gleixner, Paul E. McKenney, LKML, Ingo Molnar, Jim Gettys,
	John Stultz, David Woodhouse, Arjan van de Ven, Dave Jones

On Sat, Sep 30, 2006 at 01:36:41AM -0700, Andrew Morton wrote:
> On Fri, 29 Sep 2006 23:58:27 -0000
> Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> > From: Ingo Molnar <mingo@elte.hu>
> > 
> > prepare the RCU code for dynticks/nohz. Since on nohz kernels there
> > is no guaranteed timer IRQ that processes RCU callbacks, the idle
> > code has to make sure that all RCU callbacks that can be finished
> > off are indeed finished off. This patch adds the necessary APIs:
> > rcu_advance_callbacks() [to register quiescent state] and
> > rcu_process_callbacks() [to finish finishable RCU callbacks].
> > 
> > ...
> >  
> > +void rcu_advance_callbacks(int cpu, int user)
> > +{
> > +	if (user ||
> > +	    (idle_cpu(cpu) && !in_softirq() &&
> > +				hardirq_count() <= (1 << HARDIRQ_SHIFT))) {
> > +		rcu_qsctr_inc(cpu);
> > +		rcu_bh_qsctr_inc(cpu);
> > +	} else if (!in_softirq())
> > +		rcu_bh_qsctr_inc(cpu);
> > +}
> > +
> 
> I hope this function is immediately clear to the RCU maintainers, because it's
> complete mud to me.
> 

Ingo,

It is duplicating code. That can be easily fixed, but we need to figure
out what we really want from RCU when we are about to switch off
the ticks. It is hard if you want to finish off all the pending
RCUs and go to nohz state. Can you live with backing out if
there are pending RCUs ?

Thanks
Dipankar

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [patch 08/23] dynticks: prepare the RCU code
  2006-09-30 12:25     ` Dipankar Sarma
@ 2006-09-30 13:09       ` Ingo Molnar
  2006-09-30 13:52         ` Dipankar Sarma
  0 siblings, 1 reply; 55+ messages in thread
From: Ingo Molnar @ 2006-09-30 13:09 UTC (permalink / raw)
  To: Dipankar Sarma
  Cc: Andrew Morton, Thomas Gleixner, Paul E. McKenney, LKML,
	Jim Gettys, John Stultz, David Woodhouse, Arjan van de Ven,
	Dave Jones


* Dipankar Sarma <dipankar@in.ibm.com> wrote:

> It is duplicating code. That can be easily fixed, but we need to 
> figure out what we really want from RCU when we are about to switch 
> off the ticks. It is hard if you want to finish off all the pending 
> RCUs and go to nohz state. Can you live with backing out if there are 
> pending RCUs ?

the thing is that when we go idle we /want/ to process whatever delayed 
work there might be - rate limited or not. Do you agree with that 
approach? I consider this a performance feature as well: this way we can 
utilize otherwise lost idle time. It is not a problem that we dont 
'batch' this processing: we are really idle and we've got free cycles to 
burn. We could even do an RCU processing loop that immediately breaks 
out if need_resched() gets set [by an IRQ or by another CPU].

secondly, i think i saw functionality problems when RCU was not 
completed before going idle - for example synchronize_rcu() on another 
CPU would hang.

what approach would you suggest to achieve these goals?

	Ingo

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [patch 08/23] dynticks: prepare the RCU code
  2006-09-30 13:09       ` Ingo Molnar
@ 2006-09-30 13:52         ` Dipankar Sarma
  2006-09-30 21:35           ` Ingo Molnar
  0 siblings, 1 reply; 55+ messages in thread
From: Dipankar Sarma @ 2006-09-30 13:52 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, Thomas Gleixner, Paul E. McKenney, LKML,
	Jim Gettys, John Stultz, David Woodhouse, Arjan van de Ven,
	Dave Jones

On Sat, Sep 30, 2006 at 03:09:58PM +0200, Ingo Molnar wrote:
> * Dipankar Sarma <dipankar@in.ibm.com> wrote:
> 
> > It is duplicating code. That can be easily fixed, but we need to 
> > figure out what we really want from RCU when we are about to switch 
> > off the ticks. It is hard if you want to finish off all the pending 
> > RCUs and go to nohz state. Can you live with backing out if there are 
> > pending RCUs ?
> 
> the thing is that when we go idle we /want/ to process whatever delayed 
> work there might be - rate limited or not. Do you agree with that 
> approach? I consider this a performance feature as well: this way we can 
> utilize otherwise lost idle time. It is not a problem that we dont 
> 'batch' this processing: we are really idle and we've got free cycles to 
> burn. We could even do an RCU processing loop that immediately breaks 
> out if need_resched() gets set [by an IRQ or by another CPU].

If you don't care about rate limiting RCU processing (you wouldn't
in CONFIG_PREEMPT_RT), you still have to deal with the situation
that one CPU going idle doesn't guarantee that you can process
all pending RCUs. You can process the finished ones, but what
about the ones that are still waiting for the grace period
beyond the current cpu ?

> 
> secondly, i think i saw functionality problems when RCU was not 
> completed before going idle - for example synchronize_rcu() on another 
> CPU would hang.

That is probably because of what I mention above. In the original
CONFIG_NO_IDLE_HZ, we don't go into a nohz state if there are
RCUs pending in that cpu.

> 
> what approach would you suggest to achieve these goals?

There is no way to finish all the RCUs in a given cpu
unless you are prepared to wait for a grace period or so. 
So, you go to idle, keep checking in the timer tick and as soon 
as all RCUs are done, go to nohz state. You can keep
processing RCUs in every idle tick so that if you have only
finished RCU callbacks on that cpu, you can go to nohz rightaway.
I can add that API.

Of course, you can do what cpu hotplug does - move the RCUs
to another CPU. But that is an expensive operaation.

Thanks
Dipankar

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [patch 02/23] GTOD: persistent clock support, core
  2006-09-30  8:35   ` Andrew Morton
@ 2006-09-30 17:15     ` Jan Engelhardt
  2006-10-02 21:49     ` john stultz
  1 sibling, 0 replies; 55+ messages in thread
From: Jan Engelhardt @ 2006-09-30 17:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Thomas Gleixner, LKML, Ingo Molnar, Jim Gettys, John Stultz,
	David Woodhouse, Arjan van de Ven, Dave Jones


>> persistent clock support: do proper timekeeping across suspend/resume.
>
>How?

Rereading the RTC seems the only way to me. Someone prove me wrong, and 
do it fast! :)


Jan Engelhardt
-- 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [patch 00/23]
  2006-09-30  8:35 ` [patch 00/23] Andrew Morton
@ 2006-09-30 19:17   ` Thomas Gleixner
  0 siblings, 0 replies; 55+ messages in thread
From: Thomas Gleixner @ 2006-09-30 19:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Ingo Molnar, Jim Gettys, John Stultz, David Woodhouse,
	Arjan van de Ven, Dave Jones

On Sat, 2006-09-30 at 01:35 -0700, Andrew Morton wrote:
> Could we please have a full description of these features?  All we have at
> present is "high resolution timers" and "dynticks", which is ludicrously
> terse.  Please also describe the design and implementation.  This is basic
> stuff for a run-of-the-mill patch, let alone a feature like this one.
> 
> I don't believe I can adequately review this work without that information.
> I can try, but obviously such a review will not be as beneficial - it can
> only cover trivial matters.
> 
> With all the work which has gone into this, and with the impact which it
> will have upon us all it is totally disproportionate that no more than a
> few minutes were spent telling the rest of us what it does and how it
> does it.
> 
> We've had a lot of problems with timekeeping in recent years, and they have
> been hard and slow to solve.  Hence I'd like to set the bar very high on
> the maintainability and understandability of this work.  And what I see
> here doesn't look really great from that POV.

Fair enough. Point taken. We're working on it and on the comments you
gave. Thanks for taking the time to go through it nevertheless.

The OLS proceedings
http://www.linuxsymposium.org/2006/linuxsymposium_procv1.pdf

and the slides of my talk 
http://tglx.de/projects/hrtimers/ols2006-hrtimers.pdf

might also shed some light on the design.

	tglx



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [patch 08/23] dynticks: prepare the RCU code
  2006-09-30 13:52         ` Dipankar Sarma
@ 2006-09-30 21:35           ` Ingo Molnar
  0 siblings, 0 replies; 55+ messages in thread
From: Ingo Molnar @ 2006-09-30 21:35 UTC (permalink / raw)
  To: Dipankar Sarma
  Cc: Andrew Morton, Thomas Gleixner, Paul E. McKenney, LKML,
	Jim Gettys, John Stultz, David Woodhouse, Arjan van de Ven,
	Dave Jones


* Dipankar Sarma <dipankar@in.ibm.com> wrote:

> > secondly, i think i saw functionality problems when RCU was not 
> > completed before going idle - for example synchronize_rcu() on 
> > another CPU would hang.
> 
> That is probably because of what I mention above. In the original 
> CONFIG_NO_IDLE_HZ, we don't go into a nohz state if there are RCUs 
> pending in that cpu.

hm. I just tried it and it seems completing RCU processing isnt even 
necessary. I'll drop the RCU hackery. If we need anything then in 
synchronize_rcu [which is a rare and slowpath op]: there (on NO_HZ) we 
should tickle all cpus via an smp_call_function().

	Ingo

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [patch 02/23] GTOD: persistent clock support, core
  2006-09-30  8:35   ` Andrew Morton
  2006-09-30 17:15     ` Jan Engelhardt
@ 2006-10-02 21:49     ` john stultz
  1 sibling, 0 replies; 55+ messages in thread
From: john stultz @ 2006-10-02 21:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Thomas Gleixner, LKML, Ingo Molnar, Jim Gettys, David Woodhouse,
	Arjan van de Ven, Dave Jones

On Sat, 2006-09-30 at 01:35 -0700, Andrew Morton wrote:
> On Fri, 29 Sep 2006 23:58:21 -0000
> Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> > From: John Stultz <johnstul@us.ibm.com>
> > 
> > persistent clock support: do proper timekeeping across suspend/resume.
> 
> How?

Improved description included below.

> > +/* Weak dummy function for arches that do not yet support it.
> > + * XXX - Do be sure to remove it once all arches implement it.
> > + */
> > +unsigned long __attribute__((weak)) read_persistent_clock(void)
> > +{
> > +	return 0;
> > +}
> 
> Seconds?  microseconds?  jiffies?  walltime?  uptime?
> 
> Needs some comments.

Agreed. Thanks for pointing it out.


> 
> >  void __init timekeeping_init(void)
> >  {
> > -	unsigned long flags;
> > +	unsigned long flags, sec = read_persistent_clock();
> 
> So it apparently returns seconds-since-epoch?
> 
> If so, why?
> 
> >  	write_seqlock_irqsave(&xtime_lock, flags);
> >  
> > @@ -758,11 +769,18 @@ void __init timekeeping_init(void)
> >  	clocksource_calculate_interval(clock, tick_nsec);
> >  	clock->cycle_last = clocksource_read(clock);
> >  
> > +	xtime.tv_sec = sec;
> > +	xtime.tv_nsec = (jiffies % HZ) * (NSEC_PER_SEC / HZ);
> 
> Why is it valid to take the second from the persistent clock and the
> fraction-of-a-second from jiffies?  Some comments describing the
> implementation would improve its understandability and maintainability.

Yea. i386 and other arches have done this for awhile, so I preserved it.
However on further inspection, it really doesn't make much sense. We're
pre-timer interurpts anyway, so jiffies won't have begun yet. So now I
just initialize it to zero.

> This statement can set xtime.tv_nsec to a value >= NSEC_PER_SEC.  Should it
> not be normalised?

Yep, it is, and you commented just above it.:)

> > +	set_normalized_timespec(&wall_to_monotonic,
> > +		-xtime.tv_sec, -xtime.tv_nsec);
> > +
> >  	write_sequnlock_irqrestore(&xtime_lock, flags);
> >  }
> >  
> >  static int timekeeping_suspended;
> > +static unsigned long timekeeping_suspend_time;
> 
> In what units?

Fixed.

> > +
> >  /**
> >   * timekeeping_resume - Resumes the generic timekeeping subsystem.
> >   * @dev:	unused
> > @@ -773,14 +791,23 @@ static int timekeeping_suspended;
> >   */
> >  static int timekeeping_resume(struct sys_device *dev)
> >  {
> > -	unsigned long flags;
> > +	unsigned long flags, now = read_persistent_clock();
> 
> Would whoever keeps doing that please stop it?  This:
> 	unsigned long flags;
> 	unsigned long now = read_persistent_clock();
> 
> is more readable and makes for more readable patches in the future.

Fixed.

> >  	write_seqlock_irqsave(&xtime_lock, flags);
> > -	/* restart the last cycle value */
> > +
> > +	if (now && (now > timekeeping_suspend_time)) {
> > +		unsigned long sleep_length = now - timekeeping_suspend_time;
> > +		xtime.tv_sec += sleep_length;
> > +		jiffies_64 += sleep_length * HZ;
> 
> sleep_length will overflow if we slept for more than 49 days, and HZ=1000.

Oh! Great catch! Fixed.

Thanks so much for the thorough review!

Updated patch follows:

thanks
-john


Implement generic timekeeping suspend/resume accounting by introducing 
the read_persistent_clock() interface. This is an arch specific 
function that returns the seconds since the epoch using the arch 
defined battery backed clock.

Aside from allowing the removal of duplicate arch specific resume 
logic, this change helps avoid potential resume time ordering issues 
between generic and arch specific time code.

This patch only provides the generic usage of this new function and a 
weak dummy function, that always returns zero if no arch specific 
function is defined. Thus if no persistent clock is present, no change 
in behavior should be seen with this patch.

Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: John Stultz <johnstul@us.ibm.com>
--

 include/linux/hrtimer.h |    3 +++
 include/linux/time.h    |    1 +
 kernel/hrtimer.c        |    8 ++++++++
 kernel/timer.c          |   40 +++++++++++++++++++++++++++++++++++++++-
 4 files changed, 51 insertions(+), 1 deletion(-)

linux-2.6.18_timeofday-persistent-clock-generic_C7.patch
============================================
diff --git a/include/linux/hrtimer.h b/include/linux/hrtimer.h
index fca9302..660d91d 100644
--- a/include/linux/hrtimer.h
+++ b/include/linux/hrtimer.h
@@ -146,6 +146,9 @@ extern void hrtimer_init_sleeper(struct 
 /* Soft interrupt function to run the hrtimer queues: */
 extern void hrtimer_run_queues(void);
 
+/* Resume notification */
+void hrtimer_notify_resume(void);
+
 /* Bootup initialization: */
 extern void __init hrtimers_init(void);
 
diff --git a/include/linux/time.h b/include/linux/time.h
index a5b7399..db31d2a 100644
--- a/include/linux/time.h
+++ b/include/linux/time.h
@@ -92,6 +92,7 @@ extern struct timespec xtime;
 extern struct timespec wall_to_monotonic;
 extern seqlock_t xtime_lock;
 
+extern unsigned long read_persistent_clock(void);
 void timekeeping_init(void);
 
 static inline unsigned long get_seconds(void)
diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c
index d0ba190..090b752 100644
--- a/kernel/hrtimer.c
+++ b/kernel/hrtimer.c
@@ -287,6 +287,14 @@ static unsigned long ktime_divns(const k
 #endif /* BITS_PER_LONG >= 64 */
 
 /*
+ * Timekeeping resumed notification
+ */
+void hrtimer_notify_resume(void)
+{
+	clock_was_set();
+}
+
+/*
  * Counterpart to lock_timer_base above:
  */
 static inline
diff --git a/kernel/timer.c b/kernel/timer.c
index c1c7fbc..5069139 100644
--- a/kernel/timer.c
+++ b/kernel/timer.c
@@ -41,6 +41,9 @@
 #include <asm/timex.h>
 #include <asm/io.h>
 
+/* jiffies at the most recent update of wall time */
+unsigned long wall_jiffies = INITIAL_JIFFIES;
+
 u64 jiffies_64 __cacheline_aligned_in_smp = INITIAL_JIFFIES;
 
 EXPORT_SYMBOL(jiffies_64);
@@ -743,12 +746,27 @@ int timekeeping_is_continuous(void)
 	return ret;
 }
 
+/**
+ * read_persistent_clock -  Return time in seconds from the persistent clock.
+ *
+ * Weak dummy function for arches that do not yet support it.
+ * Returns seconds from epoch using the battery backed persistent clock.
+ * Returns zero if unsupported.
+ *
+ *  XXX - Do be sure to remove it once all arches implement it.
+ */
+unsigned long __attribute__((weak)) read_persistent_clock(void)
+{
+	return 0;
+}
+
 /*
  * timekeeping_init - Initializes the clocksource and common timekeeping values
  */
 void __init timekeeping_init(void)
 {
 	unsigned long flags;
+	unsigned long sec = read_persistent_clock();
 
 	write_seqlock_irqsave(&xtime_lock, flags);
 
@@ -758,11 +776,20 @@ void __init timekeeping_init(void)
 	clocksource_calculate_interval(clock, tick_nsec);
 	clock->cycle_last = clocksource_read(clock);
 
+	xtime.tv_sec = sec;
+	xtime.tv_nsec = 0;
+	set_normalized_timespec(&wall_to_monotonic,
+		-xtime.tv_sec, -xtime.tv_nsec);
+
 	write_sequnlock_irqrestore(&xtime_lock, flags);
 }
 
 
+/* flag for if timekeeping is suspended */
 static int timekeeping_suspended;
+/* time in seconds when suspend began */
+static unsigned long timekeeping_suspend_time;
+
 /**
  * timekeeping_resume - Resumes the generic timekeeping subsystem.
  * @dev:	unused
@@ -774,13 +801,23 @@ static int timekeeping_suspended;
 static int timekeeping_resume(struct sys_device *dev)
 {
 	unsigned long flags;
+	unsigned long now = read_persistent_clock();
 
 	write_seqlock_irqsave(&xtime_lock, flags);
-	/* restart the last cycle value */
+
+	if (now && (now > timekeeping_suspend_time)) {
+		unsigned long sleep_length = now - timekeeping_suspend_time;
+		xtime.tv_sec += sleep_length;
+		jiffies_64 += (u64)sleep_length * HZ;
+	}
+	/* re-base the last cycle value */
 	clock->cycle_last = clocksource_read(clock);
 	clock->error = 0;
 	timekeeping_suspended = 0;
 	write_sequnlock_irqrestore(&xtime_lock, flags);
+
+	hrtimer_notify_resume();
+
 	return 0;
 }
 
@@ -790,6 +827,7 @@ static int timekeeping_suspend(struct sy
 
 	write_seqlock_irqsave(&xtime_lock, flags);
 	timekeeping_suspended = 1;
+	timekeeping_suspend_time = read_persistent_clock();
 	write_sequnlock_irqrestore(&xtime_lock, flags);
 	return 0;
 }



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [patch 03/23] GTOD: persistent clock support, i386
  2006-09-30  8:36   ` Andrew Morton
@ 2006-10-02 22:03     ` john stultz
  2006-10-02 22:44       ` Andrew Morton
  0 siblings, 1 reply; 55+ messages in thread
From: john stultz @ 2006-10-02 22:03 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Thomas Gleixner, LKML, Ingo Molnar, Jim Gettys, David Woodhouse,
	Arjan van de Ven, Dave Jones

On Sat, 2006-09-30 at 01:36 -0700, Andrew Morton wrote:
> On Fri, 29 Sep 2006 23:58:22 -0000
> Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> > persistent clock support: do proper timekeeping across suspend/resume,
> > i386 arch support.
> > 
> 
> This description implies that the patch implements something for i386

> >  arch/i386/kernel/apm.c  |   44 ---------------------------------------
> >  arch/i386/kernel/time.c |   54 +---------------------------------------------
> 
> but all it does is delete stuff.

Improved description included.


> I _assume_ that it switches i386 over to using the (undescribed) generic
> core, and it does that merely by implementing read_persistent_clock().
> 
> But I'd have expected to see some Kconfig change in there as well?

Since there is a generic weak read_persistent_clock function, all that
is needed is for an arch to implement the read_persistent_clock function
and remove its arch specific suspend/resume code.

> Does this implementation support all forms of persistent clock which are
> known to exist on i386 platforms?

Yep. It converts the read_cmos_clock() function, which covers legacy
CMOS and EFI clocks.

> If/when you issue new changelogs, please describe what has to be done to
> port other architectures over to use this overall framework.

Included below.

> Do ports for other architectures exist?

I made a quick attempt earlier and covered most of the arches. However,
I'm sort of wrapping this up w/ the generic time conversion (it was one
of the changes I dropped in the GTOD rework earlier this year). I was
going to re-add this later, but then Thomas started seeing resume
ordering issues w/ the dynticks patch, so I raised the patches again.

Updated patch below:

thanks
-john


Convert i386's read_cmos_clock to the read_persistent_clock interface 
and remove the arch specific suspend/resume code, as the generic 
timekeeping code will now handle it.

If you wish to convert your arch to the read_persistent_clock code:
1) Implement read_persistent_clock in your arch.
2) Kill off xtime and jiffies modification in arch specific
initialization and suspend/resume.


Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: John Stultz <johnstul@us.ibm.com>
--

 apm.c  |   43 -------------------------------------------
 time.c |   55 ++-----------------------------------------------------
 2 files changed, 2 insertions(+), 96 deletions(-)

linux-2.6.18_timeofday-persistent-clock-i386_C7.patch
============================================
diff --git a/arch/i386/kernel/apm.c b/arch/i386/kernel/apm.c
index b42f2d9..e40e7ef 100644
--- a/arch/i386/kernel/apm.c
+++ b/arch/i386/kernel/apm.c
@@ -1153,28 +1153,6 @@ out:
 	spin_unlock(&user_list_lock);
 }
 
-static void set_time(void)
-{
-	struct timespec ts;
-	if (got_clock_diff) {	/* Must know time zone in order to set clock */
-		ts.tv_sec = get_cmos_time() + clock_cmos_diff;
-		ts.tv_nsec = 0;
-		do_settimeofday(&ts);
-	} 
-}
-
-static void get_time_diff(void)
-{
-#ifndef CONFIG_APM_RTC_IS_GMT
-	/*
-	 * Estimate time zone so that set_time can update the clock
-	 */
-	clock_cmos_diff = -get_cmos_time();
-	clock_cmos_diff += get_seconds();
-	got_clock_diff = 1;
-#endif
-}
-
 static void reinit_timer(void)
 {
 #ifdef INIT_TIMER_AFTER_SUSPEND
@@ -1214,19 +1192,6 @@ static int suspend(int vetoable)
 	local_irq_disable();
 	device_power_down(PMSG_SUSPEND);
 
-	/* serialize with the timer interrupt */
-	write_seqlock(&xtime_lock);
-
-	/* protect against access to timer chip registers */
-	spin_lock(&i8253_lock);
-
-	get_time_diff();
-	/*
-	 * Irq spinlock must be dropped around set_system_power_state.
-	 * We'll undo any timer changes due to interrupts below.
-	 */
-	spin_unlock(&i8253_lock);
-	write_sequnlock(&xtime_lock);
 	local_irq_enable();
 
 	save_processor_state();
@@ -1235,7 +1200,6 @@ static int suspend(int vetoable)
 	restore_processor_state();
 
 	local_irq_disable();
-	set_time();
 	reinit_timer();
 
 	if (err == APM_NO_ERROR)
@@ -1265,11 +1229,6 @@ static void standby(void)
 
 	local_irq_disable();
 	device_power_down(PMSG_SUSPEND);
-	/* serialize with the timer interrupt */
-	write_seqlock(&xtime_lock);
-	/* If needed, notify drivers here */
-	get_time_diff();
-	write_sequnlock(&xtime_lock);
 	local_irq_enable();
 
 	err = set_system_power_state(APM_STATE_STANDBY);
@@ -1363,7 +1322,6 @@ static void check_events(void)
 			ignore_bounce = 1;
 			if ((event != APM_NORMAL_RESUME)
 			    || (ignore_normal_resume == 0)) {
-				set_time();
 				device_resume();
 				pm_send_all(PM_RESUME, (void *)0);
 				queue_event(event, NULL);
@@ -1379,7 +1337,6 @@ static void check_events(void)
 			break;
 
 		case APM_UPDATE_TIME:
-			set_time();
 			break;
 
 		case APM_CRITICAL_SUSPEND:
diff --git a/arch/i386/kernel/time.c b/arch/i386/kernel/time.c
index 58a2d55..e43fe9a 100644
--- a/arch/i386/kernel/time.c
+++ b/arch/i386/kernel/time.c
@@ -216,7 +216,7 @@ irqreturn_t timer_interrupt(int irq, voi
 }
 
 /* not static: needed by APM */
-unsigned long get_cmos_time(void)
+unsigned long read_persistent_clock(void)
 {
 	unsigned long retval;
 	unsigned long flags;
@@ -232,7 +232,7 @@ unsigned long get_cmos_time(void)
 
 	return retval;
 }
-EXPORT_SYMBOL(get_cmos_time);
+EXPORT_SYMBOL(read_persistent_clock);
 
 static void sync_cmos_clock(unsigned long dummy);
 
@@ -283,58 +283,19 @@ void notify_arch_cmos_timer(void)
 	mod_timer(&sync_cmos_timer, jiffies + 1);
 }
 
-static long clock_cmos_diff;
-static unsigned long sleep_start;
-
-static int timer_suspend(struct sys_device *dev, pm_message_t state)
-{
-	/*
-	 * Estimate time zone so that set_time can update the clock
-	 */
-	unsigned long ctime =  get_cmos_time();
-
-	clock_cmos_diff = -ctime;
-	clock_cmos_diff += get_seconds();
-	sleep_start = ctime;
-	return 0;
-}
-
 static int timer_resume(struct sys_device *dev)
 {
-	unsigned long flags;
-	unsigned long sec;
-	unsigned long ctime = get_cmos_time();
-	long sleep_length = (ctime - sleep_start) * HZ;
-	struct timespec ts;
-
-	if (sleep_length < 0) {
-		printk(KERN_WARNING "CMOS clock skew detected in timer resume!\n");
-		/* The time after the resume must not be earlier than the time
-		 * before the suspend or some nasty things will happen
-		 */
-		sleep_length = 0;
-		ctime = sleep_start;
-	}
 #ifdef CONFIG_HPET_TIMER
 	if (is_hpet_enabled())
 		hpet_reenable();
 #endif
 	setup_pit_timer();
-
-	sec = ctime + clock_cmos_diff;
-	ts.tv_sec = sec;
-	ts.tv_nsec = 0;
-	do_settimeofday(&ts);
-	write_seqlock_irqsave(&xtime_lock, flags);
-	jiffies_64 += sleep_length;
-	write_sequnlock_irqrestore(&xtime_lock, flags);
 	touch_softlockup_watchdog();
 	return 0;
 }
 
 static struct sysdev_class timer_sysclass = {
 	.resume = timer_resume,
-	.suspend = timer_suspend,
 	set_kset_name("timer"),
 };
 
@@ -360,12 +321,6 @@ extern void (*late_time_init)(void);
 /* Duplicate of time_init() below, with hpet_enable part added */
 static void __init hpet_time_init(void)
 {
-	struct timespec ts;
-	ts.tv_sec = get_cmos_time();
-	ts.tv_nsec = (INITIAL_JIFFIES % HZ) * (NSEC_PER_SEC / HZ);
-
-	do_settimeofday(&ts);
-
 	if ((hpet_enable() >= 0) && hpet_use_timer) {
 		printk("Using HPET for base-timer\n");
 	}
@@ -376,7 +331,6 @@ static void __init hpet_time_init(void)
 
 void __init time_init(void)
 {
-	struct timespec ts;
 #ifdef CONFIG_HPET_TIMER
 	if (is_hpet_capable()) {
 		/*
@@ -387,10 +341,5 @@ void __init time_init(void)
 		return;
 	}
 #endif
-	ts.tv_sec = get_cmos_time();
-	ts.tv_nsec = (INITIAL_JIFFIES % HZ) * (NSEC_PER_SEC / HZ);
-
-	do_settimeofday(&ts);
-
 	time_init_hook();
 }




^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [patch 03/23] GTOD: persistent clock support, i386
  2006-10-02 22:03     ` john stultz
@ 2006-10-02 22:44       ` Andrew Morton
  2006-10-02 23:09         ` john stultz
  2006-10-03 23:30         ` Thomas Gleixner
  0 siblings, 2 replies; 55+ messages in thread
From: Andrew Morton @ 2006-10-02 22:44 UTC (permalink / raw)
  To: john stultz
  Cc: Thomas Gleixner, LKML, Ingo Molnar, Jim Gettys, David Woodhouse,
	Arjan van de Ven, Dave Jones

On Mon, 02 Oct 2006 15:03:37 -0700
john stultz <johnstul@us.ibm.com> wrote:

>  static int timer_resume(struct sys_device *dev)
>  {
> -	unsigned long flags;
> -	unsigned long sec;
> -	unsigned long ctime = get_cmos_time();
> -	long sleep_length = (ctime - sleep_start) * HZ;
> -	struct timespec ts;
> -
> -	if (sleep_length < 0) {
> -		printk(KERN_WARNING "CMOS clock skew detected in timer resume!\n");
> -		/* The time after the resume must not be earlier than the time
> -		 * before the suspend or some nasty things will happen
> -		 */
> -		sleep_length = 0;
> -		ctime = sleep_start;
> -	}
>  #ifdef CONFIG_HPET_TIMER
>  	if (is_hpet_enabled())
>  		hpet_reenable();
>  #endif
>  	setup_pit_timer();
> -
> -	sec = ctime + clock_cmos_diff;
> -	ts.tv_sec = sec;
> -	ts.tv_nsec = 0;
> -	do_settimeofday(&ts);
> -	write_seqlock_irqsave(&xtime_lock, flags);
> -	jiffies_64 += sleep_length;
> -	write_sequnlock_irqrestore(&xtime_lock, flags);
>  	touch_softlockup_watchdog();
>  	return 0;
>  }

In this version of the patch, you no longer remove the
touch_softlockup_watchdog() call from timer_resume().

But clockevents-drivers-for-i386.patch deletes timer_resume()
altogether.

Hence we might need to put that re-added touch_softlockup_watchdog() call
into somewhere else now.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [patch 03/23] GTOD: persistent clock support, i386
  2006-10-02 22:44       ` Andrew Morton
@ 2006-10-02 23:09         ` john stultz
  2006-10-03 23:30         ` Thomas Gleixner
  1 sibling, 0 replies; 55+ messages in thread
From: john stultz @ 2006-10-02 23:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Thomas Gleixner, LKML, Ingo Molnar, Jim Gettys, David Woodhouse,
	Arjan van de Ven, Dave Jones

On Mon, 2006-10-02 at 15:44 -0700, Andrew Morton wrote:
> On Mon, 02 Oct 2006 15:03:37 -0700
> john stultz <johnstul@us.ibm.com> wrote:
> 
> >  static int timer_resume(struct sys_device *dev)
> >  {
> > -	unsigned long flags;
> > -	unsigned long sec;
> > -	unsigned long ctime = get_cmos_time();
> > -	long sleep_length = (ctime - sleep_start) * HZ;
> > -	struct timespec ts;
> > -
> > -	if (sleep_length < 0) {
> > -		printk(KERN_WARNING "CMOS clock skew detected in timer resume!\n");
> > -		/* The time after the resume must not be earlier than the time
> > -		 * before the suspend or some nasty things will happen
> > -		 */
> > -		sleep_length = 0;
> > -		ctime = sleep_start;
> > -	}
> >  #ifdef CONFIG_HPET_TIMER
> >  	if (is_hpet_enabled())
> >  		hpet_reenable();
> >  #endif
> >  	setup_pit_timer();
> > -
> > -	sec = ctime + clock_cmos_diff;
> > -	ts.tv_sec = sec;
> > -	ts.tv_nsec = 0;
> > -	do_settimeofday(&ts);
> > -	write_seqlock_irqsave(&xtime_lock, flags);
> > -	jiffies_64 += sleep_length;
> > -	write_sequnlock_irqrestore(&xtime_lock, flags);
> >  	touch_softlockup_watchdog();
> >  	return 0;
> >  }
> 
> In this version of the patch, you no longer remove the
> touch_softlockup_watchdog() call from timer_resume().

Yea. That removal was added by Thomas, and I didn't merge it into my
tree. 

> But clockevents-drivers-for-i386.patch deletes timer_resume()
> altogether.
> 
> Hence we might need to put that re-added touch_softlockup_watchdog() call
> into somewhere else now.

Yea. While my dropping the change wasn't intentional, it seems the
change isn't really part of the persistent_clock changes, so the removal
should be done in one of the clockevents patches.

But I'll have to defer to tglx as to which one.

thanks
-john


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [patch 13/23] clockevents: core
  2006-09-30  8:39   ` Andrew Morton
@ 2006-10-03  4:33     ` John Kacur
  0 siblings, 0 replies; 55+ messages in thread
From: John Kacur @ 2006-10-03  4:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Thomas Gleixner, LKML, Ingo Molnar, Jim Gettys, John Stultz,
	David Woodhouse, Arjan van de Ven, Dave Jones, jkacur

On Sat, 2006-09-30 at 01:39 -0700, Andrew Morton wrote:
> On Fri, 29 Sep 2006 23:58:32 -0000
> Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> > From: Thomas Gleixner <tglx@linutronix.de>
> > 
> > We have two types of clock event devices:
> > - global events (one device per system)
> > - local events (one device per cpu)
> > 
> > We assign the various time(r) related interrupts to those devices:
> > 
> > - global tick
> > - profiling (per cpu)
> > - next timer events (per cpu)
> > 
> > architectures register their clockevent sources, with specific capability
> > masks set, and the generic high-res-timers code picks the best one,
> > without the architecture having to worry about that.
> > 
> > here are the capabilities a clockevent driver can register:
> > 
> >  #define CLOCK_CAP_TICK		0x000001
> >  #define CLOCK_CAP_UPDATE	0x000002
> >  #define CLOCK_CAP_PROFILE	0x000004
> >  #define CLOCK_CAP_NEXTEVT	0x000008
> 
> OK..  Perhaps this info is worth promoting to a code comment.
> 
> > +++ linux-2.6.18-mm2/include/linux/clockchips.h	2006-09-30 01:41:17.000000000 +0200
> > @@ -0,0 +1,104 @@
> > +/*  linux/include/linux/clockchips.h
> > + *
> > + *  This file contains the structure definitions for clockchips.
> > + *
> > + *  If you are not a clockchip, or the time of day code, you should
> > + *  not be including this file!
> > + */
> > +#ifndef _LINUX_CLOCKCHIPS_H
> > +#define _LINUX_CLOCKCHIPS_H
> > +
> > +#include <linux/config.h>
> 
> The build system includes config.h for you.
> 
> > +#ifdef CONFIG_GENERIC_TIME
> > +
> > +#include <linux/clocksource.h>
> > +#include <linux/interrupt.h>
> > +
> > +/* Clock event mode commands */
> > +enum {
> > +	CLOCK_EVT_PERIODIC,
> > +	CLOCK_EVT_ONESHOT,
> > +	CLOCK_EVT_SHUTDOWN,
> > +};
> > +
> > +/* Clock event capability flags */
> > +#define CLOCK_CAP_TICK		0x000001
> > +#define CLOCK_CAP_UPDATE	0x000002
> > +#ifndef CONFIG_PROFILE_NMI
> > +# define CLOCK_CAP_PROFILE	0x000004
> > +#else
> > +# define CLOCK_CAP_PROFILE	0x000000
> > +#endif
> > +#ifdef CONFIG_HIGH_RES_TIMERS
> > +# define CLOCK_CAP_NEXTEVT	0x000008
> > +#else
> > +# define CLOCK_CAP_NEXTEVT	0x000000
> > +#endif
> 
> There is no CONFIG_PROFILE_NMI in the kernel nor anywhere else in this
> patchset.
> 
---SNIP----

As I've pointed out - this breaks the ability to do timer tick profiling
too.
http://marc.theaimsgroup.com/?l=linux-kernel&m=115484411119770&w=2
http://marc.theaimsgroup.com/?l=linux-kernel&m=115484446530853&w=2


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [patch 03/23] GTOD: persistent clock support, i386
  2006-10-02 22:44       ` Andrew Morton
  2006-10-02 23:09         ` john stultz
@ 2006-10-03 23:30         ` Thomas Gleixner
  1 sibling, 0 replies; 55+ messages in thread
From: Thomas Gleixner @ 2006-10-03 23:30 UTC (permalink / raw)
  To: Andrew Morton
  Cc: john stultz, LKML, Ingo Molnar, Jim Gettys, David Woodhouse,
	Arjan van de Ven, Dave Jones

On Mon, 2006-10-02 at 15:44 -0700, Andrew Morton wrote:
> > -	write_seqlock_irqsave(&xtime_lock, flags);
> > -	jiffies_64 += sleep_length;
> > -	write_sequnlock_irqrestore(&xtime_lock, flags);
> >  	touch_softlockup_watchdog();
> >  	return 0;
> >  }
> 
> In this version of the patch, you no longer remove the
> touch_softlockup_watchdog() call from timer_resume().
> 
> But clockevents-drivers-for-i386.patch deletes timer_resume()
> altogether.
> 
> Hence we might need to put that re-added touch_softlockup_watchdog() call
> into somewhere else now.

clockevents has is it in the resume path.

static void clockevents_resume_local_events(void *arg)
{
....
        touch_softlockup_watchdog();
}

	tglx



^ permalink raw reply	[flat|nested] 55+ messages in thread

end of thread, other threads:[~2006-10-03 23:27 UTC | newest]

Thread overview: 55+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-09-29 23:58 [patch 00/23] Thomas Gleixner
2006-09-29 23:58 ` [patch 01/23] GTOD: exponential update_wall_time Thomas Gleixner
2006-09-29 23:58 ` [patch 02/23] GTOD: persistent clock support, core Thomas Gleixner
2006-09-30  8:35   ` Andrew Morton
2006-09-30 17:15     ` Jan Engelhardt
2006-10-02 21:49     ` john stultz
2006-09-29 23:58 ` [patch 03/23] GTOD: persistent clock support, i386 Thomas Gleixner
2006-09-30  8:36   ` Andrew Morton
2006-10-02 22:03     ` john stultz
2006-10-02 22:44       ` Andrew Morton
2006-10-02 23:09         ` john stultz
2006-10-03 23:30         ` Thomas Gleixner
2006-09-29 23:58 ` [patch 04/23] time: uninline jiffies.h Thomas Gleixner
2006-09-29 23:58 ` [patch 05/23] time: fix msecs_to_jiffies() bug Thomas Gleixner
2006-09-29 23:58 ` [patch 06/23] time: fix timeout overflow Thomas Gleixner
2006-09-29 23:58 ` [patch 07/23] cleanup: uninline irq_enter() and move it into a function Thomas Gleixner
2006-09-30  8:36   ` Andrew Morton
2006-09-29 23:58 ` [patch 08/23] dynticks: prepare the RCU code Thomas Gleixner
2006-09-30  8:36   ` Andrew Morton
2006-09-30 12:25     ` Dipankar Sarma
2006-09-30 13:09       ` Ingo Molnar
2006-09-30 13:52         ` Dipankar Sarma
2006-09-30 21:35           ` Ingo Molnar
2006-09-29 23:58 ` [patch 09/23] dynticks: extend next_timer_interrupt() to use a reference jiffie Thomas Gleixner
2006-09-30  8:37   ` Andrew Morton
2006-09-29 23:58 ` [patch 10/23] hrtimers: clean up locking Thomas Gleixner
2006-09-30  8:37   ` Andrew Morton
2006-09-29 23:58 ` [patch 11/23] hrtimers: state tracking Thomas Gleixner
2006-09-30  8:37   ` Andrew Morton
2006-09-29 23:58 ` [patch 12/23] hrtimers: clean up callback tracking Thomas Gleixner
2006-09-29 23:58 ` [patch 13/23] clockevents: core Thomas Gleixner
2006-09-30  8:39   ` Andrew Morton
2006-10-03  4:33     ` John Kacur
2006-09-29 23:58 ` [patch 14/23] clockevents: drivers for i386 Thomas Gleixner
2006-09-30  8:40   ` Andrew Morton
2006-09-29 23:58 ` [patch 15/23] high-res timers: core Thomas Gleixner
2006-09-30  8:43   ` Andrew Morton
2006-09-29 23:58 ` [patch 16/23] dynticks: core Thomas Gleixner
2006-09-30  8:44   ` Andrew Morton
2006-09-30 12:11     ` Dipankar Sarma
2006-09-29 23:58 ` [patch 17/23] dyntick: add nohz stats to /proc/stat Thomas Gleixner
2006-09-29 23:58 ` [patch 18/23] dynticks: i386 arch code Thomas Gleixner
2006-09-30  8:45   ` Andrew Morton
2006-09-29 23:58 ` [patch 19/23] high-res timers, dynticks: enable i386 support Thomas Gleixner
2006-09-29 23:58 ` [patch 20/23] add /proc/sys/kernel/timeout_granularity Thomas Gleixner
2006-09-30  8:45   ` Andrew Morton
2006-09-29 23:58 ` [patch 21/23] debugging feature: timer stats Thomas Gleixner
2006-09-30  8:46   ` Andrew Morton
2006-09-29 23:58 ` [patch 22/23] dynticks: increase SLAB timeouts Thomas Gleixner
2006-09-30  8:49   ` Andrew Morton
2006-09-29 23:58 ` [patch 23/23] dynticks: decrease I8042_POLL_PERIOD Thomas Gleixner
2006-09-30  8:49   ` Andrew Morton
2006-09-30  8:35 ` [patch 00/23] Andrew Morton
2006-09-30 19:17   ` Thomas Gleixner
2006-09-30  8:35 ` Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).