linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch 00/21] high resolution timers / dynamic ticks - V2
@ 2006-10-01 22:59 Thomas Gleixner
  2006-10-01 22:59 ` [patch 01/21] GTOD: exponential update_wall_time Thomas Gleixner
                   ` (24 more replies)
  0 siblings, 25 replies; 58+ messages in thread
From: Thomas Gleixner @ 2006-10-01 22:59 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Ingo Molnar, Jim Gettys, John Stultz, David Woodhouse,
	Arjan van de Ven, Dave Jones

Andrew,

the following patch series is an update in response to your review.

Following points have been addressed:

- documentation for the high res / dyntick design 

- documentation for timer_stats

- removal of the patches, which modify timeout behaviour (i8042, slab, timeout
granularity). Those are definitely worth to investigate further, but they are
not a fundamental part of the high res / dyntick feature.

- cleanup of enum -> int abuse

- namespace cleanup

- kernel doc fixups

- improved comments all over the place

- pointed out bugs resolved

- mismerge from -mm1 to -mm2 in the clockevents-i386 patch repaired

- rcu hackery removed: This was a leftover from an attempt to enforce the RCU
updates to be processed on the way to idle rather than waiting for the grace
period expiry. This is interesting to further reduce the idle wakeups with
respect to power saving, but is not necessary for the basic functionality of
the dyntick patch set.


We did not address the GTOD patches, as we want to wait for John's input on
your comments. The persistent clock modifications are useful in two ways:

 1. completely remove manipulation of xtime related variables from the
    architecture code
 2. ensure the correct resume ordering

We experiencend resume problems with earlier versions of the high resolution
timer /dyntick patches and we were able to identify the unordered update as the
cause. After an initial workaround similar to the current code, John
resurrected his timeofday-persistant-xxx patch set, which integrates nicely
with the already merged GTOD functionality.

The series contains two new patches:

#09:	hrtimer-enum-and-namespace-cleanup.patch 
	(new patch, resolves review issues vs. enums and namespaces)

#13:	time-and-timer-documentation.patch
	(Move hrtimer.txt to a new directory and add high res / dyntick
	design notes)

A broken out series and a combined patch are available at the ususal place:
http://tglx.de/projects/hrtimers/2.6.18-mm2/

The following design notes are also part of patch #13


High resolution timers and dynamic ticks design notes
-----------------------------------------------------

Further information can be found in the paper of the OLS 2006 talk "hrtimers
and beyond". The paper is part of the OLS 2006 Proceedings Volume 1, which can
be found on the OLS website:
http://www.linuxsymposium.org/2006/linuxsymposium_procv1.pdf

The slides to this talk are available from:
http://tglx.de/projects/hrtimers/ols2006-hrtimers.pdf

The slides contain five figures (pages 2, 15, 18, 20, 22), which illustrate the
changes in the time(r) related Linux subsystems. Figure #1 (p. 2) shows the
design of the Linux time(r) system before hrtimers and other building blocks
got merged into mainline.

Note: the paper and the slides are talking about "clock event source", while we
switched to the name "clock event devices" in meantime.

The design contains the following basic building blocks:

- hrtimer base infrastructure
- timeofday and clock source management
- clock event management
- high resolution timer functionality
- dynamic ticks


hrtimer base infrastructure
---------------------------

The hrtimer base infrastructure was merged into the 2.6.16 kernel. Details of
the base implementation are covered in Documentation/hrtimer/hrtimer.txt. See
also figure #2 (OLS slides p. 15)

The main differences to the timer wheel, which holds the armed timer_list type
timers are:
       - time ordered enqueueing into a rb-tree
       - independent of ticks (the processing is based on nanoseconds)


timeofday and clock source management
-------------------------------------

John Stultz's Generic Time Of Day (GTOD) framework moves a large portion of
code out of the architecture-specific areas into a generic management
framework, as illustrated in figure #3 (OLS slides p. 18). The architecture
specific portion is reduced to the low level hardware details of the clock
sources, which are registered in the framework and selected on a quality based
decision. The low level code provides hardware setup and readout routines and
initializes data structures, which are used by the generic time keeping code to
convert the clock ticks to nanosecond based time values. All other time keeping
related functionality is moved into the generic code. The GTOD base patch got
merged into the 2.6.18 kernel.

Further information about the Generic Time Of Day framework is available in the
OLS 2005 Proceedings Volume 1:
http://www.linuxsymposium.org/2005/linuxsymposium_procv1.pdf

The paper "We Are Not Getting Any Younger: A New Approach to Time and
Timers" was written by J. Stultz, D.V. Hart, & N. Aravamudan.

Figure #3 (OLS slides p.18) illustrates the transformation.


clock event management
----------------------

While clock sources provide read access to the monotonically increasing time
value, clock event devices are used to schedule the next event
interrupt(s). The next event is currently defined to be periodic, with its
period defined at compile time. The setup and selection of the event device
for various event driven functionalities is hardwired into the architecture
dependent code. This results in duplicated code across all architectures and
makes it extremely difficult to change the configuration of the system to use
event interrupt devices other than those already built into the
architecture. Another implication of the current design is that it is necessary
to touch all the architecture-specific implementations in order to provide new
functionality like high resolution timers or dynamic ticks.

The clock events subsystem tries to address this problem by providing a generic
solution to manage clock event devices and their usage for the various clock
event driven kernel functionalities. The goal of the clock event subsystem is
to minimize the clock event related architecture dependent code to the pure
hardware related handling and to allow easy addition and utilization of new
clock event devices. It also minimizes the duplicated code across the
architectures as it provides generic functionality down to the interrupt
service handler, which is almost inherently hardware dependent.

Clock event devices are registered either by the architecture dependent boot
code or at module insertion time. Each clock event device fills a data
structure with clock-specific property parameters and callback functions. The
clock event management decides, by using the specified property parameters, the
set of system functions a clock event device will be used to support. This
includes the distinction of per-CPU and per-system global event devices.

System-level global event devices are used for the Linux periodic tick. Per-CPU
event devices are used to provide local CPU functionality such as process
accounting, profiling, and high resolution timers.

The management layer assignes one or more of the folliwing functions to a clock
event device:
      - system global periodic tick (jiffies update)
      - cpu local update_process_times
      - cpu local profiling
      - cpu local next event interrupt (non periodic mode)

The clock event device delegates the selection of those timer interrupt related
functions completely to the management layer. The clock management layer stores
a function pointer in the device description structure, which has to be called
from the hardware level handler. This removes a lot of duplicated code from the
architecture specific timer interrupt handlers and hands the control over the
clock event devices and the assignment of timer interrupt related functionality
to the core code.

The clock event layer API is rather small. Aside from the clock event device
registration interface it provides functions to schedule the next event
interrupt, clock event device notification service and support for suspend and
resume.

The framework adds about 700 lines of code which results in a 2KB increase of
the kernel binary size. The conversion of i386 removes about 100 lines of
code. The binary size decrease is in the range of 400 byte. We believe that the
increase of flexibility and the avoidance of duplicated code across
architectures justifies the slight increase of the binary size.

The conversion of an architecture has no functional impact, but allows to
utilize the high resolution and dynamic tick functionalites without any change
to the clock event device and timer interrupt code. After the conversion the
enabling of high resolution timers and dynamic ticks is simply provided by
adding the kernel/time/Kconfig file to the architecture specific Kconfig and
adding the dynamic tick specific calls to the idle routine (a total of 3 lines
added to the idle function and the Kconfig file)

Figure #4 (OLS slides p.20) illustrates the transformation.


high resolution timer functionality
-----------------------------------

During system boot it is not possible to use the high resolution timer
functionality, while making it possible would be difficult and would serve no
useful function. The initialization of the clock event device framework, the
clock source framework (GTOD) and hrtimers itself has to be done and
appropriate clock sources and clock event devices have to be registered before
the high resolution functionality can work. Up to the point where hrtimers are
initialized, the system works in the usual low resolution periodic mode. The
clock source and the clock event device layers provide notification functions
which inform hrtimers about availability of new hardware. hrtimers validates
the usability of the registered clock sources and clock event devices before
switching to high resolution mode. This ensures also that a kernel which is
configured for high resolution timers can run on a system which lacks the
necessary hardware support.

The high resolution timer code does not support SMP machines which have only
global clock event devices. The support of such hardware would involve IPI
calls when an interrupt happens. The overhead would be much larger than the
benefit. This is the reason why we currently disable high resolution and
dynamic ticks on i386 SMP systems which stop the local APIC in C3 power
state. A workaround is available as an idea, but the problem has not been
tackled yet.

The time ordered insertion of timers provides all the infrastructure to decide
whether the event device has to be reprogrammed when a timer is added. The
decision is made per timer base and synchronized across per-cpu timer bases in
a support function. The design allows the system to utilize separate per-CPU
clock event devices for the per-CPU timer bases, but currently only one
reprogrammable clock event device per-CPU is utilized.

When the timer interrupt happens, the next event interrupt handler is called
from the clock event distribution code and moves expired timers from the
red-black tree to a separate double linked list and invokes the softirq
handler. An additional mode field in the hrtimer structure allows the system to
execute callback functions directly from the next event interrupt handler. This
is restricted to code which can safely be executed in the hard interrupt
context. This applies, for example, to the common case of a wakeup function as
used by nanosleep. The advantage of executing the handler in the interrupt
context is the avoidance of up to two context switches - from the interrupted
context to the softirq and to the task which is woken up by the expired
timer.

Once a system has switched to high resolution mode, the periodic tick is
switched off. This disables the per system global periodic clock event device -
e.g. the PIT on i386 SMP systems.

The periodic tick functionality is provided by an per-cpu hrtimer. The callback
function is executed in the next event interrupt context and updates jiffies
and calls update_process_times and profiling. The implementation of the hrtimer
based periodic tick is designed to be extended with dynamic tick functionality.
This allows to use a single clock event device to schedule high resolution
timer and periodic events (jiffies tick, profiling, process accounting) on UP
systems. This has been proved to work with the PIT on i386 and the Incrementer
on PPC.

The softirq for running the hrtimer queues and executing the callbacks has been
separated from the tick bound timer softirq to allow accurate delivery of high
resolution timer signals which are used by itimer and POSIX interval
timers. The execution of this softirq can still be delayed by other softirqs,
but the overall latencies have been significantly improved by this separation.

Figure #5 (OLS slides p.22) illustrates the transformation.


dynamic ticks
-------------

Dynamic ticks are the logical consequence of the hrtimer based periodic tick
replacement (sched_tick). The functionality of the sched_tick hrtimer is
extended by three functions:

- hrtimer_stop_sched_tick
- hrtimer_restart_sched_tick
- hrtimer_update_jiffies

hrtimer_stop_sched_tick() is called when a CPU goes into idle state. The code
evaluates the next scheduled timer event (from both hrtimers and the timer
wheel) and in case that the next event is further away than the next tick it
reprograms the sched_tick to this future event, to allow longer idle sleeps
without worthless interruption by the periodic tick. The function is also
called when an interrupt happens during the idle period, which does not cause a
reschedule. The call is necessary as the interrupt handler might have armed a
new timer whose expiry time is before the time which was identified as the
nearest event in the previous call to hrtimer_stop_sched_tick.

hrtimer_restart_sched_tick() is called when the CPU leaves the idle state before
it calls schedule(). hrtimer_restart_sched_tick() resumes the periodic tick,
which is kept active until the next call to hrtimer_stop_sched_tick().

hrtimer_update_jiffies() is called from irq_enter() when an interrupt happens
in the idle period to make sure that jiffies are up to date and the interrupt
handler has not to deal with an eventually stale jiffy value.

The dynamic tick feature provides statistical values which are exported to
userspace via /proc/stats and can be made available for enhanced power
management control.

The implementation leaves room for further development like full tickless
systems, where the time slice is controlled by the scheduler, variable
frequency profiling, and a complete removal of jiffies in the future.


Aside the current initial submission of i386 support, the patchset has been
extended to x86_64 and ARM already. Initial (work in progress) support is also
available for MIPS and PowerPC.

	  Thomas, Ingo

--


^ permalink raw reply	[flat|nested] 58+ messages in thread

* [patch 01/21] GTOD: exponential update_wall_time
  2006-10-01 22:59 [patch 00/21] high resolution timers / dynamic ticks - V2 Thomas Gleixner
@ 2006-10-01 22:59 ` Thomas Gleixner
  2006-10-01 23:00 ` [patch 02/21] GTOD: persistent clock support, core Thomas Gleixner
                   ` (23 subsequent siblings)
  24 siblings, 0 replies; 58+ messages in thread
From: Thomas Gleixner @ 2006-10-01 22:59 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Ingo Molnar, Jim Gettys, John Stultz, David Woodhouse,
	Arjan van de Ven, Dave Jones

[-- Attachment #1: update-times-exponential.patch --]
[-- Type: text/plain, Size: 2745 bytes --]

From: John Stultz <johnstul@us.ibm.com>

Accumulate time in update_wall_time() exponentially.  This avoids long
running loops seen with the dynticks patch as well as the problematic
hang seen on systems with broken clocksources.

NOTE: this only has relevance on dyntick kernels, so the quality of
NTP updates on jiffies-tick systems is unaffected. (non-dyntick
kernels call update_wall_time() in every timer tick)

Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
--

 kernel/timer.c |   28 ++++++++++++++++++++--------
 1 file changed, 20 insertions(+), 8 deletions(-)

Index: linux-2.6.18-mm2/kernel/timer.c
===================================================================
--- linux-2.6.18-mm2.orig/kernel/timer.c	2006-10-02 00:55:50.000000000 +0200
+++ linux-2.6.18-mm2/kernel/timer.c	2006-10-02 00:55:50.000000000 +0200
@@ -907,6 +907,7 @@ static void clocksource_adjust(struct cl
 static void update_wall_time(void)
 {
 	cycle_t offset;
+	int shift = 0;
 
 	/* Make sure we're fully resumed: */
 	if (unlikely(timekeeping_suspended))
@@ -919,28 +920,39 @@ static void update_wall_time(void)
 #endif
 	clock->xtime_nsec += (s64)xtime.tv_nsec << clock->shift;
 
+	while (offset > clock->cycle_interval << (shift + 1))
+		shift++;
+
 	/* normally this loop will run just once, however in the
 	 * case of lost or late ticks, it will accumulate correctly.
 	 */
 	while (offset >= clock->cycle_interval) {
+		if (offset < (clock->cycle_interval << shift)) {
+			shift--;
+			continue;
+		}
+
 		/* accumulate one interval */
-		clock->xtime_nsec += clock->xtime_interval;
-		clock->cycle_last += clock->cycle_interval;
-		offset -= clock->cycle_interval;
+		clock->xtime_nsec += clock->xtime_interval << shift;
+		clock->cycle_last += clock->cycle_interval << shift;
+		offset -= clock->cycle_interval << shift;
 
-		if (clock->xtime_nsec >= (u64)NSEC_PER_SEC << clock->shift) {
+		while (clock->xtime_nsec >= (u64)NSEC_PER_SEC << clock->shift) {
 			clock->xtime_nsec -= (u64)NSEC_PER_SEC << clock->shift;
 			xtime.tv_sec++;
 			second_overflow();
 		}
 
 		/* interpolator bits */
-		time_interpolator_update(clock->xtime_interval
-						>> clock->shift);
+		time_interpolator_update((clock->xtime_interval
+						>> clock->shift)<<shift);
 
 		/* accumulate error between NTP and clock interval */
-		clock->error += current_tick_length();
-		clock->error -= clock->xtime_interval << (TICK_LENGTH_SHIFT - clock->shift);
+		clock->error += current_tick_length() << shift;
+		clock->error -= (clock->xtime_interval
+			<< (TICK_LENGTH_SHIFT - clock->shift))<<shift;
+
+		shift--;
 	}
 
 	/* correct the clock when NTP error is too big */

--


^ permalink raw reply	[flat|nested] 58+ messages in thread

* [patch 02/21] GTOD: persistent clock support, core
  2006-10-01 22:59 [patch 00/21] high resolution timers / dynamic ticks - V2 Thomas Gleixner
  2006-10-01 22:59 ` [patch 01/21] GTOD: exponential update_wall_time Thomas Gleixner
@ 2006-10-01 23:00 ` Thomas Gleixner
  2006-10-01 23:00 ` [patch 03/21] GTOD: persistent clock support, i386 Thomas Gleixner
                   ` (22 subsequent siblings)
  24 siblings, 0 replies; 58+ messages in thread
From: Thomas Gleixner @ 2006-10-01 23:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Ingo Molnar, Jim Gettys, John Stultz, David Woodhouse,
	Arjan van de Ven, Dave Jones

[-- Attachment #1: linux-2.6.18-rc6_timeofday-persistent-clock-generic_C6.patch --]
[-- Type: text/plain, Size: 4767 bytes --]

From: John Stultz <johnstul@us.ibm.com>

persistent clock support: do proper timekeeping across suspend/resume.

Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
--
 include/linux/hrtimer.h |    3 +++
 include/linux/time.h    |    1 +
 kernel/hrtimer.c        |    8 ++++++++
 kernel/timer.c          |   34 +++++++++++++++++++++++++++++++---
 4 files changed, 43 insertions(+), 3 deletions(-)

linux-2.6.18-rc6_timeofday-persistent-clock-generic_C6.patch
Index: linux-2.6.18-mm2/include/linux/hrtimer.h
===================================================================
--- linux-2.6.18-mm2.orig/include/linux/hrtimer.h	2006-10-02 00:55:49.000000000 +0200
+++ linux-2.6.18-mm2/include/linux/hrtimer.h	2006-10-02 00:55:50.000000000 +0200
@@ -146,6 +146,9 @@ extern void hrtimer_init_sleeper(struct 
 /* Soft interrupt function to run the hrtimer queues: */
 extern void hrtimer_run_queues(void);
 
+/* Resume notification */
+void hrtimer_notify_resume(void);
+
 /* Bootup initialization: */
 extern void __init hrtimers_init(void);
 
Index: linux-2.6.18-mm2/include/linux/time.h
===================================================================
--- linux-2.6.18-mm2.orig/include/linux/time.h	2006-10-02 00:55:49.000000000 +0200
+++ linux-2.6.18-mm2/include/linux/time.h	2006-10-02 00:55:50.000000000 +0200
@@ -92,6 +92,7 @@ extern struct timespec xtime;
 extern struct timespec wall_to_monotonic;
 extern seqlock_t xtime_lock;
 
+extern unsigned long read_persistent_clock(void);
 void timekeeping_init(void);
 
 static inline unsigned long get_seconds(void)
Index: linux-2.6.18-mm2/kernel/hrtimer.c
===================================================================
--- linux-2.6.18-mm2.orig/kernel/hrtimer.c	2006-10-02 00:55:49.000000000 +0200
+++ linux-2.6.18-mm2/kernel/hrtimer.c	2006-10-02 00:55:50.000000000 +0200
@@ -287,6 +287,14 @@ static unsigned long ktime_divns(const k
 #endif /* BITS_PER_LONG >= 64 */
 
 /*
+ * Timekeeping resumed notification
+ */
+void hrtimer_notify_resume(void)
+{
+	clock_was_set();
+}
+
+/*
  * Counterpart to lock_timer_base above:
  */
 static inline
Index: linux-2.6.18-mm2/kernel/timer.c
===================================================================
--- linux-2.6.18-mm2.orig/kernel/timer.c	2006-10-02 00:55:50.000000000 +0200
+++ linux-2.6.18-mm2/kernel/timer.c	2006-10-02 00:55:50.000000000 +0200
@@ -41,6 +41,9 @@
 #include <asm/timex.h>
 #include <asm/io.h>
 
+/* jiffies at the most recent update of wall time */
+unsigned long wall_jiffies = INITIAL_JIFFIES;
+
 u64 jiffies_64 __cacheline_aligned_in_smp = INITIAL_JIFFIES;
 
 EXPORT_SYMBOL(jiffies_64);
@@ -743,12 +746,20 @@ int timekeeping_is_continuous(void)
 	return ret;
 }
 
+/* Weak dummy function for arches that do not yet support it.
+ * XXX - Do be sure to remove it once all arches implement it.
+ */
+unsigned long __attribute__((weak)) read_persistent_clock(void)
+{
+	return 0;
+}
+
 /*
  * timekeeping_init - Initializes the clocksource and common timekeeping values
  */
 void __init timekeeping_init(void)
 {
-	unsigned long flags;
+	unsigned long flags, sec = read_persistent_clock();
 
 	write_seqlock_irqsave(&xtime_lock, flags);
 
@@ -758,11 +769,18 @@ void __init timekeeping_init(void)
 	clocksource_calculate_interval(clock, tick_nsec);
 	clock->cycle_last = clocksource_read(clock);
 
+	xtime.tv_sec = sec;
+	xtime.tv_nsec = (jiffies % HZ) * (NSEC_PER_SEC / HZ);
+	set_normalized_timespec(&wall_to_monotonic,
+		-xtime.tv_sec, -xtime.tv_nsec);
+
 	write_sequnlock_irqrestore(&xtime_lock, flags);
 }
 
 
 static int timekeeping_suspended;
+static unsigned long timekeeping_suspend_time;
+
 /**
  * timekeeping_resume - Resumes the generic timekeeping subsystem.
  * @dev:	unused
@@ -773,14 +791,23 @@ static int timekeeping_suspended;
  */
 static int timekeeping_resume(struct sys_device *dev)
 {
-	unsigned long flags;
+	unsigned long flags, now = read_persistent_clock();
 
 	write_seqlock_irqsave(&xtime_lock, flags);
-	/* restart the last cycle value */
+
+	if (now && (now > timekeeping_suspend_time)) {
+		unsigned long sleep_length = now - timekeeping_suspend_time;
+		xtime.tv_sec += sleep_length;
+		jiffies_64 += sleep_length * HZ;
+	}
+	/* re-base the last cycle value */
 	clock->cycle_last = clocksource_read(clock);
 	clock->error = 0;
 	timekeeping_suspended = 0;
 	write_sequnlock_irqrestore(&xtime_lock, flags);
+
+	hrtimer_notify_resume();
+
 	return 0;
 }
 
@@ -790,6 +817,7 @@ static int timekeeping_suspend(struct sy
 
 	write_seqlock_irqsave(&xtime_lock, flags);
 	timekeeping_suspended = 1;
+	timekeeping_suspend_time = read_persistent_clock();
 	write_sequnlock_irqrestore(&xtime_lock, flags);
 	return 0;
 }

--


^ permalink raw reply	[flat|nested] 58+ messages in thread

* [patch 03/21] GTOD: persistent clock support, i386
  2006-10-01 22:59 [patch 00/21] high resolution timers / dynamic ticks - V2 Thomas Gleixner
  2006-10-01 22:59 ` [patch 01/21] GTOD: exponential update_wall_time Thomas Gleixner
  2006-10-01 23:00 ` [patch 02/21] GTOD: persistent clock support, core Thomas Gleixner
@ 2006-10-01 23:00 ` Thomas Gleixner
  2006-10-01 23:00 ` [patch 04/21] time: uninline jiffies.h Thomas Gleixner
                   ` (21 subsequent siblings)
  24 siblings, 0 replies; 58+ messages in thread
From: Thomas Gleixner @ 2006-10-01 23:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Ingo Molnar, Jim Gettys, John Stultz, David Woodhouse,
	Arjan van de Ven, Dave Jones

[-- Attachment #1: linux-2.6.18-rc6_timeofday-persistent-clock-i386_C6.patch --]
[-- Type: text/plain, Size: 6080 bytes --]

From: John Stultz <johnstul@us.ibm.com>

persistent clock support: do proper timekeeping across suspend/resume,
i386 arch support.

Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
--
 arch/i386/kernel/apm.c  |   44 ---------------------------------------
 arch/i386/kernel/time.c |   54 +-----------------------------------------------
 2 files changed, 2 insertions(+), 96 deletions(-)

linux-2.6.18-rc6_timeofday-persistent-clock-i386_C6.patch
Index: linux-2.6.18-mm2/arch/i386/kernel/apm.c
===================================================================
--- linux-2.6.18-mm2.orig/arch/i386/kernel/apm.c	2006-10-02 00:55:49.000000000 +0200
+++ linux-2.6.18-mm2/arch/i386/kernel/apm.c	2006-10-02 00:55:50.000000000 +0200
@@ -234,7 +234,6 @@
 
 #include "io_ports.h"
 
-extern unsigned long get_cmos_time(void);
 extern void machine_real_restart(unsigned char *, int);
 
 #if defined(CONFIG_APM_DISPLAY_BLANK) && defined(CONFIG_VT)
@@ -1153,28 +1152,6 @@ out:
 	spin_unlock(&user_list_lock);
 }
 
-static void set_time(void)
-{
-	struct timespec ts;
-	if (got_clock_diff) {	/* Must know time zone in order to set clock */
-		ts.tv_sec = get_cmos_time() + clock_cmos_diff;
-		ts.tv_nsec = 0;
-		do_settimeofday(&ts);
-	} 
-}
-
-static void get_time_diff(void)
-{
-#ifndef CONFIG_APM_RTC_IS_GMT
-	/*
-	 * Estimate time zone so that set_time can update the clock
-	 */
-	clock_cmos_diff = -get_cmos_time();
-	clock_cmos_diff += get_seconds();
-	got_clock_diff = 1;
-#endif
-}
-
 static void reinit_timer(void)
 {
 #ifdef INIT_TIMER_AFTER_SUSPEND
@@ -1214,19 +1191,6 @@ static int suspend(int vetoable)
 	local_irq_disable();
 	device_power_down(PMSG_SUSPEND);
 
-	/* serialize with the timer interrupt */
-	write_seqlock(&xtime_lock);
-
-	/* protect against access to timer chip registers */
-	spin_lock(&i8253_lock);
-
-	get_time_diff();
-	/*
-	 * Irq spinlock must be dropped around set_system_power_state.
-	 * We'll undo any timer changes due to interrupts below.
-	 */
-	spin_unlock(&i8253_lock);
-	write_sequnlock(&xtime_lock);
 	local_irq_enable();
 
 	save_processor_state();
@@ -1235,7 +1199,6 @@ static int suspend(int vetoable)
 	restore_processor_state();
 
 	local_irq_disable();
-	set_time();
 	reinit_timer();
 
 	if (err == APM_NO_ERROR)
@@ -1265,11 +1228,6 @@ static void standby(void)
 
 	local_irq_disable();
 	device_power_down(PMSG_SUSPEND);
-	/* serialize with the timer interrupt */
-	write_seqlock(&xtime_lock);
-	/* If needed, notify drivers here */
-	get_time_diff();
-	write_sequnlock(&xtime_lock);
 	local_irq_enable();
 
 	err = set_system_power_state(APM_STATE_STANDBY);
@@ -1363,7 +1321,6 @@ static void check_events(void)
 			ignore_bounce = 1;
 			if ((event != APM_NORMAL_RESUME)
 			    || (ignore_normal_resume == 0)) {
-				set_time();
 				device_resume();
 				pm_send_all(PM_RESUME, (void *)0);
 				queue_event(event, NULL);
@@ -1379,7 +1336,6 @@ static void check_events(void)
 			break;
 
 		case APM_UPDATE_TIME:
-			set_time();
 			break;
 
 		case APM_CRITICAL_SUSPEND:
Index: linux-2.6.18-mm2/arch/i386/kernel/time.c
===================================================================
--- linux-2.6.18-mm2.orig/arch/i386/kernel/time.c	2006-10-02 00:55:49.000000000 +0200
+++ linux-2.6.18-mm2/arch/i386/kernel/time.c	2006-10-02 00:55:50.000000000 +0200
@@ -216,7 +216,7 @@ irqreturn_t timer_interrupt(int irq, voi
 }
 
 /* not static: needed by APM */
-unsigned long get_cmos_time(void)
+unsigned long read_persistent_clock(void)
 {
 	unsigned long retval;
 	unsigned long flags;
@@ -232,7 +232,7 @@ unsigned long get_cmos_time(void)
 
 	return retval;
 }
-EXPORT_SYMBOL(get_cmos_time);
+EXPORT_SYMBOL(read_persistent_clock);
 
 static void sync_cmos_clock(unsigned long dummy);
 
@@ -283,58 +283,19 @@ void notify_arch_cmos_timer(void)
 	mod_timer(&sync_cmos_timer, jiffies + 1);
 }
 
-static long clock_cmos_diff;
-static unsigned long sleep_start;
-
-static int timer_suspend(struct sys_device *dev, pm_message_t state)
-{
-	/*
-	 * Estimate time zone so that set_time can update the clock
-	 */
-	unsigned long ctime =  get_cmos_time();
-
-	clock_cmos_diff = -ctime;
-	clock_cmos_diff += get_seconds();
-	sleep_start = ctime;
-	return 0;
-}
-
 static int timer_resume(struct sys_device *dev)
 {
-	unsigned long flags;
-	unsigned long sec;
-	unsigned long ctime = get_cmos_time();
-	long sleep_length = (ctime - sleep_start) * HZ;
-	struct timespec ts;
-
-	if (sleep_length < 0) {
-		printk(KERN_WARNING "CMOS clock skew detected in timer resume!\n");
-		/* The time after the resume must not be earlier than the time
-		 * before the suspend or some nasty things will happen
-		 */
-		sleep_length = 0;
-		ctime = sleep_start;
-	}
 #ifdef CONFIG_HPET_TIMER
 	if (is_hpet_enabled())
 		hpet_reenable();
 #endif
 	setup_pit_timer();
 
-	sec = ctime + clock_cmos_diff;
-	ts.tv_sec = sec;
-	ts.tv_nsec = 0;
-	do_settimeofday(&ts);
-	write_seqlock_irqsave(&xtime_lock, flags);
-	jiffies_64 += sleep_length;
-	write_sequnlock_irqrestore(&xtime_lock, flags);
-	touch_softlockup_watchdog();
 	return 0;
 }
 
 static struct sysdev_class timer_sysclass = {
 	.resume = timer_resume,
-	.suspend = timer_suspend,
 	set_kset_name("timer"),
 };
 
@@ -360,12 +321,6 @@ extern void (*late_time_init)(void);
 /* Duplicate of time_init() below, with hpet_enable part added */
 static void __init hpet_time_init(void)
 {
-	struct timespec ts;
-	ts.tv_sec = get_cmos_time();
-	ts.tv_nsec = (INITIAL_JIFFIES % HZ) * (NSEC_PER_SEC / HZ);
-
-	do_settimeofday(&ts);
-
 	if ((hpet_enable() >= 0) && hpet_use_timer) {
 		printk("Using HPET for base-timer\n");
 	}
@@ -376,7 +331,6 @@ static void __init hpet_time_init(void)
 
 void __init time_init(void)
 {
-	struct timespec ts;
 #ifdef CONFIG_HPET_TIMER
 	if (is_hpet_capable()) {
 		/*
@@ -387,10 +341,6 @@ void __init time_init(void)
 		return;
 	}
 #endif
-	ts.tv_sec = get_cmos_time();
-	ts.tv_nsec = (INITIAL_JIFFIES % HZ) * (NSEC_PER_SEC / HZ);
-
-	do_settimeofday(&ts);
 
 	time_init_hook();
 }

--


^ permalink raw reply	[flat|nested] 58+ messages in thread

* [patch 04/21] time: uninline jiffies.h
  2006-10-01 22:59 [patch 00/21] high resolution timers / dynamic ticks - V2 Thomas Gleixner
                   ` (2 preceding siblings ...)
  2006-10-01 23:00 ` [patch 03/21] GTOD: persistent clock support, i386 Thomas Gleixner
@ 2006-10-01 23:00 ` Thomas Gleixner
  2006-10-01 23:00 ` [patch 05/21] time: fix msecs_to_jiffies() bug Thomas Gleixner
                   ` (20 subsequent siblings)
  24 siblings, 0 replies; 58+ messages in thread
From: Thomas Gleixner @ 2006-10-01 23:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Ingo Molnar, Jim Gettys, John Stultz, David Woodhouse,
	Arjan van de Ven, Dave Jones

[-- Attachment #1: uninline-jiffies-h.patch --]
[-- Type: text/plain, Size: 13958 bytes --]

From: Ingo Molnar <mingo@elte.hu>

there are load of fat functions hidden in jiffies.h. Uninline them.
No code changes.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
--
 include/linux/jiffies.h |  223 +++---------------------------------------------
 kernel/time.c           |  218 ++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 234 insertions(+), 207 deletions(-)

Index: linux-2.6.18-mm2/include/linux/jiffies.h
===================================================================
--- linux-2.6.18-mm2.orig/include/linux/jiffies.h	2006-10-02 00:55:49.000000000 +0200
+++ linux-2.6.18-mm2/include/linux/jiffies.h	2006-10-02 00:55:50.000000000 +0200
@@ -259,215 +259,24 @@ static inline u64 get_jiffies_64(void)
 #endif
 
 /*
- * Convert jiffies to milliseconds and back.
- *
- * Avoid unnecessary multiplications/divisions in the
- * two most common HZ cases:
+ * Convert various time units to each other:
  */
-static inline unsigned int jiffies_to_msecs(const unsigned long j)
-{
-#if HZ <= MSEC_PER_SEC && !(MSEC_PER_SEC % HZ)
-	return (MSEC_PER_SEC / HZ) * j;
-#elif HZ > MSEC_PER_SEC && !(HZ % MSEC_PER_SEC)
-	return (j + (HZ / MSEC_PER_SEC) - 1)/(HZ / MSEC_PER_SEC);
-#else
-	return (j * MSEC_PER_SEC) / HZ;
-#endif
-}
-
-static inline unsigned int jiffies_to_usecs(const unsigned long j)
-{
-#if HZ <= USEC_PER_SEC && !(USEC_PER_SEC % HZ)
-	return (USEC_PER_SEC / HZ) * j;
-#elif HZ > USEC_PER_SEC && !(HZ % USEC_PER_SEC)
-	return (j + (HZ / USEC_PER_SEC) - 1)/(HZ / USEC_PER_SEC);
-#else
-	return (j * USEC_PER_SEC) / HZ;
-#endif
-}
-
-static inline unsigned long msecs_to_jiffies(const unsigned int m)
-{
-	if (m > jiffies_to_msecs(MAX_JIFFY_OFFSET))
-		return MAX_JIFFY_OFFSET;
-#if HZ <= MSEC_PER_SEC && !(MSEC_PER_SEC % HZ)
-	return (m + (MSEC_PER_SEC / HZ) - 1) / (MSEC_PER_SEC / HZ);
-#elif HZ > MSEC_PER_SEC && !(HZ % MSEC_PER_SEC)
-	return m * (HZ / MSEC_PER_SEC);
-#else
-	return (m * HZ + MSEC_PER_SEC - 1) / MSEC_PER_SEC;
-#endif
-}
-
-static inline unsigned long usecs_to_jiffies(const unsigned int u)
-{
-	if (u > jiffies_to_usecs(MAX_JIFFY_OFFSET))
-		return MAX_JIFFY_OFFSET;
-#if HZ <= USEC_PER_SEC && !(USEC_PER_SEC % HZ)
-	return (u + (USEC_PER_SEC / HZ) - 1) / (USEC_PER_SEC / HZ);
-#elif HZ > USEC_PER_SEC && !(HZ % USEC_PER_SEC)
-	return u * (HZ / USEC_PER_SEC);
-#else
-	return (u * HZ + USEC_PER_SEC - 1) / USEC_PER_SEC;
-#endif
-}
-
-/*
- * The TICK_NSEC - 1 rounds up the value to the next resolution.  Note
- * that a remainder subtract here would not do the right thing as the
- * resolution values don't fall on second boundries.  I.e. the line:
- * nsec -= nsec % TICK_NSEC; is NOT a correct resolution rounding.
- *
- * Rather, we just shift the bits off the right.
- *
- * The >> (NSEC_JIFFIE_SC - SEC_JIFFIE_SC) converts the scaled nsec
- * value to a scaled second value.
- */
-static __inline__ unsigned long
-timespec_to_jiffies(const struct timespec *value)
-{
-	unsigned long sec = value->tv_sec;
-	long nsec = value->tv_nsec + TICK_NSEC - 1;
-
-	if (sec >= MAX_SEC_IN_JIFFIES){
-		sec = MAX_SEC_IN_JIFFIES;
-		nsec = 0;
-	}
-	return (((u64)sec * SEC_CONVERSION) +
-		(((u64)nsec * NSEC_CONVERSION) >>
-		 (NSEC_JIFFIE_SC - SEC_JIFFIE_SC))) >> SEC_JIFFIE_SC;
-
-}
-
-static __inline__ void
-jiffies_to_timespec(const unsigned long jiffies, struct timespec *value)
-{
-	/*
-	 * Convert jiffies to nanoseconds and separate with
-	 * one divide.
-	 */
-	u64 nsec = (u64)jiffies * TICK_NSEC;
-	value->tv_sec = div_long_long_rem(nsec, NSEC_PER_SEC, &value->tv_nsec);
-}
-
-/* Same for "timeval"
- *
- * Well, almost.  The problem here is that the real system resolution is
- * in nanoseconds and the value being converted is in micro seconds.
- * Also for some machines (those that use HZ = 1024, in-particular),
- * there is a LARGE error in the tick size in microseconds.
-
- * The solution we use is to do the rounding AFTER we convert the
- * microsecond part.  Thus the USEC_ROUND, the bits to be shifted off.
- * Instruction wise, this should cost only an additional add with carry
- * instruction above the way it was done above.
- */
-static __inline__ unsigned long
-timeval_to_jiffies(const struct timeval *value)
-{
-	unsigned long sec = value->tv_sec;
-	long usec = value->tv_usec;
-
-	if (sec >= MAX_SEC_IN_JIFFIES){
-		sec = MAX_SEC_IN_JIFFIES;
-		usec = 0;
-	}
-	return (((u64)sec * SEC_CONVERSION) +
-		(((u64)usec * USEC_CONVERSION + USEC_ROUND) >>
-		 (USEC_JIFFIE_SC - SEC_JIFFIE_SC))) >> SEC_JIFFIE_SC;
-}
-
-static __inline__ void
-jiffies_to_timeval(const unsigned long jiffies, struct timeval *value)
-{
-	/*
-	 * Convert jiffies to nanoseconds and separate with
-	 * one divide.
-	 */
-	u64 nsec = (u64)jiffies * TICK_NSEC;
-	long tv_usec;
-
-	value->tv_sec = div_long_long_rem(nsec, NSEC_PER_SEC, &tv_usec);
-	tv_usec /= NSEC_PER_USEC;
-	value->tv_usec = tv_usec;
-}
-
-/*
- * Convert jiffies/jiffies_64 to clock_t and back.
- */
-static inline clock_t jiffies_to_clock_t(long x)
-{
-#if (TICK_NSEC % (NSEC_PER_SEC / USER_HZ)) == 0
-	return x / (HZ / USER_HZ);
-#else
-	u64 tmp = (u64)x * TICK_NSEC;
-	do_div(tmp, (NSEC_PER_SEC / USER_HZ));
-	return (long)tmp;
-#endif
-}
-
-static inline unsigned long clock_t_to_jiffies(unsigned long x)
-{
-#if (HZ % USER_HZ)==0
-	if (x >= ~0UL / (HZ / USER_HZ))
-		return ~0UL;
-	return x * (HZ / USER_HZ);
-#else
-	u64 jif;
-
-	/* Don't worry about loss of precision here .. */
-	if (x >= ~0UL / HZ * USER_HZ)
-		return ~0UL;
-
-	/* .. but do try to contain it here */
-	jif = x * (u64) HZ;
-	do_div(jif, USER_HZ);
-	return jif;
-#endif
-}
-
-static inline u64 jiffies_64_to_clock_t(u64 x)
-{
-#if (TICK_NSEC % (NSEC_PER_SEC / USER_HZ)) == 0
-	do_div(x, HZ / USER_HZ);
-#else
-	/*
-	 * There are better ways that don't overflow early,
-	 * but even this doesn't overflow in hundreds of years
-	 * in 64 bits, so..
-	 */
-	x *= TICK_NSEC;
-	do_div(x, (NSEC_PER_SEC / USER_HZ));
-#endif
-	return x;
-}
-
-static inline u64 nsec_to_clock_t(u64 x)
-{
-#if (NSEC_PER_SEC % USER_HZ) == 0
-	do_div(x, (NSEC_PER_SEC / USER_HZ));
-#elif (USER_HZ % 512) == 0
-	x *= USER_HZ/512;
-	do_div(x, (NSEC_PER_SEC / 512));
-#else
-	/*
-         * max relative error 5.7e-8 (1.8s per year) for USER_HZ <= 1024,
-         * overflow after 64.99 years.
-         * exact for HZ=60, 72, 90, 120, 144, 180, 300, 600, 900, ...
-         */
-	x *= 9;
-	do_div(x, (unsigned long)((9ull * NSEC_PER_SEC + (USER_HZ/2))
-	                          / USER_HZ));
-#endif
-	return x;
-}
+extern unsigned int jiffies_to_msecs(const unsigned long j);
+extern unsigned int jiffies_to_usecs(const unsigned long j);
+extern unsigned long msecs_to_jiffies(const unsigned int m);
+extern unsigned long usecs_to_jiffies(const unsigned int u);
+extern unsigned long timespec_to_jiffies(const struct timespec *value);
+extern void jiffies_to_timespec(const unsigned long jiffies,
+				struct timespec *value);
+extern unsigned long timeval_to_jiffies(const struct timeval *value);
+extern void jiffies_to_timeval(const unsigned long jiffies,
+			       struct timeval *value);
+extern clock_t jiffies_to_clock_t(long x);
+extern unsigned long clock_t_to_jiffies(unsigned long x);
+extern u64 jiffies_64_to_clock_t(u64 x);
+extern u64 nsec_to_clock_t(u64 x);
+extern int nsec_to_timestamp(char *s, u64 t);
 
-static inline int nsec_to_timestamp(char *s, u64 t)
-{
-	unsigned long nsec_rem = do_div(t, NSEC_PER_SEC);
-	return sprintf(s, "[%5lu.%06lu]", (unsigned long)t,
-		       nsec_rem/NSEC_PER_USEC);
-}
 #define TIMESTAMP_SIZE	30
 
 #endif
Index: linux-2.6.18-mm2/kernel/time.c
===================================================================
--- linux-2.6.18-mm2.orig/kernel/time.c	2006-10-02 00:55:49.000000000 +0200
+++ linux-2.6.18-mm2/kernel/time.c	2006-10-02 00:55:50.000000000 +0200
@@ -470,6 +470,224 @@ struct timeval ns_to_timeval(const s64 n
 	return tv;
 }
 
+/*
+ * Convert jiffies to milliseconds and back.
+ *
+ * Avoid unnecessary multiplications/divisions in the
+ * two most common HZ cases:
+ */
+unsigned int jiffies_to_msecs(const unsigned long j)
+{
+#if HZ <= MSEC_PER_SEC && !(MSEC_PER_SEC % HZ)
+	return (MSEC_PER_SEC / HZ) * j;
+#elif HZ > MSEC_PER_SEC && !(HZ % MSEC_PER_SEC)
+	return (j + (HZ / MSEC_PER_SEC) - 1)/(HZ / MSEC_PER_SEC);
+#else
+	return (j * MSEC_PER_SEC) / HZ;
+#endif
+}
+EXPORT_SYMBOL(jiffies_to_msecs);
+
+unsigned int jiffies_to_usecs(const unsigned long j)
+{
+#if HZ <= USEC_PER_SEC && !(USEC_PER_SEC % HZ)
+	return (USEC_PER_SEC / HZ) * j;
+#elif HZ > USEC_PER_SEC && !(HZ % USEC_PER_SEC)
+	return (j + (HZ / USEC_PER_SEC) - 1)/(HZ / USEC_PER_SEC);
+#else
+	return (j * USEC_PER_SEC) / HZ;
+#endif
+}
+EXPORT_SYMBOL(jiffies_to_usecs);
+
+unsigned long msecs_to_jiffies(const unsigned int m)
+{
+	if (m > jiffies_to_msecs(MAX_JIFFY_OFFSET))
+		return MAX_JIFFY_OFFSET;
+#if HZ <= MSEC_PER_SEC && !(MSEC_PER_SEC % HZ)
+	return (m + (MSEC_PER_SEC / HZ) - 1) / (MSEC_PER_SEC / HZ);
+#elif HZ > MSEC_PER_SEC && !(HZ % MSEC_PER_SEC)
+	return m * (HZ / MSEC_PER_SEC);
+#else
+	return (m * HZ + MSEC_PER_SEC - 1) / MSEC_PER_SEC;
+#endif
+}
+EXPORT_SYMBOL(msecs_to_jiffies);
+
+unsigned long usecs_to_jiffies(const unsigned int u)
+{
+	if (u > jiffies_to_usecs(MAX_JIFFY_OFFSET))
+		return MAX_JIFFY_OFFSET;
+#if HZ <= USEC_PER_SEC && !(USEC_PER_SEC % HZ)
+	return (u + (USEC_PER_SEC / HZ) - 1) / (USEC_PER_SEC / HZ);
+#elif HZ > USEC_PER_SEC && !(HZ % USEC_PER_SEC)
+	return u * (HZ / USEC_PER_SEC);
+#else
+	return (u * HZ + USEC_PER_SEC - 1) / USEC_PER_SEC;
+#endif
+}
+EXPORT_SYMBOL(usecs_to_jiffies);
+
+/*
+ * The TICK_NSEC - 1 rounds up the value to the next resolution.  Note
+ * that a remainder subtract here would not do the right thing as the
+ * resolution values don't fall on second boundries.  I.e. the line:
+ * nsec -= nsec % TICK_NSEC; is NOT a correct resolution rounding.
+ *
+ * Rather, we just shift the bits off the right.
+ *
+ * The >> (NSEC_JIFFIE_SC - SEC_JIFFIE_SC) converts the scaled nsec
+ * value to a scaled second value.
+ */
+unsigned long
+timespec_to_jiffies(const struct timespec *value)
+{
+	unsigned long sec = value->tv_sec;
+	long nsec = value->tv_nsec + TICK_NSEC - 1;
+
+	if (sec >= MAX_SEC_IN_JIFFIES){
+		sec = MAX_SEC_IN_JIFFIES;
+		nsec = 0;
+	}
+	return (((u64)sec * SEC_CONVERSION) +
+		(((u64)nsec * NSEC_CONVERSION) >>
+		 (NSEC_JIFFIE_SC - SEC_JIFFIE_SC))) >> SEC_JIFFIE_SC;
+
+}
+EXPORT_SYMBOL(timespec_to_jiffies);
+
+void
+jiffies_to_timespec(const unsigned long jiffies, struct timespec *value)
+{
+	/*
+	 * Convert jiffies to nanoseconds and separate with
+	 * one divide.
+	 */
+	u64 nsec = (u64)jiffies * TICK_NSEC;
+	value->tv_sec = div_long_long_rem(nsec, NSEC_PER_SEC, &value->tv_nsec);
+}
+
+/* Same for "timeval"
+ *
+ * Well, almost.  The problem here is that the real system resolution is
+ * in nanoseconds and the value being converted is in micro seconds.
+ * Also for some machines (those that use HZ = 1024, in-particular),
+ * there is a LARGE error in the tick size in microseconds.
+
+ * The solution we use is to do the rounding AFTER we convert the
+ * microsecond part.  Thus the USEC_ROUND, the bits to be shifted off.
+ * Instruction wise, this should cost only an additional add with carry
+ * instruction above the way it was done above.
+ */
+unsigned long
+timeval_to_jiffies(const struct timeval *value)
+{
+	unsigned long sec = value->tv_sec;
+	long usec = value->tv_usec;
+
+	if (sec >= MAX_SEC_IN_JIFFIES){
+		sec = MAX_SEC_IN_JIFFIES;
+		usec = 0;
+	}
+	return (((u64)sec * SEC_CONVERSION) +
+		(((u64)usec * USEC_CONVERSION + USEC_ROUND) >>
+		 (USEC_JIFFIE_SC - SEC_JIFFIE_SC))) >> SEC_JIFFIE_SC;
+}
+
+void jiffies_to_timeval(const unsigned long jiffies, struct timeval *value)
+{
+	/*
+	 * Convert jiffies to nanoseconds and separate with
+	 * one divide.
+	 */
+	u64 nsec = (u64)jiffies * TICK_NSEC;
+	long tv_usec;
+
+	value->tv_sec = div_long_long_rem(nsec, NSEC_PER_SEC, &tv_usec);
+	tv_usec /= NSEC_PER_USEC;
+	value->tv_usec = tv_usec;
+}
+
+/*
+ * Convert jiffies/jiffies_64 to clock_t and back.
+ */
+clock_t jiffies_to_clock_t(long x)
+{
+#if (TICK_NSEC % (NSEC_PER_SEC / USER_HZ)) == 0
+	return x / (HZ / USER_HZ);
+#else
+	u64 tmp = (u64)x * TICK_NSEC;
+	do_div(tmp, (NSEC_PER_SEC / USER_HZ));
+	return (long)tmp;
+#endif
+}
+EXPORT_SYMBOL(jiffies_to_clock_t);
+
+unsigned long clock_t_to_jiffies(unsigned long x)
+{
+#if (HZ % USER_HZ)==0
+	if (x >= ~0UL / (HZ / USER_HZ))
+		return ~0UL;
+	return x * (HZ / USER_HZ);
+#else
+	u64 jif;
+
+	/* Don't worry about loss of precision here .. */
+	if (x >= ~0UL / HZ * USER_HZ)
+		return ~0UL;
+
+	/* .. but do try to contain it here */
+	jif = x * (u64) HZ;
+	do_div(jif, USER_HZ);
+	return jif;
+#endif
+}
+EXPORT_SYMBOL(clock_t_to_jiffies);
+
+u64 jiffies_64_to_clock_t(u64 x)
+{
+#if (TICK_NSEC % (NSEC_PER_SEC / USER_HZ)) == 0
+	do_div(x, HZ / USER_HZ);
+#else
+	/*
+	 * There are better ways that don't overflow early,
+	 * but even this doesn't overflow in hundreds of years
+	 * in 64 bits, so..
+	 */
+	x *= TICK_NSEC;
+	do_div(x, (NSEC_PER_SEC / USER_HZ));
+#endif
+	return x;
+}
+
+EXPORT_SYMBOL(jiffies_64_to_clock_t);
+
+u64 nsec_to_clock_t(u64 x)
+{
+#if (NSEC_PER_SEC % USER_HZ) == 0
+	do_div(x, (NSEC_PER_SEC / USER_HZ));
+#elif (USER_HZ % 512) == 0
+	x *= USER_HZ/512;
+	do_div(x, (NSEC_PER_SEC / 512));
+#else
+	/*
+         * max relative error 5.7e-8 (1.8s per year) for USER_HZ <= 1024,
+         * overflow after 64.99 years.
+         * exact for HZ=60, 72, 90, 120, 144, 180, 300, 600, 900, ...
+         */
+	x *= 9;
+	do_div(x, (unsigned long)((9ull * NSEC_PER_SEC + (USER_HZ/2)) /
+				  USER_HZ));
+#endif
+	return x;
+}
+
+int nsec_to_timestamp(char *s, u64 t)
+{
+	unsigned long nsec_rem = do_div(t, NSEC_PER_SEC);
+	return sprintf(s, "[%5lu.%06lu]", (unsigned long)t,
+		       nsec_rem/NSEC_PER_USEC);
+}
 __attribute__((weak)) unsigned long long timestamp_clock(void)
 {
 	return sched_clock();

--


^ permalink raw reply	[flat|nested] 58+ messages in thread

* [patch 05/21] time: fix msecs_to_jiffies() bug
  2006-10-01 22:59 [patch 00/21] high resolution timers / dynamic ticks - V2 Thomas Gleixner
                   ` (3 preceding siblings ...)
  2006-10-01 23:00 ` [patch 04/21] time: uninline jiffies.h Thomas Gleixner
@ 2006-10-01 23:00 ` Thomas Gleixner
  2006-10-01 23:00 ` [patch 06/21] time: fix timeout overflow Thomas Gleixner
                   ` (19 subsequent siblings)
  24 siblings, 0 replies; 58+ messages in thread
From: Thomas Gleixner @ 2006-10-01 23:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Ingo Molnar, Jim Gettys, John Stultz, David Woodhouse,
	Arjan van de Ven, Dave Jones

[-- Attachment #1: fix-msec-conversion.patch --]
[-- Type: text/plain, Size: 2680 bytes --]

From: Ingo Molnar <mingo@elte.hu>

fix multiple conversion bugs in msecs_to_jiffies().

the main problem is that this condition:

       if (m > jiffies_to_msecs(MAX_JIFFY_OFFSET))

overflows if HZ is smaller than 1000!

this change is user-visible: for HZ=250 SUS-compliant poll()-timeout
value of -20 is mistakenly converted to 'immediate timeout'.

(the new dyntick code also triggered this, as it frequently creates
'lagging timer wheel' scenarios.)

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
--
 kernel/time.c |   43 ++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 42 insertions(+), 1 deletion(-)

Index: linux-2.6.18-mm2/kernel/time.c
===================================================================
--- linux-2.6.18-mm2.orig/kernel/time.c	2006-10-02 00:55:50.000000000 +0200
+++ linux-2.6.18-mm2/kernel/time.c	2006-10-02 00:55:51.000000000 +0200
@@ -500,15 +500,56 @@ unsigned int jiffies_to_usecs(const unsi
 }
 EXPORT_SYMBOL(jiffies_to_usecs);
 
+/*
+ * When we convert to jiffies then we interpret incoming values
+ * the following way:
+ *
+ * - negative values mean 'infinite timeout' (MAX_JIFFY_OFFSET)
+ *
+ * - 'too large' values [that would result in larger than
+ *   MAX_JIFFY_OFFSET values] mean 'infinite timeout' too.
+ *
+ * - all other values are converted to jiffies by either multiplying
+ *   the input value by a factor or dividing it with a factor
+ *
+ * We must also be careful about 32-bit overflows.
+ */
 unsigned long msecs_to_jiffies(const unsigned int m)
 {
-	if (m > jiffies_to_msecs(MAX_JIFFY_OFFSET))
+	/*
+	 * Negative value, means infinite timeout:
+	 */
+	if ((int)m < 0)
 		return MAX_JIFFY_OFFSET;
+
 #if HZ <= MSEC_PER_SEC && !(MSEC_PER_SEC % HZ)
+	/*
+	 * HZ is equal to or smaller than 1000, and 1000 is a nice
+	 * round multiple of HZ, divide with the factor between them,
+	 * but round upwards:
+	 */
 	return (m + (MSEC_PER_SEC / HZ) - 1) / (MSEC_PER_SEC / HZ);
 #elif HZ > MSEC_PER_SEC && !(HZ % MSEC_PER_SEC)
+	/*
+	 * HZ is larger than 1000, and HZ is a nice round multiple of
+	 * 1000 - simply multiply with the factor between them.
+	 *
+	 * But first make sure the multiplication result cannot
+	 * overflow:
+	 */
+	if (m > jiffies_to_msecs(MAX_JIFFY_OFFSET))
+		return MAX_JIFFY_OFFSET;
+
 	return m * (HZ / MSEC_PER_SEC);
 #else
+	/*
+	 * Generic case - multiply, round and divide. But first
+	 * check that if we are doing a net multiplication, that
+	 * we wouldnt overflow:
+	 */
+	if (HZ > MSEC_PER_SEC && m > jiffies_to_msecs(MAX_JIFFY_OFFSET))
+		return MAX_JIFFY_OFFSET;
+
 	return (m * HZ + MSEC_PER_SEC - 1) / MSEC_PER_SEC;
 #endif
 }

--


^ permalink raw reply	[flat|nested] 58+ messages in thread

* [patch 06/21] time: fix timeout overflow
  2006-10-01 22:59 [patch 00/21] high resolution timers / dynamic ticks - V2 Thomas Gleixner
                   ` (4 preceding siblings ...)
  2006-10-01 23:00 ` [patch 05/21] time: fix msecs_to_jiffies() bug Thomas Gleixner
@ 2006-10-01 23:00 ` Thomas Gleixner
  2006-10-01 23:00 ` [patch 07/21] cleanup: uninline irq_enter() and move it into a function Thomas Gleixner
                   ` (18 subsequent siblings)
  24 siblings, 0 replies; 58+ messages in thread
From: Thomas Gleixner @ 2006-10-01 23:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Ingo Molnar, Jim Gettys, John Stultz, David Woodhouse,
	Arjan van de Ven, Dave Jones

[-- Attachment #1: max-jiffies-timeout-prevent-overflow.patch --]
[-- Type: text/plain, Size: 1239 bytes --]

From: Ingo Molnar <mingo@elte.hu>

prevent timeout overflow if timer ticks are behind jiffies (due to high
softirq load or due to dyntick), by limiting the valid timeout range
to MAX_LONG/2.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
--
 include/linux/jiffies.h |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Index: linux-2.6.18-mm2/include/linux/jiffies.h
===================================================================
--- linux-2.6.18-mm2.orig/include/linux/jiffies.h	2006-10-02 00:55:50.000000000 +0200
+++ linux-2.6.18-mm2/include/linux/jiffies.h	2006-10-02 00:55:51.000000000 +0200
@@ -142,13 +142,13 @@ static inline u64 get_jiffies_64(void)
  *
  * And some not so obvious.
  *
- * Note that we don't want to return MAX_LONG, because
+ * Note that we don't want to return LONG_MAX, because
  * for various timeout reasons we often end up having
  * to wait "jiffies+1" in order to guarantee that we wait
  * at _least_ "jiffies" - so "jiffies+1" had better still
  * be positive.
  */
-#define MAX_JIFFY_OFFSET ((~0UL >> 1)-1)
+#define MAX_JIFFY_OFFSET ((LONG_MAX >> 1)-1)
 
 /*
  * We want to do realistic conversions of time so we need to use the same

--


^ permalink raw reply	[flat|nested] 58+ messages in thread

* [patch 07/21] cleanup: uninline irq_enter() and move it into a function
  2006-10-01 22:59 [patch 00/21] high resolution timers / dynamic ticks - V2 Thomas Gleixner
                   ` (5 preceding siblings ...)
  2006-10-01 23:00 ` [patch 06/21] time: fix timeout overflow Thomas Gleixner
@ 2006-10-01 23:00 ` Thomas Gleixner
  2006-10-01 23:00 ` [patch 08/21] dynticks: extend next_timer_interrupt() to use a reference jiffie Thomas Gleixner
                   ` (17 subsequent siblings)
  24 siblings, 0 replies; 58+ messages in thread
From: Thomas Gleixner @ 2006-10-01 23:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Ingo Molnar, Jim Gettys, John Stultz, David Woodhouse,
	Arjan van de Ven, Dave Jones

[-- Attachment #1: unmacro-irq-enter.patch --]
[-- Type: text/plain, Size: 1598 bytes --]

From: Ingo Molnar <mingo@elte.hu>

uninline irq_enter(). [dynticks adds more stuff to it]

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
--
 include/linux/hardirq.h |    7 +------
 kernel/softirq.c        |   10 ++++++++++
 2 files changed, 11 insertions(+), 6 deletions(-)

Index: linux-2.6.18-mm2/include/linux/hardirq.h
===================================================================
--- linux-2.6.18-mm2.orig/include/linux/hardirq.h	2006-10-02 00:55:48.000000000 +0200
+++ linux-2.6.18-mm2/include/linux/hardirq.h	2006-10-02 00:55:51.000000000 +0200
@@ -106,12 +106,7 @@ static inline void account_system_vtime(
  * always balanced, so the interrupted value of ->hardirq_context
  * will always be restored.
  */
-#define irq_enter()					\
-	do {						\
-		account_system_vtime(current);		\
-		add_preempt_count(HARDIRQ_OFFSET);	\
-		trace_hardirq_enter();			\
-	} while (0)
+extern void irq_enter(void);
 
 /*
  * Exit irq context without processing softirqs:
Index: linux-2.6.18-mm2/kernel/softirq.c
===================================================================
--- linux-2.6.18-mm2.orig/kernel/softirq.c	2006-10-02 00:55:48.000000000 +0200
+++ linux-2.6.18-mm2/kernel/softirq.c	2006-10-02 00:55:51.000000000 +0200
@@ -273,6 +273,16 @@ EXPORT_SYMBOL(do_softirq);
 
 #endif
 
+/*
+ * Enter an interrupt context.
+ */
+void irq_enter(void)
+{
+	account_system_vtime(current);
+	add_preempt_count(HARDIRQ_OFFSET);
+	trace_hardirq_enter();
+}
+
 #ifdef __ARCH_IRQ_EXIT_IRQS_DISABLED
 # define invoke_softirq()	__do_softirq()
 #else

--


^ permalink raw reply	[flat|nested] 58+ messages in thread

* [patch 08/21] dynticks: extend next_timer_interrupt() to use a reference jiffie
  2006-10-01 22:59 [patch 00/21] high resolution timers / dynamic ticks - V2 Thomas Gleixner
                   ` (6 preceding siblings ...)
  2006-10-01 23:00 ` [patch 07/21] cleanup: uninline irq_enter() and move it into a function Thomas Gleixner
@ 2006-10-01 23:00 ` Thomas Gleixner
  2006-10-01 23:00 ` [patch 09/21] hrtimers: namespace and enum cleanup Thomas Gleixner
                   ` (16 subsequent siblings)
  24 siblings, 0 replies; 58+ messages in thread
From: Thomas Gleixner @ 2006-10-01 23:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Ingo Molnar, Jim Gettys, John Stultz, David Woodhouse,
	Arjan van de Ven, Dave Jones

[-- Attachment #1: extend-timer-next-interrupt.patch --]
[-- Type: text/plain, Size: 5838 bytes --]

From: Thomas Gleixner <tglx@linutronix.de>

For CONFIG_NO_HZ we need to calculate the next timer wheel event based
to a given jiffie value. Extend the existing code to allow the extra now
argument. Provide a compability function for the existing implementations
to call the function with now = jiffies.
This also solves the racyness of the original code vs. jiffies changing
during the iteration.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
----
 include/linux/timer.h |   10 +++++
 kernel/timer.c        |   97 ++++++++++++++++++++++++++++++++++++--------------
 2 files changed, 81 insertions(+), 26 deletions(-)

Index: linux-2.6.18-mm2/include/linux/timer.h
===================================================================
--- linux-2.6.18-mm2.orig/include/linux/timer.h	2006-10-02 00:55:48.000000000 +0200
+++ linux-2.6.18-mm2/include/linux/timer.h	2006-10-02 00:55:51.000000000 +0200
@@ -61,7 +61,17 @@ extern int del_timer(struct timer_list *
 extern int __mod_timer(struct timer_list *timer, unsigned long expires);
 extern int mod_timer(struct timer_list *timer, unsigned long expires);
 
+/*
+ * Return when the next timer-wheel timeout occurs (in absolute jiffies),
+ * locks the timer base:
+ */
 extern unsigned long next_timer_interrupt(void);
+/*
+ * Return when the next timer-wheel timeout occurs (in absolute jiffies),
+ * locks the timer base and does the comparison against the given
+ * jiffie.
+ */
+extern unsigned long get_next_timer_interrupt(unsigned long now);
 
 /***
  * add_timer - start a timer
Index: linux-2.6.18-mm2/kernel/timer.c
===================================================================
--- linux-2.6.18-mm2.orig/kernel/timer.c	2006-10-02 00:55:50.000000000 +0200
+++ linux-2.6.18-mm2/kernel/timer.c	2006-10-02 00:55:51.000000000 +0200
@@ -468,29 +468,14 @@ static inline void __run_timers(tvec_bas
  * is used on S/390 to stop all activity when a cpus is idle.
  * This functions needs to be called disabled.
  */
-unsigned long next_timer_interrupt(void)
+unsigned long __next_timer_interrupt(tvec_base_t *base, unsigned long now)
 {
-	tvec_base_t *base;
 	struct list_head *list;
-	struct timer_list *nte;
+	struct timer_list *nte, *found = NULL;
 	unsigned long expires;
-	unsigned long hr_expires = MAX_JIFFY_OFFSET;
-	ktime_t hr_delta;
 	tvec_t *varray[4];
 	int i, j;
 
-	hr_delta = hrtimer_get_next_event();
-	if (hr_delta.tv64 != KTIME_MAX) {
-		struct timespec tsdelta;
-		tsdelta = ktime_to_timespec(hr_delta);
-		hr_expires = timespec_to_jiffies(&tsdelta);
-		if (hr_expires < 3)
-			return hr_expires + jiffies;
-	}
-	hr_expires += jiffies;
-
-	base = __get_cpu_var(tvec_bases);
-	spin_lock(&base->lock);
 	expires = base->timer_jiffies + (LONG_MAX >> 1);
 	list = NULL;
 
@@ -499,6 +484,7 @@ unsigned long next_timer_interrupt(void)
 	do {
 		list_for_each_entry(nte, base->tv1.vec + j, entry) {
 			expires = nte->expires;
+			found = nte;
 			if (j < (base->timer_jiffies & TVR_MASK))
 				list = base->tv2.vec + (INDEX(0));
 			goto found;
@@ -518,9 +504,12 @@ unsigned long next_timer_interrupt(void)
 				j = (j + 1) & TVN_MASK;
 				continue;
 			}
-			list_for_each_entry(nte, varray[i]->vec + j, entry)
-				if (time_before(nte->expires, expires))
+			list_for_each_entry(nte, varray[i]->vec + j, entry) {
+				if (time_before(nte->expires, expires)) {
 					expires = nte->expires;
+					found = nte;
+				}
+			}
 			if (j < (INDEX(i)) && i < 3)
 				list = varray[i + 1]->vec + (INDEX(i + 1));
 			goto found;
@@ -534,10 +523,59 @@ found:
 		 * where we found the timer element.
 		 */
 		list_for_each_entry(nte, list, entry) {
-			if (time_before(nte->expires, expires))
+			if (time_before(nte->expires, expires)) {
 				expires = nte->expires;
+				found = nte;
+			}
 		}
 	}
+	WARN_ON(!found);
+
+	return expires;
+}
+
+#ifdef CONFIG_NO_HZ
+
+unsigned long get_next_timer_interrupt(unsigned long now)
+{
+	tvec_base_t *base = __get_cpu_var(tvec_bases);
+	unsigned long expires;
+
+	spin_lock(&base->lock);
+	expires = __next_timer_interrupt(base, now);
+	spin_unlock(&base->lock);
+
+	/*
+	 * 'Timer wheel time' can lag behind 'jiffies time' due to
+	 * delayed processing, so make sure we return a value that
+	 * makes sense externally. base->timer_jiffies is unchanged,
+	 * so it is safe to access it outside the lock.
+	 */
+
+	return expires - (now - base->timer_jiffies);
+}
+
+#else
+
+unsigned long next_timer_interrupt(void)
+{
+	tvec_base_t *base = __get_cpu_var(tvec_bases);
+	unsigned long expires;
+	unsigned long now = jiffies;
+	unsigned long hr_expires = MAX_JIFFY_OFFSET;
+	ktime_t hr_delta = hrtimer_get_next_event();
+
+	if (hr_delta.tv64 != KTIME_MAX) {
+		struct timespec tsdelta;
+		tsdelta = ktime_to_timespec(hr_delta);
+		hr_expires = timespec_to_jiffies(&tsdelta);
+		if (hr_expires < 3)
+			return hr_expires + now;
+	}
+	hr_expires += now;
+
+	spin_lock(&base->lock);
+	expires = __next_timer_interrupt(base, now);
 	spin_unlock(&base->lock);
 
 	/*
@@ -553,16 +591,23 @@ found:
 	 * would falsely evaluate to true.  If that is the case, just
 	 * return jiffies so that we can immediately fire the local timer
 	 */
-	if (time_before(expires, jiffies))
-		return jiffies;
+	if (time_before(expires, now))
+		expires = now;
+	else if (time_before(hr_expires, expires))
+		expires = hr_expires;
 
-	if (time_before(hr_expires, expires))
-		return hr_expires;
-
-	return expires;
+	/*
+	 * 'Timer wheel time' can lag behind 'jiffies time' due to
+	 * delayed processing, so make sure we return a value that
+	 * makes sense externally. base->timer_jiffies is unchanged,
+	 * so it is safe to access it outside the lock.
+	 */
+	return expires - (now - base->timer_jiffies);
 }
 #endif
 
+#endif
+
 /******************************************************************/
 
 /* 

--


^ permalink raw reply	[flat|nested] 58+ messages in thread

* [patch 09/21] hrtimers: namespace and enum cleanup
  2006-10-01 22:59 [patch 00/21] high resolution timers / dynamic ticks - V2 Thomas Gleixner
                   ` (7 preceding siblings ...)
  2006-10-01 23:00 ` [patch 08/21] dynticks: extend next_timer_interrupt() to use a reference jiffie Thomas Gleixner
@ 2006-10-01 23:00 ` Thomas Gleixner
  2006-10-01 23:00 ` [patch 10/21] hrtimers: clean up locking Thomas Gleixner
                   ` (15 subsequent siblings)
  24 siblings, 0 replies; 58+ messages in thread
From: Thomas Gleixner @ 2006-10-01 23:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Ingo Molnar, Jim Gettys, John Stultz, David Woodhouse,
	Arjan van de Ven, Dave Jones

[-- Attachment #1: hrtimer-enum-and-namespace-cleanup.patch --]
[-- Type: text/plain, Size: 10463 bytes --]

From: Thomas Gleixner <tglx@linutronix.de>

- hrtimers did not use the hrtimer_restart enum and relied on the implict
  int representation. Fix the prototypes and the functions using the enums.
- Use seperate name spaces for the enumerations
- Convert hrtimer_restart macro to inline function
- Add comments

No functional changes.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
 include/linux/hrtimer.h |   20 ++++++++++++--------
 include/linux/timer.h   |    2 +-
 kernel/fork.c           |    2 +-
 kernel/futex.c          |    2 +-
 kernel/hrtimer.c        |   18 +++++++++---------
 kernel/itimer.c         |    4 ++--
 kernel/posix-timers.c   |   13 +++++++------
 kernel/rtmutex.c        |    2 +-
 8 files changed, 34 insertions(+), 29 deletions(-)

Index: linux-2.6.18-mm2/include/linux/hrtimer.h
===================================================================
--- linux-2.6.18-mm2.orig/include/linux/hrtimer.h	2006-10-02 00:55:50.000000000 +0200
+++ linux-2.6.18-mm2/include/linux/hrtimer.h	2006-10-02 00:55:52.000000000 +0200
@@ -25,17 +25,18 @@
  * Mode arguments of xxx_hrtimer functions:
  */
 enum hrtimer_mode {
-	HRTIMER_ABS,	/* Time value is absolute */
-	HRTIMER_REL,	/* Time value is relative to now */
+	HRTIMER_MODE_ABS,	/* Time value is absolute */
+	HRTIMER_MODE_REL,	/* Time value is relative to now */
 };
 
+/*
+ * Return values for the callback function
+ */
 enum hrtimer_restart {
-	HRTIMER_NORESTART,
-	HRTIMER_RESTART,
+	HRTIMER_NORESTART,	/* Timer is not restarted */
+	HRTIMER_RESTART,	/* Timer must be restarted */
 };
 
-#define HRTIMER_INACTIVE	((void *)1UL)
-
 struct hrtimer_base;
 
 /**
@@ -52,7 +53,7 @@ struct hrtimer_base;
 struct hrtimer {
 	struct rb_node		node;
 	ktime_t			expires;
-	int			(*function)(struct hrtimer *);
+	enum hrtimer_restart	(*function)(struct hrtimer *);
 	struct hrtimer_base	*base;
 };
 
@@ -114,7 +115,10 @@ extern int hrtimer_start(struct hrtimer 
 extern int hrtimer_cancel(struct hrtimer *timer);
 extern int hrtimer_try_to_cancel(struct hrtimer *timer);
 
-#define hrtimer_restart(timer) hrtimer_start((timer), (timer)->expires, HRTIMER_ABS)
+static inline int hrtimer_restart(struct hrtimer *timer)
+{
+	return hrtimer_start(timer, timer->expires, HRTIMER_MODE_ABS);
+}
 
 /* Query timers: */
 extern ktime_t hrtimer_get_remaining(const struct hrtimer *timer);
Index: linux-2.6.18-mm2/include/linux/timer.h
===================================================================
--- linux-2.6.18-mm2.orig/include/linux/timer.h	2006-10-02 00:55:51.000000000 +0200
+++ linux-2.6.18-mm2/include/linux/timer.h	2006-10-02 00:55:52.000000000 +0200
@@ -106,6 +106,6 @@ static inline void add_timer(struct time
 extern void init_timers(void);
 extern void run_local_timers(void);
 struct hrtimer;
-extern int it_real_fn(struct hrtimer *);
+extern enum hrtimer_restart it_real_fn(struct hrtimer *);
 
 #endif
Index: linux-2.6.18-mm2/kernel/fork.c
===================================================================
--- linux-2.6.18-mm2.orig/kernel/fork.c	2006-10-02 00:55:48.000000000 +0200
+++ linux-2.6.18-mm2/kernel/fork.c	2006-10-02 00:55:52.000000000 +0200
@@ -855,7 +855,7 @@ static inline int copy_signal(unsigned l
 	init_sigpending(&sig->shared_pending);
 	INIT_LIST_HEAD(&sig->posix_timers);
 
-	hrtimer_init(&sig->real_timer, CLOCK_MONOTONIC, HRTIMER_REL);
+	hrtimer_init(&sig->real_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
 	sig->it_real_incr.tv64 = 0;
 	sig->real_timer.function = it_real_fn;
 	sig->tsk = tsk;
Index: linux-2.6.18-mm2/kernel/futex.c
===================================================================
--- linux-2.6.18-mm2.orig/kernel/futex.c	2006-10-02 00:55:48.000000000 +0200
+++ linux-2.6.18-mm2/kernel/futex.c	2006-10-02 00:55:52.000000000 +0200
@@ -1135,7 +1135,7 @@ static int futex_lock_pi(u32 __user *uad
 
 	if (sec != MAX_SCHEDULE_TIMEOUT) {
 		to = &timeout;
-		hrtimer_init(&to->timer, CLOCK_REALTIME, HRTIMER_ABS);
+		hrtimer_init(&to->timer, CLOCK_REALTIME, HRTIMER_MODE_ABS);
 		hrtimer_init_sleeper(to, current);
 		to->timer.expires = ktime_set(sec, nsec);
 	}
Index: linux-2.6.18-mm2/kernel/hrtimer.c
===================================================================
--- linux-2.6.18-mm2.orig/kernel/hrtimer.c	2006-10-02 00:55:50.000000000 +0200
+++ linux-2.6.18-mm2/kernel/hrtimer.c	2006-10-02 00:55:52.000000000 +0200
@@ -439,7 +439,7 @@ hrtimer_start(struct hrtimer *timer, kti
 	/* Switch the timer base, if necessary: */
 	new_base = switch_hrtimer_base(timer, base);
 
-	if (mode == HRTIMER_REL) {
+	if (mode == HRTIMER_MODE_REL) {
 		tim = ktime_add(tim, new_base->get_time());
 		/*
 		 * CONFIG_TIME_LOW_RES is a temporary way for architectures
@@ -578,7 +578,7 @@ void hrtimer_init(struct hrtimer *timer,
 
 	bases = __raw_get_cpu_var(hrtimer_bases);
 
-	if (clock_id == CLOCK_REALTIME && mode != HRTIMER_ABS)
+	if (clock_id == CLOCK_REALTIME && mode != HRTIMER_MODE_ABS)
 		clock_id = CLOCK_MONOTONIC;
 
 	timer->base = &bases[clock_id];
@@ -622,7 +622,7 @@ static inline void run_hrtimer_queue(str
 
 	while ((node = base->first)) {
 		struct hrtimer *timer;
-		int (*fn)(struct hrtimer *);
+		enum hrtimer_restart (*fn)(struct hrtimer *);
 		int restart;
 
 		timer = rb_entry(node, struct hrtimer, node);
@@ -664,7 +664,7 @@ void hrtimer_run_queues(void)
 /*
  * Sleep related functions:
  */
-static int hrtimer_wakeup(struct hrtimer *timer)
+static enum hrtimer_restart hrtimer_wakeup(struct hrtimer *timer)
 {
 	struct hrtimer_sleeper *t =
 		container_of(timer, struct hrtimer_sleeper, timer);
@@ -694,7 +694,7 @@ static int __sched do_nanosleep(struct h
 		schedule();
 
 		hrtimer_cancel(&t->timer);
-		mode = HRTIMER_ABS;
+		mode = HRTIMER_MODE_ABS;
 
 	} while (t->task && !signal_pending(current));
 
@@ -710,10 +710,10 @@ long __sched hrtimer_nanosleep_restart(s
 
 	restart->fn = do_no_restart_syscall;
 
-	hrtimer_init(&t.timer, restart->arg0, HRTIMER_ABS);
+	hrtimer_init(&t.timer, restart->arg0, HRTIMER_MODE_ABS);
 	t.timer.expires.tv64 = ((u64)restart->arg3 << 32) | (u64) restart->arg2;
 
-	if (do_nanosleep(&t, HRTIMER_ABS))
+	if (do_nanosleep(&t, HRTIMER_MODE_ABS))
 		return 0;
 
 	rmtp = (struct timespec __user *) restart->arg1;
@@ -746,7 +746,7 @@ long hrtimer_nanosleep(struct timespec *
 		return 0;
 
 	/* Absolute timers do not update the rmtp value and restart: */
-	if (mode == HRTIMER_ABS)
+	if (mode == HRTIMER_MODE_ABS)
 		return -ERESTARTNOHAND;
 
 	if (rmtp) {
@@ -779,7 +779,7 @@ sys_nanosleep(struct timespec __user *rq
 	if (!timespec_valid(&tu))
 		return -EINVAL;
 
-	return hrtimer_nanosleep(&tu, rmtp, HRTIMER_REL, CLOCK_MONOTONIC);
+	return hrtimer_nanosleep(&tu, rmtp, HRTIMER_MODE_REL, CLOCK_MONOTONIC);
 }
 
 /*
Index: linux-2.6.18-mm2/kernel/itimer.c
===================================================================
--- linux-2.6.18-mm2.orig/kernel/itimer.c	2006-10-02 00:55:48.000000000 +0200
+++ linux-2.6.18-mm2/kernel/itimer.c	2006-10-02 00:55:52.000000000 +0200
@@ -128,7 +128,7 @@ asmlinkage long sys_getitimer(int which,
 /*
  * The timer is automagically restarted, when interval != 0
  */
-int it_real_fn(struct hrtimer *timer)
+enum hrtimer_restart it_real_fn(struct hrtimer *timer)
 {
 	struct signal_struct *sig =
 	    container_of(timer, struct signal_struct, real_timer);
@@ -235,7 +235,7 @@ again:
 			timeval_to_ktime(value->it_interval);
 		expires = timeval_to_ktime(value->it_value);
 		if (expires.tv64 != 0)
-			hrtimer_start(timer, expires, HRTIMER_REL);
+			hrtimer_start(timer, expires, HRTIMER_MODE_REL);
 		spin_unlock_irq(&tsk->sighand->siglock);
 		break;
 	case ITIMER_VIRTUAL:
Index: linux-2.6.18-mm2/kernel/posix-timers.c
===================================================================
--- linux-2.6.18-mm2.orig/kernel/posix-timers.c	2006-10-02 00:55:48.000000000 +0200
+++ linux-2.6.18-mm2/kernel/posix-timers.c	2006-10-02 00:55:52.000000000 +0200
@@ -145,7 +145,7 @@ static int common_timer_set(struct k_iti
 			    struct itimerspec *, struct itimerspec *);
 static int common_timer_del(struct k_itimer *timer);
 
-static int posix_timer_fn(struct hrtimer *data);
+static enum hrtimer_restart posix_timer_fn(struct hrtimer *data);
 
 static struct k_itimer *lock_timer(timer_t timer_id, unsigned long *flags);
 
@@ -334,12 +334,12 @@ EXPORT_SYMBOL_GPL(posix_timer_event);
 
  * This code is for CLOCK_REALTIME* and CLOCK_MONOTONIC* timers.
  */
-static int posix_timer_fn(struct hrtimer *timer)
+static enum hrtimer_restart posix_timer_fn(struct hrtimer *timer)
 {
 	struct k_itimer *timr;
 	unsigned long flags;
 	int si_private = 0;
-	int ret = HRTIMER_NORESTART;
+	enum hrtimer_restart ret = HRTIMER_NORESTART;
 
 	timr = container_of(timer, struct k_itimer, it.real.timer);
 	spin_lock_irqsave(&timr->it_lock, flags);
@@ -723,7 +723,7 @@ common_timer_set(struct k_itimer *timr, 
 	if (!new_setting->it_value.tv_sec && !new_setting->it_value.tv_nsec)
 		return 0;
 
-	mode = flags & TIMER_ABSTIME ? HRTIMER_ABS : HRTIMER_REL;
+	mode = flags & TIMER_ABSTIME ? HRTIMER_MODE_ABS : HRTIMER_MODE_REL;
 	hrtimer_init(&timr->it.real.timer, timr->it_clock, mode);
 	timr->it.real.timer.function = posix_timer_fn;
 
@@ -735,7 +735,7 @@ common_timer_set(struct k_itimer *timr, 
 	/* SIGEV_NONE timers are not queued ! See common_timer_get */
 	if (((timr->it_sigev_notify & ~SIGEV_THREAD_ID) == SIGEV_NONE)) {
 		/* Setup correct expiry time for relative timers */
-		if (mode == HRTIMER_REL)
+		if (mode == HRTIMER_MODE_REL)
 			timer->expires = ktime_add(timer->expires,
 						   timer->base->get_time());
 		return 0;
@@ -951,7 +951,8 @@ static int common_nsleep(const clockid_t
 			 struct timespec *tsave, struct timespec __user *rmtp)
 {
 	return hrtimer_nanosleep(tsave, rmtp, flags & TIMER_ABSTIME ?
-				 HRTIMER_ABS : HRTIMER_REL, which_clock);
+				 HRTIMER_MODE_ABS : HRTIMER_MODE_REL,
+				 which_clock);
 }
 
 asmlinkage long
Index: linux-2.6.18-mm2/kernel/rtmutex.c
===================================================================
--- linux-2.6.18-mm2.orig/kernel/rtmutex.c	2006-10-02 00:55:48.000000000 +0200
+++ linux-2.6.18-mm2/kernel/rtmutex.c	2006-10-02 00:55:52.000000000 +0200
@@ -625,7 +625,7 @@ rt_mutex_slowlock(struct rt_mutex *lock,
 	/* Setup the timer, when timeout != NULL */
 	if (unlikely(timeout))
 		hrtimer_start(&timeout->timer, timeout->timer.expires,
-			      HRTIMER_ABS);
+			      HRTIMER_MODE_ABS);
 
 	for (;;) {
 		/* Try to acquire the lock: */

--


^ permalink raw reply	[flat|nested] 58+ messages in thread

* [patch 10/21] hrtimers: clean up locking
  2006-10-01 22:59 [patch 00/21] high resolution timers / dynamic ticks - V2 Thomas Gleixner
                   ` (8 preceding siblings ...)
  2006-10-01 23:00 ` [patch 09/21] hrtimers: namespace and enum cleanup Thomas Gleixner
@ 2006-10-01 23:00 ` Thomas Gleixner
  2006-10-01 23:00 ` [patch 11/21] hrtimers: state tracking Thomas Gleixner
                   ` (14 subsequent siblings)
  24 siblings, 0 replies; 58+ messages in thread
From: Thomas Gleixner @ 2006-10-01 23:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Ingo Molnar, Jim Gettys, John Stultz, David Woodhouse,
	Arjan van de Ven, Dave Jones

[-- Attachment #1: hrtimer-single-lock.patch --]
[-- Type: text/plain, Size: 16461 bytes --]

From: Thomas Gleixner <tglx@linutronix.de>

Improve kernel/hrtimers.c locking: use a per-CPU base with a lock to
control locking of all clocks belonging to a CPU. This simplifies
code that needs to lock all clocks at once. This makes life easier
for high-res timers and dyntick. No functional change should happen
due to this.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
--
 include/linux/hrtimer.h |   43 +++++++----
 kernel/hrtimer.c        |  182 +++++++++++++++++++++++++-----------------------
 2 files changed, 125 insertions(+), 100 deletions(-)

Index: linux-2.6.18-mm2/include/linux/hrtimer.h
===================================================================
--- linux-2.6.18-mm2.orig/include/linux/hrtimer.h	2006-10-02 00:55:52.000000000 +0200
+++ linux-2.6.18-mm2/include/linux/hrtimer.h	2006-10-02 00:55:52.000000000 +0200
@@ -21,6 +21,9 @@
 #include <linux/list.h>
 #include <linux/wait.h>
 
+struct hrtimer_clock_base;
+struct hrtimer_cpu_base;
+
 /*
  * Mode arguments of xxx_hrtimer functions:
  */
@@ -37,8 +40,6 @@ enum hrtimer_restart {
 	HRTIMER_RESTART,	/* Timer must be restarted */
 };
 
-struct hrtimer_base;
-
 /**
  * struct hrtimer - the basic hrtimer structure
  * @node:	red black tree node for time ordered insertion
@@ -51,10 +52,10 @@ struct hrtimer_base;
  * The hrtimer structure must be initialized by init_hrtimer_#CLOCKTYPE()
  */
 struct hrtimer {
-	struct rb_node		node;
-	ktime_t			expires;
-	enum hrtimer_restart	(*function)(struct hrtimer *);
-	struct hrtimer_base	*base;
+	struct rb_node			node;
+	ktime_t				expires;
+	enum hrtimer_restart		(*function)(struct hrtimer *);
+	struct hrtimer_clock_base	*base;
 };
 
 /**
@@ -71,29 +72,41 @@ struct hrtimer_sleeper {
 
 /**
  * struct hrtimer_base - the timer base for a specific clock
- * @index:		clock type index for per_cpu support when moving a timer
- *			to a base on another cpu.
- * @lock:		lock protecting the base and associated timers
+ * @index:		clock type index for per_cpu support when moving a
+ *			timer to a base on another cpu.
  * @active:		red black tree root node for the active timers
  * @first:		pointer to the timer node which expires first
  * @resolution:		the resolution of the clock, in nanoseconds
  * @get_time:		function to retrieve the current time of the clock
  * @get_softirq_time:	function to retrieve the current time from the softirq
- * @curr_timer:		the timer which is executing a callback right now
  * @softirq_time:	the time when running the hrtimer queue in the softirq
- * @lock_key:		the lock_class_key for use with lockdep
  */
-struct hrtimer_base {
+struct hrtimer_clock_base {
+	struct hrtimer_cpu_base	*cpu_base;
 	clockid_t		index;
-	spinlock_t		lock;
 	struct rb_root		active;
 	struct rb_node		*first;
 	ktime_t			resolution;
 	ktime_t			(*get_time)(void);
 	ktime_t			(*get_softirq_time)(void);
-	struct hrtimer		*curr_timer;
 	ktime_t			softirq_time;
-	struct lock_class_key lock_key;
+};
+
+#define HRTIMER_MAX_CLOCK_BASES 2
+
+/*
+ * struct hrtimer_cpu_base - the per cpu clock bases
+ * @lock:		lock protecting the base and associated clock bases
+ *			and timers
+ * @lock_key:		the lock_class_key for use with lockdep
+ * @clock_base:		array of clock bases for this cpu
+ * @curr_timer:		the timer which is executing a callback right now
+ */
+struct hrtimer_cpu_base {
+	spinlock_t			lock;
+	struct lock_class_key		lock_key;
+	struct hrtimer_clock_base	clock_base[HRTIMER_MAX_CLOCK_BASES];
+	struct hrtimer			*curr_timer;
 };
 
 /*
Index: linux-2.6.18-mm2/kernel/hrtimer.c
===================================================================
--- linux-2.6.18-mm2.orig/kernel/hrtimer.c	2006-10-02 00:55:52.000000000 +0200
+++ linux-2.6.18-mm2/kernel/hrtimer.c	2006-10-02 00:55:52.000000000 +0200
@@ -1,8 +1,9 @@
 /*
  *  linux/kernel/hrtimer.c
  *
- *  Copyright(C) 2005, Thomas Gleixner <tglx@linutronix.de>
- *  Copyright(C) 2005, Red Hat, Inc., Ingo Molnar
+ *  Copyright(C) 2005-2006, Thomas Gleixner <tglx@linutronix.de>
+ *  Copyright(C) 2005-2006, Red Hat, Inc., Ingo Molnar
+ *  Copyright(C) 2006	    Timesys Corp., Thomas Gleixner <tglx@timesys.com>
  *
  *  High-resolution kernel timers
  *
@@ -79,21 +80,22 @@ EXPORT_SYMBOL_GPL(ktime_get_real);
  * This ensures that we capture erroneous accesses to these clock ids
  * rather than moving them into the range of valid clock id's.
  */
-
-#define MAX_HRTIMER_BASES 2
-
-static DEFINE_PER_CPU(struct hrtimer_base, hrtimer_bases[MAX_HRTIMER_BASES]) =
+static DEFINE_PER_CPU(struct hrtimer_cpu_base, hrtimer_bases) =
 {
+
+	.clock_base =
 	{
-		.index = CLOCK_REALTIME,
-		.get_time = &ktime_get_real,
-		.resolution = KTIME_REALTIME_RES,
-	},
-	{
-		.index = CLOCK_MONOTONIC,
-		.get_time = &ktime_get,
-		.resolution = KTIME_MONOTONIC_RES,
-	},
+		{
+			.index = CLOCK_REALTIME,
+			.get_time = &ktime_get_real,
+			.resolution = KTIME_REALTIME_RES,
+		},
+		{
+			.index = CLOCK_MONOTONIC,
+			.get_time = &ktime_get,
+			.resolution = KTIME_MONOTONIC_RES,
+		},
+	}
 };
 
 /**
@@ -125,7 +127,7 @@ EXPORT_SYMBOL_GPL(ktime_get_ts);
  * Get the coarse grained time at the softirq based on xtime and
  * wall_to_monotonic.
  */
-static void hrtimer_get_softirq_time(struct hrtimer_base *base)
+static void hrtimer_get_softirq_time(struct hrtimer_cpu_base *base)
 {
 	ktime_t xtim, tomono;
 	unsigned long seq;
@@ -137,8 +139,9 @@ static void hrtimer_get_softirq_time(str
 
 	} while (read_seqretry(&xtime_lock, seq));
 
-	base[CLOCK_REALTIME].softirq_time = xtim;
-	base[CLOCK_MONOTONIC].softirq_time = ktime_add(xtim, tomono);
+	base->clock_base[CLOCK_REALTIME].softirq_time = xtim;
+	base->clock_base[CLOCK_MONOTONIC].softirq_time =
+		ktime_add(xtim, tomono);
 }
 
 /*
@@ -161,19 +164,20 @@ static void hrtimer_get_softirq_time(str
  * possible to set timer->base = NULL and drop the lock: the timer remains
  * locked.
  */
-static struct hrtimer_base *lock_hrtimer_base(const struct hrtimer *timer,
-					      unsigned long *flags)
+static
+struct hrtimer_clock_base *lock_hrtimer_base(const struct hrtimer *timer,
+					     unsigned long *flags)
 {
-	struct hrtimer_base *base;
+	struct hrtimer_clock_base *base;
 
 	for (;;) {
 		base = timer->base;
 		if (likely(base != NULL)) {
-			spin_lock_irqsave(&base->lock, *flags);
+			spin_lock_irqsave(&base->cpu_base->lock, *flags);
 			if (likely(base == timer->base))
 				return base;
 			/* The timer has migrated to another CPU: */
-			spin_unlock_irqrestore(&base->lock, *flags);
+			spin_unlock_irqrestore(&base->cpu_base->lock, *flags);
 		}
 		cpu_relax();
 	}
@@ -182,12 +186,14 @@ static struct hrtimer_base *lock_hrtimer
 /*
  * Switch the timer base to the current CPU when possible.
  */
-static inline struct hrtimer_base *
-switch_hrtimer_base(struct hrtimer *timer, struct hrtimer_base *base)
+static inline struct hrtimer_clock_base *
+switch_hrtimer_base(struct hrtimer *timer, struct hrtimer_clock_base *base)
 {
-	struct hrtimer_base *new_base;
+	struct hrtimer_clock_base *new_base;
+	struct hrtimer_cpu_base *new_cpu_base;
 
-	new_base = &__get_cpu_var(hrtimer_bases)[base->index];
+	new_cpu_base = &__get_cpu_var(hrtimer_bases);
+	new_base = &new_cpu_base->clock_base[base->index];
 
 	if (base != new_base) {
 		/*
@@ -199,13 +205,13 @@ switch_hrtimer_base(struct hrtimer *time
 		 * completed. There is no conflict as we hold the lock until
 		 * the timer is enqueued.
 		 */
-		if (unlikely(base->curr_timer == timer))
+		if (unlikely(base->cpu_base->curr_timer == timer))
 			return base;
 
 		/* See the comment in lock_timer_base() */
 		timer->base = NULL;
-		spin_unlock(&base->lock);
-		spin_lock(&new_base->lock);
+		spin_unlock(&base->cpu_base->lock);
+		spin_lock(&new_base->cpu_base->lock);
 		timer->base = new_base;
 	}
 	return new_base;
@@ -215,12 +221,12 @@ switch_hrtimer_base(struct hrtimer *time
 
 #define set_curr_timer(b, t)		do { } while (0)
 
-static inline struct hrtimer_base *
+static inline struct hrtimer_clock_base *
 lock_hrtimer_base(const struct hrtimer *timer, unsigned long *flags)
 {
-	struct hrtimer_base *base = timer->base;
+	struct hrtimer_clock_base *base = timer->base;
 
-	spin_lock_irqsave(&base->lock, *flags);
+	spin_lock_irqsave(&base->cpu_base->lock, *flags);
 
 	return base;
 }
@@ -300,7 +306,7 @@ void hrtimer_notify_resume(void)
 static inline
 void unlock_hrtimer_base(const struct hrtimer *timer, unsigned long *flags)
 {
-	spin_unlock_irqrestore(&timer->base->lock, *flags);
+	spin_unlock_irqrestore(&timer->base->cpu_base->lock, *flags);
 }
 
 /**
@@ -350,7 +356,8 @@ hrtimer_forward(struct hrtimer *timer, k
  * The timer is inserted in expiry order. Insertion into the
  * red black tree is O(log(n)). Must hold the base lock.
  */
-static void enqueue_hrtimer(struct hrtimer *timer, struct hrtimer_base *base)
+static void enqueue_hrtimer(struct hrtimer *timer,
+			    struct hrtimer_clock_base *base)
 {
 	struct rb_node **link = &base->active.rb_node;
 	struct rb_node *parent = NULL;
@@ -389,7 +396,8 @@ static void enqueue_hrtimer(struct hrtim
  *
  * Caller must hold the base lock.
  */
-static void __remove_hrtimer(struct hrtimer *timer, struct hrtimer_base *base)
+static void __remove_hrtimer(struct hrtimer *timer,
+			     struct hrtimer_clock_base *base)
 {
 	/*
 	 * Remove the timer from the rbtree and replace the
@@ -405,7 +413,7 @@ static void __remove_hrtimer(struct hrti
  * remove hrtimer, called with base lock held
  */
 static inline int
-remove_hrtimer(struct hrtimer *timer, struct hrtimer_base *base)
+remove_hrtimer(struct hrtimer *timer, struct hrtimer_clock_base *base)
 {
 	if (hrtimer_active(timer)) {
 		__remove_hrtimer(timer, base);
@@ -427,7 +435,7 @@ remove_hrtimer(struct hrtimer *timer, st
 int
 hrtimer_start(struct hrtimer *timer, ktime_t tim, const enum hrtimer_mode mode)
 {
-	struct hrtimer_base *base, *new_base;
+	struct hrtimer_clock_base *base, *new_base;
 	unsigned long flags;
 	int ret;
 
@@ -474,13 +482,13 @@ EXPORT_SYMBOL_GPL(hrtimer_start);
  */
 int hrtimer_try_to_cancel(struct hrtimer *timer)
 {
-	struct hrtimer_base *base;
+	struct hrtimer_clock_base *base;
 	unsigned long flags;
 	int ret = -1;
 
 	base = lock_hrtimer_base(timer, &flags);
 
-	if (base->curr_timer != timer)
+	if (base->cpu_base->curr_timer != timer)
 		ret = remove_hrtimer(timer, base);
 
 	unlock_hrtimer_base(timer, &flags);
@@ -516,7 +524,7 @@ EXPORT_SYMBOL_GPL(hrtimer_cancel);
  */
 ktime_t hrtimer_get_remaining(const struct hrtimer *timer)
 {
-	struct hrtimer_base *base;
+	struct hrtimer_clock_base *base;
 	unsigned long flags;
 	ktime_t rem;
 
@@ -537,26 +545,29 @@ EXPORT_SYMBOL_GPL(hrtimer_get_remaining)
  */
 ktime_t hrtimer_get_next_event(void)
 {
-	struct hrtimer_base *base = __get_cpu_var(hrtimer_bases);
+	struct hrtimer_cpu_base *cpu_base = &__get_cpu_var(hrtimer_bases);
+	struct hrtimer_clock_base *base = cpu_base->clock_base;
 	ktime_t delta, mindelta = { .tv64 = KTIME_MAX };
 	unsigned long flags;
 	int i;
 
-	for (i = 0; i < MAX_HRTIMER_BASES; i++, base++) {
+	spin_lock_irqsave(&cpu_base->lock, flags);
+
+	for (i = 0; i < HRTIMER_MAX_CLOCK_BASES; i++, base++) {
 		struct hrtimer *timer;
 
-		spin_lock_irqsave(&base->lock, flags);
-		if (!base->first) {
-			spin_unlock_irqrestore(&base->lock, flags);
+		if (!base->first)
 			continue;
-		}
+
 		timer = rb_entry(base->first, struct hrtimer, node);
 		delta.tv64 = timer->expires.tv64;
-		spin_unlock_irqrestore(&base->lock, flags);
 		delta = ktime_sub(delta, base->get_time());
 		if (delta.tv64 < mindelta.tv64)
 			mindelta.tv64 = delta.tv64;
 	}
+
+	spin_unlock_irqrestore(&cpu_base->lock, flags);
+
 	if (mindelta.tv64 < 0)
 		mindelta.tv64 = 0;
 	return mindelta;
@@ -572,16 +583,16 @@ ktime_t hrtimer_get_next_event(void)
 void hrtimer_init(struct hrtimer *timer, clockid_t clock_id,
 		  enum hrtimer_mode mode)
 {
-	struct hrtimer_base *bases;
+	struct hrtimer_cpu_base *cpu_base;
 
 	memset(timer, 0, sizeof(struct hrtimer));
 
-	bases = __raw_get_cpu_var(hrtimer_bases);
+	cpu_base = &__raw_get_cpu_var(hrtimer_bases);
 
 	if (clock_id == CLOCK_REALTIME && mode != HRTIMER_MODE_ABS)
 		clock_id = CLOCK_MONOTONIC;
 
-	timer->base = &bases[clock_id];
+	timer->base = &cpu_base->clock_base[clock_id];
 	rb_set_parent(&timer->node, &timer->node);
 }
 EXPORT_SYMBOL_GPL(hrtimer_init);
@@ -596,10 +607,10 @@ EXPORT_SYMBOL_GPL(hrtimer_init);
  */
 int hrtimer_get_res(const clockid_t which_clock, struct timespec *tp)
 {
-	struct hrtimer_base *bases;
+	struct hrtimer_cpu_base *cpu_base;
 
-	bases = __raw_get_cpu_var(hrtimer_bases);
-	*tp = ktime_to_timespec(bases[which_clock].resolution);
+	cpu_base = &__raw_get_cpu_var(hrtimer_bases);
+	*tp = ktime_to_timespec(cpu_base->clock_base[which_clock].resolution);
 
 	return 0;
 }
@@ -608,9 +619,11 @@ EXPORT_SYMBOL_GPL(hrtimer_get_res);
 /*
  * Expire the per base hrtimer-queue:
  */
-static inline void run_hrtimer_queue(struct hrtimer_base *base)
+static inline void run_hrtimer_queue(struct hrtimer_cpu_base *cpu_base,
+				     int index)
 {
 	struct rb_node *node;
+	struct hrtimer_clock_base *base = &cpu_base->clock_base[index];
 
 	if (!base->first)
 		return;
@@ -618,7 +631,7 @@ static inline void run_hrtimer_queue(str
 	if (base->get_softirq_time)
 		base->softirq_time = base->get_softirq_time();
 
-	spin_lock_irq(&base->lock);
+	spin_lock_irq(&cpu_base->lock);
 
 	while ((node = base->first)) {
 		struct hrtimer *timer;
@@ -630,21 +643,21 @@ static inline void run_hrtimer_queue(str
 			break;
 
 		fn = timer->function;
-		set_curr_timer(base, timer);
+		set_curr_timer(cpu_base, timer);
 		__remove_hrtimer(timer, base);
-		spin_unlock_irq(&base->lock);
+		spin_unlock_irq(&cpu_base->lock);
 
 		restart = fn(timer);
 
-		spin_lock_irq(&base->lock);
+		spin_lock_irq(&cpu_base->lock);
 
 		if (restart != HRTIMER_NORESTART) {
 			BUG_ON(hrtimer_active(timer));
 			enqueue_hrtimer(timer, base);
 		}
 	}
-	set_curr_timer(base, NULL);
-	spin_unlock_irq(&base->lock);
+	set_curr_timer(cpu_base, NULL);
+	spin_unlock_irq(&cpu_base->lock);
 }
 
 /*
@@ -652,13 +665,13 @@ static inline void run_hrtimer_queue(str
  */
 void hrtimer_run_queues(void)
 {
-	struct hrtimer_base *base = __get_cpu_var(hrtimer_bases);
+	struct hrtimer_cpu_base *cpu_base = &__get_cpu_var(hrtimer_bases);
 	int i;
 
-	hrtimer_get_softirq_time(base);
+	hrtimer_get_softirq_time(cpu_base);
 
-	for (i = 0; i < MAX_HRTIMER_BASES; i++)
-		run_hrtimer_queue(&base[i]);
+	for (i = 0; i < HRTIMER_MAX_CLOCK_BASES; i++)
+		run_hrtimer_queue(cpu_base, i);
 }
 
 /*
@@ -787,19 +800,21 @@ sys_nanosleep(struct timespec __user *rq
  */
 static void __devinit init_hrtimers_cpu(int cpu)
 {
-	struct hrtimer_base *base = per_cpu(hrtimer_bases, cpu);
+	struct hrtimer_cpu_base *cpu_base = &per_cpu(hrtimer_bases, cpu);
 	int i;
 
-	for (i = 0; i < MAX_HRTIMER_BASES; i++, base++) {
-		spin_lock_init(&base->lock);
-		lockdep_set_class(&base->lock, &base->lock_key);
-	}
+	spin_lock_init(&cpu_base->lock);
+	lockdep_set_class(&cpu_base->lock, &cpu_base->lock_key);
+
+	for (i = 0; i < HRTIMER_MAX_CLOCK_BASES; i++)
+		cpu_base->clock_base[i].cpu_base = cpu_base;
+
 }
 
 #ifdef CONFIG_HOTPLUG_CPU
 
-static void migrate_hrtimer_list(struct hrtimer_base *old_base,
-				struct hrtimer_base *new_base)
+static void migrate_hrtimer_list(struct hrtimer_clock_base *old_base,
+				struct hrtimer_clock_base *new_base)
 {
 	struct hrtimer *timer;
 	struct rb_node *node;
@@ -814,29 +829,26 @@ static void migrate_hrtimer_list(struct 
 
 static void migrate_hrtimers(int cpu)
 {
-	struct hrtimer_base *old_base, *new_base;
+	struct hrtimer_cpu_base *old_base, *new_base;
 	int i;
 
 	BUG_ON(cpu_online(cpu));
-	old_base = per_cpu(hrtimer_bases, cpu);
-	new_base = get_cpu_var(hrtimer_bases);
+	old_base = &per_cpu(hrtimer_bases, cpu);
+	new_base = &get_cpu_var(hrtimer_bases);
 
 	local_irq_disable();
 
-	for (i = 0; i < MAX_HRTIMER_BASES; i++) {
-
-		spin_lock(&new_base->lock);
-		spin_lock(&old_base->lock);
+	spin_lock(&new_base->lock);
+	spin_lock(&old_base->lock);
 
+	for (i = 0; i < HRTIMER_MAX_CLOCK_BASES; i++) {
 		BUG_ON(old_base->curr_timer);
 
-		migrate_hrtimer_list(old_base, new_base);
-
-		spin_unlock(&old_base->lock);
-		spin_unlock(&new_base->lock);
-		old_base++;
-		new_base++;
+		migrate_hrtimer_list(&old_base->clock_base[i],
+				     &new_base->clock_base[i]);
 	}
+	spin_unlock(&old_base->lock);
+	spin_unlock(&new_base->lock);
 
 	local_irq_enable();
 	put_cpu_var(hrtimer_bases);

--


^ permalink raw reply	[flat|nested] 58+ messages in thread

* [patch 11/21] hrtimers: state tracking
  2006-10-01 22:59 [patch 00/21] high resolution timers / dynamic ticks - V2 Thomas Gleixner
                   ` (9 preceding siblings ...)
  2006-10-01 23:00 ` [patch 10/21] hrtimers: clean up locking Thomas Gleixner
@ 2006-10-01 23:00 ` Thomas Gleixner
  2006-10-01 23:00 ` [patch 12/21] hrtimers: clean up callback tracking Thomas Gleixner
                   ` (13 subsequent siblings)
  24 siblings, 0 replies; 58+ messages in thread
From: Thomas Gleixner @ 2006-10-01 23:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Ingo Molnar, Jim Gettys, John Stultz, David Woodhouse,
	Arjan van de Ven, Dave Jones

[-- Attachment #1: hrtimer-change-state-tracking.patch --]
[-- Type: text/plain, Size: 6043 bytes --]

From: Thomas Gleixner <tglx@linutronix.de>

reintroduce ktimers feature "optimized away" by the ktimers
review process: multiple hrtimer states to enable the running
of hrtimers without holding the cpu-base-lock.

(the "optimized" rbtree hack carried only 2 states worth of
information and we need 4 for high resolution timers and
dynamic ticks.)

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
--
 include/linux/hrtimer.h |   36 +++++++++++++++++++++++++++++++++++-
 kernel/hrtimer.c        |   21 ++++++++++++++-------
 2 files changed, 49 insertions(+), 8 deletions(-)

Index: linux-2.6.18-mm2/include/linux/hrtimer.h
===================================================================
--- linux-2.6.18-mm2.orig/include/linux/hrtimer.h	2006-10-02 00:55:52.000000000 +0200
+++ linux-2.6.18-mm2/include/linux/hrtimer.h	2006-10-02 00:55:52.000000000 +0200
@@ -40,6 +40,34 @@ enum hrtimer_restart {
 	HRTIMER_RESTART,	/* Timer must be restarted */
 };
 
+/*
+ * Bit values to track state of the timer
+ *
+ * Possible states:
+ *
+ * 0x00		inactive
+ * 0x01		enqueued into rbtree
+ * 0x02		callback function running
+ * 0x03		callback function running and enqueued
+ *		(was requeued on another CPU)
+ *
+ * The "callback function running and enqueued" status is only possible on
+ * SMP. It happens for example when a posix timer expired and the callback
+ * queued a signal. Between dropping the lock which protects the posix timer
+ * and reacquiring the base lock of the hrtimer, another CPU can deliver the
+ * signal and rearm the timer. We have to preserve the callback running state,
+ * as otherwise the timer could be removed before the softirq code finishes the
+ * the handling of the timer.
+ *
+ * The HRTIMER_STATE_ENQUEUE bit is always or'ed to the current state to
+ * preserve the HRTIMER_STATE_CALLBACK bit in the above scenario.
+ *
+ * All state transitions are protected by cpu_base->lock.
+ */
+#define HRTIMER_STATE_INACTIVE	0x00
+#define HRTIMER_STATE_ENQUEUED	0x01
+#define HRTIMER_STATE_CALLBACK	0x02
+
 /**
  * struct hrtimer - the basic hrtimer structure
  * @node:	red black tree node for time ordered insertion
@@ -48,6 +76,7 @@ enum hrtimer_restart {
  *		which the timer is based.
  * @function:	timer expiry callback function
  * @base:	pointer to the timer base (per cpu and per clock)
+ * @state:	state information (See bit values above)
  *
  * The hrtimer structure must be initialized by init_hrtimer_#CLOCKTYPE()
  */
@@ -56,6 +85,7 @@ struct hrtimer {
 	ktime_t				expires;
 	enum hrtimer_restart		(*function)(struct hrtimer *);
 	struct hrtimer_clock_base	*base;
+	unsigned long			state;
 };
 
 /**
@@ -141,9 +171,13 @@ extern int hrtimer_get_res(const clockid
 extern ktime_t hrtimer_get_next_event(void);
 #endif
 
+/*
+ * A timer is active, when it is enqueued into the rbtree or the callback
+ * function is running.
+ */
 static inline int hrtimer_active(const struct hrtimer *timer)
 {
-	return rb_parent(&timer->node) != &timer->node;
+	return timer->state != HRTIMER_STATE_INACTIVE;
 }
 
 /* Forward a hrtimer so it expires after now: */
Index: linux-2.6.18-mm2/kernel/hrtimer.c
===================================================================
--- linux-2.6.18-mm2.orig/kernel/hrtimer.c	2006-10-02 00:55:52.000000000 +0200
+++ linux-2.6.18-mm2/kernel/hrtimer.c	2006-10-02 00:55:52.000000000 +0200
@@ -385,6 +385,11 @@ static void enqueue_hrtimer(struct hrtim
 	 */
 	rb_link_node(&timer->node, parent, link);
 	rb_insert_color(&timer->node, &base->active);
+	/*
+	 * HRTIMER_STATE_ENQUEUED is or'ed to the current state to preserve the
+	 * state of a possibly running callback.
+	 */
+	timer->state |= HRTIMER_STATE_ENQUEUED;
 
 	if (!base->first || timer->expires.tv64 <
 	    rb_entry(base->first, struct hrtimer, node)->expires.tv64)
@@ -397,7 +402,8 @@ static void enqueue_hrtimer(struct hrtim
  * Caller must hold the base lock.
  */
 static void __remove_hrtimer(struct hrtimer *timer,
-			     struct hrtimer_clock_base *base)
+			     struct hrtimer_clock_base *base,
+			     unsigned long newstate)
 {
 	/*
 	 * Remove the timer from the rbtree and replace the
@@ -406,7 +412,7 @@ static void __remove_hrtimer(struct hrti
 	if (base->first == &timer->node)
 		base->first = rb_next(&timer->node);
 	rb_erase(&timer->node, &base->active);
-	rb_set_parent(&timer->node, &timer->node);
+	timer->state = newstate;
 }
 
 /*
@@ -416,7 +422,7 @@ static inline int
 remove_hrtimer(struct hrtimer *timer, struct hrtimer_clock_base *base)
 {
 	if (hrtimer_active(timer)) {
-		__remove_hrtimer(timer, base);
+		__remove_hrtimer(timer, base, HRTIMER_STATE_INACTIVE);
 		return 1;
 	}
 	return 0;
@@ -488,7 +494,7 @@ int hrtimer_try_to_cancel(struct hrtimer
 
 	base = lock_hrtimer_base(timer, &flags);
 
-	if (base->cpu_base->curr_timer != timer)
+	if (!(timer->state & HRTIMER_STATE_CALLBACK))
 		ret = remove_hrtimer(timer, base);
 
 	unlock_hrtimer_base(timer, &flags);
@@ -593,7 +599,6 @@ void hrtimer_init(struct hrtimer *timer,
 		clock_id = CLOCK_MONOTONIC;
 
 	timer->base = &cpu_base->clock_base[clock_id];
-	rb_set_parent(&timer->node, &timer->node);
 }
 EXPORT_SYMBOL_GPL(hrtimer_init);
 
@@ -644,13 +649,14 @@ static inline void run_hrtimer_queue(str
 
 		fn = timer->function;
 		set_curr_timer(cpu_base, timer);
-		__remove_hrtimer(timer, base);
+		__remove_hrtimer(timer, base, HRTIMER_STATE_CALLBACK);
 		spin_unlock_irq(&cpu_base->lock);
 
 		restart = fn(timer);
 
 		spin_lock_irq(&cpu_base->lock);
 
+		timer->state &= ~HRTIMER_STATE_CALLBACK;
 		if (restart != HRTIMER_NORESTART) {
 			BUG_ON(hrtimer_active(timer));
 			enqueue_hrtimer(timer, base);
@@ -821,7 +827,8 @@ static void migrate_hrtimer_list(struct 
 
 	while ((node = rb_first(&old_base->active))) {
 		timer = rb_entry(node, struct hrtimer, node);
-		__remove_hrtimer(timer, old_base);
+		BUG_ON(timer->state & HRTIMER_CALLBACK);
+		__remove_hrtimer(timer, old_base, HRTIMER_INACTIVE);
 		timer->base = new_base;
 		enqueue_hrtimer(timer, new_base);
 	}

--


^ permalink raw reply	[flat|nested] 58+ messages in thread

* [patch 12/21] hrtimers: clean up callback tracking
  2006-10-01 22:59 [patch 00/21] high resolution timers / dynamic ticks - V2 Thomas Gleixner
                   ` (10 preceding siblings ...)
  2006-10-01 23:00 ` [patch 11/21] hrtimers: state tracking Thomas Gleixner
@ 2006-10-01 23:00 ` Thomas Gleixner
  2006-10-01 23:01 ` [patch 13/21] hrtimers: Move and add documentation Thomas Gleixner
                   ` (12 subsequent siblings)
  24 siblings, 0 replies; 58+ messages in thread
From: Thomas Gleixner @ 2006-10-01 23:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Ingo Molnar, Jim Gettys, John Stultz, David Woodhouse,
	Arjan van de Ven, Dave Jones

[-- Attachment #1: hrtimer-change-callback-tracking.patch --]
[-- Type: text/plain, Size: 2759 bytes --]

From: Thomas Gleixner <tglx@linutronix.de>

reintroduce ktimers feature "optimized away" by the ktimers
review process: remove the curr_timer pointer from the cpu-base
and use the hrtimer state.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
--
 include/linux/hrtimer.h |    1 -
 kernel/hrtimer.c        |   10 +---------
 2 files changed, 1 insertion(+), 10 deletions(-)

Index: linux-2.6.18-mm2/include/linux/hrtimer.h
===================================================================
--- linux-2.6.18-mm2.orig/include/linux/hrtimer.h	2006-10-02 00:55:52.000000000 +0200
+++ linux-2.6.18-mm2/include/linux/hrtimer.h	2006-10-02 00:55:53.000000000 +0200
@@ -136,7 +136,6 @@ struct hrtimer_cpu_base {
 	spinlock_t			lock;
 	struct lock_class_key		lock_key;
 	struct hrtimer_clock_base	clock_base[HRTIMER_MAX_CLOCK_BASES];
-	struct hrtimer			*curr_timer;
 };
 
 /*
Index: linux-2.6.18-mm2/kernel/hrtimer.c
===================================================================
--- linux-2.6.18-mm2.orig/kernel/hrtimer.c	2006-10-02 00:55:52.000000000 +0200
+++ linux-2.6.18-mm2/kernel/hrtimer.c	2006-10-02 00:55:53.000000000 +0200
@@ -150,8 +150,6 @@ static void hrtimer_get_softirq_time(str
  */
 #ifdef CONFIG_SMP
 
-#define set_curr_timer(b, t)		do { (b)->curr_timer = (t); } while (0)
-
 /*
  * We are using hashed locking: holding per_cpu(hrtimer_bases)[n].lock
  * means that all timers which are tied to this base via timer->base are
@@ -205,7 +203,7 @@ switch_hrtimer_base(struct hrtimer *time
 		 * completed. There is no conflict as we hold the lock until
 		 * the timer is enqueued.
 		 */
-		if (unlikely(base->cpu_base->curr_timer == timer))
+		if (unlikely(timer->state & HRTIMER_STATE_CALLBACK))
 			return base;
 
 		/* See the comment in lock_timer_base() */
@@ -219,8 +217,6 @@ switch_hrtimer_base(struct hrtimer *time
 
 #else /* CONFIG_SMP */
 
-#define set_curr_timer(b, t)		do { } while (0)
-
 static inline struct hrtimer_clock_base *
 lock_hrtimer_base(const struct hrtimer *timer, unsigned long *flags)
 {
@@ -648,7 +644,6 @@ static inline void run_hrtimer_queue(str
 			break;
 
 		fn = timer->function;
-		set_curr_timer(cpu_base, timer);
 		__remove_hrtimer(timer, base, HRTIMER_STATE_CALLBACK);
 		spin_unlock_irq(&cpu_base->lock);
 
@@ -662,7 +657,6 @@ static inline void run_hrtimer_queue(str
 			enqueue_hrtimer(timer, base);
 		}
 	}
-	set_curr_timer(cpu_base, NULL);
 	spin_unlock_irq(&cpu_base->lock);
 }
 
@@ -849,8 +843,6 @@ static void migrate_hrtimers(int cpu)
 	spin_lock(&old_base->lock);
 
 	for (i = 0; i < HRTIMER_MAX_CLOCK_BASES; i++) {
-		BUG_ON(old_base->curr_timer);
-
 		migrate_hrtimer_list(&old_base->clock_base[i],
 				     &new_base->clock_base[i]);
 	}

--


^ permalink raw reply	[flat|nested] 58+ messages in thread

* [patch 13/21] hrtimers: Move and add documentation
  2006-10-01 22:59 [patch 00/21] high resolution timers / dynamic ticks - V2 Thomas Gleixner
                   ` (11 preceding siblings ...)
  2006-10-01 23:00 ` [patch 12/21] hrtimers: clean up callback tracking Thomas Gleixner
@ 2006-10-01 23:01 ` Thomas Gleixner
  2006-10-01 23:01 ` [patch 14/21] clockevents: core Thomas Gleixner
                   ` (11 subsequent siblings)
  24 siblings, 0 replies; 58+ messages in thread
From: Thomas Gleixner @ 2006-10-01 23:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Ingo Molnar, Jim Gettys, John Stultz, David Woodhouse,
	Arjan van de Ven, Dave Jones

[-- Attachment #1: time-and-timer-documentation.patch --]
[-- Type: text/plain, Size: 32451 bytes --]

From: Thomas Gleixner <tglx@linutronix.de>

Move the initial hrtimer.txt document to the new directory
"Documentation/hrtimer"

Add design notes for the high resolution timer and dynamic tick
functionality.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

 Documentation/hrtimer/highres.txt  |  249 +++++++++++++++++++++++++++++++++++++
 Documentation/hrtimer/hrtimers.txt |  178 ++++++++++++++++++++++++++
 Documentation/hrtimers.txt         |  178 --------------------------
 3 files changed, 427 insertions(+), 178 deletions(-)

Index: linux-2.6.18-mm2/Documentation/hrtimer/hrtimers.txt
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.18-mm2/Documentation/hrtimer/hrtimers.txt	2006-10-02 00:55:53.000000000 +0200
@@ -0,0 +1,178 @@
+
+hrtimers - subsystem for high-resolution kernel timers
+----------------------------------------------------
+
+This patch introduces a new subsystem for high-resolution kernel timers.
+
+One might ask the question: we already have a timer subsystem
+(kernel/timers.c), why do we need two timer subsystems? After a lot of
+back and forth trying to integrate high-resolution and high-precision
+features into the existing timer framework, and after testing various
+such high-resolution timer implementations in practice, we came to the
+conclusion that the timer wheel code is fundamentally not suitable for
+such an approach. We initially didnt believe this ('there must be a way
+to solve this'), and spent a considerable effort trying to integrate
+things into the timer wheel, but we failed. In hindsight, there are
+several reasons why such integration is hard/impossible:
+
+- the forced handling of low-resolution and high-resolution timers in
+  the same way leads to a lot of compromises, macro magic and #ifdef
+  mess. The timers.c code is very "tightly coded" around jiffies and
+  32-bitness assumptions, and has been honed and micro-optimized for a
+  relatively narrow use case (jiffies in a relatively narrow HZ range)
+  for many years - and thus even small extensions to it easily break
+  the wheel concept, leading to even worse compromises. The timer wheel
+  code is very good and tight code, there's zero problems with it in its
+  current usage - but it is simply not suitable to be extended for
+  high-res timers.
+
+- the unpredictable [O(N)] overhead of cascading leads to delays which
+  necessiate a more complex handling of high resolution timers, which
+  in turn decreases robustness. Such a design still led to rather large
+  timing inaccuracies. Cascading is a fundamental property of the timer
+  wheel concept, it cannot be 'designed out' without unevitably
+  degrading other portions of the timers.c code in an unacceptable way.
+
+- the implementation of the current posix-timer subsystem on top of
+  the timer wheel has already introduced a quite complex handling of
+  the required readjusting of absolute CLOCK_REALTIME timers at
+  settimeofday or NTP time - further underlying our experience by
+  example: that the timer wheel data structure is too rigid for high-res
+  timers.
+
+- the timer wheel code is most optimal for use cases which can be
+  identified as "timeouts". Such timeouts are usually set up to cover
+  error conditions in various I/O paths, such as networking and block
+  I/O. The vast majority of those timers never expire and are rarely
+  recascaded because the expected correct event arrives in time so they
+  can be removed from the timer wheel before any further processing of
+  them becomes necessary. Thus the users of these timeouts can accept
+  the granularity and precision tradeoffs of the timer wheel, and
+  largely expect the timer subsystem to have near-zero overhead.
+  Accurate timing for them is not a core purpose - in fact most of the
+  timeout values used are ad-hoc. For them it is at most a necessary
+  evil to guarantee the processing of actual timeout completions
+  (because most of the timeouts are deleted before completion), which
+  should thus be as cheap and unintrusive as possible.
+
+The primary users of precision timers are user-space applications that
+utilize nanosleep, posix-timers and itimer interfaces. Also, in-kernel
+users like drivers and subsystems which require precise timed events
+(e.g. multimedia) can benefit from the availability of a seperate
+high-resolution timer subsystem as well.
+
+While this subsystem does not offer high-resolution clock sources just
+yet, the hrtimer subsystem can be easily extended with high-resolution
+clock capabilities, and patches for that exist and are maturing quickly.
+The increasing demand for realtime and multimedia applications along
+with other potential users for precise timers gives another reason to
+separate the "timeout" and "precise timer" subsystems.
+
+Another potential benefit is that such a seperation allows even more
+special-purpose optimization of the existing timer wheel for the low
+resolution and low precision use cases - once the precision-sensitive
+APIs are separated from the timer wheel and are migrated over to
+hrtimers. E.g. we could decrease the frequency of the timeout subsystem
+from 250 Hz to 100 HZ (or even smaller).
+
+hrtimer subsystem implementation details
+----------------------------------------
+
+the basic design considerations were:
+
+- simplicity
+
+- data structure not bound to jiffies or any other granularity. All the
+  kernel logic works at 64-bit nanoseconds resolution - no compromises.
+
+- simplification of existing, timing related kernel code
+
+another basic requirement was the immediate enqueueing and ordering of
+timers at activation time. After looking at several possible solutions
+such as radix trees and hashes, we chose the red black tree as the basic
+data structure. Rbtrees are available as a library in the kernel and are
+used in various performance-critical areas of e.g. memory management and
+file systems. The rbtree is solely used for time sorted ordering, while
+a separate list is used to give the expiry code fast access to the
+queued timers, without having to walk the rbtree.
+
+(This seperate list is also useful for later when we'll introduce
+high-resolution clocks, where we need seperate pending and expired
+queues while keeping the time-order intact.)
+
+Time-ordered enqueueing is not purely for the purposes of
+high-resolution clocks though, it also simplifies the handling of
+absolute timers based on a low-resolution CLOCK_REALTIME. The existing
+implementation needed to keep an extra list of all armed absolute
+CLOCK_REALTIME timers along with complex locking. In case of
+settimeofday and NTP, all the timers (!) had to be dequeued, the
+time-changing code had to fix them up one by one, and all of them had to
+be enqueued again. The time-ordered enqueueing and the storage of the
+expiry time in absolute time units removes all this complex and poorly
+scaling code from the posix-timer implementation - the clock can simply
+be set without having to touch the rbtree. This also makes the handling
+of posix-timers simpler in general.
+
+The locking and per-CPU behavior of hrtimers was mostly taken from the
+existing timer wheel code, as it is mature and well suited. Sharing code
+was not really a win, due to the different data structures. Also, the
+hrtimer functions now have clearer behavior and clearer names - such as
+hrtimer_try_to_cancel() and hrtimer_cancel() [which are roughly
+equivalent to del_timer() and del_timer_sync()] - so there's no direct
+1:1 mapping between them on the algorithmical level, and thus no real
+potential for code sharing either.
+
+Basic data types: every time value, absolute or relative, is in a
+special nanosecond-resolution type: ktime_t. The kernel-internal
+representation of ktime_t values and operations is implemented via
+macros and inline functions, and can be switched between a "hybrid
+union" type and a plain "scalar" 64bit nanoseconds representation (at
+compile time). The hybrid union type optimizes time conversions on 32bit
+CPUs. This build-time-selectable ktime_t storage format was implemented
+to avoid the performance impact of 64-bit multiplications and divisions
+on 32bit CPUs. Such operations are frequently necessary to convert
+between the storage formats provided by kernel and userspace interfaces
+and the internal time format. (See include/linux/ktime.h for further
+details.)
+
+hrtimers - rounding of timer values
+-----------------------------------
+
+the hrtimer code will round timer events to lower-resolution clocks
+because it has to. Otherwise it will do no artificial rounding at all.
+
+one question is, what resolution value should be returned to the user by
+the clock_getres() interface. This will return whatever real resolution
+a given clock has - be it low-res, high-res, or artificially-low-res.
+
+hrtimers - testing and verification
+----------------------------------
+
+We used the high-resolution clock subsystem ontop of hrtimers to verify
+the hrtimer implementation details in praxis, and we also ran the posix
+timer tests in order to ensure specification compliance. We also ran
+tests on low-resolution clocks.
+
+The hrtimer patch converts the following kernel functionality to use
+hrtimers:
+
+ - nanosleep
+ - itimers
+ - posix-timers
+
+The conversion of nanosleep and posix-timers enabled the unification of
+nanosleep and clock_nanosleep.
+
+The code was successfully compiled for the following platforms:
+
+ i386, x86_64, ARM, PPC, PPC64, IA64
+
+The code was run-tested on the following platforms:
+
+ i386(UP/SMP), x86_64(UP/SMP), ARM, PPC
+
+hrtimers were also integrated into the -rt tree, along with a
+hrtimers-based high-resolution clock implementation, so the hrtimers
+code got a healthy amount of testing and use in practice.
+
+	Thomas Gleixner, Ingo Molnar
Index: linux-2.6.18-mm2/Documentation/hrtimers.txt
===================================================================
--- linux-2.6.18-mm2.orig/Documentation/hrtimers.txt	2006-10-02 00:55:47.000000000 +0200
+++ /dev/null	1970-01-01 00:00:00.000000000 +0000
@@ -1,178 +0,0 @@
-
-hrtimers - subsystem for high-resolution kernel timers
-----------------------------------------------------
-
-This patch introduces a new subsystem for high-resolution kernel timers.
-
-One might ask the question: we already have a timer subsystem
-(kernel/timers.c), why do we need two timer subsystems? After a lot of
-back and forth trying to integrate high-resolution and high-precision
-features into the existing timer framework, and after testing various
-such high-resolution timer implementations in practice, we came to the
-conclusion that the timer wheel code is fundamentally not suitable for
-such an approach. We initially didnt believe this ('there must be a way
-to solve this'), and spent a considerable effort trying to integrate
-things into the timer wheel, but we failed. In hindsight, there are
-several reasons why such integration is hard/impossible:
-
-- the forced handling of low-resolution and high-resolution timers in
-  the same way leads to a lot of compromises, macro magic and #ifdef
-  mess. The timers.c code is very "tightly coded" around jiffies and
-  32-bitness assumptions, and has been honed and micro-optimized for a
-  relatively narrow use case (jiffies in a relatively narrow HZ range)
-  for many years - and thus even small extensions to it easily break
-  the wheel concept, leading to even worse compromises. The timer wheel
-  code is very good and tight code, there's zero problems with it in its
-  current usage - but it is simply not suitable to be extended for
-  high-res timers.
-
-- the unpredictable [O(N)] overhead of cascading leads to delays which
-  necessiate a more complex handling of high resolution timers, which
-  in turn decreases robustness. Such a design still led to rather large
-  timing inaccuracies. Cascading is a fundamental property of the timer
-  wheel concept, it cannot be 'designed out' without unevitably
-  degrading other portions of the timers.c code in an unacceptable way.
-
-- the implementation of the current posix-timer subsystem on top of
-  the timer wheel has already introduced a quite complex handling of
-  the required readjusting of absolute CLOCK_REALTIME timers at
-  settimeofday or NTP time - further underlying our experience by
-  example: that the timer wheel data structure is too rigid for high-res
-  timers.
-
-- the timer wheel code is most optimal for use cases which can be
-  identified as "timeouts". Such timeouts are usually set up to cover
-  error conditions in various I/O paths, such as networking and block
-  I/O. The vast majority of those timers never expire and are rarely
-  recascaded because the expected correct event arrives in time so they
-  can be removed from the timer wheel before any further processing of
-  them becomes necessary. Thus the users of these timeouts can accept
-  the granularity and precision tradeoffs of the timer wheel, and
-  largely expect the timer subsystem to have near-zero overhead.
-  Accurate timing for them is not a core purpose - in fact most of the
-  timeout values used are ad-hoc. For them it is at most a necessary
-  evil to guarantee the processing of actual timeout completions
-  (because most of the timeouts are deleted before completion), which
-  should thus be as cheap and unintrusive as possible.
-
-The primary users of precision timers are user-space applications that
-utilize nanosleep, posix-timers and itimer interfaces. Also, in-kernel
-users like drivers and subsystems which require precise timed events
-(e.g. multimedia) can benefit from the availability of a seperate
-high-resolution timer subsystem as well.
-
-While this subsystem does not offer high-resolution clock sources just
-yet, the hrtimer subsystem can be easily extended with high-resolution
-clock capabilities, and patches for that exist and are maturing quickly.
-The increasing demand for realtime and multimedia applications along
-with other potential users for precise timers gives another reason to
-separate the "timeout" and "precise timer" subsystems.
-
-Another potential benefit is that such a seperation allows even more
-special-purpose optimization of the existing timer wheel for the low
-resolution and low precision use cases - once the precision-sensitive
-APIs are separated from the timer wheel and are migrated over to
-hrtimers. E.g. we could decrease the frequency of the timeout subsystem
-from 250 Hz to 100 HZ (or even smaller).
-
-hrtimer subsystem implementation details
-----------------------------------------
-
-the basic design considerations were:
-
-- simplicity
-
-- data structure not bound to jiffies or any other granularity. All the
-  kernel logic works at 64-bit nanoseconds resolution - no compromises.
-
-- simplification of existing, timing related kernel code
-
-another basic requirement was the immediate enqueueing and ordering of
-timers at activation time. After looking at several possible solutions
-such as radix trees and hashes, we chose the red black tree as the basic
-data structure. Rbtrees are available as a library in the kernel and are
-used in various performance-critical areas of e.g. memory management and
-file systems. The rbtree is solely used for time sorted ordering, while
-a separate list is used to give the expiry code fast access to the
-queued timers, without having to walk the rbtree.
-
-(This seperate list is also useful for later when we'll introduce
-high-resolution clocks, where we need seperate pending and expired
-queues while keeping the time-order intact.)
-
-Time-ordered enqueueing is not purely for the purposes of
-high-resolution clocks though, it also simplifies the handling of
-absolute timers based on a low-resolution CLOCK_REALTIME. The existing
-implementation needed to keep an extra list of all armed absolute
-CLOCK_REALTIME timers along with complex locking. In case of
-settimeofday and NTP, all the timers (!) had to be dequeued, the
-time-changing code had to fix them up one by one, and all of them had to
-be enqueued again. The time-ordered enqueueing and the storage of the
-expiry time in absolute time units removes all this complex and poorly
-scaling code from the posix-timer implementation - the clock can simply
-be set without having to touch the rbtree. This also makes the handling
-of posix-timers simpler in general.
-
-The locking and per-CPU behavior of hrtimers was mostly taken from the
-existing timer wheel code, as it is mature and well suited. Sharing code
-was not really a win, due to the different data structures. Also, the
-hrtimer functions now have clearer behavior and clearer names - such as
-hrtimer_try_to_cancel() and hrtimer_cancel() [which are roughly
-equivalent to del_timer() and del_timer_sync()] - so there's no direct
-1:1 mapping between them on the algorithmical level, and thus no real
-potential for code sharing either.
-
-Basic data types: every time value, absolute or relative, is in a
-special nanosecond-resolution type: ktime_t. The kernel-internal
-representation of ktime_t values and operations is implemented via
-macros and inline functions, and can be switched between a "hybrid
-union" type and a plain "scalar" 64bit nanoseconds representation (at
-compile time). The hybrid union type optimizes time conversions on 32bit
-CPUs. This build-time-selectable ktime_t storage format was implemented
-to avoid the performance impact of 64-bit multiplications and divisions
-on 32bit CPUs. Such operations are frequently necessary to convert
-between the storage formats provided by kernel and userspace interfaces
-and the internal time format. (See include/linux/ktime.h for further
-details.)
-
-hrtimers - rounding of timer values
------------------------------------
-
-the hrtimer code will round timer events to lower-resolution clocks
-because it has to. Otherwise it will do no artificial rounding at all.
-
-one question is, what resolution value should be returned to the user by
-the clock_getres() interface. This will return whatever real resolution
-a given clock has - be it low-res, high-res, or artificially-low-res.
-
-hrtimers - testing and verification
-----------------------------------
-
-We used the high-resolution clock subsystem ontop of hrtimers to verify
-the hrtimer implementation details in praxis, and we also ran the posix
-timer tests in order to ensure specification compliance. We also ran
-tests on low-resolution clocks.
-
-The hrtimer patch converts the following kernel functionality to use
-hrtimers:
-
- - nanosleep
- - itimers
- - posix-timers
-
-The conversion of nanosleep and posix-timers enabled the unification of
-nanosleep and clock_nanosleep.
-
-The code was successfully compiled for the following platforms:
-
- i386, x86_64, ARM, PPC, PPC64, IA64
-
-The code was run-tested on the following platforms:
-
- i386(UP/SMP), x86_64(UP/SMP), ARM, PPC
-
-hrtimers were also integrated into the -rt tree, along with a
-hrtimers-based high-resolution clock implementation, so the hrtimers
-code got a healthy amount of testing and use in practice.
-
-	Thomas Gleixner, Ingo Molnar
Index: linux-2.6.18-mm2/Documentation/hrtimer/highres.txt
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.18-mm2/Documentation/hrtimer/highres.txt	2006-10-02 00:55:53.000000000 +0200
@@ -0,0 +1,249 @@
+High resolution timers and dynamic ticks design notes
+-----------------------------------------------------
+
+Further information can be found in the paper of the OLS 2006 talk "hrtimers
+and beyond". The paper is part of the OLS 2006 Proceedings Volume 1, which can
+be found on the OLS website:
+http://www.linuxsymposium.org/2006/linuxsymposium_procv1.pdf
+
+The slides to this talk are available from:
+http://tglx.de/projects/hrtimers/ols2006-hrtimers.pdf
+
+The slides contain five figures (pages 2, 15, 18, 20, 22), which illustrate the
+changes in the time(r) related Linux subsystems. Figure #1 (p. 2) shows the
+design of the Linux time(r) system before hrtimers and other building blocks
+got merged into mainline.
+
+Note: the paper and the slides are talking about "clock event source", while we
+switched to the name "clock event devices" in meantime.
+
+The design contains the following basic building blocks:
+
+- hrtimer base infrastructure
+- timeofday and clock source management
+- clock event management
+- high resolution timer functionality
+- dynamic ticks
+
+
+hrtimer base infrastructure
+---------------------------
+
+The hrtimer base infrastructure was merged into the 2.6.16 kernel. Details of
+the base implementation are covered in Documentation/hrtimer/hrtimer.txt. See
+also figure #2 (OLS slides p. 15)
+
+The main differences to the timer wheel, which holds the armed timer_list type
+timers are:
+       - time ordered enqueueing into a rb-tree
+       - independent of ticks (the processing is based on nanoseconds)
+
+
+timeofday and clock source management
+-------------------------------------
+
+John Stultz's Generic Time Of Day (GTOD) framework moves a large portion of
+code out of the architecture-specific areas into a generic management
+framework, as illustrated in figure #3 (OLS slides p. 18). The architecture
+specific portion is reduced to the low level hardware details of the clock
+sources, which are registered in the framework and selected on a quality based
+decision. The low level code provides hardware setup and readout routines and
+initializes data structures, which are used by the generic time keeping code to
+convert the clock ticks to nanosecond based time values. All other time keeping
+related functionality is moved into the generic code. The GTOD base patch got
+merged into the 2.6.18 kernel.
+
+Further information about the Generic Time Of Day framework is available in the
+OLS 2005 Proceedings Volume 1:
+http://www.linuxsymposium.org/2005/linuxsymposium_procv1.pdf
+
+The paper "We Are Not Getting Any Younger: A New Approach to Time and
+Timers" was written by J. Stultz, D.V. Hart, & N. Aravamudan.
+
+Figure #3 (OLS slides p.18) illustrates the transformation.
+
+
+clock event management
+----------------------
+
+While clock sources provide read access to the monotonically increasing time
+value, clock event devices are used to schedule the next event
+interrupt(s). The next event is currently defined to be periodic, with its
+period defined at compile time. The setup and selection of the event device
+for various event driven functionalities is hardwired into the architecture
+dependent code. This results in duplicated code across all architectures and
+makes it extremely difficult to change the configuration of the system to use
+event interrupt devices other than those already built into the
+architecture. Another implication of the current design is that it is necessary
+to touch all the architecture-specific implementations in order to provide new
+functionality like high resolution timers or dynamic ticks.
+
+The clock events subsystem tries to address this problem by providing a generic
+solution to manage clock event devices and their usage for the various clock
+event driven kernel functionalities. The goal of the clock event subsystem is
+to minimize the clock event related architecture dependent code to the pure
+hardware related handling and to allow easy addition and utilization of new
+clock event devices. It also minimizes the duplicated code across the
+architectures as it provides generic functionality down to the interrupt
+service handler, which is almost inherently hardware dependent.
+
+Clock event devices are registered either by the architecture dependent boot
+code or at module insertion time. Each clock event device fills a data
+structure with clock-specific property parameters and callback functions. The
+clock event management decides, by using the specified property parameters, the
+set of system functions a clock event device will be used to support. This
+includes the distinction of per-CPU and per-system global event devices.
+
+System-level global event devices are used for the Linux periodic tick. Per-CPU
+event devices are used to provide local CPU functionality such as process
+accounting, profiling, and high resolution timers.
+
+The management layer assignes one or more of the folliwing functions to a clock
+event device:
+      - system global periodic tick (jiffies update)
+      - cpu local update_process_times
+      - cpu local profiling
+      - cpu local next event interrupt (non periodic mode)
+
+The clock event device delegates the selection of those timer interrupt related
+functions completely to the management layer. The clock management layer stores
+a function pointer in the device description structure, which has to be called
+from the hardware level handler. This removes a lot of duplicated code from the
+architecture specific timer interrupt handlers and hands the control over the
+clock event devices and the assignment of timer interrupt related functionality
+to the core code.
+
+The clock event layer API is rather small. Aside from the clock event device
+registration interface it provides functions to schedule the next event
+interrupt, clock event device notification service and support for suspend and
+resume.
+
+The framework adds about 700 lines of code which results in a 2KB increase of
+the kernel binary size. The conversion of i386 removes about 100 lines of
+code. The binary size decrease is in the range of 400 byte. We believe that the
+increase of flexibility and the avoidance of duplicated code across
+architectures justifies the slight increase of the binary size.
+
+The conversion of an architecture has no functional impact, but allows to
+utilize the high resolution and dynamic tick functionalites without any change
+to the clock event device and timer interrupt code. After the conversion the
+enabling of high resolution timers and dynamic ticks is simply provided by
+adding the kernel/time/Kconfig file to the architecture specific Kconfig and
+adding the dynamic tick specific calls to the idle routine (a total of 3 lines
+added to the idle function and the Kconfig file)
+
+Figure #4 (OLS slides p.20) illustrates the transformation.
+
+
+high resolution timer functionality
+-----------------------------------
+
+During system boot it is not possible to use the high resolution timer
+functionality, while making it possible would be difficult and would serve no
+useful function. The initialization of the clock event device framework, the
+clock source framework (GTOD) and hrtimers itself has to be done and
+appropriate clock sources and clock event devices have to be registered before
+the high resolution functionality can work. Up to the point where hrtimers are
+initialized, the system works in the usual low resolution periodic mode. The
+clock source and the clock event device layers provide notification functions
+which inform hrtimers about availability of new hardware. hrtimers validates
+the usability of the registered clock sources and clock event devices before
+switching to high resolution mode. This ensures also that a kernel which is
+configured for high resolution timers can run on a system which lacks the
+necessary hardware support.
+
+The high resolution timer code does not support SMP machines which have only
+global clock event devices. The support of such hardware would involve IPI
+calls when an interrupt happens. The overhead would be much larger than the
+benefit. This is the reason why we currently disable high resolution and
+dynamic ticks on i386 SMP systems which stop the local APIC in C3 power
+state. A workaround is available as an idea, but the problem has not been
+tackled yet.
+
+The time ordered insertion of timers provides all the infrastructure to decide
+whether the event device has to be reprogrammed when a timer is added. The
+decision is made per timer base and synchronized across per-cpu timer bases in
+a support function. The design allows the system to utilize separate per-CPU
+clock event devices for the per-CPU timer bases, but currently only one
+reprogrammable clock event device per-CPU is utilized.
+
+When the timer interrupt happens, the next event interrupt handler is called
+from the clock event distribution code and moves expired timers from the
+red-black tree to a separate double linked list and invokes the softirq
+handler. An additional mode field in the hrtimer structure allows the system to
+execute callback functions directly from the next event interrupt handler. This
+is restricted to code which can safely be executed in the hard interrupt
+context. This applies, for example, to the common case of a wakeup function as
+used by nanosleep. The advantage of executing the handler in the interrupt
+context is the avoidance of up to two context switches - from the interrupted
+context to the softirq and to the task which is woken up by the expired
+timer.
+
+Once a system has switched to high resolution mode, the periodic tick is
+switched off. This disables the per system global periodic clock event device -
+e.g. the PIT on i386 SMP systems.
+
+The periodic tick functionality is provided by an per-cpu hrtimer. The callback
+function is executed in the next event interrupt context and updates jiffies
+and calls update_process_times and profiling. The implementation of the hrtimer
+based periodic tick is designed to be extended with dynamic tick functionality.
+This allows to use a single clock event device to schedule high resolution
+timer and periodic events (jiffies tick, profiling, process accounting) on UP
+systems. This has been proved to work with the PIT on i386 and the Incrementer
+on PPC.
+
+The softirq for running the hrtimer queues and executing the callbacks has been
+separated from the tick bound timer softirq to allow accurate delivery of high
+resolution timer signals which are used by itimer and POSIX interval
+timers. The execution of this softirq can still be delayed by other softirqs,
+but the overall latencies have been significantly improved by this separation.
+
+Figure #5 (OLS slides p.22) illustrates the transformation.
+
+
+dynamic ticks
+-------------
+
+Dynamic ticks are the logical consequence of the hrtimer based periodic tick
+replacement (sched_tick). The functionality of the sched_tick hrtimer is
+extended by three functions:
+
+- hrtimer_stop_sched_tick
+- hrtimer_restart_sched_tick
+- hrtimer_update_jiffies
+
+hrtimer_stop_sched_tick() is called when a CPU goes into idle state. The code
+evaluates the next scheduled timer event (from both hrtimers and the timer
+wheel) and in case that the next event is further away than the next tick it
+reprograms the sched_tick to this future event, to allow longer idle sleeps
+without worthless interruption by the periodic tick. The function is also
+called when an interrupt happens during the idle period, which does not cause a
+reschedule. The call is necessary as the interrupt handler might have armed a
+new timer whose expiry time is before the time which was identified as the
+nearest event in the previous call to hrtimer_stop_sched_tick.
+
+hrtimer_restart_sched_tick() is called when the CPU leaves the idle state before
+it calls schedule(). hrtimer_restart_sched_tick() resumes the periodic tick,
+which is kept active until the next call to hrtimer_stop_sched_tick().
+
+hrtimer_update_jiffies() is called from irq_enter() when an interrupt happens
+in the idle period to make sure that jiffies are up to date and the interrupt
+handler has not to deal with an eventually stale jiffy value.
+
+The dynamic tick feature provides statistical values which are exported to
+userspace via /proc/stats and can be made available for enhanced power
+management control.
+
+The implementation leaves room for further development like full tickless
+systems, where the time slice is controlled by the scheduler, variable
+frequency profiling, and a complete removal of jiffies in the future.
+
+
+Aside the current initial submission of i386 support, the patchset has been
+extended to x86_64 and ARM already. Initial (work in progress) support is also
+available for MIPS and PowerPC.
+
+	  Thomas, Ingo
+
+
+

--


^ permalink raw reply	[flat|nested] 58+ messages in thread

* [patch 14/21] clockevents: core
  2006-10-01 22:59 [patch 00/21] high resolution timers / dynamic ticks - V2 Thomas Gleixner
                   ` (12 preceding siblings ...)
  2006-10-01 23:01 ` [patch 13/21] hrtimers: Move and add documentation Thomas Gleixner
@ 2006-10-01 23:01 ` Thomas Gleixner
  2006-10-01 23:01 ` [patch 15/21] clockevents: drivers for i386 Thomas Gleixner
                   ` (10 subsequent siblings)
  24 siblings, 0 replies; 58+ messages in thread
From: Thomas Gleixner @ 2006-10-01 23:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Ingo Molnar, Jim Gettys, John Stultz, David Woodhouse,
	Arjan van de Ven, Dave Jones

[-- Attachment #1: clockevents-base.patch --]
[-- Type: text/plain, Size: 23175 bytes --]

From: Thomas Gleixner <tglx@linutronix.de>

Add a framework to manage clock event devices.

We have two types of clock event devices:
- global events (one device per system)
- local events (one device per cpu)

We assign the various time(r) related interrupts to those devices:

- global tick (advances jiffies)
- update process times (per cpu)
- profiling (per cpu)
- next timer events (per cpu)

Architectures register their clock event devices, with specific capability
bits set, and the framework code assigns the appropriate event handler
to the event device. The functionality is assigned via an event handler to
avoid runtime evalutation of the assigned function bits.

This allows to control the clock event devices without the architectures
having to worry about the details of function assignment. This is also a 
preliminary for high resolution timers and dynamic ticks to allow the 
core code to control the clock functionality without intrusive changes
to the architecture code.

When high resolution timers and dynamic ticks are disabled, there is no
change in the behaviour of the system.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
--
 include/linux/clockchips.h |  122 +++++++++
 include/linux/hrtimer.h    |    3 
 init/main.c                |    2 
 kernel/hrtimer.c           |    6 
 kernel/time/Makefile       |    2 
 kernel/time/clockevents.c  |  567 +++++++++++++++++++++++++++++++++++++++++++++
 6 files changed, 700 insertions(+), 2 deletions(-)

Index: linux-2.6.18-mm2/include/linux/clockchips.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.18-mm2/include/linux/clockchips.h	2006-10-02 00:55:53.000000000 +0200
@@ -0,0 +1,122 @@
+/*  linux/include/linux/clockchips.h
+ *
+ *  This file contains the structure definitions for clockchips.
+ *
+ *  If you are not a clockchip, or the time of day code, you should
+ *  not be including this file!
+ */
+#ifndef _LINUX_CLOCKCHIPS_H
+#define _LINUX_CLOCKCHIPS_H
+
+#ifdef CONFIG_GENERIC_CLOCKEVENTS
+
+#include <linux/clocksource.h>
+#include <linux/interrupt.h>
+
+struct clock_event_device;
+
+/* Clock event mode commands */
+enum clock_event_mode {
+	CLOCK_EVT_PERIODIC,
+	CLOCK_EVT_ONESHOT,
+	CLOCK_EVT_SHUTDOWN,
+};
+
+/*
+ * Clock event capability flags:
+ *
+ * CAP_TICK:	The event source should be used for the periodic tick
+ * CAP_UPDATE:	The event source handler should call update_process_times()
+ * CAP_PROFILE: The event source handler should call profile_tick()
+ * CAP_NEXTEVT:	The event source can be reprogrammed in oneshot mode and is
+ *		a per cpu event source.
+ *
+ * The capability flags are used to select the appropriate handler for an event
+ * source. On an i386 UP system the PIT can serve all of the functionalities,
+ * while on a SMP system the PIT is solely used for the periodic tick and the
+ * local APIC timers are used for UPDATE / PROFILE / NEXTEVT. To avoid the run
+ * time query of those flags, the clock events layer assigns the appropriate
+ * event handler function, which contains only the selected calls, to the
+ * event.
+ */
+#define CLOCK_CAP_TICK		0x000001
+#define CLOCK_CAP_UPDATE	0x000002
+#define CLOCK_CAP_PROFILE	0x000004
+#ifdef CONFIG_HIGH_RES_TIMERS
+# define CLOCK_CAP_NEXTEVT	0x000008
+#else
+# define CLOCK_CAP_NEXTEVT	0x000000
+#endif
+
+#define CLOCK_BASE_CAPS_MASK	(CLOCK_CAP_TICK | CLOCK_CAP_PROFILE | \
+				 CLOCK_CAP_UPDATE)
+#define CLOCK_CAPS_MASK		(CLOCK_BASE_CAPS_MASK | CLOCK_CAP_NEXTEVT)
+
+/**
+ * struct clock_event_device - clock event descriptor
+ *
+ * @name:		ptr to clock event name
+ * @capabilities:	capabilities of the event chip
+ * @max_delta_ns:	maximum delta value in ns
+ * @min_delta_ns:	minimum delta value in ns
+ * @mult:		nanosecond to cycles multiplier
+ * @shift:		nanoseconds to cycles divisor (power of two)
+ * @set_next_event:	set next event
+ * @set_mode:		set mode function
+ * @suspend:		suspend function (optional)
+ * @resume:		resume function (optional)
+ * @evthandler:		Assigned by the framework to be called by the low
+ *			level handler of the event source
+ */
+struct clock_event_device {
+	const char	*name;
+	unsigned int	capabilities;
+	unsigned long	max_delta_ns;
+	unsigned long	min_delta_ns;
+	unsigned long	mult;
+	int		shift;
+	void		(*set_next_event)(unsigned long evt,
+					  struct clock_event_device *);
+	void		(*set_mode)(enum clock_event_mode mode,
+				    struct clock_event_device *);
+	void		(*event_handler)(struct pt_regs *regs);
+};
+
+/*
+ * Calculate a multiplication factor for scaled math, which is used to convert
+ * nanoseconds based values to clock ticks:
+ *
+ * clock_ticks = (nanoseconds * factor) >> shift.
+ *
+ * div_sc is the rearranged equation to calculate a factor from a given clock
+ * ticks / nanoseconds ratio:
+ *
+ * factor = (clock_ticks << shift) / nanoseconds
+ */
+static inline unsigned long div_sc(unsigned long ticks, unsigned long nsec,
+				   int shift)
+{
+	uint64_t tmp = ((uint64_t)ticks) << shift;
+
+	do_div(tmp, nsec);
+	return (unsigned long) tmp;
+}
+
+/* Clock event layer functions */
+extern int register_local_clockevent(struct clock_event_device *);
+extern int register_global_clockevent(struct clock_event_device *);
+extern unsigned long clockevent_delta2ns(unsigned long latch,
+					 struct clock_event_device *evt);
+extern void clockevents_init(void);
+
+extern int clockevents_init_next_event(void);
+extern int clockevents_set_next_event(ktime_t expires, int force);
+extern int clockevents_next_event_available(void);
+extern void clockevents_resume_events(void);
+
+#else
+# define clockevents_init()		do { } while(0)
+# define clockevents_resume_events()	do { } while(0)
+#endif
+
+#endif
Index: linux-2.6.18-mm2/include/linux/hrtimer.h
===================================================================
--- linux-2.6.18-mm2.orig/include/linux/hrtimer.h	2006-10-02 00:55:53.000000000 +0200
+++ linux-2.6.18-mm2/include/linux/hrtimer.h	2006-10-02 00:55:53.000000000 +0200
@@ -144,6 +144,9 @@ struct hrtimer_cpu_base {
  * is expired in the next softirq when the clock was advanced.
  */
 #define clock_was_set()		do { } while (0)
+#define hrtimer_clock_notify()	do { } while (0)
+extern ktime_t ktime_get(void);
+extern ktime_t ktime_get_real(void);
 
 /* Exported timer functions: */
 
Index: linux-2.6.18-mm2/init/main.c
===================================================================
--- linux-2.6.18-mm2.orig/init/main.c	2006-10-02 00:55:46.000000000 +0200
+++ linux-2.6.18-mm2/init/main.c	2006-10-02 00:55:53.000000000 +0200
@@ -36,6 +36,7 @@
 #include <linux/moduleparam.h>
 #include <linux/kallsyms.h>
 #include <linux/writeback.h>
+#include <linux/clockchips.h>
 #include <linux/cpu.h>
 #include <linux/cpuset.h>
 #include <linux/efi.h>
@@ -529,6 +530,7 @@ asmlinkage void __init start_kernel(void
 	rcu_init();
 	init_IRQ();
 	pidhash_init();
+	clockevents_init();
 	init_timers();
 	hrtimers_init();
 	softirq_init();
Index: linux-2.6.18-mm2/kernel/hrtimer.c
===================================================================
--- linux-2.6.18-mm2.orig/kernel/hrtimer.c	2006-10-02 00:55:53.000000000 +0200
+++ linux-2.6.18-mm2/kernel/hrtimer.c	2006-10-02 00:55:53.000000000 +0200
@@ -31,6 +31,7 @@
  *  For licencing details see kernel-base/COPYING
  */
 
+#include <linux/clockchips.h>
 #include <linux/cpu.h>
 #include <linux/module.h>
 #include <linux/percpu.h>
@@ -46,7 +47,7 @@
  *
  * returns the time in ktime_t format
  */
-static ktime_t ktime_get(void)
+ktime_t ktime_get(void)
 {
 	struct timespec now;
 
@@ -60,7 +61,7 @@ static ktime_t ktime_get(void)
  *
  * returns the time in ktime_t format
  */
-static ktime_t ktime_get_real(void)
+ktime_t ktime_get_real(void)
 {
 	struct timespec now;
 
@@ -293,6 +294,7 @@ static unsigned long ktime_divns(const k
  */
 void hrtimer_notify_resume(void)
 {
+	clockevents_resume_events();
 	clock_was_set();
 }
 
Index: linux-2.6.18-mm2/kernel/time/Makefile
===================================================================
--- linux-2.6.18-mm2.orig/kernel/time/Makefile	2006-10-02 00:55:46.000000000 +0200
+++ linux-2.6.18-mm2/kernel/time/Makefile	2006-10-02 00:55:53.000000000 +0200
@@ -1 +1,3 @@
 obj-y += ntp.o clocksource.o jiffies.o
+
+obj-$(CONFIG_GENERIC_CLOCKEVENTS) += clockevents.o
Index: linux-2.6.18-mm2/kernel/time/clockevents.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.18-mm2/kernel/time/clockevents.c	2006-10-02 00:55:53.000000000 +0200
@@ -0,0 +1,567 @@
+/*
+ * linux/kernel/time/clockevents.c
+ *
+ * This file contains functions which manage clock event drivers.
+ *
+ * Copyright(C) 2005-2006, Thomas Gleixner <tglx@linutronix.de>
+ * Copyright(C) 2005-2006, Red Hat, Inc., Ingo Molnar
+ *
+ * We have two types of clock event devices:
+ * - global events (one device per system)
+ * - local events (one device per cpu)
+ *
+ * We assign the various time(r) related interrupts to those devices
+ *
+ * - global tick
+ * - profiling (per cpu)
+ * - next timer events (per cpu)
+ *
+ * TODO:
+ * - implement variable frequency profiling
+ *
+ * This code is licenced under the GPL version 2. For details see
+ * kernel-base/COPYING.
+ */
+
+#include <linux/clockchips.h>
+#include <linux/cpu.h>
+#include <linux/irq.h>
+#include <linux/init.h>
+#include <linux/notifier.h>
+#include <linux/module.h>
+#include <linux/percpu.h>
+#include <linux/profile.h>
+#include <linux/sysdev.h>
+#include <linux/hrtimer.h>
+
+#define MAX_CLOCK_EVENTS	4
+#define GLOBAL_CLOCK_EVENT	MAX_CLOCK_EVENTS
+
+struct event_descr {
+	struct clock_event_device *event;
+	unsigned int mode;
+	unsigned int real_caps;
+	struct irqaction action;
+};
+
+struct local_events {
+	int installed;
+	struct event_descr events[MAX_CLOCK_EVENTS];
+	struct clock_event_device *nextevt;
+};
+
+/* Variables related to the global event device */
+static __read_mostly struct event_descr global_eventdevice;
+
+/*
+ * Lock to protect the above.
+ *
+ * Only the public management functions have to take this lock. The fast path
+ * of the framework, e.g. reprogramming the next event device is lockless as
+ * it is per cpu.
+ */
+static DEFINE_SPINLOCK(events_lock);
+
+/* Variables related to the per cpu local event devices */
+static DEFINE_PER_CPU(struct local_events, local_eventdevices);
+
+/*
+ * Math helper. Convert a latch value (device ticks) to nanoseconds
+ */
+unsigned long clockevent_delta2ns(unsigned long latch,
+				  struct clock_event_device *evt)
+{
+	u64 clc = ((u64) latch << evt->shift);
+
+	do_div(clc, evt->mult);
+	if (clc < KTIME_MONOTONIC_RES.tv64)
+		clc = KTIME_MONOTONIC_RES.tv64;
+	if (clc > LONG_MAX)
+		clc = LONG_MAX;
+
+	return (unsigned long) clc;
+}
+
+/*
+ * Bootup and lowres handler: ticks only
+ */
+static void handle_tick(struct pt_regs *regs)
+{
+	write_seqlock(&xtime_lock);
+	do_timer(1);
+	write_sequnlock(&xtime_lock);
+}
+
+/*
+ * Bootup and lowres handler: ticks and update_process_times
+ */
+static void handle_tick_update(struct pt_regs *regs)
+{
+	write_seqlock(&xtime_lock);
+	do_timer(1);
+	write_sequnlock(&xtime_lock);
+
+	update_process_times(user_mode(regs));
+}
+
+/*
+ * Bootup and lowres handler: ticks and profileing
+ */
+static void handle_tick_profile(struct pt_regs *regs)
+{
+	write_seqlock(&xtime_lock);
+	do_timer(1);
+	write_sequnlock(&xtime_lock);
+
+	profile_tick(CPU_PROFILING, regs);
+}
+
+/*
+ * Bootup and lowres handler: ticks, update_process_times and profiling
+ */
+static void handle_tick_update_profile(struct pt_regs *regs)
+{
+	write_seqlock(&xtime_lock);
+	do_timer(1);
+	write_sequnlock(&xtime_lock);
+
+	update_process_times(user_mode(regs));
+	profile_tick(CPU_PROFILING, regs);
+}
+
+/*
+ * Bootup and lowres handler: update_process_times
+ */
+static void handle_update(struct pt_regs *regs)
+{
+	update_process_times(user_mode(regs));
+}
+
+/*
+ * Bootup and lowres handler: update_process_times and profiling
+ */
+static void handle_update_profile(struct pt_regs *regs)
+{
+	update_process_times(user_mode(regs));
+	profile_tick(CPU_PROFILING, regs);
+}
+
+/*
+ * Bootup and lowres handler: profiling
+ */
+static void handle_profile(struct pt_regs *regs)
+{
+	profile_tick(CPU_PROFILING, regs);
+}
+
+/*
+ * Noop handler when we shut down an event device
+ */
+static void handle_noop(struct pt_regs *regs)
+{
+}
+
+/*
+ * Lookup table for bootup and lowres event assignment
+ *
+ * The event handler is choosen by the capability flags of the clock event
+ * device.
+ */
+static void __read_mostly *event_handlers[] = {
+	handle_noop,			/* 0: No capability selected */
+	handle_tick,			/* 1: Tick only	*/
+	handle_update,			/* 2: Update process times */
+	handle_tick_update,		/* 3: Tick + update process times */
+	handle_profile,			/* 4: Profiling int */
+	handle_tick_profile,		/* 5: Tick + Profiling int */
+	handle_update_profile,		/* 6: Update process times +
+					      profiling */
+	handle_tick_update_profile,	/* 7: Tick + update process times +
+					      profiling */
+#ifdef CONFIG_HIGH_RES_TIMERS
+	hrtimer_interrupt,		/* 8: Reprogrammable event device */
+#endif
+};
+
+/*
+ * Start up an event device
+ */
+static void startup_event(struct clock_event_device *evt, unsigned int caps)
+{
+	int mode;
+
+	if (caps == CLOCK_CAP_NEXTEVT)
+		mode = CLOCK_EVT_ONESHOT;
+	else
+		mode = CLOCK_EVT_PERIODIC;
+
+	evt->set_mode(mode, evt);
+}
+
+/*
+ * Setup an event device. Assign an handler and start it up
+ */
+static void setup_event(struct event_descr *descr,
+			struct clock_event_device *evt, unsigned int caps)
+{
+	void *handler = event_handlers[caps];
+
+	/* Set the event handler */
+	evt->event_handler = handler;
+
+	/* Store all relevant information */
+	descr->real_caps = caps;
+
+	startup_event(evt, caps);
+
+	printk(KERN_INFO "Clock event device %s configured with caps set: "
+	       "%02x\n", evt->name, descr->real_caps);
+}
+
+/**
+ * register_global_clockevent - register the device which generates
+ *			     global clock events
+ * @evt:	The device which generates global clock events (ticks)
+ *
+ * This can be a device which is only necessary for bootup. On UP systems this
+ * might be the only event device which is used for everything including
+ * high resolution events.
+ *
+ * When a cpu local event device is installed the global event device is
+ * switched off in the high resolution timer / tickless mode.
+ */
+int __init register_global_clockevent(struct clock_event_device *evt)
+{
+	/* Already installed? */
+	if (global_eventdevice.event) {
+		printk(KERN_ERR "Global clock event device already installed: "
+		       "%s. Ignoring new global eventsoruce %s\n",
+		       global_eventdevice.event->name,
+		       evt->name);
+		return -EBUSY;
+	}
+
+	/* Preset the handler in any case */
+	evt->event_handler = handle_noop;
+
+	/*
+	 * Check, whether it is a valid global event device
+	 */
+	if (!(evt->capabilities & CLOCK_BASE_CAPS_MASK)) {
+		printk(KERN_ERR "Unsupported clock event device %s\n",
+		       evt->name);
+		return -EINVAL;
+	}
+
+#ifdef CONFIG_SMP
+	/*
+	 * On UP systems the global clock event device can be used as the next
+	 * event device. On SMP this is disabled because the next event device
+	 * must be per CPU.
+	 */
+	evt->capabilities &= ~CLOCK_CAP_NEXTEVT;
+#endif
+
+	/* Mask out high resolution capabilities for now */
+	global_eventdevice.event = evt;
+	setup_event(&global_eventdevice, evt,
+		    evt->capabilities & CLOCK_BASE_CAPS_MASK);
+	return 0;
+}
+
+/*
+ * Mask out the functionality which is covered by the new event device
+ * and assign a new event handler.
+ */
+static void recalc_active_event(struct event_descr *descr,
+				unsigned int newcaps)
+{
+	unsigned int caps;
+
+	if (!descr->real_caps)
+		return;
+
+	/* Mask the overlapping bits */
+	caps = descr->real_caps & ~newcaps;
+
+	/* Assign the new event handler */
+	if (caps) {
+		descr->event->event_handler = event_handlers[caps];
+		printk(KERN_INFO "Clock event device %s new caps set: %02x\n" ,
+		       descr->event->name, caps);
+	} else {
+		descr->event->event_handler = handle_noop;
+
+		if (descr->event->set_mode)
+			descr->event->set_mode(CLOCK_EVT_SHUTDOWN,
+					       descr->event);
+
+		printk(KERN_INFO "Clock event device %s disabled\n" ,
+		       descr->event->name);
+	}
+	descr->real_caps = caps;
+}
+
+/*
+ * Recalc the events and reassign the handlers if necessary
+ *
+ * Called with event_lock held to protect the global event device.
+ */
+static int recalc_events(struct local_events *devices,
+			 struct clock_event_device *evt, unsigned int caps,
+			 int new)
+{
+	int i;
+
+	if (new && devices->installed == MAX_CLOCK_EVENTS)
+		return -ENOSPC;
+
+	/*
+	 * If there is no handler and this is not a next-event capable
+	 * event device, refuse to handle it
+	 */
+	if ((!evt->capabilities & CLOCK_CAP_NEXTEVT) && !event_handlers[caps]) {
+		printk(KERN_ERR "Unsupported clock event device %s\n",
+		       evt->name);
+		return -EINVAL;
+	}
+
+	if (caps && global_eventdevice.event && global_eventdevice.event != evt)
+		recalc_active_event(&global_eventdevice, caps);
+
+	for (i = 0; i < devices->installed; i++) {
+		if (devices->events[i].event != evt)
+			recalc_active_event(&devices->events[i], caps);
+	}
+
+	if (new)
+		devices->events[devices->installed++].event = evt;
+
+	if (caps) {
+		/* Is next_event event device going to be installed? */
+		if (caps & CLOCK_CAP_NEXTEVT)
+			caps = CLOCK_CAP_NEXTEVT;
+
+		setup_event(&devices->events[devices->installed],
+			    evt, caps);
+	} else
+		printk(KERN_INFO "Inactive clock event device %s registered\n",
+		       evt->name);
+
+	return 0;
+}
+
+/**
+ * register_local_clockevent - Set up a cpu local clock event device
+ * @evt:	event device to be registered
+ */
+int register_local_clockevent(struct clock_event_device *evt)
+{
+	struct local_events *devices = &__get_cpu_var(local_eventdevices);
+	unsigned long flags;
+	int ret;
+
+	spin_lock_irqsave(&events_lock, flags);
+
+	/* Preset the handler in any case */
+	evt->event_handler = handle_noop;
+
+	/* Recalc event devices and maybe reassign handlers */
+	ret = recalc_events(devices, evt,
+			    evt->capabilities & CLOCK_BASE_CAPS_MASK, 1);
+
+	spin_unlock_irqrestore(&events_lock, flags);
+
+	/*
+	 * Trigger hrtimers, when the event device is next-event
+	 * capable
+	 */
+	if (!ret && (evt->capabilities & CLOCK_CAP_NEXTEVT))
+		hrtimer_clock_notify();
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(register_local_clockevent);
+
+/*
+ * Find a next-event capable event device
+ *
+ * Called with event_lock held to protect the global event device.
+ */
+static int get_next_event_device(void)
+{
+	struct local_events *devices = &__get_cpu_var(local_eventdevices);
+	int i;
+
+	for (i = 0; i < devices->installed; i++) {
+		struct clock_event_device *evt;
+
+		evt = devices->events[i].event;
+		if (evt->capabilities & CLOCK_CAP_NEXTEVT)
+			return i;
+	}
+
+	if (global_eventdevice.event->capabilities & CLOCK_CAP_NEXTEVT)
+		return GLOBAL_CLOCK_EVENT;
+
+	return -ENODEV;
+}
+
+/**
+ * clockevents_next_event_available - Check for a installed next-event device
+ *
+ * Returns 1, when such a device exists, otherwise 0
+ */
+int clockevents_next_event_available(void)
+{
+	unsigned long flags;
+	int idx;
+
+	spin_lock_irqsave(&events_lock, flags);
+	idx = get_next_event_device();
+	spin_unlock_irqrestore(&events_lock, flags);
+
+	return IS_ERR_VALUE(idx) ? 0 : 1;
+}
+
+/**
+ * clockevents_init_next_event - switch to next event (oneshot) mode
+ *
+ * Switch to one shot mode. On SMP systems the global event (tick) device is
+ * switched off. It is replaced by a hrtimer. On UP systems the global event
+ * device might be the only one and can be used as the next event device too.
+ *
+ * Returns 0 on success, otherwise an error code.
+ */
+int clockevents_init_next_event(void)
+{
+	struct local_events *devices = &__get_cpu_var(local_eventdevices);
+	struct clock_event_device *nextevt;
+	unsigned long flags;
+	int idx, ret = -ENODEV;
+
+	if (devices->nextevt)
+		return -EBUSY;
+
+	spin_lock_irqsave(&events_lock, flags);
+
+	idx = get_next_event_device();
+	if (idx < 0)
+		goto out_unlock;
+
+	if (idx == GLOBAL_CLOCK_EVENT)
+		nextevt = global_eventdevice.event;
+	else
+		nextevt = devices->events[idx].event;
+
+	ret = recalc_events(devices, nextevt, CLOCK_CAPS_MASK, 0);
+	if (!ret)
+		devices->nextevt = nextevt;
+ out_unlock:
+	spin_unlock_irqrestore(&events_lock, flags);
+
+	return ret;
+}
+
+/**
+ * clockevents_set_next_event - Reprogram the clock event device.
+ * @expires:	absolute expiry time (monotonic clock)
+ * @force:	when set, enforce reprogramming, even if the event is in the
+ *		past
+ *
+ * Returns 0 on success, -ETIME when the event is in the past and force is not
+ * set.
+ */
+int clockevents_set_next_event(ktime_t expires, int force)
+{
+	struct local_events *devices = &__get_cpu_var(local_eventdevices);
+	int64_t delta = ktime_to_ns(ktime_sub(expires, ktime_get()));
+	struct clock_event_device *nextevt = devices->nextevt;
+	unsigned long long clc;
+
+	if (delta <= 0 && !force)
+		return -ETIME;
+
+	if (delta > nextevt->max_delta_ns)
+		delta = nextevt->max_delta_ns;
+	if (delta < nextevt->min_delta_ns)
+		delta = nextevt->min_delta_ns;
+
+	clc = delta * nextevt->mult;
+	clc >>= nextevt->shift;
+	nextevt->set_next_event((unsigned long)clc, devices->nextevt);
+
+	return 0;
+}
+
+/*
+ * Resume the cpu local clock events
+ */
+static void clockevents_resume_local_events(void *arg)
+{
+	struct local_events *devices = &__get_cpu_var(local_eventdevices);
+	int i;
+
+	for (i = 0; i < devices->installed; i++) {
+		if (devices->events[i].real_caps)
+			startup_event(devices->events[i].event,
+				      devices->events[i].real_caps);
+	}
+	touch_softlockup_watchdog();
+}
+
+/**
+ * clockevents_resume_events - resume the active clock devices
+ *
+ * Called after timekeeping is functional again
+ */
+void clockevents_resume_events(void)
+{
+	unsigned long flags;
+
+	local_irq_save(flags);
+
+	/* Resume global event device */
+	if (global_eventdevice.real_caps)
+		startup_event(global_eventdevice.event,
+			      global_eventdevice.real_caps);
+
+	local_irq_restore(flags);
+
+	/* Restart the CPU local events everywhere */
+	on_each_cpu(clockevents_resume_local_events, NULL, 0, 1);
+}
+
+/*
+ * Functions related to initialization and hotplug
+ */
+static int clockevents_cpu_notify(struct notifier_block *self,
+				  unsigned long action, void *hcpu)
+{
+	switch(action) {
+	case CPU_UP_PREPARE:
+		break;
+#ifdef CONFIG_HOTPLUG_CPU
+	case CPU_DEAD:
+		/*
+		 * Do something sensible here !
+		 * Disable the cpu local clock event devices ???
+		 */
+		break;
+#endif
+	default:
+		break;
+	}
+	return NOTIFY_OK;
+}
+
+static struct notifier_block __devinitdata clockevents_nb = {
+	.notifier_call	= clockevents_cpu_notify,
+};
+
+void __init clockevents_init(void)
+{
+	clockevents_cpu_notify(&clockevents_nb, (unsigned long)CPU_UP_PREPARE,
+				(void *)(long)smp_processor_id());
+	register_cpu_notifier(&clockevents_nb);
+}

--


^ permalink raw reply	[flat|nested] 58+ messages in thread

* [patch 15/21] clockevents: drivers for i386
  2006-10-01 22:59 [patch 00/21] high resolution timers / dynamic ticks - V2 Thomas Gleixner
                   ` (13 preceding siblings ...)
  2006-10-01 23:01 ` [patch 14/21] clockevents: core Thomas Gleixner
@ 2006-10-01 23:01 ` Thomas Gleixner
  2006-10-01 23:01 ` [patch 16/21] high-res timers: core Thomas Gleixner
                   ` (9 subsequent siblings)
  24 siblings, 0 replies; 58+ messages in thread
From: Thomas Gleixner @ 2006-10-01 23:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Ingo Molnar, Jim Gettys, John Stultz, David Woodhouse,
	Arjan van de Ven, Dave Jones

[-- Attachment #1: clockevents-i386.patch --]
[-- Type: text/plain, Size: 18380 bytes --]

From: Thomas Gleixner <tglx@linutronix.de>

Add clockevent drivers for i386: lapic (local) and PIT (global).
Update the timer IRQ to call into the PIT driver's event handler
and the lapic-timer IRQ to call into the lapic clockevent driver.
The assignement of timer functionality is delegated to the core
framework code and replaces the compile and runtime evalution in
do_timer_interrupt_hook()

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
--
 arch/i386/Kconfig                        |    4 +
 arch/i386/kernel/apic.c                  |  117 ++++++++++++++++++++++++++-----
 arch/i386/kernel/i8253.c                 |   94 ++++++++++++++++++++++--
 arch/i386/kernel/time.c                  |   45 -----------
 include/asm-i386/i8253.h                 |   18 ++++
 include/asm-i386/mach-default/do_timer.h |   27 +------
 include/asm-i386/mach-visws/do_timer.h   |    2 
 include/asm-i386/mach-voyager/do_timer.h |   15 ++-
 8 files changed, 225 insertions(+), 97 deletions(-)

Index: linux-2.6.18-mm2/arch/i386/Kconfig
===================================================================
--- linux-2.6.18-mm2.orig/arch/i386/Kconfig	2006-10-02 00:55:46.000000000 +0200
+++ linux-2.6.18-mm2/arch/i386/Kconfig	2006-10-02 00:55:53.000000000 +0200
@@ -18,6 +18,10 @@ config GENERIC_TIME
 	bool
 	default y
 
+config GENERIC_CLOCKEVENTS
+	bool
+	default y
+
 config LOCKDEP_SUPPORT
 	bool
 	default y
Index: linux-2.6.18-mm2/arch/i386/kernel/apic.c
===================================================================
--- linux-2.6.18-mm2.orig/arch/i386/kernel/apic.c	2006-10-02 00:55:46.000000000 +0200
+++ linux-2.6.18-mm2/arch/i386/kernel/apic.c	2006-10-02 00:55:53.000000000 +0200
@@ -25,6 +25,7 @@
 #include <linux/kernel_stat.h>
 #include <linux/sysdev.h>
 #include <linux/cpu.h>
+#include <linux/clockchips.h>
 #include <linux/module.h>
 
 #include <asm/atomic.h>
@@ -70,6 +71,25 @@ static inline void lapic_enable(void)
  */
 int apic_verbosity;
 
+static unsigned int calibration_result;
+
+static void lapic_next_event(unsigned long delta,
+			     struct clock_event_device *evt);
+static void lapic_timer_setup(enum clock_event_mode mode,
+			      struct clock_event_device *evt);
+
+/*
+ * The local apic timer can be used for any function which is CPU local.
+ */
+static struct clock_event_device lapic_clockevent = {
+	.name = "lapic",
+	.capabilities = CLOCK_CAP_NEXTEVT | CLOCK_CAP_PROFILE
+			| CLOCK_CAP_UPDATE,
+	.shift = 32,
+	.set_mode = lapic_timer_setup,
+	.set_next_event = lapic_next_event,
+};
+static DEFINE_PER_CPU(struct clock_event_device, lapic_events);
 
 static void apic_pm_activate(void);
 
@@ -919,6 +939,11 @@ fake_ioapic_page:
  */
 
 /*
+ * FIXME: Move this to i8253.h. There is no need to keep the access to
+ * the PIT scattered all around the place -tglx
+ */
+
+/*
  * The timer chip is already set up at HZ interrupts per second here,
  * but we do not accept timer interrupts yet. We only allow the BP
  * to calibrate.
@@ -976,13 +1001,15 @@ void (*wait_timer_tick)(void) __devinitd
 
 #define APIC_DIVISOR 16
 
-static void __setup_APIC_LVTT(unsigned int clocks)
+static void __setup_APIC_LVTT(unsigned int clocks, int oneshot)
 {
 	unsigned int lvtt_value, tmp_value, ver;
 	int cpu = smp_processor_id();
 
 	ver = GET_APIC_VERSION(apic_read(APIC_LVR));
-	lvtt_value = APIC_LVT_TIMER_PERIODIC | LOCAL_TIMER_VECTOR;
+	lvtt_value = LOCAL_TIMER_VECTOR;
+	if (!oneshot)
+		lvtt_value |= APIC_LVT_TIMER_PERIODIC;
 	if (!APIC_INTEGRATED(ver))
 		lvtt_value |= SET_APIC_TIMER_BASE(APIC_TIMER_BASE_DIV);
 
@@ -999,23 +1026,43 @@ static void __setup_APIC_LVTT(unsigned i
 				& ~(APIC_TDR_DIV_1 | APIC_TDR_DIV_TMBASE))
 				| APIC_TDR_DIV_16);
 
-	apic_write_around(APIC_TMICT, clocks/APIC_DIVISOR);
+	if (!oneshot)
+		apic_write_around(APIC_TMICT, clocks/APIC_DIVISOR);
 }
 
-static void __devinit setup_APIC_timer(unsigned int clocks)
+static void lapic_next_event(unsigned long delta,
+			     struct clock_event_device *evt)
+{
+	apic_write_around(APIC_TMICT, delta);
+}
+
+static void lapic_timer_setup(enum clock_event_mode mode,
+			      struct clock_event_device *evt)
 {
 	unsigned long flags;
 
 	local_irq_save(flags);
+	if (CLOCK_EVT_PERIODIC) {
+		/*
+		 * Wait for IRQ0's slice:
+		 */
+		wait_timer_tick();
+	}
+	__setup_APIC_LVTT(calibration_result, mode != CLOCK_EVT_PERIODIC);
+	local_irq_restore(flags);
+}
 
-	/*
-	 * Wait for IRQ0's slice:
-	 */
-	wait_timer_tick();
+/*
+ * Setup the local APIC timer for this CPU. Copy the initilized values
+ * of the boot CPU and register the clock event in the framework.
+ */
+static void __devinit setup_APIC_timer(void)
+{
+	struct clock_event_device *levt = &__get_cpu_var(lapic_events);
 
-	__setup_APIC_LVTT(clocks);
+	memcpy(levt, &lapic_clockevent, sizeof(*levt));
 
-	local_irq_restore(flags);
+	register_local_clockevent(levt);
 }
 
 /*
@@ -1024,6 +1071,8 @@ static void __devinit setup_APIC_timer(u
  * to calibrate, since some later bootup code depends on getting
  * the first irq? Ugh.
  *
+ * TODO: Fix this rather than saying "Ugh" -tglx
+ *
  * We want to do the calibration only once since we
  * want to have local timer irqs syncron. CPUs connected
  * by the same APIC bus have the very same bus frequency.
@@ -1046,7 +1095,7 @@ static int __init calibrate_APIC_clock(v
 	 * value into the APIC clock, we just want to get the
 	 * counter running for calibration.
 	 */
-	__setup_APIC_LVTT(1000000000);
+	__setup_APIC_LVTT(1000000000, 0);
 
 	/*
 	 * The timer chip counts down to zero. Let's wait
@@ -1083,6 +1132,14 @@ static int __init calibrate_APIC_clock(v
 
 	result = (tt1-tt2)*APIC_DIVISOR/LOOPS;
 
+	/* Calculate the scaled math multiplication factor */
+	lapic_clockevent.mult = div_sc(tt1-tt2, TICK_NSEC * LOOPS, 32);
+	lapic_clockevent.max_delta_ns =
+		clockevent_delta2ns(0x7FFFFF, &lapic_clockevent);
+	printk("lapic max_delta_ns: %ld\n", lapic_clockevent.max_delta_ns);
+	lapic_clockevent.min_delta_ns =
+		clockevent_delta2ns(0xF, &lapic_clockevent);
+
 	if (cpu_has_tsc)
 		apic_printk(APIC_VERBOSE, "..... CPU clock speed is "
 			"%ld.%04ld MHz.\n",
@@ -1097,8 +1154,6 @@ static int __init calibrate_APIC_clock(v
 	return result;
 }
 
-static unsigned int calibration_result;
-
 void __init setup_boot_APIC_clock(void)
 {
 	unsigned long flags;
@@ -1111,14 +1166,14 @@ void __init setup_boot_APIC_clock(void)
 	/*
 	 * Now set up the timer for real.
 	 */
-	setup_APIC_timer(calibration_result);
+	setup_APIC_timer();
 
 	local_irq_restore(flags);
 }
 
 void __devinit setup_secondary_APIC_clock(void)
 {
-	setup_APIC_timer(calibration_result);
+	setup_APIC_timer();
 }
 
 void disable_APIC_timer(void)
@@ -1164,6 +1219,35 @@ void switch_APIC_timer_to_ipi(void *cpum
 	    !cpu_isset(cpu, timer_bcast_ipi)) {
 		disable_APIC_timer();
 		cpu_set(cpu, timer_bcast_ipi);
+#ifdef CONFIG_HIGH_RES_TIMERS
+		/*
+		 * C3 stops the local apic timer. We can not make high
+		 * resolution timers and dynamic ticks work with one global
+		 * timer. Disable the NEXTEVT capability, so high resolution /
+		 * dyntick mode gets disabled too.
+		 *
+		 * There is a solution for this problem, but this is beyond the
+		 * scope of this initial patchset:
+		 *
+		 * When the local apic timer is unusable in C3, then we can
+		 * utilize the PIT to provide a global wakeup, which can be
+		 * directed to the CPU which has the earliest wakeup
+		 * point. Once the CPU is up again, the local apic is resumed
+		 * and can be used for the per cpu clock events again. It's not
+		 * hard to provide the infrastructure, but I need more insight
+		 * into the ACPI code to get it right.
+		 *
+		 * Disable the highres/dyntick feature in this case for now,
+		 * until somebody beats the ACPI clue into me. :)
+		 *
+		 *	tglx
+		 */
+		printk("Disabling NO_HZ and high resolution timers "
+		       "due to timer broadcasting (C3 stops local apic)\n");
+		for_each_possible_cpu(cpu)
+			per_cpu(lapic_events, cpu).capabilities &=
+				~CLOCK_CAP_NEXTEVT;
+#endif
 	}
 }
 EXPORT_SYMBOL(switch_APIC_timer_to_ipi);
@@ -1224,6 +1308,7 @@ inline void smp_local_timer_interrupt(st
 fastcall void smp_apic_timer_interrupt(struct pt_regs *regs)
 {
 	int cpu = smp_processor_id();
+	struct clock_event_device *evt = &per_cpu(lapic_events, cpu);
 
 	/*
 	 * the NMI deadlock-detector uses this.
@@ -1241,7 +1326,7 @@ fastcall void smp_apic_timer_interrupt(s
 	 * interrupt lock, which is the WrongThing (tm) to do.
 	 */
 	irq_enter();
-	smp_local_timer_interrupt(regs);
+	evt->event_handler(regs);
 	irq_exit();
 }
 
Index: linux-2.6.18-mm2/arch/i386/kernel/i8253.c
===================================================================
--- linux-2.6.18-mm2.orig/arch/i386/kernel/i8253.c	2006-10-02 00:55:46.000000000 +0200
+++ linux-2.6.18-mm2/arch/i386/kernel/i8253.c	2006-10-02 00:55:53.000000000 +0200
@@ -2,7 +2,7 @@
  * i8253.c  8253/PIT functions
  *
  */
-#include <linux/clocksource.h>
+#include <linux/clockchips.h>
 #include <linux/spinlock.h>
 #include <linux/jiffies.h>
 #include <linux/sysdev.h>
@@ -19,20 +19,98 @@
 DEFINE_SPINLOCK(i8253_lock);
 EXPORT_SYMBOL(i8253_lock);
 
-void setup_pit_timer(void)
+#ifdef CONFIG_HPET_TIMER
+/*
+ * HPET replaces the PIT, when enabled. So we need to know, which of
+ * the two timers is used
+ */
+struct clock_event_device *global_clock_event;
+#endif
+
+/*
+ * Initialize the PIT timer.
+ *
+ * This is also called after resume to bring the PIT into operation again.
+ */
+static void init_pit_timer(enum clock_event_mode mode,
+			   struct clock_event_device *evt)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&i8253_lock, flags);
+
+	switch(mode) {
+	case CLOCK_EVT_PERIODIC:
+		/* binary, mode 2, LSB/MSB, ch 0 */
+		outb_p(0x34, PIT_MODE);
+		udelay(10);
+		outb_p(LATCH & 0xff , PIT_CH0);	/* LSB */
+		outb(LATCH >> 8 , PIT_CH0);	/* MSB */
+		break;
+
+	case CLOCK_EVT_ONESHOT:
+	case CLOCK_EVT_SHUTDOWN:
+		/* One shot setup */
+		outb_p(0x38, PIT_MODE);
+		udelay(10);
+		break;
+	}
+	spin_unlock_irqrestore(&i8253_lock, flags);
+}
+
+/*
+ * Program the next event in oneshot mode
+ *
+ * Delta is given in PIT ticks
+ */
+static void pit_next_event(unsigned long delta, struct clock_event_device *evt)
 {
 	unsigned long flags;
 
 	spin_lock_irqsave(&i8253_lock, flags);
-	outb_p(0x34,PIT_MODE);		/* binary, mode 2, LSB/MSB, ch 0 */
-	udelay(10);
-	outb_p(LATCH & 0xff , PIT_CH0);	/* LSB */
-	udelay(10);
-	outb(LATCH >> 8 , PIT_CH0);	/* MSB */
+	outb_p(delta & 0xff , PIT_CH0);	/* LSB */
+	outb(delta >> 8 , PIT_CH0);	/* MSB */
 	spin_unlock_irqrestore(&i8253_lock, flags);
 }
 
 /*
+ * On UP the PIT can serve all of the possible timer functions. On SMP systems
+ * it can be solely used for the global tick.
+ *
+ * The profiling and update capabilites are switched off once the local apic is
+ * registered. This mechanism replaces the previous #ifdef LOCAL_APIC -
+ * !using_apic_timer decisions in do_timer_interrupt_hook()
+ */
+struct clock_event_device pit_clockevent = {
+	.name		= "pit",
+	.capabilities	= CLOCK_CAP_TICK | CLOCK_CAP_PROFILE | CLOCK_CAP_UPDATE
+#ifndef CONFIG_SMP
+			| CLOCK_CAP_NEXTEVT
+#endif
+	,
+	.set_mode	= init_pit_timer,
+	.set_next_event = pit_next_event,
+	.shift		= 32,
+};
+
+/*
+ * Initialize the conversion factor and the min/max deltas of the clock event
+ * structure and register the clock event source with the framework.
+ */
+void __init setup_pit_timer(void)
+{
+	pit_clockevent.mult = div_sc(CLOCK_TICK_RATE, NSEC_PER_SEC, 32);
+	pit_clockevent.max_delta_ns =
+		clockevent_delta2ns(0x7FFF, &pit_clockevent);
+	pit_clockevent.min_delta_ns =
+		clockevent_delta2ns(0xF, &pit_clockevent);
+	register_global_clockevent(&pit_clockevent);
+#ifdef CONFIG_HPET_TIMER
+	global_clock_event = &pit_clockevent;
+#endif
+}
+
+/*
  * Since the PIT overflows every tick, its not very useful
  * to just read by itself. So use jiffies to emulate a free
  * running counter:
@@ -46,7 +124,7 @@ static cycle_t pit_read(void)
 	static u32 old_jifs;
 
 	spin_lock_irqsave(&i8253_lock, flags);
-        /*
+	/*
 	 * Although our caller may have the read side of xtime_lock,
 	 * this is now a seqlock, and we are cheating in this routine
 	 * by having side effects on state that we cannot undo if
Index: linux-2.6.18-mm2/arch/i386/kernel/time.c
===================================================================
--- linux-2.6.18-mm2.orig/arch/i386/kernel/time.c	2006-10-02 00:55:50.000000000 +0200
+++ linux-2.6.18-mm2/arch/i386/kernel/time.c	2006-10-02 00:55:53.000000000 +0200
@@ -163,15 +163,6 @@ EXPORT_SYMBOL(profile_pc);
  */
 irqreturn_t timer_interrupt(int irq, void *dev_id, struct pt_regs *regs)
 {
-	/*
-	 * Here we are in the timer irq handler. We just have irqs locally
-	 * disabled but we don't know if the timer_bh is running on the other
-	 * CPU. We need to avoid to SMP race with it. NOTE: we don' t need
-	 * the irq version of write_lock because as just said we have irq
-	 * locally disabled. -arca
-	 */
-	write_seqlock(&xtime_lock);
-
 #ifdef CONFIG_X86_IO_APIC
 	if (timer_ack) {
 		/*
@@ -190,7 +181,6 @@ irqreturn_t timer_interrupt(int irq, voi
 
 	do_timer_interrupt_hook(regs);
 
-
 	if (MCA_bus) {
 		/* The PS/2 uses level-triggered interrupts.  You can't
 		turn them off, nor would you want to (any attempt to
@@ -205,8 +195,6 @@ irqreturn_t timer_interrupt(int irq, voi
 		outb_p( irq|0x80, 0x61 );	/* reset the IRQ */
 	}
 
-	write_sequnlock(&xtime_lock);
-
 #ifdef CONFIG_X86_LOCAL_APIC
 	if (using_apic_timer)
 		smp_send_timer_broadcast_ipi(regs);
@@ -283,39 +271,6 @@ void notify_arch_cmos_timer(void)
 	mod_timer(&sync_cmos_timer, jiffies + 1);
 }
 
-static int timer_resume(struct sys_device *dev)
-{
-#ifdef CONFIG_HPET_TIMER
-	if (is_hpet_enabled())
-		hpet_reenable();
-#endif
-	setup_pit_timer();
-
-	return 0;
-}
-
-static struct sysdev_class timer_sysclass = {
-	.resume = timer_resume,
-	set_kset_name("timer"),
-};
-
-
-/* XXX this driverfs stuff should probably go elsewhere later -john */
-static struct sys_device device_timer = {
-	.id	= 0,
-	.cls	= &timer_sysclass,
-};
-
-static int time_init_device(void)
-{
-	int error = sysdev_class_register(&timer_sysclass);
-	if (!error)
-		error = sysdev_register(&device_timer);
-	return error;
-}
-
-device_initcall(time_init_device);
-
 #ifdef CONFIG_HPET_TIMER
 extern void (*late_time_init)(void);
 /* Duplicate of time_init() below, with hpet_enable part added */
Index: linux-2.6.18-mm2/include/asm-i386/i8253.h
===================================================================
--- linux-2.6.18-mm2.orig/include/asm-i386/i8253.h	2006-10-02 00:55:46.000000000 +0200
+++ linux-2.6.18-mm2/include/asm-i386/i8253.h	2006-10-02 00:55:53.000000000 +0200
@@ -3,4 +3,22 @@
 
 extern spinlock_t i8253_lock;
 
+#ifdef CONFIG_HPET_TIMER
+extern struct clock_event_device *global_clock_event;
+#else
+extern struct clock_event_device pit_clockevent;
+# define global_clock_event (&pit_clockevent)
+#endif
+
+/**
+ * pit_interrupt_hook - hook into timer tick
+ * @regs:	standard registers from interrupt
+ *
+ * Call the global clock event handler.
+ **/
+static inline void pit_interrupt_hook(struct pt_regs *regs)
+{
+	global_clock_event->event_handler(regs);
+}
+
 #endif	/* __ASM_I8253_H__ */
Index: linux-2.6.18-mm2/include/asm-i386/mach-default/do_timer.h
===================================================================
--- linux-2.6.18-mm2.orig/include/asm-i386/mach-default/do_timer.h	2006-10-02 00:55:46.000000000 +0200
+++ linux-2.6.18-mm2/include/asm-i386/mach-default/do_timer.h	2006-10-02 00:55:53.000000000 +0200
@@ -1,39 +1,20 @@
 /* defines for inline arch setup functions */
+#include <linux/clockchips.h>
 
-#include <asm/apic.h>
 #include <asm/i8259.h>
+#include <asm/i8253.h>
 
 /**
  * do_timer_interrupt_hook - hook into timer tick
  * @regs:	standard registers from interrupt
  *
- * Description:
- *	This hook is called immediately after the timer interrupt is ack'd.
- *	It's primary purpose is to allow architectures that don't possess
- *	individual per CPU clocks (like the CPU APICs supply) to broadcast the
- *	timer interrupt as a means of triggering reschedules etc.
+ * Call the pit clock event handler. see asm/i8253.h
  **/
-
 static inline void do_timer_interrupt_hook(struct pt_regs *regs)
 {
-	do_timer(1);
-#ifndef CONFIG_SMP
-	update_process_times(user_mode_vm(regs));
-#endif
-/*
- * In the SMP case we use the local APIC timer interrupt to do the
- * profiling, except when we simulate SMP mode on a uniprocessor
- * system, in that case we have to call the local interrupt handler.
- */
-#ifndef CONFIG_X86_LOCAL_APIC
-	profile_tick(CPU_PROFILING, regs);
-#else
-	if (!using_apic_timer)
-		smp_local_timer_interrupt(regs);
-#endif
+	pit_interrupt_hook(regs);
 }
 
-
 /* you can safely undefine this if you don't have the Neptune chipset */
 
 #define BUGGY_NEPTUN_TIMER
Index: linux-2.6.18-mm2/include/asm-i386/mach-visws/do_timer.h
===================================================================
--- linux-2.6.18-mm2.orig/include/asm-i386/mach-visws/do_timer.h	2006-10-02 00:55:46.000000000 +0200
+++ linux-2.6.18-mm2/include/asm-i386/mach-visws/do_timer.h	2006-10-02 00:55:53.000000000 +0200
@@ -9,7 +9,9 @@ static inline void do_timer_interrupt_ho
 	/* Clear the interrupt */
 	co_cpu_write(CO_CPU_STAT,co_cpu_read(CO_CPU_STAT) & ~CO_STAT_TIMEINTR);
 
+	write_seqlock(&xtime_lock);
 	do_timer(1);
+	write_sequnlock(&xtime_lock);
 #ifndef CONFIG_SMP
 	update_process_times(user_mode_vm(regs));
 #endif
Index: linux-2.6.18-mm2/include/asm-i386/mach-voyager/do_timer.h
===================================================================
--- linux-2.6.18-mm2.orig/include/asm-i386/mach-voyager/do_timer.h	2006-10-02 00:55:46.000000000 +0200
+++ linux-2.6.18-mm2/include/asm-i386/mach-voyager/do_timer.h	2006-10-02 00:55:53.000000000 +0200
@@ -1,13 +1,18 @@
 /* defines for inline arch setup functions */
+#include <linux/clockchips.h>
+
 #include <asm/voyager.h>
+#include <asm/i8253.h>
 
+/**
+ * do_timer_interrupt_hook - hook into timer tick
+ * @regs:	standard registers from interrupt
+ *
+ * Call the pit clock event handler. see asm/i8253.h
+ **/
 static inline void do_timer_interrupt_hook(struct pt_regs *regs)
 {
-	do_timer(1);
-#ifndef CONFIG_SMP
-	update_process_times(user_mode_vm(regs));
-#endif
-
+	pit_interrupt_hook(regs);
 	voyager_timer_interrupt(regs);
 }
 

--


^ permalink raw reply	[flat|nested] 58+ messages in thread

* [patch 16/21] high-res timers: core
  2006-10-01 22:59 [patch 00/21] high resolution timers / dynamic ticks - V2 Thomas Gleixner
                   ` (14 preceding siblings ...)
  2006-10-01 23:01 ` [patch 15/21] clockevents: drivers for i386 Thomas Gleixner
@ 2006-10-01 23:01 ` Thomas Gleixner
  2006-10-02 11:50   ` Paulo Marques
  2006-10-01 23:01 ` [patch 17/21] dynticks: core Thomas Gleixner
                   ` (8 subsequent siblings)
  24 siblings, 1 reply; 58+ messages in thread
From: Thomas Gleixner @ 2006-10-01 23:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Ingo Molnar, Jim Gettys, John Stultz, David Woodhouse,
	Arjan van de Ven, Dave Jones

[-- Attachment #1: hrtimer-highres.patch --]
[-- Type: text/plain, Size: 35123 bytes --]

From: Thomas Gleixner <tglx@linutronix.de>

Add the core bits of high-res timers support.

The design makes use of the existing hrtimers subsystem which manages a
per-CPU and per-clock tree of timers, and the clockevents framework, which
provides a standard API to request programmable clock events from. The
core code does not have to know about the clock details - it makes use
of clockevents_set_next_event().

Once the preliminaries for high resolution mode (a continous time source for 
time keeping and a reprogrammable clock event device) are available, the
hrtimer code is switched to high resolution mode. The per-cpu clock event
devices are switched into one shot mode and on SMP systems an eventually
available global clock event device (e.g. PIT on i386) is switched off.
The periodic tick, which updates jiffies and calls update_process_times
and profiling, is provided by a per-cpu hrtimer. The callback function is
executed in the timer interrupt context. The hrtimer based implementation
of the periodic tick is designed to be extended with dynamic tick
functionality.

The impact to non-high-res architectures is intended to be minimal.

More detailed information is available in Documentation/hrtimer/highres.txt

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
--
 include/linux/hrtimer.h   |  100 ++++++-
 include/linux/interrupt.h |    5 
 include/linux/ktime.h     |    3 
 kernel/hrtimer.c          |  652 ++++++++++++++++++++++++++++++++++++++++++++--
 kernel/itimer.c           |    2 
 kernel/posix-timers.c     |    2 
 kernel/time/Kconfig       |   22 +
 kernel/timer.c            |    1 
 8 files changed, 753 insertions(+), 34 deletions(-)

Index: linux-2.6.18-mm2/include/linux/hrtimer.h
===================================================================
--- linux-2.6.18-mm2.orig/include/linux/hrtimer.h	2006-10-02 00:55:53.000000000 +0200
+++ linux-2.6.18-mm2/include/linux/hrtimer.h	2006-10-02 00:55:54.000000000 +0200
@@ -17,6 +17,7 @@
 
 #include <linux/rbtree.h>
 #include <linux/ktime.h>
+#include <linux/timer.h>
 #include <linux/init.h>
 #include <linux/list.h>
 #include <linux/wait.h>
@@ -41,6 +42,23 @@ enum hrtimer_restart {
 };
 
 /*
+ * hrtimer callback modes:
+ *
+ *	HRTIMER_CB_SOFTIRQ:		Callback must run in softirq context
+ *	HRTIMER_CB_IRQSAFE:		Callback may run in hardirq context
+ *	HRTIMER_CB_IRQSAFE_NO_RESTART:	Callback may run in hardirq context and
+ *					does not restart the timer
+ *	HRTIMER_CB_IRQSAFE_NO_SOFTIRQ:	Callback must run in softirq context
+ *					Special mode for tick emultation
+ */
+enum hrtimer_cb_mode {
+	HRTIMER_CB_SOFTIRQ,
+	HRTIMER_CB_IRQSAFE,
+	HRTIMER_CB_IRQSAFE_NO_RESTART,
+	HRTIMER_CB_IRQSAFE_NO_SOFTIRQ,
+};
+
+/*
  * Bit values to track state of the timer
  *
  * Possible states:
@@ -50,6 +68,7 @@ enum hrtimer_restart {
  * 0x02		callback function running
  * 0x03		callback function running and enqueued
  *		(was requeued on another CPU)
+ * 0x04		callback pending (high resolution mode)
  *
  * The "callback function running and enqueued" status is only possible on
  * SMP. It happens for example when a posix timer expired and the callback
@@ -67,6 +86,7 @@ enum hrtimer_restart {
 #define HRTIMER_STATE_INACTIVE	0x00
 #define HRTIMER_STATE_ENQUEUED	0x01
 #define HRTIMER_STATE_CALLBACK	0x02
+#define HRTIMER_STATE_PENDING	0x04
 
 /**
  * struct hrtimer - the basic hrtimer structure
@@ -77,6 +97,9 @@ enum hrtimer_restart {
  * @function:	timer expiry callback function
  * @base:	pointer to the timer base (per cpu and per clock)
  * @state:	state information (See bit values above)
+ * @cb_mode:	high resolution timer feature to select the callback execution
+ *		 mode
+ * @cb_entry:	list head to enqueue an expired timer into the callback list
  *
  * The hrtimer structure must be initialized by init_hrtimer_#CLOCKTYPE()
  */
@@ -86,6 +109,10 @@ struct hrtimer {
 	enum hrtimer_restart		(*function)(struct hrtimer *);
 	struct hrtimer_clock_base	*base;
 	unsigned long			state;
+#ifdef CONFIG_HIGH_RES_TIMERS
+	enum hrtimer_cb_mode		cb_mode;
+	struct list_head		cb_entry;
+#endif
 };
 
 /**
@@ -110,6 +137,9 @@ struct hrtimer_sleeper {
  * @get_time:		function to retrieve the current time of the clock
  * @get_softirq_time:	function to retrieve the current time from the softirq
  * @softirq_time:	the time when running the hrtimer queue in the softirq
+ * @cb_pending:		list of timers where the callback is pending
+ * @offset:		offset of this clock to the monotonic base
+ * @reprogram:		function to reprogram the timer event
  */
 struct hrtimer_clock_base {
 	struct hrtimer_cpu_base	*cpu_base;
@@ -120,6 +150,12 @@ struct hrtimer_clock_base {
 	ktime_t			(*get_time)(void);
 	ktime_t			(*get_softirq_time)(void);
 	ktime_t			softirq_time;
+#ifdef CONFIG_HIGH_RES_TIMERS
+	ktime_t			offset;
+	int			(*reprogram)(struct hrtimer *t,
+					     struct hrtimer_clock_base *b,
+					     ktime_t n);
+#endif
 };
 
 #define HRTIMER_MAX_CLOCK_BASES 2
@@ -131,20 +167,80 @@ struct hrtimer_clock_base {
  * @lock_key:		the lock_class_key for use with lockdep
  * @clock_base:		array of clock bases for this cpu
  * @curr_timer:		the timer which is executing a callback right now
+ * @expires_next:	absolute time of the next event which was scheduled
+ *			via clock_set_next_event()
+ * @hres_active:	State of high resolution mode
+ * @check_clocks:	Indictator, when set evaluate time source and clock
+ *			event devices whether high resolution mode can be
+ *			activated.
+ * @cb_pending:		Expired timers are moved from the rbtree to this
+ *			list in the timer interrupt. The list is processed
+ *			in the softirq.
+ * @sched_timer:	hrtimer to schedule the periodic tick in high
+ *			resolution mode
+ * @sched_regs:		Temporary storage for pt_regs for the sched_timer
+ *			callback
  */
 struct hrtimer_cpu_base {
 	spinlock_t			lock;
 	struct lock_class_key		lock_key;
 	struct hrtimer_clock_base	clock_base[HRTIMER_MAX_CLOCK_BASES];
+#ifdef CONFIG_HIGH_RES_TIMERS
+	ktime_t				expires_next;
+	int				hres_active;
+	unsigned long			check_clocks;
+	struct list_head		cb_pending;
+	struct hrtimer			sched_timer;
+	struct pt_regs			*sched_regs;
+#endif
 };
 
+#ifdef CONFIG_HIGH_RES_TIMERS
+
+extern void hrtimer_clock_notify(void);
+extern void clock_was_set(void);
+extern void hrtimer_interrupt(struct pt_regs *regs);
+
+/*
+ * In high resolution mode the time reference must be read accurate
+ */
+static inline ktime_t hrtimer_cb_get_time(struct hrtimer *timer)
+{
+	return timer->base->get_time();
+}
+
+/*
+ * The resolution of the clocks. The resolution value is returned in
+ * the clock_getres() system call to give application programmers an
+ * idea of the (in)accuracy of timers. Timer values are rounded up to
+ * this resolution values.
+ */
+# define KTIME_HIGH_RES		(ktime_t) { .tv64 = CONFIG_HIGH_RES_RESOLUTION }
+# define KTIME_MONOTONIC_RES	KTIME_HIGH_RES
+
+#else
+
+# define KTIME_MONOTONIC_RES	KTIME_LOW_RES
+
 /*
  * clock_was_set() is a NOP for non- high-resolution systems. The
  * time-sorted order guarantees that a timer does not expire early and
  * is expired in the next softirq when the clock was advanced.
  */
-#define clock_was_set()		do { } while (0)
-#define hrtimer_clock_notify()	do { } while (0)
+static inline void clock_was_set(void) { }
+static inline void hrtimer_clock_notify(void) { }
+
+/*
+ * In non high resolution mode the time reference is taken from
+ * the base softirq time variable.
+ */
+static inline ktime_t hrtimer_cb_get_time(struct hrtimer *timer)
+{
+	return timer->base->softirq_time;
+}
+
+#endif
+
 extern ktime_t ktime_get(void);
 extern ktime_t ktime_get_real(void);
 
Index: linux-2.6.18-mm2/include/linux/interrupt.h
===================================================================
--- linux-2.6.18-mm2.orig/include/linux/interrupt.h	2006-10-02 00:55:46.000000000 +0200
+++ linux-2.6.18-mm2/include/linux/interrupt.h	2006-10-02 00:55:54.000000000 +0200
@@ -235,7 +235,10 @@ enum
 	NET_TX_SOFTIRQ,
 	NET_RX_SOFTIRQ,
 	BLOCK_SOFTIRQ,
-	TASKLET_SOFTIRQ
+	TASKLET_SOFTIRQ,
+#ifdef CONFIG_HIGH_RES_TIMERS
+	HRTIMER_SOFTIRQ,
+#endif
 };
 
 /* softirq mask and active fields moved to irq_cpustat_t in
Index: linux-2.6.18-mm2/include/linux/ktime.h
===================================================================
--- linux-2.6.18-mm2.orig/include/linux/ktime.h	2006-10-02 00:55:46.000000000 +0200
+++ linux-2.6.18-mm2/include/linux/ktime.h	2006-10-02 00:55:54.000000000 +0200
@@ -261,8 +261,7 @@ static inline u64 ktime_to_ns(const ktim
  * idea of the (in)accuracy of timers. Timer values are rounded up to
  * this resolution values.
  */
-#define KTIME_REALTIME_RES	(ktime_t){ .tv64 = TICK_NSEC }
-#define KTIME_MONOTONIC_RES	(ktime_t){ .tv64 = TICK_NSEC }
+#define KTIME_LOW_RES		(ktime_t){ .tv64 = TICK_NSEC }
 
 /* Get the monotonic time in timespec format: */
 extern void ktime_get_ts(struct timespec *ts);
Index: linux-2.6.18-mm2/kernel/hrtimer.c
===================================================================
--- linux-2.6.18-mm2.orig/kernel/hrtimer.c	2006-10-02 00:55:53.000000000 +0200
+++ linux-2.6.18-mm2/kernel/hrtimer.c	2006-10-02 00:55:54.000000000 +0200
@@ -38,7 +38,11 @@
 #include <linux/hrtimer.h>
 #include <linux/notifier.h>
 #include <linux/syscalls.h>
+#include <linux/kallsyms.h>
 #include <linux/interrupt.h>
+#include <linux/clockchips.h>
+#include <linux/profile.h>
+#include <linux/seq_file.h>
 
 #include <asm/uaccess.h>
 
@@ -81,7 +85,7 @@ EXPORT_SYMBOL_GPL(ktime_get_real);
  * This ensures that we capture erroneous accesses to these clock ids
  * rather than moving them into the range of valid clock id's.
  */
-static DEFINE_PER_CPU(struct hrtimer_cpu_base, hrtimer_bases) =
+DEFINE_PER_CPU(struct hrtimer_cpu_base, hrtimer_bases) =
 {
 
 	.clock_base =
@@ -89,12 +93,12 @@ static DEFINE_PER_CPU(struct hrtimer_cpu
 		{
 			.index = CLOCK_REALTIME,
 			.get_time = &ktime_get_real,
-			.resolution = KTIME_REALTIME_RES,
+			.resolution = KTIME_LOW_RES,
 		},
 		{
 			.index = CLOCK_MONOTONIC,
 			.get_time = &ktime_get,
-			.resolution = KTIME_MONOTONIC_RES,
+			.resolution = KTIME_LOW_RES,
 		},
 	}
 };
@@ -228,7 +232,7 @@ lock_hrtimer_base(const struct hrtimer *
 	return base;
 }
 
-#define switch_hrtimer_base(t, b)	(b)
+# define switch_hrtimer_base(t, b)	(b)
 
 #endif	/* !CONFIG_SMP */
 
@@ -259,9 +263,6 @@ ktime_t ktime_add_ns(const ktime_t kt, u
 
 	return ktime_add(kt, tmp);
 }
-
-#else /* CONFIG_KTIME_SCALAR */
-
 # endif /* !CONFIG_KTIME_SCALAR */
 
 /*
@@ -289,11 +290,411 @@ static unsigned long ktime_divns(const k
 # define ktime_divns(kt, div)		(unsigned long)((kt).tv64 / (div))
 #endif /* BITS_PER_LONG >= 64 */
 
+/* High resolution timer related functions */
+#ifdef CONFIG_HIGH_RES_TIMERS
+
+/*
+ * Is the high resolution mode active ?
+ */
+static inline int hrtimer_hres_active(void)
+{
+	return __get_cpu_var(hrtimer_bases).hres_active;
+}
+
+/*
+ * The time, when the last jiffy update happened. Protected by xtime_lock.
+ */
+static ktime_t last_jiffies_update;
+
+/*
+ * Reprogram the event source with checking both queues for the
+ * next event
+ * Called with interrupts disabled and base->lock held
+ */
+static void hrtimer_force_reprogram(struct hrtimer_cpu_base *cpu_base)
+{
+	int i;
+	struct hrtimer_clock_base *base = cpu_base->clock_base;
+	ktime_t expires;
+
+	cpu_base->expires_next.tv64 = KTIME_MAX;
+
+	for (i = 0; i < HRTIMER_MAX_CLOCK_BASES; i++, base++) {
+		struct hrtimer *timer;
+
+		if (!base->first)
+			continue;
+		timer = rb_entry(base->first, struct hrtimer, node);
+		expires = ktime_sub(timer->expires, base->offset);
+		if (expires.tv64 < cpu_base->expires_next.tv64)
+			cpu_base->expires_next = expires;
+	}
+
+	if (cpu_base->expires_next.tv64 != KTIME_MAX)
+		clockevents_set_next_event(cpu_base->expires_next, 1);
+}
+
+/*
+ * Shared reprogramming for clock_realtime and clock_monotonic
+ *
+ * When a timer is enqueued and expires earlier than the already enqueued
+ * timers, we have to check, whether it expires earlier than the timer for
+ * which the clock event device was armed.
+ *
+ * Called with interrupts disabled and base->cpu_base.lock held
+ */
+static int hrtimer_reprogram(struct hrtimer *timer,
+			     struct hrtimer_clock_base *base)
+{
+	ktime_t *expires_next = &__get_cpu_var(hrtimer_bases).expires_next;
+	ktime_t expires = ktime_sub(timer->expires, base->offset);
+	int res;
+
+	/*
+	 * When the callback is running, we do not reprogram the clock event
+	 * device. The timer callback is either running on a different CPU or
+	 * the callback is executed in the hrtimer_interupt context. The
+	 * reprogramming is handled either by the softirq, which called the
+	 * callback or at the end of the hrtimer_interrupt.
+	 */
+	if (timer->state & HRTIMER_STATE_CALLBACK)
+		return 0;
+
+	if (expires.tv64 >= expires_next->tv64)
+		return 0;
+
+	/*
+	 * Clockevents returns -ETIME, when the event was in the past.
+	 */
+	res = clockevents_set_next_event(expires, 0);
+	if (!IS_ERR_VALUE(res))
+		*expires_next = expires;
+	return res;
+}
+
+
+/*
+ * Retrigger next event is called after clock was set
+ *
+ * Called with interrupts disabled via on_each_cpu()
+ */
+static void retrigger_next_event(void *arg)
+{
+	struct hrtimer_cpu_base *base;
+	struct timespec realtime_offset;
+	unsigned long seq;
+
+	do {
+		seq = read_seqbegin(&xtime_lock);
+		set_normalized_timespec(&realtime_offset,
+					-wall_to_monotonic.tv_sec,
+					-wall_to_monotonic.tv_nsec);
+	} while (read_seqretry(&xtime_lock, seq));
+
+	base = &__get_cpu_var(hrtimer_bases);
+
+	/* Adjust CLOCK_REALTIME offset */
+	spin_lock(&base->lock);
+	base->clock_base[CLOCK_REALTIME].offset =
+		timespec_to_ktime(realtime_offset);
+
+	hrtimer_force_reprogram(base);
+	spin_unlock(&base->lock);
+}
+
+/*
+ * Clock realtime was set
+ *
+ * Change the offset of the realtime clock vs. the monotonic
+ * clock.
+ *
+ * We might have to reprogram the high resolution timer interrupt. On
+ * SMP we call the architecture specific code to retrigger _all_ high
+ * resolution timer interrupts. On UP we just disable interrupts and
+ * call the high resolution interrupt code.
+ */
+void clock_was_set(void)
+{
+	/* Retrigger the CPU local events everywhere */
+	if (hrtimer_hres_active())
+		on_each_cpu(retrigger_next_event, NULL, 0, 1);
+}
+
+/**
+ * hrtimer_clock_notify - A clock source or a clock event has been installed
+ *
+ * Notify the per cpu softirqs to recheck the clock sources and events
+ */
+void hrtimer_clock_notify(void)
+{
+	int i;
+
+	for_each_possible_cpu(i)
+		set_bit(0, &per_cpu(hrtimer_bases, i).check_clocks);
+}
+
+static const ktime_t nsec_per_hz = { .tv64 = NSEC_PER_SEC / HZ };
+
+/*
+ * We switched off the global tick source when switching to high resolution
+ * mode. Update jiffies64.
+ *
+ * Must be called with interrupts disabled !
+ *
+ * FIXME: We need a mechanism to assign the update to a CPU. In principle this
+ * is not hard, but when dynamic ticks come into play it starts to be. We don't
+ * want to wake up a complete idle cpu just to update jiffies, so we need
+ * something more intellegent than a mere "do this only on CPUx".
+ */
+static void update_jiffies64(ktime_t now)
+{
+	unsigned long seq;
+	ktime_t delta;
+
+	/* Preevaluate to avoid lock contention */
+	do {
+		seq = read_seqbegin(&xtime_lock);
+		delta = ktime_sub(now, last_jiffies_update);
+	} while (read_seqretry(&xtime_lock, seq));
+
+	if (delta.tv64 >= nsec_per_hz.tv64)
+		return;
+
+	/* Reevalute with xtime_lock held */
+	write_seqlock(&xtime_lock);
+
+	delta = ktime_sub(now, last_jiffies_update);
+	if (delta.tv64 >= nsec_per_hz.tv64) {
+		unsigned long ticks = 1;
+
+		delta = ktime_sub(delta, nsec_per_hz);
+		last_jiffies_update = ktime_add(last_jiffies_update,
+						nsec_per_hz);
+
+		/* Slow path for long timeouts */
+		if (unlikely(delta.tv64 >= nsec_per_hz.tv64)) {
+			s64 incr = ktime_to_ns(nsec_per_hz);
+
+			ticks = ktime_divns(delta, incr);
+
+			last_jiffies_update = ktime_add_ns(last_jiffies_update,
+							   incr * ticks);
+			ticks++;
+		}
+		do_timer(ticks);
+	}
+	write_sequnlock(&xtime_lock);
+}
+
+/*
+ * We rearm the timer until we get disabled by the idle code
+ * Called with interrupts disabled.
+ */
+static enum hrtimer_restart hrtimer_sched_tick(struct hrtimer *timer)
+{
+	struct hrtimer_cpu_base *cpu_base =
+		container_of(timer, struct hrtimer_cpu_base, sched_timer);
+
+	/*
+	 * Do not call, when we are not in irq context and have
+	 * no valid regs pointer
+	 */
+	if (cpu_base->sched_regs) {
+		/*
+		 * update_process_times() might take tasklist_lock, hence
+		 * drop the base lock. sched-tick hrtimers are per-CPU and
+		 * never accessible by userspace APIs, so this is safe to do.
+		 */
+		spin_unlock(&cpu_base->lock);
+		update_process_times(user_mode(cpu_base->sched_regs));
+		profile_tick(CPU_PROFILING, cpu_base->sched_regs);
+		spin_lock(&cpu_base->lock);
+	}
+
+	hrtimer_forward(timer, hrtimer_cb_get_time(timer), nsec_per_hz);
+
+	return HRTIMER_RESTART;
+}
+
+/*
+ * A change in the clock source or clock events was detected.
+ * Check the clock source and the events, whether we can switch to
+ * high resolution mode or not.
+ *
+ * TODO: Handle the removal of clock sources / events
+ */
+static void hrtimer_check_clocks(void)
+{
+	struct hrtimer_cpu_base *base = &__get_cpu_var(hrtimer_bases);
+	unsigned long flags;
+	ktime_t now;
+
+	if (!test_and_clear_bit(0, &base->check_clocks))
+		return;
+
+	if (!timekeeping_is_continuous())
+		return;
+
+	if (!clockevents_next_event_available())
+		return;
+
+	local_irq_save(flags);
+
+	if (base->hres_active) {
+		local_irq_restore(flags);
+		return;
+	}
+
+	now = ktime_get();
+	if (clockevents_init_next_event()) {
+		local_irq_restore(flags);
+		return;
+	}
+	base->hres_active = 1;
+	base->clock_base[CLOCK_REALTIME].resolution = KTIME_HIGH_RES;
+	base->clock_base[CLOCK_MONOTONIC].resolution = KTIME_HIGH_RES;
+
+	/* Did we start the jiffies update yet ? */
+	if (last_jiffies_update.tv64 == 0) {
+		write_seqlock(&xtime_lock);
+		last_jiffies_update = now;
+		write_sequnlock(&xtime_lock);
+	}
+
+	/*
+	 * Emulate tick processing via per-CPU hrtimers:
+	 */
+	hrtimer_init(&base->sched_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	base->sched_timer.function = hrtimer_sched_tick;
+	base->sched_timer.cb_mode = HRTIMER_CB_IRQSAFE_NO_SOFTIRQ;
+	hrtimer_start(&base->sched_timer, nsec_per_hz, HRTIMER_MODE_REL);
+
+	/* "Retrigger" the interrupt to get things going */
+	retrigger_next_event(NULL);
+	local_irq_restore(flags);
+	printk(KERN_INFO "Switched to high resolution mode on CPU %d\n",
+	       smp_processor_id());
+}
+
+/*
+ * Check, whether the timer is on the callback pending list
+ */
+static inline int hrtimer_cb_pending(const struct hrtimer *timer)
+{
+	return timer->state == HRTIMER_STATE_PENDING;
+}
+
+/*
+ * Remove a timer from the callback pending list
+ */
+static inline void hrtimer_remove_cb_pending(struct hrtimer *timer)
+{
+	list_del_init(&timer->cb_entry);
+}
+
+/*
+ * Initialize the high resolution related parts of cpu_base
+ */
+static inline void hrtimer_init_hres(struct hrtimer_cpu_base *base)
+{
+	base->expires_next.tv64 = KTIME_MAX;
+	set_bit(0, &base->check_clocks);
+	base->hres_active = 0;
+	INIT_LIST_HEAD(&base->cb_pending);
+}
+
+/*
+ * Initialize the high resolution related parts of a hrtimer
+ */
+static inline void hrtimer_init_timer_hres(struct hrtimer *timer)
+{
+	INIT_LIST_HEAD(&timer->cb_entry);
+}
+
+/*
+ * When High resolution timers are active, try to reprogram. Note, that in case
+ * the state has HRTIMER_STATE_CALLBACK set, no reprogramming and no expiry
+ * check happens. The timer gets enqueued into the rbtree. The reprogramming
+ * and expiry check is done in the hrtimer_interrupt or in the softirq.
+ */
+static inline int hrtimer_enqueue_reprogram(struct hrtimer *timer,
+					    struct hrtimer_clock_base *base)
+{
+	if (base->cpu_base->hres_active && hrtimer_reprogram(timer, base)) {
+
+		/* Timer is expired, act upon the callback mode */
+		switch(timer->cb_mode) {
+		case HRTIMER_CB_IRQSAFE_NO_RESTART:
+			/*
+			 * We can call the callback from here. No restart
+			 * happens, so no danger of recursion
+			 */
+			BUG_ON(timer->function(timer) != HRTIMER_NORESTART);
+			return 1;
+		case HRTIMER_CB_IRQSAFE_NO_SOFTIRQ:
+			/*
+			 * This is solely for the sched tick emulation with
+			 * dynamic tick support to ensure that we do not
+			 * restart the tick right on the edge and end up with
+			 * the tick timer in the softirq ! The calling site
+			 * takes care of this.
+			 */
+			return 1;
+		case HRTIMER_CB_IRQSAFE:
+		case HRTIMER_CB_SOFTIRQ:
+			/*
+			 * Move everything else into the softirq pending list !
+			 */
+			list_add_tail(&timer->cb_entry,
+				      &base->cpu_base->cb_pending);
+			timer->state = HRTIMER_STATE_PENDING;
+			raise_softirq(HRTIMER_SOFTIRQ);
+			return 1;
+		default:
+			BUG();
+		}
+	}
+	return 0;
+}
+
+/*
+ * Called after timekeeping resumed and updated jiffies64. Set the jiffies
+ * update time to now.
+ */
+static inline void hrtimer_resume_jiffy_update(void)
+{
+	unsigned long flags;
+	ktime_t now = ktime_get();
+
+	write_seqlock_irqsave(&xtime_lock, flags);
+	last_jiffies_update = now;
+	write_sequnlock_irqrestore(&xtime_lock, flags);
+}
+
+#else
+
+static inline int hrtimer_hres_active(void) { return 0; }
+static inline void hrtimer_check_clocks(void) { }
+static inline void hrtimer_force_reprogram(struct hrtimer_cpu_base *base) { }
+static inline int hrtimer_enqueue_reprogram(struct hrtimer *timer,
+					    struct hrtimer_clock_base *base)
+{
+	return 0;
+}
+static inline int hrtimer_cb_pending(struct hrtimer *timer) { return 0; }
+static inline void hrtimer_remove_cb_pending(struct hrtimer *timer) { }
+static inline void hrtimer_init_hres(struct hrtimer_cpu_base *base) { }
+static inline void hrtimer_init_timer_hres(struct hrtimer *timer) { }
+static inline void hrtimer_resume_jiffy_update(void) { }
+
+#endif /* CONFIG_HIGH_RES_TIMERS */
+
 /*
  * Timekeeping resumed notification
  */
 void hrtimer_notify_resume(void)
 {
+	hrtimer_resume_jiffy_update();
 	clockevents_resume_events();
 	clock_was_set();
 }
@@ -355,7 +756,7 @@ hrtimer_forward(struct hrtimer *timer, k
  * red black tree is O(log(n)). Must hold the base lock.
  */
 static void enqueue_hrtimer(struct hrtimer *timer,
-			    struct hrtimer_clock_base *base)
+			    struct hrtimer_clock_base *base, int reprogram)
 {
 	struct rb_node **link = &base->active.rb_node;
 	struct rb_node *parent = NULL;
@@ -381,6 +782,22 @@ static void enqueue_hrtimer(struct hrtim
 	 * Insert the timer to the rbtree and check whether it
 	 * replaces the first pending timer
 	 */
+	if (!base->first || timer->expires.tv64 <
+	    rb_entry(base->first, struct hrtimer, node)->expires.tv64) {
+		/*
+		 * Reprogram the clock event device. When the timer is already
+		 * expired hrtimer_enqueue_reprogram has either called the
+		 * callback or added it to the pending list and raised the
+		 * softirq.
+		 *
+		 * This is a NOP for !HIGHRES
+		 */
+		if (reprogram && hrtimer_enqueue_reprogram(timer, base))
+			return;
+
+		base->first = &timer->node;
+	}
+
 	rb_link_node(&timer->node, parent, link);
 	rb_insert_color(&timer->node, &base->active);
 	/*
@@ -388,28 +805,38 @@ static void enqueue_hrtimer(struct hrtim
 	 * state of a possibly running callback.
 	 */
 	timer->state |= HRTIMER_STATE_ENQUEUED;
-
-	if (!base->first || timer->expires.tv64 <
-	    rb_entry(base->first, struct hrtimer, node)->expires.tv64)
-		base->first = &timer->node;
 }
 
 /*
  * __remove_hrtimer - internal function to remove a timer
  *
  * Caller must hold the base lock.
+ *
+ * High resolution timer mode reprograms the clock event device when the
+ * timer is the one which expires next. The caller can disable this by setting
+ * reprogram to zero. This is useful, when the context does a reprogramming
+ * anyway (e.g. timer interrupt)
  */
 static void __remove_hrtimer(struct hrtimer *timer,
 			     struct hrtimer_clock_base *base,
-			     unsigned long newstate)
+			     unsigned long newstate, int reprogram)
 {
-	/*
-	 * Remove the timer from the rbtree and replace the
-	 * first entry pointer if necessary.
-	 */
-	if (base->first == &timer->node)
-		base->first = rb_next(&timer->node);
-	rb_erase(&timer->node, &base->active);
+	/* High res. callback list. NOP for !HIGHRES */
+	if (hrtimer_cb_pending(timer))
+		hrtimer_remove_cb_pending(timer);
+	else {
+		/*
+		 * Remove the timer from the rbtree and replace the
+		 * first entry pointer if necessary.
+		 */
+		if (base->first == &timer->node) {
+			base->first = rb_next(&timer->node);
+			/* Reprogram the clock event device. if enabled */
+			if (reprogram && hrtimer_hres_active())
+				hrtimer_force_reprogram(base->cpu_base);
+		}
+		rb_erase(&timer->node, &base->active);
+	}
 	timer->state = newstate;
 }
 
@@ -420,7 +847,19 @@ static inline int
 remove_hrtimer(struct hrtimer *timer, struct hrtimer_clock_base *base)
 {
 	if (hrtimer_active(timer)) {
-		__remove_hrtimer(timer, base, HRTIMER_STATE_INACTIVE);
+		int reprogram;
+
+		/*
+		 * Remove the timer and force reprogramming when high
+		 * resolution mode is active and the timer is on the current
+		 * CPU. If we remove a timer on another CPU, reprogramming is
+		 * skipped. The interrupt event on this CPU is fired and
+		 * reprogramming happens in the interrupt handler. This is a
+		 * rare case and less expensive than a smp call.
+		 */
+		reprogram = base->cpu_base == &__get_cpu_var(hrtimer_bases);
+		__remove_hrtimer(timer, base, HRTIMER_STATE_INACTIVE,
+				 reprogram);
 		return 1;
 	}
 	return 0;
@@ -466,7 +905,7 @@ hrtimer_start(struct hrtimer *timer, kti
 	}
 	timer->expires = tim;
 
-	enqueue_hrtimer(timer, new_base);
+	enqueue_hrtimer(timer, new_base, base == new_base);
 
 	unlock_hrtimer_base(timer, &flags);
 
@@ -597,6 +1036,7 @@ void hrtimer_init(struct hrtimer *timer,
 		clock_id = CLOCK_MONOTONIC;
 
 	timer->base = &cpu_base->clock_base[clock_id];
+	hrtimer_init_timer_hres(timer);
 }
 EXPORT_SYMBOL_GPL(hrtimer_init);
 
@@ -619,6 +1059,144 @@ int hrtimer_get_res(const clockid_t whic
 }
 EXPORT_SYMBOL_GPL(hrtimer_get_res);
 
+#ifdef CONFIG_HIGH_RES_TIMERS
+
+/*
+ * High resolution timer interrupt
+ * Called with interrupts disabled
+ */
+void hrtimer_interrupt(struct pt_regs *regs)
+{
+	struct hrtimer_cpu_base *cpu_base = &__get_cpu_var(hrtimer_bases);
+	struct hrtimer_clock_base *base;
+	ktime_t expires_next, now;
+	int i, raise = 0;
+
+	BUG_ON(!cpu_base->hres_active);
+
+	/* Store the regs for an possible sched_timer callback */
+	cpu_base->sched_regs = regs;
+
+ retry:
+	now = ktime_get();
+
+	/* Check, if the jiffies need an update */
+	update_jiffies64(now);
+
+	expires_next.tv64 = KTIME_MAX;
+
+	base = cpu_base->clock_base;
+
+	for (i = 0; i < HRTIMER_MAX_CLOCK_BASES; i++) {
+		ktime_t basenow;
+		struct rb_node *node;
+
+		spin_lock(&cpu_base->lock);
+
+		basenow = ktime_add(now, base->offset);
+
+		while ((node = base->first)) {
+			struct hrtimer *timer;
+
+			timer = rb_entry(node, struct hrtimer, node);
+
+			if (basenow.tv64 < timer->expires.tv64) {
+				ktime_t expires;
+
+				expires = ktime_sub(timer->expires,
+						    base->offset);
+				if (expires.tv64 < expires_next.tv64)
+					expires_next = expires;
+				break;
+			}
+
+			/* Move softirq callbacks to the pending list */
+			if (timer->cb_mode == HRTIMER_CB_SOFTIRQ) {
+				__remove_hrtimer(timer, base,
+						 HRTIMER_STATE_PENDING, 0);
+				list_add_tail(&timer->cb_entry,
+					      &base->cpu_base->cb_pending);
+				raise = 1;
+				continue;
+			}
+
+			__remove_hrtimer(timer, base,
+					 HRTIMER_STATE_CALLBACK, 0);
+
+			if (timer->function(timer) != HRTIMER_NORESTART) {
+				BUG_ON(timer->state != HRTIMER_STATE_CALLBACK);
+				/*
+				 * Do not reprogram. We do this when we break
+				 * out of the loop !
+				 */
+				enqueue_hrtimer(timer, base, 0);
+			}
+			timer->state &= ~HRTIMER_STATE_CALLBACK;
+		}
+		spin_unlock(&cpu_base->lock);
+		base++;
+	}
+
+	cpu_base->expires_next = expires_next;
+
+	/* Reprogramming necessary ? */
+	if (expires_next.tv64 != KTIME_MAX) {
+		if (clockevents_set_next_event(expires_next, 0))
+			goto retry;
+	}
+
+	/* Invalidate regs */
+	cpu_base->sched_regs = NULL;
+
+	/* Raise softirq ? */
+	if (raise)
+		raise_softirq(HRTIMER_SOFTIRQ);
+}
+
+static void run_hrtimer_softirq(struct softirq_action *h)
+{
+	struct hrtimer_cpu_base *cpu_base = &__get_cpu_var(hrtimer_bases);
+
+	spin_lock_irq(&cpu_base->lock);
+
+	while (!list_empty(&cpu_base->cb_pending)) {
+		enum hrtimer_restart (*fn)(struct hrtimer *);
+		struct hrtimer *timer;
+		int restart;
+
+		timer = list_entry(cpu_base->cb_pending.next,
+				   struct hrtimer, cb_entry);
+
+		fn = timer->function;
+		__remove_hrtimer(timer, timer->base, HRTIMER_STATE_CALLBACK, 0);
+		spin_unlock_irq(&cpu_base->lock);
+
+		restart = fn(timer);
+
+		spin_lock_irq(&cpu_base->lock);
+
+		timer->state &= ~HRTIMER_STATE_CALLBACK;
+		if (restart == HRTIMER_RESTART) {
+			BUG_ON(hrtimer_active(timer));
+			/*
+			 * Enqueue the timer, allow reprogramming of the event
+			 * device
+			 */
+			enqueue_hrtimer(timer, timer->base, 1);
+		} else if (hrtimer_active(timer)) {
+			/*
+			 * If the timer was rearmed on another CPU, reprogram
+			 * the event device.
+			 */
+			if (timer->base->first == &timer->node)
+				hrtimer_reprogram(timer, timer->base);
+		}
+	}
+	spin_unlock_irq(&cpu_base->lock);
+}
+
+#endif	/* CONFIG_HIGH_RES_TIMERS */
+
 /*
  * Expire the per base hrtimer-queue:
  */
@@ -646,7 +1224,7 @@ static inline void run_hrtimer_queue(str
 			break;
 
 		fn = timer->function;
-		__remove_hrtimer(timer, base, HRTIMER_STATE_CALLBACK);
+		__remove_hrtimer(timer, base, HRTIMER_STATE_CALLBACK, 0);
 		spin_unlock_irq(&cpu_base->lock);
 
 		restart = fn(timer);
@@ -656,7 +1234,7 @@ static inline void run_hrtimer_queue(str
 		timer->state &= ~HRTIMER_STATE_CALLBACK;
 		if (restart != HRTIMER_NORESTART) {
 			BUG_ON(hrtimer_active(timer));
-			enqueue_hrtimer(timer, base);
+			enqueue_hrtimer(timer, base, 0);
 		}
 	}
 	spin_unlock_irq(&cpu_base->lock);
@@ -664,12 +1242,21 @@ static inline void run_hrtimer_queue(str
 
 /*
  * Called from timer softirq every jiffy, expire hrtimers:
+ *
+ * For HRT its the fall back code to run the softirq in the timer
+ * softirq context in case the hrtimer initialization failed or has
+ * not been done yet.
  */
 void hrtimer_run_queues(void)
 {
 	struct hrtimer_cpu_base *cpu_base = &__get_cpu_var(hrtimer_bases);
 	int i;
 
+	hrtimer_check_clocks();
+
+	if (hrtimer_hres_active())
+		return;
+
 	hrtimer_get_softirq_time(cpu_base);
 
 	for (i = 0; i < HRTIMER_MAX_CLOCK_BASES; i++)
@@ -696,6 +1283,9 @@ void hrtimer_init_sleeper(struct hrtimer
 {
 	sl->timer.function = hrtimer_wakeup;
 	sl->task = task;
+#ifdef CONFIG_HIGH_RES_TIMERS
+	sl->timer.cb_mode = HRTIMER_CB_IRQSAFE_NO_RESTART;
+#endif
 }
 
 static int __sched do_nanosleep(struct hrtimer_sleeper *t, enum hrtimer_mode mode)
@@ -706,7 +1296,8 @@ static int __sched do_nanosleep(struct h
 		set_current_state(TASK_INTERRUPTIBLE);
 		hrtimer_start(&t->timer, t->timer.expires, mode);
 
-		schedule();
+		if (likely(t->task))
+			schedule();
 
 		hrtimer_cancel(&t->timer);
 		mode = HRTIMER_MODE_ABS;
@@ -811,6 +1402,7 @@ static void __devinit init_hrtimers_cpu(
 	for (i = 0; i < HRTIMER_MAX_CLOCK_BASES; i++)
 		cpu_base->clock_base[i].cpu_base = cpu_base;
 
+	hrtimer_init_hres(cpu_base);
 }
 
 #ifdef CONFIG_HOTPLUG_CPU
@@ -824,9 +1416,12 @@ static void migrate_hrtimer_list(struct 
 	while ((node = rb_first(&old_base->active))) {
 		timer = rb_entry(node, struct hrtimer, node);
 		BUG_ON(timer->state & HRTIMER_CALLBACK);
-		__remove_hrtimer(timer, old_base, HRTIMER_INACTIVE);
+		__remove_hrtimer(timer, old_base, HRTIMER_INACTIVE, 0);
 		timer->base = new_base;
-		enqueue_hrtimer(timer, new_base);
+		/*
+		 * Enqueue the timer. Allow reprogramming of the event device
+		 */
+		enqueue_hrtimer(timer, new_base, 1);
 	}
 }
 
@@ -889,5 +1484,8 @@ void __init hrtimers_init(void)
 	hrtimer_cpu_notify(&hrtimers_nb, (unsigned long)CPU_UP_PREPARE,
 			  (void *)(long)smp_processor_id());
 	register_cpu_notifier(&hrtimers_nb);
+#ifdef CONFIG_HIGH_RES_TIMERS
+	open_softirq(HRTIMER_SOFTIRQ, run_hrtimer_softirq, NULL);
+#endif
 }
 
Index: linux-2.6.18-mm2/kernel/itimer.c
===================================================================
--- linux-2.6.18-mm2.orig/kernel/itimer.c	2006-10-02 00:55:52.000000000 +0200
+++ linux-2.6.18-mm2/kernel/itimer.c	2006-10-02 00:55:54.000000000 +0200
@@ -136,7 +136,7 @@ enum hrtimer_restart it_real_fn(struct h
 	send_group_sig_info(SIGALRM, SEND_SIG_PRIV, sig->tsk);
 
 	if (sig->it_real_incr.tv64 != 0) {
-		hrtimer_forward(timer, timer->base->softirq_time,
+		hrtimer_forward(timer, hrtimer_cb_get_time(timer),
 				sig->it_real_incr);
 		return HRTIMER_RESTART;
 	}
Index: linux-2.6.18-mm2/kernel/posix-timers.c
===================================================================
--- linux-2.6.18-mm2.orig/kernel/posix-timers.c	2006-10-02 00:55:52.000000000 +0200
+++ linux-2.6.18-mm2/kernel/posix-timers.c	2006-10-02 00:55:54.000000000 +0200
@@ -356,7 +356,7 @@ static enum hrtimer_restart posix_timer_
 		if (timr->it.real.interval.tv64 != 0) {
 			timr->it_overrun +=
 				hrtimer_forward(timer,
-						timer->base->softirq_time,
+						hrtimer_cb_get_time(timer),
 						timr->it.real.interval);
 			ret = HRTIMER_RESTART;
 			++timr->it_requeue_pending;
Index: linux-2.6.18-mm2/kernel/time/Kconfig
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.18-mm2/kernel/time/Kconfig	2006-10-02 00:55:54.000000000 +0200
@@ -0,0 +1,22 @@
+#
+# Timer subsystem related configuration options
+#
+config HIGH_RES_TIMERS
+	bool "High Resolution Timer Support"
+	depends on GENERIC_TIME
+	help
+	  This option enables high resolution timer support. If your
+	  hardware is not capable then this option only increases
+	  the size of the kernel image.
+
+config HIGH_RES_RESOLUTION
+	int "High Resolution Timer resolution (nanoseconds)"
+	depends on HIGH_RES_TIMERS
+	default 1000
+	help
+	  This sets the resolution in nanoseconds of the high resolution
+	  timers. Too fine a resolution (small a number) will usually
+	  not be observable due to normal system latencies.  For an
+          800 MHz processor about 10,000 (10 microseconds) is recommended as a
+	  finest resolution.  If you don't need that sort of resolution,
+	  larger values may generate less overhead.
Index: linux-2.6.18-mm2/kernel/timer.c
===================================================================
--- linux-2.6.18-mm2.orig/kernel/timer.c	2006-10-02 00:55:51.000000000 +0200
+++ linux-2.6.18-mm2/kernel/timer.c	2006-10-02 00:55:54.000000000 +0200
@@ -1039,6 +1039,7 @@ static void update_wall_time(void)
 	if (change_clocksource()) {
 		clock->error = 0;
 		clock->xtime_nsec = 0;
+		hrtimer_clock_notify();
 		clocksource_calculate_interval(clock, tick_nsec);
 	}
 }

--


^ permalink raw reply	[flat|nested] 58+ messages in thread

* [patch 17/21] dynticks: core
  2006-10-01 22:59 [patch 00/21] high resolution timers / dynamic ticks - V2 Thomas Gleixner
                   ` (15 preceding siblings ...)
  2006-10-01 23:01 ` [patch 16/21] high-res timers: core Thomas Gleixner
@ 2006-10-01 23:01 ` Thomas Gleixner
  2006-10-02  6:41   ` [patch] dynticks: core, NMI watchdog fix Ingo Molnar
  2006-10-01 23:01 ` [patch 18/21] dyntick: add nohz stats to /proc/stat Thomas Gleixner
                   ` (7 subsequent siblings)
  24 siblings, 1 reply; 58+ messages in thread
From: Thomas Gleixner @ 2006-10-01 23:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Ingo Molnar, Jim Gettys, John Stultz, David Woodhouse,
	Arjan van de Ven, Dave Jones

[-- Attachment #1: hrtimer-no-idle-hz.patch --]
[-- Type: text/plain, Size: 12882 bytes --]

From: Ingo Molnar <mingo@elte.hu>

dynamic ticks core code.

This is an extension to the per-cpu sched_tick timer of the high
resolution timer functionality. The sched_tick timer is reprogrammed 
to a longer timeout before going idle, when no timer events are due in
the next tick. The periodic tick is resumed when the CPU leaves the 
idle state. If a non-timer IRQ hits the idle task jiffies are updated
from irq_enter before calling the interrupt code, otherwise the interrupt
handler would eventually deal with a stale jiffy value.

The per-cpu idle statistics information can be used to optimize power
management decisions.

More detailed information is available in Documentation/hrtimer/highres.txt

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
--
 include/linux/hrtimer.h |   32 ++++++
 kernel/hrtimer.c        |  221 ++++++++++++++++++++++++++++++++++++++++++++++++
 kernel/softirq.c        |   11 ++
 kernel/time/Kconfig     |    8 +
 kernel/timer.c          |    2 
 5 files changed, 273 insertions(+), 1 deletion(-)

Index: linux-2.6.18-mm2/include/linux/hrtimer.h
===================================================================
--- linux-2.6.18-mm2.orig/include/linux/hrtimer.h	2006-10-02 00:55:54.000000000 +0200
+++ linux-2.6.18-mm2/include/linux/hrtimer.h	2006-10-02 00:55:54.000000000 +0200
@@ -22,6 +22,7 @@
 #include <linux/list.h>
 #include <linux/wait.h>
 
+struct seq_file;
 struct hrtimer_clock_base;
 struct hrtimer_cpu_base;
 
@@ -180,6 +181,16 @@ struct hrtimer_clock_base {
  *			resolution mode
  * @sched_regs:		Temporary storage for pt_regs for the sched_timer
  *			callback
+ * @nr_events:		Total number of timer interrupt events
+ * @idle_tick:		Store the last idle tick expiry time when the tick
+ *			timer is modified for idle sleeps. This is necessary
+ *			to resume the tick timer operation in the timeline
+ *			when the CPU returns from idle
+ * @tick_stopped:	Indicator that the idle tick has been stopped
+ * @idle_calls:		Total number of idle calls
+ * @idle_sleeps:	Number of idle calls, where the sched tick was stopped
+ * @idle_entrytime:	Time when the idle call was entered
+ * @idle_sleeptime:	Sum of the time slept in idle with sched tick stopped
  */
 struct hrtimer_cpu_base {
 	spinlock_t			lock;
@@ -192,6 +203,15 @@ struct hrtimer_cpu_base {
 	struct list_head		cb_pending;
 	struct hrtimer			sched_timer;
 	struct pt_regs			*sched_regs;
+	unsigned long			nr_events;
+#endif
+#ifdef CONFIG_NO_HZ
+	ktime_t				idle_tick;
+	int				tick_stopped;
+	unsigned long			idle_calls;
+	unsigned long			idle_sleeps;
+	ktime_t				idle_entrytime;
+	ktime_t				idle_sleeptime;
 #endif
 };
 
@@ -298,6 +318,18 @@ extern void hrtimer_run_queues(void);
 /* Resume notification */
 void hrtimer_notify_resume(void);
 
+#ifdef CONFIG_NO_HZ
+extern void hrtimer_stop_sched_tick(void);
+extern void hrtimer_restart_sched_tick(void);
+extern void hrtimer_update_jiffies(void);
+extern void show_no_hz_stats(struct seq_file *p);
+#else
+static inline void hrtimer_stop_sched_tick(void) { }
+static inline void hrtimer_restart_sched_tick(void) { }
+static inline void hrtimer_update_jiffies(void) { }
+static inline void show_no_hz_stats(struct seq_file *p) { }
+#endif
+
 /* Bootup initialization: */
 extern void __init hrtimers_init(void);
 
Index: linux-2.6.18-mm2/kernel/hrtimer.c
===================================================================
--- linux-2.6.18-mm2.orig/kernel/hrtimer.c	2006-10-02 00:55:54.000000000 +0200
+++ linux-2.6.18-mm2/kernel/hrtimer.c	2006-10-02 00:55:54.000000000 +0200
@@ -486,6 +486,221 @@ static void update_jiffies64(ktime_t now
 	write_sequnlock(&xtime_lock);
 }
 
+#ifdef CONFIG_NO_HZ
+/**
+ * hrtimer_update_jiffies - update jiffies when idle was interrupted
+ *
+ * Called from interrupt entry when the CPU was idle
+ *
+ * In case the sched_tick was stopped on this CPU, we have to check if jiffies
+ * must be updated. Otherwise an interrupt handler could use a stale jiffy
+ * value.
+ */
+void hrtimer_update_jiffies(void)
+{
+	struct hrtimer_cpu_base *cpu_base = &__get_cpu_var(hrtimer_bases);
+	unsigned long flags;
+	ktime_t now;
+
+	if (!cpu_base->tick_stopped || !cpu_base->hres_active)
+		return;
+
+	now = ktime_get();
+
+	local_irq_save(flags);
+	update_jiffies64(now);
+	local_irq_restore(flags);
+}
+
+/**
+ * hrtimer_stop_sched_tick - stop the idle tick from the idle task
+ *
+ * When the next event is more than a tick into the future, stop the idle tick
+ * Called either from the idle loop or from irq_exit() when a idle period was
+ * just interrupted by a interrupt which did not cause a reschedule.
+ */
+void hrtimer_stop_sched_tick(void)
+{
+	struct hrtimer_cpu_base *cpu_base = &__get_cpu_var(hrtimer_bases);
+	unsigned long seq, last_jiffies, next_jiffies;
+	ktime_t last_update, expires, now;
+	unsigned long delta_jiffies;
+	unsigned long flags;
+
+	if (unlikely(!cpu_base->hres_active))
+		return;
+
+	local_irq_save(flags);
+
+	now = ktime_get();
+	/*
+	 * When called from irq_exit we need to account the idle sleep time
+	 * correctly.
+	 */
+	if (cpu_base->tick_stopped) {
+		ktime_t delta = ktime_sub(now, cpu_base->idle_entrytime);
+
+		cpu_base->idle_sleeptime = ktime_add(cpu_base->idle_sleeptime,
+						     delta);
+	}
+	cpu_base->idle_entrytime = now;
+	cpu_base->idle_calls++;
+
+	/* Read jiffies and the time when jiffies were updated last */
+	do {
+		seq = read_seqbegin(&xtime_lock);
+		last_update = last_jiffies_update;
+		last_jiffies = jiffies;
+	} while (read_seqretry(&xtime_lock, seq));
+
+	/* Get the next timer wheel timer */
+	next_jiffies = get_next_timer_interrupt(last_jiffies);
+	delta_jiffies = next_jiffies - last_jiffies;
+
+	if ((long)delta_jiffies >= 1) {
+		/*
+		 * hrtimer_stop_sched_tick can be called several times before
+		 * the hrtimer_restart_sched_tick is called. This happens when
+		 * interrupts arrive which do not cause a reschedule. In the
+		 * first call we save the current tick time, so we can restart
+		 * the scheduler tick in hrtimer_restart_sched_tick.
+		 */
+		if (!cpu_base->tick_stopped) {
+			cpu_base->idle_tick = cpu_base->sched_timer.expires;
+			cpu_base->tick_stopped = 1;
+		}
+		/* calculate the expiry time for the next timer wheel timer */
+		expires = ktime_add_ns(last_update,
+				       nsec_per_hz.tv64 * delta_jiffies);
+		hrtimer_start(&cpu_base->sched_timer, expires,
+			      HRTIMER_MODE_ABS);
+		cpu_base->idle_sleeps++;
+	} else {
+		/* Raise the softirq if the timer wheel is behind jiffies */
+		if ((long) delta_jiffies < 0)
+			raise_softirq_irqoff(TIMER_SOFTIRQ);
+	}
+
+	local_irq_restore(flags);
+}
+
+/**
+ * hrtimer_restart_sched_tick - restart the idle tick from the idle task
+ *
+ * Restart the idle tick when the CPU is woken up from idle
+ */
+void hrtimer_restart_sched_tick(void)
+{
+	struct hrtimer_cpu_base *cpu_base = &__get_cpu_var(hrtimer_bases);
+	ktime_t now, delta;
+
+	if (!cpu_base->hres_active || !cpu_base->tick_stopped)
+		return;
+
+	/* Update jiffies first */
+	now = ktime_get();
+
+	local_irq_disable();
+	update_jiffies64(now);
+
+	/*
+	 * Update process times would randomly account the time we slept to
+	 * whatever the context of the next sched tick is.  Enforce that this
+	 * is accounted to idle !
+	 */
+	add_preempt_count(HARDIRQ_OFFSET);
+	update_process_times(0);
+	sub_preempt_count(HARDIRQ_OFFSET);
+
+	/* Account the idle time */
+	delta = ktime_sub(now, cpu_base->idle_entrytime);
+	cpu_base->idle_sleeptime = ktime_add(cpu_base->idle_sleeptime, delta);
+
+	/*
+	 * Cancel the scheduled timer and restore the tick
+	 */
+	cpu_base->tick_stopped  = 0;
+	hrtimer_cancel(&cpu_base->sched_timer);
+	cpu_base->sched_timer.expires = cpu_base->idle_tick;
+
+	while (1) {
+		/* Forward the time to expire in the future */
+		hrtimer_forward(&cpu_base->sched_timer, now, nsec_per_hz);
+		hrtimer_start(&cpu_base->sched_timer,
+			      cpu_base->sched_timer.expires, HRTIMER_MODE_ABS);
+
+		/* Check, if the timer was already in the past */
+		if (hrtimer_active(&cpu_base->sched_timer))
+			break;
+		/* Update jiffies and reread time */
+		update_jiffies64(now);
+		now = ktime_get();
+	}
+	local_irq_enable();
+}
+
+/**
+ * show_no_hz_stats - print out the no hz statistics
+ *
+ * The no_hz statistics are appended at the end of /proc/stats
+ *
+ * I: total number of idle calls
+ * S: number of idle calls which stopped the sched tick
+ * T: Summed up sleep time in idle with sched tick stopped (unit is seconds)
+ * A: Average sleep time: T/S (unit is seconds)
+ * E: Total number of timer interrupt events
+ */
+void show_no_hz_stats(struct seq_file *p)
+{
+	unsigned long calls = 0, sleeps = 0, events = 0;
+	struct timeval tsum, tavg;
+	ktime_t totaltime = { .tv64 = 0 };
+	int cpu;
+
+	for_each_online_cpu(cpu) {
+		struct hrtimer_cpu_base *base = &per_cpu(hrtimer_bases, cpu);
+
+		calls += base->idle_calls;
+		sleeps += base->idle_sleeps;
+		totaltime = ktime_add(totaltime, base->idle_sleeptime);
+		events += base->nr_events;
+
+#ifdef CONFIG_SMP
+		tsum = ktime_to_timeval(base->idle_sleeptime);
+		if (base->idle_sleeps) {
+			uint64_t nsec = ktime_to_ns(base->idle_sleeptime);
+
+			do_div(nsec, base->idle_sleeps);
+			tavg = ns_to_timeval(nsec);
+		} else
+			tavg.tv_sec = tavg.tv_usec = 0;
+
+		seq_printf(p, "nohz cpu%d I:%lu S:%lu T:%d.%06d A:%d.%06d E: %lu\n",
+			   cpu, base->idle_calls, base->idle_sleeps,
+			   (int) tsum.tv_sec, (int) tsum.tv_usec,
+			   (int) tavg.tv_sec, (int) tavg.tv_usec,
+			   base->nr_events);
+#endif
+	}
+
+	tsum = ktime_to_timeval(totaltime);
+	if (sleeps) {
+		uint64_t nsec = ktime_to_ns(totaltime);
+
+			do_div(nsec, sleeps);
+			tavg = ns_to_timeval(nsec);
+	} else
+		tavg.tv_sec = tavg.tv_usec = 0;
+
+	seq_printf(p, "nohz total I:%lu S:%lu T:%d.%06d A:%d.%06d E: %lu\n",
+		   calls, sleeps,
+		   (int) tsum.tv_sec, (int) tsum.tv_usec,
+		   (int) tavg.tv_sec, (int) tavg.tv_usec,
+		   events);
+}
+
+#endif
+
 /*
  * We rearm the timer until we get disabled by the idle code
  * Called with interrupts disabled.
@@ -513,6 +728,11 @@ static enum hrtimer_restart hrtimer_sche
 
 	hrtimer_forward(timer, hrtimer_cb_get_time(timer), nsec_per_hz);
 
+#ifdef CONFIG_NO_HZ
+	/* Do not restart, when we are in the idle loop */
+	if (cpu_base->tick_stopped)
+		return HRTIMER_NORESTART;
+#endif
 	return HRTIMER_RESTART;
 }
 
@@ -1076,6 +1296,7 @@ void hrtimer_interrupt(struct pt_regs *r
 
 	/* Store the regs for an possible sched_timer callback */
 	cpu_base->sched_regs = regs;
+	cpu_base->nr_events++;
 
  retry:
 	now = ktime_get();
Index: linux-2.6.18-mm2/kernel/softirq.c
===================================================================
--- linux-2.6.18-mm2.orig/kernel/softirq.c	2006-10-02 00:55:51.000000000 +0200
+++ linux-2.6.18-mm2/kernel/softirq.c	2006-10-02 00:55:54.000000000 +0200
@@ -281,6 +281,11 @@ void irq_enter(void)
 	account_system_vtime(current);
 	add_preempt_count(HARDIRQ_OFFSET);
 	trace_hardirq_enter();
+
+#ifdef CONFIG_NO_HZ
+	if (idle_cpu(smp_processor_id()) && !in_interrupt())
+		hrtimer_update_jiffies();
+#endif
 }
 
 #ifdef __ARCH_IRQ_EXIT_IRQS_DISABLED
@@ -299,6 +304,12 @@ void irq_exit(void)
 	sub_preempt_count(IRQ_EXIT_OFFSET);
 	if (!in_interrupt() && local_softirq_pending())
 		invoke_softirq();
+
+#ifdef CONFIG_NO_HZ
+	/* Make sure that timer wheel updates are propagated */
+	if (!in_interrupt() && idle_cpu(smp_processor_id()) && !need_resched())
+		hrtimer_stop_sched_tick();
+#endif
 	preempt_enable_no_resched();
 }
 
Index: linux-2.6.18-mm2/kernel/time/Kconfig
===================================================================
--- linux-2.6.18-mm2.orig/kernel/time/Kconfig	2006-10-02 00:55:54.000000000 +0200
+++ linux-2.6.18-mm2/kernel/time/Kconfig	2006-10-02 00:55:54.000000000 +0200
@@ -20,3 +20,11 @@ config HIGH_RES_RESOLUTION
           800 MHz processor about 10,000 (10 microseconds) is recommended as a
 	  finest resolution.  If you don't need that sort of resolution,
 	  larger values may generate less overhead.
+
+config NO_HZ
+	bool "Tickless System (Dynamic Ticks)"
+	depends on GENERIC_TIME && HIGH_RES_TIMERS
+	help
+	  This option enables a tickless system: timer interrupts will
+	  only trigger on an as-needed basis both when the system is
+	  busy and when the system is idle.
Index: linux-2.6.18-mm2/kernel/timer.c
===================================================================
--- linux-2.6.18-mm2.orig/kernel/timer.c	2006-10-02 00:55:54.000000000 +0200
+++ linux-2.6.18-mm2/kernel/timer.c	2006-10-02 00:55:54.000000000 +0200
@@ -462,7 +462,7 @@ static inline void __run_timers(tvec_bas
 	spin_unlock_irq(&base->lock);
 }
 
-#ifdef CONFIG_NO_IDLE_HZ
+#if defined(CONFIG_NO_IDLE_HZ) || defined(CONFIG_NO_HZ)
 /*
  * Find out when the next timer event is due to happen. This
  * is used on S/390 to stop all activity when a cpus is idle.

--


^ permalink raw reply	[flat|nested] 58+ messages in thread

* [patch 18/21] dyntick: add nohz stats to /proc/stat
  2006-10-01 22:59 [patch 00/21] high resolution timers / dynamic ticks - V2 Thomas Gleixner
                   ` (16 preceding siblings ...)
  2006-10-01 23:01 ` [patch 17/21] dynticks: core Thomas Gleixner
@ 2006-10-01 23:01 ` Thomas Gleixner
  2006-10-01 23:01 ` [patch 19/21] dynticks: i386 arch code Thomas Gleixner
                   ` (6 subsequent siblings)
  24 siblings, 0 replies; 58+ messages in thread
From: Thomas Gleixner @ 2006-10-01 23:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Ingo Molnar, Jim Gettys, John Stultz, David Woodhouse,
	Arjan van de Ven, Dave Jones

[-- Attachment #1: hrtimer-add-stats.patch --]
[-- Type: text/plain, Size: 653 bytes --]

From: Thomas Gleixner <tglx@linutronix.de>

add nohz stats to /proc/stat.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
--
 fs/proc/proc_misc.c |    2 ++
 1 file changed, 2 insertions(+)

Index: linux-2.6.18-mm2/fs/proc/proc_misc.c
===================================================================
--- linux-2.6.18-mm2.orig/fs/proc/proc_misc.c	2006-10-02 00:55:45.000000000 +0200
+++ linux-2.6.18-mm2/fs/proc/proc_misc.c	2006-10-02 00:55:54.000000000 +0200
@@ -527,6 +527,8 @@ static int show_stat(struct seq_file *p,
 		nr_running(),
 		nr_iowait());
 
+	show_no_hz_stats(p);
+
 	return 0;
 }
 

--


^ permalink raw reply	[flat|nested] 58+ messages in thread

* [patch 19/21] dynticks: i386 arch code
  2006-10-01 22:59 [patch 00/21] high resolution timers / dynamic ticks - V2 Thomas Gleixner
                   ` (17 preceding siblings ...)
  2006-10-01 23:01 ` [patch 18/21] dyntick: add nohz stats to /proc/stat Thomas Gleixner
@ 2006-10-01 23:01 ` Thomas Gleixner
  2006-10-01 23:01 ` [patch 20/21] high-res timers, dynticks: enable i386 support Thomas Gleixner
                   ` (5 subsequent siblings)
  24 siblings, 0 replies; 58+ messages in thread
From: Thomas Gleixner @ 2006-10-01 23:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Ingo Molnar, Jim Gettys, John Stultz, David Woodhouse,
	Arjan van de Ven, Dave Jones

[-- Attachment #1: i368-prepare-no-hz.patch --]
[-- Type: text/plain, Size: 959 bytes --]

From: Ingo Molnar <mingo@elte.hu>

Prepare i386 for dyntick: idle handler callbacks.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
----
 arch/i386/kernel/process.c |    2 ++
 1 file changed, 2 insertions(+)

Index: linux-2.6.18-mm2/arch/i386/kernel/process.c
===================================================================
--- linux-2.6.18-mm2.orig/arch/i386/kernel/process.c	2006-10-02 00:55:45.000000000 +0200
+++ linux-2.6.18-mm2/arch/i386/kernel/process.c	2006-10-02 00:55:55.000000000 +0200
@@ -178,6 +178,7 @@ void cpu_idle(void)
 
 	/* endless idle loop with no priority at all */
 	while (1) {
+		hrtimer_stop_sched_tick();
 		while (!need_resched()) {
 			void (*idle)(void);
 
@@ -196,6 +197,7 @@ void cpu_idle(void)
 			__get_cpu_var(irq_stat).idle_timestamp = jiffies;
 			idle();
 		}
+		hrtimer_restart_sched_tick();
 		preempt_enable_no_resched();
 		schedule();
 		preempt_disable();

--


^ permalink raw reply	[flat|nested] 58+ messages in thread

* [patch 20/21] high-res timers, dynticks: enable i386 support
  2006-10-01 22:59 [patch 00/21] high resolution timers / dynamic ticks - V2 Thomas Gleixner
                   ` (18 preceding siblings ...)
  2006-10-01 23:01 ` [patch 19/21] dynticks: i386 arch code Thomas Gleixner
@ 2006-10-01 23:01 ` Thomas Gleixner
  2006-10-01 23:01 ` [patch 21/21] debugging feature: timer stats Thomas Gleixner
                   ` (4 subsequent siblings)
  24 siblings, 0 replies; 58+ messages in thread
From: Thomas Gleixner @ 2006-10-01 23:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Ingo Molnar, Jim Gettys, John Stultz, David Woodhouse,
	Arjan van de Ven, Dave Jones

[-- Attachment #1: hrtimer-hres-i386.patch --]
[-- Type: text/plain, Size: 692 bytes --]

From: Ingo Molnar <mingo@elte.hu>

enable high-res timers and dyntick on i386.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
--
 arch/i386/Kconfig |    2 ++
 1 file changed, 2 insertions(+)

Index: linux-2.6.18-mm2/arch/i386/Kconfig
===================================================================
--- linux-2.6.18-mm2.orig/arch/i386/Kconfig	2006-10-02 00:55:53.000000000 +0200
+++ linux-2.6.18-mm2/arch/i386/Kconfig	2006-10-02 00:55:55.000000000 +0200
@@ -65,6 +65,8 @@ source "init/Kconfig"
 
 menu "Processor type and features"
 
+source "kernel/time/Kconfig"
+
 config SMP
 	bool "Symmetric multi-processing support"
 	---help---

--


^ permalink raw reply	[flat|nested] 58+ messages in thread

* [patch 21/21] debugging feature: timer stats
  2006-10-01 22:59 [patch 00/21] high resolution timers / dynamic ticks - V2 Thomas Gleixner
                   ` (19 preceding siblings ...)
  2006-10-01 23:01 ` [patch 20/21] high-res timers, dynticks: enable i386 support Thomas Gleixner
@ 2006-10-01 23:01 ` Thomas Gleixner
  2006-10-02  5:11 ` [patch 00/21] high resolution timers / dynamic ticks - V2 Valdis.Kletnieks
                   ` (3 subsequent siblings)
  24 siblings, 0 replies; 58+ messages in thread
From: Thomas Gleixner @ 2006-10-01 23:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Ingo Molnar, Jim Gettys, John Stultz, David Woodhouse,
	Arjan van de Ven, Dave Jones

[-- Attachment #1: timer_stats.patch --]
[-- Type: text/plain, Size: 24711 bytes --]

From: Thomas Gleixner <tglx@linutronix.de>

add /proc/timer_stats support: debugging feature to profile timer expiration.
Both the starting site, process/PID and the expiration function is
captured. This allows the quick identification of timer event sources
in a system.

sample output:

 # echo 1 > /proc/tstats
 # cat /proc/tstats
 Timerstats sample period: 3.888770 s
   12,     0 swapper          hrtimer_stop_sched_tick (hrtimer_sched_tick)
   15,     1 swapper          hcd_submit_urb (rh_timer_func)
    4,   959 kedac            schedule_timeout (process_timeout)
    1,     0 swapper          page_writeback_init (wb_timer_fn)
   28,     0 swapper          hrtimer_stop_sched_tick (hrtimer_sched_tick)
   22,  2948 IRQ 4            tty_flip_buffer_push (delayed_work_timer_fn)
    3,  3100 bash             schedule_timeout (process_timeout)
    1,     1 swapper          queue_delayed_work_on (delayed_work_timer_fn)
    1,     1 swapper          queue_delayed_work_on (delayed_work_timer_fn)
    1,     1 swapper          neigh_table_init_no_netlink (neigh_periodic_timer)
    1,  2292 ip               __netdev_watchdog_up (dev_watchdog)
    1,    23 events/1         do_cache_clean (delayed_work_timer_fn)
 90 total events, 30.0 events/sec

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
--
 Documentation/hrtimer/timer_stats.txt |   68 +++++++++
 include/linux/hrtimer.h               |   53 +++++++
 include/linux/timer.h                 |   49 ++++++
 kernel/hrtimer.c                      |   28 +++
 kernel/time/Makefile                  |    3 
 kernel/time/timer_stats.c             |  244 ++++++++++++++++++++++++++++++++++
 kernel/timer.c                        |   29 +++-
 kernel/workqueue.c                    |    6 
 lib/Kconfig.debug                     |   11 +
 9 files changed, 484 insertions(+), 7 deletions(-)

Index: linux-2.6.18-mm2/include/linux/hrtimer.h
===================================================================
--- linux-2.6.18-mm2.orig/include/linux/hrtimer.h	2006-10-02 00:55:54.000000000 +0200
+++ linux-2.6.18-mm2/include/linux/hrtimer.h	2006-10-02 00:55:55.000000000 +0200
@@ -101,8 +101,14 @@ enum hrtimer_cb_mode {
  * @cb_mode:	high resolution timer feature to select the callback execution
  *		 mode
  * @cb_entry:	list head to enqueue an expired timer into the callback list
+ * @start_site:	timer statistics field to store the site where the timer
+ *		was started
+ * @start_comm: timer statistics field to store the name of the process which
+ *		started the timer
+ * @start_pid: timer statistics field to store the pid of the task which
+ *		started the timer
  *
- * The hrtimer structure must be initialized by init_hrtimer_#CLOCKTYPE()
+ * The hrtimer structure must be initialized by hrtimer_init()
  */
 struct hrtimer {
 	struct rb_node			node;
@@ -114,6 +120,11 @@ struct hrtimer {
 	enum hrtimer_cb_mode		cb_mode;
 	struct list_head		cb_entry;
 #endif
+#ifdef CONFIG_TIMER_STATS
+	void				*start_site;
+	char				start_comm[16];
+	int				start_pid;
+#endif
 };
 
 /**
@@ -333,4 +344,44 @@ static inline void show_no_hz_stats(stru
 /* Bootup initialization: */
 extern void __init hrtimers_init(void);
 
+/*
+ * Timer-statistics info:
+ */
+#ifdef CONFIG_TIMER_STATS
+
+extern void timer_stats_update_stats(void *timer, pid_t pid, void *startf,
+				     void *timerf, char * comm);
+
+static inline void timer_stats_account_hrtimer(struct hrtimer *timer)
+{
+	timer_stats_update_stats(timer, timer->start_pid, timer->start_site,
+				 timer->function, timer->start_comm);
+}
+
+extern void __timer_stats_hrtimer_set_start_info(struct hrtimer *timer,
+						 void *addr);
+
+static inline void timer_stats_hrtimer_set_start_info(struct hrtimer *timer)
+{
+	__timer_stats_hrtimer_set_start_info(timer, __builtin_return_address(0));
+}
+
+static inline void timer_stats_hrtimer_clear_start_info(struct hrtimer *timer)
+{
+	timer->start_site = NULL;
+}
+#else
+static inline void timer_stats_account_hrtimer(struct hrtimer *timer)
+{
+}
+
+static inline void timer_tstats_hrtimer_set_start_info(struct hrtimer *timer)
+{
+}
+
+static inline void timer_tstats_hrtimer_clear_start_info(struct hrtimer *timer)
+{
+}
+#endif
+
 #endif
Index: linux-2.6.18-mm2/include/linux/timer.h
===================================================================
--- linux-2.6.18-mm2.orig/include/linux/timer.h	2006-10-02 00:55:52.000000000 +0200
+++ linux-2.6.18-mm2/include/linux/timer.h	2006-10-02 00:55:55.000000000 +0200
@@ -2,6 +2,7 @@
 #define _LINUX_TIMER_H
 
 #include <linux/list.h>
+#include <linux/ktime.h>
 #include <linux/spinlock.h>
 #include <linux/stddef.h>
 
@@ -15,6 +16,11 @@ struct timer_list {
 	unsigned long data;
 
 	struct tvec_t_base_s *base;
+#ifdef CONFIG_TIMER_STATS
+	void *start_site;
+	char start_comm[16];
+	int start_pid;
+#endif
 };
 
 extern struct tvec_t_base_s boot_tvec_bases;
@@ -73,6 +79,49 @@ extern unsigned long next_timer_interrup
  */
 extern unsigned long get_next_timer_interrupt(unsigned long now);
 
+/*
+ * Timer-statistics info:
+ */
+#ifdef CONFIG_TIMER_STATS
+
+extern void timer_stats_update_stats(void *timer, pid_t pid, void *startf,
+				     void *timerf, char * comm);
+
+static inline void timer_stats_account_timer(struct timer_list *timer)
+{
+	timer_stats_update_stats(timer, timer->start_pid, timer->start_site,
+				 timer->function, timer->start_comm);
+}
+
+extern void __timer_stats_timer_set_start_info(struct timer_list *timer,
+					       void *addr);
+
+static inline void timer_stats_timer_set_start_info(struct timer_list *timer)
+{
+	__timer_stats_timer_set_start_info(timer, __builtin_return_address(0));
+}
+
+static inline void timer_stats_timer_clear_start_info(struct timer_list *timer)
+{
+	timer->start_site = NULL;
+}
+#else
+static inline void timer_stats_account_timer(struct timer_list *timer)
+{
+}
+
+static inline void timer_stats_timer_set_start_info(struct timer_list *timer)
+{
+}
+
+static inline void timer_stats_timer_clear_start_info(struct timer_list *timer)
+{
+}
+#endif
+
+extern void delayed_work_timer_fn(unsigned long __data);
+
+
 /***
  * add_timer - start a timer
  * @timer: the timer to be added
Index: linux-2.6.18-mm2/kernel/hrtimer.c
===================================================================
--- linux-2.6.18-mm2.orig/kernel/hrtimer.c	2006-10-02 00:55:54.000000000 +0200
+++ linux-2.6.18-mm2/kernel/hrtimer.c	2006-10-02 00:55:55.000000000 +0200
@@ -457,7 +457,7 @@ static void update_jiffies64(ktime_t now
 		delta = ktime_sub(now, last_jiffies_update);
 	} while (read_seqretry(&xtime_lock, seq));
 
-	if (delta.tv64 >= nsec_per_hz.tv64)
+	if (delta.tv64 < nsec_per_hz.tv64)
 		return;
 
 	/* Reevalute with xtime_lock held */
@@ -909,6 +909,18 @@ static inline void hrtimer_resume_jiffy_
 
 #endif /* CONFIG_HIGH_RES_TIMERS */
 
+#ifdef CONFIG_TIMER_STATS
+void __timer_stats_hrtimer_set_start_info(struct hrtimer *timer, void *addr)
+{
+	if (timer->start_site)
+		return;
+
+	timer->start_site = addr;
+	memcpy(timer->start_comm, current->comm, TASK_COMM_LEN);
+	timer->start_pid = current->pid;
+}
+#endif
+
 /*
  * Timekeeping resumed notification
  */
@@ -1077,6 +1089,7 @@ remove_hrtimer(struct hrtimer *timer, st
 		 * reprogramming happens in the interrupt handler. This is a
 		 * rare case and less expensive than a smp call.
 		 */
+		timer_stats_hrtimer_clear_start_info(timer);
 		reprogram = base->cpu_base == &__get_cpu_var(hrtimer_bases);
 		__remove_hrtimer(timer, base, HRTIMER_STATE_INACTIVE,
 				 reprogram);
@@ -1125,6 +1138,8 @@ hrtimer_start(struct hrtimer *timer, kti
 	}
 	timer->expires = tim;
 
+	timer_stats_hrtimer_set_start_info(timer);
+
 	enqueue_hrtimer(timer, new_base, base == new_base);
 
 	unlock_hrtimer_base(timer, &flags);
@@ -1257,6 +1272,12 @@ void hrtimer_init(struct hrtimer *timer,
 
 	timer->base = &cpu_base->clock_base[clock_id];
 	hrtimer_init_timer_hres(timer);
+
+#ifdef CONFIG_TIMER_STATS
+	timer->start_site = NULL;
+	timer->start_pid = -1;
+	memset(timer->start_comm, 0, TASK_COMM_LEN);
+#endif
 }
 EXPORT_SYMBOL_GPL(hrtimer_init);
 
@@ -1343,6 +1364,7 @@ void hrtimer_interrupt(struct pt_regs *r
 
 			__remove_hrtimer(timer, base,
 					 HRTIMER_STATE_CALLBACK, 0);
+			timer_stats_account_hrtimer(timer);
 
 			if (timer->function(timer) != HRTIMER_NORESTART) {
 				BUG_ON(timer->state != HRTIMER_STATE_CALLBACK);
@@ -1388,6 +1410,8 @@ static void run_hrtimer_softirq(struct s
 		timer = list_entry(cpu_base->cb_pending.next,
 				   struct hrtimer, cb_entry);
 
+		timer_stats_account_hrtimer(timer);
+
 		fn = timer->function;
 		__remove_hrtimer(timer, timer->base, HRTIMER_STATE_CALLBACK, 0);
 		spin_unlock_irq(&cpu_base->lock);
@@ -1444,6 +1468,8 @@ static inline void run_hrtimer_queue(str
 		if (base->softirq_time.tv64 <= timer->expires.tv64)
 			break;
 
+		timer_stats_account_hrtimer(timer);
+
 		fn = timer->function;
 		__remove_hrtimer(timer, base, HRTIMER_STATE_CALLBACK, 0);
 		spin_unlock_irq(&cpu_base->lock);
Index: linux-2.6.18-mm2/kernel/time/Makefile
===================================================================
--- linux-2.6.18-mm2.orig/kernel/time/Makefile	2006-10-02 00:55:53.000000000 +0200
+++ linux-2.6.18-mm2/kernel/time/Makefile	2006-10-02 00:55:55.000000000 +0200
@@ -1,3 +1,4 @@
 obj-y += ntp.o clocksource.o jiffies.o
 
-obj-$(CONFIG_GENERIC_CLOCKEVENTS) += clockevents.o
+obj-$(CONFIG_GENERIC_CLOCKEVENTS)	+= clockevents.o
+obj-$(CONFIG_TIMER_STATS)		+= timer_stats.o
Index: linux-2.6.18-mm2/kernel/time/timer_stats.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.18-mm2/kernel/time/timer_stats.c	2006-10-02 00:55:55.000000000 +0200
@@ -0,0 +1,244 @@
+/*
+ * kernel/time/timer_stats.c
+ *
+ * Collect timer usage statistics.
+ *
+ * Copyright(C) 2006, Red Hat, Inc., Ingo Molnar
+ * Copyright(C) 2006 Timesys Corp., Thomas Gleixner <tglx@timesys.com>
+ *
+ * timer_stats is based on timer_top, a similar functionality which was part of
+ * Con Kolivas dyntick patch set. It was developed by Daniel Petrini at the
+ * Instituto Nokia de Tecnologia - INdT - Manaus. timer_top's design was based
+ * on dynamic allocation of the statistics entries rather than the static array
+ * which is used by timer_stats. It was written for the pre hrtimer kernel code
+ * and therefor did not take hrtimers into account. Nevertheless it provided
+ * the base for the timer_stats implementation and was a helpful source of
+ * inspiration in the first place. Kudos to Daniel and the Nokia folks for this
+ * effort.
+ *
+ * timer_top.c is
+ *	Copyright (C) 2005 Instituto Nokia de Tecnologia - INdT - Manaus
+ *	Written by Daniel Petrini <d.pensator@gmail.com>
+ *	timer_top.c was released under the GNU General Public License version 2
+ *
+ * We export the addresses and counting of timer functions being called,
+ * the pid and cmdline from the owner process if applicable.
+ *
+ * Start/stop data collection:
+ * # echo 1[0] >/proc/timer_stats
+ *
+ * Display the collected information:
+ * # cat /proc/timer_stats
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/list.h>
+#include <linux/proc_fs.h>
+#include <linux/module.h>
+#include <linux/spinlock.h>
+#include <linux/sched.h>
+#include <linux/seq_file.h>
+#include <linux/kallsyms.h>
+
+#include <asm/uaccess.h>
+
+enum tstats_stat {
+	TSTATS_INACTIVE,
+	TSTATS_ACTIVE,
+	TSTATS_READOUT,
+	TSTATS_RESET,
+};
+
+struct tstats_entry {
+	void			*timer;
+	void			*start_func;
+	void			*expire_func;
+	unsigned long		counter;
+	pid_t			pid;
+	char			comm[TASK_COMM_LEN + 1];
+};
+
+#define TSTATS_MAX_ENTRIES	1024
+
+static struct tstats_entry tstats[TSTATS_MAX_ENTRIES];
+static DEFINE_SPINLOCK(tstats_lock);
+static enum tstats_stat tstats_status;
+static ktime_t tstats_time;
+
+/**
+ * timer_stats_update_stats - Update the statistics for a timer.
+ * @timer:	pointer to either a timer_list or a hrtimer
+ * @pid:	the pid of the task which set up the timer
+ * @startf:	pointer to the function which did the timer setup
+ * @timerf:	pointer to the timer callback function of the timer
+ * @comm:	name of the process which set up the timer
+ *
+ * When the timer is already registered, then the event counter is
+ * incremented. Otherwise the timer is registered in a free slot.
+ */
+void timer_stats_update_stats(void *timer, pid_t pid, void *startf,
+			      void *timerf, char * comm)
+{
+	struct tstats_entry *entry = tstats;
+	unsigned long flags;
+	int i;
+
+	spin_lock_irqsave(&tstats_lock, flags);
+	if (tstats_status != TSTATS_ACTIVE)
+		goto out_unlock;
+
+	for (i = 0; i < TSTATS_MAX_ENTRIES; i++, entry++) {
+		if (entry->timer == timer &&
+		    entry->start_func == startf &&
+		    entry->expire_func == timerf &&
+		    entry->pid == pid) {
+
+			entry->counter++;
+			break;
+		}
+		if (!entry->timer) {
+			entry->timer = timer;
+			entry->start_func = startf;
+			entry->expire_func = timerf;
+			entry->counter = 1;
+			entry->pid = pid;
+			memcpy(entry->comm, comm, TASK_COMM_LEN);
+			entry->comm[TASK_COMM_LEN] = 0;
+			break;
+		}
+	}
+
+ out_unlock:
+	spin_unlock_irqrestore(&tstats_lock, flags);
+}
+
+static void print_name_offset(struct seq_file *m, unsigned long addr)
+{
+	char namebuf[KSYM_NAME_LEN+1];
+	unsigned long size, offset;
+	const char *sym_name;
+	char *modname;
+
+	sym_name = kallsyms_lookup(addr, &size, &offset, &modname, namebuf);
+	if (sym_name)
+		seq_printf(m, "%s", sym_name);
+	else
+		seq_printf(m, "<%p>", (void *)addr);
+}
+
+static int tstats_show(struct seq_file *m, void *v)
+{
+	struct tstats_entry *entry = tstats;
+	struct timespec period;
+	unsigned long ms;
+	long events = 0;
+	int i;
+
+	spin_lock_irq(&tstats_lock);
+	switch(tstats_status) {
+	case TSTATS_ACTIVE:
+		tstats_time = ktime_sub(ktime_get(), tstats_time);
+	case TSTATS_INACTIVE:
+		tstats_status = TSTATS_READOUT;
+		break;
+	default:
+		spin_unlock_irq(&tstats_lock);
+		return -EBUSY;
+	}
+	spin_unlock_irq(&tstats_lock);
+
+	period = ktime_to_timespec(tstats_time);
+	ms = period.tv_nsec % 1000000;
+
+	seq_printf(m, "Timerstats sample period: %ld.%3ld s\n",
+		   period.tv_sec, ms);
+
+	for (i = 0; i < TSTATS_MAX_ENTRIES && entry->timer; i++, entry++) {
+		seq_printf(m, "%4lu, %5d %-16s ", entry->counter, entry->pid,
+			   entry->comm);
+
+		print_name_offset(m, (unsigned long)entry->start_func);
+		seq_puts(m, " (");
+		print_name_offset(m, (unsigned long)entry->expire_func);
+		seq_puts(m, ")\n");
+		events += entry->counter;
+	}
+
+	ms += period.tv_sec * 1000;
+	if (events && period.tv_sec)
+		seq_printf(m, "%ld total events, %ld.%ld events/sec\n", events,
+			   events / period.tv_sec, events * 1000 / ms);
+	else
+		seq_printf(m, "%ld total events\n", events);
+
+	tstats_status = TSTATS_INACTIVE;
+	return 0;
+}
+
+static ssize_t tstats_write(struct file *file, const char __user *buf,
+			    size_t count, loff_t *offs)
+{
+	char ctl[2];
+
+	if (count != 2 || *offs)
+		return -EINVAL;
+
+	if (copy_from_user(ctl, buf, count))
+		return -EFAULT;
+
+	switch (ctl[0]) {
+	case '0':
+		spin_lock_irq(&tstats_lock);
+		if (tstats_status == TSTATS_ACTIVE) {
+			tstats_status = TSTATS_INACTIVE;
+			tstats_time = ktime_sub(ktime_get(), tstats_time);
+		}
+		spin_unlock_irq(&tstats_lock);
+		break;
+	case '1':
+		spin_lock_irq(&tstats_lock);
+		if (tstats_status == TSTATS_INACTIVE) {
+			tstats_status = TSTATS_RESET;
+			memset(tstats, 0, sizeof(tstats));
+			tstats_time = ktime_get();
+			tstats_status = TSTATS_ACTIVE;
+		}
+		spin_unlock_irq(&tstats_lock);
+		break;
+	default:
+		count = -EINVAL;
+	}
+
+	return count;
+}
+
+static int tstats_open(struct inode *inode, struct file *filp)
+{
+	return single_open(filp, tstats_show, NULL);
+}
+
+static struct file_operations tstats_fops = {
+	.open		= tstats_open,
+	.read		= seq_read,
+	.write		= tstats_write,
+	.llseek		= seq_lseek,
+	.release	= seq_release,
+};
+
+static int __init init_tstats(void)
+{
+	struct proc_dir_entry *pe;
+
+	pe = create_proc_entry("timer_stats", 0666, NULL);
+
+	if (!pe)
+		return -ENOMEM;
+
+	pe->proc_fops = &tstats_fops;
+
+	return 0;
+}
+module_init(init_tstats);
Index: linux-2.6.18-mm2/kernel/timer.c
===================================================================
--- linux-2.6.18-mm2.orig/kernel/timer.c	2006-10-02 00:55:54.000000000 +0200
+++ linux-2.6.18-mm2/kernel/timer.c	2006-10-02 00:55:55.000000000 +0200
@@ -34,6 +34,7 @@
 #include <linux/cpu.h>
 #include <linux/syscalls.h>
 #include <linux/delay.h>
+#include <linux/kallsyms.h>
 
 #include <asm/uaccess.h>
 #include <asm/unistd.h>
@@ -133,6 +134,18 @@ static void internal_add_timer(tvec_base
 	list_add_tail(&timer->entry, vec);
 }
 
+#ifdef CONFIG_TIMER_STATS
+void __timer_stats_timer_set_start_info(struct timer_list *timer, void *addr)
+{
+	if (timer->start_site)
+		return;
+
+	timer->start_site = addr;
+	memcpy(timer->start_comm, current->comm, TASK_COMM_LEN);
+	timer->start_pid = current->pid;
+}
+#endif
+
 /**
  * init_timer - initialize a timer.
  * @timer: the timer to be initialized
@@ -144,11 +157,16 @@ void fastcall init_timer(struct timer_li
 {
 	timer->entry.next = NULL;
 	timer->base = __raw_get_cpu_var(tvec_bases);
+#ifdef CONFIG_TIMER_STATS
+	timer->start_site = NULL;
+	timer->start_pid = -1;
+	memset(timer->start_comm, 0, TASK_COMM_LEN);
+#endif
 }
 EXPORT_SYMBOL(init_timer);
 
 static inline void detach_timer(struct timer_list *timer,
-					int clear_pending)
+				int clear_pending)
 {
 	struct list_head *entry = &timer->entry;
 
@@ -195,6 +213,7 @@ int __mod_timer(struct timer_list *timer
 	unsigned long flags;
 	int ret = 0;
 
+	timer_stats_timer_set_start_info(timer);
 	BUG_ON(!timer->function);
 
 	base = lock_timer_base(timer, &flags);
@@ -245,6 +264,7 @@ void add_timer_on(struct timer_list *tim
 	tvec_base_t *base = per_cpu(tvec_bases, cpu);
   	unsigned long flags;
 
+	timer_stats_timer_set_start_info(timer);
   	BUG_ON(timer_pending(timer) || !timer->function);
 	spin_lock_irqsave(&base->lock, flags);
 	timer->base = base;
@@ -277,6 +297,7 @@ int mod_timer(struct timer_list *timer, 
 {
 	BUG_ON(!timer->function);
 
+	timer_stats_timer_set_start_info(timer);
 	/*
 	 * This is a common optimization triggered by the
 	 * networking code - if the timer is re-modified
@@ -307,6 +328,7 @@ int del_timer(struct timer_list *timer)
 	unsigned long flags;
 	int ret = 0;
 
+	timer_stats_timer_clear_start_info(timer);
 	if (timer_pending(timer)) {
 		base = lock_timer_base(timer, &flags);
 		if (timer_pending(timer)) {
@@ -440,6 +462,8 @@ static inline void __run_timers(tvec_bas
  			fn = timer->function;
  			data = timer->data;
 
+			timer_stats_account_timer(timer);
+
 			set_running_timer(base, timer);
 			detach_timer(timer, 1);
 			spin_unlock_irq(&base->lock);
@@ -1119,7 +1143,8 @@ static void run_timer_softirq(struct sof
 {
 	tvec_base_t *base = __get_cpu_var(tvec_bases);
 
- 	hrtimer_run_queues();
+	hrtimer_run_queues();
+
 	if (time_after_eq(jiffies, base->timer_jiffies))
 		__run_timers(base);
 }
Index: linux-2.6.18-mm2/kernel/workqueue.c
===================================================================
--- linux-2.6.18-mm2.orig/kernel/workqueue.c	2006-10-02 00:55:44.000000000 +0200
+++ linux-2.6.18-mm2/kernel/workqueue.c	2006-10-02 00:55:55.000000000 +0200
@@ -119,7 +119,7 @@ int fastcall queue_work(struct workqueue
 }
 EXPORT_SYMBOL_GPL(queue_work);
 
-static void delayed_work_timer_fn(unsigned long __data)
+void delayed_work_timer_fn(unsigned long __data)
 {
 	struct work_struct *work = (struct work_struct *)__data;
 	struct workqueue_struct *wq = work->wq_data;
@@ -140,11 +140,12 @@ static void delayed_work_timer_fn(unsign
  * Returns non-zero if it was successfully added.
  */
 int fastcall queue_delayed_work(struct workqueue_struct *wq,
-			struct work_struct *work, unsigned long delay)
+				struct work_struct *work, unsigned long delay)
 {
 	int ret = 0;
 	struct timer_list *timer = &work->timer;
 
+	timer_stats_timer_set_start_info(&work->timer);
 	if (!test_and_set_bit(0, &work->pending)) {
 		BUG_ON(timer_pending(timer));
 		BUG_ON(!list_empty(&work->entry));
@@ -469,6 +470,7 @@ EXPORT_SYMBOL(schedule_work);
  */
 int fastcall schedule_delayed_work(struct work_struct *work, unsigned long delay)
 {
+	timer_stats_timer_set_start_info(&work->timer);
 	return queue_delayed_work(keventd_wq, work, delay);
 }
 EXPORT_SYMBOL(schedule_delayed_work);
Index: linux-2.6.18-mm2/lib/Kconfig.debug
===================================================================
--- linux-2.6.18-mm2.orig/lib/Kconfig.debug	2006-10-02 00:55:44.000000000 +0200
+++ linux-2.6.18-mm2/lib/Kconfig.debug	2006-10-02 00:55:55.000000000 +0200
@@ -109,6 +109,17 @@ config SCHEDSTATS
 	  application, you can say N to avoid the very slight overhead
 	  this adds.
 
+config TIMER_STATS
+	bool "Collect kernel timers statistics"
+	depends on DEBUG_KERNEL && PROC_FS
+	help
+	  If you say Y here, additional code will be inserted into the
+	  timer routines to collect statistics about kernel timers being
+	  reprogrammed. The statistics can be read from /proc/tstats.
+	  The statistics collection is started by writing 1 to /proc/tstats,
+	  writing 0 stops it. This feature is useful to collect information
+	  about timer usage patterns in kernel and userspace.
+
 config DEBUG_SLAB
 	bool "Debug slab memory allocations"
 	depends on DEBUG_KERNEL && SLAB
Index: linux-2.6.18-mm2/Documentation/hrtimer/timer_stats.txt
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.18-mm2/Documentation/hrtimer/timer_stats.txt	2006-10-02 00:55:55.000000000 +0200
@@ -0,0 +1,68 @@
+timer_stats - timer usage statistics
+------------------------------------
+
+timer_stats is a debugging facility to make the timer (ab)usage in a Linux
+system visible to kernel and userspace developers. It is not intended for
+production usage as it adds significant overhead to the (hr)timer code and the
+(hr)timer data structures.
+
+timer_stats should be used by kernel and userspace developers to verify that
+their code does not make unduly use of timers. This helps to avoid unnecessary
+wakeups, which should be avoided to optimize power consumption.
+
+It can be enabled by CONFIG_TIMER_STATS in the "Kernel hacking" configuration
+section.
+
+timer_stats collects information about the timer events which are fired in a
+Linux system over a sample period:
+
+- the pid of the task(process) which initialized the timer
+- the name of the process which initialized the timer
+- the function where the timer was intialized
+- the callback function which is associated to the timer
+- the number of events (callbacks)
+
+timer_stats adds an entry to /proc: /proc/timer_stats
+
+This entry is used to control the statistics functionality and to read out the
+sampled information.
+
+The timer_stats functionality is inactive on bootup.
+
+To activate a sample period issue:
+# echo 1 >/proc/timer_stats
+
+To stop a sample period issue:
+# echo 0 >/proc/timer_stats
+
+The statistics can be retrieved by:
+# cat /proc/timer_stats
+
+The readout of /proc/timer_stats automatically disables sampling. The sampled
+information is kept until a new sample period is started. This allows multiple
+readouts.
+
+Sample output of /proc/timer_stats:
+
+Timerstats sample period: 3.888770 s
+  12,     0 swapper          hrtimer_stop_sched_tick (hrtimer_sched_tick)
+  15,     1 swapper          hcd_submit_urb (rh_timer_func)
+   4,   959 kedac            schedule_timeout (process_timeout)
+   1,     0 swapper          page_writeback_init (wb_timer_fn)
+  28,     0 swapper          hrtimer_stop_sched_tick (hrtimer_sched_tick)
+  22,  2948 IRQ 4            tty_flip_buffer_push (delayed_work_timer_fn)
+   3,  3100 bash             schedule_timeout (process_timeout)
+   1,     1 swapper          queue_delayed_work_on (delayed_work_timer_fn)
+   1,     1 swapper          queue_delayed_work_on (delayed_work_timer_fn)
+   1,     1 swapper          neigh_table_init_no_netlink (neigh_periodic_timer)
+   1,  2292 ip               __netdev_watchdog_up (dev_watchdog)
+   1,    23 events/1         do_cache_clean (delayed_work_timer_fn)
+90 total events, 30.0 events/sec
+
+The first column is the number of events, the second column the pid, the third
+column is the name of the process. The forth column shows the function which
+initialized the timer and in parantheses the callback function which was
+executed on expiry.
+
+    Thomas, Ingo
+

--


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch 00/21] high resolution timers / dynamic ticks - V2
  2006-10-01 22:59 [patch 00/21] high resolution timers / dynamic ticks - V2 Thomas Gleixner
                   ` (20 preceding siblings ...)
  2006-10-01 23:01 ` [patch 21/21] debugging feature: timer stats Thomas Gleixner
@ 2006-10-02  5:11 ` Valdis.Kletnieks
  2006-10-02 13:02 ` Valdis.Kletnieks
                   ` (2 subsequent siblings)
  24 siblings, 0 replies; 58+ messages in thread
From: Valdis.Kletnieks @ 2006-10-02  5:11 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Andrew Morton, LKML, Ingo Molnar, Jim Gettys, John Stultz,
	David Woodhouse, Arjan van de Ven, Dave Jones

[-- Attachment #1: Type: text/plain, Size: 1044 bytes --]

On Sun, 01 Oct 2006 22:59:01 -0000, Thomas Gleixner said:

> the following patch series is an update in response to your review.

This complains if you try to compile with -Werror-implicit-function-declaration
and rightly so, as we're missing a #include to define IS_ERR_VALUE().

Patch attached.

Signed-off-by: Valdis Kletnieks <valdis.kletnieks@vt.edu>

--- linux-2.6.18-mm2/kernel/hrtimer.c.buggy	2006-10-02 00:46:50.000000000 -0400
+++ linux-2.6.18-mm2/kernel/hrtimer.c	2006-10-02 01:02:55.000000000 -0400
@@ -43,6 +43,7 @@
 #include <linux/clockchips.h>
 #include <linux/profile.h>
 #include <linux/seq_file.h>
+#include <linux/err.h>
 
 #include <asm/uaccess.h>
 
--- linux-2.6.18-mm2/kernel/time/clockevents.c.buggy	2006-10-02 00:46:50.000000000 -0400
+++ linux-2.6.18-mm2/kernel/time/clockevents.c	2006-10-02 01:04:22.000000000 -0400
@@ -33,6 +33,7 @@
 #include <linux/profile.h>
 #include <linux/sysdev.h>
 #include <linux/hrtimer.h>
+#include <linux/err.h>
 
 #define MAX_CLOCK_EVENTS	4
 #define GLOBAL_CLOCK_EVENT	MAX_CLOCK_EVENTS


[-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --]

^ permalink raw reply	[flat|nested] 58+ messages in thread

* [patch] dynticks: core, NMI watchdog fix
  2006-10-01 23:01 ` [patch 17/21] dynticks: core Thomas Gleixner
@ 2006-10-02  6:41   ` Ingo Molnar
  2006-10-02  8:54     ` [patch] dynticks: core, NMI watchdog fix, #2 Ingo Molnar
  0 siblings, 1 reply; 58+ messages in thread
From: Ingo Molnar @ 2006-10-02  6:41 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Thomas Gleixner, LKML, Jim Gettys, John Stultz, David Woodhouse,
	Arjan van de Ven, Dave Jones


Andrew: find below a fix for a bug that could cause lockups if NO_HZ, 
lockdep and the NMI watchdog are all activated. This patch comes after 
dynticks-core.patch. Compile and boot tested.

	Ingo

----------------->
Subject: dynticks: core, NMI watchdog fix
From: Ingo Molnar <mingo@elte.hu>

fix an NMI-watchdog interaction: we dont want to call
hrtimer_update_jiffies() in NMI contexts ...

create __irq_enter() (which is symmetric to __irq_exit())
and use it in nmi_enter() and irq_enter().

[ Note: like __irq_exit() it needs to be a macro because
  hardirq.h is included early on and types like struct
  task_struct are not available yet. ]

Signed-off-by: Ingo Molnar <mingo@elte.hu>
--
 include/linux/hardirq.h |   12 +++++++++++-
 kernel/softirq.c        |    5 +----
 2 files changed, 12 insertions(+), 5 deletions(-)

Index: linux/include/linux/hardirq.h
===================================================================
--- linux.orig/include/linux/hardirq.h
+++ linux/include/linux/hardirq.h
@@ -106,6 +106,16 @@ static inline void account_system_vtime(
  * always balanced, so the interrupted value of ->hardirq_context
  * will always be restored.
  */
+#define __irq_enter()					\
+	do {						\
+		account_system_vtime(current);		\
+		add_preempt_count(HARDIRQ_OFFSET);	\
+		trace_hardirq_enter();			\
+	} while (0)
+
+/*
+ * Enter irq context (on NO_HZ, update jiffies):
+ */
 extern void irq_enter(void);
 
 /*
@@ -123,7 +133,7 @@ extern void irq_enter(void);
  */
 extern void irq_exit(void);
 
-#define nmi_enter()		do { lockdep_off(); irq_enter(); } while (0)
+#define nmi_enter()		do { lockdep_off(); __irq_enter(); } while (0)
 #define nmi_exit()		do { __irq_exit(); lockdep_on(); } while (0)
 
 #endif /* LINUX_HARDIRQ_H */
Index: linux/kernel/softirq.c
===================================================================
--- linux.orig/kernel/softirq.c
+++ linux/kernel/softirq.c
@@ -278,10 +278,7 @@ EXPORT_SYMBOL(do_softirq);
  */
 void irq_enter(void)
 {
-	account_system_vtime(current);
-	add_preempt_count(HARDIRQ_OFFSET);
-	trace_hardirq_enter();
-
+	__irq_enter();
 #ifdef CONFIG_NO_HZ
 	if (idle_cpu(smp_processor_id()))
 		hrtimer_update_jiffies();

^ permalink raw reply	[flat|nested] 58+ messages in thread

* [patch] dynticks: core, NMI watchdog fix, #2
  2006-10-02  6:41   ` [patch] dynticks: core, NMI watchdog fix Ingo Molnar
@ 2006-10-02  8:54     ` Ingo Molnar
  0 siblings, 0 replies; 58+ messages in thread
From: Ingo Molnar @ 2006-10-02  8:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Thomas Gleixner, LKML, Jim Gettys, John Stultz, David Woodhouse,
	Arjan van de Ven, Dave Jones


Andrew, there's one more fallout of the dyntick queue: the fix below 
goes after (or into) dynticks-core-nmi-watchdog-fix.patch. The bug only 
affected NO_HZ kernels.

	Ingo

----------------->
Subject: dynticks: core, NMI watchdog fix, #2
From: Ingo Molnar <mingo@elte.hu>

a partial fix of the NMI watchdog bug sneaked into yesterday night's
queue - this patch removes that extra in_interrupt() condition.

(One effect of this bug is a 'slow' serial console on NO_HZ, because we 
never update jiffies and the serial console's timeouts get confused by 
it.)

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/softirq.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux/kernel/softirq.c
===================================================================
--- linux.orig/kernel/softirq.c
+++ linux/kernel/softirq.c
@@ -280,7 +280,7 @@ void irq_enter(void)
 {
 	__irq_enter();
 #ifdef CONFIG_NO_HZ
-	if (idle_cpu(smp_processor_id()) && !in_interrupt())
+	if (idle_cpu(smp_processor_id()))
 		hrtimer_update_jiffies();
 #endif
 }

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch 16/21] high-res timers: core
  2006-10-01 23:01 ` [patch 16/21] high-res timers: core Thomas Gleixner
@ 2006-10-02 11:50   ` Paulo Marques
  0 siblings, 0 replies; 58+ messages in thread
From: Paulo Marques @ 2006-10-02 11:50 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Andrew Morton, LKML, Ingo Molnar, Jim Gettys, John Stultz,
	David Woodhouse, Arjan van de Ven, Dave Jones

Thomas Gleixner wrote:
> [...]
> Index: linux-2.6.18-mm2/kernel/hrtimer.c
> ===================================================================
> --- linux-2.6.18-mm2.orig/kernel/hrtimer.c	2006-10-02 00:55:53.000000000 +0200
> +++ linux-2.6.18-mm2/kernel/hrtimer.c	2006-10-02 00:55:54.000000000 +0200
> @@ -38,7 +38,11 @@
>  #include <linux/hrtimer.h>
>  #include <linux/notifier.h>
>  #include <linux/syscalls.h>
> +#include <linux/kallsyms.h>

I'm not really knowledgeable in timer code to review these patches, but 
I always keep an eye out for kallsyms uses.

It seems that this include is unused. Maybe it was some debug stuff that 
got moved (or removed) later?

The patch to kernel/timer.c seems to have the same unused include, too.

-- 
Paulo Marques - www.grupopie.com

"The face of a child can say it all, especially the
mouth part of the face."

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch 00/21] high resolution timers / dynamic ticks - V2
  2006-10-01 22:59 [patch 00/21] high resolution timers / dynamic ticks - V2 Thomas Gleixner
                   ` (21 preceding siblings ...)
  2006-10-02  5:11 ` [patch 00/21] high resolution timers / dynamic ticks - V2 Valdis.Kletnieks
@ 2006-10-02 13:02 ` Valdis.Kletnieks
  2006-10-02 13:43   ` Thomas Gleixner
  2006-10-03  3:23 ` [patch 00/21] high resolution timers / dynamic ticks - V2 Andrew Morton
  2006-10-03  4:00 ` Andrew Morton
  24 siblings, 1 reply; 58+ messages in thread
From: Valdis.Kletnieks @ 2006-10-02 13:02 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Andrew Morton, LKML, Ingo Molnar, Jim Gettys, John Stultz,
	David Woodhouse, Arjan van de Ven, Dave Jones

[-- Attachment #1: Type: text/plain, Size: 1011 bytes --]

On Sun, 01 Oct 2006 22:59:01 -0000, Thomas Gleixner said:
> the following patch series is an update in response to your review.

First runtime results - no lockups or other severe badness in a half-hour or so
of running.

-mm2-hrt-dynticks5 shows severe clock drift issues if you run 'cpuspeed'.

Using speedstep-ich as a kernel built-in, and cpuspeed is invoked as:

cpuspeed -d -n -i 10 -p 10 50 -a /proc/acpi/ac_adapter/*/state

If cpuspeed drops the CPU speed from the default 1.6Ghz down to 1.2Ghz (the
only 2 speeds available on this core), the system clock proceeds to lose
about 15 seconds a minute.  I haven't dug further into why yet. (If the system
is busy so cpuspeed keeps the processor at 1.6Ghz, the clock doesn't drift
as much - so it looks like a "when speed is 1.2Ghz" issue...)

I'm also seeing gkrellm reporting about 25% CPU use when "near-idle" (X is up
but not much is going on) when that's usually down around 5-6%.  I need to
collect some oprofile numbers and investigate that as well.

[-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --]

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch 00/21] high resolution timers / dynamic ticks - V2
  2006-10-02 13:02 ` Valdis.Kletnieks
@ 2006-10-02 13:43   ` Thomas Gleixner
  2006-10-02 18:25     ` Valdis.Kletnieks
  0 siblings, 1 reply; 58+ messages in thread
From: Thomas Gleixner @ 2006-10-02 13:43 UTC (permalink / raw)
  To: Valdis.Kletnieks
  Cc: Andrew Morton, LKML, Ingo Molnar, Jim Gettys, John Stultz,
	David Woodhouse, Arjan van de Ven, Dave Jones

On Mon, 2006-10-02 at 09:02 -0400, Valdis.Kletnieks@vt.edu wrote:
> On Sun, 01 Oct 2006 22:59:01 -0000, Thomas Gleixner said:
> > the following patch series is an update in response to your review.
> 
> First runtime results - no lockups or other severe badness in a half-hour or so
> of running.
> 
> -mm2-hrt-dynticks5 shows severe clock drift issues if you run 'cpuspeed'.
> 
> Using speedstep-ich as a kernel built-in, and cpuspeed is invoked as:
> 
> cpuspeed -d -n -i 10 -p 10 50 -a /proc/acpi/ac_adapter/*/state
> 
> If cpuspeed drops the CPU speed from the default 1.6Ghz down to 1.2Ghz (the
> only 2 speeds available on this core), the system clock proceeds to lose
> about 15 seconds a minute.  I haven't dug further into why yet. (If the system
> is busy so cpuspeed keeps the processor at 1.6Ghz, the clock doesn't drift
> as much - so it looks like a "when speed is 1.2Ghz" issue...)

Can you please send me the bootlog and further dmesg output (especially
when related to timers / cpufreq).

> I'm also seeing gkrellm reporting about 25% CPU use when "near-idle" (X is up
> but not much is going on) when that's usually down around 5-6%.  I need to
> collect some oprofile numbers and investigate that as well.

I look into the accounting fixups again.

	tglx


		


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch 00/21] high resolution timers / dynamic ticks - V2
  2006-10-02 13:43   ` Thomas Gleixner
@ 2006-10-02 18:25     ` Valdis.Kletnieks
  2006-10-02 18:38       ` john stultz
  2006-10-02 18:43       ` [patch] dynticks core: Fix idle time accounting Thomas Gleixner
  0 siblings, 2 replies; 58+ messages in thread
From: Valdis.Kletnieks @ 2006-10-02 18:25 UTC (permalink / raw)
  To: tglx
  Cc: Andrew Morton, LKML, Ingo Molnar, Jim Gettys, John Stultz,
	David Woodhouse, Arjan van de Ven, Dave Jones


[-- Attachment #1.1: Type: text/plain, Size: 1229 bytes --]

(Sorry for the size of the note, there's some 50K of logs attached)

On Mon, 02 Oct 2006 15:43:02 +0200, Thomas Gleixner said:
> Can you please send me the bootlog and further dmesg output (especially
> when related to timers / cpufreq).

I booted the box to single-user both times, and then started cpuspeed.
I then did a cat of /proc/interrupts, /proc/uptime, and a date command,
waited 60 seconds according to my watch, and repeated.  I then dumped
the dmesg.  The -dyntick kernel moved 'uptime' almost exactly 45 seconds
(almost certainly a by-product of running at 1.2Ghz rather than 1.6Ghz).
Does the dyntick code make any unwritten assumptions about a jiffie or
bogomips remaining constant?

Attached - config diff, date and /proc dumps from both -mm2 and -mm2-dyntick,
and the dmesg's from both boots.

Yell if you have any other questions/suggestions/etc..

> > I'm also seeing gkrellm reporting about 25% CPU use when "near-idle" (X is up
> > but not much is going on) when that's usually down around 5-6%.  I need to
> > collect some oprofile numbers and investigate that as well.
>
> I look into the accounting fixups again.
I still need to get oprofile runs of this and see what's going on.

[-- Attachment #1.2: config.diff --]
[-- Type: text/plain , Size: 996 bytes --]

--- linux-2.6.18-mm2/.config	2006-10-02 10:11:34.000000000 -0400
+++ linux-2.6.18-mm2-hrt-dyntick5/.config	2006-10-02 02:19:09.000000000 -0400
@@ -1,10 +1,11 @@
 #
 # Automatically generated make config: don't edit
-# Linux kernel version: 2.6.18-mm2
-# Mon Oct  2 10:11:34 2006
+# Linux kernel version: 2.6.18-mm2-hrt-dyntick5
+# Mon Oct  2 02:19:09 2006
 #
 CONFIG_X86_32=y
 CONFIG_GENERIC_TIME=y
+CONFIG_GENERIC_CLOCKEVENTS=y
 CONFIG_LOCKDEP_SUPPORT=y
 CONFIG_STACKTRACE_SUPPORT=y
 CONFIG_SEMAPHORE_SLEEPERS=y
@@ -103,6 +104,9 @@
 #
 # Processor type and features
 #
+CONFIG_HIGH_RES_TIMERS=y
+CONFIG_HIGH_RES_RESOLUTION=1000
+CONFIG_NO_HZ=y
 # CONFIG_SMP is not set
 CONFIG_X86_PC=y
 # CONFIG_X86_ELAN is not set
@@ -2007,6 +2011,7 @@
 CONFIG_LOG_BUF_SHIFT=17
 # CONFIG_DETECT_SOFTLOCKUP is not set
 # CONFIG_SCHEDSTATS is not set
+CONFIG_TIMER_STATS=y
 # CONFIG_DEBUG_SLAB is not set
 # CONFIG_DEBUG_PREEMPT is not set
 # CONFIG_DEBUG_RT_MUTEXES is not set

[-- Attachment #1.3: date.mm2 --]
[-- Type: text/plain , Size: 1390 bytes --]

           CPU0       
  0:      60357    XT-PIC-level    timer
  1:        183    XT-PIC-level    i8042
  2:          0    XT-PIC-level    cascade
  5:          0    XT-PIC-level    Intel 82801CA-ICH3
  6:          3    XT-PIC-level    floppy
  8:          1    XT-PIC-level    rtc
  9:          1    XT-PIC-level    acpi
 11:         39    XT-PIC-level    uhci_hcd:usb1, uhci_hcd:usb2, uhci_hcd:usb3, ohci1394, yenta, yenta, yenta, pcmcia2.0
 12:        114    XT-PIC-level    i8042
 14:       2045    XT-PIC-level    libata
 15:          0    XT-PIC-level    libata
NMI:          0 
LOC:          0 
ERR:          0
MIS:          0
58.75 50.70
Mon Oct  2 11:32:07 EDT 2006
           CPU0       
  0:     120338    XT-PIC-level    timer
  1:        256    XT-PIC-level    i8042
  2:          0    XT-PIC-level    cascade
  5:          0    XT-PIC-level    Intel 82801CA-ICH3
  6:          3    XT-PIC-level    floppy
  8:          1    XT-PIC-level    rtc
  9:          1    XT-PIC-level    acpi
 11:         39    XT-PIC-level    uhci_hcd:usb1, uhci_hcd:usb2, uhci_hcd:usb3, ohci1394, yenta, yenta, yenta, pcmcia2.0
 12:        114    XT-PIC-level    i8042
 14:       2058    XT-PIC-level    libata
 15:          0    XT-PIC-level    libata
NMI:          0 
LOC:          0 
ERR:          0
MIS:          0
116.04 110.64
Mon Oct  2 11:33:04 EDT 2006

[-- Attachment #1.4: date.mm2-dyntick --]
[-- Type: text/plain , Size: 1388 bytes --]

           CPU0       
  0:      15931    XT-PIC-level    timer
  1:        164    XT-PIC-level    i8042
  2:          0    XT-PIC-level    cascade
  5:          0    XT-PIC-level    Intel 82801CA-ICH3
  6:          3    XT-PIC-level    floppy
  8:          1    XT-PIC-level    rtc
  9:          0    XT-PIC-level    acpi
 11:         38    XT-PIC-level    uhci_hcd:usb1, uhci_hcd:usb2, uhci_hcd:usb3, ohci1394, yenta, yenta, yenta, pcmcia2.0
 12:        114    XT-PIC-level    i8042
 14:       2066    XT-PIC-level    libata
 15:          0    XT-PIC-level    libata
NMI:          0 
LOC:          0 
ERR:          0
MIS:          0
73.89 6.84
Mon Oct  2 11:23:24 EDT 2006
           CPU0       
  0:      24314    XT-PIC-level    timer
  1:        205    XT-PIC-level    i8042
  2:          0    XT-PIC-level    cascade
  5:          0    XT-PIC-level    Intel 82801CA-ICH3
  6:          3    XT-PIC-level    floppy
  8:          1    XT-PIC-level    rtc
  9:          0    XT-PIC-level    acpi
 11:         38    XT-PIC-level    uhci_hcd:usb1, uhci_hcd:usb2, uhci_hcd:usb3, ohci1394, yenta, yenta, yenta, pcmcia2.0
 12:        114    XT-PIC-level    i8042
 14:       2079    XT-PIC-level    libata
 15:          0    XT-PIC-level    libata
NMI:          0 
LOC:          0 
ERR:          0
MIS:          0
118.96 11.36
Mon Oct  2 11:24:09 EDT 2006

[-- Attachment #1.5: dmesg.mm2 --]
[-- Type: text/plain , Size: 27348 bytes --]

Linux version 2.6.18-mm2 (valdis@turing-police.cc.vt.edu) (gcc version 4.1.1 20060926 (Red Hat 4.1.1-26)) #1 PREEMPT Mon Oct 2 10:45:34 EDT 2006
BIOS-provided physical RAM map:
sanitize start
sanitize end
copy_e820_map() start: 0000000000000000 size: 000000000009fc00 end: 000000000009fc00 type: 1
copy_e820_map() type is E820_RAM
copy_e820_map() start: 000000000009fc00 size: 0000000000000400 end: 00000000000a0000 type: 2
copy_e820_map() start: 0000000000100000 size: 000000002fee2800 end: 000000002ffe2800 type: 1
copy_e820_map() type is E820_RAM
copy_e820_map() start: 000000002ffe2800 size: 000000000001d800 end: 0000000030000000 type: 2
copy_e820_map() start: 00000000feda0000 size: 0000000000060000 end: 00000000fee00000 type: 2
copy_e820_map() start: 00000000ffb80000 size: 0000000000480000 end: 0000000100000000 type: 2
 BIOS-e820: 0000000000000000 - 000000000009fc00 (usable)
 BIOS-e820: 000000000009fc00 - 00000000000a0000 (reserved)
 BIOS-e820: 0000000000100000 - 000000002ffe2800 (usable)
 BIOS-e820: 000000002ffe2800 - 0000000030000000 (reserved)
 BIOS-e820: 00000000feda0000 - 00000000fee00000 (reserved)
 BIOS-e820: 00000000ffb80000 - 0000000100000000 (reserved)
767MB LOWMEM available.
Entering add_active_range(0, 0, 196578) 0 entries of 256 used
Zone PFN ranges:
  DMA             0 ->     4096
  Normal       4096 ->   196578
early_node_map[1] active PFN ranges
    0:        0 ->   196578
On node 0 totalpages: 196578
  DMA zone: 32 pages used for memmap
  DMA zone: 0 pages reserved
  DMA zone: 4064 pages, LIFO batch:0
  Normal zone: 1503 pages used for memmap
  Normal zone: 190979 pages, LIFO batch:31
DMI 2.3 present.
ACPI: RSDP (v000 DELL                                  ) @ 0x000fde50
ACPI: RSDT (v001 DELL    CPi R   0x27d40107 ASL  0x00000061) @ 0x000fde64
ACPI: FADT (v001 DELL    CPi R   0x27d40107 ASL  0x00000061) @ 0x000fde90
ACPI: DSDT (v001 INT430 SYSFexxx 0x00001001 MSFT 0x0100000e) @ 0x00000000
ACPI: PM-Timer IO Port: 0x808
Allocating PCI resources starting at 40000000 (gap: 30000000:ceda0000)
Detected 1595.436 MHz processor.
Built 1 zonelists.  Total pages: 195043
Kernel command line: vga=794 quiet crashkernel=64M@16M single
Local APIC disabled by BIOS -- you can enable it with "lapic"
mapped APIC to ffffd000 (05603000)
Enabling fast FPU save and restore... done.
Enabling unmasked SIMD FPU exception support... done.
Initializing CPU#0
PID hash table entries: 4096 (order: 12, 16384 bytes)
Console: colour dummy device 80x25
Dentry cache hash table entries: 131072 (order: 7, 524288 bytes)
Inode-cache hash table entries: 65536 (order: 6, 262144 bytes)
Memory: 706936k/786312k available (2406k kernel code, 78804k reserved, 1041k data, 180k init, 0k highmem)
virtual kernel memory layout:
    fixmap  : 0xfffb7000 - 0xfffff000   ( 288 kB)
    vmalloc : 0xf0800000 - 0xfffb5000   ( 247 MB)
    lowmem  : 0xc0000000 - 0xeffe2000   ( 767 MB)
      .init : 0xc04e6000 - 0xc0513000   ( 180 kB)
      .data : 0xc0359960 - 0xc045e0a8   (1041 kB)
      .text : 0xc0100000 - 0xc0359960   (2406 kB)
Checking if this processor honours the WP bit even in supervisor mode... Ok.
Calibrating delay using timer specific routine.. 3192.27 BogoMIPS (lpj=1596136)
Security Framework v1.0.0 initialized
SELinux:  Initializing.
SELinux:  Starting in permissive mode
selinux_register_security:  Registering secondary module capability
Capability LSM initialized as secondary
Mount-cache hash table entries: 512
CPU: After generic identify, caps: 3febf9ff 00000000 00000000 00000000 00000000 00000000 00000000
CPU: Trace cache: 12K uops, L1 D cache: 8K
CPU: L2 cache: 512K
CPU: After all inits, caps: 3febf9ff 00000000 00000000 00000080 00000000 00000000 00000000
Intel machine check architecture supported.
Intel machine check reporting enabled on CPU#0.
CPU0: Intel P4/Xeon Extended MCE MSRs (12) available
CPU0: Thermal monitoring enabled
CPU: Intel(R) Pentium(R) 4 Mobile CPU 1.60GHz stepping 04
Checking 'hlt' instruction... OK.
ACPI: Core revision 20060707
ACPI: setting ELCR to 0200 (from 0800)
checking if image is initramfs... it is
Freeing initrd memory: 1824k freed
NET: Registered protocol family 16
ACPI: ACPI Dock Station Driver 
ACPI: bus type pci registered
PCI: PCI BIOS revision 2.10 entry at 0xfbfee, last bus=2
PCI: Using configuration type 1
Setting up standard PCI resources
ACPI: Interpreter enabled
ACPI: Using PIC for interrupt routing
ACPI: PCI Root Bridge [PCI0] (0000:00)
ACPI: Assume root bridge [\_SB_.PCI0] bus is 0
PCI quirk: region 0800-087f claimed by ICH4 ACPI/GPIO/TCO
PCI quirk: region 0880-08bf claimed by ICH4 GPIO
PCI: Ignoring BAR0-3 of IDE controller 0000:00:1f.1
Boot video device is 0000:01:00.0
PCI: Transparent bridge - 0000:00:1e.0
ACPI: PCI Interrupt Routing Table [\_SB_.PCI0._PRT]
ACPI: PCI Interrupt Link [LNKA] (IRQs 9 10 *11)
ACPI: PCI Interrupt Link [LNKB] (IRQs 5 7) *11
ACPI: PCI Interrupt Link [LNKC] (IRQs 9 10 *11)
ACPI: PCI Interrupt Link [LNKD] (IRQs 5 7 9 10 *11)
ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.AGP_._PRT]
ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.PCIE._PRT]
ACPI: Power Resource [PADA] (on)
Linux Plug and Play Support v0.97 (c) Adam Belay
pnp: PnP ACPI init
pnp: PnP ACPI: found 15 devices
Intel 82802 RNG detected
SCSI subsystem initialized
usbcore: registered new interface driver usbfs
usbcore: registered new interface driver hub
usbcore: registered new device driver usb
PCI: Using ACPI for IRQ routing
PCI: If a device doesn't work, try "pci=routeirq".  If it helps, post a report
NetLabel: Initializing
NetLabel:  domain hash size = 128
NetLabel:  protocols = UNLABELED CIPSOv4
NetLabel:  unlabeled traffic allowed by default
pnp: 00:02: ioport range 0x4d0-0x4d1 has been reserved
pnp: 00:02: ioport range 0x800-0x805 could not be reserved
pnp: 00:02: ioport range 0x808-0x80f could not be reserved
pnp: 00:03: ioport range 0x806-0x807 has been reserved
pnp: 00:03: ioport range 0x810-0x85f could not be reserved
pnp: 00:03: ioport range 0x860-0x87f has been reserved
pnp: 00:03: ioport range 0x880-0x8bf has been reserved
pnp: 00:03: ioport range 0x8c0-0x8df has been reserved
pnp: 00:03: ioport range 0x8e0-0x8ff has been reserved
pnp: 00:08: ioport range 0x900-0x91f has been reserved
pnp: 00:08: ioport range 0x3f0-0x3f1 has been reserved
PCI: Bridge: 0000:00:01.0
  IO window: c000-cfff
  MEM window: fc000000-fdffffff
  PREFETCH window: d8000000-e7ffffff
PCI: Bus 3, cardbus bridge: 0000:02:01.0
  IO window: 0000e000-0000e0ff
  IO window: 0000e400-0000e4ff
  PREFETCH window: 40000000-41ffffff
  MEM window: f4000000-f5ffffff
PCI: Bus 7, cardbus bridge: 0000:02:01.1
  IO window: 0000e800-0000e8ff
  IO window: 0000f000-0000f0ff
  PREFETCH window: 42000000-43ffffff
  MEM window: f6000000-f7ffffff
PCI: Bus 11, cardbus bridge: 0000:02:03.0
  IO window: 0000f400-0000f4ff
  IO window: 0000f800-0000f8ff
  PREFETCH window: 44000000-45ffffff
  MEM window: fa000000-fbffffff
PCI: Bridge: 0000:00:1e.0
  IO window: e000-ffff
  MEM window: f4000000-fbffffff
  PREFETCH window: 40000000-46ffffff
PCI: Setting latency timer of device 0000:00:1e.0 to 64
PCI: Enabling device 0000:02:01.0 (0000 -> 0003)
ACPI: PCI Interrupt Link [LNKD] enabled at IRQ 11
PCI: setting IRQ 11 as level-triggered
ACPI: PCI Interrupt 0000:02:01.0[A] -> Link [LNKD] -> GSI 11 (level, low) -> IRQ 11
PCI: Enabling device 0000:02:01.1 (0000 -> 0003)
ACPI: PCI Interrupt 0000:02:01.1[A] -> Link [LNKD] -> GSI 11 (level, low) -> IRQ 11
ACPI: PCI Interrupt 0000:02:03.0[A] -> Link [LNKD] -> GSI 11 (level, low) -> IRQ 11
NET: Registered protocol family 2
IP route cache hash table entries: 32768 (order: 5, 131072 bytes)
TCP established hash table entries: 131072 (order: 7, 524288 bytes)
TCP bind hash table entries: 65536 (order: 6, 262144 bytes)
TCP: Hash tables configured (established 131072 bind 65536)
TCP reno registered
Machine check exception polling timer started.
speedstep: frequency transition measured seems out of range (0 nSec), falling back to a safe one of 500000 nSec.
audit: initializing netlink socket (disabled)
audit(1159803069.386:1): initialized
VFS: Disk quotas dquot_6.5.1
Dquot-cache hash table entries: 1024 (order 0, 4096 bytes)
SELinux:  Registering netfilter hooks
io scheduler noop registered
io scheduler anticipatory registered (default)
io scheduler deadline registered
io scheduler cfq registered
vesafb: framebuffer at 0xe0000000, mapped to 0xf0880000, using 5120k, total 32768k
vesafb: mode is 1280x1024x16, linelength=2560, pages=1
vesafb: protected mode interface info at c000:e140
vesafb: pmi: set display start = c00ce185, set palette = c00ce20a
vesafb: pmi: ports = b4c3 b503 ba03 c003 c103 c403 c503 c603 c703 c803 c903 cc03 ce03 cf03 d003 d103 d203 d303 d403 d503 da03 ff03 
vesafb: scrolling: redraw
vesafb: Truecolor: size=0:5:6:5, shift=0:11:5:0
Console: switching to colour frame buffer device 160x64
fb0: VESA VGA frame buffer device
ACPI: Video Device [VID] (multi-head: yes  rom: no  post: no)
Hangcheck: starting hangcheck timer 0.9.0 (tick is 180 seconds, margin is 60 seconds).
Hangcheck: Using get_cycles().
Serial: 8250/16550 driver $Revision: 1.90 $ 4 ports, IRQ sharing enabled
serial8250: ttyS0 at I/O 0x3f8 (irq = 4) is a 16550A
00:0c: ttyS0 at I/O 0x3f8 (irq = 4) is a 16550A
ACPI: PCI Interrupt Link [LNKB] enabled at IRQ 5
PCI: setting IRQ 5 as level-triggered
ACPI: PCI Interrupt 0000:00:1f.6[B] -> Link [LNKB] -> GSI 5 (level, low) -> IRQ 5
ACPI: PCI interrupt for device 0000:00:1f.6 disabled
RAMDISK driver initialized: 16 RAM disks of 10240K size 1024 blocksize
loop: loaded (max 8 devices)
ACPI: PCI Interrupt Link [LNKC] enabled at IRQ 11
ACPI: PCI Interrupt 0000:02:00.0[A] -> Link [LNKC] -> GSI 11 (level, low) -> IRQ 11
3c59x: Donald Becker and others. www.scyld.com/network/vortex.html
0000:02:00.0: 3Com PCI 3c905C Tornado at f0804c00.
ACPI: PCI Interrupt 0000:02:08.0[A] -> Link [LNKC] -> GSI 11 (level, low) -> IRQ 11
0000:02:08.0: 3Com PCI 3c905C Tornado at f0806800.
libata version 2.00 loaded.
ata_piix 0000:00:1f.1: version 2.00ac7
PCI: Enabling device 0000:00:1f.1 (0005 -> 0007)
ACPI: PCI Interrupt Link [LNKA] enabled at IRQ 11
ACPI: PCI Interrupt 0000:00:1f.1[A] -> Link [LNKA] -> GSI 11 (level, low) -> IRQ 11
PCI: Setting latency timer of device 0000:00:1f.1 to 64
ata1: PATA max UDMA/100 cmd 0x1F0 ctl 0x3F6 bmdma 0xBFA0 irq 14
ata2: PATA max UDMA/100 cmd 0x170 ctl 0x376 bmdma 0xBFA8 irq 15
scsi0 : ata_piix
ohci_hcd: 2006 August 04 USB 1.1 'Open' Host Controller (OHCI) Driver (PCI)
USB Universal Host Controller Interface driver v3.0
ACPI: PCI Interrupt 0000:00:1d.0[A] -> Link [LNKA] -> GSI 11 (level, low) -> IRQ 11
PCI: Setting latency timer of device 0000:00:1d.0 to 64
uhci_hcd 0000:00:1d.0: UHCI Host Controller
uhci_hcd 0000:00:1d.0: new USB bus registered, assigned bus number 1
uhci_hcd 0000:00:1d.0: irq 11, io base 0x0000bf80
usb usb1: new device found, idVendor=0000, idProduct=0000
usb usb1: new device strings: Mfr=3, Product=2, SerialNumber=1
usb usb1: Product: UHCI Host Controller
usb usb1: Manufacturer: Linux 2.6.18-mm2 uhci_hcd
usb usb1: SerialNumber: 0000:00:1d.0
usb usb1: configuration #1 chosen from 1 choice
hub 1-0:1.0: USB hub found
hub 1-0:1.0: 2 ports detected
ACPI: PCI Interrupt 0000:00:1d.1[B] -> Link [LNKD] -> GSI 11 (level, low) -> IRQ 11
PCI: Setting latency timer of device 0000:00:1d.1 to 64
uhci_hcd 0000:00:1d.1: UHCI Host Controller
uhci_hcd 0000:00:1d.1: new USB bus registered, assigned bus number 2
uhci_hcd 0000:00:1d.1: irq 11, io base 0x0000bf40
usb usb2: new device found, idVendor=0000, idProduct=0000
usb usb2: new device strings: Mfr=3, Product=2, SerialNumber=1
usb usb2: Product: UHCI Host Controller
usb usb2: Manufacturer: Linux 2.6.18-mm2 uhci_hcd
usb usb2: SerialNumber: 0000:00:1d.1
usb usb2: configuration #1 chosen from 1 choice
hub 2-0:1.0: USB hub found
hub 2-0:1.0: 2 ports detected
ACPI: PCI Interrupt 0000:00:1d.2[C] -> Link [LNKC] -> GSI 11 (level, low) -> IRQ 11
PCI: Setting latency timer of device 0000:00:1d.2 to 64
uhci_hcd 0000:00:1d.2: UHCI Host Controller
uhci_hcd 0000:00:1d.2: new USB bus registered, assigned bus number 3
uhci_hcd 0000:00:1d.2: irq 11, io base 0x0000bf20
usb usb3: new device found, idVendor=0000, idProduct=0000
usb usb3: new device strings: Mfr=3, Product=2, SerialNumber=1
usb usb3: Product: UHCI Host Controller
usb usb3: Manufacturer: Linux 2.6.18-mm2 uhci_hcd
usb usb3: SerialNumber: 0000:00:1d.2
usb usb3: configuration #1 chosen from 1 choice
hub 3-0:1.0: USB hub found
hub 3-0:1.0: 2 ports detected
usbcore: registered new interface driver usbhid
drivers/usb/input/hid-core.c: v2.6:USB HID core driver
PNP: PS/2 Controller [PNP0303:KBC,PNP0f13:PS2M] at 0x60,0x64 irq 1,12
serio: i8042 AUX port at 0x60,0x64 irq 12
serio: i8042 KBD port at 0x60,0x64 irq 1
mice: PS/2 mouse device common for all mice
input: PC Speaker as /class/input/input0
input: AT Translated Set 2 keyboard as /class/input/input1
ata1.00: ATA-6, max UDMA/100, 117210240 sectors: LBA 
ata1.00: ata1: dev 0 multi count 8
ata1.01: ATAPI, max UDMA/33
ata1.00: configured for UDMA/100
usb 2-2: new low speed USB device using uhci_hcd and address 2
ata1.01: configured for UDMA/33
scsi1 : ata_piix
usb 2-2: new device found, idVendor=045e, idProduct=0023
usb 2-2: new device strings: Mfr=1, Product=2, SerialNumber=0
usb 2-2: Product: Microsoft Trackball Optical®
usb 2-2: Manufacturer: Microsoft
usb 2-2: configuration #1 chosen from 1 choice
input: Microsoft Microsoft Trackball Optical® as /class/input/input2
input: USB HID v1.00 Mouse [Microsoft Microsoft Trackball Optical®] on usb-0000:00:1d.1-2
input: DualPoint Stick as /class/input/input3
input: AlpsPS/2 ALPS DualPoint TouchPad as /class/input/input4
scsi 0:0:0:0: Direct-Access     ATA      FUJITSU MHV2060A 0000 PQ: 0 ANSI: 5
SCSI device sda: 117210240 512-byte hdwr sectors (60012 MB)
sda: Write Protect is off
sda: Mode Sense: 00 3a 00 00
SCSI device sda: drive cache: write back
SCSI device sda: 117210240 512-byte hdwr sectors (60012 MB)
sda: Write Protect is off
sda: Mode Sense: 00 3a 00 00
SCSI device sda: drive cache: write back
 sda:<6>i2c /dev entries driver
device-mapper: ioctl: 4.10.0-ioctl (2006-09-14) initialised: dm-devel@redhat.com
EDAC MC: Ver: 2.0.1 Oct  2 2006
Advanced Linux Sound Architecture Driver Version 1.0.12rc1 (Thu Jun 22 13:55:50 2006 UTC).
ACPI: PCI Interrupt 0000:00:1f.5[B] -> Link [LNKB] -> GSI 5 (level, low) -> IRQ 5
PCI: Setting latency timer of device 0000:00:1f.5 to 64
ALSA device list:
  No soundcards found.
TCP bic registered
Initializing XFRM netlink socket
NET: Registered protocol family 1
NET: Registered protocol family 10
lo: Disabled Privacy Extensions
IPv6 over IPv4 tunneling driver
NET: Registered protocol family 17
NET: Registered protocol family 15
Using IPI Shortcut mode
Time: tsc clocksource has been installed.
Freeing unused kernel memory: 180k freed
Write protecting the kernel read-only data: 433k
 sda1 sda2
sd 0:0:0:0: Attached scsi disk sda
scsi 0:0:1:0: CD-ROM            TOSHIBA  CDRW/DVD SDR2102 1D13 PQ: 0 ANSI: 5
sr0: scsi3-mmc drive: 24x/24x writer cd/rw xa/form2 cdda tray
Uniform CD-ROM driver Revision: 3.20
sr 0:0:1:0: Attached scsi CD-ROM sr0
intel8x0_measure_ac97_clock: measured 50385 usecs
intel8x0: clocking to 48000
video bus notify
kjournald starting.  Commit interval 5 seconds
EXT3-fs: mounted filesystem with ordered data mode.
security:  6 users, 5 roles, 2013 types, 80 bools, 1 sens, 1024 cats
security:  58 classes, 111120 rules
SELinux:  Completing initialization.
SELinux:  Setting up existing superblocks.
SELinux: initialized (dev dm-0, type ext3), uses xattr
SELinux: initialized (dev tmpfs, type tmpfs), uses transition SIDs
SELinux: initialized (dev usbfs, type usbfs), uses genfs_contexts
SELinux: initialized (dev selinuxfs, type selinuxfs), uses genfs_contexts
SELinux: initialized (dev mqueue, type mqueue), uses transition SIDs
SELinux: initialized (dev devpts, type devpts), uses transition SIDs
SELinux: initialized (dev eventpollfs, type eventpollfs), uses task SIDs
SELinux: initialized (dev inotifyfs, type inotifyfs), uses genfs_contexts
SELinux: initialized (dev tmpfs, type tmpfs), uses transition SIDs
SELinux: initialized (dev debugfs, type debugfs), uses genfs_contexts
SELinux: initialized (dev futexfs, type futexfs), uses genfs_contexts
SELinux: initialized (dev pipefs, type pipefs), uses task SIDs
SELinux: initialized (dev sockfs, type sockfs), uses task SIDs
SELinux: initialized (dev proc, type proc), uses genfs_contexts
SELinux: initialized (dev bdev, type bdev), uses genfs_contexts
SELinux: initialized (dev rootfs, type rootfs), uses genfs_contexts
SELinux: initialized (dev sysfs, type sysfs), uses genfs_contexts
audit(1159803074.616:2): policy loaded auid=4294967295
audit(1159803074.993:3): avc:  denied  { execute } for  pid=458 comm="rc.sysinit" name="hostname" dev=dm-0 ino=33031 scontext=system_u:system_r:initrc_t:s0 tcontext=system_u:object_r:hostname_exec_t:s0 tclass=file
audit(1159803074.993:4): avc:  denied  { execute_no_trans } for  pid=458 comm="rc.sysinit" name="hostname" dev=dm-0 ino=33031 scontext=system_u:system_r:initrc_t:s0 tcontext=system_u:object_r:hostname_exec_t:s0 tclass=file
audit(1159803074.993:5): avc:  denied  { read } for  pid=458 comm="rc.sysinit" name="hostname" dev=dm-0 ino=33031 scontext=system_u:system_r:initrc_t:s0 tcontext=system_u:object_r:hostname_exec_t:s0 tclass=file
Real Time Clock Driver v1.12ac
audit(1159803078.140:6): avc:  denied  { dac_override } for  pid=488 comm="dmesg" capability=1 scontext=system_u:system_r:dmesg_t:s0 tcontext=system_u:system_r:dmesg_t:s0 tclass=capability
audit(1159803079.947:7): avc:  denied  { dac_override } for  pid=559 comm="pam_console_app" capability=1 scontext=system_u:system_r:pam_console_t:s0-s0:c0.c1023 tcontext=system_u:system_r:pam_console_t:s0-s0:c0.c1023 tclass=capability
ACPI: PCI Interrupt 0000:02:01.2[A] -> Link [LNKD] -> GSI 11 (level, low) -> IRQ 11
ohci1394: fw-host0: OHCI-1394 1.0 (PCI): IRQ=[11]  MMIO=[f8fff000-f8fff7ff]  Max Packet=[2048]  IR/IT contexts=[4/4]
Yenta: CardBus bridge found at 0000:02:01.0 [1028:00d5]
Yenta: Using CSCINT to route CSC interrupts to PCI
Yenta: Routing CardBus interrupts to PCI
Yenta TI: socket 0000:02:01.0, mfunc 0x05033002, devctl 0x64
Yenta: CardBus bridge found at 0000:02:01.1 [1028:00d5]
Yenta: Using CSCINT to route CSC interrupts to PCI
Yenta: Routing CardBus interrupts to PCI
Yenta TI: socket 0000:02:01.1, mfunc 0x05033002, devctl 0x64
Yenta: CardBus bridge found at 0000:02:03.0 [12a3:ab01]
Yenta: Enabling burst memory read transactions
Yenta: Using CSCINT to route CSC interrupts to PCI
Yenta: Routing CardBus interrupts to PCI
Yenta TI: socket 0000:02:03.0, mfunc 0x01000002, devctl 0x60
iTCO_wdt: Intel TCO WatchDog Timer Driver v1.00 (30-Jul-2006)
iTCO_wdt: Found a ICH3-M TCO device (Version=1, TCOBASE=0x0860)
iTCO_wdt: initialized. heartbeat=30 sec (nowayout=0)
Linux agpgart interface v0.101 (c) Dave Jones
Yenta: ISA IRQ mask 0x0498, PCI irq 11
Socket status: 30000020
pcmcia: parent PCI bridge I/O window: 0xe000 - 0xffff
cs: IO port probe 0xe000-0xffff: clean.
pcmcia: parent PCI bridge Memory window: 0xf4000000 - 0xfbffffff
pcmcia: parent PCI bridge Memory window: 0x40000000 - 0x46ffffff
ohci1394: fw-host0: Running dma failed because Node ID is not valid
Yenta: ISA IRQ mask 0x0498, PCI irq 11
Socket status: 30000006
pcmcia: parent PCI bridge I/O window: 0xe000 - 0xffff
cs: IO port probe 0xe000-0xffff: clean.
pcmcia: parent PCI bridge Memory window: 0xf4000000 - 0xfbffffff
pcmcia: parent PCI bridge Memory window: 0x40000000 - 0x46ffffff
agpgart: Detected an Intel i845 Chipset.
agpgart: AGP aperture is 64M @ 0xe8000000
Yenta: ISA IRQ mask 0x0000, PCI irq 11
Socket status: 30000010
pcmcia: parent PCI bridge I/O window: 0xe000 - 0xffff
cs: IO port probe 0xe000-0xffff: clean.
pcmcia: parent PCI bridge Memory window: 0xf4000000 - 0xfbffffff
pcmcia: parent PCI bridge Memory window: 0x40000000 - 0x46ffffff
pccard: CardBus card inserted into slot 0
PCI: Enabling device 0000:03:00.0 (0000 -> 0003)
ACPI: PCI Interrupt 0000:03:00.0[A] -> Link [LNKD] -> GSI 11 (level, low) -> IRQ 11
PCI: Setting latency timer of device 0000:03:00.0 to 64
eth2: Xircom cardbus revision 3 at irq 11 
PCI: Enabling device 0000:03:00.1 (0000 -> 0003)
ACPI: PCI Interrupt 0000:03:00.1[A] -> Link [LNKD] -> GSI 11 (level, low) -> IRQ 11
0000:03:00.1: ttyS1 at I/O 0xe080 (irq = 11) is a 16550A
ohci1394: fw-host0: AT dma reset ctx=0, aborting transmission
ieee1394: Current remote IRM is not 1394a-2000 compliant, resetting...
pccard: PCMCIA card inserted into slot 2
ieee1394: Host added: ID:BUS[0-00:1023]  GUID[374fc0002a71c021]
cs: memory probe 0xf4000000-0xfbffffff: excluding 0xf4000000-0xf8ffffff 0xfa000000-0xfbffffff
pcmcia: registering new device pcmcia2.0
orinoco 0.15 (David Gibson <hermes@gibson.dropbear.id.au>, Pavel Roskin <proski@gnu.org>, et al)
orinoco_cs 0.15 (David Gibson <hermes@gibson.dropbear.id.au>, Pavel Roskin <proski@gnu.org>, et al)
pcmcia: request for exclusive IRQ could not be fulfilled.
pcmcia: the driver needs updating to supported shared IRQ lines.
cs: IO port probe 0x100-0x3af: excluding 0x370-0x37f
cs: IO port probe 0x3e0-0x4ff: clean.
cs: IO port probe 0x820-0x8ff: clean.
cs: IO port probe 0xc00-0xcf7: clean.
cs: IO port probe 0xa00-0xaff: clean.
cs: IO port probe 0x100-0x3af: excluding 0x370-0x37f
cs: IO port probe 0x3e0-0x4ff: clean.
cs: IO port probe 0x820-0x8ff: clean.
cs: IO port probe 0xc00-0xcf7: clean.
cs: IO port probe 0xa00-0xaff: clean.
cs: IO port probe 0x100-0x3af: excluding 0x370-0x37f
cs: IO port probe 0x3e0-0x4ff: clean.
cs: IO port probe 0x820-0x8ff: clean.
cs: IO port probe 0xc00-0xcf7: clean.
cs: IO port probe 0xa00-0xaff: clean.
eth2: Hardware identity 0005:0004:0005:0000
eth2: Station identity  001f:0001:0008:000a
eth2: Firmware determined as Lucent/Agere 8.10
eth2: Ad-hoc demo mode supported
eth2: IEEE standard IBSS ad-hoc mode supported
eth2: WEP supported, 104-bit key
eth2: MAC address 00:02:2D:5C:11:48
eth2: Station name "HERMES I"
eth2: ready
eth2: orinoco_cs at 2.0, irq 11, io 0xe100-0xe13f
Non-volatile memory driver v1.2
Floppy drive(s): fd0 is 1.44M
FDC 0 is a post-1991 82077
Dell laptop SMM driver v1.14 21/02/2005 Massimo Dal Zotto (dz@debian.org)
Netfilter messages via NETLINK v0.30.
audit(1159803085.289:8): avc:  denied  { getattr } for  pid=457 comm="rc.sysinit" name="hostname" dev=dm-0 ino=33031 scontext=system_u:system_r:initrc_t:s0 tcontext=system_u:object_r:hostname_exec_t:s0 tclass=file
ACPI: AC Adapter [AC] (on-line)
ACPI: Battery Slot [BAT0] (battery absent)
ACPI: Battery Slot [BAT1] (battery absent)
ACPI: Lid Switch [LID]
ACPI: Power Button (CM) [PBTN]
ACPI: Sleep Button (CM) [SBTN]
Using specific hotkey driver
ACPI: Processor [CPU0] (supports 8 throttling states)
ACPI: Thermal Zone [THM] (67 C)
EXT3 FS on dm-0, internal journal
kjournald starting.  Commit interval 5 seconds
EXT3-fs: mounted filesystem with ordered data mode.
SELinux: initialized (dev sda1, type ext3), uses xattr
SELinux: initialized (dev tmpfs, type tmpfs), uses transition SIDs
SELinux: initialized (dev tmpfs, type tmpfs), uses transition SIDs
SELinux: initialized (dev tmpfs, type tmpfs), uses transition SIDs
kjournald starting.  Commit interval 5 seconds
EXT3 FS on dm-11, internal journal
EXT3-fs: mounted filesystem with ordered data mode.
SELinux: initialized (dev dm-11, type ext3), uses xattr
kjournald starting.  Commit interval 5 seconds
EXT3-fs: mounted filesystem with ordered data mode.
SELinux: initialized (dev dm-7, type ext3), uses xattr
kjournald starting.  Commit interval 5 seconds
EXT3-fs: mounted filesystem with ordered data mode.
SELinux: initialized (dev dm-8, type ext3), uses xattr
kjournald starting.  Commit interval 5 seconds
EXT3-fs: mounted filesystem with ordered data mode.
SELinux: initialized (dev dm-9, type ext3), uses xattr
kjournald starting.  Commit interval 5 seconds
EXT3 FS on dm-10, internal journal
EXT3-fs: mounted filesystem with ordered data mode.
SELinux: initialized (dev dm-10, type ext3), uses xattr
kjournald starting.  Commit interval 5 seconds
EXT3-fs: mounted filesystem with ordered data mode.
SELinux: initialized (dev dm-6, type ext3), uses xattr
kjournald starting.  Commit interval 5 seconds
EXT3 FS on dm-4, internal journal
EXT3-fs: mounted filesystem with ordered data mode.
SELinux: initialized (dev dm-4, type ext3), uses xattr
kjournald starting.  Commit interval 5 seconds
EXT3-fs: mounted filesystem with ordered data mode.
SELinux: initialized (dev dm-5, type ext3), uses xattr
kjournald starting.  Commit interval 5 seconds
EXT3-fs: mounted filesystem with ordered data mode.
SELinux: initialized (dev dm-1, type ext3), uses xattr
Adding 1572856k swap on /dev/rootvg/swap.  Priority:-1 extents:1 across:1572856k
SELinux: initialized (dev binfmt_misc, type binfmt_misc), uses genfs_contexts
audit(1159803121.098:9): avc:  denied  { getattr } for  pid=1508 comm="bash" name="cpuspeed" dev=dm-0 ino=197100 scontext=system_u:system_r:sysadm_t:s0 tcontext=system_u:object_r:initrc_exec_t:s0 tclass=file
audit(1159803121.098:10): avc:  denied  { execute } for  pid=1508 comm="bash" name="cpuspeed" dev=dm-0 ino=197100 scontext=system_u:system_r:sysadm_t:s0 tcontext=system_u:object_r:initrc_exec_t:s0 tclass=file
audit(1159803121.098:11): avc:  denied  { read } for  pid=1508 comm="bash" name="cpuspeed" dev=dm-0 ino=197100 scontext=system_u:system_r:sysadm_t:s0 tcontext=system_u:object_r:initrc_exec_t:s0 tclass=file
audit(1159803122.483:12): avc:  denied  { execute_no_trans } for  pid=1535 comm="bash" name="cpuspeed" dev=dm-0 ino=197100 scontext=system_u:system_r:sysadm_t:s0 tcontext=system_u:object_r:initrc_exec_t:s0 tclass=file
audit(1159803122.527:13): avc:  denied  { ioctl } for  pid=1535 comm="cpuspeed" name="cpuspeed" dev=dm-0 ino=197100 scontext=system_u:system_r:sysadm_t:s0 tcontext=system_u:object_r:initrc_exec_t:s0 tclass=file
audit(1159803122.572:14): avc:  denied  { getattr } for  pid=1535 comm="cpuspeed" name="cpuspeed" dev=dm-7 ino=295240 scontext=system_u:system_r:sysadm_t:s0 tcontext=system_u:object_r:cpuspeed_exec_t:s0 tclass=file
audit(1159803122.617:15): avc:  denied  { execute } for  pid=1539 comm="bash" name="cpuspeed" dev=dm-7 ino=295240 scontext=system_u:system_r:sysadm_t:s0 tcontext=system_u:object_r:cpuspeed_exec_t:s0 tclass=file
audit(1159803122.617:16): avc:  denied  { read } for  pid=1539 comm="bash" name="cpuspeed" dev=dm-7 ino=295240 scontext=system_u:system_r:sysadm_t:s0 tcontext=system_u:object_r:cpuspeed_exec_t:s0 tclass=file
audit(1159803122.617:17): avc:  denied  { execute_no_trans } for  pid=1540 comm="bash" name="cpuspeed" dev=dm-7 ino=295240 scontext=system_u:system_r:sysadm_t:s0 tcontext=system_u:object_r:cpuspeed_exec_t:s0 tclass=file
TSC appears to be running slowly. Marking it as unstable
Time: acpi_pm clocksource has been installed.

[-- Attachment #1.6: dmesg.mm2-dyntick --]
[-- Type: text/plain , Size: 27438 bytes --]

Linux version 2.6.18-mm2-hrt-dyntick5 (valdis@turing-police.cc.vt.edu) (gcc version 4.1.1 20060926 (Red Hat 4.1.1-26)) #1 PREEMPT Mon Oct 2 02:41:10 EDT 2006
BIOS-provided physical RAM map:
sanitize start
sanitize end
copy_e820_map() start: 0000000000000000 size: 000000000009fc00 end: 000000000009fc00 type: 1
copy_e820_map() type is E820_RAM
copy_e820_map() start: 000000000009fc00 size: 0000000000000400 end: 00000000000a0000 type: 2
copy_e820_map() start: 0000000000100000 size: 000000002fee2800 end: 000000002ffe2800 type: 1
copy_e820_map() type is E820_RAM
copy_e820_map() start: 000000002ffe2800 size: 000000000001d800 end: 0000000030000000 type: 2
copy_e820_map() start: 00000000feda0000 size: 0000000000060000 end: 00000000fee00000 type: 2
copy_e820_map() start: 00000000ffb80000 size: 0000000000480000 end: 0000000100000000 type: 2
 BIOS-e820: 0000000000000000 - 000000000009fc00 (usable)
 BIOS-e820: 000000000009fc00 - 00000000000a0000 (reserved)
 BIOS-e820: 0000000000100000 - 000000002ffe2800 (usable)
 BIOS-e820: 000000002ffe2800 - 0000000030000000 (reserved)
 BIOS-e820: 00000000feda0000 - 00000000fee00000 (reserved)
 BIOS-e820: 00000000ffb80000 - 0000000100000000 (reserved)
767MB LOWMEM available.
Entering add_active_range(0, 0, 196578) 0 entries of 256 used
Zone PFN ranges:
  DMA             0 ->     4096
  Normal       4096 ->   196578
early_node_map[1] active PFN ranges
    0:        0 ->   196578
On node 0 totalpages: 196578
  DMA zone: 32 pages used for memmap
  DMA zone: 0 pages reserved
  DMA zone: 4064 pages, LIFO batch:0
  Normal zone: 1503 pages used for memmap
  Normal zone: 190979 pages, LIFO batch:31
DMI 2.3 present.
ACPI: RSDP (v000 DELL                                  ) @ 0x000fde50
ACPI: RSDT (v001 DELL    CPi R   0x27d40107 ASL  0x00000061) @ 0x000fde64
ACPI: FADT (v001 DELL    CPi R   0x27d40107 ASL  0x00000061) @ 0x000fde90
ACPI: DSDT (v001 INT430 SYSFexxx 0x00001001 MSFT 0x0100000e) @ 0x00000000
ACPI: PM-Timer IO Port: 0x808
Allocating PCI resources starting at 40000000 (gap: 30000000:ceda0000)
Detected 1595.398 MHz processor.
Built 1 zonelists.  Total pages: 195043
Kernel command line: vga=794 quiet crashkernel=64M@16M single
Local APIC disabled by BIOS -- you can enable it with "lapic"
mapped APIC to ffffd000 (05603000)
Enabling fast FPU save and restore... done.
Enabling unmasked SIMD FPU exception support... done.
Initializing CPU#0
Clock event device pit configured with caps set: 07
PID hash table entries: 4096 (order: 12, 16384 bytes)
Console: colour dummy device 80x25
Dentry cache hash table entries: 131072 (order: 7, 524288 bytes)
Inode-cache hash table entries: 65536 (order: 6, 262144 bytes)
Memory: 706888k/786312k available (2412k kernel code, 78852k reserved, 1044k data, 180k init, 0k highmem)
virtual kernel memory layout:
    fixmap  : 0xfffb7000 - 0xfffff000   ( 288 kB)
    vmalloc : 0xf0800000 - 0xfffb5000   ( 247 MB)
    lowmem  : 0xc0000000 - 0xeffe2000   ( 767 MB)
      .init : 0xc04e8000 - 0xc0515000   ( 180 kB)
      .data : 0xc035b03f - 0xc0460128   (1044 kB)
      .text : 0xc0100000 - 0xc035b03f   (2412 kB)
Checking if this processor honours the WP bit even in supervisor mode... Ok.
Calibrating delay using timer specific routine.. 3192.28 BogoMIPS (lpj=1596144)
Security Framework v1.0.0 initialized
SELinux:  Initializing.
SELinux:  Starting in permissive mode
selinux_register_security:  Registering secondary module capability
Capability LSM initialized as secondary
Mount-cache hash table entries: 512
CPU: After generic identify, caps: 3febf9ff 00000000 00000000 00000000 00000000 00000000 00000000
CPU: Trace cache: 12K uops, L1 D cache: 8K
CPU: L2 cache: 512K
CPU: After all inits, caps: 3febf9ff 00000000 00000000 00000080 00000000 00000000 00000000
Intel machine check architecture supported.
Intel machine check reporting enabled on CPU#0.
CPU0: Intel P4/Xeon Extended MCE MSRs (12) available
CPU0: Thermal monitoring enabled
CPU: Intel(R) Pentium(R) 4 Mobile CPU 1.60GHz stepping 04
Checking 'hlt' instruction... OK.
ACPI: Core revision 20060707
ACPI: setting ELCR to 0200 (from 0800)
checking if image is initramfs... it is
Freeing initrd memory: 1824k freed
NET: Registered protocol family 16
ACPI: ACPI Dock Station Driver 
ACPI: bus type pci registered
PCI: PCI BIOS revision 2.10 entry at 0xfbfee, last bus=2
PCI: Using configuration type 1
Setting up standard PCI resources
ACPI: Interpreter enabled
ACPI: Using PIC for interrupt routing
ACPI: PCI Root Bridge [PCI0] (0000:00)
ACPI: Assume root bridge [\_SB_.PCI0] bus is 0
PCI quirk: region 0800-087f claimed by ICH4 ACPI/GPIO/TCO
PCI quirk: region 0880-08bf claimed by ICH4 GPIO
PCI: Ignoring BAR0-3 of IDE controller 0000:00:1f.1
Boot video device is 0000:01:00.0
PCI: Transparent bridge - 0000:00:1e.0
ACPI: PCI Interrupt Routing Table [\_SB_.PCI0._PRT]
ACPI: PCI Interrupt Link [LNKA] (IRQs 9 10 *11)
ACPI: PCI Interrupt Link [LNKB] (IRQs 5 7) *11
ACPI: PCI Interrupt Link [LNKC] (IRQs 9 10 *11)
ACPI: PCI Interrupt Link [LNKD] (IRQs 5 7 9 10 *11)
ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.AGP_._PRT]
ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.PCIE._PRT]
ACPI: Power Resource [PADA] (on)
Linux Plug and Play Support v0.97 (c) Adam Belay
pnp: PnP ACPI init
pnp: PnP ACPI: found 15 devices
Intel 82802 RNG detected
SCSI subsystem initialized
usbcore: registered new interface driver usbfs
usbcore: registered new interface driver hub
usbcore: registered new device driver usb
PCI: Using ACPI for IRQ routing
PCI: If a device doesn't work, try "pci=routeirq".  If it helps, post a report
NetLabel: Initializing
NetLabel:  domain hash size = 128
NetLabel:  protocols = UNLABELED CIPSOv4
NetLabel:  unlabeled traffic allowed by default
pnp: 00:02: ioport range 0x4d0-0x4d1 has been reserved
pnp: 00:02: ioport range 0x800-0x805 could not be reserved
pnp: 00:02: ioport range 0x808-0x80f could not be reserved
pnp: 00:03: ioport range 0x806-0x807 has been reserved
pnp: 00:03: ioport range 0x810-0x85f could not be reserved
pnp: 00:03: ioport range 0x860-0x87f has been reserved
pnp: 00:03: ioport range 0x880-0x8bf has been reserved
pnp: 00:03: ioport range 0x8c0-0x8df has been reserved
pnp: 00:03: ioport range 0x8e0-0x8ff has been reserved
pnp: 00:08: ioport range 0x900-0x91f has been reserved
pnp: 00:08: ioport range 0x3f0-0x3f1 has been reserved
PCI: Bridge: 0000:00:01.0
  IO window: c000-cfff
  MEM window: fc000000-fdffffff
  PREFETCH window: d8000000-e7ffffff
PCI: Bus 3, cardbus bridge: 0000:02:01.0
  IO window: 0000e000-0000e0ff
  IO window: 0000e400-0000e4ff
  PREFETCH window: 40000000-41ffffff
  MEM window: f4000000-f5ffffff
PCI: Bus 7, cardbus bridge: 0000:02:01.1
  IO window: 0000e800-0000e8ff
  IO window: 0000f000-0000f0ff
  PREFETCH window: 42000000-43ffffff
  MEM window: f6000000-f7ffffff
PCI: Bus 11, cardbus bridge: 0000:02:03.0
  IO window: 0000f400-0000f4ff
  IO window: 0000f800-0000f8ff
  PREFETCH window: 44000000-45ffffff
  MEM window: fa000000-fbffffff
PCI: Bridge: 0000:00:1e.0
  IO window: e000-ffff
  MEM window: f4000000-fbffffff
  PREFETCH window: 40000000-46ffffff
PCI: Setting latency timer of device 0000:00:1e.0 to 64
PCI: Enabling device 0000:02:01.0 (0000 -> 0003)
ACPI: PCI Interrupt Link [LNKD] enabled at IRQ 11
PCI: setting IRQ 11 as level-triggered
ACPI: PCI Interrupt 0000:02:01.0[A] -> Link [LNKD] -> GSI 11 (level, low) -> IRQ 11
PCI: Enabling device 0000:02:01.1 (0000 -> 0003)
ACPI: PCI Interrupt 0000:02:01.1[A] -> Link [LNKD] -> GSI 11 (level, low) -> IRQ 11
ACPI: PCI Interrupt 0000:02:03.0[A] -> Link [LNKD] -> GSI 11 (level, low) -> IRQ 11
NET: Registered protocol family 2
IP route cache hash table entries: 32768 (order: 5, 131072 bytes)
TCP established hash table entries: 131072 (order: 7, 524288 bytes)
TCP bind hash table entries: 65536 (order: 6, 262144 bytes)
TCP: Hash tables configured (established 131072 bind 65536)
TCP reno registered
Machine check exception polling timer started.
speedstep: frequency transition measured seems out of range (0 nSec), falling back to a safe one of 500000 nSec.
audit: initializing netlink socket (disabled)
audit(1159802531.385:1): initialized
VFS: Disk quotas dquot_6.5.1
Dquot-cache hash table entries: 1024 (order 0, 4096 bytes)
SELinux:  Registering netfilter hooks
io scheduler noop registered
io scheduler anticipatory registered (default)
io scheduler deadline registered
io scheduler cfq registered
vesafb: framebuffer at 0xe0000000, mapped to 0xf0880000, using 5120k, total 32768k
vesafb: mode is 1280x1024x16, linelength=2560, pages=1
vesafb: protected mode interface info at c000:e140
vesafb: pmi: set display start = c00ce185, set palette = c00ce20a
vesafb: pmi: ports = b4c3 b503 ba03 c003 c103 c403 c503 c603 c703 c803 c903 cc03 ce03 cf03 d003 d103 d203 d303 d403 d503 da03 ff03 
vesafb: scrolling: redraw
vesafb: Truecolor: size=0:5:6:5, shift=0:11:5:0
Console: switching to colour frame buffer device 160x64
fb0: VESA VGA frame buffer device
ACPI: Video Device [VID] (multi-head: yes  rom: no  post: no)
Hangcheck: starting hangcheck timer 0.9.0 (tick is 180 seconds, margin is 60 seconds).
Hangcheck: Using get_cycles().
Serial: 8250/16550 driver $Revision: 1.90 $ 4 ports, IRQ sharing enabled
serial8250: ttyS0 at I/O 0x3f8 (irq = 4) is a 16550A
00:0c: ttyS0 at I/O 0x3f8 (irq = 4) is a 16550A
ACPI: PCI Interrupt Link [LNKB] enabled at IRQ 5
PCI: setting IRQ 5 as level-triggered
ACPI: PCI Interrupt 0000:00:1f.6[B] -> Link [LNKB] -> GSI 5 (level, low) -> IRQ 5
ACPI: PCI interrupt for device 0000:00:1f.6 disabled
RAMDISK driver initialized: 16 RAM disks of 10240K size 1024 blocksize
loop: loaded (max 8 devices)
ACPI: PCI Interrupt Link [LNKC] enabled at IRQ 11
ACPI: PCI Interrupt 0000:02:00.0[A] -> Link [LNKC] -> GSI 11 (level, low) -> IRQ 11
3c59x: Donald Becker and others. www.scyld.com/network/vortex.html
0000:02:00.0: 3Com PCI 3c905C Tornado at f0804c00.
ACPI: PCI Interrupt 0000:02:08.0[A] -> Link [LNKC] -> GSI 11 (level, low) -> IRQ 11
0000:02:08.0: 3Com PCI 3c905C Tornado at f0806800.
libata version 2.00 loaded.
ata_piix 0000:00:1f.1: version 2.00ac7
PCI: Enabling device 0000:00:1f.1 (0005 -> 0007)
ACPI: PCI Interrupt Link [LNKA] enabled at IRQ 11
ACPI: PCI Interrupt 0000:00:1f.1[A] -> Link [LNKA] -> GSI 11 (level, low) -> IRQ 11
PCI: Setting latency timer of device 0000:00:1f.1 to 64
ata1: PATA max UDMA/100 cmd 0x1F0 ctl 0x3F6 bmdma 0xBFA0 irq 14
ata2: PATA max UDMA/100 cmd 0x170 ctl 0x376 bmdma 0xBFA8 irq 15
scsi0 : ata_piix
ohci_hcd: 2006 August 04 USB 1.1 'Open' Host Controller (OHCI) Driver (PCI)
USB Universal Host Controller Interface driver v3.0
ACPI: PCI Interrupt 0000:00:1d.0[A] -> Link [LNKA] -> GSI 11 (level, low) -> IRQ 11
PCI: Setting latency timer of device 0000:00:1d.0 to 64
uhci_hcd 0000:00:1d.0: UHCI Host Controller
uhci_hcd 0000:00:1d.0: new USB bus registered, assigned bus number 1
uhci_hcd 0000:00:1d.0: irq 11, io base 0x0000bf80
usb usb1: new device found, idVendor=0000, idProduct=0000
usb usb1: new device strings: Mfr=3, Product=2, SerialNumber=1
usb usb1: Product: UHCI Host Controller
usb usb1: Manufacturer: Linux 2.6.18-mm2-hrt-dyntick5 uhci_hcd
usb usb1: SerialNumber: 0000:00:1d.0
usb usb1: configuration #1 chosen from 1 choice
hub 1-0:1.0: USB hub found
hub 1-0:1.0: 2 ports detected
ACPI: PCI Interrupt 0000:00:1d.1[B] -> Link [LNKD] -> GSI 11 (level, low) -> IRQ 11
PCI: Setting latency timer of device 0000:00:1d.1 to 64
uhci_hcd 0000:00:1d.1: UHCI Host Controller
uhci_hcd 0000:00:1d.1: new USB bus registered, assigned bus number 2
uhci_hcd 0000:00:1d.1: irq 11, io base 0x0000bf40
usb usb2: new device found, idVendor=0000, idProduct=0000
usb usb2: new device strings: Mfr=3, Product=2, SerialNumber=1
usb usb2: Product: UHCI Host Controller
usb usb2: Manufacturer: Linux 2.6.18-mm2-hrt-dyntick5 uhci_hcd
usb usb2: SerialNumber: 0000:00:1d.1
usb usb2: configuration #1 chosen from 1 choice
hub 2-0:1.0: USB hub found
hub 2-0:1.0: 2 ports detected
ACPI: PCI Interrupt 0000:00:1d.2[C] -> Link [LNKC] -> GSI 11 (level, low) -> IRQ 11
PCI: Setting latency timer of device 0000:00:1d.2 to 64
uhci_hcd 0000:00:1d.2: UHCI Host Controller
uhci_hcd 0000:00:1d.2: new USB bus registered, assigned bus number 3
uhci_hcd 0000:00:1d.2: irq 11, io base 0x0000bf20
usb usb3: new device found, idVendor=0000, idProduct=0000
usb usb3: new device strings: Mfr=3, Product=2, SerialNumber=1
usb usb3: Product: UHCI Host Controller
usb usb3: Manufacturer: Linux 2.6.18-mm2-hrt-dyntick5 uhci_hcd
usb usb3: SerialNumber: 0000:00:1d.2
usb usb3: configuration #1 chosen from 1 choice
hub 3-0:1.0: USB hub found
hub 3-0:1.0: 2 ports detected
usbcore: registered new interface driver usbhid
drivers/usb/input/hid-core.c: v2.6:USB HID core driver
PNP: PS/2 Controller [PNP0303:KBC,PNP0f13:PS2M] at 0x60,0x64 irq 1,12
serio: i8042 AUX port at 0x60,0x64 irq 12
serio: i8042 KBD port at 0x60,0x64 irq 1
mice: PS/2 mouse device common for all mice
input: PC Speaker as /class/input/input0
input: AT Translated Set 2 keyboard as /class/input/input1
ata1.00: ATA-6, max UDMA/100, 117210240 sectors: LBA 
ata1.00: ata1: dev 0 multi count 8
ata1.01: ATAPI, max UDMA/33
ata1.00: configured for UDMA/100
usb 2-2: new low speed USB device using uhci_hcd and address 2
ata1.01: configured for UDMA/33
scsi1 : ata_piix
usb 2-2: new device found, idVendor=045e, idProduct=0023
usb 2-2: new device strings: Mfr=1, Product=2, SerialNumber=0
usb 2-2: Product: Microsoft Trackball Optical®
usb 2-2: Manufacturer: Microsoft
usb 2-2: configuration #1 chosen from 1 choice
input: Microsoft Microsoft Trackball Optical® as /class/input/input2
input: USB HID v1.00 Mouse [Microsoft Microsoft Trackball Optical®] on usb-0000:00:1d.1-2
input: DualPoint Stick as /class/input/input3
input: AlpsPS/2 ALPS DualPoint TouchPad as /class/input/input4
scsi 0:0:0:0: Direct-Access     ATA      FUJITSU MHV2060A 0000 PQ: 0 ANSI: 5
SCSI device sda: 117210240 512-byte hdwr sectors (60012 MB)
sda: Write Protect is off
sda: Mode Sense: 00 3a 00 00
SCSI device sda: drive cache: write back
SCSI device sda: 117210240 512-byte hdwr sectors (60012 MB)
sda: Write Protect is off
sda: Mode Sense: 00 3a 00 00
SCSI device sda: drive cache: write back
 sda:<6>i2c /dev entries driver
device-mapper: ioctl: 4.10.0-ioctl (2006-09-14) initialised: dm-devel@redhat.com
EDAC MC: Ver: 2.0.1 Oct  2 2006
Advanced Linux Sound Architecture Driver Version 1.0.12rc1 (Thu Jun 22 13:55:50 2006 UTC).
ACPI: PCI Interrupt 0000:00:1f.5[B] -> Link [LNKB] -> GSI 5 (level, low) -> IRQ 5
PCI: Setting latency timer of device 0000:00:1f.5 to 64
ALSA device list:
  No soundcards found.
TCP bic registered
Initializing XFRM netlink socket
NET: Registered protocol family 1
NET: Registered protocol family 10
lo: Disabled Privacy Extensions
IPv6 over IPv4 tunneling driver
NET: Registered protocol family 17
NET: Registered protocol family 15
Using IPI Shortcut mode
Time: tsc clocksource has been installed.
Clock event device pit configured with caps set: 08
Switched to high resolution mode on CPU 0
Freeing unused kernel memory: 180k freed
Write protecting the kernel read-only data: 434k
 sda1 sda2
sd 0:0:0:0: Attached scsi disk sda
scsi 0:0:1:0: CD-ROM            TOSHIBA  CDRW/DVD SDR2102 1D13 PQ: 0 ANSI: 5
sr0: scsi3-mmc drive: 24x/24x writer cd/rw xa/form2 cdda tray
Uniform CD-ROM driver Revision: 3.20
sr 0:0:1:0: Attached scsi CD-ROM sr0
intel8x0_measure_ac97_clock: measured 51386 usecs
intel8x0: clocking to 48000
kjournald starting.  Commit interval 5 seconds
EXT3-fs: mounted filesystem with ordered data mode.
security:  6 users, 5 roles, 2013 types, 80 bools, 1 sens, 1024 cats
security:  58 classes, 111120 rules
SELinux:  Completing initialization.
SELinux:  Setting up existing superblocks.
SELinux: initialized (dev dm-0, type ext3), uses xattr
SELinux: initialized (dev tmpfs, type tmpfs), uses transition SIDs
SELinux: initialized (dev usbfs, type usbfs), uses genfs_contexts
SELinux: initialized (dev selinuxfs, type selinuxfs), uses genfs_contexts
SELinux: initialized (dev mqueue, type mqueue), uses transition SIDs
SELinux: initialized (dev devpts, type devpts), uses transition SIDs
SELinux: initialized (dev eventpollfs, type eventpollfs), uses task SIDs
SELinux: initialized (dev inotifyfs, type inotifyfs), uses genfs_contexts
SELinux: initialized (dev tmpfs, type tmpfs), uses transition SIDs
SELinux: initialized (dev debugfs, type debugfs), uses genfs_contexts
SELinux: initialized (dev futexfs, type futexfs), uses genfs_contexts
SELinux: initialized (dev pipefs, type pipefs), uses task SIDs
SELinux: initialized (dev sockfs, type sockfs), uses task SIDs
SELinux: initialized (dev proc, type proc), uses genfs_contexts
SELinux: initialized (dev bdev, type bdev), uses genfs_contexts
SELinux: initialized (dev rootfs, type rootfs), uses genfs_contexts
SELinux: initialized (dev sysfs, type sysfs), uses genfs_contexts
audit(1159802536.642:2): policy loaded auid=4294967295
audit(1159802537.006:3): avc:  denied  { execute } for  pid=457 comm="rc.sysinit" name="hostname" dev=dm-0 ino=33031 scontext=system_u:system_r:initrc_t:s0 tcontext=system_u:object_r:hostname_exec_t:s0 tclass=file
audit(1159802537.006:4): avc:  denied  { execute_no_trans } for  pid=457 comm="rc.sysinit" name="hostname" dev=dm-0 ino=33031 scontext=system_u:system_r:initrc_t:s0 tcontext=system_u:object_r:hostname_exec_t:s0 tclass=file
audit(1159802537.006:5): avc:  denied  { read } for  pid=457 comm="rc.sysinit" name="hostname" dev=dm-0 ino=33031 scontext=system_u:system_r:initrc_t:s0 tcontext=system_u:object_r:hostname_exec_t:s0 tclass=file
Real Time Clock Driver v1.12ac
audit(1159802539.162:6): avc:  denied  { dac_override } for  pid=487 comm="dmesg" capability=1 scontext=system_u:system_r:dmesg_t:s0 tcontext=system_u:system_r:dmesg_t:s0 tclass=capability
audit(1159802540.991:7): avc:  denied  { dac_override } for  pid=542 comm="pam_console_app" capability=1 scontext=system_u:system_r:pam_console_t:s0-s0:c0.c1023 tcontext=system_u:system_r:pam_console_t:s0-s0:c0.c1023 tclass=capability
ACPI: PCI Interrupt 0000:02:01.2[A] -> Link [LNKD] -> GSI 11 (level, low) -> IRQ 11
ohci1394: fw-host0: OHCI-1394 1.0 (PCI): IRQ=[11]  MMIO=[f8fff000-f8fff7ff]  Max Packet=[2048]  IR/IT contexts=[4/4]
Yenta: CardBus bridge found at 0000:02:01.0 [1028:00d5]
Yenta: Using CSCINT to route CSC interrupts to PCI
Yenta: Routing CardBus interrupts to PCI
Yenta TI: socket 0000:02:01.0, mfunc 0x05033002, devctl 0x64
Yenta: CardBus bridge found at 0000:02:01.1 [1028:00d5]
Yenta: Using CSCINT to route CSC interrupts to PCI
Yenta: Routing CardBus interrupts to PCI
Yenta TI: socket 0000:02:01.1, mfunc 0x05033002, devctl 0x64
Yenta: CardBus bridge found at 0000:02:03.0 [12a3:ab01]
Yenta: Enabling burst memory read transactions
Yenta: Using CSCINT to route CSC interrupts to PCI
Yenta: Routing CardBus interrupts to PCI
Yenta TI: socket 0000:02:03.0, mfunc 0x01000002, devctl 0x60
Linux agpgart interface v0.101 (c) Dave Jones
iTCO_wdt: Intel TCO WatchDog Timer Driver v1.00 (30-Jul-2006)
iTCO_wdt: Found a ICH3-M TCO device (Version=1, TCOBASE=0x0860)
iTCO_wdt: initialized. heartbeat=30 sec (nowayout=0)
agpgart: Detected an Intel i845 Chipset.
agpgart: AGP aperture is 64M @ 0xe8000000
ohci1394: fw-host0: Running dma failed because Node ID is not valid
Yenta: ISA IRQ mask 0x0498, PCI irq 11
Socket status: 30000020
pcmcia: parent PCI bridge I/O window: 0xe000 - 0xffff
cs: IO port probe 0xe000-0xffff: clean.
pcmcia: parent PCI bridge Memory window: 0xf4000000 - 0xfbffffff
pcmcia: parent PCI bridge Memory window: 0x40000000 - 0x46ffffff
Yenta: ISA IRQ mask 0x0498, PCI irq 11
Socket status: 30000006
pcmcia: parent PCI bridge I/O window: 0xe000 - 0xffff
cs: IO port probe 0xe000-0xffff: clean.
pcmcia: parent PCI bridge Memory window: 0xf4000000 - 0xfbffffff
pcmcia: parent PCI bridge Memory window: 0x40000000 - 0x46ffffff
Yenta: ISA IRQ mask 0x0000, PCI irq 11
Socket status: 30000010
pcmcia: parent PCI bridge I/O window: 0xe000 - 0xffff
cs: IO port probe 0xe000-0xffff: clean.
pcmcia: parent PCI bridge Memory window: 0xf4000000 - 0xfbffffff
pcmcia: parent PCI bridge Memory window: 0x40000000 - 0x46ffffff
pccard: CardBus card inserted into slot 0
PCI: Enabling device 0000:03:00.0 (0000 -> 0003)
ACPI: PCI Interrupt 0000:03:00.0[A] -> Link [LNKD] -> GSI 11 (level, low) -> IRQ 11
PCI: Setting latency timer of device 0000:03:00.0 to 64
eth2: Xircom cardbus revision 3 at irq 11 
PCI: Enabling device 0000:03:00.1 (0000 -> 0003)
ACPI: PCI Interrupt 0000:03:00.1[A] -> Link [LNKD] -> GSI 11 (level, low) -> IRQ 11
0000:03:00.1: ttyS1 at I/O 0xe080 (irq = 11) is a 16550A
ohci1394: fw-host0: AT dma reset ctx=0, aborting transmission
ieee1394: Current remote IRM is not 1394a-2000 compliant, resetting...
pccard: PCMCIA card inserted into slot 2
ieee1394: Host added: ID:BUS[0-00:1023]  GUID[374fc0002a71c021]
cs: IO port probe 0x100-0x3af: excluding 0x370-0x37f
cs: IO port probe 0x3e0-0x4ff: clean.
cs: IO port probe 0x820-0x8ff: clean.
cs: IO port probe 0xc00-0xcf7: clean.
cs: IO port probe 0xa00-0xaff: clean.
cs: IO port probe 0x100-0x3af: excluding 0x370-0x37f
cs: IO port probe 0x3e0-0x4ff: clean.
cs: IO port probe 0x820-0x8ff: clean.
cs: IO port probe 0xc00-0xcf7: clean.
cs: IO port probe 0xa00-0xaff: clean.
cs: memory probe 0xf4000000-0xfbffffff: excluding 0xf4000000-0xf8ffffff 0xfa000000-0xfbffffff
pcmcia: registering new device pcmcia2.0
cs: IO port probe 0x100-0x3af: excluding 0x370-0x37f
cs: IO port probe 0x3e0-0x4ff: clean.
cs: IO port probe 0x820-0x8ff: clean.
cs: IO port probe 0xc00-0xcf7: clean.
cs: IO port probe 0xa00-0xaff: clean.
orinoco 0.15 (David Gibson <hermes@gibson.dropbear.id.au>, Pavel Roskin <proski@gnu.org>, et al)
orinoco_cs 0.15 (David Gibson <hermes@gibson.dropbear.id.au>, Pavel Roskin <proski@gnu.org>, et al)
pcmcia: request for exclusive IRQ could not be fulfilled.
pcmcia: the driver needs updating to supported shared IRQ lines.
eth2: Hardware identity 0005:0004:0005:0000
eth2: Station identity  001f:0001:0008:000a
eth2: Firmware determined as Lucent/Agere 8.10
eth2: Ad-hoc demo mode supported
eth2: IEEE standard IBSS ad-hoc mode supported
eth2: WEP supported, 104-bit key
eth2: MAC address 00:02:2D:5C:11:48
eth2: Station name "HERMES I"
eth2: ready
eth2: orinoco_cs at 2.0, irq 11, io 0xe100-0xe13f
Non-volatile memory driver v1.2
Floppy drive(s): fd0 is 1.44M
FDC 0 is a post-1991 82077
Dell laptop SMM driver v1.14 21/02/2005 Massimo Dal Zotto (dz@debian.org)
Netfilter messages via NETLINK v0.30.
audit(1159802546.388:8): avc:  denied  { getattr } for  pid=456 comm="rc.sysinit" name="hostname" dev=dm-0 ino=33031 scontext=system_u:system_r:initrc_t:s0 tcontext=system_u:object_r:hostname_exec_t:s0 tclass=file
ACPI: AC Adapter [AC] (on-line)
ACPI: Battery Slot [BAT0] (battery absent)
ACPI: Battery Slot [BAT1] (battery absent)
ACPI: Lid Switch [LID]
ACPI: Power Button (CM) [PBTN]
ACPI: Sleep Button (CM) [SBTN]
Using specific hotkey driver
ACPI: Processor [CPU0] (supports 8 throttling states)
ACPI: Thermal Zone [THM] (68 C)
EXT3 FS on dm-0, internal journal
kjournald starting.  Commit interval 5 seconds
EXT3-fs: mounted filesystem with ordered data mode.
SELinux: initialized (dev sda1, type ext3), uses xattr
SELinux: initialized (dev tmpfs, type tmpfs), uses transition SIDs
SELinux: initialized (dev tmpfs, type tmpfs), uses transition SIDs
SELinux: initialized (dev tmpfs, type tmpfs), uses transition SIDs
kjournald starting.  Commit interval 5 seconds
EXT3 FS on dm-11, internal journal
EXT3-fs: mounted filesystem with ordered data mode.
SELinux: initialized (dev dm-11, type ext3), uses xattr
kjournald starting.  Commit interval 5 seconds
EXT3-fs: mounted filesystem with ordered data mode.
SELinux: initialized (dev dm-7, type ext3), uses xattr
kjournald starting.  Commit interval 5 seconds
EXT3-fs: mounted filesystem with ordered data mode.
SELinux: initialized (dev dm-8, type ext3), uses xattr
kjournald starting.  Commit interval 5 seconds
EXT3-fs: mounted filesystem with ordered data mode.
SELinux: initialized (dev dm-9, type ext3), uses xattr
kjournald starting.  Commit interval 5 seconds
EXT3 FS on dm-10, internal journal
EXT3-fs: mounted filesystem with ordered data mode.
SELinux: initialized (dev dm-10, type ext3), uses xattr
kjournald starting.  Commit interval 5 seconds
EXT3-fs: mounted filesystem with ordered data mode.
SELinux: initialized (dev dm-6, type ext3), uses xattr
kjournald starting.  Commit interval 5 seconds
EXT3 FS on dm-4, internal journal
EXT3-fs: mounted filesystem with ordered data mode.
SELinux: initialized (dev dm-4, type ext3), uses xattr
kjournald starting.  Commit interval 5 seconds
EXT3-fs: mounted filesystem with ordered data mode.
SELinux: initialized (dev dm-5, type ext3), uses xattr
kjournald starting.  Commit interval 5 seconds
EXT3-fs: mounted filesystem with ordered data mode.
SELinux: initialized (dev dm-1, type ext3), uses xattr
Adding 1572856k swap on /dev/rootvg/swap.  Priority:-1 extents:1 across:1572856k
SELinux: initialized (dev binfmt_misc, type binfmt_misc), uses genfs_contexts
audit(1159802586.630:9): avc:  denied  { getattr } for  pid=1502 comm="bash" name="spamassassin" dev=dm-0 ino=197880 scontext=system_u:system_r:sysadm_t:s0 tcontext=system_u:object_r:initrc_exec_t:s0 tclass=file
audit(1159802586.630:10): avc:  denied  { execute } for  pid=1502 comm="bash" name="spamassassin" dev=dm-0 ino=197880 scontext=system_u:system_r:sysadm_t:s0 tcontext=system_u:object_r:initrc_exec_t:s0 tclass=file
audit(1159802586.630:11): avc:  denied  { read } for  pid=1502 comm="bash" name="spamassassin" dev=dm-0 ino=197880 scontext=system_u:system_r:sysadm_t:s0 tcontext=system_u:object_r:initrc_exec_t:s0 tclass=file
audit(1159802592.469:12): avc:  denied  { execute_no_trans } for  pid=1525 comm="bash" name="cpuspeed" dev=dm-0 ino=197100 scontext=system_u:system_r:sysadm_t:s0 tcontext=system_u:object_r:initrc_exec_t:s0 tclass=file
audit(1159802592.506:13): avc:  denied  { ioctl } for  pid=1525 comm="cpuspeed" name="cpuspeed" dev=dm-0 ino=197100 scontext=system_u:system_r:sysadm_t:s0 tcontext=system_u:object_r:initrc_exec_t:s0 tclass=file
audit(1159802592.551:14): avc:  denied  { getattr } for  pid=1525 comm="cpuspeed" name="cpuspeed" dev=dm-7 ino=295240 scontext=system_u:system_r:sysadm_t:s0 tcontext=system_u:object_r:cpuspeed_exec_t:s0 tclass=file
audit(1159802592.585:15): avc:  denied  { execute } for  pid=1529 comm="bash" name="cpuspeed" dev=dm-7 ino=295240 scontext=system_u:system_r:sysadm_t:s0 tcontext=system_u:object_r:cpuspeed_exec_t:s0 tclass=file
audit(1159802592.585:16): avc:  denied  { read } for  pid=1529 comm="bash" name="cpuspeed" dev=dm-7 ino=295240 scontext=system_u:system_r:sysadm_t:s0 tcontext=system_u:object_r:cpuspeed_exec_t:s0 tclass=file
audit(1159802592.586:17): avc:  denied  { execute_no_trans } for  pid=1530 comm="bash" name="cpuspeed" dev=dm-7 ino=295240 scontext=system_u:system_r:sysadm_t:s0 tcontext=system_u:object_r:cpuspeed_exec_t:s0 tclass=file

[-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --]

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch 00/21] high resolution timers / dynamic ticks - V2
  2006-10-02 18:25     ` Valdis.Kletnieks
@ 2006-10-02 18:38       ` john stultz
  2006-10-02 19:08         ` Valdis.Kletnieks
  2006-10-02 18:43       ` [patch] dynticks core: Fix idle time accounting Thomas Gleixner
  1 sibling, 1 reply; 58+ messages in thread
From: john stultz @ 2006-10-02 18:38 UTC (permalink / raw)
  To: Valdis.Kletnieks
  Cc: tglx, Andrew Morton, LKML, Ingo Molnar, Jim Gettys,
	David Woodhouse, Arjan van de Ven, Dave Jones

On Mon, 2006-10-02 at 14:25 -0400, Valdis.Kletnieks@vt.edu wrote:
> (Sorry for the size of the note, there's some 50K of logs attached)
>
> On Mon, 02 Oct 2006 15:43:02 +0200, Thomas Gleixner said:
> > Can you please send me the bootlog and further dmesg output (especially
> > when related to timers / cpufreq).
> 
> I booted the box to single-user both times, and then started cpuspeed.
> I then did a cat of /proc/interrupts, /proc/uptime, and a date command,
> waited 60 seconds according to my watch, and repeated.  I then dumped
> the dmesg.  The -dyntick kernel moved 'uptime' almost exactly 45 seconds
> (almost certainly a by-product of running at 1.2Ghz rather than 1.6Ghz).
> Does the dyntick code make any unwritten assumptions about a jiffie or
> bogomips remaining constant?
> 
> Attached - config diff, date and /proc dumps from both -mm2 and -mm2-dyntick,
> and the dmesg's from both boots.
> 
> Yell if you have any other questions/suggestions/etc..

Hmmm. So w/ -mm2 we're seeing the TSC get detected as running too slowly
(and its replaced w/ the ACPI PM), but for some reason that doesn't
happen w/ the dynticks patch.

Now, how is cpuspeed changing the cpufreq? Is it using the /sys
interface? I've got hooks in so when the cpufreq changes we should mark
it unstable and fall back to ACPI PM, but maybe I missed whatever hook
cpuspeed is using.

thanks
-john


^ permalink raw reply	[flat|nested] 58+ messages in thread

* [patch] dynticks core: Fix idle time accounting
  2006-10-02 18:25     ` Valdis.Kletnieks
  2006-10-02 18:38       ` john stultz
@ 2006-10-02 18:43       ` Thomas Gleixner
  2006-10-02 20:17         ` Valdis.Kletnieks
  1 sibling, 1 reply; 58+ messages in thread
From: Thomas Gleixner @ 2006-10-02 18:43 UTC (permalink / raw)
  To: Valdis.Kletnieks
  Cc: Andrew Morton, LKML, Ingo Molnar, Jim Gettys, John Stultz,
	David Woodhouse, Arjan van de Ven, Dave Jones

On Mon, 2006-10-02 at 14:25 -0400, Valdis.Kletnieks@vt.edu wrote:
> > > I'm also seeing gkrellm reporting about 25% CPU use when "near-idle" (X is up
> > > but not much is going on) when that's usually down around 5-6%.  I need to
> > > collect some oprofile numbers and investigate that as well.
> >
> > I look into the accounting fixups again.
> I still need to get oprofile runs of this and see what's going on.

The patch below fixes the accounting weirdness.

	tglx

----------------->
Subject: dynticks core: Fix idle time accounting
From: Thomas Gleixner <tglx@linutronix,de>

The extended sleeps during idle must be accounted to the idle thread.
The original accounting fixup was too naive. The time must be accounted
when the idle thread is interrupted and the jiffies update code has
forwarded jiffies. Otherwise the accounting is done on random targets.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
--

 kernel/hrtimer.c |   34 +++++++++++++++++++---------------
 1 file changed, 19 insertions(+), 15 deletions(-)

Index: linux-2.6.18-mm2/kernel/hrtimer.c
===================================================================
--- linux-2.6.18-mm2.orig/kernel/hrtimer.c	2006-10-02 19:22:14.000000000 +0200
+++ linux-2.6.18-mm2/kernel/hrtimer.c	2006-10-02 19:22:14.000000000 +0200
@@ -44,6 +44,7 @@
 #include <linux/profile.h>
 #include <linux/seq_file.h>
 #include <linux/err.h>
+#include <linux/kernel_stat.h>
 
 #include <asm/uaccess.h>
 
@@ -447,10 +448,11 @@ static const ktime_t nsec_per_hz = { .tv
  * want to wake up a complete idle cpu just to update jiffies, so we need
  * something more intellegent than a mere "do this only on CPUx".
  */
-static void update_jiffies64(ktime_t now)
+static unsigned long update_jiffies64(ktime_t now)
 {
 	unsigned long seq;
 	ktime_t delta;
+	unsigned long ticks = 0;
 
 	/* Preevaluate to avoid lock contention */
 	do {
@@ -459,14 +461,13 @@ static void update_jiffies64(ktime_t now
 	} while (read_seqretry(&xtime_lock, seq));
 
 	if (delta.tv64 < nsec_per_hz.tv64)
-		return;
+		return 0;
 
 	/* Reevalute with xtime_lock held */
 	write_seqlock(&xtime_lock);
 
 	delta = ktime_sub(now, last_jiffies_update);
 	if (delta.tv64 >= nsec_per_hz.tv64) {
-		unsigned long ticks = 1;
 
 		delta = ktime_sub(delta, nsec_per_hz);
 		last_jiffies_update = ktime_add(last_jiffies_update,
@@ -480,11 +481,13 @@ static void update_jiffies64(ktime_t now
 
 			last_jiffies_update = ktime_add_ns(last_jiffies_update,
 							   incr * ticks);
-			ticks++;
 		}
+		ticks++;
 		do_timer(ticks);
 	}
 	write_sequnlock(&xtime_lock);
+
+	return ticks;
 }
 
 #ifdef CONFIG_NO_HZ
@@ -500,7 +503,7 @@ static void update_jiffies64(ktime_t now
 void hrtimer_update_jiffies(void)
 {
 	struct hrtimer_cpu_base *cpu_base = &__get_cpu_var(hrtimer_bases);
-	unsigned long flags;
+	unsigned long flags, ticks;
 	ktime_t now;
 
 	if (!cpu_base->tick_stopped || !cpu_base->hres_active)
@@ -509,7 +512,17 @@ void hrtimer_update_jiffies(void)
 	now = ktime_get();
 
 	local_irq_save(flags);
-	update_jiffies64(now);
+	ticks = update_jiffies64(now);
+	if (ticks) {
+		/*
+		 * We stopped the tick in idle and this function got called to
+		 * update jiffies. Update process times would randomly account
+		 * the time we slept to whatever the context of the next sched
+		 * tick is. Enforce that this is accounted to idle !
+		 */
+		account_system_time(current, HARDIRQ_OFFSET,
+				    jiffies_to_cputime(ticks));
+	}
 	local_irq_restore(flags);
 }
 
@@ -604,15 +617,6 @@ void hrtimer_restart_sched_tick(void)
 	local_irq_disable();
 	update_jiffies64(now);
 
-	/*
-	 * Update process times would randomly account the time we slept to
-	 * whatever the context of the next sched tick is.  Enforce that this
-	 * is accounted to idle !
-	 */
-	add_preempt_count(HARDIRQ_OFFSET);
-	update_process_times(0);
-	sub_preempt_count(HARDIRQ_OFFSET);
-
 	/* Account the idle time */
 	delta = ktime_sub(now, cpu_base->idle_entrytime);
 	cpu_base->idle_sleeptime = ktime_add(cpu_base->idle_sleeptime, delta);



^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch 00/21] high resolution timers / dynamic ticks - V2
  2006-10-02 18:38       ` john stultz
@ 2006-10-02 19:08         ` Valdis.Kletnieks
  2006-10-02 19:23           ` john stultz
  0 siblings, 1 reply; 58+ messages in thread
From: Valdis.Kletnieks @ 2006-10-02 19:08 UTC (permalink / raw)
  To: john stultz
  Cc: tglx, Andrew Morton, LKML, Ingo Molnar, Jim Gettys,
	David Woodhouse, Arjan van de Ven, Dave Jones

[-- Attachment #1: Type: text/plain, Size: 1316 bytes --]

On Mon, 02 Oct 2006 11:38:36 PDT, john stultz said:

> Hmmm. So w/ -mm2 we're seeing the TSC get detected as running too slowly
> (and its replaced w/ the ACPI PM), but for some reason that doesn't
> happen w/ the dynticks patch.

It's been switching to ACPI PM for somewhere near forever, I never bothered
to check into that because the PM timer provides a reasonably stable clock
source (it drifts at about 24 ppm and NTP is happy with it, and I haven't
gotten annoyed at the fact the PM timer is slow to read...)

I wonder if the TSC has been broken for forever on this box, and I'm just
seeing it because dynticks doesn't fall over to PM timer..

> Now, how is cpuspeed changing the cpufreq? Is it using the /sys
> interface? I've got hooks in so when the cpufreq changes we should mark
> it unstable and fall back to ACPI PM, but maybe I missed whatever hook
> cpuspeed is using.

Looking at the source, it appears to do this:

const char SYSFS_CURRENT_SPEED_FILE[] =
     "/sys/devices/system/cpu/cpu%u/cpufreq/scaling_setspeed";

// set the current CPU speed
void set_speed(unsigned value)
{
#ifdef DEBUG
    fprintf(stderr, "[cpu%u] Setting speed to: %uKHz\n", cpu, value);
#endif
    write_line(CURRENT_SPEED_FILE, "%u\n", value);
    // give CPU / chipset voltage time to settle down
    usleep(10000);
}


[-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --]

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch 00/21] high resolution timers / dynamic ticks - V2
  2006-10-02 19:08         ` Valdis.Kletnieks
@ 2006-10-02 19:23           ` john stultz
  0 siblings, 0 replies; 58+ messages in thread
From: john stultz @ 2006-10-02 19:23 UTC (permalink / raw)
  To: Valdis.Kletnieks
  Cc: tglx, Andrew Morton, LKML, Ingo Molnar, Jim Gettys,
	David Woodhouse, Arjan van de Ven, Dave Jones

On Mon, 2006-10-02 at 15:08 -0400, Valdis.Kletnieks@vt.edu wrote:
> On Mon, 02 Oct 2006 11:38:36 PDT, john stultz said:
> 
> > Hmmm. So w/ -mm2 we're seeing the TSC get detected as running too slowly
> > (and its replaced w/ the ACPI PM), but for some reason that doesn't
> > happen w/ the dynticks patch.
> 
> It's been switching to ACPI PM for somewhere near forever, I never bothered
> to check into that because the PM timer provides a reasonably stable clock
> source (it drifts at about 24 ppm and NTP is happy with it, and I haven't
> gotten annoyed at the fact the PM timer is slow to read...)
> 
> I wonder if the TSC has been broken for forever on this box, and I'm just
> seeing it because dynticks doesn't fall over to PM timer..

This is what I suspect is the issue. Probably due to the new jiffies
accounting being now time based, and one of the TSC unstable checks (the
one you're tripping) being jiffies based. A tad bit circular :). I'm
working w/ tglx to see what we can do here.

> > Now, how is cpuspeed changing the cpufreq? Is it using the /sys
> > interface? I've got hooks in so when the cpufreq changes we should mark
> > it unstable and fall back to ACPI PM, but maybe I missed whatever hook
> > cpuspeed is using.
> 
> Looking at the source, it appears to do this:
> 
> const char SYSFS_CURRENT_SPEED_FILE[] =
>      "/sys/devices/system/cpu/cpu%u/cpufreq/scaling_setspeed";
> 
> // set the current CPU speed
> void set_speed(unsigned value)
> {
> #ifdef DEBUG
>     fprintf(stderr, "[cpu%u] Setting speed to: %uKHz\n", cpu, value);
> #endif
>     write_line(CURRENT_SPEED_FILE, "%u\n", value);
>     // give CPU / chipset voltage time to settle down
>     usleep(10000);
> }

I'll also take a peek there and see if I cannot add an extra hook, so we
don't have to rely on the jiffies stability check.

thanks
-john


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch] dynticks core: Fix idle time accounting
  2006-10-02 18:43       ` [patch] dynticks core: Fix idle time accounting Thomas Gleixner
@ 2006-10-02 20:17         ` Valdis.Kletnieks
  2006-10-02 21:22           ` Thomas Gleixner
  0 siblings, 1 reply; 58+ messages in thread
From: Valdis.Kletnieks @ 2006-10-02 20:17 UTC (permalink / raw)
  To: tglx
  Cc: Andrew Morton, LKML, Ingo Molnar, Jim Gettys, John Stultz,
	David Woodhouse, Arjan van de Ven, Dave Jones

[-- Attachment #1: Type: text/plain, Size: 2628 bytes --]

On Mon, 02 Oct 2006 20:43:26 +0200, Thomas Gleixner said:
>
> The patch below fixes the accounting weirdness.

I think it's still slightly defective, or at least suffering from a
disjoint between what is going on - the numbers reported in /proc/stats
add up to the total number of timer interrupts, but that's not necessarily
representative of what happened...

% cat /proc/stat;sleep 15;cat /proc/stat
cpu  27634 0 7762 20470 881 331 252 0
cpu0 27634 0 7762 20470 881 331 252 0
intr 812332 631476 2960 0 4 4 12667 3 14 1 1 4 142891 114 0 22193 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ctxt 2187603
btime 1159817297
processes 4028
procs_running 1
procs_blocked 0
nohz total I:397276 S:379955 T:1187.393123 A:0.003125 E: 629447
cpu  27753 0 7818 20739 881 332 253 0
cpu0 27753 0 7818 20739 881 332 253 0
intr 819027 636542 2969 0 4 4 12801 3 14 1 1 4 144371 114 0 22199 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ctxt 2209881
btime 1159817297
processes 4033
procs_running 1
procs_blocked 0
nohz total I:401991 S:384494 T:1200.732924 A:0.003122 E: 634513


And the deltas between the sums for cpu0 are equal to the difference of
the first intr (where the timer is) - ticking along at about 446/sec over
that 15 second timeframe. And sure enough, the 'user' field is about 1/3 of
the total interrupts.

The breakage is that userspace tools like gkrellm and vmstat and top are quite
happy in saying "oh, we averaged 446 ticks/sec over the last N seconds? That's
odd, but I can deal..." but unfortunately, treating all of them the same
"width" - and the idle ones are probably twice as wide if not wider.  I'm not
sure how to fix that.

(The "thought experiment" for this - imagine over a 10 second period, an idle
machine takes 100 short timeslices for a running process, and 100 very long
sleeps 10 times as long as the first 100.  What should /proc/stats report at
that point?)

[-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --]

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch] dynticks core: Fix idle time accounting
  2006-10-02 20:17         ` Valdis.Kletnieks
@ 2006-10-02 21:22           ` Thomas Gleixner
  2006-10-02 21:35             ` Valdis.Kletnieks
  0 siblings, 1 reply; 58+ messages in thread
From: Thomas Gleixner @ 2006-10-02 21:22 UTC (permalink / raw)
  To: Valdis.Kletnieks
  Cc: Andrew Morton, LKML, Ingo Molnar, Jim Gettys, John Stultz,
	David Woodhouse, Arjan van de Ven, Dave Jones

On Mon, 2006-10-02 at 16:17 -0400, Valdis.Kletnieks@vt.edu wrote:
> cpu  27634 0 7762 20470 881 331 252 0
> cpu0 27634 0 7762 20470 881 331 252 0
> intr 812332 631476 2960 0 4 4 12667 3 14 1 1 4 142891 114 0 22193 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
> ctxt 2187603
> btime 1159817297
> processes 4028
> procs_running 1
> procs_blocked 0
> nohz total I:397276 S:379955 T:1187.393123 A:0.003125 E: 629447
> cpu  27753 0 7818 20739 881 332 253 0
> cpu0 27753 0 7818 20739 881 332 253 0
> intr 819027 636542 2969 0 4 4 12801 3 14 1 1 4 144371 114 0 22199 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
> ctxt 2209881
> btime 1159817297
> processes 4033
> procs_running 1
> procs_blocked 0
> nohz total I:401991 S:384494 T:1200.732924 A:0.003122 E: 634513

Strange.

/me digs deeper

	tglx



^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch] dynticks core: Fix idle time accounting
  2006-10-02 21:22           ` Thomas Gleixner
@ 2006-10-02 21:35             ` Valdis.Kletnieks
  2006-10-03 20:02               ` Thomas Gleixner
  0 siblings, 1 reply; 58+ messages in thread
From: Valdis.Kletnieks @ 2006-10-02 21:35 UTC (permalink / raw)
  To: tglx
  Cc: Andrew Morton, LKML, Ingo Molnar, Jim Gettys, John Stultz,
	David Woodhouse, Arjan van de Ven, Dave Jones

[-- Attachment #1: Type: text/plain, Size: 2540 bytes --]

On Mon, 02 Oct 2006 23:22:38 +0200, Thomas Gleixner said:
> On Mon, 2006-10-02 at 16:17 -0400, Valdis.Kletnieks@vt.edu wrote:
> > cpu  27634 0 7762 20470 881 331 252 0
> > cpu0 27634 0 7762 20470 881 331 252 0
> > intr 812332 631476 2960 0 4 4 12667 3 14 1 1 4 142891 114 0 22193 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0
> > ctxt 2187603
> > btime 1159817297
> > processes 4028
> > procs_running 1
> > procs_blocked 0
> > nohz total I:397276 S:379955 T:1187.393123 A:0.003125 E: 629447
> > cpu  27753 0 7818 20739 881 332 253 0
> > cpu0 27753 0 7818 20739 881 332 253 0
> > intr 819027 636542 2969 0 4 4 12801 3 14 1 1 4 144371 114 0 22199 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0
> > ctxt 2209881
> > btime 1159817297
> > processes 4033
> > procs_running 1
> > procs_blocked 0
> > nohz total I:401991 S:384494 T:1200.732924 A:0.003122 E: 634513
> 
> Strange.
> 
> /me digs deeper

Not really strange at all - between code inspection and checking other stuff,
I'm now convinced the *counts* of "was the previous tick user/nice/system/idle"
reported in the cpu0 lines are accurate and report the relative counts
correctly.  The problem is that userspace tools are assuming that all the ticks
reported are created equal.  "We had 200 ticks, total, 100 were user and 100
were idle, so we were at 50/50 user/idle" - but in reality we had 100 10-ms
user ticks and 100 100-ms idle ticks and and only about 10% busy.....

We could "pump up" the relative counts - if 1 no-hz tick would have been 5ms
long, increment the count by 5 rather than 1 (for an alledged 1khz tick).
However, when we do that, we break the property that the sum of the ticks
in the 'cpu0' line is equal to the number of timer interrupts reported in the
'intr' line.

Like I said - unclear how to fix this....

[-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --]

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch 00/21] high resolution timers / dynamic ticks - V2
  2006-10-01 22:59 [patch 00/21] high resolution timers / dynamic ticks - V2 Thomas Gleixner
                   ` (22 preceding siblings ...)
  2006-10-02 13:02 ` Valdis.Kletnieks
@ 2006-10-03  3:23 ` Andrew Morton
  2006-10-03  4:00 ` Andrew Morton
  24 siblings, 0 replies; 58+ messages in thread
From: Andrew Morton @ 2006-10-03  3:23 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Ingo Molnar, Jim Gettys, John Stultz, David Woodhouse,
	Arjan van de Ven, Dave Jones

On Sun, 01 Oct 2006 23:00:45 -0000
Thomas Gleixner <tglx@linutronix.de> wrote:

> We did not address the GTOD patches, as we want to wait for John's input on
> your comments.

I note that the default CONFIG_HIGH_RES_RESOLUTION is still 1000 (one
microsecond), which is far higher resolution than you actually recommend.

I did query that last time around.  I'd prefer not to have to go back and
re-review it all, please...

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch 00/21] high resolution timers / dynamic ticks - V2
  2006-10-01 22:59 [patch 00/21] high resolution timers / dynamic ticks - V2 Thomas Gleixner
                   ` (23 preceding siblings ...)
  2006-10-03  3:23 ` [patch 00/21] high resolution timers / dynamic ticks - V2 Andrew Morton
@ 2006-10-03  4:00 ` Andrew Morton
  2006-10-03  8:38   ` Thomas Gleixner
  2006-10-03  8:47   ` Ingo Molnar
  24 siblings, 2 replies; 58+ messages in thread
From: Andrew Morton @ 2006-10-03  4:00 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Ingo Molnar, Jim Gettys, John Stultz, David Woodhouse,
	Arjan van de Ven, Dave Jones


These patches make my Vaio run really really slowly.  Maybe a quarter of
the normal speed or lower.  Bisection shows that the bug is introduced by
clockevents-drivers-for-i386.patch+clockevents-drivers-for-i386-fix.patch

With all patches applied, the slowdown happens with
CONFIG_HIGH_RES_TIMERS=n and also with CONFIG_HIGH_RES_TIMERS=y &&
CONFIG_NO_HZ=y.  So something got collaterally damaged.

I put various helpful stuff at http://userweb.kernel.org/~akpm/x/

I uploaded all the patches I was using to
http://userweb.kernel.org/~akpm/x/patches/

It doesn't seem to be a cpufreq thing: cpuinfo_min_freq=800kHz,
cpuinfo_max_freq=2GHz and cpuinfo_cur_freq goes up to 2GHz under load. 
Wall time is increasing at one second per second.



^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch 00/21] high resolution timers / dynamic ticks - V2
  2006-10-03  4:00 ` Andrew Morton
@ 2006-10-03  8:38   ` Thomas Gleixner
  2006-10-03  8:47   ` Ingo Molnar
  1 sibling, 0 replies; 58+ messages in thread
From: Thomas Gleixner @ 2006-10-03  8:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Ingo Molnar, Jim Gettys, John Stultz, David Woodhouse,
	Arjan van de Ven, Dave Jones

On Mon, 2006-10-02 at 21:00 -0700, Andrew Morton wrote:
> These patches make my Vaio run really really slowly.  Maybe a quarter of
> the normal speed or lower.  Bisection shows that the bug is introduced by
> clockevents-drivers-for-i386.patch+clockevents-drivers-for-i386-fix.patch
> 
> With all patches applied, the slowdown happens with
> CONFIG_HIGH_RES_TIMERS=n and also with CONFIG_HIGH_RES_TIMERS=y &&
> CONFIG_NO_HZ=y.  So something got collaterally damaged.
> 
> I put various helpful stuff at http://userweb.kernel.org/~akpm/x/

> I uploaded all the patches I was using to
> http://userweb.kernel.org/~akpm/x/patches/

That's basically the same set I have here +/- the fixups

> It doesn't seem to be a cpufreq thing: cpuinfo_min_freq=800kHz,
> cpuinfo_max_freq=2GHz and cpuinfo_cur_freq goes up to 2GHz under load. 
> Wall time is increasing at one second per second.

I retest on my Vaio.

	tglx



^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch 00/21] high resolution timers / dynamic ticks - V2
  2006-10-03  4:00 ` Andrew Morton
  2006-10-03  8:38   ` Thomas Gleixner
@ 2006-10-03  8:47   ` Ingo Molnar
  2006-10-03 10:35     ` [patch] clockevents: drivers for i386, fix #2 Ingo Molnar
  1 sibling, 1 reply; 58+ messages in thread
From: Ingo Molnar @ 2006-10-03  8:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Thomas Gleixner, LKML, Jim Gettys, John Stultz, David Woodhouse,
	Arjan van de Ven, Dave Jones


* Andrew Morton <akpm@osdl.org> wrote:

> These patches make my Vaio run really really slowly.  Maybe a quarter 
> of the normal speed or lower.  Bisection shows that the bug is 
> introduced by 
> clockevents-drivers-for-i386.patch+clockevents-drivers-for-i386-fix.patch
> 
> With all patches applied, the slowdown happens with 
> CONFIG_HIGH_RES_TIMERS=n and also with CONFIG_HIGH_RES_TIMERS=y && 
> CONFIG_NO_HZ=y.  So something got collaterally damaged.

yeah, i suspect it works again if you disable:

 CONFIG_X86_UP_APIC=y
 CONFIG_X86_UP_IOAPIC=y
 CONFIG_X86_LOCAL_APIC=y
 CONFIG_X86_IO_APIC=y

as the slowdown has the feeling of a runaway lapic timer irq.

from code review so far we can only see an udelay(10) difference in the 
initialization sequence of the PIT - we'll send a fix for that but i 
dont think that's the cause of the bug.

investigating it.

	Ingo

^ permalink raw reply	[flat|nested] 58+ messages in thread

* [patch] clockevents: drivers for i386, fix #2
  2006-10-03  8:47   ` Ingo Molnar
@ 2006-10-03 10:35     ` Ingo Molnar
  2006-10-04  3:36       ` Andrew Morton
  0 siblings, 1 reply; 58+ messages in thread
From: Ingo Molnar @ 2006-10-03 10:35 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Thomas Gleixner, LKML, Jim Gettys, John Stultz, David Woodhouse,
	Arjan van de Ven, Dave Jones


* Ingo Molnar <mingo@elte.hu> wrote:

> yeah, i suspect it works again if you disable:
> 
>  CONFIG_X86_UP_APIC=y
>  CONFIG_X86_UP_IOAPIC=y
>  CONFIG_X86_LOCAL_APIC=y
>  CONFIG_X86_IO_APIC=y
> 
> as the slowdown has the feeling of a runaway lapic timer irq.
> 
> from code review so far we can only see an udelay(10) difference in 
> the initialization sequence of the PIT - we'll send a fix for that but 
> i dont think that's the cause of the bug.

the patch below fixes that particular bug. But ... the symptoms you are 
describing have the feeling of being apic related.

	Ingo

-------------------->
Subject: clockevents: drivers for i386, fix #2
From: Ingo Molnar <mingo@elte.hu>

add back a mistakenly removed udelay(10) to the PIT initialization
sequence.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 arch/i386/kernel/i8253.c |    1 +
 1 file changed, 1 insertion(+)

Index: linux/arch/i386/kernel/i8253.c
===================================================================
--- linux.orig/arch/i386/kernel/i8253.c
+++ linux/arch/i386/kernel/i8253.c
@@ -45,6 +45,7 @@ static void init_pit_timer(enum clock_ev
 		outb_p(0x34, PIT_MODE);
 		udelay(10);
 		outb_p(LATCH & 0xff , PIT_CH0);	/* LSB */
+		udelay(10);
 		outb(LATCH >> 8 , PIT_CH0);	/* MSB */
 		break;
 

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch] dynticks core: Fix idle time accounting
  2006-10-02 21:35             ` Valdis.Kletnieks
@ 2006-10-03 20:02               ` Thomas Gleixner
  2006-10-03 21:05                 ` Thomas Gleixner
  2006-10-04  2:33                 ` Valdis.Kletnieks
  0 siblings, 2 replies; 58+ messages in thread
From: Thomas Gleixner @ 2006-10-03 20:02 UTC (permalink / raw)
  To: Valdis.Kletnieks
  Cc: Andrew Morton, LKML, Ingo Molnar, Jim Gettys, John Stultz,
	David Woodhouse, Arjan van de Ven, Dave Jones

On Mon, 2006-10-02 at 17:35 -0400, Valdis.Kletnieks@vt.edu wrote:
> We could "pump up" the relative counts - if 1 no-hz tick would have been 5ms
> long, increment the count by 5 rather than 1 (for an alledged 1khz tick).
> However, when we do that, we break the property that the sum of the ticks
> in the 'cpu0' line is equal to the number of timer interrupts reported in the
> 'intr' line.

I found a way to fix my thinkos. I put up a queue with all fixes to:

http://www.tglx.de/projects/hrtimers/2.6.18-mm3/patch-2.6.18-mm3-hrt-dyntick1.patches.tar.bz2

Can you please verify if it makes your problem go away ?

	tglx



^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch] dynticks core: Fix idle time accounting
  2006-10-03 20:02               ` Thomas Gleixner
@ 2006-10-03 21:05                 ` Thomas Gleixner
  2006-10-04  2:33                 ` Valdis.Kletnieks
  1 sibling, 0 replies; 58+ messages in thread
From: Thomas Gleixner @ 2006-10-03 21:05 UTC (permalink / raw)
  To: Valdis.Kletnieks
  Cc: Andrew Morton, LKML, Ingo Molnar, Jim Gettys, John Stultz,
	David Woodhouse, Arjan van de Ven, Dave Jones

On Tue, 2006-10-03 at 22:02 +0200, Thomas Gleixner wrote:
> I found a way to fix my thinkos. I put up a queue with all fixes to:
> 
> http://www.tglx.de/projects/hrtimers/2.6.18-mm3/patch-2.6.18-mm3-hrt-dyntick1.patches.tar.bz2
> 
> Can you please verify if it makes your problem go away ?

Please use dyntick2, as #1 is missing a fix. Sorry.

	tglx



^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch] dynticks core: Fix idle time accounting
  2006-10-03 20:02               ` Thomas Gleixner
  2006-10-03 21:05                 ` Thomas Gleixner
@ 2006-10-04  2:33                 ` Valdis.Kletnieks
  2006-10-04  7:56                   ` Ingo Molnar
  1 sibling, 1 reply; 58+ messages in thread
From: Valdis.Kletnieks @ 2006-10-04  2:33 UTC (permalink / raw)
  To: tglx
  Cc: Andrew Morton, LKML, Ingo Molnar, Jim Gettys, John Stultz,
	David Woodhouse, Arjan van de Ven, Dave Jones

[-- Attachment #1: Type: text/plain, Size: 1796 bytes --]

On Tue, 03 Oct 2006 22:02:30 +0200, Thomas Gleixner said:

> I found a way to fix my thinkos. I put up a queue with all fixes to:
> 
> http://www.tglx.de/projects/hrtimers/2.6.18-mm3/patch-2.6.18-mm3-hrt-dyntick1.patches.tar.bz2
> 
> Can you please verify if it makes your problem go away ?

Was -dyntick3 by the time I got there.

The user/system/idle/wait numbers now look sane, with one caveat:

static const ktime_t nsec_per_hz = { .tv64 = NSEC_PER_SEC / HZ };
...
                if (unlikely(delta.tv64 >= nsec_per_hz.tv64)) {
                        s64 incr = ktime_to_ns(nsec_per_hz);
                        ticks = ktime_divns(delta, incr);

Even though I have CONFIG_HZ=1000, this ends up generating a synthetic
count that works out to 100 per second.  gkrellm and vmstat are happy with
that state of affairs, but I'm not sure why it came out to 100/sec rather
than 1000/sec.

% cat /proc/stat /proc/uptime
cpu  28224 4688 9159 168290 9143 283 256 0
cpu0 28224 4688 9159 168290 9143 283 256 0
intr 818891 627337 3466 0 4 4 6459 3 7 1 1 4 160328 115 0 21162 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ctxt 971441
btime 1159926408
processes 4544
procs_running 1
procs_blocked 0
nohz total I:367986 S:302473 T:1737.640072 A:0.005744 E: 625327
2176.02 1775.11

Also, it still disagrees with speedstep - it isn't noticing the TSC has
gone slow and drop back to the PM timer.

All in all, we're making progress. ;)

[-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --]

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch] clockevents: drivers for i386, fix #2
  2006-10-03 10:35     ` [patch] clockevents: drivers for i386, fix #2 Ingo Molnar
@ 2006-10-04  3:36       ` Andrew Morton
  2006-10-04  6:46         ` Ingo Molnar
  0 siblings, 1 reply; 58+ messages in thread
From: Andrew Morton @ 2006-10-04  3:36 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Thomas Gleixner, LKML, Jim Gettys, John Stultz, David Woodhouse,
	Arjan van de Ven, Dave Jones

On Tue, 3 Oct 2006 12:35:03 +0200
Ingo Molnar <mingo@elte.hu> wrote:

> 
> add back a mistakenly removed udelay(10) to the PIT initialization
> sequence.
> 
> Signed-off-by: Ingo Molnar <mingo@elte.hu>
> ---
>  arch/i386/kernel/i8253.c |    1 +
>  1 file changed, 1 insertion(+)
> 
> Index: linux/arch/i386/kernel/i8253.c
> ===================================================================
> --- linux.orig/arch/i386/kernel/i8253.c
> +++ linux/arch/i386/kernel/i8253.c
> @@ -45,6 +45,7 @@ static void init_pit_timer(enum clock_ev
>  		outb_p(0x34, PIT_MODE);
>  		udelay(10);
>  		outb_p(LATCH & 0xff , PIT_CH0);	/* LSB */
> +		udelay(10);
>  		outb(LATCH >> 8 , PIT_CH0);	/* MSB */
>  		break;
>  

Doesn't help.

> * Ingo Molnar <mingo@elte.hu> wrote:
> 
> > yeah, i suspect it works again if you disable:
> > 
> >  CONFIG_X86_UP_APIC=y
> >  CONFIG_X86_UP_IOAPIC=y
> >  CONFIG_X86_LOCAL_APIC=y
> >  CONFIG_X86_IO_APIC=y
> > 
> > as the slowdown has the feeling of a runaway lapic timer irq.
> > 

Disabling IO_APIC doesn't fix the slowdown.

Disabling LOCAL_APIC does fix it.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch] clockevents: drivers for i386, fix #2
  2006-10-04  3:36       ` Andrew Morton
@ 2006-10-04  6:46         ` Ingo Molnar
  2006-10-04  7:32           ` Andrew Morton
  0 siblings, 1 reply; 58+ messages in thread
From: Ingo Molnar @ 2006-10-04  6:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Thomas Gleixner, LKML, Jim Gettys, John Stultz, David Woodhouse,
	Arjan van de Ven, Dave Jones


* Andrew Morton <akpm@osdl.org> wrote:

> Disabling LOCAL_APIC does fix it.

thanks, that narrows it down quite a bit. (We've double-checked the 
lapic path and it seemed all our changes are NOP, but obviously it isnt 
and we'll check it all again.)

(if you have that kernel still booted by any chance then do you see the 
'LOC' IRQ count in /proc/interrupts or any other count in /proc/stats 
increasing at an alarming rate? That would narrow it down to lapic timer 
misprogramming.)

	Ingo

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch] clockevents: drivers for i386, fix #2
  2006-10-04  6:46         ` Ingo Molnar
@ 2006-10-04  7:32           ` Andrew Morton
  2006-10-04  7:41             ` Ingo Molnar
  2006-10-04  7:55             ` Ingo Molnar
  0 siblings, 2 replies; 58+ messages in thread
From: Andrew Morton @ 2006-10-04  7:32 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Thomas Gleixner, LKML, Jim Gettys, John Stultz, David Woodhouse,
	Arjan van de Ven, Dave Jones

On Wed, 4 Oct 2006 08:46:20 +0200
Ingo Molnar <mingo@elte.hu> wrote:

> 
> * Andrew Morton <akpm@osdl.org> wrote:
> 
> > Disabling LOCAL_APIC does fix it.
> 
> thanks, that narrows it down quite a bit. (We've double-checked the 
> lapic path and it seemed all our changes are NOP, but obviously it isnt 
> and we'll check it all again.)
> 
> (if you have that kernel still booted by any chance then do you see the 
> 'LOC' IRQ count in /proc/interrupts or any other count in /proc/stats 
> increasing at an alarming rate? That would narrow it down to lapic timer 
> misprogramming.)
> 

None of the interrupts are doing anything wrong.  oprofile shows nothing
alarming.

Disabling cpufreq in config doesn't fix it.

Userspace can count to a billion in 3.9 seconds when this problem is
present, which is the same time as it takes on a non-slow kernel.

`sleep 5' takes 5 seconds.

Yet initscripts take a long time (especially applying the ipfilter firewall
rues for some reason), and `startx' takes a long time, etc.  This kernel
takes 112 seconds to boot to a login prompt - other kernels take 56 seconds
(interesting ratio..)

Weird.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch] clockevents: drivers for i386, fix #2
  2006-10-04  7:32           ` Andrew Morton
@ 2006-10-04  7:41             ` Ingo Molnar
  2006-10-04  8:01               ` Andrew Morton
  2006-10-04  7:55             ` Ingo Molnar
  1 sibling, 1 reply; 58+ messages in thread
From: Ingo Molnar @ 2006-10-04  7:41 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Thomas Gleixner, LKML, Jim Gettys, John Stultz, David Woodhouse,
	Arjan van de Ven, Dave Jones


* Andrew Morton <akpm@osdl.org> wrote:

> Yet initscripts take a long time (especially applying the ipfilter 
> firewall rues for some reason), and `startx' takes a long time, etc.  
> This kernel takes 112 seconds to boot to a login prompt - other 
> kernels take 56 seconds (interesting ratio..)

you are still using the non-hres config, correct? (so this is still 
collateral damage on vanilla kernel functionality)

	Ingo

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch] clockevents: drivers for i386, fix #2
  2006-10-04  7:32           ` Andrew Morton
  2006-10-04  7:41             ` Ingo Molnar
@ 2006-10-04  7:55             ` Ingo Molnar
  2006-10-04  8:15               ` Andrew Morton
  1 sibling, 1 reply; 58+ messages in thread
From: Ingo Molnar @ 2006-10-04  7:55 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Thomas Gleixner, LKML, Jim Gettys, John Stultz, David Woodhouse,
	Arjan van de Ven, Dave Jones


* Andrew Morton <akpm@osdl.org> wrote:

> None of the interrupts are doing anything wrong.  oprofile shows 
> nothing alarming.
> 
> Disabling cpufreq in config doesn't fix it.
> 
> Userspace can count to a billion in 3.9 seconds when this problem is 
> present, which is the same time as it takes on a non-slow kernel.
> 
> `sleep 5' takes 5 seconds.
> 
> Yet initscripts take a long time (especially applying the ipfilter 
> firewall rues for some reason), and `startx' takes a long time, etc.  
> This kernel takes 112 seconds to boot to a login prompt - other 
> kernels take 56 seconds (interesting ratio..)

hm, do you have the NMI watchdog enabled by any chance? [in particular, 
do you have nmi_watchdog=2?] Although your bootlog does not show it.

	Ingo

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch] dynticks core: Fix idle time accounting
  2006-10-04  2:33                 ` Valdis.Kletnieks
@ 2006-10-04  7:56                   ` Ingo Molnar
  2006-10-04  9:58                     ` Valdis.Kletnieks
  0 siblings, 1 reply; 58+ messages in thread
From: Ingo Molnar @ 2006-10-04  7:56 UTC (permalink / raw)
  To: Valdis.Kletnieks
  Cc: tglx, Andrew Morton, LKML, Jim Gettys, John Stultz,
	David Woodhouse, Arjan van de Ven, Dave Jones


* Valdis.Kletnieks@vt.edu <Valdis.Kletnieks@vt.edu> wrote:

> Even though I have CONFIG_HZ=1000, this ends up generating a synthetic 
> count that works out to 100 per second.  gkrellm and vmstat are happy 
> with that state of affairs, but I'm not sure why it came out to 
> 100/sec rather than 1000/sec.

that's how it worked for quite some time: all userspace APIs are 
HZ-independent and depend on USER_HZ (which is 100 even if HZ is 1000).

	Ingo

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch] clockevents: drivers for i386, fix #2
  2006-10-04  7:41             ` Ingo Molnar
@ 2006-10-04  8:01               ` Andrew Morton
  0 siblings, 0 replies; 58+ messages in thread
From: Andrew Morton @ 2006-10-04  8:01 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Thomas Gleixner, LKML, Jim Gettys, John Stultz, David Woodhouse,
	Arjan van de Ven, Dave Jones

On Wed, 4 Oct 2006 09:41:42 +0200
Ingo Molnar <mingo@elte.hu> wrote:

> 
> * Andrew Morton <akpm@osdl.org> wrote:
> 
> > Yet initscripts take a long time (especially applying the ipfilter 
> > firewall rues for some reason), and `startx' takes a long time, etc.  
> > This kernel takes 112 seconds to boot to a login prompt - other 
> > kernels take 56 seconds (interesting ratio..)
> 
> you are still using the non-hres config, correct? (so this is still 
> collateral damage on vanilla kernel functionality)
> 

yup.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch] clockevents: drivers for i386, fix #2
  2006-10-04  7:55             ` Ingo Molnar
@ 2006-10-04  8:15               ` Andrew Morton
  2006-10-04 10:53                 ` Ingo Molnar
  0 siblings, 1 reply; 58+ messages in thread
From: Andrew Morton @ 2006-10-04  8:15 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Thomas Gleixner, LKML, Jim Gettys, John Stultz, David Woodhouse,
	Arjan van de Ven, Dave Jones

On Wed, 4 Oct 2006 09:55:40 +0200
Ingo Molnar <mingo@elte.hu> wrote:

> 
> * Andrew Morton <akpm@osdl.org> wrote:
> 
> > None of the interrupts are doing anything wrong.  oprofile shows 
> > nothing alarming.
> > 
> > Disabling cpufreq in config doesn't fix it.
> > 
> > Userspace can count to a billion in 3.9 seconds when this problem is 
> > present, which is the same time as it takes on a non-slow kernel.
> > 
> > `sleep 5' takes 5 seconds.
> > 
> > Yet initscripts take a long time (especially applying the ipfilter 
> > firewall rues for some reason), and `startx' takes a long time, etc.  
> > This kernel takes 112 seconds to boot to a login prompt - other 
> > kernels take 56 seconds (interesting ratio..)
> 
> hm, do you have the NMI watchdog enabled by any chance? [in particular, 
> do you have nmi_watchdog=2?] Although your bootlog does not show it.
> 

There's no nmi_watchdog setting in the kernel boot command line and the
NMI counter isn't incrementing.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch] dynticks core: Fix idle time accounting
  2006-10-04  7:56                   ` Ingo Molnar
@ 2006-10-04  9:58                     ` Valdis.Kletnieks
  0 siblings, 0 replies; 58+ messages in thread
From: Valdis.Kletnieks @ 2006-10-04  9:58 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: tglx, Andrew Morton, LKML, Jim Gettys, John Stultz,
	David Woodhouse, Arjan van de Ven, Dave Jones

[-- Attachment #1: Type: text/plain, Size: 797 bytes --]

On Wed, 04 Oct 2006 09:56:57 +0200, Ingo Molnar said:
> 
> * Valdis.Kletnieks@vt.edu <Valdis.Kletnieks@vt.edu> wrote:
> 
> > Even though I have CONFIG_HZ=1000, this ends up generating a synthetic 
> > count that works out to 100 per second.  gkrellm and vmstat are happy 
> > with that state of affairs, but I'm not sure why it came out to 
> > 100/sec rather than 1000/sec.
> 
> that's how it worked for quite some time: all userspace APIs are 
> HZ-independent and depend on USER_HZ (which is 100 even if HZ is 1000).

Nevermind - I missed where fs/proc/proc_misc.c applied jiffies_64_to_clock_t()
to the number before handing it to userspace.  So the numbers *were* being
kept in terms of HZ (as my reading of the code indicated), they just didn't
manage to escape to userspace that way....




[-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --]

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch] clockevents: drivers for i386, fix #2
  2006-10-04  8:15               ` Andrew Morton
@ 2006-10-04 10:53                 ` Ingo Molnar
  2006-10-04 11:19                   ` Thomas Gleixner
  0 siblings, 1 reply; 58+ messages in thread
From: Ingo Molnar @ 2006-10-04 10:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Thomas Gleixner, LKML, Jim Gettys, John Stultz, David Woodhouse,
	Arjan van de Ven, Dave Jones


* Andrew Morton <akpm@osdl.org> wrote:

> > hm, do you have the NMI watchdog enabled by any chance? [in 
> > particular, do you have nmi_watchdog=2?] Although your bootlog does 
> > not show it.
> > 
> 
> There's no nmi_watchdog setting in the kernel boot command line and 
> the NMI counter isn't incrementing.

there's one material difference we just found: in the !hres case we'll 
do the timer IRQ handling mostly from the lapic vector - while in 
mainline we do it from the irq0 vector. So, how does your 
/proc/interrupts look like? How frequently does LOC increase, and how 
frequently does IRQ 0 increase?

(meanwhile we'll fix and restore things so that it matches mainline 
behavior.)

	Ingo

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch] clockevents: drivers for i386, fix #2
  2006-10-04 10:53                 ` Ingo Molnar
@ 2006-10-04 11:19                   ` Thomas Gleixner
  2006-10-04 16:02                     ` Andrew Morton
  0 siblings, 1 reply; 58+ messages in thread
From: Thomas Gleixner @ 2006-10-04 11:19 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, LKML, Jim Gettys, John Stultz, David Woodhouse,
	Arjan van de Ven, Dave Jones

On Wed, 2006-10-04 at 12:53 +0200, Ingo Molnar wrote:
> there's one material difference we just found: in the !hres case we'll 
> do the timer IRQ handling mostly from the lapic vector - while in 
> mainline we do it from the irq0 vector. So, how does your 
> /proc/interrupts look like? How frequently does LOC increase, and how 
> frequently does IRQ 0 increase?
> 
> (meanwhile we'll fix and restore things so that it matches mainline 
> behavior.)

Andrew, does the patch below fix your problem ?

You should see the same weird behaviour when you run a plain -mm3 with
CONFIG_SMP=y on that box. This moves update_process_times() to the lapic
too.
	tglx


Index: linux-2.6.18-mm3/arch/i386/kernel/apic.c
===================================================================
--- linux-2.6.18-mm3.orig/arch/i386/kernel/apic.c	2006-10-04 13:02:35.000000000 +0200
+++ linux-2.6.18-mm3/arch/i386/kernel/apic.c	2006-10-04 12:59:06.000000000 +0200
@@ -84,7 +84,9 @@ static void lapic_timer_setup(enum clock
 static struct clock_event_device lapic_clockevent = {
 	.name = "lapic",
 	.capabilities = CLOCK_CAP_NEXTEVT | CLOCK_CAP_PROFILE
+#ifdef CONFIG_SMP
 			| CLOCK_CAP_UPDATE,
+#endif
 	.shift = 32,
 	.set_mode = lapic_timer_setup,
 	.set_next_event = lapic_next_event,





^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch] clockevents: drivers for i386, fix #2
  2006-10-04 11:19                   ` Thomas Gleixner
@ 2006-10-04 16:02                     ` Andrew Morton
  2006-10-04 16:20                       ` Thomas Gleixner
  2006-10-04 16:35                       ` Ingo Molnar
  0 siblings, 2 replies; 58+ messages in thread
From: Andrew Morton @ 2006-10-04 16:02 UTC (permalink / raw)
  To: tglx
  Cc: Ingo Molnar, LKML, Jim Gettys, John Stultz, David Woodhouse,
	Arjan van de Ven, Dave Jones

On Wed, 04 Oct 2006 13:19:35 +0200
Thomas Gleixner <tglx@linutronix.de> wrote:

> On Wed, 2006-10-04 at 12:53 +0200, Ingo Molnar wrote:
> > there's one material difference we just found: in the !hres case we'll 
> > do the timer IRQ handling mostly from the lapic vector - while in 
> > mainline we do it from the irq0 vector. So, how does your 
> > /proc/interrupts look like? How frequently does LOC increase, and how 
> > frequently does IRQ 0 increase?

sony:/home/akpm> cat /proc/interrupts ; sleep 1 ; cat /proc/interrupts
           CPU0       
  0:      39256   IO-APIC-edge      timer
  1:          8   IO-APIC-edge      i8042
  8:          1   IO-APIC-edge      rtc
  9:        160   IO-APIC-fasteoi   acpi
 11:          3   IO-APIC-edge      sonypi
 12:        107   IO-APIC-edge      i8042
 14:          5   IO-APIC-edge      libata
 15:          0   IO-APIC-edge      libata
 16:          1   IO-APIC-fasteoi   yenta, uhci_hcd:usb4
 17:        246   IO-APIC-fasteoi   ohci1394, eth0
 18:       5759   IO-APIC-fasteoi   libata
 19:          3   IO-APIC-fasteoi   ipw2200
 20:        710   IO-APIC-fasteoi   HDA Intel, uhci_hcd:usb3
 21:          2   IO-APIC-fasteoi   ehci_hcd:usb1
 22:          0   IO-APIC-fasteoi   uhci_hcd:usb2, uhci_hcd:usb5
NMI:          0 
LOC:       3131 
ERR:          0
MIS:          0
           CPU0       
  0:      39519   IO-APIC-edge      timer
  1:          8   IO-APIC-edge      i8042
  8:          1   IO-APIC-edge      rtc
  9:        160   IO-APIC-fasteoi   acpi
 11:          3   IO-APIC-edge      sonypi
 12:        107   IO-APIC-edge      i8042
 14:          5   IO-APIC-edge      libata
 15:          0   IO-APIC-edge      libata
 16:          1   IO-APIC-fasteoi   yenta, uhci_hcd:usb4
 17:        248   IO-APIC-fasteoi   ohci1394, eth0
 18:       5759   IO-APIC-fasteoi   libata
 19:          3   IO-APIC-fasteoi   ipw2200
 20:        715   IO-APIC-fasteoi   HDA Intel, uhci_hcd:usb3
 21:          2   IO-APIC-fasteoi   ehci_hcd:usb1
 22:          0   IO-APIC-fasteoi   uhci_hcd:usb2, uhci_hcd:usb5
NMI:          0 
LOC:       3134 
ERR:          0
MIS:          0

> > (meanwhile we'll fix and restore things so that it matches mainline 
> > behavior.)
> 
> Andrew, does the patch below fix your problem ?
> 
> You should see the same weird behaviour when you run a plain -mm3 with
> CONFIG_SMP=y on that box. This moves update_process_times() to the lapic
> too.
> 	tglx
> 
> 
> Index: linux-2.6.18-mm3/arch/i386/kernel/apic.c
> ===================================================================
> --- linux-2.6.18-mm3.orig/arch/i386/kernel/apic.c	2006-10-04 13:02:35.000000000 +0200
> +++ linux-2.6.18-mm3/arch/i386/kernel/apic.c	2006-10-04 12:59:06.000000000 +0200
> @@ -84,7 +84,9 @@ static void lapic_timer_setup(enum clock
>  static struct clock_event_device lapic_clockevent = {
>  	.name = "lapic",
>  	.capabilities = CLOCK_CAP_NEXTEVT | CLOCK_CAP_PROFILE
> +#ifdef CONFIG_SMP
>  			| CLOCK_CAP_UPDATE,
> +#endif
>  	.shift = 32,
>  	.set_mode = lapic_timer_setup,
>  	.set_next_event = lapic_next_event,

that (after a tweak to make it compile) fixes it.   What's it do?


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch] clockevents: drivers for i386, fix #2
  2006-10-04 16:02                     ` Andrew Morton
@ 2006-10-04 16:20                       ` Thomas Gleixner
  2006-10-04 16:35                       ` Ingo Molnar
  1 sibling, 0 replies; 58+ messages in thread
From: Thomas Gleixner @ 2006-10-04 16:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ingo Molnar, LKML, Jim Gettys, John Stultz, David Woodhouse,
	Arjan van de Ven, Dave Jones

On Wed, 2006-10-04 at 09:02 -0700, Andrew Morton wrote:
> On Wed, 04 Oct 2006 13:19:35 +0200
> Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> > On Wed, 2006-10-04 at 12:53 +0200, Ingo Molnar wrote:
> > > there's one material difference we just found: in the !hres case we'll 
> > > do the timer IRQ handling mostly from the lapic vector - while in 
> > > mainline we do it from the irq0 vector. So, how does your 
> > > /proc/interrupts look like? How frequently does LOC increase, and how 
> > > frequently does IRQ 0 increase?
> 
> sony:/home/akpm> cat /proc/interrupts ; sleep 1 ; cat /proc/interrupts
>            CPU0       
>   0:      39256   IO-APIC-edge      timer
> LOC:       3131 

>   0:      39519   IO-APIC-edge      timer
> LOC:       3134 

delta IRQ == 263
delta LOC == 3

That explains the problem. The lapic frequency seems to be way off. I
have no good idea offhand how to detect such lapic brokeness.

> >  static struct clock_event_device lapic_clockevent = {
> >  	.name = "lapic",
> >  	.capabilities = CLOCK_CAP_NEXTEVT | CLOCK_CAP_PROFILE
> > +#ifdef CONFIG_SMP
> >  			| CLOCK_CAP_UPDATE,
> > +#endif
> >  	.shift = 32,
> >  	.set_mode = lapic_timer_setup,
> >  	.set_next_event = lapic_next_event,
> 
> that (after a tweak to make it compile) fixes it.   What's it do?

It brings update_process_times() back into IRQ0. On systems with a
working lapic, it would not matter. SMP moves update_process_times() to
lapic too. That's why I asked whether a SMP=y kernel has the same
problems on this box.

	tglx



^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch] clockevents: drivers for i386, fix #2
  2006-10-04 16:02                     ` Andrew Morton
  2006-10-04 16:20                       ` Thomas Gleixner
@ 2006-10-04 16:35                       ` Ingo Molnar
  1 sibling, 0 replies; 58+ messages in thread
From: Ingo Molnar @ 2006-10-04 16:35 UTC (permalink / raw)
  To: Andrew Morton
  Cc: tglx, LKML, Jim Gettys, John Stultz, David Woodhouse,
	Arjan van de Ven, Dave Jones


* Andrew Morton <akpm@osdl.org> wrote:

> >  	.name = "lapic",
> >  	.capabilities = CLOCK_CAP_NEXTEVT | CLOCK_CAP_PROFILE
> > +#ifdef CONFIG_SMP
> >  			| CLOCK_CAP_UPDATE,
> > +#endif
> >  	.shift = 32,
> >  	.set_mode = lapic_timer_setup,
> >  	.set_next_event = lapic_next_event,
> 
> that (after a tweak to make it compile) fixes it. [...]

cool!

the vanilla SMP kernel will likely show similar effects on your laptop. 
We'll figure out a safe way to detect this quirk, and to work it around 
or turn off the lapic timer driver in that case.

(Btw., this bug was cleanups collateral damage. Many people are running 
-rt on laptops and i think we'd have noticed.)

	Ingo

^ permalink raw reply	[flat|nested] 58+ messages in thread

end of thread, other threads:[~2006-10-04 16:43 UTC | newest]

Thread overview: 58+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-10-01 22:59 [patch 00/21] high resolution timers / dynamic ticks - V2 Thomas Gleixner
2006-10-01 22:59 ` [patch 01/21] GTOD: exponential update_wall_time Thomas Gleixner
2006-10-01 23:00 ` [patch 02/21] GTOD: persistent clock support, core Thomas Gleixner
2006-10-01 23:00 ` [patch 03/21] GTOD: persistent clock support, i386 Thomas Gleixner
2006-10-01 23:00 ` [patch 04/21] time: uninline jiffies.h Thomas Gleixner
2006-10-01 23:00 ` [patch 05/21] time: fix msecs_to_jiffies() bug Thomas Gleixner
2006-10-01 23:00 ` [patch 06/21] time: fix timeout overflow Thomas Gleixner
2006-10-01 23:00 ` [patch 07/21] cleanup: uninline irq_enter() and move it into a function Thomas Gleixner
2006-10-01 23:00 ` [patch 08/21] dynticks: extend next_timer_interrupt() to use a reference jiffie Thomas Gleixner
2006-10-01 23:00 ` [patch 09/21] hrtimers: namespace and enum cleanup Thomas Gleixner
2006-10-01 23:00 ` [patch 10/21] hrtimers: clean up locking Thomas Gleixner
2006-10-01 23:00 ` [patch 11/21] hrtimers: state tracking Thomas Gleixner
2006-10-01 23:00 ` [patch 12/21] hrtimers: clean up callback tracking Thomas Gleixner
2006-10-01 23:01 ` [patch 13/21] hrtimers: Move and add documentation Thomas Gleixner
2006-10-01 23:01 ` [patch 14/21] clockevents: core Thomas Gleixner
2006-10-01 23:01 ` [patch 15/21] clockevents: drivers for i386 Thomas Gleixner
2006-10-01 23:01 ` [patch 16/21] high-res timers: core Thomas Gleixner
2006-10-02 11:50   ` Paulo Marques
2006-10-01 23:01 ` [patch 17/21] dynticks: core Thomas Gleixner
2006-10-02  6:41   ` [patch] dynticks: core, NMI watchdog fix Ingo Molnar
2006-10-02  8:54     ` [patch] dynticks: core, NMI watchdog fix, #2 Ingo Molnar
2006-10-01 23:01 ` [patch 18/21] dyntick: add nohz stats to /proc/stat Thomas Gleixner
2006-10-01 23:01 ` [patch 19/21] dynticks: i386 arch code Thomas Gleixner
2006-10-01 23:01 ` [patch 20/21] high-res timers, dynticks: enable i386 support Thomas Gleixner
2006-10-01 23:01 ` [patch 21/21] debugging feature: timer stats Thomas Gleixner
2006-10-02  5:11 ` [patch 00/21] high resolution timers / dynamic ticks - V2 Valdis.Kletnieks
2006-10-02 13:02 ` Valdis.Kletnieks
2006-10-02 13:43   ` Thomas Gleixner
2006-10-02 18:25     ` Valdis.Kletnieks
2006-10-02 18:38       ` john stultz
2006-10-02 19:08         ` Valdis.Kletnieks
2006-10-02 19:23           ` john stultz
2006-10-02 18:43       ` [patch] dynticks core: Fix idle time accounting Thomas Gleixner
2006-10-02 20:17         ` Valdis.Kletnieks
2006-10-02 21:22           ` Thomas Gleixner
2006-10-02 21:35             ` Valdis.Kletnieks
2006-10-03 20:02               ` Thomas Gleixner
2006-10-03 21:05                 ` Thomas Gleixner
2006-10-04  2:33                 ` Valdis.Kletnieks
2006-10-04  7:56                   ` Ingo Molnar
2006-10-04  9:58                     ` Valdis.Kletnieks
2006-10-03  3:23 ` [patch 00/21] high resolution timers / dynamic ticks - V2 Andrew Morton
2006-10-03  4:00 ` Andrew Morton
2006-10-03  8:38   ` Thomas Gleixner
2006-10-03  8:47   ` Ingo Molnar
2006-10-03 10:35     ` [patch] clockevents: drivers for i386, fix #2 Ingo Molnar
2006-10-04  3:36       ` Andrew Morton
2006-10-04  6:46         ` Ingo Molnar
2006-10-04  7:32           ` Andrew Morton
2006-10-04  7:41             ` Ingo Molnar
2006-10-04  8:01               ` Andrew Morton
2006-10-04  7:55             ` Ingo Molnar
2006-10-04  8:15               ` Andrew Morton
2006-10-04 10:53                 ` Ingo Molnar
2006-10-04 11:19                   ` Thomas Gleixner
2006-10-04 16:02                     ` Andrew Morton
2006-10-04 16:20                       ` Thomas Gleixner
2006-10-04 16:35                       ` Ingo Molnar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).