linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] scheduler: fix x86 regression in native_sched_clock
@ 2007-12-07  1:19 Stefano Brivio
  2007-12-07  5:29 ` Nick Piggin
                   ` (2 more replies)
  0 siblings, 3 replies; 47+ messages in thread
From: Stefano Brivio @ 2007-12-07  1:19 UTC (permalink / raw)
  To: Ingo Molnar, Robert Love
  Cc: linux-kernel, Dave Jones, Rafael J. Wysocki, Michael Buesch,
	Thomas Gleixner, Andrew Morton, Len Brown

This patch fixes a regression introduced by:

commit bb29ab26863c022743143f27956cc0ca362f258c
Author: Ingo Molnar <mingo@elte.hu>
Date:   Mon Jul 9 18:51:59 2007 +0200

This caused the jiffies counter to leap back and forth on cpufreq changes
on my x86 box. I'd say that we can't always assume that TSC does "small
errors" only, when marked unstable. On cpufreq changes these errors can be
huge.

The original bug report can be found here:
http://bugzilla.kernel.org/show_bug.cgi?id=9475


Signed-off-by: Stefano Brivio <stefano.brivio@polimi.it>

---

diff --git a/arch/x86/kernel/tsc_32.c b/arch/x86/kernel/tsc_32.c
index 9ebc0da..d29cd9c 100644
--- a/arch/x86/kernel/tsc_32.c
+++ b/arch/x86/kernel/tsc_32.c
@@ -98,13 +98,8 @@ unsigned long long native_sched_clock(void)
 
 	/*
 	 * Fall back to jiffies if there's no TSC available:
-	 * ( But note that we still use it if the TSC is marked
-	 *   unstable. We do this because unlike Time Of Day,
-	 *   the scheduler clock tolerates small errors and it's
-	 *   very important for it to be as fast as the platform
-	 *   can achive it. )
 	 */
-	if (unlikely(!tsc_enabled && !tsc_unstable))
+	if (unlikely(!tsc_enabled))
 		/* No locking but a rare wrong value is not a big deal: */
 		return (jiffies_64 - INITIAL_JIFFIES) * (1000000000 / HZ);
 

-- 
Ciao
Stefano

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH] scheduler: fix x86 regression in native_sched_clock
  2007-12-07  1:19 [PATCH] scheduler: fix x86 regression in native_sched_clock Stefano Brivio
@ 2007-12-07  5:29 ` Nick Piggin
  2007-12-07  5:51 ` Thomas Gleixner
  2007-12-07  8:45 ` Ingo Molnar
  2 siblings, 0 replies; 47+ messages in thread
From: Nick Piggin @ 2007-12-07  5:29 UTC (permalink / raw)
  To: Stefano Brivio
  Cc: Ingo Molnar, Robert Love, linux-kernel, Dave Jones,
	Rafael J. Wysocki, Michael Buesch, Thomas Gleixner,
	Andrew Morton, Len Brown

[-- Attachment #1: Type: text/plain, Size: 869 bytes --]

On Friday 07 December 2007 12:19, Stefano Brivio wrote:
> This patch fixes a regression introduced by:
>
> commit bb29ab26863c022743143f27956cc0ca362f258c
> Author: Ingo Molnar <mingo@elte.hu>
> Date:   Mon Jul 9 18:51:59 2007 +0200
>
> This caused the jiffies counter to leap back and forth on cpufreq changes
> on my x86 box. I'd say that we can't always assume that TSC does "small
> errors" only, when marked unstable. On cpufreq changes these errors can be
> huge.
>
> The original bug report can be found here:
> http://bugzilla.kernel.org/show_bug.cgi?id=9475
>
>
> Signed-off-by: Stefano Brivio <stefano.brivio@polimi.it>

While your fix should probably go into 2.6.24...

This particular issue has aggravated me enough times. Let's
fix the damn thing properly already... I think what would work best
is a relatively simple change to the API along these lines:

[-- Attachment #2: sched-clock.patch --]
[-- Type: text/x-diff, Size: 7824 bytes --]

Index: linux-2.6/arch/x86/kernel/tsc_32.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/tsc_32.c
+++ linux-2.6/arch/x86/kernel/tsc_32.c
@@ -92,27 +92,35 @@ static inline void set_cyc2ns_scale(unsi
 /*
  * Scheduler clock - returns current time in nanosec units.
  */
-unsigned long long native_sched_clock(void)
+u64 native_sched_clock(u64 *sample)
 {
-	unsigned long long this_offset;
+	u64 now, delta;
 
 	/*
-	 * Fall back to jiffies if there's no TSC available:
+	 * Fall back to the default sched_clock() implementation (keep in
+	 * synch with kernel/sched.c) if there's no TSC available:
 	 * ( But note that we still use it if the TSC is marked
 	 *   unstable. We do this because unlike Time Of Day,
 	 *   the scheduler clock tolerates small errors and it's
 	 *   very important for it to be as fast as the platform
 	 *   can achive it. )
 	 */
-	if (unlikely(!tsc_enabled && !tsc_unstable))
-		/* No locking but a rare wrong value is not a big deal: */
-		return (jiffies_64 - INITIAL_JIFFIES) * (1000000000 / HZ);
+	if (unlikely(!tsc_enabled && !tsc_unstable)) {
+		now = (u64)jiffies;
+		delta = now - *sample;
+		*sample = now;
+
+		return delta * (NSEC_PER_SEC / HZ);
+
+	} else {
+		/* read the Time Stamp Counter: */
+		rdtscll(now);
+		delta = now - *sample;
+		*sample = now;
 
-	/* read the Time Stamp Counter: */
-	rdtscll(this_offset);
-
-	/* return the value in ns */
-	return cycles_2_ns(this_offset);
+		/* return the delta value in ns */
+		return cycles_2_ns(delta);
+	}
 }
 
 /* We need to define a real function for sched_clock, to override the
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -72,9 +72,13 @@
  * This is default implementation.
  * Architectures and sub-architectures can override this.
  */
-unsigned long long __attribute__((weak)) sched_clock(void)
+u64 __attribute__((weak)) sched_clock(u64 *sample)
 {
-	return (unsigned long long)jiffies * (NSEC_PER_SEC / HZ);
+	u64 now = (u64)jiffies;
+	u64 delta = now - *sample;
+	*sample = now;
+
+	return delta * (NSEC_PER_SEC / HZ);
 }
 
 /*
@@ -314,7 +318,7 @@ struct rq {
 	unsigned long next_balance;
 	struct mm_struct *prev_mm;
 
-	u64 clock, prev_clock_raw;
+	u64 clock, prev_clock_sample;
 	s64 clock_max_delta;
 
 	unsigned int clock_warps, clock_overflows;
@@ -385,9 +389,7 @@ static inline int cpu_of(struct rq *rq)
  */
 static void __update_rq_clock(struct rq *rq)
 {
-	u64 prev_raw = rq->prev_clock_raw;
-	u64 now = sched_clock();
-	s64 delta = now - prev_raw;
+	u64 delta = sched_clock(&rq->prev_clock_sample);
 	u64 clock = rq->clock;
 
 #ifdef CONFIG_SCHED_DEBUG
@@ -416,7 +418,6 @@ static void __update_rq_clock(struct rq 
 		}
 	}
 
-	rq->prev_clock_raw = now;
 	rq->clock = clock;
 }
 
@@ -656,7 +657,6 @@ EXPORT_SYMBOL_GPL(sched_clock_idle_sleep
 void sched_clock_idle_wakeup_event(u64 delta_ns)
 {
 	struct rq *rq = cpu_rq(smp_processor_id());
-	u64 now = sched_clock();
 
 	rq->idle_clock += delta_ns;
 	/*
@@ -666,7 +666,7 @@ void sched_clock_idle_wakeup_event(u64 d
 	 * rq clock:
 	 */
 	spin_lock(&rq->lock);
-	rq->prev_clock_raw = now;
+	(void)sched_clock(&rq->prev_clock_sample);
 	rq->clock += delta_ns;
 	spin_unlock(&rq->lock);
 }
@@ -4967,7 +4967,7 @@ void __cpuinit init_idle(struct task_str
 	unsigned long flags;
 
 	__sched_fork(idle);
-	idle->se.exec_start = sched_clock();
+	idle->se.exec_start = 0;
 
 	idle->prio = idle->normal_prio = MAX_PRIO;
 	idle->cpus_allowed = cpumask_of_cpu(cpu);
Index: linux-2.6/arch/x86/kernel/tsc_64.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/tsc_64.c
+++ linux-2.6/arch/x86/kernel/tsc_64.c
@@ -30,18 +30,21 @@ static unsigned long long cycles_2_ns(un
 	return (cyc * cyc2ns_scale) >> NS_SCALE;
 }
 
-unsigned long long sched_clock(void)
+u64 sched_clock(u64 *sample)
 {
-	unsigned long a = 0;
+	u64 now, delta;
 
 	/* Could do CPU core sync here. Opteron can execute rdtsc speculatively,
 	 * which means it is not completely exact and may not be monotonous
 	 * between CPUs. But the errors should be too small to matter for
 	 * scheduling purposes.
 	 */
+	rdtscll(now);
+	delta = now - *sample;
+	*sample = now;
 
-	rdtscll(a);
-	return cycles_2_ns(a);
+	/* return the delta value in ns */
+	return cycles_2_ns(delta);
 }
 
 static int tsc_unstable;
Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -1433,7 +1433,7 @@ static inline int set_cpus_allowed(struc
 }
 #endif
 
-extern unsigned long long sched_clock(void);
+extern u64 sched_clock(u64 *sample);
 
 /*
  * For kernel-internal use: high-speed (but slightly incorrect) per-cpu
Index: linux-2.6/include/asm-x86/timer.h
===================================================================
--- linux-2.6.orig/include/asm-x86/timer.h
+++ linux-2.6/include/asm-x86/timer.h
@@ -5,7 +5,7 @@
 
 #define TICK_SIZE (tick_nsec / 1000)
 
-unsigned long long native_sched_clock(void);
+unsigned u64 native_sched_clock(u64 *sample);
 unsigned long native_calculate_cpu_khz(void);
 
 extern int timer_ack;
Index: linux-2.6/kernel/lockdep.c
===================================================================
--- linux-2.6.orig/kernel/lockdep.c
+++ linux-2.6/kernel/lockdep.c
@@ -216,7 +216,7 @@ static void lock_release_holdtime(struct
 	if (!lock_stat)
 		return;
 
-	holdtime = sched_clock() - hlock->holdtime_stamp;
+	holdtime = sched_clock(&hlock->holdtime_stamp);
 
 	stats = get_lock_stats(hlock->class);
 	if (hlock->read)
@@ -2413,7 +2413,7 @@ static int __lock_acquire(struct lockdep
 	hlock->hardirqs_off = hardirqs_off;
 #ifdef CONFIG_LOCK_STAT
 	hlock->waittime_stamp = 0;
-	hlock->holdtime_stamp = sched_clock();
+	(void)sched_clock(&hlock->holdtime_stamp);
 #endif
 
 	if (check == 2 && !mark_irqflags(curr, hlock))
@@ -2780,7 +2780,7 @@ __lock_contended(struct lockdep_map *loc
 	return;
 
 found_it:
-	hlock->waittime_stamp = sched_clock();
+	(void)sched_clock(&hlock->waittime_stamp);
 
 	point = lock_contention_point(hlock->class, ip);
 
@@ -2825,9 +2825,8 @@ __lock_acquired(struct lockdep_map *lock
 found_it:
 	cpu = smp_processor_id();
 	if (hlock->waittime_stamp) {
-		now = sched_clock();
-		waittime = now - hlock->waittime_stamp;
-		hlock->holdtime_stamp = now;
+		waittime = sched_clock(&hlock->waittime_stamp);
+		hlock->holdtime_stamp = hlock->waittime_stamp;
 	}
 
 	stats = get_lock_stats(hlock->class);
Index: linux-2.6/kernel/printk.c
===================================================================
--- linux-2.6.orig/kernel/printk.c
+++ linux-2.6/kernel/printk.c
@@ -575,7 +575,7 @@ __setup("time", printk_time_setup);
 
 __attribute__((weak)) unsigned long long printk_clock(void)
 {
-	return sched_clock();
+	return (unsigned long long)get_cycles();
 }
 
 /* Check if we have any console registered that can be called early in boot. */
Index: linux-2.6/kernel/sched_debug.c
===================================================================
--- linux-2.6.orig/kernel/sched_debug.c
+++ linux-2.6/kernel/sched_debug.c
@@ -176,7 +176,7 @@ static void print_cpu(struct seq_file *m
 	P(curr->pid);
 	PN(clock);
 	PN(idle_clock);
-	PN(prev_clock_raw);
+	PN(prev_clock_sample);
 	P(clock_warps);
 	P(clock_overflows);
 	P(clock_deep_idle_events);
@@ -353,12 +353,12 @@ void proc_sched_show_task(struct task_st
 #undef __P
 
 	{
-		u64 t0, t1;
+		u64 sample, delta;
 
-		t0 = sched_clock();
-		t1 = sched_clock();
+		(void)sched_clock(&sample);
+		delta = sched_clock(&sample);
 		SEQ_printf(m, "%-35s:%21Ld\n",
-			   "clock-delta", (long long)(t1-t0));
+			   "clock-delta", (long long)delta);
 	}
 }
 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] scheduler: fix x86 regression in native_sched_clock
  2007-12-07  1:19 [PATCH] scheduler: fix x86 regression in native_sched_clock Stefano Brivio
  2007-12-07  5:29 ` Nick Piggin
@ 2007-12-07  5:51 ` Thomas Gleixner
  2007-12-07  7:18   ` Guillaume Chazarain
  2007-12-07 10:37   ` [PATCH] scheduler: fix x86 regression in native_sched_clock Andi Kleen
  2007-12-07  8:45 ` Ingo Molnar
  2 siblings, 2 replies; 47+ messages in thread
From: Thomas Gleixner @ 2007-12-07  5:51 UTC (permalink / raw)
  To: Stefano Brivio
  Cc: Ingo Molnar, Robert Love, linux-kernel, Dave Jones,
	Rafael J. Wysocki, Michael Buesch, Andrew Morton, Len Brown

On Fri, 7 Dec 2007, Stefano Brivio wrote:

> This patch fixes a regression introduced by:
> 
> commit bb29ab26863c022743143f27956cc0ca362f258c
> Author: Ingo Molnar <mingo@elte.hu>
> Date:   Mon Jul 9 18:51:59 2007 +0200
> 
> This caused the jiffies counter to leap back and forth on cpufreq changes
> on my x86 box. I'd say that we can't always assume that TSC does "small
> errors" only, when marked unstable. On cpufreq changes these errors can be
> huge.

Hmrpf. sched_clock() is used for the time stamp of the printks. We
need to find some better solution other than killing off the tsc
access completely.

Ingo ???

Thanks,

     tglx
 
> The original bug report can be found here:
> http://bugzilla.kernel.org/show_bug.cgi?id=9475
> 
> 
> Signed-off-by: Stefano Brivio <stefano.brivio@polimi.it>
> 
> ---
> 
> diff --git a/arch/x86/kernel/tsc_32.c b/arch/x86/kernel/tsc_32.c
> index 9ebc0da..d29cd9c 100644
> --- a/arch/x86/kernel/tsc_32.c
> +++ b/arch/x86/kernel/tsc_32.c
> @@ -98,13 +98,8 @@ unsigned long long native_sched_clock(void)
>  
>  	/*
>  	 * Fall back to jiffies if there's no TSC available:
> -	 * ( But note that we still use it if the TSC is marked
> -	 *   unstable. We do this because unlike Time Of Day,
> -	 *   the scheduler clock tolerates small errors and it's
> -	 *   very important for it to be as fast as the platform
> -	 *   can achive it. )
>  	 */
> -	if (unlikely(!tsc_enabled && !tsc_unstable))
> +	if (unlikely(!tsc_enabled))
>  		/* No locking but a rare wrong value is not a big deal: */
>  		return (jiffies_64 - INITIAL_JIFFIES) * (1000000000 / HZ);
>  
> 
> -- 
> Ciao
> Stefano
> 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] scheduler: fix x86 regression in native_sched_clock
  2007-12-07  5:51 ` Thomas Gleixner
@ 2007-12-07  7:18   ` Guillaume Chazarain
  2007-12-07  8:02     ` Guillaume Chazarain
  2007-12-07 10:37   ` [PATCH] scheduler: fix x86 regression in native_sched_clock Andi Kleen
  1 sibling, 1 reply; 47+ messages in thread
From: Guillaume Chazarain @ 2007-12-07  7:18 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Stefano Brivio, Ingo Molnar, Robert Love, linux-kernel,
	Dave Jones, Rafael J. Wysocki, Michael Buesch, Andrew Morton,
	Len Brown

On Dec 7, 2007 6:51 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> Hmrpf. sched_clock() is used for the time stamp of the printks. We
> need to find some better solution other than killing off the tsc
> access completely.

Something like http://lkml.org/lkml/2007/3/16/291 that would need some refresh?

-- 
Guillaume

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] scheduler: fix x86 regression in native_sched_clock
  2007-12-07  7:18   ` Guillaume Chazarain
@ 2007-12-07  8:02     ` Guillaume Chazarain
  2007-12-07  8:51       ` Ingo Molnar
  2007-12-07 13:55       ` [patch] x86: scale cyc_2_nsec according to CPU frequency Ingo Molnar
  0 siblings, 2 replies; 47+ messages in thread
From: Guillaume Chazarain @ 2007-12-07  8:02 UTC (permalink / raw)
  To: Guillaume Chazarain
  Cc: Thomas Gleixner, Stefano Brivio, Ingo Molnar, Robert Love,
	linux-kernel, Dave Jones, Rafael J. Wysocki, Michael Buesch,
	Andrew Morton

"Guillaume Chazarain" <guichaz@yahoo.fr> wrote:

> On Dec 7, 2007 6:51 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> > Hmrpf. sched_clock() is used for the time stamp of the printks. We
> > need to find some better solution other than killing off the tsc
> > access completely.
> 
> Something like http://lkml.org/lkml/2007/3/16/291 that would need some refresh?

And here is a refreshed one just for testing with 2.6-git. The 64 bit
part is a shamelessly untested copy/paste as I cannot test it.

diff --git a/arch/x86/kernel/tsc_32.c b/arch/x86/kernel/tsc_32.c
index 9ebc0da..d561b2f 100644
--- a/arch/x86/kernel/tsc_32.c
+++ b/arch/x86/kernel/tsc_32.c
@@ -5,6 +5,7 @@
 #include <linux/jiffies.h>
 #include <linux/init.h>
 #include <linux/dmi.h>
+#include <linux/percpu.h>
 
 #include <asm/delay.h>
 #include <asm/tsc.h>
@@ -78,15 +79,32 @@ EXPORT_SYMBOL_GPL(check_tsc_unstable);
  *  cyc2ns_scale is limited to 10^6 * 2^10, which fits in 32 bits.
  *  (mathieu.desnoyers@polymtl.ca)
  *
+ *  ns += offset to avoid sched_clock jumps with cpufreq
+ *
  *			-johnstul@us.ibm.com "math is hard, lets go shopping!"
  */
-unsigned long cyc2ns_scale __read_mostly;
 
 #define CYC2NS_SCALE_FACTOR 10 /* 2^10, carefully chosen */
 
-static inline void set_cyc2ns_scale(unsigned long cpu_khz)
+DEFINE_PER_CPU(struct cyc2ns_params, cyc2ns) __read_mostly;
+
+static void set_cyc2ns_scale(unsigned long cpu_khz)
 {
-	cyc2ns_scale = (1000000 << CYC2NS_SCALE_FACTOR)/cpu_khz;
+	struct cyc2ns_params *params;
+	unsigned long flags;
+	unsigned long long tsc_now, ns_now;
+
+	rdtscll(tsc_now);
+	params = &get_cpu_var(cyc2ns);
+
+	local_irq_save(flags);
+	ns_now = __cycles_2_ns(params, tsc_now);
+
+	params->scale = (NSEC_PER_MSEC << CYC2NS_SCALE_FACTOR)/cpu_khz;
+	params->offset += ns_now - __cycles_2_ns(params, tsc_now);
+	local_irq_restore(flags);
+
+	put_cpu_var(cyc2ns);
 }
 
 /*
diff --git a/arch/x86/kernel/tsc_64.c b/arch/x86/kernel/tsc_64.c
index 9c70af4..93e7a06 100644
--- a/arch/x86/kernel/tsc_64.c
+++ b/arch/x86/kernel/tsc_64.c
@@ -10,6 +10,7 @@
 
 #include <asm/hpet.h>
 #include <asm/timex.h>
+#include <asm/timer.h>
 
 static int notsc __initdata = 0;
 
@@ -18,16 +19,25 @@ EXPORT_SYMBOL(cpu_khz);
 unsigned int tsc_khz;
 EXPORT_SYMBOL(tsc_khz);
 
-static unsigned int cyc2ns_scale __read_mostly;
+DEFINE_PER_CPU(struct cyc2ns_params, cyc2ns) __read_mostly;
 
-static inline void set_cyc2ns_scale(unsigned long khz)
+static void set_cyc2ns_scale(unsigned long cpu_khz)
 {
-	cyc2ns_scale = (NSEC_PER_MSEC << NS_SCALE) / khz;
-}
+	struct cyc2ns_params *params;
+	unsigned long flags;
+	unsigned long long tsc_now, ns_now;
 
-static unsigned long long cycles_2_ns(unsigned long long cyc)
-{
-	return (cyc * cyc2ns_scale) >> NS_SCALE;
+	rdtscll(tsc_now);
+	params = &get_cpu_var(cyc2ns);
+
+	local_irq_save(flags);
+	ns_now = __cycles_2_ns(params, tsc_now);
+
+	params->scale = (NSEC_PER_MSEC << CYC2NS_SCALE_FACTOR)/cpu_khz;
+	params->offset += ns_now - __cycles_2_ns(params, tsc_now);
+	local_irq_restore(flags);
+
+	put_cpu_var(cyc2ns);
 }
 
 unsigned long long sched_clock(void)
diff --git a/include/asm-x86/timer.h b/include/asm-x86/timer.h
index 0db7e99..ff4f2a3 100644
--- a/include/asm-x86/timer.h
+++ b/include/asm-x86/timer.h
@@ -2,6 +2,7 @@
 #define _ASMi386_TIMER_H
 #include <linux/init.h>
 #include <linux/pm.h>
+#include <linux/percpu.h>
 
 #define TICK_SIZE (tick_nsec / 1000)
 
@@ -16,7 +17,7 @@ extern int recalibrate_cpu_khz(void);
 #define calculate_cpu_khz() native_calculate_cpu_khz()
 #endif
 
-/* Accellerators for sched_clock()
+/* Accelerators for sched_clock()
  * convert from cycles(64bits) => nanoseconds (64bits)
  *  basic equation:
  *		ns = cycles / (freq / ns_per_sec)
@@ -31,20 +32,44 @@ extern int recalibrate_cpu_khz(void);
  *	And since SC is a constant power of two, we can convert the div
  *  into a shift.
  *
- *  We can use khz divisor instead of mhz to keep a better percision, since
+ *  We can use khz divisor instead of mhz to keep a better precision, since
  *  cyc2ns_scale is limited to 10^6 * 2^10, which fits in 32 bits.
  *  (mathieu.desnoyers@polymtl.ca)
  *
+ *  ns += offset to avoid sched_clock jumps with cpufreq
+ *
  *			-johnstul@us.ibm.com "math is hard, lets go shopping!"
  */
-extern unsigned long cyc2ns_scale __read_mostly;
+
+struct cyc2ns_params {
+	unsigned long scale;
+	unsigned long long offset;
+};
+
+DECLARE_PER_CPU(struct cyc2ns_params, cyc2ns) __read_mostly;
 
 #define CYC2NS_SCALE_FACTOR 10 /* 2^10, carefully chosen */
 
-static inline unsigned long long cycles_2_ns(unsigned long long cyc)
+static inline unsigned long long __cycles_2_ns(struct cyc2ns_params *params,
+					       unsigned long long cyc)
 {
-	return (cyc * cyc2ns_scale) >> CYC2NS_SCALE_FACTOR;
+	return ((cyc * params->scale) >> CYC2NS_SCALE_FACTOR) + params->offset;
 }
 
+static inline unsigned long long cycles_2_ns(unsigned long long cyc)
+{
+	struct cyc2ns_params *params;
+	unsigned long flags;
+	unsigned long long ns;
+
+	params = &get_cpu_var(cyc2ns);
+
+	local_irq_save(flags);
+	ns = __cycles_2_ns(params, cyc);
+	local_irq_restore(flags);
+
+	put_cpu_var(cyc2ns);
+	return ns;
+}
 
 #endif


-- 
Guillaume

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH] scheduler: fix x86 regression in native_sched_clock
  2007-12-07  1:19 [PATCH] scheduler: fix x86 regression in native_sched_clock Stefano Brivio
  2007-12-07  5:29 ` Nick Piggin
  2007-12-07  5:51 ` Thomas Gleixner
@ 2007-12-07  8:45 ` Ingo Molnar
  2007-12-07 10:32   ` Andrew Morton
  2007-12-07 11:13   ` Nick Piggin
  2 siblings, 2 replies; 47+ messages in thread
From: Ingo Molnar @ 2007-12-07  8:45 UTC (permalink / raw)
  To: Stefano Brivio
  Cc: Robert Love, linux-kernel, Dave Jones, Rafael J. Wysocki,
	Michael Buesch, Thomas Gleixner, Andrew Morton, Len Brown


* Stefano Brivio <stefano.brivio@polimi.it> wrote:

> This patch fixes a regression introduced by:
> 
> commit bb29ab26863c022743143f27956cc0ca362f258c
> Author: Ingo Molnar <mingo@elte.hu>
> Date:   Mon Jul 9 18:51:59 2007 +0200
> 
> This caused the jiffies counter to leap back and forth on cpufreq 
> changes on my x86 box. I'd say that we can't always assume that TSC 
> does "small errors" only, when marked unstable. On cpufreq changes 
> these errors can be huge.

ah, printk_clock() still uses sched_clock(), not jiffies. So it's not 
the jiffies counter that goes back and forth, it's sched_clock() - so 
this is a printk timestamps anomaly, not related to jiffies. I thought 
we have fixed this bug in the printk code already: sched_clock() is a 
'raw' interface that should not be used directly - the proper interface 
is cpu_clock(cpu). Does the patch below help?

	Ingo

----------------------->
Subject: sched: fix CONFIG_PRINT_TIME's reliance on sched_clock()
From: Ingo Molnar <mingo@elte.hu>

Stefano Brivio reported weird printk timestamp behavior during
CPU frequency changes:

  http://bugzilla.kernel.org/show_bug.cgi?id=9475

fix CONFIG_PRINT_TIME's reliance on sched_clock() and use cpu_clock()
instead.

Reported-and-bisected-by: Stefano Brivio <stefano.brivio@polimi.it>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 kernel/printk.c |    2 +-
 kernel/sched.c  |    7 ++++++-
 2 files changed, 7 insertions(+), 2 deletions(-)

Index: linux/kernel/printk.c
===================================================================
--- linux.orig/kernel/printk.c
+++ linux/kernel/printk.c
@@ -680,7 +680,7 @@ asmlinkage int vprintk(const char *fmt, 
 					loglev_char = default_message_loglevel
 						+ '0';
 				}
-				t = printk_clock();
+				t = cpu_clock(printk_cpu);
 				nanosec_rem = do_div(t, 1000000000);
 				tlen = sprintf(tbuf,
 						"<%c>[%5lu.%06lu] ",
Index: linux/kernel/sched.c
===================================================================
--- linux.orig/kernel/sched.c
+++ linux/kernel/sched.c
@@ -599,7 +599,12 @@ unsigned long long cpu_clock(int cpu)
 
 	local_irq_save(flags);
 	rq = cpu_rq(cpu);
-	update_rq_clock(rq);
+	/*
+	 * Only call sched_clock() if the scheduler has already been
+	 * initialized (some code might call cpu_clock() very early):
+	 */
+	if (rq->idle)
+		update_rq_clock(rq);
 	now = rq->clock;
 	local_irq_restore(flags);
 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] scheduler: fix x86 regression in native_sched_clock
  2007-12-07  8:02     ` Guillaume Chazarain
@ 2007-12-07  8:51       ` Ingo Molnar
  2007-12-07  9:29         ` Guillaume Chazarain
  2007-12-07 13:55       ` [patch] x86: scale cyc_2_nsec according to CPU frequency Ingo Molnar
  1 sibling, 1 reply; 47+ messages in thread
From: Ingo Molnar @ 2007-12-07  8:51 UTC (permalink / raw)
  To: Guillaume Chazarain
  Cc: Thomas Gleixner, Stefano Brivio, Robert Love, linux-kernel,
	Dave Jones, Rafael J. Wysocki, Michael Buesch, Andrew Morton


* Guillaume Chazarain <guichaz@yahoo.fr> wrote:

> > Something like http://lkml.org/lkml/2007/3/16/291 that would need 
> > some refresh?
> 
> And here is a refreshed one just for testing with 2.6-git. The 64 bit 
> part is a shamelessly untested copy/paste as I cannot test it.

yeah, we can do something like this in 2.6.25 - this will improve the 
quality of sched_clock(). The other patch i sent should solve the 
problem for 2.6.24 - printk should not be using raw sched_clock() calls. 
(as the name says it's for the scheduler's internal use.) I've also 
queued up the patch below - it removes the now unnecessary printk clock 
code.

	Ingo

--------------------->
Subject: sched: remove printk_clock()
From: Ingo Molnar <mingo@elte.hu>

printk_clock() is obsolete - it has been replaced with cpu_clock().

Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 arch/arm/kernel/time.c  |   11 -----------
 arch/ia64/kernel/time.c |   27 ---------------------------
 kernel/printk.c         |    5 -----
 3 files changed, 43 deletions(-)

Index: linux/arch/arm/kernel/time.c
===================================================================
--- linux.orig/arch/arm/kernel/time.c
+++ linux/arch/arm/kernel/time.c
@@ -79,17 +79,6 @@ static unsigned long dummy_gettimeoffset
 }
 #endif
 
-/*
- * An implementation of printk_clock() independent from
- * sched_clock().  This avoids non-bootable kernels when
- * printk_clock is enabled.
- */
-unsigned long long printk_clock(void)
-{
-	return (unsigned long long)(jiffies - INITIAL_JIFFIES) *
-			(1000000000 / HZ);
-}
-
 static unsigned long next_rtc_update;
 
 /*
Index: linux/arch/ia64/kernel/time.c
===================================================================
--- linux.orig/arch/ia64/kernel/time.c
+++ linux/arch/ia64/kernel/time.c
@@ -344,33 +344,6 @@ udelay (unsigned long usecs)
 }
 EXPORT_SYMBOL(udelay);
 
-static unsigned long long ia64_itc_printk_clock(void)
-{
-	if (ia64_get_kr(IA64_KR_PER_CPU_DATA))
-		return sched_clock();
-	return 0;
-}
-
-static unsigned long long ia64_default_printk_clock(void)
-{
-	return (unsigned long long)(jiffies_64 - INITIAL_JIFFIES) *
-		(1000000000/HZ);
-}
-
-unsigned long long (*ia64_printk_clock)(void) = &ia64_default_printk_clock;
-
-unsigned long long printk_clock(void)
-{
-	return ia64_printk_clock();
-}
-
-void __init
-ia64_setup_printk_clock(void)
-{
-	if (!(sal_platform_features & IA64_SAL_PLATFORM_FEATURE_ITC_DRIFT))
-		ia64_printk_clock = ia64_itc_printk_clock;
-}
-
 /* IA64 doesn't cache the timezone */
 void update_vsyscall_tz(void)
 {
Index: linux/kernel/printk.c
===================================================================
--- linux.orig/kernel/printk.c
+++ linux/kernel/printk.c
@@ -573,11 +573,6 @@ static int __init printk_time_setup(char
 
 __setup("time", printk_time_setup);
 
-__attribute__((weak)) unsigned long long printk_clock(void)
-{
-	return sched_clock();
-}
-
 /* Check if we have any console registered that can be called early in boot. */
 static int have_callable_console(void)
 {

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] scheduler: fix x86 regression in native_sched_clock
  2007-12-07  8:51       ` Ingo Molnar
@ 2007-12-07  9:29         ` Guillaume Chazarain
  2007-12-07  9:59           ` Ingo Molnar
  0 siblings, 1 reply; 47+ messages in thread
From: Guillaume Chazarain @ 2007-12-07  9:29 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Guillaume Chazarain, Thomas Gleixner, Stefano Brivio,
	Robert Love, linux-kernel, Dave Jones, Rafael J. Wysocki,
	Michael Buesch, Andrew Morton

Le Fri, 7 Dec 2007 09:51:21 +0100,
Ingo Molnar <mingo@elte.hu> a écrit :

> yeah, we can do something like this in 2.6.25 - this will improve the 
> quality of sched_clock().

Thanks a lot for your interest!

I'll clean it up and resend it later. As I don't have the necessary
knowledge to do the tsc_{32,64}.c unification, should I copy paste
common functions into tsc_32.c and tsc_64.c to ease later unification
or should I start a common .c file?

Thanks again for showing interest.

-- 
Guillaume

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] scheduler: fix x86 regression in native_sched_clock
  2007-12-07  9:29         ` Guillaume Chazarain
@ 2007-12-07  9:59           ` Ingo Molnar
  0 siblings, 0 replies; 47+ messages in thread
From: Ingo Molnar @ 2007-12-07  9:59 UTC (permalink / raw)
  To: Guillaume Chazarain
  Cc: Thomas Gleixner, Stefano Brivio, Robert Love, linux-kernel,
	Dave Jones, Rafael J. Wysocki, Michael Buesch, Andrew Morton


* Guillaume Chazarain <guichaz@yahoo.fr> wrote:

> I'll clean it up and resend it later. As I don't have the necessary 
> knowledge to do the tsc_{32,64}.c unification, should I copy paste 
> common functions into tsc_32.c and tsc_64.c to ease later unification 
> or should I start a common .c file?

note that there are a couple of existing patches in this area. One is 
the fix below. There's also older frequency-scaling TSC patches - i'll 
try to dig them out.

	Ingo

---------------->
Subject: x86: idle wakeup event in the HLT loop
From: Ingo Molnar <mingo@elte.hu>

do a proper idle-wakeup event on HLT as well - some CPUs stop the TSC
in HLT too, not just when going through the ACPI methods.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 arch/x86/kernel/process_32.c |   15 ++++++++++++---
 1 file changed, 12 insertions(+), 3 deletions(-)

Index: linux/arch/x86/kernel/process_32.c
===================================================================
--- linux.orig/arch/x86/kernel/process_32.c
+++ linux/arch/x86/kernel/process_32.c
@@ -113,10 +113,19 @@ void default_idle(void)
 		smp_mb();
 
 		local_irq_disable();
-		if (!need_resched())
+		if (!need_resched()) {
+			ktime_t t0, t1;
+			u64 t0n, t1n;
+
+			t0 = ktime_get();
+			t0n = ktime_to_ns(t0);
 			safe_halt();	/* enables interrupts racelessly */
-		else
-			local_irq_enable();
+			local_irq_disable();
+			t1 = ktime_get();
+			t1n = ktime_to_ns(t1);
+			sched_clock_idle_wakeup_event(t1n - t0n);
+		}
+		local_irq_enable();
 		current_thread_info()->status |= TS_POLLING;
 	} else {
 		/* loop is done by the caller */

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] scheduler: fix x86 regression in native_sched_clock
  2007-12-07  8:45 ` Ingo Molnar
@ 2007-12-07 10:32   ` Andrew Morton
  2007-12-07 10:40     ` Ingo Molnar
  2007-12-07 11:13   ` Nick Piggin
  1 sibling, 1 reply; 47+ messages in thread
From: Andrew Morton @ 2007-12-07 10:32 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Stefano Brivio, Robert Love, linux-kernel, Dave Jones,
	Rafael J. Wysocki, Michael Buesch, Thomas Gleixner, Len Brown

On Fri, 7 Dec 2007 09:45:59 +0100 Ingo Molnar <mingo@elte.hu> wrote:

> 
> * Stefano Brivio <stefano.brivio@polimi.it> wrote:
> 
> > This patch fixes a regression introduced by:
> > 
> > commit bb29ab26863c022743143f27956cc0ca362f258c
> > Author: Ingo Molnar <mingo@elte.hu>
> > Date:   Mon Jul 9 18:51:59 2007 +0200
> > 
> > This caused the jiffies counter to leap back and forth on cpufreq 
> > changes on my x86 box. I'd say that we can't always assume that TSC 
> > does "small errors" only, when marked unstable. On cpufreq changes 
> > these errors can be huge.
> 
> ah, printk_clock() still uses sched_clock(), not jiffies. So it's not 
> the jiffies counter that goes back and forth, it's sched_clock() - so 
> this is a printk timestamps anomaly, not related to jiffies. I thought 
> we have fixed this bug in the printk code already: sched_clock() is a 
> 'raw' interface that should not be used directly - the proper interface 
> is cpu_clock(cpu). Does the patch below help?
> 
> 	Ingo
> 
> ----------------------->
> Subject: sched: fix CONFIG_PRINT_TIME's reliance on sched_clock()
> From: Ingo Molnar <mingo@elte.hu>
> 
> Stefano Brivio reported weird printk timestamp behavior during
> CPU frequency changes:
> 
>   http://bugzilla.kernel.org/show_bug.cgi?id=9475
> 
> fix CONFIG_PRINT_TIME's reliance on sched_clock() and use cpu_clock()
> instead.
> 
> Reported-and-bisected-by: Stefano Brivio <stefano.brivio@polimi.it>
> Signed-off-by: Ingo Molnar <mingo@elte.hu>
> ---
>  kernel/printk.c |    2 +-
>  kernel/sched.c  |    7 ++++++-
>  2 files changed, 7 insertions(+), 2 deletions(-)
> 
> Index: linux/kernel/printk.c
> ===================================================================
> --- linux.orig/kernel/printk.c
> +++ linux/kernel/printk.c
> @@ -680,7 +680,7 @@ asmlinkage int vprintk(const char *fmt, 
>  					loglev_char = default_message_loglevel
>  						+ '0';
>  				}
> -				t = printk_clock();
> +				t = cpu_clock(printk_cpu);
>  				nanosec_rem = do_div(t, 1000000000);
>  				tlen = sprintf(tbuf,
>  						"<%c>[%5lu.%06lu] ",

A bit risky - it's quite an expansion of code which no longer can call printk.

You might want to take that WARN_ON out of __update_rq_clock() ;)


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] scheduler: fix x86 regression in native_sched_clock
  2007-12-07  5:51 ` Thomas Gleixner
  2007-12-07  7:18   ` Guillaume Chazarain
@ 2007-12-07 10:37   ` Andi Kleen
  1 sibling, 0 replies; 47+ messages in thread
From: Andi Kleen @ 2007-12-07 10:37 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Stefano Brivio, Ingo Molnar, Robert Love, linux-kernel,
	Dave Jones, Rafael J. Wysocki, Michael Buesch, Andrew Morton,
	Len Brown

Thomas Gleixner <tglx@linutronix.de> writes:
>
> Hmrpf. sched_clock() is used for the time stamp of the printks. We
> need to find some better solution other than killing off the tsc
> access completely.

Doing it properly requires pretty much most of my old sched-clock ff patch.
Complicated and not pretty, but ..
Unfortunately that version still had some jumps on cpufreq, but they
are fixable there.

-Andi

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] scheduler: fix x86 regression in native_sched_clock
  2007-12-07 10:32   ` Andrew Morton
@ 2007-12-07 10:40     ` Ingo Molnar
  2007-12-07 11:07       ` Ingo Molnar
  2007-12-07 11:09       ` Andrew Morton
  0 siblings, 2 replies; 47+ messages in thread
From: Ingo Molnar @ 2007-12-07 10:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Stefano Brivio, Robert Love, linux-kernel, Dave Jones,
	Rafael J. Wysocki, Michael Buesch, Thomas Gleixner, Len Brown


* Andrew Morton <akpm@linux-foundation.org> wrote:

> > -				t = printk_clock();
> > +				t = cpu_clock(printk_cpu);
> >  				nanosec_rem = do_div(t, 1000000000);
> >  				tlen = sprintf(tbuf,
> >  						"<%c>[%5lu.%06lu] ",
> 
> A bit risky - it's quite an expansion of code which no longer can call 
> printk.
> 
> You might want to take that WARN_ON out of __update_rq_clock() ;)

hm, dont we already detect printk recursions and turn them into a silent 
return instead of a hang/crash?

	Ingo

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] scheduler: fix x86 regression in native_sched_clock
  2007-12-07 10:40     ` Ingo Molnar
@ 2007-12-07 11:07       ` Ingo Molnar
  2007-12-07 11:09       ` Andrew Morton
  1 sibling, 0 replies; 47+ messages in thread
From: Ingo Molnar @ 2007-12-07 11:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Stefano Brivio, Robert Love, linux-kernel, Dave Jones,
	Rafael J. Wysocki, Michael Buesch, Thomas Gleixner, Len Brown


* Ingo Molnar <mingo@elte.hu> wrote:

> > > -				t = printk_clock();
> > > +				t = cpu_clock(printk_cpu);
> > >  				nanosec_rem = do_div(t, 1000000000);
> > >  				tlen = sprintf(tbuf,
> > >  						"<%c>[%5lu.%06lu] ",
> > 
> > A bit risky - it's quite an expansion of code which no longer can call 
> > printk.
> > 
> > You might want to take that WARN_ON out of __update_rq_clock() ;)
> 
> hm, dont we already detect printk recursions and turn them into a 
> silent return instead of a hang/crash?

ugh, we dont. So i guess the (tested) patch below is highly needed. (If 
such incidents become frequent then we could save the stackdump of the 
recursion via save_stack_trace() too - but i wanted to keep the initial 
code simple.)

	Ingo

---------------->
Subject: printk: make printk more robust by not allowing recursion
From: Ingo Molnar <mingo@elte.hu>

make printk more robust by allowing recursion only if there's a crash
going on. Also add recursion detection.

I've tested it with an artificially injected printk recursion - instead
of a lockup or spontaneous reboot or other crash, the output was a well
controlled:

[   41.057335] SysRq : <2>BUG: recent printk recursion!
[   41.057335] loglevel0-8 reBoot Crashdump show-all-locks(D) tErm Full kIll saK showMem Nice powerOff showPc show-all-timers(Q) unRaw Sync showTasks Unmount shoW-blocked-tasks

also do all this printk logic with irqs disabled.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 kernel/printk.c |   52 ++++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 42 insertions(+), 10 deletions(-)

Index: linux/kernel/printk.c
===================================================================
--- linux.orig/kernel/printk.c
+++ linux/kernel/printk.c
@@ -623,30 +623,57 @@ asmlinkage int printk(const char *fmt, .
 /* cpu currently holding logbuf_lock */
 static volatile unsigned int printk_cpu = UINT_MAX;
 
+const char printk_recursion_bug_msg [] =
+			KERN_CRIT "BUG: recent printk recursion!\n";
+static int printk_recursion_bug;
+
 asmlinkage int vprintk(const char *fmt, va_list args)
 {
+	static int log_level_unknown = 1;
+	static char printk_buf[1024];
+
 	unsigned long flags;
-	int printed_len;
+	int printed_len = 0;
+	int this_cpu;
 	char *p;
-	static char printk_buf[1024];
-	static int log_level_unknown = 1;
 
 	boot_delay_msec();
 
 	preempt_disable();
-	if (unlikely(oops_in_progress) && printk_cpu == smp_processor_id())
-		/* If a crash is occurring during printk() on this CPU,
-		 * make sure we can't deadlock */
-		zap_locks();
-
 	/* This stops the holder of console_sem just where we want him */
 	raw_local_irq_save(flags);
+	this_cpu = smp_processor_id();
+
+	/*
+	 * Ouch, printk recursed into itself!
+	 */
+	if (unlikely(printk_cpu == this_cpu)) {
+		/*
+		 * If a crash is occurring during printk() on this CPU,
+		 * then try to get the crash message out but make sure
+		 * we can't deadlock. Otherwise just return to avoid the
+		 * recursion and return - but flag the recursion so that
+		 * it can be printed at the next appropriate moment:
+		 */
+		if (!oops_in_progress) {
+			printk_recursion_bug = 1;
+			goto out_restore_irqs;
+		}
+		zap_locks();
+	}
+
 	lockdep_off();
 	spin_lock(&logbuf_lock);
-	printk_cpu = smp_processor_id();
+	printk_cpu = this_cpu;
 
+	if (printk_recursion_bug) {
+		printk_recursion_bug = 0;
+		strcpy(printk_buf, printk_recursion_bug_msg);
+		printed_len = sizeof(printk_recursion_bug_msg);
+	}
 	/* Emit the output into the temporary buffer */
-	printed_len = vscnprintf(printk_buf, sizeof(printk_buf), fmt, args);
+	printed_len += vscnprintf(printk_buf + printed_len,
+				  sizeof(printk_buf), fmt, args);
 
 	/*
 	 * Copy the output into log_buf.  If the caller didn't provide
@@ -675,6 +702,10 @@ asmlinkage int vprintk(const char *fmt, 
 					loglev_char = default_message_loglevel
 						+ '0';
 				}
+				if (panic_timeout) {
+					panic_timeout = 0;
+					printk("recurse!\n");
+				}
 				t = cpu_clock(printk_cpu);
 				nanosec_rem = do_div(t, 1000000000);
 				tlen = sprintf(tbuf,
@@ -739,6 +770,7 @@ asmlinkage int vprintk(const char *fmt, 
 		printk_cpu = UINT_MAX;
 		spin_unlock(&logbuf_lock);
 		lockdep_on();
+out_restore_irqs:
 		raw_local_irq_restore(flags);
 	}
 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] scheduler: fix x86 regression in native_sched_clock
  2007-12-07 10:40     ` Ingo Molnar
  2007-12-07 11:07       ` Ingo Molnar
@ 2007-12-07 11:09       ` Andrew Morton
  2007-12-07 11:12         ` Ingo Molnar
  1 sibling, 1 reply; 47+ messages in thread
From: Andrew Morton @ 2007-12-07 11:09 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Stefano Brivio, Robert Love, linux-kernel, Dave Jones,
	Rafael J. Wysocki, Michael Buesch, Thomas Gleixner, Len Brown

On Fri, 7 Dec 2007 11:40:13 +0100 Ingo Molnar <mingo@elte.hu> wrote:

> 
> * Andrew Morton <akpm@linux-foundation.org> wrote:
> 
> > > -				t = printk_clock();
> > > +				t = cpu_clock(printk_cpu);
> > >  				nanosec_rem = do_div(t, 1000000000);
> > >  				tlen = sprintf(tbuf,
> > >  						"<%c>[%5lu.%06lu] ",
> > 
> > A bit risky - it's quite an expansion of code which no longer can call 
> > printk.
> > 
> > You might want to take that WARN_ON out of __update_rq_clock() ;)
> 
> hm, dont we already detect printk recursions and turn them into a silent 
> return instead of a hang/crash?
> 

We'll pop the locks and will proceed to do the nested printk.  So
__update_rq_clock() will need rather a lot of stack ;)

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] scheduler: fix x86 regression in native_sched_clock
  2007-12-07 11:09       ` Andrew Morton
@ 2007-12-07 11:12         ` Ingo Molnar
  0 siblings, 0 replies; 47+ messages in thread
From: Ingo Molnar @ 2007-12-07 11:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Stefano Brivio, Robert Love, linux-kernel, Dave Jones,
	Rafael J. Wysocki, Michael Buesch, Thomas Gleixner, Len Brown


* Andrew Morton <akpm@linux-foundation.org> wrote:

> > > A bit risky - it's quite an expansion of code which no longer can 
> > > call printk.
> > > 
> > > You might want to take that WARN_ON out of __update_rq_clock() ;)
> > 
> > hm, dont we already detect printk recursions and turn them into a 
> > silent return instead of a hang/crash?
> 
> We'll pop the locks and will proceed to do the nested printk.  So 
> __update_rq_clock() will need rather a lot of stack ;)

yeah. That behavior of printk is rather fragile. I think my previous 
patch should handle all such incidents.

	Ingo

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] scheduler: fix x86 regression in native_sched_clock
  2007-12-07  8:45 ` Ingo Molnar
  2007-12-07 10:32   ` Andrew Morton
@ 2007-12-07 11:13   ` Nick Piggin
  2007-12-07 11:17     ` Ingo Molnar
                       ` (3 more replies)
  1 sibling, 4 replies; 47+ messages in thread
From: Nick Piggin @ 2007-12-07 11:13 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Stefano Brivio, Robert Love, linux-kernel, Dave Jones,
	Rafael J. Wysocki, Michael Buesch, Thomas Gleixner,
	Andrew Morton, Len Brown

On Friday 07 December 2007 19:45, Ingo Molnar wrote:
> * Stefano Brivio <stefano.brivio@polimi.it> wrote:
> > This patch fixes a regression introduced by:
> >
> > commit bb29ab26863c022743143f27956cc0ca362f258c
> > Author: Ingo Molnar <mingo@elte.hu>
> > Date:   Mon Jul 9 18:51:59 2007 +0200
> >
> > This caused the jiffies counter to leap back and forth on cpufreq
> > changes on my x86 box. I'd say that we can't always assume that TSC
> > does "small errors" only, when marked unstable. On cpufreq changes
> > these errors can be huge.
>
> ah, printk_clock() still uses sched_clock(), not jiffies. So it's not
> the jiffies counter that goes back and forth, it's sched_clock() - so
> this is a printk timestamps anomaly, not related to jiffies. I thought
> we have fixed this bug in the printk code already: sched_clock() is a
> 'raw' interface that should not be used directly - the proper interface
> is cpu_clock(cpu).

It's a single CPU box, so sched_clock() jumping would still be
problematic, no?

My patch should fix the worst cpufreq sched_clock jumping issue
I think.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] scheduler: fix x86 regression in native_sched_clock
  2007-12-07 11:13   ` Nick Piggin
@ 2007-12-07 11:17     ` Ingo Molnar
  2007-12-07 16:48       ` Nick Piggin
  2007-12-07 11:18     ` Guillaume Chazarain
                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 47+ messages in thread
From: Ingo Molnar @ 2007-12-07 11:17 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Stefano Brivio, Robert Love, linux-kernel, Dave Jones,
	Rafael J. Wysocki, Michael Buesch, Thomas Gleixner,
	Andrew Morton, Len Brown


* Nick Piggin <nickpiggin@yahoo.com.au> wrote:

> > ah, printk_clock() still uses sched_clock(), not jiffies. So it's 
> > not the jiffies counter that goes back and forth, it's sched_clock() 
> > - so this is a printk timestamps anomaly, not related to jiffies. I 
> > thought we have fixed this bug in the printk code already: 
> > sched_clock() is a 'raw' interface that should not be used directly 
> > - the proper interface is cpu_clock(cpu).
> 
> It's a single CPU box, so sched_clock() jumping would still be 
> problematic, no?

sched_clock() is an internal API - the non-jumping API to be used by 
printk is cpu_clock().

	Ingo

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] scheduler: fix x86 regression in native_sched_clock
  2007-12-07 11:13   ` Nick Piggin
  2007-12-07 11:17     ` Ingo Molnar
@ 2007-12-07 11:18     ` Guillaume Chazarain
  2007-12-07 11:57       ` Guillaume Chazarain
  2007-12-07 11:23     ` stefano.brivio
  2007-12-07 11:24     ` Ingo Molnar
  3 siblings, 1 reply; 47+ messages in thread
From: Guillaume Chazarain @ 2007-12-07 11:18 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Ingo Molnar, Stefano Brivio, Robert Love, linux-kernel,
	Dave Jones, Rafael J. Wysocki, Michael Buesch, Thomas Gleixner,
	Andrew Morton, Len Brown

On Dec 7, 2007 12:13 PM, Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> My patch should fix the worst cpufreq sched_clock jumping issue
> I think.

Any pointer to it?

Thanks.

-- 
Guillaume

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] scheduler: fix x86 regression in native_sched_clock
  2007-12-07 11:13   ` Nick Piggin
  2007-12-07 11:17     ` Ingo Molnar
  2007-12-07 11:18     ` Guillaume Chazarain
@ 2007-12-07 11:23     ` stefano.brivio
  2007-12-07 12:11       ` Ingo Molnar
  2007-12-07 12:25       ` Ingo Molnar
  2007-12-07 11:24     ` Ingo Molnar
  3 siblings, 2 replies; 47+ messages in thread
From: stefano.brivio @ 2007-12-07 11:23 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Ingo Molnar, Robert Love, linux-kernel, Dave Jones,
	Rafael J. Wysocki, Michael Buesch, Thomas Gleixner,
	Andrew Morton, Len Brown

Quoting Nick Piggin <nickpiggin@yahoo.com.au>:

> On Friday 07 December 2007 19:45, Ingo Molnar wrote:
>>
>> ah, printk_clock() still uses sched_clock(), not jiffies. So it's not
>> the jiffies counter that goes back and forth, it's sched_clock() - so
>> this is a printk timestamps anomaly, not related to jiffies. I thought
>> we have fixed this bug in the printk code already: sched_clock() is a
>> 'raw' interface that should not be used directly - the proper interface
>> is cpu_clock(cpu).
>
> It's a single CPU box, so sched_clock() jumping would still be
> problematic, no?

I guess so. Definitely, it didn't look like a printk issue. Drivers  
don't read logs, usually. But they got confused anyway (it seems that  
udelay's get scaled or fail or somesuch - I can't test it right now,  
will provide more feedback in a few hours).


--
Ciao
Stefano




^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] scheduler: fix x86 regression in native_sched_clock
  2007-12-07 11:13   ` Nick Piggin
                       ` (2 preceding siblings ...)
  2007-12-07 11:23     ` stefano.brivio
@ 2007-12-07 11:24     ` Ingo Molnar
  3 siblings, 0 replies; 47+ messages in thread
From: Ingo Molnar @ 2007-12-07 11:24 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Stefano Brivio, Robert Love, linux-kernel, Dave Jones,
	Rafael J. Wysocki, Michael Buesch, Thomas Gleixner,
	Andrew Morton, Len Brown


* Nick Piggin <nickpiggin@yahoo.com.au> wrote:

> My patch should fix the worst cpufreq sched_clock jumping issue I 
> think.

but it degrades the precision of sched_clock() and has other problems as 
well. cpu_clock() is the right interface to use for such things.

	Ingo

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] scheduler: fix x86 regression in native_sched_clock
  2007-12-07 11:18     ` Guillaume Chazarain
@ 2007-12-07 11:57       ` Guillaume Chazarain
  0 siblings, 0 replies; 47+ messages in thread
From: Guillaume Chazarain @ 2007-12-07 11:57 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Ingo Molnar, Stefano Brivio, Robert Love, linux-kernel,
	Dave Jones, Rafael J. Wysocki, Michael Buesch, Thomas Gleixner,
	Andrew Morton, Len Brown

On Dec 7, 2007 12:18 PM, Guillaume Chazarain <guichaz@yahoo.fr> wrote:
> Any pointer to it?

Nevermind, I found it ... in this same thread :-(

-- 
Guillaume

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] scheduler: fix x86 regression in native_sched_clock
  2007-12-07 11:23     ` stefano.brivio
@ 2007-12-07 12:11       ` Ingo Molnar
  2007-12-07 12:25       ` Ingo Molnar
  1 sibling, 0 replies; 47+ messages in thread
From: Ingo Molnar @ 2007-12-07 12:11 UTC (permalink / raw)
  To: stefano.brivio
  Cc: Nick Piggin, Robert Love, linux-kernel, Dave Jones,
	Rafael J. Wysocki, Michael Buesch, Thomas Gleixner,
	Andrew Morton, Len Brown


* stefano.brivio@polimi.it <stefano.brivio@polimi.it> wrote:

>> It's a single CPU box, so sched_clock() jumping would still be 
>> problematic, no?
>
> I guess so. Definitely, it didn't look like a printk issue. Drivers 
> don't read logs, usually. But they got confused anyway (it seems that 
> udelay's get scaled or fail or somesuch - I can't test it right now, 
> will provide more feedback in a few hours).

no, i think it's just another aspect of the broken TSC on that hardware. 
Does the patch below improve things?

	Ingo

------------------->
Subject: x86: cpu_clock() based udelay
From: Ingo Molnar <mingo@elte.hu>

use cpu_clock() for TSC based udelay - it's more reliable than raw
TSC based delay loops.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 arch/x86/lib/delay_32.c |   20 ++++++++++++--------
 arch/x86/lib/delay_64.c |   27 ++++++++++++++++++---------
 2 files changed, 30 insertions(+), 17 deletions(-)

Index: linux/arch/x86/lib/delay_32.c
===================================================================
--- linux.orig/arch/x86/lib/delay_32.c
+++ linux/arch/x86/lib/delay_32.c
@@ -38,17 +38,21 @@ static void delay_loop(unsigned long loo
 		:"0" (loops));
 }
 
-/* TSC based delay: */
+/* cpu_clock() [TSC] based delay: */
 static void delay_tsc(unsigned long loops)
 {
-	unsigned long bclock, now;
+	unsigned long long start, stop, now;
+	int this_cpu;
+
+	preempt_disable();
+
+	this_cpu = smp_processor_id();
+	start = now = cpu_clock(this_cpu);
+	stop = start + loops;
+
+	while ((long long)(stop - now) > 0)
+		now = cpu_clock(this_cpu);
 
-	preempt_disable();		/* TSC's are per-cpu */
-	rdtscl(bclock);
-	do {
-		rep_nop();
-		rdtscl(now);
-	} while ((now-bclock) < loops);
 	preempt_enable();
 }
 
Index: linux/arch/x86/lib/delay_64.c
===================================================================
--- linux.orig/arch/x86/lib/delay_64.c
+++ linux/arch/x86/lib/delay_64.c
@@ -26,19 +26,28 @@ int read_current_timer(unsigned long *ti
 	return 0;
 }
 
-void __delay(unsigned long loops)
+/* cpu_clock() [TSC] based delay: */
+static void delay_tsc(unsigned long loops)
 {
-	unsigned bclock, now;
+	unsigned long long start, stop, now;
+	int this_cpu;
+
+	preempt_disable();
+
+	this_cpu = smp_processor_id();
+	start = now = cpu_clock(this_cpu);
+	stop = start + loops;
+
+	while ((long long)(stop - now) > 0)
+		now = cpu_clock(this_cpu);
 
-	preempt_disable();		/* TSC's are pre-cpu */
-	rdtscl(bclock);
-	do {
-		rep_nop(); 
-		rdtscl(now);
-	}
-	while ((now-bclock) < loops);
 	preempt_enable();
 }
+
+void __delay(unsigned long loops)
+{
+	delay_tsc(loops);
+}
 EXPORT_SYMBOL(__delay);
 
 inline void __const_udelay(unsigned long xloops)

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] scheduler: fix x86 regression in native_sched_clock
  2007-12-07 11:23     ` stefano.brivio
  2007-12-07 12:11       ` Ingo Molnar
@ 2007-12-07 12:25       ` Ingo Molnar
  2007-12-07 12:35         ` Ingo Molnar
  1 sibling, 1 reply; 47+ messages in thread
From: Ingo Molnar @ 2007-12-07 12:25 UTC (permalink / raw)
  To: stefano.brivio
  Cc: Nick Piggin, Robert Love, linux-kernel, Dave Jones,
	Rafael J. Wysocki, Michael Buesch, Thomas Gleixner,
	Andrew Morton, Len Brown, Guillaume Chazarain


ok, here's a rollup of 11 patches that relate to this. I hoped we could 
wait with this for 2.6.25, but it seems more urgent as per Stefano's 
testing, as udelay() and drivers are affected as well.

Stefano, could you try this ontop of a recent-ish Linus tree - does this 
resolve all issues? (without introducing new ones ;-)

	Ingo

Index: linux/arch/arm/kernel/time.c
===================================================================
--- linux.orig/arch/arm/kernel/time.c
+++ linux/arch/arm/kernel/time.c
@@ -79,17 +79,6 @@ static unsigned long dummy_gettimeoffset
 }
 #endif
 
-/*
- * An implementation of printk_clock() independent from
- * sched_clock().  This avoids non-bootable kernels when
- * printk_clock is enabled.
- */
-unsigned long long printk_clock(void)
-{
-	return (unsigned long long)(jiffies - INITIAL_JIFFIES) *
-			(1000000000 / HZ);
-}
-
 static unsigned long next_rtc_update;
 
 /*
Index: linux/arch/ia64/kernel/time.c
===================================================================
--- linux.orig/arch/ia64/kernel/time.c
+++ linux/arch/ia64/kernel/time.c
@@ -344,33 +344,6 @@ udelay (unsigned long usecs)
 }
 EXPORT_SYMBOL(udelay);
 
-static unsigned long long ia64_itc_printk_clock(void)
-{
-	if (ia64_get_kr(IA64_KR_PER_CPU_DATA))
-		return sched_clock();
-	return 0;
-}
-
-static unsigned long long ia64_default_printk_clock(void)
-{
-	return (unsigned long long)(jiffies_64 - INITIAL_JIFFIES) *
-		(1000000000/HZ);
-}
-
-unsigned long long (*ia64_printk_clock)(void) = &ia64_default_printk_clock;
-
-unsigned long long printk_clock(void)
-{
-	return ia64_printk_clock();
-}
-
-void __init
-ia64_setup_printk_clock(void)
-{
-	if (!(sal_platform_features & IA64_SAL_PLATFORM_FEATURE_ITC_DRIFT))
-		ia64_printk_clock = ia64_itc_printk_clock;
-}
-
 /* IA64 doesn't cache the timezone */
 void update_vsyscall_tz(void)
 {
Index: linux/arch/x86/kernel/process_32.c
===================================================================
--- linux.orig/arch/x86/kernel/process_32.c
+++ linux/arch/x86/kernel/process_32.c
@@ -113,10 +113,19 @@ void default_idle(void)
 		smp_mb();
 
 		local_irq_disable();
-		if (!need_resched())
+		if (!need_resched()) {
+			ktime_t t0, t1;
+			u64 t0n, t1n;
+
+			t0 = ktime_get();
+			t0n = ktime_to_ns(t0);
 			safe_halt();	/* enables interrupts racelessly */
-		else
-			local_irq_enable();
+			local_irq_disable();
+			t1 = ktime_get();
+			t1n = ktime_to_ns(t1);
+			sched_clock_idle_wakeup_event(t1n - t0n);
+		}
+		local_irq_enable();
 		current_thread_info()->status |= TS_POLLING;
 	} else {
 		/* loop is done by the caller */
Index: linux/arch/x86/kernel/tsc_32.c
===================================================================
--- linux.orig/arch/x86/kernel/tsc_32.c
+++ linux/arch/x86/kernel/tsc_32.c
@@ -5,6 +5,7 @@
 #include <linux/jiffies.h>
 #include <linux/init.h>
 #include <linux/dmi.h>
+#include <linux/percpu.h>
 
 #include <asm/delay.h>
 #include <asm/tsc.h>
@@ -78,15 +79,32 @@ EXPORT_SYMBOL_GPL(check_tsc_unstable);
  *  cyc2ns_scale is limited to 10^6 * 2^10, which fits in 32 bits.
  *  (mathieu.desnoyers@polymtl.ca)
  *
+ *  ns += offset to avoid sched_clock jumps with cpufreq
+ *
  *			-johnstul@us.ibm.com "math is hard, lets go shopping!"
  */
-unsigned long cyc2ns_scale __read_mostly;
 
 #define CYC2NS_SCALE_FACTOR 10 /* 2^10, carefully chosen */
 
-static inline void set_cyc2ns_scale(unsigned long cpu_khz)
+DEFINE_PER_CPU(struct cyc2ns_params, cyc2ns) __read_mostly;
+
+static void set_cyc2ns_scale(unsigned long cpu_khz)
 {
-	cyc2ns_scale = (1000000 << CYC2NS_SCALE_FACTOR)/cpu_khz;
+	struct cyc2ns_params *params;
+	unsigned long flags;
+	unsigned long long tsc_now, ns_now;
+
+	rdtscll(tsc_now);
+	params = &get_cpu_var(cyc2ns);
+
+	local_irq_save(flags);
+	ns_now = __cycles_2_ns(params, tsc_now);
+
+	params->scale = (NSEC_PER_MSEC << CYC2NS_SCALE_FACTOR)/cpu_khz;
+	params->offset += ns_now - __cycles_2_ns(params, tsc_now);
+	local_irq_restore(flags);
+
+	put_cpu_var(cyc2ns);
 }
 
 /*
Index: linux/arch/x86/kernel/tsc_64.c
===================================================================
--- linux.orig/arch/x86/kernel/tsc_64.c
+++ linux/arch/x86/kernel/tsc_64.c
@@ -10,6 +10,7 @@
 
 #include <asm/hpet.h>
 #include <asm/timex.h>
+#include <asm/timer.h>
 
 static int notsc __initdata = 0;
 
@@ -18,16 +19,25 @@ EXPORT_SYMBOL(cpu_khz);
 unsigned int tsc_khz;
 EXPORT_SYMBOL(tsc_khz);
 
-static unsigned int cyc2ns_scale __read_mostly;
+DEFINE_PER_CPU(struct cyc2ns_params, cyc2ns) __read_mostly;
 
-static inline void set_cyc2ns_scale(unsigned long khz)
+static void set_cyc2ns_scale(unsigned long cpu_khz)
 {
-	cyc2ns_scale = (NSEC_PER_MSEC << NS_SCALE) / khz;
-}
+	struct cyc2ns_params *params;
+	unsigned long flags;
+	unsigned long long tsc_now, ns_now;
 
-static unsigned long long cycles_2_ns(unsigned long long cyc)
-{
-	return (cyc * cyc2ns_scale) >> NS_SCALE;
+	rdtscll(tsc_now);
+	params = &get_cpu_var(cyc2ns);
+
+	local_irq_save(flags);
+	ns_now = __cycles_2_ns(params, tsc_now);
+
+	params->scale = (NSEC_PER_MSEC << CYC2NS_SCALE_FACTOR)/cpu_khz;
+	params->offset += ns_now - __cycles_2_ns(params, tsc_now);
+	local_irq_restore(flags);
+
+	put_cpu_var(cyc2ns);
 }
 
 unsigned long long sched_clock(void)
Index: linux/arch/x86/lib/delay_32.c
===================================================================
--- linux.orig/arch/x86/lib/delay_32.c
+++ linux/arch/x86/lib/delay_32.c
@@ -38,17 +38,21 @@ static void delay_loop(unsigned long loo
 		:"0" (loops));
 }
 
-/* TSC based delay: */
+/* cpu_clock() [TSC] based delay: */
 static void delay_tsc(unsigned long loops)
 {
-	unsigned long bclock, now;
+	unsigned long long start, stop, now;
+	int this_cpu;
+
+	preempt_disable();
+
+	this_cpu = smp_processor_id();
+	start = now = cpu_clock(this_cpu);
+	stop = start + loops;
+
+	while ((long long)(stop - now) > 0)
+		now = cpu_clock(this_cpu);
 
-	preempt_disable();		/* TSC's are per-cpu */
-	rdtscl(bclock);
-	do {
-		rep_nop();
-		rdtscl(now);
-	} while ((now-bclock) < loops);
 	preempt_enable();
 }
 
Index: linux/arch/x86/lib/delay_64.c
===================================================================
--- linux.orig/arch/x86/lib/delay_64.c
+++ linux/arch/x86/lib/delay_64.c
@@ -26,19 +26,28 @@ int read_current_timer(unsigned long *ti
 	return 0;
 }
 
-void __delay(unsigned long loops)
+/* cpu_clock() [TSC] based delay: */
+static void delay_tsc(unsigned long loops)
 {
-	unsigned bclock, now;
+	unsigned long long start, stop, now;
+	int this_cpu;
+
+	preempt_disable();
+
+	this_cpu = smp_processor_id();
+	start = now = cpu_clock(this_cpu);
+	stop = start + loops;
+
+	while ((long long)(stop - now) > 0)
+		now = cpu_clock(this_cpu);
 
-	preempt_disable();		/* TSC's are pre-cpu */
-	rdtscl(bclock);
-	do {
-		rep_nop(); 
-		rdtscl(now);
-	}
-	while ((now-bclock) < loops);
 	preempt_enable();
 }
+
+void __delay(unsigned long loops)
+{
+	delay_tsc(loops);
+}
 EXPORT_SYMBOL(__delay);
 
 inline void __const_udelay(unsigned long xloops)
Index: linux/drivers/acpi/processor_idle.c
===================================================================
--- linux.orig/drivers/acpi/processor_idle.c
+++ linux/drivers/acpi/processor_idle.c
@@ -531,6 +531,11 @@ static void acpi_processor_idle(void)
 
 	case ACPI_STATE_C3:
 		/*
+		 * Must be done before busmaster disable as we might
+		 * need to access HPET !
+		 */
+		acpi_state_timer_broadcast(pr, cx, 1);
+		/*
 		 * disable bus master
 		 * bm_check implies we need ARB_DIS
 		 * !bm_check implies we need cache flush
@@ -557,7 +562,6 @@ static void acpi_processor_idle(void)
 		/* Get start time (ticks) */
 		t1 = inl(acpi_gbl_FADT.xpm_timer_block.address);
 		/* Invoke C3 */
-		acpi_state_timer_broadcast(pr, cx, 1);
 		/* Tell the scheduler that we are going deep-idle: */
 		sched_clock_idle_sleep_event();
 		acpi_cstate_enter(cx);
@@ -1401,9 +1405,6 @@ static int acpi_idle_enter_simple(struct
 	if (acpi_idle_suspend)
 		return(acpi_idle_enter_c1(dev, state));
 
-	if (pr->flags.bm_check)
-		acpi_idle_update_bm_rld(pr, cx);
-
 	local_irq_disable();
 	current_thread_info()->status &= ~TS_POLLING;
 	/*
@@ -1418,13 +1419,21 @@ static int acpi_idle_enter_simple(struct
 		return 0;
 	}
 
+	/*
+	 * Must be done before busmaster disable as we might need to
+	 * access HPET !
+	 */
+	acpi_state_timer_broadcast(pr, cx, 1);
+
+	if (pr->flags.bm_check)
+		acpi_idle_update_bm_rld(pr, cx);
+
 	if (cx->type == ACPI_STATE_C3)
 		ACPI_FLUSH_CPU_CACHE();
 
 	t1 = inl(acpi_gbl_FADT.xpm_timer_block.address);
 	/* Tell the scheduler that we are going deep-idle: */
 	sched_clock_idle_sleep_event();
-	acpi_state_timer_broadcast(pr, cx, 1);
 	acpi_idle_do_entry(cx);
 	t2 = inl(acpi_gbl_FADT.xpm_timer_block.address);
 
Index: linux/include/asm-x86/timer.h
===================================================================
--- linux.orig/include/asm-x86/timer.h
+++ linux/include/asm-x86/timer.h
@@ -2,6 +2,7 @@
 #define _ASMi386_TIMER_H
 #include <linux/init.h>
 #include <linux/pm.h>
+#include <linux/percpu.h>
 
 #define TICK_SIZE (tick_nsec / 1000)
 
@@ -16,7 +17,7 @@ extern int recalibrate_cpu_khz(void);
 #define calculate_cpu_khz() native_calculate_cpu_khz()
 #endif
 
-/* Accellerators for sched_clock()
+/* Accelerators for sched_clock()
  * convert from cycles(64bits) => nanoseconds (64bits)
  *  basic equation:
  *		ns = cycles / (freq / ns_per_sec)
@@ -31,20 +32,44 @@ extern int recalibrate_cpu_khz(void);
  *	And since SC is a constant power of two, we can convert the div
  *  into a shift.
  *
- *  We can use khz divisor instead of mhz to keep a better percision, since
+ *  We can use khz divisor instead of mhz to keep a better precision, since
  *  cyc2ns_scale is limited to 10^6 * 2^10, which fits in 32 bits.
  *  (mathieu.desnoyers@polymtl.ca)
  *
+ *  ns += offset to avoid sched_clock jumps with cpufreq
+ *
  *			-johnstul@us.ibm.com "math is hard, lets go shopping!"
  */
-extern unsigned long cyc2ns_scale __read_mostly;
+
+struct cyc2ns_params {
+	unsigned long scale;
+	unsigned long long offset;
+};
+
+DECLARE_PER_CPU(struct cyc2ns_params, cyc2ns) __read_mostly;
 
 #define CYC2NS_SCALE_FACTOR 10 /* 2^10, carefully chosen */
 
-static inline unsigned long long cycles_2_ns(unsigned long long cyc)
+static inline unsigned long long __cycles_2_ns(struct cyc2ns_params *params,
+					       unsigned long long cyc)
 {
-	return (cyc * cyc2ns_scale) >> CYC2NS_SCALE_FACTOR;
+	return ((cyc * params->scale) >> CYC2NS_SCALE_FACTOR) + params->offset;
 }
 
+static inline unsigned long long cycles_2_ns(unsigned long long cyc)
+{
+	struct cyc2ns_params *params;
+	unsigned long flags;
+	unsigned long long ns;
+
+	params = &get_cpu_var(cyc2ns);
+
+	local_irq_save(flags);
+	ns = __cycles_2_ns(params, cyc);
+	local_irq_restore(flags);
+
+	put_cpu_var(cyc2ns);
+	return ns;
+}
 
 #endif
Index: linux/kernel/hrtimer.c
===================================================================
--- linux.orig/kernel/hrtimer.c
+++ linux/kernel/hrtimer.c
@@ -850,6 +850,14 @@ hrtimer_start(struct hrtimer *timer, kti
 #ifdef CONFIG_TIME_LOW_RES
 		tim = ktime_add(tim, base->resolution);
 #endif
+		/*
+		 * Careful here: User space might have asked for a
+		 * very long sleep, so the add above might result in a
+		 * negative number, which enqueues the timer in front
+		 * of the queue.
+		 */
+		if (tim.tv64 < 0)
+			tim.tv64 = KTIME_MAX;
 	}
 	timer->expires = tim;
 
Index: linux/kernel/lockdep.c
===================================================================
--- linux.orig/kernel/lockdep.c
+++ linux/kernel/lockdep.c
@@ -2654,10 +2654,15 @@ static void check_flags(unsigned long fl
 	if (!debug_locks)
 		return;
 
-	if (irqs_disabled_flags(flags))
-		DEBUG_LOCKS_WARN_ON(current->hardirqs_enabled);
-	else
-		DEBUG_LOCKS_WARN_ON(!current->hardirqs_enabled);
+	if (irqs_disabled_flags(flags)) {
+		if (DEBUG_LOCKS_WARN_ON(current->hardirqs_enabled)) {
+			printk("possible reason: unannotated irqs-off.\n");
+		}
+	} else {
+		if (DEBUG_LOCKS_WARN_ON(!current->hardirqs_enabled)) {
+			printk("possible reason: unannotated irqs-on.\n");
+		}
+	}
 
 	/*
 	 * We dont accurately track softirq state in e.g.
Index: linux/kernel/printk.c
===================================================================
--- linux.orig/kernel/printk.c
+++ linux/kernel/printk.c
@@ -573,11 +573,6 @@ static int __init printk_time_setup(char
 
 __setup("time", printk_time_setup);
 
-__attribute__((weak)) unsigned long long printk_clock(void)
-{
-	return sched_clock();
-}
-
 /* Check if we have any console registered that can be called early in boot. */
 static int have_callable_console(void)
 {
@@ -628,30 +623,57 @@ asmlinkage int printk(const char *fmt, .
 /* cpu currently holding logbuf_lock */
 static volatile unsigned int printk_cpu = UINT_MAX;
 
+const char printk_recursion_bug_msg [] =
+			KERN_CRIT "BUG: recent printk recursion!\n";
+static int printk_recursion_bug;
+
 asmlinkage int vprintk(const char *fmt, va_list args)
 {
+	static int log_level_unknown = 1;
+	static char printk_buf[1024];
+
 	unsigned long flags;
-	int printed_len;
+	int printed_len = 0;
+	int this_cpu;
 	char *p;
-	static char printk_buf[1024];
-	static int log_level_unknown = 1;
 
 	boot_delay_msec();
 
 	preempt_disable();
-	if (unlikely(oops_in_progress) && printk_cpu == smp_processor_id())
-		/* If a crash is occurring during printk() on this CPU,
-		 * make sure we can't deadlock */
-		zap_locks();
-
 	/* This stops the holder of console_sem just where we want him */
 	raw_local_irq_save(flags);
+	this_cpu = smp_processor_id();
+
+	/*
+	 * Ouch, printk recursed into itself!
+	 */
+	if (unlikely(printk_cpu == this_cpu)) {
+		/*
+		 * If a crash is occurring during printk() on this CPU,
+		 * then try to get the crash message out but make sure
+		 * we can't deadlock. Otherwise just return to avoid the
+		 * recursion and return - but flag the recursion so that
+		 * it can be printed at the next appropriate moment:
+		 */
+		if (!oops_in_progress) {
+			printk_recursion_bug = 1;
+			goto out_restore_irqs;
+		}
+		zap_locks();
+	}
+
 	lockdep_off();
 	spin_lock(&logbuf_lock);
-	printk_cpu = smp_processor_id();
+	printk_cpu = this_cpu;
 
+	if (printk_recursion_bug) {
+		printk_recursion_bug = 0;
+		strcpy(printk_buf, printk_recursion_bug_msg);
+		printed_len = sizeof(printk_recursion_bug_msg);
+	}
 	/* Emit the output into the temporary buffer */
-	printed_len = vscnprintf(printk_buf, sizeof(printk_buf), fmt, args);
+	printed_len += vscnprintf(printk_buf + printed_len,
+				  sizeof(printk_buf), fmt, args);
 
 	/*
 	 * Copy the output into log_buf.  If the caller didn't provide
@@ -680,7 +702,11 @@ asmlinkage int vprintk(const char *fmt, 
 					loglev_char = default_message_loglevel
 						+ '0';
 				}
-				t = printk_clock();
+				if (panic_timeout) {
+					panic_timeout = 0;
+					printk("recurse!\n");
+				}
+				t = cpu_clock(printk_cpu);
 				nanosec_rem = do_div(t, 1000000000);
 				tlen = sprintf(tbuf,
 						"<%c>[%5lu.%06lu] ",
@@ -744,6 +770,7 @@ asmlinkage int vprintk(const char *fmt, 
 		printk_cpu = UINT_MAX;
 		spin_unlock(&logbuf_lock);
 		lockdep_on();
+out_restore_irqs:
 		raw_local_irq_restore(flags);
 	}
 
Index: linux/kernel/sched.c
===================================================================
--- linux.orig/kernel/sched.c
+++ linux/kernel/sched.c
@@ -488,7 +488,12 @@ unsigned long long cpu_clock(int cpu)
 
 	local_irq_save(flags);
 	rq = cpu_rq(cpu);
-	update_rq_clock(rq);
+	/*
+	 * Only call sched_clock() if the scheduler has already been
+	 * initialized (some code might call cpu_clock() very early):
+	 */
+	if (rq->idle)
+		update_rq_clock(rq);
 	now = rq->clock;
 	local_irq_restore(flags);
 
Index: linux/kernel/sched_fair.c
===================================================================
--- linux.orig/kernel/sched_fair.c
+++ linux/kernel/sched_fair.c
@@ -511,8 +511,7 @@ place_entity(struct cfs_rq *cfs_rq, stru
 
 	if (!initial) {
 		/* sleeps upto a single latency don't count. */
-		if (sched_feat(NEW_FAIR_SLEEPERS) && entity_is_task(se) &&
-				task_of(se)->policy != SCHED_BATCH)
+		if (sched_feat(NEW_FAIR_SLEEPERS) && entity_is_task(se))
 			vruntime -= sysctl_sched_latency;
 
 		/* ensure we never gain time by being placed backwards. */
Index: linux/kernel/time/clockevents.c
===================================================================
--- linux.orig/kernel/time/clockevents.c
+++ linux/kernel/time/clockevents.c
@@ -78,6 +78,11 @@ int clockevents_program_event(struct clo
 	unsigned long long clc;
 	int64_t delta;
 
+	if (unlikely(expires.tv64 < 0)) {
+		WARN_ON_ONCE(1);
+		return -ETIME;
+	}
+
 	delta = ktime_to_ns(ktime_sub(expires, now));
 
 	if (delta <= 0)


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] scheduler: fix x86 regression in native_sched_clock
  2007-12-07 12:25       ` Ingo Molnar
@ 2007-12-07 12:35         ` Ingo Molnar
  2007-12-07 12:40           ` Ingo Molnar
  0 siblings, 1 reply; 47+ messages in thread
From: Ingo Molnar @ 2007-12-07 12:35 UTC (permalink / raw)
  To: stefano.brivio
  Cc: Nick Piggin, Robert Love, linux-kernel, Dave Jones,
	Rafael J. Wysocki, Michael Buesch, Thomas Gleixner,
	Andrew Morton, Len Brown, Guillaume Chazarain


* Ingo Molnar <mingo@elte.hu> wrote:

> ok, here's a rollup of 11 patches that relate to this. I hoped we 
> could wait with this for 2.6.25, but it seems more urgent as per 
> Stefano's testing, as udelay() and drivers are affected as well.
> 
> Stefano, could you try this ontop of a recent-ish Linus tree - does 
> this resolve all issues? (without introducing new ones ;-)

updated version attached below.

> +DEFINE_PER_CPU(struct cyc2ns_params, cyc2ns) __read_mostly;

__read_mostly is not a good idea for PER_CPU variables.

	Ingo

Index: linux/arch/arm/kernel/time.c
===================================================================
--- linux.orig/arch/arm/kernel/time.c
+++ linux/arch/arm/kernel/time.c
@@ -79,17 +79,6 @@ static unsigned long dummy_gettimeoffset
 }
 #endif
 
-/*
- * An implementation of printk_clock() independent from
- * sched_clock().  This avoids non-bootable kernels when
- * printk_clock is enabled.
- */
-unsigned long long printk_clock(void)
-{
-	return (unsigned long long)(jiffies - INITIAL_JIFFIES) *
-			(1000000000 / HZ);
-}
-
 static unsigned long next_rtc_update;
 
 /*
Index: linux/arch/ia64/kernel/time.c
===================================================================
--- linux.orig/arch/ia64/kernel/time.c
+++ linux/arch/ia64/kernel/time.c
@@ -344,33 +344,6 @@ udelay (unsigned long usecs)
 }
 EXPORT_SYMBOL(udelay);
 
-static unsigned long long ia64_itc_printk_clock(void)
-{
-	if (ia64_get_kr(IA64_KR_PER_CPU_DATA))
-		return sched_clock();
-	return 0;
-}
-
-static unsigned long long ia64_default_printk_clock(void)
-{
-	return (unsigned long long)(jiffies_64 - INITIAL_JIFFIES) *
-		(1000000000/HZ);
-}
-
-unsigned long long (*ia64_printk_clock)(void) = &ia64_default_printk_clock;
-
-unsigned long long printk_clock(void)
-{
-	return ia64_printk_clock();
-}
-
-void __init
-ia64_setup_printk_clock(void)
-{
-	if (!(sal_platform_features & IA64_SAL_PLATFORM_FEATURE_ITC_DRIFT))
-		ia64_printk_clock = ia64_itc_printk_clock;
-}
-
 /* IA64 doesn't cache the timezone */
 void update_vsyscall_tz(void)
 {
Index: linux/arch/x86/kernel/process_32.c
===================================================================
--- linux.orig/arch/x86/kernel/process_32.c
+++ linux/arch/x86/kernel/process_32.c
@@ -113,10 +113,19 @@ void default_idle(void)
 		smp_mb();
 
 		local_irq_disable();
-		if (!need_resched())
+		if (!need_resched()) {
+			ktime_t t0, t1;
+			u64 t0n, t1n;
+
+			t0 = ktime_get();
+			t0n = ktime_to_ns(t0);
 			safe_halt();	/* enables interrupts racelessly */
-		else
-			local_irq_enable();
+			local_irq_disable();
+			t1 = ktime_get();
+			t1n = ktime_to_ns(t1);
+			sched_clock_idle_wakeup_event(t1n - t0n);
+		}
+		local_irq_enable();
 		current_thread_info()->status |= TS_POLLING;
 	} else {
 		/* loop is done by the caller */
Index: linux/arch/x86/kernel/tsc_32.c
===================================================================
--- linux.orig/arch/x86/kernel/tsc_32.c
+++ linux/arch/x86/kernel/tsc_32.c
@@ -5,6 +5,7 @@
 #include <linux/jiffies.h>
 #include <linux/init.h>
 #include <linux/dmi.h>
+#include <linux/percpu.h>
 
 #include <asm/delay.h>
 #include <asm/tsc.h>
@@ -78,15 +79,32 @@ EXPORT_SYMBOL_GPL(check_tsc_unstable);
  *  cyc2ns_scale is limited to 10^6 * 2^10, which fits in 32 bits.
  *  (mathieu.desnoyers@polymtl.ca)
  *
+ *  ns += offset to avoid sched_clock jumps with cpufreq
+ *
  *			-johnstul@us.ibm.com "math is hard, lets go shopping!"
  */
-unsigned long cyc2ns_scale __read_mostly;
 
 #define CYC2NS_SCALE_FACTOR 10 /* 2^10, carefully chosen */
 
-static inline void set_cyc2ns_scale(unsigned long cpu_khz)
+DEFINE_PER_CPU(struct cyc2ns_params, cyc2ns);
+
+static void set_cyc2ns_scale(unsigned long cpu_khz)
 {
-	cyc2ns_scale = (1000000 << CYC2NS_SCALE_FACTOR)/cpu_khz;
+	struct cyc2ns_params *params;
+	unsigned long flags;
+	unsigned long long tsc_now, ns_now;
+
+	rdtscll(tsc_now);
+	params = &get_cpu_var(cyc2ns);
+
+	local_irq_save(flags);
+	ns_now = __cycles_2_ns(params, tsc_now);
+
+	params->scale = (NSEC_PER_MSEC << CYC2NS_SCALE_FACTOR)/cpu_khz;
+	params->offset += ns_now - __cycles_2_ns(params, tsc_now);
+	local_irq_restore(flags);
+
+	put_cpu_var(cyc2ns);
 }
 
 /*
Index: linux/arch/x86/kernel/tsc_64.c
===================================================================
--- linux.orig/arch/x86/kernel/tsc_64.c
+++ linux/arch/x86/kernel/tsc_64.c
@@ -10,6 +10,7 @@
 
 #include <asm/hpet.h>
 #include <asm/timex.h>
+#include <asm/timer.h>
 
 static int notsc __initdata = 0;
 
@@ -18,16 +19,25 @@ EXPORT_SYMBOL(cpu_khz);
 unsigned int tsc_khz;
 EXPORT_SYMBOL(tsc_khz);
 
-static unsigned int cyc2ns_scale __read_mostly;
+DEFINE_PER_CPU(struct cyc2ns_params, cyc2ns) __read_mostly;
 
-static inline void set_cyc2ns_scale(unsigned long khz)
+static void set_cyc2ns_scale(unsigned long cpu_khz)
 {
-	cyc2ns_scale = (NSEC_PER_MSEC << NS_SCALE) / khz;
-}
+	struct cyc2ns_params *params;
+	unsigned long flags;
+	unsigned long long tsc_now, ns_now;
 
-static unsigned long long cycles_2_ns(unsigned long long cyc)
-{
-	return (cyc * cyc2ns_scale) >> NS_SCALE;
+	rdtscll(tsc_now);
+	params = &get_cpu_var(cyc2ns);
+
+	local_irq_save(flags);
+	ns_now = __cycles_2_ns(params, tsc_now);
+
+	params->scale = (NSEC_PER_MSEC << CYC2NS_SCALE_FACTOR)/cpu_khz;
+	params->offset += ns_now - __cycles_2_ns(params, tsc_now);
+	local_irq_restore(flags);
+
+	put_cpu_var(cyc2ns);
 }
 
 unsigned long long sched_clock(void)
Index: linux/arch/x86/lib/delay_32.c
===================================================================
--- linux.orig/arch/x86/lib/delay_32.c
+++ linux/arch/x86/lib/delay_32.c
@@ -38,17 +38,21 @@ static void delay_loop(unsigned long loo
 		:"0" (loops));
 }
 
-/* TSC based delay: */
+/* cpu_clock() [TSC] based delay: */
 static void delay_tsc(unsigned long loops)
 {
-	unsigned long bclock, now;
+	unsigned long long start, stop, now;
+	int this_cpu;
+
+	preempt_disable();
+
+	this_cpu = smp_processor_id();
+	start = now = cpu_clock(this_cpu);
+	stop = start + loops;
+
+	while ((long long)(stop - now) > 0)
+		now = cpu_clock(this_cpu);
 
-	preempt_disable();		/* TSC's are per-cpu */
-	rdtscl(bclock);
-	do {
-		rep_nop();
-		rdtscl(now);
-	} while ((now-bclock) < loops);
 	preempt_enable();
 }
 
Index: linux/arch/x86/lib/delay_64.c
===================================================================
--- linux.orig/arch/x86/lib/delay_64.c
+++ linux/arch/x86/lib/delay_64.c
@@ -26,19 +26,28 @@ int read_current_timer(unsigned long *ti
 	return 0;
 }
 
-void __delay(unsigned long loops)
+/* cpu_clock() [TSC] based delay: */
+static void delay_tsc(unsigned long loops)
 {
-	unsigned bclock, now;
+	unsigned long long start, stop, now;
+	int this_cpu;
+
+	preempt_disable();
+
+	this_cpu = smp_processor_id();
+	start = now = cpu_clock(this_cpu);
+	stop = start + loops;
+
+	while ((long long)(stop - now) > 0)
+		now = cpu_clock(this_cpu);
 
-	preempt_disable();		/* TSC's are pre-cpu */
-	rdtscl(bclock);
-	do {
-		rep_nop(); 
-		rdtscl(now);
-	}
-	while ((now-bclock) < loops);
 	preempt_enable();
 }
+
+void __delay(unsigned long loops)
+{
+	delay_tsc(loops);
+}
 EXPORT_SYMBOL(__delay);
 
 inline void __const_udelay(unsigned long xloops)
Index: linux/drivers/acpi/processor_idle.c
===================================================================
--- linux.orig/drivers/acpi/processor_idle.c
+++ linux/drivers/acpi/processor_idle.c
@@ -531,6 +531,11 @@ static void acpi_processor_idle(void)
 
 	case ACPI_STATE_C3:
 		/*
+		 * Must be done before busmaster disable as we might
+		 * need to access HPET !
+		 */
+		acpi_state_timer_broadcast(pr, cx, 1);
+		/*
 		 * disable bus master
 		 * bm_check implies we need ARB_DIS
 		 * !bm_check implies we need cache flush
@@ -557,7 +562,6 @@ static void acpi_processor_idle(void)
 		/* Get start time (ticks) */
 		t1 = inl(acpi_gbl_FADT.xpm_timer_block.address);
 		/* Invoke C3 */
-		acpi_state_timer_broadcast(pr, cx, 1);
 		/* Tell the scheduler that we are going deep-idle: */
 		sched_clock_idle_sleep_event();
 		acpi_cstate_enter(cx);
@@ -1401,9 +1405,6 @@ static int acpi_idle_enter_simple(struct
 	if (acpi_idle_suspend)
 		return(acpi_idle_enter_c1(dev, state));
 
-	if (pr->flags.bm_check)
-		acpi_idle_update_bm_rld(pr, cx);
-
 	local_irq_disable();
 	current_thread_info()->status &= ~TS_POLLING;
 	/*
@@ -1418,13 +1419,21 @@ static int acpi_idle_enter_simple(struct
 		return 0;
 	}
 
+	/*
+	 * Must be done before busmaster disable as we might need to
+	 * access HPET !
+	 */
+	acpi_state_timer_broadcast(pr, cx, 1);
+
+	if (pr->flags.bm_check)
+		acpi_idle_update_bm_rld(pr, cx);
+
 	if (cx->type == ACPI_STATE_C3)
 		ACPI_FLUSH_CPU_CACHE();
 
 	t1 = inl(acpi_gbl_FADT.xpm_timer_block.address);
 	/* Tell the scheduler that we are going deep-idle: */
 	sched_clock_idle_sleep_event();
-	acpi_state_timer_broadcast(pr, cx, 1);
 	acpi_idle_do_entry(cx);
 	t2 = inl(acpi_gbl_FADT.xpm_timer_block.address);
 
Index: linux/include/asm-x86/timer.h
===================================================================
--- linux.orig/include/asm-x86/timer.h
+++ linux/include/asm-x86/timer.h
@@ -2,6 +2,7 @@
 #define _ASMi386_TIMER_H
 #include <linux/init.h>
 #include <linux/pm.h>
+#include <linux/percpu.h>
 
 #define TICK_SIZE (tick_nsec / 1000)
 
@@ -16,7 +17,7 @@ extern int recalibrate_cpu_khz(void);
 #define calculate_cpu_khz() native_calculate_cpu_khz()
 #endif
 
-/* Accellerators for sched_clock()
+/* Accelerators for sched_clock()
  * convert from cycles(64bits) => nanoseconds (64bits)
  *  basic equation:
  *		ns = cycles / (freq / ns_per_sec)
@@ -31,20 +32,44 @@ extern int recalibrate_cpu_khz(void);
  *	And since SC is a constant power of two, we can convert the div
  *  into a shift.
  *
- *  We can use khz divisor instead of mhz to keep a better percision, since
+ *  We can use khz divisor instead of mhz to keep a better precision, since
  *  cyc2ns_scale is limited to 10^6 * 2^10, which fits in 32 bits.
  *  (mathieu.desnoyers@polymtl.ca)
  *
+ *  ns += offset to avoid sched_clock jumps with cpufreq
+ *
  *			-johnstul@us.ibm.com "math is hard, lets go shopping!"
  */
-extern unsigned long cyc2ns_scale __read_mostly;
+
+struct cyc2ns_params {
+	unsigned long scale;
+	unsigned long long offset;
+};
+
+DECLARE_PER_CPU(struct cyc2ns_params, cyc2ns);
 
 #define CYC2NS_SCALE_FACTOR 10 /* 2^10, carefully chosen */
 
-static inline unsigned long long cycles_2_ns(unsigned long long cyc)
+static inline unsigned long long __cycles_2_ns(struct cyc2ns_params *params,
+					       unsigned long long cyc)
 {
-	return (cyc * cyc2ns_scale) >> CYC2NS_SCALE_FACTOR;
+	return ((cyc * params->scale) >> CYC2NS_SCALE_FACTOR) + params->offset;
 }
 
+static inline unsigned long long cycles_2_ns(unsigned long long cyc)
+{
+	struct cyc2ns_params *params;
+	unsigned long flags;
+	unsigned long long ns;
+
+	params = &get_cpu_var(cyc2ns);
+
+	local_irq_save(flags);
+	ns = __cycles_2_ns(params, cyc);
+	local_irq_restore(flags);
+
+	put_cpu_var(cyc2ns);
+	return ns;
+}
 
 #endif
Index: linux/kernel/hrtimer.c
===================================================================
--- linux.orig/kernel/hrtimer.c
+++ linux/kernel/hrtimer.c
@@ -850,6 +850,14 @@ hrtimer_start(struct hrtimer *timer, kti
 #ifdef CONFIG_TIME_LOW_RES
 		tim = ktime_add(tim, base->resolution);
 #endif
+		/*
+		 * Careful here: User space might have asked for a
+		 * very long sleep, so the add above might result in a
+		 * negative number, which enqueues the timer in front
+		 * of the queue.
+		 */
+		if (tim.tv64 < 0)
+			tim.tv64 = KTIME_MAX;
 	}
 	timer->expires = tim;
 
Index: linux/kernel/lockdep.c
===================================================================
--- linux.orig/kernel/lockdep.c
+++ linux/kernel/lockdep.c
@@ -2654,10 +2654,15 @@ static void check_flags(unsigned long fl
 	if (!debug_locks)
 		return;
 
-	if (irqs_disabled_flags(flags))
-		DEBUG_LOCKS_WARN_ON(current->hardirqs_enabled);
-	else
-		DEBUG_LOCKS_WARN_ON(!current->hardirqs_enabled);
+	if (irqs_disabled_flags(flags)) {
+		if (DEBUG_LOCKS_WARN_ON(current->hardirqs_enabled)) {
+			printk("possible reason: unannotated irqs-off.\n");
+		}
+	} else {
+		if (DEBUG_LOCKS_WARN_ON(!current->hardirqs_enabled)) {
+			printk("possible reason: unannotated irqs-on.\n");
+		}
+	}
 
 	/*
 	 * We dont accurately track softirq state in e.g.
Index: linux/kernel/printk.c
===================================================================
--- linux.orig/kernel/printk.c
+++ linux/kernel/printk.c
@@ -573,11 +573,6 @@ static int __init printk_time_setup(char
 
 __setup("time", printk_time_setup);
 
-__attribute__((weak)) unsigned long long printk_clock(void)
-{
-	return sched_clock();
-}
-
 /* Check if we have any console registered that can be called early in boot. */
 static int have_callable_console(void)
 {
@@ -628,30 +623,57 @@ asmlinkage int printk(const char *fmt, .
 /* cpu currently holding logbuf_lock */
 static volatile unsigned int printk_cpu = UINT_MAX;
 
+const char printk_recursion_bug_msg [] =
+			KERN_CRIT "BUG: recent printk recursion!\n";
+static int printk_recursion_bug;
+
 asmlinkage int vprintk(const char *fmt, va_list args)
 {
+	static int log_level_unknown = 1;
+	static char printk_buf[1024];
+
 	unsigned long flags;
-	int printed_len;
+	int printed_len = 0;
+	int this_cpu;
 	char *p;
-	static char printk_buf[1024];
-	static int log_level_unknown = 1;
 
 	boot_delay_msec();
 
 	preempt_disable();
-	if (unlikely(oops_in_progress) && printk_cpu == smp_processor_id())
-		/* If a crash is occurring during printk() on this CPU,
-		 * make sure we can't deadlock */
-		zap_locks();
-
 	/* This stops the holder of console_sem just where we want him */
 	raw_local_irq_save(flags);
+	this_cpu = smp_processor_id();
+
+	/*
+	 * Ouch, printk recursed into itself!
+	 */
+	if (unlikely(printk_cpu == this_cpu)) {
+		/*
+		 * If a crash is occurring during printk() on this CPU,
+		 * then try to get the crash message out but make sure
+		 * we can't deadlock. Otherwise just return to avoid the
+		 * recursion and return - but flag the recursion so that
+		 * it can be printed at the next appropriate moment:
+		 */
+		if (!oops_in_progress) {
+			printk_recursion_bug = 1;
+			goto out_restore_irqs;
+		}
+		zap_locks();
+	}
+
 	lockdep_off();
 	spin_lock(&logbuf_lock);
-	printk_cpu = smp_processor_id();
+	printk_cpu = this_cpu;
 
+	if (printk_recursion_bug) {
+		printk_recursion_bug = 0;
+		strcpy(printk_buf, printk_recursion_bug_msg);
+		printed_len = sizeof(printk_recursion_bug_msg);
+	}
 	/* Emit the output into the temporary buffer */
-	printed_len = vscnprintf(printk_buf, sizeof(printk_buf), fmt, args);
+	printed_len += vscnprintf(printk_buf + printed_len,
+				  sizeof(printk_buf), fmt, args);
 
 	/*
 	 * Copy the output into log_buf.  If the caller didn't provide
@@ -680,7 +702,11 @@ asmlinkage int vprintk(const char *fmt, 
 					loglev_char = default_message_loglevel
 						+ '0';
 				}
-				t = printk_clock();
+				if (panic_timeout) {
+					panic_timeout = 0;
+					printk("recurse!\n");
+				}
+				t = cpu_clock(printk_cpu);
 				nanosec_rem = do_div(t, 1000000000);
 				tlen = sprintf(tbuf,
 						"<%c>[%5lu.%06lu] ",
@@ -744,6 +770,7 @@ asmlinkage int vprintk(const char *fmt, 
 		printk_cpu = UINT_MAX;
 		spin_unlock(&logbuf_lock);
 		lockdep_on();
+out_restore_irqs:
 		raw_local_irq_restore(flags);
 	}
 
Index: linux/kernel/sched.c
===================================================================
--- linux.orig/kernel/sched.c
+++ linux/kernel/sched.c
@@ -488,7 +488,12 @@ unsigned long long cpu_clock(int cpu)
 
 	local_irq_save(flags);
 	rq = cpu_rq(cpu);
-	update_rq_clock(rq);
+	/*
+	 * Only call sched_clock() if the scheduler has already been
+	 * initialized (some code might call cpu_clock() very early):
+	 */
+	if (rq->idle)
+		update_rq_clock(rq);
 	now = rq->clock;
 	local_irq_restore(flags);
 
Index: linux/kernel/sched_fair.c
===================================================================
--- linux.orig/kernel/sched_fair.c
+++ linux/kernel/sched_fair.c
@@ -511,8 +511,7 @@ place_entity(struct cfs_rq *cfs_rq, stru
 
 	if (!initial) {
 		/* sleeps upto a single latency don't count. */
-		if (sched_feat(NEW_FAIR_SLEEPERS) && entity_is_task(se) &&
-				task_of(se)->policy != SCHED_BATCH)
+		if (sched_feat(NEW_FAIR_SLEEPERS) && entity_is_task(se))
 			vruntime -= sysctl_sched_latency;
 
 		/* ensure we never gain time by being placed backwards. */
Index: linux/kernel/time/clockevents.c
===================================================================
--- linux.orig/kernel/time/clockevents.c
+++ linux/kernel/time/clockevents.c
@@ -78,6 +78,11 @@ int clockevents_program_event(struct clo
 	unsigned long long clc;
 	int64_t delta;
 
+	if (unlikely(expires.tv64 < 0)) {
+		WARN_ON_ONCE(1);
+		return -ETIME;
+	}
+
 	delta = ktime_to_ns(ktime_sub(expires, now));
 
 	if (delta <= 0)

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] scheduler: fix x86 regression in native_sched_clock
  2007-12-07 12:35         ` Ingo Molnar
@ 2007-12-07 12:40           ` Ingo Molnar
  2007-12-07 14:54             ` Ingo Molnar
  0 siblings, 1 reply; 47+ messages in thread
From: Ingo Molnar @ 2007-12-07 12:40 UTC (permalink / raw)
  To: stefano.brivio
  Cc: Nick Piggin, Robert Love, linux-kernel, Dave Jones,
	Rafael J. Wysocki, Michael Buesch, Thomas Gleixner,
	Andrew Morton, Len Brown, Guillaume Chazarain


* Ingo Molnar <mingo@elte.hu> wrote:

> > Stefano, could you try this ontop of a recent-ish Linus tree - does 
> > this resolve all issues? (without introducing new ones ;-)
> 
> updated version attached below.

third update. the cpufreq callbacks are not quite OK yet.

	Ingo

Index: linux/arch/arm/kernel/time.c
===================================================================
--- linux.orig/arch/arm/kernel/time.c
+++ linux/arch/arm/kernel/time.c
@@ -79,17 +79,6 @@ static unsigned long dummy_gettimeoffset
 }
 #endif
 
-/*
- * An implementation of printk_clock() independent from
- * sched_clock().  This avoids non-bootable kernels when
- * printk_clock is enabled.
- */
-unsigned long long printk_clock(void)
-{
-	return (unsigned long long)(jiffies - INITIAL_JIFFIES) *
-			(1000000000 / HZ);
-}
-
 static unsigned long next_rtc_update;
 
 /*
Index: linux/arch/ia64/kernel/time.c
===================================================================
--- linux.orig/arch/ia64/kernel/time.c
+++ linux/arch/ia64/kernel/time.c
@@ -344,33 +344,6 @@ udelay (unsigned long usecs)
 }
 EXPORT_SYMBOL(udelay);
 
-static unsigned long long ia64_itc_printk_clock(void)
-{
-	if (ia64_get_kr(IA64_KR_PER_CPU_DATA))
-		return sched_clock();
-	return 0;
-}
-
-static unsigned long long ia64_default_printk_clock(void)
-{
-	return (unsigned long long)(jiffies_64 - INITIAL_JIFFIES) *
-		(1000000000/HZ);
-}
-
-unsigned long long (*ia64_printk_clock)(void) = &ia64_default_printk_clock;
-
-unsigned long long printk_clock(void)
-{
-	return ia64_printk_clock();
-}
-
-void __init
-ia64_setup_printk_clock(void)
-{
-	if (!(sal_platform_features & IA64_SAL_PLATFORM_FEATURE_ITC_DRIFT))
-		ia64_printk_clock = ia64_itc_printk_clock;
-}
-
 /* IA64 doesn't cache the timezone */
 void update_vsyscall_tz(void)
 {
Index: linux/arch/x86/kernel/process_32.c
===================================================================
--- linux.orig/arch/x86/kernel/process_32.c
+++ linux/arch/x86/kernel/process_32.c
@@ -113,10 +113,19 @@ void default_idle(void)
 		smp_mb();
 
 		local_irq_disable();
-		if (!need_resched())
+		if (!need_resched()) {
+			ktime_t t0, t1;
+			u64 t0n, t1n;
+
+			t0 = ktime_get();
+			t0n = ktime_to_ns(t0);
 			safe_halt();	/* enables interrupts racelessly */
-		else
-			local_irq_enable();
+			local_irq_disable();
+			t1 = ktime_get();
+			t1n = ktime_to_ns(t1);
+			sched_clock_idle_wakeup_event(t1n - t0n);
+		}
+		local_irq_enable();
 		current_thread_info()->status |= TS_POLLING;
 	} else {
 		/* loop is done by the caller */
Index: linux/arch/x86/lib/delay_32.c
===================================================================
--- linux.orig/arch/x86/lib/delay_32.c
+++ linux/arch/x86/lib/delay_32.c
@@ -38,17 +38,21 @@ static void delay_loop(unsigned long loo
 		:"0" (loops));
 }
 
-/* TSC based delay: */
+/* cpu_clock() [TSC] based delay: */
 static void delay_tsc(unsigned long loops)
 {
-	unsigned long bclock, now;
+	unsigned long long start, stop, now;
+	int this_cpu;
+
+	preempt_disable();
+
+	this_cpu = smp_processor_id();
+	start = now = cpu_clock(this_cpu);
+	stop = start + loops;
+
+	while ((long long)(stop - now) > 0)
+		now = cpu_clock(this_cpu);
 
-	preempt_disable();		/* TSC's are per-cpu */
-	rdtscl(bclock);
-	do {
-		rep_nop();
-		rdtscl(now);
-	} while ((now-bclock) < loops);
 	preempt_enable();
 }
 
Index: linux/arch/x86/lib/delay_64.c
===================================================================
--- linux.orig/arch/x86/lib/delay_64.c
+++ linux/arch/x86/lib/delay_64.c
@@ -26,19 +26,28 @@ int read_current_timer(unsigned long *ti
 	return 0;
 }
 
-void __delay(unsigned long loops)
+/* cpu_clock() [TSC] based delay: */
+static void delay_tsc(unsigned long loops)
 {
-	unsigned bclock, now;
+	unsigned long long start, stop, now;
+	int this_cpu;
+
+	preempt_disable();
+
+	this_cpu = smp_processor_id();
+	start = now = cpu_clock(this_cpu);
+	stop = start + loops;
+
+	while ((long long)(stop - now) > 0)
+		now = cpu_clock(this_cpu);
 
-	preempt_disable();		/* TSC's are pre-cpu */
-	rdtscl(bclock);
-	do {
-		rep_nop(); 
-		rdtscl(now);
-	}
-	while ((now-bclock) < loops);
 	preempt_enable();
 }
+
+void __delay(unsigned long loops)
+{
+	delay_tsc(loops);
+}
 EXPORT_SYMBOL(__delay);
 
 inline void __const_udelay(unsigned long xloops)
Index: linux/drivers/acpi/processor_idle.c
===================================================================
--- linux.orig/drivers/acpi/processor_idle.c
+++ linux/drivers/acpi/processor_idle.c
@@ -531,6 +531,11 @@ static void acpi_processor_idle(void)
 
 	case ACPI_STATE_C3:
 		/*
+		 * Must be done before busmaster disable as we might
+		 * need to access HPET !
+		 */
+		acpi_state_timer_broadcast(pr, cx, 1);
+		/*
 		 * disable bus master
 		 * bm_check implies we need ARB_DIS
 		 * !bm_check implies we need cache flush
@@ -557,7 +562,6 @@ static void acpi_processor_idle(void)
 		/* Get start time (ticks) */
 		t1 = inl(acpi_gbl_FADT.xpm_timer_block.address);
 		/* Invoke C3 */
-		acpi_state_timer_broadcast(pr, cx, 1);
 		/* Tell the scheduler that we are going deep-idle: */
 		sched_clock_idle_sleep_event();
 		acpi_cstate_enter(cx);
@@ -1401,9 +1405,6 @@ static int acpi_idle_enter_simple(struct
 	if (acpi_idle_suspend)
 		return(acpi_idle_enter_c1(dev, state));
 
-	if (pr->flags.bm_check)
-		acpi_idle_update_bm_rld(pr, cx);
-
 	local_irq_disable();
 	current_thread_info()->status &= ~TS_POLLING;
 	/*
@@ -1418,13 +1419,21 @@ static int acpi_idle_enter_simple(struct
 		return 0;
 	}
 
+	/*
+	 * Must be done before busmaster disable as we might need to
+	 * access HPET !
+	 */
+	acpi_state_timer_broadcast(pr, cx, 1);
+
+	if (pr->flags.bm_check)
+		acpi_idle_update_bm_rld(pr, cx);
+
 	if (cx->type == ACPI_STATE_C3)
 		ACPI_FLUSH_CPU_CACHE();
 
 	t1 = inl(acpi_gbl_FADT.xpm_timer_block.address);
 	/* Tell the scheduler that we are going deep-idle: */
 	sched_clock_idle_sleep_event();
-	acpi_state_timer_broadcast(pr, cx, 1);
 	acpi_idle_do_entry(cx);
 	t2 = inl(acpi_gbl_FADT.xpm_timer_block.address);
 
Index: linux/kernel/hrtimer.c
===================================================================
--- linux.orig/kernel/hrtimer.c
+++ linux/kernel/hrtimer.c
@@ -850,6 +850,14 @@ hrtimer_start(struct hrtimer *timer, kti
 #ifdef CONFIG_TIME_LOW_RES
 		tim = ktime_add(tim, base->resolution);
 #endif
+		/*
+		 * Careful here: User space might have asked for a
+		 * very long sleep, so the add above might result in a
+		 * negative number, which enqueues the timer in front
+		 * of the queue.
+		 */
+		if (tim.tv64 < 0)
+			tim.tv64 = KTIME_MAX;
 	}
 	timer->expires = tim;
 
Index: linux/kernel/lockdep.c
===================================================================
--- linux.orig/kernel/lockdep.c
+++ linux/kernel/lockdep.c
@@ -2654,10 +2654,15 @@ static void check_flags(unsigned long fl
 	if (!debug_locks)
 		return;
 
-	if (irqs_disabled_flags(flags))
-		DEBUG_LOCKS_WARN_ON(current->hardirqs_enabled);
-	else
-		DEBUG_LOCKS_WARN_ON(!current->hardirqs_enabled);
+	if (irqs_disabled_flags(flags)) {
+		if (DEBUG_LOCKS_WARN_ON(current->hardirqs_enabled)) {
+			printk("possible reason: unannotated irqs-off.\n");
+		}
+	} else {
+		if (DEBUG_LOCKS_WARN_ON(!current->hardirqs_enabled)) {
+			printk("possible reason: unannotated irqs-on.\n");
+		}
+	}
 
 	/*
 	 * We dont accurately track softirq state in e.g.
Index: linux/kernel/printk.c
===================================================================
--- linux.orig/kernel/printk.c
+++ linux/kernel/printk.c
@@ -573,11 +573,6 @@ static int __init printk_time_setup(char
 
 __setup("time", printk_time_setup);
 
-__attribute__((weak)) unsigned long long printk_clock(void)
-{
-	return sched_clock();
-}
-
 /* Check if we have any console registered that can be called early in boot. */
 static int have_callable_console(void)
 {
@@ -628,30 +623,57 @@ asmlinkage int printk(const char *fmt, .
 /* cpu currently holding logbuf_lock */
 static volatile unsigned int printk_cpu = UINT_MAX;
 
+const char printk_recursion_bug_msg [] =
+			KERN_CRIT "BUG: recent printk recursion!\n";
+static int printk_recursion_bug;
+
 asmlinkage int vprintk(const char *fmt, va_list args)
 {
+	static int log_level_unknown = 1;
+	static char printk_buf[1024];
+
 	unsigned long flags;
-	int printed_len;
+	int printed_len = 0;
+	int this_cpu;
 	char *p;
-	static char printk_buf[1024];
-	static int log_level_unknown = 1;
 
 	boot_delay_msec();
 
 	preempt_disable();
-	if (unlikely(oops_in_progress) && printk_cpu == smp_processor_id())
-		/* If a crash is occurring during printk() on this CPU,
-		 * make sure we can't deadlock */
-		zap_locks();
-
 	/* This stops the holder of console_sem just where we want him */
 	raw_local_irq_save(flags);
+	this_cpu = smp_processor_id();
+
+	/*
+	 * Ouch, printk recursed into itself!
+	 */
+	if (unlikely(printk_cpu == this_cpu)) {
+		/*
+		 * If a crash is occurring during printk() on this CPU,
+		 * then try to get the crash message out but make sure
+		 * we can't deadlock. Otherwise just return to avoid the
+		 * recursion and return - but flag the recursion so that
+		 * it can be printed at the next appropriate moment:
+		 */
+		if (!oops_in_progress) {
+			printk_recursion_bug = 1;
+			goto out_restore_irqs;
+		}
+		zap_locks();
+	}
+
 	lockdep_off();
 	spin_lock(&logbuf_lock);
-	printk_cpu = smp_processor_id();
+	printk_cpu = this_cpu;
 
+	if (printk_recursion_bug) {
+		printk_recursion_bug = 0;
+		strcpy(printk_buf, printk_recursion_bug_msg);
+		printed_len = sizeof(printk_recursion_bug_msg);
+	}
 	/* Emit the output into the temporary buffer */
-	printed_len = vscnprintf(printk_buf, sizeof(printk_buf), fmt, args);
+	printed_len += vscnprintf(printk_buf + printed_len,
+				  sizeof(printk_buf), fmt, args);
 
 	/*
 	 * Copy the output into log_buf.  If the caller didn't provide
@@ -680,7 +702,11 @@ asmlinkage int vprintk(const char *fmt, 
 					loglev_char = default_message_loglevel
 						+ '0';
 				}
-				t = printk_clock();
+				if (panic_timeout) {
+					panic_timeout = 0;
+					printk("recurse!\n");
+				}
+				t = cpu_clock(printk_cpu);
 				nanosec_rem = do_div(t, 1000000000);
 				tlen = sprintf(tbuf,
 						"<%c>[%5lu.%06lu] ",
@@ -744,6 +770,7 @@ asmlinkage int vprintk(const char *fmt, 
 		printk_cpu = UINT_MAX;
 		spin_unlock(&logbuf_lock);
 		lockdep_on();
+out_restore_irqs:
 		raw_local_irq_restore(flags);
 	}
 
Index: linux/kernel/sched.c
===================================================================
--- linux.orig/kernel/sched.c
+++ linux/kernel/sched.c
@@ -488,7 +488,12 @@ unsigned long long cpu_clock(int cpu)
 
 	local_irq_save(flags);
 	rq = cpu_rq(cpu);
-	update_rq_clock(rq);
+	/*
+	 * Only call sched_clock() if the scheduler has already been
+	 * initialized (some code might call cpu_clock() very early):
+	 */
+	if (rq->idle)
+		update_rq_clock(rq);
 	now = rq->clock;
 	local_irq_restore(flags);
 
Index: linux/kernel/sched_fair.c
===================================================================
--- linux.orig/kernel/sched_fair.c
+++ linux/kernel/sched_fair.c
@@ -511,8 +511,7 @@ place_entity(struct cfs_rq *cfs_rq, stru
 
 	if (!initial) {
 		/* sleeps upto a single latency don't count. */
-		if (sched_feat(NEW_FAIR_SLEEPERS) && entity_is_task(se) &&
-				task_of(se)->policy != SCHED_BATCH)
+		if (sched_feat(NEW_FAIR_SLEEPERS) && entity_is_task(se))
 			vruntime -= sysctl_sched_latency;
 
 		/* ensure we never gain time by being placed backwards. */
Index: linux/kernel/time/clockevents.c
===================================================================
--- linux.orig/kernel/time/clockevents.c
+++ linux/kernel/time/clockevents.c
@@ -78,6 +78,11 @@ int clockevents_program_event(struct clo
 	unsigned long long clc;
 	int64_t delta;
 
+	if (unlikely(expires.tv64 < 0)) {
+		WARN_ON_ONCE(1);
+		return -ETIME;
+	}
+
 	delta = ktime_to_ns(ktime_sub(expires, now));
 
 	if (delta <= 0)

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [patch] x86: scale cyc_2_nsec according to CPU frequency
  2007-12-07  8:02     ` Guillaume Chazarain
  2007-12-07  8:51       ` Ingo Molnar
@ 2007-12-07 13:55       ` Ingo Molnar
  2007-12-07 14:27         ` Guillaume Chazarain
  1 sibling, 1 reply; 47+ messages in thread
From: Ingo Molnar @ 2007-12-07 13:55 UTC (permalink / raw)
  To: Guillaume Chazarain
  Cc: Thomas Gleixner, Stefano Brivio, Robert Love, linux-kernel,
	Dave Jones, Rafael J. Wysocki, Michael Buesch, Andrew Morton


* Guillaume Chazarain <guichaz@yahoo.fr> wrote:

> > > Hmrpf. sched_clock() is used for the time stamp of the printks. We 
> > > need to find some better solution other than killing off the tsc 
> > > access completely.
> > 
> > Something like http://lkml.org/lkml/2007/3/16/291 that would need 
> > some refresh?
> 
> And here is a refreshed one just for testing with 2.6-git. The 64 bit 
> part is a shamelessly untested copy/paste as I cannot test it.

Guillaume, i've updated your patch with a handful of changes - see the 
result below.

Firstly, we dont need the 'offset' anymore because cpu_clock() maintains 
offsets itself. This simplifies the math and speeds up the sched_clock() 
common case.

Secondly, with PER_CPU variables we need to update them for all possible 
CPUs - otherwise they might end up with a zero scaling factor which is 
not good. (not all CPUs are cpufreq capable)

Thirdly, we can do a bit smarter and faster by using the fact that 
local_irq_disable() is preempt-safe - so we can use per_cpu() instead of 
get_cpu_var().

	Ingo

----------------->
Subject: x86: scale cyc_2_nsec according to CPU frequency
From: "Guillaume Chazarain" <guichaz@yahoo.fr>

scale the sched_clock() cyc_2_nsec scaling factor according to
CPU frequency changes.

[ mingo@elte.hu: simplified it and fixed it for SMP. ]

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/kernel/tsc_32.c |   41 +++++++++++++++++++++++++++-----
 arch/x86/kernel/tsc_64.c |   59 +++++++++++++++++++++++++++++++++++++++--------
 include/asm-x86/timer.h  |   23 ++++++++++++++----
 3 files changed, 102 insertions(+), 21 deletions(-)

Index: linux-x86.q/arch/x86/kernel/tsc_32.c
===================================================================
--- linux-x86.q.orig/arch/x86/kernel/tsc_32.c
+++ linux-x86.q/arch/x86/kernel/tsc_32.c
@@ -5,6 +5,7 @@
 #include <linux/jiffies.h>
 #include <linux/init.h>
 #include <linux/dmi.h>
+#include <linux/percpu.h>
 
 #include <asm/delay.h>
 #include <asm/tsc.h>
@@ -78,15 +79,31 @@ EXPORT_SYMBOL_GPL(check_tsc_unstable);
  *  cyc2ns_scale is limited to 10^6 * 2^10, which fits in 32 bits.
  *  (mathieu.desnoyers@polymtl.ca)
  *
+ *  ns += offset to avoid sched_clock jumps with cpufreq
+ *
  *			-johnstul@us.ibm.com "math is hard, lets go shopping!"
  */
-unsigned long cyc2ns_scale __read_mostly;
 
-#define CYC2NS_SCALE_FACTOR 10 /* 2^10, carefully chosen */
+DEFINE_PER_CPU(unsigned long, cyc2ns);
 
-static inline void set_cyc2ns_scale(unsigned long cpu_khz)
+static void set_cyc2ns_scale(unsigned long cpu_khz, int cpu)
 {
-	cyc2ns_scale = (1000000 << CYC2NS_SCALE_FACTOR)/cpu_khz;
+	unsigned long flags, prev_scale, *scale;
+	unsigned long long tsc_now, ns_now;
+
+	local_irq_save(flags);
+	scale = &per_cpu(cyc2ns, cpu);
+
+	rdtscll(tsc_now);
+	ns_now = __cycles_2_ns(tsc_now);
+
+	prev_scale = *scale;
+	if (cpu_khz)
+		*scale = (NSEC_PER_MSEC << CYC2NS_SCALE_FACTOR)/cpu_khz;
+
+	printk("CPU#%d: changed cyc2ns scale from %ld to %ld\n",
+		cpu, prev_scale, *scale);
+	local_irq_restore(flags);
 }
 
 /*
@@ -239,7 +256,9 @@ time_cpufreq_notifier(struct notifier_bl
 						ref_freq, freq->new);
 			if (!(freq->flags & CPUFREQ_CONST_LOOPS)) {
 				tsc_khz = cpu_khz;
-				set_cyc2ns_scale(cpu_khz);
+				preempt_disable();
+				set_cyc2ns_scale(cpu_khz, smp_processor_id());
+				preempt_enable();
 				/*
 				 * TSC based sched_clock turns
 				 * to junk w/ cpufreq
@@ -367,6 +386,8 @@ static inline void check_geode_tsc_relia
 
 void __init tsc_init(void)
 {
+	int cpu;
+
 	if (!cpu_has_tsc || tsc_disable)
 		goto out_no_tsc;
 
@@ -380,7 +401,15 @@ void __init tsc_init(void)
 				(unsigned long)cpu_khz / 1000,
 				(unsigned long)cpu_khz % 1000);
 
-	set_cyc2ns_scale(cpu_khz);
+	/*
+	 * Secondary CPUs do not run through tsc_init(), so set up
+	 * all the scale factors for all CPUs, assuming the same
+	 * speed as the bootup CPU. (cpufreq notifiers will fix this
+	 * up if their speed diverges)
+	 */
+	for_each_possible_cpu(cpu)
+		set_cyc2ns_scale(cpu_khz, cpu);
+
 	use_tsc_delay();
 
 	/* Check and install the TSC clocksource */
Index: linux-x86.q/arch/x86/kernel/tsc_64.c
===================================================================
--- linux-x86.q.orig/arch/x86/kernel/tsc_64.c
+++ linux-x86.q/arch/x86/kernel/tsc_64.c
@@ -10,6 +10,7 @@
 
 #include <asm/hpet.h>
 #include <asm/timex.h>
+#include <asm/timer.h>
 
 static int notsc __initdata = 0;
 
@@ -18,16 +19,50 @@ EXPORT_SYMBOL(cpu_khz);
 unsigned int tsc_khz;
 EXPORT_SYMBOL(tsc_khz);
 
-static unsigned int cyc2ns_scale __read_mostly;
+/* Accelerators for sched_clock()
+ * convert from cycles(64bits) => nanoseconds (64bits)
+ *  basic equation:
+ *		ns = cycles / (freq / ns_per_sec)
+ *		ns = cycles * (ns_per_sec / freq)
+ *		ns = cycles * (10^9 / (cpu_khz * 10^3))
+ *		ns = cycles * (10^6 / cpu_khz)
+ *
+ *	Then we use scaling math (suggested by george@mvista.com) to get:
+ *		ns = cycles * (10^6 * SC / cpu_khz) / SC
+ *		ns = cycles * cyc2ns_scale / SC
+ *
+ *	And since SC is a constant power of two, we can convert the div
+ *  into a shift.
+ *
+ *  We can use khz divisor instead of mhz to keep a better precision, since
+ *  cyc2ns_scale is limited to 10^6 * 2^10, which fits in 32 bits.
+ *  (mathieu.desnoyers@polymtl.ca)
+ *
+ *  ns += offset to avoid sched_clock jumps with cpufreq
+ *
+ *			-johnstul@us.ibm.com "math is hard, lets go shopping!"
+ */
+DEFINE_PER_CPU(unsigned long, cyc2ns);
 
-static inline void set_cyc2ns_scale(unsigned long khz)
+static void set_cyc2ns_scale(unsigned long cpu_khz, int cpu)
 {
-	cyc2ns_scale = (NSEC_PER_MSEC << NS_SCALE) / khz;
-}
+	unsigned long flags, prev_scale, *scale;
+	unsigned long long tsc_now, ns_now;
 
-static unsigned long long cycles_2_ns(unsigned long long cyc)
-{
-	return (cyc * cyc2ns_scale) >> NS_SCALE;
+	local_irq_save(flags);
+	scale = &per_cpu(cyc2ns, cpu);
+
+	rdtscll(tsc_now);
+	ns_now = __cycles_2_ns(tsc_now);
+
+	prev_scale = *scale;
+	if (cpu_khz)
+		*scale = (NSEC_PER_MSEC << CYC2NS_SCALE_FACTOR)/cpu_khz;
+
+	printk("CPU#%d: changed cyc2ns scale from %ld to %ld\n",
+		cpu, prev_scale, *scale);
+
+	local_irq_restore(flags);
 }
 
 unsigned long long sched_clock(void)
@@ -100,7 +135,9 @@ static int time_cpufreq_notifier(struct 
 			mark_tsc_unstable("cpufreq changes");
 	}
 
-	set_cyc2ns_scale(tsc_khz_ref);
+	preempt_disable();
+	set_cyc2ns_scale(tsc_khz_ref, smp_processor_id());
+	preempt_enable();
 
 	return 0;
 }
@@ -151,7 +188,7 @@ static unsigned long __init tsc_read_ref
 void __init tsc_calibrate(void)
 {
 	unsigned long flags, tsc1, tsc2, tr1, tr2, pm1, pm2, hpet1, hpet2;
-	int hpet = is_hpet_enabled();
+	int hpet = is_hpet_enabled(), cpu;
 
 	local_irq_save(flags);
 
@@ -206,7 +243,9 @@ void __init tsc_calibrate(void)
 	}
 
 	tsc_khz = tsc2 / tsc1;
-	set_cyc2ns_scale(tsc_khz);
+
+	for_each_possible_cpu(cpu)
+		set_cyc2ns_scale(tsc_khz, cpu);
 }
 
 /*
Index: linux-x86.q/include/asm-x86/timer.h
===================================================================
--- linux-x86.q.orig/include/asm-x86/timer.h
+++ linux-x86.q/include/asm-x86/timer.h
@@ -2,6 +2,7 @@
 #define _ASMi386_TIMER_H
 #include <linux/init.h>
 #include <linux/pm.h>
+#include <linux/percpu.h>
 
 #define TICK_SIZE (tick_nsec / 1000)
 
@@ -16,7 +17,7 @@ extern int recalibrate_cpu_khz(void);
 #define calculate_cpu_khz() native_calculate_cpu_khz()
 #endif
 
-/* Accellerators for sched_clock()
+/* Accelerators for sched_clock()
  * convert from cycles(64bits) => nanoseconds (64bits)
  *  basic equation:
  *		ns = cycles / (freq / ns_per_sec)
@@ -31,20 +32,32 @@ extern int recalibrate_cpu_khz(void);
  *	And since SC is a constant power of two, we can convert the div
  *  into a shift.
  *
- *  We can use khz divisor instead of mhz to keep a better percision, since
+ *  We can use khz divisor instead of mhz to keep a better precision, since
  *  cyc2ns_scale is limited to 10^6 * 2^10, which fits in 32 bits.
  *  (mathieu.desnoyers@polymtl.ca)
  *
  *			-johnstul@us.ibm.com "math is hard, lets go shopping!"
  */
-extern unsigned long cyc2ns_scale __read_mostly;
+
+DECLARE_PER_CPU(unsigned long, cyc2ns);
 
 #define CYC2NS_SCALE_FACTOR 10 /* 2^10, carefully chosen */
 
-static inline unsigned long long cycles_2_ns(unsigned long long cyc)
+static inline unsigned long long __cycles_2_ns(unsigned long long cyc)
 {
-	return (cyc * cyc2ns_scale) >> CYC2NS_SCALE_FACTOR;
+	return cyc * per_cpu(cyc2ns, smp_processor_id()) >> CYC2NS_SCALE_FACTOR;
 }
 
+static inline unsigned long long cycles_2_ns(unsigned long long cyc)
+{
+	unsigned long long ns;
+	unsigned long flags;
+
+	local_irq_save(flags);
+	ns = __cycles_2_ns(cyc);
+	local_irq_restore(flags);
+
+	return ns;
+}
 
 #endif

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch] x86: scale cyc_2_nsec according to CPU frequency
  2007-12-07 13:55       ` [patch] x86: scale cyc_2_nsec according to CPU frequency Ingo Molnar
@ 2007-12-07 14:27         ` Guillaume Chazarain
  2007-12-07 14:52           ` Ingo Molnar
  0 siblings, 1 reply; 47+ messages in thread
From: Guillaume Chazarain @ 2007-12-07 14:27 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Thomas Gleixner, Stefano Brivio, Robert Love, linux-kernel,
	Dave Jones, Rafael J. Wysocki, Michael Buesch, Andrew Morton

Le Fri, 7 Dec 2007 14:55:25 +0100,
Ingo Molnar <mingo@elte.hu> a écrit :

> Firstly, we dont need the 'offset' anymore because cpu_clock() maintains 
> offsets itself.

Yes, but a lower quality one. __update_rq_clock tries to compensate
large jumping clocks with a jiffy resolution, while my offset arranges
for a very smooth frequency transition.

I agree with keeping a single offset, but I liked the fact that with my
patch on frequency change, the clock had no jump at all.

> + *  ns += offset to avoid sched_clock jumps with cpufreq

I guess this needs to go away if I don't make my point :-(

> +	printk("CPU#%d: changed cyc2ns scale from %ld to %ld\n",
> +		cpu, prev_scale, *scale);

Pointing it out just to be sure it does not end in the final version ;-)

Thanks for cleaning up my mess ;-)

-- 
Guillaume

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch] x86: scale cyc_2_nsec according to CPU frequency
  2007-12-07 14:27         ` Guillaume Chazarain
@ 2007-12-07 14:52           ` Ingo Molnar
  2007-12-08 15:57             ` Arjan van de Ven
  0 siblings, 1 reply; 47+ messages in thread
From: Ingo Molnar @ 2007-12-07 14:52 UTC (permalink / raw)
  To: Guillaume Chazarain
  Cc: Thomas Gleixner, Stefano Brivio, Robert Love, linux-kernel,
	Dave Jones, Rafael J. Wysocki, Michael Buesch, Andrew Morton


* Guillaume Chazarain <guichaz@yahoo.fr> wrote:

> Le Fri, 7 Dec 2007 14:55:25 +0100,
> Ingo Molnar <mingo@elte.hu> a ??crit :
> 
> > Firstly, we dont need the 'offset' anymore because cpu_clock() 
> > maintains offsets itself.
> 
> Yes, but a lower quality one. __update_rq_clock tries to compensate 
> large jumping clocks with a jiffy resolution, while my offset arranges 
> for a very smooth frequency transition.

yes, but that would be easy to fix up via calling 
sched_clock_idle_wakeup_event(0) when doing a frequency transition, 
without burdening the normal sched_clock() codepath with the offset. See 
the attached latest version.

	Ingo

--------------->
Subject: x86: scale cyc_2_nsec according to CPU frequency
From: "Guillaume Chazarain" <guichaz@yahoo.fr>

scale the sched_clock() cyc_2_nsec scaling factor according to
CPU frequency changes.

[ mingo@elte.hu: simplified it and fixed it for SMP. ]

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/kernel/tsc_32.c |   45 +++++++++++++++++++++++++++++++----
 arch/x86/kernel/tsc_64.c |   59 +++++++++++++++++++++++++++++++++++++++--------
 include/asm-x86/timer.h  |   23 ++++++++++++++----
 3 files changed, 106 insertions(+), 21 deletions(-)

Index: linux-x86.q/arch/x86/kernel/tsc_32.c
===================================================================
--- linux-x86.q.orig/arch/x86/kernel/tsc_32.c
+++ linux-x86.q/arch/x86/kernel/tsc_32.c
@@ -5,6 +5,7 @@
 #include <linux/jiffies.h>
 #include <linux/init.h>
 #include <linux/dmi.h>
+#include <linux/percpu.h>
 
 #include <asm/delay.h>
 #include <asm/tsc.h>
@@ -78,15 +79,35 @@ EXPORT_SYMBOL_GPL(check_tsc_unstable);
  *  cyc2ns_scale is limited to 10^6 * 2^10, which fits in 32 bits.
  *  (mathieu.desnoyers@polymtl.ca)
  *
+ *  ns += offset to avoid sched_clock jumps with cpufreq
+ *
  *			-johnstul@us.ibm.com "math is hard, lets go shopping!"
  */
-unsigned long cyc2ns_scale __read_mostly;
 
-#define CYC2NS_SCALE_FACTOR 10 /* 2^10, carefully chosen */
+DEFINE_PER_CPU(unsigned long, cyc2ns);
 
-static inline void set_cyc2ns_scale(unsigned long cpu_khz)
+static void set_cyc2ns_scale(unsigned long cpu_khz, int cpu)
 {
-	cyc2ns_scale = (1000000 << CYC2NS_SCALE_FACTOR)/cpu_khz;
+	unsigned long flags, prev_scale, *scale;
+	unsigned long long tsc_now, ns_now;
+
+	local_irq_save(flags);
+	sched_clock_idle_sleep_event();
+
+	scale = &per_cpu(cyc2ns, cpu);
+
+	rdtscll(tsc_now);
+	ns_now = __cycles_2_ns(tsc_now);
+
+	prev_scale = *scale;
+	if (cpu_khz)
+		*scale = (NSEC_PER_MSEC << CYC2NS_SCALE_FACTOR)/cpu_khz;
+
+	/*
+	 * Start smoothly with the new frequency:
+	 */
+	sched_clock_idle_wakeup_event(0);
+	local_irq_restore(flags);
 }
 
 /*
@@ -239,7 +260,9 @@ time_cpufreq_notifier(struct notifier_bl
 						ref_freq, freq->new);
 			if (!(freq->flags & CPUFREQ_CONST_LOOPS)) {
 				tsc_khz = cpu_khz;
-				set_cyc2ns_scale(cpu_khz);
+				preempt_disable();
+				set_cyc2ns_scale(cpu_khz, smp_processor_id());
+				preempt_enable();
 				/*
 				 * TSC based sched_clock turns
 				 * to junk w/ cpufreq
@@ -367,6 +390,8 @@ static inline void check_geode_tsc_relia
 
 void __init tsc_init(void)
 {
+	int cpu;
+
 	if (!cpu_has_tsc || tsc_disable)
 		goto out_no_tsc;
 
@@ -380,7 +405,15 @@ void __init tsc_init(void)
 				(unsigned long)cpu_khz / 1000,
 				(unsigned long)cpu_khz % 1000);
 
-	set_cyc2ns_scale(cpu_khz);
+	/*
+	 * Secondary CPUs do not run through tsc_init(), so set up
+	 * all the scale factors for all CPUs, assuming the same
+	 * speed as the bootup CPU. (cpufreq notifiers will fix this
+	 * up if their speed diverges)
+	 */
+	for_each_possible_cpu(cpu)
+		set_cyc2ns_scale(cpu_khz, cpu);
+
 	use_tsc_delay();
 
 	/* Check and install the TSC clocksource */
Index: linux-x86.q/arch/x86/kernel/tsc_64.c
===================================================================
--- linux-x86.q.orig/arch/x86/kernel/tsc_64.c
+++ linux-x86.q/arch/x86/kernel/tsc_64.c
@@ -10,6 +10,7 @@
 
 #include <asm/hpet.h>
 #include <asm/timex.h>
+#include <asm/timer.h>
 
 static int notsc __initdata = 0;
 
@@ -18,16 +19,50 @@ EXPORT_SYMBOL(cpu_khz);
 unsigned int tsc_khz;
 EXPORT_SYMBOL(tsc_khz);
 
-static unsigned int cyc2ns_scale __read_mostly;
+/* Accelerators for sched_clock()
+ * convert from cycles(64bits) => nanoseconds (64bits)
+ *  basic equation:
+ *		ns = cycles / (freq / ns_per_sec)
+ *		ns = cycles * (ns_per_sec / freq)
+ *		ns = cycles * (10^9 / (cpu_khz * 10^3))
+ *		ns = cycles * (10^6 / cpu_khz)
+ *
+ *	Then we use scaling math (suggested by george@mvista.com) to get:
+ *		ns = cycles * (10^6 * SC / cpu_khz) / SC
+ *		ns = cycles * cyc2ns_scale / SC
+ *
+ *	And since SC is a constant power of two, we can convert the div
+ *  into a shift.
+ *
+ *  We can use khz divisor instead of mhz to keep a better precision, since
+ *  cyc2ns_scale is limited to 10^6 * 2^10, which fits in 32 bits.
+ *  (mathieu.desnoyers@polymtl.ca)
+ *
+ *  ns += offset to avoid sched_clock jumps with cpufreq
+ *
+ *			-johnstul@us.ibm.com "math is hard, lets go shopping!"
+ */
+DEFINE_PER_CPU(unsigned long, cyc2ns);
 
-static inline void set_cyc2ns_scale(unsigned long khz)
+static void set_cyc2ns_scale(unsigned long cpu_khz, int cpu)
 {
-	cyc2ns_scale = (NSEC_PER_MSEC << NS_SCALE) / khz;
-}
+	unsigned long flags, prev_scale, *scale;
+	unsigned long long tsc_now, ns_now;
 
-static unsigned long long cycles_2_ns(unsigned long long cyc)
-{
-	return (cyc * cyc2ns_scale) >> NS_SCALE;
+	local_irq_save(flags);
+	sched_clock_idle_sleep_event();
+
+	scale = &per_cpu(cyc2ns, cpu);
+
+	rdtscll(tsc_now);
+	ns_now = __cycles_2_ns(tsc_now);
+
+	prev_scale = *scale;
+	if (cpu_khz)
+		*scale = (NSEC_PER_MSEC << CYC2NS_SCALE_FACTOR)/cpu_khz;
+
+	sched_clock_idle_wakeup_event(0);
+	local_irq_restore(flags);
 }
 
 unsigned long long sched_clock(void)
@@ -100,7 +135,9 @@ static int time_cpufreq_notifier(struct 
 			mark_tsc_unstable("cpufreq changes");
 	}
 
-	set_cyc2ns_scale(tsc_khz_ref);
+	preempt_disable();
+	set_cyc2ns_scale(tsc_khz_ref, smp_processor_id());
+	preempt_enable();
 
 	return 0;
 }
@@ -151,7 +188,7 @@ static unsigned long __init tsc_read_ref
 void __init tsc_calibrate(void)
 {
 	unsigned long flags, tsc1, tsc2, tr1, tr2, pm1, pm2, hpet1, hpet2;
-	int hpet = is_hpet_enabled();
+	int hpet = is_hpet_enabled(), cpu;
 
 	local_irq_save(flags);
 
@@ -206,7 +243,9 @@ void __init tsc_calibrate(void)
 	}
 
 	tsc_khz = tsc2 / tsc1;
-	set_cyc2ns_scale(tsc_khz);
+
+	for_each_possible_cpu(cpu)
+		set_cyc2ns_scale(tsc_khz, cpu);
 }
 
 /*
Index: linux-x86.q/include/asm-x86/timer.h
===================================================================
--- linux-x86.q.orig/include/asm-x86/timer.h
+++ linux-x86.q/include/asm-x86/timer.h
@@ -2,6 +2,7 @@
 #define _ASMi386_TIMER_H
 #include <linux/init.h>
 #include <linux/pm.h>
+#include <linux/percpu.h>
 
 #define TICK_SIZE (tick_nsec / 1000)
 
@@ -16,7 +17,7 @@ extern int recalibrate_cpu_khz(void);
 #define calculate_cpu_khz() native_calculate_cpu_khz()
 #endif
 
-/* Accellerators for sched_clock()
+/* Accelerators for sched_clock()
  * convert from cycles(64bits) => nanoseconds (64bits)
  *  basic equation:
  *		ns = cycles / (freq / ns_per_sec)
@@ -31,20 +32,32 @@ extern int recalibrate_cpu_khz(void);
  *	And since SC is a constant power of two, we can convert the div
  *  into a shift.
  *
- *  We can use khz divisor instead of mhz to keep a better percision, since
+ *  We can use khz divisor instead of mhz to keep a better precision, since
  *  cyc2ns_scale is limited to 10^6 * 2^10, which fits in 32 bits.
  *  (mathieu.desnoyers@polymtl.ca)
  *
  *			-johnstul@us.ibm.com "math is hard, lets go shopping!"
  */
-extern unsigned long cyc2ns_scale __read_mostly;
+
+DECLARE_PER_CPU(unsigned long, cyc2ns);
 
 #define CYC2NS_SCALE_FACTOR 10 /* 2^10, carefully chosen */
 
-static inline unsigned long long cycles_2_ns(unsigned long long cyc)
+static inline unsigned long long __cycles_2_ns(unsigned long long cyc)
 {
-	return (cyc * cyc2ns_scale) >> CYC2NS_SCALE_FACTOR;
+	return cyc * per_cpu(cyc2ns, smp_processor_id()) >> CYC2NS_SCALE_FACTOR;
 }
 
+static inline unsigned long long cycles_2_ns(unsigned long long cyc)
+{
+	unsigned long long ns;
+	unsigned long flags;
+
+	local_irq_save(flags);
+	ns = __cycles_2_ns(cyc);
+	local_irq_restore(flags);
+
+	return ns;
+}
 
 #endif

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] scheduler: fix x86 regression in native_sched_clock
  2007-12-07 12:40           ` Ingo Molnar
@ 2007-12-07 14:54             ` Ingo Molnar
  2007-12-07 16:46               ` Guillaume Chazarain
  0 siblings, 1 reply; 47+ messages in thread
From: Ingo Molnar @ 2007-12-07 14:54 UTC (permalink / raw)
  To: stefano.brivio
  Cc: Nick Piggin, Robert Love, linux-kernel, Dave Jones,
	Rafael J. Wysocki, Michael Buesch, Thomas Gleixner,
	Andrew Morton, Len Brown, Guillaume Chazarain


* Ingo Molnar <mingo@elte.hu> wrote:

> third update. the cpufreq callbacks are not quite OK yet.

fourth update - the cpufreq callbacks are back. This is a version that 
is supposed fix all known aspects of TSC and frequency-change 
weirdnesses.

	Ingo

Index: linux/arch/arm/kernel/time.c
===================================================================
--- linux.orig/arch/arm/kernel/time.c
+++ linux/arch/arm/kernel/time.c
@@ -79,17 +79,6 @@ static unsigned long dummy_gettimeoffset
 }
 #endif
 
-/*
- * An implementation of printk_clock() independent from
- * sched_clock().  This avoids non-bootable kernels when
- * printk_clock is enabled.
- */
-unsigned long long printk_clock(void)
-{
-	return (unsigned long long)(jiffies - INITIAL_JIFFIES) *
-			(1000000000 / HZ);
-}
-
 static unsigned long next_rtc_update;
 
 /*
Index: linux/arch/ia64/kernel/time.c
===================================================================
--- linux.orig/arch/ia64/kernel/time.c
+++ linux/arch/ia64/kernel/time.c
@@ -344,33 +344,6 @@ udelay (unsigned long usecs)
 }
 EXPORT_SYMBOL(udelay);
 
-static unsigned long long ia64_itc_printk_clock(void)
-{
-	if (ia64_get_kr(IA64_KR_PER_CPU_DATA))
-		return sched_clock();
-	return 0;
-}
-
-static unsigned long long ia64_default_printk_clock(void)
-{
-	return (unsigned long long)(jiffies_64 - INITIAL_JIFFIES) *
-		(1000000000/HZ);
-}
-
-unsigned long long (*ia64_printk_clock)(void) = &ia64_default_printk_clock;
-
-unsigned long long printk_clock(void)
-{
-	return ia64_printk_clock();
-}
-
-void __init
-ia64_setup_printk_clock(void)
-{
-	if (!(sal_platform_features & IA64_SAL_PLATFORM_FEATURE_ITC_DRIFT))
-		ia64_printk_clock = ia64_itc_printk_clock;
-}
-
 /* IA64 doesn't cache the timezone */
 void update_vsyscall_tz(void)
 {
Index: linux/arch/x86/kernel/process_32.c
===================================================================
--- linux.orig/arch/x86/kernel/process_32.c
+++ linux/arch/x86/kernel/process_32.c
@@ -113,10 +113,19 @@ void default_idle(void)
 		smp_mb();
 
 		local_irq_disable();
-		if (!need_resched())
+		if (!need_resched()) {
+			ktime_t t0, t1;
+			u64 t0n, t1n;
+
+			t0 = ktime_get();
+			t0n = ktime_to_ns(t0);
 			safe_halt();	/* enables interrupts racelessly */
-		else
-			local_irq_enable();
+			local_irq_disable();
+			t1 = ktime_get();
+			t1n = ktime_to_ns(t1);
+			sched_clock_idle_wakeup_event(t1n - t0n);
+		}
+		local_irq_enable();
 		current_thread_info()->status |= TS_POLLING;
 	} else {
 		/* loop is done by the caller */
Index: linux/arch/x86/kernel/tsc_32.c
===================================================================
--- linux.orig/arch/x86/kernel/tsc_32.c
+++ linux/arch/x86/kernel/tsc_32.c
@@ -5,6 +5,7 @@
 #include <linux/jiffies.h>
 #include <linux/init.h>
 #include <linux/dmi.h>
+#include <linux/percpu.h>
 
 #include <asm/delay.h>
 #include <asm/tsc.h>
@@ -78,15 +79,35 @@ EXPORT_SYMBOL_GPL(check_tsc_unstable);
  *  cyc2ns_scale is limited to 10^6 * 2^10, which fits in 32 bits.
  *  (mathieu.desnoyers@polymtl.ca)
  *
+ *  ns += offset to avoid sched_clock jumps with cpufreq
+ *
  *			-johnstul@us.ibm.com "math is hard, lets go shopping!"
  */
-unsigned long cyc2ns_scale __read_mostly;
 
-#define CYC2NS_SCALE_FACTOR 10 /* 2^10, carefully chosen */
+DEFINE_PER_CPU(unsigned long, cyc2ns);
 
-static inline void set_cyc2ns_scale(unsigned long cpu_khz)
+static void set_cyc2ns_scale(unsigned long cpu_khz, int cpu)
 {
-	cyc2ns_scale = (1000000 << CYC2NS_SCALE_FACTOR)/cpu_khz;
+	unsigned long flags, prev_scale, *scale;
+	unsigned long long tsc_now, ns_now;
+
+	local_irq_save(flags);
+	sched_clock_idle_sleep_event();
+
+	scale = &per_cpu(cyc2ns, cpu);
+
+	rdtscll(tsc_now);
+	ns_now = __cycles_2_ns(tsc_now);
+
+	prev_scale = *scale;
+	if (cpu_khz)
+		*scale = (NSEC_PER_MSEC << CYC2NS_SCALE_FACTOR)/cpu_khz;
+
+	/*
+	 * Start smoothly with the new frequency:
+	 */
+	sched_clock_idle_wakeup_event(0);
+	local_irq_restore(flags);
 }
 
 /*
@@ -239,7 +260,9 @@ time_cpufreq_notifier(struct notifier_bl
 						ref_freq, freq->new);
 			if (!(freq->flags & CPUFREQ_CONST_LOOPS)) {
 				tsc_khz = cpu_khz;
-				set_cyc2ns_scale(cpu_khz);
+				preempt_disable();
+				set_cyc2ns_scale(cpu_khz, smp_processor_id());
+				preempt_enable();
 				/*
 				 * TSC based sched_clock turns
 				 * to junk w/ cpufreq
@@ -367,6 +390,8 @@ static inline void check_geode_tsc_relia
 
 void __init tsc_init(void)
 {
+	int cpu;
+
 	if (!cpu_has_tsc || tsc_disable)
 		goto out_no_tsc;
 
@@ -380,7 +405,15 @@ void __init tsc_init(void)
 				(unsigned long)cpu_khz / 1000,
 				(unsigned long)cpu_khz % 1000);
 
-	set_cyc2ns_scale(cpu_khz);
+	/*
+	 * Secondary CPUs do not run through tsc_init(), so set up
+	 * all the scale factors for all CPUs, assuming the same
+	 * speed as the bootup CPU. (cpufreq notifiers will fix this
+	 * up if their speed diverges)
+	 */
+	for_each_possible_cpu(cpu)
+		set_cyc2ns_scale(cpu_khz, cpu);
+
 	use_tsc_delay();
 
 	/* Check and install the TSC clocksource */
Index: linux/arch/x86/kernel/tsc_64.c
===================================================================
--- linux.orig/arch/x86/kernel/tsc_64.c
+++ linux/arch/x86/kernel/tsc_64.c
@@ -10,6 +10,7 @@
 
 #include <asm/hpet.h>
 #include <asm/timex.h>
+#include <asm/timer.h>
 
 static int notsc __initdata = 0;
 
@@ -18,16 +19,50 @@ EXPORT_SYMBOL(cpu_khz);
 unsigned int tsc_khz;
 EXPORT_SYMBOL(tsc_khz);
 
-static unsigned int cyc2ns_scale __read_mostly;
+/* Accelerators for sched_clock()
+ * convert from cycles(64bits) => nanoseconds (64bits)
+ *  basic equation:
+ *		ns = cycles / (freq / ns_per_sec)
+ *		ns = cycles * (ns_per_sec / freq)
+ *		ns = cycles * (10^9 / (cpu_khz * 10^3))
+ *		ns = cycles * (10^6 / cpu_khz)
+ *
+ *	Then we use scaling math (suggested by george@mvista.com) to get:
+ *		ns = cycles * (10^6 * SC / cpu_khz) / SC
+ *		ns = cycles * cyc2ns_scale / SC
+ *
+ *	And since SC is a constant power of two, we can convert the div
+ *  into a shift.
+ *
+ *  We can use khz divisor instead of mhz to keep a better precision, since
+ *  cyc2ns_scale is limited to 10^6 * 2^10, which fits in 32 bits.
+ *  (mathieu.desnoyers@polymtl.ca)
+ *
+ *  ns += offset to avoid sched_clock jumps with cpufreq
+ *
+ *			-johnstul@us.ibm.com "math is hard, lets go shopping!"
+ */
+DEFINE_PER_CPU(unsigned long, cyc2ns);
 
-static inline void set_cyc2ns_scale(unsigned long khz)
+static void set_cyc2ns_scale(unsigned long cpu_khz, int cpu)
 {
-	cyc2ns_scale = (NSEC_PER_MSEC << NS_SCALE) / khz;
-}
+	unsigned long flags, prev_scale, *scale;
+	unsigned long long tsc_now, ns_now;
 
-static unsigned long long cycles_2_ns(unsigned long long cyc)
-{
-	return (cyc * cyc2ns_scale) >> NS_SCALE;
+	local_irq_save(flags);
+	sched_clock_idle_sleep_event();
+
+	scale = &per_cpu(cyc2ns, cpu);
+
+	rdtscll(tsc_now);
+	ns_now = __cycles_2_ns(tsc_now);
+
+	prev_scale = *scale;
+	if (cpu_khz)
+		*scale = (NSEC_PER_MSEC << CYC2NS_SCALE_FACTOR)/cpu_khz;
+
+	sched_clock_idle_wakeup_event(0);
+	local_irq_restore(flags);
 }
 
 unsigned long long sched_clock(void)
@@ -100,7 +135,9 @@ static int time_cpufreq_notifier(struct 
 			mark_tsc_unstable("cpufreq changes");
 	}
 
-	set_cyc2ns_scale(tsc_khz_ref);
+	preempt_disable();
+	set_cyc2ns_scale(tsc_khz_ref, smp_processor_id());
+	preempt_enable();
 
 	return 0;
 }
@@ -151,7 +188,7 @@ static unsigned long __init tsc_read_ref
 void __init tsc_calibrate(void)
 {
 	unsigned long flags, tsc1, tsc2, tr1, tr2, pm1, pm2, hpet1, hpet2;
-	int hpet = is_hpet_enabled();
+	int hpet = is_hpet_enabled(), cpu;
 
 	local_irq_save(flags);
 
@@ -206,7 +243,9 @@ void __init tsc_calibrate(void)
 	}
 
 	tsc_khz = tsc2 / tsc1;
-	set_cyc2ns_scale(tsc_khz);
+
+	for_each_possible_cpu(cpu)
+		set_cyc2ns_scale(tsc_khz, cpu);
 }
 
 /*
Index: linux/drivers/acpi/processor_idle.c
===================================================================
--- linux.orig/drivers/acpi/processor_idle.c
+++ linux/drivers/acpi/processor_idle.c
@@ -531,6 +531,11 @@ static void acpi_processor_idle(void)
 
 	case ACPI_STATE_C3:
 		/*
+		 * Must be done before busmaster disable as we might
+		 * need to access HPET !
+		 */
+		acpi_state_timer_broadcast(pr, cx, 1);
+		/*
 		 * disable bus master
 		 * bm_check implies we need ARB_DIS
 		 * !bm_check implies we need cache flush
@@ -557,7 +562,6 @@ static void acpi_processor_idle(void)
 		/* Get start time (ticks) */
 		t1 = inl(acpi_gbl_FADT.xpm_timer_block.address);
 		/* Invoke C3 */
-		acpi_state_timer_broadcast(pr, cx, 1);
 		/* Tell the scheduler that we are going deep-idle: */
 		sched_clock_idle_sleep_event();
 		acpi_cstate_enter(cx);
@@ -1401,9 +1405,6 @@ static int acpi_idle_enter_simple(struct
 	if (acpi_idle_suspend)
 		return(acpi_idle_enter_c1(dev, state));
 
-	if (pr->flags.bm_check)
-		acpi_idle_update_bm_rld(pr, cx);
-
 	local_irq_disable();
 	current_thread_info()->status &= ~TS_POLLING;
 	/*
@@ -1418,13 +1419,21 @@ static int acpi_idle_enter_simple(struct
 		return 0;
 	}
 
+	/*
+	 * Must be done before busmaster disable as we might need to
+	 * access HPET !
+	 */
+	acpi_state_timer_broadcast(pr, cx, 1);
+
+	if (pr->flags.bm_check)
+		acpi_idle_update_bm_rld(pr, cx);
+
 	if (cx->type == ACPI_STATE_C3)
 		ACPI_FLUSH_CPU_CACHE();
 
 	t1 = inl(acpi_gbl_FADT.xpm_timer_block.address);
 	/* Tell the scheduler that we are going deep-idle: */
 	sched_clock_idle_sleep_event();
-	acpi_state_timer_broadcast(pr, cx, 1);
 	acpi_idle_do_entry(cx);
 	t2 = inl(acpi_gbl_FADT.xpm_timer_block.address);
 
Index: linux/include/asm-x86/timer.h
===================================================================
--- linux.orig/include/asm-x86/timer.h
+++ linux/include/asm-x86/timer.h
@@ -2,6 +2,7 @@
 #define _ASMi386_TIMER_H
 #include <linux/init.h>
 #include <linux/pm.h>
+#include <linux/percpu.h>
 
 #define TICK_SIZE (tick_nsec / 1000)
 
@@ -16,7 +17,7 @@ extern int recalibrate_cpu_khz(void);
 #define calculate_cpu_khz() native_calculate_cpu_khz()
 #endif
 
-/* Accellerators for sched_clock()
+/* Accelerators for sched_clock()
  * convert from cycles(64bits) => nanoseconds (64bits)
  *  basic equation:
  *		ns = cycles / (freq / ns_per_sec)
@@ -31,20 +32,32 @@ extern int recalibrate_cpu_khz(void);
  *	And since SC is a constant power of two, we can convert the div
  *  into a shift.
  *
- *  We can use khz divisor instead of mhz to keep a better percision, since
+ *  We can use khz divisor instead of mhz to keep a better precision, since
  *  cyc2ns_scale is limited to 10^6 * 2^10, which fits in 32 bits.
  *  (mathieu.desnoyers@polymtl.ca)
  *
  *			-johnstul@us.ibm.com "math is hard, lets go shopping!"
  */
-extern unsigned long cyc2ns_scale __read_mostly;
+
+DECLARE_PER_CPU(unsigned long, cyc2ns);
 
 #define CYC2NS_SCALE_FACTOR 10 /* 2^10, carefully chosen */
 
-static inline unsigned long long cycles_2_ns(unsigned long long cyc)
+static inline unsigned long long __cycles_2_ns(unsigned long long cyc)
 {
-	return (cyc * cyc2ns_scale) >> CYC2NS_SCALE_FACTOR;
+	return cyc * per_cpu(cyc2ns, smp_processor_id()) >> CYC2NS_SCALE_FACTOR;
 }
 
+static inline unsigned long long cycles_2_ns(unsigned long long cyc)
+{
+	unsigned long long ns;
+	unsigned long flags;
+
+	local_irq_save(flags);
+	ns = __cycles_2_ns(cyc);
+	local_irq_restore(flags);
+
+	return ns;
+}
 
 #endif
Index: linux/kernel/hrtimer.c
===================================================================
--- linux.orig/kernel/hrtimer.c
+++ linux/kernel/hrtimer.c
@@ -850,6 +850,14 @@ hrtimer_start(struct hrtimer *timer, kti
 #ifdef CONFIG_TIME_LOW_RES
 		tim = ktime_add(tim, base->resolution);
 #endif
+		/*
+		 * Careful here: User space might have asked for a
+		 * very long sleep, so the add above might result in a
+		 * negative number, which enqueues the timer in front
+		 * of the queue.
+		 */
+		if (tim.tv64 < 0)
+			tim.tv64 = KTIME_MAX;
 	}
 	timer->expires = tim;
 
Index: linux/kernel/lockdep.c
===================================================================
--- linux.orig/kernel/lockdep.c
+++ linux/kernel/lockdep.c
@@ -2654,10 +2654,15 @@ static void check_flags(unsigned long fl
 	if (!debug_locks)
 		return;
 
-	if (irqs_disabled_flags(flags))
-		DEBUG_LOCKS_WARN_ON(current->hardirqs_enabled);
-	else
-		DEBUG_LOCKS_WARN_ON(!current->hardirqs_enabled);
+	if (irqs_disabled_flags(flags)) {
+		if (DEBUG_LOCKS_WARN_ON(current->hardirqs_enabled)) {
+			printk("possible reason: unannotated irqs-off.\n");
+		}
+	} else {
+		if (DEBUG_LOCKS_WARN_ON(!current->hardirqs_enabled)) {
+			printk("possible reason: unannotated irqs-on.\n");
+		}
+	}
 
 	/*
 	 * We dont accurately track softirq state in e.g.
Index: linux/kernel/printk.c
===================================================================
--- linux.orig/kernel/printk.c
+++ linux/kernel/printk.c
@@ -573,11 +573,6 @@ static int __init printk_time_setup(char
 
 __setup("time", printk_time_setup);
 
-__attribute__((weak)) unsigned long long printk_clock(void)
-{
-	return sched_clock();
-}
-
 /* Check if we have any console registered that can be called early in boot. */
 static int have_callable_console(void)
 {
@@ -628,30 +623,57 @@ asmlinkage int printk(const char *fmt, .
 /* cpu currently holding logbuf_lock */
 static volatile unsigned int printk_cpu = UINT_MAX;
 
+const char printk_recursion_bug_msg [] =
+			KERN_CRIT "BUG: recent printk recursion!\n";
+static int printk_recursion_bug;
+
 asmlinkage int vprintk(const char *fmt, va_list args)
 {
+	static int log_level_unknown = 1;
+	static char printk_buf[1024];
+
 	unsigned long flags;
-	int printed_len;
+	int printed_len = 0;
+	int this_cpu;
 	char *p;
-	static char printk_buf[1024];
-	static int log_level_unknown = 1;
 
 	boot_delay_msec();
 
 	preempt_disable();
-	if (unlikely(oops_in_progress) && printk_cpu == smp_processor_id())
-		/* If a crash is occurring during printk() on this CPU,
-		 * make sure we can't deadlock */
-		zap_locks();
-
 	/* This stops the holder of console_sem just where we want him */
 	raw_local_irq_save(flags);
+	this_cpu = smp_processor_id();
+
+	/*
+	 * Ouch, printk recursed into itself!
+	 */
+	if (unlikely(printk_cpu == this_cpu)) {
+		/*
+		 * If a crash is occurring during printk() on this CPU,
+		 * then try to get the crash message out but make sure
+		 * we can't deadlock. Otherwise just return to avoid the
+		 * recursion and return - but flag the recursion so that
+		 * it can be printed at the next appropriate moment:
+		 */
+		if (!oops_in_progress) {
+			printk_recursion_bug = 1;
+			goto out_restore_irqs;
+		}
+		zap_locks();
+	}
+
 	lockdep_off();
 	spin_lock(&logbuf_lock);
-	printk_cpu = smp_processor_id();
+	printk_cpu = this_cpu;
 
+	if (printk_recursion_bug) {
+		printk_recursion_bug = 0;
+		strcpy(printk_buf, printk_recursion_bug_msg);
+		printed_len = sizeof(printk_recursion_bug_msg);
+	}
 	/* Emit the output into the temporary buffer */
-	printed_len = vscnprintf(printk_buf, sizeof(printk_buf), fmt, args);
+	printed_len += vscnprintf(printk_buf + printed_len,
+				  sizeof(printk_buf), fmt, args);
 
 	/*
 	 * Copy the output into log_buf.  If the caller didn't provide
@@ -680,7 +702,11 @@ asmlinkage int vprintk(const char *fmt, 
 					loglev_char = default_message_loglevel
 						+ '0';
 				}
-				t = printk_clock();
+				if (panic_timeout) {
+					panic_timeout = 0;
+					printk("recurse!\n");
+				}
+				t = cpu_clock(printk_cpu);
 				nanosec_rem = do_div(t, 1000000000);
 				tlen = sprintf(tbuf,
 						"<%c>[%5lu.%06lu] ",
@@ -744,6 +770,7 @@ asmlinkage int vprintk(const char *fmt, 
 		printk_cpu = UINT_MAX;
 		spin_unlock(&logbuf_lock);
 		lockdep_on();
+out_restore_irqs:
 		raw_local_irq_restore(flags);
 	}
 
Index: linux/kernel/sched.c
===================================================================
--- linux.orig/kernel/sched.c
+++ linux/kernel/sched.c
@@ -488,7 +488,12 @@ unsigned long long cpu_clock(int cpu)
 
 	local_irq_save(flags);
 	rq = cpu_rq(cpu);
-	update_rq_clock(rq);
+	/*
+	 * Only call sched_clock() if the scheduler has already been
+	 * initialized (some code might call cpu_clock() very early):
+	 */
+	if (rq->idle)
+		update_rq_clock(rq);
 	now = rq->clock;
 	local_irq_restore(flags);
 
Index: linux/kernel/sched_fair.c
===================================================================
--- linux.orig/kernel/sched_fair.c
+++ linux/kernel/sched_fair.c
@@ -511,8 +511,7 @@ place_entity(struct cfs_rq *cfs_rq, stru
 
 	if (!initial) {
 		/* sleeps upto a single latency don't count. */
-		if (sched_feat(NEW_FAIR_SLEEPERS) && entity_is_task(se) &&
-				task_of(se)->policy != SCHED_BATCH)
+		if (sched_feat(NEW_FAIR_SLEEPERS) && entity_is_task(se))
 			vruntime -= sysctl_sched_latency;
 
 		/* ensure we never gain time by being placed backwards. */
Index: linux/kernel/time/clockevents.c
===================================================================
--- linux.orig/kernel/time/clockevents.c
+++ linux/kernel/time/clockevents.c
@@ -78,6 +78,11 @@ int clockevents_program_event(struct clo
 	unsigned long long clc;
 	int64_t delta;
 
+	if (unlikely(expires.tv64 < 0)) {
+		WARN_ON_ONCE(1);
+		return -ETIME;
+	}
+
 	delta = ktime_to_ns(ktime_sub(expires, now));
 
 	if (delta <= 0)

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] scheduler: fix x86 regression in native_sched_clock
  2007-12-07 14:54             ` Ingo Molnar
@ 2007-12-07 16:46               ` Guillaume Chazarain
  2007-12-07 17:57                 ` Ingo Molnar
  0 siblings, 1 reply; 47+ messages in thread
From: Guillaume Chazarain @ 2007-12-07 16:46 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: stefano.brivio, Nick Piggin, Robert Love, linux-kernel,
	Dave Jones, Rafael J. Wysocki, Michael Buesch, Thomas Gleixner,
	Andrew Morton, Len Brown

Le Fri, 7 Dec 2007 15:54:18 +0100,
Ingo Molnar <mingo@elte.hu> a écrit :

> This is a version that 
> is supposed fix all known aspects of TSC and frequency-change 
> weirdnesses.

Tested it with frequency changes, the clock is as smooth as I like
it :-)

The only remaining sched_clock user in need of conversion seems to be
lockdep.

Great work.

-- 
Guillaume

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] scheduler: fix x86 regression in native_sched_clock
  2007-12-07 11:17     ` Ingo Molnar
@ 2007-12-07 16:48       ` Nick Piggin
  2007-12-08  0:50         ` Nick Piggin
  0 siblings, 1 reply; 47+ messages in thread
From: Nick Piggin @ 2007-12-07 16:48 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Stefano Brivio, Robert Love, linux-kernel, Dave Jones,
	Rafael J. Wysocki, Michael Buesch, Thomas Gleixner,
	Andrew Morton, Len Brown

On Friday 07 December 2007 22:17, Ingo Molnar wrote:
> * Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> > > ah, printk_clock() still uses sched_clock(), not jiffies. So it's
> > > not the jiffies counter that goes back and forth, it's sched_clock()
> > > - so this is a printk timestamps anomaly, not related to jiffies. I
> > > thought we have fixed this bug in the printk code already:
> > > sched_clock() is a 'raw' interface that should not be used directly
> > > - the proper interface is cpu_clock(cpu).
> >
> > It's a single CPU box, so sched_clock() jumping would still be
> > problematic, no?
>
> sched_clock() is an internal API - the non-jumping API to be used by
> printk is cpu_clock().

You know why sched_clock jumps when the TSC frequency changes, right?


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] scheduler: fix x86 regression in native_sched_clock
  2007-12-07 16:46               ` Guillaume Chazarain
@ 2007-12-07 17:57                 ` Ingo Molnar
  2007-12-08 15:06                   ` Mark Lord
  0 siblings, 1 reply; 47+ messages in thread
From: Ingo Molnar @ 2007-12-07 17:57 UTC (permalink / raw)
  To: Guillaume Chazarain
  Cc: stefano.brivio, Nick Piggin, Robert Love, linux-kernel,
	Dave Jones, Rafael J. Wysocki, Michael Buesch, Thomas Gleixner,
	Andrew Morton, Len Brown


* Guillaume Chazarain <guichaz@yahoo.fr> wrote:

> Le Fri, 7 Dec 2007 15:54:18 +0100,
> Ingo Molnar <mingo@elte.hu> a écrit :
> 
> > This is a version that 
> > is supposed fix all known aspects of TSC and frequency-change 
> > weirdnesses.
> 
> Tested it with frequency changes, the clock is as smooth as I like it
> :-)

ok, great :-)

> The only remaining sched_clock user in need of conversion seems to be 
> lockdep.

yeah - for CONFIG_LOCKSTAT - but that needs to be done even more 
carefully, due to rq->lock being lockdep-checked. We can perhaps try a 
lock-less cpu_clock() version - other CPUs are not supposed to update 
rq->clock.

> Great work.

thanks. I do get the impression that most of this can/should wait until 
2.6.25. The patches look quite dangerous.

	Ingo

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] scheduler: fix x86 regression in native_sched_clock
  2007-12-07 16:48       ` Nick Piggin
@ 2007-12-08  0:50         ` Nick Piggin
  2007-12-08  0:57           ` Nick Piggin
  0 siblings, 1 reply; 47+ messages in thread
From: Nick Piggin @ 2007-12-08  0:50 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Stefano Brivio, Robert Love, linux-kernel, Dave Jones,
	Rafael J. Wysocki, Michael Buesch, Thomas Gleixner,
	Andrew Morton, Len Brown

On Saturday 08 December 2007 03:48, Nick Piggin wrote:
> On Friday 07 December 2007 22:17, Ingo Molnar wrote:
> > * Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> > > > ah, printk_clock() still uses sched_clock(), not jiffies. So it's
> > > > not the jiffies counter that goes back and forth, it's sched_clock()
> > > > - so this is a printk timestamps anomaly, not related to jiffies. I
> > > > thought we have fixed this bug in the printk code already:
> > > > sched_clock() is a 'raw' interface that should not be used directly
> > > > - the proper interface is cpu_clock(cpu).
> > >
> > > It's a single CPU box, so sched_clock() jumping would still be
> > > problematic, no?
> >
> > sched_clock() is an internal API - the non-jumping API to be used by
> > printk is cpu_clock().
>
> You know why sched_clock jumps when the TSC frequency changes, right?

Ah, hmm, I don't know why I wrote that :)

I guess your patch is fairly complex but it should work if the plan
is to convert all sched_clock users to use cpu_clock eg like lockdep
as well.

So it looks good to me, thanks for fixing this.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] scheduler: fix x86 regression in native_sched_clock
  2007-12-08  0:50         ` Nick Piggin
@ 2007-12-08  0:57           ` Nick Piggin
  2007-12-08  8:52             ` Ingo Molnar
  0 siblings, 1 reply; 47+ messages in thread
From: Nick Piggin @ 2007-12-08  0:57 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Stefano Brivio, Robert Love, linux-kernel, Dave Jones,
	Rafael J. Wysocki, Michael Buesch, Thomas Gleixner,
	Andrew Morton, Len Brown

On Saturday 08 December 2007 11:50, Nick Piggin wrote:

> I guess your patch is fairly complex but it should work

I should also add that although complex, it should have a
much smaller TSC delta window in which the wrong scaling
factor can get applied to it (I guess it is about as good
as you can possibly get). So I do like it :)

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] scheduler: fix x86 regression in native_sched_clock
  2007-12-08  0:57           ` Nick Piggin
@ 2007-12-08  8:52             ` Ingo Molnar
  2007-12-08 23:37               ` Guillaume Chazarain
  2007-12-12  4:42               ` Nick Piggin
  0 siblings, 2 replies; 47+ messages in thread
From: Ingo Molnar @ 2007-12-08  8:52 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Stefano Brivio, Robert Love, linux-kernel, Dave Jones,
	Rafael J. Wysocki, Michael Buesch, Thomas Gleixner,
	Andrew Morton, Len Brown


* Nick Piggin <nickpiggin@yahoo.com.au> wrote:

> On Saturday 08 December 2007 11:50, Nick Piggin wrote:
> 
> > I guess your patch is fairly complex but it should work
> 
> I should also add that although complex, it should have a much smaller 
> TSC delta window in which the wrong scaling factor can get applied to 
> it (I guess it is about as good as you can possibly get). So I do like 
> it :)

ok :-)

the scariest bit isnt even the scaling i think - that is a fairly 
straightforward and clean PER_CPU-ization of the global scaling factor, 
and its hookup with cpufreq events. (and the credit for that goes to 
Guillaume Chazarain) We could even split it into two to make it even 
less scary and more bisectable.

the scariest bit is the adding of cpu_clock() to kernel/printk.c so late 
in the game, and the anti-recursion code i did there. Maybe because this 
only affects CONFIG_PRINTK_TIME we could try it even for v2.6.24.

i've now completed a couple of hundred random bootups on x86 overnight, 
with the full stack of these patches applied, and no problems.

Could have a really critical look at the two patches below and give your 
Reviewed-by line(s) if you agree with having them in v2.6.24? I'd feel a 
lot better about this that way :-) I do have the feeling that it makes 
printk printout a lot more robust in general, independently of the 
cpu_clock() change - especially with more complex consoles like 
netconsole or fbcon.

	Ingo

--------------------> 
Subject: printk: make printk more robust by not allowing recursion
From: Ingo Molnar <mingo@elte.hu>

make printk more robust by allowing recursion only if there's a crash
going on. Also add recursion detection.

I've tested it with an artificially injected printk recursion - instead
of a lockup or spontaneous reboot or other crash, the output was a well
controlled:

[   41.057335] SysRq : <2>BUG: recent printk recursion!
[   41.057335] loglevel0-8 reBoot Crashdump show-all-locks(D) tErm Full kIll saK showMem Nice powerOff showPc show-all-timers(Q) unRaw Sync showTasks Unmount shoW-blocked-tasks

also do all this printk-debug logic with irqs disabled.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 kernel/printk.c |   48 ++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 38 insertions(+), 10 deletions(-)

Index: linux/kernel/printk.c
===================================================================
--- linux.orig/kernel/printk.c
+++ linux/kernel/printk.c
@@ -628,30 +628,57 @@ asmlinkage int printk(const char *fmt, .
 /* cpu currently holding logbuf_lock */
 static volatile unsigned int printk_cpu = UINT_MAX;
 
+const char printk_recursion_bug_msg [] =
+			KERN_CRIT "BUG: recent printk recursion!\n";
+static int printk_recursion_bug;
+
 asmlinkage int vprintk(const char *fmt, va_list args)
 {
+	static int log_level_unknown = 1;
+	static char printk_buf[1024];
+
 	unsigned long flags;
-	int printed_len;
+	int printed_len = 0;
+	int this_cpu;
 	char *p;
-	static char printk_buf[1024];
-	static int log_level_unknown = 1;
 
 	boot_delay_msec();
 
 	preempt_disable();
-	if (unlikely(oops_in_progress) && printk_cpu == smp_processor_id())
-		/* If a crash is occurring during printk() on this CPU,
-		 * make sure we can't deadlock */
-		zap_locks();
-
 	/* This stops the holder of console_sem just where we want him */
 	raw_local_irq_save(flags);
+	this_cpu = smp_processor_id();
+
+	/*
+	 * Ouch, printk recursed into itself!
+	 */
+	if (unlikely(printk_cpu == this_cpu)) {
+		/*
+		 * If a crash is occurring during printk() on this CPU,
+		 * then try to get the crash message out but make sure
+		 * we can't deadlock. Otherwise just return to avoid the
+		 * recursion and return - but flag the recursion so that
+		 * it can be printed at the next appropriate moment:
+		 */
+		if (!oops_in_progress) {
+			printk_recursion_bug = 1;
+			goto out_restore_irqs;
+		}
+		zap_locks();
+	}
+
 	lockdep_off();
 	spin_lock(&logbuf_lock);
-	printk_cpu = smp_processor_id();
+	printk_cpu = this_cpu;
 
+	if (printk_recursion_bug) {
+		printk_recursion_bug = 0;
+		strcpy(printk_buf, printk_recursion_bug_msg);
+		printed_len = sizeof(printk_recursion_bug_msg);
+	}
 	/* Emit the output into the temporary buffer */
-	printed_len = vscnprintf(printk_buf, sizeof(printk_buf), fmt, args);
+	printed_len += vscnprintf(printk_buf + printed_len,
+				  sizeof(printk_buf), fmt, args);
 
 	/*
 	 * Copy the output into log_buf.  If the caller didn't provide
@@ -744,6 +771,7 @@ asmlinkage int vprintk(const char *fmt, 
 		printk_cpu = UINT_MAX;
 		spin_unlock(&logbuf_lock);
 		lockdep_on();
+out_restore_irqs:
 		raw_local_irq_restore(flags);
 	}

--------------------> 
Subject: sched: fix CONFIG_PRINT_TIME's reliance on sched_clock()
From: Ingo Molnar <mingo@elte.hu>

Stefano Brivio reported weird printk timestamp behavior during
CPU frequency changes:

  http://bugzilla.kernel.org/show_bug.cgi?id=9475

fix CONFIG_PRINT_TIME's reliance on sched_clock() and use cpu_clock()
instead.

Reported-and-bisected-by: Stefano Brivio <stefano.brivio@polimi.it>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 kernel/printk.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux/kernel/printk.c
===================================================================
--- linux.orig/kernel/printk.c
+++ linux/kernel/printk.c
@@ -707,7 +707,7 @@ asmlinkage int vprintk(const char *fmt, 
 					loglev_char = default_message_loglevel
 						+ '0';
 				}
-				t = printk_clock();
+				t = cpu_clock(printk_cpu);
 				nanosec_rem = do_div(t, 1000000000);
 				tlen = sprintf(tbuf,
 						"<%c>[%5lu.%06lu] ",

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] scheduler: fix x86 regression in native_sched_clock
  2007-12-07 17:57                 ` Ingo Molnar
@ 2007-12-08 15:06                   ` Mark Lord
  2007-12-08 15:13                     ` Ingo Molnar
  0 siblings, 1 reply; 47+ messages in thread
From: Mark Lord @ 2007-12-08 15:06 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Guillaume Chazarain, stefano.brivio, Nick Piggin, Robert Love,
	linux-kernel, Dave Jones, Rafael J. Wysocki, Michael Buesch,
	Thomas Gleixner, Andrew Morton, Len Brown

Ingo Molnar wrote:
>...
> thanks. I do get the impression that most of this can/should wait until 
> 2.6.25. The patches look quite dangerous.
..

I confess to not really trying hard to understand everything in this thread,
but the implication seems to be that this bug might affect udelay()
and possibly jiffies ?

If so, then fixing it has to be a *must* for 2.6.24, as otherwise we'll get
all sorts of one-in-while odd driver bugs.. like maybe these two for starters:

[Bug 9492] 2.6.24:  false double-clicks from USB mouse
[Bug 9489] 20000+ wake-ups/second in 2.6.24

Neither of which happens often enough to explain or debug,
but either of which *could* be caused by some weird jiffies thing maybe.

???

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] scheduler: fix x86 regression in native_sched_clock
  2007-12-08 15:06                   ` Mark Lord
@ 2007-12-08 15:13                     ` Ingo Molnar
  2007-12-08 15:27                       ` Michael Buesch
  0 siblings, 1 reply; 47+ messages in thread
From: Ingo Molnar @ 2007-12-08 15:13 UTC (permalink / raw)
  To: Mark Lord
  Cc: Guillaume Chazarain, stefano.brivio, Nick Piggin, Robert Love,
	linux-kernel, Dave Jones, Rafael J. Wysocki, Michael Buesch,
	Thomas Gleixner, Andrew Morton, Len Brown


* Mark Lord <lkml@rtr.ca> wrote:

> Ingo Molnar wrote:
>> ...
>> thanks. I do get the impression that most of this can/should wait until 
>> 2.6.25. The patches look quite dangerous.
> ..
>
> I confess to not really trying hard to understand everything in this 
> thread, but the implication seems to be that this bug might affect 
> udelay() and possibly jiffies ?

no, it cannot affect jiffies. (jiffies was a red herring all along)

udelay() cannot be affected either - sched_clock() has no effect on 
udelay(). _But_, when there are TSC problems then tsc based udelay() 
suffers too so the phenomenons may _seem_ related.

> If so, then fixing it has to be a *must* for 2.6.24, as otherwise 
> we'll get all sorts of one-in-while odd driver bugs.. like maybe these 
> two for starters:
>
> [Bug 9492] 2.6.24:  false double-clicks from USB mouse
> [Bug 9489] 20000+ wake-ups/second in 2.6.24

iirc these high rate wakeups happened on 2.6.22 too.

	Ingo

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] scheduler: fix x86 regression in native_sched_clock
  2007-12-08 15:13                     ` Ingo Molnar
@ 2007-12-08 15:27                       ` Michael Buesch
  2007-12-08 15:33                         ` Ingo Molnar
  0 siblings, 1 reply; 47+ messages in thread
From: Michael Buesch @ 2007-12-08 15:27 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Mark Lord, Guillaume Chazarain, stefano.brivio, Nick Piggin,
	Robert Love, linux-kernel, Dave Jones, Rafael J. Wysocki,
	Thomas Gleixner, Andrew Morton, Len Brown

On Saturday 08 December 2007 16:13:41 Ingo Molnar wrote:
> 
> * Mark Lord <lkml@rtr.ca> wrote:
> 
> > Ingo Molnar wrote:
> >> ...
> >> thanks. I do get the impression that most of this can/should wait until 
> >> 2.6.25. The patches look quite dangerous.
> > ..
> >
> > I confess to not really trying hard to understand everything in this 
> > thread, but the implication seems to be that this bug might affect 
> > udelay() and possibly jiffies ?
> 
> no, it cannot affect jiffies. (jiffies was a red herring all along)
> 
> udelay() cannot be affected either - sched_clock() has no effect on 
> udelay(). _But_, when there are TSC problems then tsc based udelay() 
> suffers too so the phenomenons may _seem_ related.

What about msleep()? I suspect problems in b43 because of this issue.
msleep() returning too early. Is that possible with this bug?

-- 
Greetings Michael.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] scheduler: fix x86 regression in native_sched_clock
  2007-12-08 15:27                       ` Michael Buesch
@ 2007-12-08 15:33                         ` Ingo Molnar
  2007-12-08 15:36                           ` Michael Buesch
  0 siblings, 1 reply; 47+ messages in thread
From: Ingo Molnar @ 2007-12-08 15:33 UTC (permalink / raw)
  To: Michael Buesch
  Cc: Mark Lord, Guillaume Chazarain, stefano.brivio, Nick Piggin,
	Robert Love, linux-kernel, Dave Jones, Rafael J. Wysocki,
	Thomas Gleixner, Andrew Morton, Len Brown


* Michael Buesch <mb@bu3sch.de> wrote:

> On Saturday 08 December 2007 16:13:41 Ingo Molnar wrote:
> > 
> > * Mark Lord <lkml@rtr.ca> wrote:
> > 
> > > Ingo Molnar wrote:
> > >> ...
> > >> thanks. I do get the impression that most of this can/should wait until 
> > >> 2.6.25. The patches look quite dangerous.
> > > ..
> > >
> > > I confess to not really trying hard to understand everything in this 
> > > thread, but the implication seems to be that this bug might affect 
> > > udelay() and possibly jiffies ?
> > 
> > no, it cannot affect jiffies. (jiffies was a red herring all along)
> > 
> > udelay() cannot be affected either - sched_clock() has no effect on 
> > udelay(). _But_, when there are TSC problems then tsc based udelay() 
> > suffers too so the phenomenons may _seem_ related.
> 
> What about msleep()? I suspect problems in b43 because of this issue. 
> msleep() returning too early. Is that possible with this bug?

i cannot see how. You can verify msleep by running something like this:

  while :; do time usleep 111000; done

you should see a steady stream of:

 real    0m0.113s
 real    0m0.113s
 real    0m0.113s

(on an idle system). If it fluctuates, with occasional longer delays, 
there's some timer problem present.

	Ingo

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] scheduler: fix x86 regression in native_sched_clock
  2007-12-08 15:33                         ` Ingo Molnar
@ 2007-12-08 15:36                           ` Michael Buesch
  2007-12-08 15:41                             ` Ingo Molnar
  0 siblings, 1 reply; 47+ messages in thread
From: Michael Buesch @ 2007-12-08 15:36 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Mark Lord, Guillaume Chazarain, stefano.brivio, Nick Piggin,
	Robert Love, linux-kernel, Dave Jones, Rafael J. Wysocki,
	Thomas Gleixner, Andrew Morton, Len Brown

On Saturday 08 December 2007 16:33:27 Ingo Molnar wrote:
> 
> * Michael Buesch <mb@bu3sch.de> wrote:
> 
> > On Saturday 08 December 2007 16:13:41 Ingo Molnar wrote:
> > > 
> > > * Mark Lord <lkml@rtr.ca> wrote:
> > > 
> > > > Ingo Molnar wrote:
> > > >> ...
> > > >> thanks. I do get the impression that most of this can/should wait until 
> > > >> 2.6.25. The patches look quite dangerous.
> > > > ..
> > > >
> > > > I confess to not really trying hard to understand everything in this 
> > > > thread, but the implication seems to be that this bug might affect 
> > > > udelay() and possibly jiffies ?
> > > 
> > > no, it cannot affect jiffies. (jiffies was a red herring all along)
> > > 
> > > udelay() cannot be affected either - sched_clock() has no effect on 
> > > udelay(). _But_, when there are TSC problems then tsc based udelay() 
> > > suffers too so the phenomenons may _seem_ related.
> > 
> > What about msleep()? I suspect problems in b43 because of this issue. 
> > msleep() returning too early. Is that possible with this bug?
> 
> i cannot see how. You can verify msleep by running something like this:
> 
>   while :; do time usleep 111000; done
> 
> you should see a steady stream of:
> 
>  real    0m0.113s
>  real    0m0.113s
>  real    0m0.113s
> 
> (on an idle system). If it fluctuates, with occasional longer delays, 
> there's some timer problem present.

Does the sleeping and timing use different time references?
I mean, if it uses the same reference and that reference does fluctuate
you won't see it in the result.

But anyway, Stefano. Can you test this?

-- 
Greetings Michael.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] scheduler: fix x86 regression in native_sched_clock
  2007-12-08 15:36                           ` Michael Buesch
@ 2007-12-08 15:41                             ` Ingo Molnar
  0 siblings, 0 replies; 47+ messages in thread
From: Ingo Molnar @ 2007-12-08 15:41 UTC (permalink / raw)
  To: Michael Buesch
  Cc: Mark Lord, Guillaume Chazarain, stefano.brivio, Nick Piggin,
	Robert Love, linux-kernel, Dave Jones, Rafael J. Wysocki,
	Thomas Gleixner, Andrew Morton, Len Brown


* Michael Buesch <mb@bu3sch.de> wrote:

> > i cannot see how. You can verify msleep by running something like this:
> > 
> >   while :; do time usleep 111000; done
> > 
> > you should see a steady stream of:
> > 
> >  real    0m0.113s
> >  real    0m0.113s
> >  real    0m0.113s
> > 
> > (on an idle system). If it fluctuates, with occasional longer delays, 
> > there's some timer problem present.
> 
> Does the sleeping and timing use different time references?

yes. But the real paranoid can do this from another box:

  while :; do time ssh testbox usleep 111000; done

	Ingo

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch] x86: scale cyc_2_nsec according to CPU frequency
  2007-12-07 14:52           ` Ingo Molnar
@ 2007-12-08 15:57             ` Arjan van de Ven
  2007-12-08 19:16               ` Ingo Molnar
  0 siblings, 1 reply; 47+ messages in thread
From: Arjan van de Ven @ 2007-12-08 15:57 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Guillaume Chazarain, Thomas Gleixner, Stefano Brivio,
	Robert Love, linux-kernel, Dave Jones, Rafael J. Wysocki,
	Michael Buesch, Andrew Morton

On Fri, 7 Dec 2007 15:52:06 +0100
Ingo Molnar <mingo@elte.hu> wrote:

> 
> * Guillaume Chazarain <guichaz@yahoo.fr> wrote:
> 
> > Le Fri, 7 Dec 2007 14:55:25 +0100,
> > Ingo Molnar <mingo@elte.hu> a ??crit :
> > 
> > > Firstly, we dont need the 'offset' anymore because cpu_clock() 
> > > maintains offsets itself.
> > 
> > Yes, but a lower quality one. __update_rq_clock tries to compensate 
> > large jumping clocks with a jiffy resolution, while my offset
> > arranges for a very smooth frequency transition.
> 
> yes, but that would be easy to fix up via calling 
> sched_clock_idle_wakeup_event(0) when doing a frequency transition, 
> without burdening the normal sched_clock() codepath with the offset.
> See the attached latest version.


can this deal with dual/quad core where the frequency of one core
changes if the sofware changes the frequency of the other core?


-- 
If you want to reach me at my work email, use arjan@linux.intel.com
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch] x86: scale cyc_2_nsec according to CPU frequency
  2007-12-08 15:57             ` Arjan van de Ven
@ 2007-12-08 19:16               ` Ingo Molnar
  2007-12-08 20:18                 ` Arjan van de Ven
  0 siblings, 1 reply; 47+ messages in thread
From: Ingo Molnar @ 2007-12-08 19:16 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Guillaume Chazarain, Thomas Gleixner, Stefano Brivio,
	Robert Love, linux-kernel, Dave Jones, Rafael J. Wysocki,
	Michael Buesch, Andrew Morton


* Arjan van de Ven <arjan@infradead.org> wrote:

> > > > Firstly, we dont need the 'offset' anymore because cpu_clock() 
> > > > maintains offsets itself.
> > > 
> > > Yes, but a lower quality one. __update_rq_clock tries to 
> > > compensate large jumping clocks with a jiffy resolution, while my 
> > > offset arranges for a very smooth frequency transition.
> > 
> > yes, but that would be easy to fix up via calling 
> > sched_clock_idle_wakeup_event(0) when doing a frequency transition, 
> > without burdening the normal sched_clock() codepath with the offset. 
> > See the attached latest version.
> 
> can this deal with dual/quad core where the frequency of one core 
> changes if the sofware changes the frequency of the other core?

doesnt the notifier still get run on the target CPU?

	Ingo

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch] x86: scale cyc_2_nsec according to CPU frequency
  2007-12-08 19:16               ` Ingo Molnar
@ 2007-12-08 20:18                 ` Arjan van de Ven
  0 siblings, 0 replies; 47+ messages in thread
From: Arjan van de Ven @ 2007-12-08 20:18 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Guillaume Chazarain, Thomas Gleixner, Stefano Brivio,
	Robert Love, linux-kernel, Dave Jones, Rafael J. Wysocki,
	Michael Buesch, Andrew Morton

On Sat, 8 Dec 2007 20:16:29 +0100
Ingo Molnar <mingo@elte.hu> wrote:

> 
> * Arjan van de Ven <arjan@infradead.org> wrote:
> 
> > > > > Firstly, we dont need the 'offset' anymore because
> > > > > cpu_clock() maintains offsets itself.
> > > > 
> > > > Yes, but a lower quality one. __update_rq_clock tries to 
> > > > compensate large jumping clocks with a jiffy resolution, while
> > > > my offset arranges for a very smooth frequency transition.
> > > 
> > > yes, but that would be easy to fix up via calling 
> > > sched_clock_idle_wakeup_event(0) when doing a frequency
> > > transition, without burdening the normal sched_clock() codepath
> > > with the offset. See the attached latest version.
> > 
> > can this deal with dual/quad core where the frequency of one core 
> > changes if the sofware changes the frequency of the other core?
> 
> doesnt the notifier still get run on the target CPU?
> 

.... if and only if the BIOS actually gives correct information to the
OS.

In reality... that's not a very common thing on this field sadly

-- 
If you want to reach me at my work email, use arjan@linux.intel.com
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] scheduler: fix x86 regression in native_sched_clock
  2007-12-08  8:52             ` Ingo Molnar
@ 2007-12-08 23:37               ` Guillaume Chazarain
  2007-12-12  4:42               ` Nick Piggin
  1 sibling, 0 replies; 47+ messages in thread
From: Guillaume Chazarain @ 2007-12-08 23:37 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Nick Piggin, Stefano Brivio, Robert Love, linux-kernel,
	Dave Jones, Rafael J. Wysocki, Michael Buesch, Thomas Gleixner,
	Andrew Morton, Len Brown

On Dec 8, 2007 9:52 AM, Ingo Molnar <mingo@elte.hu> wrote:

> the scariest bit isnt even the scaling i think - that is a fairly
> straightforward and clean PER_CPU-ization of the global scaling factor,
> and its hookup with cpufreq events. (and the credit for that goes to
> Guillaume Chazarain)

To be fair, the cpufreq hook were already there, I just did a buggy percpu
conversion and added an offset that you removed ;-)

-- 
Guillaume

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] scheduler: fix x86 regression in native_sched_clock
  2007-12-08  8:52             ` Ingo Molnar
  2007-12-08 23:37               ` Guillaume Chazarain
@ 2007-12-12  4:42               ` Nick Piggin
  2007-12-12 10:44                 ` Ingo Molnar
  1 sibling, 1 reply; 47+ messages in thread
From: Nick Piggin @ 2007-12-12  4:42 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Stefano Brivio, Robert Love, linux-kernel, Dave Jones,
	Rafael J. Wysocki, Michael Buesch, Thomas Gleixner,
	Andrew Morton, Len Brown

On Saturday 08 December 2007 19:52, Ingo Molnar wrote:

> the scariest bit is the adding of cpu_clock() to kernel/printk.c so late
> in the game, and the anti-recursion code i did there. Maybe because this
> only affects CONFIG_PRINTK_TIME we could try it even for v2.6.24.

Printk recursion I guess shouldn't happen, but if there is a bug
somewhere in eg. the scheduler locking, then it may trigger, right?

Probably pretty rare case, however it would be nice to be able to
find out where the recursion comes from? Can you put an instruction
pointer in the recursion message perhaps?


> i've now completed a couple of hundred random bootups on x86 overnight,
> with the full stack of these patches applied, and no problems.
>
> Could have a really critical look at the two patches below and give your
> Reviewed-by line(s) if you agree with having them in v2.6.24? I'd feel a
> lot better about this that way :-) I do have the feeling that it makes
> printk printout a lot more robust in general, independently of the
> cpu_clock() change - especially with more complex consoles like
> netconsole or fbcon.

Reviewed-by: Nick Piggin <npiggin@suse.de>

for both of them. However I don't feel good to get this into 2.6.24
just for printk timestamps :P But if you can convince others, I won't
stand in your way ;)

BTW. shouldn't we still disable the tsc for this machine for 2.6.24?
Because even if you did get these patches in, the sched_clock still
goes pretty wild if the tsc is not constant. cpu_clock can filter it
somewhat, but it would still lead to wrong values...

Nick

>
> 	Ingo
>
> -------------------->
> Subject: printk: make printk more robust by not allowing recursion
> From: Ingo Molnar <mingo@elte.hu>
>
> make printk more robust by allowing recursion only if there's a crash
> going on. Also add recursion detection.
>
> I've tested it with an artificially injected printk recursion - instead
> of a lockup or spontaneous reboot or other crash, the output was a well
> controlled:
>
> [   41.057335] SysRq : <2>BUG: recent printk recursion!
> [   41.057335] loglevel0-8 reBoot Crashdump show-all-locks(D) tErm Full
> kIll saK showMem Nice powerOff showPc show-all-timers(Q) unRaw Sync
> showTasks Unmount shoW-blocked-tasks
>
> also do all this printk-debug logic with irqs disabled.
>
> Signed-off-by: Ingo Molnar <mingo@elte.hu>
> ---
>  kernel/printk.c |   48 ++++++++++++++++++++++++++++++++++++++----------
>  1 file changed, 38 insertions(+), 10 deletions(-)
>
> Index: linux/kernel/printk.c
> ===================================================================
> --- linux.orig/kernel/printk.c
> +++ linux/kernel/printk.c
> @@ -628,30 +628,57 @@ asmlinkage int printk(const char *fmt, .
>  /* cpu currently holding logbuf_lock */
>  static volatile unsigned int printk_cpu = UINT_MAX;
>
> +const char printk_recursion_bug_msg [] =
> +			KERN_CRIT "BUG: recent printk recursion!\n";
> +static int printk_recursion_bug;
> +
>  asmlinkage int vprintk(const char *fmt, va_list args)
>  {
> +	static int log_level_unknown = 1;
> +	static char printk_buf[1024];
> +
>  	unsigned long flags;
> -	int printed_len;
> +	int printed_len = 0;
> +	int this_cpu;
>  	char *p;
> -	static char printk_buf[1024];
> -	static int log_level_unknown = 1;
>
>  	boot_delay_msec();
>
>  	preempt_disable();
> -	if (unlikely(oops_in_progress) && printk_cpu == smp_processor_id())
> -		/* If a crash is occurring during printk() on this CPU,
> -		 * make sure we can't deadlock */
> -		zap_locks();
> -
>  	/* This stops the holder of console_sem just where we want him */
>  	raw_local_irq_save(flags);
> +	this_cpu = smp_processor_id();
> +
> +	/*
> +	 * Ouch, printk recursed into itself!
> +	 */
> +	if (unlikely(printk_cpu == this_cpu)) {
> +		/*
> +		 * If a crash is occurring during printk() on this CPU,
> +		 * then try to get the crash message out but make sure
> +		 * we can't deadlock. Otherwise just return to avoid the
> +		 * recursion and return - but flag the recursion so that
> +		 * it can be printed at the next appropriate moment:
> +		 */
> +		if (!oops_in_progress) {
> +			printk_recursion_bug = 1;
> +			goto out_restore_irqs;
> +		}
> +		zap_locks();
> +	}
> +
>  	lockdep_off();
>  	spin_lock(&logbuf_lock);
> -	printk_cpu = smp_processor_id();
> +	printk_cpu = this_cpu;
>
> +	if (printk_recursion_bug) {
> +		printk_recursion_bug = 0;
> +		strcpy(printk_buf, printk_recursion_bug_msg);
> +		printed_len = sizeof(printk_recursion_bug_msg);
> +	}
>  	/* Emit the output into the temporary buffer */
> -	printed_len = vscnprintf(printk_buf, sizeof(printk_buf), fmt, args);
> +	printed_len += vscnprintf(printk_buf + printed_len,
> +				  sizeof(printk_buf), fmt, args);
>
>  	/*
>  	 * Copy the output into log_buf.  If the caller didn't provide
> @@ -744,6 +771,7 @@ asmlinkage int vprintk(const char *fmt,
>  		printk_cpu = UINT_MAX;
>  		spin_unlock(&logbuf_lock);
>  		lockdep_on();
> +out_restore_irqs:
>  		raw_local_irq_restore(flags);
>  	}
>
> -------------------->
> Subject: sched: fix CONFIG_PRINT_TIME's reliance on sched_clock()
> From: Ingo Molnar <mingo@elte.hu>
>
> Stefano Brivio reported weird printk timestamp behavior during
> CPU frequency changes:
>
>   http://bugzilla.kernel.org/show_bug.cgi?id=9475
>
> fix CONFIG_PRINT_TIME's reliance on sched_clock() and use cpu_clock()
> instead.
>
> Reported-and-bisected-by: Stefano Brivio <stefano.brivio@polimi.it>
> Signed-off-by: Ingo Molnar <mingo@elte.hu>
> ---
>  kernel/printk.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> Index: linux/kernel/printk.c
> ===================================================================
> --- linux.orig/kernel/printk.c
> +++ linux/kernel/printk.c
> @@ -707,7 +707,7 @@ asmlinkage int vprintk(const char *fmt,
>  					loglev_char = default_message_loglevel
>  						+ '0';
>  				}
> -				t = printk_clock();
> +				t = cpu_clock(printk_cpu);
>  				nanosec_rem = do_div(t, 1000000000);
>  				tlen = sprintf(tbuf,
>  						"<%c>[%5lu.%06lu] ",

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] scheduler: fix x86 regression in native_sched_clock
  2007-12-12  4:42               ` Nick Piggin
@ 2007-12-12 10:44                 ` Ingo Molnar
  0 siblings, 0 replies; 47+ messages in thread
From: Ingo Molnar @ 2007-12-12 10:44 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Stefano Brivio, Robert Love, linux-kernel, Dave Jones,
	Rafael J. Wysocki, Michael Buesch, Thomas Gleixner,
	Andrew Morton, Len Brown


* Nick Piggin <nickpiggin@yahoo.com.au> wrote:

> > the scariest bit is the adding of cpu_clock() to kernel/printk.c so 
> > late in the game, and the anti-recursion code i did there. Maybe 
> > because this only affects CONFIG_PRINTK_TIME we could try it even 
> > for v2.6.24.
> 
> Printk recursion I guess shouldn't happen, but if there is a bug 
> somewhere in eg. the scheduler locking, then it may trigger, right?

or we just crash somewhere. It's all about risk management - printk is 
crutial, and with more complex codepaths being touched in printk it 
might make sense to just add built-in recursion protection into printk 
via my patch.

> Probably pretty rare case, however it would be nice to be able to find 
> out where the recursion comes from? Can you put an instruction pointer 
> in the recursion message perhaps?

yeah, as i mentioned if this would be occuring in practice we can always 
save the stacktrace of the incident and output that. I opted for the 
simplest approach first. Thanks for your Reviewed-by, i've queued it up 
for 2.6.25.

	Ingo

^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2007-12-12 10:45 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-12-07  1:19 [PATCH] scheduler: fix x86 regression in native_sched_clock Stefano Brivio
2007-12-07  5:29 ` Nick Piggin
2007-12-07  5:51 ` Thomas Gleixner
2007-12-07  7:18   ` Guillaume Chazarain
2007-12-07  8:02     ` Guillaume Chazarain
2007-12-07  8:51       ` Ingo Molnar
2007-12-07  9:29         ` Guillaume Chazarain
2007-12-07  9:59           ` Ingo Molnar
2007-12-07 13:55       ` [patch] x86: scale cyc_2_nsec according to CPU frequency Ingo Molnar
2007-12-07 14:27         ` Guillaume Chazarain
2007-12-07 14:52           ` Ingo Molnar
2007-12-08 15:57             ` Arjan van de Ven
2007-12-08 19:16               ` Ingo Molnar
2007-12-08 20:18                 ` Arjan van de Ven
2007-12-07 10:37   ` [PATCH] scheduler: fix x86 regression in native_sched_clock Andi Kleen
2007-12-07  8:45 ` Ingo Molnar
2007-12-07 10:32   ` Andrew Morton
2007-12-07 10:40     ` Ingo Molnar
2007-12-07 11:07       ` Ingo Molnar
2007-12-07 11:09       ` Andrew Morton
2007-12-07 11:12         ` Ingo Molnar
2007-12-07 11:13   ` Nick Piggin
2007-12-07 11:17     ` Ingo Molnar
2007-12-07 16:48       ` Nick Piggin
2007-12-08  0:50         ` Nick Piggin
2007-12-08  0:57           ` Nick Piggin
2007-12-08  8:52             ` Ingo Molnar
2007-12-08 23:37               ` Guillaume Chazarain
2007-12-12  4:42               ` Nick Piggin
2007-12-12 10:44                 ` Ingo Molnar
2007-12-07 11:18     ` Guillaume Chazarain
2007-12-07 11:57       ` Guillaume Chazarain
2007-12-07 11:23     ` stefano.brivio
2007-12-07 12:11       ` Ingo Molnar
2007-12-07 12:25       ` Ingo Molnar
2007-12-07 12:35         ` Ingo Molnar
2007-12-07 12:40           ` Ingo Molnar
2007-12-07 14:54             ` Ingo Molnar
2007-12-07 16:46               ` Guillaume Chazarain
2007-12-07 17:57                 ` Ingo Molnar
2007-12-08 15:06                   ` Mark Lord
2007-12-08 15:13                     ` Ingo Molnar
2007-12-08 15:27                       ` Michael Buesch
2007-12-08 15:33                         ` Ingo Molnar
2007-12-08 15:36                           ` Michael Buesch
2007-12-08 15:41                             ` Ingo Molnar
2007-12-07 11:24     ` Ingo Molnar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).