All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/15] cleanups and optimizations
@ 2013-12-12 14:08 Peter Zijlstra
  2013-12-12 14:08 ` [PATCH 01/15] x86, acpi, idle: Restructure the mwait idle routines Peter Zijlstra
                   ` (16 more replies)
  0 siblings, 17 replies; 67+ messages in thread
From: Peter Zijlstra @ 2013-12-12 14:08 UTC (permalink / raw)
  To: Arjan van de Ven, lenb, rjw, Eliezer Tamir, rui.zhang,
	jacob.jun.pan, Mike Galbraith, Ingo Molnar, hpa, paulmck,
	Thomas Gleixner, John Stultz, Andy Lutomirski, Peter Zijlstra
  Cc: linux-kernel

This series contains the preempt_enable_no_resched() cleanups that include
spin_lock_bh() optimizations and local_clock() optimizations.

This patch series does not include the generic idle loop stuff used to 'fix'
the idle injection crap. Rafael, can we make that happen by simply ignoring
pm-qos like they already do, that way we at least have that part of the
cleanups done and you and Jacub can sort the pm-qos stuff at your own leasure.

Thomas, can you merge this series?


^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH 01/15] x86, acpi, idle: Restructure the mwait idle routines
  2013-12-12 14:08 [PATCH 00/15] cleanups and optimizations Peter Zijlstra
@ 2013-12-12 14:08 ` Peter Zijlstra
  2013-12-19 20:09   ` [tip:x86/idle] " tip-bot for Peter Zijlstra
  2013-12-12 14:08 ` [PATCH 02/15] sched, preempt: Fixup missed PREEMPT_NEED_RESCHED folding Peter Zijlstra
                   ` (15 subsequent siblings)
  16 siblings, 1 reply; 67+ messages in thread
From: Peter Zijlstra @ 2013-12-12 14:08 UTC (permalink / raw)
  To: Arjan van de Ven, lenb, rjw, Eliezer Tamir, rui.zhang,
	jacob.jun.pan, Mike Galbraith, Ingo Molnar, hpa, paulmck,
	Thomas Gleixner, John Stultz, Andy Lutomirski, Peter Zijlstra
  Cc: linux-kernel, Rafael Wysocki

[-- Attachment #1: peter_zijlstra-x86_acpi_idle-restructure_the_mwait_idle_routines.patch --]
[-- Type: text/plain, Size: 6939 bytes --]

People seem to delight in writing wrong and broken mwait idle routines;
collapse the lot.

This leaves mwait_play_dead() the sole remaining user of __mwait() and
new __mwait() users are probably doing it wrong.

Also remove __sti_mwait() as its unused.

Cc: arjan@linux.intel.com
Cc: jacob.jun.pan@linux.intel.com
Cc: Mike Galbraith <bitbucket@online.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: hpa@zytor.com
Cc: lenb@kernel.org
Cc: rui.zhang@intel.com
Acked-by: Rafael Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 arch/x86/include/asm/mwait.h       |   40 +++++++++++++++++++++++++++++++++++++
 arch/x86/include/asm/processor.h   |   23 ---------------------
 arch/x86/kernel/acpi/cstate.c      |   23 ---------------------
 drivers/acpi/acpi_pad.c            |    5 ----
 drivers/acpi/processor_idle.c      |   15 -------------
 drivers/idle/intel_idle.c          |    8 -------
 drivers/thermal/intel_powerclamp.c |    4 ---
 7 files changed, 43 insertions(+), 75 deletions(-)

--- a/arch/x86/include/asm/mwait.h
+++ b/arch/x86/include/asm/mwait.h
@@ -1,6 +1,8 @@
 #ifndef _ASM_X86_MWAIT_H
 #define _ASM_X86_MWAIT_H
 
+#include <linux/sched.h>
+
 #define MWAIT_SUBSTATE_MASK		0xf
 #define MWAIT_CSTATE_MASK		0xf
 #define MWAIT_SUBSTATE_SIZE		4
@@ -13,4 +15,42 @@
 
 #define MWAIT_ECX_INTERRUPT_BREAK	0x1
 
+static inline void __monitor(const void *eax, unsigned long ecx,
+			     unsigned long edx)
+{
+	/* "monitor %eax, %ecx, %edx;" */
+	asm volatile(".byte 0x0f, 0x01, 0xc8;"
+		     :: "a" (eax), "c" (ecx), "d"(edx));
+}
+
+static inline void __mwait(unsigned long eax, unsigned long ecx)
+{
+	/* "mwait %eax, %ecx;" */
+	asm volatile(".byte 0x0f, 0x01, 0xc9;"
+		     :: "a" (eax), "c" (ecx));
+}
+
+/*
+ * This uses new MONITOR/MWAIT instructions on P4 processors with PNI,
+ * which can obviate IPI to trigger checking of need_resched.
+ * We execute MONITOR against need_resched and enter optimized wait state
+ * through MWAIT. Whenever someone changes need_resched, we would be woken
+ * up from MWAIT (without an IPI).
+ *
+ * New with Core Duo processors, MWAIT can take some hints based on CPU
+ * capability.
+ */
+static inline void mwait_idle_with_hints(unsigned long eax, unsigned long ecx)
+{
+	if (!current_set_polling_and_test()) {
+		if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR))
+			clflush((void *)&current_thread_info()->flags);
+
+		__monitor((void *)&current_thread_info()->flags, 0, 0);
+		if (!need_resched())
+			__mwait(eax, ecx);
+	}
+	__current_clr_polling();
+}
+
 #endif /* _ASM_X86_MWAIT_H */
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -700,29 +700,6 @@ static inline void sync_core(void)
 #endif
 }
 
-static inline void __monitor(const void *eax, unsigned long ecx,
-			     unsigned long edx)
-{
-	/* "monitor %eax, %ecx, %edx;" */
-	asm volatile(".byte 0x0f, 0x01, 0xc8;"
-		     :: "a" (eax), "c" (ecx), "d"(edx));
-}
-
-static inline void __mwait(unsigned long eax, unsigned long ecx)
-{
-	/* "mwait %eax, %ecx;" */
-	asm volatile(".byte 0x0f, 0x01, 0xc9;"
-		     :: "a" (eax), "c" (ecx));
-}
-
-static inline void __sti_mwait(unsigned long eax, unsigned long ecx)
-{
-	trace_hardirqs_on();
-	/* "mwait %eax, %ecx;" */
-	asm volatile("sti; .byte 0x0f, 0x01, 0xc9;"
-		     :: "a" (eax), "c" (ecx));
-}
-
 extern void select_idle_routine(const struct cpuinfo_x86 *c);
 extern void init_amd_e400_c1e_mask(void);
 
--- a/arch/x86/kernel/acpi/cstate.c
+++ b/arch/x86/kernel/acpi/cstate.c
@@ -150,29 +150,6 @@ int acpi_processor_ffh_cstate_probe(unsi
 }
 EXPORT_SYMBOL_GPL(acpi_processor_ffh_cstate_probe);
 
-/*
- * This uses new MONITOR/MWAIT instructions on P4 processors with PNI,
- * which can obviate IPI to trigger checking of need_resched.
- * We execute MONITOR against need_resched and enter optimized wait state
- * through MWAIT. Whenever someone changes need_resched, we would be woken
- * up from MWAIT (without an IPI).
- *
- * New with Core Duo processors, MWAIT can take some hints based on CPU
- * capability.
- */
-void mwait_idle_with_hints(unsigned long ax, unsigned long cx)
-{
-	if (!need_resched()) {
-		if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR))
-			clflush((void *)&current_thread_info()->flags);
-
-		__monitor((void *)&current_thread_info()->flags, 0, 0);
-		smp_mb();
-		if (!need_resched())
-			__mwait(ax, cx);
-	}
-}
-
 void acpi_processor_ffh_cstate_enter(struct acpi_processor_cx *cx)
 {
 	unsigned int cpu = smp_processor_id();
--- a/drivers/acpi/acpi_pad.c
+++ b/drivers/acpi/acpi_pad.c
@@ -193,10 +193,7 @@ static int power_saving_thread(void *dat
 					CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &cpu);
 			stop_critical_timings();
 
-			__monitor((void *)&current_thread_info()->flags, 0, 0);
-			smp_mb();
-			if (!need_resched())
-				__mwait(power_saving_mwait_eax, 1);
+			mwait_idle_with_hints(power_saving_mwait_eax, 1);
 
 			start_critical_timings();
 			if (lapic_marked_unstable)
--- a/drivers/acpi/processor_idle.c
+++ b/drivers/acpi/processor_idle.c
@@ -727,11 +727,6 @@ static int acpi_idle_enter_c1(struct cpu
 	if (unlikely(!pr))
 		return -EINVAL;
 
-	if (cx->entry_method == ACPI_CSTATE_FFH) {
-		if (current_set_polling_and_test())
-			return -EINVAL;
-	}
-
 	lapic_timer_state_broadcast(pr, cx, 1);
 	acpi_idle_do_entry(cx);
 
@@ -785,11 +780,6 @@ static int acpi_idle_enter_simple(struct
 	if (unlikely(!pr))
 		return -EINVAL;
 
-	if (cx->entry_method == ACPI_CSTATE_FFH) {
-		if (current_set_polling_and_test())
-			return -EINVAL;
-	}
-
 	/*
 	 * Must be done before busmaster disable as we might need to
 	 * access HPET !
@@ -841,11 +831,6 @@ static int acpi_idle_enter_bm(struct cpu
 		}
 	}
 
-	if (cx->entry_method == ACPI_CSTATE_FFH) {
-		if (current_set_polling_and_test())
-			return -EINVAL;
-	}
-
 	acpi_unlazy_tlb(smp_processor_id());
 
 	/* Tell the scheduler that we are going deep-idle: */
--- a/drivers/idle/intel_idle.c
+++ b/drivers/idle/intel_idle.c
@@ -359,13 +359,7 @@ static int intel_idle(struct cpuidle_dev
 	if (!(lapic_timer_reliable_states & (1 << (cstate))))
 		clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &cpu);
 
-	if (!current_set_polling_and_test()) {
-
-		__monitor((void *)&current_thread_info()->flags, 0, 0);
-		smp_mb();
-		if (!need_resched())
-			__mwait(eax, ecx);
-	}
+	mwait_idle_with_hints(eax, ecx);
 
 	if (!(lapic_timer_reliable_states & (1 << (cstate))))
 		clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &cpu);
--- a/drivers/thermal/intel_powerclamp.c
+++ b/drivers/thermal/intel_powerclamp.c
@@ -438,9 +438,7 @@ static int clamp_thread(void *arg)
 			 */
 			local_touch_nmi();
 			stop_critical_timings();
-			__monitor((void *)&current_thread_info()->flags, 0, 0);
-			cpu_relax(); /* allow HT sibling to run */
-			__mwait(eax, ecx);
+			mwait_idle_with_hints(eax, ecx);
 			start_critical_timings();
 			atomic_inc(&idle_wakeup_counter);
 		}



^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH 02/15] sched, preempt: Fixup missed PREEMPT_NEED_RESCHED folding
  2013-12-12 14:08 [PATCH 00/15] cleanups and optimizations Peter Zijlstra
  2013-12-12 14:08 ` [PATCH 01/15] x86, acpi, idle: Restructure the mwait idle routines Peter Zijlstra
@ 2013-12-12 14:08 ` Peter Zijlstra
  2013-12-12 14:08 ` [PATCH 03/15] preempt, locking: Rework local_bh_{dis,en}able() Peter Zijlstra
                   ` (14 subsequent siblings)
  16 siblings, 0 replies; 67+ messages in thread
From: Peter Zijlstra @ 2013-12-12 14:08 UTC (permalink / raw)
  To: Arjan van de Ven, lenb, rjw, Eliezer Tamir, rui.zhang,
	jacob.jun.pan, Mike Galbraith, Ingo Molnar, hpa, paulmck,
	Thomas Gleixner, John Stultz, Andy Lutomirski, Peter Zijlstra
  Cc: linux-kernel

[-- Attachment #1: peterz-preempt_fold_need_resched.patch --]
[-- Type: text/plain, Size: 3574 bytes --]

With various drivers wanting to inject idle time; we get people
calling idle routines outside of the idle loop proper.

Therefore we need to be extra careful about not missing
TIF_NEED_RESCHED -> PREEMPT_NEED_RESCHED propagations.

While looking at this, I also realized there's a small window in the
existing idle loop where we can miss TIF_NEED_RESCHED; when it hits
right after the tif_need_resched() test at the end of the loop but
right before the need_resched() test at the start of the loop.

So move preempt_fold_need_resched() out of the loop where we're
guaranteed to have TIF_NEED_RESCHED set.

Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 arch/x86/include/asm/mwait.h |    2 +-
 include/linux/preempt.h      |   15 +++++++++++++++
 include/linux/sched.h        |   15 +++++++++++++++
 kernel/cpu/idle.c            |   17 ++++++++++-------
 kernel/sched/core.c          |    3 +--
 5 files changed, 42 insertions(+), 10 deletions(-)

--- a/arch/x86/include/asm/mwait.h
+++ b/arch/x86/include/asm/mwait.h
@@ -50,7 +50,7 @@ static inline void mwait_idle_with_hints
 		if (!need_resched())
 			__mwait(eax, ecx);
 	}
-	__current_clr_polling();
+	current_clr_polling();
 }
 
 #endif /* _ASM_X86_MWAIT_H */
--- a/include/linux/preempt.h
+++ b/include/linux/preempt.h
@@ -116,6 +116,21 @@ do { \
 
 #endif /* CONFIG_PREEMPT_COUNT */
 
+#ifdef CONFIG_PREEMPT
+#define preempt_set_need_resched() \
+do { \
+	set_preempt_need_resched(); \
+} while (0)
+#define preempt_fold_need_resched() \
+do { \
+	if (tif_need_resched()) \
+		set_preempt_need_resched(); \
+} while (0)
+#else
+#define preempt_set_need_resched() do { } while (0)
+#define preempt_fold_need_resched() do { } while (0)
+#endif
+
 #ifdef CONFIG_PREEMPT_NOTIFIERS
 
 struct preempt_notifier;
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2628,6 +2628,21 @@ static inline bool __must_check current_
 }
 #endif
 
+static inline void current_clr_polling(void)
+{
+	__current_clr_polling();
+
+	/*
+	 * Ensure we check TIF_NEED_RESCHED after we clear the polling bit.
+	 * Once the bit is cleared, we'll get IPIs with every new
+	 * TIF_NEED_RESCHED and the IPI handler, scheduler_ipi(), will also
+	 * fold.
+	 */
+	smp_mb(); /* paired with resched_task() */
+
+	preempt_fold_need_resched();
+}
+
 static __always_inline bool need_resched(void)
 {
 	return unlikely(tif_need_resched());
--- a/kernel/cpu/idle.c
+++ b/kernel/cpu/idle.c
@@ -105,14 +105,17 @@ static void cpu_idle_loop(void)
 				__current_set_polling();
 			}
 			arch_cpu_idle_exit();
-			/*
-			 * We need to test and propagate the TIF_NEED_RESCHED
-			 * bit here because we might not have send the
-			 * reschedule IPI to idle tasks.
-			 */
-			if (tif_need_resched())
-				set_preempt_need_resched();
 		}
+
+		/*
+		 * Since we fell out of the loop above, we know
+		 * TIF_NEED_RESCHED must be set, propagate it into
+		 * PREEMPT_NEED_RESCHED.
+		 *
+		 * This is required because for polling idle loops we will
+		 * not have had an IPI to fold the state for us.
+		 */
+		preempt_set_need_resched();
 		tick_nohz_idle_exit();
 		schedule_preempt_disabled();
 	}
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1499,8 +1499,7 @@ void scheduler_ipi(void)
 	 * TIF_NEED_RESCHED remotely (for the first time) will also send
 	 * this IPI.
 	 */
-	if (tif_need_resched())
-		set_preempt_need_resched();
+	preempt_fold_need_resched();
 
 	if (llist_empty(&this_rq()->wake_list)
 			&& !tick_nohz_full_cpu(smp_processor_id())



^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH 03/15] preempt, locking: Rework local_bh_{dis,en}able()
  2013-12-12 14:08 [PATCH 00/15] cleanups and optimizations Peter Zijlstra
  2013-12-12 14:08 ` [PATCH 01/15] x86, acpi, idle: Restructure the mwait idle routines Peter Zijlstra
  2013-12-12 14:08 ` [PATCH 02/15] sched, preempt: Fixup missed PREEMPT_NEED_RESCHED folding Peter Zijlstra
@ 2013-12-12 14:08 ` Peter Zijlstra
  2013-12-12 14:08 ` [PATCH 04/15] locking: Optimize lock_bh functions Peter Zijlstra
                   ` (13 subsequent siblings)
  16 siblings, 0 replies; 67+ messages in thread
From: Peter Zijlstra @ 2013-12-12 14:08 UTC (permalink / raw)
  To: Arjan van de Ven, lenb, rjw, Eliezer Tamir, rui.zhang,
	jacob.jun.pan, Mike Galbraith, Ingo Molnar, hpa, paulmck,
	Thomas Gleixner, John Stultz, Andy Lutomirski, Peter Zijlstra
  Cc: linux-kernel

[-- Attachment #1: peter_zijlstra-inline-local_bh_disable.patch --]
[-- Type: text/plain, Size: 4902 bytes --]

Currently local_bh_disable() is out-of-line for no apparent reason.
So inline it to save a few cycles on call/return nonsense, the
function body is a single add on x86 (a few loads and store extra on
load/store archs).

Also expose two new local_bh functions:

  __local_bh_{dis,en}able_ip(unsigned long ip, unsigned int cnt);

Which implement the actual local_bh_{dis,en}able() behaviour.

The next patch uses the exposed @cnt argument to optimize bh lock
functions.

With build fixes from Jacob Pan.

Cc: rjw@rjwysocki.net
Cc: rui.zhang@intel.com
Cc: jacob.jun.pan@linux.intel.com
Cc: Mike Galbraith <bitbucket@online.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: hpa@zytor.com
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: lenb@kernel.org
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 include/linux/bottom_half.h  |   32 +++++++++++++++++++++++++++++---
 include/linux/hardirq.h      |    1 +
 include/linux/preempt_mask.h |    1 -
 kernel/softirq.c             |   35 ++++++-----------------------------
 4 files changed, 36 insertions(+), 33 deletions(-)

--- a/include/linux/bottom_half.h
+++ b/include/linux/bottom_half.h
@@ -1,9 +1,35 @@
 #ifndef _LINUX_BH_H
 #define _LINUX_BH_H
 
-extern void local_bh_disable(void);
+#include <linux/preempt.h>
+#include <linux/preempt_mask.h>
+
+#ifdef CONFIG_TRACE_IRQFLAGS
+extern void __local_bh_disable_ip(unsigned long ip, unsigned int cnt);
+#else
+static __always_inline void __local_bh_disable_ip(unsigned long ip, unsigned int cnt)
+{
+	preempt_count_add(cnt);
+	barrier();
+}
+#endif
+
+static inline void local_bh_disable(void)
+{
+	__local_bh_disable_ip(_THIS_IP_, SOFTIRQ_DISABLE_OFFSET);
+}
+
 extern void _local_bh_enable(void);
-extern void local_bh_enable(void);
-extern void local_bh_enable_ip(unsigned long ip);
+extern void __local_bh_enable_ip(unsigned long ip, unsigned int cnt);
+
+static inline void local_bh_enable_ip(unsigned long ip)
+{
+	__local_bh_enable_ip(ip, SOFTIRQ_DISABLE_OFFSET);
+}
+
+static inline void local_bh_enable(void)
+{
+	__local_bh_enable_ip(_THIS_IP_, SOFTIRQ_DISABLE_OFFSET);
+}
 
 #endif /* _LINUX_BH_H */
--- a/include/linux/hardirq.h
+++ b/include/linux/hardirq.h
@@ -5,6 +5,7 @@
 #include <linux/lockdep.h>
 #include <linux/ftrace_irq.h>
 #include <linux/vtime.h>
+#include <asm/hardirq.h>
 
 
 extern void synchronize_irq(unsigned int irq);
--- a/include/linux/preempt_mask.h
+++ b/include/linux/preempt_mask.h
@@ -2,7 +2,6 @@
 #define LINUX_PREEMPT_MASK_H
 
 #include <linux/preempt.h>
-#include <asm/hardirq.h>
 
 /*
  * We put the hardirq and softirq counter into the preemption
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -89,7 +89,7 @@ static void wakeup_softirqd(void)
  * where hardirqs are disabled legitimately:
  */
 #ifdef CONFIG_TRACE_IRQFLAGS
-static void __local_bh_disable(unsigned long ip, unsigned int cnt)
+void __local_bh_disable_ip(unsigned long ip, unsigned int cnt)
 {
 	unsigned long flags;
 
@@ -114,21 +114,9 @@ static void __local_bh_disable(unsigned
 	if (preempt_count() == cnt)
 		trace_preempt_off(CALLER_ADDR0, get_parent_ip(CALLER_ADDR1));
 }
-#else /* !CONFIG_TRACE_IRQFLAGS */
-static inline void __local_bh_disable(unsigned long ip, unsigned int cnt)
-{
-	preempt_count_add(cnt);
-	barrier();
-}
+EXPORT_SYMBOL(__local_bh_disable_ip);
 #endif /* CONFIG_TRACE_IRQFLAGS */
 
-void local_bh_disable(void)
-{
-	__local_bh_disable(_RET_IP_, SOFTIRQ_DISABLE_OFFSET);
-}
-
-EXPORT_SYMBOL(local_bh_disable);
-
 static void __local_bh_enable(unsigned int cnt)
 {
 	WARN_ON_ONCE(!irqs_disabled());
@@ -151,7 +139,7 @@ void _local_bh_enable(void)
 
 EXPORT_SYMBOL(_local_bh_enable);
 
-static inline void _local_bh_enable_ip(unsigned long ip)
+void __local_bh_enable_ip(unsigned long ip, unsigned int cnt)
 {
 	WARN_ON_ONCE(in_irq() || irqs_disabled());
 #ifdef CONFIG_TRACE_IRQFLAGS
@@ -166,7 +154,7 @@ static inline void _local_bh_enable_ip(u
 	 * Keep preemption disabled until we are done with
 	 * softirq processing:
  	 */
-	preempt_count_sub(SOFTIRQ_DISABLE_OFFSET - 1);
+	preempt_count_sub(cnt - 1);
 
 	if (unlikely(!in_interrupt() && local_softirq_pending())) {
 		/*
@@ -182,18 +170,7 @@ static inline void _local_bh_enable_ip(u
 #endif
 	preempt_check_resched();
 }
-
-void local_bh_enable(void)
-{
-	_local_bh_enable_ip(_RET_IP_);
-}
-EXPORT_SYMBOL(local_bh_enable);
-
-void local_bh_enable_ip(unsigned long ip)
-{
-	_local_bh_enable_ip(ip);
-}
-EXPORT_SYMBOL(local_bh_enable_ip);
+EXPORT_SYMBOL(__local_bh_enable_ip);
 
 /*
  * We restart softirq processing for at most MAX_SOFTIRQ_RESTART times,
@@ -264,7 +241,7 @@ asmlinkage void __do_softirq(void)
 	pending = local_softirq_pending();
 	account_irq_enter_time(current);
 
-	__local_bh_disable(_RET_IP_, SOFTIRQ_OFFSET);
+	__local_bh_disable_ip(_RET_IP_, SOFTIRQ_OFFSET);
 	in_hardirq = lockdep_softirq_start();
 
 	cpu = smp_processor_id();



^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH 04/15] locking: Optimize lock_bh functions
  2013-12-12 14:08 [PATCH 00/15] cleanups and optimizations Peter Zijlstra
                   ` (2 preceding siblings ...)
  2013-12-12 14:08 ` [PATCH 03/15] preempt, locking: Rework local_bh_{dis,en}able() Peter Zijlstra
@ 2013-12-12 14:08 ` Peter Zijlstra
  2013-12-12 14:08 ` [PATCH 05/15] sched, net: Clean up preempt_enable_no_resched() abuse Peter Zijlstra
                   ` (12 subsequent siblings)
  16 siblings, 0 replies; 67+ messages in thread
From: Peter Zijlstra @ 2013-12-12 14:08 UTC (permalink / raw)
  To: Arjan van de Ven, lenb, rjw, Eliezer Tamir, rui.zhang,
	jacob.jun.pan, Mike Galbraith, Ingo Molnar, hpa, paulmck,
	Thomas Gleixner, John Stultz, Andy Lutomirski, Peter Zijlstra
  Cc: linux-kernel

[-- Attachment #1: peter_zijlstra-remove_preempt_enable_no_sched-from-locks.patch --]
[-- Type: text/plain, Size: 6103 bytes --]

Currently all _bh_ lock functions do two preempt_count operations:

  local_bh_disable();
  preempt_disable();

and for the unlock:

  preempt_enable_no_resched();
  local_bh_enable();

Since its a waste of perfectly good cycles to modify the same variable
twice when you can do it in one go; use the new
__local_bh_{dis,en}able_ip() functions that allow us to provide a
preempt_count value to add/sub.

So define SOFTIRQ_LOCK_OFFSET as the offset a _bh_ lock needs to
add/sub to be done in one go.

As a bonus it gets rid of the preempt_enable_no_resched() usage.

This reduces a 1000 loops of:

  spin_lock_bh(&bh_lock);
  spin_unlock_bh(&bh_lock);

from 53596 cycles to 51995 cycles. I didn't do enough measurements to
say for absolute sure that the result is significant but the the few
runs I did for each suggest it is so.

Cc: rjw@rjwysocki.net
Cc: rui.zhang@intel.com
Cc: jacob.jun.pan@linux.intel.com
Cc: Mike Galbraith <bitbucket@online.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: hpa@zytor.com
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: lenb@kernel.org
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 include/linux/preempt_mask.h     |   15 +++++++++++++++
 include/linux/rwlock_api_smp.h   |   12 ++++--------
 include/linux/spinlock_api_smp.h |   12 ++++--------
 include/linux/spinlock_api_up.h  |   16 +++++++++++-----
 4 files changed, 34 insertions(+), 21 deletions(-)

--- a/include/linux/preempt_mask.h
+++ b/include/linux/preempt_mask.h
@@ -78,6 +78,21 @@
 #endif
 
 /*
+ * The preempt_count offset needed for things like:
+ *
+ *  spin_lock_bh()
+ *
+ * Which need to disable both preemption (CONFIG_PREEMPT_COUNT) and
+ * softirqs, such that unlock sequences of:
+ *
+ *  spin_unlock();
+ *  local_bh_enable();
+ *
+ * Work as expected.
+ */
+#define SOFTIRQ_LOCK_OFFSET (SOFTIRQ_DISABLE_OFFSET + PREEMPT_CHECK_OFFSET)
+
+/*
  * Are we running in atomic context?  WARNING: this macro cannot
  * always detect atomic context; in particular, it cannot know about
  * held spinlocks in non-preemptible kernels.  Thus it should not be
--- a/include/linux/rwlock_api_smp.h
+++ b/include/linux/rwlock_api_smp.h
@@ -172,8 +172,7 @@ static inline void __raw_read_lock_irq(r
 
 static inline void __raw_read_lock_bh(rwlock_t *lock)
 {
-	local_bh_disable();
-	preempt_disable();
+	__local_bh_disable_ip(_RET_IP_, SOFTIRQ_LOCK_OFFSET);
 	rwlock_acquire_read(&lock->dep_map, 0, 0, _RET_IP_);
 	LOCK_CONTENDED(lock, do_raw_read_trylock, do_raw_read_lock);
 }
@@ -200,8 +199,7 @@ static inline void __raw_write_lock_irq(
 
 static inline void __raw_write_lock_bh(rwlock_t *lock)
 {
-	local_bh_disable();
-	preempt_disable();
+	__local_bh_disable_ip(_RET_IP_, SOFTIRQ_LOCK_OFFSET);
 	rwlock_acquire(&lock->dep_map, 0, 0, _RET_IP_);
 	LOCK_CONTENDED(lock, do_raw_write_trylock, do_raw_write_lock);
 }
@@ -250,8 +248,7 @@ static inline void __raw_read_unlock_bh(
 {
 	rwlock_release(&lock->dep_map, 1, _RET_IP_);
 	do_raw_read_unlock(lock);
-	preempt_enable_no_resched();
-	local_bh_enable_ip((unsigned long)__builtin_return_address(0));
+	__local_bh_enable_ip(_RET_IP_, SOFTIRQ_LOCK_OFFSET);
 }
 
 static inline void __raw_write_unlock_irqrestore(rwlock_t *lock,
@@ -275,8 +272,7 @@ static inline void __raw_write_unlock_bh
 {
 	rwlock_release(&lock->dep_map, 1, _RET_IP_);
 	do_raw_write_unlock(lock);
-	preempt_enable_no_resched();
-	local_bh_enable_ip((unsigned long)__builtin_return_address(0));
+	__local_bh_enable_ip(_RET_IP_, SOFTIRQ_LOCK_OFFSET);
 }
 
 #endif /* __LINUX_RWLOCK_API_SMP_H */
--- a/include/linux/spinlock_api_smp.h
+++ b/include/linux/spinlock_api_smp.h
@@ -131,8 +131,7 @@ static inline void __raw_spin_lock_irq(r
 
 static inline void __raw_spin_lock_bh(raw_spinlock_t *lock)
 {
-	local_bh_disable();
-	preempt_disable();
+	__local_bh_disable_ip(_RET_IP_, SOFTIRQ_LOCK_OFFSET);
 	spin_acquire(&lock->dep_map, 0, 0, _RET_IP_);
 	LOCK_CONTENDED(lock, do_raw_spin_trylock, do_raw_spin_lock);
 }
@@ -174,20 +173,17 @@ static inline void __raw_spin_unlock_bh(
 {
 	spin_release(&lock->dep_map, 1, _RET_IP_);
 	do_raw_spin_unlock(lock);
-	preempt_enable_no_resched();
-	local_bh_enable_ip((unsigned long)__builtin_return_address(0));
+	__local_bh_enable_ip(_RET_IP_, SOFTIRQ_LOCK_OFFSET);
 }
 
 static inline int __raw_spin_trylock_bh(raw_spinlock_t *lock)
 {
-	local_bh_disable();
-	preempt_disable();
+	__local_bh_disable_ip(_RET_IP_, SOFTIRQ_LOCK_OFFSET);
 	if (do_raw_spin_trylock(lock)) {
 		spin_acquire(&lock->dep_map, 0, 1, _RET_IP_);
 		return 1;
 	}
-	preempt_enable_no_resched();
-	local_bh_enable_ip((unsigned long)__builtin_return_address(0));
+	__local_bh_enable_ip(_RET_IP_, SOFTIRQ_LOCK_OFFSET);
 	return 0;
 }
 
--- a/include/linux/spinlock_api_up.h
+++ b/include/linux/spinlock_api_up.h
@@ -24,11 +24,14 @@
  * flags straight, to suppress compiler warnings of unused lock
  * variables, and to add the proper checker annotations:
  */
+#define ___LOCK(lock) \
+  do { __acquire(lock); (void)(lock); } while (0)
+
 #define __LOCK(lock) \
-  do { preempt_disable(); __acquire(lock); (void)(lock); } while (0)
+  do { preempt_disable(); ___LOCK(lock); } while (0);
 
 #define __LOCK_BH(lock) \
-  do { local_bh_disable(); __LOCK(lock); } while (0)
+  do { __local_bh_disable_ip(_THIS_IP_, SOFTIRQ_LOCK_OFFSET); ___LOCK(lock); } while (0)
 
 #define __LOCK_IRQ(lock) \
   do { local_irq_disable(); __LOCK(lock); } while (0)
@@ -36,12 +39,15 @@
 #define __LOCK_IRQSAVE(lock, flags) \
   do { local_irq_save(flags); __LOCK(lock); } while (0)
 
+#define ___UNLOCK(lock) \
+  do { __release(lock); (void)(lock); } while (0)
+
 #define __UNLOCK(lock) \
-  do { preempt_enable(); __release(lock); (void)(lock); } while (0)
+  do { preempt_enable(); ___UNLOCK(lock); } while (0)
 
 #define __UNLOCK_BH(lock) \
-  do { preempt_enable_no_resched(); local_bh_enable(); \
-	  __release(lock); (void)(lock); } while (0)
+  do { __local_bh_enable_ip(_THIS_IP_, SOFTIRQ_LOCK_OFFSET); \
+       ___UNLOCK(lock); } while (0)
 
 #define __UNLOCK_IRQ(lock) \
   do { local_irq_enable(); __UNLOCK(lock); } while (0)



^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH 05/15] sched, net: Clean up preempt_enable_no_resched() abuse
  2013-12-12 14:08 [PATCH 00/15] cleanups and optimizations Peter Zijlstra
                   ` (3 preceding siblings ...)
  2013-12-12 14:08 ` [PATCH 04/15] locking: Optimize lock_bh functions Peter Zijlstra
@ 2013-12-12 14:08 ` Peter Zijlstra
  2013-12-12 14:08 ` [PATCH 06/15] sched, net: Fixup busy_loop_us_clock() Peter Zijlstra
                   ` (11 subsequent siblings)
  16 siblings, 0 replies; 67+ messages in thread
From: Peter Zijlstra @ 2013-12-12 14:08 UTC (permalink / raw)
  To: Arjan van de Ven, lenb, rjw, Eliezer Tamir, rui.zhang,
	jacob.jun.pan, Mike Galbraith, Ingo Molnar, hpa, paulmck,
	Thomas Gleixner, John Stultz, Andy Lutomirski, Peter Zijlstra
  Cc: linux-kernel, David S. Miller

[-- Attachment #1: peterz-fixup-weird-preempt_enable_no_resched-usage.patch --]
[-- Type: text/plain, Size: 1171 bytes --]

The only valid use of preempt_enable_no_resched() is if the very next
line is schedule() or if we know preemption cannot actually be enabled
by that statement due to known more preempt_count 'refs'.

Cc: rjw@rjwysocki.net
Cc: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: rui.zhang@intel.com
Cc: jacob.jun.pan@linux.intel.com
Cc: Mike Galbraith <bitbucket@online.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: hpa@zytor.com
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: lenb@kernel.org
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 net/ipv4/tcp.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1623,11 +1623,11 @@ int tcp_recvmsg(struct kiocb *iocb, stru
 		    (len > sysctl_tcp_dma_copybreak) && !(flags & MSG_PEEK) &&
 		    !sysctl_tcp_low_latency &&
 		    net_dma_find_channel()) {
-			preempt_enable_no_resched();
+			preempt_enable();
 			tp->ucopy.pinned_list =
 					dma_pin_iovec_pages(msg->msg_iov, len);
 		} else {
-			preempt_enable_no_resched();
+			preempt_enable();
 		}
 	}
 #endif



^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH 06/15] sched, net: Fixup busy_loop_us_clock()
  2013-12-12 14:08 [PATCH 00/15] cleanups and optimizations Peter Zijlstra
                   ` (4 preceding siblings ...)
  2013-12-12 14:08 ` [PATCH 05/15] sched, net: Clean up preempt_enable_no_resched() abuse Peter Zijlstra
@ 2013-12-12 14:08 ` Peter Zijlstra
  2013-12-12 14:08 ` [PATCH 07/15] sched, thermal: Clean up preempt_enable_no_resched() abuse Peter Zijlstra
                   ` (10 subsequent siblings)
  16 siblings, 0 replies; 67+ messages in thread
From: Peter Zijlstra @ 2013-12-12 14:08 UTC (permalink / raw)
  To: Arjan van de Ven, lenb, rjw, Eliezer Tamir, rui.zhang,
	jacob.jun.pan, Mike Galbraith, Ingo Molnar, hpa, paulmck,
	Thomas Gleixner, John Stultz, Andy Lutomirski, Peter Zijlstra
  Cc: linux-kernel, David S. Miller

[-- Attachment #1: peterz-fixup-busy_poll.patch --]
[-- Type: text/plain, Size: 2280 bytes --]

The only valid use of preempt_enable_no_resched() is if the very next
line is schedule() or if we know preemption cannot actually be enabled
by that statement due to known more preempt_count 'refs'.

This busy_poll stuff looks to be completely and utterly broken,
sched_clock() can return utter garbage with interrupts enabled (rare
but still) and it can drift unbounded between CPUs.

This means that if you get preempted/migrated and your new CPU is
years behind on the previous CPU we get to busy spin for a _very_ long
time.

There is a _REASON_ sched_clock() warns about preemptability -
papering over it with a preempt_disable()/preempt_enable_no_resched()
is just terminal brain damage on so many levels.

Replace sched_clock() usage with local_clock() which has a bounded
drift between CPUs (<2 jiffies).

There is a further problem with the entire busy wait poll thing in
that the spin time is additive to the syscall timeout, not inclusive.

Cc: David S. Miller <davem@davemloft.net>
Cc: rui.zhang@intel.com
Cc: jacob.jun.pan@linux.intel.com
Cc: Mike Galbraith <bitbucket@online.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: hpa@zytor.com
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: lenb@kernel.org
Cc: rjw@rjwysocki.net
Cc: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 include/net/busy_poll.h |   19 +------------------
 1 file changed, 1 insertion(+), 18 deletions(-)

--- a/include/net/busy_poll.h
+++ b/include/net/busy_poll.h
@@ -42,27 +42,10 @@ static inline bool net_busy_loop_on(void
 	return sysctl_net_busy_poll;
 }
 
-/* a wrapper to make debug_smp_processor_id() happy
- * we can use sched_clock() because we don't care much about precision
- * we only care that the average is bounded
- */
-#ifdef CONFIG_DEBUG_PREEMPT
 static inline u64 busy_loop_us_clock(void)
 {
-	u64 rc;
-
-	preempt_disable_notrace();
-	rc = sched_clock();
-	preempt_enable_no_resched_notrace();
-
-	return rc >> 10;
-}
-#else /* CONFIG_DEBUG_PREEMPT */
-static inline u64 busy_loop_us_clock(void)
-{
-	return sched_clock() >> 10;
+	return local_clock() >> 10;
 }
-#endif /* CONFIG_DEBUG_PREEMPT */
 
 static inline unsigned long sk_busy_loop_end_time(struct sock *sk)
 {



^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH 07/15] sched, thermal: Clean up preempt_enable_no_resched() abuse
  2013-12-12 14:08 [PATCH 00/15] cleanups and optimizations Peter Zijlstra
                   ` (5 preceding siblings ...)
  2013-12-12 14:08 ` [PATCH 06/15] sched, net: Fixup busy_loop_us_clock() Peter Zijlstra
@ 2013-12-12 14:08 ` Peter Zijlstra
  2013-12-12 14:08 ` [PATCH 08/15] preempt: Take away preempt_enable_no_resched() from modules Peter Zijlstra
                   ` (9 subsequent siblings)
  16 siblings, 0 replies; 67+ messages in thread
From: Peter Zijlstra @ 2013-12-12 14:08 UTC (permalink / raw)
  To: Arjan van de Ven, lenb, rjw, Eliezer Tamir, rui.zhang,
	jacob.jun.pan, Mike Galbraith, Ingo Molnar, hpa, paulmck,
	Thomas Gleixner, John Stultz, Andy Lutomirski, Peter Zijlstra
  Cc: linux-kernel

[-- Attachment #1: peterz-fixup-intel-powerclamp.patch --]
[-- Type: text/plain, Size: 977 bytes --]

The only valid use of preempt_enable_no_resched() is if the very next
line is schedule() or if we know preemption cannot actually be enabled
by that statement due to known more preempt_count 'refs'.

Cc: rjw@rjwysocki.net
Cc: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Cc: rui.zhang@intel.com
Cc: jacob.jun.pan@linux.intel.com
Cc: Mike Galbraith <bitbucket@online.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: hpa@zytor.com
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: lenb@kernel.org
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 drivers/thermal/intel_powerclamp.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/drivers/thermal/intel_powerclamp.c
+++ b/drivers/thermal/intel_powerclamp.c
@@ -443,7 +443,7 @@ static int clamp_thread(void *arg)
 			atomic_inc(&idle_wakeup_counter);
 		}
 		tick_nohz_idle_exit();
-		preempt_enable_no_resched();
+		preempt_enable();
 	}
 	del_timer_sync(&wakeup_timer);
 	clear_bit(cpunr, cpu_clamping_mask);



^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH 08/15] preempt: Take away preempt_enable_no_resched() from modules
  2013-12-12 14:08 [PATCH 00/15] cleanups and optimizations Peter Zijlstra
                   ` (6 preceding siblings ...)
  2013-12-12 14:08 ` [PATCH 07/15] sched, thermal: Clean up preempt_enable_no_resched() abuse Peter Zijlstra
@ 2013-12-12 14:08 ` Peter Zijlstra
  2013-12-12 14:08 ` [PATCH 09/15] x86: Use mul_u64_u32_shr() for native_sched_clock() Peter Zijlstra
                   ` (8 subsequent siblings)
  16 siblings, 0 replies; 67+ messages in thread
From: Peter Zijlstra @ 2013-12-12 14:08 UTC (permalink / raw)
  To: Arjan van de Ven, lenb, rjw, Eliezer Tamir, rui.zhang,
	jacob.jun.pan, Mike Galbraith, Ingo Molnar, hpa, paulmck,
	Thomas Gleixner, John Stultz, Andy Lutomirski, Peter Zijlstra
  Cc: linux-kernel, Rusty Russell

[-- Attachment #1: peterz-hide-preempt_enable_no_resched-modules.patch --]
[-- Type: text/plain, Size: 2207 bytes --]

Discourage drivers/modules to be creative with preemption.

Sadly all is implemented in macros and inline so if they want to do
evil they still can, but at least try and discourage some.

Cc: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Cc: rui.zhang@intel.com
Cc: jacob.jun.pan@linux.intel.com
Cc: Mike Galbraith <bitbucket@online.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: hpa@zytor.com
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: lenb@kernel.org
Cc: rjw@rjwysocki.net
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 include/linux/preempt.h |   22 ++++++++++++++++++++--
 include/linux/uaccess.h |    5 ++++-
 2 files changed, 24 insertions(+), 3 deletions(-)

--- a/include/linux/preempt.h
+++ b/include/linux/preempt.h
@@ -64,7 +64,11 @@ do { \
 } while (0)
 
 #else
-#define preempt_enable() preempt_enable_no_resched()
+#define preempt_enable() \
+do { \
+	barrier(); \
+	preempt_count_dec(); \
+} while (0)
 #define preempt_check_resched() do { } while (0)
 #endif
 
@@ -93,7 +97,11 @@ do { \
 		__preempt_schedule_context(); \
 } while (0)
 #else
-#define preempt_enable_notrace() preempt_enable_no_resched_notrace()
+#define preempt_enable_notrace() \
+do { \
+	barrier(); \
+	__preempt_count_dec(); \
+} while (0)
 #endif
 
 #else /* !CONFIG_PREEMPT_COUNT */
@@ -126,6 +134,16 @@ do { \
 #define preempt_fold_need_resched() do { } while (0)
 #endif
 
+#ifdef MODULE
+/*
+ * Modules have no business playing preemption tricks.
+ */
+#undef sched_preempt_enable_no_resched
+#undef preempt_enable_no_resched
+#undef preempt_enable_no_resched_notrace
+#undef preempt_check_resched
+#endif
+
 #ifdef CONFIG_PREEMPT_NOTIFIERS
 
 struct preempt_notifier;
--- a/include/linux/uaccess.h
+++ b/include/linux/uaccess.h
@@ -25,13 +25,16 @@ static inline void pagefault_disable(voi
 
 static inline void pagefault_enable(void)
 {
+#ifndef CONFIG_PREEMPT
 	/*
 	 * make sure to issue those last loads/stores before enabling
 	 * the pagefault handler again.
 	 */
 	barrier();
 	preempt_count_dec();
-	preempt_check_resched();
+#else
+	preempt_enable();
+#endif
 }
 
 #ifndef ARCH_HAS_NOCACHE_UACCESS



^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH 09/15] x86: Use mul_u64_u32_shr() for native_sched_clock()
  2013-12-12 14:08 [PATCH 00/15] cleanups and optimizations Peter Zijlstra
                   ` (7 preceding siblings ...)
  2013-12-12 14:08 ` [PATCH 08/15] preempt: Take away preempt_enable_no_resched() from modules Peter Zijlstra
@ 2013-12-12 14:08 ` Peter Zijlstra
  2013-12-12 14:08 ` [PATCH 10/15] x86: Move some code around Peter Zijlstra
                   ` (7 subsequent siblings)
  16 siblings, 0 replies; 67+ messages in thread
From: Peter Zijlstra @ 2013-12-12 14:08 UTC (permalink / raw)
  To: Arjan van de Ven, lenb, rjw, Eliezer Tamir, rui.zhang,
	jacob.jun.pan, Mike Galbraith, Ingo Molnar, hpa, paulmck,
	Thomas Gleixner, John Stultz, Andy Lutomirski, Peter Zijlstra
  Cc: linux-kernel

[-- Attachment #1: peterz-cycles2ns-math64.patch --]
[-- Type: text/plain, Size: 3766 bytes --]

Use mul_u64_u32_shr() so that x86_64 can use a single 64x64->128 mul.

before:

0000000000000560 <native_sched_clock>:
 560:   44 8b 1d 00 00 00 00    mov    0x0(%rip),%r11d        # 567 <native_sched_clock+0x7>
 567:   55                      push   %rbp
 568:   48 89 e5                mov    %rsp,%rbp
 56b:   45 85 db                test   %r11d,%r11d
 56e:   75 4f                   jne    5bf <native_sched_clock+0x5f>
 570:   0f 31                   rdtsc  
 572:   89 c0                   mov    %eax,%eax
 574:   48 c1 e2 20             shl    $0x20,%rdx
 578:   48 c7 c1 00 00 00 00    mov    $0x0,%rcx
 57f:   48 09 c2                or     %rax,%rdx
 582:   48 c7 c7 00 00 00 00    mov    $0x0,%rdi
 589:   65 8b 04 25 00 00 00    mov    %gs:0x0,%eax
 590:   00 
 591:   48 98                   cltq   
 593:   48 8b 34 c5 00 00 00    mov    0x0(,%rax,8),%rsi
 59a:   00 
 59b:   48 89 d0                mov    %rdx,%rax
 59e:   81 e2 ff 03 00 00       and    $0x3ff,%edx
 5a4:   48 c1 e8 0a             shr    $0xa,%rax
 5a8:   48 0f af 14 0e          imul   (%rsi,%rcx,1),%rdx
 5ad:   48 0f af 04 0e          imul   (%rsi,%rcx,1),%rax
 5b2:   5d                      pop    %rbp
 5b3:   48 03 04 3e             add    (%rsi,%rdi,1),%rax
 5b7:   48 c1 ea 0a             shr    $0xa,%rdx
 5bb:   48 01 d0                add    %rdx,%rax
 5be:   c3                      retq 

after:

0000000000000550 <native_sched_clock>:
 550:   8b 3d 00 00 00 00       mov    0x0(%rip),%edi        # 556 <native_sched_clock+0x6>
 556:   55                      push   %rbp
 557:   48 89 e5                mov    %rsp,%rbp
 55a:   48 83 e4 f0             and    $0xfffffffffffffff0,%rsp
 55e:   85 ff                   test   %edi,%edi
 560:   75 2c                   jne    58e <native_sched_clock+0x3e>
 562:   0f 31                   rdtsc  
 564:   89 c0                   mov    %eax,%eax
 566:   48 c1 e2 20             shl    $0x20,%rdx
 56a:   48 09 c2                or     %rax,%rdx
 56d:   65 48 8b 04 25 00 00    mov    %gs:0x0,%rax
 574:   00 00 
 576:   89 c0                   mov    %eax,%eax
 578:   48 f7 e2                mul    %rdx
 57b:   65 48 8b 0c 25 00 00    mov    %gs:0x0,%rcx
 582:   00 00 
 584:   c9                      leaveq 
 585:   48 0f ac d0 0a          shrd   $0xa,%rdx,%rax
 58a:   48 01 c8                add    %rcx,%rax
 58d:   c3                      retq 

                        MAINLINE   POST

    sched_clock_stable: 1	   1
    (cold) sched_clock: 329841     331312
    (cold) local_clock: 301773     310296
    (warm) sched_clock: 38375      38247
    (warm) local_clock: 100371     102713
    (warm) rdtsc:       27340      27289
    sched_clock_stable: 0          0
    (cold) sched_clock: 382634     372706
    (cold) local_clock: 396890     399275
    (warm) sched_clock: 38194      38124
    (warm) local_clock: 143452     148698
    (warm) rdtsc:       27345      27365

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 arch/x86/include/asm/timer.h |    7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

--- a/arch/x86/include/asm/timer.h
+++ b/arch/x86/include/asm/timer.h
@@ -4,6 +4,7 @@
 #include <linux/pm.h>
 #include <linux/percpu.h>
 #include <linux/interrupt.h>
+#include <linux/math64.h>
 
 #define TICK_SIZE (tick_nsec / 1000)
 
@@ -57,10 +58,8 @@ DECLARE_PER_CPU(unsigned long long, cyc2
 
 static inline unsigned long long __cycles_2_ns(unsigned long long cyc)
 {
-	int cpu = smp_processor_id();
-	unsigned long long ns = per_cpu(cyc2ns_offset, cpu);
-	ns += mult_frac(cyc, per_cpu(cyc2ns, cpu),
-			(1UL << CYC2NS_SCALE_FACTOR));
+	unsigned long long ns = this_cpu_read(cyc2ns_offset);
+	ns += mul_u64_u32_shr(cyc, this_cpu_read(cyc2ns), CYC2NS_SCALE_FACTOR);
 	return ns;
 }
 



^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH 10/15] x86: Move some code around
  2013-12-12 14:08 [PATCH 00/15] cleanups and optimizations Peter Zijlstra
                   ` (8 preceding siblings ...)
  2013-12-12 14:08 ` [PATCH 09/15] x86: Use mul_u64_u32_shr() for native_sched_clock() Peter Zijlstra
@ 2013-12-12 14:08 ` Peter Zijlstra
  2013-12-12 14:08 ` [PATCH 11/15] x86: Rewrite cyc2ns to avoid the need to disable IRQs Peter Zijlstra
                   ` (6 subsequent siblings)
  16 siblings, 0 replies; 67+ messages in thread
From: Peter Zijlstra @ 2013-12-12 14:08 UTC (permalink / raw)
  To: Arjan van de Ven, lenb, rjw, Eliezer Tamir, rui.zhang,
	jacob.jun.pan, Mike Galbraith, Ingo Molnar, hpa, paulmck,
	Thomas Gleixner, John Stultz, Andy Lutomirski, Peter Zijlstra
  Cc: linux-kernel

[-- Attachment #1: peterz-tsc-move.patch --]
[-- Type: text/plain, Size: 6559 bytes --]

There are no __cycles_2_ns() users outside of arch/x86/kernel/tsc.c,
so move it there.

There are no cycles_2_ns() users.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 arch/x86/include/asm/timer.h |   59 ----------------------
 arch/x86/kernel/tsc.c        |  112 +++++++++++++++++++++++--------------------
 2 files changed, 61 insertions(+), 110 deletions(-)

--- a/arch/x86/include/asm/timer.h
+++ b/arch/x86/include/asm/timer.h
@@ -13,66 +13,7 @@ extern int recalibrate_cpu_khz(void);
 
 extern int no_timer_check;
 
-/* Accelerators for sched_clock()
- * convert from cycles(64bits) => nanoseconds (64bits)
- *  basic equation:
- *		ns = cycles / (freq / ns_per_sec)
- *		ns = cycles * (ns_per_sec / freq)
- *		ns = cycles * (10^9 / (cpu_khz * 10^3))
- *		ns = cycles * (10^6 / cpu_khz)
- *
- *	Then we use scaling math (suggested by george@mvista.com) to get:
- *		ns = cycles * (10^6 * SC / cpu_khz) / SC
- *		ns = cycles * cyc2ns_scale / SC
- *
- *	And since SC is a constant power of two, we can convert the div
- *  into a shift.
- *
- *  We can use khz divisor instead of mhz to keep a better precision, since
- *  cyc2ns_scale is limited to 10^6 * 2^10, which fits in 32 bits.
- *  (mathieu.desnoyers@polymtl.ca)
- *
- *			-johnstul@us.ibm.com "math is hard, lets go shopping!"
- *
- * In:
- *
- * ns = cycles * cyc2ns_scale / SC
- *
- * Although we may still have enough bits to store the value of ns,
- * in some cases, we may not have enough bits to store cycles * cyc2ns_scale,
- * leading to an incorrect result.
- *
- * To avoid this, we can decompose 'cycles' into quotient and remainder
- * of division by SC.  Then,
- *
- * ns = (quot * SC + rem) * cyc2ns_scale / SC
- *    = quot * cyc2ns_scale + (rem * cyc2ns_scale) / SC
- *
- *			- sqazi@google.com
- */
-
 DECLARE_PER_CPU(unsigned long, cyc2ns);
 DECLARE_PER_CPU(unsigned long long, cyc2ns_offset);
 
-#define CYC2NS_SCALE_FACTOR 10 /* 2^10, carefully chosen */
-
-static inline unsigned long long __cycles_2_ns(unsigned long long cyc)
-{
-	unsigned long long ns = this_cpu_read(cyc2ns_offset);
-	ns += mul_u64_u32_shr(cyc, this_cpu_read(cyc2ns), CYC2NS_SCALE_FACTOR);
-	return ns;
-}
-
-static inline unsigned long long cycles_2_ns(unsigned long long cyc)
-{
-	unsigned long long ns;
-	unsigned long flags;
-
-	local_irq_save(flags);
-	ns = __cycles_2_ns(cyc);
-	local_irq_restore(flags);
-
-	return ns;
-}
-
 #endif /* _ASM_X86_TIMER_H */
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -38,6 +38,66 @@ static int __read_mostly tsc_unstable;
 static int __read_mostly tsc_disabled = -1;
 
 int tsc_clocksource_reliable;
+
+/* Accelerators for sched_clock()
+ * convert from cycles(64bits) => nanoseconds (64bits)
+ *  basic equation:
+ *              ns = cycles / (freq / ns_per_sec)
+ *              ns = cycles * (ns_per_sec / freq)
+ *              ns = cycles * (10^9 / (cpu_khz * 10^3))
+ *              ns = cycles * (10^6 / cpu_khz)
+ *
+ *      Then we use scaling math (suggested by george@mvista.com) to get:
+ *              ns = cycles * (10^6 * SC / cpu_khz) / SC
+ *              ns = cycles * cyc2ns_scale / SC
+ *
+ *      And since SC is a constant power of two, we can convert the div
+ *  into a shift.
+ *
+ *  We can use khz divisor instead of mhz to keep a better precision, since
+ *  cyc2ns_scale is limited to 10^6 * 2^10, which fits in 32 bits.
+ *  (mathieu.desnoyers@polymtl.ca)
+ *
+ *                      -johnstul@us.ibm.com "math is hard, lets go shopping!"
+ */
+
+DEFINE_PER_CPU(unsigned long, cyc2ns);
+DEFINE_PER_CPU(unsigned long long, cyc2ns_offset);
+
+#define CYC2NS_SCALE_FACTOR 10 /* 2^10, carefully chosen */
+
+static inline unsigned long long cycles_2_ns(unsigned long long cyc)
+{
+	unsigned long long ns = this_cpu_read(cyc2ns_offset);
+	ns += mul_u64_u32_shr(cyc, this_cpu_read(cyc2ns), CYC2NS_SCALE_FACTOR);
+	return ns;
+}
+
+static void set_cyc2ns_scale(unsigned long cpu_khz, int cpu)
+{
+	unsigned long long tsc_now, ns_now, *offset;
+	unsigned long flags, *scale;
+
+	local_irq_save(flags);
+	sched_clock_idle_sleep_event();
+
+	scale = &per_cpu(cyc2ns, cpu);
+	offset = &per_cpu(cyc2ns_offset, cpu);
+
+	rdtscll(tsc_now);
+	ns_now = cycles_2_ns(tsc_now);
+
+	if (cpu_khz) {
+		*scale = ((NSEC_PER_MSEC << CYC2NS_SCALE_FACTOR) +
+				cpu_khz / 2) / cpu_khz;
+		*offset = ns_now - mult_frac(tsc_now, *scale,
+					     (1UL << CYC2NS_SCALE_FACTOR));
+	}
+
+	sched_clock_idle_wakeup_event(0);
+	local_irq_restore(flags);
+}
+
 /*
  * Scheduler clock - returns current time in nanosec units.
  */
@@ -62,7 +122,7 @@ u64 native_sched_clock(void)
 	rdtscll(this_offset);
 
 	/* return the value in ns */
-	return __cycles_2_ns(this_offset);
+	return cycles_2_ns(this_offset);
 }
 
 /* We need to define a real function for sched_clock, to override the
@@ -589,56 +649,6 @@ int recalibrate_cpu_khz(void)
 EXPORT_SYMBOL(recalibrate_cpu_khz);
 
 
-/* Accelerators for sched_clock()
- * convert from cycles(64bits) => nanoseconds (64bits)
- *  basic equation:
- *              ns = cycles / (freq / ns_per_sec)
- *              ns = cycles * (ns_per_sec / freq)
- *              ns = cycles * (10^9 / (cpu_khz * 10^3))
- *              ns = cycles * (10^6 / cpu_khz)
- *
- *      Then we use scaling math (suggested by george@mvista.com) to get:
- *              ns = cycles * (10^6 * SC / cpu_khz) / SC
- *              ns = cycles * cyc2ns_scale / SC
- *
- *      And since SC is a constant power of two, we can convert the div
- *  into a shift.
- *
- *  We can use khz divisor instead of mhz to keep a better precision, since
- *  cyc2ns_scale is limited to 10^6 * 2^10, which fits in 32 bits.
- *  (mathieu.desnoyers@polymtl.ca)
- *
- *                      -johnstul@us.ibm.com "math is hard, lets go shopping!"
- */
-
-DEFINE_PER_CPU(unsigned long, cyc2ns);
-DEFINE_PER_CPU(unsigned long long, cyc2ns_offset);
-
-static void set_cyc2ns_scale(unsigned long cpu_khz, int cpu)
-{
-	unsigned long long tsc_now, ns_now, *offset;
-	unsigned long flags, *scale;
-
-	local_irq_save(flags);
-	sched_clock_idle_sleep_event();
-
-	scale = &per_cpu(cyc2ns, cpu);
-	offset = &per_cpu(cyc2ns_offset, cpu);
-
-	rdtscll(tsc_now);
-	ns_now = __cycles_2_ns(tsc_now);
-
-	if (cpu_khz) {
-		*scale = ((NSEC_PER_MSEC << CYC2NS_SCALE_FACTOR) +
-				cpu_khz / 2) / cpu_khz;
-		*offset = ns_now - mult_frac(tsc_now, *scale,
-					     (1UL << CYC2NS_SCALE_FACTOR));
-	}
-
-	sched_clock_idle_wakeup_event(0);
-	local_irq_restore(flags);
-}
-
 static unsigned long long cyc2ns_suspend;
 
 void tsc_save_sched_clock_state(void)



^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH 11/15] x86: Rewrite cyc2ns to avoid the need to disable IRQs
  2013-12-12 14:08 [PATCH 00/15] cleanups and optimizations Peter Zijlstra
                   ` (9 preceding siblings ...)
  2013-12-12 14:08 ` [PATCH 10/15] x86: Move some code around Peter Zijlstra
@ 2013-12-12 14:08 ` Peter Zijlstra
  2014-06-16 17:13   ` Viresh Kumar
  2013-12-12 14:08 ` [PATCH 12/15] sched: Remove local_irq_disable() from the clocks Peter Zijlstra
                   ` (5 subsequent siblings)
  16 siblings, 1 reply; 67+ messages in thread
From: Peter Zijlstra @ 2013-12-12 14:08 UTC (permalink / raw)
  To: Arjan van de Ven, lenb, rjw, Eliezer Tamir, rui.zhang,
	jacob.jun.pan, Mike Galbraith, Ingo Molnar, hpa, paulmck,
	Thomas Gleixner, John Stultz, Andy Lutomirski, Peter Zijlstra
  Cc: linux-kernel

[-- Attachment #1: peterz-tsc-latch.patch --]
[-- Type: text/plain, Size: 11282 bytes --]

Use a ring-buffer like multi-version object structure which allows
always having a coherent object; we use this to avoid having to
disable IRQs while reading sched_clock() and avoids a problem when
getting an NMI while changing the cyc2ns data.

                        MAINLINE   PRE        POST

    sched_clock_stable: 1          1          1
    (cold) sched_clock: 329841     331312     257223
    (cold) local_clock: 301773     310296     309889
    (warm) sched_clock: 38375      38247      25280
    (warm) local_clock: 100371     102713     85268
    (warm) rdtsc:       27340      27289      24247
    sched_clock_stable: 0          0          0
    (cold) sched_clock: 382634     372706     301224
    (cold) local_clock: 396890     399275     399870
    (warm) sched_clock: 38194      38124      25630
    (warm) local_clock: 143452     148698     129629
    (warm) rdtsc:       27345      27365      24307

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 arch/x86/include/asm/timer.h     |   23 +++
 arch/x86/kernel/cpu/perf_event.c |   14 +-
 arch/x86/kernel/tsc.c            |  229 ++++++++++++++++++++++++++++++++++-----
 3 files changed, 236 insertions(+), 30 deletions(-)

--- a/arch/x86/include/asm/timer.h
+++ b/arch/x86/include/asm/timer.h
@@ -13,7 +13,26 @@ extern int recalibrate_cpu_khz(void);
 
 extern int no_timer_check;
 
-DECLARE_PER_CPU(unsigned long, cyc2ns);
-DECLARE_PER_CPU(unsigned long long, cyc2ns_offset);
+/*
+ * We use the full linear equation: f(x) = a + b*x, in order to allow
+ * a continuous function in the face of dynamic freq changes.
+ *
+ * Continuity means that when our frequency changes our slope (b); we want to
+ * ensure that: f(t) == f'(t), which gives: a + b*t == a' + b'*t.
+ *
+ * Without an offset (a) the above would not be possible.
+ *
+ * See the comment near cycles_2_ns() for details on how we compute (b).
+ */
+struct cyc2ns_data {
+	u32 cyc2ns_mul;
+	u32 cyc2ns_shift;
+	u64 cyc2ns_offset;
+	u32 __count;
+	/* u32 hole */
+}; /* 24 bytes -- do not grow */
+
+extern struct cyc2ns_data *cyc2ns_read_begin(void);
+extern void cyc2ns_read_end(struct cyc2ns_data *);
 
 #endif /* _ASM_X86_TIMER_H */
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -1883,6 +1883,8 @@ static struct pmu pmu = {
 
 void arch_perf_update_userpage(struct perf_event_mmap_page *userpg, u64 now)
 {
+	struct cyc2ns_data *data;
+
 	userpg->cap_user_time = 0;
 	userpg->cap_user_time_zero = 0;
 	userpg->cap_user_rdpmc = x86_pmu.attr_rdpmc;
@@ -1891,13 +1893,17 @@ void arch_perf_update_userpage(struct pe
 	if (!sched_clock_stable)
 		return;
 
+	data = cyc2ns_read_begin();
+
 	userpg->cap_user_time = 1;
-	userpg->time_mult = this_cpu_read(cyc2ns);
-	userpg->time_shift = CYC2NS_SCALE_FACTOR;
-	userpg->time_offset = this_cpu_read(cyc2ns_offset) - now;
+	userpg->time_mult = data->cyc2ns_mul;
+	userpg->time_shift = data->cyc2ns_shift;
+	userpg->time_offset = data->cyc2ns_offset - now;
 
 	userpg->cap_user_time_zero = 1;
-	userpg->time_zero = this_cpu_read(cyc2ns_offset);
+	userpg->time_zero = data->cyc2ns_offset;
+
+	cyc2ns_read_end(data);
 }
 
 /*
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -39,7 +39,119 @@ static int __read_mostly tsc_disabled =
 
 int tsc_clocksource_reliable;
 
-/* Accelerators for sched_clock()
+/*
+ * Use a ring-buffer like data structure, where a writer advances the head by
+ * writing a new data entry and a reader advances the tail when it observes a
+ * new entry.
+ *
+ * Writers are made to wait on readers until there's space to write a new
+ * entry.
+ *
+ * This means that we can always use an {offset, mul} pair to compute a ns
+ * value that is 'roughly' in the right direction, even if we're writing a new
+ * {offset, mul} pair during the clock read.
+ *
+ * The down-side is that we can no longer guarantee strict monotonicity anymore
+ * (assuming the TSC was that to begin with), because while we compute the
+ * intersection point of the two clock slopes and make sure the time is
+ * continuous at the point of switching; we can no longer guarantee a reader is
+ * strictly before or after the switch point.
+ *
+ * It does mean a reader no longer needs to disable IRQs in order to avoid
+ * CPU-Freq updates messing with his times, and similarly an NMI reader will
+ * no longer run the risk of hitting half-written state.
+ */
+
+struct cyc2ns {
+	struct cyc2ns_data data[2];	/*  0 + 2*24 = 48 */
+	struct cyc2ns_data *head;	/* 48 + 8    = 56 */
+	struct cyc2ns_data *tail;	/* 56 + 8    = 64 */
+}; /* exactly fits one cacheline */
+
+static DEFINE_PER_CPU_ALIGNED(struct cyc2ns, cyc2ns);
+
+struct cyc2ns_data *cyc2ns_read_begin(void)
+{
+	struct cyc2ns_data *head;
+
+	preempt_disable();
+
+	head = this_cpu_read(cyc2ns.head);
+	/*
+	 * Ensure we observe the entry when we observe the pointer to it.
+	 * matches the wmb from cyc2ns_write_end().
+	 */
+	smp_read_barrier_depends();
+	head->__count++;
+	barrier();
+
+	return head;
+}
+
+void cyc2ns_read_end(struct cyc2ns_data *head)
+{
+	barrier();
+	/*
+	 * If we're the outer most nested read; update the tail pointer
+	 * when we're done. This notifies possible pending writers
+	 * that we've observed the head pointer and that the other
+	 * entry is now free.
+	 */
+	if (!--head->__count) {
+		/*
+		 * x86-TSO does not reorder writes with older reads;
+		 * therefore once this write becomes visible to another
+		 * cpu, we must be finished reading the cyc2ns_data.
+		 *
+		 * matches with cyc2ns_write_begin().
+		 */
+		this_cpu_write(cyc2ns.tail, head);
+	}
+	preempt_enable();
+}
+
+/*
+ * Begin writing a new @data entry for @cpu.
+ *
+ * Assumes some sort of write side lock; currently 'provided' by the assumption
+ * that cpufreq will call its notifiers sequentially.
+ */
+static struct cyc2ns_data *cyc2ns_write_begin(int cpu)
+{
+	struct cyc2ns *c2n = &per_cpu(cyc2ns, cpu);
+	struct cyc2ns_data *data = c2n->data;
+
+	if (data == c2n->head)
+		data++;
+
+	/* XXX send an IPI to @cpu in order to guarantee a read? */
+
+	/*
+	 * When we observe the tail write from cyc2ns_read_end(),
+	 * the cpu must be done with that entry and its safe
+	 * to start writing to it.
+	 */
+	while (c2n->tail == data)
+		cpu_relax();
+
+	return data;
+}
+
+static void cyc2ns_write_end(int cpu, struct cyc2ns_data *data)
+{
+	struct cyc2ns *c2n = &per_cpu(cyc2ns, cpu);
+
+	/*
+	 * Ensure the @data writes are visible before we publish the
+	 * entry. Matches the data-depencency in cyc2ns_read_begin().
+	 */
+	smp_wmb();
+
+	ACCESS_ONCE(c2n->head) = data;
+}
+
+/*
+ * Accelerators for sched_clock()
  * convert from cycles(64bits) => nanoseconds (64bits)
  *  basic equation:
  *              ns = cycles / (freq / ns_per_sec)
@@ -61,49 +173,106 @@ int tsc_clocksource_reliable;
  *                      -johnstul@us.ibm.com "math is hard, lets go shopping!"
  */
 
-DEFINE_PER_CPU(unsigned long, cyc2ns);
-DEFINE_PER_CPU(unsigned long long, cyc2ns_offset);
-
 #define CYC2NS_SCALE_FACTOR 10 /* 2^10, carefully chosen */
 
+static void cyc2ns_data_init(struct cyc2ns_data *data)
+{
+	data->cyc2ns_mul = 1U << CYC2NS_SCALE_FACTOR;
+	data->cyc2ns_shift = CYC2NS_SCALE_FACTOR;
+	data->cyc2ns_offset = 0;
+	data->__count = 0;
+}
+
+static void cyc2ns_init(int cpu)
+{
+	struct cyc2ns *c2n = &per_cpu(cyc2ns, cpu);
+
+	cyc2ns_data_init(&c2n->data[0]);
+	cyc2ns_data_init(&c2n->data[1]);
+
+	c2n->head = c2n->data;
+	c2n->tail = c2n->data;
+}
+
 static inline unsigned long long cycles_2_ns(unsigned long long cyc)
 {
-	unsigned long long ns = this_cpu_read(cyc2ns_offset);
-	ns += mul_u64_u32_shr(cyc, this_cpu_read(cyc2ns), CYC2NS_SCALE_FACTOR);
+	struct cyc2ns_data *data, *tail;
+	unsigned long long ns;
+
+	/*
+	 * See cyc2ns_read_*() for details; replicated in order to avoid
+	 * an extra few instructions that came with the abstraction.
+	 * Notable, it allows us to only do the __count and tail update
+	 * dance when its actually needed.
+	 */
+
+	preempt_disable();
+	data = this_cpu_read(cyc2ns.head);
+	tail = this_cpu_read(cyc2ns.tail);
+
+	if (likely(data == tail)) {
+		ns = data->cyc2ns_offset;
+		ns += mul_u64_u32_shr(cyc, data->cyc2ns_mul, CYC2NS_SCALE_FACTOR);
+	} else {
+		data->__count++;
+
+		barrier();
+
+		ns = data->cyc2ns_offset;
+		ns += mul_u64_u32_shr(cyc, data->cyc2ns_mul, CYC2NS_SCALE_FACTOR);
+
+		barrier();
+
+		if (!--data->__count)
+			this_cpu_write(cyc2ns.tail, data);
+	}
+	preempt_enable();
+
 	return ns;
 }
 
+/* XXX surely we already have this someplace in the kernel?! */
+#define DIV_ROUND(n, d) (((n) + ((d) / 2)) / (d))
+
 static void set_cyc2ns_scale(unsigned long cpu_khz, int cpu)
 {
-	unsigned long long tsc_now, ns_now, *offset;
-	unsigned long flags, *scale;
+	unsigned long long tsc_now, ns_now;
+	struct cyc2ns_data *data;
+	unsigned long flags;
 
 	local_irq_save(flags);
 	sched_clock_idle_sleep_event();
 
-	scale = &per_cpu(cyc2ns, cpu);
-	offset = &per_cpu(cyc2ns_offset, cpu);
+	if (!cpu_khz)
+		goto done;
+
+	data = cyc2ns_write_begin(cpu);
 
 	rdtscll(tsc_now);
 	ns_now = cycles_2_ns(tsc_now);
 
-	if (cpu_khz) {
-		*scale = ((NSEC_PER_MSEC << CYC2NS_SCALE_FACTOR) +
-				cpu_khz / 2) / cpu_khz;
-		*offset = ns_now - mult_frac(tsc_now, *scale,
-					     (1UL << CYC2NS_SCALE_FACTOR));
-	}
+	/*
+	 * Compute a new multiplier as per the above comment and ensure our
+	 * time function is continuous; see the comment near struct
+	 * cyc2ns_data.
+	 */
+	data->cyc2ns_mul = DIV_ROUND(NSEC_PER_MSEC << CYC2NS_SCALE_FACTOR, cpu_khz);
+	data->cyc2ns_shift = CYC2NS_SCALE_FACTOR;
+	data->cyc2ns_offset = ns_now -
+		mul_u64_u32_shr(tsc_now, data->cyc2ns_mul, CYC2NS_SCALE_FACTOR);
+
+	cyc2ns_write_end(cpu, data);
 
+done:
 	sched_clock_idle_wakeup_event(0);
 	local_irq_restore(flags);
 }
-
 /*
  * Scheduler clock - returns current time in nanosec units.
  */
 u64 native_sched_clock(void)
 {
-	u64 this_offset;
+	u64 tsc_now;
 
 	/*
 	 * Fall back to jiffies if there's no TSC available:
@@ -119,10 +288,10 @@ u64 native_sched_clock(void)
 	}
 
 	/* read the Time Stamp Counter: */
-	rdtscll(this_offset);
+	rdtscll(tsc_now);
 
 	/* return the value in ns */
-	return cycles_2_ns(this_offset);
+	return cycles_2_ns(tsc_now);
 }
 
 /* We need to define a real function for sched_clock, to override the
@@ -678,11 +847,21 @@ void tsc_restore_sched_clock_state(void)
 
 	local_irq_save(flags);
 
-	__this_cpu_write(cyc2ns_offset, 0);
+	/*
+	 * We're comming out of suspend, there's no concurrency yet; don't
+	 * bother being nice about the RCU stuff, just write to both
+	 * data fields.
+	 */
+
+	this_cpu_write(cyc2ns.data[0].cyc2ns_offset, 0);
+	this_cpu_write(cyc2ns.data[1].cyc2ns_offset, 0);
+
 	offset = cyc2ns_suspend - sched_clock();
 
-	for_each_possible_cpu(cpu)
-		per_cpu(cyc2ns_offset, cpu) = offset;
+	for_each_possible_cpu(cpu) {
+		per_cpu(cyc2ns.data[0].cyc2ns_offset, cpu) = offset;
+		per_cpu(cyc2ns.data[1].cyc2ns_offset, cpu) = offset;
+	}
 
 	local_irq_restore(flags);
 }
@@ -1005,8 +1184,10 @@ void __init tsc_init(void)
 	 * speed as the bootup CPU. (cpufreq notifiers will fix this
 	 * up if their speed diverges)
 	 */
-	for_each_possible_cpu(cpu)
+	for_each_possible_cpu(cpu) {
+		cyc2ns_init(cpu);
 		set_cyc2ns_scale(cpu_khz, cpu);
+	}
 
 	if (tsc_disabled > 0)
 		return;



^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH 12/15] sched: Remove local_irq_disable() from the clocks
  2013-12-12 14:08 [PATCH 00/15] cleanups and optimizations Peter Zijlstra
                   ` (10 preceding siblings ...)
  2013-12-12 14:08 ` [PATCH 11/15] x86: Rewrite cyc2ns to avoid the need to disable IRQs Peter Zijlstra
@ 2013-12-12 14:08 ` Peter Zijlstra
  2013-12-12 14:08 ` [PATCH 13/15] sched: Use a static_key for sched_clock_stable Peter Zijlstra
                   ` (4 subsequent siblings)
  16 siblings, 0 replies; 67+ messages in thread
From: Peter Zijlstra @ 2013-12-12 14:08 UTC (permalink / raw)
  To: Arjan van de Ven, lenb, rjw, Eliezer Tamir, rui.zhang,
	jacob.jun.pan, Mike Galbraith, Ingo Molnar, hpa, paulmck,
	Thomas Gleixner, John Stultz, Andy Lutomirski, Peter Zijlstra
  Cc: linux-kernel

[-- Attachment #1: peterz-sched_clock_irq.patch --]
[-- Type: text/plain, Size: 3165 bytes --]

Now that x86 no longer requires IRQs disabled for sched_clock() and
ia64 never had this requirement (it doesn't seem to do cpufreq at
all), we can remove the requirement of disabling IRQs.

                        MAINLINE   PRE        POST

    sched_clock_stable: 1          1          1
    (cold) sched_clock: 329841     257223     221876
    (cold) local_clock: 301773     309889     234692
    (warm) sched_clock: 38375      25280      25602
    (warm) local_clock: 100371     85268      33265
    (warm) rdtsc:       27340      24247      24214
    sched_clock_stable: 0          0          0
    (cold) sched_clock: 382634     301224     235941
    (cold) local_clock: 396890     399870     297017
    (warm) sched_clock: 38194      25630      25233
    (warm) local_clock: 143452     129629     71234
    (warm) rdtsc:       27345      24307      24245

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 kernel/sched/clock.c |   34 ++++++----------------------------
 1 file changed, 6 insertions(+), 28 deletions(-)

--- a/kernel/sched/clock.c
+++ b/kernel/sched/clock.c
@@ -26,9 +26,10 @@
  * at 0 on boot (but people really shouldn't rely on that).
  *
  * cpu_clock(i)       -- can be used from any context, including NMI.
- * sched_clock_cpu(i) -- must be used with local IRQs disabled (implied by NMI)
  * local_clock()      -- is cpu_clock() on the current cpu.
  *
+ * sched_clock_cpu(i)
+ *
  * How:
  *
  * The implementation either uses sched_clock() when
@@ -50,15 +51,6 @@
  * Furthermore, explicit sleep and wakeup hooks allow us to account for time
  * that is otherwise invisible (TSC gets stopped).
  *
- *
- * Notes:
- *
- * The !IRQ-safetly of sched_clock() and sched_clock_cpu() comes from things
- * like cpufreq interrupts that can change the base clock (TSC) multiplier
- * and cause funny jumps in time -- although the filtering provided by
- * sched_clock_cpu() should mitigate serious artifacts we cannot rely on it
- * in general since for !CONFIG_HAVE_UNSTABLE_SCHED_CLOCK we fully rely on
- * sched_clock().
  */
 #include <linux/spinlock.h>
 #include <linux/hardirq.h>
@@ -242,20 +234,20 @@ u64 sched_clock_cpu(int cpu)
 	struct sched_clock_data *scd;
 	u64 clock;
 
-	WARN_ON_ONCE(!irqs_disabled());
-
 	if (sched_clock_stable)
 		return sched_clock();
 
 	if (unlikely(!sched_clock_running))
 		return 0ull;
 
+	preempt_disable();
 	scd = cpu_sdc(cpu);
 
 	if (cpu != smp_processor_id())
 		clock = sched_clock_remote(scd);
 	else
 		clock = sched_clock_local(scd);
+	preempt_enable();
 
 	return clock;
 }
@@ -316,14 +308,7 @@ EXPORT_SYMBOL_GPL(sched_clock_idle_wakeu
  */
 u64 cpu_clock(int cpu)
 {
-	u64 clock;
-	unsigned long flags;
-
-	local_irq_save(flags);
-	clock = sched_clock_cpu(cpu);
-	local_irq_restore(flags);
-
-	return clock;
+	return sched_clock_cpu(cpu);
 }
 
 /*
@@ -335,14 +320,7 @@ u64 cpu_clock(int cpu)
  */
 u64 local_clock(void)
 {
-	u64 clock;
-	unsigned long flags;
-
-	local_irq_save(flags);
-	clock = sched_clock_cpu(smp_processor_id());
-	local_irq_restore(flags);
-
-	return clock;
+	return sched_clock_cpu(raw_smp_processor_id());
 }
 
 #else /* CONFIG_HAVE_UNSTABLE_SCHED_CLOCK */



^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH 13/15] sched: Use a static_key for sched_clock_stable
  2013-12-12 14:08 [PATCH 00/15] cleanups and optimizations Peter Zijlstra
                   ` (11 preceding siblings ...)
  2013-12-12 14:08 ` [PATCH 12/15] sched: Remove local_irq_disable() from the clocks Peter Zijlstra
@ 2013-12-12 14:08 ` Peter Zijlstra
  2014-01-21 22:28   ` Sasha Levin
  2013-12-12 14:08 ` [PATCH 14/15] sched, clock: Fixup clear_sched_clock_stable() Peter Zijlstra
                   ` (3 subsequent siblings)
  16 siblings, 1 reply; 67+ messages in thread
From: Peter Zijlstra @ 2013-12-12 14:08 UTC (permalink / raw)
  To: Arjan van de Ven, lenb, rjw, Eliezer Tamir, rui.zhang,
	jacob.jun.pan, Mike Galbraith, Ingo Molnar, hpa, paulmck,
	Thomas Gleixner, John Stultz, Andy Lutomirski, Peter Zijlstra
  Cc: linux-kernel

[-- Attachment #1: peterz-sched_clock_static_key.patch --]
[-- Type: text/plain, Size: 6919 bytes --]

In order to avoid the runtime condition and variable load turn
sched_clock_stable into a static_key.

Also provide a shorter implementation of local_clock() and
cpu_clock(int) when sched_clock_stable==1.

                        MAINLINE   PRE       POST

    sched_clock_stable: 1          1         1
    (cold) sched_clock: 329841     221876    215295
    (cold) local_clock: 301773     234692    220773
    (warm) sched_clock: 38375      25602     25659
    (warm) local_clock: 100371     33265     27242
    (warm) rdtsc:       27340      24214     24208
    sched_clock_stable: 0          0         0
    (cold) sched_clock: 382634     235941    237019
    (cold) local_clock: 396890     297017    294819
    (warm) sched_clock: 38194      25233     25609
    (warm) local_clock: 143452     71234     71232
    (warm) rdtsc:       27345      24245     24243

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 arch/x86/kernel/cpu/amd.c        |    2 -
 arch/x86/kernel/cpu/intel.c      |    2 -
 arch/x86/kernel/cpu/perf_event.c |    4 +--
 arch/x86/kernel/tsc.c            |    6 ++---
 include/linux/sched.h            |    4 ++-
 kernel/sched/clock.c             |   41 ++++++++++++++++++++++++++++++++-------
 kernel/sched/debug.c             |    2 -
 kernel/time/tick-sched.c         |    2 -
 kernel/trace/ring_buffer.c       |    2 -
 9 files changed, 47 insertions(+), 18 deletions(-)

--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -487,7 +487,7 @@ static void early_init_amd(struct cpuinf
 		set_cpu_cap(c, X86_FEATURE_CONSTANT_TSC);
 		set_cpu_cap(c, X86_FEATURE_NONSTOP_TSC);
 		if (!check_tsc_unstable())
-			sched_clock_stable = 1;
+			set_sched_clock_stable();
 	}
 
 #ifdef CONFIG_X86_64
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -93,7 +93,7 @@ static void early_init_intel(struct cpui
 		set_cpu_cap(c, X86_FEATURE_CONSTANT_TSC);
 		set_cpu_cap(c, X86_FEATURE_NONSTOP_TSC);
 		if (!check_tsc_unstable())
-			sched_clock_stable = 1;
+			set_sched_clock_stable();
 	}
 
 	/* Penwell and Cloverview have the TSC which doesn't sleep on S3 */
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -1890,7 +1890,7 @@ void arch_perf_update_userpage(struct pe
 	userpg->cap_user_rdpmc = x86_pmu.attr_rdpmc;
 	userpg->pmc_width = x86_pmu.cntval_bits;
 
-	if (!sched_clock_stable)
+	if (!sched_clock_stable())
 		return;
 
 	data = cyc2ns_read_begin();
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -753,7 +753,7 @@ static unsigned long long cyc2ns_suspend
 
 void tsc_save_sched_clock_state(void)
 {
-	if (!sched_clock_stable)
+	if (!sched_clock_stable())
 		return;
 
 	cyc2ns_suspend = sched_clock();
@@ -773,7 +773,7 @@ void tsc_restore_sched_clock_state(void)
 	unsigned long flags;
 	int cpu;
 
-	if (!sched_clock_stable)
+	if (!sched_clock_stable())
 		return;
 
 	local_irq_save(flags);
@@ -920,7 +920,7 @@ void mark_tsc_unstable(char *reason)
 {
 	if (!tsc_unstable) {
 		tsc_unstable = 1;
-		sched_clock_stable = 0;
+		clear_sched_clock_stable();
 		disable_sched_clock_irqtime();
 		pr_info("Marking TSC unstable due to %s\n", reason);
 		/* Change only the rating, when not registered */
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1879,7 +1879,9 @@ static inline void sched_clock_idle_wake
  * but then during bootup it turns out that sched_clock()
  * is reliable after all:
  */
-extern int sched_clock_stable;
+extern int sched_clock_stable(void);
+extern void set_sched_clock_stable(void);
+extern void clear_sched_clock_stable(void);
 
 extern void sched_clock_tick(void);
 extern void sched_clock_idle_sleep_event(void);
--- a/kernel/sched/clock.c
+++ b/kernel/sched/clock.c
@@ -58,6 +58,7 @@
 #include <linux/percpu.h>
 #include <linux/ktime.h>
 #include <linux/sched.h>
+#include <linux/static_key.h>
 
 /*
  * Scheduler clock - returns current time in nanosec units.
@@ -74,7 +75,27 @@ EXPORT_SYMBOL_GPL(sched_clock);
 __read_mostly int sched_clock_running;
 
 #ifdef CONFIG_HAVE_UNSTABLE_SCHED_CLOCK
-__read_mostly int sched_clock_stable;
+static struct static_key __sched_clock_stable = STATIC_KEY_INIT;
+
+int sched_clock_stable(void)
+{
+	if (static_key_false(&__sched_clock_stable))
+		return false;
+	return true;
+}
+
+void set_sched_clock_stable(void)
+{
+	if (!sched_clock_stable())
+		static_key_slow_dec(&__sched_clock_stable);
+}
+
+void clear_sched_clock_stable(void)
+{
+	/* XXX worry about clock continuity */
+	if (sched_clock_stable())
+		static_key_slow_inc(&__sched_clock_stable);
+}
 
 struct sched_clock_data {
 	u64			tick_raw;
@@ -234,7 +255,7 @@ u64 sched_clock_cpu(int cpu)
 	struct sched_clock_data *scd;
 	u64 clock;
 
-	if (sched_clock_stable)
+	if (sched_clock_stable())
 		return sched_clock();
 
 	if (unlikely(!sched_clock_running))
@@ -257,7 +278,7 @@ void sched_clock_tick(void)
 	struct sched_clock_data *scd;
 	u64 now, now_gtod;
 
-	if (sched_clock_stable)
+	if (sched_clock_stable())
 		return;
 
 	if (unlikely(!sched_clock_running))
@@ -308,7 +329,10 @@ EXPORT_SYMBOL_GPL(sched_clock_idle_wakeu
  */
 u64 cpu_clock(int cpu)
 {
-	return sched_clock_cpu(cpu);
+	if (static_key_false(&__sched_clock_stable))
+		return sched_clock_cpu(cpu);
+
+	return sched_clock();
 }
 
 /*
@@ -320,7 +344,10 @@ u64 cpu_clock(int cpu)
  */
 u64 local_clock(void)
 {
-	return sched_clock_cpu(raw_smp_processor_id());
+	if (static_key_false(&__sched_clock_stable))
+		return sched_clock_cpu(raw_smp_processor_id());
+
+	return sched_clock();
 }
 
 #else /* CONFIG_HAVE_UNSTABLE_SCHED_CLOCK */
@@ -340,12 +367,12 @@ u64 sched_clock_cpu(int cpu)
 
 u64 cpu_clock(int cpu)
 {
-	return sched_clock_cpu(cpu);
+	return sched_clock();
 }
 
 u64 local_clock(void)
 {
-	return sched_clock_cpu(0);
+	return sched_clock();
 }
 
 #endif /* CONFIG_HAVE_UNSTABLE_SCHED_CLOCK */
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -371,7 +371,7 @@ static void sched_debug_header(struct se
 	PN(cpu_clk);
 	P(jiffies);
 #ifdef CONFIG_HAVE_UNSTABLE_SCHED_CLOCK
-	P(sched_clock_stable);
+	P(sched_clock_stable());
 #endif
 #undef PN
 #undef P
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -177,7 +177,7 @@ static bool can_stop_full_tick(void)
 	 * TODO: kick full dynticks CPUs when
 	 * sched_clock_stable is set.
 	 */
-	if (!sched_clock_stable) {
+	if (!sched_clock_stable()) {
 		trace_tick_stop(0, "unstable sched clock\n");
 		/*
 		 * Don't allow the user to think they can get
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -2558,7 +2558,7 @@ rb_reserve_next_event(struct ring_buffer
 		if (unlikely(test_time_stamp(delta))) {
 			int local_clock_stable = 1;
 #ifdef CONFIG_HAVE_UNSTABLE_SCHED_CLOCK
-			local_clock_stable = sched_clock_stable;
+			local_clock_stable = sched_clock_stable();
 #endif
 			WARN_ONCE(delta > (1ULL << 59),
 				  KERN_WARNING "Delta way too big! %llu ts=%llu write stamp = %llu\n%s",



^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH 14/15] sched, clock: Fixup clear_sched_clock_stable()
  2013-12-12 14:08 [PATCH 00/15] cleanups and optimizations Peter Zijlstra
                   ` (12 preceding siblings ...)
  2013-12-12 14:08 ` [PATCH 13/15] sched: Use a static_key for sched_clock_stable Peter Zijlstra
@ 2013-12-12 14:08 ` Peter Zijlstra
  2013-12-12 14:08 ` [PATCH 15/15] x86: Avoid a runtime condition in native_sched_clock() Peter Zijlstra
                   ` (2 subsequent siblings)
  16 siblings, 0 replies; 67+ messages in thread
From: Peter Zijlstra @ 2013-12-12 14:08 UTC (permalink / raw)
  To: Arjan van de Ven, lenb, rjw, Eliezer Tamir, rui.zhang,
	jacob.jun.pan, Mike Galbraith, Ingo Molnar, hpa, paulmck,
	Thomas Gleixner, John Stultz, Andy Lutomirski, Peter Zijlstra
  Cc: linux-kernel

[-- Attachment #1: peterz-sched_clock_static_key-fixup.patch --]
[-- Type: text/plain, Size: 8195 bytes --]

The below tells us the static_key conversion has a problem; since the
exact point of clearing that flag isn't too important, delay the flip
and use a workqueue to process it.

[ ] TSC synchronization [CPU#0 -> CPU#22]:
[ ] Measured 8 cycles TSC warp between CPUs, turning off TSC clock.
[ ] 
[ ] ======================================================
[ ] [ INFO: possible circular locking dependency detected ]
[ ] 3.13.0-rc3-01745-g848b0d0322cb-dirty #637 Not tainted
[ ] -------------------------------------------------------
[ ] swapper/0/1 is trying to acquire lock:
[ ]  (jump_label_mutex){+.+...}, at: [<ffffffff8115a637>] jump_label_lock+0x17/0x20
[ ] 
[ ] but task is already holding lock:
[ ]  (cpu_hotplug.lock){+.+.+.}, at: [<ffffffff8109408b>] cpu_hotplug_begin+0x2b/0x60
[ ] 
[ ] which lock already depends on the new lock.
[ ] 
[ ] 
[ ] the existing dependency chain (in reverse order) is:
[ ] 
[ ] -> #1 (cpu_hotplug.lock){+.+.+.}:
[ ]        [<ffffffff810def00>] lock_acquire+0x90/0x130
[ ]        [<ffffffff81661f83>] mutex_lock_nested+0x63/0x3e0
[ ]        [<ffffffff81093fdc>] get_online_cpus+0x3c/0x60
[ ]        [<ffffffff8104cc67>] arch_jump_label_transform+0x37/0x130
[ ]        [<ffffffff8115a3cf>] __jump_label_update+0x5f/0x80
[ ]        [<ffffffff8115a48d>] jump_label_update+0x9d/0xb0
[ ]        [<ffffffff8115aa6d>] static_key_slow_inc+0x9d/0xb0
[ ]        [<ffffffff810c0f65>] sched_feat_set+0xf5/0x100
[ ]        [<ffffffff810c5bdc>] set_numabalancing_state+0x2c/0x30
[ ]        [<ffffffff81d12f3d>] numa_policy_init+0x1af/0x1b7
[ ]        [<ffffffff81cebdf4>] start_kernel+0x35d/0x41f
[ ]        [<ffffffff81ceb5a5>] x86_64_start_reservations+0x2a/0x2c
[ ]        [<ffffffff81ceb6a2>] x86_64_start_kernel+0xfb/0xfe
[ ] 
[ ] -> #0 (jump_label_mutex){+.+...}:
[ ]        [<ffffffff810de141>] __lock_acquire+0x1701/0x1eb0
[ ]        [<ffffffff810def00>] lock_acquire+0x90/0x130
[ ]        [<ffffffff81661f83>] mutex_lock_nested+0x63/0x3e0
[ ]        [<ffffffff8115a637>] jump_label_lock+0x17/0x20
[ ]        [<ffffffff8115aa3b>] static_key_slow_inc+0x6b/0xb0
[ ]        [<ffffffff810ca775>] clear_sched_clock_stable+0x15/0x20
[ ]        [<ffffffff810503b3>] mark_tsc_unstable+0x23/0x70
[ ]        [<ffffffff810772cb>] check_tsc_sync_source+0x14b/0x150
[ ]        [<ffffffff81076612>] native_cpu_up+0x3a2/0x890
[ ]        [<ffffffff810941cb>] _cpu_up+0xdb/0x160
[ ]        [<ffffffff810942c9>] cpu_up+0x79/0x90
[ ]        [<ffffffff81d0af6b>] smp_init+0x60/0x8c
[ ]        [<ffffffff81cebf42>] kernel_init_freeable+0x8c/0x197
[ ]        [<ffffffff8164e32e>] kernel_init+0xe/0x130
[ ]        [<ffffffff8166beec>] ret_from_fork+0x7c/0xb0
[ ] 
[ ] other info that might help us debug this:
[ ] 
[ ]  Possible unsafe locking scenario:
[ ] 
[ ]        CPU0                    CPU1
[ ]        ----                    ----
[ ]   lock(cpu_hotplug.lock);
[ ]                                lock(jump_label_mutex);
[ ]                                lock(cpu_hotplug.lock);
[ ]   lock(jump_label_mutex);
[ ] 
[ ]  *** DEADLOCK ***
[ ] 
[ ] 2 locks held by swapper/0/1:
[ ]  #0:  (cpu_add_remove_lock){+.+.+.}, at: [<ffffffff81094037>] cpu_maps_update_begin+0x17/0x20
[ ]  #1:  (cpu_hotplug.lock){+.+.+.}, at: [<ffffffff8109408b>] cpu_hotplug_begin+0x2b/0x60
[ ] 
[ ] stack backtrace:
[ ] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.13.0-rc3-01745-g848b0d0322cb-dirty #637
[ ] Hardware name: Supermicro X8DTN/X8DTN, BIOS 4.6.3 01/08/2010
[ ]  ffffffff82c9c270 ffff880236843bb8 ffffffff8165c5f5 ffffffff82c9c270
[ ]  ffff880236843bf8 ffffffff81658c02 ffff880236843c80 ffff8802368586a0
[ ]  ffff880236858678 0000000000000001 0000000000000002 ffff880236858000
[ ] Call Trace:
[ ]  [<ffffffff8165c5f5>] dump_stack+0x4e/0x7a
[ ]  [<ffffffff81658c02>] print_circular_bug+0x1f9/0x207
[ ]  [<ffffffff810de141>] __lock_acquire+0x1701/0x1eb0
[ ]  [<ffffffff816680ff>] ? __atomic_notifier_call_chain+0x8f/0xb0
[ ]  [<ffffffff810def00>] lock_acquire+0x90/0x130
[ ]  [<ffffffff8115a637>] ? jump_label_lock+0x17/0x20
[ ]  [<ffffffff8115a637>] ? jump_label_lock+0x17/0x20
[ ]  [<ffffffff81661f83>] mutex_lock_nested+0x63/0x3e0
[ ]  [<ffffffff8115a637>] ? jump_label_lock+0x17/0x20
[ ]  [<ffffffff8115a637>] jump_label_lock+0x17/0x20
[ ]  [<ffffffff8115aa3b>] static_key_slow_inc+0x6b/0xb0
[ ]  [<ffffffff810ca775>] clear_sched_clock_stable+0x15/0x20
[ ]  [<ffffffff810503b3>] mark_tsc_unstable+0x23/0x70
[ ]  [<ffffffff810772cb>] check_tsc_sync_source+0x14b/0x150
[ ]  [<ffffffff81076612>] native_cpu_up+0x3a2/0x890
[ ]  [<ffffffff810941cb>] _cpu_up+0xdb/0x160
[ ]  [<ffffffff810942c9>] cpu_up+0x79/0x90
[ ]  [<ffffffff81d0af6b>] smp_init+0x60/0x8c
[ ]  [<ffffffff81cebf42>] kernel_init_freeable+0x8c/0x197
[ ]  [<ffffffff8164e320>] ? rest_init+0xd0/0xd0
[ ]  [<ffffffff8164e32e>] kernel_init+0xe/0x130
[ ]  [<ffffffff8166beec>] ret_from_fork+0x7c/0xb0
[ ]  [<ffffffff8164e320>] ? rest_init+0xd0/0xd0
[ ] ------------[ cut here ]------------
[ ] WARNING: CPU: 0 PID: 1 at /usr/src/linux-2.6/kernel/smp.c:374 smp_call_function_many+0xad/0x300()
[ ] Modules linked in:
[ ] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.13.0-rc3-01745-g848b0d0322cb-dirty #637
[ ] Hardware name: Supermicro X8DTN/X8DTN, BIOS 4.6.3 01/08/2010
[ ]  0000000000000009 ffff880236843be0 ffffffff8165c5f5 0000000000000000
[ ]  ffff880236843c18 ffffffff81093d8c 0000000000000000 0000000000000000
[ ]  ffffffff81ccd1a0 ffffffff810ca951 0000000000000000 ffff880236843c28
[ ] Call Trace:
[ ]  [<ffffffff8165c5f5>] dump_stack+0x4e/0x7a
[ ]  [<ffffffff81093d8c>] warn_slowpath_common+0x8c/0xc0
[ ]  [<ffffffff810ca951>] ? sched_clock_tick+0x1/0xa0
[ ]  [<ffffffff81093dda>] warn_slowpath_null+0x1a/0x20
[ ]  [<ffffffff8110b72d>] smp_call_function_many+0xad/0x300
[ ]  [<ffffffff8104f200>] ? arch_unregister_cpu+0x30/0x30
[ ]  [<ffffffff8104f200>] ? arch_unregister_cpu+0x30/0x30
[ ]  [<ffffffff810ca951>] ? sched_clock_tick+0x1/0xa0
[ ]  [<ffffffff8110ba96>] smp_call_function+0x46/0x80
[ ]  [<ffffffff8104f200>] ? arch_unregister_cpu+0x30/0x30
[ ]  [<ffffffff8110bb3c>] on_each_cpu+0x3c/0xa0
[ ]  [<ffffffff810ca950>] ? sched_clock_idle_sleep_event+0x20/0x20
[ ]  [<ffffffff810ca951>] ? sched_clock_tick+0x1/0xa0
[ ]  [<ffffffff8104f964>] text_poke_bp+0x64/0xd0
[ ]  [<ffffffff810ca950>] ? sched_clock_idle_sleep_event+0x20/0x20
[ ]  [<ffffffff8104ccde>] arch_jump_label_transform+0xae/0x130
[ ]  [<ffffffff8115a3cf>] __jump_label_update+0x5f/0x80
[ ]  [<ffffffff8115a48d>] jump_label_update+0x9d/0xb0
[ ]  [<ffffffff8115aa6d>] static_key_slow_inc+0x9d/0xb0
[ ]  [<ffffffff810ca775>] clear_sched_clock_stable+0x15/0x20
[ ]  [<ffffffff810503b3>] mark_tsc_unstable+0x23/0x70
[ ]  [<ffffffff810772cb>] check_tsc_sync_source+0x14b/0x150
[ ]  [<ffffffff81076612>] native_cpu_up+0x3a2/0x890
[ ]  [<ffffffff810941cb>] _cpu_up+0xdb/0x160
[ ]  [<ffffffff810942c9>] cpu_up+0x79/0x90
[ ]  [<ffffffff81d0af6b>] smp_init+0x60/0x8c
[ ]  [<ffffffff81cebf42>] kernel_init_freeable+0x8c/0x197
[ ]  [<ffffffff8164e320>] ? rest_init+0xd0/0xd0
[ ]  [<ffffffff8164e32e>] kernel_init+0xe/0x130
[ ]  [<ffffffff8166beec>] ret_from_fork+0x7c/0xb0
[ ]  [<ffffffff8164e320>] ? rest_init+0xd0/0xd0
[ ] ---[ end trace 6ff1df5620c49d26 ]---
[ ] tsc: Marking TSC unstable due to check_tsc_sync_source failed

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 kernel/sched/clock.c |   10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

--- a/kernel/sched/clock.c
+++ b/kernel/sched/clock.c
@@ -59,6 +59,7 @@
 #include <linux/ktime.h>
 #include <linux/sched.h>
 #include <linux/static_key.h>
+#include <linux/workqueue.h>
 
 /*
  * Scheduler clock - returns current time in nanosec units.
@@ -90,13 +91,20 @@ void set_sched_clock_stable(void)
 		static_key_slow_dec(&__sched_clock_stable);
 }
 
-void clear_sched_clock_stable(void)
+static void __clear_sched_clock_stable(struct work_struct *work)
 {
 	/* XXX worry about clock continuity */
 	if (sched_clock_stable())
 		static_key_slow_inc(&__sched_clock_stable);
 }
 
+static DECLARE_WORK(sched_clock_work, __clear_sched_clock_stable);
+
+void clear_sched_clock_stable(void)
+{
+	schedule_work(&sched_clock_work);
+}
+
 struct sched_clock_data {
 	u64			tick_raw;
 	u64			tick_gtod;



^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH 15/15] x86: Avoid a runtime condition in native_sched_clock()
  2013-12-12 14:08 [PATCH 00/15] cleanups and optimizations Peter Zijlstra
                   ` (13 preceding siblings ...)
  2013-12-12 14:08 ` [PATCH 14/15] sched, clock: Fixup clear_sched_clock_stable() Peter Zijlstra
@ 2013-12-12 14:08 ` Peter Zijlstra
  2013-12-13  3:30 ` [PATCH 00/15] cleanups and optimizations Mike Galbraith
  2013-12-13 10:49 ` Eliezer Tamir
  16 siblings, 0 replies; 67+ messages in thread
From: Peter Zijlstra @ 2013-12-12 14:08 UTC (permalink / raw)
  To: Arjan van de Ven, lenb, rjw, Eliezer Tamir, rui.zhang,
	jacob.jun.pan, Mike Galbraith, Ingo Molnar, hpa, paulmck,
	Thomas Gleixner, John Stultz, Andy Lutomirski, Peter Zijlstra
  Cc: linux-kernel

[-- Attachment #1: peterz-tsc-static_key.patch --]
[-- Type: text/plain, Size: 2011 bytes --]

Use a static_key to avoid touching tsc_disabled and a runtime
condition in native_sched_clock() -- less cachelines touched is always
better.

                        MAINLINE   PRE       POST

    sched_clock_stable: 1          1         1
    (cold) sched_clock: 329841     215295    213039
    (cold) local_clock: 301773     220773    216084
    (warm) sched_clock: 38375      25659     25231
    (warm) local_clock: 100371     27242     27601
    (warm) rdtsc:       27340      24208     24203
    sched_clock_stable: 0          0         0
    (cold) sched_clock: 382634     237019    240055
    (cold) local_clock: 396890     294819    299942
    (warm) sched_clock: 38194      25609     25276
    (warm) local_clock: 143452     71232     73232
    (warm) rdtsc:       27345      24243     24244

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 arch/x86/kernel/tsc.c |    7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -11,6 +11,7 @@
 #include <linux/clocksource.h>
 #include <linux/percpu.h>
 #include <linux/timex.h>
+#include <linux/static_key.h>
 
 #include <asm/hpet.h>
 #include <asm/timer.h>
@@ -37,6 +38,8 @@ static int __read_mostly tsc_unstable;
    erroneous rdtsc usage on !cpu_has_tsc processors */
 static int __read_mostly tsc_disabled = -1;
 
+static struct static_key __use_tsc = STATIC_KEY_INIT;
+
 int tsc_clocksource_reliable;
 
 /*
@@ -243,7 +246,7 @@ u64 native_sched_clock(void)
 	 *   very important for it to be as fast as the platform
 	 *   can achieve it. )
 	 */
-	if (unlikely(tsc_disabled)) {
+	if (!static_key_false(&__use_tsc)) {
 		/* No locking but a rare wrong value is not a big deal: */
 		return (jiffies_64 - INITIAL_JIFFIES) * (1000000000 / HZ);
 	}
@@ -1152,7 +1155,9 @@ void __init tsc_init(void)
 		return;
 
 	/* now allow native_sched_clock() to use rdtsc */
+
 	tsc_disabled = 0;
+	static_key_slow_inc(&__use_tsc);
 
 	if (!no_sched_irq_time)
 		enable_sched_clock_irqtime();



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 00/15] cleanups and optimizations
  2013-12-12 14:08 [PATCH 00/15] cleanups and optimizations Peter Zijlstra
                   ` (14 preceding siblings ...)
  2013-12-12 14:08 ` [PATCH 15/15] x86: Avoid a runtime condition in native_sched_clock() Peter Zijlstra
@ 2013-12-13  3:30 ` Mike Galbraith
  2013-12-13 10:49 ` Eliezer Tamir
  16 siblings, 0 replies; 67+ messages in thread
From: Mike Galbraith @ 2013-12-13  3:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Arjan van de Ven, lenb, rjw, Eliezer Tamir, rui.zhang,
	jacob.jun.pan, Ingo Molnar, hpa, paulmck, Thomas Gleixner,
	John Stultz, Andy Lutomirski, linux-kernel

On Thu, 2013-12-12 at 15:08 +0100, Peter Zijlstra wrote: 
> This series contains the preempt_enable_no_resched() cleanups that include
> spin_lock_bh() optimizations and local_clock() optimizations.

Those clock optimizations look particularly yummy.

-Mike


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 00/15] cleanups and optimizations
  2013-12-12 14:08 [PATCH 00/15] cleanups and optimizations Peter Zijlstra
                   ` (15 preceding siblings ...)
  2013-12-13  3:30 ` [PATCH 00/15] cleanups and optimizations Mike Galbraith
@ 2013-12-13 10:49 ` Eliezer Tamir
  2013-12-13 13:56   ` Peter Zijlstra
  16 siblings, 1 reply; 67+ messages in thread
From: Eliezer Tamir @ 2013-12-13 10:49 UTC (permalink / raw)
  To: Peter Zijlstra, Arjan van de Ven, lenb, rjw, rui.zhang,
	jacob.jun.pan, Mike Galbraith, Ingo Molnar, hpa, paulmck,
	Thomas Gleixner, John Stultz, Andy Lutomirski
  Cc: linux-kernel

On 12/12/2013 16:08, Peter Zijlstra wrote:
> This series contains the preempt_enable_no_resched() cleanups that include
> spin_lock_bh() optimizations and local_clock() optimizations.

I'm trying to test this on tip/master.
Patch 3 fails to apply on kenrel/softirq.c

Thanks,
Eliezer

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 00/15] cleanups and optimizations
  2013-12-13 10:49 ` Eliezer Tamir
@ 2013-12-13 13:56   ` Peter Zijlstra
  2013-12-16 17:48     ` Eliezer Tamir
  0 siblings, 1 reply; 67+ messages in thread
From: Peter Zijlstra @ 2013-12-13 13:56 UTC (permalink / raw)
  To: Eliezer Tamir
  Cc: Arjan van de Ven, lenb, rjw, rui.zhang, jacob.jun.pan,
	Mike Galbraith, Ingo Molnar, hpa, paulmck, Thomas Gleixner,
	John Stultz, Andy Lutomirski, linux-kernel

On Fri, Dec 13, 2013 at 12:49:01PM +0200, Eliezer Tamir wrote:
> On 12/12/2013 16:08, Peter Zijlstra wrote:
> > This series contains the preempt_enable_no_resched() cleanups that include
> > spin_lock_bh() optimizations and local_clock() optimizations.
> 
> I'm trying to test this on tip/master.
> Patch 3 fails to apply on kenrel/softirq.c

https://git.kernel.org/cgit/linux/kernel/git/peterz/queue.git/log/?h=preempt_clock

Applied just fine here.. stuffed it in a git tree.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 00/15] cleanups and optimizations
  2013-12-13 13:56   ` Peter Zijlstra
@ 2013-12-16 17:48     ` Eliezer Tamir
  2013-12-17 13:32       ` Peter Zijlstra
  0 siblings, 1 reply; 67+ messages in thread
From: Eliezer Tamir @ 2013-12-16 17:48 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Arjan van de Ven, lenb, rjw, rui.zhang, jacob.jun.pan,
	Mike Galbraith, Ingo Molnar, hpa, paulmck, Thomas Gleixner,
	John Stultz, Andy Lutomirski, linux-kernel

On 13/12/2013 15:56, Peter Zijlstra wrote:
> On Fri, Dec 13, 2013 at 12:49:01PM +0200, Eliezer Tamir wrote:
>> On 12/12/2013 16:08, Peter Zijlstra wrote:
>>> This series contains the preempt_enable_no_resched() cleanups that include
>>> spin_lock_bh() optimizations and local_clock() optimizations.
>>
>> I'm trying to test this on tip/master.
>> Patch 3 fails to apply on kenrel/softirq.c
> 
> https://git.kernel.org/cgit/linux/kernel/git/peterz/queue.git/log/?h=preempt_clock
> 
> Applied just fine here.. stuffed it in a git tree.

Thanks,
That worked.

My test is, netperf TCP_RR 1 byte, with busy polling enabled, 30s run.

I see 92.0K RR with your patches, 91.8K RR with out them.
This is barely above the noise level, but I'm pretty sure it's real.
In any case I don't see any regression.

Thanks,
Eliezer

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 00/15] cleanups and optimizations
  2013-12-16 17:48     ` Eliezer Tamir
@ 2013-12-17 13:32       ` Peter Zijlstra
  2013-12-17 14:03         ` Eliezer Tamir
  0 siblings, 1 reply; 67+ messages in thread
From: Peter Zijlstra @ 2013-12-17 13:32 UTC (permalink / raw)
  To: Eliezer Tamir
  Cc: Arjan van de Ven, lenb, rjw, rui.zhang, jacob.jun.pan,
	Mike Galbraith, Ingo Molnar, hpa, paulmck, Thomas Gleixner,
	John Stultz, Andy Lutomirski, linux-kernel

On Mon, Dec 16, 2013 at 07:48:32PM +0200, Eliezer Tamir wrote:
> On 13/12/2013 15:56, Peter Zijlstra wrote:
> > On Fri, Dec 13, 2013 at 12:49:01PM +0200, Eliezer Tamir wrote:
> >> On 12/12/2013 16:08, Peter Zijlstra wrote:
> >>> This series contains the preempt_enable_no_resched() cleanups that include
> >>> spin_lock_bh() optimizations and local_clock() optimizations.
> >>
> >> I'm trying to test this on tip/master.
> >> Patch 3 fails to apply on kenrel/softirq.c
> > 
> > https://git.kernel.org/cgit/linux/kernel/git/peterz/queue.git/log/?h=preempt_clock
> > 
> > Applied just fine here.. stuffed it in a git tree.
> 
> Thanks,
> That worked.
> 
> My test is, netperf TCP_RR 1 byte, with busy polling enabled, 30s run.
> 
> I see 92.0K RR with your patches, 91.8K RR with out them.
> This is barely above the noise level, but I'm pretty sure it's real.
> In any case I don't see any regression.

Awesomeness.. you'll work on subtracting the spin time from the sleep
time?

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 00/15] cleanups and optimizations
  2013-12-17 13:32       ` Peter Zijlstra
@ 2013-12-17 14:03         ` Eliezer Tamir
  2013-12-17 15:13           ` Peter Zijlstra
  0 siblings, 1 reply; 67+ messages in thread
From: Eliezer Tamir @ 2013-12-17 14:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Arjan van de Ven, lenb, rjw, rui.zhang, jacob.jun.pan,
	Mike Galbraith, Ingo Molnar, hpa, paulmck, Thomas Gleixner,
	John Stultz, Andy Lutomirski, linux-kernel, Eliezer Tamir,
	Eliezer Tamir

n 17/12/2013 15:32, Peter Zijlstra wrote:
> 
> Awesomeness.. you'll work on subtracting the spin time from the sleep
> time?

Me or someone on our team will work on it.

I'm not sure that subtracting the spin time is the optimal thing to do.

The busy poll time is supposed to be limited to something less than 1ms.
(I'm using 50us in most of my tests)
This is typically orders of magnitude smaller than the poll timeout.
Would it make more sense to just enforce a limit on poll time?

What do you think?

Thanks,
Eliezer

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 00/15] cleanups and optimizations
  2013-12-17 14:03         ` Eliezer Tamir
@ 2013-12-17 15:13           ` Peter Zijlstra
  2013-12-17 18:19             ` Eliezer Tamir
  0 siblings, 1 reply; 67+ messages in thread
From: Peter Zijlstra @ 2013-12-17 15:13 UTC (permalink / raw)
  To: Eliezer Tamir
  Cc: Arjan van de Ven, lenb, rjw, rui.zhang, jacob.jun.pan,
	Mike Galbraith, Ingo Molnar, hpa, paulmck, Thomas Gleixner,
	John Stultz, Andy Lutomirski, linux-kernel, Eliezer Tamir

On Tue, Dec 17, 2013 at 04:03:58PM +0200, Eliezer Tamir wrote:
> n 17/12/2013 15:32, Peter Zijlstra wrote:
> > 
> > Awesomeness.. you'll work on subtracting the spin time from the sleep
> > time?
> 
> Me or someone on our team will work on it.
> 
> I'm not sure that subtracting the spin time is the optimal thing to do.
> 
> The busy poll time is supposed to be limited to something less than 1ms.
> (I'm using 50us in most of my tests)
> This is typically orders of magnitude smaller than the poll timeout.
> Would it make more sense to just enforce a limit on poll time?
> 
> What do you think?

I've no idea what people normally expect of select/poll wakeup
granularity but typically we already have 50us of timer slack, although
RT tasks go without this.

It shouldn't be too hard to simply subtract the busy wait from the
expire ktime and avoid the entire issue though, so why not DTRT if its
fairly straight fwd to do so?

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 00/15] cleanups and optimizations
  2013-12-17 15:13           ` Peter Zijlstra
@ 2013-12-17 18:19             ` Eliezer Tamir
  2013-12-17 22:11               ` Thomas Gleixner
  0 siblings, 1 reply; 67+ messages in thread
From: Eliezer Tamir @ 2013-12-17 18:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Arjan van de Ven, lenb, rjw, rui.zhang, jacob.jun.pan,
	Mike Galbraith, Ingo Molnar, hpa, paulmck, Thomas Gleixner,
	John Stultz, Andy Lutomirski, linux-kernel, Eliezer Tamir

On 17/12/2013 17:13, Peter Zijlstra wrote:
> On Tue, Dec 17, 2013 at 04:03:58PM +0200, Eliezer Tamir wrote:
>>
>> I'm not sure that subtracting the spin time is the optimal thing to do.
>>
>> The busy poll time is supposed to be limited to something less than 1ms.
>> (I'm using 50us in most of my tests)
>> This is typically orders of magnitude smaller than the poll timeout.
>> Would it make more sense to just enforce a limit on poll time?
>>
>> What do you think?
> 
> I've no idea what people normally expect of select/poll wakeup
> granularity but typically we already have 50us of timer slack, although
> RT tasks go without this.

If RT tasks can't accept 50us of fuzziness, then the
path of least astonishment would be to have fully accurate
timekeeping, as you suggested. OK, so that's the plan.

Thanks,
Eliezer


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 00/15] cleanups and optimizations
  2013-12-17 18:19             ` Eliezer Tamir
@ 2013-12-17 22:11               ` Thomas Gleixner
  0 siblings, 0 replies; 67+ messages in thread
From: Thomas Gleixner @ 2013-12-17 22:11 UTC (permalink / raw)
  To: Eliezer Tamir
  Cc: Peter Zijlstra, Arjan van de Ven, lenb, rjw, rui.zhang,
	jacob.jun.pan, Mike Galbraith, Ingo Molnar, hpa, paulmck,
	John Stultz, Andy Lutomirski, linux-kernel, Eliezer Tamir

On Tue, 17 Dec 2013, Eliezer Tamir wrote:
> On 17/12/2013 17:13, Peter Zijlstra wrote:
> > On Tue, Dec 17, 2013 at 04:03:58PM +0200, Eliezer Tamir wrote:
> >>
> >> I'm not sure that subtracting the spin time is the optimal thing to do.
> >>
> >> The busy poll time is supposed to be limited to something less than 1ms.
> >> (I'm using 50us in most of my tests)
> >> This is typically orders of magnitude smaller than the poll timeout.
> >> Would it make more sense to just enforce a limit on poll time?
> >>
> >> What do you think?
> > 
> > I've no idea what people normally expect of select/poll wakeup
> > granularity but typically we already have 50us of timer slack, although
> > RT tasks go without this.
> 
> If RT tasks can't accept 50us of fuzziness, then the
> path of least astonishment would be to have fully accurate
> timekeeping, as you suggested. OK, so that's the plan.

No, the plan is to avoid busy loops in the first place.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 67+ messages in thread

* [tip:x86/idle] x86, acpi, idle: Restructure the mwait idle routines
  2013-12-12 14:08 ` [PATCH 01/15] x86, acpi, idle: Restructure the mwait idle routines Peter Zijlstra
@ 2013-12-19 20:09   ` tip-bot for Peter Zijlstra
  2013-12-19 20:13     ` H. Peter Anvin
  0 siblings, 1 reply; 67+ messages in thread
From: tip-bot for Peter Zijlstra @ 2013-12-19 20:09 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, arjan, rui.zhang, peterz, bitbucket,
	rafael.j.wysocki, lenb, tglx, hpa, jacob.jun.pan

Commit-ID:  16824255394f55adf31b9a96a9965d8c15bdac4c
Gitweb:     http://git.kernel.org/tip/16824255394f55adf31b9a96a9965d8c15bdac4c
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Thu, 12 Dec 2013 15:08:36 +0100
Committer:  H. Peter Anvin <hpa@linux.intel.com>
CommitDate: Thu, 19 Dec 2013 11:54:44 -0800

x86, acpi, idle: Restructure the mwait idle routines

People seem to delight in writing wrong and broken mwait idle routines;
collapse the lot.

This leaves mwait_play_dead() the sole remaining user of __mwait() and
new __mwait() users are probably doing it wrong.

Also remove __sti_mwait() as its unused.

Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Jacob Jun Pan <jacob.jun.pan@linux.intel.com>
Cc: Mike Galbraith <bitbucket@online.de>
Cc: Len Brown <lenb@kernel.org>
Cc: Rui Zhang <rui.zhang@intel.com>
Acked-by: Rafael Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20131212141654.616820819@infradead.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
---
 arch/x86/include/asm/mwait.h       | 40 ++++++++++++++++++++++++++++++++++++++
 arch/x86/include/asm/processor.h   | 23 ----------------------
 arch/x86/kernel/acpi/cstate.c      | 23 ----------------------
 drivers/acpi/acpi_pad.c            |  5 +----
 drivers/acpi/processor_idle.c      | 15 --------------
 drivers/idle/intel_idle.c          | 11 +----------
 drivers/thermal/intel_powerclamp.c |  4 +---
 7 files changed, 43 insertions(+), 78 deletions(-)

diff --git a/arch/x86/include/asm/mwait.h b/arch/x86/include/asm/mwait.h
index 2f366d0..361b02e 100644
--- a/arch/x86/include/asm/mwait.h
+++ b/arch/x86/include/asm/mwait.h
@@ -1,6 +1,8 @@
 #ifndef _ASM_X86_MWAIT_H
 #define _ASM_X86_MWAIT_H
 
+#include <linux/sched.h>
+
 #define MWAIT_SUBSTATE_MASK		0xf
 #define MWAIT_CSTATE_MASK		0xf
 #define MWAIT_SUBSTATE_SIZE		4
@@ -13,4 +15,42 @@
 
 #define MWAIT_ECX_INTERRUPT_BREAK	0x1
 
+static inline void __monitor(const void *eax, unsigned long ecx,
+			     unsigned long edx)
+{
+	/* "monitor %eax, %ecx, %edx;" */
+	asm volatile(".byte 0x0f, 0x01, 0xc8;"
+		     :: "a" (eax), "c" (ecx), "d"(edx));
+}
+
+static inline void __mwait(unsigned long eax, unsigned long ecx)
+{
+	/* "mwait %eax, %ecx;" */
+	asm volatile(".byte 0x0f, 0x01, 0xc9;"
+		     :: "a" (eax), "c" (ecx));
+}
+
+/*
+ * This uses new MONITOR/MWAIT instructions on P4 processors with PNI,
+ * which can obviate IPI to trigger checking of need_resched.
+ * We execute MONITOR against need_resched and enter optimized wait state
+ * through MWAIT. Whenever someone changes need_resched, we would be woken
+ * up from MWAIT (without an IPI).
+ *
+ * New with Core Duo processors, MWAIT can take some hints based on CPU
+ * capability.
+ */
+static inline void mwait_idle_with_hints(unsigned long eax, unsigned long ecx)
+{
+	if (!current_set_polling_and_test()) {
+		if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR))
+			clflush((void *)&current_thread_info()->flags);
+
+		__monitor((void *)&current_thread_info()->flags, 0, 0);
+		if (!need_resched())
+			__mwait(eax, ecx);
+	}
+	__current_clr_polling();
+}
+
 #endif /* _ASM_X86_MWAIT_H */
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 7b034a4..24821f5 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -700,29 +700,6 @@ static inline void sync_core(void)
 #endif
 }
 
-static inline void __monitor(const void *eax, unsigned long ecx,
-			     unsigned long edx)
-{
-	/* "monitor %eax, %ecx, %edx;" */
-	asm volatile(".byte 0x0f, 0x01, 0xc8;"
-		     :: "a" (eax), "c" (ecx), "d"(edx));
-}
-
-static inline void __mwait(unsigned long eax, unsigned long ecx)
-{
-	/* "mwait %eax, %ecx;" */
-	asm volatile(".byte 0x0f, 0x01, 0xc9;"
-		     :: "a" (eax), "c" (ecx));
-}
-
-static inline void __sti_mwait(unsigned long eax, unsigned long ecx)
-{
-	trace_hardirqs_on();
-	/* "mwait %eax, %ecx;" */
-	asm volatile("sti; .byte 0x0f, 0x01, 0xc9;"
-		     :: "a" (eax), "c" (ecx));
-}
-
 extern void select_idle_routine(const struct cpuinfo_x86 *c);
 extern void init_amd_e400_c1e_mask(void);
 
diff --git a/arch/x86/kernel/acpi/cstate.c b/arch/x86/kernel/acpi/cstate.c
index d2b7f27..e69182f 100644
--- a/arch/x86/kernel/acpi/cstate.c
+++ b/arch/x86/kernel/acpi/cstate.c
@@ -150,29 +150,6 @@ int acpi_processor_ffh_cstate_probe(unsigned int cpu,
 }
 EXPORT_SYMBOL_GPL(acpi_processor_ffh_cstate_probe);
 
-/*
- * This uses new MONITOR/MWAIT instructions on P4 processors with PNI,
- * which can obviate IPI to trigger checking of need_resched.
- * We execute MONITOR against need_resched and enter optimized wait state
- * through MWAIT. Whenever someone changes need_resched, we would be woken
- * up from MWAIT (without an IPI).
- *
- * New with Core Duo processors, MWAIT can take some hints based on CPU
- * capability.
- */
-void mwait_idle_with_hints(unsigned long ax, unsigned long cx)
-{
-	if (!need_resched()) {
-		if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR))
-			clflush((void *)&current_thread_info()->flags);
-
-		__monitor((void *)&current_thread_info()->flags, 0, 0);
-		smp_mb();
-		if (!need_resched())
-			__mwait(ax, cx);
-	}
-}
-
 void acpi_processor_ffh_cstate_enter(struct acpi_processor_cx *cx)
 {
 	unsigned int cpu = smp_processor_id();
diff --git a/drivers/acpi/acpi_pad.c b/drivers/acpi/acpi_pad.c
index fc6008f..509452a 100644
--- a/drivers/acpi/acpi_pad.c
+++ b/drivers/acpi/acpi_pad.c
@@ -193,10 +193,7 @@ static int power_saving_thread(void *data)
 					CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &cpu);
 			stop_critical_timings();
 
-			__monitor((void *)&current_thread_info()->flags, 0, 0);
-			smp_mb();
-			if (!need_resched())
-				__mwait(power_saving_mwait_eax, 1);
+			mwait_idle_with_hints(power_saving_mwait_eax, 1);
 
 			start_critical_timings();
 			if (lapic_marked_unstable)
diff --git a/drivers/acpi/processor_idle.c b/drivers/acpi/processor_idle.c
index 644516d..f90c56c 100644
--- a/drivers/acpi/processor_idle.c
+++ b/drivers/acpi/processor_idle.c
@@ -727,11 +727,6 @@ static int acpi_idle_enter_c1(struct cpuidle_device *dev,
 	if (unlikely(!pr))
 		return -EINVAL;
 
-	if (cx->entry_method == ACPI_CSTATE_FFH) {
-		if (current_set_polling_and_test())
-			return -EINVAL;
-	}
-
 	lapic_timer_state_broadcast(pr, cx, 1);
 	acpi_idle_do_entry(cx);
 
@@ -785,11 +780,6 @@ static int acpi_idle_enter_simple(struct cpuidle_device *dev,
 	if (unlikely(!pr))
 		return -EINVAL;
 
-	if (cx->entry_method == ACPI_CSTATE_FFH) {
-		if (current_set_polling_and_test())
-			return -EINVAL;
-	}
-
 	/*
 	 * Must be done before busmaster disable as we might need to
 	 * access HPET !
@@ -841,11 +831,6 @@ static int acpi_idle_enter_bm(struct cpuidle_device *dev,
 		}
 	}
 
-	if (cx->entry_method == ACPI_CSTATE_FFH) {
-		if (current_set_polling_and_test())
-			return -EINVAL;
-	}
-
 	acpi_unlazy_tlb(smp_processor_id());
 
 	/* Tell the scheduler that we are going deep-idle: */
diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
index f80b700..efec405 100644
--- a/drivers/idle/intel_idle.c
+++ b/drivers/idle/intel_idle.c
@@ -375,16 +375,7 @@ static int intel_idle(struct cpuidle_device *dev,
 	if (!(lapic_timer_reliable_states & (1 << (cstate))))
 		clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &cpu);
 
-	if (!current_set_polling_and_test()) {
-
-		if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR))
-			clflush((void *)&current_thread_info()->flags);
-
-		__monitor((void *)&current_thread_info()->flags, 0, 0);
-		smp_mb();
-		if (!need_resched())
-			__mwait(eax, ecx);
-	}
+	mwait_idle_with_hints(eax, ecx);
 
 	if (!(lapic_timer_reliable_states & (1 << (cstate))))
 		clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &cpu);
diff --git a/drivers/thermal/intel_powerclamp.c b/drivers/thermal/intel_powerclamp.c
index 8f181b3..e8275f2 100644
--- a/drivers/thermal/intel_powerclamp.c
+++ b/drivers/thermal/intel_powerclamp.c
@@ -438,9 +438,7 @@ static int clamp_thread(void *arg)
 			 */
 			local_touch_nmi();
 			stop_critical_timings();
-			__monitor((void *)&current_thread_info()->flags, 0, 0);
-			cpu_relax(); /* allow HT sibling to run */
-			__mwait(eax, ecx);
+			mwait_idle_with_hints(eax, ecx);
 			start_critical_timings();
 			atomic_inc(&idle_wakeup_counter);
 		}

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* Re: [tip:x86/idle] x86, acpi, idle: Restructure the mwait idle routines
  2013-12-19 20:09   ` [tip:x86/idle] " tip-bot for Peter Zijlstra
@ 2013-12-19 20:13     ` H. Peter Anvin
  2013-12-19 20:16       ` Peter Zijlstra
  0 siblings, 1 reply; 67+ messages in thread
From: H. Peter Anvin @ 2013-12-19 20:13 UTC (permalink / raw)
  To: mingo, linux-kernel, arjan, peterz, rui.zhang, bitbucket,
	rafael.j.wysocki, lenb, tglx, jacob.jun.pan, hpa,
	linux-tip-commits

On 12/19/2013 12:09 PM, tip-bot for Peter Zijlstra wrote:
> diff --git a/drivers/thermal/intel_powerclamp.c b/drivers/thermal/intel_powerclamp.c
> index 8f181b3..e8275f2 100644
> --- a/drivers/thermal/intel_powerclamp.c
> +++ b/drivers/thermal/intel_powerclamp.c
> @@ -438,9 +438,7 @@ static int clamp_thread(void *arg)
>  			 */
>  			local_touch_nmi();
>  			stop_critical_timings();
> -			__monitor((void *)&current_thread_info()->flags, 0, 0);
> -			cpu_relax(); /* allow HT sibling to run */
> -			__mwait(eax, ecx);
> +			mwait_idle_with_hints(eax, ecx);
>  			start_critical_timings();
>  			atomic_inc(&idle_wakeup_counter);
>  		}
> 

Should this cpu_relax() be in the common code as well?

	-hpa


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [tip:x86/idle] x86, acpi, idle: Restructure the mwait idle routines
  2013-12-19 20:13     ` H. Peter Anvin
@ 2013-12-19 20:16       ` Peter Zijlstra
  0 siblings, 0 replies; 67+ messages in thread
From: Peter Zijlstra @ 2013-12-19 20:16 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: mingo, linux-kernel, arjan, rui.zhang, bitbucket,
	rafael.j.wysocki, lenb, tglx, jacob.jun.pan, hpa,
	linux-tip-commits

On Thu, Dec 19, 2013 at 12:13:29PM -0800, H. Peter Anvin wrote:
> On 12/19/2013 12:09 PM, tip-bot for Peter Zijlstra wrote:
> > diff --git a/drivers/thermal/intel_powerclamp.c b/drivers/thermal/intel_powerclamp.c
> > index 8f181b3..e8275f2 100644
> > --- a/drivers/thermal/intel_powerclamp.c
> > +++ b/drivers/thermal/intel_powerclamp.c
> > @@ -438,9 +438,7 @@ static int clamp_thread(void *arg)
> >  			 */
> >  			local_touch_nmi();
> >  			stop_critical_timings();
> > -			__monitor((void *)&current_thread_info()->flags, 0, 0);
> > -			cpu_relax(); /* allow HT sibling to run */
> > -			__mwait(eax, ecx);
> > +			mwait_idle_with_hints(eax, ecx);
> >  			start_critical_timings();
> >  			atomic_inc(&idle_wakeup_counter);
> >  		}
> > 
> 
> Should this cpu_relax() be in the common code as well?

I don't think so; it seems weird to allow a sibling some time between
monitor and wait because the moment wait stalls the cpu the sibling gets
all the time it wants.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 13/15] sched: Use a static_key for sched_clock_stable
  2013-12-12 14:08 ` [PATCH 13/15] sched: Use a static_key for sched_clock_stable Peter Zijlstra
@ 2014-01-21 22:28   ` Sasha Levin
  2014-01-22 10:45     ` Peter Zijlstra
  2014-01-22 17:14     ` Peter Zijlstra
  0 siblings, 2 replies; 67+ messages in thread
From: Sasha Levin @ 2014-01-21 22:28 UTC (permalink / raw)
  To: Peter Zijlstra, Arjan van de Ven, lenb, rjw, Eliezer Tamir,
	rui.zhang, jacob.jun.pan, Mike Galbraith, Ingo Molnar, hpa,
	paulmck, Thomas Gleixner, John Stultz, Andy Lutomirski
  Cc: linux-kernel

On 12/12/2013 09:08 AM, Peter Zijlstra wrote:
> In order to avoid the runtime condition and variable load turn
> sched_clock_stable into a static_key.
>
> Also provide a shorter implementation of local_clock() and
> cpu_clock(int) when sched_clock_stable==1.
>
>                          MAINLINE   PRE       POST
>
>      sched_clock_stable: 1          1         1
>      (cold) sched_clock: 329841     221876    215295
>      (cold) local_clock: 301773     234692    220773
>      (warm) sched_clock: 38375      25602     25659
>      (warm) local_clock: 100371     33265     27242
>      (warm) rdtsc:       27340      24214     24208
>      sched_clock_stable: 0          0         0
>      (cold) sched_clock: 382634     235941    237019
>      (cold) local_clock: 396890     297017    294819
>      (warm) sched_clock: 38194      25233     25609
>      (warm) local_clock: 143452     71234     71232
>      (warm) rdtsc:       27345      24245     24243
>
> Signed-off-by: Peter Zijlstra<peterz@infradead.org>

Hi Peter,

This patch seems to be causing an issue with booting a KVM guest. It seems that it
causes the time to go random during early boot process:

	[    0.000000] Initmem setup node 30 [mem 0x12ee000000-0x138dffffff]
	[    0.000000]   NODE_DATA [mem 0xcfa42000-0xcfa72fff]
	[    0.000000]     NODE_DATA(30) on node 1
	[    0.000000] Initmem setup node 31 [mem 0x138e000000-0x142fffffff]
	[    0.000000]   NODE_DATA [mem 0xcfa11000-0xcfa41fff]
	[    0.000000]     NODE_DATA(31) on node 1
	[    0.000000] kvm-clock: Using msrs 4b564d01 and 4b564d00
	[    0.000000] kvm-clock: cpu 0, msr 0:cf991001, boot clock
	[133538.294040] Zone ranges:
	[133538.294338]   DMA      [mem 0x00001000-0x00ffffff]
	[133538.294804]   DMA32    [mem 0x01000000-0xffffffff]
	[133538.295223]   Normal   [mem 0x100000000-0x142fffffff]
	[133538.295670] Movable zone start for each node

Looking at the code, initially I though that the problem is with:

	+void set_sched_clock_stable(void)
	+{
	+       if (!sched_clock_stable())
	+               static_key_slow_dec(&__sched_clock_stable);
	+}
	+
	+void clear_sched_clock_stable(void)
	+{
	+       /* XXX worry about clock continuity */
	+       if (sched_clock_stable())
	+               static_key_slow_inc(&__sched_clock_stable);
	+}

I think the jump label inc/dec is reversed here. We would want to inc it when enabling
and dec when disabling, no?
However, trying to reverse the two didn't help. I was still seeing the same odd behaviour.

I tried doing a simple conversion to using a simple var like before, which looks like this:

diff --git a/kernel/sched/clock.c b/kernel/sched/clock.c
index 6bd6a67..a035932 100644
--- a/kernel/sched/clock.c
+++ b/kernel/sched/clock.c
@@ -76,26 +76,21 @@ EXPORT_SYMBOL_GPL(sched_clock);
  __read_mostly int sched_clock_running;

  #ifdef CONFIG_HAVE_UNSTABLE_SCHED_CLOCK
-static struct static_key __sched_clock_stable = STATIC_KEY_INIT;
+static int __sched_clock_stable;

  int sched_clock_stable(void)
  {
-       if (static_key_false(&__sched_clock_stable))
-               return false;
-       return true;
+       return __sched_clock_stable;
  }

  void set_sched_clock_stable(void)
  {
-       if (!sched_clock_stable())
-               static_key_slow_dec(&__sched_clock_stable);
+       __sched_clock_stable = 1;
  }

  static void __clear_sched_clock_stable(struct work_struct *work)
  {
-       /* XXX worry about clock continuity */
-       if (sched_clock_stable())
-               static_key_slow_inc(&__sched_clock_stable);
+       __sched_clock_stable = 0;
  }

  static DECLARE_WORK(sched_clock_work, __clear_sched_clock_stable);
@@ -340,7 +335,7 @@ EXPORT_SYMBOL_GPL(sched_clock_idle_wakeup_event);
   */
  u64 cpu_clock(int cpu)
  {
-       if (static_key_false(&__sched_clock_stable))
+       if (!sched_clock_stable())
                 return sched_clock_cpu(cpu);

         return sched_clock();
@@ -355,7 +350,7 @@ u64 cpu_clock(int cpu)
   */
  u64 local_clock(void)
  {
-       if (static_key_false(&__sched_clock_stable))
+       if (!sched_clock_stable())
                 return sched_clock_cpu(raw_smp_processor_id());

         return sched_clock();


This has corrected the issue:

	[    0.000000] Initmem setup node 31 [mem 0x138e000000-0x142fffffff]
	[    0.000000]   NODE_DATA [mem 0xcfa11000-0xcfa41fff]
	[    0.000000]     NODE_DATA(31) on node 1
	[    0.000000] kvm-clock: Using msrs 4b564d01 and 4b564d00
	[    0.000000] kvm-clock: cpu 0, msr 0:cf991001, boot clock
	[    0.000000] Zone ranges:
	[    0.000000]   DMA      [mem 0x00001000-0x00ffffff]
	[    0.000000]   DMA32    [mem 0x01000000-0xffffffff]
	[    0.000000]   Normal   [mem 0x100000000-0x142fffffff]
	[    0.000000] Movable zone start for each node
	[    0.000000] Early memory node ranges
		[ timing is correct for the rest of the boot]

At this point, I thought that there's something up with jump labels being used this early (?) and
tried compiling with CONFIG_JUMP_LABELS=n, this didn't solve the issue.

This makes me thing there's something different related to jumplabels we're missing, as the
no-jumplabel config should be very similar to the patch I did above, I just can't figure
out what it is.


Thanks,
Sasha

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* Re: [PATCH 13/15] sched: Use a static_key for sched_clock_stable
  2014-01-21 22:28   ` Sasha Levin
@ 2014-01-22 10:45     ` Peter Zijlstra
  2014-01-22 11:59       ` Peter Zijlstra
  2014-01-22 12:00       ` [PATCH 13/15] sched: Use a static_key for sched_clock_stable Markus Trippelsdorf
  2014-01-22 17:14     ` Peter Zijlstra
  1 sibling, 2 replies; 67+ messages in thread
From: Peter Zijlstra @ 2014-01-22 10:45 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Arjan van de Ven, lenb, rjw, Eliezer Tamir, rui.zhang,
	jacob.jun.pan, Mike Galbraith, Ingo Molnar, hpa, paulmck,
	Thomas Gleixner, John Stultz, Andy Lutomirski, linux-kernel,
	dyoung

On Tue, Jan 21, 2014 at 05:28:37PM -0500, Sasha Levin wrote:
> On 12/12/2013 09:08 AM, Peter Zijlstra wrote:
> 
> This patch seems to be causing an issue with booting a KVM guest. It seems that it
> causes the time to go random during early boot process:
> 
> 	[    0.000000] Initmem setup node 30 [mem 0x12ee000000-0x138dffffff]
> 	[    0.000000]   NODE_DATA [mem 0xcfa42000-0xcfa72fff]
> 	[    0.000000]     NODE_DATA(30) on node 1
> 	[    0.000000] Initmem setup node 31 [mem 0x138e000000-0x142fffffff]
> 	[    0.000000]   NODE_DATA [mem 0xcfa11000-0xcfa41fff]
> 	[    0.000000]     NODE_DATA(31) on node 1
> 	[    0.000000] kvm-clock: Using msrs 4b564d01 and 4b564d00
> 	[    0.000000] kvm-clock: cpu 0, msr 0:cf991001, boot clock
> 	[133538.294040] Zone ranges:
> 	[133538.294338]   DMA      [mem 0x00001000-0x00ffffff]
> 	[133538.294804]   DMA32    [mem 0x01000000-0xffffffff]
> 	[133538.295223]   Normal   [mem 0x100000000-0x142fffffff]
> 	[133538.295670] Movable zone start for each node
> 
> Looking at the code, initially I though that the problem is with:
> 
> 	+void set_sched_clock_stable(void)
> 	+{
> 	+       if (!sched_clock_stable())
> 	+               static_key_slow_dec(&__sched_clock_stable);
> 	+}
> 	+
> 	+void clear_sched_clock_stable(void)
> 	+{
> 	+       /* XXX worry about clock continuity */
> 	+       if (sched_clock_stable())
> 	+               static_key_slow_inc(&__sched_clock_stable);
> 	+}
> 
> I think the jump label inc/dec is reversed here. We would want to inc
> it when enabling and dec when disabling, no?

I got terribly confused with that static_key trainwreck. I know I
definitely got it wrong a few times.

I helped write the jump label stuff, but the current interface is
horrible, I really couldn't figure out what is what anymore :-(

The current code seems to work for me in that my machine ends up with
sched_clock_stable() == true and when I call clear_sched_clock_stable()
it returns false and nothing explodes.

> However, trying to reverse the two didn't help. I was still seeing the
> same odd behaviour.

I got a crash when I flipped the inc and dec ;-)

> I tried doing a simple conversion to using a simple var like before,
> which looks like this:

<snip>

> This has corrected the issue:
> 
> 	[    0.000000] Initmem setup node 31 [mem 0x138e000000-0x142fffffff]
> 	[    0.000000]   NODE_DATA [mem 0xcfa11000-0xcfa41fff]
> 	[    0.000000]     NODE_DATA(31) on node 1
> 	[    0.000000] kvm-clock: Using msrs 4b564d01 and 4b564d00
> 	[    0.000000] kvm-clock: cpu 0, msr 0:cf991001, boot clock
> 	[    0.000000] Zone ranges:
> 	[    0.000000]   DMA      [mem 0x00001000-0x00ffffff]
> 	[    0.000000]   DMA32    [mem 0x01000000-0xffffffff]
> 	[    0.000000]   Normal   [mem 0x100000000-0x142fffffff]
> 	[    0.000000] Movable zone start for each node
> 	[    0.000000] Early memory node ranges
> 		[ timing is correct for the rest of the boot]
> 
> At this point, I thought that there's something up with jump labels
> being used this early (?) and tried compiling with
> CONFIG_JUMP_LABELS=n, this didn't solve the issue.
> 
> This makes me thing there's something different related to jumplabels
> we're missing, as the no-jumplabel config should be very similar to
> the patch I did above, I just can't figure out what it is.

Does this kvm thing have sched_clock_stable==true ever? My machine ends
up setting set_sched_clock_stable() very early indeed, in
early_init_intel().

I suspect what happens is that the way I wrote it; the jump_label is
true per boot, so sched_clock_stable() == true. My initial
set_sched_clock_stable() call then does nothing as the state is already
good.

KVM doesn't do this and runs with sched_clock_stable()==true for a
while, detects the TSC really isn't stable and calls
clear_sched_clock_stable() and you get this jump in time.

Now the next patch 'fixes' this patch by adding a workqueue to delay the
actual jump_label poke; that fixed a few wierd lockdep errors.

But the above is still obviously wrong for machines that do not call
set_sched_clock_stable() on boot.

Ho humm.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 13/15] sched: Use a static_key for sched_clock_stable
  2014-01-22 10:45     ` Peter Zijlstra
@ 2014-01-22 11:59       ` Peter Zijlstra
  2014-01-23  1:53         ` Dave Young
  2014-01-23 16:46         ` [tip:sched/urgent] sched/clock: Fixup early initialization tip-bot for Peter Zijlstra
  2014-01-22 12:00       ` [PATCH 13/15] sched: Use a static_key for sched_clock_stable Markus Trippelsdorf
  1 sibling, 2 replies; 67+ messages in thread
From: Peter Zijlstra @ 2014-01-22 11:59 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Arjan van de Ven, lenb, rjw, Eliezer Tamir, rui.zhang,
	jacob.jun.pan, Mike Galbraith, Ingo Molnar, hpa, paulmck,
	Thomas Gleixner, John Stultz, Andy Lutomirski, linux-kernel,
	dyoung

On Wed, Jan 22, 2014 at 11:45:32AM +0100, Peter Zijlstra wrote:
> Ho humm.

OK, so I had me a ponder; does the below fix things for you and David?
I've only done a boot test on real proper hardware :-)

---
 kernel/sched/clock.c | 42 +++++++++++++++++++++++++++++++++---------
 1 file changed, 33 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/clock.c b/kernel/sched/clock.c
index 6bd6a6731b21..6bbcd97f4532 100644
--- a/kernel/sched/clock.c
+++ b/kernel/sched/clock.c
@@ -77,35 +77,45 @@ __read_mostly int sched_clock_running;
 
 #ifdef CONFIG_HAVE_UNSTABLE_SCHED_CLOCK
 static struct static_key __sched_clock_stable = STATIC_KEY_INIT;
+static int __sched_clock_stable_early;
 
 int sched_clock_stable(void)
 {
-	if (static_key_false(&__sched_clock_stable))
-		return false;
-	return true;
+	return static_key_false(&__sched_clock_stable);
 }
 
 void set_sched_clock_stable(void)
 {
+	__sched_clock_stable_early = 1;
+
+	smp_mb(); /* matches sched_clock_init() */
+
+	if (!sched_clock_running)
+		return;
+
 	if (!sched_clock_stable())
-		static_key_slow_dec(&__sched_clock_stable);
+		static_key_slow_inc(&__sched_clock_stable);
 }
 
 static void __clear_sched_clock_stable(struct work_struct *work)
 {
 	/* XXX worry about clock continuity */
 	if (sched_clock_stable())
-		static_key_slow_inc(&__sched_clock_stable);
+		static_key_slow_dec(&__sched_clock_stable);
 }
 
 static DECLARE_WORK(sched_clock_work, __clear_sched_clock_stable);
 
 void clear_sched_clock_stable(void)
 {
-	if (keventd_up())
-		schedule_work(&sched_clock_work);
-	else
-		__clear_sched_clock_stable(&sched_clock_work);
+	__sched_clock_stable_early = 0;
+
+	smp_mb(); /* matches sched_clock_init() */
+
+	if (!sched_clock_running)
+		return;
+
+	schedule_work(&sched_clock_work);
 }
 
 struct sched_clock_data {
@@ -140,6 +150,20 @@ void sched_clock_init(void)
 	}
 
 	sched_clock_running = 1;
+
+	/*
+	 * Ensure that it is impossible to not do a static_key update.
+	 *
+	 * Either {set,clear}_sched_clock_stable() must see sched_clock_running
+	 * and do the update, or we must see their __sched_clock_stable_early
+	 * and do the update, or both.
+	 */
+	smp_mb(); /* matches {set,clear}_sched_clock_stable() */
+
+	if (__sched_clock_stable_early)
+		set_sched_clock_stable();
+	else
+		clear_sched_clock_stable();
 }
 
 /*

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* Re: [PATCH 13/15] sched: Use a static_key for sched_clock_stable
  2014-01-22 10:45     ` Peter Zijlstra
  2014-01-22 11:59       ` Peter Zijlstra
@ 2014-01-22 12:00       ` Markus Trippelsdorf
  2014-01-22 12:07         ` Peter Zijlstra
  1 sibling, 1 reply; 67+ messages in thread
From: Markus Trippelsdorf @ 2014-01-22 12:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Sasha Levin, Arjan van de Ven, lenb, rjw, Eliezer Tamir,
	rui.zhang, jacob.jun.pan, Mike Galbraith, Ingo Molnar, hpa,
	paulmck, Thomas Gleixner, John Stultz, Andy Lutomirski,
	linux-kernel, dyoung

On 2014.01.22 at 11:45 +0100, Peter Zijlstra wrote:
> On Tue, Jan 21, 2014 at 05:28:37PM -0500, Sasha Levin wrote:
> > On 12/12/2013 09:08 AM, Peter Zijlstra wrote:
> > 
> > This patch seems to be causing an issue with booting a KVM guest. It seems that it
> > causes the time to go random during early boot process:
> > 
> > 	[    0.000000] Initmem setup node 30 [mem 0x12ee000000-0x138dffffff]
> > 	[    0.000000]   NODE_DATA [mem 0xcfa42000-0xcfa72fff]
> > 	[    0.000000]     NODE_DATA(30) on node 1
> > 	[    0.000000] Initmem setup node 31 [mem 0x138e000000-0x142fffffff]
> > 	[    0.000000]   NODE_DATA [mem 0xcfa11000-0xcfa41fff]
> > 	[    0.000000]     NODE_DATA(31) on node 1
> > 	[    0.000000] kvm-clock: Using msrs 4b564d01 and 4b564d00
> > 	[    0.000000] kvm-clock: cpu 0, msr 0:cf991001, boot clock
> > 	[133538.294040] Zone ranges:
> > 	[133538.294338]   DMA      [mem 0x00001000-0x00ffffff]
> > 	[133538.294804]   DMA32    [mem 0x01000000-0xffffffff]
> > 	[133538.295223]   Normal   [mem 0x100000000-0x142fffffff]
> > 	[133538.295670] Movable zone start for each node
> > 

FYI it happens on real hardware on my machine:
...
[    0.000000] Hierarchical RCU implementation.
[    0.000000] NR_IRQS:4352 nr_irqs:712 16
[    0.000000] spurious 8259A interrupt: IRQ7.
[    0.000000] Console: colour VGA+ 80x25
[    0.000000] console [tty0] enabled
[    0.000000] hpet clockevent registered
[    0.000000] tsc: Fast TSC calibration using PIT
[    0.003333] tsc: Detected 3210.681 MHz processor
[   60.375238] Calibrating delay loop (skipped), value calculated using timer frequency.. 6423.91 BogoMIPS (lpj=10702270)
[   60.375240] pid_max: default: 32768 minimum: 301
[   60.375259] Mount-cache hash table entries: 256
[   60.375373] tseg: 0000000000
[   60.375377] CPU: Physical Processor ID: 0
[   60.375377] CPU: Processor Core ID: 0
[   60.375378] mce: CPU supports 6 MCE banks
[   60.375382] LVT offset 0 assigned for vector 0xf9
[   60.375384] process: using AMD E400 aware idle routine
[   60.375386] Last level iTLB entries: 4KB 512, 2MB 16, 4MB 8
...
-- 
Markus

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 13/15] sched: Use a static_key for sched_clock_stable
  2014-01-22 12:00       ` [PATCH 13/15] sched: Use a static_key for sched_clock_stable Markus Trippelsdorf
@ 2014-01-22 12:07         ` Peter Zijlstra
  2014-01-22 12:16           ` Peter Zijlstra
  2014-01-22 12:26           ` Markus Trippelsdorf
  0 siblings, 2 replies; 67+ messages in thread
From: Peter Zijlstra @ 2014-01-22 12:07 UTC (permalink / raw)
  To: Markus Trippelsdorf
  Cc: Sasha Levin, Arjan van de Ven, lenb, rjw, Eliezer Tamir,
	rui.zhang, jacob.jun.pan, Mike Galbraith, Ingo Molnar, hpa,
	paulmck, Thomas Gleixner, John Stultz, Andy Lutomirski,
	linux-kernel, dyoung

On Wed, Jan 22, 2014 at 01:00:48PM +0100, Markus Trippelsdorf wrote:
> FYI it happens on real hardware on my machine:
> ...
> [    0.000000] Hierarchical RCU implementation.
> [    0.000000] NR_IRQS:4352 nr_irqs:712 16
> [    0.000000] spurious 8259A interrupt: IRQ7.
> [    0.000000] Console: colour VGA+ 80x25
> [    0.000000] console [tty0] enabled
> [    0.000000] hpet clockevent registered
> [    0.000000] tsc: Fast TSC calibration using PIT
> [    0.003333] tsc: Detected 3210.681 MHz processor
> [   60.375238] Calibrating delay loop (skipped), value calculated using timer frequency.. 6423.91 BogoMIPS (lpj=10702270)
> [   60.375240] pid_max: default: 32768 minimum: 301
> [   60.375259] Mount-cache hash table entries: 256
> [   60.375373] tseg: 0000000000
> [   60.375377] CPU: Physical Processor ID: 0
> [   60.375377] CPU: Processor Core ID: 0
> [   60.375378] mce: CPU supports 6 MCE banks
> [   60.375382] LVT offset 0 assigned for vector 0xf9
> [   60.375384] process: using AMD E400 aware idle routine
> [   60.375386] Last level iTLB entries: 4KB 512, 2MB 16, 4MB 8

Should have always happened like that I think. From the log it looks
like the moment we switch from jiffies to actual TSC in
arch/x86/kernel/tsc.c:sched_clock().

I don't think I changed the logic there, just switched from a condition
to a jump_label.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 13/15] sched: Use a static_key for sched_clock_stable
  2014-01-22 12:07         ` Peter Zijlstra
@ 2014-01-22 12:16           ` Peter Zijlstra
  2014-01-22 12:26           ` Markus Trippelsdorf
  1 sibling, 0 replies; 67+ messages in thread
From: Peter Zijlstra @ 2014-01-22 12:16 UTC (permalink / raw)
  To: Markus Trippelsdorf
  Cc: Sasha Levin, Arjan van de Ven, lenb, rjw, Eliezer Tamir,
	rui.zhang, jacob.jun.pan, Mike Galbraith, Ingo Molnar, hpa,
	paulmck, Thomas Gleixner, John Stultz, Andy Lutomirski,
	linux-kernel, dyoung

On Wed, Jan 22, 2014 at 01:07:57PM +0100, Peter Zijlstra wrote:
> On Wed, Jan 22, 2014 at 01:00:48PM +0100, Markus Trippelsdorf wrote:
> > FYI it happens on real hardware on my machine:
> > ...
> > [    0.000000] Hierarchical RCU implementation.
> > [    0.000000] NR_IRQS:4352 nr_irqs:712 16
> > [    0.000000] spurious 8259A interrupt: IRQ7.
> > [    0.000000] Console: colour VGA+ 80x25
> > [    0.000000] console [tty0] enabled
> > [    0.000000] hpet clockevent registered
> > [    0.000000] tsc: Fast TSC calibration using PIT
> > [    0.003333] tsc: Detected 3210.681 MHz processor
> > [   60.375238] Calibrating delay loop (skipped), value calculated using timer frequency.. 6423.91 BogoMIPS (lpj=10702270)
> > [   60.375240] pid_max: default: 32768 minimum: 301
> > [   60.375259] Mount-cache hash table entries: 256
> > [   60.375373] tseg: 0000000000
> > [   60.375377] CPU: Physical Processor ID: 0
> > [   60.375377] CPU: Processor Core ID: 0
> > [   60.375378] mce: CPU supports 6 MCE banks
> > [   60.375382] LVT offset 0 assigned for vector 0xf9
> > [   60.375384] process: using AMD E400 aware idle routine
> > [   60.375386] Last level iTLB entries: 4KB 512, 2MB 16, 4MB 8
> 
> Should have always happened like that I think. From the log it looks
> like the moment we switch from jiffies to actual TSC in
> arch/x86/kernel/tsc.c:sched_clock().
> 
> I don't think I changed the logic there, just switched from a condition
> to a jump_label.

On that, the bestest solution is to completely rip out the no_tsc
option, its daft. The few people still running chips without TSC can use
X86_TSC=n.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 13/15] sched: Use a static_key for sched_clock_stable
  2014-01-22 12:07         ` Peter Zijlstra
  2014-01-22 12:16           ` Peter Zijlstra
@ 2014-01-22 12:26           ` Markus Trippelsdorf
  2014-01-22 12:30             ` Peter Zijlstra
  1 sibling, 1 reply; 67+ messages in thread
From: Markus Trippelsdorf @ 2014-01-22 12:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Sasha Levin, Arjan van de Ven, lenb, rjw, Eliezer Tamir,
	rui.zhang, jacob.jun.pan, Mike Galbraith, Ingo Molnar, hpa,
	paulmck, Thomas Gleixner, John Stultz, Andy Lutomirski,
	linux-kernel, dyoung

On 2014.01.22 at 13:07 +0100, Peter Zijlstra wrote:
> On Wed, Jan 22, 2014 at 01:00:48PM +0100, Markus Trippelsdorf wrote:
> > FYI it happens on real hardware on my machine:
> > ...
> > [    0.000000] Hierarchical RCU implementation.
> > [    0.000000] NR_IRQS:4352 nr_irqs:712 16
> > [    0.000000] spurious 8259A interrupt: IRQ7.
> > [    0.000000] Console: colour VGA+ 80x25
> > [    0.000000] console [tty0] enabled
> > [    0.000000] hpet clockevent registered
> > [    0.000000] tsc: Fast TSC calibration using PIT
> > [    0.003333] tsc: Detected 3210.681 MHz processor
> > [   60.375238] Calibrating delay loop (skipped), value calculated using timer frequency.. 6423.91 BogoMIPS (lpj=10702270)
> > [   60.375240] pid_max: default: 32768 minimum: 301
> > [   60.375259] Mount-cache hash table entries: 256
> > [   60.375373] tseg: 0000000000
> > [   60.375377] CPU: Physical Processor ID: 0
> > [   60.375377] CPU: Processor Core ID: 0
> > [   60.375378] mce: CPU supports 6 MCE banks
> > [   60.375382] LVT offset 0 assigned for vector 0xf9
> > [   60.375384] process: using AMD E400 aware idle routine
> > [   60.375386] Last level iTLB entries: 4KB 512, 2MB 16, 4MB 8
> 
> Should have always happened like that I think. From the log it looks
> like the moment we switch from jiffies to actual TSC in
> arch/x86/kernel/tsc.c:sched_clock().
> 
> I don't think I changed the logic there, just switched from a condition
> to a jump_label.

Well, v3.13 was fine. So it's definitely a regression. But it may be
another issue. I will try to bisect later.

-- 
Markus

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 13/15] sched: Use a static_key for sched_clock_stable
  2014-01-22 12:26           ` Markus Trippelsdorf
@ 2014-01-22 12:30             ` Peter Zijlstra
  2014-01-22 13:14               ` Markus Trippelsdorf
  0 siblings, 1 reply; 67+ messages in thread
From: Peter Zijlstra @ 2014-01-22 12:30 UTC (permalink / raw)
  To: Markus Trippelsdorf
  Cc: Sasha Levin, Arjan van de Ven, lenb, rjw, Eliezer Tamir,
	rui.zhang, jacob.jun.pan, Mike Galbraith, Ingo Molnar, hpa,
	paulmck, Thomas Gleixner, John Stultz, Andy Lutomirski,
	linux-kernel, dyoung

On Wed, Jan 22, 2014 at 01:26:09PM +0100, Markus Trippelsdorf wrote:
> On 2014.01.22 at 13:07 +0100, Peter Zijlstra wrote:
> > On Wed, Jan 22, 2014 at 01:00:48PM +0100, Markus Trippelsdorf wrote:
> > > FYI it happens on real hardware on my machine:
> > > ...
> > > [    0.000000] Hierarchical RCU implementation.
> > > [    0.000000] NR_IRQS:4352 nr_irqs:712 16
> > > [    0.000000] spurious 8259A interrupt: IRQ7.
> > > [    0.000000] Console: colour VGA+ 80x25
> > > [    0.000000] console [tty0] enabled
> > > [    0.000000] hpet clockevent registered
> > > [    0.000000] tsc: Fast TSC calibration using PIT
> > > [    0.003333] tsc: Detected 3210.681 MHz processor
> > > [   60.375238] Calibrating delay loop (skipped), value calculated using timer frequency.. 6423.91 BogoMIPS (lpj=10702270)
> > > [   60.375240] pid_max: default: 32768 minimum: 301
> > > [   60.375259] Mount-cache hash table entries: 256
> > > [   60.375373] tseg: 0000000000
> > > [   60.375377] CPU: Physical Processor ID: 0
> > > [   60.375377] CPU: Processor Core ID: 0
> > > [   60.375378] mce: CPU supports 6 MCE banks
> > > [   60.375382] LVT offset 0 assigned for vector 0xf9
> > > [   60.375384] process: using AMD E400 aware idle routine
> > > [   60.375386] Last level iTLB entries: 4KB 512, 2MB 16, 4MB 8
> > 
> > Should have always happened like that I think. From the log it looks
> > like the moment we switch from jiffies to actual TSC in
> > arch/x86/kernel/tsc.c:sched_clock().
> > 
> > I don't think I changed the logic there, just switched from a condition
> > to a jump_label.
> 
> Well, v3.13 was fine. So it's definitely a regression. But it may be
> another issue. I will try to bisect later.

OK, weird, I'll see if I can spot anything.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 13/15] sched: Use a static_key for sched_clock_stable
  2014-01-22 12:30             ` Peter Zijlstra
@ 2014-01-22 13:14               ` Markus Trippelsdorf
  2014-01-22 14:26                 ` Sasha Levin
  0 siblings, 1 reply; 67+ messages in thread
From: Markus Trippelsdorf @ 2014-01-22 13:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Sasha Levin, Arjan van de Ven, lenb, rjw, Eliezer Tamir,
	rui.zhang, jacob.jun.pan, Mike Galbraith, Ingo Molnar, hpa,
	paulmck, Thomas Gleixner, John Stultz, Andy Lutomirski,
	linux-kernel, dyoung

On 2014.01.22 at 13:30 +0100, Peter Zijlstra wrote:
> On Wed, Jan 22, 2014 at 01:26:09PM +0100, Markus Trippelsdorf wrote:
> > On 2014.01.22 at 13:07 +0100, Peter Zijlstra wrote:
> > > On Wed, Jan 22, 2014 at 01:00:48PM +0100, Markus Trippelsdorf wrote:
> > > > FYI it happens on real hardware on my machine:
> > > > ...
> > > > [    0.000000] Hierarchical RCU implementation.
> > > > [    0.000000] NR_IRQS:4352 nr_irqs:712 16
> > > > [    0.000000] spurious 8259A interrupt: IRQ7.
> > > > [    0.000000] Console: colour VGA+ 80x25
> > > > [    0.000000] console [tty0] enabled
> > > > [    0.000000] hpet clockevent registered
> > > > [    0.000000] tsc: Fast TSC calibration using PIT
> > > > [    0.003333] tsc: Detected 3210.681 MHz processor
> > > > [   60.375238] Calibrating delay loop (skipped), value calculated using timer frequency.. 6423.91 BogoMIPS (lpj=10702270)
> > > > [   60.375240] pid_max: default: 32768 minimum: 301
> > > > [   60.375259] Mount-cache hash table entries: 256
> > > > [   60.375373] tseg: 0000000000
> > > > [   60.375377] CPU: Physical Processor ID: 0
> > > > [   60.375377] CPU: Processor Core ID: 0
> > > > [   60.375378] mce: CPU supports 6 MCE banks
> > > > [   60.375382] LVT offset 0 assigned for vector 0xf9
> > > > [   60.375384] process: using AMD E400 aware idle routine
> > > > [   60.375386] Last level iTLB entries: 4KB 512, 2MB 16, 4MB 8
> > > 
> > > Should have always happened like that I think. From the log it looks
> > > like the moment we switch from jiffies to actual TSC in
> > > arch/x86/kernel/tsc.c:sched_clock().
> > > 
> > > I don't think I changed the logic there, just switched from a condition
> > > to a jump_label.
> > 
> > Well, v3.13 was fine. So it's definitely a regression. But it may be
> > another issue. I will try to bisect later.
> 
> OK, weird, I'll see if I can spot anything.

Unfortunately the issue is unbisectable (but the remaining commits are
all yours):

git bisect start
# bad: [df32e43a54d04eda35d2859beaf90e3864d53288] Merge branch 'akpm' (incoming from Andrew)
git bisect bad df32e43a54d04eda35d2859beaf90e3864d53288
# good: [d8ec26d7f8287f5788a494f56e8814210f0e64be] Linux 3.13
git bisect good d8ec26d7f8287f5788a494f56e8814210f0e64be
# bad: [de4fe30af1620b5117d65489621a5037913e7a92] Merge tag 'staging-3.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging
git bisect bad de4fe30af1620b5117d65489621a5037913e7a92
# good: [2783980525df12b9c49e8b4baaff06abc7f2f8f2] staging: comedi: usbduxsigma: removing unneccesay attached info
git bisect good 2783980525df12b9c49e8b4baaff06abc7f2f8f2
# good: [9326657abe1a83ed4b4f396b923ca1217fd50cba] Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
git bisect good 9326657abe1a83ed4b4f396b923ca1217fd50cba
# bad: [82b51734b4f228c76b6064b6e899d9d3d4c17c1a] Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
git bisect bad 82b51734b4f228c76b6064b6e899d9d3d4c17c1a
# bad: [5d4863e4cc4dc12d1d5e42da3cb5d38c535e4ad6] Merge branch 'x86-cpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
git bisect bad 5d4863e4cc4dc12d1d5e42da3cb5d38c535e4ad6
# bad: [a0fa1dd3cdbccec9597fe53b6177a9aa6e20f2f8] Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
git bisect bad a0fa1dd3cdbccec9597fe53b6177a9aa6e20f2f8
# good: [0aeeeebac8d8304501680f12862784341f4bee7b] sched: Remove unused parameter from nohz_balancer_kick()
git bisect good 0aeeeebac8d8304501680f12862784341f4bee7b
# bad: [c9c8986847d2f4fc474c10ee08afa57e7474096d] Merge branch 'x86/idle' into sched/core
git bisect bad c9c8986847d2f4fc474c10ee08afa57e7474096d
# skip: [20d1c86a57762f0a33a78988e3fc8818316badd4] sched/clock, x86: Rewrite cyc2ns() to avoid the need to disable IRQs
git bisect skip 20d1c86a57762f0a33a78988e3fc8818316badd4
# skip: [35af99e646c7f7ea46dc2977601e9e71a51dadd5] sched/clock, x86: Use a static_key for sched_clock_stable
git bisect skip 35af99e646c7f7ea46dc2977601e9e71a51dadd5
# good: [7d590cca7cd2cce4ed7c47d221d6f90566653ba8] x86, idle: Add memory barriers around clflush in mwait_play_dead()
git bisect good 7d590cca7cd2cce4ed7c47d221d6f90566653ba8
# skip: [57c67da274f3fab38e08d2c9edf08b89e1d9c71d] sched/clock, x86: Move some cyc2ns() code around
git bisect skip 57c67da274f3fab38e08d2c9edf08b89e1d9c71d
# good: [c726099ec224be8078d91072207053ff9a1ad6fc] sched: Factor out the on_null_domain() checks in trigger_load_balance()
git bisect good c726099ec224be8078d91072207053ff9a1ad6fc
# skip: [ef08f0fff87630d4f67ceb09514d8b444df833f8] sched/clock: Remove local_irq_disable() from the clocks
git bisect skip ef08f0fff87630d4f67ceb09514d8b444df833f8
# skip: [62b94a08da1bae9d187d49dfcd6665af393750f8] sched/preempt: Take away preempt_enable_no_resched() from modules
git bisect skip 62b94a08da1bae9d187d49dfcd6665af393750f8
# skip: [5dd12c2152743747ca9f50ef80281e54cc416dc0] sched/clock, x86: Use mul_u64_u32_shr() for native_sched_clock()
git bisect skip 5dd12c2152743747ca9f50ef80281e54cc416dc0
# skip: [10b033d434c25a6c9e0f4f4dc2418af1b8236c63] sched/clock, x86: Avoid a runtime condition in native_sched_clock()
git bisect skip 10b033d434c25a6c9e0f4f4dc2418af1b8236c63
# skip: [6577e42a3e1633afe762f47da9e00061ee4b9a5e] sched/clock: Fix up clear_sched_clock_stable()
git bisect skip 6577e42a3e1633afe762f47da9e00061ee4b9a5e
# bad: [0bd3a173d711857fc9f583eb5825386cc08f3948] sched/preempt, locking: Rework local_bh_{dis,en}able()
git bisect bad 0bd3a173d711857fc9f583eb5825386cc08f3948
# skip: [9ea4c380066fbe23fe0da7f4abfabc444f2467f4] locking: Optimize lock_bh functions
git bisect skip 9ea4c380066fbe23fe0da7f4abfabc444f2467f4
# only skipped commits left to test
# possible first bad commit: [0bd3a173d711857fc9f583eb5825386cc08f3948] sched/preempt, locking: Rework local_bh_{dis,en}able()
# possible first bad commit: [10b033d434c25a6c9e0f4f4dc2418af1b8236c63] sched/clock, x86: Avoid a runtime condition in native_sched_clock()
# possible first bad commit: [6577e42a3e1633afe762f47da9e00061ee4b9a5e] sched/clock: Fix up clear_sched_clock_stable()
# possible first bad commit: [35af99e646c7f7ea46dc2977601e9e71a51dadd5] sched/clock, x86: Use a static_key for sched_clock_stable
# possible first bad commit: [ef08f0fff87630d4f67ceb09514d8b444df833f8] sched/clock: Remove local_irq_disable() from the clocks
# possible first bad commit: [20d1c86a57762f0a33a78988e3fc8818316badd4] sched/clock, x86: Rewrite cyc2ns() to avoid the need to disable IRQs
# possible first bad commit: [57c67da274f3fab38e08d2c9edf08b89e1d9c71d] sched/clock, x86: Move some cyc2ns() code around
# possible first bad commit: [5dd12c2152743747ca9f50ef80281e54cc416dc0] sched/clock, x86: Use mul_u64_u32_shr() for native_sched_clock()
# possible first bad commit: [62b94a08da1bae9d187d49dfcd6665af393750f8] sched/preempt: Take away preempt_enable_no_resched() from modules
# possible first bad commit: [9ea4c380066fbe23fe0da7f4abfabc444f2467f4] locking: Optimize lock_bh functions

I've skipped the commits because of compile errors:

make[1]: Nothing to be done for 'all'.
  CHK     include/config/kernel.release
  CHK     include/generated/uapi/linux/version.h
make[1]: Nothing to be done for 'relocs'.
  CHK     include/generated/utsrelease.h
  CC      arch/x86/kernel/asm-offsets.s
In file included from include/linux/spinlock.h:271:0,
                 from include/linux/mmzone.h:7,
                 from include/linux/gfp.h:4,
                 from include/linux/slab.h:14,
                 from include/linux/crypto.h:24,
                 from arch/x86/kernel/asm-offsets.c:8:
include/linux/spinlock_api_smp.h: In function ‘__raw_spin_lock_bh’:
include/linux/spinlock_api_smp.h:134:2: error: implicit declaration of function ‘__local_bh_disable_ip’ [-Werror=implicit-function-declaration]
  __local_bh_disable_ip(_RET_IP_, SOFTIRQ_LOCK_OFFSET);
  ^
include/linux/spinlock_api_smp.h:134:34: error: ‘SOFTIRQ_LOCK_OFFSET’ undeclared (first use in this function)
  __local_bh_disable_ip(_RET_IP_, SOFTIRQ_LOCK_OFFSET);
                                  ^
include/linux/spinlock_api_smp.h:134:34: note: each undeclared identifier is reported only once for each function it appears in
include/linux/spinlock_api_smp.h: In function ‘__raw_spin_unlock_bh’:
include/linux/spinlock_api_smp.h:176:2: error: implicit declaration of function ‘__local_bh_enable_ip’ [-Werror=implicit-function-declaration]
  __local_bh_enable_ip(_RET_IP_, SOFTIRQ_LOCK_OFFSET);
  ^
include/linux/spinlock_api_smp.h:176:33: error: ‘SOFTIRQ_LOCK_OFFSET’ undeclared (first use in this function)
  __local_bh_enable_ip(_RET_IP_, SOFTIRQ_LOCK_OFFSET);
                                 ^
include/linux/spinlock_api_smp.h: In function ‘__raw_spin_trylock_bh’:
include/linux/spinlock_api_smp.h:181:34: error: ‘SOFTIRQ_LOCK_OFFSET’ undeclared (first use in this function)
  __local_bh_disable_ip(_RET_IP_, SOFTIRQ_LOCK_OFFSET);
                                  ^
In file included from include/linux/spinlock_api_smp.h:190:0,
                 from include/linux/spinlock.h:271,
                 from include/linux/mmzone.h:7,
                 from include/linux/gfp.h:4,
                 from include/linux/slab.h:14,
                 from include/linux/crypto.h:24,
                 from arch/x86/kernel/asm-offsets.c:8:
include/linux/rwlock_api_smp.h: In function ‘__raw_read_lock_bh’:
include/linux/rwlock_api_smp.h:175:34: error: ‘SOFTIRQ_LOCK_OFFSET’ undeclared (first use in this function)
  __local_bh_disable_ip(_RET_IP_, SOFTIRQ_LOCK_OFFSET);
                                  ^
include/linux/rwlock_api_smp.h: In function ‘__raw_write_lock_bh’:
include/linux/rwlock_api_smp.h:202:34: error: ‘SOFTIRQ_LOCK_OFFSET’ undeclared (first use in this function)
  __local_bh_disable_ip(_RET_IP_, SOFTIRQ_LOCK_OFFSET);
                                  ^
include/linux/rwlock_api_smp.h: In function ‘__raw_read_unlock_bh’:
include/linux/rwlock_api_smp.h:251:33: error: ‘SOFTIRQ_LOCK_OFFSET’ undeclared (first use in this function)
  __local_bh_enable_ip(_RET_IP_, SOFTIRQ_LOCK_OFFSET);
                                 ^
include/linux/rwlock_api_smp.h: In function ‘__raw_write_unlock_bh’:
include/linux/rwlock_api_smp.h:275:33: error: ‘SOFTIRQ_LOCK_OFFSET’ undeclared (first use in this function)
  __local_bh_enable_ip(_RET_IP_, SOFTIRQ_LOCK_OFFSET);
                                 ^
cc1: some warnings being treated as errors
/usr/src/linux/./Kbuild:81: recipe for target 'arch/x86/kernel/asm-offsets.s' failed
make[1]: *** [arch/x86/kernel/asm-offsets.s] Error 1
Makefile:859: recipe for target 'prepare0' failed
make: *** [prepare0] Error 2

-- 
Markus

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 13/15] sched: Use a static_key for sched_clock_stable
  2014-01-22 13:14               ` Markus Trippelsdorf
@ 2014-01-22 14:26                 ` Sasha Levin
  2014-01-22 18:35                   ` Markus Trippelsdorf
  0 siblings, 1 reply; 67+ messages in thread
From: Sasha Levin @ 2014-01-22 14:26 UTC (permalink / raw)
  To: Markus Trippelsdorf, Peter Zijlstra
  Cc: Arjan van de Ven, lenb, rjw, Eliezer Tamir, rui.zhang,
	jacob.jun.pan, Mike Galbraith, Ingo Molnar, hpa, paulmck,
	Thomas Gleixner, John Stultz, Andy Lutomirski, linux-kernel,
	dyoung

On 01/22/2014 08:14 AM, Markus Trippelsdorf wrote:
> On 2014.01.22 at 13:30 +0100, Peter Zijlstra wrote:
>> >On Wed, Jan 22, 2014 at 01:26:09PM +0100, Markus Trippelsdorf wrote:
>>> > >On 2014.01.22 at 13:07 +0100, Peter Zijlstra wrote:
>>>> > > >On Wed, Jan 22, 2014 at 01:00:48PM +0100, Markus Trippelsdorf wrote:
>>>>> > > > >FYI it happens on real hardware on my machine:
>>>>> > > > >...
>>>>> > > > >[    0.000000] Hierarchical RCU implementation.
>>>>> > > > >[    0.000000] NR_IRQS:4352 nr_irqs:712 16
>>>>> > > > >[    0.000000] spurious 8259A interrupt: IRQ7.
>>>>> > > > >[    0.000000] Console: colour VGA+ 80x25
>>>>> > > > >[    0.000000] console [tty0] enabled
>>>>> > > > >[    0.000000] hpet clockevent registered
>>>>> > > > >[    0.000000] tsc: Fast TSC calibration using PIT
>>>>> > > > >[    0.003333] tsc: Detected 3210.681 MHz processor
>>>>> > > > >[   60.375238] Calibrating delay loop (skipped), value calculated using timer frequency.. 6423.91 BogoMIPS (lpj=10702270)
>>>>> > > > >[   60.375240] pid_max: default: 32768 minimum: 301
>>>>> > > > >[   60.375259] Mount-cache hash table entries: 256
>>>>> > > > >[   60.375373] tseg: 0000000000
>>>>> > > > >[   60.375377] CPU: Physical Processor ID: 0
>>>>> > > > >[   60.375377] CPU: Processor Core ID: 0
>>>>> > > > >[   60.375378] mce: CPU supports 6 MCE banks
>>>>> > > > >[   60.375382] LVT offset 0 assigned for vector 0xf9
>>>>> > > > >[   60.375384] process: using AMD E400 aware idle routine
>>>>> > > > >[   60.375386] Last level iTLB entries: 4KB 512, 2MB 16, 4MB 8
>>>> > > >
>>>> > > >Should have always happened like that I think. From the log it looks
>>>> > > >like the moment we switch from jiffies to actual TSC in
>>>> > > >arch/x86/kernel/tsc.c:sched_clock().
>>>> > > >
>>>> > > >I don't think I changed the logic there, just switched from a condition
>>>> > > >to a jump_label.
>>> > >
>>> > >Well, v3.13 was fine. So it's definitely a regression. But it may be
>>> > >another issue. I will try to bisect later.
>> >
>> >OK, weird, I'll see if I can spot anything.
> Unfortunately the issue is unbisectable (but the remaining commits are
> all yours):

I've actually bisected it previously by fixing the build errors manually, and that took me
to this patch you see in the subject line :)


Thanks,
Sasha

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 13/15] sched: Use a static_key for sched_clock_stable
  2014-01-21 22:28   ` Sasha Levin
  2014-01-22 10:45     ` Peter Zijlstra
@ 2014-01-22 17:14     ` Peter Zijlstra
  2014-01-22 22:31       ` Sasha Levin
  1 sibling, 1 reply; 67+ messages in thread
From: Peter Zijlstra @ 2014-01-22 17:14 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Arjan van de Ven, lenb, rjw, Eliezer Tamir, rui.zhang,
	jacob.jun.pan, Mike Galbraith, Ingo Molnar, hpa, paulmck,
	Thomas Gleixner, John Stultz, Andy Lutomirski, linux-kernel

On Tue, Jan 21, 2014 at 05:28:37PM -0500, Sasha Levin wrote:
> 	[    0.000000] Initmem setup node 30 [mem 0x12ee000000-0x138dffffff]
> 	[    0.000000]   NODE_DATA [mem 0xcfa42000-0xcfa72fff]
> 	[    0.000000]     NODE_DATA(30) on node 1
> 	[    0.000000] Initmem setup node 31 [mem 0x138e000000-0x142fffffff]
> 	[    0.000000]   NODE_DATA [mem 0xcfa11000-0xcfa41fff]
> 	[    0.000000]     NODE_DATA(31) on node 1
> 	[    0.000000] kvm-clock: Using msrs 4b564d01 and 4b564d00
> 	[    0.000000] kvm-clock: cpu 0, msr 0:cf991001, boot clock
> 	[133538.294040] Zone ranges:
> 	[133538.294338]   DMA      [mem 0x00001000-0x00ffffff]
> 	[133538.294804]   DMA32    [mem 0x01000000-0xffffffff]
> 	[133538.295223]   Normal   [mem 0x100000000-0x142fffffff]
> 	[133538.295670] Movable zone start for each node

OK, took me a while to fiddle with KVM and all the various muck around
that to reproduce. But I can confirm the below does fix the issue for
me.

I'm hoping to not have to re-introcude the kevents_up() check, but I
need to figure out what hardware triggered that and test again.

---
Subject: sched/clock: Fixup early sched_clock initialization
From: Peter Zijlstra <peterz@infradead.org>
Date: Wed, 22 Jan 2014 12:59:18 +0100

The code would assume sched_clock_stable() and switch to !stable
later, this switch brings a discontinuity in time.

The discontinuity on switching from stable to unstable was always
present, but previously we would set stable/unstable before
initializing TSC and usually stick to the one we start out with.

So the static_key bits brought an extra switch where there previously
wasn't one.

Things are further complicated by the fact that we cannot use
static_key as early as we usually call set_sched_clock_stable().

Fix things by tracking the stable state in a regular variable and only
set the static_key to the right state on sched_clock_init(), which is
ran right after late_time_init->tsc_init().

Before this we would not be using the TSC anyway.

Fixes: 35af99e646c7 ("sched/clock, x86: Use a static_key for sched_clock_stable")
Cc: jacob.jun.pan@linux.intel.com
Cc: Mike Galbraith <bitbucket@online.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: hpa@zytor.com
Cc: paulmck@linux.vnet.ibm.com
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: lenb@kernel.org
Cc: rjw@rjwysocki.net
Cc: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Cc: rui.zhang@intel.com
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Reported-by: dyoung@redhat.com
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20140122115918.GG3694@twins.programming.kicks-ass.net
---
 kernel/sched/clock.c |   53 +++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 41 insertions(+), 12 deletions(-)

--- a/kernel/sched/clock.c
+++ b/kernel/sched/clock.c
@@ -77,35 +77,50 @@ __read_mostly int sched_clock_running;
 
 #ifdef CONFIG_HAVE_UNSTABLE_SCHED_CLOCK
 static struct static_key __sched_clock_stable = STATIC_KEY_INIT;
+static int __sched_clock_stable_early;
 
 int sched_clock_stable(void)
 {
-	if (static_key_false(&__sched_clock_stable))
-		return false;
-	return true;
+	return static_key_false(&__sched_clock_stable);
 }
 
-void set_sched_clock_stable(void)
+static void __set_sched_clock_stable(void)
 {
 	if (!sched_clock_stable())
-		static_key_slow_dec(&__sched_clock_stable);
+		static_key_slow_inc(&__sched_clock_stable);
+}
+
+void set_sched_clock_stable(void)
+{
+	__sched_clock_stable_early = 1;
+
+	smp_mb(); /* matches sched_clock_init() */
+
+	if (!sched_clock_running)
+		return;
+
+	__set_sched_clock_stable();
 }
 
 static void __clear_sched_clock_stable(struct work_struct *work)
 {
 	/* XXX worry about clock continuity */
 	if (sched_clock_stable())
-		static_key_slow_inc(&__sched_clock_stable);
+		static_key_slow_dec(&__sched_clock_stable);
 }
 
 static DECLARE_WORK(sched_clock_work, __clear_sched_clock_stable);
 
 void clear_sched_clock_stable(void)
 {
-	if (keventd_up())
-		schedule_work(&sched_clock_work);
-	else
-		__clear_sched_clock_stable(&sched_clock_work);
+	__sched_clock_stable_early = 0;
+
+	smp_mb(); /* matches sched_clock_init() */
+
+	if (!sched_clock_running)
+		return;
+
+	schedule_work(&sched_clock_work);
 }
 
 struct sched_clock_data {
@@ -140,6 +155,20 @@ void sched_clock_init(void)
 	}
 
 	sched_clock_running = 1;
+
+	/*
+	 * Ensure that it is impossible to not do a static_key update.
+	 *
+	 * Either {set,clear}_sched_clock_stable() must see sched_clock_running
+	 * and do the update, or we must see their __sched_clock_stable_early
+	 * and do the update, or both.
+	 */
+	smp_mb(); /* matches {set,clear}_sched_clock_stable() */
+
+	if (__sched_clock_stable_early)
+		__set_sched_clock_stable();
+	else
+		__clear_sched_clock_stable(NULL);
 }
 
 /*
@@ -340,7 +369,7 @@ EXPORT_SYMBOL_GPL(sched_clock_idle_wakeu
  */
 u64 cpu_clock(int cpu)
 {
-	if (static_key_false(&__sched_clock_stable))
+	if (!sched_clock_stable())
 		return sched_clock_cpu(cpu);
 
 	return sched_clock();
@@ -355,7 +384,7 @@ u64 cpu_clock(int cpu)
  */
 u64 local_clock(void)
 {
-	if (static_key_false(&__sched_clock_stable))
+	if (!sched_clock_stable())
 		return sched_clock_cpu(raw_smp_processor_id());
 
 	return sched_clock();

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 13/15] sched: Use a static_key for sched_clock_stable
  2014-01-22 14:26                 ` Sasha Levin
@ 2014-01-22 18:35                   ` Markus Trippelsdorf
  2014-01-22 18:42                     ` Peter Zijlstra
  0 siblings, 1 reply; 67+ messages in thread
From: Markus Trippelsdorf @ 2014-01-22 18:35 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Peter Zijlstra, Arjan van de Ven, lenb, rjw, Eliezer Tamir,
	rui.zhang, jacob.jun.pan, Mike Galbraith, Ingo Molnar, hpa,
	paulmck, Thomas Gleixner, John Stultz, Andy Lutomirski,
	linux-kernel, dyoung

On 2014.01.22 at 09:26 -0500, Sasha Levin wrote:
> On 01/22/2014 08:14 AM, Markus Trippelsdorf wrote:
> > On 2014.01.22 at 13:30 +0100, Peter Zijlstra wrote:
> >> >On Wed, Jan 22, 2014 at 01:26:09PM +0100, Markus Trippelsdorf wrote:
> >>> > >On 2014.01.22 at 13:07 +0100, Peter Zijlstra wrote:
> >>>> > > >On Wed, Jan 22, 2014 at 01:00:48PM +0100, Markus Trippelsdorf wrote:
> >>>>> > > > >FYI it happens on real hardware on my machine:
> >>>>> > > > >...
> >>>>> > > > >[    0.000000] Hierarchical RCU implementation.
> >>>>> > > > >[    0.000000] NR_IRQS:4352 nr_irqs:712 16
> >>>>> > > > >[    0.000000] spurious 8259A interrupt: IRQ7.
> >>>>> > > > >[    0.000000] Console: colour VGA+ 80x25
> >>>>> > > > >[    0.000000] console [tty0] enabled
> >>>>> > > > >[    0.000000] hpet clockevent registered
> >>>>> > > > >[    0.000000] tsc: Fast TSC calibration using PIT
> >>>>> > > > >[    0.003333] tsc: Detected 3210.681 MHz processor
> >>>>> > > > >[   60.375238] Calibrating delay loop (skipped), value calculated using timer frequency.. 6423.91 BogoMIPS (lpj=10702270)
> >>>>> > > > >[   60.375240] pid_max: default: 32768 minimum: 301
> >>>>> > > > >[   60.375259] Mount-cache hash table entries: 256
> >>>>> > > > >[   60.375373] tseg: 0000000000
> >>>>> > > > >[   60.375377] CPU: Physical Processor ID: 0
> >>>>> > > > >[   60.375377] CPU: Processor Core ID: 0
> >>>>> > > > >[   60.375378] mce: CPU supports 6 MCE banks
> >>>>> > > > >[   60.375382] LVT offset 0 assigned for vector 0xf9
> >>>>> > > > >[   60.375384] process: using AMD E400 aware idle routine
> >>>>> > > > >[   60.375386] Last level iTLB entries: 4KB 512, 2MB 16, 4MB 8
> >>>> > > >
> >>>> > > >Should have always happened like that I think. From the log it looks
> >>>> > > >like the moment we switch from jiffies to actual TSC in
> >>>> > > >arch/x86/kernel/tsc.c:sched_clock().
> >>>> > > >
> >>>> > > >I don't think I changed the logic there, just switched from a condition
> >>>> > > >to a jump_label.
> >>> > >
> >>> > >Well, v3.13 was fine. So it's definitely a regression. But it may be
> >>> > >another issue. I will try to bisect later.
> >> >
> >> >OK, weird, I'll see if I can spot anything.
> > Unfortunately the issue is unbisectable (but the remaining commits are
> > all yours):
> 
> I've actually bisected it previously by fixing the build errors manually, and that took me
> to this patch you see in the subject line :)

But this is a different issue. I've bisected it to:

commit 20d1c86a57762f0a33a78988e3fc8818316badd4
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Fri Nov 29 15:40:29 2013 +0100

    sched/clock, x86: Rewrite cyc2ns() to avoid the need to disable IRQs

Reverting the commit "fixes" the issue:
...
[    0.000000] Hierarchical RCU implementation.
[    0.000000] NR_IRQS:4352 nr_irqs:712 16
[    0.000000] spurious 8259A interrupt: IRQ7.
[    0.000000] Console: colour VGA+ 80x25
[    0.000000] console [tty0] enabled
[    0.000000] hpet clockevent registered
[    0.000000] tsc: Fast TSC calibration using PIT
[    0.003333] tsc: Detected 3211.075 MHz processor
[    0.000006] Calibrating delay loop (skipped), value calculated using timer frequency.. 6424.73 BogoMIPS (lpj=10703583)
[    0.000007] pid_max: default: 32768 minimum: 301
[    0.000026] Mount-cache hash table entries: 256
[    0.000139] tseg: 0000000000
[    0.000143] CPU: Physical Processor ID: 0
[    0.000144] CPU: Processor Core ID: 0
[    0.000145] mce: CPU supports 6 MCE banks
[    0.000148] LVT offset 0 assigned for vector 0xf9
[    0.000151] process: using AMD E400 aware idle routine
[    0.000152] Last level iTLB entries: 4KB 512, 2MB 16, 4MB 8
...

-- 
Markus

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 13/15] sched: Use a static_key for sched_clock_stable
  2014-01-22 18:35                   ` Markus Trippelsdorf
@ 2014-01-22 18:42                     ` Peter Zijlstra
  2014-01-22 18:45                       ` Markus Trippelsdorf
  2014-01-22 19:09                       ` Markus Trippelsdorf
  0 siblings, 2 replies; 67+ messages in thread
From: Peter Zijlstra @ 2014-01-22 18:42 UTC (permalink / raw)
  To: Markus Trippelsdorf
  Cc: Sasha Levin, Arjan van de Ven, lenb, rjw, Eliezer Tamir,
	rui.zhang, jacob.jun.pan, Mike Galbraith, Ingo Molnar, hpa,
	paulmck, Thomas Gleixner, John Stultz, Andy Lutomirski,
	linux-kernel, dyoung

On Wed, Jan 22, 2014 at 07:35:38PM +0100, Markus Trippelsdorf wrote:

> >FYI it happens on real hardware on my machine:

> >[   60.375384] process: using AMD E400 aware idle routine

> But this is a different issue. I've bisected it to:
> 
> commit 20d1c86a57762f0a33a78988e3fc8818316badd4
> Author: Peter Zijlstra <peterz@infradead.org>
> Date:   Fri Nov 29 15:40:29 2013 +0100
> 
>     sched/clock, x86: Rewrite cyc2ns() to avoid the need to disable IRQs
> 
> Reverting the commit "fixes" the issue:

Hurm.. what kind of AMD machine is that?

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 13/15] sched: Use a static_key for sched_clock_stable
  2014-01-22 18:42                     ` Peter Zijlstra
@ 2014-01-22 18:45                       ` Markus Trippelsdorf
  2014-01-22 19:17                         ` Josh Boyer
  2014-01-22 19:09                       ` Markus Trippelsdorf
  1 sibling, 1 reply; 67+ messages in thread
From: Markus Trippelsdorf @ 2014-01-22 18:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Sasha Levin, Arjan van de Ven, lenb, rjw, Eliezer Tamir,
	rui.zhang, jacob.jun.pan, Mike Galbraith, Ingo Molnar, hpa,
	paulmck, Thomas Gleixner, John Stultz, Andy Lutomirski,
	linux-kernel, dyoung

On 2014.01.22 at 19:42 +0100, Peter Zijlstra wrote:
> On Wed, Jan 22, 2014 at 07:35:38PM +0100, Markus Trippelsdorf wrote:
> 
> > >FYI it happens on real hardware on my machine:
> 
> > >[   60.375384] process: using AMD E400 aware idle routine
> 
> > But this is a different issue. I've bisected it to:
> > 
> > commit 20d1c86a57762f0a33a78988e3fc8818316badd4
> > Author: Peter Zijlstra <peterz@infradead.org>
> > Date:   Fri Nov 29 15:40:29 2013 +0100
> > 
> >     sched/clock, x86: Rewrite cyc2ns() to avoid the need to disable IRQs
> > 
> > Reverting the commit "fixes" the issue:
> 
> Hurm.. what kind of AMD machine is that?

It's an ancient one:

processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 16
model           : 4
model name      : AMD Phenom(tm) II X4 955 Processor
stepping        : 2
microcode       : 0x10000db
cpu MHz         : 800.000
cache size      : 512 KB
physical id     : 0
siblings        : 4
core id         : 0
cpu cores       : 4
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 5
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3
dnow constant_tsc rep_good nopl nonstop_tsc extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit 
wdt hw_pstate npt lbrv svm_lock nrip_save
bogomips        : 6424.73
TLB size        : 1024 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate



-- 
Markus

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 13/15] sched: Use a static_key for sched_clock_stable
  2014-01-22 18:42                     ` Peter Zijlstra
  2014-01-22 18:45                       ` Markus Trippelsdorf
@ 2014-01-22 19:09                       ` Markus Trippelsdorf
  2014-01-22 19:12                         ` Markus Trippelsdorf
  1 sibling, 1 reply; 67+ messages in thread
From: Markus Trippelsdorf @ 2014-01-22 19:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Sasha Levin, Arjan van de Ven, lenb, rjw, Eliezer Tamir,
	rui.zhang, jacob.jun.pan, Mike Galbraith, Ingo Molnar, hpa,
	paulmck, Thomas Gleixner, John Stultz, Andy Lutomirski,
	linux-kernel, dyoung

On 2014.01.22 at 19:42 +0100, Peter Zijlstra wrote:
> On Wed, Jan 22, 2014 at 07:35:38PM +0100, Markus Trippelsdorf wrote:
> 
> > >FYI it happens on real hardware on my machine:
> 
> > >[   60.375384] process: using AMD E400 aware idle routine
> 
> > But this is a different issue. I've bisected it to:
> > 
> > commit 20d1c86a57762f0a33a78988e3fc8818316badd4
> > Author: Peter Zijlstra <peterz@infradead.org>
> > Date:   Fri Nov 29 15:40:29 2013 +0100
> > 
> >     sched/clock, x86: Rewrite cyc2ns() to avoid the need to disable IRQs
> > 
> > Reverting the commit "fixes" the issue:
> 
> Hurm..

Turns out the fix is simple:

diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index a3acbac2ee72..d90f7a11d573 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -1198,8 +1198,8 @@ void __init tsc_init(void)
 	 * up if their speed diverges)
 	 */
 	for_each_possible_cpu(cpu) {
-		cyc2ns_init(cpu);
 		set_cyc2ns_scale(cpu_khz, cpu);
+		cyc2ns_init(cpu);
 	}
 
 	if (tsc_disabled > 0)

-- 
Markus

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* Re: [PATCH 13/15] sched: Use a static_key for sched_clock_stable
  2014-01-22 19:09                       ` Markus Trippelsdorf
@ 2014-01-22 19:12                         ` Markus Trippelsdorf
  2014-01-22 20:16                           ` Peter Zijlstra
  0 siblings, 1 reply; 67+ messages in thread
From: Markus Trippelsdorf @ 2014-01-22 19:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Sasha Levin, Arjan van de Ven, lenb, rjw, Eliezer Tamir,
	rui.zhang, jacob.jun.pan, Mike Galbraith, Ingo Molnar, hpa,
	paulmck, Thomas Gleixner, John Stultz, Andy Lutomirski,
	linux-kernel, dyoung

On 2014.01.22 at 20:09 +0100, Markus Trippelsdorf wrote:
> On 2014.01.22 at 19:42 +0100, Peter Zijlstra wrote:
> > On Wed, Jan 22, 2014 at 07:35:38PM +0100, Markus Trippelsdorf wrote:
> > 
> > > >FYI it happens on real hardware on my machine:
> > 
> > > >[   60.375384] process: using AMD E400 aware idle routine
> > 
> > > But this is a different issue. I've bisected it to:
> > > 
> > > commit 20d1c86a57762f0a33a78988e3fc8818316badd4
> > > Author: Peter Zijlstra <peterz@infradead.org>
> > > Date:   Fri Nov 29 15:40:29 2013 +0100
> > > 
> > >     sched/clock, x86: Rewrite cyc2ns() to avoid the need to disable IRQs
> > > 
> > > Reverting the commit "fixes" the issue:
> > 
> > Hurm..
> 
> Turns out the fix is simple:

No. It isn't. I've tested the wrong kernel. Please ignore.

Thanks.

-- 
Markus

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 13/15] sched: Use a static_key for sched_clock_stable
  2014-01-22 18:45                       ` Markus Trippelsdorf
@ 2014-01-22 19:17                         ` Josh Boyer
  0 siblings, 0 replies; 67+ messages in thread
From: Josh Boyer @ 2014-01-22 19:17 UTC (permalink / raw)
  To: Markus Trippelsdorf
  Cc: Peter Zijlstra, Sasha Levin, Arjan van de Ven, Len Brown,
	Rafael J. Wysocki, Eliezer Tamir, Zhang Rui, jacob.jun.pan,
	Mike Galbraith, Ingo Molnar, H. Peter Anvin, Paul McKenney,
	Thomas Gleixner, John Stultz, Andy Lutomirski,
	Linux-Kernel@Vger. Kernel. Org, Dave Young

On Wed, Jan 22, 2014 at 1:45 PM, Markus Trippelsdorf
<markus@trippelsdorf.de> wrote:
> On 2014.01.22 at 19:42 +0100, Peter Zijlstra wrote:
>> On Wed, Jan 22, 2014 at 07:35:38PM +0100, Markus Trippelsdorf wrote:
>>
>> > >FYI it happens on real hardware on my machine:
>>
>> > >[   60.375384] process: using AMD E400 aware idle routine
>>
>> > But this is a different issue. I've bisected it to:
>> >
>> > commit 20d1c86a57762f0a33a78988e3fc8818316badd4
>> > Author: Peter Zijlstra <peterz@infradead.org>
>> > Date:   Fri Nov 29 15:40:29 2013 +0100
>> >
>> >     sched/clock, x86: Rewrite cyc2ns() to avoid the need to disable IRQs
>> >
>> > Reverting the commit "fixes" the issue:
>>
>> Hurm.. what kind of AMD machine is that?
>
> It's an ancient one:
>
> processor       : 0
> vendor_id       : AuthenticAMD
> cpu family      : 16
> model           : 4
> model name      : AMD Phenom(tm) II X4 955 Processor
> stepping        : 2
> microcode       : 0x10000db
> cpu MHz         : 800.000
> cache size      : 512 KB
> physical id     : 0
> siblings        : 4
> core id         : 0
> cpu cores       : 4
> apicid          : 0
> initial apicid  : 0
> fpu             : yes
> fpu_exception   : yes
> cpuid level     : 5
> wp              : yes
> flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3
> dnow constant_tsc rep_good nopl nonstop_tsc extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit
> wdt hw_pstate npt lbrv svm_lock nrip_save
> bogomips        : 6424.73
> TLB size        : 1024 4K pages
> clflush size    : 64
> cache_alignment : 64
> address sizes   : 48 bits physical, 48 bits virtual
> power management: ts ttp tm stc 100mhzsteps hwpstate

I'm seeing the odd printk timestamp jump going from 0.xxx to 49.xx on
an i7-2600 so it isn't just old hardware.

[    0.000000] kmemleak: Kernel memory leak detector disabled
[    0.000000] ODEBUG: 32 of 32 active objects replaced
[    0.000000] hpet clockevent registered
[    0.000000] tsc: Fast TSC calibration using PIT
[    0.001000] tsc: Detected 3392.351 MHz processor
[   49.777360] Calibrating delay loop (skipped), value calculated
using timer frequency.. 6784.70 BogoMIPS (lpj=3392351)
[   49.777363] pid_max: default: 32768 minimum: 301
[   49.777572] Security Framework initialized


josh

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 13/15] sched: Use a static_key for sched_clock_stable
  2014-01-22 19:12                         ` Markus Trippelsdorf
@ 2014-01-22 20:16                           ` Peter Zijlstra
  2014-01-22 21:08                             ` Peter Zijlstra
  0 siblings, 1 reply; 67+ messages in thread
From: Peter Zijlstra @ 2014-01-22 20:16 UTC (permalink / raw)
  To: Markus Trippelsdorf
  Cc: Sasha Levin, Arjan van de Ven, lenb, rjw, Eliezer Tamir,
	rui.zhang, jacob.jun.pan, Mike Galbraith, Ingo Molnar, hpa,
	paulmck, Thomas Gleixner, John Stultz, Andy Lutomirski,
	linux-kernel, dyoung

On Wed, Jan 22, 2014 at 08:12:54PM +0100, Markus Trippelsdorf wrote:
> On 2014.01.22 at 20:09 +0100, Markus Trippelsdorf wrote:
> > On 2014.01.22 at 19:42 +0100, Peter Zijlstra wrote:
> > > On Wed, Jan 22, 2014 at 07:35:38PM +0100, Markus Trippelsdorf wrote:
> > > 
> > > > >FYI it happens on real hardware on my machine:
> > > 
> > > > >[   60.375384] process: using AMD E400 aware idle routine
> > > 
> > > > But this is a different issue. I've bisected it to:
> > > > 
> > > > commit 20d1c86a57762f0a33a78988e3fc8818316badd4
> > > > Author: Peter Zijlstra <peterz@infradead.org>
> > > > Date:   Fri Nov 29 15:40:29 2013 +0100
> > > > 
> > > >     sched/clock, x86: Rewrite cyc2ns() to avoid the need to disable IRQs
> > > > 
> > > > Reverting the commit "fixes" the issue:
> > > 
> > > Hurm..
> > 
> > Turns out the fix is simple:
> 
> No. It isn't. I've tested the wrong kernel. Please ignore.

I think its the right region to look through. My current suspect is the
linear continuity fit with the initial 'random' multiplier.

That initial 'random' multiplier can get us quite high, and we'll fit
the function to match that but continue at a sane rate.

I'll try and prod a little more later this evening as time permits.



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 13/15] sched: Use a static_key for sched_clock_stable
  2014-01-22 20:16                           ` Peter Zijlstra
@ 2014-01-22 21:08                             ` Peter Zijlstra
  2014-01-22 21:17                               ` Markus Trippelsdorf
                                                 ` (2 more replies)
  0 siblings, 3 replies; 67+ messages in thread
From: Peter Zijlstra @ 2014-01-22 21:08 UTC (permalink / raw)
  To: Markus Trippelsdorf
  Cc: Sasha Levin, Arjan van de Ven, lenb, rjw, Eliezer Tamir,
	rui.zhang, jacob.jun.pan, Mike Galbraith, Ingo Molnar, hpa,
	paulmck, Thomas Gleixner, John Stultz, Andy Lutomirski,
	linux-kernel, dyoung

> 
> I think its the right region to look through. My current suspect is the
> linear continuity fit with the initial 'random' multiplier.
> 
> That initial 'random' multiplier can get us quite high, and we'll fit
> the function to match that but continue at a sane rate.
> 
> I'll try and prod a little more later this evening as time permits.

Does this cure things?

---
 arch/x86/kernel/tsc.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index a3acbac2ee72..bb04148c5fe0 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -237,7 +237,7 @@ static inline unsigned long long cycles_2_ns(unsigned long long cyc)
 /* XXX surely we already have this someplace in the kernel?! */
 #define DIV_ROUND(n, d) (((n) + ((d) / 2)) / (d))
 
-static void set_cyc2ns_scale(unsigned long cpu_khz, int cpu)
+static void set_cyc2ns_scale(unsigned long cpu_khz, int cpu, bool origin)
 {
 	unsigned long long tsc_now, ns_now;
 	struct cyc2ns_data *data;
@@ -252,7 +252,10 @@ static void set_cyc2ns_scale(unsigned long cpu_khz, int cpu)
 	data = cyc2ns_write_begin(cpu);
 
 	rdtscll(tsc_now);
-	ns_now = cycles_2_ns(tsc_now);
+	if (origin)
+		ns_now = 0;
+	else
+		ns_now = cycles_2_ns(tsc_now);
 
 	/*
 	 * Compute a new multiplier as per the above comment and ensure our
@@ -926,7 +929,7 @@ static int time_cpufreq_notifier(struct notifier_block *nb, unsigned long val,
 			mark_tsc_unstable("cpufreq changes");
 	}
 
-	set_cyc2ns_scale(tsc_khz, freq->cpu);
+	set_cyc2ns_scale(tsc_khz, freq->cpu, false);
 
 	return 0;
 }
@@ -1199,7 +1202,7 @@ void __init tsc_init(void)
 	 */
 	for_each_possible_cpu(cpu) {
 		cyc2ns_init(cpu);
-		set_cyc2ns_scale(cpu_khz, cpu);
+		set_cyc2ns_scale(cpu_khz, cpu, true);
 	}
 
 	if (tsc_disabled > 0)

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* Re: [PATCH 13/15] sched: Use a static_key for sched_clock_stable
  2014-01-22 21:08                             ` Peter Zijlstra
@ 2014-01-22 21:17                               ` Markus Trippelsdorf
  2014-01-23  9:48                                 ` Peter Zijlstra
  2014-01-22 23:53                               ` [PATCH 13/15] sched: Use a static_key for sched_clock_stable Josh Boyer
  2014-01-23  1:53                               ` Dave Young
  2 siblings, 1 reply; 67+ messages in thread
From: Markus Trippelsdorf @ 2014-01-22 21:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Sasha Levin, Arjan van de Ven, lenb, rjw, Eliezer Tamir,
	rui.zhang, jacob.jun.pan, Mike Galbraith, Ingo Molnar, hpa,
	paulmck, Thomas Gleixner, John Stultz, Andy Lutomirski,
	linux-kernel, dyoung

On 2014.01.22 at 22:08 +0100, Peter Zijlstra wrote:
> > 
> > I think its the right region to look through. My current suspect is the
> > linear continuity fit with the initial 'random' multiplier.
> > 
> > That initial 'random' multiplier can get us quite high, and we'll fit
> > the function to match that but continue at a sane rate.
> > 
> > I'll try and prod a little more later this evening as time permits.
> 
> Does this cure things?

Yes. Thanks Peter.

-- 
Markus

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 13/15] sched: Use a static_key for sched_clock_stable
  2014-01-22 17:14     ` Peter Zijlstra
@ 2014-01-22 22:31       ` Sasha Levin
  0 siblings, 0 replies; 67+ messages in thread
From: Sasha Levin @ 2014-01-22 22:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Arjan van de Ven, lenb, rjw, Eliezer Tamir, rui.zhang,
	jacob.jun.pan, Mike Galbraith, Ingo Molnar, hpa, paulmck,
	Thomas Gleixner, John Stultz, Andy Lutomirski, linux-kernel

On 01/22/2014 12:14 PM, Peter Zijlstra wrote:
> On Tue, Jan 21, 2014 at 05:28:37PM -0500, Sasha Levin wrote:
>> 	[    0.000000] Initmem setup node 30 [mem 0x12ee000000-0x138dffffff]
>> 	[    0.000000]   NODE_DATA [mem 0xcfa42000-0xcfa72fff]
>> 	[    0.000000]     NODE_DATA(30) on node 1
>> 	[    0.000000] Initmem setup node 31 [mem 0x138e000000-0x142fffffff]
>> 	[    0.000000]   NODE_DATA [mem 0xcfa11000-0xcfa41fff]
>> 	[    0.000000]     NODE_DATA(31) on node 1
>> 	[    0.000000] kvm-clock: Using msrs 4b564d01 and 4b564d00
>> 	[    0.000000] kvm-clock: cpu 0, msr 0:cf991001, boot clock
>> 	[133538.294040] Zone ranges:
>> 	[133538.294338]   DMA      [mem 0x00001000-0x00ffffff]
>> 	[133538.294804]   DMA32    [mem 0x01000000-0xffffffff]
>> 	[133538.295223]   Normal   [mem 0x100000000-0x142fffffff]
>> 	[133538.295670] Movable zone start for each node
>
> OK, took me a while to fiddle with KVM and all the various muck around
> that to reproduce. But I can confirm the below does fix the issue for
> me.
>
> I'm hoping to not have to re-introcude the kevents_up() check, but I
> need to figure out what hardware triggered that and test again.
>
> ---
> Subject: sched/clock: Fixup early sched_clock initialization
> From: Peter Zijlstra <peterz@infradead.org>
> Date: Wed, 22 Jan 2014 12:59:18 +0100
>
> The code would assume sched_clock_stable() and switch to !stable
> later, this switch brings a discontinuity in time.
>
> The discontinuity on switching from stable to unstable was always
> present, but previously we would set stable/unstable before
> initializing TSC and usually stick to the one we start out with.
>
> So the static_key bits brought an extra switch where there previously
> wasn't one.
>
> Things are further complicated by the fact that we cannot use
> static_key as early as we usually call set_sched_clock_stable().
>
> Fix things by tracking the stable state in a regular variable and only
> set the static_key to the right state on sched_clock_init(), which is
> ran right after late_time_init->tsc_init().
>
> Before this we would not be using the TSC anyway.
>
> Fixes: 35af99e646c7 ("sched/clock, x86: Use a static_key for sched_clock_stable")
> Cc: jacob.jun.pan@linux.intel.com
> Cc: Mike Galbraith <bitbucket@online.de>
> Cc: Ingo Molnar <mingo@kernel.org>
> Cc: hpa@zytor.com
> Cc: paulmck@linux.vnet.ibm.com
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: John Stultz <john.stultz@linaro.org>
> Cc: Andy Lutomirski <luto@amacapital.net>
> Cc: Arjan van de Ven <arjan@linux.intel.com>
> Cc: lenb@kernel.org
> Cc: rjw@rjwysocki.net
> Cc: Eliezer Tamir <eliezer.tamir@linux.intel.com>
> Cc: rui.zhang@intel.com
> Reported-by: Sasha Levin <sasha.levin@oracle.com>
> Reported-by: dyoung@redhat.com
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> Link: http://lkml.kernel.org/r/20140122115918.GG3694@twins.programming.kicks-ass.net

The patch fixes the problem for me.


Thanks,
Sasha


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 13/15] sched: Use a static_key for sched_clock_stable
  2014-01-22 21:08                             ` Peter Zijlstra
  2014-01-22 21:17                               ` Markus Trippelsdorf
@ 2014-01-22 23:53                               ` Josh Boyer
  2014-01-23  1:53                               ` Dave Young
  2 siblings, 0 replies; 67+ messages in thread
From: Josh Boyer @ 2014-01-22 23:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Markus Trippelsdorf, Sasha Levin, Arjan van de Ven, Len Brown,
	Rafael J. Wysocki, Eliezer Tamir, Zhang Rui, jacob.jun.pan,
	Mike Galbraith, Ingo Molnar, H. Peter Anvin, Paul McKenney,
	Thomas Gleixner, John Stultz, Andy Lutomirski,
	Linux-Kernel@Vger. Kernel. Org, Dave Young

On Wed, Jan 22, 2014 at 4:08 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>>
>> I think its the right region to look through. My current suspect is the
>> linear continuity fit with the initial 'random' multiplier.
>>
>> That initial 'random' multiplier can get us quite high, and we'll fit
>> the function to match that but continue at a sane rate.
>>
>> I'll try and prod a little more later this evening as time permits.
>
> Does this cure things?

This seems to correct the time jump I was seeing on my i7 box as well.

josh

>
> ---
>  arch/x86/kernel/tsc.c | 11 +++++++----
>  1 file changed, 7 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
> index a3acbac2ee72..bb04148c5fe0 100644
> --- a/arch/x86/kernel/tsc.c
> +++ b/arch/x86/kernel/tsc.c
> @@ -237,7 +237,7 @@ static inline unsigned long long cycles_2_ns(unsigned long long cyc)
>  /* XXX surely we already have this someplace in the kernel?! */
>  #define DIV_ROUND(n, d) (((n) + ((d) / 2)) / (d))
>
> -static void set_cyc2ns_scale(unsigned long cpu_khz, int cpu)
> +static void set_cyc2ns_scale(unsigned long cpu_khz, int cpu, bool origin)
>  {
>         unsigned long long tsc_now, ns_now;
>         struct cyc2ns_data *data;
> @@ -252,7 +252,10 @@ static void set_cyc2ns_scale(unsigned long cpu_khz, int cpu)
>         data = cyc2ns_write_begin(cpu);
>
>         rdtscll(tsc_now);
> -       ns_now = cycles_2_ns(tsc_now);
> +       if (origin)
> +               ns_now = 0;
> +       else
> +               ns_now = cycles_2_ns(tsc_now);
>
>         /*
>          * Compute a new multiplier as per the above comment and ensure our
> @@ -926,7 +929,7 @@ static int time_cpufreq_notifier(struct notifier_block *nb, unsigned long val,
>                         mark_tsc_unstable("cpufreq changes");
>         }
>
> -       set_cyc2ns_scale(tsc_khz, freq->cpu);
> +       set_cyc2ns_scale(tsc_khz, freq->cpu, false);
>
>         return 0;
>  }
> @@ -1199,7 +1202,7 @@ void __init tsc_init(void)
>          */
>         for_each_possible_cpu(cpu) {
>                 cyc2ns_init(cpu);
> -               set_cyc2ns_scale(cpu_khz, cpu);
> +               set_cyc2ns_scale(cpu_khz, cpu, true);
>         }
>
>         if (tsc_disabled > 0)
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 13/15] sched: Use a static_key for sched_clock_stable
  2014-01-22 11:59       ` Peter Zijlstra
@ 2014-01-23  1:53         ` Dave Young
  2014-01-23 16:46         ` [tip:sched/urgent] sched/clock: Fixup early initialization tip-bot for Peter Zijlstra
  1 sibling, 0 replies; 67+ messages in thread
From: Dave Young @ 2014-01-23  1:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Sasha Levin, Arjan van de Ven, lenb, rjw, Eliezer Tamir,
	rui.zhang, jacob.jun.pan, Mike Galbraith, Ingo Molnar, hpa,
	paulmck, Thomas Gleixner, John Stultz, Andy Lutomirski,
	linux-kernel

On 01/22/14 at 12:59pm, Peter Zijlstra wrote:
> On Wed, Jan 22, 2014 at 11:45:32AM +0100, Peter Zijlstra wrote:
> > Ho humm.
> 
> OK, so I had me a ponder; does the below fix things for you and David?
> I've only done a boot test on real proper hardware :-)
> 
> ---
>  kernel/sched/clock.c | 42 +++++++++++++++++++++++++++++++++---------
>  1 file changed, 33 insertions(+), 9 deletions(-)
> 
> diff --git a/kernel/sched/clock.c b/kernel/sched/clock.c
> index 6bd6a6731b21..6bbcd97f4532 100644
> --- a/kernel/sched/clock.c
> +++ b/kernel/sched/clock.c
> @@ -77,35 +77,45 @@ __read_mostly int sched_clock_running;
>  
>  #ifdef CONFIG_HAVE_UNSTABLE_SCHED_CLOCK
>  static struct static_key __sched_clock_stable = STATIC_KEY_INIT;
> +static int __sched_clock_stable_early;
>  
>  int sched_clock_stable(void)
>  {
> -	if (static_key_false(&__sched_clock_stable))
> -		return false;
> -	return true;
> +	return static_key_false(&__sched_clock_stable);
>  }
>  
>  void set_sched_clock_stable(void)
>  {
> +	__sched_clock_stable_early = 1;
> +
> +	smp_mb(); /* matches sched_clock_init() */
> +
> +	if (!sched_clock_running)
> +		return;
> +
>  	if (!sched_clock_stable())
> -		static_key_slow_dec(&__sched_clock_stable);
> +		static_key_slow_inc(&__sched_clock_stable);
>  }
>  
>  static void __clear_sched_clock_stable(struct work_struct *work)
>  {
>  	/* XXX worry about clock continuity */
>  	if (sched_clock_stable())
> -		static_key_slow_inc(&__sched_clock_stable);
> +		static_key_slow_dec(&__sched_clock_stable);
>  }
>  
>  static DECLARE_WORK(sched_clock_work, __clear_sched_clock_stable);
>  
>  void clear_sched_clock_stable(void)
>  {
> -	if (keventd_up())
> -		schedule_work(&sched_clock_work);
> -	else
> -		__clear_sched_clock_stable(&sched_clock_work);
> +	__sched_clock_stable_early = 0;
> +
> +	smp_mb(); /* matches sched_clock_init() */
> +
> +	if (!sched_clock_running)
> +		return;
> +
> +	schedule_work(&sched_clock_work);
>  }
>  
>  struct sched_clock_data {
> @@ -140,6 +150,20 @@ void sched_clock_init(void)
>  	}
>  
>  	sched_clock_running = 1;
> +
> +	/*
> +	 * Ensure that it is impossible to not do a static_key update.
> +	 *
> +	 * Either {set,clear}_sched_clock_stable() must see sched_clock_running
> +	 * and do the update, or we must see their __sched_clock_stable_early
> +	 * and do the update, or both.
> +	 */
> +	smp_mb(); /* matches {set,clear}_sched_clock_stable() */
> +
> +	if (__sched_clock_stable_early)
> +		set_sched_clock_stable();
> +	else
> +		clear_sched_clock_stable();
>  }
>  
>  /*

It does not fix the prink time issue, here is the log:
[    0.000000] efi: mem26: type=6, attr=0x800000000000000f, range=[0x000000000dbe0000-0x000000000dc00000) (0MB)
[    0.000000] DMI not present or invalid.
[    0.000000] Hypervisor detected: KVM
[    0.000000] e820: last_pfn = 0xdbe0 max_arch_pfn = 0x400000000
[    0.000000] PAT not supported by CPU.
[    0.000000] init_memory_mapping: [mem 0x00000000-0x000fffff]
[    0.000000] init_memory_mapping: [mem 0x0aa00000-0x0abfffff]
[    0.000000] init_memory_mapping: [mem 0x08000000-0x0a9fffff]
[    0.000000] init_memory_mapping: [mem 0x00100000-0x07ffffff]
[    0.000000] init_memory_mapping: [mem 0x0ac00000-0x0bd93fff]
[    0.000000] init_memory_mapping: [mem 0x0bdc1000-0x0d580fff]
[    0.000000] init_memory_mapping: [mem 0x0d5e5000-0x0dbdffff]
[    0.000000] RAMDISK: [mem 0x0ac0e000-0x0b583fff]
[    0.000000] ACPI: RSDP 000000000d5e0014 000024 (v02 OVMF  )
[    0.000000] ACPI: XSDT 000000000d5df0e8 00003C (v01 OVMF   OVMFEDK2 20130221      01000013)
[    0.000000] ACPI: FACP 000000000d5de000 0000F4 (v03 OVMF   OVMFEDK2 20130221 OVMF 00000099)
[    0.000000] ACPI: DSDT 000000000d5dc000 000D57 (v01 INTEL  OVMF     00000004 INTL 20120913)
[    0.000000] ACPI: FACS 000000000d5e4000 000040
[    0.000000] ACPI: APIC 000000000d5dd000 000078 (v01 OVMF   OVMFEDK2 20130221 OVMF 00000099)
[    0.000000] ACPI: SSDT 000000000d5db000 000057 (v01 REDHAT OVMF     00000001 INTL 20120913)
[    0.000000] crashkernel reservation failed - No suitable area found.
[    0.000000] kvm-clock: Using msrs 4b564d01 and 4b564d00
[    0.000000] kvm-clock: cpu 0, msr 0:d401001, boot clock
[65465.267798] Zone ranges:
[65465.268914]   DMA      [mem 0x00001000-0x00ffffff]
[65465.271107]   DMA32    [mem 0x01000000-0xffffffff]
[65465.273348]   Normal   empty
[65465.274683] Movable zone start for each node
[65465.276646] Early memory node ranges
[65465.278321]   node   0: [mem 0x00001000-0x0009ffff]
[65465.280572]   node   0: [mem 0x00100000-0x0bd93fff]
[65465.282825]   node   0: [mem 0x0bdc1000-0x0d580fff]
[65465.285084]   node   0: [mem 0x0d5e5000-0x0dbdffff]
[65465.289251] ACPI: PM-Timer IO Port: 0xb008
[65465.291105] ACPI: LAPIC (acpi_id[0x00] lapic_id[0x00] enabled)
[65465.293766] ACPI: LAPIC_NMI (acpi_id[0xff] dfl dfl lint[0x1])
[65465.296413] ACPI: IOAPIC (id[0x01] address[0xfec00000] gsi_base[0])
[65465.299460] IOAPIC[0]: apic_id 1, version 17, address 0xfec00000, GSI 0-23
[65465.302607] ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
[65465.305524] ACPI: INT_SRC_OVR (bus 0 bus_irq 5 global_irq 5 high level)
[65465.308622] ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level)
[65465.311685] ACPI: INT_SRC_OVR (bus 0 bus_irq 10 global_irq 10 high level)
[65465.314766] ACPI: INT_SRC_OVR (bus 0 bus_irq 11 global_irq 11 high level)
[65465.317792] Using ACPI (MADT) for SMP configuration information
[65465.320608] smpboot: Allowing 1 CPUs, 0 hotplug CPUs
[65465.322958] PM: Registered nosave memory: [mem 0x000a0000-0x000fffff]
[65465.325861] PM: Registered nosave memory: [mem 0x0bd94000-0x0bdc0fff]
[65465.328809] PM: Registered nosave memory: [mem 0x0d581000-0x0d5d8fff]
[65465.331770] PM: Registered nosave memory: [mem 0x0d5d9000-0x0d5e0fff]
[65465.334716] PM: Registered nosave memory: [mem 0x0d5e1000-0x0d5e4fff]
[65465.337723] e820: [mem 0x0dc00000-0xffffffff] available for PCI devices
[65465.340880] Booting paravirtualized kernel on KVM
[65465.343045] setup_percpu: NR_CPUS:16 nr_cpumask_bits:16 nr_cpu_ids:1 nr_node_ids:1
[65465.346736] PERCPU: Embedded 28 pages/cpu @ffff88000a800000 s83392 r8192 d23104 u2097152
[65465.350469] kvm-clock: cpu 0, msr 0:d401001, primary cpu clock
[65465.353143] KVM setup async PF for cpu 0
[65465.354969] kvm-stealtime: cpu 0, msr a80dfc0
[65465.357124] Built 1 zonelists in Zone order, mobility grouping on.  Total pages: 53096
[65465.360905] Kernel command line: root=UUID=4522081c-614f-43ba-927b-1ef26d69fe20 ro console=ttyS0 earlyprintk=serial,ttyS0 nomodeset selinux=0 crashkernel=128M
[65465.367711] PID hash table entries: 1024 (order: 1, 8192 bytes)
[65465.370534] Dentry cache hash table entries: 32768 (order: 6, 262144 bytes)
[65465.373903] Inode-cache hash table entries: 16384 (order: 5, 131072 bytes)
[65465.377467] Memory: 144368K/224184K available (4748K kernel code, 788K rwdata, 2376K rodata, 888K init, 8968K bss, 79816K reserved)
[65465.382968] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=1, Nodes=1
[65465.386095] Preemptible hierarchical RCU implementation.
[65465.388602] 	RCU debugfs-based tracing is enabled.
[65465.390851] 	CONFIG_RCU_FANOUT set to non-default value of 32
[65465.393569] 	RCU dyntick-idle grace-period acceleration is enabled.
[65465.396494] 	RCU restricting CPUs from NR_CPUS=16 to nr_cpu_ids=1.
[65465.399422] 	Offload RCU callbacks from all CPUs
[65465.401594] 	Offload RCU callbacks from CPUs: 0.
[65465.403781] RCU: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=1
[65465.406906] NO_HZ: Full dynticks CPUs: 1-15.
[65465.408963] NR_IRQS:4352 nr_irqs:256 16
[65465.411104] Console: colour dummy device 80x25
[65465.413229] console [ttyS0] enabled
[65465.413229] console [ttyS0] enabled
[65465.416579] bootconsole [earlyser0] disabled
[65465.416579] bootconsole [earlyser0] disabled
[65465.420729] Lock dependency validator: Copyright (c) 2006 Red Hat, Inc., Ingo Molnar
[65465.424454] ... MAX_LOCKDEP_SUBCLASSES:  8
[65465.426418] ... MAX_LOCK_DEPTH:          48
[65465.428422] ... MAX_LOCKDEP_KEYS:        8191
[65465.430509] ... CLASSHASH_SIZE:          4096
[65465.432576] ... MAX_LOCKDEP_ENTRIES:     16384
[65465.434717] ... MAX_LOCKDEP_CHAINS:      32768
[65465.436858] ... CHAINHASH_SIZE:          16384
[65465.438991]  memory used by lock dependency info: 5855 kB
[65465.441630]  per task-struct memory footprint: 1920 bytes
[65465.444477] tsc: Detected 2793.268 MHz processor
[65465.446663] BUG: unable to handle kernel NULL pointer dereference at 0000000000000182
[65465.450471] IP: [<ffffffff81074023>] __queue_work+0x45/0x1ee
[65465.453195] PGD 0 
[65465.454270] Oops: 0000 [#1] PREEMPT SMP 
[65465.456286] Modules linked in:
[65465.457815] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.13.0+ #15
[65465.460610] task: ffffffff81719490 ti: ffffffff816f8000 task.ti: ffffffff816f8000
[65465.464099] RIP: 0010:[<ffffffff81074023>]  [<ffffffff81074023>] __queue_work+0x45/0x1ee
[65465.467932] RSP: 0000:ffffffff816f9eb8  EFLAGS: 00010046
[65465.470418] RAX: 0000000000000006 RBX: 0000000000000292 RCX: 0000000000000030
[65465.473790] RDX: ffffffff817328e0 RSI: 0000000000000000 RDI: 0000000000000010
[65465.477159] RBP: ffffffff816f9ee8 R08: ffffffff817b6ac8 R09: 00000000ffffffff
[65465.480531] R10: 00000000fffea071 R11: 0000000225c17d03 R12: 0000000000000000
[65465.483927] R13: ffffffff817328e0 R14: ffffffff81857ac0 R15: 000000000b584000
[65465.487348] FS:  0000000000000000(0000) GS:ffff88000a800000(0000) knlGS:0000000000000000
[65465.491138] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[65465.493900] CR2: 0000000000000182 CR3: 0000000002714000 CR4: 00000000000006b0
[65465.497214] Stack:
[65465.498208]  0000001081857ac0 0000000000000292 000000000032dcd5 ffff88000b585680
[65465.501853]  ffffffff81857ac0 000000000b584000 ffffffff816f9f20 ffffffff8107420f
[65465.505492]  00000010816f9f30 0000000000000000 ffffffff817328e0 0000000000014280
[65465.509180] Call Trace:
[65465.510385]  [<ffffffff8107420f>] queue_work_on+0x43/0x7c
[65465.512932]  [<ffffffff810868a5>] clear_sched_clock_stable+0x32/0x34
[65465.515985]  [<ffffffff81086921>] sched_clock_init+0x7a/0x7f
[65465.518696]  [<ffffffff817d4cd8>] start_kernel+0x351/0x3fa
[65465.521367]  [<ffffffff817d4795>] ? repair_env_string+0x58/0x58
[65465.524159]  [<ffffffff817d4120>] ? early_idt_handlers+0x120/0x120
[65465.527044]  [<ffffffff817d4498>] x86_64_start_reservations+0x2a/0x2c
[65465.530038]  [<ffffffff817d458d>] x86_64_start_kernel+0xf3/0x100
[65465.532832] Code: 25 30 d2 72 81 f6 c4 02 74 21 80 3d 91 22 73 00 00 75 18 be 31 05 00 00 48 c7 c7 ca 0b 64 81 e8 80 c9 fe ff c6 05 77 22 73 00 01 <41> f6 84 24 82 01 00 00 01 74 59 65 48 8b 3c 25 c0 c9 00 00 f6 
[65465.544401] RIP  [<ffffffff81074023>] __queue_work+0x45/0x1ee
[65465.547201]  RSP <ffffffff816f9eb8>
[65465.548880] CR2: 0000000000000182
[65465.550462] ---[ end trace 8bf023a4e6e5d79e ]---
[65465.552655] Kernel panic - not syncing: Attempted to kill the idle task!

Thanks
Dave


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 13/15] sched: Use a static_key for sched_clock_stable
  2014-01-22 21:08                             ` Peter Zijlstra
  2014-01-22 21:17                               ` Markus Trippelsdorf
  2014-01-22 23:53                               ` [PATCH 13/15] sched: Use a static_key for sched_clock_stable Josh Boyer
@ 2014-01-23  1:53                               ` Dave Young
  2014-01-23  2:10                                 ` Dave Young
  2 siblings, 1 reply; 67+ messages in thread
From: Dave Young @ 2014-01-23  1:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Markus Trippelsdorf, Sasha Levin, Arjan van de Ven, lenb, rjw,
	Eliezer Tamir, rui.zhang, jacob.jun.pan, Mike Galbraith,
	Ingo Molnar, hpa, paulmck, Thomas Gleixner, John Stultz,
	Andy Lutomirski, linux-kernel

On 01/22/14 at 10:08pm, Peter Zijlstra wrote:
> > 
> > I think its the right region to look through. My current suspect is the
> > linear continuity fit with the initial 'random' multiplier.
> > 
> > That initial 'random' multiplier can get us quite high, and we'll fit
> > the function to match that but continue at a sane rate.
> > 
> > I'll try and prod a little more later this evening as time permits.
> 
> Does this cure things?

Peter, the odd timstamp still happens with this patch for me.

> 
> ---
>  arch/x86/kernel/tsc.c | 11 +++++++----
>  1 file changed, 7 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
> index a3acbac2ee72..bb04148c5fe0 100644
> --- a/arch/x86/kernel/tsc.c
> +++ b/arch/x86/kernel/tsc.c
> @@ -237,7 +237,7 @@ static inline unsigned long long cycles_2_ns(unsigned long long cyc)
>  /* XXX surely we already have this someplace in the kernel?! */
>  #define DIV_ROUND(n, d) (((n) + ((d) / 2)) / (d))
>  
> -static void set_cyc2ns_scale(unsigned long cpu_khz, int cpu)
> +static void set_cyc2ns_scale(unsigned long cpu_khz, int cpu, bool origin)
>  {
>  	unsigned long long tsc_now, ns_now;
>  	struct cyc2ns_data *data;
> @@ -252,7 +252,10 @@ static void set_cyc2ns_scale(unsigned long cpu_khz, int cpu)
>  	data = cyc2ns_write_begin(cpu);
>  
>  	rdtscll(tsc_now);
> -	ns_now = cycles_2_ns(tsc_now);
> +	if (origin)
> +		ns_now = 0;
> +	else
> +		ns_now = cycles_2_ns(tsc_now);
>  
>  	/*
>  	 * Compute a new multiplier as per the above comment and ensure our
> @@ -926,7 +929,7 @@ static int time_cpufreq_notifier(struct notifier_block *nb, unsigned long val,
>  			mark_tsc_unstable("cpufreq changes");
>  	}
>  
> -	set_cyc2ns_scale(tsc_khz, freq->cpu);
> +	set_cyc2ns_scale(tsc_khz, freq->cpu, false);
>  
>  	return 0;
>  }
> @@ -1199,7 +1202,7 @@ void __init tsc_init(void)
>  	 */
>  	for_each_possible_cpu(cpu) {
>  		cyc2ns_init(cpu);
> -		set_cyc2ns_scale(cpu_khz, cpu);
> +		set_cyc2ns_scale(cpu_khz, cpu, true);
>  	}
>  
>  	if (tsc_disabled > 0)

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 13/15] sched: Use a static_key for sched_clock_stable
  2014-01-23  1:53                               ` Dave Young
@ 2014-01-23  2:10                                 ` Dave Young
  2014-01-23 16:56                                   ` Peter Zijlstra
  0 siblings, 1 reply; 67+ messages in thread
From: Dave Young @ 2014-01-23  2:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Markus Trippelsdorf, Sasha Levin, Arjan van de Ven, lenb, rjw,
	Eliezer Tamir, rui.zhang, jacob.jun.pan, Mike Galbraith,
	Ingo Molnar, hpa, paulmck, Thomas Gleixner, John Stultz,
	Andy Lutomirski, linux-kernel

On 01/23/14 at 09:53am, Dave Young wrote:
> On 01/22/14 at 10:08pm, Peter Zijlstra wrote:
> > > 
> > > I think its the right region to look through. My current suspect is the
> > > linear continuity fit with the initial 'random' multiplier.
> > > 
> > > That initial 'random' multiplier can get us quite high, and we'll fit
> > > the function to match that but continue at a sane rate.
> > > 
> > > I'll try and prod a little more later this evening as time permits.
> > 
> > Does this cure things?
> 
> Peter, the odd timstamp still happens with this patch for me.

Hmm, seems the my physical machine is booting fine with this patch. kvm
guest problem still exist, but that kvm thing might be other problem.

> 
> > 
> > ---
> >  arch/x86/kernel/tsc.c | 11 +++++++----
> >  1 file changed, 7 insertions(+), 4 deletions(-)
> > 
> > diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
> > index a3acbac2ee72..bb04148c5fe0 100644
> > --- a/arch/x86/kernel/tsc.c
> > +++ b/arch/x86/kernel/tsc.c
> > @@ -237,7 +237,7 @@ static inline unsigned long long cycles_2_ns(unsigned long long cyc)
> >  /* XXX surely we already have this someplace in the kernel?! */
> >  #define DIV_ROUND(n, d) (((n) + ((d) / 2)) / (d))
> >  
> > -static void set_cyc2ns_scale(unsigned long cpu_khz, int cpu)
> > +static void set_cyc2ns_scale(unsigned long cpu_khz, int cpu, bool origin)
> >  {
> >  	unsigned long long tsc_now, ns_now;
> >  	struct cyc2ns_data *data;
> > @@ -252,7 +252,10 @@ static void set_cyc2ns_scale(unsigned long cpu_khz, int cpu)
> >  	data = cyc2ns_write_begin(cpu);
> >  
> >  	rdtscll(tsc_now);
> > -	ns_now = cycles_2_ns(tsc_now);
> > +	if (origin)
> > +		ns_now = 0;
> > +	else
> > +		ns_now = cycles_2_ns(tsc_now);
> >  
> >  	/*
> >  	 * Compute a new multiplier as per the above comment and ensure our
> > @@ -926,7 +929,7 @@ static int time_cpufreq_notifier(struct notifier_block *nb, unsigned long val,
> >  			mark_tsc_unstable("cpufreq changes");
> >  	}
> >  
> > -	set_cyc2ns_scale(tsc_khz, freq->cpu);
> > +	set_cyc2ns_scale(tsc_khz, freq->cpu, false);
> >  
> >  	return 0;
> >  }
> > @@ -1199,7 +1202,7 @@ void __init tsc_init(void)
> >  	 */
> >  	for_each_possible_cpu(cpu) {
> >  		cyc2ns_init(cpu);
> > -		set_cyc2ns_scale(cpu_khz, cpu);
> > +		set_cyc2ns_scale(cpu_khz, cpu, true);
> >  	}
> >  
> >  	if (tsc_disabled > 0)

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 13/15] sched: Use a static_key for sched_clock_stable
  2014-01-22 21:17                               ` Markus Trippelsdorf
@ 2014-01-23  9:48                                 ` Peter Zijlstra
  2014-01-23 10:01                                   ` Markus Trippelsdorf
                                                     ` (2 more replies)
  0 siblings, 3 replies; 67+ messages in thread
From: Peter Zijlstra @ 2014-01-23  9:48 UTC (permalink / raw)
  To: Markus Trippelsdorf
  Cc: Sasha Levin, Arjan van de Ven, lenb, rjw, Eliezer Tamir,
	rui.zhang, jacob.jun.pan, Mike Galbraith, Ingo Molnar, hpa,
	paulmck, Thomas Gleixner, John Stultz, Andy Lutomirski,
	linux-kernel, dyoung

On Wed, Jan 22, 2014 at 10:17:40PM +0100, Markus Trippelsdorf wrote:
> Yes. Thanks Peter.
> 

Ah much simpler patch that should have the same effect:

---
Subject: sched/x86/tsc: Initialize multiplier to 0
From: Peter Zijlstra <peterz@infradead.org>
Date: Wed, 22 Jan 2014 22:08:14 +0100

Since we keep the clock value linearly continuous on frequency change,
make sure the initial multiplier is 0, such that out initial value is
0. Without this we compute the initial value at whatever the TSC has
managed to reach since power-on.

Fixes: 20d1c86a57762 ("sched/clock, x86: Rewrite cyc2ns() to avoid the need to disable IRQs")
Cc: lenb@kernel.org
Cc: rjw@rjwysocki.net
Cc: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Cc: rui.zhang@intel.com
Cc: jacob.jun.pan@linux.intel.com
Cc: Mike Galbraith <bitbucket@online.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: hpa@zytor.com
Cc: paulmck@linux.vnet.ibm.com
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: dyoung@redhat.com
Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 arch/x86/kernel/tsc.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -180,7 +180,7 @@ static void cyc2ns_write_end(int cpu, st
 
 static void cyc2ns_data_init(struct cyc2ns_data *data)
 {
-	data->cyc2ns_mul = 1U << CYC2NS_SCALE_FACTOR;
+	data->cyc2ns_mul = 0;
 	data->cyc2ns_shift = CYC2NS_SCALE_FACTOR;
 	data->cyc2ns_offset = 0;
 	data->__count = 0;

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 13/15] sched: Use a static_key for sched_clock_stable
  2014-01-23  9:48                                 ` Peter Zijlstra
@ 2014-01-23 10:01                                   ` Markus Trippelsdorf
  2014-01-23 10:04                                     ` Peter Zijlstra
  2014-01-23 13:32                                   ` Josh Boyer
  2014-01-23 16:46                                   ` [tip:sched/urgent] sched/x86/tsc: Initialize multiplier to 0 tip-bot for Peter Zijlstra
  2 siblings, 1 reply; 67+ messages in thread
From: Markus Trippelsdorf @ 2014-01-23 10:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Sasha Levin, Arjan van de Ven, lenb, rjw, Eliezer Tamir,
	rui.zhang, jacob.jun.pan, Mike Galbraith, Ingo Molnar, hpa,
	paulmck, Thomas Gleixner, John Stultz, Andy Lutomirski,
	linux-kernel, dyoung

On 2014.01.23 at 10:48 +0100, Peter Zijlstra wrote:
> On Wed, Jan 22, 2014 at 10:17:40PM +0100, Markus Trippelsdorf wrote:
> 
> Ah much simpler patch that should have the same effect:

Yes. FWIW it also seems to be fine.

-- 
Markus

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 13/15] sched: Use a static_key for sched_clock_stable
  2014-01-23 10:01                                   ` Markus Trippelsdorf
@ 2014-01-23 10:04                                     ` Peter Zijlstra
  0 siblings, 0 replies; 67+ messages in thread
From: Peter Zijlstra @ 2014-01-23 10:04 UTC (permalink / raw)
  To: Markus Trippelsdorf
  Cc: Sasha Levin, Arjan van de Ven, lenb, rjw, Eliezer Tamir,
	rui.zhang, jacob.jun.pan, Mike Galbraith, Ingo Molnar, hpa,
	paulmck, Thomas Gleixner, John Stultz, Andy Lutomirski,
	linux-kernel, dyoung

On Thu, Jan 23, 2014 at 11:01:00AM +0100, Markus Trippelsdorf wrote:
> On 2014.01.23 at 10:48 +0100, Peter Zijlstra wrote:
> > On Wed, Jan 22, 2014 at 10:17:40PM +0100, Markus Trippelsdorf wrote:
> > 
> > Ah much simpler patch that should have the same effect:
> 
> Yes. FWIW it also seems to be fine.

Thanks.. the wonders of sleep eh :-)

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 13/15] sched: Use a static_key for sched_clock_stable
  2014-01-23  9:48                                 ` Peter Zijlstra
  2014-01-23 10:01                                   ` Markus Trippelsdorf
@ 2014-01-23 13:32                                   ` Josh Boyer
  2014-01-23 16:46                                   ` [tip:sched/urgent] sched/x86/tsc: Initialize multiplier to 0 tip-bot for Peter Zijlstra
  2 siblings, 0 replies; 67+ messages in thread
From: Josh Boyer @ 2014-01-23 13:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Markus Trippelsdorf, Sasha Levin, Arjan van de Ven, Len Brown,
	Rafael J. Wysocki, Eliezer Tamir, Zhang Rui, jacob.jun.pan,
	Mike Galbraith, Ingo Molnar, H. Peter Anvin, Paul McKenney,
	Thomas Gleixner, John Stultz, Andy Lutomirski,
	Linux-Kernel@Vger. Kernel. Org, Dave Young

On Thu, Jan 23, 2014 at 4:48 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Wed, Jan 22, 2014 at 10:17:40PM +0100, Markus Trippelsdorf wrote:
>> Yes. Thanks Peter.
>>
>
> Ah much simpler patch that should have the same effect:

This fixes the issue on my baremetal i7 machine as well.

josh

> ---
> Subject: sched/x86/tsc: Initialize multiplier to 0
> From: Peter Zijlstra <peterz@infradead.org>
> Date: Wed, 22 Jan 2014 22:08:14 +0100
>
> Since we keep the clock value linearly continuous on frequency change,
> make sure the initial multiplier is 0, such that out initial value is
> 0. Without this we compute the initial value at whatever the TSC has
> managed to reach since power-on.
>
> Fixes: 20d1c86a57762 ("sched/clock, x86: Rewrite cyc2ns() to avoid the need to disable IRQs")
> Cc: lenb@kernel.org
> Cc: rjw@rjwysocki.net
> Cc: Eliezer Tamir <eliezer.tamir@linux.intel.com>
> Cc: rui.zhang@intel.com
> Cc: jacob.jun.pan@linux.intel.com
> Cc: Mike Galbraith <bitbucket@online.de>
> Cc: Ingo Molnar <mingo@kernel.org>
> Cc: hpa@zytor.com
> Cc: paulmck@linux.vnet.ibm.com
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: John Stultz <john.stultz@linaro.org>
> Cc: Andy Lutomirski <luto@amacapital.net>
> Cc: Arjan van de Ven <arjan@linux.intel.com>
> Cc: Sasha Levin <sasha.levin@oracle.com>
> Cc: dyoung@redhat.com
> Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de>
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> ---
>  arch/x86/kernel/tsc.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> --- a/arch/x86/kernel/tsc.c
> +++ b/arch/x86/kernel/tsc.c
> @@ -180,7 +180,7 @@ static void cyc2ns_write_end(int cpu, st
>
>  static void cyc2ns_data_init(struct cyc2ns_data *data)
>  {
> -       data->cyc2ns_mul = 1U << CYC2NS_SCALE_FACTOR;
> +       data->cyc2ns_mul = 0;
>         data->cyc2ns_shift = CYC2NS_SCALE_FACTOR;
>         data->cyc2ns_offset = 0;
>         data->__count = 0;
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 67+ messages in thread

* [tip:sched/urgent] sched/clock: Fixup early initialization
  2014-01-22 11:59       ` Peter Zijlstra
  2014-01-23  1:53         ` Dave Young
@ 2014-01-23 16:46         ` tip-bot for Peter Zijlstra
  1 sibling, 0 replies; 67+ messages in thread
From: tip-bot for Peter Zijlstra @ 2014-01-23 16:46 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, sasha.levin, hpa, mingo, arjan, peterz, bitbucket,
	john.stultz, luto, eliezer.tamir, tglx

Commit-ID:  d375b4e0fa3771343b370be0d876a1963c02e0a0
Gitweb:     http://git.kernel.org/tip/d375b4e0fa3771343b370be0d876a1963c02e0a0
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Wed, 22 Jan 2014 12:59:18 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 23 Jan 2014 14:48:36 +0100

sched/clock: Fixup early initialization

The code would assume sched_clock_stable() and switch to !stable
later, this switch brings a discontinuity in time.

The discontinuity on switching from stable to unstable was always
present, but previously we would set stable/unstable before
initializing TSC and usually stick to the one we start out with.

So the static_key bits brought an extra switch where there previously
wasn't one.

Things are further complicated by the fact that we cannot use
static_key as early as we usually call set_sched_clock_stable().

Fix things by tracking the stable state in a regular variable and only
set the static_key to the right state on sched_clock_init(), which is
ran right after late_time_init->tsc_init().

Before this we would not be using the TSC anyway.

Reported-and-Tested-by: Sasha Levin <sasha.levin@oracle.com>
Reported-by: dyoung@redhat.com
Fixes: 35af99e646c7 ("sched/clock, x86: Use a static_key for sched_clock_stable")
Cc: jacob.jun.pan@linux.intel.com
Cc: Mike Galbraith <bitbucket@online.de>
Cc: hpa@zytor.com
Cc: paulmck@linux.vnet.ibm.com
Cc: John Stultz <john.stultz@linaro.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: lenb@kernel.org
Cc: rjw@rjwysocki.net
Cc: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Cc: rui.zhang@intel.com
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20140122115918.GG3694@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/clock.c | 53 ++++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 41 insertions(+), 12 deletions(-)

diff --git a/kernel/sched/clock.c b/kernel/sched/clock.c
index 6bd6a67..43c2bcc 100644
--- a/kernel/sched/clock.c
+++ b/kernel/sched/clock.c
@@ -77,35 +77,50 @@ __read_mostly int sched_clock_running;
 
 #ifdef CONFIG_HAVE_UNSTABLE_SCHED_CLOCK
 static struct static_key __sched_clock_stable = STATIC_KEY_INIT;
+static int __sched_clock_stable_early;
 
 int sched_clock_stable(void)
 {
-	if (static_key_false(&__sched_clock_stable))
-		return false;
-	return true;
+	return static_key_false(&__sched_clock_stable);
 }
 
-void set_sched_clock_stable(void)
+static void __set_sched_clock_stable(void)
 {
 	if (!sched_clock_stable())
-		static_key_slow_dec(&__sched_clock_stable);
+		static_key_slow_inc(&__sched_clock_stable);
+}
+
+void set_sched_clock_stable(void)
+{
+	__sched_clock_stable_early = 1;
+
+	smp_mb(); /* matches sched_clock_init() */
+
+	if (!sched_clock_running)
+		return;
+
+	__set_sched_clock_stable();
 }
 
 static void __clear_sched_clock_stable(struct work_struct *work)
 {
 	/* XXX worry about clock continuity */
 	if (sched_clock_stable())
-		static_key_slow_inc(&__sched_clock_stable);
+		static_key_slow_dec(&__sched_clock_stable);
 }
 
 static DECLARE_WORK(sched_clock_work, __clear_sched_clock_stable);
 
 void clear_sched_clock_stable(void)
 {
-	if (keventd_up())
-		schedule_work(&sched_clock_work);
-	else
-		__clear_sched_clock_stable(&sched_clock_work);
+	__sched_clock_stable_early = 0;
+
+	smp_mb(); /* matches sched_clock_init() */
+
+	if (!sched_clock_running)
+		return;
+
+	schedule_work(&sched_clock_work);
 }
 
 struct sched_clock_data {
@@ -140,6 +155,20 @@ void sched_clock_init(void)
 	}
 
 	sched_clock_running = 1;
+
+	/*
+	 * Ensure that it is impossible to not do a static_key update.
+	 *
+	 * Either {set,clear}_sched_clock_stable() must see sched_clock_running
+	 * and do the update, or we must see their __sched_clock_stable_early
+	 * and do the update, or both.
+	 */
+	smp_mb(); /* matches {set,clear}_sched_clock_stable() */
+
+	if (__sched_clock_stable_early)
+		__set_sched_clock_stable();
+	else
+		__clear_sched_clock_stable(NULL);
 }
 
 /*
@@ -340,7 +369,7 @@ EXPORT_SYMBOL_GPL(sched_clock_idle_wakeup_event);
  */
 u64 cpu_clock(int cpu)
 {
-	if (static_key_false(&__sched_clock_stable))
+	if (!sched_clock_stable())
 		return sched_clock_cpu(cpu);
 
 	return sched_clock();
@@ -355,7 +384,7 @@ u64 cpu_clock(int cpu)
  */
 u64 local_clock(void)
 {
-	if (static_key_false(&__sched_clock_stable))
+	if (!sched_clock_stable())
 		return sched_clock_cpu(raw_smp_processor_id());
 
 	return sched_clock();

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [tip:sched/urgent] sched/x86/tsc: Initialize multiplier to 0
  2014-01-23  9:48                                 ` Peter Zijlstra
  2014-01-23 10:01                                   ` Markus Trippelsdorf
  2014-01-23 13:32                                   ` Josh Boyer
@ 2014-01-23 16:46                                   ` tip-bot for Peter Zijlstra
  2 siblings, 0 replies; 67+ messages in thread
From: tip-bot for Peter Zijlstra @ 2014-01-23 16:46 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, sasha.levin, hpa, mingo, arjan, peterz, bitbucket,
	luto, john.stultz, eliezer.tamir, markus, tglx

Commit-ID:  5e3c1afd4587e70c201bf7224b51f747c9a3dfa8
Gitweb:     http://git.kernel.org/tip/5e3c1afd4587e70c201bf7224b51f747c9a3dfa8
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Wed, 22 Jan 2014 22:08:14 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 23 Jan 2014 14:48:36 +0100

sched/x86/tsc: Initialize multiplier to 0

Since we keep the clock value linearly continuous on frequency change,
make sure the initial multiplier is 0, such that our initial value is 0.
Without this we compute the initial value at whatever the TSC has
managed to reach since power-on.

Reported-and-Tested-by: Markus Trippelsdorf <markus@trippelsdorf.de>
Fixes: 20d1c86a57762 ("sched/clock, x86: Rewrite cyc2ns() to avoid the need to disable IRQs")
Cc: lenb@kernel.org
Cc: rjw@rjwysocki.net
Cc: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Cc: rui.zhang@intel.com
Cc: jacob.jun.pan@linux.intel.com
Cc: Mike Galbraith <bitbucket@online.de>
Cc: hpa@zytor.com
Cc: paulmck@linux.vnet.ibm.com
Cc: John Stultz <john.stultz@linaro.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: dyoung@redhat.com
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20140123094804.GP30183@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/kernel/tsc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index a3acbac..19e5adb 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -180,7 +180,7 @@ static void cyc2ns_write_end(int cpu, struct cyc2ns_data *data)
 
 static void cyc2ns_data_init(struct cyc2ns_data *data)
 {
-	data->cyc2ns_mul = 1U << CYC2NS_SCALE_FACTOR;
+	data->cyc2ns_mul = 0;
 	data->cyc2ns_shift = CYC2NS_SCALE_FACTOR;
 	data->cyc2ns_offset = 0;
 	data->__count = 0;

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* Re: [PATCH 13/15] sched: Use a static_key for sched_clock_stable
  2014-01-23  2:10                                 ` Dave Young
@ 2014-01-23 16:56                                   ` Peter Zijlstra
  2014-01-24  3:15                                     ` Dave Young
  0 siblings, 1 reply; 67+ messages in thread
From: Peter Zijlstra @ 2014-01-23 16:56 UTC (permalink / raw)
  To: Dave Young
  Cc: Markus Trippelsdorf, Sasha Levin, Arjan van de Ven, lenb, rjw,
	Eliezer Tamir, rui.zhang, jacob.jun.pan, Mike Galbraith,
	Ingo Molnar, hpa, paulmck, Thomas Gleixner, John Stultz,
	Andy Lutomirski, linux-kernel

On Thu, Jan 23, 2014 at 10:10:28AM +0800, Dave Young wrote:
> Hmm, seems the my physical machine is booting fine with this patch. kvm
> guest problem still exist, but that kvm thing might be other problem.

Dave, could you try with tip/master, Ingo pushed out all the fixes
gathered.

Lacking that, could you perhaps provide the .config you use for your
guest and a kvm invocation?

I used a defconfig with KVM_GUEST=y, which was enough to enable kvmclock
support and then started with:

kvm -kernel defconfig-build/arch/x86/boot/bzImage -nographic -append "console=ttyS0" -smp 2

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 13/15] sched: Use a static_key for sched_clock_stable
  2014-01-23 16:56                                   ` Peter Zijlstra
@ 2014-01-24  3:15                                     ` Dave Young
  2014-01-24  7:58                                       ` Ingo Molnar
  0 siblings, 1 reply; 67+ messages in thread
From: Dave Young @ 2014-01-24  3:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Markus Trippelsdorf, Sasha Levin, Arjan van de Ven, lenb, rjw,
	Eliezer Tamir, rui.zhang, jacob.jun.pan, Mike Galbraith,
	Ingo Molnar, hpa, paulmck, Thomas Gleixner, John Stultz,
	Andy Lutomirski, linux-kernel

On 01/23/14 at 05:56pm, Peter Zijlstra wrote:
> On Thu, Jan 23, 2014 at 10:10:28AM +0800, Dave Young wrote:
> > Hmm, seems the my physical machine is booting fine with this patch. kvm
> > guest problem still exist, but that kvm thing might be other problem.
> 
> Dave, could you try with tip/master, Ingo pushed out all the fixes
> gathered.

tip/master works fine for me.

> 
> Lacking that, could you perhaps provide the .config you use for your
> guest and a kvm invocation?
> 
> I used a defconfig with KVM_GUEST=y, which was enough to enable kvmclock
> support and then started with:
> 
> kvm -kernel defconfig-build/arch/x86/boot/bzImage -nographic -append "console=ttyS0" -smp 2

I use qemu-system-x86_64 with efi ovmf firmware, I think the problem
has been resolved in tip/master, if you still want the .config please
let me know.

Thanks
Dave

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 13/15] sched: Use a static_key for sched_clock_stable
  2014-01-24  3:15                                     ` Dave Young
@ 2014-01-24  7:58                                       ` Ingo Molnar
  0 siblings, 0 replies; 67+ messages in thread
From: Ingo Molnar @ 2014-01-24  7:58 UTC (permalink / raw)
  To: Dave Young
  Cc: Peter Zijlstra, Markus Trippelsdorf, Sasha Levin,
	Arjan van de Ven, lenb, rjw, Eliezer Tamir, rui.zhang,
	jacob.jun.pan, Mike Galbraith, hpa, paulmck, Thomas Gleixner,
	John Stultz, Andy Lutomirski, linux-kernel


* Dave Young <dyoung@redhat.com> wrote:

> On 01/23/14 at 05:56pm, Peter Zijlstra wrote:
> > On Thu, Jan 23, 2014 at 10:10:28AM +0800, Dave Young wrote:
> > > Hmm, seems the my physical machine is booting fine with this patch. kvm
> > > guest problem still exist, but that kvm thing might be other problem.
> > 
> > Dave, could you try with tip/master, Ingo pushed out all the fixes
> > gathered.
> 
> tip/master works fine for me.

Ok, thanks for testing - I'll send those bits to Linus later today.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 11/15] x86: Rewrite cyc2ns to avoid the need to disable IRQs
  2013-12-12 14:08 ` [PATCH 11/15] x86: Rewrite cyc2ns to avoid the need to disable IRQs Peter Zijlstra
@ 2014-06-16 17:13   ` Viresh Kumar
  2014-06-17 12:15     ` Peter Zijlstra
  0 siblings, 1 reply; 67+ messages in thread
From: Viresh Kumar @ 2014-06-16 17:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Arjan van de Ven, lenb, Rafael J. Wysocki, Eliezer Tamir,
	Zhang Rui, jacob.jun.pan, Mike Galbraith, Ingo Molnar, hpa,
	Paul McKenney, Thomas Gleixner, John Stultz, Andy Lutomirski,
	linux-kernel, registosites, srivatsa

Cc'ing Mauro/Rafael/Srivatsa..

On Thu, Dec 12, 2013 at 7:38 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> Use a ring-buffer like multi-version object structure which allows
> always having a coherent object; we use this to avoid having to
> disable IRQs while reading sched_clock() and avoids a problem when
> getting an NMI while changing the cyc2ns data.
>
>                         MAINLINE   PRE        POST
>
>     sched_clock_stable: 1          1          1
>     (cold) sched_clock: 329841     331312     257223
>     (cold) local_clock: 301773     310296     309889
>     (warm) sched_clock: 38375      38247      25280
>     (warm) local_clock: 100371     102713     85268
>     (warm) rdtsc:       27340      27289      24247
>     sched_clock_stable: 0          0          0
>     (cold) sched_clock: 382634     372706     301224
>     (cold) local_clock: 396890     399275     399870
>     (warm) sched_clock: 38194      38124      25630
>     (warm) local_clock: 143452     148698     129629
>     (warm) rdtsc:       27345      27365      24307
>
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> ---
>  arch/x86/include/asm/timer.h     |   23 +++
>  arch/x86/kernel/cpu/perf_event.c |   14 +-
>  arch/x86/kernel/tsc.c            |  229 ++++++++++++++++++++++++++++++++++-----
>  3 files changed, 236 insertions(+), 30 deletions(-)

Hi Peter,

We have been following a bug around cpufreq changes in 3.14
(https://bugzilla.kernel.org/show_bug.cgi?id=77201) and our tests
narrowed it down to this patch after which things aren't working
as expected.

Mauro (registosites@hotmail.com) have reported that since 3.14
after doing hot-unplug/hotplug of CPU1, cpufreq POSTCHANGE notifications
for every second frequency change hangs his machine ..

Last working tag was: 3.13.8 and it started failing from 3.14 and fails
for latest 3.15 as well..

model name : AMD Turion(tm) 64 X2 Mobile Technology TL-64
CPUFreq driver: powernow-k8 ..
CPUs: Only two CPUs, sharing clock line


We have asked him to do lots of tests as we initially though its
because of some cpufreq changes, but finally we came to the conclusion
that reverting below two commits fixes his issues over 3.14:

20d1c86 sched/clock, x86: Rewrite cyc2ns() to avoid the need to disable IRQs
57c67da sched/clock, x86: Move some cyc2ns() code around

(some more commits were reverted as well to revert these two cleanly,
but he still had issues after those reverts and only after reverting these two
his problem got fixed)..

He couldn't do a bisect as it was broken with compilation errors ..

Can you give any pointers, so that he can get it fixed ?

Thanks.

--
Viresh

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 11/15] x86: Rewrite cyc2ns to avoid the need to disable IRQs
  2014-06-16 17:13   ` Viresh Kumar
@ 2014-06-17 12:15     ` Peter Zijlstra
  2014-06-17 12:42       ` Viresh Kumar
  2014-06-17 12:47       ` Mauro
  0 siblings, 2 replies; 67+ messages in thread
From: Peter Zijlstra @ 2014-06-17 12:15 UTC (permalink / raw)
  To: Viresh Kumar
  Cc: Arjan van de Ven, lenb, Rafael J. Wysocki, Eliezer Tamir,
	Zhang Rui, jacob.jun.pan, Mike Galbraith, Ingo Molnar, hpa,
	Paul McKenney, Thomas Gleixner, John Stultz, Andy Lutomirski,
	linux-kernel, registosites, srivatsa

[-- Attachment #1: Type: text/plain, Size: 2515 bytes --]

On Mon, Jun 16, 2014 at 10:43:38PM +0530, Viresh Kumar wrote:
> Cc'ing Mauro/Rafael/Srivatsa..
> 
> On Thu, Dec 12, 2013 at 7:38 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> > Use a ring-buffer like multi-version object structure which allows
> > always having a coherent object; we use this to avoid having to
> > disable IRQs while reading sched_clock() and avoids a problem when
> > getting an NMI while changing the cyc2ns data.
> >
> >                         MAINLINE   PRE        POST
> >
> >     sched_clock_stable: 1          1          1
> >     (cold) sched_clock: 329841     331312     257223
> >     (cold) local_clock: 301773     310296     309889
> >     (warm) sched_clock: 38375      38247      25280
> >     (warm) local_clock: 100371     102713     85268
> >     (warm) rdtsc:       27340      27289      24247
> >     sched_clock_stable: 0          0          0
> >     (cold) sched_clock: 382634     372706     301224
> >     (cold) local_clock: 396890     399275     399870
> >     (warm) sched_clock: 38194      38124      25630
> >     (warm) local_clock: 143452     148698     129629
> >     (warm) rdtsc:       27345      27365      24307
> >
> > Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> > ---
> >  arch/x86/include/asm/timer.h     |   23 +++
> >  arch/x86/kernel/cpu/perf_event.c |   14 +-
> >  arch/x86/kernel/tsc.c            |  229 ++++++++++++++++++++++++++++++++++-----
> >  3 files changed, 236 insertions(+), 30 deletions(-)
> 
> Hi Peter,
> 
> We have been following a bug around cpufreq changes in 3.14
> (https://bugzilla.kernel.org/show_bug.cgi?id=77201) and our tests
> narrowed it down to this patch after which things aren't working
> as expected.

> Mauro (registosites@hotmail.com) have reported that since 3.14
> after doing hot-unplug/hotplug of CPU1, cpufreq POSTCHANGE notifications
> for every second frequency change hangs his machine ..

Ah, just a freeze?

> Last working tag was: 3.13.8 and it started failing from 3.14 and fails
> for latest 3.15 as well..
> 
> model name : AMD Turion(tm) 64 X2 Mobile Technology TL-64
> CPUFreq driver: powernow-k8 ..
> CPUs: Only two CPUs, sharing clock line

What's specific to this particular CPU?

> Can you give any pointers, so that he can get it fixed ?

So what you can try is force a cyc2ns read before the write in
set_cyc2ns_scale(). I think its possible that if we do not do the read,
the write will wait for a 'free' slot indefinitely.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 11/15] x86: Rewrite cyc2ns to avoid the need to disable IRQs
  2014-06-17 12:15     ` Peter Zijlstra
@ 2014-06-17 12:42       ` Viresh Kumar
  2014-06-17 16:32         ` Mauro
  2014-06-17 12:47       ` Mauro
  1 sibling, 1 reply; 67+ messages in thread
From: Viresh Kumar @ 2014-06-17 12:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Arjan van de Ven, Len Brown, Rafael J. Wysocki, Eliezer Tamir,
	Zhang Rui, jacob.jun.pan, Mike Galbraith, Ingo Molnar,
	H. Peter Anvin, Paul McKenney, Thomas Gleixner, John Stultz,
	Andy Lutomirski, linux-kernel, registosites, Srivatsa S. Bhat

On 17 June 2014 17:45, Peter Zijlstra <peterz@infradead.org> wrote:
> Ah, just a freeze?

As per what he reported and his screen dumps, looks like yes a
freeze. IOW, no prints at all on terminal, even the ones issued
before reaching set_cyc2ns_scale()..

>> CPUs: Only two CPUs, sharing clock line
>
> What's specific to this particular CPU?

You meant why I mentioned this information? Nothing specific
actually, just mentioned it to make all information available that
might be useful .. Probably its not :)

> So what you can try is force a cyc2ns read before the write in
> set_cyc2ns_scale(). I think its possible that if we do not do the read,
> the write will wait for a 'free' slot indefinitely.

Hmm, I see. Thanks for your quick response.

Are you suggesting something like this ? :

diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 57e5ce1..290ac03 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -249,6 +249,9 @@ static void set_cyc2ns_scale(unsigned long cpu_khz, int cpu)
        if (!cpu_khz)
                goto done;

+       data = cyc2ns_read_begin();
+       cyc2ns_read_end(data);
+
        data = cyc2ns_write_begin(cpu);

        rdtscll(tsc_now);


@Mauro: Please try this once Peter confirms.

--
viresh

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* Re: [PATCH 11/15] x86: Rewrite cyc2ns to avoid the need to disable IRQs
  2014-06-17 12:15     ` Peter Zijlstra
  2014-06-17 12:42       ` Viresh Kumar
@ 2014-06-17 12:47       ` Mauro
  1 sibling, 0 replies; 67+ messages in thread
From: Mauro @ 2014-06-17 12:47 UTC (permalink / raw)
  To: Peter Zijlstra, Viresh Kumar
  Cc: Arjan van de Ven, lenb, Rafael J. Wysocki, Eliezer Tamir,
	Zhang Rui, jacob.jun.pan, Mike Galbraith, Ingo Molnar, hpa,
	Paul McKenney, Thomas Gleixner, John Stultz, Andy Lutomirski,
	linux-kernel, srivatsa

[-- Attachment #1: Type: text/plain, Size: 2877 bytes --]

On 17-06-2014 13:15, Peter Zijlstra wrote:
> On Mon, Jun 16, 2014 at 10:43:38PM +0530, Viresh Kumar wrote:
>> Cc'ing Mauro/Rafael/Srivatsa..
>>
>> On Thu, Dec 12, 2013 at 7:38 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>>> Use a ring-buffer like multi-version object structure which allows
>>> always having a coherent object; we use this to avoid having to
>>> disable IRQs while reading sched_clock() and avoids a problem when
>>> getting an NMI while changing the cyc2ns data.
>>>
>>>                         MAINLINE   PRE        POST
>>>
>>>     sched_clock_stable: 1          1          1
>>>     (cold) sched_clock: 329841     331312     257223
>>>     (cold) local_clock: 301773     310296     309889
>>>     (warm) sched_clock: 38375      38247      25280
>>>     (warm) local_clock: 100371     102713     85268
>>>     (warm) rdtsc:       27340      27289      24247
>>>     sched_clock_stable: 0          0          0
>>>     (cold) sched_clock: 382634     372706     301224
>>>     (cold) local_clock: 396890     399275     399870
>>>     (warm) sched_clock: 38194      38124      25630
>>>     (warm) local_clock: 143452     148698     129629
>>>     (warm) rdtsc:       27345      27365      24307
>>>
>>> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
>>> ---
>>>  arch/x86/include/asm/timer.h     |   23 +++
>>>  arch/x86/kernel/cpu/perf_event.c |   14 +-
>>>  arch/x86/kernel/tsc.c            |  229 ++++++++++++++++++++++++++++++++++-----
>>>  3 files changed, 236 insertions(+), 30 deletions(-)
>>
>> Hi Peter,
>>
>> We have been following a bug around cpufreq changes in 3.14
>> (https://bugzilla.kernel.org/show_bug.cgi?id=77201) and our tests
>> narrowed it down to this patch after which things aren't working
>> as expected.
> 
>> Mauro (registosites@hotmail.com) have reported that since 3.14
>> after doing hot-unplug/hotplug of CPU1, cpufreq POSTCHANGE notifications
>> for every second frequency change hangs his machine ..
> 
> Ah, just a freeze?

If you mean no output whatsoever and frozen to the point of needing a
hard shutdown/reboot, yes.

>> Last working tag was: 3.13.8 and it started failing from 3.14 and fails
>> for latest 3.15 as well..
>>
>> model name : AMD Turion(tm) 64 X2 Mobile Technology TL-64
>> CPUFreq driver: powernow-k8 ..
>> CPUs: Only two CPUs, sharing clock line
> 
> What's specific to this particular CPU?

That is a very good question, for which I don't have an answer, as far
as I know it is just like any other AMD mobile Turion X2 cpu.

>> Can you give any pointers, so that he can get it fixed ?
> 
> So what you can try is force a cyc2ns read before the write in
> set_cyc2ns_scale(). I think its possible that if we do not do the read,
> the write will wait for a 'free' slot indefinitely.

For that I'll need some help/pointers.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 11/15] x86: Rewrite cyc2ns to avoid the need to disable IRQs
  2014-06-17 12:42       ` Viresh Kumar
@ 2014-06-17 16:32         ` Mauro
  0 siblings, 0 replies; 67+ messages in thread
From: Mauro @ 2014-06-17 16:32 UTC (permalink / raw)
  To: Viresh Kumar, Peter Zijlstra
  Cc: Arjan van de Ven, Len Brown, Rafael J. Wysocki, Eliezer Tamir,
	Zhang Rui, jacob.jun.pan, Mike Galbraith, Ingo Molnar,
	H. Peter Anvin, Paul McKenney, Thomas Gleixner, John Stultz,
	Andy Lutomirski, linux-kernel, Srivatsa S. Bhat

On 17-06-2014 13:42, Viresh Kumar wrote:
>> So what you can try is force a cyc2ns read before the write in
>> set_cyc2ns_scale(). I think its possible that if we do not do the read,
>> the write will wait for a 'free' slot indefinitely.
> 
> Hmm, I see. Thanks for your quick response.
> 
> Are you suggesting something like this ? :
> 
> diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
> index 57e5ce1..290ac03 100644
> --- a/arch/x86/kernel/tsc.c
> +++ b/arch/x86/kernel/tsc.c
> @@ -249,6 +249,9 @@ static void set_cyc2ns_scale(unsigned long cpu_khz, int cpu)
>         if (!cpu_khz)
>                 goto done;
> 
> +       data = cyc2ns_read_begin();
> +       cyc2ns_read_end(data);
> +
>         data = cyc2ns_write_begin(cpu);
> 
>         rdtscll(tsc_now);
> 
> 
> @Mauro: Please try this once Peter confirms.

I have tried this and it doesn't solve the problem, the machine still
hangs after the second frequency change after cpu offline/online.

^ permalink raw reply	[flat|nested] 67+ messages in thread

end of thread, other threads:[~2014-06-17 16:32 UTC | newest]

Thread overview: 67+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-12-12 14:08 [PATCH 00/15] cleanups and optimizations Peter Zijlstra
2013-12-12 14:08 ` [PATCH 01/15] x86, acpi, idle: Restructure the mwait idle routines Peter Zijlstra
2013-12-19 20:09   ` [tip:x86/idle] " tip-bot for Peter Zijlstra
2013-12-19 20:13     ` H. Peter Anvin
2013-12-19 20:16       ` Peter Zijlstra
2013-12-12 14:08 ` [PATCH 02/15] sched, preempt: Fixup missed PREEMPT_NEED_RESCHED folding Peter Zijlstra
2013-12-12 14:08 ` [PATCH 03/15] preempt, locking: Rework local_bh_{dis,en}able() Peter Zijlstra
2013-12-12 14:08 ` [PATCH 04/15] locking: Optimize lock_bh functions Peter Zijlstra
2013-12-12 14:08 ` [PATCH 05/15] sched, net: Clean up preempt_enable_no_resched() abuse Peter Zijlstra
2013-12-12 14:08 ` [PATCH 06/15] sched, net: Fixup busy_loop_us_clock() Peter Zijlstra
2013-12-12 14:08 ` [PATCH 07/15] sched, thermal: Clean up preempt_enable_no_resched() abuse Peter Zijlstra
2013-12-12 14:08 ` [PATCH 08/15] preempt: Take away preempt_enable_no_resched() from modules Peter Zijlstra
2013-12-12 14:08 ` [PATCH 09/15] x86: Use mul_u64_u32_shr() for native_sched_clock() Peter Zijlstra
2013-12-12 14:08 ` [PATCH 10/15] x86: Move some code around Peter Zijlstra
2013-12-12 14:08 ` [PATCH 11/15] x86: Rewrite cyc2ns to avoid the need to disable IRQs Peter Zijlstra
2014-06-16 17:13   ` Viresh Kumar
2014-06-17 12:15     ` Peter Zijlstra
2014-06-17 12:42       ` Viresh Kumar
2014-06-17 16:32         ` Mauro
2014-06-17 12:47       ` Mauro
2013-12-12 14:08 ` [PATCH 12/15] sched: Remove local_irq_disable() from the clocks Peter Zijlstra
2013-12-12 14:08 ` [PATCH 13/15] sched: Use a static_key for sched_clock_stable Peter Zijlstra
2014-01-21 22:28   ` Sasha Levin
2014-01-22 10:45     ` Peter Zijlstra
2014-01-22 11:59       ` Peter Zijlstra
2014-01-23  1:53         ` Dave Young
2014-01-23 16:46         ` [tip:sched/urgent] sched/clock: Fixup early initialization tip-bot for Peter Zijlstra
2014-01-22 12:00       ` [PATCH 13/15] sched: Use a static_key for sched_clock_stable Markus Trippelsdorf
2014-01-22 12:07         ` Peter Zijlstra
2014-01-22 12:16           ` Peter Zijlstra
2014-01-22 12:26           ` Markus Trippelsdorf
2014-01-22 12:30             ` Peter Zijlstra
2014-01-22 13:14               ` Markus Trippelsdorf
2014-01-22 14:26                 ` Sasha Levin
2014-01-22 18:35                   ` Markus Trippelsdorf
2014-01-22 18:42                     ` Peter Zijlstra
2014-01-22 18:45                       ` Markus Trippelsdorf
2014-01-22 19:17                         ` Josh Boyer
2014-01-22 19:09                       ` Markus Trippelsdorf
2014-01-22 19:12                         ` Markus Trippelsdorf
2014-01-22 20:16                           ` Peter Zijlstra
2014-01-22 21:08                             ` Peter Zijlstra
2014-01-22 21:17                               ` Markus Trippelsdorf
2014-01-23  9:48                                 ` Peter Zijlstra
2014-01-23 10:01                                   ` Markus Trippelsdorf
2014-01-23 10:04                                     ` Peter Zijlstra
2014-01-23 13:32                                   ` Josh Boyer
2014-01-23 16:46                                   ` [tip:sched/urgent] sched/x86/tsc: Initialize multiplier to 0 tip-bot for Peter Zijlstra
2014-01-22 23:53                               ` [PATCH 13/15] sched: Use a static_key for sched_clock_stable Josh Boyer
2014-01-23  1:53                               ` Dave Young
2014-01-23  2:10                                 ` Dave Young
2014-01-23 16:56                                   ` Peter Zijlstra
2014-01-24  3:15                                     ` Dave Young
2014-01-24  7:58                                       ` Ingo Molnar
2014-01-22 17:14     ` Peter Zijlstra
2014-01-22 22:31       ` Sasha Levin
2013-12-12 14:08 ` [PATCH 14/15] sched, clock: Fixup clear_sched_clock_stable() Peter Zijlstra
2013-12-12 14:08 ` [PATCH 15/15] x86: Avoid a runtime condition in native_sched_clock() Peter Zijlstra
2013-12-13  3:30 ` [PATCH 00/15] cleanups and optimizations Mike Galbraith
2013-12-13 10:49 ` Eliezer Tamir
2013-12-13 13:56   ` Peter Zijlstra
2013-12-16 17:48     ` Eliezer Tamir
2013-12-17 13:32       ` Peter Zijlstra
2013-12-17 14:03         ` Eliezer Tamir
2013-12-17 15:13           ` Peter Zijlstra
2013-12-17 18:19             ` Eliezer Tamir
2013-12-17 22:11               ` Thomas Gleixner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.