[RFC PATCH 0/6] timekeeping: Missing timekeeping update detection

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 0/6] timekeeping: Missing timekeeping update detection
@ 2013-08-21 16:42 Frederic Weisbecker
  2013-08-21 16:42 ` [RFC PATCH 1/6] sched: Let arch tell us if sched clock is NMI-safe Frederic Weisbecker
                   ` (5 more replies)
  0 siblings, 6 replies; 11+ messages in thread
From: Frederic Weisbecker @ 2013-08-21 16:42 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Ingo Molnar, Thomas Gleixner,
	Peter Zijlstra, Paul E. McKenney, John Stultz, Steven Rostedt,
	Don Zickus

Hi,

With the timekeeping going to be maintained by full system idle detection
patchset from Paul, it seems that the guarantees that enforce timekeeping
progression are going to grow in complexity enough to deserve some automated
checking.

So here is a proposition in the form of a timekeeping watchdog. It
uses periodic NMIs that poll on any suspicious drift between jiffies
and a global cpu clock progression.

Thanks.

Frederic Weisbecker (6):
  sched: Let arch tell us if sched clock is NMI-safe
  x86: nsecs to cycles conversion
  x86: Tell that sched clock is callable in nmi
  seqlock: Add raw_seqbegin() for non-waiting readers
  jiffies: Add jiffies_to_nsecs
  timekeeping: Debug missing timekeeping updates

 arch/Kconfig                       |    8 ++
 arch/x86/Kconfig                   |    2 +
 arch/x86/include/asm/cycles.h      |   11 +++
 arch/x86/kernel/apic/hw_nmi.c      |    7 --
 include/linux/jiffies.h            |    6 ++
 include/linux/seqlock.h            |    5 ++
 include/linux/time.h               |   11 +++
 kernel/time/Makefile               |    1 +
 kernel/time/tick-sched.c           |    4 +
 kernel/time/timekeeping.c          |    1 +
 kernel/time/timekeeping_selftest.c |  125 ++++++++++++++++++++++++++++++++++++
 kernel/watchdog.c                  |    3 +-
 lib/Kconfig.debug                  |   13 ++++
 13 files changed, 189 insertions(+), 8 deletions(-)
 create mode 100644 arch/x86/include/asm/cycles.h
 create mode 100644 kernel/time/timekeeping_selftest.c

-- 
1.7.5.4


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC PATCH 1/6] sched: Let arch tell us if sched clock is NMI-safe
  2013-08-21 16:42 [RFC PATCH 0/6] timekeeping: Missing timekeeping update detection Frederic Weisbecker
@ 2013-08-21 16:42 ` Frederic Weisbecker
  2013-08-21 16:42 ` [RFC PATCH 2/6] x86: nsecs to cycles conversion Frederic Weisbecker
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 11+ messages in thread
From: Frederic Weisbecker @ 2013-08-21 16:42 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Ingo Molnar, Thomas Gleixner,
	Peter Zijlstra, Paul E. McKenney, John Stultz, Steven Rostedt,
	Don Zickus

sched_clock() should be fast, scalable and not use any lock. As a
resuly it should be safely called from NMIs.

Now just in case there might be some implementation details proper
to some archs that make sched_clock() not reliable or not safe in NMIs,
lets provide a way through Kconfig for archs to testify about that
support.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Don Zickus <dzickus@redhat.com>
---
 arch/Kconfig |    5 +++++
 1 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 1feb169..52ad235 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -393,6 +393,11 @@ config HAVE_UNDERSCORE_SYMBOL_PREFIX
 	  Some architectures generate an _ in front of C symbols; things like
 	  module loading and assembly files need to know about this.
 
+config HAVE_SCHED_CLOCK_NMI
+	bool
+	help
+	  Architecture's sched_clock() implementation is safely callable from  NMIs.
+
 #
 # ABI hall of shame
 #
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [RFC PATCH 2/6] x86: nsecs to cycles conversion
  2013-08-21 16:42 [RFC PATCH 0/6] timekeeping: Missing timekeeping update detection Frederic Weisbecker
  2013-08-21 16:42 ` [RFC PATCH 1/6] sched: Let arch tell us if sched clock is NMI-safe Frederic Weisbecker
@ 2013-08-21 16:42 ` Frederic Weisbecker
  2013-08-21 18:26   ` Don Zickus
  2013-08-21 16:42 ` [RFC PATCH 3/6] x86: Tell that sched clock is callable in nmi Frederic Weisbecker
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 11+ messages in thread
From: Frederic Weisbecker @ 2013-08-21 16:42 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Ingo Molnar, Thomas Gleixner,
	Peter Zijlstra, Paul E. McKenney, John Stultz, Steven Rostedt,
	Don Zickus

hw_nmi_get_sample_period() is simply a conversion from a period
to cycles. Lets generalize the API naming so that it can be used for
wider purpose than just watchdog perf event settings. Also it makes the
function name less opaque about what it really does.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Don Zickus <dzickus@redhat.com>
---
 arch/Kconfig                  |    3 +++
 arch/x86/Kconfig              |    1 +
 arch/x86/include/asm/cycles.h |   11 +++++++++++
 arch/x86/kernel/apic/hw_nmi.c |    7 -------
 kernel/watchdog.c             |    3 ++-
 lib/Kconfig.debug             |    1 +
 6 files changed, 18 insertions(+), 8 deletions(-)
 create mode 100644 arch/x86/include/asm/cycles.h

diff --git a/arch/Kconfig b/arch/Kconfig
index 52ad235..cc4d14a 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -398,6 +398,9 @@ config HAVE_SCHED_CLOCK_NMI
 	help
 	  Architecture's sched_clock() implementation is safely callable from  NMIs.
 
+config HAVE_NSECS_TO_CYCLES
+	bool
+
 #
 # ABI hall of shame
 #
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index b32ebf9..7fbda5c 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -124,6 +124,7 @@ config X86
 	select COMPAT_OLD_SIGACTION if IA32_EMULATION
 	select RTC_LIB
 	select HAVE_DEBUG_STACKOVERFLOW
+	select HAVE_NSECS_TO_CYCLES
 
 config INSTRUCTION_DECODER
 	def_bool y
diff --git a/arch/x86/include/asm/cycles.h b/arch/x86/include/asm/cycles.h
new file mode 100644
index 0000000..7dedeb3
--- /dev/null
+++ b/arch/x86/include/asm/cycles.h
@@ -0,0 +1,11 @@
+#ifndef _ASM_X86_CYCLES_H
+#define _ASM_X86_CYCLES_H
+
+#include <linux/time.h>
+#include <asm/tsc.h>
+
+static inline u64 nsecs_to_cycles(u64 nsecs)
+{
+	return (u64)((cpu_khz) * nsecs) / NSEC_PER_MSEC;
+}
+#endif /* #ifndef _ASM_X86_CYCLES_H */
diff --git a/arch/x86/kernel/apic/hw_nmi.c b/arch/x86/kernel/apic/hw_nmi.c
index a698d71..4015906 100644
--- a/arch/x86/kernel/apic/hw_nmi.c
+++ b/arch/x86/kernel/apic/hw_nmi.c
@@ -19,13 +19,6 @@
 #include <linux/module.h>
 #include <linux/delay.h>
 
-#ifdef CONFIG_HARDLOCKUP_DETECTOR
-u64 hw_nmi_get_sample_period(int watchdog_thresh)
-{
-	return (u64)(cpu_khz) * 1000 * watchdog_thresh;
-}
-#endif
-
 #ifdef arch_trigger_all_cpu_backtrace
 /* For reliability, we're prepared to waste bits here. */
 static DECLARE_BITMAP(backtrace_mask, NR_CPUS) __read_mostly;
diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 1241d8c..e04887c 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -26,6 +26,7 @@
 #include <linux/sched/rt.h>
 
 #include <asm/irq_regs.h>
+#include <asm/cycles.h>
 #include <linux/kvm_para.h>
 #include <linux/perf_event.h>
 
@@ -417,7 +418,7 @@ static int watchdog_nmi_enable(unsigned int cpu)
 		goto out_enable;
 
 	wd_attr = &wd_hw_attr;
-	wd_attr->sample_period = hw_nmi_get_sample_period(watchdog_thresh);
+	wd_attr->sample_period = nsecs_to_cycles(watchdog_thresh * NSEC_PER_SEC);
 
 	/* Try to register using hardware perf events */
 	event = perf_event_create_kernel_counter(wd_attr, cpu, NULL, watchdog_overflow_callback, NULL);
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 1501aa5..568179d 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -634,6 +634,7 @@ config HARDLOCKUP_DETECTOR
 	def_bool y
 	depends on LOCKUP_DETECTOR && !HAVE_NMI_WATCHDOG
 	depends on PERF_EVENTS && HAVE_PERF_EVENTS_NMI
+	depends on HAVE_NSECS_TO_CYCLES
 
 config BOOTPARAM_HARDLOCKUP_PANIC
 	bool "Panic (Reboot) On Hard Lockups"
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [RFC PATCH 3/6] x86: Tell that sched clock is callable in nmi
  2013-08-21 16:42 [RFC PATCH 0/6] timekeeping: Missing timekeeping update detection Frederic Weisbecker
  2013-08-21 16:42 ` [RFC PATCH 1/6] sched: Let arch tell us if sched clock is NMI-safe Frederic Weisbecker
  2013-08-21 16:42 ` [RFC PATCH 2/6] x86: nsecs to cycles conversion Frederic Weisbecker
@ 2013-08-21 16:42 ` Frederic Weisbecker
  2013-08-21 16:42 ` [RFC PATCH 4/6] seqlock: Add raw_seqbegin() for non-waiting readers Frederic Weisbecker
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 11+ messages in thread
From: Frederic Weisbecker @ 2013-08-21 16:42 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Ingo Molnar, Thomas Gleixner,
	Peter Zijlstra, Paul E. McKenney, John Stultz, Steven Rostedt,
	Don Zickus

The x86 version of sched_clock() seem to be safely callable
from NMIs, tell about that.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Don Zickus <dzickus@redhat.com>
---
 arch/x86/Kconfig |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 7fbda5c..fc7c34c 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -125,6 +125,7 @@ config X86
 	select RTC_LIB
 	select HAVE_DEBUG_STACKOVERFLOW
 	select HAVE_NSECS_TO_CYCLES
+	select HAVE_SCHED_CLOCK_NMI
 
 config INSTRUCTION_DECODER
 	def_bool y
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [RFC PATCH 4/6] seqlock: Add raw_seqbegin() for non-waiting readers
  2013-08-21 16:42 [RFC PATCH 0/6] timekeeping: Missing timekeeping update detection Frederic Weisbecker
                   ` (2 preceding siblings ...)
  2013-08-21 16:42 ` [RFC PATCH 3/6] x86: Tell that sched clock is callable in nmi Frederic Weisbecker
@ 2013-08-21 16:42 ` Frederic Weisbecker
  2013-08-21 16:42 ` [RFC PATCH 5/6] jiffies: Add jiffies_to_nsecs Frederic Weisbecker
  2013-08-21 16:42 ` [RFC PATCH 6/6] timekeeping: Debug missing timekeeping updates Frederic Weisbecker
  5 siblings, 0 replies; 11+ messages in thread
From: Frederic Weisbecker @ 2013-08-21 16:42 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Ingo Molnar, Thomas Gleixner,
	Peter Zijlstra, Paul E. McKenney, John Stultz, Steven Rostedt,
	Don Zickus

Add the seqlock version of raw_seqcount_begin() for readers
that can't wait for writers to complete their update and
who can live with seqlock read acquirement failure.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Don Zickus <dzickus@redhat.com>
---
 include/linux/seqlock.h |    5 +++++
 1 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/include/linux/seqlock.h b/include/linux/seqlock.h
index 1829905..aa845de 100644
--- a/include/linux/seqlock.h
+++ b/include/linux/seqlock.h
@@ -203,6 +203,11 @@ static inline unsigned read_seqbegin(const seqlock_t *sl)
 	return read_seqcount_begin(&sl->seqcount);
 }
 
+static inline unsigned raw_seqbegin(const seqlock_t *sl)
+{
+	return raw_seqcount_begin(&sl->seqcount);
+}
+
 static inline unsigned read_seqretry(const seqlock_t *sl, unsigned start)
 {
 	return read_seqcount_retry(&sl->seqcount, start);
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [RFC PATCH 5/6] jiffies: Add jiffies_to_nsecs
  2013-08-21 16:42 [RFC PATCH 0/6] timekeeping: Missing timekeeping update detection Frederic Weisbecker
                   ` (3 preceding siblings ...)
  2013-08-21 16:42 ` [RFC PATCH 4/6] seqlock: Add raw_seqbegin() for non-waiting readers Frederic Weisbecker
@ 2013-08-21 16:42 ` Frederic Weisbecker
  2013-08-21 16:42 ` [RFC PATCH 6/6] timekeeping: Debug missing timekeeping updates Frederic Weisbecker
  5 siblings, 0 replies; 11+ messages in thread
From: Frederic Weisbecker @ 2013-08-21 16:42 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Ingo Molnar, Thomas Gleixner,
	Peter Zijlstra, Paul E. McKenney, John Stultz, Steven Rostedt,
	Don Zickus

This is going to be used by the timekeeping debugger.

It's a first naive version, at least to make sure that future ad hoc
implementations don't poliferate.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Don Zickus <dzickus@redhat.com>
---
 include/linux/jiffies.h |    6 ++++++
 1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/include/linux/jiffies.h b/include/linux/jiffies.h
index 97ba4e7..77a24da 100644
--- a/include/linux/jiffies.h
+++ b/include/linux/jiffies.h
@@ -294,6 +294,12 @@ extern unsigned long preset_lpj;
  */
 extern unsigned int jiffies_to_msecs(const unsigned long j);
 extern unsigned int jiffies_to_usecs(const unsigned long j);
+
+static inline u64 jiffies_to_nsecs(const unsigned long j)
+{
+	return jiffies_to_usecs(j) * NSEC_PER_USEC;
+}
+
 extern unsigned long msecs_to_jiffies(const unsigned int m);
 extern unsigned long usecs_to_jiffies(const unsigned int u);
 extern unsigned long timespec_to_jiffies(const struct timespec *value);
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [RFC PATCH 6/6] timekeeping: Debug missing timekeeping updates
  2013-08-21 16:42 [RFC PATCH 0/6] timekeeping: Missing timekeeping update detection Frederic Weisbecker
                   ` (4 preceding siblings ...)
  2013-08-21 16:42 ` [RFC PATCH 5/6] jiffies: Add jiffies_to_nsecs Frederic Weisbecker
@ 2013-08-21 16:42 ` Frederic Weisbecker
  2013-08-21 17:25   ` John Stultz
  5 siblings, 1 reply; 11+ messages in thread
From: Frederic Weisbecker @ 2013-08-21 16:42 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Ingo Molnar, Thomas Gleixner,
	Peter Zijlstra, Paul E. McKenney, John Stultz, Steven Rostedt,
	Don Zickus

With the full dynticks feature and the tricky full system idle
detection code that is coming soon, it becomes necessary to have
some debug code that makes sure that the timekeeping is always
maintained and moving forward as expected.

This provides a simple detection of missing timekeeping updates
inspired by the lockup detector's use of CPU cycles clock.

The jiffies are compared to the cpu clock after several snapshots taken
from NMIs that trigger after arbitrary CPU cycles period overflow.

If the jiffies progression appears to drift too far away from the CPU
clock's, this triggers a warning.

We just make sure not to account the tiny code on irq entry that
may have stale jiffies values before tick_check_nohz() is called
after the CPU is woken up while the system went full idle for some
time.

Same goes for idle exit in case the tick were stopped but idle
was polling on need_resched().

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Don Zickus <dzickus@redhat.com>
---
 include/linux/time.h               |   11 +++
 kernel/time/Makefile               |    1 +
 kernel/time/tick-sched.c           |    4 +
 kernel/time/timekeeping.c          |    1 +
 kernel/time/timekeeping_selftest.c |  125 ++++++++++++++++++++++++++++++++++++
 lib/Kconfig.debug                  |   12 ++++
 6 files changed, 154 insertions(+), 0 deletions(-)
 create mode 100644 kernel/time/timekeeping_selftest.c

diff --git a/include/linux/time.h b/include/linux/time.h
index d5d229b..39d8493 100644
--- a/include/linux/time.h
+++ b/include/linux/time.h
@@ -181,6 +181,17 @@ extern s32 timekeeping_get_tai_offset(void);
 extern void timekeeping_set_tai_offset(s32 tai_offset);
 extern void timekeeping_clocktai(struct timespec *ts);
 
+#ifdef CONFIG_DEBUG_TIMEKEEPING
+extern void timekeeping_selftest_update(void);
+extern void timekeeping_selftest_start(void);
+extern void timekeeping_selftest_stop(void);
+
+#else
+static inline void timekeeping_selftest_update(void) { }
+static inline void timekeeping_selftest_start(void) { }
+static inline void timekeeping_selftest_stop(void) { }
+#endif
+
 struct tms;
 extern void do_sys_times(struct tms *);
 
diff --git a/kernel/time/Makefile b/kernel/time/Makefile
index 9250130..cbea072 100644
--- a/kernel/time/Makefile
+++ b/kernel/time/Makefile
@@ -9,3 +9,4 @@ obj-$(CONFIG_TICK_ONESHOT)			+= tick-oneshot.o
 obj-$(CONFIG_TICK_ONESHOT)			+= tick-sched.o
 obj-$(CONFIG_TIMER_STATS)			+= timer_stats.o
 obj-$(CONFIG_DEBUG_FS)				+= timekeeping_debug.o
+obj-$(CONFIG_DEBUG_TIMEKEEPING)			+= timekeeping_selftest.o
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index e77edc9..dcba0e1 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -771,6 +771,8 @@ static void __tick_nohz_idle_enter(struct tick_sched *ts)
 
 		if (!was_stopped && ts->tick_stopped)
 			ts->idle_jiffies = ts->last_jiffies;
+
+		timekeeping_selftest_stop();
 	}
 }
 
@@ -937,6 +939,7 @@ void tick_nohz_idle_exit(void)
 	if (ts->tick_stopped) {
 		tick_nohz_restart_sched_tick(ts, now);
 		tick_nohz_account_idle_ticks(ts);
+		timekeeping_selftest_start();
 	}
 
 	local_irq_enable();
@@ -1049,6 +1052,7 @@ static inline void tick_check_nohz(int cpu)
 	if (ts->tick_stopped) {
 		tick_nohz_update_jiffies(now);
 		tick_nohz_kick_tick(cpu, now);
+		timekeeping_selftest_start();
 	}
 }
 
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index 48b9fff..1dcb216 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -1583,6 +1583,7 @@ struct timespec get_monotonic_coarse(void)
 void do_timer(unsigned long ticks)
 {
 	jiffies_64 += ticks;
+	timekeeping_selftest_update();
 	update_wall_time();
 	calc_global_load(ticks);
 }
diff --git a/kernel/time/timekeeping_selftest.c b/kernel/time/timekeeping_selftest.c
new file mode 100644
index 0000000..f66331f
--- /dev/null
+++ b/kernel/time/timekeeping_selftest.c
@@ -0,0 +1,125 @@
+#include "tick-internal.h"
+#include <linux/perf_event.h>
+#include <linux/percpu.h>
+#include <linux/sched.h>
+#include <asm/cycles.h>
+
+#define TIMEKEEPING_CHECK_DELAY (100 * NSEC_PER_MSEC) /* 0.1 sec */
+#define TIMEKEEPING_MAX_DELAY (500 * NSEC_PER_MSEC) /* 0.5 sec */
+
+static long jiffies_snap;
+static u64 clock_snap;
+static int warned;
+static DEFINE_PER_CPU(bool, selftest_running) = true;
+
+static struct perf_event_attr event_hw_attr = {
+	.type		= PERF_TYPE_HARDWARE,
+	.config		= PERF_COUNT_HW_CPU_CYCLES,
+	.size		= sizeof(struct perf_event_attr),
+	.pinned		= 1,
+};
+
+static DEFINE_PER_CPU(struct perf_event *, cpu_event);
+
+static void timekeeping_missing_warn(long jiffies_old, u64 clock_old,
+				     long jiffies_now, u64 clock_now)
+{
+	unsigned int nsecs;
+
+	pr_err("INFO: Suspicion of missing timekeeping update\n");
+	nsecs = do_div(clock_old, NSEC_PER_SEC);
+	pr_err("Jiffies last update at %llu.%u with jiffies=%ld\n", clock_old, nsecs, jiffies_old);
+	nsecs = do_div(clock_now, NSEC_PER_SEC);
+	pr_err("Now is %llu.%u, and jiffies = %ld\n", clock_now, nsecs, jiffies_now);
+	WARN_ON(1);
+}
+
+static void selftest_event_overflow(struct perf_event *event,
+				    struct perf_sample_data *data,
+				    struct pt_regs *regs)
+{
+	long jiffies_now, jiffies_old;
+	u64 clock_now, clock_old;
+	s64 old_diff, new_diff;
+	int seq;
+
+	/* Quick check if already warned, stop there */
+	if (warned)
+		return;
+
+	/* Allow out of date jiffies on sleeping CPUs */
+	if (!__this_cpu_read(selftest_running))
+		return;
+
+	/*
+	 * If we race with a jiffies updater, we can't wait for it
+	 * as it may well be the current CPU
+	 */
+	seq = raw_seqbegin(&jiffies_lock);
+
+	jiffies_old = jiffies_snap;
+	clock_old = clock_snap;
+
+	jiffies_now = ACCESS_ONCE(jiffies);
+	clock_now = sched_clock_cpu(0);
+
+	old_diff = (s64)clock_snap - jiffies_to_nsecs(jiffies_snap);
+	new_diff = (s64)clock_now - jiffies_to_nsecs(jiffies_now);
+
+	if (abs(new_diff - old_diff) < TIMEKEEPING_MAX_DELAY)
+		return;
+
+	/* Abort if we raced with a snapshot updater */
+	if (read_seqretry(&jiffies_lock, seq))
+		return;
+
+	if (!xchg(&warned, 1))
+		timekeeping_missing_warn(jiffies_old, clock_old, jiffies_now, clock_now);
+}
+
+/*
+ * Called under jiffies lock, so NMIs don't see half way updates
+ */
+void timekeeping_selftest_update(void)
+{
+	jiffies_snap = jiffies;
+	/*
+	 * sched_clock_cpu(random constant nr) is used to mimic a global
+	 * monotonic clock. It's a hacky equivalent to trace_clock_global().
+	 * We need that to compare two snapshots reliably.
+	 */
+	clock_snap = sched_clock_cpu(0);
+}
+
+void timekeeping_selftest_start(void)
+{
+	__this_cpu_write(selftest_running, true);
+}
+
+void timekeeping_selftest_stop(void)
+{
+	__this_cpu_write(selftest_running, false);
+}
+
+static int __init timekeeping_selftest_init(void)
+{
+	int cpu;
+
+	event_hw_attr.sample_period = nsecs_to_cycles(TIMEKEEPING_CHECK_DELAY);
+
+	/* TODO: proper CPU hotplug handling ?*/
+	get_online_cpus();
+
+	for_each_online_cpu(cpu) {
+		struct perf_event *event;
+		event = perf_event_create_kernel_counter(&event_hw_attr, cpu, NULL,
+							 selftest_event_overflow, NULL);
+		if (event)
+			per_cpu(cpu_event, cpu) = event;
+	}
+
+	put_online_cpus();
+
+	return 0;
+}
+core_initcall(timekeeping_selftest_init);
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 568179d..419f2e9 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -788,6 +788,18 @@ config TIMER_STATS
 	  (it defaults to deactivated on bootup and will only be activated
 	  if some application like powertop activates it explicitly).
 
+config DEBUG_TIMEKEEPING
+	bool "Debug timekeeping"
+	depends on HAVE_SCHED_CLOCK_NMI
+	depends on PERF_EVENTS && HAVE_PERF_EVENTS_NMI && HAVE_NSECS_TO_CYCLES
+	help
+	  This enables detection of missing timekeeping updates. This option
+	  is mostly useful to check the correctness of the full system idle
+	  detection used by full dynticks to adaptively run the timekeeper
+	  in order to keep reasonable power consumption. But it can also be
+	  used anytime to ensure that timekeeping is correctly moving foward
+	  when it is expected to.
+
 config DEBUG_PREEMPT
 	bool "Debug preemptible kernel"
 	depends on DEBUG_KERNEL && PREEMPT && TRACE_IRQFLAGS_SUPPORT
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 6/6] timekeeping: Debug missing timekeeping updates
  2013-08-21 16:42 ` [RFC PATCH 6/6] timekeeping: Debug missing timekeeping updates Frederic Weisbecker
@ 2013-08-21 17:25   ` John Stultz
  2013-08-30 11:05     ` Frederic Weisbecker
  0 siblings, 1 reply; 11+ messages in thread
From: John Stultz @ 2013-08-21 17:25 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Ingo Molnar, Thomas Gleixner, Peter Zijlstra,
	Paul E. McKenney, Steven Rostedt, Don Zickus

On 08/21/2013 09:42 AM, Frederic Weisbecker wrote:
> With the full dynticks feature and the tricky full system idle
> detection code that is coming soon, it becomes necessary to have
> some debug code that makes sure that the timekeeping is always
> maintained and moving forward as expected.
>
> This provides a simple detection of missing timekeeping updates
> inspired by the lockup detector's use of CPU cycles clock.
>
> The jiffies are compared to the cpu clock after several snapshots taken
> from NMIs that trigger after arbitrary CPU cycles period overflow.
>
> If the jiffies progression appears to drift too far away from the CPU
> clock's, this triggers a warning.
>
> We just make sure not to account the tiny code on irq entry that
> may have stale jiffies values before tick_check_nohz() is called
> after the CPU is woken up while the system went full idle for some
> time.
>
> Same goes for idle exit in case the tick were stopped but idle
> was polling on need_resched().

So you're using sched_clock to try to detect timekeeping
inconsistencies. Hrm.. Do you have some examples of where this debug
infrastructure helped out?

A few thoughts:

1) Why are you using jiffies as the timekeeping reference instead of
reading some of actual timekeeping values? Jiffies usage has been
intentionally on the decline, and since the dynticks infrastructure
landed, jiffies are just derived from the timekeeping core, so its so
its sort of strange to see it used for this.

2) This seems very similar to the old lost-ticks compensation code we
had prior to the clocksource infrastructure, and seems like it might
suffer from some of the issues seen there. For instance, sched_clock has
been historically looser in its correctness requirements then the
timekeeping code, so using it to validate the more strict timekeeping
code, makes me worry we might see cases of false positives.

3) I'm also curious (maybe skeptical) as if sched_clock is reliable
enough to use for validating time, then we likely are using that same
hardware as the timekeeping clocksource. Thus cases where I'd suspect
you'd see likely issues w/ nohz, like clocksource counter overflows
being missed on quick wrapping clcoksources wouldn't really apply.

Personally, I've been thinking the timekeeping update code could use
some improvements/warnings around cases where update delay is larger
then the clocksource max_deferment - possibly falling back to a slower
overflow-proof multiply as is done in the CLOCK_SOURCE_SUSPEND_NONSTOP
resume case. This would allow more robust behaivor in cases like kvm
guests being paused for unreasonable lengths of time, and could also
provide very similar NOHZ debug warnings (assuming the clocksource
doesn't wrap quickly - but again, in those cases, I'm not confident we
can trust sched_clock either).

thanks
-john

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 2/6] x86: nsecs to cycles conversion
  2013-08-21 16:42 ` [RFC PATCH 2/6] x86: nsecs to cycles conversion Frederic Weisbecker
@ 2013-08-21 18:26   ` Don Zickus
  2013-08-30 10:35     ` Frederic Weisbecker
  0 siblings, 1 reply; 11+ messages in thread
From: Don Zickus @ 2013-08-21 18:26 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Ingo Molnar, Thomas Gleixner, Peter Zijlstra,
	Paul E. McKenney, John Stultz, Steven Rostedt

On Wed, Aug 21, 2013 at 06:42:17PM +0200, Frederic Weisbecker wrote:
> hw_nmi_get_sample_period() is simply a conversion from a period
> to cycles. Lets generalize the API naming so that it can be used for
> wider purpose than just watchdog perf event settings. Also it makes the
> function name less opaque about what it really does.
> 

<snip>

> diff --git a/arch/x86/include/asm/cycles.h b/arch/x86/include/asm/cycles.h
> new file mode 100644
> index 0000000..7dedeb3
> --- /dev/null
> +++ b/arch/x86/include/asm/cycles.h
> @@ -0,0 +1,11 @@
> +#ifndef _ASM_X86_CYCLES_H
> +#define _ASM_X86_CYCLES_H
> +
> +#include <linux/time.h>
> +#include <asm/tsc.h>
> +
> +static inline u64 nsecs_to_cycles(u64 nsecs)
> +{
> +	return (u64)((cpu_khz) * nsecs) / NSEC_PER_MSEC;
> +}
> +#endif /* #ifndef _ASM_X86_CYCLES_H */
> diff --git a/arch/x86/kernel/apic/hw_nmi.c b/arch/x86/kernel/apic/hw_nmi.c
> index a698d71..4015906 100644
> --- a/arch/x86/kernel/apic/hw_nmi.c
> +++ b/arch/x86/kernel/apic/hw_nmi.c
> @@ -19,13 +19,6 @@
>  #include <linux/module.h>
>  #include <linux/delay.h>
>  
> -#ifdef CONFIG_HARDLOCKUP_DETECTOR
> -u64 hw_nmi_get_sample_period(int watchdog_thresh)
> -{
> -	return (u64)(cpu_khz) * 1000 * watchdog_thresh;
> -}
> -#endif
> -
>  #ifdef arch_trigger_all_cpu_backtrace
>  /* For reliability, we're prepared to waste bits here. */
>  static DECLARE_BITMAP(backtrace_mask, NR_CPUS) __read_mostly;
> diff --git a/kernel/watchdog.c b/kernel/watchdog.c
> index 1241d8c..e04887c 100644
> --- a/kernel/watchdog.c
> +++ b/kernel/watchdog.c
> @@ -26,6 +26,7 @@
>  #include <linux/sched/rt.h>
>  
>  #include <asm/irq_regs.h>
> +#include <asm/cycles.h>
>  #include <linux/kvm_para.h>
>  #include <linux/perf_event.h>

Do you have to wrap the above with #ifdef CONFIG_HARDLOCKUP_DETECTOR?
Otherwise non-x86 boxes won't compile unless I missed stub
nsecs_to_cycles() somewhere.

Cheers,
Don

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 2/6] x86: nsecs to cycles conversion
  2013-08-21 18:26   ` Don Zickus
@ 2013-08-30 10:35     ` Frederic Weisbecker
  0 siblings, 0 replies; 11+ messages in thread
From: Frederic Weisbecker @ 2013-08-30 10:35 UTC (permalink / raw)
  To: Don Zickus
  Cc: LKML, Ingo Molnar, Thomas Gleixner, Peter Zijlstra,
	Paul E. McKenney, John Stultz, Steven Rostedt

On Wed, Aug 21, 2013 at 02:26:46PM -0400, Don Zickus wrote:
> On Wed, Aug 21, 2013 at 06:42:17PM +0200, Frederic Weisbecker wrote:
> > hw_nmi_get_sample_period() is simply a conversion from a period
> > to cycles. Lets generalize the API naming so that it can be used for
> > wider purpose than just watchdog perf event settings. Also it makes the
> > function name less opaque about what it really does.
> > 
> 
> <snip>
> 
> > diff --git a/arch/x86/include/asm/cycles.h b/arch/x86/include/asm/cycles.h
> > new file mode 100644
> > index 0000000..7dedeb3
> > --- /dev/null
> > +++ b/arch/x86/include/asm/cycles.h
> > @@ -0,0 +1,11 @@
> > +#ifndef _ASM_X86_CYCLES_H
> > +#define _ASM_X86_CYCLES_H
> > +
> > +#include <linux/time.h>
> > +#include <asm/tsc.h>
> > +
> > +static inline u64 nsecs_to_cycles(u64 nsecs)
> > +{
> > +	return (u64)((cpu_khz) * nsecs) / NSEC_PER_MSEC;
> > +}
> > +#endif /* #ifndef _ASM_X86_CYCLES_H */
> > diff --git a/arch/x86/kernel/apic/hw_nmi.c b/arch/x86/kernel/apic/hw_nmi.c
> > index a698d71..4015906 100644
> > --- a/arch/x86/kernel/apic/hw_nmi.c
> > +++ b/arch/x86/kernel/apic/hw_nmi.c
> > @@ -19,13 +19,6 @@
> >  #include <linux/module.h>
> >  #include <linux/delay.h>
> >  
> > -#ifdef CONFIG_HARDLOCKUP_DETECTOR
> > -u64 hw_nmi_get_sample_period(int watchdog_thresh)
> > -{
> > -	return (u64)(cpu_khz) * 1000 * watchdog_thresh;
> > -}
> > -#endif
> > -
> >  #ifdef arch_trigger_all_cpu_backtrace
> >  /* For reliability, we're prepared to waste bits here. */
> >  static DECLARE_BITMAP(backtrace_mask, NR_CPUS) __read_mostly;
> > diff --git a/kernel/watchdog.c b/kernel/watchdog.c
> > index 1241d8c..e04887c 100644
> > --- a/kernel/watchdog.c
> > +++ b/kernel/watchdog.c
> > @@ -26,6 +26,7 @@
> >  #include <linux/sched/rt.h>
> >  
> >  #include <asm/irq_regs.h>
> > +#include <asm/cycles.h>
> >  #include <linux/kvm_para.h>
> >  #include <linux/perf_event.h>
> 
> Do you have to wrap the above with #ifdef CONFIG_HARDLOCKUP_DETECTOR?
> Otherwise non-x86 boxes won't compile unless I missed stub
> nsecs_to_cycles() somewhere.

Good catch, I forgot that.

Thanks.

> 
> Cheers,
> Don

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 6/6] timekeeping: Debug missing timekeeping updates
  2013-08-21 17:25   ` John Stultz
@ 2013-08-30 11:05     ` Frederic Weisbecker
  0 siblings, 0 replies; 11+ messages in thread
From: Frederic Weisbecker @ 2013-08-30 11:05 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Ingo Molnar, Thomas Gleixner, Peter Zijlstra,
	Paul E. McKenney, Steven Rostedt, Don Zickus

On Wed, Aug 21, 2013 at 10:25:57AM -0700, John Stultz wrote:
> On 08/21/2013 09:42 AM, Frederic Weisbecker wrote:
> > With the full dynticks feature and the tricky full system idle
> > detection code that is coming soon, it becomes necessary to have
> > some debug code that makes sure that the timekeeping is always
> > maintained and moving forward as expected.
> >
> > This provides a simple detection of missing timekeeping updates
> > inspired by the lockup detector's use of CPU cycles clock.
> >
> > The jiffies are compared to the cpu clock after several snapshots taken
> > from NMIs that trigger after arbitrary CPU cycles period overflow.
> >
> > If the jiffies progression appears to drift too far away from the CPU
> > clock's, this triggers a warning.
> >
> > We just make sure not to account the tiny code on irq entry that
> > may have stale jiffies values before tick_check_nohz() is called
> > after the CPU is woken up while the system went full idle for some
> > time.
> >
> > Same goes for idle exit in case the tick were stopped but idle
> > was polling on need_resched().
> 
> So you're using sched_clock to try to detect timekeeping
> inconsistencies. Hrm.. Do you have some examples of where this debug
> infrastructure helped out?

Yeah, currently the full dynticks implies to keep a CPU with the tick alive all
the time. This way we make sure that busy CPUs have a reliable timekeeping even
when they run tickless.

We could make the timekeeper go to sleep when the system is fully idle and wake
it up when there is at least a full dynticks CPU alive. But the simple concept of
full idle detection is actually not that obvious. We need to maintain some atomic
counter of busy CPUs, send an IPI to the timekeeper when a single CPU wakes up, and
order all that correctly.

But maintaining such atomic counter is bad for scalability.

So Paul MckKenney is working on a full system idle detection that reuse the RCU
extended grace period detection infrastructure. He enhanced it with a state machine
based on smp barriers and atomic ops. The thing appears to be much more scalable
that a single busy CPU counters.

All in one it makes sure that timekeeping is always well maintained while keeping
reasonable power consumption by allowing the timekeeper to sleep when it considers
that it's time to do so.

I don't understand the state macchine completely though, so my brain takes it as a lemma.
And in any case it's a quite complicated piece on which the timekeeping progression depends.

So I thought we really need to start adding some automated detection of missing timekeeping.

> 
> A few thoughts:
> 
> 1) Why are you using jiffies as the timekeeping reference instead of
> reading some of actual timekeeping values? Jiffies usage has been
> intentionally on the decline, and since the dynticks infrastructure
> landed, jiffies are just derived from the timekeeping core, so its so
> its sort of strange to see it used for this.

That's because it's a simple counter so it's simpler to compute diffs and
comparisons on top of it. I prefered to use it instead of gtod values to
keep the NMI fast check path simple enough.

> 
> 2) This seems very similar to the old lost-ticks compensation code we
> had prior to the clocksource infrastructure, and seems like it might
> suffer from some of the issues seen there. For instance, sched_clock has
> been historically looser in its correctness requirements then the
> timekeeping code, so using it to validate the more strict timekeeping
> code, makes me worry we might see cases of false positives.

Yeah I was worried about that too. That's why that detection allows large
drifts between jiffies and sched clock (around 0.5 secs).

Do you think of other clock base I could use instead? sched clock also has
the advantage to be fast and NMI-safe, at least in x86.

> 
> 3) I'm also curious (maybe skeptical) as if sched_clock is reliable
> enough to use for validating time, then we likely are using that same
> hardware as the timekeeping clocksource. Thus cases where I'd suspect
> you'd see likely issues w/ nohz, like clocksource counter overflows
> being missed on quick wrapping clcoksources wouldn't really apply.

Hmm, but I thought timekeeping_max_deferment would takes care of clocksource
overflows?

> 
> 
> Personally, I've been thinking the timekeeping update code could use
> some improvements/warnings around cases where update delay is larger
> then the clocksource max_deferment - possibly falling back to a slower
> overflow-proof multiply as is done in the CLOCK_SOURCE_SUSPEND_NONSTOP
> resume case. This would allow more robust behaivor in cases like kvm
> guests being paused for unreasonable lengths of time, and could also
> provide very similar NOHZ debug warnings (assuming the clocksource
> doesn't wrap quickly - but again, in those cases, I'm not confident we
> can trust sched_clock either).

May be. I don't know enough the details of timekeeping to debate there :)

Thanks.

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2013-08-30 11:05 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-08-21 16:42 [RFC PATCH 0/6] timekeeping: Missing timekeeping update detection Frederic Weisbecker
2013-08-21 16:42 ` [RFC PATCH 1/6] sched: Let arch tell us if sched clock is NMI-safe Frederic Weisbecker
2013-08-21 16:42 ` [RFC PATCH 2/6] x86: nsecs to cycles conversion Frederic Weisbecker
2013-08-21 18:26   ` Don Zickus
2013-08-30 10:35     ` Frederic Weisbecker
2013-08-21 16:42 ` [RFC PATCH 3/6] x86: Tell that sched clock is callable in nmi Frederic Weisbecker
2013-08-21 16:42 ` [RFC PATCH 4/6] seqlock: Add raw_seqbegin() for non-waiting readers Frederic Weisbecker
2013-08-21 16:42 ` [RFC PATCH 5/6] jiffies: Add jiffies_to_nsecs Frederic Weisbecker
2013-08-21 16:42 ` [RFC PATCH 6/6] timekeeping: Debug missing timekeeping updates Frederic Weisbecker
2013-08-21 17:25   ` John Stultz
2013-08-30 11:05     ` Frederic Weisbecker

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).