All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/6] Fix for leapsecond caused hrtimer/futex issue (updated)
@ 2012-07-10 22:43 John Stultz
  2012-07-10 22:43 ` [PATCH 1/6] hrtimer: Provide clock_was_set_delayed() John Stultz
                   ` (11 more replies)
  0 siblings, 12 replies; 38+ messages in thread
From: John Stultz @ 2012-07-10 22:43 UTC (permalink / raw)
  To: Linux Kernel
  Cc: John Stultz, Ingo Molnar, Peter Zijlstra, Prarit Bhargava,
	Thomas Gleixner, stable

Over the weekend, Thomas got a chance to review the leap second fix
in more detail and had a few additional changes he wanted to make
to improve performance as well as style.

So this iteration includes his modifications.

Once merged, I'll be working to get the backports finished as quickly
as I can and sent to -stable.

thanks
-john

CC: Ingo Molnar <mingo@kernel.org>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Prarit Bhargava <prarit@redhat.com>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: stable@vger.kernel.org


John Stultz (3):
  hrtimer: Provide clock_was_set_delayed()
  timekeeping: Fix leapsecond triggered load spike issue
  hrtimer: Update hrtimer base offsets each hrtimer_interrupt

Thomas Gleixner (3):
  timekeeping: Maintain ktime_t based offsets for hrtimers
  hrtimer: Move lock held region in hrtimer_interrupt()
  timekeeping: Provide hrtimer update function

 include/linux/hrtimer.h   |    6 ++++-
 kernel/hrtimer.c          |   53 ++++++++++++++++++++++++++------------
 kernel/time/timekeeping.c |   63 +++++++++++++++++++++++++++++++++++++++++++--
 3 files changed, 103 insertions(+), 19 deletions(-)

-- 
1.7.9.5


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 1/6] hrtimer: Provide clock_was_set_delayed()
  2012-07-10 22:43 [PATCH 0/6] Fix for leapsecond caused hrtimer/futex issue (updated) John Stultz
@ 2012-07-10 22:43 ` John Stultz
  2012-07-11 12:15   ` Prarit Bhargava
  2012-07-11 21:40   ` [tip:timers/urgent] " tip-bot for John Stultz
  2012-07-10 22:43 ` [PATCH 2/6] timekeeping: Fix leapsecond triggered load spike issue John Stultz
                   ` (10 subsequent siblings)
  11 siblings, 2 replies; 38+ messages in thread
From: John Stultz @ 2012-07-10 22:43 UTC (permalink / raw)
  To: Linux Kernel
  Cc: John Stultz, Ingo Molnar, Peter Zijlstra, Prarit Bhargava,
	Thomas Gleixner, stable

clock_was_set() cannot be called from hard interrupt context because
it calls on_each_cpu(). For fixing the widely reported leap seconds
issue it's necessary to call it from the timer interrupt context.

Provide a new function which denotes it in the hrtimer cpu base
structure of the cpu on which it is called and raising the timer
softirq.

We then execute the clock_was_set() notificiation in the timer softirq
context in hrtimer_run_pending().

CC: Ingo Molnar <mingo@kernel.org>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Prarit Bhargava <prarit@redhat.com>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: stable@vger.kernel.org
Reported-by: Jan Engelhardt <jengelh@inai.de>
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <johnstul@us.ibm.com>
---
 include/linux/hrtimer.h |    5 ++++-
 kernel/hrtimer.c        |   20 ++++++++++++++++++++
 2 files changed, 24 insertions(+), 1 deletion(-)

diff --git a/include/linux/hrtimer.h b/include/linux/hrtimer.h
index fd0dc30..4c3dac8 100644
--- a/include/linux/hrtimer.h
+++ b/include/linux/hrtimer.h
@@ -165,6 +165,7 @@ enum  hrtimer_base_type {
  * @lock:		lock protecting the base and associated clock bases
  *			and timers
  * @active_bases:	Bitfield to mark bases with active timers
+ * @clock_was_set:	Indicates that clock was set from irq context.
  * @expires_next:	absolute time of the next event which was scheduled
  *			via clock_set_next_event()
  * @hres_active:	State of high resolution mode
@@ -177,7 +178,8 @@ enum  hrtimer_base_type {
  */
 struct hrtimer_cpu_base {
 	raw_spinlock_t			lock;
-	unsigned long			active_bases;
+	unsigned int			active_bases;
+	unsigned int			clock_was_set;
 #ifdef CONFIG_HIGH_RES_TIMERS
 	ktime_t				expires_next;
 	int				hres_active;
@@ -309,6 +311,7 @@ static inline int hrtimer_is_hres_active(struct hrtimer *timer)
 #endif
 
 extern void clock_was_set(void);
+extern void clock_was_set_delayed(void);
 #ifdef CONFIG_TIMERFD
 extern void timerfd_clock_was_set(void);
 #else
diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c
index ae34bf5..7c20cb8 100644
--- a/kernel/hrtimer.c
+++ b/kernel/hrtimer.c
@@ -756,6 +756,19 @@ void clock_was_set(void)
 }
 
 /*
+ * Called from timekeeping code to reprogramm the hrtimer interrupt
+ * device. If called from the timer interrupt context we defer it to
+ * softirq context.
+ */
+void clock_was_set_delayed(void)
+{
+	struct hrtimer_cpu_base *cpu_base = &__get_cpu_var(hrtimer_bases);
+
+	cpu_base->clock_was_set = 1;
+	__raise_softirq_irqoff(TIMER_SOFTIRQ);
+}
+
+/*
  * During resume we might have to reprogram the high resolution timer
  * interrupt (on the local CPU):
  */
@@ -1413,6 +1426,13 @@ static inline void __hrtimer_peek_ahead_timers(void) { }
  */
 void hrtimer_run_pending(void)
 {
+	struct hrtimer_cpu_base *cpu_base = &__get_cpu_var(hrtimer_bases);
+
+	if (cpu_base->clock_was_set) {
+		cpu_base->clock_was_set = 0;
+		clock_was_set();
+	}
+
 	if (hrtimer_hres_active())
 		return;
 
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 2/6] timekeeping: Fix leapsecond triggered load spike issue
  2012-07-10 22:43 [PATCH 0/6] Fix for leapsecond caused hrtimer/futex issue (updated) John Stultz
  2012-07-10 22:43 ` [PATCH 1/6] hrtimer: Provide clock_was_set_delayed() John Stultz
@ 2012-07-10 22:43 ` John Stultz
  2012-07-11 21:41   ` [tip:timers/urgent] " tip-bot for John Stultz
  2012-07-10 22:43 ` [PATCH 3/6] timekeeping: Maintain ktime_t based offsets for hrtimers John Stultz
                   ` (9 subsequent siblings)
  11 siblings, 1 reply; 38+ messages in thread
From: John Stultz @ 2012-07-10 22:43 UTC (permalink / raw)
  To: Linux Kernel
  Cc: John Stultz, Ingo Molnar, Peter Zijlstra, Prarit Bhargava,
	Thomas Gleixner, stable

The timekeeping code misses an update of the hrtimer subsystem after a
leap second happened. Due to that timers based on CLOCK_REALTIME are
either expiring a second early or late depending on whether a leap
second has been inserted or deleted until an operation is initiated
which causes that update. Unless the update happens by some other
means this discrepancy between the timekeeping and the hrtimer data
stays forever and timers are expired either early or late.

The reported immediate workaround - $ data -s "`date`" - is causing a
call to clock_was_set() which updates the hrtimer data structures.
See: http://www.sheeri.com/content/mysql-and-leap-second-high-cpu-and-fix

Add the missing clock_was_set() call to update_wall_time() in case of
a leap second event. The actual update is deferred to softirq context
as the necessary smp function call cannot be invoked from hard
interrupt context.

CC: Ingo Molnar <mingo@kernel.org>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Prarit Bhargava <prarit@redhat.com>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: stable@vger.kernel.org
Reported-by: Jan Engelhardt <jengelh@inai.de>
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <johnstul@us.ibm.com>
---
 kernel/time/timekeeping.c |    4 ++++
 1 file changed, 4 insertions(+)

diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index 6f46a00..a413e59 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -963,6 +963,8 @@ static cycle_t logarithmic_accumulation(cycle_t offset, int shift)
 		leap = second_overflow(timekeeper.xtime.tv_sec);
 		timekeeper.xtime.tv_sec += leap;
 		timekeeper.wall_to_monotonic.tv_sec -= leap;
+		if (leap)
+			clock_was_set_delayed();
 	}
 
 	/* Accumulate raw time */
@@ -1079,6 +1081,8 @@ static void update_wall_time(void)
 		leap = second_overflow(timekeeper.xtime.tv_sec);
 		timekeeper.xtime.tv_sec += leap;
 		timekeeper.wall_to_monotonic.tv_sec -= leap;
+		if (leap)
+			clock_was_set_delayed();
 	}
 
 	timekeeping_update(false);
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 3/6] timekeeping: Maintain ktime_t based offsets for hrtimers
  2012-07-10 22:43 [PATCH 0/6] Fix for leapsecond caused hrtimer/futex issue (updated) John Stultz
  2012-07-10 22:43 ` [PATCH 1/6] hrtimer: Provide clock_was_set_delayed() John Stultz
  2012-07-10 22:43 ` [PATCH 2/6] timekeeping: Fix leapsecond triggered load spike issue John Stultz
@ 2012-07-10 22:43 ` John Stultz
  2012-07-11 21:42   ` [tip:timers/urgent] " tip-bot for Thomas Gleixner
  2012-07-10 22:43 ` [PATCH 4/6] hrtimer: Move lock held region in hrtimer_interrupt() John Stultz
                   ` (8 subsequent siblings)
  11 siblings, 1 reply; 38+ messages in thread
From: John Stultz @ 2012-07-10 22:43 UTC (permalink / raw)
  To: Linux Kernel
  Cc: Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Prarit Bhargava,
	stable, John Stultz

From: Thomas Gleixner <tglx@linutronix.de>

We need to update the hrtimer clock offsets from the hrtimer interrupt
context. To avoid conversions from timespec to ktime_t maintain a
ktime_t based representation of those offsets in the timekeeper. This
puts the conversion overhead into the code which updates the
underlying offsets and provides fast accessible values in the hrtimer
interrupt.

CC: Ingo Molnar <mingo@kernel.org>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Prarit Bhargava <prarit@redhat.com>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: stable@vger.kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <johnstul@us.ibm.com>
---
 kernel/time/timekeeping.c |   25 +++++++++++++++++++++++--
 1 file changed, 23 insertions(+), 2 deletions(-)

diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index a413e59..1c038da 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -70,6 +70,12 @@ struct timekeeper {
 	/* The raw monotonic time for the CLOCK_MONOTONIC_RAW posix clock. */
 	struct timespec raw_time;
 
+	/* Offset clock monotonic -> clock realtime */
+	ktime_t offs_real;
+
+	/* Offset clock monotonic -> clock boottime */
+	ktime_t offs_boot;
+
 	/* Seqlock for all timekeeper values */
 	seqlock_t lock;
 };
@@ -172,6 +178,14 @@ static inline s64 timekeeping_get_ns_raw(void)
 	return clocksource_cyc2ns(cycle_delta, clock->mult, clock->shift);
 }
 
+static void update_rt_offset(void)
+{
+	struct timespec tmp, *wtm = &timekeeper.wall_to_monotonic;
+
+	set_normalized_timespec(&tmp, -wtm->tv_sec, -wtm->tv_nsec);
+	timekeeper.offs_real = timespec_to_ktime(tmp);
+}
+
 /* must hold write on timekeeper.lock */
 static void timekeeping_update(bool clearntp)
 {
@@ -179,6 +193,7 @@ static void timekeeping_update(bool clearntp)
 		timekeeper.ntp_error = 0;
 		ntp_clear();
 	}
+	update_rt_offset();
 	update_vsyscall(&timekeeper.xtime, &timekeeper.wall_to_monotonic,
 			 timekeeper.clock, timekeeper.mult);
 }
@@ -604,6 +619,7 @@ void __init timekeeping_init(void)
 	}
 	set_normalized_timespec(&timekeeper.wall_to_monotonic,
 				-boot.tv_sec, -boot.tv_nsec);
+	update_rt_offset();
 	timekeeper.total_sleep_time.tv_sec = 0;
 	timekeeper.total_sleep_time.tv_nsec = 0;
 	write_sequnlock_irqrestore(&timekeeper.lock, flags);
@@ -612,6 +628,12 @@ void __init timekeeping_init(void)
 /* time in seconds when suspend began */
 static struct timespec timekeeping_suspend_time;
 
+static void update_sleep_time(struct timespec t)
+{
+	timekeeper.total_sleep_time = t;
+	timekeeper.offs_boot = timespec_to_ktime(t);
+}
+
 /**
  * __timekeeping_inject_sleeptime - Internal function to add sleep interval
  * @delta: pointer to a timespec delta value
@@ -630,8 +652,7 @@ static void __timekeeping_inject_sleeptime(struct timespec *delta)
 	timekeeper.xtime = timespec_add(timekeeper.xtime, *delta);
 	timekeeper.wall_to_monotonic =
 			timespec_sub(timekeeper.wall_to_monotonic, *delta);
-	timekeeper.total_sleep_time = timespec_add(
-					timekeeper.total_sleep_time, *delta);
+	update_sleep_time(timespec_add(timekeeper.total_sleep_time, *delta));
 }
 
 
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 4/6] hrtimer: Move lock held region in hrtimer_interrupt()
  2012-07-10 22:43 [PATCH 0/6] Fix for leapsecond caused hrtimer/futex issue (updated) John Stultz
                   ` (2 preceding siblings ...)
  2012-07-10 22:43 ` [PATCH 3/6] timekeeping: Maintain ktime_t based offsets for hrtimers John Stultz
@ 2012-07-10 22:43 ` John Stultz
  2012-07-10 22:43 ` [PATCH 4/6] hrtimers: " John Stultz
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 38+ messages in thread
From: John Stultz @ 2012-07-10 22:43 UTC (permalink / raw)
  To: Linux Kernel
  Cc: Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Prarit Bhargava,
	stable, John Stultz

From: Thomas Gleixner <tglx@linutronix.de>

We need to update the base offsets from this code and we need to do
that under base->lock. Move the lock held region around the
ktime_get() calls. The ktime_get() calls are going to be replaced with
a function which gets the time and the offsets atomically.

CC: Ingo Molnar <mingo@kernel.org>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Prarit Bhargava <prarit@redhat.com>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: stable@vger.kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <johnstul@us.ibm.com>
---
 kernel/hrtimer.c |    5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c
index 7c20cb8..14a260a 100644
--- a/kernel/hrtimer.c
+++ b/kernel/hrtimer.c
@@ -1263,11 +1263,10 @@ void hrtimer_interrupt(struct clock_event_device *dev)
 	cpu_base->nr_events++;
 	dev->next_event.tv64 = KTIME_MAX;
 
+	raw_spin_lock(&cpu_base->lock);
 	entry_time = now = ktime_get();
 retry:
 	expires_next.tv64 = KTIME_MAX;
-
-	raw_spin_lock(&cpu_base->lock);
 	/*
 	 * We set expires_next to KTIME_MAX here with cpu_base->lock
 	 * held to prevent that a timer is enqueued in our queue via
@@ -1344,6 +1343,7 @@ retry:
 	 * interrupt routine. We give it 3 attempts to avoid
 	 * overreacting on some spurious event.
 	 */
+	raw_spin_lock(&cpu_base->lock);
 	now = ktime_get();
 	cpu_base->nr_retries++;
 	if (++retries < 3)
@@ -1356,6 +1356,7 @@ retry:
 	 */
 	cpu_base->nr_hangs++;
 	cpu_base->hang_detected = 1;
+	raw_spin_unlock(&cpu_base->lock);
 	delta = ktime_sub(now, entry_time);
 	if (delta.tv64 > cpu_base->max_hang_time.tv64)
 		cpu_base->max_hang_time = delta;
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 4/6] hrtimers: Move lock held region in hrtimer_interrupt()
  2012-07-10 22:43 [PATCH 0/6] Fix for leapsecond caused hrtimer/futex issue (updated) John Stultz
                   ` (3 preceding siblings ...)
  2012-07-10 22:43 ` [PATCH 4/6] hrtimer: Move lock held region in hrtimer_interrupt() John Stultz
@ 2012-07-10 22:43 ` John Stultz
  2012-07-11 21:43   ` [tip:timers/urgent] " tip-bot for Thomas Gleixner
  2012-07-10 22:43 ` [PATCH 5/6] timekeeping: Provide hrtimer update function John Stultz
                   ` (6 subsequent siblings)
  11 siblings, 1 reply; 38+ messages in thread
From: John Stultz @ 2012-07-10 22:43 UTC (permalink / raw)
  To: Linux Kernel
  Cc: Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Prarit Bhargava,
	stable, John Stultz

From: Thomas Gleixner <tglx@linutronix.de>

We need to update the base offsets from this code and we need to do
that under base->lock. Move the lock held region around the
ktime_get() calls. The ktime_get() calls are going to be replaced with
a function which gets the time and the offsets atomically.

CC: Ingo Molnar <mingo@kernel.org>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Prarit Bhargava <prarit@redhat.com>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: stable@vger.kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <johnstul@us.ibm.com>
---
 kernel/hrtimer.c |    5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c
index 7c20cb8..14a260a 100644
--- a/kernel/hrtimer.c
+++ b/kernel/hrtimer.c
@@ -1263,11 +1263,10 @@ void hrtimer_interrupt(struct clock_event_device *dev)
 	cpu_base->nr_events++;
 	dev->next_event.tv64 = KTIME_MAX;
 
+	raw_spin_lock(&cpu_base->lock);
 	entry_time = now = ktime_get();
 retry:
 	expires_next.tv64 = KTIME_MAX;
-
-	raw_spin_lock(&cpu_base->lock);
 	/*
 	 * We set expires_next to KTIME_MAX here with cpu_base->lock
 	 * held to prevent that a timer is enqueued in our queue via
@@ -1344,6 +1343,7 @@ retry:
 	 * interrupt routine. We give it 3 attempts to avoid
 	 * overreacting on some spurious event.
 	 */
+	raw_spin_lock(&cpu_base->lock);
 	now = ktime_get();
 	cpu_base->nr_retries++;
 	if (++retries < 3)
@@ -1356,6 +1356,7 @@ retry:
 	 */
 	cpu_base->nr_hangs++;
 	cpu_base->hang_detected = 1;
+	raw_spin_unlock(&cpu_base->lock);
 	delta = ktime_sub(now, entry_time);
 	if (delta.tv64 > cpu_base->max_hang_time.tv64)
 		cpu_base->max_hang_time = delta;
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 5/6] timekeeping: Provide hrtimer update function
  2012-07-10 22:43 [PATCH 0/6] Fix for leapsecond caused hrtimer/futex issue (updated) John Stultz
                   ` (4 preceding siblings ...)
  2012-07-10 22:43 ` [PATCH 4/6] hrtimers: " John Stultz
@ 2012-07-10 22:43 ` John Stultz
  2012-07-11 21:44   ` [tip:timers/urgent] " tip-bot for Thomas Gleixner
  2012-07-10 22:43 ` [PATCH 6/6] hrtimer: Update hrtimer base offsets each hrtimer_interrupt John Stultz
                   ` (5 subsequent siblings)
  11 siblings, 1 reply; 38+ messages in thread
From: John Stultz @ 2012-07-10 22:43 UTC (permalink / raw)
  To: Linux Kernel
  Cc: Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Prarit Bhargava,
	stable, John Stultz

From: Thomas Gleixner <tglx@linutronix.de>

To finally fix the infamous leap second issue and other race windows
caused by functions which change the offsets between the various time
bases (CLOCK_MONOTONIC, CLOCK_REALTIME and CLOCK_BOOTTIME) we need a
function which atomically gets the current monotonic time and updates
the offsets of CLOCK_REALTIME and CLOCK_BOOTTIME with minimalistic
overhead. The previous patch which provides ktime_t offsets allows us
to make this function almost as cheap as ktime_get() which is going to
be replaced in hrtimer_interrupt().

CC: Ingo Molnar <mingo@kernel.org>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Prarit Bhargava <prarit@redhat.com>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: stable@vger.kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <johnstul@us.ibm.com>
---
 include/linux/hrtimer.h   |    1 +
 kernel/time/timekeeping.c |   34 ++++++++++++++++++++++++++++++++++
 2 files changed, 35 insertions(+)

diff --git a/include/linux/hrtimer.h b/include/linux/hrtimer.h
index 4c3dac8..25b214e 100644
--- a/include/linux/hrtimer.h
+++ b/include/linux/hrtimer.h
@@ -323,6 +323,7 @@ extern ktime_t ktime_get(void);
 extern ktime_t ktime_get_real(void);
 extern ktime_t ktime_get_boottime(void);
 extern ktime_t ktime_get_monotonic_offset(void);
+extern ktime_t ktime_get_update_offsets(ktime_t *offs_real, ktime_t *offs_boot);
 
 DECLARE_PER_CPU(struct tick_device, tick_cpu_device);
 
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index 1c038da..269b1fe 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -1271,6 +1271,40 @@ void get_xtime_and_monotonic_and_sleep_offset(struct timespec *xtim,
 	} while (read_seqretry(&timekeeper.lock, seq));
 }
 
+#ifdef CONFIG_HIGH_RES_TIMERS
+/**
+ * ktime_get_update_offsets - hrtimer helper
+ * @offs_real:	pointer to storage for monotonic -> realtime offset
+ * @offs_boot:	pointer to storage for monotonic -> boottime offset
+ *
+ * Returns current monotonic time and updates the offsets
+ * Called from hrtimer_interupt() or retrigger_next_event()
+ */
+ktime_t ktime_get_update_offsets(ktime_t *offs_real, ktime_t *offs_boot)
+{
+	ktime_t now;
+	unsigned int seq;
+	u64 secs, nsecs;
+
+	do {
+		seq = read_seqbegin(&timekeeper.lock);
+
+		secs = timekeeper.xtime.tv_sec;
+		nsecs = timekeeper.xtime.tv_nsec;
+		nsecs += timekeeping_get_ns();
+		/* If arch requires, add in gettimeoffset() */
+		nsecs += arch_gettimeoffset();
+
+		*offs_real = timekeeper.offs_real;
+		*offs_boot = timekeeper.offs_boot;
+	} while (read_seqretry(&timekeeper.lock, seq));
+
+	now = ktime_add_ns(ktime_set(secs, 0), nsecs);
+	now = ktime_sub(now, *offs_real);
+	return now;
+}
+#endif
+
 /**
  * ktime_get_monotonic_offset() - get wall_to_monotonic in ktime_t format
  */
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 6/6] hrtimer: Update hrtimer base offsets each hrtimer_interrupt
  2012-07-10 22:43 [PATCH 0/6] Fix for leapsecond caused hrtimer/futex issue (updated) John Stultz
                   ` (5 preceding siblings ...)
  2012-07-10 22:43 ` [PATCH 5/6] timekeeping: Provide hrtimer update function John Stultz
@ 2012-07-10 22:43 ` John Stultz
  2012-07-11 21:45   ` [tip:timers/urgent] " tip-bot for John Stultz
                     ` (2 more replies)
  2012-07-10 22:53 ` [PATCH 0/6] Fix for leapsecond caused hrtimer/futex issue (updated) John Stultz
                   ` (4 subsequent siblings)
  11 siblings, 3 replies; 38+ messages in thread
From: John Stultz @ 2012-07-10 22:43 UTC (permalink / raw)
  To: Linux Kernel
  Cc: John Stultz, Ingo Molnar, Peter Zijlstra, Prarit Bhargava,
	Thomas Gleixner, stable

The update of the hrtimer base offsets on all cpus cannot be made
atomically from the timekeeper.lock held and interrupt disabled region
as smp function calls are not allowed there.

clock_was_set(), which enforces the update on all cpus, is called
either from preemptible process context in case of do_settimeofday()
or from the softirq context when the offset modification happened in
the timer interrupt itself due to a leap second.

In both cases there is a race window for an hrtimer interrupt between
dropping timekeeper lock, enabling interrupts and clock_was_set()
issuing the updates. Any interrupt which arrives in that window will
see the new time but operate on stale offsets.

So we need to make sure that an hrtimer interrupt always sees a
consistent state of time and offsets.

ktime_get_update_offsets() allows us to get the current monotonic time
and update the per cpu hrtimer base offsets from hrtimer_interrupt()
to capture a consistent state of monotonic time and the offsets. The
function replaces the existing ktime_get() calls in hrtimer_interrupt().

The overhead of the new function vs. ktime_get() is minimal as it just
adds two store operations.

This ensures that any changes to realtime or boottime offsets are
noticed and stored into the per-cpu hrtimer base structures, prior to
any hrtimer expiration and guarantees that timers are not expired early.

CC: Ingo Molnar <mingo@kernel.org>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Prarit Bhargava <prarit@redhat.com>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: stable@vger.kernel.org
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <johnstul@us.ibm.com>
---
 kernel/hrtimer.c |   28 ++++++++++++++--------------
 1 file changed, 14 insertions(+), 14 deletions(-)

diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c
index 14a260a..669522b 100644
--- a/kernel/hrtimer.c
+++ b/kernel/hrtimer.c
@@ -657,6 +657,14 @@ static inline int hrtimer_enqueue_reprogram(struct hrtimer *timer,
 	return 0;
 }
 
+static inline ktime_t hrtimer_update_base(struct hrtimer_cpu_base *base)
+{
+	ktime_t *offs_real = &base->clock_base[HRTIMER_BASE_REALTIME].offset;
+	ktime_t *offs_boot = &base->clock_base[HRTIMER_BASE_BOOTTIME].offset;
+
+	return ktime_get_update_offsets(offs_real, offs_boot);
+}
+
 /*
  * Retrigger next event is called after clock was set
  *
@@ -665,22 +673,12 @@ static inline int hrtimer_enqueue_reprogram(struct hrtimer *timer,
 static void retrigger_next_event(void *arg)
 {
 	struct hrtimer_cpu_base *base = &__get_cpu_var(hrtimer_bases);
-	struct timespec realtime_offset, xtim, wtm, sleep;
 
 	if (!hrtimer_hres_active())
 		return;
 
-	/* Optimized out for !HIGH_RES */
-	get_xtime_and_monotonic_and_sleep_offset(&xtim, &wtm, &sleep);
-	set_normalized_timespec(&realtime_offset, -wtm.tv_sec, -wtm.tv_nsec);
-
-	/* Adjust CLOCK_REALTIME offset */
 	raw_spin_lock(&base->lock);
-	base->clock_base[HRTIMER_BASE_REALTIME].offset =
-		timespec_to_ktime(realtime_offset);
-	base->clock_base[HRTIMER_BASE_BOOTTIME].offset =
-		timespec_to_ktime(sleep);
-
+	hrtimer_update_base(base);
 	hrtimer_force_reprogram(base, 0);
 	raw_spin_unlock(&base->lock);
 }
@@ -710,7 +708,6 @@ static int hrtimer_switch_to_hres(void)
 		base->clock_base[i].resolution = KTIME_HIGH_RES;
 
 	tick_setup_sched_timer();
-
 	/* "Retrigger" the interrupt to get things going */
 	retrigger_next_event(NULL);
 	local_irq_restore(flags);
@@ -1264,7 +1261,7 @@ void hrtimer_interrupt(struct clock_event_device *dev)
 	dev->next_event.tv64 = KTIME_MAX;
 
 	raw_spin_lock(&cpu_base->lock);
-	entry_time = now = ktime_get();
+	entry_time = now = hrtimer_update_base(cpu_base);
 retry:
 	expires_next.tv64 = KTIME_MAX;
 	/*
@@ -1342,9 +1339,12 @@ retry:
 	 * We need to prevent that we loop forever in the hrtimer
 	 * interrupt routine. We give it 3 attempts to avoid
 	 * overreacting on some spurious event.
+	 *
+	 * Acquire base lock for updating the offsets and retrieving
+	 * the current time.
 	 */
 	raw_spin_lock(&cpu_base->lock);
-	now = ktime_get();
+	now = hrtimer_update_base(cpu_base);
 	cpu_base->nr_retries++;
 	if (++retries < 3)
 		goto retry;
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/6] Fix for leapsecond caused hrtimer/futex issue (updated)
  2012-07-10 22:43 [PATCH 0/6] Fix for leapsecond caused hrtimer/futex issue (updated) John Stultz
                   ` (6 preceding siblings ...)
  2012-07-10 22:43 ` [PATCH 6/6] hrtimer: Update hrtimer base offsets each hrtimer_interrupt John Stultz
@ 2012-07-10 22:53 ` John Stultz
  2012-07-12 22:43   ` Jiri Bohac
  2012-07-10 23:00 ` John Stultz
                   ` (3 subsequent siblings)
  11 siblings, 1 reply; 38+ messages in thread
From: John Stultz @ 2012-07-10 22:53 UTC (permalink / raw)
  To: Linux Kernel
  Cc: Ingo Molnar, Peter Zijlstra, Prarit Bhargava, Thomas Gleixner, stable

On 07/10/2012 03:43 PM, John Stultz wrote:
> Over the weekend, Thomas got a chance to review the leap second fix
> in more detail and had a few additional changes he wanted to make
> to improve performance as well as style.
>
> So this iteration includes his modifications.
>
> Once merged, I'll be working to get the backports finished as quickly
> as I can and sent to -stable.

Also, Thomas pointed out that the entire fix might be too large for this 
late in the -rc cycle.

Thus we may want to merge only the first two patches as the "simple fix" 
which will avoid the major problems seen for 3.5, and then leave the 
last four patches, which really enforce correctness, for 3.6 merge 
window (backporting variants  to -stable).

But I'll leave that to Thomas/Ingo/Linus to decide.

thanks
-john


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/6] Fix for leapsecond caused hrtimer/futex issue (updated)
  2012-07-10 22:43 [PATCH 0/6] Fix for leapsecond caused hrtimer/futex issue (updated) John Stultz
                   ` (7 preceding siblings ...)
  2012-07-10 22:53 ` [PATCH 0/6] Fix for leapsecond caused hrtimer/futex issue (updated) John Stultz
@ 2012-07-10 23:00 ` John Stultz
  2012-07-13  0:43   ` John Stultz
  2012-07-11 10:59 ` Peter Zijlstra
                   ` (2 subsequent siblings)
  11 siblings, 1 reply; 38+ messages in thread
From: John Stultz @ 2012-07-10 23:00 UTC (permalink / raw)
  To: Linux Kernel
  Cc: Ingo Molnar, Peter Zijlstra, Prarit Bhargava, Thomas Gleixner, stable

[-- Attachment #1: Type: text/plain, Size: 542 bytes --]

On 07/10/2012 03:43 PM, John Stultz wrote:
> Over the weekend, Thomas got a chance to review the leap second fix
> in more detail and had a few additional changes he wanted to make
> to improve performance as well as style.
>
> So this iteration includes his modifications.
>
> Once merged, I'll be working to get the backports finished as quickly
> as I can and sent to -stable.
>

Once again, here's the test case I've been using to stress test 
leapsecond insertion/deletion (the deletion testing is new since last 
round).

thanks
-john


[-- Attachment #2: leap-a-day.c --]
[-- Type: text/x-csrc, Size: 5474 bytes --]

/* Leap second stress test
 *              by: john stultz (johnstul@us.ibm.com)
 *              (C) Copyright IBM 2012
 *              Licensed under the GPLv2
 *
 *  This test signals the kernel to insert a leap second
 *  every day at midnight GMT. This allows for stessing the
 *  kernel's leap-second behavior, as well as how well applications
 *  handle the leap-second discontinuity.
 *
 *  Usage: leap-a-day [-s]
 *
 *  Options:
 *	-s:	Each iteration, set the date to 10 seconds before midnight GMT.
 *		This speeds up the number of leapsecond transitions tested,
 *		but because it calls settimeofday frequently, advancing the
 *		time by 24 hours every ~16 seconds, it may cause application
 *		disruption.
 *
 *  Other notes: Disabling NTP prior to running this is advised, as the two
 *		 may conflict in thier commands to the kernel.
 *
 *  To build:
 *	$ gcc leap-a-day.c -o leap-a-day -lrt
 */



#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <sys/time.h>
#include <sys/timex.h>
#include <string.h>
#include <signal.h>

#define NSEC_PER_SEC 1000000000ULL

/* returns 1 if a <= b, 0 otherwise */
static inline int in_order(struct timespec a, struct timespec b)
{
        if(a.tv_sec < b.tv_sec)
                return 1;
        if(a.tv_sec > b.tv_sec)
                return 0;
        if(a.tv_nsec > b.tv_nsec)
                return 0;
        return 1;
}

struct timespec timespec_add(struct timespec ts, unsigned long long ns)
{
	ts.tv_nsec += ns;
	while(ts.tv_nsec >= NSEC_PER_SEC) {
		ts.tv_nsec -= NSEC_PER_SEC;
		ts.tv_sec++;
	}
	return ts;
}


char* time_state_str(int state)
{
	switch (state) {
		case TIME_OK:	return "TIME_OK";
		case TIME_INS:	return "TIME_INS";
		case TIME_DEL:	return "TIME_DEL";
		case TIME_OOP:	return "TIME_OOP";
		case TIME_WAIT:	return "TIME_WAIT";
		case TIME_BAD:	return "TIME_BAD";
	}
	return "ERROR"; 
}

/* clear NTP time_status & time_state */
void clear_time_state(void)
{
	struct timex tx;
	int ret;

	/*
	 * XXX - The fact we have to call this twice seems
	 * to point to a slight issue in the kernel's ntp state
	 * managment. Needs to be investigated further.
	 */

	tx.modes = ADJ_STATUS;
	tx.status = STA_PLL;
	ret = adjtimex(&tx);

	tx.modes = ADJ_STATUS;
	tx.status = 0;
	ret = adjtimex(&tx);
}

/* Make sure we cleanup on ctrl-c */
void handler(int unused)
{
	clear_time_state();
	exit(0);
}


/* Test for known hrtimer failure */
void test_hrtimer_failure(void)
{
	struct timespec now, target;

	clock_gettime(CLOCK_REALTIME, &now);
	target = timespec_add(now, NSEC_PER_SEC/2);
	clock_nanosleep(CLOCK_REALTIME, TIMER_ABSTIME, &target, NULL);
	clock_gettime(CLOCK_REALTIME, &now);

	if (!in_order(target, now)) {
		printf("Note: hrtimer early expiration failure observed.\n");
	}

}



int main(int argc, char** argv) 
{
	struct timeval tv;
	struct timex tx;
	struct timespec ts;
	int settime = 0;
	int insert = 1;

	signal(SIGINT, handler);
	signal(SIGKILL, handler);
	printf("This runs continuously. Press ctrl-c to stop\n");


	/* Process arguments */
	if (argc > 1) {
		if (!strncmp(argv[1], "-s", 2)) {
			printf("Setting time to speed up testing\n");
			settime = 1;
		} else {
			printf("Usage: %s [-s]\n", argv[0]);
			printf("	-s: Set time to right before leap second each iteration\n");
		}
	}

	printf("\n");
	while (1) {
		int ret;
		time_t now, next_leap;
		/* Get the current time */
		gettimeofday(&tv, NULL);

		/* Calculate the next possible leap second 23:59:60 GMT */
		tv.tv_sec += 86400 - (tv.tv_sec % 86400);
		next_leap = ts.tv_sec = tv.tv_sec;

		if (settime) {
			tv.tv_sec -= 10;
			settimeofday(&tv, NULL);
			printf("Setting time to %s", ctime(&ts.tv_sec));
		}

		/* Reset NTP time state */
		clear_time_state();

		/* Set the leap second insert flag */
		tx.modes = ADJ_STATUS;
		if (insert) 
			tx.status = STA_INS;
		else
			tx.status = STA_DEL;
		ret = adjtimex(&tx);
		if (ret < 0 ) {
			printf("Error: Problem setting STA_INS/STA_DEL!: %s\n",
							time_state_str(ret));
			return -1;
		}

		/* Validate STA_INS was set */
		ret = adjtimex(&tx);
		if (tx.status != STA_INS && tx.status != STA_DEL) {
			printf("Error: STA_INS/STA_DEL not set!: %s\n",
							time_state_str(ret));
			return -1;
		}

		printf("Scheduling leap second for %s", ctime(&next_leap));
	
		/* Wake up 3 seconds before leap */
		ts.tv_sec -= 3;
		while(clock_nanosleep(CLOCK_REALTIME, TIMER_ABSTIME, &ts, NULL))
			printf("Something woke us up, returning to sleep\n");

		/* Validate STA_INS is still set */
		ret = adjtimex(&tx);
		if (tx.status != STA_INS && tx.status != STA_DEL) {
			printf("Something cleared STA_INS/STA_DEL, setting it again.\n");
			tx.modes = ADJ_STATUS;
			if (insert) 
				tx.status = STA_INS;
			else
				tx.status = STA_DEL;
			ret = adjtimex(&tx);
		}

		/* Check adjtimex output every half second */
		now = tx.time.tv_sec;
		while (now < next_leap+2) {
			char buf[26];
			ret = adjtimex(&tx);

			ctime_r(&tx.time.tv_sec, buf);
			buf[strlen(buf)-1] = 0; /*remove trailing\n */

			printf("%s + %6ld us\t%s\n",
					buf,
					tx.time.tv_usec, 
					time_state_str(ret));
			now = tx.time.tv_sec;
			/* Sleep for another half second */
			ts.tv_sec = 0;
			ts.tv_nsec = NSEC_PER_SEC/2;
			clock_nanosleep(CLOCK_MONOTONIC, 0, &ts, NULL);			
		}
		/* Switch to using other mode */
		insert = !insert;

		/* Note if kernel has known hrtimer failure */
		test_hrtimer_failure();

		printf("Leap complete\n\n");
	}

	clear_time_state();
	return 0;
}

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/6] Fix for leapsecond caused hrtimer/futex issue (updated)
  2012-07-10 22:43 [PATCH 0/6] Fix for leapsecond caused hrtimer/futex issue (updated) John Stultz
                   ` (8 preceding siblings ...)
  2012-07-10 23:00 ` John Stultz
@ 2012-07-11 10:59 ` Peter Zijlstra
  2012-07-11 11:17 ` Ingo Molnar
  2012-07-11 12:16 ` Prarit Bhargava
  11 siblings, 0 replies; 38+ messages in thread
From: Peter Zijlstra @ 2012-07-11 10:59 UTC (permalink / raw)
  To: John Stultz
  Cc: Linux Kernel, Ingo Molnar, Prarit Bhargava, Thomas Gleixner, stable

On Tue, 2012-07-10 at 18:43 -0400, John Stultz wrote:
> Over the weekend, Thomas got a chance to review the leap second fix
> in more detail and had a few additional changes he wanted to make
> to improve performance as well as style.
> 
> So this iteration includes his modifications.
> 
> Once merged, I'll be working to get the backports finished as quickly
> as I can and sent to -stable. 

I think we all dislike 1/6, but I can't come up with anything saner
either. The rest looks good.

The only thing I found (and already mentioned to Thomas) is that the
Changelog of 1/6 could be improved by mentioning why we're in hardirq
context to begin with.

Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/6] Fix for leapsecond caused hrtimer/futex issue (updated)
  2012-07-10 22:43 [PATCH 0/6] Fix for leapsecond caused hrtimer/futex issue (updated) John Stultz
                   ` (9 preceding siblings ...)
  2012-07-11 10:59 ` Peter Zijlstra
@ 2012-07-11 11:17 ` Ingo Molnar
  2012-07-12 12:32   ` Prarit Bhargava
  2012-07-11 12:16 ` Prarit Bhargava
  11 siblings, 1 reply; 38+ messages in thread
From: Ingo Molnar @ 2012-07-11 11:17 UTC (permalink / raw)
  To: John Stultz
  Cc: Linux Kernel, Peter Zijlstra, Prarit Bhargava, Thomas Gleixner, stable


* John Stultz <johnstul@us.ibm.com> wrote:

> Over the weekend, Thomas got a chance to review the leap 
> second fix in more detail and had a few additional changes he 
> wanted to make to improve performance as well as style.
> 
> So this iteration includes his modifications.

Yep, looks much saner now that the separate 'delayed' parameter 
is gone. For all patches:

Reviewed-by: Ingo Molnar <mingo@kernel.org>

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 1/6] hrtimer: Provide clock_was_set_delayed()
  2012-07-10 22:43 ` [PATCH 1/6] hrtimer: Provide clock_was_set_delayed() John Stultz
@ 2012-07-11 12:15   ` Prarit Bhargava
  2012-07-11 12:45     ` Thomas Gleixner
  2012-07-11 21:40   ` [tip:timers/urgent] " tip-bot for John Stultz
  1 sibling, 1 reply; 38+ messages in thread
From: Prarit Bhargava @ 2012-07-11 12:15 UTC (permalink / raw)
  To: John Stultz
  Cc: Linux Kernel, Ingo Molnar, Peter Zijlstra, Thomas Gleixner, stable



On 07/10/2012 06:43 PM, John Stultz wrote:
> clock_was_set() cannot be called from hard interrupt context because
> it calls on_each_cpu(). For fixing the widely reported leap seconds
> issue it's necessary to call it from the timer interrupt context.
> 
> Provide a new function which denotes it in the hrtimer cpu base
> structure of the cpu on which it is called and raising the timer
> softirq.
> 
> We then execute the clock_was_set() notificiation in the timer softirq
> context in hrtimer_run_pending().

I wish there was a nicer way to do this ... but looking at the code I can't
figure out a better way.  (no offense John, it's just the way the code is ;) )

P.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/6] Fix for leapsecond caused hrtimer/futex issue (updated)
  2012-07-10 22:43 [PATCH 0/6] Fix for leapsecond caused hrtimer/futex issue (updated) John Stultz
                   ` (10 preceding siblings ...)
  2012-07-11 11:17 ` Ingo Molnar
@ 2012-07-11 12:16 ` Prarit Bhargava
  11 siblings, 0 replies; 38+ messages in thread
From: Prarit Bhargava @ 2012-07-11 12:16 UTC (permalink / raw)
  To: John Stultz
  Cc: Linux Kernel, Ingo Molnar, Peter Zijlstra, Thomas Gleixner, stable



On 07/10/2012 06:43 PM, John Stultz wrote:
> Over the weekend, Thomas got a chance to review the leap second fix
> in more detail and had a few additional changes he wanted to make
> to improve performance as well as style.
> 
> So this iteration includes his modifications.
> 
> Once merged, I'll be working to get the backports finished as quickly
> as I can and sent to -stable.
> 
> thanks
> -john
> 

Add an Acked-by: Prarit Bhargava <prarit@redhat.com>

John -- I'll do some testing of this patchset today and tomorrow and let you
know the results.

P.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 1/6] hrtimer: Provide clock_was_set_delayed()
  2012-07-11 12:15   ` Prarit Bhargava
@ 2012-07-11 12:45     ` Thomas Gleixner
  2012-07-11 13:05       ` Peter Zijlstra
  2012-07-11 13:05       ` Prarit Bhargava
  0 siblings, 2 replies; 38+ messages in thread
From: Thomas Gleixner @ 2012-07-11 12:45 UTC (permalink / raw)
  To: Prarit Bhargava
  Cc: John Stultz, Linux Kernel, Ingo Molnar, Peter Zijlstra, stable

On Wed, 11 Jul 2012, Prarit Bhargava wrote:
> On 07/10/2012 06:43 PM, John Stultz wrote:
> > clock_was_set() cannot be called from hard interrupt context because
> > it calls on_each_cpu(). For fixing the widely reported leap seconds
> > issue it's necessary to call it from the timer interrupt context.
> > 
> > Provide a new function which denotes it in the hrtimer cpu base
> > structure of the cpu on which it is called and raising the timer
> > softirq.
> > 
> > We then execute the clock_was_set() notificiation in the timer softirq
> > context in hrtimer_run_pending().
> 
> I wish there was a nicer way to do this ... but looking at the code I can't
> figure out a better way.  (no offense John, it's just the way the code is ;) )

Yeah, I had the same discussion with Peter earlier today. There is
only a rather limited set of options.

1) Retrigger the timer interrupt vectors on all CPUs - except the one
   we are running on, but we have no interface for that at the moment

2) Do the nasty __smp_call_function_single() hack

   Preallocate call_single_data for all cpus and do a
   __smp_call_function_single() on all online cpus.

   This can be called from hard interrupt context or irq disabled
   regions.

   That would allow to get rid of the whole delay magic all
   together.

Thoughts?

	tglx


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 1/6] hrtimer: Provide clock_was_set_delayed()
  2012-07-11 12:45     ` Thomas Gleixner
@ 2012-07-11 13:05       ` Peter Zijlstra
  2012-07-11 15:18         ` Thomas Gleixner
  2012-07-11 13:05       ` Prarit Bhargava
  1 sibling, 1 reply; 38+ messages in thread
From: Peter Zijlstra @ 2012-07-11 13:05 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Prarit Bhargava, John Stultz, Linux Kernel, Ingo Molnar, stable

On Wed, 2012-07-11 at 14:45 +0200, Thomas Gleixner wrote:
> On Wed, 11 Jul 2012, Prarit Bhargava wrote:
> > On 07/10/2012 06:43 PM, John Stultz wrote:
> > > clock_was_set() cannot be called from hard interrupt context because
> > > it calls on_each_cpu(). For fixing the widely reported leap seconds
> > > issue it's necessary to call it from the timer interrupt context.
> > > 
> > > Provide a new function which denotes it in the hrtimer cpu base
> > > structure of the cpu on which it is called and raising the timer
> > > softirq.
> > > 
> > > We then execute the clock_was_set() notificiation in the timer softirq
> > > context in hrtimer_run_pending().
> > 
> > I wish there was a nicer way to do this ... but looking at the code I can't
> > figure out a better way.  (no offense John, it's just the way the code is ;) )
> 
> Yeah, I had the same discussion with Peter earlier today. There is
> only a rather limited set of options.
> 
> 1) Retrigger the timer interrupt vectors on all CPUs - except the one
>    we are running on, but we have no interface for that at the moment
> 
> 2) Do the nasty __smp_call_function_single() hack
> 
>    Preallocate call_single_data for all cpus and do a
>    __smp_call_function_single() on all online cpus.
> 
>    This can be called from hard interrupt context or irq disabled
>    regions.
> 
>    That would allow to get rid of the whole delay magic all
>    together.
> 
> Thoughts?

The __smp_call_function_single() thing isn't particularly pretty either
and a lot more code to boot.. 

static DEFINE_PER_CPU(struct call_single_data, cws_csd);

void clock_was_set(void)
{
	int cpu;

	for_each_online_cpu(cpu) {
		struct call_single_data *csd = &per_cpu(cws_csd, cpu);

		if (csd->flags & CSD_FLAG_LOCK)
			continue; /* a pending request is good enough */

		csd->func = retrigger_next_event;

		__smp_call_function_single(cpu, csd, 0);
	}

	timerfd_clock_was_set();
}

It also is a for_each_cpu loop with preemption disabled, not pretty :/

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 1/6] hrtimer: Provide clock_was_set_delayed()
  2012-07-11 12:45     ` Thomas Gleixner
  2012-07-11 13:05       ` Peter Zijlstra
@ 2012-07-11 13:05       ` Prarit Bhargava
  2012-07-11 13:38         ` Peter Zijlstra
  1 sibling, 1 reply; 38+ messages in thread
From: Prarit Bhargava @ 2012-07-11 13:05 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: John Stultz, Linux Kernel, Ingo Molnar, Peter Zijlstra, stable


>> I wish there was a nicer way to do this ... but looking at the code I can't
>> figure out a better way.  (no offense John, it's just the way the code is ;) )
> 
> Yeah, I had the same discussion with Peter earlier today. There is
> only a rather limited set of options.
> 
> 1) Retrigger the timer interrupt vectors on all CPUs - except the one
>    we are running on, but we have no interface for that at the moment
> 
> 2) Do the nasty __smp_call_function_single() hack
> 
>    Preallocate call_single_data for all cpus and do a
>    __smp_call_function_single() on all online cpus.
> 
>    This can be called from hard interrupt context or irq disabled
>    regions.
> 
>    That would allow to get rid of the whole delay magic all
>    together.
> 
> Thoughts?
> 

Both of those options seem like a lot of work for something that happens once
every 3-4 years, and may not happen ever again[1].  Based on that statement, if
we're going to modify code I would prefer that it be as lightweight as possible.
 So, in terms of the kernel, option 2 is likely the best way to go rather than
introducing new code that will be used once every 3-4 years.

I keep asking the question of why the mechanism of inserting a leap second isn't
moved into userspace ntpd (or some other appropriate daemon).  I suppose there
is a risk of ntpd being starved out on heavily loaded systems...

P.

[1] http://en.wikipedia.org/wiki/Leap_second#Proposal_to_abolish_leap_seconds

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 1/6] hrtimer: Provide clock_was_set_delayed()
  2012-07-11 13:05       ` Prarit Bhargava
@ 2012-07-11 13:38         ` Peter Zijlstra
  0 siblings, 0 replies; 38+ messages in thread
From: Peter Zijlstra @ 2012-07-11 13:38 UTC (permalink / raw)
  To: Prarit Bhargava
  Cc: Thomas Gleixner, John Stultz, Linux Kernel, Ingo Molnar, stable

On Wed, 2012-07-11 at 09:05 -0400, Prarit Bhargava wrote:
> 
> Both of those options seem like a lot of work for something that happens once
> every 3-4 years, and may not happen ever again[1].  Based on that statement, if
> we're going to modify code I would prefer that it be as lightweight as possible.
>  So, in terms of the kernel, option 2 is likely the best way to go rather than
> introducing new code that will be used once every 3-4 years. 

Full agreed, however if we implement clock_was_set() like I just
proposed we'd use that code for every time the clock was modified, which
is a lot more often.

That said, I'm not a particular fan of it..

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 1/6] hrtimer: Provide clock_was_set_delayed()
  2012-07-11 13:05       ` Peter Zijlstra
@ 2012-07-11 15:18         ` Thomas Gleixner
  2012-07-11 15:56           ` Peter Zijlstra
  2012-07-11 16:47           ` John Stultz
  0 siblings, 2 replies; 38+ messages in thread
From: Thomas Gleixner @ 2012-07-11 15:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Prarit Bhargava, John Stultz, Linux Kernel, Ingo Molnar, stable

On Wed, 11 Jul 2012, Peter Zijlstra wrote:
> On Wed, 2012-07-11 at 14:45 +0200, Thomas Gleixner wrote:
> > On Wed, 11 Jul 2012, Prarit Bhargava wrote:
> > > On 07/10/2012 06:43 PM, John Stultz wrote:
> > > > clock_was_set() cannot be called from hard interrupt context because
> > > > it calls on_each_cpu(). For fixing the widely reported leap seconds
> > > > issue it's necessary to call it from the timer interrupt context.
> > > > 
> > > > Provide a new function which denotes it in the hrtimer cpu base
> > > > structure of the cpu on which it is called and raising the timer
> > > > softirq.
> > > > 
> > > > We then execute the clock_was_set() notificiation in the timer softirq
> > > > context in hrtimer_run_pending().
> > > 
> > > I wish there was a nicer way to do this ... but looking at the code I can't
> > > figure out a better way.  (no offense John, it's just the way the code is ;) )
> > 
> > Yeah, I had the same discussion with Peter earlier today. There is
> > only a rather limited set of options.
> > 
> > 1) Retrigger the timer interrupt vectors on all CPUs - except the one
> >    we are running on, but we have no interface for that at the moment
> > 
> > 2) Do the nasty __smp_call_function_single() hack
> > 
> >    Preallocate call_single_data for all cpus and do a
> >    __smp_call_function_single() on all online cpus.
> > 
> >    This can be called from hard interrupt context or irq disabled
> >    regions.
> > 
> >    That would allow to get rid of the whole delay magic all
> >    together.
> > 
> > Thoughts?
> 
> The __smp_call_function_single() thing isn't particularly pretty either
> and a lot more code to boot.. 
> 
> static DEFINE_PER_CPU(struct call_single_data, cws_csd);
> 
> void clock_was_set(void)
> {
> 	int cpu;
> 
> 	for_each_online_cpu(cpu) {
> 		struct call_single_data *csd = &per_cpu(cws_csd, cpu);
> 
> 		if (csd->flags & CSD_FLAG_LOCK)
> 			continue; /* a pending request is good enough */
> 
> 		csd->func = retrigger_next_event;
> 
> 		__smp_call_function_single(cpu, csd, 0);
> 	}
> 
> 	timerfd_clock_was_set();
> }
> 
> It also is a for_each_cpu loop with preemption disabled, not pretty :/

Right. I think with the atomic update of the offset in the timer
interrupt we are on the safe side. The main problem of timers expiring
early forever is covered by this.

Thinking more about it.

If time goes backwards, then the IPI is pointless. The already armed
clockevent device will fire too early, hrtimer_interrupt will update
and just rearm it. That's one "spurious" event.

So we only need it in the case of time going forward. 

Though with the leap second the maximum observable delay is 1 second
on a completely idle core. Surely nothing to worry about for an event
which happens rarely. So we could safely avoid the whole delayed
business and just do the timerfd notification, though I wonder if even
that is necessary in the leap second case.

On NOHZ=n systems the IPI is pointless as well. The maximum lateness
will be 10ms for HZ=100. Nothing we should worry about.

That leaves NOHZ enabled systems and there we might be clever and
avoid the IPIs to those cores which are not idle and let the tick
interrupt deal with it. And we can make the calls async and just let
them raise the hrtimer softirq on those cores, which will run the
hrtimer interrupt code and take care of everything.

Thoughts?

	tglx





^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 1/6] hrtimer: Provide clock_was_set_delayed()
  2012-07-11 15:18         ` Thomas Gleixner
@ 2012-07-11 15:56           ` Peter Zijlstra
  2012-07-11 16:47           ` John Stultz
  1 sibling, 0 replies; 38+ messages in thread
From: Peter Zijlstra @ 2012-07-11 15:56 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Prarit Bhargava, John Stultz, Linux Kernel, Ingo Molnar, stable

On Wed, 2012-07-11 at 17:18 +0200, Thomas Gleixner wrote:

> Right. I think with the atomic update of the offset in the timer
> interrupt we are on the safe side. The main problem of timers expiring
> early forever is covered by this.
> 
> Thinking more about it.
> 
> If time goes backwards, then the IPI is pointless. The already armed
> clockevent device will fire too early, hrtimer_interrupt will update
> and just rearm it. That's one "spurious" event.
> 
> So we only need it in the case of time going forward. 
> 
> Though with the leap second the maximum observable delay is 1 second
> on a completely idle core. Surely nothing to worry about for an event
> which happens rarely. So we could safely avoid the whole delayed
> business and just do the timerfd notification, though I wonder if even
> that is necessary in the leap second case.
> 
> On NOHZ=n systems the IPI is pointless as well. The maximum lateness
> will be 10ms for HZ=100. Nothing we should worry about.
> 
> That leaves NOHZ enabled systems and there we might be clever and
> avoid the IPIs to those cores which are not idle and let the tick
> interrupt deal with it. And we can make the calls async and just let
> them raise the hrtimer softirq on those cores, which will run the
> hrtimer interrupt code and take care of everything.
> 
> Thoughts?


static void nohz_hrtimer_softirq(void *unused)
{
	raise_softirq(HRTIMER_SOFTIRQ);
}

static void kick_nohz_cpus(void)
{
	smp_call_function_many(nohz.idle_cpus_mask, nohz_hrtimer_softirq, NULL, 0);
}

Same problem as before though, can't be sending IPIs while in hardirq
context.. 

And you cannot do the same trick with a CFD as with the CSD, some CPUs
might need it again while others are still pending.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 1/6] hrtimer: Provide clock_was_set_delayed()
  2012-07-11 15:18         ` Thomas Gleixner
  2012-07-11 15:56           ` Peter Zijlstra
@ 2012-07-11 16:47           ` John Stultz
  2012-07-12  7:44             ` Jan Ceuleers
  1 sibling, 1 reply; 38+ messages in thread
From: John Stultz @ 2012-07-11 16:47 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, Prarit Bhargava, Linux Kernel, Ingo Molnar, stable

On 07/11/2012 08:18 AM, Thomas Gleixner wrote:
> That leaves NOHZ enabled systems and there we might be clever and
> avoid the IPIs to those cores which are not idle and let the tick
> interrupt deal with it. And we can make the calls async and just let
> them raise the hrtimer softirq on those cores, which will run the
> hrtimer interrupt code and take care of everything.

I'm not familiar with the details of the code that tracks which cores 
are idle or not, but I'd worry with this approach that there might be 
further races in determining which cores are idle and which cores are 
about to go idle with stale base offsets.

I'll see if my worry is unfounded, but it might be a bit too clever for 
rare events.

thanks
-john


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [tip:timers/urgent] hrtimer: Provide clock_was_set_delayed()
  2012-07-10 22:43 ` [PATCH 1/6] hrtimer: Provide clock_was_set_delayed() John Stultz
  2012-07-11 12:15   ` Prarit Bhargava
@ 2012-07-11 21:40   ` tip-bot for John Stultz
  1 sibling, 0 replies; 38+ messages in thread
From: tip-bot for John Stultz @ 2012-07-11 21:40 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, a.p.zijlstra, johnstul, jengelh, tglx, prarit

Commit-ID:  f55a6faa384304c89cfef162768e88374d3312cb
Gitweb:     http://git.kernel.org/tip/f55a6faa384304c89cfef162768e88374d3312cb
Author:     John Stultz <johnstul@us.ibm.com>
AuthorDate: Tue, 10 Jul 2012 18:43:19 -0400
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Wed, 11 Jul 2012 23:34:37 +0200

hrtimer: Provide clock_was_set_delayed()

clock_was_set() cannot be called from hard interrupt context because
it calls on_each_cpu().

For fixing the widely reported leap seconds issue it is necessary to
call it from hard interrupt context, i.e. the timer tick code, which
does the timekeeping updates.

Provide a new function which denotes it in the hrtimer cpu base
structure of the cpu on which it is called and raise the hrtimer
softirq. We then execute the clock_was_set() notificiation from
softirq context in run_hrtimer_softirq(). The hrtimer softirq is
rarely used, so polling the flag there is not a performance issue.

[ tglx: Made it depend on CONFIG_HIGH_RES_TIMERS. We really should get
  rid of all this ifdeffery ASAP ]

Signed-off-by: John Stultz <johnstul@us.ibm.com>
Reported-by: Jan Engelhardt <jengelh@inai.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Prarit Bhargava <prarit@redhat.com>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/1341960205-56738-2-git-send-email-johnstul@us.ibm.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/hrtimer.h |    9 ++++++++-
 kernel/hrtimer.c        |   20 ++++++++++++++++++++
 2 files changed, 28 insertions(+), 1 deletions(-)

diff --git a/include/linux/hrtimer.h b/include/linux/hrtimer.h
index fd0dc30..c9ec940 100644
--- a/include/linux/hrtimer.h
+++ b/include/linux/hrtimer.h
@@ -165,6 +165,7 @@ enum  hrtimer_base_type {
  * @lock:		lock protecting the base and associated clock bases
  *			and timers
  * @active_bases:	Bitfield to mark bases with active timers
+ * @clock_was_set:	Indicates that clock was set from irq context.
  * @expires_next:	absolute time of the next event which was scheduled
  *			via clock_set_next_event()
  * @hres_active:	State of high resolution mode
@@ -177,7 +178,8 @@ enum  hrtimer_base_type {
  */
 struct hrtimer_cpu_base {
 	raw_spinlock_t			lock;
-	unsigned long			active_bases;
+	unsigned int			active_bases;
+	unsigned int			clock_was_set;
 #ifdef CONFIG_HIGH_RES_TIMERS
 	ktime_t				expires_next;
 	int				hres_active;
@@ -286,6 +288,8 @@ extern void hrtimer_peek_ahead_timers(void);
 # define MONOTONIC_RES_NSEC	HIGH_RES_NSEC
 # define KTIME_MONOTONIC_RES	KTIME_HIGH_RES
 
+extern void clock_was_set_delayed(void);
+
 #else
 
 # define MONOTONIC_RES_NSEC	LOW_RES_NSEC
@@ -306,6 +310,9 @@ static inline int hrtimer_is_hres_active(struct hrtimer *timer)
 {
 	return 0;
 }
+
+static inline void clock_was_set_delayed(void) { }
+
 #endif
 
 extern void clock_was_set(void);
diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c
index ae34bf5..3c24fb2 100644
--- a/kernel/hrtimer.c
+++ b/kernel/hrtimer.c
@@ -717,6 +717,19 @@ static int hrtimer_switch_to_hres(void)
 	return 1;
 }
 
+/*
+ * Called from timekeeping code to reprogramm the hrtimer interrupt
+ * device. If called from the timer interrupt context we defer it to
+ * softirq context.
+ */
+void clock_was_set_delayed(void)
+{
+	struct hrtimer_cpu_base *cpu_base = &__get_cpu_var(hrtimer_bases);
+
+	cpu_base->clock_was_set = 1;
+	__raise_softirq_irqoff(HRTIMER_SOFTIRQ);
+}
+
 #else
 
 static inline int hrtimer_hres_active(void) { return 0; }
@@ -1395,6 +1408,13 @@ void hrtimer_peek_ahead_timers(void)
 
 static void run_hrtimer_softirq(struct softirq_action *h)
 {
+	struct hrtimer_cpu_base *cpu_base = &__get_cpu_var(hrtimer_bases);
+
+	if (cpu_base->clock_was_set) {
+		cpu_base->clock_was_set = 0;
+		clock_was_set();
+	}
+
 	hrtimer_peek_ahead_timers();
 }
 

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [tip:timers/urgent] timekeeping: Fix leapsecond triggered load spike issue
  2012-07-10 22:43 ` [PATCH 2/6] timekeeping: Fix leapsecond triggered load spike issue John Stultz
@ 2012-07-11 21:41   ` tip-bot for John Stultz
  0 siblings, 0 replies; 38+ messages in thread
From: tip-bot for John Stultz @ 2012-07-11 21:41 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, a.p.zijlstra, johnstul, jengelh, tglx, prarit

Commit-ID:  4873fa070ae84a4115f0b3c9dfabc224f1bc7c51
Gitweb:     http://git.kernel.org/tip/4873fa070ae84a4115f0b3c9dfabc224f1bc7c51
Author:     John Stultz <johnstul@us.ibm.com>
AuthorDate: Tue, 10 Jul 2012 18:43:20 -0400
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Wed, 11 Jul 2012 23:34:37 +0200

timekeeping: Fix leapsecond triggered load spike issue

The timekeeping code misses an update of the hrtimer subsystem after a
leap second happened. Due to that timers based on CLOCK_REALTIME are
either expiring a second early or late depending on whether a leap
second has been inserted or deleted until an operation is initiated
which causes that update. Unless the update happens by some other
means this discrepancy between the timekeeping and the hrtimer data
stays forever and timers are expired either early or late.

The reported immediate workaround - $ data -s "`date`" - is causing a
call to clock_was_set() which updates the hrtimer data structures.
See: http://www.sheeri.com/content/mysql-and-leap-second-high-cpu-and-fix

Add the missing clock_was_set() call to update_wall_time() in case of
a leap second event. The actual update is deferred to softirq context
as the necessary smp function call cannot be invoked from hard
interrupt context.

Signed-off-by: John Stultz <johnstul@us.ibm.com>
Reported-by: Jan Engelhardt <jengelh@inai.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Prarit Bhargava <prarit@redhat.com>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/1341960205-56738-3-git-send-email-johnstul@us.ibm.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/time/timekeeping.c |    4 ++++
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index 6f46a00..a413e59 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -963,6 +963,8 @@ static cycle_t logarithmic_accumulation(cycle_t offset, int shift)
 		leap = second_overflow(timekeeper.xtime.tv_sec);
 		timekeeper.xtime.tv_sec += leap;
 		timekeeper.wall_to_monotonic.tv_sec -= leap;
+		if (leap)
+			clock_was_set_delayed();
 	}
 
 	/* Accumulate raw time */
@@ -1079,6 +1081,8 @@ static void update_wall_time(void)
 		leap = second_overflow(timekeeper.xtime.tv_sec);
 		timekeeper.xtime.tv_sec += leap;
 		timekeeper.wall_to_monotonic.tv_sec -= leap;
+		if (leap)
+			clock_was_set_delayed();
 	}
 
 	timekeeping_update(false);

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [tip:timers/urgent] timekeeping: Maintain ktime_t based offsets for hrtimers
  2012-07-10 22:43 ` [PATCH 3/6] timekeeping: Maintain ktime_t based offsets for hrtimers John Stultz
@ 2012-07-11 21:42   ` tip-bot for Thomas Gleixner
  0 siblings, 0 replies; 38+ messages in thread
From: tip-bot for Thomas Gleixner @ 2012-07-11 21:42 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, a.p.zijlstra, johnstul, tglx, prarit

Commit-ID:  5b9fe759a678e05be4937ddf03d50e950207c1c0
Gitweb:     http://git.kernel.org/tip/5b9fe759a678e05be4937ddf03d50e950207c1c0
Author:     Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Tue, 10 Jul 2012 18:43:21 -0400
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Wed, 11 Jul 2012 23:34:38 +0200

timekeeping: Maintain ktime_t based offsets for hrtimers

We need to update the hrtimer clock offsets from the hrtimer interrupt
context. To avoid conversions from timespec to ktime_t maintain a
ktime_t based representation of those offsets in the timekeeper. This
puts the conversion overhead into the code which updates the
underlying offsets and provides fast accessible values in the hrtimer
interrupt.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Prarit Bhargava <prarit@redhat.com>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/1341960205-56738-4-git-send-email-johnstul@us.ibm.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/time/timekeeping.c |   25 +++++++++++++++++++++++--
 1 files changed, 23 insertions(+), 2 deletions(-)

diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index a413e59..1c038da 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -70,6 +70,12 @@ struct timekeeper {
 	/* The raw monotonic time for the CLOCK_MONOTONIC_RAW posix clock. */
 	struct timespec raw_time;
 
+	/* Offset clock monotonic -> clock realtime */
+	ktime_t offs_real;
+
+	/* Offset clock monotonic -> clock boottime */
+	ktime_t offs_boot;
+
 	/* Seqlock for all timekeeper values */
 	seqlock_t lock;
 };
@@ -172,6 +178,14 @@ static inline s64 timekeeping_get_ns_raw(void)
 	return clocksource_cyc2ns(cycle_delta, clock->mult, clock->shift);
 }
 
+static void update_rt_offset(void)
+{
+	struct timespec tmp, *wtm = &timekeeper.wall_to_monotonic;
+
+	set_normalized_timespec(&tmp, -wtm->tv_sec, -wtm->tv_nsec);
+	timekeeper.offs_real = timespec_to_ktime(tmp);
+}
+
 /* must hold write on timekeeper.lock */
 static void timekeeping_update(bool clearntp)
 {
@@ -179,6 +193,7 @@ static void timekeeping_update(bool clearntp)
 		timekeeper.ntp_error = 0;
 		ntp_clear();
 	}
+	update_rt_offset();
 	update_vsyscall(&timekeeper.xtime, &timekeeper.wall_to_monotonic,
 			 timekeeper.clock, timekeeper.mult);
 }
@@ -604,6 +619,7 @@ void __init timekeeping_init(void)
 	}
 	set_normalized_timespec(&timekeeper.wall_to_monotonic,
 				-boot.tv_sec, -boot.tv_nsec);
+	update_rt_offset();
 	timekeeper.total_sleep_time.tv_sec = 0;
 	timekeeper.total_sleep_time.tv_nsec = 0;
 	write_sequnlock_irqrestore(&timekeeper.lock, flags);
@@ -612,6 +628,12 @@ void __init timekeeping_init(void)
 /* time in seconds when suspend began */
 static struct timespec timekeeping_suspend_time;
 
+static void update_sleep_time(struct timespec t)
+{
+	timekeeper.total_sleep_time = t;
+	timekeeper.offs_boot = timespec_to_ktime(t);
+}
+
 /**
  * __timekeeping_inject_sleeptime - Internal function to add sleep interval
  * @delta: pointer to a timespec delta value
@@ -630,8 +652,7 @@ static void __timekeeping_inject_sleeptime(struct timespec *delta)
 	timekeeper.xtime = timespec_add(timekeeper.xtime, *delta);
 	timekeeper.wall_to_monotonic =
 			timespec_sub(timekeeper.wall_to_monotonic, *delta);
-	timekeeper.total_sleep_time = timespec_add(
-					timekeeper.total_sleep_time, *delta);
+	update_sleep_time(timespec_add(timekeeper.total_sleep_time, *delta));
 }
 
 

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [tip:timers/urgent] hrtimers: Move lock held region in hrtimer_interrupt()
  2012-07-10 22:43 ` [PATCH 4/6] hrtimers: " John Stultz
@ 2012-07-11 21:43   ` tip-bot for Thomas Gleixner
  0 siblings, 0 replies; 38+ messages in thread
From: tip-bot for Thomas Gleixner @ 2012-07-11 21:43 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, johnstul, a.p.zijlstra, tglx, prarit

Commit-ID:  196951e91262fccda81147d2bcf7fdab08668b40
Gitweb:     http://git.kernel.org/tip/196951e91262fccda81147d2bcf7fdab08668b40
Author:     Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Tue, 10 Jul 2012 18:43:23 -0400
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Wed, 11 Jul 2012 23:34:38 +0200

hrtimers: Move lock held region in hrtimer_interrupt()

We need to update the base offsets from this code and we need to do
that under base->lock. Move the lock held region around the
ktime_get() calls. The ktime_get() calls are going to be replaced with
a function which gets the time and the offsets atomically.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Prarit Bhargava <prarit@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Link: http://lkml.kernel.org/r/1341960205-56738-6-git-send-email-johnstul@us.ibm.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/hrtimer.c |    5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c
index 3c24fb2..8f320af 100644
--- a/kernel/hrtimer.c
+++ b/kernel/hrtimer.c
@@ -1263,11 +1263,10 @@ void hrtimer_interrupt(struct clock_event_device *dev)
 	cpu_base->nr_events++;
 	dev->next_event.tv64 = KTIME_MAX;
 
+	raw_spin_lock(&cpu_base->lock);
 	entry_time = now = ktime_get();
 retry:
 	expires_next.tv64 = KTIME_MAX;
-
-	raw_spin_lock(&cpu_base->lock);
 	/*
 	 * We set expires_next to KTIME_MAX here with cpu_base->lock
 	 * held to prevent that a timer is enqueued in our queue via
@@ -1344,6 +1343,7 @@ retry:
 	 * interrupt routine. We give it 3 attempts to avoid
 	 * overreacting on some spurious event.
 	 */
+	raw_spin_lock(&cpu_base->lock);
 	now = ktime_get();
 	cpu_base->nr_retries++;
 	if (++retries < 3)
@@ -1356,6 +1356,7 @@ retry:
 	 */
 	cpu_base->nr_hangs++;
 	cpu_base->hang_detected = 1;
+	raw_spin_unlock(&cpu_base->lock);
 	delta = ktime_sub(now, entry_time);
 	if (delta.tv64 > cpu_base->max_hang_time.tv64)
 		cpu_base->max_hang_time = delta;

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [tip:timers/urgent] timekeeping: Provide hrtimer update function
  2012-07-10 22:43 ` [PATCH 5/6] timekeeping: Provide hrtimer update function John Stultz
@ 2012-07-11 21:44   ` tip-bot for Thomas Gleixner
  0 siblings, 0 replies; 38+ messages in thread
From: tip-bot for Thomas Gleixner @ 2012-07-11 21:44 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, johnstul, a.p.zijlstra, tglx, prarit

Commit-ID:  f6c06abfb3972ad4914cef57d8348fcb2932bc3b
Gitweb:     http://git.kernel.org/tip/f6c06abfb3972ad4914cef57d8348fcb2932bc3b
Author:     Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Tue, 10 Jul 2012 18:43:24 -0400
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Wed, 11 Jul 2012 23:34:39 +0200

timekeeping: Provide hrtimer update function

To finally fix the infamous leap second issue and other race windows
caused by functions which change the offsets between the various time
bases (CLOCK_MONOTONIC, CLOCK_REALTIME and CLOCK_BOOTTIME) we need a
function which atomically gets the current monotonic time and updates
the offsets of CLOCK_REALTIME and CLOCK_BOOTTIME with minimalistic
overhead. The previous patch which provides ktime_t offsets allows us
to make this function almost as cheap as ktime_get() which is going to
be replaced in hrtimer_interrupt().

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Prarit Bhargava <prarit@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Link: http://lkml.kernel.org/r/1341960205-56738-7-git-send-email-johnstul@us.ibm.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/hrtimer.h   |    1 +
 kernel/time/timekeeping.c |   34 ++++++++++++++++++++++++++++++++++
 2 files changed, 35 insertions(+), 0 deletions(-)

diff --git a/include/linux/hrtimer.h b/include/linux/hrtimer.h
index c9ec940..cc07d27 100644
--- a/include/linux/hrtimer.h
+++ b/include/linux/hrtimer.h
@@ -327,6 +327,7 @@ extern ktime_t ktime_get(void);
 extern ktime_t ktime_get_real(void);
 extern ktime_t ktime_get_boottime(void);
 extern ktime_t ktime_get_monotonic_offset(void);
+extern ktime_t ktime_get_update_offsets(ktime_t *offs_real, ktime_t *offs_boot);
 
 DECLARE_PER_CPU(struct tick_device, tick_cpu_device);
 
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index 1c038da..269b1fe 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -1271,6 +1271,40 @@ void get_xtime_and_monotonic_and_sleep_offset(struct timespec *xtim,
 	} while (read_seqretry(&timekeeper.lock, seq));
 }
 
+#ifdef CONFIG_HIGH_RES_TIMERS
+/**
+ * ktime_get_update_offsets - hrtimer helper
+ * @offs_real:	pointer to storage for monotonic -> realtime offset
+ * @offs_boot:	pointer to storage for monotonic -> boottime offset
+ *
+ * Returns current monotonic time and updates the offsets
+ * Called from hrtimer_interupt() or retrigger_next_event()
+ */
+ktime_t ktime_get_update_offsets(ktime_t *offs_real, ktime_t *offs_boot)
+{
+	ktime_t now;
+	unsigned int seq;
+	u64 secs, nsecs;
+
+	do {
+		seq = read_seqbegin(&timekeeper.lock);
+
+		secs = timekeeper.xtime.tv_sec;
+		nsecs = timekeeper.xtime.tv_nsec;
+		nsecs += timekeeping_get_ns();
+		/* If arch requires, add in gettimeoffset() */
+		nsecs += arch_gettimeoffset();
+
+		*offs_real = timekeeper.offs_real;
+		*offs_boot = timekeeper.offs_boot;
+	} while (read_seqretry(&timekeeper.lock, seq));
+
+	now = ktime_add_ns(ktime_set(secs, 0), nsecs);
+	now = ktime_sub(now, *offs_real);
+	return now;
+}
+#endif
+
 /**
  * ktime_get_monotonic_offset() - get wall_to_monotonic in ktime_t format
  */

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [tip:timers/urgent] hrtimer: Update hrtimer base offsets each hrtimer_interrupt
  2012-07-10 22:43 ` [PATCH 6/6] hrtimer: Update hrtimer base offsets each hrtimer_interrupt John Stultz
@ 2012-07-11 21:45   ` tip-bot for John Stultz
  2012-07-15 15:22     ` Andreas Schwab
       [not found]   ` <m2y5mlnj5z.fsf__49536.0585897744$1342365803$gmane$org@igel.home>
  2 siblings, 0 replies; 38+ messages in thread
From: tip-bot for John Stultz @ 2012-07-11 21:45 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, a.p.zijlstra, johnstul, tglx, prarit

Commit-ID:  5baefd6d84163443215f4a99f6a20f054ef11236
Gitweb:     http://git.kernel.org/tip/5baefd6d84163443215f4a99f6a20f054ef11236
Author:     John Stultz <johnstul@us.ibm.com>
AuthorDate: Tue, 10 Jul 2012 18:43:25 -0400
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Wed, 11 Jul 2012 23:34:39 +0200

hrtimer: Update hrtimer base offsets each hrtimer_interrupt

The update of the hrtimer base offsets on all cpus cannot be made
atomically from the timekeeper.lock held and interrupt disabled region
as smp function calls are not allowed there.

clock_was_set(), which enforces the update on all cpus, is called
either from preemptible process context in case of do_settimeofday()
or from the softirq context when the offset modification happened in
the timer interrupt itself due to a leap second.

In both cases there is a race window for an hrtimer interrupt between
dropping timekeeper lock, enabling interrupts and clock_was_set()
issuing the updates. Any interrupt which arrives in that window will
see the new time but operate on stale offsets.

So we need to make sure that an hrtimer interrupt always sees a
consistent state of time and offsets.

ktime_get_update_offsets() allows us to get the current monotonic time
and update the per cpu hrtimer base offsets from hrtimer_interrupt()
to capture a consistent state of monotonic time and the offsets. The
function replaces the existing ktime_get() calls in hrtimer_interrupt().

The overhead of the new function vs. ktime_get() is minimal as it just
adds two store operations.

This ensures that any changes to realtime or boottime offsets are
noticed and stored into the per-cpu hrtimer base structures, prior to
any hrtimer expiration and guarantees that timers are not expired early.

Signed-off-by: John Stultz <johnstul@us.ibm.com>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Prarit Bhargava <prarit@redhat.com>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/1341960205-56738-8-git-send-email-johnstul@us.ibm.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/hrtimer.c |   28 ++++++++++++++--------------
 1 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c
index 8f320af..6db7a5e 100644
--- a/kernel/hrtimer.c
+++ b/kernel/hrtimer.c
@@ -657,6 +657,14 @@ static inline int hrtimer_enqueue_reprogram(struct hrtimer *timer,
 	return 0;
 }
 
+static inline ktime_t hrtimer_update_base(struct hrtimer_cpu_base *base)
+{
+	ktime_t *offs_real = &base->clock_base[HRTIMER_BASE_REALTIME].offset;
+	ktime_t *offs_boot = &base->clock_base[HRTIMER_BASE_BOOTTIME].offset;
+
+	return ktime_get_update_offsets(offs_real, offs_boot);
+}
+
 /*
  * Retrigger next event is called after clock was set
  *
@@ -665,22 +673,12 @@ static inline int hrtimer_enqueue_reprogram(struct hrtimer *timer,
 static void retrigger_next_event(void *arg)
 {
 	struct hrtimer_cpu_base *base = &__get_cpu_var(hrtimer_bases);
-	struct timespec realtime_offset, xtim, wtm, sleep;
 
 	if (!hrtimer_hres_active())
 		return;
 
-	/* Optimized out for !HIGH_RES */
-	get_xtime_and_monotonic_and_sleep_offset(&xtim, &wtm, &sleep);
-	set_normalized_timespec(&realtime_offset, -wtm.tv_sec, -wtm.tv_nsec);
-
-	/* Adjust CLOCK_REALTIME offset */
 	raw_spin_lock(&base->lock);
-	base->clock_base[HRTIMER_BASE_REALTIME].offset =
-		timespec_to_ktime(realtime_offset);
-	base->clock_base[HRTIMER_BASE_BOOTTIME].offset =
-		timespec_to_ktime(sleep);
-
+	hrtimer_update_base(base);
 	hrtimer_force_reprogram(base, 0);
 	raw_spin_unlock(&base->lock);
 }
@@ -710,7 +708,6 @@ static int hrtimer_switch_to_hres(void)
 		base->clock_base[i].resolution = KTIME_HIGH_RES;
 
 	tick_setup_sched_timer();
-
 	/* "Retrigger" the interrupt to get things going */
 	retrigger_next_event(NULL);
 	local_irq_restore(flags);
@@ -1264,7 +1261,7 @@ void hrtimer_interrupt(struct clock_event_device *dev)
 	dev->next_event.tv64 = KTIME_MAX;
 
 	raw_spin_lock(&cpu_base->lock);
-	entry_time = now = ktime_get();
+	entry_time = now = hrtimer_update_base(cpu_base);
 retry:
 	expires_next.tv64 = KTIME_MAX;
 	/*
@@ -1342,9 +1339,12 @@ retry:
 	 * We need to prevent that we loop forever in the hrtimer
 	 * interrupt routine. We give it 3 attempts to avoid
 	 * overreacting on some spurious event.
+	 *
+	 * Acquire base lock for updating the offsets and retrieving
+	 * the current time.
 	 */
 	raw_spin_lock(&cpu_base->lock);
-	now = ktime_get();
+	now = hrtimer_update_base(cpu_base);
 	cpu_base->nr_retries++;
 	if (++retries < 3)
 		goto retry;

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH 1/6] hrtimer: Provide clock_was_set_delayed()
  2012-07-11 16:47           ` John Stultz
@ 2012-07-12  7:44             ` Jan Ceuleers
  2012-07-12 12:29               ` Prarit Bhargava
  0 siblings, 1 reply; 38+ messages in thread
From: Jan Ceuleers @ 2012-07-12  7:44 UTC (permalink / raw)
  To: John Stultz
  Cc: Thomas Gleixner, Peter Zijlstra, Prarit Bhargava, Linux Kernel,
	Ingo Molnar, stable

On 07/11/2012 06:47 PM, John Stultz wrote:
> I'll see if my worry is unfounded, but it might be a bit too clever for rare events.

Full ACK.

There is an unfortunate history of critical-to-moderately-serious bugs in
the leap second handling, so I submit that what is needed is a simple,
obviously-correct and robust mechanism. Robust statically, but also in the
face of code churn because these code paths are exercised so rarely out in
the wild.

Just my opinion, FWIW.

Jan

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 1/6] hrtimer: Provide clock_was_set_delayed()
  2012-07-12  7:44             ` Jan Ceuleers
@ 2012-07-12 12:29               ` Prarit Bhargava
  0 siblings, 0 replies; 38+ messages in thread
From: Prarit Bhargava @ 2012-07-12 12:29 UTC (permalink / raw)
  To: Jan Ceuleers
  Cc: John Stultz, Thomas Gleixner, Peter Zijlstra, Linux Kernel,
	Ingo Molnar, stable



On 07/12/2012 03:44 AM, Jan Ceuleers wrote:
> On 07/11/2012 06:47 PM, John Stultz wrote:
>> I'll see if my worry is unfounded, but it might be a bit too clever for rare events.
> 
> Full ACK.
> 
> There is an unfortunate history of critical-to-moderately-serious bugs in
> the leap second handling, so I submit that what is needed is a simple,
> obviously-correct and robust mechanism. Robust statically, but also in the
> face of code churn because these code paths are exercised so rarely out in
> the wild.
> 
> Just my opinion, FWIW.
> 

Ditto - and it's not just FWIW.

John (and everyone else), I think we're over-thinking this.  Would it be nice to
get an extremely elegant solution to this?  Yeah ... it would.  But the reality
is that we're not going to get there and IMO we're making things too complex for
this little piece of code.

Acked-by: Prarit Bhargava <prarit@redhat.com>

IMO, this is the simplest way to move forward with this code.

P.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/6] Fix for leapsecond caused hrtimer/futex issue (updated)
  2012-07-11 11:17 ` Ingo Molnar
@ 2012-07-12 12:32   ` Prarit Bhargava
  0 siblings, 0 replies; 38+ messages in thread
From: Prarit Bhargava @ 2012-07-12 12:32 UTC (permalink / raw)
  To: John Stultz
  Cc: Ingo Molnar, Linux Kernel, Peter Zijlstra, Thomas Gleixner, stable



On 07/11/2012 07:17 AM, Ingo Molnar wrote:
> 
> * John Stultz <johnstul@us.ibm.com> wrote:
> 
>> Over the weekend, Thomas got a chance to review the leap 
>> second fix in more detail and had a few additional changes he 
>> wanted to make to improve performance as well as style.


John,

FYI -- Using a mix of AMD and Intel systems (big and small, large memory and
small memory footprint), current test runs are ~18 hours at this point without
any issues seen, using a slightly modified version of your leap-a-day.c .

P.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/6] Fix for leapsecond caused hrtimer/futex issue (updated)
  2012-07-10 22:53 ` [PATCH 0/6] Fix for leapsecond caused hrtimer/futex issue (updated) John Stultz
@ 2012-07-12 22:43   ` Jiri Bohac
  2012-07-12 23:58     ` John Stultz
  0 siblings, 1 reply; 38+ messages in thread
From: Jiri Bohac @ 2012-07-12 22:43 UTC (permalink / raw)
  To: John Stultz
  Cc: Linux Kernel, Ingo Molnar, Peter Zijlstra, Prarit Bhargava,
	Thomas Gleixner, stable

On Tue, Jul 10, 2012 at 03:53:59PM -0700, John Stultz wrote:
> On 07/10/2012 03:43 PM, John Stultz wrote:
> >Over the weekend, Thomas got a chance to review the leap second fix
> >in more detail and had a few additional changes he wanted to make
> >to improve performance as well as style.
> >
> >So this iteration includes his modifications.
> >
> >Once merged, I'll be working to get the backports finished as quickly
> >as I can and sent to -stable.

looking at the proposed 2.6.32.y stable patch at:
http://git.linaro.org/gitweb?p=people/jstultz/linux.git;a=commitdiff;h=18d208632bf17aed56c581b882868b2be44be71e;hp=6d224606bb8eec78027522d6dd5abfea8108c41a
Is this the final version you are about to send to -stable?

In 2.6.32 timekeeping_leap_insert() is not called from the timer
interrupt, but from the leap_timer hrtimer.

I think the new clock_was_set_timer will thus not be called by
irq_exit() because TIMER_SOFTIRQ has not been raised. Unless
TIMER_SOFTIRQ is raised, clock_was_set() will not be called until
the next periodic timer interrupt, correct?

Wouldn't the original schedule_work() approach work better for
2.6.32?

Or do you plan backporting the most recent version to 2.6.32?

Thanks,

-- 
Jiri Bohac <jbohac@suse.cz>
SUSE Labs, SUSE CZ


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/6] Fix for leapsecond caused hrtimer/futex issue (updated)
  2012-07-12 22:43   ` Jiri Bohac
@ 2012-07-12 23:58     ` John Stultz
  0 siblings, 0 replies; 38+ messages in thread
From: John Stultz @ 2012-07-12 23:58 UTC (permalink / raw)
  To: Jiri Bohac
  Cc: Linux Kernel, Ingo Molnar, Peter Zijlstra, Prarit Bhargava,
	Thomas Gleixner, stable

On 07/12/2012 03:43 PM, Jiri Bohac wrote:
> On Tue, Jul 10, 2012 at 03:53:59PM -0700, John Stultz wrote:
>> On 07/10/2012 03:43 PM, John Stultz wrote:
>>> Over the weekend, Thomas got a chance to review the leap second fix
>>> in more detail and had a few additional changes he wanted to make
>>> to improve performance as well as style.
>>>
>>> So this iteration includes his modifications.
>>>
>>> Once merged, I'll be working to get the backports finished as quickly
>>> as I can and sent to -stable.
> looking at the proposed 2.6.32.y stable patch at:
> http://git.linaro.org/gitweb?p=people/jstultz/linux.git;a=commitdiff;h=18d208632bf17aed56c581b882868b2be44be71e;hp=6d224606bb8eec78027522d6dd5abfea8108c41a
> Is this the final version you are about to send to -stable?
No, this isn't what I'm sending to -stable.  That was my backport that 
was done was prior to merging Thomas' modifications from over the 
weekend. Having, so far, done this backporting 3 times or so,  I figured 
I'd just wait until something got committed upstream before trying to 
backport it again. :)


> In 2.6.32 timekeeping_leap_insert() is not called from the timer
> interrupt, but from the leap_timer hrtimer.
>
> I think the new clock_was_set_timer will thus not be called by
> irq_exit() because TIMER_SOFTIRQ has not been raised. Unless
> TIMER_SOFTIRQ is raised, clock_was_set() will not be called until
> the next periodic timer interrupt, correct?
>
> Wouldn't the original schedule_work() approach work better for
> 2.6.32?
>
> Or do you plan backporting the most recent version to 2.6.32?

I'll be backporting & testing the most recent version once it is 
committed upstream.

thanks
-john


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/6] Fix for leapsecond caused hrtimer/futex issue (updated)
  2012-07-10 23:00 ` John Stultz
@ 2012-07-13  0:43   ` John Stultz
  0 siblings, 0 replies; 38+ messages in thread
From: John Stultz @ 2012-07-13  0:43 UTC (permalink / raw)
  To: Linux Kernel
  Cc: Ingo Molnar, Peter Zijlstra, Prarit Bhargava, Thomas Gleixner, stable

[-- Attachment #1: Type: text/plain, Size: 849 bytes --]

On 07/10/2012 04:00 PM, John Stultz wrote:
> On 07/10/2012 03:43 PM, John Stultz wrote:
>> Over the weekend, Thomas got a chance to review the leap second fix
>> in more detail and had a few additional changes he wanted to make
>> to improve performance as well as style.
>>
>> So this iteration includes his modifications.
>>
>> Once merged, I'll be working to get the backports finished as quickly
>> as I can and sent to -stable.
>>
>
> Once again, here's the test case I've been using to stress test 
> leapsecond insertion/deletion (the deletion testing is new since last 
> round).

Prarit noticed a small bug with the test case. We're not initializing 
the timespec tv_nsec value before sleeping, so in some cases 
clock_nanosleep will return -EINVAL.

This version fixes this and cleans up some of the timespec/timeval usage.

thanks
-john


[-- Attachment #2: leap-a-day.c --]
[-- Type: text/x-csrc, Size: 5539 bytes --]

/* Leap second stress test
 *              by: john stultz (johnstul@us.ibm.com)
 *              (C) Copyright IBM 2012
 *              Licensed under the GPLv2
 *
 *  This test signals the kernel to insert a leap second
 *  every day at midnight GMT. This allows for stessing the
 *  kernel's leap-second behavior, as well as how well applications
 *  handle the leap-second discontinuity.
 *
 *  Usage: leap-a-day [-s]
 *
 *  Options:
 *	-s:	Each iteration, set the date to 10 seconds before midnight GMT.
 *		This speeds up the number of leapsecond transitions tested,
 *		but because it calls settimeofday frequently, advancing the
 *		time by 24 hours every ~16 seconds, it may cause application
 *		disruption.
 *
 *  Other notes: Disabling NTP prior to running this is advised, as the two
 *		 may conflict in thier commands to the kernel.
 *
 *  To build:
 *	$ gcc leap-a-day.c -o leap-a-day -lrt
 */



#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <sys/time.h>
#include <sys/timex.h>
#include <string.h>
#include <signal.h>

#define NSEC_PER_SEC 1000000000ULL

/* returns 1 if a <= b, 0 otherwise */
static inline int in_order(struct timespec a, struct timespec b)
{
        if(a.tv_sec < b.tv_sec)
                return 1;
        if(a.tv_sec > b.tv_sec)
                return 0;
        if(a.tv_nsec > b.tv_nsec)
                return 0;
        return 1;
}

struct timespec timespec_add(struct timespec ts, unsigned long long ns)
{
	ts.tv_nsec += ns;
	while(ts.tv_nsec >= NSEC_PER_SEC) {
		ts.tv_nsec -= NSEC_PER_SEC;
		ts.tv_sec++;
	}
	return ts;
}


char* time_state_str(int state)
{
	switch (state) {
		case TIME_OK:	return "TIME_OK";
		case TIME_INS:	return "TIME_INS";
		case TIME_DEL:	return "TIME_DEL";
		case TIME_OOP:	return "TIME_OOP";
		case TIME_WAIT:	return "TIME_WAIT";
		case TIME_BAD:	return "TIME_BAD";
	}
	return "ERROR"; 
}

/* clear NTP time_status & time_state */
void clear_time_state(void)
{
	struct timex tx;
	int ret;

	/*
	 * XXX - The fact we have to call this twice seems
	 * to point to a slight issue in the kernel's ntp state
	 * managment. Needs to be investigated further.
	 */

	tx.modes = ADJ_STATUS;
	tx.status = STA_PLL;
	ret = adjtimex(&tx);

	tx.modes = ADJ_STATUS;
	tx.status = 0;
	ret = adjtimex(&tx);
}

/* Make sure we cleanup on ctrl-c */
void handler(int unused)
{
	clear_time_state();
	exit(0);
}


/* Test for known hrtimer failure */
void test_hrtimer_failure(void)
{
	struct timespec now, target;

	clock_gettime(CLOCK_REALTIME, &now);
	target = timespec_add(now, NSEC_PER_SEC/2);
	clock_nanosleep(CLOCK_REALTIME, TIMER_ABSTIME, &target, NULL);
	clock_gettime(CLOCK_REALTIME, &now);

	if (!in_order(target, now)) {
		printf("Note: hrtimer early expiration failure observed.\n");
	}

}



int main(int argc, char** argv) 
{
	int settime = 0;
	int insert = 1;

	signal(SIGINT, handler);
	signal(SIGKILL, handler);
	printf("This runs continuously. Press ctrl-c to stop\n");


	/* Process arguments */
	if (argc > 1) {
		if (!strncmp(argv[1], "-s", 2)) {
			printf("Setting time to speed up testing\n");
			settime = 1;
		} else {
			printf("Usage: %s [-s]\n", argv[0]);
			printf("	-s: Set time to right before leap second each iteration\n");
		}
	}

	printf("\n");
	while (1) {
		int ret;
		struct timespec ts;
		struct timex tx;
		time_t now, next_leap;

		/* Get the current time */
		clock_gettime(CLOCK_REALTIME, &ts);

		/* Calculate the next possible leap second 23:59:60 GMT */
		next_leap = ts.tv_sec;
		next_leap += 86400 - (next_leap % 86400);


		if (settime) {
			struct timeval tv;
			tv.tv_sec = next_leap - 10;
			tv.tv_usec = 0;
			settimeofday(&tv, NULL);
			printf("Setting time to %s", ctime(&tv.tv_sec));
		}

		/* Reset NTP time state */
		clear_time_state();

		/* Set the leap second insert flag */
		tx.modes = ADJ_STATUS;
		if (insert) 
			tx.status = STA_INS;
		else
			tx.status = STA_DEL;
		ret = adjtimex(&tx);
		if (ret < 0 ) {
			printf("Error: Problem setting STA_INS/STA_DEL!: %s\n",
							time_state_str(ret));
			return -1;
		}

		/* Validate STA_INS was set */
		ret = adjtimex(&tx);
		if (tx.status != STA_INS && tx.status != STA_DEL) {
			printf("Error: STA_INS/STA_DEL not set!: %s\n",
							time_state_str(ret));
			return -1;
		}

		printf("Scheduling leap second for %s", ctime(&next_leap));
	
		/* Wake up 3 seconds before leap */
		ts.tv_sec = next_leap - 3;
		ts.tv_nsec = 0;

		while(clock_nanosleep(CLOCK_REALTIME, TIMER_ABSTIME, &ts, NULL))
			printf("Something woke us up, returning to sleep\n");

		/* Validate STA_INS is still set */
		ret = adjtimex(&tx);
		if (tx.status != STA_INS && tx.status != STA_DEL) {
			printf("Something cleared STA_INS/STA_DEL, setting it again.\n");
			tx.modes = ADJ_STATUS;
			if (insert) 
				tx.status = STA_INS;
			else
				tx.status = STA_DEL;
			ret = adjtimex(&tx);
		}

		/* Check adjtimex output every half second */
		now = tx.time.tv_sec;
		while (now < next_leap+2) {
			char buf[26];
			ret = adjtimex(&tx);

			ctime_r(&tx.time.tv_sec, buf);
			buf[strlen(buf)-1] = 0; /*remove trailing\n */

			printf("%s + %6ld us\t%s\n",
					buf,
					tx.time.tv_usec, 
					time_state_str(ret));
			now = tx.time.tv_sec;
			/* Sleep for another half second */
			ts.tv_sec = 0;
			ts.tv_nsec = NSEC_PER_SEC/2;
			clock_nanosleep(CLOCK_MONOTONIC, 0, &ts, NULL);			
		}
		/* Switch to using other mode */
		insert = !insert;

		/* Note if kernel has known hrtimer failure */
		test_hrtimer_failure();

		printf("Leap complete\n\n");
	}

	clear_time_state();
	return 0;
}

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 6/6] hrtimer: Update hrtimer base offsets each hrtimer_interrupt
  2012-07-10 22:43 ` [PATCH 6/6] hrtimer: Update hrtimer base offsets each hrtimer_interrupt John Stultz
@ 2012-07-15 15:22     ` Andreas Schwab
  2012-07-15 15:22     ` Andreas Schwab
       [not found]   ` <m2y5mlnj5z.fsf__49536.0585897744$1342365803$gmane$org@igel.home>
  2 siblings, 0 replies; 38+ messages in thread
From: Andreas Schwab @ 2012-07-15 15:22 UTC (permalink / raw)
  To: John Stultz
  Cc: Linux Kernel, Ingo Molnar, Peter Zijlstra, Prarit Bhargava,
	Thomas Gleixner, stable, Benjamin Herrenschmidt, linuxppc-dev

This breaks resume on the iBook G4 (PowerBook6,7).  Apparently during or
before noirq resume the system is hanging by the same amount of time as
the system was sleeping.

Andreas.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 6/6] hrtimer: Update hrtimer base offsets each hrtimer_interrupt
@ 2012-07-15 15:22     ` Andreas Schwab
  0 siblings, 0 replies; 38+ messages in thread
From: Andreas Schwab @ 2012-07-15 15:22 UTC (permalink / raw)
  To: John Stultz
  Cc: Prarit Bhargava, Peter Zijlstra, Linux Kernel, stable,
	Thomas Gleixner, linuxppc-dev, Ingo Molnar

This breaks resume on the iBook G4 (PowerBook6,7).  Apparently during or
before noirq resume the system is hanging by the same amount of time as
the system was sleeping.

Andreas.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 6/6] hrtimer: Update hrtimer base offsets each hrtimer_interrupt
       [not found]   ` <m2y5mlnj5z.fsf__49536.0585897744$1342365803$gmane$org@igel.home>
@ 2012-07-15 16:02     ` Andreas Schwab
  0 siblings, 0 replies; 38+ messages in thread
From: Andreas Schwab @ 2012-07-15 16:02 UTC (permalink / raw)
  To: John Stultz
  Cc: Prarit Bhargava, Peter Zijlstra, Linux Kernel, stable,
	Thomas Gleixner, linuxppc-dev, Ingo Molnar

Andreas Schwab <schwab@linux-m68k.org> writes:

> This breaks resume on the iBook G4 (PowerBook6,7).  Apparently during or
> before noirq resume the system is hanging by the same amount of time as
> the system was sleeping.

The point where the time is wasted actually appears to be _after_ resume
(the elapsed time for the resume itself isn't affected).

Andreas.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 6/6] hrtimer: Update hrtimer base offsets each hrtimer_interrupt
  2012-07-15 15:22     ` Andreas Schwab
@ 2012-07-15 20:28       ` Rafael J. Wysocki
  -1 siblings, 0 replies; 38+ messages in thread
From: Rafael J. Wysocki @ 2012-07-15 20:28 UTC (permalink / raw)
  To: Andreas Schwab, John Stultz
  Cc: Linux Kernel, Ingo Molnar, Peter Zijlstra, Prarit Bhargava,
	Thomas Gleixner, stable, Benjamin Herrenschmidt, linuxppc-dev,
	Linux PM list

On Sunday, July 15, 2012, Andreas Schwab wrote:
> This breaks resume on the iBook G4 (PowerBook6,7).  Apparently during or
> before noirq resume the system is hanging by the same amount of time as
> the system was sleeping.

I'm able to reproduce this problem on Toshiba Portege R500 with similar
symptoms, although that box sometimes hangs hard during resume from system
suspend with the "caps lock" LED blinking.

Reverting the patch fixes the problem 100% of the time.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 6/6] hrtimer: Update hrtimer base offsets each hrtimer_interrupt
@ 2012-07-15 20:28       ` Rafael J. Wysocki
  0 siblings, 0 replies; 38+ messages in thread
From: Rafael J. Wysocki @ 2012-07-15 20:28 UTC (permalink / raw)
  To: Andreas Schwab, John Stultz
  Cc: Prarit Bhargava, Peter Zijlstra, Linux PM list, Linux Kernel,
	stable, Thomas Gleixner, linuxppc-dev, Ingo Molnar

On Sunday, July 15, 2012, Andreas Schwab wrote:
> This breaks resume on the iBook G4 (PowerBook6,7).  Apparently during or
> before noirq resume the system is hanging by the same amount of time as
> the system was sleeping.

I'm able to reproduce this problem on Toshiba Portege R500 with similar
symptoms, although that box sometimes hangs hard during resume from system
suspend with the "caps lock" LED blinking.

Reverting the patch fixes the problem 100% of the time.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2012-07-15 20:53 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-07-10 22:43 [PATCH 0/6] Fix for leapsecond caused hrtimer/futex issue (updated) John Stultz
2012-07-10 22:43 ` [PATCH 1/6] hrtimer: Provide clock_was_set_delayed() John Stultz
2012-07-11 12:15   ` Prarit Bhargava
2012-07-11 12:45     ` Thomas Gleixner
2012-07-11 13:05       ` Peter Zijlstra
2012-07-11 15:18         ` Thomas Gleixner
2012-07-11 15:56           ` Peter Zijlstra
2012-07-11 16:47           ` John Stultz
2012-07-12  7:44             ` Jan Ceuleers
2012-07-12 12:29               ` Prarit Bhargava
2012-07-11 13:05       ` Prarit Bhargava
2012-07-11 13:38         ` Peter Zijlstra
2012-07-11 21:40   ` [tip:timers/urgent] " tip-bot for John Stultz
2012-07-10 22:43 ` [PATCH 2/6] timekeeping: Fix leapsecond triggered load spike issue John Stultz
2012-07-11 21:41   ` [tip:timers/urgent] " tip-bot for John Stultz
2012-07-10 22:43 ` [PATCH 3/6] timekeeping: Maintain ktime_t based offsets for hrtimers John Stultz
2012-07-11 21:42   ` [tip:timers/urgent] " tip-bot for Thomas Gleixner
2012-07-10 22:43 ` [PATCH 4/6] hrtimer: Move lock held region in hrtimer_interrupt() John Stultz
2012-07-10 22:43 ` [PATCH 4/6] hrtimers: " John Stultz
2012-07-11 21:43   ` [tip:timers/urgent] " tip-bot for Thomas Gleixner
2012-07-10 22:43 ` [PATCH 5/6] timekeeping: Provide hrtimer update function John Stultz
2012-07-11 21:44   ` [tip:timers/urgent] " tip-bot for Thomas Gleixner
2012-07-10 22:43 ` [PATCH 6/6] hrtimer: Update hrtimer base offsets each hrtimer_interrupt John Stultz
2012-07-11 21:45   ` [tip:timers/urgent] " tip-bot for John Stultz
2012-07-15 15:22   ` [PATCH 6/6] " Andreas Schwab
2012-07-15 15:22     ` Andreas Schwab
2012-07-15 20:28     ` Rafael J. Wysocki
2012-07-15 20:28       ` Rafael J. Wysocki
     [not found]   ` <m2y5mlnj5z.fsf__49536.0585897744$1342365803$gmane$org@igel.home>
2012-07-15 16:02     ` Andreas Schwab
2012-07-10 22:53 ` [PATCH 0/6] Fix for leapsecond caused hrtimer/futex issue (updated) John Stultz
2012-07-12 22:43   ` Jiri Bohac
2012-07-12 23:58     ` John Stultz
2012-07-10 23:00 ` John Stultz
2012-07-13  0:43   ` John Stultz
2012-07-11 10:59 ` Peter Zijlstra
2012-07-11 11:17 ` Ingo Molnar
2012-07-12 12:32   ` Prarit Bhargava
2012-07-11 12:16 ` Prarit Bhargava

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.