[PATCH v2 4/4] kernel/time/clockevents: compensate for monotonic clock's dynamic frequency

From: Nicolai Stange <nicstange@gmail.com>
To: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>, "H. Peter Anvin" <hpa@zytor.com>,
	x86@kernel.org, John Stultz <john.stultz@linaro.org>,
	Borislav Petkov <bp@suse.de>, Paolo Bonzini <pbonzini@redhat.com>,
	Viresh Kumar <viresh.kumar@linaro.org>,
	Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>,
	"Peter Zijlstra (Intel)" <peterz@infradead.org>,
	"Christopher S. Hall" <christopher.s.hall@intel.com>,
	Adrian Hunter <adrian.hunter@intel.com>,
	linux-kernel@vger.kernel.org,
	Nicolai Stange <nicstange@gmail.com>
Subject: [PATCH v2 4/4] kernel/time/clockevents: compensate for monotonic clock's dynamic frequency
Date: Sun, 10 Jul 2016 21:30:47 +0200	[thread overview]
Message-ID: <20160710193047.18320-5-nicstange@gmail.com> (raw)
In-Reply-To: <20160710193047.18320-1-nicstange@gmail.com>

With NOHZ_FULL and one single well-isolated, CPU consumptive task, one
would expect approximately one clockevent interrupt per second. However, on
my Intel Haswell where the monotonic clock is the TSC monotonic clock and
the clockevent device is the TSC deadline device, it turns out that every
second, there are two such interrupts: the first one arrives always
approximately ~50us before the scheduled deadline as programmed by
tick_nohz_stop_sched_tick() through the hrtimer API. The
__hrtimer_run_queues() called in this interrupt detects that the queued
tick_sched_timer hasn't expired yet and simply does nothing except
reprogramming the clock event device to fire shortly after again.

These too early programmed deadlines are explained as follows:
clockevents_program_event() programs the clockevent device to fire
after
  f_event * delta_t_progr
clockevent device cycles where f_event is the clockevent device's hardware
frequency and delta_t_progr is the requested time interval. After that many
clockevent device cycles have elapsed, the device underlying the monotonic
clock, that is the monotonic raw clock has seen f_raw / f_event as many
cycles.
The ktime_get() called from __hrtimer_run_queues() interprets those
cycles to run at the frequency of the monotonic clock. Summarizing:
  delta_t_perc = 1/f_mono * f_raw/f_event * f_event * delta_t_progr
               = f_raw / f_mono * delta_t_progr
with f_mono being the monotonic clock's frequency and delta_t_perc being
the elapsed time interval as perceived by __hrtimer_run_queues().

Now, f_mono is not a constant, but is dynamically adjusted in
timekeeping_adjust() in order to compensate for the NTP error. With the
large values of delta_t_progr of 10^9ns with NOHZ_FULL, the error made
becomes significant and results in the double timer interrupts described
above.

Compensate for this error by multiplying delta_t_progr with f_mono / f_raw
in clockevents_program_event() before actually programming the clockevent
device.

Namely, introduce a helper, timekeeping_mono_interval_to_raw(), which
converts a given time interval from the monotonic clock's perception to
that of the raw monotonic clock by multiplying the value by f_mono / f_raw.
Call that helper from clockevents_program_event() in order to obtain a
suitable time interval to program the clockevent device with.

Signed-off-by: Nicolai Stange <nicstange@gmail.com>
---
 kernel/time/clockevents.c |  1 +
 kernel/time/timekeeping.c | 59 +++++++++++++++++++++++++++++++++++++++++++++++
 kernel/time/timekeeping.h |  1 +
 3 files changed, 61 insertions(+)

diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c
index a9b76a4..4bccf04 100644
--- a/kernel/time/clockevents.c
+++ b/kernel/time/clockevents.c
@@ -329,6 +329,7 @@ int clockevents_program_event(struct clock_event_device *dev, ktime_t expires,
 		return dev->set_next_ktime(expires, dev);
 
 	delta = ktime_to_ns(ktime_sub(expires, ktime_get()));
+	delta = timekeeping_mono_interval_to_raw(delta);
 	if (delta <= 0)
 		return force ? clockevents_program_min_delta(dev) : -ETIME;
 
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index dcd5ce6..51dfbbb 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -23,6 +23,7 @@
 #include <linux/stop_machine.h>
 #include <linux/pvclock_gtod.h>
 #include <linux/compiler.h>
+#include <asm/div64.h>
 
 #include "tick-internal.h"
 #include "ntp_internal.h"
@@ -2133,6 +2134,64 @@ out:
 }
 
 /**
+ * timekeeper_mono_interval_to_raw - Convert mono interval to raw's perception.
+ * @interval: Time interval as measured by the mono clock.
+ *
+ * Converts the given time interval as measured by the monotonic clock to
+ * what would have been measured by the raw monotonic clock in the meanwhile.
+ * The monotonic clock's frequency gets dynamically adjusted every now and then
+ * in order to compensate for the differences to NTP. OTOH, the clockevents
+ * devices are not affected by this adjustment, i.e. they keep ticking at some
+ * fixed hardware frequency which may be assumed to have a constant ratio to
+ * the fixed raw monotonic clock's frequency. This function provides a means
+ * to convert time intervals from the dynamic frequency monotonic clock to
+ * the fixed frequency hardware world.
+ *
+ * If interval < 0, zero is returned. If an overflow happens during the
+ * calculation, KTIME_MAX is returned.
+ */
+s64 timekeeping_mono_interval_to_raw(s64 interval)
+{
+	struct timekeeper *tk = &tk_core.timekeeper;
+	u32 raw_mult = tk->tkr_raw.mult, mono_mult = tk->tkr_mono.mult;
+	u64 raw, tmp;
+
+	/* The overflow checks below can't deal with negative intervals. */
+	if (interval <= 0)
+		return 0;
+
+	/*
+	 * Calculate
+	 *   raw = f_mono / f_raw * interval
+	 *       = (raw_mult / 2^raw_shift) / (mono_mult / 2^mono_shift)
+	 *            * interval
+	 * where f_mono and f_raw denote the frequencies of the monotonic
+	 * and raw clock respectively.
+	 *
+	 * Note that the monotonic and raw clocks' shifts are equal and fixed,
+	 * that is they cancel.
+	 */
+
+	/* First, calculate interval * raw_mult while checking for overflow. */
+	raw = ((u64)interval >> 32) * raw_mult; /* Upper half of interval */
+	if (raw >> 32)
+		return KTIME_MAX;
+	raw <<= 32;
+	tmp = ((u64)interval & U32_MAX) * raw_mult; /* Lower half of interval */
+	if (U64_MAX - raw < tmp)
+		return KTIME_MAX;
+	raw += tmp;
+
+	/* Finally, do raw /= mono_mult with proper rounding. */
+	if (U64_MAX - raw < mono_mult / 2)
+		return KTIME_MAX;
+	raw += mono_mult / 2;
+	do_div(raw, mono_mult);
+
+	return (s64)raw;
+}
+
+/**
  * getboottime64 - Return the real time of system boot.
  * @ts:		pointer to the timespec64 to be set
  *
diff --git a/kernel/time/timekeeping.h b/kernel/time/timekeeping.h
index 704f595..40a0fa9 100644
--- a/kernel/time/timekeeping.h
+++ b/kernel/time/timekeeping.h
@@ -18,6 +18,7 @@ extern void timekeeping_resume(void);
 
 extern void do_timer(unsigned long ticks);
 extern void update_wall_time(void);
+extern s64 timekeeping_mono_interval_to_raw(s64 interval);
 
 extern seqlock_t jiffies_lock;
 
-- 
2.9.0