From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933489AbcGJMYv (ORCPT ); Sun, 10 Jul 2016 08:24:51 -0400 Received: from mail-wm0-f66.google.com ([74.125.82.66]:32924 "EHLO mail-wm0-f66.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933432AbcGJMYi (ORCPT ); Sun, 10 Jul 2016 08:24:38 -0400 From: Nicolai Stange To: Thomas Gleixner Cc: Ingo Molnar , "H. Peter Anvin" , x86@kernel.org, John Stultz , Borislav Petkov , Paolo Bonzini , Viresh Kumar , Hidehiro Kawai , "Peter Zijlstra (Intel)" , "Christopher S. Hall" , Adrian Hunter , Suresh Siddha , linux-kernel@vger.kernel.org, Nicolai Stange Subject: [PATCH 4/4] kernel/time/clockevents: compensate for monotonic clock's dynamic frequency Date: Sun, 10 Jul 2016 14:23:45 +0200 Message-Id: <20160710122345.13061-5-nicstange@gmail.com> X-Mailer: git-send-email 2.9.0 In-Reply-To: <20160710122345.13061-1-nicstange@gmail.com> References: <20160710122345.13061-1-nicstange@gmail.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org With NOHZ_FULL and one single well-isolated, CPU consumptive task, one would expect approximately one clockevent interrupt per second. However, on my Intel Haswell where the monotonic clock is the TSC monotonic clock and the clockevent device is the TSC deadline device, it turns out that every second, there are two such interrupts: the first one arrives always approximately ~50us before the scheduled deadline as programmed by tick_nohz_stop_sched_tick() through the hrtimer API. The __hrtimer_run_queues() called in this interrupt detects that the queued tick_sched_timer hasn't expired yet and simply does nothing except reprogramming the clock event device to fire shortly after again. These too early programmed deadlines are explained as follows: clockevents_program_event() programs the clockevent device to fire after f_event * delta_t_progr clockevent device cycles where f_event is the clockevent device's hardware frequency and delta_t_progr is the requested time interval. After that many clockevent device cycles have elapsed, the device underlying the monotonic clock, that is the monotonic raw clock has seen f_raw / f_event as many cycles. The ktime_get() called from __hrtimer_run_queues() interprets those cycles to run at the frequency of the monotonic clock. Summarizing: delta_t_perc = 1/f_mono * f_raw/f_event * f_event * delta_t_progr = f_raw / f_mono * delta_t_progr with f_mono being the monotonic clock's frequency and delta_t_perc being the elapsed time interval as perceived by __hrtimer_run_queues(). Now, f_mono is not a constant, but is dynamically adjusted in timekeeping_adjust() in order to compensate for the NTP error. With the large values of delta_t_progr of 10^9ns with NOHZ_FULL, the error made becomes significant and results in the double timer interrupts described above. Compensate for this error by multiplying delta_t_progr with f_mono / f_raw in clockevents_program_event() before actually programming the clockevent device. Namely, introduce a helper, timekeeping_mono_interval_to_raw(), which converts a given time interval from the monotonic clock's perception to that of the raw monotonic clock by multiplying the value by f_mono / f_raw. Call that helper from clockevents_program_event() in order to obtain a suitable time interval to program the clockevent device with. Signed-off-by: Nicolai Stange --- kernel/time/clockevents.c | 1 + kernel/time/timekeeping.c | 50 +++++++++++++++++++++++++++++++++++++++++++++++ kernel/time/timekeeping.h | 1 + 3 files changed, 52 insertions(+) diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c index a9b76a4..4bccf04 100644 --- a/kernel/time/clockevents.c +++ b/kernel/time/clockevents.c @@ -329,6 +329,7 @@ int clockevents_program_event(struct clock_event_device *dev, ktime_t expires, return dev->set_next_ktime(expires, dev); delta = ktime_to_ns(ktime_sub(expires, ktime_get())); + delta = timekeeping_mono_interval_to_raw(delta); if (delta <= 0) return force ? clockevents_program_min_delta(dev) : -ETIME; diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c index dcd5ce6..b1e7a41 100644 --- a/kernel/time/timekeeping.c +++ b/kernel/time/timekeeping.c @@ -2133,6 +2133,56 @@ out: } /** + * timekeeper_mono_interval_to_raw - Convert mono interval to raw's perception. + * @interval: Time interval as measured by the mono clock. + * + * Converts the given time interval as measured by the monotonic clock to + * what would have been measured by the raw monotonic clock in the meanwhile. + * The monotonic clock's frequency gets dynamically adjusted every now and then + * in order to compensate for the differences to NTP. OTOH, the clockevents + * devices are not affected by this adjustment, i.e. they keep ticking at some + * fixed hardware frequency which may be assumed to have a constant ratio to + * the fixed raw monotonic clock's frequency. This function provides a means + * to convert time intervals from the dynamic frequency monotonic clock to + * the fixed frequency hardware world. + * + * The returned value is clamped to [0, KTIME_MAX]. In particular, + * KTIME_MAX is returned on overflow and zero for negative values of interval. + */ +s64 timekeeping_mono_interval_to_raw(s64 interval) +{ + struct timekeeper *tk = &tk_core.timekeeper; + u32 raw_mult = tk->tkr_raw.mult, mono_mult = tk->tkr_mono.mult; + s64 raw; + + /* The overflow checks below can't deal with negative intervals. */ + if (interval <= 0) + return interval; + + /* + * Calculate + * raw = f_mono / f_raw * interval + * = (raw_mult / 2^raw_shift) / (mono_mult / 2^mono_shift) + * * interval + * where f_mono and f_raw denote the frequencies of the monotonic + * and raw clock respectively. + * + * Note that the monotonic and raw clock shifts' are equal and fixed, + * that is they cancel. + */ + if (KTIME_MAX / raw_mult < interval) + return KTIME_MAX; + raw = interval * raw_mult; + + if (KTIME_MAX - mono_mult / 2 < raw) + return KTIME_MAX; + raw += mono_mult / 2; /* round properly */ + raw /= mono_mult; + + return raw; +} + +/** * getboottime64 - Return the real time of system boot. * @ts: pointer to the timespec64 to be set * diff --git a/kernel/time/timekeeping.h b/kernel/time/timekeeping.h index 704f595..40a0fa9 100644 --- a/kernel/time/timekeeping.h +++ b/kernel/time/timekeeping.h @@ -18,6 +18,7 @@ extern void timekeeping_resume(void); extern void do_timer(unsigned long ticks); extern void update_wall_time(void); +extern s64 timekeeping_mono_interval_to_raw(s64 interval); extern seqlock_t jiffies_lock; -- 2.9.0