linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/4] avoid double timer interrupt with nohz and Intel TSC
@ 2016-07-10 19:30 Nicolai Stange
  2016-07-10 19:30 ` [PATCH v2 1/4] arch, x86, tsc deadline clockevent dev: reduce frequency roundoff error Nicolai Stange
                   ` (3 more replies)
  0 siblings, 4 replies; 10+ messages in thread
From: Nicolai Stange @ 2016-07-10 19:30 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Ingo Molnar, H. Peter Anvin, x86, John Stultz, Borislav Petkov,
	Paolo Bonzini, Viresh Kumar, Hidehiro Kawai,
	Peter Zijlstra (Intel),
	Christopher S. Hall, Adrian Hunter, linux-kernel, Nicolai Stange

With a single task running on a NOHZ CPU on an Intel Haswell, I recognized
that I did not only get the one expected local_timer APIC interrupt, but
two per second at minimum.

Further tracing showed that the first one preceedes the programmed deadline
by up to ~50us and hence, it did nothing except for reprogramming the TSC
deadline clockevent device to trigger shortly thereafter again.

FYI, the trace looks like this:

  <...>-2938  [007] d.h.   420.753164: local_timer_entry: vector=239
  <...>-2938  [007] d.h.   420.753164: __hrtimer_run_queues <-hrtimer_interrupt
  <...>-2938  [007] d.h.   420.753184: local_timer_entry: vector=239
  <...>-2938  [007] d.h.   420.753184: __hrtimer_run_queues <-hrtimer_interrupt
  <...>-2938  [007] d.h.   420.753195: tick_sched_timer <-__hrtimer_run_queues
  <...>-2938  [007] d.h.   421.752170: local_timer_entry: vector=239
  <...>-2938  [007] d.h.   421.752171: __hrtimer_run_queues <-hrtimer_interrupt
  <...>-2938  [007] d.h.   421.752202: local_timer_entry: vector=239
  <...>-2938  [007] d.h.   421.752202: __hrtimer_run_queues <-hrtimer_interrupt
  <...>-2938  [007] d.h.   421.752202: tick_sched_timer <-__hrtimer_run_queues

It turns out that this too early programmed TSC deadline is caused by
inaccuracies in some frequency calculations which become significant if
the timer periods become large as it is the case for nohz with one task
(delta = 10^9ns).

The first three patches address inaccuracies entering the TSC deadline
clockevent devices' frequency.

The fourth patch is the most important one as it addresses the error
of largest relative magnitude. It is caused by the assumption in the
clockevents core that the ratio of the monotonic clock's frequency to that
of the clockevent device's is a constant. Since the monotonic clock's
frequency gets dynamically adjusted in order to compensate for NTP errors,
this is not true.

With this patchset applied, the trace looks like this:

  <...>-23609 [007] d.h.  1811.586658: local_timer_entry: vector=239
  <...>-23609 [007] d.h.  1811.586680: __hrtimer_run_queues <-hrtimer_interrupt
  <...>-23609 [007] d.h.  1811.586680: tick_sched_timer <-__hrtimer_run_queues
  <...>-23609 [007] d.h.  1812.585659: local_timer_entry: vector=239
  <...>-23609 [007] d.h.  1812.585666: __hrtimer_run_queues <-hrtimer_interrupt
  <...>-23609 [007] d.h.  1812.585666: tick_sched_timer <-__hrtimer_run_queues
  <...>-23609 [007] d.h.  1813.584661: local_timer_entry: vector=239
  <...>-23609 [007] d.h.  1813.584668: __hrtimer_run_queues <-hrtimer_interrupt
  <...>-23609 [007] d.h.  1813.584668: tick_sched_timer <-__hrtimer_run_queues

Please note that the first three TSC-patches might not be necessary to
get this result. In fact, [3/4] ("arch, x86, tsc: inform TSC deadline
clockevent device about") is somewhat counterproductive in the sense that
on my system, it usually corrects the TSC deadline device's frequency
towards lower values and thus, facilitates the too-early interrupt
behaviour initially described. However, I decided to send them along with
the fourth patch because
 - I tested the fourth patch in this setting
 - I believe that a greater accurracy of the TSC deadline device is
   worthwhile on its own


Changes to v1:
  - [1/4] ("arch, x86, tsc deadline clockevent dev: reduce frequency
            roundoff error")
      No changes to the patch. Note that the v1 mail could not be delivered
      to the author of the TSC_DIVISOR introducing commit 279f1461432c
      ("x86: apic: Use tsc deadline for oneshot when available"),
      Suresh Siddha <suresh.b.siddha@intel.com>, so I had to remove him
      from the CC list.

  - [2/4] ("arch, x86, tsc deadline clockevent dev: reduce TSC_DIVISOR
            to 2")
       Likewise.

  - [3/4] ("arch, x86, tsc: inform TSC deadline clockevent device about
            recalibration")
      Silence the kbuild test robot on ARCH=i386 by wrapping the new call
      to lapic_update_tsc_freq() from arch/x86/kernel/tsc.c in an
      #ifdef CONFIG_X86_LOCAL_APIC.

  - [4/4] ("kernel/time/clockevents: compensate for monotonic clock's
            dynamic frequency")
      I carelessly did two non-do_div() 64 bit divisions and the kbuild
      test robot complained. The first one was in charge of checking a
      u64 by u32 multiplication for overflow. Purge it by splitting the
      u64 factor in two halves and multiplying each on its own while
      simultaneously doing the overflow checks. Make the second u64
      division explicit by invoking do_div().


Nicolai Stange (4):
  arch, x86, tsc deadline clockevent dev: reduce frequency roundoff
    error
  arch, x86, tsc deadline clockevent dev: reduce TSC_DIVISOR to 2
  arch, x86, tsc: inform TSC deadline clockevent device about
    recalibration
  kernel/time/clockevents: compensate for monotonic clock's dynamic
    frequency

 arch/x86/include/asm/apic.h |  1 +
 arch/x86/kernel/apic/apic.c | 29 ++++++++++++++++++++--
 arch/x86/kernel/tsc.c       |  6 +++++
 kernel/time/clockevents.c   |  1 +
 kernel/time/timekeeping.c   | 59 +++++++++++++++++++++++++++++++++++++++++++++
 kernel/time/timekeeping.h   |  1 +
 6 files changed, 95 insertions(+), 2 deletions(-)

-- 
2.9.0

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v2 1/4] arch, x86, tsc deadline clockevent dev: reduce frequency roundoff error
  2016-07-10 19:30 [PATCH v2 0/4] avoid double timer interrupt with nohz and Intel TSC Nicolai Stange
@ 2016-07-10 19:30 ` Nicolai Stange
  2016-07-10 19:30 ` [PATCH v2 2/4] arch, x86, tsc deadline clockevent dev: reduce TSC_DIVISOR to 2 Nicolai Stange
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 10+ messages in thread
From: Nicolai Stange @ 2016-07-10 19:30 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Ingo Molnar, H. Peter Anvin, x86, John Stultz, Borislav Petkov,
	Paolo Bonzini, Viresh Kumar, Hidehiro Kawai,
	Peter Zijlstra (Intel),
	Christopher S. Hall, Adrian Hunter, linux-kernel, Nicolai Stange

In setup_APIC_timer(), the registered clockevent device's frequency
is calculated by first dividing tsc_khz by TSC_DIVISOR and multiplying
it with 1000 afterwards.

The multiplication with 1000 is done for converting from kHz to Hz and the
division by TSC_DIVISOR is carried out in order to make sure that the final
result fits into an u32.

However, with the order given in this calculation, the roundoff error
introduced by the division gets magnified by a factor of 1000 by the
following multiplication.

Increase the accuracy by reversing the order of the division and
multiplication. In order not to overflow during this calculation, cast
temporarily to u64.

Signed-off-by: Nicolai Stange <nicstange@gmail.com>
---
 arch/x86/kernel/apic/apic.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index 89a5bce..dce654c 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -563,7 +563,8 @@ static void setup_APIC_timer(void)
 				    CLOCK_EVT_FEAT_DUMMY);
 		levt->set_next_event = lapic_next_deadline;
 		clockevents_config_and_register(levt,
-						(tsc_khz / TSC_DIVISOR) * 1000,
+						(u32)(((u64)tsc_khz * 1000) /
+							TSC_DIVISOR),
 						0xF, ~0UL);
 	} else
 		clockevents_register_device(levt);
-- 
2.9.0

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v2 2/4] arch, x86, tsc deadline clockevent dev: reduce TSC_DIVISOR to 2
  2016-07-10 19:30 [PATCH v2 0/4] avoid double timer interrupt with nohz and Intel TSC Nicolai Stange
  2016-07-10 19:30 ` [PATCH v2 1/4] arch, x86, tsc deadline clockevent dev: reduce frequency roundoff error Nicolai Stange
@ 2016-07-10 19:30 ` Nicolai Stange
  2016-07-10 19:30 ` [PATCH v2 3/4] arch, x86, tsc: inform TSC deadline clockevent device about recalibration Nicolai Stange
  2016-07-10 19:30 ` [PATCH v2 4/4] kernel/time/clockevents: compensate for monotonic clock's dynamic frequency Nicolai Stange
  3 siblings, 0 replies; 10+ messages in thread
From: Nicolai Stange @ 2016-07-10 19:30 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Ingo Molnar, H. Peter Anvin, x86, John Stultz, Borislav Petkov,
	Paolo Bonzini, Viresh Kumar, Hidehiro Kawai,
	Peter Zijlstra (Intel),
	Christopher S. Hall, Adrian Hunter, linux-kernel, Nicolai Stange

In order to avoid overflowing an u32, the TSC deadline clockevent device's
frequency is divided by TSC_DIVISOR at registration.

The TSC_DIVISOR is currently defined as equaling 32 which allows for a
TSC frequency as high as 2^32 / 10^9ns * 32 = 137 GHz.

OTOH, larger values of TSC_DIVISOR introduce bigger roundoff errors into
the device's frequency.

A value of 2 for TSC_DIVISOR allows for a TSC frequency of
2^32 / 10^9ns * 2 = 8.5 GHz which is still way larger than anything to
expect in the next years.

Reduce the TSC deadline clockevent device's TSC_DIVISOR from 32 down to 2.

Signed-off-by: Nicolai Stange <nicstange@gmail.com>
---
 arch/x86/kernel/apic/apic.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index dce654c..1d22c72 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -311,7 +311,7 @@ int lapic_get_maxlvt(void)
 
 /* Clock divisor */
 #define APIC_DIVISOR 16
-#define TSC_DIVISOR  32
+#define TSC_DIVISOR  2
 
 /*
  * This function sets up the local APIC timer, with a timeout of
-- 
2.9.0

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v2 3/4] arch, x86, tsc: inform TSC deadline clockevent device about recalibration
  2016-07-10 19:30 [PATCH v2 0/4] avoid double timer interrupt with nohz and Intel TSC Nicolai Stange
  2016-07-10 19:30 ` [PATCH v2 1/4] arch, x86, tsc deadline clockevent dev: reduce frequency roundoff error Nicolai Stange
  2016-07-10 19:30 ` [PATCH v2 2/4] arch, x86, tsc deadline clockevent dev: reduce TSC_DIVISOR to 2 Nicolai Stange
@ 2016-07-10 19:30 ` Nicolai Stange
  2016-07-10 19:30 ` [PATCH v2 4/4] kernel/time/clockevents: compensate for monotonic clock's dynamic frequency Nicolai Stange
  3 siblings, 0 replies; 10+ messages in thread
From: Nicolai Stange @ 2016-07-10 19:30 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Ingo Molnar, H. Peter Anvin, x86, John Stultz, Borislav Petkov,
	Paolo Bonzini, Viresh Kumar, Hidehiro Kawai,
	Peter Zijlstra (Intel),
	Christopher S. Hall, Adrian Hunter, linux-kernel, Nicolai Stange

The TSC deadline clockevent devices' configuration and registration
happens before the TSC frequency calibration is refined in
tsc_refine_calibration_work().

This results in the TSC clocksource and the TSC deadline clockevent
devices being configured with slightly different frequencies: the former
gets the refined one and the latter are configured with the inaccurate
frequency detected earlier by means of the
"Fast TSC calibration using PIT".

Within the APIC code, introduce the notifier function
lapic_update_tsc_freq() which reconfigures all per-CPU TSC deadline
clockevent devices with the current tsc_khz.

Call it from the TSC code after TSC calibration refinement has happened.

Signed-off-by: Nicolai Stange <nicstange@gmail.com>
---
 arch/x86/include/asm/apic.h |  1 +
 arch/x86/kernel/apic/apic.c | 24 ++++++++++++++++++++++++
 arch/x86/kernel/tsc.c       |  6 ++++++
 3 files changed, 31 insertions(+)

diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h
index bc27611..971f446 100644
--- a/arch/x86/include/asm/apic.h
+++ b/arch/x86/include/asm/apic.h
@@ -135,6 +135,7 @@ extern void init_apic_mappings(void);
 void register_lapic_address(unsigned long address);
 extern void setup_boot_APIC_clock(void);
 extern void setup_secondary_APIC_clock(void);
+extern void lapic_update_tsc_freq(void);
 extern int APIC_init_uniprocessor(void);
 
 #ifdef CONFIG_X86_64
diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index 1d22c72..85238a8 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -571,6 +571,30 @@ static void setup_APIC_timer(void)
 }
 
 /*
+ * Install the updated TSC frequency from recalibration at the TSC
+ * deadline clockevent devices.
+ */
+static void __lapic_update_tsc_freq(void *info)
+{
+	struct clock_event_device *levt = this_cpu_ptr(&lapic_events);
+
+	if (!this_cpu_has(X86_FEATURE_TSC_DEADLINE_TIMER))
+		return;
+
+	clockevents_config(levt, (u32)(((u64)tsc_khz * 1000) / TSC_DIVISOR));
+}
+
+void lapic_update_tsc_freq(void)
+{
+	/*
+	 * The clockevent device's ->mult and ->shift can both be
+	 * changed. In order to avoid races, schedule the frequency
+	 * update code on each CPU.
+	 */
+	on_each_cpu(__lapic_update_tsc_freq, NULL, 0);
+}
+
+/*
  * In this functions we calibrate APIC bus clocks to the external timer.
  *
  * We want to do the calibration only once since we want to have local timer
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 38ba6de..e0a6e20 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -22,6 +22,7 @@
 #include <asm/nmi.h>
 #include <asm/x86_init.h>
 #include <asm/geode.h>
+#include <asm/apic.h>
 
 unsigned int __read_mostly cpu_khz;	/* TSC clocks / usec, not used here */
 EXPORT_SYMBOL(cpu_khz);
@@ -1193,6 +1194,11 @@ static void tsc_refine_calibration_work(struct work_struct *work)
 		(unsigned long)tsc_khz / 1000,
 		(unsigned long)tsc_khz % 1000);
 
+#ifdef CONFIG_X86_LOCAL_APIC
+	/* Inform the TSC deadline clockevent devices about the recalibration */
+	lapic_update_tsc_freq();
+#endif
+
 out:
 	if (boot_cpu_has(X86_FEATURE_ART))
 		art_related_clocksource = &clocksource_tsc;
-- 
2.9.0

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v2 4/4] kernel/time/clockevents: compensate for monotonic clock's dynamic frequency
  2016-07-10 19:30 [PATCH v2 0/4] avoid double timer interrupt with nohz and Intel TSC Nicolai Stange
                   ` (2 preceding siblings ...)
  2016-07-10 19:30 ` [PATCH v2 3/4] arch, x86, tsc: inform TSC deadline clockevent device about recalibration Nicolai Stange
@ 2016-07-10 19:30 ` Nicolai Stange
  2016-07-11  6:32   ` Nicolai Stange
  3 siblings, 1 reply; 10+ messages in thread
From: Nicolai Stange @ 2016-07-10 19:30 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Ingo Molnar, H. Peter Anvin, x86, John Stultz, Borislav Petkov,
	Paolo Bonzini, Viresh Kumar, Hidehiro Kawai,
	Peter Zijlstra (Intel),
	Christopher S. Hall, Adrian Hunter, linux-kernel, Nicolai Stange

With NOHZ_FULL and one single well-isolated, CPU consumptive task, one
would expect approximately one clockevent interrupt per second. However, on
my Intel Haswell where the monotonic clock is the TSC monotonic clock and
the clockevent device is the TSC deadline device, it turns out that every
second, there are two such interrupts: the first one arrives always
approximately ~50us before the scheduled deadline as programmed by
tick_nohz_stop_sched_tick() through the hrtimer API. The
__hrtimer_run_queues() called in this interrupt detects that the queued
tick_sched_timer hasn't expired yet and simply does nothing except
reprogramming the clock event device to fire shortly after again.

These too early programmed deadlines are explained as follows:
clockevents_program_event() programs the clockevent device to fire
after
  f_event * delta_t_progr
clockevent device cycles where f_event is the clockevent device's hardware
frequency and delta_t_progr is the requested time interval. After that many
clockevent device cycles have elapsed, the device underlying the monotonic
clock, that is the monotonic raw clock has seen f_raw / f_event as many
cycles.
The ktime_get() called from __hrtimer_run_queues() interprets those
cycles to run at the frequency of the monotonic clock. Summarizing:
  delta_t_perc = 1/f_mono * f_raw/f_event * f_event * delta_t_progr
               = f_raw / f_mono * delta_t_progr
with f_mono being the monotonic clock's frequency and delta_t_perc being
the elapsed time interval as perceived by __hrtimer_run_queues().

Now, f_mono is not a constant, but is dynamically adjusted in
timekeeping_adjust() in order to compensate for the NTP error. With the
large values of delta_t_progr of 10^9ns with NOHZ_FULL, the error made
becomes significant and results in the double timer interrupts described
above.

Compensate for this error by multiplying delta_t_progr with f_mono / f_raw
in clockevents_program_event() before actually programming the clockevent
device.

Namely, introduce a helper, timekeeping_mono_interval_to_raw(), which
converts a given time interval from the monotonic clock's perception to
that of the raw monotonic clock by multiplying the value by f_mono / f_raw.
Call that helper from clockevents_program_event() in order to obtain a
suitable time interval to program the clockevent device with.

Signed-off-by: Nicolai Stange <nicstange@gmail.com>
---
 kernel/time/clockevents.c |  1 +
 kernel/time/timekeeping.c | 59 +++++++++++++++++++++++++++++++++++++++++++++++
 kernel/time/timekeeping.h |  1 +
 3 files changed, 61 insertions(+)

diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c
index a9b76a4..4bccf04 100644
--- a/kernel/time/clockevents.c
+++ b/kernel/time/clockevents.c
@@ -329,6 +329,7 @@ int clockevents_program_event(struct clock_event_device *dev, ktime_t expires,
 		return dev->set_next_ktime(expires, dev);
 
 	delta = ktime_to_ns(ktime_sub(expires, ktime_get()));
+	delta = timekeeping_mono_interval_to_raw(delta);
 	if (delta <= 0)
 		return force ? clockevents_program_min_delta(dev) : -ETIME;
 
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index dcd5ce6..51dfbbb 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -23,6 +23,7 @@
 #include <linux/stop_machine.h>
 #include <linux/pvclock_gtod.h>
 #include <linux/compiler.h>
+#include <asm/div64.h>
 
 #include "tick-internal.h"
 #include "ntp_internal.h"
@@ -2133,6 +2134,64 @@ out:
 }
 
 /**
+ * timekeeper_mono_interval_to_raw - Convert mono interval to raw's perception.
+ * @interval: Time interval as measured by the mono clock.
+ *
+ * Converts the given time interval as measured by the monotonic clock to
+ * what would have been measured by the raw monotonic clock in the meanwhile.
+ * The monotonic clock's frequency gets dynamically adjusted every now and then
+ * in order to compensate for the differences to NTP. OTOH, the clockevents
+ * devices are not affected by this adjustment, i.e. they keep ticking at some
+ * fixed hardware frequency which may be assumed to have a constant ratio to
+ * the fixed raw monotonic clock's frequency. This function provides a means
+ * to convert time intervals from the dynamic frequency monotonic clock to
+ * the fixed frequency hardware world.
+ *
+ * If interval < 0, zero is returned. If an overflow happens during the
+ * calculation, KTIME_MAX is returned.
+ */
+s64 timekeeping_mono_interval_to_raw(s64 interval)
+{
+	struct timekeeper *tk = &tk_core.timekeeper;
+	u32 raw_mult = tk->tkr_raw.mult, mono_mult = tk->tkr_mono.mult;
+	u64 raw, tmp;
+
+	/* The overflow checks below can't deal with negative intervals. */
+	if (interval <= 0)
+		return 0;
+
+	/*
+	 * Calculate
+	 *   raw = f_mono / f_raw * interval
+	 *       = (raw_mult / 2^raw_shift) / (mono_mult / 2^mono_shift)
+	 *            * interval
+	 * where f_mono and f_raw denote the frequencies of the monotonic
+	 * and raw clock respectively.
+	 *
+	 * Note that the monotonic and raw clocks' shifts are equal and fixed,
+	 * that is they cancel.
+	 */
+
+	/* First, calculate interval * raw_mult while checking for overflow. */
+	raw = ((u64)interval >> 32) * raw_mult; /* Upper half of interval */
+	if (raw >> 32)
+		return KTIME_MAX;
+	raw <<= 32;
+	tmp = ((u64)interval & U32_MAX) * raw_mult; /* Lower half of interval */
+	if (U64_MAX - raw < tmp)
+		return KTIME_MAX;
+	raw += tmp;
+
+	/* Finally, do raw /= mono_mult with proper rounding. */
+	if (U64_MAX - raw < mono_mult / 2)
+		return KTIME_MAX;
+	raw += mono_mult / 2;
+	do_div(raw, mono_mult);
+
+	return (s64)raw;
+}
+
+/**
  * getboottime64 - Return the real time of system boot.
  * @ts:		pointer to the timespec64 to be set
  *
diff --git a/kernel/time/timekeeping.h b/kernel/time/timekeeping.h
index 704f595..40a0fa9 100644
--- a/kernel/time/timekeeping.h
+++ b/kernel/time/timekeeping.h
@@ -18,6 +18,7 @@ extern void timekeeping_resume(void);
 
 extern void do_timer(unsigned long ticks);
 extern void update_wall_time(void);
+extern s64 timekeeping_mono_interval_to_raw(s64 interval);
 
 extern seqlock_t jiffies_lock;
 
-- 
2.9.0

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 4/4] kernel/time/clockevents: compensate for monotonic clock's dynamic frequency
  2016-07-10 19:30 ` [PATCH v2 4/4] kernel/time/clockevents: compensate for monotonic clock's dynamic frequency Nicolai Stange
@ 2016-07-11  6:32   ` Nicolai Stange
  2016-07-11  8:32     ` Thomas Gleixner
  0 siblings, 1 reply; 10+ messages in thread
From: Nicolai Stange @ 2016-07-11  6:32 UTC (permalink / raw)
  To: Nicolai Stange
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86, John Stultz,
	Borislav Petkov, Paolo Bonzini, Viresh Kumar, Hidehiro Kawai,
	Peter Zijlstra (Intel),
	Christopher S. Hall, Adrian Hunter, linux-kernel

Nicolai Stange <nicstange@gmail.com> writes:

> With NOHZ_FULL and one single well-isolated, CPU consumptive task, one
> would expect approximately one clockevent interrupt per second. However, on
> my Intel Haswell where the monotonic clock is the TSC monotonic clock and
> the clockevent device is the TSC deadline device, it turns out that every
> second, there are two such interrupts: the first one arrives always
> approximately ~50us before the scheduled deadline as programmed by
> tick_nohz_stop_sched_tick() through the hrtimer API. The
> __hrtimer_run_queues() called in this interrupt detects that the queued
> tick_sched_timer hasn't expired yet and simply does nothing except
> reprogramming the clock event device to fire shortly after again.
>
> These too early programmed deadlines are explained as follows:
> clockevents_program_event() programs the clockevent device to fire
> after
>   f_event * delta_t_progr
> clockevent device cycles where f_event is the clockevent device's hardware
> frequency and delta_t_progr is the requested time interval. After that many
> clockevent device cycles have elapsed, the device underlying the monotonic
> clock, that is the monotonic raw clock has seen f_raw / f_event as many
> cycles.
> The ktime_get() called from __hrtimer_run_queues() interprets those
> cycles to run at the frequency of the monotonic clock. Summarizing:
>   delta_t_perc = 1/f_mono * f_raw/f_event * f_event * delta_t_progr
>                = f_raw / f_mono * delta_t_progr
> with f_mono being the monotonic clock's frequency and delta_t_perc being
> the elapsed time interval as perceived by __hrtimer_run_queues().
>
> Now, f_mono is not a constant, but is dynamically adjusted in
> timekeeping_adjust() in order to compensate for the NTP error. With the
> large values of delta_t_progr of 10^9ns with NOHZ_FULL, the error made
> becomes significant and results in the double timer interrupts described
> above.
>
> Compensate for this error by multiplying delta_t_progr with f_mono / f_raw
> in clockevents_program_event() before actually programming the clockevent
> device.
>
> Namely, introduce a helper, timekeeping_mono_interval_to_raw(), which
> converts a given time interval from the monotonic clock's perception to
> that of the raw monotonic clock by multiplying the value by f_mono / f_raw.
> Call that helper from clockevents_program_event() in order to obtain a
> suitable time interval to program the clockevent device with.
>
> Signed-off-by: Nicolai Stange <nicstange@gmail.com>
> ---
>  kernel/time/clockevents.c |  1 +
>  kernel/time/timekeeping.c | 59 +++++++++++++++++++++++++++++++++++++++++++++++
>  kernel/time/timekeeping.h |  1 +
>  3 files changed, 61 insertions(+)
>
> diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c
> index a9b76a4..4bccf04 100644
> --- a/kernel/time/clockevents.c
> +++ b/kernel/time/clockevents.c
> @@ -329,6 +329,7 @@ int clockevents_program_event(struct clock_event_device *dev, ktime_t expires,
>  		return dev->set_next_ktime(expires, dev);
>  
>  	delta = ktime_to_ns(ktime_sub(expires, ktime_get()));
> +	delta = timekeeping_mono_interval_to_raw(delta);
>  	if (delta <= 0)
>  		return force ? clockevents_program_min_delta(dev) : -ETIME;
>  
> diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
> index dcd5ce6..51dfbbb 100644
> --- a/kernel/time/timekeeping.c
> +++ b/kernel/time/timekeeping.c
> @@ -23,6 +23,7 @@
>  #include <linux/stop_machine.h>
>  #include <linux/pvclock_gtod.h>
>  #include <linux/compiler.h>
> +#include <asm/div64.h>
>  
>  #include "tick-internal.h"
>  #include "ntp_internal.h"
> @@ -2133,6 +2134,64 @@ out:
>  }
>  
>  /**
> + * timekeeper_mono_interval_to_raw - Convert mono interval to raw's perception.
> + * @interval: Time interval as measured by the mono clock.
> + *
> + * Converts the given time interval as measured by the monotonic clock to
> + * what would have been measured by the raw monotonic clock in the meanwhile.
> + * The monotonic clock's frequency gets dynamically adjusted every now and then
> + * in order to compensate for the differences to NTP. OTOH, the clockevents
> + * devices are not affected by this adjustment, i.e. they keep ticking at some
> + * fixed hardware frequency which may be assumed to have a constant ratio to
> + * the fixed raw monotonic clock's frequency. This function provides a means
> + * to convert time intervals from the dynamic frequency monotonic clock to
> + * the fixed frequency hardware world.
> + *
> + * If interval < 0, zero is returned. If an overflow happens during the
> + * calculation, KTIME_MAX is returned.
> + */
> +s64 timekeeping_mono_interval_to_raw(s64 interval)
> +{
> +	struct timekeeper *tk = &tk_core.timekeeper;
> +	u32 raw_mult = tk->tkr_raw.mult, mono_mult = tk->tkr_mono.mult;
> +	u64 raw, tmp;
> +
> +	/* The overflow checks below can't deal with negative intervals. */
> +	if (interval <= 0)
> +		return 0;
> +
> +	/*
> +	 * Calculate
> +	 *   raw = f_mono / f_raw * interval
> +	 *       = (raw_mult / 2^raw_shift) / (mono_mult / 2^mono_shift)
> +	 *            * interval
> +	 * where f_mono and f_raw denote the frequencies of the monotonic
> +	 * and raw clock respectively.
> +	 *
> +	 * Note that the monotonic and raw clocks' shifts are equal and fixed,
> +	 * that is they cancel.
> +	 */
> +
> +	/* First, calculate interval * raw_mult while checking for overflow. */

After thinking further about this, I had to recognize that
  (raw_mult - mono_mult) * interval
is *much* less likely to overflow.

So, I'll send a v3 doing

  raw = interval + (((raw_mult - mono_mult) * interval) / mono_mult)

during the course of the day.



> +	raw = ((u64)interval >> 32) * raw_mult; /* Upper half of interval */
> +	if (raw >> 32)
> +		return KTIME_MAX;
> +	raw <<= 32;
> +	tmp = ((u64)interval & U32_MAX) * raw_mult; /* Lower half of interval */
> +	if (U64_MAX - raw < tmp)
> +		return KTIME_MAX;
> +	raw += tmp;
> +
> +	/* Finally, do raw /= mono_mult with proper rounding. */
> +	if (U64_MAX - raw < mono_mult / 2)
> +		return KTIME_MAX;
> +	raw += mono_mult / 2;
> +	do_div(raw, mono_mult);
> +
> +	return (s64)raw;
> +}
> +
> +/**
>   * getboottime64 - Return the real time of system boot.
>   * @ts:		pointer to the timespec64 to be set
>   *
> diff --git a/kernel/time/timekeeping.h b/kernel/time/timekeeping.h
> index 704f595..40a0fa9 100644
> --- a/kernel/time/timekeeping.h
> +++ b/kernel/time/timekeeping.h
> @@ -18,6 +18,7 @@ extern void timekeeping_resume(void);
>  
>  extern void do_timer(unsigned long ticks);
>  extern void update_wall_time(void);
> +extern s64 timekeeping_mono_interval_to_raw(s64 interval);
>  
>  extern seqlock_t jiffies_lock;

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 4/4] kernel/time/clockevents: compensate for monotonic clock's dynamic frequency
  2016-07-11  6:32   ` Nicolai Stange
@ 2016-07-11  8:32     ` Thomas Gleixner
  2016-07-12 11:10       ` Nicolai Stange
  0 siblings, 1 reply; 10+ messages in thread
From: Thomas Gleixner @ 2016-07-11  8:32 UTC (permalink / raw)
  To: Nicolai Stange
  Cc: Ingo Molnar, H. Peter Anvin, x86, John Stultz, Borislav Petkov,
	Paolo Bonzini, Viresh Kumar, Hidehiro Kawai,
	Peter Zijlstra (Intel),
	Christopher S. Hall, Adrian Hunter, linux-kernel

On Mon, 11 Jul 2016, Nicolai Stange wrote:
> > +	raw = ((u64)interval >> 32) * raw_mult; /* Upper half of interval */
> > +	if (raw >> 32)
> > +		return KTIME_MAX;
> > +	raw <<= 32;
> > +	tmp = ((u64)interval & U32_MAX) * raw_mult; /* Lower half of interval */
> > +	if (U64_MAX - raw < tmp)
> > +		return KTIME_MAX;
> > +	raw += tmp;
> > +
> > +	/* Finally, do raw /= mono_mult with proper rounding. */
> > +	if (U64_MAX - raw < mono_mult / 2)
> > +		return KTIME_MAX;
> > +	raw += mono_mult / 2;
> > +	do_div(raw, mono_mult);
> > +
> > +	return (s64)raw;

That's a complete insanity. No way that we are going to do all this math in
the CE programming path.

If you want to address the issue, then you need to find a way to adjust the
mult/shift factors of the clock event device occasionally.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 4/4] kernel/time/clockevents: compensate for monotonic clock's dynamic frequency
  2016-07-11  8:32     ` Thomas Gleixner
@ 2016-07-12 11:10       ` Nicolai Stange
  2016-07-12 15:04         ` Thomas Gleixner
  0 siblings, 1 reply; 10+ messages in thread
From: Nicolai Stange @ 2016-07-12 11:10 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Nicolai Stange, Ingo Molnar, H. Peter Anvin, x86, John Stultz,
	Borislav Petkov, Paolo Bonzini, Viresh Kumar, Hidehiro Kawai,
	Peter Zijlstra (Intel),
	Christopher S. Hall, Adrian Hunter, linux-kernel

Thomas Gleixner <tglx@linutronix.de> writes:

> On Mon, 11 Jul 2016, Nicolai Stange wrote:
>> > +	raw = ((u64)interval >> 32) * raw_mult; /* Upper half of interval */
>> > +	if (raw >> 32)
>> > +		return KTIME_MAX;
>> > +	raw <<= 32;
>> > +	tmp = ((u64)interval & U32_MAX) * raw_mult; /* Lower half of interval */
>> > +	if (U64_MAX - raw < tmp)
>> > +		return KTIME_MAX;
>> > +	raw += tmp;
>> > +
>> > +	/* Finally, do raw /= mono_mult with proper rounding. */
>> > +	if (U64_MAX - raw < mono_mult / 2)
>> > +		return KTIME_MAX;
>> > +	raw += mono_mult / 2;
>> > +	do_div(raw, mono_mult);
>> > +
>> > +	return (s64)raw;
>
> That's a complete insanity. No way that we are going to do all this math in
> the CE programming path.
>
> If you want to address the issue, then you need to find a way to adjust the
> mult/shift factors of the clock event device occasionally.

I tried adjusting the clock event device's ->mult, triggered by
timekeeping_apply_adjustment() and it works well.

I think that in order to avoid error accumulation, it is best not to do
any incremental updates to ->mult, but introduce a new ->mult_mono and
recalculate the latter from the former.

Now, the ->mult_mono needs to get updated when the driver updates
->mult. Certainly, hooking into clockevents_register_device() and
clockevents_update_freq() is the method of choice here. However,
there are a handful of drivers which set ->mult from
->set_state_oneshot() either by direct assignment or through
clockevents_config():
  drivers/clocksource/sh_cmt.c
  drivers/clocksource/sh_tmu.c
  drivers/clocksource/em_sti.c
  drivers/clocksource/h8300_timer8.c
Converting these to clockevents_update_freq() seems straightforward
though.


Another issue is that ->min_delta_ns and ->max_delta_ns are measured in
raw clock time while the delta in clockevents_program_event() would now
be interpreted as being in monotonic clock time:
  clc = ((unsigned long long) delta * dev->mult_mono) >> dev->shift;

Ideally, I'd like to get rid of ->min_delta_ns and ->max_delta_ns
alltogether and consistently use the ->min_delta_ticks and
->max_delta_ticks instead. AFAICS, ->min_delta_ns is really needed only
for setting dev->next_event in clockevents_program_min_delta().
dev->next_event is read only from __clockevents_update_freq() for
reprogramming purposes and thus, assuming 0 for ->delta_min_ns in
clockevents_program_min_delta() would probably work: a reprogramming
would invoke clockevents_program_min_delta() once again.

The downside of this approach is that a quick grep reveals 40 clockevent
device drivers whose initialization code would need to get touched in
order to convert them from min_delta_ns/max_delta_ns to
min_delta_ticks/max_delta_ticks.


So, the question is whether I should do all of this or whether the
doubled timer interrupts aren't annoying enough to justify such a big
change?


Thanks,

Nicolai

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 4/4] kernel/time/clockevents: compensate for monotonic clock's dynamic frequency
  2016-07-12 11:10       ` Nicolai Stange
@ 2016-07-12 15:04         ` Thomas Gleixner
  2016-07-13 13:08           ` Nicolai Stange
  0 siblings, 1 reply; 10+ messages in thread
From: Thomas Gleixner @ 2016-07-12 15:04 UTC (permalink / raw)
  To: Nicolai Stange
  Cc: Ingo Molnar, H. Peter Anvin, x86, John Stultz, Borislav Petkov,
	Paolo Bonzini, Viresh Kumar, Hidehiro Kawai,
	Peter Zijlstra (Intel),
	Christopher S. Hall, Adrian Hunter, linux-kernel

On Tue, 12 Jul 2016, Nicolai Stange wrote:
> I tried adjusting the clock event device's ->mult, triggered by
> timekeeping_apply_adjustment() and it works well.
> 
> I think that in order to avoid error accumulation, it is best not to do
> any incremental updates to ->mult, but introduce a new ->mult_mono and
> recalculate the latter from the former.
> 
> Now, the ->mult_mono needs to get updated when the driver updates
> ->mult. Certainly, hooking into clockevents_register_device() and
> clockevents_update_freq() is the method of choice here. However,
> there are a handful of drivers which set ->mult from
> ->set_state_oneshot() either by direct assignment or through
> clockevents_config():
>   drivers/clocksource/sh_cmt.c
>   drivers/clocksource/sh_tmu.c
>   drivers/clocksource/em_sti.c
>   drivers/clocksource/h8300_timer8.c
> Converting these to clockevents_update_freq() seems straightforward
> though.

Ok.
 
> Another issue is that ->min_delta_ns and ->max_delta_ns are measured in
> raw clock time while the delta in clockevents_program_event() would now
> be interpreted as being in monotonic clock time:
>   clc = ((unsigned long long) delta * dev->mult_mono) >> dev->shift;

Does that really matter much?

> Ideally, I'd like to get rid of ->min_delta_ns and ->max_delta_ns
> alltogether and consistently use the ->min_delta_ticks and
> ->max_delta_ticks instead. AFAICS, ->min_delta_ns is really needed only
> for setting dev->next_event in clockevents_program_min_delta().
> dev->next_event is read only from __clockevents_update_freq() for
> reprogramming purposes and thus, assuming 0 for ->delta_min_ns in
> clockevents_program_min_delta() would probably work: a reprogramming
> would invoke clockevents_program_min_delta() once again.

I completely fail to parse the above paragraph. 
 
> The downside of this approach is that a quick grep reveals 40 clockevent
> device drivers whose initialization code would need to get touched in
> order to convert them from min_delta_ns/max_delta_ns to
> min_delta_ticks/max_delta_ticks.
> 
> So, the question is whether I should do all of this or whether the
> doubled timer interrupts aren't annoying enough to justify such a big
> change?

Can you provide an initial patch which does the adjustment w/o all the related
churn so we can see how intrusive that gets?

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 4/4] kernel/time/clockevents: compensate for monotonic clock's dynamic frequency
  2016-07-12 15:04         ` Thomas Gleixner
@ 2016-07-13 13:08           ` Nicolai Stange
  0 siblings, 0 replies; 10+ messages in thread
From: Nicolai Stange @ 2016-07-13 13:08 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Nicolai Stange, Ingo Molnar, H. Peter Anvin, x86, John Stultz,
	Borislav Petkov, Paolo Bonzini, Viresh Kumar, Hidehiro Kawai,
	Peter Zijlstra (Intel),
	Christopher S. Hall, Adrian Hunter, linux-kernel

Thomas Gleixner <tglx@linutronix.de> writes:

> On Tue, 12 Jul 2016, Nicolai Stange wrote:
>> Another issue is that ->min_delta_ns and ->max_delta_ns are measured in
>> raw clock time while the delta in clockevents_program_event() would now
>> be interpreted as being in monotonic clock time:
>>   clc = ((unsigned long long) delta * dev->mult_mono) >> dev->shift;
>
> Does that really matter much?
>
>> Ideally, I'd like to get rid of ->min_delta_ns and ->max_delta_ns
>> alltogether and consistently use the ->min_delta_ticks and
>> ->max_delta_ticks instead. AFAICS, ->min_delta_ns is really needed only
>> for setting dev->next_event in clockevents_program_min_delta().
>> dev->next_event is read only from __clockevents_update_freq() for
>> reprogramming purposes and thus, assuming 0 for ->delta_min_ns in
>> clockevents_program_min_delta() would probably work: a reprogramming
>> would invoke clockevents_program_min_delta() once again.
>
> I completely fail to parse the above paragraph. 
>  
>> The downside of this approach is that a quick grep reveals 40 clockevent
>> device drivers whose initialization code would need to get touched in
>> order to convert them from min_delta_ns/max_delta_ns to
>> min_delta_ticks/max_delta_ticks.
>> 
>> So, the question is whether I should do all of this or whether the
>> doubled timer interrupts aren't annoying enough to justify such a big
>> change?
>
> Can you provide an initial patch which does the adjustment w/o all the related
> churn so we can see how intrusive that gets?

Please see the RFC tagged series at
  http://lkml.kernel.org/g/20160713130017.8202-1-nicstange@gmail.com
I tried to answer/address your above questions in the cover letter.

Note that I split the x86 TSC related patches off:
  http://lkml.kernel.org/g/20160713130344.8319-1-nicstange@gmail.com

Thanks,

Nicolai

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2016-07-13 13:09 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-07-10 19:30 [PATCH v2 0/4] avoid double timer interrupt with nohz and Intel TSC Nicolai Stange
2016-07-10 19:30 ` [PATCH v2 1/4] arch, x86, tsc deadline clockevent dev: reduce frequency roundoff error Nicolai Stange
2016-07-10 19:30 ` [PATCH v2 2/4] arch, x86, tsc deadline clockevent dev: reduce TSC_DIVISOR to 2 Nicolai Stange
2016-07-10 19:30 ` [PATCH v2 3/4] arch, x86, tsc: inform TSC deadline clockevent device about recalibration Nicolai Stange
2016-07-10 19:30 ` [PATCH v2 4/4] kernel/time/clockevents: compensate for monotonic clock's dynamic frequency Nicolai Stange
2016-07-11  6:32   ` Nicolai Stange
2016-07-11  8:32     ` Thomas Gleixner
2016-07-12 11:10       ` Nicolai Stange
2016-07-12 15:04         ` Thomas Gleixner
2016-07-13 13:08           ` Nicolai Stange

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).