linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC] timekeeping: Fix clock stability with nohz
@ 2013-11-14 14:50 Miroslav Lichvar
  2013-11-14 18:01 ` Rik van Riel
                   ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: Miroslav Lichvar @ 2013-11-14 14:50 UTC (permalink / raw)
  To: John Stultz
  Cc: linux-kernel, Prarit Bhargava, Richard Cochran, Miroslav Lichvar

Since commit 5eb6d205 the system clock is kept separately from NTP time
and it is synchronized by adjusting its multiplier in a feedback loop.
This works well when the updates are done regularly. With nohz and idle
system, however, the loop becomes unstable at a certain update interval.
The loop overcorrects and the frequency doesn't settle down. The clock
has a large error, which seems to grow quadratically with update
interval.

If the constants used in the loop were modified for a maximum update
interval, it would be stable, but too slow to keep up with NTP. Without
knowing when will be the next update it's difficult to implement a loop
that is both fast and stable.

This patch fixes the problem by postponing update of NTP tick length in
the clock and setting the multiplier directly without feedback loop by
dividing the tick length with clock cycles. Previously, a change in tick
length was applied immediately to all ticks since last update, which
caused a jump in the NTP error proportional to the change and the update
interval and which had to be corrected later by the loop.

The only source of the accounted NTP error is now lack of resolution in
the clock multiplier. The error is corrected by adding 0 or 1 to the
calculated multiplier. With a constant tick length, the multiplier will
be switching between the two values. The loop is greatly simplified and
there is no problem with stability. The maximum error is limited by the
update interval.

In a simulation with 1GHz TSC clock and 10Hz clock update the maximum
error went down from 4.7 microseconds to 5.5 nanoseconds. With 1Hz
update the maximum error went down from 480 microseconds to 55
nanoseconds.

In a real test on idle machine comparing raw TSC and clock_gettime()
time stamps, the maximum error went down from microseconds to tens of
nanoseconds. A test with clock synchronized to a PTP hardware clock by
phc2sys from linuxptp now shows no difference when running with nohz
enabled and disabled, the clock seems to be stable to few tens of
nanoseconds.

TODO:
- add forced update_wall_time() and call it after adjtimex() to improve
  accuracy of phase adjustments done by quickly changing NTP frequency
- check if there are any issues with suspend

Signed-off-by: Miroslav Lichvar <mlichvar@redhat.com>
---
 include/linux/timekeeper_internal.h |   4 +
 kernel/time/timekeeping.c           | 209 +++++-------------------------------
 2 files changed, 31 insertions(+), 182 deletions(-)

diff --git a/include/linux/timekeeper_internal.h b/include/linux/timekeeper_internal.h
index c1825eb..b91ad4b 100644
--- a/include/linux/timekeeper_internal.h
+++ b/include/linux/timekeeper_internal.h
@@ -34,12 +34,16 @@ struct timekeeper {
 	/* Clock shifted nano seconds */
 	u64			xtime_nsec;
 
+	/* Tick used for calculation of NTP error. */
+	u64			ntp_tick;
 	/* Difference between accumulated time and NTP time in ntp
 	 * shifted nano seconds. */
 	s64			ntp_error;
 	/* Shift conversion between clock shifted nano seconds and
 	 * ntp shifted nano seconds. */
 	u32			ntp_error_shift;
+	/* Correction applied to mult to reduce the error. */
+	u32			mult_ntp_correction;
 
 	/*
 	 * wall_to_monotonic is what we need to add to xtime (or xtime corrected
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index 3abf534..6ee57f7 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -146,6 +146,9 @@ static void tk_setup_internals(struct timekeeper *tk, struct clocksource *clock)
 	 * to counteract clock drifting.
 	 */
 	tk->mult = clock->mult;
+	/* zero frequency offset */
+	tk->ntp_tick = tk->xtime_interval << tk->ntp_error_shift;
+	tk->mult_ntp_correction = 0;
 }
 
 /* Timekeeper helper functions. */
@@ -1048,200 +1051,42 @@ static int __init timekeeping_init_ops(void)
 device_initcall(timekeeping_init_ops);
 
 /*
- * If the error is already larger, we look ahead even further
- * to compensate for late or lost adjustments.
+ * Adjust the multiplier according to current NTP tick.
  */
-static __always_inline int timekeeping_bigadjust(struct timekeeper *tk,
-						 s64 error, s64 *interval,
-						 s64 *offset)
+static void timekeeping_adjust(struct timekeeper *tk, s64 offset)
 {
-	s64 tick_error, i;
-	u32 look_ahead, adj;
-	s32 error2, mult;
+	u32 new_mult;
 
-	/*
-	 * Use the current error value to determine how much to look ahead.
-	 * The larger the error the slower we adjust for it to avoid problems
-	 * with losing too many ticks, otherwise we would overadjust and
-	 * produce an even larger error.  The smaller the adjustment the
-	 * faster we try to adjust for it, as lost ticks can do less harm
-	 * here.  This is tuned so that an error of about 1 msec is adjusted
-	 * within about 1 sec (or 2^20 nsec in 2^SHIFT_HZ ticks).
-	 */
-	error2 = tk->ntp_error >> (NTP_SCALE_SHIFT + 22 - 2 * SHIFT_HZ);
-	error2 = abs(error2);
-	for (look_ahead = 0; error2 > 0; look_ahead++)
-		error2 >>= 2;
+	/* Avoid division if the tick didn't change */
+	if (tk->ntp_tick == ntp_tick_length()) {
+		new_mult = tk->mult - tk->mult_ntp_correction;
+	} else {
+		tk->ntp_tick = ntp_tick_length();
+		new_mult = div64_u64(tk->ntp_tick, tk->cycle_interval) >>
+			tk->ntp_error_shift;
+	}
 
 	/*
-	 * Now calculate the error in (1 << look_ahead) ticks, but first
-	 * remove the single look ahead already included in the error.
+	 * Speed the clock up by 1 if it's behind NTP time. If there is no
+	 * remainder from the tick division, the clock will stay ahead of
+	 * NTP time until a non-divisible tick is encountered.
 	 */
-	tick_error = ntp_tick_length() >> (tk->ntp_error_shift + 1);
-	tick_error -= tk->xtime_interval >> 1;
-	error = ((error - tick_error) >> look_ahead) + tick_error;
-
-	/* Finally calculate the adjustment shift value.  */
-	i = *interval;
-	mult = 1;
-	if (error < 0) {
-		error = -error;
-		*interval = -*interval;
-		*offset = -*offset;
-		mult = -1;
-	}
-	for (adj = 0; error > i; adj++)
-		error >>= 1;
+	tk->mult_ntp_correction = tk->ntp_error > 0 ? 1 : 0;
+	new_mult += tk->mult_ntp_correction;
 
-	*interval <<= adj;
-	*offset <<= adj;
-	return mult << adj;
-}
-
-/*
- * Adjust the multiplier to reduce the error value,
- * this is optimized for the most common adjustments of -1,0,1,
- * for other values we can do a bit more work.
- */
-static void timekeeping_adjust(struct timekeeper *tk, s64 offset)
-{
-	s64 error, interval = tk->cycle_interval;
-	int adj;
+	if (new_mult == tk->mult)
+		return;
 
-	/*
-	 * The point of this is to check if the error is greater than half
-	 * an interval.
-	 *
-	 * First we shift it down from NTP_SHIFT to clocksource->shifted nsecs.
-	 *
-	 * Note we subtract one in the shift, so that error is really error*2.
-	 * This "saves" dividing(shifting) interval twice, but keeps the
-	 * (error > interval) comparison as still measuring if error is
-	 * larger than half an interval.
-	 *
-	 * Note: It does not "save" on aggravation when reading the code.
-	 */
-	error = tk->ntp_error >> (tk->ntp_error_shift - 1);
-	if (error > interval) {
-		/*
-		 * We now divide error by 4(via shift), which checks if
-		 * the error is greater than twice the interval.
-		 * If it is greater, we need a bigadjust, if its smaller,
-		 * we can adjust by 1.
-		 */
-		error >>= 2;
-		/*
-		 * XXX - In update_wall_time, we round up to the next
-		 * nanosecond, and store the amount rounded up into
-		 * the error. This causes the likely below to be unlikely.
-		 *
-		 * The proper fix is to avoid rounding up by using
-		 * the high precision tk->xtime_nsec instead of
-		 * xtime.tv_nsec everywhere. Fixing this will take some
-		 * time.
-		 */
-		if (likely(error <= interval))
-			adj = 1;
-		else
-			adj = timekeeping_bigadjust(tk, error, &interval, &offset);
-	} else {
-		if (error < -interval) {
-			/* See comment above, this is just switched for the negative */
-			error >>= 2;
-			if (likely(error >= -interval)) {
-				adj = -1;
-				interval = -interval;
-				offset = -offset;
-			} else {
-				adj = timekeeping_bigadjust(tk, error, &interval, &offset);
-			}
-		} else {
-			goto out_adjust;
-		}
-	}
+	tk->mult = new_mult;
+	tk->xtime_interval = tk->cycle_interval * tk->mult;
 
 	if (unlikely(tk->clock->maxadj &&
-		(tk->mult + adj > tk->clock->mult + tk->clock->maxadj))) {
+		(tk->mult > tk->clock->mult + tk->clock->maxadj))) {
 		printk_once(KERN_WARNING
 			"Adjusting %s more than 11%% (%ld vs %ld)\n",
-			tk->clock->name, (long)tk->mult + adj,
+			tk->clock->name, (long)tk->mult,
 			(long)tk->clock->mult + tk->clock->maxadj);
 	}
-	/*
-	 * So the following can be confusing.
-	 *
-	 * To keep things simple, lets assume adj == 1 for now.
-	 *
-	 * When adj != 1, remember that the interval and offset values
-	 * have been appropriately scaled so the math is the same.
-	 *
-	 * The basic idea here is that we're increasing the multiplier
-	 * by one, this causes the xtime_interval to be incremented by
-	 * one cycle_interval. This is because:
-	 *	xtime_interval = cycle_interval * mult
-	 * So if mult is being incremented by one:
-	 *	xtime_interval = cycle_interval * (mult + 1)
-	 * Its the same as:
-	 *	xtime_interval = (cycle_interval * mult) + cycle_interval
-	 * Which can be shortened to:
-	 *	xtime_interval += cycle_interval
-	 *
-	 * So offset stores the non-accumulated cycles. Thus the current
-	 * time (in shifted nanoseconds) is:
-	 *	now = (offset * adj) + xtime_nsec
-	 * Now, even though we're adjusting the clock frequency, we have
-	 * to keep time consistent. In other words, we can't jump back
-	 * in time, and we also want to avoid jumping forward in time.
-	 *
-	 * So given the same offset value, we need the time to be the same
-	 * both before and after the freq adjustment.
-	 *	now = (offset * adj_1) + xtime_nsec_1
-	 *	now = (offset * adj_2) + xtime_nsec_2
-	 * So:
-	 *	(offset * adj_1) + xtime_nsec_1 =
-	 *		(offset * adj_2) + xtime_nsec_2
-	 * And we know:
-	 *	adj_2 = adj_1 + 1
-	 * So:
-	 *	(offset * adj_1) + xtime_nsec_1 =
-	 *		(offset * (adj_1+1)) + xtime_nsec_2
-	 *	(offset * adj_1) + xtime_nsec_1 =
-	 *		(offset * adj_1) + offset + xtime_nsec_2
-	 * Canceling the sides:
-	 *	xtime_nsec_1 = offset + xtime_nsec_2
-	 * Which gives us:
-	 *	xtime_nsec_2 = xtime_nsec_1 - offset
-	 * Which simplfies to:
-	 *	xtime_nsec -= offset
-	 *
-	 * XXX - TODO: Doc ntp_error calculation.
-	 */
-	tk->mult += adj;
-	tk->xtime_interval += interval;
-	tk->xtime_nsec -= offset;
-	tk->ntp_error -= (interval - offset) << tk->ntp_error_shift;
-
-out_adjust:
-	/*
-	 * It may be possible that when we entered this function, xtime_nsec
-	 * was very small.  Further, if we're slightly speeding the clocksource
-	 * in the code above, its possible the required corrective factor to
-	 * xtime_nsec could cause it to underflow.
-	 *
-	 * Now, since we already accumulated the second, cannot simply roll
-	 * the accumulated second back, since the NTP subsystem has been
-	 * notified via second_overflow. So instead we push xtime_nsec forward
-	 * by the amount we underflowed, and add that amount into the error.
-	 *
-	 * We'll correct this error next time through this function, when
-	 * xtime_nsec is not as small.
-	 */
-	if (unlikely((s64)tk->xtime_nsec < 0)) {
-		s64 neg = -(s64)tk->xtime_nsec;
-		tk->xtime_nsec = 0;
-		tk->ntp_error += neg << tk->ntp_error_shift;
-	}
-
 }
 
 /**
@@ -1321,7 +1166,7 @@ static cycle_t logarithmic_accumulation(struct timekeeper *tk, cycle_t offset,
 	tk->raw_time.tv_nsec = raw_nsecs;
 
 	/* Accumulate error between NTP and clock interval */
-	tk->ntp_error += ntp_tick_length() << shift;
+	tk->ntp_error += tk->ntp_tick << shift;
 	tk->ntp_error -= (tk->xtime_interval + tk->xtime_remainder) <<
 						(tk->ntp_error_shift + shift);
 
@@ -1406,7 +1251,7 @@ static void update_wall_time(void)
 			shift--;
 	}
 
-	/* correct the clock when NTP error is too big */
+	/* Update the multiplier */
 	timekeeping_adjust(tk, offset);
 
 	/*
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC] timekeeping: Fix clock stability with nohz
  2013-11-14 14:50 [PATCH RFC] timekeeping: Fix clock stability with nohz Miroslav Lichvar
@ 2013-11-14 18:01 ` Rik van Riel
  2013-11-16  7:03 ` Richard Cochran
  2013-11-18 20:46 ` John Stultz
  2 siblings, 0 replies; 19+ messages in thread
From: Rik van Riel @ 2013-11-14 18:01 UTC (permalink / raw)
  To: Miroslav Lichvar
  Cc: John Stultz, linux-kernel, Prarit Bhargava, Richard Cochran

On 11/14/2013 09:50 AM, Miroslav Lichvar wrote:
> Since commit 5eb6d205 the system clock is kept separately from NTP time
> and it is synchronized by adjusting its multiplier in a feedback loop.
> This works well when the updates are done regularly. With nohz and idle
> system, however, the loop becomes unstable at a certain update interval.
> The loop overcorrects and the frequency doesn't settle down. The clock
> has a large error, which seems to grow quadratically with update
> interval.

> In a simulation with 1GHz TSC clock and 10Hz clock update the maximum
> error went down from 4.7 microseconds to 5.5 nanoseconds. With 1Hz
> update the maximum error went down from 480 microseconds to 55
> nanoseconds.
>
> In a real test on idle machine comparing raw TSC and clock_gettime()
> time stamps, the maximum error went down from microseconds to tens of
> nanoseconds. A test with clock synchronized to a PTP hardware clock by
> phc2sys from linuxptp now shows no difference when running with nohz
> enabled and disabled, the clock seems to be stable to few tens of
> nanoseconds.

Looks like a big improvement to me.

Also very useful for virtual machines, which have no
good control over when the timekeeping routines will
run, but which can see what time it is when they do
run...

> Signed-off-by: Miroslav Lichvar <mlichvar@redhat.com>

Acked-by: Rik van Riel <riel@redhat.com>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC] timekeeping: Fix clock stability with nohz
  2013-11-14 14:50 [PATCH RFC] timekeeping: Fix clock stability with nohz Miroslav Lichvar
  2013-11-14 18:01 ` Rik van Riel
@ 2013-11-16  7:03 ` Richard Cochran
  2013-11-18 21:28   ` John Stultz
  2013-11-18 20:46 ` John Stultz
  2 siblings, 1 reply; 19+ messages in thread
From: Richard Cochran @ 2013-11-16  7:03 UTC (permalink / raw)
  To: Miroslav Lichvar; +Cc: John Stultz, linux-kernel, Prarit Bhargava

On Thu, Nov 14, 2013 at 03:50:40PM +0100, Miroslav Lichvar wrote:

>  include/linux/timekeeper_internal.h |   4 +
>  kernel/time/timekeeping.c           | 209 +++++-------------------------------
>  2 files changed, 31 insertions(+), 182 deletions(-)

This looks like an impressive simplification...

> -	 * So the following can be confusing.

Yep.

So I really have no idea how the deleted code worked (or didn't work
for nohz), but I can confirm that nohz time keeping is broken under
light system load. Running a high load (like recompiling the kernel on
all cores for CONFIG_HZ_PERIODIC=y ;) hides the issue, but that is
obviously not the right solution.

Out of my ignorance, two questions spring to mind.

1. Considering the simplicity of Miroslav's patch, what was the
   benefit of the much more complicated code in the first place?

2. Does this patch work in the CONFIG_HZ_PERIODIC case just as well as
   the deleted code?

Thanks,
Richard

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC] timekeeping: Fix clock stability with nohz
  2013-11-14 14:50 [PATCH RFC] timekeeping: Fix clock stability with nohz Miroslav Lichvar
  2013-11-14 18:01 ` Rik van Riel
  2013-11-16  7:03 ` Richard Cochran
@ 2013-11-18 20:46 ` John Stultz
  2013-11-20 18:39   ` Miroslav Lichvar
  2 siblings, 1 reply; 19+ messages in thread
From: John Stultz @ 2013-11-18 20:46 UTC (permalink / raw)
  To: Miroslav Lichvar; +Cc: linux-kernel, Prarit Bhargava, Richard Cochran

On 11/14/2013 06:50 AM, Miroslav Lichvar wrote:
> Since commit 5eb6d205 the system clock is kept separately from NTP time
> and it is synchronized by adjusting its multiplier in a feedback loop.
> This works well when the updates are done regularly. With nohz and idle
> system, however, the loop becomes unstable at a certain update interval.
> The loop overcorrects and the frequency doesn't settle down. The clock
> has a large error, which seems to grow quadratically with update
> interval.
>
> If the constants used in the loop were modified for a maximum update
> interval, it would be stable, but too slow to keep up with NTP. Without
> knowing when will be the next update it's difficult to implement a loop
> that is both fast and stable.

So yes, trying to sort out how much of an adjustment to make when you
don't know how long the next interval will be is a difficult problem to
solve.

I'm definitely interested in ways to improve this. That said, there are
some major subtleties in the code you're removing. I'm never been a huge
fan of the code in question, but I did manage to sit down at one point
with some paper and see the design was right, its designed to be quite
efficient and has been mostly left alone for a number of years, so its
reasonably time tested (at least for the requirements it was designed
for). That said, its terribly opaque code, so I usually have to to
re-establish that proof to myself every time I look at it in depth. :)

So quite a bit of care and caution will be needed.

Also, I know you've had to have spent quite a bit of time researching
this issue and coming up with the solution, so I want to be clear I
really appreciate that! But if we are to replace this core logic, I'll
want to make sure I throughly understand your solution, so forgive me if
I have to ask quite a number of stupid sounding (or just stupid)
questions to clarify things and I'll likely propose alternative
solutions that may seem/be bone headed or overly cautious.  So please
bear with me through this. :)

> This patch fixes the problem by postponing update of NTP tick length in
> the clock and setting the multiplier directly without feedback loop by
> dividing the tick length with clock cycles. Previously, a change in tick
> length was applied immediately to all ticks since last update, which
> caused a jump in the NTP error proportional to the change and the update
> interval and which had to be corrected later by the loop.

Hmm. Reading this, but not having studied your patch in depth, is
interesting. It originally was that we only applied any NTP adjustment
to future changes. Also, since at that time the tick length was only
changed on the second overflow, we ran into some problems, since there
were some applications using the ntp interface to make very small
adjustments for only a few ms. Thus we changed (see fdcedf7b75808 -
time: apply NTP frequency/tick changes immediately) it to adjust the
tick length immediately.

If this is the core issue, then maybe would revisiting that change alone
improve things?

Also I'm not sure if I quite follow the bit about "a change in the tick
was applied immediately to all ticks since the last update", since at
least with classic nohz, if someone is making a change to the ntp tick
length, we're not idle, so we should still be getting regular ticks. So
at most, it should be the modified ntp tick length is immediately
applied to the current  tick in progress, as well as following ticks.
Now with nohz_full this is definitely more complicated, but I want to
make sure I really understand what you're seeing. So are the issues
you're seeing w/ classic idle nohz or nohz full?


One of the issues I've recently started to worry about with the existing
code, is that the ntp_error value can really only hold about 2 seconds
worth of error. Now, that's the internal loop adjustment error (ie: the
error from what ntpd specified the kernel to be and where the kernel is
which should be very small), not the actual drift from the ntp server,
so it really shouldn't be problematic, but as we do get into *very*
large nohz idle values, it seems possible that we could accumulate more
then 2 seconds and see overflows of ntp_error, which could cause
instability.  So I'm curious if you were seeing anything like this, or
at least ask if you've already eliminated this as a potential source of
the problem you were seeing.

> The only source of the accounted NTP error is now lack of resolution in
> the clock multiplier. The error is corrected by adding 0 or 1 to the
> calculated multiplier. With a constant tick length, the multiplier will
> be switching between the two values. The loop is greatly simplified and
> there is no problem with stability. The maximum error is limited by the
> update interval.

Hrmm.. So I wonder how well this works with fairly coarse grained
clocksources? Quite a bit of the previous design (which is still looks
mostly in place with a superficial review of your patch) is focused on
keeping our long-term accuracy correct via the internally managed
feedback loop, even if we had a very coarse clocksource shift value. 
While improving the nohz performance is desired, I'm not sure we want to
also throw away that managed precision.

Also, if the ntp_mult_correction is always 0 or 1, (which handles if
we're too slow) how does it handle if we're running too fast?

Apologies if I'm missing something in your code.


> In a simulation with 1GHz TSC clock and 10Hz clock update the maximum
> error went down from 4.7 microseconds to 5.5 nanoseconds. With 1Hz
> update the maximum error went down from 480 microseconds to 55
> nanoseconds.

Curious what sort of a environment you're using for simulation (as well
as the real test below)?

Also just want to clarify what/how you're measuring when you say
"maximum error".


> In a real test on idle machine comparing raw TSC and clock_gettime()
> time stamps, the maximum error went down from microseconds to tens of
> nanoseconds. A test with clock synchronized to a PTP hardware clock by
> phc2sys from linuxptp now shows no difference when running with nohz
> enabled and disabled, the clock seems to be stable to few tens of
> nanoseconds.

So this all sounds great!

I'm currently running your patch through some basic sanity checks in a
VM and its looking ok so far, so that's promising.
(my tests are here, if you want to run them as well:
https://github.com/johnstultz-work/timetests.git )

There's a few spots where I'm a bit concerned we're missing some
replacement for some of the accounting you removed (you don't seem to be
adjusting the non-accumulated offset value when modifying the
multiplier), and the division in the hotpath is a bit concerning (but
maybe that can be pre-computed in another path or approximated into
shifts?).

I'll want to reproduce your results, so let me know about your testing
methods. I may also want to generate some more intricate tests so we can
really see the if long-term precision and stability is really as
unaffected you make it sound.

But this is all very interesting! Thanks again for digging into this
issue and sending the patch!

thanks
-john

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC] timekeeping: Fix clock stability with nohz
  2013-11-16  7:03 ` Richard Cochran
@ 2013-11-18 21:28   ` John Stultz
  2013-11-19 14:13     ` Richard Cochran
  2013-11-21 10:12     ` Miroslav Lichvar
  0 siblings, 2 replies; 19+ messages in thread
From: John Stultz @ 2013-11-18 21:28 UTC (permalink / raw)
  To: Richard Cochran, Miroslav Lichvar; +Cc: linux-kernel, Prarit Bhargava

On 11/15/2013 11:03 PM, Richard Cochran wrote:
> On Thu, Nov 14, 2013 at 03:50:40PM +0100, Miroslav Lichvar wrote:
>
>>  include/linux/timekeeper_internal.h |   4 +
>>  kernel/time/timekeeping.c           | 209 +++++-------------------------------
>>  2 files changed, 31 insertions(+), 182 deletions(-)
> This looks like an impressive simplification...

Indeed!

>
>> -	 * So the following can be confusing.
> Yep.
>
> So I really have no idea how the deleted code worked (or didn't work
> for nohz), but I can confirm that nohz time keeping is broken under
> light system load. Running a high load (like recompiling the kernel on
> all cores for CONFIG_HZ_PERIODIC=y ;) hides the issue, but that is
> obviously not the right solution.
Thanks for confirming the issue, and yes, forcing periodic hz isn't the
right solution.

That said there is a bit of a tension between very long nohz periods and
very accurate ntp syncing. Its a bit like sleeping at the wheel: you can
either get your rest, or stay on the road. :)  

It may be that we need to feed some of the ntp error state into the
tick-scheduling logic, so we don't try to sleep very long when we know
we're on a curvy bend.

Also, just for clarity's sake, when you're seeing things "broken",
curious how/what you are measuring?

Also is this brokenness something that has been around for awhile for
you or more recently cropped up?  I'm wondering as nohz idle has been
around for quite a few years and I've not seen too many complaints. So
I'm wondering if I'm looking in the right places, or if something has
regressed recently, or if its just that accuracy expectations have gone up?


> Out of my ignorance, two questions spring to mind.
>
> 1. Considering the simplicity of Miroslav's patch, what was the
>    benefit of the much more complicated code in the first place?

The much more complicated code was designed by Roman quite awhile back
(2006-ish). He was extremely bright, but produced very opaque code. It
made it very difficult to review, but once you sat down for awhile and
worked out the math, it ended up being correct and very efficient.

His concern was mostly about making the timer interrupt very fast on
very slow hardware (m68k was the architecture he co-maintained at the
time). So the code is very optimized for steady state, where the
adjustment value is 1,0, or -1. And we avoid doing any expensive math
operations (no multiplies, no divides), even in the non-steady state path.

Now it was designed before nohz was common (and back when nohz for a
full second was considered a aggresive length), so I could very well
believe that it has been stretched past its limit.  So I think your
point above about understanding exactly how it doesn't work for nohz is key.

Now, there is some parts of his look-ahead logic that I never quite
understood the rational for (see the top part of timekeeping_bigadjust).
At the time it was designed, I preferred a more PID-like approach, but
it was more computationally expensive and I couldn't measure any benefit
to my approach over his, so that part stuck around.


> 2. Does this patch work in the CONFIG_HZ_PERIODIC case just as well as
>    the deleted code?

As I mentioned in my other email, I'm a little concerned about some of
the accounting that is removed. At a blackbox level, its not handling
all the same values, so I suspect there's some issues here, but they may
not be very significant or difficult to fix.

So while I'm very cautious to throwing out the complex code which has
worked relatively well for quite awhile, I am very interested in coming
up with a solution that is easier to understand and validate as correct.

thanks
-john

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC] timekeeping: Fix clock stability with nohz
  2013-11-18 21:28   ` John Stultz
@ 2013-11-19 14:13     ` Richard Cochran
  2013-11-27 10:07       ` Richard Cochran
  2013-11-21 10:12     ` Miroslav Lichvar
  1 sibling, 1 reply; 19+ messages in thread
From: Richard Cochran @ 2013-11-19 14:13 UTC (permalink / raw)
  To: John Stultz; +Cc: Miroslav Lichvar, linux-kernel, Prarit Bhargava

On Mon, Nov 18, 2013 at 01:28:07PM -0800, John Stultz wrote:
 
> Also, just for clarity's sake, when you're seeing things "broken",
> curious how/what you are measuring?

Here is the background for this issue. The linuxptp stack has a
program called phc2sys whose purpose is to synchronize the Linux
system clock with a PTP hardware clock typically residing on a PCIe
card. This program uses the PTP_SYS_OFFSET ioctl to read the clocks.

One user on the list was complaining that the reported time and
frequency errors were unacceptably large. After vainly trying
different PI weights, we found out that the poor performance is
due to the nohz setting in the kernel.

See below for example logs.
 
> Also is this brokenness something that has been around for awhile for
> you or more recently cropped up?  I'm wondering as nohz idle has been
> around for quite a few years and I've not seen too many complaints. So
> I'm wondering if I'm looking in the right places, or if something has
> regressed recently, or if its just that accuracy expectations have gone up?

I have always avoided nohz like the plague. IIRC, it would not even
compile on ARM for the longest time. In any case, I really don't know
when this issue appeared. At some point, nohz became the Kconfig
default, and I did not notice, and so now I had it enabled by
accident.

Here are two sample runs of phc2sys. The PHC is an Intel i210 PCIe
card. I simply set the PHC time to the Linux system time (using
testptp -s), and then let PHC clock run free. During the test, the
phc2sys program is the only active program, and the system is
otherwise idle.

In this test, the update rate is once per second. When using longer
intervals, the problem becomes worse.

In the listings, "sys offset" is measured in nanosecond using the
PTP_SYS_OFFSET ioctl, "freq" is the PI servo output in ppb, and
"delay" is the time it took to read the two clocks in nanoseconds.

* Periodic Case

  CONFIG_HZ_PERIODIC=y
  CONFIG_NO_HZ=y
  CONFIG_HZ_1000=y

  Here we observe a time error of about 100 nanoseconds peak to peak
  and a frequency error within about .2 PPM.

sudo ./phc2sys -s eth3 -m -q -l7 -O0
phc2sys[5050.678]: PI servo: sync interval 1.000 kp 0.700 ki 0.300000
phc2sys[5051.678]: sys offset    287239 s0 freq      +0 delay   5164
phc2sys[5052.678]: sys offset    303841 s1 freq  +16600 delay   5161
phc2sys[5053.679]: sys offset        11 s2 freq  +16611 delay   5184
phc2sys[5054.679]: sys offset        -6 s2 freq  +16597 delay   5191
phc2sys[5055.679]: sys offset        -6 s2 freq  +16595 delay   5121
phc2sys[5056.679]: sys offset       -32 s2 freq  +16567 delay   5184
phc2sys[5057.679]: sys offset        -6 s2 freq  +16584 delay   5197
phc2sys[5058.679]: sys offset        10 s2 freq  +16598 delay   5186
phc2sys[5059.679]: sys offset        14 s2 freq  +16605 delay   5162
phc2sys[5060.680]: sys offset       -12 s2 freq  +16583 delay   5212
phc2sys[5061.680]: sys offset       -15 s2 freq  +16576 delay   5196
phc2sys[5062.680]: sys offset        -4 s2 freq  +16583 delay   5186
phc2sys[5063.680]: sys offset        21 s2 freq  +16607 delay   5187
phc2sys[5064.680]: sys offset         5 s2 freq  +16597 delay   5196
phc2sys[5065.680]: sys offset        -1 s2 freq  +16593 delay   5192
phc2sys[5066.680]: sys offset        13 s2 freq  +16606 delay   5177
phc2sys[5067.680]: sys offset         5 s2 freq  +16602 delay   5212
phc2sys[5068.681]: sys offset        11 s2 freq  +16610 delay   5191
phc2sys[5069.681]: sys offset       -19 s2 freq  +16583 delay   5203
phc2sys[5070.681]: sys offset        -3 s2 freq  +16593 delay   5185
phc2sys[5071.681]: sys offset         9 s2 freq  +16604 delay   5197
phc2sys[5072.681]: sys offset         4 s2 freq  +16602 delay   5176
phc2sys[5073.681]: sys offset         7 s2 freq  +16606 delay   5196
phc2sys[5074.681]: sys offset        -6 s2 freq  +16595 delay   5186
phc2sys[5075.681]: sys offset       -28 s2 freq  +16572 delay   5192
phc2sys[5076.682]: sys offset        48 s2 freq  +16639 delay   5214

* NO_HZ 

  CONFIG_NO_HZ_COMMON=y
  CONFIG_NO_HZ_IDLE=y
  CONFIG_NO_HZ=y
  CONFIG_RCU_FAST_NO_HZ=y
  CONFIG_HZ_250=y

  Here we observe a time error of about 3 microseconds peak to peak
  and a frequency error within about 3 PPM.

cori@cher1293:~/git/linuxptp$ sudo ./phc2sys -s eth3 -m -q -l7 -O0
phc2sys[105.837]: PI servo: sync interval 1.000 kp 0.700 ki 0.300000
phc2sys[106.837]: sys offset    135052 s0 freq      +0 delay   5189
phc2sys[107.837]: sys offset    151449 s1 freq  +16394 delay   5174
phc2sys[108.837]: sys offset       246 s2 freq  +16640 delay   5179
phc2sys[109.838]: sys offset      -185 s2 freq  +16283 delay   5171
phc2sys[110.838]: sys offset      -552 s2 freq  +15860 delay   5192
phc2sys[111.838]: sys offset      1383 s2 freq  +17630 delay   5179
phc2sys[112.838]: sys offset      -737 s2 freq  +15925 delay   5158
phc2sys[113.838]: sys offset      1147 s2 freq  +17587 delay   5171
phc2sys[114.839]: sys offset     -1593 s2 freq  +15192 delay   5186
phc2sys[115.839]: sys offset     -1006 s2 freq  +15301 delay   5161
phc2sys[116.839]: sys offset      1157 s2 freq  +17162 delay   5191
phc2sys[117.839]: sys offset       410 s2 freq  +16762 delay   5179
phc2sys[118.839]: sys offset         0 s2 freq  +16475 delay   5184
phc2sys[119.839]: sys offset        20 s2 freq  +16495 delay   5159
phc2sys[120.840]: sys offset       473 s2 freq  +16954 delay   5172
phc2sys[121.840]: sys offset      -690 s2 freq  +15933 delay   5186
phc2sys[122.840]: sys offset      -437 s2 freq  +15979 delay   5186
phc2sys[123.840]: sys offset      -229 s2 freq  +16056 delay   5179
phc2sys[124.840]: sys offset       357 s2 freq  +16573 delay   5179
phc2sys[125.840]: sys offset       285 s2 freq  +16608 delay   5121
phc2sys[126.841]: sys offset       512 s2 freq  +16921 delay   5179
phc2sys[127.841]: sys offset      -944 s2 freq  +15618 delay   5179
phc2sys[128.841]: sys offset       338 s2 freq  +16617 delay   5158
phc2sys[129.841]: sys offset      1275 s2 freq  +17655 delay   5171
phc2sys[130.841]: sys offset      -198 s2 freq  +16565 delay   5339
phc2sys[131.841]: sys offset      -702 s2 freq  +16002 delay   5179
phc2sys[132.842]: sys offset      -664 s2 freq  +15829 delay   5101
phc2sys[133.842]: sys offset       -38 s2 freq  +16256 delay   5184
phc2sys[134.842]: sys offset       663 s2 freq  +16945 delay   5164
phc2sys[135.842]: sys offset     -1066 s2 freq  +15415 delay   5187
phc2sys[136.842]: sys offset      1202 s2 freq  +17363 delay   5172
phc2sys[137.842]: sys offset       289 s2 freq  +16811 delay   5161
phc2sys[138.843]: sys offset     -1495 s2 freq  +15114 delay   5154
phc2sys[139.843]: sys offset       877 s2 freq  +17037 delay   5179
phc2sys[140.843]: sys offset      -311 s2 freq  +16112 delay   5186
phc2sys[141.843]: sys offset      1509 s2 freq  +17839 delay   5179

Thanks,
Richard

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC] timekeeping: Fix clock stability with nohz
  2013-11-18 20:46 ` John Stultz
@ 2013-11-20 18:39   ` Miroslav Lichvar
  2013-12-03  0:53     ` John Stultz
  0 siblings, 1 reply; 19+ messages in thread
From: Miroslav Lichvar @ 2013-11-20 18:39 UTC (permalink / raw)
  To: John Stultz; +Cc: linux-kernel, Prarit Bhargava, Richard Cochran

On Mon, Nov 18, 2013 at 12:46:00PM -0800, John Stultz wrote:
> Hmm. Reading this, but not having studied your patch in depth, is
> interesting. It originally was that we only applied any NTP adjustment
> to future changes. Also, since at that time the tick length was only
> changed on the second overflow, we ran into some problems, since there
> were some applications using the ntp interface to make very small
> adjustments for only a few ms. Thus we changed (see fdcedf7b75808 -
> time: apply NTP frequency/tick changes immediately) it to adjust the
> tick length immediately.
> 
> If this is the core issue, then maybe would revisiting that change alone
> improve things?

No, this doesn't seem to be related to the problem with stability.
Applying changes in tick length immediately is definitely a useful
feature. One item in my todo list for this patch is to make such phase
adjustments more accurate by forcing the timekeeper update immediately
after adjtimex().

> Also I'm not sure if I quite follow the bit about "a change in the tick
> was applied immediately to all ticks since the last update", since at
> least with classic nohz, if someone is making a change to the ntp tick
> length, we're not idle, so we should still be getting regular ticks. So
> at most, it should be the modified ntp tick length is immediately
> applied to the current  tick in progress, as well as following ticks.

I meant the tick length used in the NTP error calculation
in the logarithmic accumulation function (tk->ntp_error +=
ntp_tick_length() << shift). The code calls second_overflow() before
the NTP error is updated, which means the NTP error will jump by the
difference in the tick length multiplied by number of accumulated
ticks in that step and that's extra work for the loop to correct. This
must not happen in the design I'm proposing.

Probably a stupid question, but can adjtimex() or clock_gettime() be
called before the clock is updated after an idle period? I'm wondering
if this assumption would allow the timekeeper to correct the NTP error
simply by stepping the clock.

> Now with nohz_full this is definitely more complicated, but I want to
> make sure I really understand what you're seeing. So are the issues
> you're seeing w/ classic idle nohz or nohz full?

I was using only kernels with classic idle nohz.

> One of the issues I've recently started to worry about with the existing
> code, is that the ntp_error value can really only hold about 2 seconds
> worth of error. Now, that's the internal loop adjustment error (ie: the
> error from what ntpd specified the kernel to be and where the kernel is
> which should be very small), not the actual drift from the ntp server,
> so it really shouldn't be problematic, but as we do get into *very*
> large nohz idle values, it seems possible that we could accumulate more
> then 2 seconds and see overflows of ntp_error, which could cause
> instability.  So I'm curious if you were seeing anything like this, or
> at least ask if you've already eliminated this as a potential source of
> the problem you were seeing.

That doesn't seem to be related to the problem I'm seeing. With 1Hz
updates I think ntp_error was always in microseconds or milliseconds
at most.

> > The only source of the accounted NTP error is now lack of resolution in
> > the clock multiplier. The error is corrected by adding 0 or 1 to the
> > calculated multiplier. With a constant tick length, the multiplier will
> > be switching between the two values. The loop is greatly simplified and
> > there is no problem with stability. The maximum error is limited by the
> > update interval.
> 
> Hrmm.. So I wonder how well this works with fairly coarse grained
> clocksources? Quite a bit of the previous design (which is still looks
> mostly in place with a superficial review of your patch) is focused on
> keeping our long-term accuracy correct via the internally managed
> feedback loop, even if we had a very coarse clocksource shift value. 
> While improving the nohz performance is desired, I'm not sure we want to
> also throw away that managed precision.

There shouldn't be any regressions in the long-term frequency
accuracy. Short-term frequency accuracy should be better as the bug is
fixed :). 

However, the PLL/FLL and singleshot phase adjustments will be possibly
even less accurate as the whole accumulated interval will use only one
tick length. In the current code, the adjustments could be made
accurate by splitting the accumulated interval at start of second and
updating NTP error for each part with the right tick length. In the
design I'm proposing that will no longer be possible.

> Also, if the ntp_mult_correction is always 0 or 1, (which handles if
> we're too slow) how does it handle if we're running too fast?

If tick length is not divisible by cycles, with zero
ntp_mult_correction the clock will be running slower. If it is
divisible, the clock will be running at the exact same rate and stay
slightly ahead. How much it will be ahead depends on how long was the
last interval with correction of 1. It could use -1 to slow the clock
down in this case, but I'm not sure it's worth the extra frequency
error.

> > In a simulation with 1GHz TSC clock and 10Hz clock update the maximum
> > error went down from 4.7 microseconds to 5.5 nanoseconds. With 1Hz
> > update the maximum error went down from 480 microseconds to 55
> > nanoseconds.
> 
> Curious what sort of a environment you're using for simulation (as well
> as the real test below)?

I compile the kernel timekeeping.c file into a test program where a
fake TSC is provided to the kernel code and I can easily control the
tick length and timekeeper updates. The program collects pairs of
TSC/getnstimeofday() values and then it runs linear regression through
the points to see the frequency offset and how stable the clock was.

http://mlichvar.fedorapeople.org/tmp/tk_test.tar.gz

The real test uses the system TSC and clock_gettime() instead. The
clock needs to be exercised a bit to disturb the tick adjusting code.
Setting a non-zero frequency offset with ntptime(8) or adjtimex(8) or
running ntpd shortly and resetting the PLL offset seems to trigger the
problem reliably.

http://mlichvar.fedorapeople.org/tmp/tstsc.c

> Also just want to clarify what/how you're measuring when you say
> "maximum error".

It's the distance from the furthest point to the regression line. If
you monitor the value of ntp_error, you should see similar results.

> There's a few spots where I'm a bit concerned we're missing some
> replacement for some of the accounting you removed (you don't seem to be
> adjusting the non-accumulated offset value when modifying the
> multiplier),

Yes, I missed that part. There could be time jumps observed. I was
considering to change the code to accumulate partial ticks, so this
wouldn't be needed.

> and the division in the hotpath is a bit concerning (but
> maybe that can be pre-computed in another path or approximated into
> shifts?).

In this design the division needs to be exact, otherwise the ntp_error
correction wouldn't work. Do you have any suggestions on how to
replace it?

> I'll want to reproduce your results, so let me know about your testing
> methods. I may also want to generate some more intricate tests so we can
> really see the if long-term precision and stability is really as
> unaffected you make it sound.

Ok, please let me know if my tests don't work for you or something is
not clear. I tried to make it a bit more readable, but I'm not sure I
succeeded.

> But this is all very interesting! Thanks again for digging into this
> issue and sending the patch!

Thanks for looking into it. I agree this is a rather radical change
and it needs to be done very carefully. At this point, I'd like to
know if you think there are no fundamental problems in the design and
whether it would be an acceptable replacement.

A different approach to fix this problem would be controlling the
maximum idle interval from the tick adjusting code so that the NTP
error corrections can be accurate. That would further increase the
complexity of the code and increase the interrupt rate.

-- 
Miroslav Lichvar

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC] timekeeping: Fix clock stability with nohz
  2013-11-18 21:28   ` John Stultz
  2013-11-19 14:13     ` Richard Cochran
@ 2013-11-21 10:12     ` Miroslav Lichvar
  1 sibling, 0 replies; 19+ messages in thread
From: Miroslav Lichvar @ 2013-11-21 10:12 UTC (permalink / raw)
  To: John Stultz; +Cc: Richard Cochran, linux-kernel, Prarit Bhargava

On Mon, Nov 18, 2013 at 01:28:07PM -0800, John Stultz wrote:
> Also is this brokenness something that has been around for awhile for
> you or more recently cropped up?  I'm wondering as nohz idle has been
> around for quite a few years and I've not seen too many complaints. So
> I'm wondering if I'm looking in the right places, or if something has
> regressed recently, or if its just that accuracy expectations have gone up?

I think the problem was there right from the beginning when the
internal synchronization loop was introduced, but before PTP hardware
clocks there might not had been a reference which could be used to
evaluate the stability of the system clock at this level. The quantum
mechanical property of the problem that it disappears when you
increase sampling doesn't help much either :).

IIRC the first time I noticed something wasn't quite alright was in
2009 when I wrote a PPS refclock driver for chrony. With a GPS
receiver connected to serial port the synchronization seemed to work
better when the system was loaded. My explanation at the time was that
it's a HW limitation, the interrupt latency/jitter is worse when the
CPU is idle. I had no idea there could be a problem in the kernel
timekeeping.

-- 
Miroslav Lichvar

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC] timekeeping: Fix clock stability with nohz
  2013-11-19 14:13     ` Richard Cochran
@ 2013-11-27 10:07       ` Richard Cochran
  0 siblings, 0 replies; 19+ messages in thread
From: Richard Cochran @ 2013-11-27 10:07 UTC (permalink / raw)
  To: John Stultz; +Cc: Miroslav Lichvar, linux-kernel, Prarit Bhargava

On Tue, Nov 19, 2013 at 03:13:19PM +0100, Richard Cochran wrote:
> 
> In this test, the update rate is once per second. When using longer
> intervals, the problem becomes worse.

Here is another pair of example runs on an idle system, this time with
a 32 second update interval.

* Periodic Case

  CONFIG_HZ_PERIODIC=y
  CONFIG_NO_HZ=y
  CONFIG_HZ_1000=y

  The peak to peak time error is about 600 nanoseconds.

sudo ./phc2sys -s eth3 -m -q -l7 -O0 -P0 -I0 -R.03125
phc2sys[118.045]: PI servo: sync interval 32.000 kp 0.022 ki 0.009375
phc2sys[150.045]: sys offset    894205 s0 freq      +0 delay   5172
phc2sys[182.045]: sys offset   1416429 s1 freq  +16319 delay   5176
phc2sys[214.045]: sys offset         7 s2 freq  +16320 delay   5172
phc2sys[246.045]: sys offset       140 s2 freq  +16324 delay   5179
phc2sys[278.045]: sys offset       346 s2 freq  +16332 delay   5171
phc2sys[310.045]: sys offset       494 s2 freq  +16339 delay   5184
phc2sys[342.046]: sys offset       485 s2 freq  +16344 delay   5191
phc2sys[374.046]: sys offset       251 s2 freq  +16341 delay   5172
phc2sys[406.046]: sys offset       153 s2 freq  +16340 delay   5174
phc2sys[438.046]: sys offset       156 s2 freq  +16342 delay   5193
phc2sys[470.046]: sys offset        39 s2 freq  +16340 delay   5179
phc2sys[502.046]: sys offset       173 s2 freq  +16344 delay   5076
phc2sys[534.046]: sys offset       286 s2 freq  +16349 delay   5171
phc2sys[566.047]: sys offset       130 s2 freq  +16347 delay   5159
phc2sys[598.047]: sys offset       -64 s2 freq  +16342 delay   5191
phc2sys[630.047]: sys offset       176 s2 freq  +16349 delay   5184
phc2sys[662.047]: sys offset       248 s2 freq  +16353 delay   5167
phc2sys[694.047]: sys offset       166 s2 freq  +16353 delay   5187
phc2sys[726.047]: sys offset       227 s2 freq  +16356 delay   5179
phc2sys[758.047]: sys offset        91 s2 freq  +16354 delay   5177
phc2sys[790.048]: sys offset       -12 s2 freq  +16352 delay   5179
phc2sys[822.048]: sys offset       -44 s2 freq  +16351 delay   5189
phc2sys[854.048]: sys offset       -99 s2 freq  +16349 delay   5159
phc2sys[886.048]: sys offset        53 s2 freq  +16352 delay   5184
phc2sys[918.048]: sys offset       241 s2 freq  +16359 delay   5172
phc2sys[950.048]: sys offset       293 s2 freq  +16363 delay   5191
phc2sys[982.048]: sys offset       -92 s2 freq  +16353 delay   5179
phc2sys[1014.048]: sys offset       -61 s2 freq  +16354 delay   5172
phc2sys[1046.049]: sys offset       -42 s2 freq  +16354 delay   5186

* NO_HZ 

  CONFIG_NO_HZ_COMMON=y
  CONFIG_NO_HZ_IDLE=y
  CONFIG_NO_HZ=y
  CONFIG_RCU_FAST_NO_HZ=y
  CONFIG_HZ_250=y

  The peak to peak time error reaches 9.5 microseconds.

sudo ./phc2sys -s eth3 -m -q -l7 -O0 -P0 -I0 -R.03125
phc2sys[2036.649]: PI servo: sync interval 32.000 kp 0.022 ki 0.009375
phc2sys[2068.650]: sys offset   1322846 s0 freq      +0 delay   5184
phc2sys[2100.650]: sys offset   1845801 s1 freq  +16342 delay   5172
phc2sys[2132.650]: sys offset      -643 s2 freq  +16322 delay   5197
phc2sys[2164.650]: sys offset       111 s2 freq  +16340 delay   5192
phc2sys[2196.650]: sys offset     -2110 s2 freq  +16271 delay   5174
phc2sys[2228.650]: sys offset      4545 s2 freq  +16460 delay   5172
phc2sys[2260.650]: sys offset     -3666 s2 freq  +16246 delay   5160
phc2sys[2292.651]: sys offset      4454 s2 freq  +16465 delay   5183
phc2sys[2324.651]: sys offset     -1017 s2 freq  +16336 delay   5166
phc2sys[2356.651]: sys offset     -1248 s2 freq  +16319 delay   5159
phc2sys[2388.651]: sys offset      2824 s2 freq  +16435 delay   5174
phc2sys[2420.651]: sys offset     -2253 s2 freq  +16302 delay   5174
phc2sys[2452.651]: sys offset      -431 s2 freq  +16338 delay   5186
phc2sys[2484.651]: sys offset       374 s2 freq  +16359 delay   5191
phc2sys[2516.652]: sys offset      1198 s2 freq  +16389 delay   5176
phc2sys[2548.652]: sys offset     -1803 s2 freq  +16306 delay   5184
phc2sys[2580.652]: sys offset      -295 s2 freq  +16336 delay   5179
phc2sys[2612.652]: sys offset       753 s2 freq  +16366 delay   5188
phc2sys[2644.652]: sys offset      -907 s2 freq  +16321 delay   5184
phc2sys[2676.652]: sys offset      -128 s2 freq  +16337 delay   5111
phc2sys[2708.652]: sys offset     -2000 s2 freq  +16277 delay   5179
phc2sys[2740.653]: sys offset      2270 s2 freq  +16392 delay   5174
phc2sys[2772.653]: sys offset       718 s2 freq  +16365 delay   5191
phc2sys[2804.653]: sys offset     -1004 s2 freq  +16318 delay   5172
phc2sys[2836.653]: sys offset      -331 s2 freq  +16329 delay   5184
phc2sys[2868.653]: sys offset      1667 s2 freq  +16389 delay   5081
phc2sys[2900.653]: sys offset     -1154 s2 freq  +16316 delay   5192
phc2sys[2932.653]: sys offset      5693 s2 freq  +16519 delay   5184
phc2sys[2964.654]: sys offset     -3833 s2 freq  +16275 delay   5171
 
Thanks,
Richard

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC] timekeeping: Fix clock stability with nohz
  2013-11-20 18:39   ` Miroslav Lichvar
@ 2013-12-03  0:53     ` John Stultz
  2013-12-03  4:03       ` John Stultz
  0 siblings, 1 reply; 19+ messages in thread
From: John Stultz @ 2013-12-03  0:53 UTC (permalink / raw)
  To: Miroslav Lichvar; +Cc: linux-kernel, Prarit Bhargava, Richard Cochran

On 11/20/2013 10:39 AM, Miroslav Lichvar wrote:
> On Mon, Nov 18, 2013 at 12:46:00PM -0800, John Stultz wrote:
>>> In a simulation with 1GHz TSC clock and 10Hz clock update the maximum
>>> error went down from 4.7 microseconds to 5.5 nanoseconds. With 1Hz
>>> update the maximum error went down from 480 microseconds to 55
>>> nanoseconds.
>> Curious what sort of a environment you're using for simulation (as well
>> as the real test below)?
> I compile the kernel timekeeping.c file into a test program where a
> fake TSC is provided to the kernel code and I can easily control the
> tick length and timekeeper updates. The program collects pairs of
> TSC/getnstimeofday() values and then it runs linear regression through
> the points to see the frequency offset and how stable the clock was.
>
> http://mlichvar.fedorapeople.org/tmp/tk_test.tar.gz

So.. this doesn't build for me.

timekeeping.o: In function `change_clocksource':
timekeeping.c:(.text+0x535): undefined reference to `lock_acquire'
timekeeping.c:(.text+0x588): undefined reference to `lock_release'
timekeeping.o: In function `__getnstimeofday':
timekeeping.c:(.text+0x675): undefined reference to `trace_hardirqs_off'
timekeeping.c:(.text+0x69b): undefined reference to `lock_acquire'
timekeeping.c:(.text+0x6b0): undefined reference to `lock_release'
timekeeping.c:(.text+0x6c2): undefined reference to `trace_hardirqs_on'
timekeeping.c:(.text+0x77c): undefined reference to `trace_hardirqs_off'


Do you need a special .config in your kernel source? Changing lockdep
and irq tracer configs don't seem to fix things for me.



>> But this is all very interesting! Thanks again for digging into this
>> issue and sending the patch!
> Thanks for looking into it. I agree this is a rather radical change
> and it needs to be done very carefully. At this point, I'd like to
> know if you think there are no fundamental problems in the design and
> whether it would be an acceptable replacement.
>
> A different approach to fix this problem would be controlling the
> maximum idle interval from the tick adjusting code so that the NTP
> error corrections can be accurate. That would further increase the
> complexity of the code and increase the interrupt rate.

So I'm still trying to break apart the larger change you've made into
smaller functional steps.

The main two differences I see are:

1) Calculating the mult value directly from the ntp_tick_length() value
(via the division) each update cycle.

2) Drastically simplifying the ntp_error correction to a 0/1 multiplier
adjustment (assuming the calculated mult will be slightly slow, due to
truncating the remainder in the division) based on the sign of the error.

(and I'm ignoring the previously mentioned accounting changes that
appear to be improperly dropped ;)

This makes me suspect the main issue is we're over-correcting in the
timekeeping_bigadjust() logic with nohz and I'm curious if instead we
could either limit timekeeping_bigadjust() adjustment to achieve the
same stability? In particular, bigadjust()'s exponential adjustment of
mult seems problematic.

Am I missing or glossing over anything else that is key to your changes?

thanks
-john





^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC] timekeeping: Fix clock stability with nohz
  2013-12-03  0:53     ` John Stultz
@ 2013-12-03  4:03       ` John Stultz
  2013-12-06 14:26         ` Miroslav Lichvar
  0 siblings, 1 reply; 19+ messages in thread
From: John Stultz @ 2013-12-03  4:03 UTC (permalink / raw)
  To: Miroslav Lichvar; +Cc: linux-kernel, Prarit Bhargava, Richard Cochran

On 12/02/2013 04:53 PM, John Stultz wrote:
> On 11/20/2013 10:39 AM, Miroslav Lichvar wrote:
>> On Mon, Nov 18, 2013 at 12:46:00PM -0800, John Stultz wrote:
>>>> In a simulation with 1GHz TSC clock and 10Hz clock update the maximum
>>>> error went down from 4.7 microseconds to 5.5 nanoseconds. With 1Hz
>>>> update the maximum error went down from 480 microseconds to 55
>>>> nanoseconds.
>>> Curious what sort of a environment you're using for simulation (as well
>>> as the real test below)?
>> I compile the kernel timekeeping.c file into a test program where a
>> fake TSC is provided to the kernel code and I can easily control the
>> tick length and timekeeper updates. The program collects pairs of
>> TSC/getnstimeofday() values and then it runs linear regression through
>> the points to see the frequency offset and how stable the clock was.
>>
>> http://mlichvar.fedorapeople.org/tmp/tk_test.tar.gz
> So.. this doesn't build for me.
>
> timekeeping.o: In function `change_clocksource':
> timekeeping.c:(.text+0x535): undefined reference to `lock_acquire'
> timekeeping.c:(.text+0x588): undefined reference to `lock_release'
> timekeeping.o: In function `__getnstimeofday':
> timekeeping.c:(.text+0x675): undefined reference to `trace_hardirqs_off'
> timekeeping.c:(.text+0x69b): undefined reference to `lock_acquire'
> timekeeping.c:(.text+0x6b0): undefined reference to `lock_release'
> timekeeping.c:(.text+0x6c2): undefined reference to `trace_hardirqs_on'
> timekeeping.c:(.text+0x77c): undefined reference to `trace_hardirqs_off'
>
>
> Do you need a special .config in your kernel source? Changing lockdep
> and irq tracer configs don't seem to fix things for me.

Finally found a config to get it working (disabling kernel debugging
seems to work), and am currently trying to fixup the missing symbols
(although I'm getting segfaults from various inline cli's :)

Very cool simulator, by the way. Do you plan to have a git repo at some
point for it?


>
>>> But this is all very interesting! Thanks again for digging into this
>>> issue and sending the patch!
>> Thanks for looking into it. I agree this is a rather radical change
>> and it needs to be done very carefully. At this point, I'd like to
>> know if you think there are no fundamental problems in the design and
>> whether it would be an acceptable replacement.
>>
>> A different approach to fix this problem would be controlling the
>> maximum idle interval from the tick adjusting code so that the NTP
>> error corrections can be accurate. That would further increase the
>> complexity of the code and increase the interrupt rate.
> So I'm still trying to break apart the larger change you've made into
> smaller functional steps.
>
> The main two differences I see are:
>
> 1) Calculating the mult value directly from the ntp_tick_length() value
> (via the division) each update cycle.
>
> 2) Drastically simplifying the ntp_error correction to a 0/1 multiplier
> adjustment (assuming the calculated mult will be slightly slow, due to
> truncating the remainder in the division) based on the sign of the error.
>
> (and I'm ignoring the previously mentioned accounting changes that
> appear to be improperly dropped ;)
>
> This makes me suspect the main issue is we're over-correcting in the
> timekeeping_bigadjust() logic with nohz and I'm curious if instead we
> could either limit timekeeping_bigadjust() adjustment to achieve the
> same stability? In particular, bigadjust()'s exponential adjustment of
> mult seems problematic.

Tuning the look_ahead seems to solve most of the issues, at least using
your simulator. But this may not be yet ideal for all cases. But
overall, I'm glad, since that particular code was always the most
opaque, so this gives some good reason for us to sort it out and get it
documented better.

See the patch below. I'm doing some actual testing with it to see if its
maybe too dampened.

thanks
-john


diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index 3abf534..872c9c5 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -1068,7 +1068,7 @@ static __always_inline int timekeeping_bigadjust(struct timekeeper *tk,
 	 * here.  This is tuned so that an error of about 1 msec is adjusted
 	 * within about 1 sec (or 2^20 nsec in 2^SHIFT_HZ ticks).
 	 */
-	error2 = tk->ntp_error >> (NTP_SCALE_SHIFT + 22 - 2 * SHIFT_HZ);
+	error2 = tk->ntp_error >> (NTP_SCALE_SHIFT/2);
 	error2 = abs(error2);
 	for (look_ahead = 0; error2 > 0; look_ahead++)
 		error2 >>= 2;


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC] timekeeping: Fix clock stability with nohz
  2013-12-03  4:03       ` John Stultz
@ 2013-12-06 14:26         ` Miroslav Lichvar
  2013-12-06 18:09           ` John Stultz
  2013-12-07  1:43           ` John Stultz
  0 siblings, 2 replies; 19+ messages in thread
From: Miroslav Lichvar @ 2013-12-06 14:26 UTC (permalink / raw)
  To: John Stultz; +Cc: linux-kernel, Prarit Bhargava, Richard Cochran

On Mon, Dec 02, 2013 at 08:03:17PM -0800, John Stultz wrote:
> On 12/02/2013 04:53 PM, John Stultz wrote:
> Finally found a config to get it working (disabling kernel debugging
> seems to work), and am currently trying to fixup the missing symbols
> (although I'm getting segfaults from various inline cli's :)

Patches are welcome :).

> Very cool simulator, by the way. Do you plan to have a git repo at some
> point for it?

It's now at https://github.com/mlichvar/linux-tktest

I'm considering to include it in https://github.com/mlichvar/clknetsim
as an optional replacement of the somewhat idealized clock which is
currently implemented there. This would allow us to see the whole
picture with applications controlling the clock.

> See the patch below. I'm doing some actual testing with it to see if its
> maybe too dampened.

It seems to fix the problem with stability, that's good. But the
response seems to be very slow now. In the simulated test with 10Hz
clock update it takes about 1000 updates (100 seconds!) for the loop
to converge to the correct frequency.

With the current tktest code from git:
n: 30, slope: 1.00 (1.00 GHz), dev: 3.1 ns, max: 3.6 ns, freq: -100.43404 ppm

You can see here the frequency is off by 0.43 ppm, that's after the 20
skipped updates.

When the sampling interval is changed to 100*50 ticks:
n: 30, slope: 1.00 (1.00 GHz), dev: 2146.9 ns, max: 5446.5 ns, freq: -100.07928 ppm

Only when the warmup period is extended to 100*1000 ticks, it produces
a nice fit:
n: 30, slope: 1.00 (1.00 GHz), dev: 7.3 ns, max: 12.2 ns, freq: -100.00004 ppm

This graph shows the value of tk->mult as it changes with clock
updates:
http://mlichvar.fedorapeople.org/tmp/tk_test1.png

When the TSC frequency is set to 100 MHz, it becomes more pronounced:
http://mlichvar.fedorapeople.org/tmp/tk_test2.png

I'm worried about the artifacts in the response, is that a bug?

> --- a/kernel/time/timekeeping.c
> +++ b/kernel/time/timekeeping.c
> @@ -1068,7 +1068,7 @@ static __always_inline int timekeeping_bigadjust(struct timekeeper *tk,
>  	 * here.  This is tuned so that an error of about 1 msec is adjusted
>  	 * within about 1 sec (or 2^20 nsec in 2^SHIFT_HZ ticks).
>  	 */
> -	error2 = tk->ntp_error >> (NTP_SCALE_SHIFT + 22 - 2 * SHIFT_HZ);
> +	error2 = tk->ntp_error >> (NTP_SCALE_SHIFT/2);
>  	error2 = abs(error2);
>  	for (look_ahead = 0; error2 > 0; look_ahead++)
>  		error2 >>= 2;
> 

-- 
Miroslav Lichvar

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC] timekeeping: Fix clock stability with nohz
  2013-12-06 14:26         ` Miroslav Lichvar
@ 2013-12-06 18:09           ` John Stultz
  2013-12-06 18:37             ` Miroslav Lichvar
  2013-12-07  1:43           ` John Stultz
  1 sibling, 1 reply; 19+ messages in thread
From: John Stultz @ 2013-12-06 18:09 UTC (permalink / raw)
  To: Miroslav Lichvar; +Cc: linux-kernel, Prarit Bhargava, Richard Cochran

On 12/06/2013 06:26 AM, Miroslav Lichvar wrote:
> On Mon, Dec 02, 2013 at 08:03:17PM -0800, John Stultz wrote:
>> On 12/02/2013 04:53 PM, John Stultz wrote:
>> Finally found a config to get it working (disabling kernel debugging
>> seems to work), and am currently trying to fixup the missing symbols
>> (although I'm getting segfaults from various inline cli's :)
> Patches are welcome :).
>
>> Very cool simulator, by the way. Do you plan to have a git repo at some
>> point for it?
> It's now at https://github.com/mlichvar/linux-tktest
>
> I'm considering to include it in https://github.com/mlichvar/clknetsim
> as an optional replacement of the somewhat idealized clock which is
> currently implemented there. This would allow us to see the whole
> picture with applications controlling the clock.

Interesting!  I don't think I've seen clknetsim before. I'll have to
look at it more closely!


>> See the patch below. I'm doing some actual testing with it to see if its
>> maybe too dampened.
> It seems to fix the problem with stability, that's good. But the
> response seems to be very slow now. In the simulated test with 10Hz
> clock update it takes about 1000 updates (100 seconds!) for the loop
> to converge to the correct frequency.

Yea. That was my concern that it over dampens the correction. In my
tests on actual systems it doesn't seem to cause much change in the
overall convergence picture with ntp, so I'll have to look more closely.

Just to be clear, when you say 10Hz clock update, what exactly are you
changing, as that doesn't quite match to the terminology in the tktest
simulator (ie: are you changing the ticks count?).


> With the current tktest code from git:
> n: 30, slope: 1.00 (1.00 GHz), dev: 3.1 ns, max: 3.6 ns, freq: -100.43404 ppm
>
> You can see here the frequency is off by 0.43 ppm, that's after the 20
> skipped updates.
>
> When the sampling interval is changed to 100*50 ticks:
> n: 30, slope: 1.00 (1.00 GHz), dev: 2146.9 ns, max: 5446.5 ns, freq: -100.07928 ppm
>
> Only when the warmup period is extended to 100*1000 ticks, it produces
> a nice fit:
> n: 30, slope: 1.00 (1.00 GHz), dev: 7.3 ns, max: 12.2 ns, freq: -100.00004 ppm


I get the first and the last numbers, but the middle are different for
me. Are you just setting:

diff --git a/tk_test.c b/tk_test.c
index e44a488..680f315 100644
--- a/tk_test.c
+++ b/tk_test.c
@@ -82,7 +82,7 @@ void tk_test(uint64_t *ts_x, uint64_t *ts_y, int samples, int 
 
        advance_ticks(freq, 1, 1, 200);
        ntp_freq -= 100000;
-       advance_ticks(freq, 100, 1, 20);
+       advance_ticks(freq, 100, 1, 50);
 
        for (i = 0; i < samples; i++) {
                getnstimeofday(&ts);

?

> This graph shows the value of tk->mult as it changes with clock
> updates:
> http://mlichvar.fedorapeople.org/tmp/tk_test1.png
>
> When the TSC frequency is set to 100 MHz, it becomes more pronounced:
> http://mlichvar.fedorapeople.org/tmp/tk_test2.png
>
> I'm worried about the artifacts in the response, is that a bug?

It does look strange. And again so I can reproduce this, how are you
generating the charts?


thanks
-john


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC] timekeeping: Fix clock stability with nohz
  2013-12-06 18:09           ` John Stultz
@ 2013-12-06 18:37             ` Miroslav Lichvar
  0 siblings, 0 replies; 19+ messages in thread
From: Miroslav Lichvar @ 2013-12-06 18:37 UTC (permalink / raw)
  To: John Stultz; +Cc: linux-kernel, Prarit Bhargava, Richard Cochran

On Fri, Dec 06, 2013 at 10:09:03AM -0800, John Stultz wrote:
> On 12/06/2013 06:26 AM, Miroslav Lichvar wrote:
> > It seems to fix the problem with stability, that's good. But the
> > response seems to be very slow now. In the simulated test with 10Hz
> > clock update it takes about 1000 updates (100 seconds!) for the loop
> > to converge to the correct frequency.
> 
> Yea. That was my concern that it over dampens the correction. In my
> tests on actual systems it doesn't seem to cause much change in the
> overall convergence picture with ntp, so I'll have to look more closely.

In a few quick tests with phc2sys I didn't see any changes either.
But I couldn't get the wakeup rate on my test system below 20 per
second even after playing with powertop and killing everything, so I'm
not sure if that means anything.

What is the wakeup rate on your test system?

My feeling is that the internal kernel loop should be faster than the
application's loop.

> Just to be clear, when you say 10Hz clock update, what exactly are you
> changing, as that doesn't quite match to the terminology in the tktest
> simulator (ie: are you changing the ticks count?).

Yes, the second argument of advance_ticks(), 100 for 10 Hz (the
current value) and 1000 for 1 Hz.

> > When the sampling interval is changed to 100*50 ticks:
> > n: 30, slope: 1.00 (1.00 GHz), dev: 2146.9 ns, max: 5446.5 ns, freq: -100.07928 ppm

> I get the first and the last numbers, but the middle are different for
> me. Are you just setting:

> -       advance_ticks(freq, 100, 1, 20);
> +       advance_ticks(freq, 100, 1, 50);
>  
>         for (i = 0; i < samples; i++) {
>                 getnstimeofday(&ts);

No, I'm setting the one in the loop:

                ts_x[i] = simtsc;
                ts_y[i] = ts.tv_sec * 1000000000ULL + ts.tv_nsec;
 
-               advance_ticks(freq, 100, 1, 1);
+               advance_ticks(freq, 100, 1, 50);

> > When the TSC frequency is set to 100 MHz, it becomes more pronounced:
> > http://mlichvar.fedorapeople.org/tmp/tk_test2.png
> >
> > I'm worried about the artifacts in the response, is that a bug?
> 
> It does look strange. And again so I can reproduce this, how are you
> generating the charts?

I changed TSC_FREQ, added "printk("mult: %d\n", tk->mult);" to
update_wall_time() in timekeeping.c, grepped the output for "^mult"
and used this command in gnuplot on display the data:

plot [] [167755330:167755390] "log.mult" using 2

-- 
Miroslav Lichvar

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC] timekeeping: Fix clock stability with nohz
  2013-12-06 14:26         ` Miroslav Lichvar
  2013-12-06 18:09           ` John Stultz
@ 2013-12-07  1:43           ` John Stultz
  2013-12-07 17:56             ` Richard Cochran
  2013-12-10 10:20             ` Miroslav Lichvar
  1 sibling, 2 replies; 19+ messages in thread
From: John Stultz @ 2013-12-07  1:43 UTC (permalink / raw)
  To: Miroslav Lichvar; +Cc: linux-kernel, Prarit Bhargava, Richard Cochran

On 12/06/2013 06:26 AM, Miroslav Lichvar wrote:
> This graph shows the value of tk->mult as it changes with clock
> updates:
> http://mlichvar.fedorapeople.org/tmp/tk_test1.png
>
> When the TSC frequency is set to 100 MHz, it becomes more pronounced:
> http://mlichvar.fedorapeople.org/tmp/tk_test2.png
>
> I'm worried about the artifacts in the response, is that a bug?

So now being able to reproduce this, I can see the jaggy
artifacts/discontinuities in the mult curve are basically where the
look_ahead logic is being adjusted.

My understanding is the lookahead bit is trying to evaluate the order of
magnitude in error and use that to dampen the adjustment. This means
when we have large errors, we adjust more slowly. When we have small
errors we can adjust more quickly.

So each discontinuity, you're basically seeing the dampening effect
decreased, causing the adjustment slope to increase.


Being that the bigadjust code, and specifically this lookahead bit, has
always been the most opaque logic to me, I figured I'd spend some time
looking at alternatives, and came up with one approach that tries to
mimic your patch, but tries to be more in line with the existing logic.

It seems to do fairly well in the simulator:
n: 30, slope: 1.00 (1.00 GHz), dev: 3.2 ns, max: 3.6 ns, freq: -99.95677 ppm

Basically in the big-error case, we calculate the adjustment from the
current tick error (and the assumption is that is where the majority of
the large error is coming from), leaving the normal +1/-1 adjustments to
the cumulative error.

But I'm a little worried its maybe too reactive, focusing really only on
the per-tick error magnitude, so I'm not sure how it will handle real
world ntp fluctuation or things like granularity error.

I also went through and removed some of the over-optimization that the
compiler should handle, so the code is a bit more readable.

Anyway, let me know what you think and I'll run some tests on it this
weekend.

thanks
-john


diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index 3abf534..bfb36fd 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -1056,42 +1056,24 @@ static __always_inline int
timekeeping_bigadjust(struct timekeeper *tk,
                          s64 *offset)
 {
     s64 tick_error, i;
-    u32 look_ahead, adj;
-    s32 error2, mult;
+    u32 adj;
+    s32 mult = 1;
 
-    /*
-     * Use the current error value to determine how much to look ahead.
-     * The larger the error the slower we adjust for it to avoid problems
-     * with losing too many ticks, otherwise we would overadjust and
-     * produce an even larger error.  The smaller the adjustment the
-     * faster we try to adjust for it, as lost ticks can do less harm
-     * here.  This is tuned so that an error of about 1 msec is adjusted
-     * within about 1 sec (or 2^20 nsec in 2^SHIFT_HZ ticks).
-     */
-    error2 = tk->ntp_error >> (NTP_SCALE_SHIFT + 22 - 2 * SHIFT_HZ);
-    error2 = abs(error2);
-    for (look_ahead = 0; error2 > 0; look_ahead++)
-        error2 >>= 2;
+    /* Calculate current tick error */
+    tick_error = ntp_tick_length() >> (tk->ntp_error_shift );
+    tick_error -= tk->xtime_interval;
 
-    /*
-     * Now calculate the error in (1 << look_ahead) ticks, but first
-     * remove the single look ahead already included in the error.
-     */
-    tick_error = ntp_tick_length() >> (tk->ntp_error_shift + 1);
-    tick_error -= tk->xtime_interval >> 1;
-    error = ((error - tick_error) >> look_ahead) + tick_error;
-
-    /* Finally calculate the adjustment shift value.  */
-    i = *interval;
-    mult = 1;
-    if (error < 0) {
-        error = -error;
+    if (tick_error < 0) {
         *interval = -*interval;
         *offset = -*offset;
-        mult = -1;
+        mult = -mult;
     }
-    for (adj = 0; error > i; adj++)
-        error >>= 1;
+
+    /* Sort out the magnitude of the correction */
+    tick_error = abs(tick_error);
+    i = abs(*interval);
+    for (adj = 0; tick_error > i; adj++)
+        tick_error >>= 1;
 
     *interval <<= adj;
     *offset <<= adj;
@@ -1114,41 +1096,23 @@ static void timekeeping_adjust(struct timekeeper
*tk, s64 offset)
      *
      * First we shift it down from NTP_SHIFT to clocksource->shifted nsecs.
      *
-     * Note we subtract one in the shift, so that error is really error*2.
-     * This "saves" dividing(shifting) interval twice, but keeps the
-     * (error > interval) comparison as still measuring if error is
-     * larger than half an interval.
-     *
-     * Note: It does not "save" on aggravation when reading the code.
+     * Then we meausre if error is larger than half an interval.
      */
-    error = tk->ntp_error >> (tk->ntp_error_shift - 1);
-    if (error > interval) {
-        /*
-         * We now divide error by 4(via shift), which checks if
-         * the error is greater than twice the interval.
-         * If it is greater, we need a bigadjust, if its smaller,
-         * we can adjust by 1.
-         */
-        error >>= 2;
+    error = tk->ntp_error >> (tk->ntp_error_shift);
+    if (error > interval/2) {
         /*
-         * XXX - In update_wall_time, we round up to the next
-         * nanosecond, and store the amount rounded up into
-         * the error. This causes the likely below to be unlikely.
-         *
-         * The proper fix is to avoid rounding up by using
-         * the high precision tk->xtime_nsec instead of
-         * xtime.tv_nsec everywhere. Fixing this will take some
-         * time.
+         * We now checks if the error is greater than twice the
+         * interval. If it is greater, we need a bigadjust, if its
+         * smaller, we can adjust by 1.
          */
-        if (likely(error <= interval))
+        if (likely(error <= interval*2))
             adj = 1;
         else
             adj = timekeeping_bigadjust(tk, error, &interval, &offset);
     } else {
-        if (error < -interval) {
+        if (error < -interval/2) {
             /* See comment above, this is just switched for the negative */
-            error >>= 2;
-            if (likely(error >= -interval)) {
+            if (likely(error >= -interval*2)) {
                 adj = -1;
                 interval = -interval;
                 offset = -offset;


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC] timekeeping: Fix clock stability with nohz
  2013-12-07  1:43           ` John Stultz
@ 2013-12-07 17:56             ` Richard Cochran
  2013-12-07 22:16               ` John Stultz
  2013-12-10 10:20             ` Miroslav Lichvar
  1 sibling, 1 reply; 19+ messages in thread
From: Richard Cochran @ 2013-12-07 17:56 UTC (permalink / raw)
  To: John Stultz; +Cc: Miroslav Lichvar, linux-kernel, Prarit Bhargava

On Fri, Dec 06, 2013 at 05:43:45PM -0800, John Stultz wrote:
> Anyway, let me know what you think and I'll run some tests on it this
> weekend.
> 
> thanks
> -john
> 
> 
> diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
> index 3abf534..bfb36fd 100644
> --- a/kernel/time/timekeeping.c
> +++ b/kernel/time/timekeeping.c
> @@ -1056,42 +1056,24 @@ static __always_inline int
> timekeeping_bigadjust(struct timekeeper *tk,
>                           s64 *offset)

John,

Any chance of posting this against a normal kernel?  I am preparing
"real" tests comparing the three different patches in this thread on
v3.12.3, but this one does not apply.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC] timekeeping: Fix clock stability with nohz
  2013-12-07 17:56             ` Richard Cochran
@ 2013-12-07 22:16               ` John Stultz
  0 siblings, 0 replies; 19+ messages in thread
From: John Stultz @ 2013-12-07 22:16 UTC (permalink / raw)
  To: Richard Cochran; +Cc: Miroslav Lichvar, linux-kernel, Prarit Bhargava

[-- Attachment #1: Type: text/plain, Size: 991 bytes --]

On 12/07/2013 09:56 AM, Richard Cochran wrote:
> On Fri, Dec 06, 2013 at 05:43:45PM -0800, John Stultz wrote:
>> Anyway, let me know what you think and I'll run some tests on it this
>> weekend.
>>
>> thanks
>> -john
>>
>>
>> diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
>> index 3abf534..bfb36fd 100644
>> --- a/kernel/time/timekeeping.c
>> +++ b/kernel/time/timekeeping.c
>> @@ -1056,42 +1056,24 @@ static __always_inline int
>> timekeeping_bigadjust(struct timekeeper *tk,
>>                           s64 *offset)
> John,
>
> Any chance of posting this against a normal kernel?  I am preparing
> "real" tests comparing the three different patches in this thread on
> v3.12.3, but this one does not apply.

Sorry about that. :(  About half the time I try pasting in a patch,
thunderbird seems to decide to ignore the preformat setting and corrupts
whitespace in the patch.

Anyway, patch is attached. Many thanks for the additional testing and
review!

thanks
-john

[-- Attachment #2: 0001-reworking-bigadjust-for-nohz.patch --]
[-- Type: text/x-diff, Size: 4361 bytes --]

>From 3fbbd9ade38419245af22902a98ea221e7b36c94 Mon Sep 17 00:00:00 2001
From: John Stultz <john.stultz@linaro.org>
Date: Fri, 6 Dec 2013 17:25:21 -0800
Subject: [PATCH] my first attempt at reworking bigadjust for nohz

Takes a similar approach as Miroslav's but using the bigadjust method.

Signed-off-by: John Stultz <john.stultz@linaro.org>
---
 kernel/time/timekeeping.c | 80 +++++++++++++----------------------------------
 1 file changed, 22 insertions(+), 58 deletions(-)

diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index 947ba25..46f4bd2 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -1056,42 +1056,24 @@ static __always_inline int timekeeping_bigadjust(struct timekeeper *tk,
 						 s64 *offset)
 {
 	s64 tick_error, i;
-	u32 look_ahead, adj;
-	s32 error2, mult;
+	u32 adj;
+	s32 mult = 1;
 
-	/*
-	 * Use the current error value to determine how much to look ahead.
-	 * The larger the error the slower we adjust for it to avoid problems
-	 * with losing too many ticks, otherwise we would overadjust and
-	 * produce an even larger error.  The smaller the adjustment the
-	 * faster we try to adjust for it, as lost ticks can do less harm
-	 * here.  This is tuned so that an error of about 1 msec is adjusted
-	 * within about 1 sec (or 2^20 nsec in 2^SHIFT_HZ ticks).
-	 */
-	error2 = tk->ntp_error >> (NTP_SCALE_SHIFT + 22 - 2 * SHIFT_HZ);
-	error2 = abs(error2);
-	for (look_ahead = 0; error2 > 0; look_ahead++)
-		error2 >>= 2;
+	/* Calculate current tick error */
+	tick_error = ntp_tick_length() >> (tk->ntp_error_shift );
+	tick_error -= tk->xtime_interval;
 
-	/*
-	 * Now calculate the error in (1 << look_ahead) ticks, but first
-	 * remove the single look ahead already included in the error.
-	 */
-	tick_error = ntp_tick_length() >> (tk->ntp_error_shift + 1);
-	tick_error -= tk->xtime_interval >> 1;
-	error = ((error - tick_error) >> look_ahead) + tick_error;
-
-	/* Finally calculate the adjustment shift value.  */
-	i = *interval;
-	mult = 1;
-	if (error < 0) {
-		error = -error;
+	if (tick_error < 0) {
 		*interval = -*interval;
 		*offset = -*offset;
-		mult = -1;
+		mult = -mult;
 	}
-	for (adj = 0; error > i; adj++)
-		error >>= 1;
+
+	/* Sort out the magnitude of the correction */
+	tick_error = abs(tick_error);
+	i = abs(*interval);
+	for (adj = 0; tick_error > i; adj++)
+		tick_error >>= 1;
 
 	*interval <<= adj;
 	*offset <<= adj;
@@ -1114,41 +1096,23 @@ static void timekeeping_adjust(struct timekeeper *tk, s64 offset)
 	 *
 	 * First we shift it down from NTP_SHIFT to clocksource->shifted nsecs.
 	 *
-	 * Note we subtract one in the shift, so that error is really error*2.
-	 * This "saves" dividing(shifting) interval twice, but keeps the
-	 * (error > interval) comparison as still measuring if error is
-	 * larger than half an interval.
-	 *
-	 * Note: It does not "save" on aggravation when reading the code.
+	 * Then we meausre if error is larger than half an interval.
 	 */
-	error = tk->ntp_error >> (tk->ntp_error_shift - 1);
-	if (error > interval) {
-		/*
-		 * We now divide error by 4(via shift), which checks if
-		 * the error is greater than twice the interval.
-		 * If it is greater, we need a bigadjust, if its smaller,
-		 * we can adjust by 1.
-		 */
-		error >>= 2;
+	error = tk->ntp_error >> (tk->ntp_error_shift);
+	if (error > interval/2) {
 		/*
-		 * XXX - In update_wall_time, we round up to the next
-		 * nanosecond, and store the amount rounded up into
-		 * the error. This causes the likely below to be unlikely.
-		 *
-		 * The proper fix is to avoid rounding up by using
-		 * the high precision tk->xtime_nsec instead of
-		 * xtime.tv_nsec everywhere. Fixing this will take some
-		 * time.
+		 * We now checks if the error is greater than twice the
+		 * interval. If it is greater, we need a bigadjust, if its
+		 * smaller, we can adjust by 1.
 		 */
-		if (likely(error <= interval))
+		if (likely(error <= interval*2))
 			adj = 1;
 		else
 			adj = timekeeping_bigadjust(tk, error, &interval, &offset);
 	} else {
-		if (error < -interval) {
+		if (error < -interval/2) {
 			/* See comment above, this is just switched for the negative */
-			error >>= 2;
-			if (likely(error >= -interval)) {
+			if (likely(error >= -interval*2)) {
 				adj = -1;
 				interval = -interval;
 				offset = -offset;
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC] timekeeping: Fix clock stability with nohz
  2013-12-07  1:43           ` John Stultz
  2013-12-07 17:56             ` Richard Cochran
@ 2013-12-10 10:20             ` Miroslav Lichvar
  2013-12-10 11:26               ` Miroslav Lichvar
  1 sibling, 1 reply; 19+ messages in thread
From: Miroslav Lichvar @ 2013-12-10 10:20 UTC (permalink / raw)
  To: John Stultz; +Cc: linux-kernel, Prarit Bhargava, Richard Cochran

On Fri, Dec 06, 2013 at 05:43:45PM -0800, John Stultz wrote:
> Being that the bigadjust code, and specifically this lookahead bit, has
> always been the most opaque logic to me, I figured I'd spend some time
> looking at alternatives, and came up with one approach that tries to
> mimic your patch, but tries to be more in line with the existing logic.
> 
> It seems to do fairly well in the simulator:
> n: 30, slope: 1.00 (1.00 GHz), dev: 3.2 ns, max: 3.6 ns, freq: -99.95677 ppm

Hm, this shows a 0.043ppm error in the frequency. It doesn't seem to go
away even when I use a long sampling interval or give it more time to
settle down. Is that an expected side effect of the patch?

> Basically in the big-error case, we calculate the adjustment from the
> current tick error (and the assumption is that is where the majority of
> the large error is coming from), leaving the normal +1/-1 adjustments to
> the cumulative error.

The normal +1/-1 adjustment doesn't seem to be active in the
simulation, at least in the default settings with 100ppm offset. When
I print the error variable in timekeeping_adjust() I can see its
absolute value stays above interval*2, so timekeeping_bigadjust() is
called on every update. The bigadjust correction seems too weak to
bring the error down to activate the normal +1/-1 adjustment, the
error keeps increasing and the frequency is slighly off.

What does the following line from your patch mean?

        tick_error -= tk->xtime_interval;

-- 
Miroslav Lichvar

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC] timekeeping: Fix clock stability with nohz
  2013-12-10 10:20             ` Miroslav Lichvar
@ 2013-12-10 11:26               ` Miroslav Lichvar
  0 siblings, 0 replies; 19+ messages in thread
From: Miroslav Lichvar @ 2013-12-10 11:26 UTC (permalink / raw)
  To: John Stultz; +Cc: linux-kernel, Prarit Bhargava, Richard Cochran

On Tue, Dec 10, 2013 at 11:20:51AM +0100, Miroslav Lichvar wrote:
> What does the following line from your patch mean?
> 
>         tick_error -= tk->xtime_interval;

Ok, I think I understand how it should work. There are two loops, the
bigadjust one is correcting only for ntp tick length and the other for
the cumulative error. I think it might work better if they were both
active at the same time instead of switching between them according to
the current ntp error.

-- 
Miroslav Lichvar

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2013-12-10 11:26 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-11-14 14:50 [PATCH RFC] timekeeping: Fix clock stability with nohz Miroslav Lichvar
2013-11-14 18:01 ` Rik van Riel
2013-11-16  7:03 ` Richard Cochran
2013-11-18 21:28   ` John Stultz
2013-11-19 14:13     ` Richard Cochran
2013-11-27 10:07       ` Richard Cochran
2013-11-21 10:12     ` Miroslav Lichvar
2013-11-18 20:46 ` John Stultz
2013-11-20 18:39   ` Miroslav Lichvar
2013-12-03  0:53     ` John Stultz
2013-12-03  4:03       ` John Stultz
2013-12-06 14:26         ` Miroslav Lichvar
2013-12-06 18:09           ` John Stultz
2013-12-06 18:37             ` Miroslav Lichvar
2013-12-07  1:43           ` John Stultz
2013-12-07 17:56             ` Richard Cochran
2013-12-07 22:16               ` John Stultz
2013-12-10 10:20             ` Miroslav Lichvar
2013-12-10 11:26               ` Miroslav Lichvar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).