linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] clocksource: Add heuristics to avoid switching away from TSC due to timer delay
@ 2018-11-30 21:17 Roland Dreier
  2018-12-04 11:56 ` Thomas Gleixner
  0 siblings, 1 reply; 3+ messages in thread
From: Roland Dreier @ 2018-11-30 21:17 UTC (permalink / raw)
  To: Thomas Gleixner, John Stultz, Stephen Boyd, linux-kernel

On a modern x86 system, the TSC is used as a clocksource, with HPET
used in the clocksource watchdog to make sure that the TSC is stable.

If the clocksource watchdog_timer is delayed for an extremely long
time (for example if softirqs are being serviced in ksoftirqd, and
realtime threads are starving ksoftirqd), then the 32-bit HPET counter
may wrap around.  For example, with an HPET running at 24 MHz, 2^32
cycles is about 179 seconds - a long time for timers to be starved,
but possible with a poorly behaved realtime thread.

If this happens, since the TSC is a 64-bit counter and won't wrap, the
watchdog will detect skew - the TSC interval will be 179 seconds
longer than the HPET interval - and will mark the TSC as unstable.
This causes the system to switch to the HPET as a clocksource, which
has a huge negative performance impact.

In this case, switching to the HPET just makes a bad situation (timers
starved) that the system might recover from turn permanently even
worse (more expensive clock_gettime() calls), due to a spurious false
positive detection of TSC instability.

To improve this, add some heuristics to detect cases where the
watchdog is delayed long enough for the instability detection to be
likely to be wrong:

 - If the clocksource being tested (eg TSC) has counted so many cycles
   that converting to nsecs will overflow multiplication, *AND* the
   watchdog clocksource (eg HPET) shows that the watchdog timer has
   missed its interval by at least a factor of 3, skip marking the
   clocksource as unstable for a timer interation.  This is not
   perfect - for example it is possible for the watchdog clocksource
   to wrap around and show a small interval - but at least in the
   specific x86 it is unlikely, since the watchdog interval is a small
   fraction of the wraparound interval.

 - If there is a skew between the clocksource being tested and the
   watchdog clocksource that is at least as big as the wraparound
   interval for the watchdog clocksource, then don't mark the
   clocksource as unstable.  Again, this might fail to mark a
   clocksource as unstable for one iteration, but it is unlikely that
   the instability is bad enough that we will see a larger skew than
   the wraparound interval for many iterations.

These heuristics are imperfect but are chosen to make false detection
of instability much less likely, while leaving detection of true
instability very likely within a few clocksource watchdog iterations.

Signed-off-by: Roland Dreier <roland@purestorage.com>
---
 kernel/time/clocksource.c | 35 +++++++++++++++++++++++++++++++++++
 1 file changed, 35 insertions(+)

diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index ffe081623aec..f1b3d8ff2437 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -243,12 +243,47 @@ static void clocksource_watchdog(struct timer_list *unused)
 					     watchdog->shift);
 
 		delta = clocksource_delta(csnow, cs->cs_last, cs->mask);
+
+		/* If the cycle delta is beyond what we can safely
+		 * convert to nsecs, and the watchdog clocksource
+		 * suggests that we've overslept, skip checking this
+		 * iteration to avoid marking a clocksource as
+		 * unstable because of a severely delayed timer. */
+		if (delta > cs->max_cycles &&
+		    wd_nsec > 3 * jiffies_to_nsecs(WATCHDOG_INTERVAL)) {
+			pr_warn("timekeeping watchdog: Clocksource '%s' not checked due to apparent long timer delay:\n",
+				cs->name);
+			pr_warn("                      Delta %llx > max_cycles %llx, wd_nsec %lld\n",
+				delta, cs->max_cycles, wd_nsec);
+			continue;
+		}
+
 		cs_nsec = clocksource_cyc2ns(delta, cs->mult, cs->shift);
 		wdlast = cs->wd_last; /* save these in case we print them */
 		cslast = cs->cs_last;
 		cs->cs_last = csnow;
 		cs->wd_last = wdnow;
 
+		/* If the clocksource interval is far off from the
+		 * watchdog clocksource interval but the interval is
+		 * big enough that the watchdog may have wrapped
+		 * around (again due to a severely delayed timer),
+		 * skip this iteration.  For example, this saves us
+		 * from marking the TSC as unstable just because the
+		 * 32-bit HPET wrapped around on x86. */
+		if (abs(cs_nsec - wd_nsec) >
+		    clocksource_cyc2ns(watchdog->max_cycles, watchdog->mult,
+				       watchdog->shift) - WATCHDOG_THRESHOLD) {
+			pr_warn("timekeeping watchdog: Clocksource '%s' not checked due to apparent timer delay:\n",
+				cs->name);
+			pr_warn("                      Skew %lld watchdog wrap %lld\n",
+				abs(cs_nsec - wd_nsec),
+				clocksource_cyc2ns(watchdog->max_cycles,
+						   watchdog->mult,
+						   watchdog->shift));
+			continue;
+		}
+
 		if (atomic_read(&watchdog_reset_pending))
 			continue;
 
-- 
2.19.1


^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH] clocksource: Add heuristics to avoid switching away from TSC due to timer delay
  2018-11-30 21:17 [PATCH] clocksource: Add heuristics to avoid switching away from TSC due to timer delay Roland Dreier
@ 2018-12-04 11:56 ` Thomas Gleixner
  2018-12-04 19:33   ` Roland Dreier
  0 siblings, 1 reply; 3+ messages in thread
From: Thomas Gleixner @ 2018-12-04 11:56 UTC (permalink / raw)
  To: Roland Dreier; +Cc: John Stultz, Stephen Boyd, linux-kernel

Roland,

On Fri, 30 Nov 2018, Roland Dreier wrote:
>  		delta = clocksource_delta(csnow, cs->cs_last, cs->mask);
> +
> +		/* If the cycle delta is beyond what we can safely
> +		 * convert to nsecs, and the watchdog clocksource
> +		 * suggests that we've overslept, skip checking this
> +		 * iteration to avoid marking a clocksource as
> +		 * unstable because of a severely delayed timer. */

  		 /*
		  * Proper multiline comments look like this not like
		  * the above.
		  */

That aside. Why are you trying to do heuristics on the delta?

We have way better information than that. The watchdog timer expiry time is
known and we can determine the exact delay of the timer.

The watchdog clocksource provides the maximum 'idle' time, i.e. the time
between two reads, in clocksource::max_idle_ns. That value is filled in
when the clocksource is configured.

So without doing speculation we can make an informed decision:

    elapsed = jiffies_to_nsec(jiffies - watchdog_timer->expires) +
    	      WATCHDOG_INTERVAL_NS;

    if (elapsed > wdcs->max_idle_ns) {
       	    Skip ......
    }

Hmm?

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH] clocksource: Add heuristics to avoid switching away from TSC due to timer delay
  2018-12-04 11:56 ` Thomas Gleixner
@ 2018-12-04 19:33   ` Roland Dreier
  0 siblings, 0 replies; 3+ messages in thread
From: Roland Dreier @ 2018-12-04 19:33 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: john.stultz, sboyd, LKML

>                  /*
>                   * Proper multiline comments look like this not like
>                   * the above.
>                   */

Got it, will fix next time around.

> That aside. Why are you trying to do heuristics on the delta?
>
> We have way better information than that. The watchdog timer expiry time is
> known and we can determine the exact delay of the timer.
>
> The watchdog clocksource provides the maximum 'idle' time, i.e. the time
> between two reads, in clocksource::max_idle_ns. That value is filled in
> when the clocksource is configured.
>
> So without doing speculation we can make an informed decision:
>
>     elapsed = jiffies_to_nsec(jiffies - watchdog_timer->expires) +
>               WATCHDOG_INTERVAL_NS;
>
>     if (elapsed > wdcs->max_idle_ns) {
>             Skip ......
>     }

Yes, that makes more sense than what I was doing, although I'm not
sure on the details.  Just missed that idea.

Why are you adding the watchdog interval to the calculated elapsed
time?  It seems we have an issue exactly if jiffies -
watchdog_timer->expires is too big, without adding the interval we
tried to wait in on top.  Also I think we might want to be careful
that jiffies is >= the expires time - or is it not possible that a
timer fires one jiffy early?

Also for full generality it seems we should check against the
clocksource max_idle_ns as well - for x86 TSC is wider than HPET but
there may be other architectures that could hit the same problem, just
with the clocksource being checked wrapping around instead of the
watchdog clocksource.  Right?

Thanks!
  Roland

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2018-12-04 19:33 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-11-30 21:17 [PATCH] clocksource: Add heuristics to avoid switching away from TSC due to timer delay Roland Dreier
2018-12-04 11:56 ` Thomas Gleixner
2018-12-04 19:33   ` Roland Dreier

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).