Re: [PATCH clocksource] Reject bogus watchdog clocksource measurements

From: "Paul E. McKenney" <paulmck@kernel.org>
To: Feng Tang <feng.tang@intel.com>
Cc: linux-kernel@vger.kernel.org, clm@meta.com, jstultz@google.com,
	tglx@linutronix.de, sboyd@kernel.org, longman@redhat.com
Subject: Re: [PATCH clocksource] Reject bogus watchdog clocksource measurements
Date: Tue, 1 Nov 2022 12:06:27 -0700	[thread overview]
Message-ID: <20221101190627.GI5600@paulmck-ThinkPad-P17-Gen-1> (raw)
In-Reply-To: <Y2CyBGNM0rMI6nCG@feng-clx>

On Tue, Nov 01, 2022 at 01:43:32PM +0800, Feng Tang wrote:
> On Mon, Oct 31, 2022 at 10:42:12AM -0700, Paul E. McKenney wrote:
> 
> [...]
> > > > @@ -448,8 +448,26 @@ static void clocksource_watchdog(struct timer_list *unused)
> > > >  			continue;
> > > >  		}
> > > >  		if (wd_nsec > (wdi << 2)) {
> > > 
> > > Just recalled one thing, that it may be better to check 'cs_nsec' 
> > > instead of 'wd_nsec', as some watchdog may have small wrap-around
> > > value. IIRC, HPET's counter is 32 bits long and wraps at about
> > > 300 seconds, and PMTIMER's counter is 24 bits which wraps at about
> > > 3 ~ 4 seconds. So when a long stall of the watchdog timer happens,
> > > the watchdog's value could 'overflow' many times.
> > > 
> > > And usually the 'current' closcksource has longer wrap time than
> > > the watchdog.
> > 
> > Why not both?
> 
> You mean checking both clocksource and the watchdog? It's fine for
> me, though I still trust clocksource more.

OK, good, I will check both.  You never know what future hardware
might bring.

I also reversed the order of the checks, so that it now checks for heavy
load before too-short interval.  The purpose is to automatically avoid
being fooled by clock wrap.

> I checked some old emails and found some long stall logs for reference.
> 
> * one stall of 471 seconds
> 
>  [ 2410.694068] clocksource: timekeeping watchdog on CPU262: Marking clocksource 'tsc' as unstable because the skew is too large:
>  [ 2410.706920] clocksource:                       'hpet' wd_nsec: 0 wd_now: ffd70be2 wd_last: 40da633b mask: ffffffff
>  [ 2410.718583] clocksource:                       'tsc' cs_nsec: 471766594285 cs_now: 44f62c184e9 cs_last: 394a7a43771 mask: ffffffffffffffff
>  [ 2410.732568] clocksource:                       'tsc' is current clocksource.
>  [ 2410.740553] tsc: Marking TSC unstable due to clocksource watchdog
>  [ 2410.747611] TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'.
>  [ 2410.757321] sched_clock: Marking unstable (2398804490960, 11943006672)<-(2419023952548, -8276474713)
>  [ 2410.767741] clocksource: Checking clocksource tsc synchronization from CPU 233 to CPUs 0,73,93-94,226,454,602,821.
>  [ 2410.784045] clocksource: Switched to clocksource hpet
> 
> 
> * another one of 5 seconds
> 
>  [ 3302.211708] clocksource: timekeeping watchdog on CPU9: Marking clocksource 'tsc' as unstable because the skew is too large:
>  [ 3302.211710] clocksource:                       'acpi_pm' wd_nsec: 312227950 wd_now: 92367f wd_last: 8128bd mask: ffffff
>  [ 3302.211712] clocksource:                       'tsc' cs_nsec: 4999196389 cs_now: 9e811223a9754 cs_last: 9e80e767df194 mask: ffffffffffffffff
>  [ 3302.211714] clocksource:                       'tsc' is current clocksource.
>  [ 3302.211716] tsc: Marking TSC unstable due to clocksource watchdog

Very good, thank you!  I believe that both of these would be handled
by the updated commit (see below for the update).

> >  		if (wd_nsec > (wdi << 2) || cs_nsec > (wdi << 2)) {
> > 
> > > > -			/* This can happen on busy systems, which can delay the watchdog. */
> > > > -			pr_warn("timekeeping watchdog on CPU%d: Watchdog clocksource '%s' advanced an excessive %lld ns during %d-jiffy time interval, probable CPU overutilization, skipping watchdog check.\n", smp_processor_id(), watchdog->name, wd_nsec, WATCHDOG_INTERVAL);
> > > > +			bool needwarn = false;
> > > > +			u64 wd_lb;
> > > > +
> > > > +			cs->wd_bogus_count++;
> > > > +			if (!cs->wd_bogus_shift) {
> > > > +				needwarn = true;
> > > > +			} else {
> > > > +				delta = clocksource_delta(wdnow, cs->wd_last_bogus, watchdog->mask);
> > > > +				wd_lb = clocksource_cyc2ns(delta, watchdog->mult, watchdog->shift);
> > > > +				if ((1 << cs->wd_bogus_shift) * wdi <= wd_lb)
> > > > +					needwarn = true;
> > > 
> > > I'm not sure if we need to check the last_bogus counter, or just
> > > the current interval 'cs_nsec' is what we care, and some code
> > > like this ?
> > 
> > I thought we wanted exponential backoff?  Do you really get that from
> > the changes below?
> 
> Aha, I misunderstood your words. I thought to only report one time for
> each 2, 4, 8, ... 256 seconds stall, and after that only report stall
> of 512+ seconds. So your approach looks good to me, as our intention is
> to avoid the flood of warning message.

Sounds good, thank you!

Please see below for a patch to be squashed into the original.

Thoughts?

							Thanx, Paul

------------------------------------------------------------------------

commit eaee921daa7091f0eb731c9217ccc638ed5f8baf
Author: Paul E. McKenney <paulmck@kernel.org>
Date:   Tue Nov 1 12:02:18 2022 -0700

    squash! clocksource: Exponential backoff for load-induced bogus watchdog reads
    
    [ paulmck: Apply Feng Tang feedback. ]
    
    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index 6537ffa02e445..de8047b6720f5 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -442,12 +442,7 @@ static void clocksource_watchdog(struct timer_list *unused)
 
 		/* Check for bogus measurements. */
 		wdi = jiffies_to_nsecs(WATCHDOG_INTERVAL);
-		if (wd_nsec < (wdi >> 2)) {
-			/* This usually indicates broken timer code or hardware. */
-			pr_warn("timekeeping watchdog on CPU%d: Watchdog clocksource '%s' advanced only %lld ns during %d-jiffy time interval, skipping watchdog check.\n", smp_processor_id(), watchdog->name, wd_nsec, WATCHDOG_INTERVAL);
-			continue;
-		}
-		if (wd_nsec > (wdi << 2)) {
+		if (wd_nsec > (wdi << 2) || cs_nsec > (wdi << 2)) {
 			bool needwarn = false;
 			u64 wd_lb;
 
@@ -470,6 +465,12 @@ static void clocksource_watchdog(struct timer_list *unused)
 			}
 			continue;
 		}
+		/* Check too-short measurements second to handle wrap. */
+		if (wd_nsec < (wdi >> 2) || cs_nsec < (wdi >> 2)) {
+			/* This usually indicates broken timer code or hardware. */
+			pr_warn("timekeeping watchdog on CPU%d: Watchdog clocksource '%s' advanced only %lld ns during %d-jiffy time interval, skipping watchdog check.\n", smp_processor_id(), watchdog->name, wd_nsec, WATCHDOG_INTERVAL);
+			continue;
+		}
 
 		/* Check the deviation from the watchdog clocksource. */
 		md = cs->uncertainty_margin + watchdog->uncertainty_margin;