Re: [PATCH clocksource] Reject bogus watchdog clocksource measurements

From: Feng Tang <feng.tang@intel.com>
To: "Paul E. McKenney" <paulmck@kernel.org>
Cc: <linux-kernel@vger.kernel.org>, <clm@meta.com>,
	<jstultz@google.com>, <tglx@linutronix.de>, <sboyd@kernel.org>,
	<longman@redhat.com>
Subject: Re: [PATCH clocksource] Reject bogus watchdog clocksource measurements
Date: Tue, 1 Nov 2022 13:43:32 +0800	[thread overview]
Message-ID: <Y2CyBGNM0rMI6nCG@feng-clx> (raw)
In-Reply-To: <20221031174212.GB5600@paulmck-ThinkPad-P17-Gen-1>

On Mon, Oct 31, 2022 at 10:42:12AM -0700, Paul E. McKenney wrote:

[...]
> > > @@ -448,8 +448,26 @@ static void clocksource_watchdog(struct timer_list *unused)
> > >  			continue;
> > >  		}
> > >  		if (wd_nsec > (wdi << 2)) {
> > 
> > Just recalled one thing, that it may be better to check 'cs_nsec' 
> > instead of 'wd_nsec', as some watchdog may have small wrap-around
> > value. IIRC, HPET's counter is 32 bits long and wraps at about
> > 300 seconds, and PMTIMER's counter is 24 bits which wraps at about
> > 3 ~ 4 seconds. So when a long stall of the watchdog timer happens,
> > the watchdog's value could 'overflow' many times.
> > 
> > And usually the 'current' closcksource has longer wrap time than
> > the watchdog.
> 
> Why not both?

You mean checking both clocksource and the watchdog? It's fine for
me, though I still trust clocksource more.

I checked some old emails and found some long stall logs for reference.

* one stall of 471 seconds

 [ 2410.694068] clocksource: timekeeping watchdog on CPU262: Marking clocksource 'tsc' as unstable because the skew is too large:
 [ 2410.706920] clocksource:                       'hpet' wd_nsec: 0 wd_now: ffd70be2 wd_last: 40da633b mask: ffffffff
 [ 2410.718583] clocksource:                       'tsc' cs_nsec: 471766594285 cs_now: 44f62c184e9 cs_last: 394a7a43771 mask: ffffffffffffffff
 [ 2410.732568] clocksource:                       'tsc' is current clocksource.
 [ 2410.740553] tsc: Marking TSC unstable due to clocksource watchdog
 [ 2410.747611] TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'.
 [ 2410.757321] sched_clock: Marking unstable (2398804490960, 11943006672)<-(2419023952548, -8276474713)
 [ 2410.767741] clocksource: Checking clocksource tsc synchronization from CPU 233 to CPUs 0,73,93-94,226,454,602,821.
 [ 2410.784045] clocksource: Switched to clocksource hpet

* another one of 5 seconds

 [ 3302.211708] clocksource: timekeeping watchdog on CPU9: Marking clocksource 'tsc' as unstable because the skew is too large:
 [ 3302.211710] clocksource:                       'acpi_pm' wd_nsec: 312227950 wd_now: 92367f wd_last: 8128bd mask: ffffff
 [ 3302.211712] clocksource:                       'tsc' cs_nsec: 4999196389 cs_now: 9e811223a9754 cs_last: 9e80e767df194 mask: ffffffffffffffff
 [ 3302.211714] clocksource:                       'tsc' is current clocksource.
 [ 3302.211716] tsc: Marking TSC unstable due to clocksource watchdog

> 
>  		if (wd_nsec > (wdi << 2) || cs_nsec > (wdi << 2)) {
> 
> > > -			/* This can happen on busy systems, which can delay the watchdog. */
> > > -			pr_warn("timekeeping watchdog on CPU%d: Watchdog clocksource '%s' advanced an excessive %lld ns during %d-jiffy time interval, probable CPU overutilization, skipping watchdog check.\n", smp_processor_id(), watchdog->name, wd_nsec, WATCHDOG_INTERVAL);
> > > +			bool needwarn = false;
> > > +			u64 wd_lb;
> > > +
> > > +			cs->wd_bogus_count++;
> > > +			if (!cs->wd_bogus_shift) {
> > > +				needwarn = true;
> > > +			} else {
> > > +				delta = clocksource_delta(wdnow, cs->wd_last_bogus, watchdog->mask);
> > > +				wd_lb = clocksource_cyc2ns(delta, watchdog->mult, watchdog->shift);
> > > +				if ((1 << cs->wd_bogus_shift) * wdi <= wd_lb)
> > > +					needwarn = true;
> > 
> > I'm not sure if we need to check the last_bogus counter, or just
> > the current interval 'cs_nsec' is what we care, and some code
> > like this ?
> 
> I thought we wanted exponential backoff?  Do you really get that from
> the changes below?

Aha, I misunderstood your words. I thought to only report one time for
each 2, 4, 8, ... 256 seconds stall, and after that only report stall
of 512+ seconds. So your approach looks good to me, as our intention is
to avoid the flood of warning message.

Thanks,
Feng

> And should we be using something like the jiffies counter to measure the
> exponential backoff?
> 
> 							Thanx, Paul
> 
> > diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h
> > index daac05aedf56..3910dbb9b960 100644
> > --- a/include/linux/clocksource.h
> > +++ b/include/linux/clocksource.h
> > @@ -125,7 +125,6 @@ struct clocksource {
> >  	struct list_head	wd_list;
> >  	u64			cs_last;
> >  	u64			wd_last;
> > -	u64			wd_last_bogus;
> >  	int			wd_bogus_shift;
> >  	unsigned long		wd_bogus_count;
> >  	unsigned long		wd_bogus_count_last;
> > diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
> > index 6537ffa02e44..8e6d498b1492 100644
> > --- a/kernel/time/clocksource.c
> > +++ b/kernel/time/clocksource.c
> > @@ -442,28 +442,18 @@ static void clocksource_watchdog(struct timer_list *unused)
> >  
> >  		/* Check for bogus measurements. */
> >  		wdi = jiffies_to_nsecs(WATCHDOG_INTERVAL);
> > -		if (wd_nsec < (wdi >> 2)) {
> > +		if (cs_nsec < (wdi >> 2)) {
> >  			/* This usually indicates broken timer code or hardware. */
> > -			pr_warn("timekeeping watchdog on CPU%d: Watchdog clocksource '%s' advanced only %lld ns during %d-jiffy time interval, skipping watchdog check.\n", smp_processor_id(), watchdog->name, wd_nsec, WATCHDOG_INTERVAL);
> > +			pr_warn("timekeeping watchdog on CPU%d: clocksource '%s' advanced only %lld ns during %d-jiffy time interval, skipping watchdog check.\n", smp_processor_id(), cs->name, wd_nsec, WATCHDOG_INTERVAL);
> >  			continue;
> >  		}
> > -		if (wd_nsec > (wdi << 2)) {
> > -			bool needwarn = false;
> > -			u64 wd_lb;
> > -
> > +		if (cs_nsec > (wdi << 2)) {
> >  			cs->wd_bogus_count++;
> > -			if (!cs->wd_bogus_shift) {
> > -				needwarn = true;
> > -			} else {
> > -				delta = clocksource_delta(wdnow, cs->wd_last_bogus, watchdog->mask);
> > -				wd_lb = clocksource_cyc2ns(delta, watchdog->mult, watchdog->shift);
> > -				if ((1 << cs->wd_bogus_shift) * wdi <= wd_lb)
> > -					needwarn = true;
> > -			}
> > -			if (needwarn) {
> > +			if (!cs->wd_bogus_shift ||
> > +			    (1 << cs->wd_bogus_shift) * wdi <= cs_nsec) {
> >  				/* This can happen on busy systems, which can delay the watchdog. */
> > -				pr_warn("timekeeping watchdog on CPU%d: Watchdog clocksource '%s' advanced an excessive %lld ns during %d-jiffy time interval (%lu additional), probable CPU overutilization, skipping watchdog check.\n", smp_processor_id(), watchdog->name, wd_nsec, WATCHDOG_INTERVAL, cs->wd_bogus_count - cs->wd_bogus_count_last);
> > -				cs->wd_last_bogus = wdnow;
> > +				pr_warn("timekeeping watchdog on CPU%d: clocksource '%s' advanced an excessive %lld ns during %d-jiffy time interval (%lu additional), probable CPU overutilization, skipping watchdog check.\n", smp_processor_id(), cs->name, cs_nsec, WATCHDOG_INTERVAL, cs->wd_bogus_count - cs->wd_bogus_count_last);
> > +
> >  				if (cs->wd_bogus_shift < 10)
> >  					cs->wd_bogus_shift++;
> >  				cs->wd_bogus_count_last = cs->wd_bogus_count;
> > 
> > Thanks,
> > Feng
> > 
> > 
> > > +			}
> > > +			if (needwarn) {
> > > +				/* This can happen on busy systems, which can delay the watchdog. */
> > > +				pr_warn("timekeeping watchdog on CPU%d: Watchdog clocksource '%s' advanced an excessive %lld ns during %d-jiffy time interval (%lu additional), probable CPU overutilization, skipping watchdog check.\n", smp_processor_id(), watchdog->name, wd_nsec, WATCHDOG_INTERVAL, cs->wd_bogus_count - cs->wd_bogus_count_last);
> > > +				cs->wd_last_bogus = wdnow;
> > > +				if (cs->wd_bogus_shift < 10)
> > > +					cs->wd_bogus_shift++;
> > > +				cs->wd_bogus_count_last = cs->wd_bogus_count;
> > > +			}
> > >  			continue;
> > >  		}
> > >