Re: [RFC PATCH] clocksource: Suspend the watchdog temporarily when high read lantency detected

From: Waiman Long <longman@redhat.com>
To: Feng Tang <feng.tang@intel.com>,
	John Stultz <john.stultz@linaro.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	Stephen Boyd <sboyd@kernel.org>,
	x86@kernel.org, Peter Zijlstra <peterz@infradead.org>,
	"Paul E . McKenney" <paulmck@kernel.org>
Cc: linux-kernel@vger.kernel.org, Tim Chen <tim.c.chen@intel.com>
Subject: Re: [RFC PATCH] clocksource: Suspend the watchdog temporarily when high read lantency detected
Date: Tue, 20 Dec 2022 11:11:08 -0500	[thread overview]
Message-ID: <6fb04ee9-ce77-4835-2ad1-b7f8419cfb77@redhat.com> (raw)
In-Reply-To: <20221220082512.186283-1-feng.tang@intel.com>

On 12/20/22 03:25, Feng Tang wrote:
> There were bug reported on 8 sockets x86 machines that TSC was wrongly
> disabled when system is under heavy workload.
>
>   [ 818.380354] clocksource: timekeeping watchdog on CPU336: hpet wd-wd read-back delay of 1203520ns
>   [ 818.436160] clocksource: wd-tsc-wd read-back delay of 181880ns, clock-skew test skipped!
>   [ 819.402962] clocksource: timekeeping watchdog on CPU338: hpet wd-wd read-back delay of 324000ns
>   [ 819.448036] clocksource: wd-tsc-wd read-back delay of 337240ns, clock-skew test skipped!
>   [ 819.880863] clocksource: timekeeping watchdog on CPU339: hpet read-back delay of 150280ns, attempt 3, marking unstable
>   [ 819.936243] tsc: Marking TSC unstable due to clocksource watchdog
>   [ 820.068173] TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'.
>   [ 820.092382] sched_clock: Marking unstable (818769414384, 1195404998)
>   [ 820.643627] clocksource: Checking clocksource tsc synchronization from CPU 267 to CPUs 0,4,25,70,126,430,557,564.
>   [ 821.067990] clocksource: Switched to clocksource hpet
>
> This can be reproduced when system is running memory intensive 'stream'
> test, or some stress-ng subcases like 'ioport'.
>
> The reason is when system is under heavy load, the read latency of
> clocksource can be very high, it can be seen even with lightweight
> TSC read, and is much worse on MMIO or IO port read based external
> clocksource. Causing the watchdog check to be inaccurate.
>
> As the clocksource watchdog is a lifetime check with frequency of
> twice a second, there is no need to rush doing it when the system
> is under heavy load and the clocksource read latency is very high,
> suspend the watchdog timer for 5 minutes.
>
> Signed-off-by: Feng Tang <feng.tang@intel.com>
> ---
>   kernel/time/clocksource.c | 45 ++++++++++++++++++++++++++++-----------
>   1 file changed, 32 insertions(+), 13 deletions(-)
>
> diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
> index 9cf32ccda715..8cd74b89d577 100644
> --- a/kernel/time/clocksource.c
> +++ b/kernel/time/clocksource.c
> @@ -384,6 +384,15 @@ void clocksource_verify_percpu(struct clocksource *cs)
>   }
>   EXPORT_SYMBOL_GPL(clocksource_verify_percpu);
>   
> +static inline void clocksource_reset_watchdog(void)
> +{
> +	struct clocksource *cs;
> +
> +	list_for_each_entry(cs, &watchdog_list, wd_list)
> +		cs->flags &= ~CLOCK_SOURCE_WATCHDOG;
> +}
> +
> +
>   static void clocksource_watchdog(struct timer_list *unused)
>   {
>   	u64 csnow, wdnow, cslast, wdlast, delta;
> @@ -391,6 +400,7 @@ static void clocksource_watchdog(struct timer_list *unused)
>   	int64_t wd_nsec, cs_nsec;
>   	struct clocksource *cs;
>   	enum wd_read_status read_ret;
> +	unsigned long extra_wait = 0;
>   	u32 md;
>   
>   	spin_lock(&watchdog_lock);
> @@ -410,13 +420,30 @@ static void clocksource_watchdog(struct timer_list *unused)
>   
>   		read_ret = cs_watchdog_read(cs, &csnow, &wdnow);
>   
> -		if (read_ret != WD_READ_SUCCESS) {
> -			if (read_ret == WD_READ_UNSTABLE)
> -				/* Clock readout unreliable, so give it up. */
> -				__clocksource_unstable(cs);
> +		if (read_ret == WD_READ_UNSTABLE) {
> +			/* Clock readout unreliable, so give it up. */
> +			__clocksource_unstable(cs);
>   			continue;
>   		}
>   
> +		/*
> +		 * When WD_READ_SKIP is returned, it means the system is likely
> +		 * under very heavy load, where the latency of reading
> +		 * watchdog/clocksource is very big, and affect the accuracy of
> +		 * watchdog check. So give system some space and suspend the
> +		 * watchdog check for 5 minutes.
> +		 */
> +		if (read_ret == WD_READ_SKIP) {
> +			/*
> +			 * As the watchdog timer will be suspended, and
> +			 * cs->last could keep unchanged for 5 minutes, reset
> +			 * the counters.
> +			 */
> +			clocksource_reset_watchdog();
> +			extra_wait = HZ * 300;
> +			break;
> +		}
> +
>   		/* Clocksource initialized ? */
>   		if (!(cs->flags & CLOCK_SOURCE_WATCHDOG) ||
>   		    atomic_read(&watchdog_reset_pending)) {
> @@ -512,7 +539,7 @@ static void clocksource_watchdog(struct timer_list *unused)
>   	 * pair clocksource_stop_watchdog() clocksource_start_watchdog().
>   	 */
>   	if (!timer_pending(&watchdog_timer)) {
> -		watchdog_timer.expires += WATCHDOG_INTERVAL;
> +		watchdog_timer.expires += WATCHDOG_INTERVAL + extra_wait;
>   		add_timer_on(&watchdog_timer, next_cpu);
>   	}
>   out:
> @@ -537,14 +564,6 @@ static inline void clocksource_stop_watchdog(void)
>   	watchdog_running = 0;
>   }
>   
> -static inline void clocksource_reset_watchdog(void)
> -{
> -	struct clocksource *cs;
> -
> -	list_for_each_entry(cs, &watchdog_list, wd_list)
> -		cs->flags &= ~CLOCK_SOURCE_WATCHDOG;
> -}
> -
>   static void clocksource_resume_watchdog(void)
>   {
>   	atomic_inc(&watchdog_reset_pending);

It looks reasonable to me. Thanks for the patch.

Acked-by: Waiman Long <longman@redhat.com>