All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH] clocksource: skip check while watchdog hung up or unstable
@ 2021-08-11  9:55 brookxu
  2021-08-11 12:44 ` Thomas Gleixner
  2021-08-11 13:00 ` kernel test robot
  0 siblings, 2 replies; 8+ messages in thread
From: brookxu @ 2021-08-11  9:55 UTC (permalink / raw)
  To: john.stultz, tglx, sboyd; +Cc: linux-kernel

From: Chunguang Xu <brookxu@tencent.com>

After patch 1f45f1f3 (clocksource: Make clocksource validation work
for all clocksources), md_nsec may be 0 in some scenarios, such as
the watchdog is delayed for a long time or the watchdog has a
time-warp.

We found a problem when testing nvme disks with fio, when multiple
queue interrupts of a disk were mapped to a single CPU. IO interrupt
processing will cause the watchdog to be delayed for a long time
(155 seconds), the system reports TSC unstable and switches the clock
to hpet. It seems that this scenario cannot be handled by optimizing
softirq. Therefore, when md_nsec returns 0, the machine or watchdog
should be in unstable state,the verification result not unreliable.
Is it possible for us to skip the current check at this time?
1. If the watchdog is delayed because the system is busy, and the
   clocksource is switched to hpet due to a wrong judgment, the
   performance degradation may directly cause the machine to be
   unavailable and cause more problems.
2. If watchdog has time-warp, we should not rely on hpet to directly
   mark TSC as unstable.

Later we register watchdog to other CPU, if other CPU is not busy, we
can also check the stability of TSC.

Signed-off-by: Chunguang Xu <brookxu@tencent.com>
---
 kernel/time/clocksource.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index b89c76e1c02c..9b9014d67f1d 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -399,6 +399,13 @@ static void clocksource_watchdog(struct timer_list *unused)
 		cs->cs_last = csnow;
 		cs->wd_last = wdnow;
 
+		if (!wd_nsec) {
+			pr_warn("timekeeping watchdog on CPU%d seems hung up or unstable:");
+			pr_warn("'%s' wd_now: %llx wd_last: %llx mask: %llx\n",
+				watchdog->name, wdnow, wdlast, watchdog->mask);
+			continue;
+		}
+
 		if (atomic_read(&watchdog_reset_pending))
 			continue;
 
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2021-08-13  0:57 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-11  9:55 [RFC PATCH] clocksource: skip check while watchdog hung up or unstable brookxu
2021-08-11 12:44 ` Thomas Gleixner
2021-08-11 13:18   ` brookxu
2021-08-11 14:01     ` Thomas Gleixner
2021-08-11 15:26       ` brookxu
2021-08-12 10:53         ` Thomas Gleixner
2021-08-13  0:54           ` brookxu
2021-08-11 13:00 ` kernel test robot

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.