linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] clocksource: Avoid incorrect hpet fallback
@ 2021-11-10 22:17 Waiman Long
  2021-11-10 22:17 ` [PATCH 1/2] clocksource: Avoid accidental unstable marking of clocksources Waiman Long
                   ` (3 more replies)
  0 siblings, 4 replies; 24+ messages in thread
From: Waiman Long @ 2021-11-10 22:17 UTC (permalink / raw)
  To: John Stultz, Thomas Gleixner, Stephen Boyd
  Cc: linux-kernel, Peter Zijlstra, Cassio Neri, Linus Walleij,
	Colin Ian King, Frederic Weisbecker, Feng Tang, Paul E. McKenney,
	Waiman Long

It was found that when an x86 system was being stressed by running
various different benchmark suites, the clocksource watchdog might
occasionally mark TSC as unstable and fall back to hpet which will
have a signficant impact on system performance.

The current watchdog clocksource skew threshold of 50us is found to be
insufficient. So it is changed back to 100us before commit 2e27e793e280
("clocksource: Reduce clocksource-skew threshold") in patch 1. Patch 2
adds a Kconfig option to allow kernel builder to control the actual
threshold to be used.

Waiman Long (2):
  clocksource: Avoid accidental unstable marking of clocksources
  clocksource: Add a Kconfig option for WATCHDOG_MAX_SKEW

 kernel/time/Kconfig       |  9 ++++++
 kernel/time/clocksource.c | 58 ++++++++++++++++++++++++++++++---------
 2 files changed, 54 insertions(+), 13 deletions(-)

-- 
2.27.0


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 1/2] clocksource: Avoid accidental unstable marking of clocksources
  2021-11-10 22:17 [PATCH 0/2] clocksource: Avoid incorrect hpet fallback Waiman Long
@ 2021-11-10 22:17 ` Waiman Long
  2021-11-11  4:57   ` Feng Tang
  2021-11-10 22:17 ` [PATCH 2/2] clocksource: Add a Kconfig option for WATCHDOG_MAX_SKEW Waiman Long
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 24+ messages in thread
From: Waiman Long @ 2021-11-10 22:17 UTC (permalink / raw)
  To: John Stultz, Thomas Gleixner, Stephen Boyd
  Cc: linux-kernel, Peter Zijlstra, Cassio Neri, Linus Walleij,
	Colin Ian King, Frederic Weisbecker, Feng Tang, Paul E. McKenney,
	Waiman Long

Since commit db3a34e17433 ("clocksource: Retry clock read if long delays
detected") and commit 2e27e793e280 ("clocksource: Reduce clocksource-skew
threshold"), it is found that tsc clocksource fallback to hpet can
sometimes happen on both Intel and AMD systems especially when they are
running stressful benchmarking workloads. Of the 23 systems tested with
a v5.14 kernel, 10 of them have switched to hpet clock source during
the test run.

The result of falling back to hpet is a drastic reduction of performance
when running benchmarks. For example, the fio performance tests can
drop up to 70% whereas the iperf3 performance can drop up to 80%.

4 hpet fallbacks happened during bootup. They were:

  [    8.749399] clocksource: timekeeping watchdog on CPU13: hpet read-back delay of 263750ns, attempt 4, marking unstable
  [   12.044610] clocksource: timekeeping watchdog on CPU19: hpet read-back delay of 186166ns, attempt 4, marking unstable
  [   17.336941] clocksource: timekeeping watchdog on CPU28: hpet read-back delay of 182291ns, attempt 4, marking unstable
  [   17.518565] clocksource: timekeeping watchdog on CPU34: hpet read-back delay of 252196ns, attempt 4, marking unstable

Other fallbacks happen when the systems were running stressful
benchmarks. For example:

  [ 2685.867873] clocksource: timekeeping watchdog on CPU117: hpet read-back delay of 57269ns, attempt 4, marking unstable
  [46215.471228] clocksource: timekeeping watchdog on CPU8: hpet read-back delay of 61460ns, attempt 4, marking unstable

Commit 2e27e793e280 ("clocksource: Reduce clocksource-skew threshold"),
changed the skew margin from 100us to 50us. I think this is too small
and can easily be exceeded when running some stressful workloads on
a thermally stressed system.  So it is switched back to 100us. On
the other hand, it doesn't look like we need to increase the minimum
uncertainty margin. So it is kept the same at 100us too.

Even a maximum skew margin of 100us may be too small in for some systems
when booting up especially if those systems are under thermal stress. To
eliminate the case that the large skew is due to the system being too
busy slowing down the reading of both the watchdog and the clocksource,
a final check is done by reading watchdog time again and comparing the
consecutive watchdog timing read delay against WATCHDOG_MAX_SKEW/2. If
that delay exceeds the limit, we assume that the system is just too
busy. A warning will be printed to the console and the watchdog check
is then skipped for this round. For example:

  [    8.789316] clocksource: timekeeping watchdog on CPU13: hpet consecutive read-back delay of 174541ns, system too busy

Fixes: db3a34e17433 ("clocksource: Retry clock read if long delays detected")
Fixes: 2e27e793e280 ("clocksource: Reduce clocksource-skew threshold")
Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/time/clocksource.c | 53 +++++++++++++++++++++++++++++----------
 1 file changed, 40 insertions(+), 13 deletions(-)

diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index b8a14d2fb5ba..c503f5ab4101 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -107,7 +107,7 @@ static u64 suspend_start;
  * This delay could be due to SMIs, NMIs, or to VCPU preemptions.  Used as
  * a lower bound for cs->uncertainty_margin values when registering clocks.
  */
-#define WATCHDOG_MAX_SKEW (50 * NSEC_PER_USEC)
+#define WATCHDOG_MAX_SKEW (100 * NSEC_PER_USEC)
 
 #ifdef CONFIG_CLOCKSOURCE_WATCHDOG
 static void clocksource_watchdog_work(struct work_struct *work);
@@ -205,11 +205,17 @@ EXPORT_SYMBOL_GPL(max_cswd_read_retries);
 static int verify_n_cpus = 8;
 module_param(verify_n_cpus, int, 0644);
 
-static bool cs_watchdog_read(struct clocksource *cs, u64 *csnow, u64 *wdnow)
+enum wd_read_status {
+	WD_READ_SUCCESS,
+	WD_READ_UNSTABLE,
+	WD_READ_SKIP
+};
+
+static enum wd_read_status cs_watchdog_read(struct clocksource *cs, u64 *csnow, u64 *wdnow)
 {
 	unsigned int nretries;
-	u64 wd_end, wd_delta;
-	int64_t wd_delay;
+	u64 wd_end, wd_end2, wd_delta;
+	int64_t wd_delay, wd_delay2;
 
 	for (nretries = 0; nretries <= max_cswd_read_retries; nretries++) {
 		local_irq_disable();
@@ -226,13 +232,30 @@ static bool cs_watchdog_read(struct clocksource *cs, u64 *csnow, u64 *wdnow)
 				pr_warn("timekeeping watchdog on CPU%d: %s retried %d times before success\n",
 					smp_processor_id(), watchdog->name, nretries);
 			}
-			return true;
+			return WD_READ_SUCCESS;
 		}
 	}
 
+	/*
+	 * Now compute delay in consecutive watchdog read to see if there is
+	 * too much external interferences that cause significant delay in
+	 * reading both clocksource and watchdog.
+	 *
+	 * If the consecutive watchdog read-back delay > WATCHDOG_MAX_SKEW/2,
+	 * report system busy and skip the current watchdog test.
+	 */
+	wd_end2 = watchdog->read(watchdog);
+	wd_delta = clocksource_delta(wd_end2, wd_end, watchdog->mask);
+	wd_delay2 = clocksource_cyc2ns(wd_delta, watchdog->mult, watchdog->shift);
+	if (wd_delay2 > WATCHDOG_MAX_SKEW/2) {
+		pr_warn("timekeeping watchdog on CPU%d: %s consecutive read-back delay of %lldns, system too busy\n",
+			smp_processor_id(), watchdog->name, wd_delay2);
+		return WD_READ_SKIP;
+	}
+
 	pr_warn("timekeeping watchdog on CPU%d: %s read-back delay of %lldns, attempt %d, marking unstable\n",
 		smp_processor_id(), watchdog->name, wd_delay, nretries);
-	return false;
+	return WD_READ_UNSTABLE;
 }
 
 static u64 csnow_mid;
@@ -356,6 +379,7 @@ static void clocksource_watchdog(struct timer_list *unused)
 	int next_cpu, reset_pending;
 	int64_t wd_nsec, cs_nsec;
 	struct clocksource *cs;
+	enum wd_read_status read_ret;
 	u32 md;
 
 	spin_lock(&watchdog_lock);
@@ -373,9 +397,12 @@ static void clocksource_watchdog(struct timer_list *unused)
 			continue;
 		}
 
-		if (!cs_watchdog_read(cs, &csnow, &wdnow)) {
-			/* Clock readout unreliable, so give it up. */
-			__clocksource_unstable(cs);
+		read_ret = cs_watchdog_read(cs, &csnow, &wdnow);
+
+		if (read_ret != WD_READ_SUCCESS) {
+			if (read_ret == WD_READ_UNSTABLE)
+				/* Clock readout unreliable, so give it up. */
+				__clocksource_unstable(cs);
 			continue;
 		}
 
@@ -1061,7 +1088,7 @@ void __clocksource_update_freq_scale(struct clocksource *cs, u32 scale, u32 freq
 	/*
 	 * If the uncertainty margin is not specified, calculate it.
 	 * If both scale and freq are non-zero, calculate the clock
-	 * period, but bound below at 2*WATCHDOG_MAX_SKEW.  However,
+	 * period, but bound below at WATCHDOG_MAX_SKEW.  However,
 	 * if either of scale or freq is zero, be very conservative and
 	 * take the tens-of-milliseconds WATCHDOG_THRESHOLD value for the
 	 * uncertainty margin.  Allow stupidly small uncertainty margins
@@ -1070,12 +1097,12 @@ void __clocksource_update_freq_scale(struct clocksource *cs, u32 scale, u32 freq
 	 */
 	if (scale && freq && !cs->uncertainty_margin) {
 		cs->uncertainty_margin = NSEC_PER_SEC / (scale * freq);
-		if (cs->uncertainty_margin < 2 * WATCHDOG_MAX_SKEW)
-			cs->uncertainty_margin = 2 * WATCHDOG_MAX_SKEW;
+		if (cs->uncertainty_margin < WATCHDOG_MAX_SKEW)
+			cs->uncertainty_margin = WATCHDOG_MAX_SKEW;
 	} else if (!cs->uncertainty_margin) {
 		cs->uncertainty_margin = WATCHDOG_THRESHOLD;
 	}
-	WARN_ON_ONCE(cs->uncertainty_margin < 2 * WATCHDOG_MAX_SKEW);
+	WARN_ON_ONCE(cs->uncertainty_margin < WATCHDOG_MAX_SKEW);
 
 	/*
 	 * Ensure clocksources that have large 'mult' values don't overflow
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 2/2] clocksource: Add a Kconfig option for WATCHDOG_MAX_SKEW
  2021-11-10 22:17 [PATCH 0/2] clocksource: Avoid incorrect hpet fallback Waiman Long
  2021-11-10 22:17 ` [PATCH 1/2] clocksource: Avoid accidental unstable marking of clocksources Waiman Long
@ 2021-11-10 22:17 ` Waiman Long
  2021-11-10 22:32 ` [PATCH 0/2] clocksource: Avoid incorrect hpet fallback Paul E. McKenney
  2021-11-11  1:23 ` Feng Tang
  3 siblings, 0 replies; 24+ messages in thread
From: Waiman Long @ 2021-11-10 22:17 UTC (permalink / raw)
  To: John Stultz, Thomas Gleixner, Stephen Boyd
  Cc: linux-kernel, Peter Zijlstra, Cassio Neri, Linus Walleij,
	Colin Ian King, Frederic Weisbecker, Feng Tang, Paul E. McKenney,
	Waiman Long

A watchdog maximum skew of 100us may still be too small for
some systems or archs. It may also be too small when some kernel
debug config options are enabled.  So add a new Kconfig option
CLOCKSOURCE_WATCHDOG_MAX_SKEW_US to allow kernel builders to have more
control on the threshold for marking clocksource as unstable.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/time/Kconfig       | 9 +++++++++
 kernel/time/clocksource.c | 7 ++++++-
 2 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/kernel/time/Kconfig b/kernel/time/Kconfig
index 04bfd62f5e5c..27b7868b5c30 100644
--- a/kernel/time/Kconfig
+++ b/kernel/time/Kconfig
@@ -181,5 +181,14 @@ config HIGH_RES_TIMERS
 	  hardware is not capable then this option only increases
 	  the size of the kernel image.
 
+config CLOCKSOURCE_WATCHDOG_MAX_SKEW_US
+	int "Clocksource watchdog maximum allowable skew (in μs)"
+	depends on CLOCKSOURCE_WATCHDOG
+	range 50 1000
+	default 100
+	help
+	  Specify the maximum amount of allowable watchdog skew in
+	  microseconds before reporting the clocksource to be unstable.
+
 endmenu
 endif
diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index c503f5ab4101..2e605fc40e82 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -107,7 +107,12 @@ static u64 suspend_start;
  * This delay could be due to SMIs, NMIs, or to VCPU preemptions.  Used as
  * a lower bound for cs->uncertainty_margin values when registering clocks.
  */
-#define WATCHDOG_MAX_SKEW (100 * NSEC_PER_USEC)
+#ifdef CONFIG_CLOCKSOURCE_WATCHDOG_MAX_SKEW_US
+#define MAX_SKEW_USEC	CONFIG_CLOCKSOURCE_WATCHDOG_MAX_SKEW_US
+#else
+#define MAX_SKEW_USEC	100
+#endif
+#define WATCHDOG_MAX_SKEW (MAX_SKEW_USEC * NSEC_PER_USEC)
 
 #ifdef CONFIG_CLOCKSOURCE_WATCHDOG
 static void clocksource_watchdog_work(struct work_struct *work);
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/2] clocksource: Avoid incorrect hpet fallback
  2021-11-10 22:17 [PATCH 0/2] clocksource: Avoid incorrect hpet fallback Waiman Long
  2021-11-10 22:17 ` [PATCH 1/2] clocksource: Avoid accidental unstable marking of clocksources Waiman Long
  2021-11-10 22:17 ` [PATCH 2/2] clocksource: Add a Kconfig option for WATCHDOG_MAX_SKEW Waiman Long
@ 2021-11-10 22:32 ` Paul E. McKenney
  2021-11-10 23:25   ` Waiman Long
  2021-11-11  1:23 ` Feng Tang
  3 siblings, 1 reply; 24+ messages in thread
From: Paul E. McKenney @ 2021-11-10 22:32 UTC (permalink / raw)
  To: Waiman Long
  Cc: John Stultz, Thomas Gleixner, Stephen Boyd, linux-kernel,
	Peter Zijlstra, Cassio Neri, Linus Walleij, Colin Ian King,
	Frederic Weisbecker, Feng Tang

On Wed, Nov 10, 2021 at 05:17:30PM -0500, Waiman Long wrote:
> It was found that when an x86 system was being stressed by running
> various different benchmark suites, the clocksource watchdog might
> occasionally mark TSC as unstable and fall back to hpet which will
> have a signficant impact on system performance.
> 
> The current watchdog clocksource skew threshold of 50us is found to be
> insufficient. So it is changed back to 100us before commit 2e27e793e280
> ("clocksource: Reduce clocksource-skew threshold") in patch 1. Patch 2
> adds a Kconfig option to allow kernel builder to control the actual
> threshold to be used.
> 
> Waiman Long (2):
>   clocksource: Avoid accidental unstable marking of clocksources
>   clocksource: Add a Kconfig option for WATCHDOG_MAX_SKEW

The ability to control the fine-grained threshold seems useful, but is
the TSC still marked unstable when this commit from -rcu is applied?
It has passed significant testing on other workloads.

2a43fb0479aa ("clocksource: Forgive repeated long-latency watchdog clocksource reads")

If the patch below takes care of your situation, my thought is to
also take your second patch, which would allow people to set the
cutoff more loosely or more tightly, as their situation dictates.

Thoughts?

							Thanx, Paul

------------------------------------------------------------------------

commit 2a43fb0479aa53c6ff866aa26df2708de0c86b12
Author: Paul E. McKenney <paulmck@kernel.org>
Date:   Thu May 27 11:03:28 2021 -0700

    clocksource: Forgive repeated long-latency watchdog clocksource reads
    
    Currently, the clocksource watchdog reacts to repeated long-latency
    clocksource reads by marking that clocksource unstable on the theory that
    these long-latency reads are a sign of a serious problem.  And this theory
    does in fact have real-world support in the form of firmware issues [1].
    
    However, it is also possible to trigger this using stress-ng on what
    the stress-ng man page terms "poorly designed hardware" [2].  And it
    is not necessarily a bad thing for the kernel to diagnose cases where
    heavy memory-contention workloads are being run on hardware that is not
    designed for this sort of use.
    
    Nevertheless, it is quite possible that real-world use will result in
    some situation requiring that high-stress workloads run on hardware
    not designed to accommodate them, and also requiring that the kernel
    refrain from marking clocksources unstable.
    
    Therefore, react to persistent long-latency reads by leaving the
    clocksource alone, but using the old 62.5-millisecond skew-detection
    threshold.  In addition, the offending clocksource is marked for
    re-initialization, which both restarts that clocksource with a clean bill
    of health and avoids false-positive skew reports on later watchdog checks.
    Once marked for re-initialization, the clocksource is not subjected to
    further watchdog checking until a subsequent successful read from that
    clocksource that is free of excessive delays.
    
    However, if clocksource.max_cswd_coarse_reads consecutive clocksource read
    attempts result in long latencies, a warning (splat) will be emitted.
    This kernel boot parameter defaults to 1000, and this warning can be
    disabled by setting it to zero or to a negative value.
    
    [ paulmck: Apply feedback from Chao Gao and Oliver Sang. ]
    
    Link: https://lore.kernel.org/lkml/20210513155515.GB23902@xsang-OptiPlex-9020/ # [1]
    Link: https://lore.kernel.org/lkml/20210521083322.GG25531@xsang-OptiPlex-9020/ # [2]
    Link: https://lore.kernel.org/lkml/20210521084405.GH25531@xsang-OptiPlex-9020/
    Link: https://lore.kernel.org/lkml/20210511233403.GA2896757@paulmck-ThinkPad-P17-Gen-1/
    Tested-by: Chao Gao <chao.gao@intel.com>
    Tested-by: Jiri Wiesner <jwiesner@suse.de>
    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 316027c3aadce..61d2436ae9dfc 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -600,6 +600,14 @@
 			loops can be debugged more effectively on production
 			systems.
 
+	clocksource.max_cswd_coarse_reads= [KNL]
+			Number of consecutive clocksource_watchdog()
+			coarse reads (that is, clocksource reads that
+			were unduly delayed) that are permitted before
+			the kernel complains (gently).	Set to a value
+			less than or equal to zero to suppress these
+			complaints.
+
 	clocksource.max_cswd_read_retries= [KNL]
 			Number of clocksource_watchdog() retries due to
 			external delays before the clock will be marked
diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h
index 1d42d4b173271..3e925d9ffc31c 100644
--- a/include/linux/clocksource.h
+++ b/include/linux/clocksource.h
@@ -110,6 +110,7 @@ struct clocksource {
 	int			rating;
 	enum clocksource_ids	id;
 	enum vdso_clock_mode	vdso_clock_mode;
+	unsigned int		n_coarse_reads;
 	unsigned long		flags;
 
 	int			(*enable)(struct clocksource *cs);
@@ -291,6 +292,7 @@ static inline void timer_probe(void) {}
 #define TIMER_ACPI_DECLARE(name, table_id, fn)		\
 	ACPI_DECLARE_PROBE_ENTRY(timer, name, table_id, 0, NULL, 0, fn)
 
+extern int max_cswd_coarse_reads;
 extern ulong max_cswd_read_retries;
 void clocksource_verify_percpu(struct clocksource *cs);
 
diff --git a/kernel/time/clocksource-wdtest.c b/kernel/time/clocksource-wdtest.c
index df922f49d171b..7e82500c400b9 100644
--- a/kernel/time/clocksource-wdtest.c
+++ b/kernel/time/clocksource-wdtest.c
@@ -145,13 +145,12 @@ static int wdtest_func(void *arg)
 		else if (i <= max_cswd_read_retries)
 			s = ", expect message";
 		else
-			s = ", expect clock skew";
+			s = ", expect coarse-grained clock skew check and re-initialization";
 		pr_info("--- Watchdog with %dx error injection, %lu retries%s.\n", i, max_cswd_read_retries, s);
 		WRITE_ONCE(wdtest_ktime_read_ndelays, i);
 		schedule_timeout_uninterruptible(2 * HZ);
 		WARN_ON_ONCE(READ_ONCE(wdtest_ktime_read_ndelays));
-		WARN_ON_ONCE((i <= max_cswd_read_retries) !=
-			     !(clocksource_wdtest_ktime.flags & CLOCK_SOURCE_UNSTABLE));
+		WARN_ON_ONCE(clocksource_wdtest_ktime.flags & CLOCK_SOURCE_UNSTABLE);
 		wdtest_ktime_clocksource_reset();
 	}
 
diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index b8a14d2fb5ba6..553b479c4e73e 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -199,6 +199,9 @@ void clocksource_mark_unstable(struct clocksource *cs)
 	spin_unlock_irqrestore(&watchdog_lock, flags);
 }
 
+int max_cswd_coarse_reads = 1000;
+module_param(max_cswd_coarse_reads, int, 0644);
+EXPORT_SYMBOL_GPL(max_cswd_coarse_reads);
 ulong max_cswd_read_retries = 3;
 module_param(max_cswd_read_retries, ulong, 0644);
 EXPORT_SYMBOL_GPL(max_cswd_read_retries);
@@ -226,13 +229,22 @@ static bool cs_watchdog_read(struct clocksource *cs, u64 *csnow, u64 *wdnow)
 				pr_warn("timekeeping watchdog on CPU%d: %s retried %d times before success\n",
 					smp_processor_id(), watchdog->name, nretries);
 			}
-			return true;
+			cs->n_coarse_reads = 0;
+			return false;
 		}
+		WARN_ONCE(max_cswd_coarse_reads > 0 &&
+			  !(++cs->n_coarse_reads % max_cswd_coarse_reads),
+			  "timekeeping watchdog on CPU%d: %s %u consecutive coarse-grained reads\n", smp_processor_id(), watchdog->name, cs->n_coarse_reads);
 	}
 
-	pr_warn("timekeeping watchdog on CPU%d: %s read-back delay of %lldns, attempt %d, marking unstable\n",
-		smp_processor_id(), watchdog->name, wd_delay, nretries);
-	return false;
+	if ((cs->flags & CLOCK_SOURCE_WATCHDOG) && !atomic_read(&watchdog_reset_pending)) {
+		pr_warn("timekeeping watchdog on CPU%d: %s read-back delay of %lldns, attempt %d, coarse-grained skew check followed by re-initialization\n",
+			smp_processor_id(), watchdog->name, wd_delay, nretries);
+	} else {
+		pr_warn("timekeeping watchdog on CPU%d: %s read-back delay of %lldns, attempt %d, awaiting re-initialization\n",
+			smp_processor_id(), watchdog->name, wd_delay, nretries);
+	}
+	return true;
 }
 
 static u64 csnow_mid;
@@ -356,6 +368,7 @@ static void clocksource_watchdog(struct timer_list *unused)
 	int next_cpu, reset_pending;
 	int64_t wd_nsec, cs_nsec;
 	struct clocksource *cs;
+	bool coarse;
 	u32 md;
 
 	spin_lock(&watchdog_lock);
@@ -373,16 +386,13 @@ static void clocksource_watchdog(struct timer_list *unused)
 			continue;
 		}
 
-		if (!cs_watchdog_read(cs, &csnow, &wdnow)) {
-			/* Clock readout unreliable, so give it up. */
-			__clocksource_unstable(cs);
-			continue;
-		}
+		coarse = cs_watchdog_read(cs, &csnow, &wdnow);
 
 		/* Clocksource initialized ? */
 		if (!(cs->flags & CLOCK_SOURCE_WATCHDOG) ||
 		    atomic_read(&watchdog_reset_pending)) {
-			cs->flags |= CLOCK_SOURCE_WATCHDOG;
+			if (!coarse)
+				cs->flags |= CLOCK_SOURCE_WATCHDOG;
 			cs->wd_last = wdnow;
 			cs->cs_last = csnow;
 			continue;
@@ -403,7 +413,13 @@ static void clocksource_watchdog(struct timer_list *unused)
 			continue;
 
 		/* Check the deviation from the watchdog clocksource. */
-		md = cs->uncertainty_margin + watchdog->uncertainty_margin;
+		if (coarse) {
+			md = 62500 * NSEC_PER_USEC;
+			cs->flags &= ~CLOCK_SOURCE_WATCHDOG;
+			pr_warn("timekeeping watchdog on CPU%d: %s coarse-grained %lu.%03lu ms clock-skew check followed by re-initialization\n", smp_processor_id(), watchdog->name, md / NSEC_PER_MSEC, md % NSEC_PER_MSEC / NSEC_PER_USEC);
+		} else {
+			md = cs->uncertainty_margin + watchdog->uncertainty_margin;
+		}
 		if (abs(cs_nsec - wd_nsec) > md) {
 			pr_warn("timekeeping watchdog on CPU%d: Marking clocksource '%s' as unstable because the skew is too large:\n",
 				smp_processor_id(), cs->name);

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/2] clocksource: Avoid incorrect hpet fallback
  2021-11-10 22:32 ` [PATCH 0/2] clocksource: Avoid incorrect hpet fallback Paul E. McKenney
@ 2021-11-10 23:25   ` Waiman Long
  2021-11-11  0:04     ` Paul E. McKenney
  0 siblings, 1 reply; 24+ messages in thread
From: Waiman Long @ 2021-11-10 23:25 UTC (permalink / raw)
  To: paulmck
  Cc: John Stultz, Thomas Gleixner, Stephen Boyd, linux-kernel,
	Peter Zijlstra, Cassio Neri, Linus Walleij, Frederic Weisbecker,
	Feng Tang


On 11/10/21 17:32, Paul E. McKenney wrote:
> On Wed, Nov 10, 2021 at 05:17:30PM -0500, Waiman Long wrote:
>> It was found that when an x86 system was being stressed by running
>> various different benchmark suites, the clocksource watchdog might
>> occasionally mark TSC as unstable and fall back to hpet which will
>> have a signficant impact on system performance.
>>
>> The current watchdog clocksource skew threshold of 50us is found to be
>> insufficient. So it is changed back to 100us before commit 2e27e793e280
>> ("clocksource: Reduce clocksource-skew threshold") in patch 1. Patch 2
>> adds a Kconfig option to allow kernel builder to control the actual
>> threshold to be used.
>>
>> Waiman Long (2):
>>    clocksource: Avoid accidental unstable marking of clocksources
>>    clocksource: Add a Kconfig option for WATCHDOG_MAX_SKEW
> The ability to control the fine-grained threshold seems useful, but is
> the TSC still marked unstable when this commit from -rcu is applied?
> It has passed significant testing on other workloads.
>
> 2a43fb0479aa ("clocksource: Forgive repeated long-latency watchdog clocksource reads")
>
> If the patch below takes care of your situation, my thought is to
> also take your second patch, which would allow people to set the
> cutoff more loosely or more tightly, as their situation dictates.
>
> Thoughts?

That is commit 14dbb29eda51 ("clocksource: Forgive repeated long-latency 
watchdog clocksource reads") in your linux-rcu git tree. From reading 
the patch, I believe it should be able to address the hpet fallback 
problem that Red Hat had encountered. Your patch said it was an 
out-of-tree patch. Are you planning to mainline it?

Patch 1 of this series contains some testing data that caused hpet 
fallback in our testing runs. In summary, a clock skew of 100us is found 
to be enough to avoid the problem with benchmark runs. However, we have 
some cases where TSC was marked unstable at bootup time with a skew of 
200us or more which, I believe, was caused by the thermal stress that 
the system was experiencing after running stressful benchmarks for hours.

At the end, we have to revert your clocksource patches before shipping 
RHEL9 beta last week.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/2] clocksource: Avoid incorrect hpet fallback
  2021-11-10 23:25   ` Waiman Long
@ 2021-11-11  0:04     ` Paul E. McKenney
  2021-11-11  1:19       ` Waiman Long
  0 siblings, 1 reply; 24+ messages in thread
From: Paul E. McKenney @ 2021-11-11  0:04 UTC (permalink / raw)
  To: Waiman Long
  Cc: John Stultz, Thomas Gleixner, Stephen Boyd, linux-kernel,
	Peter Zijlstra, Cassio Neri, Linus Walleij, Frederic Weisbecker,
	Feng Tang

On Wed, Nov 10, 2021 at 06:25:14PM -0500, Waiman Long wrote:
> 
> On 11/10/21 17:32, Paul E. McKenney wrote:
> > On Wed, Nov 10, 2021 at 05:17:30PM -0500, Waiman Long wrote:
> > > It was found that when an x86 system was being stressed by running
> > > various different benchmark suites, the clocksource watchdog might
> > > occasionally mark TSC as unstable and fall back to hpet which will
> > > have a signficant impact on system performance.
> > > 
> > > The current watchdog clocksource skew threshold of 50us is found to be
> > > insufficient. So it is changed back to 100us before commit 2e27e793e280
> > > ("clocksource: Reduce clocksource-skew threshold") in patch 1. Patch 2
> > > adds a Kconfig option to allow kernel builder to control the actual
> > > threshold to be used.
> > > 
> > > Waiman Long (2):
> > >    clocksource: Avoid accidental unstable marking of clocksources
> > >    clocksource: Add a Kconfig option for WATCHDOG_MAX_SKEW
> > The ability to control the fine-grained threshold seems useful, but is
> > the TSC still marked unstable when this commit from -rcu is applied?
> > It has passed significant testing on other workloads.
> > 
> > 2a43fb0479aa ("clocksource: Forgive repeated long-latency watchdog clocksource reads")
> > 
> > If the patch below takes care of your situation, my thought is to
> > also take your second patch, which would allow people to set the
> > cutoff more loosely or more tightly, as their situation dictates.
> > 
> > Thoughts?
> 
> That is commit 14dbb29eda51 ("clocksource: Forgive repeated long-latency
> watchdog clocksource reads") in your linux-rcu git tree. From reading the
> patch, I believe it should be able to address the hpet fallback problem that
> Red Hat had encountered. Your patch said it was an out-of-tree patch. Are
> you planning to mainline it?

Yes, I expect to submit it into the next merge window (not the current
v5.16 merge window, but v5.17).  However, if your situation is urgent, and
if it works for you, I could submit it as a fix for an earlier regression.

> Patch 1 of this series contains some testing data that caused hpet fallback
> in our testing runs. In summary, a clock skew of 100us is found to be enough
> to avoid the problem with benchmark runs. However, we have some cases where
> TSC was marked unstable at bootup time with a skew of 200us or more which, I
> believe, was caused by the thermal stress that the system was experiencing
> after running stressful benchmarks for hours.

This sort of thing does show some value for allowing the threshold to
be adjusted.  I hope that it does not prove necessary to dynamically
adjust the threshold based on CPU clock frequency, but you never know.

> At the end, we have to revert your clocksource patches before shipping RHEL9
> beta last week.

Which has the disadvantage of leaving the initial clock-skew issues,
but I do understand that introducing one problem even while fixing
another one still counts as a regression.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/2] clocksource: Avoid incorrect hpet fallback
  2021-11-11  0:04     ` Paul E. McKenney
@ 2021-11-11  1:19       ` Waiman Long
  0 siblings, 0 replies; 24+ messages in thread
From: Waiman Long @ 2021-11-11  1:19 UTC (permalink / raw)
  To: paulmck
  Cc: John Stultz, Thomas Gleixner, Stephen Boyd, linux-kernel,
	Peter Zijlstra, Cassio Neri, Linus Walleij, Frederic Weisbecker,
	Feng Tang


On 11/10/21 19:04, Paul E. McKenney wrote:
> On Wed, Nov 10, 2021 at 06:25:14PM -0500, Waiman Long wrote:
>> On 11/10/21 17:32, Paul E. McKenney wrote:
>>> On Wed, Nov 10, 2021 at 05:17:30PM -0500, Waiman Long wrote:
>>>> It was found that when an x86 system was being stressed by running
>>>> various different benchmark suites, the clocksource watchdog might
>>>> occasionally mark TSC as unstable and fall back to hpet which will
>>>> have a signficant impact on system performance.
>>>>
>>>> The current watchdog clocksource skew threshold of 50us is found to be
>>>> insufficient. So it is changed back to 100us before commit 2e27e793e280
>>>> ("clocksource: Reduce clocksource-skew threshold") in patch 1. Patch 2
>>>> adds a Kconfig option to allow kernel builder to control the actual
>>>> threshold to be used.
>>>>
>>>> Waiman Long (2):
>>>>     clocksource: Avoid accidental unstable marking of clocksources
>>>>     clocksource: Add a Kconfig option for WATCHDOG_MAX_SKEW
>>> The ability to control the fine-grained threshold seems useful, but is
>>> the TSC still marked unstable when this commit from -rcu is applied?
>>> It has passed significant testing on other workloads.
>>>
>>> 2a43fb0479aa ("clocksource: Forgive repeated long-latency watchdog clocksource reads")
>>>
>>> If the patch below takes care of your situation, my thought is to
>>> also take your second patch, which would allow people to set the
>>> cutoff more loosely or more tightly, as their situation dictates.
>>>
>>> Thoughts?
>> That is commit 14dbb29eda51 ("clocksource: Forgive repeated long-latency
>> watchdog clocksource reads") in your linux-rcu git tree. From reading the
>> patch, I believe it should be able to address the hpet fallback problem that
>> Red Hat had encountered. Your patch said it was an out-of-tree patch. Are
>> you planning to mainline it?
> Yes, I expect to submit it into the next merge window (not the current
> v5.16 merge window, but v5.17).  However, if your situation is urgent, and
> if it works for you, I could submit it as a fix for an earlier regression.

I will build a test kernel based on your patch and ask our benchmarking 
group to run their test suites. It will take a day or two to get a 
definitive answer even though I believe it should fix the issue.

Cheers,
Longman



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/2] clocksource: Avoid incorrect hpet fallback
  2021-11-10 22:17 [PATCH 0/2] clocksource: Avoid incorrect hpet fallback Waiman Long
                   ` (2 preceding siblings ...)
  2021-11-10 22:32 ` [PATCH 0/2] clocksource: Avoid incorrect hpet fallback Paul E. McKenney
@ 2021-11-11  1:23 ` Feng Tang
  2021-11-11  1:30   ` Waiman Long
  3 siblings, 1 reply; 24+ messages in thread
From: Feng Tang @ 2021-11-11  1:23 UTC (permalink / raw)
  To: Waiman Long, Paul E. McKenney
  Cc: John Stultz, Thomas Gleixner, Stephen Boyd, linux-kernel,
	Peter Zijlstra, Cassio Neri, Linus Walleij, Colin Ian King,
	Frederic Weisbecker, Paul E. McKenney

Hi Waiman, Paul,

On Wed, Nov 10, 2021 at 05:17:30PM -0500, Waiman Long wrote:
> It was found that when an x86 system was being stressed by running
> various different benchmark suites, the clocksource watchdog might
> occasionally mark TSC as unstable and fall back to hpet which will
> have a signficant impact on system performance.
 
We've seen similar cases while running 'netperf' and 'lockbus/ioport'
cases of 'stress-ng' tool.

In those scenarios, the clocksource used by kernel is tsc, while
hpet is used as watchdog. And when the "screwing" happens, we found
mostly it's the hpet's 'fault', that when system is under extreme
pressure, the read of hpet could take a long time, and even 2
consecutive read of hpet will have a big gap (up to 1ms+) in between.
So the screw we saw is actually caused by hpet instead of tsc, as
tsc read is a lightweight cpu operation

I tried the following patch to detect the screw of watchdog itself,
and avoid wrongly judging the tsc to be unstable. It does help in
our tests, please help to review.

And one futher idea is to also adding 2 consecutive read of current
clocksource, and compare its gap with watchdog's, and skip the check
if the watchdog's is bigger.


From f68c668f19abc77c409795ccc1800de3f4fcfc5a Mon Sep 17 00:00:00 2001
From: Feng Tang <feng.tang@intel.com>
Date: Wed, 10 Nov 2021 21:12:10 +0800
Subject: [PATCH RFC] clocksource: add detection for screw of watchdog

Signed-off-by: Feng Tang <feng.tang@intel.com>
---
 kernel/time/clocksource.c | 39 +++++++++++++++++++++++++++++++++------
 1 file changed, 33 insertions(+), 6 deletions(-)

diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index b8a14d2fb5ba6..6fbef74b0ed74 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -205,34 +205,54 @@ EXPORT_SYMBOL_GPL(max_cswd_read_retries);
 static int verify_n_cpus = 8;
 module_param(verify_n_cpus, int, 0644);
 
-static bool cs_watchdog_read(struct clocksource *cs, u64 *csnow, u64 *wdnow)
+enum {
+	WATCHDOG_READ_SCREWED,
+	CLOCKSROUCE_READ_SCREWED,
+	BOTH_READ_FINE,
+};
+
+static int cs_watchdog_read(struct clocksource *cs, u64 *csnow, u64 *wdnow)
 {
 	unsigned int nretries;
-	u64 wd_end, wd_delta;
-	int64_t wd_delay;
+	u64 wd_end, wd_delta, wd_third;
+	int64_t wd_delay, wd_seq_delay;
+
 
 	for (nretries = 0; nretries <= max_cswd_read_retries; nretries++) {
 		local_irq_disable();
 		*wdnow = watchdog->read(watchdog);
 		*csnow = cs->read(cs);
 		wd_end = watchdog->read(watchdog);
+		wd_third = watchdog->read(watchdog);
 		local_irq_enable();
 
 		wd_delta = clocksource_delta(wd_end, *wdnow, watchdog->mask);
 		wd_delay = clocksource_cyc2ns(wd_delta, watchdog->mult,
 					      watchdog->shift);
+		/*
+		 * If a big delay is seen between 2 consecutive read of watchdog,
+		 * then there must be something wrong with watchdog itself, or
+		 * the system is under some extreme pressures, no need to judge
+		 * clocksource for this run.
+		 */
+		wd_delta = clocksource_delta(wd_third, wd_end, watchdog->mask);
+		wd_seq_delay = clocksource_cyc2ns(wd_delta, watchdog->mult,
+						watchdog->shift);
+		if (wd_seq_delay > WATCHDOG_MAX_SKEW / 2)
+			return WATCHDOG_READ_SCREWED;
+
 		if (wd_delay <= WATCHDOG_MAX_SKEW) {
 			if (nretries > 1 || nretries >= max_cswd_read_retries) {
 				pr_warn("timekeeping watchdog on CPU%d: %s retried %d times before success\n",
 					smp_processor_id(), watchdog->name, nretries);
 			}
-			return true;
+			return BOTH_READ_FINE;
 		}
 	}
 
 	pr_warn("timekeeping watchdog on CPU%d: %s read-back delay of %lldns, attempt %d, marking unstable\n",
 		smp_processor_id(), watchdog->name, wd_delay, nretries);
-	return false;
+	return CLOCKSROUCE_READ_SCREWED;
 }
 
 static u64 csnow_mid;
@@ -373,10 +393,17 @@ static void clocksource_watchdog(struct timer_list *unused)
 			continue;
 		}
 
-		if (!cs_watchdog_read(cs, &csnow, &wdnow)) {
+		switch (cs_watchdog_read(cs, &csnow, &wdnow)) {
+		case WATCHDOG_READ_SCREWED:
+			/* skip checking this 'cs' */
+			continue;
+		case CLOCKSROUCE_READ_SCREWED:
 			/* Clock readout unreliable, so give it up. */
 			__clocksource_unstable(cs);
 			continue;
+		case BOTH_READ_FINE:
+		default:
+			;
 		}
 
 		/* Clocksource initialized ? */
-- 
2.27.0

Thanks,
Feng

> The current watchdog clocksource skew threshold of 50us is found to be
> insufficient. So it is changed back to 100us before commit 2e27e793e280
> ("clocksource: Reduce clocksource-skew threshold") in patch 1. Patch 2
> adds a Kconfig option to allow kernel builder to control the actual
> threshold to be used.
> 
> Waiman Long (2):
>   clocksource: Avoid accidental unstable marking of clocksources
>   clocksource: Add a Kconfig option for WATCHDOG_MAX_SKEW
> 
>  kernel/time/Kconfig       |  9 ++++++
>  kernel/time/clocksource.c | 58 ++++++++++++++++++++++++++++++---------
>  2 files changed, 54 insertions(+), 13 deletions(-)
> 
> -- 
> 2.27.0

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/2] clocksource: Avoid incorrect hpet fallback
  2021-11-11  1:23 ` Feng Tang
@ 2021-11-11  1:30   ` Waiman Long
  2021-11-11  1:53     ` Feng Tang
  0 siblings, 1 reply; 24+ messages in thread
From: Waiman Long @ 2021-11-11  1:30 UTC (permalink / raw)
  To: Feng Tang, Paul E. McKenney
  Cc: John Stultz, Thomas Gleixner, Stephen Boyd, linux-kernel,
	Peter Zijlstra, Cassio Neri, Linus Walleij, Frederic Weisbecker


On 11/10/21 20:23, Feng Tang wrote:
> Hi Waiman, Paul,
>
> On Wed, Nov 10, 2021 at 05:17:30PM -0500, Waiman Long wrote:
>> It was found that when an x86 system was being stressed by running
>> various different benchmark suites, the clocksource watchdog might
>> occasionally mark TSC as unstable and fall back to hpet which will
>> have a signficant impact on system performance.
>   
> We've seen similar cases while running 'netperf' and 'lockbus/ioport'
> cases of 'stress-ng' tool.
>
> In those scenarios, the clocksource used by kernel is tsc, while
> hpet is used as watchdog. And when the "screwing" happens, we found
> mostly it's the hpet's 'fault', that when system is under extreme
> pressure, the read of hpet could take a long time, and even 2
> consecutive read of hpet will have a big gap (up to 1ms+) in between.
> So the screw we saw is actually caused by hpet instead of tsc, as
> tsc read is a lightweight cpu operation
>
> I tried the following patch to detect the screw of watchdog itself,
> and avoid wrongly judging the tsc to be unstable. It does help in
> our tests, please help to review.
>
> And one futher idea is to also adding 2 consecutive read of current
> clocksource, and compare its gap with watchdog's, and skip the check
> if the watchdog's is bigger.

That is what I found too. And I also did a 2nd watchdog read to compare 
the consecutive delay versus half the threshold and skip the test if it 
exceeds it. My patch is actually similar in concept to what your patch does.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/2] clocksource: Avoid incorrect hpet fallback
  2021-11-11  1:30   ` Waiman Long
@ 2021-11-11  1:53     ` Feng Tang
  2021-11-11  3:07       ` Paul E. McKenney
  0 siblings, 1 reply; 24+ messages in thread
From: Feng Tang @ 2021-11-11  1:53 UTC (permalink / raw)
  To: Waiman Long
  Cc: Paul E. McKenney, John Stultz, Thomas Gleixner, Stephen Boyd,
	linux-kernel, Peter Zijlstra, Cassio Neri, Linus Walleij,
	Frederic Weisbecker

On Wed, Nov 10, 2021 at 08:30:10PM -0500, Waiman Long wrote:
> 
> On 11/10/21 20:23, Feng Tang wrote:
> > Hi Waiman, Paul,
> > 
> > On Wed, Nov 10, 2021 at 05:17:30PM -0500, Waiman Long wrote:
> > > It was found that when an x86 system was being stressed by running
> > > various different benchmark suites, the clocksource watchdog might
> > > occasionally mark TSC as unstable and fall back to hpet which will
> > > have a signficant impact on system performance.
> > We've seen similar cases while running 'netperf' and 'lockbus/ioport'
> > cases of 'stress-ng' tool.
> > 
> > In those scenarios, the clocksource used by kernel is tsc, while
> > hpet is used as watchdog. And when the "screwing" happens, we found
> > mostly it's the hpet's 'fault', that when system is under extreme
> > pressure, the read of hpet could take a long time, and even 2
> > consecutive read of hpet will have a big gap (up to 1ms+) in between.
> > So the screw we saw is actually caused by hpet instead of tsc, as
> > tsc read is a lightweight cpu operation
> > 
> > I tried the following patch to detect the screw of watchdog itself,
> > and avoid wrongly judging the tsc to be unstable. It does help in
> > our tests, please help to review.
> > 
> > And one futher idea is to also adding 2 consecutive read of current
> > clocksource, and compare its gap with watchdog's, and skip the check
> > if the watchdog's is bigger.
> 
> That is what I found too. And I also did a 2nd watchdog read to compare the
> consecutive delay versus half the threshold and skip the test if it exceeds
> it. My patch is actually similar in concept to what your patch does.
 
Aha, yes, I missed that.

I just got to office, and saw the disucssion around 0/2 patch and replied,
without going through the patches, sorry about that.

0day reported some cases about stress-ng testing, and we are still testing
differenct cases we've seen.

Thanks,
Feng

> Cheers,
> Longman

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/2] clocksource: Avoid incorrect hpet fallback
  2021-11-11  1:53     ` Feng Tang
@ 2021-11-11  3:07       ` Paul E. McKenney
  0 siblings, 0 replies; 24+ messages in thread
From: Paul E. McKenney @ 2021-11-11  3:07 UTC (permalink / raw)
  To: Feng Tang
  Cc: Waiman Long, John Stultz, Thomas Gleixner, Stephen Boyd,
	linux-kernel, Peter Zijlstra, Cassio Neri, Linus Walleij,
	Frederic Weisbecker

On Thu, Nov 11, 2021 at 09:53:31AM +0800, Feng Tang wrote:
> On Wed, Nov 10, 2021 at 08:30:10PM -0500, Waiman Long wrote:
> > 
> > On 11/10/21 20:23, Feng Tang wrote:
> > > Hi Waiman, Paul,
> > > 
> > > On Wed, Nov 10, 2021 at 05:17:30PM -0500, Waiman Long wrote:
> > > > It was found that when an x86 system was being stressed by running
> > > > various different benchmark suites, the clocksource watchdog might
> > > > occasionally mark TSC as unstable and fall back to hpet which will
> > > > have a signficant impact on system performance.
> > > We've seen similar cases while running 'netperf' and 'lockbus/ioport'
> > > cases of 'stress-ng' tool.
> > > 
> > > In those scenarios, the clocksource used by kernel is tsc, while
> > > hpet is used as watchdog. And when the "screwing" happens, we found
> > > mostly it's the hpet's 'fault', that when system is under extreme
> > > pressure, the read of hpet could take a long time, and even 2
> > > consecutive read of hpet will have a big gap (up to 1ms+) in between.
> > > So the screw we saw is actually caused by hpet instead of tsc, as
> > > tsc read is a lightweight cpu operation
> > > 
> > > I tried the following patch to detect the screw of watchdog itself,
> > > and avoid wrongly judging the tsc to be unstable. It does help in
> > > our tests, please help to review.
> > > 
> > > And one futher idea is to also adding 2 consecutive read of current
> > > clocksource, and compare its gap with watchdog's, and skip the check
> > > if the watchdog's is bigger.
> > 
> > That is what I found too. And I also did a 2nd watchdog read to compare the
> > consecutive delay versus half the threshold and skip the test if it exceeds
> > it. My patch is actually similar in concept to what your patch does.
>  
> Aha, yes, I missed that.
> 
> I just got to office, and saw the disucssion around 0/2 patch and replied,
> without going through the patches, sorry about that.
> 
> 0day reported some cases about stress-ng testing, and we are still testing
> differenct cases we've seen.

Whichever way it happens, I agree that an extra read to correctly
attribute the latency could be very effective!  Good on you both!

							Thanx, Paul

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] clocksource: Avoid accidental unstable marking of clocksources
  2021-11-10 22:17 ` [PATCH 1/2] clocksource: Avoid accidental unstable marking of clocksources Waiman Long
@ 2021-11-11  4:57   ` Feng Tang
  2021-11-11 14:43     ` Paul E. McKenney
  0 siblings, 1 reply; 24+ messages in thread
From: Feng Tang @ 2021-11-11  4:57 UTC (permalink / raw)
  To: Waiman Long
  Cc: John Stultz, Thomas Gleixner, Stephen Boyd, linux-kernel,
	Peter Zijlstra, Cassio Neri, Linus Walleij, Colin Ian King,
	Frederic Weisbecker, Paul E. McKenney

On Wed, Nov 10, 2021 at 05:17:31PM -0500, Waiman Long wrote:
> Since commit db3a34e17433 ("clocksource: Retry clock read if long delays
> detected") and commit 2e27e793e280 ("clocksource: Reduce clocksource-skew
> threshold"), it is found that tsc clocksource fallback to hpet can
> sometimes happen on both Intel and AMD systems especially when they are
> running stressful benchmarking workloads. Of the 23 systems tested with
> a v5.14 kernel, 10 of them have switched to hpet clock source during
> the test run.
> 
> The result of falling back to hpet is a drastic reduction of performance
> when running benchmarks. For example, the fio performance tests can
> drop up to 70% whereas the iperf3 performance can drop up to 80%.
> 
> 4 hpet fallbacks happened during bootup. They were:
> 
>   [    8.749399] clocksource: timekeeping watchdog on CPU13: hpet read-back delay of 263750ns, attempt 4, marking unstable
>   [   12.044610] clocksource: timekeeping watchdog on CPU19: hpet read-back delay of 186166ns, attempt 4, marking unstable
>   [   17.336941] clocksource: timekeeping watchdog on CPU28: hpet read-back delay of 182291ns, attempt 4, marking unstable
>   [   17.518565] clocksource: timekeeping watchdog on CPU34: hpet read-back delay of 252196ns, attempt 4, marking unstable
> 
> Other fallbacks happen when the systems were running stressful
> benchmarks. For example:
> 
>   [ 2685.867873] clocksource: timekeeping watchdog on CPU117: hpet read-back delay of 57269ns, attempt 4, marking unstable
>   [46215.471228] clocksource: timekeeping watchdog on CPU8: hpet read-back delay of 61460ns, attempt 4, marking unstable
> 
> Commit 2e27e793e280 ("clocksource: Reduce clocksource-skew threshold"),
> changed the skew margin from 100us to 50us. I think this is too small
> and can easily be exceeded when running some stressful workloads on
> a thermally stressed system.  So it is switched back to 100us. On
> the other hand, it doesn't look like we need to increase the minimum
> uncertainty margin. So it is kept the same at 100us too.
> 
> Even a maximum skew margin of 100us may be too small in for some systems
> when booting up especially if those systems are under thermal stress. To
> eliminate the case that the large skew is due to the system being too
> busy slowing down the reading of both the watchdog and the clocksource,
> a final check is done by reading watchdog time again and comparing the
> consecutive watchdog timing read delay against WATCHDOG_MAX_SKEW/2. If
> that delay exceeds the limit, we assume that the system is just too
> busy. A warning will be printed to the console and the watchdog check
> is then skipped for this round. For example:
> 
>   [    8.789316] clocksource: timekeeping watchdog on CPU13: hpet consecutive read-back delay of 174541ns, system too busy
 

I think it may be better to add more details about the root cause, other
than that it looks good to me, as we tested similar patch on our test
platforms.

Reviewed-by: Feng Tang <feng.tang@intel.com>

Thanks,
Feng

> Fixes: db3a34e17433 ("clocksource: Retry clock read if long delays detected")
> Fixes: 2e27e793e280 ("clocksource: Reduce clocksource-skew threshold")
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
>  kernel/time/clocksource.c | 53 +++++++++++++++++++++++++++++----------
>  1 file changed, 40 insertions(+), 13 deletions(-)
> 
> diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
> index b8a14d2fb5ba..c503f5ab4101 100644
> --- a/kernel/time/clocksource.c
> +++ b/kernel/time/clocksource.c
> @@ -107,7 +107,7 @@ static u64 suspend_start;
>   * This delay could be due to SMIs, NMIs, or to VCPU preemptions.  Used as
>   * a lower bound for cs->uncertainty_margin values when registering clocks.
>   */
> -#define WATCHDOG_MAX_SKEW (50 * NSEC_PER_USEC)
> +#define WATCHDOG_MAX_SKEW (100 * NSEC_PER_USEC)
>  
>  #ifdef CONFIG_CLOCKSOURCE_WATCHDOG
>  static void clocksource_watchdog_work(struct work_struct *work);
> @@ -205,11 +205,17 @@ EXPORT_SYMBOL_GPL(max_cswd_read_retries);
>  static int verify_n_cpus = 8;
>  module_param(verify_n_cpus, int, 0644);
>  
> -static bool cs_watchdog_read(struct clocksource *cs, u64 *csnow, u64 *wdnow)
> +enum wd_read_status {
> +	WD_READ_SUCCESS,
> +	WD_READ_UNSTABLE,
> +	WD_READ_SKIP
> +};
> +
> +static enum wd_read_status cs_watchdog_read(struct clocksource *cs, u64 *csnow, u64 *wdnow)
>  {
>  	unsigned int nretries;
> -	u64 wd_end, wd_delta;
> -	int64_t wd_delay;
> +	u64 wd_end, wd_end2, wd_delta;
> +	int64_t wd_delay, wd_delay2;
>  
>  	for (nretries = 0; nretries <= max_cswd_read_retries; nretries++) {
>  		local_irq_disable();
> @@ -226,13 +232,30 @@ static bool cs_watchdog_read(struct clocksource *cs, u64 *csnow, u64 *wdnow)
>  				pr_warn("timekeeping watchdog on CPU%d: %s retried %d times before success\n",
>  					smp_processor_id(), watchdog->name, nretries);
>  			}
> -			return true;
> +			return WD_READ_SUCCESS;
>  		}
>  	}
>  
> +	/*
> +	 * Now compute delay in consecutive watchdog read to see if there is
> +	 * too much external interferences that cause significant delay in
> +	 * reading both clocksource and watchdog.
> +	 *
> +	 * If the consecutive watchdog read-back delay > WATCHDOG_MAX_SKEW/2,
> +	 * report system busy and skip the current watchdog test.
> +	 */
> +	wd_end2 = watchdog->read(watchdog);
> +	wd_delta = clocksource_delta(wd_end2, wd_end, watchdog->mask);
> +	wd_delay2 = clocksource_cyc2ns(wd_delta, watchdog->mult, watchdog->shift);
> +	if (wd_delay2 > WATCHDOG_MAX_SKEW/2) {
> +		pr_warn("timekeeping watchdog on CPU%d: %s consecutive read-back delay of %lldns, system too busy\n",
> +			smp_processor_id(), watchdog->name, wd_delay2);
> +		return WD_READ_SKIP;
> +	}
> +
>  	pr_warn("timekeeping watchdog on CPU%d: %s read-back delay of %lldns, attempt %d, marking unstable\n",
>  		smp_processor_id(), watchdog->name, wd_delay, nretries);
> -	return false;
> +	return WD_READ_UNSTABLE;
>  }
>  
>  static u64 csnow_mid;
> @@ -356,6 +379,7 @@ static void clocksource_watchdog(struct timer_list *unused)
>  	int next_cpu, reset_pending;
>  	int64_t wd_nsec, cs_nsec;
>  	struct clocksource *cs;
> +	enum wd_read_status read_ret;
>  	u32 md;
>  
>  	spin_lock(&watchdog_lock);
> @@ -373,9 +397,12 @@ static void clocksource_watchdog(struct timer_list *unused)
>  			continue;
>  		}
>  
> -		if (!cs_watchdog_read(cs, &csnow, &wdnow)) {
> -			/* Clock readout unreliable, so give it up. */
> -			__clocksource_unstable(cs);
> +		read_ret = cs_watchdog_read(cs, &csnow, &wdnow);
> +
> +		if (read_ret != WD_READ_SUCCESS) {
> +			if (read_ret == WD_READ_UNSTABLE)
> +				/* Clock readout unreliable, so give it up. */
> +				__clocksource_unstable(cs);
>  			continue;
>  		}
>  
> @@ -1061,7 +1088,7 @@ void __clocksource_update_freq_scale(struct clocksource *cs, u32 scale, u32 freq
>  	/*
>  	 * If the uncertainty margin is not specified, calculate it.
>  	 * If both scale and freq are non-zero, calculate the clock
> -	 * period, but bound below at 2*WATCHDOG_MAX_SKEW.  However,
> +	 * period, but bound below at WATCHDOG_MAX_SKEW.  However,
>  	 * if either of scale or freq is zero, be very conservative and
>  	 * take the tens-of-milliseconds WATCHDOG_THRESHOLD value for the
>  	 * uncertainty margin.  Allow stupidly small uncertainty margins
> @@ -1070,12 +1097,12 @@ void __clocksource_update_freq_scale(struct clocksource *cs, u32 scale, u32 freq
>  	 */
>  	if (scale && freq && !cs->uncertainty_margin) {
>  		cs->uncertainty_margin = NSEC_PER_SEC / (scale * freq);
> -		if (cs->uncertainty_margin < 2 * WATCHDOG_MAX_SKEW)
> -			cs->uncertainty_margin = 2 * WATCHDOG_MAX_SKEW;
> +		if (cs->uncertainty_margin < WATCHDOG_MAX_SKEW)
> +			cs->uncertainty_margin = WATCHDOG_MAX_SKEW;
>  	} else if (!cs->uncertainty_margin) {
>  		cs->uncertainty_margin = WATCHDOG_THRESHOLD;
>  	}
> -	WARN_ON_ONCE(cs->uncertainty_margin < 2 * WATCHDOG_MAX_SKEW);
> +	WARN_ON_ONCE(cs->uncertainty_margin < WATCHDOG_MAX_SKEW);
>  
>  	/*
>  	 * Ensure clocksources that have large 'mult' values don't overflow
> -- 
> 2.27.0

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] clocksource: Avoid accidental unstable marking of clocksources
  2021-11-11  4:57   ` Feng Tang
@ 2021-11-11 14:43     ` Paul E. McKenney
  2021-11-12  5:44       ` Feng Tang
  0 siblings, 1 reply; 24+ messages in thread
From: Paul E. McKenney @ 2021-11-11 14:43 UTC (permalink / raw)
  To: Feng Tang
  Cc: Waiman Long, John Stultz, Thomas Gleixner, Stephen Boyd,
	linux-kernel, Peter Zijlstra, Cassio Neri, Linus Walleij,
	Colin Ian King, Frederic Weisbecker

On Thu, Nov 11, 2021 at 12:57:03PM +0800, Feng Tang wrote:
> On Wed, Nov 10, 2021 at 05:17:31PM -0500, Waiman Long wrote:
> > Since commit db3a34e17433 ("clocksource: Retry clock read if long delays
> > detected") and commit 2e27e793e280 ("clocksource: Reduce clocksource-skew
> > threshold"), it is found that tsc clocksource fallback to hpet can
> > sometimes happen on both Intel and AMD systems especially when they are
> > running stressful benchmarking workloads. Of the 23 systems tested with
> > a v5.14 kernel, 10 of them have switched to hpet clock source during
> > the test run.
> > 
> > The result of falling back to hpet is a drastic reduction of performance
> > when running benchmarks. For example, the fio performance tests can
> > drop up to 70% whereas the iperf3 performance can drop up to 80%.
> > 
> > 4 hpet fallbacks happened during bootup. They were:
> > 
> >   [    8.749399] clocksource: timekeeping watchdog on CPU13: hpet read-back delay of 263750ns, attempt 4, marking unstable
> >   [   12.044610] clocksource: timekeeping watchdog on CPU19: hpet read-back delay of 186166ns, attempt 4, marking unstable
> >   [   17.336941] clocksource: timekeeping watchdog on CPU28: hpet read-back delay of 182291ns, attempt 4, marking unstable
> >   [   17.518565] clocksource: timekeeping watchdog on CPU34: hpet read-back delay of 252196ns, attempt 4, marking unstable
> > 
> > Other fallbacks happen when the systems were running stressful
> > benchmarks. For example:
> > 
> >   [ 2685.867873] clocksource: timekeeping watchdog on CPU117: hpet read-back delay of 57269ns, attempt 4, marking unstable
> >   [46215.471228] clocksource: timekeeping watchdog on CPU8: hpet read-back delay of 61460ns, attempt 4, marking unstable
> > 
> > Commit 2e27e793e280 ("clocksource: Reduce clocksource-skew threshold"),
> > changed the skew margin from 100us to 50us. I think this is too small
> > and can easily be exceeded when running some stressful workloads on
> > a thermally stressed system.  So it is switched back to 100us. On
> > the other hand, it doesn't look like we need to increase the minimum
> > uncertainty margin. So it is kept the same at 100us too.
> > 
> > Even a maximum skew margin of 100us may be too small in for some systems
> > when booting up especially if those systems are under thermal stress. To
> > eliminate the case that the large skew is due to the system being too
> > busy slowing down the reading of both the watchdog and the clocksource,
> > a final check is done by reading watchdog time again and comparing the
> > consecutive watchdog timing read delay against WATCHDOG_MAX_SKEW/2. If
> > that delay exceeds the limit, we assume that the system is just too
> > busy. A warning will be printed to the console and the watchdog check
> > is then skipped for this round. For example:
> > 
> >   [    8.789316] clocksource: timekeeping watchdog on CPU13: hpet consecutive read-back delay of 174541ns, system too busy
>  
> 
> I think it may be better to add more details about the root cause, other
> than that it looks good to me, as we tested similar patch on our test
> platforms.
> 
> Reviewed-by: Feng Tang <feng.tang@intel.com>

Thank you both!

I agree on the bit about root cause.  Would it make sense to compare the
difference between HPET reads 1 and 2 (containing the read of the TSC)
and the difference between HPET reads 2 and 3?  If the 2-1 difference was
no more than (say) 8/7ths of the 3-2 difference, or the 2-1 difference
was no more than (say) 20 microseconds more than the 3-2 difference,
this could be considered a good-as-it-gets read, ending the retry loop.
Then if the 3-1 difference was greater than the default (100 microseconds
in current -rcu), that difference could be substituted for that particular
clocksource watchdog check.  With a console message noting the unusually
high overhead (but not a splat).

So if it took 75 microseconds for each HPET read and 1 microsecond for
the TSC read, then 226 microseconds would be substituted for the default
of 100 microseconds for that cycle's skew cutoff.  Unless the previous
skew cutoff was larger, in which case the previous cutoff should be
used instead.  Either way, the current cutoff is recorded for comparison
for the next clocksource watchdog check.

If the 3-1 difference was greater than 62.5 milliseconds, a warning should
probably be emitted anyway.

Or did you have something else in mind?

							Thanx, Paul

> Thanks,
> Feng
> 
> > Fixes: db3a34e17433 ("clocksource: Retry clock read if long delays detected")
> > Fixes: 2e27e793e280 ("clocksource: Reduce clocksource-skew threshold")
> > Signed-off-by: Waiman Long <longman@redhat.com>
> > ---
> >  kernel/time/clocksource.c | 53 +++++++++++++++++++++++++++++----------
> >  1 file changed, 40 insertions(+), 13 deletions(-)
> > 
> > diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
> > index b8a14d2fb5ba..c503f5ab4101 100644
> > --- a/kernel/time/clocksource.c
> > +++ b/kernel/time/clocksource.c
> > @@ -107,7 +107,7 @@ static u64 suspend_start;
> >   * This delay could be due to SMIs, NMIs, or to VCPU preemptions.  Used as
> >   * a lower bound for cs->uncertainty_margin values when registering clocks.
> >   */
> > -#define WATCHDOG_MAX_SKEW (50 * NSEC_PER_USEC)
> > +#define WATCHDOG_MAX_SKEW (100 * NSEC_PER_USEC)
> >  
> >  #ifdef CONFIG_CLOCKSOURCE_WATCHDOG
> >  static void clocksource_watchdog_work(struct work_struct *work);
> > @@ -205,11 +205,17 @@ EXPORT_SYMBOL_GPL(max_cswd_read_retries);
> >  static int verify_n_cpus = 8;
> >  module_param(verify_n_cpus, int, 0644);
> >  
> > -static bool cs_watchdog_read(struct clocksource *cs, u64 *csnow, u64 *wdnow)
> > +enum wd_read_status {
> > +	WD_READ_SUCCESS,
> > +	WD_READ_UNSTABLE,
> > +	WD_READ_SKIP
> > +};
> > +
> > +static enum wd_read_status cs_watchdog_read(struct clocksource *cs, u64 *csnow, u64 *wdnow)
> >  {
> >  	unsigned int nretries;
> > -	u64 wd_end, wd_delta;
> > -	int64_t wd_delay;
> > +	u64 wd_end, wd_end2, wd_delta;
> > +	int64_t wd_delay, wd_delay2;
> >  
> >  	for (nretries = 0; nretries <= max_cswd_read_retries; nretries++) {
> >  		local_irq_disable();
> > @@ -226,13 +232,30 @@ static bool cs_watchdog_read(struct clocksource *cs, u64 *csnow, u64 *wdnow)
> >  				pr_warn("timekeeping watchdog on CPU%d: %s retried %d times before success\n",
> >  					smp_processor_id(), watchdog->name, nretries);
> >  			}
> > -			return true;
> > +			return WD_READ_SUCCESS;
> >  		}
> >  	}
> >  
> > +	/*
> > +	 * Now compute delay in consecutive watchdog read to see if there is
> > +	 * too much external interferences that cause significant delay in
> > +	 * reading both clocksource and watchdog.
> > +	 *
> > +	 * If the consecutive watchdog read-back delay > WATCHDOG_MAX_SKEW/2,
> > +	 * report system busy and skip the current watchdog test.
> > +	 */
> > +	wd_end2 = watchdog->read(watchdog);
> > +	wd_delta = clocksource_delta(wd_end2, wd_end, watchdog->mask);
> > +	wd_delay2 = clocksource_cyc2ns(wd_delta, watchdog->mult, watchdog->shift);
> > +	if (wd_delay2 > WATCHDOG_MAX_SKEW/2) {
> > +		pr_warn("timekeeping watchdog on CPU%d: %s consecutive read-back delay of %lldns, system too busy\n",
> > +			smp_processor_id(), watchdog->name, wd_delay2);
> > +		return WD_READ_SKIP;
> > +	}
> > +
> >  	pr_warn("timekeeping watchdog on CPU%d: %s read-back delay of %lldns, attempt %d, marking unstable\n",
> >  		smp_processor_id(), watchdog->name, wd_delay, nretries);
> > -	return false;
> > +	return WD_READ_UNSTABLE;
> >  }
> >  
> >  static u64 csnow_mid;
> > @@ -356,6 +379,7 @@ static void clocksource_watchdog(struct timer_list *unused)
> >  	int next_cpu, reset_pending;
> >  	int64_t wd_nsec, cs_nsec;
> >  	struct clocksource *cs;
> > +	enum wd_read_status read_ret;
> >  	u32 md;
> >  
> >  	spin_lock(&watchdog_lock);
> > @@ -373,9 +397,12 @@ static void clocksource_watchdog(struct timer_list *unused)
> >  			continue;
> >  		}
> >  
> > -		if (!cs_watchdog_read(cs, &csnow, &wdnow)) {
> > -			/* Clock readout unreliable, so give it up. */
> > -			__clocksource_unstable(cs);
> > +		read_ret = cs_watchdog_read(cs, &csnow, &wdnow);
> > +
> > +		if (read_ret != WD_READ_SUCCESS) {
> > +			if (read_ret == WD_READ_UNSTABLE)
> > +				/* Clock readout unreliable, so give it up. */
> > +				__clocksource_unstable(cs);
> >  			continue;
> >  		}
> >  
> > @@ -1061,7 +1088,7 @@ void __clocksource_update_freq_scale(struct clocksource *cs, u32 scale, u32 freq
> >  	/*
> >  	 * If the uncertainty margin is not specified, calculate it.
> >  	 * If both scale and freq are non-zero, calculate the clock
> > -	 * period, but bound below at 2*WATCHDOG_MAX_SKEW.  However,
> > +	 * period, but bound below at WATCHDOG_MAX_SKEW.  However,
> >  	 * if either of scale or freq is zero, be very conservative and
> >  	 * take the tens-of-milliseconds WATCHDOG_THRESHOLD value for the
> >  	 * uncertainty margin.  Allow stupidly small uncertainty margins
> > @@ -1070,12 +1097,12 @@ void __clocksource_update_freq_scale(struct clocksource *cs, u32 scale, u32 freq
> >  	 */
> >  	if (scale && freq && !cs->uncertainty_margin) {
> >  		cs->uncertainty_margin = NSEC_PER_SEC / (scale * freq);
> > -		if (cs->uncertainty_margin < 2 * WATCHDOG_MAX_SKEW)
> > -			cs->uncertainty_margin = 2 * WATCHDOG_MAX_SKEW;
> > +		if (cs->uncertainty_margin < WATCHDOG_MAX_SKEW)
> > +			cs->uncertainty_margin = WATCHDOG_MAX_SKEW;
> >  	} else if (!cs->uncertainty_margin) {
> >  		cs->uncertainty_margin = WATCHDOG_THRESHOLD;
> >  	}
> > -	WARN_ON_ONCE(cs->uncertainty_margin < 2 * WATCHDOG_MAX_SKEW);
> > +	WARN_ON_ONCE(cs->uncertainty_margin < WATCHDOG_MAX_SKEW);
> >  
> >  	/*
> >  	 * Ensure clocksources that have large 'mult' values don't overflow
> > -- 
> > 2.27.0

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] clocksource: Avoid accidental unstable marking of clocksources
  2021-11-11 14:43     ` Paul E. McKenney
@ 2021-11-12  5:44       ` Feng Tang
  2021-11-12 13:47         ` Paul E. McKenney
  2021-11-13  3:43         ` Waiman Long
  0 siblings, 2 replies; 24+ messages in thread
From: Feng Tang @ 2021-11-12  5:44 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Waiman Long, John Stultz, Thomas Gleixner, Stephen Boyd,
	linux-kernel, Peter Zijlstra, Cassio Neri, Linus Walleij,
	Colin Ian King, Frederic Weisbecker

On Thu, Nov 11, 2021 at 06:43:11AM -0800, Paul E. McKenney wrote:
> On Thu, Nov 11, 2021 at 12:57:03PM +0800, Feng Tang wrote:
> > On Wed, Nov 10, 2021 at 05:17:31PM -0500, Waiman Long wrote:
> > > Since commit db3a34e17433 ("clocksource: Retry clock read if long delays
> > > detected") and commit 2e27e793e280 ("clocksource: Reduce clocksource-skew
> > > threshold"), it is found that tsc clocksource fallback to hpet can
> > > sometimes happen on both Intel and AMD systems especially when they are
> > > running stressful benchmarking workloads. Of the 23 systems tested with
> > > a v5.14 kernel, 10 of them have switched to hpet clock source during
> > > the test run.
> > > 
> > > The result of falling back to hpet is a drastic reduction of performance
> > > when running benchmarks. For example, the fio performance tests can
> > > drop up to 70% whereas the iperf3 performance can drop up to 80%.
> > > 
> > > 4 hpet fallbacks happened during bootup. They were:
> > > 
> > >   [    8.749399] clocksource: timekeeping watchdog on CPU13: hpet read-back delay of 263750ns, attempt 4, marking unstable
> > >   [   12.044610] clocksource: timekeeping watchdog on CPU19: hpet read-back delay of 186166ns, attempt 4, marking unstable
> > >   [   17.336941] clocksource: timekeeping watchdog on CPU28: hpet read-back delay of 182291ns, attempt 4, marking unstable
> > >   [   17.518565] clocksource: timekeeping watchdog on CPU34: hpet read-back delay of 252196ns, attempt 4, marking unstable
> > > 
> > > Other fallbacks happen when the systems were running stressful
> > > benchmarks. For example:
> > > 
> > >   [ 2685.867873] clocksource: timekeeping watchdog on CPU117: hpet read-back delay of 57269ns, attempt 4, marking unstable
> > >   [46215.471228] clocksource: timekeeping watchdog on CPU8: hpet read-back delay of 61460ns, attempt 4, marking unstable
> > > 
> > > Commit 2e27e793e280 ("clocksource: Reduce clocksource-skew threshold"),
> > > changed the skew margin from 100us to 50us. I think this is too small
> > > and can easily be exceeded when running some stressful workloads on
> > > a thermally stressed system.  So it is switched back to 100us. On
> > > the other hand, it doesn't look like we need to increase the minimum
> > > uncertainty margin. So it is kept the same at 100us too.
> > > 
> > > Even a maximum skew margin of 100us may be too small in for some systems
> > > when booting up especially if those systems are under thermal stress. To
> > > eliminate the case that the large skew is due to the system being too
> > > busy slowing down the reading of both the watchdog and the clocksource,
> > > a final check is done by reading watchdog time again and comparing the
> > > consecutive watchdog timing read delay against WATCHDOG_MAX_SKEW/2. If
> > > that delay exceeds the limit, we assume that the system is just too
> > > busy. A warning will be printed to the console and the watchdog check
> > > is then skipped for this round. For example:
> > > 
> > >   [    8.789316] clocksource: timekeeping watchdog on CPU13: hpet consecutive read-back delay of 174541ns, system too busy
> >  
> > 
> > I think it may be better to add more details about the root cause, other
> > than that it looks good to me, as we tested similar patch on our test
> > platforms.
> > 
> > Reviewed-by: Feng Tang <feng.tang@intel.com>
> 
> Thank you both!
> 
> I agree on the bit about root cause.  Would it make sense to compare the
> difference between HPET reads 1 and 2 (containing the read of the TSC)
> and the difference between HPET reads 2 and 3?  If the 2-1 difference was
> no more than (say) 8/7ths of the 3-2 difference, or the 2-1 difference
> was no more than (say) 20 microseconds more than the 3-2 difference,
> this could be considered a good-as-it-gets read, ending the retry loop.
> Then if the 3-1 difference was greater than the default (100 microseconds
> in current -rcu), that difference could be substituted for that particular
> clocksource watchdog check.  With a console message noting the unusually
> high overhead (but not a splat).
> 
> So if it took 75 microseconds for each HPET read and 1 microsecond for
> the TSC read, then 226 microseconds would be substituted for the default
> of 100 microseconds for that cycle's skew cutoff.  Unless the previous
> skew cutoff was larger, in which case the previous cutoff should be
> used instead.  Either way, the current cutoff is recorded for comparison
> for the next clocksource watchdog check.
> 
> If the 3-1 difference was greater than 62.5 milliseconds, a warning should
> probably be emitted anyway.
 
I can test the patch with our cases that could reproduce the problem.

> Or did you have something else in mind?

I'm not sure the detail in  Waiman's cases, and in our cases (stress-ng)
the delay between watchdog's (HPET here) read were not linear, that
from debug data, sometimes the 3-2 difference could be bigger or much
bigger than the 2-1 difference.

The reason could be the gap between 2 reads depends hugely on the system
pressure at that time that 3 HPET read happens. On our test box (a
2-Socket Cascade Lake AP server), the 2-1 and 3-2 difference are stably
about 2.5 us,  while under the stress it could be bumped to from 6 us
to 2800 us.

So I think checking the 3-2 difference plus increasing the max retries
to 10 may be a simple way, if the watchdog read is found to be
abnormally long, we skip this round of check.

Thanks,
Feng


> 
> 							Thanx, Paul
 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] clocksource: Avoid accidental unstable marking of clocksources
  2021-11-12  5:44       ` Feng Tang
@ 2021-11-12 13:47         ` Paul E. McKenney
  2021-11-13  3:43         ` Waiman Long
  1 sibling, 0 replies; 24+ messages in thread
From: Paul E. McKenney @ 2021-11-12 13:47 UTC (permalink / raw)
  To: Feng Tang
  Cc: Waiman Long, John Stultz, Thomas Gleixner, Stephen Boyd,
	linux-kernel, Peter Zijlstra, Cassio Neri, Linus Walleij,
	Colin Ian King, Frederic Weisbecker

On Fri, Nov 12, 2021 at 01:44:17PM +0800, Feng Tang wrote:
> On Thu, Nov 11, 2021 at 06:43:11AM -0800, Paul E. McKenney wrote:
> > On Thu, Nov 11, 2021 at 12:57:03PM +0800, Feng Tang wrote:
> > > On Wed, Nov 10, 2021 at 05:17:31PM -0500, Waiman Long wrote:
> > > > Since commit db3a34e17433 ("clocksource: Retry clock read if long delays
> > > > detected") and commit 2e27e793e280 ("clocksource: Reduce clocksource-skew
> > > > threshold"), it is found that tsc clocksource fallback to hpet can
> > > > sometimes happen on both Intel and AMD systems especially when they are
> > > > running stressful benchmarking workloads. Of the 23 systems tested with
> > > > a v5.14 kernel, 10 of them have switched to hpet clock source during
> > > > the test run.
> > > > 
> > > > The result of falling back to hpet is a drastic reduction of performance
> > > > when running benchmarks. For example, the fio performance tests can
> > > > drop up to 70% whereas the iperf3 performance can drop up to 80%.
> > > > 
> > > > 4 hpet fallbacks happened during bootup. They were:
> > > > 
> > > >   [    8.749399] clocksource: timekeeping watchdog on CPU13: hpet read-back delay of 263750ns, attempt 4, marking unstable
> > > >   [   12.044610] clocksource: timekeeping watchdog on CPU19: hpet read-back delay of 186166ns, attempt 4, marking unstable
> > > >   [   17.336941] clocksource: timekeeping watchdog on CPU28: hpet read-back delay of 182291ns, attempt 4, marking unstable
> > > >   [   17.518565] clocksource: timekeeping watchdog on CPU34: hpet read-back delay of 252196ns, attempt 4, marking unstable
> > > > 
> > > > Other fallbacks happen when the systems were running stressful
> > > > benchmarks. For example:
> > > > 
> > > >   [ 2685.867873] clocksource: timekeeping watchdog on CPU117: hpet read-back delay of 57269ns, attempt 4, marking unstable
> > > >   [46215.471228] clocksource: timekeeping watchdog on CPU8: hpet read-back delay of 61460ns, attempt 4, marking unstable
> > > > 
> > > > Commit 2e27e793e280 ("clocksource: Reduce clocksource-skew threshold"),
> > > > changed the skew margin from 100us to 50us. I think this is too small
> > > > and can easily be exceeded when running some stressful workloads on
> > > > a thermally stressed system.  So it is switched back to 100us. On
> > > > the other hand, it doesn't look like we need to increase the minimum
> > > > uncertainty margin. So it is kept the same at 100us too.
> > > > 
> > > > Even a maximum skew margin of 100us may be too small in for some systems
> > > > when booting up especially if those systems are under thermal stress. To
> > > > eliminate the case that the large skew is due to the system being too
> > > > busy slowing down the reading of both the watchdog and the clocksource,
> > > > a final check is done by reading watchdog time again and comparing the
> > > > consecutive watchdog timing read delay against WATCHDOG_MAX_SKEW/2. If
> > > > that delay exceeds the limit, we assume that the system is just too
> > > > busy. A warning will be printed to the console and the watchdog check
> > > > is then skipped for this round. For example:
> > > > 
> > > >   [    8.789316] clocksource: timekeeping watchdog on CPU13: hpet consecutive read-back delay of 174541ns, system too busy
> > >  
> > > 
> > > I think it may be better to add more details about the root cause, other
> > > than that it looks good to me, as we tested similar patch on our test
> > > platforms.
> > > 
> > > Reviewed-by: Feng Tang <feng.tang@intel.com>
> > 
> > Thank you both!
> > 
> > I agree on the bit about root cause.  Would it make sense to compare the
> > difference between HPET reads 1 and 2 (containing the read of the TSC)
> > and the difference between HPET reads 2 and 3?  If the 2-1 difference was
> > no more than (say) 8/7ths of the 3-2 difference, or the 2-1 difference
> > was no more than (say) 20 microseconds more than the 3-2 difference,
> > this could be considered a good-as-it-gets read, ending the retry loop.
> > Then if the 3-1 difference was greater than the default (100 microseconds
> > in current -rcu), that difference could be substituted for that particular
> > clocksource watchdog check.  With a console message noting the unusually
> > high overhead (but not a splat).
> > 
> > So if it took 75 microseconds for each HPET read and 1 microsecond for
> > the TSC read, then 226 microseconds would be substituted for the default
> > of 100 microseconds for that cycle's skew cutoff.  Unless the previous
> > skew cutoff was larger, in which case the previous cutoff should be
> > used instead.  Either way, the current cutoff is recorded for comparison
> > for the next clocksource watchdog check.
> > 
> > If the 3-1 difference was greater than 62.5 milliseconds, a warning should
> > probably be emitted anyway.
>  
> I can test the patch with our cases that could reproduce the problem.

Much appreciated!

> > Or did you have something else in mind?
> 
> I'm not sure the detail in  Waiman's cases, and in our cases (stress-ng)
> the delay between watchdog's (HPET here) read were not linear, that
> from debug data, sometimes the 3-2 difference could be bigger or much
> bigger than the 2-1 difference.
> 
> The reason could be the gap between 2 reads depends hugely on the system
> pressure at that time that 3 HPET read happens. On our test box (a
> 2-Socket Cascade Lake AP server), the 2-1 and 3-2 difference are stably
> about 2.5 us,  while under the stress it could be bumped to from 6 us
> to 2800 us.
> 
> So I think checking the 3-2 difference plus increasing the max retries
> to 10 may be a simple way, if the watchdog read is found to be
> abnormally long, we skip this round of check.

OK, given your data, it might indeed be best to keep things simple for
the time being.

So when you suggested adding more details about root cause, what did you
have in mind?  More details on Waiman's commit log?  Some other strategy
for extracting additional latency information?  Something else entirely?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] clocksource: Avoid accidental unstable marking of clocksources
  2021-11-12  5:44       ` Feng Tang
  2021-11-12 13:47         ` Paul E. McKenney
@ 2021-11-13  3:43         ` Waiman Long
  2021-11-14 15:54           ` Paul E. McKenney
  1 sibling, 1 reply; 24+ messages in thread
From: Waiman Long @ 2021-11-13  3:43 UTC (permalink / raw)
  To: Feng Tang, Paul E. McKenney
  Cc: John Stultz, Thomas Gleixner, Stephen Boyd, linux-kernel,
	Peter Zijlstra, Cassio Neri, Linus Walleij, Colin Ian King,
	Frederic Weisbecker

On 11/12/21 00:44, Feng Tang wrote:
> On Thu, Nov 11, 2021 at 06:43:11AM -0800, Paul E. McKenney wrote:
>> On Thu, Nov 11, 2021 at 12:57:03PM +0800, Feng Tang wrote:
>>> On Wed, Nov 10, 2021 at 05:17:31PM -0500, Waiman Long wrote:
>>>> Since commit db3a34e17433 ("clocksource: Retry clock read if long delays
>>>> detected") and commit 2e27e793e280 ("clocksource: Reduce clocksource-skew
>>>> threshold"), it is found that tsc clocksource fallback to hpet can
>>>> sometimes happen on both Intel and AMD systems especially when they are
>>>> running stressful benchmarking workloads. Of the 23 systems tested with
>>>> a v5.14 kernel, 10 of them have switched to hpet clock source during
>>>> the test run.
>>>>
>>>> The result of falling back to hpet is a drastic reduction of performance
>>>> when running benchmarks. For example, the fio performance tests can
>>>> drop up to 70% whereas the iperf3 performance can drop up to 80%.
>>>>
>>>> 4 hpet fallbacks happened during bootup. They were:
>>>>
>>>>    [    8.749399] clocksource: timekeeping watchdog on CPU13: hpet read-back delay of 263750ns, attempt 4, marking unstable
>>>>    [   12.044610] clocksource: timekeeping watchdog on CPU19: hpet read-back delay of 186166ns, attempt 4, marking unstable
>>>>    [   17.336941] clocksource: timekeeping watchdog on CPU28: hpet read-back delay of 182291ns, attempt 4, marking unstable
>>>>    [   17.518565] clocksource: timekeeping watchdog on CPU34: hpet read-back delay of 252196ns, attempt 4, marking unstable
>>>>
>>>> Other fallbacks happen when the systems were running stressful
>>>> benchmarks. For example:
>>>>
>>>>    [ 2685.867873] clocksource: timekeeping watchdog on CPU117: hpet read-back delay of 57269ns, attempt 4, marking unstable
>>>>    [46215.471228] clocksource: timekeeping watchdog on CPU8: hpet read-back delay of 61460ns, attempt 4, marking unstable
>>>>
>>>> Commit 2e27e793e280 ("clocksource: Reduce clocksource-skew threshold"),
>>>> changed the skew margin from 100us to 50us. I think this is too small
>>>> and can easily be exceeded when running some stressful workloads on
>>>> a thermally stressed system.  So it is switched back to 100us. On
>>>> the other hand, it doesn't look like we need to increase the minimum
>>>> uncertainty margin. So it is kept the same at 100us too.
>>>>
>>>> Even a maximum skew margin of 100us may be too small in for some systems
>>>> when booting up especially if those systems are under thermal stress. To
>>>> eliminate the case that the large skew is due to the system being too
>>>> busy slowing down the reading of both the watchdog and the clocksource,
>>>> a final check is done by reading watchdog time again and comparing the
>>>> consecutive watchdog timing read delay against WATCHDOG_MAX_SKEW/2. If
>>>> that delay exceeds the limit, we assume that the system is just too
>>>> busy. A warning will be printed to the console and the watchdog check
>>>> is then skipped for this round. For example:
>>>>
>>>>    [    8.789316] clocksource: timekeeping watchdog on CPU13: hpet consecutive read-back delay of 174541ns, system too busy
>>>   
>>>
>>> I think it may be better to add more details about the root cause, other
>>> than that it looks good to me, as we tested similar patch on our test
>>> platforms.
>>>
>>> Reviewed-by: Feng Tang <feng.tang@intel.com>
>> Thank you both!
>>
>> I agree on the bit about root cause.  Would it make sense to compare the
>> difference between HPET reads 1 and 2 (containing the read of the TSC)
>> and the difference between HPET reads 2 and 3?  If the 2-1 difference was
>> no more than (say) 8/7ths of the 3-2 difference, or the 2-1 difference
>> was no more than (say) 20 microseconds more than the 3-2 difference,
>> this could be considered a good-as-it-gets read, ending the retry loop.
>> Then if the 3-1 difference was greater than the default (100 microseconds
>> in current -rcu), that difference could be substituted for that particular
>> clocksource watchdog check.  With a console message noting the unusually
>> high overhead (but not a splat).
>>
>> So if it took 75 microseconds for each HPET read and 1 microsecond for
>> the TSC read, then 226 microseconds would be substituted for the default
>> of 100 microseconds for that cycle's skew cutoff.  Unless the previous
>> skew cutoff was larger, in which case the previous cutoff should be
>> used instead.  Either way, the current cutoff is recorded for comparison
>> for the next clocksource watchdog check.
>>
>> If the 3-1 difference was greater than 62.5 milliseconds, a warning should
>> probably be emitted anyway.
>   
> I can test the patch with our cases that could reproduce the problem.
>
>> Or did you have something else in mind?
> I'm not sure the detail in  Waiman's cases, and in our cases (stress-ng)
> the delay between watchdog's (HPET here) read were not linear, that
> from debug data, sometimes the 3-2 difference could be bigger or much
> bigger than the 2-1 difference.
>
> The reason could be the gap between 2 reads depends hugely on the system
> pressure at that time that 3 HPET read happens. On our test box (a
> 2-Socket Cascade Lake AP server), the 2-1 and 3-2 difference are stably
> about 2.5 us,  while under the stress it could be bumped to from 6 us
> to 2800 us.
>
> So I think checking the 3-2 difference plus increasing the max retries
> to 10 may be a simple way, if the watchdog read is found to be
> abnormally long, we skip this round of check.

On one of the test system, I had measured that normal delay 
(hpet->tsc->hpet) was normally a bit over 2us. It was a bit more than 
4us at bootup time. However, the same system under stress could have a 
delay of over 200us at bootup time. When I measured the consecutive hpet 
delay, it was about 180us. So hpet read did dominate the total 
clocksource read delay.

I would not suggest increasing the max retries as it may still fail in 
most cases because the system stress will likely not be going away 
within a short time. So we are likely just wasting cpu times. I believe 
we should just skip it if it is the watchdog read that is causing most 
of the delay.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] clocksource: Avoid accidental unstable marking of clocksources
  2021-11-13  3:43         ` Waiman Long
@ 2021-11-14 15:54           ` Paul E. McKenney
  2021-11-15  2:08             ` Feng Tang
  0 siblings, 1 reply; 24+ messages in thread
From: Paul E. McKenney @ 2021-11-14 15:54 UTC (permalink / raw)
  To: Waiman Long
  Cc: Feng Tang, John Stultz, Thomas Gleixner, Stephen Boyd,
	linux-kernel, Peter Zijlstra, Cassio Neri, Linus Walleij,
	Colin Ian King, Frederic Weisbecker

On Fri, Nov 12, 2021 at 10:43:15PM -0500, Waiman Long wrote:
> On 11/12/21 00:44, Feng Tang wrote:
> > On Thu, Nov 11, 2021 at 06:43:11AM -0800, Paul E. McKenney wrote:
> > > On Thu, Nov 11, 2021 at 12:57:03PM +0800, Feng Tang wrote:
> > > > On Wed, Nov 10, 2021 at 05:17:31PM -0500, Waiman Long wrote:
> > > > > Since commit db3a34e17433 ("clocksource: Retry clock read if long delays
> > > > > detected") and commit 2e27e793e280 ("clocksource: Reduce clocksource-skew
> > > > > threshold"), it is found that tsc clocksource fallback to hpet can
> > > > > sometimes happen on both Intel and AMD systems especially when they are
> > > > > running stressful benchmarking workloads. Of the 23 systems tested with
> > > > > a v5.14 kernel, 10 of them have switched to hpet clock source during
> > > > > the test run.
> > > > > 
> > > > > The result of falling back to hpet is a drastic reduction of performance
> > > > > when running benchmarks. For example, the fio performance tests can
> > > > > drop up to 70% whereas the iperf3 performance can drop up to 80%.
> > > > > 
> > > > > 4 hpet fallbacks happened during bootup. They were:
> > > > > 
> > > > >    [    8.749399] clocksource: timekeeping watchdog on CPU13: hpet read-back delay of 263750ns, attempt 4, marking unstable
> > > > >    [   12.044610] clocksource: timekeeping watchdog on CPU19: hpet read-back delay of 186166ns, attempt 4, marking unstable
> > > > >    [   17.336941] clocksource: timekeeping watchdog on CPU28: hpet read-back delay of 182291ns, attempt 4, marking unstable
> > > > >    [   17.518565] clocksource: timekeeping watchdog on CPU34: hpet read-back delay of 252196ns, attempt 4, marking unstable
> > > > > 
> > > > > Other fallbacks happen when the systems were running stressful
> > > > > benchmarks. For example:
> > > > > 
> > > > >    [ 2685.867873] clocksource: timekeeping watchdog on CPU117: hpet read-back delay of 57269ns, attempt 4, marking unstable
> > > > >    [46215.471228] clocksource: timekeeping watchdog on CPU8: hpet read-back delay of 61460ns, attempt 4, marking unstable
> > > > > 
> > > > > Commit 2e27e793e280 ("clocksource: Reduce clocksource-skew threshold"),
> > > > > changed the skew margin from 100us to 50us. I think this is too small
> > > > > and can easily be exceeded when running some stressful workloads on
> > > > > a thermally stressed system.  So it is switched back to 100us. On
> > > > > the other hand, it doesn't look like we need to increase the minimum
> > > > > uncertainty margin. So it is kept the same at 100us too.
> > > > > 
> > > > > Even a maximum skew margin of 100us may be too small in for some systems
> > > > > when booting up especially if those systems are under thermal stress. To
> > > > > eliminate the case that the large skew is due to the system being too
> > > > > busy slowing down the reading of both the watchdog and the clocksource,
> > > > > a final check is done by reading watchdog time again and comparing the
> > > > > consecutive watchdog timing read delay against WATCHDOG_MAX_SKEW/2. If
> > > > > that delay exceeds the limit, we assume that the system is just too
> > > > > busy. A warning will be printed to the console and the watchdog check
> > > > > is then skipped for this round. For example:
> > > > > 
> > > > >    [    8.789316] clocksource: timekeeping watchdog on CPU13: hpet consecutive read-back delay of 174541ns, system too busy
> > > > 
> > > > I think it may be better to add more details about the root cause, other
> > > > than that it looks good to me, as we tested similar patch on our test
> > > > platforms.
> > > > 
> > > > Reviewed-by: Feng Tang <feng.tang@intel.com>
> > > Thank you both!
> > > 
> > > I agree on the bit about root cause.  Would it make sense to compare the
> > > difference between HPET reads 1 and 2 (containing the read of the TSC)
> > > and the difference between HPET reads 2 and 3?  If the 2-1 difference was
> > > no more than (say) 8/7ths of the 3-2 difference, or the 2-1 difference
> > > was no more than (say) 20 microseconds more than the 3-2 difference,
> > > this could be considered a good-as-it-gets read, ending the retry loop.
> > > Then if the 3-1 difference was greater than the default (100 microseconds
> > > in current -rcu), that difference could be substituted for that particular
> > > clocksource watchdog check.  With a console message noting the unusually
> > > high overhead (but not a splat).
> > > 
> > > So if it took 75 microseconds for each HPET read and 1 microsecond for
> > > the TSC read, then 226 microseconds would be substituted for the default
> > > of 100 microseconds for that cycle's skew cutoff.  Unless the previous
> > > skew cutoff was larger, in which case the previous cutoff should be
> > > used instead.  Either way, the current cutoff is recorded for comparison
> > > for the next clocksource watchdog check.
> > > 
> > > If the 3-1 difference was greater than 62.5 milliseconds, a warning should
> > > probably be emitted anyway.
> > I can test the patch with our cases that could reproduce the problem.
> > 
> > > Or did you have something else in mind?
> > I'm not sure the detail in  Waiman's cases, and in our cases (stress-ng)
> > the delay between watchdog's (HPET here) read were not linear, that
> > from debug data, sometimes the 3-2 difference could be bigger or much
> > bigger than the 2-1 difference.
> > 
> > The reason could be the gap between 2 reads depends hugely on the system
> > pressure at that time that 3 HPET read happens. On our test box (a
> > 2-Socket Cascade Lake AP server), the 2-1 and 3-2 difference are stably
> > about 2.5 us,  while under the stress it could be bumped to from 6 us
> > to 2800 us.
> > 
> > So I think checking the 3-2 difference plus increasing the max retries
> > to 10 may be a simple way, if the watchdog read is found to be
> > abnormally long, we skip this round of check.
> 
> On one of the test system, I had measured that normal delay
> (hpet->tsc->hpet) was normally a bit over 2us. It was a bit more than 4us at
> bootup time. However, the same system under stress could have a delay of
> over 200us at bootup time. When I measured the consecutive hpet delay, it
> was about 180us. So hpet read did dominate the total clocksource read delay.

Thank you both for the data!

> I would not suggest increasing the max retries as it may still fail in most
> cases because the system stress will likely not be going away within a short
> time. So we are likely just wasting cpu times. I believe we should just skip
> it if it is the watchdog read that is causing most of the delay.

If anything, adding that extra read would cause me to -reduce- the number
of retries to avoid increasing the per-watchdog overhead.

Thoughts?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] clocksource: Avoid accidental unstable marking of clocksources
  2021-11-14 15:54           ` Paul E. McKenney
@ 2021-11-15  2:08             ` Feng Tang
  2021-11-15  3:24               ` Waiman Long
  0 siblings, 1 reply; 24+ messages in thread
From: Feng Tang @ 2021-11-15  2:08 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Waiman Long, John Stultz, Thomas Gleixner, Stephen Boyd,
	linux-kernel, Peter Zijlstra, Cassio Neri, Linus Walleij,
	Colin Ian King, Frederic Weisbecker

On Sun, Nov 14, 2021 at 07:54:07AM -0800, Paul E. McKenney wrote:
> On Fri, Nov 12, 2021 at 10:43:15PM -0500, Waiman Long wrote:
> > On 11/12/21 00:44, Feng Tang wrote:
> > > On Thu, Nov 11, 2021 at 06:43:11AM -0800, Paul E. McKenney wrote:
> > > > On Thu, Nov 11, 2021 at 12:57:03PM +0800, Feng Tang wrote:
> > > > > On Wed, Nov 10, 2021 at 05:17:31PM -0500, Waiman Long wrote:
> > > > > > Since commit db3a34e17433 ("clocksource: Retry clock read if long delays
> > > > > > detected") and commit 2e27e793e280 ("clocksource: Reduce clocksource-skew
> > > > > > threshold"), it is found that tsc clocksource fallback to hpet can
> > > > > > sometimes happen on both Intel and AMD systems especially when they are
> > > > > > running stressful benchmarking workloads. Of the 23 systems tested with
> > > > > > a v5.14 kernel, 10 of them have switched to hpet clock source during
> > > > > > the test run.
> > > > > > 
> > > > > > The result of falling back to hpet is a drastic reduction of performance
> > > > > > when running benchmarks. For example, the fio performance tests can
> > > > > > drop up to 70% whereas the iperf3 performance can drop up to 80%.
> > > > > > 
> > > > > > 4 hpet fallbacks happened during bootup. They were:
> > > > > > 
> > > > > >    [    8.749399] clocksource: timekeeping watchdog on CPU13: hpet read-back delay of 263750ns, attempt 4, marking unstable
> > > > > >    [   12.044610] clocksource: timekeeping watchdog on CPU19: hpet read-back delay of 186166ns, attempt 4, marking unstable
> > > > > >    [   17.336941] clocksource: timekeeping watchdog on CPU28: hpet read-back delay of 182291ns, attempt 4, marking unstable
> > > > > >    [   17.518565] clocksource: timekeeping watchdog on CPU34: hpet read-back delay of 252196ns, attempt 4, marking unstable
> > > > > > 
> > > > > > Other fallbacks happen when the systems were running stressful
> > > > > > benchmarks. For example:
> > > > > > 
> > > > > >    [ 2685.867873] clocksource: timekeeping watchdog on CPU117: hpet read-back delay of 57269ns, attempt 4, marking unstable
> > > > > >    [46215.471228] clocksource: timekeeping watchdog on CPU8: hpet read-back delay of 61460ns, attempt 4, marking unstable
> > > > > > 
> > > > > > Commit 2e27e793e280 ("clocksource: Reduce clocksource-skew threshold"),
> > > > > > changed the skew margin from 100us to 50us. I think this is too small
> > > > > > and can easily be exceeded when running some stressful workloads on
> > > > > > a thermally stressed system.  So it is switched back to 100us. On
> > > > > > the other hand, it doesn't look like we need to increase the minimum
> > > > > > uncertainty margin. So it is kept the same at 100us too.
> > > > > > 
> > > > > > Even a maximum skew margin of 100us may be too small in for some systems
> > > > > > when booting up especially if those systems are under thermal stress. To
> > > > > > eliminate the case that the large skew is due to the system being too
> > > > > > busy slowing down the reading of both the watchdog and the clocksource,
> > > > > > a final check is done by reading watchdog time again and comparing the
> > > > > > consecutive watchdog timing read delay against WATCHDOG_MAX_SKEW/2. If
> > > > > > that delay exceeds the limit, we assume that the system is just too
> > > > > > busy. A warning will be printed to the console and the watchdog check
> > > > > > is then skipped for this round. For example:
> > > > > > 
> > > > > >    [    8.789316] clocksource: timekeeping watchdog on CPU13: hpet consecutive read-back delay of 174541ns, system too busy
> > > > > 
> > > > > I think it may be better to add more details about the root cause, other
> > > > > than that it looks good to me, as we tested similar patch on our test
> > > > > platforms.
> > > > > 
> > > > > Reviewed-by: Feng Tang <feng.tang@intel.com>
> > > > Thank you both!
> > > > 
> > > > I agree on the bit about root cause.  Would it make sense to compare the
> > > > difference between HPET reads 1 and 2 (containing the read of the TSC)
> > > > and the difference between HPET reads 2 and 3?  If the 2-1 difference was
> > > > no more than (say) 8/7ths of the 3-2 difference, or the 2-1 difference
> > > > was no more than (say) 20 microseconds more than the 3-2 difference,
> > > > this could be considered a good-as-it-gets read, ending the retry loop.
> > > > Then if the 3-1 difference was greater than the default (100 microseconds
> > > > in current -rcu), that difference could be substituted for that particular
> > > > clocksource watchdog check.  With a console message noting the unusually
> > > > high overhead (but not a splat).
> > > > 
> > > > So if it took 75 microseconds for each HPET read and 1 microsecond for
> > > > the TSC read, then 226 microseconds would be substituted for the default
> > > > of 100 microseconds for that cycle's skew cutoff.  Unless the previous
> > > > skew cutoff was larger, in which case the previous cutoff should be
> > > > used instead.  Either way, the current cutoff is recorded for comparison
> > > > for the next clocksource watchdog check.
> > > > 
> > > > If the 3-1 difference was greater than 62.5 milliseconds, a warning should
> > > > probably be emitted anyway.
> > > I can test the patch with our cases that could reproduce the problem.
> > > 
> > > > Or did you have something else in mind?
> > > I'm not sure the detail in  Waiman's cases, and in our cases (stress-ng)
> > > the delay between watchdog's (HPET here) read were not linear, that
> > > from debug data, sometimes the 3-2 difference could be bigger or much
> > > bigger than the 2-1 difference.
> > > 
> > > The reason could be the gap between 2 reads depends hugely on the system
> > > pressure at that time that 3 HPET read happens. On our test box (a
> > > 2-Socket Cascade Lake AP server), the 2-1 and 3-2 difference are stably
> > > about 2.5 us,  while under the stress it could be bumped to from 6 us
> > > to 2800 us.
> > > 
> > > So I think checking the 3-2 difference plus increasing the max retries
> > > to 10 may be a simple way, if the watchdog read is found to be
> > > abnormally long, we skip this round of check.
> > 
> > On one of the test system, I had measured that normal delay
> > (hpet->tsc->hpet) was normally a bit over 2us. It was a bit more than 4us at
> > bootup time. However, the same system under stress could have a delay of
> > over 200us at bootup time. When I measured the consecutive hpet delay, it
> > was about 180us. So hpet read did dominate the total clocksource read delay.
> 
> Thank you both for the data!
> 
> > I would not suggest increasing the max retries as it may still fail in most
> > cases because the system stress will likely not be going away within a short
> > time. So we are likely just wasting cpu times. I believe we should just skip
> > it if it is the watchdog read that is causing most of the delay.
> 
> If anything, adding that extra read would cause me to -reduce- the number
> of retries to avoid increasing the per-watchdog overhead.
 
I understand Waiman's concern here, and in our test patch, the 2
consecutive watchdog read delay check is done inside this retrying
loop accompanying the 'cs' read, and once an abnormal delay is found,
the watchdog check is skipped without waiting for the max-retries to
complete.

Our test data shows the consecutive delay is not always big even when
the system is much stressed, that's why I suggest to increase the
retries.

Thanks,
Feng

> Thoughts?
> 
> 							Thanx, Paul

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] clocksource: Avoid accidental unstable marking of clocksources
  2021-11-15  2:08             ` Feng Tang
@ 2021-11-15  3:24               ` Waiman Long
  2021-11-15  7:59                 ` Feng Tang
  0 siblings, 1 reply; 24+ messages in thread
From: Waiman Long @ 2021-11-15  3:24 UTC (permalink / raw)
  To: Feng Tang, Paul E. McKenney
  Cc: John Stultz, Thomas Gleixner, Stephen Boyd, linux-kernel,
	Peter Zijlstra, Cassio Neri, Linus Walleij, Colin Ian King,
	Frederic Weisbecker


On 11/14/21 21:08, Feng Tang wrote:
> Or did you have something else in mind?
>>>> I'm not sure the detail in  Waiman's cases, and in our cases (stress-ng)
>>>> the delay between watchdog's (HPET here) read were not linear, that
>>>> from debug data, sometimes the 3-2 difference could be bigger or much
>>>> bigger than the 2-1 difference.
>>>>
>>>> The reason could be the gap between 2 reads depends hugely on the system
>>>> pressure at that time that 3 HPET read happens. On our test box (a
>>>> 2-Socket Cascade Lake AP server), the 2-1 and 3-2 difference are stably
>>>> about 2.5 us,  while under the stress it could be bumped to from 6 us
>>>> to 2800 us.
>>>>
>>>> So I think checking the 3-2 difference plus increasing the max retries
>>>> to 10 may be a simple way, if the watchdog read is found to be
>>>> abnormally long, we skip this round of check.
>>> On one of the test system, I had measured that normal delay
>>> (hpet->tsc->hpet) was normally a bit over 2us. It was a bit more than 4us at
>>> bootup time. However, the same system under stress could have a delay of
>>> over 200us at bootup time. When I measured the consecutive hpet delay, it
>>> was about 180us. So hpet read did dominate the total clocksource read delay.
>> Thank you both for the data!
>>
>>> I would not suggest increasing the max retries as it may still fail in most
>>> cases because the system stress will likely not be going away within a short
>>> time. So we are likely just wasting cpu times. I believe we should just skip
>>> it if it is the watchdog read that is causing most of the delay.
>> If anything, adding that extra read would cause me to -reduce- the number
>> of retries to avoid increasing the per-watchdog overhead.
>   
> I understand Waiman's concern here, and in our test patch, the 2
> consecutive watchdog read delay check is done inside this retrying
> loop accompanying the 'cs' read, and once an abnormal delay is found,
> the watchdog check is skipped without waiting for the max-retries to
> complete.
>
> Our test data shows the consecutive delay is not always big even when
> the system is much stressed, that's why I suggest to increase the
> retries.

If we need a large number of retries to avoid triggering the unstable 
TSC message, we should consider increase the threshod instead. Right?

That is why my patch 2 makes the max skew value a configurable option so 
that we can tune it if necessary.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] clocksource: Avoid accidental unstable marking of clocksources
  2021-11-15  3:24               ` Waiman Long
@ 2021-11-15  7:59                 ` Feng Tang
  2021-11-15 14:07                   ` Paul E. McKenney
  2021-11-15 19:19                   ` Waiman Long
  0 siblings, 2 replies; 24+ messages in thread
From: Feng Tang @ 2021-11-15  7:59 UTC (permalink / raw)
  To: Waiman Long
  Cc: Paul E. McKenney, John Stultz, Thomas Gleixner, Stephen Boyd,
	linux-kernel, Peter Zijlstra, Cassio Neri, Linus Walleij,
	Colin Ian King, Frederic Weisbecker

On Sun, Nov 14, 2021 at 10:24:56PM -0500, Waiman Long wrote:
> 
> On 11/14/21 21:08, Feng Tang wrote:
> > Or did you have something else in mind?
> > > > > I'm not sure the detail in  Waiman's cases, and in our cases (stress-ng)
> > > > > the delay between watchdog's (HPET here) read were not linear, that
> > > > > from debug data, sometimes the 3-2 difference could be bigger or much
> > > > > bigger than the 2-1 difference.
> > > > > 
> > > > > The reason could be the gap between 2 reads depends hugely on the system
> > > > > pressure at that time that 3 HPET read happens. On our test box (a
> > > > > 2-Socket Cascade Lake AP server), the 2-1 and 3-2 difference are stably
> > > > > about 2.5 us,  while under the stress it could be bumped to from 6 us
> > > > > to 2800 us.
> > > > > 
> > > > > So I think checking the 3-2 difference plus increasing the max retries
> > > > > to 10 may be a simple way, if the watchdog read is found to be
> > > > > abnormally long, we skip this round of check.
> > > > On one of the test system, I had measured that normal delay
> > > > (hpet->tsc->hpet) was normally a bit over 2us. It was a bit more than 4us at
> > > > bootup time. However, the same system under stress could have a delay of
> > > > over 200us at bootup time. When I measured the consecutive hpet delay, it
> > > > was about 180us. So hpet read did dominate the total clocksource read delay.
> > > Thank you both for the data!
> > > 
> > > > I would not suggest increasing the max retries as it may still fail in most
> > > > cases because the system stress will likely not be going away within a short
> > > > time. So we are likely just wasting cpu times. I believe we should just skip
> > > > it if it is the watchdog read that is causing most of the delay.
> > > If anything, adding that extra read would cause me to -reduce- the number
> > > of retries to avoid increasing the per-watchdog overhead.
> > I understand Waiman's concern here, and in our test patch, the 2
> > consecutive watchdog read delay check is done inside this retrying
> > loop accompanying the 'cs' read, and once an abnormal delay is found,
> > the watchdog check is skipped without waiting for the max-retries to
> > complete.
> > 
> > Our test data shows the consecutive delay is not always big even when
> > the system is much stressed, that's why I suggest to increase the
> > retries.
> 
> If we need a large number of retries to avoid triggering the unstable TSC
> message, we should consider increase the threshod instead. Right?
> 
> That is why my patch 2 makes the max skew value a configurable option so
> that we can tune it if necessary.

I'm fine with it, though the ideal case I expected is with carefully
picked values for max_retries/screw_threshhold, we could save the users
from configuring these. But given the complexity of all HWs out there,
it's not an easy goal.

And I still suggest to put the consecutive watchdog read check inside
the retry loop, so that it could bail out early when detecting the
abnormal delay.

Another thing is we may need to set the 'watchdog_reset_pending', as
under the stress, there could be consecutive many times of "skipping"
watchdog check, and the saved value of 'cs' and 'watchdog' should be
reset.

Thanks,
Feng


> Cheers,
> Longman

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] clocksource: Avoid accidental unstable marking of clocksources
  2021-11-15  7:59                 ` Feng Tang
@ 2021-11-15 14:07                   ` Paul E. McKenney
  2021-11-16  1:36                     ` Feng Tang
  2021-11-15 19:19                   ` Waiman Long
  1 sibling, 1 reply; 24+ messages in thread
From: Paul E. McKenney @ 2021-11-15 14:07 UTC (permalink / raw)
  To: Feng Tang
  Cc: Waiman Long, John Stultz, Thomas Gleixner, Stephen Boyd,
	linux-kernel, Peter Zijlstra, Cassio Neri, Linus Walleij,
	Colin Ian King, Frederic Weisbecker

On Mon, Nov 15, 2021 at 03:59:15PM +0800, Feng Tang wrote:
> On Sun, Nov 14, 2021 at 10:24:56PM -0500, Waiman Long wrote:
> > 
> > On 11/14/21 21:08, Feng Tang wrote:
> > > Or did you have something else in mind?
> > > > > > I'm not sure the detail in  Waiman's cases, and in our cases (stress-ng)
> > > > > > the delay between watchdog's (HPET here) read were not linear, that
> > > > > > from debug data, sometimes the 3-2 difference could be bigger or much
> > > > > > bigger than the 2-1 difference.
> > > > > > 
> > > > > > The reason could be the gap between 2 reads depends hugely on the system
> > > > > > pressure at that time that 3 HPET read happens. On our test box (a
> > > > > > 2-Socket Cascade Lake AP server), the 2-1 and 3-2 difference are stably
> > > > > > about 2.5 us,  while under the stress it could be bumped to from 6 us
> > > > > > to 2800 us.
> > > > > > 
> > > > > > So I think checking the 3-2 difference plus increasing the max retries
> > > > > > to 10 may be a simple way, if the watchdog read is found to be
> > > > > > abnormally long, we skip this round of check.
> > > > > On one of the test system, I had measured that normal delay
> > > > > (hpet->tsc->hpet) was normally a bit over 2us. It was a bit more than 4us at
> > > > > bootup time. However, the same system under stress could have a delay of
> > > > > over 200us at bootup time. When I measured the consecutive hpet delay, it
> > > > > was about 180us. So hpet read did dominate the total clocksource read delay.
> > > > Thank you both for the data!
> > > > 
> > > > > I would not suggest increasing the max retries as it may still fail in most
> > > > > cases because the system stress will likely not be going away within a short
> > > > > time. So we are likely just wasting cpu times. I believe we should just skip
> > > > > it if it is the watchdog read that is causing most of the delay.
> > > > If anything, adding that extra read would cause me to -reduce- the number
> > > > of retries to avoid increasing the per-watchdog overhead.
> > > I understand Waiman's concern here, and in our test patch, the 2
> > > consecutive watchdog read delay check is done inside this retrying
> > > loop accompanying the 'cs' read, and once an abnormal delay is found,
> > > the watchdog check is skipped without waiting for the max-retries to
> > > complete.
> > > 
> > > Our test data shows the consecutive delay is not always big even when
> > > the system is much stressed, that's why I suggest to increase the
> > > retries.
> > 
> > If we need a large number of retries to avoid triggering the unstable TSC
> > message, we should consider increase the threshod instead. Right?
> > 
> > That is why my patch 2 makes the max skew value a configurable option so
> > that we can tune it if necessary.
> 
> I'm fine with it, though the ideal case I expected is with carefully
> picked values for max_retries/screw_threshhold, we could save the users
> from configuring these. But given the complexity of all HWs out there,
> it's not an easy goal.

That is my goal as well, but I expect that more experience, testing,
and patches will be required to reach that goal.

> And I still suggest to put the consecutive watchdog read check inside
> the retry loop, so that it could bail out early when detecting the
> abnormal delay.

If the HPET read shows abnormal delay, agreed.  But if the abnormal
delay is only in the clocksource under test (TSC in this case), then
a re-read seems to me to make sense.

> Another thing is we may need to set the 'watchdog_reset_pending', as
> under the stress, there could be consecutive many times of "skipping"
> watchdog check, and the saved value of 'cs' and 'watchdog' should be
> reset.

My thought was to count a read failure only if the HPET read did not
have excessive delays.  This means that a cache-buster workload could 
indefinitely delay a clock-skew check, which was one reason that I
was thinking in terms of using the actual measured delays to set the
clock-skew check criterion.

Either way, something like Waiman's patch checking the HPET delay looks
to me to be valuable.

Thoughts?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] clocksource: Avoid accidental unstable marking of clocksources
  2021-11-15  7:59                 ` Feng Tang
  2021-11-15 14:07                   ` Paul E. McKenney
@ 2021-11-15 19:19                   ` Waiman Long
  1 sibling, 0 replies; 24+ messages in thread
From: Waiman Long @ 2021-11-15 19:19 UTC (permalink / raw)
  To: Feng Tang
  Cc: Paul E. McKenney, John Stultz, Thomas Gleixner, Stephen Boyd,
	linux-kernel, Peter Zijlstra, Cassio Neri, Linus Walleij,
	Colin Ian King, Frederic Weisbecker


On 11/15/21 02:59, Feng Tang wrote:
> On Sun, Nov 14, 2021 at 10:24:56PM -0500, Waiman Long wrote:
>> On 11/14/21 21:08, Feng Tang wrote:
>>> Or did you have something else in mind?
>>>>>> I'm not sure the detail in  Waiman's cases, and in our cases (stress-ng)
>>>>>> the delay between watchdog's (HPET here) read were not linear, that
>>>>>> from debug data, sometimes the 3-2 difference could be bigger or much
>>>>>> bigger than the 2-1 difference.
>>>>>>
>>>>>> The reason could be the gap between 2 reads depends hugely on the system
>>>>>> pressure at that time that 3 HPET read happens. On our test box (a
>>>>>> 2-Socket Cascade Lake AP server), the 2-1 and 3-2 difference are stably
>>>>>> about 2.5 us,  while under the stress it could be bumped to from 6 us
>>>>>> to 2800 us.
>>>>>>
>>>>>> So I think checking the 3-2 difference plus increasing the max retries
>>>>>> to 10 may be a simple way, if the watchdog read is found to be
>>>>>> abnormally long, we skip this round of check.
>>>>> On one of the test system, I had measured that normal delay
>>>>> (hpet->tsc->hpet) was normally a bit over 2us. It was a bit more than 4us at
>>>>> bootup time. However, the same system under stress could have a delay of
>>>>> over 200us at bootup time. When I measured the consecutive hpet delay, it
>>>>> was about 180us. So hpet read did dominate the total clocksource read delay.
>>>> Thank you both for the data!
>>>>
>>>>> I would not suggest increasing the max retries as it may still fail in most
>>>>> cases because the system stress will likely not be going away within a short
>>>>> time. So we are likely just wasting cpu times. I believe we should just skip
>>>>> it if it is the watchdog read that is causing most of the delay.
>>>> If anything, adding that extra read would cause me to -reduce- the number
>>>> of retries to avoid increasing the per-watchdog overhead.
>>> I understand Waiman's concern here, and in our test patch, the 2
>>> consecutive watchdog read delay check is done inside this retrying
>>> loop accompanying the 'cs' read, and once an abnormal delay is found,
>>> the watchdog check is skipped without waiting for the max-retries to
>>> complete.
>>>
>>> Our test data shows the consecutive delay is not always big even when
>>> the system is much stressed, that's why I suggest to increase the
>>> retries.
>> If we need a large number of retries to avoid triggering the unstable TSC
>> message, we should consider increase the threshod instead. Right?
>>
>> That is why my patch 2 makes the max skew value a configurable option so
>> that we can tune it if necessary.
> I'm fine with it, though the ideal case I expected is with carefully
> picked values for max_retries/screw_threshhold, we could save the users
> from configuring these. But given the complexity of all HWs out there,
> it's not an easy goal.
>
> And I still suggest to put the consecutive watchdog read check inside
> the retry loop, so that it could bail out early when detecting the
> abnormal delay.
Yes, I think that may sense.
>
> Another thing is we may need to set the 'watchdog_reset_pending', as
> under the stress, there could be consecutive many times of "skipping"
> watchdog check, and the saved value of 'cs' and 'watchdog' should be
> reset.

I agree with that too. IOW, we combine the best part of the patches 
together. I will post an updated patch for that.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] clocksource: Avoid accidental unstable marking of clocksources
  2021-11-15 14:07                   ` Paul E. McKenney
@ 2021-11-16  1:36                     ` Feng Tang
  2021-11-16 20:36                       ` Paul E. McKenney
  0 siblings, 1 reply; 24+ messages in thread
From: Feng Tang @ 2021-11-16  1:36 UTC (permalink / raw)
  To: Paul E. McKenney, Waiman Long
  Cc: Waiman Long, John Stultz, Thomas Gleixner, Stephen Boyd,
	linux-kernel, Peter Zijlstra, Cassio Neri, Linus Walleij,
	Colin Ian King, Frederic Weisbecker

On Mon, Nov 15, 2021 at 06:07:09AM -0800, Paul E. McKenney wrote:
> On Mon, Nov 15, 2021 at 03:59:15PM +0800, Feng Tang wrote:
> > On Sun, Nov 14, 2021 at 10:24:56PM -0500, Waiman Long wrote:
> > > 
> > > On 11/14/21 21:08, Feng Tang wrote:
> > > > Or did you have something else in mind?
> > > > > > > I'm not sure the detail in  Waiman's cases, and in our cases (stress-ng)
> > > > > > > the delay between watchdog's (HPET here) read were not linear, that
> > > > > > > from debug data, sometimes the 3-2 difference could be bigger or much
> > > > > > > bigger than the 2-1 difference.
> > > > > > > 
> > > > > > > The reason could be the gap between 2 reads depends hugely on the system
> > > > > > > pressure at that time that 3 HPET read happens. On our test box (a
> > > > > > > 2-Socket Cascade Lake AP server), the 2-1 and 3-2 difference are stably
> > > > > > > about 2.5 us,  while under the stress it could be bumped to from 6 us
> > > > > > > to 2800 us.
> > > > > > > 
> > > > > > > So I think checking the 3-2 difference plus increasing the max retries
> > > > > > > to 10 may be a simple way, if the watchdog read is found to be
> > > > > > > abnormally long, we skip this round of check.
> > > > > > On one of the test system, I had measured that normal delay
> > > > > > (hpet->tsc->hpet) was normally a bit over 2us. It was a bit more than 4us at
> > > > > > bootup time. However, the same system under stress could have a delay of
> > > > > > over 200us at bootup time. When I measured the consecutive hpet delay, it
> > > > > > was about 180us. So hpet read did dominate the total clocksource read delay.
> > > > > Thank you both for the data!
> > > > > 
> > > > > > I would not suggest increasing the max retries as it may still fail in most
> > > > > > cases because the system stress will likely not be going away within a short
> > > > > > time. So we are likely just wasting cpu times. I believe we should just skip
> > > > > > it if it is the watchdog read that is causing most of the delay.
> > > > > If anything, adding that extra read would cause me to -reduce- the number
> > > > > of retries to avoid increasing the per-watchdog overhead.
> > > > I understand Waiman's concern here, and in our test patch, the 2
> > > > consecutive watchdog read delay check is done inside this retrying
> > > > loop accompanying the 'cs' read, and once an abnormal delay is found,
> > > > the watchdog check is skipped without waiting for the max-retries to
> > > > complete.
> > > > 
> > > > Our test data shows the consecutive delay is not always big even when
> > > > the system is much stressed, that's why I suggest to increase the
> > > > retries.
> > > 
> > > If we need a large number of retries to avoid triggering the unstable TSC
> > > message, we should consider increase the threshod instead. Right?
> > > 
> > > That is why my patch 2 makes the max skew value a configurable option so
> > > that we can tune it if necessary.
> > 
> > I'm fine with it, though the ideal case I expected is with carefully
> > picked values for max_retries/screw_threshhold, we could save the users
> > from configuring these. But given the complexity of all HWs out there,
> > it's not an easy goal.
> 
> That is my goal as well, but I expect that more experience, testing,
> and patches will be required to reach that goal.
> 
> > And I still suggest to put the consecutive watchdog read check inside
> > the retry loop, so that it could bail out early when detecting the
> > abnormal delay.
> 
> If the HPET read shows abnormal delay, agreed.  But if the abnormal
> delay is only in the clocksource under test (TSC in this case), then
> a re-read seems to me to make sense.

Yes, I agree. The retry logic you introeduced does help to filter
many false alarms from a watchdog. 

> > Another thing is we may need to set the 'watchdog_reset_pending', as
> > under the stress, there could be consecutive many times of "skipping"
> > watchdog check, and the saved value of 'cs' and 'watchdog' should be
> > reset.
> 
> My thought was to count a read failure only if the HPET read did not
> have excessive delays.  This means that a cache-buster workload could 
> indefinitely delay a clock-skew check, which was one reason that I
> was thinking in terms of using the actual measured delays to set the
> clock-skew check criterion.
> 
> Either way, something like Waiman's patch checking the HPET delay looks
> to me to be valuable.

Yes, and Wainman is working on a new version.

btw, here is our easy reproducer (the case you have worked with Oliver
Sang), running the stress-ng's case (192 is the CPU number of the test
box):

 sudo stress-ng --timeout 30 --times --verify --metrics-brief --ioport 192

Thanks,
Feng

> Thoughts?
> 
> 							Thanx, Paul

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] clocksource: Avoid accidental unstable marking of clocksources
  2021-11-16  1:36                     ` Feng Tang
@ 2021-11-16 20:36                       ` Paul E. McKenney
  0 siblings, 0 replies; 24+ messages in thread
From: Paul E. McKenney @ 2021-11-16 20:36 UTC (permalink / raw)
  To: Feng Tang
  Cc: Waiman Long, John Stultz, Thomas Gleixner, Stephen Boyd,
	linux-kernel, Peter Zijlstra, Cassio Neri, Linus Walleij,
	Colin Ian King, Frederic Weisbecker

On Tue, Nov 16, 2021 at 09:36:51AM +0800, Feng Tang wrote:
> On Mon, Nov 15, 2021 at 06:07:09AM -0800, Paul E. McKenney wrote:
> > On Mon, Nov 15, 2021 at 03:59:15PM +0800, Feng Tang wrote:
> > > On Sun, Nov 14, 2021 at 10:24:56PM -0500, Waiman Long wrote:
> > > > 
> > > > On 11/14/21 21:08, Feng Tang wrote:
> > > > > Or did you have something else in mind?
> > > > > > > > I'm not sure the detail in  Waiman's cases, and in our cases (stress-ng)
> > > > > > > > the delay between watchdog's (HPET here) read were not linear, that
> > > > > > > > from debug data, sometimes the 3-2 difference could be bigger or much
> > > > > > > > bigger than the 2-1 difference.
> > > > > > > > 
> > > > > > > > The reason could be the gap between 2 reads depends hugely on the system
> > > > > > > > pressure at that time that 3 HPET read happens. On our test box (a
> > > > > > > > 2-Socket Cascade Lake AP server), the 2-1 and 3-2 difference are stably
> > > > > > > > about 2.5 us,  while under the stress it could be bumped to from 6 us
> > > > > > > > to 2800 us.
> > > > > > > > 
> > > > > > > > So I think checking the 3-2 difference plus increasing the max retries
> > > > > > > > to 10 may be a simple way, if the watchdog read is found to be
> > > > > > > > abnormally long, we skip this round of check.
> > > > > > > On one of the test system, I had measured that normal delay
> > > > > > > (hpet->tsc->hpet) was normally a bit over 2us. It was a bit more than 4us at
> > > > > > > bootup time. However, the same system under stress could have a delay of
> > > > > > > over 200us at bootup time. When I measured the consecutive hpet delay, it
> > > > > > > was about 180us. So hpet read did dominate the total clocksource read delay.
> > > > > > Thank you both for the data!
> > > > > > 
> > > > > > > I would not suggest increasing the max retries as it may still fail in most
> > > > > > > cases because the system stress will likely not be going away within a short
> > > > > > > time. So we are likely just wasting cpu times. I believe we should just skip
> > > > > > > it if it is the watchdog read that is causing most of the delay.
> > > > > > If anything, adding that extra read would cause me to -reduce- the number
> > > > > > of retries to avoid increasing the per-watchdog overhead.
> > > > > I understand Waiman's concern here, and in our test patch, the 2
> > > > > consecutive watchdog read delay check is done inside this retrying
> > > > > loop accompanying the 'cs' read, and once an abnormal delay is found,
> > > > > the watchdog check is skipped without waiting for the max-retries to
> > > > > complete.
> > > > > 
> > > > > Our test data shows the consecutive delay is not always big even when
> > > > > the system is much stressed, that's why I suggest to increase the
> > > > > retries.
> > > > 
> > > > If we need a large number of retries to avoid triggering the unstable TSC
> > > > message, we should consider increase the threshod instead. Right?
> > > > 
> > > > That is why my patch 2 makes the max skew value a configurable option so
> > > > that we can tune it if necessary.
> > > 
> > > I'm fine with it, though the ideal case I expected is with carefully
> > > picked values for max_retries/screw_threshhold, we could save the users
> > > from configuring these. But given the complexity of all HWs out there,
> > > it's not an easy goal.
> > 
> > That is my goal as well, but I expect that more experience, testing,
> > and patches will be required to reach that goal.
> > 
> > > And I still suggest to put the consecutive watchdog read check inside
> > > the retry loop, so that it could bail out early when detecting the
> > > abnormal delay.
> > 
> > If the HPET read shows abnormal delay, agreed.  But if the abnormal
> > delay is only in the clocksource under test (TSC in this case), then
> > a re-read seems to me to make sense.
> 
> Yes, I agree. The retry logic you introeduced does help to filter
> many false alarms from a watchdog. 
> 
> > > Another thing is we may need to set the 'watchdog_reset_pending', as
> > > under the stress, there could be consecutive many times of "skipping"
> > > watchdog check, and the saved value of 'cs' and 'watchdog' should be
> > > reset.
> > 
> > My thought was to count a read failure only if the HPET read did not
> > have excessive delays.  This means that a cache-buster workload could 
> > indefinitely delay a clock-skew check, which was one reason that I
> > was thinking in terms of using the actual measured delays to set the
> > clock-skew check criterion.
> > 
> > Either way, something like Waiman's patch checking the HPET delay looks
> > to me to be valuable.
> 
> Yes, and Wainman is working on a new version.

Looking forward to seeing it!

> btw, here is our easy reproducer (the case you have worked with Oliver
> Sang), running the stress-ng's case (192 is the CPU number of the test
> box):
> 
>  sudo stress-ng --timeout 30 --times --verify --metrics-brief --ioport 192

Good to know, thank you!

							Thanx, Paul

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2021-11-16 20:36 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-10 22:17 [PATCH 0/2] clocksource: Avoid incorrect hpet fallback Waiman Long
2021-11-10 22:17 ` [PATCH 1/2] clocksource: Avoid accidental unstable marking of clocksources Waiman Long
2021-11-11  4:57   ` Feng Tang
2021-11-11 14:43     ` Paul E. McKenney
2021-11-12  5:44       ` Feng Tang
2021-11-12 13:47         ` Paul E. McKenney
2021-11-13  3:43         ` Waiman Long
2021-11-14 15:54           ` Paul E. McKenney
2021-11-15  2:08             ` Feng Tang
2021-11-15  3:24               ` Waiman Long
2021-11-15  7:59                 ` Feng Tang
2021-11-15 14:07                   ` Paul E. McKenney
2021-11-16  1:36                     ` Feng Tang
2021-11-16 20:36                       ` Paul E. McKenney
2021-11-15 19:19                   ` Waiman Long
2021-11-10 22:17 ` [PATCH 2/2] clocksource: Add a Kconfig option for WATCHDOG_MAX_SKEW Waiman Long
2021-11-10 22:32 ` [PATCH 0/2] clocksource: Avoid incorrect hpet fallback Paul E. McKenney
2021-11-10 23:25   ` Waiman Long
2021-11-11  0:04     ` Paul E. McKenney
2021-11-11  1:19       ` Waiman Long
2021-11-11  1:23 ` Feng Tang
2021-11-11  1:30   ` Waiman Long
2021-11-11  1:53     ` Feng Tang
2021-11-11  3:07       ` Paul E. McKenney

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).