linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH clocksource v2 0/7] Clocksource watchdog updates for v6.3
@ 2023-01-25  0:27 Paul E. McKenney
  2023-01-25  0:27 ` [PATCH v2 clocksource 1/7] clocksource: Print clocksource name when clocksource is tested unstable Paul E. McKenney
                   ` (7 more replies)
  0 siblings, 8 replies; 22+ messages in thread
From: Paul E. McKenney @ 2023-01-25  0:27 UTC (permalink / raw)
  To: tglx
  Cc: linux-kernel, john.stultz, sboyd, corbet, Mark.Rutland, maz,
	kernel-team, neeraju, ak, feng.tang, zhengjun.xing

Hello!

This series contains clocksource-watchdog updates:

1.	Print clocksource name when clocksource is tested unstable,
	courtesy of Yunying Sun.

2.	Loosen clocksource watchdog constraints.

3.	Improve read-back-delay message.

4.	Improve "skew is too large" messages.

5.	Suspend the watchdog temporarily when high read latency detected,
	courtesy of Feng Tang.

6.	Verify HPET and PMTMR when TSC unverified.

7.	x86/tsc: Add option to force frequency recalibration with HW
	timer, courtesy of Feng Tang.

Changes since v1:

o	Add system descriptions to #5 and #6 in response to questions
	from Thomas Gleixner.

o	Add patch #7.

						Thanx, Paul

------------------------------------------------------------------------

 arch/x86/kernel/tsc.c                             |   34 +++++++++-
 b/Documentation/admin-guide/kernel-parameters.txt |    4 +
 b/arch/x86/include/asm/time.h                     |    1 
 b/arch/x86/kernel/hpet.c                          |    2 
 b/arch/x86/kernel/tsc.c                           |    5 +
 b/drivers/clocksource/acpi_pm.c                   |    6 +
 b/kernel/time/Kconfig                             |    6 +
 b/kernel/time/clocksource.c                       |    4 -
 kernel/time/clocksource.c                         |   72 +++++++++++++++-------
 9 files changed, 104 insertions(+), 30 deletions(-)


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v2 clocksource 1/7] clocksource: Print clocksource name when clocksource is tested unstable
  2023-01-25  0:27 [PATCH clocksource v2 0/7] Clocksource watchdog updates for v6.3 Paul E. McKenney
@ 2023-01-25  0:27 ` Paul E. McKenney
  2023-01-25  0:27 ` [PATCH v2 clocksource 2/7] clocksource: Loosen clocksource watchdog constraints Paul E. McKenney
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 22+ messages in thread
From: Paul E. McKenney @ 2023-01-25  0:27 UTC (permalink / raw)
  To: tglx
  Cc: linux-kernel, john.stultz, sboyd, corbet, Mark.Rutland, maz,
	kernel-team, neeraju, ak, feng.tang, zhengjun.xing, Yunying Sun,
	Paul E . McKenney

From: Yunying Sun <yunying.sun@intel.com>

Some "TSC fall back to HPET" messages appear on systems having more than
2 NUMA nodes:

clocksource: timekeeping watchdog on CPU168: hpet read-back delay of 4296200ns, attempt 4, marking unstable

The "hpet" here is misleading the clocksource watchdog is really
doing repeated reads of "hpet" in order to check for unrelated delays.
Therefore, print the name of the clocksource under test, prefixed by
"wd-" and suffixed by "-wd", for example, "wd-tsc-wd".

Signed-off-by: Yunying Sun <yunying.sun@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
---
 kernel/time/clocksource.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index 9cf32ccda715d..4a2c3bb92e2e9 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -257,8 +257,8 @@ static enum wd_read_status cs_watchdog_read(struct clocksource *cs, u64 *csnow,
 			goto skip_test;
 	}
 
-	pr_warn("timekeeping watchdog on CPU%d: %s read-back delay of %lldns, attempt %d, marking unstable\n",
-		smp_processor_id(), watchdog->name, wd_delay, nretries);
+	pr_warn("timekeeping watchdog on CPU%d: wd-%s-wd read-back delay of %lldns, attempt %d, marking unstable\n",
+		smp_processor_id(), cs->name, wd_delay, nretries);
 	return WD_READ_UNSTABLE;
 
 skip_test:
-- 
2.31.1.189.g2e36527f23


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v2 clocksource 2/7] clocksource: Loosen clocksource watchdog constraints
  2023-01-25  0:27 [PATCH clocksource v2 0/7] Clocksource watchdog updates for v6.3 Paul E. McKenney
  2023-01-25  0:27 ` [PATCH v2 clocksource 1/7] clocksource: Print clocksource name when clocksource is tested unstable Paul E. McKenney
@ 2023-01-25  0:27 ` Paul E. McKenney
  2023-01-25  0:27 ` [PATCH v2 clocksource 3/7] clocksource: Improve read-back-delay message Paul E. McKenney
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 22+ messages in thread
From: Paul E. McKenney @ 2023-01-25  0:27 UTC (permalink / raw)
  To: tglx
  Cc: linux-kernel, john.stultz, sboyd, corbet, Mark.Rutland, maz,
	kernel-team, neeraju, ak, feng.tang, zhengjun.xing,
	Paul E. McKenney

Currently, MAX_SKEW_USEC is set to 100 microseconds, which has worked
reasonably well.  However, NTP is willing to tolerate 500 microseconds
of skew per second, and a clocksource that is good enough for NTP should
be good enough for the clocksource watchdog.  The watchdog's skew is
controlled by MAX_SKEW_USEC and the CLOCKSOURCE_WATCHDOG_MAX_SKEW_US
Kconfig option.  However, these values are doubled before being associated
with a clocksource's ->uncertainty_margin, and the ->uncertainty_margin
values of the pair of clocksource's being compared are summed before
checking against the skew.

Therefore, set both MAX_SKEW_USEC and the default for the
CLOCKSOURCE_WATCHDOG_MAX_SKEW_US Kconfig option to 125 microseconds of
skew per second, resulting in 500 microseconds of skew per second in
the clocksource watchdog's skew comparison.

Suggested-by Rik van Riel <riel@surriel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
---
 kernel/time/Kconfig       |  6 +++++-
 kernel/time/clocksource.c | 15 +++++++++------
 2 files changed, 14 insertions(+), 7 deletions(-)

diff --git a/kernel/time/Kconfig b/kernel/time/Kconfig
index a41753be1a2bf..bae8f11070bef 100644
--- a/kernel/time/Kconfig
+++ b/kernel/time/Kconfig
@@ -200,10 +200,14 @@ config CLOCKSOURCE_WATCHDOG_MAX_SKEW_US
 	int "Clocksource watchdog maximum allowable skew (in μs)"
 	depends on CLOCKSOURCE_WATCHDOG
 	range 50 1000
-	default 100
+	default 125
 	help
 	  Specify the maximum amount of allowable watchdog skew in
 	  microseconds before reporting the clocksource to be unstable.
+	  The default is based on a half-second clocksource watchdog
+	  interval and NTP's maximum frequency drift of 500 parts
+	  per million.	If the clocksource is good enough for NTP,
+	  it is good enough for the clocksource watchdog!
 
 endmenu
 endif
diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index 4a2c3bb92e2e9..a3d19f6660ac7 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -95,6 +95,11 @@ static char override_name[CS_NAME_LEN];
 static int finished_booting;
 static u64 suspend_start;
 
+/*
+ * Interval: 0.5sec.
+ */
+#define WATCHDOG_INTERVAL (HZ >> 1)
+
 /*
  * Threshold: 0.0312s, when doubled: 0.0625s.
  * Also a default for cs->uncertainty_margin when registering clocks.
@@ -106,11 +111,14 @@ static u64 suspend_start;
  * clocksource surrounding a read of the clocksource being validated.
  * This delay could be due to SMIs, NMIs, or to VCPU preemptions.  Used as
  * a lower bound for cs->uncertainty_margin values when registering clocks.
+ *
+ * The default of 500 parts per million is based on NTP's limits.
+ * If a clocksource is good enough for NTP, it is good enough for us!
  */
 #ifdef CONFIG_CLOCKSOURCE_WATCHDOG_MAX_SKEW_US
 #define MAX_SKEW_USEC	CONFIG_CLOCKSOURCE_WATCHDOG_MAX_SKEW_US
 #else
-#define MAX_SKEW_USEC	100
+#define MAX_SKEW_USEC	(125 * WATCHDOG_INTERVAL / HZ)
 #endif
 
 #define WATCHDOG_MAX_SKEW (MAX_SKEW_USEC * NSEC_PER_USEC)
@@ -140,11 +148,6 @@ static inline void clocksource_watchdog_unlock(unsigned long *flags)
 static int clocksource_watchdog_kthread(void *data);
 static void __clocksource_change_rating(struct clocksource *cs, int rating);
 
-/*
- * Interval: 0.5sec.
- */
-#define WATCHDOG_INTERVAL (HZ >> 1)
-
 static void clocksource_watchdog_work(struct work_struct *work)
 {
 	/*
-- 
2.31.1.189.g2e36527f23


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v2 clocksource 3/7] clocksource: Improve read-back-delay message
  2023-01-25  0:27 [PATCH clocksource v2 0/7] Clocksource watchdog updates for v6.3 Paul E. McKenney
  2023-01-25  0:27 ` [PATCH v2 clocksource 1/7] clocksource: Print clocksource name when clocksource is tested unstable Paul E. McKenney
  2023-01-25  0:27 ` [PATCH v2 clocksource 2/7] clocksource: Loosen clocksource watchdog constraints Paul E. McKenney
@ 2023-01-25  0:27 ` Paul E. McKenney
  2023-01-25  0:27 ` [PATCH v2 clocksource 4/7] clocksource: Improve "skew is too large" messages Paul E. McKenney
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 22+ messages in thread
From: Paul E. McKenney @ 2023-01-25  0:27 UTC (permalink / raw)
  To: tglx
  Cc: linux-kernel, john.stultz, sboyd, corbet, Mark.Rutland, maz,
	kernel-team, neeraju, ak, feng.tang, zhengjun.xing,
	Paul E. McKenney, John Stultz

When cs_watchdog_read() is unable to get a qualifying clocksource read
within the limit set by max_cswd_read_retries, it prints a message
and marks the clocksource under test as unstable.  But that message is
unclear to anyone unfamiliar with the code:

clocksource: timekeeping watchdog on CPU13: wd-tsc-wd read-back delay 1000614ns, attempt 3, marking unstable

Therefore, add some context so that the message appears as follows:

clocksource: timekeeping watchdog on CPU13: wd-tsc-wd excessive read-back delay of 1000614ns vs. limit of 125000ns, wd-wd read-back delay only 27ns, attempt 3, marking tsc unstable

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: John Stultz <jstultz@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Stephen Boyd <sboyd@kernel.org>
Cc: Feng Tang <feng.tang@intel.com>
---
 kernel/time/clocksource.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index a3d19f6660ac7..b59914953809f 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -260,8 +260,8 @@ static enum wd_read_status cs_watchdog_read(struct clocksource *cs, u64 *csnow,
 			goto skip_test;
 	}
 
-	pr_warn("timekeeping watchdog on CPU%d: wd-%s-wd read-back delay of %lldns, attempt %d, marking unstable\n",
-		smp_processor_id(), cs->name, wd_delay, nretries);
+	pr_warn("timekeeping watchdog on CPU%d: wd-%s-wd excessive read-back delay of %lldns vs. limit of %ldns, wd-wd read-back delay only %lldns, attempt %d, marking %s unstable\n",
+		smp_processor_id(), cs->name, wd_delay, WATCHDOG_MAX_SKEW, wd_seq_delay, nretries, cs->name);
 	return WD_READ_UNSTABLE;
 
 skip_test:
-- 
2.31.1.189.g2e36527f23


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v2 clocksource 4/7] clocksource: Improve "skew is too large" messages
  2023-01-25  0:27 [PATCH clocksource v2 0/7] Clocksource watchdog updates for v6.3 Paul E. McKenney
                   ` (2 preceding siblings ...)
  2023-01-25  0:27 ` [PATCH v2 clocksource 3/7] clocksource: Improve read-back-delay message Paul E. McKenney
@ 2023-01-25  0:27 ` Paul E. McKenney
  2023-01-25  0:27 ` [PATCH v2 clocksource 5/7] clocksource: Suspend the watchdog temporarily when high read latency detected Paul E. McKenney
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 22+ messages in thread
From: Paul E. McKenney @ 2023-01-25  0:27 UTC (permalink / raw)
  To: tglx
  Cc: linux-kernel, john.stultz, sboyd, corbet, Mark.Rutland, maz,
	kernel-team, neeraju, ak, feng.tang, zhengjun.xing,
	Paul E. McKenney, John Stultz

When clocksource_watchdog() detects excessive clocksource skew compared
to the watchdog clocksource, it marks the clocksource under test as
unstable and prints several lines worth of message.  But that message
is unclear to anyone unfamiliar with the code:

clocksource: timekeeping watchdog on CPU2: Marking clocksource 'wdtest-ktime' as unstable because the skew is too large:
clocksource:                       'kvm-clock' wd_nsec: 400744390 wd_now: 612625c2c wd_last: 5fa7f7c66 mask: ffffffffffffffff
clocksource:                       'wdtest-ktime' cs_nsec: 600744034 cs_now: 173081397a292d4f cs_last: 17308139565a8ced mask: ffffffffffffffff
clocksource:                       'kvm-clock' (not 'wdtest-ktime') is current clocksource.

Therefore, add the following line near the end of that message:

Clocksource 'wdtest-ktime' skewed 199999644 ns (199 ms) over watchdog 'kvm-clock' interval of 400744390 ns (400 ms)

This new line clearly indicates the amount of skew between the two
clocksources, along with the duration of the time interval over which
the skew occurred, both in nanoseconds and milliseconds.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: John Stultz <jstultz@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Stephen Boyd <sboyd@kernel.org>
Cc: Feng Tang <feng.tang@intel.com>
---
 kernel/time/clocksource.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index b59914953809f..fc486cd972635 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -446,12 +446,20 @@ static void clocksource_watchdog(struct timer_list *unused)
 		/* Check the deviation from the watchdog clocksource. */
 		md = cs->uncertainty_margin + watchdog->uncertainty_margin;
 		if (abs(cs_nsec - wd_nsec) > md) {
+			u64 cs_wd_msec;
+			u64 wd_msec;
+			u32 wd_rem;
+
 			pr_warn("timekeeping watchdog on CPU%d: Marking clocksource '%s' as unstable because the skew is too large:\n",
 				smp_processor_id(), cs->name);
 			pr_warn("                      '%s' wd_nsec: %lld wd_now: %llx wd_last: %llx mask: %llx\n",
 				watchdog->name, wd_nsec, wdnow, wdlast, watchdog->mask);
 			pr_warn("                      '%s' cs_nsec: %lld cs_now: %llx cs_last: %llx mask: %llx\n",
 				cs->name, cs_nsec, csnow, cslast, cs->mask);
+			cs_wd_msec = div_u64_rem(cs_nsec - wd_nsec, 1000U * 1000U, &wd_rem);
+			wd_msec = div_u64_rem(wd_nsec, 1000U * 1000U, &wd_rem);
+			pr_warn("                      Clocksource '%s' skewed %lld ns (%lld ms) over watchdog '%s' interval of %lld ns (%lld ms)\n",
+				cs->name, cs_nsec - wd_nsec, cs_wd_msec, watchdog->name, wd_nsec, wd_msec);
 			if (curr_clocksource == cs)
 				pr_warn("                      '%s' is current clocksource.\n", cs->name);
 			else if (curr_clocksource)
-- 
2.31.1.189.g2e36527f23


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v2 clocksource 5/7] clocksource: Suspend the watchdog temporarily when high read latency detected
  2023-01-25  0:27 [PATCH clocksource v2 0/7] Clocksource watchdog updates for v6.3 Paul E. McKenney
                   ` (3 preceding siblings ...)
  2023-01-25  0:27 ` [PATCH v2 clocksource 4/7] clocksource: Improve "skew is too large" messages Paul E. McKenney
@ 2023-01-25  0:27 ` Paul E. McKenney
  2023-01-25  0:27 ` [PATCH v2 clocksource 6/7] clocksource: Verify HPET and PMTMR when TSC unverified Paul E. McKenney
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 22+ messages in thread
From: Paul E. McKenney @ 2023-01-25  0:27 UTC (permalink / raw)
  To: tglx
  Cc: linux-kernel, john.stultz, sboyd, corbet, Mark.Rutland, maz,
	kernel-team, neeraju, ak, feng.tang, zhengjun.xing, Waiman Long,
	John Stultz, Paul E . McKenney

From: Feng Tang <feng.tang@intel.com>

Bugs have been reported on 8 sockets x86 machines in which the TSC was
wrongly disabled when the system is under heavy workload.

 [ 818.380354] clocksource: timekeeping watchdog on CPU336: hpet wd-wd read-back delay of 1203520ns
 [ 818.436160] clocksource: wd-tsc-wd read-back delay of 181880ns, clock-skew test skipped!
 [ 819.402962] clocksource: timekeeping watchdog on CPU338: hpet wd-wd read-back delay of 324000ns
 [ 819.448036] clocksource: wd-tsc-wd read-back delay of 337240ns, clock-skew test skipped!
 [ 819.880863] clocksource: timekeeping watchdog on CPU339: hpet read-back delay of 150280ns, attempt 3, marking unstable
 [ 819.936243] tsc: Marking TSC unstable due to clocksource watchdog
 [ 820.068173] TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'.
 [ 820.092382] sched_clock: Marking unstable (818769414384, 1195404998)
 [ 820.643627] clocksource: Checking clocksource tsc synchronization from CPU 267 to CPUs 0,4,25,70,126,430,557,564.
 [ 821.067990] clocksource: Switched to clocksource hpet

This can be reproduced by running memory intensive 'stream' tests,
or some of the stress-ng subcases such as 'ioport'.

The reason for these issues is the when system is under heavy load, the
read latency of the clocksources can be very high.  Even lightweight TSC
reads can show high latencies, and latencies are much worse for external
clocksources such as HPET or the APIC PM timer.  These latencies can
result in false-positive clocksource-unstable determinations.

These issues were initially reported by a customer running on a production
system, and this problem was reproduced on several generations of Xeon
servers, especially when running the stress-ng test.  These Xeon servers
were not production systems, but they did have the latest steppings
and firmware.

Given that the clocksource watchdog is a continual diagnostic check with
frequency of twice a second, there is no need to rush it when the system
is under heavy load.  Therefore, when high clocksource read latencies
are detected, suspend the watchdog timer for 5 minutes.

Signed-off-by: Feng Tang <feng.tang@intel.com>
Acked-by: Waiman Long <longman@redhat.com>
Cc: John Stultz <jstultz@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Stephen Boyd <sboyd@kernel.org>
Cc: Feng Tang <feng.tang@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
---
 kernel/time/clocksource.c | 45 ++++++++++++++++++++++++++++-----------
 1 file changed, 32 insertions(+), 13 deletions(-)

diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index fc486cd972635..91836b727cef5 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -387,6 +387,15 @@ void clocksource_verify_percpu(struct clocksource *cs)
 }
 EXPORT_SYMBOL_GPL(clocksource_verify_percpu);
 
+static inline void clocksource_reset_watchdog(void)
+{
+	struct clocksource *cs;
+
+	list_for_each_entry(cs, &watchdog_list, wd_list)
+		cs->flags &= ~CLOCK_SOURCE_WATCHDOG;
+}
+
+
 static void clocksource_watchdog(struct timer_list *unused)
 {
 	u64 csnow, wdnow, cslast, wdlast, delta;
@@ -394,6 +403,7 @@ static void clocksource_watchdog(struct timer_list *unused)
 	int64_t wd_nsec, cs_nsec;
 	struct clocksource *cs;
 	enum wd_read_status read_ret;
+	unsigned long extra_wait = 0;
 	u32 md;
 
 	spin_lock(&watchdog_lock);
@@ -413,13 +423,30 @@ static void clocksource_watchdog(struct timer_list *unused)
 
 		read_ret = cs_watchdog_read(cs, &csnow, &wdnow);
 
-		if (read_ret != WD_READ_SUCCESS) {
-			if (read_ret == WD_READ_UNSTABLE)
-				/* Clock readout unreliable, so give it up. */
-				__clocksource_unstable(cs);
+		if (read_ret == WD_READ_UNSTABLE) {
+			/* Clock readout unreliable, so give it up. */
+			__clocksource_unstable(cs);
 			continue;
 		}
 
+		/*
+		 * When WD_READ_SKIP is returned, it means the system is likely
+		 * under very heavy load, where the latency of reading
+		 * watchdog/clocksource is very big, and affect the accuracy of
+		 * watchdog check. So give system some space and suspend the
+		 * watchdog check for 5 minutes.
+		 */
+		if (read_ret == WD_READ_SKIP) {
+			/*
+			 * As the watchdog timer will be suspended, and
+			 * cs->last could keep unchanged for 5 minutes, reset
+			 * the counters.
+			 */
+			clocksource_reset_watchdog();
+			extra_wait = HZ * 300;
+			break;
+		}
+
 		/* Clocksource initialized ? */
 		if (!(cs->flags & CLOCK_SOURCE_WATCHDOG) ||
 		    atomic_read(&watchdog_reset_pending)) {
@@ -523,7 +550,7 @@ static void clocksource_watchdog(struct timer_list *unused)
 	 * pair clocksource_stop_watchdog() clocksource_start_watchdog().
 	 */
 	if (!timer_pending(&watchdog_timer)) {
-		watchdog_timer.expires += WATCHDOG_INTERVAL;
+		watchdog_timer.expires += WATCHDOG_INTERVAL + extra_wait;
 		add_timer_on(&watchdog_timer, next_cpu);
 	}
 out:
@@ -548,14 +575,6 @@ static inline void clocksource_stop_watchdog(void)
 	watchdog_running = 0;
 }
 
-static inline void clocksource_reset_watchdog(void)
-{
-	struct clocksource *cs;
-
-	list_for_each_entry(cs, &watchdog_list, wd_list)
-		cs->flags &= ~CLOCK_SOURCE_WATCHDOG;
-}
-
 static void clocksource_resume_watchdog(void)
 {
 	atomic_inc(&watchdog_reset_pending);
-- 
2.31.1.189.g2e36527f23


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v2 clocksource 6/7] clocksource: Verify HPET and PMTMR when TSC unverified
  2023-01-25  0:27 [PATCH clocksource v2 0/7] Clocksource watchdog updates for v6.3 Paul E. McKenney
                   ` (4 preceding siblings ...)
  2023-01-25  0:27 ` [PATCH v2 clocksource 5/7] clocksource: Suspend the watchdog temporarily when high read latency detected Paul E. McKenney
@ 2023-01-25  0:27 ` Paul E. McKenney
  2023-01-26 10:57   ` Daniel Lezcano
  2023-02-01 10:24   ` Thomas Gleixner
  2023-01-25  0:27 ` [PATCH v2 clocksource 7/7] x86/tsc: Add option to force frequency recalibration with HW timer Paul E. McKenney
  2023-02-03  4:36 ` PATCH v2 clocksource 8/7] clocksource: Enable TSC watchdog checking of HPET and PMTMR only when requested Paul E. McKenney
  7 siblings, 2 replies; 22+ messages in thread
From: Paul E. McKenney @ 2023-01-25  0:27 UTC (permalink / raw)
  To: tglx
  Cc: linux-kernel, john.stultz, sboyd, corbet, Mark.Rutland, maz,
	kernel-team, neeraju, ak, feng.tang, zhengjun.xing,
	Paul E. McKenney, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Daniel Lezcano, Waiman Long, x86

On systems with two or fewer sockets, when the boot CPU has CONSTANT_TSC,
NONSTOP_TSC, and TSC_ADJUST, clocksource watchdog verification of the
TSC is disabled.  This works well much of the time, but there is the
occasional production-level system that meets all of these criteria, but
which still has a TSC that skews significantly from atomic-clock time.
This is usually attributed to a firmware or hardware fault.  Yes, the
various NTP daemons do express their opinions of userspace-to-atomic-clock
time skew, but they put them in various places, depending on the daemon
and distro in question.  It would therefore be good for the kernel to
have some clue that there is a problem.

The old behavior of marking the TSC unstable is a non-starter because a
great many workloads simply cannot tolerate the overheads and latencies
of the various non-TSC clocksources.  In addition, NTP-corrected systems
sometimes can tolerate significant kernel-space time skew as long as
the userspace time sources are within epsilon of atomic-clock time.

Therefore, when watchdog verification of TSC is disabled, enable it for
HPET and PMTMR (AKA ACPI PM timer).  This provides the needed in-kernel
time-skew diagnostic without degrading the system's performance.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Waiman Long <longman@redhat.com>
Cc: <x86@kernel.org>
Tested-by: Feng Tang <feng.tang@intel.com>
---
 arch/x86/include/asm/time.h   | 1 +
 arch/x86/kernel/hpet.c        | 2 ++
 arch/x86/kernel/tsc.c         | 5 +++++
 drivers/clocksource/acpi_pm.c | 6 ++++--
 4 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/time.h b/arch/x86/include/asm/time.h
index 8ac563abb567b..a53961c64a567 100644
--- a/arch/x86/include/asm/time.h
+++ b/arch/x86/include/asm/time.h
@@ -8,6 +8,7 @@
 extern void hpet_time_init(void);
 extern void time_init(void);
 extern bool pit_timer_init(void);
+extern bool tsc_clocksource_watchdog_disabled(void);
 
 extern struct clock_event_device *global_clock_event;
 
diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c
index 71f336425e58a..c8eb1ac5125ab 100644
--- a/arch/x86/kernel/hpet.c
+++ b/arch/x86/kernel/hpet.c
@@ -1091,6 +1091,8 @@ int __init hpet_enable(void)
 	if (!hpet_counting())
 		goto out_nohpet;
 
+	if (tsc_clocksource_watchdog_disabled())
+		clocksource_hpet.flags |= CLOCK_SOURCE_MUST_VERIFY;
 	clocksource_register_hz(&clocksource_hpet, (u32)hpet_freq);
 
 	if (id & HPET_ID_LEGSUP) {
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index a78e73da4a74b..af3782fb6200c 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -1186,6 +1186,11 @@ static void __init tsc_disable_clocksource_watchdog(void)
 	clocksource_tsc.flags &= ~CLOCK_SOURCE_MUST_VERIFY;
 }
 
+bool tsc_clocksource_watchdog_disabled(void)
+{
+	return !(clocksource_tsc.flags & CLOCK_SOURCE_MUST_VERIFY);
+}
+
 static void __init check_system_tsc_reliable(void)
 {
 #if defined(CONFIG_MGEODEGX1) || defined(CONFIG_MGEODE_LX) || defined(CONFIG_X86_GENERIC)
diff --git a/drivers/clocksource/acpi_pm.c b/drivers/clocksource/acpi_pm.c
index 279ddff81ab49..82338773602ca 100644
--- a/drivers/clocksource/acpi_pm.c
+++ b/drivers/clocksource/acpi_pm.c
@@ -23,6 +23,7 @@
 #include <linux/pci.h>
 #include <linux/delay.h>
 #include <asm/io.h>
+#include <asm/time.h>
 
 /*
  * The I/O port the PMTMR resides at.
@@ -210,8 +211,9 @@ static int __init init_acpi_pm_clocksource(void)
 		return -ENODEV;
 	}
 
-	return clocksource_register_hz(&clocksource_acpi_pm,
-						PMTMR_TICKS_PER_SEC);
+	if (tsc_clocksource_watchdog_disabled())
+		clocksource_acpi_pm.flags |= CLOCK_SOURCE_MUST_VERIFY;
+	return clocksource_register_hz(&clocksource_acpi_pm, PMTMR_TICKS_PER_SEC);
 }
 
 /* We use fs_initcall because we want the PCI fixups to have run
-- 
2.31.1.189.g2e36527f23


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v2 clocksource 7/7] x86/tsc: Add option to force frequency recalibration with HW timer
  2023-01-25  0:27 [PATCH clocksource v2 0/7] Clocksource watchdog updates for v6.3 Paul E. McKenney
                   ` (5 preceding siblings ...)
  2023-01-25  0:27 ` [PATCH v2 clocksource 6/7] clocksource: Verify HPET and PMTMR when TSC unverified Paul E. McKenney
@ 2023-01-25  0:27 ` Paul E. McKenney
  2023-02-03  4:36 ` PATCH v2 clocksource 8/7] clocksource: Enable TSC watchdog checking of HPET and PMTMR only when requested Paul E. McKenney
  7 siblings, 0 replies; 22+ messages in thread
From: Paul E. McKenney @ 2023-01-25  0:27 UTC (permalink / raw)
  To: tglx
  Cc: linux-kernel, john.stultz, sboyd, corbet, Mark.Rutland, maz,
	kernel-team, neeraju, ak, feng.tang, zhengjun.xing, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, x86, linux-doc,
	Paul E . McKenney

From: Feng Tang <feng.tang@intel.com>

The kernel assumes that the TSC frequency which is provided by the
hardware / firmware via MSRs or CPUID(0x15) is correct after applying
a few basic consistency checks. This disables the TSC recalibration
against HPET or PM timer.

As a result there is no mechanism to validate that frequency in cases
where a firmware or hardware defect is suspected. And there was case
that some user used atomic clock to measure the TSC frequency and
reported an inaccuracy issue, which was later fixed in firmware.

Add an option 'recalibrate' for 'tsc' kernel parameter to force the
tsc freq recalibration with HPET or PM timer, and warn if the
deviation from previous value is more than about 500 PPM, which
provides a way to verify the data from hardware / firmware.

There is no functional change to existing work flow.

Recently there was a real-world case: "The 40ms/s divergence between
TSC and HPET was observed on hardware that is quite recent" [1], on
that platform the TSC frequence 1896 MHz was got from CPUID(0x15),
and the force-reclibration with HPET/PMTIMER both calibrated out
value of 1975 MHz, which also matched with check from software
'chronyd', indicating it's a problem of BIOS or firmware.

[Thanks tglx for helping improving the commit log]
[ paulmck: Wordsmith Kconfig help text. ]

[1]. https://lore.kernel.org/lkml/20221117230910.GI4001@paulmck-ThinkPad-P17-Gen-1/
Signed-off-by: Feng Tang <feng.tang@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: <x86@kernel.org>
Cc: <linux-doc@vger.kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
---
 .../admin-guide/kernel-parameters.txt         |  4 +++
 arch/x86/kernel/tsc.c                         | 34 ++++++++++++++++---
 2 files changed, 34 insertions(+), 4 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 6cfa6e3996cf7..95f0d104c2322 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -6369,6 +6369,10 @@
 			in situations with strict latency requirements (where
 			interruptions from clocksource watchdog are not
 			acceptable).
+			[x86] recalibrate: force recalibration against a HW timer
+			(HPET or PM timer) on systems whose TSC frequency was
+			obtained from HW or FW using either an MSR or CPUID(0x15).
+			Warn if the difference is more than 500 ppm.
 
 	tsc_early_khz=  [X86] Skip early TSC calibration and use the given
 			value instead. Useful when the early TSC frequency discovery
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index af3782fb6200c..a5371c6d4b64b 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -48,6 +48,8 @@ static DEFINE_STATIC_KEY_FALSE(__use_tsc);
 
 int tsc_clocksource_reliable;
 
+static int __read_mostly tsc_force_recalibrate;
+
 static u32 art_to_tsc_numerator;
 static u32 art_to_tsc_denominator;
 static u64 art_to_tsc_offset;
@@ -303,6 +305,8 @@ static int __init tsc_setup(char *str)
 		mark_tsc_unstable("boot parameter");
 	if (!strcmp(str, "nowatchdog"))
 		no_tsc_watchdog = 1;
+	if (!strcmp(str, "recalibrate"))
+		tsc_force_recalibrate = 1;
 	return 1;
 }
 
@@ -1379,6 +1383,25 @@ static void tsc_refine_calibration_work(struct work_struct *work)
 	else
 		freq = calc_pmtimer_ref(delta, ref_start, ref_stop);
 
+	/* Will hit this only if tsc_force_recalibrate has been set */
+	if (boot_cpu_has(X86_FEATURE_TSC_KNOWN_FREQ)) {
+
+		/* Warn if the deviation exceeds 500 ppm */
+		if (abs(tsc_khz - freq) > (tsc_khz >> 11)) {
+			pr_warn("Warning: TSC freq calibrated by CPUID/MSR differs from what is calibrated by HW timer, please check with vendor!!\n");
+			pr_info("Previous calibrated TSC freq:\t %lu.%03lu MHz\n",
+				(unsigned long)tsc_khz / 1000,
+				(unsigned long)tsc_khz % 1000);
+		}
+
+		pr_info("TSC freq recalibrated by [%s]:\t %lu.%03lu MHz\n",
+			hpet ? "HPET" : "PM_TIMER",
+			(unsigned long)freq / 1000,
+			(unsigned long)freq % 1000);
+
+		return;
+	}
+
 	/* Make sure we're within 1% */
 	if (abs(tsc_khz - freq) > tsc_khz/100)
 		goto out;
@@ -1412,8 +1435,10 @@ static int __init init_tsc_clocksource(void)
 	if (!boot_cpu_has(X86_FEATURE_TSC) || !tsc_khz)
 		return 0;
 
-	if (tsc_unstable)
-		goto unreg;
+	if (tsc_unstable) {
+		clocksource_unregister(&clocksource_tsc_early);
+		return 0;
+	}
 
 	if (boot_cpu_has(X86_FEATURE_NONSTOP_TSC_S3))
 		clocksource_tsc.flags |= CLOCK_SOURCE_SUSPEND_NONSTOP;
@@ -1426,9 +1451,10 @@ static int __init init_tsc_clocksource(void)
 		if (boot_cpu_has(X86_FEATURE_ART))
 			art_related_clocksource = &clocksource_tsc;
 		clocksource_register_khz(&clocksource_tsc, tsc_khz);
-unreg:
 		clocksource_unregister(&clocksource_tsc_early);
-		return 0;
+
+		if (!tsc_force_recalibrate)
+			return 0;
 	}
 
 	schedule_delayed_work(&tsc_irqwork, 0);
-- 
2.31.1.189.g2e36527f23


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 clocksource 6/7] clocksource: Verify HPET and PMTMR when TSC unverified
  2023-01-25  0:27 ` [PATCH v2 clocksource 6/7] clocksource: Verify HPET and PMTMR when TSC unverified Paul E. McKenney
@ 2023-01-26 10:57   ` Daniel Lezcano
  2023-02-01  0:50     ` Paul E. McKenney
  2023-02-01 10:24   ` Thomas Gleixner
  1 sibling, 1 reply; 22+ messages in thread
From: Daniel Lezcano @ 2023-01-26 10:57 UTC (permalink / raw)
  To: Paul E. McKenney, tglx
  Cc: linux-kernel, john.stultz, sboyd, corbet, Mark.Rutland, maz,
	kernel-team, neeraju, ak, feng.tang, zhengjun.xing, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, Waiman Long, x86


Hi Thomas,

are you ok with this patch ? Shall I pick it ?

Thanks

   -- Daniel


On 25/01/2023 01:27, Paul E. McKenney wrote:
> On systems with two or fewer sockets, when the boot CPU has CONSTANT_TSC,
> NONSTOP_TSC, and TSC_ADJUST, clocksource watchdog verification of the
> TSC is disabled.  This works well much of the time, but there is the
> occasional production-level system that meets all of these criteria, but
> which still has a TSC that skews significantly from atomic-clock time.
> This is usually attributed to a firmware or hardware fault.  Yes, the
> various NTP daemons do express their opinions of userspace-to-atomic-clock
> time skew, but they put them in various places, depending on the daemon
> and distro in question.  It would therefore be good for the kernel to
> have some clue that there is a problem.
> 
> The old behavior of marking the TSC unstable is a non-starter because a
> great many workloads simply cannot tolerate the overheads and latencies
> of the various non-TSC clocksources.  In addition, NTP-corrected systems
> sometimes can tolerate significant kernel-space time skew as long as
> the userspace time sources are within epsilon of atomic-clock time.
> 
> Therefore, when watchdog verification of TSC is disabled, enable it for
> HPET and PMTMR (AKA ACPI PM timer).  This provides the needed in-kernel
> time-skew diagnostic without degrading the system's performance.
> 
> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Borislav Petkov <bp@alien8.de>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: "H. Peter Anvin" <hpa@zytor.com>
> Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
> Cc: Waiman Long <longman@redhat.com>
> Cc: <x86@kernel.org>
> Tested-by: Feng Tang <feng.tang@intel.com>
> ---
>   arch/x86/include/asm/time.h   | 1 +
>   arch/x86/kernel/hpet.c        | 2 ++
>   arch/x86/kernel/tsc.c         | 5 +++++
>   drivers/clocksource/acpi_pm.c | 6 ++++--
>   4 files changed, 12 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/include/asm/time.h b/arch/x86/include/asm/time.h
> index 8ac563abb567b..a53961c64a567 100644
> --- a/arch/x86/include/asm/time.h
> +++ b/arch/x86/include/asm/time.h
> @@ -8,6 +8,7 @@
>   extern void hpet_time_init(void);
>   extern void time_init(void);
>   extern bool pit_timer_init(void);
> +extern bool tsc_clocksource_watchdog_disabled(void);
>   
>   extern struct clock_event_device *global_clock_event;
>   
> diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c
> index 71f336425e58a..c8eb1ac5125ab 100644
> --- a/arch/x86/kernel/hpet.c
> +++ b/arch/x86/kernel/hpet.c
> @@ -1091,6 +1091,8 @@ int __init hpet_enable(void)
>   	if (!hpet_counting())
>   		goto out_nohpet;
>   
> +	if (tsc_clocksource_watchdog_disabled())
> +		clocksource_hpet.flags |= CLOCK_SOURCE_MUST_VERIFY;
>   	clocksource_register_hz(&clocksource_hpet, (u32)hpet_freq);
>   
>   	if (id & HPET_ID_LEGSUP) {
> diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
> index a78e73da4a74b..af3782fb6200c 100644
> --- a/arch/x86/kernel/tsc.c
> +++ b/arch/x86/kernel/tsc.c
> @@ -1186,6 +1186,11 @@ static void __init tsc_disable_clocksource_watchdog(void)
>   	clocksource_tsc.flags &= ~CLOCK_SOURCE_MUST_VERIFY;
>   }
>   
> +bool tsc_clocksource_watchdog_disabled(void)
> +{
> +	return !(clocksource_tsc.flags & CLOCK_SOURCE_MUST_VERIFY);
> +}
> +
>   static void __init check_system_tsc_reliable(void)
>   {
>   #if defined(CONFIG_MGEODEGX1) || defined(CONFIG_MGEODE_LX) || defined(CONFIG_X86_GENERIC)
> diff --git a/drivers/clocksource/acpi_pm.c b/drivers/clocksource/acpi_pm.c
> index 279ddff81ab49..82338773602ca 100644
> --- a/drivers/clocksource/acpi_pm.c
> +++ b/drivers/clocksource/acpi_pm.c
> @@ -23,6 +23,7 @@
>   #include <linux/pci.h>
>   #include <linux/delay.h>
>   #include <asm/io.h>
> +#include <asm/time.h>
>   
>   /*
>    * The I/O port the PMTMR resides at.
> @@ -210,8 +211,9 @@ static int __init init_acpi_pm_clocksource(void)
>   		return -ENODEV;
>   	}
>   
> -	return clocksource_register_hz(&clocksource_acpi_pm,
> -						PMTMR_TICKS_PER_SEC);
> +	if (tsc_clocksource_watchdog_disabled())
> +		clocksource_acpi_pm.flags |= CLOCK_SOURCE_MUST_VERIFY;
> +	return clocksource_register_hz(&clocksource_acpi_pm, PMTMR_TICKS_PER_SEC);
>   }
>   
>   /* We use fs_initcall because we want the PCI fixups to have run

-- 
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 clocksource 6/7] clocksource: Verify HPET and PMTMR when TSC unverified
  2023-01-26 10:57   ` Daniel Lezcano
@ 2023-02-01  0:50     ` Paul E. McKenney
  0 siblings, 0 replies; 22+ messages in thread
From: Paul E. McKenney @ 2023-02-01  0:50 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: tglx, linux-kernel, john.stultz, sboyd, corbet, Mark.Rutland,
	maz, kernel-team, neeraju, ak, feng.tang, zhengjun.xing,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	Waiman Long, x86

On Thu, Jan 26, 2023 at 11:57:41AM +0100, Daniel Lezcano wrote:
> 
> Hi Thomas,
> 
> are you ok with this patch ? Shall I pick it ?

Seeing no response, I sent a pull request.

If it would be helpful for me to make the pilgrimmage to (say) Intel
Jones Farm, I can easily make that trip.

							Thanx, Paul

> Thanks
> 
>   -- Daniel
> 
> 
> On 25/01/2023 01:27, Paul E. McKenney wrote:
> > On systems with two or fewer sockets, when the boot CPU has CONSTANT_TSC,
> > NONSTOP_TSC, and TSC_ADJUST, clocksource watchdog verification of the
> > TSC is disabled.  This works well much of the time, but there is the
> > occasional production-level system that meets all of these criteria, but
> > which still has a TSC that skews significantly from atomic-clock time.
> > This is usually attributed to a firmware or hardware fault.  Yes, the
> > various NTP daemons do express their opinions of userspace-to-atomic-clock
> > time skew, but they put them in various places, depending on the daemon
> > and distro in question.  It would therefore be good for the kernel to
> > have some clue that there is a problem.
> > 
> > The old behavior of marking the TSC unstable is a non-starter because a
> > great many workloads simply cannot tolerate the overheads and latencies
> > of the various non-TSC clocksources.  In addition, NTP-corrected systems
> > sometimes can tolerate significant kernel-space time skew as long as
> > the userspace time sources are within epsilon of atomic-clock time.
> > 
> > Therefore, when watchdog verification of TSC is disabled, enable it for
> > HPET and PMTMR (AKA ACPI PM timer).  This provides the needed in-kernel
> > time-skew diagnostic without degrading the system's performance.
> > 
> > Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> > Cc: Thomas Gleixner <tglx@linutronix.de>
> > Cc: Ingo Molnar <mingo@redhat.com>
> > Cc: Borislav Petkov <bp@alien8.de>
> > Cc: Dave Hansen <dave.hansen@linux.intel.com>
> > Cc: "H. Peter Anvin" <hpa@zytor.com>
> > Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
> > Cc: Waiman Long <longman@redhat.com>
> > Cc: <x86@kernel.org>
> > Tested-by: Feng Tang <feng.tang@intel.com>
> > ---
> >   arch/x86/include/asm/time.h   | 1 +
> >   arch/x86/kernel/hpet.c        | 2 ++
> >   arch/x86/kernel/tsc.c         | 5 +++++
> >   drivers/clocksource/acpi_pm.c | 6 ++++--
> >   4 files changed, 12 insertions(+), 2 deletions(-)
> > 
> > diff --git a/arch/x86/include/asm/time.h b/arch/x86/include/asm/time.h
> > index 8ac563abb567b..a53961c64a567 100644
> > --- a/arch/x86/include/asm/time.h
> > +++ b/arch/x86/include/asm/time.h
> > @@ -8,6 +8,7 @@
> >   extern void hpet_time_init(void);
> >   extern void time_init(void);
> >   extern bool pit_timer_init(void);
> > +extern bool tsc_clocksource_watchdog_disabled(void);
> >   extern struct clock_event_device *global_clock_event;
> > diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c
> > index 71f336425e58a..c8eb1ac5125ab 100644
> > --- a/arch/x86/kernel/hpet.c
> > +++ b/arch/x86/kernel/hpet.c
> > @@ -1091,6 +1091,8 @@ int __init hpet_enable(void)
> >   	if (!hpet_counting())
> >   		goto out_nohpet;
> > +	if (tsc_clocksource_watchdog_disabled())
> > +		clocksource_hpet.flags |= CLOCK_SOURCE_MUST_VERIFY;
> >   	clocksource_register_hz(&clocksource_hpet, (u32)hpet_freq);
> >   	if (id & HPET_ID_LEGSUP) {
> > diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
> > index a78e73da4a74b..af3782fb6200c 100644
> > --- a/arch/x86/kernel/tsc.c
> > +++ b/arch/x86/kernel/tsc.c
> > @@ -1186,6 +1186,11 @@ static void __init tsc_disable_clocksource_watchdog(void)
> >   	clocksource_tsc.flags &= ~CLOCK_SOURCE_MUST_VERIFY;
> >   }
> > +bool tsc_clocksource_watchdog_disabled(void)
> > +{
> > +	return !(clocksource_tsc.flags & CLOCK_SOURCE_MUST_VERIFY);
> > +}
> > +
> >   static void __init check_system_tsc_reliable(void)
> >   {
> >   #if defined(CONFIG_MGEODEGX1) || defined(CONFIG_MGEODE_LX) || defined(CONFIG_X86_GENERIC)
> > diff --git a/drivers/clocksource/acpi_pm.c b/drivers/clocksource/acpi_pm.c
> > index 279ddff81ab49..82338773602ca 100644
> > --- a/drivers/clocksource/acpi_pm.c
> > +++ b/drivers/clocksource/acpi_pm.c
> > @@ -23,6 +23,7 @@
> >   #include <linux/pci.h>
> >   #include <linux/delay.h>
> >   #include <asm/io.h>
> > +#include <asm/time.h>
> >   /*
> >    * The I/O port the PMTMR resides at.
> > @@ -210,8 +211,9 @@ static int __init init_acpi_pm_clocksource(void)
> >   		return -ENODEV;
> >   	}
> > -	return clocksource_register_hz(&clocksource_acpi_pm,
> > -						PMTMR_TICKS_PER_SEC);
> > +	if (tsc_clocksource_watchdog_disabled())
> > +		clocksource_acpi_pm.flags |= CLOCK_SOURCE_MUST_VERIFY;
> > +	return clocksource_register_hz(&clocksource_acpi_pm, PMTMR_TICKS_PER_SEC);
> >   }
> >   /* We use fs_initcall because we want the PCI fixups to have run
> 
> -- 
> <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs
> 
> Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
> <http://twitter.com/#!/linaroorg> Twitter |
> <http://www.linaro.org/linaro-blog/> Blog
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 clocksource 6/7] clocksource: Verify HPET and PMTMR when TSC unverified
  2023-01-25  0:27 ` [PATCH v2 clocksource 6/7] clocksource: Verify HPET and PMTMR when TSC unverified Paul E. McKenney
  2023-01-26 10:57   ` Daniel Lezcano
@ 2023-02-01 10:24   ` Thomas Gleixner
  2023-02-01 15:10     ` Feng Tang
                       ` (2 more replies)
  1 sibling, 3 replies; 22+ messages in thread
From: Thomas Gleixner @ 2023-02-01 10:24 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, john.stultz, sboyd, corbet, Mark.Rutland, maz,
	kernel-team, neeraju, ak, feng.tang, zhengjun.xing,
	Paul E. McKenney, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Daniel Lezcano, Waiman Long, x86

Paul!

On Tue, Jan 24 2023 at 16:27, Paul E. McKenney wrote:
> On systems with two or fewer sockets, when the boot CPU has CONSTANT_TSC,
> NONSTOP_TSC, and TSC_ADJUST, clocksource watchdog verification of the
> TSC is disabled.  This works well much of the time, but there is the
> occasional production-level system that meets all of these criteria, but
> which still has a TSC that skews significantly from atomic-clock time.
> This is usually attributed to a firmware or hardware fault.  Yes, the
> various NTP daemons do express their opinions of userspace-to-atomic-clock
> time skew, but they put them in various places, depending on the daemon
> and distro in question.  It would therefore be good for the kernel to
> have some clue that there is a problem.
>
> The old behavior of marking the TSC unstable is a non-starter because a
> great many workloads simply cannot tolerate the overheads and latencies
> of the various non-TSC clocksources.  In addition, NTP-corrected systems
> sometimes can tolerate significant kernel-space time skew as long as
> the userspace time sources are within epsilon of atomic-clock time.
>
> Therefore, when watchdog verification of TSC is disabled, enable it for
> HPET and PMTMR (AKA ACPI PM timer).  This provides the needed in-kernel
> time-skew diagnostic without degrading the system's performance.

I'm more than unhappy about this. We finally have a point where the TSC
watchdog overhead can go away without adding TSC=reliable to the kernel
commandline.

Now you add an unconditionally enforce the watchdog again in a way which
even cannot be disabled on the kernel command line.

Patently bad idea, no cookies for you!

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 clocksource 6/7] clocksource: Verify HPET and PMTMR when TSC unverified
  2023-02-01 10:24   ` Thomas Gleixner
@ 2023-02-01 15:10     ` Feng Tang
  2023-02-01 19:26     ` Waiman Long
  2023-02-01 19:51     ` Paul E. McKenney
  2 siblings, 0 replies; 22+ messages in thread
From: Feng Tang @ 2023-02-01 15:10 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Paul E. McKenney, linux-kernel, john.stultz, sboyd, corbet,
	Mark.Rutland, maz, kernel-team, neeraju, ak, zhengjun.xing,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	Daniel Lezcano, Waiman Long, x86

Hi Thomas,

On Wed, Feb 01, 2023 at 11:24:14AM +0100, Thomas Gleixner wrote:
> Paul!
> 
> On Tue, Jan 24 2023 at 16:27, Paul E. McKenney wrote:
> > On systems with two or fewer sockets, when the boot CPU has CONSTANT_TSC,
> > NONSTOP_TSC, and TSC_ADJUST, clocksource watchdog verification of the
> > TSC is disabled.  This works well much of the time, but there is the
> > occasional production-level system that meets all of these criteria, but
> > which still has a TSC that skews significantly from atomic-clock time.
> > This is usually attributed to a firmware or hardware fault.  Yes, the
> > various NTP daemons do express their opinions of userspace-to-atomic-clock
> > time skew, but they put them in various places, depending on the daemon
> > and distro in question.  It would therefore be good for the kernel to
> > have some clue that there is a problem.
> >
> > The old behavior of marking the TSC unstable is a non-starter because a
> > great many workloads simply cannot tolerate the overheads and latencies
> > of the various non-TSC clocksources.  In addition, NTP-corrected systems
> > sometimes can tolerate significant kernel-space time skew as long as
> > the userspace time sources are within epsilon of atomic-clock time.
> >
> > Therefore, when watchdog verification of TSC is disabled, enable it for
> > HPET and PMTMR (AKA ACPI PM timer).  This provides the needed in-kernel
> > time-skew diagnostic without degrading the system's performance.
> 
> I'm more than unhappy about this. We finally have a point where the TSC
> watchdog overhead can go away without adding TSC=reliable to the kernel
> commandline.
> 
> Now you add an unconditionally enforce the watchdog again in a way which
> even cannot be disabled on the kernel command line.

Yes, this is a valid concern. Waiman, Paul and I discussed this and
had some proposal to handle this side effect, like only watchdoging 
HPET/ACPI-PM timer for a short period of time in this case.
https://lore.kernel.org/lkml/20221227183819.GI4001@paulmck-ThinkPad-P17-Gen-1/
My bad that I didn't follow up as my proposed code looked ugly as
bringing more complexsities. Does the idea of setting a watchdog
time limit sound fine to you?  

Thanks,
Feng

> Patently bad idea, no cookies for you!
> 
> Thanks,
> 
>         tglx

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 clocksource 6/7] clocksource: Verify HPET and PMTMR when TSC unverified
  2023-02-01 10:24   ` Thomas Gleixner
  2023-02-01 15:10     ` Feng Tang
@ 2023-02-01 19:26     ` Waiman Long
  2023-02-01 19:55       ` Paul E. McKenney
  2023-02-01 19:51     ` Paul E. McKenney
  2 siblings, 1 reply; 22+ messages in thread
From: Waiman Long @ 2023-02-01 19:26 UTC (permalink / raw)
  To: Thomas Gleixner, Paul E. McKenney
  Cc: linux-kernel, john.stultz, sboyd, corbet, Mark.Rutland, maz,
	kernel-team, neeraju, ak, feng.tang, zhengjun.xing, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, Daniel Lezcano,
	x86

On 2/1/23 05:24, Thomas Gleixner wrote:
> Paul!
>
> On Tue, Jan 24 2023 at 16:27, Paul E. McKenney wrote:
>> On systems with two or fewer sockets, when the boot CPU has CONSTANT_TSC,
>> NONSTOP_TSC, and TSC_ADJUST, clocksource watchdog verification of the
>> TSC is disabled.  This works well much of the time, but there is the
>> occasional production-level system that meets all of these criteria, but
>> which still has a TSC that skews significantly from atomic-clock time.
>> This is usually attributed to a firmware or hardware fault.  Yes, the
>> various NTP daemons do express their opinions of userspace-to-atomic-clock
>> time skew, but they put them in various places, depending on the daemon
>> and distro in question.  It would therefore be good for the kernel to
>> have some clue that there is a problem.
>>
>> The old behavior of marking the TSC unstable is a non-starter because a
>> great many workloads simply cannot tolerate the overheads and latencies
>> of the various non-TSC clocksources.  In addition, NTP-corrected systems
>> sometimes can tolerate significant kernel-space time skew as long as
>> the userspace time sources are within epsilon of atomic-clock time.
>>
>> Therefore, when watchdog verification of TSC is disabled, enable it for
>> HPET and PMTMR (AKA ACPI PM timer).  This provides the needed in-kernel
>> time-skew diagnostic without degrading the system's performance.
> I'm more than unhappy about this. We finally have a point where the TSC
> watchdog overhead can go away without adding TSC=reliable to the kernel
> commandline.
>
> Now you add an unconditionally enforce the watchdog again in a way which
> even cannot be disabled on the kernel command line.
>
> Patently bad idea, no cookies for you!

I have a similar concern about this patch as well. That is why I was 
suggesting to have this enabled for a limited time after boot for sanity 
checking purpose only.

The previous "[PATCH clocksource 5/6] clocksource: Suspend the watchdog 
temporarily when high read latency detected" patch, however, should be 
fine. Right?

Cheers,
Longman


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 clocksource 6/7] clocksource: Verify HPET and PMTMR when TSC unverified
  2023-02-01 10:24   ` Thomas Gleixner
  2023-02-01 15:10     ` Feng Tang
  2023-02-01 19:26     ` Waiman Long
@ 2023-02-01 19:51     ` Paul E. McKenney
  2 siblings, 0 replies; 22+ messages in thread
From: Paul E. McKenney @ 2023-02-01 19:51 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, john.stultz, sboyd, corbet, Mark.Rutland, maz,
	kernel-team, neeraju, ak, feng.tang, zhengjun.xing, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, Daniel Lezcano,
	Waiman Long, x86

On Wed, Feb 01, 2023 at 11:24:14AM +0100, Thomas Gleixner wrote:
> Paul!
> 
> On Tue, Jan 24 2023 at 16:27, Paul E. McKenney wrote:
> > On systems with two or fewer sockets, when the boot CPU has CONSTANT_TSC,
> > NONSTOP_TSC, and TSC_ADJUST, clocksource watchdog verification of the
> > TSC is disabled.  This works well much of the time, but there is the
> > occasional production-level system that meets all of these criteria, but
> > which still has a TSC that skews significantly from atomic-clock time.
> > This is usually attributed to a firmware or hardware fault.  Yes, the
> > various NTP daemons do express their opinions of userspace-to-atomic-clock
> > time skew, but they put them in various places, depending on the daemon
> > and distro in question.  It would therefore be good for the kernel to
> > have some clue that there is a problem.
> >
> > The old behavior of marking the TSC unstable is a non-starter because a
> > great many workloads simply cannot tolerate the overheads and latencies
> > of the various non-TSC clocksources.  In addition, NTP-corrected systems
> > sometimes can tolerate significant kernel-space time skew as long as
> > the userspace time sources are within epsilon of atomic-clock time.
> >
> > Therefore, when watchdog verification of TSC is disabled, enable it for
> > HPET and PMTMR (AKA ACPI PM timer).  This provides the needed in-kernel
> > time-skew diagnostic without degrading the system's performance.
> 
> I'm more than unhappy about this. We finally have a point where the TSC
> watchdog overhead can go away without adding TSC=reliable to the kernel
> commandline.
> 
> Now you add an unconditionally enforce the watchdog again in a way which
> even cannot be disabled on the kernel command line.
> 
> Patently bad idea, no cookies for you!

What can I say?  40,000 parts per million TSC clock skew did raise some
eyebrows, and therefore the complete suppressing of that diagnostic
completely was not at all a welcome development.

So how about the (untested) patch below, either on top of the existing
series or folded into e57818b20b0b ("clocksource: Verify HPET and PMTMR
when TSC unverified")?

The idea is to provide TSC checking of HPET and PMTMR only when the
TSC is deemed reliable and the new tsc=watchdog kernel boot parameter
is provided.  If both tsc=watchdog and tsc=nowatchdog are provided,
tsc=watchdog wins and a console message is emitted (no splat).

To restate, with this patch, unless the sysadmin asks for it, there will
be no clocksource watchdog unless there also would have been one without
this patch.

Thoughts?

							Thanx, Paul

------------------------------------------------------------------------

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 115829d71d0ca..d681f9252aaa7 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -6382,6 +6382,12 @@
 			(HPET or PM timer) on systems whose TSC frequency was
 			obtained from HW or FW using either an MSR or CPUID(0x15).
 			Warn if the difference is more than 500 ppm.
+			[x86] watchdog: Use TSC as the watchdog clocksource with
+			which to check other HW timers (HPET or PM timer), but
+			only on systems where TSC has been deemed trustworthy.
+			This will be suppressed by an earlier tsc=nowatchdog and
+			can be overridden by a later tsc=nowatchdog.  A console
+			message will flag any such suppression or overriding.
 
 	tsc_early_khz=  [X86] Skip early TSC calibration and use the given
 			value instead. Useful when the early TSC frequency discovery
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index a5371c6d4b64b..306c233c98d84 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -294,6 +294,7 @@ __setup("notsc", notsc_setup);
 
 static int no_sched_irq_time;
 static int no_tsc_watchdog;
+static int tsc_as_watchdog;
 
 static int __init tsc_setup(char *str)
 {
@@ -303,10 +304,22 @@ static int __init tsc_setup(char *str)
 		no_sched_irq_time = 1;
 	if (!strcmp(str, "unstable"))
 		mark_tsc_unstable("boot parameter");
-	if (!strcmp(str, "nowatchdog"))
+	if (!strcmp(str, "nowatchdog")) {
 		no_tsc_watchdog = 1;
+		if (tsc_as_watchdog)
+			pr_alert("%s: Overriding earlier tsc=watchdog with tsc=nowatchdog\n",
+				 __func__);
+		tsc_as_watchdog = 0;
+	}
 	if (!strcmp(str, "recalibrate"))
 		tsc_force_recalibrate = 1;
+	if (!strcmp(str, "watchdog")) {
+		if (no_tsc_watchdog)
+			pr_alert("%s: tsc=watchdog overridden by earlier tsc=nowatchdog\n",
+				 __func__);
+		else
+			tsc_as_watchdog = 1;
+	}
 	return 1;
 }
 
@@ -1192,7 +1205,8 @@ static void __init tsc_disable_clocksource_watchdog(void)
 
 bool tsc_clocksource_watchdog_disabled(void)
 {
-	return !(clocksource_tsc.flags & CLOCK_SOURCE_MUST_VERIFY);
+	return !(clocksource_tsc.flags & CLOCK_SOURCE_MUST_VERIFY) &&
+	       tsc_as_watchdog && !no_tsc_watchdog;
 }
 
 static void __init check_system_tsc_reliable(void)

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 clocksource 6/7] clocksource: Verify HPET and PMTMR when TSC unverified
  2023-02-01 19:26     ` Waiman Long
@ 2023-02-01 19:55       ` Paul E. McKenney
  2023-02-02  3:40         ` Waiman Long
  0 siblings, 1 reply; 22+ messages in thread
From: Paul E. McKenney @ 2023-02-01 19:55 UTC (permalink / raw)
  To: Waiman Long
  Cc: Thomas Gleixner, linux-kernel, john.stultz, sboyd, corbet,
	Mark.Rutland, maz, kernel-team, neeraju, ak, feng.tang,
	zhengjun.xing, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Daniel Lezcano, x86

On Wed, Feb 01, 2023 at 02:26:29PM -0500, Waiman Long wrote:
> On 2/1/23 05:24, Thomas Gleixner wrote:
> > Paul!
> > 
> > On Tue, Jan 24 2023 at 16:27, Paul E. McKenney wrote:
> > > On systems with two or fewer sockets, when the boot CPU has CONSTANT_TSC,
> > > NONSTOP_TSC, and TSC_ADJUST, clocksource watchdog verification of the
> > > TSC is disabled.  This works well much of the time, but there is the
> > > occasional production-level system that meets all of these criteria, but
> > > which still has a TSC that skews significantly from atomic-clock time.
> > > This is usually attributed to a firmware or hardware fault.  Yes, the
> > > various NTP daemons do express their opinions of userspace-to-atomic-clock
> > > time skew, but they put them in various places, depending on the daemon
> > > and distro in question.  It would therefore be good for the kernel to
> > > have some clue that there is a problem.
> > > 
> > > The old behavior of marking the TSC unstable is a non-starter because a
> > > great many workloads simply cannot tolerate the overheads and latencies
> > > of the various non-TSC clocksources.  In addition, NTP-corrected systems
> > > sometimes can tolerate significant kernel-space time skew as long as
> > > the userspace time sources are within epsilon of atomic-clock time.
> > > 
> > > Therefore, when watchdog verification of TSC is disabled, enable it for
> > > HPET and PMTMR (AKA ACPI PM timer).  This provides the needed in-kernel
> > > time-skew diagnostic without degrading the system's performance.
> > I'm more than unhappy about this. We finally have a point where the TSC
> > watchdog overhead can go away without adding TSC=reliable to the kernel
> > commandline.
> > 
> > Now you add an unconditionally enforce the watchdog again in a way which
> > even cannot be disabled on the kernel command line.
> > 
> > Patently bad idea, no cookies for you!
> 
> I have a similar concern about this patch as well. That is why I was
> suggesting to have this enabled for a limited time after boot for sanity
> checking purpose only.

Fair enough!

If the watchdog checking of HPET and/or PMTMR against TSC only happens
only when the sysadm asks for it, would you still want to have the ability
to enable such watchdog checking at boot time, and then to disable it
once the system had been running for some limited time?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 clocksource 6/7] clocksource: Verify HPET and PMTMR when TSC unverified
  2023-02-01 19:55       ` Paul E. McKenney
@ 2023-02-02  3:40         ` Waiman Long
  2023-02-02  4:54           ` Paul E. McKenney
  2023-02-02  7:57           ` Thomas Gleixner
  0 siblings, 2 replies; 22+ messages in thread
From: Waiman Long @ 2023-02-02  3:40 UTC (permalink / raw)
  To: paulmck
  Cc: Thomas Gleixner, linux-kernel, john.stultz, sboyd, corbet,
	Mark.Rutland, maz, kernel-team, neeraju, ak, feng.tang,
	zhengjun.xing, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Daniel Lezcano, x86

On 2/1/23 14:55, Paul E. McKenney wrote:
> On Wed, Feb 01, 2023 at 02:26:29PM -0500, Waiman Long wrote:
>> On 2/1/23 05:24, Thomas Gleixner wrote:
>>> Paul!
>>>
>>> On Tue, Jan 24 2023 at 16:27, Paul E. McKenney wrote:
>>>> On systems with two or fewer sockets, when the boot CPU has CONSTANT_TSC,
>>>> NONSTOP_TSC, and TSC_ADJUST, clocksource watchdog verification of the
>>>> TSC is disabled.  This works well much of the time, but there is the
>>>> occasional production-level system that meets all of these criteria, but
>>>> which still has a TSC that skews significantly from atomic-clock time.
>>>> This is usually attributed to a firmware or hardware fault.  Yes, the
>>>> various NTP daemons do express their opinions of userspace-to-atomic-clock
>>>> time skew, but they put them in various places, depending on the daemon
>>>> and distro in question.  It would therefore be good for the kernel to
>>>> have some clue that there is a problem.
>>>>
>>>> The old behavior of marking the TSC unstable is a non-starter because a
>>>> great many workloads simply cannot tolerate the overheads and latencies
>>>> of the various non-TSC clocksources.  In addition, NTP-corrected systems
>>>> sometimes can tolerate significant kernel-space time skew as long as
>>>> the userspace time sources are within epsilon of atomic-clock time.
>>>>
>>>> Therefore, when watchdog verification of TSC is disabled, enable it for
>>>> HPET and PMTMR (AKA ACPI PM timer).  This provides the needed in-kernel
>>>> time-skew diagnostic without degrading the system's performance.
>>> I'm more than unhappy about this. We finally have a point where the TSC
>>> watchdog overhead can go away without adding TSC=reliable to the kernel
>>> commandline.
>>>
>>> Now you add an unconditionally enforce the watchdog again in a way which
>>> even cannot be disabled on the kernel command line.
>>>
>>> Patently bad idea, no cookies for you!
>> I have a similar concern about this patch as well. That is why I was
>> suggesting to have this enabled for a limited time after boot for sanity
>> checking purpose only.
> Fair enough!
>
> If the watchdog checking of HPET and/or PMTMR against TSC only happens
> only when the sysadm asks for it, would you still want to have the ability
> to enable such watchdog checking at boot time, and then to disable it
> once the system had been running for some limited time?

Yes, being optional is another way to avoid the overhead for the 
majority of users. The paranoids can turn it on if they want to.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 clocksource 6/7] clocksource: Verify HPET and PMTMR when TSC unverified
  2023-02-02  3:40         ` Waiman Long
@ 2023-02-02  4:54           ` Paul E. McKenney
  2023-02-02  7:57           ` Thomas Gleixner
  1 sibling, 0 replies; 22+ messages in thread
From: Paul E. McKenney @ 2023-02-02  4:54 UTC (permalink / raw)
  To: Waiman Long
  Cc: Thomas Gleixner, linux-kernel, john.stultz, sboyd, corbet,
	Mark.Rutland, maz, kernel-team, neeraju, ak, feng.tang,
	zhengjun.xing, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Daniel Lezcano, x86

On Wed, Feb 01, 2023 at 10:40:56PM -0500, Waiman Long wrote:
> On 2/1/23 14:55, Paul E. McKenney wrote:
> > On Wed, Feb 01, 2023 at 02:26:29PM -0500, Waiman Long wrote:
> > > On 2/1/23 05:24, Thomas Gleixner wrote:
> > > > Paul!
> > > > 
> > > > On Tue, Jan 24 2023 at 16:27, Paul E. McKenney wrote:
> > > > > On systems with two or fewer sockets, when the boot CPU has CONSTANT_TSC,
> > > > > NONSTOP_TSC, and TSC_ADJUST, clocksource watchdog verification of the
> > > > > TSC is disabled.  This works well much of the time, but there is the
> > > > > occasional production-level system that meets all of these criteria, but
> > > > > which still has a TSC that skews significantly from atomic-clock time.
> > > > > This is usually attributed to a firmware or hardware fault.  Yes, the
> > > > > various NTP daemons do express their opinions of userspace-to-atomic-clock
> > > > > time skew, but they put them in various places, depending on the daemon
> > > > > and distro in question.  It would therefore be good for the kernel to
> > > > > have some clue that there is a problem.
> > > > > 
> > > > > The old behavior of marking the TSC unstable is a non-starter because a
> > > > > great many workloads simply cannot tolerate the overheads and latencies
> > > > > of the various non-TSC clocksources.  In addition, NTP-corrected systems
> > > > > sometimes can tolerate significant kernel-space time skew as long as
> > > > > the userspace time sources are within epsilon of atomic-clock time.
> > > > > 
> > > > > Therefore, when watchdog verification of TSC is disabled, enable it for
> > > > > HPET and PMTMR (AKA ACPI PM timer).  This provides the needed in-kernel
> > > > > time-skew diagnostic without degrading the system's performance.
> > > > I'm more than unhappy about this. We finally have a point where the TSC
> > > > watchdog overhead can go away without adding TSC=reliable to the kernel
> > > > commandline.
> > > > 
> > > > Now you add an unconditionally enforce the watchdog again in a way which
> > > > even cannot be disabled on the kernel command line.
> > > > 
> > > > Patently bad idea, no cookies for you!
> > > I have a similar concern about this patch as well. That is why I was
> > > suggesting to have this enabled for a limited time after boot for sanity
> > > checking purpose only.
> > Fair enough!
> > 
> > If the watchdog checking of HPET and/or PMTMR against TSC only happens
> > only when the sysadm asks for it, would you still want to have the ability
> > to enable such watchdog checking at boot time, and then to disable it
> > once the system had been running for some limited time?
> 
> Yes, being optional is another way to avoid the overhead for the majority of
> users. The paranoids can turn it on if they want to.

Very good, thank you!

							Thanx, Paul

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 clocksource 6/7] clocksource: Verify HPET and PMTMR when TSC unverified
  2023-02-02  3:40         ` Waiman Long
  2023-02-02  4:54           ` Paul E. McKenney
@ 2023-02-02  7:57           ` Thomas Gleixner
  2023-02-04  1:27             ` Paul E. McKenney
  1 sibling, 1 reply; 22+ messages in thread
From: Thomas Gleixner @ 2023-02-02  7:57 UTC (permalink / raw)
  To: Waiman Long, paulmck
  Cc: linux-kernel, john.stultz, sboyd, corbet, Mark.Rutland, maz,
	kernel-team, neeraju, ak, feng.tang, zhengjun.xing, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, Daniel Lezcano,
	x86

On Wed, Feb 01 2023 at 22:40, Waiman Long wrote:
> On 2/1/23 14:55, Paul E. McKenney wrote:
>>>>> Therefore, when watchdog verification of TSC is disabled, enable it for
>>>>> HPET and PMTMR (AKA ACPI PM timer).  This provides the needed in-kernel
>>>>> time-skew diagnostic without degrading the system's performance.
>>>> I'm more than unhappy about this. We finally have a point where the TSC
>>>> watchdog overhead can go away without adding TSC=reliable to the kernel
>>>> commandline.
>>>>
>>>> Now you add an unconditionally enforce the watchdog again in a way which
>>>> even cannot be disabled on the kernel command line.
>>>>
>>>> Patently bad idea, no cookies for you!
>>> I have a similar concern about this patch as well. That is why I was
>>> suggesting to have this enabled for a limited time after boot for sanity
>>> checking purpose only.
>> Fair enough!
>>
>> If the watchdog checking of HPET and/or PMTMR against TSC only happens
>> only when the sysadm asks for it, would you still want to have the ability
>> to enable such watchdog checking at boot time, and then to disable it
>> once the system had been running for some limited time?
>
> Yes, being optional is another way to avoid the overhead for the 
> majority of users. The paranoids can turn it on if they want to.

Yes, opt-in is good enough.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 22+ messages in thread

* PATCH v2 clocksource 8/7] clocksource: Enable TSC watchdog checking of HPET and PMTMR only when requested
  2023-01-25  0:27 [PATCH clocksource v2 0/7] Clocksource watchdog updates for v6.3 Paul E. McKenney
                   ` (6 preceding siblings ...)
  2023-01-25  0:27 ` [PATCH v2 clocksource 7/7] x86/tsc: Add option to force frequency recalibration with HW timer Paul E. McKenney
@ 2023-02-03  4:36 ` Paul E. McKenney
  2023-02-06 19:57   ` Waiman Long
  7 siblings, 1 reply; 22+ messages in thread
From: Paul E. McKenney @ 2023-02-03  4:36 UTC (permalink / raw)
  To: tglx
  Cc: linux-kernel, john.stultz, sboyd, corbet, Mark.Rutland, maz,
	kernel-team, neeraju, ak, feng.tang, zhengjun.xing, longman

Unconditionally enabling TSC watchdog checking of the HPET and PMTMR
clocksources can degrade latency and performance.  Therefore, provide
a new "watchdog" option to the tsc= boot parameter that opts into such
checking.  Note that tsc=watchdog is overridden by a tsc=nowatchdog
regardless of their relative positions in the list of boot parameters.

Reported-by: Thomas Gleixner <tglx@linutronix.de>
Reported-by: Waiman Long <longman@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 95f0d104c2322..7b4df6d89d3c3 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -6373,6 +6373,12 @@
 			(HPET or PM timer) on systems whose TSC frequency was
 			obtained from HW or FW using either an MSR or CPUID(0x15).
 			Warn if the difference is more than 500 ppm.
+			[x86] watchdog: Use TSC as the watchdog clocksource with
+			which to check other HW timers (HPET or PM timer), but
+			only on systems where TSC has been deemed trustworthy.
+			This will be suppressed by an earlier tsc=nowatchdog and
+			can be overridden by a later tsc=nowatchdog.  A console
+			message will flag any such suppression or overriding.
 
 	tsc_early_khz=  [X86] Skip early TSC calibration and use the given
 			value instead. Useful when the early TSC frequency discovery
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index a5371c6d4b64b..306c233c98d84 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -294,6 +294,7 @@ __setup("notsc", notsc_setup);
 
 static int no_sched_irq_time;
 static int no_tsc_watchdog;
+static int tsc_as_watchdog;
 
 static int __init tsc_setup(char *str)
 {
@@ -303,10 +304,22 @@ static int __init tsc_setup(char *str)
 		no_sched_irq_time = 1;
 	if (!strcmp(str, "unstable"))
 		mark_tsc_unstable("boot parameter");
-	if (!strcmp(str, "nowatchdog"))
+	if (!strcmp(str, "nowatchdog")) {
 		no_tsc_watchdog = 1;
+		if (tsc_as_watchdog)
+			pr_alert("%s: Overriding earlier tsc=watchdog with tsc=nowatchdog\n",
+				 __func__);
+		tsc_as_watchdog = 0;
+	}
 	if (!strcmp(str, "recalibrate"))
 		tsc_force_recalibrate = 1;
+	if (!strcmp(str, "watchdog")) {
+		if (no_tsc_watchdog)
+			pr_alert("%s: tsc=watchdog overridden by earlier tsc=nowatchdog\n",
+				 __func__);
+		else
+			tsc_as_watchdog = 1;
+	}
 	return 1;
 }
 
@@ -1192,7 +1205,8 @@ static void __init tsc_disable_clocksource_watchdog(void)
 
 bool tsc_clocksource_watchdog_disabled(void)
 {
-	return !(clocksource_tsc.flags & CLOCK_SOURCE_MUST_VERIFY);
+	return !(clocksource_tsc.flags & CLOCK_SOURCE_MUST_VERIFY) &&
+	       tsc_as_watchdog && !no_tsc_watchdog;
 }
 
 static void __init check_system_tsc_reliable(void)

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 clocksource 6/7] clocksource: Verify HPET and PMTMR when TSC unverified
  2023-02-02  7:57           ` Thomas Gleixner
@ 2023-02-04  1:27             ` Paul E. McKenney
  0 siblings, 0 replies; 22+ messages in thread
From: Paul E. McKenney @ 2023-02-04  1:27 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Waiman Long, linux-kernel, john.stultz, sboyd, corbet,
	Mark.Rutland, maz, kernel-team, neeraju, ak, feng.tang,
	zhengjun.xing, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Daniel Lezcano, x86

On Thu, Feb 02, 2023 at 08:57:39AM +0100, Thomas Gleixner wrote:
> On Wed, Feb 01 2023 at 22:40, Waiman Long wrote:
> > On 2/1/23 14:55, Paul E. McKenney wrote:
> >>>>> Therefore, when watchdog verification of TSC is disabled, enable it for
> >>>>> HPET and PMTMR (AKA ACPI PM timer).  This provides the needed in-kernel
> >>>>> time-skew diagnostic without degrading the system's performance.
> >>>> I'm more than unhappy about this. We finally have a point where the TSC
> >>>> watchdog overhead can go away without adding TSC=reliable to the kernel
> >>>> commandline.
> >>>>
> >>>> Now you add an unconditionally enforce the watchdog again in a way which
> >>>> even cannot be disabled on the kernel command line.
> >>>>
> >>>> Patently bad idea, no cookies for you!
> >>> I have a similar concern about this patch as well. That is why I was
> >>> suggesting to have this enabled for a limited time after boot for sanity
> >>> checking purpose only.
> >> Fair enough!
> >>
> >> If the watchdog checking of HPET and/or PMTMR against TSC only happens
> >> only when the sysadm asks for it, would you still want to have the ability
> >> to enable such watchdog checking at boot time, and then to disable it
> >> once the system had been running for some limited time?
> >
> > Yes, being optional is another way to avoid the overhead for the 
> > majority of users. The paranoids can turn it on if they want to.
> 
> Yes, opt-in is good enough.

I have added this commit to my clocksource branch:

2ff7dacc88b0 clocksource: Enable TSC watchdog checking of HPET and PMTMR only when requested

It is passing my internal tests, and if it does fine for a couple of
days in -next, I will send an updated pull request.

							Thanx, Paul

------------------------------------------------------------------------

The full clocksource branch:

beaa1ffe551c3 clocksource: Print clocksource name when clocksource is tested unstable
c37e85c135cea clocksource: Loosen clocksource watchdog constraints
f092eb34b3304 clocksource: Improve read-back-delay message
dd029269947a3 clocksource: Improve "skew is too large" messages
b7082cdfc464b clocksource: Suspend the watchdog temporarily when high read latency detected
a7ec817d55421 x86/tsc: Add option to force frequency recalibration with HW timer
efc8b329c7fdc clocksource: Verify HPET and PMTMR when TSC unverified
2ff7dacc88b0c clocksource: Enable TSC watchdog checking of HPET and PMTMR only when requested

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: PATCH v2 clocksource 8/7] clocksource: Enable TSC watchdog checking of HPET and PMTMR only when requested
  2023-02-03  4:36 ` PATCH v2 clocksource 8/7] clocksource: Enable TSC watchdog checking of HPET and PMTMR only when requested Paul E. McKenney
@ 2023-02-06 19:57   ` Waiman Long
  2023-02-07  1:08     ` Paul E. McKenney
  0 siblings, 1 reply; 22+ messages in thread
From: Waiman Long @ 2023-02-06 19:57 UTC (permalink / raw)
  To: paulmck, tglx
  Cc: linux-kernel, john.stultz, sboyd, corbet, Mark.Rutland, maz,
	kernel-team, neeraju, ak, feng.tang, zhengjun.xing

On 2/2/23 23:36, Paul E. McKenney wrote:
> Unconditionally enabling TSC watchdog checking of the HPET and PMTMR
> clocksources can degrade latency and performance.  Therefore, provide
> a new "watchdog" option to the tsc= boot parameter that opts into such
> checking.  Note that tsc=watchdog is overridden by a tsc=nowatchdog
> regardless of their relative positions in the list of boot parameters.
>
> Reported-by: Thomas Gleixner <tglx@linutronix.de>
> Reported-by: Waiman Long <longman@redhat.com>
> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
>
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index 95f0d104c2322..7b4df6d89d3c3 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -6373,6 +6373,12 @@
>   			(HPET or PM timer) on systems whose TSC frequency was
>   			obtained from HW or FW using either an MSR or CPUID(0x15).
>   			Warn if the difference is more than 500 ppm.
> +			[x86] watchdog: Use TSC as the watchdog clocksource with
> +			which to check other HW timers (HPET or PM timer), but
> +			only on systems where TSC has been deemed trustworthy.
> +			This will be suppressed by an earlier tsc=nowatchdog and
> +			can be overridden by a later tsc=nowatchdog.  A console
> +			message will flag any such suppression or overriding.
>   
>   	tsc_early_khz=  [X86] Skip early TSC calibration and use the given
>   			value instead. Useful when the early TSC frequency discovery
> diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
> index a5371c6d4b64b..306c233c98d84 100644
> --- a/arch/x86/kernel/tsc.c
> +++ b/arch/x86/kernel/tsc.c
> @@ -294,6 +294,7 @@ __setup("notsc", notsc_setup);
>   
>   static int no_sched_irq_time;
>   static int no_tsc_watchdog;
> +static int tsc_as_watchdog;
>   
>   static int __init tsc_setup(char *str)
>   {
> @@ -303,10 +304,22 @@ static int __init tsc_setup(char *str)
>   		no_sched_irq_time = 1;
>   	if (!strcmp(str, "unstable"))
>   		mark_tsc_unstable("boot parameter");
> -	if (!strcmp(str, "nowatchdog"))
> +	if (!strcmp(str, "nowatchdog")) {
>   		no_tsc_watchdog = 1;
> +		if (tsc_as_watchdog)
> +			pr_alert("%s: Overriding earlier tsc=watchdog with tsc=nowatchdog\n",
> +				 __func__);
> +		tsc_as_watchdog = 0;
> +	}
>   	if (!strcmp(str, "recalibrate"))
>   		tsc_force_recalibrate = 1;
> +	if (!strcmp(str, "watchdog")) {
> +		if (no_tsc_watchdog)
> +			pr_alert("%s: tsc=watchdog overridden by earlier tsc=nowatchdog\n",
> +				 __func__);
> +		else
> +			tsc_as_watchdog = 1;
> +	}
>   	return 1;
>   }
>   
> @@ -1192,7 +1205,8 @@ static void __init tsc_disable_clocksource_watchdog(void)
>   
>   bool tsc_clocksource_watchdog_disabled(void)
>   {
> -	return !(clocksource_tsc.flags & CLOCK_SOURCE_MUST_VERIFY);
> +	return !(clocksource_tsc.flags & CLOCK_SOURCE_MUST_VERIFY) &&
> +	       tsc_as_watchdog && !no_tsc_watchdog;
>   }
>   
>   static void __init check_system_tsc_reliable(void)
>
Acked-by: Waiman Long <longman@redhat.com>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: PATCH v2 clocksource 8/7] clocksource: Enable TSC watchdog checking of HPET and PMTMR only when requested
  2023-02-06 19:57   ` Waiman Long
@ 2023-02-07  1:08     ` Paul E. McKenney
  0 siblings, 0 replies; 22+ messages in thread
From: Paul E. McKenney @ 2023-02-07  1:08 UTC (permalink / raw)
  To: Waiman Long
  Cc: tglx, linux-kernel, john.stultz, sboyd, corbet, Mark.Rutland,
	maz, kernel-team, neeraju, ak, feng.tang, zhengjun.xing

On Mon, Feb 06, 2023 at 02:57:55PM -0500, Waiman Long wrote:
> On 2/2/23 23:36, Paul E. McKenney wrote:
> > Unconditionally enabling TSC watchdog checking of the HPET and PMTMR
> > clocksources can degrade latency and performance.  Therefore, provide
> > a new "watchdog" option to the tsc= boot parameter that opts into such
> > checking.  Note that tsc=watchdog is overridden by a tsc=nowatchdog
> > regardless of their relative positions in the list of boot parameters.
> > 
> > Reported-by: Thomas Gleixner <tglx@linutronix.de>
> > Reported-by: Waiman Long <longman@redhat.com>
> > Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> > 
> > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> > index 95f0d104c2322..7b4df6d89d3c3 100644
> > --- a/Documentation/admin-guide/kernel-parameters.txt
> > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > @@ -6373,6 +6373,12 @@
> >   			(HPET or PM timer) on systems whose TSC frequency was
> >   			obtained from HW or FW using either an MSR or CPUID(0x15).
> >   			Warn if the difference is more than 500 ppm.
> > +			[x86] watchdog: Use TSC as the watchdog clocksource with
> > +			which to check other HW timers (HPET or PM timer), but
> > +			only on systems where TSC has been deemed trustworthy.
> > +			This will be suppressed by an earlier tsc=nowatchdog and
> > +			can be overridden by a later tsc=nowatchdog.  A console
> > +			message will flag any such suppression or overriding.
> >   	tsc_early_khz=  [X86] Skip early TSC calibration and use the given
> >   			value instead. Useful when the early TSC frequency discovery
> > diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
> > index a5371c6d4b64b..306c233c98d84 100644
> > --- a/arch/x86/kernel/tsc.c
> > +++ b/arch/x86/kernel/tsc.c
> > @@ -294,6 +294,7 @@ __setup("notsc", notsc_setup);
> >   static int no_sched_irq_time;
> >   static int no_tsc_watchdog;
> > +static int tsc_as_watchdog;
> >   static int __init tsc_setup(char *str)
> >   {
> > @@ -303,10 +304,22 @@ static int __init tsc_setup(char *str)
> >   		no_sched_irq_time = 1;
> >   	if (!strcmp(str, "unstable"))
> >   		mark_tsc_unstable("boot parameter");
> > -	if (!strcmp(str, "nowatchdog"))
> > +	if (!strcmp(str, "nowatchdog")) {
> >   		no_tsc_watchdog = 1;
> > +		if (tsc_as_watchdog)
> > +			pr_alert("%s: Overriding earlier tsc=watchdog with tsc=nowatchdog\n",
> > +				 __func__);
> > +		tsc_as_watchdog = 0;
> > +	}
> >   	if (!strcmp(str, "recalibrate"))
> >   		tsc_force_recalibrate = 1;
> > +	if (!strcmp(str, "watchdog")) {
> > +		if (no_tsc_watchdog)
> > +			pr_alert("%s: tsc=watchdog overridden by earlier tsc=nowatchdog\n",
> > +				 __func__);
> > +		else
> > +			tsc_as_watchdog = 1;
> > +	}
> >   	return 1;
> >   }
> > @@ -1192,7 +1205,8 @@ static void __init tsc_disable_clocksource_watchdog(void)
> >   bool tsc_clocksource_watchdog_disabled(void)
> >   {
> > -	return !(clocksource_tsc.flags & CLOCK_SOURCE_MUST_VERIFY);
> > +	return !(clocksource_tsc.flags & CLOCK_SOURCE_MUST_VERIFY) &&
> > +	       tsc_as_watchdog && !no_tsc_watchdog;
> >   }
> >   static void __init check_system_tsc_reliable(void)
> > 
> Acked-by: Waiman Long <longman@redhat.com>

Applied, thank you!

							Thanx, Paul

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2023-02-07  1:08 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-01-25  0:27 [PATCH clocksource v2 0/7] Clocksource watchdog updates for v6.3 Paul E. McKenney
2023-01-25  0:27 ` [PATCH v2 clocksource 1/7] clocksource: Print clocksource name when clocksource is tested unstable Paul E. McKenney
2023-01-25  0:27 ` [PATCH v2 clocksource 2/7] clocksource: Loosen clocksource watchdog constraints Paul E. McKenney
2023-01-25  0:27 ` [PATCH v2 clocksource 3/7] clocksource: Improve read-back-delay message Paul E. McKenney
2023-01-25  0:27 ` [PATCH v2 clocksource 4/7] clocksource: Improve "skew is too large" messages Paul E. McKenney
2023-01-25  0:27 ` [PATCH v2 clocksource 5/7] clocksource: Suspend the watchdog temporarily when high read latency detected Paul E. McKenney
2023-01-25  0:27 ` [PATCH v2 clocksource 6/7] clocksource: Verify HPET and PMTMR when TSC unverified Paul E. McKenney
2023-01-26 10:57   ` Daniel Lezcano
2023-02-01  0:50     ` Paul E. McKenney
2023-02-01 10:24   ` Thomas Gleixner
2023-02-01 15:10     ` Feng Tang
2023-02-01 19:26     ` Waiman Long
2023-02-01 19:55       ` Paul E. McKenney
2023-02-02  3:40         ` Waiman Long
2023-02-02  4:54           ` Paul E. McKenney
2023-02-02  7:57           ` Thomas Gleixner
2023-02-04  1:27             ` Paul E. McKenney
2023-02-01 19:51     ` Paul E. McKenney
2023-01-25  0:27 ` [PATCH v2 clocksource 7/7] x86/tsc: Add option to force frequency recalibration with HW timer Paul E. McKenney
2023-02-03  4:36 ` PATCH v2 clocksource 8/7] clocksource: Enable TSC watchdog checking of HPET and PMTMR only when requested Paul E. McKenney
2023-02-06 19:57   ` Waiman Long
2023-02-07  1:08     ` Paul E. McKenney

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).