[PATCH clocksource 0/3] Reject bogus watchdog clocksource measurements

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH clocksource 0/3] Reject bogus watchdog clocksource measurements
@ 2022-11-14 23:28 Paul E. McKenney
  2022-11-14 23:28 ` [PATCH clocksource 1/3] clocksource: " Paul E. McKenney
                   ` (2 more replies)
  0 siblings, 3 replies; 28+ messages in thread
From: Paul E. McKenney @ 2022-11-14 23:28 UTC (permalink / raw)
  To: tglx
  Cc: linux-kernel, john.stultz, sboyd, corbet, Mark.Rutland, maz,
	kernel-team, neeraju, ak, feng.tang, zhengjun.xing

Hello!

This series adds comments classifying bogus measurements and adds capped
exponential backoff for messages that are likely caused by overly busy
systems.

1.	Reject bogus watchdog clocksource measurements.

2.	Add comments to classify bogus measurements.

3.	Exponential backoff for load-induced bogus watchdog reads.

Changes since v1:

o	Merge the "Reject bogus watchdog clocksource measurements" [1] series
	with the "Clocksource-watchdog classification and backoff" [2] series.

o	Updated messages to indicate timer bug for too-short watchdog
	intervals based on Feng Tang feedback.

o	Added explanatory comments before error messages based on Feng
	Tang feedback.

o	Check both the reference clocksource and the clocksource under
	test to handle clocksource-internal counter wrap, based on Feng
	Tang feedback.

o	Reversed the order of the checks, so that the too-long check
	preceeds the too-short check.  This avoids too-short false
	positives due to counter wrap.

o	Simplified exponential-backoff time checks based on feedback
	from Waiman Long.

o	Updated the exponentially backed off messages to more clearly
	indicate that the counts are instances since the last printed
	message.

o	Applied Reviewed-by tags.

[1] https://lore.kernel.org/all/20221019230904.GA2502730@paulmck-ThinkPad-P17-Gen-1
[2] https://lore.kernel.org/all/20221102184001.GA1306489@paulmck-ThinkPad-P17-Gen-1/

						Thanx, Paul

------------------------------------------------------------------------

 b/include/linux/clocksource.h |    3 +++
 b/kernel/time/clocksource.c   |   13 ++++++++++++-
 kernel/time/clocksource.c     |   33 +++++++++++++++++++++++++++------
 3 files changed, 42 insertions(+), 7 deletions(-)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH clocksource 1/3] clocksource: Reject bogus watchdog clocksource measurements
  2022-11-14 23:28 [PATCH clocksource 0/3] Reject bogus watchdog clocksource measurements Paul E. McKenney
@ 2022-11-14 23:28 ` Paul E. McKenney
  2022-11-17 21:57   ` Thomas Gleixner
  2022-11-14 23:28 ` [PATCH clocksource 2/3] clocksource: Add comments to classify bogus measurements Paul E. McKenney
  2022-11-14 23:28 ` [PATCH clocksource 3/3] clocksource: Exponential backoff for load-induced bogus watchdog reads Paul E. McKenney
  2 siblings, 1 reply; 28+ messages in thread
From: Paul E. McKenney @ 2022-11-14 23:28 UTC (permalink / raw)
  To: tglx
  Cc: linux-kernel, john.stultz, sboyd, corbet, Mark.Rutland, maz,
	kernel-team, neeraju, ak, feng.tang, zhengjun.xing,
	Paul E. McKenney, Chris Mason, John Stultz, Waiman Long

One remaining clocksource-skew issue involves extreme CPU overcommit,
which can cause the clocksource watchdog measurements to be delayed by
tens of seconds.  This in turn means that a clock-skew criterion that
is appropriate for a 500-millisecond interval will instead give lots of
false positives.

Therefore, check for the watchdog clocksource reporting much larger or
much less than the time specified by WATCHDOG_INTERVAL.  In these cases,
print a pr_warn() warning and refrain from marking the clocksource under
test as being unstable.

Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: John Stultz <jstultz@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Stephen Boyd <sboyd@kernel.org>
Cc: Feng Tang <feng.tang@intel.com>
Cc: Waiman Long <longman@redhat.com>
---
 kernel/time/clocksource.c | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index 8058bec87acee..dcaf38c062161 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -386,7 +386,7 @@ EXPORT_SYMBOL_GPL(clocksource_verify_percpu);
 
 static void clocksource_watchdog(struct timer_list *unused)
 {
-	u64 csnow, wdnow, cslast, wdlast, delta;
+	u64 csnow, wdnow, cslast, wdlast, delta, wdi;
 	int next_cpu, reset_pending;
 	int64_t wd_nsec, cs_nsec;
 	struct clocksource *cs;
@@ -440,6 +440,17 @@ static void clocksource_watchdog(struct timer_list *unused)
 		if (atomic_read(&watchdog_reset_pending))
 			continue;
 
+		/* Check for bogus measurements. */
+		wdi = jiffies_to_nsecs(WATCHDOG_INTERVAL);
+		if (wd_nsec < (wdi >> 2)) {
+			pr_warn("timekeeping watchdog on CPU%d: Watchdog clocksource '%s' advanced only %lld ns during %d-jiffy time interval, skipping watchdog check.\n", smp_processor_id(), watchdog->name, wd_nsec, WATCHDOG_INTERVAL);
+			continue;
+		}
+		if (wd_nsec > (wdi << 2)) {
+			pr_warn("timekeeping watchdog on CPU%d: Watchdog clocksource '%s' advanced an excessive %lld ns during %d-jiffy time interval, probable CPU overutilization, skipping watchdog check.\n", smp_processor_id(), watchdog->name, wd_nsec, WATCHDOG_INTERVAL);
+			continue;
+		}
+
 		/* Check the deviation from the watchdog clocksource. */
 		md = cs->uncertainty_margin + watchdog->uncertainty_margin;
 		if (abs(cs_nsec - wd_nsec) > md) {
-- 
2.31.1.189.g2e36527f23


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH clocksource 2/3] clocksource: Add comments to classify bogus measurements
  2022-11-14 23:28 [PATCH clocksource 0/3] Reject bogus watchdog clocksource measurements Paul E. McKenney
  2022-11-14 23:28 ` [PATCH clocksource 1/3] clocksource: " Paul E. McKenney
@ 2022-11-14 23:28 ` Paul E. McKenney
  2022-11-14 23:28 ` [PATCH clocksource 3/3] clocksource: Exponential backoff for load-induced bogus watchdog reads Paul E. McKenney
  2 siblings, 0 replies; 28+ messages in thread
From: Paul E. McKenney @ 2022-11-14 23:28 UTC (permalink / raw)
  To: tglx
  Cc: linux-kernel, john.stultz, sboyd, corbet, Mark.Rutland, maz,
	kernel-team, neeraju, ak, feng.tang, zhengjun.xing,
	Paul E. McKenney, Waiman Long, John Stultz

An extremely busy system can delay the clocksource watchdog, so that
the corresponding too-long bogus-measurement error does not necessarily
imply an error in the system.  However, a too-short bogus-measurement
error likely indicates a bug in hardware, firmware or software.

Therefore, add comments clarifying these bogus-measurement pr_warn()s.

Reported-by: Feng Tang <feng.tang@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Waiman Long <longman@redhat.com>
Reviewed-by: Feng Tang <feng.tang@intel.com>
Cc: John Stultz <jstultz@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Stephen Boyd <sboyd@kernel.org>
---
 kernel/time/clocksource.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index dcaf38c062161..3f5317faf891f 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -443,10 +443,12 @@ static void clocksource_watchdog(struct timer_list *unused)
 		/* Check for bogus measurements. */
 		wdi = jiffies_to_nsecs(WATCHDOG_INTERVAL);
 		if (wd_nsec < (wdi >> 2)) {
+			/* This usually indicates broken timer code or hardware. */
 			pr_warn("timekeeping watchdog on CPU%d: Watchdog clocksource '%s' advanced only %lld ns during %d-jiffy time interval, skipping watchdog check.\n", smp_processor_id(), watchdog->name, wd_nsec, WATCHDOG_INTERVAL);
 			continue;
 		}
 		if (wd_nsec > (wdi << 2)) {
+			/* This can happen on busy systems, which can delay the watchdog. */
 			pr_warn("timekeeping watchdog on CPU%d: Watchdog clocksource '%s' advanced an excessive %lld ns during %d-jiffy time interval, probable CPU overutilization, skipping watchdog check.\n", smp_processor_id(), watchdog->name, wd_nsec, WATCHDOG_INTERVAL);
 			continue;
 		}
-- 
2.31.1.189.g2e36527f23


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH clocksource 3/3] clocksource: Exponential backoff for load-induced bogus watchdog reads
  2022-11-14 23:28 [PATCH clocksource 0/3] Reject bogus watchdog clocksource measurements Paul E. McKenney
  2022-11-14 23:28 ` [PATCH clocksource 1/3] clocksource: " Paul E. McKenney
  2022-11-14 23:28 ` [PATCH clocksource 2/3] clocksource: Add comments to classify bogus measurements Paul E. McKenney
@ 2022-11-14 23:28 ` Paul E. McKenney
  2 siblings, 0 replies; 28+ messages in thread
From: Paul E. McKenney @ 2022-11-14 23:28 UTC (permalink / raw)
  To: tglx
  Cc: linux-kernel, john.stultz, sboyd, corbet, Mark.Rutland, maz,
	kernel-team, neeraju, ak, feng.tang, zhengjun.xing,
	Paul E. McKenney, Waiman Long, John Stultz

The clocksource watchdog will reject measurements that are excessively
delayed, that is, by more than 1.5 seconds beyond the intended 0.5-second
watchdog interval.  On an extremely busy system, this can result in a
console message being printed every two seconds.  This is excessively
noisy for a non-error condition.

Therefore, apply exponential backoff to these messages.  This exponential
backoff is capped at 1024 times the watchdog interval, which comes to
not quite one message per ten minutes.

Please note that the bogus watchdog reads that occur when the watchdog
interval is less than 0.125 seconds are still printed unconditionally
because these likely correspond to a serious error condition in the
timer code or hardware.

[ paulmck: Apply Feng Tang feedback. ]
[ paulmck: Apply Waiman Long feedback. ]

Reported-by: Waiman Long <longman@redhat.com>
Reported-by: Feng Tang <feng.tang@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Feng Tang <feng.tang@intel.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Cc: John Stultz <jstultz@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Stephen Boyd <sboyd@kernel.org>
---
 include/linux/clocksource.h |  3 +++
 kernel/time/clocksource.c   | 31 +++++++++++++++++++++++++------
 2 files changed, 28 insertions(+), 6 deletions(-)

diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h
index 1d42d4b173271..23b73f2293d6d 100644
--- a/include/linux/clocksource.h
+++ b/include/linux/clocksource.h
@@ -125,6 +125,9 @@ struct clocksource {
 	struct list_head	wd_list;
 	u64			cs_last;
 	u64			wd_last;
+	u64			wd_last_bogus;
+	int			wd_bogus_shift;
+	unsigned long		wd_bogus_count;
 #endif
 	struct module		*owner;
 };
diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index 3f5317faf891f..4015ec6503a52 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -442,14 +442,33 @@ static void clocksource_watchdog(struct timer_list *unused)
 
 		/* Check for bogus measurements. */
 		wdi = jiffies_to_nsecs(WATCHDOG_INTERVAL);
-		if (wd_nsec < (wdi >> 2)) {
-			/* This usually indicates broken timer code or hardware. */
-			pr_warn("timekeeping watchdog on CPU%d: Watchdog clocksource '%s' advanced only %lld ns during %d-jiffy time interval, skipping watchdog check.\n", smp_processor_id(), watchdog->name, wd_nsec, WATCHDOG_INTERVAL);
+		if (wd_nsec > (wdi << 2) || cs_nsec > (wdi << 2)) {
+			bool needwarn = false;
+			u64 wd_lb;
+
+			cs->wd_bogus_count++;
+			if (!cs->wd_bogus_shift) {
+				needwarn = true;
+			} else {
+				delta = clocksource_delta(wdnow, cs->wd_last_bogus, watchdog->mask);
+				wd_lb = clocksource_cyc2ns(delta, watchdog->mult, watchdog->shift);
+				if ((1 << cs->wd_bogus_shift) * wdi <= wd_lb)
+					needwarn = true;
+			}
+			if (needwarn) {
+				/* This can happen on busy systems, which can delay the watchdog. */
+				pr_warn("timekeeping watchdog on CPU%d: Watchdog clocksource '%s' advanced an excessive %lld ns during %d-jiffy time interval (%lu since last message), probable CPU overutilization, skipping watchdog check.\n", smp_processor_id(), watchdog->name, wd_nsec, WATCHDOG_INTERVAL, cs->wd_bogus_count);
+				cs->wd_last_bogus = wdnow;
+				if (cs->wd_bogus_shift < 10)
+					cs->wd_bogus_shift++;
+				cs->wd_bogus_count = 0;
+			}
 			continue;
 		}
-		if (wd_nsec > (wdi << 2)) {
-			/* This can happen on busy systems, which can delay the watchdog. */
-			pr_warn("timekeeping watchdog on CPU%d: Watchdog clocksource '%s' advanced an excessive %lld ns during %d-jiffy time interval, probable CPU overutilization, skipping watchdog check.\n", smp_processor_id(), watchdog->name, wd_nsec, WATCHDOG_INTERVAL);
+		/* Check too-short measurements second to handle wrap. */
+		if (wd_nsec < (wdi >> 2) || cs_nsec < (wdi >> 2)) {
+			/* This usually indicates broken timer code or hardware. */
+			pr_warn("timekeeping watchdog on CPU%d: Watchdog clocksource '%s' advanced only %lld ns during %d-jiffy time interval, skipping watchdog check.\n", smp_processor_id(), watchdog->name, wd_nsec, WATCHDOG_INTERVAL);
 			continue;
 		}
 
-- 
2.31.1.189.g2e36527f23


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH clocksource 1/3] clocksource: Reject bogus watchdog clocksource measurements
  2022-11-14 23:28 ` [PATCH clocksource 1/3] clocksource: " Paul E. McKenney
@ 2022-11-17 21:57   ` Thomas Gleixner
  2022-11-17 23:09     ` Paul E. McKenney
  0 siblings, 1 reply; 28+ messages in thread
From: Thomas Gleixner @ 2022-11-17 21:57 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, john.stultz, sboyd, corbet, Mark.Rutland, maz,
	kernel-team, neeraju, ak, feng.tang, zhengjun.xing,
	Paul E. McKenney, Chris Mason, John Stultz, Waiman Long

Paul!

On Mon, Nov 14 2022 at 15:28, Paul E. McKenney wrote:
>  
> +		/* Check for bogus measurements. */
> +		wdi = jiffies_to_nsecs(WATCHDOG_INTERVAL);
> +		if (wd_nsec < (wdi >> 2)) {
> +			pr_warn("timekeeping watchdog on CPU%d: Watchdog clocksource '%s' advanced only %lld ns during %d-jiffy time interval, skipping watchdog check.\n", smp_processor_id(), watchdog->name, wd_nsec, WATCHDOG_INTERVAL);
> +			continue;
> +		}
> +		if (wd_nsec > (wdi << 2)) {
> +			pr_warn("timekeeping watchdog on CPU%d: Watchdog clocksource '%s' advanced an excessive %lld ns during %d-jiffy time interval, probable CPU overutilization, skipping watchdog check.\n", smp_processor_id(), watchdog->name, wd_nsec, WATCHDOG_INTERVAL);
> +			continue;
> +		}

This is really getting ridiculous.

The clocksource watchdog is supposed to run periodically with period =
WATCHDOG_INTERVAL.

That periodic schedule depends on the clocksource which is monitored. If
the clocksource runs fast the period is shortened and if it runs slow is
prolonged.

Now you add checks:

 1) If the period observed by the watchdog clocksource is less than 1/4
    of the expected period, everything is fine.

 2) If the period observed by the watchdog clocksource is greater than 4
    times the expected period, everything is fine.

IOW, you are preventing detection of one class of problems which caused
us to implement the watchdog clocksource in the first place.

You are preventing it by making the watchdog decision circular dependent
on the clocksource it is supposed to watch. IOW, you put a fox in charge
of the henhouse. That's a really brilliant plan.

But what's worse is the constant stream of heuristics which make the
clocksource watchdog "work" under workloads which are simply impossible
to be handled by its current implementation.

If I look at the full set of them by now, I'm pretty sure that a real
TSC fail would not be noticed anymore because there are more exceptions
and excuses why a particular measurement it bogus or invalid or
whatever.

I didn't do a full analysis yet, but I have a hard time to convince
myself that - assumed we add this gem - the watchdog will be anything
else than a useless waste of CPU cycles as there is always one of the
heuristics declaring that everything is fine along with incomprehensible
messages in dmesg which will create more confusion to most people than
being helpful.

This is hunting us for 20+ years now and why do we still need this? I'm
pretty sure that farcebook does not run their server farms on 20 years
old silicon.

That means even the largest customers have not been able to convince the
CPU manufactures to fix this idiocy by now, right? Either that or they
did not even try because it's simpler to "fix" it in software.

That's the only two explanations I have for the constant stream of
voodoo logic. Both are just a proof for my claim that this industry just
"works" by chance.

Can we stop this pretty please and either come up with something
fundamentally different or just admit defeat and remove the whole thing?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH clocksource 1/3] clocksource: Reject bogus watchdog clocksource measurements
  2022-11-17 21:57   ` Thomas Gleixner
@ 2022-11-17 23:09     ` Paul E. McKenney
  2022-11-21  0:55       ` Feng Tang
  0 siblings, 1 reply; 28+ messages in thread
From: Paul E. McKenney @ 2022-11-17 23:09 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, john.stultz, sboyd, corbet, Mark.Rutland, maz,
	kernel-team, neeraju, ak, feng.tang, zhengjun.xing, Chris Mason,
	John Stultz, Waiman Long

On Thu, Nov 17, 2022 at 10:57:34PM +0100, Thomas Gleixner wrote:
> Paul!
> 
> On Mon, Nov 14 2022 at 15:28, Paul E. McKenney wrote:
> >  
> > +		/* Check for bogus measurements. */
> > +		wdi = jiffies_to_nsecs(WATCHDOG_INTERVAL);
> > +		if (wd_nsec < (wdi >> 2)) {
> > +			pr_warn("timekeeping watchdog on CPU%d: Watchdog clocksource '%s' advanced only %lld ns during %d-jiffy time interval, skipping watchdog check.\n", smp_processor_id(), watchdog->name, wd_nsec, WATCHDOG_INTERVAL);
> > +			continue;
> > +		}
> > +		if (wd_nsec > (wdi << 2)) {
> > +			pr_warn("timekeeping watchdog on CPU%d: Watchdog clocksource '%s' advanced an excessive %lld ns during %d-jiffy time interval, probable CPU overutilization, skipping watchdog check.\n", smp_processor_id(), watchdog->name, wd_nsec, WATCHDOG_INTERVAL);
> > +			continue;
> > +		}
> 
> This is really getting ridiculous.

I have absolutely no argument with this statement, and going back a
long time.  ;-)

But the set of systems that caused me to send this turned out to have
real divergence between HPET and TSC, and 40 milliseconds per second of
divergence at that.  So not only do you hate this series, but it is also
the case that this series doesn't help with the problem at hand.

I will therefore set this series aside.

And I sure hope that it is the TSC and not HPET that is correct on those
systems, because they have constant_tsc, nonstop_tsc, and tsc_adjust
all set in their lscpu output.  :-/

And yes, I backported Feng Tang's "b50db7095fe0 ("x86/tsc: Disable
clocksource watchdog for TSC on qualified platorms") commit for them.
Worked like a charm, so thank you, Feng Tang!

But as long as I started responding, I might as well continue.  ;-)

> The clocksource watchdog is supposed to run periodically with period =
> WATCHDOG_INTERVAL.
> 
> That periodic schedule depends on the clocksource which is monitored. If
> the clocksource runs fast the period is shortened and if it runs slow is
> prolonged.
> 
> Now you add checks:
> 
>  1) If the period observed by the watchdog clocksource is less than 1/4
>     of the expected period, everything is fine.

s/less than/more than/, then agreed, give or take that a console message
is printed.  Also, the third patch adds a check of the clocksource under
check, so that if either the watchdog clocksource or the clocksource
under test says that the period observed is less than 1/4 fo the expected
period, we print the message but don't turn off the clocksource under
test.

>  2) If the period observed by the watchdog clocksource is greater than 4
>     times the expected period, everything is fine.

s/greater than/less than/, then agreed, again give or take the console
message and the subsequent check of both clocksources.

> IOW, you are preventing detection of one class of problems which caused
> us to implement the watchdog clocksource in the first place.

The problem is still memorialized via those console-log message.

> You are preventing it by making the watchdog decision circular dependent
> on the clocksource it is supposed to watch. IOW, you put a fox in charge
> of the henhouse. That's a really brilliant plan.
> 
> But what's worse is the constant stream of heuristics which make the
> clocksource watchdog "work" under workloads which are simply impossible
> to be handled by its current implementation.
> 
> If I look at the full set of them by now, I'm pretty sure that a real
> TSC fail would not be noticed anymore because there are more exceptions
> and excuses why a particular measurement it bogus or invalid or
> whatever.

It really did spot the 40ms/s divergence between HPET and TSC.

> I didn't do a full analysis yet, but I have a hard time to convince
> myself that - assumed we add this gem - the watchdog will be anything
> else than a useless waste of CPU cycles as there is always one of the
> heuristics declaring that everything is fine along with incomprehensible
> messages in dmesg which will create more confusion to most people than
> being helpful.

The messages certainly could be improved.  I freely confess to having
focused solely on eliminating false positives.

And to good effect.  The overwhelming bulk of our fleet seems to have
good timers.  The issues are concentrated in systems with dying CPUs and
in a few of the quite-new systems.  Had this happened ten years ago,
we would both be engaged in activities that are far more productive.
Or, failing that, far more fun.

> This is hunting us for 20+ years now and why do we still need this? I'm
> pretty sure that farcebook does not run their server farms on 20 years
> old silicon.

The 40ms/s divergence between TSC and HPET was observed on hardware that
is quite recent.  Again, I sure hope that it is HPET that is unreliable
rather than TSC, but either way, we have a very modern piece of hardware
with at least one very iffy clocksource.  Needless to say, this does not
build confidence.  Yes, it might be a firmware or some sort of calibration
problem, but this system is definitely well beyond the simulation and
FPGA stage.

> That means even the largest customers have not been able to convince the
> CPU manufactures to fix this idiocy by now, right? Either that or they
> did not even try because it's simpler to "fix" it in software.

Worse yet, the manufacturers that did fix it were not helped by having
reliable per-CPU clocks.  The response was of the form "Nice, but given
that it does not work on x86, we won't be relying on it."

Can't win for losing.  ;-)

> That's the only two explanations I have for the constant stream of
> voodoo logic. Both are just a proof for my claim that this industry just
> "works" by chance.

I am sure that you meant to say "just additional proof", but yes.  :-/

My concern is that the next vendor will make all the same mistakes all
over again.  It would be better for the kernel (or something equally
ubiquitous) to yell at them right away rather than for hapless users to
have to figure it out.

> Can we stop this pretty please and either come up with something
> fundamentally different or just admit defeat and remove the whole thing?

I am not at all married to the current code, but I really would like
offending system manufacturers to get their hands slapped during hardware
bringup rather than such bugs being inflicted on their unwary users.

Do you have any thoughts on what you would want a fundamentally different
approach to look like?

Yes, I do hear the "null approach" as your current preferred option.  ;-)

But having just met up with a system with seriously divergent clocks,
I am a bit shy of that approach.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH clocksource 1/3] clocksource: Reject bogus watchdog clocksource measurements
  2022-11-17 23:09     ` Paul E. McKenney
@ 2022-11-21  0:55       ` Feng Tang
  2022-11-21 15:21         ` Paul E. McKenney
  2022-11-21 18:14         ` Paul E. McKenney
  0 siblings, 2 replies; 28+ messages in thread
From: Feng Tang @ 2022-11-21  0:55 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Thomas Gleixner, linux-kernel, john.stultz, sboyd, corbet,
	Mark.Rutland, maz, kernel-team, neeraju, ak, zhengjun.xing,
	Chris Mason, John Stultz, Waiman Long

On Thu, Nov 17, 2022 at 03:09:10PM -0800, Paul E. McKenney wrote:
> On Thu, Nov 17, 2022 at 10:57:34PM +0100, Thomas Gleixner wrote:
> > Paul!
> > 
> > On Mon, Nov 14 2022 at 15:28, Paul E. McKenney wrote:
> > >  
> > > +		/* Check for bogus measurements. */
> > > +		wdi = jiffies_to_nsecs(WATCHDOG_INTERVAL);
> > > +		if (wd_nsec < (wdi >> 2)) {
> > > +			pr_warn("timekeeping watchdog on CPU%d: Watchdog clocksource '%s' advanced only %lld ns during %d-jiffy time interval, skipping watchdog check.\n", smp_processor_id(), watchdog->name, wd_nsec, WATCHDOG_INTERVAL);
> > > +			continue;
> > > +		}
> > > +		if (wd_nsec > (wdi << 2)) {
> > > +			pr_warn("timekeeping watchdog on CPU%d: Watchdog clocksource '%s' advanced an excessive %lld ns during %d-jiffy time interval, probable CPU overutilization, skipping watchdog check.\n", smp_processor_id(), watchdog->name, wd_nsec, WATCHDOG_INTERVAL);
> > > +			continue;
> > > +		}
> > 
> > This is really getting ridiculous.
> 
> I have absolutely no argument with this statement, and going back a
> long time.  ;-)
> 
> But the set of systems that caused me to send this turned out to have
> real divergence between HPET and TSC, and 40 milliseconds per second of
> divergence at that.  So not only do you hate this series, but it is also
> the case that this series doesn't help with the problem at hand.

The drift is about 4% which is quite big. It seems that this is
either problem of HPET/TSC's hardware/firmware, or the problem of
frequency calibration for HPET/TSC. TSC calibration is complex,
as it could be done from different methods depending on hardware
and firmware, could you share the kernel boot log related with
tsc/hpet and clocksource? 

Also if your platform has acpi PM_TIMER, you may try "nohpet"
to use PM_TIMER instead of HPET and check if there is also big
drift between TSC and PM_TIMER.

Thanks,
Feng



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH clocksource 1/3] clocksource: Reject bogus watchdog clocksource measurements
  2022-11-21  0:55       ` Feng Tang
@ 2022-11-21 15:21         ` Paul E. McKenney
  2022-11-21 18:14         ` Paul E. McKenney
  1 sibling, 0 replies; 28+ messages in thread
From: Paul E. McKenney @ 2022-11-21 15:21 UTC (permalink / raw)
  To: Feng Tang
  Cc: Thomas Gleixner, linux-kernel, john.stultz, sboyd, corbet,
	Mark.Rutland, maz, kernel-team, neeraju, ak, zhengjun.xing,
	Chris Mason, John Stultz, Waiman Long

On Mon, Nov 21, 2022 at 08:55:15AM +0800, Feng Tang wrote:
> On Thu, Nov 17, 2022 at 03:09:10PM -0800, Paul E. McKenney wrote:
> > On Thu, Nov 17, 2022 at 10:57:34PM +0100, Thomas Gleixner wrote:
> > > Paul!
> > > 
> > > On Mon, Nov 14 2022 at 15:28, Paul E. McKenney wrote:
> > > >  
> > > > +		/* Check for bogus measurements. */
> > > > +		wdi = jiffies_to_nsecs(WATCHDOG_INTERVAL);
> > > > +		if (wd_nsec < (wdi >> 2)) {
> > > > +			pr_warn("timekeeping watchdog on CPU%d: Watchdog clocksource '%s' advanced only %lld ns during %d-jiffy time interval, skipping watchdog check.\n", smp_processor_id(), watchdog->name, wd_nsec, WATCHDOG_INTERVAL);
> > > > +			continue;
> > > > +		}
> > > > +		if (wd_nsec > (wdi << 2)) {
> > > > +			pr_warn("timekeeping watchdog on CPU%d: Watchdog clocksource '%s' advanced an excessive %lld ns during %d-jiffy time interval, probable CPU overutilization, skipping watchdog check.\n", smp_processor_id(), watchdog->name, wd_nsec, WATCHDOG_INTERVAL);
> > > > +			continue;
> > > > +		}
> > > 
> > > This is really getting ridiculous.
> > 
> > I have absolutely no argument with this statement, and going back a
> > long time.  ;-)
> > 
> > But the set of systems that caused me to send this turned out to have
> > real divergence between HPET and TSC, and 40 milliseconds per second of
> > divergence at that.  So not only do you hate this series, but it is also
> > the case that this series doesn't help with the problem at hand.
> 
> The drift is about 4% which is quite big. It seems that this is
> either problem of HPET/TSC's hardware/firmware, or the problem of
> frequency calibration for HPET/TSC. TSC calibration is complex,
> as it could be done from different methods depending on hardware
> and firmware, could you share the kernel boot log related with
> tsc/hpet and clocksource? 

Thank you for looking into this!

Knucklehead here failed to save away the dmesg.  Let me see what
I can drum up.

> Also if your platform has acpi PM_TIMER, you may try "nohpet"
> to use PM_TIMER instead of HPET and check if there is also big
> drift between TSC and PM_TIMER.

Let me see about giving this a go, and again, thank you!

						Thanx, Paul

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH clocksource 1/3] clocksource: Reject bogus watchdog clocksource measurements
  2022-11-21  0:55       ` Feng Tang
  2022-11-21 15:21         ` Paul E. McKenney
@ 2022-11-21 18:14         ` Paul E. McKenney
  2022-11-22 15:55           ` Feng Tang
  1 sibling, 1 reply; 28+ messages in thread
From: Paul E. McKenney @ 2022-11-21 18:14 UTC (permalink / raw)
  To: Feng Tang
  Cc: Thomas Gleixner, linux-kernel, john.stultz, sboyd, corbet,
	Mark.Rutland, maz, kernel-team, neeraju, ak, zhengjun.xing,
	Chris Mason, John Stultz, Waiman Long

On Mon, Nov 21, 2022 at 08:55:15AM +0800, Feng Tang wrote:
> On Thu, Nov 17, 2022 at 03:09:10PM -0800, Paul E. McKenney wrote:
> > On Thu, Nov 17, 2022 at 10:57:34PM +0100, Thomas Gleixner wrote:
> > > Paul!
> > > 
> > > On Mon, Nov 14 2022 at 15:28, Paul E. McKenney wrote:
> > > >  
> > > > +		/* Check for bogus measurements. */
> > > > +		wdi = jiffies_to_nsecs(WATCHDOG_INTERVAL);
> > > > +		if (wd_nsec < (wdi >> 2)) {
> > > > +			pr_warn("timekeeping watchdog on CPU%d: Watchdog clocksource '%s' advanced only %lld ns during %d-jiffy time interval, skipping watchdog check.\n", smp_processor_id(), watchdog->name, wd_nsec, WATCHDOG_INTERVAL);
> > > > +			continue;
> > > > +		}
> > > > +		if (wd_nsec > (wdi << 2)) {
> > > > +			pr_warn("timekeeping watchdog on CPU%d: Watchdog clocksource '%s' advanced an excessive %lld ns during %d-jiffy time interval, probable CPU overutilization, skipping watchdog check.\n", smp_processor_id(), watchdog->name, wd_nsec, WATCHDOG_INTERVAL);
> > > > +			continue;
> > > > +		}
> > > 
> > > This is really getting ridiculous.
> > 
> > I have absolutely no argument with this statement, and going back a
> > long time.  ;-)
> > 
> > But the set of systems that caused me to send this turned out to have
> > real divergence between HPET and TSC, and 40 milliseconds per second of
> > divergence at that.  So not only do you hate this series, but it is also
> > the case that this series doesn't help with the problem at hand.
> 
> The drift is about 4% which is quite big. It seems that this is
> either problem of HPET/TSC's hardware/firmware, or the problem of
> frequency calibration for HPET/TSC. TSC calibration is complex,
> as it could be done from different methods depending on hardware
> and firmware, could you share the kernel boot log related with
> tsc/hpet and clocksource? 
> 
> Also if your platform has acpi PM_TIMER, you may try "nohpet"
> to use PM_TIMER instead of HPET and check if there is also big
> drift between TSC and PM_TIMER.

The kernel is built with CONFIG_X86_PM_TIMER=y, so I was guessing
that there is an ACPI PM_TIMER.  Except that when I booted
without your "Disable clocksource watchdog for TSC on qualified
platforms" patch, I get the following:

[   44.303035] clocksource: timekeeping watchdog on CPU3: Marking clocksource 'tsc-early' as unstable because the skew is too large:
[   44.347034] clocksource:                       'refined-jiffies' wd_nsec: 503923392 wd_now: fffb73f8 wd_last: fffb7200 mask: ffffffff
[   44.374034] clocksource:                       'tsc-early' cs_nsec: 588042081 cs_now: 66c486d157 cs_last: 6682125e5e mask: ffffffffffffffff
[   44.403034] clocksource:                       No current clocksource.
[   44.418034] tsc: Marking TSC unstable due to clocksource watchdog

If PM_TIMER was involved, I would expect 'acpi_pm' instead of
refined-jiffies.  Or am I misinterpreting the output and/or code?

Either way, would it make sense to add CLOCK_SOURCE_MUST_VERIFY to
clocksource_hpet.flags?

I am sending the full console output off-list.  Hey, you asked for it!  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH clocksource 1/3] clocksource: Reject bogus watchdog clocksource measurements
  2022-11-21 18:14         ` Paul E. McKenney
@ 2022-11-22 15:55           ` Feng Tang
  2022-11-22 22:07             ` Paul E. McKenney
  0 siblings, 1 reply; 28+ messages in thread
From: Feng Tang @ 2022-11-22 15:55 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Thomas Gleixner, linux-kernel, john.stultz, sboyd, corbet,
	Mark.Rutland, maz, kernel-team, neeraju, ak, zhengjun.xing,
	Chris Mason, John Stultz, Waiman Long

On Mon, Nov 21, 2022 at 10:14:49AM -0800, Paul E. McKenney wrote:
> > > I have absolutely no argument with this statement, and going back a
> > > long time.  ;-)
> > > 
> > > But the set of systems that caused me to send this turned out to have
> > > real divergence between HPET and TSC, and 40 milliseconds per second of
> > > divergence at that.  So not only do you hate this series, but it is also
> > > the case that this series doesn't help with the problem at hand.
> > 
> > The drift is about 4% which is quite big. It seems that this is
> > either problem of HPET/TSC's hardware/firmware, or the problem of
> > frequency calibration for HPET/TSC. TSC calibration is complex,
> > as it could be done from different methods depending on hardware
> > and firmware, could you share the kernel boot log related with
> > tsc/hpet and clocksource? 
> > 
> > Also if your platform has acpi PM_TIMER, you may try "nohpet"
> > to use PM_TIMER instead of HPET and check if there is also big
> > drift between TSC and PM_TIMER.
> 
> The kernel is built with CONFIG_X86_PM_TIMER=y, so I was guessing
> that there is an ACPI PM_TIMER.  Except that when I booted
> without your "Disable clocksource watchdog for TSC on qualified
> platforms" patch, I get the following:
> 
> [   44.303035] clocksource: timekeeping watchdog on CPU3: Marking clocksource 'tsc-early' as unstable because the skew is too large:
> [   44.347034] clocksource:                       'refined-jiffies' wd_nsec: 503923392 wd_now: fffb73f8 wd_last: fffb7200 mask: ffffffff
> [   44.374034] clocksource:                       'tsc-early' cs_nsec: 588042081 cs_now: 66c486d157 cs_last: 6682125e5e mask: ffffffffffffffff
> [   44.403034] clocksource:                       No current clocksource.
> [   44.418034] tsc: Marking TSC unstable due to clocksource watchdog
 
Aha, we've met similar error (TSC being judged 'unstable' by
'refined-jiffies') before, and our root cause is discussed in [1].
In our case, we had early serial console enabled, which made it
easier to be reproduced.

That was a trigger for us to proposed severl solutions before Thomas
suggested to disable tsc watchdog for all qualified platforms.

[1]. https://lore.kernel.org/lkml/20201126012421.GA92582@shbuild999.sh.intel.com/

> If PM_TIMER was involved, I would expect 'acpi_pm' instead of
> refined-jiffies.  Or am I misinterpreting the output and/or code?

It's about timing. On a typical server platform, the clocksources
init order could be:
  refined-jiffies --> hpet --> tsc-early --> acpi_pm --> tsc 

From your log, TSC('tsc-early') is disabled before 'acpi_pm' get
initialized, so 'acpi_pm' timer (if exist) had no chance to watchdog
the tsc.

> Either way, would it make sense to add CLOCK_SOURCE_MUST_VERIFY to
> clocksource_hpet.flags?

Maybe try below patch, which will skip watchdog for 'tsc-early',
while giving 'acpi_pm' timer a chance to watchdog 'tsc'.

diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index cafacb2e58cc..9840f0131764 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -1131,8 +1131,7 @@ static struct clocksource clocksource_tsc_early = {
 	.uncertainty_margin	= 32 * NSEC_PER_MSEC,
 	.read			= read_tsc,
 	.mask			= CLOCKSOURCE_MASK(64),
-	.flags			= CLOCK_SOURCE_IS_CONTINUOUS |
-				  CLOCK_SOURCE_MUST_VERIFY,
+	.flags			= CLOCK_SOURCE_IS_CONTINUOUS,
 	.vdso_clock_mode	= VDSO_CLOCKMODE_TSC,
 	.enable			= tsc_cs_enable,
 	.resume			= tsc_resume,


> I am sending the full console output off-list.  Hey, you asked for it!  ;-)

Thanks for sharing!

Thanks,
Feng

> 
> 							Thanx, Paul

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH clocksource 1/3] clocksource: Reject bogus watchdog clocksource measurements
  2022-11-22 15:55           ` Feng Tang
@ 2022-11-22 22:07             ` Paul E. McKenney
  2022-11-23  2:36               ` Feng Tang
  0 siblings, 1 reply; 28+ messages in thread
From: Paul E. McKenney @ 2022-11-22 22:07 UTC (permalink / raw)
  To: Feng Tang
  Cc: Thomas Gleixner, linux-kernel, john.stultz, sboyd, corbet,
	Mark.Rutland, maz, kernel-team, neeraju, ak, zhengjun.xing,
	Chris Mason, John Stultz, Waiman Long

On Tue, Nov 22, 2022 at 11:55:51PM +0800, Feng Tang wrote:
> On Mon, Nov 21, 2022 at 10:14:49AM -0800, Paul E. McKenney wrote:
> > > > I have absolutely no argument with this statement, and going back a
> > > > long time.  ;-)
> > > > 
> > > > But the set of systems that caused me to send this turned out to have
> > > > real divergence between HPET and TSC, and 40 milliseconds per second of
> > > > divergence at that.  So not only do you hate this series, but it is also
> > > > the case that this series doesn't help with the problem at hand.
> > > 
> > > The drift is about 4% which is quite big. It seems that this is
> > > either problem of HPET/TSC's hardware/firmware, or the problem of
> > > frequency calibration for HPET/TSC. TSC calibration is complex,
> > > as it could be done from different methods depending on hardware
> > > and firmware, could you share the kernel boot log related with
> > > tsc/hpet and clocksource? 
> > > 
> > > Also if your platform has acpi PM_TIMER, you may try "nohpet"
> > > to use PM_TIMER instead of HPET and check if there is also big
> > > drift between TSC and PM_TIMER.
> > 
> > The kernel is built with CONFIG_X86_PM_TIMER=y, so I was guessing
> > that there is an ACPI PM_TIMER.  Except that when I booted
> > without your "Disable clocksource watchdog for TSC on qualified
> > platforms" patch, I get the following:
> > 
> > [   44.303035] clocksource: timekeeping watchdog on CPU3: Marking clocksource 'tsc-early' as unstable because the skew is too large:
> > [   44.347034] clocksource:                       'refined-jiffies' wd_nsec: 503923392 wd_now: fffb73f8 wd_last: fffb7200 mask: ffffffff
> > [   44.374034] clocksource:                       'tsc-early' cs_nsec: 588042081 cs_now: 66c486d157 cs_last: 6682125e5e mask: ffffffffffffffff
> > [   44.403034] clocksource:                       No current clocksource.
> > [   44.418034] tsc: Marking TSC unstable due to clocksource watchdog
>  
> Aha, we've met similar error (TSC being judged 'unstable' by
> 'refined-jiffies') before, and our root cause is discussed in [1].
> In our case, we had early serial console enabled, which made it
> easier to be reproduced.
> 
> That was a trigger for us to proposed severl solutions before Thomas
> suggested to disable tsc watchdog for all qualified platforms.
> 
> [1]. https://lore.kernel.org/lkml/20201126012421.GA92582@shbuild999.sh.intel.com/

Thank you for the analysis!

> > If PM_TIMER was involved, I would expect 'acpi_pm' instead of
> > refined-jiffies.  Or am I misinterpreting the output and/or code?
> 
> It's about timing. On a typical server platform, the clocksources
> init order could be:
>   refined-jiffies --> hpet --> tsc-early --> acpi_pm --> tsc 
> 
> >From your log, TSC('tsc-early') is disabled before 'acpi_pm' get
> initialized, so 'acpi_pm' timer (if exist) had no chance to watchdog
> the tsc.
> 
> > Either way, would it make sense to add CLOCK_SOURCE_MUST_VERIFY to
> > clocksource_hpet.flags?
> 
> Maybe try below patch, which will skip watchdog for 'tsc-early',
> while giving 'acpi_pm' timer a chance to watchdog 'tsc'.
> 
> diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
> index cafacb2e58cc..9840f0131764 100644
> --- a/arch/x86/kernel/tsc.c
> +++ b/arch/x86/kernel/tsc.c
> @@ -1131,8 +1131,7 @@ static struct clocksource clocksource_tsc_early = {
>  	.uncertainty_margin	= 32 * NSEC_PER_MSEC,
>  	.read			= read_tsc,
>  	.mask			= CLOCKSOURCE_MASK(64),
> -	.flags			= CLOCK_SOURCE_IS_CONTINUOUS |
> -				  CLOCK_SOURCE_MUST_VERIFY,
> +	.flags			= CLOCK_SOURCE_IS_CONTINUOUS,
>  	.vdso_clock_mode	= VDSO_CLOCKMODE_TSC,
>  	.enable			= tsc_cs_enable,
>  	.resume			= tsc_resume,

Your mainline patch b50db7095fe0 ("x86/tsc: Disable clocksource watchdog
for TSC on qualified platorms") mitigates the issue so we are good for
the immediate future, at least assuming reliable TSC.

But it also disables checking against HPET, hence my question about
marking clocksource_hpet.flags with CLOCK_SOURCE_MUST_VERIFY at boot time
on systems whose CPUs have constant_tsc, nonstop_tsc, and tsc_adjust.

Would that make sense?

							Thanx, Paul

> > I am sending the full console output off-list.  Hey, you asked for it!  ;-)
> 
> Thanks for sharing!
> 
> Thanks,
> Feng
> 
> > 
> > 							Thanx, Paul

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH clocksource 1/3] clocksource: Reject bogus watchdog clocksource measurements
  2022-11-22 22:07             ` Paul E. McKenney
@ 2022-11-23  2:36               ` Feng Tang
  2022-11-23 21:23                 ` Paul E. McKenney
  0 siblings, 1 reply; 28+ messages in thread
From: Feng Tang @ 2022-11-23  2:36 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Thomas Gleixner, linux-kernel, john.stultz, sboyd, corbet,
	Mark.Rutland, maz, kernel-team, neeraju, ak, zhengjun.xing,
	Chris Mason, John Stultz, Waiman Long

On Tue, Nov 22, 2022 at 02:07:12PM -0800, Paul E. McKenney wrote:
[...] 
> > > If PM_TIMER was involved, I would expect 'acpi_pm' instead of
> > > refined-jiffies.  Or am I misinterpreting the output and/or code?
> > 
> > It's about timing. On a typical server platform, the clocksources
> > init order could be:
> >   refined-jiffies --> hpet --> tsc-early --> acpi_pm --> tsc 
> > 
> > >From your log, TSC('tsc-early') is disabled before 'acpi_pm' get
> > initialized, so 'acpi_pm' timer (if exist) had no chance to watchdog
> > the tsc.
> > 
> > > Either way, would it make sense to add CLOCK_SOURCE_MUST_VERIFY to
> > > clocksource_hpet.flags?
> > 
> > Maybe try below patch, which will skip watchdog for 'tsc-early',
> > while giving 'acpi_pm' timer a chance to watchdog 'tsc'.
> > 
> > diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
> > index cafacb2e58cc..9840f0131764 100644
> > --- a/arch/x86/kernel/tsc.c
> > +++ b/arch/x86/kernel/tsc.c
> > @@ -1131,8 +1131,7 @@ static struct clocksource clocksource_tsc_early = {
> >  	.uncertainty_margin	= 32 * NSEC_PER_MSEC,
> >  	.read			= read_tsc,
> >  	.mask			= CLOCKSOURCE_MASK(64),
> > -	.flags			= CLOCK_SOURCE_IS_CONTINUOUS |
> > -				  CLOCK_SOURCE_MUST_VERIFY,
> > +	.flags			= CLOCK_SOURCE_IS_CONTINUOUS,
> >  	.vdso_clock_mode	= VDSO_CLOCKMODE_TSC,
> >  	.enable			= tsc_cs_enable,
> >  	.resume			= tsc_resume,
> 
> Your mainline patch b50db7095fe0 ("x86/tsc: Disable clocksource watchdog
> for TSC on qualified platorms") mitigates the issue so we are good for
> the immediate future, at least assuming reliable TSC.
> 
> But it also disables checking against HPET, hence my question about
> marking clocksource_hpet.flags with CLOCK_SOURCE_MUST_VERIFY at boot time
> on systems whose CPUs have constant_tsc, nonstop_tsc, and tsc_adjust.

IIUC, this will make TSC to watchdog HPET every 500 ms. We have got
report that the 500ms watchdog timer had big impact on some parallel
workload on big servers, that was another factor for us to seek
stopping the timer.

Is this about the concern of possible TSC frequency calibration
issue, as the 40 ms per second drift between HPET and TSC? With 
b50db7095fe0 backported, we also have another patch to force TSC
calibration for those platforms which get the TSC freq directly
from CPUID or MSR and don't have such info in dmesg:
 "tsc: Refined TSC clocksource calibration: 2693.509 MHz" 

https://lore.kernel.org/lkml/20220509144110.9242-1-feng.tang@intel.com/

We did met tsc calibration issue due to some firmware issue, and
this can help to catch it. You can try it if you think it's relevant.

Thanks,
Feng

> Would that make sense?
> 
> 							Thanx, Paul
> 
> > > I am sending the full console output off-list.  Hey, you asked for it!  ;-)
> > 
> > Thanks for sharing!
> > 
> > Thanks,
> > Feng
> > 
> > > 
> > > 							Thanx, Paul

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH clocksource 1/3] clocksource: Reject bogus watchdog clocksource measurements
  2022-11-23  2:36               ` Feng Tang
@ 2022-11-23 21:23                 ` Paul E. McKenney
  2022-11-28  2:15                   ` Feng Tang
  0 siblings, 1 reply; 28+ messages in thread
From: Paul E. McKenney @ 2022-11-23 21:23 UTC (permalink / raw)
  To: Feng Tang
  Cc: Thomas Gleixner, linux-kernel, john.stultz, sboyd, corbet,
	Mark.Rutland, maz, kernel-team, neeraju, ak, zhengjun.xing,
	Chris Mason, John Stultz, Waiman Long

On Wed, Nov 23, 2022 at 10:36:04AM +0800, Feng Tang wrote:
> On Tue, Nov 22, 2022 at 02:07:12PM -0800, Paul E. McKenney wrote:
> [...] 
> > > > If PM_TIMER was involved, I would expect 'acpi_pm' instead of
> > > > refined-jiffies.  Or am I misinterpreting the output and/or code?
> > > 
> > > It's about timing. On a typical server platform, the clocksources
> > > init order could be:
> > >   refined-jiffies --> hpet --> tsc-early --> acpi_pm --> tsc 
> > > 
> > > >From your log, TSC('tsc-early') is disabled before 'acpi_pm' get
> > > initialized, so 'acpi_pm' timer (if exist) had no chance to watchdog
> > > the tsc.
> > > 
> > > > Either way, would it make sense to add CLOCK_SOURCE_MUST_VERIFY to
> > > > clocksource_hpet.flags?
> > > 
> > > Maybe try below patch, which will skip watchdog for 'tsc-early',
> > > while giving 'acpi_pm' timer a chance to watchdog 'tsc'.
> > > 
> > > diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
> > > index cafacb2e58cc..9840f0131764 100644
> > > --- a/arch/x86/kernel/tsc.c
> > > +++ b/arch/x86/kernel/tsc.c
> > > @@ -1131,8 +1131,7 @@ static struct clocksource clocksource_tsc_early = {
> > >  	.uncertainty_margin	= 32 * NSEC_PER_MSEC,
> > >  	.read			= read_tsc,
> > >  	.mask			= CLOCKSOURCE_MASK(64),
> > > -	.flags			= CLOCK_SOURCE_IS_CONTINUOUS |
> > > -				  CLOCK_SOURCE_MUST_VERIFY,
> > > +	.flags			= CLOCK_SOURCE_IS_CONTINUOUS,
> > >  	.vdso_clock_mode	= VDSO_CLOCKMODE_TSC,
> > >  	.enable			= tsc_cs_enable,
> > >  	.resume			= tsc_resume,
> > 
> > Your mainline patch b50db7095fe0 ("x86/tsc: Disable clocksource watchdog
> > for TSC on qualified platorms") mitigates the issue so we are good for
> > the immediate future, at least assuming reliable TSC.
> > 
> > But it also disables checking against HPET, hence my question about
> > marking clocksource_hpet.flags with CLOCK_SOURCE_MUST_VERIFY at boot time
> > on systems whose CPUs have constant_tsc, nonstop_tsc, and tsc_adjust.
> 
> IIUC, this will make TSC to watchdog HPET every 500 ms. We have got
> report that the 500ms watchdog timer had big impact on some parallel
> workload on big servers, that was another factor for us to seek
> stopping the timer.

Another approach would be to slow it down.  Given the tighter bounds
on skew, it could be done every (say) 10 seconds while allowing
2 milliseconds skew instead of the current 100 microseconds.

> Is this about the concern of possible TSC frequency calibration
> issue, as the 40 ms per second drift between HPET and TSC? With 
> b50db7095fe0 backported, we also have another patch to force TSC
> calibration for those platforms which get the TSC freq directly
> from CPUID or MSR and don't have such info in dmesg:
>  "tsc: Refined TSC clocksource calibration: 2693.509 MHz" 
> 
> https://lore.kernel.org/lkml/20220509144110.9242-1-feng.tang@intel.com/
> 
> We did met tsc calibration issue due to some firmware issue, and
> this can help to catch it. You can try it if you think it's relevant.

I am giving this a go, thank you!

						Thanx, Paul

> Thanks,
> Feng
> 
> > Would that make sense?
> > 
> > 							Thanx, Paul
> > 
> > > > I am sending the full console output off-list.  Hey, you asked for it!  ;-)
> > > 
> > > Thanks for sharing!
> > > 
> > > Thanks,
> > > Feng
> > > 
> > > > 
> > > > 							Thanx, Paul

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH clocksource 1/3] clocksource: Reject bogus watchdog clocksource measurements
  2022-11-23 21:23                 ` Paul E. McKenney
@ 2022-11-28  2:15                   ` Feng Tang
  2022-11-29 19:29                     ` Paul E. McKenney
  0 siblings, 1 reply; 28+ messages in thread
From: Feng Tang @ 2022-11-28  2:15 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Thomas Gleixner, linux-kernel, john.stultz, sboyd, corbet,
	Mark.Rutland, maz, kernel-team, neeraju, ak, zhengjun.xing,
	Chris Mason, John Stultz, Waiman Long

On Wed, Nov 23, 2022 at 01:23:48PM -0800, Paul E. McKenney wrote:
> On Wed, Nov 23, 2022 at 10:36:04AM +0800, Feng Tang wrote:
> > On Tue, Nov 22, 2022 at 02:07:12PM -0800, Paul E. McKenney wrote:
> > [...] 
> > > > > If PM_TIMER was involved, I would expect 'acpi_pm' instead of
> > > > > refined-jiffies.  Or am I misinterpreting the output and/or code?
> > > > 
> > > > It's about timing. On a typical server platform, the clocksources
> > > > init order could be:
> > > >   refined-jiffies --> hpet --> tsc-early --> acpi_pm --> tsc 
> > > > 
> > > > >From your log, TSC('tsc-early') is disabled before 'acpi_pm' get
> > > > initialized, so 'acpi_pm' timer (if exist) had no chance to watchdog
> > > > the tsc.
> > > > 
> > > > > Either way, would it make sense to add CLOCK_SOURCE_MUST_VERIFY to
> > > > > clocksource_hpet.flags?
> > > > 
> > > > Maybe try below patch, which will skip watchdog for 'tsc-early',
> > > > while giving 'acpi_pm' timer a chance to watchdog 'tsc'.
> > > > 
> > > > diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
> > > > index cafacb2e58cc..9840f0131764 100644
> > > > --- a/arch/x86/kernel/tsc.c
> > > > +++ b/arch/x86/kernel/tsc.c
> > > > @@ -1131,8 +1131,7 @@ static struct clocksource clocksource_tsc_early = {
> > > >  	.uncertainty_margin	= 32 * NSEC_PER_MSEC,
> > > >  	.read			= read_tsc,
> > > >  	.mask			= CLOCKSOURCE_MASK(64),
> > > > -	.flags			= CLOCK_SOURCE_IS_CONTINUOUS |
> > > > -				  CLOCK_SOURCE_MUST_VERIFY,
> > > > +	.flags			= CLOCK_SOURCE_IS_CONTINUOUS,
> > > >  	.vdso_clock_mode	= VDSO_CLOCKMODE_TSC,
> > > >  	.enable			= tsc_cs_enable,
> > > >  	.resume			= tsc_resume,
> > > 
> > > Your mainline patch b50db7095fe0 ("x86/tsc: Disable clocksource watchdog
> > > for TSC on qualified platorms") mitigates the issue so we are good for
> > > the immediate future, at least assuming reliable TSC.
> > > 
> > > But it also disables checking against HPET, hence my question about
> > > marking clocksource_hpet.flags with CLOCK_SOURCE_MUST_VERIFY at boot time
> > > on systems whose CPUs have constant_tsc, nonstop_tsc, and tsc_adjust.
> > 
> > IIUC, this will make TSC to watchdog HPET every 500 ms. We have got
> > report that the 500ms watchdog timer had big impact on some parallel
> > workload on big servers, that was another factor for us to seek
> > stopping the timer.
> 
> Another approach would be to slow it down.  Given the tighter bounds
> on skew, it could be done every (say) 10 seconds while allowing
> 2 milliseconds skew instead of the current 100 microseconds.

Yes, this can reduce the OS noise much. One problem is if we make it
a general interface, there is some clocksource whose warp time is
less than 10 seconds, like ACPI PM_TIMER (3-4 seconds), and I don't
know if other ARCHs have similar cases.

> 
> > Is this about the concern of possible TSC frequency calibration
> > issue, as the 40 ms per second drift between HPET and TSC? With 
> > b50db7095fe0 backported, we also have another patch to force TSC
> > calibration for those platforms which get the TSC freq directly
> > from CPUID or MSR and don't have such info in dmesg:
> >  "tsc: Refined TSC clocksource calibration: 2693.509 MHz" 
> > 
> > https://lore.kernel.org/lkml/20220509144110.9242-1-feng.tang@intel.com/
> > 
> > We did met tsc calibration issue due to some firmware issue, and
> > this can help to catch it. You can try it if you think it's relevant.
> 
> I am giving this a go, thank you!

Thanks for spending time testing it!

Thanks,
Feng

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH clocksource 1/3] clocksource: Reject bogus watchdog clocksource measurements
  2022-11-28  2:15                   ` Feng Tang
@ 2022-11-29 19:29                     ` Paul E. McKenney
  2022-11-30  1:38                       ` Feng Tang
  0 siblings, 1 reply; 28+ messages in thread
From: Paul E. McKenney @ 2022-11-29 19:29 UTC (permalink / raw)
  To: Feng Tang
  Cc: Thomas Gleixner, linux-kernel, john.stultz, sboyd, corbet,
	Mark.Rutland, maz, kernel-team, neeraju, ak, zhengjun.xing,
	Chris Mason, John Stultz, Waiman Long

On Mon, Nov 28, 2022 at 10:15:43AM +0800, Feng Tang wrote:
> On Wed, Nov 23, 2022 at 01:23:48PM -0800, Paul E. McKenney wrote:
> > On Wed, Nov 23, 2022 at 10:36:04AM +0800, Feng Tang wrote:
> > > On Tue, Nov 22, 2022 at 02:07:12PM -0800, Paul E. McKenney wrote:
> > > [...] 
> > > > > > If PM_TIMER was involved, I would expect 'acpi_pm' instead of
> > > > > > refined-jiffies.  Or am I misinterpreting the output and/or code?
> > > > > 
> > > > > It's about timing. On a typical server platform, the clocksources
> > > > > init order could be:
> > > > >   refined-jiffies --> hpet --> tsc-early --> acpi_pm --> tsc 
> > > > > 
> > > > > >From your log, TSC('tsc-early') is disabled before 'acpi_pm' get
> > > > > initialized, so 'acpi_pm' timer (if exist) had no chance to watchdog
> > > > > the tsc.
> > > > > 
> > > > > > Either way, would it make sense to add CLOCK_SOURCE_MUST_VERIFY to
> > > > > > clocksource_hpet.flags?
> > > > > 
> > > > > Maybe try below patch, which will skip watchdog for 'tsc-early',
> > > > > while giving 'acpi_pm' timer a chance to watchdog 'tsc'.
> > > > > 
> > > > > diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
> > > > > index cafacb2e58cc..9840f0131764 100644
> > > > > --- a/arch/x86/kernel/tsc.c
> > > > > +++ b/arch/x86/kernel/tsc.c
> > > > > @@ -1131,8 +1131,7 @@ static struct clocksource clocksource_tsc_early = {
> > > > >  	.uncertainty_margin	= 32 * NSEC_PER_MSEC,
> > > > >  	.read			= read_tsc,
> > > > >  	.mask			= CLOCKSOURCE_MASK(64),
> > > > > -	.flags			= CLOCK_SOURCE_IS_CONTINUOUS |
> > > > > -				  CLOCK_SOURCE_MUST_VERIFY,
> > > > > +	.flags			= CLOCK_SOURCE_IS_CONTINUOUS,
> > > > >  	.vdso_clock_mode	= VDSO_CLOCKMODE_TSC,
> > > > >  	.enable			= tsc_cs_enable,
> > > > >  	.resume			= tsc_resume,
> > > > 
> > > > Your mainline patch b50db7095fe0 ("x86/tsc: Disable clocksource watchdog
> > > > for TSC on qualified platorms") mitigates the issue so we are good for
> > > > the immediate future, at least assuming reliable TSC.
> > > > 
> > > > But it also disables checking against HPET, hence my question about
> > > > marking clocksource_hpet.flags with CLOCK_SOURCE_MUST_VERIFY at boot time
> > > > on systems whose CPUs have constant_tsc, nonstop_tsc, and tsc_adjust.
> > > 
> > > IIUC, this will make TSC to watchdog HPET every 500 ms. We have got
> > > report that the 500ms watchdog timer had big impact on some parallel
> > > workload on big servers, that was another factor for us to seek
> > > stopping the timer.
> > 
> > Another approach would be to slow it down.  Given the tighter bounds
> > on skew, it could be done every (say) 10 seconds while allowing
> > 2 milliseconds skew instead of the current 100 microseconds.
> 
> Yes, this can reduce the OS noise much. One problem is if we make it
> a general interface, there is some clocksource whose warp time is
> less than 10 seconds, like ACPI PM_TIMER (3-4 seconds), and I don't
> know if other ARCHs have similar cases.

Maybe a simpler approach is for systems with such high sensitivity to
OS noise to simply disable the clocksource watchdog.  ;-)

> > > Is this about the concern of possible TSC frequency calibration
> > > issue, as the 40 ms per second drift between HPET and TSC? With 
> > > b50db7095fe0 backported, we also have another patch to force TSC
> > > calibration for those platforms which get the TSC freq directly
> > > from CPUID or MSR and don't have such info in dmesg:
> > >  "tsc: Refined TSC clocksource calibration: 2693.509 MHz" 
> > > 
> > > https://lore.kernel.org/lkml/20220509144110.9242-1-feng.tang@intel.com/
> > > 
> > > We did met tsc calibration issue due to some firmware issue, and
> > > this can help to catch it. You can try it if you think it's relevant.
> > 
> > I am giving this a go, thank you!
> 
> Thanks for spending time testing it!

And here are the results from setting tsc_force_recalibrate to 1:

$ dmesg | grep -E 'calibrat|clocksource'
[    5.272939] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1910969940391419 ns
[   16.830644] clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 76450417870 ns
[   17.938020] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x36a8d32ce31, max_idle_ns: 881590731004 ns
[   24.548583] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1911260446275000 ns
[   49.762432] clocksource: Switched to clocksource tsc-early
[   50.076769] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
[   55.615946] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x36a8d32ce31, max_idle_ns: 881590731004 ns
[   55.640270] clocksource: Switched to clocksource tsc
[   56.694371] tsc: Warning: TSC freq calibrated by CPUID/MSR differs from what is calibrated by HW timer, please check with vendor!!
[   56.724550] tsc: Previous calibrated TSC freq:        1896.000 MHz
[   56.737646] tsc: TSC freq recalibrated by [HPET]:     1975.000 MHz

Apologies for the delay, but reconfigurations put the system off the
net for some time.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH clocksource 1/3] clocksource: Reject bogus watchdog clocksource measurements
  2022-11-29 19:29                     ` Paul E. McKenney
@ 2022-11-30  1:38                       ` Feng Tang
  2022-11-30  4:12                         ` Paul E. McKenney
  0 siblings, 1 reply; 28+ messages in thread
From: Feng Tang @ 2022-11-30  1:38 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Thomas Gleixner, linux-kernel, john.stultz, sboyd, corbet,
	Mark.Rutland, maz, kernel-team, neeraju, ak, zhengjun.xing,
	Chris Mason, John Stultz, Waiman Long

On Tue, Nov 29, 2022 at 11:29:15AM -0800, Paul E. McKenney wrote:
[...]
> > > > IIUC, this will make TSC to watchdog HPET every 500 ms. We have got
> > > > report that the 500ms watchdog timer had big impact on some parallel
> > > > workload on big servers, that was another factor for us to seek
> > > > stopping the timer.
> > > 
> > > Another approach would be to slow it down.  Given the tighter bounds
> > > on skew, it could be done every (say) 10 seconds while allowing
> > > 2 milliseconds skew instead of the current 100 microseconds.
> > 
> > Yes, this can reduce the OS noise much. One problem is if we make it
> > a general interface, there is some clocksource whose warp time is
> > less than 10 seconds, like ACPI PM_TIMER (3-4 seconds), and I don't
> > know if other ARCHs have similar cases.
> 
> Maybe a simpler approach is for systems with such high sensitivity to
> OS noise to simply disable the clocksource watchdog.  ;-)

That's what the reported did, test with and without "tsc=reliable"
parameter :)

And AFAIK, many customers with big server farms hate to add more
cmdline parameters when we suggested so.

> > > > Is this about the concern of possible TSC frequency calibration
> > > > issue, as the 40 ms per second drift between HPET and TSC? With 
> > > > b50db7095fe0 backported, we also have another patch to force TSC
> > > > calibration for those platforms which get the TSC freq directly
> > > > from CPUID or MSR and don't have such info in dmesg:
> > > >  "tsc: Refined TSC clocksource calibration: 2693.509 MHz" 
> > > > 
> > > > https://lore.kernel.org/lkml/20220509144110.9242-1-feng.tang@intel.com/
> > > > 
> > > > We did met tsc calibration issue due to some firmware issue, and
> > > > this can help to catch it. You can try it if you think it's relevant.
> > > 
> > > I am giving this a go, thank you!
> > 
> > Thanks for spending time testing it!
> 
> And here are the results from setting tsc_force_recalibrate to 1:
> 
> $ dmesg | grep -E 'calibrat|clocksource'
> [    5.272939] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1910969940391419 ns
> [   16.830644] clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 76450417870 ns
> [   17.938020] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x36a8d32ce31, max_idle_ns: 881590731004 ns
> [   24.548583] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1911260446275000 ns
> [   49.762432] clocksource: Switched to clocksource tsc-early
> [   50.076769] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
> [   55.615946] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x36a8d32ce31, max_idle_ns: 881590731004 ns
> [   55.640270] clocksource: Switched to clocksource tsc
> [   56.694371] tsc: Warning: TSC freq calibrated by CPUID/MSR differs from what is calibrated by HW timer, please check with vendor!!
> [   56.724550] tsc: Previous calibrated TSC freq:        1896.000 MHz
> [   56.737646] tsc: TSC freq recalibrated by [HPET]:     1975.000 MHz

Looks like there is really something wrong here. I assume the first
number '1896 MHz' is got from CPUID(0x15)'s math calculation.

I thinks 2 more things could be try:

* add "nohpet" to the cmdline, so the tsc_force_recalibrate should use
  ACPI PM_TIMER to do the calibration, say a third-party check.
* If the system don't have auto-adjusted time setting like NTP, I
  guess the system time will have obvious drift comparing to a normal
  clock or a mobile phone time, as the deviation is about 4%, which
  is 2.4 minutes per hour.

> Apologies for the delay, but reconfigurations put the system off the
> net for some time.

No problem at all, it's your holiday time! Thanks for trying this!

Thanks,
Feng

> 							Thanx, Paul

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH clocksource 1/3] clocksource: Reject bogus watchdog clocksource measurements
  2022-11-30  1:38                       ` Feng Tang
@ 2022-11-30  4:12                         ` Paul E. McKenney
  2022-11-30  4:49                           ` Feng Tang
  0 siblings, 1 reply; 28+ messages in thread
From: Paul E. McKenney @ 2022-11-30  4:12 UTC (permalink / raw)
  To: Feng Tang
  Cc: Thomas Gleixner, linux-kernel, john.stultz, sboyd, corbet,
	Mark.Rutland, maz, kernel-team, neeraju, ak, zhengjun.xing,
	Chris Mason, John Stultz, Waiman Long

On Wed, Nov 30, 2022 at 09:38:00AM +0800, Feng Tang wrote:
> On Tue, Nov 29, 2022 at 11:29:15AM -0800, Paul E. McKenney wrote:
> [...]
> > > > > IIUC, this will make TSC to watchdog HPET every 500 ms. We have got
> > > > > report that the 500ms watchdog timer had big impact on some parallel
> > > > > workload on big servers, that was another factor for us to seek
> > > > > stopping the timer.
> > > > 
> > > > Another approach would be to slow it down.  Given the tighter bounds
> > > > on skew, it could be done every (say) 10 seconds while allowing
> > > > 2 milliseconds skew instead of the current 100 microseconds.
> > > 
> > > Yes, this can reduce the OS noise much. One problem is if we make it
> > > a general interface, there is some clocksource whose warp time is
> > > less than 10 seconds, like ACPI PM_TIMER (3-4 seconds), and I don't
> > > know if other ARCHs have similar cases.
> > 
> > Maybe a simpler approach is for systems with such high sensitivity to
> > OS noise to simply disable the clocksource watchdog.  ;-)
> 
> That's what the reported did, test with and without "tsc=reliable"
> parameter :)
> 
> And AFAIK, many customers with big server farms hate to add more
> cmdline parameters when we suggested so.

It can be surprisingly hard.  It is sometimes easier to patch the kernel
to change the default.  Part of the problem is getting the right set
of command-line parameters associated with the right versions of the
kernel in the not-uncommon case where different portions of the server
farm are running different versions of the kernel.

> > > > > Is this about the concern of possible TSC frequency calibration
> > > > > issue, as the 40 ms per second drift between HPET and TSC? With 
> > > > > b50db7095fe0 backported, we also have another patch to force TSC
> > > > > calibration for those platforms which get the TSC freq directly
> > > > > from CPUID or MSR and don't have such info in dmesg:
> > > > >  "tsc: Refined TSC clocksource calibration: 2693.509 MHz" 
> > > > > 
> > > > > https://lore.kernel.org/lkml/20220509144110.9242-1-feng.tang@intel.com/
> > > > > 
> > > > > We did met tsc calibration issue due to some firmware issue, and
> > > > > this can help to catch it. You can try it if you think it's relevant.
> > > > 
> > > > I am giving this a go, thank you!
> > > 
> > > Thanks for spending time testing it!
> > 
> > And here are the results from setting tsc_force_recalibrate to 1:
> > 
> > $ dmesg | grep -E 'calibrat|clocksource'
> > [    5.272939] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1910969940391419 ns
> > [   16.830644] clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 76450417870 ns
> > [   17.938020] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x36a8d32ce31, max_idle_ns: 881590731004 ns
> > [   24.548583] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1911260446275000 ns
> > [   49.762432] clocksource: Switched to clocksource tsc-early
> > [   50.076769] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
> > [   55.615946] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x36a8d32ce31, max_idle_ns: 881590731004 ns
> > [   55.640270] clocksource: Switched to clocksource tsc
> > [   56.694371] tsc: Warning: TSC freq calibrated by CPUID/MSR differs from what is calibrated by HW timer, please check with vendor!!
> > [   56.724550] tsc: Previous calibrated TSC freq:        1896.000 MHz
> > [   56.737646] tsc: TSC freq recalibrated by [HPET]:     1975.000 MHz
> 
> Looks like there is really something wrong here. I assume the first
> number '1896 MHz' is got from CPUID(0x15)'s math calculation.

How about this from earlier in that same console?

[    0.000000] efi: EFI v2.80 by American Megatrends
[    0.000000] efi: ACPI=0x6f569000 ACPI 2.0=0x6f569014 TPMFinalLog=0x6f56b000 SMBIOS=0x753e1000 SMBIOS 3.0=0x753e0000 MEMATTR=0x62176018 ESRT=0x64bd1018 TPMEventLog=0x58737018
[    0.000000] [Firmware Bug]: TPM Final Events table missing or invalid
[    0.000000] SMBIOS 3.5.0 present.
[    0.000000] DMI: Quanta Grand Teton 1F0TUBZ0007/Grand Teton MB, BIOS F0T_1A15 08/25/2022
[    0.000000] tsc: Detected 1900.000 MHz processor
[    0.000000] tsc: Detected 1896.000 MHz TSC

> I thinks 2 more things could be try:
> 
> * add "nohpet" to the cmdline, so the tsc_force_recalibrate should use
>   ACPI PM_TIMER to do the calibration, say a third-party check.

OK, getting things teed up for TSC recalibration and nohpet.

> * If the system don't have auto-adjusted time setting like NTP, I
>   guess the system time will have obvious drift comparing to a normal
>   clock or a mobile phone time, as the deviation is about 4%, which
>   is 2.4 minutes per hour.

No ntpd, but there is a chronyd.

I will let you know what happens with HPET disabled and TSC recalibration
enabled.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH clocksource 1/3] clocksource: Reject bogus watchdog clocksource measurements
  2022-11-30  4:12                         ` Paul E. McKenney
@ 2022-11-30  4:49                           ` Feng Tang
  2022-11-30  5:16                             ` Paul E. McKenney
  0 siblings, 1 reply; 28+ messages in thread
From: Feng Tang @ 2022-11-30  4:49 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Thomas Gleixner, linux-kernel, john.stultz, sboyd, corbet,
	Mark.Rutland, maz, kernel-team, neeraju, ak, zhengjun.xing,
	Chris Mason, John Stultz, Waiman Long

On Tue, Nov 29, 2022 at 08:12:06PM -0800, Paul E. McKenney wrote:
> On Wed, Nov 30, 2022 at 09:38:00AM +0800, Feng Tang wrote:
> > On Tue, Nov 29, 2022 at 11:29:15AM -0800, Paul E. McKenney wrote:
> > [...]
> > > > > > IIUC, this will make TSC to watchdog HPET every 500 ms. We have got
> > > > > > report that the 500ms watchdog timer had big impact on some parallel
> > > > > > workload on big servers, that was another factor for us to seek
> > > > > > stopping the timer.
> > > > > 
> > > > > Another approach would be to slow it down.  Given the tighter bounds
> > > > > on skew, it could be done every (say) 10 seconds while allowing
> > > > > 2 milliseconds skew instead of the current 100 microseconds.
> > > > 
> > > > Yes, this can reduce the OS noise much. One problem is if we make it
> > > > a general interface, there is some clocksource whose warp time is
> > > > less than 10 seconds, like ACPI PM_TIMER (3-4 seconds), and I don't
> > > > know if other ARCHs have similar cases.
> > > 
> > > Maybe a simpler approach is for systems with such high sensitivity to
> > > OS noise to simply disable the clocksource watchdog.  ;-)
> > 
> > That's what the reported did, test with and without "tsc=reliable"
> > parameter :)
> > 
> > And AFAIK, many customers with big server farms hate to add more
> > cmdline parameters when we suggested so.
> 
> It can be surprisingly hard.  It is sometimes easier to patch the kernel
> to change the default. 

Indeed, sometimes we were askd to provide patch than cmdline parameters :)

> Part of the problem is getting the right set
> of command-line parameters associated with the right versions of the
> kernel in the not-uncommon case where different portions of the server
> farm are running different versions of the kernel.
> 
> > > > > > Is this about the concern of possible TSC frequency calibration
> > > > > > issue, as the 40 ms per second drift between HPET and TSC? With 
> > > > > > b50db7095fe0 backported, we also have another patch to force TSC
> > > > > > calibration for those platforms which get the TSC freq directly
> > > > > > from CPUID or MSR and don't have such info in dmesg:
> > > > > >  "tsc: Refined TSC clocksource calibration: 2693.509 MHz" 
> > > > > > 
> > > > > > https://lore.kernel.org/lkml/20220509144110.9242-1-feng.tang@intel.com/
> > > > > > 
> > > > > > We did met tsc calibration issue due to some firmware issue, and
> > > > > > this can help to catch it. You can try it if you think it's relevant.
> > > > > 
> > > > > I am giving this a go, thank you!
> > > > 
> > > > Thanks for spending time testing it!
> > > 
> > > And here are the results from setting tsc_force_recalibrate to 1:
> > > 
> > > $ dmesg | grep -E 'calibrat|clocksource'
> > > [    5.272939] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1910969940391419 ns
> > > [   16.830644] clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 76450417870 ns
> > > [   17.938020] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x36a8d32ce31, max_idle_ns: 881590731004 ns
> > > [   24.548583] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1911260446275000 ns
> > > [   49.762432] clocksource: Switched to clocksource tsc-early
> > > [   50.076769] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
> > > [   55.615946] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x36a8d32ce31, max_idle_ns: 881590731004 ns
> > > [   55.640270] clocksource: Switched to clocksource tsc
> > > [   56.694371] tsc: Warning: TSC freq calibrated by CPUID/MSR differs from what is calibrated by HW timer, please check with vendor!!
> > > [   56.724550] tsc: Previous calibrated TSC freq:        1896.000 MHz
> > > [   56.737646] tsc: TSC freq recalibrated by [HPET]:     1975.000 MHz
> > 
> > Looks like there is really something wrong here. I assume the first
> > number '1896 MHz' is got from CPUID(0x15)'s math calculation.
> 
> How about this from earlier in that same console?
> 
> [    0.000000] efi: EFI v2.80 by American Megatrends
> [    0.000000] efi: ACPI=0x6f569000 ACPI 2.0=0x6f569014 TPMFinalLog=0x6f56b000 SMBIOS=0x753e1000 SMBIOS 3.0=0x753e0000 MEMATTR=0x62176018 ESRT=0x64bd1018 TPMEventLog=0x58737018
> [    0.000000] [Firmware Bug]: TPM Final Events table missing or invalid
> [    0.000000] SMBIOS 3.5.0 present.
> [    0.000000] DMI: Quanta Grand Teton 1F0TUBZ0007/Grand Teton MB, BIOS F0T_1A15 08/25/2022
> [    0.000000] tsc: Detected 1900.000 MHz processor
> [    0.000000] tsc: Detected 1896.000 MHz TSC

I'm still not sure, but it's likely from CPUID(0x15). I met cases
that even severs of same generations get their tsc frequence from
different sources.

I locally have debug patch to check it:

diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index cafacb2e58cc..82ddb4b0529a 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -654,8 +654,11 @@ unsigned long native_calibrate_tsc(void)
 	 * frequency and is the most accurate one so far we have. This
 	 * is considered a known frequency.
 	 */
-	if (crystal_khz != 0)
+	if (crystal_khz != 0) {
+		printk("tsc: using CPUID[0x15] crystal_khz= %d kHz ebx=%d eax=%d\n",
+		crystal_khz, ebx_numerator, eax_denominator);
 		setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
+	}
 
 	/*
 	 * Some Intel SoCs like Skylake and Kabylake don't report the crystal
@@ -668,6 +671,7 @@ unsigned long native_calibrate_tsc(void)
 		cpuid(0x16, &eax_base_mhz, &ebx, &ecx, &edx);
 		crystal_khz = eax_base_mhz * 1000 *
 			eax_denominator / ebx_numerator;
+		printk("tsc: using CPUID[0x16] base_khz=%d kHz\n", crystal_khz);
 	}
 
 	if (crystal_khz == 0)

This remind me that maybe we can add a line in dmesg telling people
which exact soure that the TSC frequency comes from (CPUID, MSR or
calibration from HW timers like HPET/PM_TIMER).

> > I thinks 2 more things could be try:
> > 
> > * add "nohpet" to the cmdline, so the tsc_force_recalibrate should use
> >   ACPI PM_TIMER to do the calibration, say a third-party check.
> 
> OK, getting things teed up for TSC recalibration and nohpet.
> 
> > * If the system don't have auto-adjusted time setting like NTP, I
> >   guess the system time will have obvious drift comparing to a normal
> >   clock or a mobile phone time, as the deviation is about 4%, which
> >   is 2.4 minutes per hour.
> 
> No ntpd, but there is a chronyd.
> 
> I will let you know what happens with HPET disabled and TSC recalibration
> enabled.

Thanks!

- Feng

> 							Thanx, Paul

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH clocksource 1/3] clocksource: Reject bogus watchdog clocksource measurements
  2022-11-30  4:49                           ` Feng Tang
@ 2022-11-30  5:16                             ` Paul E. McKenney
  2022-11-30  5:35                               ` Feng Tang
  0 siblings, 1 reply; 28+ messages in thread
From: Paul E. McKenney @ 2022-11-30  5:16 UTC (permalink / raw)
  To: Feng Tang
  Cc: Thomas Gleixner, linux-kernel, john.stultz, sboyd, corbet,
	Mark.Rutland, maz, kernel-team, neeraju, ak, zhengjun.xing,
	Chris Mason, John Stultz, Waiman Long

On Wed, Nov 30, 2022 at 12:49:24PM +0800, Feng Tang wrote:
> On Tue, Nov 29, 2022 at 08:12:06PM -0800, Paul E. McKenney wrote:
> > On Wed, Nov 30, 2022 at 09:38:00AM +0800, Feng Tang wrote:
> > > On Tue, Nov 29, 2022 at 11:29:15AM -0800, Paul E. McKenney wrote:
> > > [...]
> > > > > > > IIUC, this will make TSC to watchdog HPET every 500 ms. We have got
> > > > > > > report that the 500ms watchdog timer had big impact on some parallel
> > > > > > > workload on big servers, that was another factor for us to seek
> > > > > > > stopping the timer.
> > > > > > 
> > > > > > Another approach would be to slow it down.  Given the tighter bounds
> > > > > > on skew, it could be done every (say) 10 seconds while allowing
> > > > > > 2 milliseconds skew instead of the current 100 microseconds.
> > > > > 
> > > > > Yes, this can reduce the OS noise much. One problem is if we make it
> > > > > a general interface, there is some clocksource whose warp time is
> > > > > less than 10 seconds, like ACPI PM_TIMER (3-4 seconds), and I don't
> > > > > know if other ARCHs have similar cases.
> > > > 
> > > > Maybe a simpler approach is for systems with such high sensitivity to
> > > > OS noise to simply disable the clocksource watchdog.  ;-)
> > > 
> > > That's what the reported did, test with and without "tsc=reliable"
> > > parameter :)
> > > 
> > > And AFAIK, many customers with big server farms hate to add more
> > > cmdline parameters when we suggested so.
> > 
> > It can be surprisingly hard.  It is sometimes easier to patch the kernel
> > to change the default. 
> 
> Indeed, sometimes we were askd to provide patch than cmdline parameters :)

Do any of them like CONFIG_BOOT_CONFIG_EMBED?  ;-)

> > Part of the problem is getting the right set
> > of command-line parameters associated with the right versions of the
> > kernel in the not-uncommon case where different portions of the server
> > farm are running different versions of the kernel.
> > 
> > > > > > > Is this about the concern of possible TSC frequency calibration
> > > > > > > issue, as the 40 ms per second drift between HPET and TSC? With 
> > > > > > > b50db7095fe0 backported, we also have another patch to force TSC
> > > > > > > calibration for those platforms which get the TSC freq directly
> > > > > > > from CPUID or MSR and don't have such info in dmesg:
> > > > > > >  "tsc: Refined TSC clocksource calibration: 2693.509 MHz" 
> > > > > > > 
> > > > > > > https://lore.kernel.org/lkml/20220509144110.9242-1-feng.tang@intel.com/
> > > > > > > 
> > > > > > > We did met tsc calibration issue due to some firmware issue, and
> > > > > > > this can help to catch it. You can try it if you think it's relevant.
> > > > > > 
> > > > > > I am giving this a go, thank you!
> > > > > 
> > > > > Thanks for spending time testing it!
> > > > 
> > > > And here are the results from setting tsc_force_recalibrate to 1:
> > > > 
> > > > $ dmesg | grep -E 'calibrat|clocksource'
> > > > [    5.272939] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1910969940391419 ns
> > > > [   16.830644] clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 76450417870 ns
> > > > [   17.938020] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x36a8d32ce31, max_idle_ns: 881590731004 ns
> > > > [   24.548583] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1911260446275000 ns
> > > > [   49.762432] clocksource: Switched to clocksource tsc-early
> > > > [   50.076769] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
> > > > [   55.615946] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x36a8d32ce31, max_idle_ns: 881590731004 ns
> > > > [   55.640270] clocksource: Switched to clocksource tsc
> > > > [   56.694371] tsc: Warning: TSC freq calibrated by CPUID/MSR differs from what is calibrated by HW timer, please check with vendor!!
> > > > [   56.724550] tsc: Previous calibrated TSC freq:        1896.000 MHz
> > > > [   56.737646] tsc: TSC freq recalibrated by [HPET]:     1975.000 MHz
> > > 
> > > Looks like there is really something wrong here. I assume the first
> > > number '1896 MHz' is got from CPUID(0x15)'s math calculation.
> > 
> > How about this from earlier in that same console?
> > 
> > [    0.000000] efi: EFI v2.80 by American Megatrends
> > [    0.000000] efi: ACPI=0x6f569000 ACPI 2.0=0x6f569014 TPMFinalLog=0x6f56b000 SMBIOS=0x753e1000 SMBIOS 3.0=0x753e0000 MEMATTR=0x62176018 ESRT=0x64bd1018 TPMEventLog=0x58737018
> > [    0.000000] [Firmware Bug]: TPM Final Events table missing or invalid
> > [    0.000000] SMBIOS 3.5.0 present.
> > [    0.000000] DMI: Quanta Grand Teton 1F0TUBZ0007/Grand Teton MB, BIOS F0T_1A15 08/25/2022
> > [    0.000000] tsc: Detected 1900.000 MHz processor
> > [    0.000000] tsc: Detected 1896.000 MHz TSC
> 
> I'm still not sure, but it's likely from CPUID(0x15). I met cases
> that even severs of same generations get their tsc frequence from
> different sources.
> 
> I locally have debug patch to check it:
> 
> diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
> index cafacb2e58cc..82ddb4b0529a 100644
> --- a/arch/x86/kernel/tsc.c
> +++ b/arch/x86/kernel/tsc.c
> @@ -654,8 +654,11 @@ unsigned long native_calibrate_tsc(void)
>  	 * frequency and is the most accurate one so far we have. This
>  	 * is considered a known frequency.
>  	 */
> -	if (crystal_khz != 0)
> +	if (crystal_khz != 0) {
> +		printk("tsc: using CPUID[0x15] crystal_khz= %d kHz ebx=%d eax=%d\n",
> +		crystal_khz, ebx_numerator, eax_denominator);
>  		setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
> +	}
>  
>  	/*
>  	 * Some Intel SoCs like Skylake and Kabylake don't report the crystal
> @@ -668,6 +671,7 @@ unsigned long native_calibrate_tsc(void)
>  		cpuid(0x16, &eax_base_mhz, &ebx, &ecx, &edx);
>  		crystal_khz = eax_base_mhz * 1000 *
>  			eax_denominator / ebx_numerator;
> +		printk("tsc: using CPUID[0x16] base_khz=%d kHz\n", crystal_khz);
>  	}
>  
>  	if (crystal_khz == 0)

I will give this one a go, thank you!

> This remind me that maybe we can add a line in dmesg telling people
> which exact soure that the TSC frequency comes from (CPUID, MSR or
> calibration from HW timers like HPET/PM_TIMER).

Thus far, it seems that this information could be useful.

> > > I thinks 2 more things could be try:
> > > 
> > > * add "nohpet" to the cmdline, so the tsc_force_recalibrate should use
> > >   ACPI PM_TIMER to do the calibration, say a third-party check.
> > 
> > OK, getting things teed up for TSC recalibration and nohpet.
> > 
> > > * If the system don't have auto-adjusted time setting like NTP, I
> > >   guess the system time will have obvious drift comparing to a normal
> > >   clock or a mobile phone time, as the deviation is about 4%, which
> > >   is 2.4 minutes per hour.
> > 
> > No ntpd, but there is a chronyd.
> > 
> > I will let you know what happens with HPET disabled and TSC recalibration
> > enabled.

And here you go:

[    0.000000] tsc: Detected 1900.000 MHz processor
[    0.000000] tsc: Detected 1896.000 MHz TSC
[    5.254788] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1910969940391419 ns
[   25.669260] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x36a8d32ce31, max_idle_ns: 881590731004 ns
[   32.182958] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1911260446275000 ns
[   57.785756] clocksource: Switched to clocksource tsc-early
[   58.106661] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
[   63.578433] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x36a8d32ce31, max_idle_ns: 881590731004 ns
[   63.602169] clocksource: Switched to clocksource tsc
[   64.636612] tsc: Warning: TSC freq calibrated by CPUID/MSR differs from what is calibrated by HW timer, please check with vendor!!
[   64.664615] tsc: Previous calibrated TSC freq:        1896.000 MHz
[   64.677714] tsc: TSC freq recalibrated by [PM_TIMER]:         1975.000 MHz

Same same, give or take, anyway.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH clocksource 1/3] clocksource: Reject bogus watchdog clocksource measurements
  2022-11-30  5:16                             ` Paul E. McKenney
@ 2022-11-30  5:35                               ` Feng Tang
  2022-11-30  5:50                                 ` Paul E. McKenney
  0 siblings, 1 reply; 28+ messages in thread
From: Feng Tang @ 2022-11-30  5:35 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Thomas Gleixner, linux-kernel, john.stultz, sboyd, corbet,
	Mark.Rutland, maz, kernel-team, neeraju, ak, zhengjun.xing,
	Chris Mason, John Stultz, Waiman Long

On Tue, Nov 29, 2022 at 09:16:00PM -0800, Paul E. McKenney wrote:
> On Wed, Nov 30, 2022 at 12:49:24PM +0800, Feng Tang wrote:
> > On Tue, Nov 29, 2022 at 08:12:06PM -0800, Paul E. McKenney wrote:
> > > On Wed, Nov 30, 2022 at 09:38:00AM +0800, Feng Tang wrote:
> > > > On Tue, Nov 29, 2022 at 11:29:15AM -0800, Paul E. McKenney wrote:
> > > > [...]
> > > > > > > > IIUC, this will make TSC to watchdog HPET every 500 ms. We have got
> > > > > > > > report that the 500ms watchdog timer had big impact on some parallel
> > > > > > > > workload on big servers, that was another factor for us to seek
> > > > > > > > stopping the timer.
> > > > > > > 
> > > > > > > Another approach would be to slow it down.  Given the tighter bounds
> > > > > > > on skew, it could be done every (say) 10 seconds while allowing
> > > > > > > 2 milliseconds skew instead of the current 100 microseconds.
> > > > > > 
> > > > > > Yes, this can reduce the OS noise much. One problem is if we make it
> > > > > > a general interface, there is some clocksource whose warp time is
> > > > > > less than 10 seconds, like ACPI PM_TIMER (3-4 seconds), and I don't
> > > > > > know if other ARCHs have similar cases.
> > > > > 
> > > > > Maybe a simpler approach is for systems with such high sensitivity to
> > > > > OS noise to simply disable the clocksource watchdog.  ;-)
> > > > 
> > > > That's what the reported did, test with and without "tsc=reliable"
> > > > parameter :)
> > > > 
> > > > And AFAIK, many customers with big server farms hate to add more
> > > > cmdline parameters when we suggested so.
> > > 
> > > It can be surprisingly hard.  It is sometimes easier to patch the kernel
> > > to change the default. 
> > 
> > Indeed, sometimes we were askd to provide patch than cmdline parameters :)
> 
> Do any of them like CONFIG_BOOT_CONFIG_EMBED?  ;-)
> 
> > > Part of the problem is getting the right set
> > > of command-line parameters associated with the right versions of the
> > > kernel in the not-uncommon case where different portions of the server
> > > farm are running different versions of the kernel.
> > > 
> > > > > > > > Is this about the concern of possible TSC frequency calibration
> > > > > > > > issue, as the 40 ms per second drift between HPET and TSC? With 
> > > > > > > > b50db7095fe0 backported, we also have another patch to force TSC
> > > > > > > > calibration for those platforms which get the TSC freq directly
> > > > > > > > from CPUID or MSR and don't have such info in dmesg:
> > > > > > > >  "tsc: Refined TSC clocksource calibration: 2693.509 MHz" 
> > > > > > > > 
> > > > > > > > https://lore.kernel.org/lkml/20220509144110.9242-1-feng.tang@intel.com/
> > > > > > > > 
> > > > > > > > We did met tsc calibration issue due to some firmware issue, and
> > > > > > > > this can help to catch it. You can try it if you think it's relevant.
> > > > > > > 
> > > > > > > I am giving this a go, thank you!
> > > > > > 
> > > > > > Thanks for spending time testing it!
> > > > > 
> > > > > And here are the results from setting tsc_force_recalibrate to 1:
> > > > > 
> > > > > $ dmesg | grep -E 'calibrat|clocksource'
> > > > > [    5.272939] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1910969940391419 ns
> > > > > [   16.830644] clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 76450417870 ns
> > > > > [   17.938020] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x36a8d32ce31, max_idle_ns: 881590731004 ns
> > > > > [   24.548583] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1911260446275000 ns
> > > > > [   49.762432] clocksource: Switched to clocksource tsc-early
> > > > > [   50.076769] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
> > > > > [   55.615946] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x36a8d32ce31, max_idle_ns: 881590731004 ns
> > > > > [   55.640270] clocksource: Switched to clocksource tsc
> > > > > [   56.694371] tsc: Warning: TSC freq calibrated by CPUID/MSR differs from what is calibrated by HW timer, please check with vendor!!
> > > > > [   56.724550] tsc: Previous calibrated TSC freq:        1896.000 MHz
> > > > > [   56.737646] tsc: TSC freq recalibrated by [HPET]:     1975.000 MHz
> > > > 
> > > > Looks like there is really something wrong here. I assume the first
> > > > number '1896 MHz' is got from CPUID(0x15)'s math calculation.
> > > 
> > > How about this from earlier in that same console?
> > > 
> > > [    0.000000] efi: EFI v2.80 by American Megatrends
> > > [    0.000000] efi: ACPI=0x6f569000 ACPI 2.0=0x6f569014 TPMFinalLog=0x6f56b000 SMBIOS=0x753e1000 SMBIOS 3.0=0x753e0000 MEMATTR=0x62176018 ESRT=0x64bd1018 TPMEventLog=0x58737018
> > > [    0.000000] [Firmware Bug]: TPM Final Events table missing or invalid
> > > [    0.000000] SMBIOS 3.5.0 present.
> > > [    0.000000] DMI: Quanta Grand Teton 1F0TUBZ0007/Grand Teton MB, BIOS F0T_1A15 08/25/2022
> > > [    0.000000] tsc: Detected 1900.000 MHz processor
> > > [    0.000000] tsc: Detected 1896.000 MHz TSC
> > 
> > I'm still not sure, but it's likely from CPUID(0x15). I met cases
> > that even severs of same generations get their tsc frequence from
> > different sources.
> > 
> > I locally have debug patch to check it:
> > 
> > diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
> > index cafacb2e58cc..82ddb4b0529a 100644
> > --- a/arch/x86/kernel/tsc.c
> > +++ b/arch/x86/kernel/tsc.c
> > @@ -654,8 +654,11 @@ unsigned long native_calibrate_tsc(void)
> >  	 * frequency and is the most accurate one so far we have. This
> >  	 * is considered a known frequency.
> >  	 */
> > -	if (crystal_khz != 0)
> > +	if (crystal_khz != 0) {
> > +		printk("tsc: using CPUID[0x15] crystal_khz= %d kHz ebx=%d eax=%d\n",
> > +		crystal_khz, ebx_numerator, eax_denominator);
> >  		setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
> > +	}
> >  
> >  	/*
> >  	 * Some Intel SoCs like Skylake and Kabylake don't report the crystal
> > @@ -668,6 +671,7 @@ unsigned long native_calibrate_tsc(void)
> >  		cpuid(0x16, &eax_base_mhz, &ebx, &ecx, &edx);
> >  		crystal_khz = eax_base_mhz * 1000 *
> >  			eax_denominator / ebx_numerator;
> > +		printk("tsc: using CPUID[0x16] base_khz=%d kHz\n", crystal_khz);
> >  	}
> >  
> >  	if (crystal_khz == 0)
> 
> I will give this one a go, thank you!
> 
> > This remind me that maybe we can add a line in dmesg telling people
> > which exact soure that the TSC frequency comes from (CPUID, MSR or
> > calibration from HW timers like HPET/PM_TIMER).
> 
> Thus far, it seems that this information could be useful.
> 
> > > > I thinks 2 more things could be try:
> > > > 
> > > > * add "nohpet" to the cmdline, so the tsc_force_recalibrate should use
> > > >   ACPI PM_TIMER to do the calibration, say a third-party check.
> > > 
> > > OK, getting things teed up for TSC recalibration and nohpet.
> > > 
> > > > * If the system don't have auto-adjusted time setting like NTP, I
> > > >   guess the system time will have obvious drift comparing to a normal
> > > >   clock or a mobile phone time, as the deviation is about 4%, which
> > > >   is 2.4 minutes per hour.
> > > 
> > > No ntpd, but there is a chronyd.
> > > 
> > > I will let you know what happens with HPET disabled and TSC recalibration
> > > enabled.
> 
> And here you go:
> 
> [    0.000000] tsc: Detected 1900.000 MHz processor
> [    0.000000] tsc: Detected 1896.000 MHz TSC
> [    5.254788] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1910969940391419 ns
> [   25.669260] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x36a8d32ce31, max_idle_ns: 881590731004 ns
> [   32.182958] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1911260446275000 ns
> [   57.785756] clocksource: Switched to clocksource tsc-early
> [   58.106661] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
> [   63.578433] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x36a8d32ce31, max_idle_ns: 881590731004 ns
> [   63.602169] clocksource: Switched to clocksource tsc
> [   64.636612] tsc: Warning: TSC freq calibrated by CPUID/MSR differs from what is calibrated by HW timer, please check with vendor!!
> [   64.664615] tsc: Previous calibrated TSC freq:        1896.000 MHz
> [   64.677714] tsc: TSC freq recalibrated by [PM_TIMER]:         1975.000 MHz
 
Great! As both HPET and PM_TIMER get the same calibration 1975.000 MHz,
and it matches the 40ms drift per second you mentioned earlier, this
seems like the CPUID(0x15) gave the wrong frequence number.

Or unlikely, HPET and PM_TIMER are driven by the same circuit source,
which has deviation.

Either way, I think the HW/FW could have some problem.

Thanks,
Feng

> Same same, give or take, anyway.
> 
> 							Thanx, Paul

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH clocksource 1/3] clocksource: Reject bogus watchdog clocksource measurements
  2022-11-30  5:35                               ` Feng Tang
@ 2022-11-30  5:50                                 ` Paul E. McKenney
  2022-11-30  6:00                                   ` Feng Tang
  0 siblings, 1 reply; 28+ messages in thread
From: Paul E. McKenney @ 2022-11-30  5:50 UTC (permalink / raw)
  To: Feng Tang
  Cc: Thomas Gleixner, linux-kernel, john.stultz, sboyd, corbet,
	Mark.Rutland, maz, kernel-team, neeraju, ak, zhengjun.xing,
	Chris Mason, John Stultz, Waiman Long

On Wed, Nov 30, 2022 at 01:35:08PM +0800, Feng Tang wrote:
> On Tue, Nov 29, 2022 at 09:16:00PM -0800, Paul E. McKenney wrote:
> > On Wed, Nov 30, 2022 at 12:49:24PM +0800, Feng Tang wrote:
> > > On Tue, Nov 29, 2022 at 08:12:06PM -0800, Paul E. McKenney wrote:
> > > > On Wed, Nov 30, 2022 at 09:38:00AM +0800, Feng Tang wrote:
> > > > > On Tue, Nov 29, 2022 at 11:29:15AM -0800, Paul E. McKenney wrote:
> > > > > [...]
> > > > > > > > > IIUC, this will make TSC to watchdog HPET every 500 ms. We have got
> > > > > > > > > report that the 500ms watchdog timer had big impact on some parallel
> > > > > > > > > workload on big servers, that was another factor for us to seek
> > > > > > > > > stopping the timer.
> > > > > > > > 
> > > > > > > > Another approach would be to slow it down.  Given the tighter bounds
> > > > > > > > on skew, it could be done every (say) 10 seconds while allowing
> > > > > > > > 2 milliseconds skew instead of the current 100 microseconds.
> > > > > > > 
> > > > > > > Yes, this can reduce the OS noise much. One problem is if we make it
> > > > > > > a general interface, there is some clocksource whose warp time is
> > > > > > > less than 10 seconds, like ACPI PM_TIMER (3-4 seconds), and I don't
> > > > > > > know if other ARCHs have similar cases.
> > > > > > 
> > > > > > Maybe a simpler approach is for systems with such high sensitivity to
> > > > > > OS noise to simply disable the clocksource watchdog.  ;-)
> > > > > 
> > > > > That's what the reported did, test with and without "tsc=reliable"
> > > > > parameter :)
> > > > > 
> > > > > And AFAIK, many customers with big server farms hate to add more
> > > > > cmdline parameters when we suggested so.
> > > > 
> > > > It can be surprisingly hard.  It is sometimes easier to patch the kernel
> > > > to change the default. 
> > > 
> > > Indeed, sometimes we were askd to provide patch than cmdline parameters :)
> > 
> > Do any of them like CONFIG_BOOT_CONFIG_EMBED?  ;-)
> > 
> > > > Part of the problem is getting the right set
> > > > of command-line parameters associated with the right versions of the
> > > > kernel in the not-uncommon case where different portions of the server
> > > > farm are running different versions of the kernel.
> > > > 
> > > > > > > > > Is this about the concern of possible TSC frequency calibration
> > > > > > > > > issue, as the 40 ms per second drift between HPET and TSC? With 
> > > > > > > > > b50db7095fe0 backported, we also have another patch to force TSC
> > > > > > > > > calibration for those platforms which get the TSC freq directly
> > > > > > > > > from CPUID or MSR and don't have such info in dmesg:
> > > > > > > > >  "tsc: Refined TSC clocksource calibration: 2693.509 MHz" 
> > > > > > > > > 
> > > > > > > > > https://lore.kernel.org/lkml/20220509144110.9242-1-feng.tang@intel.com/
> > > > > > > > > 
> > > > > > > > > We did met tsc calibration issue due to some firmware issue, and
> > > > > > > > > this can help to catch it. You can try it if you think it's relevant.
> > > > > > > > 
> > > > > > > > I am giving this a go, thank you!
> > > > > > > 
> > > > > > > Thanks for spending time testing it!
> > > > > > 
> > > > > > And here are the results from setting tsc_force_recalibrate to 1:
> > > > > > 
> > > > > > $ dmesg | grep -E 'calibrat|clocksource'
> > > > > > [    5.272939] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1910969940391419 ns
> > > > > > [   16.830644] clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 76450417870 ns
> > > > > > [   17.938020] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x36a8d32ce31, max_idle_ns: 881590731004 ns
> > > > > > [   24.548583] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1911260446275000 ns
> > > > > > [   49.762432] clocksource: Switched to clocksource tsc-early
> > > > > > [   50.076769] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
> > > > > > [   55.615946] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x36a8d32ce31, max_idle_ns: 881590731004 ns
> > > > > > [   55.640270] clocksource: Switched to clocksource tsc
> > > > > > [   56.694371] tsc: Warning: TSC freq calibrated by CPUID/MSR differs from what is calibrated by HW timer, please check with vendor!!
> > > > > > [   56.724550] tsc: Previous calibrated TSC freq:        1896.000 MHz
> > > > > > [   56.737646] tsc: TSC freq recalibrated by [HPET]:     1975.000 MHz
> > > > > 
> > > > > Looks like there is really something wrong here. I assume the first
> > > > > number '1896 MHz' is got from CPUID(0x15)'s math calculation.
> > > > 
> > > > How about this from earlier in that same console?
> > > > 
> > > > [    0.000000] efi: EFI v2.80 by American Megatrends
> > > > [    0.000000] efi: ACPI=0x6f569000 ACPI 2.0=0x6f569014 TPMFinalLog=0x6f56b000 SMBIOS=0x753e1000 SMBIOS 3.0=0x753e0000 MEMATTR=0x62176018 ESRT=0x64bd1018 TPMEventLog=0x58737018
> > > > [    0.000000] [Firmware Bug]: TPM Final Events table missing or invalid
> > > > [    0.000000] SMBIOS 3.5.0 present.
> > > > [    0.000000] DMI: Quanta Grand Teton 1F0TUBZ0007/Grand Teton MB, BIOS F0T_1A15 08/25/2022
> > > > [    0.000000] tsc: Detected 1900.000 MHz processor
> > > > [    0.000000] tsc: Detected 1896.000 MHz TSC
> > > 
> > > I'm still not sure, but it's likely from CPUID(0x15). I met cases
> > > that even severs of same generations get their tsc frequence from
> > > different sources.
> > > 
> > > I locally have debug patch to check it:
> > > 
> > > diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
> > > index cafacb2e58cc..82ddb4b0529a 100644
> > > --- a/arch/x86/kernel/tsc.c
> > > +++ b/arch/x86/kernel/tsc.c
> > > @@ -654,8 +654,11 @@ unsigned long native_calibrate_tsc(void)
> > >  	 * frequency and is the most accurate one so far we have. This
> > >  	 * is considered a known frequency.
> > >  	 */
> > > -	if (crystal_khz != 0)
> > > +	if (crystal_khz != 0) {
> > > +		printk("tsc: using CPUID[0x15] crystal_khz= %d kHz ebx=%d eax=%d\n",
> > > +		crystal_khz, ebx_numerator, eax_denominator);
> > >  		setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
> > > +	}
> > >  
> > >  	/*
> > >  	 * Some Intel SoCs like Skylake and Kabylake don't report the crystal
> > > @@ -668,6 +671,7 @@ unsigned long native_calibrate_tsc(void)
> > >  		cpuid(0x16, &eax_base_mhz, &ebx, &ecx, &edx);
> > >  		crystal_khz = eax_base_mhz * 1000 *
> > >  			eax_denominator / ebx_numerator;
> > > +		printk("tsc: using CPUID[0x16] base_khz=%d kHz\n", crystal_khz);
> > >  	}
> > >  
> > >  	if (crystal_khz == 0)
> > 
> > I will give this one a go, thank you!
> > 
> > > This remind me that maybe we can add a line in dmesg telling people
> > > which exact soure that the TSC frequency comes from (CPUID, MSR or
> > > calibration from HW timers like HPET/PM_TIMER).
> > 
> > Thus far, it seems that this information could be useful.
> > 
> > > > > I thinks 2 more things could be try:
> > > > > 
> > > > > * add "nohpet" to the cmdline, so the tsc_force_recalibrate should use
> > > > >   ACPI PM_TIMER to do the calibration, say a third-party check.
> > > > 
> > > > OK, getting things teed up for TSC recalibration and nohpet.
> > > > 
> > > > > * If the system don't have auto-adjusted time setting like NTP, I
> > > > >   guess the system time will have obvious drift comparing to a normal
> > > > >   clock or a mobile phone time, as the deviation is about 4%, which
> > > > >   is 2.4 minutes per hour.
> > > > 
> > > > No ntpd, but there is a chronyd.
> > > > 
> > > > I will let you know what happens with HPET disabled and TSC recalibration
> > > > enabled.
> > 
> > And here you go:
> > 
> > [    0.000000] tsc: Detected 1900.000 MHz processor
> > [    0.000000] tsc: Detected 1896.000 MHz TSC
> > [    5.254788] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1910969940391419 ns
> > [   25.669260] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x36a8d32ce31, max_idle_ns: 881590731004 ns
> > [   32.182958] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1911260446275000 ns
> > [   57.785756] clocksource: Switched to clocksource tsc-early
> > [   58.106661] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
> > [   63.578433] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x36a8d32ce31, max_idle_ns: 881590731004 ns
> > [   63.602169] clocksource: Switched to clocksource tsc
> > [   64.636612] tsc: Warning: TSC freq calibrated by CPUID/MSR differs from what is calibrated by HW timer, please check with vendor!!
> > [   64.664615] tsc: Previous calibrated TSC freq:        1896.000 MHz
> > [   64.677714] tsc: TSC freq recalibrated by [PM_TIMER]:         1975.000 MHz
>  
> Great! As both HPET and PM_TIMER get the same calibration 1975.000 MHz,
> and it matches the 40ms drift per second you mentioned earlier, this
> seems like the CPUID(0x15) gave the wrong frequence number.
> 
> Or unlikely, HPET and PM_TIMER are driven by the same circuit source,
> which has deviation.
> 
> Either way, I think the HW/FW could have some problem.

And this time with your printk()s:

[    0.000000] tsc: using CPUID[0x15] crystal_khz= 24000 kHz ebx=158 eax=2
[    0.000000] tsc: Detected 1900.000 MHz processor
[    0.000000] tsc: Detected 1896.000 MHz TSC
[    5.268858] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1910969940391419 ns
[   25.706231] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x36a8d32ce31, max_idle_ns: 881590731004 ns
[   32.223011] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1911260446275000 ns
[   57.823933] clocksource: Switched to clocksource tsc-early
[   58.144840] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
[   63.613713] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x36a8d32ce31, max_idle_ns: 881590731004 ns
[   63.637323] clocksource: Switched to clocksource tsc
[   64.673579] tsc: Warning: TSC freq calibrated by CPUID/MSR differs from what is calibrated by HW timer, please check with vendor!!
[   64.703719] tsc: Previous calibrated TSC freq:        1896.000 MHz
[   64.716816] tsc: TSC freq recalibrated by [PM_TIMER]:         1974.999 MHz

What would be good next steps to check up on the hardware and firmware?

(My next step involves a pillow, but will follow up tomorrow morning
Pacific Time.)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH clocksource 1/3] clocksource: Reject bogus watchdog clocksource measurements
  2022-11-30  5:50                                 ` Paul E. McKenney
@ 2022-11-30  6:00                                   ` Feng Tang
  2022-12-01 17:24                                     ` Paul E. McKenney
  0 siblings, 1 reply; 28+ messages in thread
From: Feng Tang @ 2022-11-30  6:00 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Thomas Gleixner, linux-kernel, sboyd, corbet, Mark.Rutland, maz,
	kernel-team, ak, zhengjun.xing, Chris Mason, John Stultz,
	Waiman Long

On Tue, Nov 29, 2022 at 09:50:24PM -0800, Paul E. McKenney wrote:
[...]
> >  
> > Great! As both HPET and PM_TIMER get the same calibration 1975.000 MHz,
> > and it matches the 40ms drift per second you mentioned earlier, this
> > seems like the CPUID(0x15) gave the wrong frequence number.
> > 
> > Or unlikely, HPET and PM_TIMER are driven by the same circuit source,
> > which has deviation.
> > 
> > Either way, I think the HW/FW could have some problem.
> 
> And this time with your printk()s:
> 
> [    0.000000] tsc: using CPUID[0x15] crystal_khz= 24000 kHz ebx=158 eax=2
> [    0.000000] tsc: Detected 1900.000 MHz processor
> [    0.000000] tsc: Detected 1896.000 MHz TSC
> [    5.268858] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1910969940391419 ns
> [   25.706231] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x36a8d32ce31, max_idle_ns: 881590731004 ns
> [   32.223011] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1911260446275000 ns
> [   57.823933] clocksource: Switched to clocksource tsc-early
> [   58.144840] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
> [   63.613713] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x36a8d32ce31, max_idle_ns: 881590731004 ns
> [   63.637323] clocksource: Switched to clocksource tsc
> [   64.673579] tsc: Warning: TSC freq calibrated by CPUID/MSR differs from what is calibrated by HW timer, please check with vendor!!
> [   64.703719] tsc: Previous calibrated TSC freq:        1896.000 MHz
> [   64.716816] tsc: TSC freq recalibrated by [PM_TIMER]:         1974.999 MHz

This confirms the tsc frequency is calculated from CPUID(0x15).

> What would be good next steps to check up on the hardware and firmware?

Maybe raise it to vendor? I have no idea how to check thos black boxes :)


> (My next step involves a pillow, but will follow up tomorrow morning
> Pacific Time.)
 
Really thanks for checking this through late night!

Thanks,
Feng

> 							Thanx, Paul

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH clocksource 1/3] clocksource: Reject bogus watchdog clocksource measurements
  2022-11-30  6:00                                   ` Feng Tang
@ 2022-12-01 17:24                                     ` Paul E. McKenney
  2022-12-02  1:10                                       ` Feng Tang
  0 siblings, 1 reply; 28+ messages in thread
From: Paul E. McKenney @ 2022-12-01 17:24 UTC (permalink / raw)
  To: Feng Tang
  Cc: Thomas Gleixner, linux-kernel, sboyd, corbet, Mark.Rutland, maz,
	kernel-team, ak, zhengjun.xing, Chris Mason, John Stultz,
	Waiman Long

On Wed, Nov 30, 2022 at 02:00:04PM +0800, Feng Tang wrote:
> On Tue, Nov 29, 2022 at 09:50:24PM -0800, Paul E. McKenney wrote:
> [...]
> > >  
> > > Great! As both HPET and PM_TIMER get the same calibration 1975.000 MHz,
> > > and it matches the 40ms drift per second you mentioned earlier, this
> > > seems like the CPUID(0x15) gave the wrong frequence number.
> > > 
> > > Or unlikely, HPET and PM_TIMER are driven by the same circuit source,
> > > which has deviation.
> > > 
> > > Either way, I think the HW/FW could have some problem.
> > 
> > And this time with your printk()s:
> > 
> > [    0.000000] tsc: using CPUID[0x15] crystal_khz= 24000 kHz ebx=158 eax=2
> > [    0.000000] tsc: Detected 1900.000 MHz processor
> > [    0.000000] tsc: Detected 1896.000 MHz TSC
> > [    5.268858] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1910969940391419 ns
> > [   25.706231] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x36a8d32ce31, max_idle_ns: 881590731004 ns
> > [   32.223011] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1911260446275000 ns
> > [   57.823933] clocksource: Switched to clocksource tsc-early
> > [   58.144840] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
> > [   63.613713] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x36a8d32ce31, max_idle_ns: 881590731004 ns
> > [   63.637323] clocksource: Switched to clocksource tsc
> > [   64.673579] tsc: Warning: TSC freq calibrated by CPUID/MSR differs from what is calibrated by HW timer, please check with vendor!!
> > [   64.703719] tsc: Previous calibrated TSC freq:        1896.000 MHz
> > [   64.716816] tsc: TSC freq recalibrated by [PM_TIMER]:         1974.999 MHz
> 
> This confirms the tsc frequency is calculated from CPUID(0x15).
> 
> > What would be good next steps to check up on the hardware and firmware?
> 
> Maybe raise it to vendor? I have no idea how to check thos black boxes :)

Done, but no high hopes here.  (What, me cynical?  Better believe it!!!)

> > (My next step involves a pillow, but will follow up tomorrow morning
> > Pacific Time.)
>  
> Really thanks for checking this through late night!

No problem, and I guess it is instead the day after tomorrow, but
I thought you might be interested in chronyd's opinion:

[root@rtptest1029.snc8 ~]# cat /var/lib/chrony/drift
        40001.074911             0.002098

In contrast, on my Fedora laptop:

$ sudo cat /var/lib/chrony/drift
            2.074313             0.186606

I am (perhaps incorrectly) taking this to indicate that TSC is in fact
drifting with respect to standard time.

Thoughts?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH clocksource 1/3] clocksource: Reject bogus watchdog clocksource measurements
  2022-12-01 17:24                                     ` Paul E. McKenney
@ 2022-12-02  1:10                                       ` Feng Tang
  2022-12-02  1:44                                         ` Paul E. McKenney
  0 siblings, 1 reply; 28+ messages in thread
From: Feng Tang @ 2022-12-02  1:10 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Thomas Gleixner, linux-kernel, sboyd, corbet, Mark.Rutland, maz,
	kernel-team, ak, zhengjun.xing, Chris Mason, John Stultz,
	Waiman Long

On Thu, Dec 01, 2022 at 09:24:05AM -0800, Paul E. McKenney wrote:
> On Wed, Nov 30, 2022 at 02:00:04PM +0800, Feng Tang wrote:
> > On Tue, Nov 29, 2022 at 09:50:24PM -0800, Paul E. McKenney wrote:
> > [...]
> > > >  
> > > > Great! As both HPET and PM_TIMER get the same calibration 1975.000 MHz,
> > > > and it matches the 40ms drift per second you mentioned earlier, this
> > > > seems like the CPUID(0x15) gave the wrong frequence number.
> > > > 
> > > > Or unlikely, HPET and PM_TIMER are driven by the same circuit source,
> > > > which has deviation.
> > > > 
> > > > Either way, I think the HW/FW could have some problem.
> > > 
> > > And this time with your printk()s:
> > > 
> > > [    0.000000] tsc: using CPUID[0x15] crystal_khz= 24000 kHz ebx=158 eax=2
> > > [    0.000000] tsc: Detected 1900.000 MHz processor
> > > [    0.000000] tsc: Detected 1896.000 MHz TSC
> > > [    5.268858] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1910969940391419 ns
> > > [   25.706231] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x36a8d32ce31, max_idle_ns: 881590731004 ns
> > > [   32.223011] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1911260446275000 ns
> > > [   57.823933] clocksource: Switched to clocksource tsc-early
> > > [   58.144840] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
> > > [   63.613713] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x36a8d32ce31, max_idle_ns: 881590731004 ns
> > > [   63.637323] clocksource: Switched to clocksource tsc
> > > [   64.673579] tsc: Warning: TSC freq calibrated by CPUID/MSR differs from what is calibrated by HW timer, please check with vendor!!
> > > [   64.703719] tsc: Previous calibrated TSC freq:        1896.000 MHz
> > > [   64.716816] tsc: TSC freq recalibrated by [PM_TIMER]:         1974.999 MHz
> > 
> > This confirms the tsc frequency is calculated from CPUID(0x15).
> > 
> > > What would be good next steps to check up on the hardware and firmware?
> > 
> > Maybe raise it to vendor? I have no idea how to check thos black boxes :)
> 
> Done, but no high hopes here.  (What, me cynical?  Better believe it!!!)
> 
> > > (My next step involves a pillow, but will follow up tomorrow morning
> > > Pacific Time.)
> >  
> > Really thanks for checking this through late night!
> 
> No problem, and I guess it is instead the day after tomorrow, but
> I thought you might be interested in chronyd's opinion:
> 
> [root@rtptest1029.snc8 ~]# cat /var/lib/chrony/drift
>         40001.074911             0.002098
> 
> In contrast, on my Fedora laptop:
> 
> $ sudo cat /var/lib/chrony/drift
>             2.074313             0.186606
> 
> I am (perhaps incorrectly) taking this to indicate that TSC is in fact
> drifting with respect to standard time.
 
This info is very useful! It further confirms the CPUID(0x15) gave
the wrong frequency info. 

Also I don't think TSC itself is drifting, and the drift some from
the wrong match calculation(1896 MHz), if we give it the correct
number (likely 1975 MHz here), there shouldn't be big chrony drift
like your Fedora laptop.

Thanks,
Feng


> Thoughts?
> 
> 							Thanx, Paul

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH clocksource 1/3] clocksource: Reject bogus watchdog clocksource measurements
  2022-12-02  1:10                                       ` Feng Tang
@ 2022-12-02  1:44                                         ` Paul E. McKenney
  2022-12-02  2:02                                           ` Feng Tang
  0 siblings, 1 reply; 28+ messages in thread
From: Paul E. McKenney @ 2022-12-02  1:44 UTC (permalink / raw)
  To: Feng Tang
  Cc: Thomas Gleixner, linux-kernel, sboyd, corbet, Mark.Rutland, maz,
	kernel-team, ak, zhengjun.xing, Chris Mason, John Stultz,
	Waiman Long

On Fri, Dec 02, 2022 at 09:10:12AM +0800, Feng Tang wrote:
> On Thu, Dec 01, 2022 at 09:24:05AM -0800, Paul E. McKenney wrote:
> > On Wed, Nov 30, 2022 at 02:00:04PM +0800, Feng Tang wrote:
> > > On Tue, Nov 29, 2022 at 09:50:24PM -0800, Paul E. McKenney wrote:
> > > [...]
> > > > >  
> > > > > Great! As both HPET and PM_TIMER get the same calibration 1975.000 MHz,
> > > > > and it matches the 40ms drift per second you mentioned earlier, this
> > > > > seems like the CPUID(0x15) gave the wrong frequence number.
> > > > > 
> > > > > Or unlikely, HPET and PM_TIMER are driven by the same circuit source,
> > > > > which has deviation.
> > > > > 
> > > > > Either way, I think the HW/FW could have some problem.
> > > > 
> > > > And this time with your printk()s:
> > > > 
> > > > [    0.000000] tsc: using CPUID[0x15] crystal_khz= 24000 kHz ebx=158 eax=2
> > > > [    0.000000] tsc: Detected 1900.000 MHz processor
> > > > [    0.000000] tsc: Detected 1896.000 MHz TSC
> > > > [    5.268858] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1910969940391419 ns
> > > > [   25.706231] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x36a8d32ce31, max_idle_ns: 881590731004 ns
> > > > [   32.223011] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1911260446275000 ns
> > > > [   57.823933] clocksource: Switched to clocksource tsc-early
> > > > [   58.144840] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
> > > > [   63.613713] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x36a8d32ce31, max_idle_ns: 881590731004 ns
> > > > [   63.637323] clocksource: Switched to clocksource tsc
> > > > [   64.673579] tsc: Warning: TSC freq calibrated by CPUID/MSR differs from what is calibrated by HW timer, please check with vendor!!
> > > > [   64.703719] tsc: Previous calibrated TSC freq:        1896.000 MHz
> > > > [   64.716816] tsc: TSC freq recalibrated by [PM_TIMER]:         1974.999 MHz
> > > 
> > > This confirms the tsc frequency is calculated from CPUID(0x15).
> > > 
> > > > What would be good next steps to check up on the hardware and firmware?
> > > 
> > > Maybe raise it to vendor? I have no idea how to check thos black boxes :)
> > 
> > Done, but no high hopes here.  (What, me cynical?  Better believe it!!!)
> > 
> > > > (My next step involves a pillow, but will follow up tomorrow morning
> > > > Pacific Time.)
> > >  
> > > Really thanks for checking this through late night!
> > 
> > No problem, and I guess it is instead the day after tomorrow, but
> > I thought you might be interested in chronyd's opinion:
> > 
> > [root@rtptest1029.snc8 ~]# cat /var/lib/chrony/drift
> >         40001.074911             0.002098
> > 
> > In contrast, on my Fedora laptop:
> > 
> > $ sudo cat /var/lib/chrony/drift
> >             2.074313             0.186606
> > 
> > I am (perhaps incorrectly) taking this to indicate that TSC is in fact
> > drifting with respect to standard time.
>  
> This info is very useful! It further confirms the CPUID(0x15) gave
> the wrong frequency info. 

So the TSC is just doing what it is told.  ;-)

This indicates a firmware problem?

> Also I don't think TSC itself is drifting, and the drift some from
> the wrong match calculation(1896 MHz), if we give it the correct
> number (likely 1975 MHz here), there shouldn't be big chrony drift
> like your Fedora laptop.

Resetting so that the clocksource watchdog once again gets rid of TSC,
but leaving nohpet:

[    0.000000] tsc: using CPUID[0x15] crystal_khz= 24000 kHz ebx=158 eax=2
[    0.000000] tsc: Detected 1900.000 MHz processor
[    0.000000] tsc: Detected 1896.000 MHz TSC
[    5.287750] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1910969940391419 ns
[   17.963947] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x36a8d32ce31, max_idle_ns: 881590731004 ns
[   19.996287] clocksource: timekeeping watchdog on CPU3: Marking clocksource 'tsc-early' as unstable because the skew is too large:
[   20.040287] clocksource:                       'refined-jiffies' wd_nsec: 503923392 wd_now: fffb73f8 wd_last: fffb7200 mask: ffffffff
[   20.067286] clocksource:                       'tsc-early' cs_nsec: 588021368 cs_now: 581c1eb378 cs_last: 57d9aad9e8 mask: ffffffffffffffff
 [   20.096286] clocksource:                       No current clocksource.
 [   20.111286] tsc: Marking TSC unstable due to clocksource watchdog
 [   24.582541] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1911260446275000 ns
 [   49.739301] clocksource: Switched to clocksource refined-jiffies
 [   50.046356] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
 [   50.066475] clocksource: Switched to clocksource acpi_pm

# cat /var/lib/chrony/drift 
            1.372570             0.020049

I interpret this to mean that acpi_pm (and thus from prior observations,
HPET as well) are counting at the correct rate.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH clocksource 1/3] clocksource: Reject bogus watchdog clocksource measurements
  2022-12-02  1:44                                         ` Paul E. McKenney
@ 2022-12-02  2:02                                           ` Feng Tang
  2022-12-02 22:24                                             ` Paul E. McKenney
  0 siblings, 1 reply; 28+ messages in thread
From: Feng Tang @ 2022-12-02  2:02 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Thomas Gleixner, linux-kernel, sboyd, corbet, Mark.Rutland, maz,
	kernel-team, ak, zhengjun.xing, Chris Mason, John Stultz,
	Waiman Long

On Thu, Dec 01, 2022 at 05:44:28PM -0800, Paul E. McKenney wrote:
> On Fri, Dec 02, 2022 at 09:10:12AM +0800, Feng Tang wrote:
> > On Thu, Dec 01, 2022 at 09:24:05AM -0800, Paul E. McKenney wrote:
> > > On Wed, Nov 30, 2022 at 02:00:04PM +0800, Feng Tang wrote:
> > > > On Tue, Nov 29, 2022 at 09:50:24PM -0800, Paul E. McKenney wrote:
> > > > [...]
> > > > > >  
> > > > > > Great! As both HPET and PM_TIMER get the same calibration 1975.000 MHz,
> > > > > > and it matches the 40ms drift per second you mentioned earlier, this
> > > > > > seems like the CPUID(0x15) gave the wrong frequence number.
> > > > > > 
> > > > > > Or unlikely, HPET and PM_TIMER are driven by the same circuit source,
> > > > > > which has deviation.
> > > > > > 
> > > > > > Either way, I think the HW/FW could have some problem.
> > > > > 
> > > > > And this time with your printk()s:
> > > > > 
> > > > > [    0.000000] tsc: using CPUID[0x15] crystal_khz= 24000 kHz ebx=158 eax=2
> > > > > [    0.000000] tsc: Detected 1900.000 MHz processor
> > > > > [    0.000000] tsc: Detected 1896.000 MHz TSC
> > > > > [    5.268858] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1910969940391419 ns
> > > > > [   25.706231] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x36a8d32ce31, max_idle_ns: 881590731004 ns
> > > > > [   32.223011] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1911260446275000 ns
> > > > > [   57.823933] clocksource: Switched to clocksource tsc-early
> > > > > [   58.144840] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
> > > > > [   63.613713] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x36a8d32ce31, max_idle_ns: 881590731004 ns
> > > > > [   63.637323] clocksource: Switched to clocksource tsc
> > > > > [   64.673579] tsc: Warning: TSC freq calibrated by CPUID/MSR differs from what is calibrated by HW timer, please check with vendor!!
> > > > > [   64.703719] tsc: Previous calibrated TSC freq:        1896.000 MHz
> > > > > [   64.716816] tsc: TSC freq recalibrated by [PM_TIMER]:         1974.999 MHz
> > > > 
> > > > This confirms the tsc frequency is calculated from CPUID(0x15).
> > > > 
> > > > > What would be good next steps to check up on the hardware and firmware?
> > > > 
> > > > Maybe raise it to vendor? I have no idea how to check thos black boxes :)
> > > 
> > > Done, but no high hopes here.  (What, me cynical?  Better believe it!!!)
> > > 
> > > > > (My next step involves a pillow, but will follow up tomorrow morning
> > > > > Pacific Time.)
> > > >  
> > > > Really thanks for checking this through late night!
> > > 
> > > No problem, and I guess it is instead the day after tomorrow, but
> > > I thought you might be interested in chronyd's opinion:
> > > 
> > > [root@rtptest1029.snc8 ~]# cat /var/lib/chrony/drift
> > >         40001.074911             0.002098
> > > 
> > > In contrast, on my Fedora laptop:
> > > 
> > > $ sudo cat /var/lib/chrony/drift
> > >             2.074313             0.186606
> > > 
> > > I am (perhaps incorrectly) taking this to indicate that TSC is in fact
> > > drifting with respect to standard time.
> >  
> > This info is very useful! It further confirms the CPUID(0x15) gave
> > the wrong frequency info. 
> 
> So the TSC is just doing what it is told.  ;-)
> 
> This indicates a firmware problem?
> 
> > Also I don't think TSC itself is drifting, and the drift some from
> > the wrong match calculation(1896 MHz), if we give it the correct
> > number (likely 1975 MHz here), there shouldn't be big chrony drift
> > like your Fedora laptop.
> 
> Resetting so that the clocksource watchdog once again gets rid of TSC,
> but leaving nohpet:
> 
> [    0.000000] tsc: using CPUID[0x15] crystal_khz= 24000 kHz ebx=158 eax=2
> [    0.000000] tsc: Detected 1900.000 MHz processor
> [    0.000000] tsc: Detected 1896.000 MHz TSC
> [    5.287750] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1910969940391419 ns
> [   17.963947] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x36a8d32ce31, max_idle_ns: 881590731004 ns
> [   19.996287] clocksource: timekeeping watchdog on CPU3: Marking clocksource 'tsc-early' as unstable because the skew is too large:
> [   20.040287] clocksource:                       'refined-jiffies' wd_nsec: 503923392 wd_now: fffb73f8 wd_last: fffb7200 mask: ffffffff
> [   20.067286] clocksource:                       'tsc-early' cs_nsec: 588021368 cs_now: 581c1eb378 cs_last: 57d9aad9e8 mask: ffffffffffffffff
>  [   20.096286] clocksource:                       No current clocksource.
>  [   20.111286] tsc: Marking TSC unstable due to clocksource watchdog
>  [   24.582541] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1911260446275000 ns
>  [   49.739301] clocksource: Switched to clocksource refined-jiffies
>  [   50.046356] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
>  [   50.066475] clocksource: Switched to clocksource acpi_pm
> 
> # cat /var/lib/chrony/drift 
>             1.372570             0.020049
> 
> I interpret this to mean that acpi_pm (and thus from prior observations,
> HPET as well) are counting at the correct rate.

Correct. And this is a good news! that 1975 MHz seems to be the right
number.

Could you try below patch, it should override the CPUID calculation
and forced to use HW timer calibrated number:

---
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index a78e73da4a74..68a2fea4961d 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -1417,7 +1417,8 @@ static int __init init_tsc_clocksource(void)
 	 * When TSC frequency is known (retrieved via MSR or CPUID), we skip
 	 * the refined calibration and directly register it as a clocksource.
 	 */
-	if (boot_cpu_has(X86_FEATURE_TSC_KNOWN_FREQ)) {
+//	if (boot_cpu_has(X86_FEATURE_TSC_KNOWN_FREQ)) {
+	if (false) {
 		if (boot_cpu_has(X86_FEATURE_ART))
 			art_related_clocksource = &clocksource_tsc;
 		clocksource_register_khz(&clocksource_tsc, tsc_khz);
---

And together with b50db7095fe0 "x86/tsc: Disable clocksource watchdog
for TSC on qualified platorms". I assume this will tell TSC to use
1975 MHZ as its frequency.


Thanks,
Feng


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH clocksource 1/3] clocksource: Reject bogus watchdog clocksource measurements
  2022-12-02  2:02                                           ` Feng Tang
@ 2022-12-02 22:24                                             ` Paul E. McKenney
  2022-12-03  2:51                                               ` Feng Tang
  0 siblings, 1 reply; 28+ messages in thread
From: Paul E. McKenney @ 2022-12-02 22:24 UTC (permalink / raw)
  To: Feng Tang
  Cc: Thomas Gleixner, linux-kernel, sboyd, corbet, Mark.Rutland, maz,
	kernel-team, ak, zhengjun.xing, Chris Mason, John Stultz,
	Waiman Long

On Fri, Dec 02, 2022 at 10:02:59AM +0800, Feng Tang wrote:
> On Thu, Dec 01, 2022 at 05:44:28PM -0800, Paul E. McKenney wrote:
> > On Fri, Dec 02, 2022 at 09:10:12AM +0800, Feng Tang wrote:
> > > On Thu, Dec 01, 2022 at 09:24:05AM -0800, Paul E. McKenney wrote:
> > > > On Wed, Nov 30, 2022 at 02:00:04PM +0800, Feng Tang wrote:
> > > > > On Tue, Nov 29, 2022 at 09:50:24PM -0800, Paul E. McKenney wrote:
> > > > > [...]
> > > > > > >  
> > > > > > > Great! As both HPET and PM_TIMER get the same calibration 1975.000 MHz,
> > > > > > > and it matches the 40ms drift per second you mentioned earlier, this
> > > > > > > seems like the CPUID(0x15) gave the wrong frequence number.
> > > > > > > 
> > > > > > > Or unlikely, HPET and PM_TIMER are driven by the same circuit source,
> > > > > > > which has deviation.
> > > > > > > 
> > > > > > > Either way, I think the HW/FW could have some problem.
> > > > > > 
> > > > > > And this time with your printk()s:
> > > > > > 
> > > > > > [    0.000000] tsc: using CPUID[0x15] crystal_khz= 24000 kHz ebx=158 eax=2
> > > > > > [    0.000000] tsc: Detected 1900.000 MHz processor
> > > > > > [    0.000000] tsc: Detected 1896.000 MHz TSC
> > > > > > [    5.268858] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1910969940391419 ns
> > > > > > [   25.706231] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x36a8d32ce31, max_idle_ns: 881590731004 ns
> > > > > > [   32.223011] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1911260446275000 ns
> > > > > > [   57.823933] clocksource: Switched to clocksource tsc-early
> > > > > > [   58.144840] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
> > > > > > [   63.613713] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x36a8d32ce31, max_idle_ns: 881590731004 ns
> > > > > > [   63.637323] clocksource: Switched to clocksource tsc
> > > > > > [   64.673579] tsc: Warning: TSC freq calibrated by CPUID/MSR differs from what is calibrated by HW timer, please check with vendor!!
> > > > > > [   64.703719] tsc: Previous calibrated TSC freq:        1896.000 MHz
> > > > > > [   64.716816] tsc: TSC freq recalibrated by [PM_TIMER]:         1974.999 MHz
> > > > > 
> > > > > This confirms the tsc frequency is calculated from CPUID(0x15).
> > > > > 
> > > > > > What would be good next steps to check up on the hardware and firmware?
> > > > > 
> > > > > Maybe raise it to vendor? I have no idea how to check thos black boxes :)
> > > > 
> > > > Done, but no high hopes here.  (What, me cynical?  Better believe it!!!)
> > > > 
> > > > > > (My next step involves a pillow, but will follow up tomorrow morning
> > > > > > Pacific Time.)
> > > > >  
> > > > > Really thanks for checking this through late night!
> > > > 
> > > > No problem, and I guess it is instead the day after tomorrow, but
> > > > I thought you might be interested in chronyd's opinion:
> > > > 
> > > > [root@rtptest1029.snc8 ~]# cat /var/lib/chrony/drift
> > > >         40001.074911             0.002098
> > > > 
> > > > In contrast, on my Fedora laptop:
> > > > 
> > > > $ sudo cat /var/lib/chrony/drift
> > > >             2.074313             0.186606
> > > > 
> > > > I am (perhaps incorrectly) taking this to indicate that TSC is in fact
> > > > drifting with respect to standard time.
> > >  
> > > This info is very useful! It further confirms the CPUID(0x15) gave
> > > the wrong frequency info. 
> > 
> > So the TSC is just doing what it is told.  ;-)
> > 
> > This indicates a firmware problem?
> > 
> > > Also I don't think TSC itself is drifting, and the drift some from
> > > the wrong match calculation(1896 MHz), if we give it the correct
> > > number (likely 1975 MHz here), there shouldn't be big chrony drift
> > > like your Fedora laptop.
> > 
> > Resetting so that the clocksource watchdog once again gets rid of TSC,
> > but leaving nohpet:
> > 
> > [    0.000000] tsc: using CPUID[0x15] crystal_khz= 24000 kHz ebx=158 eax=2
> > [    0.000000] tsc: Detected 1900.000 MHz processor
> > [    0.000000] tsc: Detected 1896.000 MHz TSC
> > [    5.287750] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1910969940391419 ns
> > [   17.963947] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x36a8d32ce31, max_idle_ns: 881590731004 ns
> > [   19.996287] clocksource: timekeeping watchdog on CPU3: Marking clocksource 'tsc-early' as unstable because the skew is too large:
> > [   20.040287] clocksource:                       'refined-jiffies' wd_nsec: 503923392 wd_now: fffb73f8 wd_last: fffb7200 mask: ffffffff
> > [   20.067286] clocksource:                       'tsc-early' cs_nsec: 588021368 cs_now: 581c1eb378 cs_last: 57d9aad9e8 mask: ffffffffffffffff
> >  [   20.096286] clocksource:                       No current clocksource.
> >  [   20.111286] tsc: Marking TSC unstable due to clocksource watchdog
> >  [   24.582541] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1911260446275000 ns
> >  [   49.739301] clocksource: Switched to clocksource refined-jiffies
> >  [   50.046356] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
> >  [   50.066475] clocksource: Switched to clocksource acpi_pm
> > 
> > # cat /var/lib/chrony/drift 
> >             1.372570             0.020049
> > 
> > I interpret this to mean that acpi_pm (and thus from prior observations,
> > HPET as well) are counting at the correct rate.
> 
> Correct. And this is a good news! that 1975 MHz seems to be the right
> number.
> 
> Could you try below patch, it should override the CPUID calculation
> and forced to use HW timer calibrated number:
> 
> ---
> diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
> index a78e73da4a74..68a2fea4961d 100644
> --- a/arch/x86/kernel/tsc.c
> +++ b/arch/x86/kernel/tsc.c
> @@ -1417,7 +1417,8 @@ static int __init init_tsc_clocksource(void)
>  	 * When TSC frequency is known (retrieved via MSR or CPUID), we skip
>  	 * the refined calibration and directly register it as a clocksource.
>  	 */
> -	if (boot_cpu_has(X86_FEATURE_TSC_KNOWN_FREQ)) {
> +//	if (boot_cpu_has(X86_FEATURE_TSC_KNOWN_FREQ)) {
> +	if (false) {
>  		if (boot_cpu_has(X86_FEATURE_ART))
>  			art_related_clocksource = &clocksource_tsc;
>  		clocksource_register_khz(&clocksource_tsc, tsc_khz);
> ---
> 
> And together with b50db7095fe0 "x86/tsc: Disable clocksource watchdog
> for TSC on qualified platorms". I assume this will tell TSC to use
> 1975 MHZ as its frequency.

This did not change things, but when I hardcoded tsc_khz to 1975, the
clocksource watchdog no longer disables TSC and chrony shows drifts of
less than 2.0.  (As opposed to about 40,000 without the hardcoding.)

So yes, forcing 1975 makes TSC work nicely.  Yay!  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH clocksource 1/3] clocksource: Reject bogus watchdog clocksource measurements
  2022-12-02 22:24                                             ` Paul E. McKenney
@ 2022-12-03  2:51                                               ` Feng Tang
  0 siblings, 0 replies; 28+ messages in thread
From: Feng Tang @ 2022-12-03  2:51 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Thomas Gleixner, linux-kernel, sboyd, corbet, Mark.Rutland, maz,
	kernel-team, ak, zhengjun.xing, Chris Mason, John Stultz,
	Waiman Long

On Fri, Dec 02, 2022 at 02:24:02PM -0800, Paul E. McKenney wrote:
[...]
> > > > > > > (My next step involves a pillow, but will follow up tomorrow morning
> > > > > > > Pacific Time.)
> > > > > >  
> > > > > > Really thanks for checking this through late night!
> > > > > 
> > > > > No problem, and I guess it is instead the day after tomorrow, but
> > > > > I thought you might be interested in chronyd's opinion:
> > > > > 
> > > > > [root@rtptest1029.snc8 ~]# cat /var/lib/chrony/drift
> > > > >         40001.074911             0.002098
> > > > > 
> > > > > In contrast, on my Fedora laptop:
> > > > > 
> > > > > $ sudo cat /var/lib/chrony/drift
> > > > >             2.074313             0.186606
> > > > > 
> > > > > I am (perhaps incorrectly) taking this to indicate that TSC is in fact
> > > > > drifting with respect to standard time.
> > > >  
> > > > This info is very useful! It further confirms the CPUID(0x15) gave
> > > > the wrong frequency info. 
> > > 
> > > So the TSC is just doing what it is told.  ;-)
> > > 
> > > This indicates a firmware problem?
> > > 
> > > > Also I don't think TSC itself is drifting, and the drift some from
> > > > the wrong match calculation(1896 MHz), if we give it the correct
> > > > number (likely 1975 MHz here), there shouldn't be big chrony drift
> > > > like your Fedora laptop.
> > > 
> > > Resetting so that the clocksource watchdog once again gets rid of TSC,
> > > but leaving nohpet:
> > > 
> > > [    0.000000] tsc: using CPUID[0x15] crystal_khz= 24000 kHz ebx=158 eax=2
> > > [    0.000000] tsc: Detected 1900.000 MHz processor
> > > [    0.000000] tsc: Detected 1896.000 MHz TSC
> > > [    5.287750] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1910969940391419 ns
> > > [   17.963947] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x36a8d32ce31, max_idle_ns: 881590731004 ns
> > > [   19.996287] clocksource: timekeeping watchdog on CPU3: Marking clocksource 'tsc-early' as unstable because the skew is too large:
> > > [   20.040287] clocksource:                       'refined-jiffies' wd_nsec: 503923392 wd_now: fffb73f8 wd_last: fffb7200 mask: ffffffff
> > > [   20.067286] clocksource:                       'tsc-early' cs_nsec: 588021368 cs_now: 581c1eb378 cs_last: 57d9aad9e8 mask: ffffffffffffffff
> > >  [   20.096286] clocksource:                       No current clocksource.
> > >  [   20.111286] tsc: Marking TSC unstable due to clocksource watchdog
> > >  [   24.582541] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1911260446275000 ns
> > >  [   49.739301] clocksource: Switched to clocksource refined-jiffies
> > >  [   50.046356] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
> > >  [   50.066475] clocksource: Switched to clocksource acpi_pm
> > > 
> > > # cat /var/lib/chrony/drift 
> > >             1.372570             0.020049
> > > 
> > > I interpret this to mean that acpi_pm (and thus from prior observations,
> > > HPET as well) are counting at the correct rate.
> > 
> > Correct. And this is a good news! that 1975 MHz seems to be the right
> > number.
> > 
> > Could you try below patch, it should override the CPUID calculation
> > and forced to use HW timer calibrated number:
> > 
> > ---
> > diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
> > index a78e73da4a74..68a2fea4961d 100644
> > --- a/arch/x86/kernel/tsc.c
> > +++ b/arch/x86/kernel/tsc.c
> > @@ -1417,7 +1417,8 @@ static int __init init_tsc_clocksource(void)
> >  	 * When TSC frequency is known (retrieved via MSR or CPUID), we skip
> >  	 * the refined calibration and directly register it as a clocksource.
> >  	 */
> > -	if (boot_cpu_has(X86_FEATURE_TSC_KNOWN_FREQ)) {
> > +//	if (boot_cpu_has(X86_FEATURE_TSC_KNOWN_FREQ)) {
> > +	if (false) {
> >  		if (boot_cpu_has(X86_FEATURE_ART))
> >  			art_related_clocksource = &clocksource_tsc;
> >  		clocksource_register_khz(&clocksource_tsc, tsc_khz);
> > ---
> > 
> > And together with b50db7095fe0 "x86/tsc: Disable clocksource watchdog
> > for TSC on qualified platorms". I assume this will tell TSC to use
> > 1975 MHZ as its frequency.
> 
> This did not change things,

Maybe the kernel is with the tsc reclibrate patch? it will not use
the reclibrated frequency for a KNOWN_FREQ case, but just print
out the info. Anyway this doesn't matter, as it's just for debug,
and your way already proved the 1975 MHz is the right number :)

> but when I hardcoded tsc_khz to 1975, the
> clocksource watchdog no longer disables TSC and chrony shows drifts of
> less than 2.0.  (As opposed to about 40,000 without the hardcoding.)
> 
> So yes, forcing 1975 makes TSC work nicely.  Yay!  ;-)

Great to know it works and we have some workaround before a final
solution :)

Thanks,
Feng

> 							Thanx, Paul

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2022-12-03  2:55 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-11-14 23:28 [PATCH clocksource 0/3] Reject bogus watchdog clocksource measurements Paul E. McKenney
2022-11-14 23:28 ` [PATCH clocksource 1/3] clocksource: " Paul E. McKenney
2022-11-17 21:57   ` Thomas Gleixner
2022-11-17 23:09     ` Paul E. McKenney
2022-11-21  0:55       ` Feng Tang
2022-11-21 15:21         ` Paul E. McKenney
2022-11-21 18:14         ` Paul E. McKenney
2022-11-22 15:55           ` Feng Tang
2022-11-22 22:07             ` Paul E. McKenney
2022-11-23  2:36               ` Feng Tang
2022-11-23 21:23                 ` Paul E. McKenney
2022-11-28  2:15                   ` Feng Tang
2022-11-29 19:29                     ` Paul E. McKenney
2022-11-30  1:38                       ` Feng Tang
2022-11-30  4:12                         ` Paul E. McKenney
2022-11-30  4:49                           ` Feng Tang
2022-11-30  5:16                             ` Paul E. McKenney
2022-11-30  5:35                               ` Feng Tang
2022-11-30  5:50                                 ` Paul E. McKenney
2022-11-30  6:00                                   ` Feng Tang
2022-12-01 17:24                                     ` Paul E. McKenney
2022-12-02  1:10                                       ` Feng Tang
2022-12-02  1:44                                         ` Paul E. McKenney
2022-12-02  2:02                                           ` Feng Tang
2022-12-02 22:24                                             ` Paul E. McKenney
2022-12-03  2:51                                               ` Feng Tang
2022-11-14 23:28 ` [PATCH clocksource 2/3] clocksource: Add comments to classify bogus measurements Paul E. McKenney
2022-11-14 23:28 ` [PATCH clocksource 3/3] clocksource: Exponential backoff for load-induced bogus watchdog reads Paul E. McKenney

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).