clocksource: don't run watchdog forever
diff mbox series

Message ID 1614653665-20905-1-git-send-email-feng.tang@intel.com
State New, archived
Headers show
Series
  • clocksource: don't run watchdog forever
Related show

Commit Message

Feng Tang March 2, 2021, 2:54 a.m. UTC
clocksource watchdog runs every 500ms, which creates some OS noise.
As the clocksource wreckage (especially for those that has per-cpu
reading hook) usually happens shortly after CPU is brought up or
after system resumes from sleep state, so add a time limit for
clocksource watchdog to only run for a period of time, and make
sure it run at least twice for each CPU.

Regarding performance data, there is no improvement data with the
micro-benchmarks we have like hackbench/netperf/fio/will-it-scale
etc. But it obviously reduces periodic timer interrupts, and may
help in following cases:
* When some CPUs are isolated to only run scientific or high
  performance computing tasks on a NOHZ_FULL kernel, where there
  is almost no interrupts, this could make it more quiet
* On a cluster which runs a lot of systems in parallel with
  barriers there are always enough systems which run the watchdog
  and make everyone else wait

Signed-off-by: Feng Tang <feng.tang@intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
---
 include/linux/clocksource.h |  7 +++++++
 kernel/cpu.c                |  3 +++
 kernel/time/clocksource.c   | 31 +++++++++++++++++++++++++++++--
 3 files changed, 39 insertions(+), 2 deletions(-)

Comments

Peter Zijlstra March 2, 2021, 9:16 a.m. UTC | #1
On Tue, Mar 02, 2021 at 10:54:24AM +0800, Feng Tang wrote:
> clocksource watchdog runs every 500ms, which creates some OS noise.
> As the clocksource wreckage (especially for those that has per-cpu
> reading hook) usually happens shortly after CPU is brought up or
> after system resumes from sleep state, so add a time limit for
> clocksource watchdog to only run for a period of time, and make
> sure it run at least twice for each CPU.
> 
> Regarding performance data, there is no improvement data with the
> micro-benchmarks we have like hackbench/netperf/fio/will-it-scale
> etc. But it obviously reduces periodic timer interrupts, and may
> help in following cases:
> * When some CPUs are isolated to only run scientific or high
>   performance computing tasks on a NOHZ_FULL kernel, where there
>   is almost no interrupts, this could make it more quiet
> * On a cluster which runs a lot of systems in parallel with
>   barriers there are always enough systems which run the watchdog
>   and make everyone else wait
> 
> Signed-off-by: Feng Tang <feng.tang@intel.com>

Urgh.. so this hopes and prays that the TSC wrackage happens in the
first 10 minutes after boot.

Given the previous patch, the watchdog wouldn't be running at all on
modern machines, so why wreck it for the old machines where it's
actually needed?
Feng Tang March 2, 2021, 12:06 p.m. UTC | #2
Hi Peter,

On Tue, Mar 02, 2021 at 10:16:37AM +0100, Peter Zijlstra wrote:
> On Tue, Mar 02, 2021 at 10:54:24AM +0800, Feng Tang wrote:
> > clocksource watchdog runs every 500ms, which creates some OS noise.
> > As the clocksource wreckage (especially for those that has per-cpu
> > reading hook) usually happens shortly after CPU is brought up or
> > after system resumes from sleep state, so add a time limit for
> > clocksource watchdog to only run for a period of time, and make
> > sure it run at least twice for each CPU.
> > 
> > Regarding performance data, there is no improvement data with the
> > micro-benchmarks we have like hackbench/netperf/fio/will-it-scale
> > etc. But it obviously reduces periodic timer interrupts, and may
> > help in following cases:
> > * When some CPUs are isolated to only run scientific or high
> >   performance computing tasks on a NOHZ_FULL kernel, where there
> >   is almost no interrupts, this could make it more quiet
> > * On a cluster which runs a lot of systems in parallel with
> >   barriers there are always enough systems which run the watchdog
> >   and make everyone else wait
> > 
> > Signed-off-by: Feng Tang <feng.tang@intel.com>
> 
> Urgh.. so this hopes and prays that the TSC wrackage happens in the
> first 10 minutes after boot.

Yes, the 10 minutes part is only based on our past experience and we
can make it longer. But if there was real case that the wrackage happened
long after CPU is brought up like days, then this patch won't help much.

> Given the previous patch, the watchdog wouldn't be running at all on
> modern machines, so why wreck it for the old machines where it's
> actually needed?

Yes, this is for the older x86 machines and some other architectures. 

Thanks,
Feng
Thomas Gleixner March 3, 2021, 3:50 p.m. UTC | #3
On Tue, Mar 02 2021 at 20:06, Feng Tang wrote:
> On Tue, Mar 02, 2021 at 10:16:37AM +0100, Peter Zijlstra wrote:
>> On Tue, Mar 02, 2021 at 10:54:24AM +0800, Feng Tang wrote:
>> > clocksource watchdog runs every 500ms, which creates some OS noise.
>> > As the clocksource wreckage (especially for those that has per-cpu
>> > reading hook) usually happens shortly after CPU is brought up or
>> > after system resumes from sleep state, so add a time limit for
>> > clocksource watchdog to only run for a period of time, and make
>> > sure it run at least twice for each CPU.
>> > 
>> > Regarding performance data, there is no improvement data with the
>> > micro-benchmarks we have like hackbench/netperf/fio/will-it-scale
>> > etc. But it obviously reduces periodic timer interrupts, and may
>> > help in following cases:
>> > * When some CPUs are isolated to only run scientific or high
>> >   performance computing tasks on a NOHZ_FULL kernel, where there
>> >   is almost no interrupts, this could make it more quiet
>> > * On a cluster which runs a lot of systems in parallel with
>> >   barriers there are always enough systems which run the watchdog
>> >   and make everyone else wait
>> > 
>> > Signed-off-by: Feng Tang <feng.tang@intel.com>
>> 
>> Urgh.. so this hopes and prays that the TSC wrackage happens in the
>> first 10 minutes after boot.

which is wishful thinking....

> Yes, the 10 minutes part is only based on our past experience and we
> can make it longer. But if there was real case that the wrackage happened
> long after CPU is brought up like days, then this patch won't help
> much.

It really depends on the BIOS wreckage. On one of my machine it takes up
to a day depending on the workload.

Anything pre TSC_ADJUST wants the watchdog on. With TSC ADJUST available
we can probably avoid it.

There is a caveat though. If the machine never goes idle then TSC adjust
is not able to detect a potential wreckage. OTOH, most of the broken
BIOSes tweak TSC only by a few cycles and that is usually detectable
during boot. So we might be clever about it and schedule a check every
hour when during the first 10 minutes a modification of TSC adjust is
seen on any CPU.

Where is this TSC_DISABLE_WRITE bit again?

Thanks,

        tglx
Feng Tang March 4, 2021, 7:43 a.m. UTC | #4
Hi Thomas,

On Wed, Mar 03, 2021 at 04:50:31PM +0100, Thomas Gleixner wrote:
> On Tue, Mar 02 2021 at 20:06, Feng Tang wrote:
> > On Tue, Mar 02, 2021 at 10:16:37AM +0100, Peter Zijlstra wrote:
> >> On Tue, Mar 02, 2021 at 10:54:24AM +0800, Feng Tang wrote:
> >> > clocksource watchdog runs every 500ms, which creates some OS noise.
> >> > As the clocksource wreckage (especially for those that has per-cpu
> >> > reading hook) usually happens shortly after CPU is brought up or
> >> > after system resumes from sleep state, so add a time limit for
> >> > clocksource watchdog to only run for a period of time, and make
> >> > sure it run at least twice for each CPU.
> >> > 
> >> > Regarding performance data, there is no improvement data with the
> >> > micro-benchmarks we have like hackbench/netperf/fio/will-it-scale
> >> > etc. But it obviously reduces periodic timer interrupts, and may
> >> > help in following cases:
> >> > * When some CPUs are isolated to only run scientific or high
> >> >   performance computing tasks on a NOHZ_FULL kernel, where there
> >> >   is almost no interrupts, this could make it more quiet
> >> > * On a cluster which runs a lot of systems in parallel with
> >> >   barriers there are always enough systems which run the watchdog
> >> >   and make everyone else wait
> >> > 
> >> > Signed-off-by: Feng Tang <feng.tang@intel.com>
> >> 
> >> Urgh.. so this hopes and prays that the TSC wrackage happens in the
> >> first 10 minutes after boot.
> 
> which is wishful thinking....
> 
> > Yes, the 10 minutes part is only based on our past experience and we
> > can make it longer. But if there was real case that the wrackage happened
> > long after CPU is brought up like days, then this patch won't help
> > much.
> 
> It really depends on the BIOS wreckage. On one of my machine it takes up
> to a day depending on the workload.

Thanks for sharing the info.

> Anything pre TSC_ADJUST wants the watchdog on. With TSC ADJUST available
> we can probably avoid it.
> 
> There is a caveat though. If the machine never goes idle then TSC adjust
> is not able to detect a potential wreckage. OTOH, most of the broken
> BIOSes tweak TSC only by a few cycles and that is usually detectable
> during boot. So we might be clever about it and schedule a check every
> hour when during the first 10 minutes a modification of TSC adjust is
> seen on any CPU.

I don't have much experience with tsc_adjust, and try to understand it:
The 'modification of TSC' here has 2 cases: ? 
* First read of 'TSC_ADJUST' MSR of a just boot CPU returns non-zero value
* Following read of 'TSC_ADJUST' doesn't equal to the 'tsc_adjust' value
  saved in per-cpu data.

Also, does the patch ("x86/tsc: mark tsc reliable for qualified platforms")
need to wait till this caveat case is solved? 

Thanks,
Feng




> 
> Where is this TSC_DISABLE_WRITE bit again?
> 
> Thanks,
> 
>         tglx
>
Thomas Gleixner March 4, 2021, 2:15 p.m. UTC | #5
Feng,

On Thu, Mar 04 2021 at 15:43, Feng Tang wrote:
> On Wed, Mar 03, 2021 at 04:50:31PM +0100, Thomas Gleixner wrote:
>> Anything pre TSC_ADJUST wants the watchdog on. With TSC ADJUST available
>> we can probably avoid it.
>> 
>> There is a caveat though. If the machine never goes idle then TSC adjust
>> is not able to detect a potential wreckage. OTOH, most of the broken
>> BIOSes tweak TSC only by a few cycles and that is usually detectable
>> during boot. So we might be clever about it and schedule a check every
>> hour when during the first 10 minutes a modification of TSC adjust is
>> seen on any CPU.
>
> I don't have much experience with tsc_adjust, and try to understand it:
> The 'modification of TSC' here has 2 cases: ? 
> * First read of 'TSC_ADJUST' MSR of a just boot CPU returns non-zero
> value

That's catching stupid BIOSes which set the TSC to random values during
boot/reboot. That's a one off boot issue and not a real problem. The
kernel fixes it up and is done with it. Nothing to care about.

> * Following read of 'TSC_ADJUST' doesn't equal to the 'tsc_adjust' value
>   saved in per-cpu data.

That's where we catch broken BIOS/SMI implementations which try to
"hide" the time wasted in BIOS/SMI by setting the TSC back to the value
they saved on SMI entry. That was a popular BIOS "feature" some years
ago, but it seems the BIOS tinkerers finally gave up on it.

>> Where is this TSC_DISABLE_WRITE bit again?

I'm serious about this. Once the kernel has taken over a CPU there is
absolutely no reason for any context to write to the TSC/TSC_ADJUST
register ever again. So having a mechanism to prevent writes would
surely help to make the TSC more trustworthy.

> Also, does the patch ("x86/tsc: mark tsc reliable for qualified platforms")
> need to wait till this caveat case is solved?

Yes, but that should be trivial to do. 

Thanks,

        tglx
Feng Tang March 5, 2021, 2:30 a.m. UTC | #6
On Thu, Mar 04, 2021 at 03:15:13PM +0100, Thomas Gleixner wrote:
> Feng,
> 
> On Thu, Mar 04 2021 at 15:43, Feng Tang wrote:
> > On Wed, Mar 03, 2021 at 04:50:31PM +0100, Thomas Gleixner wrote:
> >> Anything pre TSC_ADJUST wants the watchdog on. With TSC ADJUST available
> >> we can probably avoid it.
> >> 
> >> There is a caveat though. If the machine never goes idle then TSC adjust
> >> is not able to detect a potential wreckage. OTOH, most of the broken
> >> BIOSes tweak TSC only by a few cycles and that is usually detectable
> >> during boot. So we might be clever about it and schedule a check every
> >> hour when during the first 10 minutes a modification of TSC adjust is
> >> seen on any CPU.
> >
> > I don't have much experience with tsc_adjust, and try to understand it:
> > The 'modification of TSC' here has 2 cases: ? 
> > * First read of 'TSC_ADJUST' MSR of a just boot CPU returns non-zero
> > value
> 
> That's catching stupid BIOSes which set the TSC to random values during
> boot/reboot. That's a one off boot issue and not a real problem. The
> kernel fixes it up and is done with it. Nothing to care about.
> 
> > * Following read of 'TSC_ADJUST' doesn't equal to the 'tsc_adjust' value
> >   saved in per-cpu data.
> 
> That's where we catch broken BIOS/SMI implementations which try to
> "hide" the time wasted in BIOS/SMI by setting the TSC back to the value
> they saved on SMI entry. That was a popular BIOS "feature" some years
> ago, but it seems the BIOS tinkerers finally gave up on it.
 
Thanks for the detailed explaination! I understand now.

> >> Where is this TSC_DISABLE_WRITE bit again?
> 
> I'm serious about this. Once the kernel has taken over a CPU there is
> absolutely no reason for any context to write to the TSC/TSC_ADJUST
> register ever again. So having a mechanism to prevent writes would
> surely help to make the TSC more trustworthy.
> 
> > Also, does the patch ("x86/tsc: mark tsc reliable for qualified platforms")
> > need to wait till this caveat case is solved?
> 
> Yes, but that should be trivial to do. 

Ok, I see.

Thanks,
Feng

> 
> Thanks,
> 
>         tglx
Feng Tang March 25, 2021, 8:34 a.m. UTC | #7
Hi Thomas,

On Wed, Mar 03, 2021 at 04:50:31PM +0100, Thomas Gleixner wrote:
> On Tue, Mar 02 2021 at 20:06, Feng Tang wrote:
> > On Tue, Mar 02, 2021 at 10:16:37AM +0100, Peter Zijlstra wrote:
> >> On Tue, Mar 02, 2021 at 10:54:24AM +0800, Feng Tang wrote:
> >> > clocksource watchdog runs every 500ms, which creates some OS noise.
> >> > As the clocksource wreckage (especially for those that has per-cpu
> >> > reading hook) usually happens shortly after CPU is brought up or
> >> > after system resumes from sleep state, so add a time limit for
> >> > clocksource watchdog to only run for a period of time, and make
> >> > sure it run at least twice for each CPU.
> >> > 
> >> > Regarding performance data, there is no improvement data with the
> >> > micro-benchmarks we have like hackbench/netperf/fio/will-it-scale
> >> > etc. But it obviously reduces periodic timer interrupts, and may
> >> > help in following cases:
> >> > * When some CPUs are isolated to only run scientific or high
> >> >   performance computing tasks on a NOHZ_FULL kernel, where there
> >> >   is almost no interrupts, this could make it more quiet
> >> > * On a cluster which runs a lot of systems in parallel with
> >> >   barriers there are always enough systems which run the watchdog
> >> >   and make everyone else wait
> >> > 
> >> > Signed-off-by: Feng Tang <feng.tang@intel.com>
> >> 
> >> Urgh.. so this hopes and prays that the TSC wrackage happens in the
> >> first 10 minutes after boot.
> 
> which is wishful thinking....
> 
> > Yes, the 10 minutes part is only based on our past experience and we
> > can make it longer. But if there was real case that the wrackage happened
> > long after CPU is brought up like days, then this patch won't help
> > much.
> 
> It really depends on the BIOS wreckage. On one of my machine it takes up
> to a day depending on the workload.
> 
> Anything pre TSC_ADJUST wants the watchdog on. With TSC ADJUST available
> we can probably avoid it.
> 
> There is a caveat though. If the machine never goes idle then TSC adjust
> is not able to detect a potential wreckage. OTOH, most of the broken
> BIOSes tweak TSC only by a few cycles and that is usually detectable
> during boot. So we might be clever about it and schedule a check every
> hour when during the first 10 minutes a modification of TSC adjust is
> seen on any CPU.

I've thought about implementing this (sorry for delay), and would
clarify something to understand it correctly. This hourly check is only
for x86's tsc_adjust overriden by BIOS, and not the general kernel watchdog?
As the current clocksources have different wrap time, like acpi_pm timer
will wrap around every 4 seconds, and hpet wraps about every 300 scconds,
we can only either keep doing the watchdog check or cancel it.

If so, we can start a timer fired 10 minutes later to check it, and extend
the timer to 1 hour if there is no tsc_adjust overridden.

I've checked one open-sourced BIOS code project: EDK2 (https://github.com/tianocore/edk2),
where I did some grep and can't find places writting to tsc_adjust msr,
which can give us more confidence that fewer and fewer BIOS will wrongly
write to tsc_adjust msr :)

Thanks,
Feng
Thomas Gleixner March 25, 2021, 11:40 a.m. UTC | #8
On Thu, Mar 25 2021 at 16:34, Feng Tang wrote:
> On Wed, Mar 03, 2021 at 04:50:31PM +0100, Thomas Gleixner wrote:
>
> I've checked one open-sourced BIOS code project: EDK2 (https://github.com/tianocore/edk2),
> where I did some grep and can't find places writting to tsc_adjust msr,
> which can give us more confidence that fewer and fewer BIOS will wrongly
> write to tsc_adjust msr :)

The problem is not EDK2. The problem is in the $VENDOR value add which
is _not_ part of EDK2.

So looking at that code does not make me more confident at all.

Thanks,

        tglx

Patch
diff mbox series

diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h
index 86d143d..cf428a2 100644
--- a/include/linux/clocksource.h
+++ b/include/linux/clocksource.h
@@ -252,6 +252,13 @@  static inline void __clocksource_update_freq_khz(struct clocksource *cs, u32 khz
 	__clocksource_update_freq_scale(cs, 1000, khz);
 }
 
+#ifdef CONFIG_CLOCKSOURCE_WATCHDOG
+extern void clocksource_kick_watchdog(void);
+#else
+static inline void clocksource_kick_watchdog(void) { }
+#endif
+
+
 #ifdef CONFIG_ARCH_CLOCKSOURCE_INIT
 extern void clocksource_arch_init(struct clocksource *cs);
 #else
diff --git a/kernel/cpu.c b/kernel/cpu.c
index 1b6302e..fdf3c69 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -32,6 +32,7 @@ 
 #include <linux/relay.h>
 #include <linux/slab.h>
 #include <linux/percpu-rwsem.h>
+#include <linux/clocksource.h>
 
 #include <trace/events/power.h>
 #define CREATE_TRACE_POINTS
@@ -1286,6 +1287,8 @@  static int cpu_up(unsigned int cpu, enum cpuhp_state target)
 	}
 
 	err = _cpu_up(cpu, 0, target);
+
+	clocksource_kick_watchdog();
 out:
 	cpu_maps_update_done();
 	return err;
diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index cce484a..aba985a 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -104,6 +104,7 @@  static DECLARE_WORK(watchdog_work, clocksource_watchdog_work);
 static DEFINE_SPINLOCK(watchdog_lock);
 static int watchdog_running;
 static atomic_t watchdog_reset_pending;
+static unsigned long watchdog_stop_time;	/* in jiffies */
 
 static inline void clocksource_watchdog_lock(unsigned long *flags)
 {
@@ -295,10 +296,16 @@  static void clocksource_watchdog(struct timer_list *unused)
 		next_cpu = cpumask_first(cpu_online_mask);
 
 	/*
-	 * Arm timer if not already pending: could race with concurrent
-	 * pair clocksource_stop_watchdog() clocksource_start_watchdog().
+	 * Arm timer if not already pending or pass the check time window:
+	 * could race with concurrent pair clocksource_stop_watchdog()
+	 * clocksource_start_watchdog().
 	 */
 	if (!timer_pending(&watchdog_timer)) {
+		if (time_after(jiffies, watchdog_stop_time)) {
+			atomic_inc(&watchdog_reset_pending);
+			watchdog_running = 0;
+			goto out;
+		}
 		watchdog_timer.expires += WATCHDOG_INTERVAL;
 		add_timer_on(&watchdog_timer, next_cpu);
 	}
@@ -308,6 +315,16 @@  static void clocksource_watchdog(struct timer_list *unused)
 
 static inline void clocksource_start_watchdog(void)
 {
+	unsigned long check_ticks;
+
+	/*
+	 * As all CPUs will be looped to run the timer, make sure each
+	 * CPU can run the timer twice, and the check run for at least
+	 * 10 minutes.
+	 */
+	check_ticks = max_t(unsigned long, num_possible_cpus(), 600) * HZ;
+	watchdog_stop_time = jiffies + check_ticks;
+
 	if (watchdog_running || !watchdog || list_empty(&watchdog_list))
 		return;
 	timer_setup(&watchdog_timer, clocksource_watchdog, 0);
@@ -324,6 +341,15 @@  static inline void clocksource_stop_watchdog(void)
 	watchdog_running = 0;
 }
 
+void clocksource_kick_watchdog(void)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&watchdog_lock, flags);
+	clocksource_start_watchdog();
+	spin_unlock_irqrestore(&watchdog_lock, flags);
+}
+
 static inline void clocksource_reset_watchdog(void)
 {
 	struct clocksource *cs;
@@ -618,6 +644,7 @@  void clocksource_resume(void)
 			cs->resume(cs);
 
 	clocksource_resume_watchdog();
+	clocksource_kick_watchdog();
 }
 
 /**