[V2] kernel/watchdog: fix spurious hard lockups
diff mbox series

Message ID 20170621144118.5939-1-kan.liang@intel.com
State New, archived
Headers show
Series
  • [V2] kernel/watchdog: fix spurious hard lockups
Related show

Commit Message

Liang, Kan June 21, 2017, 2:41 p.m. UTC
From: Kan Liang <Kan.liang@intel.com>

Some users reported spurious NMI watchdog timeouts.

We now have more and more systems where the Turbo range is wide enough
that the NMI watchdog expires faster than the soft watchdog timer that
updates the interrupt tick the NMI watchdog relies on.

This problem was originally added by commit 58687acba592
("lockup_detector: Combine nmi_watchdog and softlockup detector").
Previously the NMI watchdog would always check jiffies, which were
ticking fast enough. But now the backing is quite slow so the expire
time becomes more sensitive.

For mainline the right fix is to switch the NMI watchdog to reference
cycles, which tick always at the same rate independent of turbo mode.
But this is requires some complicated changes in perf, which are too
difficult to backport. Since we need a stable fix to just increase the
NMI watchdog rate here to avoid the spurious timeouts. This is not an
ideal fix because a 3x as large Turbo range could still fail, but for
now that's not likely.

Signed-off-by: Kan Liang <Kan.liang@intel.com>
Cc: stable@vger.kernel.org
Fixes: 58687acba592 ("lockup_detector: Combine nmi_watchdog and
softlockup detector")
---

The right fix for mainline can be found here.
perf/x86/intel: enable CPU ref_cycles for GP counter
perf/x86/intel, watchdog: Switch NMI watchdog to ref cycles on x86
https://patchwork.kernel.org/patch/9779087/
https://patchwork.kernel.org/patch/9779089/

Change since V1:
 - Restrict the period in hw_nmi.c for Intel platform. (Don Zickus)

 arch/x86/kernel/apic/hw_nmi.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

Comments

Thomas Gleixner June 21, 2017, 3:12 p.m. UTC | #1
On Wed, 21 Jun 2017, kan.liang@intel.com wrote:
>  
>  #ifdef CONFIG_HARDLOCKUP_DETECTOR
> +/*
> + * The NMI watchdog relies on PERF_COUNT_HW_CPU_CYCLES event, which
> + * can tick faster than the measured CPU Frequency due to Turbo mode.
> + * That can lead to spurious timeouts.
> + * To workaround the issue, extending the period by 3 times.
> + */
>  u64 hw_nmi_get_sample_period(int watchdog_thresh)
>  {
> -	return (u64)(cpu_khz) * 1000 * watchdog_thresh;
> +	return (u64)(cpu_khz) * 1000 * watchdog_thresh * 3;

The maximum turbo frequency of any given machine can be retrieved.

So why don't you simply take that ratio into account and apply it for the
machines which have those insane turbo loaders? That's not a huge effort,
can be easily backported and does not inflict this unconditially.

So what you want is:

	return get_max_turbo_khz() * 1000 * watchdog_thresh;

Where get_max_turbo_khz() by default returns cpu_khz for non turbo
motors.

And instead of silently doing this it should emit a info into dmesg:

   	u64 period, max_khz = get_max_turbo_khz();
	static int once;

	period = max_khz * 1000 * watchdog_thresh;

	if (max_khz != cpu_khz && !once) {
		unsigned int msec = period / cpu_khz;

		once = 1;
		pr_info("Adjusted watchdog threshold to %u.%04u sec\n",
			msec / 1000, msec % 1000);
	}

	return period;

Hmm?

Thanks,

	tglx
Liang, Kan June 21, 2017, 3:47 p.m. UTC | #2
> On Wed, 21 Jun 2017, kan.liang@intel.com wrote:
> >
> >  #ifdef CONFIG_HARDLOCKUP_DETECTOR
> > +/*
> > + * The NMI watchdog relies on PERF_COUNT_HW_CPU_CYCLES event,
> which
> > + * can tick faster than the measured CPU Frequency due to Turbo mode.
> > + * That can lead to spurious timeouts.
> > + * To workaround the issue, extending the period by 3 times.
> > + */
> >  u64 hw_nmi_get_sample_period(int watchdog_thresh)  {
> > -	return (u64)(cpu_khz) * 1000 * watchdog_thresh;
> > +	return (u64)(cpu_khz) * 1000 * watchdog_thresh * 3;
> 
> The maximum turbo frequency of any given machine can be retrieved.

The maximum turbo frequency is determined by the model of processor.
I'm not sure if there is a generic way to get the maximum turbo frequency.
Is there?

Thanks,
Kan

> 
> So why don't you simply take that ratio into account and apply it for the
> machines which have those insane turbo loaders? That's not a huge effort,
> can be easily backported and does not inflict this unconditially.
> 
> So what you want is:
> 
> 	return get_max_turbo_khz() * 1000 * watchdog_thresh;
> 
> Where get_max_turbo_khz() by default returns cpu_khz for non turbo
> motors.
> 
> And instead of silently doing this it should emit a info into dmesg:
> 
>    	u64 period, max_khz = get_max_turbo_khz();
> 	static int once;
> 
> 	period = max_khz * 1000 * watchdog_thresh;
> 
> 	if (max_khz != cpu_khz && !once) {
> 		unsigned int msec = period / cpu_khz;
> 
> 		once = 1;
> 		pr_info("Adjusted watchdog threshold to %u.%04u sec\n",
> 			msec / 1000, msec % 1000);
> 	}
> 
> 	return period;
> 
> Hmm?
> 
> Thanks,
> 
> 	tglx
Andi Kleen June 21, 2017, 5:07 p.m. UTC | #3
On Wed, Jun 21, 2017 at 05:12:06PM +0200, Thomas Gleixner wrote:
> On Wed, 21 Jun 2017, kan.liang@intel.com wrote:
> >  
> >  #ifdef CONFIG_HARDLOCKUP_DETECTOR
> > +/*
> > + * The NMI watchdog relies on PERF_COUNT_HW_CPU_CYCLES event, which
> > + * can tick faster than the measured CPU Frequency due to Turbo mode.
> > + * That can lead to spurious timeouts.
> > + * To workaround the issue, extending the period by 3 times.
> > + */
> >  u64 hw_nmi_get_sample_period(int watchdog_thresh)
> >  {
> > -	return (u64)(cpu_khz) * 1000 * watchdog_thresh;
> > +	return (u64)(cpu_khz) * 1000 * watchdog_thresh * 3;
> 
> The maximum turbo frequency of any given machine can be retrieved.

Not reliably, e.g. not in virtualization. Also it would require
model specific checks, so as soon as you have a new model and an
old kernel it could still randomly fail.

-Andi
Prarit Bhargava June 21, 2017, 5:40 p.m. UTC | #4
On 06/21/2017 11:47 AM, Liang, Kan wrote:
> 
> 
>> On Wed, 21 Jun 2017, kan.liang@intel.com wrote:
>>>
>>>  #ifdef CONFIG_HARDLOCKUP_DETECTOR
>>> +/*
>>> + * The NMI watchdog relies on PERF_COUNT_HW_CPU_CYCLES event,
>> which
>>> + * can tick faster than the measured CPU Frequency due to Turbo mode.
>>> + * That can lead to spurious timeouts.
>>> + * To workaround the issue, extending the period by 3 times.
>>> + */
>>>  u64 hw_nmi_get_sample_period(int watchdog_thresh)  {
>>> -	return (u64)(cpu_khz) * 1000 * watchdog_thresh;
>>> +	return (u64)(cpu_khz) * 1000 * watchdog_thresh * 3;
>>
>> The maximum turbo frequency of any given machine can be retrieved.
> 
> The maximum turbo frequency is determined by the model of processor.
> I'm not sure if there is a generic way to get the maximum turbo frequency.
> Is there?
> 

cpufreq_quick_get_max()

but iff cpufreq subsystem is initialized.  O/w 0 is returned for the freq.

Quick test shows the correct turbo max of 3700000 on my 2000000 (2.00GHz)
system.

P.

> Thanks,
> Kan
> 
>>
>> So why don't you simply take that ratio into account and apply it for the
>> machines which have those insane turbo loaders? That's not a huge effort,
>> can be easily backported and does not inflict this unconditially.
>>
>> So what you want is:
>>
>> 	return get_max_turbo_khz() * 1000 * watchdog_thresh;
>>
>> Where get_max_turbo_khz() by default returns cpu_khz for non turbo
>> motors.
>>
>> And instead of silently doing this it should emit a info into dmesg:
>>
>>    	u64 period, max_khz = get_max_turbo_khz();
>> 	static int once;
>>
>> 	period = max_khz * 1000 * watchdog_thresh;
>>
>> 	if (max_khz != cpu_khz && !once) {
>> 		unsigned int msec = period / cpu_khz;
>>
>> 		once = 1;
>> 		pr_info("Adjusted watchdog threshold to %u.%04u sec\n",
>> 			msec / 1000, msec % 1000);
>> 	}
>>
>> 	return period;
>>

>> Hmm?
>>
>> Thanks,
>>
>> 	tglx
>
Thomas Gleixner June 21, 2017, 7:59 p.m. UTC | #5
On Wed, 21 Jun 2017, Andi Kleen wrote:
> On Wed, Jun 21, 2017 at 05:12:06PM +0200, Thomas Gleixner wrote:
> > On Wed, 21 Jun 2017, kan.liang@intel.com wrote:
> > >  
> > >  #ifdef CONFIG_HARDLOCKUP_DETECTOR
> > > +/*
> > > + * The NMI watchdog relies on PERF_COUNT_HW_CPU_CYCLES event, which
> > > + * can tick faster than the measured CPU Frequency due to Turbo mode.
> > > + * That can lead to spurious timeouts.
> > > + * To workaround the issue, extending the period by 3 times.
> > > + */
> > >  u64 hw_nmi_get_sample_period(int watchdog_thresh)
> > >  {
> > > -	return (u64)(cpu_khz) * 1000 * watchdog_thresh;
> > > +	return (u64)(cpu_khz) * 1000 * watchdog_thresh * 3;
> > 
> > The maximum turbo frequency of any given machine can be retrieved.
> 
> Not reliably, e.g. not in virtualization. Also it would require
> model specific checks, so as soon as you have a new model and an
> old kernel it could still randomly fail.

And that's in no way an argument for breaking every existing setup which
relies on the way stuff works now. Lots of crap on new models does not work
with older kernels.

Fact is, this is a user visible change and people have pointed out, that it
will break their setups and expectations. So, no this is not going to
happen with just slapping a randomly chosen factor on it.

Fact is, that the watchdog works this way since it got implemented and it
really can sensibly argued that the way it works is correct.

The treshold is based on the non-turbo max frequency of the CPU. That's how
the period is calculated. And that makes sense in terms of frequency
scaling in either direction.

If your CPU is stuck for 1 second @2GHZ, then it wastes exactly the same
amount of cycles when it is stuck for 0.5 seconds @4Ghz. So the watchdog
can rightfully kick in after that and tell the world that crap is stuck.

If your newfangled machine triggers the watchdog after 0.2 seconds, then
you already have a mechanism to fix that. It's a user space interface after
all. There are lot of things which need to be adjusted with new machines
and if the boot default of 10 seconds is not sufficient, then something is
really wrong with these systems.

TBH, I rather whish the hard lockup watchdog treshhold would be
configurable in milliseconds rather than seconds to catch crap faster.

Thanks,

	tglx
Thomas Gleixner June 21, 2017, 9:53 p.m. UTC | #6
On Wed, 21 Jun 2017, kan.liang@intel.com wrote:
> We now have more and more systems where the Turbo range is wide enough
> that the NMI watchdog expires faster than the soft watchdog timer that
> updates the interrupt tick the NMI watchdog relies on.
> 
> This problem was originally added by commit 58687acba592
> ("lockup_detector: Combine nmi_watchdog and softlockup detector").
> Previously the NMI watchdog would always check jiffies, which were
> ticking fast enough. But now the backing is quite slow so the expire
> time becomes more sensitive.

And slapping a factor 3 on the NMI period is the wrong answer to the
problem. The simple solution would be to increase the hrtimer frequency,
but that's not really desired either.

Find an untested patch below, which should cure the issue.

Thanks,

	tglx
	
8<---------------
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -96,6 +96,7 @@ config X86
 	select GENERIC_STRNCPY_FROM_USER
 	select GENERIC_STRNLEN_USER
 	select GENERIC_TIME_VSYSCALL
+	select HARDLOCKUP_CHECK_TIMESTAMP	if X86_64
 	select HAVE_ACPI_APEI			if ACPI
 	select HAVE_ACPI_APEI_NMI		if ACPI
 	select HAVE_ALIGNED_STRUCT_PAGE		if SLUB
--- a/include/linux/nmi.h
+++ b/include/linux/nmi.h
@@ -155,6 +155,14 @@ extern int sysctl_hardlockup_all_cpu_bac
 #define sysctl_softlockup_all_cpu_backtrace 0
 #define sysctl_hardlockup_all_cpu_backtrace 0
 #endif
+
+#if defined(CONFIG_HARDLOCKUP_CHECK_TIMESTAMP) && \
+    defined(CONFIG_HARDLOCKUP_DETECTOR)
+void watchdog_update_hrtimer_threshold(u64 period);
+#else
+static inline void watchdog_update_hrtimer_threshold(u64 period) { }
+#endif
+
 extern bool is_hardlockup(void);
 struct ctl_table;
 extern int proc_watchdog(struct ctl_table *, int ,
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -161,6 +161,7 @@ static void set_sample_period(void)
 	 * hardlockup detector generates a warning
 	 */
 	sample_period = get_softlockup_thresh() * ((u64)NSEC_PER_SEC / 5);
+	watchdog_update_hrtimer_threshold(sample_period);
 }
 
 /* Commands for resetting the watchdog */
--- a/kernel/watchdog_hld.c
+++ b/kernel/watchdog_hld.c
@@ -70,6 +70,54 @@ void touch_nmi_watchdog(void)
 }
 EXPORT_SYMBOL(touch_nmi_watchdog);
 
+#ifdef CONFIG_HARDLOCKUP_CHECK_TIMESTAMP
+static DEFINE_PER_CPU(ktime_t, last_timestamp);
+static ktime_t watchdog_hrtimer_sample_threshold __read_mostly;
+
+void watchdog_update_hrtimer_threshold(u64 period)
+{
+	/*
+	 * The hrtimer runs with a period of (watchdog_threshold * 2) / 5
+	 *
+	 * So it runs effectively with 2.5 times the rate of the NMI
+	 * watchdog. That means the hrtimer should fire 2-3 times before
+	 * the NMI watchdog expires. The NMI watchdog on x86 is based on
+	 * unhalted CPU cycles, so if Turbo-Mode is enabled the CPU cycles
+	 * might run way faster than expected and the NMI fires in a
+	 * smaller period than the one deduced from the nominal CPU
+	 * frequency. Depending on the Turbo-Mode factor this might be fast
+	 * enough to get the NMI period smaller than the hrtimer watchdog
+	 * period and trigger false positives.
+	 *
+	 * The sample threshold is used to check in the NMI handler whether
+	 * the minimum time between two NMI samples has elapsed. That
+	 * prevents false positives.
+	 *
+	 * Set this to 4/5 of the actual watchdog threshold period so the
+	 * hrtimer is guaranteed to fire at least once within the real
+	 * watchdog threshold.
+	 */
+	watchdog_hrtimer_sample_threshold = period * 2;
+}
+
+static bool watchdog_check_timestamp(void)
+{
+	ktime_t delta, now = ktime_get_mono_fast_ns();
+
+	delta = now - __this_cpu_read(last_timestamp);
+	if (delta < watchdog_hrtimer_sample_threshold)
+		return false;
+	__this_cpu_write(last_timestamp, now);
+	return true;
+}
+#else
+static inline bool watchdog_check_timestamp(void)
+{
+	return true;
+}
+#endif
+
+
 static struct perf_event_attr wd_hw_attr = {
 	.type		= PERF_TYPE_HARDWARE,
 	.config		= PERF_COUNT_HW_CPU_CYCLES,
@@ -94,6 +142,9 @@ static void watchdog_overflow_callback(s
 		return;
 	}
 
+	if (!watchdog_check_timestamp())
+		return;
+
 	/* check for a hardlockup
 	 * This is done by making sure our timer interrupt
 	 * is incrementing.  The timer interrupt should have
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -806,6 +806,9 @@ config HARDLOCKUP_DETECTOR
 	depends on LOCKUP_DETECTOR && !HAVE_NMI_WATCHDOG
 	depends on PERF_EVENTS && HAVE_PERF_EVENTS_NMI
 
+config HARDLOCKUP_CHECK_TIMESTAMP
+	bool
+
 config BOOTPARAM_HARDLOCKUP_PANIC
 	bool "Panic (Reboot) On Hard Lockups"
 	depends on HARDLOCKUP_DETECTOR
Thomas Gleixner June 22, 2017, 3:33 p.m. UTC | #7
On Wed, 21 Jun 2017, Thomas Gleixner wrote:

> On Wed, 21 Jun 2017, kan.liang@intel.com wrote:
> > We now have more and more systems where the Turbo range is wide enough
> > that the NMI watchdog expires faster than the soft watchdog timer that
> > updates the interrupt tick the NMI watchdog relies on.
> > 
> > This problem was originally added by commit 58687acba592
> > ("lockup_detector: Combine nmi_watchdog and softlockup detector").
> > Previously the NMI watchdog would always check jiffies, which were
> > ticking fast enough. But now the backing is quite slow so the expire
> > time becomes more sensitive.
> 
> And slapping a factor 3 on the NMI period is the wrong answer to the
> problem. The simple solution would be to increase the hrtimer frequency,
> but that's not really desired either.

Thinking a bit more about it. Increasing the hrtimer frequency and
maintaining the current frequency of softlockup_watchdog wakeups, would be
probably the most trivial workaround for now.

Thanks,

	tglx
Don Zickus June 22, 2017, 3:44 p.m. UTC | #8
On Wed, Jun 21, 2017 at 11:53:57PM +0200, Thomas Gleixner wrote:
> On Wed, 21 Jun 2017, kan.liang@intel.com wrote:
> > We now have more and more systems where the Turbo range is wide enough
> > that the NMI watchdog expires faster than the soft watchdog timer that
> > updates the interrupt tick the NMI watchdog relies on.
> > 
> > This problem was originally added by commit 58687acba592
> > ("lockup_detector: Combine nmi_watchdog and softlockup detector").
> > Previously the NMI watchdog would always check jiffies, which were
> > ticking fast enough. But now the backing is quite slow so the expire
> > time becomes more sensitive.
> 
> And slapping a factor 3 on the NMI period is the wrong answer to the
> problem. The simple solution would be to increase the hrtimer frequency,
> but that's not really desired either.
> 
> Find an untested patch below, which should cure the issue.

A simple low pass filter.  It compiles. :-) I don't think I have knowledge
to test it.  Kan?

Cheers,
Don

> 
> Thanks,
> 
> 	tglx
> 	
> 8<---------------
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -96,6 +96,7 @@ config X86
>  	select GENERIC_STRNCPY_FROM_USER
>  	select GENERIC_STRNLEN_USER
>  	select GENERIC_TIME_VSYSCALL
> +	select HARDLOCKUP_CHECK_TIMESTAMP	if X86_64
>  	select HAVE_ACPI_APEI			if ACPI
>  	select HAVE_ACPI_APEI_NMI		if ACPI
>  	select HAVE_ALIGNED_STRUCT_PAGE		if SLUB
> --- a/include/linux/nmi.h
> +++ b/include/linux/nmi.h
> @@ -155,6 +155,14 @@ extern int sysctl_hardlockup_all_cpu_bac
>  #define sysctl_softlockup_all_cpu_backtrace 0
>  #define sysctl_hardlockup_all_cpu_backtrace 0
>  #endif
> +
> +#if defined(CONFIG_HARDLOCKUP_CHECK_TIMESTAMP) && \
> +    defined(CONFIG_HARDLOCKUP_DETECTOR)
> +void watchdog_update_hrtimer_threshold(u64 period);
> +#else
> +static inline void watchdog_update_hrtimer_threshold(u64 period) { }
> +#endif
> +
>  extern bool is_hardlockup(void);
>  struct ctl_table;
>  extern int proc_watchdog(struct ctl_table *, int ,
> --- a/kernel/watchdog.c
> +++ b/kernel/watchdog.c
> @@ -161,6 +161,7 @@ static void set_sample_period(void)
>  	 * hardlockup detector generates a warning
>  	 */
>  	sample_period = get_softlockup_thresh() * ((u64)NSEC_PER_SEC / 5);
> +	watchdog_update_hrtimer_threshold(sample_period);
>  }
>  
>  /* Commands for resetting the watchdog */
> --- a/kernel/watchdog_hld.c
> +++ b/kernel/watchdog_hld.c
> @@ -70,6 +70,54 @@ void touch_nmi_watchdog(void)
>  }
>  EXPORT_SYMBOL(touch_nmi_watchdog);
>  
> +#ifdef CONFIG_HARDLOCKUP_CHECK_TIMESTAMP
> +static DEFINE_PER_CPU(ktime_t, last_timestamp);
> +static ktime_t watchdog_hrtimer_sample_threshold __read_mostly;
> +
> +void watchdog_update_hrtimer_threshold(u64 period)
> +{
> +	/*
> +	 * The hrtimer runs with a period of (watchdog_threshold * 2) / 5
> +	 *
> +	 * So it runs effectively with 2.5 times the rate of the NMI
> +	 * watchdog. That means the hrtimer should fire 2-3 times before
> +	 * the NMI watchdog expires. The NMI watchdog on x86 is based on
> +	 * unhalted CPU cycles, so if Turbo-Mode is enabled the CPU cycles
> +	 * might run way faster than expected and the NMI fires in a
> +	 * smaller period than the one deduced from the nominal CPU
> +	 * frequency. Depending on the Turbo-Mode factor this might be fast
> +	 * enough to get the NMI period smaller than the hrtimer watchdog
> +	 * period and trigger false positives.
> +	 *
> +	 * The sample threshold is used to check in the NMI handler whether
> +	 * the minimum time between two NMI samples has elapsed. That
> +	 * prevents false positives.
> +	 *
> +	 * Set this to 4/5 of the actual watchdog threshold period so the
> +	 * hrtimer is guaranteed to fire at least once within the real
> +	 * watchdog threshold.
> +	 */
> +	watchdog_hrtimer_sample_threshold = period * 2;
> +}
> +
> +static bool watchdog_check_timestamp(void)
> +{
> +	ktime_t delta, now = ktime_get_mono_fast_ns();
> +
> +	delta = now - __this_cpu_read(last_timestamp);
> +	if (delta < watchdog_hrtimer_sample_threshold)
> +		return false;
> +	__this_cpu_write(last_timestamp, now);
> +	return true;
> +}
> +#else
> +static inline bool watchdog_check_timestamp(void)
> +{
> +	return true;
> +}
> +#endif
> +
> +
>  static struct perf_event_attr wd_hw_attr = {
>  	.type		= PERF_TYPE_HARDWARE,
>  	.config		= PERF_COUNT_HW_CPU_CYCLES,
> @@ -94,6 +142,9 @@ static void watchdog_overflow_callback(s
>  		return;
>  	}
>  
> +	if (!watchdog_check_timestamp())
> +		return;
> +
>  	/* check for a hardlockup
>  	 * This is done by making sure our timer interrupt
>  	 * is incrementing.  The timer interrupt should have
> --- a/lib/Kconfig.debug
> +++ b/lib/Kconfig.debug
> @@ -806,6 +806,9 @@ config HARDLOCKUP_DETECTOR
>  	depends on LOCKUP_DETECTOR && !HAVE_NMI_WATCHDOG
>  	depends on PERF_EVENTS && HAVE_PERF_EVENTS_NMI
>  
> +config HARDLOCKUP_CHECK_TIMESTAMP
> +	bool
> +
>  config BOOTPARAM_HARDLOCKUP_PANIC
>  	bool "Panic (Reboot) On Hard Lockups"
>  	depends on HARDLOCKUP_DETECTOR
Liang, Kan June 22, 2017, 3:48 p.m. UTC | #9
> Subject: Re: [PATCH V2] kernel/watchdog: fix spurious hard lockups
> 
> On Wed, Jun 21, 2017 at 11:53:57PM +0200, Thomas Gleixner wrote:
> > On Wed, 21 Jun 2017, kan.liang@intel.com wrote:
> > > We now have more and more systems where the Turbo range is wide
> > > enough that the NMI watchdog expires faster than the soft watchdog
> > > timer that updates the interrupt tick the NMI watchdog relies on.
> > >
> > > This problem was originally added by commit 58687acba592
> > > ("lockup_detector: Combine nmi_watchdog and softlockup detector").
> > > Previously the NMI watchdog would always check jiffies, which were
> > > ticking fast enough. But now the backing is quite slow so the expire
> > > time becomes more sensitive.
> >
> > And slapping a factor 3 on the NMI period is the wrong answer to the
> > problem. The simple solution would be to increase the hrtimer
> > frequency, but that's not really desired either.
> >
> > Find an untested patch below, which should cure the issue.
> 
> A simple low pass filter.  It compiles. :-) I don't think I have knowledge to test
> it.  Kan?
> 

Yes, we are doing the test.

Thanks,
Kan

> Cheers,
> Don
> 
> >
> > Thanks,
> >
> > 	tglx
> >
> > 8<---------------
> > --- a/arch/x86/Kconfig
> > +++ b/arch/x86/Kconfig
> > @@ -96,6 +96,7 @@ config X86
> >  	select GENERIC_STRNCPY_FROM_USER
> >  	select GENERIC_STRNLEN_USER
> >  	select GENERIC_TIME_VSYSCALL
> > +	select HARDLOCKUP_CHECK_TIMESTAMP	if X86_64
> >  	select HAVE_ACPI_APEI			if ACPI
> >  	select HAVE_ACPI_APEI_NMI		if ACPI
> >  	select HAVE_ALIGNED_STRUCT_PAGE		if SLUB
> > --- a/include/linux/nmi.h
> > +++ b/include/linux/nmi.h
> > @@ -155,6 +155,14 @@ extern int sysctl_hardlockup_all_cpu_bac  #define
> > sysctl_softlockup_all_cpu_backtrace 0  #define
> > sysctl_hardlockup_all_cpu_backtrace 0  #endif
> > +
> > +#if defined(CONFIG_HARDLOCKUP_CHECK_TIMESTAMP) && \
> > +    defined(CONFIG_HARDLOCKUP_DETECTOR)
> > +void watchdog_update_hrtimer_threshold(u64 period); #else static
> > +inline void watchdog_update_hrtimer_threshold(u64 period) { } #endif
> > +
> >  extern bool is_hardlockup(void);
> >  struct ctl_table;
> >  extern int proc_watchdog(struct ctl_table *, int ,
> > --- a/kernel/watchdog.c
> > +++ b/kernel/watchdog.c
> > @@ -161,6 +161,7 @@ static void set_sample_period(void)
> >  	 * hardlockup detector generates a warning
> >  	 */
> >  	sample_period = get_softlockup_thresh() * ((u64)NSEC_PER_SEC / 5);
> > +	watchdog_update_hrtimer_threshold(sample_period);
> >  }
> >
> >  /* Commands for resetting the watchdog */
> > --- a/kernel/watchdog_hld.c
> > +++ b/kernel/watchdog_hld.c
> > @@ -70,6 +70,54 @@ void touch_nmi_watchdog(void)  }
> > EXPORT_SYMBOL(touch_nmi_watchdog);
> >
> > +#ifdef CONFIG_HARDLOCKUP_CHECK_TIMESTAMP static
> > +DEFINE_PER_CPU(ktime_t, last_timestamp); static ktime_t
> > +watchdog_hrtimer_sample_threshold __read_mostly;
> > +
> > +void watchdog_update_hrtimer_threshold(u64 period) {
> > +	/*
> > +	 * The hrtimer runs with a period of (watchdog_threshold * 2) / 5
> > +	 *
> > +	 * So it runs effectively with 2.5 times the rate of the NMI
> > +	 * watchdog. That means the hrtimer should fire 2-3 times before
> > +	 * the NMI watchdog expires. The NMI watchdog on x86 is based on
> > +	 * unhalted CPU cycles, so if Turbo-Mode is enabled the CPU cycles
> > +	 * might run way faster than expected and the NMI fires in a
> > +	 * smaller period than the one deduced from the nominal CPU
> > +	 * frequency. Depending on the Turbo-Mode factor this might be fast
> > +	 * enough to get the NMI period smaller than the hrtimer watchdog
> > +	 * period and trigger false positives.
> > +	 *
> > +	 * The sample threshold is used to check in the NMI handler whether
> > +	 * the minimum time between two NMI samples has elapsed. That
> > +	 * prevents false positives.
> > +	 *
> > +	 * Set this to 4/5 of the actual watchdog threshold period so the
> > +	 * hrtimer is guaranteed to fire at least once within the real
> > +	 * watchdog threshold.
> > +	 */
> > +	watchdog_hrtimer_sample_threshold = period * 2; }
> > +
> > +static bool watchdog_check_timestamp(void) {
> > +	ktime_t delta, now = ktime_get_mono_fast_ns();
> > +
> > +	delta = now - __this_cpu_read(last_timestamp);
> > +	if (delta < watchdog_hrtimer_sample_threshold)
> > +		return false;
> > +	__this_cpu_write(last_timestamp, now);
> > +	return true;
> > +}
> > +#else
> > +static inline bool watchdog_check_timestamp(void) {
> > +	return true;
> > +}
> > +#endif
> > +
> > +
> >  static struct perf_event_attr wd_hw_attr = {
> >  	.type		= PERF_TYPE_HARDWARE,
> >  	.config		= PERF_COUNT_HW_CPU_CYCLES,
> > @@ -94,6 +142,9 @@ static void watchdog_overflow_callback(s
> >  		return;
> >  	}
> >
> > +	if (!watchdog_check_timestamp())
> > +		return;
> > +
> >  	/* check for a hardlockup
> >  	 * This is done by making sure our timer interrupt
> >  	 * is incrementing.  The timer interrupt should have
> > --- a/lib/Kconfig.debug
> > +++ b/lib/Kconfig.debug
> > @@ -806,6 +806,9 @@ config HARDLOCKUP_DETECTOR
> >  	depends on LOCKUP_DETECTOR && !HAVE_NMI_WATCHDOG
> >  	depends on PERF_EVENTS && HAVE_PERF_EVENTS_NMI
> >
> > +config HARDLOCKUP_CHECK_TIMESTAMP
> > +	bool
> > +
> >  config BOOTPARAM_HARDLOCKUP_PANIC
> >  	bool "Panic (Reboot) On Hard Lockups"
> >  	depends on HARDLOCKUP_DETECTOR
Thomas Gleixner June 23, 2017, 8:01 a.m. UTC | #10
On Thu, 22 Jun 2017, Don Zickus wrote:
> On Wed, Jun 21, 2017 at 11:53:57PM +0200, Thomas Gleixner wrote:
> > On Wed, 21 Jun 2017, kan.liang@intel.com wrote:
> > > We now have more and more systems where the Turbo range is wide enough
> > > that the NMI watchdog expires faster than the soft watchdog timer that
> > > updates the interrupt tick the NMI watchdog relies on.
> > > 
> > > This problem was originally added by commit 58687acba592
> > > ("lockup_detector: Combine nmi_watchdog and softlockup detector").
> > > Previously the NMI watchdog would always check jiffies, which were
> > > ticking fast enough. But now the backing is quite slow so the expire
> > > time becomes more sensitive.
> > 
> > And slapping a factor 3 on the NMI period is the wrong answer to the
> > problem. The simple solution would be to increase the hrtimer frequency,
> > but that's not really desired either.
> > 
> > Find an untested patch below, which should cure the issue.
> 
> A simple low pass filter.  It compiles. :-) I don't think I have knowledge
> to test it.  Kan?

Yes, and it has an interesting twist. It's only working once we have
switched to TSC as clocksource.

As long as jiffies are the clocksource, this will miserably fail because
when the hrtimer interrupt is not delivered jiffies wont be incremented
either and the NMI will say: Oh. not enough time elapsed. Lather, rinse and
repeat.

One simple way to fix this is with the delta patch below.

Thanks,

	tglx

8<--------------------------
--- a/kernel/watchdog_hld.c
+++ b/kernel/watchdog_hld.c
@@ -72,6 +72,7 @@ EXPORT_SYMBOL(touch_nmi_watchdog);
 
 #ifdef CONFIG_HARDLOCKUP_CHECK_TIMESTAMP
 static DEFINE_PER_CPU(ktime_t, last_timestamp);
+static DEFINE_PER_CPU(unsigned int, nmi_rearmed);
 static ktime_t watchdog_hrtimer_sample_threshold __read_mostly;
 
 void watchdog_update_hrtimer_threshold(u64 period)
@@ -105,8 +106,11 @@ static bool watchdog_check_timestamp(voi
 	ktime_t delta, now = ktime_get_mono_fast_ns();
 
 	delta = now - __this_cpu_read(last_timestamp);
-	if (delta < watchdog_hrtimer_sample_threshold)
-		return false;
+	if (delta < watchdog_hrtimer_sample_threshold) {
+		if (__this_cpu_inc_return(nmi_rearmed) < 10)
+			return false;
+	}
+	__this_cpu_write(nmi_rearmed, 0);
 	__this_cpu_write(last_timestamp, now);
 	return true;
 }
Don Zickus June 23, 2017, 4:29 p.m. UTC | #11
On Fri, Jun 23, 2017 at 10:01:55AM +0200, Thomas Gleixner wrote:
> On Thu, 22 Jun 2017, Don Zickus wrote:
> > On Wed, Jun 21, 2017 at 11:53:57PM +0200, Thomas Gleixner wrote:
> > > On Wed, 21 Jun 2017, kan.liang@intel.com wrote:
> > > > We now have more and more systems where the Turbo range is wide enough
> > > > that the NMI watchdog expires faster than the soft watchdog timer that
> > > > updates the interrupt tick the NMI watchdog relies on.
> > > > 
> > > > This problem was originally added by commit 58687acba592
> > > > ("lockup_detector: Combine nmi_watchdog and softlockup detector").
> > > > Previously the NMI watchdog would always check jiffies, which were
> > > > ticking fast enough. But now the backing is quite slow so the expire
> > > > time becomes more sensitive.
> > > 
> > > And slapping a factor 3 on the NMI period is the wrong answer to the
> > > problem. The simple solution would be to increase the hrtimer frequency,
> > > but that's not really desired either.
> > > 
> > > Find an untested patch below, which should cure the issue.
> > 
> > A simple low pass filter.  It compiles. :-) I don't think I have knowledge
> > to test it.  Kan?
> 
> Yes, and it has an interesting twist. It's only working once we have
> switched to TSC as clocksource.
> 
> As long as jiffies are the clocksource, this will miserably fail because
> when the hrtimer interrupt is not delivered jiffies wont be incremented
> either and the NMI will say: Oh. not enough time elapsed. Lather, rinse and
> repeat.
> 
> One simple way to fix this is with the delta patch below.

Hmm, all this work for a temp fix.  Kan, how much longer until the real fix
of having perf count the right cycles?

Cheers,
Don

> 
> Thanks,
> 
> 	tglx
> 
> 8<--------------------------
> --- a/kernel/watchdog_hld.c
> +++ b/kernel/watchdog_hld.c
> @@ -72,6 +72,7 @@ EXPORT_SYMBOL(touch_nmi_watchdog);
>  
>  #ifdef CONFIG_HARDLOCKUP_CHECK_TIMESTAMP
>  static DEFINE_PER_CPU(ktime_t, last_timestamp);
> +static DEFINE_PER_CPU(unsigned int, nmi_rearmed);
>  static ktime_t watchdog_hrtimer_sample_threshold __read_mostly;
>  
>  void watchdog_update_hrtimer_threshold(u64 period)
> @@ -105,8 +106,11 @@ static bool watchdog_check_timestamp(voi
>  	ktime_t delta, now = ktime_get_mono_fast_ns();
>  
>  	delta = now - __this_cpu_read(last_timestamp);
> -	if (delta < watchdog_hrtimer_sample_threshold)
> -		return false;
> +	if (delta < watchdog_hrtimer_sample_threshold) {
> +		if (__this_cpu_inc_return(nmi_rearmed) < 10)
> +			return false;
> +	}
> +	__this_cpu_write(nmi_rearmed, 0);
>  	__this_cpu_write(last_timestamp, now);
>  	return true;
>  }
> 
>
Thomas Gleixner June 23, 2017, 9:50 p.m. UTC | #12
On Fri, 23 Jun 2017, Don Zickus wrote:
> Hmm, all this work for a temp fix.  Kan, how much longer until the real fix
> of having perf count the right cycles?

Quite a while. The approach is wilfully breaking the user space ABI, which
is not going to happen.

And there is a simpler solution as well, as I said here:

    http://lkml.kernel.org/r/alpine.DEB.2.20.1706221730520.1885@nanos

Thanks,

	tglx
Don Zickus June 26, 2017, 8:19 p.m. UTC | #13
On Fri, Jun 23, 2017 at 11:50:25PM +0200, Thomas Gleixner wrote:
> On Fri, 23 Jun 2017, Don Zickus wrote:
> > Hmm, all this work for a temp fix.  Kan, how much longer until the real fix
> > of having perf count the right cycles?
> 
> Quite a while. The approach is wilfully breaking the user space ABI, which
> is not going to happen.
> 
> And there is a simpler solution as well, as I said here:
> 
>     http://lkml.kernel.org/r/alpine.DEB.2.20.1706221730520.1885@nanos

Hi Thomas,

So, you are saying instead of slowing down the perf counter, speed up the
hrtimer to sample more frequently like so:

diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 03e0b69..8ff49de 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -160,7 +160,7 @@ static void set_sample_period(void)
 	 * and hard thresholds) to increment before the
 	 * hardlockup detector generates a warning
 	 */
-	sample_period = get_softlockup_thresh() * ((u64)NSEC_PER_SEC / 5);
+	sample_period = get_softlockup_thresh() * ((u64)NSEC_PER_SEC / 10);
 }
 
 /* Commands for resetting the watchdog */


That is another way of doing it.  It just hits all the arches.  It does seem
cleaner as the watchdog_thresh value still retains it correct meaning.  Are
the laptop folks going to yell at me some more for waking their systems up
more? :-)

Cheers,
Don
Thomas Gleixner June 26, 2017, 8:30 p.m. UTC | #14
On Mon, 26 Jun 2017, Don Zickus wrote:
> On Fri, Jun 23, 2017 at 11:50:25PM +0200, Thomas Gleixner wrote:
> > On Fri, 23 Jun 2017, Don Zickus wrote:
> > > Hmm, all this work for a temp fix.  Kan, how much longer until the real fix
> > > of having perf count the right cycles?
> > 
> > Quite a while. The approach is wilfully breaking the user space ABI, which
> > is not going to happen.
> > 
> > And there is a simpler solution as well, as I said here:
> > 
> >     http://lkml.kernel.org/r/alpine.DEB.2.20.1706221730520.1885@nanos
> 
> Hi Thomas,
> 
> So, you are saying instead of slowing down the perf counter, speed up the
> hrtimer to sample more frequently like so:
> 
> diff --git a/kernel/watchdog.c b/kernel/watchdog.c
> index 03e0b69..8ff49de 100644
> --- a/kernel/watchdog.c
> +++ b/kernel/watchdog.c
> @@ -160,7 +160,7 @@ static void set_sample_period(void)
>  	 * and hard thresholds) to increment before the
>  	 * hardlockup detector generates a warning
>  	 */
> -	sample_period = get_softlockup_thresh() * ((u64)NSEC_PER_SEC / 5);
> +	sample_period = get_softlockup_thresh() * ((u64)NSEC_PER_SEC / 10);
>  }
> 
>  /* Commands for resetting the watchdog */
> 
> 
> That is another way of doing it.  It just hits all the arches.  It does seem
> cleaner as the watchdog_thresh value still retains it correct meaning.  Are
> the laptop folks going to yell at me some more for waking their systems up
> more? :-)

Yes, that's bound to happen. You might make them less angry if you wake the
softlockup thread only on every second hrtimer expiry, i.e. keeping the
current wakeup rate.  But I can't promise that this will significantly
lower their wrath. :)

Thanks,

	tglx
Don Zickus June 27, 2017, 8:12 p.m. UTC | #15
On Mon, Jun 26, 2017 at 04:19:27PM -0400, Don Zickus wrote:
> On Fri, Jun 23, 2017 at 11:50:25PM +0200, Thomas Gleixner wrote:
> > On Fri, 23 Jun 2017, Don Zickus wrote:
> > > Hmm, all this work for a temp fix.  Kan, how much longer until the real fix
> > > of having perf count the right cycles?
> > 
> > Quite a while. The approach is wilfully breaking the user space ABI, which
> > is not going to happen.
> > 
> > And there is a simpler solution as well, as I said here:
> > 
> >     http://lkml.kernel.org/r/alpine.DEB.2.20.1706221730520.1885@nanos
> 
> Hi Thomas,
> 
> So, you are saying instead of slowing down the perf counter, speed up the
> hrtimer to sample more frequently like so:
> 
> diff --git a/kernel/watchdog.c b/kernel/watchdog.c
> index 03e0b69..8ff49de 100644
> --- a/kernel/watchdog.c
> +++ b/kernel/watchdog.c
> @@ -160,7 +160,7 @@ static void set_sample_period(void)
>  	 * and hard thresholds) to increment before the
>  	 * hardlockup detector generates a warning
>  	 */
> -	sample_period = get_softlockup_thresh() * ((u64)NSEC_PER_SEC / 5);
> +	sample_period = get_softlockup_thresh() * ((u64)NSEC_PER_SEC / 10);
>  }

Hi Kan,

Will the above patch work for you?

Cheers,
Don
Liang, Kan June 27, 2017, 8:49 p.m. UTC | #16
> On Mon, Jun 26, 2017 at 04:19:27PM -0400, Don Zickus wrote:
> > On Fri, Jun 23, 2017 at 11:50:25PM +0200, Thomas Gleixner wrote:
> > > On Fri, 23 Jun 2017, Don Zickus wrote:
> > > > Hmm, all this work for a temp fix.  Kan, how much longer until the
> > > > real fix of having perf count the right cycles?
> > >
> > > Quite a while. The approach is wilfully breaking the user space ABI,
> > > which is not going to happen.
> > >
> > > And there is a simpler solution as well, as I said here:
> > >
> > >
> > > http://lkml.kernel.org/r/alpine.DEB.2.20.1706221730520.1885@nanos
> >
> > Hi Thomas,
> >
> > So, you are saying instead of slowing down the perf counter, speed up
> > the hrtimer to sample more frequently like so:
> >
> > diff --git a/kernel/watchdog.c b/kernel/watchdog.c index
> > 03e0b69..8ff49de 100644
> > --- a/kernel/watchdog.c
> > +++ b/kernel/watchdog.c
> > @@ -160,7 +160,7 @@ static void set_sample_period(void)
> >  	 * and hard thresholds) to increment before the
> >  	 * hardlockup detector generates a warning
> >  	 */
> > -	sample_period = get_softlockup_thresh() * ((u64)NSEC_PER_SEC / 5);
> > +	sample_period = get_softlockup_thresh() * ((u64)NSEC_PER_SEC /
> 10);
> >  }
> 
> Hi Kan,
> 
> Will the above patch work for you?
> 

I haven't heard back any test result yet.

The above patch looks good to me.
But I'm not sure if /10 is enough. We may need /15.
Anyway, I think we will test /10 first.

Which workaround do you prefer, the above one or the one checking timestamp?


Thanks,
Kan
Don Zickus June 27, 2017, 9:09 p.m. UTC | #17
On Tue, Jun 27, 2017 at 08:49:19PM +0000, Liang, Kan wrote:
> 
> > On Mon, Jun 26, 2017 at 04:19:27PM -0400, Don Zickus wrote:
> > > On Fri, Jun 23, 2017 at 11:50:25PM +0200, Thomas Gleixner wrote:
> > > > On Fri, 23 Jun 2017, Don Zickus wrote:
> > > > > Hmm, all this work for a temp fix.  Kan, how much longer until the
> > > > > real fix of having perf count the right cycles?
> > > >
> > > > Quite a while. The approach is wilfully breaking the user space ABI,
> > > > which is not going to happen.
> > > >
> > > > And there is a simpler solution as well, as I said here:
> > > >
> > > >
> > > > http://lkml.kernel.org/r/alpine.DEB.2.20.1706221730520.1885@nanos
> > >
> > > Hi Thomas,
> > >
> > > So, you are saying instead of slowing down the perf counter, speed up
> > > the hrtimer to sample more frequently like so:
> > >
> > > diff --git a/kernel/watchdog.c b/kernel/watchdog.c index
> > > 03e0b69..8ff49de 100644
> > > --- a/kernel/watchdog.c
> > > +++ b/kernel/watchdog.c
> > > @@ -160,7 +160,7 @@ static void set_sample_period(void)
> > >  	 * and hard thresholds) to increment before the
> > >  	 * hardlockup detector generates a warning
> > >  	 */
> > > -	sample_period = get_softlockup_thresh() * ((u64)NSEC_PER_SEC / 5);
> > > +	sample_period = get_softlockup_thresh() * ((u64)NSEC_PER_SEC /
> > 10);
> > >  }
> > 
> > Hi Kan,
> > 
> > Will the above patch work for you?
> > 
> 
> I haven't heard back any test result yet.
> 
> The above patch looks good to me.
> But I'm not sure if /10 is enough. We may need /15.
> Anyway, I think we will test /10 first.
> 
> Which workaround do you prefer, the above one or the one checking timestamp?

Let's go with this one, it is simpler.

Cheers,
Don
Andi Kleen June 27, 2017, 11:48 p.m. UTC | #18
> I haven't heard back any test result yet.
> 
> The above patch looks good to me.

This needs performance testing.  It may slow down performance or latency sensitive workloads.

> Which workaround do you prefer, the above one or the one checking timestamp?

I prefer the earlier patch, it has far less risk of performance issues.

-Andi
Don Zickus June 28, 2017, 7 p.m. UTC | #19
On Tue, Jun 27, 2017 at 04:48:22PM -0700, Andi Kleen wrote:
> > I haven't heard back any test result yet.
> > 
> > The above patch looks good to me.
> 
> This needs performance testing.  It may slow down performance or latency sensitive workloads.

More motivation to work through the issues with the proposed real fix? :-)

> 
> > Which workaround do you prefer, the above one or the one checking timestamp?
> 
> I prefer the earlier patch, it has far less risk of performance issues.

But now you are slowing down the nmi_watchdog so much that the
watchdog_thresh hold becomes meaningless, no? (granted the turbo-mode blows
it out of the water too)  So now folks who depend on the 10/5/1/whatever second
reliability lose that.  I think that might be unfair too.

The hrtimer increase maintains that and just adds a few more
interrupts/second.

Cheers,
Don
Andi Kleen June 28, 2017, 8:14 p.m. UTC | #20
On Wed, Jun 28, 2017 at 03:00:08PM -0400, Don Zickus wrote:
> On Tue, Jun 27, 2017 at 04:48:22PM -0700, Andi Kleen wrote:
> > > I haven't heard back any test result yet.
> > > 
> > > The above patch looks good to me.
> > 
> > This needs performance testing.  It may slow down performance or latency sensitive workloads.
> 
> More motivation to work through the issues with the proposed real fix? :-)
> 
> > 
> > > Which workaround do you prefer, the above one or the one checking timestamp?
> > 
> > I prefer the earlier patch, it has far less risk of performance issues.
> 
> But now you are slowing down the nmi_watchdog so much that the
> watchdog_thresh hold becomes meaningless, no? (granted the turbo-mode blows
> it out of the water too)  So now folks who depend on the 10/5/1/whatever second
> reliability lose that.  I think that might be unfair too.

What do you mean with reliability? If you need guarantees of resetting
you always need another separate hardware watchdog (like the TCO watchdog),
as the CPU could be hung up enough that even the NMI watchdog is not 
functional anymore.

So relying solely on the NMI watchdog doesn't make any sense.

It can be a useful debugging tool for a specific class of bugs: 
when kernel software is looping forever.

But if that happens does it really matter how many iterations the
loop does before it is stopped?

Even the current timeout is essentially eternity in CPU time, and 3x
eternity is still eternity.

> The hrtimer increase maintains that and just adds a few more
> interrupts/second.

Interruptions are a big deal for many people.

-Andi
Don Zickus June 29, 2017, 3:44 p.m. UTC | #21
On Wed, Jun 28, 2017 at 01:14:04PM -0700, Andi Kleen wrote:
> It can be a useful debugging tool for a specific class of bugs: 
> when kernel software is looping forever.
> 
> But if that happens does it really matter how many iterations the
> loop does before it is stopped?
> 
> Even the current timeout is essentially eternity in CPU time, and 3x
> eternity is still eternity.

That isn't true.  We have customers that test the accuracy and file bugs.  I
had to write a RHEL whitepaper a number of years ago explaining why the
softlockup took 62 seconds to fire instead of 60.

Customers were changing the watchdog_thresh when the system slowed down to
purposely trigger a panic (which exposed race conditions leading Uli to
redesign the sysctl interface).

I don't feel like explaining to our customers how we regressed on our
watchdog accuracy.  It is exhausting, especially if it is a debug feature.

> 
> > The hrtimer increase maintains that and just adds a few more
> > interrupts/second.
> 
> Interruptions are a big deal for many people.

Yes, and we probably have customers that will complain on that too.

Either solution is a lose/lose.  And yes, we will probably get bit by the
false NMI problems on those Intel boxes.  This is why I was preferring a
real solution.

The question is, if the real solution is going to take a while, what is the
least sucky solution for now?  Or how do we minimize it to a specific class
of Intel boxes.

Cheers,
Don
Andi Kleen June 29, 2017, 4:12 p.m. UTC | #22
On Thu, Jun 29, 2017 at 11:44:06AM -0400, Don Zickus wrote:
> On Wed, Jun 28, 2017 at 01:14:04PM -0700, Andi Kleen wrote:
> > It can be a useful debugging tool for a specific class of bugs: 
> > when kernel software is looping forever.
> > 
> > But if that happens does it really matter how many iterations the
> > loop does before it is stopped?
> > 
> > Even the current timeout is essentially eternity in CPU time, and 3x
> > eternity is still eternity.
> 
> That isn't true.  We have customers that test the accuracy and file bugs.  I
> had to write a RHEL whitepaper a number of years ago explaining why the
> softlockup took 62 seconds to fire instead of 60.

Ok that makes sense.

It seems like a broken QA test from your customer, not a real issue, 
but yes explaining and documenting that can be difficult.

> 
> The question is, if the real solution is going to take a while, what is the
> least sucky solution for now?  Or how do we minimize it to a specific class
> of Intel boxes.

You can't minimize it because there's no forward looking solution
to detect a large turbo range, and also whatever issue you have in the
generic case would apply to them too.

Thomas' patch to modulate the frequency seemed reasonable to me.
It made the NMI watchdog depend on accurate ktime, but that's probably ok.

-Andi
Don Zickus June 29, 2017, 4:26 p.m. UTC | #23
On Thu, Jun 29, 2017 at 09:12:20AM -0700, Andi Kleen wrote:
> On Thu, Jun 29, 2017 at 11:44:06AM -0400, Don Zickus wrote:
> > On Wed, Jun 28, 2017 at 01:14:04PM -0700, Andi Kleen wrote:
> > > It can be a useful debugging tool for a specific class of bugs: 
> > > when kernel software is looping forever.
> > > 
> > > But if that happens does it really matter how many iterations the
> > > loop does before it is stopped?
> > > 
> > > Even the current timeout is essentially eternity in CPU time, and 3x
> > > eternity is still eternity.
> > 
> > That isn't true.  We have customers that test the accuracy and file bugs.  I
> > had to write a RHEL whitepaper a number of years ago explaining why the
> > softlockup took 62 seconds to fire instead of 60.
> 
> Ok that makes sense.
> 
> It seems like a broken QA test from your customer, not a real issue, 

Agreed.

> but yes explaining and documenting that can be difficult.

Yes.

> 
> > 
> > The question is, if the real solution is going to take a while, what is the
> > least sucky solution for now?  Or how do we minimize it to a specific class
> > of Intel boxes.
> 
> You can't minimize it because there's no forward looking solution
> to detect a large turbo range, and also whatever issue you have in the
> generic case would apply to them too.
> 
> Thomas' patch to modulate the frequency seemed reasonable to me.
> It made the NMI watchdog depend on accurate ktime, but that's probably ok.

Ok, did Kan finish testing this patch (with the small fix on top)?

Cheers,
Don
Andi Kleen June 29, 2017, 4:36 p.m. UTC | #24
> > Thomas' patch to modulate the frequency seemed reasonable to me.
> > It made the NMI watchdog depend on accurate ktime, but that's probably ok.
> 
> Ok, did Kan finish testing this patch (with the small fix on top)?

Kan doesn't have the specific hardware to test it. We've been waiting
for another team to do the testing. I'll ping them.

-Andi
Liang, Kan July 17, 2017, 1:24 a.m. UTC | #25
> On Mon, Jun 26, 2017 at 04:19:27PM -0400, Don Zickus wrote:
> > On Fri, Jun 23, 2017 at 11:50:25PM +0200, Thomas Gleixner wrote:
> > > On Fri, 23 Jun 2017, Don Zickus wrote:
> > > > Hmm, all this work for a temp fix.  Kan, how much longer until the
> > > > real fix of having perf count the right cycles?
> > >
> > > Quite a while. The approach is wilfully breaking the user space ABI,
> > > which is not going to happen.
> > >
> > > And there is a simpler solution as well, as I said here:
> > >
> > >
> > > http://lkml.kernel.org/r/alpine.DEB.2.20.1706221730520.1885@nanos
> >
> > Hi Thomas,
> >
> > So, you are saying instead of slowing down the perf counter, speed up
> > the hrtimer to sample more frequently like so:
> >
> > diff --git a/kernel/watchdog.c b/kernel/watchdog.c index
> > 03e0b69..8ff49de 100644
> > --- a/kernel/watchdog.c
> > +++ b/kernel/watchdog.c
> > @@ -160,7 +160,7 @@ static void set_sample_period(void)
> >  	 * and hard thresholds) to increment before the
> >  	 * hardlockup detector generates a warning
> >  	 */
> > -	sample_period = get_softlockup_thresh() * ((u64)NSEC_PER_SEC / 5);
> > +	sample_period = get_softlockup_thresh() * ((u64)NSEC_PER_SEC /
> 10);
> >  }
> 
> Hi Kan,
> 
> Will the above patch work for you?

Hi Don & Thomas,

Sorry for the late response. We just finished the tests for all proposed patches.

There are three proposed patches so far.
Patch 1: The patch as above which speed up the hrtimer.
Patch 2: Thomas's first proposal.
https://patchwork.kernel.org/patch/9803033/
https://patchwork.kernel.org/patch/9805903/
Patch 3: my original proposal which increase the NMI watchdog timeout by 3X
https://patchwork.kernel.org/patch/9802053/

According to our test, only patch 3 works well.
The other two patches will hang the system eventually.
For patch 1, the system hang after running our test case for ~1 hour.
For patch 2, the system hang in running the overnight test.
There is no error message shown when the system hang. So I don't know the
root cause yet.

BTW: We set 1 to watchdog_thresh when we did the test.
It's believed that can speed up the failure.

Thanks,
Kan
Thomas Gleixner July 17, 2017, 7:14 a.m. UTC | #26
On Mon, 17 Jul 2017, Liang, Kan wrote:
> There are three proposed patches so far.
> Patch 1: The patch as above which speed up the hrtimer.
> Patch 2: Thomas's first proposal.
> https://patchwork.kernel.org/patch/9803033/
> https://patchwork.kernel.org/patch/9805903/
> Patch 3: my original proposal which increase the NMI watchdog timeout by 3X
> https://patchwork.kernel.org/patch/9802053/
> 
> According to our test, only patch 3 works well.
> The other two patches will hang the system eventually.
> For patch 1, the system hang after running our test case for ~1 hour.
> For patch 2, the system hang in running the overnight test.
> There is no error message shown when the system hang. So I don't know the
> root cause yet.

That doesn't make sense. What's the exact test procedure?

> BTW: We set 1 to watchdog_thresh when we did the test.
> It's believed that can speed up the failure.

Believe is not really a technical measure....

Thanks,

	tglx
Liang, Kan July 17, 2017, 12:18 p.m. UTC | #27
> 
> On Mon, 17 Jul 2017, Liang, Kan wrote:
> > There are three proposed patches so far.
> > Patch 1: The patch as above which speed up the hrtimer.
> > Patch 2: Thomas's first proposal.
> > https://patchwork.kernel.org/patch/9803033/
> > https://patchwork.kernel.org/patch/9805903/
> > Patch 3: my original proposal which increase the NMI watchdog timeout
> > by 3X https://patchwork.kernel.org/patch/9802053/
> >
> > According to our test, only patch 3 works well.
> > The other two patches will hang the system eventually.
> > For patch 1, the system hang after running our test case for ~1 hour.
> > For patch 2, the system hang in running the overnight test.
> > There is no error message shown when the system hang. So I don't know
> > the root cause yet.
> 
> That doesn't make sense. What's the exact test procedure?

I don't know the exact test procedure. The test case is from our customer.
I only know that the test case makes calls into the x11 libs.

> 
> > BTW: We set 1 to watchdog_thresh when we did the test.
> > It's believed that can speed up the failure.
> 
> Believe is not really a technical measure....
> 

1 is a valid value for watchdog_thresh.
It was set through the standard proc interface.
/proc/sys/kernel/watchdog_thresh
It should not impacts the final test result.

Thanks,
Kan
Thomas Gleixner July 17, 2017, 1:13 p.m. UTC | #28
On Mon, 17 Jul 2017, Liang, Kan wrote:
> > That doesn't make sense. What's the exact test procedure?
> 
> I don't know the exact test procedure. The test case is from our customer.
> I only know that the test case makes calls into the x11 libs.

Sigh. This starts to be silly. You test something and have no idea what it
does?

> > > According to our test, only patch 3 works well.
> > > The other two patches will hang the system eventually.

Hang the system eventually? Does that mean that the system stops working
and the watchdog does not catch the problem?

> > > BTW: We set 1 to watchdog_thresh when we did the test.
> > > It's believed that can speed up the failure.
> > 
> > Believe is not really a technical measure....
> > 
> 
> 1 is a valid value for watchdog_thresh.
> It was set through the standard proc interface.
> /proc/sys/kernel/watchdog_thresh
> It should not impacts the final test result.

I know that 1 is a valid value and I know how that can be set. Still, it
does not help if you believe that setting the threshold to 1 can speed up
the failure. Either you know it for sure or not. You can believe in god or
whatever, but here we talk about facts.

Please start coming up with facts and proper explanations.

Thanks,

	tglx
Don Zickus July 17, 2017, 2:46 p.m. UTC | #29
On Mon, Jul 17, 2017 at 01:24:23AM +0000, Liang, Kan wrote:
> Hi Don & Thomas,
> 
> Sorry for the late response. We just finished the tests for all proposed patches.
> 
> There are three proposed patches so far.
> Patch 1: The patch as above which speed up the hrtimer.
> Patch 2: Thomas's first proposal.
> https://patchwork.kernel.org/patch/9803033/
> https://patchwork.kernel.org/patch/9805903/
> Patch 3: my original proposal which increase the NMI watchdog timeout by 3X
> https://patchwork.kernel.org/patch/9802053/
> 
> According to our test, only patch 3 works well.
> The other two patches will hang the system eventually.
> For patch 1, the system hang after running our test case for ~1 hour.
> For patch 2, the system hang in running the overnight test.
> There is no error message shown when the system hang. So I don't know the
> root cause yet.

Hi Kan,

Thanks for the feedback.  Odd that the different patches had different
results.  What is more odd to me is the hang.  I thought these were all
false lockups that prematurely panic'd and rebooted the box.

Is the machine configured to panic on hardlockup and reboot?  Perhaps kdump
is enabled to store the console log for review upon reboot?

It almost implies that a hardlockup did happen but isnt' being detected
until later??
> 
> BTW: We set 1 to watchdog_thresh when we did the test.
> It's believed that can speed up the failure.

Sure, you/they look for 1 second hangs instead of 10 second ones.  But with
patch3 it is more like 3 seconds'ish vs 30 second'ish.

As Thomas asked, I would also be interested in the way the test works.  The
hang doesn't make sense.

Cheers,
Don
Liang, Kan July 17, 2017, 2:46 p.m. UTC | #30
> On Mon, 17 Jul 2017, Liang, Kan wrote:
> > > That doesn't make sense. What's the exact test procedure?
> >
> > I don't know the exact test procedure. The test case is from our customer.
> > I only know that the test case makes calls into the x11 libs.
> 
> Sigh. This starts to be silly. You test something and have no idea what it does?

As I said, the test case is from our customer. They only share binaries with us.
Actually, it's more proper to call it test suite. It includes dozens of small test.
I just reproduced the issue and verified all the three patches in our lab.
Then I report it here as request immediately.
So I know little about the test case for now. 
I will share more when I learn more.
Sorry for that.

> 
> > > > According to our test, only patch 3 works well.
> > > > The other two patches will hang the system eventually.
> 
> Hang the system eventually? Does that mean that the system stops working
> and the watchdog does not catch the problem?


Right, the system stops working and the watchdog does not catch the problem.

> 
> > > > BTW: We set 1 to watchdog_thresh when we did the test.
> > > > It's believed that can speed up the failure.
> > >
> > > Believe is not really a technical measure....
> > >
> >
> > 1 is a valid value for watchdog_thresh.
> > It was set through the standard proc interface.
> > /proc/sys/kernel/watchdog_thresh
> > It should not impacts the final test result.
> 
> I know that 1 is a valid value and I know how that can be set. Still, it does not
> help if you believe that setting the threshold to 1 can speed up the failure.
> Either you know it for sure or not. You can believe in god or whatever, but
> here we talk about facts.

I personally didn't compare the difference between 1 and default 10 for this
test case.
Before we had the test case from customer, we developed other micro
which can reproduce the similar issue.
For that micro, 1 can speed up the failure.
(BTW: all the three patches can fix the issue which was reproduced by that micro.)

If you think it's meaningful to verify 10 as well, I can do the compare.

Thanks,
Kan
Thomas Gleixner July 17, 2017, 3 p.m. UTC | #31
On Mon, 17 Jul 2017, Liang, Kan wrote:
> > > > > According to our test, only patch 3 works well.
> > > > > The other two patches will hang the system eventually.
> > 
> > Hang the system eventually? Does that mean that the system stops working
> > and the watchdog does not catch the problem?
> 
> Right, the system stops working and the watchdog does not catch the problem.

What exactly means: "stops working" ? Just that you observe that the system
does not make progress or is not reacting to key strokes or what?

And what is the lockup, which is detected in the other case? Which code
path causes the lockup?

> I personally didn't compare the difference between 1 and default 10 for this
> test case.
> Before we had the test case from customer, we developed other micro
> which can reproduce the similar issue.
> For that micro, 1 can speed up the failure.
> (BTW: all the three patches can fix the issue which was reproduced by that micro.)
> 
> If you think it's meaningful to verify 10 as well, I can do the compare.

It might be worth a try, but unless we can either get hands on the test
scenario or at least have a proper explanation of what it is doing
including the expected outcome, i.e. what is the 'system is locked up'
failure which should be detected by the watchdog, I can't tell anything.

Thanks,

	tglx
Liang, Kan Aug. 15, 2017, 1:16 a.m. UTC | #32
> On Mon, Jul 17, 2017 at 01:24:23AM +0000, Liang, Kan wrote:
> > Hi Don & Thomas,
> >
> > Sorry for the late response. We just finished the tests for all proposed
> patches.
> >
> > There are three proposed patches so far.
> > Patch 1: The patch as above which speed up the hrtimer.
> > Patch 2: Thomas's first proposal.
> > https://patchwork.kernel.org/patch/9803033/
> > https://patchwork.kernel.org/patch/9805903/
> > Patch 3: my original proposal which increase the NMI watchdog timeout
> > by 3X https://patchwork.kernel.org/patch/9802053/
> >
> > According to our test, only patch 3 works well.
> > The other two patches will hang the system eventually.
> > For patch 1, the system hang after running our test case for ~1 hour.
> > For patch 2, the system hang in running the overnight test.
> > There is no error message shown when the system hang. So I don't know
> > the root cause yet.
> 
> Hi Kan,
> 
> Thanks for the feedback.  Odd that the different patches had different results.
> What is more odd to me is the hang.  I thought these were all false lockups
> that prematurely panic'd and rebooted the box.
> 
> Is the machine configured to panic on hardlockup and reboot?  Perhaps
> kdump is enabled to store the console log for review upon reboot?
> 
> It almost implies that a hardlockup did happen but isnt' being detected until
> later??
> >
> > BTW: We set 1 to watchdog_thresh when we did the test.
> > It's believed that can speed up the failure.
> 
> Sure, you/they look for 1 second hangs instead of 10 second ones.  But with
> patch3 it is more like 3 seconds'ish vs 30 second'ish.
> 
> As Thomas asked, I would also be interested in the way the test works.  The
> hang doesn't make sense.
> 

Hi Don and Thomas, 

Sorry for the late response.

We have confirmed that the hardlock with "speed up the hrtimer" patch is
actually another issue. Tim has already proposed a patch to fix it. 
Here is his patch. https://lkml.org/lkml/2017/8/14/1000

This patch which speed up the hrtimer (https://lkml.org/lkml/2017/6/26/685)
is decent to fix the spurious hard lockups.
Tested-by: Kan Liang <kan.liang@intel.com>

Please consider to merge it into both mainline and stable tree.

Thanks,
Kan
Linus Torvalds Aug. 15, 2017, 1:28 a.m. UTC | #33
On Mon, Aug 14, 2017 at 6:16 PM, Liang, Kan <kan.liang@intel.com> wrote:
>
> We have confirmed that the hardlock with "speed up the hrtimer" patch is
> actually another issue.

Good.

However:

> Tim has already proposed a patch to fix it.
> Here is his patch. https://lkml.org/lkml/2017/8/14/1000

Ugh. I hate that patch, and hope we can find a proper fix for it. But
I'll respond separately to it, and it's independent of the watchdog
issue.

                  Linus
Thomas Gleixner Aug. 15, 2017, 7:50 a.m. UTC | #34
On Tue, 15 Aug 2017, Liang, Kan wrote:
> This patch which speed up the hrtimer (https://lkml.org/lkml/2017/6/26/685)
> is decent to fix the spurious hard lockups.
> Tested-by: Kan Liang <kan.liang@intel.com>
> 
> Please consider to merge it into both mainline and stable tree.

Well, it 'fixes' the problem, but at the same time it imposes a higher
frequency of hrtimer interrupts and a higher frequency of soft lockup
thread wakeups. I'm not convinced that this is the right thing to do, even
if the patch itself is simple and small.

Did you run the patch which implements the low pass filter? Does it fix the
issue as well? It's slightly larger, but does not come with the downsides
of the real simple one. Appended for reference.

Thanks,

	tglx

8<---------------------
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -100,6 +100,7 @@ config X86
 	select GENERIC_STRNCPY_FROM_USER
 	select GENERIC_STRNLEN_USER
 	select GENERIC_TIME_VSYSCALL
+	select HARDLOCKUP_CHECK_TIMESTAMP	if X86_64
 	select HAVE_ACPI_APEI			if ACPI
 	select HAVE_ACPI_APEI_NMI		if ACPI
 	select HAVE_ALIGNED_STRUCT_PAGE		if SLUB
--- a/include/linux/nmi.h
+++ b/include/linux/nmi.h
@@ -168,6 +168,14 @@ extern int sysctl_hardlockup_all_cpu_bac
 #define sysctl_softlockup_all_cpu_backtrace 0
 #define sysctl_hardlockup_all_cpu_backtrace 0
 #endif
+
+#if defined(CONFIG_HARDLOCKUP_CHECK_TIMESTAMP) && \
+    defined(CONFIG_HARDLOCKUP_DETECTOR)
+void watchdog_update_hrtimer_threshold(u64 period);
+#else
+static inline void watchdog_update_hrtimer_threshold(u64 period) { }
+#endif
+
 extern bool is_hardlockup(void);
 struct ctl_table;
 extern int proc_watchdog(struct ctl_table *, int ,
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -240,6 +240,7 @@ static void set_sample_period(void)
 	 * hardlockup detector generates a warning
 	 */
 	sample_period = get_softlockup_thresh() * ((u64)NSEC_PER_SEC / 5);
+	watchdog_update_hrtimer_threshold(sample_period);
 }
 
 /* Commands for resetting the watchdog */
--- a/kernel/watchdog_hld.c
+++ b/kernel/watchdog_hld.c
@@ -37,6 +37,62 @@ void arch_touch_nmi_watchdog(void)
 }
 EXPORT_SYMBOL(arch_touch_nmi_watchdog);
 
+#ifdef CONFIG_HARDLOCKUP_CHECK_TIMESTAMP
+static DEFINE_PER_CPU(ktime_t, last_timestamp);
+static DEFINE_PER_CPU(unsigned int, nmi_rearmed);
+static ktime_t watchdog_hrtimer_sample_threshold __read_mostly;
+
+void watchdog_update_hrtimer_threshold(u64 period)
+{
+	/*
+	 * The hrtimer runs with a period of (watchdog_threshold * 2) / 5
+	 *
+	 * So it runs effectively with 2.5 times the rate of the NMI
+	 * watchdog. That means the hrtimer should fire 2-3 times before
+	 * the NMI watchdog expires. The NMI watchdog on x86 is based on
+	 * unhalted CPU cycles, so if Turbo-Mode is enabled the CPU cycles
+	 * might run way faster than expected and the NMI fires in a
+	 * smaller period than the one deduced from the nominal CPU
+	 * frequency. Depending on the Turbo-Mode factor this might be fast
+	 * enough to get the NMI period smaller than the hrtimer watchdog
+	 * period and trigger false positives.
+	 *
+	 * The sample threshold is used to check in the NMI handler whether
+	 * the minimum time between two NMI samples has elapsed. That
+	 * prevents false positives.
+	 *
+	 * Set this to 4/5 of the actual watchdog threshold period so the
+	 * hrtimer is guaranteed to fire at least once within the real
+	 * watchdog threshold.
+	 */
+	watchdog_hrtimer_sample_threshold = period * 2;
+}
+
+static bool watchdog_check_timestamp(void)
+{
+	ktime_t delta, now = ktime_get_mono_fast_ns();
+
+	delta = now - __this_cpu_read(last_timestamp);
+	if (delta < watchdog_hrtimer_sample_threshold) {
+		/*
+		 * If ktime is jiffies based, a stalled timer would prevent
+		 * jiffies from being incremented and the filter would look
+		 * at a stale timestamp and never trigger.
+		 */
+		if (__this_cpu_inc_return(nmi_rearmed) < 10)
+			return false;
+	}
+	__this_cpu_write(nmi_rearmed, 0);
+	__this_cpu_write(last_timestamp, now);
+	return true;
+}
+#else
+static inline bool watchdog_check_timestamp(void)
+{
+	return true;
+}
+#endif
+
 static struct perf_event_attr wd_hw_attr = {
 	.type		= PERF_TYPE_HARDWARE,
 	.config		= PERF_COUNT_HW_CPU_CYCLES,
@@ -61,6 +117,9 @@ static void watchdog_overflow_callback(s
 		return;
 	}
 
+	if (!watchdog_check_timestamp())
+		return;
+
 	/* check for a hardlockup
 	 * This is done by making sure our timer interrupt
 	 * is incrementing.  The timer interrupt should have
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -798,6 +798,13 @@ config HARDLOCKUP_DETECTOR_PERF
 	select SOFTLOCKUP_DETECTOR
 
 #
+# Enables a timestamp based low pass filter to compensate for perf based
+# hard lockup detection which runs too fast due to turbo modes.
+#
+config HARDLOCKUP_CHECK_TIMESTAMP
+	bool
+
+#
 # arch/ can define HAVE_HARDLOCKUP_DETECTOR_ARCH to provide their own hard
 # lockup detector rather than the perf based detector.
 #
Liang, Kan Aug. 17, 2017, 3:45 p.m. UTC | #35
> On Tue, 15 Aug 2017, Liang, Kan wrote:
> > This patch which speed up the hrtimer
> (https://lkml.org/lkml/2017/6/26/685)
> > is decent to fix the spurious hard lockups.
> > Tested-by: Kan Liang <kan.liang@intel.com>
> >
> > Please consider to merge it into both mainline and stable tree.
> 
> Well, it 'fixes' the problem, but at the same time it imposes a higher
> frequency of hrtimer interrupts and a higher frequency of soft lockup
> thread wakeups. I'm not convinced that this is the right thing to do, even
> if the patch itself is simple and small.
> 
> Did you run the patch which implements the low pass filter? Does it fix the
> issue as well? It's slightly larger, but does not come with the downsides
> of the real simple one. Appended for reference.

I just finished the test for the low pass filter patch.
Yes, it also fixes the watchdog false positive issue.

Thanks,
Kan

> 
> Thanks,
> 
> 	tglx
> 
> 8<---------------------
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -100,6 +100,7 @@ config X86
>  	select GENERIC_STRNCPY_FROM_USER
>  	select GENERIC_STRNLEN_USER
>  	select GENERIC_TIME_VSYSCALL
> +	select HARDLOCKUP_CHECK_TIMESTAMP	if X86_64
>  	select HAVE_ACPI_APEI			if ACPI
>  	select HAVE_ACPI_APEI_NMI		if ACPI
>  	select HAVE_ALIGNED_STRUCT_PAGE		if SLUB
> --- a/include/linux/nmi.h
> +++ b/include/linux/nmi.h
> @@ -168,6 +168,14 @@ extern int sysctl_hardlockup_all_cpu_bac
>  #define sysctl_softlockup_all_cpu_backtrace 0
>  #define sysctl_hardlockup_all_cpu_backtrace 0
>  #endif
> +
> +#if defined(CONFIG_HARDLOCKUP_CHECK_TIMESTAMP) && \
> +    defined(CONFIG_HARDLOCKUP_DETECTOR)
> +void watchdog_update_hrtimer_threshold(u64 period);
> +#else
> +static inline void watchdog_update_hrtimer_threshold(u64 period) { }
> +#endif
> +
>  extern bool is_hardlockup(void);
>  struct ctl_table;
>  extern int proc_watchdog(struct ctl_table *, int ,
> --- a/kernel/watchdog.c
> +++ b/kernel/watchdog.c
> @@ -240,6 +240,7 @@ static void set_sample_period(void)
>  	 * hardlockup detector generates a warning
>  	 */
>  	sample_period = get_softlockup_thresh() * ((u64)NSEC_PER_SEC / 5);
> +	watchdog_update_hrtimer_threshold(sample_period);
>  }
> 
>  /* Commands for resetting the watchdog */
> --- a/kernel/watchdog_hld.c
> +++ b/kernel/watchdog_hld.c
> @@ -37,6 +37,62 @@ void arch_touch_nmi_watchdog(void)
>  }
>  EXPORT_SYMBOL(arch_touch_nmi_watchdog);
> 
> +#ifdef CONFIG_HARDLOCKUP_CHECK_TIMESTAMP
> +static DEFINE_PER_CPU(ktime_t, last_timestamp);
> +static DEFINE_PER_CPU(unsigned int, nmi_rearmed);
> +static ktime_t watchdog_hrtimer_sample_threshold __read_mostly;
> +
> +void watchdog_update_hrtimer_threshold(u64 period)
> +{
> +	/*
> +	 * The hrtimer runs with a period of (watchdog_threshold * 2) / 5
> +	 *
> +	 * So it runs effectively with 2.5 times the rate of the NMI
> +	 * watchdog. That means the hrtimer should fire 2-3 times before
> +	 * the NMI watchdog expires. The NMI watchdog on x86 is based on
> +	 * unhalted CPU cycles, so if Turbo-Mode is enabled the CPU cycles
> +	 * might run way faster than expected and the NMI fires in a
> +	 * smaller period than the one deduced from the nominal CPU
> +	 * frequency. Depending on the Turbo-Mode factor this might be fast
> +	 * enough to get the NMI period smaller than the hrtimer watchdog
> +	 * period and trigger false positives.
> +	 *
> +	 * The sample threshold is used to check in the NMI handler whether
> +	 * the minimum time between two NMI samples has elapsed. That
> +	 * prevents false positives.
> +	 *
> +	 * Set this to 4/5 of the actual watchdog threshold period so the
> +	 * hrtimer is guaranteed to fire at least once within the real
> +	 * watchdog threshold.
> +	 */
> +	watchdog_hrtimer_sample_threshold = period * 2;
> +}
> +
> +static bool watchdog_check_timestamp(void)
> +{
> +	ktime_t delta, now = ktime_get_mono_fast_ns();
> +
> +	delta = now - __this_cpu_read(last_timestamp);
> +	if (delta < watchdog_hrtimer_sample_threshold) {
> +		/*
> +		 * If ktime is jiffies based, a stalled timer would prevent
> +		 * jiffies from being incremented and the filter would look
> +		 * at a stale timestamp and never trigger.
> +		 */
> +		if (__this_cpu_inc_return(nmi_rearmed) < 10)
> +			return false;
> +	}
> +	__this_cpu_write(nmi_rearmed, 0);
> +	__this_cpu_write(last_timestamp, now);
> +	return true;
> +}
> +#else
> +static inline bool watchdog_check_timestamp(void)
> +{
> +	return true;
> +}
> +#endif
> +
>  static struct perf_event_attr wd_hw_attr = {
>  	.type		= PERF_TYPE_HARDWARE,
>  	.config		= PERF_COUNT_HW_CPU_CYCLES,
> @@ -61,6 +117,9 @@ static void watchdog_overflow_callback(s
>  		return;
>  	}
> 
> +	if (!watchdog_check_timestamp())
> +		return;
> +
>  	/* check for a hardlockup
>  	 * This is done by making sure our timer interrupt
>  	 * is incrementing.  The timer interrupt should have
> --- a/lib/Kconfig.debug
> +++ b/lib/Kconfig.debug
> @@ -798,6 +798,13 @@ config HARDLOCKUP_DETECTOR_PERF
>  	select SOFTLOCKUP_DETECTOR
> 
>  #
> +# Enables a timestamp based low pass filter to compensate for perf based
> +# hard lockup detection which runs too fast due to turbo modes.
> +#
> +config HARDLOCKUP_CHECK_TIMESTAMP
> +	bool
> +
> +#
>  # arch/ can define HAVE_HARDLOCKUP_DETECTOR_ARCH to provide their
> own hard
>  # lockup detector rather than the perf based detector.
>  #

Patch
diff mbox series

diff --git a/arch/x86/kernel/apic/hw_nmi.c b/arch/x86/kernel/apic/hw_nmi.c
index c73c9fb281e1..716d44e986f9 100644
--- a/arch/x86/kernel/apic/hw_nmi.c
+++ b/arch/x86/kernel/apic/hw_nmi.c
@@ -20,9 +20,15 @@ 
 #include <linux/delay.h>
 
 #ifdef CONFIG_HARDLOCKUP_DETECTOR
+/*
+ * The NMI watchdog relies on PERF_COUNT_HW_CPU_CYCLES event, which
+ * can tick faster than the measured CPU Frequency due to Turbo mode.
+ * That can lead to spurious timeouts.
+ * To workaround the issue, extending the period by 3 times.
+ */
 u64 hw_nmi_get_sample_period(int watchdog_thresh)
 {
-	return (u64)(cpu_khz) * 1000 * watchdog_thresh;
+	return (u64)(cpu_khz) * 1000 * watchdog_thresh * 3;
 }
 #endif