linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] watchdog:  touch_nmi_watchdog should only touch local cpu not every one
@ 2010-11-05  1:18 Don Zickus
  2010-11-05 13:51 ` Sergey Senozhatsky
  2010-11-07 22:09 ` Frederic Weisbecker
  0 siblings, 2 replies; 8+ messages in thread
From: Don Zickus @ 2010-11-05  1:18 UTC (permalink / raw)
  To: fweisbec, Peter Zijlstra
  Cc: Ingo Molnar, LKML, akpm, sergey.senozhatsky, Don Zickus

I ran into a scenario where while one cpu was stuck and should have panic'd
because of the NMI watchdog, it didn't.  The reason was another cpu was spewing
stack dumps on to the console.  Upon investigation, I noticed that when writing
to the console and also when dumping the stack, the watchdog is touched.

This causes all the cpus to reset their NMI watchdog flags and the 'stuck' cpu
just spins forever.

This change causes the semantics of touch_nmi_watchdog to be changed slightly.
Previously, I accidentally changed the semantics and we noticed there was a
codepath in which touch_nmi_watchdog could be touched from a preemtible area.
That caused a BUG() to happen when CONFIG_DEBUG_PREEMPT was enabled.  I believe
it was the acpi code.

My attempt here re-introduces the change to have the touch_nmi_watchdog() code
only touch the local cpu instead of all of the cpus.  But instead of using
__get_cpu_var(), I use the __raw_get_cpu_var() version.

This avoids the preemption problem.  However my reasoning wasn't because I was
trying to be lazy.  Instead I rationalized it as, well if preemption is enabled
then interrupts should be enabled to and the NMI watchdog will have no reason
to trigger.  So it won't matter if the wrong cpu is touched because the percpu
interrupt counters the NMI watchdog uses should still be incrementing.

Signed-off-by: Don Zickus <dzickus@redhat.com>
---
 kernel/watchdog.c |   17 ++++++++++++++++-
 1 files changed, 16 insertions(+), 1 deletions(-)

diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index dc8e168..dd0c140 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -141,6 +141,21 @@ void touch_all_softlockup_watchdogs(void)
 #ifdef CONFIG_HARDLOCKUP_DETECTOR
 void touch_nmi_watchdog(void)
 {
+	/*
+	 * Using __raw here because some code paths have
+	 * preemption enabled.  If preemption is enabled
+	 * then interrupts should be enabled too, in which
+	 * case we shouldn't have to worry about the watchdog
+	 * going off.
+	 */
+	__raw_get_cpu_var(watchdog_nmi_touch) = true;
+
+	touch_softlockup_watchdog();
+}
+EXPORT_SYMBOL(touch_nmi_watchdog);
+
+void touch_all_nmi_watchdogs(void)
+{
 	if (watchdog_enabled) {
 		unsigned cpu;
 
@@ -151,7 +166,7 @@ void touch_nmi_watchdog(void)
 	}
 	touch_softlockup_watchdog();
 }
-EXPORT_SYMBOL(touch_nmi_watchdog);
+EXPORT_SYMBOL(touch_all_nmi_watchdogs);
 
 #endif
 
-- 
1.7.2.3


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH] watchdog:  touch_nmi_watchdog should only touch local cpu not every one
  2010-11-05  1:18 [PATCH] watchdog: touch_nmi_watchdog should only touch local cpu not every one Don Zickus
@ 2010-11-05 13:51 ` Sergey Senozhatsky
  2010-11-05 19:58   ` Andrew Morton
  2010-11-07 22:09 ` Frederic Weisbecker
  1 sibling, 1 reply; 8+ messages in thread
From: Sergey Senozhatsky @ 2010-11-05 13:51 UTC (permalink / raw)
  To: Don Zickus
  Cc: fweisbec, Peter Zijlstra, Ingo Molnar, LKML, akpm, sergey.senozhatsky

[-- Attachment #1: Type: text/plain, Size: 1181 bytes --]

On (11/04/10 21:18), Don Zickus wrote:
>  void touch_nmi_watchdog(void)
>  {
> +	/*
> +	 * Using __raw here because some code paths have
> +	 * preemption enabled.  If preemption is enabled
> +	 * then interrupts should be enabled too, in which
> +	 * case we shouldn't have to worry about the watchdog
> +	 * going off.
> +	 */
> +	__raw_get_cpu_var(watchdog_nmi_touch) = true;
> +
> +	touch_softlockup_watchdog();
> +}
> +EXPORT_SYMBOL(touch_nmi_watchdog);
> +
> +void touch_all_nmi_watchdogs(void)
> +{
>  	if (watchdog_enabled) {
>  		unsigned cpu;
>  
> @@ -151,7 +166,7 @@ void touch_nmi_watchdog(void)
>  	}
>  	touch_softlockup_watchdog();
>  }
> -EXPORT_SYMBOL(touch_nmi_watchdog);
> +EXPORT_SYMBOL(touch_all_nmi_watchdogs);
>  

Hello,
Seems like no one is actually calling touch_all_nmi_watchdogs, as for now. 
Right?


Minor nit

	touch_all_nmi_watchdogs:
	...
	for_each_present_cpu(cpu) {
		if (per_cpu(watchdog_nmi_touch, cpu) != true)
			per_cpu(watchdog_nmi_touch, cpu) = true;
	}


which is, I belive, could be simplified to 
	for_each_present_cpu(cpu) {
		per_cpu(watchdog_nmi_touch, cpu) = true;
	}



	Sergey

[-- Attachment #2: Type: application/pgp-signature, Size: 316 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] watchdog:  touch_nmi_watchdog should only touch local cpu not every one
  2010-11-05 13:51 ` Sergey Senozhatsky
@ 2010-11-05 19:58   ` Andrew Morton
  2010-11-08 13:37     ` Don Zickus
  0 siblings, 1 reply; 8+ messages in thread
From: Andrew Morton @ 2010-11-05 19:58 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Don Zickus, fweisbec, Peter Zijlstra, Ingo Molnar, LKML

On Fri, 5 Nov 2010 15:51:18 +0200
Sergey Senozhatsky <sergey.senozhatsky@gmail.com> wrote:

> On (11/04/10 21:18), Don Zickus wrote:
> >  void touch_nmi_watchdog(void)
> >  {
> > +	/*
> > +	 * Using __raw here because some code paths have
> > +	 * preemption enabled.  If preemption is enabled
> > +	 * then interrupts should be enabled too, in which
> > +	 * case we shouldn't have to worry about the watchdog
> > +	 * going off.
> > +	 */
> > +	__raw_get_cpu_var(watchdog_nmi_touch) = true;
> > +
> > +	touch_softlockup_watchdog();
> > +}
> > +EXPORT_SYMBOL(touch_nmi_watchdog);
> > +
> > +void touch_all_nmi_watchdogs(void)
> > +{
> >  	if (watchdog_enabled) {
> >  		unsigned cpu;
> >  
> > @@ -151,7 +166,7 @@ void touch_nmi_watchdog(void)
> >  	}
> >  	touch_softlockup_watchdog();
> >  }
> > -EXPORT_SYMBOL(touch_nmi_watchdog);
> > +EXPORT_SYMBOL(touch_all_nmi_watchdogs);
> >  
> 
> Hello,
> Seems like no one is actually calling touch_all_nmi_watchdogs, as for now. 
> Right?

Yes, there doesn't seem a lot of point in adding the interface unless
we have callers.

> 
> Minor nit
> 
> 	touch_all_nmi_watchdogs:
> 	...
> 	for_each_present_cpu(cpu) {
> 		if (per_cpu(watchdog_nmi_touch, cpu) != true)
> 			per_cpu(watchdog_nmi_touch, cpu) = true;
> 	}
> 
> 
> which is, I belive, could be simplified to 
> 	for_each_present_cpu(cpu) {
> 		per_cpu(watchdog_nmi_touch, cpu) = true;
> 	}

We sometimes do this trick to avoid dirtying lots of cachelines which
already held the correct value.  It'll be extra-benefical when dealing
with other CPU's data, I expect.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] watchdog:  touch_nmi_watchdog should only touch local cpu not every one
  2010-11-05  1:18 [PATCH] watchdog: touch_nmi_watchdog should only touch local cpu not every one Don Zickus
  2010-11-05 13:51 ` Sergey Senozhatsky
@ 2010-11-07 22:09 ` Frederic Weisbecker
  2010-11-08 13:38   ` Don Zickus
  1 sibling, 1 reply; 8+ messages in thread
From: Frederic Weisbecker @ 2010-11-07 22:09 UTC (permalink / raw)
  To: Don Zickus; +Cc: Peter Zijlstra, Ingo Molnar, LKML, akpm, sergey.senozhatsky

On Thu, Nov 04, 2010 at 09:18:52PM -0400, Don Zickus wrote:
> I ran into a scenario where while one cpu was stuck and should have panic'd
> because of the NMI watchdog, it didn't.  The reason was another cpu was spewing
> stack dumps on to the console.  Upon investigation, I noticed that when writing
> to the console and also when dumping the stack, the watchdog is touched.
> 
> This causes all the cpus to reset their NMI watchdog flags and the 'stuck' cpu
> just spins forever.
> 
> This change causes the semantics of touch_nmi_watchdog to be changed slightly.
> Previously, I accidentally changed the semantics and we noticed there was a
> codepath in which touch_nmi_watchdog could be touched from a preemtible area.
> That caused a BUG() to happen when CONFIG_DEBUG_PREEMPT was enabled.  I believe
> it was the acpi code.
> 
> My attempt here re-introduces the change to have the touch_nmi_watchdog() code
> only touch the local cpu instead of all of the cpus.  But instead of using
> __get_cpu_var(), I use the __raw_get_cpu_var() version.
> 
> This avoids the preemption problem.  However my reasoning wasn't because I was
> trying to be lazy.  Instead I rationalized it as, well if preemption is enabled
> then interrupts should be enabled to and the NMI watchdog will have no reason
> to trigger.  So it won't matter if the wrong cpu is touched because the percpu
> interrupt counters the NMI watchdog uses should still be incrementing.
> 
> Signed-off-by: Don Zickus <dzickus@redhat.com>
> ---
>  kernel/watchdog.c |   17 ++++++++++++++++-
>  1 files changed, 16 insertions(+), 1 deletions(-)
> 
> diff --git a/kernel/watchdog.c b/kernel/watchdog.c
> index dc8e168..dd0c140 100644
> --- a/kernel/watchdog.c
> +++ b/kernel/watchdog.c
> @@ -141,6 +141,21 @@ void touch_all_softlockup_watchdogs(void)
>  #ifdef CONFIG_HARDLOCKUP_DETECTOR
>  void touch_nmi_watchdog(void)
>  {
> +	/*
> +	 * Using __raw here because some code paths have
> +	 * preemption enabled.  If preemption is enabled
> +	 * then interrupts should be enabled too, in which
> +	 * case we shouldn't have to worry about the watchdog
> +	 * going off.
> +	 */
> +	__raw_get_cpu_var(watchdog_nmi_touch) = true;
> +
> +	touch_softlockup_watchdog();
> +}
> +EXPORT_SYMBOL(touch_nmi_watchdog);



Did the old watchdog also touched every CPUs?

That doesn't appear to be a good thing, we may indeed miss hardlockups
because of that.

And it seems you can drop touch_all_nmi_watchdogs() as, like others
pointed out, there are no users of it.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] watchdog:  touch_nmi_watchdog should only touch local cpu not every one
  2010-11-05 19:58   ` Andrew Morton
@ 2010-11-08 13:37     ` Don Zickus
  0 siblings, 0 replies; 8+ messages in thread
From: Don Zickus @ 2010-11-08 13:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Sergey Senozhatsky, fweisbec, Peter Zijlstra, Ingo Molnar, LKML

On Fri, Nov 05, 2010 at 12:58:55PM -0700, Andrew Morton wrote:
> On Fri, 5 Nov 2010 15:51:18 +0200
> Sergey Senozhatsky <sergey.senozhatsky@gmail.com> wrote:
> 
> > On (11/04/10 21:18), Don Zickus wrote:
> > >  void touch_nmi_watchdog(void)
> > >  {
> > > +	/*
> > > +	 * Using __raw here because some code paths have
> > > +	 * preemption enabled.  If preemption is enabled
> > > +	 * then interrupts should be enabled too, in which
> > > +	 * case we shouldn't have to worry about the watchdog
> > > +	 * going off.
> > > +	 */
> > > +	__raw_get_cpu_var(watchdog_nmi_touch) = true;
> > > +
> > > +	touch_softlockup_watchdog();
> > > +}
> > > +EXPORT_SYMBOL(touch_nmi_watchdog);
> > > +
> > > +void touch_all_nmi_watchdogs(void)
> > > +{
> > >  	if (watchdog_enabled) {
> > >  		unsigned cpu;
> > >  
> > > @@ -151,7 +166,7 @@ void touch_nmi_watchdog(void)
> > >  	}
> > >  	touch_softlockup_watchdog();
> > >  }
> > > -EXPORT_SYMBOL(touch_nmi_watchdog);
> > > +EXPORT_SYMBOL(touch_all_nmi_watchdogs);
> > >  
> > 
> > Hello,
> > Seems like no one is actually calling touch_all_nmi_watchdogs, as for now. 
> > Right?
> 
> Yes, there doesn't seem a lot of point in adding the interface unless
> we have callers.

Yeah I wasn't sure how to deal with that.  It didn't seem like any of the
callers was relying on the fact that touch_nmi_watchdog() touched
everyone.  I just provided it as an option in case I misread someone's use
of the touch_nmi_watchdog.

I'll repost and remove it then.

Thanks for the feedback.

Cheers,
Don

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] watchdog:  touch_nmi_watchdog should only touch local cpu not every one
  2010-11-07 22:09 ` Frederic Weisbecker
@ 2010-11-08 13:38   ` Don Zickus
  0 siblings, 0 replies; 8+ messages in thread
From: Don Zickus @ 2010-11-08 13:38 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Peter Zijlstra, Ingo Molnar, LKML, akpm, sergey.senozhatsky

On Sun, Nov 07, 2010 at 11:09:11PM +0100, Frederic Weisbecker wrote:
> > +	/*
> > +	 * Using __raw here because some code paths have
> > +	 * preemption enabled.  If preemption is enabled
> > +	 * then interrupts should be enabled too, in which
> > +	 * case we shouldn't have to worry about the watchdog
> > +	 * going off.
> > +	 */
> > +	__raw_get_cpu_var(watchdog_nmi_touch) = true;
> > +
> > +	touch_softlockup_watchdog();
> > +}
> > +EXPORT_SYMBOL(touch_nmi_watchdog);
> 
> 
> 
> Did the old watchdog also touched every CPUs?

Yeah, unfortunately.

Cheers,
Don

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] watchdog: touch_nmi_watchdog should only touch local cpu not every one
  2013-12-05 20:42 ` [PATCH] watchdog: touch_nmi_watchdog should only touch local cpu not every one Ben Zhang
@ 2013-12-16 15:55   ` Don Zickus
  0 siblings, 0 replies; 8+ messages in thread
From: Don Zickus @ 2013-12-16 15:55 UTC (permalink / raw)
  To: Ben Zhang; +Cc: linux-kernel, Andrew Morton, Ingo Molnar, Frederic Weisbecker

On Thu, Dec 05, 2013 at 12:42:24PM -0800, Ben Zhang wrote:
> I ran into a scenario where while one cpu was stuck and should have panic'd
> because of the NMI watchdog, it didn't.  The reason was another cpu was spewing
> stack dumps on to the console.  Upon investigation, I noticed that when writing
> to the console and also when dumping the stack, the watchdog is touched.
> 
> This causes all the cpus to reset their NMI watchdog flags and the 'stuck' cpu
> just spins forever.
> 
> This change causes the semantics of touch_nmi_watchdog to be changed slightly.
> Previously, I accidentally changed the semantics and we noticed there was a
> codepath in which touch_nmi_watchdog could be touched from a preemtible area.
> That caused a BUG() to happen when CONFIG_DEBUG_PREEMPT was enabled.  I believe
> it was the acpi code.
> 
> My attempt here re-introduces the change to have the touch_nmi_watchdog() code
> only touch the local cpu instead of all of the cpus.  But instead of using
> __get_cpu_var(), I use the __raw_get_cpu_var() version.
> 
> This avoids the preemption problem.  However my reasoning wasn't because I was
> trying to be lazy.  Instead I rationalized it as, well if preemption is enabled
> then interrupts should be enabled to and the NMI watchdog will have no reason
> to trigger.  So it won't matter if the wrong cpu is touched because the percpu
> interrupt counters the NMI watchdog uses should still be incrementing.

Hi Andrew,

I'm ok with this patch, though it does alter the behaviour of how
touch_nmi_watchdog works.  For the most part I don't think most callers
need to touch all of the watchdogs (on each cpu).  Perhaps a corner case
will pop up (the scheduler?? to mimic touch_all_softlockup_watchdogs() ).

But this does address an issue where if a system is locked up and one cpu
is spewing out useful debug messages (or error messages), the hard lockup
will fail to go off.  We have seen this on RHEL also.

I wouldn't mind see this get soaked in linux-next if possible.  And see
what falls out.

Cheers,
Don

> 
> Signed-off-by: Don Zickus <dzickus@redhat.com>
> Signed-off-by: Ben Zhang <benzh@chromium.org>
> ---
>  kernel/watchdog.c | 16 ++++++++--------
>  1 file changed, 8 insertions(+), 8 deletions(-)
> 
> diff --git a/kernel/watchdog.c b/kernel/watchdog.c
> index 4431610..613f611 100644
> --- a/kernel/watchdog.c
> +++ b/kernel/watchdog.c
> @@ -158,14 +158,14 @@ void touch_all_softlockup_watchdogs(void)
>  #ifdef CONFIG_HARDLOCKUP_DETECTOR
>  void touch_nmi_watchdog(void)
>  {
> -	if (watchdog_user_enabled) {
> -		unsigned cpu;
> -
> -		for_each_present_cpu(cpu) {
> -			if (per_cpu(watchdog_nmi_touch, cpu) != true)
> -				per_cpu(watchdog_nmi_touch, cpu) = true;
> -		}
> -	}
> +	/*
> +	 * Using __raw here because some code paths have
> +	 * preemption enabled.  If preemption is enabled
> +	 * then interrupts should be enabled too, in which
> +	 * case we shouldn't have to worry about the watchdog
> +	 * going off.
> +	 */
> +	__raw_get_cpu_var(watchdog_nmi_touch) = true;
>  	touch_softlockup_watchdog();
>  }
>  EXPORT_SYMBOL(touch_nmi_watchdog);
> -- 
> 1.8.5.1
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH] watchdog: touch_nmi_watchdog should only touch local cpu not every one
  2013-12-05  3:12 [PATCH v2] watchdog: Add a sysctl to disable soft lockup detector Don Zickus
@ 2013-12-05 20:42 ` Ben Zhang
  2013-12-16 15:55   ` Don Zickus
  0 siblings, 1 reply; 8+ messages in thread
From: Ben Zhang @ 2013-12-05 20:42 UTC (permalink / raw)
  To: linux-kernel
  Cc: Don Zickus, Andrew Morton, Ingo Molnar, Frederic Weisbecker, Ben Zhang

I ran into a scenario where while one cpu was stuck and should have panic'd
because of the NMI watchdog, it didn't.  The reason was another cpu was spewing
stack dumps on to the console.  Upon investigation, I noticed that when writing
to the console and also when dumping the stack, the watchdog is touched.

This causes all the cpus to reset their NMI watchdog flags and the 'stuck' cpu
just spins forever.

This change causes the semantics of touch_nmi_watchdog to be changed slightly.
Previously, I accidentally changed the semantics and we noticed there was a
codepath in which touch_nmi_watchdog could be touched from a preemtible area.
That caused a BUG() to happen when CONFIG_DEBUG_PREEMPT was enabled.  I believe
it was the acpi code.

My attempt here re-introduces the change to have the touch_nmi_watchdog() code
only touch the local cpu instead of all of the cpus.  But instead of using
__get_cpu_var(), I use the __raw_get_cpu_var() version.

This avoids the preemption problem.  However my reasoning wasn't because I was
trying to be lazy.  Instead I rationalized it as, well if preemption is enabled
then interrupts should be enabled to and the NMI watchdog will have no reason
to trigger.  So it won't matter if the wrong cpu is touched because the percpu
interrupt counters the NMI watchdog uses should still be incrementing.

Signed-off-by: Don Zickus <dzickus@redhat.com>
Signed-off-by: Ben Zhang <benzh@chromium.org>
---
 kernel/watchdog.c | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 4431610..613f611 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -158,14 +158,14 @@ void touch_all_softlockup_watchdogs(void)
 #ifdef CONFIG_HARDLOCKUP_DETECTOR
 void touch_nmi_watchdog(void)
 {
-	if (watchdog_user_enabled) {
-		unsigned cpu;
-
-		for_each_present_cpu(cpu) {
-			if (per_cpu(watchdog_nmi_touch, cpu) != true)
-				per_cpu(watchdog_nmi_touch, cpu) = true;
-		}
-	}
+	/*
+	 * Using __raw here because some code paths have
+	 * preemption enabled.  If preemption is enabled
+	 * then interrupts should be enabled too, in which
+	 * case we shouldn't have to worry about the watchdog
+	 * going off.
+	 */
+	__raw_get_cpu_var(watchdog_nmi_touch) = true;
 	touch_softlockup_watchdog();
 }
 EXPORT_SYMBOL(touch_nmi_watchdog);
-- 
1.8.5.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2013-12-16 15:55 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-11-05  1:18 [PATCH] watchdog: touch_nmi_watchdog should only touch local cpu not every one Don Zickus
2010-11-05 13:51 ` Sergey Senozhatsky
2010-11-05 19:58   ` Andrew Morton
2010-11-08 13:37     ` Don Zickus
2010-11-07 22:09 ` Frederic Weisbecker
2010-11-08 13:38   ` Don Zickus
2013-12-05  3:12 [PATCH v2] watchdog: Add a sysctl to disable soft lockup detector Don Zickus
2013-12-05 20:42 ` [PATCH] watchdog: touch_nmi_watchdog should only touch local cpu not every one Ben Zhang
2013-12-16 15:55   ` Don Zickus

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).