All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2] watchdog:  touch_nmi_watchdog should only touch local cpu not every one
@ 2010-11-08 16:44 Don Zickus
  2010-11-08 18:05 ` Frederic Weisbecker
  0 siblings, 1 reply; 3+ messages in thread
From: Don Zickus @ 2010-11-08 16:44 UTC (permalink / raw)
  To: fweisbec, Peter Zijlstra
  Cc: Ingo Molnar, LKML, akpm, sergey.senozhatsky, Don Zickus

I ran into a scenario where while one cpu was stuck and should have panic'd
because of the NMI watchdog, it didn't.  The reason was another cpu was spewing
stack dumps on to the console.  Upon investigation, I noticed that when writing
to the console and also when dumping the stack, the watchdog is touched.

This causes all the cpus to reset their NMI watchdog flags and the 'stuck' cpu
just spins forever.

This change causes the semantics of touch_nmi_watchdog to be changed slightly.
Previously, I accidentally changed the semantics and we noticed there was a
codepath in which touch_nmi_watchdog could be touched from a preemtible area.
That caused a BUG() to happen when CONFIG_DEBUG_PREEMPT was enabled.  I believe
it was the acpi code.

My attempt here re-introduces the change to have the touch_nmi_watchdog() code
only touch the local cpu instead of all of the cpus.  But instead of using
__get_cpu_var(), I use the __raw_get_cpu_var() version.

This avoids the preemption problem.  However my reasoning wasn't because I was
trying to be lazy.  Instead I rationalized it as, well if preemption is enabled
then interrupts should be enabled to and the NMI watchdog will have no reason
to trigger.  So it won't matter if the wrong cpu is touched because the percpu
interrupt counters the NMI watchdog uses should still be incrementing.

V2:  remove touch_all_nmi_watchdog code

Signed-off-by: Don Zickus <dzickus@redhat.com>
---
 kernel/watchdog.c |   17 ++++++++++++++++-
 1 files changed, 16 insertions(+), 1 deletions(-)

diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index dc8e168..dd0c140 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -141,6 +141,21 @@ void touch_all_softlockup_watchdogs(void)
 #ifdef CONFIG_HARDLOCKUP_DETECTOR
 void touch_nmi_watchdog(void)
 {
+	/*
+	 * Using __raw here because some code paths have
+	 * preemption enabled.  If preemption is enabled
+	 * then interrupts should be enabled too, in which
+	 * case we shouldn't have to worry about the watchdog
+	 * going off.
+	 */
+	__raw_get_cpu_var(watchdog_nmi_touch) = true;
+
+	touch_softlockup_watchdog();
+}
+EXPORT_SYMBOL(touch_nmi_watchdog);
+
+void touch_all_nmi_watchdogs(void)
+{
 	if (watchdog_enabled) {
 		unsigned cpu;
 
@@ -151,7 +166,7 @@ void touch_nmi_watchdog(void)
 	}
 	touch_softlockup_watchdog();
 }
-EXPORT_SYMBOL(touch_nmi_watchdog);
+EXPORT_SYMBOL(touch_all_nmi_watchdogs);
 
 #endif
 
-- 
1.7.2.3


^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH v2] watchdog:  touch_nmi_watchdog should only touch local cpu not every one
  2010-11-08 16:44 [PATCH v2] watchdog: touch_nmi_watchdog should only touch local cpu not every one Don Zickus
@ 2010-11-08 18:05 ` Frederic Weisbecker
  2010-11-08 18:28   ` Don Zickus
  0 siblings, 1 reply; 3+ messages in thread
From: Frederic Weisbecker @ 2010-11-08 18:05 UTC (permalink / raw)
  To: Don Zickus; +Cc: Peter Zijlstra, Ingo Molnar, LKML, akpm, sergey.senozhatsky

On Mon, Nov 08, 2010 at 11:44:35AM -0500, Don Zickus wrote:
> I ran into a scenario where while one cpu was stuck and should have panic'd
> because of the NMI watchdog, it didn't.  The reason was another cpu was spewing
> stack dumps on to the console.  Upon investigation, I noticed that when writing
> to the console and also when dumping the stack, the watchdog is touched.
> 
> This causes all the cpus to reset their NMI watchdog flags and the 'stuck' cpu
> just spins forever.
> 
> This change causes the semantics of touch_nmi_watchdog to be changed slightly.
> Previously, I accidentally changed the semantics and we noticed there was a
> codepath in which touch_nmi_watchdog could be touched from a preemtible area.
> That caused a BUG() to happen when CONFIG_DEBUG_PREEMPT was enabled.  I believe
> it was the acpi code.
> 
> My attempt here re-introduces the change to have the touch_nmi_watchdog() code
> only touch the local cpu instead of all of the cpus.  But instead of using
> __get_cpu_var(), I use the __raw_get_cpu_var() version.
> 
> This avoids the preemption problem.  However my reasoning wasn't because I was
> trying to be lazy.  Instead I rationalized it as, well if preemption is enabled
> then interrupts should be enabled to and the NMI watchdog will have no reason
> to trigger.  So it won't matter if the wrong cpu is touched because the percpu
> interrupt counters the NMI watchdog uses should still be incrementing.
> 
> V2:  remove touch_all_nmi_watchdog code


Are you sure you did? :)


> +void touch_all_nmi_watchdogs(void)
> +{
>  	if (watchdog_enabled) {
>  		unsigned cpu;
>  
> @@ -151,7 +166,7 @@ void touch_nmi_watchdog(void)
>  	}
>  	touch_softlockup_watchdog();
>  }
> -EXPORT_SYMBOL(touch_nmi_watchdog);
> +EXPORT_SYMBOL(touch_all_nmi_watchdogs);
>  
>  #endif
>  
> -- 
> 1.7.2.3
> 


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH v2] watchdog:  touch_nmi_watchdog should only touch local cpu not every one
  2010-11-08 18:05 ` Frederic Weisbecker
@ 2010-11-08 18:28   ` Don Zickus
  0 siblings, 0 replies; 3+ messages in thread
From: Don Zickus @ 2010-11-08 18:28 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Peter Zijlstra, Ingo Molnar, LKML, akpm, sergey.senozhatsky

On Mon, Nov 08, 2010 at 07:05:23PM +0100, Frederic Weisbecker wrote:
> On Mon, Nov 08, 2010 at 11:44:35AM -0500, Don Zickus wrote:
> > I ran into a scenario where while one cpu was stuck and should have panic'd
> > because of the NMI watchdog, it didn't.  The reason was another cpu was spewing
> > stack dumps on to the console.  Upon investigation, I noticed that when writing
> > to the console and also when dumping the stack, the watchdog is touched.
> > 
> > This causes all the cpus to reset their NMI watchdog flags and the 'stuck' cpu
> > just spins forever.
> > 
> > This change causes the semantics of touch_nmi_watchdog to be changed slightly.
> > Previously, I accidentally changed the semantics and we noticed there was a
> > codepath in which touch_nmi_watchdog could be touched from a preemtible area.
> > That caused a BUG() to happen when CONFIG_DEBUG_PREEMPT was enabled.  I believe
> > it was the acpi code.
> > 
> > My attempt here re-introduces the change to have the touch_nmi_watchdog() code
> > only touch the local cpu instead of all of the cpus.  But instead of using
> > __get_cpu_var(), I use the __raw_get_cpu_var() version.
> > 
> > This avoids the preemption problem.  However my reasoning wasn't because I was
> > trying to be lazy.  Instead I rationalized it as, well if preemption is enabled
> > then interrupts should be enabled to and the NMI watchdog will have no reason
> > to trigger.  So it won't matter if the wrong cpu is touched because the percpu
> > interrupt counters the NMI watchdog uses should still be incrementing.
> > 
> > V2:  remove touch_all_nmi_watchdog code
> 
> 
> Are you sure you did? :)

Hmm.. Odd.  I remember making and committing the changes.  But now I can't
find them.

Thanks for catching that! I'll send out a version 3.

Cheers,
Don

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2010-11-08 18:29 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-11-08 16:44 [PATCH v2] watchdog: touch_nmi_watchdog should only touch local cpu not every one Don Zickus
2010-11-08 18:05 ` Frederic Weisbecker
2010-11-08 18:28   ` Don Zickus

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.