linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] Allow users to force a panic on NMI
@ 2006-05-11 21:49 Don Zickus
  2006-05-18 19:17 ` Pavel Machek
  0 siblings, 1 reply; 6+ messages in thread
From: Don Zickus @ 2006-05-11 21:49 UTC (permalink / raw)
  To: linux-kernel; +Cc: ak

To quote Alan Cox:

The default Linux behaviour on an NMI of either memory or unknown is to
continue operation. For many environments such as scientific computing
it is preferable that the box is taken out and the error dealt with than
an uncorrected parity/ECC error get propogated.

A small number of systems do generate NMI's for bizarre random reasons
such as power management so the default is unchanged. In other respects
the new proc/sys entry works like the existing panic controls already in
that directory.

This is separate to the edac support - EDAC allows supported chipsets to
handle ECC errors well, this change allows unsupported cases to at least
panic rather than cause problems further down the line.

Signed-off-by: Don Zickus <dzickus@redhat.com>

---

This is just a refreshed post of Alan's original patch
<http://www.ussg.iu.edu/hypermail/linux/kernel/0510.2/1208.html>, with
hopes this time it sticks. :)

It applies cleanly on top of my other nmi patches.  

Cheers,
Don


Index: linux-don/arch/i386/kernel/traps.c
===================================================================
--- linux-don.orig/arch/i386/kernel/traps.c
+++ linux-don/arch/i386/kernel/traps.c
@@ -602,6 +602,8 @@ static void mem_parity_error(unsigned ch
 			"to continue\n");
 	printk(KERN_EMERG "You probably have a hardware problem with your RAM "
 			"chips\n");
+	if (panic_on_unrecovered_nmi)
+                panic("NMI: Not continuing");
 
 	/* Clear and disable the memory parity error line. */
 	clear_mem_error(reason);
@@ -637,6 +639,10 @@ static void unknown_nmi_error(unsigned c
 		reason, smp_processor_id());
 	printk("Dazed and confused, but trying to continue\n");
 	printk("Do you have a strange power saving mode enabled?\n");
+
+	if (panic_on_unrecovered_nmi)
+                panic("NMI: Not continuing");
+
 }
 
 static DEFINE_SPINLOCK(nmi_print_lock);
Index: linux-don/arch/x86_64/kernel/traps.c
===================================================================
--- linux-don.orig/arch/x86_64/kernel/traps.c
+++ linux-don/arch/x86_64/kernel/traps.c
@@ -608,6 +608,8 @@ mem_parity_error(unsigned char reason, s
 {
 	printk("Uhhuh. NMI received. Dazed and confused, but trying to continue\n");
 	printk("You probably have a hardware problem with your RAM chips\n");
+	if (panic_on_unrecovered_nmi)
+               panic("NMI: Not continuing");
 
 	/* Clear and disable the memory parity error line. */
 	reason = (reason & 0xf) | 4;
@@ -633,6 +635,10 @@ unknown_nmi_error(unsigned char reason, 
 {	printk("Uhhuh. NMI received for unknown reason %02x.\n", reason);
 	printk("Dazed and confused, but trying to continue\n");
 	printk("Do you have a strange power saving mode enabled?\n");
+
+	if (panic_on_unrecovered_nmi)
+                panic("NMI: Not continuing");
+
 }
 
 /* Runs on IST stack. This code must keep interrupts off all the time.
Index: linux-don/include/linux/kernel.h
===================================================================
--- linux-don.orig/include/linux/kernel.h
+++ linux-don/include/linux/kernel.h
@@ -178,6 +178,7 @@ extern void bust_spinlocks(int yes);
 extern int oops_in_progress;		/* If set, an oops, panic(), BUG() or die() is in progress */
 extern int panic_timeout;
 extern int panic_on_oops;
+extern int panic_on_unrecovered_nmi;
 extern int tainted;
 extern const char *print_tainted(void);
 extern void add_taint(unsigned);
Index: linux-don/kernel/sysctl.c
===================================================================
--- linux-don.orig/kernel/sysctl.c
+++ linux-don/kernel/sysctl.c
@@ -644,6 +644,14 @@ static ctl_table kern_table[] = {
 #endif
 #if defined(CONFIG_X86)
 	{
+		.ctl_name	= KERN_PANIC_ON_NMI,
+		.procname	= "panic_on_unrecovered_nmi",
+		.data		= &panic_on_unrecovered_nmi,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+	},
+	{
 		.ctl_name	= KERN_BOOTLOADER_TYPE,
 		.procname	= "bootloader_type",
 		.data		= &bootloader_type,
Index: linux-don/include/linux/sysctl.h
===================================================================
--- linux-don.orig/include/linux/sysctl.h
+++ linux-don/include/linux/sysctl.h
@@ -149,6 +149,7 @@ enum
 	KERN_ACPI_VIDEO_FLAGS=71, /* int: flags for setting up video after ACPI sleep */
 	KERN_IA64_UNALIGNED=72, /* int: ia64 unaligned userland trap enable */
 	KERN_NMI_ENABLED=73, /* int: enable/disable nmi watchdog */
+	KERN_PANIC_ON_NMI=74, /* int: whether we will panic on an unrecovered */
 };
 
 
Index: linux-don/kernel/panic.c
===================================================================
--- linux-don.orig/kernel/panic.c
+++ linux-don/kernel/panic.c
@@ -21,6 +21,7 @@
 #include <linux/kexec.h>
 
 int panic_on_oops;
+int panic_on_unrecovered_nmi;
 int tainted;
 static int pause_on_oops;
 static int pause_on_oops_flag;

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] Allow users to force a panic on NMI
  2006-05-11 21:49 [PATCH] Allow users to force a panic on NMI Don Zickus
@ 2006-05-18 19:17 ` Pavel Machek
  2006-05-25 17:39   ` Paul Jackson
  2006-05-26 17:59   ` [PATCH] x86 clean up nmi panic messages Don Zickus
  0 siblings, 2 replies; 6+ messages in thread
From: Pavel Machek @ 2006-05-18 19:17 UTC (permalink / raw)
  To: Don Zickus; +Cc: linux-kernel, ak

Hi!

> To quote Alan Cox:
> 
> The default Linux behaviour on an NMI of either memory or unknown is to
> continue operation. For many environments such as scientific computing
> it is preferable that the box is taken out and the error dealt with than
> an uncorrected parity/ECC error get propogated.

> This is just a refreshed post of Alan's original patch
> <http://www.ussg.iu.edu/hypermail/linux/kernel/0510.2/1208.html>, with
> hopes this time it sticks. :)
> 
> It applies cleanly on top of my other nmi patches.  

> Index: linux-don/arch/i386/kernel/traps.c
> ===================================================================
> --- linux-don.orig/arch/i386/kernel/traps.c
> +++ linux-don/arch/i386/kernel/traps.c
> @@ -602,6 +602,8 @@ static void mem_parity_error(unsigned ch
>  			"to continue\n");
>  	printk(KERN_EMERG "You probably have a hardware problem with your RAM "
>  			"chips\n");
> +	if (panic_on_unrecovered_nmi)
> +                panic("NMI: Not continuing");
>  
>  	/* Clear and disable the memory parity error line. */
>  	clear_mem_error(reason);
> @@ -637,6 +639,10 @@ static void unknown_nmi_error(unsigned c
>  		reason, smp_processor_id());
>  	printk("Dazed and confused, but trying to continue\n");

'Trying to contninue'

>  	printk("Do you have a strange power saving mode enabled?\n");
> +
> +	if (panic_on_unrecovered_nmi)
> +                panic("NMI: Not continuing");
> +

'not really'. Move printks around so it makes sense...

> Index: linux-don/arch/x86_64/kernel/traps.c
> ===================================================================
> --- linux-don.orig/arch/x86_64/kernel/traps.c
> +++ linux-don/arch/x86_64/kernel/traps.c
> @@ -608,6 +608,8 @@ mem_parity_error(unsigned char reason, s
>  {
>  	printk("Uhhuh. NMI received. Dazed and confused, but trying to continue\n");
>  	printk("You probably have a hardware problem with your RAM chips\n");
> +	if (panic_on_unrecovered_nmi)
> +               panic("NMI: Not continuing");
>  
>  	/* Clear and disable the memory parity error line. */

same here.

> @@ -633,6 +635,10 @@ unknown_nmi_error(unsigned char reason, 
>  {	printk("Uhhuh. NMI received for unknown reason %02x.\n", reason);
>  	printk("Dazed and confused, but trying to continue\n");
>  	printk("Do you have a strange power saving mode enabled?\n");
> +
> +	if (panic_on_unrecovered_nmi)
> +                panic("NMI: Not continuing");
> +
>  }
>  

and here.

-- 
Thanks for all the (sleeping) penguins.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] Allow users to force a panic on NMI
  2006-05-18 19:17 ` Pavel Machek
@ 2006-05-25 17:39   ` Paul Jackson
  2006-05-25 21:45     ` Pavel Machek
  2006-05-26 17:59   ` [PATCH] x86 clean up nmi panic messages Don Zickus
  1 sibling, 1 reply; 6+ messages in thread
From: Paul Jackson @ 2006-05-25 17:39 UTC (permalink / raw)
  To: Pavel Machek; +Cc: dzickus, linux-kernel, ak

> >  	printk("Dazed and confused, but trying to continue\n");
> 
> 'Trying to contninue'

I'm pretty sure that the correct spelling is "continue",
not "contninue".  Are you suggesting otherwise?

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] Allow users to force a panic on NMI
  2006-05-25 17:39   ` Paul Jackson
@ 2006-05-25 21:45     ` Pavel Machek
  0 siblings, 0 replies; 6+ messages in thread
From: Pavel Machek @ 2006-05-25 21:45 UTC (permalink / raw)
  To: Paul Jackson; +Cc: dzickus, linux-kernel, ak

On Čt 25-05-06 10:39:16, Paul Jackson wrote:
> > >  	printk("Dazed and confused, but trying to continue\n");
> > 
> > 'Trying to contninue'
> 
> I'm pretty sure that the correct spelling is "continue",
> not "contninue".  Are you suggesting otherwise?

No, that was a typo. But it is wrong to print "...trying to continue"
message, when machine is going to be halted next milisecond.
								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH] x86 clean up nmi panic messages
  2006-05-18 19:17 ` Pavel Machek
  2006-05-25 17:39   ` Paul Jackson
@ 2006-05-26 17:59   ` Don Zickus
  2006-05-27 16:33     ` Pavel Machek
  1 sibling, 1 reply; 6+ messages in thread
From: Don Zickus @ 2006-05-26 17:59 UTC (permalink / raw)
  To: Pavel Machek; +Cc: linux-kernel, ak

Clean up some of the output messages on the nmi error paths to make more
sense when they are displayed.  This is mainly a cosmetic fix and
shouldn't impact any normal code path.  

Signed-off-by:  Don Zickus <dzickus@redhat.com>

---

Pavel, 

I hope this patch addresses your concerns.  This will apply on top of my
other patch.  

Cheers,
Don

> > Index: linux-don/arch/i386/kernel/traps.c
> > ===================================================================
> > --- linux-don.orig/arch/i386/kernel/traps.c
> > +++ linux-don/arch/i386/kernel/traps.c
> > @@ -602,6 +602,8 @@ static void mem_parity_error(unsigned ch
> >  			"to continue\n");
> >  	printk(KERN_EMERG "You probably have a hardware problem with your RAM "
> >  			"chips\n");
> > +	if (panic_on_unrecovered_nmi)
> > +                panic("NMI: Not continuing");
> >  
> >  	/* Clear and disable the memory parity error line. */
> >  	clear_mem_error(reason);
> > @@ -637,6 +639,10 @@ static void unknown_nmi_error(unsigned c
> >  		reason, smp_processor_id());
> >  	printk("Dazed and confused, but trying to continue\n");
> 
> 'Trying to contninue'
> 
> >  	printk("Do you have a strange power saving mode enabled?\n");
> > +
> > +	if (panic_on_unrecovered_nmi)
> > +                panic("NMI: Not continuing");
> > +
> 
> 'not really'. Move printks around so it makes sense...


Index: linux-don/arch/i386/kernel/traps.c
===================================================================
--- linux-don.orig/arch/i386/kernel/traps.c
+++ linux-don/arch/i386/kernel/traps.c
@@ -598,13 +598,15 @@ gp_in_kernel:
 
 static void mem_parity_error(unsigned char reason, struct pt_regs * regs)
 {
-	printk(KERN_EMERG "Uhhuh. NMI received. Dazed and confused, but trying "
-			"to continue\n");
+	printk(KERN_EMERG "Uhhuh. NMI received for unknown reason %02x on "
+		"CPU %d.\n", reason, smp_processor_id());
 	printk(KERN_EMERG "You probably have a hardware problem with your RAM "
 			"chips\n");
 	if (panic_on_unrecovered_nmi)
                 panic("NMI: Not continuing");
 
+	printk(KERN_EMERG "Dazed and confused, but trying to continue\n");
+
 	/* Clear and disable the memory parity error line. */
 	clear_mem_error(reason);
 }
@@ -635,14 +637,13 @@ static void unknown_nmi_error(unsigned c
 		return;
 	}
 #endif
-	printk("Uhhuh. NMI received for unknown reason %02x on CPU %d.\n",
-		reason, smp_processor_id());
-	printk("Dazed and confused, but trying to continue\n");
-	printk("Do you have a strange power saving mode enabled?\n");
-
+	printk(KERN_EMERG "Uhhuh. NMI received for unknown reason %02x on "
+		"CPU %d.\n", reason, smp_processor_id());
+	printk(KERN_EMERG "Do you have a strange power saving mode enabled?\n");
 	if (panic_on_unrecovered_nmi)
                 panic("NMI: Not continuing");
 
+	printk(KERN_EMERG "Dazed and confused, but trying to continue\n");
 }
 
 static DEFINE_SPINLOCK(nmi_print_lock);
Index: linux-don/arch/x86_64/kernel/traps.c
===================================================================
--- linux-don.orig/arch/x86_64/kernel/traps.c
+++ linux-don/arch/x86_64/kernel/traps.c
@@ -606,10 +606,15 @@ asmlinkage void __kprobes do_general_pro
 static __kprobes void
 mem_parity_error(unsigned char reason, struct pt_regs * regs)
 {
-	printk("Uhhuh. NMI received. Dazed and confused, but trying to continue\n");
-	printk("You probably have a hardware problem with your RAM chips\n");
+	printk(KERN_EMERG "Uhhuh. NMI received for unknown reason %02x.\n",
+		reason);
+	printk(KERN_EMERG "You probably have a hardware problem with your "
+		"RAM chips\n");
+
 	if (panic_on_unrecovered_nmi)
-               panic("NMI: Not continuing");
+		panic("NMI: Not continuing");
+
+	printk(KERN_EMERG "Dazed and confused, but trying to continue\n");
 
 	/* Clear and disable the memory parity error line. */
 	reason = (reason & 0xf) | 4;
@@ -632,13 +637,15 @@ io_check_error(unsigned char reason, str
 
 static __kprobes void
 unknown_nmi_error(unsigned char reason, struct pt_regs * regs)
-{	printk("Uhhuh. NMI received for unknown reason %02x.\n", reason);
-	printk("Dazed and confused, but trying to continue\n");
-	printk("Do you have a strange power saving mode enabled?\n");
+{
+	printk(KERN_EMERG "Uhhuh. NMI received for unknown reason %02x.\n",
+		reason);
+	printk(KERN_EMERG "Do you have a strange power saving mode enabled?\n");
 
 	if (panic_on_unrecovered_nmi)
-                panic("NMI: Not continuing");
+		panic("NMI: Not continuing");
 
+	printk(KERN_EMERG "Dazed and confused, but trying to continue\n");
 }
 
 /* Runs on IST stack. This code must keep interrupts off all the time.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] x86 clean up nmi panic messages
  2006-05-26 17:59   ` [PATCH] x86 clean up nmi panic messages Don Zickus
@ 2006-05-27 16:33     ` Pavel Machek
  0 siblings, 0 replies; 6+ messages in thread
From: Pavel Machek @ 2006-05-27 16:33 UTC (permalink / raw)
  To: Don Zickus; +Cc: linux-kernel, ak

Hi!

> Clean up some of the output messages on the nmi error paths to make more
> sense when they are displayed.  This is mainly a cosmetic fix and
> shouldn't impact any normal code path.  
> 
> Signed-off-by:  Don Zickus <dzickus@redhat.com>

ACK.

> Pavel, 
> 
> I hope this patch addresses your concerns.  This will apply on top of my
> other patch.  

Yes, thanks.

-- 
Thanks for all the (sleeping) penguins.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2006-05-27 16:34 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-05-11 21:49 [PATCH] Allow users to force a panic on NMI Don Zickus
2006-05-18 19:17 ` Pavel Machek
2006-05-25 17:39   ` Paul Jackson
2006-05-25 21:45     ` Pavel Machek
2006-05-26 17:59   ` [PATCH] x86 clean up nmi panic messages Don Zickus
2006-05-27 16:33     ` Pavel Machek

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).