linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* bluesmoke, machine check exception, reboot
@ 2002-05-28 12:19 Corin Hartland-Swann
  2002-05-28 13:50 ` Alan Cox
  0 siblings, 1 reply; 7+ messages in thread
From: Corin Hartland-Swann @ 2002-05-28 12:19 UTC (permalink / raw)
  To: linux-kernel


Hi there,

I have a Dual PIII-1000 running 2.4.18, and am occasionally getting the
following error:

> CPU 1: Machine Check Exception: 000000000000000004
> Bank 1: f200000000000115
> Kernel panic: CPU context corrupt

This results in a hard lock (unable to use magic SysRQ key to sync or
reboot, etc). I located these errors in arch/i386/kernel/bluesmoke.c in
the function intel_machine_check(). From what I have read on lkml it is
probably a result of the processor overheating and causing errors.

I intend to add another fan to stop this from happening, but in the
meantime is there anything I can do to get the machine to reboot after the
panic? After the last time that this happened, I set
/proc/sys/kernel/panic to 10, but it hasn't happened since then so I can't
tell whether it will work. The error listed above is the entire error
before the machine fails - there is no register dump or anything after
that.

Do you think it will manage to reboot with a hopelessly confused
processor?

Thanks,

Corin

/------------------------+-------------------------------------\
| Corin Hartland-Swann   |    Tel: +44 (0) 20 7491 2000        |
| Commerce Internet Ltd  |    Fax: +44 (0) 20 7491 2010        |
| 22 Cavendish Buildings | Mobile: +44 (0) 79 5854 0027        |
| Gilbert Street         |                                     |
| Mayfair                |    Web: http://www.commerce.uk.net/ |
| London W1K 5HJ         | E-Mail: cdhs@commerce.uk.net        |
\------------------------+-------------------------------------/



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: bluesmoke, machine check exception, reboot
  2002-05-28 13:50 ` Alan Cox
@ 2002-05-28 13:46   ` Corin Hartland-Swann
  2002-05-28 15:09     ` Alan Cox
  2002-05-28 15:13     ` Randy.Dunlap
  0 siblings, 2 replies; 7+ messages in thread
From: Corin Hartland-Swann @ 2002-05-28 13:46 UTC (permalink / raw)
  To: Alan Cox; +Cc: linux-kernel


Alan,

On 28 May 2002, Alan Cox wrote:
> On Tue, 2002-05-28 at 13:19, Corin Hartland-Swann wrote:
> > I have a Dual PIII-1000 running 2.4.18, and am occasionally getting the
> > following error:
> >
> > > CPU 1: Machine Check Exception: 0000000000000004
> > > Bank 1: f200000000000115
> > > Kernel panic: CPU context corrupt

I just found another set of messages in the logs as well:

CPU 1: Machine Check Exception: 0000000000000004
Bank 1: b200000000000115
Kernel panic: CPU context corrupt

> > This results in a hard lock (unable to use magic SysRQ key to sync or
> > reboot, etc). I located these errors in arch/i386/kernel/bluesmoke.c in
> > the function intel_machine_check(). From what I have read on lkml it is
> > probably a result of the processor overheating and causing errors.
>
> It may even be a faulty processor. If you are running the processor to
> spec and your heatsink/fan/voltage all check out you may want to see
> about getting the CPU replaced. Thats a data cache l1 read error it
> appears

How do you work out what the numbers mean? Is there some kind of reference
to it, or are you just Alan "decodes machine check exceptions in his head"
Cox :) From the code it seems to be some kind of MCG status and MC0 status
- but of course, I have no idea what that means...

> That /proc setting should cause a reboot although after an MCE all
> things are a little undefined

After checking the logs (above) I found that the two times this has
happened it has managed to write it to the logs. Is the fact that it
sync()d a good indication that it will manage to reboot OK?

Thanks,

Corin

/------------------------+-------------------------------------\
| Corin Hartland-Swann   |    Tel: +44 (0) 20 7491 2000        |
| Commerce Internet Ltd  |    Fax: +44 (0) 20 7491 2010        |
| 22 Cavendish Buildings | Mobile: +44 (0) 79 5854 0027        |
| Gilbert Street         |                                     |
| Mayfair                |    Web: http://www.commerce.uk.net/ |
| London W1K 5HJ         | E-Mail: cdhs@commerce.uk.net        |
\------------------------+-------------------------------------/


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: bluesmoke, machine check exception, reboot
  2002-05-28 12:19 bluesmoke, machine check exception, reboot Corin Hartland-Swann
@ 2002-05-28 13:50 ` Alan Cox
  2002-05-28 13:46   ` Corin Hartland-Swann
  0 siblings, 1 reply; 7+ messages in thread
From: Alan Cox @ 2002-05-28 13:50 UTC (permalink / raw)
  To: Corin Hartland-Swann; +Cc: linux-kernel

On Tue, 2002-05-28 at 13:19, Corin Hartland-Swann wrote:
> I have a Dual PIII-1000 running 2.4.18, and am occasionally getting the
> following error:
> 
> > CPU 1: Machine Check Exception: 000000000000000004
> > Bank 1: f200000000000115
> > Kernel panic: CPU context corrupt
> 
> This results in a hard lock (unable to use magic SysRQ key to sync or
> reboot, etc). I located these errors in arch/i386/kernel/bluesmoke.c in
> the function intel_machine_check(). From what I have read on lkml it is
> probably a result of the processor overheating and causing errors.

It may even be a faulty processor. If you are running the processor to
spec and your heatsink/fan/voltage all check out you may want to see
about getting the CPU replaced. Thats a data cache l1 read error it
appears

> meantime is there anything I can do to get the machine to reboot after the
> panic? After the last time that this happened, I set
> /proc/sys/kernel/panic to 10, but it hasn't happened since then so I can't
> tell whether it will work. The error listed above is the entire error
> before the machine fails - there is no register dump or anything after
> that.

That /proc setting should cause a reboot although after an MCE all
things are a little undefined

> Do you think it will manage to reboot with a hopelessly confused
> processor?

Should do


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: bluesmoke, machine check exception, reboot
  2002-05-28 13:46   ` Corin Hartland-Swann
@ 2002-05-28 15:09     ` Alan Cox
  2002-05-28 15:13     ` Randy.Dunlap
  1 sibling, 0 replies; 7+ messages in thread
From: Alan Cox @ 2002-05-28 15:09 UTC (permalink / raw)
  To: Corin Hartland-Swann; +Cc: linux-kernel

On Tue, 2002-05-28 at 14:46, Corin Hartland-Swann wrote:
> CPU 1: Machine Check Exception: 0000000000000004
> Bank 1: b200000000000115
> Kernel panic: CPU context corrupt
>
> How do you work out what the numbers mean? Is there some kind of reference
> to it, or are you just Alan "decodes machine check exceptions in his head"
> Cox :) From the code it seems to be some kind of MCG status and MC0 status
> - but of course, I have no idea what that means...

I contemplate them in zen peace and they speak to me 8). The MCE value
is the flags from the control register. The Bank n value is a dump of
the register that explains what the fault is. The decoding rules are in
the Intel Pentium III documentation set.

> After checking the logs (above) I found that the two times this has
> happened it has managed to write it to the logs. Is the fact that it
> sync()d a good indication that it will manage to reboot OK?

Is the fact the airbag deployed a good indication that it will deploy if
you keep crashing into walls ? Its logging a CPU error where it decides
the CPU is in an unrecoverable state. The odds are pretty good but each
time you are taking the risk it won't, and if its a hardware problem
that it might simply drop dead for good.

Alan


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: bluesmoke, machine check exception, reboot
  2002-05-28 13:46   ` Corin Hartland-Swann
  2002-05-28 15:09     ` Alan Cox
@ 2002-05-28 15:13     ` Randy.Dunlap
  2002-05-29 13:46       ` Corin Hartland-Swann
  1 sibling, 1 reply; 7+ messages in thread
From: Randy.Dunlap @ 2002-05-28 15:13 UTC (permalink / raw)
  To: Corin Hartland-Swann; +Cc: Alan Cox, linux-kernel

On Tue, 28 May 2002, Corin Hartland-Swann wrote:

| Alan,
|
| On 28 May 2002, Alan Cox wrote:
| > On Tue, 2002-05-28 at 13:19, Corin Hartland-Swann wrote:
| > > I have a Dual PIII-1000 running 2.4.18, and am occasionally getting the
| > > following error:
| > >
| > > > CPU 1: Machine Check Exception: 0000000000000004
| > > > Bank 1: f200000000000115
| > > > Kernel panic: CPU context corrupt
|
| I just found another set of messages in the logs as well:
|
| CPU 1: Machine Check Exception: 0000000000000004
| Bank 1: b200000000000115
| Kernel panic: CPU context corrupt
|
| > > This results in a hard lock (unable to use magic SysRQ key to sync or
| > > reboot, etc). I located these errors in arch/i386/kernel/bluesmoke.c in
| > > the function intel_machine_check(). From what I have read on lkml it is
| > > probably a result of the processor overheating and causing errors.
| >
| > It may even be a faulty processor. If you are running the processor to
| > spec and your heatsink/fan/voltage all check out you may want to see
| > about getting the CPU replaced. Thats a data cache l1 read error it
| > appears
|
| How do you work out what the numbers mean? Is there some kind of reference
| to it, or are you just Alan "decodes machine check exceptions in his head"
| Cox :) From the code it seems to be some kind of MCG status and MC0 status
| - but of course, I have no idea what that means...

Appendix E of
"IA-32 Intel Architecture Software Developer's Manual
Volume 3 : System Programming Guide" is
"INTERPRETING MACHINE-CHECK ERROR CODES".
You can download it from developer.intel.com website.

Dave Jones has also begun a program called "parsemce".
You can get it at
http://www.codemonkey.org.uk/cruft/parsemce.c and
compile/run it.

| > That /proc setting should cause a reboot although after an MCE all
| > things are a little undefined
|
| After checking the logs (above) I found that the two times this has
| happened it has managed to write it to the logs. Is the fact that it
| sync()d a good indication that it will manage to reboot OK?
|
| Thanks,
| Corin

-- 
~Randy


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: bluesmoke, machine check exception, reboot
  2002-05-28 15:13     ` Randy.Dunlap
@ 2002-05-29 13:46       ` Corin Hartland-Swann
  2002-05-29 15:08         ` Dave Jones
  0 siblings, 1 reply; 7+ messages in thread
From: Corin Hartland-Swann @ 2002-05-29 13:46 UTC (permalink / raw)
  To: Randy.Dunlap, Dave Jones; +Cc: linux-kernel


Hi Randy/Dave,

On Tue, 28 May 2002, Randy.Dunlap wrote:
> On Tue, 28 May 2002, Corin Hartland-Swann wrote:
> | On 28 May 2002, Alan Cox wrote:
> | > On Tue, 2002-05-28 at 13:19, Corin Hartland-Swann wrote:
> | > > I have a Dual PIII-1000 running 2.4.18, and am occasionally getting the
> | > > following error:
> | > >
> | > > > CPU 1: Machine Check Exception: 0000000000000004
> | > > > Bank 1: f200000000000115
> | > > > Kernel panic: CPU context corrupt
> |
> | I just found another set of messages in the logs as well:
> |
> | CPU 1: Machine Check Exception: 0000000000000004
> | Bank 1: b200000000000115
> | Kernel panic: CPU context corrupt
> |
> | > > This results in a hard lock (unable to use magic SysRQ key to sync or
> | > > reboot, etc). I located these errors in arch/i386/kernel/bluesmoke.c in
> | > > the function intel_machine_check(). From what I have read on lkml it is
> | > > probably a result of the processor overheating and causing errors.
> | >
> | > It may even be a faulty processor. If you are running the processor to
> | > spec and your heatsink/fan/voltage all check out you may want to see
> | > about getting the CPU replaced. Thats a data cache l1 read error it
> | > appears
> |
> | How do you work out what the numbers mean? Is there some kind of reference
> | to it, or are you just Alan "decodes machine check exceptions in his head"
> | Cox :) From the code it seems to be some kind of MCG status and MC0 status
> | - but of course, I have no idea what that means...
>
> Appendix E of
> "IA-32 Intel Architecture Software Developer's Manual
> Volume 3 : System Programming Guide" is
> "INTERPRETING MACHINE-CHECK ERROR CODES".
> You can download it from developer.intel.com website.
>
> Dave Jones has also begun a program called "parsemce".
> You can get it at
> http://www.codemonkey.org.uk/cruft/parsemce.c and
> compile/run it.

I ran it through parsemce as suggested (thanks Randy), and got the
following output:

Status: (4) Machine Check in progress.
Restart IP invalid.
parsebank(4): b200000000040151 @ 0
	External tag parity error
	Uncorrectable ECC error
	Correctable ECC error
	Address in addr register valid
	MISC register information valid
	Error overflow
	Memory heirarchy error
	Request: Generic error
	Transaction type : Instruction
	Memory/IO : Reserved

A question for Dave mainly - is this output valid considering that I am
running stock 2.4.18 (I hadn't applied the patch you've done to fix typos
in this - and now I've lost the damn thing).

Does this output give any more information about possible causes?

Also http://marc.theaimsgroup.com/?l=linux-kernel&m=101338603328639&w=2
mentions a tool decodemca - is that a previous name for parsemce?

Thanks again,

Corin

/------------------------+-------------------------------------\
| Corin Hartland-Swann   |    Tel: +44 (0) 20 7491 2000        |
| Commerce Internet Ltd  |    Fax: +44 (0) 20 7491 2010        |
| 22 Cavendish Buildings | Mobile: +44 (0) 79 5854 0027        |
| Gilbert Street         |                                     |
| Mayfair                |    Web: http://www.commerce.uk.net/ |
| London W1K 5HJ         | E-Mail: cdhs@commerce.uk.net        |
\------------------------+-------------------------------------/


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: bluesmoke, machine check exception, reboot
  2002-05-29 13:46       ` Corin Hartland-Swann
@ 2002-05-29 15:08         ` Dave Jones
  0 siblings, 0 replies; 7+ messages in thread
From: Dave Jones @ 2002-05-29 15:08 UTC (permalink / raw)
  To: Corin Hartland-Swann; +Cc: Randy.Dunlap, linux-kernel

On Wed, May 29, 2002 at 02:46:51PM +0100, Corin Hartland-Swann wrote:
 > I ran it through parsemce as suggested (thanks Randy), and got the
 > following output:

parsemce lacks any command line parsing (I just never got around to it yet)
so you'll have to hack the values in the code at lines 200 or so to
match the values in your logs.

 > Also http://marc.theaimsgroup.com/?l=linux-kernel&m=101338603328639&w=2
 > mentions a tool decodemca - is that a previous name for parsemce?

At the time of that message no tool existed at all, once I got it to
the state its in now, I called it 'parse' instead of 'decode' for some
reason. There is no alternative tool, just a thinko on my part.

    Dave

-- 
| Dave Jones.        http://www.codemonkey.org.uk
| SuSE Labs

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2002-05-29 15:08 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2002-05-28 12:19 bluesmoke, machine check exception, reboot Corin Hartland-Swann
2002-05-28 13:50 ` Alan Cox
2002-05-28 13:46   ` Corin Hartland-Swann
2002-05-28 15:09     ` Alan Cox
2002-05-28 15:13     ` Randy.Dunlap
2002-05-29 13:46       ` Corin Hartland-Swann
2002-05-29 15:08         ` Dave Jones

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).