linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Machine check exception? (2.4.6+SMP+VIA)
@ 2001-07-07 20:32 Vibol Hou
  2001-07-07 20:54 ` H. Peter Anvin
  2001-07-07 21:41 ` Alan Cox
  0 siblings, 2 replies; 12+ messages in thread
From: Vibol Hou @ 2001-07-07 20:32 UTC (permalink / raw)
  To: Linux-Kernel

Hi,

I was running 2.4.6-stable in SMP mode on a dual P3-1GHz machine (VIA 694D
Chipset / MSI-6321 M/B + ) and the following message popped up after which
the system hardlocked (no SysRQ input).  What does this message mean?

CPU 1: Machine Check Exception: 0000000000000004
Bank 1: b200000000000115
Kernel panic: CPU context corrupt

Message from syslogd@delta at Sat Jul  7 13:18:36 2001 ...
delta kernel: CPU 1: Machine Check Exception: 0000000000000004

Message from syslogd@delta at Sat Jul  7 13:18:36 2001 ...
delta kernel: Bank 1: b200000000000115

Message from syslogd@delta at Sat Jul  7 13:18:36 2001 ...
delta kernel: Kernel panic: CPU context corrupt

--
Vibol Hou


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Machine check exception? (2.4.6+SMP+VIA)
  2001-07-07 20:32 Machine check exception? (2.4.6+SMP+VIA) Vibol Hou
@ 2001-07-07 20:54 ` H. Peter Anvin
  2001-07-07 21:41 ` Alan Cox
  1 sibling, 0 replies; 12+ messages in thread
From: H. Peter Anvin @ 2001-07-07 20:54 UTC (permalink / raw)
  To: linux-kernel

Followup to:  <HDEBKHLDKIDOBMHPKDDKEEAIEMAA.vhou@khmer.cc>
By author:    "Vibol Hou" <vhou@khmer.cc>
In newsgroup: linux.dev.kernel
>
> Hi,
> 
> I was running 2.4.6-stable in SMP mode on a dual P3-1GHz machine (VIA 694D
> Chipset / MSI-6321 M/B + ) and the following message popped up after which
> the system hardlocked (no SysRQ input).  What does this message mean?
> 
> CPU 1: Machine Check Exception: 0000000000000004
> Bank 1: b200000000000115
> Kernel panic: CPU context corrupt
> 
> Message from syslogd@delta at Sat Jul  7 13:18:36 2001 ...
> delta kernel: CPU 1: Machine Check Exception: 0000000000000004
> 
> Message from syslogd@delta at Sat Jul  7 13:18:36 2001 ...
> delta kernel: Bank 1: b200000000000115
> 
> Message from syslogd@delta at Sat Jul  7 13:18:36 2001 ...
> delta kernel: Kernel panic: CPU context corrupt
> 

It means your hardware is bad.

	-hpa
-- 
<hpa@transmeta.com> at work, <hpa@zytor.com> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Machine check exception? (2.4.6+SMP+VIA)
  2001-07-07 20:32 Machine check exception? (2.4.6+SMP+VIA) Vibol Hou
  2001-07-07 20:54 ` H. Peter Anvin
@ 2001-07-07 21:41 ` Alan Cox
  2001-07-08  7:28   ` Chris Wedgwood
  1 sibling, 1 reply; 12+ messages in thread
From: Alan Cox @ 2001-07-07 21:41 UTC (permalink / raw)
  To: Vibol Hou; +Cc: Linux-Kernel

> I was running 2.4.6-stable in SMP mode on a dual P3-1GHz machine (VIA 694D
> Chipset / MSI-6321 M/B + ) and the following message popped up after which
> the system hardlocked (no SysRQ input).  What does this message mean?
> 
> CPU 1: Machine Check Exception: 0000000000000004
> Bank 1: b200000000000115
> Kernel panic: CPU context corrupt

It means your processor flagged a fault. The b2....115 number decodes to info
about the fault cause if you grab the PIII manual.

Stupid things like overheating. wrong voltages can also trigger it


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Machine check exception? (2.4.6+SMP+VIA)
  2001-07-07 21:41 ` Alan Cox
@ 2001-07-08  7:28   ` Chris Wedgwood
  2001-07-08 14:00     ` Alan Cox
  2001-07-08 17:32     ` Vibol Hou
  0 siblings, 2 replies; 12+ messages in thread
From: Chris Wedgwood @ 2001-07-08  7:28 UTC (permalink / raw)
  To: Alan Cox; +Cc: Vibol Hou, Linux-Kernel

On Sat, Jul 07, 2001 at 10:41:23PM +0100, Alan Cox wrote:

    It means your processor flagged a fault. The b2....115 number
    decodes to info about the fault cause if you grab the PIII manual.

    Stupid things like overheating. wrong voltages can also trigger it

Is there any reason why, with proper MCE checking for both K7 and PIII
we can't automatically off-line processors when they start doing bad
things?

Sure, its a pretty lousy thing to do, but if you buys you a few
minutes and allows userland to initiate some kind of remedy
(pager("HELP"); system("shutdown"); sort of thing)...

Also, I'm pretty sure I was seeing overheating problems or something
on a K7 at one point, but never saw MCE; I take it this code only
exists fully in -ac kernels? I looked in Linus' tree and couldn't see
anything.




  --cw

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Machine check exception? (2.4.6+SMP+VIA)
  2001-07-08  7:28   ` Chris Wedgwood
@ 2001-07-08 14:00     ` Alan Cox
  2001-07-08 15:33       ` Dave Jones
  2001-07-08 17:32     ` Vibol Hou
  1 sibling, 1 reply; 12+ messages in thread
From: Alan Cox @ 2001-07-08 14:00 UTC (permalink / raw)
  To: Chris Wedgwood; +Cc: Alan Cox, Vibol Hou, Linux-Kernel

> Is there any reason why, with proper MCE checking for both K7 and PIII
> we can't automatically off-line processors when they start doing bad
> things?

Architectural limitations. Its entirely possible that the cache of the dying
processor contains exclusive copies of arbitary data.

> Also, I'm pretty sure I was seeing overheating problems or something
> on a K7 at one point, but never saw MCE; I take it this code only
> exists fully in -ac kernels? I looked in Linus' tree and couldn't see
> anything.

Only -ac has K7 MCE enabled right now - also MCE is not guaranteed to catch
problems.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Machine check exception? (2.4.6+SMP+VIA)
  2001-07-08 14:00     ` Alan Cox
@ 2001-07-08 15:33       ` Dave Jones
  2001-07-08 17:04         ` Chris Wedgwood
  0 siblings, 1 reply; 12+ messages in thread
From: Dave Jones @ 2001-07-08 15:33 UTC (permalink / raw)
  To: Alan Cox; +Cc: Chris Wedgwood, Vibol Hou, Linux-Kernel

On Sun, 8 Jul 2001, Alan Cox wrote:

> Only -ac has K7 MCE enabled right now - also MCE is not guaranteed to catch
> problems.

Actually you merged that with Linus a few revisions back iirc.

regards,

Dave.

-- 
| Dave Jones.        http://www.suse.de/~davej
| SuSE Labs


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Machine check exception? (2.4.6+SMP+VIA)
  2001-07-08 15:33       ` Dave Jones
@ 2001-07-08 17:04         ` Chris Wedgwood
  2001-07-08 17:09           ` Dave Jones
  0 siblings, 1 reply; 12+ messages in thread
From: Chris Wedgwood @ 2001-07-08 17:04 UTC (permalink / raw)
  To: Dave Jones; +Cc: Alan Cox, Vibol Hou, Linux-Kernel


On Sun, Jul 08, 2001 at 05:33:59PM +0200, Dave Jones wrote:

    Actually you merged that with Linus a few revisions back iirc.

I don't see it for K7/AMD:

cw:tty5@tapu(kernel)$ pwd
/home/cw/wk/linux/linux-2.4.7-pre2+O_DIRECT/arch/i386/kernel

cw:tty5@tapu(kernel)$ grep machine_check\(struct\ pt bluesmoke.c
static void intel_machine_check(struct pt_regs * regs, long error_code)
static void pentium_machine_check(struct pt_regs * regs, long error_code)
static void winchip_machine_check(struct pt_regs * regs, long error_code)
static void unexpected_machine_check(struct pt_regs * regs, long error_code)
void do_machine_check(struct pt_regs * regs, long error_code)




  --cw

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Machine check exception? (2.4.6+SMP+VIA)
  2001-07-08 17:04         ` Chris Wedgwood
@ 2001-07-08 17:09           ` Dave Jones
  2001-07-08 17:18             ` Chris Wedgwood
  2001-07-08 20:39             ` H. Peter Anvin
  0 siblings, 2 replies; 12+ messages in thread
From: Dave Jones @ 2001-07-08 17:09 UTC (permalink / raw)
  To: Chris Wedgwood; +Cc: Alan Cox, Vibol Hou, Linux-Kernel

On Mon, 9 Jul 2001, Chris Wedgwood wrote:

>     Actually you merged that with Linus a few revisions back iirc.
> I don't see it for K7/AMD:

> cw:tty5@tapu(kernel)$ grep machine_check\(struct\ pt bluesmoke.c
> static void intel_machine_check(struct pt_regs * regs, long error_code)

There is no K7 specific implementation. It's the same as the Intel MSRs.

>From the comment in the file:

        case X86_VENDOR_AMD:
            /*
             *  AMD K7 machine check is Intel like
             */
            if(c->x86 == 6)
                intel_mcheck_init(c);
            break;


regards,

Dave.

-- 
| Dave Jones.        http://www.suse.de/~davej
| SuSE Labs


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Machine check exception? (2.4.6+SMP+VIA)
  2001-07-08 17:09           ` Dave Jones
@ 2001-07-08 17:18             ` Chris Wedgwood
  2001-07-08 20:39             ` H. Peter Anvin
  1 sibling, 0 replies; 12+ messages in thread
From: Chris Wedgwood @ 2001-07-08 17:18 UTC (permalink / raw)
  To: Dave Jones; +Cc: Alan Cox, Vibol Hou, Linux-Kernel

On Sun, Jul 08, 2001 at 07:09:11PM +0200, Dave Jones wrote:

    There is no K7 specific implementation. It's the same as the Intel
    MSRs.

Ah thanks, missed that.


  --cw

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: Machine check exception? (2.4.6+SMP+VIA)
  2001-07-08  7:28   ` Chris Wedgwood
  2001-07-08 14:00     ` Alan Cox
@ 2001-07-08 17:32     ` Vibol Hou
  2001-07-08 20:40       ` H. Peter Anvin
  1 sibling, 1 reply; 12+ messages in thread
From: Vibol Hou @ 2001-07-08 17:32 UTC (permalink / raw)
  To: Chris Wedgwood, Alan Cox; +Cc: Vibol Hou, Linux-Kernel

Hrm,

First off, thanks for the direction Alan, Peter, and Chris.

So I've flipped through the Intel docs, and read up on the MCA for P2/3
processors.  I've decoded the info from the MC0_STATUS register that was
given to me in the Bank 1: b200000000000115 line.  The 0115 MCA code
indicates a DCACHEL1_RD error, so it seems the L1 cache is bad, though this
does not seem to be heat-related since lm_sensors indicate similar
temperature readings for both CPUs (within .3 degress celcius of each other
~30 dC).

That probably explains why the system hardlocked quickly each time there was
heavy I/O and processing with SMP mode enabled with the full 1GB memory in
it.  Only after removing one of the memory sticks did the system begin
spitting out OOPs and MCEs.

I also wonder, however, if this could be due to the 2nd processor not
getting enough voltage.  I don't know the S-SPEC of the processor, but I
think it's the same as the 1st.  However, the voltage reading for CPU 2 is
.05v lower at 1.65v.  Any processor gurus here?

Thanks,
Vibol

-----Original Message-----
From: Chris Wedgwood [mailto:cw@f00f.org]
Sent: Sunday, July 08, 2001 12:28 AM
To: Alan Cox
Cc: Vibol Hou; Linux-Kernel
Subject: Re: Machine check exception? (2.4.6+SMP+VIA)


On Sat, Jul 07, 2001 at 10:41:23PM +0100, Alan Cox wrote:

    It means your processor flagged a fault. The b2....115 number
    decodes to info about the fault cause if you grab the PIII manual.

    Stupid things like overheating. wrong voltages can also trigger it

Is there any reason why, with proper MCE checking for both K7 and PIII
we can't automatically off-line processors when they start doing bad
things?

Sure, its a pretty lousy thing to do, but if you buys you a few
minutes and allows userland to initiate some kind of remedy
(pager("HELP"); system("shutdown"); sort of thing)...

Also, I'm pretty sure I was seeing overheating problems or something
on a K7 at one point, but never saw MCE; I take it this code only
exists fully in -ac kernels? I looked in Linus' tree and couldn't see
anything.




  --cw


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Machine check exception? (2.4.6+SMP+VIA)
  2001-07-08 17:09           ` Dave Jones
  2001-07-08 17:18             ` Chris Wedgwood
@ 2001-07-08 20:39             ` H. Peter Anvin
  1 sibling, 0 replies; 12+ messages in thread
From: H. Peter Anvin @ 2001-07-08 20:39 UTC (permalink / raw)
  To: linux-kernel

Followup to:  <Pine.LNX.4.30.0107081907440.28660-100000@Appserv.suse.de>
By author:    Dave Jones <davej@suse.de>
In newsgroup: linux.dev.kernel
>
> On Mon, 9 Jul 2001, Chris Wedgwood wrote:
> 
> >     Actually you merged that with Linus a few revisions back iirc.
> > I don't see it for K7/AMD:
> 
> > cw:tty5@tapu(kernel)$ grep machine_check\(struct\ pt bluesmoke.c
> > static void intel_machine_check(struct pt_regs * regs, long error_code)
> 
> There is no K7 specific implementation. It's the same as the Intel MSRs.
> 
> From the comment in the file:
> 
>         case X86_VENDOR_AMD:
>             /*
>              *  AMD K7 machine check is Intel like
>              */
>             if(c->x86 == 6)
>                 intel_mcheck_init(c);
>             break;
> 
> 

Note that I released a patch to make bluesmoke a lot more generic
quite a while ago.  Linus was in the "I don't want to even hear about
anything but critical bugfixes" mode at that point, so it didn't get
integrated.

If anyone is interested, it is at:

http://www.kernel.org/pub/linux/kernel/people/hpa/bluesmoke-2.4.0-test11-pre5-3.diff.gz

Let me know if you want me to bring it forward.

	-hpa
-- 
<hpa@transmeta.com> at work, <hpa@zytor.com> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Machine check exception? (2.4.6+SMP+VIA)
  2001-07-08 17:32     ` Vibol Hou
@ 2001-07-08 20:40       ` H. Peter Anvin
  0 siblings, 0 replies; 12+ messages in thread
From: H. Peter Anvin @ 2001-07-08 20:40 UTC (permalink / raw)
  To: linux-kernel

Followup to:  <NDBBKKONDOBLNCIOPCGHIEKOIKAA.vhou@khmer.cc>
By author:    "Vibol Hou" <vhou@khmer.cc>
In newsgroup: linux.dev.kernel
> 
> I also wonder, however, if this could be due to the 2nd processor not
> getting enough voltage.  I don't know the S-SPEC of the processor, but I
> think it's the same as the 1st.  However, the voltage reading for CPU 2 is
> .05v lower at 1.65v.  Any processor gurus here?
> 

That sounds a bit suspicious indeed, and could certainly cause that
kind of errors.

	-hpa
-- 
<hpa@transmeta.com> at work, <hpa@zytor.com> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2001-07-08 20:41 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2001-07-07 20:32 Machine check exception? (2.4.6+SMP+VIA) Vibol Hou
2001-07-07 20:54 ` H. Peter Anvin
2001-07-07 21:41 ` Alan Cox
2001-07-08  7:28   ` Chris Wedgwood
2001-07-08 14:00     ` Alan Cox
2001-07-08 15:33       ` Dave Jones
2001-07-08 17:04         ` Chris Wedgwood
2001-07-08 17:09           ` Dave Jones
2001-07-08 17:18             ` Chris Wedgwood
2001-07-08 20:39             ` H. Peter Anvin
2001-07-08 17:32     ` Vibol Hou
2001-07-08 20:40       ` H. Peter Anvin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).