linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* MCEs
@ 2008-10-24 12:45 Felix von Leitner
  2008-10-24 16:23 ` MCEs Tony Vroon
  2008-10-24 18:04 ` MCEs Andi Kleen
  0 siblings, 2 replies; 7+ messages in thread
From: Felix von Leitner @ 2008-10-24 12:45 UTC (permalink / raw)
  To: Linux Kernel Mailing list

I am getting frequent MCEs on my Linux desktop, when I am encoding TV
recordings to H.264 using mencoder.  It is a dual core box, I am using
2.6.27 (but have had the problem for a while now).

This is the kind of MCE that freezes the box and causes a panic.  The
trace does not end up in syslog.  I found a program called mcelog which
I am supposed to call regularly from cron, but how can that help me when
the first MCE I get insta-panics the box?

Now the most common causes for MCEs are apparently heat issues and bad
memory.  I can rule out both.  Could this be an artifact of some bad
ACPI tables?

How do you debug this kind of problem?

Thanks,

Felix

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: MCEs
  2008-10-24 12:45 MCEs Felix von Leitner
@ 2008-10-24 16:23 ` Tony Vroon
  2008-10-24 18:04 ` MCEs Andi Kleen
  1 sibling, 0 replies; 7+ messages in thread
From: Tony Vroon @ 2008-10-24 16:23 UTC (permalink / raw)
  To: Felix von Leitner; +Cc: Linux Kernel Mailing list

[-- Attachment #1: Type: text/plain, Size: 1257 bytes --]

On Fri, 2008-10-24 at 14:45 +0200, Felix von Leitner wrote:
> Now the most common causes for MCEs are apparently heat issues and bad
> memory.  I can rule out both.

Are you sure? I have had MCEs and instability for a while now, and using
mcelog --k8 --dmi /dev/mcelog

I finally got a clear "this component is at fault" message, pinpointing
DIMM 4 on CPU 2. I shuffled the DIMMs around and then used the machine
again.
The message shifted with the DIMM, to DIMM 1 on CPU 1. Memtest86+
doesn't appear to stress the hardware enough to provoke single or
multi-bit errors though. (So, a few successful passes in memtest86+ does
not rule out a RAM problem)

Temperatures can also get high at locations in the machine that have no
sensors (specifically voltage regulators). To check for heat problems
you could operate your tower case whilst lying on the floor, so hot air
rises up past the PCI/PCIe cards instead of getting trapped underneath
them.

Note that LKML isn't the friendliest place to get MCE debugging, as it
will be considered a hardware fault and thus off-topic.
Consider an MCE like a 'check engine' light in your car. It doesn't tell
you what's wrong, just that it's bad and should be investigated.

Regards,
Tony V.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: MCEs
  2008-10-24 12:45 MCEs Felix von Leitner
  2008-10-24 16:23 ` MCEs Tony Vroon
@ 2008-10-24 18:04 ` Andi Kleen
  2008-10-24 21:44   ` MCEs Felix von Leitner
  2008-10-24 23:07   ` MCEs Felix von Leitner
  1 sibling, 2 replies; 7+ messages in thread
From: Andi Kleen @ 2008-10-24 18:04 UTC (permalink / raw)
  To: Felix von Leitner; +Cc: Linux Kernel Mailing list

Felix von Leitner <felix-linuxkernel@fefe.de> writes:

> This is the kind of MCE that freezes the box and causes a panic.  The
> trace does not end up in syslog.  I found a program called mcelog which
> I am supposed to call regularly from cron, but how can that help me when
> the first MCE I get insta-panics the box?

When you do a warm boot (not power cycle, but reset button or 
panic=30) then the panic mce will be logged after reboot.

> Now the most common causes for MCEs are apparently heat issues and bad
> memory.  I can rule out both.  Could this be an artifact of some bad
> ACPI tables?
>
> How do you debug this kind of problem?

It's some sort of hardware problem, debugging it typically
either involves fixing the cooling or exchanging components.

-Andi


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: MCEs
  2008-10-24 18:04 ` MCEs Andi Kleen
@ 2008-10-24 21:44   ` Felix von Leitner
  2008-10-24 23:07   ` MCEs Felix von Leitner
  1 sibling, 0 replies; 7+ messages in thread
From: Felix von Leitner @ 2008-10-24 21:44 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Linux Kernel Mailing list

> > This is the kind of MCE that freezes the box and causes a panic.  The
> > trace does not end up in syslog.  I found a program called mcelog which
> > I am supposed to call regularly from cron, but how can that help me when
> > the first MCE I get insta-panics the box?
> When you do a warm boot (not power cycle, but reset button or 
> panic=30) then the panic mce will be logged after reboot.

Maybe I'm doing something wrong here.
I run mcelog from the shell after the boot.
Nothing happens, I get the shell prompt right back.

How do I know it's really a MCE?  So far my symptoms are: the machine
freezes, then a panic dump scrolls by, and the only text I can see on my
screen are the stack dump lines, which contain something to the tune of
machine_check() or so.  Nothing shows up in syslog, probably the kernel
decides the box is too hosed to log anything.

Felix

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: MCEs
  2008-10-24 18:04 ` MCEs Andi Kleen
  2008-10-24 21:44   ` MCEs Felix von Leitner
@ 2008-10-24 23:07   ` Felix von Leitner
  2008-10-25  7:00     ` MCEs Andi Kleen
  1 sibling, 1 reply; 7+ messages in thread
From: Felix von Leitner @ 2008-10-24 23:07 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Linux Kernel Mailing list

> It's some sort of hardware problem, debugging it typically
> either involves fixing the cooling or exchanging components.

Using an older kernel I get actual MCE messages that I can write down
and decode using mcelog.

It looks like the MCE handler is buggy in current kernels and will cause
a panic instead of the MCE messages (or maybe in addition to and the
panic obscures them by scrolling them out of my 80x25 screen).

Is there a howto or wiki somewhere on how to parse mcelog decoded
messages?

Felix

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: MCEs
  2008-10-24 23:07   ` MCEs Felix von Leitner
@ 2008-10-25  7:00     ` Andi Kleen
  2008-10-25 10:05       ` MCEs Felix von Leitner
  0 siblings, 1 reply; 7+ messages in thread
From: Andi Kleen @ 2008-10-25  7:00 UTC (permalink / raw)
  To: Felix von Leitner; +Cc: Andi Kleen, Linux Kernel Mailing list

On Sat, Oct 25, 2008 at 01:07:48AM +0200, Felix von Leitner wrote:
> > It's some sort of hardware problem, debugging it typically
> > either involves fixing the cooling or exchanging components.
> 
> Using an older kernel I get actual MCE messages that I can write down
> and decode using mcelog.

You mean when you switch back to the old kernel it's recoverable
and then switch to the new kernel it is not?

If yes then bisect it please.

> 
> It looks like the MCE handler is buggy in current kernels and will cause
> a panic instead of the MCE messages (or maybe in addition to and the

The CPU tells the kernel if a machine check should cause a panic or not.
Usually when you actually get an MC _E_xception it's panic time.

> panic obscures them by scrolling them out of my 80x25 screen).

Panic should be one line, unless you're bitten by the 2.6.27 
smp_call_function in panic bug (see 
http://bugzilla.kernel.org/show_bug.cgi?id=11569)

Please post concrete logs from a serial or netconsole.

-Andi

-- 
ak@linux.intel.com

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: MCEs
  2008-10-25  7:00     ` MCEs Andi Kleen
@ 2008-10-25 10:05       ` Felix von Leitner
  0 siblings, 0 replies; 7+ messages in thread
From: Felix von Leitner @ 2008-10-25 10:05 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Linux Kernel Mailing list

Thus spake Andi Kleen (andi@firstfloor.org):
> > Using an older kernel I get actual MCE messages that I can write down
> > and decode using mcelog.
> You mean when you switch back to the old kernel it's recoverable
> and then switch to the new kernel it is not?

No it's not recoverable but I get the error messages and can write them
down.  Previously I just got the stack dump.

> > panic obscures them by scrolling them out of my 80x25 screen).
> Panic should be one line, unless you're bitten by the 2.6.27 
> smp_call_function in panic bug (see 
> http://bugzilla.kernel.org/show_bug.cgi?id=11569)

That's exactly what happened.

Felix

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2008-10-25 10:05 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-10-24 12:45 MCEs Felix von Leitner
2008-10-24 16:23 ` MCEs Tony Vroon
2008-10-24 18:04 ` MCEs Andi Kleen
2008-10-24 21:44   ` MCEs Felix von Leitner
2008-10-24 23:07   ` MCEs Felix von Leitner
2008-10-25  7:00     ` MCEs Andi Kleen
2008-10-25 10:05       ` MCEs Felix von Leitner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).