linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Request for MCE decode (AMD Barcelona, fam 10h)
@ 2008-09-07  2:32 Tony Vroon
  2008-09-07  3:16 ` Jeroen van Rijn
  2008-09-08 10:55 ` Andi Kleen
  0 siblings, 2 replies; 11+ messages in thread
From: Tony Vroon @ 2008-09-07  2:32 UTC (permalink / raw)
  To: LKML

[-- Attachment #1: Type: text/plain, Size: 3767 bytes --]

On a Tyan-based system with intermittent but persistent instability, I
have finally received a message that something might actually be wrong
in hardware. Could you decode:

MCE 0
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 0 BANK 4 MISC c000000001000000 
STATUS fa00002000020c0f MCGSTATUS 0
MCE 1
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 4 BANK 4 MISC c000000001000000 
STATUS fa00000000070f0f MCGSTATUS 0

This appeared while the 3Ware 9550SXU-8LP RAID controller reported a
disk corruption:
3w-9xxx: scsi0: AEN: INFO (0x04:0x0029): Verify started:unit=0.
Machine check events logged
3w-9xxx: scsi0: AEN: WARNING (0x04:0x0023): Sector repair
completed:port=2, LBA=0x74907F9.

This is on:
Linux prometheus 2.6.27-rc5-00283-g70bb089 #1 SMP Sat Sep 6 13:52:51 BST
2008 x86_64 Quad-Core AMD Opteron(tm) Processor 2354 AuthenticAMD
GNU/Linux

This is a 2x Opteron 2354 (so 8 core) system, on a Tyan S2915-E
mainboard with the v2.07 BIOS. The system is equipped with 16GB RAM,
populated as 8x Kingston KVR667D2D4P5/2G.

Configuration for the RAID controller, in case it is relevant:
/c0 Driver Version = 2.26.02.011
/c0 Model = 9550SXU-8LP
/c0 Available Memory = 112MB
/c0 Firmware Version = FE9X 3.08.00.029
/c0 Bios Version = BE9X 3.10.00.003
/c0 Boot Loader Version = BL9X 3.02.00.001
/c0 Serial Number = [scrubbed]
/c0 PCB Version = Rev 032
/c0 PCHIP Version = 1.60
/c0 ACHIP Version = 1.90
/c0 Number of Ports = 8
/c0 Number of Drives = 6
/c0 Number of Units = 1
/c0 Total Optimal Units = 1
/c0 Not Optimal Units = 0 
/c0 JBOD Export Policy = off
/c0 Disk Spinup Policy = 2
/c0 Spinup Stagger Time Policy (sec) = 1
/c0 Auto-Carving Policy = off
/c0 Auto-Carving Size = 2048 GB
/c0 Auto-Rebuild Policy = on
/c0 Controller Bus Type = PCIX
/c0 Controller Bus Width = 64 bits
/c0 Controller Bus Speed = 133 Mhz

Unit  UnitType  Status         %RCmpl  %V/I/M  Stripe  Size(GB)  Cache
AVrfy
------------------------------------------------------------------------------
u0    RAID-5    VERIFYING      -       12      256K    3492.41   ON
OFF    

Port   Status           Unit   Size        Blocks        Serial
---------------------------------------------------------------
p0     OK               u0     698.63 GB   1465149168   [scrubbed]      
p1     OK               u0     698.63 GB   1465149168   [scrubbed] 
p2     OK               u0     698.63 GB   1465149168   [scrubbed] 
p3     OK               u0     698.63 GB   1465149168   [scrubbed] 
p4     OK               u0     698.63 GB   1465149168   [scrubbed] 
p5     OK               u0     698.63 GB   1465149168   [scrubbed] 
p6     NOT-PRESENT      -      -           -             -
p7     NOT-PRESENT      -      -           -             -

Name  OnlineState  BBUReady  Status    Volt     Temp     Hours
LastCapTest
---------------------------------------------------------------------------
bbu   On           Yes       OK        OK       OK       255
04-Jun-2008  

I have stripped the machine down to its bare minimum configuration, but
the instability continues. On an older BIOS, enabling the IOMMU option
in the BIOS seemed to cause hard crashes with an alarming frequency.
However, this all but disappeared in in v2.05 and upwards.
As disabling the IOMMU option costs me RAM, I have not yet done so.
However, I will happily flip BIOS settings and run further tests for
you. I'm not getting any work done on this workstation and that needs to
stop.
I realize that the linux kernel may be entirely blameless in this
situation, but I'd like to have some peer insight before I run after
vendors.

Regards,
Tony V.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Request for MCE decode (AMD Barcelona, fam 10h)
  2008-09-07  2:32 Request for MCE decode (AMD Barcelona, fam 10h) Tony Vroon
@ 2008-09-07  3:16 ` Jeroen van Rijn
  2008-09-07  4:22   ` Tony Vroon
  2008-09-08 10:55 ` Andi Kleen
  1 sibling, 1 reply; 11+ messages in thread
From: Jeroen van Rijn @ 2008-09-07  3:16 UTC (permalink / raw)
  To: Tony Vroon; +Cc: LKML

On Sun, Sep 7, 2008 at 4:32 AM, Tony Vroon <tony@vroon.org> wrote:
> On a Tyan-based system with intermittent but persistent instability, I
> have finally received a message that something might actually be wrong
> in hardware. Could you decode:
>
> MCE 0
> HARDWARE ERROR. This is *NOT* a software problem!
> Please contact your hardware vendor
> CPU 0 BANK 4 MISC c000000001000000
> STATUS fa00002000020c0f MCGSTATUS 0
> MCE 1
> HARDWARE ERROR. This is *NOT* a software problem!
> Please contact your hardware vendor
> CPU 4 BANK 4 MISC c000000001000000
> STATUS fa00000000070f0f MCGSTATUS 0

Hi Tony,

Not easily, and it's too late to parse
arch/x86/kernel/cpu/mcheck/mce_64.c and find out what it means before
I nod off. Still, before I sign off, have you tried running "mcelog
--ascii"? It needs to be run on the machine the check occured on. It
might give you something to go on before the cavalry arrives.

Best regards,
  Jeroen.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Request for MCE decode (AMD Barcelona, fam 10h)
  2008-09-07  3:16 ` Jeroen van Rijn
@ 2008-09-07  4:22   ` Tony Vroon
  0 siblings, 0 replies; 11+ messages in thread
From: Tony Vroon @ 2008-09-07  4:22 UTC (permalink / raw)
  To: Jeroen van Rijn; +Cc: LKML

[-- Attachment #1: Type: text/plain, Size: 1306 bytes --]

On Sun, 2008-09-07 at 05:16 +0200, Jeroen van Rijn wrote:
> Still, before I sign off, have you tried running "mcelog
> --ascii"? It needs to be run on the machine the check occured on. It
> might give you something to go on before the cavalry arrives.

That worked, thank you. Had to feed it back in as /dev/mcelog was empty,
but it made a bit more sense of it:

MCE 0
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 0 0 data cache        bit57 = processor context corrupt
       bit59 = misc error valid
       bit61 = error uncorrected
       bit62 = error overflow (multiple errors)
  bus error 'local node observed, request didn't time out
      generic error mem transaction
      generic access, level generic'
STATUS fa00002000020c0f MCGSTATUS 0
MCE 1
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 4 0 data cache        bit57 = processor context corrupt
       bit59 = misc error valid
       bit61 = error uncorrected
       bit62 = error overflow (multiple errors)
  bus error 'generic participation, request timed out
      generic error mem transaction
      generic access, level generic'
STATUS fa00000000070f0f MCGSTATUS 0

> Best regards,
>   Jeroen.

Regards,
Tony V.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Request for MCE decode (AMD Barcelona, fam 10h)
  2008-09-07  2:32 Request for MCE decode (AMD Barcelona, fam 10h) Tony Vroon
  2008-09-07  3:16 ` Jeroen van Rijn
@ 2008-09-08 10:55 ` Andi Kleen
  2008-09-08 11:13   ` Jeroen van Rijn
                     ` (2 more replies)
  1 sibling, 3 replies; 11+ messages in thread
From: Andi Kleen @ 2008-09-08 10:55 UTC (permalink / raw)
  To: Tony Vroon; +Cc: LKML

Tony Vroon <tony@vroon.org> writes:

> HARDWARE ERROR. This is *NOT* a software problem!
  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> Please contact your hardware vendor
  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

> I realize that the linux kernel may be entirely blameless in this
> situation,

It is, like mcelog told you.

> but I'd like to have some peer insight before I run after
> vendors.

It unfortunately turns out that mcelog logging is a tricky
psychological problem. How should the warning above have
looked like so that you would not have required "peer insight"
and actually just contacted your hardware vendor? 

Thank you.

-Andi (who wonders if <blink> tags in syslog would be useful 
to solve this)

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Request for MCE decode (AMD Barcelona, fam 10h)
  2008-09-08 10:55 ` Andi Kleen
@ 2008-09-08 11:13   ` Jeroen van Rijn
  2008-09-08 12:22   ` Tony Vroon
  2008-09-08 13:55   ` Pavel Machek
  2 siblings, 0 replies; 11+ messages in thread
From: Jeroen van Rijn @ 2008-09-08 11:13 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Tony Vroon, LKML

On Mon, Sep 8, 2008 at 12:55 PM, Andi Kleen <andi@firstfloor.org> wrote:
> Tony Vroon <tony@vroon.org> writes:
>
>> HARDWARE ERROR. This is *NOT* a software problem!
>  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>> Please contact your hardware vendor
>  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>
>> I realize that the linux kernel may be entirely blameless in this
>> situation,
>
> It is, like mcelog told you.
>
>> but I'd like to have some peer insight before I run after
>> vendors.
>
> It unfortunately turns out that mcelog logging is a tricky
> psychological problem. How should the warning above have
> looked like so that you would not have required "peer insight"
> and actually just contacted your hardware vendor?

I suppose mcelog might be extended to point at possible tools to get a
second opinion, in case the admin would like to to be entirely
certain. In their position I can understand them when their vendor
asks them if it's the hardware and what tests they've run to rule out
software.

Think for example a machine check that might point to faulty RAM, it
might direct the admin to run memcheck if mcelog alone isn't
compelling enough.

> Thank you.
>
> -Andi (who wonders if <blink> tags in syslog would be useful
> to solve this)

Yikes, ixnay to the <blinkay>. Next people will ask for flash support
to get all-singing and -dancing error messages.

-- Jeroen.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Request for MCE decode (AMD Barcelona, fam 10h)
  2008-09-08 10:55 ` Andi Kleen
  2008-09-08 11:13   ` Jeroen van Rijn
@ 2008-09-08 12:22   ` Tony Vroon
  2008-09-08 14:04     ` Andi Kleen
  2008-09-08 13:55   ` Pavel Machek
  2 siblings, 1 reply; 11+ messages in thread
From: Tony Vroon @ 2008-09-08 12:22 UTC (permalink / raw)
  To: Andi Kleen; +Cc: LKML

[-- Attachment #1: Type: text/plain, Size: 495 bytes --]

On Mon, 2008-09-08 at 12:55 +0200, Andi Kleen wrote:
> It unfortunately turns out that mcelog logging is a tricky
> psychological problem. How should the warning above have
> looked like so that you would not have required "peer insight"
> and actually just contacted your hardware vendor? 

Indication of the faulty part, so I know whether to contact AMD or Tyan.
Without a clear idea of which it could quickly turn into an infinite
redirect loop between the two.

Regards,
Tony V.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Request for MCE decode (AMD Barcelona, fam 10h)
  2008-09-08 10:55 ` Andi Kleen
  2008-09-08 11:13   ` Jeroen van Rijn
  2008-09-08 12:22   ` Tony Vroon
@ 2008-09-08 13:55   ` Pavel Machek
  2008-09-08 14:00     ` Andi Kleen
  2 siblings, 1 reply; 11+ messages in thread
From: Pavel Machek @ 2008-09-08 13:55 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Tony Vroon, LKML

On Mon 2008-09-08 12:55:40, Andi Kleen wrote:
> Tony Vroon <tony@vroon.org> writes:
> 
> > HARDWARE ERROR. This is *NOT* a software problem!
>   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > Please contact your hardware vendor
>   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> 
> > I realize that the linux kernel may be entirely blameless in this
> > situation,
> 
> It is, like mcelog told you.

Ugh, actually this is not right. AFAIK MCEs can be triggered by stuff
like PCI aborts, which in turn can be caused by software.

If you really want me to contact hw vendor, you need to be a lot more
specific.
								Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Request for MCE decode (AMD Barcelona, fam 10h)
  2008-09-08 13:55   ` Pavel Machek
@ 2008-09-08 14:00     ` Andi Kleen
  0 siblings, 0 replies; 11+ messages in thread
From: Andi Kleen @ 2008-09-08 14:00 UTC (permalink / raw)
  To: Pavel Machek; +Cc: Andi Kleen, Tony Vroon, LKML

On Mon, Sep 08, 2008 at 03:55:58PM +0200, Pavel Machek wrote:
> On Mon 2008-09-08 12:55:40, Andi Kleen wrote:
> > Tony Vroon <tony@vroon.org> writes:
> > 
> > > HARDWARE ERROR. This is *NOT* a software problem!
> >   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > > Please contact your hardware vendor
> >   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > 
> > > I realize that the linux kernel may be entirely blameless in this
> > > situation,
> > 
> > It is, like mcelog told you.
> 
> Ugh, actually this is not right. AFAIK MCEs can be triggered by stuff
> like PCI aborts, which in turn can be caused by software.

PCI aborts don't normally cause machine checks, no.

-Andi
-- 
ak@linux.intel.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Request for MCE decode (AMD Barcelona, fam 10h)
  2008-09-08 12:22   ` Tony Vroon
@ 2008-09-08 14:04     ` Andi Kleen
  2008-09-08 15:52       ` Tony Vroon
  0 siblings, 1 reply; 11+ messages in thread
From: Andi Kleen @ 2008-09-08 14:04 UTC (permalink / raw)
  To: Tony Vroon; +Cc: Andi Kleen, LKML

On Mon, Sep 08, 2008 at 01:22:39PM +0100, Tony Vroon wrote:
> On Mon, 2008-09-08 at 12:55 +0200, Andi Kleen wrote:
> > It unfortunately turns out that mcelog logging is a tricky
> > psychological problem. How should the warning above have
> > looked like so that you would not have required "peer insight"
> > and actually just contacted your hardware vendor? 
> 
> Indication of the faulty part, so I know whether to contact AMD or Tyan.
> Without a clear idea of which it could quickly turn into an infinite
> redirect loop between the two.

Ok so you wanted linux-kernel to diagnose your hardware for you?

For DIMMs you can get that with --dmi if you run the latest mcelog
and if it's a memory problem. 

Unfortunately the BIOS vendors in their wisdom often deliver incorrect
DMI tables, so the information is not always very useful.
 
-Andi
-- 
ak@linux.intel.com


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Request for MCE decode (AMD Barcelona, fam 10h)
  2008-09-08 14:04     ` Andi Kleen
@ 2008-09-08 15:52       ` Tony Vroon
  2008-09-08 16:25         ` Jeroen van Rijn
  0 siblings, 1 reply; 11+ messages in thread
From: Tony Vroon @ 2008-09-08 15:52 UTC (permalink / raw)
  To: Andi Kleen; +Cc: LKML

[-- Attachment #1: Type: text/plain, Size: 620 bytes --]

On Mon, 2008-09-08 at 16:04 +0200, Andi Kleen wrote:
> Ok so you wanted linux-kernel to diagnose your hardware for you?

I was hoping for some help in narrowing it down, yes. Jeroen's reply was
very helpful, and more among the line of what I was expecting. I have
contacted all vendors involved now, and it looks like the system RAM is
not fully compatible with the mainboard.

With regard to the message, I would suggest an alternate wording like
such:

A hardware component in your system is failing.
Please contact your hardware vendor(s).
If unsure, contact your CPU vendor first.

Regards,
Tony V.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Request for MCE decode (AMD Barcelona, fam 10h)
  2008-09-08 15:52       ` Tony Vroon
@ 2008-09-08 16:25         ` Jeroen van Rijn
  0 siblings, 0 replies; 11+ messages in thread
From: Jeroen van Rijn @ 2008-09-08 16:25 UTC (permalink / raw)
  To: Tony Vroon; +Cc: Andi Kleen, LKML

On Mon, Sep 8, 2008 at 5:52 PM, Tony Vroon <tony@vroon.org> wrote:
> On Mon, 2008-09-08 at 16:04 +0200, Andi Kleen wrote:
>> Ok so you wanted linux-kernel to diagnose your hardware for you?
>
> I was hoping for some help in narrowing it down, yes. Jeroen's reply was
> very helpful, and more among the line of what I was expecting. I have
> contacted all vendors involved now, and it looks like the system RAM is
> not fully compatible with the mainboard.

I'm happy to see you've got it narrowed down enough.

> With regard to the message, I would suggest an alternate wording like
> such:
>
> A hardware component in your system is failing.
> Please contact your hardware vendor(s).
> If unsure, contact your CPU vendor first.

Or this:
"A hardware component in your system is failing.
[insert specific bit if MCE is certain enough about what part]
If you can, try to narrow it down by placing it in another mainboard
(assuming you have one available), or run [memcheck, another tool].
Then contact the hardware vendor(s) in question, if uncertain, try
your CPU vendor first."

> Regards,
> Tony V.

Jeroen.

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2008-09-08 16:25 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-09-07  2:32 Request for MCE decode (AMD Barcelona, fam 10h) Tony Vroon
2008-09-07  3:16 ` Jeroen van Rijn
2008-09-07  4:22   ` Tony Vroon
2008-09-08 10:55 ` Andi Kleen
2008-09-08 11:13   ` Jeroen van Rijn
2008-09-08 12:22   ` Tony Vroon
2008-09-08 14:04     ` Andi Kleen
2008-09-08 15:52       ` Tony Vroon
2008-09-08 16:25         ` Jeroen van Rijn
2008-09-08 13:55   ` Pavel Machek
2008-09-08 14:00     ` Andi Kleen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).