linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Tony Vroon <tony@vroon.org>
To: LKML <linux-kernel@vger.kernel.org>
Subject: Request for MCE decode (AMD Barcelona, fam 10h)
Date: Sun, 07 Sep 2008 03:32:22 +0100	[thread overview]
Message-ID: <1220754742.8530.12.camel@localhost> (raw)

[-- Attachment #1: Type: text/plain, Size: 3767 bytes --]

On a Tyan-based system with intermittent but persistent instability, I
have finally received a message that something might actually be wrong
in hardware. Could you decode:

MCE 0
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 0 BANK 4 MISC c000000001000000 
STATUS fa00002000020c0f MCGSTATUS 0
MCE 1
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 4 BANK 4 MISC c000000001000000 
STATUS fa00000000070f0f MCGSTATUS 0

This appeared while the 3Ware 9550SXU-8LP RAID controller reported a
disk corruption:
3w-9xxx: scsi0: AEN: INFO (0x04:0x0029): Verify started:unit=0.
Machine check events logged
3w-9xxx: scsi0: AEN: WARNING (0x04:0x0023): Sector repair
completed:port=2, LBA=0x74907F9.

This is on:
Linux prometheus 2.6.27-rc5-00283-g70bb089 #1 SMP Sat Sep 6 13:52:51 BST
2008 x86_64 Quad-Core AMD Opteron(tm) Processor 2354 AuthenticAMD
GNU/Linux

This is a 2x Opteron 2354 (so 8 core) system, on a Tyan S2915-E
mainboard with the v2.07 BIOS. The system is equipped with 16GB RAM,
populated as 8x Kingston KVR667D2D4P5/2G.

Configuration for the RAID controller, in case it is relevant:
/c0 Driver Version = 2.26.02.011
/c0 Model = 9550SXU-8LP
/c0 Available Memory = 112MB
/c0 Firmware Version = FE9X 3.08.00.029
/c0 Bios Version = BE9X 3.10.00.003
/c0 Boot Loader Version = BL9X 3.02.00.001
/c0 Serial Number = [scrubbed]
/c0 PCB Version = Rev 032
/c0 PCHIP Version = 1.60
/c0 ACHIP Version = 1.90
/c0 Number of Ports = 8
/c0 Number of Drives = 6
/c0 Number of Units = 1
/c0 Total Optimal Units = 1
/c0 Not Optimal Units = 0 
/c0 JBOD Export Policy = off
/c0 Disk Spinup Policy = 2
/c0 Spinup Stagger Time Policy (sec) = 1
/c0 Auto-Carving Policy = off
/c0 Auto-Carving Size = 2048 GB
/c0 Auto-Rebuild Policy = on
/c0 Controller Bus Type = PCIX
/c0 Controller Bus Width = 64 bits
/c0 Controller Bus Speed = 133 Mhz

Unit  UnitType  Status         %RCmpl  %V/I/M  Stripe  Size(GB)  Cache
AVrfy
------------------------------------------------------------------------------
u0    RAID-5    VERIFYING      -       12      256K    3492.41   ON
OFF    

Port   Status           Unit   Size        Blocks        Serial
---------------------------------------------------------------
p0     OK               u0     698.63 GB   1465149168   [scrubbed]      
p1     OK               u0     698.63 GB   1465149168   [scrubbed] 
p2     OK               u0     698.63 GB   1465149168   [scrubbed] 
p3     OK               u0     698.63 GB   1465149168   [scrubbed] 
p4     OK               u0     698.63 GB   1465149168   [scrubbed] 
p5     OK               u0     698.63 GB   1465149168   [scrubbed] 
p6     NOT-PRESENT      -      -           -             -
p7     NOT-PRESENT      -      -           -             -

Name  OnlineState  BBUReady  Status    Volt     Temp     Hours
LastCapTest
---------------------------------------------------------------------------
bbu   On           Yes       OK        OK       OK       255
04-Jun-2008  

I have stripped the machine down to its bare minimum configuration, but
the instability continues. On an older BIOS, enabling the IOMMU option
in the BIOS seemed to cause hard crashes with an alarming frequency.
However, this all but disappeared in in v2.05 and upwards.
As disabling the IOMMU option costs me RAM, I have not yet done so.
However, I will happily flip BIOS settings and run further tests for
you. I'm not getting any work done on this workstation and that needs to
stop.
I realize that the linux kernel may be entirely blameless in this
situation, but I'd like to have some peer insight before I run after
vendors.

Regards,
Tony V.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

             reply	other threads:[~2008-09-07  2:45 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-09-07  2:32 Tony Vroon [this message]
2008-09-07  3:16 ` Request for MCE decode (AMD Barcelona, fam 10h) Jeroen van Rijn
2008-09-07  4:22   ` Tony Vroon
2008-09-08 10:55 ` Andi Kleen
2008-09-08 11:13   ` Jeroen van Rijn
2008-09-08 12:22   ` Tony Vroon
2008-09-08 14:04     ` Andi Kleen
2008-09-08 15:52       ` Tony Vroon
2008-09-08 16:25         ` Jeroen van Rijn
2008-09-08 13:55   ` Pavel Machek
2008-09-08 14:00     ` Andi Kleen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1220754742.8530.12.camel@localhost \
    --to=tony@vroon.org \
    --cc=linux-kernel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).