linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* machine check errors
@ 2006-01-12 23:37 don fisher
  2006-01-13 10:10 ` Alan Cox
  2006-01-13 15:31 ` Roger Heflin
  0 siblings, 2 replies; 5+ messages in thread
From: don fisher @ 2006-01-12 23:37 UTC (permalink / raw)
  To: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 973 bytes --]

I have a Tyan S2892 board with a pair Opteron 288 dual core cpus and 
16GB dram. I receive the errors shown in the attached file, mcelog. It 
appears that these occur when the free memory becomes small, there is 
a lot in the cache, and a lot of IO.

The Tyan S2892 has an Nvidia Crush K8-04, which I think they call the 
southbridge. My errors appear to be related to the north bridge. There 
is an AMD 8131 PCI-X controller that runs the PCI slots. There is a 
3WARE 9500-12 located in one of the PCI-X slots.

I have run Memtest86+-1.65 for 24 hours without errors. I recently 
upgraded the BIOS to V2.00 without any remarkable changes.

I am running 2.6.15 within a current Fedora Core4 configuration.

I would appreciate any advice as to how to proceed. I have not noticed 
any adverse behavior from the mce's. But that could be masked is data 
transfered or ???.

Could there be any connection with the memory cache? Thanks in advance 
for your assistance.

don
-- 


[-- Attachment #2: mcelog --]
[-- Type: text/plain, Size: 5701 bytes --]

MCE 0
CPU 2 4 northbridge TSC 967b1992c66 
ADDR 2a52cb5f0 
  Northbridge Chipkill ECC error
  Chipkill ECC syndrome = 40b9
       bit32 = err cpu0
       bit46 = corrected ecc error
       bit62 = error overflow (multiple errors)
  bus error 'local node origin, request didn't time out
      generic read mem transaction
      memory access, level generic'
STATUS d45cc00140080813 MCGSTATUS 0
MCE 1
CPU 2 4 northbridge TSC a101bbf7338 
ADDR 2922df698 
  Northbridge Chipkill ECC error
  Chipkill ECC syndrome = 6051
       bit46 = corrected ecc error
       bit62 = error overflow (multiple errors)
  bus error 'local node response, request didn't time out
      generic read mem transaction
      memory access, level generic'
STATUS d428c00060080a13 MCGSTATUS 0
MCE 2
CPU 2 4 northbridge TSC ab885e5bdbe 
ADDR 2922df698 
  Northbridge Chipkill ECC error
  Chipkill ECC syndrome = 6051
       bit46 = corrected ecc error
       bit62 = error overflow (multiple errors)
  bus error 'local node response, request didn't time out
      generic read mem transaction
      memory access, level generic'
STATUS d428c00060080a13 MCGSTATUS 0
MCE 0
CPU 2 4 northbridge TSC b60f17bf394 
ADDR 2918d98d0 
  Northbridge Chipkill ECC error
  Chipkill ECC syndrome = 6051
       bit46 = corrected ecc error
       bit62 = error overflow (multiple errors)
  bus error 'local node response, request didn't time out
      generic read mem transaction
      memory access, level generic'
STATUS d428c00060080a13 MCGSTATUS 0
MCE 1
CPU 2 4 northbridge TSC c095ba23a7e 
ADDR 2918cdff0 
  Northbridge Chipkill ECC error
  Chipkill ECC syndrome = 40b9
       bit46 = corrected ecc error
       bit62 = error overflow (multiple errors)
  bus error 'local node response, request didn't time out
      generic read mem transaction
      memory access, level generic'
STATUS d45cc00040080a13 MCGSTATUS 0
MCE 2
CPU 2 4 northbridge TSC cb1c7387269 
ADDR 2bf0cfa50 
  Northbridge Chipkill ECC error
  Chipkill ECC syndrome = 6051
       bit46 = corrected ecc error
       bit62 = error overflow (multiple errors)
  bus error 'local node response, request didn't time out
      generic read mem transaction
      memory access, level generic'
STATUS d428c00060080a13 MCGSTATUS 0
MCE 3
CPU 2 4 northbridge TSC d5a315f1a34 
ADDR 2900df990 
  Northbridge Chipkill ECC error
  Chipkill ECC syndrome = 20e8
       bit46 = corrected ecc error
       bit62 = error overflow (multiple errors)
  bus error 'local node response, request didn't time out
      generic read mem transaction
      memory access, level generic'
STATUS d474400020080a13 MCGSTATUS 0
MCE 4
CPU 2 4 northbridge TSC e029b8504a6 
ADDR 2900dd030 
  Northbridge Chipkill ECC error
  Chipkill ECC syndrome = 6051
       bit46 = corrected ecc error
       bit62 = error overflow (multiple errors)
  bus error 'local node response, request didn't time out
      generic read mem transaction
      memory access, level generic'
STATUS d428c00060080a13 MCGSTATUS 0
MCE 5
CPU 2 4 northbridge TSC eab071b5316 
ADDR 291ac9d98 
  Northbridge Chipkill ECC error
  Chipkill ECC syndrome = 6051
       bit46 = corrected ecc error
       bit62 = error overflow (multiple errors)
  bus error 'local node response, request didn't time out
      generic read mem transaction
      memory access, level generic'
STATUS d428c00060080a13 MCGSTATUS 0
MCE 6
CPU 2 4 northbridge TSC f537141ab1c 
ADDR 2918dfe78 
  Northbridge Chipkill ECC error
  Chipkill ECC syndrome = 40b9
       bit33 = err cpu1
       bit46 = corrected ecc error
       bit62 = error overflow (multiple errors)
  bus error 'local node origin, request didn't time out
      generic read mem transaction
      memory access, level generic'
STATUS d45cc00240080813 MCGSTATUS 0
MCE 7
CPU 2 4 northbridge TSC ffbdb67fd26 
ADDR 2beac9010 
  Northbridge Chipkill ECC error
  Chipkill ECC syndrome = 40b9
       bit46 = corrected ecc error
       bit62 = error overflow (multiple errors)
  bus error 'local node response, request didn't time out
      generic read mem transaction
      memory access, level generic'
STATUS d45cc00040080a13 MCGSTATUS 0
MCE 0
CPU 2 4 northbridge TSC 10a446fe04fa 
ADDR 2cfcc9870 
  Northbridge Chipkill ECC error
  Chipkill ECC syndrome = 40b9
       bit32 = err cpu0
       bit46 = corrected ecc error
       bit62 = error overflow (multiple errors)
  bus error 'local node origin, request didn't time out
      generic read mem transaction
      memory access, level generic'
STATUS d45cc00140080813 MCGSTATUS 0
MCE 1
CPU 2 4 northbridge TSC 114cb12451a2 
ADDR 291ac9630 
  Northbridge Chipkill ECC error
  Chipkill ECC syndrome = 6051
       bit33 = err cpu1
       bit46 = corrected ecc error
       bit62 = error overflow (multiple errors)
  bus error 'local node origin, request didn't time out
      generic read mem transaction
      memory access, level generic'
STATUS d428c00260080813 MCGSTATUS 0
MCE 2
CPU 2 4 northbridge TSC 11f51cba9b82 
ADDR 2c10cb010 
  Northbridge Chipkill ECC error
  Chipkill ECC syndrome = 20e8
       bit32 = err cpu0
       bit46 = corrected ecc error
       bit62 = error overflow (multiple errors)
  bus error 'local node origin, request didn't time out
      generic read mem transaction
      memory access, level generic'
STATUS d474400120080813 MCGSTATUS 0
MCE 3
CPU 2 4 northbridge TSC 129d86e0d26a 
ADDR 294ec9390 
  Northbridge Chipkill ECC error
  Chipkill ECC syndrome = 6051
       bit32 = err cpu0
       bit46 = corrected ecc error
       bit62 = error overflow (multiple errors)
  bus error 'local node origin, request didn't time out
      generic read mem transaction
      memory access, level generic'
STATUS d428c00160080813 MCGSTATUS 0

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: machine check errors
  2006-01-12 23:37 machine check errors don fisher
@ 2006-01-13 10:10 ` Alan Cox
  2006-01-13 15:31 ` Roger Heflin
  1 sibling, 0 replies; 5+ messages in thread
From: Alan Cox @ 2006-01-13 10:10 UTC (permalink / raw)
  To: don fisher; +Cc: linux-kernel

On Iau, 2006-01-12 at 16:37 -0700, don fisher wrote:
> CPU 2 4 northbridge TSC 967b1992c66 
> ADDR 2a52cb5f0 
>   Northbridge Chipkill ECC error
>   Chipkill ECC syndrome = 40b9

Corrected ECC errors from memory. You've got bad memory but because you
have ECC memory it was able to recover the failure.

Alan


^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: machine check errors
  2006-01-12 23:37 machine check errors don fisher
  2006-01-13 10:10 ` Alan Cox
@ 2006-01-13 15:31 ` Roger Heflin
  2006-01-14  0:28   ` Doug Thompson
  1 sibling, 1 reply; 5+ messages in thread
From: Roger Heflin @ 2006-01-13 15:31 UTC (permalink / raw)
  To: 'don fisher', linux-kernel

 

> -----Original Message-----
> From: linux-kernel-owner@vger.kernel.org 
> [mailto:linux-kernel-owner@vger.kernel.org] On Behalf Of don fisher
> Sent: Thursday, January 12, 2006 5:37 PM
> To: linux-kernel@vger.kernel.org
> Subject: machine check errors
> 
> I have a Tyan S2892 board with a pair Opteron 288 dual core 
> cpus and 16GB dram. I receive the errors shown in the 
> attached file, mcelog. It appears that these occur when the 
> free memory becomes small, there is a lot in the cache, and a 
> lot of IO.

You probably mean Opteron 285's. or Opteron 280's.

> 
> The Tyan S2892 has an Nvidia Crush K8-04, which I think they 
> call the southbridge. My errors appear to be related to the 
> north bridge. There is an AMD 8131 PCI-X controller that runs 
> the PCI slots. There is a 3WARE 9500-12 located in one of the 
> PCI-X slots.
> 
> I have run Memtest86+-1.65 for 24 hours without errors. I 
> recently upgraded the BIOS to V2.00 without any remarkable changes.

Does memtest86+ support reading of ecc errors on that motherboard,
if it does not, memtest won't tell you anything as the hardware
ecc will correct the errors and memtest will not find anything, if
that version of memtest is ecc aware it will register an ecc error.

> 
> I am running 2.6.15 within a current Fedora Core4 configuration.
> 
> I would appreciate any advice as to how to proceed. I have 
> not noticed any adverse behavior from the mce's. But that 
> could be masked is data transfered or ???.

Download edac/bluesmoke from sourceforge and compile and install
it, this will monitor ecc errors from linux, and should tell you
if you are getting ecc errors.

If you were running certain other Linux distributions they won't
report mces as they are missing the mcelog program, but the errors
would have been there.

> 
> Could there be any connection with the memory cache? Thanks 
> in advance for your assistance.
> 
> don

Non-fatal mce's are usually ecc faults, and *USUALLY* track back
to bad memory, though it can also be overheating cpu, or a problematic cpu, 
or rarely the MB could be the fault.

ECC/MCE counts will get worse under load, unless the problem is
really severe you won't see them at idle.

                     Roger
			   Atipa Technologies


^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: machine check errors
  2006-01-13 15:31 ` Roger Heflin
@ 2006-01-14  0:28   ` Doug Thompson
  0 siblings, 0 replies; 5+ messages in thread
From: Doug Thompson @ 2006-01-14  0:28 UTC (permalink / raw)
  To: 'don fisher', linux-kernel

> 
> Download edac/bluesmoke from sourceforge and compile and install
> it, this will monitor ecc errors from linux, and should tell you
> if you are getting ecc errors.
> 

Download the bluesmoke from bluesmoke.sourceforge.net for now, as EDAC currently does not have the
Opteron driver yet. It is in the queue.

(EDAC is the name for the new module to be in the kernel shortly. Bluesmoke is the lagacy name for
EDAC)

doug t




"If you think Education is expensive, just try Ignorance"

"Don't tell people HOW to do things, tell them WHAT you
want and they will surprise you with their ingenuity."
                   Gen George Patton


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: machine check errors
       [not found] <5ukd5-3kf-17@gated-at.bofh.it>
@ 2006-01-13  4:37 ` Robert Hancock
  0 siblings, 0 replies; 5+ messages in thread
From: Robert Hancock @ 2006-01-13  4:37 UTC (permalink / raw)
  To: linux-kernel

don fisher wrote:
> I have a Tyan S2892 board with a pair Opteron 288 dual core cpus and 
> 16GB dram. I receive the errors shown in the attached file, mcelog. It 
> appears that these occur when the free memory becomes small, there is a 
> lot in the cache, and a lot of IO.
> 
> The Tyan S2892 has an Nvidia Crush K8-04, which I think they call the 
> southbridge. My errors appear to be related to the north bridge. There 
> is an AMD 8131 PCI-X controller that runs the PCI slots. There is a 
> 3WARE 9500-12 located in one of the PCI-X slots.
> 
> I have run Memtest86+-1.65 for 24 hours without errors. I recently 
> upgraded the BIOS to V2.00 without any remarkable changes.
> 
> I am running 2.6.15 within a current Fedora Core4 configuration.
> 
> I would appreciate any advice as to how to proceed. I have not noticed 
> any adverse behavior from the mce's. But that could be masked is data 
> transfered or ???.
> 
> Could there be any connection with the memory cache? Thanks in advance 
> for your assistance.

I would say you likely do have some bad RAM, that seems to be what those 
MCEs are indicating. Depending on the configuration, Memtest86 may not 
find all the errors if they are being corrected by ECC..

-- 
Robert Hancock      Saskatoon, SK, Canada
To email, remove "nospam" from hancockr@nospamshaw.ca
Home Page: http://www.roberthancock.com/


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2006-01-14  0:28 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-01-12 23:37 machine check errors don fisher
2006-01-13 10:10 ` Alan Cox
2006-01-13 15:31 ` Roger Heflin
2006-01-14  0:28   ` Doug Thompson
     [not found] <5ukd5-3kf-17@gated-at.bofh.it>
2006-01-13  4:37 ` Robert Hancock

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).