Re: Opteron 6276 Corrected ECC errors

* Re: Opteron 6276 Corrected ECC errors
@ 2013-02-05 16:34 Michael Madore
  0 siblings, 0 replies; 3+ messages in thread
From: Michael Madore @ 2013-02-05 16:34 UTC (permalink / raw)
  To: linux-kernel

> On Wed, Jan 30, 2013 at 11:29:47AM -0500, Michael Madore wrote:
>> Supermicro H8QGi-F server board (AMD SR5690/SR5670/SP5100 Chipset)
>> 4 X AMD Opteron 6276 processors
>> 32 X 8GB (256GB) DDR3-1600 ECC Registered memory
>> Debian with kernel 3.2.35-2
>>
>> We have received the following two hardware errors:
>>
>> 9/10/12
>>
>> [591006.120039] [Hardware Error]: CPU:58
>> MC2_STATUS[-|CE|MiscV|-|-|-|-|CECC]: 0x9842c000000c0176
>> [591006.120048] [Hardware Error]: Combined Unit Error: VB Data/ECC error.
>> [591006.120052] [Hardware Error]: cache level: L2, tx: DATA, mem-tx: EV
>>
>> 1/21/12
>>
>> [549004.336097] [Hardware Error]: CPU:40
>> MC4_STATUS[-|CE|MiscV|-|AddrV|-|-|CECC]: 0x9c3444e0001f010b
>> [549004.336111] [Hardware Error]:       MC4_ADDR: 0x000000000000e480
>> [549004.336117] [Hardware Error]: Northbridge Error (node 5): ECC
>> Error in the Probe Filter directory.
>> [549004.336125] [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: GEN
>>
>> If I understand correctly, both of these errors represent single bit
>> corrected errors in the CPU cache.
>
> Internal CPU structures, victim buffer the first and the second in the
> probe filter which is part of L3.
>
>> On both occasions the system continued to function normally after the
>> error was reported.
>
> As expected; both are single-bit ECC errors which were corrected and
> system state wasn't influenced.
>
>> Is receiving two such errors (on different CPUs) over such a time span
>> cause for concern?
>
> Not really. I'd say, only if the error rate starts increasing over time
> and the error types keep repeating.
>
>> The end user is concerned there is a serious hardware problem. I'm
>> reluctant to start replacing CPUs, however, without seeing a repeated
>> pattern of errors.
>
> Yes, no need to replace, simply watch the error rates. Maybe check the
> temperature of the CPUs, possibly improve cooling are some of the things
> that come to mind.

Hi Boris,

Thank you for the information.  The system has just received a third error:

This is on a different node than the previous two errors.  And each
node has it's own L3, correct?  Would you still advocate watching and
waiting?

Thanks,

Mike

^ permalink raw reply	[flat|nested] 3+ messages in thread