All of lore.kernel.org
 help / color / mirror / Atom feed
* MCE error log
@ 2009-01-06 13:00 Zdenek Kabelac
  2009-01-06 18:42 ` Valdis.Kletnieks
  2009-01-06 18:49 ` Andi Kleen
  0 siblings, 2 replies; 5+ messages in thread
From: Zdenek Kabelac @ 2009-01-06 13:00 UTC (permalink / raw)
  To: Linux Kernel Mailing List

Hi

I've noticed mcelog with weird content:

MCE 0
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 1 BANK 128 TSC 57976afd
STATUS 88380100 MCGSTATUS 0
MCE 0
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 1 BANK 128 TSC 53e61034
STATUS 88370100 MCGSTATUS 0

I'm running T61 - 2GB - in this directory
/sys/devices/system/machinecheck/machinecheck1
I could only see  bank0ctl ... bank5ctl - so where is bank 128 ?
(as there are no time stamps, I could hardly guess how often this happens)

Is it kernel bug or chipset bug ?
Should I start to worry about the stability of my machine ?

Zdenek

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: MCE error log
  2009-01-06 13:00 MCE error log Zdenek Kabelac
@ 2009-01-06 18:42 ` Valdis.Kletnieks
  2009-01-25 23:51   ` Vegard Nossum
  2009-01-06 18:49 ` Andi Kleen
  1 sibling, 1 reply; 5+ messages in thread
From: Valdis.Kletnieks @ 2009-01-06 18:42 UTC (permalink / raw)
  To: Zdenek Kabelac; +Cc: Linux Kernel Mailing List

[-- Attachment #1: Type: text/plain, Size: 432 bytes --]

On Tue, 06 Jan 2009 14:00:03 +0100, Zdenek Kabelac said:

> CPU 1 BANK 128 TSC 57976afd

> I could only see  bank0ctl ... bank5ctl - so where is bank 128 ?

I've had bank 128 reported before. Turned out it was for thermal events caused
by dust bunnies clogging a cooling vent.  I never did find an official
statement that's what 128 is for, but I did find a bunch of hints....

What does lm_sensors say the CPU temp is sitting at?


[-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: MCE error log
  2009-01-06 13:00 MCE error log Zdenek Kabelac
  2009-01-06 18:42 ` Valdis.Kletnieks
@ 2009-01-06 18:49 ` Andi Kleen
  2009-01-07 17:44   ` Zdenek Kabelac
  1 sibling, 1 reply; 5+ messages in thread
From: Andi Kleen @ 2009-01-06 18:49 UTC (permalink / raw)
  To: Zdenek Kabelac; +Cc: Linux Kernel Mailing List

"Zdenek Kabelac" <zdenek.kabelac@gmail.com> writes:

> /sys/devices/system/machinecheck/machinecheck1
> I could only see  bank0ctl ... bank5ctl - so where is bank 128 ?

Update your mcelog. Newer versions decode it.

-Andi

-- 
ak@linux.intel.com

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: MCE error log
  2009-01-06 18:49 ` Andi Kleen
@ 2009-01-07 17:44   ` Zdenek Kabelac
  0 siblings, 0 replies; 5+ messages in thread
From: Zdenek Kabelac @ 2009-01-07 17:44 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Linux Kernel Mailing List

2009/1/6 Andi Kleen <andi@firstfloor.org>:
> "Zdenek Kabelac" <zdenek.kabelac@gmail.com> writes:
>
>> /sys/devices/system/machinecheck/machinecheck1
>> I could only see  bank0ctl ... bank5ctl - so where is bank 128 ?
>
> Update your mcelog. Newer versions decode it.

Ok

I've replaced binary with the latest code from you.

Here is new trace

MCE 0
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 1 THERMAL EVENT TSC 54ab7bf6
Processor core below trip temperature. Throttling disabled
STATUS 88380100 MCGSTATUS 0

So what does this message means now ?

Zdenek

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: MCE error log
  2009-01-06 18:42 ` Valdis.Kletnieks
@ 2009-01-25 23:51   ` Vegard Nossum
  0 siblings, 0 replies; 5+ messages in thread
From: Vegard Nossum @ 2009-01-25 23:51 UTC (permalink / raw)
  To: Valdis.Kletnieks
  Cc: Zdenek Kabelac, Linux Kernel Mailing List, Maciej W. Rozycki

On Tue, Jan 6, 2009 at 7:42 PM,  <Valdis.Kletnieks@vt.edu> wrote:
> On Tue, 06 Jan 2009 14:00:03 +0100, Zdenek Kabelac said:
>
>> CPU 1 BANK 128 TSC 57976afd
>
>> I could only see  bank0ctl ... bank5ctl - so where is bank 128 ?
>
> I've had bank 128 reported before. Turned out it was for thermal events caused
> by dust bunnies clogging a cooling vent.  I never did find an official
> statement that's what 128 is for, but I did find a bunch of hints....
>
> What does lm_sensors say the CPU temp is sitting at?

I get this also:

MCE 0
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 1 THERMAL EVENT TSC dc963a087
Processor core below trip temperature. Throttling disabled
STATUS 882c0100 MCGSTATUS 0
MCE 1
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 0 THERMAL EVENT TSC dc970c0d0
Processor core below trip temperature. Throttling disabled
STATUS 882d0200 MCGSTATUS 0

and in kernel log:

Machine check events logged
CPU0: Temperature/speed normal
CPU1: Temperature/speed normal

This is happening since I installed a x86_64 kernel instead of 32-bit.
Maybe this explains those weird (never fatal) APIC errors I always
used to get before (error 40, invalid vector received AFAIR)? In any
case, the APIC errors are not to be seen now, and the frequency of the
MCEs are about that of the APIC errors. What I can say is that it
seems they appear sooner when there is a lot of interrupts, e.g. disk
or network activity. What is the correlation?

Temperature seems completely normal whenever it happens:

# sensors
coretemp-isa-0000
Adapter: ISA adapter
Core 0:      +58.0°C  (high = +100.0°C, crit = +100.0°C)

coretemp-isa-0001
Adapter: ISA adapter
Core 1:      +59.0°C  (high = +100.0°C, crit = +100.0°C)

Anyway, system works fine, so it's not much to worry about. But I am curious...


Vegard

-- 
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
	-- E. W. Dijkstra, EWD1036

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2009-01-25 23:52 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-01-06 13:00 MCE error log Zdenek Kabelac
2009-01-06 18:42 ` Valdis.Kletnieks
2009-01-25 23:51   ` Vegard Nossum
2009-01-06 18:49 ` Andi Kleen
2009-01-07 17:44   ` Zdenek Kabelac

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.