linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Help with machine check exception
@ 2006-01-12 16:30 Orion Poplawski
  2006-01-12 17:07 ` Orion Poplawski
  2006-01-12 17:32 ` Roger Heflin
  0 siblings, 2 replies; 6+ messages in thread
From: Orion Poplawski @ 2006-01-12 16:30 UTC (permalink / raw)
  To: linux-kernel

Can someone help determine the problem here?  Does it definitely point 
to a bad CPU, or possibly a bad motherboard?

Thanks!

CPU 0: Machine Check Exception:                4 Bank 4: b200000000070f0f
TSC 184fcd0553e4
Kernel panic - not syncing: Machine check

Call Trace: <#MC> <ffffffff80134831>{panic+133} 
<ffffffff8034329c>{_spin_trylock+9}
        <ffffffff8010f4d3>{oops_begin+90} <ffffffff801149f8>{print_mce+136}
        <ffffffff80114abf>{mcheck_timer+0} 
<ffffffff801150cd>{do_machine_check+752}
        <ffffffff8010ee6f>{machine_check+127}  <EOE>
  NMI Watchdog detected LOCKUP on CPU 0
CPU 0
Modules linked in: nfs lockd nfs_acl ipv6 parport_pc lp parport autofs4 
sunrpc xfs export
fs dm_mod video button battery ac ohci_hcd i2c_amd8111 i2c_amd756 
i2c_core shpchp eepro10
0 e100 mii tg3 floppy ext3 jbd
Pid: 14041, comm: srt Tainted: G   M  2.6.14-1.1656_FC4smp #1
RIP: 0010:[<ffffffff80118242>] <ffffffff80118242>{__smp_call_function+107}
RSP: 0000:ffffffff804a8358  EFLAGS: 00000002
RAX: 0000000000000002 RBX: 0000000000000003 RCX: 0000ffff0000ffff
RDX: 0000000000000004 RSI: 0000000000000020 RDI: ffffffff80523be0
RBP: 0000000000000000 R08: ffff8100826b71e0 R09: 0000000000000000
R10: 0000000000000000 R11: ffffffff8011abcb R12: ffffffff80117f0b
R13: 0000000000000000 R14: ffffffff80360d29 R15: 0000000000000001
FS:  00002aaaaae8ad00(0000) GS:ffffffff80518000(0000) knlGS:00000000f7fab6c0
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fffffc07000 CR3: 00000000c2f42000 CR4: 00000000000006e0
Process srt (pid: 14041, threadinfo ffff8100bd0ba000, task ffff8100822000c0)
Stack: ffffffff80117f0b 0000000000000000 0000000000000002 0000000000000000
        0000000000014f00 0000000000000001 0000000000000000 0000000000000000
        0000184fcd054eab ffffffff801182a0
Call Trace: <#MC> <ffffffff80117f0b>{smp_really_stop_cpu+0} 
<ffffffff801182a0>{smp_send_s
top+43}
        <ffffffff8013483d>{panic+145} <ffffffff8034329c>{_spin_trylock+9}
        <ffffffff8010f4d3>{oops_begin+90} <ffffffff801149f8>{print_mce+136}
        <ffffffff80114abf>{mcheck_timer+0} 
<ffffffff801150cd>{do_machine_check+752}
        <ffffffff8010ee6f>{machine_check+127}  <EOE>

Code: 8b 44 24 10 39 c3 75 f6 85 ed 75 14 66 90 eb 18 f3 90 8b 44
console shuts up ...
  <3>Debug: sleeping function called from invalid context at 
include/linux/rwsem.h:43
in_atomic():1, irqs_disabled():1

Call Trace: <NMI> <ffffffff8013603a>{profile_task_exit+21} 
<ffffffff801371f8>{do_exit+34}
        <ffffffff8025456d>{do_unblank_screen+40} 
<ffffffff8010f593>{bad_intr+0}
        <ffffffff80118ed0>{nmi_watchdog_tick+242} 
<ffffffff8010f835>{default_do_nmi+137}
        <ffffffff80117f0b>{smp_really_stop_cpu+0} 
<ffffffff80118ff9>{do_nmi+69}
        <ffffffff8010eb97>{nmi+127} 
<ffffffff80117f0b>{smp_really_stop_cpu+0}
        <ffffffff8011abcb>{flat_send_IPI_mask+0} 
<ffffffff80118242>{__smp_call_function+10
7}
         <EOE>  <#MC> <ffffffff80117f0b>{smp_really_stop_cpu+0}
        <ffffffff801182a0>{smp_send_stop+43} <ffffffff8013483d>{panic+145}
        <ffffffff8034329c>{_spin_trylock+9} 
<ffffffff8010f4d3>{oops_begin+90}
        <ffffffff801149f8>{print_mce+136} <ffffffff80114abf>{mcheck_timer+0}
        <ffffffff801150cd>{do_machine_check+752} 
<ffffffff8010ee6f>{machine_check+127}
         <EOE>
APIC error on CPU0: 00(08)
Kernel panic - not syncing: Aiee, killing interrupt handler!
APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)

Call Trace:<7>APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)
  <NMI> <7>APIC error on CPU0: 08(08)
<ffffffff80134831>{panic+133}<7>APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)
  <ffffffff8034334a>{_spin_unlock_irq+14}<7>APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)

        <7>APIC error on CPU0: 08(08)
<ffffffff80342cd1>{__down_read+50}<7>APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)
  <7>APIC error on CPU0: 08(08)
<ffffffff803432e8>{_spin_lock_irqsave+9}<7>APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)

        <7>APIC error on CPU0: 08(08)
<ffffffff801fde91>{__up_read+19}<7>APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)
  <ffffffff80137255>{do_exit+127}<7>APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)

        <ffffffff8025456d>{do_unblank_screen+40}<7>APIC error on CPU0: 
08(08)
APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)
  <ffffffff8010f593>{bad_intr+0}<7>APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)

        <ffffffff80118ed0>{nmi_watchdog_tick+242}<7>APIC error on CPU0: 
08(08)
  <ffffffff8010f835>{default_do_nmi+137}<7>APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)

        <7>APIC error on CPU0: 08(08)
<ffffffff80117f0b>{smp_really_stop_cpu+0}<7>APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)
  <ffffffff80118ff9>{do_nmi+69}<7>APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)

        <7>APIC error on CPU0: 08(08)
<ffffffff8010eb97>{nmi+127}<7>APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)
  <ffffffff80117f0b>{smp_really_stop_cpu+0}<7>APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)

        <7>APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)
<ffffffff8011abcb>{flat_send_IPI_mask+0}<7>APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)
  <ffffffff80118242>{__smp_call_function+107}<7>APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)

        <7>APIC error on CPU0: 08(08)
  <EOE> <7>APIC error on CPU0: 08(08)
  <#MC> <7>APIC error on CPU0: 08(08)
<ffffffff80117f0b>{smp_really_stop_cpu+0}<7>APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)

        <7>APIC error on CPU0: 08(08)
<ffffffff801182a0>{smp_send_stop+43}<7>APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)
  <ffffffff8013483d>{panic+145}<7>APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)

        <7>APIC error on CPU0: 08(08)
<ffffffff8034329c>{_spin_trylock+9}<7>APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)
  <ffffffff8010f4d3>{oops_begin+90}<7>APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)

        <7>APIC error on CPU0: 08(08)
<ffffffff801149f8>{print_mce+136}<7>APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)
  <ffffffff80114abf>{mcheck_timer+0}<7>APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)

        <7>APIC error on CPU0: 08(08)
<ffffffff801150cd>{do_machine_check+752}<7>APIC error on CPU0: 08(08)
  <ffffffff8010ee6f>{machine_check+127}<7>APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)

         <EOE> <7>APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)

  <7>APIC error on CPU0: 08(08)
APIC error on CPU0: 08(08)


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Help with machine check exception
  2006-01-12 16:30 Help with machine check exception Orion Poplawski
@ 2006-01-12 17:07 ` Orion Poplawski
  2006-01-12 17:33   ` Alan Cox
  2006-01-12 17:32 ` Roger Heflin
  1 sibling, 1 reply; 6+ messages in thread
From: Orion Poplawski @ 2006-01-12 17:07 UTC (permalink / raw)
  To: linux-kernel

Orion Poplawski wrote:
> Can someone help determine the problem here?  Does it definitely point 
> to a bad CPU, or possibly a bad motherboard?
> 
> Thanks!
> 


mcelog decode states:

CPU 0 4 northbridge TSC 184fcd0553e4
   Northbridge Watchdog error
        bit57 = processor context corrupt
        bit61 = error uncorrected
   bus error 'generic participation, request timed out
       generic error mem transaction
       generic access, level generic'
STATUS b200000000070f0f MCGSTATUS 4
Kernel panic - not syncing: Machine check


^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: Help with machine check exception
  2006-01-12 16:30 Help with machine check exception Orion Poplawski
  2006-01-12 17:07 ` Orion Poplawski
@ 2006-01-12 17:32 ` Roger Heflin
  1 sibling, 0 replies; 6+ messages in thread
From: Roger Heflin @ 2006-01-12 17:32 UTC (permalink / raw)
  To: 'Orion Poplawski', linux-kernel

 

> -----Original Message-----
> From: linux-kernel-owner@vger.kernel.org 
> [mailto:linux-kernel-owner@vger.kernel.org] On Behalf Of 
> Orion Poplawski
> Sent: Thursday, January 12, 2006 10:30 AM
> To: linux-kernel@vger.kernel.org
> Subject: Help with machine check exception
> 
> Can someone help determine the problem here?  Does it 
> definitely point to a bad CPU, or possibly a bad motherboard?
> 
> Thanks!
> 
> CPU 0: Machine Check Exception:                4 Bank 4: 
> b200000000070f0f
> TSC 184fcd0553e4
> Kernel panic - not syncing: Machine check
> 

If this is an Opteron, CPU or Memory, a dimm failing in the
correct manner will cause it, and I have seen a CPU cause it,
I don't know that I have seen a MB cause it, and we have fixed
a fair number of these errors.   If it is memory, it can be any
of the dimms on that cpu.

I have seen this error kill a machine on boot up, but it looks
more like something was cleared improperly, and may only affect
much older versions of 2.6, in this case it is not broken hardware,
and rebooting will cause it to not be duplicatable.

                       Roger


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Help with machine check exception
  2006-01-12 17:07 ` Orion Poplawski
@ 2006-01-12 17:33   ` Alan Cox
  2006-01-12 17:52     ` Orion Poplawski
  0 siblings, 1 reply; 6+ messages in thread
From: Alan Cox @ 2006-01-12 17:33 UTC (permalink / raw)
  To: Orion Poplawski; +Cc: linux-kernel

On Iau, 2006-01-12 at 10:07 -0700, Orion Poplawski wrote:
> mcelog decode states:
> 
> CPU 0 4 northbridge TSC 184fcd0553e4
>    Northbridge Watchdog error
>         bit57 = processor context corrupt
>         bit61 = error uncorrected
>    bus error 'generic participation, request timed out
>        generic error mem transaction
>        generic access, level generic'
> STATUS b200000000070f0f MCGSTATUS 4
> Kernel panic - not syncing: Machine check

Could be ram cpu or motherboard, even a power glitch of course.

Before you panic I'd suggest that you check the machine is being
adequately cooled (especially the CPU) and that the ram and cpu are all
well socketed.

memtest86+ will help test for memory problems and may be worth an
overnight run


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Help with machine check exception
  2006-01-12 17:33   ` Alan Cox
@ 2006-01-12 17:52     ` Orion Poplawski
  2006-01-12 19:18       ` Roger Heflin
  0 siblings, 1 reply; 6+ messages in thread
From: Orion Poplawski @ 2006-01-12 17:52 UTC (permalink / raw)
  To: linux-kernel

Alan Cox wrote:
> On Iau, 2006-01-12 at 10:07 -0700, Orion Poplawski wrote:
>> mcelog decode states:
>>
>> CPU 0 4 northbridge TSC 184fcd0553e4
>>    Northbridge Watchdog error
>>         bit57 = processor context corrupt
>>         bit61 = error uncorrected
>>    bus error 'generic participation, request timed out
>>        generic error mem transaction
>>        generic access, level generic'
>> STATUS b200000000070f0f MCGSTATUS 4
>> Kernel panic - not syncing: Machine check
> 
> Could be ram cpu or motherboard, even a power glitch of course.
> 
> Before you panic I'd suggest that you check the machine is being
> adequately cooled (especially the CPU) and that the ram and cpu are all
> well socketed.
> 
> memtest86+ will help test for memory problems and may be worth an
> overnight run
> 

Well, I've swapped memory with an identical machine and the problem 
stayed where it was.  The crash is fairly frequent (about 1-2 days of 
operating).

adm1027-i2c-0-2e
Adapter: SMBus AMD8111 adapter at 10e0
V1.5:      +2.601 V  (min =  +1.42 V, max =  +1.58 V)   ALARM
VCore:     +1.304 V  (min =  +1.48 V, max =  +1.63 V)   ALARM
V3.3:      +3.326 V  (min =  +3.13 V, max =  +3.47 V)
V5:       +5.117 V  (min =  +4.74 V, max =  +5.26 V)
V12:      +12.094 V  (min = +11.38 V, max = +12.62 V)
CPU_Fan:      0 RPM  (min = 4000 RPM)                     ALARM
fan2:         0 RPM  (min =    0 RPM)
fan3:         0 RPM  (min =    0 RPM)
fan4:      4981 RPM  (min =    0 RPM)
CPU:      +50.00°C  (low  =   +10°C, high =   +50°C)
Board:    +34.00°C  (low  =   +10°C, high =   +35°C)
Remote:   +52.50°C  (low  =   +10°C, high =   +35°C)     ALARM
CPU_PWM:   255
Fan2_PWM:  255
Fan3_PWM:  255
vid:      +1.550 V  (VRM Version 9.1)


I would have expected 2 CPU temps (being dual-processor).  Maybe Remote 
is the second.

With 4 copies of burnK7:

machine with problems:

CPU:      +71.25°C  (low  =   +10°C, high =   +50°C)     ALARM
Board:    +47.25°C  (low  =   +10°C, high =   +35°C)     ALARM
Remote:   +68.50°C  (low  =   +10°C, high =   +35°C)     ALARM

machine without:

CPU:      +61.25°C  (low  =   +10°C, high =   +50°C)     ALARM
Board:    +47.25°C  (low  =   +10°C, high =   +35°C)     ALARM
Remote:   +74.25°C  (low  =   +10°C, high =   +35°C)     ALARM


So *maybe* cooling?





^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: Help with machine check exception
  2006-01-12 17:52     ` Orion Poplawski
@ 2006-01-12 19:18       ` Roger Heflin
  0 siblings, 0 replies; 6+ messages in thread
From: Roger Heflin @ 2006-01-12 19:18 UTC (permalink / raw)
  To: 'Orion Poplawski', linux-kernel

 

> 
> I would have expected 2 CPU temps (being dual-processor).  
> Maybe Remote is the second.
> 
> With 4 copies of burnK7:
> 
> machine with problems:
> 
> CPU:      +71.25°C  (low  =   +10°C, high =   +50°C)     ALARM
> Board:    +47.25°C  (low  =   +10°C, high =   +35°C)     ALARM
> Remote:   +68.50°C  (low  =   +10°C, high =   +35°C)     ALARM
> 
> machine without:
> 
> CPU:      +61.25°C  (low  =   +10°C, high =   +50°C)     ALARM
> Board:    +47.25°C  (low  =   +10°C, high =   +35°C)     ALARM
> Remote:   +74.25°C  (low  =   +10°C, high =   +35°C)     ALARM
> 
> 
> So *maybe* cooling?
> 

That is a little on the warm side, I believe AMD's posted limit
is 70C for most of their chips, assuming the measuring point is
in the correct place for the 70C limit.

Certain cpus also seem to have more issues than others, so one cpu
out of a batch can be ok with a certain setup, and another from the
same batch will mce under similar conditions.

Did you build the machines yourself or did you buy them this way?

Machines getting MCE's that often will fail the burnin testing that
we use here.

And machines that produce those kinds of temps will also fail our
burn-in process just because that seems a bit too warm.

                           Roger


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2006-01-12 19:08 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-01-12 16:30 Help with machine check exception Orion Poplawski
2006-01-12 17:07 ` Orion Poplawski
2006-01-12 17:33   ` Alan Cox
2006-01-12 17:52     ` Orion Poplawski
2006-01-12 19:18       ` Roger Heflin
2006-01-12 17:32 ` Roger Heflin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).