linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Machine check expection panic
@ 2003-08-06 22:35 kwijibo
  2003-08-06 23:05 ` Matt Mackall
  2003-08-07  0:27 ` Dave Jones
  0 siblings, 2 replies; 10+ messages in thread
From: kwijibo @ 2003-08-06 22:35 UTC (permalink / raw)
  To: linux-kernel

I decided to try out the new 2.6.0-test2 kernel today but
ran into a problem with booting it.  I narrowed it down to
the machine check expection code.  I get this panic from
the kernel on boot when I have it enabled

CPU0: Machine Check Exception: 0000000000000004
Bank0: f606200000000833 at 0000000000004040
Kernel Panic: CPU context corrupt.

I disabled this option in the kernel and recompiled and everything
went smooth.  I figured maybe there could actually possibly be
something wrong with the CPU but I can boot with RedHat's
2.4.20-19 kernel fine which I *think* includes machine check exception
code.  I have no beef with leaving the exception code out but I figured
someone on this list may want to know. 

Little bit of hardware info:
Tyan 2466 motherboard
2 Athon MP 1200 processors

Steve



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Machine check expection panic
  2003-08-06 22:35 Machine check expection panic kwijibo
@ 2003-08-06 23:05 ` Matt Mackall
  2003-08-07  0:27 ` Dave Jones
  1 sibling, 0 replies; 10+ messages in thread
From: Matt Mackall @ 2003-08-06 23:05 UTC (permalink / raw)
  To: kwijibo; +Cc: linux-kernel

On Wed, Aug 06, 2003 at 04:35:33PM -0600, kwijibo@zianet.com wrote:
> I decided to try out the new 2.6.0-test2 kernel today but
> ran into a problem with booting it.  I narrowed it down to
> the machine check expection code.  I get this panic from
> the kernel on boot when I have it enabled
> 
> CPU0: Machine Check Exception: 0000000000000004
> Bank0: f606200000000833 at 0000000000004040
> Kernel Panic: CPU context corrupt.

$ parsemce -b 0 -e 0000000000000004 -s f606200000000833 -a 0000000000004040
Status: (4) Machine Check in progress.
Restart IP invalid.
parsebank(0): f606200000000833 @ 4040
        External tag parity error
        Uncorrectable ECC error
        CPU state corrupt. Restart not possible
        Address in addr register valid
        Error enabled in control register
        Error not corrected.
        Error overflow
        Bus and interconnect error
        Participation: Local processor originated request
        Timeout: Request did not timeout
        Request: Generic error
        Transaction type : Instruction
        Memory/IO : Other

Looks like corruption with your L2 cache. Odds are its heat-related.

> I disabled this option in the kernel and recompiled and everything
> went smooth.  I figured maybe there could actually possibly be
> something wrong with the CPU but I can boot with RedHat's
> 2.4.20-19 kernel fine which I *think* includes machine check exception
> code.  I have no beef with leaving the exception code out but I figured
> someone on this list may want to know. 
> 
> Little bit of hardware info:
> Tyan 2466 motherboard
> 2 Athon MP 1200 processors

-- 
Matt Mackall : http://www.selenic.com : of or relating to the moon

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Machine check expection panic
  2003-08-06 22:35 Machine check expection panic kwijibo
  2003-08-06 23:05 ` Matt Mackall
@ 2003-08-07  0:27 ` Dave Jones
  1 sibling, 0 replies; 10+ messages in thread
From: Dave Jones @ 2003-08-07  0:27 UTC (permalink / raw)
  To: kwijibo; +Cc: linux-kernel

On Wed, Aug 06, 2003 at 04:35:33PM -0600, kwijibo@zianet.com wrote:
 > I decided to try out the new 2.6.0-test2 kernel today but
 > ran into a problem with booting it.  I narrowed it down to
 > the machine check expection code.  I get this panic from
 > the kernel on boot when I have it enabled
 > 
 > CPU0: Machine Check Exception: 0000000000000004
 > Bank0: f606200000000833 at 0000000000004040
 > Kernel Panic: CPU context corrupt.

Missing bugfix from the 2.4 kernel that never made it into 2.5.
Chances are you (and many other Athlon users) are hitting problems
because of this chunk..

Already pushed to Linus/Andrew.

		Dave


# This is a BitKeeper generated patch for the following project:
# Project Name: Linux kernel tree
# This patch format is intended for GNU patch command version 2.5 or higher.
# This patch includes the following deltas:
#	           ChangeSet	1.1055  -> 1.1056 
#	arch/i386/kernel/cpu/mcheck/k7.c	1.4     -> 1.5    
#
# The following is the BitKeeper ChangeSet Log
# --------------------------------------------
# 03/08/06	davej@redhat.com	1.1056
# stupid off by one
# --------------------------------------------
#
diff -Nru a/arch/i386/kernel/cpu/mcheck/k7.c b/arch/i386/kernel/cpu/mcheck/k7.c
--- a/arch/i386/kernel/cpu/mcheck/k7.c	Wed Aug  6 23:33:40 2003
+++ b/arch/i386/kernel/cpu/mcheck/k7.c	Wed Aug  6 23:33:40 2003
@@ -81,7 +81,7 @@
 		wrmsr (MSR_IA32_MCG_CTL, 0xffffffff, 0xffffffff);
 	nr_mce_banks = l & 0xff;
 
-	for (i=0; i<nr_mce_banks; i++) {
+	for (i=1; i<nr_mce_banks; i++) {
 		wrmsr (MSR_IA32_MC0_CTL+4*i, 0xffffffff, 0xffffffff);
 		wrmsr (MSR_IA32_MC0_STATUS+4*i, 0x0, 0x0);
 	}

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Machine check expection panic
  2003-08-11 10:15           ` Petr Vandrovec
@ 2003-08-11 11:34             ` Bartlomiej Zolnierkiewicz
  0 siblings, 0 replies; 10+ messages in thread
From: Bartlomiej Zolnierkiewicz @ 2003-08-11 11:34 UTC (permalink / raw)
  To: Petr Vandrovec; +Cc: ak, kwijibo, Dave Jones, richard.brunner, linux-kernel


Just "me too".

MCE: The hardware reports a non fatal, correctable incident occurred on
CPU 0.
Bank 0: 8000000000002140

$ cat /proc/cpuinfo
processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 6
model           : 8
model name      : AMD Athlon(tm) XP 1700+
stepping        : 1
cpu MHz         : 1467.033
cache size      : 256 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 1
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
cmov
pat pse36 mmx fxsr sse syscall mmxext 3dnowext 3dnow
bogomips        : 2883.58

--bartlomiej

On Mon, 11 Aug 2003, Petr Vandrovec wrote:

> Out of curiosity, I never got MCE on my system at home (last kernel
> before one below was 2.6.0-test2, and it did not complain for
> different kernels at least since November 2001), yet after recent MCE
> changes I got during fsck:
>
> MCE: The hardware reports a non fatal, correctable incident occurred on CPU 0.
> Bank 0: f65980000000baff


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Machine check expection panic
  2003-08-10 21:04         ` kwijibo
@ 2003-08-11 10:15           ` Petr Vandrovec
  2003-08-11 11:34             ` Bartlomiej Zolnierkiewicz
  0 siblings, 1 reply; 10+ messages in thread
From: Petr Vandrovec @ 2003-08-11 10:15 UTC (permalink / raw)
  To: ak; +Cc: kwijibo, Dave Jones, richard.brunner, linux-kernel

On Sun, Aug 10, 2003 at 03:04:01PM -0600, kwijibo@zianet.com wrote:
> Out of curiosity I decided to try this on some other Athlon
> systems I have.  I tried it on a dual Athlon MP 2400(2GHz)
> system with a Tyan 2462 motherboard.  Also I tried it on a
> single Athlon XP 1800 with a Asus A7V motherboard.  They
> both booted fine with the 2.6.0-test2 kernel and the machine
> exception code in it.  So I am thinking either it is something
> with the older CPU's or the CPU is actually borked.  Like I said
> though I have been using those 1.2GHz processors for a long time
> with no problems.

Out of curiosity, I never got MCE on my system at home (last kernel
before one below was 2.6.0-test2, and it did not complain for
different kernels at least since November 2001), yet after recent MCE 
changes I got during fsck:

MCE: The hardware reports a non fatal, correctable incident occurred on CPU 0.
Bank 0: f65980000000baff

System booted these 2.4.x kernels which are supposed to contain
off by one fix, without complaints (first number = no. of boots):

8 Linux version 2.4.20-30 (root@ppc) (gcc version 3.3 (Debian))
1 Linux version 2.4.20-4GB-athlon (root@Athlon.suse.de) (gcc version 3.3 20030226 (prerelease) (SuSE Linux))
4 Linux version 2.4.20-sp (root@ppc) (gcc version 3.3 (Debian))
1 Linux version 2.4.21-0.11mdk (quintela@bi.mandrakesoft.com) (gcc version 3.2.2 (Mandrake Linux 9.1 3.2.2-2mdk))
2 Linux version 2.4.21-0.11mdksecure (quintela@bi.mandrakesoft.com) (gcc version 3.2.2 (Mandrake Linux 9.1 3.2.2-2mdk))
2 Linux version 2.4.21-0.13mdk (quintela@bi.mandrakesoft.com) (gcc version 3.2.2 (Mandrake Linux 9.1 3.2.2-3mdk))
4 Linux version 2.4.21-pre7 (root@ppc) (gcc version 3.2.3 20030331 (Debian prerelease))

Any idea what's going wrong?
					Best regards,
						Petr Vandrovec
						vandrove@vc.cvut.cz

processor	: 0
vendor_id	: AuthenticAMD
cpu family	: 6
model		: 4
model name	: AMD Athlon(tm) Processor
stepping	: 2
cpu MHz		: 1009.064
cache size	: 256 KB
fdiv_bug	: no
hlt_bug		: no
f00f_bug	: no
coma_bug	: no
fpu		: yes
fpu_exception	: yes
cpuid level	: 1
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 mmx fxsr syscall mmxext 3dnowext 3dnow
bogomips	: 1986.56


Linux version 2.6.0-test3-c1149 (root@ppc) (gcc version 3.3.1 (Debian)) #1 SMP Sun Aug 10 19:42:22 CEST 2003
Video mode to be used for restore is f00
BIOS-provided physical RAM map:
 BIOS-e820: 0000000000000000 - 000000000009e800 (usable)
 BIOS-e820: 000000000009e800 - 00000000000a0000 (reserved)
 BIOS-e820: 00000000000f0000 - 0000000000100000 (reserved)
 BIOS-e820: 0000000000100000 - 000000001ffec000 (usable)
 BIOS-e820: 000000001ffec000 - 000000001ffef000 (ACPI data)
 BIOS-e820: 000000001ffef000 - 000000001ffff000 (reserved)
 BIOS-e820: 000000001ffff000 - 0000000020000000 (ACPI NVS)
 BIOS-e820: 00000000ffff0000 - 0000000100000000 (reserved)
511MB LOWMEM available.
On node 0 totalpages: 131052
  DMA zone: 4096 pages, LIFO batch:1
  Normal zone: 126956 pages, LIFO batch:16
  HighMem zone: 0 pages, LIFO batch:1
ACPI: RSDP (v000 ASUS                       ) @ 0x000f6a90
ACPI: RSDT (v001 ASUS   A7V      12336.12337) @ 0x1ffec000
ACPI: FADT (v001 ASUS   A7V      12336.12337) @ 0x1ffec080
ACPI: BOOT (v001 ASUS   A7V      12336.12337) @ 0x1ffec040
ACPI: DSDT (v001   ASUS A7V      00000.04096) @ 0x00000000
ACPI: BIOS passes blacklist
ACPI: MADT not present
Building zonelist for node : 0
Kernel command line: BOOT_IMAGE=Linux ro root=2105 video=matrox:vesa:0x117,fv:85 video=matroxfb:vesa:0x117,fv:85 nmi_watchdog=1 devfs=nomount
Local APIC disabled by BIOS -- reenabling.
Found and enabled local APIC!
Initializing CPU#0
PID hash table entries: 2048 (order 11: 16384 bytes)
Detected 1009.064 MHz processor.
Console: colour VGA+ 80x25
Calibrating delay loop... 1986.56 BogoMIPS
Memory: 514556k/524208k available (2195k kernel code, 8856k reserved, 672k data, 364k init, 0k highmem)
Dentry cache hash table entries: 65536 (order: 6, 262144 bytes)
Inode-cache hash table entries: 32768 (order: 5, 131072 bytes)
Mount-cache hash table entries: 512 (order: 0, 4096 bytes)
-> /dev
-> /dev/console
-> /root
CPU:     After generic identify, caps: 0183fbff c1c7fbff 00000000 00000000
CPU:     After vendor identify, caps: 0183fbff c1c7fbff 00000000 00000000
CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line)
CPU: L2 Cache: 256K (64 bytes/line)
CPU:     After all inits, caps: 0183fbff c1c7fbff 00000000 00000020
Intel machine check architecture supported.
Intel machine check reporting enabled on CPU#0.
Enabling fast FPU save and restore... done.
Checking 'hlt' instruction... OK.
POSIX conformance testing by UNIFIX
CPU0: AMD Athlon(tm) Processor stepping 02
per-CPU timeslice cutoff: 731.16 usecs.
task migration cache decay timeout: 1 msecs.
SMP motherboard not detected.
enabled ExtINT on CPU#0
ESR value before enabling vector: 00000000
ESR value after enabling vector: 00000000
testing NMI watchdog ... OK.
Using local APIC timer interrupts.
calibrating APIC timer ...
..... CPU clock speed is 1008.0833 MHz.
..... host bus clock speed is 201.0766 MHz.
Starting migration thread for cpu 0
CPUS done 2
Initializing RT netlink socket
PCI: PCI BIOS revision 2.10 entry at 0xf1180, last bus=1
PCI: Using configuration type 1
mtrr: v2.0 (20020519)
BIO: pool of 256 setup, 15Kb (60 bytes/bio)
biovec pool[0]:   1 bvecs: 256 entries (12 bytes)
biovec pool[1]:   4 bvecs: 256 entries (48 bytes)
biovec pool[2]:  16 bvecs: 256 entries (192 bytes)
biovec pool[3]:  64 bvecs: 256 entries (768 bytes)
biovec pool[4]: 128 bvecs: 256 entries (1536 bytes)
biovec pool[5]: 256 bvecs: 256 entries (3072 bytes)
ACPI: Subsystem revision 20030714
spurious 8259A interrupt: IRQ7.
ACPI: Interpreter enabled
ACPI: Using PIC for interrupt routing
ACPI: PCI Interrupt Link [LNKA] (IRQs 3 4 5 6 7 9 10 *11 12 14 15)
ACPI: PCI Interrupt Link [LNKB] (IRQs 3 4 5 6 7 9 *10 11 12 14 15)
ACPI: PCI Interrupt Link [LNKC] (IRQs 3 4 *5 6 7 9 10 11 12 14 15)
ACPI: PCI Interrupt Link [LNKD] (IRQs 3 4 5 6 7 *9 10 11 12 14 15)
ACPI: PCI Root Bridge [PCI0] (00:00)
PCI: Probing PCI hardware (bus 00)
ACPI: PCI Interrupt Routing Table [\_SB_.PCI0._PRT]
drivers/usb/core/usb.c: registered new driver usbfs
drivers/usb/core/usb.c: registered new driver hub
PCI: Using ACPI for IRQ routing
PCI: if you experience problems, try using option 'pci=noacpi' or even 'acpi=off'
matroxfb: Matrox G450 detected
matroxfb: MTRR's turned on
matroxfb: 1024x768x16bpp (virtual: 1024x8190)
matroxfb: framebuffer at 0xCE000000, mapped to 0xe080f000, size 33554432
Console: switching to colour frame buffer device 128x48
fb0: MATROX frame buffer device
pty: 256 Unix98 ptys configured
SBF: Simple Boot Flag extension found and enabled.
SBF: Setting boot flags 0x1
Machine check exception polling timer started.
IA-32 Microcode Update Driver: v1.11 <tigran@veritas.com>
Journalled Block Device driver loaded
devfs: v1.22 (20021013) Richard Gooch (rgooch@atnf.csiro.au)
devfs: boot_options: 0x0
Initializing Cryptographic API
PCI: Disabling Via external APIC routing
ACPI: Power Button (FF) [PWRF]
ACPI: Processor [CPU0] (supports C1 C2, 16 throttling states)
Real Time Clock Driver v1.11
Non-volatile memory driver v1.2
Serial: 8250/16550 driver $Revision: 1.90 $ IRQ sharing disabled
ttyS0 at I/O 0x3f8 (irq = 4) is a 16550A
ttyS1 at I/O 0x2f8 (irq = 3) is a 16550A
parport0: PC-style at 0x378 (0x778) [PCSPP,TRISTATE]
parport0: cpp_daisy: aa5500ff(38)
parport0: assign_addrs: aa5500ff(38)
parport0: cpp_daisy: aa5500ff(38)
parport0: assign_addrs: aa5500ff(38)
parport_pc: Via 686A parallel port: io=0x378
Uniform Multi-Platform E-IDE driver Revision: 7.00alpha2
ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx
VP_IDE: IDE controller at PCI slot 0000:00:04.1
VP_IDE: chipset revision 16
VP_IDE: not 100% native mode: will probe irqs later
ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx
VP_IDE: VIA vt82c686a (rev 22) IDE UDMA66 controller on pci0000:00:04.1
    ide0: BM-DMA at 0xd800-0xd807, BIOS settings: hda:DMA, hdb:pio
    ide1: BM-DMA at 0xd808-0xd80f, BIOS settings: hdc:DMA, hdd:pio
hda: PLEXTOR CD-R PX-W2410A, ATAPI CD/DVD-ROM drive
Using anticipatory scheduling elevator
ide0 at 0x1f0-0x1f7,0x3f6 on irq 14
hdc: TOSHIBA MK6409MAV, ATA DISK drive
ide1 at 0x170-0x177,0x376 on irq 15
PDC20265: IDE controller at PCI slot 0000:00:11.0
PDC20265: chipset revision 2
PDC20265: 100% native mode on irq 10
PDC20265: (U)DMA Burst Bit ENABLED Primary PCI Mode Secondary PCI Mode.
    ide2: BM-DMA at 0x8400-0x8407, BIOS settings: hde:pio, hdf:pio
    ide3: BM-DMA at 0x8408-0x840f, BIOS settings: hdg:pio, hdh:pio
hde: WDC WD1200BB-00CAA1, ATA DISK drive
ide2 at 0x9800-0x9807,0x9402 on irq 10
hdh: WDC WD1200BB-00CAA1, ATA DISK drive
ide3 at 0x9000-0x9007,0x8802 on irq 10
hdc: max request size: 128KiB
hdc: 12685680 sectors (6495 MB), CHS=13424/15/63, UDMA(33)
 /dev/ide/host0/bus1/target0/lun0: p1
hde: max request size: 128KiB
hde: 234441648 sectors (120034 MB) w/2048KiB Cache, CHS=65535/16/63, UDMA(100)
 /dev/ide/host2/bus0/target0/lun0: p1 p2 < p5 p6 p7 >
hdh: max request size: 128KiB
hdh: 234441648 sectors (120034 MB) w/2048KiB Cache, CHS=65535/16/63, UDMA(100)
 /dev/ide/host2/bus1/target1/lun0: p1 p2 < p5 p6 >
hda: ATAPI 40X CD-ROM CD-R/RW drive, 4096kB Cache, UDMA(33)
Uniform CD-ROM driver Revision: 3.12
matroxfb_crtc2: secondary head of fb0 was registered as fb1
drivers/usb/host/uhci-hcd.c: USB Universal Host Controller Interface driver v2.1
uhci-hcd 0000:00:04.2: UHCI Host Controller
uhci-hcd 0000:00:04.2: irq 9, io base 0000d400
uhci-hcd 0000:00:04.2: new USB bus registered, assigned bus number 1
hub 1-0:0: USB hub found
hub 1-0:0: 2 ports detected
uhci-hcd 0000:00:04.3: UHCI Host Controller
uhci-hcd 0000:00:04.3: irq 9, io base 0000d000
uhci-hcd 0000:00:04.3: new USB bus registered, assigned bus number 2
hub 2-0:0: USB hub found
hub 2-0:0: 2 ports detected
mice: PS/2 mouse device common for all mice
input: PC Speaker
input: ImExPS/2 Generic Explorer Mouse on isa0060/serio1
serio: i8042 AUX port at 0x60,0x64 irq 12
input: AT Set 2 keyboard on isa0060/serio0
serio: i8042 KBD port at 0x60,0x64 irq 1
i2c /dev entries driver module version 2.7.0 (20021208)
oprofile: using NMI interrupt.
NET4: Linux TCP/IP 1.0 for NET4.0
IP: routing cache hash table of 2048 buckets, 32Kbytes
TCP: Hash tables configured (established 16384 bind 21845)
Initializing IPsec netlink socket
NET4: Unix domain sockets 1.0/SMP for Linux NET4.0.
BIOS EDD facility v0.09 2003-Jan-22, 3 devices found
VFS: Mounted root (ext2 filesystem) readonly.
Freeing unused kernel memory: 364k freed
hub 2-0:0: debounce: port 2: delay 100ms stable 4 status 0x101
hub 2-0:0: new USB device on port 2, assigned address 2
hub 2-2:0: USB hub found
hub 2-2:0: 4 ports detected
Adding 1951856k swap on /dev/hde6.  Priority:-1 extents:1
MCE: The hardware reports a non fatal, correctable incident occurred on CPU 0.
Bank 0: f65980000000baff
ne2k-pci.c:v1.02 10/19/2000 D. Becker/P. Gortmaker
  http://www.scyld.com/network/ne2k-pci.html
eth0: RealTek RTL-8029 found at 0xa000, IRQ 10, 00:C0:26:30:B0:2D.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Machine check expection panic
  2003-08-10 13:07       ` Andi Kleen
@ 2003-08-10 21:04         ` kwijibo
  2003-08-11 10:15           ` Petr Vandrovec
  0 siblings, 1 reply; 10+ messages in thread
From: kwijibo @ 2003-08-10 21:04 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Dave Jones, richard.brunner, linux-kernel

Out of curiosity I decided to try this on some other Athlon
systems I have.  I tried it on a dual Athlon MP 2400(2GHz)
system with a Tyan 2462 motherboard.  Also I tried it on a
single Athlon XP 1800 with a Asus A7V motherboard.  They
both booted fine with the 2.6.0-test2 kernel and the machine
exception code in it.  So I am thinking either it is something
with the older CPU's or the CPU is actually borked.  Like I said
though I have been using those 1.2GHz processors for a long time
with no problems.

Steve

Andi Kleen wrote:

>>The CPU's aren't overclocked and have worked fine for
>>me under much heavier loads than booting a kernel for
>>    
>>
>
>It could be corrected ECC errors in the cache. If that
>happens I would consider it a hardware problem
>
>(now hidden with the disabled bank).
>
>  
>
>>at least a year. Using the 2.4 kernel that is. Once
>>I remove the exception code from the kernel it boots
>>fine and runs fine under any load I put it under.
>>    
>>
>
>I maintain that such a magic hack needs at least a big fat comment.
>
>I still find the change very suspicious, there isn't any errata that 
>says that bank 0 is bad on Athlon.
>
>Also disabling a whole bank just for some buggy CPUs is quite a sledgehammer,
>it would be probably better to identify the bank 0 sub unit that causes it
>and only turn that off.
>
>-Andi
>
>
>
>  
>



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Machine check expection panic
  2003-08-10  8:12     ` kwijibo
@ 2003-08-10 13:07       ` Andi Kleen
  2003-08-10 21:04         ` kwijibo
  0 siblings, 1 reply; 10+ messages in thread
From: Andi Kleen @ 2003-08-10 13:07 UTC (permalink / raw)
  To: kwijibo; +Cc: Andi Kleen, Dave Jones, richard.brunner, linux-kernel

> The CPU's aren't overclocked and have worked fine for
> me under much heavier loads than booting a kernel for

It could be corrected ECC errors in the cache. If that
happens I would consider it a hardware problem

(now hidden with the disabled bank).

> at least a year. Using the 2.4 kernel that is. Once
> I remove the exception code from the kernel it boots
> fine and runs fine under any load I put it under.

I maintain that such a magic hack needs at least a big fat comment.

I still find the change very suspicious, there isn't any errata that 
says that bank 0 is bad on Athlon.

Also disabling a whole bank just for some buggy CPUs is quite a sledgehammer,
it would be probably better to identify the bank 0 sub unit that causes it
and only turn that off.

-Andi


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Machine check expection panic
  2003-08-07  1:00   ` Andi Kleen
  2003-08-07  1:34     ` Dave Jones
@ 2003-08-10  8:12     ` kwijibo
  2003-08-10 13:07       ` Andi Kleen
  1 sibling, 1 reply; 10+ messages in thread
From: kwijibo @ 2003-08-10  8:12 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Dave Jones, richard.brunner, linux-kernel


Andi Kleen wrote:

>Dave Jones <davej@redhat.com> writes:
>  
>
>>#
>>diff -Nru a/arch/i386/kernel/cpu/mcheck/k7.c b/arch/i386/kernel/cpu/mcheck/k7.c
>>--- a/arch/i386/kernel/cpu/mcheck/k7.c	Wed Aug  6 23:33:40 2003
>>+++ b/arch/i386/kernel/cpu/mcheck/k7.c	Wed Aug  6 23:33:40 2003
>>@@ -81,7 +81,7 @@
>> 		wrmsr (MSR_IA32_MCG_CTL, 0xffffffff, 0xffffffff);
>> 	nr_mce_banks = l & 0xff;
>> 
>>-	for (i=0; i<nr_mce_banks; i++) {
>>+	for (i=1; i<nr_mce_banks; i++) {
>>    
>>
>
>The change looks rather suspicious to me.
>
>Bank 0 is the data cache unit (DC)
>
>Do you have an errata that says that the DC bank is bad on all Athlons?
>
>Normally BIOS or microcode are supposed to turn off bad MCEs by 
>masking them in another register. Maybe the person's CPU has a 
>real problem that is just masked now, e.g. it could be overclocked
>and stress the cache too much.
>
The CPU's aren't overclocked and have worked fine for
me under much heavier loads than booting a kernel for
at least a year. Using the 2.4 kernel that is. Once
I remove the exception code from the kernel it boots
fine and runs fine under any load I put it under.

>
>The original MCE was:
>
>Status: (4) Machine Check in progress.
>Restart IP invalid.
>parsebank(0): f606200000000833 @ 4040
>        External tag parity error
>        Uncorrectable ECC error
>        CPU state corrupt. Restart not possible
>        Address in addr register valid
>        Error enabled in control register
>        Error not corrected.
>        Error overflow
>        Bus and interconnect error
>        Participation: Local processor originated request
>        Timeout: Request did not timeout
>        Request: Generic error
>        Transaction type : Instruction
>        Memory/IO : Other
>
>Tyan 2466 motherboard
>2 Athon MP 1200 processors  (1200?) 
>  
>
Should say 1.2 GHz processor I imagine. AMD and their
wacky naming schemes. This is before they had they
wacky number scheme.

Steve




^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Machine check expection panic
  2003-08-07  1:00   ` Andi Kleen
@ 2003-08-07  1:34     ` Dave Jones
  2003-08-10  8:12     ` kwijibo
  1 sibling, 0 replies; 10+ messages in thread
From: Dave Jones @ 2003-08-07  1:34 UTC (permalink / raw)
  To: Andi Kleen; +Cc: richard.brunner, linux-kernel, kwijibo

On Thu, Aug 07, 2003 at 03:00:14AM +0200, Andi Kleen wrote:

 > The change looks rather suspicious to me.

It's been in 2.4 for months, it solved the same problem there as
many people are now seeing in 2.6.  The "I don't get MCEs in 2.4
but I get them in 2.6" reports are numerous, and I don't buy the
"2.6 stresses hardware more" theory for a second.
 
 > Bank 0 is the data cache unit (DC)
 > Do you have an errata that says that the DC bank is bad on all Athlons?

Hmm, I thought this was actually documented, but I can't seem to find
it in any of the docs I have. There are however gaps between the
errata numbers in a few cases, so its possible it was removed in
a later version of the revision guide.  Richard ?
 
 > Normally BIOS or microcode are supposed to turn off bad MCEs by 
 > masking them in another register. Maybe the person's CPU has a 
 > real problem that is just masked now, e.g. it could be overclocked
 > and stress the cache too much.

I recall seeing Athlon owners complain when I 'fixed' this problem
using an inverse of this patch in 2.4.19-pre3. For pre4, Marcelo
backed it out, and people were happy again.

Whether its documented or not, there are boxes out there that don't
like having that bank enabled.

		Dave

-- 
 Dave Jones     http://www.codemonkey.org.uk

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Machine check expection panic
       [not found] ` <20030807002722.GA3579@suse.de.suse.lists.linux.kernel>
@ 2003-08-07  1:00   ` Andi Kleen
  2003-08-07  1:34     ` Dave Jones
  2003-08-10  8:12     ` kwijibo
  0 siblings, 2 replies; 10+ messages in thread
From: Andi Kleen @ 2003-08-07  1:00 UTC (permalink / raw)
  To: Dave Jones; +Cc: richard.brunner, linux-kernel, kwijibo

Dave Jones <davej@redhat.com> writes:
> #
> diff -Nru a/arch/i386/kernel/cpu/mcheck/k7.c b/arch/i386/kernel/cpu/mcheck/k7.c
> --- a/arch/i386/kernel/cpu/mcheck/k7.c	Wed Aug  6 23:33:40 2003
> +++ b/arch/i386/kernel/cpu/mcheck/k7.c	Wed Aug  6 23:33:40 2003
> @@ -81,7 +81,7 @@
>  		wrmsr (MSR_IA32_MCG_CTL, 0xffffffff, 0xffffffff);
>  	nr_mce_banks = l & 0xff;
>  
> -	for (i=0; i<nr_mce_banks; i++) {
> +	for (i=1; i<nr_mce_banks; i++) {

The change looks rather suspicious to me.

Bank 0 is the data cache unit (DC)

Do you have an errata that says that the DC bank is bad on all Athlons?

Normally BIOS or microcode are supposed to turn off bad MCEs by 
masking them in another register. Maybe the person's CPU has a 
real problem that is just masked now, e.g. it could be overclocked
and stress the cache too much.

The original MCE was:

Status: (4) Machine Check in progress.
Restart IP invalid.
parsebank(0): f606200000000833 @ 4040
        External tag parity error
        Uncorrectable ECC error
        CPU state corrupt. Restart not possible
        Address in addr register valid
        Error enabled in control register
        Error not corrected.
        Error overflow
        Bus and interconnect error
        Participation: Local processor originated request
        Timeout: Request did not timeout
        Request: Generic error
        Transaction type : Instruction
        Memory/IO : Other

Tyan 2466 motherboard
2 Athon MP 1200 processors  (1200?) 

-Andi

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2003-08-11 11:35 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-08-06 22:35 Machine check expection panic kwijibo
2003-08-06 23:05 ` Matt Mackall
2003-08-07  0:27 ` Dave Jones
     [not found] <3F3182B5.3040301@zianet.com.suse.lists.linux.kernel>
     [not found] ` <20030807002722.GA3579@suse.de.suse.lists.linux.kernel>
2003-08-07  1:00   ` Andi Kleen
2003-08-07  1:34     ` Dave Jones
2003-08-10  8:12     ` kwijibo
2003-08-10 13:07       ` Andi Kleen
2003-08-10 21:04         ` kwijibo
2003-08-11 10:15           ` Petr Vandrovec
2003-08-11 11:34             ` Bartlomiej Zolnierkiewicz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).