All of lore.kernel.org
 help / color / mirror / Atom feed
* Where do the "Machine Check Exceptions" come from?
@ 2004-01-29 10:01 Kai Militzer
  2004-02-02 13:51 ` Where do the "Machine Check Exceptions" come from? [update] Kai Militzer
  0 siblings, 1 reply; 3+ messages in thread
From: Kai Militzer @ 2004-01-29 10:01 UTC (permalink / raw)
  To: linux-kernel

Hello!

We have a Server runing here, with a very strange behavior.

It all started, that the machine crashed in two-day-intervalls with the
following message in log:

Jan  6 22:39:01 CPU 0: Machine Check Exception: 0000000000000004
Jan  6 22:39:01 Bank 4: b200000000040151
Jan  6 22:39:01 Kernel panic: CPU context corrupt

So we took the machine out of productivity and started to search for the
problem. We first thought it must be some Hardware error, so we did a
memtest86 for a long time (over 10 passes) without any errors there. We
then booted from a knoppix CD and did a burnMMX and burnP4. Nothing
happend, all ran smooth.

So we thought maybe it is some other system component, so we removed
everything not needed (network card, scsi-controller) and changed the
video-card. We then bootet again from knoppix and did a lot of kernel
compiles over night (on the harddisk, not on a ramdisk) --> all went
smooth.

We then bootet the original system (without all unneeded hardware),
started kernel compiling and it crashed after a day. This was strange.
So we looked in out changelog and then realized, that the crashing
started, when we changed the running kernel from a vanilla 2.4.19 to a
vanilla 2.4.23.

We thought it could be something in the new kernel. So we took a new
2.4.24 with the config from 2.4.23 (make oldconfig) and tested -->
system crashed after compiling kernels for a day.

So there must be something else. Next step was to take the config from
the 2.4.19 kernel and do a "make oldconfig" with the 2.4.24. The system
is now running for two days without a crash. So it must be something
that has changed between the two configs.

So I took the config from the faulty 2.4.23 kernel, and did a "make
oldconfig" with the running config from 2.4.19 on the 2.4.23 kernel.

I will attach what a "diff faulty_config running_config" showed at the
end of the mail.

Any ideas what option new option made the kernel crash? I will try the
three options directly compiled into the kernel (not as a module) the
next few days and will give an, if I can find out what causes this
behavior.

Best regards

Kai Militzer

++++output of diff+++++

153c153
< CONFIG_BLK_STATS=y
---
> # CONFIG_BLK_STATS is not set
194c194
< CONFIG_IP_NF_TFTP=m
---
> # CONFIG_IP_NF_TFTP is not set
200c200
< CONFIG_IP_NF_MATCH_PKTTYPE=m
---
> # CONFIG_IP_NF_MATCH_PKTTYPE is not set
204,206c204,206
< CONFIG_IP_NF_MATCH_RECENT=m
< CONFIG_IP_NF_MATCH_ECN=m
< CONFIG_IP_NF_MATCH_DSCP=m
---
> # CONFIG_IP_NF_MATCH_RECENT is not set
> # CONFIG_IP_NF_MATCH_ECN is not set
> # CONFIG_IP_NF_MATCH_DSCP is not set
211c211
< CONFIG_IP_NF_MATCH_HELPER=m
---
> # CONFIG_IP_NF_MATCH_HELPER is not set
213c213
< CONFIG_IP_NF_MATCH_CONNTRACK=m
---
> # CONFIG_IP_NF_MATCH_CONNTRACK is not set
227d226
< CONFIG_IP_NF_NAT_TFTP=m
230,231c229,230
< CONFIG_IP_NF_TARGET_ECN=m
< CONFIG_IP_NF_TARGET_DSCP=m
---
> # CONFIG_IP_NF_TARGET_ECN is not set
> # CONFIG_IP_NF_TARGET_DSCP is not set
238c237
< CONFIG_IP_NF_ARP_MANGLE=m
---
> # CONFIG_IP_NF_ARP_MANGLE is not set
329c328
< CONFIG_BLK_DEV_GENERIC=y
---
> # CONFIG_BLK_DEV_GENERIC is not set
557c556
< CONFIG_B44=m
---
> # CONFIG_B44 is not set
565c564
< CONFIG_E100=m
---
> # CONFIG_E100 is not set
593,594c592
< CONFIG_E1000=m
< # CONFIG_E1000_NAPI is not set
---
> # CONFIG_E1000 is not set
599c597
< CONFIG_R8169=m
---
> # CONFIG_R8169 is not set
712c710
< CONFIG_HW_RANDOM=m
---
> # CONFIG_HW_RANDOM is not set
910c908
< CONFIG_DEBUG_STACKOVERFLOW=y
---
> # CONFIG_DEBUG_STACKOVERFLOW is not set
927c925
< CONFIG_CRC32=m
---
> # CONFIG_CRC32 is not set
929c927
< CONFIG_ZLIB_DEFLATE=m
---
> # CONFIG_ZLIB_DEFLATE is not set

+++++end output of diff++++

-- 
Kai Militzer                 WESTEND GmbH  |  Internet-Business-Provider
Technik                      CISCO Systems Partner - Authorized Reseller
                             Lütticher Straße 10      Tel 0241/701333-11
km@westend.com               D-52064 Aachen              Fax 0241/911879



^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Where do the "Machine Check Exceptions" come from? [update]
  2004-01-29 10:01 Where do the "Machine Check Exceptions" come from? Kai Militzer
@ 2004-02-02 13:51 ` Kai Militzer
  2004-02-08 12:13   ` Re[3]: 2.6.2 Compile Failure - Redhat 7.3 Distro Nick Warne
  0 siblings, 1 reply; 3+ messages in thread
From: Kai Militzer @ 2004-02-02 13:51 UTC (permalink / raw)
  To: linux-kernel; +Cc: Kai Militzer

Hello everyone!

I have an update on the reproduction of the strange kernel oopses on an
2.4.24 kernel.

> It all started, that the machine crashed in two-day-intervalls with the
> following message in log:

> Jan  6 22:39:01 CPU 0: Machine Check Exception: 0000000000000004
> Jan  6 22:39:01 Bank 4: b200000000040151
> Jan  6 22:39:01 Kernel panic: CPU context corrupt

That's the message, that always appears.

We then tested around as described in my original mail.

> So there must be something else. Next step was to take the config from
> the 2.4.19 kernel and do a "make oldconfig" with the 2.4.24. The system
> is now running for two days without a crash. So it must be something
> that has changed between the two configs.

The kernel ran for four days without crashing. So I tried to activate
some options, that were activeted in the crashing kernel.

I started with this option, just by a foresought.

> < CONFIG_DEBUG_STACKOVERFLOW=y
> ---
> > # CONFIG_DEBUG_STACKOVERFLOW is not set

It was not set in the kernel running for four days, but in the one,
crashing. After I activated it (means: CONFIG_DEBUG_STACKOVERFLOW=y),
compiled the kernel and let it run under work for the weekend (starting
on friday). This morning (monday) it crashed. So I would say, it was the
CONFIG_DEBUG_STACKOVERFLOW.

Does anyone have an idea, why this options makes the kernel crash?
Shouldn't this option prevent the kernel from crashing?

If more information is needed (i.e. full kernel config, hardware specs,
etc.) please let me know.

Regards

Kai

-- 
Kai Militzer                 WESTEND GmbH  |  Internet-Business-Provider
Technik                      CISCO Systems Partner - Authorized Reseller
                             Lütticher Straße 10      Tel 0241/701333-11
km@westend.com               D-52064 Aachen              Fax 0241/911879



^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re[3]: 2.6.2 Compile Failure - Redhat 7.3 Distro
  2004-02-02 13:51 ` Where do the "Machine Check Exceptions" come from? [update] Kai Militzer
@ 2004-02-08 12:13   ` Nick Warne
  0 siblings, 0 replies; 3+ messages in thread
From: Nick Warne @ 2004-02-08 12:13 UTC (permalink / raw)
  To: linux-kernel

> Hello Robert,
>
>   Sighs...I guess I have to look at making new set of RPM packages 
> for 7.3 Distros \
> to upgrade the glibc, gcc and few other packages to have it updated 
> to be able to \
> compile the kernel.
>
>  Thanks.

Saturday, February 7, 2004, 4:31:25 PM, you wrote:

RFM> Elikster wrote:

> > fs/proc/array.c: In function `proc_pid_stat':
> > fs/proc/array.c:398: Unrecognizable insn:
> > (insn/i 721 1009 1003 (parallel[
> > (set (reg:SI 0 eax)
> > (asm_operands ("") ("=a") 0[
> > (reg:DI 1 edx)
> > ]
> > [
> > (asm_input:DI ("A"))
> > ]  ("include/linux/times.h") 38))
> > (set (reg:SI 1 edx)
> > (asm_operands ("") ("=d") 1[
> > (reg:DI 1 edx)
> > ]
> > [
> > (asm_input:DI ("A"))
> > ]  ("include/linux/times.h") 38))
> > (clobber (reg:QI 19 dirflag))
> > (clobber (reg:QI 18 fpsr))
> > (clobber (reg:QI 17 flags))
> > ] ) -1 (insn_list 715 (nil))
> > (nil))
> > fs/proc/array.c:398: confused by earlier errors, bailing out
> > make[2]: *** [fs/proc/array.o] Error 1
> > make[1]: *** [fs/proc] Error 2
> > make: *** [fs] Error 2
> > root@longmont [/usr/src/linux-2.6.2]#

I was getting all sorts of funny reportedly code errors with RH GCC:

gcc version 2.96 20000731 (Red Hat Linux 7.2 2.96-112.7.1)

It used to build ok, then report an error somewhere.. but never in 
the same place/file.  Funnily enough, a kernel built with 5 or 6 of 
these errors (i.e. just 'make'ing again from where it 'broke' to 
carry on the build) is OK.

Well, for me it turned out GCC gets very _flakey_ if CPU (K6 233) 
gets too hot...

...I just put on a bigger heatsink, and it fixed it!

The machine had/has never shown any other symptoms of overheating 
until/unless I build a kernel.

Nick 

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2004-02-08 12:13 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-01-29 10:01 Where do the "Machine Check Exceptions" come from? Kai Militzer
2004-02-02 13:51 ` Where do the "Machine Check Exceptions" come from? [update] Kai Militzer
2004-02-08 12:13   ` Re[3]: 2.6.2 Compile Failure - Redhat 7.3 Distro Nick Warne

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.