All of lore.kernel.org
 help / color / mirror / Atom feed
* lockups with 2.4.2x
@ 2003-09-21  2:56 evil
  2003-09-21  5:37 ` Christian
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: evil @ 2003-09-21  2:56 UTC (permalink / raw)
  To: linux-kernel

Hi y'all,

i am in the need of some help tracing down mysterious lock-ups of my
machine. with any vanilla kernel above 2.4.19 the machine boots, the it
will take 1 to 3 minutes and the machine freezes. no Oops on the
console, no warnings (e.g. no high load or mem-pressure), even SysReq is
not working.

i've already setup a script, writing

/proc/modules|locks|ksyms

to the disk every 2 seconds, because i suspected some 3rd party module i
usually load to be the reason for the freezes. but it was not. please
tell me, if other files are more interesting for this or what else i can
do to get to the source to the problem.

the machine:
Dual Athlon, 1GB RAM (HighMem enabled), gcc 3.3.1, libc 2.3.2,
(Debian/Testing) some more infos are on:

http://nerdbynature.de/bits/freeze/config|cpuinfo|dmesg|lspci

(directory listing follows...)

I really appreciate some help here, i don't know where to start
searching since no errors are shown :-(

below are some further infos, but i don't know if they are related to
this issue.

Thank you for your time,
Christian.

before this whole mess i was using 2.4.19, but i wanted to upgrade to
2.4.20, 2.4.21, did not make it, due to lack of time or need. 2.4.19 was
running fine. but i need netfilter now, so i had to recompile
modules+kernel!  but, mysteriously, i fail to recompile my 2.4.19. i did
"make mrproper", then even untar'ed a new archive, took the old config,
"make oldconfig" went ok. then
"make dep bzImage modules modules_install" stops within "bzImage":

net/network.o(.text+0xf125): In function `rtnetlink_rcv':
: undefined reference to `rtnetlink_rcv_skb'
make: *** [vmlinux] Error 1

there was another issue with a vanilla 2.4.19, but i tried to fix it:

http://nerdbynature.de/bits/freeze/semaphore.c.patch

hm, is this gcc related? it did not like some missing "" terminations.
ok, after fixing this, the error above showed up.

so, i thought it was time to try 2.4.2x (now 2.4.22), but am somehow
stuck now. on some completely other machines (PPC32 + single
AMD/Athlon), 2.4.2x are running very fine.




^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: lockups with 2.4.2x
  2003-09-21  2:56 lockups with 2.4.2x evil
@ 2003-09-21  5:37 ` Christian
  2003-09-21  5:38 ` Willy Tarreau
  2003-09-23  0:23 ` Christian Kujau
  2 siblings, 0 replies; 6+ messages in thread
From: Christian @ 2003-09-21  5:37 UTC (permalink / raw)
  To: linux-kernel

Some updates from me:

evil aka Christian wrote:
> http://nerdbynature.de/bits/freeze/config|cpuinfo|dmesg|lspci
> 
> (directory listing follows...)

now it's all here: http://www.nerdbynature.de/bits/

> before this whole mess i was using 2.4.19, but i wanted to upgrade to
> 2.4.20, 2.4.21, did not make it, due to lack of time or need. 2.4.19 was
> running fine. but i need netfilter now, so i had to recompile
> modules+kernel!  but, mysteriously, i fail to recompile my 2.4.19. i did

i was able to recompile 2.4.19, but i had to use gcc-2.95.4 (from 
debian/testing.) Kernel boots fine, even loads my 3rd party module
(ISDN/CAPI related, taints the kernel), no freezes. i have compiled 
2.4.22 with gcc2.95 too, but it's still freezing.

Thanks,
Christian.
-- 
BOFH excuse #201:

RPC_PMAP_FAILURE



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: lockups with 2.4.2x
  2003-09-21  2:56 lockups with 2.4.2x evil
  2003-09-21  5:37 ` Christian
@ 2003-09-21  5:38 ` Willy Tarreau
  2003-09-21 13:26   ` Christian
  2003-09-23  0:23 ` Christian Kujau
  2 siblings, 1 reply; 6+ messages in thread
From: Willy Tarreau @ 2003-09-21  5:38 UTC (permalink / raw)
  To: evil; +Cc: linux-kernel

On Sun, Sep 21, 2003 at 04:56:14AM +0200, evil wrote:
 
> the machine:
> Dual Athlon, 1GB RAM (HighMem enabled), gcc 3.3.1, libc 2.3.2,
> (Debian/Testing) some more infos are on:
> 
> http://nerdbynature.de/bits/freeze/config|cpuinfo|dmesg|lspci

Hmmm, there is a lot of hardware in this box. Have you tried disabling IDE ?
ServeRaid ? SymBIOS ? Your hangs may be related to an updatedb or slocate
indexing all your filesystems, and triggering a bug in one of those drivers.
Also, the DMESG shows that you have an AMD bug on your CPUs, and tells you
that if you have problems, you should restart with 'noapic'. Did you try it ?
You could also try to boot in 'nosmp' mode, and even with network unplugged.
I believe it will be relatively quick to find the problem if the system
usually hangs in no more than 3 minutes.

You may also have a defect in your RAM. Someone else here had problems since
2.4.20, and only when saving disks to tape. It was finally tracked down to
a RAM problem which only showed up on SMP with newer kernels which seem to
torture the hardware a bit more. So if your GB ram is 4*256, you can try to
remove 2 sticks and see what happens.

Hope this helps,
Willy


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: lockups with 2.4.2x
  2003-09-21  5:38 ` Willy Tarreau
@ 2003-09-21 13:26   ` Christian
  2003-09-21 21:49     ` Christian
  0 siblings, 1 reply; 6+ messages in thread
From: Christian @ 2003-09-21 13:26 UTC (permalink / raw)
  To: linux-kernel

Willy Tarreau wrote:
> Hmmm, there is a lot of hardware in this box. Have you tried disabling IDE ?
> ServeRaid ? SymBIOS ?

hm, yes, i could disable the ServeRaid module. gotta find out how to 
disable the builtin IDE / Symbios other than recompile the kernel...

can i do this by giving "ide0=noprobe ide1=noprobe ..." on the 
boot-prompt? my rootdisk used to be on hda, will see if i can do something.

> Also, the DMESG shows that you have an AMD bug on your CPUs, and tells you
> that if you have problems, you should restart with 'noapic'. Did you try it ?

Oh, no, I did not. sorry. i'll try it.

> You could also try to boot in 'nosmp' mode, and even with network unplugged.

hm, will try this too.

> I believe it will be relatively quick to find the problem if the system
> usually hangs in no more than 3 minutes.

the worst thing on this machine is the pre-booting process, where all 
the controllers are "Initializing..."  and "Checking...", and even the 
ServRaid controller wants to have 3 minutes sometimes to settle :-)

> You may also have a defect in your RAM.

even it is "ECC" RAM, i'll give it a try.

Thank your for the quick reply, i'll try these things out, will take 
some time, so maybe i'm back in the evening.

Christian.
(if I only could use my .sig this time :-))
-- 
BOFH excuse #301:

appears to be a Slow/Narrow SCSI-0 Interface problem



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: lockups with 2.4.2x
  2003-09-21 13:26   ` Christian
@ 2003-09-21 21:49     ` Christian
  0 siblings, 0 replies; 6+ messages in thread
From: Christian @ 2003-09-21 21:49 UTC (permalink / raw)
  To: linux-kernel

Christian wrote:
> hm, yes, i could disable the ServeRaid module. gotta find out how to 
> disable the builtin IDE / Symbios other than recompile the kernel...
[...]

i took your advice and booted with "nosmp" and "noacpi" into
single-user. then i enabled all modules i used to load. i tried to
produce some I/O with "updatedb" and "find /" and so on, everything
looked fine.

the most time consuming part was starting some apps out of my
init-scripts, see if the survive some minutes while using the system.

i finally narrowed it down to a few applications left, but further
testing is required. (due to my lack of time, i'll go on tomorrow)

>> You may also have a defect in your RAM.

i did not try removing RAM or booting a 2.4.22 with no "HIGHMEM set".

oh, an can it be, that when i boot with "nosmp noapic", the use of APIC
is forced and APIC is initialized upon booting? i don't have the exact
dmesg output right now, but i remembered sth. like this.

Thank you,
Christian.
-- 
BOFH excuse #417:

Computer room being moved.  Our systems are down for the weekend.







^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: lockups with 2.4.2x
  2003-09-21  2:56 lockups with 2.4.2x evil
  2003-09-21  5:37 ` Christian
  2003-09-21  5:38 ` Willy Tarreau
@ 2003-09-23  0:23 ` Christian Kujau
  2 siblings, 0 replies; 6+ messages in thread
From: Christian Kujau @ 2003-09-23  0:23 UTC (permalink / raw)
  To: linux-kernel

hi again,


i seem to have the source to my problem here. a little application named 
"dnetc" (RC5-72 number cruncher, see http://www.distributed.net) causes 
the lockups when running under 2.4.22. no joke, i too don't want to 
believe this, but it's quite reproducable.

ok, the thing is: yes, i can live without this st00pid number-cruncher, 
so my system won't crash. otoh, i wonder why this little userspace 
application crashes a whole system so badly, that it's not even able to 
give an Oops. i will try to "strace ./dnetc" or execute it in the gdb, 
but i'm pretty sure, it will crash anyway, and the system won't have 
time to show errors or even write them to the disk.

btw, the "dnetc" is some kind of "open source but please don't look into 
it" thing, and even if we manage to compile it, the produced output is 
invalid.

oh, and i still did not try to remove memory and booting with no 
"highmem" or a "not preemtible kernel" and such. will do so, if that helps.

Thanks for reading,
Christian.

-- 
BOFH excuse #104:

backup tape overwritten with copy of system manager's favourite CD


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2003-09-22 22:23 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-09-21  2:56 lockups with 2.4.2x evil
2003-09-21  5:37 ` Christian
2003-09-21  5:38 ` Willy Tarreau
2003-09-21 13:26   ` Christian
2003-09-21 21:49     ` Christian
2003-09-23  0:23 ` Christian Kujau

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.