Re: 2.4.20-pre10aa1 oops report (was Re: Linux-2.4.20-pre8-aa2 oops report. [solved])

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: 2.4.20-pre10aa1 oops report (was Re: Linux-2.4.20-pre8-aa2 oops report. [solved])
@ 2002-10-15  1:08 harisri
  2002-10-15 13:05 ` Srihari Vijayaraghavan
  0 siblings, 1 reply; 6+ messages in thread
From: harisri @ 2002-10-15  1:08 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-kernel, harisri

Hello Andrea,

> this smells like a problem with one of your modules. Please make 100%
> sure you use exactly the same .config for both 2.4.20pre10 and
> 2.4.20pre10aa1 and please try to find which is the module that is
> crashing the kernel after it's being loaded. Expect always different
> kind of crashes and oopses. You can also try to turn on the slab
> debugging option in the kernel hacking menu.

Yes I am using the same .config file from 2.4.20-pre10 on 
2.4.20-pre10aa1 (of course I run make oldconfig, and accept the default 
setting that shows up on 2.4.20-pre10aa1)

I think you are right, it has something to do with the kernel modules.

> > Code;  c01e55e2 <fast_clear_page+12/50>
> 
> you also may want to configure the kernel as i686 instead of K7 so
> fast_clear_page won't be used to see if it makes any difference.

Ok. That didn't really help. Kernel compiled for i386 even crashes, but 
the k7 optimised kernel crashes at the Athlon speed :-)

> the place where the oops happens is most certainly not the problem,
> either something is wrong with fast_clear_page for whatever hardware
> reason, or more likely the moduled by modprobe is corrupting the
> freelist and alloc_pages returned garbage.
> 
> btw, how much memory do you have? If you've more than 800M it 
> could be a
> broken driver using pte_offset by hand, try to reproduce with mem=800m
> in such case. To fix this you should find which is the module that is
> destabilizing the kernel.

My computer has 512 MB RAM. No highmem.

I am able to trigger the issue (after 3 attempts [1]) with,
CONFIG_AGP m
CONFIG_AGP_AMD y
CONFIG_DRM y
CONFIG_DRM_RADEON m

While I couldn't trigger the issue (after 5 attempts [1]) without them. 
Hence I suspect it may be something to do with them. But it takes a lot 
of time to test these all, I think I will have good answers in couple of 
days time considering the amount of time it takes to perform the tests.

[1]
1. Login to XFree86/Gnome
2. Start Mozilla, Evolution, OpenOffice Writer/Calc/Impress, Konqueror, 
KMail. And exit them all.
3. mke2fs -j /dev/hdc9; mount /dev/hdc9 /test;cd /test;dd if=/dev/zero 
of=zero bs=1024 count=2097152;cd /
4. Redo the step 2
5. Log out and log in and redo step 2
6. Unmount /test

Repeat the above test cycle few times (on 3rd attempt or so) the system 
oops (when I had AGP/AMD/DRM/Radeon stuff).

Thanks for your help.

Hari
harisri@bigpond.com

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: 2.4.20-pre10aa1 oops report (was Re: Linux-2.4.20-pre8-aa2 oops report. [solved])
  2002-10-15  1:08 2.4.20-pre10aa1 oops report (was Re: Linux-2.4.20-pre8-aa2 oops report. [solved]) harisri
@ 2002-10-15 13:05 ` Srihari Vijayaraghavan
  2002-10-15 14:13   ` Srihari Vijayaraghavan
  0 siblings, 1 reply; 6+ messages in thread
From: Srihari Vijayaraghavan @ 2002-10-15 13:05 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-kernel

Hello Andrea,

> > this smells like a problem with one of your modules. Please make 100%
> > sure you use exactly the same .config for both 2.4.20pre10 and
> > 2.4.20pre10aa1 and please try to find which is the module that is
> > crashing the kernel after it's being loaded. Expect always different
> > kind of crashes and oopses. You can also try to turn on the slab
> > debugging option in the kernel hacking menu.

That precisely is the reason. The bad news is that system crashes when agpgart 
and radeon are compiled as modules, and the good news is that I am unable to 
crash it when they are not.

Mainline (2.4.20-pre10) is stable when agpgart and radeon are compiled as 
modules.

The problem is much easier to reproduce than I thought, just log in and log 
out of XFree86/Gnome few times (3 or more times in my case) is more than 
adequate to crash it.

Here is the .config which is stable in -aa1:
CONFIG_AGP=y
CONFIG_AGP_AMD=y
CONFIG_DRM=y
CONFIG_DRM_NEW=y
CONFIG_DRM_RADEON=y

Here is the .config which destabilises the -aa1 kernel:
CONFIG_AGP=m
CONFIG_AGP_AMD=y
CONFIG_DRM=y
CONFIG_DRM_NEW=y
CONFIG_DRM_RADEON=m

Unfortunately system just reboots without leaving oops information in the 
system logs. If you want I can try few older versions of -aa to find from 
when it started happening.

Thanks for your help.
-- 
Hari
harisri@bigpond.com

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: 2.4.20-pre10aa1 oops report (was Re: Linux-2.4.20-pre8-aa2 oops report. [solved])
  2002-10-15 13:05 ` Srihari Vijayaraghavan
@ 2002-10-15 14:13   ` Srihari Vijayaraghavan
  2002-10-16  5:06     ` 2.4.20-pre10aa1 oops report (was Re: Linux-2.4.20-pre8-aa2 oops report. [solved]) [solved2? ac97] Andrea Arcangeli
  0 siblings, 1 reply; 6+ messages in thread
From: Srihari Vijayaraghavan @ 2002-10-15 14:13 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-kernel

Hello,

> That precisely is the reason. The bad news is that system crashes when
> agpgart and radeon are compiled as modules, and the good news is that I am
> unable to crash it when they are not.

My goodness, I have spoken too early I guess. The -aa kernel crashes whether 
agpgart and radeon are modules or not.
 
> Mainline (2.4.20-pre10) is stable when agpgart and radeon are compiled as
> modules.

That holds true still.

> The problem is much easier to reproduce than I thought, just log in and log
> out of XFree86/Gnome few times (3 or more times in my case) is more than
> adequate to crash it.

That is still the case.
-- 
Hari
harisri@bigpond.com


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: 2.4.20-pre10aa1 oops report (was Re: Linux-2.4.20-pre8-aa2 oops report. [solved]) [solved2? ac97]
  2002-10-15 14:13   ` Srihari Vijayaraghavan
@ 2002-10-16  5:06     ` Andrea Arcangeli
  0 siblings, 0 replies; 6+ messages in thread
From: Andrea Arcangeli @ 2002-10-16  5:06 UTC (permalink / raw)
  To: Srihari Vijayaraghavan
  Cc: linux-kernel, hpj, mcelrath+kernel, pellegrini, lists, mroos, willi

On Wed, Oct 16, 2002 at 12:13:02AM +1000, Srihari Vijayaraghavan wrote:
> Hello,
> 
> > That precisely is the reason. The bad news is that system crashes when
> > agpgart and radeon are compiled as modules, and the good news is that I am
> > unable to crash it when they are not.
> 
> My goodness, I have spoken too early I guess. The -aa kernel crashes whether 
> agpgart and radeon are modules or not.

I'm running this kernel for 5 days now very often under heavy load (also with
thousand of tasks with volanomark in background and aio and flood of writes
from /dev/zero), and there's no sign of instability (besides a rare tcp race
that is been reported for 2.4.19 on l-k too, not fatal, it only deadlocks the
tcp connection and you've to kill the task because readmsg will never return
until it gets a signal, I tried to debug it but with no luck yet, but
that is also most certainly a mainline issue too and it triggers only
during heavy load).

You probably did something incidentally (not part of your regression
test loop) that corrupted memory. The regression test is a workload that
will show you if the corruption has happened in the past or not, but the
regression test loop is not the thing that is generating the corruption.
The regression test loop is what gets _harmed_ by the corruption, it's
not the culprit.

My crystall ball is telling me that you could reproduce it easily on my
tree because when you feel finally stable and that you can restart doing
your usual work without worrying about oopses, you enjoy yourself
playing some music to relax. And you instead don't play music while you
try to reproduce the problem because you're busy looking at stressing
the kernel and in turn you can't reproduce the bug. Is she right? ;)

Please try with CONFIG_SOUND=n and make sure to run:

	rm -r /lib/modules/2.4.20-pre10aa1

before "make modules_install" to avoid running stale modules (also enable
modversions just in case).

I see a pile of oopses all showing ac97 loaded into the kernel, some
also for 2.4.19, but they may be unrelated problems of course. A number
of reports showing definitive random mm corruption like yours on top of
2.4.20-pre vanilla (not -aa) are most certainly been affected too by the
ac97 bug (I'm CC'ing the other affected testers, they can try as well
the same as you). I never tried ac97 (I've a couple of boxes that could
handle it, but I never attempted to play sound on those yet and the
chipset may be different so it may not trigger for me after all even if
I could load that module).

Hint: in the past I found easier to reproduce various module bugs with a
loop like this:

	while :; do insmod ac97_codec.o; rmmod ac97_codec.o; done

you can try the above and see if it trigger in seconds.

>From the l-k db grepping it seems the bug is been introduced in 2.4.19.
So I would suggest you to try to reproduce after a:

	rm -r 2.4.20pre10aa1/drivers/sound
	cp -a 2.4.18/drivers/sound 2.4.20pre10aa1/drivers
	cd 2.4.20pre10aa1; make oldconfig ...

(of course you can replace 2.4.20pre10aa1 with 2.4.20pre11 vanilla or
2.4.20pre10ac2)

and see if the instability goes away?

Marcelo also included some further ac97 patch in pre11, maybe
2.4.20pre11aa1 will fix it, you may want to give it a try too when I
release it (OTOH, I'm fixing what seems to be a design bug in the o1
scheduler that is apparently generatating an huge cpu waste, so I don't
guarantee that the very first release with these changes will be as
solid as 2.4.20pre10aa1 ;)

Thanks for all the reports,

Andrea

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: 2.4.20-pre10aa1 oops report (was Re: Linux-2.4.20-pre8-aa2 oops report. [solved])
  2002-10-13  1:53     ` 2.4.20-pre10aa1 oops report (was Re: Linux-2.4.20-pre8-aa2 oops report. [solved]) Srihari Vijayaraghavan
@ 2002-10-13 22:42       ` Andrea Arcangeli
  0 siblings, 0 replies; 6+ messages in thread
From: Andrea Arcangeli @ 2002-10-13 22:42 UTC (permalink / raw)
  To: Srihari Vijayaraghavan; +Cc: linux-kernel

On Sun, Oct 13, 2002 at 11:53:29AM +1000, Srihari Vijayaraghavan wrote:
> Oct 11 22:43:19 localhost kernel: Process modprobe (pid: 1675, 

this smells like a problem with one of your modules. Please make 100%
sure you use exactly the same .config for both 2.4.20pre10 and
2.4.20pre10aa1 and please try to find which is the module that is
crashing the kernel after it's being loaded. Expect always different
kind of crashes and oopses. You can also try to turn on the slab
debugging option in the kernel hacking menu.

> Code;  c01e55e2 <fast_clear_page+12/50>

you also may want to configure the kernel as i686 instead of K7 so
fast_clear_page won't be used to see if it makes any difference.

> The mainline (2.4.20-pre10) does not exhibit this issue. Unlike 
> 2.4.20-pre8aa1, 2.4.20-pre10aa1 rebooted itself after the above oops.
> 
> I am hoping some of these oops might reveal the real issue/reason/bug to 
> kernel developers one of these days.

the place where the oops happens is most certainly not the problem,
either something is wrong with fast_clear_page for whatever hardware
reason, or more likely the moduled by modprobe is corrupting the
freelist and alloc_pages returned garbage.

btw, how much memory do you have? If you've more than 800M it could be a
broken driver using pte_offset by hand, try to reproduce with mem=800m
in such case. To fix this you should find which is the module that is
destabilizing the kernel.

thanks for the reports.

Andrea

^ permalink raw reply	[flat|nested] 6+ messages in thread

* 2.4.20-pre10aa1 oops report (was Re: Linux-2.4.20-pre8-aa2 oops report. [solved])
  2002-10-10 10:17   ` Srihari Vijayaraghavan
@ 2002-10-13  1:53     ` Srihari Vijayaraghavan
  2002-10-13 22:42       ` Andrea Arcangeli
  0 siblings, 1 reply; 6+ messages in thread
From: Srihari Vijayaraghavan @ 2002-10-13  1:53 UTC (permalink / raw)
  To: linux-kernel; +Cc: Andrea Arcangeli

Hello,

On Thursday 10 October 2002 20:17, Srihari Vijayaraghavan wrote:
> Thanks. Unfortunately that did not fix the problem.
>
> I was able to reproduce 4 more oops. (all happened one after other)
>
> ksymoops 2.4.5 on i686 2.4.20-pre8aa2-p1.  Options used

Here is a similar oops report from 2.4.20-pre10aa1.

ksymoops 2.4.5 on i686 2.4.20-pre10aa1.  Options used
     -V (default)
     -k /proc/ksyms (default)
     -l /proc/modules (default)
     -o /lib/modules/2.4.20-pre10aa1/ (default)
     -m /boot/System.map-2.4.20-pre10aa1 (default)

Warning: You did not tell me where to find symbol information.  I will
assume that the log matches the kernel and modules that are running
right now and I'll use the default options above for symbol resolution.
If the current kernel and/or modules do not match the log, you can get
more accurate output by telling me the kernel version and where to find
map, modules, ksyms etc.  ksymoops -h explains the options.

Oct 11 22:43:19 localhost kernel: Unable to handle kernel paging request at 
virtual address cbe8e000
Oct 11 22:43:19 localhost kernel: c01e55e2
Oct 11 22:43:19 localhost kernel: *pde = 0bc001e3
Oct 11 22:43:19 localhost kernel: Oops: 0002 2.4.20-pre10aa1 #3 Fri Oct 11 
22:10:08 EST 2002
Oct 11 22:43:19 localhost kernel: CPU:    0
Oct 11 22:43:19 localhost kernel: EIP:    0010:[<c01e55e2>]    Not tainted
Using defaults from ksymoops -t elf32-i386 -a i386
Oct 11 22:43:19 localhost kernel: EFLAGS: 00013246
Oct 11 22:43:19 localhost kernel: eax: 0000003f   ebx: cbe8e000   ecx: 
c9f8e000   edx: 00000000
Oct 11 22:43:19 localhost kernel: esi: c3f7d4b0   edi: 000004b0   ebp: 
c120c084   esp: c9f8feac
Oct 11 22:43:19 localhost kernel: ds: 0018   es: 0018   ss: 0018
Oct 11 22:43:19 localhost kernel: Process modprobe (pid: 1675, 
stackpage=c9f8f000)
Oct 11 22:43:19 localhost kernel: Stack: 00104025 c0126952 cbe8e000 c95bc420 
4212c1fc dff87e00 cbc1a140 c0126d7e 
Oct 11 22:43:19 localhost kernel:        dff87e00 cbc1a140 c3f7d4b0 c95bc420 
00000001 4212c1fc c9f8ff24 dff87e00 
Oct 11 22:43:19 localhost kernel:        cbc1a140 4212c1fc c9f8e000 c011240a 
dff87e00 cbc1a140 4212c1fc 00000001 
Oct 11 22:43:19 localhost kernel: Call Trace:    [<c0126952>] [<c0126d7e>] 
[<c011240a>] [<c012869f>] [<c01289d2>]
Oct 11 22:43:19 localhost kernel:   [<c0128a54>] [<c0112260>] [<c01075b0>]
Oct 11 22:43:19 localhost kernel: Code: 0f e7 03 0f e7 43 08 0f e7 43 10 0f e7 
43 18 0f e7 43 20 0f 


>>EIP; c01e55e2 <fast_clear_page+12/50>   <=====

>>ebx; cbe8e000 <[sr_mod].bss.end+54ea1a9/1925c229>
>>ecx; c9f8e000 <[sr_mod].bss.end+35ea1a9/1925c229>
>>esi; c3f7d4b0 <[agpgart].bss.end+200695/1b93265>
>>edi; 000004b0 Before first symbol
>>ebp; c120c084 <_end+f86b14/166cb10>
>>esp; c9f8feac <[sr_mod].bss.end+35ec055/1925c229>

Trace; c0126952 <do_anonymous_page+a2/110>
Trace; c0126d7e <handle_mm_fault+8e/160>
Trace; c011240a <do_page_fault+1aa/5a0>
Trace; c012869f <unmap_fixup+12f/140>
Trace; c01289d2 <do_munmap+292/2d0>
Trace; c0128a54 <sys_munmap+44/80>
Trace; c0112260 <do_page_fault+0/5a0>
Trace; c01075b0 <error_code+34/3c>

Code;  c01e55e2 <fast_clear_page+12/50>
00000000 <_EIP>:
Code;  c01e55e2 <fast_clear_page+12/50>   <=====
   0:   0f e7 03                  movntq %mm0,(%ebx)   <=====
Code;  c01e55e5 <fast_clear_page+15/50>
   3:   0f e7 43 08               movntq %mm0,0x8(%ebx)
Code;  c01e55e9 <fast_clear_page+19/50>
   7:   0f e7 43 10               movntq %mm0,0x10(%ebx)
Code;  c01e55ed <fast_clear_page+1d/50>
   b:   0f e7 43 18               movntq %mm0,0x18(%ebx)
Code;  c01e55f1 <fast_clear_page+21/50>
   f:   0f e7 43 20               movntq %mm0,0x20(%ebx)
Code;  c01e55f5 <fast_clear_page+25/50>
  13:   0f 00 00                  sldtl  (%eax)


1 warning issued.  Results may not be reliable.

The mainline (2.4.20-pre10) does not exhibit this issue. Unlike 
2.4.20-pre8aa1, 2.4.20-pre10aa1 rebooted itself after the above oops.

I am hoping some of these oops might reveal the real issue/reason/bug to 
kernel developers one of these days.

And my sincere thanks for your help.
-- 
Hari
harisri@bigpond.com


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2002-10-16  8:34 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2002-10-15  1:08 2.4.20-pre10aa1 oops report (was Re: Linux-2.4.20-pre8-aa2 oops report. [solved]) harisri
2002-10-15 13:05 ` Srihari Vijayaraghavan
2002-10-15 14:13   ` Srihari Vijayaraghavan
2002-10-16  5:06     ` 2.4.20-pre10aa1 oops report (was Re: Linux-2.4.20-pre8-aa2 oops report. [solved]) [solved2? ac97] Andrea Arcangeli
  -- strict thread matches above, loose matches on Subject: below --
2002-10-05  2:47 Linux-2.4.20-pre8-aa2 oops report Srihari Vijayaraghavan
2002-10-10  1:26 ` Linux-2.4.20-pre8-aa2 oops report. [solved] Andrea Arcangeli
2002-10-10 10:17   ` Srihari Vijayaraghavan
2002-10-13  1:53     ` 2.4.20-pre10aa1 oops report (was Re: Linux-2.4.20-pre8-aa2 oops report. [solved]) Srihari Vijayaraghavan
2002-10-13 22:42       ` Andrea Arcangeli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).