linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: 2.4.22-pre lockups (now decoded oops for pre10)
       [not found] <20030808002918.723abb08.skraw@ithnet.com>
@ 2003-08-08 14:54 ` Marcelo Tosatti
  2003-08-08 15:05   ` Stephan von Krawczynski
  0 siblings, 1 reply; 56+ messages in thread
From: Marcelo Tosatti @ 2003-08-08 14:54 UTC (permalink / raw)
  To: Stephan von Krawczynski; +Cc: Andrew Morton, andrea, Alan Cox, linux-kernel



On Fri, 8 Aug 2003, Stephan von Krawczynski wrote:

> On Thu, 7 Aug 2003 14:49:17 -0700
> Andrew Morton <akpm@osdl.org> wrote:
> 
> > Marcelo Tosatti <marcelo@conectiva.com.br> wrote:
> > >
> > >  Anyway, you seem to be getting random memory corruption and I have no idea
> > >  
> > >  what the hell maybe causing it.
> > > 
> > >  Andrea? Andrew? Alan? _Any_ helpful comments?  
> > 
> > Not really, sorry.  Ugly.
> > 
> > What was the last kernel which didn't crash?
> > 
> > You're showing a huge set of reiserfs diffs there, mostly cosmetic though.
> > 
> > Running memtest86 for 12 hours is needed.
> > 
> > Going back to the last-known-kernel would be useful, just to verify that
> > the hardware is still good (some connector could have become resistive, or
> > the power supply could have drifted, etc).
> > 
> > Would it be possible to try a different filesystem on that box?
> > 
> > Do we know of other people who are using late 2.4 kernels on server-grade
> > hardware?  If so, are they doing OK?
> 
> I can give you this additional info:
> I tried about everything back to 2.4.21 release, and even this crashes on the
> box. BUT it is _not_ the only box I can crash 2.4.21. I have another hardware
> (also SMP) based not on Serverworks but on VIA chipset and with no 64 bit pci
> and it crashes with 2.4.21 around every 10 - 20 days. It definitely does not
> with 2.4.19. 

Do you have any traces of the other box crash? 

> The only requirement for my usual test-box is a working tg3 driver for the GBit
> ethernet link.

> Ah yes, and from the long series of tests I can tell that the box won't crash
> with UP kernel. I can re-check that with rc1 if this is useful.

Okey. Thats useful information. How hard would it be for you to try ext3 
as the filesystem (as Andrew suggested) ? 


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: 2.4.22-pre lockups (now decoded oops for pre10)
  2003-08-08 14:54 ` 2.4.22-pre lockups (now decoded oops for pre10) Marcelo Tosatti
@ 2003-08-08 15:05   ` Stephan von Krawczynski
  2003-08-08 15:33     ` Marcelo Tosatti
  2003-08-10 14:23     ` Keith Owens
  0 siblings, 2 replies; 56+ messages in thread
From: Stephan von Krawczynski @ 2003-08-08 15:05 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: akpm, andrea, alan, linux-kernel

On Fri, 8 Aug 2003 11:54:39 -0300 (BRT)
Marcelo Tosatti <marcelo@conectiva.com.br> wrote:

> > I can give you this additional info:
> > I tried about everything back to 2.4.21 release, and even this crashes on
> > the box. BUT it is _not_ the only box I can crash 2.4.21. I have another
> > hardware(also SMP) based not on Serverworks but on VIA chipset and with no
> > 64 bit pci and it crashes with 2.4.21 around every 10 - 20 days. It
> > definitely does not with 2.4.19. 
> 
> Do you have any traces of the other box crash? 

Not at hand, but can prepare for the next crash during the weekend.

> > The only requirement for my usual test-box is a working tg3 driver for the
> > GBit ethernet link.
> 
> > Ah yes, and from the long series of tests I can tell that the box won't
> > crash with UP kernel. I can re-check that with rc1 if this is useful.
> 
> Okey. Thats useful information. How hard would it be for you to try ext3 
> as the filesystem (as Andrew suggested) ? 

Well, if that provides further info I will do. I will try to achieve over the
weekend, I need some spare volumes for conversion (by copy) :-)

Regards,
Stephan

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: 2.4.22-pre lockups (now decoded oops for pre10)
  2003-08-08 15:05   ` Stephan von Krawczynski
@ 2003-08-08 15:33     ` Marcelo Tosatti
  2003-08-10 21:35       ` Stephan von Krawczynski
  2003-08-13 10:55       ` Stephan von Krawczynski
  2003-08-10 14:23     ` Keith Owens
  1 sibling, 2 replies; 56+ messages in thread
From: Marcelo Tosatti @ 2003-08-08 15:33 UTC (permalink / raw)
  To: Stephan von Krawczynski; +Cc: akpm, andrea, alan, linux-kernel



On Fri, 8 Aug 2003, Stephan von Krawczynski wrote:

> On Fri, 8 Aug 2003 11:54:39 -0300 (BRT)
> Marcelo Tosatti <marcelo@conectiva.com.br> wrote:
> 
> > > I can give you this additional info:
> > > I tried about everything back to 2.4.21 release, and even this crashes on
> > > the box. BUT it is _not_ the only box I can crash 2.4.21. I have another
> > > hardware(also SMP) based not on Serverworks but on VIA chipset and with no
> > > 64 bit pci and it crashes with 2.4.21 around every 10 - 20 days. It
> > > definitely does not with 2.4.19. 
> > 
> > Do you have any traces of the other box crash? 
> 
> Not at hand, but can prepare for the next crash during the weekend.
> 
> > > The only requirement for my usual test-box is a working tg3 driver for the
> > > GBit ethernet link.
> > 
> > > Ah yes, and from the long series of tests I can tell that the box won't
> > > crash with UP kernel. I can re-check that with rc1 if this is useful.
> > 
> > Okey. Thats useful information. How hard would it be for you to try ext3 
> > as the filesystem (as Andrew suggested) ? 
> 
> Well, if that provides further info I will do. I will try to achieve over the
> weekend, I need some spare volumes for conversion (by copy) :-)

That will provide further information yes. We can then know if the problem 
is reiserfs specific or not, which is VERY useful.

Again, thanks for your efforts helping us track down the problem.


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: 2.4.22-pre lockups (now decoded oops for pre10)
  2003-08-08 15:05   ` Stephan von Krawczynski
  2003-08-08 15:33     ` Marcelo Tosatti
@ 2003-08-10 14:23     ` Keith Owens
  1 sibling, 0 replies; 56+ messages in thread
From: Keith Owens @ 2003-08-10 14:23 UTC (permalink / raw)
  To: Stephan von Krawczynski; +Cc: linux-kernel

On Fri, 8 Aug 2003 17:05:36 +0200, 
Stephan von Krawczynski <skraw@ithnet.com> wrote:
>Well, if that provides further info I will do. I will try to achieve over the
>weekend, I need some spare volumes for conversion (by copy) :-)

FWIW, there are kdb patches for 2.4.22-pre98 onwards.  They also fit
2.4.22-rc1.

ftp://oss.sgi.com/projects/kdb/download/v4.3/kdb-v4.3-2.4.22-pre8-common-8.bz2
ftp://oss.sgi.com/projects/kdb/download/v4.3/kdb-v4.3-2.4.22-pre8-i386-5.bz2


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: 2.4.22-pre lockups (now decoded oops for pre10)
  2003-08-08 15:33     ` Marcelo Tosatti
@ 2003-08-10 21:35       ` Stephan von Krawczynski
  2003-08-10 23:23         ` Neil Brown
  2003-08-13 10:55       ` Stephan von Krawczynski
  1 sibling, 1 reply; 56+ messages in thread
From: Stephan von Krawczynski @ 2003-08-10 21:35 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: akpm, andrea, alan, linux-kernel

On Fri, 8 Aug 2003 12:33:28 -0300 (BRT)
Marcelo Tosatti <marcelo@conectiva.com.br> wrote:

> > > > Ah yes, and from the long series of tests I can tell that the box won't
> > > > crash with UP kernel. I can re-check that with rc1 if this is useful.
> > > 
> > > Okey. Thats useful information.

During this weekend I did several tests around SMP and UP, and I can definitely
confirm the box does not crash under rc2-UP kernel, but collapses within hours
under rc2-SMP.

> > > How hard would it be for you to try ext3 
> > > as the filesystem (as Andrew suggested) ? 

I spent half the weekend to turn the setup from reiserfs over to ext3
completely preserving the data. The box runs now with rc2-SMP-ext3 (no reiserfs
present any longer). I will send notice if/when it crashes.

>From looking at the tests so far I would say the setup is remarkably slower in
terms of writing to ext3 via nfs and sync option set. I think especially the
"sync" is very visible - unlike reiserfs.

Regards,
Stephan

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: 2.4.22-pre lockups (now decoded oops for pre10)
  2003-08-10 21:35       ` Stephan von Krawczynski
@ 2003-08-10 23:23         ` Neil Brown
  2003-08-11  9:33           ` Stephan von Krawczynski
  0 siblings, 1 reply; 56+ messages in thread
From: Neil Brown @ 2003-08-10 23:23 UTC (permalink / raw)
  To: Stephan von Krawczynski; +Cc: Marcelo Tosatti, akpm, andrea, alan, linux-kernel

On Sunday August 10, skraw@ithnet.com wrote:
> 
> From looking at the tests so far I would say the setup is remarkably slower in
> terms of writing to ext3 via nfs and sync option set. I think especially the
> "sync" is very visible - unlike reiserfs.

  data=journal
makes nfsd go noticable faster over ext3.  Having an external journal
is even better.

NeilBrown

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: 2.4.22-pre lockups (now decoded oops for pre10)
  2003-08-10 23:23         ` Neil Brown
@ 2003-08-11  9:33           ` Stephan von Krawczynski
  2003-08-18 20:43             ` Mike Fedyk
  0 siblings, 1 reply; 56+ messages in thread
From: Stephan von Krawczynski @ 2003-08-11  9:33 UTC (permalink / raw)
  To: Neil Brown; +Cc: marcelo, akpm, andrea, alan, linux-kernel

On Mon, 11 Aug 2003 09:23:20 +1000
Neil Brown <neilb@cse.unsw.edu.au> wrote:

> On Sunday August 10, skraw@ithnet.com wrote:
> > 
> > From looking at the tests so far I would say the setup is remarkably slower
> > in terms of writing to ext3 via nfs and sync option set. I think especially
> > the"sync" is very visible - unlike reiserfs.
> 
>   data=journal
> makes nfsd go noticable faster over ext3.  Having an external journal
> is even better.

Uh, forgive my ignorance. "journal" means metadata+data journaling. If I have
large data movement, how can that be even faster? Ok, I see the facts around
sync'ing the fs. But anyway the data size written should be nearly doubled
compared to data=ordered. Reiserfs journaling has to be real incredible in
comparison to ext3(ordered). I have the impression that large files are hit
most.


Regards,
Stephan

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: 2.4.22-pre lockups (now decoded oops for pre10)
  2003-08-08 15:33     ` Marcelo Tosatti
  2003-08-10 21:35       ` Stephan von Krawczynski
@ 2003-08-13 10:55       ` Stephan von Krawczynski
  2003-08-13 14:53         ` Marcelo Tosatti
  1 sibling, 1 reply; 56+ messages in thread
From: Stephan von Krawczynski @ 2003-08-13 10:55 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: akpm, andrea, alan, linux-kernel

On Fri, 8 Aug 2003 12:33:28 -0300 (BRT)
Marcelo Tosatti <marcelo@conectiva.com.br> wrote:

> That will provide further information yes. We can then know if the problem 
> is reiserfs specific or not, which is VERY useful.
> 
> Again, thanks for your efforts helping us track down the problem.

Status update:

uptime:
 12:45pm  up 2 days 19:39,  18 users,  load average: 2.02, 2.05, 2.06

Running SMP. So far no crash happened under ext3. 
Still I see the tar-verification errors. None on the first day, 2 on the second
and 2 today so far.
I see a growing possibility that the formerly crashes are directly linked to a
reiserfs problem, maybe broken SMP-locking.
If it survives until sunday I will revert all ext3 back to reiserfs to be sure
it still crashes, then ideas for patches will be welcome :-)

Up to sunday I can try to look deeper into the verification troubles. To be
honest I already doubt today that I will see a crash with ext3 until sunday...

Regards,
Stephan


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: 2.4.22-pre lockups (now decoded oops for pre10)
  2003-08-13 10:55       ` Stephan von Krawczynski
@ 2003-08-13 14:53         ` Marcelo Tosatti
  2003-08-13 14:59           ` Oleg Drokin
  2003-08-13 15:21           ` Jim Gifford
  0 siblings, 2 replies; 56+ messages in thread
From: Marcelo Tosatti @ 2003-08-13 14:53 UTC (permalink / raw)
  To: Stephan von Krawczynski; +Cc: akpm, andrea, alan, linux-kernel, mason, green



On Wed, 13 Aug 2003, Stephan von Krawczynski wrote:

> On Fri, 8 Aug 2003 12:33:28 -0300 (BRT)
> Marcelo Tosatti <marcelo@conectiva.com.br> wrote:
> 
> > That will provide further information yes. We can then know if the problem 
> > is reiserfs specific or not, which is VERY useful.
> > 
> > Again, thanks for your efforts helping us track down the problem.
> 
> Status update:
> 
> uptime:
>  12:45pm  up 2 days 19:39,  18 users,  load average: 2.02, 2.05, 2.06
> 
> Running SMP. So far no crash happened under ext3. 
> Still I see the tar-verification errors. None on the first day, 2 on the second
> and 2 today so far.
> I see a growing possibility that the formerly crashes are directly linked to a
> reiserfs problem, maybe broken SMP-locking.
> If it survives until sunday I will revert all ext3 back to reiserfs to be sure
> it still crashes, then ideas for patches will be welcome :-)

Great you tracked it down. Your previous traces almost always involved
reiserfs calls, which is another indicator that reiserfs is probably the
problem here.

Chris, Oleg, it might be nice if you guys could look at previous oops
reports by Stephan. 


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: 2.4.22-pre lockups (now decoded oops for pre10)
  2003-08-13 14:53         ` Marcelo Tosatti
@ 2003-08-13 14:59           ` Oleg Drokin
  2003-08-13 15:12             ` Stephan von Krawczynski
  2003-08-13 15:21           ` Jim Gifford
  1 sibling, 1 reply; 56+ messages in thread
From: Oleg Drokin @ 2003-08-13 14:59 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Stephan von Krawczynski, akpm, andrea, alan, linux-kernel, mason

Hello!

On Wed, Aug 13, 2003 at 11:53:09AM -0300, Marcelo Tosatti wrote:

> > Running SMP. So far no crash happened under ext3. 
> > Still I see the tar-verification errors. None on the first day, 2 on the second

But tar verification errors are still bad, right?

> > it still crashes, then ideas for patches will be welcome :-)
> Great you tracked it down. Your previous traces almost always involved
> reiserfs calls, which is another indicator that reiserfs is probably the
> problem here.

reiserfs is just probably a bit more sensitive to memory corruptions.

> Chris, Oleg, it might be nice if you guys could look at previous oops
> reports by Stephan. 

All of them looked like memory corruptions of unknown reason to me.

Bye,
    Oleg

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: 2.4.22-pre lockups (now decoded oops for pre10)
  2003-08-13 14:59           ` Oleg Drokin
@ 2003-08-13 15:12             ` Stephan von Krawczynski
  2003-08-13 15:30               ` Oleg Drokin
  0 siblings, 1 reply; 56+ messages in thread
From: Stephan von Krawczynski @ 2003-08-13 15:12 UTC (permalink / raw)
  To: Oleg Drokin; +Cc: marcelo, akpm, andrea, alan, linux-kernel, mason

On Wed, 13 Aug 2003 18:59:40 +0400
Oleg Drokin <green@namesys.com> wrote:

> Hello!
> 
> On Wed, Aug 13, 2003 at 11:53:09AM -0300, Marcelo Tosatti wrote:
> 
> > > Running SMP. So far no crash happened under ext3. 
> > > Still I see the tar-verification errors. None on the first day, 2 on the
> > > second
> 
> But tar verification errors are still bad, right?

Sure. Maybe both topics are unrelated. I can't tell.

> > > it still crashes, then ideas for patches will be welcome :-)
> > Great you tracked it down. Your previous traces almost always involved
> > reiserfs calls, which is another indicator that reiserfs is probably the
> > problem here.
> 
> reiserfs is just probably a bit more sensitive to memory corruptions.
> 
> > Chris, Oleg, it might be nice if you guys could look at previous oops
> > reports by Stephan. 
> 
> All of them looked like memory corruptions of unknown reason to me.

Well, that's exactly the reason why I am awaiting some more days of
up-and-running ext3. After how many days will you be convinced that a random
memory corruption should have hit the ext3 system that bad, that it should have
crashed?
I can add another week if you want me to, just tell me. The only thing I don't
want is that any doubts are left after testing ...
Still, current 2 days uptime is early stage, so let's give it some more time.

Regards,
Stephan

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: 2.4.22-pre lockups (now decoded oops for pre10)
  2003-08-13 14:53         ` Marcelo Tosatti
  2003-08-13 14:59           ` Oleg Drokin
@ 2003-08-13 15:21           ` Jim Gifford
  2003-08-13 17:08             ` Marcelo Tosatti
  1 sibling, 1 reply; 56+ messages in thread
From: Jim Gifford @ 2003-08-13 15:21 UTC (permalink / raw)
  To: Marcelo Tosatti, Stephan von Krawczynski
  Cc: akpm, andrea, alan, linux-kernel, mason, green


----- Original Message ----- 
From: "Marcelo Tosatti" <marcelo@conectiva.com.br>
To: "Stephan von Krawczynski" <skraw@ithnet.com>
Cc: <akpm@osdl.org>; <andrea@suse.de>; <alan@lxorguk.ukuu.org.uk>;
<linux-kernel@vger.kernel.org>; <mason@suse.com>; <green@namesys.com>
Sent: Wednesday, August 13, 2003 7:53 AM
Subject: Re: 2.4.22-pre lockups (now decoded oops for pre10)


>
>
> On Wed, 13 Aug 2003, Stephan von Krawczynski wrote:
>
> > On Fri, 8 Aug 2003 12:33:28 -0300 (BRT)
> > Marcelo Tosatti <marcelo@conectiva.com.br> wrote:
> >
> > > That will provide further information yes. We can then know if the
problem
> > > is reiserfs specific or not, which is VERY useful.
> > >
> > > Again, thanks for your efforts helping us track down the problem.
> >
> > Status update:
> >
> > uptime:
> >  12:45pm  up 2 days 19:39,  18 users,  load average: 2.02, 2.05, 2.06
> >
> > Running SMP. So far no crash happened under ext3.
> > Still I see the tar-verification errors. None on the first day, 2 on the
second
> > and 2 today so far.
> > I see a growing possibility that the formerly crashes are directly
linked to a
> > reiserfs problem, maybe broken SMP-locking.
> > If it survives until sunday I will revert all ext3 back to reiserfs to
be sure
> > it still crashes, then ideas for patches will be welcome :-)
>
> Great you tracked it down. Your previous traces almost always involved
> reiserfs calls, which is another indicator that reiserfs is probably the
> problem here.
>
> Chris, Oleg, it might be nice if you guys could look at previous oops
> reports by Stephan.
>
Marcelo,
    Could this be related to the issues I was having. Since rc1 I have not
had any problems, and I have all the iptables stuff running again. My
machine is smp and is using ext3.



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: 2.4.22-pre lockups (now decoded oops for pre10)
  2003-08-13 15:12             ` Stephan von Krawczynski
@ 2003-08-13 15:30               ` Oleg Drokin
  2003-08-13 16:04                 ` Stephan von Krawczynski
  0 siblings, 1 reply; 56+ messages in thread
From: Oleg Drokin @ 2003-08-13 15:30 UTC (permalink / raw)
  To: Stephan von Krawczynski; +Cc: marcelo, akpm, andrea, alan, linux-kernel, mason

Hello!

On Wed, Aug 13, 2003 at 05:12:24PM +0200, Stephan von Krawczynski wrote:

> Well, that's exactly the reason why I am awaiting some more days of
> up-and-running ext3. After how many days will you be convinced that a random
> memory corruption should have hit the ext3 system that bad, that it should have
> crashed?

Well, I'd prefer that you spend time to figure out at which exact
2.4.21-pre version the crashes in reiserfs started to appear. ;)

> I can add another week if you want me to, just tell me. The only thing I don't
> want is that any doubts are left after testing ...

It would be interesting to look at fsck results on the fs after some time of
testing.
Probably it would be easier for you to make it crash (if there are crash
possibility at all) if you enable JBD debugging.

Bye,
    Oleg

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: 2.4.22-pre lockups (now decoded oops for pre10)
  2003-08-13 15:30               ` Oleg Drokin
@ 2003-08-13 16:04                 ` Stephan von Krawczynski
  2003-08-13 16:34                   ` Oleg Drokin
  2003-08-18 15:06                   ` Andrea Arcangeli
  0 siblings, 2 replies; 56+ messages in thread
From: Stephan von Krawczynski @ 2003-08-13 16:04 UTC (permalink / raw)
  To: Oleg Drokin; +Cc: marcelo, akpm, andrea, alan, linux-kernel, mason

On Wed, 13 Aug 2003 19:30:09 +0400
Oleg Drokin <green@namesys.com> wrote:

> Hello!
> 
> On Wed, Aug 13, 2003 at 05:12:24PM +0200, Stephan von Krawczynski wrote:
> 
> > Well, that's exactly the reason why I am awaiting some more days of
> > up-and-running ext3. After how many days will you be convinced that a
> > random memory corruption should have hit the ext3 system that bad, that it
> > should have crashed?
> 
> Well, I'd prefer that you spend time to figure out at which exact
> 2.4.21-pre version the crashes in reiserfs started to appear. ;)

Well, Oleg, I'd love to, but there is an immanent problem with that. If
I check pre-X and it crashes, everything is fine, because I have a certain
result of the test. If it does not crash within 3 days, then I have a problem.
How long do I wait before stating the pre is good? It could take months to test
10 pre's ... That cannot be the way to find out what is going on. 
On the other hand: 
- no UP kernel ever crashed. So we can at least talk about an SMP-race.
- 2.4.20 does not crash
- 2.4.21 does crash
If we can add "ext3 does not crash" to the list, then I really hope we can use
some brain and give good selection of patches between 2.4.20 and 2.4.21 that
may cause the troubles.
How many suspects do we have? We can at least begin to create a list of things
that went in between .20 and .21, or not?
If possible I can then patch out all of them and retry. So there is much less
time spent for testing. 
I mean, have you looked at the length of this thread already?

> > I can add another week if you want me to, just tell me. The only thing I
> > don't want is that any doubts are left after testing ...
> 
> It would be interesting to look at fsck results on the fs after some time of
> testing.

You mean I should do an fsck on sunday?

> Probably it would be easier for you to make it crash (if there are crash
> possibility at all) if you enable JBD debugging.

I have never seen this in real life. Is it possible to turn this on when
handling >100 GB of data or will some debug output flood the box?

Regards,
Stephan

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: 2.4.22-pre lockups (now decoded oops for pre10)
  2003-08-13 16:04                 ` Stephan von Krawczynski
@ 2003-08-13 16:34                   ` Oleg Drokin
  2003-08-13 22:19                     ` Stephan von Krawczynski
  2003-08-18 15:06                   ` Andrea Arcangeli
  1 sibling, 1 reply; 56+ messages in thread
From: Oleg Drokin @ 2003-08-13 16:34 UTC (permalink / raw)
  To: Stephan von Krawczynski; +Cc: marcelo, akpm, andrea, alan, linux-kernel, mason

Hello!

On Wed, Aug 13, 2003 at 06:04:05PM +0200, Stephan von Krawczynski wrote:
> > > Well, that's exactly the reason why I am awaiting some more days of
> > > up-and-running ext3. After how many days will you be convinced that a
> > > random memory corruption should have hit the ext3 system that bad, that it
> > > should have crashed?
> > Well, I'd prefer that you spend time to figure out at which exact
> > 2.4.21-pre version the crashes in reiserfs started to appear. ;)
> Well, Oleg, I'd love to, but there is an immanent problem with that. If
> I check pre-X and it crashes, everything is fine, because I have a certain
> result of the test. If it does not crash within 3 days, then I have a problem.
> How long do I wait before stating the pre is good? It could take months to test

You seem to be getting corruptions in at least 2 days for now, though.
And reiserfs seems to trigger the problem even faster (and may be
even more faster if you enable CONFIG_REISERFS_CHECK).

> 10 pre's ... That cannot be the way to find out what is going on. 
> On the other hand: 
> - no UP kernel ever crashed. So we can at least talk about an SMP-race.

There is still huge field to look at.

> - 2.4.20 does not crash
> - 2.4.21 does crash

diff is 20M in size.

> If we can add "ext3 does not crash" to the list, then I really hope we can use
> some brain and give good selection of patches between 2.4.20 and 2.4.21 that
> may cause the troubles.

There were not much changes in reiserfs. All those patches can easily be
reverted just for verification purposes. Let me know when you are ready/want
to test this variant and I will send you a diff.

> How many suspects do we have? We can at least begin to create a list of things

Well, suspects are all used drivers, VM, filesystem itself, arch code.

> that went in between .20 and .21, or not?

Lots of changes, 2.4.20->2.4.21 was a long trip.

> If possible I can then patch out all of them and retry. So there is much less
> time spent for testing. 
> I mean, have you looked at the length of this thread already?

Yes, I did.
Now if only we can get someone to reproduce your problems...

> > > I can add another week if you want me to, just tell me. The only thing I
> > > don't want is that any doubts are left after testing ...
> > It would be interesting to look at fsck results on the fs after some time of
> > testing.
> You mean I should do an fsck on sunday?

Yes, whenever you decide you have waited long enough (provided that it won't
crash) and decide to stop testing, please run fsck on that testing fs.

> > Probably it would be easier for you to make it crash (if there are crash
> > possibility at all) if you enable JBD debugging.
> I have never seen this in real life. Is it possible to turn this on when
> handling >100 GB of data or will some debug output flood the box?

It only enables some more checks, not debug output.

Bye,
    Oleg

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: 2.4.22-pre lockups (now decoded oops for pre10)
  2003-08-13 15:21           ` Jim Gifford
@ 2003-08-13 17:08             ` Marcelo Tosatti
  0 siblings, 0 replies; 56+ messages in thread
From: Marcelo Tosatti @ 2003-08-13 17:08 UTC (permalink / raw)
  To: Jim Gifford; +Cc: Stephan von Krawczynski, linux-kernel



On Wed, 13 Aug 2003, Jim Gifford wrote:

> Marcelo,
>     Could this be related to the issues I was having. Since rc1 I have not
> had any problems, and I have all the iptables stuff running again. My
> machine is smp and is using ext3.

Jim,

Dont think so. Your problems were caused by additional netfilter patches 
or the dazuko module -- Stephan is not using any of those. 


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: 2.4.22-pre lockups (now decoded oops for pre10)
  2003-08-13 16:34                   ` Oleg Drokin
@ 2003-08-13 22:19                     ` Stephan von Krawczynski
  2003-08-14  8:45                       ` Oleg Drokin
  0 siblings, 1 reply; 56+ messages in thread
From: Stephan von Krawczynski @ 2003-08-13 22:19 UTC (permalink / raw)
  To: Oleg Drokin; +Cc: marcelo, akpm, andrea, alan, linux-kernel, mason

On Wed, 13 Aug 2003 20:34:52 +0400
Oleg Drokin <green@namesys.com> wrote:

> You seem to be getting corruptions in at least 2 days for now, though.
> And reiserfs seems to trigger the problem even faster (and may be
> even more faster if you enable CONFIG_REISERFS_CHECK).

well, I have an idea how to find out more about these verify problem. Basically
I would try to patch tar to ouput the differing areas to stdout in hexdump
format or the like. Only I need some time to make this work out. I hope to find
some pattern about this corruption. 

> > 10 pre's ... That cannot be the way to find out what is going on. 
> > On the other hand: 
> > - no UP kernel ever crashed. So we can at least talk about an SMP-race.
> 
> There is still huge field to look at.
> 
> > - 2.4.20 does not crash
> > - 2.4.21 does crash
> 
> diff is 20M in size.
> 
> > If we can add "ext3 does not crash" to the list, then I really hope we can
> > use some brain and give good selection of patches between 2.4.20 and 2.4.21
> > that may cause the troubles.
> 
> There were not much changes in reiserfs. All those patches can easily be
> reverted just for verification purposes. Let me know when you are ready/want
> to test this variant and I will send you a diff.

Hm, my primary belief is that something _around_ reiserfs has changed
semantics.

> > If possible I can then patch out all of them and retry. So there is much
> > less time spent for testing. 
> > I mean, have you looked at the length of this thread already?
> 
> Yes, I did.
> Now if only we can get someone to reproduce your problems...

Hm, I believe nobody in fact tried a setup like mine. As I have clear
indication that I can trigger it simply by using an SMP box, installing SuSE
8.2, compiling stock 2.4.22-rc2 kernel exporting some reiserfs to a nfs-client
of your choice and starting copying data with sizes around 100GB back and
forth.

> > > > I can add another week if you want me to, just tell me. The only thing
> > > > I don't want is that any doubts are left after testing ...
> > > It would be interesting to look at fsck results on the fs after some time
> > > of testing.
> > You mean I should do an fsck on sunday?
> 
> Yes, whenever you decide you have waited long enough (provided that it won't
> crash) and decide to stop testing, please run fsck on that testing fs.

Ok, will do that.

> 
> > > Probably it would be easier for you to make it crash (if there are crash
> > > possibility at all) if you enable JBD debugging.
> > I have never seen this in real life. Is it possible to turn this on when
> > handling >100 GB of data or will some debug output flood the box?
> 
> It only enables some more checks, not debug output.

Does this work for ext3, reiserfs or both?

Regards,
Stephan

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: 2.4.22-pre lockups (now decoded oops for pre10)
  2003-08-13 22:19                     ` Stephan von Krawczynski
@ 2003-08-14  8:45                       ` Oleg Drokin
  2003-08-14 17:26                         ` Marcelo Tosatti
  2003-08-15 10:13                         ` 2.4.22-pre lockups (now decoded oops for pre10) Stephan von Krawczynski
  0 siblings, 2 replies; 56+ messages in thread
From: Oleg Drokin @ 2003-08-14  8:45 UTC (permalink / raw)
  To: Stephan von Krawczynski; +Cc: marcelo, akpm, andrea, alan, linux-kernel, mason

Hello!

> > You seem to be getting corruptions in at least 2 days for now, though.
> > And reiserfs seems to trigger the problem even faster (and may be
> > even more faster if you enable CONFIG_REISERFS_CHECK).
> well, I have an idea how to find out more about these verify problem. Basically
> I would try to patch tar to ouput the differing areas to stdout in hexdump
> format or the like. Only I need some time to make this work out. I hope to find
> some pattern about this corruption. 

Yes, that would be interesting.

> > > If we can add "ext3 does not crash" to the list, then I really hope we can
> > > use some brain and give good selection of patches between 2.4.20 and 2.4.21
> > > that may cause the troubles.
> > There were not much changes in reiserfs. All those patches can easily be
> > reverted just for verification purposes. Let me know when you are ready/want
> > to test this variant and I will send you a diff.
> Hm, my primary belief is that something _around_ reiserfs has changed
> semantics.

Well. Might be, but this is unlikely. And I do not remember anything like that.
I will take a closer look, though.

> > > If possible I can then patch out all of them and retry. So there is much
> > > less time spent for testing. 
> > > I mean, have you looked at the length of this thread already?
> > Yes, I did.
> > Now if only we can get someone to reproduce your problems...
> Hm, I believe nobody in fact tried a setup like mine. As I have clear
> indication that I can trigger it simply by using an SMP box, installing SuSE
> 8.2, compiling stock 2.4.22-rc2 kernel exporting some reiserfs to a nfs-client
> of your choice and starting copying data with sizes around 100GB back and
> forth.

sounds like quite typical setup for some tasks (like clusters I guess).

> > > > Probably it would be easier for you to make it crash (if there are crash
> > > > possibility at all) if you enable JBD debugging.
> > > I have never seen this in real life. Is it possible to turn this on when
> > > handling >100 GB of data or will some debug output flood the box?
> > It only enables some more checks, not debug output.
> Does this work for ext3, reiserfs or both?

This works for ext3
For reiserfs we have similar compile time option that is called
CONFIG_REISERFS_CHECK 

Thank you for all the time and efforts you are putting into finding out
the cause.

Bye,
    Oleg

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: 2.4.22-pre lockups (now decoded oops for pre10)
  2003-08-14  8:45                       ` Oleg Drokin
@ 2003-08-14 17:26                         ` Marcelo Tosatti
  2003-08-14 17:42                           ` Stephan von Krawczynski
  2003-08-15 10:13                         ` 2.4.22-pre lockups (now decoded oops for pre10) Stephan von Krawczynski
  1 sibling, 1 reply; 56+ messages in thread
From: Marcelo Tosatti @ 2003-08-14 17:26 UTC (permalink / raw)
  To: Oleg Drokin
  Cc: Stephan von Krawczynski, akpm, andrea, alan, linux-kernel, mason



On Thu, 14 Aug 2003, Oleg Drokin wrote:

> Thank you for all the time and efforts you are putting into finding out
> the cause.

Stephan,

How are things going? Is the machine is still alive and well? 



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: 2.4.22-pre lockups (now decoded oops for pre10)
  2003-08-14 17:26                         ` Marcelo Tosatti
@ 2003-08-14 17:42                           ` Stephan von Krawczynski
  2003-08-15  2:08                             ` Chris Mason
  0 siblings, 1 reply; 56+ messages in thread
From: Stephan von Krawczynski @ 2003-08-14 17:42 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: green, akpm, andrea, alan, linux-kernel, mason

On Thu, 14 Aug 2003 14:26:33 -0300 (BRT)
Marcelo Tosatti <marcelo@conectiva.com.br> wrote:

> 
> 
> On Thu, 14 Aug 2003, Oleg Drokin wrote:
> 
> > Thank you for all the time and efforts you are putting into finding out
> > the cause.
> 
> Stephan,
> 
> How are things going? Is the machine is still alive and well? 

Hello Marcelo,

the system is up and running, currently:

  7:40pm  up 4 days  2:34,  21 users,  load average: 2.07, 2.10, 2.06

there is still the verification issue, today I added another 50 GB to the data
stream, and therefore got additional 3 verification  errors. But this seems to
have no influence on the stability. Box feels ok, reacts completely normal, no
strange output in any logs.

Regards,
Stephan

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: 2.4.22-pre lockups (now decoded oops for pre10)
  2003-08-14 17:42                           ` Stephan von Krawczynski
@ 2003-08-15  2:08                             ` Chris Mason
  2003-08-15  9:40                               ` Stephan von Krawczynski
  2003-08-15 10:28                               ` Stephan von Krawczynski
  0 siblings, 2 replies; 56+ messages in thread
From: Chris Mason @ 2003-08-15  2:08 UTC (permalink / raw)
  To: Stephan von Krawczynski
  Cc: Marcelo Tosatti, green, akpm, andrea, alan, linux-kernel

On Thu, 2003-08-14 at 13:42, Stephan von Krawczynski wrote:

> Hello Marcelo,
> 
> the system is up and running, currently:
> 
>   7:40pm  up 4 days  2:34,  21 users,  load average: 2.07, 2.10, 2.06
> 
> there is still the verification issue, today I added another 50 GB to the data
> stream, and therefore got additional 3 verification  errors. But this seems to
> have no influence on the stability. Box feels ok, reacts completely normal, no
> strange output in any logs.

Just to second Oleg's messages so far, the verification issues are still
serious, it could be the same kind of memory corruptions that could be
causing crashes on reiserfs, just in a different place.

We need to find out if a specific kernel release is causing these
corruptions.  There are lots of different ways to go about it, I would
suggest a combination of fsx (triggers IO and does verification) and
usemem (sucks down ram) from the ext3 cvs progs.

When you can reliably cause either fsx-linux errors or system hangs in a
short period of time, then we can try different prereleases to find the
offending code.

(download details here: http://www.zipworld.com.au/~akpm/linux/ext3/)

Run 4 or so fsx-linux programs (each to its own file) and use usemem to
put your box into swap.  That should hit it pretty quickly, and any
errors from fsx indicate problems.

-chris



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: 2.4.22-pre lockups (now decoded oops for pre10)
  2003-08-15  2:08                             ` Chris Mason
@ 2003-08-15  9:40                               ` Stephan von Krawczynski
  2003-08-15 10:28                               ` Stephan von Krawczynski
  1 sibling, 0 replies; 56+ messages in thread
From: Stephan von Krawczynski @ 2003-08-15  9:40 UTC (permalink / raw)
  To: Chris Mason; +Cc: marcelo, green, akpm, andrea, alan, linux-kernel

On 14 Aug 2003 22:08:58 -0400
Chris Mason <mason@suse.com> wrote:

> On Thu, 2003-08-14 at 13:42, Stephan von Krawczynski wrote:
> 
> > Hello Marcelo,
> > 
> > the system is up and running, currently:
> > 
> >   7:40pm  up 4 days  2:34,  21 users,  load average: 2.07, 2.10, 2.06
> > 
> > there is still the verification issue, today I added another 50 GB to the
> > data stream, and therefore got additional 3 verification  errors. But this
> > seems to have no influence on the stability. Box feels ok, reacts
> > completely normal, no strange output in any logs.
> 
> Just to second Oleg's messages so far, the verification issues are still
> serious, it could be the same kind of memory corruptions that could be
> causing crashes on reiserfs, just in a different place.

Well, as you expected I have the oops for you happened just this morning:

ksymoops 2.4.8 on i686 2.4.22-rc2.  Options used
     -V (default)
     -k /proc/ksyms (default)
     -l /proc/modules (default)
     -o /lib/modules/2.4.22-rc2/ (default)
     -m /boot/System.map-2.4.22-rc2 (default)

Warning: You did not tell me where to find symbol information.  I will
assume that the log matches the kernel and modules that are running
right now and I'll use the default options above for symbol resolution.
If the current kernel and/or modules do not match the log, you can get
more accurate output by telling me the kernel version and where to find
map, modules, ksyms etc.  ksymoops -h explains the options.

NMI Watchdog detected LOCKUP on CPU0, eip c01457c3, registers:
CPU:    0
EIP:    0010:[<c01457c3>]    Not tainted
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00000046
eax: 00000019   ebx: effc5c7c   ecx: 00000000   edx: effc6c7c
esi: 00000001   edi: 00000202   ebp: c13956c0   esp: f6ae1e8c
ds: 0018   es: 0018   ss: 0018
Process setiathome (pid: 2696, stackpage=f6ae1000)
Stack: f79ba218 effc5c7c f710eab8 00000008 c02165ea effc5c7c 00000001 ffffffff
       f79ba298 f79ba218 00000001 00000010 00000001 f710ea00 c0216a0f f710ea00
       00000001 00000000 00000001 00000001 ffffffff ffffffff 0000001c 00000000
Call Trace:    [<c02165ea>] [<c0216a0f>] [<c024a47a>] [<c020f6b8>] [<c020f568>]
  [<c01226da>] [<c0122563>] [<c01222d6>] [<c0109508>] [<c010c048>]
Code: 75 eb a8 01 0f 44 f1 8b 52 28 39 da 75 ea c6 05 64 5d 30 c0


>>EIP; c01457c3 <end_buffer_io_async+63/b0>   <=====

>>ebx; effc5c7c <_end+2fbfe61c/38462a00>
>>edx; effc6c7c <_end+2fbff61c/38462a00>
>>ebp; c13956c0 <_end+fce060/38462a00>
>>esp; f6ae1e8c <_end+3671a82c/38462a00>

Trace; c02165ea <__scsi_end_request+ba/250>
Trace; c0216a0f <scsi_io_completion+15f/430>
Trace; c024a47a <rw_intr+5a/200>
Trace; c020f6b8 <scsi_finish_command+98/d0>
Trace; c020f568 <scsi_bottom_half_handler+c8/f0>
Trace; c01226da <bh_action+6a/70>
Trace; c0122563 <tasklet_hi_action+53/a0>
Trace; c01222d6 <do_softirq+76/e0>
Trace; c0109508 <do_IRQ+d8/f0>
Trace; c010c048 <call_do_IRQ+5/d>

Code;  c01457c3 <end_buffer_io_async+63/b0>
00000000 <_EIP>:
Code;  c01457c3 <end_buffer_io_async+63/b0>   <=====
   0:   75 eb                     jne    ffffffed <_EIP+0xffffffed>   <=====
Code;  c01457c5 <end_buffer_io_async+65/b0>
   2:   a8 01                     test   $0x1,%al
Code;  c01457c7 <end_buffer_io_async+67/b0>
   4:   0f 44 f1                  cmove  %ecx,%esi
Code;  c01457ca <end_buffer_io_async+6a/b0>
   7:   8b 52 28                  mov    0x28(%edx),%edx
Code;  c01457cd <end_buffer_io_async+6d/b0>
   a:   39 da                     cmp    %ebx,%edx
Code;  c01457cf <end_buffer_io_async+6f/b0>
   c:   75 ea                     jne    fffffff8 <_EIP+0xfffffff8>
Code;  c01457d1 <end_buffer_io_async+71/b0>
   e:   c6 05 64 5d 30 c0 00      movb   $0x0,0xc0305d64


1 warning issued.  Results may not be reliable.


Obviously the problem seems a lot harder to trigger with ext3, but nevertheless
comes up (this time around 5 days). I will try Chris' suggestions  and see what
happens. I'll keep you informed.

Regards,
Stephan

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: 2.4.22-pre lockups (now decoded oops for pre10)
  2003-08-14  8:45                       ` Oleg Drokin
  2003-08-14 17:26                         ` Marcelo Tosatti
@ 2003-08-15 10:13                         ` Stephan von Krawczynski
  2003-08-15 10:31                           ` Oleg Drokin
  1 sibling, 1 reply; 56+ messages in thread
From: Stephan von Krawczynski @ 2003-08-15 10:13 UTC (permalink / raw)
  To: Oleg Drokin; +Cc: marcelo, akpm, andrea, alan, linux-kernel, mason

Hello Oleg,

there was a question about fsck'ing the ext3 filesystems. Since it crashed
today I did check them now and no errors or warnings showed up. Everything
seems clean. I don't exactly understand what that tells you. I guess you mean
the fs metadata may have been hit, too. Seems not.

Regards,
Stephan

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: 2.4.22-pre lockups (now decoded oops for pre10)
  2003-08-15  2:08                             ` Chris Mason
  2003-08-15  9:40                               ` Stephan von Krawczynski
@ 2003-08-15 10:28                               ` Stephan von Krawczynski
  2003-08-15 12:55                                 ` Chris Mason
  1 sibling, 1 reply; 56+ messages in thread
From: Stephan von Krawczynski @ 2003-08-15 10:28 UTC (permalink / raw)
  To: Chris Mason; +Cc: marcelo, green, akpm, andrea, alan, linux-kernel

On 14 Aug 2003 22:08:58 -0400
Chris Mason <mason@suse.com> wrote:

> Run 4 or so fsx-linux programs (each to its own file) and use usemem to
> put your box into swap.  That should hit it pretty quickly, and any
> errors from fsx indicate problems.

Question: how do I make fsx-linux use big filesizes (GB range) ?

Regards,
Stephan

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: 2.4.22-pre lockups (now decoded oops for pre10)
  2003-08-15 10:13                         ` 2.4.22-pre lockups (now decoded oops for pre10) Stephan von Krawczynski
@ 2003-08-15 10:31                           ` Oleg Drokin
  0 siblings, 0 replies; 56+ messages in thread
From: Oleg Drokin @ 2003-08-15 10:31 UTC (permalink / raw)
  To: Stephan von Krawczynski; +Cc: marcelo, akpm, andrea, alan, linux-kernel, mason

Hello!

On Fri, Aug 15, 2003 at 12:13:21PM +0200, Stephan von Krawczynski wrote:

> there was a question about fsck'ing the ext3 filesystems. Since it crashed
> today I did check them now and no errors or warnings showed up. Everything
> seems clean. I don't exactly understand what that tells you. I guess you mean
> the fs metadata may have been hit, too. Seems not.

Yes. And from what I remember, all the oopses on reiserfs were about some
lists corruptions and this sort of things, so not metadata, but kernel
data was damaged somehow.
And your last oops confirms that.
end_buffer_io_async have the loop running with irqs disabled.
And this loop in your case should only have one iteration (you run with 4k
blocksize, I presume) of gouig thorough one buffer attaching to a page.
Also at least one of the oopses you posted prior to that also had signs of
buffer list corruptions. (may be even two).
So it seems something changes buffer lists under out feet without doing
proper locking.
I am not sure how this relates to data corruption, though.
Ok, at least now there seems to be something definite to look for in changes.

Thank you.

Bye,
    Oleg

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: 2.4.22-pre lockups (now decoded oops for pre10)
  2003-08-15 10:28                               ` Stephan von Krawczynski
@ 2003-08-15 12:55                                 ` Chris Mason
  2003-08-20 14:21                                   ` 2.4.22-pre lockups (yet another oops for rc2) Stephan von Krawczynski
  2003-09-05  9:24                                   ` 2.4.22-pre lockups (case closed) Stephan von Krawczynski
  0 siblings, 2 replies; 56+ messages in thread
From: Chris Mason @ 2003-08-15 12:55 UTC (permalink / raw)
  To: Stephan von Krawczynski; +Cc: marcelo, green, akpm, andrea, alan, linux-kernel

On Fri, 2003-08-15 at 06:28, Stephan von Krawczynski wrote:
> On 14 Aug 2003 22:08:58 -0400
> Chris Mason <mason@suse.com> wrote:
> 
> > Run 4 or so fsx-linux programs (each to its own file) and use usemem to
> > put your box into swap.  That should hit it pretty quickly, and any
> > errors from fsx indicate problems.
> 
> Question: how do I make fsx-linux use big filesizes (GB range) ?

You don't really need to, fsx-linux is pretty good at triggering
problems with its default file size.  Usually you just need some other
load in place to chew up ram.

-chris



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: 2.4.22-pre lockups (now decoded oops for pre10)
  2003-08-13 16:04                 ` Stephan von Krawczynski
  2003-08-13 16:34                   ` Oleg Drokin
@ 2003-08-18 15:06                   ` Andrea Arcangeli
  2003-08-18 20:19                     ` Stephan von Krawczynski
  1 sibling, 1 reply; 56+ messages in thread
From: Andrea Arcangeli @ 2003-08-18 15:06 UTC (permalink / raw)
  To: Stephan von Krawczynski
  Cc: Oleg Drokin, marcelo, akpm, alan, linux-kernel, mason

On Wed, Aug 13, 2003 at 06:04:05PM +0200, Stephan von Krawczynski wrote:
> On Wed, 13 Aug 2003 19:30:09 +0400
> Oleg Drokin <green@namesys.com> wrote:
> 
> > Hello!
> > 
> > On Wed, Aug 13, 2003 at 05:12:24PM +0200, Stephan von Krawczynski wrote:
> > 
> > > Well, that's exactly the reason why I am awaiting some more days of
> > > up-and-running ext3. After how many days will you be convinced that a
> > > random memory corruption should have hit the ext3 system that bad, that it
> > > should have crashed?
> > 
> > Well, I'd prefer that you spend time to figure out at which exact
> > 2.4.21-pre version the crashes in reiserfs started to appear. ;)
> 
> Well, Oleg, I'd love to, but there is an immanent problem with that. If
> I check pre-X and it crashes, everything is fine, because I have a certain
> result of the test. If it does not crash within 3 days, then I have a problem.
> How long do I wait before stating the pre is good? It could take months to test
> 10 pre's ... That cannot be the way to find out what is going on. 
> On the other hand: 
> - no UP kernel ever crashed. So we can at least talk about an SMP-race.
> - 2.4.20 does not crash
> - 2.4.21 does crash

an SMP kernel puts the double of the stress on the mem bus, so it might
still be ram that went bad around the time you upgraded from 2.4.19. Or
it maybe simply a buggy smp motherboard, or whatever.

Of course I can't be sure but we can't exclude it.

Andrea

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: 2.4.22-pre lockups (now decoded oops for pre10)
  2003-08-18 15:06                   ` Andrea Arcangeli
@ 2003-08-18 20:19                     ` Stephan von Krawczynski
  2003-08-18 20:58                       ` Mike Fedyk
  2003-08-18 22:31                       ` Andrea Arcangeli
  0 siblings, 2 replies; 56+ messages in thread
From: Stephan von Krawczynski @ 2003-08-18 20:19 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: green, marcelo, akpm, alan, linux-kernel, mason

On Mon, 18 Aug 2003 17:06:25 +0200
Andrea Arcangeli <andrea@suse.de> wrote:

> an SMP kernel puts the double of the stress on the mem bus, so it might
> still be ram that went bad around the time you upgraded from 2.4.19. Or
> it maybe simply a buggy smp motherboard, or whatever.
> 
> Of course I can't be sure but we can't exclude it.

It is unlikely for bad ram to survive memtest for several hours.

Regards,
Stephan

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: 2.4.22-pre lockups (now decoded oops for pre10)
  2003-08-11  9:33           ` Stephan von Krawczynski
@ 2003-08-18 20:43             ` Mike Fedyk
  0 siblings, 0 replies; 56+ messages in thread
From: Mike Fedyk @ 2003-08-18 20:43 UTC (permalink / raw)
  To: Stephan von Krawczynski
  Cc: Neil Brown, marcelo, akpm, andrea, alan, linux-kernel

On Mon, Aug 11, 2003 at 11:33:02AM +0200, Stephan von Krawczynski wrote:
> On Mon, 11 Aug 2003 09:23:20 +1000
> Neil Brown <neilb@cse.unsw.edu.au> wrote:
> 
> > On Sunday August 10, skraw@ithnet.com wrote:
> > > 
> > > From looking at the tests so far I would say the setup is remarkably slower
> > > in terms of writing to ext3 via nfs and sync option set. I think especially
> > > the"sync" is very visible - unlike reiserfs.
> > 
> >   data=journal
> > makes nfsd go noticable faster over ext3.  Having an external journal
> > is even better.
> 
> Uh, forgive my ignorance. "journal" means metadata+data journaling. If I have
> large data movement, how can that be even faster? Ok, I see the facts around
> sync'ing the fs. But anyway the data size written should be nearly doubled
> compared to data=ordered. Reiserfs journaling has to be real incredible in
> comparison to ext3(ordered). I have the impression that large files are hit
> most.

You enlarge your journal (larger for more activity).

The idea is that the sync puts all data+meta-data into the journal, and once
that's complete the sync returns.

After the sync returns, the data is written from the journal asyncrounously
in the background while you're not waiting.

If your system is stressed to its limit, this won't work, but in the common
case, it will speed up your nfs server.

Though, after using reiserfs for a while, writeout is smoother.  There
aren't the spikes like with ext3 (even in ordered mode), but that could be
due to the 30 second timeout on reiserfs compared to 5 second for ext3
before writes are committed to disk (without memory pressure).

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: 2.4.22-pre lockups (now decoded oops for pre10)
  2003-08-18 20:19                     ` Stephan von Krawczynski
@ 2003-08-18 20:58                       ` Mike Fedyk
  2003-08-18 22:31                       ` Andrea Arcangeli
  1 sibling, 0 replies; 56+ messages in thread
From: Mike Fedyk @ 2003-08-18 20:58 UTC (permalink / raw)
  To: Stephan von Krawczynski
  Cc: Andrea Arcangeli, green, marcelo, akpm, alan, linux-kernel, mason

On Mon, Aug 18, 2003 at 10:19:21PM +0200, Stephan von Krawczynski wrote:
> On Mon, 18 Aug 2003 17:06:25 +0200
> Andrea Arcangeli <andrea@suse.de> wrote:
> 
> > an SMP kernel puts the double of the stress on the mem bus, so it might
> > still be ram that went bad around the time you upgraded from 2.4.19. Or
> > it maybe simply a buggy smp motherboard, or whatever.
> > 
> > Of course I can't be sure but we can't exclude it.
> 
> It is unlikely for bad ram to survive memtest for several hours.

How many hours?

Are you using memtest 3.0 that supports larger ammounts of memory, and has
specific tests for ECC (ie disabling it)?

Are you doing a full run with all tests, and not just the standard tests?
(you should let it complete one, or preferably two or three in this mode)

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: 2.4.22-pre lockups (now decoded oops for pre10)
  2003-08-18 20:19                     ` Stephan von Krawczynski
  2003-08-18 20:58                       ` Mike Fedyk
@ 2003-08-18 22:31                       ` Andrea Arcangeli
  2003-08-19  1:12                         ` Mike Fedyk
  1 sibling, 1 reply; 56+ messages in thread
From: Andrea Arcangeli @ 2003-08-18 22:31 UTC (permalink / raw)
  To: Stephan von Krawczynski; +Cc: green, marcelo, akpm, alan, linux-kernel, mason

On Mon, Aug 18, 2003 at 10:19:21PM +0200, Stephan von Krawczynski wrote:
> On Mon, 18 Aug 2003 17:06:25 +0200
> Andrea Arcangeli <andrea@suse.de> wrote:
> 
> > an SMP kernel puts the double of the stress on the mem bus, so it might
> > still be ram that went bad around the time you upgraded from 2.4.19. Or
> > it maybe simply a buggy smp motherboard, or whatever.
> > 
> > Of course I can't be sure but we can't exclude it.
> 
> It is unlikely for bad ram to survive memtest for several hours.

memtest is single threaded, UP kernel works fine too.

Andrea

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: 2.4.22-pre lockups (now decoded oops for pre10)
  2003-08-18 22:31                       ` Andrea Arcangeli
@ 2003-08-19  1:12                         ` Mike Fedyk
  2003-08-19  7:12                           ` Stephan von Krawczynski
  0 siblings, 1 reply; 56+ messages in thread
From: Mike Fedyk @ 2003-08-19  1:12 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Stephan von Krawczynski, green, marcelo, akpm, alan, linux-kernel, mason

On Tue, Aug 19, 2003 at 12:31:27AM +0200, Andrea Arcangeli wrote:
> On Mon, Aug 18, 2003 at 10:19:21PM +0200, Stephan von Krawczynski wrote:
> > On Mon, 18 Aug 2003 17:06:25 +0200
> > Andrea Arcangeli <andrea@suse.de> wrote:
> > 
> > > an SMP kernel puts the double of the stress on the mem bus, so it might
> > > still be ram that went bad around the time you upgraded from 2.4.19. Or
> > > it maybe simply a buggy smp motherboard, or whatever.
> > > 
> > > Of course I can't be sure but we can't exclude it.
> > 
> > It is unlikely for bad ram to survive memtest for several hours.
> 
> memtest is single threaded, UP kernel works fine too.

Are you saying that one CPU can't saturate the memory bus?  Or maybe we're
hitting something on the CPU bus, or just that SMP will change the timings
and stress things differently?  Or that if memtest doesn't test from the
second CPU then it could be a faulty cpu/L2?

Grr, has anything been done to verify the hardware is running withing specs
and isn't too hot?

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: 2.4.22-pre lockups (now decoded oops for pre10)
  2003-08-19  1:12                         ` Mike Fedyk
@ 2003-08-19  7:12                           ` Stephan von Krawczynski
  2003-08-19 13:10                             ` Alan Cox
  2003-08-19 13:27                             ` Andrea Arcangeli
  0 siblings, 2 replies; 56+ messages in thread
From: Stephan von Krawczynski @ 2003-08-19  7:12 UTC (permalink / raw)
  To: Mike Fedyk; +Cc: andrea, green, marcelo, akpm, alan, linux-kernel, mason

On Mon, 18 Aug 2003 18:12:08 -0700
Mike Fedyk <mfedyk@matchmail.com> wrote:

> > > It is unlikely for bad ram to survive memtest for several hours.
> > 
> > memtest is single threaded, UP kernel works fine too.
> 
> Are you saying that one CPU can't saturate the memory bus?  Or maybe we're
> hitting something on the CPU bus, or just that SMP will change the timings
> and stress things differently?  Or that if memtest doesn't test from the
> second CPU then it could be a faulty cpu/L2?

Well, if memtest does not use a second available CPU then probably we should
ask the author about this...
 
> Grr, has anything been done to verify the hardware is running withing specs
> and isn't too hot?

In fact we are talking about datacenter environment with air conditioning and
the like.
Besides the favourite test box I have others (already mentioned in this thread)
- SMP with completely different hw - where I can make 2.4.21 and above crash,
too.

Regards,
Stephan

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: 2.4.22-pre lockups (now decoded oops for pre10)
  2003-08-19  7:12                           ` Stephan von Krawczynski
@ 2003-08-19 13:10                             ` Alan Cox
  2003-08-19 14:18                               ` Stephan von Krawczynski
  2003-08-19 13:27                             ` Andrea Arcangeli
  1 sibling, 1 reply; 56+ messages in thread
From: Alan Cox @ 2003-08-19 13:10 UTC (permalink / raw)
  To: Stephan von Krawczynski
  Cc: Mike Fedyk, andrea, green, Marcelo Tosatti, akpm,
	Linux Kernel Mailing List, mason

On Maw, 2003-08-19 at 08:12, Stephan von Krawczynski wrote:
> > Are you saying that one CPU can't saturate the memory bus?  Or maybe we're
> > hitting something on the CPU bus, or just that SMP will change the timings
> > and stress things differently?  Or that if memtest doesn't test from the
> > second CPU then it could be a faulty cpu/L2?
> 
> Well, if memtest does not use a second available CPU then probably we should
> ask the author about this...

I'm sure he'd give you a quote for adding SMP support if you asked.


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: 2.4.22-pre lockups (now decoded oops for pre10)
  2003-08-19  7:12                           ` Stephan von Krawczynski
  2003-08-19 13:10                             ` Alan Cox
@ 2003-08-19 13:27                             ` Andrea Arcangeli
  1 sibling, 0 replies; 56+ messages in thread
From: Andrea Arcangeli @ 2003-08-19 13:27 UTC (permalink / raw)
  To: Stephan von Krawczynski
  Cc: Mike Fedyk, green, marcelo, akpm, alan, linux-kernel, mason

On Tue, Aug 19, 2003 at 09:12:43AM +0200, Stephan von Krawczynski wrote:
> Besides the favourite test box I have others (already mentioned in this thread)
> - SMP with completely different hw - where I can make 2.4.21 and above crash,
> too.

Did you post any backtrace for those other boxes yet? It would be
especially useful if you could demonstrate the same random mm corruption
with different ram/motherboard/cpus (I mean all of them different), if
the devices are the same that's ok (since it could be a software bug in
a driver).

At the moment I doubt a bug in the common code since AFIK you are the
only one running into this sort of corruption and at the very least I
can't trigger it here (OTOH maybe it triggers with only one certain
application).

(just for clarity: with my previous posts I didn't mean it's not a
software bug, I only wanted to point out that with the current info we
cannot exclude completely an hardware issue yet)

Andrea

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: 2.4.22-pre lockups (now decoded oops for pre10)
  2003-08-19 13:10                             ` Alan Cox
@ 2003-08-19 14:18                               ` Stephan von Krawczynski
  2003-08-19 18:00                                 ` Mike Fedyk
  0 siblings, 1 reply; 56+ messages in thread
From: Stephan von Krawczynski @ 2003-08-19 14:18 UTC (permalink / raw)
  To: Alan Cox; +Cc: mfedyk, andrea, green, marcelo, akpm, linux-kernel, mason

On 19 Aug 2003 14:10:22 +0100
Alan Cox <alan@lxorguk.ukuu.org.uk> wrote:

> On Maw, 2003-08-19 at 08:12, Stephan von Krawczynski wrote:
> > > Are you saying that one CPU can't saturate the memory bus?  Or maybe
> > > we're hitting something on the CPU bus, or just that SMP will change the
> > > timings and stress things differently?  Or that if memtest doesn't test
> > > from the second CPU then it could be a faulty cpu/L2?
> > 
> > Well, if memtest does not use a second available CPU then probably we
> > should ask the author about this...
> 
> I'm sure he'd give you a quote for adding SMP support if you asked.

Well, actually I don't want to burn down his time as long as I don't see a need
for it. Since I am pretty confident to make the box work in SMP under 2.4.20 a
memtest will most certainly not give any additional information, be it running
UP or SMP.
Instead I will invest another day and convert the whole system back to
reiserfs, because the ext3 fs cannot be used under 2.4.20 - I don't know why.
Additionally reiserfs is better for testing possible patches because it crashes
in much shorter time than ext3 setup.
2.4.20 setup gives me a simple testcase to prove people right or wrong that are
talking about a hardware issue.

Regards,
Stephan

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: 2.4.22-pre lockups (now decoded oops for pre10)
  2003-08-19 14:18                               ` Stephan von Krawczynski
@ 2003-08-19 18:00                                 ` Mike Fedyk
  2003-08-19 21:58                                   ` Stephan von Krawczynski
  0 siblings, 1 reply; 56+ messages in thread
From: Mike Fedyk @ 2003-08-19 18:00 UTC (permalink / raw)
  To: Stephan von Krawczynski
  Cc: Alan Cox, andrea, green, marcelo, akpm, linux-kernel, mason

On Tue, Aug 19, 2003 at 04:18:32PM +0200, Stephan von Krawczynski wrote:
> On 19 Aug 2003 14:10:22 +0100
> Alan Cox <alan@lxorguk.ukuu.org.uk> wrote:
> 
> > On Maw, 2003-08-19 at 08:12, Stephan von Krawczynski wrote:
> > > > Are you saying that one CPU can't saturate the memory bus?  Or maybe
> > > > we're hitting something on the CPU bus, or just that SMP will change the
> > > > timings and stress things differently?  Or that if memtest doesn't test
> > > > from the second CPU then it could be a faulty cpu/L2?
> > > 
> > > Well, if memtest does not use a second available CPU then probably we
> > > should ask the author about this...
> > 
> > I'm sure he'd give you a quote for adding SMP support if you asked.
> 
> Well, actually I don't want to burn down his time as long as I don't see a need
> for it. Since I am pretty confident to make the box work in SMP under 2.4.20 a
> memtest will most certainly not give any additional information, be it running
> UP or SMP.
> Instead I will invest another day and convert the whole system back to
> reiserfs, because the ext3 fs cannot be used under 2.4.20 - I don't know why.
> Additionally reiserfs is better for testing possible patches because it crashes
> in much shorter time than ext3 setup.
> 2.4.20 setup gives me a simple testcase to prove people right or wrong that are
> talking about a hardware issue.

Are you doing a lot of directory operations, or is it mostly just large
amounts of data transfering over NFS?

The reason why I ask, is that I know that at least JFS and possibly XFS uses
trees for their directory structures, and might show similar problems (with
its large use of trees), if you did a lot of directory operations on the
other filesystems.

Then maybe it could rule out reiserfs.  Though it still did show up on ext3...

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: 2.4.22-pre lockups (now decoded oops for pre10)
  2003-08-19 18:00                                 ` Mike Fedyk
@ 2003-08-19 21:58                                   ` Stephan von Krawczynski
  0 siblings, 0 replies; 56+ messages in thread
From: Stephan von Krawczynski @ 2003-08-19 21:58 UTC (permalink / raw)
  To: Mike Fedyk; +Cc: alan, andrea, green, marcelo, akpm, linux-kernel, mason

On Tue, 19 Aug 2003 11:00:28 -0700
Mike Fedyk <mfedyk@matchmail.com> wrote:

> Are you doing a lot of directory operations, or is it mostly just large
> amounts of data transfering over NFS?

In fact merely no directory operations take place. Large data movement is the
primary test.

Regards,
Stephan

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: 2.4.22-pre lockups (yet another oops for rc2)
  2003-08-15 12:55                                 ` Chris Mason
@ 2003-08-20 14:21                                   ` Stephan von Krawczynski
  2003-09-05  9:24                                   ` 2.4.22-pre lockups (case closed) Stephan von Krawczynski
  1 sibling, 0 replies; 56+ messages in thread
From: Stephan von Krawczynski @ 2003-08-20 14:21 UTC (permalink / raw)
  To: Chris Mason; +Cc: marcelo, green, akpm, andrea, alan, linux-kernel

Hello all,

todays' oops is:

ksymoops 2.4.8 on i686 2.4.22-rc2.  Options used
     -V (default)
     -k /proc/ksyms (default)
     -l /proc/modules (default)
     -o /lib/modules/2.4.22-rc2/ (default)
     -m /boot/System.map-2.4.22-rc2 (default)

Warning: You did not tell me where to find symbol information.  I will
assume that the log matches the kernel and modules that are running
right now and I'll use the default options above for symbol resolution.
If the current kernel and/or modules do not match the log, you can get
more accurate output by telling me the kernel version and where to find
map, modules, ksyms etc.  ksymoops -h explains the options.

kernel BUG at slab.c:1225!
invalid operand: 0000
CPU:    1
EIP:    0010:[<c0137ebd>]    Not tainted
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00010046
eax: 00000005   ebx: 00000005   ecx: 00000088   edx: 00000000
esi: f6df2000   edi: f6df20a0   ebp: f6df2348   esp: c345df04
ds: 0018   es: 0018   ss: 0018
Process kswapd (pid: 5, stackpage=c345d000)
Stack: f6df234c f6df2348 f6df23cc f6df2000 c0139107 c342b4d0 f6df2000 f6df2348
       c342b4d0 0000007d c346040c c3460400 c01384e2 c342b4d0 f6df234c 00000000 
       00000001 00000000 00000000 00000000 00000020 000001d0 00000020 00000006
Call Trace:    [<c0139107>] [<c01384e2>] [<c0139c78>] [<c0139d2e>] [<c0139e3c>]
  [<c0139ec8>] [<c0139ff8>] [<c0139f60>] [<c0105000>] [<c010592e>] [<c0139f60>]
Code: 0f 0b c9 04 44 92 2c c0 8b 44 86 18 83 f8 ff 75 eb 89 f6 8b


>>EIP; c0137ebd <kmem_extra_free_checks+6d/a0>   <=====

>>esi; f6df2000 <_end+36a2a9a0/38462a00>
>>edi; f6df20a0 <_end+36a2aa40/38462a00>
>>ebp; f6df2348 <_end+36a2ace8/38462a00>
>>esp; c345df04 <_end+30968a4/38462a00>

Trace; c0139107 <kmem_cache_free_one+f7/220>
Trace; c01384e2 <kmem_cache_reap+b2/290>
Trace; c0139c78 <shrink_caches+28/a0>
Trace; c0139d2e <try_to_free_pages_zone+3e/60>
Trace; c0139e3c <kswapd_balance_pgdat+4c/b0>
Trace; c0139ec8 <kswapd_balance+28/40>
Trace; c0139ff8 <kswapd+98/c0>
Trace; c0139f60 <kswapd+0/c0>
Trace; c0105000 <_stext+0/0>
Trace; c010592e <arch_kernel_thread+2e/40>
Trace; c0139f60 <kswapd+0/c0>

Code;  c0137ebd <kmem_extra_free_checks+6d/a0>
00000000 <_EIP>:
Code;  c0137ebd <kmem_extra_free_checks+6d/a0>   <=====
   0:   0f 0b                     ud2a      <=====
Code;  c0137ebf <kmem_extra_free_checks+6f/a0>
   2:   c9                        leave  
Code;  c0137ec0 <kmem_extra_free_checks+70/a0>
   3:   04 44                     add    $0x44,%al
Code;  c0137ec2 <kmem_extra_free_checks+72/a0>
   5:   92                        xchg   %eax,%edx
Code;  c0137ec3 <kmem_extra_free_checks+73/a0>
   6:   2c c0                     sub    $0xc0,%al
Code;  c0137ec5 <kmem_extra_free_checks+75/a0>
   8:   8b 44 86 18               mov    0x18(%esi,%eax,4),%eax
Code;  c0137ec9 <kmem_extra_free_checks+79/a0>
   c:   83 f8 ff                  cmp    $0xffffffff,%eax
Code;  c0137ecc <kmem_extra_free_checks+7c/a0>
   f:   75 eb                     jne    fffffffc <_EIP+0xfffffffc>
Code;  c0137ece <kmem_extra_free_checks+7e/a0>
  11:   89 f6                     mov    %esi,%esi
Code;  c0137ed0 <kmem_extra_free_checks+80/a0>
  13:   8b 00                     mov    (%eax),%eax


1 warning issued.  Results may not be reliable.


This is still with ext3 and about 24 hours uptime (rough guess).

Regards,
Stephan


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: 2.4.22-pre lockups (case closed)
  2003-08-15 12:55                                 ` Chris Mason
  2003-08-20 14:21                                   ` 2.4.22-pre lockups (yet another oops for rc2) Stephan von Krawczynski
@ 2003-09-05  9:24                                   ` Stephan von Krawczynski
  2003-09-05 13:37                                     ` Andrea Arcangeli
  1 sibling, 1 reply; 56+ messages in thread
From: Stephan von Krawczynski @ 2003-09-05  9:24 UTC (permalink / raw)
  To: linux-kernel; +Cc: marcelo, mason, green, akpm, andrea, alan, tejun, chris

Hello all,

I would like to give you the last update on the story:

short: hardware problem

long:
The box had two different types of RAM (both registered ECC) in it. Two were 1
GByte, four were 256 MByte to a total of 3 GByte. I had to find out that the
box runs flawlessly when using only the GByte modules _or_ only the 256 MByte
modules, but not the mix. All modules are from same vendor. The problem in
mixed setup does not show up in UP mode (memtest works!). It does not even show
up straight away, it takes days, but it is always there.
In fact - even though having sunk weeks of work - I am pretty happy that it
turned out not to be a kernel problem.
For the other setups that showed SMP-specific weirdness TeJun may have found
interesting explanations. I updated them all to 2.4.22 and have not seen any
problem yet.
For me it was really interesting to see that reiserfs setups obviously have a
completely different memory footprint than ext3, and altogether there seems a
remarkable difference between later kernels and former. The problem showed up
very seldom on 2.4.21 and below but within 2 days with 2.4.22.
Thanks to all who lend me their ears on the topic and sorry for wasting the
time.

Regards,
Stephan

PS: Obviously there are seldom cases where SMP support in memtest _could_ make
a difference ;-)


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: 2.4.22-pre lockups (case closed)
  2003-09-05  9:24                                   ` 2.4.22-pre lockups (case closed) Stephan von Krawczynski
@ 2003-09-05 13:37                                     ` Andrea Arcangeli
  0 siblings, 0 replies; 56+ messages in thread
From: Andrea Arcangeli @ 2003-09-05 13:37 UTC (permalink / raw)
  To: Stephan von Krawczynski
  Cc: linux-kernel, marcelo, mason, green, akpm, alan, tejun, chris

On Fri, Sep 05, 2003 at 11:24:00AM +0200, Stephan von Krawczynski wrote:
> Hello all,
> 
> I would like to give you the last update on the story:
> 
> short: hardware problem
> 
> long:
> The box had two different types of RAM (both registered ECC) in it. Two were 1
> GByte, four were 256 MByte to a total of 3 GByte. I had to find out that the
> box runs flawlessly when using only the GByte modules _or_ only the 256 MByte
> modules, but not the mix. All modules are from same vendor. The problem in
> mixed setup does not show up in UP mode (memtest works!). It does not even show
> up straight away, it takes days, but it is always there.
> In fact - even though having sunk weeks of work - I am pretty happy that it
> turned out not to be a kernel problem.

thanks for demonstrating this.

> For the other setups that showed SMP-specific weirdness TeJun may have found
> interesting explanations. I updated them all to 2.4.22 and have not seen any
> problem yet.
> For me it was really interesting to see that reiserfs setups obviously have a
> completely different memory footprint than ext3, and altogether there seems a
> remarkable difference between later kernels and former. The problem showed up
> very seldom on 2.4.21 and below but within 2 days with 2.4.22.

normally that indicates the kernel is somehow using the resources more
efficiently, it's usually a good sign from a kernel standpoint, I heard
of things like this happening also during major upgrades like from 2.2
to 2.4.

> Thanks to all who lend me their ears on the topic and sorry for wasting the
> time.

you're very welcome.

> PS: Obviously there are seldom cases where SMP support in memtest _could_ make
> a difference ;-)

;)

Andrea

/*
 * If you refuse to depend on closed software for a critical
 * part of your business, these links may be useful:
 *
 * rsync.kernel.org::pub/scm/linux/kernel/bkcvs/linux-2.5/
 * rsync.kernel.org::pub/scm/linux/kernel/bkcvs/linux-2.4/
 * http://www.cobite.com/cvsps/
 *
 * svn://svn.kernel.org/linux-2.6/trunk
 * svn://svn.kernel.org/linux-2.4/trunk
 */

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: 2.4.22-pre lockups (now decoded oops for pre10)
  2003-08-18 20:39                 ` Stephan von Krawczynski
@ 2003-08-18 21:09                   ` Mike Fedyk
  0 siblings, 0 replies; 56+ messages in thread
From: Mike Fedyk @ 2003-08-18 21:09 UTC (permalink / raw)
  To: Stephan von Krawczynski; +Cc: reiser, linux-kernel

On Mon, Aug 18, 2003 at 10:39:46PM +0200, Stephan von Krawczynski wrote:
> On Mon, 18 Aug 2003 13:29:49 -0700
> Mike Fedyk <mfedyk@matchmail.com> wrote:
> 
> > > I'd say "two things differ", without trailing "s". I am not even sure if
> > > "bitmaps" shouldn't be singular "bitmap" instead.
> > 
> > "bitmaps" with your changes would be correct.
> > 
> > Though, just turn "bitmaps" into "bitmap" and it should be fine.  I can't
> > really think of a phrase specific enough for the error message without
> > adding enough text to make it two lines, which wouldn't be good.
> > 
> > "Comparing bitmaps.. vpf-10640: The on-disk and the correct bitmap differs"
> 
> Hm, but:
> 
> "a and b differ"

1) "Comparing bitmaps.. vpf-10640: The on-disk and correct bitmap differ"

> "a differs from b"

2) "Comparing bitmaps.. vpf-10640: The on-disk differs from the correct bitmap"

> 
> or not?
> 
> Alternatives:
> 
> "a and b are different"

3) "Comparing bitmaps.. vpf-10640: The on-disk and correct are different"

> 
> But if you use "are" here, you cannot use "differs" above, right?
> 

Yes.

I kinda like (1), or the origional changed to "bitmap" instead of
"bitmaps".

Mike

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: 2.4.22-pre lockups (now decoded oops for pre10)
  2003-08-18 20:29               ` Mike Fedyk
@ 2003-08-18 20:39                 ` Stephan von Krawczynski
  2003-08-18 21:09                   ` Mike Fedyk
  0 siblings, 1 reply; 56+ messages in thread
From: Stephan von Krawczynski @ 2003-08-18 20:39 UTC (permalink / raw)
  To: Mike Fedyk; +Cc: reiser, linux-kernel

On Mon, 18 Aug 2003 13:29:49 -0700
Mike Fedyk <mfedyk@matchmail.com> wrote:

> > I'd say "two things differ", without trailing "s". I am not even sure if
> > "bitmaps" shouldn't be singular "bitmap" instead.
> 
> "bitmaps" with your changes would be correct.
> 
> Though, just turn "bitmaps" into "bitmap" and it should be fine.  I can't
> really think of a phrase specific enough for the error message without
> adding enough text to make it two lines, which wouldn't be good.
> 
> "Comparing bitmaps.. vpf-10640: The on-disk and the correct bitmap differs"

Hm, but:

"a and b differ"
"a differs from b"

or not?

Alternatives:

"a and b are different"

But if you use "are" here, you cannot use "differs" above, right?

Regards,
Stephan

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: 2.4.22-pre lockups (now decoded oops for pre10)
  2003-08-07 13:32             ` Stephan von Krawczynski
@ 2003-08-18 20:29               ` Mike Fedyk
  2003-08-18 20:39                 ` Stephan von Krawczynski
  0 siblings, 1 reply; 56+ messages in thread
From: Mike Fedyk @ 2003-08-18 20:29 UTC (permalink / raw)
  To: Stephan von Krawczynski; +Cc: Hans Reiser, linux-kernel

On Thu, Aug 07, 2003 at 03:32:57PM +0200, Stephan von Krawczynski wrote:
> On Thu, 07 Aug 2003 17:18:16 +0400
> Hans Reiser <reiser@namesys.com> wrote:
> 
> > >On Thu, 7 Aug 2003, Stephan von Krawczynski wrote:
> > >>for this one. Hint: spelling in reiserfsck should be checked ;-)
> > >
> > where?
> 
> Hello Hans,
> 
> I am no native english, but 
> "Comparing bitmaps.. vpf-10640: The on-disk and the correct bitmaps differs"
> feels uncomfortable in my ears ;-)
> I'd say "two things differ", without trailing "s". I am not even sure if
> "bitmaps" shouldn't be singular "bitmap" instead.

"bitmaps" with your changes would be correct.

Though, just turn "bitmaps" into "bitmap" and it should be fine.  I can't
really think of a phrase specific enough for the error message without
adding enough text to make it two lines, which wouldn't be good.

"Comparing bitmaps.. vpf-10640: The on-disk and the correct bitmap differs"

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: 2.4.22-pre lockups (now decoded oops for pre10)
  2003-08-06  9:09     ` Willy Tarreau
  2003-08-06  9:36       ` Stephan von Krawczynski
@ 2003-08-18 14:23       ` Andrea Arcangeli
  1 sibling, 0 replies; 56+ messages in thread
From: Andrea Arcangeli @ 2003-08-18 14:23 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Stephan von Krawczynski, Marcelo Tosatti, linux-kernel, green

On Wed, Aug 06, 2003 at 11:09:20AM +0200, Willy Tarreau wrote:
> On Wed, Aug 06, 2003 at 09:41:50AM +0200, Stephan von Krawczynski wrote:
>  
> > Code;  c0144b14 <__remove_from_queues+14/30>
> > 00000000 <_EIP>:
> > Code;  c0144b14 <__remove_from_queues+14/30>   <=====
> >    0:   89 02                     mov    %eax,(%edx)   <=====
> > Code;  c0144b16 <__remove_from_queues+16/30>
> >    2:   c7 41 30 00 00 00 00      movl   $0x0,0x30(%ecx)
> > Code;  c0144b1d <__remove_from_queues+1d/30>
> >    9:   89 4c 24 04               mov    %ecx,0x4(%esp,1)
> > Code;  c0144b21 <__remove_from_queues+21/30>
> >    d:   e9 7a ff ff ff            jmp    ffffff8c <_EIP+0xffffff8c>
> > Code;  c0144b26 <__remove_from_queues+26/30>
> >   12:   8d 76 00                  lea    0x0(%esi),%esi
> 
> once again, it's *pprev=next which is is causing trouble, with pprev=6 this
> time (fs/buffer.c:523). There really seems to be something playing badly with
> this...
> 
> I find amazing that such widely used portions of code only trigger panics on
> your system ! either it's a rare combinations of several components/drivers, or
> a strange hardware problem, although I can't imagine which (cpu? bus locking?).

normally it's bad ram (or anyways a problem with the memory) when bugs
triggers in that place reproducibly. the list walking trashes the l2 and
that put more stress on the ram. If it was random memory corruption
(software) it would more likely crash in different places (though it's
not guaranteed ;).

Andrea

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: 2.4.22-pre lockups (now decoded oops for pre10)
  2003-08-07 12:45         ` Marcelo Tosatti
       [not found]           ` <3F325198.2010301@namesys.com>
@ 2003-08-07 15:52           ` Stephan von Krawczynski
  1 sibling, 0 replies; 56+ messages in thread
From: Stephan von Krawczynski @ 2003-08-07 15:52 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: andrea, linux-kernel, green

On Thu, 7 Aug 2003 09:45:36 -0300 (BRT)
Marcelo Tosatti <marcelo@conectiva.com.br> wrote:

> The decoded oops should be sufficient. 

Well, how about this one:


ksymoops 2.4.8 on i686 2.4.22-rc1.  Options used
     -V (default)
     -k /proc/ksyms (default)
     -l /proc/modules (default)
     -o /lib/modules/2.4.22-rc1/ (default)
     -m /boot/System.map-2.4.22-rc1 (default)

Warning: You did not tell me where to find symbol information.  I will
assume that the log matches the kernel and modules that are running
right now and I'll use the default options above for symbol resolution.
If the current kernel and/or modules do not match the log, you can get
more accurate output by telling me the kernel version and where to find
map, modules, ksyms etc.  ksymoops -h explains the options.

Unable to handle kernel paging request at virtual address 63eabdb3
c0145f31 
*pde = 00000000
Oops: 0000
CPU:    0
EIP:    0010:[<c0145f31>]    Not tainted
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00010206
eax: 00000000   ebx: 00000000   ecx: 00000061   edx: 63eabd93
esi: 00000000   edi: 00001000   ebp: 00000000   esp: c34f7e60
ds: 0018   es: 0018   ss: 0018
Process kupdated (pid: 7, stackpage=c34f7000)
Stack: 00000000 f7afb1f0 c0146018 00000000 c01312e9 00000000 c1849dd0 00001000
       00001000 00000803 c014823a c1849dd0 00001000 00000000 f79b7fa4 00001e18
       c0148428 f79b7fa4 00001e18 00001000 e9640000 00000000 00000803 00001000
Call Trace:    [<c0146018>] [<c01312e9>] [<c014823a>] [<c0148428>] [<c0145b36>]
  [<c0197328>] [<c019ceb9>] [<c019c4f5>] [<c0188e94>] [<c01498cb>] [<c014887c>]
  [<c0148be9>] [<c0105000>] [<c010592e>] [<c0148af0>]
Code: 8b 42 20 a3 30 c6 37 c0 8d 41 ff a3 34 c6 37 c0 c6 05 c0 bb


>>EIP; c0145f31 <get_unused_buffer_head+21/b0>   <=====

>>esp; c34f7e60 <_end+314cc40/3852ee40>

Trace; c0146018 <create_buffers+28/100>
Trace; c01312e9 <find_or_create_page+109/110>
Trace; c014823a <grow_dev_page+7a/c0>
Trace; c0148428 <grow_buffers+98/110>
Trace; c0145b36 <getblk+46/80>
Trace; c0197328 <journal_getblk+28/30>
Trace; c019ceb9 <do_journal_end+139/bb0>
Trace; c019c4f5 <flush_old_commits+135/1d0>
Trace; c0188e94 <reiserfs_write_super+64/90>
Trace; c01498cb <sync_supers+14b/170>
Trace; c014887c <sync_old_buffers+3c/b0>
Trace; c0148be9 <kupdate+f9/130>
Trace; c0105000 <_stext+0/0>
Trace; c010592e <arch_kernel_thread+2e/40>
Trace; c0148af0 <kupdate+0/130>

Code;  c0145f31 <get_unused_buffer_head+21/b0>
00000000 <_EIP>:
Code;  c0145f31 <get_unused_buffer_head+21/b0>   <=====
   0:   8b 42 20                  mov    0x20(%edx),%eax   <=====
Code;  c0145f34 <get_unused_buffer_head+24/b0>
   3:   a3 30 c6 37 c0            mov    %eax,0xc037c630
Code;  c0145f39 <get_unused_buffer_head+29/b0>
   8:   8d 41 ff                  lea    0xffffffff(%ecx),%eax
Code;  c0145f3c <get_unused_buffer_head+2c/b0>
   b:   a3 34 c6 37 c0            mov    %eax,0xc037c634
Code;  c0145f41 <get_unused_buffer_head+31/b0>
  10:   c6 05 c0 bb 00 00 00      movb   $0x0,0xbbc0


1 warning issued.  Results may not be reliable.


After that I received this one:


ksymoops 2.4.8 on i686 2.4.22-rc1.  Options used
     -V (default)
     -k /proc/ksyms (default)
     -l /proc/modules (default)
     -o /lib/modules/2.4.22-rc1/ (default)
     -m /boot/System.map-2.4.22-rc1 (default)

Warning: You did not tell me where to find symbol information.  I will
assume that the log matches the kernel and modules that are running
right now and I'll use the default options above for symbol resolution.
If the current kernel and/or modules do not match the log, you can get
more accurate output by telling me the kernel version and where to find
map, modules, ksyms etc.  ksymoops -h explains the options.

 NMI Watchdog detected LOCKUP on CPU1, eip c011a747, registers:
CPU:    1
EIP:    0010:[<c011a747>]    Not tainted
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00000082
eax: cef0b8dc   ebx: cef0b894   ecx: 00000001   edx: 00000003  
esi: 00000008   edi: cef0b8dc   ebp: ec8efe48   esp: ec8efe28
ds: 0018   es: 0018   ss: 0018
Process tar (pid: 13603, stackpage=ec8ef000)
Stack: 00000000 cef0b894 00000000 00000282 00000003 cef0b894 00000008 cef0b8dc
       00000000 c01c4f41 00000000 cef0b894 00000000 0001679d cef0b894 00001000 
       c0146c87 00000000 cef0b894 cef0b894 00000004 cef0b894 ec8ee000 00000001
Call Trace:    [<c01c4f41>] [<c0146c87>] [<c013ae92>] [<c0119630>] [<c0130d7e>]
  [<c017ff50>] [<c013146f>] [<c0131751>] [<c0131d50>] [<c0131ffc>] [<c0131d50>]
  [<c014328b>] [<c010782f>]
Code: 7e f9 e9 d9 ec ff ff 80 38 00 f3 90 7e f9 e9 5d ed ff ff 80 


>>EIP; c011a747 <.text.lock.sched+3f/178>   <=====

>>eax; cef0b8dc <_end+eb606bc/3852ee40>
>>ebx; cef0b894 <_end+eb60674/3852ee40>
>>edi; cef0b8dc <_end+eb606bc/3852ee40>
>>ebp; ec8efe48 <_end+2c544c28/3852ee40>
>>esp; ec8efe28 <_end+2c544c08/3852ee40>

Trace; c01c4f41 <submit_bh+a1/c0>
Trace; c0146c87 <block_read_full_page+2d7/2f0>
Trace; c013ae92 <__alloc_pages+42/190>
Trace; c0119630 <wait_for_completion+70/b0>
Trace; c0130d7e <page_cache_read+be/e0>
Trace; c017ff50 <reiserfs_get_block+0/1490>
Trace; c013146f <generic_file_readahead+af/1a0>
Trace; c0131751 <do_generic_file_read+1c1/470>
Trace; c0131d50 <file_read_actor+0/110>
Trace; c0131ffc <generic_file_read+19c/1b0>
Trace; c0131d50 <file_read_actor+0/110>
Trace; c014328b <sys_read+9b/180>
Trace; c010782f <system_call+33/38>

Code;  c011a747 <.text.lock.sched+3f/178>
00000000 <_EIP>:
Code;  c011a747 <.text.lock.sched+3f/178>   <=====
   0:   7e f9                     jle    fffffffb <_EIP+0xfffffffb>   <=====
Code;  c011a749 <.text.lock.sched+41/178>
   2:   e9 d9 ec ff ff            jmp    ffffece0 <_EIP+0xffffece0>
Code;  c011a74e <.text.lock.sched+46/178>
   7:   80 38 00                  cmpb   $0x0,(%eax)
Code;  c011a751 <.text.lock.sched+49/178>
   a:   f3 90                     repz nop 
Code;  c011a753 <.text.lock.sched+4b/178>
   c:   7e f9                     jle    7 <_EIP+0x7>
Code;  c011a755 <.text.lock.sched+4d/178>
   e:   e9 5d ed ff ff            jmp    ffffed70 <_EIP+0xffffed70>
Code;  c011a75a <.text.lock.sched+52/178>
  13:   80 00 00                  addb   $0x0,(%eax)


1 warning issued.  Results may not be reliable.


There were no I/O errors or any other spectacular things happening. It just
died while I was sitting right next to it during the verify run of tar.

Regards,
Stephan

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: 2.4.22-pre lockups (now decoded oops for pre10)
       [not found]           ` <3F325198.2010301@namesys.com>
@ 2003-08-07 13:32             ` Stephan von Krawczynski
  2003-08-18 20:29               ` Mike Fedyk
  0 siblings, 1 reply; 56+ messages in thread
From: Stephan von Krawczynski @ 2003-08-07 13:32 UTC (permalink / raw)
  To: Hans Reiser; +Cc: linux-kernel

On Thu, 07 Aug 2003 17:18:16 +0400
Hans Reiser <reiser@namesys.com> wrote:

> >On Thu, 7 Aug 2003, Stephan von Krawczynski wrote:
> >>for this one. Hint: spelling in reiserfsck should be checked ;-)
> >
> where?

Hello Hans,

I am no native english, but 
"Comparing bitmaps.. vpf-10640: The on-disk and the correct bitmaps differs"
feels uncomfortable in my ears ;-)
I'd say "two things differ", without trailing "s". I am not even sure if
"bitmaps" shouldn't be singular "bitmap" instead.

But, as stated, I am no native, I can't be sure.

Regards,
Stephan

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: 2.4.22-pre lockups (now decoded oops for pre10)
  2003-08-07  2:14       ` Stephan von Krawczynski
  2003-08-07  5:35         ` Oleg Drokin
@ 2003-08-07 12:45         ` Marcelo Tosatti
       [not found]           ` <3F325198.2010301@namesys.com>
  2003-08-07 15:52           ` Stephan von Krawczynski
  1 sibling, 2 replies; 56+ messages in thread
From: Marcelo Tosatti @ 2003-08-07 12:45 UTC (permalink / raw)
  To: Stephan von Krawczynski; +Cc: andrea, linux-kernel, green



On Thu, 7 Aug 2003, Stephan von Krawczynski wrote:

> On Wed, 6 Aug 2003 15:15:39 -0300 (BRT)
> Marcelo Tosatti <marcelo@conectiva.com.br> wrote:
> 
> > Stephan,
> > 
> > I'm pretty worried about this problem.
> > 
> > Your oopses seem to be the result of some kind of memory corruption. On
> > the other oopses we could see the kernel oopsing on
> > remove_page_from_hash_queue due to corrupted pointers (as Willy pointed 
> > out). 
> > 
> > Can you please try to crash your box again with 
> > 
> > CONFIG_DEBUG_SLAB=y 
> > 
> > Again, thanks a lot for your reports.
> 
> Ok, I have two things. 
> First, another oops. I upgraded the system to rc1 yesterday and it did not
> survive a single day. Here's the decoded oops, the box was "clean" meaning no
> weird modules or the like:
> 
> 
> ksymoops 2.4.8 on i686 2.4.22-rc1.  Options used
>      -V (default)
>      -k /proc/ksyms (default)
>      -l /proc/modules (default)
>      -o /lib/modules/2.4.22-rc1/ (default)
>      -m /boot/System.map-2.4.22-rc1 (default)
> 
> Warning: You did not tell me where to find symbol information.  I will
> assume that the log matches the kernel and modules that are running
> right now and I'll use the default options above for symbol resolution.
> If the current kernel and/or modules do not match the log, you can get
> more accurate output by telling me the kernel version and where to find
> map, modules, ksyms etc.  ksymoops -h explains the options.
> 
> Unable to handle kernel NULL pointer dereference at virtual address 00000004
> c0145060
> *pde = 00000000
> Oops: 0002
> CPU:    1
> EIP:    0010:[<c0145060>]    Not tainted   
> Using defaults from ksymoops -t elf32-i386 -a i386
> EFLAGS: 00010283
> eax: 00000000   ebx: c822feb4   ecx: c822fe60   edx: e07e7780
> esi: 00000000   edi: e07e7780   ebp: f59bfe3c   esp: f59bfe2c
> ds: 0018   es: 0018   ss: 0018
> Process nfsd (pid: 1737, stackpage=f59bf000)
> Stack: f0cce7a0 00000001 f59bfe38 c822fe60 f0cce7f4 eec54ef4 00000000 e07e7760
>        f59be000 f59bfea8 c0183ef5 e07e7780 e07e77cc c02ed880 e07e7760 f8c84fc8
>        f59bfea8 dfe6c960 00000000 e07e7760 dfe6c960 00000000 f59c6e04 f59bfea8
> Call Trace:    [<c0183ef5>] [<f8c84fc8>] [<f8c856f1>] [<f8c8cee4>] [<f8c8e295>]
>   [<f8c923f4>] [<f8c80699>] [<f8c65938>] [<f8c923f4>] [<f8c91a38>] [<f8c91a58>]
>   [<f8c80411>] [<c010592e>] [<f8c80210>]
> Code: 89 50 04 c7 41 54 00 00 00 00 c7 43 04 00 00 00 00 8b 44 24
> 
> 
> >>EIP; c0145060 <fsync_buffers_list+50/1b0>   <=====
> 
> >>ebx; c822feb4 <_end+7e84c94/3852ee40>
> >>ecx; c822fe60 <_end+7e84c40/3852ee40>
> >>edx; e07e7780 <_end+2043c560/3852ee40>
> >>edi; e07e7780 <_end+2043c560/3852ee40>
> >>ebp; f59bfe3c <_end+35614c1c/3852ee40>
> >>esp; f59bfe2c <_end+35614c0c/3852ee40>
> 
> Trace; c0183ef5 <reiserfs_sync_file+65/d0>
> Trace; f8c84fc8 <[nfsd]nfsd_sync+78/d0>
> Trace; f8c856f1 <[nfsd]nfsd_commit+a1/b0>
> Trace; f8c8cee4 <[nfsd]nfsd3_proc_commit+94/130>
> Trace; f8c8e295 <[nfsd]nfs3svc_decode_commitargs+35/e0>
> Trace; f8c923f4 <[nfsd]nfsd_procedures3+2f4/320>
> Trace; f8c80699 <[nfsd]nfsd_dispatch+119/21d>
> Trace; f8c65938 <[sunrpc]svc_process+4d8/570>
> Trace; f8c923f4 <[nfsd]nfsd_procedures3+2f4/320>
> Trace; f8c91a38 <[nfsd]nfsd_version3+0/10>
> Trace; f8c91a58 <[nfsd]nfsd_program+0/28>
> Trace; f8c80411 <[nfsd]nfsd+201/370>
> Trace; c010592e <arch_kernel_thread+2e/40>
> Trace; f8c80210 <[nfsd]nfsd+0/370>
> 
> Code;  c0145060 <fsync_buffers_list+50/1b0>
> 00000000 <_EIP>:
> Code;  c0145060 <fsync_buffers_list+50/1b0>   <=====
>    0:   89 50 04                  mov    %edx,0x4(%eax)   <=====
> Code;  c0145063 <fsync_buffers_list+53/1b0>
>    3:   c7 41 54 00 00 00 00      movl   $0x0,0x54(%ecx)
> Code;  c014506a <fsync_buffers_list+5a/1b0>
>    a:   c7 43 04 00 00 00 00      movl   $0x0,0x4(%ebx)
> Code;  c0145071 <fsync_buffers_list+61/1b0>
>   11:   8b 44 24 00               mov    0x0(%esp,1),%eax
> 
> 
> 1 warning issued.  Results may not be reliable.
> 
> 
> As you can see reiserfs seems involved. Regarding reiserfs and my last postings
> I can assure you that all reiserfs partitions were checked via reiserfsck right
> before installation of rc1 - as Oleg advised - and found:
> "Comparing bitmaps.. vpf-10640: The on-disk and the correct bitmaps differs"
> I was told to use --fix-fixable option which I did and it indeed fixed the
> problem. Trying reiserfsck after that found no errors any more. So I see no
> chance that corrupt data on the media (through former crashes) is responsible
> for this one. Hint: spelling in reiserfsck should be checked ;-)

It might be a problem in reiserfs. You're getting oopses on different
places with different stack traces, which is weird. 

I'll take a closer look at this oops now. 

> Second, I re-install the box with CONFIG_DEBUG_SLAB="y" right now. Please tell
> me if I should perform special steps (SYSRQ or the like) after the next crash
> happens, or if the decoded oops will be sufficient.

The decoded oops should be sufficient. 


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: 2.4.22-pre lockups (now decoded oops for pre10)
  2003-08-07  2:14       ` Stephan von Krawczynski
@ 2003-08-07  5:35         ` Oleg Drokin
  2003-08-07 12:45         ` Marcelo Tosatti
  1 sibling, 0 replies; 56+ messages in thread
From: Oleg Drokin @ 2003-08-07  5:35 UTC (permalink / raw)
  To: Stephan von Krawczynski; +Cc: Marcelo Tosatti, andrea, linux-kernel

Hello!

On Thu, Aug 07, 2003 at 04:14:40AM +0200, Stephan von Krawczynski wrote:

> Unable to handle kernel NULL pointer dereference at virtual address 00000004

Hm NULL pointer in j_dirty_buffers list. This cannot happen, basically.
This is a cyclically linked list of buffers. And we add stuff to it via standard
functions, so the linkage happens by itself.

> Trace; c0183ef5 <reiserfs_sync_file+65/d0>
> Trace; f8c84fc8 <[nfsd]nfsd_sync+78/d0>
> Code;  c0145060 <fsync_buffers_list+50/1b0>
> 00000000 <_EIP>:
> Code;  c0145060 <fsync_buffers_list+50/1b0>   <=====
>    0:   89 50 04                  mov    %edx,0x4(%eax)   <=====

> As you can see reiserfs seems involved. Regarding reiserfs and my last postings
> I can assure you that all reiserfs partitions were checked via reiserfsck right
> before installation of rc1 - as Oleg advised - and found:
> "Comparing bitmaps.. vpf-10640: The on-disk and the correct bitmaps differs"

That might explain your prior "freeing already free block" messages.

> I was told to use --fix-fixable option which I did and it indeed fixed the
> problem. Trying reiserfsck after that found no errors any more. So I see no
> chance that corrupt data on the media (through former crashes) is responsible
> for this one. Hint: spelling in reiserfsck should be checked ;-)

Yes, but how the condition that triggered the oops have appeared is totally unclear for me.

Bye,
    Oleg

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: 2.4.22-pre lockups (now decoded oops for pre10)
  2003-08-06 18:15     ` Marcelo Tosatti
@ 2003-08-07  2:14       ` Stephan von Krawczynski
  2003-08-07  5:35         ` Oleg Drokin
  2003-08-07 12:45         ` Marcelo Tosatti
  0 siblings, 2 replies; 56+ messages in thread
From: Stephan von Krawczynski @ 2003-08-07  2:14 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: andrea, linux-kernel, green

On Wed, 6 Aug 2003 15:15:39 -0300 (BRT)
Marcelo Tosatti <marcelo@conectiva.com.br> wrote:

> Stephan,
> 
> I'm pretty worried about this problem.
> 
> Your oopses seem to be the result of some kind of memory corruption. On
> the other oopses we could see the kernel oopsing on
> remove_page_from_hash_queue due to corrupted pointers (as Willy pointed 
> out). 
> 
> Can you please try to crash your box again with 
> 
> CONFIG_DEBUG_SLAB=y 
> 
> Again, thanks a lot for your reports.

Ok, I have two things. 
First, another oops. I upgraded the system to rc1 yesterday and it did not
survive a single day. Here's the decoded oops, the box was "clean" meaning no
weird modules or the like:


ksymoops 2.4.8 on i686 2.4.22-rc1.  Options used
     -V (default)
     -k /proc/ksyms (default)
     -l /proc/modules (default)
     -o /lib/modules/2.4.22-rc1/ (default)
     -m /boot/System.map-2.4.22-rc1 (default)

Warning: You did not tell me where to find symbol information.  I will
assume that the log matches the kernel and modules that are running
right now and I'll use the default options above for symbol resolution.
If the current kernel and/or modules do not match the log, you can get
more accurate output by telling me the kernel version and where to find
map, modules, ksyms etc.  ksymoops -h explains the options.

Unable to handle kernel NULL pointer dereference at virtual address 00000004
c0145060
*pde = 00000000
Oops: 0002
CPU:    1
EIP:    0010:[<c0145060>]    Not tainted   
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00010283
eax: 00000000   ebx: c822feb4   ecx: c822fe60   edx: e07e7780
esi: 00000000   edi: e07e7780   ebp: f59bfe3c   esp: f59bfe2c
ds: 0018   es: 0018   ss: 0018
Process nfsd (pid: 1737, stackpage=f59bf000)
Stack: f0cce7a0 00000001 f59bfe38 c822fe60 f0cce7f4 eec54ef4 00000000 e07e7760
       f59be000 f59bfea8 c0183ef5 e07e7780 e07e77cc c02ed880 e07e7760 f8c84fc8
       f59bfea8 dfe6c960 00000000 e07e7760 dfe6c960 00000000 f59c6e04 f59bfea8
Call Trace:    [<c0183ef5>] [<f8c84fc8>] [<f8c856f1>] [<f8c8cee4>] [<f8c8e295>]
  [<f8c923f4>] [<f8c80699>] [<f8c65938>] [<f8c923f4>] [<f8c91a38>] [<f8c91a58>]
  [<f8c80411>] [<c010592e>] [<f8c80210>]
Code: 89 50 04 c7 41 54 00 00 00 00 c7 43 04 00 00 00 00 8b 44 24


>>EIP; c0145060 <fsync_buffers_list+50/1b0>   <=====

>>ebx; c822feb4 <_end+7e84c94/3852ee40>
>>ecx; c822fe60 <_end+7e84c40/3852ee40>
>>edx; e07e7780 <_end+2043c560/3852ee40>
>>edi; e07e7780 <_end+2043c560/3852ee40>
>>ebp; f59bfe3c <_end+35614c1c/3852ee40>
>>esp; f59bfe2c <_end+35614c0c/3852ee40>

Trace; c0183ef5 <reiserfs_sync_file+65/d0>
Trace; f8c84fc8 <[nfsd]nfsd_sync+78/d0>
Trace; f8c856f1 <[nfsd]nfsd_commit+a1/b0>
Trace; f8c8cee4 <[nfsd]nfsd3_proc_commit+94/130>
Trace; f8c8e295 <[nfsd]nfs3svc_decode_commitargs+35/e0>
Trace; f8c923f4 <[nfsd]nfsd_procedures3+2f4/320>
Trace; f8c80699 <[nfsd]nfsd_dispatch+119/21d>
Trace; f8c65938 <[sunrpc]svc_process+4d8/570>
Trace; f8c923f4 <[nfsd]nfsd_procedures3+2f4/320>
Trace; f8c91a38 <[nfsd]nfsd_version3+0/10>
Trace; f8c91a58 <[nfsd]nfsd_program+0/28>
Trace; f8c80411 <[nfsd]nfsd+201/370>
Trace; c010592e <arch_kernel_thread+2e/40>
Trace; f8c80210 <[nfsd]nfsd+0/370>

Code;  c0145060 <fsync_buffers_list+50/1b0>
00000000 <_EIP>:
Code;  c0145060 <fsync_buffers_list+50/1b0>   <=====
   0:   89 50 04                  mov    %edx,0x4(%eax)   <=====
Code;  c0145063 <fsync_buffers_list+53/1b0>
   3:   c7 41 54 00 00 00 00      movl   $0x0,0x54(%ecx)
Code;  c014506a <fsync_buffers_list+5a/1b0>
   a:   c7 43 04 00 00 00 00      movl   $0x0,0x4(%ebx)
Code;  c0145071 <fsync_buffers_list+61/1b0>
  11:   8b 44 24 00               mov    0x0(%esp,1),%eax


1 warning issued.  Results may not be reliable.


As you can see reiserfs seems involved. Regarding reiserfs and my last postings
I can assure you that all reiserfs partitions were checked via reiserfsck right
before installation of rc1 - as Oleg advised - and found:
"Comparing bitmaps.. vpf-10640: The on-disk and the correct bitmaps differs"
I was told to use --fix-fixable option which I did and it indeed fixed the
problem. Trying reiserfsck after that found no errors any more. So I see no
chance that corrupt data on the media (through former crashes) is responsible
for this one. Hint: spelling in reiserfsck should be checked ;-)

Second, I re-install the box with CONFIG_DEBUG_SLAB="y" right now. Please tell
me if I should perform special steps (SYSRQ or the like) after the next crash
happens, or if the decoded oops will be sufficient.

Regards,
Stephan

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: 2.4.22-pre lockups (now decoded oops for pre10)
  2003-08-06  7:41   ` 2.4.22-pre lockups (now decoded oops for pre10) Stephan von Krawczynski
  2003-08-06  8:58     ` Oleg Drokin
  2003-08-06  9:09     ` Willy Tarreau
@ 2003-08-06 18:15     ` Marcelo Tosatti
  2003-08-07  2:14       ` Stephan von Krawczynski
  2 siblings, 1 reply; 56+ messages in thread
From: Marcelo Tosatti @ 2003-08-06 18:15 UTC (permalink / raw)
  To: Stephan von Krawczynski; +Cc: andrea, linux-kernel, green



On Wed, 6 Aug 2003, Stephan von Krawczynski wrote:

> Unable to handle kernel NULL pointer dereference at virtual address 00000006
> c0144b14
> *pde = 00000000
> Oops: 0002
> CPU:    1
> EIP:    0010:[<c0144b14>]    Not tainted
> Using defaults from ksymoops -t elf32-i386 -a i386
> EFLAGS: 00010246
> eax: 00000000   ebx: f0f66540   ecx: f0f66540   edx: 00000006
> esi: f0f66540   edi: f0f66540   ebp: c2ce0350   esp: c345df24
> ds: 0018   es: 0018   ss: 0018
> Process kswapd (pid: 5, stackpage=c345d000)
> Stack: c0147ddf f0f66540 00000000 c2ce0350 0001bcad c02eab68 c0139228 c2ce0350
>        000001d0 00000200 000001d0 00000016 00000020 000001d0 00000020 00000006
>        c01394b3 00000006 c345c000 c02eab68 000001d0 00000006 c02eab68 00000000 
> Call Trace:    [<c0147ddf>] [<c0139228>] [<c01394b3>] [<c013952e>] [<c013963c>]
>   [<c01396c8>] [<c01397f8>] [<c0139760>] [<c0105000>] [<c010592e>] [<c0139760>]
> Code: 89 02 c7 41 30 00 00 00 00 89 4c 24 04 e9 7a ff ff ff 8d 76 
> 
> 
> >>EIP; c0144b14 <__remove_from_queues+14/30>   <=====
> 
> >>ebx; f0f66540 <_end+30bbb320/3852ee40>
> >>ecx; f0f66540 <_end+30bbb320/3852ee40>
> >>esi; f0f66540 <_end+30bbb320/3852ee40>
> >>edi; f0f66540 <_end+30bbb320/3852ee40>
> >>ebp; c2ce0350 <_end+2935130/3852ee40>
> >>esp; c345df24 <_end+30b2d04/3852ee40>

Stephan,

I'm pretty worried about this problem.

Your oopses seem to be the result of some kind of memory corruption. On
the other oopses we could see the kernel oopsing on
remove_page_from_hash_queue due to corrupted pointers (as Willy pointed 
out). 

Can you please try to crash your box again with 

CONFIG_DEBUG_SLAB=y 

Again, thanks a lot for your reports.


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: 2.4.22-pre lockups (now decoded oops for pre10)
  2003-08-06  9:36       ` Stephan von Krawczynski
@ 2003-08-06 12:45         ` Willy Tarreau
  0 siblings, 0 replies; 56+ messages in thread
From: Willy Tarreau @ 2003-08-06 12:45 UTC (permalink / raw)
  To: Stephan von Krawczynski; +Cc: marcelo, andrea, linux-kernel, green, alan

> Hm, the hardware may not be that widespread. I guess not many people are really
> using SMP, 64 bit PCI network, 3 GB RAM, 3ware RAID5 and serverworks board
> altogether in one box. I can't fight the impression it has something to do with
> locking issues. It doesn't look exactly like a hardware problem, you would not
> expect crashes on the same type of code then.

Well, it depends... I once had an overclocked CPU which died only in one
case, it was a car simulator, and it always crashed exactly on the same race,
at the same position in the round ! I even knew that if I could pass that
position, it was ok for another round ! So I later used that game as a
reliability test when I was not sure about the origin of a crash :-)
It seems as a particular sequence of data and/or code could reliably trigger it
although parallel makes never hurt it.

> The question is: what additional information is needed to find the underlying
> problem?

Perhaps cache poisonning could help. Alan has already used this technique
extensively in the past, and might still have a patch which could apply to your
kernel without too many changes. Alan ?

On the other hand, you could also do it by hand, but it's a little hard. You
have to pick every place there's a free, and write particular data before the
free, if possible, data which can identify who has freed the page.

Then after the next crash, you can identify who used the page last. It can
sometimes lead you to some driver missing a lock. But that's not certain.

Cheers,
Willy


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: 2.4.22-pre lockups (now decoded oops for pre10)
  2003-08-06  9:09     ` Willy Tarreau
@ 2003-08-06  9:36       ` Stephan von Krawczynski
  2003-08-06 12:45         ` Willy Tarreau
  2003-08-18 14:23       ` Andrea Arcangeli
  1 sibling, 1 reply; 56+ messages in thread
From: Stephan von Krawczynski @ 2003-08-06  9:36 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: marcelo, andrea, linux-kernel, green

On Wed, 6 Aug 2003 11:09:20 +0200
Willy Tarreau <willy@w.ods.org> wrote:

> On Wed, Aug 06, 2003 at 09:41:50AM +0200, Stephan von Krawczynski wrote:
>  
> > Code;  c0144b14 <__remove_from_queues+14/30>
> > 00000000 <_EIP>:
> > Code;  c0144b14 <__remove_from_queues+14/30>   <=====
> >    0:   89 02                     mov    %eax,(%edx)   <=====
> > Code;  c0144b16 <__remove_from_queues+16/30>
> >    2:   c7 41 30 00 00 00 00      movl   $0x0,0x30(%ecx)
> > Code;  c0144b1d <__remove_from_queues+1d/30>
> >    9:   89 4c 24 04               mov    %ecx,0x4(%esp,1)
> > Code;  c0144b21 <__remove_from_queues+21/30>
> >    d:   e9 7a ff ff ff            jmp    ffffff8c <_EIP+0xffffff8c>
> > Code;  c0144b26 <__remove_from_queues+26/30>
> >   12:   8d 76 00                  lea    0x0(%esi),%esi
> 
> once again, it's *pprev=next which is is causing trouble, with pprev=6 this
> time (fs/buffer.c:523). There really seems to be something playing badly with
> this...
> 
> I find amazing that such widely used portions of code only trigger panics on
> your system ! either it's a rare combinations of several components/drivers,
> or a strange hardware problem, although I can't imagine which (cpu? bus
> locking?).

Hm, the hardware may not be that widespread. I guess not many people are really
using SMP, 64 bit PCI network, 3 GB RAM, 3ware RAID5 and serverworks board
altogether in one box. I can't fight the impression it has something to do with
locking issues. It doesn't look exactly like a hardware problem, you would not
expect crashes on the same type of code then.
The question is: what additional information is needed to find the underlying
problem?

Regards,
Stephan

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: 2.4.22-pre lockups (now decoded oops for pre10)
  2003-08-06  7:41   ` 2.4.22-pre lockups (now decoded oops for pre10) Stephan von Krawczynski
  2003-08-06  8:58     ` Oleg Drokin
@ 2003-08-06  9:09     ` Willy Tarreau
  2003-08-06  9:36       ` Stephan von Krawczynski
  2003-08-18 14:23       ` Andrea Arcangeli
  2003-08-06 18:15     ` Marcelo Tosatti
  2 siblings, 2 replies; 56+ messages in thread
From: Willy Tarreau @ 2003-08-06  9:09 UTC (permalink / raw)
  To: Stephan von Krawczynski; +Cc: Marcelo Tosatti, andrea, linux-kernel, green

On Wed, Aug 06, 2003 at 09:41:50AM +0200, Stephan von Krawczynski wrote:
 
> Code;  c0144b14 <__remove_from_queues+14/30>
> 00000000 <_EIP>:
> Code;  c0144b14 <__remove_from_queues+14/30>   <=====
>    0:   89 02                     mov    %eax,(%edx)   <=====
> Code;  c0144b16 <__remove_from_queues+16/30>
>    2:   c7 41 30 00 00 00 00      movl   $0x0,0x30(%ecx)
> Code;  c0144b1d <__remove_from_queues+1d/30>
>    9:   89 4c 24 04               mov    %ecx,0x4(%esp,1)
> Code;  c0144b21 <__remove_from_queues+21/30>
>    d:   e9 7a ff ff ff            jmp    ffffff8c <_EIP+0xffffff8c>
> Code;  c0144b26 <__remove_from_queues+26/30>
>   12:   8d 76 00                  lea    0x0(%esi),%esi

once again, it's *pprev=next which is is causing trouble, with pprev=6 this
time (fs/buffer.c:523). There really seems to be something playing badly with
this...

I find amazing that such widely used portions of code only trigger panics on
your system ! either it's a rare combinations of several components/drivers, or
a strange hardware problem, although I can't imagine which (cpu? bus locking?).

Cheers,
Willy


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: 2.4.22-pre lockups (now decoded oops for pre10)
  2003-08-06  7:41   ` 2.4.22-pre lockups (now decoded oops for pre10) Stephan von Krawczynski
@ 2003-08-06  8:58     ` Oleg Drokin
  2003-08-06  9:09     ` Willy Tarreau
  2003-08-06 18:15     ` Marcelo Tosatti
  2 siblings, 0 replies; 56+ messages in thread
From: Oleg Drokin @ 2003-08-06  8:58 UTC (permalink / raw)
  To: Stephan von Krawczynski; +Cc: Marcelo Tosatti, andrea, linux-kernel

Hello!

On Wed, Aug 06, 2003 at 09:41:50AM +0200, Stephan von Krawczynski wrote:

> > Is this _STOCK_ 2.4.22-pre10 (no vmware, no other modules) ? 
> Hello Marcelo,
> today I have a fresh -pre10 oops for you.
> Everything seems to start with (there is no i/o error or the like, is it
> possible that the fs got damaged during former crashes?):

Well, you'd better run reiserfsck after crashes with binary modules just to make sure everything is ok.

> sd(8,17):vs-4080: reiserfs_free_block: free_block (0811:14478481)[dev:blocknr]:
> bit already cleared
> sd(8,17):vs-4080: reiserfs_free_block: free_block (0811:14478445)[dev:blocknr]:
> bit already cleared
> sd(8,17):vs-4080: reiserfs_free_block: free_block (0811:14478441)[dev:blocknr]:
> bit already cleared
> sd(8,17):vs-4080: reiserfs_free_block: free_block (0811:14478348)[dev:blocknr]:
> bit already cleared

Bye,
    Oleg

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: 2.4.22-pre lockups (now decoded oops for pre10)
  2003-08-05 16:40 ` Marcelo Tosatti
@ 2003-08-06  7:41   ` Stephan von Krawczynski
  2003-08-06  8:58     ` Oleg Drokin
                       ` (2 more replies)
  0 siblings, 3 replies; 56+ messages in thread
From: Stephan von Krawczynski @ 2003-08-06  7:41 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: andrea, linux-kernel, green

On Tue, 5 Aug 2003 13:40:48 -0300 (BRT)
Marcelo Tosatti <marcelo@conectiva.com.br> wrote:

> 
> Stephan,
> 
> Is this _STOCK_ 2.4.22-pre10 (no vmware, no other modules) ? 

Hello Marcelo,

today I have a fresh -pre10 oops for you.

Everything seems to start with (there is no i/o error or the like, is it
possible that the fs got damaged during former crashes?):

sd(8,17):vs-4080: reiserfs_free_block: free_block (0811:14478481)[dev:blocknr]:
bit already cleared
sd(8,17):vs-4080: reiserfs_free_block: free_block (0811:14478445)[dev:blocknr]:
bit already cleared
sd(8,17):vs-4080: reiserfs_free_block: free_block (0811:14478441)[dev:blocknr]:
bit already cleared
sd(8,17):vs-4080: reiserfs_free_block: free_block (0811:14478348)[dev:blocknr]:
bit already cleared

An then:

ksymoops 2.4.8 on i686 2.4.22-pre10.  Options used
     -V (default)
     -k /proc/ksyms (default)
     -l /proc/modules (default)
     -o /lib/modules/2.4.22-pre10/ (default)
     -m /boot/System.map-2.4.22-pre10 (default)

Warning: You did not tell me where to find symbol information.  I will
assume that the log matches the kernel and modules that are running
right now and I'll use the default options above for symbol resolution.
If the current kernel and/or modules do not match the log, you can get
more accurate output by telling me the kernel version and where to find
map, modules, ksyms etc.  ksymoops -h explains the options.

Unable to handle kernel NULL pointer dereference at virtual address 00000006
c0144b14
*pde = 00000000
Oops: 0002
CPU:    1
EIP:    0010:[<c0144b14>]    Not tainted
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00010246
eax: 00000000   ebx: f0f66540   ecx: f0f66540   edx: 00000006
esi: f0f66540   edi: f0f66540   ebp: c2ce0350   esp: c345df24
ds: 0018   es: 0018   ss: 0018
Process kswapd (pid: 5, stackpage=c345d000)
Stack: c0147ddf f0f66540 00000000 c2ce0350 0001bcad c02eab68 c0139228 c2ce0350
       000001d0 00000200 000001d0 00000016 00000020 000001d0 00000020 00000006
       c01394b3 00000006 c345c000 c02eab68 000001d0 00000006 c02eab68 00000000 
Call Trace:    [<c0147ddf>] [<c0139228>] [<c01394b3>] [<c013952e>] [<c013963c>]
  [<c01396c8>] [<c01397f8>] [<c0139760>] [<c0105000>] [<c010592e>] [<c0139760>]
Code: 89 02 c7 41 30 00 00 00 00 89 4c 24 04 e9 7a ff ff ff 8d 76 


>>EIP; c0144b14 <__remove_from_queues+14/30>   <=====

>>ebx; f0f66540 <_end+30bbb320/3852ee40>
>>ecx; f0f66540 <_end+30bbb320/3852ee40>
>>esi; f0f66540 <_end+30bbb320/3852ee40>
>>edi; f0f66540 <_end+30bbb320/3852ee40>
>>ebp; c2ce0350 <_end+2935130/3852ee40>
>>esp; c345df24 <_end+30b2d04/3852ee40>

Trace; c0147ddf <try_to_free_buffers+7f/170>
Trace; c0139228 <shrink_cache+298/3b0>
Trace; c01394b3 <shrink_caches+63/a0>
Trace; c013952e <try_to_free_pages_zone+3e/60>
Trace; c013963c <kswapd_balance_pgdat+4c/b0>
Trace; c01396c8 <kswapd_balance+28/40>
Trace; c01397f8 <kswapd+98/c0>
Trace; c0139760 <kswapd+0/c0>
Trace; c0105000 <_stext+0/0>
Trace; c010592e <arch_kernel_thread+2e/40>
Trace; c0139760 <kswapd+0/c0>

Code;  c0144b14 <__remove_from_queues+14/30>
00000000 <_EIP>:
Code;  c0144b14 <__remove_from_queues+14/30>   <=====
   0:   89 02                     mov    %eax,(%edx)   <=====
Code;  c0144b16 <__remove_from_queues+16/30>
   2:   c7 41 30 00 00 00 00      movl   $0x0,0x30(%ecx)
Code;  c0144b1d <__remove_from_queues+1d/30>
   9:   89 4c 24 04               mov    %ecx,0x4(%esp,1)
Code;  c0144b21 <__remove_from_queues+21/30>
   d:   e9 7a ff ff ff            jmp    ffffff8c <_EIP+0xffffff8c>
Code;  c0144b26 <__remove_from_queues+26/30>
  12:   8d 76 00                  lea    0x0(%esi),%esi


1 warning issued.  Results may not be reliable.

Regards,
Stephan



^ permalink raw reply	[flat|nested] 56+ messages in thread

end of thread, other threads:[~2003-09-05 13:37 UTC | newest]

Thread overview: 56+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20030808002918.723abb08.skraw@ithnet.com>
2003-08-08 14:54 ` 2.4.22-pre lockups (now decoded oops for pre10) Marcelo Tosatti
2003-08-08 15:05   ` Stephan von Krawczynski
2003-08-08 15:33     ` Marcelo Tosatti
2003-08-10 21:35       ` Stephan von Krawczynski
2003-08-10 23:23         ` Neil Brown
2003-08-11  9:33           ` Stephan von Krawczynski
2003-08-18 20:43             ` Mike Fedyk
2003-08-13 10:55       ` Stephan von Krawczynski
2003-08-13 14:53         ` Marcelo Tosatti
2003-08-13 14:59           ` Oleg Drokin
2003-08-13 15:12             ` Stephan von Krawczynski
2003-08-13 15:30               ` Oleg Drokin
2003-08-13 16:04                 ` Stephan von Krawczynski
2003-08-13 16:34                   ` Oleg Drokin
2003-08-13 22:19                     ` Stephan von Krawczynski
2003-08-14  8:45                       ` Oleg Drokin
2003-08-14 17:26                         ` Marcelo Tosatti
2003-08-14 17:42                           ` Stephan von Krawczynski
2003-08-15  2:08                             ` Chris Mason
2003-08-15  9:40                               ` Stephan von Krawczynski
2003-08-15 10:28                               ` Stephan von Krawczynski
2003-08-15 12:55                                 ` Chris Mason
2003-08-20 14:21                                   ` 2.4.22-pre lockups (yet another oops for rc2) Stephan von Krawczynski
2003-09-05  9:24                                   ` 2.4.22-pre lockups (case closed) Stephan von Krawczynski
2003-09-05 13:37                                     ` Andrea Arcangeli
2003-08-15 10:13                         ` 2.4.22-pre lockups (now decoded oops for pre10) Stephan von Krawczynski
2003-08-15 10:31                           ` Oleg Drokin
2003-08-18 15:06                   ` Andrea Arcangeli
2003-08-18 20:19                     ` Stephan von Krawczynski
2003-08-18 20:58                       ` Mike Fedyk
2003-08-18 22:31                       ` Andrea Arcangeli
2003-08-19  1:12                         ` Mike Fedyk
2003-08-19  7:12                           ` Stephan von Krawczynski
2003-08-19 13:10                             ` Alan Cox
2003-08-19 14:18                               ` Stephan von Krawczynski
2003-08-19 18:00                                 ` Mike Fedyk
2003-08-19 21:58                                   ` Stephan von Krawczynski
2003-08-19 13:27                             ` Andrea Arcangeli
2003-08-13 15:21           ` Jim Gifford
2003-08-13 17:08             ` Marcelo Tosatti
2003-08-10 14:23     ` Keith Owens
2003-08-02 12:27 2.4.22-pre lockups (decoded oops for pre8) Stephan von Krawczynski
2003-08-05 16:40 ` Marcelo Tosatti
2003-08-06  7:41   ` 2.4.22-pre lockups (now decoded oops for pre10) Stephan von Krawczynski
2003-08-06  8:58     ` Oleg Drokin
2003-08-06  9:09     ` Willy Tarreau
2003-08-06  9:36       ` Stephan von Krawczynski
2003-08-06 12:45         ` Willy Tarreau
2003-08-18 14:23       ` Andrea Arcangeli
2003-08-06 18:15     ` Marcelo Tosatti
2003-08-07  2:14       ` Stephan von Krawczynski
2003-08-07  5:35         ` Oleg Drokin
2003-08-07 12:45         ` Marcelo Tosatti
     [not found]           ` <3F325198.2010301@namesys.com>
2003-08-07 13:32             ` Stephan von Krawczynski
2003-08-18 20:29               ` Mike Fedyk
2003-08-18 20:39                 ` Stephan von Krawczynski
2003-08-18 21:09                   ` Mike Fedyk
2003-08-07 15:52           ` Stephan von Krawczynski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).