linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: Re: [Oops]  i386 mm/slab.c (cache_flusharray)
@ 2003-11-29 17:41 pinotj
  2003-12-02  0:36 ` Linus Torvalds
  0 siblings, 1 reply; 17+ messages in thread
From: pinotj @ 2003-11-29 17:41 UTC (permalink / raw)
  To: manfred, torvalds; +Cc: akpm, linux-kernel

I triggered the slab oops with a very small kernel -test11 (~700KB):

---
slab: double free detected in cache 'biovec-1', objp c12c2df0, objn 122, slabp c12c2000, s_mem c12c2280, bufctl ffffffff
---

Oops occurs very quickly during compilation.

Here is the .config file (I removed unset options). Full config can be found at http://cercle-daejeon.homelinux.org/misc/config-small.txt , if you want diff with other config for example.

BTW, I don't understand why I can't remove config about game port, mouse and DOS partition. Compilation without DMA fails.

I will reduce it again by removing FB and try unset some options in "linux for embbeded systems" and in debug.

Hope it can help find the problem.

Jerome Pinot

---
CONFIG_X86=y
CONFIG_MMU=y
CONFIG_UID16=y
CONFIG_GENERIC_ISA_DMA=y

CONFIG_CLEAN_COMPILE=y
CONFIG_STANDALONE=y
CONFIG_BROKEN_ON_SMP=y

CONFIG_LOG_BUF_SHIFT=14
CONFIG_KALLSYMS=y
CONFIG_FUTEX=y
CONFIG_EPOLL=y
CONFIG_IOSCHED_NOOP=y
CONFIG_IOSCHED_AS=y
CONFIG_IOSCHED_DEADLINE=y

CONFIG_X86_PC=y
CONFIG_M386=y
CONFIG_X86_L1_CACHE_SHIFT=4
CONFIG_RWSEM_GENERIC_SPINLOCK=y
CONFIG_X86_PPRO_FENCE=y
CONFIG_X86_F00F_BUG=y
CONFIG_NOHIGHMEM=y

CONFIG_PCI=y
CONFIG_PCI_GOANY=y
CONFIG_PCI_BIOS=y
CONFIG_PCI_DIRECT=y

CONFIG_BINFMT_ELF=y

CONFIG_IDE=y
CONFIG_BLK_DEV_IDE=y

CONFIG_BLK_DEV_IDEDISK=y

CONFIG_BLK_DEV_IDEPCI=y
CONFIG_BLK_DEV_GENERIC=y
CONFIG_BLK_DEV_IDEDMA_PCI=y
CONFIG_BLK_DEV_ADMA=y
CONFIG_BLK_DEV_IDEDMA=y

CONFIG_INPUT=y

CONFIG_INPUT_MOUSEDEV=y
CONFIG_INPUT_MOUSEDEV_PSAUX=y
CONFIG_INPUT_MOUSEDEV_SCREEN_X=1024
CONFIG_INPUT_MOUSEDEV_SCREEN_Y=768

CONFIG_SOUND_GAMEPORT=y
CONFIG_SERIO=y
CONFIG_SERIO_I8042=y

CONFIG_INPUT_KEYBOARD=y
CONFIG_KEYBOARD_ATKBD=y

CONFIG_VT=y
CONFIG_VT_CONSOLE=y
CONFIG_HW_CONSOLE=y

CONFIG_FB=y
CONFIG_FB_VESA=y
CONFIG_VIDEO_SELECT=y

CONFIG_VGA_CONSOLE=y
CONFIG_DUMMY_CONSOLE=y
CONFIG_FRAMEBUFFER_CONSOLE=y
CONFIG_PCI_CONSOLE=y
CONFIG_FONTS=y
CONFIG_FONT_8x8=y
CONFIG_FONT_8x16=y

CONFIG_XFS_FS=y

CONFIG_PROC_FS=y
CONFIG_PROC_KCORE=y
CONFIG_RAMFS=y

CONFIG_MSDOS_PARTITION=y

CONFIG_DEBUG_KERNEL=y
CONFIG_DEBUG_STACKOVERFLOW=y
CONFIG_DEBUG_SLAB=y
CONFIG_DEBUG_IOVIRT=y
CONFIG_DEBUG_SPINLOCK=y
CONFIG_DEBUG_PAGEALLOC=y
CONFIG_DEBUG_INFO=y
CONFIG_DEBUG_SPINLOCK_SLEEP=y
CONFIG_FRAME_POINTER=y

CONFIG_X86_BIOS_REBOOT=y
CONFIG_PC=y
---


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Re: [Oops]  i386 mm/slab.c (cache_flusharray)
  2003-11-29 17:41 Re: [Oops] i386 mm/slab.c (cache_flusharray) pinotj
@ 2003-12-02  0:36 ` Linus Torvalds
  2003-12-02  1:37   ` Nathan Scott
  0 siblings, 1 reply; 17+ messages in thread
From: Linus Torvalds @ 2003-12-02  0:36 UTC (permalink / raw)
  To: pinotj; +Cc: manfred, Andrew Morton, Kernel Mailing List, nathans



On Sat, 29 Nov 2003 pinotj@club-internet.fr wrote:
>
> I triggered the slab oops with a very small kernel -test11 (~700KB):

The only thing that looks at _all_ likely to explain the problem is

> CONFIG_XFS_FS=y

since there aren't that many XFS users I know of. It's also now the only
thing that uses buffer heads in your config, so..

I assume it's not an option to try another filesystem on this setup, but
it's entirely possible that the 2.6.x buffer-head removal has impacted XFS
negatively - although I'm a bit surprised at how easily you seem to show
problems, since XFS actually has active maintenance.

Nathan - I don't know if you follow linux-kernel, but Jerome Pinot has
been having bad slab problems for some time now. Do normal XFS users
compile with slab debugging turned on?

		Linus

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Oops]  i386 mm/slab.c (cache_flusharray)
  2003-12-02  0:36 ` Linus Torvalds
@ 2003-12-02  1:37   ` Nathan Scott
  2003-12-02  6:44     ` Nathan Scott
  0 siblings, 1 reply; 17+ messages in thread
From: Nathan Scott @ 2003-12-02  1:37 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: pinotj, manfred, Andrew Morton, Kernel Mailing List

On Mon, Dec 01, 2003 at 04:36:33PM -0800, Linus Torvalds wrote:
> 
> I assume it's not an option to try another filesystem on this setup, but
> it's entirely possible that the 2.6.x buffer-head removal has impacted XFS
> negatively - although I'm a bit surprised at how easily you seem to show
> problems, since XFS actually has active maintenance.
> 
> Nathan - I don't know if you follow linux-kernel, but Jerome Pinot has

Yep, although I try to filter out "noise" and have inadvertently
missed this discussion so far.

> been having bad slab problems for some time now. Do normal XFS users
> compile with slab debugging turned on?

Hmm - I know I do - my nightly QA testing runs with this set.
Let me dig through the archives and catch up a bit on this issue;
I'll get back to you.

thanks.

-- 
Nathan

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Oops]  i386 mm/slab.c (cache_flusharray)
  2003-12-02  1:37   ` Nathan Scott
@ 2003-12-02  6:44     ` Nathan Scott
  2003-12-02 18:05       ` Mike Fedyk
  0 siblings, 1 reply; 17+ messages in thread
From: Nathan Scott @ 2003-12-02  6:44 UTC (permalink / raw)
  To: Linus Torvalds, pinotj; +Cc: manfred, Andrew Morton, Kernel Mailing List

Hi there,

On Tue, Dec 02, 2003 at 12:37:16PM +1100, Nathan Scott wrote:
> On Mon, Dec 01, 2003 at 04:36:33PM -0800, Linus Torvalds wrote:
> > 
> > I assume it's not an option to try another filesystem on this setup, but
> > it's entirely possible that the 2.6.x buffer-head removal has impacted XFS
> > negatively - although I'm a bit surprised at how easily you seem to show
> > problems, since XFS actually has active maintenance.
> > 
> > Nathan - I don't know if you follow linux-kernel, but Jerome Pinot has
> 
> Yep, although I try to filter out "noise" and have inadvertently
> missed this discussion so far.
> 
> > been having bad slab problems for some time now. Do normal XFS users
> > compile with slab debugging turned on?
> 
> Hmm - I know I do - my nightly QA testing runs with this set.
> Let me dig through the archives and catch up a bit on this issue;
> I'll get back to you.

OK, I've run XFS through hours and hours of very heavy stress now,
using a variety of different tests, and have tried different mount
and mkfs options as well.  And with a few kernel compiles thrown in
in the background for good measure.  Either we have quite different
hardware configs, compilers, etc; or this is something else.  This
was done with preempt enabled too (which I usually test without).

I'm not seeing anything to suggest random slab corruption, and I'm
so far unable to trip things up as easily as you're able to Jerome.
Do you have just a very small amount of memory perhaps?  I can try
running while very low on memory, but thats the only other obvious
thing I can think of atm.

cheers.

-- 
Nathan

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Oops]  i386 mm/slab.c (cache_flusharray)
  2003-12-02  6:44     ` Nathan Scott
@ 2003-12-02 18:05       ` Mike Fedyk
  2003-12-02 20:05         ` Nathan Scott
  0 siblings, 1 reply; 17+ messages in thread
From: Mike Fedyk @ 2003-12-02 18:05 UTC (permalink / raw)
  To: Nathan Scott
  Cc: Linus Torvalds, pinotj, manfred, Andrew Morton, Kernel Mailing List

On Tue, Dec 02, 2003 at 05:44:18PM +1100, Nathan Scott wrote:
> I'm not seeing anything to suggest random slab corruption, and I'm
> so far unable to trip things up as easily as you're able to Jerome.
> Do you have just a very small amount of memory perhaps?  I can try
> running while very low on memory, but thats the only other obvious
> thing I can think of atm.

How about XFS on DM on RAID?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Oops]  i386 mm/slab.c (cache_flusharray)
  2003-12-02 18:05       ` Mike Fedyk
@ 2003-12-02 20:05         ` Nathan Scott
  0 siblings, 0 replies; 17+ messages in thread
From: Nathan Scott @ 2003-12-02 20:05 UTC (permalink / raw)
  To: Kernel Mailing List

On Tue, Dec 02, 2003 at 10:05:40AM -0800, Mike Fedyk wrote:
> On Tue, Dec 02, 2003 at 05:44:18PM +1100, Nathan Scott wrote:
> > I'm not seeing anything to suggest random slab corruption, and I'm
> > so far unable to trip things up as easily as you're able to Jerome.
> > Do you have just a very small amount of memory perhaps?  I can try
> > running while very low on memory, but thats the only other obvious
> > thing I can think of atm.
> 
> How about XFS on DM on RAID?

I didn't see references to those in Jeromes original description,
so haven't been testing those in this context.  But I need to do
more testing on those subsystems anyway, so, will do.

cheers.

-- 
Nathan

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Re: [Oops]  i386 mm/slab.c (cache_flusharray)
@ 2003-12-09  0:57 pinotj
  0 siblings, 0 replies; 17+ messages in thread
From: pinotj @ 2003-12-09  0:57 UTC (permalink / raw)
  To: torvalds, nathans; +Cc: neilb, manfred, akpm, linux-kernel

Results about testing on test11 this week-end.
Things didn't go exactly as expected, unfortunately, but there are interesting results.
First, I confirm that use of patch-xfs with patch-slab (and without CONFIG_DEBUG_SLAB) makes my system boot normaly.
I don't mention it after, but I always used the patch-printk to get more verbosity in case of slab oops.

I. Test on "small" kernel (http://cercle-daejeon.homelinux.org/misc/config-small2.txt except in first part of A.2.)

 A. With XFS (and without Ext3)

  1. no patch
  I couldn't reproduce the oops and I have no explanation at this time. I changed some parameters in the config (mostly debug to have same config for all kernels) but even with the config I used before, I didn't succeed to trigger the oops again. I first thought it was CONFIG_DEBUG_SLAB that may be the main problem of this story (and explain last tests) but I got same non-oops with CONFIG_DEBUG_SLAB=y. It doesn't mean that config change corrects the problem, though. Previous test with CONFIG_PREEMPT=y/n
shown that it could be very tricky to get the oops sometimes. More tests are needed here.

  2. patch-xfs & patch-slab
  Oops during first compile. The first test, it was kswapd who complained. I thought it could be a side effect of my config (CONFIG_SWAP=n) and not enough RAM, even if it looked strange. That's why I decided to use config-small2.txt for all the tests (that I remade). In this case, I got a very similar oops, from `ld`:

  ---
  ld: page allocation failure. order:0, mode:0x8d0
  Unable to handle kernel NULL pointer dereference at virtual address 00000074
  c01d4cbd
  *pde = 00000000
  Oops: 0002 [#1]
  CPU:    0
  EIP:    0060:[_xfs_trans_alloc+149/160]    Not tainted
  EIP:    0060:[<c01d4cbd>]    Not tainted
  ---

  Full oops: http://cercle-daejeon.homelinux.org/misc/oops-small2-xfs-patched.txt
  First config used: http://cercle-daejeon.homelinux.org/misc/config-small.txt
   and oops of kswapd http://cercle-daejeon.homelinux.org/misc/oops-small-xfs-patched.txt

  3. patch-bio
  No oops triggered

 B. With Ext3 (and without XFS)

  1. no patch
  same as I.A.1
  2. patch-xfs & patch-slab
  Compilations looked good but I got a lot of errors in my logs:

  ---
  kernel: ld: page allocation failure. order:0, mode:0x50
  last message repeated 31 times
  klogd: page allocation failure. order:0, mode:0x50
  last message repeated 63 times
  kswapd0: page allocation failure. order:0, mode:0x50
  ENOMEM in journal_alloc_journal_head, retrying.
  ion failure. order:0, mode:0x50
  kswapd0: page allocation failure. order:0, mode:0x50
  last message repeated 291 times
  ---

  Full log: http://cercle-daejeon.homelinux.org/misc/error-small2-ext3-patched.txt
  NB: I made a `swapoff -a && mkswap /dev/hda3 && swapon -a` a few days before, so swap should be clean

  3. patch-bio
  As A.3, no oops triggered

II. Tests on "big" kernel, support for XFS and Ext3 (http://cercle-daejeon.homelinux.org/misc/config-big.txt)

 A. no patch
 Same as I.A.1

 B. patch-xfs & patch-slab
 Oops very similar with I.A.2
 Full oops: http://cercle-daejeon.homelinux.org/misc/oops-big-patched.txt

 C. patch-bio
 Here is the interesting thing. Till now, I didn't saw any problem with patch bio, but this time, very easily, I got kind of double triple Lutz, starting by the infamous "BUG at mm/slab.c". There is almost 800 lines.
 Full oops: http://cercle-daejeon.homelinux.org/misc/oops-big-patch-bio-full.txt
 Oops via ksymoops: Full oops: http://cercle-daejeon.homelinux.org/misc/oops-big-patch-bio.txt

III. Conclusion

Well, it's not really easy to find a pattern here. Change in configuration can modify behavior of the oops but we can not correlate these with one setting. I will try to get back the oops by changing config.
Tests in I. confirm that it's not an XFS-only problem but seems to affect page allocation for fs in general.
I hope these oops will be clearer to you. I still have no problem with test9.

Jerome Pinot
(Hope it missed nothing)


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Re: [Oops]  i386 mm/slab.c (cache_flusharray)
  2003-12-05 22:57 pinotj
@ 2003-12-05 23:02 ` Linus Torvalds
  0 siblings, 0 replies; 17+ messages in thread
From: Linus Torvalds @ 2003-12-05 23:02 UTC (permalink / raw)
  To: pinotj; +Cc: nathans, neilb, manfred, akpm, linux-kernel



On Fri, 5 Dec 2003 pinotj@club-internet.fr wrote:
>
> 1. Is it still usefull to get all the backtraces of the last xfs oops ?

No, I'm assuming that was due to the slab interaction.

> 2. I will test patch-slab and patch-xfs on test11,
> CONFIG_DEBUG_PAGEALLOC (only). Test on XFS root and ext3 with "small"
> and "big" kernels.

Sounds good.

> 3. What about patch-bio of Manfred ? I didn't have much time to try it
> yet but seems to stabilize too. Should I use it alone or with the others
> patchs ?

It would be interesting to hear as much as possible about this: if
Manfred's bio patch makes a difference, it's less intrusive than mine, and
as such interesting.

On the other hand, despite the small size of Manfred's patch, it does have
a big impact: since 128 bytes is a "watermark" for the slab debugging, the
patch which appears less intrusive does in fact still cause a big amount
of changes.

Anyway, the more you feel like testing, the better. But use your own
judgements.

Thanks a lot for the effort, btw,

		Linus

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Re: [Oops]  i386 mm/slab.c (cache_flusharray)
@ 2003-12-05 22:57 pinotj
  2003-12-05 23:02 ` Linus Torvalds
  0 siblings, 1 reply; 17+ messages in thread
From: pinotj @ 2003-12-05 22:57 UTC (permalink / raw)
  To: torvalds, nathans; +Cc: neilb, manfred, akpm, linux-kernel

>De: Linus Torvalds <torvalds@osdl.org>
[...]
>Jerome - can you test Nathan's patch together with my "avoid the
>complicated slab logic"? The slab avoidance thing got ext3 stable for you,
>now with Nathan's patch hopefully XFS will be stable too.
[...]

OK, I will do intensive tests this week-end, I have time. I just want to have some confirmations:

1. Is it still usefull to get all the backtraces of the last xfs oops ?
2. I will test patch-slab and patch-xfs on test11, CONFIG_DEBUG_PAGEALLOC (only). Test on XFS root and ext3 with "small" and "big" kernels.
3. What about patch-bio of Manfred ? I didn't have much time to try it yet but seems to stabilize too. Should I use it alone or with the others patchs ?

Thanks all for your help

Jerome Pinot


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Re: [Oops]  i386 mm/slab.c (cache_flusharray)
@ 2003-11-27 18:42 pinotj
  0 siblings, 0 replies; 17+ messages in thread
From: pinotj @ 2003-11-27 18:42 UTC (permalink / raw)
  To: manfred, torvalds; +Cc: akpm, linux-kernel

first, some news

2.6.0-test11 makes same oops during second compilation of kernel. The vanilla kernel with PREEMPT always oops the same way. No matter, it's always reproductible.

2.6.0-test11 + Manfred's patch doesn't hang but I found a slab error in the logs that occured during a compilation. (I didn't find this for -test10, I was lucky ?)

So, there is no more way for my system to run a kernel > -test9 without problem.

>De: Manfred Spraul <manfred@colorfullife.com>
[...]
>There are several sources for the "-1": My initial guess was either a bug in slab, or a bad memory cell (only one bit difference). 
>Thus I sent him a patch that changes multiple bits. Result: It remained a single bit change, i.e it's proven that slab doesn't write BUFCTL_END into the wrong slot.

Thanks for your explanation.
Should I try with L1 and/or L2 cache disable on my computer (I don't know if it's safe) ?
I trust my hardware but it's better to get some facts.

Jerome Pinot

(between LFS/BLFS, kernel compilation and tests compilation, I will surely break kind of record about load average :-)


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Re: [Oops]  i386 mm/slab.c (cache_flusharray)
  2003-11-25 17:30 pinotj
@ 2003-11-25 22:51 ` Linus Torvalds
  0 siblings, 0 replies; 17+ messages in thread
From: Linus Torvalds @ 2003-11-25 22:51 UTC (permalink / raw)
  To: pinotj; +Cc: manfred, akpm, linux-kernel



On Tue, 25 Nov 2003 pinotj@club-internet.fr wrote:
>
> 3. 2.6.0-test10 vanilla + PREEMPT_CONFIG=y + patch printk + patch magic numbers
> The patch solves the problem, I can compile 4 times a kernel and do heavy work in parallele (load average around 1.2 during 2 hours) without any problems.

Those magic numbers don't make any sense. In particular, SLAB_LIMIT is
clearly bogus both in the original version and in the "magic number
patch". The only place that uses SLAB_LIMIT is the code that decides how
many entries fit in one slab, and quite frankly, it makes no _sense_ to
have a SLAB_LIMIT that is big enough to be unsigned.

"SLAB_LIMIT" should be something in the few hundreds, maybe.

Manfred?  What is the logic behind those nonsensical numbers?

		Linus

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Re: [Oops]  i386 mm/slab.c (cache_flusharray)
@ 2003-11-25 17:30 pinotj
  2003-11-25 22:51 ` Linus Torvalds
  0 siblings, 1 reply; 17+ messages in thread
From: pinotj @ 2003-11-25 17:30 UTC (permalink / raw)
  To: manfred; +Cc: torvalds, akpm, linux-kernel

Here are the results for test10 about the oops in slab.c
When I say `compilation` with no explanation, it means compilation of 2.6.0-test10 when using this same kernel, with my .config file (vmlinuz is 2.5M).

1.  2.6.0-test10 vanilla + PREEMPT_CONFIG=y + patch printk
Kernel oops during compilation, as for 2.6.0-test9 : at around 80% of the task.

---
slab: double free detected in cache 'buffer_head', objp cc2dbb58, objnr 42, slabp cc2db000, s_mem cc2db180, bufctl ffffffff.
------------[ cut here ]------------
kernel BUG at mm/slab.c:1956!
invalid operand: 0000 [#1]
CPU:    0
EIP:    0060:[free_block+357/784]    Not tainted
EIP:    0060:[<c015ad55>]    Not tainted
EFLAGS: 00010092
EIP is at free_block+0x165/0x310
eax: 00000080   ebx: 00000000   ecx: c0697854   edx: c05714f8
esi: cc2db000   edi: cc2db018   ebp: cf821c68   esp: cf821c34
ds: 007b   es: 007b   ss: 0068
Process kswapd0 (pid: 8, threadinfo=cf820000 task=cf849960)
Stack: c0504f40 c0505b3d cc2dbb58 0000002a cc2db000 cc2db180 ffffffff 0000002a
       cc2dbb58 0000000d cffdef08 c9e93180 00000010 cf821ca0 c015afda cffed800
       cffdef08 00000010 00000282 c11c89a0 00000000 00000001 cffee730 00000010
 Call Trace:
 [cache_flusharray+218/688] cache_flusharray+0xda/0x2b0
 [<c015afda>] cache_flusharray+0xda/0x2b0
 [kmem_cache_free+429/912] kmem_cache_free+0x1ad/0x390
---

full log can be found here: http://cercle-daejeon.homelinux.org/oops-full.txt
Cleaned oops (ksymoops):    http://cercle-daejeon.homelinux.org/oops.txt
Config of the kernel :      http://cercle-daejeon.homelinux.org/config.txt

NB: I don't get any oops if I compile the kernel with default settings (vmlinuz around 1.5M)

2. 2.6.0-test10 vanilla + PREEMPT_CONFIG=n + patch printk
Argh, oops at the speed of light in the beginning of compilation. Too fast to catch something in the logs.
This invalidates the first idea of bad effect of PREEMPT, it's exactly the contrary in this case.
Second try, compilation is OK, 1 times, 2 times and failed during the third time.
Again, no logs, but this time I wrote down the printk:

---
slab: double free detected in cache 'bio', objp c888fc28, objnr 42, slabp c888f000, s_mem c888f1000, bufctl ffffffff.
---

This case is not easily reproductible.

3. 2.6.0-test10 vanilla + PREEMPT_CONFIG=y + patch printk + patch magic numbers
The patch solves the problem, I can compile 4 times a kernel and do heavy work in parallele (load average around 1.2 during 2 hours) without any problems.

Conclusion: well, this confirms some facts for 2.6.0-test10:

- oops reproductible if PREEMPT_CONFIG=y each time heavy load.
The limit of load, for my system (AMD tbird 1.2GHz, 256Mo) is somewhere between the compilation of a kernel of 1.5M (default settings) and a kernel of 2.5M (custom settings). Always occurs at about 80% of compilation in this second case.

- oops occurs even if PREEMPT_CONFIG=n, but with no really reproductibility. It needs heavy load too, but it's not enough. System hangs really quickly, no logs.

Finally, I just recall the patches of Manfred used here (printk and magic number):

diff -Nru a/mm/slab.c b/mm/slab.c 2003-11-22 09:00:00 +0900
--- a/mm/slab.c         2003-11-22 08:43:02.189656536 +0900
+++ b/mm/slab.c         2003-11-22 08:45:44.158033600 +0900
@@ -1952,8 +1952,7 @@
                check_slabp(cachep, slabp);
 #if DEBUG
                if (slab_bufctl(slabp)[objnr] != BUFCTL_FREE) {
-                       printk(KERN_ERR "slab: double free detected in cache '%s', objp %p.\n",
-                                               cachep->name, objp);
+                       printk(KERN_ERR "slab: double free detected in cache '%s', objp %p, objnr %d, slabp %p, s_mem %p, bufctl %x.\n", cachep->name, objp, objnr, slabp, slabp->s_mem, slab_bufctl(slabp)[objnr]);
                        BUG();
                }
 #endif

diff -Nru a/mm/slab.c b/mm/slab.c 2003-11-22 09:00:00 +0900
--- a/mm/slab.c         2003-11-22 08:43:02.189656536 +0900
+++ b/mm/slab.c         2003-11-22 08:45:44.158033600 +0900
@@ -153,9 +153,9 @@
  * is less than 512 (PAGE_SIZE<<3), but greater than 256.
  */

-#define BUFCTL_END     0xffffFFFF
-#define BUFCTL_FREE    0xffffFFFE
-#define        SLAB_LIMIT      0xffffFFFD
+#define BUFCTL_END     0xfeffFFFF
+#define BUFCTL_FREE    0xf7ffFFFF
+#define SLAB_LIMIT     0xf0ffFFFD
 typedef unsigned int kmem_bufctl_t;

 /* Max number of objs-per-slab for caches which use off-slab slabs.

Regards,

Jerome Pinot


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Re: [Oops]  i386 mm/slab.c (cache_flusharray)
@ 2003-11-24 15:20 pinotj
  0 siblings, 0 replies; 17+ messages in thread
From: pinotj @ 2003-11-24 15:20 UTC (permalink / raw)
  To: manfred; +Cc: torvalds, akpm, linux-kernel

Sorry to be late,

>De: Linus Torvalds <torvalds@osdl.org>
[...]
>> Summary: Oops reproductible when heavy load, bug in mm/slab.c
>
>Do you have CONFIG_PREEMPT on, and if so, does it go away if you compile without PREEMPT? We have at least one other bug that seems to be dependent on CONFIG_PREEMPT

I compiled without PREEMPT and first it seemed good. I could compile again a kernel without problem.
But later, I got the same oops when doing something else
(like `./configure` in parallele with a `make install` on other tty) so CONFIG_PREEMPT doesn't seem to be the cause, unfortunately, but a parameter than can affect the probability of getting the oops.

>De: Manfred Spraul <manfred@colorfullife.com>
[...]
>>slab: double free detected in cache 'buffer_head', objp cc3f9798, objnr 26, slabp cc3f9000, s_mem cc3f9180 bufctl f7ffffff.  
>>
>Good.
>
>+#define BUFCTL_END	0xfeffFFFF
>+#define BUFCTL_FREE	0xf7ffFFFE
>+#define	SLAB_LIMIT	0xf0ffFFFD

This seems to solve the problem, no oops during kernel compilation. Unfortunately, considering what I wrote just above, I'm not so sure it's really solved. Now I will use the 2.6.0-test10 and make again tests (alone, with this patch, with PREEMPT_CONFIG=n)

>f7ffffff is not a valid value, slab never writes that into a bufctl. 
>Someone did a ++ or "|= 1", or a hw bug.
>I think the Athlon cpus have ECC for the L2 cache - could you check in 
>the bios that ECC checking is enabled?

Well, cheap mainboard (VIA K7S5A) with cheap BIOS. I can only {en,dis}able cache L1/L2, nothing about ECC. DRAM is set to safe.
But as I said, I got no problem with 2.6.0-test9 vanilla. I compiled all my LFS/BLFS with it during several days.
I even use it these days as rescue kernel to compile the others.

Thanks for your help,

Jerome Pinot


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Re: [Oops]  i386 mm/slab.c (cache_flusharray)
  2003-11-21 18:12 pinotj
@ 2003-11-21 22:48 ` Linus Torvalds
  0 siblings, 0 replies; 17+ messages in thread
From: Linus Torvalds @ 2003-11-21 22:48 UTC (permalink / raw)
  To: pinotj; +Cc: akpm, manfred, linux-kernel


On Fri, 21 Nov 2003 pinotj@club-internet.fr wrote:
>
> Summary: Oops reproductible when heavy load, bug in mm/slab.c

Do you have CONFIG_PREEMPT on, and if so, does it go away if you compile
without PREEMPT? We have at least one other bug that seems to be dependent
on CONFIG_PREEMPT.

		Linus



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Re: [Oops]  i386 mm/slab.c (cache_flusharray)
@ 2003-11-21 18:12 pinotj
  2003-11-21 22:48 ` Linus Torvalds
  0 siblings, 1 reply; 17+ messages in thread
From: pinotj @ 2003-11-21 18:12 UTC (permalink / raw)
  To: akpm, manfred; +Cc: linux-kernel

----Message d'origine----
>Date: Wed, 19 Nov 2003 18:09:43 -0800
>De: Andrew Morton <akpm@osdl.org>
>A: pinotj@club-internet.fr
>Copie à: linux-kernel@vger.kernel.org
>Sujet: Re: [Oops]  i386 mm/slab.c (cache_flusharray)
[...]
>Well it's interesting that it is repeatable.
>
>First thing to do is to eliminate hardware failures:
>
>1: Is the oops always the same, or does the machine crash in other ways,
>   with different backtraces?
>
>2: Try running memtest86 on that machine for 12 hours or more.
>
>3: Can the problem be reproduced on other machines?
>
>4: try a different compiler version.

First, some results about some tests (not finish yet)

0. Increase verbosity of the printk (thanks to Manfred):
(compilation of kernel)
slab: double free detected in cache 'buffer_head', objp c4c8e3d8, objnr 10,
slabp c4c8e000, s_mem c4c8e180, bufctl ffffffff.
(compilation of firebird)
slab: double free detected in cache 'pte_chain', objp c18a6600, objnr 10,
slabp c18a6000, s_mem c18a6100, bufctl ffffffff.

1. Reproductibility : yes, it oops each time I try to compile a kernel, for example, at around 75% of the task.
Always oops if I try to compile during a quite long time
One funny thing, though. I got one oops without freeze. After error of gcc, I went back to the shell but when I called ksymoops 2 commands later, everything freezed.
About the backtrace, well I'm not sure. Are you talking about what follow the `call trace` etc ? The problem is the system don't have always the time to flush everything to the log, I often got only the printk. But I always got the cache_flusharray thing in first position.

2. Test mem (not yet, I need some time). But as I said, I never had oops before, with 2.6.0-test from 4 to 9. I compiled all my LFS with 2.6.0-test9 vanilla without problem.
3. Change compiler: confirm same problem with gcc 2.95.3, 3.2.3, 3.3.1
x. ACPI: same oops with `acpi=off pci=noacpi` at boot

Summary: Oops reproductible when heavy load, bug in mm/slab.c
Don't have this problem with 2.6.0-test9 and prior
Problem appears in the last patches, before 15 november
(cset-20031115_0206) so I looked for something wrong.

I tried to remove some of the last patches (mm/ioremap.c, 
mm/filemap.c, mm/memory.c) but still got oops.
Should be another patch. Which one else can I remove to test ?

Regards,

Jerome


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Re: [Oops]  i386 mm/slab.c (cache_flusharray)
@ 2003-11-20  2:40 pinotj
  0 siblings, 0 replies; 17+ messages in thread
From: pinotj @ 2003-11-20  2:40 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel

----Message d'origine----
>Date: Wed, 19 Nov 2003 18:09:43 -0800
>De: Andrew Morton <akpm@osdl.org>
>A: pinotj@club-internet.fr
>Copie à: linux-kernel@vger.kernel.org
>Sujet: Re: [Oops]  i386 mm/slab.c (cache_flusharray)
[...]
>>  Is there any thing I can do to help figure out where does the problem comes from ? 
>
>Well it's interesting that it is repeatable.
>First thing to do is to eliminate hardware failures:
>
>1: Is the oops always the same, or does the machine crash in other ways, with different backtraces?

Well, I can't check right know but seemed to be the same. I will keep the next 5 oops with same distro and make a diff to be sure.

>2: Try running memtest86 on that machine for 12 hours or more.

OK

>3: Can the problem be reproduced on other machines?

Unfortunately, I can't use any other computer for this (or I will lose some friends :-) If there was already some reports about this bug, it can be good to compare the hardware and/or environment with these others people.

>4: try a different compiler version.

I already tried gcc 3.2.3 and 3.3.1 (2.95.3 to be confirmed)

I will make the tests (1, 2 and confirmed 4) and give you the results tomorrow.

Just an idea: could it be an ACPI problem ?
I will try some boot parameters too, to be sure...

>Thanks.

Your welcome

Jerome Pinot


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Re: [Oops]  i386 mm/slab.c (cache_flusharray)
@ 2003-11-20  1:50 pinotj
  0 siblings, 0 replies; 17+ messages in thread
From: pinotj @ 2003-11-20  1:50 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel

>Date: Wed, 19 Nov 2003 17:07:53 -0800
>De: Andrew Morton <akpm@osdl.org>
>A: pinotj@club-internet.fr
>Copie à: linux-kernel@vger.kernel.org
>Sujet: Re: [Oops]  i386 mm/slab.c (cache_flusharray)
>
>pinotj@club-internet.fr wrote:
>>
>> kernel BUG at mm/slab.c:1957!
[...]
>
>urgh, there are several reports of this and it's always the buffer_head
>slab.  The code in there is trivial so perhaps it's just that the large
>number of buffer_heads makes them a fat target.
>
>You should have also seen the message "slab: double free detected in cache
>'buffer_head', objp 0xNNNNNNNN".

Yeah, you right, I forgot to mention it, it was just above the Oops in the logs:
---
slab: double free detected in cache 'buffer_head', objp cd09af18.
------------[ cut here ]------------
kernel BUG at mm/slab.c:1957!
---

>Don't know, sorry.

Is there any thing I can do to help figure out where does the problem comes from ? Anyway, thanks for your answer.

Regards,

Jerome Pinot


^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2003-12-09  0:57 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-11-29 17:41 Re: [Oops] i386 mm/slab.c (cache_flusharray) pinotj
2003-12-02  0:36 ` Linus Torvalds
2003-12-02  1:37   ` Nathan Scott
2003-12-02  6:44     ` Nathan Scott
2003-12-02 18:05       ` Mike Fedyk
2003-12-02 20:05         ` Nathan Scott
  -- strict thread matches above, loose matches on Subject: below --
2003-12-09  0:57 pinotj
2003-12-05 22:57 pinotj
2003-12-05 23:02 ` Linus Torvalds
2003-11-27 18:42 pinotj
2003-11-25 17:30 pinotj
2003-11-25 22:51 ` Linus Torvalds
2003-11-24 15:20 pinotj
2003-11-21 18:12 pinotj
2003-11-21 22:48 ` Linus Torvalds
2003-11-20  2:40 pinotj
2003-11-20  1:50 pinotj

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).