linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Reproducable OOPS with MD RAID-5 on 2.6.0-test11
@ 2003-12-01 14:06 Kevin P. Fleming
  2003-12-01 14:11 ` Jens Axboe
  0 siblings, 1 reply; 20+ messages in thread
From: Kevin P. Fleming @ 2003-12-01 14:06 UTC (permalink / raw)
  To: LKML, Linux-raid maillist

I've got a new system here with six SATA disks set up in a RAID-5 array 
(no partition tables, using the whole disks). I then used LVM2 tools to 
make the RAID array a physical volume, created a logical volume and 
formatted that volume with an XFS filesystem.

Mounting the filesystem and copying over the 2.6 kernel source tree 
produces this OOPS (and is pretty reproducable):

kernel BUG at fs/bio.c:177!
invalid operand: 0000 [#1]
CPU:    0
EIP:    0060:[<c014db9a>]    Not tainted
EFLAGS: 00010246
EIP is at bio_put+0x2c/0x36
eax: 00000000   ebx: f6221080   ecx: c1182180   edx: edcbf780
esi: c577b998   edi: 00000002   ebp: edcbf780   esp: f78ffeb0
ds: 007b   es: 007b   ss: 0068
Process md0_raid5 (pid: 65, threadinfo=f78fe000 task=f7924080)
Stack: c71e2640 c021d88d edcbf780 00000000 00000001 c1182180 00000009 
0001000
        edcbf780 00000000 00000000 00000000 c014e2fc edcbf780 00000000 
00000000
        f23a0ff0 f23a0ff0 edcbf7c0 c02ca51d edcbf780 00000000 00000000 
00000000
Call Trace:
  [<c021d88d>] bio_end_io_pagebuf+0x9a/0x138
  [<c014e2fc>] bio_endio+0x59/0x7e
  [<c02ca51d>] clone_endio+0x82/0xb5
  [<c02c0dc3>] handle_stripe+0x8f2/0xec0
  [<c02c17d1>] raid5d+0x71/0x105
  [<c02c898c>] md_thread+0xde/0x15c
  [<c011984b>] default_wake_function+0x0/0x12
  [<c02c88ae>] md_thread+0x0/0x15c
  [<c0107049>] kernel_thread_helper+0x5/0xb

Code: 0f 0b b1 00 bc 94 34 c0 eb d8 56 53 83 ec 08 8b 44 24 18 8b

Hardware is a 2.6CGHz P4, 1G of RAM (4G highmem enabled), SMP kernel but 
no preemption. Kernel config is at:

http://www.backtobasicsmgmt.com/bucky/bucky.config

(I'm subscribed to linux-kernel but not linux-raid, so please CC me on 
any linux-raid responses. Thanks!)


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Reproducable OOPS with MD RAID-5 on 2.6.0-test11
  2003-12-01 14:06 Reproducable OOPS with MD RAID-5 on 2.6.0-test11 Kevin P. Fleming
@ 2003-12-01 14:11 ` Jens Axboe
  2003-12-01 14:15   ` Kevin P. Fleming
  2003-12-01 23:06   ` Reproducable OOPS with MD RAID-5 on 2.6.0-test11 - with XFS Neil Brown
  0 siblings, 2 replies; 20+ messages in thread
From: Jens Axboe @ 2003-12-01 14:11 UTC (permalink / raw)
  To: Kevin P. Fleming; +Cc: LKML, Linux-raid maillist

On Mon, Dec 01 2003, Kevin P. Fleming wrote:
> I've got a new system here with six SATA disks set up in a RAID-5 array 
> (no partition tables, using the whole disks). I then used LVM2 tools to 
> make the RAID array a physical volume, created a logical volume and 
> formatted that volume with an XFS filesystem.
> 
> Mounting the filesystem and copying over the 2.6 kernel source tree 
> produces this OOPS (and is pretty reproducable):
> 
> kernel BUG at fs/bio.c:177!

It's doing a put on an already freed bio, that's really bad.

> invalid operand: 0000 [#1]
> CPU:    0
> EIP:    0060:[<c014db9a>]    Not tainted
> EFLAGS: 00010246
> EIP is at bio_put+0x2c/0x36
> eax: 00000000   ebx: f6221080   ecx: c1182180   edx: edcbf780
> esi: c577b998   edi: 00000002   ebp: edcbf780   esp: f78ffeb0
> ds: 007b   es: 007b   ss: 0068
> Process md0_raid5 (pid: 65, threadinfo=f78fe000 task=f7924080)
> Stack: c71e2640 c021d88d edcbf780 00000000 00000001 c1182180 00000009 
> 0001000
>        edcbf780 00000000 00000000 00000000 c014e2fc edcbf780 00000000 
> 00000000
>        f23a0ff0 f23a0ff0 edcbf7c0 c02ca51d edcbf780 00000000 00000000 
> 00000000
> Call Trace:
>  [<c021d88d>] bio_end_io_pagebuf+0x9a/0x138
>  [<c014e2fc>] bio_endio+0x59/0x7e
>  [<c02ca51d>] clone_endio+0x82/0xb5
>  [<c02c0dc3>] handle_stripe+0x8f2/0xec0
>  [<c02c17d1>] raid5d+0x71/0x105
>  [<c02c898c>] md_thread+0xde/0x15c
>  [<c011984b>] default_wake_function+0x0/0x12
>  [<c02c88ae>] md_thread+0x0/0x15c
>  [<c0107049>] kernel_thread_helper+0x5/0xb

Odds are it's a raid5 bug.

> Hardware is a 2.6CGHz P4, 1G of RAM (4G highmem enabled), SMP kernel but 
> no preemption. Kernel config is at:

Are you using ide or libata as the backing for the sata drives?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Reproducable OOPS with MD RAID-5 on 2.6.0-test11
  2003-12-01 14:11 ` Jens Axboe
@ 2003-12-01 14:15   ` Kevin P. Fleming
  2003-12-01 15:51     ` Jens Axboe
  2003-12-01 23:06   ` Reproducable OOPS with MD RAID-5 on 2.6.0-test11 - with XFS Neil Brown
  1 sibling, 1 reply; 20+ messages in thread
From: Kevin P. Fleming @ 2003-12-01 14:15 UTC (permalink / raw)
  To: Jens Axboe; +Cc: LKML, Linux-raid maillist

Jens Axboe wrote:

>>Hardware is a 2.6CGHz P4, 1G of RAM (4G highmem enabled), SMP kernel but 
>>no preemption. Kernel config is at:
> 
> 
> Are you using ide or libata as the backing for the sata drives?
> 

libata, two of the disks are on an ICH5 and the other four are on a 
Promise SATA150 TX4.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Reproducable OOPS with MD RAID-5 on 2.6.0-test11
  2003-12-01 14:15   ` Kevin P. Fleming
@ 2003-12-01 15:51     ` Jens Axboe
  2003-12-02  4:02       ` Kevin P. Fleming
  0 siblings, 1 reply; 20+ messages in thread
From: Jens Axboe @ 2003-12-01 15:51 UTC (permalink / raw)
  To: Kevin P. Fleming; +Cc: LKML, Linux-raid maillist

On Mon, Dec 01 2003, Kevin P. Fleming wrote:
> Jens Axboe wrote:
> 
> >>Hardware is a 2.6CGHz P4, 1G of RAM (4G highmem enabled), SMP kernel but 
> >>no preemption. Kernel config is at:
> >
> >
> >Are you using ide or libata as the backing for the sata drives?
> >
> 
> libata, two of the disks are on an ICH5 and the other four are on a 
> Promise SATA150 TX4.

Alright, so no bouncing should be happening. Could you boot with
mem=800m (and reproduce) just to rule it out completely?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Reproducable OOPS with MD RAID-5 on 2.6.0-test11 - with XFS
  2003-12-01 14:11 ` Jens Axboe
  2003-12-01 14:15   ` Kevin P. Fleming
@ 2003-12-01 23:06   ` Neil Brown
  1 sibling, 0 replies; 20+ messages in thread
From: Neil Brown @ 2003-12-01 23:06 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Kevin P. Fleming, LKML, Linux-raid maillist, Eric Jensen

On Monday December 1, axboe@suse.de wrote:
> On Mon, Dec 01 2003, Kevin P. Fleming wrote:
> > I've got a new system here with six SATA disks set up in a RAID-5 array 
> > (no partition tables, using the whole disks). I then used LVM2 tools to 
> > make the RAID array a physical volume, created a logical volume and 
> > formatted that volume with an XFS filesystem.
> > 
> > Mounting the filesystem and copying over the 2.6 kernel source tree 
> > produces this OOPS (and is pretty reproducable):
> > 
> > kernel BUG at fs/bio.c:177!
> 
> It's doing a put on an already freed bio, that's really bad.
> 

That makes 2 bug reports that seem to suggest that raid5 is calling
bi_end_io twice on the one bio. 

The other one was from Eric Jensen <ej@xmission.com>
with Subject: PROBLEM: 2.6.0-test10 BUG/panic in mpage_end_io_read
on  26 Nov 2003 

Both involve xfs and raid5.
I, of course, am tempted to blame xfs.....

In this case, I don't think that raid5 calling bi_end_io twice would
cause the problem as the bi_end_io that raid5 calls is  clone_end_io,
and that has an atomic_t to make sure it only calls it's bi_end_io
(bio_end_io_pagebuf) once, even if it were called multiple times itself.

So I'm wondering if xfs might be doing something funny after
submitting the request to raid5... though I don't find that convincing
either.

In this reports, the IO seems to have been request from the
pagebuf stuff (fs/xfs/pagebuf/page_buf.c).  In the other one it
is coming from mpage, presumably from inside xfs/linux/xfs_aops.c
These are very different code paths and are unlikely to share a bug
like this.

Which does tend to point the finger back at raid5. :-(

I'd love to see some more reports of similar bugs, in the hope that
they might shed some more light.

NeilBrown

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Reproducable OOPS with MD RAID-5 on 2.6.0-test11
  2003-12-01 15:51     ` Jens Axboe
@ 2003-12-02  4:02       ` Kevin P. Fleming
  2003-12-02  4:15         ` Mike Fedyk
  2003-12-02  8:27         ` Jens Axboe
  0 siblings, 2 replies; 20+ messages in thread
From: Kevin P. Fleming @ 2003-12-02  4:02 UTC (permalink / raw)
  To: Jens Axboe; +Cc: LKML, Linux-raid maillist, linux-lvm

Jens Axboe wrote:

> Alright, so no bouncing should be happening. Could you boot with
> mem=800m (and reproduce) just to rule it out completely?

Tested with mem=800m, problem still occurs. Additional test was done 
without device-mapper in place, though, and I could not reproduce the 
problem! I copied > 500MB of stuff to the XFS filesystem created using 
the entire /dev/md/0 device without a single unusual message. I then 
unmounted the filesystem and used pvcreate/vgcreate/lvcreate to make a 
3G volume on the array, made an XFS filesystem on it, mounted it, and 
tried copying data over. The oops message came back.

I'm copying this message to linux-lvm; the original oops message is 
repeated below for the benefit of those list readers. I've got one more 
round of testing to do (after the array resyncs itself), which is to try 
a filesystem other than XFS.

----

kernel BUG at fs/bio.c:177!
invalid operand: 0000 [#1]
CPU:    0
EIP:    0060:[<c014db9a>]    Not tainted
EFLAGS: 00010246
EIP is at bio_put+0x2c/0x36
eax: 00000000   ebx: f6221080   ecx: c1182180   edx: edcbf780
esi: c577b998   edi: 00000002   ebp: edcbf780   esp: f78ffeb0
ds: 007b   es: 007b   ss: 0068
Process md0_raid5 (pid: 65, threadinfo=f78fe000 task=f7924080)
Stack: c71e2640 c021d88d edcbf780 00000000 00000001 c1182180 00000009 
0001000
        edcbf780 00000000 00000000 00000000 c014e2fc edcbf780 00000000 
00000000
        f23a0ff0 f23a0ff0 edcbf7c0 c02ca51d edcbf780 00000000 00000000 
00000000
Call Trace:
  [<c021d88d>] bio_end_io_pagebuf+0x9a/0x138
  [<c014e2fc>] bio_endio+0x59/0x7e
  [<c02ca51d>] clone_endio+0x82/0xb5
  [<c02c0dc3>] handle_stripe+0x8f2/0xec0
  [<c02c17d1>] raid5d+0x71/0x105
  [<c02c898c>] md_thread+0xde/0x15c
  [<c011984b>] default_wake_function+0x0/0x12
  [<c02c88ae>] md_thread+0x0/0x15c
  [<c0107049>] kernel_thread_helper+0x5/0xb

Code: 0f 0b b1 00 bc 94 34 c0 eb d8 56 53 83 ec 08 8b 44 24 18 8b



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Reproducable OOPS with MD RAID-5 on 2.6.0-test11
  2003-12-02  4:02       ` Kevin P. Fleming
@ 2003-12-02  4:15         ` Mike Fedyk
  2003-12-02 13:11           ` Kevin P. Fleming
  2003-12-02  8:27         ` Jens Axboe
  1 sibling, 1 reply; 20+ messages in thread
From: Mike Fedyk @ 2003-12-02  4:15 UTC (permalink / raw)
  To: Kevin P. Fleming; +Cc: Jens Axboe, LKML, Linux-raid maillist, linux-lvm

On Mon, Dec 01, 2003 at 09:02:40PM -0700, Kevin P. Fleming wrote:
> Tested with mem=800m, problem still occurs. Additional test was done 
> without device-mapper in place, though, and I could not reproduce the 
> problem! I copied > 500MB of stuff to the XFS filesystem created using 
> the entire /dev/md/0 device without a single unusual message. I then 
> unmounted the filesystem and used pvcreate/vgcreate/lvcreate to make a 
> 3G volume on the array, made an XFS filesystem on it, mounted it, and 
> tried copying data over. The oops message came back.

Can you try with DM on regular disk tm, instead of sw raid?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Reproducable OOPS with MD RAID-5 on 2.6.0-test11
  2003-12-02  4:02       ` Kevin P. Fleming
  2003-12-02  4:15         ` Mike Fedyk
@ 2003-12-02  8:27         ` Jens Axboe
  2003-12-02 10:10           ` Nathan Scott
  2003-12-02 18:23           ` Linus Torvalds
  1 sibling, 2 replies; 20+ messages in thread
From: Jens Axboe @ 2003-12-02  8:27 UTC (permalink / raw)
  To: Kevin P. Fleming; +Cc: LKML, Linux-raid maillist, linux-lvm

On Mon, Dec 01 2003, Kevin P. Fleming wrote:
> Jens Axboe wrote:
> 
> >Alright, so no bouncing should be happening. Could you boot with
> >mem=800m (and reproduce) just to rule it out completely?
> 
> Tested with mem=800m, problem still occurs. Additional test was done 

Suspected as much, just wanted to make sure.

> without device-mapper in place, though, and I could not reproduce the 
> problem! I copied > 500MB of stuff to the XFS filesystem created using 
> the entire /dev/md/0 device without a single unusual message. I then 
> unmounted the filesystem and used pvcreate/vgcreate/lvcreate to make a 
> 3G volume on the array, made an XFS filesystem on it, mounted it, and 
> tried copying data over. The oops message came back.

Smells like a bio stacking problem in raid/dm then. I'll take a quick
look and see if anything obvious pops up, otherwise the maintainers of
those areas should take a closer look.

> I'm copying this message to linux-lvm; the original oops message is 
> repeated below for the benefit of those list readers. I've got one more 
> round of testing to do (after the array resyncs itself), which is to try 
> a filesystem other than XFS.

That might be a good idea, although it's not very likely to be an XFS
problem as it happens further down the io stack. It should trigger just
as happily on IDE or SCSI if that was the case.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Reproducable OOPS with MD RAID-5 on 2.6.0-test11
  2003-12-02  8:27         ` Jens Axboe
@ 2003-12-02 10:10           ` Nathan Scott
  2003-12-02 13:15             ` Kevin P. Fleming
  2003-12-03  3:32             ` Nathan Scott
  2003-12-02 18:23           ` Linus Torvalds
  1 sibling, 2 replies; 20+ messages in thread
From: Nathan Scott @ 2003-12-02 10:10 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Kevin P. Fleming, LKML, Linux-raid maillist, linux-lvm

On Tue, Dec 02, 2003 at 09:27:13AM +0100, Jens Axboe wrote:
> On Mon, Dec 01 2003, Kevin P. Fleming wrote:
> > 
> > without device-mapper in place, though, and I could not reproduce the 
> > problem! I copied > 500MB of stuff to the XFS filesystem created using 
> > the entire /dev/md/0 device without a single unusual message. I then 
> > unmounted the filesystem and used pvcreate/vgcreate/lvcreate to make a 
> > 3G volume on the array, made an XFS filesystem on it, mounted it, and 
> > tried copying data over. The oops message came back.
> 
> Smells like a bio stacking problem in raid/dm then. I'll take a quick
> look and see if anything obvious pops up, otherwise the maintainers of
> those areas should take a closer look.

One thing that might be of interest - XFS does tend to pass
variable size requests down to the block layer, and this has
tripped up md and other drivers in 2.4 in the distant past.

Log IO is typically 512 byte aligned (as opposed to block or
page size aligned), as are IOs into several of XFS' metadata
structures.

> > I'm copying this message to linux-lvm; the original oops message is 
> > repeated below for the benefit of those list readers. I've got one more 
> > round of testing to do (after the array resyncs itself), which is to try 
> > a filesystem other than XFS.
> 
> That might be a good idea, although it's not very likely to be an XFS
> problem as it happens further down the io stack. It should trigger just
> as happily on IDE or SCSI if that was the case.

I would tend to agree (but will happily fix things if proven
wrong ;) - I took a brief look through dm & md this afternoon
but nothing obvious jumped out at me.  I'm not particularly
familiar with that code though.

cheers.

-- 
Nathan

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Reproducable OOPS with MD RAID-5 on 2.6.0-test11
  2003-12-02  4:15         ` Mike Fedyk
@ 2003-12-02 13:11           ` Kevin P. Fleming
  0 siblings, 0 replies; 20+ messages in thread
From: Kevin P. Fleming @ 2003-12-02 13:11 UTC (permalink / raw)
  To: Mike Fedyk; +Cc: Jens Axboe, LKML, Linux-raid maillist, linux-lvm

Mike Fedyk wrote:

> Can you try with DM on regular disk tm, instead of sw raid?

Tested, failure does not occur.



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Reproducable OOPS with MD RAID-5 on 2.6.0-test11
  2003-12-02 10:10           ` Nathan Scott
@ 2003-12-02 13:15             ` Kevin P. Fleming
  2003-12-03  3:32             ` Nathan Scott
  1 sibling, 0 replies; 20+ messages in thread
From: Kevin P. Fleming @ 2003-12-02 13:15 UTC (permalink / raw)
  To: Nathan Scott; +Cc: Jens Axboe, LKML, Linux-raid maillist, linux-lvm

Nathan Scott wrote:

> One thing that might be of interest - XFS does tend to pass
> variable size requests down to the block layer, and this has
> tripped up md and other drivers in 2.4 in the distant past.
> 
> Log IO is typically 512 byte aligned (as opposed to block or
> page size aligned), as are IOs into several of XFS' metadata
> structures.

Hey, thanks for the pointer! I think we're getting somewhere now. Here's 
a recap of the tested combinations:

XFS on raw disk: OK
XFS on LVM2 on single disk: OK
XFS on LVM2 on RAID-5: fails
ext2 on LVM2 on RAID-5: OK

I just tested XFS on LVM2 on RAID-5 using "-l sunit=8" while creating 
the filesystem to force log writes be block-sized and block-aligned; 
this seems to work :-) I have not been able to force a failure using my 
test script, although ATM the system is still running a RAID-5 resync of 
the array, but that should only make the problem more likely, not less.

So, this does appear to be an md/dm stacking problem, that is exposed by 
XFS sending non-block-sized and/or non-block-aligned IOs.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Reproducable OOPS with MD RAID-5 on 2.6.0-test11
  2003-12-02  8:27         ` Jens Axboe
  2003-12-02 10:10           ` Nathan Scott
@ 2003-12-02 18:23           ` Linus Torvalds
  2003-12-04  1:12             ` Simon Kirby
  1 sibling, 1 reply; 20+ messages in thread
From: Linus Torvalds @ 2003-12-02 18:23 UTC (permalink / raw)
  To: Jens Axboe, Neil Brown
  Cc: Kevin P. Fleming, LKML, Linux-raid maillist, linux-lvm



On Tue, 2 Dec 2003, Jens Axboe wrote:
>
> Smells like a bio stacking problem in raid/dm then. I'll take a quick
> look and see if anything obvious pops up, otherwise the maintainers of
> those areas should take a closer look.

There are several other problem reports which start to smell like md/raid.

> That might be a good idea, although it's not very likely to be an XFS
> problem as it happens further down the io stack. It should trigger just
> as happily on IDE or SCSI if that was the case.

There's one (by Alan Buxey) that I attributed to PREEMPT which happens on
UP with ext3 and raid0:

	md: Autodetecting RAID arrays.
	md: autorun ...
	md: considering hdd1 ...
	md:  adding hdd1 ...
	md:  adding hda1 ...
	md: created md0
	md: bind<hda1>
	md: bind<hdd1>
	md: running: <hdd1><hda1>
	raid1: raid set md0 active with 2 out of 2 mirrors
	md: ... autorun DONE.

and that one shows strange memory corruption problems too.

NOTE! The fact that it only happens with PREEMPT for some people is not
necessarily a sign of preempt-only trouble: while PREEMPT should really be
equivalent to SMP-safe, but there are some things that are much more
likely with preemption than with normal SMP.

In particular, preempt will cause every single (final) unlock to check
whether there is something else runnable with a higher priority, so it
opens up races a lot - if you touch something just outside (in particular:
_after_) the locked region, preempt is much more likely to show a race
that on SMP might be just a few instructions long.

		Linus

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Reproducable OOPS with MD RAID-5 on 2.6.0-test11
  2003-12-02 10:10           ` Nathan Scott
  2003-12-02 13:15             ` Kevin P. Fleming
@ 2003-12-03  3:32             ` Nathan Scott
  2003-12-03 17:13               ` Linus Torvalds
  1 sibling, 1 reply; 20+ messages in thread
From: Nathan Scott @ 2003-12-03  3:32 UTC (permalink / raw)
  To: Jens Axboe, Kevin P. Fleming; +Cc: LKML, Linux-raid maillist, linux-lvm

On Tue, Dec 02, 2003 at 09:10:02PM +1100, Nathan Scott wrote:
> On Tue, Dec 02, 2003 at 09:27:13AM +0100, Jens Axboe wrote:
> > On Mon, Dec 01 2003, Kevin P. Fleming wrote:
> > > 
> > > without device-mapper in place, though, and I could not reproduce the 
> > > problem! I copied > 500MB of stuff to the XFS filesystem created using 
> > > the entire /dev/md/0 device without a single unusual message. I then 
> > > unmounted the filesystem and used pvcreate/vgcreate/lvcreate to make a 
> > > 3G volume on the array, made an XFS filesystem on it, mounted it, and 
> > > tried copying data over. The oops message came back.
> > 
> > Smells like a bio stacking problem in raid/dm then. I'll take a quick
> > look and see if anything obvious pops up, otherwise the maintainers of
> > those areas should take a closer look.
> 
> One thing that might be of interest - XFS does tend to pass
> variable size requests down to the block layer, and this has
> tripped up md and other drivers in 2.4 in the distant past.
> 
> Log IO is typically 512 byte aligned (as opposed to block or
> page size aligned), as are IOs into several of XFS' metadata
> structures.

The XFS tests just tripped up a panic in raid5 in -test11 -- a kdb
stacktrace follows.  Seems to be reproducible, but not always the
same test that causes it.  And I haven't seen a double bio_put yet,
this first problem keeps getting in the way I guess.

Looks like its in a raid5 kernel thread, doing asynchronous stuff?,
so I don't really have any extra hints about what XFS was doing at
the time for y'all either.

cheers.

--
Nathan

XFS mounting filesystem md0
Unable to handle kernel paging request at virtual address d1c92c00
 printing eip:
c0387be6
*pde = 00048067
*pte = 11c92000
Oops: 0000 [#1]
CPU:    3
EIP:    0060:[<c0387be6>]    Not tainted
EFLAGS: 00010086
EIP is at handle_stripe+0xda6/0xef0
eax: f315df94   ebx: 00000000   ecx: 00000000   edx: f6d25ef8
esi: d1c92bfc   edi: d1c92bfc   ebp: f36d3f88   esp: f36d3ef8
ds: 007b   es: 007b   ss: 0068
Process md0_raid5 (pid: 1435, threadinfo=f36d2000 task=f684a9d0)
Stack: f6d25ef8 f2f84ebc f302e000 00000020 f2f84fc0 f7127000 f712760c f36d3f30
       f315de3c f7101ef8 00000000 00000000 f36d3f3c f315df68 c04fde00 f7f9a9d0
       f684a9d0 df449de8 00000000 f315df94 00000000 00000000 00000001 00000000
Call Trace:
 [<c0388173>] raid5d+0x73/0x120
 [<c039048c>] md_thread+0xbc/0x180
 [<c0118ef0>] default_wake_function+0x0/0x30
 [<c03903d0>] md_thread+0x0/0x180
 [<c010750d>] kernel_thread_helper+0x5/0x18

Code: 8b 56 04 8b 48 58 8b 58 5c 8b 06 83 c1 08 83 d3 00 39 da 72

Entering kdb (current=0xf684a9d0, pid 1435) on processor 3 Oops: Oops
due to oops @ 0xc0387be6
eax = 0xf315df94 ebx = 0x00000000 ecx = 0x00000000 edx = 0xf6d25ef8
esi = 0xd1c92bfc edi = 0xd1c92bfc esp = 0xf36d3ef8 eip = 0xc0387be6
ebp = 0xf36d3f88 xss = 0xc0390068 xcs = 0x00000060 eflags = 0x00010086
xds = 0xf6d2007b xes = 0x0000007b origeax = 0xffffffff &regs = 0xf36d3ec4
[3]kdb> bt
Stack traceback for pid 1435
0xf684a9d0     1435        1  1    3   R  0xf684ad00 *md0_raid5
EBP        EIP        Function (args)
0xf36d3f88 0xc0387be6 handle_stripe+0xda6 (0xf315dea0, 0x292, 0xf36d2000, 0xf5e90578, 0xf5e90580)
                               kernel <NULL> 0x0 0xc0386e40 0xc0387d30
0xf36d3fa4 0xc0388173 raid5d+0x73 (0xf6d25ef8, 0x0, 0xf36d2000, 0xf36d2000, 0xf36d2000)
                               kernel <NULL> 0x0 0xc0388100 0xc0388220
0xf36d3fec 0xc039048c md_thread+0xbc
                               kernel <NULL> 0x0 0xc03903d0 0xc0390550
           0xc010750d kernel_thread_helper+0x5
                               kernel <NULL> 0x0 0xc0107508 0xc0107520
[3]kdb>


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Reproducable OOPS with MD RAID-5 on 2.6.0-test11
  2003-12-03  3:32             ` Nathan Scott
@ 2003-12-03 17:13               ` Linus Torvalds
  0 siblings, 0 replies; 20+ messages in thread
From: Linus Torvalds @ 2003-12-03 17:13 UTC (permalink / raw)
  To: Nathan Scott
  Cc: Jens Axboe, Kevin P. Fleming, LKML, Linux-raid maillist, linux-lvm



On Wed, 3 Dec 2003, Nathan Scott wrote:
>
> The XFS tests just tripped up a panic in raid5 in -test11 -- a kdb
> stacktrace follows.  Seems to be reproducible, but not always the
> same test that causes it.  And I haven't seen a double bio_put yet,
> this first problem keeps getting in the way I guess.

Ok, debugging this oops makes me _think_ that the problem comes from here:

	raid5.c: around line 1000:
			....
                            wbi = dev->written;
                            dev->written = NULL;
                            while (wbi && wbi->bi_sector < dev->sector + STRIPE_SECTORS) {
                                    wbi2 = wbi->bi_next;
                                    if (--wbi->bi_phys_segments == 0) {
                                            md_write_end(conf->mddev);
                                            wbi->bi_next = return_bi;
                                            return_bi = wbi;
                                    }
                                    wbi = wbi2;
                            }
			....

where it appears that the "wbi->bi_sector" access takes a page fault,
probably due to PAGE_ALLOC debugging. It appears that somebody has already
finished (and thus free'd) that bio.

I dunno - I can't follow what that code does at all.

One problem is that the slab code - because it caches the slabs and shares
pages between different slab entryes - will not trigger the bugs that
DEBUG_PAGEALLOC would show very easily. So here's my ugly hack once more,
to see if that makes the bug show up more repeatably and quicker. Nathan?

			Linus


-+- slab-debug-on-steroids -+-

NOTE! For this patch to make sense, you have to enable the page allocator
debugging thing (CONFIG_DEBUG_PAGEALLOC), and you have to live with the
fact that it wastes a _lot_ of memory.

There's another problem with this patch: if the bug is actually in the
slab code itself, this will obviously not find it, since it disables that
code entirely.

===== mm/slab.c 1.110 vs edited =====
--- 1.110/mm/slab.c	Tue Oct 21 22:10:10 2003
+++ edited/mm/slab.c	Mon Dec  1 15:29:06 2003
@@ -1906,6 +1906,21 @@

 static inline void * __cache_alloc (kmem_cache_t *cachep, int flags)
 {
+#if 1
+	void *ptr = (void*)__get_free_pages(flags, cachep->gfporder);
+	if (ptr) {
+		struct page *page = virt_to_page(ptr);
+		SET_PAGE_CACHE(page, cachep);
+		SET_PAGE_SLAB(page, 0x01020304);
+		if (cachep->ctor) {
+			unsigned long ctor_flags = SLAB_CTOR_CONSTRUCTOR;
+			if (!(flags & __GFP_WAIT))
+				ctor_flags |= SLAB_CTOR_ATOMIC;
+			cachep->ctor(ptr, cachep, ctor_flags);
+		}
+	}
+	return ptr;
+#else
 	unsigned long save_flags;
 	void* objp;
 	struct array_cache *ac;
@@ -1925,6 +1940,7 @@
 	local_irq_restore(save_flags);
 	objp = cache_alloc_debugcheck_after(cachep, flags, objp, __builtin_return_address(0));
 	return objp;
+#endif
 }

 /*
@@ -2042,6 +2058,15 @@
  */
 static inline void __cache_free (kmem_cache_t *cachep, void* objp)
 {
+#if 1
+	{
+		struct page *page = virt_to_page(objp);
+		int order = cachep->gfporder;
+		if (cachep->dtor)
+			cachep->dtor(objp, cachep, 0);
+		__free_pages(page, order);
+	}
+#else
 	struct array_cache *ac = ac_data(cachep);

 	check_irq_off();
@@ -2056,6 +2081,7 @@
 		cache_flusharray(cachep, ac);
 		ac_entry(ac)[ac->avail++] = objp;
 	}
+#endif
 }

 /**


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Reproducable OOPS with MD RAID-5 on 2.6.0-test11
  2003-12-02 18:23           ` Linus Torvalds
@ 2003-12-04  1:12             ` Simon Kirby
  2003-12-04  1:23               ` Linus Torvalds
  0 siblings, 1 reply; 20+ messages in thread
From: Simon Kirby @ 2003-12-04  1:12 UTC (permalink / raw)
  To: Linux-raid maillist
  Cc: Jens Axboe, Neil Brown, Kevin P. Fleming, LKML, Linus Torvalds,
	linux-lvm

On Tue, Dec 02, 2003 at 10:23:17AM -0800, Linus Torvalds wrote:

> On Tue, 2 Dec 2003, Jens Axboe wrote:
> >
> > Smells like a bio stacking problem in raid/dm then. I'll take a quick
> > look and see if anything obvious pops up, otherwise the maintainers of
> > those areas should take a closer look.
> 
> There are several other problem reports which start to smell like md/raid.

Btw, I had trouble creating a linear array (probably not very common)
with size > 2 TB.  I expected it to just complain, but it ended up
resulting in hard lockups.  With a slight size change (removed one
drive), it ended up printing a double fault Oops, and all sorts of neat
stuff.  I suspect the hashing was overflowing and writing bits in
unexpected places.

In any event, this patch against 2.6.0-test11 compiles without warnings,
boots, and (bonus) actually works:

--- drivers/md/linear.c.orig	2003-10-29 08:16:35.000000000 -0800
+++ drivers/md/linear.c	2003-12-03 16:19:59.000000000 -0800
@@ -214,10 +214,11 @@
 		char b[BDEVNAME_SIZE];
 
 		printk("linear_make_request: Block %llu out of bounds on "
-			"dev %s size %ld offset %ld\n",
+			"dev %s size %lld offset %lld\n",
 			(unsigned long long)block,
 			bdevname(tmp_dev->rdev->bdev, b),
-			tmp_dev->size, tmp_dev->offset);
+			(unsigned long long)tmp_dev->size,
+			(unsigned long long)tmp_dev->offset);
 		bio_io_error(bio, bio->bi_size);
 		return 0;
 	}
--- include/linux/raid/linear.h.orig	2003-11-26 12:45:44.000000000 -0800
+++ include/linux/raid/linear.h	2003-12-03 16:18:00.000000000 -0800
@@ -5,8 +5,8 @@
 
 struct dev_info {
 	mdk_rdev_t	*rdev;
-	unsigned long	size;
-	unsigned long	offset;
+	sector_t	size;
+	sector_t	offset;
 };
 
 typedef struct dev_info dev_info_t;

Simon-

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Reproducable OOPS with MD RAID-5 on 2.6.0-test11
  2003-12-04  1:12             ` Simon Kirby
@ 2003-12-04  1:23               ` Linus Torvalds
  2003-12-04  4:31                 ` Simon Kirby
  2003-12-04 20:53                 ` Herbert Xu
  0 siblings, 2 replies; 20+ messages in thread
From: Linus Torvalds @ 2003-12-04  1:23 UTC (permalink / raw)
  To: Simon Kirby
  Cc: Linux-raid maillist, Jens Axboe, Neil Brown, Kevin P. Fleming,
	LKML, linux-lvm



On Wed, 3 Dec 2003, Simon Kirby wrote:
>
> In any event, this patch against 2.6.0-test11 compiles without warnings,
> boots, and (bonus) actually works:

Really? This actually makes a difference for you? I don't see why it
should matter: even if the sector offsets would overflow, why would that
cause _oopses_?

[ Insert theme to "The Twilight Zone" ]

Neil, Jens, any ideas?

		Linus

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Reproducable OOPS with MD RAID-5 on 2.6.0-test11
  2003-12-04  1:23               ` Linus Torvalds
@ 2003-12-04  4:31                 ` Simon Kirby
  2003-12-05  6:55                   ` Theodore Ts'o
  2003-12-04 20:53                 ` Herbert Xu
  1 sibling, 1 reply; 20+ messages in thread
From: Simon Kirby @ 2003-12-04  4:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux-raid maillist, Jens Axboe, Neil Brown, Kevin P. Fleming,
	LKML, linux-lvm

On Wed, Dec 03, 2003 at 05:23:02PM -0800, Linus Torvalds wrote:

> On Wed, 3 Dec 2003, Simon Kirby wrote:
> >
> > In any event, this patch against 2.6.0-test11 compiles without warnings,
> > boots, and (bonus) actually works:
> 
> Really? This actually makes a difference for you? I don't see why it
> should matter: even if the sector offsets would overflow, why would that
> cause _oopses_?
> 
> [ Insert theme to "The Twilight Zone" ]

Without the patches, the box gets as far as assembling the array and
activating it, but dies on "mke2fs".  Running mke2fs through strace shows
that it stops during the early stages, before it even tries to write
anything.  mke2fs appears to seek through the whole device and do a bunch
of small reads at various points, and as soon as it tries to read from an
offset > 2 TB, it hangs.

When I first tried this, something with the configuration caused it to
hang so that even nmi_watchdog didn't work.  I first assumed it was the
result of some sort of current spike from all of the drives working at
once, but after gettng it to work with an array size < 2 TB and after
seeing different strange Oopses with different total sizes (by removing
some drives), the problem appeared to be software-related.  I added some
printk()s and found the problem occurred shortly after an overflow in
linear.c:which_dev().

As soon as I saw the overflow I made the connection and corrected the
variable types, but I didn't bother to figure out why it decided to
blow up before.

I can put an unpatched kernel back on the box and do some more testing
if it would be helpful.

Simon-

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Reproducable OOPS with MD RAID-5 on 2.6.0-test11
  2003-12-04  1:23               ` Linus Torvalds
  2003-12-04  4:31                 ` Simon Kirby
@ 2003-12-04 20:53                 ` Herbert Xu
  2003-12-04 21:06                   ` Linus Torvalds
  1 sibling, 1 reply; 20+ messages in thread
From: Herbert Xu @ 2003-12-04 20:53 UTC (permalink / raw)
  To: Linus Torvalds, linux-kernel

Linus Torvalds <torvalds@osdl.org> wrote:
> 
> Really? This actually makes a difference for you? I don't see why it
> should matter: even if the sector offsets would overflow, why would that
> cause _oopses_?

Apart from the printk, he also changed dev_info_t which means that any
place that uses it will be using the 64-bit type now.
-- 
Debian GNU/Linux 3.0 is out! ( http://www.debian.org/ )
Email:  Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Reproducable OOPS with MD RAID-5 on 2.6.0-test11
  2003-12-04 20:53                 ` Herbert Xu
@ 2003-12-04 21:06                   ` Linus Torvalds
  0 siblings, 0 replies; 20+ messages in thread
From: Linus Torvalds @ 2003-12-04 21:06 UTC (permalink / raw)
  To: Herbert Xu; +Cc: linux-kernel



On Fri, 5 Dec 2003, Herbert Xu wrote:
>
> Linus Torvalds <torvalds@osdl.org> wrote:
> >
> > Really? This actually makes a difference for you? I don't see why it
> > should matter: even if the sector offsets would overflow, why would that
> > cause _oopses_?
>
> Apart from the printk, he also changed dev_info_t which means that any
> place that uses it will be using the 64-bit type now.

I wasn't looking at the printk, I was looking at those 64-bit types. My
argument was that while the small size is incorrect, it shouldn't cause
system stability issues per se - it should just cause IO to potentially
"wrap around" and go to the wrong place on disk.

Which is very serious in itself, of course - but what surprised me was the
quoted system stability things.

Anyway, that patch only matters for the LINEAR MD module, and only for
2TB+ aggregate disks at that, so it doesn't explain any of the other
problematic behaviour. Something else is up.

		Linus

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Reproducable OOPS with MD RAID-5 on 2.6.0-test11
  2003-12-04  4:31                 ` Simon Kirby
@ 2003-12-05  6:55                   ` Theodore Ts'o
  0 siblings, 0 replies; 20+ messages in thread
From: Theodore Ts'o @ 2003-12-05  6:55 UTC (permalink / raw)
  To: Simon Kirby
  Cc: Linus Torvalds, Linux-raid maillist, Jens Axboe, Neil Brown,
	Kevin P. Fleming, LKML, linux-lvm

On Wed, Dec 03, 2003 at 08:31:06PM -0800, Simon Kirby wrote:
> 
> Without the patches, the box gets as far as assembling the array and
> activating it, but dies on "mke2fs".  Running mke2fs through strace shows
> that it stops during the early stages, before it even tries to write
> anything.  mke2fs appears to seek through the whole device and do a bunch
> of small reads at various points, and as soon as it tries to read from an
> offset > 2 TB, it hangs.

It sounds like mke2fs tried using BLKGETSIZE ioctl, but given that
this returns the number of 512 byte sectors in a device in a 4 byte
word, the BLKGETSIZE ioctl quite rightly threw up its hands and said,
"sorry, I can't tell you the correct size."

The mke2fs fell back to its backup algorithm, which uses a modified
binary search to find the size of the device.  It started to see if
the device was at least 1k, and checks to see if the device is at
least 2k, 4k, 8k, 16k, 32k, 64k, 128k, etc.  So it sounds like it's
dieing when it tries to seek past 2TB using llseek(). 

It would probably be worthwhile to write a little test program which
opens the disk, llseeks to 2TB+1, and then tries reading a byte.  If
that fiailes, then there's definitely a bug somewhere in the device
driver....

						- Ted

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2003-12-05  6:55 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-12-01 14:06 Reproducable OOPS with MD RAID-5 on 2.6.0-test11 Kevin P. Fleming
2003-12-01 14:11 ` Jens Axboe
2003-12-01 14:15   ` Kevin P. Fleming
2003-12-01 15:51     ` Jens Axboe
2003-12-02  4:02       ` Kevin P. Fleming
2003-12-02  4:15         ` Mike Fedyk
2003-12-02 13:11           ` Kevin P. Fleming
2003-12-02  8:27         ` Jens Axboe
2003-12-02 10:10           ` Nathan Scott
2003-12-02 13:15             ` Kevin P. Fleming
2003-12-03  3:32             ` Nathan Scott
2003-12-03 17:13               ` Linus Torvalds
2003-12-02 18:23           ` Linus Torvalds
2003-12-04  1:12             ` Simon Kirby
2003-12-04  1:23               ` Linus Torvalds
2003-12-04  4:31                 ` Simon Kirby
2003-12-05  6:55                   ` Theodore Ts'o
2003-12-04 20:53                 ` Herbert Xu
2003-12-04 21:06                   ` Linus Torvalds
2003-12-01 23:06   ` Reproducable OOPS with MD RAID-5 on 2.6.0-test11 - with XFS Neil Brown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).