* Reproducable OOPS with MD RAID-5 on 2.6.0-test11
@ 2003-12-01 14:06 Kevin P. Fleming
2003-12-01 14:11 ` Jens Axboe
0 siblings, 1 reply; 20+ messages in thread
From: Kevin P. Fleming @ 2003-12-01 14:06 UTC (permalink / raw)
To: LKML, Linux-raid maillist
I've got a new system here with six SATA disks set up in a RAID-5 array
(no partition tables, using the whole disks). I then used LVM2 tools to
make the RAID array a physical volume, created a logical volume and
formatted that volume with an XFS filesystem.
Mounting the filesystem and copying over the 2.6 kernel source tree
produces this OOPS (and is pretty reproducable):
kernel BUG at fs/bio.c:177!
invalid operand: 0000 [#1]
CPU: 0
EIP: 0060:[<c014db9a>] Not tainted
EFLAGS: 00010246
EIP is at bio_put+0x2c/0x36
eax: 00000000 ebx: f6221080 ecx: c1182180 edx: edcbf780
esi: c577b998 edi: 00000002 ebp: edcbf780 esp: f78ffeb0
ds: 007b es: 007b ss: 0068
Process md0_raid5 (pid: 65, threadinfo=f78fe000 task=f7924080)
Stack: c71e2640 c021d88d edcbf780 00000000 00000001 c1182180 00000009
0001000
edcbf780 00000000 00000000 00000000 c014e2fc edcbf780 00000000
00000000
f23a0ff0 f23a0ff0 edcbf7c0 c02ca51d edcbf780 00000000 00000000
00000000
Call Trace:
[<c021d88d>] bio_end_io_pagebuf+0x9a/0x138
[<c014e2fc>] bio_endio+0x59/0x7e
[<c02ca51d>] clone_endio+0x82/0xb5
[<c02c0dc3>] handle_stripe+0x8f2/0xec0
[<c02c17d1>] raid5d+0x71/0x105
[<c02c898c>] md_thread+0xde/0x15c
[<c011984b>] default_wake_function+0x0/0x12
[<c02c88ae>] md_thread+0x0/0x15c
[<c0107049>] kernel_thread_helper+0x5/0xb
Code: 0f 0b b1 00 bc 94 34 c0 eb d8 56 53 83 ec 08 8b 44 24 18 8b
Hardware is a 2.6CGHz P4, 1G of RAM (4G highmem enabled), SMP kernel but
no preemption. Kernel config is at:
http://www.backtobasicsmgmt.com/bucky/bucky.config
(I'm subscribed to linux-kernel but not linux-raid, so please CC me on
any linux-raid responses. Thanks!)
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Reproducable OOPS with MD RAID-5 on 2.6.0-test11
2003-12-01 14:06 Reproducable OOPS with MD RAID-5 on 2.6.0-test11 Kevin P. Fleming
@ 2003-12-01 14:11 ` Jens Axboe
2003-12-01 14:15 ` Kevin P. Fleming
2003-12-01 23:06 ` Reproducable OOPS with MD RAID-5 on 2.6.0-test11 - with XFS Neil Brown
0 siblings, 2 replies; 20+ messages in thread
From: Jens Axboe @ 2003-12-01 14:11 UTC (permalink / raw)
To: Kevin P. Fleming; +Cc: LKML, Linux-raid maillist
On Mon, Dec 01 2003, Kevin P. Fleming wrote:
> I've got a new system here with six SATA disks set up in a RAID-5 array
> (no partition tables, using the whole disks). I then used LVM2 tools to
> make the RAID array a physical volume, created a logical volume and
> formatted that volume with an XFS filesystem.
>
> Mounting the filesystem and copying over the 2.6 kernel source tree
> produces this OOPS (and is pretty reproducable):
>
> kernel BUG at fs/bio.c:177!
It's doing a put on an already freed bio, that's really bad.
> invalid operand: 0000 [#1]
> CPU: 0
> EIP: 0060:[<c014db9a>] Not tainted
> EFLAGS: 00010246
> EIP is at bio_put+0x2c/0x36
> eax: 00000000 ebx: f6221080 ecx: c1182180 edx: edcbf780
> esi: c577b998 edi: 00000002 ebp: edcbf780 esp: f78ffeb0
> ds: 007b es: 007b ss: 0068
> Process md0_raid5 (pid: 65, threadinfo=f78fe000 task=f7924080)
> Stack: c71e2640 c021d88d edcbf780 00000000 00000001 c1182180 00000009
> 0001000
> edcbf780 00000000 00000000 00000000 c014e2fc edcbf780 00000000
> 00000000
> f23a0ff0 f23a0ff0 edcbf7c0 c02ca51d edcbf780 00000000 00000000
> 00000000
> Call Trace:
> [<c021d88d>] bio_end_io_pagebuf+0x9a/0x138
> [<c014e2fc>] bio_endio+0x59/0x7e
> [<c02ca51d>] clone_endio+0x82/0xb5
> [<c02c0dc3>] handle_stripe+0x8f2/0xec0
> [<c02c17d1>] raid5d+0x71/0x105
> [<c02c898c>] md_thread+0xde/0x15c
> [<c011984b>] default_wake_function+0x0/0x12
> [<c02c88ae>] md_thread+0x0/0x15c
> [<c0107049>] kernel_thread_helper+0x5/0xb
Odds are it's a raid5 bug.
> Hardware is a 2.6CGHz P4, 1G of RAM (4G highmem enabled), SMP kernel but
> no preemption. Kernel config is at:
Are you using ide or libata as the backing for the sata drives?
--
Jens Axboe
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Reproducable OOPS with MD RAID-5 on 2.6.0-test11
2003-12-01 14:11 ` Jens Axboe
@ 2003-12-01 14:15 ` Kevin P. Fleming
2003-12-01 15:51 ` Jens Axboe
2003-12-01 23:06 ` Reproducable OOPS with MD RAID-5 on 2.6.0-test11 - with XFS Neil Brown
1 sibling, 1 reply; 20+ messages in thread
From: Kevin P. Fleming @ 2003-12-01 14:15 UTC (permalink / raw)
To: Jens Axboe; +Cc: LKML, Linux-raid maillist
Jens Axboe wrote:
>>Hardware is a 2.6CGHz P4, 1G of RAM (4G highmem enabled), SMP kernel but
>>no preemption. Kernel config is at:
>
>
> Are you using ide or libata as the backing for the sata drives?
>
libata, two of the disks are on an ICH5 and the other four are on a
Promise SATA150 TX4.
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Reproducable OOPS with MD RAID-5 on 2.6.0-test11
2003-12-01 14:15 ` Kevin P. Fleming
@ 2003-12-01 15:51 ` Jens Axboe
2003-12-02 4:02 ` Kevin P. Fleming
0 siblings, 1 reply; 20+ messages in thread
From: Jens Axboe @ 2003-12-01 15:51 UTC (permalink / raw)
To: Kevin P. Fleming; +Cc: LKML, Linux-raid maillist
On Mon, Dec 01 2003, Kevin P. Fleming wrote:
> Jens Axboe wrote:
>
> >>Hardware is a 2.6CGHz P4, 1G of RAM (4G highmem enabled), SMP kernel but
> >>no preemption. Kernel config is at:
> >
> >
> >Are you using ide or libata as the backing for the sata drives?
> >
>
> libata, two of the disks are on an ICH5 and the other four are on a
> Promise SATA150 TX4.
Alright, so no bouncing should be happening. Could you boot with
mem=800m (and reproduce) just to rule it out completely?
--
Jens Axboe
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Reproducable OOPS with MD RAID-5 on 2.6.0-test11 - with XFS
2003-12-01 14:11 ` Jens Axboe
2003-12-01 14:15 ` Kevin P. Fleming
@ 2003-12-01 23:06 ` Neil Brown
1 sibling, 0 replies; 20+ messages in thread
From: Neil Brown @ 2003-12-01 23:06 UTC (permalink / raw)
To: Jens Axboe; +Cc: Kevin P. Fleming, LKML, Linux-raid maillist, Eric Jensen
On Monday December 1, axboe@suse.de wrote:
> On Mon, Dec 01 2003, Kevin P. Fleming wrote:
> > I've got a new system here with six SATA disks set up in a RAID-5 array
> > (no partition tables, using the whole disks). I then used LVM2 tools to
> > make the RAID array a physical volume, created a logical volume and
> > formatted that volume with an XFS filesystem.
> >
> > Mounting the filesystem and copying over the 2.6 kernel source tree
> > produces this OOPS (and is pretty reproducable):
> >
> > kernel BUG at fs/bio.c:177!
>
> It's doing a put on an already freed bio, that's really bad.
>
That makes 2 bug reports that seem to suggest that raid5 is calling
bi_end_io twice on the one bio.
The other one was from Eric Jensen <ej@xmission.com>
with Subject: PROBLEM: 2.6.0-test10 BUG/panic in mpage_end_io_read
on 26 Nov 2003
Both involve xfs and raid5.
I, of course, am tempted to blame xfs.....
In this case, I don't think that raid5 calling bi_end_io twice would
cause the problem as the bi_end_io that raid5 calls is clone_end_io,
and that has an atomic_t to make sure it only calls it's bi_end_io
(bio_end_io_pagebuf) once, even if it were called multiple times itself.
So I'm wondering if xfs might be doing something funny after
submitting the request to raid5... though I don't find that convincing
either.
In this reports, the IO seems to have been request from the
pagebuf stuff (fs/xfs/pagebuf/page_buf.c). In the other one it
is coming from mpage, presumably from inside xfs/linux/xfs_aops.c
These are very different code paths and are unlikely to share a bug
like this.
Which does tend to point the finger back at raid5. :-(
I'd love to see some more reports of similar bugs, in the hope that
they might shed some more light.
NeilBrown
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Reproducable OOPS with MD RAID-5 on 2.6.0-test11
2003-12-01 15:51 ` Jens Axboe
@ 2003-12-02 4:02 ` Kevin P. Fleming
2003-12-02 4:15 ` Mike Fedyk
2003-12-02 8:27 ` Jens Axboe
0 siblings, 2 replies; 20+ messages in thread
From: Kevin P. Fleming @ 2003-12-02 4:02 UTC (permalink / raw)
To: Jens Axboe; +Cc: LKML, Linux-raid maillist, linux-lvm
Jens Axboe wrote:
> Alright, so no bouncing should be happening. Could you boot with
> mem=800m (and reproduce) just to rule it out completely?
Tested with mem=800m, problem still occurs. Additional test was done
without device-mapper in place, though, and I could not reproduce the
problem! I copied > 500MB of stuff to the XFS filesystem created using
the entire /dev/md/0 device without a single unusual message. I then
unmounted the filesystem and used pvcreate/vgcreate/lvcreate to make a
3G volume on the array, made an XFS filesystem on it, mounted it, and
tried copying data over. The oops message came back.
I'm copying this message to linux-lvm; the original oops message is
repeated below for the benefit of those list readers. I've got one more
round of testing to do (after the array resyncs itself), which is to try
a filesystem other than XFS.
----
kernel BUG at fs/bio.c:177!
invalid operand: 0000 [#1]
CPU: 0
EIP: 0060:[<c014db9a>] Not tainted
EFLAGS: 00010246
EIP is at bio_put+0x2c/0x36
eax: 00000000 ebx: f6221080 ecx: c1182180 edx: edcbf780
esi: c577b998 edi: 00000002 ebp: edcbf780 esp: f78ffeb0
ds: 007b es: 007b ss: 0068
Process md0_raid5 (pid: 65, threadinfo=f78fe000 task=f7924080)
Stack: c71e2640 c021d88d edcbf780 00000000 00000001 c1182180 00000009
0001000
edcbf780 00000000 00000000 00000000 c014e2fc edcbf780 00000000
00000000
f23a0ff0 f23a0ff0 edcbf7c0 c02ca51d edcbf780 00000000 00000000
00000000
Call Trace:
[<c021d88d>] bio_end_io_pagebuf+0x9a/0x138
[<c014e2fc>] bio_endio+0x59/0x7e
[<c02ca51d>] clone_endio+0x82/0xb5
[<c02c0dc3>] handle_stripe+0x8f2/0xec0
[<c02c17d1>] raid5d+0x71/0x105
[<c02c898c>] md_thread+0xde/0x15c
[<c011984b>] default_wake_function+0x0/0x12
[<c02c88ae>] md_thread+0x0/0x15c
[<c0107049>] kernel_thread_helper+0x5/0xb
Code: 0f 0b b1 00 bc 94 34 c0 eb d8 56 53 83 ec 08 8b 44 24 18 8b
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Reproducable OOPS with MD RAID-5 on 2.6.0-test11
2003-12-02 4:02 ` Kevin P. Fleming
@ 2003-12-02 4:15 ` Mike Fedyk
2003-12-02 13:11 ` Kevin P. Fleming
2003-12-02 8:27 ` Jens Axboe
1 sibling, 1 reply; 20+ messages in thread
From: Mike Fedyk @ 2003-12-02 4:15 UTC (permalink / raw)
To: Kevin P. Fleming; +Cc: Jens Axboe, LKML, Linux-raid maillist, linux-lvm
On Mon, Dec 01, 2003 at 09:02:40PM -0700, Kevin P. Fleming wrote:
> Tested with mem=800m, problem still occurs. Additional test was done
> without device-mapper in place, though, and I could not reproduce the
> problem! I copied > 500MB of stuff to the XFS filesystem created using
> the entire /dev/md/0 device without a single unusual message. I then
> unmounted the filesystem and used pvcreate/vgcreate/lvcreate to make a
> 3G volume on the array, made an XFS filesystem on it, mounted it, and
> tried copying data over. The oops message came back.
Can you try with DM on regular disk tm, instead of sw raid?
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Reproducable OOPS with MD RAID-5 on 2.6.0-test11
2003-12-02 4:02 ` Kevin P. Fleming
2003-12-02 4:15 ` Mike Fedyk
@ 2003-12-02 8:27 ` Jens Axboe
2003-12-02 10:10 ` Nathan Scott
2003-12-02 18:23 ` Linus Torvalds
1 sibling, 2 replies; 20+ messages in thread
From: Jens Axboe @ 2003-12-02 8:27 UTC (permalink / raw)
To: Kevin P. Fleming; +Cc: LKML, Linux-raid maillist, linux-lvm
On Mon, Dec 01 2003, Kevin P. Fleming wrote:
> Jens Axboe wrote:
>
> >Alright, so no bouncing should be happening. Could you boot with
> >mem=800m (and reproduce) just to rule it out completely?
>
> Tested with mem=800m, problem still occurs. Additional test was done
Suspected as much, just wanted to make sure.
> without device-mapper in place, though, and I could not reproduce the
> problem! I copied > 500MB of stuff to the XFS filesystem created using
> the entire /dev/md/0 device without a single unusual message. I then
> unmounted the filesystem and used pvcreate/vgcreate/lvcreate to make a
> 3G volume on the array, made an XFS filesystem on it, mounted it, and
> tried copying data over. The oops message came back.
Smells like a bio stacking problem in raid/dm then. I'll take a quick
look and see if anything obvious pops up, otherwise the maintainers of
those areas should take a closer look.
> I'm copying this message to linux-lvm; the original oops message is
> repeated below for the benefit of those list readers. I've got one more
> round of testing to do (after the array resyncs itself), which is to try
> a filesystem other than XFS.
That might be a good idea, although it's not very likely to be an XFS
problem as it happens further down the io stack. It should trigger just
as happily on IDE or SCSI if that was the case.
--
Jens Axboe
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Reproducable OOPS with MD RAID-5 on 2.6.0-test11
2003-12-02 8:27 ` Jens Axboe
@ 2003-12-02 10:10 ` Nathan Scott
2003-12-02 13:15 ` Kevin P. Fleming
2003-12-03 3:32 ` Nathan Scott
2003-12-02 18:23 ` Linus Torvalds
1 sibling, 2 replies; 20+ messages in thread
From: Nathan Scott @ 2003-12-02 10:10 UTC (permalink / raw)
To: Jens Axboe; +Cc: Kevin P. Fleming, LKML, Linux-raid maillist, linux-lvm
On Tue, Dec 02, 2003 at 09:27:13AM +0100, Jens Axboe wrote:
> On Mon, Dec 01 2003, Kevin P. Fleming wrote:
> >
> > without device-mapper in place, though, and I could not reproduce the
> > problem! I copied > 500MB of stuff to the XFS filesystem created using
> > the entire /dev/md/0 device without a single unusual message. I then
> > unmounted the filesystem and used pvcreate/vgcreate/lvcreate to make a
> > 3G volume on the array, made an XFS filesystem on it, mounted it, and
> > tried copying data over. The oops message came back.
>
> Smells like a bio stacking problem in raid/dm then. I'll take a quick
> look and see if anything obvious pops up, otherwise the maintainers of
> those areas should take a closer look.
One thing that might be of interest - XFS does tend to pass
variable size requests down to the block layer, and this has
tripped up md and other drivers in 2.4 in the distant past.
Log IO is typically 512 byte aligned (as opposed to block or
page size aligned), as are IOs into several of XFS' metadata
structures.
> > I'm copying this message to linux-lvm; the original oops message is
> > repeated below for the benefit of those list readers. I've got one more
> > round of testing to do (after the array resyncs itself), which is to try
> > a filesystem other than XFS.
>
> That might be a good idea, although it's not very likely to be an XFS
> problem as it happens further down the io stack. It should trigger just
> as happily on IDE or SCSI if that was the case.
I would tend to agree (but will happily fix things if proven
wrong ;) - I took a brief look through dm & md this afternoon
but nothing obvious jumped out at me. I'm not particularly
familiar with that code though.
cheers.
--
Nathan
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Reproducable OOPS with MD RAID-5 on 2.6.0-test11
2003-12-02 4:15 ` Mike Fedyk
@ 2003-12-02 13:11 ` Kevin P. Fleming
0 siblings, 0 replies; 20+ messages in thread
From: Kevin P. Fleming @ 2003-12-02 13:11 UTC (permalink / raw)
To: Mike Fedyk; +Cc: Jens Axboe, LKML, Linux-raid maillist, linux-lvm
Mike Fedyk wrote:
> Can you try with DM on regular disk tm, instead of sw raid?
Tested, failure does not occur.
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Reproducable OOPS with MD RAID-5 on 2.6.0-test11
2003-12-02 10:10 ` Nathan Scott
@ 2003-12-02 13:15 ` Kevin P. Fleming
2003-12-03 3:32 ` Nathan Scott
1 sibling, 0 replies; 20+ messages in thread
From: Kevin P. Fleming @ 2003-12-02 13:15 UTC (permalink / raw)
To: Nathan Scott; +Cc: Jens Axboe, LKML, Linux-raid maillist, linux-lvm
Nathan Scott wrote:
> One thing that might be of interest - XFS does tend to pass
> variable size requests down to the block layer, and this has
> tripped up md and other drivers in 2.4 in the distant past.
>
> Log IO is typically 512 byte aligned (as opposed to block or
> page size aligned), as are IOs into several of XFS' metadata
> structures.
Hey, thanks for the pointer! I think we're getting somewhere now. Here's
a recap of the tested combinations:
XFS on raw disk: OK
XFS on LVM2 on single disk: OK
XFS on LVM2 on RAID-5: fails
ext2 on LVM2 on RAID-5: OK
I just tested XFS on LVM2 on RAID-5 using "-l sunit=8" while creating
the filesystem to force log writes be block-sized and block-aligned;
this seems to work :-) I have not been able to force a failure using my
test script, although ATM the system is still running a RAID-5 resync of
the array, but that should only make the problem more likely, not less.
So, this does appear to be an md/dm stacking problem, that is exposed by
XFS sending non-block-sized and/or non-block-aligned IOs.
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Reproducable OOPS with MD RAID-5 on 2.6.0-test11
2003-12-02 8:27 ` Jens Axboe
2003-12-02 10:10 ` Nathan Scott
@ 2003-12-02 18:23 ` Linus Torvalds
2003-12-04 1:12 ` Simon Kirby
1 sibling, 1 reply; 20+ messages in thread
From: Linus Torvalds @ 2003-12-02 18:23 UTC (permalink / raw)
To: Jens Axboe, Neil Brown
Cc: Kevin P. Fleming, LKML, Linux-raid maillist, linux-lvm
On Tue, 2 Dec 2003, Jens Axboe wrote:
>
> Smells like a bio stacking problem in raid/dm then. I'll take a quick
> look and see if anything obvious pops up, otherwise the maintainers of
> those areas should take a closer look.
There are several other problem reports which start to smell like md/raid.
> That might be a good idea, although it's not very likely to be an XFS
> problem as it happens further down the io stack. It should trigger just
> as happily on IDE or SCSI if that was the case.
There's one (by Alan Buxey) that I attributed to PREEMPT which happens on
UP with ext3 and raid0:
md: Autodetecting RAID arrays.
md: autorun ...
md: considering hdd1 ...
md: adding hdd1 ...
md: adding hda1 ...
md: created md0
md: bind<hda1>
md: bind<hdd1>
md: running: <hdd1><hda1>
raid1: raid set md0 active with 2 out of 2 mirrors
md: ... autorun DONE.
and that one shows strange memory corruption problems too.
NOTE! The fact that it only happens with PREEMPT for some people is not
necessarily a sign of preempt-only trouble: while PREEMPT should really be
equivalent to SMP-safe, but there are some things that are much more
likely with preemption than with normal SMP.
In particular, preempt will cause every single (final) unlock to check
whether there is something else runnable with a higher priority, so it
opens up races a lot - if you touch something just outside (in particular:
_after_) the locked region, preempt is much more likely to show a race
that on SMP might be just a few instructions long.
Linus
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Reproducable OOPS with MD RAID-5 on 2.6.0-test11
2003-12-02 10:10 ` Nathan Scott
2003-12-02 13:15 ` Kevin P. Fleming
@ 2003-12-03 3:32 ` Nathan Scott
2003-12-03 17:13 ` Linus Torvalds
1 sibling, 1 reply; 20+ messages in thread
From: Nathan Scott @ 2003-12-03 3:32 UTC (permalink / raw)
To: Jens Axboe, Kevin P. Fleming; +Cc: LKML, Linux-raid maillist, linux-lvm
On Tue, Dec 02, 2003 at 09:10:02PM +1100, Nathan Scott wrote:
> On Tue, Dec 02, 2003 at 09:27:13AM +0100, Jens Axboe wrote:
> > On Mon, Dec 01 2003, Kevin P. Fleming wrote:
> > >
> > > without device-mapper in place, though, and I could not reproduce the
> > > problem! I copied > 500MB of stuff to the XFS filesystem created using
> > > the entire /dev/md/0 device without a single unusual message. I then
> > > unmounted the filesystem and used pvcreate/vgcreate/lvcreate to make a
> > > 3G volume on the array, made an XFS filesystem on it, mounted it, and
> > > tried copying data over. The oops message came back.
> >
> > Smells like a bio stacking problem in raid/dm then. I'll take a quick
> > look and see if anything obvious pops up, otherwise the maintainers of
> > those areas should take a closer look.
>
> One thing that might be of interest - XFS does tend to pass
> variable size requests down to the block layer, and this has
> tripped up md and other drivers in 2.4 in the distant past.
>
> Log IO is typically 512 byte aligned (as opposed to block or
> page size aligned), as are IOs into several of XFS' metadata
> structures.
The XFS tests just tripped up a panic in raid5 in -test11 -- a kdb
stacktrace follows. Seems to be reproducible, but not always the
same test that causes it. And I haven't seen a double bio_put yet,
this first problem keeps getting in the way I guess.
Looks like its in a raid5 kernel thread, doing asynchronous stuff?,
so I don't really have any extra hints about what XFS was doing at
the time for y'all either.
cheers.
--
Nathan
XFS mounting filesystem md0
Unable to handle kernel paging request at virtual address d1c92c00
printing eip:
c0387be6
*pde = 00048067
*pte = 11c92000
Oops: 0000 [#1]
CPU: 3
EIP: 0060:[<c0387be6>] Not tainted
EFLAGS: 00010086
EIP is at handle_stripe+0xda6/0xef0
eax: f315df94 ebx: 00000000 ecx: 00000000 edx: f6d25ef8
esi: d1c92bfc edi: d1c92bfc ebp: f36d3f88 esp: f36d3ef8
ds: 007b es: 007b ss: 0068
Process md0_raid5 (pid: 1435, threadinfo=f36d2000 task=f684a9d0)
Stack: f6d25ef8 f2f84ebc f302e000 00000020 f2f84fc0 f7127000 f712760c f36d3f30
f315de3c f7101ef8 00000000 00000000 f36d3f3c f315df68 c04fde00 f7f9a9d0
f684a9d0 df449de8 00000000 f315df94 00000000 00000000 00000001 00000000
Call Trace:
[<c0388173>] raid5d+0x73/0x120
[<c039048c>] md_thread+0xbc/0x180
[<c0118ef0>] default_wake_function+0x0/0x30
[<c03903d0>] md_thread+0x0/0x180
[<c010750d>] kernel_thread_helper+0x5/0x18
Code: 8b 56 04 8b 48 58 8b 58 5c 8b 06 83 c1 08 83 d3 00 39 da 72
Entering kdb (current=0xf684a9d0, pid 1435) on processor 3 Oops: Oops
due to oops @ 0xc0387be6
eax = 0xf315df94 ebx = 0x00000000 ecx = 0x00000000 edx = 0xf6d25ef8
esi = 0xd1c92bfc edi = 0xd1c92bfc esp = 0xf36d3ef8 eip = 0xc0387be6
ebp = 0xf36d3f88 xss = 0xc0390068 xcs = 0x00000060 eflags = 0x00010086
xds = 0xf6d2007b xes = 0x0000007b origeax = 0xffffffff ®s = 0xf36d3ec4
[3]kdb> bt
Stack traceback for pid 1435
0xf684a9d0 1435 1 1 3 R 0xf684ad00 *md0_raid5
EBP EIP Function (args)
0xf36d3f88 0xc0387be6 handle_stripe+0xda6 (0xf315dea0, 0x292, 0xf36d2000, 0xf5e90578, 0xf5e90580)
kernel <NULL> 0x0 0xc0386e40 0xc0387d30
0xf36d3fa4 0xc0388173 raid5d+0x73 (0xf6d25ef8, 0x0, 0xf36d2000, 0xf36d2000, 0xf36d2000)
kernel <NULL> 0x0 0xc0388100 0xc0388220
0xf36d3fec 0xc039048c md_thread+0xbc
kernel <NULL> 0x0 0xc03903d0 0xc0390550
0xc010750d kernel_thread_helper+0x5
kernel <NULL> 0x0 0xc0107508 0xc0107520
[3]kdb>
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Reproducable OOPS with MD RAID-5 on 2.6.0-test11
2003-12-03 3:32 ` Nathan Scott
@ 2003-12-03 17:13 ` Linus Torvalds
0 siblings, 0 replies; 20+ messages in thread
From: Linus Torvalds @ 2003-12-03 17:13 UTC (permalink / raw)
To: Nathan Scott
Cc: Jens Axboe, Kevin P. Fleming, LKML, Linux-raid maillist, linux-lvm
On Wed, 3 Dec 2003, Nathan Scott wrote:
>
> The XFS tests just tripped up a panic in raid5 in -test11 -- a kdb
> stacktrace follows. Seems to be reproducible, but not always the
> same test that causes it. And I haven't seen a double bio_put yet,
> this first problem keeps getting in the way I guess.
Ok, debugging this oops makes me _think_ that the problem comes from here:
raid5.c: around line 1000:
....
wbi = dev->written;
dev->written = NULL;
while (wbi && wbi->bi_sector < dev->sector + STRIPE_SECTORS) {
wbi2 = wbi->bi_next;
if (--wbi->bi_phys_segments == 0) {
md_write_end(conf->mddev);
wbi->bi_next = return_bi;
return_bi = wbi;
}
wbi = wbi2;
}
....
where it appears that the "wbi->bi_sector" access takes a page fault,
probably due to PAGE_ALLOC debugging. It appears that somebody has already
finished (and thus free'd) that bio.
I dunno - I can't follow what that code does at all.
One problem is that the slab code - because it caches the slabs and shares
pages between different slab entryes - will not trigger the bugs that
DEBUG_PAGEALLOC would show very easily. So here's my ugly hack once more,
to see if that makes the bug show up more repeatably and quicker. Nathan?
Linus
-+- slab-debug-on-steroids -+-
NOTE! For this patch to make sense, you have to enable the page allocator
debugging thing (CONFIG_DEBUG_PAGEALLOC), and you have to live with the
fact that it wastes a _lot_ of memory.
There's another problem with this patch: if the bug is actually in the
slab code itself, this will obviously not find it, since it disables that
code entirely.
===== mm/slab.c 1.110 vs edited =====
--- 1.110/mm/slab.c Tue Oct 21 22:10:10 2003
+++ edited/mm/slab.c Mon Dec 1 15:29:06 2003
@@ -1906,6 +1906,21 @@
static inline void * __cache_alloc (kmem_cache_t *cachep, int flags)
{
+#if 1
+ void *ptr = (void*)__get_free_pages(flags, cachep->gfporder);
+ if (ptr) {
+ struct page *page = virt_to_page(ptr);
+ SET_PAGE_CACHE(page, cachep);
+ SET_PAGE_SLAB(page, 0x01020304);
+ if (cachep->ctor) {
+ unsigned long ctor_flags = SLAB_CTOR_CONSTRUCTOR;
+ if (!(flags & __GFP_WAIT))
+ ctor_flags |= SLAB_CTOR_ATOMIC;
+ cachep->ctor(ptr, cachep, ctor_flags);
+ }
+ }
+ return ptr;
+#else
unsigned long save_flags;
void* objp;
struct array_cache *ac;
@@ -1925,6 +1940,7 @@
local_irq_restore(save_flags);
objp = cache_alloc_debugcheck_after(cachep, flags, objp, __builtin_return_address(0));
return objp;
+#endif
}
/*
@@ -2042,6 +2058,15 @@
*/
static inline void __cache_free (kmem_cache_t *cachep, void* objp)
{
+#if 1
+ {
+ struct page *page = virt_to_page(objp);
+ int order = cachep->gfporder;
+ if (cachep->dtor)
+ cachep->dtor(objp, cachep, 0);
+ __free_pages(page, order);
+ }
+#else
struct array_cache *ac = ac_data(cachep);
check_irq_off();
@@ -2056,6 +2081,7 @@
cache_flusharray(cachep, ac);
ac_entry(ac)[ac->avail++] = objp;
}
+#endif
}
/**
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Reproducable OOPS with MD RAID-5 on 2.6.0-test11
2003-12-02 18:23 ` Linus Torvalds
@ 2003-12-04 1:12 ` Simon Kirby
2003-12-04 1:23 ` Linus Torvalds
0 siblings, 1 reply; 20+ messages in thread
From: Simon Kirby @ 2003-12-04 1:12 UTC (permalink / raw)
To: Linux-raid maillist
Cc: Jens Axboe, Neil Brown, Kevin P. Fleming, LKML, Linus Torvalds,
linux-lvm
On Tue, Dec 02, 2003 at 10:23:17AM -0800, Linus Torvalds wrote:
> On Tue, 2 Dec 2003, Jens Axboe wrote:
> >
> > Smells like a bio stacking problem in raid/dm then. I'll take a quick
> > look and see if anything obvious pops up, otherwise the maintainers of
> > those areas should take a closer look.
>
> There are several other problem reports which start to smell like md/raid.
Btw, I had trouble creating a linear array (probably not very common)
with size > 2 TB. I expected it to just complain, but it ended up
resulting in hard lockups. With a slight size change (removed one
drive), it ended up printing a double fault Oops, and all sorts of neat
stuff. I suspect the hashing was overflowing and writing bits in
unexpected places.
In any event, this patch against 2.6.0-test11 compiles without warnings,
boots, and (bonus) actually works:
--- drivers/md/linear.c.orig 2003-10-29 08:16:35.000000000 -0800
+++ drivers/md/linear.c 2003-12-03 16:19:59.000000000 -0800
@@ -214,10 +214,11 @@
char b[BDEVNAME_SIZE];
printk("linear_make_request: Block %llu out of bounds on "
- "dev %s size %ld offset %ld\n",
+ "dev %s size %lld offset %lld\n",
(unsigned long long)block,
bdevname(tmp_dev->rdev->bdev, b),
- tmp_dev->size, tmp_dev->offset);
+ (unsigned long long)tmp_dev->size,
+ (unsigned long long)tmp_dev->offset);
bio_io_error(bio, bio->bi_size);
return 0;
}
--- include/linux/raid/linear.h.orig 2003-11-26 12:45:44.000000000 -0800
+++ include/linux/raid/linear.h 2003-12-03 16:18:00.000000000 -0800
@@ -5,8 +5,8 @@
struct dev_info {
mdk_rdev_t *rdev;
- unsigned long size;
- unsigned long offset;
+ sector_t size;
+ sector_t offset;
};
typedef struct dev_info dev_info_t;
Simon-
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Reproducable OOPS with MD RAID-5 on 2.6.0-test11
2003-12-04 1:12 ` Simon Kirby
@ 2003-12-04 1:23 ` Linus Torvalds
2003-12-04 4:31 ` Simon Kirby
2003-12-04 20:53 ` Herbert Xu
0 siblings, 2 replies; 20+ messages in thread
From: Linus Torvalds @ 2003-12-04 1:23 UTC (permalink / raw)
To: Simon Kirby
Cc: Linux-raid maillist, Jens Axboe, Neil Brown, Kevin P. Fleming,
LKML, linux-lvm
On Wed, 3 Dec 2003, Simon Kirby wrote:
>
> In any event, this patch against 2.6.0-test11 compiles without warnings,
> boots, and (bonus) actually works:
Really? This actually makes a difference for you? I don't see why it
should matter: even if the sector offsets would overflow, why would that
cause _oopses_?
[ Insert theme to "The Twilight Zone" ]
Neil, Jens, any ideas?
Linus
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Reproducable OOPS with MD RAID-5 on 2.6.0-test11
2003-12-04 1:23 ` Linus Torvalds
@ 2003-12-04 4:31 ` Simon Kirby
2003-12-05 6:55 ` Theodore Ts'o
2003-12-04 20:53 ` Herbert Xu
1 sibling, 1 reply; 20+ messages in thread
From: Simon Kirby @ 2003-12-04 4:31 UTC (permalink / raw)
To: Linus Torvalds
Cc: Linux-raid maillist, Jens Axboe, Neil Brown, Kevin P. Fleming,
LKML, linux-lvm
On Wed, Dec 03, 2003 at 05:23:02PM -0800, Linus Torvalds wrote:
> On Wed, 3 Dec 2003, Simon Kirby wrote:
> >
> > In any event, this patch against 2.6.0-test11 compiles without warnings,
> > boots, and (bonus) actually works:
>
> Really? This actually makes a difference for you? I don't see why it
> should matter: even if the sector offsets would overflow, why would that
> cause _oopses_?
>
> [ Insert theme to "The Twilight Zone" ]
Without the patches, the box gets as far as assembling the array and
activating it, but dies on "mke2fs". Running mke2fs through strace shows
that it stops during the early stages, before it even tries to write
anything. mke2fs appears to seek through the whole device and do a bunch
of small reads at various points, and as soon as it tries to read from an
offset > 2 TB, it hangs.
When I first tried this, something with the configuration caused it to
hang so that even nmi_watchdog didn't work. I first assumed it was the
result of some sort of current spike from all of the drives working at
once, but after gettng it to work with an array size < 2 TB and after
seeing different strange Oopses with different total sizes (by removing
some drives), the problem appeared to be software-related. I added some
printk()s and found the problem occurred shortly after an overflow in
linear.c:which_dev().
As soon as I saw the overflow I made the connection and corrected the
variable types, but I didn't bother to figure out why it decided to
blow up before.
I can put an unpatched kernel back on the box and do some more testing
if it would be helpful.
Simon-
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Reproducable OOPS with MD RAID-5 on 2.6.0-test11
2003-12-04 1:23 ` Linus Torvalds
2003-12-04 4:31 ` Simon Kirby
@ 2003-12-04 20:53 ` Herbert Xu
2003-12-04 21:06 ` Linus Torvalds
1 sibling, 1 reply; 20+ messages in thread
From: Herbert Xu @ 2003-12-04 20:53 UTC (permalink / raw)
To: Linus Torvalds, linux-kernel
Linus Torvalds <torvalds@osdl.org> wrote:
>
> Really? This actually makes a difference for you? I don't see why it
> should matter: even if the sector offsets would overflow, why would that
> cause _oopses_?
Apart from the printk, he also changed dev_info_t which means that any
place that uses it will be using the 64-bit type now.
--
Debian GNU/Linux 3.0 is out! ( http://www.debian.org/ )
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Reproducable OOPS with MD RAID-5 on 2.6.0-test11
2003-12-04 20:53 ` Herbert Xu
@ 2003-12-04 21:06 ` Linus Torvalds
0 siblings, 0 replies; 20+ messages in thread
From: Linus Torvalds @ 2003-12-04 21:06 UTC (permalink / raw)
To: Herbert Xu; +Cc: linux-kernel
On Fri, 5 Dec 2003, Herbert Xu wrote:
>
> Linus Torvalds <torvalds@osdl.org> wrote:
> >
> > Really? This actually makes a difference for you? I don't see why it
> > should matter: even if the sector offsets would overflow, why would that
> > cause _oopses_?
>
> Apart from the printk, he also changed dev_info_t which means that any
> place that uses it will be using the 64-bit type now.
I wasn't looking at the printk, I was looking at those 64-bit types. My
argument was that while the small size is incorrect, it shouldn't cause
system stability issues per se - it should just cause IO to potentially
"wrap around" and go to the wrong place on disk.
Which is very serious in itself, of course - but what surprised me was the
quoted system stability things.
Anyway, that patch only matters for the LINEAR MD module, and only for
2TB+ aggregate disks at that, so it doesn't explain any of the other
problematic behaviour. Something else is up.
Linus
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Reproducable OOPS with MD RAID-5 on 2.6.0-test11
2003-12-04 4:31 ` Simon Kirby
@ 2003-12-05 6:55 ` Theodore Ts'o
0 siblings, 0 replies; 20+ messages in thread
From: Theodore Ts'o @ 2003-12-05 6:55 UTC (permalink / raw)
To: Simon Kirby
Cc: Linus Torvalds, Linux-raid maillist, Jens Axboe, Neil Brown,
Kevin P. Fleming, LKML, linux-lvm
On Wed, Dec 03, 2003 at 08:31:06PM -0800, Simon Kirby wrote:
>
> Without the patches, the box gets as far as assembling the array and
> activating it, but dies on "mke2fs". Running mke2fs through strace shows
> that it stops during the early stages, before it even tries to write
> anything. mke2fs appears to seek through the whole device and do a bunch
> of small reads at various points, and as soon as it tries to read from an
> offset > 2 TB, it hangs.
It sounds like mke2fs tried using BLKGETSIZE ioctl, but given that
this returns the number of 512 byte sectors in a device in a 4 byte
word, the BLKGETSIZE ioctl quite rightly threw up its hands and said,
"sorry, I can't tell you the correct size."
The mke2fs fell back to its backup algorithm, which uses a modified
binary search to find the size of the device. It started to see if
the device was at least 1k, and checks to see if the device is at
least 2k, 4k, 8k, 16k, 32k, 64k, 128k, etc. So it sounds like it's
dieing when it tries to seek past 2TB using llseek().
It would probably be worthwhile to write a little test program which
opens the disk, llseeks to 2TB+1, and then tries reading a byte. If
that fiailes, then there's definitely a bug somewhere in the device
driver....
- Ted
^ permalink raw reply [flat|nested] 20+ messages in thread
end of thread, other threads:[~2003-12-05 6:55 UTC | newest]
Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-12-01 14:06 Reproducable OOPS with MD RAID-5 on 2.6.0-test11 Kevin P. Fleming
2003-12-01 14:11 ` Jens Axboe
2003-12-01 14:15 ` Kevin P. Fleming
2003-12-01 15:51 ` Jens Axboe
2003-12-02 4:02 ` Kevin P. Fleming
2003-12-02 4:15 ` Mike Fedyk
2003-12-02 13:11 ` Kevin P. Fleming
2003-12-02 8:27 ` Jens Axboe
2003-12-02 10:10 ` Nathan Scott
2003-12-02 13:15 ` Kevin P. Fleming
2003-12-03 3:32 ` Nathan Scott
2003-12-03 17:13 ` Linus Torvalds
2003-12-02 18:23 ` Linus Torvalds
2003-12-04 1:12 ` Simon Kirby
2003-12-04 1:23 ` Linus Torvalds
2003-12-04 4:31 ` Simon Kirby
2003-12-05 6:55 ` Theodore Ts'o
2003-12-04 20:53 ` Herbert Xu
2003-12-04 21:06 ` Linus Torvalds
2003-12-01 23:06 ` Reproducable OOPS with MD RAID-5 on 2.6.0-test11 - with XFS Neil Brown
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).