* RAID 10 resync leading to attempt to access beyond end of device @ 2007-02-14 22:08 John Stilson 2007-02-14 23:37 ` Neil Brown 0 siblings, 1 reply; 8+ messages in thread From: John Stilson @ 2007-02-14 22:08 UTC (permalink / raw) To: linux-raid Hi, I'm experiencing what appears to be a kernel bug in the raid10 driver, where immediately after a resync completes an access beyond the end of the rebuilt disk is attempted which causes the disk to be failed. The system is a single-processor dual-core Xeon 3000 at 1.86GHz. It has four 250GB drives, two each on two channels of an Intel ICH7. It's running Fedora Core 4 with a custom compiled unpatched 2.6.20 kernel. I can provide the kernel itself, config, etc on request. Here is a full dmesg output from where the /dev/sdc1, part of /dev/md0 was intentionally failed using mdadm /dev/md0 -f /dev/sdc1, mdadm /dev/md0 -r /dev/sdc1, mdadm /dev/md0 -a /dev/sdc1. Feb 14 16:20:18 testsvr kernel: raid10: Disk failure on sdc1, disabling device. Feb 14 16:20:18 testsvr kernel: Operation continuing on 3 devices Feb 14 16:20:18 testsvr kernel: RAID10 conf printout: Feb 14 16:20:18 testsvr kernel: --- wd:3 rd:4 Feb 14 16:20:18 testsvr kernel: disk 0, wo:0, o:1, dev:sda9 Feb 14 16:20:18 testsvr kernel: disk 1, wo:0, o:1, dev:sdb1 Feb 14 16:20:18 testsvr kernel: disk 2, wo:1, o:0, dev:sdc1 Feb 14 16:20:18 testsvr kernel: disk 3, wo:0, o:1, dev:sdd1 Feb 14 16:20:18 testsvr kernel: RAID10 conf printout: Feb 14 16:20:18 testsvr kernel: --- wd:3 rd:4 Feb 14 16:20:18 testsvr kernel: disk 0, wo:0, o:1, dev:sda9 Feb 14 16:20:18 testsvr kernel: disk 1, wo:0, o:1, dev:sdb1 Feb 14 16:20:18 testsvr kernel: disk 3, wo:0, o:1, dev:sdd1 Feb 14 16:20:20 testsvr kernel: md: unbind<sdc1> Feb 14 16:20:20 testsvr kernel: md: export_rdev(sdc1) Feb 14 16:20:23 testsvr kernel: md: bind<sdc1> Feb 14 16:20:23 testsvr kernel: RAID10 conf printout: Feb 14 16:20:23 testsvr kernel: --- wd:3 rd:4 Feb 14 16:20:23 testsvr kernel: disk 0, wo:0, o:1, dev:sda9 Feb 14 16:20:23 testsvr kernel: disk 1, wo:0, o:1, dev:sdb1 Feb 14 16:20:23 testsvr kernel: disk 2, wo:1, o:1, dev:sdc1 Feb 14 16:20:23 testsvr kernel: disk 3, wo:0, o:1, dev:sdd1 Feb 14 16:20:23 testsvr kernel: md: recovery of RAID array md0 Feb 14 16:20:23 testsvr kernel: md: minimum _guaranteed_ speed: 1000 KB/sec/disk. Feb 14 16:20:23 testsvr kernel: md: using maximum available idle IO bandwidth (but not more than 40000 KB/sec) for recovery. Feb 14 16:20:23 testsvr kernel: md: using 128k window, over a total of 8040320 blocks. Feb 14 16:23:45 testsvr kernel: md: md0: recovery done. Feb 14 16:23:45 testsvr kernel: attempt to access beyond end of device Feb 14 16:23:45 testsvr kernel: sdc1: rw=1, want=901904331651136, limit=16081002 Feb 14 16:23:45 testsvr kernel: raid10: Disk failure on sdc1, disabling device. Feb 14 16:23:45 testsvr kernel: Operation continuing on 3 devices Feb 14 16:23:45 testsvr kernel: RAID10 conf printout: Feb 14 16:23:45 testsvr kernel: --- wd:3 rd:4 Feb 14 16:23:45 testsvr kernel: disk 0, wo:0, o:1, dev:sda9 Feb 14 16:23:45 testsvr kernel: disk 1, wo:0, o:1, dev:sdb1 Feb 14 16:23:45 testsvr kernel: disk 2, wo:1, o:0, dev:sdc1 Feb 14 16:23:45 testsvr kernel: disk 3, wo:0, o:1, dev:sdd1 Feb 14 16:23:45 testsvr kernel: RAID10 conf printout: Feb 14 16:23:45 testsvr kernel: --- wd:3 rd:4 Feb 14 16:23:45 testsvr kernel: disk 0, wo:0, o:1, dev:sda9 Feb 14 16:23:45 testsvr kernel: disk 1, wo:0, o:1, dev:sdb1 Feb 14 16:23:45 testsvr kernel: disk 3, wo:0, o:1, dev:sdd1 I made the kernel OOPS during handle_bad_sector in ll_rw_blk.c to try and get a backtrace, however the backtrace looks mildly suspicious, so I think it may not be a good indicator. Here it is anyway: Feb 13 14:25:23 testsvr kernel: Oops: 0000 [#1] Feb 13 14:25:23 testsvr kernel: SMP Feb 13 14:25:23 testsvr kernel: CPU: 0 Feb 13 14:25:23 testsvr kernel: EIP: 0060:[<c022b55a>] Not tainted VLI Feb 13 14:25:23 testsvr kernel: EFLAGS: 00010296 (2.6.19.1 #3) Feb 13 14:25:23 testsvr kernel: EIP is at handle_bad_sector+0x96/0xf0 Feb 13 14:25:23 testsvr kernel: eax: 00000039 ebx: 00000001 ecx: f6a7c9c0 edx: 00000082 Feb 13 14:25:23 testsvr kernel: esi: 00000000 edi: f6a7c9c0 ebp: f7451e58 esp: f7451df4 Feb 13 14:25:23 testsvr kernel: ds: 007b es: 007b ss: 0068 Feb 13 14:25:23 testsvr kernel: Process md0_raid10 (pid: 2267, ti=f7450000 task=f6ed2550 task.ti=f7450000) Feb 13 14:25:23 testsvr kernel: Stack: c044c950 f7451e2c 00000001 00000102 f7ee0208 00f5606a 00000000 00000002 Feb 13 14:25:23 testsvr kernel: f7fb0408 eac0d400 00000001 00000102 f7ee0208 00000001 31646473 00000000 Feb 13 14:25:23 testsvr kernel: f6e80000 00000086 c0124ce1 00000086 f6e81bc0 f7fb0408 f7ee0208 f6a7c9c0 Feb 13 14:25:23 testsvr kernel: Call Trace: Feb 13 14:25:23 testsvr kernel: [<c0124ce1>] __mod_timer+0x8e/0xa5 Feb 13 14:25:23 testsvr kernel: [<c022b618>] generic_make_request+0x64/0x21e Feb 13 14:25:23 testsvr kernel: [<c0238369>] kobject_release+0x0/0x17 Feb 13 14:25:23 testsvr kernel: [<c02de475>] scsi_request_fn+0x15b/0x36e Feb 13 14:25:23 testsvr kernel: [<c022c421>] generic_unplug_device+0x1b/0x2a Feb 13 14:25:23 testsvr kernel: [<c03315d7>] unplug_slaves+0x5c/0xa2 Feb 13 14:25:23 testsvr kernel: [<c033341f>] raid10d+0x564/0xc79 Feb 13 14:25:23 testsvr kernel: [<c040616a>] schedule+0x31e/0x8ed Feb 13 14:25:23 testsvr kernel: [<c0406ae1>] schedule_timeout+0x72/0xb0 Feb 13 14:25:23 testsvr kernel: [<c0406ae1>] schedule_timeout+0x72/0xb0 Feb 13 14:25:23 testsvr kernel: [<c034416e>] md_thread+0x40/0x103 Feb 13 14:25:23 testsvr kernel: [<c012f47c>] autoremove_wake_function+0x0/0x4b Feb 13 14:25:23 testsvr kernel: [<c034412e>] md_thread+0x0/0x103 Feb 13 14:25:23 testsvr kernel: [<c012f397>] kthread+0xfc/0x100 Feb 13 14:25:23 testsvr kernel: [<c012f29b>] kthread+0x0/0x100 Feb 13 14:25:23 testsvr kernel: [<c0103997>] kernel_thread_helper+0x7/0x10 Any help would be appreciated. I'm available to try any test -- this is a test server that I can perform any kind of wild test on. -John ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: RAID 10 resync leading to attempt to access beyond end of device 2007-02-14 22:08 RAID 10 resync leading to attempt to access beyond end of device John Stilson @ 2007-02-14 23:37 ` Neil Brown [not found] ` <e1e9d81a0702141606r7dea6288qea942cee2d978ee2@mail.gmail.com> 0 siblings, 1 reply; 8+ messages in thread From: Neil Brown @ 2007-02-14 23:37 UTC (permalink / raw) To: John Stilson; +Cc: linux-raid On Wednesday February 14, john9601@gmail.com wrote: > Feb 14 16:23:45 testsvr kernel: attempt to access beyond end of device > Feb 14 16:23:45 testsvr kernel: sdc1: rw=1, want=901904331651136, > limit=16081002 That 'want=' value is an enormous number! 52 bits. Looks a lot like an uninitialised variable somewhere. What does grep . /sys/block/md*/md/dev-*/offset show while the resync is running? How about grep . /sys/block/md*/md/dev-*/size And can you give me the output of "mdadm --detail' on the array? Thanks, NeilBrown ^ permalink raw reply [flat|nested] 8+ messages in thread
[parent not found: <e1e9d81a0702141606r7dea6288qea942cee2d978ee2@mail.gmail.com>]
[parent not found: <17875.57273.543122.581106@notabene.brown>]
[parent not found: <e1e9d81a0702142051v152c4c8dme2b20e1c53e1f4b2@mail.gmail.com>]
* Re: RAID 10 resync leading to attempt to access beyond end of device [not found] ` <e1e9d81a0702142051v152c4c8dme2b20e1c53e1f4b2@mail.gmail.com> @ 2007-02-15 18:02 ` John Stilson 2007-02-15 18:23 ` John Stilson 2007-02-16 2:25 ` RAID 10 resync leading to attempt to access beyond end of device Neil Brown 0 siblings, 2 replies; 8+ messages in thread From: John Stilson @ 2007-02-15 18:02 UTC (permalink / raw) To: Neil Brown; +Cc: linux-raid Ok tried the patch and got a kernel BUG this time (BUG_ON(k == conf->copies)?) -John Feb 15 12:52:35 testsvr kernel: md: recovery of RAID array md0 Feb 15 12:52:35 testsvr kernel: md: minimum _guaranteed_ speed: 1000 KB/sec/disk. Feb 15 12:52:35 testsvr kernel: md: using maximum available idle IO bandwidth (but not more than 40000 KB/sec) for recovery. Feb 15 12:52:35 testsvr kernel: md: using 128k window, over a total of 8040320 blocks. Feb 15 12:55:57 testsvr kernel: ------------[ cut here ]------------ Feb 15 12:55:57 testsvr kernel: kernel BUG at drivers/md/raid10.c:1804! Feb 15 12:55:57 testsvr kernel: invalid opcode: 0000 [#1] Feb 15 12:55:57 testsvr kernel: SMP Feb 15 12:55:57 testsvr kernel: Modules linked in: Feb 15 12:55:57 testsvr kernel: CPU: 0 Feb 15 12:55:57 testsvr kernel: EIP: 0060:[<c036bbe8>] Not tainted VLI Feb 15 12:55:57 testsvr kernel: EFLAGS: 00010246 (2.6.20test1 #3) Feb 15 12:55:57 testsvr kernel: EIP is at sync_request+0x43d/0x928 Feb 15 12:55:57 testsvr kernel: eax: c2330e14 ebx: c2330dc0 ecx: 00000003 edx: 00000000 Feb 15 12:55:57 testsvr kernel: esi: f68b30c0 edi: f782d4c0 ebp: 00000002 esp: f7397e58 Feb 15 12:55:57 testsvr kernel: ds: 007b es: 007b ss: 0068 Feb 15 12:55:57 testsvr kernel: Process md0_resync (pid: 2589, ti=f7396000 task=f7ade030 task.ti=f7396000) Feb 15 12:55:57 testsvr kernel: Stack: f7397eac 00000000 00000024 00f55e00 00000000 f717fa00 00000000 00000000 Feb 15 12:55:57 testsvr kernel: 00000080 00000000 00000000 00000000 00000003 00000100 00000000 00000001 Feb 15 12:55:57 testsvr kernel: c020307c 00443eb0 00000000 00f55f00 00000000 00000400 c036b7ab 00f55e00 Feb 15 12:55:57 testsvr kernel: Call Trace: Feb 15 12:55:57 testsvr kernel: [<c020307c>] __next_cpu+0x12/0x1f Feb 15 12:55:57 testsvr kernel: [<c036b7ab>] sync_request+0x0/0x928 Feb 15 12:55:57 testsvr kernel: [<c037fade>] md_do_sync+0x581/0xa07 Feb 15 12:55:57 testsvr kernel: [<c037a997>] md_thread+0x0/0xdc Feb 15 12:55:57 testsvr kernel: [<c037aa5d>] md_thread+0xc6/0xdc Feb 15 12:55:57 testsvr kernel: [<c0114004>] complete+0x38/0x47 Feb 15 12:55:57 testsvr kernel: [<c0129eb2>] kthread+0xab/0xcf Feb 15 12:55:57 testsvr kernel: [<c0129e07>] kthread+0x0/0xcf Feb 15 12:55:57 testsvr kernel: [<c01041cb>] kernel_thread_helper+0x7/0x10 Feb 15 12:55:57 testsvr kernel: ======================= Feb 15 12:55:57 testsvr kernel: Code: 4f 04 8b 01 f0 ff 80 9c 00 00 00 f0 ff 03 31 ed 8d 43 34 eb 0c 8b 4c 24 30 39 08 74 09 45 83 c0 10 3b 6f 1c 7c ef 3b 6f 1c 75 04 <0f> 0b eb fe 8b 4b 38 c1 e5 04 89 71 08 89 59 3c c7 41 34 ba b6 Feb 15 12:55:57 testsvr kernel: EIP: [<c036bbe8>] sync_request+0x43d/0x928 SS:ESP 0068:f7397e58 On 2/14/07, John Stilson <john9601@gmail.com> wrote: > Wow thanks for the quick response. I will try this tomorrow morning > and let you know. > > -John > > On 2/14/07, Neil Brown <neilb@suse.de> wrote: > > > > Thanks for the extra detail. I think I've nailed it. > > Does this fix it for you? > > > > Thanks, > > NeilBrown > > > > Signed-off-by: Neil Brown <neilb@suse.de> > > > > ### Diffstat output > > ./drivers/md/raid10.c | 4 +++- > > 1 file changed, 3 insertions(+), 1 deletion(-) > > > > diff .prev/drivers/md/raid10.c ./drivers/md/raid10.c > > --- .prev/drivers/md/raid10.c 2007-02-15 13:57:34.000000000 +1100 > > +++ ./drivers/md/raid10.c 2007-02-15 15:20:04.000000000 +1100 > > @@ -420,7 +420,7 @@ static sector_t raid10_find_virt(conf_t > > if (dev < 0) > > dev += conf->raid_disks; > > } else { > > - while (sector > conf->stride) { > > + while (sector >= conf->stride) { > > sector -= conf->stride; > > if (dev < conf->near_copies) > > dev += conf->raid_disks - conf->near_copies; > > @@ -1747,6 +1747,7 @@ static sector_t sync_request(mddev_t *md > > for (k=0; k<conf->copies; k++) > > if (r10_bio->devs[k].devnum == i) > > break; > > + BUG_ON(k == conf->copies); > > bio = r10_bio->devs[1].bio; > > bio->bi_next = biolist; > > biolist = bio; > > @@ -1973,6 +1974,7 @@ static int run(mddev_t *mddev) > > conf->far_offset = fo; > > conf->chunk_mask = (sector_t)(mddev->chunk_size>>9)-1; > > conf->chunk_shift = ffz(~mddev->chunk_size) - 9; > > + mddev->size &= ~(conf->chunk_mask >> 1); > > if (fo) > > conf->stride = 1 << conf->chunk_shift; > > else { > > > ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: RAID 10 resync leading to attempt to access beyond end of device 2007-02-15 18:02 ` John Stilson @ 2007-02-15 18:23 ` John Stilson 2007-02-15 18:28 ` (unknown) Derek Yeung 2007-02-16 2:25 ` RAID 10 resync leading to attempt to access beyond end of device Neil Brown 1 sibling, 1 reply; 8+ messages in thread From: John Stilson @ 2007-02-15 18:23 UTC (permalink / raw) To: Neil Brown; +Cc: linux-raid Oh, an additional piece of information I just realized I had not put in my original email is that this failure only happens intermittenly -- 50%-75% of the time a rebuild occurs -John On 2/15/07, John Stilson <john9601@gmail.com> wrote: > Ok tried the patch and got a kernel BUG this time (BUG_ON(k == conf->copies)?) > > -John > > Feb 15 12:52:35 testsvr kernel: md: recovery of RAID array md0 > Feb 15 12:52:35 testsvr kernel: md: minimum _guaranteed_ speed: 1000 > KB/sec/disk. > Feb 15 12:52:35 testsvr kernel: md: using maximum available idle IO > bandwidth (but not more than 40000 KB/sec) for recovery. > Feb 15 12:52:35 testsvr kernel: md: using 128k window, over a total of > 8040320 blocks. > Feb 15 12:55:57 testsvr kernel: ------------[ cut here ]------------ > Feb 15 12:55:57 testsvr kernel: kernel BUG at drivers/md/raid10.c:1804! > Feb 15 12:55:57 testsvr kernel: invalid opcode: 0000 [#1] > Feb 15 12:55:57 testsvr kernel: SMP > Feb 15 12:55:57 testsvr kernel: Modules linked in: > Feb 15 12:55:57 testsvr kernel: CPU: 0 > Feb 15 12:55:57 testsvr kernel: EIP: 0060:[<c036bbe8>] Not tainted VLI > Feb 15 12:55:57 testsvr kernel: EFLAGS: 00010246 (2.6.20test1 #3) > Feb 15 12:55:57 testsvr kernel: EIP is at sync_request+0x43d/0x928 > Feb 15 12:55:57 testsvr kernel: eax: c2330e14 ebx: c2330dc0 ecx: > 00000003 edx: 00000000 > Feb 15 12:55:57 testsvr kernel: esi: f68b30c0 edi: f782d4c0 ebp: > 00000002 esp: f7397e58 > Feb 15 12:55:57 testsvr kernel: ds: 007b es: 007b ss: 0068 > Feb 15 12:55:57 testsvr kernel: Process md0_resync (pid: 2589, > ti=f7396000 task=f7ade030 task.ti=f7396000) > Feb 15 12:55:57 testsvr kernel: Stack: f7397eac 00000000 00000024 > 00f55e00 00000000 f717fa00 00000000 00000000 > Feb 15 12:55:57 testsvr kernel: 00000080 00000000 00000000 > 00000000 00000003 00000100 00000000 00000001 > Feb 15 12:55:57 testsvr kernel: c020307c 00443eb0 00000000 > 00f55f00 00000000 00000400 c036b7ab 00f55e00 > Feb 15 12:55:57 testsvr kernel: Call Trace: > Feb 15 12:55:57 testsvr kernel: [<c020307c>] __next_cpu+0x12/0x1f > Feb 15 12:55:57 testsvr kernel: [<c036b7ab>] sync_request+0x0/0x928 > Feb 15 12:55:57 testsvr kernel: [<c037fade>] md_do_sync+0x581/0xa07 > Feb 15 12:55:57 testsvr kernel: [<c037a997>] md_thread+0x0/0xdc > Feb 15 12:55:57 testsvr kernel: [<c037aa5d>] md_thread+0xc6/0xdc > Feb 15 12:55:57 testsvr kernel: [<c0114004>] complete+0x38/0x47 > Feb 15 12:55:57 testsvr kernel: [<c0129eb2>] kthread+0xab/0xcf > Feb 15 12:55:57 testsvr kernel: [<c0129e07>] kthread+0x0/0xcf > Feb 15 12:55:57 testsvr kernel: [<c01041cb>] kernel_thread_helper+0x7/0x10 > Feb 15 12:55:57 testsvr kernel: ======================= > Feb 15 12:55:57 testsvr kernel: Code: 4f 04 8b 01 f0 ff 80 9c 00 00 00 > f0 ff 03 31 ed 8d 43 34 eb 0c 8b 4c 24 30 39 08 74 09 45 83 c0 10 3b > 6f 1c 7c ef > 3b 6f 1c 75 04 <0f> 0b eb fe 8b 4b 38 c1 e5 04 89 71 08 89 59 3c c7 41 34 ba b6 > Feb 15 12:55:57 testsvr kernel: EIP: [<c036bbe8>] > sync_request+0x43d/0x928 SS:ESP 0068:f7397e58 > > > On 2/14/07, John Stilson <john9601@gmail.com> wrote: > > Wow thanks for the quick response. I will try this tomorrow morning > > and let you know. > > > > -John > > > > On 2/14/07, Neil Brown <neilb@suse.de> wrote: > > > > > > Thanks for the extra detail. I think I've nailed it. > > > Does this fix it for you? > > > > > > Thanks, > > > NeilBrown > > > > > > Signed-off-by: Neil Brown <neilb@suse.de> > > > > > > ### Diffstat output > > > ./drivers/md/raid10.c | 4 +++- > > > 1 file changed, 3 insertions(+), 1 deletion(-) > > > > > > diff .prev/drivers/md/raid10.c ./drivers/md/raid10.c > > > --- .prev/drivers/md/raid10.c 2007-02-15 13:57:34.000000000 +1100 > > > +++ ./drivers/md/raid10.c 2007-02-15 15:20:04.000000000 +1100 > > > @@ -420,7 +420,7 @@ static sector_t raid10_find_virt(conf_t > > > if (dev < 0) > > > dev += conf->raid_disks; > > > } else { > > > - while (sector > conf->stride) { > > > + while (sector >= conf->stride) { > > > sector -= conf->stride; > > > if (dev < conf->near_copies) > > > dev += conf->raid_disks - conf->near_copies; > > > @@ -1747,6 +1747,7 @@ static sector_t sync_request(mddev_t *md > > > for (k=0; k<conf->copies; k++) > > > if (r10_bio->devs[k].devnum == i) > > > break; > > > + BUG_ON(k == conf->copies); > > > bio = r10_bio->devs[1].bio; > > > bio->bi_next = biolist; > > > biolist = bio; > > > @@ -1973,6 +1974,7 @@ static int run(mddev_t *mddev) > > > conf->far_offset = fo; > > > conf->chunk_mask = (sector_t)(mddev->chunk_size>>9)-1; > > > conf->chunk_shift = ffz(~mddev->chunk_size) - 9; > > > + mddev->size &= ~(conf->chunk_mask >> 1); > > > if (fo) > > > conf->stride = 1 << conf->chunk_shift; > > > else { > > > > > > ^ permalink raw reply [flat|nested] 8+ messages in thread
* (unknown) 2007-02-15 18:23 ` John Stilson @ 2007-02-15 18:28 ` Derek Yeung 2007-02-15 18:53 ` (unknown) Derek Yeung 0 siblings, 1 reply; 8+ messages in thread From: Derek Yeung @ 2007-02-15 18:28 UTC (permalink / raw) To: linux-raid help ^ permalink raw reply [flat|nested] 8+ messages in thread
* (unknown) 2007-02-15 18:28 ` (unknown) Derek Yeung @ 2007-02-15 18:53 ` Derek Yeung 0 siblings, 0 replies; 8+ messages in thread From: Derek Yeung @ 2007-02-15 18:53 UTC (permalink / raw) To: Derek Yeung; +Cc: linux-raid unsubscribe linux-raid ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: RAID 10 resync leading to attempt to access beyond end of device 2007-02-15 18:02 ` John Stilson 2007-02-15 18:23 ` John Stilson @ 2007-02-16 2:25 ` Neil Brown 2007-02-19 17:16 ` John Stilson 1 sibling, 1 reply; 8+ messages in thread From: Neil Brown @ 2007-02-16 2:25 UTC (permalink / raw) To: John Stilson; +Cc: linux-raid On Thursday February 15, john9601@gmail.com wrote: > Ok tried the patch and got a kernel BUG this time (BUG_ON(k == conf->copies)?) Thanks.... obviously I missed some subtlety. I think I have it right now. I've tested this against a setup which I think is sufficiently identical to yours this time (now that I know what the important parameters are: device size), but if you could test it too, that would be great. This patch is in place of the previous patch. Thanks, NeilBrown Signed-off-by: Neil Brown <neilb@suse.de> ### Diffstat output ./drivers/md/raid10.c | 39 +++++++++++++++++++++------------------ 1 file changed, 21 insertions(+), 18 deletions(-) diff .prev/drivers/md/raid10.c ./drivers/md/raid10.c --- .prev/drivers/md/raid10.c 2007-02-15 13:57:34.000000000 +1100 +++ ./drivers/md/raid10.c 2007-02-16 13:23:55.000000000 +1100 @@ -420,7 +420,7 @@ static sector_t raid10_find_virt(conf_t if (dev < 0) dev += conf->raid_disks; } else { - while (sector > conf->stride) { + while (sector >= conf->stride) { sector -= conf->stride; if (dev < conf->near_copies) dev += conf->raid_disks - conf->near_copies; @@ -1747,6 +1747,8 @@ static sector_t sync_request(mddev_t *md for (k=0; k<conf->copies; k++) if (r10_bio->devs[k].devnum == i) break; + + BUG_ON(k == conf->copies); bio = r10_bio->devs[1].bio; bio->bi_next = biolist; biolist = bio; @@ -1967,19 +1969,30 @@ static int run(mddev_t *mddev) if (!conf->tmppage) goto out_free_conf; + conf->mddev = mddev; + conf->raid_disks = mddev->raid_disks; conf->near_copies = nc; conf->far_copies = fc; conf->copies = nc*fc; conf->far_offset = fo; conf->chunk_mask = (sector_t)(mddev->chunk_size>>9)-1; conf->chunk_shift = ffz(~mddev->chunk_size) - 9; + size = mddev->size >> (conf->chunk_shift-1); + sector_div(size, fc); + size = size * conf->raid_disks; + sector_div(size, nc); + /* 'size' is now the number of chunks in the array */ + /* calculate "used chunks per device" in 'stride' */ + stride = size * conf->copies; + sector_div(stride, conf->raid_disks); + mddev->size = stride << (conf->chunk_shift-1); + if (fo) - conf->stride = 1 << conf->chunk_shift; - else { - stride = mddev->size >> (conf->chunk_shift-1); + stride = 1; + else sector_div(stride, fc); - conf->stride = stride << conf->chunk_shift; - } + conf->stride = stride << conf->chunk_shift; + conf->r10bio_pool = mempool_create(NR_RAID10_BIOS, r10bio_pool_alloc, r10bio_pool_free, conf); if (!conf->r10bio_pool) { @@ -2009,8 +2022,6 @@ static int run(mddev_t *mddev) disk->head_position = 0; } - conf->raid_disks = mddev->raid_disks; - conf->mddev = mddev; spin_lock_init(&conf->device_lock); INIT_LIST_HEAD(&conf->retry_list); @@ -2052,16 +2063,8 @@ static int run(mddev_t *mddev) /* * Ok, everything is just fine now */ - if (conf->far_offset) { - size = mddev->size >> (conf->chunk_shift-1); - size *= conf->raid_disks; - size <<= conf->chunk_shift; - sector_div(size, conf->far_copies); - } else - size = conf->stride * conf->raid_disks; - sector_div(size, conf->near_copies); - mddev->array_size = size/2; - mddev->resync_max_sectors = size; + mddev->array_size = size << (conf->chunk_shift-1); + mddev->resync_max_sectors = size << conf->chunk_shift; mddev->queue->issue_flush_fn = raid10_issue_flush; mddev->queue->backing_dev_info.congested_fn = raid10_congested; ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: RAID 10 resync leading to attempt to access beyond end of device 2007-02-16 2:25 ` RAID 10 resync leading to attempt to access beyond end of device Neil Brown @ 2007-02-19 17:16 ` John Stilson 0 siblings, 0 replies; 8+ messages in thread From: John Stilson @ 2007-02-19 17:16 UTC (permalink / raw) To: Neil Brown; +Cc: linux-raid Hey Neil, I tested this new patch and it seems to work! I'm going to do some more vigorous testing, and I'll let you know if any more issues bubble out. Thanks! -John On 2/15/07, Neil Brown <neilb@suse.de> wrote: > On Thursday February 15, john9601@gmail.com wrote: > > Ok tried the patch and got a kernel BUG this time (BUG_ON(k == conf->copies)?) > > Thanks.... obviously I missed some subtlety. I think I have it right > now. > I've tested this against a setup which I think is sufficiently > identical to yours this time (now that I know what the important > parameters are: device size), but if you could test it too, that would > be great. > > This patch is in place of the previous patch. > > Thanks, > NeilBrown > > > Signed-off-by: Neil Brown <neilb@suse.de> > > ### Diffstat output > ./drivers/md/raid10.c | 39 +++++++++++++++++++++------------------ > 1 file changed, 21 insertions(+), 18 deletions(-) > > diff .prev/drivers/md/raid10.c ./drivers/md/raid10.c > --- .prev/drivers/md/raid10.c 2007-02-15 13:57:34.000000000 +1100 > +++ ./drivers/md/raid10.c 2007-02-16 13:23:55.000000000 +1100 > @@ -420,7 +420,7 @@ static sector_t raid10_find_virt(conf_t > if (dev < 0) > dev += conf->raid_disks; > } else { > - while (sector > conf->stride) { > + while (sector >= conf->stride) { > sector -= conf->stride; > if (dev < conf->near_copies) > dev += conf->raid_disks - conf->near_copies; > @@ -1747,6 +1747,8 @@ static sector_t sync_request(mddev_t *md > for (k=0; k<conf->copies; k++) > if (r10_bio->devs[k].devnum == i) > break; > + > + BUG_ON(k == conf->copies); > bio = r10_bio->devs[1].bio; > bio->bi_next = biolist; > biolist = bio; > @@ -1967,19 +1969,30 @@ static int run(mddev_t *mddev) > if (!conf->tmppage) > goto out_free_conf; > > + conf->mddev = mddev; > + conf->raid_disks = mddev->raid_disks; > conf->near_copies = nc; > conf->far_copies = fc; > conf->copies = nc*fc; > conf->far_offset = fo; > conf->chunk_mask = (sector_t)(mddev->chunk_size>>9)-1; > conf->chunk_shift = ffz(~mddev->chunk_size) - 9; > + size = mddev->size >> (conf->chunk_shift-1); > + sector_div(size, fc); > + size = size * conf->raid_disks; > + sector_div(size, nc); > + /* 'size' is now the number of chunks in the array */ > + /* calculate "used chunks per device" in 'stride' */ > + stride = size * conf->copies; > + sector_div(stride, conf->raid_disks); > + mddev->size = stride << (conf->chunk_shift-1); > + > if (fo) > - conf->stride = 1 << conf->chunk_shift; > - else { > - stride = mddev->size >> (conf->chunk_shift-1); > + stride = 1; > + else > sector_div(stride, fc); > - conf->stride = stride << conf->chunk_shift; > - } > + conf->stride = stride << conf->chunk_shift; > + > conf->r10bio_pool = mempool_create(NR_RAID10_BIOS, r10bio_pool_alloc, > r10bio_pool_free, conf); > if (!conf->r10bio_pool) { > @@ -2009,8 +2022,6 @@ static int run(mddev_t *mddev) > > disk->head_position = 0; > } > - conf->raid_disks = mddev->raid_disks; > - conf->mddev = mddev; > spin_lock_init(&conf->device_lock); > INIT_LIST_HEAD(&conf->retry_list); > > @@ -2052,16 +2063,8 @@ static int run(mddev_t *mddev) > /* > * Ok, everything is just fine now > */ > - if (conf->far_offset) { > - size = mddev->size >> (conf->chunk_shift-1); > - size *= conf->raid_disks; > - size <<= conf->chunk_shift; > - sector_div(size, conf->far_copies); > - } else > - size = conf->stride * conf->raid_disks; > - sector_div(size, conf->near_copies); > - mddev->array_size = size/2; > - mddev->resync_max_sectors = size; > + mddev->array_size = size << (conf->chunk_shift-1); > + mddev->resync_max_sectors = size << conf->chunk_shift; > > mddev->queue->issue_flush_fn = raid10_issue_flush; > mddev->queue->backing_dev_info.congested_fn = raid10_congested; > ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2007-02-19 17:16 UTC | newest] Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2007-02-14 22:08 RAID 10 resync leading to attempt to access beyond end of device John Stilson 2007-02-14 23:37 ` Neil Brown [not found] ` <e1e9d81a0702141606r7dea6288qea942cee2d978ee2@mail.gmail.com> [not found] ` <17875.57273.543122.581106@notabene.brown> [not found] ` <e1e9d81a0702142051v152c4c8dme2b20e1c53e1f4b2@mail.gmail.com> 2007-02-15 18:02 ` John Stilson 2007-02-15 18:23 ` John Stilson 2007-02-15 18:28 ` (unknown) Derek Yeung 2007-02-15 18:53 ` (unknown) Derek Yeung 2007-02-16 2:25 ` RAID 10 resync leading to attempt to access beyond end of device Neil Brown 2007-02-19 17:16 ` John Stilson
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.