* raid1 + 2.6.27.7 issues @ 2009-02-09 16:36 Jon Nelson 2009-02-09 16:47 ` Iain Rauch 2009-02-09 22:17 ` Neil Brown 0 siblings, 2 replies; 11+ messages in thread From: Jon Nelson @ 2009-02-09 16:36 UTC (permalink / raw) To: LinuxRaid I found some time to get back to some raid1 issues I have been having. Briefly: I have a pair of machines each with an 80G hard drive. One machine exports this hard drive over NBD (network block device) to the other. When both devices are available, they are combined via MD into a raid1. The raid1 looks like this: /dev/md11: Version : 1.00 Creation Time : Mon Dec 15 07:06:13 2008 Raid Level : raid1 Array Size : 78123988 (74.50 GiB 80.00 GB) Used Dev Size : 156247976 (149.01 GiB 160.00 GB) Raid Devices : 2 Total Devices : 2 Persistence : Superblock is persistent Intent Bitmap : Internal Update Time : Mon Feb 9 09:53:13 2009 State : active, degraded, recovering Active Devices : 1 Working Devices : 2 Failed Devices : 0 Spare Devices : 1 Rebuild Status : 9% complete Name : turnip:11 (local to host turnip) UUID : cf24d099:9e174a79:2a2f6797:dcff1420 Events : 90220 Number Major Minor RaidDevice State 2 43 0 0 writemostly spare rebuilding /dev/nbd0 3 8 0 1 active sync /dev/sda The typical use case for me is this: I will run the array (/dev/md11) in degraded mode (without /dev/nbd0) for a week or so. At some point, I will try to synchronize the underlying devices. To do this I use: mdadm /dev/md11 --re-add /dev/nbd0 The issues I encounter are this: the array goes into *recovery* mode rather than *resync*, despite the fact that /dev/nbd0 was at one point a full member (in sync) of the array. Typically, less than 1/3 of the array needs to be resynchronized, often much less than that. I base this off of the --examine-bitmap output from /dev/sda. Today it says: Bitmap : 19074 bits (chunks), 6001 dirty (31.5%) which is a substantially higher percentage than usual. Indications of a problem: --examine and --examine-bitmap have an Events count which agrees for /dev/sda but does *not* agree for /dev/nbd0. From today, the --examine and --examine-bitmap output from /dev/nbd0: Magic : a92b4efc Version : 1.0 Feature Map : 0x1 Array UUID : cf24d099:9e174a79:2a2f6797:dcff1420 Name : turnip:11 (local to host turnip) Creation Time : Mon Dec 15 07:06:13 2008 Raid Level : raid1 Raid Devices : 2 Avail Dev Size : 160086384 (76.34 GiB 81.96 GB) Array Size : 156247976 (74.50 GiB 80.00 GB) Used Dev Size : 156247976 (74.50 GiB 80.00 GB) Super Offset : 160086512 sectors State : clean Device UUID : 01524a75:c309869c:6da972c9:084115c6 Internal Bitmap : 2 sectors from superblock Flags : write-mostly Update Time : Mon Feb 9 09:52:58 2009 Checksum : 64058b3b - correct Events : 90192 Array Slot : 2 (failed, failed, empty, 1) Array State : _u 2 failed Filename : /dev/nbd0 Magic : 6d746962 Version : 4 UUID : cf24d099:9e174a79:2a2f6797:dcff1420 Events : 81596 Events Cleared : 81570 State : OK Chunksize : 4 MB Daemon : 5s flush period Write Mode : Allow write behind, max 256 Sync Size : 78123988 (74.50 GiB 80.00 GB) Bitmap : 19074 bits (chunks), 0 dirty (0.0%) As you can see, --examine says that there are 90192 events, and --examine-bitmap says there are 81596 events. So that seems to be a bug or some other issue. What am I doing wrong that causes the array to go into "recovery" instead of "resync" mode? It's clearly showing /dev/nbd0 as a *spare* - does this have anything to do with it? -- Jon ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: raid1 + 2.6.27.7 issues 2009-02-09 16:36 raid1 + 2.6.27.7 issues Jon Nelson @ 2009-02-09 16:47 ` Iain Rauch 2009-02-09 16:59 ` Ray Van Dolson 2009-02-09 17:25 ` Jon Nelson 2009-02-09 22:17 ` Neil Brown 1 sibling, 2 replies; 11+ messages in thread From: Iain Rauch @ 2009-02-09 16:47 UTC (permalink / raw) To: LinuxRaid > The typical use case for me is this: I will run the array (/dev/md11) in > degraded mode (without /dev/nbd0) for a week or so. > At some point, I will try to synchronize the underlying devices. Sounds like rsync is more suited to your application. Why are you using RAID? Iain ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: raid1 + 2.6.27.7 issues 2009-02-09 16:47 ` Iain Rauch @ 2009-02-09 16:59 ` Ray Van Dolson 2009-02-09 17:49 ` Jon Nelson 2009-02-09 17:25 ` Jon Nelson 1 sibling, 1 reply; 11+ messages in thread From: Ray Van Dolson @ 2009-02-09 16:59 UTC (permalink / raw) To: Iain Rauch; +Cc: LinuxRaid On Mon, Feb 09, 2009 at 08:47:33AM -0800, Iain Rauch wrote: > > The typical use case for me is this: I will run the array (/dev/md11) in > > degraded mode (without /dev/nbd0) for a week or so. > > At some point, I will try to synchronize the underlying devices. > > Sounds like rsync is more suited to your application. > Why are you using RAID? > > Iain Seems academic to me. Whatever the reasons, the above _should_ work should it not? Could this be NBD's fault somehow? Ray ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: raid1 + 2.6.27.7 issues 2009-02-09 16:59 ` Ray Van Dolson @ 2009-02-09 17:49 ` Jon Nelson 2009-02-09 17:53 ` Ray Van Dolson [not found] ` <4990843B.7010708@tmr.com> 0 siblings, 2 replies; 11+ messages in thread From: Jon Nelson @ 2009-02-09 17:49 UTC (permalink / raw) Cc: LinuxRaid On Mon, Feb 9, 2009 at 10:59 AM, Ray Van Dolson <rvandolson@esri.com> wrote: > On Mon, Feb 09, 2009 at 08:47:33AM -0800, Iain Rauch wrote: >> > The typical use case for me is this: I will run the array (/dev/md11) in >> > degraded mode (without /dev/nbd0) for a week or so. >> > At some point, I will try to synchronize the underlying devices. >> >> Sounds like rsync is more suited to your application. >> Why are you using RAID? >> >> Iain > > Seems academic to me. Whatever the reasons, the above _should_ work > should it not? > > Could this be NBD's fault somehow? I don't think so. This is how I *remove* the nbd device: mdadm /dev/md11 --fail /dev/nbd0 sleep 3 mdadm /dev/md11 --remove /dev/nbd0 and then finally nbd-client -d /dev/nbd0 If necessary, I can try to simulate the problem by using a local logical volume or some such. -- Jon ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: raid1 + 2.6.27.7 issues 2009-02-09 17:49 ` Jon Nelson @ 2009-02-09 17:53 ` Ray Van Dolson 2009-02-09 18:13 ` Jon Nelson [not found] ` <4990843B.7010708@tmr.com> 1 sibling, 1 reply; 11+ messages in thread From: Ray Van Dolson @ 2009-02-09 17:53 UTC (permalink / raw) To: Jon Nelson; +Cc: LinuxRaid On Mon, Feb 09, 2009 at 09:49:21AM -0800, Jon Nelson wrote: > On Mon, Feb 9, 2009 at 10:59 AM, Ray Van Dolson <rvandolson@esri.com> wrote: > > On Mon, Feb 09, 2009 at 08:47:33AM -0800, Iain Rauch wrote: > >> > The typical use case for me is this: I will run the array (/dev/md11) in > >> > degraded mode (without /dev/nbd0) for a week or so. > >> > At some point, I will try to synchronize the underlying devices. > >> > >> Sounds like rsync is more suited to your application. > >> Why are you using RAID? > >> > >> Iain > > > > Seems academic to me. Whatever the reasons, the above _should_ work > > should it not? > > > > Could this be NBD's fault somehow? > > I don't think so. This is how I *remove* the nbd device: > > mdadm /dev/md11 --fail /dev/nbd0 > sleep 3 > mdadm /dev/md11 --remove /dev/nbd0 > > and then finally nbd-client -d /dev/nbd0 > > If necessary, I can try to simulate the problem by using a local > logical volume or some such. Don't want to send you on a wild goose chase, but I'd be interested to see the results of that and maybe the results of the same with an iSCSI backed blocked device vs nbd. I just thought there were some subtle differences between NBD and other "SAN" over network protocols. No idea how they'd play into the scenario you're describing however. Ray ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: raid1 + 2.6.27.7 issues 2009-02-09 17:53 ` Ray Van Dolson @ 2009-02-09 18:13 ` Jon Nelson 0 siblings, 0 replies; 11+ messages in thread From: Jon Nelson @ 2009-02-09 18:13 UTC (permalink / raw) Cc: LinuxRaid On Mon, Feb 9, 2009 at 11:53 AM, Ray Van Dolson <rvandolson@esri.com> wrote: > Don't want to send you on a wild goose chase, but I'd be interested to > see the results of that and maybe the results of the same with an iSCSI > backed blocked device vs nbd. I'm trying with AoE. I've never had any luck setting up a software iSCSI device. Maybe I'll give that a try one of these days. Once the dev is done /recovering/ (4.5% done, 3 hours to go) we'll see if AoE behaves any differently. I doubt it. -- Jon ^ permalink raw reply [flat|nested] 11+ messages in thread
[parent not found: <4990843B.7010708@tmr.com>]
* Re: raid1 + 2.6.27.7 issues [not found] ` <4990843B.7010708@tmr.com> @ 2009-02-09 20:56 ` Jon Nelson 0 siblings, 0 replies; 11+ messages in thread From: Jon Nelson @ 2009-02-09 20:56 UTC (permalink / raw) Cc: LinuxRaid On Mon, Feb 9, 2009 at 1:30 PM, Bill Davidsen <davidsen@tmr.com> wrote: > Jon Nelson wrote: >> >> I don't think so. This is how I *remove* the nbd device: >> >> mdadm /dev/md11 --fail /dev/nbd0 >> sleep 3 >> mdadm /dev/md11 --remove /dev/nbd0 >> >> and then finally nbd-client -d /dev/nbd0 >> >> If necessary, I can try to simulate the problem by using a local >> logical volume or some such. >> > > I would try this with just the first step. I looked at the code briefly, and > I think the write intent bitmap will not get built for a removed device but > will for failed. In any case it's certainly something you can easily try. Nope. :-( --remove is necessary. turnip:~ # mdadm --fail /dev/md11 /dev/nbd0 mdadm: set /dev/nbd0 faulty in /dev/md11 turnip:~ # mdadm --re-add /dev/md11 /dev/nbd0 mdadm: Cannot open /dev/nbd0: Device or resource busy turnip:~ # mdadm --add /dev/md11 /dev/nbd0 mdadm: Cannot open /dev/nbd0: Device or resource busy turnip:~ # -- Jon ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: raid1 + 2.6.27.7 issues 2009-02-09 16:47 ` Iain Rauch 2009-02-09 16:59 ` Ray Van Dolson @ 2009-02-09 17:25 ` Jon Nelson 1 sibling, 0 replies; 11+ messages in thread From: Jon Nelson @ 2009-02-09 17:25 UTC (permalink / raw) To: Iain Rauch; +Cc: LinuxRaid On Mon, Feb 9, 2009 at 10:47 AM, Iain Rauch <groups@email.iain.rauch.co.uk> wrote: >> The typical use case for me is this: I will run the array (/dev/md11) in >> degraded mode (without /dev/nbd0) for a week or so. >> At some point, I will try to synchronize the underlying devices. > > Sounds like rsync is more suited to your application. > Why are you using RAID? If I were transferring files then rsync would be fine. Files have mutable state (applications may be writing to them at the time of transfer, etc...) and a whole host of other issues. > Why are you using RAID? I want a *mirror* of the block device. -- Jon ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: raid1 + 2.6.27.7 issues 2009-02-09 16:36 raid1 + 2.6.27.7 issues Jon Nelson 2009-02-09 16:47 ` Iain Rauch @ 2009-02-09 22:17 ` Neil Brown 2009-02-10 0:07 ` Jon Nelson 1 sibling, 1 reply; 11+ messages in thread From: Neil Brown @ 2009-02-09 22:17 UTC (permalink / raw) To: Jon Nelson; +Cc: LinuxRaid On Monday February 9, jnelson-linux-raid@jamponi.net wrote: > > > > The typical use case for me is this: I will run the array (/dev/md11) in > degraded mode (without /dev/nbd0) for a week or so. > At some point, I will try to synchronize the underlying devices. To do this > I use: > > mdadm /dev/md11 --re-add /dev/nbd0 > > The issues I encounter are this: the array goes into *recovery* mode rather > than *resync*, despite the fact that /dev/nbd0 was at one point a full > member (in sync) of the array. Typically, less than 1/3 of the array needs > to be resynchronized, often much less than that. I've managed to reproduce this. If you fail the write-mostly device when the array is 'clean' (as reported by --examine), it works as expected. If you fail it when the array is 'active', you get the full recovery. The array is 'active' if there have been any writes in the last 200 msecs, and clean otherwise. I'll have to have a bit of a think about this and figure out where what the correct fix is. Nag me if you haven't heard anything by the end of the week. Thanks, NeilBrown ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: raid1 + 2.6.27.7 issues 2009-02-09 22:17 ` Neil Brown @ 2009-02-10 0:07 ` Jon Nelson 2009-02-11 5:06 ` Neil Brown 0 siblings, 1 reply; 11+ messages in thread From: Jon Nelson @ 2009-02-10 0:07 UTC (permalink / raw) Cc: LinuxRaid On Mon, Feb 9, 2009 at 4:17 PM, Neil Brown <neilb@suse.de> wrote: ... > I've managed to reproduce this. > > If you fail the write-mostly device when the array is 'clean' (as > reported by --examine), it works as expected. > If you fail it when the array is 'active', you get the full recovery. > > The array is 'active' if there have been any writes in the last 200 > msecs, and clean otherwise. > > I'll have to have a bit of a think about this and figure out where > what the correct fix is. Nag me if you haven't heard anything by the > end of the week. Can-do. Here are some more wrinkles: Wrinkle "A". I can't un-do "write-mostly". I used the md.txt docs that ship with the kernel which suggest that the following should work: 1. I let the array come up to sync 2. I echoed "-writemostly" into /sys/block/md11/md/dev-nbd0/state 3. A 'cat state' showed "in_sync,write_mostly" before, and "in_sync" after. 4. --fail and --remove /dev/nbd0 5. --re-add /dev/nbd0 6. 'cat state' shows "in_sync,write_mostly" again. D'oh! Wrinkle "B": When I did the above, when I --re-add'ed /dev/nbd0, it went into "recovery" mode, which completed instantly. My recollection of "recovery" is that it does not update the bitmap until the entire process is complete. Is this correct? If so, I'd like to try to convince you (Neil Brown) that it's worthwhile to behave the same WRT the bitmap and up-to-dateness regardless of whether it's recovery or resync. I'm including the --examine, --examine-bitmap from both /dev/nbd0 and /dev/sda: /dev/nbd0: Magic : a92b4efc Version : 1.0 Feature Map : 0x1 Array UUID : cf24d099:9e174a79:2a2f6797:dcff1420 Name : turnip:11 (local to host turnip) Creation Time : Mon Dec 15 07:06:13 2008 Raid Level : raid1 Raid Devices : 2 Avail Dev Size : 160086384 (76.34 GiB 81.96 GB) Array Size : 156247976 (74.50 GiB 80.00 GB) Used Dev Size : 156247976 (74.50 GiB 80.00 GB) Super Offset : 160086512 sectors State : clean Device UUID : 01524a75:c309869c:6da972c9:084115c6 Internal Bitmap : 2 sectors from superblock Flags : write-mostly Update Time : Mon Feb 9 17:49:21 2009 Checksum : 6404fbcd - correct Events : 90426 Array Slot : 2 (failed, failed, 0, 1) Array State : Uu 2 failed Filename : /dev/nbd0 Magic : 6d746962 Version : 4 UUID : cf24d099:9e174a79:2a2f6797:dcff1420 Events : 90426 Events Cleared : 90398 State : OK Chunksize : 4 MB Daemon : 5s flush period Write Mode : Allow write behind, max 256 Sync Size : 78123988 (74.50 GiB 80.00 GB) Bitmap : 19074 bits (chunks), 0 dirty (0.0%) /dev/sda: Magic : a92b4efc Version : 1.0 Feature Map : 0x1 Array UUID : cf24d099:9e174a79:2a2f6797:dcff1420 Name : turnip:11 (local to host turnip) Creation Time : Mon Dec 15 07:06:13 2008 Raid Level : raid1 Raid Devices : 2 Avail Dev Size : 160086384 (76.34 GiB 81.96 GB) Array Size : 156247976 (74.50 GiB 80.00 GB) Used Dev Size : 156247976 (74.50 GiB 80.00 GB) Super Offset : 160086512 sectors State : clean Device UUID : 0059434c:ecef51a0:2974482d:ba38f944 Internal Bitmap : 2 sectors from superblock Update Time : Mon Feb 9 17:57:34 2009 Checksum : 2184ad61 - correct Events : 90446 Array Slot : 3 (failed, failed, failed, 1) Array State : _U 3 failed Filename : /dev/sda Magic : 6d746962 Version : 4 UUID : cf24d099:9e174a79:2a2f6797:dcff1420 Events : 90446 Events Cleared : 90398 State : OK Chunksize : 4 MB Daemon : 5s flush period Write Mode : Allow write behind, max 256 Sync Size : 78123988 (74.50 GiB 80.00 GB) Bitmap : 19074 bits (chunks), 0 dirty (0.0%) -- Jon ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: raid1 + 2.6.27.7 issues 2009-02-10 0:07 ` Jon Nelson @ 2009-02-11 5:06 ` Neil Brown 0 siblings, 0 replies; 11+ messages in thread From: Neil Brown @ 2009-02-11 5:06 UTC (permalink / raw) To: Jon Nelson; +Cc: LinuxRaid On Monday February 9, jnelson-linux-raid@jamponi.net wrote: > On Mon, Feb 9, 2009 at 4:17 PM, Neil Brown <neilb@suse.de> wrote: > ... > > > I've managed to reproduce this. > > > > If you fail the write-mostly device when the array is 'clean' (as > > reported by --examine), it works as expected. > > If you fail it when the array is 'active', you get the full recovery. > > > > The array is 'active' if there have been any writes in the last 200 > > msecs, and clean otherwise. > > > > I'll have to have a bit of a think about this and figure out where > > what the correct fix is. Nag me if you haven't heard anything by the > > end of the week. See below... > > > Can-do. Here are some more wrinkles: > > Wrinkle "A". I can't un-do "write-mostly". I used the md.txt docs that > ship with the kernel which suggest that the following should work: You want mdadm 2.6.8 mdadm /dev/md0 --re-add --readwrite /dev/whatever ... or you would if it actually worked... That's odd, I cannot have tested that.... I'll have to think about that too. > > > Wrinkle "B": When I did the above, when I --re-add'ed /dev/nbd0, it > went into "recovery" mode, which completed instantly. My recollection > of "recovery" is that it does not update the bitmap until the entire > process is complete. Is this correct? If so, I'd like to try to > convince you (Neil Brown) that it's worthwhile to behave the same WRT > the bitmap and up-to-dateness regardless of whether it's recovery or > resync. If the recovery is completing instantly, I wonder why you care exactly when in that instant the bitmap is updated.... but I suspect that is missing the point. No, the bitmap isn't updated during recovery... Maybe it could be... More thinking. Mean while, this patch should fix your original problem. commit 67ad8eaf70c5ca2948b482138d3f88764b3e8ee5 Author: NeilBrown <neilb@suse.de> Date: Wed Feb 11 15:33:21 2009 +1100 md: never clear bit from the write-intent bitmap when the array is degraded. It is safe to clear a bit from the write-intent bitmap for a raid1 when if we know the data has been written to all devices, which is what the current test does. But it is not always safe to update the 'events_cleared' counter in that case. This is one request could complete successfully after some other request has partially failed. So simply disable the clearing and updating of events_cleared whenever the array is degraded. This might end up not clearing some bits that could safely be cleared, but it is safest approach. Signed-off-by: NeilBrown <neilb@suse.de> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index 01e3cff..d875172 100644 --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c @@ -386,7 +386,8 @@ static void raid1_end_write_request(struct bio *bio, int error) /* clear the bitmap if all writes complete successfully */ bitmap_endwrite(r1_bio->mddev->bitmap, r1_bio->sector, r1_bio->sectors, - !test_bit(R1BIO_Degraded, &r1_bio->state), + !test_bit(R1BIO_Degraded, &r1_bio->state) + && !r1_bio->mddev->degraded, behind); md_write_end(r1_bio->mddev); raid_end_bio_io(r1_bio); diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c index 6736d6d..9797a85 100644 --- a/drivers/md/raid10.c +++ b/drivers/md/raid10.c @@ -332,7 +332,8 @@ static void raid10_end_write_request(struct bio *bio, int error) /* clear the bitmap if all writes complete successfully */ bitmap_endwrite(r10_bio->mddev->bitmap, r10_bio->sector, r10_bio->sectors, - !test_bit(R10BIO_Degraded, &r10_bio->state), + !test_bit(R10BIO_Degraded, &r10_bio->state) && + !r10_bio->mddev->degraded, 0); md_write_end(r10_bio->mddev); raid_end_bio_io(r10_bio); diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index a5ba080..4d71cce 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -2076,7 +2076,8 @@ static void handle_stripe_clean_event(raid5_conf_t *conf, bitmap_endwrite(conf->mddev->bitmap, sh->sector, STRIPE_SECTORS, - !test_bit(STRIPE_DEGRADED, &sh->state), + !test_bit(STRIPE_DEGRADED, &sh->state) && + !conf->mddev->degraded, 0); } } ^ permalink raw reply related [flat|nested] 11+ messages in thread
end of thread, other threads:[~2009-02-11 5:06 UTC | newest] Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2009-02-09 16:36 raid1 + 2.6.27.7 issues Jon Nelson 2009-02-09 16:47 ` Iain Rauch 2009-02-09 16:59 ` Ray Van Dolson 2009-02-09 17:49 ` Jon Nelson 2009-02-09 17:53 ` Ray Van Dolson 2009-02-09 18:13 ` Jon Nelson [not found] ` <4990843B.7010708@tmr.com> 2009-02-09 20:56 ` Jon Nelson 2009-02-09 17:25 ` Jon Nelson 2009-02-09 22:17 ` Neil Brown 2009-02-10 0:07 ` Jon Nelson 2009-02-11 5:06 ` Neil Brown
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.