* RAID1 removing failed disk returns EBUSY @ 2014-10-27 20:27 Joe Lawrence 2014-10-28 21:41 ` NeilBrown 0 siblings, 1 reply; 22+ messages in thread From: Joe Lawrence @ 2014-10-27 20:27 UTC (permalink / raw) To: linux-raid; +Cc: Bill Kuzeja Hi Neil, We've encountered changes in MD and mdadm that have broken our automated disk removal script. In the past, we've been able to run the following after a RAID1 disk component removal: % echo fail > /sys/block/md3/md/dev-sdr5/state % echo remove > /sys/block/md3/md/dev-sdr5/state However, the latest RHEL6.6 code drop has rebased to sufficiently recent MD kernel and mdadm changes, in which the previous commands occasionally fail like so: * MD array is usually resyncing or checking * Component disk /dev/sdr removed via HBA sysfs PCI removal * Following UDEV rule fires: SUBSYSTEM=="block", ACTION=="remove", ENV{ID_PATH}=="?*", \ RUN+="/sbin/mdadm -If $name --path $env{ID_PATH}" % mdadm --detail /dev/md3 /dev/md3: Version : 1.1 Creation Time : Tue Oct 14 17:31:59 2014 Raid Level : raid1 Array Size : 25149440 (23.98 GiB 25.75 GB) Used Dev Size : 25149440 (23.98 GiB 25.75 GB) Raid Devices : 2 Total Devices : 2 Persistence : Superblock is persistent Intent Bitmap : Internal Update Time : Wed Oct 15 14:22:34 2014 State : active, degraded Active Devices : 1 Working Devices : 1 Failed Devices : 1 Spare Devices : 0 Name : localhost.localdomain:3 UUID : 40ed68ee:ba41d4cd:28c361ed:be7470b8 Events : 142 Number Major Minor RaidDevice State 0 65 21 0 faulty 1 65 5 1 active sync /dev/sdj5 All attempts to remove this device fail: % echo remove > /sys/block/md3/md/dev-sdr5/state -bash: echo: write error: Device or resource busy This can be traced to state_store(): } else if (cmd_match(buf, "remove")) { if (rdev->raid_disk >= 0) err = -EBUSY; After much debugging and systemtapping, I think I've figured out that the sysfs scripting may fail after the following combination of changes: mdadm 8af530b07fce "Enhance incremental removal." kernel 30b8feb730f9 "md/raid5: avoid deadlock when raid5 array has unack badblocks during md_stop_writes" With these two changes: 1 - On the user side, mdadm is trying to set the array_state to read-auto on incremental removal (as invoked by UDEV rule). 2 - Kernel side, md_set_readonly() will set the MD_RECOVERY_FROZEN flag, wake up the mddev->thread and if there is a sync_thread, it will set MD_RECOVERY_INTR and then wait until the sync_thread is set to NULL. When md_check_recovery() gets a chance to run as part of the raid1d() mddev->thread, it may or may not ever get to an invocation of remove_and_add_spares(), for there are but *many* conditional early exits along the way -- for example, if MD_RECOVERY_FROZEN is set, the following condition will bounce out of the routine: if (!test_and_clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery) || test_bit(MD_RECOVERY_FROZEN, &mddev->recovery)) goto unlock; the next time around, MD_RECOVERY_NEEDED will have been cleared, so all future tests will return 0 and the negation will always take the early exit path. Back in md_set_readonly(), it may notice that the MD is still in use, so it clears the MD_RECOVERY_FROZEN and then returns -EBUSY, without setting mddev->ro. But the damage has been done as conditions have been set such that md_check_recovery() will never call remove_and_add_spares(). This would also explain why an "idle" sync_action clears the wedge: it sets MD_RECOVERY_NEEDED allowing md_check_recovery() to continue executing to remove_and_add_spares(). As far as I can tell, this is what is happening to prevent the "remove" write to /sys/block/md3/md/dev-sdr5/state from succeeding. There are certainly a lot of little bit-states between disk removal, UDEV mdadm, and various MD kernel threads, so apologies if I missed an important transition. Would you consider writing "idle" to the MD array sync_action file as a safe and reasonable intermediate workaround step for our script? And of course, any suggestions to whether this is intended behavior (ie, the removed component disk is failed, but stuck in the array)? This is fairly easy for us to reproduce with multiple MD arrays per disk (one per partition) and interrupting a raid check on all of them (especially when they are delayed waiting for the first to finish) by removing the component disk via sysfs PCI removal. We can provide additional debug or testing if required. Regards, -- Joe ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: RAID1 removing failed disk returns EBUSY 2014-10-27 20:27 RAID1 removing failed disk returns EBUSY Joe Lawrence @ 2014-10-28 21:41 ` NeilBrown 2014-10-29 17:36 ` Joe Lawrence 0 siblings, 1 reply; 22+ messages in thread From: NeilBrown @ 2014-10-28 21:41 UTC (permalink / raw) To: Joe Lawrence; +Cc: linux-raid, Bill Kuzeja [-- Attachment #1: Type: text/plain, Size: 6472 bytes --] On Mon, 27 Oct 2014 16:27:48 -0400 Joe Lawrence <joe.lawrence@stratus.com> wrote: > Hi Neil, > > We've encountered changes in MD and mdadm that have broken our automated > disk removal script. In the past, we've been able to run the following > after a RAID1 disk component removal: > > % echo fail > /sys/block/md3/md/dev-sdr5/state > % echo remove > /sys/block/md3/md/dev-sdr5/state > > However, the latest RHEL6.6 code drop has rebased to sufficiently recent > MD kernel and mdadm changes, in which the previous commands occasionally > fail like so: > > * MD array is usually resyncing or checking > * Component disk /dev/sdr removed via HBA sysfs PCI removal > * Following UDEV rule fires: > > SUBSYSTEM=="block", ACTION=="remove", ENV{ID_PATH}=="?*", \ > RUN+="/sbin/mdadm -If $name --path $env{ID_PATH}" > > % mdadm --detail /dev/md3 > /dev/md3: > Version : 1.1 > Creation Time : Tue Oct 14 17:31:59 2014 > Raid Level : raid1 > Array Size : 25149440 (23.98 GiB 25.75 GB) > Used Dev Size : 25149440 (23.98 GiB 25.75 GB) > Raid Devices : 2 > Total Devices : 2 > Persistence : Superblock is persistent > > Intent Bitmap : Internal > > Update Time : Wed Oct 15 14:22:34 2014 > State : active, degraded > Active Devices : 1 > Working Devices : 1 > Failed Devices : 1 > Spare Devices : 0 > > Name : localhost.localdomain:3 > UUID : 40ed68ee:ba41d4cd:28c361ed:be7470b8 > Events : 142 > > Number Major Minor RaidDevice State > 0 65 21 0 faulty > 1 65 5 1 active sync /dev/sdj5 > > All attempts to remove this device fail: > > % echo remove > /sys/block/md3/md/dev-sdr5/state > -bash: echo: write error: Device or resource busy > > This can be traced to state_store(): > > } else if (cmd_match(buf, "remove")) { > if (rdev->raid_disk >= 0) > err = -EBUSY; > > After much debugging and systemtapping, I think I've figured out that the > sysfs scripting may fail after the following combination of changes: > > mdadm 8af530b07fce "Enhance incremental removal." > kernel 30b8feb730f9 "md/raid5: avoid deadlock when raid5 array has unack > badblocks during md_stop_writes" > > With these two changes: > > 1 - On the user side, mdadm is trying to set the array_state to read-auto > on incremental removal (as invoked by UDEV rule). > > 2 - Kernel side, md_set_readonly() will set the MD_RECOVERY_FROZEN flag, > wake up the mddev->thread and if there is a sync_thread, it will set > MD_RECOVERY_INTR and then wait until the sync_thread is set to NULL. > > When md_check_recovery() gets a chance to run as part of the > raid1d() mddev->thread, it may or may not ever get to > an invocation of remove_and_add_spares(), for there are but *many* > conditional early exits along the way -- for example, if > MD_RECOVERY_FROZEN is set, the following condition will bounce out of > the routine: > > if (!test_and_clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery) || > test_bit(MD_RECOVERY_FROZEN, &mddev->recovery)) > goto unlock; > > the next time around, MD_RECOVERY_NEEDED will have been cleared, so > all future tests will return 0 and the negation will always take the > early exit path. > > Back in md_set_readonly(), it may notice that the MD is still in use, > so it clears the MD_RECOVERY_FROZEN and then returns -EBUSY, without > setting mddev->ro. But the damage has been done as conditions have > been set such that md_check_recovery() will never call > remove_and_add_spares(). > > This would also explain why an "idle" sync_action clears the wedge: it > sets MD_RECOVERY_NEEDED allowing md_check_recovery() to continue executing > to remove_and_add_spares(). > > As far as I can tell, this is what is happening to prevent the "remove" > write to /sys/block/md3/md/dev-sdr5/state from succeeding. There are > certainly a lot of little bit-states between disk removal, UDEV mdadm, and > various MD kernel threads, so apologies if I missed an important > transition. > > Would you consider writing "idle" to the MD array sync_action file as a > safe and reasonable intermediate workaround step for our script? > > And of course, any suggestions to whether this is intended behavior (ie, > the removed component disk is failed, but stuck in the array)? > > This is fairly easy for us to reproduce with multiple MD arrays per disk > (one per partition) and interrupting a raid check on all of them > (especially when they are delayed waiting for the first to finish) by > removing the component disk via sysfs PCI removal. We can provide > additional debug or testing if required. > Hi Joe, thanks for the details analysis!! I think the correct fix would be that MD_RECOVERY_NEEDED should be set after clearing MD_RECOVERY_FROZEN, like the patch below. Can you confirm that it works for you? Writing 'idle' should in general be safe, so that could be used as an interim. Thanks, NeilBrown diff --git a/drivers/md/md.c b/drivers/md/md.c index c03d87b6890a..2c73fcb82593 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -5261,6 +5261,7 @@ static int md_set_readonly(struct mddev *mddev, struct block_device *bdev) printk("md: %s still in use.\n",mdname(mddev)); if (did_freeze) { clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery); + set_bit(MD_RECOVERY_NEEDED, &mddev->recovery); md_wakeup_thread(mddev->thread); } err = -EBUSY; @@ -5275,6 +5276,8 @@ static int md_set_readonly(struct mddev *mddev, struct block_device *bdev) mddev->ro = 1; set_disk_ro(mddev->gendisk, 1); clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery); + set_bit(MD_RECOVERY_NEEDED, &mddev->recovery); + md_wakeup_thread(mddev->thread); sysfs_notify_dirent_safe(mddev->sysfs_state); err = 0; } @@ -5318,6 +5321,7 @@ static int do_md_stop(struct mddev *mddev, int mode, mutex_unlock(&mddev->open_mutex); if (did_freeze) { clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery); + set_bit(MD_RECOVERY_NEEDED, &mddev->recovery); md_wakeup_thread(mddev->thread); } return -EBUSY; [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 828 bytes --] ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: RAID1 removing failed disk returns EBUSY 2014-10-28 21:41 ` NeilBrown @ 2014-10-29 17:36 ` Joe Lawrence 2014-11-13 14:05 ` Joe Lawrence 0 siblings, 1 reply; 22+ messages in thread From: Joe Lawrence @ 2014-10-29 17:36 UTC (permalink / raw) To: NeilBrown; +Cc: linux-raid, Bill Kuzeja On Wed, 29 Oct 2014 08:41:13 +1100 NeilBrown <neilb@suse.de> wrote: > On Mon, 27 Oct 2014 16:27:48 -0400 Joe Lawrence <joe.lawrence@stratus.com> > wrote: > > > Hi Neil, > > > > We've encountered changes in MD and mdadm that have broken our automated > > disk removal script. In the past, we've been able to run the following > > after a RAID1 disk component removal: > > > > % echo fail > /sys/block/md3/md/dev-sdr5/state > > % echo remove > /sys/block/md3/md/dev-sdr5/state > > > > However, the latest RHEL6.6 code drop has rebased to sufficiently recent > > MD kernel and mdadm changes, in which the previous commands occasionally > > fail like so: > > > > * MD array is usually resyncing or checking > > * Component disk /dev/sdr removed via HBA sysfs PCI removal > > * Following UDEV rule fires: > > > > SUBSYSTEM=="block", ACTION=="remove", ENV{ID_PATH}=="?*", \ > > RUN+="/sbin/mdadm -If $name --path $env{ID_PATH}" > > > > % mdadm --detail /dev/md3 > > /dev/md3: > > Version : 1.1 > > Creation Time : Tue Oct 14 17:31:59 2014 > > Raid Level : raid1 > > Array Size : 25149440 (23.98 GiB 25.75 GB) > > Used Dev Size : 25149440 (23.98 GiB 25.75 GB) > > Raid Devices : 2 > > Total Devices : 2 > > Persistence : Superblock is persistent > > > > Intent Bitmap : Internal > > > > Update Time : Wed Oct 15 14:22:34 2014 > > State : active, degraded > > Active Devices : 1 > > Working Devices : 1 > > Failed Devices : 1 > > Spare Devices : 0 > > > > Name : localhost.localdomain:3 > > UUID : 40ed68ee:ba41d4cd:28c361ed:be7470b8 > > Events : 142 > > > > Number Major Minor RaidDevice State > > 0 65 21 0 faulty > > 1 65 5 1 active sync /dev/sdj5 > > > > All attempts to remove this device fail: > > > > % echo remove > /sys/block/md3/md/dev-sdr5/state > > -bash: echo: write error: Device or resource busy > > > > This can be traced to state_store(): > > > > } else if (cmd_match(buf, "remove")) { > > if (rdev->raid_disk >= 0) > > err = -EBUSY; > > > > After much debugging and systemtapping, I think I've figured out that the > > sysfs scripting may fail after the following combination of changes: > > > > mdadm 8af530b07fce "Enhance incremental removal." > > kernel 30b8feb730f9 "md/raid5: avoid deadlock when raid5 array has unack > > badblocks during md_stop_writes" > > > > With these two changes: > > > > 1 - On the user side, mdadm is trying to set the array_state to read-auto > > on incremental removal (as invoked by UDEV rule). > > > > 2 - Kernel side, md_set_readonly() will set the MD_RECOVERY_FROZEN flag, > > wake up the mddev->thread and if there is a sync_thread, it will set > > MD_RECOVERY_INTR and then wait until the sync_thread is set to NULL. > > > > When md_check_recovery() gets a chance to run as part of the > > raid1d() mddev->thread, it may or may not ever get to > > an invocation of remove_and_add_spares(), for there are but *many* > > conditional early exits along the way -- for example, if > > MD_RECOVERY_FROZEN is set, the following condition will bounce out of > > the routine: > > > > if (!test_and_clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery) || > > test_bit(MD_RECOVERY_FROZEN, &mddev->recovery)) > > goto unlock; > > > > the next time around, MD_RECOVERY_NEEDED will have been cleared, so > > all future tests will return 0 and the negation will always take the > > early exit path. > > > > Back in md_set_readonly(), it may notice that the MD is still in use, > > so it clears the MD_RECOVERY_FROZEN and then returns -EBUSY, without > > setting mddev->ro. But the damage has been done as conditions have > > been set such that md_check_recovery() will never call > > remove_and_add_spares(). > > > > This would also explain why an "idle" sync_action clears the wedge: it > > sets MD_RECOVERY_NEEDED allowing md_check_recovery() to continue executing > > to remove_and_add_spares(). > > > > As far as I can tell, this is what is happening to prevent the "remove" > > write to /sys/block/md3/md/dev-sdr5/state from succeeding. There are > > certainly a lot of little bit-states between disk removal, UDEV mdadm, and > > various MD kernel threads, so apologies if I missed an important > > transition. > > > > Would you consider writing "idle" to the MD array sync_action file as a > > safe and reasonable intermediate workaround step for our script? > > > > And of course, any suggestions to whether this is intended behavior (ie, > > the removed component disk is failed, but stuck in the array)? > > > > This is fairly easy for us to reproduce with multiple MD arrays per disk > > (one per partition) and interrupting a raid check on all of them > > (especially when they are delayed waiting for the first to finish) by > > removing the component disk via sysfs PCI removal. We can provide > > additional debug or testing if required. > > > > Hi Joe, > thanks for the details analysis!! > > I think the correct fix would be that MD_RECOVERY_NEEDED should be set after > clearing MD_RECOVERY_FROZEN, like the patch below. > Can you confirm that it works for you? > > Writing 'idle' should in general be safe, so that could be used as an interim. > > Thanks, > NeilBrown > > diff --git a/drivers/md/md.c b/drivers/md/md.c > index c03d87b6890a..2c73fcb82593 100644 > --- a/drivers/md/md.c > +++ b/drivers/md/md.c > @@ -5261,6 +5261,7 @@ static int md_set_readonly(struct mddev *mddev, struct block_device *bdev) > printk("md: %s still in use.\n",mdname(mddev)); > if (did_freeze) { > clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery); > + set_bit(MD_RECOVERY_NEEDED, &mddev->recovery); > md_wakeup_thread(mddev->thread); > } > err = -EBUSY; > @@ -5275,6 +5276,8 @@ static int md_set_readonly(struct mddev *mddev, struct block_device *bdev) > mddev->ro = 1; > set_disk_ro(mddev->gendisk, 1); > clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery); > + set_bit(MD_RECOVERY_NEEDED, &mddev->recovery); > + md_wakeup_thread(mddev->thread); > sysfs_notify_dirent_safe(mddev->sysfs_state); > err = 0; > } > @@ -5318,6 +5321,7 @@ static int do_md_stop(struct mddev *mddev, int mode, > mutex_unlock(&mddev->open_mutex); > if (did_freeze) { > clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery); > + set_bit(MD_RECOVERY_NEEDED, &mddev->recovery); > md_wakeup_thread(mddev->thread); > } > return -EBUSY; Hi Neil, In my tests, the UDEV "mdadm -If" invocation fails *and* removes the pulled disk from the MD array. This is okay for our intentions, but I wanted to make sure that it's okay to skip any failed-but-not-removed state. Tested-by: Joe Lawrence <joe.lawrence@stratus.com> and should this have a Fixes: 30b8feb730f9 ("md/raid5: avoid deadlock when raid5 array has unack badblocks during md_stop_writes") tag to mark for stable? Thanks, -- Joe ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: RAID1 removing failed disk returns EBUSY 2014-10-29 17:36 ` Joe Lawrence @ 2014-11-13 14:05 ` Joe Lawrence 2014-11-16 23:03 ` NeilBrown 0 siblings, 1 reply; 22+ messages in thread From: Joe Lawrence @ 2014-11-13 14:05 UTC (permalink / raw) To: NeilBrown; +Cc: linux-raid, Bill Kuzeja On Wed, 29 Oct 2014 13:36:04 -0400 Joe Lawrence <joe.lawrence@stratus.com> wrote: > On Wed, 29 Oct 2014 08:41:13 +1100 > NeilBrown <neilb@suse.de> wrote: > > > On Mon, 27 Oct 2014 16:27:48 -0400 Joe Lawrence <joe.lawrence@stratus.com> > > wrote: > > > > > Hi Neil, > > > > > > We've encountered changes in MD and mdadm that have broken our automated > > > disk removal script. In the past, we've been able to run the following > > > after a RAID1 disk component removal: > > > > > > % echo fail > /sys/block/md3/md/dev-sdr5/state > > > % echo remove > /sys/block/md3/md/dev-sdr5/state > > > > > > However, the latest RHEL6.6 code drop has rebased to sufficiently recent > > > MD kernel and mdadm changes, in which the previous commands occasionally > > > fail like so: > > > > > > * MD array is usually resyncing or checking > > > * Component disk /dev/sdr removed via HBA sysfs PCI removal > > > * Following UDEV rule fires: > > > > > > SUBSYSTEM=="block", ACTION=="remove", ENV{ID_PATH}=="?*", \ > > > RUN+="/sbin/mdadm -If $name --path $env{ID_PATH}" > > > > > > % mdadm --detail /dev/md3 > > > /dev/md3: > > > Version : 1.1 > > > Creation Time : Tue Oct 14 17:31:59 2014 > > > Raid Level : raid1 > > > Array Size : 25149440 (23.98 GiB 25.75 GB) > > > Used Dev Size : 25149440 (23.98 GiB 25.75 GB) > > > Raid Devices : 2 > > > Total Devices : 2 > > > Persistence : Superblock is persistent > > > > > > Intent Bitmap : Internal > > > > > > Update Time : Wed Oct 15 14:22:34 2014 > > > State : active, degraded > > > Active Devices : 1 > > > Working Devices : 1 > > > Failed Devices : 1 > > > Spare Devices : 0 > > > > > > Name : localhost.localdomain:3 > > > UUID : 40ed68ee:ba41d4cd:28c361ed:be7470b8 > > > Events : 142 > > > > > > Number Major Minor RaidDevice State > > > 0 65 21 0 faulty > > > 1 65 5 1 active sync /dev/sdj5 > > > > > > All attempts to remove this device fail: > > > > > > % echo remove > /sys/block/md3/md/dev-sdr5/state > > > -bash: echo: write error: Device or resource busy > > > > > > This can be traced to state_store(): > > > > > > } else if (cmd_match(buf, "remove")) { > > > if (rdev->raid_disk >= 0) > > > err = -EBUSY; > > > > > > After much debugging and systemtapping, I think I've figured out that the > > > sysfs scripting may fail after the following combination of changes: > > > > > > mdadm 8af530b07fce "Enhance incremental removal." > > > kernel 30b8feb730f9 "md/raid5: avoid deadlock when raid5 array has unack > > > badblocks during md_stop_writes" > > > > > > With these two changes: > > > > > > 1 - On the user side, mdadm is trying to set the array_state to read-auto > > > on incremental removal (as invoked by UDEV rule). > > > > > > 2 - Kernel side, md_set_readonly() will set the MD_RECOVERY_FROZEN flag, > > > wake up the mddev->thread and if there is a sync_thread, it will set > > > MD_RECOVERY_INTR and then wait until the sync_thread is set to NULL. > > > > > > When md_check_recovery() gets a chance to run as part of the > > > raid1d() mddev->thread, it may or may not ever get to > > > an invocation of remove_and_add_spares(), for there are but *many* > > > conditional early exits along the way -- for example, if > > > MD_RECOVERY_FROZEN is set, the following condition will bounce out of > > > the routine: > > > > > > if (!test_and_clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery) || > > > test_bit(MD_RECOVERY_FROZEN, &mddev->recovery)) > > > goto unlock; > > > > > > the next time around, MD_RECOVERY_NEEDED will have been cleared, so > > > all future tests will return 0 and the negation will always take the > > > early exit path. > > > > > > Back in md_set_readonly(), it may notice that the MD is still in use, > > > so it clears the MD_RECOVERY_FROZEN and then returns -EBUSY, without > > > setting mddev->ro. But the damage has been done as conditions have > > > been set such that md_check_recovery() will never call > > > remove_and_add_spares(). > > > > > > This would also explain why an "idle" sync_action clears the wedge: it > > > sets MD_RECOVERY_NEEDED allowing md_check_recovery() to continue executing > > > to remove_and_add_spares(). > > > > > > As far as I can tell, this is what is happening to prevent the "remove" > > > write to /sys/block/md3/md/dev-sdr5/state from succeeding. There are > > > certainly a lot of little bit-states between disk removal, UDEV mdadm, and > > > various MD kernel threads, so apologies if I missed an important > > > transition. > > > > > > Would you consider writing "idle" to the MD array sync_action file as a > > > safe and reasonable intermediate workaround step for our script? > > > > > > And of course, any suggestions to whether this is intended behavior (ie, > > > the removed component disk is failed, but stuck in the array)? > > > > > > This is fairly easy for us to reproduce with multiple MD arrays per disk > > > (one per partition) and interrupting a raid check on all of them > > > (especially when they are delayed waiting for the first to finish) by > > > removing the component disk via sysfs PCI removal. We can provide > > > additional debug or testing if required. > > > > > > > Hi Joe, > > thanks for the details analysis!! > > > > I think the correct fix would be that MD_RECOVERY_NEEDED should be set after > > clearing MD_RECOVERY_FROZEN, like the patch below. > > Can you confirm that it works for you? > > > > Writing 'idle' should in general be safe, so that could be used as an interim. > > > > Thanks, > > NeilBrown > > > > diff --git a/drivers/md/md.c b/drivers/md/md.c > > index c03d87b6890a..2c73fcb82593 100644 > > --- a/drivers/md/md.c > > +++ b/drivers/md/md.c > > @@ -5261,6 +5261,7 @@ static int md_set_readonly(struct mddev *mddev, struct block_device *bdev) > > printk("md: %s still in use.\n",mdname(mddev)); > > if (did_freeze) { > > clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery); > > + set_bit(MD_RECOVERY_NEEDED, &mddev->recovery); > > md_wakeup_thread(mddev->thread); > > } > > err = -EBUSY; > > @@ -5275,6 +5276,8 @@ static int md_set_readonly(struct mddev *mddev, struct block_device *bdev) > > mddev->ro = 1; > > set_disk_ro(mddev->gendisk, 1); > > clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery); > > + set_bit(MD_RECOVERY_NEEDED, &mddev->recovery); > > + md_wakeup_thread(mddev->thread); > > sysfs_notify_dirent_safe(mddev->sysfs_state); > > err = 0; > > } > > @@ -5318,6 +5321,7 @@ static int do_md_stop(struct mddev *mddev, int mode, > > mutex_unlock(&mddev->open_mutex); > > if (did_freeze) { > > clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery); > > + set_bit(MD_RECOVERY_NEEDED, &mddev->recovery); > > md_wakeup_thread(mddev->thread); > > } > > return -EBUSY; > > Hi Neil, > > In my tests, the UDEV "mdadm -If" invocation fails *and* removes the > pulled disk from the MD array. This is okay for our intentions, but I > wanted to make sure that it's okay to skip any failed-but-not-removed > state. > > Tested-by: Joe Lawrence <joe.lawrence@stratus.com> > > and should this have a > > Fixes: 30b8feb730f9 ("md/raid5: avoid deadlock when raid5 array has unack badblocks during md_stop_writes") > > tag to mark for stable? Hi Neil, Would you like me to write up a proper patch, or is this one in the queue? Thanks, -- Joe ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: RAID1 removing failed disk returns EBUSY 2014-11-13 14:05 ` Joe Lawrence @ 2014-11-16 23:03 ` NeilBrown 2015-01-14 12:41 ` XiaoNi 0 siblings, 1 reply; 22+ messages in thread From: NeilBrown @ 2014-11-16 23:03 UTC (permalink / raw) To: Joe Lawrence; +Cc: linux-raid, Bill Kuzeja [-- Attachment #1: Type: text/plain, Size: 8622 bytes --] On Thu, 13 Nov 2014 09:05:49 -0500 Joe Lawrence <joe.lawrence@stratus.com> wrote: > On Wed, 29 Oct 2014 13:36:04 -0400 > Joe Lawrence <joe.lawrence@stratus.com> wrote: > > > On Wed, 29 Oct 2014 08:41:13 +1100 > > NeilBrown <neilb@suse.de> wrote: > > > > > On Mon, 27 Oct 2014 16:27:48 -0400 Joe Lawrence <joe.lawrence@stratus.com> > > > wrote: > > > > > > > Hi Neil, > > > > > > > > We've encountered changes in MD and mdadm that have broken our automated > > > > disk removal script. In the past, we've been able to run the following > > > > after a RAID1 disk component removal: > > > > > > > > % echo fail > /sys/block/md3/md/dev-sdr5/state > > > > % echo remove > /sys/block/md3/md/dev-sdr5/state > > > > > > > > However, the latest RHEL6.6 code drop has rebased to sufficiently recent > > > > MD kernel and mdadm changes, in which the previous commands occasionally > > > > fail like so: > > > > > > > > * MD array is usually resyncing or checking > > > > * Component disk /dev/sdr removed via HBA sysfs PCI removal > > > > * Following UDEV rule fires: > > > > > > > > SUBSYSTEM=="block", ACTION=="remove", ENV{ID_PATH}=="?*", \ > > > > RUN+="/sbin/mdadm -If $name --path $env{ID_PATH}" > > > > > > > > % mdadm --detail /dev/md3 > > > > /dev/md3: > > > > Version : 1.1 > > > > Creation Time : Tue Oct 14 17:31:59 2014 > > > > Raid Level : raid1 > > > > Array Size : 25149440 (23.98 GiB 25.75 GB) > > > > Used Dev Size : 25149440 (23.98 GiB 25.75 GB) > > > > Raid Devices : 2 > > > > Total Devices : 2 > > > > Persistence : Superblock is persistent > > > > > > > > Intent Bitmap : Internal > > > > > > > > Update Time : Wed Oct 15 14:22:34 2014 > > > > State : active, degraded > > > > Active Devices : 1 > > > > Working Devices : 1 > > > > Failed Devices : 1 > > > > Spare Devices : 0 > > > > > > > > Name : localhost.localdomain:3 > > > > UUID : 40ed68ee:ba41d4cd:28c361ed:be7470b8 > > > > Events : 142 > > > > > > > > Number Major Minor RaidDevice State > > > > 0 65 21 0 faulty > > > > 1 65 5 1 active sync /dev/sdj5 > > > > > > > > All attempts to remove this device fail: > > > > > > > > % echo remove > /sys/block/md3/md/dev-sdr5/state > > > > -bash: echo: write error: Device or resource busy > > > > > > > > This can be traced to state_store(): > > > > > > > > } else if (cmd_match(buf, "remove")) { > > > > if (rdev->raid_disk >= 0) > > > > err = -EBUSY; > > > > > > > > After much debugging and systemtapping, I think I've figured out that the > > > > sysfs scripting may fail after the following combination of changes: > > > > > > > > mdadm 8af530b07fce "Enhance incremental removal." > > > > kernel 30b8feb730f9 "md/raid5: avoid deadlock when raid5 array has unack > > > > badblocks during md_stop_writes" > > > > > > > > With these two changes: > > > > > > > > 1 - On the user side, mdadm is trying to set the array_state to read-auto > > > > on incremental removal (as invoked by UDEV rule). > > > > > > > > 2 - Kernel side, md_set_readonly() will set the MD_RECOVERY_FROZEN flag, > > > > wake up the mddev->thread and if there is a sync_thread, it will set > > > > MD_RECOVERY_INTR and then wait until the sync_thread is set to NULL. > > > > > > > > When md_check_recovery() gets a chance to run as part of the > > > > raid1d() mddev->thread, it may or may not ever get to > > > > an invocation of remove_and_add_spares(), for there are but *many* > > > > conditional early exits along the way -- for example, if > > > > MD_RECOVERY_FROZEN is set, the following condition will bounce out of > > > > the routine: > > > > > > > > if (!test_and_clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery) || > > > > test_bit(MD_RECOVERY_FROZEN, &mddev->recovery)) > > > > goto unlock; > > > > > > > > the next time around, MD_RECOVERY_NEEDED will have been cleared, so > > > > all future tests will return 0 and the negation will always take the > > > > early exit path. > > > > > > > > Back in md_set_readonly(), it may notice that the MD is still in use, > > > > so it clears the MD_RECOVERY_FROZEN and then returns -EBUSY, without > > > > setting mddev->ro. But the damage has been done as conditions have > > > > been set such that md_check_recovery() will never call > > > > remove_and_add_spares(). > > > > > > > > This would also explain why an "idle" sync_action clears the wedge: it > > > > sets MD_RECOVERY_NEEDED allowing md_check_recovery() to continue executing > > > > to remove_and_add_spares(). > > > > > > > > As far as I can tell, this is what is happening to prevent the "remove" > > > > write to /sys/block/md3/md/dev-sdr5/state from succeeding. There are > > > > certainly a lot of little bit-states between disk removal, UDEV mdadm, and > > > > various MD kernel threads, so apologies if I missed an important > > > > transition. > > > > > > > > Would you consider writing "idle" to the MD array sync_action file as a > > > > safe and reasonable intermediate workaround step for our script? > > > > > > > > And of course, any suggestions to whether this is intended behavior (ie, > > > > the removed component disk is failed, but stuck in the array)? > > > > > > > > This is fairly easy for us to reproduce with multiple MD arrays per disk > > > > (one per partition) and interrupting a raid check on all of them > > > > (especially when they are delayed waiting for the first to finish) by > > > > removing the component disk via sysfs PCI removal. We can provide > > > > additional debug or testing if required. > > > > > > > > > > Hi Joe, > > > thanks for the details analysis!! > > > > > > I think the correct fix would be that MD_RECOVERY_NEEDED should be set after > > > clearing MD_RECOVERY_FROZEN, like the patch below. > > > Can you confirm that it works for you? > > > > > > Writing 'idle' should in general be safe, so that could be used as an interim. > > > > > > Thanks, > > > NeilBrown > > > > > > diff --git a/drivers/md/md.c b/drivers/md/md.c > > > index c03d87b6890a..2c73fcb82593 100644 > > > --- a/drivers/md/md.c > > > +++ b/drivers/md/md.c > > > @@ -5261,6 +5261,7 @@ static int md_set_readonly(struct mddev *mddev, struct block_device *bdev) > > > printk("md: %s still in use.\n",mdname(mddev)); > > > if (did_freeze) { > > > clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery); > > > + set_bit(MD_RECOVERY_NEEDED, &mddev->recovery); > > > md_wakeup_thread(mddev->thread); > > > } > > > err = -EBUSY; > > > @@ -5275,6 +5276,8 @@ static int md_set_readonly(struct mddev *mddev, struct block_device *bdev) > > > mddev->ro = 1; > > > set_disk_ro(mddev->gendisk, 1); > > > clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery); > > > + set_bit(MD_RECOVERY_NEEDED, &mddev->recovery); > > > + md_wakeup_thread(mddev->thread); > > > sysfs_notify_dirent_safe(mddev->sysfs_state); > > > err = 0; > > > } > > > @@ -5318,6 +5321,7 @@ static int do_md_stop(struct mddev *mddev, int mode, > > > mutex_unlock(&mddev->open_mutex); > > > if (did_freeze) { > > > clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery); > > > + set_bit(MD_RECOVERY_NEEDED, &mddev->recovery); > > > md_wakeup_thread(mddev->thread); > > > } > > > return -EBUSY; > > > > Hi Neil, > > > > In my tests, the UDEV "mdadm -If" invocation fails *and* removes the > > pulled disk from the MD array. This is okay for our intentions, but I > > wanted to make sure that it's okay to skip any failed-but-not-removed > > state. > > > > Tested-by: Joe Lawrence <joe.lawrence@stratus.com> > > > > and should this have a > > > > Fixes: 30b8feb730f9 ("md/raid5: avoid deadlock when raid5 array has unack badblocks during md_stop_writes") > > > > tag to mark for stable? > > > Hi Neil, > > Would you like me to write up a proper patch, or is this one in the queue? > Several times over the last week I've thought that I should probably push that patch along ... but each time something else seemed more interesting. But it's a new week now. I've just posted a pull request. Thanks for the prompt (and the report and testing of course). NeilBrown [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 811 bytes --] ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: RAID1 removing failed disk returns EBUSY 2014-11-16 23:03 ` NeilBrown @ 2015-01-14 12:41 ` XiaoNi 2015-01-15 13:22 ` Joe Lawrence 0 siblings, 1 reply; 22+ messages in thread From: XiaoNi @ 2015-01-14 12:41 UTC (permalink / raw) To: NeilBrown; +Cc: Joe Lawrence, linux-raid, Bill Kuzeja On 11/17/2014 07:03 AM, NeilBrown wrote: > On Thu, 13 Nov 2014 09:05:49 -0500 Joe Lawrence<joe.lawrence@stratus.com> > wrote: > >> On Wed, 29 Oct 2014 13:36:04 -0400 >> Joe Lawrence<joe.lawrence@stratus.com> wrote: >> >>> On Wed, 29 Oct 2014 08:41:13 +1100 >>> NeilBrown<neilb@suse.de> wrote: >>> >>>> On Mon, 27 Oct 2014 16:27:48 -0400 Joe Lawrence<joe.lawrence@stratus.com> >>>> wrote: >>>> >>>>> Hi Neil, >>>>> >>>>> We've encountered changes in MD and mdadm that have broken our automated >>>>> disk removal script. In the past, we've been able to run the following >>>>> after a RAID1 disk component removal: >>>>> >>>>> % echo fail> /sys/block/md3/md/dev-sdr5/state >>>>> % echo remove> /sys/block/md3/md/dev-sdr5/state >>>>> >>>>> However, the latest RHEL6.6 code drop has rebased to sufficiently recent >>>>> MD kernel and mdadm changes, in which the previous commands occasionally >>>>> fail like so: >>>>> >>>>> * MD array is usually resyncing or checking >>>>> * Component disk /dev/sdr removed via HBA sysfs PCI removal >>>>> * Following UDEV rule fires: >>>>> >>>>> SUBSYSTEM=="block", ACTION=="remove", ENV{ID_PATH}=="?*", \ >>>>> RUN+="/sbin/mdadm -If $name --path $env{ID_PATH}" >>>>> >>>>> % mdadm --detail /dev/md3 >>>>> /dev/md3: >>>>> Version : 1.1 >>>>> Creation Time : Tue Oct 14 17:31:59 2014 >>>>> Raid Level : raid1 >>>>> Array Size : 25149440 (23.98 GiB 25.75 GB) >>>>> Used Dev Size : 25149440 (23.98 GiB 25.75 GB) >>>>> Raid Devices : 2 >>>>> Total Devices : 2 >>>>> Persistence : Superblock is persistent >>>>> >>>>> Intent Bitmap : Internal >>>>> >>>>> Update Time : Wed Oct 15 14:22:34 2014 >>>>> State : active, degraded >>>>> Active Devices : 1 >>>>> Working Devices : 1 >>>>> Failed Devices : 1 >>>>> Spare Devices : 0 >>>>> >>>>> Name : localhost.localdomain:3 >>>>> UUID : 40ed68ee:ba41d4cd:28c361ed:be7470b8 >>>>> Events : 142 >>>>> >>>>> Number Major Minor RaidDevice State >>>>> 0 65 21 0 faulty >>>>> 1 65 5 1 active sync /dev/sdj5 >>>>> >>>>> All attempts to remove this device fail: >>>>> >>>>> % echo remove> /sys/block/md3/md/dev-sdr5/state >>>>> -bash: echo: write error: Device or resource busy >>>>> >>>>> This can be traced to state_store(): >>>>> >>>>> } else if (cmd_match(buf, "remove")) { >>>>> if (rdev->raid_disk>= 0) >>>>> err = -EBUSY; >>>>> >>>>> After much debugging and systemtapping, I think I've figured out that the >>>>> sysfs scripting may fail after the following combination of changes: >>>>> >>>>> mdadm 8af530b07fce "Enhance incremental removal." >>>>> kernel 30b8feb730f9 "md/raid5: avoid deadlock when raid5 array has unack >>>>> badblocks during md_stop_writes" >>>>> >>>>> With these two changes: >>>>> >>>>> 1 - On the user side, mdadm is trying to set the array_state to read-auto >>>>> on incremental removal (as invoked by UDEV rule). >>>>> >>>>> 2 - Kernel side, md_set_readonly() will set the MD_RECOVERY_FROZEN flag, >>>>> wake up the mddev->thread and if there is a sync_thread, it will set >>>>> MD_RECOVERY_INTR and then wait until the sync_thread is set to NULL. >>>>> >>>>> When md_check_recovery() gets a chance to run as part of the >>>>> raid1d() mddev->thread, it may or may not ever get to >>>>> an invocation of remove_and_add_spares(), for there are but *many* >>>>> conditional early exits along the way -- for example, if >>>>> MD_RECOVERY_FROZEN is set, the following condition will bounce out of >>>>> the routine: >>>>> >>>>> if (!test_and_clear_bit(MD_RECOVERY_NEEDED,&mddev->recovery) || >>>>> test_bit(MD_RECOVERY_FROZEN,&mddev->recovery)) >>>>> goto unlock; >>>>> >>>>> the next time around, MD_RECOVERY_NEEDED will have been cleared, so >>>>> all future tests will return 0 and the negation will always take the >>>>> early exit path. >>>>> >>>>> Back in md_set_readonly(), it may notice that the MD is still in use, >>>>> so it clears the MD_RECOVERY_FROZEN and then returns -EBUSY, without >>>>> setting mddev->ro. But the damage has been done as conditions have >>>>> been set such that md_check_recovery() will never call >>>>> remove_and_add_spares(). >>>>> >>>>> This would also explain why an "idle" sync_action clears the wedge: it >>>>> sets MD_RECOVERY_NEEDED allowing md_check_recovery() to continue executing >>>>> to remove_and_add_spares(). >>>>> >>>>> As far as I can tell, this is what is happening to prevent the "remove" >>>>> write to /sys/block/md3/md/dev-sdr5/state from succeeding. There are >>>>> certainly a lot of little bit-states between disk removal, UDEV mdadm, and >>>>> various MD kernel threads, so apologies if I missed an important >>>>> transition. >>>>> >>>>> Would you consider writing "idle" to the MD array sync_action file as a >>>>> safe and reasonable intermediate workaround step for our script? >>>>> >>>>> And of course, any suggestions to whether this is intended behavior (ie, >>>>> the removed component disk is failed, but stuck in the array)? >>>>> >>>>> This is fairly easy for us to reproduce with multiple MD arrays per disk >>>>> (one per partition) and interrupting a raid check on all of them >>>>> (especially when they are delayed waiting for the first to finish) by >>>>> removing the component disk via sysfs PCI removal. We can provide >>>>> additional debug or testing if required. >>>>> >>>> Hi Joe, >>>> thanks for the details analysis!! >>>> >>>> I think the correct fix would be that MD_RECOVERY_NEEDED should be set after >>>> clearing MD_RECOVERY_FROZEN, like the patch below. >>>> Can you confirm that it works for you? >>>> >>>> Writing 'idle' should in general be safe, so that could be used as an interim. >>>> >>>> Thanks, >>>> NeilBrown >>>> >>>> diff --git a/drivers/md/md.c b/drivers/md/md.c >>>> index c03d87b6890a..2c73fcb82593 100644 >>>> --- a/drivers/md/md.c >>>> +++ b/drivers/md/md.c >>>> @@ -5261,6 +5261,7 @@ static int md_set_readonly(struct mddev *mddev, struct block_device *bdev) >>>> printk("md: %s still in use.\n",mdname(mddev)); >>>> if (did_freeze) { >>>> clear_bit(MD_RECOVERY_FROZEN,&mddev->recovery); >>>> + set_bit(MD_RECOVERY_NEEDED,&mddev->recovery); >>>> md_wakeup_thread(mddev->thread); >>>> } >>>> err = -EBUSY; >>>> @@ -5275,6 +5276,8 @@ static int md_set_readonly(struct mddev *mddev, struct block_device *bdev) >>>> mddev->ro = 1; >>>> set_disk_ro(mddev->gendisk, 1); >>>> clear_bit(MD_RECOVERY_FROZEN,&mddev->recovery); >>>> + set_bit(MD_RECOVERY_NEEDED,&mddev->recovery); >>>> + md_wakeup_thread(mddev->thread); >>>> sysfs_notify_dirent_safe(mddev->sysfs_state); >>>> err = 0; >>>> } >>>> @@ -5318,6 +5321,7 @@ static int do_md_stop(struct mddev *mddev, int mode, >>>> mutex_unlock(&mddev->open_mutex); >>>> if (did_freeze) { >>>> clear_bit(MD_RECOVERY_FROZEN,&mddev->recovery); >>>> + set_bit(MD_RECOVERY_NEEDED,&mddev->recovery); >>>> md_wakeup_thread(mddev->thread); >>>> } >>>> return -EBUSY; >>> Hi Neil, >>> >>> In my tests, the UDEV "mdadm -If" invocation fails *and* removes the >>> pulled disk from the MD array. This is okay for our intentions, but I >>> wanted to make sure that it's okay to skip any failed-but-not-removed >>> state. >>> >>> Tested-by: Joe Lawrence<joe.lawrence@stratus.com> >>> >>> and should this have a >>> >>> Fixes: 30b8feb730f9 ("md/raid5: avoid deadlock when raid5 array has unack badblocks during md_stop_writes") >>> >>> tag to mark for stable? >> >> Hi Neil, >> >> Would you like me to write up a proper patch, or is this one in the queue? >> > Several times over the last week I've thought that I should probably push > that patch along ... but each time something else seemed more interesting. > But it's a new week now. I've just posted a pull request. > > Thanks for the prompt (and the report and testing of course). > > NeilBrown Hi Neil and Joe Any update for this? I tried this with 3.18.2 and the problem still exists. When it tried to remove the failed disk. it find the Blocked flag in rdev->flags is set. So it can't remove the disk. Is this the right reason? Best Regards Xiao ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: RAID1 removing failed disk returns EBUSY 2015-01-14 12:41 ` XiaoNi @ 2015-01-15 13:22 ` Joe Lawrence 2015-01-16 5:20 ` Xiao Ni 0 siblings, 1 reply; 22+ messages in thread From: Joe Lawrence @ 2015-01-15 13:22 UTC (permalink / raw) To: XiaoNi; +Cc: NeilBrown, linux-raid, Bill Kuzeja On Wed, 14 Jan 2015 20:41:16 +0800 XiaoNi <xni@redhat.com> wrote: > On 11/17/2014 07:03 AM, NeilBrown wrote: > > On Thu, 13 Nov 2014 09:05:49 -0500 Joe Lawrence<joe.lawrence@stratus.com> > > wrote: > > > >> On Wed, 29 Oct 2014 13:36:04 -0400 > >> Joe Lawrence<joe.lawrence@stratus.com> wrote: > >> > >>> On Wed, 29 Oct 2014 08:41:13 +1100 > >>> NeilBrown<neilb@suse.de> wrote: > >>> > >>>> On Mon, 27 Oct 2014 16:27:48 -0400 Joe Lawrence<joe.lawrence@stratus.com> > >>>> wrote: > >>>> > >>>>> Hi Neil, > >>>>> > >>>>> We've encountered changes in MD and mdadm that have broken our automated > >>>>> disk removal script. In the past, we've been able to run the following > >>>>> after a RAID1 disk component removal: > >>>>> > >>>>> % echo fail> /sys/block/md3/md/dev-sdr5/state > >>>>> % echo remove> /sys/block/md3/md/dev-sdr5/state > >>>>> > >>>>> However, the latest RHEL6.6 code drop has rebased to sufficiently recent > >>>>> MD kernel and mdadm changes, in which the previous commands occasionally > >>>>> fail like so: > >>>>> > >>>>> * MD array is usually resyncing or checking > >>>>> * Component disk /dev/sdr removed via HBA sysfs PCI removal > >>>>> * Following UDEV rule fires: > >>>>> > >>>>> SUBSYSTEM=="block", ACTION=="remove", ENV{ID_PATH}=="?*", \ > >>>>> RUN+="/sbin/mdadm -If $name --path $env{ID_PATH}" > >>>>> > >>>>> % mdadm --detail /dev/md3 > >>>>> /dev/md3: > >>>>> Version : 1.1 > >>>>> Creation Time : Tue Oct 14 17:31:59 2014 > >>>>> Raid Level : raid1 > >>>>> Array Size : 25149440 (23.98 GiB 25.75 GB) > >>>>> Used Dev Size : 25149440 (23.98 GiB 25.75 GB) > >>>>> Raid Devices : 2 > >>>>> Total Devices : 2 > >>>>> Persistence : Superblock is persistent > >>>>> > >>>>> Intent Bitmap : Internal > >>>>> > >>>>> Update Time : Wed Oct 15 14:22:34 2014 > >>>>> State : active, degraded > >>>>> Active Devices : 1 > >>>>> Working Devices : 1 > >>>>> Failed Devices : 1 > >>>>> Spare Devices : 0 > >>>>> > >>>>> Name : localhost.localdomain:3 > >>>>> UUID : 40ed68ee:ba41d4cd:28c361ed:be7470b8 > >>>>> Events : 142 > >>>>> > >>>>> Number Major Minor RaidDevice State > >>>>> 0 65 21 0 faulty > >>>>> 1 65 5 1 active sync /dev/sdj5 > >>>>> > >>>>> All attempts to remove this device fail: > >>>>> > >>>>> % echo remove> /sys/block/md3/md/dev-sdr5/state > >>>>> -bash: echo: write error: Device or resource busy > >>>>> > >>>>> This can be traced to state_store(): > >>>>> > >>>>> } else if (cmd_match(buf, "remove")) { > >>>>> if (rdev->raid_disk>= 0) > >>>>> err = -EBUSY; > >>>>> > >>>>> After much debugging and systemtapping, I think I've figured out that the > >>>>> sysfs scripting may fail after the following combination of changes: > >>>>> > >>>>> mdadm 8af530b07fce "Enhance incremental removal." > >>>>> kernel 30b8feb730f9 "md/raid5: avoid deadlock when raid5 array has unack > >>>>> badblocks during md_stop_writes" > >>>>> > >>>>> With these two changes: > >>>>> > >>>>> 1 - On the user side, mdadm is trying to set the array_state to read-auto > >>>>> on incremental removal (as invoked by UDEV rule). > >>>>> > >>>>> 2 - Kernel side, md_set_readonly() will set the MD_RECOVERY_FROZEN flag, > >>>>> wake up the mddev->thread and if there is a sync_thread, it will set > >>>>> MD_RECOVERY_INTR and then wait until the sync_thread is set to NULL. > >>>>> > >>>>> When md_check_recovery() gets a chance to run as part of the > >>>>> raid1d() mddev->thread, it may or may not ever get to > >>>>> an invocation of remove_and_add_spares(), for there are but *many* > >>>>> conditional early exits along the way -- for example, if > >>>>> MD_RECOVERY_FROZEN is set, the following condition will bounce out of > >>>>> the routine: > >>>>> > >>>>> if (!test_and_clear_bit(MD_RECOVERY_NEEDED,&mddev->recovery) || > >>>>> test_bit(MD_RECOVERY_FROZEN,&mddev->recovery)) > >>>>> goto unlock; > >>>>> > >>>>> the next time around, MD_RECOVERY_NEEDED will have been cleared, so > >>>>> all future tests will return 0 and the negation will always take the > >>>>> early exit path. > >>>>> > >>>>> Back in md_set_readonly(), it may notice that the MD is still in use, > >>>>> so it clears the MD_RECOVERY_FROZEN and then returns -EBUSY, without > >>>>> setting mddev->ro. But the damage has been done as conditions have > >>>>> been set such that md_check_recovery() will never call > >>>>> remove_and_add_spares(). > >>>>> > >>>>> This would also explain why an "idle" sync_action clears the wedge: it > >>>>> sets MD_RECOVERY_NEEDED allowing md_check_recovery() to continue executing > >>>>> to remove_and_add_spares(). > >>>>> > >>>>> As far as I can tell, this is what is happening to prevent the "remove" > >>>>> write to /sys/block/md3/md/dev-sdr5/state from succeeding. There are > >>>>> certainly a lot of little bit-states between disk removal, UDEV mdadm, and > >>>>> various MD kernel threads, so apologies if I missed an important > >>>>> transition. > >>>>> > >>>>> Would you consider writing "idle" to the MD array sync_action file as a > >>>>> safe and reasonable intermediate workaround step for our script? > >>>>> > >>>>> And of course, any suggestions to whether this is intended behavior (ie, > >>>>> the removed component disk is failed, but stuck in the array)? > >>>>> > >>>>> This is fairly easy for us to reproduce with multiple MD arrays per disk > >>>>> (one per partition) and interrupting a raid check on all of them > >>>>> (especially when they are delayed waiting for the first to finish) by > >>>>> removing the component disk via sysfs PCI removal. We can provide > >>>>> additional debug or testing if required. > >>>>> > >>>> Hi Joe, > >>>> thanks for the details analysis!! > >>>> > >>>> I think the correct fix would be that MD_RECOVERY_NEEDED should be set after > >>>> clearing MD_RECOVERY_FROZEN, like the patch below. > >>>> Can you confirm that it works for you? > >>>> > >>>> Writing 'idle' should in general be safe, so that could be used as an interim. > >>>> > >>>> Thanks, > >>>> NeilBrown > >>>> > >>>> diff --git a/drivers/md/md.c b/drivers/md/md.c > >>>> index c03d87b6890a..2c73fcb82593 100644 > >>>> --- a/drivers/md/md.c > >>>> +++ b/drivers/md/md.c > >>>> @@ -5261,6 +5261,7 @@ static int md_set_readonly(struct mddev *mddev, struct block_device *bdev) > >>>> printk("md: %s still in use.\n",mdname(mddev)); > >>>> if (did_freeze) { > >>>> clear_bit(MD_RECOVERY_FROZEN,&mddev->recovery); > >>>> + set_bit(MD_RECOVERY_NEEDED,&mddev->recovery); > >>>> md_wakeup_thread(mddev->thread); > >>>> } > >>>> err = -EBUSY; > >>>> @@ -5275,6 +5276,8 @@ static int md_set_readonly(struct mddev *mddev, struct block_device *bdev) > >>>> mddev->ro = 1; > >>>> set_disk_ro(mddev->gendisk, 1); > >>>> clear_bit(MD_RECOVERY_FROZEN,&mddev->recovery); > >>>> + set_bit(MD_RECOVERY_NEEDED,&mddev->recovery); > >>>> + md_wakeup_thread(mddev->thread); > >>>> sysfs_notify_dirent_safe(mddev->sysfs_state); > >>>> err = 0; > >>>> } > >>>> @@ -5318,6 +5321,7 @@ static int do_md_stop(struct mddev *mddev, int mode, > >>>> mutex_unlock(&mddev->open_mutex); > >>>> if (did_freeze) { > >>>> clear_bit(MD_RECOVERY_FROZEN,&mddev->recovery); > >>>> + set_bit(MD_RECOVERY_NEEDED,&mddev->recovery); > >>>> md_wakeup_thread(mddev->thread); > >>>> } > >>>> return -EBUSY; > >>> Hi Neil, > >>> > >>> In my tests, the UDEV "mdadm -If" invocation fails *and* removes the > >>> pulled disk from the MD array. This is okay for our intentions, but I > >>> wanted to make sure that it's okay to skip any failed-but-not-removed > >>> state. > >>> > >>> Tested-by: Joe Lawrence<joe.lawrence@stratus.com> > >>> > >>> and should this have a > >>> > >>> Fixes: 30b8feb730f9 ("md/raid5: avoid deadlock when raid5 array has unack badblocks during md_stop_writes") > >>> > >>> tag to mark for stable? > >> > >> Hi Neil, > >> > >> Would you like me to write up a proper patch, or is this one in the queue? > >> > > Several times over the last week I've thought that I should probably push > > that patch along ... but each time something else seemed more interesting. > > But it's a new week now. I've just posted a pull request. > > > > Thanks for the prompt (and the report and testing of course). > > > > NeilBrown > Hi Neil and Joe > > Any update for this? I tried this with 3.18.2 and the problem still > exists. > > When it tried to remove the failed disk. it find the Blocked flag in > rdev->flags is > set. So it can't remove the disk. Is this the right reason? Hi Xiao, It's been a while since I've looked at this patch, but it looks like it made it into 3.18, so it should be present on 3.18.2. What version of mdadm are you running? Does writing an "idle" sync_action clear this condition? Regards, -- Joe ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: RAID1 removing failed disk returns EBUSY 2015-01-15 13:22 ` Joe Lawrence @ 2015-01-16 5:20 ` Xiao Ni 2015-01-16 15:10 ` Joe Lawrence 0 siblings, 1 reply; 22+ messages in thread From: Xiao Ni @ 2015-01-16 5:20 UTC (permalink / raw) To: Joe Lawrence; +Cc: NeilBrown, linux-raid, Bill Kuzeja ----- Original Message ----- > From: "Joe Lawrence" <joe.lawrence@stratus.com> > To: "XiaoNi" <xni@redhat.com> > Cc: "NeilBrown" <neilb@suse.de>, linux-raid@vger.kernel.org, "Bill Kuzeja" <william.kuzeja@stratus.com> > Sent: Thursday, January 15, 2015 9:22:10 PM > Subject: Re: RAID1 removing failed disk returns EBUSY > > On Wed, 14 Jan 2015 20:41:16 +0800 > XiaoNi <xni@redhat.com> wrote: > > > On 11/17/2014 07:03 AM, NeilBrown wrote: > > > On Thu, 13 Nov 2014 09:05:49 -0500 Joe Lawrence<joe.lawrence@stratus.com> > > > wrote: > > > > > >> On Wed, 29 Oct 2014 13:36:04 -0400 > > >> Joe Lawrence<joe.lawrence@stratus.com> wrote: > > >> > > >>> On Wed, 29 Oct 2014 08:41:13 +1100 > > >>> NeilBrown<neilb@suse.de> wrote: > > >>> > > >>>> On Mon, 27 Oct 2014 16:27:48 -0400 Joe > > >>>> Lawrence<joe.lawrence@stratus.com> > > >>>> wrote: > > >>>> > > >>>>> Hi Neil, > > >>>>> > > >>>>> We've encountered changes in MD and mdadm that have broken our > > >>>>> automated > > >>>>> disk removal script. In the past, we've been able to run the > > >>>>> following > > >>>>> after a RAID1 disk component removal: > > >>>>> > > >>>>> % echo fail> /sys/block/md3/md/dev-sdr5/state > > >>>>> % echo remove> /sys/block/md3/md/dev-sdr5/state > > >>>>> > > >>>>> However, the latest RHEL6.6 code drop has rebased to sufficiently > > >>>>> recent > > >>>>> MD kernel and mdadm changes, in which the previous commands > > >>>>> occasionally > > >>>>> fail like so: > > >>>>> > > >>>>> * MD array is usually resyncing or checking > > >>>>> * Component disk /dev/sdr removed via HBA sysfs PCI removal > > >>>>> * Following UDEV rule fires: > > >>>>> > > >>>>> SUBSYSTEM=="block", ACTION=="remove", ENV{ID_PATH}=="?*", \ > > >>>>> RUN+="/sbin/mdadm -If $name --path $env{ID_PATH}" > > >>>>> > > >>>>> % mdadm --detail /dev/md3 > > >>>>> /dev/md3: > > >>>>> Version : 1.1 > > >>>>> Creation Time : Tue Oct 14 17:31:59 2014 > > >>>>> Raid Level : raid1 > > >>>>> Array Size : 25149440 (23.98 GiB 25.75 GB) > > >>>>> Used Dev Size : 25149440 (23.98 GiB 25.75 GB) > > >>>>> Raid Devices : 2 > > >>>>> Total Devices : 2 > > >>>>> Persistence : Superblock is persistent > > >>>>> > > >>>>> Intent Bitmap : Internal > > >>>>> > > >>>>> Update Time : Wed Oct 15 14:22:34 2014 > > >>>>> State : active, degraded > > >>>>> Active Devices : 1 > > >>>>> Working Devices : 1 > > >>>>> Failed Devices : 1 > > >>>>> Spare Devices : 0 > > >>>>> > > >>>>> Name : localhost.localdomain:3 > > >>>>> UUID : 40ed68ee:ba41d4cd:28c361ed:be7470b8 > > >>>>> Events : 142 > > >>>>> > > >>>>> Number Major Minor RaidDevice State > > >>>>> 0 65 21 0 faulty > > >>>>> 1 65 5 1 active sync /dev/sdj5 > > >>>>> > > >>>>> All attempts to remove this device fail: > > >>>>> > > >>>>> % echo remove> /sys/block/md3/md/dev-sdr5/state > > >>>>> -bash: echo: write error: Device or resource busy > > >>>>> > > >>>>> This can be traced to state_store(): > > >>>>> > > >>>>> } else if (cmd_match(buf, "remove")) { > > >>>>> if (rdev->raid_disk>= 0) > > >>>>> err = -EBUSY; > > >>>>> > > >>>>> After much debugging and systemtapping, I think I've figured out that > > >>>>> the > > >>>>> sysfs scripting may fail after the following combination of changes: > > >>>>> > > >>>>> mdadm 8af530b07fce "Enhance incremental removal." > > >>>>> kernel 30b8feb730f9 "md/raid5: avoid deadlock when raid5 array has > > >>>>> unack > > >>>>> badblocks during md_stop_writes" > > >>>>> > > >>>>> With these two changes: > > >>>>> > > >>>>> 1 - On the user side, mdadm is trying to set the array_state to > > >>>>> read-auto > > >>>>> on incremental removal (as invoked by UDEV rule). > > >>>>> > > >>>>> 2 - Kernel side, md_set_readonly() will set the MD_RECOVERY_FROZEN > > >>>>> flag, > > >>>>> wake up the mddev->thread and if there is a sync_thread, it will > > >>>>> set > > >>>>> MD_RECOVERY_INTR and then wait until the sync_thread is set to > > >>>>> NULL. > > >>>>> > > >>>>> When md_check_recovery() gets a chance to run as part of the > > >>>>> raid1d() mddev->thread, it may or may not ever get to > > >>>>> an invocation of remove_and_add_spares(), for there are but > > >>>>> *many* > > >>>>> conditional early exits along the way -- for example, if > > >>>>> MD_RECOVERY_FROZEN is set, the following condition will bounce > > >>>>> out of > > >>>>> the routine: > > >>>>> > > >>>>> if > > >>>>> (!test_and_clear_bit(MD_RECOVERY_NEEDED,&mddev->recovery) > > >>>>> || > > >>>>> test_bit(MD_RECOVERY_FROZEN,&mddev->recovery)) > > >>>>> goto unlock; > > >>>>> > > >>>>> the next time around, MD_RECOVERY_NEEDED will have been cleared, > > >>>>> so > > >>>>> all future tests will return 0 and the negation will always take > > >>>>> the > > >>>>> early exit path. > > >>>>> > > >>>>> Back in md_set_readonly(), it may notice that the MD is still in > > >>>>> use, > > >>>>> so it clears the MD_RECOVERY_FROZEN and then returns -EBUSY, > > >>>>> without > > >>>>> setting mddev->ro. But the damage has been done as conditions > > >>>>> have > > >>>>> been set such that md_check_recovery() will never call > > >>>>> remove_and_add_spares(). > > >>>>> > > >>>>> This would also explain why an "idle" sync_action clears the wedge: > > >>>>> it > > >>>>> sets MD_RECOVERY_NEEDED allowing md_check_recovery() to continue > > >>>>> executing > > >>>>> to remove_and_add_spares(). > > >>>>> > > >>>>> As far as I can tell, this is what is happening to prevent the > > >>>>> "remove" > > >>>>> write to /sys/block/md3/md/dev-sdr5/state from succeeding. There are > > >>>>> certainly a lot of little bit-states between disk removal, UDEV > > >>>>> mdadm, and > > >>>>> various MD kernel threads, so apologies if I missed an important > > >>>>> transition. > > >>>>> > > >>>>> Would you consider writing "idle" to the MD array sync_action file as > > >>>>> a > > >>>>> safe and reasonable intermediate workaround step for our script? > > >>>>> > > >>>>> And of course, any suggestions to whether this is intended behavior > > >>>>> (ie, > > >>>>> the removed component disk is failed, but stuck in the array)? > > >>>>> > > >>>>> This is fairly easy for us to reproduce with multiple MD arrays per > > >>>>> disk > > >>>>> (one per partition) and interrupting a raid check on all of them > > >>>>> (especially when they are delayed waiting for the first to finish) by > > >>>>> removing the component disk via sysfs PCI removal. We can provide > > >>>>> additional debug or testing if required. > > >>>>> > > >>>> Hi Joe, > > >>>> thanks for the details analysis!! > > >>>> > > >>>> I think the correct fix would be that MD_RECOVERY_NEEDED should be set > > >>>> after > > >>>> clearing MD_RECOVERY_FROZEN, like the patch below. > > >>>> Can you confirm that it works for you? > > >>>> > > >>>> Writing 'idle' should in general be safe, so that could be used as an > > >>>> interim. > > >>>> > > >>>> Thanks, > > >>>> NeilBrown > > >>>> > > >>>> diff --git a/drivers/md/md.c b/drivers/md/md.c > > >>>> index c03d87b6890a..2c73fcb82593 100644 > > >>>> --- a/drivers/md/md.c > > >>>> +++ b/drivers/md/md.c > > >>>> @@ -5261,6 +5261,7 @@ static int md_set_readonly(struct mddev *mddev, > > >>>> struct block_device *bdev) > > >>>> printk("md: %s still in use.\n",mdname(mddev)); > > >>>> if (did_freeze) { > > >>>> clear_bit(MD_RECOVERY_FROZEN,&mddev->recovery); > > >>>> + set_bit(MD_RECOVERY_NEEDED,&mddev->recovery); > > >>>> md_wakeup_thread(mddev->thread); > > >>>> } > > >>>> err = -EBUSY; > > >>>> @@ -5275,6 +5276,8 @@ static int md_set_readonly(struct mddev *mddev, > > >>>> struct block_device *bdev) > > >>>> mddev->ro = 1; > > >>>> set_disk_ro(mddev->gendisk, 1); > > >>>> clear_bit(MD_RECOVERY_FROZEN,&mddev->recovery); > > >>>> + set_bit(MD_RECOVERY_NEEDED,&mddev->recovery); > > >>>> + md_wakeup_thread(mddev->thread); > > >>>> sysfs_notify_dirent_safe(mddev->sysfs_state); > > >>>> err = 0; > > >>>> } > > >>>> @@ -5318,6 +5321,7 @@ static int do_md_stop(struct mddev *mddev, int > > >>>> mode, > > >>>> mutex_unlock(&mddev->open_mutex); > > >>>> if (did_freeze) { > > >>>> clear_bit(MD_RECOVERY_FROZEN,&mddev->recovery); > > >>>> + set_bit(MD_RECOVERY_NEEDED,&mddev->recovery); > > >>>> md_wakeup_thread(mddev->thread); > > >>>> } > > >>>> return -EBUSY; > > >>> Hi Neil, > > >>> > > >>> In my tests, the UDEV "mdadm -If" invocation fails *and* removes the > > >>> pulled disk from the MD array. This is okay for our intentions, but I > > >>> wanted to make sure that it's okay to skip any failed-but-not-removed > > >>> state. > > >>> > > >>> Tested-by: Joe Lawrence<joe.lawrence@stratus.com> > > >>> > > >>> and should this have a > > >>> > > >>> Fixes: 30b8feb730f9 ("md/raid5: avoid deadlock when raid5 array has > > >>> unack badblocks during md_stop_writes") > > >>> > > >>> tag to mark for stable? > > >> > > >> Hi Neil, > > >> > > >> Would you like me to write up a proper patch, or is this one in the > > >> queue? > > >> > > > Several times over the last week I've thought that I should probably push > > > that patch along ... but each time something else seemed more > > > interesting. > > > But it's a new week now. I've just posted a pull request. > > > > > > Thanks for the prompt (and the report and testing of course). > > > > > > NeilBrown > > Hi Neil and Joe > > > > Any update for this? I tried this with 3.18.2 and the problem still > > exists. > > > > When it tried to remove the failed disk. it find the Blocked flag in > > rdev->flags is > > set. So it can't remove the disk. Is this the right reason? > > Hi Xiao, > > It's been a while since I've looked at this patch, but it looks like it > made it into 3.18, so it should be present on 3.18.2. > > What version of mdadm are you running? > > Does writing an "idle" sync_action clear this condition? > > Regards, > > -- Joe > > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Hi Joe Thanks for reminding me. I didn't do that. Now it can remove successfully after writing "idle" to sync_action. I thought wrongly that the patch referenced in this mail is fixed for the problem. Best Regards Xiao ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: RAID1 removing failed disk returns EBUSY 2015-01-16 5:20 ` Xiao Ni @ 2015-01-16 15:10 ` Joe Lawrence 2015-01-19 2:33 ` Xiao Ni 0 siblings, 1 reply; 22+ messages in thread From: Joe Lawrence @ 2015-01-16 15:10 UTC (permalink / raw) To: Xiao Ni; +Cc: NeilBrown, linux-raid, Bill Kuzeja On Fri, 16 Jan 2015 00:20:12 -0500 Xiao Ni <xni@redhat.com> wrote: > > Hi Joe > > Thanks for reminding me. I didn't do that. Now it can remove successfully after writing > "idle" to sync_action. > > I thought wrongly that the patch referenced in this mail is fixed for the problem. So it sounds like even with 3.18 and a new mdadm, this bug still persists? -- Joe ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: RAID1 removing failed disk returns EBUSY 2015-01-16 15:10 ` Joe Lawrence @ 2015-01-19 2:33 ` Xiao Ni 2015-01-19 17:56 ` Joe Lawrence 2015-01-29 3:52 ` NeilBrown 0 siblings, 2 replies; 22+ messages in thread From: Xiao Ni @ 2015-01-19 2:33 UTC (permalink / raw) To: Joe Lawrence; +Cc: NeilBrown, linux-raid, Bill Kuzeja ----- Original Message ----- > From: "Joe Lawrence" <joe.lawrence@stratus.com> > To: "Xiao Ni" <xni@redhat.com> > Cc: "NeilBrown" <neilb@suse.de>, linux-raid@vger.kernel.org, "Bill Kuzeja" <william.kuzeja@stratus.com> > Sent: Friday, January 16, 2015 11:10:31 PM > Subject: Re: RAID1 removing failed disk returns EBUSY > > On Fri, 16 Jan 2015 00:20:12 -0500 > Xiao Ni <xni@redhat.com> wrote: > > > > Hi Joe > > > > Thanks for reminding me. I didn't do that. Now it can remove > > successfully after writing > > "idle" to sync_action. > > > > I thought wrongly that the patch referenced in this mail is fixed for > > the problem. > > So it sounds like even with 3.18 and a new mdadm, this bug still > persists? > > -- Joe > > -- Hi Joe I'm a little confused now. Does the patch 45eaf45dfa4850df16bc2e8e7903d89021137f40 from linux-stable resolve the problem? My environment is: [root@dhcp-12-133 mdadm]# mdadm --version mdadm - v3.3.2-18-g93d3bd3 - 18th December 2014 (this is the newest upstream) [root@dhcp-12-133 mdadm]# uname -r 3.18.2 My steps are: [root@dhcp-12-133 mdadm]# lsblk sdb 8:16 0 931.5G 0 disk └─sdb1 8:17 0 5G 0 part sdc 8:32 0 186.3G 0 disk sdd 8:48 0 931.5G 0 disk └─sdd1 8:49 0 5G 0 part [root@dhcp-12-133 mdadm]# mdadm -CR /dev/md0 -l1 -n2 /dev/sdb1 /dev/sdd1 --assume-clean mdadm: Note: this array has metadata at the start and may not be suitable as a boot device. If you plan to store '/boot' on this device please ensure that your boot-loader understands md/v1.x metadata, or use --metadata=0.90 mdadm: Defaulting to version 1.2 metadata mdadm: array /dev/md0 started. Then I unplug the disk. [root@dhcp-12-133 mdadm]# lsblk sdc 8:32 0 186.3G 0 disk sdd 8:48 0 931.5G 0 disk └─sdd1 8:49 0 5G 0 part └─md0 9:0 0 5G 0 raid1 [root@dhcp-12-133 mdadm]# echo faulty > /sys/block/md0/md/dev-sdb1/state [root@dhcp-12-133 mdadm]# echo remove > /sys/block/md0/md/dev-sdb1/state -bash: echo: write error: Device or resource busy [root@dhcp-12-133 mdadm]# echo idle > /sys/block/md0/md/sync_action [root@dhcp-12-133 mdadm]# echo remove > /sys/block/md0/md/dev-sdb1/state Now after I set idle to sync_action, it can be removed as you said in the mail. It's a good workaround. Is this OK? Best Regards Xiao -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: RAID1 removing failed disk returns EBUSY 2015-01-19 2:33 ` Xiao Ni @ 2015-01-19 17:56 ` Joe Lawrence 2015-01-20 7:16 ` Xiao Ni 2015-01-29 3:52 ` NeilBrown 1 sibling, 1 reply; 22+ messages in thread From: Joe Lawrence @ 2015-01-19 17:56 UTC (permalink / raw) To: Xiao Ni; +Cc: NeilBrown, linux-raid, Bill Kuzeja On Sun, 18 Jan 2015 21:33:50 -0500 Xiao Ni <xni@redhat.com> wrote: > > > ----- Original Message ----- > > From: "Joe Lawrence" <joe.lawrence@stratus.com> > > To: "Xiao Ni" <xni@redhat.com> > > Cc: "NeilBrown" <neilb@suse.de>, linux-raid@vger.kernel.org, "Bill Kuzeja" <william.kuzeja@stratus.com> > > Sent: Friday, January 16, 2015 11:10:31 PM > > Subject: Re: RAID1 removing failed disk returns EBUSY > > > > On Fri, 16 Jan 2015 00:20:12 -0500 > > Xiao Ni <xni@redhat.com> wrote: > > > > > > Hi Joe > > > > > > Thanks for reminding me. I didn't do that. Now it can remove > > > successfully after writing > > > "idle" to sync_action. > > > > > > I thought wrongly that the patch referenced in this mail is fixed for > > > the problem. > > > > So it sounds like even with 3.18 and a new mdadm, this bug still > > persists? > > > > -- Joe > > > > -- > > Hi Joe > > I'm a little confused now. Does the patch 45eaf45dfa4850df16bc2e8e7903d89021137f40 from linux-stable > resolve the problem? > > My environment is: > > [root@dhcp-12-133 mdadm]# mdadm --version > mdadm - v3.3.2-18-g93d3bd3 - 18th December 2014 (this is the newest upstream) > [root@dhcp-12-133 mdadm]# uname -r > 3.18.2 > > > My steps are: > > [root@dhcp-12-133 mdadm]# lsblk > sdb 8:16 0 931.5G 0 disk > └─sdb1 8:17 0 5G 0 part > sdc 8:32 0 186.3G 0 disk > sdd 8:48 0 931.5G 0 disk > └─sdd1 8:49 0 5G 0 part > [root@dhcp-12-133 mdadm]# mdadm -CR /dev/md0 -l1 -n2 /dev/sdb1 /dev/sdd1 --assume-clean > mdadm: Note: this array has metadata at the start and > may not be suitable as a boot device. If you plan to > store '/boot' on this device please ensure that > your boot-loader understands md/v1.x metadata, or use > --metadata=0.90 > mdadm: Defaulting to version 1.2 metadata > mdadm: array /dev/md0 started. > > Then I unplug the disk. > > [root@dhcp-12-133 mdadm]# lsblk > sdc 8:32 0 186.3G 0 disk > sdd 8:48 0 931.5G 0 disk > └─sdd1 8:49 0 5G 0 part > └─md0 9:0 0 5G 0 raid1 > [root@dhcp-12-133 mdadm]# echo faulty > /sys/block/md0/md/dev-sdb1/state > [root@dhcp-12-133 mdadm]# echo remove > /sys/block/md0/md/dev-sdb1/state > -bash: echo: write error: Device or resource busy > [root@dhcp-12-133 mdadm]# echo idle > /sys/block/md0/md/sync_action > [root@dhcp-12-133 mdadm]# echo remove > /sys/block/md0/md/dev-sdb1/state > > > Now after I set idle to sync_action, it can be removed as you said in the mail. > It's a good workaround. Is this OK? > > Best Regards > Xiao Hi Xiao, According to my notes, the "idle" sync_action was always a viable workaround, with or with this change. Neil's patch should have made it possible to issue only a "faulty" and "remove" to remove the RAID component. I don't have an exact version, but it appears that my mdadm version was an upstream git from Oct 27-th timeframe. -- Joe -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: RAID1 removing failed disk returns EBUSY 2015-01-19 17:56 ` Joe Lawrence @ 2015-01-20 7:16 ` Xiao Ni 2015-01-23 15:11 ` Joe Lawrence 0 siblings, 1 reply; 22+ messages in thread From: Xiao Ni @ 2015-01-20 7:16 UTC (permalink / raw) To: Joe Lawrence; +Cc: NeilBrown, linux-raid, Bill Kuzeja ----- Original Message ----- > From: "Joe Lawrence" <joe.lawrence@stratus.com> > To: "Xiao Ni" <xni@redhat.com> > Cc: "NeilBrown" <neilb@suse.de>, linux-raid@vger.kernel.org, "Bill Kuzeja" <william.kuzeja@stratus.com> > Sent: Tuesday, January 20, 2015 1:56:50 AM > Subject: Re: RAID1 removing failed disk returns EBUSY > > On Sun, 18 Jan 2015 21:33:50 -0500 > Xiao Ni <xni@redhat.com> wrote: > > > > > > > ----- Original Message ----- > > > From: "Joe Lawrence" <joe.lawrence@stratus.com> > > > To: "Xiao Ni" <xni@redhat.com> > > > Cc: "NeilBrown" <neilb@suse.de>, linux-raid@vger.kernel.org, "Bill > > > Kuzeja" <william.kuzeja@stratus.com> > > > Sent: Friday, January 16, 2015 11:10:31 PM > > > Subject: Re: RAID1 removing failed disk returns EBUSY > > > > > > On Fri, 16 Jan 2015 00:20:12 -0500 > > > Xiao Ni <xni@redhat.com> wrote: > > > > > > > > Hi Joe > > > > > > > > Thanks for reminding me. I didn't do that. Now it can remove > > > > successfully after writing > > > > "idle" to sync_action. > > > > > > > > I thought wrongly that the patch referenced in this mail is fixed > > > > for > > > > the problem. > > > > > > So it sounds like even with 3.18 and a new mdadm, this bug still > > > persists? > > > > > > -- Joe > > > > > > -- > > > > Hi Joe > > > > I'm a little confused now. Does the patch > > 45eaf45dfa4850df16bc2e8e7903d89021137f40 from linux-stable > > resolve the problem? > > > > My environment is: > > > > [root@dhcp-12-133 mdadm]# mdadm --version > > mdadm - v3.3.2-18-g93d3bd3 - 18th December 2014 (this is the newest > > upstream) > > [root@dhcp-12-133 mdadm]# uname -r > > 3.18.2 > > > > > > My steps are: > > > > [root@dhcp-12-133 mdadm]# lsblk > > sdb 8:16 0 931.5G 0 disk > > └─sdb1 8:17 0 5G 0 part > > sdc 8:32 0 186.3G 0 disk > > sdd 8:48 0 931.5G 0 disk > > └─sdd1 8:49 0 5G 0 part > > [root@dhcp-12-133 mdadm]# mdadm -CR /dev/md0 -l1 -n2 /dev/sdb1 /dev/sdd1 > > --assume-clean > > mdadm: Note: this array has metadata at the start and > > may not be suitable as a boot device. If you plan to > > store '/boot' on this device please ensure that > > your boot-loader understands md/v1.x metadata, or use > > --metadata=0.90 > > mdadm: Defaulting to version 1.2 metadata > > mdadm: array /dev/md0 started. > > > > Then I unplug the disk. > > > > [root@dhcp-12-133 mdadm]# lsblk > > sdc 8:32 0 186.3G 0 disk > > sdd 8:48 0 931.5G 0 disk > > └─sdd1 8:49 0 5G 0 part > > └─md0 9:0 0 5G 0 raid1 > > [root@dhcp-12-133 mdadm]# echo faulty > /sys/block/md0/md/dev-sdb1/state > > [root@dhcp-12-133 mdadm]# echo remove > /sys/block/md0/md/dev-sdb1/state > > -bash: echo: write error: Device or resource busy > > [root@dhcp-12-133 mdadm]# echo idle > /sys/block/md0/md/sync_action > > [root@dhcp-12-133 mdadm]# echo remove > /sys/block/md0/md/dev-sdb1/state > > > > > > Now after I set idle to sync_action, it can be removed as you said in > > the mail. > > It's a good workaround. Is this OK? > > > > Best Regards > > Xiao > > Hi Xiao, > > According to my notes, the "idle" sync_action was always a viable > workaround, with or with this change. > > Neil's patch should have made it possible to issue only a > "faulty" and "remove" to remove the RAID component. > > I don't have an exact version, but it appears that my mdadm version was > an upstream git from Oct 27-th timeframe. > > -- Joe > Joe Thanks for the explanation. So echo "idle" to sync_action is a workaround without the patch. It looks like the patch is not enough to fix the problem. Do you have a try with the new patch? Is the problem still exist in your environment? If your environment have no problem, can you give me the version number? I'll have a try with the same version too. Best Regards Xiao -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: RAID1 removing failed disk returns EBUSY 2015-01-20 7:16 ` Xiao Ni @ 2015-01-23 15:11 ` Joe Lawrence 2015-01-30 2:19 ` Xiao Ni 0 siblings, 1 reply; 22+ messages in thread From: Joe Lawrence @ 2015-01-23 15:11 UTC (permalink / raw) To: Xiao Ni; +Cc: NeilBrown, linux-raid, Bill Kuzeja On Tue, 20 Jan 2015 02:16:46 -0500 Xiao Ni <xni@redhat.com> wrote: > Joe > > Thanks for the explanation. So echo "idle" to sync_action is a workaround > without the patch. > > It looks like the patch is not enough to fix the problem. > Do you have a try with the new patch? Is the problem still exist in > your environment? > > If your environment have no problem, can you give me the version number? I'll > have a try with the same version too. Hi Xiao, Bill and I did some more testing yesterday and I think we've figured out the confusion. Running a 3.18+ kernel and an upstream mdadm, it was the udev invocation of "mdadm -If <dev>" that was automatically removing the device for us. If we ran with an older mdadm and got the MD wedged in the faulty condition, then nothing we echoed into the sysfs state file ('idle' 'fail' or 'remove') would change anything. I think this agrees with your testing report. So two things: 1 - Did you make / make install the latest mdadm and see it try to run mdadm -If on the removed disk? (You could also try manually running it.) 2 - I think the sysfs interface to the removed disks is still broken in cases where (1) doesn't occur. Thanks, -- Joe ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: RAID1 removing failed disk returns EBUSY 2015-01-23 15:11 ` Joe Lawrence @ 2015-01-30 2:19 ` Xiao Ni 2015-01-30 4:27 ` Xiao Ni 0 siblings, 1 reply; 22+ messages in thread From: Xiao Ni @ 2015-01-30 2:19 UTC (permalink / raw) To: Joe Lawrence; +Cc: NeilBrown, linux-raid, Bill Kuzeja ----- Original Message ----- > From: "Joe Lawrence" <joe.lawrence@stratus.com> > To: "Xiao Ni" <xni@redhat.com> > Cc: "NeilBrown" <neilb@suse.de>, linux-raid@vger.kernel.org, "Bill Kuzeja" <william.kuzeja@stratus.com> > Sent: Friday, January 23, 2015 11:11:29 PM > Subject: Re: RAID1 removing failed disk returns EBUSY > > On Tue, 20 Jan 2015 02:16:46 -0500 > Xiao Ni <xni@redhat.com> wrote: > > Joe > > > > Thanks for the explanation. So echo "idle" to sync_action is a > > workaround > > without the patch. > > > > It looks like the patch is not enough to fix the problem. > > Do you have a try with the new patch? Is the problem still exist in > > your environment? > > > > If your environment have no problem, can you give me the version number? > > I'll > > have a try with the same version too. > > Hi Xiao, > > Bill and I did some more testing yesterday and I think we've figured > out the confusion. Running a 3.18+ kernel and an upstream mdadm, it > was the udev invocation of "mdadm -If <dev>" that was automatically > removing the device for us. > > If we ran with an older mdadm and got the MD wedged in the faulty > condition, then nothing we echoed into the sysfs state file ('idle' > 'fail' or 'remove') would change anything. I think this agrees with > your testing report. > > So two things: > > 1 - Did you make / make install the latest mdadm and see it try to run > mdadm -If on the removed disk? (You could also try manually running > it.) I make sure I have install the latest mdadm [root@dhcp-12-133 ~]# mdadm --version mdadm - v3.3.2-18-g93d3bd3 - 18th December 2014 It can prove this, right? It's strange when I ran mdadm -If [root@dhcp-12-133 ~]# mdadm -If sdc mdadm: sdc does not appear to be a component of any array [root@dhcp-12-133 ~]# cat /proc/mdstat Personalities : [raid1] md0 : active (auto-read-only) raid1 sdd1[1] sdc1[0](F) 5238784 blocks super 1.2 [2/1] [_U] unused devices: <none> I unplug the device manually from the machine. The machine is on my desk. > > 2 - I think the sysfs interface to the removed disks is still broken in > cases where (1) doesn't occur. > > Thanks, > > -- Joe > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: RAID1 removing failed disk returns EBUSY 2015-01-30 2:19 ` Xiao Ni @ 2015-01-30 4:27 ` Xiao Ni 0 siblings, 0 replies; 22+ messages in thread From: Xiao Ni @ 2015-01-30 4:27 UTC (permalink / raw) To: Joe Lawrence; +Cc: NeilBrown, linux-raid, Bill Kuzeja ----- Original Message ----- > From: "Xiao Ni" <xni@redhat.com> > To: "Joe Lawrence" <joe.lawrence@stratus.com> > Cc: "NeilBrown" <neilb@suse.de>, linux-raid@vger.kernel.org, "Bill Kuzeja" <william.kuzeja@stratus.com> > Sent: Friday, January 30, 2015 10:19:01 AM > Subject: Re: RAID1 removing failed disk returns EBUSY > > > > ----- Original Message ----- > > From: "Joe Lawrence" <joe.lawrence@stratus.com> > > To: "Xiao Ni" <xni@redhat.com> > > Cc: "NeilBrown" <neilb@suse.de>, linux-raid@vger.kernel.org, "Bill Kuzeja" > > <william.kuzeja@stratus.com> > > Sent: Friday, January 23, 2015 11:11:29 PM > > Subject: Re: RAID1 removing failed disk returns EBUSY > > > > On Tue, 20 Jan 2015 02:16:46 -0500 > > Xiao Ni <xni@redhat.com> wrote: > > > Joe > > > > > > Thanks for the explanation. So echo "idle" to sync_action is a > > > workaround > > > without the patch. > > > > > > It looks like the patch is not enough to fix the problem. > > > Do you have a try with the new patch? Is the problem still exist in > > > your environment? > > > > > > If your environment have no problem, can you give me the version > > > number? > > > I'll > > > have a try with the same version too. > > > > Hi Xiao, > > > > Bill and I did some more testing yesterday and I think we've figured > > out the confusion. Running a 3.18+ kernel and an upstream mdadm, it > > was the udev invocation of "mdadm -If <dev>" that was automatically > > removing the device for us. > > > > If we ran with an older mdadm and got the MD wedged in the faulty > > condition, then nothing we echoed into the sysfs state file ('idle' > > 'fail' or 'remove') would change anything. I think this agrees with > > your testing report. > > > > So two things: > > > > 1 - Did you make / make install the latest mdadm and see it try to run > > mdadm -If on the removed disk? (You could also try manually running > > it.) > > I make sure I have install the latest mdadm > [root@dhcp-12-133 ~]# mdadm --version > mdadm - v3.3.2-18-g93d3bd3 - 18th December 2014 > > It can prove this, right? > > It's strange when I ran mdadm -If > > [root@dhcp-12-133 ~]# mdadm -If sdc > mdadm: sdc does not appear to be a component of any array > [root@dhcp-12-133 ~]# cat /proc/mdstat > Personalities : [raid1] > md0 : active (auto-read-only) raid1 sdd1[1] sdc1[0](F) > 5238784 blocks super 1.2 [2/1] [_U] > > unused devices: <none> > > I unplug the device manually from the machine. The machine is on my desk. Hi Joe Sorry for this. I input the command wrongly. [root@dhcp-12-133 ~]# mdadm -If sdc1 mdadm: set sdc1 faulty in md0 mdadm: hot remove failed for sdc1: Device or resource busy > > > > > > 2 - I think the sysfs interface to the removed disks is still broken in > > cases where (1) doesn't occur. > > > > Thanks, > > > > -- Joe > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: RAID1 removing failed disk returns EBUSY 2015-01-19 2:33 ` Xiao Ni 2015-01-19 17:56 ` Joe Lawrence @ 2015-01-29 3:52 ` NeilBrown 2015-01-29 12:14 ` Xiao Ni 1 sibling, 1 reply; 22+ messages in thread From: NeilBrown @ 2015-01-29 3:52 UTC (permalink / raw) To: Xiao Ni; +Cc: Joe Lawrence, linux-raid, Bill Kuzeja [-- Attachment #1: Type: text/plain, Size: 5752 bytes --] On Sun, 18 Jan 2015 21:33:50 -0500 (EST) Xiao Ni <xni@redhat.com> wrote: > > > ----- Original Message ----- > > From: "Joe Lawrence" <joe.lawrence@stratus.com> > > To: "Xiao Ni" <xni@redhat.com> > > Cc: "NeilBrown" <neilb@suse.de>, linux-raid@vger.kernel.org, "Bill Kuzeja" <william.kuzeja@stratus.com> > > Sent: Friday, January 16, 2015 11:10:31 PM > > Subject: Re: RAID1 removing failed disk returns EBUSY > > > > On Fri, 16 Jan 2015 00:20:12 -0500 > > Xiao Ni <xni@redhat.com> wrote: > > > > > > Hi Joe > > > > > > Thanks for reminding me. I didn't do that. Now it can remove > > > successfully after writing > > > "idle" to sync_action. > > > > > > I thought wrongly that the patch referenced in this mail is fixed for > > > the problem. > > > > So it sounds like even with 3.18 and a new mdadm, this bug still > > persists? > > > > -- Joe > > > > -- > > Hi Joe > > I'm a little confused now. Does the patch 45eaf45dfa4850df16bc2e8e7903d89021137f40 from linux-stable > resolve the problem? > > My environment is: > > [root@dhcp-12-133 mdadm]# mdadm --version > mdadm - v3.3.2-18-g93d3bd3 - 18th December 2014 (this is the newest upstream) > [root@dhcp-12-133 mdadm]# uname -r > 3.18.2 > > > My steps are: > > [root@dhcp-12-133 mdadm]# lsblk > sdb 8:16 0 931.5G 0 disk > └─sdb1 8:17 0 5G 0 part > sdc 8:32 0 186.3G 0 disk > sdd 8:48 0 931.5G 0 disk > └─sdd1 8:49 0 5G 0 part > [root@dhcp-12-133 mdadm]# mdadm -CR /dev/md0 -l1 -n2 /dev/sdb1 /dev/sdd1 --assume-clean > mdadm: Note: this array has metadata at the start and > may not be suitable as a boot device. If you plan to > store '/boot' on this device please ensure that > your boot-loader understands md/v1.x metadata, or use > --metadata=0.90 > mdadm: Defaulting to version 1.2 metadata > mdadm: array /dev/md0 started. > > Then I unplug the disk. > > [root@dhcp-12-133 mdadm]# lsblk > sdc 8:32 0 186.3G 0 disk > sdd 8:48 0 931.5G 0 disk > └─sdd1 8:49 0 5G 0 part > └─md0 9:0 0 5G 0 raid1 > [root@dhcp-12-133 mdadm]# echo faulty > /sys/block/md0/md/dev-sdb1/state > [root@dhcp-12-133 mdadm]# echo remove > /sys/block/md0/md/dev-sdb1/state > -bash: echo: write error: Device or resource busy > [root@dhcp-12-133 mdadm]# echo idle > /sys/block/md0/md/sync_action > [root@dhcp-12-133 mdadm]# echo remove > /sys/block/md0/md/dev-sdb1/state > I cannot reproduce this - using linux 3.18.2. I'd be surprised if mdadm version affects things. This error (Device or resoource busy) implies that rdev->raid_disk is >= 0 (tested in state_store()). ->raid_disk is set to -1 by remove_and_add_spares() providing: 1/ it isn't Blocked (which is very unlikely) 2/ hot_remove_disk succeeds, which it will if nr_pending is zero, and 3/ nr_pending is zero. So it seems most likely that either: 1/ nr_pending is non-zero, or 2/ remove_and_add_spares() didn't run. nr_pending can only get set if IO is generated, and your sequence of steps don't show any IO. It is possible that something else (e.g. started by udev) triggered some IO. How long that IO can stay pending might depend on exactly how you unplug the device. In my tests I used echo 1 > /sys/block/sdXX/../../delete which may have a different effect to what you do. However the fact that writing 'idle' to sync_action releases the device seems to suggest the nr_pending has dropped to zero. So either - remove_and_add_spares didn't run, or - remove_and_add_spares ran during a small window when nr_pending was elevated, and then didn't run again when nr_pending was reduced to zero. Ahh.... that rings bells.... I have the following patch in the SLES kernel which I have applied to mainline yet (and given how old it is, that is really slack of me). Can you apply the following and see if the symptom goes away please? Thanks, NeilBrown From: Hannes Reinecke <hare@suse.de> Date: Thu, 26 Jul 2012 11:12:18 +0200 Subject: [PATCH] md: wakeup thread upon rdev_dec_pending() After each call to rdev_dec_pending() we should wakeup the md thread if the device is found to be faulty. Otherwise we'll incur heavy delays on failing devices. Signed-off-by: Neil Brown <nfbrown@suse.de> Signed-off-by: Hannes Reinecke <hare@suse.de> diff --git a/drivers/md/md.h b/drivers/md/md.h index 03cec5bdcaae..4cc2f59b2994 100644 --- a/drivers/md/md.h +++ b/drivers/md/md.h @@ -439,13 +439,6 @@ struct mddev { void (*sync_super)(struct mddev *mddev, struct md_rdev *rdev); }; -static inline void rdev_dec_pending(struct md_rdev *rdev, struct mddev *mddev) -{ - int faulty = test_bit(Faulty, &rdev->flags); - if (atomic_dec_and_test(&rdev->nr_pending) && faulty) - set_bit(MD_RECOVERY_NEEDED, &mddev->recovery); -} - static inline void md_sync_acct(struct block_device *bdev, unsigned long nr_sectors) { atomic_add(nr_sectors, &bdev->bd_contains->bd_disk->sync_io); @@ -624,4 +617,14 @@ static inline int mddev_check_plugged(struct mddev *mddev) return !!blk_check_plugged(md_unplug, mddev, sizeof(struct blk_plug_cb)); } + +static inline void rdev_dec_pending(struct md_rdev *rdev, struct mddev *mddev) +{ + int faulty = test_bit(Faulty, &rdev->flags); + if (atomic_dec_and_test(&rdev->nr_pending) && faulty) { + set_bit(MD_RECOVERY_NEEDED, &mddev->recovery); + md_wakeup_thread(mddev->thread); + } +} + #endif /* _MD_MD_H */ [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 811 bytes --] ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: RAID1 removing failed disk returns EBUSY 2015-01-29 3:52 ` NeilBrown @ 2015-01-29 12:14 ` Xiao Ni 2015-02-02 6:36 ` NeilBrown 0 siblings, 1 reply; 22+ messages in thread From: Xiao Ni @ 2015-01-29 12:14 UTC (permalink / raw) To: NeilBrown; +Cc: Joe Lawrence, linux-raid, Bill Kuzeja ----- Original Message ----- > From: "NeilBrown" <neilb@suse.de> > To: "Xiao Ni" <xni@redhat.com> > Cc: "Joe Lawrence" <joe.lawrence@stratus.com>, linux-raid@vger.kernel.org, "Bill Kuzeja" <william.kuzeja@stratus.com> > Sent: Thursday, January 29, 2015 11:52:17 AM > Subject: Re: RAID1 removing failed disk returns EBUSY > > On Sun, 18 Jan 2015 21:33:50 -0500 (EST) Xiao Ni <xni@redhat.com> wrote: > > > > > > > ----- Original Message ----- > > > From: "Joe Lawrence" <joe.lawrence@stratus.com> > > > To: "Xiao Ni" <xni@redhat.com> > > > Cc: "NeilBrown" <neilb@suse.de>, linux-raid@vger.kernel.org, "Bill > > > Kuzeja" <william.kuzeja@stratus.com> > > > Sent: Friday, January 16, 2015 11:10:31 PM > > > Subject: Re: RAID1 removing failed disk returns EBUSY > > > > > > On Fri, 16 Jan 2015 00:20:12 -0500 > > > Xiao Ni <xni@redhat.com> wrote: > > > > > > > > Hi Joe > > > > > > > > Thanks for reminding me. I didn't do that. Now it can remove > > > > successfully after writing > > > > "idle" to sync_action. > > > > > > > > I thought wrongly that the patch referenced in this mail is fixed > > > > for > > > > the problem. > > > > > > So it sounds like even with 3.18 and a new mdadm, this bug still > > > persists? > > > > > > -- Joe > > > > > > -- > > > > Hi Joe > > > > I'm a little confused now. Does the patch > > 45eaf45dfa4850df16bc2e8e7903d89021137f40 from linux-stable > > resolve the problem? > > > > My environment is: > > > > [root@dhcp-12-133 mdadm]# mdadm --version > > mdadm - v3.3.2-18-g93d3bd3 - 18th December 2014 (this is the newest > > upstream) > > [root@dhcp-12-133 mdadm]# uname -r > > 3.18.2 > > > > > > My steps are: > > > > [root@dhcp-12-133 mdadm]# lsblk > > sdb 8:16 0 931.5G 0 disk > > └─sdb1 8:17 0 5G 0 part > > sdc 8:32 0 186.3G 0 disk > > sdd 8:48 0 931.5G 0 disk > > └─sdd1 8:49 0 5G 0 part > > [root@dhcp-12-133 mdadm]# mdadm -CR /dev/md0 -l1 -n2 /dev/sdb1 /dev/sdd1 > > --assume-clean > > mdadm: Note: this array has metadata at the start and > > may not be suitable as a boot device. If you plan to > > store '/boot' on this device please ensure that > > your boot-loader understands md/v1.x metadata, or use > > --metadata=0.90 > > mdadm: Defaulting to version 1.2 metadata > > mdadm: array /dev/md0 started. > > > > Then I unplug the disk. > > > > [root@dhcp-12-133 mdadm]# lsblk > > sdc 8:32 0 186.3G 0 disk > > sdd 8:48 0 931.5G 0 disk > > └─sdd1 8:49 0 5G 0 part > > └─md0 9:0 0 5G 0 raid1 > > [root@dhcp-12-133 mdadm]# echo faulty > /sys/block/md0/md/dev-sdb1/state > > [root@dhcp-12-133 mdadm]# echo remove > /sys/block/md0/md/dev-sdb1/state > > -bash: echo: write error: Device or resource busy > > [root@dhcp-12-133 mdadm]# echo idle > /sys/block/md0/md/sync_action > > [root@dhcp-12-133 mdadm]# echo remove > /sys/block/md0/md/dev-sdb1/state > > > > I cannot reproduce this - using linux 3.18.2. I'd be surprised if mdadm > version affects things. Hi Neil I'm very curious, because it can reproduce in my machine 100%. > > This error (Device or resoource busy) implies that rdev->raid_disk is >= 0 > (tested in state_store()). > > ->raid_disk is set to -1 by remove_and_add_spares() providing: > 1/ it isn't Blocked (which is very unlikely) > 2/ hot_remove_disk succeeds, which it will if nr_pending is zero, and > 3/ nr_pending is zero. I remember I have tired to check those reasons. But it's really is the reason 1 which is very unlikely. I add some code in the function array_state_show array_state_show(struct mddev *mddev, char *page) { enum array_state st = inactive; struct md_rdev *rdev; rdev_for_each_rcu(rdev, mddev) { printk(KERN_ALERT "search for %s\n", rdev->bdev->bd_disk->disk_name); if (test_bit(Blocked, &rdev->flags)) printk(KERN_ALERT "rdev is Blocked\n"); else printk(KERN_ALERT "rdev is not Blocked\n"); } When I echo 1 > /sys/block/sdc/device/delete, then I ran command: [root@dhcp-12-133 md]# cat /sys/block/md0/md/array_state read-auto [root@dhcp-12-133 md]# dmesg [ 2679.559185] search for sdc [ 2679.559189] rdev is Blocked [ 2679.559190] search for sdb [ 2679.559190] rdev is not Blocked So sdc is Blocked > > So it seems most likely that either: > 1/ nr_pending is non-zero, or > 2/ remove_and_add_spares() didn't run. > > nr_pending can only get set if IO is generated, and your sequence of steps > don't show any IO. It is possible that something else (e.g. started by udev) > triggered some IO. How long that IO can stay pending might depend on exactly > how you unplug the device. > In my tests I used > echo 1 > /sys/block/sdXX/../../delete > which may have a different effect to what you do. > > However the fact that writing 'idle' to sync_action releases the device seems > to suggest the nr_pending has dropped to zero. So either > - remove_and_add_spares didn't run, or > - remove_and_add_spares ran during a small window when nr_pending was > elevated, and then didn't run again when nr_pending was reduced to zero. > > Ahh.... that rings bells.... > > I have the following patch in the SLES kernel which I have applied to > mainline yet (and given how old it is, that is really slack of me). > > Can you apply the following and see if the symptom goes away please? I have tried the patch, the problem is still exist. > > Thanks, > NeilBrown > > From: Hannes Reinecke <hare@suse.de> > Date: Thu, 26 Jul 2012 11:12:18 +0200 > Subject: [PATCH] md: wakeup thread upon rdev_dec_pending() > > After each call to rdev_dec_pending() we should wakeup the > md thread if the device is found to be faulty. > Otherwise we'll incur heavy delays on failing devices. > > Signed-off-by: Neil Brown <nfbrown@suse.de> > Signed-off-by: Hannes Reinecke <hare@suse.de> > > diff --git a/drivers/md/md.h b/drivers/md/md.h > index 03cec5bdcaae..4cc2f59b2994 100644 > --- a/drivers/md/md.h > +++ b/drivers/md/md.h > @@ -439,13 +439,6 @@ struct mddev { > void (*sync_super)(struct mddev *mddev, struct md_rdev *rdev); > }; > > -static inline void rdev_dec_pending(struct md_rdev *rdev, struct mddev > *mddev) > -{ > - int faulty = test_bit(Faulty, &rdev->flags); > - if (atomic_dec_and_test(&rdev->nr_pending) && faulty) > - set_bit(MD_RECOVERY_NEEDED, &mddev->recovery); > -} > - > static inline void md_sync_acct(struct block_device *bdev, unsigned long > nr_sectors) > { > atomic_add(nr_sectors, &bdev->bd_contains->bd_disk->sync_io); > @@ -624,4 +617,14 @@ static inline int mddev_check_plugged(struct mddev > *mddev) > return !!blk_check_plugged(md_unplug, mddev, > sizeof(struct blk_plug_cb)); > } > + > +static inline void rdev_dec_pending(struct md_rdev *rdev, struct mddev > *mddev) > +{ > + int faulty = test_bit(Faulty, &rdev->flags); > + if (atomic_dec_and_test(&rdev->nr_pending) && faulty) { > + set_bit(MD_RECOVERY_NEEDED, &mddev->recovery); > + md_wakeup_thread(mddev->thread); > + } > +} > + > #endif /* _MD_MD_H */ > > -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: RAID1 removing failed disk returns EBUSY 2015-01-29 12:14 ` Xiao Ni @ 2015-02-02 6:36 ` NeilBrown 2015-02-03 8:10 ` Xiao Ni 0 siblings, 1 reply; 22+ messages in thread From: NeilBrown @ 2015-02-02 6:36 UTC (permalink / raw) To: Xiao Ni; +Cc: Joe Lawrence, linux-raid, Bill Kuzeja [-- Attachment #1: Type: text/plain, Size: 6808 bytes --] On Thu, 29 Jan 2015 07:14:16 -0500 (EST) Xiao Ni <xni@redhat.com> wrote: > > > ----- Original Message ----- > > From: "NeilBrown" <neilb@suse.de> > > To: "Xiao Ni" <xni@redhat.com> > > Cc: "Joe Lawrence" <joe.lawrence@stratus.com>, linux-raid@vger.kernel.org, "Bill Kuzeja" <william.kuzeja@stratus.com> > > Sent: Thursday, January 29, 2015 11:52:17 AM > > Subject: Re: RAID1 removing failed disk returns EBUSY > > > > On Sun, 18 Jan 2015 21:33:50 -0500 (EST) Xiao Ni <xni@redhat.com> wrote: > > > > > > > > > > > ----- Original Message ----- > > > > From: "Joe Lawrence" <joe.lawrence@stratus.com> > > > > To: "Xiao Ni" <xni@redhat.com> > > > > Cc: "NeilBrown" <neilb@suse.de>, linux-raid@vger.kernel.org, "Bill > > > > Kuzeja" <william.kuzeja@stratus.com> > > > > Sent: Friday, January 16, 2015 11:10:31 PM > > > > Subject: Re: RAID1 removing failed disk returns EBUSY > > > > > > > > On Fri, 16 Jan 2015 00:20:12 -0500 > > > > Xiao Ni <xni@redhat.com> wrote: > > > > > > > > > > Hi Joe > > > > > > > > > > Thanks for reminding me. I didn't do that. Now it can remove > > > > > successfully after writing > > > > > "idle" to sync_action. > > > > > > > > > > I thought wrongly that the patch referenced in this mail is fixed > > > > > for > > > > > the problem. > > > > > > > > So it sounds like even with 3.18 and a new mdadm, this bug still > > > > persists? > > > > > > > > -- Joe > > > > > > > > -- > > > > > > Hi Joe > > > > > > I'm a little confused now. Does the patch > > > 45eaf45dfa4850df16bc2e8e7903d89021137f40 from linux-stable > > > resolve the problem? > > > > > > My environment is: > > > > > > [root@dhcp-12-133 mdadm]# mdadm --version > > > mdadm - v3.3.2-18-g93d3bd3 - 18th December 2014 (this is the newest > > > upstream) > > > [root@dhcp-12-133 mdadm]# uname -r > > > 3.18.2 > > > > > > > > > My steps are: > > > > > > [root@dhcp-12-133 mdadm]# lsblk > > > sdb 8:16 0 931.5G 0 disk > > > └─sdb1 8:17 0 5G 0 part > > > sdc 8:32 0 186.3G 0 disk > > > sdd 8:48 0 931.5G 0 disk > > > └─sdd1 8:49 0 5G 0 part > > > [root@dhcp-12-133 mdadm]# mdadm -CR /dev/md0 -l1 -n2 /dev/sdb1 /dev/sdd1 > > > --assume-clean > > > mdadm: Note: this array has metadata at the start and > > > may not be suitable as a boot device. If you plan to > > > store '/boot' on this device please ensure that > > > your boot-loader understands md/v1.x metadata, or use > > > --metadata=0.90 > > > mdadm: Defaulting to version 1.2 metadata > > > mdadm: array /dev/md0 started. > > > > > > Then I unplug the disk. > > > > > > [root@dhcp-12-133 mdadm]# lsblk > > > sdc 8:32 0 186.3G 0 disk > > > sdd 8:48 0 931.5G 0 disk > > > └─sdd1 8:49 0 5G 0 part > > > └─md0 9:0 0 5G 0 raid1 > > > [root@dhcp-12-133 mdadm]# echo faulty > /sys/block/md0/md/dev-sdb1/state > > > [root@dhcp-12-133 mdadm]# echo remove > /sys/block/md0/md/dev-sdb1/state > > > -bash: echo: write error: Device or resource busy > > > [root@dhcp-12-133 mdadm]# echo idle > /sys/block/md0/md/sync_action > > > [root@dhcp-12-133 mdadm]# echo remove > /sys/block/md0/md/dev-sdb1/state > > > > > > > I cannot reproduce this - using linux 3.18.2. I'd be surprised if mdadm > > version affects things. > > Hi Neil > > I'm very curious, because it can reproduce in my machine 100%. > > > > > This error (Device or resoource busy) implies that rdev->raid_disk is >= 0 > > (tested in state_store()). > > > > ->raid_disk is set to -1 by remove_and_add_spares() providing: > > 1/ it isn't Blocked (which is very unlikely) > > 2/ hot_remove_disk succeeds, which it will if nr_pending is zero, and > > 3/ nr_pending is zero. > > I remember I have tired to check those reasons. But it's really is the reason 1 > which is very unlikely. > > I add some code in the function array_state_show > > array_state_show(struct mddev *mddev, char *page) { > enum array_state st = inactive; > struct md_rdev *rdev; > > rdev_for_each_rcu(rdev, mddev) { > printk(KERN_ALERT "search for %s\n", rdev->bdev->bd_disk->disk_name); > if (test_bit(Blocked, &rdev->flags)) > printk(KERN_ALERT "rdev is Blocked\n"); > else > printk(KERN_ALERT "rdev is not Blocked\n"); > } > > When I echo 1 > /sys/block/sdc/device/delete, then I ran command: > > [root@dhcp-12-133 md]# cat /sys/block/md0/md/array_state > read-auto ^^^^^^^^^ I think that is half the explanation. You must have the md_mod.start_ro parameter set to '1'. > [root@dhcp-12-133 md]# dmesg > [ 2679.559185] search for sdc > [ 2679.559189] rdev is Blocked > [ 2679.559190] search for sdb > [ 2679.559190] rdev is not Blocked > > So sdc is Blocked and that is the other half - thanks. (yes, I was wrong. Sometimes it is easier than being right, but still yields results). When a device fails, it is Blocked until the metadata is updated to record the failure. This ensures that no writes succeed without writing to that device, until we a certain that no read will try reading from that device, even after a crash/restart. Blocked is cleared after the metadata is written, but read-auto (and read-only) devices never write out their metadata. So blocked doesn't get cleared. When you "echo idle > .../sync_action" one of the side effects is to with from 'read-auto' to fully active. This allows the metadata to be written, Blocked to be cleared, and the device to be removed. If you echo none > /sys/block/md0/md/dev-sdc/slot first, then the remove will work. We could possibly fix it with something like the following, but I'm not sure I like it. There is no guarantee that I can see which would ensure the superblock got updated before the first write if the array switch to read/write. NeilBrown diff --git a/drivers/md/md.c b/drivers/md/md.c index 9233c71138f1..b3d1e8e5e067 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -7528,7 +7528,7 @@ static int remove_and_add_spares(struct mddev *mddev, rdev_for_each(rdev, mddev) if ((this == NULL || rdev == this) && rdev->raid_disk >= 0 && - !test_bit(Blocked, &rdev->flags) && + (!test_bit(Blocked, &rdev->flags) || mddev->ro) && (test_bit(Faulty, &rdev->flags) || ! test_bit(In_sync, &rdev->flags)) && atomic_read(&rdev->nr_pending)==0) { [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 811 bytes --] ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: RAID1 removing failed disk returns EBUSY 2015-02-02 6:36 ` NeilBrown @ 2015-02-03 8:10 ` Xiao Ni 2015-06-10 6:26 ` XiaoNi 0 siblings, 1 reply; 22+ messages in thread From: Xiao Ni @ 2015-02-03 8:10 UTC (permalink / raw) To: NeilBrown; +Cc: Joe Lawrence, linux-raid, Bill Kuzeja ----- Original Message ----- > From: "NeilBrown" <neilb@suse.de> > To: "Xiao Ni" <xni@redhat.com> > Cc: "Joe Lawrence" <joe.lawrence@stratus.com>, linux-raid@vger.kernel.org, "Bill Kuzeja" <william.kuzeja@stratus.com> > Sent: Monday, February 2, 2015 2:36:01 PM > Subject: Re: RAID1 removing failed disk returns EBUSY > > On Thu, 29 Jan 2015 07:14:16 -0500 (EST) Xiao Ni <xni@redhat.com> wrote: > > > > > > > ----- Original Message ----- > > > From: "NeilBrown" <neilb@suse.de> > > > To: "Xiao Ni" <xni@redhat.com> > > > Cc: "Joe Lawrence" <joe.lawrence@stratus.com>, > > > linux-raid@vger.kernel.org, "Bill Kuzeja" <william.kuzeja@stratus.com> > > > Sent: Thursday, January 29, 2015 11:52:17 AM > > > Subject: Re: RAID1 removing failed disk returns EBUSY > > > > > > On Sun, 18 Jan 2015 21:33:50 -0500 (EST) Xiao Ni <xni@redhat.com> wrote: > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > From: "Joe Lawrence" <joe.lawrence@stratus.com> > > > > > To: "Xiao Ni" <xni@redhat.com> > > > > > Cc: "NeilBrown" <neilb@suse.de>, linux-raid@vger.kernel.org, "Bill > > > > > Kuzeja" <william.kuzeja@stratus.com> > > > > > Sent: Friday, January 16, 2015 11:10:31 PM > > > > > Subject: Re: RAID1 removing failed disk returns EBUSY > > > > > > > > > > On Fri, 16 Jan 2015 00:20:12 -0500 > > > > > Xiao Ni <xni@redhat.com> wrote: > > > > > > > > > > > > Hi Joe > > > > > > > > > > > > Thanks for reminding me. I didn't do that. Now it can remove > > > > > > successfully after writing > > > > > > "idle" to sync_action. > > > > > > > > > > > > I thought wrongly that the patch referenced in this mail is > > > > > > fixed > > > > > > for > > > > > > the problem. > > > > > > > > > > So it sounds like even with 3.18 and a new mdadm, this bug still > > > > > persists? > > > > > > > > > > -- Joe > > > > > > > > > > -- > > > > > > > > Hi Joe > > > > > > > > I'm a little confused now. Does the patch > > > > 45eaf45dfa4850df16bc2e8e7903d89021137f40 from linux-stable > > > > resolve the problem? > > > > > > > > My environment is: > > > > > > > > [root@dhcp-12-133 mdadm]# mdadm --version > > > > mdadm - v3.3.2-18-g93d3bd3 - 18th December 2014 (this is the newest > > > > upstream) > > > > [root@dhcp-12-133 mdadm]# uname -r > > > > 3.18.2 > > > > > > > > > > > > My steps are: > > > > > > > > [root@dhcp-12-133 mdadm]# lsblk > > > > sdb 8:16 0 931.5G 0 disk > > > > └─sdb1 8:17 0 5G 0 part > > > > sdc 8:32 0 186.3G 0 disk > > > > sdd 8:48 0 931.5G 0 disk > > > > └─sdd1 8:49 0 5G 0 part > > > > [root@dhcp-12-133 mdadm]# mdadm -CR /dev/md0 -l1 -n2 /dev/sdb1 > > > > /dev/sdd1 > > > > --assume-clean > > > > mdadm: Note: this array has metadata at the start and > > > > may not be suitable as a boot device. If you plan to > > > > store '/boot' on this device please ensure that > > > > your boot-loader understands md/v1.x metadata, or use > > > > --metadata=0.90 > > > > mdadm: Defaulting to version 1.2 metadata > > > > mdadm: array /dev/md0 started. > > > > > > > > Then I unplug the disk. > > > > > > > > [root@dhcp-12-133 mdadm]# lsblk > > > > sdc 8:32 0 186.3G 0 disk > > > > sdd 8:48 0 931.5G 0 disk > > > > └─sdd1 8:49 0 5G 0 part > > > > └─md0 9:0 0 5G 0 raid1 > > > > [root@dhcp-12-133 mdadm]# echo faulty > > > > > /sys/block/md0/md/dev-sdb1/state > > > > [root@dhcp-12-133 mdadm]# echo remove > > > > > /sys/block/md0/md/dev-sdb1/state > > > > -bash: echo: write error: Device or resource busy > > > > [root@dhcp-12-133 mdadm]# echo idle > /sys/block/md0/md/sync_action > > > > [root@dhcp-12-133 mdadm]# echo remove > > > > > /sys/block/md0/md/dev-sdb1/state > > > > > > > > > > I cannot reproduce this - using linux 3.18.2. I'd be surprised if mdadm > > > version affects things. > > > > Hi Neil > > > > I'm very curious, because it can reproduce in my machine 100%. > > > > > > > > This error (Device or resoource busy) implies that rdev->raid_disk is >= > > > 0 > > > (tested in state_store()). > > > > > > ->raid_disk is set to -1 by remove_and_add_spares() providing: > > > 1/ it isn't Blocked (which is very unlikely) > > > 2/ hot_remove_disk succeeds, which it will if nr_pending is zero, and > > > 3/ nr_pending is zero. > > > > I remember I have tired to check those reasons. But it's really is the > > reason 1 > > which is very unlikely. > > > > I add some code in the function array_state_show > > > > array_state_show(struct mddev *mddev, char *page) { > > enum array_state st = inactive; > > struct md_rdev *rdev; > > > > rdev_for_each_rcu(rdev, mddev) { > > printk(KERN_ALERT "search for %s\n", > > rdev->bdev->bd_disk->disk_name); > > if (test_bit(Blocked, &rdev->flags)) > > printk(KERN_ALERT "rdev is Blocked\n"); > > else > > printk(KERN_ALERT "rdev is not Blocked\n"); > > } > > > > When I echo 1 > /sys/block/sdc/device/delete, then I ran command: > > > > [root@dhcp-12-133 md]# cat /sys/block/md0/md/array_state > > read-auto > ^^^^^^^^^ > > I think that is half the explanation. > You must have the md_mod.start_ro parameter set to '1'. > > > > [root@dhcp-12-133 md]# dmesg > > [ 2679.559185] search for sdc > > [ 2679.559189] rdev is Blocked > > [ 2679.559190] search for sdb > > [ 2679.559190] rdev is not Blocked > > > > So sdc is Blocked > > and that is the other half - thanks. > (yes, I was wrong. Sometimes it is easier than being right, but still > yields results). > > When a device fails, it is Blocked until the metadata is updated to record > the failure. This ensures that no writes succeed without writing to that > device, until we a certain that no read will try reading from that device, > even after a crash/restart. > > Blocked is cleared after the metadata is written, but read-auto (and > read-only) devices never write out their metadata. So blocked doesn't get > cleared. > > When you "echo idle > .../sync_action" one of the side effects is to with > from 'read-auto' to fully active. This allows the metadata to be written, > Blocked to be cleared, and the device to be removed. > > If you > echo none > /sys/block/md0/md/dev-sdc/slot > > first, then the remove will work. > > We could possibly fix it with something like the following, but I'm not sure > I like it. There is no guarantee that I can see which would ensure the > superblock got updated before the first write if the array switch to > read/write. > > NeilBrown > > diff --git a/drivers/md/md.c b/drivers/md/md.c > index 9233c71138f1..b3d1e8e5e067 100644 > --- a/drivers/md/md.c > +++ b/drivers/md/md.c > @@ -7528,7 +7528,7 @@ static int remove_and_add_spares(struct mddev *mddev, > rdev_for_each(rdev, mddev) > if ((this == NULL || rdev == this) && > rdev->raid_disk >= 0 && > - !test_bit(Blocked, &rdev->flags) && > + (!test_bit(Blocked, &rdev->flags) || mddev->ro) && > (test_bit(Faulty, &rdev->flags) || > ! test_bit(In_sync, &rdev->flags)) && > atomic_read(&rdev->nr_pending)==0) { > > > Hi Neil I have tried the patch and the problem can be fixed by it. But I'm sorry that I can't give more advices for better idea about this. I'm not familiar with the metadata part about the md. I'll try to get more time to read the code about md. Best Regards Xiao -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: RAID1 removing failed disk returns EBUSY 2015-02-03 8:10 ` Xiao Ni @ 2015-06-10 6:26 ` XiaoNi 2015-06-17 2:51 ` Neil Brown 0 siblings, 1 reply; 22+ messages in thread From: XiaoNi @ 2015-06-10 6:26 UTC (permalink / raw) To: NeilBrown; +Cc: Joe Lawrence, linux-raid, Bill Kuzeja On 02/03/2015 04:10 PM, Xiao Ni wrote: > > ----- Original Message ----- >> From: "NeilBrown" <neilb@suse.de> >> To: "Xiao Ni" <xni@redhat.com> >> Cc: "Joe Lawrence" <joe.lawrence@stratus.com>, linux-raid@vger.kernel.org, "Bill Kuzeja" <william.kuzeja@stratus.com> >> Sent: Monday, February 2, 2015 2:36:01 PM >> Subject: Re: RAID1 removing failed disk returns EBUSY >> >> On Thu, 29 Jan 2015 07:14:16 -0500 (EST) Xiao Ni <xni@redhat.com> wrote: >> >>> >>> ----- Original Message ----- >>>> From: "NeilBrown" <neilb@suse.de> >>>> To: "Xiao Ni" <xni@redhat.com> >>>> Cc: "Joe Lawrence" <joe.lawrence@stratus.com>, >>>> linux-raid@vger.kernel.org, "Bill Kuzeja" <william.kuzeja@stratus.com> >>>> Sent: Thursday, January 29, 2015 11:52:17 AM >>>> Subject: Re: RAID1 removing failed disk returns EBUSY >>>> >>>> On Sun, 18 Jan 2015 21:33:50 -0500 (EST) Xiao Ni <xni@redhat.com> wrote: >>>> >>>>> >>>>> ----- Original Message ----- >>>>>> From: "Joe Lawrence" <joe.lawrence@stratus.com> >>>>>> To: "Xiao Ni" <xni@redhat.com> >>>>>> Cc: "NeilBrown" <neilb@suse.de>, linux-raid@vger.kernel.org, "Bill >>>>>> Kuzeja" <william.kuzeja@stratus.com> >>>>>> Sent: Friday, January 16, 2015 11:10:31 PM >>>>>> Subject: Re: RAID1 removing failed disk returns EBUSY >>>>>> >>>>>> On Fri, 16 Jan 2015 00:20:12 -0500 >>>>>> Xiao Ni <xni@redhat.com> wrote: >>>>>>> Hi Joe >>>>>>> >>>>>>> Thanks for reminding me. I didn't do that. Now it can remove >>>>>>> successfully after writing >>>>>>> "idle" to sync_action. >>>>>>> >>>>>>> I thought wrongly that the patch referenced in this mail is >>>>>>> fixed >>>>>>> for >>>>>>> the problem. >>>>>> So it sounds like even with 3.18 and a new mdadm, this bug still >>>>>> persists? >>>>>> >>>>>> -- Joe >>>>>> >>>>>> -- >>>>> Hi Joe >>>>> >>>>> I'm a little confused now. Does the patch >>>>> 45eaf45dfa4850df16bc2e8e7903d89021137f40 from linux-stable >>>>> resolve the problem? >>>>> >>>>> My environment is: >>>>> >>>>> [root@dhcp-12-133 mdadm]# mdadm --version >>>>> mdadm - v3.3.2-18-g93d3bd3 - 18th December 2014 (this is the newest >>>>> upstream) >>>>> [root@dhcp-12-133 mdadm]# uname -r >>>>> 3.18.2 >>>>> >>>>> >>>>> My steps are: >>>>> >>>>> [root@dhcp-12-133 mdadm]# lsblk >>>>> sdb 8:16 0 931.5G 0 disk >>>>> └─sdb1 8:17 0 5G 0 part >>>>> sdc 8:32 0 186.3G 0 disk >>>>> sdd 8:48 0 931.5G 0 disk >>>>> └─sdd1 8:49 0 5G 0 part >>>>> [root@dhcp-12-133 mdadm]# mdadm -CR /dev/md0 -l1 -n2 /dev/sdb1 >>>>> /dev/sdd1 >>>>> --assume-clean >>>>> mdadm: Note: this array has metadata at the start and >>>>> may not be suitable as a boot device. If you plan to >>>>> store '/boot' on this device please ensure that >>>>> your boot-loader understands md/v1.x metadata, or use >>>>> --metadata=0.90 >>>>> mdadm: Defaulting to version 1.2 metadata >>>>> mdadm: array /dev/md0 started. >>>>> >>>>> Then I unplug the disk. >>>>> >>>>> [root@dhcp-12-133 mdadm]# lsblk >>>>> sdc 8:32 0 186.3G 0 disk >>>>> sdd 8:48 0 931.5G 0 disk >>>>> └─sdd1 8:49 0 5G 0 part >>>>> └─md0 9:0 0 5G 0 raid1 >>>>> [root@dhcp-12-133 mdadm]# echo faulty > >>>>> /sys/block/md0/md/dev-sdb1/state >>>>> [root@dhcp-12-133 mdadm]# echo remove > >>>>> /sys/block/md0/md/dev-sdb1/state >>>>> -bash: echo: write error: Device or resource busy >>>>> [root@dhcp-12-133 mdadm]# echo idle > /sys/block/md0/md/sync_action >>>>> [root@dhcp-12-133 mdadm]# echo remove > >>>>> /sys/block/md0/md/dev-sdb1/state >>>>> >>>> I cannot reproduce this - using linux 3.18.2. I'd be surprised if mdadm >>>> version affects things. >>> Hi Neil >>> >>> I'm very curious, because it can reproduce in my machine 100%. >>> >>>> This error (Device or resoource busy) implies that rdev->raid_disk is >= >>>> 0 >>>> (tested in state_store()). >>>> >>>> ->raid_disk is set to -1 by remove_and_add_spares() providing: >>>> 1/ it isn't Blocked (which is very unlikely) >>>> 2/ hot_remove_disk succeeds, which it will if nr_pending is zero, and >>>> 3/ nr_pending is zero. >>> I remember I have tired to check those reasons. But it's really is the >>> reason 1 >>> which is very unlikely. >>> >>> I add some code in the function array_state_show >>> >>> array_state_show(struct mddev *mddev, char *page) { >>> enum array_state st = inactive; >>> struct md_rdev *rdev; >>> >>> rdev_for_each_rcu(rdev, mddev) { >>> printk(KERN_ALERT "search for %s\n", >>> rdev->bdev->bd_disk->disk_name); >>> if (test_bit(Blocked, &rdev->flags)) >>> printk(KERN_ALERT "rdev is Blocked\n"); >>> else >>> printk(KERN_ALERT "rdev is not Blocked\n"); >>> } >>> >>> When I echo 1 > /sys/block/sdc/device/delete, then I ran command: >>> >>> [root@dhcp-12-133 md]# cat /sys/block/md0/md/array_state >>> read-auto >> ^^^^^^^^^ >> >> I think that is half the explanation. >> You must have the md_mod.start_ro parameter set to '1'. >> >> >>> [root@dhcp-12-133 md]# dmesg >>> [ 2679.559185] search for sdc >>> [ 2679.559189] rdev is Blocked >>> [ 2679.559190] search for sdb >>> [ 2679.559190] rdev is not Blocked >>> >>> So sdc is Blocked >> and that is the other half - thanks. >> (yes, I was wrong. Sometimes it is easier than being right, but still >> yields results). >> >> When a device fails, it is Blocked until the metadata is updated to record >> the failure. This ensures that no writes succeed without writing to that >> device, until we a certain that no read will try reading from that device, >> even after a crash/restart. >> >> Blocked is cleared after the metadata is written, but read-auto (and >> read-only) devices never write out their metadata. So blocked doesn't get >> cleared. >> >> When you "echo idle > .../sync_action" one of the side effects is to with >> from 'read-auto' to fully active. This allows the metadata to be written, >> Blocked to be cleared, and the device to be removed. >> >> If you >> echo none > /sys/block/md0/md/dev-sdc/slot >> >> first, then the remove will work. >> >> We could possibly fix it with something like the following, but I'm not sure >> I like it. There is no guarantee that I can see which would ensure the >> superblock got updated before the first write if the array switch to >> read/write. >> >> NeilBrown >> >> diff --git a/drivers/md/md.c b/drivers/md/md.c >> index 9233c71138f1..b3d1e8e5e067 100644 >> --- a/drivers/md/md.c >> +++ b/drivers/md/md.c >> @@ -7528,7 +7528,7 @@ static int remove_and_add_spares(struct mddev *mddev, >> rdev_for_each(rdev, mddev) >> if ((this == NULL || rdev == this) && >> rdev->raid_disk >= 0 && >> - !test_bit(Blocked, &rdev->flags) && >> + (!test_bit(Blocked, &rdev->flags) || mddev->ro) && >> (test_bit(Faulty, &rdev->flags) || >> ! test_bit(In_sync, &rdev->flags)) && >> atomic_read(&rdev->nr_pending)==0) { >> >> >> > Hi Neil > > I have tried the patch and the problem can be fixed by it. But I'm sorry that I can't > give more advices for better idea about this. I'm not familiar with the metadata part about > the md. I'll try to get more time to read the code about md. > Hi Neil I don't see the patch in linux-stable, do you miss this? Best Regards Xiao -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: RAID1 removing failed disk returns EBUSY 2015-06-10 6:26 ` XiaoNi @ 2015-06-17 2:51 ` Neil Brown 2015-06-25 9:42 ` Xiao Ni 0 siblings, 1 reply; 22+ messages in thread From: Neil Brown @ 2015-06-17 2:51 UTC (permalink / raw) To: XiaoNi; +Cc: Joe Lawrence, linux-raid, Bill Kuzeja On Wed, 10 Jun 2015 14:26:41 +0800 XiaoNi <xni@redhat.com> wrote: > > > On 02/03/2015 04:10 PM, Xiao Ni wrote: > > > > ----- Original Message ----- > >> From: "NeilBrown" <neilb@suse.de> > >> To: "Xiao Ni" <xni@redhat.com> > >> Cc: "Joe Lawrence" <joe.lawrence@stratus.com>, linux-raid@vger.kernel.org, "Bill Kuzeja" <william.kuzeja@stratus.com> > >> Sent: Monday, February 2, 2015 2:36:01 PM > >> Subject: Re: RAID1 removing failed disk returns EBUSY > >> > >> On Thu, 29 Jan 2015 07:14:16 -0500 (EST) Xiao Ni <xni@redhat.com> wrote: > >> > >>> > >>> ----- Original Message ----- > >>>> From: "NeilBrown" <neilb@suse.de> > >>>> To: "Xiao Ni" <xni@redhat.com> > >>>> Cc: "Joe Lawrence" <joe.lawrence@stratus.com>, > >>>> linux-raid@vger.kernel.org, "Bill Kuzeja" <william.kuzeja@stratus.com> > >>>> Sent: Thursday, January 29, 2015 11:52:17 AM > >>>> Subject: Re: RAID1 removing failed disk returns EBUSY > >>>> > >>>> On Sun, 18 Jan 2015 21:33:50 -0500 (EST) Xiao Ni <xni@redhat.com> wrote: > >>>> > >>>>> > >>>>> ----- Original Message ----- > >>>>>> From: "Joe Lawrence" <joe.lawrence@stratus.com> > >>>>>> To: "Xiao Ni" <xni@redhat.com> > >>>>>> Cc: "NeilBrown" <neilb@suse.de>, linux-raid@vger.kernel.org, "Bill > >>>>>> Kuzeja" <william.kuzeja@stratus.com> > >>>>>> Sent: Friday, January 16, 2015 11:10:31 PM > >>>>>> Subject: Re: RAID1 removing failed disk returns EBUSY > >>>>>> > >>>>>> On Fri, 16 Jan 2015 00:20:12 -0500 > >>>>>> Xiao Ni <xni@redhat.com> wrote: > >>>>>>> Hi Joe > >>>>>>> > >>>>>>> Thanks for reminding me. I didn't do that. Now it can remove > >>>>>>> successfully after writing > >>>>>>> "idle" to sync_action. > >>>>>>> > >>>>>>> I thought wrongly that the patch referenced in this mail is > >>>>>>> fixed > >>>>>>> for > >>>>>>> the problem. > >>>>>> So it sounds like even with 3.18 and a new mdadm, this bug still > >>>>>> persists? > >>>>>> > >>>>>> -- Joe > >>>>>> > >>>>>> -- > >>>>> Hi Joe > >>>>> > >>>>> I'm a little confused now. Does the patch > >>>>> 45eaf45dfa4850df16bc2e8e7903d89021137f40 from linux-stable > >>>>> resolve the problem? > >>>>> > >>>>> My environment is: > >>>>> > >>>>> [root@dhcp-12-133 mdadm]# mdadm --version > >>>>> mdadm - v3.3.2-18-g93d3bd3 - 18th December 2014 (this is the newest > >>>>> upstream) > >>>>> [root@dhcp-12-133 mdadm]# uname -r > >>>>> 3.18.2 > >>>>> > >>>>> > >>>>> My steps are: > >>>>> > >>>>> [root@dhcp-12-133 mdadm]# lsblk > >>>>> sdb 8:16 0 931.5G 0 disk > >>>>> └─sdb1 8:17 0 5G 0 part > >>>>> sdc 8:32 0 186.3G 0 disk > >>>>> sdd 8:48 0 931.5G 0 disk > >>>>> └─sdd1 8:49 0 5G 0 part > >>>>> [root@dhcp-12-133 mdadm]# mdadm -CR /dev/md0 -l1 -n2 /dev/sdb1 > >>>>> /dev/sdd1 > >>>>> --assume-clean > >>>>> mdadm: Note: this array has metadata at the start and > >>>>> may not be suitable as a boot device. If you plan to > >>>>> store '/boot' on this device please ensure that > >>>>> your boot-loader understands md/v1.x metadata, or use > >>>>> --metadata=0.90 > >>>>> mdadm: Defaulting to version 1.2 metadata > >>>>> mdadm: array /dev/md0 started. > >>>>> > >>>>> Then I unplug the disk. > >>>>> > >>>>> [root@dhcp-12-133 mdadm]# lsblk > >>>>> sdc 8:32 0 186.3G 0 disk > >>>>> sdd 8:48 0 931.5G 0 disk > >>>>> └─sdd1 8:49 0 5G 0 part > >>>>> └─md0 9:0 0 5G 0 raid1 > >>>>> [root@dhcp-12-133 mdadm]# echo faulty > > >>>>> /sys/block/md0/md/dev-sdb1/state > >>>>> [root@dhcp-12-133 mdadm]# echo remove > > >>>>> /sys/block/md0/md/dev-sdb1/state > >>>>> -bash: echo: write error: Device or resource busy > >>>>> [root@dhcp-12-133 mdadm]# echo idle > /sys/block/md0/md/sync_action > >>>>> [root@dhcp-12-133 mdadm]# echo remove > > >>>>> /sys/block/md0/md/dev-sdb1/state > >>>>> > >>>> I cannot reproduce this - using linux 3.18.2. I'd be surprised if mdadm > >>>> version affects things. > >>> Hi Neil > >>> > >>> I'm very curious, because it can reproduce in my machine 100%. > >>> > >>>> This error (Device or resoource busy) implies that rdev->raid_disk is >= > >>>> 0 > >>>> (tested in state_store()). > >>>> > >>>> ->raid_disk is set to -1 by remove_and_add_spares() providing: > >>>> 1/ it isn't Blocked (which is very unlikely) > >>>> 2/ hot_remove_disk succeeds, which it will if nr_pending is zero, and > >>>> 3/ nr_pending is zero. > >>> I remember I have tired to check those reasons. But it's really is the > >>> reason 1 > >>> which is very unlikely. > >>> > >>> I add some code in the function array_state_show > >>> > >>> array_state_show(struct mddev *mddev, char *page) { > >>> enum array_state st = inactive; > >>> struct md_rdev *rdev; > >>> > >>> rdev_for_each_rcu(rdev, mddev) { > >>> printk(KERN_ALERT "search for %s\n", > >>> rdev->bdev->bd_disk->disk_name); > >>> if (test_bit(Blocked, &rdev->flags)) > >>> printk(KERN_ALERT "rdev is Blocked\n"); > >>> else > >>> printk(KERN_ALERT "rdev is not Blocked\n"); > >>> } > >>> > >>> When I echo 1 > /sys/block/sdc/device/delete, then I ran command: > >>> > >>> [root@dhcp-12-133 md]# cat /sys/block/md0/md/array_state > >>> read-auto > >> ^^^^^^^^^ > >> > >> I think that is half the explanation. > >> You must have the md_mod.start_ro parameter set to '1'. > >> > >> > >>> [root@dhcp-12-133 md]# dmesg > >>> [ 2679.559185] search for sdc > >>> [ 2679.559189] rdev is Blocked > >>> [ 2679.559190] search for sdb > >>> [ 2679.559190] rdev is not Blocked > >>> > >>> So sdc is Blocked > >> and that is the other half - thanks. > >> (yes, I was wrong. Sometimes it is easier than being right, but still > >> yields results). > >> > >> When a device fails, it is Blocked until the metadata is updated to record > >> the failure. This ensures that no writes succeed without writing to that > >> device, until we a certain that no read will try reading from that device, > >> even after a crash/restart. > >> > >> Blocked is cleared after the metadata is written, but read-auto (and > >> read-only) devices never write out their metadata. So blocked doesn't get > >> cleared. > >> > >> When you "echo idle > .../sync_action" one of the side effects is to with > >> from 'read-auto' to fully active. This allows the metadata to be written, > >> Blocked to be cleared, and the device to be removed. > >> > >> If you > >> echo none > /sys/block/md0/md/dev-sdc/slot > >> > >> first, then the remove will work. > >> > >> We could possibly fix it with something like the following, but I'm not sure > >> I like it. There is no guarantee that I can see which would ensure the > >> superblock got updated before the first write if the array switch to > >> read/write. > >> > >> NeilBrown > >> > >> diff --git a/drivers/md/md.c b/drivers/md/md.c > >> index 9233c71138f1..b3d1e8e5e067 100644 > >> --- a/drivers/md/md.c > >> +++ b/drivers/md/md.c > >> @@ -7528,7 +7528,7 @@ static int remove_and_add_spares(struct mddev *mddev, > >> rdev_for_each(rdev, mddev) > >> if ((this == NULL || rdev == this) && > >> rdev->raid_disk >= 0 && > >> - !test_bit(Blocked, &rdev->flags) && > >> + (!test_bit(Blocked, &rdev->flags) || mddev->ro) && > >> (test_bit(Faulty, &rdev->flags) || > >> ! test_bit(In_sync, &rdev->flags)) && > >> atomic_read(&rdev->nr_pending)==0) { > >> > >> > >> > > Hi Neil > > > > I have tried the patch and the problem can be fixed by it. But I'm sorry that I can't > > give more advices for better idea about this. I'm not familiar with the metadata part about > > the md. I'll try to get more time to read the code about md. > > > Hi Neil > > I don't see the patch in linux-stable, do you miss this? I don't believe this bug is sufficiently serious for the patch to go to -stable. However it doesn't need to be fixed - thanks for the reminder. I've just queued the following patch which I am happy with. If you could confirm that it works for you, I would appreciate that. Thanks, NeilBrown From: Neil Brown <neilb@suse.de> Date: Wed, 17 Jun 2015 12:31:46 +1000 Subject: [PATCH] md: clear Blocked flag on failed devices when array is read-only. The Blocked flag indicates that a device has failed but that this fact hasn't been recorded in the metadata yet. Writes to such devices cannot be allowed until the metadata has been updated. On a read-only array, the Blocked flag will never be cleared. This prevents the device being removed from the array. If the metadata is being handled by the kernel (i.e. !mddev->external), then we can be sure that if the array is switch to writable, then a metadata update will happen and will record the failure. So we don't need the flag set. If metadata is externally managed, it is upto the external manager to clear the 'blocked' flag. Reported-by: XiaoNi <xni@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de> diff --git a/drivers/md/md.c b/drivers/md/md.c index 3d339e2..5a6681a 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -8125,6 +8125,15 @@ void md_check_recovery(struct mddev *mddev) int spares = 0; if (mddev->ro) { + struct md_rdev *rdev; + if (!mddev->external && mddev->in_sync) + /* 'Blocked' flag not needed as failed devices + * will be recorded if array switched to read/write. + * Leaving it set will prevent the device + * from being removed. + */ + rdev_for_each(rdev, mddev) + clear_bit(Blocked, &rdev->flags); /* On a read-only array we can: * - remove failed devices * - add already-in_sync devices if the array itself -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: RAID1 removing failed disk returns EBUSY 2015-06-17 2:51 ` Neil Brown @ 2015-06-25 9:42 ` Xiao Ni 0 siblings, 0 replies; 22+ messages in thread From: Xiao Ni @ 2015-06-25 9:42 UTC (permalink / raw) To: Neil Brown; +Cc: Joe Lawrence, linux-raid, Bill Kuzeja ----- Original Message ----- > From: "Neil Brown" <neilb@suse.de> > To: "XiaoNi" <xni@redhat.com> > Cc: "Joe Lawrence" <joe.lawrence@stratus.com>, linux-raid@vger.kernel.org, "Bill Kuzeja" <william.kuzeja@stratus.com> > Sent: Wednesday, June 17, 2015 10:51:51 AM > Subject: Re: RAID1 removing failed disk returns EBUSY > > On Wed, 10 Jun 2015 14:26:41 +0800 > XiaoNi <xni@redhat.com> wrote: > > > > > > > On 02/03/2015 04:10 PM, Xiao Ni wrote: > > > > > > ----- Original Message ----- > > >> From: "NeilBrown" <neilb@suse.de> > > >> To: "Xiao Ni" <xni@redhat.com> > > >> Cc: "Joe Lawrence" <joe.lawrence@stratus.com>, > > >> linux-raid@vger.kernel.org, "Bill Kuzeja" <william.kuzeja@stratus.com> > > >> Sent: Monday, February 2, 2015 2:36:01 PM > > >> Subject: Re: RAID1 removing failed disk returns EBUSY > > >> > > >> On Thu, 29 Jan 2015 07:14:16 -0500 (EST) Xiao Ni <xni@redhat.com> wrote: > > >> > > >>> > > >>> ----- Original Message ----- > > >>>> From: "NeilBrown" <neilb@suse.de> > > >>>> To: "Xiao Ni" <xni@redhat.com> > > >>>> Cc: "Joe Lawrence" <joe.lawrence@stratus.com>, > > >>>> linux-raid@vger.kernel.org, "Bill Kuzeja" <william.kuzeja@stratus.com> > > >>>> Sent: Thursday, January 29, 2015 11:52:17 AM > > >>>> Subject: Re: RAID1 removing failed disk returns EBUSY > > >>>> > > >>>> On Sun, 18 Jan 2015 21:33:50 -0500 (EST) Xiao Ni <xni@redhat.com> > > >>>> wrote: > > >>>> > > >>>>> > > >>>>> ----- Original Message ----- > > >>>>>> From: "Joe Lawrence" <joe.lawrence@stratus.com> > > >>>>>> To: "Xiao Ni" <xni@redhat.com> > > >>>>>> Cc: "NeilBrown" <neilb@suse.de>, linux-raid@vger.kernel.org, "Bill > > >>>>>> Kuzeja" <william.kuzeja@stratus.com> > > >>>>>> Sent: Friday, January 16, 2015 11:10:31 PM > > >>>>>> Subject: Re: RAID1 removing failed disk returns EBUSY > > >>>>>> > > >>>>>> On Fri, 16 Jan 2015 00:20:12 -0500 > > >>>>>> Xiao Ni <xni@redhat.com> wrote: > > >>>>>>> Hi Joe > > >>>>>>> > > >>>>>>> Thanks for reminding me. I didn't do that. Now it can remove > > >>>>>>> successfully after writing > > >>>>>>> "idle" to sync_action. > > >>>>>>> > > >>>>>>> I thought wrongly that the patch referenced in this mail is > > >>>>>>> fixed > > >>>>>>> for > > >>>>>>> the problem. > > >>>>>> So it sounds like even with 3.18 and a new mdadm, this bug still > > >>>>>> persists? > > >>>>>> > > >>>>>> -- Joe > > >>>>>> > > >>>>>> -- > > >>>>> Hi Joe > > >>>>> > > >>>>> I'm a little confused now. Does the patch > > >>>>> 45eaf45dfa4850df16bc2e8e7903d89021137f40 from linux-stable > > >>>>> resolve the problem? > > >>>>> > > >>>>> My environment is: > > >>>>> > > >>>>> [root@dhcp-12-133 mdadm]# mdadm --version > > >>>>> mdadm - v3.3.2-18-g93d3bd3 - 18th December 2014 (this is the newest > > >>>>> upstream) > > >>>>> [root@dhcp-12-133 mdadm]# uname -r > > >>>>> 3.18.2 > > >>>>> > > >>>>> > > >>>>> My steps are: > > >>>>> > > >>>>> [root@dhcp-12-133 mdadm]# lsblk > > >>>>> sdb 8:16 0 931.5G 0 disk > > >>>>> └─sdb1 8:17 0 5G 0 part > > >>>>> sdc 8:32 0 186.3G 0 disk > > >>>>> sdd 8:48 0 931.5G 0 disk > > >>>>> └─sdd1 8:49 0 5G 0 part > > >>>>> [root@dhcp-12-133 mdadm]# mdadm -CR /dev/md0 -l1 -n2 /dev/sdb1 > > >>>>> /dev/sdd1 > > >>>>> --assume-clean > > >>>>> mdadm: Note: this array has metadata at the start and > > >>>>> may not be suitable as a boot device. If you plan to > > >>>>> store '/boot' on this device please ensure that > > >>>>> your boot-loader understands md/v1.x metadata, or use > > >>>>> --metadata=0.90 > > >>>>> mdadm: Defaulting to version 1.2 metadata > > >>>>> mdadm: array /dev/md0 started. > > >>>>> > > >>>>> Then I unplug the disk. > > >>>>> > > >>>>> [root@dhcp-12-133 mdadm]# lsblk > > >>>>> sdc 8:32 0 186.3G 0 disk > > >>>>> sdd 8:48 0 931.5G 0 disk > > >>>>> └─sdd1 8:49 0 5G 0 part > > >>>>> └─md0 9:0 0 5G 0 raid1 > > >>>>> [root@dhcp-12-133 mdadm]# echo faulty > > > >>>>> /sys/block/md0/md/dev-sdb1/state > > >>>>> [root@dhcp-12-133 mdadm]# echo remove > > > >>>>> /sys/block/md0/md/dev-sdb1/state > > >>>>> -bash: echo: write error: Device or resource busy > > >>>>> [root@dhcp-12-133 mdadm]# echo idle > /sys/block/md0/md/sync_action > > >>>>> [root@dhcp-12-133 mdadm]# echo remove > > > >>>>> /sys/block/md0/md/dev-sdb1/state > > >>>>> > > >>>> I cannot reproduce this - using linux 3.18.2. I'd be surprised if > > >>>> mdadm > > >>>> version affects things. > > >>> Hi Neil > > >>> > > >>> I'm very curious, because it can reproduce in my machine 100%. > > >>> > > >>>> This error (Device or resoource busy) implies that rdev->raid_disk is > > >>>> >= > > >>>> 0 > > >>>> (tested in state_store()). > > >>>> > > >>>> ->raid_disk is set to -1 by remove_and_add_spares() providing: > > >>>> 1/ it isn't Blocked (which is very unlikely) > > >>>> 2/ hot_remove_disk succeeds, which it will if nr_pending is zero, > > >>>> and > > >>>> 3/ nr_pending is zero. > > >>> I remember I have tired to check those reasons. But it's really is > > >>> the > > >>> reason 1 > > >>> which is very unlikely. > > >>> > > >>> I add some code in the function array_state_show > > >>> > > >>> array_state_show(struct mddev *mddev, char *page) { > > >>> enum array_state st = inactive; > > >>> struct md_rdev *rdev; > > >>> > > >>> rdev_for_each_rcu(rdev, mddev) { > > >>> printk(KERN_ALERT "search for %s\n", > > >>> rdev->bdev->bd_disk->disk_name); > > >>> if (test_bit(Blocked, &rdev->flags)) > > >>> printk(KERN_ALERT "rdev is Blocked\n"); > > >>> else > > >>> printk(KERN_ALERT "rdev is not Blocked\n"); > > >>> } > > >>> > > >>> When I echo 1 > /sys/block/sdc/device/delete, then I ran command: > > >>> > > >>> [root@dhcp-12-133 md]# cat /sys/block/md0/md/array_state > > >>> read-auto > > >> ^^^^^^^^^ > > >> > > >> I think that is half the explanation. > > >> You must have the md_mod.start_ro parameter set to '1'. > > >> > > >> > > >>> [root@dhcp-12-133 md]# dmesg > > >>> [ 2679.559185] search for sdc > > >>> [ 2679.559189] rdev is Blocked > > >>> [ 2679.559190] search for sdb > > >>> [ 2679.559190] rdev is not Blocked > > >>> > > >>> So sdc is Blocked > > >> and that is the other half - thanks. > > >> (yes, I was wrong. Sometimes it is easier than being right, but still > > >> yields results). > > >> > > >> When a device fails, it is Blocked until the metadata is updated to > > >> record > > >> the failure. This ensures that no writes succeed without writing to > > >> that > > >> device, until we a certain that no read will try reading from that > > >> device, > > >> even after a crash/restart. > > >> > > >> Blocked is cleared after the metadata is written, but read-auto (and > > >> read-only) devices never write out their metadata. So blocked doesn't > > >> get > > >> cleared. > > >> > > >> When you "echo idle > .../sync_action" one of the side effects is to > > >> with > > >> from 'read-auto' to fully active. This allows the metadata to be > > >> written, > > >> Blocked to be cleared, and the device to be removed. > > >> > > >> If you > > >> echo none > /sys/block/md0/md/dev-sdc/slot > > >> > > >> first, then the remove will work. > > >> > > >> We could possibly fix it with something like the following, but I'm not > > >> sure > > >> I like it. There is no guarantee that I can see which would ensure the > > >> superblock got updated before the first write if the array switch to > > >> read/write. > > >> > > >> NeilBrown > > >> > > >> diff --git a/drivers/md/md.c b/drivers/md/md.c > > >> index 9233c71138f1..b3d1e8e5e067 100644 > > >> --- a/drivers/md/md.c > > >> +++ b/drivers/md/md.c > > >> @@ -7528,7 +7528,7 @@ static int remove_and_add_spares(struct mddev > > >> *mddev, > > >> rdev_for_each(rdev, mddev) > > >> if ((this == NULL || rdev == this) && > > >> rdev->raid_disk >= 0 && > > >> - !test_bit(Blocked, &rdev->flags) && > > >> + (!test_bit(Blocked, &rdev->flags) || mddev->ro) && > > >> (test_bit(Faulty, &rdev->flags) || > > >> ! test_bit(In_sync, &rdev->flags)) && > > >> atomic_read(&rdev->nr_pending)==0) { > > >> > > >> > > >> > > > Hi Neil > > > > > > I have tried the patch and the problem can be fixed by it. But I'm > > > sorry that I can't > > > give more advices for better idea about this. I'm not familiar with the > > > metadata part about > > > the md. I'll try to get more time to read the code about md. > > > > > Hi Neil > > > > I don't see the patch in linux-stable, do you miss this? > > I don't believe this bug is sufficiently serious for the patch to go to > -stable. However it doesn't need to be fixed - thanks for the reminder. > > I've just queued the following patch which I am happy with. If you > could confirm that it works for you, I would appreciate that. > > Thanks, > NeilBrown > > > From: Neil Brown <neilb@suse.de> > Date: Wed, 17 Jun 2015 12:31:46 +1000 > Subject: [PATCH] md: clear Blocked flag on failed devices when array is > read-only. > > The Blocked flag indicates that a device has failed but that this > fact hasn't been recorded in the metadata yet. Writes to such > devices cannot be allowed until the metadata has been updated. > > On a read-only array, the Blocked flag will never be cleared. > This prevents the device being removed from the array. > > If the metadata is being handled by the kernel > (i.e. !mddev->external), then we can be sure that if the array is > switch to writable, then a metadata update will happen and will > record the failure. So we don't need the flag set. > > If metadata is externally managed, it is upto the external manager > to clear the 'blocked' flag. > > Reported-by: XiaoNi <xni@redhat.com> > Signed-off-by: NeilBrown <neilb@suse.de> > > diff --git a/drivers/md/md.c b/drivers/md/md.c > index 3d339e2..5a6681a 100644 > --- a/drivers/md/md.c > +++ b/drivers/md/md.c > @@ -8125,6 +8125,15 @@ void md_check_recovery(struct mddev *mddev) > int spares = 0; > > if (mddev->ro) { > + struct md_rdev *rdev; > + if (!mddev->external && mddev->in_sync) > + /* 'Blocked' flag not needed as failed devices > + * will be recorded if array switched to read/write. > + * Leaving it set will prevent the device > + * from being removed. > + */ > + rdev_for_each(rdev, mddev) > + clear_bit(Blocked, &rdev->flags); > /* On a read-only array we can: > * - remove failed devices > * - add already-in_sync devices if the array itself > > Hi Neil Sorry for late response for this. I have tried the patch. When I unplug the disk(sdc1) which belongs to the raid1, the directory /sys/block/md0/md/dev-sdc1 is deleted. I haven't read the code for unplug device. So is it what you want? Best Regards Xiao -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2015-06-25 9:42 UTC | newest] Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2014-10-27 20:27 RAID1 removing failed disk returns EBUSY Joe Lawrence 2014-10-28 21:41 ` NeilBrown 2014-10-29 17:36 ` Joe Lawrence 2014-11-13 14:05 ` Joe Lawrence 2014-11-16 23:03 ` NeilBrown 2015-01-14 12:41 ` XiaoNi 2015-01-15 13:22 ` Joe Lawrence 2015-01-16 5:20 ` Xiao Ni 2015-01-16 15:10 ` Joe Lawrence 2015-01-19 2:33 ` Xiao Ni 2015-01-19 17:56 ` Joe Lawrence 2015-01-20 7:16 ` Xiao Ni 2015-01-23 15:11 ` Joe Lawrence 2015-01-30 2:19 ` Xiao Ni 2015-01-30 4:27 ` Xiao Ni 2015-01-29 3:52 ` NeilBrown 2015-01-29 12:14 ` Xiao Ni 2015-02-02 6:36 ` NeilBrown 2015-02-03 8:10 ` Xiao Ni 2015-06-10 6:26 ` XiaoNi 2015-06-17 2:51 ` Neil Brown 2015-06-25 9:42 ` Xiao Ni
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.