All of lore.kernel.org
 help / color / mirror / Atom feed
* RAID1 removing failed disk returns EBUSY
@ 2014-10-27 20:27 Joe Lawrence
  2014-10-28 21:41 ` NeilBrown
  0 siblings, 1 reply; 22+ messages in thread
From: Joe Lawrence @ 2014-10-27 20:27 UTC (permalink / raw)
  To: linux-raid; +Cc: Bill Kuzeja

Hi Neil,

We've encountered changes in MD and mdadm that have broken our automated
disk removal script.  In the past, we've been able to run the following
after a RAID1 disk component removal:

% echo fail > /sys/block/md3/md/dev-sdr5/state
% echo remove > /sys/block/md3/md/dev-sdr5/state

However, the latest RHEL6.6 code drop has rebased to sufficiently recent
MD kernel and mdadm changes, in which the previous commands occasionally
fail like so:

* MD array is usually resyncing or checking
* Component disk /dev/sdr removed via HBA sysfs PCI removal
* Following UDEV rule fires:

SUBSYSTEM=="block", ACTION=="remove", ENV{ID_PATH}=="?*", \
        RUN+="/sbin/mdadm -If $name --path $env{ID_PATH}"

% mdadm --detail /dev/md3
/dev/md3:
        Version : 1.1
  Creation Time : Tue Oct 14 17:31:59 2014
     Raid Level : raid1
     Array Size : 25149440 (23.98 GiB 25.75 GB)
  Used Dev Size : 25149440 (23.98 GiB 25.75 GB)
   Raid Devices : 2
  Total Devices : 2
    Persistence : Superblock is persistent

  Intent Bitmap : Internal

    Update Time : Wed Oct 15 14:22:34 2014
          State : active, degraded
 Active Devices : 1
Working Devices : 1
 Failed Devices : 1
  Spare Devices : 0

           Name : localhost.localdomain:3
           UUID : 40ed68ee:ba41d4cd:28c361ed:be7470b8
         Events : 142

    Number   Major   Minor   RaidDevice State
       0      65       21        0      faulty
       1      65        5        1      active sync   /dev/sdj5

All attempts to remove this device fail: 

% echo remove > /sys/block/md3/md/dev-sdr5/state
-bash: echo: write error: Device or resource busy

This can be traced to state_store():

        } else if (cmd_match(buf, "remove")) {
                if (rdev->raid_disk >= 0)
                        err = -EBUSY;

After much debugging and systemtapping, I think I've figured out that the
sysfs scripting may fail after the following combination of changes:

mdadm  8af530b07fce "Enhance incremental removal."
kernel 30b8feb730f9 "md/raid5: avoid deadlock when raid5 array has unack
                     badblocks during md_stop_writes"

With these two changes:

1 - On the user side, mdadm is trying to set the array_state to read-auto
    on incremental removal (as invoked by UDEV rule). 

2 - Kernel side, md_set_readonly() will set the MD_RECOVERY_FROZEN flag,
    wake up the mddev->thread and if there is a sync_thread, it will set
    MD_RECOVERY_INTR and then wait until the sync_thread is set to NULL.

    When md_check_recovery() gets a chance to run as part of the
    raid1d() mddev->thread, it may or may not ever get to
    an invocation of remove_and_add_spares(), for there are but *many*
    conditional early exits along the way -- for example, if
    MD_RECOVERY_FROZEN is set, the following condition will bounce out of
    the routine:

                if (!test_and_clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery) ||
                    test_bit(MD_RECOVERY_FROZEN, &mddev->recovery))             
                        goto unlock;

    the next time around, MD_RECOVERY_NEEDED will have been cleared, so
    all future tests will return 0 and the negation will always take the
    early exit path.

    Back in md_set_readonly(), it may notice that the MD is still in use,
    so it clears the MD_RECOVERY_FROZEN and then returns -EBUSY, without
    setting mddev->ro.  But the damage has been done as conditions have
    been set such that md_check_recovery() will never call
    remove_and_add_spares().

This would also explain why an "idle" sync_action clears the wedge: it
sets MD_RECOVERY_NEEDED allowing md_check_recovery() to continue executing
to remove_and_add_spares().

As far as I can tell, this is what is happening to prevent the "remove"
write to /sys/block/md3/md/dev-sdr5/state from succeeding.  There are
certainly a lot of little bit-states between disk removal, UDEV mdadm, and
various MD kernel threads, so apologies if I missed an important
transition.

Would you consider writing "idle" to the MD array sync_action file as a
safe and reasonable intermediate workaround step for our script?

And of course, any suggestions to whether this is intended behavior (ie,
the removed component disk is failed, but stuck in the array)?

This is fairly easy for us to reproduce with multiple MD arrays per disk
(one per partition) and interrupting a raid check on all of them
(especially when they are delayed waiting for the first to finish) by
removing the component disk via sysfs PCI removal.  We can provide
additional debug or testing if required.

Regards,

-- Joe

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: RAID1 removing failed disk returns EBUSY
  2014-10-27 20:27 RAID1 removing failed disk returns EBUSY Joe Lawrence
@ 2014-10-28 21:41 ` NeilBrown
  2014-10-29 17:36   ` Joe Lawrence
  0 siblings, 1 reply; 22+ messages in thread
From: NeilBrown @ 2014-10-28 21:41 UTC (permalink / raw)
  To: Joe Lawrence; +Cc: linux-raid, Bill Kuzeja

[-- Attachment #1: Type: text/plain, Size: 6472 bytes --]

On Mon, 27 Oct 2014 16:27:48 -0400 Joe Lawrence <joe.lawrence@stratus.com>
wrote:

> Hi Neil,
> 
> We've encountered changes in MD and mdadm that have broken our automated
> disk removal script.  In the past, we've been able to run the following
> after a RAID1 disk component removal:
> 
> % echo fail > /sys/block/md3/md/dev-sdr5/state
> % echo remove > /sys/block/md3/md/dev-sdr5/state
> 
> However, the latest RHEL6.6 code drop has rebased to sufficiently recent
> MD kernel and mdadm changes, in which the previous commands occasionally
> fail like so:
> 
> * MD array is usually resyncing or checking
> * Component disk /dev/sdr removed via HBA sysfs PCI removal
> * Following UDEV rule fires:
> 
> SUBSYSTEM=="block", ACTION=="remove", ENV{ID_PATH}=="?*", \
>         RUN+="/sbin/mdadm -If $name --path $env{ID_PATH}"
> 
> % mdadm --detail /dev/md3
> /dev/md3:
>         Version : 1.1
>   Creation Time : Tue Oct 14 17:31:59 2014
>      Raid Level : raid1
>      Array Size : 25149440 (23.98 GiB 25.75 GB)
>   Used Dev Size : 25149440 (23.98 GiB 25.75 GB)
>    Raid Devices : 2
>   Total Devices : 2
>     Persistence : Superblock is persistent
> 
>   Intent Bitmap : Internal
> 
>     Update Time : Wed Oct 15 14:22:34 2014
>           State : active, degraded
>  Active Devices : 1
> Working Devices : 1
>  Failed Devices : 1
>   Spare Devices : 0
> 
>            Name : localhost.localdomain:3
>            UUID : 40ed68ee:ba41d4cd:28c361ed:be7470b8
>          Events : 142
> 
>     Number   Major   Minor   RaidDevice State
>        0      65       21        0      faulty
>        1      65        5        1      active sync   /dev/sdj5
> 
> All attempts to remove this device fail: 
> 
> % echo remove > /sys/block/md3/md/dev-sdr5/state
> -bash: echo: write error: Device or resource busy
> 
> This can be traced to state_store():
> 
>         } else if (cmd_match(buf, "remove")) {
>                 if (rdev->raid_disk >= 0)
>                         err = -EBUSY;
> 
> After much debugging and systemtapping, I think I've figured out that the
> sysfs scripting may fail after the following combination of changes:
> 
> mdadm  8af530b07fce "Enhance incremental removal."
> kernel 30b8feb730f9 "md/raid5: avoid deadlock when raid5 array has unack
>                      badblocks during md_stop_writes"
> 
> With these two changes:
> 
> 1 - On the user side, mdadm is trying to set the array_state to read-auto
>     on incremental removal (as invoked by UDEV rule). 
> 
> 2 - Kernel side, md_set_readonly() will set the MD_RECOVERY_FROZEN flag,
>     wake up the mddev->thread and if there is a sync_thread, it will set
>     MD_RECOVERY_INTR and then wait until the sync_thread is set to NULL.
> 
>     When md_check_recovery() gets a chance to run as part of the
>     raid1d() mddev->thread, it may or may not ever get to
>     an invocation of remove_and_add_spares(), for there are but *many*
>     conditional early exits along the way -- for example, if
>     MD_RECOVERY_FROZEN is set, the following condition will bounce out of
>     the routine:
> 
>                 if (!test_and_clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery) ||
>                     test_bit(MD_RECOVERY_FROZEN, &mddev->recovery))             
>                         goto unlock;
> 
>     the next time around, MD_RECOVERY_NEEDED will have been cleared, so
>     all future tests will return 0 and the negation will always take the
>     early exit path.
> 
>     Back in md_set_readonly(), it may notice that the MD is still in use,
>     so it clears the MD_RECOVERY_FROZEN and then returns -EBUSY, without
>     setting mddev->ro.  But the damage has been done as conditions have
>     been set such that md_check_recovery() will never call
>     remove_and_add_spares().
> 
> This would also explain why an "idle" sync_action clears the wedge: it
> sets MD_RECOVERY_NEEDED allowing md_check_recovery() to continue executing
> to remove_and_add_spares().
> 
> As far as I can tell, this is what is happening to prevent the "remove"
> write to /sys/block/md3/md/dev-sdr5/state from succeeding.  There are
> certainly a lot of little bit-states between disk removal, UDEV mdadm, and
> various MD kernel threads, so apologies if I missed an important
> transition.
> 
> Would you consider writing "idle" to the MD array sync_action file as a
> safe and reasonable intermediate workaround step for our script?
> 
> And of course, any suggestions to whether this is intended behavior (ie,
> the removed component disk is failed, but stuck in the array)?
> 
> This is fairly easy for us to reproduce with multiple MD arrays per disk
> (one per partition) and interrupting a raid check on all of them
> (especially when they are delayed waiting for the first to finish) by
> removing the component disk via sysfs PCI removal.  We can provide
> additional debug or testing if required.
> 

Hi Joe,
 thanks for the details analysis!!

I think the correct fix would be that MD_RECOVERY_NEEDED should be set after
clearing MD_RECOVERY_FROZEN, like the patch below.
Can you confirm that it works for you?

Writing 'idle' should in general be safe, so that could be used as an interim.

Thanks,
NeilBrown

diff --git a/drivers/md/md.c b/drivers/md/md.c
index c03d87b6890a..2c73fcb82593 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -5261,6 +5261,7 @@ static int md_set_readonly(struct mddev *mddev, struct block_device *bdev)
 		printk("md: %s still in use.\n",mdname(mddev));
 		if (did_freeze) {
 			clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
+			set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
 			md_wakeup_thread(mddev->thread);
 		}
 		err = -EBUSY;
@@ -5275,6 +5276,8 @@ static int md_set_readonly(struct mddev *mddev, struct block_device *bdev)
 		mddev->ro = 1;
 		set_disk_ro(mddev->gendisk, 1);
 		clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
+		set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
+		md_wakeup_thread(mddev->thread);
 		sysfs_notify_dirent_safe(mddev->sysfs_state);
 		err = 0;
 	}
@@ -5318,6 +5321,7 @@ static int do_md_stop(struct mddev *mddev, int mode,
 		mutex_unlock(&mddev->open_mutex);
 		if (did_freeze) {
 			clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
+			set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
 			md_wakeup_thread(mddev->thread);
 		}
 		return -EBUSY;

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: RAID1 removing failed disk returns EBUSY
  2014-10-28 21:41 ` NeilBrown
@ 2014-10-29 17:36   ` Joe Lawrence
  2014-11-13 14:05     ` Joe Lawrence
  0 siblings, 1 reply; 22+ messages in thread
From: Joe Lawrence @ 2014-10-29 17:36 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid, Bill Kuzeja

On Wed, 29 Oct 2014 08:41:13 +1100
NeilBrown <neilb@suse.de> wrote:

> On Mon, 27 Oct 2014 16:27:48 -0400 Joe Lawrence <joe.lawrence@stratus.com>
> wrote:
> 
> > Hi Neil,
> > 
> > We've encountered changes in MD and mdadm that have broken our automated
> > disk removal script.  In the past, we've been able to run the following
> > after a RAID1 disk component removal:
> > 
> > % echo fail > /sys/block/md3/md/dev-sdr5/state
> > % echo remove > /sys/block/md3/md/dev-sdr5/state
> > 
> > However, the latest RHEL6.6 code drop has rebased to sufficiently recent
> > MD kernel and mdadm changes, in which the previous commands occasionally
> > fail like so:
> > 
> > * MD array is usually resyncing or checking
> > * Component disk /dev/sdr removed via HBA sysfs PCI removal
> > * Following UDEV rule fires:
> > 
> > SUBSYSTEM=="block", ACTION=="remove", ENV{ID_PATH}=="?*", \
> >         RUN+="/sbin/mdadm -If $name --path $env{ID_PATH}"
> > 
> > % mdadm --detail /dev/md3
> > /dev/md3:
> >         Version : 1.1
> >   Creation Time : Tue Oct 14 17:31:59 2014
> >      Raid Level : raid1
> >      Array Size : 25149440 (23.98 GiB 25.75 GB)
> >   Used Dev Size : 25149440 (23.98 GiB 25.75 GB)
> >    Raid Devices : 2
> >   Total Devices : 2
> >     Persistence : Superblock is persistent
> > 
> >   Intent Bitmap : Internal
> > 
> >     Update Time : Wed Oct 15 14:22:34 2014
> >           State : active, degraded
> >  Active Devices : 1
> > Working Devices : 1
> >  Failed Devices : 1
> >   Spare Devices : 0
> > 
> >            Name : localhost.localdomain:3
> >            UUID : 40ed68ee:ba41d4cd:28c361ed:be7470b8
> >          Events : 142
> > 
> >     Number   Major   Minor   RaidDevice State
> >        0      65       21        0      faulty
> >        1      65        5        1      active sync   /dev/sdj5
> > 
> > All attempts to remove this device fail: 
> > 
> > % echo remove > /sys/block/md3/md/dev-sdr5/state
> > -bash: echo: write error: Device or resource busy
> > 
> > This can be traced to state_store():
> > 
> >         } else if (cmd_match(buf, "remove")) {
> >                 if (rdev->raid_disk >= 0)
> >                         err = -EBUSY;
> > 
> > After much debugging and systemtapping, I think I've figured out that the
> > sysfs scripting may fail after the following combination of changes:
> > 
> > mdadm  8af530b07fce "Enhance incremental removal."
> > kernel 30b8feb730f9 "md/raid5: avoid deadlock when raid5 array has unack
> >                      badblocks during md_stop_writes"
> > 
> > With these two changes:
> > 
> > 1 - On the user side, mdadm is trying to set the array_state to read-auto
> >     on incremental removal (as invoked by UDEV rule). 
> > 
> > 2 - Kernel side, md_set_readonly() will set the MD_RECOVERY_FROZEN flag,
> >     wake up the mddev->thread and if there is a sync_thread, it will set
> >     MD_RECOVERY_INTR and then wait until the sync_thread is set to NULL.
> > 
> >     When md_check_recovery() gets a chance to run as part of the
> >     raid1d() mddev->thread, it may or may not ever get to
> >     an invocation of remove_and_add_spares(), for there are but *many*
> >     conditional early exits along the way -- for example, if
> >     MD_RECOVERY_FROZEN is set, the following condition will bounce out of
> >     the routine:
> > 
> >                 if (!test_and_clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery) ||
> >                     test_bit(MD_RECOVERY_FROZEN, &mddev->recovery))             
> >                         goto unlock;
> > 
> >     the next time around, MD_RECOVERY_NEEDED will have been cleared, so
> >     all future tests will return 0 and the negation will always take the
> >     early exit path.
> > 
> >     Back in md_set_readonly(), it may notice that the MD is still in use,
> >     so it clears the MD_RECOVERY_FROZEN and then returns -EBUSY, without
> >     setting mddev->ro.  But the damage has been done as conditions have
> >     been set such that md_check_recovery() will never call
> >     remove_and_add_spares().
> > 
> > This would also explain why an "idle" sync_action clears the wedge: it
> > sets MD_RECOVERY_NEEDED allowing md_check_recovery() to continue executing
> > to remove_and_add_spares().
> > 
> > As far as I can tell, this is what is happening to prevent the "remove"
> > write to /sys/block/md3/md/dev-sdr5/state from succeeding.  There are
> > certainly a lot of little bit-states between disk removal, UDEV mdadm, and
> > various MD kernel threads, so apologies if I missed an important
> > transition.
> > 
> > Would you consider writing "idle" to the MD array sync_action file as a
> > safe and reasonable intermediate workaround step for our script?
> > 
> > And of course, any suggestions to whether this is intended behavior (ie,
> > the removed component disk is failed, but stuck in the array)?
> > 
> > This is fairly easy for us to reproduce with multiple MD arrays per disk
> > (one per partition) and interrupting a raid check on all of them
> > (especially when they are delayed waiting for the first to finish) by
> > removing the component disk via sysfs PCI removal.  We can provide
> > additional debug or testing if required.
> > 
> 
> Hi Joe,
>  thanks for the details analysis!!
> 
> I think the correct fix would be that MD_RECOVERY_NEEDED should be set after
> clearing MD_RECOVERY_FROZEN, like the patch below.
> Can you confirm that it works for you?
> 
> Writing 'idle' should in general be safe, so that could be used as an interim.
> 
> Thanks,
> NeilBrown
> 
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index c03d87b6890a..2c73fcb82593 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -5261,6 +5261,7 @@ static int md_set_readonly(struct mddev *mddev, struct block_device *bdev)
>  		printk("md: %s still in use.\n",mdname(mddev));
>  		if (did_freeze) {
>  			clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
> +			set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
>  			md_wakeup_thread(mddev->thread);
>  		}
>  		err = -EBUSY;
> @@ -5275,6 +5276,8 @@ static int md_set_readonly(struct mddev *mddev, struct block_device *bdev)
>  		mddev->ro = 1;
>  		set_disk_ro(mddev->gendisk, 1);
>  		clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
> +		set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
> +		md_wakeup_thread(mddev->thread);
>  		sysfs_notify_dirent_safe(mddev->sysfs_state);
>  		err = 0;
>  	}
> @@ -5318,6 +5321,7 @@ static int do_md_stop(struct mddev *mddev, int mode,
>  		mutex_unlock(&mddev->open_mutex);
>  		if (did_freeze) {
>  			clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
> +			set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
>  			md_wakeup_thread(mddev->thread);
>  		}
>  		return -EBUSY;

Hi Neil,

In my tests, the UDEV "mdadm -If" invocation fails *and* removes the
pulled disk from the MD array.  This is okay for our intentions, but I
wanted to make sure that it's okay to skip any failed-but-not-removed
state.

Tested-by: Joe Lawrence <joe.lawrence@stratus.com>

and should this have a

Fixes: 30b8feb730f9 ("md/raid5: avoid deadlock when raid5 array has unack badblocks during md_stop_writes")

tag to mark for stable?

Thanks,

-- Joe

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: RAID1 removing failed disk returns EBUSY
  2014-10-29 17:36   ` Joe Lawrence
@ 2014-11-13 14:05     ` Joe Lawrence
  2014-11-16 23:03       ` NeilBrown
  0 siblings, 1 reply; 22+ messages in thread
From: Joe Lawrence @ 2014-11-13 14:05 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid, Bill Kuzeja

On Wed, 29 Oct 2014 13:36:04 -0400
Joe Lawrence <joe.lawrence@stratus.com> wrote:

> On Wed, 29 Oct 2014 08:41:13 +1100
> NeilBrown <neilb@suse.de> wrote:
> 
> > On Mon, 27 Oct 2014 16:27:48 -0400 Joe Lawrence <joe.lawrence@stratus.com>
> > wrote:
> > 
> > > Hi Neil,
> > > 
> > > We've encountered changes in MD and mdadm that have broken our automated
> > > disk removal script.  In the past, we've been able to run the following
> > > after a RAID1 disk component removal:
> > > 
> > > % echo fail > /sys/block/md3/md/dev-sdr5/state
> > > % echo remove > /sys/block/md3/md/dev-sdr5/state
> > > 
> > > However, the latest RHEL6.6 code drop has rebased to sufficiently recent
> > > MD kernel and mdadm changes, in which the previous commands occasionally
> > > fail like so:
> > > 
> > > * MD array is usually resyncing or checking
> > > * Component disk /dev/sdr removed via HBA sysfs PCI removal
> > > * Following UDEV rule fires:
> > > 
> > > SUBSYSTEM=="block", ACTION=="remove", ENV{ID_PATH}=="?*", \
> > >         RUN+="/sbin/mdadm -If $name --path $env{ID_PATH}"
> > > 
> > > % mdadm --detail /dev/md3
> > > /dev/md3:
> > >         Version : 1.1
> > >   Creation Time : Tue Oct 14 17:31:59 2014
> > >      Raid Level : raid1
> > >      Array Size : 25149440 (23.98 GiB 25.75 GB)
> > >   Used Dev Size : 25149440 (23.98 GiB 25.75 GB)
> > >    Raid Devices : 2
> > >   Total Devices : 2
> > >     Persistence : Superblock is persistent
> > > 
> > >   Intent Bitmap : Internal
> > > 
> > >     Update Time : Wed Oct 15 14:22:34 2014
> > >           State : active, degraded
> > >  Active Devices : 1
> > > Working Devices : 1
> > >  Failed Devices : 1
> > >   Spare Devices : 0
> > > 
> > >            Name : localhost.localdomain:3
> > >            UUID : 40ed68ee:ba41d4cd:28c361ed:be7470b8
> > >          Events : 142
> > > 
> > >     Number   Major   Minor   RaidDevice State
> > >        0      65       21        0      faulty
> > >        1      65        5        1      active sync   /dev/sdj5
> > > 
> > > All attempts to remove this device fail: 
> > > 
> > > % echo remove > /sys/block/md3/md/dev-sdr5/state
> > > -bash: echo: write error: Device or resource busy
> > > 
> > > This can be traced to state_store():
> > > 
> > >         } else if (cmd_match(buf, "remove")) {
> > >                 if (rdev->raid_disk >= 0)
> > >                         err = -EBUSY;
> > > 
> > > After much debugging and systemtapping, I think I've figured out that the
> > > sysfs scripting may fail after the following combination of changes:
> > > 
> > > mdadm  8af530b07fce "Enhance incremental removal."
> > > kernel 30b8feb730f9 "md/raid5: avoid deadlock when raid5 array has unack
> > >                      badblocks during md_stop_writes"
> > > 
> > > With these two changes:
> > > 
> > > 1 - On the user side, mdadm is trying to set the array_state to read-auto
> > >     on incremental removal (as invoked by UDEV rule). 
> > > 
> > > 2 - Kernel side, md_set_readonly() will set the MD_RECOVERY_FROZEN flag,
> > >     wake up the mddev->thread and if there is a sync_thread, it will set
> > >     MD_RECOVERY_INTR and then wait until the sync_thread is set to NULL.
> > > 
> > >     When md_check_recovery() gets a chance to run as part of the
> > >     raid1d() mddev->thread, it may or may not ever get to
> > >     an invocation of remove_and_add_spares(), for there are but *many*
> > >     conditional early exits along the way -- for example, if
> > >     MD_RECOVERY_FROZEN is set, the following condition will bounce out of
> > >     the routine:
> > > 
> > >                 if (!test_and_clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery) ||
> > >                     test_bit(MD_RECOVERY_FROZEN, &mddev->recovery))             
> > >                         goto unlock;
> > > 
> > >     the next time around, MD_RECOVERY_NEEDED will have been cleared, so
> > >     all future tests will return 0 and the negation will always take the
> > >     early exit path.
> > > 
> > >     Back in md_set_readonly(), it may notice that the MD is still in use,
> > >     so it clears the MD_RECOVERY_FROZEN and then returns -EBUSY, without
> > >     setting mddev->ro.  But the damage has been done as conditions have
> > >     been set such that md_check_recovery() will never call
> > >     remove_and_add_spares().
> > > 
> > > This would also explain why an "idle" sync_action clears the wedge: it
> > > sets MD_RECOVERY_NEEDED allowing md_check_recovery() to continue executing
> > > to remove_and_add_spares().
> > > 
> > > As far as I can tell, this is what is happening to prevent the "remove"
> > > write to /sys/block/md3/md/dev-sdr5/state from succeeding.  There are
> > > certainly a lot of little bit-states between disk removal, UDEV mdadm, and
> > > various MD kernel threads, so apologies if I missed an important
> > > transition.
> > > 
> > > Would you consider writing "idle" to the MD array sync_action file as a
> > > safe and reasonable intermediate workaround step for our script?
> > > 
> > > And of course, any suggestions to whether this is intended behavior (ie,
> > > the removed component disk is failed, but stuck in the array)?
> > > 
> > > This is fairly easy for us to reproduce with multiple MD arrays per disk
> > > (one per partition) and interrupting a raid check on all of them
> > > (especially when they are delayed waiting for the first to finish) by
> > > removing the component disk via sysfs PCI removal.  We can provide
> > > additional debug or testing if required.
> > > 
> > 
> > Hi Joe,
> >  thanks for the details analysis!!
> > 
> > I think the correct fix would be that MD_RECOVERY_NEEDED should be set after
> > clearing MD_RECOVERY_FROZEN, like the patch below.
> > Can you confirm that it works for you?
> > 
> > Writing 'idle' should in general be safe, so that could be used as an interim.
> > 
> > Thanks,
> > NeilBrown
> > 
> > diff --git a/drivers/md/md.c b/drivers/md/md.c
> > index c03d87b6890a..2c73fcb82593 100644
> > --- a/drivers/md/md.c
> > +++ b/drivers/md/md.c
> > @@ -5261,6 +5261,7 @@ static int md_set_readonly(struct mddev *mddev, struct block_device *bdev)
> >  		printk("md: %s still in use.\n",mdname(mddev));
> >  		if (did_freeze) {
> >  			clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
> > +			set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
> >  			md_wakeup_thread(mddev->thread);
> >  		}
> >  		err = -EBUSY;
> > @@ -5275,6 +5276,8 @@ static int md_set_readonly(struct mddev *mddev, struct block_device *bdev)
> >  		mddev->ro = 1;
> >  		set_disk_ro(mddev->gendisk, 1);
> >  		clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
> > +		set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
> > +		md_wakeup_thread(mddev->thread);
> >  		sysfs_notify_dirent_safe(mddev->sysfs_state);
> >  		err = 0;
> >  	}
> > @@ -5318,6 +5321,7 @@ static int do_md_stop(struct mddev *mddev, int mode,
> >  		mutex_unlock(&mddev->open_mutex);
> >  		if (did_freeze) {
> >  			clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
> > +			set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
> >  			md_wakeup_thread(mddev->thread);
> >  		}
> >  		return -EBUSY;
> 
> Hi Neil,
> 
> In my tests, the UDEV "mdadm -If" invocation fails *and* removes the
> pulled disk from the MD array.  This is okay for our intentions, but I
> wanted to make sure that it's okay to skip any failed-but-not-removed
> state.
> 
> Tested-by: Joe Lawrence <joe.lawrence@stratus.com>
> 
> and should this have a
> 
> Fixes: 30b8feb730f9 ("md/raid5: avoid deadlock when raid5 array has unack badblocks during md_stop_writes")
> 
> tag to mark for stable?


Hi Neil,

Would you like me to write up a proper patch, or is this one in the queue?

Thanks,

-- Joe

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: RAID1 removing failed disk returns EBUSY
  2014-11-13 14:05     ` Joe Lawrence
@ 2014-11-16 23:03       ` NeilBrown
  2015-01-14 12:41         ` XiaoNi
  0 siblings, 1 reply; 22+ messages in thread
From: NeilBrown @ 2014-11-16 23:03 UTC (permalink / raw)
  To: Joe Lawrence; +Cc: linux-raid, Bill Kuzeja

[-- Attachment #1: Type: text/plain, Size: 8622 bytes --]

On Thu, 13 Nov 2014 09:05:49 -0500 Joe Lawrence <joe.lawrence@stratus.com>
wrote:

> On Wed, 29 Oct 2014 13:36:04 -0400
> Joe Lawrence <joe.lawrence@stratus.com> wrote:
> 
> > On Wed, 29 Oct 2014 08:41:13 +1100
> > NeilBrown <neilb@suse.de> wrote:
> > 
> > > On Mon, 27 Oct 2014 16:27:48 -0400 Joe Lawrence <joe.lawrence@stratus.com>
> > > wrote:
> > > 
> > > > Hi Neil,
> > > > 
> > > > We've encountered changes in MD and mdadm that have broken our automated
> > > > disk removal script.  In the past, we've been able to run the following
> > > > after a RAID1 disk component removal:
> > > > 
> > > > % echo fail > /sys/block/md3/md/dev-sdr5/state
> > > > % echo remove > /sys/block/md3/md/dev-sdr5/state
> > > > 
> > > > However, the latest RHEL6.6 code drop has rebased to sufficiently recent
> > > > MD kernel and mdadm changes, in which the previous commands occasionally
> > > > fail like so:
> > > > 
> > > > * MD array is usually resyncing or checking
> > > > * Component disk /dev/sdr removed via HBA sysfs PCI removal
> > > > * Following UDEV rule fires:
> > > > 
> > > > SUBSYSTEM=="block", ACTION=="remove", ENV{ID_PATH}=="?*", \
> > > >         RUN+="/sbin/mdadm -If $name --path $env{ID_PATH}"
> > > > 
> > > > % mdadm --detail /dev/md3
> > > > /dev/md3:
> > > >         Version : 1.1
> > > >   Creation Time : Tue Oct 14 17:31:59 2014
> > > >      Raid Level : raid1
> > > >      Array Size : 25149440 (23.98 GiB 25.75 GB)
> > > >   Used Dev Size : 25149440 (23.98 GiB 25.75 GB)
> > > >    Raid Devices : 2
> > > >   Total Devices : 2
> > > >     Persistence : Superblock is persistent
> > > > 
> > > >   Intent Bitmap : Internal
> > > > 
> > > >     Update Time : Wed Oct 15 14:22:34 2014
> > > >           State : active, degraded
> > > >  Active Devices : 1
> > > > Working Devices : 1
> > > >  Failed Devices : 1
> > > >   Spare Devices : 0
> > > > 
> > > >            Name : localhost.localdomain:3
> > > >            UUID : 40ed68ee:ba41d4cd:28c361ed:be7470b8
> > > >          Events : 142
> > > > 
> > > >     Number   Major   Minor   RaidDevice State
> > > >        0      65       21        0      faulty
> > > >        1      65        5        1      active sync   /dev/sdj5
> > > > 
> > > > All attempts to remove this device fail: 
> > > > 
> > > > % echo remove > /sys/block/md3/md/dev-sdr5/state
> > > > -bash: echo: write error: Device or resource busy
> > > > 
> > > > This can be traced to state_store():
> > > > 
> > > >         } else if (cmd_match(buf, "remove")) {
> > > >                 if (rdev->raid_disk >= 0)
> > > >                         err = -EBUSY;
> > > > 
> > > > After much debugging and systemtapping, I think I've figured out that the
> > > > sysfs scripting may fail after the following combination of changes:
> > > > 
> > > > mdadm  8af530b07fce "Enhance incremental removal."
> > > > kernel 30b8feb730f9 "md/raid5: avoid deadlock when raid5 array has unack
> > > >                      badblocks during md_stop_writes"
> > > > 
> > > > With these two changes:
> > > > 
> > > > 1 - On the user side, mdadm is trying to set the array_state to read-auto
> > > >     on incremental removal (as invoked by UDEV rule). 
> > > > 
> > > > 2 - Kernel side, md_set_readonly() will set the MD_RECOVERY_FROZEN flag,
> > > >     wake up the mddev->thread and if there is a sync_thread, it will set
> > > >     MD_RECOVERY_INTR and then wait until the sync_thread is set to NULL.
> > > > 
> > > >     When md_check_recovery() gets a chance to run as part of the
> > > >     raid1d() mddev->thread, it may or may not ever get to
> > > >     an invocation of remove_and_add_spares(), for there are but *many*
> > > >     conditional early exits along the way -- for example, if
> > > >     MD_RECOVERY_FROZEN is set, the following condition will bounce out of
> > > >     the routine:
> > > > 
> > > >                 if (!test_and_clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery) ||
> > > >                     test_bit(MD_RECOVERY_FROZEN, &mddev->recovery))             
> > > >                         goto unlock;
> > > > 
> > > >     the next time around, MD_RECOVERY_NEEDED will have been cleared, so
> > > >     all future tests will return 0 and the negation will always take the
> > > >     early exit path.
> > > > 
> > > >     Back in md_set_readonly(), it may notice that the MD is still in use,
> > > >     so it clears the MD_RECOVERY_FROZEN and then returns -EBUSY, without
> > > >     setting mddev->ro.  But the damage has been done as conditions have
> > > >     been set such that md_check_recovery() will never call
> > > >     remove_and_add_spares().
> > > > 
> > > > This would also explain why an "idle" sync_action clears the wedge: it
> > > > sets MD_RECOVERY_NEEDED allowing md_check_recovery() to continue executing
> > > > to remove_and_add_spares().
> > > > 
> > > > As far as I can tell, this is what is happening to prevent the "remove"
> > > > write to /sys/block/md3/md/dev-sdr5/state from succeeding.  There are
> > > > certainly a lot of little bit-states between disk removal, UDEV mdadm, and
> > > > various MD kernel threads, so apologies if I missed an important
> > > > transition.
> > > > 
> > > > Would you consider writing "idle" to the MD array sync_action file as a
> > > > safe and reasonable intermediate workaround step for our script?
> > > > 
> > > > And of course, any suggestions to whether this is intended behavior (ie,
> > > > the removed component disk is failed, but stuck in the array)?
> > > > 
> > > > This is fairly easy for us to reproduce with multiple MD arrays per disk
> > > > (one per partition) and interrupting a raid check on all of them
> > > > (especially when they are delayed waiting for the first to finish) by
> > > > removing the component disk via sysfs PCI removal.  We can provide
> > > > additional debug or testing if required.
> > > > 
> > > 
> > > Hi Joe,
> > >  thanks for the details analysis!!
> > > 
> > > I think the correct fix would be that MD_RECOVERY_NEEDED should be set after
> > > clearing MD_RECOVERY_FROZEN, like the patch below.
> > > Can you confirm that it works for you?
> > > 
> > > Writing 'idle' should in general be safe, so that could be used as an interim.
> > > 
> > > Thanks,
> > > NeilBrown
> > > 
> > > diff --git a/drivers/md/md.c b/drivers/md/md.c
> > > index c03d87b6890a..2c73fcb82593 100644
> > > --- a/drivers/md/md.c
> > > +++ b/drivers/md/md.c
> > > @@ -5261,6 +5261,7 @@ static int md_set_readonly(struct mddev *mddev, struct block_device *bdev)
> > >  		printk("md: %s still in use.\n",mdname(mddev));
> > >  		if (did_freeze) {
> > >  			clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
> > > +			set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
> > >  			md_wakeup_thread(mddev->thread);
> > >  		}
> > >  		err = -EBUSY;
> > > @@ -5275,6 +5276,8 @@ static int md_set_readonly(struct mddev *mddev, struct block_device *bdev)
> > >  		mddev->ro = 1;
> > >  		set_disk_ro(mddev->gendisk, 1);
> > >  		clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
> > > +		set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
> > > +		md_wakeup_thread(mddev->thread);
> > >  		sysfs_notify_dirent_safe(mddev->sysfs_state);
> > >  		err = 0;
> > >  	}
> > > @@ -5318,6 +5321,7 @@ static int do_md_stop(struct mddev *mddev, int mode,
> > >  		mutex_unlock(&mddev->open_mutex);
> > >  		if (did_freeze) {
> > >  			clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
> > > +			set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
> > >  			md_wakeup_thread(mddev->thread);
> > >  		}
> > >  		return -EBUSY;
> > 
> > Hi Neil,
> > 
> > In my tests, the UDEV "mdadm -If" invocation fails *and* removes the
> > pulled disk from the MD array.  This is okay for our intentions, but I
> > wanted to make sure that it's okay to skip any failed-but-not-removed
> > state.
> > 
> > Tested-by: Joe Lawrence <joe.lawrence@stratus.com>
> > 
> > and should this have a
> > 
> > Fixes: 30b8feb730f9 ("md/raid5: avoid deadlock when raid5 array has unack badblocks during md_stop_writes")
> > 
> > tag to mark for stable?
> 
> 
> Hi Neil,
> 
> Would you like me to write up a proper patch, or is this one in the queue?
> 

Several times over the last week I've thought that I should probably push
that patch along ... but each time something else seemed more interesting.
But it's a new week now.  I've just posted a pull request.

Thanks for the prompt (and the report and testing of course).

NeilBrown

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 811 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: RAID1 removing failed disk returns EBUSY
  2014-11-16 23:03       ` NeilBrown
@ 2015-01-14 12:41         ` XiaoNi
  2015-01-15 13:22           ` Joe Lawrence
  0 siblings, 1 reply; 22+ messages in thread
From: XiaoNi @ 2015-01-14 12:41 UTC (permalink / raw)
  To: NeilBrown; +Cc: Joe Lawrence, linux-raid, Bill Kuzeja

On 11/17/2014 07:03 AM, NeilBrown wrote:
> On Thu, 13 Nov 2014 09:05:49 -0500 Joe Lawrence<joe.lawrence@stratus.com>
> wrote:
>
>> On Wed, 29 Oct 2014 13:36:04 -0400
>> Joe Lawrence<joe.lawrence@stratus.com>  wrote:
>>
>>> On Wed, 29 Oct 2014 08:41:13 +1100
>>> NeilBrown<neilb@suse.de>  wrote:
>>>
>>>> On Mon, 27 Oct 2014 16:27:48 -0400 Joe Lawrence<joe.lawrence@stratus.com>
>>>> wrote:
>>>>
>>>>> Hi Neil,
>>>>>
>>>>> We've encountered changes in MD and mdadm that have broken our automated
>>>>> disk removal script.  In the past, we've been able to run the following
>>>>> after a RAID1 disk component removal:
>>>>>
>>>>> % echo fail>  /sys/block/md3/md/dev-sdr5/state
>>>>> % echo remove>  /sys/block/md3/md/dev-sdr5/state
>>>>>
>>>>> However, the latest RHEL6.6 code drop has rebased to sufficiently recent
>>>>> MD kernel and mdadm changes, in which the previous commands occasionally
>>>>> fail like so:
>>>>>
>>>>> * MD array is usually resyncing or checking
>>>>> * Component disk /dev/sdr removed via HBA sysfs PCI removal
>>>>> * Following UDEV rule fires:
>>>>>
>>>>> SUBSYSTEM=="block", ACTION=="remove", ENV{ID_PATH}=="?*", \
>>>>>          RUN+="/sbin/mdadm -If $name --path $env{ID_PATH}"
>>>>>
>>>>> % mdadm --detail /dev/md3
>>>>> /dev/md3:
>>>>>          Version : 1.1
>>>>>    Creation Time : Tue Oct 14 17:31:59 2014
>>>>>       Raid Level : raid1
>>>>>       Array Size : 25149440 (23.98 GiB 25.75 GB)
>>>>>    Used Dev Size : 25149440 (23.98 GiB 25.75 GB)
>>>>>     Raid Devices : 2
>>>>>    Total Devices : 2
>>>>>      Persistence : Superblock is persistent
>>>>>
>>>>>    Intent Bitmap : Internal
>>>>>
>>>>>      Update Time : Wed Oct 15 14:22:34 2014
>>>>>            State : active, degraded
>>>>>   Active Devices : 1
>>>>> Working Devices : 1
>>>>>   Failed Devices : 1
>>>>>    Spare Devices : 0
>>>>>
>>>>>             Name : localhost.localdomain:3
>>>>>             UUID : 40ed68ee:ba41d4cd:28c361ed:be7470b8
>>>>>           Events : 142
>>>>>
>>>>>      Number   Major   Minor   RaidDevice State
>>>>>         0      65       21        0      faulty
>>>>>         1      65        5        1      active sync   /dev/sdj5
>>>>>
>>>>> All attempts to remove this device fail:
>>>>>
>>>>> % echo remove>  /sys/block/md3/md/dev-sdr5/state
>>>>> -bash: echo: write error: Device or resource busy
>>>>>
>>>>> This can be traced to state_store():
>>>>>
>>>>>          } else if (cmd_match(buf, "remove")) {
>>>>>                  if (rdev->raid_disk>= 0)
>>>>>                          err = -EBUSY;
>>>>>
>>>>> After much debugging and systemtapping, I think I've figured out that the
>>>>> sysfs scripting may fail after the following combination of changes:
>>>>>
>>>>> mdadm  8af530b07fce "Enhance incremental removal."
>>>>> kernel 30b8feb730f9 "md/raid5: avoid deadlock when raid5 array has unack
>>>>>                       badblocks during md_stop_writes"
>>>>>
>>>>> With these two changes:
>>>>>
>>>>> 1 - On the user side, mdadm is trying to set the array_state to read-auto
>>>>>      on incremental removal (as invoked by UDEV rule).
>>>>>
>>>>> 2 - Kernel side, md_set_readonly() will set the MD_RECOVERY_FROZEN flag,
>>>>>      wake up the mddev->thread and if there is a sync_thread, it will set
>>>>>      MD_RECOVERY_INTR and then wait until the sync_thread is set to NULL.
>>>>>
>>>>>      When md_check_recovery() gets a chance to run as part of the
>>>>>      raid1d() mddev->thread, it may or may not ever get to
>>>>>      an invocation of remove_and_add_spares(), for there are but *many*
>>>>>      conditional early exits along the way -- for example, if
>>>>>      MD_RECOVERY_FROZEN is set, the following condition will bounce out of
>>>>>      the routine:
>>>>>
>>>>>                  if (!test_and_clear_bit(MD_RECOVERY_NEEDED,&mddev->recovery) ||
>>>>>                      test_bit(MD_RECOVERY_FROZEN,&mddev->recovery))
>>>>>                          goto unlock;
>>>>>
>>>>>      the next time around, MD_RECOVERY_NEEDED will have been cleared, so
>>>>>      all future tests will return 0 and the negation will always take the
>>>>>      early exit path.
>>>>>
>>>>>      Back in md_set_readonly(), it may notice that the MD is still in use,
>>>>>      so it clears the MD_RECOVERY_FROZEN and then returns -EBUSY, without
>>>>>      setting mddev->ro.  But the damage has been done as conditions have
>>>>>      been set such that md_check_recovery() will never call
>>>>>      remove_and_add_spares().
>>>>>
>>>>> This would also explain why an "idle" sync_action clears the wedge: it
>>>>> sets MD_RECOVERY_NEEDED allowing md_check_recovery() to continue executing
>>>>> to remove_and_add_spares().
>>>>>
>>>>> As far as I can tell, this is what is happening to prevent the "remove"
>>>>> write to /sys/block/md3/md/dev-sdr5/state from succeeding.  There are
>>>>> certainly a lot of little bit-states between disk removal, UDEV mdadm, and
>>>>> various MD kernel threads, so apologies if I missed an important
>>>>> transition.
>>>>>
>>>>> Would you consider writing "idle" to the MD array sync_action file as a
>>>>> safe and reasonable intermediate workaround step for our script?
>>>>>
>>>>> And of course, any suggestions to whether this is intended behavior (ie,
>>>>> the removed component disk is failed, but stuck in the array)?
>>>>>
>>>>> This is fairly easy for us to reproduce with multiple MD arrays per disk
>>>>> (one per partition) and interrupting a raid check on all of them
>>>>> (especially when they are delayed waiting for the first to finish) by
>>>>> removing the component disk via sysfs PCI removal.  We can provide
>>>>> additional debug or testing if required.
>>>>>
>>>> Hi Joe,
>>>>   thanks for the details analysis!!
>>>>
>>>> I think the correct fix would be that MD_RECOVERY_NEEDED should be set after
>>>> clearing MD_RECOVERY_FROZEN, like the patch below.
>>>> Can you confirm that it works for you?
>>>>
>>>> Writing 'idle' should in general be safe, so that could be used as an interim.
>>>>
>>>> Thanks,
>>>> NeilBrown
>>>>
>>>> diff --git a/drivers/md/md.c b/drivers/md/md.c
>>>> index c03d87b6890a..2c73fcb82593 100644
>>>> --- a/drivers/md/md.c
>>>> +++ b/drivers/md/md.c
>>>> @@ -5261,6 +5261,7 @@ static int md_set_readonly(struct mddev *mddev, struct block_device *bdev)
>>>>   		printk("md: %s still in use.\n",mdname(mddev));
>>>>   		if (did_freeze) {
>>>>   			clear_bit(MD_RECOVERY_FROZEN,&mddev->recovery);
>>>> +			set_bit(MD_RECOVERY_NEEDED,&mddev->recovery);
>>>>   			md_wakeup_thread(mddev->thread);
>>>>   		}
>>>>   		err = -EBUSY;
>>>> @@ -5275,6 +5276,8 @@ static int md_set_readonly(struct mddev *mddev, struct block_device *bdev)
>>>>   		mddev->ro = 1;
>>>>   		set_disk_ro(mddev->gendisk, 1);
>>>>   		clear_bit(MD_RECOVERY_FROZEN,&mddev->recovery);
>>>> +		set_bit(MD_RECOVERY_NEEDED,&mddev->recovery);
>>>> +		md_wakeup_thread(mddev->thread);
>>>>   		sysfs_notify_dirent_safe(mddev->sysfs_state);
>>>>   		err = 0;
>>>>   	}
>>>> @@ -5318,6 +5321,7 @@ static int do_md_stop(struct mddev *mddev, int mode,
>>>>   		mutex_unlock(&mddev->open_mutex);
>>>>   		if (did_freeze) {
>>>>   			clear_bit(MD_RECOVERY_FROZEN,&mddev->recovery);
>>>> +			set_bit(MD_RECOVERY_NEEDED,&mddev->recovery);
>>>>   			md_wakeup_thread(mddev->thread);
>>>>   		}
>>>>   		return -EBUSY;
>>> Hi Neil,
>>>
>>> In my tests, the UDEV "mdadm -If" invocation fails *and* removes the
>>> pulled disk from the MD array.  This is okay for our intentions, but I
>>> wanted to make sure that it's okay to skip any failed-but-not-removed
>>> state.
>>>
>>> Tested-by: Joe Lawrence<joe.lawrence@stratus.com>
>>>
>>> and should this have a
>>>
>>> Fixes: 30b8feb730f9 ("md/raid5: avoid deadlock when raid5 array has unack badblocks during md_stop_writes")
>>>
>>> tag to mark for stable?
>>
>> Hi Neil,
>>
>> Would you like me to write up a proper patch, or is this one in the queue?
>>
> Several times over the last week I've thought that I should probably push
> that patch along ... but each time something else seemed more interesting.
> But it's a new week now.  I've just posted a pull request.
>
> Thanks for the prompt (and the report and testing of course).
>
> NeilBrown
Hi Neil and Joe

     Any update for this? I tried this with 3.18.2 and the problem still 
exists.

    When it tried to remove the failed disk. it find the Blocked flag in 
rdev->flags is
set. So it can't remove the disk. Is this the right reason?

Best Regards
Xiao

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: RAID1 removing failed disk returns EBUSY
  2015-01-14 12:41         ` XiaoNi
@ 2015-01-15 13:22           ` Joe Lawrence
  2015-01-16  5:20             ` Xiao Ni
  0 siblings, 1 reply; 22+ messages in thread
From: Joe Lawrence @ 2015-01-15 13:22 UTC (permalink / raw)
  To: XiaoNi; +Cc: NeilBrown, linux-raid, Bill Kuzeja

On Wed, 14 Jan 2015 20:41:16 +0800
XiaoNi <xni@redhat.com> wrote:

> On 11/17/2014 07:03 AM, NeilBrown wrote:
> > On Thu, 13 Nov 2014 09:05:49 -0500 Joe Lawrence<joe.lawrence@stratus.com>
> > wrote:
> >
> >> On Wed, 29 Oct 2014 13:36:04 -0400
> >> Joe Lawrence<joe.lawrence@stratus.com>  wrote:
> >>
> >>> On Wed, 29 Oct 2014 08:41:13 +1100
> >>> NeilBrown<neilb@suse.de>  wrote:
> >>>
> >>>> On Mon, 27 Oct 2014 16:27:48 -0400 Joe Lawrence<joe.lawrence@stratus.com>
> >>>> wrote:
> >>>>
> >>>>> Hi Neil,
> >>>>>
> >>>>> We've encountered changes in MD and mdadm that have broken our automated
> >>>>> disk removal script.  In the past, we've been able to run the following
> >>>>> after a RAID1 disk component removal:
> >>>>>
> >>>>> % echo fail>  /sys/block/md3/md/dev-sdr5/state
> >>>>> % echo remove>  /sys/block/md3/md/dev-sdr5/state
> >>>>>
> >>>>> However, the latest RHEL6.6 code drop has rebased to sufficiently recent
> >>>>> MD kernel and mdadm changes, in which the previous commands occasionally
> >>>>> fail like so:
> >>>>>
> >>>>> * MD array is usually resyncing or checking
> >>>>> * Component disk /dev/sdr removed via HBA sysfs PCI removal
> >>>>> * Following UDEV rule fires:
> >>>>>
> >>>>> SUBSYSTEM=="block", ACTION=="remove", ENV{ID_PATH}=="?*", \
> >>>>>          RUN+="/sbin/mdadm -If $name --path $env{ID_PATH}"
> >>>>>
> >>>>> % mdadm --detail /dev/md3
> >>>>> /dev/md3:
> >>>>>          Version : 1.1
> >>>>>    Creation Time : Tue Oct 14 17:31:59 2014
> >>>>>       Raid Level : raid1
> >>>>>       Array Size : 25149440 (23.98 GiB 25.75 GB)
> >>>>>    Used Dev Size : 25149440 (23.98 GiB 25.75 GB)
> >>>>>     Raid Devices : 2
> >>>>>    Total Devices : 2
> >>>>>      Persistence : Superblock is persistent
> >>>>>
> >>>>>    Intent Bitmap : Internal
> >>>>>
> >>>>>      Update Time : Wed Oct 15 14:22:34 2014
> >>>>>            State : active, degraded
> >>>>>   Active Devices : 1
> >>>>> Working Devices : 1
> >>>>>   Failed Devices : 1
> >>>>>    Spare Devices : 0
> >>>>>
> >>>>>             Name : localhost.localdomain:3
> >>>>>             UUID : 40ed68ee:ba41d4cd:28c361ed:be7470b8
> >>>>>           Events : 142
> >>>>>
> >>>>>      Number   Major   Minor   RaidDevice State
> >>>>>         0      65       21        0      faulty
> >>>>>         1      65        5        1      active sync   /dev/sdj5
> >>>>>
> >>>>> All attempts to remove this device fail:
> >>>>>
> >>>>> % echo remove>  /sys/block/md3/md/dev-sdr5/state
> >>>>> -bash: echo: write error: Device or resource busy
> >>>>>
> >>>>> This can be traced to state_store():
> >>>>>
> >>>>>          } else if (cmd_match(buf, "remove")) {
> >>>>>                  if (rdev->raid_disk>= 0)
> >>>>>                          err = -EBUSY;
> >>>>>
> >>>>> After much debugging and systemtapping, I think I've figured out that the
> >>>>> sysfs scripting may fail after the following combination of changes:
> >>>>>
> >>>>> mdadm  8af530b07fce "Enhance incremental removal."
> >>>>> kernel 30b8feb730f9 "md/raid5: avoid deadlock when raid5 array has unack
> >>>>>                       badblocks during md_stop_writes"
> >>>>>
> >>>>> With these two changes:
> >>>>>
> >>>>> 1 - On the user side, mdadm is trying to set the array_state to read-auto
> >>>>>      on incremental removal (as invoked by UDEV rule).
> >>>>>
> >>>>> 2 - Kernel side, md_set_readonly() will set the MD_RECOVERY_FROZEN flag,
> >>>>>      wake up the mddev->thread and if there is a sync_thread, it will set
> >>>>>      MD_RECOVERY_INTR and then wait until the sync_thread is set to NULL.
> >>>>>
> >>>>>      When md_check_recovery() gets a chance to run as part of the
> >>>>>      raid1d() mddev->thread, it may or may not ever get to
> >>>>>      an invocation of remove_and_add_spares(), for there are but *many*
> >>>>>      conditional early exits along the way -- for example, if
> >>>>>      MD_RECOVERY_FROZEN is set, the following condition will bounce out of
> >>>>>      the routine:
> >>>>>
> >>>>>                  if (!test_and_clear_bit(MD_RECOVERY_NEEDED,&mddev->recovery) ||
> >>>>>                      test_bit(MD_RECOVERY_FROZEN,&mddev->recovery))
> >>>>>                          goto unlock;
> >>>>>
> >>>>>      the next time around, MD_RECOVERY_NEEDED will have been cleared, so
> >>>>>      all future tests will return 0 and the negation will always take the
> >>>>>      early exit path.
> >>>>>
> >>>>>      Back in md_set_readonly(), it may notice that the MD is still in use,
> >>>>>      so it clears the MD_RECOVERY_FROZEN and then returns -EBUSY, without
> >>>>>      setting mddev->ro.  But the damage has been done as conditions have
> >>>>>      been set such that md_check_recovery() will never call
> >>>>>      remove_and_add_spares().
> >>>>>
> >>>>> This would also explain why an "idle" sync_action clears the wedge: it
> >>>>> sets MD_RECOVERY_NEEDED allowing md_check_recovery() to continue executing
> >>>>> to remove_and_add_spares().
> >>>>>
> >>>>> As far as I can tell, this is what is happening to prevent the "remove"
> >>>>> write to /sys/block/md3/md/dev-sdr5/state from succeeding.  There are
> >>>>> certainly a lot of little bit-states between disk removal, UDEV mdadm, and
> >>>>> various MD kernel threads, so apologies if I missed an important
> >>>>> transition.
> >>>>>
> >>>>> Would you consider writing "idle" to the MD array sync_action file as a
> >>>>> safe and reasonable intermediate workaround step for our script?
> >>>>>
> >>>>> And of course, any suggestions to whether this is intended behavior (ie,
> >>>>> the removed component disk is failed, but stuck in the array)?
> >>>>>
> >>>>> This is fairly easy for us to reproduce with multiple MD arrays per disk
> >>>>> (one per partition) and interrupting a raid check on all of them
> >>>>> (especially when they are delayed waiting for the first to finish) by
> >>>>> removing the component disk via sysfs PCI removal.  We can provide
> >>>>> additional debug or testing if required.
> >>>>>
> >>>> Hi Joe,
> >>>>   thanks for the details analysis!!
> >>>>
> >>>> I think the correct fix would be that MD_RECOVERY_NEEDED should be set after
> >>>> clearing MD_RECOVERY_FROZEN, like the patch below.
> >>>> Can you confirm that it works for you?
> >>>>
> >>>> Writing 'idle' should in general be safe, so that could be used as an interim.
> >>>>
> >>>> Thanks,
> >>>> NeilBrown
> >>>>
> >>>> diff --git a/drivers/md/md.c b/drivers/md/md.c
> >>>> index c03d87b6890a..2c73fcb82593 100644
> >>>> --- a/drivers/md/md.c
> >>>> +++ b/drivers/md/md.c
> >>>> @@ -5261,6 +5261,7 @@ static int md_set_readonly(struct mddev *mddev, struct block_device *bdev)
> >>>>   		printk("md: %s still in use.\n",mdname(mddev));
> >>>>   		if (did_freeze) {
> >>>>   			clear_bit(MD_RECOVERY_FROZEN,&mddev->recovery);
> >>>> +			set_bit(MD_RECOVERY_NEEDED,&mddev->recovery);
> >>>>   			md_wakeup_thread(mddev->thread);
> >>>>   		}
> >>>>   		err = -EBUSY;
> >>>> @@ -5275,6 +5276,8 @@ static int md_set_readonly(struct mddev *mddev, struct block_device *bdev)
> >>>>   		mddev->ro = 1;
> >>>>   		set_disk_ro(mddev->gendisk, 1);
> >>>>   		clear_bit(MD_RECOVERY_FROZEN,&mddev->recovery);
> >>>> +		set_bit(MD_RECOVERY_NEEDED,&mddev->recovery);
> >>>> +		md_wakeup_thread(mddev->thread);
> >>>>   		sysfs_notify_dirent_safe(mddev->sysfs_state);
> >>>>   		err = 0;
> >>>>   	}
> >>>> @@ -5318,6 +5321,7 @@ static int do_md_stop(struct mddev *mddev, int mode,
> >>>>   		mutex_unlock(&mddev->open_mutex);
> >>>>   		if (did_freeze) {
> >>>>   			clear_bit(MD_RECOVERY_FROZEN,&mddev->recovery);
> >>>> +			set_bit(MD_RECOVERY_NEEDED,&mddev->recovery);
> >>>>   			md_wakeup_thread(mddev->thread);
> >>>>   		}
> >>>>   		return -EBUSY;
> >>> Hi Neil,
> >>>
> >>> In my tests, the UDEV "mdadm -If" invocation fails *and* removes the
> >>> pulled disk from the MD array.  This is okay for our intentions, but I
> >>> wanted to make sure that it's okay to skip any failed-but-not-removed
> >>> state.
> >>>
> >>> Tested-by: Joe Lawrence<joe.lawrence@stratus.com>
> >>>
> >>> and should this have a
> >>>
> >>> Fixes: 30b8feb730f9 ("md/raid5: avoid deadlock when raid5 array has unack badblocks during md_stop_writes")
> >>>
> >>> tag to mark for stable?
> >>
> >> Hi Neil,
> >>
> >> Would you like me to write up a proper patch, or is this one in the queue?
> >>
> > Several times over the last week I've thought that I should probably push
> > that patch along ... but each time something else seemed more interesting.
> > But it's a new week now.  I've just posted a pull request.
> >
> > Thanks for the prompt (and the report and testing of course).
> >
> > NeilBrown
> Hi Neil and Joe
> 
>      Any update for this? I tried this with 3.18.2 and the problem still 
> exists.
> 
>     When it tried to remove the failed disk. it find the Blocked flag in 
> rdev->flags is
> set. So it can't remove the disk. Is this the right reason?

Hi Xiao,

It's been a while since I've looked at this patch, but it looks like it
made it into 3.18, so it should be present on 3.18.2.

What version of mdadm are you running? 

Does writing an "idle" sync_action clear this condition?

Regards,

-- Joe


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: RAID1 removing failed disk returns EBUSY
  2015-01-15 13:22           ` Joe Lawrence
@ 2015-01-16  5:20             ` Xiao Ni
  2015-01-16 15:10               ` Joe Lawrence
  0 siblings, 1 reply; 22+ messages in thread
From: Xiao Ni @ 2015-01-16  5:20 UTC (permalink / raw)
  To: Joe Lawrence; +Cc: NeilBrown, linux-raid, Bill Kuzeja



----- Original Message -----
> From: "Joe Lawrence" <joe.lawrence@stratus.com>
> To: "XiaoNi" <xni@redhat.com>
> Cc: "NeilBrown" <neilb@suse.de>, linux-raid@vger.kernel.org, "Bill Kuzeja" <william.kuzeja@stratus.com>
> Sent: Thursday, January 15, 2015 9:22:10 PM
> Subject: Re: RAID1 removing failed disk returns EBUSY
> 
> On Wed, 14 Jan 2015 20:41:16 +0800
> XiaoNi <xni@redhat.com> wrote:
> 
> > On 11/17/2014 07:03 AM, NeilBrown wrote:
> > > On Thu, 13 Nov 2014 09:05:49 -0500 Joe Lawrence<joe.lawrence@stratus.com>
> > > wrote:
> > >
> > >> On Wed, 29 Oct 2014 13:36:04 -0400
> > >> Joe Lawrence<joe.lawrence@stratus.com>  wrote:
> > >>
> > >>> On Wed, 29 Oct 2014 08:41:13 +1100
> > >>> NeilBrown<neilb@suse.de>  wrote:
> > >>>
> > >>>> On Mon, 27 Oct 2014 16:27:48 -0400 Joe
> > >>>> Lawrence<joe.lawrence@stratus.com>
> > >>>> wrote:
> > >>>>
> > >>>>> Hi Neil,
> > >>>>>
> > >>>>> We've encountered changes in MD and mdadm that have broken our
> > >>>>> automated
> > >>>>> disk removal script.  In the past, we've been able to run the
> > >>>>> following
> > >>>>> after a RAID1 disk component removal:
> > >>>>>
> > >>>>> % echo fail>  /sys/block/md3/md/dev-sdr5/state
> > >>>>> % echo remove>  /sys/block/md3/md/dev-sdr5/state
> > >>>>>
> > >>>>> However, the latest RHEL6.6 code drop has rebased to sufficiently
> > >>>>> recent
> > >>>>> MD kernel and mdadm changes, in which the previous commands
> > >>>>> occasionally
> > >>>>> fail like so:
> > >>>>>
> > >>>>> * MD array is usually resyncing or checking
> > >>>>> * Component disk /dev/sdr removed via HBA sysfs PCI removal
> > >>>>> * Following UDEV rule fires:
> > >>>>>
> > >>>>> SUBSYSTEM=="block", ACTION=="remove", ENV{ID_PATH}=="?*", \
> > >>>>>          RUN+="/sbin/mdadm -If $name --path $env{ID_PATH}"
> > >>>>>
> > >>>>> % mdadm --detail /dev/md3
> > >>>>> /dev/md3:
> > >>>>>          Version : 1.1
> > >>>>>    Creation Time : Tue Oct 14 17:31:59 2014
> > >>>>>       Raid Level : raid1
> > >>>>>       Array Size : 25149440 (23.98 GiB 25.75 GB)
> > >>>>>    Used Dev Size : 25149440 (23.98 GiB 25.75 GB)
> > >>>>>     Raid Devices : 2
> > >>>>>    Total Devices : 2
> > >>>>>      Persistence : Superblock is persistent
> > >>>>>
> > >>>>>    Intent Bitmap : Internal
> > >>>>>
> > >>>>>      Update Time : Wed Oct 15 14:22:34 2014
> > >>>>>            State : active, degraded
> > >>>>>   Active Devices : 1
> > >>>>> Working Devices : 1
> > >>>>>   Failed Devices : 1
> > >>>>>    Spare Devices : 0
> > >>>>>
> > >>>>>             Name : localhost.localdomain:3
> > >>>>>             UUID : 40ed68ee:ba41d4cd:28c361ed:be7470b8
> > >>>>>           Events : 142
> > >>>>>
> > >>>>>      Number   Major   Minor   RaidDevice State
> > >>>>>         0      65       21        0      faulty
> > >>>>>         1      65        5        1      active sync   /dev/sdj5
> > >>>>>
> > >>>>> All attempts to remove this device fail:
> > >>>>>
> > >>>>> % echo remove>  /sys/block/md3/md/dev-sdr5/state
> > >>>>> -bash: echo: write error: Device or resource busy
> > >>>>>
> > >>>>> This can be traced to state_store():
> > >>>>>
> > >>>>>          } else if (cmd_match(buf, "remove")) {
> > >>>>>                  if (rdev->raid_disk>= 0)
> > >>>>>                          err = -EBUSY;
> > >>>>>
> > >>>>> After much debugging and systemtapping, I think I've figured out that
> > >>>>> the
> > >>>>> sysfs scripting may fail after the following combination of changes:
> > >>>>>
> > >>>>> mdadm  8af530b07fce "Enhance incremental removal."
> > >>>>> kernel 30b8feb730f9 "md/raid5: avoid deadlock when raid5 array has
> > >>>>> unack
> > >>>>>                       badblocks during md_stop_writes"
> > >>>>>
> > >>>>> With these two changes:
> > >>>>>
> > >>>>> 1 - On the user side, mdadm is trying to set the array_state to
> > >>>>> read-auto
> > >>>>>      on incremental removal (as invoked by UDEV rule).
> > >>>>>
> > >>>>> 2 - Kernel side, md_set_readonly() will set the MD_RECOVERY_FROZEN
> > >>>>> flag,
> > >>>>>      wake up the mddev->thread and if there is a sync_thread, it will
> > >>>>>      set
> > >>>>>      MD_RECOVERY_INTR and then wait until the sync_thread is set to
> > >>>>>      NULL.
> > >>>>>
> > >>>>>      When md_check_recovery() gets a chance to run as part of the
> > >>>>>      raid1d() mddev->thread, it may or may not ever get to
> > >>>>>      an invocation of remove_and_add_spares(), for there are but
> > >>>>>      *many*
> > >>>>>      conditional early exits along the way -- for example, if
> > >>>>>      MD_RECOVERY_FROZEN is set, the following condition will bounce
> > >>>>>      out of
> > >>>>>      the routine:
> > >>>>>
> > >>>>>                  if
> > >>>>>                  (!test_and_clear_bit(MD_RECOVERY_NEEDED,&mddev->recovery)
> > >>>>>                  ||
> > >>>>>                      test_bit(MD_RECOVERY_FROZEN,&mddev->recovery))
> > >>>>>                          goto unlock;
> > >>>>>
> > >>>>>      the next time around, MD_RECOVERY_NEEDED will have been cleared,
> > >>>>>      so
> > >>>>>      all future tests will return 0 and the negation will always take
> > >>>>>      the
> > >>>>>      early exit path.
> > >>>>>
> > >>>>>      Back in md_set_readonly(), it may notice that the MD is still in
> > >>>>>      use,
> > >>>>>      so it clears the MD_RECOVERY_FROZEN and then returns -EBUSY,
> > >>>>>      without
> > >>>>>      setting mddev->ro.  But the damage has been done as conditions
> > >>>>>      have
> > >>>>>      been set such that md_check_recovery() will never call
> > >>>>>      remove_and_add_spares().
> > >>>>>
> > >>>>> This would also explain why an "idle" sync_action clears the wedge:
> > >>>>> it
> > >>>>> sets MD_RECOVERY_NEEDED allowing md_check_recovery() to continue
> > >>>>> executing
> > >>>>> to remove_and_add_spares().
> > >>>>>
> > >>>>> As far as I can tell, this is what is happening to prevent the
> > >>>>> "remove"
> > >>>>> write to /sys/block/md3/md/dev-sdr5/state from succeeding.  There are
> > >>>>> certainly a lot of little bit-states between disk removal, UDEV
> > >>>>> mdadm, and
> > >>>>> various MD kernel threads, so apologies if I missed an important
> > >>>>> transition.
> > >>>>>
> > >>>>> Would you consider writing "idle" to the MD array sync_action file as
> > >>>>> a
> > >>>>> safe and reasonable intermediate workaround step for our script?
> > >>>>>
> > >>>>> And of course, any suggestions to whether this is intended behavior
> > >>>>> (ie,
> > >>>>> the removed component disk is failed, but stuck in the array)?
> > >>>>>
> > >>>>> This is fairly easy for us to reproduce with multiple MD arrays per
> > >>>>> disk
> > >>>>> (one per partition) and interrupting a raid check on all of them
> > >>>>> (especially when they are delayed waiting for the first to finish) by
> > >>>>> removing the component disk via sysfs PCI removal.  We can provide
> > >>>>> additional debug or testing if required.
> > >>>>>
> > >>>> Hi Joe,
> > >>>>   thanks for the details analysis!!
> > >>>>
> > >>>> I think the correct fix would be that MD_RECOVERY_NEEDED should be set
> > >>>> after
> > >>>> clearing MD_RECOVERY_FROZEN, like the patch below.
> > >>>> Can you confirm that it works for you?
> > >>>>
> > >>>> Writing 'idle' should in general be safe, so that could be used as an
> > >>>> interim.
> > >>>>
> > >>>> Thanks,
> > >>>> NeilBrown
> > >>>>
> > >>>> diff --git a/drivers/md/md.c b/drivers/md/md.c
> > >>>> index c03d87b6890a..2c73fcb82593 100644
> > >>>> --- a/drivers/md/md.c
> > >>>> +++ b/drivers/md/md.c
> > >>>> @@ -5261,6 +5261,7 @@ static int md_set_readonly(struct mddev *mddev,
> > >>>> struct block_device *bdev)
> > >>>>   		printk("md: %s still in use.\n",mdname(mddev));
> > >>>>   		if (did_freeze) {
> > >>>>   			clear_bit(MD_RECOVERY_FROZEN,&mddev->recovery);
> > >>>> +			set_bit(MD_RECOVERY_NEEDED,&mddev->recovery);
> > >>>>   			md_wakeup_thread(mddev->thread);
> > >>>>   		}
> > >>>>   		err = -EBUSY;
> > >>>> @@ -5275,6 +5276,8 @@ static int md_set_readonly(struct mddev *mddev,
> > >>>> struct block_device *bdev)
> > >>>>   		mddev->ro = 1;
> > >>>>   		set_disk_ro(mddev->gendisk, 1);
> > >>>>   		clear_bit(MD_RECOVERY_FROZEN,&mddev->recovery);
> > >>>> +		set_bit(MD_RECOVERY_NEEDED,&mddev->recovery);
> > >>>> +		md_wakeup_thread(mddev->thread);
> > >>>>   		sysfs_notify_dirent_safe(mddev->sysfs_state);
> > >>>>   		err = 0;
> > >>>>   	}
> > >>>> @@ -5318,6 +5321,7 @@ static int do_md_stop(struct mddev *mddev, int
> > >>>> mode,
> > >>>>   		mutex_unlock(&mddev->open_mutex);
> > >>>>   		if (did_freeze) {
> > >>>>   			clear_bit(MD_RECOVERY_FROZEN,&mddev->recovery);
> > >>>> +			set_bit(MD_RECOVERY_NEEDED,&mddev->recovery);
> > >>>>   			md_wakeup_thread(mddev->thread);
> > >>>>   		}
> > >>>>   		return -EBUSY;
> > >>> Hi Neil,
> > >>>
> > >>> In my tests, the UDEV "mdadm -If" invocation fails *and* removes the
> > >>> pulled disk from the MD array.  This is okay for our intentions, but I
> > >>> wanted to make sure that it's okay to skip any failed-but-not-removed
> > >>> state.
> > >>>
> > >>> Tested-by: Joe Lawrence<joe.lawrence@stratus.com>
> > >>>
> > >>> and should this have a
> > >>>
> > >>> Fixes: 30b8feb730f9 ("md/raid5: avoid deadlock when raid5 array has
> > >>> unack badblocks during md_stop_writes")
> > >>>
> > >>> tag to mark for stable?
> > >>
> > >> Hi Neil,
> > >>
> > >> Would you like me to write up a proper patch, or is this one in the
> > >> queue?
> > >>
> > > Several times over the last week I've thought that I should probably push
> > > that patch along ... but each time something else seemed more
> > > interesting.
> > > But it's a new week now.  I've just posted a pull request.
> > >
> > > Thanks for the prompt (and the report and testing of course).
> > >
> > > NeilBrown
> > Hi Neil and Joe
> > 
> >      Any update for this? I tried this with 3.18.2 and the problem still
> > exists.
> > 
> >     When it tried to remove the failed disk. it find the Blocked flag in
> > rdev->flags is
> > set. So it can't remove the disk. Is this the right reason?
> 
> Hi Xiao,
> 
> It's been a while since I've looked at this patch, but it looks like it
> made it into 3.18, so it should be present on 3.18.2.
> 
> What version of mdadm are you running?
> 
> Does writing an "idle" sync_action clear this condition?
> 
> Regards,
> 
> -- Joe
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

Hi Joe

   Thanks for reminding me. I didn't do that. Now it can remove successfully after writing
"idle" to sync_action.

   I thought wrongly that the patch referenced in this mail is fixed for the problem.

Best Regards
Xiao


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: RAID1 removing failed disk returns EBUSY
  2015-01-16  5:20             ` Xiao Ni
@ 2015-01-16 15:10               ` Joe Lawrence
  2015-01-19  2:33                 ` Xiao Ni
  0 siblings, 1 reply; 22+ messages in thread
From: Joe Lawrence @ 2015-01-16 15:10 UTC (permalink / raw)
  To: Xiao Ni; +Cc: NeilBrown, linux-raid, Bill Kuzeja

On Fri, 16 Jan 2015 00:20:12 -0500
Xiao Ni <xni@redhat.com> wrote:
> 
> Hi Joe
> 
>    Thanks for reminding me. I didn't do that. Now it can remove successfully after writing
> "idle" to sync_action.
> 
>    I thought wrongly that the patch referenced in this mail is fixed for the problem.

So it sounds like even with 3.18 and a new mdadm, this bug still
persists?

-- Joe


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: RAID1 removing failed disk returns EBUSY
  2015-01-16 15:10               ` Joe Lawrence
@ 2015-01-19  2:33                 ` Xiao Ni
  2015-01-19 17:56                   ` Joe Lawrence
  2015-01-29  3:52                   ` NeilBrown
  0 siblings, 2 replies; 22+ messages in thread
From: Xiao Ni @ 2015-01-19  2:33 UTC (permalink / raw)
  To: Joe Lawrence; +Cc: NeilBrown, linux-raid, Bill Kuzeja



----- Original Message -----
> From: "Joe Lawrence" <joe.lawrence@stratus.com>
> To: "Xiao Ni" <xni@redhat.com>
> Cc: "NeilBrown" <neilb@suse.de>, linux-raid@vger.kernel.org, "Bill Kuzeja" <william.kuzeja@stratus.com>
> Sent: Friday, January 16, 2015 11:10:31 PM
> Subject: Re: RAID1 removing failed disk returns EBUSY
> 
> On Fri, 16 Jan 2015 00:20:12 -0500
> Xiao Ni <xni@redhat.com> wrote:
> > 
> > Hi Joe
> > 
> >    Thanks for reminding me. I didn't do that. Now it can remove
> >    successfully after writing
> > "idle" to sync_action.
> > 
> >    I thought wrongly that the patch referenced in this mail is fixed for
> >    the problem.
> 
> So it sounds like even with 3.18 and a new mdadm, this bug still
> persists?
> 
> -- Joe
> 
> --

Hi Joe

   I'm a little confused now. Does the patch 45eaf45dfa4850df16bc2e8e7903d89021137f40 from linux-stable
resolve the problem?

   My environment is:

[root@dhcp-12-133 mdadm]# mdadm --version
mdadm - v3.3.2-18-g93d3bd3 - 18th December 2014  (this is the newest upstream)
[root@dhcp-12-133 mdadm]# uname -r
3.18.2


   My steps are:

[root@dhcp-12-133 mdadm]# lsblk 
sdb                       8:16   0 931.5G  0 disk 
└─sdb1                    8:17   0     5G  0 part 
sdc                       8:32   0 186.3G  0 disk 
sdd                       8:48   0 931.5G  0 disk 
└─sdd1                    8:49   0     5G  0 part 
[root@dhcp-12-133 mdadm]# mdadm -CR /dev/md0 -l1 -n2 /dev/sdb1 /dev/sdd1 --assume-clean
mdadm: Note: this array has metadata at the start and
    may not be suitable as a boot device.  If you plan to
    store '/boot' on this device please ensure that
    your boot-loader understands md/v1.x metadata, or use
    --metadata=0.90
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md0 started.

   Then I unplug the disk.

[root@dhcp-12-133 mdadm]# lsblk 
sdc                       8:32   0 186.3G  0 disk  
sdd                       8:48   0 931.5G  0 disk  
└─sdd1                    8:49   0     5G  0 part  
  └─md0                   9:0    0     5G  0 raid1 
[root@dhcp-12-133 mdadm]# echo faulty > /sys/block/md0/md/dev-sdb1/state 
[root@dhcp-12-133 mdadm]# echo remove > /sys/block/md0/md/dev-sdb1/state 
-bash: echo: write error: Device or resource busy
[root@dhcp-12-133 mdadm]# echo idle > /sys/block/md0/md/sync_action 
[root@dhcp-12-133 mdadm]# echo remove > /sys/block/md0/md/dev-sdb1/state 


   Now after I set idle to sync_action, it can be removed as you said in the mail.
It's a good workaround. Is this OK? 

Best Regards
Xiao


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: RAID1 removing failed disk returns EBUSY
  2015-01-19  2:33                 ` Xiao Ni
@ 2015-01-19 17:56                   ` Joe Lawrence
  2015-01-20  7:16                     ` Xiao Ni
  2015-01-29  3:52                   ` NeilBrown
  1 sibling, 1 reply; 22+ messages in thread
From: Joe Lawrence @ 2015-01-19 17:56 UTC (permalink / raw)
  To: Xiao Ni; +Cc: NeilBrown, linux-raid, Bill Kuzeja

On Sun, 18 Jan 2015 21:33:50 -0500
Xiao Ni <xni@redhat.com> wrote:

> 
> 
> ----- Original Message -----
> > From: "Joe Lawrence" <joe.lawrence@stratus.com>
> > To: "Xiao Ni" <xni@redhat.com>
> > Cc: "NeilBrown" <neilb@suse.de>, linux-raid@vger.kernel.org, "Bill Kuzeja" <william.kuzeja@stratus.com>
> > Sent: Friday, January 16, 2015 11:10:31 PM
> > Subject: Re: RAID1 removing failed disk returns EBUSY
> > 
> > On Fri, 16 Jan 2015 00:20:12 -0500
> > Xiao Ni <xni@redhat.com> wrote:
> > > 
> > > Hi Joe
> > > 
> > >    Thanks for reminding me. I didn't do that. Now it can remove
> > >    successfully after writing
> > > "idle" to sync_action.
> > > 
> > >    I thought wrongly that the patch referenced in this mail is fixed for
> > >    the problem.
> > 
> > So it sounds like even with 3.18 and a new mdadm, this bug still
> > persists?
> > 
> > -- Joe
> > 
> > --
> 
> Hi Joe
> 
>    I'm a little confused now. Does the patch 45eaf45dfa4850df16bc2e8e7903d89021137f40 from linux-stable
> resolve the problem?
> 
>    My environment is:
> 
> [root@dhcp-12-133 mdadm]# mdadm --version
> mdadm - v3.3.2-18-g93d3bd3 - 18th December 2014  (this is the newest upstream)
> [root@dhcp-12-133 mdadm]# uname -r
> 3.18.2
> 
> 
>    My steps are:
> 
> [root@dhcp-12-133 mdadm]# lsblk 
> sdb                       8:16   0 931.5G  0 disk 
> └─sdb1                    8:17   0     5G  0 part 
> sdc                       8:32   0 186.3G  0 disk 
> sdd                       8:48   0 931.5G  0 disk 
> └─sdd1                    8:49   0     5G  0 part 
> [root@dhcp-12-133 mdadm]# mdadm -CR /dev/md0 -l1 -n2 /dev/sdb1 /dev/sdd1 --assume-clean
> mdadm: Note: this array has metadata at the start and
>     may not be suitable as a boot device.  If you plan to
>     store '/boot' on this device please ensure that
>     your boot-loader understands md/v1.x metadata, or use
>     --metadata=0.90
> mdadm: Defaulting to version 1.2 metadata
> mdadm: array /dev/md0 started.
> 
>    Then I unplug the disk.
> 
> [root@dhcp-12-133 mdadm]# lsblk 
> sdc                       8:32   0 186.3G  0 disk  
> sdd                       8:48   0 931.5G  0 disk  
> └─sdd1                    8:49   0     5G  0 part  
>   └─md0                   9:0    0     5G  0 raid1 
> [root@dhcp-12-133 mdadm]# echo faulty > /sys/block/md0/md/dev-sdb1/state 
> [root@dhcp-12-133 mdadm]# echo remove > /sys/block/md0/md/dev-sdb1/state 
> -bash: echo: write error: Device or resource busy
> [root@dhcp-12-133 mdadm]# echo idle > /sys/block/md0/md/sync_action 
> [root@dhcp-12-133 mdadm]# echo remove > /sys/block/md0/md/dev-sdb1/state 
> 
> 
>    Now after I set idle to sync_action, it can be removed as you said in the mail.
> It's a good workaround. Is this OK? 
> 
> Best Regards
> Xiao

Hi Xiao,

According to my notes, the "idle" sync_action was always a viable
workaround, with or with this change.

Neil's patch should have made it possible to issue only a
"faulty" and "remove" to remove the RAID component.

I don't have an exact version, but it appears that my mdadm version was
an upstream git from Oct 27-th timeframe.

-- Joe

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: RAID1 removing failed disk returns EBUSY
  2015-01-19 17:56                   ` Joe Lawrence
@ 2015-01-20  7:16                     ` Xiao Ni
  2015-01-23 15:11                       ` Joe Lawrence
  0 siblings, 1 reply; 22+ messages in thread
From: Xiao Ni @ 2015-01-20  7:16 UTC (permalink / raw)
  To: Joe Lawrence; +Cc: NeilBrown, linux-raid, Bill Kuzeja



----- Original Message -----
> From: "Joe Lawrence" <joe.lawrence@stratus.com>
> To: "Xiao Ni" <xni@redhat.com>
> Cc: "NeilBrown" <neilb@suse.de>, linux-raid@vger.kernel.org, "Bill Kuzeja" <william.kuzeja@stratus.com>
> Sent: Tuesday, January 20, 2015 1:56:50 AM
> Subject: Re: RAID1 removing failed disk returns EBUSY
> 
> On Sun, 18 Jan 2015 21:33:50 -0500
> Xiao Ni <xni@redhat.com> wrote:
> 
> > 
> > 
> > ----- Original Message -----
> > > From: "Joe Lawrence" <joe.lawrence@stratus.com>
> > > To: "Xiao Ni" <xni@redhat.com>
> > > Cc: "NeilBrown" <neilb@suse.de>, linux-raid@vger.kernel.org, "Bill
> > > Kuzeja" <william.kuzeja@stratus.com>
> > > Sent: Friday, January 16, 2015 11:10:31 PM
> > > Subject: Re: RAID1 removing failed disk returns EBUSY
> > > 
> > > On Fri, 16 Jan 2015 00:20:12 -0500
> > > Xiao Ni <xni@redhat.com> wrote:
> > > > 
> > > > Hi Joe
> > > > 
> > > >    Thanks for reminding me. I didn't do that. Now it can remove
> > > >    successfully after writing
> > > > "idle" to sync_action.
> > > > 
> > > >    I thought wrongly that the patch referenced in this mail is fixed
> > > >    for
> > > >    the problem.
> > > 
> > > So it sounds like even with 3.18 and a new mdadm, this bug still
> > > persists?
> > > 
> > > -- Joe
> > > 
> > > --
> > 
> > Hi Joe
> > 
> >    I'm a little confused now. Does the patch
> >    45eaf45dfa4850df16bc2e8e7903d89021137f40 from linux-stable
> > resolve the problem?
> > 
> >    My environment is:
> > 
> > [root@dhcp-12-133 mdadm]# mdadm --version
> > mdadm - v3.3.2-18-g93d3bd3 - 18th December 2014  (this is the newest
> > upstream)
> > [root@dhcp-12-133 mdadm]# uname -r
> > 3.18.2
> > 
> > 
> >    My steps are:
> > 
> > [root@dhcp-12-133 mdadm]# lsblk
> > sdb                       8:16   0 931.5G  0 disk
> > └─sdb1                    8:17   0     5G  0 part
> > sdc                       8:32   0 186.3G  0 disk
> > sdd                       8:48   0 931.5G  0 disk
> > └─sdd1                    8:49   0     5G  0 part
> > [root@dhcp-12-133 mdadm]# mdadm -CR /dev/md0 -l1 -n2 /dev/sdb1 /dev/sdd1
> > --assume-clean
> > mdadm: Note: this array has metadata at the start and
> >     may not be suitable as a boot device.  If you plan to
> >     store '/boot' on this device please ensure that
> >     your boot-loader understands md/v1.x metadata, or use
> >     --metadata=0.90
> > mdadm: Defaulting to version 1.2 metadata
> > mdadm: array /dev/md0 started.
> > 
> >    Then I unplug the disk.
> > 
> > [root@dhcp-12-133 mdadm]# lsblk
> > sdc                       8:32   0 186.3G  0 disk
> > sdd                       8:48   0 931.5G  0 disk
> > └─sdd1                    8:49   0     5G  0 part
> >   └─md0                   9:0    0     5G  0 raid1
> > [root@dhcp-12-133 mdadm]# echo faulty > /sys/block/md0/md/dev-sdb1/state
> > [root@dhcp-12-133 mdadm]# echo remove > /sys/block/md0/md/dev-sdb1/state
> > -bash: echo: write error: Device or resource busy
> > [root@dhcp-12-133 mdadm]# echo idle > /sys/block/md0/md/sync_action
> > [root@dhcp-12-133 mdadm]# echo remove > /sys/block/md0/md/dev-sdb1/state
> > 
> > 
> >    Now after I set idle to sync_action, it can be removed as you said in
> >    the mail.
> > It's a good workaround. Is this OK?
> > 
> > Best Regards
> > Xiao
> 
> Hi Xiao,
> 
> According to my notes, the "idle" sync_action was always a viable
> workaround, with or with this change.
> 
> Neil's patch should have made it possible to issue only a
> "faulty" and "remove" to remove the RAID component.
> 
> I don't have an exact version, but it appears that my mdadm version was
> an upstream git from Oct 27-th timeframe.
> 
> -- Joe
> 

Joe

   Thanks for the explanation. So echo "idle" to sync_action is a workaround
without the patch. 
 
   It looks like the patch is not enough to fix the problem. 
Do you have a try with the new patch? Is the problem still exist in
your environment?

   If your environment have no problem, can you give me the version number? I'll 
have a try with the same version too. 

Best Regards
Xiao
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: RAID1 removing failed disk returns EBUSY
  2015-01-20  7:16                     ` Xiao Ni
@ 2015-01-23 15:11                       ` Joe Lawrence
  2015-01-30  2:19                         ` Xiao Ni
  0 siblings, 1 reply; 22+ messages in thread
From: Joe Lawrence @ 2015-01-23 15:11 UTC (permalink / raw)
  To: Xiao Ni; +Cc: NeilBrown, linux-raid, Bill Kuzeja

On Tue, 20 Jan 2015 02:16:46 -0500
Xiao Ni <xni@redhat.com> wrote:
> Joe
> 
>    Thanks for the explanation. So echo "idle" to sync_action is a workaround
> without the patch. 
>  
>    It looks like the patch is not enough to fix the problem. 
> Do you have a try with the new patch? Is the problem still exist in
> your environment?
> 
>    If your environment have no problem, can you give me the version number? I'll 
> have a try with the same version too. 

Hi Xiao,

Bill and I did some more testing yesterday and I think we've figured
out the confusion.  Running a 3.18+ kernel and an upstream mdadm, it
was the udev invocation of "mdadm -If <dev>" that was automatically
removing the device for us.

If we ran with an older mdadm and got the MD wedged in the faulty
condition, then nothing we echoed into the sysfs state file ('idle'
'fail' or 'remove')  would change anything.  I think this agrees with
your testing report.

So two things:

1 - Did you make / make install the latest mdadm and see it try to run
mdadm -If on the removed disk?  (You could also try manually running
it.)

2 - I think the sysfs interface to the removed disks is still broken in
cases where (1) doesn't occur.

Thanks,

-- Joe

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: RAID1 removing failed disk returns EBUSY
  2015-01-19  2:33                 ` Xiao Ni
  2015-01-19 17:56                   ` Joe Lawrence
@ 2015-01-29  3:52                   ` NeilBrown
  2015-01-29 12:14                     ` Xiao Ni
  1 sibling, 1 reply; 22+ messages in thread
From: NeilBrown @ 2015-01-29  3:52 UTC (permalink / raw)
  To: Xiao Ni; +Cc: Joe Lawrence, linux-raid, Bill Kuzeja

[-- Attachment #1: Type: text/plain, Size: 5752 bytes --]

On Sun, 18 Jan 2015 21:33:50 -0500 (EST) Xiao Ni <xni@redhat.com> wrote:

> 
> 
> ----- Original Message -----
> > From: "Joe Lawrence" <joe.lawrence@stratus.com>
> > To: "Xiao Ni" <xni@redhat.com>
> > Cc: "NeilBrown" <neilb@suse.de>, linux-raid@vger.kernel.org, "Bill Kuzeja" <william.kuzeja@stratus.com>
> > Sent: Friday, January 16, 2015 11:10:31 PM
> > Subject: Re: RAID1 removing failed disk returns EBUSY
> > 
> > On Fri, 16 Jan 2015 00:20:12 -0500
> > Xiao Ni <xni@redhat.com> wrote:
> > > 
> > > Hi Joe
> > > 
> > >    Thanks for reminding me. I didn't do that. Now it can remove
> > >    successfully after writing
> > > "idle" to sync_action.
> > > 
> > >    I thought wrongly that the patch referenced in this mail is fixed for
> > >    the problem.
> > 
> > So it sounds like even with 3.18 and a new mdadm, this bug still
> > persists?
> > 
> > -- Joe
> > 
> > --
> 
> Hi Joe
> 
>    I'm a little confused now. Does the patch 45eaf45dfa4850df16bc2e8e7903d89021137f40 from linux-stable
> resolve the problem?
> 
>    My environment is:
> 
> [root@dhcp-12-133 mdadm]# mdadm --version
> mdadm - v3.3.2-18-g93d3bd3 - 18th December 2014  (this is the newest upstream)
> [root@dhcp-12-133 mdadm]# uname -r
> 3.18.2
> 
> 
>    My steps are:
> 
> [root@dhcp-12-133 mdadm]# lsblk 
> sdb                       8:16   0 931.5G  0 disk 
> └─sdb1                    8:17   0     5G  0 part 
> sdc                       8:32   0 186.3G  0 disk 
> sdd                       8:48   0 931.5G  0 disk 
> └─sdd1                    8:49   0     5G  0 part 
> [root@dhcp-12-133 mdadm]# mdadm -CR /dev/md0 -l1 -n2 /dev/sdb1 /dev/sdd1 --assume-clean
> mdadm: Note: this array has metadata at the start and
>     may not be suitable as a boot device.  If you plan to
>     store '/boot' on this device please ensure that
>     your boot-loader understands md/v1.x metadata, or use
>     --metadata=0.90
> mdadm: Defaulting to version 1.2 metadata
> mdadm: array /dev/md0 started.
> 
>    Then I unplug the disk.
> 
> [root@dhcp-12-133 mdadm]# lsblk 
> sdc                       8:32   0 186.3G  0 disk  
> sdd                       8:48   0 931.5G  0 disk  
> └─sdd1                    8:49   0     5G  0 part  
>   └─md0                   9:0    0     5G  0 raid1 
> [root@dhcp-12-133 mdadm]# echo faulty > /sys/block/md0/md/dev-sdb1/state 
> [root@dhcp-12-133 mdadm]# echo remove > /sys/block/md0/md/dev-sdb1/state 
> -bash: echo: write error: Device or resource busy
> [root@dhcp-12-133 mdadm]# echo idle > /sys/block/md0/md/sync_action 
> [root@dhcp-12-133 mdadm]# echo remove > /sys/block/md0/md/dev-sdb1/state 
> 

I cannot reproduce this - using linux 3.18.2.  I'd be surprised if mdadm
version affects things.

This error (Device or resoource busy) implies that rdev->raid_disk is >= 0
(tested in state_store()).

->raid_disk is set to -1 by remove_and_add_spares() providing:
  1/ it isn't Blocked (which is very unlikely)
  2/ hot_remove_disk succeeds, which it will if nr_pending is zero, and
  3/ nr_pending is zero.

So it seems most likely that either:
 1/ nr_pending is non-zero, or
 2/ remove_and_add_spares() didn't run.

nr_pending can only get set if IO is generated, and your sequence of steps
don't show any IO.  It is possible that something else (e.g. started by udev)
triggered some IO.  How long that IO can stay pending might depend on exactly
how you unplug the device.
In my tests I used
   echo 1 > /sys/block/sdXX/../../delete
which may have a different effect to what you do.

However the fact that writing 'idle' to sync_action releases the device seems
to suggest the nr_pending has dropped to zero.  So either
  - remove_and_add_spares didn't run, or
  - remove_and_add_spares ran during a small window when nr_pending was
    elevated, and then didn't run again when nr_pending was reduced to zero.

Ahh.... that rings bells....

I have the following patch in the SLES kernel which I have applied to
mainline yet (and given how old it is, that is really slack of me).

Can you apply the following and see if the symptom goes away please?

Thanks,
NeilBrown

From: Hannes Reinecke <hare@suse.de>
Date: Thu, 26 Jul 2012 11:12:18 +0200
Subject: [PATCH] md: wakeup thread upon rdev_dec_pending()

After each call to rdev_dec_pending() we should wakeup the
md thread if the device is found to be faulty.
Otherwise we'll incur heavy delays on failing devices.

Signed-off-by: Neil Brown <nfbrown@suse.de>
Signed-off-by: Hannes Reinecke <hare@suse.de>

diff --git a/drivers/md/md.h b/drivers/md/md.h
index 03cec5bdcaae..4cc2f59b2994 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -439,13 +439,6 @@ struct mddev {
 	void (*sync_super)(struct mddev *mddev, struct md_rdev *rdev);
 };
 
-static inline void rdev_dec_pending(struct md_rdev *rdev, struct mddev *mddev)
-{
-	int faulty = test_bit(Faulty, &rdev->flags);
-	if (atomic_dec_and_test(&rdev->nr_pending) && faulty)
-		set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
-}
-
 static inline void md_sync_acct(struct block_device *bdev, unsigned long nr_sectors)
 {
 	atomic_add(nr_sectors, &bdev->bd_contains->bd_disk->sync_io);
@@ -624,4 +617,14 @@ static inline int mddev_check_plugged(struct mddev *mddev)
 	return !!blk_check_plugged(md_unplug, mddev,
 				   sizeof(struct blk_plug_cb));
 }
+
+static inline void rdev_dec_pending(struct md_rdev *rdev, struct mddev *mddev)
+{
+	int faulty = test_bit(Faulty, &rdev->flags);
+	if (atomic_dec_and_test(&rdev->nr_pending) && faulty) {
+		set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
+		md_wakeup_thread(mddev->thread);
+	}
+}
+
 #endif /* _MD_MD_H */


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 811 bytes --]

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: RAID1 removing failed disk returns EBUSY
  2015-01-29  3:52                   ` NeilBrown
@ 2015-01-29 12:14                     ` Xiao Ni
  2015-02-02  6:36                       ` NeilBrown
  0 siblings, 1 reply; 22+ messages in thread
From: Xiao Ni @ 2015-01-29 12:14 UTC (permalink / raw)
  To: NeilBrown; +Cc: Joe Lawrence, linux-raid, Bill Kuzeja



----- Original Message -----
> From: "NeilBrown" <neilb@suse.de>
> To: "Xiao Ni" <xni@redhat.com>
> Cc: "Joe Lawrence" <joe.lawrence@stratus.com>, linux-raid@vger.kernel.org, "Bill Kuzeja" <william.kuzeja@stratus.com>
> Sent: Thursday, January 29, 2015 11:52:17 AM
> Subject: Re: RAID1 removing failed disk returns EBUSY
> 
> On Sun, 18 Jan 2015 21:33:50 -0500 (EST) Xiao Ni <xni@redhat.com> wrote:
> 
> > 
> > 
> > ----- Original Message -----
> > > From: "Joe Lawrence" <joe.lawrence@stratus.com>
> > > To: "Xiao Ni" <xni@redhat.com>
> > > Cc: "NeilBrown" <neilb@suse.de>, linux-raid@vger.kernel.org, "Bill
> > > Kuzeja" <william.kuzeja@stratus.com>
> > > Sent: Friday, January 16, 2015 11:10:31 PM
> > > Subject: Re: RAID1 removing failed disk returns EBUSY
> > > 
> > > On Fri, 16 Jan 2015 00:20:12 -0500
> > > Xiao Ni <xni@redhat.com> wrote:
> > > > 
> > > > Hi Joe
> > > > 
> > > >    Thanks for reminding me. I didn't do that. Now it can remove
> > > >    successfully after writing
> > > > "idle" to sync_action.
> > > > 
> > > >    I thought wrongly that the patch referenced in this mail is fixed
> > > >    for
> > > >    the problem.
> > > 
> > > So it sounds like even with 3.18 and a new mdadm, this bug still
> > > persists?
> > > 
> > > -- Joe
> > > 
> > > --
> > 
> > Hi Joe
> > 
> >    I'm a little confused now. Does the patch
> >    45eaf45dfa4850df16bc2e8e7903d89021137f40 from linux-stable
> > resolve the problem?
> > 
> >    My environment is:
> > 
> > [root@dhcp-12-133 mdadm]# mdadm --version
> > mdadm - v3.3.2-18-g93d3bd3 - 18th December 2014  (this is the newest
> > upstream)
> > [root@dhcp-12-133 mdadm]# uname -r
> > 3.18.2
> > 
> > 
> >    My steps are:
> > 
> > [root@dhcp-12-133 mdadm]# lsblk
> > sdb                       8:16   0 931.5G  0 disk
> > └─sdb1                    8:17   0     5G  0 part
> > sdc                       8:32   0 186.3G  0 disk
> > sdd                       8:48   0 931.5G  0 disk
> > └─sdd1                    8:49   0     5G  0 part
> > [root@dhcp-12-133 mdadm]# mdadm -CR /dev/md0 -l1 -n2 /dev/sdb1 /dev/sdd1
> > --assume-clean
> > mdadm: Note: this array has metadata at the start and
> >     may not be suitable as a boot device.  If you plan to
> >     store '/boot' on this device please ensure that
> >     your boot-loader understands md/v1.x metadata, or use
> >     --metadata=0.90
> > mdadm: Defaulting to version 1.2 metadata
> > mdadm: array /dev/md0 started.
> > 
> >    Then I unplug the disk.
> > 
> > [root@dhcp-12-133 mdadm]# lsblk
> > sdc                       8:32   0 186.3G  0 disk
> > sdd                       8:48   0 931.5G  0 disk
> > └─sdd1                    8:49   0     5G  0 part
> >   └─md0                   9:0    0     5G  0 raid1
> > [root@dhcp-12-133 mdadm]# echo faulty > /sys/block/md0/md/dev-sdb1/state
> > [root@dhcp-12-133 mdadm]# echo remove > /sys/block/md0/md/dev-sdb1/state
> > -bash: echo: write error: Device or resource busy
> > [root@dhcp-12-133 mdadm]# echo idle > /sys/block/md0/md/sync_action
> > [root@dhcp-12-133 mdadm]# echo remove > /sys/block/md0/md/dev-sdb1/state
> > 
> 
> I cannot reproduce this - using linux 3.18.2.  I'd be surprised if mdadm
> version affects things.

Hi Neil

   I'm very curious, because it can reproduce in my machine 100%.

> 
> This error (Device or resoource busy) implies that rdev->raid_disk is >= 0
> (tested in state_store()).
> 
> ->raid_disk is set to -1 by remove_and_add_spares() providing:
>   1/ it isn't Blocked (which is very unlikely)
>   2/ hot_remove_disk succeeds, which it will if nr_pending is zero, and
>   3/ nr_pending is zero.

   I remember I have tired to check those reasons. But it's really is the reason 1
which is very unlikely.

   I add some code in the function array_state_show

    array_state_show(struct mddev *mddev, char *page) {
        enum array_state st = inactive;
        struct md_rdev *rdev;

        rdev_for_each_rcu(rdev, mddev) {
                printk(KERN_ALERT "search for %s\n", rdev->bdev->bd_disk->disk_name);
                if (test_bit(Blocked, &rdev->flags))
                        printk(KERN_ALERT "rdev is Blocked\n");
                else
                        printk(KERN_ALERT "rdev is not Blocked\n");
    }

  When I echo 1 > /sys/block/sdc/device/delete, then I ran command:

[root@dhcp-12-133 md]# cat /sys/block/md0/md/array_state 
read-auto
[root@dhcp-12-133 md]# dmesg 
[ 2679.559185] search for sdc
[ 2679.559189] rdev is Blocked
[ 2679.559190] search for sdb
[ 2679.559190] rdev is not Blocked
   
  So sdc is Blocked


> 
> So it seems most likely that either:
>  1/ nr_pending is non-zero, or
>  2/ remove_and_add_spares() didn't run.
> 
> nr_pending can only get set if IO is generated, and your sequence of steps
> don't show any IO.  It is possible that something else (e.g. started by udev)
> triggered some IO.  How long that IO can stay pending might depend on exactly
> how you unplug the device.
> In my tests I used
>    echo 1 > /sys/block/sdXX/../../delete
> which may have a different effect to what you do.
> 
> However the fact that writing 'idle' to sync_action releases the device seems
> to suggest the nr_pending has dropped to zero.  So either
>   - remove_and_add_spares didn't run, or
>   - remove_and_add_spares ran during a small window when nr_pending was
>     elevated, and then didn't run again when nr_pending was reduced to zero.
> 
> Ahh.... that rings bells....
> 
> I have the following patch in the SLES kernel which I have applied to
> mainline yet (and given how old it is, that is really slack of me).
> 
> Can you apply the following and see if the symptom goes away please?

   I have tried the patch, the problem is still exist.
> 
> Thanks,
> NeilBrown
> 
> From: Hannes Reinecke <hare@suse.de>
> Date: Thu, 26 Jul 2012 11:12:18 +0200
> Subject: [PATCH] md: wakeup thread upon rdev_dec_pending()
> 
> After each call to rdev_dec_pending() we should wakeup the
> md thread if the device is found to be faulty.
> Otherwise we'll incur heavy delays on failing devices.
> 
> Signed-off-by: Neil Brown <nfbrown@suse.de>
> Signed-off-by: Hannes Reinecke <hare@suse.de>
> 
> diff --git a/drivers/md/md.h b/drivers/md/md.h
> index 03cec5bdcaae..4cc2f59b2994 100644
> --- a/drivers/md/md.h
> +++ b/drivers/md/md.h
> @@ -439,13 +439,6 @@ struct mddev {
>  	void (*sync_super)(struct mddev *mddev, struct md_rdev *rdev);
>  };
>  
> -static inline void rdev_dec_pending(struct md_rdev *rdev, struct mddev
> *mddev)
> -{
> -	int faulty = test_bit(Faulty, &rdev->flags);
> -	if (atomic_dec_and_test(&rdev->nr_pending) && faulty)
> -		set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
> -}
> -
>  static inline void md_sync_acct(struct block_device *bdev, unsigned long
>  nr_sectors)
>  {
>  	atomic_add(nr_sectors, &bdev->bd_contains->bd_disk->sync_io);
> @@ -624,4 +617,14 @@ static inline int mddev_check_plugged(struct mddev
> *mddev)
>  	return !!blk_check_plugged(md_unplug, mddev,
>  				   sizeof(struct blk_plug_cb));
>  }
> +
> +static inline void rdev_dec_pending(struct md_rdev *rdev, struct mddev
> *mddev)
> +{
> +	int faulty = test_bit(Faulty, &rdev->flags);
> +	if (atomic_dec_and_test(&rdev->nr_pending) && faulty) {
> +		set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
> +		md_wakeup_thread(mddev->thread);
> +	}
> +}
> +
>  #endif /* _MD_MD_H */
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: RAID1 removing failed disk returns EBUSY
  2015-01-23 15:11                       ` Joe Lawrence
@ 2015-01-30  2:19                         ` Xiao Ni
  2015-01-30  4:27                           ` Xiao Ni
  0 siblings, 1 reply; 22+ messages in thread
From: Xiao Ni @ 2015-01-30  2:19 UTC (permalink / raw)
  To: Joe Lawrence; +Cc: NeilBrown, linux-raid, Bill Kuzeja



----- Original Message -----
> From: "Joe Lawrence" <joe.lawrence@stratus.com>
> To: "Xiao Ni" <xni@redhat.com>
> Cc: "NeilBrown" <neilb@suse.de>, linux-raid@vger.kernel.org, "Bill Kuzeja" <william.kuzeja@stratus.com>
> Sent: Friday, January 23, 2015 11:11:29 PM
> Subject: Re: RAID1 removing failed disk returns EBUSY
> 
> On Tue, 20 Jan 2015 02:16:46 -0500
> Xiao Ni <xni@redhat.com> wrote:
> > Joe
> > 
> >    Thanks for the explanation. So echo "idle" to sync_action is a
> >    workaround
> > without the patch.
> >  
> >    It looks like the patch is not enough to fix the problem.
> > Do you have a try with the new patch? Is the problem still exist in
> > your environment?
> > 
> >    If your environment have no problem, can you give me the version number?
> >    I'll
> > have a try with the same version too.
> 
> Hi Xiao,
> 
> Bill and I did some more testing yesterday and I think we've figured
> out the confusion.  Running a 3.18+ kernel and an upstream mdadm, it
> was the udev invocation of "mdadm -If <dev>" that was automatically
> removing the device for us.
> 
> If we ran with an older mdadm and got the MD wedged in the faulty
> condition, then nothing we echoed into the sysfs state file ('idle'
> 'fail' or 'remove')  would change anything.  I think this agrees with
> your testing report.
> 
> So two things:
> 
> 1 - Did you make / make install the latest mdadm and see it try to run
> mdadm -If on the removed disk?  (You could also try manually running
> it.)

  I make sure I have install the latest mdadm
  [root@dhcp-12-133 ~]# mdadm --version
  mdadm - v3.3.2-18-g93d3bd3 - 18th December 2014

  It can prove this, right?

  It's strange when I ran mdadm -If

[root@dhcp-12-133 ~]# mdadm -If sdc
mdadm: sdc does not appear to be a component of any array
[root@dhcp-12-133 ~]# cat /proc/mdstat 
Personalities : [raid1] 
md0 : active (auto-read-only) raid1 sdd1[1] sdc1[0](F)
      5238784 blocks super 1.2 [2/1] [_U]
      
unused devices: <none>

  I unplug the device manually from the machine. The machine is on my desk.


> 
> 2 - I think the sysfs interface to the removed disks is still broken in
> cases where (1) doesn't occur.
> 
> Thanks,
> 
> -- Joe
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: RAID1 removing failed disk returns EBUSY
  2015-01-30  2:19                         ` Xiao Ni
@ 2015-01-30  4:27                           ` Xiao Ni
  0 siblings, 0 replies; 22+ messages in thread
From: Xiao Ni @ 2015-01-30  4:27 UTC (permalink / raw)
  To: Joe Lawrence; +Cc: NeilBrown, linux-raid, Bill Kuzeja



----- Original Message -----
> From: "Xiao Ni" <xni@redhat.com>
> To: "Joe Lawrence" <joe.lawrence@stratus.com>
> Cc: "NeilBrown" <neilb@suse.de>, linux-raid@vger.kernel.org, "Bill Kuzeja" <william.kuzeja@stratus.com>
> Sent: Friday, January 30, 2015 10:19:01 AM
> Subject: Re: RAID1 removing failed disk returns EBUSY
> 
> 
> 
> ----- Original Message -----
> > From: "Joe Lawrence" <joe.lawrence@stratus.com>
> > To: "Xiao Ni" <xni@redhat.com>
> > Cc: "NeilBrown" <neilb@suse.de>, linux-raid@vger.kernel.org, "Bill Kuzeja"
> > <william.kuzeja@stratus.com>
> > Sent: Friday, January 23, 2015 11:11:29 PM
> > Subject: Re: RAID1 removing failed disk returns EBUSY
> > 
> > On Tue, 20 Jan 2015 02:16:46 -0500
> > Xiao Ni <xni@redhat.com> wrote:
> > > Joe
> > > 
> > >    Thanks for the explanation. So echo "idle" to sync_action is a
> > >    workaround
> > > without the patch.
> > >  
> > >    It looks like the patch is not enough to fix the problem.
> > > Do you have a try with the new patch? Is the problem still exist in
> > > your environment?
> > > 
> > >    If your environment have no problem, can you give me the version
> > >    number?
> > >    I'll
> > > have a try with the same version too.
> > 
> > Hi Xiao,
> > 
> > Bill and I did some more testing yesterday and I think we've figured
> > out the confusion.  Running a 3.18+ kernel and an upstream mdadm, it
> > was the udev invocation of "mdadm -If <dev>" that was automatically
> > removing the device for us.
> > 
> > If we ran with an older mdadm and got the MD wedged in the faulty
> > condition, then nothing we echoed into the sysfs state file ('idle'
> > 'fail' or 'remove')  would change anything.  I think this agrees with
> > your testing report.
> > 
> > So two things:
> > 
> > 1 - Did you make / make install the latest mdadm and see it try to run
> > mdadm -If on the removed disk?  (You could also try manually running
> > it.)
> 
>   I make sure I have install the latest mdadm
>   [root@dhcp-12-133 ~]# mdadm --version
>   mdadm - v3.3.2-18-g93d3bd3 - 18th December 2014
> 
>   It can prove this, right?
> 
>   It's strange when I ran mdadm -If
> 
> [root@dhcp-12-133 ~]# mdadm -If sdc
> mdadm: sdc does not appear to be a component of any array
> [root@dhcp-12-133 ~]# cat /proc/mdstat
> Personalities : [raid1]
> md0 : active (auto-read-only) raid1 sdd1[1] sdc1[0](F)
>       5238784 blocks super 1.2 [2/1] [_U]
>       
> unused devices: <none>
> 
>   I unplug the device manually from the machine. The machine is on my desk.

Hi Joe

   Sorry for this. I input the command wrongly.

[root@dhcp-12-133 ~]# mdadm -If sdc1
mdadm: set sdc1 faulty in md0
mdadm: hot remove failed for sdc1: Device or resource busy

> 
> 
> > 
> > 2 - I think the sysfs interface to the removed disks is still broken in
> > cases where (1) doesn't occur.
> > 
> > Thanks,
> > 
> > -- Joe
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: RAID1 removing failed disk returns EBUSY
  2015-01-29 12:14                     ` Xiao Ni
@ 2015-02-02  6:36                       ` NeilBrown
  2015-02-03  8:10                         ` Xiao Ni
  0 siblings, 1 reply; 22+ messages in thread
From: NeilBrown @ 2015-02-02  6:36 UTC (permalink / raw)
  To: Xiao Ni; +Cc: Joe Lawrence, linux-raid, Bill Kuzeja

[-- Attachment #1: Type: text/plain, Size: 6808 bytes --]

On Thu, 29 Jan 2015 07:14:16 -0500 (EST) Xiao Ni <xni@redhat.com> wrote:

> 
> 
> ----- Original Message -----
> > From: "NeilBrown" <neilb@suse.de>
> > To: "Xiao Ni" <xni@redhat.com>
> > Cc: "Joe Lawrence" <joe.lawrence@stratus.com>, linux-raid@vger.kernel.org, "Bill Kuzeja" <william.kuzeja@stratus.com>
> > Sent: Thursday, January 29, 2015 11:52:17 AM
> > Subject: Re: RAID1 removing failed disk returns EBUSY
> > 
> > On Sun, 18 Jan 2015 21:33:50 -0500 (EST) Xiao Ni <xni@redhat.com> wrote:
> > 
> > > 
> > > 
> > > ----- Original Message -----
> > > > From: "Joe Lawrence" <joe.lawrence@stratus.com>
> > > > To: "Xiao Ni" <xni@redhat.com>
> > > > Cc: "NeilBrown" <neilb@suse.de>, linux-raid@vger.kernel.org, "Bill
> > > > Kuzeja" <william.kuzeja@stratus.com>
> > > > Sent: Friday, January 16, 2015 11:10:31 PM
> > > > Subject: Re: RAID1 removing failed disk returns EBUSY
> > > > 
> > > > On Fri, 16 Jan 2015 00:20:12 -0500
> > > > Xiao Ni <xni@redhat.com> wrote:
> > > > > 
> > > > > Hi Joe
> > > > > 
> > > > >    Thanks for reminding me. I didn't do that. Now it can remove
> > > > >    successfully after writing
> > > > > "idle" to sync_action.
> > > > > 
> > > > >    I thought wrongly that the patch referenced in this mail is fixed
> > > > >    for
> > > > >    the problem.
> > > > 
> > > > So it sounds like even with 3.18 and a new mdadm, this bug still
> > > > persists?
> > > > 
> > > > -- Joe
> > > > 
> > > > --
> > > 
> > > Hi Joe
> > > 
> > >    I'm a little confused now. Does the patch
> > >    45eaf45dfa4850df16bc2e8e7903d89021137f40 from linux-stable
> > > resolve the problem?
> > > 
> > >    My environment is:
> > > 
> > > [root@dhcp-12-133 mdadm]# mdadm --version
> > > mdadm - v3.3.2-18-g93d3bd3 - 18th December 2014  (this is the newest
> > > upstream)
> > > [root@dhcp-12-133 mdadm]# uname -r
> > > 3.18.2
> > > 
> > > 
> > >    My steps are:
> > > 
> > > [root@dhcp-12-133 mdadm]# lsblk
> > > sdb                       8:16   0 931.5G  0 disk
> > > └─sdb1                    8:17   0     5G  0 part
> > > sdc                       8:32   0 186.3G  0 disk
> > > sdd                       8:48   0 931.5G  0 disk
> > > └─sdd1                    8:49   0     5G  0 part
> > > [root@dhcp-12-133 mdadm]# mdadm -CR /dev/md0 -l1 -n2 /dev/sdb1 /dev/sdd1
> > > --assume-clean
> > > mdadm: Note: this array has metadata at the start and
> > >     may not be suitable as a boot device.  If you plan to
> > >     store '/boot' on this device please ensure that
> > >     your boot-loader understands md/v1.x metadata, or use
> > >     --metadata=0.90
> > > mdadm: Defaulting to version 1.2 metadata
> > > mdadm: array /dev/md0 started.
> > > 
> > >    Then I unplug the disk.
> > > 
> > > [root@dhcp-12-133 mdadm]# lsblk
> > > sdc                       8:32   0 186.3G  0 disk
> > > sdd                       8:48   0 931.5G  0 disk
> > > └─sdd1                    8:49   0     5G  0 part
> > >   └─md0                   9:0    0     5G  0 raid1
> > > [root@dhcp-12-133 mdadm]# echo faulty > /sys/block/md0/md/dev-sdb1/state
> > > [root@dhcp-12-133 mdadm]# echo remove > /sys/block/md0/md/dev-sdb1/state
> > > -bash: echo: write error: Device or resource busy
> > > [root@dhcp-12-133 mdadm]# echo idle > /sys/block/md0/md/sync_action
> > > [root@dhcp-12-133 mdadm]# echo remove > /sys/block/md0/md/dev-sdb1/state
> > > 
> > 
> > I cannot reproduce this - using linux 3.18.2.  I'd be surprised if mdadm
> > version affects things.
> 
> Hi Neil
> 
>    I'm very curious, because it can reproduce in my machine 100%.
> 
> > 
> > This error (Device or resoource busy) implies that rdev->raid_disk is >= 0
> > (tested in state_store()).
> > 
> > ->raid_disk is set to -1 by remove_and_add_spares() providing:
> >   1/ it isn't Blocked (which is very unlikely)
> >   2/ hot_remove_disk succeeds, which it will if nr_pending is zero, and
> >   3/ nr_pending is zero.
> 
>    I remember I have tired to check those reasons. But it's really is the reason 1
> which is very unlikely.
> 
>    I add some code in the function array_state_show
> 
>     array_state_show(struct mddev *mddev, char *page) {
>         enum array_state st = inactive;
>         struct md_rdev *rdev;
> 
>         rdev_for_each_rcu(rdev, mddev) {
>                 printk(KERN_ALERT "search for %s\n", rdev->bdev->bd_disk->disk_name);
>                 if (test_bit(Blocked, &rdev->flags))
>                         printk(KERN_ALERT "rdev is Blocked\n");
>                 else
>                         printk(KERN_ALERT "rdev is not Blocked\n");
>     }
> 
>   When I echo 1 > /sys/block/sdc/device/delete, then I ran command:
> 
> [root@dhcp-12-133 md]# cat /sys/block/md0/md/array_state 
> read-auto
  ^^^^^^^^^

I think that is half the explanation.
You must have the md_mod.start_ro parameter set to '1'.


> [root@dhcp-12-133 md]# dmesg 
> [ 2679.559185] search for sdc
> [ 2679.559189] rdev is Blocked
> [ 2679.559190] search for sdb
> [ 2679.559190] rdev is not Blocked
>    
>   So sdc is Blocked

and that is the other half - thanks.
(yes, I was wrong.  Sometimes it is easier than being right, but still
yields results).

When a device fails, it is Blocked until the metadata is updated to record
the failure.  This ensures that no writes succeed without writing to that
device, until we a certain that no read will try reading from that device,
even after a crash/restart.

Blocked is cleared after the metadata is written, but read-auto (and
read-only) devices never write out their metadata.  So blocked doesn't get
cleared.

When you "echo idle > .../sync_action" one of the side effects is to with
from 'read-auto' to fully active.  This allows the metadata to be written,
Blocked to be cleared, and the device to be removed.

If you 
  echo none > /sys/block/md0/md/dev-sdc/slot

first, then the remove will work.

We could possibly fix it with something like the following, but I'm not sure
I like it.  There is no guarantee that I can see which would ensure the
superblock got updated before the first write if the array switch to
read/write.

NeilBrown

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 9233c71138f1..b3d1e8e5e067 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -7528,7 +7528,7 @@ static int remove_and_add_spares(struct mddev *mddev,
 	rdev_for_each(rdev, mddev)
 		if ((this == NULL || rdev == this) &&
 		    rdev->raid_disk >= 0 &&
-		    !test_bit(Blocked, &rdev->flags) &&
+		    (!test_bit(Blocked, &rdev->flags) || mddev->ro) &&
 		    (test_bit(Faulty, &rdev->flags) ||
 		     ! test_bit(In_sync, &rdev->flags)) &&
 		    atomic_read(&rdev->nr_pending)==0) {



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 811 bytes --]

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: RAID1 removing failed disk returns EBUSY
  2015-02-02  6:36                       ` NeilBrown
@ 2015-02-03  8:10                         ` Xiao Ni
  2015-06-10  6:26                           ` XiaoNi
  0 siblings, 1 reply; 22+ messages in thread
From: Xiao Ni @ 2015-02-03  8:10 UTC (permalink / raw)
  To: NeilBrown; +Cc: Joe Lawrence, linux-raid, Bill Kuzeja



----- Original Message -----
> From: "NeilBrown" <neilb@suse.de>
> To: "Xiao Ni" <xni@redhat.com>
> Cc: "Joe Lawrence" <joe.lawrence@stratus.com>, linux-raid@vger.kernel.org, "Bill Kuzeja" <william.kuzeja@stratus.com>
> Sent: Monday, February 2, 2015 2:36:01 PM
> Subject: Re: RAID1 removing failed disk returns EBUSY
> 
> On Thu, 29 Jan 2015 07:14:16 -0500 (EST) Xiao Ni <xni@redhat.com> wrote:
> 
> > 
> > 
> > ----- Original Message -----
> > > From: "NeilBrown" <neilb@suse.de>
> > > To: "Xiao Ni" <xni@redhat.com>
> > > Cc: "Joe Lawrence" <joe.lawrence@stratus.com>,
> > > linux-raid@vger.kernel.org, "Bill Kuzeja" <william.kuzeja@stratus.com>
> > > Sent: Thursday, January 29, 2015 11:52:17 AM
> > > Subject: Re: RAID1 removing failed disk returns EBUSY
> > > 
> > > On Sun, 18 Jan 2015 21:33:50 -0500 (EST) Xiao Ni <xni@redhat.com> wrote:
> > > 
> > > > 
> > > > 
> > > > ----- Original Message -----
> > > > > From: "Joe Lawrence" <joe.lawrence@stratus.com>
> > > > > To: "Xiao Ni" <xni@redhat.com>
> > > > > Cc: "NeilBrown" <neilb@suse.de>, linux-raid@vger.kernel.org, "Bill
> > > > > Kuzeja" <william.kuzeja@stratus.com>
> > > > > Sent: Friday, January 16, 2015 11:10:31 PM
> > > > > Subject: Re: RAID1 removing failed disk returns EBUSY
> > > > > 
> > > > > On Fri, 16 Jan 2015 00:20:12 -0500
> > > > > Xiao Ni <xni@redhat.com> wrote:
> > > > > > 
> > > > > > Hi Joe
> > > > > > 
> > > > > >    Thanks for reminding me. I didn't do that. Now it can remove
> > > > > >    successfully after writing
> > > > > > "idle" to sync_action.
> > > > > > 
> > > > > >    I thought wrongly that the patch referenced in this mail is
> > > > > >    fixed
> > > > > >    for
> > > > > >    the problem.
> > > > > 
> > > > > So it sounds like even with 3.18 and a new mdadm, this bug still
> > > > > persists?
> > > > > 
> > > > > -- Joe
> > > > > 
> > > > > --
> > > > 
> > > > Hi Joe
> > > > 
> > > >    I'm a little confused now. Does the patch
> > > >    45eaf45dfa4850df16bc2e8e7903d89021137f40 from linux-stable
> > > > resolve the problem?
> > > > 
> > > >    My environment is:
> > > > 
> > > > [root@dhcp-12-133 mdadm]# mdadm --version
> > > > mdadm - v3.3.2-18-g93d3bd3 - 18th December 2014  (this is the newest
> > > > upstream)
> > > > [root@dhcp-12-133 mdadm]# uname -r
> > > > 3.18.2
> > > > 
> > > > 
> > > >    My steps are:
> > > > 
> > > > [root@dhcp-12-133 mdadm]# lsblk
> > > > sdb                       8:16   0 931.5G  0 disk
> > > > └─sdb1                    8:17   0     5G  0 part
> > > > sdc                       8:32   0 186.3G  0 disk
> > > > sdd                       8:48   0 931.5G  0 disk
> > > > └─sdd1                    8:49   0     5G  0 part
> > > > [root@dhcp-12-133 mdadm]# mdadm -CR /dev/md0 -l1 -n2 /dev/sdb1
> > > > /dev/sdd1
> > > > --assume-clean
> > > > mdadm: Note: this array has metadata at the start and
> > > >     may not be suitable as a boot device.  If you plan to
> > > >     store '/boot' on this device please ensure that
> > > >     your boot-loader understands md/v1.x metadata, or use
> > > >     --metadata=0.90
> > > > mdadm: Defaulting to version 1.2 metadata
> > > > mdadm: array /dev/md0 started.
> > > > 
> > > >    Then I unplug the disk.
> > > > 
> > > > [root@dhcp-12-133 mdadm]# lsblk
> > > > sdc                       8:32   0 186.3G  0 disk
> > > > sdd                       8:48   0 931.5G  0 disk
> > > > └─sdd1                    8:49   0     5G  0 part
> > > >   └─md0                   9:0    0     5G  0 raid1
> > > > [root@dhcp-12-133 mdadm]# echo faulty >
> > > > /sys/block/md0/md/dev-sdb1/state
> > > > [root@dhcp-12-133 mdadm]# echo remove >
> > > > /sys/block/md0/md/dev-sdb1/state
> > > > -bash: echo: write error: Device or resource busy
> > > > [root@dhcp-12-133 mdadm]# echo idle > /sys/block/md0/md/sync_action
> > > > [root@dhcp-12-133 mdadm]# echo remove >
> > > > /sys/block/md0/md/dev-sdb1/state
> > > > 
> > > 
> > > I cannot reproduce this - using linux 3.18.2.  I'd be surprised if mdadm
> > > version affects things.
> > 
> > Hi Neil
> > 
> >    I'm very curious, because it can reproduce in my machine 100%.
> > 
> > > 
> > > This error (Device or resoource busy) implies that rdev->raid_disk is >=
> > > 0
> > > (tested in state_store()).
> > > 
> > > ->raid_disk is set to -1 by remove_and_add_spares() providing:
> > >   1/ it isn't Blocked (which is very unlikely)
> > >   2/ hot_remove_disk succeeds, which it will if nr_pending is zero, and
> > >   3/ nr_pending is zero.
> > 
> >    I remember I have tired to check those reasons. But it's really is the
> >    reason 1
> > which is very unlikely.
> > 
> >    I add some code in the function array_state_show
> > 
> >     array_state_show(struct mddev *mddev, char *page) {
> >         enum array_state st = inactive;
> >         struct md_rdev *rdev;
> > 
> >         rdev_for_each_rcu(rdev, mddev) {
> >                 printk(KERN_ALERT "search for %s\n",
> >                 rdev->bdev->bd_disk->disk_name);
> >                 if (test_bit(Blocked, &rdev->flags))
> >                         printk(KERN_ALERT "rdev is Blocked\n");
> >                 else
> >                         printk(KERN_ALERT "rdev is not Blocked\n");
> >     }
> > 
> >   When I echo 1 > /sys/block/sdc/device/delete, then I ran command:
> > 
> > [root@dhcp-12-133 md]# cat /sys/block/md0/md/array_state
> > read-auto
>   ^^^^^^^^^
> 
> I think that is half the explanation.
> You must have the md_mod.start_ro parameter set to '1'.
> 
> 
> > [root@dhcp-12-133 md]# dmesg
> > [ 2679.559185] search for sdc
> > [ 2679.559189] rdev is Blocked
> > [ 2679.559190] search for sdb
> > [ 2679.559190] rdev is not Blocked
> >    
> >   So sdc is Blocked
> 
> and that is the other half - thanks.
> (yes, I was wrong.  Sometimes it is easier than being right, but still
> yields results).
> 
> When a device fails, it is Blocked until the metadata is updated to record
> the failure.  This ensures that no writes succeed without writing to that
> device, until we a certain that no read will try reading from that device,
> even after a crash/restart.
> 
> Blocked is cleared after the metadata is written, but read-auto (and
> read-only) devices never write out their metadata.  So blocked doesn't get
> cleared.
> 
> When you "echo idle > .../sync_action" one of the side effects is to with
> from 'read-auto' to fully active.  This allows the metadata to be written,
> Blocked to be cleared, and the device to be removed.
> 
> If you
>   echo none > /sys/block/md0/md/dev-sdc/slot
> 
> first, then the remove will work.
> 
> We could possibly fix it with something like the following, but I'm not sure
> I like it.  There is no guarantee that I can see which would ensure the
> superblock got updated before the first write if the array switch to
> read/write.
> 
> NeilBrown
> 
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 9233c71138f1..b3d1e8e5e067 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -7528,7 +7528,7 @@ static int remove_and_add_spares(struct mddev *mddev,
>  	rdev_for_each(rdev, mddev)
>  		if ((this == NULL || rdev == this) &&
>  		    rdev->raid_disk >= 0 &&
> -		    !test_bit(Blocked, &rdev->flags) &&
> +		    (!test_bit(Blocked, &rdev->flags) || mddev->ro) &&
>  		    (test_bit(Faulty, &rdev->flags) ||
>  		     ! test_bit(In_sync, &rdev->flags)) &&
>  		    atomic_read(&rdev->nr_pending)==0) {
> 
> 
> 

Hi Neil

   I have tried the patch and the problem can be fixed by it. But I'm sorry that I can't
give more advices for better idea about this. I'm not familiar with the metadata part about
the md. I'll try to get more time to read the code about md.

Best Regards
Xiao
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: RAID1 removing failed disk returns EBUSY
  2015-02-03  8:10                         ` Xiao Ni
@ 2015-06-10  6:26                           ` XiaoNi
  2015-06-17  2:51                             ` Neil Brown
  0 siblings, 1 reply; 22+ messages in thread
From: XiaoNi @ 2015-06-10  6:26 UTC (permalink / raw)
  To: NeilBrown; +Cc: Joe Lawrence, linux-raid, Bill Kuzeja



On 02/03/2015 04:10 PM, Xiao Ni wrote:
>
> ----- Original Message -----
>> From: "NeilBrown" <neilb@suse.de>
>> To: "Xiao Ni" <xni@redhat.com>
>> Cc: "Joe Lawrence" <joe.lawrence@stratus.com>, linux-raid@vger.kernel.org, "Bill Kuzeja" <william.kuzeja@stratus.com>
>> Sent: Monday, February 2, 2015 2:36:01 PM
>> Subject: Re: RAID1 removing failed disk returns EBUSY
>>
>> On Thu, 29 Jan 2015 07:14:16 -0500 (EST) Xiao Ni <xni@redhat.com> wrote:
>>
>>>
>>> ----- Original Message -----
>>>> From: "NeilBrown" <neilb@suse.de>
>>>> To: "Xiao Ni" <xni@redhat.com>
>>>> Cc: "Joe Lawrence" <joe.lawrence@stratus.com>,
>>>> linux-raid@vger.kernel.org, "Bill Kuzeja" <william.kuzeja@stratus.com>
>>>> Sent: Thursday, January 29, 2015 11:52:17 AM
>>>> Subject: Re: RAID1 removing failed disk returns EBUSY
>>>>
>>>> On Sun, 18 Jan 2015 21:33:50 -0500 (EST) Xiao Ni <xni@redhat.com> wrote:
>>>>
>>>>>
>>>>> ----- Original Message -----
>>>>>> From: "Joe Lawrence" <joe.lawrence@stratus.com>
>>>>>> To: "Xiao Ni" <xni@redhat.com>
>>>>>> Cc: "NeilBrown" <neilb@suse.de>, linux-raid@vger.kernel.org, "Bill
>>>>>> Kuzeja" <william.kuzeja@stratus.com>
>>>>>> Sent: Friday, January 16, 2015 11:10:31 PM
>>>>>> Subject: Re: RAID1 removing failed disk returns EBUSY
>>>>>>
>>>>>> On Fri, 16 Jan 2015 00:20:12 -0500
>>>>>> Xiao Ni <xni@redhat.com> wrote:
>>>>>>> Hi Joe
>>>>>>>
>>>>>>>     Thanks for reminding me. I didn't do that. Now it can remove
>>>>>>>     successfully after writing
>>>>>>> "idle" to sync_action.
>>>>>>>
>>>>>>>     I thought wrongly that the patch referenced in this mail is
>>>>>>>     fixed
>>>>>>>     for
>>>>>>>     the problem.
>>>>>> So it sounds like even with 3.18 and a new mdadm, this bug still
>>>>>> persists?
>>>>>>
>>>>>> -- Joe
>>>>>>
>>>>>> --
>>>>> Hi Joe
>>>>>
>>>>>     I'm a little confused now. Does the patch
>>>>>     45eaf45dfa4850df16bc2e8e7903d89021137f40 from linux-stable
>>>>> resolve the problem?
>>>>>
>>>>>     My environment is:
>>>>>
>>>>> [root@dhcp-12-133 mdadm]# mdadm --version
>>>>> mdadm - v3.3.2-18-g93d3bd3 - 18th December 2014  (this is the newest
>>>>> upstream)
>>>>> [root@dhcp-12-133 mdadm]# uname -r
>>>>> 3.18.2
>>>>>
>>>>>
>>>>>     My steps are:
>>>>>
>>>>> [root@dhcp-12-133 mdadm]# lsblk
>>>>> sdb                       8:16   0 931.5G  0 disk
>>>>> └─sdb1                    8:17   0     5G  0 part
>>>>> sdc                       8:32   0 186.3G  0 disk
>>>>> sdd                       8:48   0 931.5G  0 disk
>>>>> └─sdd1                    8:49   0     5G  0 part
>>>>> [root@dhcp-12-133 mdadm]# mdadm -CR /dev/md0 -l1 -n2 /dev/sdb1
>>>>> /dev/sdd1
>>>>> --assume-clean
>>>>> mdadm: Note: this array has metadata at the start and
>>>>>      may not be suitable as a boot device.  If you plan to
>>>>>      store '/boot' on this device please ensure that
>>>>>      your boot-loader understands md/v1.x metadata, or use
>>>>>      --metadata=0.90
>>>>> mdadm: Defaulting to version 1.2 metadata
>>>>> mdadm: array /dev/md0 started.
>>>>>
>>>>>     Then I unplug the disk.
>>>>>
>>>>> [root@dhcp-12-133 mdadm]# lsblk
>>>>> sdc                       8:32   0 186.3G  0 disk
>>>>> sdd                       8:48   0 931.5G  0 disk
>>>>> └─sdd1                    8:49   0     5G  0 part
>>>>>    └─md0                   9:0    0     5G  0 raid1
>>>>> [root@dhcp-12-133 mdadm]# echo faulty >
>>>>> /sys/block/md0/md/dev-sdb1/state
>>>>> [root@dhcp-12-133 mdadm]# echo remove >
>>>>> /sys/block/md0/md/dev-sdb1/state
>>>>> -bash: echo: write error: Device or resource busy
>>>>> [root@dhcp-12-133 mdadm]# echo idle > /sys/block/md0/md/sync_action
>>>>> [root@dhcp-12-133 mdadm]# echo remove >
>>>>> /sys/block/md0/md/dev-sdb1/state
>>>>>
>>>> I cannot reproduce this - using linux 3.18.2.  I'd be surprised if mdadm
>>>> version affects things.
>>> Hi Neil
>>>
>>>     I'm very curious, because it can reproduce in my machine 100%.
>>>
>>>> This error (Device or resoource busy) implies that rdev->raid_disk is >=
>>>> 0
>>>> (tested in state_store()).
>>>>
>>>> ->raid_disk is set to -1 by remove_and_add_spares() providing:
>>>>    1/ it isn't Blocked (which is very unlikely)
>>>>    2/ hot_remove_disk succeeds, which it will if nr_pending is zero, and
>>>>    3/ nr_pending is zero.
>>>     I remember I have tired to check those reasons. But it's really is the
>>>     reason 1
>>> which is very unlikely.
>>>
>>>     I add some code in the function array_state_show
>>>
>>>      array_state_show(struct mddev *mddev, char *page) {
>>>          enum array_state st = inactive;
>>>          struct md_rdev *rdev;
>>>
>>>          rdev_for_each_rcu(rdev, mddev) {
>>>                  printk(KERN_ALERT "search for %s\n",
>>>                  rdev->bdev->bd_disk->disk_name);
>>>                  if (test_bit(Blocked, &rdev->flags))
>>>                          printk(KERN_ALERT "rdev is Blocked\n");
>>>                  else
>>>                          printk(KERN_ALERT "rdev is not Blocked\n");
>>>      }
>>>
>>>    When I echo 1 > /sys/block/sdc/device/delete, then I ran command:
>>>
>>> [root@dhcp-12-133 md]# cat /sys/block/md0/md/array_state
>>> read-auto
>>    ^^^^^^^^^
>>
>> I think that is half the explanation.
>> You must have the md_mod.start_ro parameter set to '1'.
>>
>>
>>> [root@dhcp-12-133 md]# dmesg
>>> [ 2679.559185] search for sdc
>>> [ 2679.559189] rdev is Blocked
>>> [ 2679.559190] search for sdb
>>> [ 2679.559190] rdev is not Blocked
>>>     
>>>    So sdc is Blocked
>> and that is the other half - thanks.
>> (yes, I was wrong.  Sometimes it is easier than being right, but still
>> yields results).
>>
>> When a device fails, it is Blocked until the metadata is updated to record
>> the failure.  This ensures that no writes succeed without writing to that
>> device, until we a certain that no read will try reading from that device,
>> even after a crash/restart.
>>
>> Blocked is cleared after the metadata is written, but read-auto (and
>> read-only) devices never write out their metadata.  So blocked doesn't get
>> cleared.
>>
>> When you "echo idle > .../sync_action" one of the side effects is to with
>> from 'read-auto' to fully active.  This allows the metadata to be written,
>> Blocked to be cleared, and the device to be removed.
>>
>> If you
>>    echo none > /sys/block/md0/md/dev-sdc/slot
>>
>> first, then the remove will work.
>>
>> We could possibly fix it with something like the following, but I'm not sure
>> I like it.  There is no guarantee that I can see which would ensure the
>> superblock got updated before the first write if the array switch to
>> read/write.
>>
>> NeilBrown
>>
>> diff --git a/drivers/md/md.c b/drivers/md/md.c
>> index 9233c71138f1..b3d1e8e5e067 100644
>> --- a/drivers/md/md.c
>> +++ b/drivers/md/md.c
>> @@ -7528,7 +7528,7 @@ static int remove_and_add_spares(struct mddev *mddev,
>>   	rdev_for_each(rdev, mddev)
>>   		if ((this == NULL || rdev == this) &&
>>   		    rdev->raid_disk >= 0 &&
>> -		    !test_bit(Blocked, &rdev->flags) &&
>> +		    (!test_bit(Blocked, &rdev->flags) || mddev->ro) &&
>>   		    (test_bit(Faulty, &rdev->flags) ||
>>   		     ! test_bit(In_sync, &rdev->flags)) &&
>>   		    atomic_read(&rdev->nr_pending)==0) {
>>
>>
>>
> Hi Neil
>
>     I have tried the patch and the problem can be fixed by it. But I'm sorry that I can't
> give more advices for better idea about this. I'm not familiar with the metadata part about
> the md. I'll try to get more time to read the code about md.
>
Hi Neil

     I don't see the patch in linux-stable, do you miss this?

Best Regards
Xiao
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: RAID1 removing failed disk returns EBUSY
  2015-06-10  6:26                           ` XiaoNi
@ 2015-06-17  2:51                             ` Neil Brown
  2015-06-25  9:42                               ` Xiao Ni
  0 siblings, 1 reply; 22+ messages in thread
From: Neil Brown @ 2015-06-17  2:51 UTC (permalink / raw)
  To: XiaoNi; +Cc: Joe Lawrence, linux-raid, Bill Kuzeja

On Wed, 10 Jun 2015 14:26:41 +0800
XiaoNi <xni@redhat.com> wrote:

> 
> 
> On 02/03/2015 04:10 PM, Xiao Ni wrote:
> >
> > ----- Original Message -----
> >> From: "NeilBrown" <neilb@suse.de>
> >> To: "Xiao Ni" <xni@redhat.com>
> >> Cc: "Joe Lawrence" <joe.lawrence@stratus.com>, linux-raid@vger.kernel.org, "Bill Kuzeja" <william.kuzeja@stratus.com>
> >> Sent: Monday, February 2, 2015 2:36:01 PM
> >> Subject: Re: RAID1 removing failed disk returns EBUSY
> >>
> >> On Thu, 29 Jan 2015 07:14:16 -0500 (EST) Xiao Ni <xni@redhat.com> wrote:
> >>
> >>>
> >>> ----- Original Message -----
> >>>> From: "NeilBrown" <neilb@suse.de>
> >>>> To: "Xiao Ni" <xni@redhat.com>
> >>>> Cc: "Joe Lawrence" <joe.lawrence@stratus.com>,
> >>>> linux-raid@vger.kernel.org, "Bill Kuzeja" <william.kuzeja@stratus.com>
> >>>> Sent: Thursday, January 29, 2015 11:52:17 AM
> >>>> Subject: Re: RAID1 removing failed disk returns EBUSY
> >>>>
> >>>> On Sun, 18 Jan 2015 21:33:50 -0500 (EST) Xiao Ni <xni@redhat.com> wrote:
> >>>>
> >>>>>
> >>>>> ----- Original Message -----
> >>>>>> From: "Joe Lawrence" <joe.lawrence@stratus.com>
> >>>>>> To: "Xiao Ni" <xni@redhat.com>
> >>>>>> Cc: "NeilBrown" <neilb@suse.de>, linux-raid@vger.kernel.org, "Bill
> >>>>>> Kuzeja" <william.kuzeja@stratus.com>
> >>>>>> Sent: Friday, January 16, 2015 11:10:31 PM
> >>>>>> Subject: Re: RAID1 removing failed disk returns EBUSY
> >>>>>>
> >>>>>> On Fri, 16 Jan 2015 00:20:12 -0500
> >>>>>> Xiao Ni <xni@redhat.com> wrote:
> >>>>>>> Hi Joe
> >>>>>>>
> >>>>>>>     Thanks for reminding me. I didn't do that. Now it can remove
> >>>>>>>     successfully after writing
> >>>>>>> "idle" to sync_action.
> >>>>>>>
> >>>>>>>     I thought wrongly that the patch referenced in this mail is
> >>>>>>>     fixed
> >>>>>>>     for
> >>>>>>>     the problem.
> >>>>>> So it sounds like even with 3.18 and a new mdadm, this bug still
> >>>>>> persists?
> >>>>>>
> >>>>>> -- Joe
> >>>>>>
> >>>>>> --
> >>>>> Hi Joe
> >>>>>
> >>>>>     I'm a little confused now. Does the patch
> >>>>>     45eaf45dfa4850df16bc2e8e7903d89021137f40 from linux-stable
> >>>>> resolve the problem?
> >>>>>
> >>>>>     My environment is:
> >>>>>
> >>>>> [root@dhcp-12-133 mdadm]# mdadm --version
> >>>>> mdadm - v3.3.2-18-g93d3bd3 - 18th December 2014  (this is the newest
> >>>>> upstream)
> >>>>> [root@dhcp-12-133 mdadm]# uname -r
> >>>>> 3.18.2
> >>>>>
> >>>>>
> >>>>>     My steps are:
> >>>>>
> >>>>> [root@dhcp-12-133 mdadm]# lsblk
> >>>>> sdb                       8:16   0 931.5G  0 disk
> >>>>> └─sdb1                    8:17   0     5G  0 part
> >>>>> sdc                       8:32   0 186.3G  0 disk
> >>>>> sdd                       8:48   0 931.5G  0 disk
> >>>>> └─sdd1                    8:49   0     5G  0 part
> >>>>> [root@dhcp-12-133 mdadm]# mdadm -CR /dev/md0 -l1 -n2 /dev/sdb1
> >>>>> /dev/sdd1
> >>>>> --assume-clean
> >>>>> mdadm: Note: this array has metadata at the start and
> >>>>>      may not be suitable as a boot device.  If you plan to
> >>>>>      store '/boot' on this device please ensure that
> >>>>>      your boot-loader understands md/v1.x metadata, or use
> >>>>>      --metadata=0.90
> >>>>> mdadm: Defaulting to version 1.2 metadata
> >>>>> mdadm: array /dev/md0 started.
> >>>>>
> >>>>>     Then I unplug the disk.
> >>>>>
> >>>>> [root@dhcp-12-133 mdadm]# lsblk
> >>>>> sdc                       8:32   0 186.3G  0 disk
> >>>>> sdd                       8:48   0 931.5G  0 disk
> >>>>> └─sdd1                    8:49   0     5G  0 part
> >>>>>    └─md0                   9:0    0     5G  0 raid1
> >>>>> [root@dhcp-12-133 mdadm]# echo faulty >
> >>>>> /sys/block/md0/md/dev-sdb1/state
> >>>>> [root@dhcp-12-133 mdadm]# echo remove >
> >>>>> /sys/block/md0/md/dev-sdb1/state
> >>>>> -bash: echo: write error: Device or resource busy
> >>>>> [root@dhcp-12-133 mdadm]# echo idle > /sys/block/md0/md/sync_action
> >>>>> [root@dhcp-12-133 mdadm]# echo remove >
> >>>>> /sys/block/md0/md/dev-sdb1/state
> >>>>>
> >>>> I cannot reproduce this - using linux 3.18.2.  I'd be surprised if mdadm
> >>>> version affects things.
> >>> Hi Neil
> >>>
> >>>     I'm very curious, because it can reproduce in my machine 100%.
> >>>
> >>>> This error (Device or resoource busy) implies that rdev->raid_disk is >=
> >>>> 0
> >>>> (tested in state_store()).
> >>>>
> >>>> ->raid_disk is set to -1 by remove_and_add_spares() providing:
> >>>>    1/ it isn't Blocked (which is very unlikely)
> >>>>    2/ hot_remove_disk succeeds, which it will if nr_pending is zero, and
> >>>>    3/ nr_pending is zero.
> >>>     I remember I have tired to check those reasons. But it's really is the
> >>>     reason 1
> >>> which is very unlikely.
> >>>
> >>>     I add some code in the function array_state_show
> >>>
> >>>      array_state_show(struct mddev *mddev, char *page) {
> >>>          enum array_state st = inactive;
> >>>          struct md_rdev *rdev;
> >>>
> >>>          rdev_for_each_rcu(rdev, mddev) {
> >>>                  printk(KERN_ALERT "search for %s\n",
> >>>                  rdev->bdev->bd_disk->disk_name);
> >>>                  if (test_bit(Blocked, &rdev->flags))
> >>>                          printk(KERN_ALERT "rdev is Blocked\n");
> >>>                  else
> >>>                          printk(KERN_ALERT "rdev is not Blocked\n");
> >>>      }
> >>>
> >>>    When I echo 1 > /sys/block/sdc/device/delete, then I ran command:
> >>>
> >>> [root@dhcp-12-133 md]# cat /sys/block/md0/md/array_state
> >>> read-auto
> >>    ^^^^^^^^^
> >>
> >> I think that is half the explanation.
> >> You must have the md_mod.start_ro parameter set to '1'.
> >>
> >>
> >>> [root@dhcp-12-133 md]# dmesg
> >>> [ 2679.559185] search for sdc
> >>> [ 2679.559189] rdev is Blocked
> >>> [ 2679.559190] search for sdb
> >>> [ 2679.559190] rdev is not Blocked
> >>>     
> >>>    So sdc is Blocked
> >> and that is the other half - thanks.
> >> (yes, I was wrong.  Sometimes it is easier than being right, but still
> >> yields results).
> >>
> >> When a device fails, it is Blocked until the metadata is updated to record
> >> the failure.  This ensures that no writes succeed without writing to that
> >> device, until we a certain that no read will try reading from that device,
> >> even after a crash/restart.
> >>
> >> Blocked is cleared after the metadata is written, but read-auto (and
> >> read-only) devices never write out their metadata.  So blocked doesn't get
> >> cleared.
> >>
> >> When you "echo idle > .../sync_action" one of the side effects is to with
> >> from 'read-auto' to fully active.  This allows the metadata to be written,
> >> Blocked to be cleared, and the device to be removed.
> >>
> >> If you
> >>    echo none > /sys/block/md0/md/dev-sdc/slot
> >>
> >> first, then the remove will work.
> >>
> >> We could possibly fix it with something like the following, but I'm not sure
> >> I like it.  There is no guarantee that I can see which would ensure the
> >> superblock got updated before the first write if the array switch to
> >> read/write.
> >>
> >> NeilBrown
> >>
> >> diff --git a/drivers/md/md.c b/drivers/md/md.c
> >> index 9233c71138f1..b3d1e8e5e067 100644
> >> --- a/drivers/md/md.c
> >> +++ b/drivers/md/md.c
> >> @@ -7528,7 +7528,7 @@ static int remove_and_add_spares(struct mddev *mddev,
> >>   	rdev_for_each(rdev, mddev)
> >>   		if ((this == NULL || rdev == this) &&
> >>   		    rdev->raid_disk >= 0 &&
> >> -		    !test_bit(Blocked, &rdev->flags) &&
> >> +		    (!test_bit(Blocked, &rdev->flags) || mddev->ro) &&
> >>   		    (test_bit(Faulty, &rdev->flags) ||
> >>   		     ! test_bit(In_sync, &rdev->flags)) &&
> >>   		    atomic_read(&rdev->nr_pending)==0) {
> >>
> >>
> >>
> > Hi Neil
> >
> >     I have tried the patch and the problem can be fixed by it. But I'm sorry that I can't
> > give more advices for better idea about this. I'm not familiar with the metadata part about
> > the md. I'll try to get more time to read the code about md.
> >
> Hi Neil
> 
>      I don't see the patch in linux-stable, do you miss this?

I don't believe this bug is sufficiently serious for the patch to go to
-stable.  However it doesn't need to be fixed - thanks for the reminder.

I've just queued the following patch which I am happy with.  If you
could confirm that it works for you, I would appreciate that.

Thanks,
NeilBrown


From: Neil Brown <neilb@suse.de>
Date: Wed, 17 Jun 2015 12:31:46 +1000
Subject: [PATCH] md: clear Blocked flag on failed devices when array is
 read-only.

The Blocked flag indicates that a device has failed but that this
fact hasn't been recorded in the metadata yet.  Writes to such
devices cannot be allowed until the metadata has been updated.

On a read-only array, the Blocked flag will never be cleared.
This prevents the device being removed from the array.

If the metadata is being handled by the kernel
(i.e. !mddev->external), then we can be sure that if the array is
switch to writable, then a metadata update will happen and will
record the failure.  So we don't need the flag set.

If metadata is externally managed, it is upto the external manager
to clear the 'blocked' flag.

Reported-by: XiaoNi <xni@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 3d339e2..5a6681a 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -8125,6 +8125,15 @@ void md_check_recovery(struct mddev *mddev)
 		int spares = 0;
 
 		if (mddev->ro) {
+			struct md_rdev *rdev;
+			if (!mddev->external && mddev->in_sync)
+				/* 'Blocked' flag not needed as failed devices
+				 * will be recorded if array switched to read/write.
+				 * Leaving it set will prevent the device
+				 * from being removed.
+				 */
+				rdev_for_each(rdev, mddev)
+					clear_bit(Blocked, &rdev->flags);
 			/* On a read-only array we can:
 			 * - remove failed devices
 			 * - add already-in_sync devices if the array itself


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: RAID1 removing failed disk returns EBUSY
  2015-06-17  2:51                             ` Neil Brown
@ 2015-06-25  9:42                               ` Xiao Ni
  0 siblings, 0 replies; 22+ messages in thread
From: Xiao Ni @ 2015-06-25  9:42 UTC (permalink / raw)
  To: Neil Brown; +Cc: Joe Lawrence, linux-raid, Bill Kuzeja



----- Original Message -----
> From: "Neil Brown" <neilb@suse.de>
> To: "XiaoNi" <xni@redhat.com>
> Cc: "Joe Lawrence" <joe.lawrence@stratus.com>, linux-raid@vger.kernel.org, "Bill Kuzeja" <william.kuzeja@stratus.com>
> Sent: Wednesday, June 17, 2015 10:51:51 AM
> Subject: Re: RAID1 removing failed disk returns EBUSY
> 
> On Wed, 10 Jun 2015 14:26:41 +0800
> XiaoNi <xni@redhat.com> wrote:
> 
> > 
> > 
> > On 02/03/2015 04:10 PM, Xiao Ni wrote:
> > >
> > > ----- Original Message -----
> > >> From: "NeilBrown" <neilb@suse.de>
> > >> To: "Xiao Ni" <xni@redhat.com>
> > >> Cc: "Joe Lawrence" <joe.lawrence@stratus.com>,
> > >> linux-raid@vger.kernel.org, "Bill Kuzeja" <william.kuzeja@stratus.com>
> > >> Sent: Monday, February 2, 2015 2:36:01 PM
> > >> Subject: Re: RAID1 removing failed disk returns EBUSY
> > >>
> > >> On Thu, 29 Jan 2015 07:14:16 -0500 (EST) Xiao Ni <xni@redhat.com> wrote:
> > >>
> > >>>
> > >>> ----- Original Message -----
> > >>>> From: "NeilBrown" <neilb@suse.de>
> > >>>> To: "Xiao Ni" <xni@redhat.com>
> > >>>> Cc: "Joe Lawrence" <joe.lawrence@stratus.com>,
> > >>>> linux-raid@vger.kernel.org, "Bill Kuzeja" <william.kuzeja@stratus.com>
> > >>>> Sent: Thursday, January 29, 2015 11:52:17 AM
> > >>>> Subject: Re: RAID1 removing failed disk returns EBUSY
> > >>>>
> > >>>> On Sun, 18 Jan 2015 21:33:50 -0500 (EST) Xiao Ni <xni@redhat.com>
> > >>>> wrote:
> > >>>>
> > >>>>>
> > >>>>> ----- Original Message -----
> > >>>>>> From: "Joe Lawrence" <joe.lawrence@stratus.com>
> > >>>>>> To: "Xiao Ni" <xni@redhat.com>
> > >>>>>> Cc: "NeilBrown" <neilb@suse.de>, linux-raid@vger.kernel.org, "Bill
> > >>>>>> Kuzeja" <william.kuzeja@stratus.com>
> > >>>>>> Sent: Friday, January 16, 2015 11:10:31 PM
> > >>>>>> Subject: Re: RAID1 removing failed disk returns EBUSY
> > >>>>>>
> > >>>>>> On Fri, 16 Jan 2015 00:20:12 -0500
> > >>>>>> Xiao Ni <xni@redhat.com> wrote:
> > >>>>>>> Hi Joe
> > >>>>>>>
> > >>>>>>>     Thanks for reminding me. I didn't do that. Now it can remove
> > >>>>>>>     successfully after writing
> > >>>>>>> "idle" to sync_action.
> > >>>>>>>
> > >>>>>>>     I thought wrongly that the patch referenced in this mail is
> > >>>>>>>     fixed
> > >>>>>>>     for
> > >>>>>>>     the problem.
> > >>>>>> So it sounds like even with 3.18 and a new mdadm, this bug still
> > >>>>>> persists?
> > >>>>>>
> > >>>>>> -- Joe
> > >>>>>>
> > >>>>>> --
> > >>>>> Hi Joe
> > >>>>>
> > >>>>>     I'm a little confused now. Does the patch
> > >>>>>     45eaf45dfa4850df16bc2e8e7903d89021137f40 from linux-stable
> > >>>>> resolve the problem?
> > >>>>>
> > >>>>>     My environment is:
> > >>>>>
> > >>>>> [root@dhcp-12-133 mdadm]# mdadm --version
> > >>>>> mdadm - v3.3.2-18-g93d3bd3 - 18th December 2014  (this is the newest
> > >>>>> upstream)
> > >>>>> [root@dhcp-12-133 mdadm]# uname -r
> > >>>>> 3.18.2
> > >>>>>
> > >>>>>
> > >>>>>     My steps are:
> > >>>>>
> > >>>>> [root@dhcp-12-133 mdadm]# lsblk
> > >>>>> sdb                       8:16   0 931.5G  0 disk
> > >>>>> └─sdb1                    8:17   0     5G  0 part
> > >>>>> sdc                       8:32   0 186.3G  0 disk
> > >>>>> sdd                       8:48   0 931.5G  0 disk
> > >>>>> └─sdd1                    8:49   0     5G  0 part
> > >>>>> [root@dhcp-12-133 mdadm]# mdadm -CR /dev/md0 -l1 -n2 /dev/sdb1
> > >>>>> /dev/sdd1
> > >>>>> --assume-clean
> > >>>>> mdadm: Note: this array has metadata at the start and
> > >>>>>      may not be suitable as a boot device.  If you plan to
> > >>>>>      store '/boot' on this device please ensure that
> > >>>>>      your boot-loader understands md/v1.x metadata, or use
> > >>>>>      --metadata=0.90
> > >>>>> mdadm: Defaulting to version 1.2 metadata
> > >>>>> mdadm: array /dev/md0 started.
> > >>>>>
> > >>>>>     Then I unplug the disk.
> > >>>>>
> > >>>>> [root@dhcp-12-133 mdadm]# lsblk
> > >>>>> sdc                       8:32   0 186.3G  0 disk
> > >>>>> sdd                       8:48   0 931.5G  0 disk
> > >>>>> └─sdd1                    8:49   0     5G  0 part
> > >>>>>    └─md0                   9:0    0     5G  0 raid1
> > >>>>> [root@dhcp-12-133 mdadm]# echo faulty >
> > >>>>> /sys/block/md0/md/dev-sdb1/state
> > >>>>> [root@dhcp-12-133 mdadm]# echo remove >
> > >>>>> /sys/block/md0/md/dev-sdb1/state
> > >>>>> -bash: echo: write error: Device or resource busy
> > >>>>> [root@dhcp-12-133 mdadm]# echo idle > /sys/block/md0/md/sync_action
> > >>>>> [root@dhcp-12-133 mdadm]# echo remove >
> > >>>>> /sys/block/md0/md/dev-sdb1/state
> > >>>>>
> > >>>> I cannot reproduce this - using linux 3.18.2.  I'd be surprised if
> > >>>> mdadm
> > >>>> version affects things.
> > >>> Hi Neil
> > >>>
> > >>>     I'm very curious, because it can reproduce in my machine 100%.
> > >>>
> > >>>> This error (Device or resoource busy) implies that rdev->raid_disk is
> > >>>> >=
> > >>>> 0
> > >>>> (tested in state_store()).
> > >>>>
> > >>>> ->raid_disk is set to -1 by remove_and_add_spares() providing:
> > >>>>    1/ it isn't Blocked (which is very unlikely)
> > >>>>    2/ hot_remove_disk succeeds, which it will if nr_pending is zero,
> > >>>>    and
> > >>>>    3/ nr_pending is zero.
> > >>>     I remember I have tired to check those reasons. But it's really is
> > >>>     the
> > >>>     reason 1
> > >>> which is very unlikely.
> > >>>
> > >>>     I add some code in the function array_state_show
> > >>>
> > >>>      array_state_show(struct mddev *mddev, char *page) {
> > >>>          enum array_state st = inactive;
> > >>>          struct md_rdev *rdev;
> > >>>
> > >>>          rdev_for_each_rcu(rdev, mddev) {
> > >>>                  printk(KERN_ALERT "search for %s\n",
> > >>>                  rdev->bdev->bd_disk->disk_name);
> > >>>                  if (test_bit(Blocked, &rdev->flags))
> > >>>                          printk(KERN_ALERT "rdev is Blocked\n");
> > >>>                  else
> > >>>                          printk(KERN_ALERT "rdev is not Blocked\n");
> > >>>      }
> > >>>
> > >>>    When I echo 1 > /sys/block/sdc/device/delete, then I ran command:
> > >>>
> > >>> [root@dhcp-12-133 md]# cat /sys/block/md0/md/array_state
> > >>> read-auto
> > >>    ^^^^^^^^^
> > >>
> > >> I think that is half the explanation.
> > >> You must have the md_mod.start_ro parameter set to '1'.
> > >>
> > >>
> > >>> [root@dhcp-12-133 md]# dmesg
> > >>> [ 2679.559185] search for sdc
> > >>> [ 2679.559189] rdev is Blocked
> > >>> [ 2679.559190] search for sdb
> > >>> [ 2679.559190] rdev is not Blocked
> > >>>     
> > >>>    So sdc is Blocked
> > >> and that is the other half - thanks.
> > >> (yes, I was wrong.  Sometimes it is easier than being right, but still
> > >> yields results).
> > >>
> > >> When a device fails, it is Blocked until the metadata is updated to
> > >> record
> > >> the failure.  This ensures that no writes succeed without writing to
> > >> that
> > >> device, until we a certain that no read will try reading from that
> > >> device,
> > >> even after a crash/restart.
> > >>
> > >> Blocked is cleared after the metadata is written, but read-auto (and
> > >> read-only) devices never write out their metadata.  So blocked doesn't
> > >> get
> > >> cleared.
> > >>
> > >> When you "echo idle > .../sync_action" one of the side effects is to
> > >> with
> > >> from 'read-auto' to fully active.  This allows the metadata to be
> > >> written,
> > >> Blocked to be cleared, and the device to be removed.
> > >>
> > >> If you
> > >>    echo none > /sys/block/md0/md/dev-sdc/slot
> > >>
> > >> first, then the remove will work.
> > >>
> > >> We could possibly fix it with something like the following, but I'm not
> > >> sure
> > >> I like it.  There is no guarantee that I can see which would ensure the
> > >> superblock got updated before the first write if the array switch to
> > >> read/write.
> > >>
> > >> NeilBrown
> > >>
> > >> diff --git a/drivers/md/md.c b/drivers/md/md.c
> > >> index 9233c71138f1..b3d1e8e5e067 100644
> > >> --- a/drivers/md/md.c
> > >> +++ b/drivers/md/md.c
> > >> @@ -7528,7 +7528,7 @@ static int remove_and_add_spares(struct mddev
> > >> *mddev,
> > >>   	rdev_for_each(rdev, mddev)
> > >>   		if ((this == NULL || rdev == this) &&
> > >>   		    rdev->raid_disk >= 0 &&
> > >> -		    !test_bit(Blocked, &rdev->flags) &&
> > >> +		    (!test_bit(Blocked, &rdev->flags) || mddev->ro) &&
> > >>   		    (test_bit(Faulty, &rdev->flags) ||
> > >>   		     ! test_bit(In_sync, &rdev->flags)) &&
> > >>   		    atomic_read(&rdev->nr_pending)==0) {
> > >>
> > >>
> > >>
> > > Hi Neil
> > >
> > >     I have tried the patch and the problem can be fixed by it. But I'm
> > >     sorry that I can't
> > > give more advices for better idea about this. I'm not familiar with the
> > > metadata part about
> > > the md. I'll try to get more time to read the code about md.
> > >
> > Hi Neil
> > 
> >      I don't see the patch in linux-stable, do you miss this?
> 
> I don't believe this bug is sufficiently serious for the patch to go to
> -stable.  However it doesn't need to be fixed - thanks for the reminder.
> 
> I've just queued the following patch which I am happy with.  If you
> could confirm that it works for you, I would appreciate that.
> 
> Thanks,
> NeilBrown
> 
> 
> From: Neil Brown <neilb@suse.de>
> Date: Wed, 17 Jun 2015 12:31:46 +1000
> Subject: [PATCH] md: clear Blocked flag on failed devices when array is
>  read-only.
> 
> The Blocked flag indicates that a device has failed but that this
> fact hasn't been recorded in the metadata yet.  Writes to such
> devices cannot be allowed until the metadata has been updated.
> 
> On a read-only array, the Blocked flag will never be cleared.
> This prevents the device being removed from the array.
> 
> If the metadata is being handled by the kernel
> (i.e. !mddev->external), then we can be sure that if the array is
> switch to writable, then a metadata update will happen and will
> record the failure.  So we don't need the flag set.
> 
> If metadata is externally managed, it is upto the external manager
> to clear the 'blocked' flag.
> 
> Reported-by: XiaoNi <xni@redhat.com>
> Signed-off-by: NeilBrown <neilb@suse.de>
> 
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 3d339e2..5a6681a 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -8125,6 +8125,15 @@ void md_check_recovery(struct mddev *mddev)
>  		int spares = 0;
>  
>  		if (mddev->ro) {
> +			struct md_rdev *rdev;
> +			if (!mddev->external && mddev->in_sync)
> +				/* 'Blocked' flag not needed as failed devices
> +				 * will be recorded if array switched to read/write.
> +				 * Leaving it set will prevent the device
> +				 * from being removed.
> +				 */
> +				rdev_for_each(rdev, mddev)
> +					clear_bit(Blocked, &rdev->flags);
>  			/* On a read-only array we can:
>  			 * - remove failed devices
>  			 * - add already-in_sync devices if the array itself
> 
> 
Hi Neil

Sorry for late response for this. 

I have tried the patch. When I unplug the disk(sdc1) which belongs to the raid1, the directory 
/sys/block/md0/md/dev-sdc1 is deleted. I haven't read the code for unplug device. So is it what
you want?

Best Regards
Xiao
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2015-06-25  9:42 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-10-27 20:27 RAID1 removing failed disk returns EBUSY Joe Lawrence
2014-10-28 21:41 ` NeilBrown
2014-10-29 17:36   ` Joe Lawrence
2014-11-13 14:05     ` Joe Lawrence
2014-11-16 23:03       ` NeilBrown
2015-01-14 12:41         ` XiaoNi
2015-01-15 13:22           ` Joe Lawrence
2015-01-16  5:20             ` Xiao Ni
2015-01-16 15:10               ` Joe Lawrence
2015-01-19  2:33                 ` Xiao Ni
2015-01-19 17:56                   ` Joe Lawrence
2015-01-20  7:16                     ` Xiao Ni
2015-01-23 15:11                       ` Joe Lawrence
2015-01-30  2:19                         ` Xiao Ni
2015-01-30  4:27                           ` Xiao Ni
2015-01-29  3:52                   ` NeilBrown
2015-01-29 12:14                     ` Xiao Ni
2015-02-02  6:36                       ` NeilBrown
2015-02-03  8:10                         ` Xiao Ni
2015-06-10  6:26                           ` XiaoNi
2015-06-17  2:51                             ` Neil Brown
2015-06-25  9:42                               ` Xiao Ni

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.