All of lore.kernel.org
 help / color / mirror / Atom feed
* mdadm: can't removed failed/detached drives when using metadata 1.x
@ 2011-02-10 15:28 Rémi Rérolle
  2011-02-14  3:27 ` NeilBrown
  0 siblings, 1 reply; 4+ messages in thread
From: Rémi Rérolle @ 2011-02-10 15:28 UTC (permalink / raw)
  To: neilb; +Cc: linux-raid

Hi Neil,

I recently came across what I believe is a regression in mdadm, which
has been introduced in version 3.1.3.

It seems that, when using metadata 1.x, the handling of failed/detached
drives isn't effective anymore.

Here's a quick example:

[root@GrosCinq ~]# mdadm -C /dev/md4 -l1 -n2 --metadata=1.0 /dev/sdc1 
/dev/sdd1
mdadm: array /dev/md4 started.
[root@GrosCinq ~]#
[root@GrosCinq ~]# mdadm --wait /dev/md4
[root@GrosCinq ~]#
[root@GrosCinq ~]# mdadm -D /dev/md4
/dev/md4:
         Version : 1.0
   Creation Time : Thu Feb 10 13:56:31 2011
      Raid Level : raid1
      Array Size : 1953096 (1907.64 MiB 1999.97 MB)
   Used Dev Size : 1953096 (1907.64 MiB 1999.97 MB)
    Raid Devices : 2
   Total Devices : 2
     Persistence : Superblock is persistent

     Update Time : Thu Feb 10 13:56:46 2011
           State : clean
  Active Devices : 2
Working Devices : 2
  Failed Devices : 0
   Spare Devices : 0

            Name : GrosCinq:4  (local to host GrosCinq)
            UUID : bbfef508:252e7ce1:c95d4a03:8beb3cbd
          Events : 17

     Number   Major   Minor   RaidDevice State
        0       8        1        0      active sync   /dev/sdc1
        1       8       49        1      active sync   /dev/sdd1

[root@GrosCinq ~]# mdadm --fail /dev/md4 /dev/sdc1
mdadm: set /dev/sdc1 faulty in /dev/md4
[root@GrosCinq ~]#
[root@GrosCinq ~]# mdadm -D /dev/md4 | tail -n 6

     Number   Major   Minor   RaidDevice State
        0       0        0        0      removed
        1       8       49        1      active sync   /dev/sdd1

        0       8        1        -      faulty spare   /dev/sdc1
[root@GrosCinq ~]#
[root@GrosCinq ~]# mdadm --remove /dev/md4 failed
[root@GrosCinq ~]#
[root@GrosCinq ~]# mdadm -D /dev/md4 | tail -n 6

     Number   Major   Minor   RaidDevice State
        0       0        0        0      removed
        1       8       49        1      active sync   /dev/sdd1

        0       8        1        -      faulty spare   /dev/sdc1
[root@GrosCinq ~]#

This is with mdadm 3.1.4, 3.1.3 or even 3.2, but not 3.1.2. I did a git 
bisect to try and isolate the regression and it appears the guilty 
commit is :

b3b4e8a : "Avoid skipping devices where removing all faulty/detached
            devices."

As stated in the commit, this is only true with metadata 1.x. With 0.9, 
there is no problem. I also tested with detached drives as well as 
raid5/6 and encountered the same issue. Actually, with detached drives, 
it's even more annoying, since using --remove detached is the only way 
to remove the device without restarting the array. For a failed drive, 
there is still the possibility to use the device name.

Do you have any idea of the reason behind that regression ? Shall this 
patch only apply in the case of 0.9 metadata ?

Regards,

-- 
Rémi
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: mdadm: can't removed failed/detached drives when using metadata 1.x
  2011-02-10 15:28 mdadm: can't removed failed/detached drives when using metadata 1.x Rémi Rérolle
@ 2011-02-14  3:27 ` NeilBrown
  2011-02-14 14:05   ` Rémi Rérolle
  0 siblings, 1 reply; 4+ messages in thread
From: NeilBrown @ 2011-02-14  3:27 UTC (permalink / raw)
  To: Rémi Rérolle; +Cc: linux-raid

On Thu, 10 Feb 2011 16:28:12 +0100 Rémi Rérolle <rrerolle@lacie.com> wrote:

> Hi Neil,
> 
> I recently came across what I believe is a regression in mdadm, which
> has been introduced in version 3.1.3.
> 
> It seems that, when using metadata 1.x, the handling of failed/detached
> drives isn't effective anymore.
> 
> Here's a quick example:
> 
> [root@GrosCinq ~]# mdadm -C /dev/md4 -l1 -n2 --metadata=1.0 /dev/sdc1 
> /dev/sdd1
> mdadm: array /dev/md4 started.
> [root@GrosCinq ~]#
> [root@GrosCinq ~]# mdadm --wait /dev/md4
> [root@GrosCinq ~]#
> [root@GrosCinq ~]# mdadm -D /dev/md4
> /dev/md4:
>          Version : 1.0
>    Creation Time : Thu Feb 10 13:56:31 2011
>       Raid Level : raid1
>       Array Size : 1953096 (1907.64 MiB 1999.97 MB)
>    Used Dev Size : 1953096 (1907.64 MiB 1999.97 MB)
>     Raid Devices : 2
>    Total Devices : 2
>      Persistence : Superblock is persistent
> 
>      Update Time : Thu Feb 10 13:56:46 2011
>            State : clean
>   Active Devices : 2
> Working Devices : 2
>   Failed Devices : 0
>    Spare Devices : 0
> 
>             Name : GrosCinq:4  (local to host GrosCinq)
>             UUID : bbfef508:252e7ce1:c95d4a03:8beb3cbd
>           Events : 17
> 
>      Number   Major   Minor   RaidDevice State
>         0       8        1        0      active sync   /dev/sdc1
>         1       8       49        1      active sync   /dev/sdd1
> 
> [root@GrosCinq ~]# mdadm --fail /dev/md4 /dev/sdc1
> mdadm: set /dev/sdc1 faulty in /dev/md4
> [root@GrosCinq ~]#
> [root@GrosCinq ~]# mdadm -D /dev/md4 | tail -n 6
> 
>      Number   Major   Minor   RaidDevice State
>         0       0        0        0      removed
>         1       8       49        1      active sync   /dev/sdd1
> 
>         0       8        1        -      faulty spare   /dev/sdc1
> [root@GrosCinq ~]#
> [root@GrosCinq ~]# mdadm --remove /dev/md4 failed
> [root@GrosCinq ~]#
> [root@GrosCinq ~]# mdadm -D /dev/md4 | tail -n 6
> 
>      Number   Major   Minor   RaidDevice State
>         0       0        0        0      removed
>         1       8       49        1      active sync   /dev/sdd1
> 
>         0       8        1        -      faulty spare   /dev/sdc1
> [root@GrosCinq ~]#
> 
> This is with mdadm 3.1.4, 3.1.3 or even 3.2, but not 3.1.2. I did a git 
> bisect to try and isolate the regression and it appears the guilty 
> commit is :
> 
> b3b4e8a : "Avoid skipping devices where removing all faulty/detached
>             devices."
> 
> As stated in the commit, this is only true with metadata 1.x. With 0.9, 
> there is no problem. I also tested with detached drives as well as 
> raid5/6 and encountered the same issue. Actually, with detached drives, 
> it's even more annoying, since using --remove detached is the only way 
> to remove the device without restarting the array. For a failed drive, 
> there is still the possibility to use the device name.
> 
> Do you have any idea of the reason behind that regression ? Shall this 
> patch only apply in the case of 0.9 metadata ?
> 
> Regards,
> 


Thanks for the report - especially for bitsecting it down to the erroneous
commit!

This patch should fix the regression.  I'll ensure it is in all future
releases.

Thanks,
NeilBrown


diff --git a/Manage.c b/Manage.c
index 481c165..8c86a53 100644
--- a/Manage.c
+++ b/Manage.c
@@ -421,7 +421,7 @@ int Manage_subdevs(char *devname, int fd,
 				dnprintable = dvname;
 				break;
 			}
-			if (jnext == 0)
+			if (next != dv)
 				continue;
 		} else if (strcmp(dv->devname, "detached") == 0) {
 			if (dv->disposition != 'r' && dv->disposition != 'f') {
@@ -461,7 +461,7 @@ int Manage_subdevs(char *devname, int fd,
 				dnprintable = dvname;
 				break;
 			}
-			if (jnext == 0)
+			if (next != dv)
 				continue;
 		} else if (strcmp(dv->devname, "missing") == 0) {
 			if (dv->disposition != 'a' || dv->re_add == 0) {
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: mdadm: can't removed failed/detached drives when using metadata 1.x
  2011-02-14  3:27 ` NeilBrown
@ 2011-02-14 14:05   ` Rémi Rérolle
  2011-02-15  0:05     ` NeilBrown
  0 siblings, 1 reply; 4+ messages in thread
From: Rémi Rérolle @ 2011-02-14 14:05 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

Le 14/02/2011 04:27, NeilBrown a écrit :
> On Thu, 10 Feb 2011 16:28:12 +0100 Rémi Rérolle<rrerolle@lacie.com>  wrote:
>
>> Hi Neil,
>>
>> I recently came across what I believe is a regression in mdadm, which
>> has been introduced in version 3.1.3.
>>
>> It seems that, when using metadata 1.x, the handling of failed/detached
>> drives isn't effective anymore.
>>
>> Here's a quick example:
>>
>> [root@GrosCinq ~]# mdadm -C /dev/md4 -l1 -n2 --metadata=1.0 /dev/sdc1
>> /dev/sdd1
>> mdadm: array /dev/md4 started.
>> [root@GrosCinq ~]#
>> [root@GrosCinq ~]# mdadm --wait /dev/md4
>> [root@GrosCinq ~]#
>> [root@GrosCinq ~]# mdadm -D /dev/md4
>> /dev/md4:
>>           Version : 1.0
>>     Creation Time : Thu Feb 10 13:56:31 2011
>>        Raid Level : raid1
>>        Array Size : 1953096 (1907.64 MiB 1999.97 MB)
>>     Used Dev Size : 1953096 (1907.64 MiB 1999.97 MB)
>>      Raid Devices : 2
>>     Total Devices : 2
>>       Persistence : Superblock is persistent
>>
>>       Update Time : Thu Feb 10 13:56:46 2011
>>             State : clean
>>    Active Devices : 2
>> Working Devices : 2
>>    Failed Devices : 0
>>     Spare Devices : 0
>>
>>              Name : GrosCinq:4  (local to host GrosCinq)
>>              UUID : bbfef508:252e7ce1:c95d4a03:8beb3cbd
>>            Events : 17
>>
>>       Number   Major   Minor   RaidDevice State
>>          0       8        1        0      active sync   /dev/sdc1
>>          1       8       49        1      active sync   /dev/sdd1
>>
>> [root@GrosCinq ~]# mdadm --fail /dev/md4 /dev/sdc1
>> mdadm: set /dev/sdc1 faulty in /dev/md4
>> [root@GrosCinq ~]#
>> [root@GrosCinq ~]# mdadm -D /dev/md4 | tail -n 6
>>
>>       Number   Major   Minor   RaidDevice State
>>          0       0        0        0      removed
>>          1       8       49        1      active sync   /dev/sdd1
>>
>>          0       8        1        -      faulty spare   /dev/sdc1
>> [root@GrosCinq ~]#
>> [root@GrosCinq ~]# mdadm --remove /dev/md4 failed
>> [root@GrosCinq ~]#
>> [root@GrosCinq ~]# mdadm -D /dev/md4 | tail -n 6
>>
>>       Number   Major   Minor   RaidDevice State
>>          0       0        0        0      removed
>>          1       8       49        1      active sync   /dev/sdd1
>>
>>          0       8        1        -      faulty spare   /dev/sdc1
>> [root@GrosCinq ~]#
>>
>> This is with mdadm 3.1.4, 3.1.3 or even 3.2, but not 3.1.2. I did a git
>> bisect to try and isolate the regression and it appears the guilty
>> commit is :
>>
>> b3b4e8a : "Avoid skipping devices where removing all faulty/detached
>>              devices."
>>
>> As stated in the commit, this is only true with metadata 1.x. With 0.9,
>> there is no problem. I also tested with detached drives as well as
>> raid5/6 and encountered the same issue. Actually, with detached drives,
>> it's even more annoying, since using --remove detached is the only way
>> to remove the device without restarting the array. For a failed drive,
>> there is still the possibility to use the device name.
>>
>> Do you have any idea of the reason behind that regression ? Shall this
>> patch only apply in the case of 0.9 metadata ?
>>
>> Regards,
>>
>
>
> Thanks for the report - especially for bitsecting it down to the erroneous
> commit!
>
> This patch should fix the regression.  I'll ensure it is in all future
> releases.
>

Hi Neil,

I've tested your patch with the setup that was causing me trouble. It 
did fix the regression.

Thanks!

Rémi

> Thanks,
> NeilBrown
>
>
> diff --git a/Manage.c b/Manage.c
> index 481c165..8c86a53 100644
> --- a/Manage.c
> +++ b/Manage.c
> @@ -421,7 +421,7 @@ int Manage_subdevs(char *devname, int fd,
>   				dnprintable = dvname;
>   				break;
>   			}
> -			if (jnext == 0)
> +			if (next != dv)
>   				continue;
>   		} else if (strcmp(dv->devname, "detached") == 0) {
>   			if (dv->disposition != 'r'&&  dv->disposition != 'f') {
> @@ -461,7 +461,7 @@ int Manage_subdevs(char *devname, int fd,
>   				dnprintable = dvname;
>   				break;
>   			}
> -			if (jnext == 0)
> +			if (next != dv)
>   				continue;
>   		} else if (strcmp(dv->devname, "missing") == 0) {
>   			if (dv->disposition != 'a' || dv->re_add == 0) {
>
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: mdadm: can't removed failed/detached drives when using metadata 1.x
  2011-02-14 14:05   ` Rémi Rérolle
@ 2011-02-15  0:05     ` NeilBrown
  0 siblings, 0 replies; 4+ messages in thread
From: NeilBrown @ 2011-02-15  0:05 UTC (permalink / raw)
  To: Rémi Rérolle; +Cc: linux-raid

On Mon, 14 Feb 2011 15:05:25 +0100 Rémi Rérolle <rrerolle@lacie.com> wrote:


> > Thanks for the report - especially for bitsecting it down to the erroneous
> > commit!
> >
> > This patch should fix the regression.  I'll ensure it is in all future
> > releases.
> >
> 
> Hi Neil,
> 
> I've tested your patch with the setup that was causing me trouble. It 
> did fix the regression.
> 

Great - thanks for the confirmation.

NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2011-02-15  0:05 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-02-10 15:28 mdadm: can't removed failed/detached drives when using metadata 1.x Rémi Rérolle
2011-02-14  3:27 ` NeilBrown
2011-02-14 14:05   ` Rémi Rérolle
2011-02-15  0:05     ` NeilBrown

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.