All of lore.kernel.org
 help / color / mirror / Atom feed
* 4 out of 16 drives show up as 'removed'
@ 2011-12-07 20:42 Eli Morris
  2011-12-07 20:51 ` Mathias Burén
                   ` (2 more replies)
  0 siblings, 3 replies; 24+ messages in thread
From: Eli Morris @ 2011-12-07 20:42 UTC (permalink / raw)
  To: linux-raid

Hi All,

 I thought maybe someone could help me out. I have a 16 disk software RAID that we use for backup. This is at least the second time this happened- all at once, four of the drives report as 'removed' when none of them actually were. These drives also disappeared from the 'lsscsi' list until I restarted the disk expansion chassis where they live. 

These are the dreaded Caviar Green drives. We bought 16 of them as an upgrade for a hardware RAID originally, because the tech from that company said they would work fine. After running them for a while, four drives dropped out of that array. So I put them in the software RAID expansion chassis they are in now, thinking I might have better luck. In this configuration, this happened once before. That time, the drives looked to all have significant numbers of bad sectors, so I got those ones replaced and thought that that might have been the problem all along. Now it has happened again. So I have two fairly predictable questions and I'm hoping someone might be able to offer a suggestion:

1) Any ideas on how to get this array working again without starting from scratch? It's all backup data, so it's not do or die, but it is also 30 TB and I really don't want to rebuild the whole thing again from scratch.

I tried the re-add command and the error was something like 'not allowed'

2) Any idea on how to stop this from happening again? I was thinking of playing with the disk timeout in the OS (not the one on the drive firmware). 

If anyway can help, I'd greatly appreciate it, because, at this point, I have no idea what to do about this mess. 

Thanks!

Eli


[root@stratus ~]# mdadm --detail /dev/md5
/dev/md5:
        Version : 1.2
  Creation Time : Wed Oct 12 16:32:41 2011
     Raid Level : raid5
  Used Dev Size : 1953511936 (1863.01 GiB 2000.40 GB)
   Raid Devices : 16
  Total Devices : 13
    Persistence : Superblock is persistent

    Update Time : Mon Dec  5 12:52:46 2011
          State : active, FAILED, Not Started
 Active Devices : 12
Working Devices : 13
 Failed Devices : 0
  Spare Devices : 1

         Layout : left-symmetric
     Chunk Size : 512K

           Name : stratus.pmc.ucsc.edu:5  (local to host stratus.pmc.ucsc.edu)
           UUID : 3189ca06:ccf973d0:7ef41366:98a75a32
         Events : 32

    Number   Major   Minor   RaidDevice State
       0       8        1        0      active sync   /dev/sda1
       1       0        0        1      removed
       2       8       33        2      active sync   /dev/sdc1
       3       8       49        3      active sync   /dev/sdd1
       4       8       65        4      active sync   /dev/sde1
       5       8       81        5      active sync   /dev/sdf1
       6       8       97        6      active sync   /dev/sdg1
       7       8      113        7      active sync   /dev/sdh1
       8       0        0        8      removed
       9       8      145        9      active sync   /dev/sdj1
      10       8      161       10      active sync   /dev/sdk1
      11       8      177       11      active sync   /dev/sdl1
      12       8      193       12      active sync   /dev/sdm1
      13       8      209       13      active sync   /dev/sdn1
      14       0        0       14      removed
      15       0        0       15      removed

      16       8      225        -      spare   /dev/sdo1
[root@stratus ~]# 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 4 out of 16 drives show up as 'removed'
  2011-12-07 20:42 4 out of 16 drives show up as 'removed' Eli Morris
@ 2011-12-07 20:51 ` Mathias Burén
  2011-12-07 20:57 ` NeilBrown
  2011-12-09 19:38 ` Stan Hoeppner
  2 siblings, 0 replies; 24+ messages in thread
From: Mathias Burén @ 2011-12-07 20:51 UTC (permalink / raw)
  To: Eli Morris; +Cc: linux-raid

On 7 December 2011 20:42, Eli Morris <ermorris@ucsc.edu> wrote:
> Hi All,
>
>  I thought maybe someone could help me out. I have a 16 disk software RAID that we use for backup. This is at least the second time this happened- all at once, four of the drives report as 'removed' when none of them actually were. These drives also disappeared from the 'lsscsi' list until I restarted the disk expansion chassis where they live.
>
> These are the dreaded Caviar Green drives. We bought 16 of them as an upgrade for a hardware RAID originally, because the tech from that company said they would work fine. After running them for a while, four drives dropped out of that array. So I put them in the software RAID expansion chassis they are in now, thinking I might have better luck. In this configuration, this happened once before. That time, the drives looked to all have significant numbers of bad sectors, so I got those ones replaced and thought that that might have been the problem all along. Now it has happened again. So I have two fairly predictable questions and I'm hoping someone might be able to offer a suggestion:
>
> 1) Any ideas on how to get this array working again without starting from scratch? It's all backup data, so it's not do or die, but it is also 30 TB and I really don't want to rebuild the whole thing again from scratch.
>
> I tried the re-add command and the error was something like 'not allowed'
>
> 2) Any idea on how to stop this from happening again? I was thinking of playing with the disk timeout in the OS (not the one on the drive firmware).
>
> If anyway can help, I'd greatly appreciate it, because, at this point, I have no idea what to do about this mess.
>
> Thanks!
>
> Eli
>
>
> [root@stratus ~]# mdadm --detail /dev/md5
> /dev/md5:
>        Version : 1.2
>  Creation Time : Wed Oct 12 16:32:41 2011
>     Raid Level : raid5
>  Used Dev Size : 1953511936 (1863.01 GiB 2000.40 GB)
>   Raid Devices : 16
>  Total Devices : 13
>    Persistence : Superblock is persistent
>
>    Update Time : Mon Dec  5 12:52:46 2011
>          State : active, FAILED, Not Started
>  Active Devices : 12
> Working Devices : 13
>  Failed Devices : 0
>  Spare Devices : 1
>
>         Layout : left-symmetric
>     Chunk Size : 512K
>
>           Name : stratus.pmc.ucsc.edu:5  (local to host stratus.pmc.ucsc.edu)
>           UUID : 3189ca06:ccf973d0:7ef41366:98a75a32
>         Events : 32
>
>    Number   Major   Minor   RaidDevice State
>       0       8        1        0      active sync   /dev/sda1
>       1       0        0        1      removed
>       2       8       33        2      active sync   /dev/sdc1
>       3       8       49        3      active sync   /dev/sdd1
>       4       8       65        4      active sync   /dev/sde1
>       5       8       81        5      active sync   /dev/sdf1
>       6       8       97        6      active sync   /dev/sdg1
>       7       8      113        7      active sync   /dev/sdh1
>       8       0        0        8      removed
>       9       8      145        9      active sync   /dev/sdj1
>      10       8      161       10      active sync   /dev/sdk1
>      11       8      177       11      active sync   /dev/sdl1
>      12       8      193       12      active sync   /dev/sdm1
>      13       8      209       13      active sync   /dev/sdn1
>      14       0        0       14      removed
>      15       0        0       15      removed
>
>      16       8      225        -      spare   /dev/sdo1
> [root@stratus ~]#
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



Hi,

To eliminate bad disks, can you post the smartctl -a output of all the
removed drives? (if you can get the OS to see them again)

Also, do you have any log files from when this happened? (kernel log,
dmesg, syslog etc)

Regards,
Mathias
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 4 out of 16 drives show up as 'removed'
  2011-12-07 20:42 4 out of 16 drives show up as 'removed' Eli Morris
  2011-12-07 20:51 ` Mathias Burén
@ 2011-12-07 20:57 ` NeilBrown
  2011-12-07 22:00   ` Eli Morris
  2011-12-09 19:38 ` Stan Hoeppner
  2 siblings, 1 reply; 24+ messages in thread
From: NeilBrown @ 2011-12-07 20:57 UTC (permalink / raw)
  To: Eli Morris; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 4337 bytes --]

On Wed, 7 Dec 2011 12:42:26 -0800 Eli Morris <ermorris@ucsc.edu> wrote:

> Hi All,
> 
>  I thought maybe someone could help me out. I have a 16 disk software RAID that we use for backup. This is at least the second time this happened- all at once, four of the drives report as 'removed' when none of them actually were. These drives also disappeared from the 'lsscsi' list until I restarted the disk expansion chassis where they live. 
> 
> These are the dreaded Caviar Green drives. We bought 16 of them as an upgrade for a hardware RAID originally, because the tech from that company said they would work fine. After running them for a while, four drives dropped out of that array. So I put them in the software RAID expansion chassis they are in now, thinking I might have better luck. In this configuration, this happened once before. That time, the drives looked to all have significant numbers of bad sectors, so I got those ones replaced and thought that that might have been the problem all along. Now it has happened again. So I have two fairly predictable questions and I'm hoping someone might be able to offer a suggestion:
> 
> 1) Any ideas on how to get this array working again without starting from scratch? It's all backup data, so it's not do or die, but it is also 30 TB and I really don't want to rebuild the whole thing again from scratch.

1/ Stop the array
    mdadm -S /dev/md5

2/ Make sure you can read all of the devices
 
    mdadm -E /dev/some-device

3/ When you are confident that the hardware is actually working, reassemble
   the array with --force

    mdadm -A /dev/md5 --force /dev/sd[a-o]1
(or whatever gets you a list of devices.)

> 
> I tried the re-add command and the error was something like 'not allowed'
> 
> 2) Any idea on how to stop this from happening again? I was thinking of playing with the disk timeout in the OS (not the one on the drive firmware). 

Cannot help there, sorry - and you really should solve this issue before you
put the array back together or it'll just all happen again.

NeilBrown

> 
> If anyway can help, I'd greatly appreciate it, because, at this point, I have no idea what to do about this mess. 
> 
> Thanks!
> 
> Eli
> 
> 
> [root@stratus ~]# mdadm --detail /dev/md5
> /dev/md5:
>         Version : 1.2
>   Creation Time : Wed Oct 12 16:32:41 2011
>      Raid Level : raid5
>   Used Dev Size : 1953511936 (1863.01 GiB 2000.40 GB)
>    Raid Devices : 16
>   Total Devices : 13
>     Persistence : Superblock is persistent
> 
>     Update Time : Mon Dec  5 12:52:46 2011
>           State : active, FAILED, Not Started
>  Active Devices : 12
> Working Devices : 13
>  Failed Devices : 0
>   Spare Devices : 1
> 
>          Layout : left-symmetric
>      Chunk Size : 512K
> 
>            Name : stratus.pmc.ucsc.edu:5  (local to host stratus.pmc.ucsc.edu)
>            UUID : 3189ca06:ccf973d0:7ef41366:98a75a32
>          Events : 32
> 
>     Number   Major   Minor   RaidDevice State
>        0       8        1        0      active sync   /dev/sda1
>        1       0        0        1      removed
>        2       8       33        2      active sync   /dev/sdc1
>        3       8       49        3      active sync   /dev/sdd1
>        4       8       65        4      active sync   /dev/sde1
>        5       8       81        5      active sync   /dev/sdf1
>        6       8       97        6      active sync   /dev/sdg1
>        7       8      113        7      active sync   /dev/sdh1
>        8       0        0        8      removed
>        9       8      145        9      active sync   /dev/sdj1
>       10       8      161       10      active sync   /dev/sdk1
>       11       8      177       11      active sync   /dev/sdl1
>       12       8      193       12      active sync   /dev/sdm1
>       13       8      209       13      active sync   /dev/sdn1
>       14       0        0       14      removed
>       15       0        0       15      removed
> 
>       16       8      225        -      spare   /dev/sdo1
> [root@stratus ~]# 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 4 out of 16 drives show up as 'removed'
  2011-12-07 20:57 ` NeilBrown
@ 2011-12-07 22:00   ` Eli Morris
  2011-12-07 22:16     ` NeilBrown
  0 siblings, 1 reply; 24+ messages in thread
From: Eli Morris @ 2011-12-07 22:00 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid


On Dec 7, 2011, at 12:57 PM, NeilBrown wrote:

> On Wed, 7 Dec 2011 12:42:26 -0800 Eli Morris <ermorris@ucsc.edu> wrote:
> 
>> Hi All,
>> 
>> I thought maybe someone could help me out. I have a 16 disk software RAID that we use for backup. This is at least the second time this happened- all at once, four of the drives report as 'removed' when none of them actually were. These drives also disappeared from the 'lsscsi' list until I restarted the disk expansion chassis where they live. 
>> 
>> These are the dreaded Caviar Green drives. We bought 16 of them as an upgrade for a hardware RAID originally, because the tech from that company said they would work fine. After running them for a while, four drives dropped out of that array. So I put them in the software RAID expansion chassis they are in now, thinking I might have better luck. In this configuration, this happened once before. That time, the drives looked to all have significant numbers of bad sectors, so I got those ones replaced and thought that that might have been the problem all along. Now it has happened again. So I have two fairly predictable questions and I'm hoping someone might be able to offer a suggestion:
>> 
>> 1) Any ideas on how to get this array working again without starting from scratch? It's all backup data, so it's not do or die, but it is also 30 TB and I really don't want to rebuild the whole thing again from scratch.
> 
> 1/ Stop the array
>    mdadm -S /dev/md5
> 
> 2/ Make sure you can read all of the devices
> 
>    mdadm -E /dev/some-device
> 
> 3/ When you are confident that the hardware is actually working, reassemble
>   the array with --force
> 
>    mdadm -A /dev/md5 --force /dev/sd[a-o]1
> (or whatever gets you a list of devices.)
> 
>> 
>> I tried the re-add command and the error was something like 'not allowed'
>> 
>> 2) Any idea on how to stop this from happening again? I was thinking of playing with the disk timeout in the OS (not the one on the drive firmware). 
> 
> Cannot help there, sorry - and you really should solve this issue before you
> put the array back together or it'll just all happen again.
> 
> NeilBrown
> 
>> 
>> If anyway can help, I'd greatly appreciate it, because, at this point, I have no idea what to do about this mess. 
>> 
>> Thanks!
>> 
>> Eli
>> 
>> 
>> [root@stratus ~]# mdadm --detail /dev/md5
>> /dev/md5:
>>        Version : 1.2
>>  Creation Time : Wed Oct 12 16:32:41 2011
>>     Raid Level : raid5
>>  Used Dev Size : 1953511936 (1863.01 GiB 2000.40 GB)
>>   Raid Devices : 16
>>  Total Devices : 13
>>    Persistence : Superblock is persistent
>> 
>>    Update Time : Mon Dec  5 12:52:46 2011
>>          State : active, FAILED, Not Started
>> Active Devices : 12
>> Working Devices : 13
>> Failed Devices : 0
>>  Spare Devices : 1
>> 
>>         Layout : left-symmetric
>>     Chunk Size : 512K
>> 
>>           Name : stratus.pmc.ucsc.edu:5  (local to host stratus.pmc.ucsc.edu)
>>           UUID : 3189ca06:ccf973d0:7ef41366:98a75a32
>>         Events : 32
>> 
>>    Number   Major   Minor   RaidDevice State
>>       0       8        1        0      active sync   /dev/sda1
>>       1       0        0        1      removed
>>       2       8       33        2      active sync   /dev/sdc1
>>       3       8       49        3      active sync   /dev/sdd1
>>       4       8       65        4      active sync   /dev/sde1
>>       5       8       81        5      active sync   /dev/sdf1
>>       6       8       97        6      active sync   /dev/sdg1
>>       7       8      113        7      active sync   /dev/sdh1
>>       8       0        0        8      removed
>>       9       8      145        9      active sync   /dev/sdj1
>>      10       8      161       10      active sync   /dev/sdk1
>>      11       8      177       11      active sync   /dev/sdl1
>>      12       8      193       12      active sync   /dev/sdm1
>>      13       8      209       13      active sync   /dev/sdn1
>>      14       0        0       14      removed
>>      15       0        0       15      removed
>> 
>>      16       8      225        -      spare   /dev/sdo1
>> [root@stratus ~]# 
>> 
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

Hi Neil,

Thanks. I gave it a try and I think I got close to getting it back. Maybe. Here is the output from one of the drives that showed up as 'removed' below. It looks OK to me, but I'm not really sure what trouble signs to look for. After stopping the array, I tried to reconstruct it, and here is what I got below. I don't know why the drives would be busy. Short of rebooting, which I can't do at the moment, is there a way to check why they are busy and force them to stop? I don't have them mounted or anything. Or do you think that means the hardware is not responding properly?

Thanks,

Eli

mdadm -A /dev/md5 --force /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1 /dev/sdf1 /dev/sdg1 /dev/sdh1 /dev/sdi1 /dev/sdj1 /dev/sdk1 /dev/sdl1 /dev/sdm1 /dev/sdn1 /dev/sdo1 /dev/sdp1
mdadm: failed to add /dev/sdo1 to /dev/md5: Device or resource busy
mdadm: failed to add /dev/sdp1 to /dev/md5: Device or resource busy
mdadm: /dev/md5 assembled from 12 drives and 2 spares - not enough to start the array.

[root@stratus ~]# mdadm -E /dev/sdo1
/dev/sdo1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 3189ca06:ccf973d0:7ef41366:98a75a32
           Name : stratus.pmc.ucsc.edu:5  (local to host stratus.pmc.ucsc.edu)
  Creation Time : Wed Oct 12 16:32:41 2011
     Raid Level : raid5
   Raid Devices : 16

 Avail Dev Size : 3907025039 (1863.01 GiB 2000.40 GB)
     Array Size : 58605358080 (27945.21 GiB 30005.94 GB)
  Used Dev Size : 3907023872 (1863.01 GiB 2000.40 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : a31ac4c9:a37eb6e8:97aa7298:b3b0bf8d

    Update Time : Mon Dec  5 12:52:46 2011
       Checksum : f80904f1 - correct
         Events : 0

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : spare
   Array State : A.AAAAAA.AAAAA.. ('A' == active, '.' == missing)


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 4 out of 16 drives show up as 'removed'
  2011-12-07 22:00   ` Eli Morris
@ 2011-12-07 22:16     ` NeilBrown
  2011-12-07 23:42       ` Eli Morris
  2011-12-08 19:17       ` Eli Morris
  0 siblings, 2 replies; 24+ messages in thread
From: NeilBrown @ 2011-12-07 22:16 UTC (permalink / raw)
  To: Eli Morris; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 5940 bytes --]

On Wed, 7 Dec 2011 14:00:00 -0800 Eli Morris <ermorris@ucsc.edu> wrote:

> 
> On Dec 7, 2011, at 12:57 PM, NeilBrown wrote:
> 
> > On Wed, 7 Dec 2011 12:42:26 -0800 Eli Morris <ermorris@ucsc.edu> wrote:
> > 
> >> Hi All,
> >> 
> >> I thought maybe someone could help me out. I have a 16 disk software RAID that we use for backup. This is at least the second time this happened- all at once, four of the drives report as 'removed' when none of them actually were. These drives also disappeared from the 'lsscsi' list until I restarted the disk expansion chassis where they live. 
> >> 
> >> These are the dreaded Caviar Green drives. We bought 16 of them as an upgrade for a hardware RAID originally, because the tech from that company said they would work fine. After running them for a while, four drives dropped out of that array. So I put them in the software RAID expansion chassis they are in now, thinking I might have better luck. In this configuration, this happened once before. That time, the drives looked to all have significant numbers of bad sectors, so I got those ones replaced and thought that that might have been the problem all along. Now it has happened again. So I have two fairly predictable questions and I'm hoping someone might be able to offer a suggestion:
> >> 
> >> 1) Any ideas on how to get this array working again without starting from scratch? It's all backup data, so it's not do or die, but it is also 30 TB and I really don't want to rebuild the whole thing again from scratch.
> > 
> > 1/ Stop the array
> >    mdadm -S /dev/md5
> > 
> > 2/ Make sure you can read all of the devices
> > 
> >    mdadm -E /dev/some-device
> > 
> > 3/ When you are confident that the hardware is actually working, reassemble
> >   the array with --force
> > 
> >    mdadm -A /dev/md5 --force /dev/sd[a-o]1
> > (or whatever gets you a list of devices.)
> > 
> >> 
> >> I tried the re-add command and the error was something like 'not allowed'
> >> 
> >> 2) Any idea on how to stop this from happening again? I was thinking of playing with the disk timeout in the OS (not the one on the drive firmware). 
> > 
> > Cannot help there, sorry - and you really should solve this issue before you
> > put the array back together or it'll just all happen again.
> > 
> > NeilBrown
> > 
> >> 
> >> If anyway can help, I'd greatly appreciate it, because, at this point, I have no idea what to do about this mess. 
> >> 
> >> Thanks!
> >> 
> >> Eli
> >> 
> >> 
> >> [root@stratus ~]# mdadm --detail /dev/md5
> >> /dev/md5:
> >>        Version : 1.2
> >>  Creation Time : Wed Oct 12 16:32:41 2011
> >>     Raid Level : raid5
> >>  Used Dev Size : 1953511936 (1863.01 GiB 2000.40 GB)
> >>   Raid Devices : 16
> >>  Total Devices : 13
> >>    Persistence : Superblock is persistent
> >> 
> >>    Update Time : Mon Dec  5 12:52:46 2011
> >>          State : active, FAILED, Not Started
> >> Active Devices : 12
> >> Working Devices : 13
> >> Failed Devices : 0
> >>  Spare Devices : 1
> >> 
> >>         Layout : left-symmetric
> >>     Chunk Size : 512K
> >> 
> >>           Name : stratus.pmc.ucsc.edu:5  (local to host stratus.pmc.ucsc.edu)
> >>           UUID : 3189ca06:ccf973d0:7ef41366:98a75a32
> >>         Events : 32
> >> 
> >>    Number   Major   Minor   RaidDevice State
> >>       0       8        1        0      active sync   /dev/sda1
> >>       1       0        0        1      removed
> >>       2       8       33        2      active sync   /dev/sdc1
> >>       3       8       49        3      active sync   /dev/sdd1
> >>       4       8       65        4      active sync   /dev/sde1
> >>       5       8       81        5      active sync   /dev/sdf1
> >>       6       8       97        6      active sync   /dev/sdg1
> >>       7       8      113        7      active sync   /dev/sdh1
> >>       8       0        0        8      removed
> >>       9       8      145        9      active sync   /dev/sdj1
> >>      10       8      161       10      active sync   /dev/sdk1
> >>      11       8      177       11      active sync   /dev/sdl1
> >>      12       8      193       12      active sync   /dev/sdm1
> >>      13       8      209       13      active sync   /dev/sdn1
> >>      14       0        0       14      removed
> >>      15       0        0       15      removed
> >> 
> >>      16       8      225        -      spare   /dev/sdo1
> >> [root@stratus ~]# 
> >> 
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> 
> Hi Neil,
> 
> Thanks. I gave it a try and I think I got close to getting it back. Maybe. Here is the output from one of the drives that showed up as 'removed' below. It looks OK to me, but I'm not really sure what trouble signs to look for. After stopping the array, I tried to reconstruct it, and here is what I got below. I don't know why the drives would be busy. Short of rebooting, which I can't do at the moment, is there a way to check why they are busy and force them to stop? I don't have them mounted or anything. Or do you think that means the hardware is not responding properly?
> 
> Thanks,
> 
> Eli
> 
> mdadm -A /dev/md5 --force /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1 /dev/sdf1 /dev/sdg1 /dev/sdh1 /dev/sdi1 /dev/sdj1 /dev/sdk1 /dev/sdl1 /dev/sdm1 /dev/sdn1 /dev/sdo1 /dev/sdp1
> mdadm: failed to add /dev/sdo1 to /dev/md5: Device or resource busy
> mdadm: failed to add /dev/sdp1 to /dev/md5: Device or resource busy
> mdadm: /dev/md5 assembled from 12 drives and 2 spares - not enough to start the array.

This means that the device is busy....
Maybe it got attach to another md array.  What is in /proc/mdstat.  Maybe you
have to stop something else.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 4 out of 16 drives show up as 'removed'
  2011-12-07 22:16     ` NeilBrown
@ 2011-12-07 23:42       ` Eli Morris
  2011-12-08 19:17       ` Eli Morris
  1 sibling, 0 replies; 24+ messages in thread
From: Eli Morris @ 2011-12-07 23:42 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid


On Dec 7, 2011, at 2:16 PM, NeilBrown wrote:

> On Wed, 7 Dec 2011 14:00:00 -0800 Eli Morris <ermorris@ucsc.edu> wrote:
> 
>> 
>> On Dec 7, 2011, at 12:57 PM, NeilBrown wrote:
>> 
>>> On Wed, 7 Dec 2011 12:42:26 -0800 Eli Morris <ermorris@ucsc.edu> wrote:
>>> 
>>>> Hi All,
>>>> 
>>>> I thought maybe someone could help me out. I have a 16 disk software RAID that we use for backup. This is at least the second time this happened- all at once, four of the drives report as 'removed' when none of them actually were. These drives also disappeared from the 'lsscsi' list until I restarted the disk expansion chassis where they live. 
>>>> 
>>>> These are the dreaded Caviar Green drives. We bought 16 of them as an upgrade for a hardware RAID originally, because the tech from that company said they would work fine. After running them for a while, four drives dropped out of that array. So I put them in the software RAID expansion chassis they are in now, thinking I might have better luck. In this configuration, this happened once before. That time, the drives looked to all have significant numbers of bad sectors, so I got those ones replaced and thought that that might have been the problem all along. Now it has happened again. So I have two fairly predictable questions and I'm hoping someone might be able to offer a suggestion:
>>>> 
>>>> 1) Any ideas on how to get this array working again without starting from scratch? It's all backup data, so it's not do or die, but it is also 30 TB and I really don't want to rebuild the whole thing again from scratch.
>>> 
>>> 1/ Stop the array
>>>   mdadm -S /dev/md5
>>> 
>>> 2/ Make sure you can read all of the devices
>>> 
>>>   mdadm -E /dev/some-device
>>> 
>>> 3/ When you are confident that the hardware is actually working, reassemble
>>>  the array with --force
>>> 
>>>   mdadm -A /dev/md5 --force /dev/sd[a-o]1
>>> (or whatever gets you a list of devices.)
>>> 
>>>> 
>>>> I tried the re-add command and the error was something like 'not allowed'
>>>> 
>>>> 2) Any idea on how to stop this from happening again? I was thinking of playing with the disk timeout in the OS (not the one on the drive firmware). 
>>> 
>>> Cannot help there, sorry - and you really should solve this issue before you
>>> put the array back together or it'll just all happen again.
>>> 
>>> NeilBrown
>>> 
>>>> 
>>>> If anyway can help, I'd greatly appreciate it, because, at this point, I have no idea what to do about this mess. 
>>>> 
>>>> Thanks!
>>>> 
>>>> Eli
>>>> 
>>>> 
>>>> [root@stratus ~]# mdadm --detail /dev/md5
>>>> /dev/md5:
>>>>       Version : 1.2
>>>> Creation Time : Wed Oct 12 16:32:41 2011
>>>>    Raid Level : raid5
>>>> Used Dev Size : 1953511936 (1863.01 GiB 2000.40 GB)
>>>>  Raid Devices : 16
>>>> Total Devices : 13
>>>>   Persistence : Superblock is persistent
>>>> 
>>>>   Update Time : Mon Dec  5 12:52:46 2011
>>>>         State : active, FAILED, Not Started
>>>> Active Devices : 12
>>>> Working Devices : 13
>>>> Failed Devices : 0
>>>> Spare Devices : 1
>>>> 
>>>>        Layout : left-symmetric
>>>>    Chunk Size : 512K
>>>> 
>>>>          Name : stratus.pmc.ucsc.edu:5  (local to host stratus.pmc.ucsc.edu)
>>>>          UUID : 3189ca06:ccf973d0:7ef41366:98a75a32
>>>>        Events : 32
>>>> 
>>>>   Number   Major   Minor   RaidDevice State
>>>>      0       8        1        0      active sync   /dev/sda1
>>>>      1       0        0        1      removed
>>>>      2       8       33        2      active sync   /dev/sdc1
>>>>      3       8       49        3      active sync   /dev/sdd1
>>>>      4       8       65        4      active sync   /dev/sde1
>>>>      5       8       81        5      active sync   /dev/sdf1
>>>>      6       8       97        6      active sync   /dev/sdg1
>>>>      7       8      113        7      active sync   /dev/sdh1
>>>>      8       0        0        8      removed
>>>>      9       8      145        9      active sync   /dev/sdj1
>>>>     10       8      161       10      active sync   /dev/sdk1
>>>>     11       8      177       11      active sync   /dev/sdl1
>>>>     12       8      193       12      active sync   /dev/sdm1
>>>>     13       8      209       13      active sync   /dev/sdn1
>>>>     14       0        0       14      removed
>>>>     15       0        0       15      removed
>>>> 
>>>>     16       8      225        -      spare   /dev/sdo1
>>>> [root@stratus ~]# 
>>>> 
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> 
>> 
>> Hi Neil,
>> 
>> Thanks. I gave it a try and I think I got close to getting it back. Maybe. Here is the output from one of the drives that showed up as 'removed' below. It looks OK to me, but I'm not really sure what trouble signs to look for. After stopping the array, I tried to reconstruct it, and here is what I got below. I don't know why the drives would be busy. Short of rebooting, which I can't do at the moment, is there a way to check why they are busy and force them to stop? I don't have them mounted or anything. Or do you think that means the hardware is not responding properly?
>> 
>> Thanks,
>> 
>> Eli
>> 
>> mdadm -A /dev/md5 --force /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1 /dev/sdf1 /dev/sdg1 /dev/sdh1 /dev/sdi1 /dev/sdj1 /dev/sdk1 /dev/sdl1 /dev/sdm1 /dev/sdn1 /dev/sdo1 /dev/sdp1
>> mdadm: failed to add /dev/sdo1 to /dev/md5: Device or resource busy
>> mdadm: failed to add /dev/sdp1 to /dev/md5: Device or resource busy
>> mdadm: /dev/md5 assembled from 12 drives and 2 spares - not enough to start the array.
> 
> This means that the device is busy....
> Maybe it got attach to another md array.  What is in /proc/mdstat.  Maybe you
> have to stop something else.
> 
> NeilBrown


I can't imagine what would be using it though. It was being used as part of this RAID set, so it's not going to be mounted somewhere else.

[root@stratus ~]# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4] 
unused devices: <none>

thanks,

Eli


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 4 out of 16 drives show up as 'removed'
  2011-12-07 22:16     ` NeilBrown
  2011-12-07 23:42       ` Eli Morris
@ 2011-12-08 19:17       ` Eli Morris
  2011-12-08 19:51         ` NeilBrown
  1 sibling, 1 reply; 24+ messages in thread
From: Eli Morris @ 2011-12-08 19:17 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid


On Dec 7, 2011, at 2:16 PM, NeilBrown wrote:

> On Wed, 7 Dec 2011 14:00:00 -0800 Eli Morris <ermorris@ucsc.edu> wrote:
> 
>> 
>> On Dec 7, 2011, at 12:57 PM, NeilBrown wrote:
>> 
>>> On Wed, 7 Dec 2011 12:42:26 -0800 Eli Morris <ermorris@ucsc.edu> wrote:
>>> 
>>>> Hi All,
>>>> 
>>>> I thought maybe someone could help me out. I have a 16 disk software RAID that we use for backup. This is at least the second time this happened- all at once, four of the drives report as 'removed' when none of them actually were. These drives also disappeared from the 'lsscsi' list until I restarted the disk expansion chassis where they live. 
>>>> 
>>>> These are the dreaded Caviar Green drives. We bought 16 of them as an upgrade for a hardware RAID originally, because the tech from that company said they would work fine. After running them for a while, four drives dropped out of that array. So I put them in the software RAID expansion chassis they are in now, thinking I might have better luck. In this configuration, this happened once before. That time, the drives looked to all have significant numbers of bad sectors, so I got those ones replaced and thought that that might have been the problem all along. Now it has happened again. So I have two fairly predictable questions and I'm hoping someone might be able to offer a suggestion:
>>>> 
>>>> 1) Any ideas on how to get this array working again without starting from scratch? It's all backup data, so it's not do or die, but it is also 30 TB and I really don't want to rebuild the whole thing again from scratch.
>>> 
>>> 1/ Stop the array
>>>   mdadm -S /dev/md5
>>> 
>>> 2/ Make sure you can read all of the devices
>>> 
>>>   mdadm -E /dev/some-device
>>> 
>>> 3/ When you are confident that the hardware is actually working, reassemble
>>>  the array with --force
>>> 
>>>   mdadm -A /dev/md5 --force /dev/sd[a-o]1
>>> (or whatever gets you a list of devices.)
>>> 
>>>> 
>>>> I tried the re-add command and the error was something like 'not allowed'
>>>> 
>>>> 2) Any idea on how to stop this from happening again? I was thinking of playing with the disk timeout in the OS (not the one on the drive firmware). 
>>> 
>>> Cannot help there, sorry - and you really should solve this issue before you
>>> put the array back together or it'll just all happen again.
>>> 
>>> NeilBrown
>>> 
>>>> 
>>>> If anyway can help, I'd greatly appreciate it, because, at this point, I have no idea what to do about this mess. 
>>>> 
>>>> Thanks!
>>>> 
>>>> Eli
>>>> 
>>>> 
>>>> [root@stratus ~]# mdadm --detail /dev/md5
>>>> /dev/md5:
>>>>       Version : 1.2
>>>> Creation Time : Wed Oct 12 16:32:41 2011
>>>>    Raid Level : raid5
>>>> Used Dev Size : 1953511936 (1863.01 GiB 2000.40 GB)
>>>>  Raid Devices : 16
>>>> Total Devices : 13
>>>>   Persistence : Superblock is persistent
>>>> 
>>>>   Update Time : Mon Dec  5 12:52:46 2011
>>>>         State : active, FAILED, Not Started
>>>> Active Devices : 12
>>>> Working Devices : 13
>>>> Failed Devices : 0
>>>> Spare Devices : 1
>>>> 
>>>>        Layout : left-symmetric
>>>>    Chunk Size : 512K
>>>> 
>>>>          Name : stratus.pmc.ucsc.edu:5  (local to host stratus.pmc.ucsc.edu)
>>>>          UUID : 3189ca06:ccf973d0:7ef41366:98a75a32
>>>>        Events : 32
>>>> 
>>>>   Number   Major   Minor   RaidDevice State
>>>>      0       8        1        0      active sync   /dev/sda1
>>>>      1       0        0        1      removed
>>>>      2       8       33        2      active sync   /dev/sdc1
>>>>      3       8       49        3      active sync   /dev/sdd1
>>>>      4       8       65        4      active sync   /dev/sde1
>>>>      5       8       81        5      active sync   /dev/sdf1
>>>>      6       8       97        6      active sync   /dev/sdg1
>>>>      7       8      113        7      active sync   /dev/sdh1
>>>>      8       0        0        8      removed
>>>>      9       8      145        9      active sync   /dev/sdj1
>>>>     10       8      161       10      active sync   /dev/sdk1
>>>>     11       8      177       11      active sync   /dev/sdl1
>>>>     12       8      193       12      active sync   /dev/sdm1
>>>>     13       8      209       13      active sync   /dev/sdn1
>>>>     14       0        0       14      removed
>>>>     15       0        0       15      removed
>>>> 
>>>>     16       8      225        -      spare   /dev/sdo1
>>>> [root@stratus ~]# 
>>>> 
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> 
>> 
>> Hi Neil,
>> 
>> Thanks. I gave it a try and I think I got close to getting it back. Maybe. Here is the output from one of the drives that showed up as 'removed' below. It looks OK to me, but I'm not really sure what trouble signs to look for. After stopping the array, I tried to reconstruct it, and here is what I got below. I don't know why the drives would be busy. Short of rebooting, which I can't do at the moment, is there a way to check why they are busy and force them to stop? I don't have them mounted or anything. Or do you think that means the hardware is not responding properly?
>> 
>> Thanks,
>> 
>> Eli
>> 
>> mdadm -A /dev/md5 --force /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1 /dev/sdf1 /dev/sdg1 /dev/sdh1 /dev/sdi1 /dev/sdj1 /dev/sdk1 /dev/sdl1 /dev/sdm1 /dev/sdn1 /dev/sdo1 /dev/sdp1
>> mdadm: failed to add /dev/sdo1 to /dev/md5: Device or resource busy
>> mdadm: failed to add /dev/sdp1 to /dev/md5: Device or resource busy
>> mdadm: /dev/md5 assembled from 12 drives and 2 spares - not enough to start the array.
> 
> This means that the device is busy....
> Maybe it got attach to another md array.  What is in /proc/mdstat.  Maybe you
> have to stop something else.
> 
> NeilBrown

I found somewhere that dmraid can grab the drives and not release them, so I removed the dmraid packages and set the nodrmraid flag on the boot line. Since I did that I get:

mdadm: cannot open device /dev/sda1: Device or resource busy
mdadm: /dev/sda1 has no superblock - assembly aborted

which is a little odd, since last time it complained that /sdo1 and /sdp1 where busy and didn't say anything about drive /sda1. Anyway through, I read some instructions here: 

http://en.wikipedia.org/wiki/Mdadm#Known_problems

that suggest that I zero the superblock on /dev/sda1

I don't know too much about this, but I thought the superblock contained information about the RAID array. If I zero it, will that screw up the array that I'm trying to recover or is it the thing to try? I also am wondering if this might have caused the problem to begin with, like dmraid grabbed four of my drives when I did the last routine reboot, since I had four drives come up as "removed" all of a sudden. 

thanks for any advice,

Eli
 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 4 out of 16 drives show up as 'removed'
  2011-12-08 19:17       ` Eli Morris
@ 2011-12-08 19:51         ` NeilBrown
  2011-12-08 20:39           ` Eli Morris
  0 siblings, 1 reply; 24+ messages in thread
From: NeilBrown @ 2011-12-08 19:51 UTC (permalink / raw)
  To: Eli Morris; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 8086 bytes --]

On Thu, 8 Dec 2011 11:17:12 -0800 Eli Morris <ermorris@ucsc.edu> wrote:

> 
> On Dec 7, 2011, at 2:16 PM, NeilBrown wrote:
> 
> > On Wed, 7 Dec 2011 14:00:00 -0800 Eli Morris <ermorris@ucsc.edu> wrote:
> > 
> >> 
> >> On Dec 7, 2011, at 12:57 PM, NeilBrown wrote:
> >> 
> >>> On Wed, 7 Dec 2011 12:42:26 -0800 Eli Morris <ermorris@ucsc.edu> wrote:
> >>> 
> >>>> Hi All,
> >>>> 
> >>>> I thought maybe someone could help me out. I have a 16 disk software RAID that we use for backup. This is at least the second time this happened- all at once, four of the drives report as 'removed' when none of them actually were. These drives also disappeared from the 'lsscsi' list until I restarted the disk expansion chassis where they live. 
> >>>> 
> >>>> These are the dreaded Caviar Green drives. We bought 16 of them as an upgrade for a hardware RAID originally, because the tech from that company said they would work fine. After running them for a while, four drives dropped out of that array. So I put them in the software RAID expansion chassis they are in now, thinking I might have better luck. In this configuration, this happened once before. That time, the drives looked to all have significant numbers of bad sectors, so I got those ones replaced and thought that that might have been the problem all along. Now it has happened again. So I have two fairly predictable questions and I'm hoping someone might be able to offer a suggestion:
> >>>> 
> >>>> 1) Any ideas on how to get this array working again without starting from scratch? It's all backup data, so it's not do or die, but it is also 30 TB and I really don't want to rebuild the whole thing again from scratch.
> >>> 
> >>> 1/ Stop the array
> >>>   mdadm -S /dev/md5
> >>> 
> >>> 2/ Make sure you can read all of the devices
> >>> 
> >>>   mdadm -E /dev/some-device
> >>> 
> >>> 3/ When you are confident that the hardware is actually working, reassemble
> >>>  the array with --force
> >>> 
> >>>   mdadm -A /dev/md5 --force /dev/sd[a-o]1
> >>> (or whatever gets you a list of devices.)
> >>> 
> >>>> 
> >>>> I tried the re-add command and the error was something like 'not allowed'
> >>>> 
> >>>> 2) Any idea on how to stop this from happening again? I was thinking of playing with the disk timeout in the OS (not the one on the drive firmware). 
> >>> 
> >>> Cannot help there, sorry - and you really should solve this issue before you
> >>> put the array back together or it'll just all happen again.
> >>> 
> >>> NeilBrown
> >>> 
> >>>> 
> >>>> If anyway can help, I'd greatly appreciate it, because, at this point, I have no idea what to do about this mess. 
> >>>> 
> >>>> Thanks!
> >>>> 
> >>>> Eli
> >>>> 
> >>>> 
> >>>> [root@stratus ~]# mdadm --detail /dev/md5
> >>>> /dev/md5:
> >>>>       Version : 1.2
> >>>> Creation Time : Wed Oct 12 16:32:41 2011
> >>>>    Raid Level : raid5
> >>>> Used Dev Size : 1953511936 (1863.01 GiB 2000.40 GB)
> >>>>  Raid Devices : 16
> >>>> Total Devices : 13
> >>>>   Persistence : Superblock is persistent
> >>>> 
> >>>>   Update Time : Mon Dec  5 12:52:46 2011
> >>>>         State : active, FAILED, Not Started
> >>>> Active Devices : 12
> >>>> Working Devices : 13
> >>>> Failed Devices : 0
> >>>> Spare Devices : 1
> >>>> 
> >>>>        Layout : left-symmetric
> >>>>    Chunk Size : 512K
> >>>> 
> >>>>          Name : stratus.pmc.ucsc.edu:5  (local to host stratus.pmc.ucsc.edu)
> >>>>          UUID : 3189ca06:ccf973d0:7ef41366:98a75a32
> >>>>        Events : 32
> >>>> 
> >>>>   Number   Major   Minor   RaidDevice State
> >>>>      0       8        1        0      active sync   /dev/sda1
> >>>>      1       0        0        1      removed
> >>>>      2       8       33        2      active sync   /dev/sdc1
> >>>>      3       8       49        3      active sync   /dev/sdd1
> >>>>      4       8       65        4      active sync   /dev/sde1
> >>>>      5       8       81        5      active sync   /dev/sdf1
> >>>>      6       8       97        6      active sync   /dev/sdg1
> >>>>      7       8      113        7      active sync   /dev/sdh1
> >>>>      8       0        0        8      removed
> >>>>      9       8      145        9      active sync   /dev/sdj1
> >>>>     10       8      161       10      active sync   /dev/sdk1
> >>>>     11       8      177       11      active sync   /dev/sdl1
> >>>>     12       8      193       12      active sync   /dev/sdm1
> >>>>     13       8      209       13      active sync   /dev/sdn1
> >>>>     14       0        0       14      removed
> >>>>     15       0        0       15      removed
> >>>> 
> >>>>     16       8      225        -      spare   /dev/sdo1
> >>>> [root@stratus ~]# 
> >>>> 
> >>>> --
> >>>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> >>>> the body of a message to majordomo@vger.kernel.org
> >>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>> 
> >> 
> >> Hi Neil,
> >> 
> >> Thanks. I gave it a try and I think I got close to getting it back. Maybe. Here is the output from one of the drives that showed up as 'removed' below. It looks OK to me, but I'm not really sure what trouble signs to look for. After stopping the array, I tried to reconstruct it, and here is what I got below. I don't know why the drives would be busy. Short of rebooting, which I can't do at the moment, is there a way to check why they are busy and force them to stop? I don't have them mounted or anything. Or do you think that means the hardware is not responding properly?
> >> 
> >> Thanks,
> >> 
> >> Eli
> >> 
> >> mdadm -A /dev/md5 --force /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1 /dev/sdf1 /dev/sdg1 /dev/sdh1 /dev/sdi1 /dev/sdj1 /dev/sdk1 /dev/sdl1 /dev/sdm1 /dev/sdn1 /dev/sdo1 /dev/sdp1
> >> mdadm: failed to add /dev/sdo1 to /dev/md5: Device or resource busy
> >> mdadm: failed to add /dev/sdp1 to /dev/md5: Device or resource busy
> >> mdadm: /dev/md5 assembled from 12 drives and 2 spares - not enough to start the array.
> > 
> > This means that the device is busy....
> > Maybe it got attach to another md array.  What is in /proc/mdstat.  Maybe you
> > have to stop something else.
> > 
> > NeilBrown
> 
> I found somewhere that dmraid can grab the drives and not release them, so I removed the dmraid packages and set the nodrmraid flag on the boot line. Since I did that I get:
> 
> mdadm: cannot open device /dev/sda1: Device or resource busy
> mdadm: /dev/sda1 has no superblock - assembly aborted
> 
> which is a little odd, since last time it complained that /sdo1 and /sdp1 where busy and didn't say anything about drive /sda1. Anyway through, I read some instructions here: 
> 
> http://en.wikipedia.org/wiki/Mdadm#Known_problems
> 
> that suggest that I zero the superblock on /dev/sda1
> 
> I don't know too much about this, but I thought the superblock contained information about the RAID array. If I zero it, will that screw up the array that I'm trying to recover or is it the thing to try? I also am wondering if this might have caused the problem to begin with, like dmraid grabbed four of my drives when I did the last routine reboot, since I had four drives come up as "removed" all of a sudden. 
> 
> thanks for any advice,
> 
> Eli
>  


Don't zero anything until you are sure you know what the problem is and why
that would fix it.  I probably won't in this case.

There are a number of things that can keep a device busy:
 - mounted filesystem - unlikely here
 - enabled as swap - unlikely
 - in an md array - /proc/mdstat shows there aren't any
 - in a dm array - "dmsetup table" will show you, "dmsetup remove_all" will
   remove the dm arrays
 - some process has an exclusive open - again, unlikely.

Cannot think of anything else just now.

Are there any message appearing in the kernel logs (or 'dmesg' output)
when you try to assemble the array.

Try running the --assemble with --verbose and  post the result.

NeilBrown


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 4 out of 16 drives show up as 'removed'
  2011-12-08 19:51         ` NeilBrown
@ 2011-12-08 20:39           ` Eli Morris
  2011-12-08 20:59             ` NeilBrown
  0 siblings, 1 reply; 24+ messages in thread
From: Eli Morris @ 2011-12-08 20:39 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid


On Dec 8, 2011, at 11:51 AM, NeilBrown wrote:

> On Thu, 8 Dec 2011 11:17:12 -0800 Eli Morris <ermorris@ucsc.edu> wrote:
> 
>> 
>> On Dec 7, 2011, at 2:16 PM, NeilBrown wrote:
>> 
>>> On Wed, 7 Dec 2011 14:00:00 -0800 Eli Morris <ermorris@ucsc.edu> wrote:
>>> 
>>>> 
>>>> On Dec 7, 2011, at 12:57 PM, NeilBrown wrote:
>>>> 
>>>>> On Wed, 7 Dec 2011 12:42:26 -0800 Eli Morris <ermorris@ucsc.edu> wrote:
>>>>> 
>>>>>> Hi All,
>>>>>> 
>>>>>> I thought maybe someone could help me out. I have a 16 disk software RAID that we use for backup. This is at least the second time this happened- all at once, four of the drives report as 'removed' when none of them actually were. These drives also disappeared from the 'lsscsi' list until I restarted the disk expansion chassis where they live. 
>>>>>> 
>>>>>> These are the dreaded Caviar Green drives. We bought 16 of them as an upgrade for a hardware RAID originally, because the tech from that company said they would work fine. After running them for a while, four drives dropped out of that array. So I put them in the software RAID expansion chassis they are in now, thinking I might have better luck. In this configuration, this happened once before. That time, the drives looked to all have significant numbers of bad sectors, so I got those ones replaced and thought that that might have been the problem all along. Now it has happened again. So I have two fairly predictable questions and I'm hoping someone might be able to offer a suggestion:
>>>>>> 
>>>>>> 1) Any ideas on how to get this array working again without starting from scratch? It's all backup data, so it's not do or die, but it is also 30 TB and I really don't want to rebuild the whole thing again from scratch.
>>>>> 
>>>>> 1/ Stop the array
>>>>>  mdadm -S /dev/md5
>>>>> 
>>>>> 2/ Make sure you can read all of the devices
>>>>> 
>>>>>  mdadm -E /dev/some-device
>>>>> 
>>>>> 3/ When you are confident that the hardware is actually working, reassemble
>>>>> the array with --force
>>>>> 
>>>>>  mdadm -A /dev/md5 --force /dev/sd[a-o]1
>>>>> (or whatever gets you a list of devices.)
>>>>> 
>>>>>> 
>>>>>> I tried the re-add command and the error was something like 'not allowed'
>>>>>> 
>>>>>> 2) Any idea on how to stop this from happening again? I was thinking of playing with the disk timeout in the OS (not the one on the drive firmware). 
>>>>> 
>>>>> Cannot help there, sorry - and you really should solve this issue before you
>>>>> put the array back together or it'll just all happen again.
>>>>> 
>>>>> NeilBrown
>>>>> 
>>>>>> 
>>>>>> If anyway can help, I'd greatly appreciate it, because, at this point, I have no idea what to do about this mess. 
>>>>>> 
>>>>>> Thanks!
>>>>>> 
>>>>>> Eli
>>>>>> 
>>>>>> 
>>>>>> [root@stratus ~]# mdadm --detail /dev/md5
>>>>>> /dev/md5:
>>>>>>      Version : 1.2
>>>>>> Creation Time : Wed Oct 12 16:32:41 2011
>>>>>>   Raid Level : raid5
>>>>>> Used Dev Size : 1953511936 (1863.01 GiB 2000.40 GB)
>>>>>> Raid Devices : 16
>>>>>> Total Devices : 13
>>>>>>  Persistence : Superblock is persistent
>>>>>> 
>>>>>>  Update Time : Mon Dec  5 12:52:46 2011
>>>>>>        State : active, FAILED, Not Started
>>>>>> Active Devices : 12
>>>>>> Working Devices : 13
>>>>>> Failed Devices : 0
>>>>>> Spare Devices : 1
>>>>>> 
>>>>>>       Layout : left-symmetric
>>>>>>   Chunk Size : 512K
>>>>>> 
>>>>>>         Name : stratus.pmc.ucsc.edu:5  (local to host stratus.pmc.ucsc.edu)
>>>>>>         UUID : 3189ca06:ccf973d0:7ef41366:98a75a32
>>>>>>       Events : 32
>>>>>> 
>>>>>>  Number   Major   Minor   RaidDevice State
>>>>>>     0       8        1        0      active sync   /dev/sda1
>>>>>>     1       0        0        1      removed
>>>>>>     2       8       33        2      active sync   /dev/sdc1
>>>>>>     3       8       49        3      active sync   /dev/sdd1
>>>>>>     4       8       65        4      active sync   /dev/sde1
>>>>>>     5       8       81        5      active sync   /dev/sdf1
>>>>>>     6       8       97        6      active sync   /dev/sdg1
>>>>>>     7       8      113        7      active sync   /dev/sdh1
>>>>>>     8       0        0        8      removed
>>>>>>     9       8      145        9      active sync   /dev/sdj1
>>>>>>    10       8      161       10      active sync   /dev/sdk1
>>>>>>    11       8      177       11      active sync   /dev/sdl1
>>>>>>    12       8      193       12      active sync   /dev/sdm1
>>>>>>    13       8      209       13      active sync   /dev/sdn1
>>>>>>    14       0        0       14      removed
>>>>>>    15       0        0       15      removed
>>>>>> 
>>>>>>    16       8      225        -      spare   /dev/sdo1
>>>>>> [root@stratus ~]# 
>>>>>> 
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>> 
>>>> 
>>>> Hi Neil,
>>>> 
>>>> Thanks. I gave it a try and I think I got close to getting it back. Maybe. Here is the output from one of the drives that showed up as 'removed' below. It looks OK to me, but I'm not really sure what trouble signs to look for. After stopping the array, I tried to reconstruct it, and here is what I got below. I don't know why the drives would be busy. Short of rebooting, which I can't do at the moment, is there a way to check why they are busy and force them to stop? I don't have them mounted or anything. Or do you think that means the hardware is not responding properly?
>>>> 
>>>> Thanks,
>>>> 
>>>> Eli
>>>> 
>>>> mdadm -A /dev/md5 --force /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1 /dev/sdf1 /dev/sdg1 /dev/sdh1 /dev/sdi1 /dev/sdj1 /dev/sdk1 /dev/sdl1 /dev/sdm1 /dev/sdn1 /dev/sdo1 /dev/sdp1
>>>> mdadm: failed to add /dev/sdo1 to /dev/md5: Device or resource busy
>>>> mdadm: failed to add /dev/sdp1 to /dev/md5: Device or resource busy
>>>> mdadm: /dev/md5 assembled from 12 drives and 2 spares - not enough to start the array.
>>> 
>>> This means that the device is busy....
>>> Maybe it got attach to another md array.  What is in /proc/mdstat.  Maybe you
>>> have to stop something else.
>>> 
>>> NeilBrown
>> 
>> I found somewhere that dmraid can grab the drives and not release them, so I removed the dmraid packages and set the nodrmraid flag on the boot line. Since I did that I get:
>> 
>> mdadm: cannot open device /dev/sda1: Device or resource busy
>> mdadm: /dev/sda1 has no superblock - assembly aborted
>> 
>> which is a little odd, since last time it complained that /sdo1 and /sdp1 where busy and didn't say anything about drive /sda1. Anyway through, I read some instructions here: 
>> 
>> http://en.wikipedia.org/wiki/Mdadm#Known_problems
>> 
>> that suggest that I zero the superblock on /dev/sda1
>> 
>> I don't know too much about this, but I thought the superblock contained information about the RAID array. If I zero it, will that screw up the array that I'm trying to recover or is it the thing to try? I also am wondering if this might have caused the problem to begin with, like dmraid grabbed four of my drives when I did the last routine reboot, since I had four drives come up as "removed" all of a sudden. 
>> 
>> thanks for any advice,
>> 
>> Eli
>> 
> 
> 
> Don't zero anything until you are sure you know what the problem is and why
> that would fix it.  I probably won't in this case.
> 
> There are a number of things that can keep a device busy:
> - mounted filesystem - unlikely here
> - enabled as swap - unlikely
> - in an md array - /proc/mdstat shows there aren't any
> - in a dm array - "dmsetup table" will show you, "dmsetup remove_all" will
>   remove the dm arrays
> - some process has an exclusive open - again, unlikely.
> 
> Cannot think of anything else just now.
> 
> Are there any message appearing in the kernel logs (or 'dmesg' output)
> when you try to assemble the array.
> 
> Try running the --assemble with --verbose and  post the result.
> 
> NeilBrown
> 

Thanks. I think I know why /dev/sda was busy last time. I didn't realize that even if the assemble produced an inactive array, it needed to be stopped prior to trying again. After I stopped that array, I assembled the array again and I got the same problem drives an before. Log file output follows.

thanks for your help,

Eli



Here is the messages log:

Dec  8 12:22:29 stratus kernel: md: bind<sdc1>
Dec  8 12:22:29 stratus kernel: md: bind<sdd1>
Dec  8 12:22:29 stratus kernel: md: bind<sde1>
Dec  8 12:22:29 stratus kernel: md: bind<sdf1>
Dec  8 12:22:29 stratus kernel: md: bind<sdg1>
Dec  8 12:22:29 stratus kernel: md: bind<sdh1>
Dec  8 12:22:29 stratus kernel: md: bind<sdj1>
Dec  8 12:22:29 stratus kernel: md: bind<sdk1>
Dec  8 12:22:29 stratus kernel: md: bind<sdl1>
Dec  8 12:22:29 stratus kernel: md: bind<sdm1>
Dec  8 12:22:29 stratus kernel: md: bind<sdn1>
Dec  8 12:22:29 stratus kernel: md: bind<sdb1>
Dec  8 12:22:29 stratus kernel: md: bind<sdi1>
Dec  8 12:22:29 stratus kernel: md: export_rdev(sdo1)
Dec  8 12:22:29 stratus kernel: md: bind<sda1>

Here is dmesg output:

md: bind<sdc1>
md: bind<sdd1>
md: bind<sde1>
md: bind<sdf1>
md: bind<sdg1>
md: bind<sdh1>
md: bind<sdj1>
md: bind<sdk1>
md: bind<sdl1>
md: bind<sdm1>
md: bind<sdn1>
md: bind<sdb1>
md: bind<sdi1>
md: export_rdev(sdo1)
md: bind<sda1>


and here is the verbose assemble output:

[root@stratus log]# mdadm --verbose --assemble /dev/md5 --force /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1 /dev/sdf1 /dev/sdg1 /dev/sdh1 /dev/sdi1 /dev/sdj1 /dev/sdk1 /dev/sdl1 /dev/sdm1 /dev/sdn1 /dev/sdo1 
mdadm: looking for devices for /dev/md5
mdadm: /dev/sda1 is identified as a member of /dev/md5, slot 0.
mdadm: /dev/sdb1 is identified as a member of /dev/md5, slot -1.
mdadm: /dev/sdc1 is identified as a member of /dev/md5, slot 2.
mdadm: /dev/sdd1 is identified as a member of /dev/md5, slot 3.
mdadm: /dev/sde1 is identified as a member of /dev/md5, slot 4.
mdadm: /dev/sdf1 is identified as a member of /dev/md5, slot 5.
mdadm: /dev/sdg1 is identified as a member of /dev/md5, slot 6.
mdadm: /dev/sdh1 is identified as a member of /dev/md5, slot 7.
mdadm: /dev/sdi1 is identified as a member of /dev/md5, slot -1.
mdadm: /dev/sdj1 is identified as a member of /dev/md5, slot 9.
mdadm: /dev/sdk1 is identified as a member of /dev/md5, slot 10.
mdadm: /dev/sdl1 is identified as a member of /dev/md5, slot 11.
mdadm: /dev/sdm1 is identified as a member of /dev/md5, slot 12.
mdadm: /dev/sdn1 is identified as a member of /dev/md5, slot 13.
mdadm: /dev/sdo1 is identified as a member of /dev/md5, slot -1.
mdadm: no uptodate device for slot 1 of /dev/md5
mdadm: added /dev/sdc1 to /dev/md5 as 2
mdadm: added /dev/sdd1 to /dev/md5 as 3
mdadm: added /dev/sde1 to /dev/md5 as 4
mdadm: added /dev/sdf1 to /dev/md5 as 5
mdadm: added /dev/sdg1 to /dev/md5 as 6
mdadm: added /dev/sdh1 to /dev/md5 as 7
mdadm: no uptodate device for slot 8 of /dev/md5
mdadm: added /dev/sdj1 to /dev/md5 as 9
mdadm: added /dev/sdk1 to /dev/md5 as 10
mdadm: added /dev/sdl1 to /dev/md5 as 11
mdadm: added /dev/sdm1 to /dev/md5 as 12
mdadm: added /dev/sdn1 to /dev/md5 as 13
mdadm: no uptodate device for slot 14 of /dev/md5
mdadm: no uptodate device for slot 15 of /dev/md5
mdadm: added /dev/sdb1 to /dev/md5 as -1
mdadm: added /dev/sdi1 to /dev/md5 as -1
mdadm: failed to add /dev/sdo1 to /dev/md5: Device or resource busy
mdadm: added /dev/sda1 to /dev/md5 as 0
mdadm: /dev/md5 assembled from 12 drives and 2 spares - not enough to start the array.




Here is the output of dmsetup table and I don't see anything beyond entries that make sense (vol2, vol3 are LVM groups that assemble 2 TB LUNS from hardware RAIDs):

[root@stratus log]# dmsetup tableVolGroup-lv_swap: 0 20578304 linear 65:2 3664791552
VolGroup-lv_root: 0 104857600 linear 65:2 2048
vol2-vol2: 0 4294942720 linear 65:49 384
vol2-vol2: 4294942720 4294942720 linear 65:65 384
vol2-vol2: 8589885440 1177706496 linear 65:81 384
vol3-vol3: 0 4294942720 linear 65:97 384
vol3-vol3: 4294942720 4294942720 linear 65:113 384
vol3-vol3: 8589885440 4294942720 linear 65:129 384
vol3-vol3: 12884828160 789798912 linear 65:145 384
VolGroup-lv_home: 0 3633963008 linear 65:17 2048
VolGroup-lv_home: 3633963008 3559931904 linear 65:2 104859648 

[root@stratus log]# lsscsi
[0:0:0:0]    disk    ATA      WDC WD20EADS-00S 0A01  /dev/sda 
[0:0:1:0]    disk    ATA      WDC WD20EADS-32S 0A01  /dev/sdb 
[0:0:2:0]    disk    ATA      WDC WD20EADS-00S 0A01  /dev/sdc 
[0:0:3:0]    disk    ATA      WDC WD20EADS-00S 0A01  /dev/sdd 
[0:0:4:0]    disk    ATA      WDC WD20EADS-00S 0A01  /dev/sde 
[0:0:5:0]    disk    ATA      WDC WD20EADS-00W 0A01  /dev/sdf 
[0:0:6:0]    disk    ATA      WDC WD20EADS-00S 0A01  /dev/sdg 
[0:0:7:0]    disk    ATA      WDC WD2001FASS-0 1D05  /dev/sdh 
[0:0:8:0]    disk    ATA      WDC WD20EADS-00S 0A01  /dev/sdi 
[0:0:9:0]    disk    ATA      WDC WD20EARS-00M AB51  /dev/sdj 
[0:0:10:0]   disk    ATA      WDC WD20EADS-00S 0A01  /dev/sdk 
[0:0:11:0]   disk    ATA      WDC WD20EADS-00S 0A01  /dev/sdl 
[0:0:12:0]   disk    ATA      WDC WD20EARX-00P AB51  /dev/sdm 
[0:0:13:0]   disk    ATA      WDC WD20EADS-00S 0A01  /dev/sdn 
[0:0:14:0]   disk    ATA      WDC WD20EADS-00S 0A01  /dev/sdo 
[0:0:15:0]   disk    ATA      WDC WD20EADS-00S 0A01  /dev/sdp 
[0:0:16:0]   enclosu Areca    ARC-8026-.01.11. 0111  -       
[2:0:0:0]    cd/dvd  SONY     DVD-ROM DDU810A  KD38  /dev/sr0 
[4:2:0:0]    disk    DELL     PERC 5/i         1.03  /dev/sdq 
[4:2:1:0]    disk    DELL     PERC 5/i         1.03  /dev/sdr 
[5:0:0:0]    disk    RaidWeb. com              R0.0  /dev/sds 
[6:0:2:0]    disk    RaidWeb. Com              0001  /dev/sdt 
[6:0:2:1]    disk    RaidWeb. Com              0001  /dev/sdu 
[6:0:2:2]    disk    RaidWeb. Com              0001  /dev/sdv 
[6:0:3:0]    disk    RaidWeb. Com              0001  /dev/sdw 
[6:0:3:1]    disk    RaidWeb. Com              0001  /dev/sdx 
[6:0:3:2]    disk    RaidWeb. Com              0001  /dev/sdy 
[6:0:3:3]    disk    RaidWeb. Com              0001  /dev/sdz 
[7:0:0:0]    cd/dvd  Dell     Virtual  CDROM   123   /dev/sr1 
[8:0:0:0]    disk    Dell     Virtual  Floppy  123   /dev/sdaa
[root@stratus log]# 




^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 4 out of 16 drives show up as 'removed'
  2011-12-08 20:39           ` Eli Morris
@ 2011-12-08 20:59             ` NeilBrown
  2011-12-08 21:42               ` Eli Morris
  0 siblings, 1 reply; 24+ messages in thread
From: NeilBrown @ 2011-12-08 20:59 UTC (permalink / raw)
  To: Eli Morris; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 2739 bytes --]

On Thu, 8 Dec 2011 12:39:10 -0800 Eli Morris <ermorris@ucsc.edu> wrote:

> 

> 
> and here is the verbose assemble output:
> 
> [root@stratus log]# mdadm --verbose --assemble /dev/md5 --force /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1 /dev/sdf1 /dev/sdg1 /dev/sdh1 /dev/sdi1 /dev/sdj1 /dev/sdk1 /dev/sdl1 /dev/sdm1 /dev/sdn1 /dev/sdo1 
> mdadm: looking for devices for /dev/md5
> mdadm: /dev/sda1 is identified as a member of /dev/md5, slot 0.
> mdadm: /dev/sdb1 is identified as a member of /dev/md5, slot -1.
> mdadm: /dev/sdc1 is identified as a member of /dev/md5, slot 2.
> mdadm: /dev/sdd1 is identified as a member of /dev/md5, slot 3.
> mdadm: /dev/sde1 is identified as a member of /dev/md5, slot 4.
> mdadm: /dev/sdf1 is identified as a member of /dev/md5, slot 5.
> mdadm: /dev/sdg1 is identified as a member of /dev/md5, slot 6.
> mdadm: /dev/sdh1 is identified as a member of /dev/md5, slot 7.
> mdadm: /dev/sdi1 is identified as a member of /dev/md5, slot -1.
> mdadm: /dev/sdj1 is identified as a member of /dev/md5, slot 9.
> mdadm: /dev/sdk1 is identified as a member of /dev/md5, slot 10.
> mdadm: /dev/sdl1 is identified as a member of /dev/md5, slot 11.
> mdadm: /dev/sdm1 is identified as a member of /dev/md5, slot 12.
> mdadm: /dev/sdn1 is identified as a member of /dev/md5, slot 13.
> mdadm: /dev/sdo1 is identified as a member of /dev/md5, slot -1.
> mdadm: no uptodate device for slot 1 of /dev/md5
> mdadm: added /dev/sdc1 to /dev/md5 as 2
> mdadm: added /dev/sdd1 to /dev/md5 as 3
> mdadm: added /dev/sde1 to /dev/md5 as 4
> mdadm: added /dev/sdf1 to /dev/md5 as 5
> mdadm: added /dev/sdg1 to /dev/md5 as 6
> mdadm: added /dev/sdh1 to /dev/md5 as 7
> mdadm: no uptodate device for slot 8 of /dev/md5
> mdadm: added /dev/sdj1 to /dev/md5 as 9
> mdadm: added /dev/sdk1 to /dev/md5 as 10
> mdadm: added /dev/sdl1 to /dev/md5 as 11
> mdadm: added /dev/sdm1 to /dev/md5 as 12
> mdadm: added /dev/sdn1 to /dev/md5 as 13
> mdadm: no uptodate device for slot 14 of /dev/md5
> mdadm: no uptodate device for slot 15 of /dev/md5
> mdadm: added /dev/sdb1 to /dev/md5 as -1
> mdadm: added /dev/sdi1 to /dev/md5 as -1
> mdadm: failed to add /dev/sdo1 to /dev/md5: Device or resource busy
> mdadm: added /dev/sda1 to /dev/md5 as 0
> mdadm: /dev/md5 assembled from 12 drives and 2 spares - not enough to start the array.
> 
> 

Thank.

I know what the 'busy' thing is now.
sdo1 appears the be the 'same' as some other device in some way.

Also it looks like you might have turned some drives into spares
unintentionally, though I'm not sure

Could you pleas send "mdadm --examine" output for all of these drives and
I'll have a look.

Thanks,
NeilBrown




[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 4 out of 16 drives show up as 'removed'
  2011-12-08 20:59             ` NeilBrown
@ 2011-12-08 21:42               ` Eli Morris
  2011-12-08 22:50                 ` NeilBrown
  0 siblings, 1 reply; 24+ messages in thread
From: Eli Morris @ 2011-12-08 21:42 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid


On Dec 8, 2011, at 12:59 PM, NeilBrown wrote:

> On Thu, 8 Dec 2011 12:39:10 -0800 Eli Morris <ermorris@ucsc.edu> wrote:
> 
>> 
> 
>> 
>> and here is the verbose assemble output:
>> 
>> [root@stratus log]# mdadm --verbose --assemble /dev/md5 --force /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1 /dev/sdf1 /dev/sdg1 /dev/sdh1 /dev/sdi1 /dev/sdj1 /dev/sdk1 /dev/sdl1 /dev/sdm1 /dev/sdn1 /dev/sdo1 
>> mdadm: looking for devices for /dev/md5
>> mdadm: /dev/sda1 is identified as a member of /dev/md5, slot 0.
>> mdadm: /dev/sdb1 is identified as a member of /dev/md5, slot -1.
>> mdadm: /dev/sdc1 is identified as a member of /dev/md5, slot 2.
>> mdadm: /dev/sdd1 is identified as a member of /dev/md5, slot 3.
>> mdadm: /dev/sde1 is identified as a member of /dev/md5, slot 4.
>> mdadm: /dev/sdf1 is identified as a member of /dev/md5, slot 5.
>> mdadm: /dev/sdg1 is identified as a member of /dev/md5, slot 6.
>> mdadm: /dev/sdh1 is identified as a member of /dev/md5, slot 7.
>> mdadm: /dev/sdi1 is identified as a member of /dev/md5, slot -1.
>> mdadm: /dev/sdj1 is identified as a member of /dev/md5, slot 9.
>> mdadm: /dev/sdk1 is identified as a member of /dev/md5, slot 10.
>> mdadm: /dev/sdl1 is identified as a member of /dev/md5, slot 11.
>> mdadm: /dev/sdm1 is identified as a member of /dev/md5, slot 12.
>> mdadm: /dev/sdn1 is identified as a member of /dev/md5, slot 13.
>> mdadm: /dev/sdo1 is identified as a member of /dev/md5, slot -1.
>> mdadm: no uptodate device for slot 1 of /dev/md5
>> mdadm: added /dev/sdc1 to /dev/md5 as 2
>> mdadm: added /dev/sdd1 to /dev/md5 as 3
>> mdadm: added /dev/sde1 to /dev/md5 as 4
>> mdadm: added /dev/sdf1 to /dev/md5 as 5
>> mdadm: added /dev/sdg1 to /dev/md5 as 6
>> mdadm: added /dev/sdh1 to /dev/md5 as 7
>> mdadm: no uptodate device for slot 8 of /dev/md5
>> mdadm: added /dev/sdj1 to /dev/md5 as 9
>> mdadm: added /dev/sdk1 to /dev/md5 as 10
>> mdadm: added /dev/sdl1 to /dev/md5 as 11
>> mdadm: added /dev/sdm1 to /dev/md5 as 12
>> mdadm: added /dev/sdn1 to /dev/md5 as 13
>> mdadm: no uptodate device for slot 14 of /dev/md5
>> mdadm: no uptodate device for slot 15 of /dev/md5
>> mdadm: added /dev/sdb1 to /dev/md5 as -1
>> mdadm: added /dev/sdi1 to /dev/md5 as -1
>> mdadm: failed to add /dev/sdo1 to /dev/md5: Device or resource busy
>> mdadm: added /dev/sda1 to /dev/md5 as 0
>> mdadm: /dev/md5 assembled from 12 drives and 2 spares - not enough to start the array.
>> 
>> 
> 
> Thank.
> 
> I know what the 'busy' thing is now.
> sdo1 appears the be the 'same' as some other device in some way.
> 
> Also it looks like you might have turned some drives into spares
> unintentionally, though I'm not sure
> 
> Could you pleas send "mdadm --examine" output for all of these drives and
> I'll have a look.
> 
> Thanks,
> NeilBrown
> 
> 
> 

Thanks Neil. I wasn't sure if you wanted the output of all the drives or just the 'removed' ones, so here is the output for all the drives in the array.

Just FYI, I don't know what I could have done to make these spares. Between when things worked fine and when they did not, I did not make any hardware or configuration changes to the array.

Thanks!

Eli


[root@stratus log]# mdadm --examine /dev/sda1
/dev/sda1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 3189ca06:ccf973d0:7ef41366:98a75a32
           Name : stratus.pmc.ucsc.edu:5  (local to host stratus.pmc.ucsc.edu)
  Creation Time : Wed Oct 12 16:32:41 2011
     Raid Level : raid5
   Raid Devices : 16

 Avail Dev Size : 3907025039 (1863.01 GiB 2000.40 GB)
     Array Size : 58605358080 (27945.21 GiB 30005.94 GB)
  Used Dev Size : 3907023872 (1863.01 GiB 2000.40 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : 2f2219e1:52ba8e9f:59a99614:19023955

    Update Time : Mon Dec  5 12:52:46 2011
       Checksum : 9d39863 - correct
         Events : 32

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 0
   Array State : A.AAAAAA.AAAAA.. ('A' == active, '.' == missing)

[root@stratus log]# mdadm --examine /dev/sdb1
/dev/sdb1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 3189ca06:ccf973d0:7ef41366:98a75a32
           Name : stratus.pmc.ucsc.edu:5  (local to host stratus.pmc.ucsc.edu)
  Creation Time : Wed Oct 12 16:32:41 2011
     Raid Level : raid5
   Raid Devices : 16

 Avail Dev Size : 3907025039 (1863.01 GiB 2000.40 GB)
     Array Size : 58605358080 (27945.21 GiB 30005.94 GB)
  Used Dev Size : 3907023872 (1863.01 GiB 2000.40 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : 4457f369:e449cbd3:b1d00987:d0b3ec45

    Update Time : Mon Dec  5 12:52:46 2011
       Checksum : 2a11360a - correct
         Events : 0

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : spare
   Array State : A.AAAAAA.AAAAA.. ('A' == active, '.' == missing)

[root@stratus log]# mdadm --examine /dev/sdc1
/dev/sdc1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 3189ca06:ccf973d0:7ef41366:98a75a32
           Name : stratus.pmc.ucsc.edu:5  (local to host stratus.pmc.ucsc.edu)
  Creation Time : Wed Oct 12 16:32:41 2011
     Raid Level : raid5
   Raid Devices : 16

 Avail Dev Size : 3907025039 (1863.01 GiB 2000.40 GB)
     Array Size : 58605358080 (27945.21 GiB 30005.94 GB)
  Used Dev Size : 3907023872 (1863.01 GiB 2000.40 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : 03b89c44:e2db0b0a:7fea4aa4:6ac80958

    Update Time : Mon Dec  5 12:52:46 2011
       Checksum : 6a59573f - correct
         Events : 32

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 2
   Array State : A.AAAAAA.AAAAA.. ('A' == active, '.' == missing)

[root@stratus log]# mdadm --examine /dev/sdd1
/dev/sdd1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 3189ca06:ccf973d0:7ef41366:98a75a32
           Name : stratus.pmc.ucsc.edu:5  (local to host stratus.pmc.ucsc.edu)
  Creation Time : Wed Oct 12 16:32:41 2011
     Raid Level : raid5
   Raid Devices : 16

 Avail Dev Size : 3907025039 (1863.01 GiB 2000.40 GB)
     Array Size : 58605358080 (27945.21 GiB 30005.94 GB)
  Used Dev Size : 3907023872 (1863.01 GiB 2000.40 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : 26be9b16:d2c39c72:498f3873:d8e60059

    Update Time : Mon Dec  5 12:52:46 2011
       Checksum : 74ce088b - correct
         Events : 32

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 3
   Array State : A.AAAAAA.AAAAA.. ('A' == active, '.' == missing)

[root@stratus log]# mdadm --examine /dev/sde1
/dev/sde1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 3189ca06:ccf973d0:7ef41366:98a75a32
           Name : stratus.pmc.ucsc.edu:5  (local to host stratus.pmc.ucsc.edu)
  Creation Time : Wed Oct 12 16:32:41 2011
     Raid Level : raid5
   Raid Devices : 16

 Avail Dev Size : 3907025039 (1863.01 GiB 2000.40 GB)
     Array Size : 58605358080 (27945.21 GiB 30005.94 GB)
  Used Dev Size : 3907023872 (1863.01 GiB 2000.40 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : 5b3c19ab:034cd1f3:708cf5ca:a3be4093

    Update Time : Mon Dec  5 12:52:46 2011
       Checksum : 1c7ce3e6 - correct
         Events : 32

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 4
   Array State : A.AAAAAA.AAAAA.. ('A' == active, '.' == missing)

[root@stratus log]# mdadm --examine /dev/sdf1
/dev/sdf1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 3189ca06:ccf973d0:7ef41366:98a75a32
           Name : stratus.pmc.ucsc.edu:5  (local to host stratus.pmc.ucsc.edu)
  Creation Time : Wed Oct 12 16:32:41 2011
     Raid Level : raid5
   Raid Devices : 16

 Avail Dev Size : 3907025039 (1863.01 GiB 2000.40 GB)
     Array Size : 58605358080 (27945.21 GiB 30005.94 GB)
  Used Dev Size : 3907023872 (1863.01 GiB 2000.40 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : 0c48ca56:414f4d53:21da2d1a:7d519272

    Update Time : Mon Dec  5 12:52:46 2011
       Checksum : 5633d35f - correct
         Events : 32

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 5
   Array State : A.AAAAAA.AAAAA.. ('A' == active, '.' == missing)

[root@stratus log]# mdadm --examine /dev/sdg1
/dev/sdg1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 3189ca06:ccf973d0:7ef41366:98a75a32
           Name : stratus.pmc.ucsc.edu:5  (local to host stratus.pmc.ucsc.edu)
  Creation Time : Wed Oct 12 16:32:41 2011
     Raid Level : raid5
   Raid Devices : 16

 Avail Dev Size : 3907025039 (1863.01 GiB 2000.40 GB)
     Array Size : 58605358080 (27945.21 GiB 30005.94 GB)
  Used Dev Size : 3907023872 (1863.01 GiB 2000.40 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : f6e4c4f9:4b00903e:e8ccf189:f345287c

    Update Time : Mon Dec  5 12:52:46 2011
       Checksum : 5dcb0892 - correct
         Events : 32

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 6
   Array State : A.AAAAAA.AAAAA.. ('A' == active, '.' == missing)

[root@stratus log]# mdadm --examine /dev/sdh1
/dev/sdh1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 3189ca06:ccf973d0:7ef41366:98a75a32
           Name : stratus.pmc.ucsc.edu:5  (local to host stratus.pmc.ucsc.edu)
  Creation Time : Wed Oct 12 16:32:41 2011
     Raid Level : raid5
   Raid Devices : 16

 Avail Dev Size : 3907025039 (1863.01 GiB 2000.40 GB)
     Array Size : 58605358080 (27945.21 GiB 30005.94 GB)
  Used Dev Size : 3907023872 (1863.01 GiB 2000.40 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : 793be893:3192ae21:57f59ea7:97f76b4f

    Update Time : Mon Dec  5 12:52:46 2011
       Checksum : cbfdcb0e - correct
         Events : 32

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 7
   Array State : A.AAAAAA.AAAAA.. ('A' == active, '.' == missing)

[root@stratus log]# mdadm --examine /dev/sdi1
/dev/sdi1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 3189ca06:ccf973d0:7ef41366:98a75a32
           Name : stratus.pmc.ucsc.edu:5  (local to host stratus.pmc.ucsc.edu)
  Creation Time : Wed Oct 12 16:32:41 2011
     Raid Level : raid5
   Raid Devices : 16

 Avail Dev Size : 3907025039 (1863.01 GiB 2000.40 GB)
     Array Size : 58605358080 (27945.21 GiB 30005.94 GB)
  Used Dev Size : 3907023872 (1863.01 GiB 2000.40 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : 1cad41bf:75b756e2:71a83b96:cc3bf444

    Update Time : Mon Dec  5 12:52:46 2011
       Checksum : 9c24592f - correct
         Events : 0

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : spare
   Array State : A.AAAAAA.AAAAA.. ('A' == active, '.' == missing)

[root@stratus log]# mdadm --examine /dev/sdj1
/dev/sdj1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 3189ca06:ccf973d0:7ef41366:98a75a32
           Name : stratus.pmc.ucsc.edu:5  (local to host stratus.pmc.ucsc.edu)
  Creation Time : Wed Oct 12 16:32:41 2011
     Raid Level : raid5
   Raid Devices : 16

 Avail Dev Size : 3907025039 (1863.01 GiB 2000.40 GB)
     Array Size : 58605358080 (27945.21 GiB 30005.94 GB)
  Used Dev Size : 3907023872 (1863.01 GiB 2000.40 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : 74a09868:6978117b:8f2c4e22:f0e30ab0

    Update Time : Mon Dec  5 12:52:46 2011
       Checksum : d55f39d4 - correct
         Events : 32

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 9
   Array State : A.AAAAAA.AAAAA.. ('A' == active, '.' == missing)

[root@stratus log]# mdadm --examine /dev/sdk1
/dev/sdk1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 3189ca06:ccf973d0:7ef41366:98a75a32
           Name : stratus.pmc.ucsc.edu:5  (local to host stratus.pmc.ucsc.edu)
  Creation Time : Wed Oct 12 16:32:41 2011
     Raid Level : raid5
   Raid Devices : 16

 Avail Dev Size : 3907025039 (1863.01 GiB 2000.40 GB)
     Array Size : 58605358080 (27945.21 GiB 30005.94 GB)
  Used Dev Size : 3907023872 (1863.01 GiB 2000.40 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : 88ec1371:4cd85c1b:62f5deaa:2821bc7b

    Update Time : Mon Dec  5 12:52:46 2011
       Checksum : d267ebd7 - correct
         Events : 32

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 10
   Array State : A.AAAAAA.AAAAA.. ('A' == active, '.' == missing)

[root@stratus log]# mdadm --examine /dev/sdl1
/dev/sdl1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 3189ca06:ccf973d0:7ef41366:98a75a32
           Name : stratus.pmc.ucsc.edu:5  (local to host stratus.pmc.ucsc.edu)
  Creation Time : Wed Oct 12 16:32:41 2011
     Raid Level : raid5
   Raid Devices : 16

 Avail Dev Size : 3907025039 (1863.01 GiB 2000.40 GB)
     Array Size : 58605358080 (27945.21 GiB 30005.94 GB)
  Used Dev Size : 3907023872 (1863.01 GiB 2000.40 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : 9a009907:d9ff6097:4ac84cbf:8726797f

    Update Time : Mon Dec  5 12:52:46 2011
       Checksum : fd1bffbe - correct
         Events : 32

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 11
   Array State : A.AAAAAA.AAAAA.. ('A' == active, '.' == missing)

[root@stratus log]# mdadm --examine /dev/sdm1
/dev/sdm1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 3189ca06:ccf973d0:7ef41366:98a75a32
           Name : stratus.pmc.ucsc.edu:5  (local to host stratus.pmc.ucsc.edu)
  Creation Time : Wed Oct 12 16:32:41 2011
     Raid Level : raid5
   Raid Devices : 16

 Avail Dev Size : 3907025039 (1863.01 GiB 2000.40 GB)
     Array Size : 58605358080 (27945.21 GiB 30005.94 GB)
  Used Dev Size : 3907023872 (1863.01 GiB 2000.40 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : 481d50b4:65352be3:f65d8152:69e11c40

    Update Time : Mon Dec  5 12:52:46 2011
       Checksum : 4975a288 - correct
         Events : 32

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 12
   Array State : A.AAAAAA.AAAAA.. ('A' == active, '.' == missing)

[root@stratus log]# mdadm --examine /dev/sdn1
/dev/sdn1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 3189ca06:ccf973d0:7ef41366:98a75a32
           Name : stratus.pmc.ucsc.edu:5  (local to host stratus.pmc.ucsc.edu)
  Creation Time : Wed Oct 12 16:32:41 2011
     Raid Level : raid5
   Raid Devices : 16

 Avail Dev Size : 3907025039 (1863.01 GiB 2000.40 GB)
     Array Size : 58605358080 (27945.21 GiB 30005.94 GB)
  Used Dev Size : 3907023872 (1863.01 GiB 2000.40 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : 3898edca:756c1f76:b33a66a0:a2dcce69

    Update Time : Mon Dec  5 12:52:46 2011
       Checksum : 6a9e2c7f - correct
         Events : 32

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 13
   Array State : A.AAAAAA.AAAAA.. ('A' == active, '.' == missing)

[root@stratus log]# mdadm --examine /dev/sdo1
/dev/sdo1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 3189ca06:ccf973d0:7ef41366:98a75a32
           Name : stratus.pmc.ucsc.edu:5  (local to host stratus.pmc.ucsc.edu)
  Creation Time : Wed Oct 12 16:32:41 2011
     Raid Level : raid5
   Raid Devices : 16

 Avail Dev Size : 3907025039 (1863.01 GiB 2000.40 GB)
     Array Size : 58605358080 (27945.21 GiB 30005.94 GB)
  Used Dev Size : 3907023872 (1863.01 GiB 2000.40 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : a31ac4c9:a37eb6e8:97aa7298:b3b0bf8d

    Update Time : Mon Dec  5 12:52:46 2011
       Checksum : f80904f1 - correct
         Events : 0

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : spare
   Array State : A.AAAAAA.AAAAA.. ('A' == active, '.' == missing)

[root@stratus log]# mdadm --examine /dev/sdp1
/dev/sdp1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 3189ca06:ccf973d0:7ef41366:98a75a32
           Name : stratus.pmc.ucsc.edu:5  (local to host stratus.pmc.ucsc.edu)
  Creation Time : Wed Oct 12 16:32:41 2011
     Raid Level : raid5
   Raid Devices : 16

 Avail Dev Size : 3907025039 (1863.01 GiB 2000.40 GB)
     Array Size : 58605358080 (27945.21 GiB 30005.94 GB)
  Used Dev Size : 3907023872 (1863.01 GiB 2000.40 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : 66e0b547:26c72e70:a6756af6:eccaef5a

    Update Time : Mon Dec  5 12:52:46 2011
       Checksum : 289af87f - correct
         Events : 0

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : spare
   Array State : A.AAAAAA.AAAAA.. ('A' == active, '.' == missing)


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 4 out of 16 drives show up as 'removed'
  2011-12-08 21:42               ` Eli Morris
@ 2011-12-08 22:50                 ` NeilBrown
  2011-12-08 23:03                   ` Eli Morris
  0 siblings, 1 reply; 24+ messages in thread
From: NeilBrown @ 2011-12-08 22:50 UTC (permalink / raw)
  To: Eli Morris; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 5229 bytes --]

On Thu, 8 Dec 2011 13:42:44 -0800 Eli Morris <ermorris@ucsc.edu> wrote:

> 
> On Dec 8, 2011, at 12:59 PM, NeilBrown wrote:
> 
> > On Thu, 8 Dec 2011 12:39:10 -0800 Eli Morris <ermorris@ucsc.edu> wrote:
> > 
> >> 
> > 
> >> 
> >> and here is the verbose assemble output:
> >> 
> >> [root@stratus log]# mdadm --verbose --assemble /dev/md5 --force /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1 /dev/sdf1 /dev/sdg1 /dev/sdh1 /dev/sdi1 /dev/sdj1 /dev/sdk1 /dev/sdl1 /dev/sdm1 /dev/sdn1 /dev/sdo1 
> >> mdadm: looking for devices for /dev/md5
> >> mdadm: /dev/sda1 is identified as a member of /dev/md5, slot 0.
> >> mdadm: /dev/sdb1 is identified as a member of /dev/md5, slot -1.
> >> mdadm: /dev/sdc1 is identified as a member of /dev/md5, slot 2.
> >> mdadm: /dev/sdd1 is identified as a member of /dev/md5, slot 3.
> >> mdadm: /dev/sde1 is identified as a member of /dev/md5, slot 4.
> >> mdadm: /dev/sdf1 is identified as a member of /dev/md5, slot 5.
> >> mdadm: /dev/sdg1 is identified as a member of /dev/md5, slot 6.
> >> mdadm: /dev/sdh1 is identified as a member of /dev/md5, slot 7.
> >> mdadm: /dev/sdi1 is identified as a member of /dev/md5, slot -1.
> >> mdadm: /dev/sdj1 is identified as a member of /dev/md5, slot 9.
> >> mdadm: /dev/sdk1 is identified as a member of /dev/md5, slot 10.
> >> mdadm: /dev/sdl1 is identified as a member of /dev/md5, slot 11.
> >> mdadm: /dev/sdm1 is identified as a member of /dev/md5, slot 12.
> >> mdadm: /dev/sdn1 is identified as a member of /dev/md5, slot 13.
> >> mdadm: /dev/sdo1 is identified as a member of /dev/md5, slot -1.
> >> mdadm: no uptodate device for slot 1 of /dev/md5
> >> mdadm: added /dev/sdc1 to /dev/md5 as 2
> >> mdadm: added /dev/sdd1 to /dev/md5 as 3
> >> mdadm: added /dev/sde1 to /dev/md5 as 4
> >> mdadm: added /dev/sdf1 to /dev/md5 as 5
> >> mdadm: added /dev/sdg1 to /dev/md5 as 6
> >> mdadm: added /dev/sdh1 to /dev/md5 as 7
> >> mdadm: no uptodate device for slot 8 of /dev/md5
> >> mdadm: added /dev/sdj1 to /dev/md5 as 9
> >> mdadm: added /dev/sdk1 to /dev/md5 as 10
> >> mdadm: added /dev/sdl1 to /dev/md5 as 11
> >> mdadm: added /dev/sdm1 to /dev/md5 as 12
> >> mdadm: added /dev/sdn1 to /dev/md5 as 13
> >> mdadm: no uptodate device for slot 14 of /dev/md5
> >> mdadm: no uptodate device for slot 15 of /dev/md5
> >> mdadm: added /dev/sdb1 to /dev/md5 as -1
> >> mdadm: added /dev/sdi1 to /dev/md5 as -1
> >> mdadm: failed to add /dev/sdo1 to /dev/md5: Device or resource busy
> >> mdadm: added /dev/sda1 to /dev/md5 as 0
> >> mdadm: /dev/md5 assembled from 12 drives and 2 spares - not enough to start the array.
> >> 
> >> 
> > 
> > Thank.
> > 
> > I know what the 'busy' thing is now.
> > sdo1 appears the be the 'same' as some other device in some way.
> > 
> > Also it looks like you might have turned some drives into spares
> > unintentionally, though I'm not sure
> > 
> > Could you pleas send "mdadm --examine" output for all of these drives and
> > I'll have a look.
> > 
> > Thanks,
> > NeilBrown
> > 
> > 
> > 
> 
> Thanks Neil. I wasn't sure if you wanted the output of all the drives or just the 'removed' ones, so here is the output for all the drives in the array.
> 
> Just FYI, I don't know what I could have done to make these spares. Between when things worked fine and when they did not, I did not make any hardware or configuration changes to the array.
> 

Thanks.  I did want it all (it is always better to give too much than to
little - so thanks).

Those devices have be turned into spares.  Maybe an "--add" command or
possibly even a "--re-add" though it shouldn't.  Newer versions of mdadm are
more careful about this.

You need to re-"Create" the array.  This doesn't affect the data, just writes
new metadata.
It looks like it is safe to assume that none of the devices have been
renamed.  However if you have any reason to believe that the devices don't
belong in the array in the 'obvious' order, you should let me know or adjust
the command below accordingly.

You want to create the array exactly as it was, and you want to make sure
it doesn't immediately start to resync, just in case something goes wrong and
we want to try again.

All the 'Data Offset's are the same and are 2048 (1M) which is the current
default so that is good.

So:
  mdadm --create /dev/md5 -l5 --layout=left-symmetric --chunk=512 \
  --raid-disks=16  --assume-clean /dev/sd[a-p]

This will over-write all the metadata but not touch the data.

Then you probably want to
  fsck -n /dev/md5

to make sure it looks good.  If it does,

 echo check > /sys/block/md5/md/sync_action

That will read all blocks and  make sure parity is correct.  When it finishes
check
   /sys/block/md5/md/mismatch_cnt

if this is zero or close to zero, then it is looking very good.
If it is a lot more than zero (as  > 10000) then we probably need to think
again.
If it is small but non-zero, then "echo repair > ...the same /sync_action"
will fix it up.

If fsck showed any issues, run
  fsck -f /dev/md5
to fix them, then mount the filesystem and all should be good.

What version of mdadm do you have?

Thanks,
NeilBrown


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 4 out of 16 drives show up as 'removed'
  2011-12-08 22:50                 ` NeilBrown
@ 2011-12-08 23:03                   ` Eli Morris
  2011-12-09  3:20                     ` NeilBrown
  0 siblings, 1 reply; 24+ messages in thread
From: Eli Morris @ 2011-12-08 23:03 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid


On Dec 8, 2011, at 2:50 PM, NeilBrown wrote:

> On Thu, 8 Dec 2011 13:42:44 -0800 Eli Morris <ermorris@ucsc.edu> wrote:
> 
>> 
>> On Dec 8, 2011, at 12:59 PM, NeilBrown wrote:
>> 
>>> On Thu, 8 Dec 2011 12:39:10 -0800 Eli Morris <ermorris@ucsc.edu> wrote:
>>> 
>>>> 
>>> 
>>>> 
>>>> and here is the verbose assemble output:
>>>> 
>>>> [root@stratus log]# mdadm --verbose --assemble /dev/md5 --force /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1 /dev/sdf1 /dev/sdg1 /dev/sdh1 /dev/sdi1 /dev/sdj1 /dev/sdk1 /dev/sdl1 /dev/sdm1 /dev/sdn1 /dev/sdo1 
>>>> mdadm: looking for devices for /dev/md5
>>>> mdadm: /dev/sda1 is identified as a member of /dev/md5, slot 0.
>>>> mdadm: /dev/sdb1 is identified as a member of /dev/md5, slot -1.
>>>> mdadm: /dev/sdc1 is identified as a member of /dev/md5, slot 2.
>>>> mdadm: /dev/sdd1 is identified as a member of /dev/md5, slot 3.
>>>> mdadm: /dev/sde1 is identified as a member of /dev/md5, slot 4.
>>>> mdadm: /dev/sdf1 is identified as a member of /dev/md5, slot 5.
>>>> mdadm: /dev/sdg1 is identified as a member of /dev/md5, slot 6.
>>>> mdadm: /dev/sdh1 is identified as a member of /dev/md5, slot 7.
>>>> mdadm: /dev/sdi1 is identified as a member of /dev/md5, slot -1.
>>>> mdadm: /dev/sdj1 is identified as a member of /dev/md5, slot 9.
>>>> mdadm: /dev/sdk1 is identified as a member of /dev/md5, slot 10.
>>>> mdadm: /dev/sdl1 is identified as a member of /dev/md5, slot 11.
>>>> mdadm: /dev/sdm1 is identified as a member of /dev/md5, slot 12.
>>>> mdadm: /dev/sdn1 is identified as a member of /dev/md5, slot 13.
>>>> mdadm: /dev/sdo1 is identified as a member of /dev/md5, slot -1.
>>>> mdadm: no uptodate device for slot 1 of /dev/md5
>>>> mdadm: added /dev/sdc1 to /dev/md5 as 2
>>>> mdadm: added /dev/sdd1 to /dev/md5 as 3
>>>> mdadm: added /dev/sde1 to /dev/md5 as 4
>>>> mdadm: added /dev/sdf1 to /dev/md5 as 5
>>>> mdadm: added /dev/sdg1 to /dev/md5 as 6
>>>> mdadm: added /dev/sdh1 to /dev/md5 as 7
>>>> mdadm: no uptodate device for slot 8 of /dev/md5
>>>> mdadm: added /dev/sdj1 to /dev/md5 as 9
>>>> mdadm: added /dev/sdk1 to /dev/md5 as 10
>>>> mdadm: added /dev/sdl1 to /dev/md5 as 11
>>>> mdadm: added /dev/sdm1 to /dev/md5 as 12
>>>> mdadm: added /dev/sdn1 to /dev/md5 as 13
>>>> mdadm: no uptodate device for slot 14 of /dev/md5
>>>> mdadm: no uptodate device for slot 15 of /dev/md5
>>>> mdadm: added /dev/sdb1 to /dev/md5 as -1
>>>> mdadm: added /dev/sdi1 to /dev/md5 as -1
>>>> mdadm: failed to add /dev/sdo1 to /dev/md5: Device or resource busy
>>>> mdadm: added /dev/sda1 to /dev/md5 as 0
>>>> mdadm: /dev/md5 assembled from 12 drives and 2 spares - not enough to start the array.
>>>> 
>>>> 
>>> 
>>> Thank.
>>> 
>>> I know what the 'busy' thing is now.
>>> sdo1 appears the be the 'same' as some other device in some way.
>>> 
>>> Also it looks like you might have turned some drives into spares
>>> unintentionally, though I'm not sure
>>> 
>>> Could you pleas send "mdadm --examine" output for all of these drives and
>>> I'll have a look.
>>> 
>>> Thanks,
>>> NeilBrown
>>> 
>>> 
>>> 
>> 
>> Thanks Neil. I wasn't sure if you wanted the output of all the drives or just the 'removed' ones, so here is the output for all the drives in the array.
>> 
>> Just FYI, I don't know what I could have done to make these spares. Between when things worked fine and when they did not, I did not make any hardware or configuration changes to the array.
>> 
> 
> Thanks.  I did want it all (it is always better to give too much than to
> little - so thanks).
> 
> Those devices have be turned into spares.  Maybe an "--add" command or
> possibly even a "--re-add" though it shouldn't.  Newer versions of mdadm are
> more careful about this.
> 
> You need to re-"Create" the array.  This doesn't affect the data, just writes
> new metadata.
> It looks like it is safe to assume that none of the devices have been
> renamed.  However if you have any reason to believe that the devices don't
> belong in the array in the 'obvious' order, you should let me know or adjust
> the command below accordingly.
> 
> You want to create the array exactly as it was, and you want to make sure
> it doesn't immediately start to resync, just in case something goes wrong and
> we want to try again.
> 
> All the 'Data Offset's are the same and are 2048 (1M) which is the current
> default so that is good.
> 
> So:
>  mdadm --create /dev/md5 -l5 --layout=left-symmetric --chunk=512 \
>  --raid-disks=16  --assume-clean /dev/sd[a-p]
> 
> This will over-write all the metadata but not touch the data.
> 
> Then you probably want to
>  fsck -n /dev/md5
> 
> to make sure it looks good.  If it does,
> 
> echo check > /sys/block/md5/md/sync_action
> 
> That will read all blocks and  make sure parity is correct.  When it finishes
> check
>   /sys/block/md5/md/mismatch_cnt
> 
> if this is zero or close to zero, then it is looking very good.
> If it is a lot more than zero (as  > 10000) then we probably need to think
> again.
> If it is small but non-zero, then "echo repair > ...the same /sync_action"
> will fix it up.
> 
> If fsck showed any issues, run
>  fsck -f /dev/md5
> to fix them, then mount the filesystem and all should be good.
> 
> What version of mdadm do you have?
> 
> Thanks,
> NeilBrown
> 


Hi Neil,

Thanks. I have:

mdadm - v3.1.3 - 6th August 2010, which is the default up to date version on Centos 6.0

One question: is the command:

>  mdadm --create /dev/md5 -l5 --layout=left-symmetric --chunk=512 \
>  --raid-disks=16  --assume-clean /dev/sd[a-p]

Or 

 mdadm --create /dev/md5 -l5 --layout=left-symmetric --chunk=512 \
 --raid-disks=16  --assume-clean /dev/sd[a-p]1



Note that partition number on the drives: /dev/sd[1-p]1 instead of /dev/sd[a-p]

thanks,

Eli

mdadm --create /dev/md5 -l5 --layout=left-symmetric --chunk=512 \
>  --raid-disks=16  --assume-clean /dev/sd[a-p]
mdadm: partition table exists on /dev/sda but will be lost or
       meaningless after creating array
mdadm: partition table exists on /dev/sdb but will be lost or
       meaningless after creating array
mdadm: partition table exists on /dev/sdc but will be lost or
       meaningless after creating array
mdadm: partition table exists on /dev/sdd but will be lost or
       meaningless after creating array
mdadm: partition table exists on /dev/sde but will be lost or
       meaningless after creating array
mdadm: partition table exists on /dev/sdf but will be lost or
       meaningless after creating array
mdadm: partition table exists on /dev/sdg but will be lost or
       meaningless after creating array
mdadm: partition table exists on /dev/sdh but will be lost or
       meaningless after creating array
mdadm: partition table exists on /dev/sdi but will be lost or
       meaningless after creating array
mdadm: partition table exists on /dev/sdj but will be lost or
       meaningless after creating array
mdadm: partition table exists on /dev/sdk but will be lost or
       meaningless after creating array
mdadm: partition table exists on /dev/sdl but will be lost or
       meaningless after creating array
mdadm: partition table exists on /dev/sdm but will be lost or
       meaningless after creating array
mdadm: partition table exists on /dev/sdn but will be lost or
       meaningless after creating array
mdadm: partition table exists on /dev/sdo but will be lost or
       meaningless after creating array
mdadm: partition table exists on /dev/sdp but will be lost or
       meaningless after creating array
Continue creating array? n







^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 4 out of 16 drives show up as 'removed'
  2011-12-08 23:03                   ` Eli Morris
@ 2011-12-09  3:20                     ` NeilBrown
  2011-12-09  6:58                       ` Eli Morris
  2011-12-09 16:40                       ` Asdo
  0 siblings, 2 replies; 24+ messages in thread
From: NeilBrown @ 2011-12-09  3:20 UTC (permalink / raw)
  To: Eli Morris; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 710 bytes --]

On Thu, 8 Dec 2011 15:03:54 -0800 Eli Morris <ermorris@ucsc.edu> wrote:

> 
> Hi Neil,
> 
> Thanks. I have:
> 
> mdadm - v3.1.3 - 6th August 2010, which is the default up to date version on Centos 6.0
> 
> One question: is the command:
> 
> >  mdadm --create /dev/md5 -l5 --layout=left-symmetric --chunk=512 \
> >  --raid-disks=16  --assume-clean /dev/sd[a-p]
> 
> Or 
> 
>  mdadm --create /dev/md5 -l5 --layout=left-symmetric --chunk=512 \
>  --raid-disks=16  --assume-clean /dev/sd[a-p]1
> 
> 
> 
> Note that partition number on the drives: /dev/sd[1-p]1 instead of /dev/sd[a-p]

Yes, you are correct, you need the '1' at the end.
Sorry, and thanks for being careful.

NeilBrown


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 4 out of 16 drives show up as 'removed'
  2011-12-09  3:20                     ` NeilBrown
@ 2011-12-09  6:58                       ` Eli Morris
  2011-12-09 15:31                         ` John Stoffel
  2011-12-09 16:40                       ` Asdo
  1 sibling, 1 reply; 24+ messages in thread
From: Eli Morris @ 2011-12-09  6:58 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid


On Dec 8, 2011, at 7:20 PM, NeilBrown wrote:

> On Thu, 8 Dec 2011 15:03:54 -0800 Eli Morris <ermorris@ucsc.edu> wrote:
> 
>> 
>> Hi Neil,
>> 
>> Thanks. I have:
>> 
>> mdadm - v3.1.3 - 6th August 2010, which is the default up to date version on Centos 6.0
>> 
>> One question: is the command:
>> 
>>> mdadm --create /dev/md5 -l5 --layout=left-symmetric --chunk=512 \
>>> --raid-disks=16  --assume-clean /dev/sd[a-p]
>> 
>> Or 
>> 
>> mdadm --create /dev/md5 -l5 --layout=left-symmetric --chunk=512 \
>> --raid-disks=16  --assume-clean /dev/sd[a-p]1
>> 
>> 
>> 
>> Note that partition number on the drives: /dev/sd[1-p]1 instead of /dev/sd[a-p]
> 
> Yes, you are correct, you need the '1' at the end.
> Sorry, and thanks for being careful.
> 
> NeilBrown
> 

Wow. Thanks so much. I haven't verified the entire array, but it looks fine. I'm still really wondering what you think might have happened. As I mentioned, everything was OK and then the four drives just showed up as "removed" when I went to write to the filesystem on the array. Do you think it has to do with the hardware or is there something I can correct in software so that this won't happen again? I suspect hardware, since after it happened, I ran lsscsi and the four drives failed to show up on the list. These are the Caviar Green drives that people seem to have problems with in RAID arrays. I was thinking about changing the timeout tolerance in the OS to something several minutes long. Maybe you've seen something like this before? I'm just not sure what happened, but, as you might ima
 gine, I'm keen for it to not happen every few months.

Again, thank you. I'm grateful to have it working again.

Eli


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 4 out of 16 drives show up as 'removed'
  2011-12-09  6:58                       ` Eli Morris
@ 2011-12-09 15:31                         ` John Stoffel
  0 siblings, 0 replies; 24+ messages in thread
From: John Stoffel @ 2011-12-09 15:31 UTC (permalink / raw)
  To: Eli Morris; +Cc: NeilBrown, linux-raid

>>>>> "Eli" == Eli Morris <ermorris@ucsc.edu> writes:

Eli> Wow. Thanks so much. I haven't verified the entire array, but it
Eli> looks fine. I'm still really wondering what you think might have
Eli> happened. As I mentioned, everything was OK and then the four
Eli> drives just showed up as "removed" when I went to write to the
Eli> filesystem on the array. Do you think it has to do with the
Eli> hardware or is there something I can correct in software so that
Eli> this won't happen again? I suspect hardware, since after it
Eli> happened, I ran lsscsi and the four drives failed to show up on
Eli> the list. 

I personally suspect that you have hardware issues, possibly cables,
possibly the enclosure you're using.  When you have four drives all on
a single path drop out, it points strongly to the path or some other
common issue.

Have you checked your power supply as well?  Maybe you're running just
slightly overloaded and something happened to work the system, so the
voltage dropped just enough to cause the problem.

If you feel ok, you could try to run the RAID check while at the same
time stressing the system with something CPU intensive to load down
the CPU(s) on the system.  

But in general, when things are stable again, bring down the system,
and double check all your cables are seated properly, controllers are
tight, etc.  

Eli> These are the Caviar Green drives that people seem to have
Eli> problems with in RAID arrays. I was thinking about changing the
Eli> timeout tolerance in the OS to something several minutes
Eli> long. Maybe you've seen something like this before? I'm just not
Eli> sure what happened, but, as you might imagine, I'm keen for it to
Eli> not happen every few months.

I don't remember if you posted and logs from dmesg when this crapped
out, but you might see something in there.

John

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 4 out of 16 drives show up as 'removed'
  2011-12-09  3:20                     ` NeilBrown
  2011-12-09  6:58                       ` Eli Morris
@ 2011-12-09 16:40                       ` Asdo
  1 sibling, 0 replies; 24+ messages in thread
From: Asdo @ 2011-12-09 16:40 UTC (permalink / raw)
  To: linux-raid

Dear Neil,
this issue went OK for the OP (and thanks for your continuous support), 
however, exactly this situation is my worst nightmare regarding MD RAID.

It seems to me that MD has no mechanism to safeguard the situation of a 
disconnecting cable (holding multiple drives) and I think this could 
cause major puzzlement of the user potentially followed by major data loss.

I think it should be possible in line of principle to implement a 
mechanism that discriminates between cable disconnects (on multiple 
drives) and a failed single drive:

The technique would be:
BEFORE failing a drive with any symptom that *could* be caused by a 
cable disconnect, (maybe wait a couple of seconds and then) perform a 
read and/or a write (not cached, mandatorily from the platters, and sync 
in case of write) from each of the drives of the array. If multiple 
drives which were believed to be working, do not respond to such 
read/write command, then assume a cable to be disconnected and either 
block the array (is there a blocked state like for other linux 
blockdevices? if not it should be implemented) or set it as read-only. 
Or worst case, disassemble the array. But DO NOT proceed failing the 
drive. OTOH if all other drives respond correctly, assume it's not a 
cable problem and, go ahead failing the drive which was supposed to be 
failed.

The current behaviour is not good because MD will start declaring all 
the failed drives onto the metadatas of the good drives, before 
discovering that there are so many failed drives that the array cannot 
be kept running at all.

So you end up with a down array, but which has also an inconsistent 
state (I think writes could have been performed between the first 
discovered failure and the last discovered failure so the array would 
indeed be inconsistent) and also does not cleanly assemble anymore.

Thank you


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 4 out of 16 drives show up as 'removed'
  2011-12-07 20:42 4 out of 16 drives show up as 'removed' Eli Morris
  2011-12-07 20:51 ` Mathias Burén
  2011-12-07 20:57 ` NeilBrown
@ 2011-12-09 19:38 ` Stan Hoeppner
  2011-12-09 22:07   ` Eli Morris
  2 siblings, 1 reply; 24+ messages in thread
From: Stan Hoeppner @ 2011-12-09 19:38 UTC (permalink / raw)
  To: Eli Morris; +Cc: linux-raid

On 12/7/2011 2:42 PM, Eli Morris wrote:

>  I thought maybe someone could help me out. I have a 16 disk software RAID that we use for backup. This is at least the second time this happened- all at once, four of the drives report as 'removed' when none of them actually were. These drives also disappeared from the 'lsscsi' list until I restarted the disk expansion chassis where they live. 
> 
> These are the dreaded Caviar Green drives.

Eli, you masochist.  ;)

> 2) Any idea on how to stop this from happening again?

You already know the answer.  You're simply ignoring/avoiding it.  It
was given to you by a half dozen imminently qualified people over on the
XFS list when you had an entire chassis blow out many months ago, losing
many students' doctoral thesis data IIRC.  I've since used that saga
many times as evidence against using "green" drives, esp the WD models,
for anything but casual desktop storage.

The only permanent fix is to replace the drives with models meant for
your use case, such as the WD RE4 or Seagate Constellation, to name two.
 Unfortunately, right now, due to the flooding in Thailand, the price of
all drives across the board has doubled as a result of constricted
supply.  Given you don't have funds in the budget to replace them
anyway, at any price, it seems you are simply screwed in this regard.

One thing I would suggest though, if you got the vendor tech's statement
in writing WRT the WD Green drives being compatible with their RAID
chassis, I'd lean hard on them to fix the issue, as it was their rec
that prompted your purchase, causing this problem in the first place, no?

-- 
Stan

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 4 out of 16 drives show up as 'removed'
  2011-12-09 19:38 ` Stan Hoeppner
@ 2011-12-09 22:07   ` Eli Morris
  2011-12-10  2:29     ` Stan Hoeppner
  2011-12-10 17:28     ` wilsonjonathan
  0 siblings, 2 replies; 24+ messages in thread
From: Eli Morris @ 2011-12-09 22:07 UTC (permalink / raw)
  To: stan; +Cc: linux-raid


On Dec 9, 2011, at 11:38 AM, Stan Hoeppner wrote:

> On 12/7/2011 2:42 PM, Eli Morris wrote:
> 
>> I thought maybe someone could help me out. I have a 16 disk software RAID that we use for backup. This is at least the second time this happened- all at once, four of the drives report as 'removed' when none of them actually were. These drives also disappeared from the 'lsscsi' list until I restarted the disk expansion chassis where they live. 
>> 
>> These are the dreaded Caviar Green drives.
> 
> Eli, you masochist.  ;)
> 
>> 2) Any idea on how to stop this from happening again?
> 
> You already know the answer.  You're simply ignoring/avoiding it.  It
> was given to you by a half dozen imminently qualified people over on the
> XFS list when you had an entire chassis blow out many months ago, losing
> many students' doctoral thesis data IIRC.  I've since used that saga
> many times as evidence against using "green" drives, esp the WD models,
> for anything but casual desktop storage.
> 
> The only permanent fix is to replace the drives with models meant for
> your use case, such as the WD RE4 or Seagate Constellation, to name two.
> Unfortunately, right now, due to the flooding in Thailand, the price of
> all drives across the board has doubled as a result of constricted
> supply.  


> Given you don't have funds in the budget to replace them
> anyway, at any price, it seems you are simply screwed in this regard.

Well, I think you may be on to something here. So I have two choices:

1) Admit defeat and wash my hands of the lack of adequate back ups.

2) Risk my time and ridicule and ask around if there is any way to make these drives work in this configuration. From what I understand of these drives is that they time out RAID units remapping bad blocks, instead of the disks themselves timing out the remapping process and responding to the RAID request and then they get marked as failed. Now, why four drives would do this simultaneously and then when I look at the SMART data, they don't show up as having any bad blocks, I don't have any understanding of. Maybe these drives are spinning down and not responding in a certain time and that is why they showed up as removed? I don't know, maybe there is a way to keep that from happening??

My further understanding is that one can control the timeout in the OS of drives that are in an expansion bay, such as they are now configured in our system. But, look, I'll admit that that I'm no expert in this issue and someone might have a better suggestion or will tell me why that is not the right idea / a bad idea, whatever. And if using these drives is just impossible (which very well might be - YES, I'm getting very sick of trying to find a way to make these work), then so be it. 

I agree with you and everyone else that tells me that these drives shouldn't be used in RAIDs. I will never buy these type of drives again and I will never recommend these drives to anyone else. I'd really like to just chuck these drives off the roof of a building and buy new good quality ones. I'd REALLY like to do that. When more funding is available, I'll be doing just that. 

The melt down that we did have in our lab was due to a pretty unfortunate chain of events where we lost 4 drives in a few days out of 16 in one RAID unit (these four are different drives and have nothing to do with Caviar Green drives), I was on vacation at the time and did not replace the failing drives as fast as they failed, and during this time, the other backup RAID (the one with the Caviar Green drives) failed with four drives also. 

So, that's not so great. As you mention in your last paragraph, the reason why we had Caviar Green drives to begin with is that our RAID vendor recommended them to us specifically for use in the RAID where they failed. I spoke with him after they failed and he insists that these drives were not the problem and that they are used without problem in similar RAIDs. He seems like a good guy, but ultimately, I have no way of knowing what to think of that. He thinks the four drives 'failed' because of a backplane issue, but, since the unit is older and out of warranty, and thus costly, that isn't really worth investigating.


thanks,

Eli

> 
> One thing I would suggest though, if you got the vendor tech's statement
> in writing WRT the WD Green drives being compatible with their RAID
> chassis, I'd lean hard on them to fix the issue, as it was their rec
> that prompted your purchase, causing this problem in the first place, no?
> 
> -- 
> Stan
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 4 out of 16 drives show up as 'removed'
  2011-12-09 22:07   ` Eli Morris
@ 2011-12-10  2:29     ` Stan Hoeppner
  2011-12-10  4:57       ` Eli Morris
  2011-12-10 17:28     ` wilsonjonathan
  1 sibling, 1 reply; 24+ messages in thread
From: Stan Hoeppner @ 2011-12-10  2:29 UTC (permalink / raw)
  To: Eli Morris; +Cc: linux-raid

On 12/9/2011 4:07 PM, Eli Morris wrote:

> So, that's not so great. As you mention in your last paragraph, the reason why we had Caviar Green drives to begin with is that our RAID vendor recommended them to us specifically for use in the RAID where they failed. I spoke with him after they failed and he insists that these drives were not the problem and that they are used without problem in similar RAIDs. He seems like a good guy, but ultimately, I have no way of knowing what to think of that. He thinks the four drives 'failed' because of a backplane issue, but, since the unit is older and out of warranty, and thus costly, that isn't really worth investigating.

Sure it is, if your data has value.  The style of backplanbe you have,
4x3 IIRC, is cheap.  If one board is flaky, replace it.  They normally
run only a couple hundred dollars, assuming your OEM still has some in
inventory.

If not, and you have $1500 squirreled away somewhere in the budget, grab
one of these and move the drives over:
http://www.newegg.com/Product/Product.aspx?Item=N82E16816133047

Sure, the Norco is definitely a low dollar 24 drive SAS/SATA JBOD unit.
 But the Areca expander module is built around the LSI SAS 2x36 ASIC,
the gold standard SAS expander chip on the market.

Do you have any dollars in your yearly budget for hardware
maintenance/replacement?

-- 
Stan

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 4 out of 16 drives show up as 'removed'
  2011-12-10  2:29     ` Stan Hoeppner
@ 2011-12-10  4:57       ` Eli Morris
  2011-12-11  1:15         ` Stan Hoeppner
  0 siblings, 1 reply; 24+ messages in thread
From: Eli Morris @ 2011-12-10  4:57 UTC (permalink / raw)
  To: stan; +Cc: linux-raid


On Dec 9, 2011, at 6:29 PM, Stan Hoeppner wrote:

> On 12/9/2011 4:07 PM, Eli Morris wrote:
> 
>> So, that's not so great. As you mention in your last paragraph, the reason why we had Caviar Green drives to begin with is that our RAID vendor recommended them to us specifically for use in the RAID where they failed. I spoke with him after they failed and he insists that these drives were not the problem and that they are used without problem in similar RAIDs. He seems like a good guy, but ultimately, I have no way of knowing what to think of that. He thinks the four drives 'failed' because of a backplane issue, but, since the unit is older and out of warranty, and thus costly, that isn't really worth investigating.
> 
> Sure it is, if your data has value.  The style of backplanbe you have,
> 4x3 IIRC, is cheap.  If one board is flaky, replace it.  They normally
> run only a couple hundred dollars, assuming your OEM still has some in
> inventory.
> 
> If not, and you have $1500 squirreled away somewhere in the budget, grab
> one of these and move the drives over:
> http://www.newegg.com/Product/Product.aspx?Item=N82E16816133047
> 
> Sure, the Norco is definitely a low dollar 24 drive SAS/SATA JBOD unit.
> But the Areca expander module is built around the LSI SAS 2x36 ASIC,
> the gold standard SAS expander chip on the market.
> 
> Do you have any dollars in your yearly budget for hardware
> maintenance/replacement?
> 
> -- 
> Stan

Hi Stan,

It's funny you should mention getting a SAS/SATA JBOD unit. When I was told that the RAID unit we had might have a backplane issue, I decided to try put these drives in a JBOD expander module and use a software RAID configuration with the drives, since I have read in a few various places that this gets around the TLER problem with these particular drives and if we did have a backplane or controller problem, doing so would get around that as well. Thus I did buy a JBOD expander and I put the drives in them and here we are today with this latest failure- with the drives in the SAS/SATA JBOD expander using mdadm as the controller. So maybe our thinking isn't too far apart ;<)

Now I could replace the backplane of the original RAID (if we can get one for a reasonable price) and put these silly drives back in it and hope the problem goes away, but I'm not convinced that the backplane is the issue. It might be the issue, but I'm not sure I want to bet money on it. I think it is more likely a problem with these drives and some sort of timing out issue related to TLER or a power saving spin down of the drives that mdadm has a problem with. I feel like the most likely fix is something related to that. One other thing, the four drives that originally 'failed' back when they were in the hardware RAID unit (and they weren't dead drives-they just showed up as removed - same as this time), all had quite a few bad blocks, so I sent those back and got replacements. 

Since the symptoms were the same in the hardware and software RAIDs and the drives themselves seem to be OK, it leads me back to some sort of timeout issue where they are not responding to a command in a certain amount of time and thus show up as 'removed' - not failed, but 'removed'

Regarding the hardware RAID, at some point when I have time, I'll put our original much lower capacity disks that shipped with the unit about six years ago in and see if they work OK in the unit with the suspect backplane. In that way, I hope to show if the unit really does have bad hardware or if it was the Caviar Green drives that were causing the problem. 

We don't have a yearly budget per se. We have about $6000 total for maintenance, hardware, and software for the next 2.5 years to support about $200,000 worth of hardware. Almost as bad as losing data would be something breaking that is needed to run that we then couldn't replace for lack of funds. I'm not sure what happens then. Now the lab is constantly applying for grants. So if one comes in, everything could change and we could have some money again. It's just hard to say if that will happen or not or when.

thanks,

Eli



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 4 out of 16 drives show up as 'removed'
  2011-12-09 22:07   ` Eli Morris
  2011-12-10  2:29     ` Stan Hoeppner
@ 2011-12-10 17:28     ` wilsonjonathan
  2011-12-10 17:43       ` wilsonjonathan
  1 sibling, 1 reply; 24+ messages in thread
From: wilsonjonathan @ 2011-12-10 17:28 UTC (permalink / raw)
  To: Eli Morris; +Cc: stan, linux-raid

On Fri, 2011-12-09 at 14:07 -0800, Eli Morris wrote:

> My further understanding is that one can control the timeout in the OS of drives that are in an expansion bay, such as they are now configured in our system. But, look, I'll admit that that I'm no expert in this issue and someone might have a better suggestion or will tell me why that is not the right idea / a bad idea, whatever. And if using these drives is just impossible (which very well might be - YES, I'm getting very sick of trying to find a way to make these work), then so be it. 
> 

I'm setting up my first raid server for home use so power reduction when
not in use is important hence my drives and the hostlink power
management need to be able to go into low power mode and this is what I
have gleaned from the net!


First issue the following (taken and edited from
http://www.excaliburtech.net/archives/83 )

(top one works in Red Hat 4):
for a in /sys/class/scsi_device/*/device/timeout; do echo -n “$a “; cat
“$a” ; done;
or
for a in /sys/class/scsi_generic/*/device/timeout; do echo -n “$a “; cat
“$a” ; done;

You should see results similar to:

/sys/class/scsi_device/0:0:0:0/device/timeout 30
/sys/class/scsi_device/2:0:0:0/device/timeout 30
/sys/class/scsi_device/4:0:0:0/device/timeout 30


If you (as root) for each entry above:

echo 120 > /sys/class/...   (use the full path names displayed from
previous here)

Then re-run the first command they should all now show 120. This is more
than enough to cope with disks winding up, doing some stuff, maybe a bit
of alignment correction, and then replying to the md stack.

From what I have read (although documentation on the net is rarely up to
date) the md stack will wait forever on an action until either an error
is returned or the data is. There is no "time out" within the md stack
as there is with a hardware raid controller.

Mind you, I don't know anything about SAS cables or controllers so its
possible they may have hardware/software timeouts in their own right.

It might be worth investigating also (note this is for sata, so may not
be applicable):

for a in /sys/class/scsi_host/*/link_power_management_policy; do echo -n
"$a "; cat "$a"; done

which should show:

/sys/class/scsi_host/host0/link_power_management_policy max_performance
/sys/class/scsi_host/host1/link_power_management_policy max_performance


I guess another place to look is at the sdparm/hdparm data for the disks
to see what options are set regarding winddown etc.



--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 4 out of 16 drives show up as 'removed'
  2011-12-10 17:28     ` wilsonjonathan
@ 2011-12-10 17:43       ` wilsonjonathan
  0 siblings, 0 replies; 24+ messages in thread
From: wilsonjonathan @ 2011-12-10 17:43 UTC (permalink / raw)
  To: Eli Morris; +Cc: stan, linux-raid

On Sat, 2011-12-10 at 17:28 +0000, wilsonjonathan wrote:

> 
> If you (as root) for each entry above:
> 
> echo 120 > /sys/class/...   (use the full path names displayed from
> previous here)

Addendum, these commands only last till re-boot so will need to be in a
startup script.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 4 out of 16 drives show up as 'removed'
  2011-12-10  4:57       ` Eli Morris
@ 2011-12-11  1:15         ` Stan Hoeppner
  0 siblings, 0 replies; 24+ messages in thread
From: Stan Hoeppner @ 2011-12-11  1:15 UTC (permalink / raw)
  To: Eli Morris; +Cc: linux-raid

On 12/9/2011 10:57 PM, Eli Morris wrote:
> 
> On Dec 9, 2011, at 6:29 PM, Stan Hoeppner wrote:
> 
>> On 12/9/2011 4:07 PM, Eli Morris wrote:
>>
>>> So, that's not so great. As you mention in your last paragraph, the reason why we had Caviar Green drives to begin with is that our RAID vendor recommended them to us specifically for use in the RAID where they failed. I spoke with him after they failed and he insists that these drives were not the problem and that they are used without problem in similar RAIDs. He seems like a good guy, but ultimately, I have no way of knowing what to think of that. He thinks the four drives 'failed' because of a backplane issue, but, since the unit is older and out of warranty, and thus costly, that isn't really worth investigating.
>>
>> Sure it is, if your data has value.  The style of backplanbe you have,
>> 4x3 IIRC, is cheap.  If one board is flaky, replace it.  They normally
>> run only a couple hundred dollars, assuming your OEM still has some in
>> inventory.
>>
>> If not, and you have $1500 squirreled away somewhere in the budget, grab
>> one of these and move the drives over:
>> http://www.newegg.com/Product/Product.aspx?Item=N82E16816133047
>>
>> Sure, the Norco is definitely a low dollar 24 drive SAS/SATA JBOD unit.
>> But the Areca expander module is built around the LSI SAS 2x36 ASIC,
>> the gold standard SAS expander chip on the market.
>>
>> Do you have any dollars in your yearly budget for hardware
>> maintenance/replacement?
>>
>> -- 
>> Stan
> 
> Hi Stan,
> 
> It's funny you should mention getting a SAS/SATA JBOD unit. When I was told that the RAID unit we had might have a backplane issue, I decided to try put these drives in a JBOD expander module and use a software RAID configuration with the drives, since I have read in a few various places that this gets around the TLER problem with these particular drives and if we did have a backplane or controller problem, doing so would get around that as well. Thus I did buy a JBOD expander and I put the drives in them and here we are today with this latest failure- with the drives in the SAS/SATA JBOD expander using mdadm as the controller. So maybe our thinking isn't too far apart ;<)

That depends.  Which JBOD/expander did you acquire?  Make/model?  There
are a few hundred on the market, of vastly different quality.  Not all
SAS expanders are equal.  Some will never work out of the box with some
HBAs and RAID cards, some will have intermittent problems such as drive
drops, some will just work.  Sticking with the LSI based stuff gives you
the best odds of success.

If a chassis backplane is defective, switching drives to a good chassis
will solve that problem.  However, it won't solve any Green drive
related problems.  BTW, an SAS expander doesn't have anything to do with
TLER, ERC, CCTL, etc, and won't fix such problems.

AFAIK mdraid isn't as picky about read/write timeouts as hardware RAID
controllers are.  Others here are more qualified to speak to how mdraid
handles this than I am.

You didn't mention which SAS HBA you're using with this JBOD setup.  If
it's a Marvell SAS chipset on the HBA that would be a big problem as
well, and would yield similar drive dropout problems.  SuperMicro has a
cheap dual SFF8088 card using this chipset, HighPoint as well.  If
you're using either of those, or anything with a Marvell chip, swap it
out with a recent PCIe LSI ASIC based HBA, such as the LSI 9200-8e:

http://www.lsi.com/products/storagecomponents/Pages/LSISAS9200-8e.aspx

> Now I could replace the backplane of the original RAID (if we can get one for a reasonable price) and put these silly drives back in it and hope the problem goes away, but I'm not convinced that the backplane is the issue. It might be the issue, but I'm not sure I want to bet money on it. I think it is more likely a problem with these drives and some sort of timing out issue related to TLER or a power saving spin down of the drives that mdadm has a problem with. I feel like the most likely fix is something related to that. One other thing, the four drives that originally 'failed' back when they were in the hardware RAID unit (and they weren't dead drives-they just showed up as removed - same as this time), all had quite a few bad blocks, so I sent those back and got replacements. 

Replacing a backplane only helps if you indeed have a defective
backplane.  I'd need to see the internal design of the RAID chassis to
determine if the *simultaneous* dropping of 4 drives is very likely a
backplane issue or not.  If enterprise drives were involved here, I'd
say it's almost a certainty.  The fact these are WD Green drives makes
such determinations far more difficult.

> Since the symptoms were the same in the hardware and software RAIDs and the drives themselves seem to be OK, it leads me back to some sort of timeout issue where they are not responding to a command in a certain amount of time and thus show up as 'removed' - not failed, but 'removed'

Recalling your thread on XFS, drives dropped sequentially over time in
one RAID chassis of the 5 stitched together with an mdadm linear
concat--four did not drop simultaneously.  Then you had drives drop from
your D2D backup array, but again, I don't believe you stated multiple
drives dropping simultaneously.

Define these drives being "OK" in this context.  Surface scanning them
and reading the SMART data can show no errors all day long, but they'll
still often drop out of arrays.  There is no relationship between one
and the other.

TTBOMK, not a single reputable storage vendor integrates WD's Green
drives in their packaged SAS/SATA or FC/iSCSI RAID array chassis.  That
alone is instructive.

> Regarding the hardware RAID, at some point when I have time, I'll put our original much lower capacity disks that shipped with the unit about six years ago in and see if they work OK in the unit with the suspect backplane. In that way, I hope to show if the unit really does have bad hardware or if it was the Caviar Green drives that were causing the problem. 

Very good idea.  Assuming the original drives are still ok.  I'd
thoroughly test each one individually on the bench first.

> We don't have a yearly budget per se. We have about $6000 total for maintenance, hardware, and software for the next 2.5 years to support about $200,000 worth of hardware. Almost as bad as losing data would be something breaking that is needed to run that we then couldn't replace for lack of funds. I'm not sure what happens then. Now the lab is constantly applying for grants. So if one comes in, everything could change and we could have some money again. It's just hard to say if that will happen or not or when.

That's just sad.  Your hardware arrays are out of warranty.  You have 5
of them stitched together with mdraid linear and XFS atop.  Unsuitable
drives and associated problems aside, what's your plan to overcome
complete loss of one of those 5 units due to, say, controller failure?
If you can't buy a replacement controller FRU, you're looking at
purchasing another array with drives, with less than $6K to play with?

I feel for ya.  You're in a pickle.

-- 
Stan

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2011-12-11  1:15 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-12-07 20:42 4 out of 16 drives show up as 'removed' Eli Morris
2011-12-07 20:51 ` Mathias Burén
2011-12-07 20:57 ` NeilBrown
2011-12-07 22:00   ` Eli Morris
2011-12-07 22:16     ` NeilBrown
2011-12-07 23:42       ` Eli Morris
2011-12-08 19:17       ` Eli Morris
2011-12-08 19:51         ` NeilBrown
2011-12-08 20:39           ` Eli Morris
2011-12-08 20:59             ` NeilBrown
2011-12-08 21:42               ` Eli Morris
2011-12-08 22:50                 ` NeilBrown
2011-12-08 23:03                   ` Eli Morris
2011-12-09  3:20                     ` NeilBrown
2011-12-09  6:58                       ` Eli Morris
2011-12-09 15:31                         ` John Stoffel
2011-12-09 16:40                       ` Asdo
2011-12-09 19:38 ` Stan Hoeppner
2011-12-09 22:07   ` Eli Morris
2011-12-10  2:29     ` Stan Hoeppner
2011-12-10  4:57       ` Eli Morris
2011-12-11  1:15         ` Stan Hoeppner
2011-12-10 17:28     ` wilsonjonathan
2011-12-10 17:43       ` wilsonjonathan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.