All of lore.kernel.org
 help / color / mirror / Atom feed
* raid 5 crashed
@ 2016-05-10 21:28 bobzer
  2016-05-11 12:09 ` Mikael Abrahamsson
  2016-05-11 13:15 ` Robin Hill
  0 siblings, 2 replies; 27+ messages in thread
From: bobzer @ 2016-05-10 21:28 UTC (permalink / raw)
  To: linux-raid

hi everyone,

I'm in panic mode :-( because i got a raid 5 with 4 disk but 2 removed
yesterday i got a power outage which removed one disk. the disks
sd[bcd]1 was ok and saying that sde1 is removed but sde1 said that
everything is fine.
so i stop the raid, zero the superblock of sde1, start the raid and
add sde1 to the raid. then it start to reconstruct, i think it had
time to finish before this problem (i'm not 100% sure that it finish
but i think so)
the data was accessible so i went to sleep
today i discovered the raid in this state :

root@serveur:/home/math# mdadm -D /dev/md0
/dev/md0:
        Version : 1.2
  Creation Time : Sun Mar  4 22:49:14 2012
     Raid Level : raid5
     Array Size : 5860532352 (5589.04 GiB 6001.19 GB)
  Used Dev Size : 1953510784 (1863.01 GiB 2000.40 GB)
   Raid Devices : 4
  Total Devices : 4
    Persistence : Superblock is persistent

    Update Time : Fri May  6 17:44:02 2016
          State : clean, FAILED
 Active Devices : 2
Working Devices : 3
 Failed Devices : 1
  Spare Devices : 1

         Layout : left-symmetric
     Chunk Size : 128K

           Name : debian:0
           UUID : bf3c605b:9699aa55:d45119a2:7ba58d56
         Events : 892482

    Number   Major   Minor   RaidDevice State
       3       8       33        0      active sync   /dev/sdc1
       1       8       49        1      active sync   /dev/sdd1
       4       0        0        4      removed
       6       0        0        6      removed

       4       8       17        -      faulty   /dev/sdb1
       5       8       65        -      spare   /dev/sde1


root@serveur:/home/math# mdadm --examine /dev/sdb1
/dev/sdb1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : bf3c605b:9699aa55:d45119a2:7ba58d56
           Name : debian:0
  Creation Time : Sun Mar  4 22:49:14 2012
     Raid Level : raid5
   Raid Devices : 4

 Avail Dev Size : 3907021954 (1863.01 GiB 2000.40 GB)
     Array Size : 5860532352 (5589.04 GiB 6001.19 GB)
  Used Dev Size : 3907021568 (1863.01 GiB 2000.40 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
   Unused Space : before=1960 sectors, after=386 sectors
          State : clean
    Device UUID : 9bececcb:d520ca38:fd88d956:5718e361

    Update Time : Fri May  6 02:07:00 2016
  Bad Block Log : 512 entries available at offset 72 sectors
       Checksum : dc2a133a - correct
         Events : 892215

         Layout : left-symmetric
     Chunk Size : 128K

   Device Role : Active device 2
   Array State : AAAA ('A' == active, '.' == missing, 'R' == replacing)


root@serveur:/home/math# mdadm --examine /dev/sdc1
/dev/sdc1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : bf3c605b:9699aa55:d45119a2:7ba58d56
           Name : debian:0
  Creation Time : Sun Mar  4 22:49:14 2012
     Raid Level : raid5
   Raid Devices : 4

 Avail Dev Size : 3907021954 (1863.01 GiB 2000.40 GB)
     Array Size : 5860532352 (5589.04 GiB 6001.19 GB)
  Used Dev Size : 3907021568 (1863.01 GiB 2000.40 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
   Unused Space : before=1960 sectors, after=386 sectors
          State : clean
    Device UUID : 1ecaf51c:3289a902:7bb71a93:237c68e8

    Update Time : Fri May  6 17:58:27 2016
  Bad Block Log : 512 entries available at offset 72 sectors
       Checksum : b9d6aa84 - correct
         Events : 892484

         Layout : left-symmetric
     Chunk Size : 128K

   Device Role : Active device 0
   Array State : AA.. ('A' == active, '.' == missing, 'R' == replacing)

root@serveur:/home/math# mdadm --examine /dev/sdd1
/dev/sdd1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : bf3c605b:9699aa55:d45119a2:7ba58d56
           Name : debian:0
  Creation Time : Sun Mar  4 22:49:14 2012
     Raid Level : raid5
   Raid Devices : 4

 Avail Dev Size : 3907021954 (1863.01 GiB 2000.40 GB)
     Array Size : 5860532352 (5589.04 GiB 6001.19 GB)
  Used Dev Size : 3907021568 (1863.01 GiB 2000.40 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
   Unused Space : before=0 sectors, after=386 sectors
          State : clean
    Device UUID : 406c4cb5:c188e4a9:7ed8be9f:14a49b16

    Update Time : Fri May  6 17:58:27 2016
  Bad Block Log : 512 entries available at offset 2032 sectors
       Checksum : 343f9cd0 - correct
         Events : 892484

         Layout : left-symmetric
     Chunk Size : 128K

   Device Role : Active device 1
   Array State : AA.. ('A' == active, '.' == missing, 'R' == replacing)


root@serveur:/home/math# mdadm --examine /dev/sde1
/dev/sde1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x8
     Array UUID : bf3c605b:9699aa55:d45119a2:7ba58d56
           Name : debian:0
  Creation Time : Sun Mar  4 22:49:14 2012
     Raid Level : raid5
   Raid Devices : 4

 Avail Dev Size : 3907025072 (1863.01 GiB 2000.40 GB)
     Array Size : 5860532352 (5589.04 GiB 6001.19 GB)
  Used Dev Size : 3907021568 (1863.01 GiB 2000.40 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
   Unused Space : before=1960 sectors, after=3504 sectors
          State : clean
    Device UUID : f2e9c1ec:2852cf21:1a588581:b9f49a8b

    Update Time : Fri May  6 17:58:27 2016
  Bad Block Log : 512 entries available at offset 72 sectors - bad
blocks present.
       Checksum : 3a65b8bc - correct
         Events : 892484

         Layout : left-symmetric
     Chunk Size : 128K

   Device Role : spare
   Array State : AA.. ('A' == active, '.' == missing, 'R' == replacing)



PLEASE help me :-) i don't know what to do so i did nothing to not do
any stupid things
1000 thank you

ps i just saw this, i hope it not mak y case worst
root@serveur:/home/math# cat /etc/mdadm/mdadm.conf
DEVICE /dev/sd[bcd]1
ARRAY /dev/md0 metadata=1.2 name=debian:0
UUID=bf3c605b:9699aa55:d45119a2:7ba58d56

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: raid 5 crashed
  2016-05-10 21:28 raid 5 crashed bobzer
@ 2016-05-11 12:09 ` Mikael Abrahamsson
  2016-05-11 13:15 ` Robin Hill
  1 sibling, 0 replies; 27+ messages in thread
From: Mikael Abrahamsson @ 2016-05-11 12:09 UTC (permalink / raw)
  To: bobzer; +Cc: linux-raid

On Tue, 10 May 2016, bobzer wrote:

> PLEASE help me :-) i don't know what to do so i did nothing to not do 
> any stupid things 1000 thank you

What does dmesg say? Did you get read error on one of the remaining 3 
drives? Then you need to look into the archives to use dd_rescue to get as 
much data off of that drive as possible, and try again.

It would also help if you provided what kernel version and mdadm version 
you're using.

Before you try again, also do this:

for x in /sys/block/sd[a-z] ; do
         echo 180  > $x/device/timeout
done

This makes sure the kernel will not kick a drive that might take long to 
respond due to read errors.

-- 
Mikael Abrahamsson    email: swmike@swm.pp.se

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: raid 5 crashed
  2016-05-10 21:28 raid 5 crashed bobzer
  2016-05-11 12:09 ` Mikael Abrahamsson
@ 2016-05-11 13:15 ` Robin Hill
  2016-05-26  3:06   ` bobzer
  1 sibling, 1 reply; 27+ messages in thread
From: Robin Hill @ 2016-05-11 13:15 UTC (permalink / raw)
  To: bobzer; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 7875 bytes --]

On Tue May 10, 2016 at 11:28:31PM +0200, bobzer wrote:

> hi everyone,
> 
> I'm in panic mode :-( because i got a raid 5 with 4 disk but 2 removed
> yesterday i got a power outage which removed one disk. the disks
> sd[bcd]1 was ok and saying that sde1 is removed but sde1 said that
> everything is fine.
> so i stop the raid, zero the superblock of sde1, start the raid and
> add sde1 to the raid. then it start to reconstruct, i think it had
> time to finish before this problem (i'm not 100% sure that it finish
> but i think so)
> the data was accessible so i went to sleep
> today i discovered the raid in this state :
> 
> root@serveur:/home/math# mdadm -D /dev/md0
> /dev/md0:
>         Version : 1.2
>   Creation Time : Sun Mar  4 22:49:14 2012
>      Raid Level : raid5
>      Array Size : 5860532352 (5589.04 GiB 6001.19 GB)
>   Used Dev Size : 1953510784 (1863.01 GiB 2000.40 GB)
>    Raid Devices : 4
>   Total Devices : 4
>     Persistence : Superblock is persistent
> 
>     Update Time : Fri May  6 17:44:02 2016
>           State : clean, FAILED
>  Active Devices : 2
> Working Devices : 3
>  Failed Devices : 1
>   Spare Devices : 1
> 
>          Layout : left-symmetric
>      Chunk Size : 128K
> 
>            Name : debian:0
>            UUID : bf3c605b:9699aa55:d45119a2:7ba58d56
>          Events : 892482
> 
>     Number   Major   Minor   RaidDevice State
>        3       8       33        0      active sync   /dev/sdc1
>        1       8       49        1      active sync   /dev/sdd1
>        4       0        0        4      removed
>        6       0        0        6      removed
> 
>        4       8       17        -      faulty   /dev/sdb1
>        5       8       65        -      spare   /dev/sde1
> 
So this reports /dev/sdb1 faulty and /dev/sde1 spare. That would
indicate that the rebuild hadn't finished.

> root@serveur:/home/math# mdadm --examine /dev/sdb1
> /dev/sdb1:
>           Magic : a92b4efc
>         Version : 1.2
>     Feature Map : 0x0
>      Array UUID : bf3c605b:9699aa55:d45119a2:7ba58d56
>            Name : debian:0
>   Creation Time : Sun Mar  4 22:49:14 2012
>      Raid Level : raid5
>    Raid Devices : 4
> 
>  Avail Dev Size : 3907021954 (1863.01 GiB 2000.40 GB)
>      Array Size : 5860532352 (5589.04 GiB 6001.19 GB)
>   Used Dev Size : 3907021568 (1863.01 GiB 2000.40 GB)
>     Data Offset : 2048 sectors
>    Super Offset : 8 sectors
>    Unused Space : before=1960 sectors, after=386 sectors
>           State : clean
>     Device UUID : 9bececcb:d520ca38:fd88d956:5718e361
> 
>     Update Time : Fri May  6 02:07:00 2016
>   Bad Block Log : 512 entries available at offset 72 sectors
>        Checksum : dc2a133a - correct
>          Events : 892215
> 
>          Layout : left-symmetric
>      Chunk Size : 128K
> 
>    Device Role : Active device 2
>    Array State : AAAA ('A' == active, '.' == missing, 'R' == replacing)
> 
We can see /dev/sdb1 has a lower event count than the others and also
that it indicates all the drives in the array were active when it was
last running. That would strongly suggest that it was not in the array
when /dev/sde1 was added to rebuild. The update time is also nearly 16
hours earlier than that of the other drives.

> root@serveur:/home/math# mdadm --examine /dev/sdc1
> /dev/sdc1:
>           Magic : a92b4efc
>         Version : 1.2
>     Feature Map : 0x0
>      Array UUID : bf3c605b:9699aa55:d45119a2:7ba58d56
>            Name : debian:0
>   Creation Time : Sun Mar  4 22:49:14 2012
>      Raid Level : raid5
>    Raid Devices : 4
> 
>  Avail Dev Size : 3907021954 (1863.01 GiB 2000.40 GB)
>      Array Size : 5860532352 (5589.04 GiB 6001.19 GB)
>   Used Dev Size : 3907021568 (1863.01 GiB 2000.40 GB)
>     Data Offset : 2048 sectors
>    Super Offset : 8 sectors
>    Unused Space : before=1960 sectors, after=386 sectors
>           State : clean
>     Device UUID : 1ecaf51c:3289a902:7bb71a93:237c68e8
> 
>     Update Time : Fri May  6 17:58:27 2016
>   Bad Block Log : 512 entries available at offset 72 sectors
>        Checksum : b9d6aa84 - correct
>          Events : 892484
> 
>          Layout : left-symmetric
>      Chunk Size : 128K
> 
>    Device Role : Active device 0
>    Array State : AA.. ('A' == active, '.' == missing, 'R' == replacing)
> 
> root@serveur:/home/math# mdadm --examine /dev/sdd1
> /dev/sdd1:
>           Magic : a92b4efc
>         Version : 1.2
>     Feature Map : 0x0
>      Array UUID : bf3c605b:9699aa55:d45119a2:7ba58d56
>            Name : debian:0
>   Creation Time : Sun Mar  4 22:49:14 2012
>      Raid Level : raid5
>    Raid Devices : 4
> 
>  Avail Dev Size : 3907021954 (1863.01 GiB 2000.40 GB)
>      Array Size : 5860532352 (5589.04 GiB 6001.19 GB)
>   Used Dev Size : 3907021568 (1863.01 GiB 2000.40 GB)
>     Data Offset : 2048 sectors
>    Super Offset : 8 sectors
>    Unused Space : before=0 sectors, after=386 sectors
>           State : clean
>     Device UUID : 406c4cb5:c188e4a9:7ed8be9f:14a49b16
> 
>     Update Time : Fri May  6 17:58:27 2016
>   Bad Block Log : 512 entries available at offset 2032 sectors
>        Checksum : 343f9cd0 - correct
>          Events : 892484
> 
>          Layout : left-symmetric
>      Chunk Size : 128K
> 
>    Device Role : Active device 1
>    Array State : AA.. ('A' == active, '.' == missing, 'R' == replacing)
> 
These two drives contain the same information. They indicate that they
were the only 2 running members in the array when they were last updated.

> root@serveur:/home/math# mdadm --examine /dev/sde1
> /dev/sde1:
>           Magic : a92b4efc
>         Version : 1.2
>     Feature Map : 0x8
>      Array UUID : bf3c605b:9699aa55:d45119a2:7ba58d56
>            Name : debian:0
>   Creation Time : Sun Mar  4 22:49:14 2012
>      Raid Level : raid5
>    Raid Devices : 4
> 
>  Avail Dev Size : 3907025072 (1863.01 GiB 2000.40 GB)
>      Array Size : 5860532352 (5589.04 GiB 6001.19 GB)
>   Used Dev Size : 3907021568 (1863.01 GiB 2000.40 GB)
>     Data Offset : 2048 sectors
>    Super Offset : 8 sectors
>    Unused Space : before=1960 sectors, after=3504 sectors
>           State : clean
>     Device UUID : f2e9c1ec:2852cf21:1a588581:b9f49a8b
> 
>     Update Time : Fri May  6 17:58:27 2016
>   Bad Block Log : 512 entries available at offset 72 sectors - bad
> blocks present.
>        Checksum : 3a65b8bc - correct
>          Events : 892484
> 
>          Layout : left-symmetric
>      Chunk Size : 128K
> 
>    Device Role : spare
>    Array State : AA.. ('A' == active, '.' == missing, 'R' == replacing)
> 
And finally /dev/sde1 shows as a spare, with the rest of the data
matching /dev/sdc1 and /dev/sde1.

> PLEASE help me :-) i don't know what to do so i did nothing to not do
> any stupid things
> 1000 thank you
> 
> ps i just saw this, i hope it not mak y case worst
> root@serveur:/home/math# cat /etc/mdadm/mdadm.conf
> DEVICE /dev/sd[bcd]1
> ARRAY /dev/md0 metadata=1.2 name=debian:0
> UUID=bf3c605b:9699aa55:d45119a2:7ba58d56
>

From the data here, if looks to me as though /dev/sdb1 failed originally
(hence it thinks the array was complete). Either then /dev/sde1 also
failed, or you've proceeded to zero the superblock on the wrong drive.
You really need to look through the system logs and verify what happened
when and to what disk (if you rebooted at any point, the drive ordering
may have changed, so don't take for granted that the drive names are
consistent throughout).

Cheers,
    Robin
-- 
     ___        
    ( ' }     |       Robin Hill        <robin@robinhill.me.uk> |
   / / )      | Little Jim says ....                            |
  // !!       |      "He fallen in de water !!"                 |

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: raid 5 crashed
  2016-05-11 13:15 ` Robin Hill
@ 2016-05-26  3:06   ` bobzer
  2016-05-27 19:19     ` bobzer
  0 siblings, 1 reply; 27+ messages in thread
From: bobzer @ 2016-05-26  3:06 UTC (permalink / raw)
  To: linux-raid, Robin Hill, Mikael Abrahamsson

thanks for your help.
i took time to answer because unlucky me, the power supply of my
laptop fried so no laptop and no way to work on my raid :-(
any way i got a new one :-)

for the dmesg i paste it here : http://pastebin.com/whUHs256

root@serveur:~# uname -a
Linux serveur 3.2.0-4-amd64 #1 SMP Debian 3.2.63-2+deb7u1 x86_64 GNU/Linux
root@serveur:~# mdadm -V
mdadm - v3.3-78-gf43f5b3 - 02nd avril 2014

about zero the superblock on the wrong device, i hope i didn't do
that, but also i really don't think i did that, because i took care
and at that time the raid was working

i don't know what to do, if i use dd_rescue and can't get back 100% of
the data could i be able to start the raid anyway ?
what are my risk if i try something like :
mdadm --assemble --force /dev/md0 /dev/sd[bcde]1

thank you very much for your time
Mathieu


On Wed, May 11, 2016 at 3:15 PM, Robin Hill <robin@robinhill.me.uk> wrote:
> On Tue May 10, 2016 at 11:28:31PM +0200, bobzer wrote:
>
>> hi everyone,
>>
>> I'm in panic mode :-( because i got a raid 5 with 4 disk but 2 removed
>> yesterday i got a power outage which removed one disk. the disks
>> sd[bcd]1 was ok and saying that sde1 is removed but sde1 said that
>> everything is fine.
>> so i stop the raid, zero the superblock of sde1, start the raid and
>> add sde1 to the raid. then it start to reconstruct, i think it had
>> time to finish before this problem (i'm not 100% sure that it finish
>> but i think so)
>> the data was accessible so i went to sleep
>> today i discovered the raid in this state :
>>
>> root@serveur:/home/math# mdadm -D /dev/md0
>> /dev/md0:
>>         Version : 1.2
>>   Creation Time : Sun Mar  4 22:49:14 2012
>>      Raid Level : raid5
>>      Array Size : 5860532352 (5589.04 GiB 6001.19 GB)
>>   Used Dev Size : 1953510784 (1863.01 GiB 2000.40 GB)
>>    Raid Devices : 4
>>   Total Devices : 4
>>     Persistence : Superblock is persistent
>>
>>     Update Time : Fri May  6 17:44:02 2016
>>           State : clean, FAILED
>>  Active Devices : 2
>> Working Devices : 3
>>  Failed Devices : 1
>>   Spare Devices : 1
>>
>>          Layout : left-symmetric
>>      Chunk Size : 128K
>>
>>            Name : debian:0
>>            UUID : bf3c605b:9699aa55:d45119a2:7ba58d56
>>          Events : 892482
>>
>>     Number   Major   Minor   RaidDevice State
>>        3       8       33        0      active sync   /dev/sdc1
>>        1       8       49        1      active sync   /dev/sdd1
>>        4       0        0        4      removed
>>        6       0        0        6      removed
>>
>>        4       8       17        -      faulty   /dev/sdb1
>>        5       8       65        -      spare   /dev/sde1
>>
> So this reports /dev/sdb1 faulty and /dev/sde1 spare. That would
> indicate that the rebuild hadn't finished.
>
>> root@serveur:/home/math# mdadm --examine /dev/sdb1
>> /dev/sdb1:
>>           Magic : a92b4efc
>>         Version : 1.2
>>     Feature Map : 0x0
>>      Array UUID : bf3c605b:9699aa55:d45119a2:7ba58d56
>>            Name : debian:0
>>   Creation Time : Sun Mar  4 22:49:14 2012
>>      Raid Level : raid5
>>    Raid Devices : 4
>>
>>  Avail Dev Size : 3907021954 (1863.01 GiB 2000.40 GB)
>>      Array Size : 5860532352 (5589.04 GiB 6001.19 GB)
>>   Used Dev Size : 3907021568 (1863.01 GiB 2000.40 GB)
>>     Data Offset : 2048 sectors
>>    Super Offset : 8 sectors
>>    Unused Space : before=1960 sectors, after=386 sectors
>>           State : clean
>>     Device UUID : 9bececcb:d520ca38:fd88d956:5718e361
>>
>>     Update Time : Fri May  6 02:07:00 2016
>>   Bad Block Log : 512 entries available at offset 72 sectors
>>        Checksum : dc2a133a - correct
>>          Events : 892215
>>
>>          Layout : left-symmetric
>>      Chunk Size : 128K
>>
>>    Device Role : Active device 2
>>    Array State : AAAA ('A' == active, '.' == missing, 'R' == replacing)
>>
> We can see /dev/sdb1 has a lower event count than the others and also
> that it indicates all the drives in the array were active when it was
> last running. That would strongly suggest that it was not in the array
> when /dev/sde1 was added to rebuild. The update time is also nearly 16
> hours earlier than that of the other drives.
>
>> root@serveur:/home/math# mdadm --examine /dev/sdc1
>> /dev/sdc1:
>>           Magic : a92b4efc
>>         Version : 1.2
>>     Feature Map : 0x0
>>      Array UUID : bf3c605b:9699aa55:d45119a2:7ba58d56
>>            Name : debian:0
>>   Creation Time : Sun Mar  4 22:49:14 2012
>>      Raid Level : raid5
>>    Raid Devices : 4
>>
>>  Avail Dev Size : 3907021954 (1863.01 GiB 2000.40 GB)
>>      Array Size : 5860532352 (5589.04 GiB 6001.19 GB)
>>   Used Dev Size : 3907021568 (1863.01 GiB 2000.40 GB)
>>     Data Offset : 2048 sectors
>>    Super Offset : 8 sectors
>>    Unused Space : before=1960 sectors, after=386 sectors
>>           State : clean
>>     Device UUID : 1ecaf51c:3289a902:7bb71a93:237c68e8
>>
>>     Update Time : Fri May  6 17:58:27 2016
>>   Bad Block Log : 512 entries available at offset 72 sectors
>>        Checksum : b9d6aa84 - correct
>>          Events : 892484
>>
>>          Layout : left-symmetric
>>      Chunk Size : 128K
>>
>>    Device Role : Active device 0
>>    Array State : AA.. ('A' == active, '.' == missing, 'R' == replacing)
>>
>> root@serveur:/home/math# mdadm --examine /dev/sdd1
>> /dev/sdd1:
>>           Magic : a92b4efc
>>         Version : 1.2
>>     Feature Map : 0x0
>>      Array UUID : bf3c605b:9699aa55:d45119a2:7ba58d56
>>            Name : debian:0
>>   Creation Time : Sun Mar  4 22:49:14 2012
>>      Raid Level : raid5
>>    Raid Devices : 4
>>
>>  Avail Dev Size : 3907021954 (1863.01 GiB 2000.40 GB)
>>      Array Size : 5860532352 (5589.04 GiB 6001.19 GB)
>>   Used Dev Size : 3907021568 (1863.01 GiB 2000.40 GB)
>>     Data Offset : 2048 sectors
>>    Super Offset : 8 sectors
>>    Unused Space : before=0 sectors, after=386 sectors
>>           State : clean
>>     Device UUID : 406c4cb5:c188e4a9:7ed8be9f:14a49b16
>>
>>     Update Time : Fri May  6 17:58:27 2016
>>   Bad Block Log : 512 entries available at offset 2032 sectors
>>        Checksum : 343f9cd0 - correct
>>          Events : 892484
>>
>>          Layout : left-symmetric
>>      Chunk Size : 128K
>>
>>    Device Role : Active device 1
>>    Array State : AA.. ('A' == active, '.' == missing, 'R' == replacing)
>>
> These two drives contain the same information. They indicate that they
> were the only 2 running members in the array when they were last updated.
>
>> root@serveur:/home/math# mdadm --examine /dev/sde1
>> /dev/sde1:
>>           Magic : a92b4efc
>>         Version : 1.2
>>     Feature Map : 0x8
>>      Array UUID : bf3c605b:9699aa55:d45119a2:7ba58d56
>>            Name : debian:0
>>   Creation Time : Sun Mar  4 22:49:14 2012
>>      Raid Level : raid5
>>    Raid Devices : 4
>>
>>  Avail Dev Size : 3907025072 (1863.01 GiB 2000.40 GB)
>>      Array Size : 5860532352 (5589.04 GiB 6001.19 GB)
>>   Used Dev Size : 3907021568 (1863.01 GiB 2000.40 GB)
>>     Data Offset : 2048 sectors
>>    Super Offset : 8 sectors
>>    Unused Space : before=1960 sectors, after=3504 sectors
>>           State : clean
>>     Device UUID : f2e9c1ec:2852cf21:1a588581:b9f49a8b
>>
>>     Update Time : Fri May  6 17:58:27 2016
>>   Bad Block Log : 512 entries available at offset 72 sectors - bad
>> blocks present.
>>        Checksum : 3a65b8bc - correct
>>          Events : 892484
>>
>>          Layout : left-symmetric
>>      Chunk Size : 128K
>>
>>    Device Role : spare
>>    Array State : AA.. ('A' == active, '.' == missing, 'R' == replacing)
>>
> And finally /dev/sde1 shows as a spare, with the rest of the data
> matching /dev/sdc1 and /dev/sde1.
>
>> PLEASE help me :-) i don't know what to do so i did nothing to not do
>> any stupid things
>> 1000 thank you
>>
>> ps i just saw this, i hope it not mak y case worst
>> root@serveur:/home/math# cat /etc/mdadm/mdadm.conf
>> DEVICE /dev/sd[bcd]1
>> ARRAY /dev/md0 metadata=1.2 name=debian:0
>> UUID=bf3c605b:9699aa55:d45119a2:7ba58d56
>>
>
> From the data here, if looks to me as though /dev/sdb1 failed originally
> (hence it thinks the array was complete). Either then /dev/sde1 also
> failed, or you've proceeded to zero the superblock on the wrong drive.
> You really need to look through the system logs and verify what happened
> when and to what disk (if you rebooted at any point, the drive ordering
> may have changed, so don't take for granted that the drive names are
> consistent throughout).
>
> Cheers,
>     Robin
> --
>      ___
>     ( ' }     |       Robin Hill        <robin@robinhill.me.uk> |
>    / / )      | Little Jim says ....                            |
>   // !!       |      "He fallen in de water !!"                 |

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: raid 5 crashed
  2016-05-26  3:06   ` bobzer
@ 2016-05-27 19:19     ` bobzer
  2016-05-30 15:01       ` bobzer
  0 siblings, 1 reply; 27+ messages in thread
From: bobzer @ 2016-05-27 19:19 UTC (permalink / raw)
  To: linux-raid, Robin Hill, Mikael Abrahamsson

hi,

I'm afraid to make the problem worst but i received a new HD to do a
dd_rescue :-)
I'm ready to buy another HD but the problem is that i don't know
what's the best to recover my data

My question is : There is a way to test if the data/raid is ok without
take the risk of losting anything more ?

help me please :-(

best regards
Mathieu

On Wed, May 25, 2016 at 11:06 PM, bobzer <bobzer@gmail.com> wrote:
> thanks for your help.
> i took time to answer because unlucky me, the power supply of my
> laptop fried so no laptop and no way to work on my raid :-(
> any way i got a new one :-)
>
> for the dmesg i paste it here : http://pastebin.com/whUHs256
>
> root@serveur:~# uname -a
> Linux serveur 3.2.0-4-amd64 #1 SMP Debian 3.2.63-2+deb7u1 x86_64 GNU/Linux
> root@serveur:~# mdadm -V
> mdadm - v3.3-78-gf43f5b3 - 02nd avril 2014
>
> about zero the superblock on the wrong device, i hope i didn't do
> that, but also i really don't think i did that, because i took care
> and at that time the raid was working
>
> i don't know what to do, if i use dd_rescue and can't get back 100% of
> the data could i be able to start the raid anyway ?
> what are my risk if i try something like :
> mdadm --assemble --force /dev/md0 /dev/sd[bcde]1
>
> thank you very much for your time
> Mathieu
>
>
> On Wed, May 11, 2016 at 3:15 PM, Robin Hill <robin@robinhill.me.uk> wrote:
>> On Tue May 10, 2016 at 11:28:31PM +0200, bobzer wrote:
>>
>>> hi everyone,
>>>
>>> I'm in panic mode :-( because i got a raid 5 with 4 disk but 2 removed
>>> yesterday i got a power outage which removed one disk. the disks
>>> sd[bcd]1 was ok and saying that sde1 is removed but sde1 said that
>>> everything is fine.
>>> so i stop the raid, zero the superblock of sde1, start the raid and
>>> add sde1 to the raid. then it start to reconstruct, i think it had
>>> time to finish before this problem (i'm not 100% sure that it finish
>>> but i think so)
>>> the data was accessible so i went to sleep
>>> today i discovered the raid in this state :
>>>
>>> root@serveur:/home/math# mdadm -D /dev/md0
>>> /dev/md0:
>>>         Version : 1.2
>>>   Creation Time : Sun Mar  4 22:49:14 2012
>>>      Raid Level : raid5
>>>      Array Size : 5860532352 (5589.04 GiB 6001.19 GB)
>>>   Used Dev Size : 1953510784 (1863.01 GiB 2000.40 GB)
>>>    Raid Devices : 4
>>>   Total Devices : 4
>>>     Persistence : Superblock is persistent
>>>
>>>     Update Time : Fri May  6 17:44:02 2016
>>>           State : clean, FAILED
>>>  Active Devices : 2
>>> Working Devices : 3
>>>  Failed Devices : 1
>>>   Spare Devices : 1
>>>
>>>          Layout : left-symmetric
>>>      Chunk Size : 128K
>>>
>>>            Name : debian:0
>>>            UUID : bf3c605b:9699aa55:d45119a2:7ba58d56
>>>          Events : 892482
>>>
>>>     Number   Major   Minor   RaidDevice State
>>>        3       8       33        0      active sync   /dev/sdc1
>>>        1       8       49        1      active sync   /dev/sdd1
>>>        4       0        0        4      removed
>>>        6       0        0        6      removed
>>>
>>>        4       8       17        -      faulty   /dev/sdb1
>>>        5       8       65        -      spare   /dev/sde1
>>>
>> So this reports /dev/sdb1 faulty and /dev/sde1 spare. That would
>> indicate that the rebuild hadn't finished.
>>
>>> root@serveur:/home/math# mdadm --examine /dev/sdb1
>>> /dev/sdb1:
>>>           Magic : a92b4efc
>>>         Version : 1.2
>>>     Feature Map : 0x0
>>>      Array UUID : bf3c605b:9699aa55:d45119a2:7ba58d56
>>>            Name : debian:0
>>>   Creation Time : Sun Mar  4 22:49:14 2012
>>>      Raid Level : raid5
>>>    Raid Devices : 4
>>>
>>>  Avail Dev Size : 3907021954 (1863.01 GiB 2000.40 GB)
>>>      Array Size : 5860532352 (5589.04 GiB 6001.19 GB)
>>>   Used Dev Size : 3907021568 (1863.01 GiB 2000.40 GB)
>>>     Data Offset : 2048 sectors
>>>    Super Offset : 8 sectors
>>>    Unused Space : before=1960 sectors, after=386 sectors
>>>           State : clean
>>>     Device UUID : 9bececcb:d520ca38:fd88d956:5718e361
>>>
>>>     Update Time : Fri May  6 02:07:00 2016
>>>   Bad Block Log : 512 entries available at offset 72 sectors
>>>        Checksum : dc2a133a - correct
>>>          Events : 892215
>>>
>>>          Layout : left-symmetric
>>>      Chunk Size : 128K
>>>
>>>    Device Role : Active device 2
>>>    Array State : AAAA ('A' == active, '.' == missing, 'R' == replacing)
>>>
>> We can see /dev/sdb1 has a lower event count than the others and also
>> that it indicates all the drives in the array were active when it was
>> last running. That would strongly suggest that it was not in the array
>> when /dev/sde1 was added to rebuild. The update time is also nearly 16
>> hours earlier than that of the other drives.
>>
>>> root@serveur:/home/math# mdadm --examine /dev/sdc1
>>> /dev/sdc1:
>>>           Magic : a92b4efc
>>>         Version : 1.2
>>>     Feature Map : 0x0
>>>      Array UUID : bf3c605b:9699aa55:d45119a2:7ba58d56
>>>            Name : debian:0
>>>   Creation Time : Sun Mar  4 22:49:14 2012
>>>      Raid Level : raid5
>>>    Raid Devices : 4
>>>
>>>  Avail Dev Size : 3907021954 (1863.01 GiB 2000.40 GB)
>>>      Array Size : 5860532352 (5589.04 GiB 6001.19 GB)
>>>   Used Dev Size : 3907021568 (1863.01 GiB 2000.40 GB)
>>>     Data Offset : 2048 sectors
>>>    Super Offset : 8 sectors
>>>    Unused Space : before=1960 sectors, after=386 sectors
>>>           State : clean
>>>     Device UUID : 1ecaf51c:3289a902:7bb71a93:237c68e8
>>>
>>>     Update Time : Fri May  6 17:58:27 2016
>>>   Bad Block Log : 512 entries available at offset 72 sectors
>>>        Checksum : b9d6aa84 - correct
>>>          Events : 892484
>>>
>>>          Layout : left-symmetric
>>>      Chunk Size : 128K
>>>
>>>    Device Role : Active device 0
>>>    Array State : AA.. ('A' == active, '.' == missing, 'R' == replacing)
>>>
>>> root@serveur:/home/math# mdadm --examine /dev/sdd1
>>> /dev/sdd1:
>>>           Magic : a92b4efc
>>>         Version : 1.2
>>>     Feature Map : 0x0
>>>      Array UUID : bf3c605b:9699aa55:d45119a2:7ba58d56
>>>            Name : debian:0
>>>   Creation Time : Sun Mar  4 22:49:14 2012
>>>      Raid Level : raid5
>>>    Raid Devices : 4
>>>
>>>  Avail Dev Size : 3907021954 (1863.01 GiB 2000.40 GB)
>>>      Array Size : 5860532352 (5589.04 GiB 6001.19 GB)
>>>   Used Dev Size : 3907021568 (1863.01 GiB 2000.40 GB)
>>>     Data Offset : 2048 sectors
>>>    Super Offset : 8 sectors
>>>    Unused Space : before=0 sectors, after=386 sectors
>>>           State : clean
>>>     Device UUID : 406c4cb5:c188e4a9:7ed8be9f:14a49b16
>>>
>>>     Update Time : Fri May  6 17:58:27 2016
>>>   Bad Block Log : 512 entries available at offset 2032 sectors
>>>        Checksum : 343f9cd0 - correct
>>>          Events : 892484
>>>
>>>          Layout : left-symmetric
>>>      Chunk Size : 128K
>>>
>>>    Device Role : Active device 1
>>>    Array State : AA.. ('A' == active, '.' == missing, 'R' == replacing)
>>>
>> These two drives contain the same information. They indicate that they
>> were the only 2 running members in the array when they were last updated.
>>
>>> root@serveur:/home/math# mdadm --examine /dev/sde1
>>> /dev/sde1:
>>>           Magic : a92b4efc
>>>         Version : 1.2
>>>     Feature Map : 0x8
>>>      Array UUID : bf3c605b:9699aa55:d45119a2:7ba58d56
>>>            Name : debian:0
>>>   Creation Time : Sun Mar  4 22:49:14 2012
>>>      Raid Level : raid5
>>>    Raid Devices : 4
>>>
>>>  Avail Dev Size : 3907025072 (1863.01 GiB 2000.40 GB)
>>>      Array Size : 5860532352 (5589.04 GiB 6001.19 GB)
>>>   Used Dev Size : 3907021568 (1863.01 GiB 2000.40 GB)
>>>     Data Offset : 2048 sectors
>>>    Super Offset : 8 sectors
>>>    Unused Space : before=1960 sectors, after=3504 sectors
>>>           State : clean
>>>     Device UUID : f2e9c1ec:2852cf21:1a588581:b9f49a8b
>>>
>>>     Update Time : Fri May  6 17:58:27 2016
>>>   Bad Block Log : 512 entries available at offset 72 sectors - bad
>>> blocks present.
>>>        Checksum : 3a65b8bc - correct
>>>          Events : 892484
>>>
>>>          Layout : left-symmetric
>>>      Chunk Size : 128K
>>>
>>>    Device Role : spare
>>>    Array State : AA.. ('A' == active, '.' == missing, 'R' == replacing)
>>>
>> And finally /dev/sde1 shows as a spare, with the rest of the data
>> matching /dev/sdc1 and /dev/sde1.
>>
>>> PLEASE help me :-) i don't know what to do so i did nothing to not do
>>> any stupid things
>>> 1000 thank you
>>>
>>> ps i just saw this, i hope it not mak y case worst
>>> root@serveur:/home/math# cat /etc/mdadm/mdadm.conf
>>> DEVICE /dev/sd[bcd]1
>>> ARRAY /dev/md0 metadata=1.2 name=debian:0
>>> UUID=bf3c605b:9699aa55:d45119a2:7ba58d56
>>>
>>
>> From the data here, if looks to me as though /dev/sdb1 failed originally
>> (hence it thinks the array was complete). Either then /dev/sde1 also
>> failed, or you've proceeded to zero the superblock on the wrong drive.
>> You really need to look through the system logs and verify what happened
>> when and to what disk (if you rebooted at any point, the drive ordering
>> may have changed, so don't take for granted that the drive names are
>> consistent throughout).
>>
>> Cheers,
>>     Robin
>> --
>>      ___
>>     ( ' }     |       Robin Hill        <robin@robinhill.me.uk> |
>>    / / )      | Little Jim says ....                            |
>>   // !!       |      "He fallen in de water !!"                 |

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: raid 5 crashed
  2016-05-27 19:19     ` bobzer
@ 2016-05-30 15:01       ` bobzer
  2016-05-30 19:04         ` Anthonys Lists
  0 siblings, 1 reply; 27+ messages in thread
From: bobzer @ 2016-05-30 15:01 UTC (permalink / raw)
  To: linux-raid, Robin Hill, Mikael Abrahamsson

HI,

i did a dd_rescue which recovered a lot but not everything :
# Rescue Logfile. Created by GNU ddrescue version 1.19
# Command line: ddrescue -d -f -r3 /dev/sdf1
wdc/9bececcb-d520-ca38-fd88-d9565718e361.dd
wdc/9bececcb-d520-ca38-fd88-d9565718e361.mapfile
# Start time:   2016-05-29 22:32:38
# Current time: 2016-05-29 23:04:49
# Finished
# current_pos  current_status
0x129F7DB6000     +
#      pos        size  status
0x00000000  0x1118E2AB200  +
0x1118E2AB200  0x00001000  -
0x1118E2AC200  0x00090000  +
0x1118E33C200  0x00001000  -
0x1118E33D200  0x00817000  +
0x1118EB54200  0x00001000  -
0x1118EB55200  0x0016B000  +
0x1118ECC0200  0x00001000  -
0x1118ECC1200  0x641976000  +
0x117D0637200  0x00001000  -
0x117D0638200  0x000FE000  +
0x117D0736200  0x00001000  -
0x117D0737200  0x000EC000  +
0x117D0823200  0x00001000  -
0x117D0824200  0x0010C000  +
0x117D0930200  0x00001000  -
0x117D0931200  0x0010C000  +
0x117D0A3D200  0x00001000  -
0x117D0A3E200  0x00375000  +
0x117D0DB3200  0x00001000  -
0x117D0DB4200  0x0010B000  +
0x117D0EBF200  0x00002000  -
0x117D0EC1200  0x0010C000  +
0x117D0FCD200  0x00001000  -
0x117D0FCE200  0x001EB000  +
0x117D11B9200  0x00001000  -
0x117D11BA200  0x00112000  +
0x117D12CC200  0x00001000  -
0x117D12CD200  0x00077000  +
0x117D1344200  0x00001000  -
0x117D1345200  0x000EB000  +
0x117D1430200  0x00001000  -
0x117D1431200  0x000FD000  +
0x117D152E200  0x00001000  -
0x117D152F200  0x0010C000  +
0x117D163B200  0x00001000  -
0x117D163C200  0x00251000  +
0x117D188D200  0x00002000  -
0x117D188F200  0x12264B3000  +
0x129F7D42200  0x00001000  -
0x129F7D43200  0x0004F000  +
0x129F7D92200  0x00001000  -
0x129F7D93200  0x00022000  +
0x129F7DB5200  0x00001000  -
0x129F7DB6200  0xA7C90DA200  +

i don't know if i can recover more but i tried to reassemble the raid
and it work but during the rebuilding the sdb1 failed again
so my data are there but i'm not sure to know what to do to get them all.

thanks



On Fri, May 27, 2016 at 3:19 PM, bobzer <bobzer@gmail.com> wrote:
> hi,
>
> I'm afraid to make the problem worst but i received a new HD to do a
> dd_rescue :-)
> I'm ready to buy another HD but the problem is that i don't know
> what's the best to recover my data
>
> My question is : There is a way to test if the data/raid is ok without
> take the risk of losting anything more ?
>
> help me please :-(
>
> best regards
> Mathieu
>
> On Wed, May 25, 2016 at 11:06 PM, bobzer <bobzer@gmail.com> wrote:
>> thanks for your help.
>> i took time to answer because unlucky me, the power supply of my
>> laptop fried so no laptop and no way to work on my raid :-(
>> any way i got a new one :-)
>>
>> for the dmesg i paste it here : http://pastebin.com/whUHs256
>>
>> root@serveur:~# uname -a
>> Linux serveur 3.2.0-4-amd64 #1 SMP Debian 3.2.63-2+deb7u1 x86_64 GNU/Linux
>> root@serveur:~# mdadm -V
>> mdadm - v3.3-78-gf43f5b3 - 02nd avril 2014
>>
>> about zero the superblock on the wrong device, i hope i didn't do
>> that, but also i really don't think i did that, because i took care
>> and at that time the raid was working
>>
>> i don't know what to do, if i use dd_rescue and can't get back 100% of
>> the data could i be able to start the raid anyway ?
>> what are my risk if i try something like :
>> mdadm --assemble --force /dev/md0 /dev/sd[bcde]1
>>
>> thank you very much for your time
>> Mathieu
>>
>>
>> On Wed, May 11, 2016 at 3:15 PM, Robin Hill <robin@robinhill.me.uk> wrote:
>>> On Tue May 10, 2016 at 11:28:31PM +0200, bobzer wrote:
>>>
>>>> hi everyone,
>>>>
>>>> I'm in panic mode :-( because i got a raid 5 with 4 disk but 2 removed
>>>> yesterday i got a power outage which removed one disk. the disks
>>>> sd[bcd]1 was ok and saying that sde1 is removed but sde1 said that
>>>> everything is fine.
>>>> so i stop the raid, zero the superblock of sde1, start the raid and
>>>> add sde1 to the raid. then it start to reconstruct, i think it had
>>>> time to finish before this problem (i'm not 100% sure that it finish
>>>> but i think so)
>>>> the data was accessible so i went to sleep
>>>> today i discovered the raid in this state :
>>>>
>>>> root@serveur:/home/math# mdadm -D /dev/md0
>>>> /dev/md0:
>>>>         Version : 1.2
>>>>   Creation Time : Sun Mar  4 22:49:14 2012
>>>>      Raid Level : raid5
>>>>      Array Size : 5860532352 (5589.04 GiB 6001.19 GB)
>>>>   Used Dev Size : 1953510784 (1863.01 GiB 2000.40 GB)
>>>>    Raid Devices : 4
>>>>   Total Devices : 4
>>>>     Persistence : Superblock is persistent
>>>>
>>>>     Update Time : Fri May  6 17:44:02 2016
>>>>           State : clean, FAILED
>>>>  Active Devices : 2
>>>> Working Devices : 3
>>>>  Failed Devices : 1
>>>>   Spare Devices : 1
>>>>
>>>>          Layout : left-symmetric
>>>>      Chunk Size : 128K
>>>>
>>>>            Name : debian:0
>>>>            UUID : bf3c605b:9699aa55:d45119a2:7ba58d56
>>>>          Events : 892482
>>>>
>>>>     Number   Major   Minor   RaidDevice State
>>>>        3       8       33        0      active sync   /dev/sdc1
>>>>        1       8       49        1      active sync   /dev/sdd1
>>>>        4       0        0        4      removed
>>>>        6       0        0        6      removed
>>>>
>>>>        4       8       17        -      faulty   /dev/sdb1
>>>>        5       8       65        -      spare   /dev/sde1
>>>>
>>> So this reports /dev/sdb1 faulty and /dev/sde1 spare. That would
>>> indicate that the rebuild hadn't finished.
>>>
>>>> root@serveur:/home/math# mdadm --examine /dev/sdb1
>>>> /dev/sdb1:
>>>>           Magic : a92b4efc
>>>>         Version : 1.2
>>>>     Feature Map : 0x0
>>>>      Array UUID : bf3c605b:9699aa55:d45119a2:7ba58d56
>>>>            Name : debian:0
>>>>   Creation Time : Sun Mar  4 22:49:14 2012
>>>>      Raid Level : raid5
>>>>    Raid Devices : 4
>>>>
>>>>  Avail Dev Size : 3907021954 (1863.01 GiB 2000.40 GB)
>>>>      Array Size : 5860532352 (5589.04 GiB 6001.19 GB)
>>>>   Used Dev Size : 3907021568 (1863.01 GiB 2000.40 GB)
>>>>     Data Offset : 2048 sectors
>>>>    Super Offset : 8 sectors
>>>>    Unused Space : before=1960 sectors, after=386 sectors
>>>>           State : clean
>>>>     Device UUID : 9bececcb:d520ca38:fd88d956:5718e361
>>>>
>>>>     Update Time : Fri May  6 02:07:00 2016
>>>>   Bad Block Log : 512 entries available at offset 72 sectors
>>>>        Checksum : dc2a133a - correct
>>>>          Events : 892215
>>>>
>>>>          Layout : left-symmetric
>>>>      Chunk Size : 128K
>>>>
>>>>    Device Role : Active device 2
>>>>    Array State : AAAA ('A' == active, '.' == missing, 'R' == replacing)
>>>>
>>> We can see /dev/sdb1 has a lower event count than the others and also
>>> that it indicates all the drives in the array were active when it was
>>> last running. That would strongly suggest that it was not in the array
>>> when /dev/sde1 was added to rebuild. The update time is also nearly 16
>>> hours earlier than that of the other drives.
>>>
>>>> root@serveur:/home/math# mdadm --examine /dev/sdc1
>>>> /dev/sdc1:
>>>>           Magic : a92b4efc
>>>>         Version : 1.2
>>>>     Feature Map : 0x0
>>>>      Array UUID : bf3c605b:9699aa55:d45119a2:7ba58d56
>>>>            Name : debian:0
>>>>   Creation Time : Sun Mar  4 22:49:14 2012
>>>>      Raid Level : raid5
>>>>    Raid Devices : 4
>>>>
>>>>  Avail Dev Size : 3907021954 (1863.01 GiB 2000.40 GB)
>>>>      Array Size : 5860532352 (5589.04 GiB 6001.19 GB)
>>>>   Used Dev Size : 3907021568 (1863.01 GiB 2000.40 GB)
>>>>     Data Offset : 2048 sectors
>>>>    Super Offset : 8 sectors
>>>>    Unused Space : before=1960 sectors, after=386 sectors
>>>>           State : clean
>>>>     Device UUID : 1ecaf51c:3289a902:7bb71a93:237c68e8
>>>>
>>>>     Update Time : Fri May  6 17:58:27 2016
>>>>   Bad Block Log : 512 entries available at offset 72 sectors
>>>>        Checksum : b9d6aa84 - correct
>>>>          Events : 892484
>>>>
>>>>          Layout : left-symmetric
>>>>      Chunk Size : 128K
>>>>
>>>>    Device Role : Active device 0
>>>>    Array State : AA.. ('A' == active, '.' == missing, 'R' == replacing)
>>>>
>>>> root@serveur:/home/math# mdadm --examine /dev/sdd1
>>>> /dev/sdd1:
>>>>           Magic : a92b4efc
>>>>         Version : 1.2
>>>>     Feature Map : 0x0
>>>>      Array UUID : bf3c605b:9699aa55:d45119a2:7ba58d56
>>>>            Name : debian:0
>>>>   Creation Time : Sun Mar  4 22:49:14 2012
>>>>      Raid Level : raid5
>>>>    Raid Devices : 4
>>>>
>>>>  Avail Dev Size : 3907021954 (1863.01 GiB 2000.40 GB)
>>>>      Array Size : 5860532352 (5589.04 GiB 6001.19 GB)
>>>>   Used Dev Size : 3907021568 (1863.01 GiB 2000.40 GB)
>>>>     Data Offset : 2048 sectors
>>>>    Super Offset : 8 sectors
>>>>    Unused Space : before=0 sectors, after=386 sectors
>>>>           State : clean
>>>>     Device UUID : 406c4cb5:c188e4a9:7ed8be9f:14a49b16
>>>>
>>>>     Update Time : Fri May  6 17:58:27 2016
>>>>   Bad Block Log : 512 entries available at offset 2032 sectors
>>>>        Checksum : 343f9cd0 - correct
>>>>          Events : 892484
>>>>
>>>>          Layout : left-symmetric
>>>>      Chunk Size : 128K
>>>>
>>>>    Device Role : Active device 1
>>>>    Array State : AA.. ('A' == active, '.' == missing, 'R' == replacing)
>>>>
>>> These two drives contain the same information. They indicate that they
>>> were the only 2 running members in the array when they were last updated.
>>>
>>>> root@serveur:/home/math# mdadm --examine /dev/sde1
>>>> /dev/sde1:
>>>>           Magic : a92b4efc
>>>>         Version : 1.2
>>>>     Feature Map : 0x8
>>>>      Array UUID : bf3c605b:9699aa55:d45119a2:7ba58d56
>>>>            Name : debian:0
>>>>   Creation Time : Sun Mar  4 22:49:14 2012
>>>>      Raid Level : raid5
>>>>    Raid Devices : 4
>>>>
>>>>  Avail Dev Size : 3907025072 (1863.01 GiB 2000.40 GB)
>>>>      Array Size : 5860532352 (5589.04 GiB 6001.19 GB)
>>>>   Used Dev Size : 3907021568 (1863.01 GiB 2000.40 GB)
>>>>     Data Offset : 2048 sectors
>>>>    Super Offset : 8 sectors
>>>>    Unused Space : before=1960 sectors, after=3504 sectors
>>>>           State : clean
>>>>     Device UUID : f2e9c1ec:2852cf21:1a588581:b9f49a8b
>>>>
>>>>     Update Time : Fri May  6 17:58:27 2016
>>>>   Bad Block Log : 512 entries available at offset 72 sectors - bad
>>>> blocks present.
>>>>        Checksum : 3a65b8bc - correct
>>>>          Events : 892484
>>>>
>>>>          Layout : left-symmetric
>>>>      Chunk Size : 128K
>>>>
>>>>    Device Role : spare
>>>>    Array State : AA.. ('A' == active, '.' == missing, 'R' == replacing)
>>>>
>>> And finally /dev/sde1 shows as a spare, with the rest of the data
>>> matching /dev/sdc1 and /dev/sde1.
>>>
>>>> PLEASE help me :-) i don't know what to do so i did nothing to not do
>>>> any stupid things
>>>> 1000 thank you
>>>>
>>>> ps i just saw this, i hope it not mak y case worst
>>>> root@serveur:/home/math# cat /etc/mdadm/mdadm.conf
>>>> DEVICE /dev/sd[bcd]1
>>>> ARRAY /dev/md0 metadata=1.2 name=debian:0
>>>> UUID=bf3c605b:9699aa55:d45119a2:7ba58d56
>>>>
>>>
>>> From the data here, if looks to me as though /dev/sdb1 failed originally
>>> (hence it thinks the array was complete). Either then /dev/sde1 also
>>> failed, or you've proceeded to zero the superblock on the wrong drive.
>>> You really need to look through the system logs and verify what happened
>>> when and to what disk (if you rebooted at any point, the drive ordering
>>> may have changed, so don't take for granted that the drive names are
>>> consistent throughout).
>>>
>>> Cheers,
>>>     Robin
>>> --
>>>      ___
>>>     ( ' }     |       Robin Hill        <robin@robinhill.me.uk> |
>>>    / / )      | Little Jim says ....                            |
>>>   // !!       |      "He fallen in de water !!"                 |

On Fri, May 27, 2016 at 9:19 PM, bobzer <bobzer@gmail.com> wrote:
> hi,
>
> I'm afraid to make the problem worst but i received a new HD to do a
> dd_rescue :-)
> I'm ready to buy another HD but the problem is that i don't know
> what's the best to recover my data
>
> My question is : There is a way to test if the data/raid is ok without
> take the risk of losting anything more ?
>
> help me please :-(
>
> best regards
> Mathieu
>
> On Wed, May 25, 2016 at 11:06 PM, bobzer <bobzer@gmail.com> wrote:
>> thanks for your help.
>> i took time to answer because unlucky me, the power supply of my
>> laptop fried so no laptop and no way to work on my raid :-(
>> any way i got a new one :-)
>>
>> for the dmesg i paste it here : http://pastebin.com/whUHs256
>>
>> root@serveur:~# uname -a
>> Linux serveur 3.2.0-4-amd64 #1 SMP Debian 3.2.63-2+deb7u1 x86_64 GNU/Linux
>> root@serveur:~# mdadm -V
>> mdadm - v3.3-78-gf43f5b3 - 02nd avril 2014
>>
>> about zero the superblock on the wrong device, i hope i didn't do
>> that, but also i really don't think i did that, because i took care
>> and at that time the raid was working
>>
>> i don't know what to do, if i use dd_rescue and can't get back 100% of
>> the data could i be able to start the raid anyway ?
>> what are my risk if i try something like :
>> mdadm --assemble --force /dev/md0 /dev/sd[bcde]1
>>
>> thank you very much for your time
>> Mathieu
>>
>>
>> On Wed, May 11, 2016 at 3:15 PM, Robin Hill <robin@robinhill.me.uk> wrote:
>>> On Tue May 10, 2016 at 11:28:31PM +0200, bobzer wrote:
>>>
>>>> hi everyone,
>>>>
>>>> I'm in panic mode :-( because i got a raid 5 with 4 disk but 2 removed
>>>> yesterday i got a power outage which removed one disk. the disks
>>>> sd[bcd]1 was ok and saying that sde1 is removed but sde1 said that
>>>> everything is fine.
>>>> so i stop the raid, zero the superblock of sde1, start the raid and
>>>> add sde1 to the raid. then it start to reconstruct, i think it had
>>>> time to finish before this problem (i'm not 100% sure that it finish
>>>> but i think so)
>>>> the data was accessible so i went to sleep
>>>> today i discovered the raid in this state :
>>>>
>>>> root@serveur:/home/math# mdadm -D /dev/md0
>>>> /dev/md0:
>>>>         Version : 1.2
>>>>   Creation Time : Sun Mar  4 22:49:14 2012
>>>>      Raid Level : raid5
>>>>      Array Size : 5860532352 (5589.04 GiB 6001.19 GB)
>>>>   Used Dev Size : 1953510784 (1863.01 GiB 2000.40 GB)
>>>>    Raid Devices : 4
>>>>   Total Devices : 4
>>>>     Persistence : Superblock is persistent
>>>>
>>>>     Update Time : Fri May  6 17:44:02 2016
>>>>           State : clean, FAILED
>>>>  Active Devices : 2
>>>> Working Devices : 3
>>>>  Failed Devices : 1
>>>>   Spare Devices : 1
>>>>
>>>>          Layout : left-symmetric
>>>>      Chunk Size : 128K
>>>>
>>>>            Name : debian:0
>>>>            UUID : bf3c605b:9699aa55:d45119a2:7ba58d56
>>>>          Events : 892482
>>>>
>>>>     Number   Major   Minor   RaidDevice State
>>>>        3       8       33        0      active sync   /dev/sdc1
>>>>        1       8       49        1      active sync   /dev/sdd1
>>>>        4       0        0        4      removed
>>>>        6       0        0        6      removed
>>>>
>>>>        4       8       17        -      faulty   /dev/sdb1
>>>>        5       8       65        -      spare   /dev/sde1
>>>>
>>> So this reports /dev/sdb1 faulty and /dev/sde1 spare. That would
>>> indicate that the rebuild hadn't finished.
>>>
>>>> root@serveur:/home/math# mdadm --examine /dev/sdb1
>>>> /dev/sdb1:
>>>>           Magic : a92b4efc
>>>>         Version : 1.2
>>>>     Feature Map : 0x0
>>>>      Array UUID : bf3c605b:9699aa55:d45119a2:7ba58d56
>>>>            Name : debian:0
>>>>   Creation Time : Sun Mar  4 22:49:14 2012
>>>>      Raid Level : raid5
>>>>    Raid Devices : 4
>>>>
>>>>  Avail Dev Size : 3907021954 (1863.01 GiB 2000.40 GB)
>>>>      Array Size : 5860532352 (5589.04 GiB 6001.19 GB)
>>>>   Used Dev Size : 3907021568 (1863.01 GiB 2000.40 GB)
>>>>     Data Offset : 2048 sectors
>>>>    Super Offset : 8 sectors
>>>>    Unused Space : before=1960 sectors, after=386 sectors
>>>>           State : clean
>>>>     Device UUID : 9bececcb:d520ca38:fd88d956:5718e361
>>>>
>>>>     Update Time : Fri May  6 02:07:00 2016
>>>>   Bad Block Log : 512 entries available at offset 72 sectors
>>>>        Checksum : dc2a133a - correct
>>>>          Events : 892215
>>>>
>>>>          Layout : left-symmetric
>>>>      Chunk Size : 128K
>>>>
>>>>    Device Role : Active device 2
>>>>    Array State : AAAA ('A' == active, '.' == missing, 'R' == replacing)
>>>>
>>> We can see /dev/sdb1 has a lower event count than the others and also
>>> that it indicates all the drives in the array were active when it was
>>> last running. That would strongly suggest that it was not in the array
>>> when /dev/sde1 was added to rebuild. The update time is also nearly 16
>>> hours earlier than that of the other drives.
>>>
>>>> root@serveur:/home/math# mdadm --examine /dev/sdc1
>>>> /dev/sdc1:
>>>>           Magic : a92b4efc
>>>>         Version : 1.2
>>>>     Feature Map : 0x0
>>>>      Array UUID : bf3c605b:9699aa55:d45119a2:7ba58d56
>>>>            Name : debian:0
>>>>   Creation Time : Sun Mar  4 22:49:14 2012
>>>>      Raid Level : raid5
>>>>    Raid Devices : 4
>>>>
>>>>  Avail Dev Size : 3907021954 (1863.01 GiB 2000.40 GB)
>>>>      Array Size : 5860532352 (5589.04 GiB 6001.19 GB)
>>>>   Used Dev Size : 3907021568 (1863.01 GiB 2000.40 GB)
>>>>     Data Offset : 2048 sectors
>>>>    Super Offset : 8 sectors
>>>>    Unused Space : before=1960 sectors, after=386 sectors
>>>>           State : clean
>>>>     Device UUID : 1ecaf51c:3289a902:7bb71a93:237c68e8
>>>>
>>>>     Update Time : Fri May  6 17:58:27 2016
>>>>   Bad Block Log : 512 entries available at offset 72 sectors
>>>>        Checksum : b9d6aa84 - correct
>>>>          Events : 892484
>>>>
>>>>          Layout : left-symmetric
>>>>      Chunk Size : 128K
>>>>
>>>>    Device Role : Active device 0
>>>>    Array State : AA.. ('A' == active, '.' == missing, 'R' == replacing)
>>>>
>>>> root@serveur:/home/math# mdadm --examine /dev/sdd1
>>>> /dev/sdd1:
>>>>           Magic : a92b4efc
>>>>         Version : 1.2
>>>>     Feature Map : 0x0
>>>>      Array UUID : bf3c605b:9699aa55:d45119a2:7ba58d56
>>>>            Name : debian:0
>>>>   Creation Time : Sun Mar  4 22:49:14 2012
>>>>      Raid Level : raid5
>>>>    Raid Devices : 4
>>>>
>>>>  Avail Dev Size : 3907021954 (1863.01 GiB 2000.40 GB)
>>>>      Array Size : 5860532352 (5589.04 GiB 6001.19 GB)
>>>>   Used Dev Size : 3907021568 (1863.01 GiB 2000.40 GB)
>>>>     Data Offset : 2048 sectors
>>>>    Super Offset : 8 sectors
>>>>    Unused Space : before=0 sectors, after=386 sectors
>>>>           State : clean
>>>>     Device UUID : 406c4cb5:c188e4a9:7ed8be9f:14a49b16
>>>>
>>>>     Update Time : Fri May  6 17:58:27 2016
>>>>   Bad Block Log : 512 entries available at offset 2032 sectors
>>>>        Checksum : 343f9cd0 - correct
>>>>          Events : 892484
>>>>
>>>>          Layout : left-symmetric
>>>>      Chunk Size : 128K
>>>>
>>>>    Device Role : Active device 1
>>>>    Array State : AA.. ('A' == active, '.' == missing, 'R' == replacing)
>>>>
>>> These two drives contain the same information. They indicate that they
>>> were the only 2 running members in the array when they were last updated.
>>>
>>>> root@serveur:/home/math# mdadm --examine /dev/sde1
>>>> /dev/sde1:
>>>>           Magic : a92b4efc
>>>>         Version : 1.2
>>>>     Feature Map : 0x8
>>>>      Array UUID : bf3c605b:9699aa55:d45119a2:7ba58d56
>>>>            Name : debian:0
>>>>   Creation Time : Sun Mar  4 22:49:14 2012
>>>>      Raid Level : raid5
>>>>    Raid Devices : 4
>>>>
>>>>  Avail Dev Size : 3907025072 (1863.01 GiB 2000.40 GB)
>>>>      Array Size : 5860532352 (5589.04 GiB 6001.19 GB)
>>>>   Used Dev Size : 3907021568 (1863.01 GiB 2000.40 GB)
>>>>     Data Offset : 2048 sectors
>>>>    Super Offset : 8 sectors
>>>>    Unused Space : before=1960 sectors, after=3504 sectors
>>>>           State : clean
>>>>     Device UUID : f2e9c1ec:2852cf21:1a588581:b9f49a8b
>>>>
>>>>     Update Time : Fri May  6 17:58:27 2016
>>>>   Bad Block Log : 512 entries available at offset 72 sectors - bad
>>>> blocks present.
>>>>        Checksum : 3a65b8bc - correct
>>>>          Events : 892484
>>>>
>>>>          Layout : left-symmetric
>>>>      Chunk Size : 128K
>>>>
>>>>    Device Role : spare
>>>>    Array State : AA.. ('A' == active, '.' == missing, 'R' == replacing)
>>>>
>>> And finally /dev/sde1 shows as a spare, with the rest of the data
>>> matching /dev/sdc1 and /dev/sde1.
>>>
>>>> PLEASE help me :-) i don't know what to do so i did nothing to not do
>>>> any stupid things
>>>> 1000 thank you
>>>>
>>>> ps i just saw this, i hope it not mak y case worst
>>>> root@serveur:/home/math# cat /etc/mdadm/mdadm.conf
>>>> DEVICE /dev/sd[bcd]1
>>>> ARRAY /dev/md0 metadata=1.2 name=debian:0
>>>> UUID=bf3c605b:9699aa55:d45119a2:7ba58d56
>>>>
>>>
>>> From the data here, if looks to me as though /dev/sdb1 failed originally
>>> (hence it thinks the array was complete). Either then /dev/sde1 also
>>> failed, or you've proceeded to zero the superblock on the wrong drive.
>>> You really need to look through the system logs and verify what happened
>>> when and to what disk (if you rebooted at any point, the drive ordering
>>> may have changed, so don't take for granted that the drive names are
>>> consistent throughout).
>>>
>>> Cheers,
>>>     Robin
>>> --
>>>      ___
>>>     ( ' }     |       Robin Hill        <robin@robinhill.me.uk> |
>>>    / / )      | Little Jim says ....                            |
>>>   // !!       |      "He fallen in de water !!"                 |

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: raid 5 crashed
  2016-05-30 15:01       ` bobzer
@ 2016-05-30 19:04         ` Anthonys Lists
  2016-05-30 22:00           ` bobzer
  0 siblings, 1 reply; 27+ messages in thread
From: Anthonys Lists @ 2016-05-30 19:04 UTC (permalink / raw)
  To: bobzer, linux-raid, Mikael Abrahamsson

On 30/05/2016 16:01, bobzer wrote:
> HI,
>
> i did a dd_rescue which recovered a lot but not everything :
> # Rescue Logfile. Created by GNU ddrescue version 1.19
> # Command line: ddrescue -d -f -r3 /dev/sdf1
> wdc/9bececcb-d520-ca38-fd88-d9565718e361.dd
> wdc/9bececcb-d520-ca38-fd88-d9565718e361.mapfile
> # Start time:   2016-05-29 22:32:38
> # Current time: 2016-05-29 23:04:49
> # Finished
> # current_pos  current_status
> 0x129F7DB6000     +
> #      pos        size  status
> 0x00000000  0x1118E2AB200  +
> 0x1118E2AB200  0x00001000  -
> 0x1118E2AC200  0x00090000  +
> 0x1118E33C200  0x00001000  -
> 0x1118E33D200  0x00817000  +
> 0x1118EB54200  0x00001000  -
> 0x1118EB55200  0x0016B000  +
> 0x1118ECC0200  0x00001000  -
> 0x1118ECC1200  0x641976000  +
> 0x117D0637200  0x00001000  -
> 0x117D0638200  0x000FE000  +
> 0x117D0736200  0x00001000  -
> 0x117D0737200  0x000EC000  +
> 0x117D0823200  0x00001000  -
> 0x117D0824200  0x0010C000  +
> 0x117D0930200  0x00001000  -
> 0x117D0931200  0x0010C000  +
> 0x117D0A3D200  0x00001000  -
> 0x117D0A3E200  0x00375000  +
> 0x117D0DB3200  0x00001000  -
> 0x117D0DB4200  0x0010B000  +
> 0x117D0EBF200  0x00002000  -
> 0x117D0EC1200  0x0010C000  +
> 0x117D0FCD200  0x00001000  -
> 0x117D0FCE200  0x001EB000  +
> 0x117D11B9200  0x00001000  -
> 0x117D11BA200  0x00112000  +
> 0x117D12CC200  0x00001000  -
> 0x117D12CD200  0x00077000  +
> 0x117D1344200  0x00001000  -
> 0x117D1345200  0x000EB000  +
> 0x117D1430200  0x00001000  -
> 0x117D1431200  0x000FD000  +
> 0x117D152E200  0x00001000  -
> 0x117D152F200  0x0010C000  +
> 0x117D163B200  0x00001000  -
> 0x117D163C200  0x00251000  +
> 0x117D188D200  0x00002000  -
> 0x117D188F200  0x12264B3000  +
> 0x129F7D42200  0x00001000  -
> 0x129F7D43200  0x0004F000  +
> 0x129F7D92200  0x00001000  -
> 0x129F7D93200  0x00022000  +
> 0x129F7DB5200  0x00001000  -
> 0x129F7DB6200  0xA7C90DA200  +
>
> i don't know if i can recover more but i tried to reassemble the raid
> and it work but during the rebuilding the sdb1 failed again
> so my data are there but i'm not sure to know what to do to get them all.
>
> thanks
>
>
Did you follow Mikael's advice? Can you do smartctl on the drive (I 
think the option you want is "smartctl -x" - display "extended all") and 
post the output?

We're assuming you're using proper raid-capable drives, but if you 
aren't then the drives could be fine but giving up under load and 
causing your raid problems. On the other hand, if your drives are proper 
raid drives, then they're probably not fit for anything more than the 
scrapheap.

Cheers,
Wol

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: raid 5 crashed
  2016-05-30 19:04         ` Anthonys Lists
@ 2016-05-30 22:00           ` bobzer
  2016-05-31 13:45             ` Phil Turmel
  0 siblings, 1 reply; 27+ messages in thread
From: bobzer @ 2016-05-30 22:00 UTC (permalink / raw)
  To: Anthonys Lists; +Cc: linux-raid, Mikael Abrahamsson

I did follow his advice for the smartctl i can't right now but i will
post the output
my drives are desktop grade normal seagate, not NAS nor entreprise so
i will retry but i will limit the speed of reconstruction with :
echo 40000 > /proc/sys/dev/raid/speed_limit_max
echo 1000 > /proc/sys/dev/raid/speed_limit_min
I hope that it will permit to finish the rebuilding


On Mon, May 30, 2016 at 3:04 PM, Anthonys Lists
<antlists@youngman.org.uk> wrote:
> On 30/05/2016 16:01, bobzer wrote:
>>
>> HI,
>>
>> i did a dd_rescue which recovered a lot but not everything :
>> # Rescue Logfile. Created by GNU ddrescue version 1.19
>> # Command line: ddrescue -d -f -r3 /dev/sdf1
>> wdc/9bececcb-d520-ca38-fd88-d9565718e361.dd
>> wdc/9bececcb-d520-ca38-fd88-d9565718e361.mapfile
>> # Start time:   2016-05-29 22:32:38
>> # Current time: 2016-05-29 23:04:49
>> # Finished
>> # current_pos  current_status
>> 0x129F7DB6000     +
>> #      pos        size  status
>> 0x00000000  0x1118E2AB200  +
>> 0x1118E2AB200  0x00001000  -
>> 0x1118E2AC200  0x00090000  +
>> 0x1118E33C200  0x00001000  -
>> 0x1118E33D200  0x00817000  +
>> 0x1118EB54200  0x00001000  -
>> 0x1118EB55200  0x0016B000  +
>> 0x1118ECC0200  0x00001000  -
>> 0x1118ECC1200  0x641976000  +
>> 0x117D0637200  0x00001000  -
>> 0x117D0638200  0x000FE000  +
>> 0x117D0736200  0x00001000  -
>> 0x117D0737200  0x000EC000  +
>> 0x117D0823200  0x00001000  -
>> 0x117D0824200  0x0010C000  +
>> 0x117D0930200  0x00001000  -
>> 0x117D0931200  0x0010C000  +
>> 0x117D0A3D200  0x00001000  -
>> 0x117D0A3E200  0x00375000  +
>> 0x117D0DB3200  0x00001000  -
>> 0x117D0DB4200  0x0010B000  +
>> 0x117D0EBF200  0x00002000  -
>> 0x117D0EC1200  0x0010C000  +
>> 0x117D0FCD200  0x00001000  -
>> 0x117D0FCE200  0x001EB000  +
>> 0x117D11B9200  0x00001000  -
>> 0x117D11BA200  0x00112000  +
>> 0x117D12CC200  0x00001000  -
>> 0x117D12CD200  0x00077000  +
>> 0x117D1344200  0x00001000  -
>> 0x117D1345200  0x000EB000  +
>> 0x117D1430200  0x00001000  -
>> 0x117D1431200  0x000FD000  +
>> 0x117D152E200  0x00001000  -
>> 0x117D152F200  0x0010C000  +
>> 0x117D163B200  0x00001000  -
>> 0x117D163C200  0x00251000  +
>> 0x117D188D200  0x00002000  -
>> 0x117D188F200  0x12264B3000  +
>> 0x129F7D42200  0x00001000  -
>> 0x129F7D43200  0x0004F000  +
>> 0x129F7D92200  0x00001000  -
>> 0x129F7D93200  0x00022000  +
>> 0x129F7DB5200  0x00001000  -
>> 0x129F7DB6200  0xA7C90DA200  +
>>
>> i don't know if i can recover more but i tried to reassemble the raid
>> and it work but during the rebuilding the sdb1 failed again
>> so my data are there but i'm not sure to know what to do to get them all.
>>
>> thanks
>>
>>
> Did you follow Mikael's advice? Can you do smartctl on the drive (I think
> the option you want is "smartctl -x" - display "extended all") and post the
> output?
>
> We're assuming you're using proper raid-capable drives, but if you aren't
> then the drives could be fine but giving up under load and causing your raid
> problems. On the other hand, if your drives are proper raid drives, then
> they're probably not fit for anything more than the scrapheap.
>
> Cheers,
> Wol

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: raid 5 crashed
  2016-05-30 22:00           ` bobzer
@ 2016-05-31 13:45             ` Phil Turmel
  2016-05-31 18:49               ` Wols Lists
  0 siblings, 1 reply; 27+ messages in thread
From: Phil Turmel @ 2016-05-31 13:45 UTC (permalink / raw)
  To: bobzer, Anthonys Lists; +Cc: linux-raid, Mikael Abrahamsson

On 05/30/2016 06:00 PM, bobzer wrote:
> I did follow his advice for the smartctl i can't right now but i will
> post the output
> my drives are desktop grade normal seagate, not NAS nor entreprise so
> i will retry but i will limit the speed of reconstruction with :
> echo 40000 > /proc/sys/dev/raid/speed_limit_max
> echo 1000 > /proc/sys/dev/raid/speed_limit_min
> I hope that it will permit to finish the rebuilding

No, that is unlikely to help.  You need to read about "timeout mismatch"
as you clearly have a bad case of it.  This is not a new issue, as you
can see from the dates of the following:

http://marc.info/?l=linux-raid&m=139050322510249&w=2
http://marc.info/?l=linux-raid&m=135863964624202&w=2
http://marc.info/?l=linux-raid&m=135811522817345&w=1
http://marc.info/?l=linux-raid&m=133761065622164&w=2
http://marc.info/?l=linux-raid&m=132477199207506
http://marc.info/?l=linux-raid&m=133665797115876&w=2
http://marc.info/?l=linux-raid&m=142487508806844&w=3
http://marc.info/?l=linux-raid&m=144535576302583&w=2

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: raid 5 crashed
  2016-05-31 13:45             ` Phil Turmel
@ 2016-05-31 18:49               ` Wols Lists
  2016-06-01  1:48                 ` Brad Campbell
  0 siblings, 1 reply; 27+ messages in thread
From: Wols Lists @ 2016-05-31 18:49 UTC (permalink / raw)
  To: Phil Turmel, bobzer; +Cc: linux-raid, Mikael Abrahamsson

On 31/05/16 14:45, Phil Turmel wrote:
> On 05/30/2016 06:00 PM, bobzer wrote:
>> I did follow his advice for the smartctl i can't right now but i will
>> post the output
>> my drives are desktop grade normal seagate, not NAS nor entreprise so
>> i will retry but i will limit the speed of reconstruction with :
>> echo 40000 > /proc/sys/dev/raid/speed_limit_max
>> echo 1000 > /proc/sys/dev/raid/speed_limit_min
>> I hope that it will permit to finish the rebuilding
> 
> No, that is unlikely to help.  You need to read about "timeout mismatch"
> as you clearly have a bad case of it.  This is not a new issue, as you
> can see from the dates of the following:
> 
> http://marc.info/?l=linux-raid&m=139050322510249&w=2
> http://marc.info/?l=linux-raid&m=135863964624202&w=2
> http://marc.info/?l=linux-raid&m=135811522817345&w=1
> http://marc.info/?l=linux-raid&m=133761065622164&w=2
> http://marc.info/?l=linux-raid&m=132477199207506
> http://marc.info/?l=linux-raid&m=133665797115876&w=2
> http://marc.info/?l=linux-raid&m=142487508806844&w=3
> http://marc.info/?l=linux-raid&m=144535576302583&w=2
> 
And you need to follow Mikael's advice EVERY boot. It sounds like this
is your problem. So once you've managed to get your array reconstructed,
you need to replace your drives FAST.

In fact, with your setup, I'd be inclined to GIVE UP RIGHT NOW trying to
reconstruct the raid as-is. Get four replacement, NAS drives (they can
be 3TB drives if you want to increase the space, this technique should
work regardless...).

When the new drives arrive, copy each old drive in turn onto a new drive...

dd if=/dev/sda of=/dev/sde

etc etc. Expect it to take a little while ... (others may chime in and
tell you to use ddrescue rather than dd - I don't know which one is
best, both should work).

Once you've copied and replaced all four drives, your system should boot
and recover without difficulty. And if there IS a problem, at least you
still have the original drives UNTOUCHED.

Once you've got your system back, you should be able to claim the new
space if you did buy bigger drives. The old drives are probably fine -
you've just gone over the size limit at which fatal errors become a
probability for non-NAS drives.

I've got two 3TB Seagate Barracudas in a mirror. I can get away with a
mirror, I hope, but there's no way I'd go to raid 5 without proper NAS
drives (I'm a bit gutted - I originally bought the Barracudas intending
to do just that, but 3x3TB is just asking for trouble!)

Cheers,
Wol

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: raid 5 crashed
  2016-05-31 18:49               ` Wols Lists
@ 2016-06-01  1:48                 ` Brad Campbell
  2016-06-01  3:46                   ` Edward Kuns
  2016-06-01 15:42                   ` Wols Lists
  0 siblings, 2 replies; 27+ messages in thread
From: Brad Campbell @ 2016-06-01  1:48 UTC (permalink / raw)
  To: Wols Lists, Phil Turmel, bobzer; +Cc: linux-raid, Mikael Abrahamsson

On 01/06/16 02:49, Wols Lists wrote:

> When the new drives arrive, copy each old drive in turn onto a new drive...
>
> dd if=/dev/sda of=/dev/sde
>
> etc etc. Expect it to take a little while ... (others may chime in and
> tell you to use ddrescue rather than dd - I don't know which one is
> best, both should work).

Do NOT use dd on a drive with bad sectors.... ever....

Using dd like you prescribe above will simply abort when it hits the 
first read error. Using dd with the 'noerror' option will appear to work 
and make you feel all warm and fuzzy, until you eventually realize that 
when it encounters a read error, it skips that input block but does 
*not* pad the output appropriately. So you wind up with everything after 
the first read error in the wrong place on the disk. That will never end 
well.

Use one of the ddrescue style of applications to guarantee everything 
comes out where it needs to be regardless of input read errors.

Now, having said that :

Much better to try and get the array running in a read-only state with 
all disks in place and clone the data from the array rather than the 
disks after they've been ddrescued. In the case of a running array, a 
read error on one of the array members will see the RAID attempt to get 
the data from elsewhere (a reconstruction), whereas a read from a disc 
cloned with ddrescue will happily just report what was a faulty sector 
as a big pile of zeros, and *poof* your data is gone.

Set the timeouts appropriately (and conservatively) to give the disks 
time to actually report they can't read the sector. This will allow md 
to try and get it elsewhere rather than kicking the disc out because the 
storage stack timed it out as faulty.

Brad.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: raid 5 crashed
  2016-06-01  1:48                 ` Brad Campbell
@ 2016-06-01  3:46                   ` Edward Kuns
  2016-06-01  4:07                     ` Brad Campbell
  2016-06-01 15:42                   ` Wols Lists
  1 sibling, 1 reply; 27+ messages in thread
From: Edward Kuns @ 2016-06-01  3:46 UTC (permalink / raw)
  To: Brad Campbell
  Cc: Wols Lists, Phil Turmel, bobzer, linux-raid, Mikael Abrahamsson

On Tue, May 31, 2016 at 8:48 PM, Brad Campbell
<lists2009@fnarfbargle.com> wrote:
> Much better to try and get the array running in a read-only state with all
> disks in place and clone the data from the array rather than the disks after
> they've been ddrescued. In the case of a running array, a read error on one
> of the array members will see the RAID attempt to get the data from
> elsewhere (a reconstruction), whereas a read from a disc cloned with
> ddrescue will happily just report what was a faulty sector as a big pile of
> zeros, and *poof* your data is gone.

My understanding is that mdraid will kick out a drive with an unrecoverable
hardware error on a single sector.  (Is this incorrect?)  How do you add
the drive back in and get the raid in a read-only mode that won't kick out
drives for failures, thus allowing you to fully recover data on a (say)
raid5 with bad sectors on every drive at different sectors offsets?

I'm asking hypothetically.  I have my arrays scrubbed weekly to prevent
this kind of surprise, and I keep multiple backups of my most important and
irreplaceable data.

In the past, I had a mirror that kept kicking out one drive.  I never lost
any data.  Between a decent backup policy and luck, I never experienced two
drive failures at the same time.  It's likely that the drive was timing out
on an unrecoverable read error, but the OS gave up first.  This was before
I knew to fix that default (mis)tuning, as was discussed here recently.
Last fall, that drive totally failed, and since, I've been paying much more
attention to care-and-maintenance!

          Thanks,

           Eddie

P.S. Sorry for the double-response to those on the CC list... I forgot
to tell gmail to use plain text.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: raid 5 crashed
  2016-06-01  3:46                   ` Edward Kuns
@ 2016-06-01  4:07                     ` Brad Campbell
  2016-06-01  5:23                       ` Edward Kuns
  2016-06-01 15:36                       ` Wols Lists
  0 siblings, 2 replies; 27+ messages in thread
From: Brad Campbell @ 2016-06-01  4:07 UTC (permalink / raw)
  To: Edward Kuns
  Cc: Wols Lists, Phil Turmel, bobzer, linux-raid, Mikael Abrahamsson

On 01/06/16 11:46, Edward Kuns wrote:
> On Tue, May 31, 2016 at 8:48 PM, Brad Campbell
> <lists2009@fnarfbargle.com> wrote:
>> Much better to try and get the array running in a read-only state with all
>> disks in place and clone the data from the array rather than the disks after
>> they've been ddrescued. In the case of a running array, a read error on one
>> of the array members will see the RAID attempt to get the data from
>> elsewhere (a reconstruction), whereas a read from a disc cloned with
>> ddrescue will happily just report what was a faulty sector as a big pile of
>> zeros, and *poof* your data is gone.
>
> My understanding is that mdraid will kick out a drive with an unrecoverable
> hardware error on a single sector.  (Is this incorrect?)  How do you add
> the drive back in and get the raid in a read-only mode that won't kick out
> drives for failures, thus allowing you to fully recover data on a (say)
> raid5 with bad sectors on every drive at different sectors offsets?

Yes, that is incorrect. If your timeouts are configured correctly the 
drive will report an uncorrectable error up the stack. MD will try to 
get the data from elsewhere and it will try to re-write the bad sector 
with the reconstructed data.

If your timeouts are *wrong* however, the drive will go away and try 
desperately to read the sector. This most often will take well in excess 
of the default 30 second ata stack timeout. So the ata stack will poke 
the drive after 30 seconds, but the drive is still tied up trying to 
recover the data. It sees this as the drive going away, and reports to 
MD that the drive is gone. *Bang* it's out of the array.

There are other issues associated with md trying to re-write the sector 
and failing (and some complexities around whether or not you have a bad 
block list) that may kick the drive if the write fails, but invariably 
it's a read timeout issue that causes this problem.

> I'm asking hypothetically.  I have my arrays scrubbed weekly to prevent
> this kind of surprise, and I keep multiple backups of my most important and
> irreplaceable data.

Certainly best practice. I was more lax until I encountered a 
misbehaving SIL controller that silently corrupted a significant 
proportion of a 16TB array.

> In the past, I had a mirror that kept kicking out one drive.  I never lost
> any data.  Between a decent backup policy and luck, I never experienced two
> drive failures at the same time.  It's likely that the drive was timing out
> on an unrecoverable read error, but the OS gave up first.  This was before
> I knew to fix that default (mis)tuning, as was discussed here recently.
> Last fall, that drive totally failed, and since, I've been paying much more
> attention to care-and-maintenance!

I do daily short SMART tests, weekly long SMART tests and monthly 
"check" scrubs of the RAID(s). I'm not saying it's best practice, but 
I've had several drives turn up SMART failures and managed to address 
those before it became an issue on any of the arrays.

I've used the '--force' parameter to mdadm to assemble an array after a 
sata controller failure and had good results, but I wasn't dealing with 
bad drives and the need to make things read-only, so I defer to those 
with more experience in that regard.

I have however done a *lot* of data recovery on single drives over the 
years and can absolutely vouch that dd will leave you in tears.


Regards,
Brad


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: raid 5 crashed
  2016-06-01  4:07                     ` Brad Campbell
@ 2016-06-01  5:23                       ` Edward Kuns
  2016-06-01  5:28                         ` Brad Campbell
  2016-06-01 15:36                       ` Wols Lists
  1 sibling, 1 reply; 27+ messages in thread
From: Edward Kuns @ 2016-06-01  5:23 UTC (permalink / raw)
  To: Brad Campbell
  Cc: Wols Lists, Phil Turmel, bobzer, linux-raid, Mikael Abrahamsson

On 01/06/16 11:46, Edward Kuns wrote:
> My understanding is that mdraid will kick out a drive with an
> unrecoverable hardware error on a single sector.  (Is this incorrect?)

On Tue, May 31, 2016 at 11:07 PM, Brad Campbell
<lists2009@fnarfbargle.com> wrote:
> Yes, that is incorrect. If your timeouts are configured correctly the drive
> will report an uncorrectable error up the stack. MD will try to get the data
> from elsewhere and it will try to re-write the bad sector with the
> reconstructed data.

Ah ha.  OK.  That makes a lot more sense.  Thanks.  And I *know* I had
bad timeouts on consumer drives before I replaced this drive.

> I do daily short SMART tests, weekly long SMART tests and monthly "check"
> scrubs of the RAID(s). I'm not saying it's best practice, but I've had
> several drives turn up SMART failures and managed to address those before it
> became an issue on any of the arrays.

As far as SMART, I'm running with the default Fedora behavior in that
area.  I don't believe it does any automatic tests, unless the default
logwatch behavior does SMART tests.  I'll add this to my "todo" list.
Thanks.

              Eddie

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: raid 5 crashed
  2016-06-01  5:23                       ` Edward Kuns
@ 2016-06-01  5:28                         ` Brad Campbell
  0 siblings, 0 replies; 27+ messages in thread
From: Brad Campbell @ 2016-06-01  5:28 UTC (permalink / raw)
  To: Edward Kuns
  Cc: Wols Lists, Phil Turmel, bobzer, linux-raid, Mikael Abrahamsson

On 01/06/16 13:23, Edward Kuns wrote:

>> I do daily short SMART tests, weekly long SMART tests and monthly "check"
>> scrubs of the RAID(s). I'm not saying it's best practice, but I've had
>> several drives turn up SMART failures and managed to address those before it
>> became an issue on any of the arrays.
>
> As far as SMART, I'm running with the default Fedora behavior in that
> area.  I don't believe it does any automatic tests, unless the default
> logwatch behavior does SMART tests.  I'll add this to my "todo" list.
> Thanks.

I just add this into smartd.conf on every machine I have (all Debian 
based systems). Make sure it's the first DEVICESCAN line as all lines 
after this are ignored.

Mail to brad, short tests every day at 2am and long tests at 4am on a 
Sunday.

DEVICESCAN -m brad -s (L/../../7/04|S/../.././02) -M exec 
/usr/share/smartmontools/smartd-runner

By default I've never had any system schedule smart tests.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: raid 5 crashed
  2016-06-01  4:07                     ` Brad Campbell
  2016-06-01  5:23                       ` Edward Kuns
@ 2016-06-01 15:36                       ` Wols Lists
  2016-06-01 23:15                         ` Brad Campbell
  1 sibling, 1 reply; 27+ messages in thread
From: Wols Lists @ 2016-06-01 15:36 UTC (permalink / raw)
  To: Brad Campbell, Edward Kuns
  Cc: Phil Turmel, bobzer, linux-raid, Mikael Abrahamsson

On 01/06/16 05:07, Brad Campbell wrote:
> I have however done a *lot* of data recovery on single drives over the
> years and can absolutely vouch that dd will leave you in tears.

Good to know! I've regularly used dd on drives, but not on ones that
were in trouble (maybe once ...)

But now drives are at the point that you cannot guarantee an error-free
scan even on just one pass, I guess I'll have to make sure I use
ddrescue from now on - I usually just read an (old) drive into a file on
a new larger hard disk, loopmount it, and proceed from there ... never
had any trouble so far ...

Cheers,
Wol

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: raid 5 crashed
  2016-06-01  1:48                 ` Brad Campbell
  2016-06-01  3:46                   ` Edward Kuns
@ 2016-06-01 15:42                   ` Wols Lists
  2016-06-01 17:28                     ` Phil Turmel
  1 sibling, 1 reply; 27+ messages in thread
From: Wols Lists @ 2016-06-01 15:42 UTC (permalink / raw)
  To: Brad Campbell, Phil Turmel, bobzer; +Cc: linux-raid, Mikael Abrahamsson

On 01/06/16 02:48, Brad Campbell wrote:
> Now, having said that :
> 
> Much better to try and get the array running in a read-only state with
> all disks in place and clone the data from the array rather than the
> disks after they've been ddrescued. In the case of a running array, a
> read error on one of the array members will see the RAID attempt to get
> the data from elsewhere (a reconstruction), whereas a read from a disc
> cloned with ddrescue will happily just report what was a faulty sector
> as a big pile of zeros, and *poof* your data is gone.
> 
> Set the timeouts appropriately (and conservatively) to give the disks
> time to actually report they can't read the sector. This will allow md
> to try and get it elsewhere rather than kicking the disc out because the
> storage stack timed it out as faulty.

Okay - so would this be better (a lot slower, possibly, but safe ...)

Use dd - so it DOES bomb on error! - and only replace the drive once
you've got a clean read off it. With 2TB drives, that should work so
long as they're not faulty. And if it's - JUST - a timeout issue,
this'll work fine?

Cheers,
Wol

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: raid 5 crashed
  2016-06-01 15:42                   ` Wols Lists
@ 2016-06-01 17:28                     ` Phil Turmel
  0 siblings, 0 replies; 27+ messages in thread
From: Phil Turmel @ 2016-06-01 17:28 UTC (permalink / raw)
  To: Wols Lists, Brad Campbell, bobzer; +Cc: linux-raid, Mikael Abrahamsson

On 06/01/2016 11:42 AM, Wols Lists wrote:

> Okay - so would this be better (a lot slower, possibly, but safe ...)
> 
> Use dd - so it DOES bomb on error! - and only replace the drive once
> you've got a clean read off it. With 2TB drives, that should work so
> long as they're not faulty. And if it's - JUST - a timeout issue,
> this'll work fine?

If there's errors, you'll never get a clean read.  (Short of the moon
and stars aligning for a near-miracle.)  ddrescue and similar replace
those errors with zeros to successfully retrieve less than 100% of your
data.

The whole point of keeping it in the array is to get the correct data
from the array's redundancy wherever the disk has unfixed read errors.
And with correct timeouts, to *FIX* that read error.  Please read *all*
of the links I posted on why and how this is.

Side note:  In these situations, you should *not* use overlays, as that
prevents the *FIX* part from happening.

Temporarily setting the timeouts for non-raid drives is a one-liner:

for x in /sys/block/*/device/timeout ; do echo 180 > $x ; done

Phil


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: raid 5 crashed
  2016-06-01 15:36                       ` Wols Lists
@ 2016-06-01 23:15                         ` Brad Campbell
  2016-06-02  5:52                           ` Mikael Abrahamsson
  2016-06-02 14:01                           ` Wols Lists
  0 siblings, 2 replies; 27+ messages in thread
From: Brad Campbell @ 2016-06-01 23:15 UTC (permalink / raw)
  To: Wols Lists, Edward Kuns
  Cc: Phil Turmel, bobzer, linux-raid, Mikael Abrahamsson



On 01/06/16 23:36, Wols Lists wrote:
> On 01/06/16 05:07, Brad Campbell wrote:
>> I have however done a *lot* of data recovery on single drives over the
>> years and can absolutely vouch that dd will leave you in tears.
> Good to know! I've regularly used dd on drives, but not on ones that
> were in trouble (maybe once ...)
>
> But now drives are at the point that you cannot guarantee an error-free
> scan even on just one pass,
People keep saying that. I've never encountered it. I suspect it's just 
not the problem that the hysterical ranting makes it out to be (either 
that or the pile of cheap and nasty drives I have here are model citizens).
I've *never* seen a read error unless the drive was in trouble, and that 
includes running dd reads in a loop over multiple days continuously.
If it were that bad I'd see drives failing SMART long tests routinely 
also, and that does not happen either.

-- 

Dolphins are so intelligent that within a few weeks they can
train Americans to stand at the edge of the pool and throw them
fish.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: raid 5 crashed
  2016-06-01 23:15                         ` Brad Campbell
@ 2016-06-02  5:52                           ` Mikael Abrahamsson
  2016-06-02 14:01                           ` Wols Lists
  1 sibling, 0 replies; 27+ messages in thread
From: Mikael Abrahamsson @ 2016-06-02  5:52 UTC (permalink / raw)
  To: Brad Campbell; +Cc: linux-raid

On Thu, 2 Jun 2016, Brad Campbell wrote:

> People keep saying that. I've never encountered it. I suspect it's just not

Well, I have had drives that would occasionally throw a read error, but MD 
requires that read error to happen three times before re-writing the 
sector, and that never happened. See earlier discussions I had with Neil 
on the topic. But you're correct, I don't see this on normally functioning 
drives. Swapped out that drive (it didn't have any specific SMART errors 
either) and everything was fine. I don't know what was wrong with it, 
might have been something flying around in there causing spurious 
problems.

> the problem that the hysterical ranting makes it out to be (either that or 
> the pile of cheap and nasty drives I have here are model citizens).
> I've *never* seen a read error unless the drive was in trouble, and that 
> includes running dd reads in a loop over multiple days continuously.
> If it were that bad I'd see drives failing SMART long tests routinely also, 
> and that does not happen either.

I've seen enough read errors that I nowadays only run RAID6, never RAID5. 
I'd also venture to say that considering the amount of people who come on 
the list and who come on the #linux-raid IRC channel with "raid5, 
one-drive-failed, and now I have read error on another drive so my array 
doesn't resync, what should I do?", I'd say this is a real problem. It's 
not however like "if you have a good drive, reading it 5 times will yield 
a read error". The vendor bit error rate specification doesn't work like 
that, so totally agree with you there.

-- 
Mikael Abrahamsson    email: swmike@swm.pp.se

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: raid 5 crashed
  2016-06-01 23:15                         ` Brad Campbell
  2016-06-02  5:52                           ` Mikael Abrahamsson
@ 2016-06-02 14:01                           ` Wols Lists
  2016-06-02 15:27                             ` Andreas Klauer
  2016-06-03  1:05                             ` Brad Campbell
  1 sibling, 2 replies; 27+ messages in thread
From: Wols Lists @ 2016-06-02 14:01 UTC (permalink / raw)
  To: Brad Campbell, Edward Kuns
  Cc: Phil Turmel, bobzer, linux-raid, Mikael Abrahamsson

On 02/06/16 00:15, Brad Campbell wrote:
> People keep saying that. I've never encountered it. I suspect it's just
> not the problem that the hysterical ranting makes it out to be (either
> that or the pile of cheap and nasty drives I have here are model citizens).
> I've *never* seen a read error unless the drive was in trouble, and that
> includes running dd reads in a loop over multiple days continuously.
> If it were that bad I'd see drives failing SMART long tests routinely
> also, and that does not happen either.

Note I didn't say you *will* see an error. BUT. If I recall correctly,
the specs say that one read error per 10TB read is acceptable for a
desktop drive that is designated healthy. In other words, if a 4TB drive
throws an error every third pass, then according to the spec it's a
perfectly healthy drive.

Yes. We know that most drives are far better than spec, and if it
degrades to spec then it's probably heading for failure, but the fact
remains. If you have 3 x 4TB desktop drives in an array, then the spec
says you should expect, and be able to deal with, an error EVERY time
you scan the array. (Yes, I know I would probably be panicking if I got
even one error, too :-)

Cheers,
Wol

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: raid 5 crashed
  2016-06-02 14:01                           ` Wols Lists
@ 2016-06-02 15:27                             ` Andreas Klauer
  2016-06-03  1:05                             ` Brad Campbell
  1 sibling, 0 replies; 27+ messages in thread
From: Andreas Klauer @ 2016-06-02 15:27 UTC (permalink / raw)
  To: Wols Lists
  Cc: Brad Campbell, Edward Kuns, Phil Turmel, bobzer, linux-raid,
	Mikael Abrahamsson

On Thu, Jun 02, 2016 at 03:01:35PM +0100, Wols Lists wrote:
> If you have 3 x 4TB desktop drives in an array, then the spec
> says you should expect, and be able to deal with, an error EVERY time
> you scan the array.

It doesn't happen in practice, though. (Thank god.)

There was a paper about disk failures that said URE was simply not useful. 
(Empirical Measurements of Disk Failure Rates).

There's this ZDnet article that declared RAID5 dead in 2009 but it still 
works fine for me.

I just ignore the URE spec entirely.
(Until someone can prove that it actually matters.)

IMHO the main reason people notice disk failures during rebuilds, is that 
they never ever tested their disks for read errors before. You should do 
so regularly.

A long SMART self-test takes ages on a large disk, on a busy server with 
today's disk sizes it can take days, which is why people avoid running 
them (the other reason is lazyness).

However, SMART also supports selective self-tests; so you can run a 
relatively short test every day and cover the entire disk over time.
You can schedule these partial tests at night when server load is lowest.

I think mdadm can also do a selective RAID check by fiddling with some 
variables in /proc but there is no obvious way of doing so via 
the userspace program.

Regards
Andreas Klauer

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: raid 5 crashed
  2016-06-02 14:01                           ` Wols Lists
  2016-06-02 15:27                             ` Andreas Klauer
@ 2016-06-03  1:05                             ` Brad Campbell
  2016-06-03  7:52                               ` Mikael Abrahamsson
  1 sibling, 1 reply; 27+ messages in thread
From: Brad Campbell @ 2016-06-03  1:05 UTC (permalink / raw)
  To: Wols Lists, Edward Kuns
  Cc: Phil Turmel, bobzer, linux-raid, Mikael Abrahamsson

On 02/06/16 22:01, Wols Lists wrote:
> On 02/06/16 00:15, Brad Campbell wrote:
>> People keep saying that. I've never encountered it. I suspect it's just
>> not the problem that the hysterical ranting makes it out to be (either
>> that or the pile of cheap and nasty drives I have here are model citizens).
>> I've *never* seen a read error unless the drive was in trouble, and that
>> includes running dd reads in a loop over multiple days continuously.
>> If it were that bad I'd see drives failing SMART long tests routinely
>> also, and that does not happen either.
>
> Note I didn't say you *will* see an error. BUT. If I recall correctly,
> the specs say that one read error per 10TB read is acceptable for a
> desktop drive that is designated healthy. In other words, if a 4TB drive
> throws an error every third pass, then according to the spec it's a
> perfectly healthy drive.
>
> Yes. We know that most drives are far better than spec, and if it
> degrades to spec then it's probably heading for failure, but the fact
> remains. If you have 3 x 4TB desktop drives in an array, then the spec
> says you should expect, and be able to deal with, an error EVERY time
> you scan the array.

No, it really doesn't. Those URE figures say <' 1 in' 10^14, not '= 1' 
in 10^14. So that's a statistical worst case rather than a "this is what 
you should expect". In addition, it's not a linear extrapolation, it's a 
probability.

By that logic I should "expect" to roll a 6 at least once every 6 dice 
rolls.

You can't extrapolate statistical figures like that. Just the same as 
you can't calculate drive failures from MTBF figures.

Just perform regular read tests on all drives and periodic array scrubs 
and you'll be much better off.

I've never had a reported URE on any of my arrays with SAS drvies, most 
have reallocated sectors. They perform background reads periodically and 
auto-reallocate anything that is looking dodgy.

SATA drives don't do that, but we can manage that externally with long 
SMART tests and array scrubs to force rewrite/reallocation.

Just don't go trying to extrapolate from manufacturers probability data. 
There are plenty of garbage web pages littered around the net where 
"experts" do that, leading to 'hysterical ranting' about how the world 
is ending and RAID5 is the devil. Sure RAID5 can be an issue when 
dealing with a catastrophic drive failure requiring a rebuild if you 
don't look after your drives, and I use and prefer RAID6 to mitigate 
that, but it's not the end of the world.


Now, on an interesting, related and completely different note. To get 
back to the concept of using dd or dd_rescue, I had a thought last night 
and I've never seen it mentioned anywhere.

When you clone a dud drive using dd_rescue, it creates a bad block log.

The reason we don't like doing this is because when you put the 
replacement drive back into the array, md does not see the errors and 
will happily return zero data when it reads any sector that was bad on 
the old drive.

hdparm has a neat feature called --make-bad-sector. It uses a feature of 
the ATA protocol to write a sector that contains an invalid CRC, so the 
drive returns an error when you try and read it. The sector is restored 
by a normal re-write, so no reallocation or permanent damage takes place.

If we took the bad block list from the dd_rescue, and fed it to hdparm 
to create bad sectors in all those locations on the cloned disk, md 
would get a bad sector on read and attempt a recovery rather than 
returning zero, This would "in theory" cause a re-write of good data 
back to that disk and minimise the chance of data loss.

This might be a useful "last ditch" recovery method to allow you to 
bring up an array with a cloned disk and minimise data loss. On the 
other hand, lets say you are using it to bring up a RAID 5 with 2 failed 
disks. One completely dead and one that you managed to clone most of. 
When you extract the data from the running and degraded array, md will 
pass the read error up the stack when it encounters the bad sectors, 
allowing your copy or rsync session to log which files are affected as 
you backup the remaining contents rather than just return silently 
corrupted files.



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: raid 5 crashed
  2016-06-03  1:05                             ` Brad Campbell
@ 2016-06-03  7:52                               ` Mikael Abrahamsson
  2016-06-03 15:27                                 ` bobzer
  0 siblings, 1 reply; 27+ messages in thread
From: Mikael Abrahamsson @ 2016-06-03  7:52 UTC (permalink / raw)
  To: Brad Campbell; +Cc: linux-raid

On Fri, 3 Jun 2016, Brad Campbell wrote:

> If we took the bad block list from the dd_rescue, and fed it to hdparm 
> to create bad sectors in all those locations on the cloned disk, md 
> would get a bad sector on read and attempt a recovery rather than 
> returning zero, This would "in theory" cause a re-write of good data 
> back to that disk and minimise the chance of data loss.

Why would you use dd_rescue on a drive unless you're doing it because 
you're getting UREs on your remaining component drives on your degraded 
RAID5 so it won't complete resync? That's at least the most common 
use-case for ddrescue I've seen in md-raid scenarios.

> This might be a useful "last ditch" recovery method to allow you to 
> bring up an array with a cloned disk and minimise data loss. On the 
> other hand, lets say you are using it to bring up a RAID 5 with 2 failed 
> disks. One completely dead and one that you managed to clone most of. 
> When you extract the data from the running and degraded array, md will 
> pass the read error up the stack when it encounters the bad sectors, 
> allowing your copy or rsync session to log which files are affected as 
> you backup the remaining contents rather than just return silently 
> corrupted files.

This use case makes a lot more sense to me than the first one. Knowing 
what files are now bad would be very useful.

But wouldn't it be better to put the known errors in the new bad_blocks 
list in md that I believe is a fairly recent feature?

-- 
Mikael Abrahamsson    email: swmike@swm.pp.se

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: raid 5 crashed
  2016-06-03  7:52                               ` Mikael Abrahamsson
@ 2016-06-03 15:27                                 ` bobzer
  2016-06-03 16:31                                   ` Sarah Newman
  0 siblings, 1 reply; 27+ messages in thread
From: bobzer @ 2016-06-03 15:27 UTC (permalink / raw)
  To: Mikael Abrahamsson; +Cc: Brad Campbell, linux-raid

I'm still trying to recover my data, i did a dd_rescue of the
partition only (i regret that, i don't know why i didn't took the
whole disk :-(  )
unluckily me i got problem to be able to see all the disk i need in my
serveur (it's a vm and i miss sata port on the host)
i got a sata controller card which is not a recognize by esxi but
perfectly working with linux so i'm using gparted live
in gparted live the version of mdadm is 3.4 28th january 2016 , on my
serveur is mdadm - v3.3-78-gf43f5b3 - 02nd avril 2014

so i am asking myself is it better for me to try to rebuild my the
array from my original serveur or from gparted ?

i use losetup to make my dd image a dev/loop device and took the risk
of start the raid but it doesn't work:
i did :
mdadm --assemble --force --name=/dev/md0 /dev/sdb1 /dev/sdd1
/dev/loop2 /dev/sdc1
it answer :
mdadm: device /dev/sdb1 exist but is not an md array

except that is not true, sdb1 and sdd1 are the correct drive. sdc1 is
the one saw as a spare

so i don't know what to do :-(

thank you for your advice

PS: sorry for my english it's not my first language

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: raid 5 crashed
  2016-06-03 15:27                                 ` bobzer
@ 2016-06-03 16:31                                   ` Sarah Newman
  2016-06-04  2:56                                     ` bobzer
  0 siblings, 1 reply; 27+ messages in thread
From: Sarah Newman @ 2016-06-03 16:31 UTC (permalink / raw)
  To: bobzer; +Cc: linux-raid

On 06/03/2016 08:27 AM, bobzer wrote:

> i use losetup to make my dd image a dev/loop device and took the risk
> of start the raid but it doesn't work:
> i did :
> mdadm --assemble --force --name=/dev/md0 /dev/sdb1 /dev/sdd1
> /dev/loop2 /dev/sdc1
> it answer :
> mdadm: device /dev/sdb1 exist but is not an md array
> 
> except that is not true, sdb1 and sdd1 are the correct drive. sdc1 is
> the one saw as a spare

The --name is for if you want to call the device something other than just md0. If you use /dev/md0 instead of --name=/dev/md0 it might work.



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: raid 5 crashed
  2016-06-03 16:31                                   ` Sarah Newman
@ 2016-06-04  2:56                                     ` bobzer
  0 siblings, 0 replies; 27+ messages in thread
From: bobzer @ 2016-06-04  2:56 UTC (permalink / raw)
  To: Sarah Newman; +Cc: linux-raid

> The --name is for if you want to call the device something other than just md0. If you use /dev/md0 instead of --name=/dev/md0 it might work.

thanks in fact it was just that ...
so i did :
mdadm --assemble --force /dev/md0 /dev/sdb1 /dev/sdc1 /dev/loop1 /dev/sdd1
and now i'm just hoping that it will work... every finger crossed :-)

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2016-06-04  2:56 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-05-10 21:28 raid 5 crashed bobzer
2016-05-11 12:09 ` Mikael Abrahamsson
2016-05-11 13:15 ` Robin Hill
2016-05-26  3:06   ` bobzer
2016-05-27 19:19     ` bobzer
2016-05-30 15:01       ` bobzer
2016-05-30 19:04         ` Anthonys Lists
2016-05-30 22:00           ` bobzer
2016-05-31 13:45             ` Phil Turmel
2016-05-31 18:49               ` Wols Lists
2016-06-01  1:48                 ` Brad Campbell
2016-06-01  3:46                   ` Edward Kuns
2016-06-01  4:07                     ` Brad Campbell
2016-06-01  5:23                       ` Edward Kuns
2016-06-01  5:28                         ` Brad Campbell
2016-06-01 15:36                       ` Wols Lists
2016-06-01 23:15                         ` Brad Campbell
2016-06-02  5:52                           ` Mikael Abrahamsson
2016-06-02 14:01                           ` Wols Lists
2016-06-02 15:27                             ` Andreas Klauer
2016-06-03  1:05                             ` Brad Campbell
2016-06-03  7:52                               ` Mikael Abrahamsson
2016-06-03 15:27                                 ` bobzer
2016-06-03 16:31                                   ` Sarah Newman
2016-06-04  2:56                                     ` bobzer
2016-06-01 15:42                   ` Wols Lists
2016-06-01 17:28                     ` Phil Turmel

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.