All of lore.kernel.org
 help / color / mirror / Atom feed
* [Recovery] RAID10 hdd failureS help requested
@ 2013-09-24 13:12 Karel Walters
  2013-09-24 14:23 ` Phil Turmel
  0 siblings, 1 reply; 9+ messages in thread
From: Karel Walters @ 2013-09-24 13:12 UTC (permalink / raw)
  To: linux-raid

Hopefully someone can help me with this.
I have a 7 drive raid10 array.
A single drive failed this night and the 7th spare drive was trying to
pickup the failed drive.
During the re-sync a second drive failed and the re-sync stopped.

Now I know I should replace the failed drives but I would like to have
them online one more time for some critical files that were produced
last night.

As it stands I tried:

remove from array and re-add:
This failed with:
mdadm: --re-add for /dev/sdd1 to /dev/md1 is not possible

I tried forced reassemble:
this failed:
mdadm: failed to add /dev/sde1 to /dev/md1: Device or resource busy
mdadm: failed to add /dev/sdj1 to /dev/md1: Device or resource busy
mdadm: failed to RUN_ARRAY /dev/md1: Input/output error

From what I read online I should re-create the array with
assume-clean, but I am quite hesitant to do so since a single type
means the destruction of my raid array.

Could someone please advice?


Added is the output from --examine and --detail

/dev/md1:
        Version : 1.2
  Creation Time : Thu Apr 26 11:33:56 2012
     Raid Level : raid10
  Used Dev Size : -1
   Raid Devices : 6
  Total Devices : 6
    Persistence : Superblock is persistent

    Update Time : Tue Sep 24 13:52:16 2013
          State : active, degraded, Not Started
 Active Devices : 4
Working Devices : 6
 Failed Devices : 0
  Spare Devices : 2

         Layout : far=2
     Chunk Size : 64K

           Name : phobos:0  (local to host phobos)
           UUID : d2d84ef9:7981fe2b:2140e1cb:1ec86f38
         Events : 6298

    Number   Major   Minor   RaidDevice State
       8       8       81        0      active sync   /dev/sdf1
       7       8      129        1      active sync   /dev/sdi1
       2       8      113        2      active sync   /dev/sdh1
       3       0        0        3      removed
       4       0        0        4      removed
       6       8       97        5      active sync   /dev/sdg1

       9       8       33        -      spare   /dev/sdc1
      10       8       49        -      spare   /dev/sdd1


/dev/sdc1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : d2d84ef9:7981fe2b:2140e1cb:1ec86f38
           Name : phobos:0  (local to host phobos)
  Creation Time : Thu Apr 26 11:33:56 2012
     Raid Level : raid10
   Raid Devices : 6

 Avail Dev Size : 5860526080 (2794.52 GiB 3000.59 GB)
     Array Size : 8790788352 (8383.55 GiB 9001.77 GB)
  Used Dev Size : 5860525568 (2794.52 GiB 3000.59 GB)
    Data Offset : 4096 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : 3b46e91f:bd2a3481:f1818a46:19e77284

    Update Time : Tue Sep 24 13:52:16 2013
       Checksum : 2d4b1c52 - correct
         Events : 0

         Layout : far=2
     Chunk Size : 64K

   Device Role : spare
   Array State : AAA..A ('A' == active, '.' == missing)
/dev/sdd1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : d2d84ef9:7981fe2b:2140e1cb:1ec86f38
           Name : phobos:0  (local to host phobos)
  Creation Time : Thu Apr 26 11:33:56 2012
     Raid Level : raid10
   Raid Devices : 6

 Avail Dev Size : 5860526080 (2794.52 GiB 3000.59 GB)
     Array Size : 8790788352 (8383.55 GiB 9001.77 GB)
  Used Dev Size : 5860525568 (2794.52 GiB 3000.59 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : 0be3d9d6:376a68fe:ef7df0e5:9b369dea

    Update Time : Tue Sep 24 13:52:16 2013
       Checksum : 67003c20 - correct
         Events : 0

         Layout : far=2
     Chunk Size : 64K

   Device Role : spare
   Array State : AAA..A ('A' == active, '.' == missing)
/dev/sde1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : d2d84ef9:7981fe2b:2140e1cb:1ec86f38
           Name : phobos:0  (local to host phobos)
  Creation Time : Thu Apr 26 11:33:56 2012
     Raid Level : raid10
   Raid Devices : 6

 Avail Dev Size : 5860526080 (2794.52 GiB 3000.59 GB)
     Array Size : 8790788352 (8383.55 GiB 9001.77 GB)
  Used Dev Size : 5860525568 (2794.52 GiB 3000.59 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : 21d32b5b:4817285e:928eb39d:f9c7090b

    Update Time : Tue Sep 24 13:52:16 2013
       Checksum : 23417b46 - correct
         Events : 0

         Layout : far=2
     Chunk Size : 64K

   Device Role : spare
   Array State : AAA..A ('A' == active, '.' == missing)
/dev/sdf1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : d2d84ef9:7981fe2b:2140e1cb:1ec86f38
           Name : phobos:0  (local to host phobos)
  Creation Time : Thu Apr 26 11:33:56 2012
     Raid Level : raid10
   Raid Devices : 6

 Avail Dev Size : 5860526080 (2794.52 GiB 3000.59 GB)
     Array Size : 8790788352 (8383.55 GiB 9001.77 GB)
  Used Dev Size : 5860525568 (2794.52 GiB 3000.59 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : a300c489:7a42f41a:2cbf23eb:1db87ae8

    Update Time : Tue Sep 24 13:52:16 2013
       Checksum : 39870d51 - correct
         Events : 6298

         Layout : far=2
     Chunk Size : 64K

   Device Role : Active device 0
   Array State : AAA..A ('A' == active, '.' == missing)
/dev/sdg1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : d2d84ef9:7981fe2b:2140e1cb:1ec86f38
           Name : phobos:0  (local to host phobos)
  Creation Time : Thu Apr 26 11:33:56 2012
     Raid Level : raid10
   Raid Devices : 6

 Avail Dev Size : 5860526991 (2794.52 GiB 3000.59 GB)
     Array Size : 8790788352 (8383.55 GiB 9001.77 GB)
  Used Dev Size : 5860525568 (2794.52 GiB 3000.59 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : a9cd4f3b:c562c203:3b2c6d3a:2df429bd

    Update Time : Tue Sep 24 13:52:16 2013
       Checksum : f7d9a74b - correct
         Events : 6298

         Layout : far=2
     Chunk Size : 64K

   Device Role : Active device 5
   Array State : AAA..A ('A' == active, '.' == missing)
/dev/sdh1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : d2d84ef9:7981fe2b:2140e1cb:1ec86f38
           Name : phobos:0  (local to host phobos)
  Creation Time : Thu Apr 26 11:33:56 2012
     Raid Level : raid10
   Raid Devices : 6

 Avail Dev Size : 5860526080 (2794.52 GiB 3000.59 GB)
     Array Size : 8790788352 (8383.55 GiB 9001.77 GB)
  Used Dev Size : 5860525568 (2794.52 GiB 3000.59 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : b2439c7e:6e1048b9:87da7045:3c2d5249

    Update Time : Tue Sep 24 13:52:16 2013
       Checksum : 87d7aec6 - correct
         Events : 6298

         Layout : far=2
     Chunk Size : 64K

   Device Role : Active device 2
   Array State : AAA..A ('A' == active, '.' == missing)
/dev/sdi1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : d2d84ef9:7981fe2b:2140e1cb:1ec86f38
           Name : phobos:0  (local to host phobos)
  Creation Time : Thu Apr 26 11:33:56 2012
     Raid Level : raid10
   Raid Devices : 6

 Avail Dev Size : 5860526080 (2794.52 GiB 3000.59 GB)
     Array Size : 8790788352 (8383.55 GiB 9001.77 GB)
  Used Dev Size : 5860525568 (2794.52 GiB 3000.59 GB)
    Data Offset : 4096 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : cc3485c3:08acfa6b:25c11841:f80e07b7

    Update Time : Tue Sep 24 13:52:16 2013
       Checksum : e8d00bd9 - correct
         Events : 6298

         Layout : far=2
     Chunk Size : 64K

   Device Role : Active device 1
   Array State : AAA..A ('A' == active, '.' == missing)
/dev/sdj1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : d2d84ef9:7981fe2b:2140e1cb:1ec86f38
           Name : phobos:0  (local to host phobos)
  Creation Time : Thu Apr 26 11:33:56 2012
     Raid Level : raid10
   Raid Devices : 6

 Avail Dev Size : 5860526080 (2794.52 GiB 3000.59 GB)
     Array Size : 8790788352 (8383.55 GiB 9001.77 GB)
  Used Dev Size : 5860525568 (2794.52 GiB 3000.59 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : 2e8d438b:16ad5bb7:6e2dafcd:804b37ce

    Update Time : Tue Sep 24 13:52:16 2013
       Checksum : 9fb6061d - correct
         Events : 6298

         Layout : far=2
     Chunk Size : 64K

   Device Role : spare
   Array State : AAA..A ('A' == active, '.' == missing)

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Recovery] RAID10 hdd failureS help requested
  2013-09-24 13:12 [Recovery] RAID10 hdd failureS help requested Karel Walters
@ 2013-09-24 14:23 ` Phil Turmel
       [not found]   ` <CAB4fJqezb0sWcUUgRPd4BXoWr3hNBp725gv8xnMOPmcqU8RiRw@mail.gmail.com>
  0 siblings, 1 reply; 9+ messages in thread
From: Phil Turmel @ 2013-09-24 14:23 UTC (permalink / raw)
  To: Karel Walters; +Cc: linux-raid

Hi Karel,

On 09/24/2013 09:12 AM, Karel Walters wrote:
> Hopefully someone can help me with this.

Likely.

> I have a 7 drive raid10 array.
> A single drive failed this night and the 7th spare drive was trying to
> pickup the failed drive.
> During the re-sync a second drive failed and the re-sync stopped.

Oh, if I had a dollar for every time I write the following:

Your report sounds like the classic timeout mismatch problem when using
non-raid (consumer) drives in a raid array.  You will need to spend some
time reading archived messages on this list to understand the problem.
I recommended searching for various combinations of "scterc" "error
recovery" "timeout mismatch" "ure" and "unrecoverable read error".

> Now I know I should replace the failed drives but I would like to have
> them online one more time for some critical files that were produced
> last night.

If the problem is timeout mismatch, your drives are probably fine.

> As it stands I tried:
> 
> remove from array and re-add:
> This failed with:
> mdadm: --re-add for /dev/sdd1 to /dev/md1 is not possible
> 
> I tried forced reassemble:
> this failed:
> mdadm: failed to add /dev/sde1 to /dev/md1: Device or resource busy
> mdadm: failed to add /dev/sdj1 to /dev/md1: Device or resource busy
> mdadm: failed to RUN_ARRAY /dev/md1: Input/output error
> 
> From what I read online I should re-create the array with
> assume-clean, but I am quite hesitant to do so since a single type
> means the destruction of my raid array.
> 
> Could someone please advice?
> 
> 
> Added is the output from --examine and --detail
> 
> /dev/md1:
>         Version : 1.2
>   Creation Time : Thu Apr 26 11:33:56 2012
>      Raid Level : raid10
>   Used Dev Size : -1
>    Raid Devices : 6
>   Total Devices : 6
>     Persistence : Superblock is persistent
> 
>     Update Time : Tue Sep 24 13:52:16 2013
>           State : active, degraded, Not Started

This suggests you should try "mdadm /dev/md1 --run" before anything
else.  The drives that have dropped out should not have broken the far
mirrors (I think).

If this works, take your backup right away. (But fix the timeouts if
that is part of your problem.)

If that doesn't work, report the following:

dmesg

for x in /sys/block/*/device/timeout ; do echo $x : $(< $x) ; done

for x in /dev/sd[c-i] ; do echo $x ; smartctl -x $x ; done

HTH,

Phil

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Recovery] RAID10 hdd failureS help requested
       [not found]   ` <CAB4fJqezb0sWcUUgRPd4BXoWr3hNBp725gv8xnMOPmcqU8RiRw@mail.gmail.com>
@ 2013-09-24 15:50     ` Phil Turmel
       [not found]       ` <CAB4fJqerQy7PJzK4+WSNAh7YCcHmwoAqB5vMrXeSYqzWawAS+A@mail.gmail.com>
  0 siblings, 1 reply; 9+ messages in thread
From: Phil Turmel @ 2013-09-24 15:50 UTC (permalink / raw)
  To: Karel Walters; +Cc: linux-raid

Hi Karel,

Please use reply-to-all on kernel.org lists, trim replies, and avoid
top-posting.

On 09/24/2013 11:07 AM, Karel Walters wrote:
> Dear Phil,
> 
> Thank you for the quick response!
> Unfortunately that does not work.
> The drives did fail their SMART test, one short and one long.
> That is how I judged they are indeed broken.
> 
> Thanks already!
> 
> Indeed these are consumer Seagate 7200RPM drives.
> 
> /sys/block/sda/device/timeout : 30
> /sys/block/sdb/device/timeout : 30
> /sys/block/sdc/device/timeout : 30
> /sys/block/sdd/device/timeout : 30
> /sys/block/sde/device/timeout : 30
> /sys/block/sdf/device/timeout : 30
> /sys/block/sdg/device/timeout : 30
> /sys/block/sdh/device/timeout : 30
> /sys/block/sdi/device/timeout : 30
> /sys/block/sdj/device/timeout : 30
> /sys/block/sdk/device/timeout : 30
> /sys/block/sdl/device/timeout : 30
> /sys/block/sdm/device/timeout : 30
> /sys/block/sdn/device/timeout : 30

Allow me to select critical info from these smartctl reports:

> /dev/sdc
> Device Model:     WDC WD30EFRX-68AX9N0
> Serial Number:    WD-WCC1T1255024
> SMART overall-health self-assessment test result: PASSED
> ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
>   5 Reallocated_Sector_Ct   PO--CK   200   200   140    -    0
> 197 Current_Pending_Sector  -O--CK   200   200   000    -    0
> SCT Error Recovery Control:
>            Read:     70 (7.0 seconds)
>           Write:     70 (7.0 seconds)

/dev/sdc is healthy and has appropriate timeouts.

> /dev/sdd
> Device Model:     ST3000DM001-9YN166
> Serial Number:    W1F09XLV
> SMART overall-health self-assessment test result: PASSED
> ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
>   5 Reallocated_Sector_Ct   PO--CK   100   100   036    -    8
> 197 Current_Pending_Sector  -O--C-   096   096   000    -    656
> 198 Offline_Uncorrectable   ----C-   096   096   000    -    656
> Warning: device does not support SCT Data Table command
> Warning: device does not support SCT Error Recovery Control command

/dev/sdd is technically healthy, but approaching failure, and has been
neglected.  It has many pending sectors.  You clearly have not been
scrubbing your array, and if you had, it would have been bumped out of
your array long ago for timeout mismatch.

> /dev/sde
> Device Model:     ST3000DM001-9YN166
> Serial Number:    W1F0AXTQ
> SMART overall-health self-assessment test result: PASSED
> ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
>   5 Reallocated_Sector_Ct   PO--CK   100   100   036    -    0
> 197 Current_Pending_Sector  -O--C-   100   100   000    -    144
> 198 Offline_Uncorrectable   ----C-   100   100   000    -    144
> Warning: device does not support SCT Data Table command
> Warning: device does not support SCT Error Recovery Control command

/dev/sde is technically healthy, and probably healthy in fact.  But like
/dev/sdd, it has many pending sectors due to lack of scrubbing.  And if
you had been scrubbing, the timeout mismatch would have kicked it out
anyways.

> /dev/sdf
> Device Model:     ST3000DM001-9YN166
> Serial Number:    W1F0B6X6
> SMART overall-health self-assessment test result: PASSED
> ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
>   5 Reallocated_Sector_Ct   PO--CK   100   100   036    -    0
> 197 Current_Pending_Sector  -O--C-   100   100   000    -    0
> Warning: device does not support SCT Data Table command
> Warning: device does not support SCT Error Recovery Control command

/dev/sdf is healthy.  But it has the timeout mismatch problem.

> /dev/sdg
> Device Model:     ST3000DM001-9YN166
> Serial Number:    S1F04BZT
> SMART overall-health self-assessment test result: PASSED
> ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
>   5 Reallocated_Sector_Ct   PO--CK   100   100   036    -    0
> 197 Current_Pending_Sector  -O--C-   100   100   000    -    0
> Warning: device does not support SCT Data Table command
> Warning: device does not support SCT Error Recovery Control command

/dev/sdg is healthy.  But it has the timeout mismatch problem.

> /dev/sdh
> Device Model:     ST3000DM001-9YN166
> Serial Number:    W1F0B9ER
> SMART overall-health self-assessment test result: PASSED
> ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
>   5 Reallocated_Sector_Ct   PO--CK   100   100   036    -    0
> 197 Current_Pending_Sector  -O--C-   100   100   000    -    0
> Warning: device does not support SCT Data Table command
> Warning: device does not support SCT Error Recovery Control command

/dev/sdh is healthy.  But it has the timeout mismatch problem.

> /dev/sdi
> Device Model:     WDC WD30EFRX-68AX9N0
> Serial Number:    WD-WMC1T2341606
> SMART overall-health self-assessment test result: PASSED
> ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
>   5 Reallocated_Sector_Ct   PO--CK   200   200   140    -    0
> 197 Current_Pending_Sector  -O--CK   200   200   000    -    0
> SCT Error Recovery Control:
>            Read:     70 (7.0 seconds)
>           Write:     70 (7.0 seconds)

/dev/sdi is healthy and has appropriate timeouts.


Before you do anything else, you have to compensate for the drives that
don't support error recovery control:

for x in /sys/block/sd[d-h]/device/timeout ; do echo 180 >$x ; done

You must do this for all of your Seagate drives on every powerup or your
arrays will always kick drives out instead of fixing the accumulating
pending errors.  (Pending errors are repaired or relocated by writing to
them.  MD will do this automatically on read errors, but cannot do so if
the drive won't respond in 30 seconds.)

{ In the future, buy drives that wake up with ERC enabled (like your WD
Reds), or at least capable of enabling ERC (at every powerup). }

Next, you will have to figure out which of the bumped drives belongs in
which slot in the array.  An old dmesg (from before the failures) or an
archived "mdadm --detail" would tell us that.  This is important,
because you *will* need to use --create --assume-clean as the drives are
now marked as spare--the info needed for forced assembly is gone.

You will also need to make sure that the create operation results in the
correct data offset on each device before accessing the array.

Phil

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Recovery] RAID10 hdd failureS help requested
       [not found]       ` <CAB4fJqerQy7PJzK4+WSNAh7YCcHmwoAqB5vMrXeSYqzWawAS+A@mail.gmail.com>
@ 2013-09-24 17:09         ` Phil Turmel
  2013-09-24 18:18           ` Karel Walters
  0 siblings, 1 reply; 9+ messages in thread
From: Phil Turmel @ 2013-09-24 17:09 UTC (permalink / raw)
  To: Karel Walters; +Cc: linux-raid

Hi Karel,

On 09/24/2013 12:28 PM, Karel Walters wrote:
> Will find a way to do proper scrubbing and alter the timeouts on startup.
>> for x in /sys/block/sd[d-h]/device/timeout ; do echo 180 >$x ; done
> done!

Good.

>> { In the future, buy drives that wake up with ERC enabled (like your WD
>> Reds), or at least capable of enabling ERC (at every powerup). }
> Reds are on the desk next to me and will replace the raid array.

Very Good.  Mind you, the Seagates are good enough drives, they just
aren't suited to raid arrays.  Changing the driver timeouts will get you
by, but when you do encounter an error, the three minute pause will kick
many applications in the teeth.  I have a few Seagates like this kicking
around that I use for offsite backups.

>> Next, you will have to figure out which of the bumped drives belongs in
>> which slot in the array.  An old dmesg (from before the failures) or an
>> archived "mdadm --detail" would tell us that.  This is important,
>> because you *will* need to use --create --assume-clean as the drives are
>> now marked as spare--the info needed for forced assembly is gone.
> 
> This is a problem for me and maybe a harsh lesson, I added an old
> dmesg output at the end but I' m not to sure about it.

Yes, that dmesg did the trick.  The drive that failed first was #3, and
the drive the failed second was #4.  You should create a list of which
drive serial number corresponds to which raid device role, with a third
column showing the current device name.

Then we can construct an "mdadm --create --assume-clean" command that
generates the correct order.  And I would leave the partially synced
spare out entirely.

Then, to deal with the large number of pending events, you'll need to do
a "check" scrub with a very low speed limit.  To keep you from exceeding
the 10/hour read error limit in the MD kernel driver.

{ Or you can scrub at full speed until it kicks drives out, then force
assemble and restart the scrub.  Many times over in your case. }

Phil

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Recovery] RAID10 hdd failureS help requested
  2013-09-24 17:09         ` Phil Turmel
@ 2013-09-24 18:18           ` Karel Walters
  2013-09-24 19:05             ` Phil Turmel
  0 siblings, 1 reply; 9+ messages in thread
From: Karel Walters @ 2013-09-24 18:18 UTC (permalink / raw)
  To: Phil Turmel; +Cc: linux-raid

 Hi Phil,

Thank you for all the great help so far!

> Yes, that dmesg did the trick.  The drive that failed first was #3, and
> the drive the failed second was #4.  You should create a list of which
> drive serial number corresponds to which raid device role, with a third
> column showing the current device name.

Serial no:              old name  ID     current name
WD-WCC1T1255024                     /dev/sdc1  new drive
W1F09XLV            /dev/sdb1 [3]    /dev/sdd1  failed drive 1
W1F0AXTQ           /dev/sdc1 [4]    /dev/sde1  failed drive 2
W1F0B6X6            /dev/sdd1 [0]    /dev/sdf1
S1F04BZT             /dev/sde1 [5]    /dev/sdg1
W1F0B9ER           /dev/sdf1 [2]    /dev/sdh1
WD-WMC1T2341606    /dev/sdg1 [1]    /dev/sdi1
S1F04CWH           /dev/sdh1 [6]    /dev/sdj1  (partially rebuild spare)

> Then, to deal with the large number of pending events, you'll need to do
> a "check" scrub with a very low speed limit.  To keep you from exceeding
> the 10/hour read error limit in the MD kernel driver.

echo 1000 > /proc/sys/dev/raid/speed_limit_min
echo 10000 > /proc/sys/dev/raid/speed_limit_max
echo check > /sys/block/md0/md/sync_action

Would this be ok in such a case?

Karel

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Recovery] RAID10 hdd failureS help requested
  2013-09-24 18:18           ` Karel Walters
@ 2013-09-24 19:05             ` Phil Turmel
  2013-09-24 19:14               ` Karel Walters
  0 siblings, 1 reply; 9+ messages in thread
From: Phil Turmel @ 2013-09-24 19:05 UTC (permalink / raw)
  To: Karel Walters; +Cc: linux-raid

On 09/24/2013 02:18 PM, Karel Walters wrote:
>  Hi Phil,
> 
> Thank you for all the great help so far!
> 
>> Yes, that dmesg did the trick.  The drive that failed first was #3, and
>> the drive the failed second was #4.  You should create a list of which
>> drive serial number corresponds to which raid device role, with a third
>> column showing the current device name.
> 
> Serial no:              old name  ID     current name
> WD-WCC1T1255024                     /dev/sdc1  new drive
> W1F09XLV            /dev/sdb1 [3]    /dev/sdd1  failed drive 1
> W1F0AXTQ           /dev/sdc1 [4]    /dev/sde1  failed drive 2
> W1F0B6X6            /dev/sdd1 [0]    /dev/sdf1
> S1F04BZT             /dev/sde1 [5]    /dev/sdg1
> W1F0B9ER           /dev/sdf1 [2]    /dev/sdh1
> WD-WMC1T2341606    /dev/sdg1 [1]    /dev/sdi1
> S1F04CWH           /dev/sdh1 [6]    /dev/sdj1  (partially rebuild spare)

Ok, so your create operation will be:

mdadm --create /dev/md1 --level=10 -n 6 --layout=f2 --chunk=64
--data-offset=variable /dev/sdd1:2048 /dev/sdg1:4096 /dev/sdf1:2048
/dev/sdb1:2048 /dev/sdc1:2048 /dev/sde1:2048

I'm actually guessing that /dev/sdb1 and /dev/sdc1 need offset 2048 like
the original devices, not the 4096 of a device added later (newer
mdadm).  With the mixed offsets, you need mdadm version 3.3.

Use "fsck -n" to verify the array before mounting anything, just in case
one or both of those drives really does need :4096.

>> Then, to deal with the large number of pending events, you'll need to do
>> a "check" scrub with a very low speed limit.  To keep you from exceeding
>> the 10/hour read error limit in the MD kernel driver.
> 
> echo 1000 > /proc/sys/dev/raid/speed_limit_min
> echo 10000 > /proc/sys/dev/raid/speed_limit_max
> echo check > /sys/block/md0/md/sync_action
> 
> Would this be ok in such a case?

Looks ok.  You may want to experiment in progress.

Phil

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Recovery] RAID10 hdd failureS help requested
  2013-09-24 19:05             ` Phil Turmel
@ 2013-09-24 19:14               ` Karel Walters
  2013-09-24 21:19                 ` Phil Turmel
  0 siblings, 1 reply; 9+ messages in thread
From: Karel Walters @ 2013-09-24 19:14 UTC (permalink / raw)
  To: Phil Turmel; +Cc: linux-raid

Hi Phil,
> Ok, so your create operation will be:
>
> mdadm --create /dev/md1 --level=10 -n 6 --layout=f2 --chunk=64
> --data-offset=variable /dev/sdd1:2048 /dev/sdg1:4096 /dev/sdf1:2048
> /dev/sdb1:2048 /dev/sdc1:2048 /dev/sde1:2048

No --assume-clean?

> I'm actually guessing that /dev/sdb1 and /dev/sdc1 need offset 2048 like
> the original devices, not the 4096 of a device added later (newer
> mdadm).  With the mixed offsets, you need mdadm version 3.3.

I created and used it this with mdadm 3.2.5, do I really need to get 3.3?

> Use "fsck -n" to verify the array before mounting anything, just in case
> one or both of those drives really does need :4096.

Great!

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Recovery] RAID10 hdd failureS help requested
  2013-09-24 19:14               ` Karel Walters
@ 2013-09-24 21:19                 ` Phil Turmel
  2013-09-25 12:55                   ` Karel Walters
  0 siblings, 1 reply; 9+ messages in thread
From: Phil Turmel @ 2013-09-24 21:19 UTC (permalink / raw)
  To: Karel Walters; +Cc: linux-raid

On 09/24/2013 03:14 PM, Karel Walters wrote:
> Hi Phil,
>> Ok, so your create operation will be:
>>
>> mdadm --create /dev/md1 --level=10 -n 6 --layout=f2 --chunk=64
>> --data-offset=variable /dev/sdd1:2048 /dev/sdg1:4096 /dev/sdf1:2048
>> /dev/sdb1:2048 /dev/sdc1:2048 /dev/sde1:2048
> 
> No --assume-clean?

Whoops!  Yes, add the --assume-clean.

>> I'm actually guessing that /dev/sdb1 and /dev/sdc1 need offset 2048 like
>> the original devices, not the 4096 of a device added later (newer
>> mdadm).  With the mixed offsets, you need mdadm version 3.3.
> 
> I created and used it this with mdadm 3.2.5, do I really need to get 3.3?

Yes, 3.2.5 doesn't have the variable offset syntax.

Phil

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Recovery] RAID10 hdd failureS help requested
  2013-09-24 21:19                 ` Phil Turmel
@ 2013-09-25 12:55                   ` Karel Walters
  0 siblings, 0 replies; 9+ messages in thread
From: Karel Walters @ 2013-09-25 12:55 UTC (permalink / raw)
  To: Phil Turmel; +Cc: linux-raid

> Whoops!  Yes, add the --assume-clean.
>
Well it didn' t work and I couldn' t wait any longer to bring back a
working raid array.
Thanks for all the help!
I learned a lot and may this be an extra lesson to keep the original
creation files of the raid array :)

Karel

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2013-09-25 12:55 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-09-24 13:12 [Recovery] RAID10 hdd failureS help requested Karel Walters
2013-09-24 14:23 ` Phil Turmel
     [not found]   ` <CAB4fJqezb0sWcUUgRPd4BXoWr3hNBp725gv8xnMOPmcqU8RiRw@mail.gmail.com>
2013-09-24 15:50     ` Phil Turmel
     [not found]       ` <CAB4fJqerQy7PJzK4+WSNAh7YCcHmwoAqB5vMrXeSYqzWawAS+A@mail.gmail.com>
2013-09-24 17:09         ` Phil Turmel
2013-09-24 18:18           ` Karel Walters
2013-09-24 19:05             ` Phil Turmel
2013-09-24 19:14               ` Karel Walters
2013-09-24 21:19                 ` Phil Turmel
2013-09-25 12:55                   ` Karel Walters

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.