All of lore.kernel.org
 help / color / mirror / Atom feed
* RAID 6 recovery (it's not looking good)
@ 2008-12-16 15:17 Iain Rauch
  2008-12-16 19:18 ` Bernd Schubert
  0 siblings, 1 reply; 8+ messages in thread
From: Iain Rauch @ 2008-12-16 15:17 UTC (permalink / raw)
  To: linux-raid

Hi,

Here's the situation:

24 disk array.
Disk fails - usage continues for a while.
Power cut - array in unknown state at the time, but I expect it was just
running degraded.
Restart and assemble the array.
*story continues further down.

1 disk is way out of sync and 1 disk doesn't work. 2 disks are marked spare
- the faulty one and another. I think I need to set the status of one of the
'spare' disks to clean and then assemble the array. I can then rebuild the
disk that is way out of sync, and when I have a replacement, rebuild the
failed disk. Is this possible? It doesn't matter if I can assemble the array
to get some of the data even if most of it is corrupt. At the very least I'd
like to be able to see the names of the files I had on it.

Hope someone can help, but I'm not holding my breath.

Iain


root@skinner:/home/iain# mdadm -D /dev/md0
/dev/md0:
        Version : 00.90.03
  Creation Time : Thu May 31 17:24:32 2007
     Raid Level : raid6
     Array Size : 10744267776 (10246.53 GiB 11002.13 GB)
  Used Dev Size : 488375808 (465.75 GiB 500.10 GB)
   Raid Devices : 24
  Total Devices : 22
Preferred Minor : 0
    Persistence : Superblock is persistent

    Update Time : Sat Oct 18 20:52:53 2008
          State : clean, degraded
 Active Devices : 22
Working Devices : 22
 Failed Devices : 0
  Spare Devices : 0

     Chunk Size : 128K

           UUID : 2aa31867:40c370a7:61c202b9:07c4b1c4
         Events : 0.1273570

    Number   Major   Minor   RaidDevice State
       0       8       49        0      active sync   /dev/sdd1
       1       8       65        1      active sync   /dev/sde1
       2       8       81        2      active sync   /dev/sdf1
       3       8       33        3      active sync   /dev/sdc1
       4       8      225        4      active sync   /dev/sdo1
       5       8      241        5      active sync   /dev/sdp1
       6       8      193        6      active sync   /dev/sdm1
       7       8      209        7      active sync   /dev/sdn1
       8      65       17        8      active sync   /dev/sdr1
       9      65       81        9      active sync   /dev/sdv1
      10       0        0       10      removed
      11      65      113       11      active sync   /dev/sdx1
      12       0        0       12      removed
      13      65       49       13      active sync   /dev/sdt1
      14      65        1       14      active sync   /dev/sdq1
      15      65       65       15      active sync   /dev/sdu1
      16       8      129       16      active sync   /dev/sdi1
      17       8      161       17      active sync   /dev/sdk1
      18       8      145       18      active sync   /dev/sdj1
      19       8      177       19      active sync   /dev/sdl1
      20       8        1       20      active sync   /dev/sda1
      21       8       97       21      active sync   /dev/sdg1
      22       8       17       22      active sync   /dev/sdb1
      23       8      113       23      active sync   /dev/sdh1

Not so long after:

root@skinner:/home/iain# mdadm -D /dev/md0
/dev/md0:
        Version : 00.90.03
  Creation Time : Thu May 31 17:24:32 2007
     Raid Level : raid6
     Array Size : 10744267776 (10246.53 GiB 11002.13 GB)
  Used Dev Size : 488375808 (465.75 GiB 500.10 GB)
   Raid Devices : 24
  Total Devices : 22
Preferred Minor : 0
    Persistence : Superblock is persistent

    Update Time : Tue Dec 16 12:56:38 2008
          State : clean, degraded
 Active Devices : 21
Working Devices : 21
 Failed Devices : 1
  Spare Devices : 0

     Chunk Size : 128K

           UUID : 2aa31867:40c370a7:61c202b9:07c4b1c4
         Events : 0.1273576

    Number   Major   Minor   RaidDevice State
       0       8       49        0      active sync   /dev/sdd1
       1       8       65        1      active sync   /dev/sde1
       2       8       81        2      active sync   /dev/sdf1
       3       8       33        3      active sync   /dev/sdc1
       4       8      225        4      active sync   /dev/sdo1
       5       8      241        5      active sync   /dev/sdp1
       6       8      193        6      active sync   /dev/sdm1
       7       8      209        7      active sync   /dev/sdn1
       8      65       17        8      active sync   /dev/sdr1
       9      65       81        9      active sync   /dev/sdv1
      10       0        0       10      removed
      11      65      113       11      active sync   /dev/sdx1
      12       0        0       12      removed
      13      65       49       13      active sync   /dev/sdt1
      14      65        1       14      active sync   /dev/sdq1
      15       0        0       15      removed
      16       8      129       16      active sync   /dev/sdi1
      17       8      161       17      active sync   /dev/sdk1
      18       8      145       18      active sync   /dev/sdj1
      19       8      177       19      active sync   /dev/sdl1
      20       8        1       20      active sync   /dev/sda1
      21       8       97       21      active sync   /dev/sdg1
      22       8       17       22      active sync   /dev/sdb1
      23       8      113       23      active sync   /dev/sdh1

      24      65       65        -      faulty spare   /dev/sdu1

root@skinner:/home/iain# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md0 : active raid6 sdd1[0] sdh1[23] sdb1[22] sdg1[21] sda1[20] sdl1[19]
sdj1[18] sdk1[17] sdi1[16] sdu1[24](F) sdq1[14] sdt1[13] sdx1[11] sdv1[9]
sdr1[8] sdn1[7] sdm1[6] sdp1[5] sdo1[4] sdc1[3] sdf1[2] sde1[1]
      10744267776 blocks level 6, 128k chunk, algorithm 2 [24/21]
[UUUUUUUUUU_U_UU_UUUUUUUU]

root@skinner:/home/iain# mdadm -E /dev/sd[a-x]1 | grep Events
         Events : 0.1273578
         Events : 0.1273578
         Events : 0.1273578
         Events : 0.1273578
         Events : 0.1273578
         Events : 0.1273578
         Events : 0.1273578
         Events : 0.1273578
         Events : 0.1273578
         Events : 0.1273578
         Events : 0.1273578
         Events : 0.1273578
         Events : 0.1273578
         Events : 0.1273578
         Events : 0.1273578
         Events : 0.1273578
         Events : 0.1273578
         Events : 0.1273578
         Events : 0.1271254
         Events : 0.1273578
         Events : 0.1273572
         Events : 0.1273578
         Events : 0.1273570
         Events : 0.1273578

So from this I think sds1 was the one that first failed. Looks like it but
letters are now reallocated (since reboot).

root@skinner:/home/iain# mdadm -E /dev/sds1
/dev/sds1:
          Magic : a92b4efc
        Version : 00.90.00
           UUID : 2aa31867:40c370a7:61c202b9:07c4b1c4
  Creation Time : Thu May 31 17:24:32 2007
     Raid Level : raid6
  Used Dev Size : 488375808 (465.75 GiB 500.10 GB)
     Array Size : 10744267776 (10246.53 GiB 11002.13 GB)
   Raid Devices : 24
  Total Devices : 24
Preferred Minor : 0

    Update Time : Sat Oct 18 12:56:28 2008
          State : clean
 Active Devices : 24
Working Devices : 24
 Failed Devices : 0
  Spare Devices : 0
       Checksum : 2ab633fc - correct
         Events : 0.1271254

     Chunk Size : 128K

      Number   Major   Minor   RaidDevice State
this    12      65       97       12      active sync   /dev/sdw1

   0     0       8      145        0      active sync   /dev/sdj1
   1     1       8      177        1      active sync   /dev/sdl1
   2     2       8      129        2      active sync   /dev/sdi1
   3     3       8      161        3      active sync   /dev/sdk1
   4     4       8      225        4      active sync   /dev/sdo1
   5     5       8      241        5      active sync   /dev/sdp1
   6     6       8      193        6      active sync   /dev/sdm1
   7     7       8      209        7      active sync   /dev/sdn1
   8     8      65       65        8      active sync   /dev/sdu1
   9     9      65       49        9      active sync   /dev/sdt1
  10    10      65       33       10      active sync   /dev/sds1
  11    11      65        1       11      active sync   /dev/sdq1
  12    12      65       97       12      active sync   /dev/sdw1
  13    13      65      113       13      active sync   /dev/sdx1
  14    14      65       81       14      active sync   /dev/sdv1
  15    15      65       17       15      active sync   /dev/sdr1
  16    16       8       97       16      active sync   /dev/sdg1
  17    17       8      113       17      active sync   /dev/sdh1
  18    18       8       65       18      active sync   /dev/sde1
  19    19       8       81       19      active sync   /dev/sdf1
  20    20       8       49       20      active sync   /dev/sdd1
  21    21       8       33       21      active sync   /dev/sdc1
  22    22       8       17       22      active sync   /dev/sdb1
  23    23       8        1       23      active sync   /dev/sda1

Next:
Start without sds1 as that was the one that was left behind when there was
still activity going on.

root@skinner:/home/iain# mdadm -v -S /dev/md0
mdadm: stopped /dev/md0

root@skinner:/home/iain# mdadm -A /dev/md0 /dev/sd[abcdefghijklmnopqrtuvwx]1
mdadm: /dev/md0 assembled from 21 drives - not enough to start the array.

root@skinner:/home/iain# mdadm -A -f -v /dev/md0
/dev/sd[abcdefghijklmnopqrtuvwx]1
mdadm: looking for devices for /dev/md0
mdadm: /dev/sda1 is identified as a member of /dev/md0, slot 20.
mdadm: /dev/sdb1 is identified as a member of /dev/md0, slot 22.
mdadm: /dev/sdc1 is identified as a member of /dev/md0, slot 3.
mdadm: /dev/sdd1 is identified as a member of /dev/md0, slot 0.
mdadm: /dev/sde1 is identified as a member of /dev/md0, slot 1.
mdadm: /dev/sdf1 is identified as a member of /dev/md0, slot 2.
mdadm: /dev/sdg1 is identified as a member of /dev/md0, slot 21.
mdadm: /dev/sdh1 is identified as a member of /dev/md0, slot 23.
mdadm: /dev/sdi1 is identified as a member of /dev/md0, slot 16.
mdadm: /dev/sdj1 is identified as a member of /dev/md0, slot 18.
mdadm: /dev/sdk1 is identified as a member of /dev/md0, slot 17.
mdadm: /dev/sdl1 is identified as a member of /dev/md0, slot 19.
mdadm: /dev/sdm1 is identified as a member of /dev/md0, slot 6.
mdadm: /dev/sdn1 is identified as a member of /dev/md0, slot 7.
mdadm: /dev/sdo1 is identified as a member of /dev/md0, slot 4.
mdadm: /dev/sdp1 is identified as a member of /dev/md0, slot 5.
mdadm: /dev/sdq1 is identified as a member of /dev/md0, slot 14.
mdadm: /dev/sdr1 is identified as a member of /dev/md0, slot 8.
mdadm: /dev/sdt1 is identified as a member of /dev/md0, slot 13.
mdadm: /dev/sdu1 is identified as a member of /dev/md0, slot 15.
mdadm: /dev/sdv1 is identified as a member of /dev/md0, slot 9.
mdadm: /dev/sdw1 is identified as a member of /dev/md0, slot 10.
mdadm: /dev/sdx1 is identified as a member of /dev/md0, slot 11.
mdadm: forcing event count in /dev/sdu1(15) from 1273572 upto 1273578
mdadm: clearing FAULTY flag for device 19 in /dev/md0 for /dev/sdu1
mdadm: added /dev/sde1 to /dev/md0 as 1
mdadm: added /dev/sdf1 to /dev/md0 as 2
mdadm: added /dev/sdc1 to /dev/md0 as 3
mdadm: added /dev/sdo1 to /dev/md0 as 4
mdadm: added /dev/sdp1 to /dev/md0 as 5
mdadm: added /dev/sdm1 to /dev/md0 as 6
mdadm: added /dev/sdn1 to /dev/md0 as 7
mdadm: added /dev/sdr1 to /dev/md0 as 8
mdadm: added /dev/sdv1 to /dev/md0 as 9
mdadm: added /dev/sdw1 to /dev/md0 as 10
mdadm: added /dev/sdx1 to /dev/md0 as 11
mdadm: no uptodate device for slot 12 of /dev/md0
mdadm: added /dev/sdt1 to /dev/md0 as 13
mdadm: added /dev/sdq1 to /dev/md0 as 14
mdadm: added /dev/sdu1 to /dev/md0 as 15
mdadm: added /dev/sdi1 to /dev/md0 as 16
mdadm: added /dev/sdk1 to /dev/md0 as 17
mdadm: added /dev/sdj1 to /dev/md0 as 18
mdadm: added /dev/sdl1 to /dev/md0 as 19
mdadm: added /dev/sda1 to /dev/md0 as 20
mdadm: added /dev/sdg1 to /dev/md0 as 21
mdadm: added /dev/sdb1 to /dev/md0 as 22
mdadm: added /dev/sdh1 to /dev/md0 as 23
mdadm: added /dev/sdd1 to /dev/md0 as 0
mdadm: /dev/md0 has been started with 22 drives (out of 24).


root@skinner:/home/iain# mdadm --add /dev/md0 /dev/sd[ws]1
mdadm: re-added /dev/sds1
mdadm: re-added /dev/sdw1
root@skinner:/home/iain# mdadm -D /dev/md0
/dev/md0:
        Version : 00.90.03
  Creation Time : Thu May 31 17:24:32 2007
     Raid Level : raid6
     Array Size : 10744267776 (10246.53 GiB 11002.13 GB)
  Used Dev Size : 488375808 (465.75 GiB 500.10 GB)
   Raid Devices : 24
  Total Devices : 24
Preferred Minor : 0
    Persistence : Superblock is persistent

    Update Time : Tue Dec 16 13:10:29 2008
          State : clean, degraded, recovering
 Active Devices : 22
Working Devices : 24
 Failed Devices : 0
  Spare Devices : 2

     Chunk Size : 128K

 Rebuild Status : 0% complete

           UUID : 2aa31867:40c370a7:61c202b9:07c4b1c4
         Events : 0.1273586

    Number   Major   Minor   RaidDevice State
       0       8       49        0      active sync   /dev/sdd1
       1       8       65        1      active sync   /dev/sde1
       2       8       81        2      active sync   /dev/sdf1
       3       8       33        3      active sync   /dev/sdc1
       4       8      225        4      active sync   /dev/sdo1
       5       8      241        5      active sync   /dev/sdp1
       6       8      193        6      active sync   /dev/sdm1
       7       8      209        7      active sync   /dev/sdn1
       8      65       17        8      active sync   /dev/sdr1
       9      65       81        9      active sync   /dev/sdv1
      25      65       33       10      spare rebuilding   /dev/sds1
      11      65      113       11      active sync   /dev/sdx1
      12       0        0       12      removed
      13      65       49       13      active sync   /dev/sdt1
      14      65        1       14      active sync   /dev/sdq1
      15      65       65       15      active sync   /dev/sdu1
      16       8      129       16      active sync   /dev/sdi1
      17       8      161       17      active sync   /dev/sdk1
      18       8      145       18      active sync   /dev/sdj1
      19       8      177       19      active sync   /dev/sdl1
      20       8        1       20      active sync   /dev/sda1
      21       8       97       21      active sync   /dev/sdg1
      22       8       17       22      active sync   /dev/sdb1
      23       8      113       23      active sync   /dev/sdh1

      24      65       97        -      spare   /dev/sdw1

root@skinner:/home/iain# mount -a
mount: /dev/md0: can't read superblock

root@skinner:/mnt/md0raid# xfs_check /dev/md0
ERROR: The filesystem has valuable metadata changes in a log which needs to
be replayed.  Mount the filesystem to replay the log, and unmount it before
re-running xfs_check.  If you are unable to mount the filesystem, then use
the xfs_repair -L option to destroy the log and attempt a repair.
Note that destroying the log may cause corruption -- please attempt a mount
of the filesystem before doing this.

root@skinner:/mnt/md0raid# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md0 : active raid6 sdw1[24](S) sds1[25](S) sdd1[0] sdh1[23] sdb1[22]
sdg1[21] sda1[20] sdl1[19] sdj1[18] sdk1[17] sdi1[16] sdu1[26](F) sdq1[14]
sdt1[13] sdx1[11] sdv1[9] sdr1[8] sdn1[7] sdm1[6] sdp1[5] sdo1[4] sdc1[3]
sdf1[2] sde1[1]
      10744267776 blocks level 6, 128k chunk, algorithm 2 [24/21]
[UUUUUUUUUU_U_UU_UUUUUUUU]
      
root@skinner:/mnt/md0raid# mdadm -D /dev/md0
/dev/md0:
        Version : 00.90.03
  Creation Time : Thu May 31 17:24:32 2007
     Raid Level : raid6
     Array Size : 10744267776 (10246.53 GiB 11002.13 GB)
  Used Dev Size : 488375808 (465.75 GiB 500.10 GB)
   Raid Devices : 24
  Total Devices : 24
Preferred Minor : 0
    Persistence : Superblock is persistent

    Update Time : Tue Dec 16 13:13:55 2008
          State : clean, degraded
 Active Devices : 21
Working Devices : 23
 Failed Devices : 1
  Spare Devices : 2

     Chunk Size : 128K

           UUID : 2aa31867:40c370a7:61c202b9:07c4b1c4
         Events : 0.1273598

    Number   Major   Minor   RaidDevice State
       0       8       49        0      active sync   /dev/sdd1
       1       8       65        1      active sync   /dev/sde1
       2       8       81        2      active sync   /dev/sdf1
       3       8       33        3      active sync   /dev/sdc1
       4       8      225        4      active sync   /dev/sdo1
       5       8      241        5      active sync   /dev/sdp1
       6       8      193        6      active sync   /dev/sdm1
       7       8      209        7      active sync   /dev/sdn1
       8      65       17        8      active sync   /dev/sdr1
       9      65       81        9      active sync   /dev/sdv1
      10       0        0       10      removed
      11      65      113       11      active sync   /dev/sdx1
      12       0        0       12      removed
      13      65       49       13      active sync   /dev/sdt1
      14      65        1       14      active sync   /dev/sdq1
      15       0        0       15      removed
      16       8      129       16      active sync   /dev/sdi1
      17       8      161       17      active sync   /dev/sdk1
      18       8      145       18      active sync   /dev/sdj1
      19       8      177       19      active sync   /dev/sdl1
      20       8        1       20      active sync   /dev/sda1
      21       8       97       21      active sync   /dev/sdg1
      22       8       17       22      active sync   /dev/sdb1
      23       8      113       23      active sync   /dev/sdh1

      24      65       97        -      spare   /dev/sdw1
      25      65       33        -      spare   /dev/sds1
      26      65       65        -      faulty spare   /dev/sdu1

root@skinner:/mnt/md0raid# mdadm -E /dev/sd[a-x]1 | grep Events
         Events : 0.1273600
         Events : 0.1273600
         Events : 0.1273600
         Events : 0.1273600
         Events : 0.1273600
         Events : 0.1273600
         Events : 0.1273600
         Events : 0.1273600
         Events : 0.1273600
         Events : 0.1273600
         Events : 0.1273600
         Events : 0.1273600
         Events : 0.1273600
         Events : 0.1273600
         Events : 0.1273600
         Events : 0.1273600
         Events : 0.1273600
         Events : 0.1273600
         Events : 0.1273600
         Events : 0.1273600
         Events : 0.1273589
         Events : 0.1273600
         Events : 0.1273600
         Events : 0.1273600

root@skinner:/mnt/md0raid# mdadm -v -S /dev/md0
mdadm: stopped /dev/md0
root@skinner:/mnt/md0raid# mdadm -A /dev/md0
/dev/sd[abcdefghijklmnopqrstvwx]1
mdadm: /dev/md0 assembled from 21 drives and 2 spares - not enough to start
the array.

sdu has spontainiously changed to sdy

root@skinner:/mnt/md0raid# mdadm -E /dev/sd[a-z]1 | grep
"Events\|/dev/sd[a-z]1:"
/dev/sda1:
         Events : 0.1273608
/dev/sdb1:
         Events : 0.1273608
/dev/sdc1:
         Events : 0.1273608
/dev/sdd1:
         Events : 0.1273608
/dev/sde1:
         Events : 0.1273608
/dev/sdf1:
         Events : 0.1273608
/dev/sdg1:
         Events : 0.1273608
/dev/sdh1:
         Events : 0.1273608
/dev/sdi1:
         Events : 0.1273608
/dev/sdj1:
         Events : 0.1273608
/dev/sdk1:
         Events : 0.1273608
/dev/sdl1:
         Events : 0.1273608
/dev/sdm1:
         Events : 0.1273608
/dev/sdn1:
         Events : 0.1273608
/dev/sdo1:
         Events : 0.1273608
/dev/sdp1:
         Events : 0.1273608
/dev/sdq1:
         Events : 0.1273608
/dev/sdr1:
         Events : 0.1273608
/dev/sds1:
         Events : 0.1273608
/dev/sdt1:
         Events : 0.1273608
/dev/sdv1:
         Events : 0.1273608
/dev/sdw1:
         Events : 0.1273608
/dev/sdx1:
         Events : 0.1273608
/dev/sdy1:
         Events : 0.1273589

root@skinner:/mnt/md0raid# mdadm -v -A -f /dev/md0
/dev/sd[abcdefghijklmnopqrtvwxy]1
mdadm: looking for devices for /dev/md0
mdadm: /dev/sda1 is identified as a member of /dev/md0, slot 20.
mdadm: /dev/sdb1 is identified as a member of /dev/md0, slot 22.
mdadm: /dev/sdc1 is identified as a member of /dev/md0, slot 3.
mdadm: /dev/sdd1 is identified as a member of /dev/md0, slot 0.
mdadm: /dev/sde1 is identified as a member of /dev/md0, slot 1.
mdadm: /dev/sdf1 is identified as a member of /dev/md0, slot 2.
mdadm: /dev/sdg1 is identified as a member of /dev/md0, slot 21.
mdadm: /dev/sdh1 is identified as a member of /dev/md0, slot 23.
mdadm: /dev/sdi1 is identified as a member of /dev/md0, slot 16.
mdadm: /dev/sdj1 is identified as a member of /dev/md0, slot 18.
mdadm: /dev/sdk1 is identified as a member of /dev/md0, slot 17.
mdadm: /dev/sdl1 is identified as a member of /dev/md0, slot 19.
mdadm: /dev/sdm1 is identified as a member of /dev/md0, slot 6.
mdadm: /dev/sdn1 is identified as a member of /dev/md0, slot 7.
mdadm: /dev/sdo1 is identified as a member of /dev/md0, slot 4.
mdadm: /dev/sdp1 is identified as a member of /dev/md0, slot 5.
mdadm: /dev/sdq1 is identified as a member of /dev/md0, slot 14.
mdadm: /dev/sdr1 is identified as a member of /dev/md0, slot 8.
mdadm: /dev/sdt1 is identified as a member of /dev/md0, slot 13.
mdadm: /dev/sdv1 is identified as a member of /dev/md0, slot 9.
mdadm: /dev/sdw1 is identified as a member of /dev/md0, slot 24.
mdadm: /dev/sdx1 is identified as a member of /dev/md0, slot 11.
mdadm: /dev/sdy1 is identified as a member of /dev/md0, slot 15.
mdadm: forcing event count in /dev/sdy1(15) from 1273589 upto 1273608
mdadm: clearing FAULTY flag for device 22 in /dev/md0 for /dev/sdy1
mdadm: added /dev/sde1 to /dev/md0 as 1
mdadm: added /dev/sdf1 to /dev/md0 as 2
mdadm: added /dev/sdc1 to /dev/md0 as 3
mdadm: added /dev/sdo1 to /dev/md0 as 4
mdadm: added /dev/sdp1 to /dev/md0 as 5
mdadm: added /dev/sdm1 to /dev/md0 as 6
mdadm: added /dev/sdn1 to /dev/md0 as 7
mdadm: added /dev/sdr1 to /dev/md0 as 8
mdadm: added /dev/sdv1 to /dev/md0 as 9
mdadm: no uptodate device for slot 10 of /dev/md0
mdadm: added /dev/sdx1 to /dev/md0 as 11
mdadm: no uptodate device for slot 12 of /dev/md0
mdadm: added /dev/sdt1 to /dev/md0 as 13
mdadm: added /dev/sdq1 to /dev/md0 as 14
mdadm: added /dev/sdy1 to /dev/md0 as 15
mdadm: added /dev/sdi1 to /dev/md0 as 16
mdadm: added /dev/sdk1 to /dev/md0 as 17
mdadm: added /dev/sdj1 to /dev/md0 as 18
mdadm: added /dev/sdl1 to /dev/md0 as 19
mdadm: added /dev/sda1 to /dev/md0 as 20
mdadm: added /dev/sdg1 to /dev/md0 as 21
mdadm: added /dev/sdb1 to /dev/md0 as 22
mdadm: added /dev/sdh1 to /dev/md0 as 23
mdadm: added /dev/sdw1 to /dev/md0 as 24
mdadm: added /dev/sdd1 to /dev/md0 as 0
mdadm: /dev/md0 has been started with 22 drives (out of 24) and 1 spare.

root@skinner:/mnt/md0raid# mdadm -D /dev/md0
/dev/md0:
        Version : 00.90.03
  Creation Time : Thu May 31 17:24:32 2007
     Raid Level : raid6
     Array Size : 10744267776 (10246.53 GiB 11002.13 GB)
  Used Dev Size : 488375808 (465.75 GiB 500.10 GB)
   Raid Devices : 24
  Total Devices : 23
Preferred Minor : 0
    Persistence : Superblock is persistent

    Update Time : Tue Dec 16 14:43:57 2008
          State : clean, degraded
 Active Devices : 21
Working Devices : 22
 Failed Devices : 1
  Spare Devices : 1

     Chunk Size : 128K

           UUID : 2aa31867:40c370a7:61c202b9:07c4b1c4
         Events : 0.1273614

    Number   Major   Minor   RaidDevice State
       0       8       49        0      active sync   /dev/sdd1
       1       8       65        1      active sync   /dev/sde1
       2       8       81        2      active sync   /dev/sdf1
       3       8       33        3      active sync   /dev/sdc1
       4       8      225        4      active sync   /dev/sdo1
       5       8      241        5      active sync   /dev/sdp1
       6       8      193        6      active sync   /dev/sdm1
       7       8      209        7      active sync   /dev/sdn1
       8      65       17        8      active sync   /dev/sdr1
       9      65       81        9      active sync   /dev/sdv1
      10       0        0       10      removed
      11      65      113       11      active sync   /dev/sdx1
      12       0        0       12      removed
      13      65       49       13      active sync   /dev/sdt1
      14      65        1       14      active sync   /dev/sdq1
      15       0        0       15      removed
      16       8      129       16      active sync   /dev/sdi1
      17       8      161       17      active sync   /dev/sdk1
      18       8      145       18      active sync   /dev/sdj1
      19       8      177       19      active sync   /dev/sdl1
      20       8        1       20      active sync   /dev/sda1
      21       8       97       21      active sync   /dev/sdg1
      22       8       17       22      active sync   /dev/sdb1
      23       8      113       23      active sync   /dev/sdh1

      24      65       97        -      spare   /dev/sdw1
      25      65      129        -      faulty spare   /dev/sdy1



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: RAID 6 recovery (it's not looking good)
  2008-12-16 15:17 RAID 6 recovery (it's not looking good) Iain Rauch
@ 2008-12-16 19:18 ` Bernd Schubert
  2008-12-16 20:31   ` Iain Rauch
  0 siblings, 1 reply; 8+ messages in thread
From: Bernd Schubert @ 2008-12-16 19:18 UTC (permalink / raw)
  To: Iain Rauch; +Cc: linux-raid

Hello Iain,

can you please describe what is the *present* status? 

> /dev/md0 has been started with 22 drives (out of 24) and 1 spare

So in short, you had failure of 3 drives, reassembled it with 22 drives and 
while you rebuild it again a drive failed?

If so, take this last failed drive, clone it to a new drive (e.g. dd_rescue)
and continue.

(Sorry, but this is by far too much output below for my tired eyes. 
Sometimes a short description is more helpful).


Cheers,
Bernd



On Tue, Dec 16, 2008 at 03:17:37PM +0000, Iain Rauch wrote:
> Hi,
> 
> Here's the situation:
> 
> 24 disk array.
> Disk fails - usage continues for a while.
> Power cut - array in unknown state at the time, but I expect it was just
> running degraded.
> Restart and assemble the array.
> *story continues further down.
> 
> 1 disk is way out of sync and 1 disk doesn't work. 2 disks are marked spare
> - the faulty one and another. I think I need to set the status of one of the
> 'spare' disks to clean and then assemble the array. I can then rebuild the
> disk that is way out of sync, and when I have a replacement, rebuild the
> failed disk. Is this possible? It doesn't matter if I can assemble the array
> to get some of the data even if most of it is corrupt. At the very least I'd
> like to be able to see the names of the files I had on it.
> 
> Hope someone can help, but I'm not holding my breath.
> 
> Iain
> 
> 
> root@skinner:/home/iain# mdadm -D /dev/md0
> /dev/md0:
>         Version : 00.90.03
>   Creation Time : Thu May 31 17:24:32 2007
>      Raid Level : raid6
>      Array Size : 10744267776 (10246.53 GiB 11002.13 GB)
>   Used Dev Size : 488375808 (465.75 GiB 500.10 GB)
>    Raid Devices : 24
>   Total Devices : 22
> Preferred Minor : 0
>     Persistence : Superblock is persistent
> 
>     Update Time : Sat Oct 18 20:52:53 2008
>           State : clean, degraded
>  Active Devices : 22
> Working Devices : 22
>  Failed Devices : 0
>   Spare Devices : 0
> 
>      Chunk Size : 128K
> 
>            UUID : 2aa31867:40c370a7:61c202b9:07c4b1c4
>          Events : 0.1273570
> 
>     Number   Major   Minor   RaidDevice State
>        0       8       49        0      active sync   /dev/sdd1
>        1       8       65        1      active sync   /dev/sde1
>        2       8       81        2      active sync   /dev/sdf1
>        3       8       33        3      active sync   /dev/sdc1
>        4       8      225        4      active sync   /dev/sdo1
>        5       8      241        5      active sync   /dev/sdp1
>        6       8      193        6      active sync   /dev/sdm1
>        7       8      209        7      active sync   /dev/sdn1
>        8      65       17        8      active sync   /dev/sdr1
>        9      65       81        9      active sync   /dev/sdv1
>       10       0        0       10      removed
>       11      65      113       11      active sync   /dev/sdx1
>       12       0        0       12      removed
>       13      65       49       13      active sync   /dev/sdt1
>       14      65        1       14      active sync   /dev/sdq1
>       15      65       65       15      active sync   /dev/sdu1
>       16       8      129       16      active sync   /dev/sdi1
>       17       8      161       17      active sync   /dev/sdk1
>       18       8      145       18      active sync   /dev/sdj1
>       19       8      177       19      active sync   /dev/sdl1
>       20       8        1       20      active sync   /dev/sda1
>       21       8       97       21      active sync   /dev/sdg1
>       22       8       17       22      active sync   /dev/sdb1
>       23       8      113       23      active sync   /dev/sdh1
> 
> Not so long after:
> 
> root@skinner:/home/iain# mdadm -D /dev/md0
> /dev/md0:
>         Version : 00.90.03
>   Creation Time : Thu May 31 17:24:32 2007
>      Raid Level : raid6
>      Array Size : 10744267776 (10246.53 GiB 11002.13 GB)
>   Used Dev Size : 488375808 (465.75 GiB 500.10 GB)
>    Raid Devices : 24
>   Total Devices : 22
> Preferred Minor : 0
>     Persistence : Superblock is persistent
> 
>     Update Time : Tue Dec 16 12:56:38 2008
>           State : clean, degraded
>  Active Devices : 21
> Working Devices : 21
>  Failed Devices : 1
>   Spare Devices : 0
> 
>      Chunk Size : 128K
> 
>            UUID : 2aa31867:40c370a7:61c202b9:07c4b1c4
>          Events : 0.1273576
> 
>     Number   Major   Minor   RaidDevice State
>        0       8       49        0      active sync   /dev/sdd1
>        1       8       65        1      active sync   /dev/sde1
>        2       8       81        2      active sync   /dev/sdf1
>        3       8       33        3      active sync   /dev/sdc1
>        4       8      225        4      active sync   /dev/sdo1
>        5       8      241        5      active sync   /dev/sdp1
>        6       8      193        6      active sync   /dev/sdm1
>        7       8      209        7      active sync   /dev/sdn1
>        8      65       17        8      active sync   /dev/sdr1
>        9      65       81        9      active sync   /dev/sdv1
>       10       0        0       10      removed
>       11      65      113       11      active sync   /dev/sdx1
>       12       0        0       12      removed
>       13      65       49       13      active sync   /dev/sdt1
>       14      65        1       14      active sync   /dev/sdq1
>       15       0        0       15      removed
>       16       8      129       16      active sync   /dev/sdi1
>       17       8      161       17      active sync   /dev/sdk1
>       18       8      145       18      active sync   /dev/sdj1
>       19       8      177       19      active sync   /dev/sdl1
>       20       8        1       20      active sync   /dev/sda1
>       21       8       97       21      active sync   /dev/sdg1
>       22       8       17       22      active sync   /dev/sdb1
>       23       8      113       23      active sync   /dev/sdh1
> 
>       24      65       65        -      faulty spare   /dev/sdu1
> 
> root@skinner:/home/iain# cat /proc/mdstat
> Personalities : [raid6] [raid5] [raid4]
> md0 : active raid6 sdd1[0] sdh1[23] sdb1[22] sdg1[21] sda1[20] sdl1[19]
> sdj1[18] sdk1[17] sdi1[16] sdu1[24](F) sdq1[14] sdt1[13] sdx1[11] sdv1[9]
> sdr1[8] sdn1[7] sdm1[6] sdp1[5] sdo1[4] sdc1[3] sdf1[2] sde1[1]
>       10744267776 blocks level 6, 128k chunk, algorithm 2 [24/21]
> [UUUUUUUUUU_U_UU_UUUUUUUU]
> 
> root@skinner:/home/iain# mdadm -E /dev/sd[a-x]1 | grep Events
>          Events : 0.1273578
>          Events : 0.1273578
>          Events : 0.1273578
>          Events : 0.1273578
>          Events : 0.1273578
>          Events : 0.1273578
>          Events : 0.1273578
>          Events : 0.1273578
>          Events : 0.1273578
>          Events : 0.1273578
>          Events : 0.1273578
>          Events : 0.1273578
>          Events : 0.1273578
>          Events : 0.1273578
>          Events : 0.1273578
>          Events : 0.1273578
>          Events : 0.1273578
>          Events : 0.1273578
>          Events : 0.1271254
>          Events : 0.1273578
>          Events : 0.1273572
>          Events : 0.1273578
>          Events : 0.1273570
>          Events : 0.1273578
> 
> So from this I think sds1 was the one that first failed. Looks like it but
> letters are now reallocated (since reboot).
> 
> root@skinner:/home/iain# mdadm -E /dev/sds1
> /dev/sds1:
>           Magic : a92b4efc
>         Version : 00.90.00
>            UUID : 2aa31867:40c370a7:61c202b9:07c4b1c4
>   Creation Time : Thu May 31 17:24:32 2007
>      Raid Level : raid6
>   Used Dev Size : 488375808 (465.75 GiB 500.10 GB)
>      Array Size : 10744267776 (10246.53 GiB 11002.13 GB)
>    Raid Devices : 24
>   Total Devices : 24
> Preferred Minor : 0
> 
>     Update Time : Sat Oct 18 12:56:28 2008
>           State : clean
>  Active Devices : 24
> Working Devices : 24
>  Failed Devices : 0
>   Spare Devices : 0
>        Checksum : 2ab633fc - correct
>          Events : 0.1271254
> 
>      Chunk Size : 128K
> 
>       Number   Major   Minor   RaidDevice State
> this    12      65       97       12      active sync   /dev/sdw1
> 
>    0     0       8      145        0      active sync   /dev/sdj1
>    1     1       8      177        1      active sync   /dev/sdl1
>    2     2       8      129        2      active sync   /dev/sdi1
>    3     3       8      161        3      active sync   /dev/sdk1
>    4     4       8      225        4      active sync   /dev/sdo1
>    5     5       8      241        5      active sync   /dev/sdp1
>    6     6       8      193        6      active sync   /dev/sdm1
>    7     7       8      209        7      active sync   /dev/sdn1
>    8     8      65       65        8      active sync   /dev/sdu1
>    9     9      65       49        9      active sync   /dev/sdt1
>   10    10      65       33       10      active sync   /dev/sds1
>   11    11      65        1       11      active sync   /dev/sdq1
>   12    12      65       97       12      active sync   /dev/sdw1
>   13    13      65      113       13      active sync   /dev/sdx1
>   14    14      65       81       14      active sync   /dev/sdv1
>   15    15      65       17       15      active sync   /dev/sdr1
>   16    16       8       97       16      active sync   /dev/sdg1
>   17    17       8      113       17      active sync   /dev/sdh1
>   18    18       8       65       18      active sync   /dev/sde1
>   19    19       8       81       19      active sync   /dev/sdf1
>   20    20       8       49       20      active sync   /dev/sdd1
>   21    21       8       33       21      active sync   /dev/sdc1
>   22    22       8       17       22      active sync   /dev/sdb1
>   23    23       8        1       23      active sync   /dev/sda1
> 
> Next:
> Start without sds1 as that was the one that was left behind when there was
> still activity going on.
> 
> root@skinner:/home/iain# mdadm -v -S /dev/md0
> mdadm: stopped /dev/md0
> 
> root@skinner:/home/iain# mdadm -A /dev/md0 /dev/sd[abcdefghijklmnopqrtuvwx]1
> mdadm: /dev/md0 assembled from 21 drives - not enough to start the array.
> 
> root@skinner:/home/iain# mdadm -A -f -v /dev/md0
> /dev/sd[abcdefghijklmnopqrtuvwx]1
> mdadm: looking for devices for /dev/md0
> mdadm: /dev/sda1 is identified as a member of /dev/md0, slot 20.
> mdadm: /dev/sdb1 is identified as a member of /dev/md0, slot 22.
> mdadm: /dev/sdc1 is identified as a member of /dev/md0, slot 3.
> mdadm: /dev/sdd1 is identified as a member of /dev/md0, slot 0.
> mdadm: /dev/sde1 is identified as a member of /dev/md0, slot 1.
> mdadm: /dev/sdf1 is identified as a member of /dev/md0, slot 2.
> mdadm: /dev/sdg1 is identified as a member of /dev/md0, slot 21.
> mdadm: /dev/sdh1 is identified as a member of /dev/md0, slot 23.
> mdadm: /dev/sdi1 is identified as a member of /dev/md0, slot 16.
> mdadm: /dev/sdj1 is identified as a member of /dev/md0, slot 18.
> mdadm: /dev/sdk1 is identified as a member of /dev/md0, slot 17.
> mdadm: /dev/sdl1 is identified as a member of /dev/md0, slot 19.
> mdadm: /dev/sdm1 is identified as a member of /dev/md0, slot 6.
> mdadm: /dev/sdn1 is identified as a member of /dev/md0, slot 7.
> mdadm: /dev/sdo1 is identified as a member of /dev/md0, slot 4.
> mdadm: /dev/sdp1 is identified as a member of /dev/md0, slot 5.
> mdadm: /dev/sdq1 is identified as a member of /dev/md0, slot 14.
> mdadm: /dev/sdr1 is identified as a member of /dev/md0, slot 8.
> mdadm: /dev/sdt1 is identified as a member of /dev/md0, slot 13.
> mdadm: /dev/sdu1 is identified as a member of /dev/md0, slot 15.
> mdadm: /dev/sdv1 is identified as a member of /dev/md0, slot 9.
> mdadm: /dev/sdw1 is identified as a member of /dev/md0, slot 10.
> mdadm: /dev/sdx1 is identified as a member of /dev/md0, slot 11.
> mdadm: forcing event count in /dev/sdu1(15) from 1273572 upto 1273578
> mdadm: clearing FAULTY flag for device 19 in /dev/md0 for /dev/sdu1
> mdadm: added /dev/sde1 to /dev/md0 as 1
> mdadm: added /dev/sdf1 to /dev/md0 as 2
> mdadm: added /dev/sdc1 to /dev/md0 as 3
> mdadm: added /dev/sdo1 to /dev/md0 as 4
> mdadm: added /dev/sdp1 to /dev/md0 as 5
> mdadm: added /dev/sdm1 to /dev/md0 as 6
> mdadm: added /dev/sdn1 to /dev/md0 as 7
> mdadm: added /dev/sdr1 to /dev/md0 as 8
> mdadm: added /dev/sdv1 to /dev/md0 as 9
> mdadm: added /dev/sdw1 to /dev/md0 as 10
> mdadm: added /dev/sdx1 to /dev/md0 as 11
> mdadm: no uptodate device for slot 12 of /dev/md0
> mdadm: added /dev/sdt1 to /dev/md0 as 13
> mdadm: added /dev/sdq1 to /dev/md0 as 14
> mdadm: added /dev/sdu1 to /dev/md0 as 15
> mdadm: added /dev/sdi1 to /dev/md0 as 16
> mdadm: added /dev/sdk1 to /dev/md0 as 17
> mdadm: added /dev/sdj1 to /dev/md0 as 18
> mdadm: added /dev/sdl1 to /dev/md0 as 19
> mdadm: added /dev/sda1 to /dev/md0 as 20
> mdadm: added /dev/sdg1 to /dev/md0 as 21
> mdadm: added /dev/sdb1 to /dev/md0 as 22
> mdadm: added /dev/sdh1 to /dev/md0 as 23
> mdadm: added /dev/sdd1 to /dev/md0 as 0
> mdadm: /dev/md0 has been started with 22 drives (out of 24).
> 
> 
> root@skinner:/home/iain# mdadm --add /dev/md0 /dev/sd[ws]1
> mdadm: re-added /dev/sds1
> mdadm: re-added /dev/sdw1
> root@skinner:/home/iain# mdadm -D /dev/md0
> /dev/md0:
>         Version : 00.90.03
>   Creation Time : Thu May 31 17:24:32 2007
>      Raid Level : raid6
>      Array Size : 10744267776 (10246.53 GiB 11002.13 GB)
>   Used Dev Size : 488375808 (465.75 GiB 500.10 GB)
>    Raid Devices : 24
>   Total Devices : 24
> Preferred Minor : 0
>     Persistence : Superblock is persistent
> 
>     Update Time : Tue Dec 16 13:10:29 2008
>           State : clean, degraded, recovering
>  Active Devices : 22
> Working Devices : 24
>  Failed Devices : 0
>   Spare Devices : 2
> 
>      Chunk Size : 128K
> 
>  Rebuild Status : 0% complete
> 
>            UUID : 2aa31867:40c370a7:61c202b9:07c4b1c4
>          Events : 0.1273586
> 
>     Number   Major   Minor   RaidDevice State
>        0       8       49        0      active sync   /dev/sdd1
>        1       8       65        1      active sync   /dev/sde1
>        2       8       81        2      active sync   /dev/sdf1
>        3       8       33        3      active sync   /dev/sdc1
>        4       8      225        4      active sync   /dev/sdo1
>        5       8      241        5      active sync   /dev/sdp1
>        6       8      193        6      active sync   /dev/sdm1
>        7       8      209        7      active sync   /dev/sdn1
>        8      65       17        8      active sync   /dev/sdr1
>        9      65       81        9      active sync   /dev/sdv1
>       25      65       33       10      spare rebuilding   /dev/sds1
>       11      65      113       11      active sync   /dev/sdx1
>       12       0        0       12      removed
>       13      65       49       13      active sync   /dev/sdt1
>       14      65        1       14      active sync   /dev/sdq1
>       15      65       65       15      active sync   /dev/sdu1
>       16       8      129       16      active sync   /dev/sdi1
>       17       8      161       17      active sync   /dev/sdk1
>       18       8      145       18      active sync   /dev/sdj1
>       19       8      177       19      active sync   /dev/sdl1
>       20       8        1       20      active sync   /dev/sda1
>       21       8       97       21      active sync   /dev/sdg1
>       22       8       17       22      active sync   /dev/sdb1
>       23       8      113       23      active sync   /dev/sdh1
> 
>       24      65       97        -      spare   /dev/sdw1
> 
> root@skinner:/home/iain# mount -a
> mount: /dev/md0: can't read superblock
> 
> root@skinner:/mnt/md0raid# xfs_check /dev/md0
> ERROR: The filesystem has valuable metadata changes in a log which needs to
> be replayed.  Mount the filesystem to replay the log, and unmount it before
> re-running xfs_check.  If you are unable to mount the filesystem, then use
> the xfs_repair -L option to destroy the log and attempt a repair.
> Note that destroying the log may cause corruption -- please attempt a mount
> of the filesystem before doing this.
> 
> root@skinner:/mnt/md0raid# cat /proc/mdstat
> Personalities : [raid6] [raid5] [raid4]
> md0 : active raid6 sdw1[24](S) sds1[25](S) sdd1[0] sdh1[23] sdb1[22]
> sdg1[21] sda1[20] sdl1[19] sdj1[18] sdk1[17] sdi1[16] sdu1[26](F) sdq1[14]
> sdt1[13] sdx1[11] sdv1[9] sdr1[8] sdn1[7] sdm1[6] sdp1[5] sdo1[4] sdc1[3]
> sdf1[2] sde1[1]
>       10744267776 blocks level 6, 128k chunk, algorithm 2 [24/21]
> [UUUUUUUUUU_U_UU_UUUUUUUU]
>       
> root@skinner:/mnt/md0raid# mdadm -D /dev/md0
> /dev/md0:
>         Version : 00.90.03
>   Creation Time : Thu May 31 17:24:32 2007
>      Raid Level : raid6
>      Array Size : 10744267776 (10246.53 GiB 11002.13 GB)
>   Used Dev Size : 488375808 (465.75 GiB 500.10 GB)
>    Raid Devices : 24
>   Total Devices : 24
> Preferred Minor : 0
>     Persistence : Superblock is persistent
> 
>     Update Time : Tue Dec 16 13:13:55 2008
>           State : clean, degraded
>  Active Devices : 21
> Working Devices : 23
>  Failed Devices : 1
>   Spare Devices : 2
> 
>      Chunk Size : 128K
> 
>            UUID : 2aa31867:40c370a7:61c202b9:07c4b1c4
>          Events : 0.1273598
> 
>     Number   Major   Minor   RaidDevice State
>        0       8       49        0      active sync   /dev/sdd1
>        1       8       65        1      active sync   /dev/sde1
>        2       8       81        2      active sync   /dev/sdf1
>        3       8       33        3      active sync   /dev/sdc1
>        4       8      225        4      active sync   /dev/sdo1
>        5       8      241        5      active sync   /dev/sdp1
>        6       8      193        6      active sync   /dev/sdm1
>        7       8      209        7      active sync   /dev/sdn1
>        8      65       17        8      active sync   /dev/sdr1
>        9      65       81        9      active sync   /dev/sdv1
>       10       0        0       10      removed
>       11      65      113       11      active sync   /dev/sdx1
>       12       0        0       12      removed
>       13      65       49       13      active sync   /dev/sdt1
>       14      65        1       14      active sync   /dev/sdq1
>       15       0        0       15      removed
>       16       8      129       16      active sync   /dev/sdi1
>       17       8      161       17      active sync   /dev/sdk1
>       18       8      145       18      active sync   /dev/sdj1
>       19       8      177       19      active sync   /dev/sdl1
>       20       8        1       20      active sync   /dev/sda1
>       21       8       97       21      active sync   /dev/sdg1
>       22       8       17       22      active sync   /dev/sdb1
>       23       8      113       23      active sync   /dev/sdh1
> 
>       24      65       97        -      spare   /dev/sdw1
>       25      65       33        -      spare   /dev/sds1
>       26      65       65        -      faulty spare   /dev/sdu1
> 
> root@skinner:/mnt/md0raid# mdadm -E /dev/sd[a-x]1 | grep Events
>          Events : 0.1273600
>          Events : 0.1273600
>          Events : 0.1273600
>          Events : 0.1273600
>          Events : 0.1273600
>          Events : 0.1273600
>          Events : 0.1273600
>          Events : 0.1273600
>          Events : 0.1273600
>          Events : 0.1273600
>          Events : 0.1273600
>          Events : 0.1273600
>          Events : 0.1273600
>          Events : 0.1273600
>          Events : 0.1273600
>          Events : 0.1273600
>          Events : 0.1273600
>          Events : 0.1273600
>          Events : 0.1273600
>          Events : 0.1273600
>          Events : 0.1273589
>          Events : 0.1273600
>          Events : 0.1273600
>          Events : 0.1273600
> 
> root@skinner:/mnt/md0raid# mdadm -v -S /dev/md0
> mdadm: stopped /dev/md0
> root@skinner:/mnt/md0raid# mdadm -A /dev/md0
> /dev/sd[abcdefghijklmnopqrstvwx]1
> mdadm: /dev/md0 assembled from 21 drives and 2 spares - not enough to start
> the array.
> 
> sdu has spontainiously changed to sdy
> 
> root@skinner:/mnt/md0raid# mdadm -E /dev/sd[a-z]1 | grep
> "Events\|/dev/sd[a-z]1:"
> /dev/sda1:
>          Events : 0.1273608
> /dev/sdb1:
>          Events : 0.1273608
> /dev/sdc1:
>          Events : 0.1273608
> /dev/sdd1:
>          Events : 0.1273608
> /dev/sde1:
>          Events : 0.1273608
> /dev/sdf1:
>          Events : 0.1273608
> /dev/sdg1:
>          Events : 0.1273608
> /dev/sdh1:
>          Events : 0.1273608
> /dev/sdi1:
>          Events : 0.1273608
> /dev/sdj1:
>          Events : 0.1273608
> /dev/sdk1:
>          Events : 0.1273608
> /dev/sdl1:
>          Events : 0.1273608
> /dev/sdm1:
>          Events : 0.1273608
> /dev/sdn1:
>          Events : 0.1273608
> /dev/sdo1:
>          Events : 0.1273608
> /dev/sdp1:
>          Events : 0.1273608
> /dev/sdq1:
>          Events : 0.1273608
> /dev/sdr1:
>          Events : 0.1273608
> /dev/sds1:
>          Events : 0.1273608
> /dev/sdt1:
>          Events : 0.1273608
> /dev/sdv1:
>          Events : 0.1273608
> /dev/sdw1:
>          Events : 0.1273608
> /dev/sdx1:
>          Events : 0.1273608
> /dev/sdy1:
>          Events : 0.1273589
> 
> root@skinner:/mnt/md0raid# mdadm -v -A -f /dev/md0
> /dev/sd[abcdefghijklmnopqrtvwxy]1
> mdadm: looking for devices for /dev/md0
> mdadm: /dev/sda1 is identified as a member of /dev/md0, slot 20.
> mdadm: /dev/sdb1 is identified as a member of /dev/md0, slot 22.
> mdadm: /dev/sdc1 is identified as a member of /dev/md0, slot 3.
> mdadm: /dev/sdd1 is identified as a member of /dev/md0, slot 0.
> mdadm: /dev/sde1 is identified as a member of /dev/md0, slot 1.
> mdadm: /dev/sdf1 is identified as a member of /dev/md0, slot 2.
> mdadm: /dev/sdg1 is identified as a member of /dev/md0, slot 21.
> mdadm: /dev/sdh1 is identified as a member of /dev/md0, slot 23.
> mdadm: /dev/sdi1 is identified as a member of /dev/md0, slot 16.
> mdadm: /dev/sdj1 is identified as a member of /dev/md0, slot 18.
> mdadm: /dev/sdk1 is identified as a member of /dev/md0, slot 17.
> mdadm: /dev/sdl1 is identified as a member of /dev/md0, slot 19.
> mdadm: /dev/sdm1 is identified as a member of /dev/md0, slot 6.
> mdadm: /dev/sdn1 is identified as a member of /dev/md0, slot 7.
> mdadm: /dev/sdo1 is identified as a member of /dev/md0, slot 4.
> mdadm: /dev/sdp1 is identified as a member of /dev/md0, slot 5.
> mdadm: /dev/sdq1 is identified as a member of /dev/md0, slot 14.
> mdadm: /dev/sdr1 is identified as a member of /dev/md0, slot 8.
> mdadm: /dev/sdt1 is identified as a member of /dev/md0, slot 13.
> mdadm: /dev/sdv1 is identified as a member of /dev/md0, slot 9.
> mdadm: /dev/sdw1 is identified as a member of /dev/md0, slot 24.
> mdadm: /dev/sdx1 is identified as a member of /dev/md0, slot 11.
> mdadm: /dev/sdy1 is identified as a member of /dev/md0, slot 15.
> mdadm: forcing event count in /dev/sdy1(15) from 1273589 upto 1273608
> mdadm: clearing FAULTY flag for device 22 in /dev/md0 for /dev/sdy1
> mdadm: added /dev/sde1 to /dev/md0 as 1
> mdadm: added /dev/sdf1 to /dev/md0 as 2
> mdadm: added /dev/sdc1 to /dev/md0 as 3
> mdadm: added /dev/sdo1 to /dev/md0 as 4
> mdadm: added /dev/sdp1 to /dev/md0 as 5
> mdadm: added /dev/sdm1 to /dev/md0 as 6
> mdadm: added /dev/sdn1 to /dev/md0 as 7
> mdadm: added /dev/sdr1 to /dev/md0 as 8
> mdadm: added /dev/sdv1 to /dev/md0 as 9
> mdadm: no uptodate device for slot 10 of /dev/md0
> mdadm: added /dev/sdx1 to /dev/md0 as 11
> mdadm: no uptodate device for slot 12 of /dev/md0
> mdadm: added /dev/sdt1 to /dev/md0 as 13
> mdadm: added /dev/sdq1 to /dev/md0 as 14
> mdadm: added /dev/sdy1 to /dev/md0 as 15
> mdadm: added /dev/sdi1 to /dev/md0 as 16
> mdadm: added /dev/sdk1 to /dev/md0 as 17
> mdadm: added /dev/sdj1 to /dev/md0 as 18
> mdadm: added /dev/sdl1 to /dev/md0 as 19
> mdadm: added /dev/sda1 to /dev/md0 as 20
> mdadm: added /dev/sdg1 to /dev/md0 as 21
> mdadm: added /dev/sdb1 to /dev/md0 as 22
> mdadm: added /dev/sdh1 to /dev/md0 as 23
> mdadm: added /dev/sdw1 to /dev/md0 as 24
> mdadm: added /dev/sdd1 to /dev/md0 as 0
> mdadm: /dev/md0 has been started with 22 drives (out of 24) and 1 spare.
> 
> root@skinner:/mnt/md0raid# mdadm -D /dev/md0
> /dev/md0:
>         Version : 00.90.03
>   Creation Time : Thu May 31 17:24:32 2007
>      Raid Level : raid6
>      Array Size : 10744267776 (10246.53 GiB 11002.13 GB)
>   Used Dev Size : 488375808 (465.75 GiB 500.10 GB)
>    Raid Devices : 24
>   Total Devices : 23
> Preferred Minor : 0
>     Persistence : Superblock is persistent
> 
>     Update Time : Tue Dec 16 14:43:57 2008
>           State : clean, degraded
>  Active Devices : 21
> Working Devices : 22
>  Failed Devices : 1
>   Spare Devices : 1
> 
>      Chunk Size : 128K
> 
>            UUID : 2aa31867:40c370a7:61c202b9:07c4b1c4
>          Events : 0.1273614
> 
>     Number   Major   Minor   RaidDevice State
>        0       8       49        0      active sync   /dev/sdd1
>        1       8       65        1      active sync   /dev/sde1
>        2       8       81        2      active sync   /dev/sdf1
>        3       8       33        3      active sync   /dev/sdc1
>        4       8      225        4      active sync   /dev/sdo1
>        5       8      241        5      active sync   /dev/sdp1
>        6       8      193        6      active sync   /dev/sdm1
>        7       8      209        7      active sync   /dev/sdn1
>        8      65       17        8      active sync   /dev/sdr1
>        9      65       81        9      active sync   /dev/sdv1
>       10       0        0       10      removed
>       11      65      113       11      active sync   /dev/sdx1
>       12       0        0       12      removed
>       13      65       49       13      active sync   /dev/sdt1
>       14      65        1       14      active sync   /dev/sdq1
>       15       0        0       15      removed
>       16       8      129       16      active sync   /dev/sdi1
>       17       8      161       17      active sync   /dev/sdk1
>       18       8      145       18      active sync   /dev/sdj1
>       19       8      177       19      active sync   /dev/sdl1
>       20       8        1       20      active sync   /dev/sda1
>       21       8       97       21      active sync   /dev/sdg1
>       22       8       17       22      active sync   /dev/sdb1
>       23       8      113       23      active sync   /dev/sdh1
> 
>       24      65       97        -      spare   /dev/sdw1
>       25      65      129        -      faulty spare   /dev/sdy1
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: RAID 6 recovery (it's not looking good)
  2008-12-16 19:18 ` Bernd Schubert
@ 2008-12-16 20:31   ` Iain Rauch
  2008-12-16 23:59     ` Bernd Schubert
  0 siblings, 1 reply; 8+ messages in thread
From: Iain Rauch @ 2008-12-16 20:31 UTC (permalink / raw)
  To: Bernd Schubert; +Cc: linux-raid

> Hello Iain,
> 
> can you please describe what is the *present* status?
> 
>> /dev/md0 has been started with 22 drives (out of 24) and 1 spare
> 
> So in short, you had failure of 3 drives, reassembled it with 22 drives and
> while you rebuild it again a drive failed?
> 
> If so, take this last failed drive, clone it to a new drive (e.g. dd_rescue)
> and continue.
> 
> (Sorry, but this is by far too much output below for my tired eyes.
> Sometimes a short description is more helpful).
> 

I'll see if I can do that.

If I can't get anything useful off sdu (the latest to fail) can I change sdw
from spare to active sync? sds is the spare drive it's trying to recover to
and was the one that became out of sync as it ran in degraded mode.

I think sdw maybe sdw was only set to faulty because it was the last one to
be recognised and the array got assembled without it. (The system won't boot
with all the drives on together).

Here is what mdadm -E has to say about each disk:

http://iain.rauch.co.uk/stuff/skinner-2008-12-16/


Regards,

Iain



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: RAID 6 recovery (it's not looking good)
  2008-12-16 20:31   ` Iain Rauch
@ 2008-12-16 23:59     ` Bernd Schubert
  2008-12-19 12:29       ` Iain Rauch
  0 siblings, 1 reply; 8+ messages in thread
From: Bernd Schubert @ 2008-12-16 23:59 UTC (permalink / raw)
  To: Iain Rauch; +Cc: linux-raid

On Tue, Dec 16, 2008 at 08:31:08PM +0000, Iain Rauch wrote:
> > Hello Iain,
> > 
> > can you please describe what is the *present* status?
> > 
> >> /dev/md0 has been started with 22 drives (out of 24) and 1 spare
> > 
> > So in short, you had failure of 3 drives, reassembled it with 22 drives and
> > while you rebuild it again a drive failed?
> > 
> > If so, take this last failed drive, clone it to a new drive (e.g. dd_rescue)
> > and continue.
> > 
> > (Sorry, but this is by far too much output below for my tired eyes.
> > Sometimes a short description is more helpful).
> > 
> 
> I'll see if I can do that.
> 
> If I can't get anything useful off sdu (the latest to fail) can I change sdw
> from spare to active sync? sds is the spare drive it's trying to recover to
> and was the one that became out of sync as it ran in degraded mode.
> 
> I think sdw maybe sdw was only set to faulty because it was the last one to
> be recognised and the array got assembled without it. (The system won't boot
> with all the drives on together).
> 
> Here is what mdadm -E has to say about each disk:
> 
> http://iain.rauch.co.uk/stuff/skinner-2008-12-16/
> 

I'm still tired (now even more ;-) ). Just check again if /dev/sdu really 
was the latest to fail and if so, clone this one. 
I also suggest to reassemble it without an immediate raid-rebuild. 
First check your data and only then add a new drives to the raid.  
Once you start a raid-rebuild.
there is no way to go back. We recently also had the problem of three
failed disks  but we only could get back the data by not assembling the 
array with the latest failed disk, but with the 2nd latest (don't ask why).

So in short

1) clone disk

2) mdadm --assemble --force /dev/mdX /dev/sda1 /dev/sdb1 ... /dev/sdx1

===> Use only **22** devices here.

3) Mount and check data, maybe even a read-only fsck

4) Add two new disks.


Hope it helps,
Bernd



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: RAID 6 recovery (it's not looking good)
  2008-12-16 23:59     ` Bernd Schubert
@ 2008-12-19 12:29       ` Iain Rauch
  2008-12-19 12:36         ` Bernd Schubert
  0 siblings, 1 reply; 8+ messages in thread
From: Iain Rauch @ 2008-12-19 12:29 UTC (permalink / raw)
  To: linux-raid

> I'm still tired (now even more ;-) ). Just check again if /dev/sdu really
> was the latest to fail and if so, clone this one.
> I also suggest to reassemble it without an immediate raid-rebuild.
> First check your data and only then add a new drives to the raid.
> Once you start a raid-rebuild.
> there is no way to go back. We recently also had the problem of three
> failed disks  but we only could get back the data by not assembling the
> array with the latest failed disk, but with the 2nd latest (don't ask why).
> 
> So in short
> 
> 1) clone disk
> 
> 2) mdadm --assemble --force /dev/mdX /dev/sda1 /dev/sdb1 ... /dev/sdx1
> 
> ===> Use only **22** devices here.
> 
> 3) Mount and check data, maybe even a read-only fsck
> 
> 4) Add two new disks.
> 
> 
> Hope it helps,
> Bernd

Well I cloned the disk and force started the array with 22 drives. I mounted
the file system read-only and it did appear to be intact :)

The problem is I cloned the failed drive to a 1.5TB Seagate, and it has the
freezing issue. After 12h of rebuilding (out of 50) that drive got kicked.
I'm gonna see if updating the FW on the drive helps, but otherwise I'll just
have to get another decent drive.

Is there any way to have mdadm be more patient and not kick the drive, or
let me put it back in and continue the rebuild of another drive? I don't
believe the drive will operate for 50h straight.


Regards,

Iain



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: RAID 6 recovery (it's not looking good)
  2008-12-19 12:29       ` Iain Rauch
@ 2008-12-19 12:36         ` Bernd Schubert
  2009-01-15 11:41           ` RAID 6 recovery (it's not looking good) *More problems* Iain Rauch
  0 siblings, 1 reply; 8+ messages in thread
From: Bernd Schubert @ 2008-12-19 12:36 UTC (permalink / raw)
  To: Iain Rauch; +Cc: linux-raid

On Fri, Dec 19, 2008 at 12:29:30PM +0000, Iain Rauch wrote:
> > I'm still tired (now even more ;-) ). Just check again if /dev/sdu really
> > was the latest to fail and if so, clone this one.
> > I also suggest to reassemble it without an immediate raid-rebuild.
> > First check your data and only then add a new drives to the raid.
> > Once you start a raid-rebuild.
> > there is no way to go back. We recently also had the problem of three
> > failed disks  but we only could get back the data by not assembling the
> > array with the latest failed disk, but with the 2nd latest (don't ask why).
> > 
> > So in short
> > 
> > 1) clone disk
> > 
> > 2) mdadm --assemble --force /dev/mdX /dev/sda1 /dev/sdb1 ... /dev/sdx1
> > 
> > ===> Use only **22** devices here.
> > 
> > 3) Mount and check data, maybe even a read-only fsck
> > 
> > 4) Add two new disks.
> > 
> > 
> > Hope it helps,
> > Bernd
> 
> Well I cloned the disk and force started the array with 22 drives. I mounted
> the file system read-only and it did appear to be intact :)

I'm glad to hear that.

> 
> The problem is I cloned the failed drive to a 1.5TB Seagate, and it has the
> freezing issue. After 12h of rebuilding (out of 50) that drive got kicked.
> I'm gonna see if updating the FW on the drive helps, but otherwise I'll just
> have to get another decent drive.
> 
> Is there any way to have mdadm be more patient and not kick the drive, or
> let me put it back in and continue the rebuild of another drive? I don't
> believe the drive will operate for 50h straight.

I think this would continue the rebuild if you would use bitmaps. You may
add bitmaps by using "mdadm --grow --bitmap=internal /dev/mdX", but I'm not 
sure if it will work on a degrade md device. At least it won't work during
rebuild phase.

Cheers,
Bernd

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: RAID 6 recovery (it's not looking good) *More problems*
  2008-12-19 12:36         ` Bernd Schubert
@ 2009-01-15 11:41           ` Iain Rauch
  2009-01-17 13:13             ` RAID 6 recovery (it's not looking good) *Even More problems* Iain Rauch
  0 siblings, 1 reply; 8+ messages in thread
From: Iain Rauch @ 2009-01-15 11:41 UTC (permalink / raw)
  To: linux-raid; +Cc: Bernd Schubert

I finally got the array to a state where it has 24/24 drives up.
Unfortunately after a copying some data onto it, it now comes up with IO
errors.


Please help,

Iain.


Here's what I've done so far:

root@skinner:/# umount /mnt/md0raid
umount: /mnt/md0raid: device is busy
umount: /mnt/md0raid: device is busy
root@skinner:/# fuser -m /mnt/md0raid
Cannot stat /mnt/md0raid: Input/output error
Cannot stat /mnt/md0raid: Input/output error
Cannot stat /mnt/md0raid: Input/output error
Cannot stat file /proc/9651/fd/4: Input/output error
root@skinner:/# fuser -m /dev/md0
Cannot stat file /proc/9651/fd/4: Input/output error
root@skinner:/# umount -l /mnt/md0raid
root@skinner:/# xfs_check /dev/md0
xfs_check: /dev/md0 contains a mounted and writable filesystem

fatal error -- couldn't initialize XFS library
root@skinner:/# dmesg | grep -i xfs
[196225.294919] XFS mounting filesystem md0
[196226.008338] Ending clean XFS mount for filesystem: md0
[204347.455334] XFS internal error XFS_WANT_CORRUPTED_GOTO at line 1563 of
file fs/xfs/xfs_alloc.c.  Caller 0xf8c21e90
[204347.455374]  [<f8c215eb>] xfs_free_ag_extent+0x53b/0x730 [xfs]
[204347.455400]  [<f8c21e90>] xfs_free_extent+0xe0/0x110 [xfs]
[204347.455441]  [<f8c21e90>] xfs_free_extent+0xe0/0x110 [xfs]
[204347.455503]  [<f8c2d360>] xfs_bmap_finish+0x140/0x190 [xfs]
[204347.455535]  [<f8c37900>] xfs_bunmapi+0x0/0xfb0 [xfs]
[204347.455555]  [<f8c55fcf>] xfs_itruncate_finish+0x24f/0x3b0 [xfs]
[204347.455618]  [<f8c77289>] xfs_inactive+0x469/0x500 [xfs]
[204347.455660]  [<f8c825e2>] xfs_fs_clear_inode+0x32/0x70 [xfs]
[204347.455779] xfs_force_shutdown(md0,0x8) called from line 4261 of file
fs/xfs/xfs_bmap.c.  Return address = 0xf8c82fec

root@skinner:/# xfs_repair -n /dev/md0
xfs_repair: /dev/md0 contains a mounted and writable filesystem

fatal error -- couldn't initialize XFS library
root@skinner:/# xfs_repair -fn /dev/md0
        - creating 2 worker thread(s)
Phase 1 - find and verify superblock...
        - reporting progress in intervals of 15 minutes
Phase 2 - using internal log
        - scan filesystem freespace and inode maps...
        - 01:52:55: scanning filesystem freespace - 118 of 118 allocation
groups done
        - found root inode chunk
Phase 3 - for each AG...
        - scan (but don't clear) agi unlinked lists...
        - 01:52:55: scanning agi unlinked lists - 118 of 118 allocation
groups done
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
data fork in ino 1075468390 claims free block 838867328
<snip>
data fork in ino 1075468390 claims free block 838867811
        - agno = 4
bad nblocks 1863041 for inode 2147484215, would reset to 1898317
        - agno = 5
        - agno = 6
        - agno = 7
data fork in ino 3221610517 claims free block 3623910585
imap claims in-use inode 3221610517 is free, would correct imap
        - agno = 8
<snip>
        - agno = 117
data fork in ino 3758128252 claims free block 2952790138
data fork in ino 3758128252 claims free block 2952790139
imap claims in-use inode 3758128252 is free, would correct imap
        - 02:02:39: process known inodes and inode discovery - 55360 of
55360 inodes done
        - process newly discovered inodes...
        - 02:02:39: process newly discovered inodes - 118 of 118 allocation
groups done
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - 02:06:26: setting up duplicate extent list - 24 of 118 allocation
groups done
    - 02:06:26: Phase 4: elapsed time 3 minutes, 47 seconds - processed 6
allocation groups per minute
    - 02:06:26: Phase 4: 20% done - estimated remaining time 14 minutes, 49
seconds
        - 02:21:26: setting up duplicate extent list - 75 of 118 allocation
groups done
    - 02:21:26: Phase 4: elapsed time 18 minutes, 47 seconds - processed 3
allocation groups per minute
    - 02:21:26: Phase 4: 63% done - estimated remaining time 10 minutes, 46
seconds
        - 02:36:26: setting up duplicate extent list - 104 of 118 allocation
groups done
    - 02:36:26: Phase 4: elapsed time 33 minutes, 47 seconds - processed 3
allocation groups per minute
    - 02:36:26: Phase 4: 88% done - estimated remaining time 4 minutes, 32
seconds
        - 02:42:59: setting up duplicate extent list - 118 of 118 allocation
groups done
        - 02:51:26: setting up duplicate extent list - 118 of 118 allocation
groups done
    - 02:51:26: Phase 4: elapsed time 48 minutes, 47 seconds - processed 2
allocation groups per minute
    - 02:51:26: Phase 4: 100% done - estimated remaining time
        - 03:06:26: setting up duplicate extent list - 118 of 118 allocation
groups done
    - 03:06:26: Phase 4: elapsed time 1 hour, 3 minutes, 47 seconds -
processed 1 allocation groups per minute
    - 03:06:26: Phase 4: 100% done - estimated remaining time
        - 03:21:27: setting up duplicate extent list - 118 of 118 allocation
groups done
    - 03:21:27: Phase 4: elapsed time 1 hour, 18 minutes, 48 seconds -
processed 1 allocation groups per minute
    - 03:21:27: Phase 4: 100% done - estimated remaining time
        - 03:36:26: setting up duplicate extent list - 118 of 118 allocation
groups done
    - 03:36:26: Phase 4: elapsed time 1 hour, 33 minutes, 47 seconds -
processed 1 allocation groups per minute
    - 03:36:26: Phase 4: 100% done - estimated remaining time
        - 03:51:26: setting up duplicate extent list - 118 of 118 allocation
groups done
    - 03:51:26: Phase 4: elapsed time 1 hour, 48 minutes, 47 seconds -
processed 1 allocation groups per minute
    - 03:51:26: Phase 4: 100% done - estimated remaining time
        - check for inodes claiming duplicate blocks...
        - agno = 0
        - agno = 1
        - agno = 2
        - 04:06:26: check for inodes claiming duplicate blocks - 12480 of
55360 inodes done
    - 04:06:26: Phase 4: elapsed time 2 hours, 3 minutes, 47 seconds -
processed 100 inodes per minute
    - 04:06:26: Phase 4: 22% done - estimated remaining time 7 hours, 5
minutes, 18 seconds
        - agno = 3
entry ":2eDS_Store" at block 0 offset 72 in directory inode 1073752707
references free inode 1073752709
    would clear inode number in entry at offset 72...
entry ":2eDS_Store" at block 0 offset 72 in directory inode 1073752708
references free inode 1073753109
    would clear inode number in entry at offset 72...
entry ":2epar_done" at block 0 offset 72 in directory inode 1073753094
references free inode 1073753095
    would clear inode number in entry at offset 72...
        - agno = 4
bad nblocks 1863041 for inode 2147484215, would reset to 1898317
        - 04:21:26: check for inodes claiming duplicate blocks - 23744 of
55360 inodes done
    - 04:21:26: Phase 4: elapsed time 2 hours, 18 minutes, 47 seconds -
processed 171 inodes per minute
    - 04:21:26: Phase 4: 42% done - estimated remaining time 3 hours, 4
minutes, 47 seconds
        - agno = 5
        - agno = 6
entry ":2eDS_Store" at block 0 offset 72 in directory inode 3221234251
references free inode 3221234252
    would clear inode number in entry at offset 72...
entry ":2epar_done" at block 0 offset 96 in directory inode 3221234251
references free inode 3221234253
    would clear inode number in entry at offset 96...
        - 04:36:26: check for inodes claiming duplicate blocks - 39360 of
55360 inodes done
    - 04:36:26: Phase 4: elapsed time 2 hours, 33 minutes, 47 seconds -
processed 255 inodes per minute
    - 04:36:26: Phase 4: 71% done - estimated remaining time 1 hour, 2
minutes, 30 seconds
        - agno = 7
        - agno = 117
        - 04:51:26: check for inodes claiming duplicate blocks - 49664 of
55360 inodes done
    - 04:51:26: Phase 4: elapsed time 2 hours, 48 minutes, 47 seconds -
processed 294 inodes per minute
    - 04:51:26: Phase 4: 89% done - estimated remaining time 19 minutes, 21
seconds
        - 04:59:54: check for inodes claiming duplicate blocks - 55360 of
55360 inodes done
No modify flag set, skipping phase 5
Phase 6 - check inode connectivity...
        - traversing filesystem starting at / ...
entry ":2eDS_Store" in directory inode 1073752707 points to free inode
1073752709, would junk entry
entry ":2eDS_Store" in directory inode 1073752708 points to free inode
1073753109, would junk entry
entry ":2epar_done" in directory inode 1073753094 points to free inode
1073753095, would junk entry
entry ":2eDS_Store" in directory inode 3221234251 points to free inode
3221234252, would junk entry
entry ":2epar_done" in directory inode 3221234251 points to free inode
3221234253, would junk entry
        - 05:02:01: traversing filesystem - 118 of 118 allocation groups
done
        - traversal finished ...
        - traversing all unattached subtrees ...
        - traversals finished ...
        - moving disconnected inodes to lost+found ...
disconnected inode 3758128252, would move to lost+found
Phase 7 - verify link counts...
        - 05:02:05: verify link counts - 55360 of 55360 inodes done
No modify flag set, skipping filesystem flush and exiting.
root@skinner:/# 

Syslog has got lots of these:

Jan 15 00:01:45 skinner kernel: [203766.124587] SCSI device sdd: 976773168
512-byte hdwr sectors (500108 MB)
Jan 15 00:01:45 skinner kernel: [203766.132971] sdd: Write Protect is off
Jan 15 00:01:45 skinner kernel: [203766.132976] sdd: Mode Sense: 00 3a 00 00
Jan 15 00:01:45 skinner kernel: [203766.134301] SCSI device sdd: write
cache: enabled, read cache: enabled, doesn't support DPO or FUA
Jan 15 00:01:58 skinner kernel: [203780.291734] ata17.00: exception Emask
0x10 SAct 0x0 SErr 0x90000 action 0x2 frozen
Jan 15 00:01:58 skinner kernel: [203780.291763] ata17.00: cmd
c8/00:20:5f:24:7b/00:00:00:00:00/e2 tag 0 cdb 0x0 data 16384 in
Jan 15 00:01:58 skinner kernel: [203780.291765]          res
ff/ff:ff:ff:ff:ff/d0:d0:d0:d0:d0/ff Emask 0x12 (ATA bus error)
Jan 15 00:01:58 skinner kernel: [203780.292896] ata17: hard resetting port
Jan 15 00:01:59 skinner kernel: [203781.551363] ata17: COMRESET failed
(device not ready)
Jan 15 00:01:59 skinner kernel: [203781.551404] ata17: hardreset failed,
retrying in 5 secs
Jan 15 00:02:04 skinner kernel: [203786.548252] ata17: hard resetting port
Jan 15 00:02:05 skinner kernel: [203787.427001] ata17: SATA link up 1.5 Gbps
(SStatus 113 SControl 310)
Jan 15 00:02:06 skinner kernel: [203787.443227] ata17.00: configured for
UDMA/33
Jan 15 00:02:06 skinner kernel: [203787.443238] ata17: EH complete

Syslog around when it happened:

Jan 15 00:11:15 skinner kernel: [204335.803986] SCSI device sde: write
cache: enabled, read cache: enabled, doesn't support DPO or FUA
Jan 15 00:11:15 skinner kernel: [204336.338284] raid5:md0: read error
corrected (8 sectors at 58272000 on sde1)
Jan 15 00:11:15 skinner kernel: [204336.338290] raid5:md0: read error
corrected (8 sectors at 58272008 on sde1)
Jan 15 00:11:15 skinner afpd[9470]: 3214789.36KB read, 6901.94KB written
Jan 15 00:11:15 skinner afpd[9470]: dsi_stream_write: Broken pipe
Jan 15 00:11:15 skinner afpd[9470]: Connection terminated
Jan 15 00:11:15 skinner afpd[9651]: Warning: No CNID scheme for volume
/mnt/md0raid/. Using default.
Jan 15 00:11:15 skinner afpd[9651]: Setting uid/gid to 0/0
Jan 15 00:11:15 skinner afpd[9651]: CNID DB initialized using Sleepycat
Software: Berkeley DB 4.2.52: (December  3, 2003)
Jan 15 00:11:17 skinner afpd[5506]: server_child[1] 9470 exited 1
Jan 15 00:11:17 skinner afpd[9651]: ipc_write: command: 2, pid: 9651,
msglen: 24
Jan 15 00:11:17 skinner afpd[5506]: ipc_read: command: 2, pid: 9651, len: 24
Jan 15 00:11:17 skinner afpd[5506]: Setting clientid (len 16) for 9651,
boottime 496E72CC
Jan 15 00:11:17 skinner afpd[5506]: ipc_get_session: len: 24, idlen 16, time
496e72cc
Jan 15 00:11:17 skinner afpd[9651]: ipc_write: command: 2, pid: 9651,
msglen: 24
Jan 15 00:11:17 skinner afpd[5506]: ipc_read: command: 2, pid: 9651, len: 24
Jan 15 00:11:17 skinner afpd[5506]: Setting clientid (len 16) for 9651,
boottime 496E72CC
Jan 15 00:11:17 skinner afpd[5506]: ipc_get_session: len: 24, idlen 16, time
496e72cc
Jan 15 00:11:17 skinner afpd[9653]: ASIP session:548(5) from
192.168.0.2:49345(8)
Jan 15 00:11:17 skinner afpd[5506]: server_child[1] 9653 done
Jan 15 00:11:26 skinner kernel: [204347.455334] XFS internal error
XFS_WANT_CORRUPTED_GOTO at line 1563 of file fs/xfs/xfs_alloc.c.  Caller
0xf8c21e90
Jan 15 00:11:26 skinner kernel: [204347.455374]  [pg0+947631595/1069122560]
xfs_free_ag_extent+0x53b/0x730 [xfs]
Jan 15 00:11:26 skinner kernel: [204347.455400]  [pg0+947633808/1069122560]
xfs_free_extent+0xe0/0x110 [xfs]
Jan 15 00:11:26 skinner kernel: [204347.455441]  [pg0+947633808/1069122560]
xfs_free_extent+0xe0/0x110 [xfs]
Jan 15 00:11:26 skinner kernel: [204347.455503]  [pg0+947680096/1069122560]
xfs_bmap_finish+0x140/0x190 [xfs]
Jan 15 00:11:26 skinner kernel: [204347.455535]  [pg0+947722496/1069122560]
xfs_bunmapi+0x0/0xfb0 [xfs]
Jan 15 00:11:26 skinner kernel: [204347.455555]  [pg0+947847119/1069122560]
xfs_itruncate_finish+0x24f/0x3b0 [xfs]
Jan 15 00:11:26 skinner kernel: [204347.455618]  [pg0+947982985/1069122560]
xfs_inactive+0x469/0x500 [xfs]
Jan 15 00:11:26 skinner kernel: [204347.455645]  [mutex_lock+8/32]
mutex_lock+0x8/0x20
Jan 15 00:11:26 skinner kernel: [204347.455660]  [pg0+948028898/1069122560]
xfs_fs_clear_inode+0x32/0x70 [xfs]
Jan 15 00:11:26 skinner kernel: [204347.455679]  [dentry_iput+132/144]
dentry_iput+0x84/0x90
Jan 15 00:11:26 skinner kernel: [204347.455688]  [clear_inode+159/336]
clear_inode+0x9f/0x150
Jan 15 00:11:26 skinner kernel: [204347.455691]
[truncate_inode_pages+23/32] truncate_inode_pages+0x17/0x20
Jan 15 00:11:26 skinner kernel: [204347.455698]
[generic_delete_inode+234/256] generic_delete_inode+0xea/0x100
Jan 15 00:11:26 skinner kernel: [204347.455704]  [iput+86/112]
iput+0x56/0x70
Jan 15 00:11:26 skinner kernel: [204347.455709]  [do_unlinkat+238/336]
do_unlinkat+0xee/0x150
Jan 15 00:11:26 skinner kernel: [204347.455747]  [syscall_call+7/11]
syscall_call+0x7/0xb
Jan 15 00:11:26 skinner kernel: [204347.455775]  =======================
Jan 15 00:11:26 skinner kernel: [204347.455779] xfs_force_shutdown(md0,0x8)
called from line 4261 of file fs/xfs/xfs_bmap.c.  Return address =
0xf8c82fec
Jan 15 00:11:26 skinner kernel: [204347.520962] Filesystem "md0": Corruption
of in-memory data detected.  Shutting down filesystem: md0
Jan 15 00:11:26 skinner kernel: [204347.520989] Please umount the
filesystem, and rectify the problem(s)

sdc seems to have had a few erros hourly or so before this happened
  (ATA Error Count: 104)
sdd doesn't have any SMART errors.
sde shows Spin_Up_Time: 8320, last error at disk power-on lifetime: 56 hours
In fact quite a few disks seem to have had non fatal errors recently.


> On Fri, Dec 19, 2008 at 12:29:30PM +0000, Iain Rauch wrote:
>>> I'm still tired (now even more ;-) ). Just check again if /dev/sdu really
>>> was the latest to fail and if so, clone this one.
>>> I also suggest to reassemble it without an immediate raid-rebuild.
>>> First check your data and only then add a new drives to the raid.
>>> Once you start a raid-rebuild.
>>> there is no way to go back. We recently also had the problem of three
>>> failed disks  but we only could get back the data by not assembling the
>>> array with the latest failed disk, but with the 2nd latest (don't ask why).
>>> 
>>> So in short
>>> 
>>> 1) clone disk
>>> 
>>> 2) mdadm --assemble --force /dev/mdX /dev/sda1 /dev/sdb1 ... /dev/sdx1
>>> 
>>> ===> Use only **22** devices here.
>>> 
>>> 3) Mount and check data, maybe even a read-only fsck
>>> 
>>> 4) Add two new disks.
>>> 
>>> 
>>> Hope it helps,
>>> Bernd
>> 
>> Well I cloned the disk and force started the array with 22 drives. I mounted
>> the file system read-only and it did appear to be intact :)
> 
> I'm glad to hear that.
> 
>> 
>> The problem is I cloned the failed drive to a 1.5TB Seagate, and it has the
>> freezing issue. After 12h of rebuilding (out of 50) that drive got kicked.
>> I'm gonna see if updating the FW on the drive helps, but otherwise I'll just
>> have to get another decent drive.
>> 
>> Is there any way to have mdadm be more patient and not kick the drive, or
>> let me put it back in and continue the rebuild of another drive? I don't
>> believe the drive will operate for 50h straight.
> 
> I think this would continue the rebuild if you would use bitmaps. You may
> add bitmaps by using "mdadm --grow --bitmap=internal /dev/mdX", but I'm not
> sure if it will work on a degrade md device. At least it won't work during
> rebuild phase.
> 
> Cheers,
> Bernd
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: RAID 6 recovery (it's not looking good) *Even More problems*
  2009-01-15 11:41           ` RAID 6 recovery (it's not looking good) *More problems* Iain Rauch
@ 2009-01-17 13:13             ` Iain Rauch
  0 siblings, 0 replies; 8+ messages in thread
From: Iain Rauch @ 2009-01-17 13:13 UTC (permalink / raw)
  To: linux-raid; +Cc: Bernd Schubert

OK, Now I think I've screwed things up royally.

After cloning the partition of the 1.5TB disk to a replacement 500, it
wouldn't assemble as it had a different UUID.

So I tried assembling with assume clean and the first time I got
mount: /dev/md0: can't read superblock

So I tried a different order and got
mount: Structure needs cleaning

And then trying the first order again I got
mdadm: failed to open /dev/sdp1 after earlier success - aborting

One drive started rebuilding so that must be of no use, one drive is very
dodgy and I wouldn't trust that it has correct data. One drive is the
replacement that got a different UUID somehow. So the problem is there are
three drives that have issues - plus sdp as the above error.

I still haven't formatted the 1.5TB drive if you think that could be of any
use but I'd just like to know if it's time to give up.


Iain


> I finally got the array to a state where it has 24/24 drives up.
> Unfortunately after a copying some data onto it, it now comes up with IO
> errors.
> 
> 
> Please help,
> 
> Iain.
> 
> 
> Here's what I've done so far:
> 
> root@skinner:/# umount /mnt/md0raid
> umount: /mnt/md0raid: device is busy
> umount: /mnt/md0raid: device is busy
> root@skinner:/# fuser -m /mnt/md0raid
> Cannot stat /mnt/md0raid: Input/output error
> Cannot stat /mnt/md0raid: Input/output error
> Cannot stat /mnt/md0raid: Input/output error
> Cannot stat file /proc/9651/fd/4: Input/output error
> root@skinner:/# fuser -m /dev/md0
> Cannot stat file /proc/9651/fd/4: Input/output error
> root@skinner:/# umount -l /mnt/md0raid
> root@skinner:/# xfs_check /dev/md0
> xfs_check: /dev/md0 contains a mounted and writable filesystem
> 
> fatal error -- couldn't initialize XFS library
> root@skinner:/# dmesg | grep -i xfs
> [196225.294919] XFS mounting filesystem md0
> [196226.008338] Ending clean XFS mount for filesystem: md0
> [204347.455334] XFS internal error XFS_WANT_CORRUPTED_GOTO at line 1563 of
> file fs/xfs/xfs_alloc.c.  Caller 0xf8c21e90
> [204347.455374]  [<f8c215eb>] xfs_free_ag_extent+0x53b/0x730 [xfs]
> [204347.455400]  [<f8c21e90>] xfs_free_extent+0xe0/0x110 [xfs]
> [204347.455441]  [<f8c21e90>] xfs_free_extent+0xe0/0x110 [xfs]
> [204347.455503]  [<f8c2d360>] xfs_bmap_finish+0x140/0x190 [xfs]
> [204347.455535]  [<f8c37900>] xfs_bunmapi+0x0/0xfb0 [xfs]
> [204347.455555]  [<f8c55fcf>] xfs_itruncate_finish+0x24f/0x3b0 [xfs]
> [204347.455618]  [<f8c77289>] xfs_inactive+0x469/0x500 [xfs]
> [204347.455660]  [<f8c825e2>] xfs_fs_clear_inode+0x32/0x70 [xfs]
> [204347.455779] xfs_force_shutdown(md0,0x8) called from line 4261 of file
> fs/xfs/xfs_bmap.c.  Return address = 0xf8c82fec
> 
> root@skinner:/# xfs_repair -n /dev/md0
> xfs_repair: /dev/md0 contains a mounted and writable filesystem
> 
> fatal error -- couldn't initialize XFS library
> root@skinner:/# xfs_repair -fn /dev/md0
>         - creating 2 worker thread(s)
> Phase 1 - find and verify superblock...
>         - reporting progress in intervals of 15 minutes
> Phase 2 - using internal log
>         - scan filesystem freespace and inode maps...
>         - 01:52:55: scanning filesystem freespace - 118 of 118 allocation
> groups done
>         - found root inode chunk
> Phase 3 - for each AG...
>         - scan (but don't clear) agi unlinked lists...
>         - 01:52:55: scanning agi unlinked lists - 118 of 118 allocation
> groups done
>         - process known inodes and perform inode discovery...
>         - agno = 0
>         - agno = 1
>         - agno = 2
>         - agno = 3
> data fork in ino 1075468390 claims free block 838867328
> <snip>
> data fork in ino 1075468390 claims free block 838867811
>         - agno = 4
> bad nblocks 1863041 for inode 2147484215, would reset to 1898317
>         - agno = 5
>         - agno = 6
>         - agno = 7
> data fork in ino 3221610517 claims free block 3623910585
> imap claims in-use inode 3221610517 is free, would correct imap
>         - agno = 8
> <snip>
>         - agno = 117
> data fork in ino 3758128252 claims free block 2952790138
> data fork in ino 3758128252 claims free block 2952790139
> imap claims in-use inode 3758128252 is free, would correct imap
>         - 02:02:39: process known inodes and inode discovery - 55360 of
> 55360 inodes done
>         - process newly discovered inodes...
>         - 02:02:39: process newly discovered inodes - 118 of 118 allocation
> groups done
> Phase 4 - check for duplicate blocks...
>         - setting up duplicate extent list...
>         - 02:06:26: setting up duplicate extent list - 24 of 118 allocation
> groups done
>     - 02:06:26: Phase 4: elapsed time 3 minutes, 47 seconds - processed 6
> allocation groups per minute
>     - 02:06:26: Phase 4: 20% done - estimated remaining time 14 minutes, 49
> seconds
>         - 02:21:26: setting up duplicate extent list - 75 of 118 allocation
> groups done
>     - 02:21:26: Phase 4: elapsed time 18 minutes, 47 seconds - processed 3
> allocation groups per minute
>     - 02:21:26: Phase 4: 63% done - estimated remaining time 10 minutes, 46
> seconds
>         - 02:36:26: setting up duplicate extent list - 104 of 118 allocation
> groups done
>     - 02:36:26: Phase 4: elapsed time 33 minutes, 47 seconds - processed 3
> allocation groups per minute
>     - 02:36:26: Phase 4: 88% done - estimated remaining time 4 minutes, 32
> seconds
>         - 02:42:59: setting up duplicate extent list - 118 of 118 allocation
> groups done
>         - 02:51:26: setting up duplicate extent list - 118 of 118 allocation
> groups done
>     - 02:51:26: Phase 4: elapsed time 48 minutes, 47 seconds - processed 2
> allocation groups per minute
>     - 02:51:26: Phase 4: 100% done - estimated remaining time
>         - 03:06:26: setting up duplicate extent list - 118 of 118 allocation
> groups done
>     - 03:06:26: Phase 4: elapsed time 1 hour, 3 minutes, 47 seconds -
> processed 1 allocation groups per minute
>     - 03:06:26: Phase 4: 100% done - estimated remaining time
>         - 03:21:27: setting up duplicate extent list - 118 of 118 allocation
> groups done
>     - 03:21:27: Phase 4: elapsed time 1 hour, 18 minutes, 48 seconds -
> processed 1 allocation groups per minute
>     - 03:21:27: Phase 4: 100% done - estimated remaining time
>         - 03:36:26: setting up duplicate extent list - 118 of 118 allocation
> groups done
>     - 03:36:26: Phase 4: elapsed time 1 hour, 33 minutes, 47 seconds -
> processed 1 allocation groups per minute
>     - 03:36:26: Phase 4: 100% done - estimated remaining time
>         - 03:51:26: setting up duplicate extent list - 118 of 118 allocation
> groups done
>     - 03:51:26: Phase 4: elapsed time 1 hour, 48 minutes, 47 seconds -
> processed 1 allocation groups per minute
>     - 03:51:26: Phase 4: 100% done - estimated remaining time
>         - check for inodes claiming duplicate blocks...
>         - agno = 0
>         - agno = 1
>         - agno = 2
>         - 04:06:26: check for inodes claiming duplicate blocks - 12480 of
> 55360 inodes done
>     - 04:06:26: Phase 4: elapsed time 2 hours, 3 minutes, 47 seconds -
> processed 100 inodes per minute
>     - 04:06:26: Phase 4: 22% done - estimated remaining time 7 hours, 5
> minutes, 18 seconds
>         - agno = 3
> entry ":2eDS_Store" at block 0 offset 72 in directory inode 1073752707
> references free inode 1073752709
>     would clear inode number in entry at offset 72...
> entry ":2eDS_Store" at block 0 offset 72 in directory inode 1073752708
> references free inode 1073753109
>     would clear inode number in entry at offset 72...
> entry ":2epar_done" at block 0 offset 72 in directory inode 1073753094
> references free inode 1073753095
>     would clear inode number in entry at offset 72...
>         - agno = 4
> bad nblocks 1863041 for inode 2147484215, would reset to 1898317
>         - 04:21:26: check for inodes claiming duplicate blocks - 23744 of
> 55360 inodes done
>     - 04:21:26: Phase 4: elapsed time 2 hours, 18 minutes, 47 seconds -
> processed 171 inodes per minute
>     - 04:21:26: Phase 4: 42% done - estimated remaining time 3 hours, 4
> minutes, 47 seconds
>         - agno = 5
>         - agno = 6
> entry ":2eDS_Store" at block 0 offset 72 in directory inode 3221234251
> references free inode 3221234252
>     would clear inode number in entry at offset 72...
> entry ":2epar_done" at block 0 offset 96 in directory inode 3221234251
> references free inode 3221234253
>     would clear inode number in entry at offset 96...
>         - 04:36:26: check for inodes claiming duplicate blocks - 39360 of
> 55360 inodes done
>     - 04:36:26: Phase 4: elapsed time 2 hours, 33 minutes, 47 seconds -
> processed 255 inodes per minute
>     - 04:36:26: Phase 4: 71% done - estimated remaining time 1 hour, 2
> minutes, 30 seconds
>         - agno = 7
>         - agno = 117
>         - 04:51:26: check for inodes claiming duplicate blocks - 49664 of
> 55360 inodes done
>     - 04:51:26: Phase 4: elapsed time 2 hours, 48 minutes, 47 seconds -
> processed 294 inodes per minute
>     - 04:51:26: Phase 4: 89% done - estimated remaining time 19 minutes, 21
> seconds
>         - 04:59:54: check for inodes claiming duplicate blocks - 55360 of
> 55360 inodes done
> No modify flag set, skipping phase 5
> Phase 6 - check inode connectivity...
>         - traversing filesystem starting at / ...
> entry ":2eDS_Store" in directory inode 1073752707 points to free inode
> 1073752709, would junk entry
> entry ":2eDS_Store" in directory inode 1073752708 points to free inode
> 1073753109, would junk entry
> entry ":2epar_done" in directory inode 1073753094 points to free inode
> 1073753095, would junk entry
> entry ":2eDS_Store" in directory inode 3221234251 points to free inode
> 3221234252, would junk entry
> entry ":2epar_done" in directory inode 3221234251 points to free inode
> 3221234253, would junk entry
>         - 05:02:01: traversing filesystem - 118 of 118 allocation groups
> done
>         - traversal finished ...
>         - traversing all unattached subtrees ...
>         - traversals finished ...
>         - moving disconnected inodes to lost+found ...
> disconnected inode 3758128252, would move to lost+found
> Phase 7 - verify link counts...
>         - 05:02:05: verify link counts - 55360 of 55360 inodes done
> No modify flag set, skipping filesystem flush and exiting.
> root@skinner:/# 
> 
> Syslog has got lots of these:
> 
> Jan 15 00:01:45 skinner kernel: [203766.124587] SCSI device sdd: 976773168
> 512-byte hdwr sectors (500108 MB)
> Jan 15 00:01:45 skinner kernel: [203766.132971] sdd: Write Protect is off
> Jan 15 00:01:45 skinner kernel: [203766.132976] sdd: Mode Sense: 00 3a 00 00
> Jan 15 00:01:45 skinner kernel: [203766.134301] SCSI device sdd: write
> cache: enabled, read cache: enabled, doesn't support DPO or FUA
> Jan 15 00:01:58 skinner kernel: [203780.291734] ata17.00: exception Emask
> 0x10 SAct 0x0 SErr 0x90000 action 0x2 frozen
> Jan 15 00:01:58 skinner kernel: [203780.291763] ata17.00: cmd
> c8/00:20:5f:24:7b/00:00:00:00:00/e2 tag 0 cdb 0x0 data 16384 in
> Jan 15 00:01:58 skinner kernel: [203780.291765]          res
> ff/ff:ff:ff:ff:ff/d0:d0:d0:d0:d0/ff Emask 0x12 (ATA bus error)
> Jan 15 00:01:58 skinner kernel: [203780.292896] ata17: hard resetting port
> Jan 15 00:01:59 skinner kernel: [203781.551363] ata17: COMRESET failed
> (device not ready)
> Jan 15 00:01:59 skinner kernel: [203781.551404] ata17: hardreset failed,
> retrying in 5 secs
> Jan 15 00:02:04 skinner kernel: [203786.548252] ata17: hard resetting port
> Jan 15 00:02:05 skinner kernel: [203787.427001] ata17: SATA link up 1.5 Gbps
> (SStatus 113 SControl 310)
> Jan 15 00:02:06 skinner kernel: [203787.443227] ata17.00: configured for
> UDMA/33
> Jan 15 00:02:06 skinner kernel: [203787.443238] ata17: EH complete
> 
> Syslog around when it happened:
> 
> Jan 15 00:11:15 skinner kernel: [204335.803986] SCSI device sde: write
> cache: enabled, read cache: enabled, doesn't support DPO or FUA
> Jan 15 00:11:15 skinner kernel: [204336.338284] raid5:md0: read error
> corrected (8 sectors at 58272000 on sde1)
> Jan 15 00:11:15 skinner kernel: [204336.338290] raid5:md0: read error
> corrected (8 sectors at 58272008 on sde1)
> Jan 15 00:11:15 skinner afpd[9470]: 3214789.36KB read, 6901.94KB written
> Jan 15 00:11:15 skinner afpd[9470]: dsi_stream_write: Broken pipe
> Jan 15 00:11:15 skinner afpd[9470]: Connection terminated
> Jan 15 00:11:15 skinner afpd[9651]: Warning: No CNID scheme for volume
> /mnt/md0raid/. Using default.
> Jan 15 00:11:15 skinner afpd[9651]: Setting uid/gid to 0/0
> Jan 15 00:11:15 skinner afpd[9651]: CNID DB initialized using Sleepycat
> Software: Berkeley DB 4.2.52: (December  3, 2003)
> Jan 15 00:11:17 skinner afpd[5506]: server_child[1] 9470 exited 1
> Jan 15 00:11:17 skinner afpd[9651]: ipc_write: command: 2, pid: 9651,
> msglen: 24
> Jan 15 00:11:17 skinner afpd[5506]: ipc_read: command: 2, pid: 9651, len: 24
> Jan 15 00:11:17 skinner afpd[5506]: Setting clientid (len 16) for 9651,
> boottime 496E72CC
> Jan 15 00:11:17 skinner afpd[5506]: ipc_get_session: len: 24, idlen 16, time
> 496e72cc
> Jan 15 00:11:17 skinner afpd[9651]: ipc_write: command: 2, pid: 9651,
> msglen: 24
> Jan 15 00:11:17 skinner afpd[5506]: ipc_read: command: 2, pid: 9651, len: 24
> Jan 15 00:11:17 skinner afpd[5506]: Setting clientid (len 16) for 9651,
> boottime 496E72CC
> Jan 15 00:11:17 skinner afpd[5506]: ipc_get_session: len: 24, idlen 16, time
> 496e72cc
> Jan 15 00:11:17 skinner afpd[9653]: ASIP session:548(5) from
> 192.168.0.2:49345(8)
> Jan 15 00:11:17 skinner afpd[5506]: server_child[1] 9653 done
> Jan 15 00:11:26 skinner kernel: [204347.455334] XFS internal error
> XFS_WANT_CORRUPTED_GOTO at line 1563 of file fs/xfs/xfs_alloc.c.  Caller
> 0xf8c21e90
> Jan 15 00:11:26 skinner kernel: [204347.455374]  [pg0+947631595/1069122560]
> xfs_free_ag_extent+0x53b/0x730 [xfs]
> Jan 15 00:11:26 skinner kernel: [204347.455400]  [pg0+947633808/1069122560]
> xfs_free_extent+0xe0/0x110 [xfs]
> Jan 15 00:11:26 skinner kernel: [204347.455441]  [pg0+947633808/1069122560]
> xfs_free_extent+0xe0/0x110 [xfs]
> Jan 15 00:11:26 skinner kernel: [204347.455503]  [pg0+947680096/1069122560]
> xfs_bmap_finish+0x140/0x190 [xfs]
> Jan 15 00:11:26 skinner kernel: [204347.455535]  [pg0+947722496/1069122560]
> xfs_bunmapi+0x0/0xfb0 [xfs]
> Jan 15 00:11:26 skinner kernel: [204347.455555]  [pg0+947847119/1069122560]
> xfs_itruncate_finish+0x24f/0x3b0 [xfs]
> Jan 15 00:11:26 skinner kernel: [204347.455618]  [pg0+947982985/1069122560]
> xfs_inactive+0x469/0x500 [xfs]
> Jan 15 00:11:26 skinner kernel: [204347.455645]  [mutex_lock+8/32]
> mutex_lock+0x8/0x20
> Jan 15 00:11:26 skinner kernel: [204347.455660]  [pg0+948028898/1069122560]
> xfs_fs_clear_inode+0x32/0x70 [xfs]
> Jan 15 00:11:26 skinner kernel: [204347.455679]  [dentry_iput+132/144]
> dentry_iput+0x84/0x90
> Jan 15 00:11:26 skinner kernel: [204347.455688]  [clear_inode+159/336]
> clear_inode+0x9f/0x150
> Jan 15 00:11:26 skinner kernel: [204347.455691]
> [truncate_inode_pages+23/32] truncate_inode_pages+0x17/0x20
> Jan 15 00:11:26 skinner kernel: [204347.455698]
> [generic_delete_inode+234/256] generic_delete_inode+0xea/0x100
> Jan 15 00:11:26 skinner kernel: [204347.455704]  [iput+86/112]
> iput+0x56/0x70
> Jan 15 00:11:26 skinner kernel: [204347.455709]  [do_unlinkat+238/336]
> do_unlinkat+0xee/0x150
> Jan 15 00:11:26 skinner kernel: [204347.455747]  [syscall_call+7/11]
> syscall_call+0x7/0xb
> Jan 15 00:11:26 skinner kernel: [204347.455775]  =======================
> Jan 15 00:11:26 skinner kernel: [204347.455779] xfs_force_shutdown(md0,0x8)
> called from line 4261 of file fs/xfs/xfs_bmap.c.  Return address =
> 0xf8c82fec
> Jan 15 00:11:26 skinner kernel: [204347.520962] Filesystem "md0": Corruption
> of in-memory data detected.  Shutting down filesystem: md0
> Jan 15 00:11:26 skinner kernel: [204347.520989] Please umount the
> filesystem, and rectify the problem(s)
> 
> sdc seems to have had a few erros hourly or so before this happened
>   (ATA Error Count: 104)
> sdd doesn't have any SMART errors.
> sde shows Spin_Up_Time: 8320, last error at disk power-on lifetime: 56 hours
> In fact quite a few disks seem to have had non fatal errors recently.
> 
> 
>> On Fri, Dec 19, 2008 at 12:29:30PM +0000, Iain Rauch wrote:
>>>> I'm still tired (now even more ;-) ). Just check again if /dev/sdu really
>>>> was the latest to fail and if so, clone this one.
>>>> I also suggest to reassemble it without an immediate raid-rebuild.
>>>> First check your data and only then add a new drives to the raid.
>>>> Once you start a raid-rebuild.
>>>> there is no way to go back. We recently also had the problem of three
>>>> failed disks  but we only could get back the data by not assembling the
>>>> array with the latest failed disk, but with the 2nd latest (don't ask why).
>>>> 
>>>> So in short
>>>> 
>>>> 1) clone disk
>>>> 
>>>> 2) mdadm --assemble --force /dev/mdX /dev/sda1 /dev/sdb1 ... /dev/sdx1
>>>> 
>>>> ===> Use only **22** devices here.
>>>> 
>>>> 3) Mount and check data, maybe even a read-only fsck
>>>> 
>>>> 4) Add two new disks.
>>>> 
>>>> 
>>>> Hope it helps,
>>>> Bernd
>>> 
>>> Well I cloned the disk and force started the array with 22 drives. I mounted
>>> the file system read-only and it did appear to be intact :)
>> 
>> I'm glad to hear that.
>> 
>>> 
>>> The problem is I cloned the failed drive to a 1.5TB Seagate, and it has the
>>> freezing issue. After 12h of rebuilding (out of 50) that drive got kicked.
>>> I'm gonna see if updating the FW on the drive helps, but otherwise I'll just
>>> have to get another decent drive.
>>> 
>>> Is there any way to have mdadm be more patient and not kick the drive, or
>>> let me put it back in and continue the rebuild of another drive? I don't
>>> believe the drive will operate for 50h straight.
>> 
>> I think this would continue the rebuild if you would use bitmaps. You may
>> add bitmaps by using "mdadm --grow --bitmap=internal /dev/mdX", but I'm not
>> sure if it will work on a degrade md device. At least it won't work during
>> rebuild phase.
>> 
>> Cheers,
>> Bernd
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2009-01-17 13:13 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-12-16 15:17 RAID 6 recovery (it's not looking good) Iain Rauch
2008-12-16 19:18 ` Bernd Schubert
2008-12-16 20:31   ` Iain Rauch
2008-12-16 23:59     ` Bernd Schubert
2008-12-19 12:29       ` Iain Rauch
2008-12-19 12:36         ` Bernd Schubert
2009-01-15 11:41           ` RAID 6 recovery (it's not looking good) *More problems* Iain Rauch
2009-01-17 13:13             ` RAID 6 recovery (it's not looking good) *Even More problems* Iain Rauch

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.