Diagnosis of assembly failure and attempted recovery

* Diagnosis of assembly failure and attempted recovery - help needed
@ 2010-05-30  9:20 Dave Fisher
  2010-05-31  3:55 ` Neil Brown
  0 siblings, 1 reply; 3+ messages in thread
From: Dave Fisher @ 2010-05-30  9:20 UTC (permalink / raw)
  To: linux-raid

[-- Attachment #1: Type: text/plain, Size: 4380 bytes --]

Hi,

My machine suffered a system crash, a couple of days ago. Although the
OS appeared to be still running, there was no means of input by any
external device (except the power switch), so I power cycled it. When
it came back up, it was obvious that there was a problem with the RAID
10 array containing my /home partition (c. 2TB). The crash was only
the latest of a recent series.

First, I ran some diagnostics, whose results are printed in the second
text attachment to this email (the first attachment tells you what I
know about the current state of the array, i.e. after my
intervention).

The results shown in the second attachment, together with the recent
crashes and some previous experience, led me to believe that the four
partitions in the array were not actually (or seriously) damaged, but
simply out of synch.

So I looked up the linux-raid mailing list thread in which I had
reported my previous problem:
http://www.spinics.net/lists/raid/msg22811.html

Unfortunately, in a moment of reckless hope and blind panic I then did
something very stupid ... I applied the 'solution' which Neil Brown
had recommended for my previous RAID failures, without thinking
through the differences in the new context.

... I realised this stupidity, at almost exactly at the moment when
the ENTER key sprang back up after sending the following command:

$ sudo mdadm --assemble --force --verbose /dev/md1 /dev/sdf4 /dev/sdg4
/dev/sdh4 /dev/sdi4

Producing these results some time later:

$ cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5]
[raid4] [raid10]
md_d0 : inactive sdi2[0](S)
      9767424 blocks

md1 : active raid10 sdf4[4] sdg4[1] sdh4[2]
      1931767808 blocks 64K chunks 2 near-copies [4/2] [_UU_]
      [=====>...............]  recovery = 29.4% (284005568/965883904)
finish=250.0min speed=45440K/sec

unused devices: <none>

$ sudo mdadm --detail /dev/md1
/dev/md1:
        Version : 00.90
  Creation Time : Tue May  6 02:06:45 2008
     Raid Level : raid10
     Array Size : 1931767808 (1842.28 GiB 1978.13 GB)
  Used Dev Size : 965883904 (921.14 GiB 989.07 GB)
   Raid Devices : 4
  Total Devices : 3
Preferred Minor : 1
    Persistence : Superblock is persistent

    Update Time : Sun May 30 00:25:19 2010
          State : clean, degraded, recovering
 Active Devices : 2
Working Devices : 3
 Failed Devices : 0
  Spare Devices : 1

         Layout : near=2, far=1
     Chunk Size : 64K

 Rebuild Status : 25% complete

           UUID : f4ddbd55:206c7f81:b855f41b:37d33d37
         Events : 0.8079536

    Number   Major   Minor   RaidDevice State
       4       8       84        0      spare rebuilding   /dev/sdf4
       1       8      100        1      active sync   /dev/sdg4
       2       8      116        2      active sync   /dev/sdh4
       3       0        0        3      removed

This result temporally raised my hopes because it indicated recovery
in a degraded state ... and I had read somewhere
(http://www.aput.net/~jheiss/raid10/) that 'degraded' meant "lost one
or more drives but has not lost the right combination of drives to
completely fail"

Unfortunately this result also raised my fears, because the
"RaidDevice State" indicated that it was treating /dev/sdf4 as the
spare and writing to it ... whereas I believed that /dev/sdf4 was
supposed to be a full member of the array ... and that /dev/sdj4 was
supposed to be the spare.

I think this belief is confirmed by these data on /dev/sdj4 (from the
second attachment):

    Update Time : Tue Oct  6 18:01:45 2009
    Events : 370

It may be too late, but at this point I came to my senses and resolved
to stop tinkering and to email the following questions instead.

QUESTION 1: Have I now wrecked any chance of recovering the data, or
have I been lucky enough to retain enough data to rebuild the entire
array by employing /dev/sdi4 and/or /dev/sdj4?

QUESTION 2: If I have had 'the luck of the stupid', how do I proceed
safely with the recovery?

QUESTION 3: If I have NOT been unfeasibly lucky, is there any way of
recovering some of the data files from the raw partitions?

N.B. I would be more than happy to recover data at the date shown by
/dev/sdi4's update time. The non-backed-up, business critical data,
has not been modified in several weeks.

I hope you can help and I'd be desperately grateful for it.

Best wishes,

Dave Fisher

[-- Attachment #2: post-recovery-raid-diagnostics.txt --]
[-- Type: text/plain, Size: 5412 bytes --]

$ cat /proc/mdstat 
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
md_d0 : inactive sdi2[0](S)
      9767424 blocks

md1 : active raid10 sdf4[4](F) sdg4[5](F) sdh4[2]
      1931767808 blocks 64K chunks 2 near-copies [4/1] [__U_]

unused devices: <none>

$ sudo mdadm -E /dev/sd{f,g,h,i,j}4 
/dev/sdf4:
          Magic : a92b4efc
        Version : 00.90.00
           UUID : f4ddbd55:206c7f81:b855f41b:37d33d37
  Creation Time : Tue May  6 02:06:45 2008
     Raid Level : raid10
  Used Dev Size : 965883904 (921.14 GiB 989.07 GB)
     Array Size : 1931767808 (1842.28 GiB 1978.13 GB)
   Raid Devices : 4
  Total Devices : 3
Preferred Minor : 1

    Update Time : Sun May 30 04:47:20 2010
          State : clean
 Active Devices : 1
Working Devices : 2
 Failed Devices : 2
  Spare Devices : 1
       Checksum : 7d4a18fc - correct
         Events : 8079558

         Layout : near=2, far=1
     Chunk Size : 64K

      Number   Major   Minor   RaidDevice State
this     4       8       84        4      spare   /dev/sdf4

   0     0       0        0        0      removed
   1     1       0        0        1      faulty removed
   2     2       8      116        2      active sync   /dev/sdh4
   3     3       0        0        3      faulty removed
   4     4       8       84        4      spare   /dev/sdf4
/dev/sdg4:
          Magic : a92b4efc
        Version : 00.90.00
           UUID : f4ddbd55:206c7f81:b855f41b:37d33d37
  Creation Time : Tue May  6 02:06:45 2008
     Raid Level : raid10
  Used Dev Size : 965883904 (921.14 GiB 989.07 GB)
     Array Size : 1931767808 (1842.28 GiB 1978.13 GB)
   Raid Devices : 4
  Total Devices : 3
Preferred Minor : 1

    Update Time : Sun May 30 04:25:29 2010
          State : clean
 Active Devices : 2
Working Devices : 3
 Failed Devices : 1
  Spare Devices : 1
       Checksum : 7d4a13de - correct
         Events : 8079557

         Layout : near=2, far=1
     Chunk Size : 64K

      Number   Major   Minor   RaidDevice State
this     1       8      100        1      active sync   /dev/sdg4

   0     0       0        0        0      removed
   1     1       8      100        1      active sync   /dev/sdg4
   2     2       8      116        2      active sync   /dev/sdh4
   3     3       0        0        3      faulty removed
   4     4       8       84        4      spare   /dev/sdf4
/dev/sdh4:
          Magic : a92b4efc
        Version : 00.90.00
           UUID : f4ddbd55:206c7f81:b855f41b:37d33d37
  Creation Time : Tue May  6 02:06:45 2008
     Raid Level : raid10
  Used Dev Size : 965883904 (921.14 GiB 989.07 GB)
     Array Size : 1931767808 (1842.28 GiB 1978.13 GB)
   Raid Devices : 4
  Total Devices : 3
Preferred Minor : 1

    Update Time : Sun May 30 08:50:37 2010
          State : clean
 Active Devices : 1
Working Devices : 1
 Failed Devices : 2
  Spare Devices : 0
       Checksum : 7d4a5230 - correct
         Events : 8079565

         Layout : near=2, far=1
     Chunk Size : 64K

      Number   Major   Minor   RaidDevice State
this     2       8      116        2      active sync   /dev/sdh4

   0     0       0        0        0      removed
   1     1       0        0        1      faulty removed
   2     2       8      116        2      active sync   /dev/sdh4
   3     3       0        0        3      faulty removed
/dev/sdi4:
          Magic : a92b4efc
        Version : 00.90.00
           UUID : f4ddbd55:206c7f81:b855f41b:37d33d37
  Creation Time : Tue May  6 02:06:45 2008
     Raid Level : raid10
  Used Dev Size : 965883904 (921.14 GiB 989.07 GB)
     Array Size : 1931767808 (1842.28 GiB 1978.13 GB)
   Raid Devices : 4
  Total Devices : 4
Preferred Minor : 1

    Update Time : Mon May 24 02:12:54 2010
          State : clean
 Active Devices : 4
Working Devices : 4
 Failed Devices : 0
  Spare Devices : 0
       Checksum : 7d3a6276 - correct
         Events : 7828427

         Layout : near=2, far=1
     Chunk Size : 64K

      Number   Major   Minor   RaidDevice State
this     0       8      132        0      active sync   /dev/sdi4

   0     0       8      132        0      active sync   /dev/sdi4
   1     1       8      100        1      active sync   /dev/sdg4
   2     2       8      116        2      active sync   /dev/sdh4
   3     3       8       84        3      active sync   /dev/sdf4
/dev/sdj4:
          Magic : a92b4efc
        Version : 00.90.00
           UUID : f4ddbd55:206c7f81:b855f41b:37d33d37
  Creation Time : Tue May  6 02:06:45 2008
     Raid Level : raid10
  Used Dev Size : 965883904 (921.14 GiB 989.07 GB)
     Array Size : 1931767808 (1842.28 GiB 1978.13 GB)
   Raid Devices : 4
  Total Devices : 5
Preferred Minor : 1

    Update Time : Tue Oct  6 18:01:45 2009
          State : clean
 Active Devices : 4
Working Devices : 5
 Failed Devices : 0
  Spare Devices : 1
       Checksum : 7b1d23e4 - correct
         Events : 370

         Layout : near=2, far=1
     Chunk Size : 64K

      Number   Major   Minor   RaidDevice State
this     3       8      148        3      active sync   /dev/sdj4

   0     0       8      132        0      active sync   /dev/sdi4
   1     1       8      100        1      active sync   /dev/sdg4
   2     2       8      116        2      active sync   /dev/sdh4
   3     3       8      148        3      active sync   /dev/sdj4
   4     4       8       84        4      spare   /dev/sdf4

[-- Attachment #3: pre-recovery-raid-diagnostics.txt --]
[-- Type: text/plain, Size: 5493 bytes --]

$ cat /proc/mdstat 
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
md1 : inactive sdh4[2](S) sdf4[3](S) sdg4[1](S) sdi4[0](S)
      3863535616 blocks
unused devices: <none>

sudo mdadm --examine /dev/md1
mdadm: No md superblock detected on /dev/md1.

$ sudo mdadm --examine /dev/sdf4
/dev/sdf4:
          Magic : a92b4efc
        Version : 00.90.00
           UUID : f4ddbd55:206c7f81:b855f41b:37d33d37
  Creation Time : Tue May  6 02:06:45 2008
     Raid Level : raid10
  Used Dev Size : 965883904 (921.14 GiB 989.07 GB)
     Array Size : 1931767808 (1842.28 GiB 1978.13 GB)
   Raid Devices : 4
  Total Devices : 4
Preferred Minor : 1

    Update Time : Mon May 24 02:12:54 2010
          State : clean
 Active Devices : 4
Working Devices : 4
 Failed Devices : 0
  Spare Devices : 0
       Checksum : 7d3a624c - correct
         Events : 7828427

         Layout : near=2, far=1
     Chunk Size : 64K

      Number   Major   Minor   RaidDevice State
this     3       8       84        3      active sync   /dev/sdf4

   0     0       8      132        0      active sync   /dev/sdi4
   1     1       8      100        1      active sync   /dev/sdg4
   2     2       8      116        2      active sync   /dev/sdh4
   3     3       8       84        3      active sync   /dev/sdf4
</pre>

$ sudo mdadm --examine /dev/sdg4
/dev/sdg4:
          Magic : a92b4efc
        Version : 00.90.00
           UUID : f4ddbd55:206c7f81:b855f41b:37d33d37
  Creation Time : Tue May  6 02:06:45 2008
     Raid Level : raid10
  Used Dev Size : 965883904 (921.14 GiB 989.07 GB)
     Array Size : 1931767808 (1842.28 GiB 1978.13 GB)
   Raid Devices : 4
  Total Devices : 4
Preferred Minor : 1

    Update Time : Sat May 29 01:12:30 2010
          State : active
 Active Devices : 2
Working Devices : 2
 Failed Devices : 1
  Spare Devices : 0
       Checksum : 7ccd4c92 - correct
         Events : 8079459

         Layout : near=2, far=1
     Chunk Size : 64K

      Number   Major   Minor   RaidDevice State
this     1       8      100        1      active sync   /dev/sdg4

   0     0       0        0        0      removed
   1     1       8      100        1      active sync   /dev/sdg4
   2     2       8      116        2      active sync   /dev/sdh4
   3     3       0        0        3      faulty removed

$ sudo mdadm --examine /dev/sdh4
/dev/sdh4:
          Magic : a92b4efc
        Version : 00.90.00
           UUID : f4ddbd55:206c7f81:b855f41b:37d33d37
  Creation Time : Tue May  6 02:06:45 2008
     Raid Level : raid10
  Used Dev Size : 965883904 (921.14 GiB 989.07 GB)
     Array Size : 1931767808 (1842.28 GiB 1978.13 GB)
   Raid Devices : 4
  Total Devices : 4
Preferred Minor : 1

    Update Time : Sat May 29 01:26:30 2010
          State : clean
 Active Devices : 1
Working Devices : 1
 Failed Devices : 2
  Spare Devices : 0
       Checksum : 7d4898bb - correct
         Events : 8079505

         Layout : near=2, far=1
     Chunk Size : 64K

      Number   Major   Minor   RaidDevice State
this     2       8      116        2      active sync   /dev/sdh4

   0     0       0        0        0      removed
   1     1       0        0        1      faulty removed
   2     2       8      116        2      active sync   /dev/sdh4
   3     3       0        0        3      faulty removed

$ sudo mdadm --examine /dev/sdi4
/dev/sdi4:
          Magic : a92b4efc
        Version : 00.90.00
           UUID : f4ddbd55:206c7f81:b855f41b:37d33d37
  Creation Time : Tue May  6 02:06:45 2008
     Raid Level : raid10
  Used Dev Size : 965883904 (921.14 GiB 989.07 GB)
     Array Size : 1931767808 (1842.28 GiB 1978.13 GB)
   Raid Devices : 4
  Total Devices : 4
Preferred Minor : 1

    Update Time : Mon May 24 02:12:54 2010
          State : clean
 Active Devices : 4
Working Devices : 4
 Failed Devices : 0
  Spare Devices : 0
       Checksum : 7d3a6276 - correct
         Events : 7828427

         Layout : near=2, far=1
     Chunk Size : 64K

      Number   Major   Minor   RaidDevice State
this     0       8      132        0      active sync   /dev/sdi4

   0     0       8      132        0      active sync   /dev/sdi4
   1     1       8      100        1      active sync   /dev/sdg4
   2     2       8      116        2      active sync   /dev/sdh4
   3     3       8       84        3      active sync   /dev/sdf4

$ sudo mdadm --examine /dev/sdj4
[sudo] password for davef: 
/dev/sdj4:
          Magic : a92b4efc
        Version : 00.90.00
           UUID : f4ddbd55:206c7f81:b855f41b:37d33d37
  Creation Time : Tue May  6 02:06:45 2008
     Raid Level : raid10
  Used Dev Size : 965883904 (921.14 GiB 989.07 GB)
     Array Size : 1931767808 (1842.28 GiB 1978.13 GB)
   Raid Devices : 4
  Total Devices : 5
Preferred Minor : 1

    Update Time : Tue Oct  6 18:01:45 2009
          State : clean
 Active Devices : 4
Working Devices : 5
 Failed Devices : 0
  Spare Devices : 1
       Checksum : 7b1d23e4 - correct
         Events : 370

         Layout : near=2, far=1
     Chunk Size : 64K

      Number   Major   Minor   RaidDevice State
this     3       8      148        3      active sync   /dev/sdj4

   0     0       8      132        0      active sync   /dev/sdi4
   1     1       8      100        1      active sync   /dev/sdg4
   2     2       8      116        2      active sync   /dev/sdh4
   3     3       8      148        3      active sync   /dev/sdj4
   4     4       8       84        4      spare   /dev/sdf4

^ permalink raw reply	[flat|nested] 3+ messages in thread