Re: [solved] 3 drive RAID5 with 1 bad drive, 1 drive active but not clean and a single clean drive

From: importantdata <importantdata@protonmail.com>
To: "linux-raid@vger.kernel.org" <linux-raid@vger.kernel.org>
Subject: Re: [solved] 3 drive RAID5 with 1 bad drive, 1 drive active but not clean and a single clean drive
Date: Wed, 03 Feb 2021 06:07:43 +0000	[thread overview]
Message-ID: <1xVN7rJL7OEF2ZKf4PrEYr8js-QWLWTpjtyYFFo9a67zcH7AhHdQbmfimfB0hSGWYa2YP9DgJYdfd-GzYE_p__TXZU9HZkvkF-YWRCkAExs=@protonmail.com> (raw)

Sent with ProtonMail Secure Email.

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On Tuesday, February 2, 2021 10:04 PM, importantdata <importantdata@protonmail.com> wrote:

> Hello,
>
> So, I have a bad situation. I run a raid5 array with 3 drives, I noticed one had fallen out of the array. I need to setup better monitoring, it turns out this was quite some time ago (back in Nov!). This leaves two drives and puts me into a bit of a scare. So I decide to move the important data off to a different array. I created a dir in /mnt made a new LVM partition formatted it as EXT4 and started syncing stuff over... Kinda forgot to actually mount it though so all that data was syncing right back into the bad array (woops!). Server died this morning, I assume the extra stress may have done something with the drives, or perhaps it filled up root and panicked. In either case I could not boot. Setup a fresh OS on my other raid array and got some tools installed and now I am working on trying to assemble the bad raid array enough to pull out my data. My data is contained within an LVM within the raid array.
>
> I have attached the examine and mdadm of the drives, as you can see ddd has a really old update time. This drive was having lots of i/o errors. So I want to use sde and sdf to assemble a read-only array, assemble the LVM, mount then copy my important data off. They are 185 events different so I assume there will be some slight data corruption. But I am hoping its mostly fine and likely part of my bad rsync.
>
> So unfortunately I don't know what mdadm version was used to make this array or the OS version as that's all stuck on the dead array. Here is what I am running on my new install I am using to try and recover the data:
> Linux 4.19.0-13-amd64 #1 SMP Debian 4.19.160-2 (2020-11-28) x86_64 GNU/Linux
> mdadm - v4.1 - 2018-10-01
>
> The initial -D looked like this:
>
>     root@kglhost-1:~# mdadm -D /dev/md1
>     /dev/md1:
>                Version : 1.2
>             Raid Level : raid0
>          Total Devices : 3
>            Persistence : Superblock is persistent
>
>                  State : inactive
>        Working Devices : 3
>
>                   Name : blah:1
>                   UUID : fba7c062:e352fa39:fdc09bf9:e21c4617
>                 Events : 18094545
>
>         Number   Major   Minor   RaidDevice
>
>            -       8       82        -        /dev/sdf2
>            -       8       66        -        /dev/sde2
>            -       8       50        -        /dev/sdd2
>
>
> I tried to run the array but that failed:
>
>     # mdadm -o -R  /dev/md1
>     mdadm: failed to start array /dev/md/1: Input/output error
>
>
> In dmesg it says
>
>     [Tue Feb  2 21:14:42 2021] md: kicking non-fresh sdd2 from array!
>     [Tue Feb  2 21:14:42 2021] md: kicking non-fresh sdf2 from array!
>     [Tue Feb  2 21:14:42 2021] md/raid:md1: device sde2 operational as raid disk 1
>     [Tue Feb  2 21:14:42 2021] md/raid:md1: not enough operational devices (2/3 failed)
>     [Tue Feb  2 21:14:42 2021] md/raid:md1: failed to run raid set.
>     [Tue Feb  2 21:14:42 2021] md: pers->run() failed ...
>
>
>
> That made the array look like so:
>
>     # mdadm -D /dev/md1
>     /dev/md1:
>                Version : 1.2
>          Creation Time : Thu Jul 30 21:34:20 2015
>             Raid Level : raid5
>          Used Dev Size : 489615360 (466.93 GiB 501.37 GB)
>           Raid Devices : 3
>          Total Devices : 1
>            Persistence : Superblock is persistent
>
>            Update Time : Tue Feb  2 13:55:02 2021
>                  State : active, FAILED, Not Started
>         Active Devices : 1
>        Working Devices : 1
>         Failed Devices : 0
>          Spare Devices : 0
>
>                 Layout : left-symmetric
>             Chunk Size : 512K
>
>     Consistency Policy : unknown
>
>                   Name : blah:1
>                   UUID : fba7c062:e352fa39:fdc09bf9:e21c4617
>                 Events : 18094730
>
>         Number   Major   Minor   RaidDevice State
>            -       0        0        0      removed
>            -       0        0        1      removed
>            -       0        0        2      removed
>
>            -       8       66        1      sync   /dev/sde2
>
>
> I was hoping that assume-clean might be helpful, but seems I can't assemble with that option
>
>     # mdadm --assemble --assume-clean -o  /dev/md1 /dev/sde /dev/sdf
>     mdadm: :option --assume-clean not valid in assemble mode
>
>
> So I tried a more normal assemble but it does not have enough drives to start the array:
>
>     # mdadm --assemble -o  /dev/md1 /dev/sde2 /dev/sdf2
>     mdadm: /dev/md1 assembled from 1 drive - not enough to start the array.
>
>
> mdstat looked like this awhile ago:
>
>     Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
>     md1 : inactive sdf2[2](S) sde2[1](S)
>           979230720 blocks super 1.2
>
>     md2 : active raid5 sdb1[1] sda1[0] sdc1[2]
>           3906764800 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/3] [UUU]
>           bitmap: 1/15 pages [4KB], 65536KB chunk
>
>     md0 : active raid1 sde1[4]
>           242624 blocks super 1.0 [3/1] [_U_]
>
>     unused devices: <none>
>
>
>
> Now it looks like so
>
>     Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
>     md1 : inactive sdf2[2](S) sde2[1](S)
>           979230720 blocks super 1.2
>
>     md2 : active raid5 sdb1[1] sda1[0] sdc1[2]
>           3906764800 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/3] [UUU]
>           bitmap: 0/15 pages [0KB], 65536KB chunk
>
>     md0 : active raid1 sde1[4]
>           242624 blocks super 1.0 [3/1] [_U_]
>
>     unused devices: <none>
>
>
>
> I am really concerned about --force... andhttps://raid.wiki.kernel.org/index.php/Linux_Raid#When_Things_Go_Wrogn does nothing to alleviate those fears.
>
> Anyone have suggestions on what to do next?
> Thanks!

Hello again,

So it looks like I fixed my situation.

I decided to go for the overlay option in:
https://raid.wiki.kernel.org/index.php/Recovering_a_damaged_RAID

I initially went past that because it looked too complex.  The parallel command makes everything look really difficult.  I try not to run commands unless I understand what they are doing.  So I spend a few minutes playing with parallel and learning how it works and what underlying commands are being executed.  Once I figured that out I went ahead with the process.  It might be a good idea to list the actual commands needed broken down so people are not scared away by that section.

Once the overlays were in place I ended up with an issue

The array looked like this:

```
# mdadm -D /dev/md1
/dev/md1:
           Version : 1.2
        Raid Level : raid0
     Total Devices : 2
       Persistence : Superblock is persistent

             State : inactive
   Working Devices : 2

              Name : blah:1
              UUID : fba7c062:e352fa39:fdc09bf9:e21c4617
            Events : 18094730

    Number   Major   Minor   RaidDevice

       -     253        3        -        /dev/dm-3
       -     253        4        -        /dev/dm-4
```

So I tried to assemble

```
# mdadm --assemble --force /dev/md1
mdadm: Merging with already-assembled /dev/md/1
mdadm: forcing event count in /dev/dm-4(2) from 18094545 upto 18094730
mdadm: clearing FAULTY flag for device 0 in /dev/md/1 for /dev/dm-4
mdadm: Marking array /dev/md/1 as 'clean'
mdadm: failed to RUN_ARRAY /dev/md/1: Input/output error
```

I was really worried but stopped md1 and tried again and it worked!

```
# mdadm --assemble --force /dev/md1 $OVERLAYS
mdadm: /dev/md1 has been started with 2 drives (out of 3).
```

LVM detected the partitions right away and I was able to mount them without issue.  I have pulled the important bits off and its still going fine so I am pulling less critical but very very nice to have data off as well.

So a success story for MDADM, recovery with little to no data loss.