From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS, URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 326A4C433DB for ; Wed, 3 Feb 2021 06:08:55 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id CD63E64DF4 for ; Wed, 3 Feb 2021 06:08:54 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230367AbhBCGIj (ORCPT ); Wed, 3 Feb 2021 01:08:39 -0500 Received: from mail-40140.protonmail.ch ([185.70.40.140]:19696 "EHLO mail-40140.protonmail.ch" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229735AbhBCGIi (ORCPT ); Wed, 3 Feb 2021 01:08:38 -0500 Date: Wed, 03 Feb 2021 06:07:43 +0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=protonmail.com; s=protonmail; t=1612332472; bh=h0TD2elYV+UOb4YFu/GA3YVcgXd+M9EjEeq3X16+X/Q=; h=Date:To:From:Reply-To:Subject:From; b=WH6PQfGSgZfwQQKyAQUYIOHlJaoKon3/UHIVSe7Yma878uotzQU7HNDHFP+SR32sa afZxKdb4P/XBL06j4MMW/+sHrJKHV7CVXNFewiA3eHqzKcAtmlJC6fN3SZxc/iaoio 7B1wDjAyZKNWhSAfyBU0C+sW/hqQ41g2fpxD4SYI= To: "linux-raid@vger.kernel.org" From: importantdata Reply-To: importantdata Subject: Re: [solved] 3 drive RAID5 with 1 bad drive, 1 drive active but not clean and a single clean drive Message-ID: <1xVN7rJL7OEF2ZKf4PrEYr8js-QWLWTpjtyYFFo9a67zcH7AhHdQbmfimfB0hSGWYa2YP9DgJYdfd-GzYE_p__TXZU9HZkvkF-YWRCkAExs=@protonmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-raid@vger.kernel.org Sent with ProtonMail Secure Email. =E2=80=90=E2=80=90=E2=80=90=E2=80=90=E2=80=90=E2=80=90=E2=80=90 Original Me= ssage =E2=80=90=E2=80=90=E2=80=90=E2=80=90=E2=80=90=E2=80=90=E2=80=90 On Tuesday, February 2, 2021 10:04 PM, importantdata wrote: > Hello, > > So, I have a bad situation. I run a raid5 array with 3 drives, I noticed = one had fallen out of the array. I need to setup better monitoring, it turn= s out this was quite some time ago (back in Nov!). This leaves two drives a= nd puts me into a bit of a scare. So I decide to move the important data of= f to a different array. I created a dir in /mnt made a new LVM partition fo= rmatted it as EXT4 and started syncing stuff over... Kinda forgot to actual= ly mount it though so all that data was syncing right back into the bad arr= ay (woops!). Server died this morning, I assume the extra stress may have d= one something with the drives, or perhaps it filled up root and panicked. I= n either case I could not boot. Setup a fresh OS on my other raid array and= got some tools installed and now I am working on trying to assemble the ba= d raid array enough to pull out my data. My data is contained within an LVM= within the raid array. > > I have attached the examine and mdadm of the drives, as you can see ddd h= as a really old update time. This drive was having lots of i/o errors. So I= want to use sde and sdf to assemble a read-only array, assemble the LVM, m= ount then copy my important data off. They are 185 events different so I as= sume there will be some slight data corruption. But I am hoping its mostly = fine and likely part of my bad rsync. > > So unfortunately I don't know what mdadm version was used to make this ar= ray or the OS version as that's all stuck on the dead array. Here is what I= am running on my new install I am using to try and recover the data: > Linux 4.19.0-13-amd64 #1 SMP Debian 4.19.160-2 (2020-11-28) x86_64 GNU/Li= nux > mdadm - v4.1 - 2018-10-01 > > The initial -D looked like this: > > root@kglhost-1:~# mdadm -D /dev/md1 > /dev/md1: > Version : 1.2 > Raid Level : raid0 > Total Devices : 3 > Persistence : Superblock is persistent > > State : inactive > Working Devices : 3 > > Name : blah:1 > UUID : fba7c062:e352fa39:fdc09bf9:e21c4617 > Events : 18094545 > > Number Major Minor RaidDevice > > - 8 82 - /dev/sdf2 > - 8 66 - /dev/sde2 > - 8 50 - /dev/sdd2 > > > I tried to run the array but that failed: > > # mdadm -o -R /dev/md1 > mdadm: failed to start array /dev/md/1: Input/output error > > > In dmesg it says > > [Tue Feb 2 21:14:42 2021] md: kicking non-fresh sdd2 from array! > [Tue Feb 2 21:14:42 2021] md: kicking non-fresh sdf2 from array! > [Tue Feb 2 21:14:42 2021] md/raid:md1: device sde2 operational as ra= id disk 1 > [Tue Feb 2 21:14:42 2021] md/raid:md1: not enough operational device= s (2/3 failed) > [Tue Feb 2 21:14:42 2021] md/raid:md1: failed to run raid set. > [Tue Feb 2 21:14:42 2021] md: pers->run() failed ... > > > > That made the array look like so: > > # mdadm -D /dev/md1 > /dev/md1: > Version : 1.2 > Creation Time : Thu Jul 30 21:34:20 2015 > Raid Level : raid5 > Used Dev Size : 489615360 (466.93 GiB 501.37 GB) > Raid Devices : 3 > Total Devices : 1 > Persistence : Superblock is persistent > > Update Time : Tue Feb 2 13:55:02 2021 > State : active, FAILED, Not Started > Active Devices : 1 > Working Devices : 1 > Failed Devices : 0 > Spare Devices : 0 > > Layout : left-symmetric > Chunk Size : 512K > > Consistency Policy : unknown > > Name : blah:1 > UUID : fba7c062:e352fa39:fdc09bf9:e21c4617 > Events : 18094730 > > Number Major Minor RaidDevice State > - 0 0 0 removed > - 0 0 1 removed > - 0 0 2 removed > > - 8 66 1 sync /dev/sde2 > > > I was hoping that assume-clean might be helpful, but seems I can't assemb= le with that option > > # mdadm --assemble --assume-clean -o /dev/md1 /dev/sde /dev/sdf > mdadm: :option --assume-clean not valid in assemble mode > > > So I tried a more normal assemble but it does not have enough drives to s= tart the array: > > # mdadm --assemble -o /dev/md1 /dev/sde2 /dev/sdf2 > mdadm: /dev/md1 assembled from 1 drive - not enough to start the arra= y. > > > mdstat looked like this awhile ago: > > Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] = [raid4] [raid10] > md1 : inactive sdf2[2](S) sde2[1](S) > 979230720 blocks super 1.2 > > md2 : active raid5 sdb1[1] sda1[0] sdc1[2] > 3906764800 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3= /3] [UUU] > bitmap: 1/15 pages [4KB], 65536KB chunk > > md0 : active raid1 sde1[4] > 242624 blocks super 1.0 [3/1] [_U_] > > unused devices: > > > > Now it looks like so > > Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] = [raid4] [raid10] > md1 : inactive sdf2[2](S) sde2[1](S) > 979230720 blocks super 1.2 > > md2 : active raid5 sdb1[1] sda1[0] sdc1[2] > 3906764800 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3= /3] [UUU] > bitmap: 0/15 pages [0KB], 65536KB chunk > > md0 : active raid1 sde1[4] > 242624 blocks super 1.0 [3/1] [_U_] > > unused devices: > > > > I am really concerned about --force... andhttps://raid.wiki.kernel.org/in= dex.php/Linux_Raid#When_Things_Go_Wrogn does nothing to alleviate those fea= rs. > > Anyone have suggestions on what to do next? > Thanks! Hello again, So it looks like I fixed my situation. I decided to go for the overlay option in: https://raid.wiki.kernel.org/index.php/Recovering_a_damaged_RAID I initially went past that because it looked too complex. The parallel com= mand makes everything look really difficult. I try not to run commands unl= ess I understand what they are doing. So I spend a few minutes playing wit= h parallel and learning how it works and what underlying commands are being= executed. Once I figured that out I went ahead with the process. It migh= t be a good idea to list the actual commands needed broken down so people a= re not scared away by that section. Once the overlays were in place I ended up with an issue The array looked like this: ``` # mdadm -D /dev/md1 /dev/md1: Version : 1.2 Raid Level : raid0 Total Devices : 2 Persistence : Superblock is persistent State : inactive Working Devices : 2 Name : blah:1 UUID : fba7c062:e352fa39:fdc09bf9:e21c4617 Events : 18094730 Number Major Minor RaidDevice - 253 3 - /dev/dm-3 - 253 4 - /dev/dm-4 ``` So I tried to assemble ``` # mdadm --assemble --force /dev/md1 mdadm: Merging with already-assembled /dev/md/1 mdadm: forcing event count in /dev/dm-4(2) from 18094545 upto 18094730 mdadm: clearing FAULTY flag for device 0 in /dev/md/1 for /dev/dm-4 mdadm: Marking array /dev/md/1 as 'clean' mdadm: failed to RUN_ARRAY /dev/md/1: Input/output error ``` I was really worried but stopped md1 and tried again and it worked! ``` # mdadm --assemble --force /dev/md1 $OVERLAYS mdadm: /dev/md1 has been started with 2 drives (out of 3). ``` LVM detected the partitions right away and I was able to mount them without= issue. I have pulled the important bits off and its still going fine so I= am pulling less critical but very very nice to have data off as well. So a success story for MDADM, recovery with little to no data loss.