Requesting assistance recovering RAID-5 array

* Requesting assistance recovering RAID-5 array
@ 2020-03-31  0:04 Daniel Jones
  2020-03-31  0:24 ` antlists
  0 siblings, 1 reply; 25+ messages in thread
From: Daniel Jones @ 2020-03-31  0:04 UTC (permalink / raw)
  To: linux-raid

Hello,

I've ended up my array in an unpleasant condition and am unsure how to
best attempt recovery.  Any assistance from this list would be greatly
appreciated.

The short version: I have a 4 device RAID-5 array currently degraded
to 3 devices. The superblock is missing from 3 out of 4 drives. I've
also lost track of which device was originally /dev/sd[bcde] and doubt
they are in their original order.

How did I end up here?

1) Dec 2018: Created RAID-5 array on three HDDs on Centos7

2) Jul 2019: Added fourth HDD to array.

3) Mar 22 2020: One drive in array failed (originally /dev/sdb). **Due
to outgoing email issue I was not aware of this until (4a) below.**

4) Yesterday: Blissfully unaware of (3) did a planned upgrade of
Mobo/CPU/Boot-HDD in chassis. This went poorly as follows.

  a) After connecting the four drives to the new mobo I noted that
BIOS would not recognize the drive in bay #4.

  b) After booting into "new" system mdadm did not recognize the array.
     Shut down, replaced various SATA/power cables, at some point bay
#4 was recognized.
     Array still not recognized by mdadm.

  c) Put old Mobo/CPU/boot-HDD back into chassis to try to recover to
"last known good" state. Still using reconfigured SATA/power cables.
     The drive in bay #4 is still recognized.
     Due to all the part swapping I doubt the disks still match their
original sdb/sdc/sdd/sde mapping.

  d) After booting into "old" system mdadm does not recognize the array.

     ** Discover that the superblock appears overwritten on three out
of the four drives. **

     Find anecdotal reports online of superblock deletion when moving
arrays between motherboards:

       https://serverfault.com/questions/580761/is-mdadm-raid-toast
       https://forum.openmediavault.org/index.php?thread/11625-raid5-missing-superblocks-after-restart/
(see comments by Nordmann)
       Note that Nordmann claims "Sometimes it occurs that one single
drive out of the array doesnt get affected"

  e) Give up for the day.

4) Today: Looked at things fresh today.

  a) Discovered the Mar-22 drive failure in /var/spool/mail/root.
Working assumption is that the bay #4 drive is one that was /dev/sdb
at the time of failure.
  b) Collected the information posted below.

So, here is my current situation as I see it.

  A four-disk RAID-5 array that degraded to three-disk a week ago with
the failure of what was then /dev/sdb.
  Due to the moving of cables I am no longer confident that
/dev/sd[bcde] are still what they once were.
  I suspect the orginal failed /dev/sdb is the bay #4 drive, not
completely sure.
  Three of the four disks have erased superblocks for unknown reasons.
  Doing a full 'dd' backup of the four disks is not feasible, but if I
can get them to assemble and mount one time I can copy off the data I
need.
  I think the best chance at data recovery is to do a --create to
replace the missing superblocks, but am unsure of the best way in
light of the degraded state of the array.

The only "good news" I have at this point is that I've done nothing at
this time to intentionally overwrite anything.

Information:

Here is the mdadm failure message from 3/22:

        Date: Sun, 22 Mar 2020 13:12:50 -0600 (MDT)This is an
automatically generated mail message from mdadm running on hulk
        A Fail event had been detected on md device /dev/md/0.
        It could be related to component device /dev/sdb.
        Faithfully yours, etc.
        P.S. The /proc/mdstat file currently contains the following:
        Personalities : [raid6] [raid5] [raid4]
        md0 : active raid5 sdd[3] sde[4] sdc[1] sdb[0](F)
              29298914304 blocks super 1.2 level 5, 512k chunk,
algorithm 2 [4/3] [_UUU]
              [========>............]  check = 40.2%
(3927650944/9766304768) finish=1271.6min speed=76523K/sec
              bitmap: 0/73 pages [0KB], 65536KB chunk

Followed shortly by:

        Date: Sun, 22 Mar 2020 13:21:20 -0600 (MDT)
        This is an automatically generated mail message from mdadm
running on hulk
        A DegradedArray event had been detected on md device /dev/md/0.
        Faithfully yours, etc.
        P.S. The /proc/mdstat file currently contains the following:
        Personalities : [raid6] [raid5] [raid4]
        md0 : active raid5 sdc[1] sde[4] sdd[3]
              29298914304 blocks super 1.2 level 5, 512k chunk,
algorithm 2 [4/3] [_UUU]
              bitmap: 2/73 pages [8KB], 65536KB chunk

Actions today:

# cat /proc/mdstat
Personalities :
md0 : inactive sdc[1](S)
      9766306304 blocks super 1.2

unused devices: <none>

#  mdadm --stop /dev/md0
mdadm: stopped /dev/md0

# mdadm -E /dev/sd[bcde]
/dev/sdb:
   MBR Magic : aa55
Partition[0] :   4294967295 sectors at            1 (type ee)
/dev/sdc:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x1
     Array UUID : 423d9a8e:636a5f08:56ecbd90:282e478b
           Name : hulk:0  (local to host hulk)
  Creation Time : Wed Dec 26 14:13:35 2018
     Raid Level : raid5
   Raid Devices : 4

 Avail Dev Size : 19532612608 sectors (9313.88 GiB 10000.70 GB)
     Array Size : 29298914304 KiB (27941.62 GiB 30002.09 GB)
  Used Dev Size : 19532609536 sectors (9313.87 GiB 10000.70 GB)
    Data Offset : 261120 sectors
   Super Offset : 8 sectors
   Unused Space : before=261040 sectors, after=3072 sectors
          State : clean
    Device UUID : 31fa9d90:a407908d:d4d7c7cc:e362b8a5

Internal Bitmap : 8 sectors from superblock
    Update Time : Sun Mar 29 15:43:14 2020
  Bad Block Log : 512 entries available at offset 48 sectors
       Checksum : d01e7462 - correct
         Events : 103087

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 1
   Array State : .AAA ('A' == active, '.' == missing, 'R' == replacing)
/dev/sdd:
   MBR Magic : aa55
Partition[0] :   4294967295 sectors at            1 (type ee)
/dev/sde:
   MBR Magic : aa55
Partition[0] :   4294967295 sectors at            1 (type ee)

# gdisk -l /dev/sdb
GPT fdisk (gdisk) version 0.8.10

Partition table scan:
  MBR: protective
  BSD: not present
  APM: not present
  GPT: present

Found valid GPT with protective MBR; using GPT.
Disk /dev/sdb: 19532873728 sectors, 9.1 TiB
Logical sector size: 512 bytes
Disk identifier (GUID): A0CB08EC-4CA4-4A87-8848-5ED928708E84
Partition table holds up to 128 entries
First usable sector is 34, last usable sector is 19532873694
Partitions will be aligned on 2048-sector boundaries
Total free space is 19532873661 sectors (9.1 TiB)

Number  Start (sector)    End (sector)  Size       Code  Name

# gdisk -l /dev/sdc
GPT fdisk (gdisk) version 0.8.10

Caution: invalid backup GPT header, but valid main header; regenerating
backup header from main header.

Caution! After loading partitions, the CRC doesn't check out!
Warning! Main partition table CRC mismatch! Loaded backup partition table
instead of main partition table!

Warning! One or more CRCs don't match. You should repair the disk!

Partition table scan:
  MBR: protective
  BSD: not present
  APM: not present
  GPT: damaged

****************************************************************************
Caution: Found protective or hybrid MBR and corrupt GPT. Using GPT, but disk
verification and recovery are STRONGLY recommended.
****************************************************************************
Disk /dev/sdc: 19532873728 sectors, 9.1 TiB
Logical sector size: 512 bytes
Disk identifier (GUID): C41898B6-A81D-41B9-BE14-F2AB6D71D8EF
Partition table holds up to 128 entries
First usable sector is 34, last usable sector is 19532873694
Partitions will be aligned on 2048-sector boundaries
Total free space is 19532873661 sectors (9.1 TiB)

Number  Start (sector)    End (sector)  Size       Code  Name

# gdisk -l /dev/sdd
GPT fdisk (gdisk) version 0.8.10

Partition table scan:
  MBR: protective
  BSD: not present
  APM: not present
  GPT: present

Found valid GPT with protective MBR; using GPT.
Disk /dev/sdd: 19532873728 sectors, 9.1 TiB
Logical sector size: 512 bytes
Disk identifier (GUID): A0CB08EC-4CA4-4A87-8848-5ED928708E84
Partition table holds up to 128 entries
First usable sector is 34, last usable sector is 19532873694
Partitions will be aligned on 2048-sector boundaries
Total free space is 19532873661 sectors (9.1 TiB)

Number  Start (sector)    End (sector)  Size       Code  Name

# gdisk -l /dev/sde
GPT fdisk (gdisk) version 0.8.10

Partition table scan:
  MBR: protective
  BSD: not present
  APM: not present
  GPT: present

Found valid GPT with protective MBR; using GPT.
Disk /dev/sde: 19532873728 sectors, 9.1 TiB
Logical sector size: 512 bytes
Disk identifier (GUID): FB0B481A-6258-4F61-BA60-6AAC8F663DA8
Partition table holds up to 128 entries
First usable sector is 34, last usable sector is 19532873694
Partitions will be aligned on 2048-sector boundaries
Total free space is 19532873661 sectors (9.1 TiB)

Number  Start (sector)    End (sector)  Size       Code  Name

# lsblk -o NAME,SIZE,FSTYPE,TYPE,MOUNTPOINT
NAME                   SIZE FSTYPE            TYPE MOUNTPOINT
sda                  238.5G                   disk
├─sda1                 500M xfs               part /boot
└─sda2                 238G LVM2_member       part
  ├─centos_hulk-root    50G xfs               lvm  /
  ├─centos_hulk-swap     2G swap              lvm  [SWAP]
  └─centos_hulk-home 185.9G xfs               lvm  /home
sdb                    9.1T                   disk
sdc                    9.1T linux_raid_member disk
sdd                    9.1T                   disk
sde                    9.1T                   disk

# smartctl -H -i -l scterc /dev/sdb
smartctl 7.0 2018-12-30 r4883
[x86_64-linux-3.10.0-1062.18.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     WDC WD100EMAZ-00WJTA0
Serial Number:    *removed*
LU WWN Device Id: 5 000cca 26ccc09f6
Firmware Version: 83.H0A83
User Capacity:    10,000,831,348,736 bytes [10.0 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Form Factor:      3.5 inches
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Mon Mar 30 17:21:04 2020 MDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SCT Error Recovery Control:
           Read:     70 (7.0 seconds)
          Write:     70 (7.0 seconds)

# smartctl -H -i -l scterc /dev/sdc
smartctl 7.0 2018-12-30 r4883
[x86_64-linux-3.10.0-1062.18.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     WDC WD100EMAZ-00WJTA0
Serial Number:    *removed*
LU WWN Device Id: 5 000cca 273dd833e
Firmware Version: 83.H0A83
User Capacity:    10,000,831,348,736 bytes [10.0 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Form Factor:      3.5 inches
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Mon Mar 30 17:21:24 2020 MDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SCT Error Recovery Control:
           Read:     70 (7.0 seconds)
          Write:     70 (7.0 seconds)

# smartctl -H -i -l scterc /dev/sdd
smartctl 7.0 2018-12-30 r4883
[x86_64-linux-3.10.0-1062.18.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     WDC WD100EMAZ-00WJTA0
Serial Number:    *removed*
LU WWN Device Id: 5 000cca 273e1f716
Firmware Version: 83.H0A83
User Capacity:    10,000,831,348,736 bytes [10.0 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Form Factor:      3.5 inches
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Mon Mar 30 17:21:43 2020 MDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SCT Error Recovery Control:
           Read:     70 (7.0 seconds)
          Write:     70 (7.0 seconds)

# smartctl -H -i -l scterc /dev/sde
smartctl 7.0 2018-12-30 r4883
[x86_64-linux-3.10.0-1062.18.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     WDC WD100EMAZ-00WJTA0
Serial Number:    *removed*
LU WWN Device Id: 5 000cca 267d8594f
Firmware Version: 83.H0A83
User Capacity:    10,000,831,348,736 bytes [10.0 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Form Factor:      3.5 inches
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Mon Mar 30 17:21:58 2020 MDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SCT Error Recovery Control:
           Read:     70 (7.0 seconds)
          Write:     70 (7.0 seconds)

I am genuinely over my head at this point and unsure how to proceed.
My logic tells me the best choice is to attempt a --create to try to
rebuild the missing superblocks, but I'm not clear if I should try
devices=4 (the true size of the array) or devices=3 (the size it was
last operating in).  I'm also not sure of what device order to use
since I have likely scrambled /dev/sd[bcde] and am concerned about
what happens when I bring the previously disable drive back into the
array.

Can anybody provide any guidance?

Thanks,
DJ

^ permalink raw reply	[flat|nested] 25+ messages in thread